Tag Archives: ICML

Intervening on early readouts for mitigating spurious features and simplicity bias

Machine learning models in the real world are often trained on limited data that may contain unintended statistical biases. For example, in the CELEBA celebrity image dataset, a disproportionate number of female celebrities have blond hair, leading to classifiers incorrectly predicting “blond” as the hair color for most female faces — here, gender is a spurious feature for predicting hair color. Such unfair biases could have significant consequences in critical applications such as medical diagnosis.

Surprisingly, recent work has also discovered an inherent tendency of deep networks to amplify such statistical biases, through the so-called simplicity bias of deep learning. This bias is the tendency of deep networks to identify weakly predictive features early in the training, and continue to anchor on these features, failing to identify more complex and potentially more accurate features.

With the above in mind, we propose simple and effective fixes to this dual challenge of spurious features and simplicity bias by applying early readouts and feature forgetting. First, in “Using Early Readouts to Mediate Featural Bias in Distillation”, we show that making predictions from early layers of a deep network (referred to as “early readouts”) can automatically signal issues with the quality of the learned representations. In particular, these predictions are more often wrong, and more confidently wrong, when the network is relying on spurious features. We use this erroneous confidence to improve outcomes in model distillation, a setting where a larger “teacher” model guides the training of a smaller “student” model. Then in “Overcoming Simplicity Bias in Deep Networks using a Feature Sieve”, we intervene directly on these indicator signals by making the network “forget” the problematic features and consequently look for better, more predictive features. This substantially improves the model’s ability to generalize to unseen domains compared to previous approaches. Our AI Principles and our Responsible AI practices guide how we research and develop these advanced applications and help us address the challenges posed by statistical biases.

Animation comparing hypothetical responses from two models trained with and without the feature sieve.

Early readouts for debiasing distillation

We first illustrate the diagnostic value of early readouts and their application in debiased distillation, i.e., making sure that the student model inherits the teacher model’s resilience to feature bias through distillation. We start with a standard distillation framework where the student is trained with a mixture of label matching (minimizing the cross-entropy loss between student outputs and the ground-truth labels) and teacher matching (minimizing the KL divergence loss between student and teacher outputs for any given input).

Suppose one trains a linear decoder, i.e., a small auxiliary neural network named as Aux, on top of an intermediate representation of the student model. We refer to the output of this linear decoder as an early readout of the network representation. Our finding is that early readouts make more errors on instances that contain spurious features, and further, the confidence on those errors is higher than the confidence associated with other errors. This suggests that confidence on errors from early readouts is a fairly strong, automated indicator of the model’s dependence on potentially spurious features.

Illustrating the usage of early readouts (i.e., output from the auxiliary layer) in debiasing distillation. Instances that are confidently mispredicted in the early readouts are upweighted in the distillation loss.

We used this signal to modulate the contribution of the teacher in the distillation loss on a per-instance basis, and found significant improvements in the trained student model as a result.

We evaluated our approach on standard benchmark datasets known to contain spurious correlations (Waterbirds, CelebA, CivilComments, MNLI). Each of these datasets contain groupings of data that share an attribute potentially correlated with the label in a spurious manner. As an example, the CelebA dataset mentioned above includes groups such as {blond male, blond female, non-blond male, non-blond female}, with models typically performing the worst on the {non-blond female} group when predicting hair color. Thus, a measure of model performance is its worst group accuracy, i.e., the lowest accuracy among all known groups present in the dataset. We improved the worst group accuracy of student models on all datasets; moreover, we also improved overall accuracy in three of the four datasets, showing that our improvement on any one group does not come at the expense of accuracy on other groups. More details are available in our paper.

Comparison of Worst Group Accuracies of different distillation techniques relative to that of the Teacher model. Our method outperforms other methods on all datasets.

Overcoming simplicity bias with a feature sieve

In a second, closely related project, we intervene directly on the information provided by early readouts, to improve feature learning and generalization. The workflow alternates between identifying problematic features and erasing identified features from the network. Our primary hypothesis is that early features are more prone to simplicity bias, and that by erasing (“sieving”) these features, we allow richer feature representations to be learned.

Training workflow with feature sieve. We alternate between identifying problematic features (using training iteration) and erasing them from the network (using forgetting iteration).

We describe the identification and erasure steps in more detail:

  • Identifying simple features: We train the primary model and the readout model (AUX above) in conventional fashion via forward- and back-propagation. Note that feedback from the auxiliary layer does not back-propagate to the main network. This is to force the auxiliary layer to learn from already-available features rather than create or reinforce them in the main network.
  • Applying the feature sieve: We aim to erase the identified features in the early layers of the neural network with the use of a novel forgetting loss, Lf , which is simply the cross-entropy between the readout and a uniform distribution over labels. Essentially, all information that leads to nontrivial readouts are erased from the primary network. In this step, the auxiliary network and upper layers of the main network are kept unchanged.

We can control specifically how the feature sieve is applied to a given dataset through a small number of configuration parameters. By changing the position and complexity of the auxiliary network, we control the complexity of the identified- and erased features. By modifying the mixing of learning and forgetting steps, we control the degree to which the model is challenged to learn more complex features. These choices, which are dataset-dependent, are made via hyperparameter search to maximize validation accuracy, a standard measure of generalization. Since we include “no-forgetting” (i.e., the baseline model) in the search space, we expect to find settings that are at least as good as the baseline.

Below we show features learned by the baseline model (middle row) and our model (bottom row) on two benchmark datasets — biased activity recognition (BAR) and animal categorization (NICO). Feature importance was estimated using post-hoc gradient-based importance scoring (GRAD-CAM), with the orange-red end of the spectrum indicating high importance, while green-blue indicates low importance. Shown below, our trained models focus on the primary object of interest, whereas the baseline model tends to focus on background features that are simpler and spuriously correlated with the label.

Feature importance scoring using GRAD-CAM on activity recognition (BAR) and animal categorization (NICO) generalization benchmarks. Our approach (last row) focuses on the relevant objects in the image, whereas the baseline (ERM; middle row) relies on background features that are spuriously correlated with the label.

Through this ability to learn better, generalizable features, we show substantial gains over a range of relevant baselines on real-world spurious feature benchmark datasets: BAR, CelebA Hair, NICO and ImagenetA, by margins up to 11% (see figure below). More details are available in our paper.

Our feature sieve method improves accuracy by significant margins relative to the nearest baseline for a range of feature generalization benchmark datasets.

Conclusion

We hope that our work on early readouts and their use in feature sieving for generalization will both spur the development of a new class of adversarial feature learning approaches and help improve the generalization capability and robustness of deep learning systems.


Acknowledgements

The work on applying early readouts to debiasing distillation was conducted in collaboration with our academic partners Durga Sivasubramanian, Anmol Reddy and Prof. Ganesh Ramakrishnan at IIT Bombay. We extend our sincere gratitude to Praneeth Netrapalli and Anshul Nasery for their feedback and recommendations. We are also grateful to Nishant Jain, Shreyas Havaldar, Rachit Bansal, Kartikeya Badola, Amandeep Kaur and the whole cohort of pre-doctoral researchers at Google Research India for taking part in research discussions. Special thanks to Tom Small for creating the animation used in this post.

Source: Google AI Blog


Scalable spherical CNNs for scientific applications

Typical deep learning models for computer vision, like convolutional neural networks (CNNs) and vision transformers (ViT), process signals assuming planar (flat) spaces. For example, digital images are represented as a grid of pixels on a plane. However, this type of data makes up only a fraction of the data we encounter in scientific applications. Variables sampled from the Earth's atmosphere, like temperature and humidity, are naturally represented on the sphere. Some kinds of cosmological data and panoramic photos are also spherical signals, and are better treated as such.

Using methods designed for planar images to process spherical signals is problematic for a couple of reasons. First, there is a sampling problem, i.e., there is no way of defining uniform grids on the sphere, which are needed for planar CNNs and ViTs, without heavy distortion.

When projecting the sphere into a plane, the patch represented by the red circle is heavily distorted near the poles. This sampling problem hurts the accuracy of conventional CNNs and ViTs on spherical inputs.

Second, signals and local patterns on the sphere are often complicated by rotations, so models need a way to address that. We would like equivariance to 3D rotations, which ensures that learned features follow the rotations of the input. This leads to better utilization of the model parameters and allows training with less data. Equivariance to 3D rotations is also useful in most settings where inputs don’t have a preferred orientation, such as 3D shapes and molecules.

Drone racing with panoramic cameras. Here the sharp turns result in large 3D rotations of the spherical image. We would like our models to be robust to such rotations. Source: https://www.youtube.com/watch?v=_J7qXbbXY80 (licensed under CC BY)
In the atmosphere, it is common to see similar patterns appearing at different positions and orientations. We would like our models to share parameters to recognize these patterns.

With the above challenges in mind, in “Scaling Spherical CNNs”, presented at ICML 2023, we introduce an open-source library in JAX for deep learning on spherical surfaces. We demonstrate how applications of this library match or surpass state-of-the-art performance on weather forecasting and molecular property prediction benchmarks, tasks that are typically addressed with transformers and graph neural networks.


Background on spherical CNNs

Spherical CNNs solve both the problems of sampling and of robustness to rotation by leveraging spherical convolution and cross-correlation operations, which are typically computed via generalized Fourier transforms. For planar surfaces, however, convolution with small filters is faster, because it can be performed on regular grids without using Fourier transforms. The higher computational cost for spherical inputs has so far restricted the application of spherical CNNs to small models and datasets and low resolution datasets.


Our contributions

We have implemented the spherical convolutions from spin-weighted spherical CNNs in JAX with a focus on speed, and have enabled distributed training over a large number of TPUs using data parallelism. We also introduced a new phase collapse activation and spectral batch normalization layer, and a new residual block that improves accuracy and efficiency, which allows training more accurate models up to 100x larger than before. We apply these new models on molecular property regression and weather forecasting.

We scale spherical CNNs by up to two orders of magnitude in terms of feature sizes and model capacity, compared to the literature: Cohen’18Esteves’18Esteves’20, and Cobb’21VGG-19 is included as a conventional CNN reference. Our largest model for weather forecasting has 256 x 256 x 78 inputs and outputs, and runs 96 convolutional layers during training with a lowest internal resolution of 128 x 128 x 256.

Molecular property regression

Predicting properties of molecules has applications in drug discovery, where the goal is to quickly screen numerous molecules in search of those with desirable properties. Similar models may also be relevant in the design of drugs targeting the interaction between proteins. Current methods in computational or experimental quantum chemistry are expensive, which motivates the use of machine learning.

Molecules can be represented by a set of atoms and their positions in 3D space; rotations of the molecule change the positions but not the molecular properties. This motivates the application of spherical CNNs because of their rotation equivariance. However, molecules are not defined as signals on the sphere so the first step is to map them to a set of spherical functions. We do so by leveraging physics-based interactions between the atoms of the molecule.

Each atom is represented by a set of spherical signals accumulating physical interactions with other atoms of each type (shown in the three panels on the right). For example, the oxygen atom (O; top panel) has a channel for oxygen (indicated by the sphere labeled “O” on the left) and hydrogen (“H”, right). The accumulated Coulomb forces on the oxygen atom with respect to the two hydrogen atoms is indicated by the red shaded regions on the bottom of the sphere labeled “H”. Because the oxygen atom contributes no forces to itself, the “O” sphere is uniform. We include extra channels for the Van der Waals forces.

Spherical CNNs are applied to each atom's features, and results are later combined to produce the property predictions. This results in state-of-the art performance in most properties as typically evaluated in the QM9 benchmark:

Error comparison against the state-of-the-art on 12 properties of QM9 (see the dataset paper for details). We show TorchMD-Net and PaiNN results, normalizing TorchMD-Net errors to 1.0 (lower is better). Our model, shown in green, outperforms the baselines in most targets.

Weather forecasting

Accurate climate forecasts serve as invaluable tools for providing timely warnings of extreme weather events, enabling effective water resource management, and guiding informed infrastructure planning. In a world increasingly threatened by climate disasters, there is an urgency to deliver forecasts much faster and more accurately over a longer time horizon than general circulation models. Forecasting models will also be important for predicting the safety and effectiveness of efforts intended to combat climate change, such as climate interventions. The current state-of-the-art uses costly numerical models based on fluid dynamics and thermodynamics, which tend to drift after a few days.

Given these challenges, there is an urgency for machine learning researchers to address climate forecasting problems, as data-driven techniques have the potential of both reducing the computational cost and improving long range accuracy. Spherical CNNs are suitable for this task since atmospheric data is natively presented on the sphere. They can also efficiently handle repeating patterns at different positions and orientations that are common in such data.

We apply our models to several weather forecasting benchmarks and outperform or match neural weather models based on conventional CNNs (specifically, 1, 2, and 3). Below we show results in a test setting where the model takes a number of atmospheric variables as input and predicts their values six hours ahead. The model is then iteratively applied on its own predictions to produce longer forecasts. During training, the model predicts up to three days ahead, and is evaluated up to five days. Keisler proposed a graph neural network for this task, but we show that spherical CNNs can match the GNN accuracy in the same setting.

Iterative weather forecasting up to five days (120h) ahead with spherical CNNs. The animations show the specific humidity forecast at a given pressure and its error.
Wind speed and temperature forecasts with spherical CNNs.

Additional resources

Our JAX library for efficient spherical CNNs is now available. We have shown applications to molecular property regression and weather forecasting, and we believe the library will be helpful in other scientific applications, as well as in computer vision and 3D vision.

Weather forecasting is an active area of research at Google with the goal of building more accurate and robust models — like Graphcast, a recent ML-based mid-range forecasting model — and to build tools that enable further advancement across the research community, such as the recently released WeatherBench 2.


Acknowledgements

This work was done in collaboration with Jean-Jacques Slotine, and is based on previous collaborations with Kostas Daniilidis and Christine Allen-Blanchette. We thank Stephan Hoyer, Stephan Rasp, and Ignacio Lopez-Gomez for helping with data processing and evaluation, and Fei Sha, Vivian Yang, Anudhyan Boral, Leonardo Zepeda-Núñez, and Avram Hershko for suggestions and discussions. We are thankful to Michael Riley and Corinna Cortes for supporting and encouraging this project.

Source: Google AI Blog


Google at ICML 2023

Groups across Google actively pursue research in the field of machine learning (ML), ranging from theory and application. We build ML systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, algorithm development, and more. We aim to build a more collaborative ecosystem with the broader ML research community through open-sourcing tools and datasets, publishing our work, and actively participating in conferences.

Google is proud to be a Diamond Sponsor of the 40th International Conference on Machine Learning (ICML 2023), a premier annual conference, which is being held this week in Honolulu, Hawaii. As a leader in ML research, Google has a strong presence at this year’s conference with over 120 accepted papers and active involvement in a number of workshops and tutorials. Google is also proud to be a Platinum Sponsor for both the LatinX in AI and Women in Machine Learning workshops. We look forward to sharing some of our extensive ML research and expanding our partnership with the broader ML research community.

Registered for ICML 2023? We hope you’ll visit the Google booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Visit the @GoogleAI Twitter account to find out about Google booth activities (e.g., demos and Q&A sessions). See Google DeepMind’s blog to learn about their technical participation at ICML 2023.

Take a look below to learn more about the Google research being presented at ICML 2023 (Google affiliations in bold).



Board and Organizing Committee

Board Members include: Corinna Cortes, Hugo Larochelle
Tutorial Chairs include: Hanie Sedghi



Google Research booth activities

Presenters: Bryan Perozzi, Anton Tsitsulin, Brandon Mayer
Title: Unsupervised Graph Embedding @ Google (paper, EXPO workshop)
Tuesday, July 25th at 10:30 AM HST

Presenters: Zheng Xu
Title: Federated Learning of Gboard Language Models with Differential Privacy (paper 1, paper 2, blog post)
Tuesday, July 25th at 3:30 PM HST

Presenters: Thomas Kipf
Title: Self-supervised scene understanding (paper 1, paper 2)
Wednesday, July 26th at 10:30 AM HST

Presenters: Johannes von Oswald, Max Vladymyrov
Title: Transformers learn in-context by gradient descent (paper)
Wednesday, July 26th at 3:30 PM HST



Accepted papers

Scaling Vision Transformers to 22 Billion Parameters (see blog post)
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby

Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan, Matan Kalman, Yossi Matias

Best of Both Worlds Policy Optimization
Christoph Dann, Chen-Yu Wei, Julian Zimmert

Inflow, Outflow, and Reciprocity in Machine Learning
Mukund Sundararajan, Walid Krichene

Transformers Learn In-Context by Gradient Descent
Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov

Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models
Luke Vilnis, Yury Zemlyanskiy, Patrick Murray*, Alexandre Passos*, Sumit Sanghai

Differentially Private Hierarchical Clustering with Provable Approximation Guarantees (see blog post)
Jacob Imola*, Alessandro Epasto, Mohammad Mahdian, Vincent Cohen-Addad, Vahab Mirrokni

Multi-Epoch Matrix Factorization Mechanisms for Private Machine Learning
Christopher A. Choquette-Choo, H. Brendan McMahan, Keith Rush, Abhradeep Thakurta

Random Classification Noise Does Not Defeat All Convex Potential Boosters Irrespective of Model Choice
Yishay Mansour, Richard Nock, Robert Williamson

Simplex Random Features
Isaac Reid, Krzysztof Choromanski, Valerii Likhosherstov, Adrian Weller

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova

Mu2SLAM: Multitask, Multilingual Speech and Language Models
Yong Cheng, Yu Zhang, Melvin Johnson, Wolfgang Macherey, Ankur Bapna

Robust Budget Pacing with a Single Sample
Santiago Balseiro, Rachitesh Kumar*, Vahab Mirrokni, Balasubramanian Sivan, Di Wang

A Statistical Perspective on Retrieval-Based Models
Soumya Basu, Ankit Singh Rawat, Manzil Zaheer

Approximately Optimal Core Shapes for Tensor Decompositions
Mehrdad Ghadiri, Matthew Fahrbach, Gang Fu, Vahab Mirrokni

Efficient List-Decodable Regression Using Batches
Abhimanyu Das, Ayush Jain*, Weihao Kong, Rajat Sen

Efficient Training of Language Models Using Few-Shot Learning
Sashank J. Reddi, Sobhan Miryoosefi, Stefani Karp, Shankar Krishnan, Satyen Kale, Seungyeon Kim, Sanjiv Kumar

Fully Dynamic Submodular Maximization Over Matroids
Paul Duetting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam

GFlowNet-EM for Learning Compositional Latent Variable Models
Edward J Hu, Nikolay Malkin, Moksh Jain, Katie Everett, Alexandros Graikos, Yoshua Bengio

Improved Online Learning Algorithms for CTR Prediction in Ad Auctions
Zhe Feng, Christopher Liaw, Zixin Zhou

Large Language Models Struggle to Learn Long-Tail Knowledge
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin Raffel

Multi-channel Autobidding with Budget and ROI Constraints
Yuan Deng, Negin Golrezaei, Patrick Jaillet, Jason Cheuk Nam Liang, Vahab Mirrokni

Multi-layer Neural Networks as Trainable Ladders of Hilbert Spaces
Zhengdao Chen

On User-Level Private Convex Optimization
Badih Ghazi, Pritish Kamath, Ravi Kumar, Raghu Meka, Pasin Manurangsi, Chiyuan Zhang

PAC Generalization via Invariant Representations
Advait U Parulekar, Karthikeyan Shanmugam, Sanjay Shakkottai

Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice
Toshinori Kitamura, Tadashi Kozuno, Yunhao Tang, Nino Vieillard, Michal Valko, Wenhao Yang, Jincheng Mei, Pierre Menard, Mohammad Gheshlaghi Azar, Remi Munos, Olivier Pietquin, Matthieu Geist,Csaba Szepesvari, Wataru Kumagai, Yutaka Matsuo

Speeding Up Bellman Ford via Minimum Violation Permutations
Silvio Lattanzi, Ola Svensson, Sergei Vassilvitskii

Statistical Indistinguishability of Learning Algorithms
Alkis Kalavasis, Amin Karbasi, Shay Moran, Grigoris Velegkas

Test-Time Adaptation with Slot-Centric Models
Mihir Prabhudesai, Anirudh Goyal, Sujoy Paul, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gaurav Aggarwal, Thomas Kipf, Deepak Pathak, Katerina Fragkiadaki>

Algorithms for Bounding Contribution for Histogram Estimation Under User-Level Privacy
Yuhan Liu*, Ananda Theertha Suresh, Wennan Zhu, Peter Kairouz, Marco Gruteser

Bandit Online Linear Optimization with Hints and Queries
Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning
Abdus Salam Azad, Izzeddin Gur, Jasper Emhoff, Nathaniel Alexis, Aleksandra Faust, Pieter Abbeel, Ion Stoica

CSP: Self-Supervised Contrastive Spatial Pre-training for Geospatial-Visual Representations
Gengchen Mai, Ni Lao, Yutong He, Jiaming Song, Stefano Ermon

Ewald-Based Long-Range Message Passing for Molecular Graphs
Arthur Kosmala, Johannes Gasteiger, Nicholas Gao, Stephan Günnemann

Fast (1+ε)-Approximation Algorithms for Binary Matrix Factorization
Ameya Velingker, Maximilian Vötsch, David Woodruff, Samson Zhou

Federated Linear Contextual Bandits with User-Level Differential Privacy
Ruiquan Huang, Huanyu Zhang, Luca Melis, Milan Shen, Meisam Hejazinia, Jing Yang

Investigating the Role of Model-Based Learning in Exploration and Transfer
Jacob C Walker, Eszter Vértes, Yazhe Li, Gabriel Dulac-Arnold, Ankesh Anand, Theophane Weber, Jessica B Hamrick

Label Differential Privacy and Private Training Data Release
Robert Busa-Fekete, Andres Munoz, Umar Syed, Sergei Vassilvitskii

Lifelong Language Pretraining with Distribution-Specialized Experts
Wuyang Chen*, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, Claire Cui

Multi-User Reinforcement Learning with Low Rank Rewards
Dheeraj Mysore Nagaraj, Suhas S Kowshik, Naman Agarwal, Praneeth Netrapalli, Prateek Jain

Multi-View Masked World Models for Visual Robotic Manipulation
Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, Pieter Abbeel

PaLM-E: An Embodied Multimodal Language Model (see blog post)
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter,Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence

Private Federated Learning with Autotuned Compression
Enayat Ullah*, Christopher A. Choquette-Choo, Peter Kairouz, Sewoong Oh

Refined Regret for Adversarial MDPs with Linear Function Approximation
Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert

Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory
Justin Cui, Ruoche Wan, Si Si, Cho-Jui Hsieh

SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance
Amit Attia, Tomer Koren

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation
Mark Rowland, Yunhao Tang, Clare Lyle, Rémi Munos, Marc G. Bellemare, Will Dabney

Unveiling The Mask of Position-Information Pattern Through the Mist of Image Features
Chieh Hubert Lin, Hung-Yu Tseng, Hsin-Ying Lee, Maneesh Kumar Singh, Ming-Hsuan Yang

User-Level Private Stochastic Convex Optimization with Optimal Rates
Raef Bassily, Ziteng Sun

A Simple Zero-Shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models
James Urquhart Allingham*, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, Balaji Lakshminarayanan

Can Large Language Models Reason About Program Invariants?
Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, Pengcheng Yin

Concurrent Shuffle Differential Privacy Under Continual Observation
Jay Tenenbaum, Haim Kaplan, Yishay Mansour, Uri Stemmer

Constant Matters: Fine-Grained Error Bound on Differentially Private Continual Observation
Hendrik Fichtenberger, Monika Henzinger, Jalaj Upadhyay

Cross-Entropy Loss Functions: Theoretical Analysis and Applications
Anqi Mao, Mehryar Mohri, Yutao Zhong

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation
Orin Levy, Alon Cohen, Asaf Cassel, Yishay Mansour

Fairness in Streaming Submodular Maximization Over a Matroid Constraint
Marwa El Halabi, Federico Fusco, Ashkan Norouzi-Fard, Jakab Tardos, Jakub Tarnawski

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning (see blog post)
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, Adam Roberts

Graph Reinforcement Learning for Network Control via Bi-level Optimization
Daniele Gammelli, James Harrison, Kaidi Yang, Marco Pavone, Filipe Rodrigues, Francisco C. Pereira

Learning-Augmented Private Algorithms for Multiple Quantile Release
Mikhail Khodak*, Kareem Amin, Travis Dick, Sergei Vassilvitskii

LegendreTron: Uprising Proper Multiclass Loss Learning
Kevin H Lam, Christian Walder, Spiridon Penev, Richard Nock

Measuring the Impact of Programming Language Distribution
Gabriel Orlanski*, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta*

Multi-task Differential Privacy Under Distribution Skew
Walid Krichene, Prateek Jain, Shuang Song, Mukund Sundararajan, Abhradeep Thakurta, Li Zhang

Muse: Text-to-Image Generation via Masked Generative Transformers
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, José Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan

On the Convergence of Federated Averaging with Cyclic Client Participation
Yae Jee Cho, Pranay Sharma, Gauri Joshi, Zheng Xu, Satyen Kale, Tong Zhang

Optimal Stochastic Non-smooth Non-convex Optimization Through Online-to-Non-convex Conversion
Ashok Cutkosky, Harsh Mehta, Francesco Orabona

Out-of-Domain Robustness via Targeted Augmentations
Irena Gao, Shiori Sagawa, Pang Wei Koh, Tatsunori Hashimoto, Percy Liang

Polynomial Time and Private Learning of Unbounded Gaussian Mixture Models
Jamil Arbas, Hassan Ashtiani, Christopher Liaw

Pre-computed Memory or On-the-Fly Encoding? A Hybrid Approach to Retrieval Augmentation Makes the Most of Your Compute
Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Joshua Ainslie, Sumit Sanghai, Fei Sha, William W. Cohen

Scalable Adaptive Computation for Iterative Generation
Allan Jabri*, David J. Fleet, Ting Chen

Scaling Spherical CNNs
Carlos Esteves, Jean-Jacques Slotine, Ameesh Makadia

STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition
Yucheng Lu, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov, Christopher De Sa, Amir Yazdanbakhsh

Stratified Adversarial Robustness with Rejection
Jiefeng Chen, Jayaram Raghuram, Jihye Choi, Xi Wu, Yingyu Liang, Somesh Jha

When Does Privileged information Explain Away Label Noise?
Guillermo Ortiz-Jimenez*, Mark Collier, Anant Nawalgaria, Alexander D'Amour, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou

Adaptive Computation with Elastic Input Sequence
Fuzhao Xue*, Valerii Likhosherstov, Anurag Arnab, Neil Houlsby, Mostafa Dehghani, Yang You

Can Neural Network Memorization Be Localized?
Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, Chiyuan Zhang

Controllability-Aware Unsupervised Skill Discovery
Seohong Park, Kimin Lee, Youngwoon Lee, Pieter Abbeel

Efficient Learning of Mesh-Based Physical Simulation with Bi-Stride Multi-Scale Graph Neural Network
Yadi Cao, Menglei Chai, Minchen Li, Chenfanfu Jiang

Federated Heavy Hitter Recovery Under Linear Sketching
Adria Gascon, Peter Kairouz, Ziteng Sun, Ananda Theertha Suresh

Graph Generative Model for Benchmarking Graph Neural Networks
Minji Yoon, Yue Wu, John Palowitch, Bryan Perozzi, Russ Salakhutdinov

H-Consistency Bounds for Pairwise Misranking Loss Surrogates
Anqi Mao, Mehryar Mohri, Yutao Zhong

Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation
Uri Sherman, Tomer Koren, Yishay Mansour

Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames
Ondrej Biza*, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Thomas Kipf

Multi-task Off-Policy Learning from Bandit Feedback
Joey Hong, Branislav Kveton, Manzil Zaheer, Sumeet Katariya, Mohammad Ghavamzadeh

Optimal No-Regret Learning for One-Sided Lipschitz Functions
Paul Duetting, Guru Guruganesh, Jon Schneider, Joshua Ruizhi Wang

Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games
Batuhan Yardim, Semih Cayci, Matthieu Geist, Niao He

Regret Minimization and Convergence to Equilibria in General-Sum Markov Games
Liad Erez, Tal Lancewicki, Uri Sherman, Tomer Koren, Yishay Mansour

Reinforcement Learning Can Be More Efficient with Multiple Rewards
Christoph Dann, Yishay Mansour, Mehryar Mohri

Reinforcement Learning with History-Dependent Dynamic Contexts
Guy Tennenholtz, Nadav Merlis, Lior Shani, Martin Mladenov, Craig Boutlier

User-Defined Event Sampling and Uncertainty Quantification in Diffusion Models for Physical Dynamical Systems
Marc Anton Finzi*, Anudhyan Boral, Andrew Gordon Wilson, Fei Sha, Leonardo Zepeda-Nunez

Discrete Key-Value Bottleneck
Frederik Träuble, Anirudh Goyal, Nasim Rahaman, Michael Curtis Mozer, Kenji Kawaguchi, Yoshua Bengio, Bernhard Schölkopf

DSGD-CECA: Decentralized SGD with Communication-Optimal Exact Consensus Algorithm
Lisang Ding, Kexin Jin, Bicheng Ying, Kun Yuan, Wotao Yin

Exphormer: Sparse Transformers for Graphs
Hamed Shirzad, Ameya Velingker, Balaji Venkatachalam, Danica J. Sutherland, Ali Kemal Sinop

Fast, Differentiable and Sparse Top-k: A Convex Analysis Perspective
Michael Eli Sander*, Joan Puigcerver, Josip Djolonga, Gabriel Peyré, Mathieu Blondel

Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation
Aditya Mate, Bryan Wilder, Aparna Taneja, Milind Tambe

In Search for a Generalizable Method for Source Free Domain Adaptation
Malik Boudiaf*, Tom Denton, Bart van Merrienboer, Vincent Dumoulin, Eleni Triantafillou

Learning Rate Schedules in the Presence of Distribution Shift
Matthew Fahrbach, Adel Javanmard, Vahab Mirrokni, Pratik Worah

Not All Semantics Are Created Equal: Contrastive Self-Supervised Learning with Automatic Temperature Individualization
Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, Tianbao Yang

On the Relationship Between Explanation and Prediction: A Causal View
Amir-Hossein Karimi*, Krikamol Muandet, Simon Kornblith, Bernhard Schölkopf, Been Kim

On the Role of Attention in Prompt-Tuning
Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, Christos Thrampoulidis

PLay: Parametrically Conditioned Layout Generation Using Latent Diffusion
Chin-Yi Cheng, Forrest Huang, Gang Li, Yang Li

The Power of Learned Locally Linear Models for Nonlinear Policy Optimization
Daniel Pfrommer, Max Simchowitz, Tyler Westenbroek, Nikolai Matni, Stephen Tu

Relevant Walk Search for Explaining Graph Neural Networks
Ping Xiong, Thomas Schnake, Michael Gastegger, Grégoire Montavon, Klaus Robert Muller,Shinichi Nakajima

Repository-Level Prompt Generation for Large Language Models of Code
Disha Shrivastava, Hugo Larochelle, Daniel Tarlow

Robust and Private Stochastic Linear Bandits
Vasileios Charisopoulos*, Hossein Esfandiari, Vahab Mirrokni

Simple Diffusion: End-to-End Diffusion for High Resolution Images
Emiel Hoogeboom, Jonathan Heek, Tim Salimans

Tied-Augment: Controlling Representation Similarity Improves Data Augmentation
Emirhan Kurtulus, Zichao Li, Yann Dauphin, Ekin D. Cubuk

Why Is Public Pre-Training Necessary for Private Model Training?
Arun Ganesh, Mahdi Haghifam*, Milad Nasr, Sewoong Oh, Thomas Steinke, Om Thakkar, Abhradeep Guha Thakurta, Lun Wang

A Connection Between One-Step RL and Critic Regularization in Reinforcement Learning
Benjamin Eysenbach, Matthieu Geist, Sergey Levine, Ruslan Salakhutdinov

Beyond Uniform Lipschitz Condition in Differentially Private Optimization
Rudrajit Das*, Satyen Kale, Zheng Xu, Tong Zhang, Sujay Sanghavi

Efficient Graph Field Integrators Meet Point Clouds
Krzysztof Choromanski, Arijit Sehanobish, Han Lin, Yunfan Zhao, Eli Berger, Tetiana Parshakova, Alvin Pan, David Watkins, Tianyi Zhang, Valerii Likhosherstov, Somnath Basu Roy Chowdhury, Avinava Dubey, Deepali Jain, Tamas Sarlos, Snigdha Chaturvedi, Adrian Weller

Fast as CHITA: Neural Network Pruning with Combinatorial Optimization
Riade Benbaki, Wenyu Chen, Xiang Meng, Hussein Hazimeh, Natalia Ponomareva, Zhe Zhao, Rahul Mazumder

Jump-Start Reinforcement Learning (see blog post)
Ikechukwu Uchendu*, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, Sergey Levine, Karol Hausman

Learning in POMDPs is Sample-Efficient with Hindsight Observability
Jonathan Lee, Alekh Agarwal, Christoph Dann, Tong Zhang

Low-Variance Gradient Estimation in Unrolled Computation Graphs with ES-Single
Paul Vicol

Masked Trajectory Models for Prediction, Representation, and Control
Philipp Wu, Arjun Majumdar, Kevin Stone, Yixin Lin, Igor Mordatch, Pieter Abbeel, Aravind Rajeswaran

Overcoming Simplicity Bias in Deep Networks Using a Feature Sieve
Rishabh Tiwari, Pradeep Shenoy

Pairwise Ranking Losses of Click-Through Rates Prediction for Welfare Maximization in Ad Auctions
Boxiang Lyu, Zhe Feng, Zachary Robertson, Sanmi Koyejo

Predictive Flows for Faster Ford-Fulkerson
Sami Davies, Benjamin Moseley, Sergei Vassilvitskii, Yuyan Wang

Scaling Laws for Multilingual Neural Machine Translation
Patrick Fernandes, Behrooz Ghorbani, Xavier Garcia, Markus Freitag, Orhan Firat

Sequential Monte Carlo Learning for Time Series Structure Discovery
Feras Saad, Brian Patton, Matthew Douglas Hoffman, Rif A. Saurous, Vikash Mansinghka

Stochastic Gradient Succeeds for Bandits
Jincheng Mei, Zixin Zhong, Bo Dai, Alekh Agarwal, Csaba Szepesvari, Dale Schuurmans

Subset-Based Instance Optimality in Private Estimation
Travis Dick, Alex Kulesza, Ziteng Sun, Ananda Theertha Suresh

The Unreasonable Effectiveness of Few-Shot Learning for Machine Translation
Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, Orhan Firat



Tutorials

Self-Supervised Learning in Vision: from Research Advances to Best Practices
Xinlei Chen, Ishan Misra, Randall Balestriero, Mathilde Caron, Christoph Feichtenhofer, Mark Ibrahim

How to DP-fy ML: A Practical Tutorial to Machine Learning with Differential Privacy (see blog post)
Sergei Vassilvitskii, Natalia Ponomareva, Zheng Xu

Recent Advances in the Generalization Theory of Neural Networks
Tengyu Ma, Alex Damian



EXPO Day workshops

Graph Neural Networks in Tensorflow: A Practical Guide
Workshop Organizers include: Bryan Perozzi, Anton Tsitsulin, Brandon Mayer, Jonathan Halcrow



Google sponsored affinity workshops

LatinX in AI (LAXAI)
Platinum Sponsor
Keynote Speaker: Monica Ribero
Panelist: Yao Qin

Women in Machine Learning (WiML)
Platinum Sponsor
Panelists: Yao Qin



Workshops

Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities
Organizer: Peter Kairouz, Zheng Xu
Speaker: Brendan McMahan

Interpretable Machine Learning in Healthcare (IMLH)
Organizer: Ramin Zabih

Knowledge and Logical Reasoning in the Era of Data-Driven Learning
Organizer: Beliz Günel

The Many Facets of Preference-Based Learning (MFPL)
Organizer: Robert Busa-Fekete, Mohammad Ghavamzadeh

The Synergy of Scientific and Machine Learning Modelling (SynS & ML)
Speaker: Sercan Arik

Theory of Mind in Communicating Agents
Organizer: Pei Zhou

Artificial Intelligence & Human Computer Interaction
Organizer: Yang Li, Forrest Huang

Data-Centric Machine Learning Research (DMLR)
Organizer: Alicia Parrish, Najoung Kim
Speaker: Peter Mattson

Neural Compression: from Information Theory to Applications
Speaker: Johannes Ballé
Panelist: George Toderici

Neural Conversational AI Workshop - What’s Left to TEACH (Trustworthy, Enhanced, Adaptable, Capable and Human-centric) Chatbots?
Organizer: Ahmad Beirami

Spurious Correlations, Invariance and Stability (SCIS)
Organizer: Amir Feder


* Work done while at Google

Source: Google AI Blog


Predicting Spreadsheet Formulas from Semi-structured Contexts

Hundreds of millions of people use spreadsheets, and formulas in those spreadsheets allow users to perform sophisticated analyses and transformations on their data. Although formula languages are simpler than general-purpose programming languages, writing these formulas can still be tedious and error-prone, especially for end-users. We've previously developed tools to understand patterns in spreadsheet data to automatically fill missing values in a column, but they were not built to support the process of writing formulas.

In "SpreadsheetCoder: Formula Prediction from Semi-structured Context", published at ICML 2021, we describe a new model that learns to automatically generate formulas based on the rich context around a target cell. When a user starts writing a formula with the “=” sign in a target cell, the system generates possible relevant formulas for that cell by learning patterns of formulas in historical spreadsheets. The model uses the data present in neighboring rows and columns of the target cell as well as the header row as context. It does this by first embedding the contextual structure of a spreadsheet table, consisting of neighboring and header cells, and then generates the desired spreadsheet formula using this contextual embedding. The formula is generated as two components: 1) the sequence of operators (e.g., SUM, IF, etc.), and 2) the corresponding ranges on which the operators are applied (e.g., “A2:A10”). The feature based on this model is now generally available to Google Sheets users.

Given the user’s intent to enter a formula in cells B7, C7, and D7, the system automatically infers the most likely formula the user might want to write in those cells.
Given the target cell (D4), the model uses the header and surrounding cell values as context to generate the target formula consisting of the corresponding sequence of operators and range.

Model Architecture
The model uses an encoder-decoder architecture that allows the flexibility to embed multiple types of contextual information (such as that contained in neighboring rows, columns, headers, etc.) in the encoder, which the decoder can use to generate desired formulas. To compute the embedding of the tabular context, it first uses a BERT-based architecture to encode several rows above and below the target cell (together with the header row). The content in each cell includes its data type (such as numeric, string, etc.) and its value, and the cell contents present in the same row are concatenated together into a token sequence to be embedded using the BERT encoder. Similarly, it encodes several columns to the left and to the right of the target cell. Finally, it performs a row-wise and column-wise convolution on the two BERT encoders to compute an aggregated representation of the context.

The decoder uses a long short-term memory (LSTM) architecture to generate the desired target formula as a sequence of tokens by first predicting a formula-sketch (consisting of formula operators without ranges) and then generating the corresponding ranges using cell addresses relative to the target cell. It additionally leverages an attention mechanism to compute attention vectors over the header and cell data, which are concatenated to the LSTM output layer before making the predictions.

The overall architecture of the formula prediction model.

In addition to the data present in neighboring rows and columns, the model also leverages additional information from the high-level sheet structure, such as headers. Using TPUs for model predictions, we ensure low latency on generating formula suggestions and are able to handle more requests on fewer machines.

Leveraging the high-level spreadsheet structure, the model can learn ranges that span thousands of rows.

Results
In the paper, we trained the model on a corpus of spreadsheets created by and shared with Googlers. We split 46k Google Sheets with formulas into 42k for training, 2.3k for validation, and 1.7k for testing. The model achieves a 42.5% top-1 full-formula accuracy, and 57.4% top-1 formula-sketch accuracy, both of which we find high enough to be practically useful in our initial user studies. We perform an ablation study, in which we test several simplifications of the model by removing different components, and find that having row- and column-based context embedding as well as header information is important for models to perform well.

The performance of different ablations of our model with increasing lengths of the target formula.

Conclusion
Our model illustrates the benefits of learning to represent the two-dimensional relational structure of the spreadsheet tables together with high-level structural information, such as table headers, to facilitate formula predictions. There are several exciting research directions, both in terms of designing new model architectures to incorporate more tabular structure as well as extending the model to support more applications such as bug detection and automated chart creation in spreadsheets. We are also looking forward to seeing how users use this feature and learning from feedback for future improvements.

Acknowledgements
We gratefully acknowledge the key contributions of the other team members, including Alexander Burmistrov, Xinyun Chen, Hanjun Dai, Prashant Khurana, Petros Maniatis, Rahul Srinivasan, Charles Sutton, Amanuel Taddesse, Peilun Zhang, and Denny Zhou.

Source: Google AI Blog


Google at ICML 2021

Groups across Google are actively pursuing research across the field of machine learning, ranging from theory to application. With scalable tools and architectures, we build machine learning systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, and more.

Google is proud to be a Platinum Sponsor of the thirty-eighth International Conference on Machine Learning (ICML 2021), a premier annual event happening this week. As a leader in machine learning research — with over 100 accepted publications and Googlers participating in workshops — we look forward to our continued partnership with the broader machine learning research community.

Registered for ICML 2021? We hope you’ll visit the Google virtual booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Take a look below to learn more about the Google research being presented at ICML 2021 (Google affiliations in bold).

Organizing Committee
ICML Board Members include: Corinna Cortes, Hugo Larochelle, Shakir Mohamed
ICML Emeritus Board includes: William Cohen, Andrew McCallum
Tutorial Co-Chair member: Quoc Lee

Publications
Attention Is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas

Scalable Evaluation of Multi-agent Reinforcement Learning with Melting Pot
Joel Z. Leibo, Edgar Duéñez-Guzmán, Alexander Sasha Vezhnevets, John P. Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charles Beattie, Igor Mordatch, Thore Graepel

On the Optimality of Batch Policy Optimization Algorithms
Chenjun Xiao, Yifan Wu, Tor Lattimore, Bo Dai, Jincheng Mei, Lihong Li*, Csaba Szepesvari, Dale Schuurmans

Low-Rank Sinkhorn Factorization
Meyer Scetbon, Marco Cuturi, Gabriel Peyré

Oops I Took A Gradient: Scalable Sampling for Discrete Distributions
Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, Chris J. Maddison

PID Accelerated Value Iteration Algorithm
Amir-Massoud Farahmand, Mohammad Ghavamzadeh

Dueling Convex Optimization
Aadirupa Saha, Tomer Koren, Yishay Mansour

What Are Bayesian Neural Network Posteriors Really Like?
Pavel Izmailov, Sharad Vikram, Matthew D. Hoffman, Andrew Gordon Wilson

Offline Reinforcement Learning with Pseudometric Learning
Robert Dadashi, Shideh Rezaeifar, Nino Vieillard, Léonard Hussenot, Olivier Pietquin, Matthieu Geist

Revisiting Rainbow: Promoting More Insightful and Inclusive Deep Reinforcement Learning Research (see blog post)
Johan S. Obando-Ceron, Pablo Samuel Castro

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL
Seyed Kamyar Seyed Ghasemipour*, Dale Schuurmans, Shixiang Shane Gu

Variational Data Assimilation with a Learned Inverse Observation Operator
Thomas Frerix, Dmitrii Kochkov, Jamie A. Smith, Daniel Cremers, Michael P. Brenner, Stephan Hoyer

Tilting the Playing Field: Dynamical Loss Functions for Machine Learning
Miguel Ruiz-Garcia, Ge Zhang, Samuel S. Schoenholz, Andrea J. Liu

Model-Based Reinforcement Learning via Latent-Space Collocation
Oleh Rybkin, Chuning Zhu, Anusha Nagabandi, Kostas Daniilidis, Igor Mordatch, Sergey Levine

Momentum Residual Neural Networks
Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

OmniNet: Omnidirectional Representations from Transformers
Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler

Synthesizer: Rethinking Self-Attention for Transformer Models
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

Towards Domain-Agnostic Contrastive Learning
Vikas Verma, Minh-Thang Luong, Kenji Kawaguchi, Hieu Pham, Quoc V. Le

Randomized Entity-wise Factorization for Multi-agent Reinforcement Learning
Shariq Iqbal, Christian A. Schroeder de Witt, Bei Peng, Wendelin Böhmer, Shimon Whiteson, Fei Sha

LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning
Yuhuai Wu, Markus Rabe, Wenda Li, Jimmy Ba, Roger Grosse, Christian Szegedy

Emergent Social Learning via Multi-agent Reinforcement Learning
Kamal Ndousse, Douglas Eck, Sergey Levine, Natasha Jaques

Improved Contrastive Divergence Training of Energy-Based Models
Yilun Du, Shuang Li, Joshua Tenenbaum, Igor Mordatch

Characterizing Structural Regularities of Labeled Data in Overparameterized Models
Ziheng Jiang*, Chiyuan Zhang, Kunal Talwar, Michael Mozer

Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills
Yevgen Chebotar, Karol Hausman, Yao Lu, Ted Xiao, Dmitry Kalashnikov, Jake Varley, Alex Irpan, Benjamin Eysenbach, Ryan Julian, Chelsea Finn, Sergey Levine

PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning
Angelos Filos, Clare Lyle, Yarin Gal, Sergey Levine, Natasha Jaques, Gregory Farquhar

EfficientNetV2: Smaller Models and Faster Training
Mingxing Tan, Quoc V. Le

Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies
Paul Vicol, Luke Metz, Jascha Sohl-Dickstein

Federated Composite Optimization
Honglin Yuan*, Manzil Zaheer, Sashank Reddi

Light RUMs
Flavio Chierichetti, Ravi Kumar, Andrew Tomkins

Catformer: Designing Stable Transformers via Sensitivity Analysis
Jared Quincy Davis, Albert Gu, Krzysztof Choromanski, Tri Dao, Christopher Re, Chelsea Finn, Percy Liang

Representation Matters: Offline Pretraining for Sequential Decision Making
Mengjiao Yang, Ofir Nachum

Variational Empowerment as Representation Learning for Goal-Conditioned Reinforcement Learning
Jongwook Choi*, Archit Sharma*, Honglak Lee, Sergey Levine, Shixiang Shane Gu

Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization
Wesley Chung, Valentin Thomas, Marlos C. Machado, Nicolas Le Roux

Whitening and Second Order Optimization Both Make Information in the Dataset Unusable During Training, and Can Reduce or Prevent Generalization
Neha S. Wadia, Daniel Duckworth, Samuel S. Schoenholz, Ethan Dyer, Jascha Sohl-Dickstein

Understanding Invariance via Feedforward Inversion of Discriminatively Trained Classifiers
Piotr Teterwak*, Chiyuan Zhang, Dilip Krishnan, Michael C. Mozer

Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning
Hiroki Furuta, Tatsuya Matsushima, Tadashi Kozuno, Yutaka Matsuo, Sergey Levine, Ofir Nachum, Shixiang Shane Gu

Hyperparameter Selection for Imitation Learning
Leonard Hussenot, Marcin Andrychowicz, Damien Vincent, Robert Dadashi, Anton Raichuk, Lukasz Stafiniak, Sertan Girgin, Raphael Marinier, Nikola Momchev, Sabela Ramos, Manu Orsini, Olivier Bachem, Matthieu Geist, Olivier Pietquin

Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces
Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank J. Reddi, Sanjiv Kumar

Revenue-Incentive Tradeoffs in Dynamic Reserve Pricing
Yuan Deng, Sebastien Lahaie, Vahab Mirrokni, Song Zuo

Debiasing a First-Order Heuristic for Approximate Bi-Level Optimization
Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Davis, Adrian Weller

Characterizing the Gap Between Actor-Critic and Policy Gradient
Junfeng Wen, Saurabh Kumar, Ramki Gummadi, Dale Schuurmans

Composing Normalizing Flows for Inverse Problems
Jay Whang, Erik Lindgren, Alexandros Dimakis

Online Policy Gradient for Model Free Learning of Linear Quadratic Regulators with √T Regret
Asaf Cassel, Tomer Koren

Learning to Price Against a Moving Target
Renato Paes Leme, Balasubramanian Sivan, Yifeng Teng, Pratik Worah

Fairness and Bias in Online Selection
Jose Correa, Andres Cristi, Paul Duetting, Ashkan Norouzi-Fard

The Impact of Record Linkage on Learning from Feature Partitioned Data
Richard Nock, Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Jakub Nabaglo, Giorgio Patrini, Guillaume Smith, Brian Thorne

Reserve Price Optimization for First Price Auctions in Display Advertising
Zhe Feng*, Sébastien Lahaie, Jon Schneider, Jinchao Ye

A Regret Minimization Approach to Iterative Learning Control
Naman Agarwal, Elad Hazan, Anirudha Majumdar, Karan Singh

A Statistical Perspective on Distillation
Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar

Best Model Identification: A Rested Bandit Formulation
Leonardo Cella, Massimiliano Pontil, Claudio Gentile

Generalised Lipschitz Regularisation Equals Distributional Robustness
Zac Cranko, Zhan Shi, Xinhua Zhang, Richard Nock, Simon Kornblith

Stochastic Multi-armed Bandits with Unrestricted Delay Distributions
Tal Lancewicki, Shahar Segal, Tomer Koren, Yishay Mansour

Regularized Online Allocation Problems: Fairness and Beyond
Santiago Balseiro, Haihao Lu, Vahab Mirrokni

Implicit Rate-Constrained Optimization of Non-decomposable Objectives
Abhishek Kumar, Harikrishna Narasimhan, Andrew Cotter

Leveraging Non-uniformity in First-Order Non-Convex Optimization
Jincheng Mei, Yue Gao, Bo Dai, Csaba Szepesvari, Dale Schuurmans

Dynamic Balancing for Model Selection in Bandits and RL
Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, Manish Purohit

Adversarial Dueling Bandits
Aadirupa Saha, Tomer Koren, Yishay Mansour

Optimizing Black-Box Metrics with Iterative Example Weighting
Gaurush Hiranandani*, Jatin Mathur, Harikrishna Narasimhan, Mahdi Milani Fard, Oluwasanmi Koyejo

Relative Deviation Margin Bounds
Corinna Cortes, Mehryar Mohri, Ananda Theertha Suresh

MC-LSTM: Mass-Conserving LSTM
Pieter-Jan Hoedt, Frederik Kratzert, Daniel Klotz, Christina Halmich, Markus Holzleitner, Grey Nearing, Sepp Hochreiter, Günter Klambauer

12-Lead ECG Reconstruction via Koopman Operators
Authors:Tomer Golany, Kira Radinsky, Daniel Freedman, Saar Minha

Finding Relevant Information via a Discrete Fourier Expansion
Mohsen Heidari, Jithin Sreedharan, Gil Shamir, Wojciech Szpankowski

LEGO: Latent Execution-Guided Reasoning for Multi-hop Question Answering on Knowledge Graphs
Hongyu Ren, Hanjun Dai, Bo Dai, Xinyun Chen, Michihiro Yasunaga, Haitian Sun, Dale Schuurmans, Jure Leskovec, Denny Zhou

SpreadsheetCoder: Formula Prediction from Semi-structured Context
Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, Denny Zhou

Combinatorial Blocking Bandits with Stochastic Delays
Alexia Atsidakou, Orestis Papadigenopoulos, Soumya Basu, Constantine Caramani, Sanjay Shakkottai

Beyond log2(T) Regret for Decentralized Bandits in Matching Markets
Soumya Basu, Karthik Abinav Sankararaman, Abishek Sankararaman

Robust Pure Exploration in Linear Bandits with Limited Budget
Ayya Alieva, Ashok Cutkosky, Abhimanyu Das

Latent Programmer: Discrete Latent Codes for Program Synthesis
Joey Hong, David Dohan, Rishabh Singh, Charles Sutton, Manzil Zaheer

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (see blog post)
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig

On Linear Identifiability of Learned Representations
Geoffrey Roeder, Luke Metz, Diederik P. Kingma

Hierarchical Clustering of Data Streams: Scalable Algorithms and Approximation Guarantees
Anand Rajagopalan, Fabio Vitale, Danny Vainstein, Gui Citovsky, Cecilia M Procopiuc, Claudio Gentile

Differentially Private Quantiles
Jennifer Gillenwater, Matthew Joseph, Alex Kulesza

Active Covering
Heinrich Jiang, Afshin Rostamizadeh

Sharf: Shape-Conditioned Radiance Fields from a Single View
Konstantinos Rematas, Ricardo Martin-Brualla, Vittorio Ferrari

Learning a Universal Template for Few-Shot Dataset Generalization
Eleni Triantafillou*, Hugo Larochelle, Richard Zemel, Vincent Dumoulin

Private Alternating Least Squares: Practical Private Matrix Completion with Tighter Rates
Steve Chien, Prateek Jain, Walid Krichene, Steffen Rendle, Shuang Song, Abhradeep Thakurta, Li Zhang

Differentially-Private Clustering of Easy Instances
Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer, Eliad Tsfadia

Label-Only Membership Inference Attacks
Christopher A. Choquette-Choo, Florian Tramèr, Nicholas Carlini, Nicolas Papernot

Neural Feature Matching in Implicit 3D Representations
Yunlu Chen, Basura Fernando, Hakan Bilen, Thomas Mensink, Efstratios Gavves

Locally Private k-Means in One Round
Alisa Chang, Badih Ghazi, Ravi Kumar, Pasin Manurangsi

Large-Scale Meta-learning with Continual Trajectory Shifting
Jaewoong Shin, Hae Beom Lee, Boqing Gong, Sung Ju Hwang

Statistical Estimation from Dependent Data
Vardis Kandiros, Yuval Dagan, Nishanth Dikkala, Surbhi Goel, Constantinos Daskalakis

Oneshot Differentially Private Top-k Selection
Gang Qiao, Weijie J. Su, Li Zhang

Unsupervised Part Representation by Flow Capsules
Sara Sabour, Andrea Tagliasacchi, Soroosh Yazdani, Geoffrey E. Hinton, David J. Fleet

Private Stochastic Convex Optimization: Optimal Rates in L1 Geometry
Hilal Asi, Vitaly Feldman, Tomer Koren, Kunal Talwar

Practical and Private (Deep) Learning Without Sampling or Shuffling
Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, Zheng Xu

Differentially Private Aggregation in the Shuffle Model: Almost Central Accuracy in Almost a Single Message
Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, Amer Sinha

Leveraging Public Data for Practical Private Query Release
Terrance Liu, Giuseppe Vietri, Thomas Steinke, Jonathan Ullman, Zhiwei Steven Wu

Meta-Thompson Sampling
Branislav Kveton, Mikhail Konobeev, Manzil Zaheer, Chih-wei Hsu, Martin Mladenov, Craig Boutilier, Csaba Szepesvári

Implicit-PDF: Non-parametric Representation of Probability Distributions on the Rotation Manifold
Kieran A Murphy, Carlos Esteves, Varun Jampani, Srikumar Ramalingam, Ameesh Makadia

Improving Ultrametrics Embeddings Through Coresets
Vincent Cohen-Addad, Rémi de Joannis de Verclos, Guillaume Lagarde

A Discriminative Technique for Multiple-Source Adaptation
Corinna Cortes, Mehryar Mohri, Ananda Theertha Suresh, Ningshan Zhang

Self-Supervised and Supervised Joint Training for Resource-Rich Machine Translation
Yong Cheng, Wei Wang*, Lu Jiang, Wolfgang Macherey

Correlation Clustering in Constant Many Parallel Rounds
Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski

Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time
Laxman Dhulipala, David Eisenstat, Jakub Łącki, Vahab Mirrokni, Jessica Shi

Meta-learning Bidirectional Update Rules
Mark Sandler, Max Vladymyrov, Andrey Zhmoginov, Nolan Miller, Andrew Jackson, Tom Madams, Blaise Aguera y Arcas

Discretization Drift in Two-Player Games
Mihaela Rosca, Yan Wu, Benoit Dherin, David G.T. Barrett

Reasoning Over Virtual Knowledge Bases With Open Predicate Relations
Haitian Sun*, Pat Verga, Bhuwan Dhingra, Ruslan Salakhutdinov, William W. Cohen

Learn2Hop: Learned Optimization on Rough Landscapes
Amil Merchant, Luke Metz, Samuel Schoenholz, Ekin Cubuk

Locally Adaptive Label Smoothing Improves Predictive Churn
Dara Bahri, Heinrich Jiang

Overcoming Catastrophic Forgetting by Bayesian Generative Regularization
Patrick H. Chen, Wei Wei, Cho-jui Hsieh, Bo Dai

Workshops (only Google affiliations are noted)
LatinX in AI (LXAI) Research at ICML 2021
Hosts: Been Kim, Natasha Jaques

Uncertainty and Robustness in Deep Learning
Organizers: Balaji Lakshminarayanan, Jasper Snoek Invited Speaker: Dustin Tran

Reinforcement Learning for Real Life
Organizers: Minmin Chen, Lihong Li Invited Speaker: Ed Chi

Interpretable Machine Learning in Healthcare
Organizers: Alan Karthikesalingam Invited Speakers: Abhijit Guha Roy, Jim Winkens

The Neglected Assumptions in Causal Inference
Organizer: Alexander D'Amour

ICML Workshop on Algorithmic Recourse
Invited Speakers: Been Kim, Berk Ustun

A Blessing in Disguise: The Prospects and Perils of Adversarial Machine Learning
Invited Speaker: Nicholas Carlini

Overparameterization: Pitfalls and Opportunities
Organizers: Yasaman Bahri, Hanie Sedghi

Information-Theoretic Methods for Rigorous, Responsible, and Reliable Machine Learning (ITR3)
Invited Speaker: Thomas Steinke

Beyond First-Order Methods in Machine Learning Systems
Invited Speaker: Courtney Paquette

ICML 2021 Workshop: Self-Supervised Learning for Reasoning and Perception
Invited Speaker: Chelsea Finn

Workshop on Reinforcement Learning Theory
Invited Speaker: Bo Dai

Tutorials (only Google affiliations are noted)
Responsible AI in Industry: Practical Challenges and Lessons Learned
Organizers: Ben Packer

Online and Non-stochastic Control
Organizers: Elad Hazan

Random Matrix Theory and ML (RMT +ML)
Organizers: Fabian Pedregosa, Jeffrey Pennington, Courntey Paquette Self-Attention for Computer Vision Organizers: Prajit Ramachandran, Ashish Vaswani

* Indicates work done while at Google

Source: Google AI Blog


Google at ICML 2020



Machine learning is a key strategic focus at Google, with highly active groups pursuing research in virtually all aspects of the field, including deep learning and more classical algorithms, exploring theory as well as application. We utilize scalable tools and architectures to build machine learning systems that enable us to solve deep scientific and engineering challenges in areas of language, speech, translation, music, visual processing and more.

As a leader in machine learning research, Google is proud to be a Platinum Sponsor of the thirty-seventh International Conference on Machine Learning (ICML 2020), a premier annual event taking place virtually this week. With over 100 accepted publications and Googlers participating in workshops, we look forward to our continued collaboration with the larger machine learning research community.

If you're registered for ICML 2020, we hope you'll visit the Google virtual booth to learn more about the exciting work, creativity and fun that goes into solving some of the field's most interesting challenges. You can also learn more about the Google research being presented at ICML 2020 in the list below (Google affiliations bolded).

ICML Expo
Google Dataset Search: Building an Open Ecosystem for Dataset Discovery
Natasha Noy

End-to-end Bayesian inference workflows in TensorFlow Probability
Colin Carroll

Publications
Population-Based Black-Box Optimization for Biological Sequence Design
Christof Angermueller, David Belanger, Andreea Gane, Zelda Mariet, David Dohan, Kevin Murphy, Lucy Colwell, D Sculley

Predictive Coding for Locally-Linear Control
Rui Shu, Tung Nguyen, Yinlam Chow, Tuan Pham, Khoat Than, Mohammad Ghavamzadeh, Stefano Ermon, Hung Bui

FedBoost: A Communication-Efficient Algorithm for Federated Learning
Jenny Hamer, Mehryar Mohri, Ananda Theertha Suresh

Faster Graph Embeddings via Coarsening
Matthew Fahrbach, Gramoz Goranci, Richard Peng, Sushant Sachdeva, Chi Wang

Revisiting Fundamentals of Experience Replay
William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, Will Dabney

Boosting for Control of Dynamical Systems
Naman Agarwal, Nataly Brukhim, Elad Hazan, Zhou Lu

Neural Clustering Processes
Ari Pakman, Yueqi Wang, Catalin Mitelut, JinHyung Lee, Liam Paninski

The Tree Ensemble Layer: Differentiability Meets Conditional Computation
Hussein Hazimeh, Natalia Ponomareva, Petros Mol, Zhenyu Tan, Rahul Mazumder

Representations for Stable Off-Policy Reinforcement Learning
Dibya Ghosh, Marc Bellemare

REALM: Retrieval-Augmented Language Model Pre-Training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang

Context Aware Local Differential Privacy
Jayadev Acharya, Keith Bonawitz, Peter Kairouz, Daniel Ramage, Ziteng Sun

Scalable Deep Generative Modeling for Sparse Graphs
Hanjun Dai, Azade Nazi, Yujia Li, Bo Dai, Dale Schuurmans

Deep k-NN for Noisy Labels
Dara Bahri, Heinrich Jiang, Maya Gupta

Revisiting Spatial Invariance with Low-Rank Local Connectivity
Gamaleldin F. Elsayed, Prajit Ramachandran, Jonathon Shlens, Simon Kornblith

SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, Ananda Theertha Suresh

Incremental Sampling Without Replacement for Sequence Models
Kensen Shi, David Bieber, Charles Sutton

SoftSort: A Continuous Relaxation for the argsort Operator
Sebastian Prillo, Julian Martin Eisenschlos

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation (see blog post)
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson

Learning to Stop While Learning to Predict
Xinshi Chen, Hanjun Dai, Yu Li, Xin Gao, Le Song

Bandits with Adversarial Scaling
Thodoris Lykouris, Vahab Mirrokni, Renato Paes Leme

SimGANs: Simulator-Based Generative Adversarial Networks for ECG Synthesis to Improve Deep ECG Classification
Tomer Golany, Daniel Freedman, Kira Radinsky

Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization
Geoffrey Negiar, Gideon Dresdner, Alicia Yi-Ting Tsai, Laurent El Ghaoui, Francesco Locatello, Robert M. Freund, Fabian Pedregosa

Implicit differentiation of Lasso-type models for hyperparameter optimization
Quentin Bertrand, Quentin Klopfenstein, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, Joseph Salmon

Infinite attention: NNGP and NTK for deep attention networks
Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, Roman Novak

Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently
Asaf Cassel, Alon Cohen, Tomer Koren

Adversarial Learning Guarantees for Linear Hypotheses and Neural Networks
Pranjal Awasthi, Natalie Frank, Mehryar Mohri

Random Hypervolume Scalarizations for Provable Multi-Objective Black Box Optimization
Daniel Golovin, Qiuyi (Richard) Zhang

Generating Programmatic Referring Expressions via Program Synthesis
Jiani Huang, Calvin Smith, Osbert Bastani, Rishabh Singh, Aws Albarghouthi, Mayur Naik

Optimizing Long-term Social Welfare in Recommender Systems: A Constrained Matching Approach
Martin Mladenov, Elliot Creager, Omer Ben-Porat, Kevin Swersky, Richard Zemel, Craig Boutilier

AutoML-Zero: Evolving Machine Learning Algorithms From Scratch (see blog post)
Esteban Real, Chen Liang, David R. So, Quoc V. Le

How Good is the Bayes Posterior in Deep Neural Networks Really?
Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin

Which Tasks Should Be Learned Together in Multi-task Learning?
Trevor Standley, Amir R. Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, Silvio Savarese

Influence Diagram Bandits: Variational Thompson Sampling for Structured Bandit Problems
Tong Yu, Branislav Kveton, Zheng Wen, Ruiyi Zhang, Ole J. Mengshoel

Disentangling Trainability and Generalization in Deep Neural Networks
Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz

The Many Shapley Values for Model Explanation
Mukund Sundararajan, Amir Najmi

Neural Contextual Bandits with UCB-based Exploration
Dongruo Zhou, Lihong Li, Quanquan Gu

Automatic Shortcut Removal for Self-Supervised Representation Learning
Matthias Minderer, Olivier Bachem, Neil Houlsby, Michael Tschannen

Federated Learning with Only Positive Labels
Felix X. Yu, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar

How Recurrent Networks Implement Contextual Processing in Sentiment Analysis
Niru Maheswaranathan, David Sussillo

Supervised Learning: No Loss No Cry
Richard Nock, Aditya Krishna Menon

Ready Policy One: World Building Through Active Learning
Philip Ball, Jack Parker-Holder, Aldo Pacchiano, Krzysztof Choromanski, Stephen Roberts

Weakly-Supervised Disentanglement Without Compromises
Francesco Locatello, Ben Poole, Gunnar Raetsch, Bernhard Schölkopf, Olivier Bachem, Michael Tschannen

Fast Differentiable Sorting and Ranking
Mathieu Blondel, Olivier Teboul, Quentin Berthet, Josip Djolonga

Debiased Sinkhorn barycenters
Hicham Janati, Marco Cuturi, Alexandre Gramfort

Interpretable, Multidimensional, Multimodal Anomaly Detection with Negative Sampling for Detection of Device Failure
John Sipple

Accelerating Large-Scale Inference with Anisotropic Vector Quantization
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar

An Optimistic Perspective on Offline Reinforcement Learning (see blog post)
Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization
Ben Adlam, Jeffrey Pennington

Private Query Release Assisted by Public Data
Raef Bassily, Albert Cheu, Shay Moran, Aleksandar Nikolov, Jonathan Ullman, Zhiwei Steven Wu

Learning and Evaluating Contextual Embedding of Source Code
Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

Evaluating Machine Accuracy on ImageNet
Vaishaal Shankar, Rebecca Roelofs, Horia Mania, Alex Fang, Benjamin Recht, Ludwig Schmidt

Imputer: Sequence Modelling via Imputation and Dynamic Programming
William Chan, Chitwan Saharia, Geoffrey Hinton, Mohammad Norouzi, Navdeep Jaitly

Domain Aggregation Networks for Multi-Source Domain Adaptation
Junfeng Wen, Russell Greiner, Dale Schuurmans

Planning to Explore via Self-Supervised World Models
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak

Context-Aware Dynamics Model for Generalization in Model-Based Reinforcement Learning
Kimin Lee, Younggyo Seo, Seunghyun Lee, Honglak Lee, Jinwoo Shin

Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search
Binghong Chen, Chengtao Li, Hanjun Dai, Le Song

On the Consistency of Top-k Surrogate Losses
Forest Yang, Sanmi Koyejo

Dual Mirror Descent for Online Allocation Problems
Haihao Lu, Santiago Balseiro, Vahab Mirrokni

Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors
Michael W. Dusenberry, Ghassen Jerfel, Yeming Wen, Yi-An Ma, Jasper Snoek, Katherine Heller, Balaji Lakshminarayanan, Dustin Tran

Batch Stationary Distribution Estimation
Junfeng Wen, Bo Dai, Lihong Li, Dale Schuurmans

Small-GAN: Speeding Up GAN Training Using Core-Sets
Samarth Sinha, Han Zhang, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Augustus Odena

Data Valuation Using Reinforcement Learning
Jinsung Yoon, Sercan ‎Ö. Arik, Tomas Pfister

A Game Theoretic Perspective on Model-Based Reinforcement Learning
Aravind Rajeswaran, Igor Mordatch, Vikash Kumar

Encoding Musical Style with Transformer Autoencoders
Kristy Choi, Curtis Hawthorne, Ian Simon, Monica Dinculescu, Jesse Engel

The Shapley Taylor Interaction Index
Kedar Dhamdhere, Mukund Sundararajan, Ashish Agarwal

Multidimensional Shape Constraints
Maya Gupta, Erez Louidor, Olexander Mangylov, Nobu Morioka, Taman Narayan, Sen Zhao

Private Counting from Anonymous Messages: Near-Optimal Accuracy with Vanishing Communication Overhead
Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh

Learning to Score Behaviors for Guided Policy Optimization
Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Anna Choromanska, Krzysztof Choromanski, Michael I. Jordan

Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations
Florian Tramèr, Jens Behrmann, Nicholas Carlini, Nicolas Papernot, Jörn-Henrik Jacobsen

Optimizing Black-Box Metrics with Adaptive Surrogates
Qijia Jiang, Olaoluwa Adigun, Harikrishna Narasimhan, Mahdi Milani Fard, Maya Gupta

Circuit-Based Intrinsic Methods to Detect Overfitting
Sat Chatterjee, Alan Mishchenko

Automatic Reparameterisation of Probabilistic Programs
Maria I. Gorinova, Dave Moore, Matthew D. Hoffman

Stochastic Flows and Geometric Optimization on the Orthogonal Group
Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov, Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos, Adrian Weller, Vikas Sindhwani

Black-Box Variational Inference as a Parametric Approximation to Langevin Dynamics
Matthew Hoffman, Yi-An Ma

Concise Explanations of Neural Networks Using Adversarial Training
Prasad Chalasani, Jiefeng Chen, Amrita Roy Chowdhury, Somesh Jha, Xi Wu

p-Norm Flow Diffusion for Local Graph Clustering
Shenghao Yang, Di Wang, Kimon Fountoulakis

Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Models
Rares-Darius Buhai, Yoni Halpern, Yoon Kim, Andrej Risteski, David Sontag

Robust Pricing in Dynamic Mechanism Design
Yuan Deng, Sébastien Lahaie, Vahab Mirrokni

Differentiable Product Quantization for Learning Compact Embedding Layers
Ting Chen, Lala Li, Yizhou Sun

Adaptive Region-Based Active Learning
Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, Ningshan Zhang

Countering Language Drift with Seeded Iterated Learning
Yuchen Lu, Soumye Singhal, Florian Strub, Olivier Pietquin, Aaron Courville

Does Label Smoothing Mitigate Label Noise?
Michal Lukasik, Srinadh Bhojanapalli, Aditya Krishna Menon, Sanjiv Kumar

Acceleration Through Spectral Density Estimation
Fabian Pedregosa, Damien Scieur

Momentum Improves Normalized SGD
Ashok Cutkosky, Harsh Mehta

ConQUR: Mitigating Delusional Bias in Deep Q-Learning
Andy Su, Jayden Ooi, Tyler Lu, Dale Schuurmans, Craig Boutilier

Online Learning with Imperfect Hints
Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

Go Wide, Then Narrow: Efficient Training of Deep Thin Networks
Denny Zhou, Mao Ye, Chen Chen, Tianjian Meng, Mingxing Tan, Xiaodan Song, Quoc Le, Qiang Liu, Dale Schuurmans

On Implicit Regularization in β-VAEs
Abhishek Kumar, Ben Poole

Is Local SGD Better than Minibatch SGD?
Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Sreb

A Simple Framework for Contrastive Learning of Visual Representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton

Universal Average-Case Optimality of Polyak Momentum
Damien Scieur, Fabian Pedregosa

An Imitation Learning Approach for Cache Replacement
Evan Zheran Liu, Milad Hashemi, Kevin Swersky, Parthasarathy Ranganathan, Junwhan Ahn

Collapsed Amortized Variational Inference for Switching Nonlinear Dynamical Systems
Zhe Dong, Bryan A. Seybold, Kevin P. Murphy, Hung H. Bui

Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels
Lu Jiang, Di Huang, Mason Liu, Weilong Yang

Optimizing Data Usage via Differentiable Rewards
Xinyi Wang, Hieu Pham, Paul Michel, Antonios Anastasopoulos, Jaime Carbonell, Graham Neubig

Sparse Sinkhorn Attention
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, Da-Cheng Juan

One Policy to Control Them All: Shared Modular Policies for Agent-Agnostic Control
Wenlong Huang, Igor Mordatch, Deepak Pathak

On Thompson Sampling with Langevin Algorithms
Eric Mazumdar, Aldo Pacchiano, Yi-An Ma, Peter L. Bartlett, Michael I. Jordan

Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection
Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, Qiang Liu

On the Global Convergence Rates of Softmax Policy Gradient Methods
Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans

Concept Bottleneck Models
Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, Percy Liang

Supervised Quantile Normalization for Low-Rank Matrix Approximation
Marco Cuturi, Olivier Teboul, Jonathan Niles-Weed, Jean-Philippe Vert

Missing Data Imputation Using Optimal Transport
Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi

Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention Over Modules
Sarthak Mittal, Alex Lamb, Anirudh Goyal, Vikram Voleti, Murray Shanahan, Guillaume Lajoie, Michael Mozer, Yoshua Bengio

Stochastic Optimization for Regularized Wasserstein Estimators
Marin Ballu, Quentin Berthet, Francis Bach

Low-Rank Bottleneck in Multi-head Attention Models
Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Jakkam Reddi, Sanjiv Kumar

Rigging the Lottery: Making All Tickets Winners
Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, Erich Elsen

Online Learning with Dependent Stochastic Feedback Graphs
Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, Ningshan Zhang

Calibration, Entropy Rates, and Memory in Language Models
Mark Braverman, Xinyi Chen, Sham Kakade, Karthik Narasimhan, Cyril Zhang, Yi Zhang

Composable Sketches for Functions of Frequencies: Beyond the Worst Case
Edith Cohen, Ofir Geri, Rasmus Pagh

Energy-Based Processes for Exchangeable Data
Mengjiao Yang, Bo Dai, Hanjun Dai, Dale Schuurmans

Near-Optimal Regret Bounds for Stochastic Shortest Path
Alon Cohen, Haim Kaplan, Yishay Mansour, Aviv Rosenberg

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization (see blog post)
Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu

The Complexity of Finding Stationary Points with Stochastic Gradient Descent
Yoel Drori, Ohad Shamir

The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks
Jakub Swiatkowski, Kevin Roth, Bas Veeling, Linh Tran, Josh Dillon, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin

Regularized Optimal Transport is Ground Cost Adversarial
François-Pierre Paty, Marco Cuturi

Workshops
New In ML
Invited Speaker: Nicolas Le Roux
Organizers: Zhen Xu, Sparkle Russell-Puleri, Zhengying Liu, Sinead A Williamson, Matthias W Seeger, Wei-Wei Tu, Samy Bengio, Isabelle Guyon

LatinX in AI
Workshop Advisor: Pablo Samuel Castro

Women in Machine Learning Un-Workshop
Invited Speaker: Doina Precup
Sponsor Expo Speaker: Jennifer Wei

Queer in AI
Invited Speaker: Shakir Mohamed

Workshop on Continual Learning
Organizers: Haytham Fayek, Arslan Chaudhry, David Lopez-Paz, Eugene Belilovsky, Jonathan Schwarz, Marc Pickett, Rahaf Aljundi, Sayna Ebrahimi, Razvan Pascanu, Puneet Dokania

5th ICML Workshop on Human Interpretability in Machine Learning (WHI)
Organizers: Kush Varshney, Adrian Weller, Alice Xiang, Amit Dhurandhar, Been Kim, Dennis Wei, Umang Bhatt

Self-supervision in Audio and Speech
Organizers: Mirco Ravanelli, Dmitriy Serdyuk, R Devon Hjelm, Bhuvana Ramabhadran, Titouan Parcollet

Workshop on eXtreme Classification: Theory and Applications
Invited Speakers: Sanjiv Kumar

Healthcare Systems, Population Health, and the Role of Health-tech
Organizers: Krzysztof Choromanski, David Cheikhi, Jared Davis, Valerii Likhosherstov, Achille Nazaret, Achraf Bahamou, Xingyou Song, Mrugank Akarte, Jack Parker-Holder, Jacob Bergquist, Yuan Gao, Aldo Pacchiano, Tamas Sarlos, Adrian Weller, Vikas Sindhwani

Theoretical Foundations of Reinforcement Learning
Program Committee: Alon Cohen, Chris Dann

Uncertainty and Robustness in Deep Learning Workshop (UDL)
Invited Speaker: Justin Gilmer

Organizers: Sharon Li, Balaji Lakshminarayanan, Dan Hendrycks, Thomas Dietterich, Jasper Snoek
Program Committee: Jeremiah Liu, Jie Ren, Rodolphe Jenatton, Zack Nado, Alexander Alemi, Florian Wenzel, Mike Dusenberry, Raphael Lopes

Beyond First Order Methods in Machine Learning Systems
Industry Panel: Jonathan Hseu

Object-Oriented Learning: Perception, Representation, and Reasoning
Invited Speakers: Thomas Kipf, Igor Mordatch

Graph Representation Learning and Beyond (GRL+)
Organizers: Michael Bronstein, Andreea Deac, William L. Hamilton, Jessica B. Hamrick, Milad Hashemi, Stefanie Jegelka, Jure Leskovec, Renjie Liao, Federico Monti, Yizhou Sun, Kevin Swersky, Petar Veličković, Rex Ying, Marinka Žitnik
Speakers: Thomas Kipf
Program Committee: Bryan Perozzi, Kevin Swersky, Milad Hashemi, Thomas Kipf, Ting Cheng

ML Interpretability for Scientific Discovery
Organizers: Subhashini Venugopalan, Michael Brenner, Scott Linderman, Been Kim
Program Committee: Akinori Mitani, Arunachalam Narayanaswamy, Avinash Varadarajan, Awa Dieng, Benjamin Sanchez-Lengeling, Bo Dai, Stephan Hoyer, Subham Sekhar Sahoo, Suhani Vora
Steering Committee: John Platt, Mukund Sundararajan, Jon Kleinberg

Negative Dependence and Submodularity for Machine Learning
Organizers: Zelda Mariet, Mike Gartrell, Michal Derezinski

7th ICML Workshop on Automated Machine Learning (AutoML)
Organizers: Charles Weill, Katharina Eggensperger, Matthias Feurer, Frank Hutter, Marius Lindauer, Joaquin Vanschoren

Federated Learning for User Privacy and Data Confidentiality
Keynote: Brendan McMahan
Program Committee: Peter Kairouz, Jakub Konecný

MLRetrospectives: A Venue for Self-Reflection in ML Research
Speaker: Margaret Mitchell

Machine Learning for Media Discovery
Speaker: Ed Chi

INNF+: Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models
Organizers: Chin-Wei Huang, David Krueger, Rianne van den Berg, George Papamakarios, Chris Cremer, Ricky Chen, Danilo Rezende

4th Lifelong Learning Workshop
Program Committee: George Tucker, Marlos C. Machado

2nd ICML Workshop on Human in the Loop Learning (HILL)
Organizers: Shanghang Zhang, Xin Wang, Fisher Yu, Jiajun Wu, Trevor Darrell

Machine Learning for Global Health
Organizers: Danielle Belgrave, Danielle Belgrave, Stephanie Hyland, Charles Onu, Nicholas Furnham, Ernest Mwebaze, Neil Lawrence

Committee
Social Chair: Adam White

Work performed while at Google

Source: Google AI Blog


AutoML-Zero: Evolving Code that Learns



Machine learning (ML) has seen tremendous successes recently, which were made possible by ML algorithms like deep neural networks that were discovered through years of expert research. The difficulty involved in this research fueled AutoML, a field that aims to automate the design of ML algorithms. So far, AutoML has focused on constructing solutions by combining sophisticated hand-designed components. A typical example is that of neural architecture search, a subfield in which one builds neural networks automatically out of complex layers (e.g., convolutions, batch-norm, and dropout), and the topic of much research.

An alternative approach to using these hand-designed components in AutoML is to search for entire algorithms from scratch. This is challenging because it requires the exploration of vast and sparse search spaces, yet it has great potential benefits — it is not biased toward what we already know and potentially allows for the discovery of new and better ML architectures. By analogy, if one were building a house from scratch, there is more potential for flexibility or improvement than if one was constructing a house using only prefabricated rooms. However, the discovery of such housing designs may be more difficult because there are many more possible ways to combine the bricks and mortar than there are of combining pre-made designs of entire rooms. As such, early research into algorithm learning from scratch focused on one aspect of the algorithm, to reduce the search space and compute required, such as the learning rule, and has not been revisited much since the early 90s. Until now.

Extending our research into evolutionary AutoML, our recent paper, to be published at ICML 2020, demonstrates that it is possible to successfully evolve ML algorithms from scratch. The approach we propose, called AutoML-Zero, starts from empty programs and, using only basic mathematical operations as building blocks, applies evolutionary methods to automatically find the code for complete ML algorithms. Given small image classification problems, our method rediscovered fundamental ML techniques, such as 2-layer neural networks with backpropagation, linear regression and the like, which have been invented by researchers throughout the years. This result demonstrates the plausibility of automatically discovering more novel ML algorithms to address harder problems in the future.

Evolving Learning Algorithms from Scratch
We use a variant of classic evolutionary methods to search the space of algorithms. These methods have proved useful in discovering computer programs since the 80s. Their simplicity and scalability makes them especially suitable for the discovery of learning algorithms.

In our case, a population is initialized with empty programs. It then evolves in repeating cycles to produce better and better learning algorithms. At each cycle, two (or more) random models compete and the most accurate model gets to be a parent. The parent clones itself to produce a child, which gets mutated. That is, the child’s code is modified in a random way, which could mean, for example, arbitrarily inserting, removing or modifying a line in the code. The mutated algorithm is then evaluated on image classification tasks.
A population is initialized with empty programs. Many generations later, we see a more evolved population and two of its algorithms compete. The most accurate wins to produce a child. After many such events, the final population contains highly accurate classifiers.
Exploring a Difficult Search Space
Our AutoML-Zero setup, in contrast to much previous AutoML work, makes the search space very sparse — an accurate algorithm might be as rare as 1 in 1012 candidates. This is due to the granularity of the building blocks provided to the algorithm, which include only basic operations such as variable assignment, addition, and matrix multiplication. In such an environment, a random search will not find a solution in a reasonable amount of time, yet evolution can be tens of thousands of times faster, according to our measurements. We distributed the search on multiple machines that occasionally exchange algorithms (analogous to migration in real life). We also constructed small proxy classification tasks on which to evaluate each child algorithm, and executed this evaluation with highly optimized code.

Despite the sparsity, the evolutionary search discovers more complex and effective techniques as time passes. Initially, the simplest algorithms appear, which represent linear models with hard-coded weights. In time, stochastic gradient descent (SGD) is invented to learn the weights, in spite of the gradient itself not having been provided as a building block. Though flawed at first, SGD gets fixed relatively quickly, starting a series of improvements to the prediction and learning algorithm. Within our toy scenario, the process discovers several concepts known to have been useful to the research community. In the end, our approach manages to construct a model that outperforms hand-designs of comparable complexity.
Progress of an evolution experiment. As time passes, from left to right, we see the algorithms becoming more complex and more accurate.
The Evolved Algorithm
The figure above includes the best evolved algorithm produced by our method. This final algorithm includes techniques such as noise injection as data augmentation, bilinear model, gradient normalization, and weight averaging, and the improvement over the baseline also transfers to datasets that are not used during search. Our paper describes how the different lines in the evolved code implement each of these techniques, and verifies their value through ablation studies.

Through more experiments, we show that it is possible to guide the evolutionary search by controlling "the habitat" — i.e., the tasks on which the evolutionary process evaluates the fitness of the algorithms. For example, when we reduce the amount of data, the noisy ReLU emerges, which helps with regularization. Or when we reduce the number of training steps, we witness the emergence of learning rate decay, which enables faster convergence. Targeted discoveries such as these are important — while it may be interesting if an automatic tool-inventing machine comes up with a hammer or a needle, it is much more interesting if it comes up with a hammer when you show it some nails and a needle when you show it some thread. By analogy, in our work the noisy ReLU ("hammer") is discovered when in the presence of little data ("nails") and the learning rate decay when in the presence of few training steps.

Conclusion
We consider this to be preliminary work. We have yet to evolve fundamentally new algorithms, but it is encouraging that the evolved algorithm can surpass simple neural networks that exist within the search space. Right now, the search process requires significant compute.* As the coming years scale up available hardware and as the search methods become more efficient, it is likely that the search space will become more inclusive and the results will improve. We are excited at the prospects of discovering novel machine learning algorithms as we further our understanding of AutoML-Zero.

Acknowledgements
We want to thank our co-authors, David R. So and Quoc V. Le, and the many who helped us through discussions during the project and paper writing, including Samy Bengio, Vincent Vanhoucke, Doug Eck, Charles Sutton, Yanping Huang, Jacques Pienaar, Jeff Dean, and particularly Gabriel Bender, Hanxiao Liu, Rishabh Singh, Chiyuan Zhang, and Hieu Pham. We also want to especially thank Tom Small for contributing the animations in this post.


* The electricity consumption for the experiments (run in 2019) was matched with the purchase of renewable energy.

Source: Google AI Blog


PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization



Students are often tasked with reading a document and producing a summary (for example, a book report) to demonstrate both reading comprehension and writing ability. This abstractive text summarization is one of the most challenging tasks in natural language processing, involving understanding of long passages, information compression, and language generation. The dominant paradigm for training machine learning models to do this is sequence-to-sequence (seq2seq) learning, where a neural network learns to map input sequences to output sequences. While these seq2seq models were initially developed using recurrent neural networks, Transformer encoder-decoder models have recently become favored as they are more effective at modeling the dependencies present in the long sequences encountered in summarization.

Transformer models combined with self-supervised pre-training (e.g., BERT, GPT-2, RoBERTa, XLNet, ALBERT, T5, ELECTRA) have shown to be a powerful framework for producing general language learning, achieving state-of-the-art performance when fine-tuned on a wide array of language tasks. In prior work, the self-supervised objectives used in pre-training have been somewhat agnostic to the down-stream application in favor of generality; we wondered whether better performance could be achieved if the self-supervised objective more closely mirrored the final task.

In “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization” (to appear at the 2020 International Conference on Machine Learning), we designed a pre-training self-supervised objective (called gap-sentence generation) for Transformer encoder-decoder models to improve fine-tuning performance on abstractive summarization, achieving state-of-the-art results on 12 diverse summarization datasets. Supplementary to the paper, we are also releasing the training code and model checkpoints on GitHub.

A Self-Supervised Objective for Summarization
Our hypothesis is that the closer the pre-training self-supervised objective is to the final down-stream task, the better the fine-tuning performance. In PEGASUS pre-training, several whole sentences are removed from documents and the model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. This is an incredibly difficult task that may seem impossible, even for people, and we don’t expect the model to solve it perfectly. However, such a challenging task encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task. The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.
A self-supervised example for PEGASUS during pre-training. The model is trained to output all the masked sentences.
We found that choosing “important” sentences to mask worked best, making the output of self-supervised examples even more similar to a summary. We automatically identified these sentences by finding those that were most similar to the rest of the document according to a metric called ROUGE. ROUGE computes the similarity of two texts by computing n-gram overlaps using a score from 0 to 100 (ROUGE-1, ROUGE-2, and ROUGE-L are three common variants).

Similar to other recent methods, such as T5, we pre-trained our model on a very large corpus of web-crawled documents, then we fine-tuned the model on 12 public down-stream abstractive summarization datasets, resulting in new state-of-the-art results as measured by automatic metrics, while using only 5% of the number of parameters of T5. The datasets were chosen to be diverse, including news articles, scientific papers, patents, short stories, e-mails, legal documents, and how-to directions, showing that the model framework is adaptive to a wide-variety of topics.

Fine-Tuning with Small Numbers of Examples
While PEGASUS showed remarkable performance with large datasets, we were surprised to learn that the model didn’t require a large number of examples for fine-tuning to get near state-of-the-art performance:
ROUGE scores (three variants, higher is better) vs. the number of supervised examples across four selected summarization datasets. The dotted-line shows the Transformer encoder-decoder performance with full-supervision, but without pre-training.
With only 1000 fine-tuning examples, we were able to perform better in most tasks than a strong baseline (Transformer encoder-decoder) that used the full supervised data, which in some cases had many orders of magnitude more examples. This “sample efficiency” greatly increases the usefulness of text summarization models as it significantly lowers the scale and cost of supervised data collection, which in the case of summarization is very expensive.

Human-Quality summaries
While we find automatic metrics such as ROUGE are useful proxies for measuring progress during model development, they only provide limited information and don’t tell us the whole story, such as fluency or a comparison to human performance. To this end, we conducted a human evaluation, where raters were asked to compare summaries from our model with human ones (without knowing which is which). This has some similarities to the Turing test.
Human raters were asked to rate model and human-written summaries without knowing which was which. The document is truncated here for illustration, but raters see the full text.
We performed the experiment with 3 different datasets and found that human raters do not consistently prefer the human summaries to those from our model. Furthermore, our models trained with only 1000 examples performed nearly as well. In particular, with the much studied XSum and CNN/Dailymail datasets, the model achieves human-like performance using only 1000 examples. This suggests large datasets of supervised examples are no longer necessary for summarization, opening up many low-cost use-cases.

A Test of Comprehension: Counting Ships
Following this post is an example article from the XSum dataset and the model-generated abstractive summary. As we can see, the model correctly abstracts and paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall) as “four Royal Navy frigates”, something an extractive approach could not do since “four” is not mentioned anywhere. Was this a fluke or did the model actually count? One way to find out is to add and remove ships to see if the count changes.

As can be seen below, the model successfully “counts” ships from 2 to 5. However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as “seven”. So it appears the model has learned to count small numbers of items in a list, but does not yet generalize as elegantly as we would hope. Still, we think this rudimentary counting ability is impressive as it was not explicitly programmed into the model, and it demonstrates a limited amount of “symbolic reasoning” by the model.

PEGASUS code and model release
To support on-going research in this field and ensure reproducibility, we are releasing the PEGASUS code and model checkpoints on GitHub. This includes fine-tuning code which can be used to adapt PEGASUS to other summarization datasets.

Acknowledgements
This work has been a collaborative effort involving Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. We thank the T5 and Google News teams for providing datasets for pre-training PEGASUS.

Source: Google AI Blog


PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization



Students are often tasked with reading a document and producing a summary (for example, a book report) to demonstrate both reading comprehension and writing ability. This abstractive text summarization is one of the most challenging tasks in natural language processing, involving understanding of long passages, information compression, and language generation. The dominant paradigm for training machine learning models to do this is sequence-to-sequence (seq2seq) learning, where a neural network learns to map input sequences to output sequences. While these seq2seq models were initially developed using recurrent neural networks, Transformer encoder-decoder models have recently become favored as they are more effective at modeling the dependencies present in the long sequences encountered in summarization.

Transformer models combined with self-supervised pre-training (e.g., BERT, GPT-2, RoBERTa, XLNet, ALBERT, T5, ELECTRA) have shown to be a powerful framework for producing general language learning, achieving state-of-the-art performance when fine-tuned on a wide array of language tasks. In prior work, the self-supervised objectives used in pre-training have been somewhat agnostic to the down-stream application in favor of generality; we wondered whether better performance could be achieved if the self-supervised objective more closely mirrored the final task.

In “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization” (to appear at the 2020 International Conference on Machine Learning), we designed a pre-training self-supervised objective (called gap-sentence generation) for Transformer encoder-decoder models to improve fine-tuning performance on abstractive summarization, achieving state-of-the-art results on 12 diverse summarization datasets. Supplementary to the paper, we are also releasing the training code and model checkpoints on GitHub.

A Self-Supervised Objective for Summarization
Our hypothesis is that the closer the pre-training self-supervised objective is to the final down-stream task, the better the fine-tuning performance. In PEGASUS pre-training, several whole sentences are removed from documents and the model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. This is an incredibly difficult task that may seem impossible, even for people, and we don’t expect the model to solve it perfectly. However, such a challenging task encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task. The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.
A self-supervised example for PEGASUS during pre-training. The model is trained to output all the masked sentences.
We found that choosing “important” sentences to mask worked best, making the output of self-supervised examples even more similar to a summary. We automatically identified these sentences by finding those that were most similar to the rest of the document according to a metric called ROUGE. ROUGE computes the similarity of two texts by computing n-gram overlaps using a score from 0 to 100 (ROUGE-1, ROUGE-2, and ROUGE-L are three common variants).

Similar to other recent methods, such as T5, we pre-trained our model on a very large corpus of web-crawled documents, then we fine-tuned the model on 12 public down-stream abstractive summarization datasets, resulting in new state-of-the-art results as measured by automatic metrics, while using only 5% of the number of parameters of T5. The datasets were chosen to be diverse, including news articles, scientific papers, patents, short stories, e-mails, legal documents, and how-to directions, showing that the model framework is adaptive to a wide-variety of topics.

Fine-Tuning with Small Numbers of Examples
While PEGASUS showed remarkable performance with large datasets, we were surprised to learn that the model didn’t require a large number of examples for fine-tuning to get near state-of-the-art performance:
ROUGE scores (three variants, higher is better) vs. the number of supervised examples across four selected summarization datasets. The dotted-line shows the Transformer encoder-decoder performance with full-supervision, but without pre-training.
With only 1000 fine-tuning examples, we were able to perform better in most tasks than a strong baseline (Transformer encoder-decoder) that used the full supervised data, which in some cases had many orders of magnitude more examples. This “sample efficiency” greatly increases the usefulness of text summarization models as it significantly lowers the scale and cost of supervised data collection, which in the case of summarization is very expensive.

Human-Quality summaries
While we find automatic metrics such as ROUGE are useful proxies for measuring progress during model development, they only provide limited information and don’t tell us the whole story, such as fluency or a comparison to human performance. To this end, we conducted a human evaluation, where raters were asked to compare summaries from our model with human ones (without knowing which is which). This has some similarities to the Turing test.
Human raters were asked to rate model and human-written summaries without knowing which was which. The document is truncated here for illustration, but raters see the full text.
We performed the experiment with 3 different datasets and found that human raters do not consistently prefer the human summaries to those from our model. Furthermore, our models trained with only 1000 examples performed nearly as well. In particular, with the much studied XSum and CNN/Dailymail datasets, the model achieves human-like performance using only 1000 examples. This suggests large datasets of supervised examples are no longer necessary for summarization, opening up many low-cost use-cases.

A Test of Comprehension: Counting Ships
Following this post is an example article from the XSum dataset and the model-generated abstractive summary. As we can see, the model correctly abstracts and paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall) as “four Royal Navy frigates”, something an extractive approach could not do since “four” is not mentioned anywhere. Was this a fluke or did the model actually count? One way to find out is to add and remove ships to see if the count changes.

As can be seen below, the model successfully “counts” ships from 2 to 5. However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as “seven”. So it appears the model has learned to count small numbers of items in a list, but does not yet generalize as elegantly as we would hope. Still, we think this rudimentary counting ability is impressive as it was not explicitly programmed into the model, and it demonstrates a limited amount of “symbolic reasoning” by the model.

PEGASUS code and model release
To support on-going research in this field and ensure reproducibility, we are releasing the PEGASUS code and model checkpoints on GitHub. This includes fine-tuning code which can be used to adapt PEGASUS to other summarization datasets.

Acknowledgements
This work has been a collaborative effort involving Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. We thank the T5 and Google News teams for providing datasets for pre-training PEGASUS.

Source: Google AI Blog


Recursive Sketches for Modular Deep Learning



Much of classical machine learning (ML) focuses on utilizing available data to make more accurate predictions. More recently, researchers have considered other important objectives, such as how to design algorithms to be small, efficient, and robust. With these goals in mind, a natural research objective is the design of a system on top of neural networks that efficiently stores information encoded within—in other words, a mechanism to compute a succinct summary (a “sketch”) of how a complex deep network processes its inputs. Sketching is a rich field of study that dates back to the foundational work of Alon, Matias, and Szegedy, which can enable neural networks to efficiently summarize information about their inputs.

For example: Imagine stepping into a room and briefly viewing the objects within. Modern machine learning is excellent at answering immediate questions, known at training time, about this scene: “Is there a cat? How big is said cat?” Now, suppose we view this room every day over the course of a year. People can reminisce about the times they saw the room: “How often did the room contain a cat? Was it usually morning or night when we saw the room?”. However, can one design systems that are also capable of efficiently answering such memory-based questions even if they are unknown at training time?

In “Recursive Sketches for Modular Deep Learning”, recently presented at ICML 2019, we explore how to succinctly summarize how a machine learning model understands its input. We do this by augmenting an existing (already trained) machine learning model with “sketches” of its computation, using them to efficiently answer memory-based questions—for example, image-to-image-similarity and summary statistics—despite the fact that they take up much less memory than storing the entire original computation.

Basic Sketching Algorithms
In general, sketching algorithms take a vector x and produce an output sketch vector that behaves like x but whose storage cost is much smaller. The fact that the storage cost is much smaller allows one to succinctly store information about the network, which is critical for efficiently answering memory-based questions. In the simplest case, a linear sketch x is given by the matrix-vector product Ax where A is a wide matrix, i.e., the number of columns is equal to the original dimension of x and the number of rows is equal to the new reduced dimension. Such methods have led to a variety of efficient algorithms for basic tasks on massive datasets, such as estimating fundamental statistics (e.g., histogram, quantiles and interquartile range), finding popular items (known as frequent elements), as well as estimating the number of distinct elements (known as support size) and the related tasks of norms and entropy estimation.
A simple method to sketch the vector x is to multiply it by a wide matrix A to produce a lower-dimensional vector y.
This basic approach works well in the relatively simple case of linear regression, where it is possible to identify important data dimensions simply by the magnitude of weights (under the common assumption that they have uniform variance). However, many modern machine learning models are actually deep neural networks and are based on high-dimensional embeddings (such as Word2Vec, Image Embeddings, Glove, DeepWalk and BERT), which makes the task of summarizing the operation of the model on the input much more difficult. However, a large subset of these more complex networks are modular, allowing us to generate accurate sketches of their behavior, in spite of their complexity.

Neural Network Modularity
A modular deep network consists of several independent neural networks (modules) that only communicate via one’s output serving as another’s input. This concept has inspired several practical architectures, including Neural Modular Networks, Capsule Neural Networks and PathNet. It is also possible to split other canonical architectures to view them as modular networks and apply our approach. For example, convolutional neural networks (CNNs) are traditionally understood to behave in a modular fashion; they detect basic concepts and attributes in their lower layers and build up to detecting more complex objects in their higher layers. In this view, the convolution kernels correspond to modules. A cartoon depiction of a modular network is given below.
This is a cartoon depiction of a modular network for image processing. Data flows from the bottom of the figure to the top through the modules represented with blue boxes. Note that modules in the lower layers correspond to basic objects, such as edges in an image, while modules in upper layers correspond to more complex objects, like humans or cats. Also notice that in this imaginary modular network, the output of the face module is generic enough to be used by both the human and cat modules.
Sketch Requirements
To optimize our approach for these modular networks, we identified several desired properties that a network sketch should satisfy:
  • Sketch-to-Sketch Similarity: The sketches of two unrelated network operations (either in terms of the present modules or in terms of the attribute vectors) should be very different; on the other hand, the sketches of two similar network operations should be very close.
  • Attribute Recovery: The attribute vector, e.g., the activations of any node of the graph can be approximately recovered from the top-level sketch.
  • Summary Statistics: If there are multiple similar objects, we can recover summary statistics about them. For example, if an image has multiple cats, we can count how many there are. Note that we want to do this without knowing the questions ahead of time.
  • Graceful Erasure: Erasing a suffix of the top-level sketch maintains the above properties (but would smoothly increase the error).
  • Network Recovery: Given sufficiently many (input, sketch) pairs, the wiring of the edges of the network as well as the sketch function can be approximately recovered.
This is a 2D cartoon depiction of the sketch-to-sketch similarity property. Each vector represents a sketch and related sketches are more likely to cluster together.
The Sketching Mechanism
The sketching mechanism we propose can be applied to a pre-trained modular network. It produces a single top-level sketch summarizing the operation of this network, simultaneously satisfying all of the desired properties above. To understand how it does this, it helps to first consider a one-layer network. In this case, we ensure that all the information pertaining to a specific node is “packed” into two separate subspaces, one corresponding to the node itself and one corresponding to its associated module. Using suitable projections, the first subspace lets us recover the attributes of the node whereas the second subspace facilitates quick estimates of summary statistics. Both subspaces help enforce the aforementioned sketch-to-sketch similarity property. We demonstrate that these properties hold if all the involved subspaces are chosen independently at random.

Of course, extra care has to be taken when extending this idea to networks with more than one layer—which leads to our recursive sketching mechanism. Due to their recursive nature, these sketches can be “unrolled” to identify sub-components, capturing even complicated network structures. Finally, we utilize a dictionary learning algorithm tailored to our setup to prove that the random subspaces making up the sketching mechanism together with the network architecture can be recovered from a sufficiently large number of (input, sketch) pairs.

Future Directions
The question of succinctly summarizing the operation of a network seems to be closely related to that of model interpretability. It would be interesting to investigate whether ideas from the sketching literature can be applied to this domain. Our sketches could also be organized in a repository to implicitly form a “knowledge graph”, allowing patterns to be identified and quickly retrieved. Moreover, our sketching mechanism allows for seamlessly adding new modules to the sketch repository—it would be interesting to explore whether this feature can have applications to architecture search and evolving network topologies. Finally, our sketches can be viewed as a way of organizing previously encountered information in memory, e.g., images that share the same modules or attributes would share subcomponents of their sketches. This, on a very high level, is similar to the way humans use prior knowledge to recognize objects and generalize to unencountered situations.

Acknowledgements
This work was the joint effort of Badih Ghazi, Rina Panigrahy and Joshua R. Wang.

Source: Google AI Blog