Author Archives: Google AI

Recognizing Pose Similarity in Images and Videos

Everyday actions, such as jogging, reading a book, pouring water, or playing sports, can be viewed as a sequence of poses, consisting of the position and orientation of a person’s body. An understanding of poses from images and videos is a crucial step for enabling a range of applications, including augmented reality display, full-body gesture control, and physical exercise quantification. However, a 3-dimensional pose captured in two dimensions in images and videos appears different depending on the viewpoint of the camera. The ability to recognize similarity in 3D pose using only 2D information will help vision systems better understand the world.

In “View-Invariant Probabilistic Embedding for Human Pose” (Pr-VIPE), a spotlight paper at ECCV 2020, we present a new algorithm for human pose perception that recognizes similarity in human body poses across different camera views by mapping 2D body pose keypoints to a view-invariant embedding space. This ability enables tasks, such as pose retrieval, action recognition, action video synchronization, and more. Compared to existing models that directly map 2D pose keypoints to 3D pose keypoints, the Pr-VIPE embedding space is (1) view-invariant, (2) probabilistic in order to capture 2D input ambiguity, and (3) does not require camera parameters during training or inference. Trained with in-lab setting data, the model works on in-the-wild images out of the box, given a reasonably good 2D pose estimator (e.g., PersonLab, BlazePose, among others). The model is simple, results in compact embeddings, and can be trained (in ~1 day) using 15 CPUs. We have released the code on our GitHub repo.

Pr-VIPE can be directly applied to align videos from different views.

The input to Pr-VIPE is a set of 2D keypoints, from any 2D pose estimator that produces a minimum of 13 body keypoints, and the output is the mean and variance of the pose embedding. The distances between embeddings of 2D poses correlate to their similarities in absolute 3D pose space. Our approach is based on two observations:

  • The same 3D pose may appear very different in 2D as the viewpoint changes.
  • The same 2D pose can be projected from different 3D poses.

The first observation motivates the need for view-invariance. To accomplish this, we define the matching probability, i.e., the likelihood that different 2D poses were projected from the same, or similar 3D poses. The matching probability predicted by Pr-VIPE for matching pose pairs should be higher than for non-matching pairs.

To address the second observation, Pr-VIPE utilizes a probabilistic embedding formulation. Because many 3D poses can project to the same or similar 2D poses, the model input exhibits an inherent ambiguity that is difficult to capture through deterministic mapping point-to-point in embedding space. Therefore, we map a 2D pose through a probabilistic mapping to an embedding distribution, of which we use the variance to represent the uncertainty of the input 2D pose. As an example, in the figure below the third 2D view of the 3D pose on the left is similar to the first 2D view of a different 3D pose on the right, so we map them into a similar location in the embedding space with large variances.

Pr-VIPE enables vision systems to recognize 2D poses across views. We embed 2D poses using Pr-VIPE such that the embeddings are (1) view-invariant (2D projections of similar 3D poses are embedded close together) and (2) probabilistic. By embedding detected 2D poses, Pr-VIPE enables direct retrieval of pose images from different views, and can also be applied to action recognition and video alignment.
During training, we use 2D poses from two sources: multi-view images and projections of groundtruth 3D poses. Triplets of 2D poses (anchor, positive, and negative) are selected from a batch, where the anchor and positive are two different projections of the same 3D pose, and the negative is a projection of a non-matching 3D pose. Pr-VIPE then estimates the matching probability of 2D pose pairs from their embeddings.
During training, we push the matching probability of positive pairs to be close to 1 with a positive pairwise loss in which we minimize the embedding distance between positive pairs, and the matching probability of negative pairs to be small by maximizing the ratio of the matching probabilities between positive and negative pairs with a triplet ratio loss.
Overview of the Pr-VIPE model. During training, we apply three losses (triplet ratio loss, positive pairwise loss, and a prior loss that applies a unit Gaussian prior to our embeddings). During inference, the model maps an input 2D pose to a probabilistic, view-invariant embedding.
Probabilistic Embedding
Pr-VIPE maps a 2D pose to a probabilistic embedding as a multivariate Gaussian distribution using a sampling-based approach for similarity score computation between two distributions. During training, we use a Gaussian prior loss to regularize the predicted distribution.

We propose a new cross-view pose retrieval benchmark to evaluate the view-invariance property of the embedding. Given a monocular pose image, cross-view retrieval aims to retrieve the same pose from different views without using camera parameters. The results demonstrate that Pr-VIPE retrieves poses more accurately across views compared to baseline methods in both evaluated datasets (Human3.6M, MPI-INF-3DHP).

Pr-VIPE retrieves poses across different views more accurately relative to the baseline method (3D pose estimation).

Common 3D pose estimation methods (such as the simple baseline used for comparison above, SemGCN, and EpipolarPose, amongst many others), predict 3D poses in camera coordinates, which are not directly view-invariant. Thus, rigid alignment between every query-index pair is required for retrieval using estimated 3D poses, which is computationally expensive due to the need for singular value decomposition (SVD). In contrast, Pr-VIPE embeddings can be directly used for distance computation in Euclidean space, without any post-processing.

View-invariant pose embedding can be applied to many image and video related tasks. Below, we show Pr-VIPE applied to cross-view retrieval on in-the-wild images without using camera parameters.

We can retrieve in-the-wild images from different views without using camera parameters by embedding the detected 2D pose using Pr-VIPE. Using the query image (top row), we search for a matching pose from a different camera view and we show the nearest neighbor retrieval (bottom row). This enables us to search for matching poses across camera views more easily.

The same Pr-VIPE model can also be used for video alignment. To do so, we stack Pr-VIPE embeddings within a small time window, and use the dynamic time warping (DTW) algorithm to align video pairs.

Manual video alignment is difficult and time-consuming. Here, Pr-VIPE is applied to automatically align videos of the same action repeated from different views.

The video alignment distance calculated via DTW can then be used for action recognition by classifying videos using nearest neighbor search. We evaluate the Pr-VIPE embedding using the Penn Action dataset and demonstrate that using the Pr-VIPE embedding without fine-tuning on the target dataset, yields highly competitive recognition accuracy. In addition, we show that Pr-VIPE even achieves relatively accurate results using only videos from a single view in the index set.

Pr-VIPE recognizes action across views using pose inputs only, and is comparable to or better than methods using pose only or with additional context information (such as Iqbal et al., Liu and Yuan, Luvizon et al., and Du et al.). When action labels are only available for videos from a single view, Pr-VIPE (1-view only) can still achieve relatively accurate results.

We introduce the Pr-VIPE model for mapping 2D human poses to a view-invariant probabilistic embedding space, and show that the learned embeddings can be directly used for pose retrieval, action recognition, and video alignment. Our cross-view retrieval benchmark can be used to test the view-invariant property of other embeddings. We look forward to hearing about what you can do with pose embeddings!

Special thanks to Jiaping Zhao, Liang-Chieh Chen, Long Zhao (Rutgers University), Liangzhe Yuan, Yuxiao Wang, Florian Schroff, Hartwig Adam, and the Mobile Vision team for the wonderful collaboration and support.

Source: Google AI Blog

Google Research: Looking Back at 2020, and Forward to 2021

When I joined Google over 20 years ago, we were just figuring out how to really start on the journey of making a high quality and comprehensive search service for information on the web, using lots of curiously wired computers. Fast forward to today, and while we’re taking on a much broader array of technical challenges, it’s still with the same overarching goal of organizing the world's information and making it universally accessible and useful. In 2020, as the world has been reshaped by COVID-19, we saw the ways research-developed technologies could help billions of people better communicate, understand the world, and get things done. I’m proud of what we’ve accomplished, and excited about new possibilities on the horizon.

The goal of Google Research is to work on long-term, ambitious problems across a wide range of important topics — from predicting the spread of COVID-19, to designing algorithms, to learning to translate more and more languages automatically, to mitigating bias in ML models. In the spirit of our annual reviews for 2019, 2018, and more narrowly focused reviews of some work in 2017 and 2016, this post covers key Google Research highlights from this unusual year. This is a long post, but grouped into many different sections. Hopefully, there’s something interesting in here for everyone! For a more comprehensive look, please see our >750 research publications in 2020.

COVID-19 and Health
As the impact of COVID-19 took a tremendous toll on people’s lives, researchers and developers around the world rallied together to develop tools and technologies to help public health officials and policymakers understand and respond to the pandemic. Apple and Google partnered in 2020 to develop the Exposure Notifications System (ENS), a Bluetooth-enabled privacy-preserving technology that allows people to be notified if they have been exposed to others who have tested positive for COVID-19. ENS supplements traditional contact tracing efforts and has been deployed by public health authorities in more than 50 countries, states and regions to help curb the spread of infection.

In the early days of the pandemic, public health officials signalled their need for more comprehensive data to combat the virus’ rapid spread. Our Community Mobility Reports, which provide anonymized insights into movement trends, are helping researchers not only understand the impact of policies like stay-at-home directives and social distancing, and also conduct economic forecasting.

Community Mobility Reports: Navigate and download a report for regions of interest.

Our own researchers have also explored using this anonymized data to forecast COVID-19 spread using graph neural networks instead of traditional time series-based models.

Although the research community knew little about this disease and secondary effects initially, we’re learning more every day. Our COVID-19 Search Trends symptoms allows researchers to explore temporal or symptomatic associations, such as anosmia — the loss of smell that is sometimes a symptom of the virus. To further support the broader research community, we launched Google Health Studies app to provide the public ways to participate in research studies.

Our COVID-19 Search Trends are helping researchers study the link between the disease’s spread and symptom-related searches.

Teams across Google are contributing tools and resources to the broader scientific community, which is working to address the health and economic impacts of the virus.

A spatio-temporal graph for modelling COVID-19 Spread.

Accurate information is critical in dealing with public health threats. We collaborated with many product teams at Google in order to improve information quality about COVID-19 in Google News and Search through supporting fact checking efforts, as well as similar efforts in YouTube.

We helped multilingual communities get equal access to critical COVID-19 information by sponsoring localization of’s weekly Situation Reports and developing a COVID-19 open source parallel dataset in collaboration with Translators Without Borders.

Modelling a complex global event is particularly challenging and requires more comprehensive epidemiological datasets, the development of novel interpretable models and agent-based simulators to inform the public health response. Machine learning techniques have also helped in other ways from deploying natural language understanding to helping researchers quickly navigate the mountains of COVID-19 scientific literature, applying anonymization technology to protect privacy while making useful datasets available, and exploring whether public health can conduct faster screening with fewer tests via Bayesian group testing.

These are only a sample of the many pieces of work that happened across Google to help users and public health authorities respond to COVID-19. For more, see using technology to help take on COVID-19.

Research in Machine Learning for Medical Diagnostics
We continue to make headway helping clinicians harness the power of ML to deliver better care for more patients. This year we have described notable advances in applying computer vision to aid doctors in the diagnosis and management of cancer, including helping to make sure that doctors don’t miss potentially cancerous polyps during colonoscopies, and showing that an ML system can achieve substantially higher accuracy than pathologists in Gleason grading of prostate tissue, enabling radiologists to achieve significant reductions in both false negative and false positive results when examining X-rays for signs of breast cancer.

To determine the aggressiveness of prostate cancers, pathologists examine a biopsy and assign it a Gleason grade. In published research, our system was able to grade with higher accuracy than a cohort of pathologists who have not had specialist training in prostate cancer. The first stage of the deep learning system assigns a Gleason grade to every region in a biopsy. In this biopsy, green indicates Gleason pattern 3, while yellow indicates Gleason pattern 4.

We’ve also been working on systems to help identify skin disease, help detect age-related macular degeneration (the leading cause of blindness in the U.S. and U.K., and the third-largest cause of blindness worldwide), and on potential novel non-invasive diagnostics (e.g., being able to detect signs of anemia from retinal images).

Our study examines how a deep learning model can quantify hemoglobin levels — a measure doctors use to detect anemia — from retinal images.

This year has also brought exciting demonstrations of how these same technologies can peer into the human genome. Google’s open-source tool, DeepVariant, identifies genomic variants in sequencing data using a convolutional neural network, and this year won the FDA Challenge for best accuracy in 3 out of 4 categories. Using this same tool, a study led by the Dana-Farber Cancer Institute improved diagnostic yield by 14% for genetic variants that lead to prostate cancer and melanoma in a cohort of 2,367 cancer patients.

Research doesn’t end at measurement of experimental accuracy. Ultimately, truly helping patients receive better care requires understanding how ML tools will affect people in the real world. This year we began work with Mayo Clinic to develop a machine learning system to assist in radiotherapy planning and to better understand how this technology could be deployed into clinical practice. With our partners in Thailand, we’ve used diabetic eye disease screening as a test case in how we can build systems with people at the center, and recognize the fundamental role of diversity, equity, and inclusion in building tools for a healthier world.

Weather, Environment and Climate Change
Machine learning can help us better understand the environment and make useful predictions to help people in both their everyday life as well as in disaster situations. For weather and precipitation forecasting, computationally intensive physics-based models like NOAA’s HRRR have long reigned supreme. We have been able to show, though, that ML-based forecasting systems can predict current precipitation with much better spatial resolution (“Is it raining in my local park in Seattle?” and not just “Is it raining in Seattle?”) and can produce short-term forecasts of up to eight hours that are considerably more accurate than HRRR, and can compute the forecast more quickly, yet with higher temporal and spatial resolution.

A visualization of predictions made over the course of roughly one day. Left: The 1-hour HRRR prediction made at the top of each hour, the limit to how often HRRR provides predictions. Center: The ground truth, i.e., what we are trying to predict. Right: The predictions made by our model. Our predictions are every 2 minutes (displayed here every 15 minutes) at roughly 10 times the spatial resolution made by HRRR. Notice that we capture the general motion and general shape of the storm.

We’ve also developed an improved technique called HydroNets, which uses a network of neural networks to model the actual river systems in the world to more accurately understand the interactions of upstream water levels to downstream inundation, resulting in more accurate water-level predictions and flood forecasting. Using these techniques, we've expanded our coverage of flood alerts by 20x in India and Bangladesh, helping to better protect more than 200 million people in 250,000 square kilometers.

An illustration of the HydroNets architecture.

Better analysis of satellite imagery data can also give Google users a better understanding of the impact and extent of wildfires (which caused devastating effects in California and Australia this year). We showed that automated analysis of satellite imagery can help with rapid assessment of damage after natural disasters even with limited prior satellite imagery. It can also aid urban tree-planting efforts by helping cities assess their current tree canopy coverage and where they should focus on planting new trees. We’ve also shown how machine learning techniques that leverage temporal context can help improve ecological and wildlife monitoring.

Based on this work, we’re excited to partner with NOAA on using AI and ML to amplify NOAA’s environmental monitoring, weather forecasting and climate research using Google Cloud’s infrastructure.

Machine learning continues to provide amazing opportunities for improving accessibility, because it can learn to transfer one kind of sensory input into others. As one example, we released Lookout, an Android application that can help visually impaired users by identifying packaged foods, both in a grocery store and also in their kitchen cupboard at home. The machine learning system behind Lookout demonstrates that a powerful-but-compact machine learning model can accomplish this in real-time on a phone for nearly 2 million products.

Similarly, people who communicate with sign language find it difficult to use video conferencing systems because even if they are signing, they are not detected as actively speaking by audio-based speaker detection systems. Developing Real-Time, Automatic Sign Language Detection for Video Conferencing presents a real-time sign language detection model and demonstrates how it can be used to provide video conferencing systems with a mechanism to identify the person signing as the active speaker.

We also enabled useful Android accessibility capabilities such as Voice Access and Sound Notifications for important household sounds.

Live Caption was expanded to support calls on the Pixel phone with the ability to caption phone calls and video calls. This came out of the Live Relay research project, which enables deaf and hard of hearing people to make calls without assistance.

Applications of ML to Other Fields
Machine learning continues to prove vital in helping us make progress across many fields of science. In 2020, in collaboration with the FlyEM team at HHMI Janelia Research Campus, we released the drosophila hemibrain connectome, the large synapse-resolution map of brain connectivity, reconstructed using large-scale machine learning models applied to high-resolution electron microscope imaging of brain tissue. This connectome information will aid neuroscientists in a wide variety of inquiries, helping us all better understand how brains function. Be sure to check out the very fly interactive 3-D UI!

The application of ML to problems in systems biology is also on the rise. Our Google Accelerated Science team, in collaboration with our colleagues at Calico, have been applying machine learning to yeast, to get a better understanding of how genes work together as a whole system. We’ve also been exploring how to use model-based reinforcement learning in order to design biological sequences like DNA or proteins that have desirable properties for medical or industrial uses. Model-based RL is used to improve sample efficiency. At each round of experimentation the policy is trained offline using a simulator fit on functional measurements from prior rounds. On various tasks like designing DNA transcription factor binding sites, designing antimicrobial proteins, and optimizing the energy of Ising models based on protein structures, we find that model-based RL is an attractive alternative to existing methods.

In partnership with X-Chem Pharmaceuticals and ZebiAI, we have also been developing ML techniques to do “virtual screening” of promising molecular compounds computationally. Previous work in this area has tended to focus on relatively small sets of related compounds, but in this work, we are trying to use DNA-encoded small molecule libraries in order to be able to generalize to find “hits” across a wide swath of chemical space, reducing the need for slower, physical-based lab work in order to progress from idea to working pharmaceutical.

We’ve also seen success applying machine learning to core computer science and computer systems problems, a growing trend that is spawning entire new conferences like MLSys. In Learning-based Memory Allocation for C++ Server Workloads, a neural network-based language model predicts context-sensitive per-allocation site object lifetime information, and then uses this to organize the heap so as to reduce fragmentation. It is able to reduce fragmentation by up to 78% while only using huge pages (which are better for TLB behavior). End-to-End, Transferable Deep RL for Graph Optimization described an end-to-end transferable deep reinforcement learning method for computational graph optimization that shows 33%-60% speedup on three graph optimization tasks compared to TensorFlow default optimization, with 15x faster convergence over prior computation graph optimization methods.

Overview of GO: An end-to-end graph policy network that combines graph embedding and sequential attention.

As described in Chip Design with Deep Reinforcement Learning, we have also been applying reinforcement learning to the problem of place-and-route in computer chip design. This is normally a very time-consuming, labor-intensive process, and is a major reason that going from an idea for a chip to actually having a fully designed and fabricated chip takes so long. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously unseen chip blocks. The system is able to generate placements that usually outperform those of human chip design experts, and we have been using this system (running on TPUs) to do placement and layout for major portions of future generations of TPUs. Menger is a recent infrastructure we’ve built for large-scale distributed reinforcement learning that is yielding promising performance for difficult RL tasks such as chip design.

Macro placements of Ariane, an open-source RISC-V processor, as training progresses. On the left, the policy is being trained from scratch, and on the right, a pre-trained policy is being fine-tuned for this chip. Each rectangle represents an individual macro placement. Notice how the cavity that is occupied by non-macro logic cells that is discovered by the from-scratch policy is already present from the outset in the pre-trained policy’s placement.

Responsible AI
The Google AI Principles guide our development of advanced technologies. We continue to invest in responsible AI research and tools, update our recommended technical practices in this area, and share regular updates — including a 2020 blog post and report — on our progress in implementation.

To help better understand the behavior of language models, we developed the Language Interpretability Tool (LIT), a toolkit for better interpretability of language models, enabling interactive exploration and analysis of their decisions. We developed techniques for measuring gendered correlations in pre-trained language models and scalable techniques for reducing gender bias in Google Translate. We used the kernel trick to propose a simple method to estimate the influence of a training data example on an individual prediction. To help non-specialists interpret machine learning results, we extended the TCAV technique introduced in 2019 to now provide a complete and sufficient set of concepts. With the original TCAV work, we were able to say that ‘fur’ and ‘long ears’ are important concepts for ‘rabbit’ prediction. With this work, we can also say that these two concepts are enough to fully explain the prediction; you don’t need any other concepts. Concept bottleneck models are a technique to make models more interpretable by training them so that one of the layers is aligned with pre-defined expert concepts (e.g., “bone spurs present”, or “wing color”, as shown below) before making a final prediction for a task, so that we can not only interpret but also turn on/off these concepts on the fly.

Aligning predictions to pre-identified concepts can make models more interpretable, as described in Concept Bottleneck Models.

In collaboration with many other institutions, we also looked into memorization effects of language models, showing that training data extraction attacks are realistic threats on state-of-the-art large language models. This finding along with a result that embedding models can leak information can have significant privacy implications (especially for models trained on private data). In Thieves of Sesame Street: Model Extraction on BERT-based APIs, we demonstrated that attackers with only API access to a language model could create models whose outputs had very high correlation with the original model, even with relatively few API queries to the original model. Subsequent work demonstrated that attackers can extract smaller models with arbitrary accuracy. On the AI Principle of safety we demonstrated that thirteen published defenses to adversarial examples can be circumvented despite attempting to perform evaluations using adaptive attacks. Our work focuses on laying out the methodology and the approach necessary to perform an adaptive attack, and thus will allow the community to make further progress in building more robust models.

Examining the way in which machine learning systems themselves are examined is also an important area of exploration. In collaboration with the Partnership on AI, we defined a framework for how to audit the use of machine learning in software product settings, drawing on lessons from the aerospace, medical devices, and finance industries and their best practices. In joint work with University of Toronto and MIT, we identified several ethical concerns that can arise when auditing the performance of facial recognition systems. In joint work with the University of Washington, we identified some important considerations related to diversity and inclusion when choosing subsets for evaluating algorithmic fairness. As an initial step in making responsible AI work for the next billion users and to help understand if notions of fairness were consistent in different parts of the world, we analyzed and created a framework for algorithmic fairness in India, accounting for datasets, fairness optimizations, infrastructures, and ecosystems

The Model Cards work that was introduced in collaboration with the University of Toronto in 2019 has been growing in influence. Indeed, many well-known models like OpenAI’s GPT-2 and GPT-3, many of Google’s MediaPipe models and various Google Cloud APIs have all adopted Model Cards as a way of giving users of a machine learning model more information about the model’s development and the observed behavior of the model under different conditions. To make this easier for others to adopt for their own machine learning models, we also introduced the Model Card Toolkit for easier model transparency reporting. In order to increase transparency in ML development practices, we demonstrate the applicability of a range of best practices throughout the dataset development lifecycle, including data requirements specification and data acceptance testing.

In collaboration with the U.S. National Science Foundation (NSF), we announced and helped to fund a National AI Research Institute for Human-AI Interaction and Collaboration. We also released the MinDiff framework, a new regularization technique available in the TF Model Remediation library for effectively and efficiently mitigating unfair biases when training ML models, along with ML-fairness gym for building simple simulations that explore potential long-run impacts of deploying machine learning-based decision systems in social environments.

In addition to developing frameworks for fairness, we developed approaches for identifying and improving the health and quality of experiences with Recommender Systems, including using reinforcement learning to introduce safer trajectories. We also continue to work on improving the reliability of our machine learning systems, where we’ve seen that approaches such as generating adversarial examples can improve robustness and that robustness approaches can improve fairness.

Differential privacy is a way to formally quantify privacy protections and requires a rethinking of the most basic algorithms to operate in a way that they do not leak information about any particular individual. In particular, differential privacy can help in addressing memorization effects and information leakage of the kinds mentioned above. In 2020 there were a number of exciting developments, from more efficient ways of computing private empirical risk minimizers to private clustering methods with tight approximation guarantees and private sketching algorithms. We also open sourced the differential privacy libraries that lie at the core of our internal tools, taking extra care to protect against leakage caused by the floating point representation of real numbers. These are the exact same tools that we use to produce differentially private COVID-19 mobility reports that have been a valuable source of anonymous data for researchers and policymakers.

To help developers assess the privacy properties of their classification models we released an ML privacy testing library in Tensorflow. We hope this library will be the starting point of a robust privacy testing suite that can be used by any machine learning developer around the world.

Membership inference attack on models for CIFAR10. The x-axis is the test accuracy of the model, and y-axis is vulnerability score (lower means more private). Vulnerability grows while test accuracy remains the same — better generalization could prevent privacy leakage.

In addition to pushing the state of the art in developing private algorithms, I am excited about the advances we made in weaving privacy into the fabric of our products. One of the best examples is Chrome’s Privacy Sandbox, which changes the underpinnings of the advertising ecosystem and helps systematically protect individuals’ privacy. As part of the project, we proposed and evaluated a number of different APIs, including federated learning of cohorts (FLoC) for interest based targeting, and aggregate APIs for differentially private measurement.

Launched in 2017, federated learning is now a complete research field unto itself, with over 3000 publications on federated learning appearing in 2020 alone. Our cross-institutional Advances and Open Problems in Federated Learning survey paper published in 2019 has been cited 367 times in the past year, and an updated version will soon be published in the Foundations & Trends in Machine Learning series. In July, we hosted a Workshop on Federated Learning and Analytics, and made all research talks and a TensorFlow Federated tutorial publicly available.

The lifecycle of an FL-trained model and the various actors in a federated learning system.

We continue to push the state of the art in federated learning, including the development of new federated optimization algorithms including adaptive learning algorithms, posterior averaging algorithms, and techniques for mimicking centralized algorithms in federated settings, substantial improvements in complimentary cryptographic protocols, and more. We announced and deployed federated analytics, enabling data science over raw data that is stored locally on users’ devices. New uses of federated learning in Google products include contextual emoji suggestions in Gboard, and pioneering privacy-preserving medical research with Google Health Studies. Furthermore, in Privacy Amplification via Random Check-Ins we presented the first privacy accounting mechanism for Federated Learning.

Security for our users is also an area of considerable interest for us. In 2020, we continued to improve protections for Gmail users, by deploying a new ML-based document scanner that provides protection against malicious documents, which increased malicious office document detection by 10% on a daily basis. Thanks to its ability to generalize, this tool has been very effective at blocking some adversarial malware campaigns that elude other detection mechanisms and increased our detection rate by 150% in some cases.

On the account protection side, we released a fully open-source security key firmware to help advance state of art in the two factor authentication space, staying focused on security keys as the best way to protect accounts against phishing.

Natural Language Understanding
Better understanding of language is an area where we saw considerable progress this year. Much of the work in this space from Google and elsewhere now relies on Transformers, a particular style of neural network model originally developed for language problems (but with a growing body of evidence that they are also useful for images, videos, speech, protein folding, and a wide variety of other domains).

One area of excitement is in dialog systems that can chat with a user about something of interest, often encompassing multiple turns of interaction. While successful work in this area to date has involved creating systems that are specialized around particular topics (e.g., Duplex) these systems cannot carry on general conversations. In pursuit of the general research goal of creating systems capable of much more open-ended dialog, in 2020 we described Meena, a learned conversational agent that aspirationally can chat about anything. Meena achieves high scores on a dialog system metric called SSA, which measures both sensibility and specificity of responses. We’ve seen that as we scale up the model size of Meena, it is able to achieve lower perplexity and, as shown in the paper, lower perplexity correlates extremely closely with improved SSA.

A chat between Meena (left) and a person (right).

One well-known issue with generative language models and dialog systems is that when discussing factual data, the model’s capacity may not be large enough to remember every specific detail about a topic, so they generate language that is plausible but incorrect. (This is not unique to machines — people can commit these errors too.) To address this in dialog systems, we are exploring ways to augment a conversational agent by giving it access to external information sources (e.g., a large corpus of documents or a search engine API), and developing learning techniques to use this as an additional resource in order to generate language that is consistent with the retrieved text. Work in this area includes integrating retrieval into language representation models (and a key underlying technology for this to work well is something like ScaNN, an efficient vector similarity search, to efficiently match the desired information to information in the corpus of text). Once appropriate content is found, it can be better understood with approaches like using neural networks to find answers in tables and extracting structured data from templatic documents. Our work on PEGASUS, a state-of-the-art model for abstractive text summarization can also help to create automatic summaries from any piece of text, a general technique useful in conversations, retrieval systems, and many other places.

Efficiency of NLP models has also been a significant focus for our work in 2020. Techniques like transfer learning and multi-task learning can dramatically help with making general NLP models usable for new tasks with modest amounts of computation. Work in this vein includes transfer learning explorations in T5, sparse activation of models (as in our GShard work mentioned below), and more efficient model pre-training with ELECTRA. Several threads of work also look to improve on the basic Transformer architecture, including Reformer, which uses locality-sensitive hashing and reversible computation to more efficiently support much larger attention windows, Performers, which use an approach for attention that scales linearly rather than quadratically (and discusses its use in the context of protein modeling), and ETC and BigBird, which utilize global and sparse random connections, to enable linear scaling for larger and structured sequences. We also explored techniques for creating very lightweight NLP models that are 100x smaller than a larger BERT model, but perform nearly as well for some tasks, making them very suitable for on-device NLP. In Encode, Tag and Realize, we also explored new approaches for generative text models that use edit operations rather than fully general text generation, which can have advantages in computation requirements for generation, more control over the generated text, and require less training data.

Language Translation
Effective language translation helps bring the world closer together by enabling us to all communicate, despite speaking different languages. To date, over a billion people around the world use Google Translate, and last year we added support for five new languages (Kinyarwanda, Odia, Tatar, Turkmen and Uyghur, collectively spoken by 75 million people). Translation quality continues to improve, showing an average +5 BLEU point gain across more than 100 languages from May 2019 to May 2020, through a wide variety of techniques like improved model architectures and training, better handling of noise in datasets, multilingual transfer and multi-task learning, and better use of monolingual data to improve low-resource languages (those without much written public content on the web), directly in line with our goals of improving ML fairness of machine learning systems to provide benefits to the greatest number of people possible.

We strongly believe that continued scaling of multilingual translation models will bring further quality improvements, especially to the billions of speakers of low-resource languages around the world. In GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, Google researchers showed that training sparsely-activated multilingual translation models of up to 600 billion parameters leads to major improvements in translation quality for 100 languages as measured by BLEU score improvement over a baseline of a separate 400M parameter monolingual baseline model for each language. Three trends stood out in this work, illustrated by Figure 6 in the paper, reproduced below (see the paper for complete discussion):

  • The BLEU score improvements from multilingual training are high for all languages but are even higher for low-resource languages (right hand side of graph is higher than the left) whose speakers represent billions of people in some of the world’s most marginalized communities. Each rectangle on the figure represents languages with 1B speakers.
  • The larger and deeper the model, the larger the BLEU score improvements were across all languages (the lines hardly ever cross).
  • Large, sparse models also show a ~10x to 100x improvement in computational efficiency for model training over training a large, dense model, while simultaneously matching or significantly exceeding the BLEU scores of the large, dense model (computational efficiency discussed in paper).
An illustration of the significant gains in translation quality across 100 languages for large, sparsely-activated language models described in GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.

We’re actively working on bringing the benefits demonstrated in this GShard research work to Google Translate, as well as training single models that cover 1000 languages, including languages like Dhivehi and Sudanese Arabic (while sharing some challenges that needed solving along the way).

We also developed techniques to create language-agnostic representations of sentences for BERT models, which can help with developing better translation models. To more effectively evaluate translation quality, we introduced BLEURT, a new metric for evaluating language generation for tasks like translation that considers the semantics of the generated text, rather than just the amount of word overlap with ground-truth data, illustrated in the table below.

Machine Learning Algorithms
We continue to develop new machine learning algorithms and approaches for training that enable systems to learn more quickly and from less supervised data. By replaying intermediate results during training of neural networks, we find that we can fill idle time on ML accelerators and therefore can train neural networks faster. By changing the connectivity of neurons dynamically during training, we can find better solutions compared with statically-connected neural networks. We also developed SimCLR, a new self-supervised and semi-supervised learning technique that simultaneously maximizes agreement between differently transformed views of the same image and minimizes agreement between transformed views of different images. This approach significantly improves on the best self-supervised learning techniques.

ImageNet top-1 accuracy of linear classifiers trained on representations learned with different self-supervised methods (pretrained on ImageNet). Gray cross indicates supervised ResNet-50.

We also extended the idea of contrastive learning to the supervised regime, resulting in a loss function that significantly improves over cross-entropy for supervised classification problems.

Reinforcement Learning
Reinforcement learning (RL), which learns to make good long-term decisions from limited experience, has been an important focus area for us. An important challenge in RL is to learn to make decisions from few data points, and we’ve improved RL algorithm efficiency through learning from fixed datasets, learning from other agents, and improving exploration.

A major focus area this year has been around offline RL, which relies solely on fixed, previously collected datasets (for example, from previous experiments or human demonstrations), extending RL to the applications that can’t collect training data on-the-fly. We’ve introduced a duality approach to RL, developed improved algorithms for off-policy evaluation, estimating confidence intervals, and offline policy optimization. In addition, we’re collaborating with the broader community to tackle these problems by releasing open-source benchmark datasets, and DQN dataset for Atari.

Offline RL on Atari games using the DQN Replay Dataset.

Another line of research improved sample efficiency by learning from other agents through apprenticeship learning. We developed methods to learn from informed agents, matching other agent’s distribution, or learning from adversarial examples. To improve the exploration in RL, we explored bonus-based exploration methods including imitation techniques able to mimic structured exploration arising in agents having prior knowledge about their environment.

We’ve also made significant advances in the mathematical theory of reinforcement learning. One of our main areas of research was studying reinforcement learning as an optimization process. We found connections to the Frank-Wolfe algorithm, momentum methods, KL divergence regularization, operator theory, and convergence analysis; some of these insights led to an algorithm that achieves state-of-the-art performance in challenging RL benchmarks and discovery that polynomial transfer functions avoid convergence problems associated with softmax, both in RL and supervised learning. We’ve made some exciting progress on the topic of safe reinforcement learning, where one seeks to discover optimal control rules while respecting important experimental constraints. This includes a framework for safe policy optimization. We studied efficient RL-based algorithms for solving a class of problems known as mean field games, which model systems with a large number of decision-makers, from mobile networks to electric grids.

We’ve made breakthroughs toward generalization to new tasks and environments, an important challenge for scaling up RL to complex real-world problems. A 2020 focus area was population-based learning-to-learn methods, where another RL or evolutionary agent trained a population of RL agents to create a curriculum of emergent complexity, and discover new state-of-the-art RL algorithms. Learning to estimate the importance of data points in the training set and parts of visual input with selective attention resulted in significantly more skillful RL agents.

Overview of our method and illustration of data processing flow in AttentionAgent. Top: Input transformation — A sliding window segments an input image into smaller patches, and then “flattens” them for future processing. Middle: Patch election — The modified self-attention module holds votes between patches to generate a patch importance vector. Bottom: Action generation — AttentionAgent picks the patches of the highest importance, extracts corresponding features and makes decisions based on them.

Further, we made progress in model-based RL by showing that learning predictive behavior models accelerates RL learning, and enables decentralized cooperative multi-agent tasks in diverse teams, and learning long-term behavior models. Observing that skills bring predictable changes in the environment, we discover skills without supervision. Better representations stabilize RL learning, while hierarchical latent spaces and value-improvement paths yield better performance.

We shared open source tools for scaling up and productionizing RL. To expand the scope and problems tackled by users, we’ve introduced SEED, a massively parallel RL agent, released a library for measuring the RL algorithm reliability, and a new version of TF-Agents that includes distributed RL, TPU support, and a full set of bandit algorithms. In addition, we performed a large empirical study of RL algorithms to improve hyperparameter selection and algorithm design.

Finally, in collaboration with Loon, we trained and deployed RL to more efficiently control stratospheric balloons, improving both power usage and their ability to navigate.

Using learning algorithms to develop new machine learning techniques and solutions, or meta-learning, is a very active and exciting area of research. In much of our previous work in this area, we’ve created search spaces that look at how to find ways to combine sophisticated hand-designed components together in interesting ways. In AutoML-Zero: Evolving Code that Learns, we took a different approach, by giving an evolutionary algorithm a search space consisting of very primitive operations (like addition, subtraction, variable assignment, and matrix multiplication) in order to see if it was possible to evolve modern ML algorithms from scratch. The presence of useful learning algorithms in this space is incredibly sparse, so it is remarkable that the system was able to progressively evolve more and more sophisticated ML algorithms. As shown in the figure below, the system reinvents many of the most important ML discoveries over the past 30 years, such as linear models, gradient descent, rectified linear units, effective learning rate settings and weight initializations, and gradient normalization.

We also used meta-learning to discover a variety of new efficient architectures for object detection in both still images and videos. Last year’s work on EfficientNet for efficient image classification architectures showed significant accuracy improvements and computational cost reductions for image classification. In follow-on work this year, EfficientDet: Towards Scalable and Efficient Object Detection builds on top of the EfficientNet work to derive new efficient architectures for object detection and localization, showing remarkable improvements in both highest absolute accuracy, as well as computational cost reductions of 13-42x over previous approaches to achieve a given level of accuracy.

EfficientDet achieves state-of-the-art 52.2 mAP, up 1.5 points from the prior state of the art (not shown since it is at 3045B FLOPs) on COCO test-dev under the same setting. Under the same accuracy constraint, EfficientDet models are 4x-9x smaller and use 13x-42x less computation than previous detectors.

Our work on SpineNet describes a meta-learned architecture that can retain spatial information more effectively, allowing detection to be done at finer resolution. We also focused on learning effective architectures for a variety of video classification problems. AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures, AssembleNet++: Assembling Modality Representations via Attention Connections, and AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification demonstrate how to use evolutionary algorithms to create novel state-of-the-art video processing machine learning architectures.

This approach can also be used to develop effective model architectures for time series forecasting. Using AutoML for Time Series Forecasting describes the system that discovers new forecasting models via an automated search over a search space involving many interesting kinds of low-level building blocks, and its effectiveness was demonstrated in the Kaggle M5 Forecasting Competition, by generating an algorithm and system that placed 138th out of 5558 participants (top 2.5%). While many of the competitive forecasting models required months of manual effort to create, our AutoML solution found the model in a short time with only a moderate compute cost (500 CPUs for 2 hours) and no human intervention.

Better Understanding of ML Algorithms and Models
Deeper understanding of machine learning algorithms and models is crucial for designing and training more effective models, as well as understanding when models may fail. Last year, we focused on fundamental questions around representation power, optimization, model generalization, and label noise, among others. As mentioned earlier in this post, Transformer networks have had a huge impact on modeling language, speech and vision problems, but what is the class of functions represented by these models? Recently we showed that transformers are universal approximators for sequence-to-sequence functions. Furthermore, sparse transformers also remain universal approximators even when they use just a linear number of interactions among the tokens. We have been developing new optimization techniques based on layerwise adaptive learning rates to improve the convergence speed of transformers, e.g., Large batch optimization for deep learning (LAMB): Training BERT in 76 minutes.

As neural networks are made wider and deeper, they often train faster and generalize better. This is a core mystery in deep learning since classical learning theory suggests that large networks should overfit more. We are working to understand neural networks in this overparameterized regime. In the limit of infinite width, neural networks take on a surprisingly simple form, and are described by a Neural Network Gaussian Process (NNGP) or Neural Tangent Kernel (NTK). We studied this phenomenon theoretically and experimentally, and released Neural Tangents, an open-source software library written in JAX that allows researchers to build and train infinite-width neural networks.

Left: A schematic showing how deep neural networks induce simple input / output maps as they become infinitely wide. Right: As the width of a neural network increases, we see that the distribution of outputs over different random instantiations of the network becomes Gaussian.

As finite width networks are made larger, they also demonstrate peculiar double descent phenomena — where they generalize better, then worse, then better again with increasing width. We have shown that this phenomenon can be explained by a novel bias-variance decomposition, and further that it can sometimes manifest as triple descent.

Lastly, in real-world problems, one often needs to deal with significant label noise. For instance, in large scale learning scenarios, weakly labeled data is available in abundance with large label noise. We have developed new techniques for distilling effective supervision from severe label noise leading to state-of-the-art results. We have further analyzed the effects of training neural networks with random labels, and shown that it leads to alignment between network parameters and input data, enabling faster downstream training than initializing from scratch. We have also explored questions such as whether label smoothing or gradient clipping can mitigate label noise, leading to new insights for developing robust training techniques with noisy labels.

Algorithmic Foundations and Theory
2020 was a productive year for our work in algorithmic foundations and theory, with several impactful research publications and notable results. On the optimization front, our paper on edge-weighted online bipartite matching develops a new technique for online competitive algorithms and solves a thirty-year old open problem for the edge-weighted variant with applications in efficient online ad allocation. Along with this work in online allocation, we developed dual mirror descent techniques that generalize to a variety of models with additional diversity and fairness constraints, and published a sequence of papers on the topic of online optimization with ML advice in online scheduling, online learning and online linear optimization. Another research result gave the first improvement in 50 years on the classic bipartite matching in dense graphs. Finally, another paper solves a long-standing open problem about chasing convex bodies online — using an algorithm from The Book, no less.

We also continued our work in scalable graph mining and graph-based learning and hosted the Graph Mining & Learning at Scale Workshop at NeurIPS’20, which covered work on scalable graph algorithms including graph clustering, graph embedding, causal inference, and graph neural networks. As part of the workshop, we showed how to solve several fundamental graph problems faster, both in theory and practice, by augmenting standard synchronous computation frameworks like MapReduce with a distributed hash-table similar to a BigTable. Our extensive empirical study validates the practical relevance of the AMPC model inspired by our use of distributed hash tables in massive parallel algorithms for hierarchical clustering and connected components, and our theoretical results show how to solve many of these problems in constant distributed rounds, greatly improving upon our previous results. We also achieved exponential speedup for computing PageRank and random walks. On the graph-based learning side, we presented Grale, our framework for designing graphs for use in machine learning. Furthermore, we presented our work on more scalable graph neural network models, where we show that PageRank can be used to greatly accelerate inference in GNNs.

In market algorithms, an area at the intersection of computer science and economics, we continued our research in designing improved online marketplaces, such as measuring incentive properties of ad auctions, two-sided markets, and optimizing order statistics in ad selection. In the area of repeated auctions, we developed frameworks to make dynamic mechanisms robust against lack of forecasting or estimation errors of the current market and/or the future market, leading to provably tight low-regret dynamic mechanisms. Later, we characterized when it is possible to achieve the asymptotically optimal objective through geometry-based criteria. We also compared the equilibrium outcome of a range of budget management strategies used in practice, showed their impact on the tradeoff between revenue and buyers' utility and shed light on their incentive properties. Additionally, we continued our research in learning optimal auction parameters, and settled the complexity of batch-learning with revenue loss. We designed the optimal regret and studied combinatorial optimization for contextual auction pricing, and developed a new active learning framework for auctions and improved the approximation for posted-price auctions. Finally, motivated by the importance of incentives in ad auctions, and in the hope to help advertisers study the impact of incentives in auctions, we introduce a data-driven metric to quantify how much a mechanism deviates from incentive compatibility.

Machine Perception
Perceiving the world around us — understanding, modeling and acting on visual, auditory and multimodal input — continues to be a research area with tremendous potential to be beneficial in our everyday lives.

In 2020, deep learning powered new approaches that bring 3D computer vision and computer graphics closer together. CvxNet, deep implicit functions for 3D shapes, neural voxel rendering and CoReNet are a few examples of this direction. Furthermore, our research on representing scenes as neural radiance fields (aka NeRF, see also this blog post) is a good example of how Google Research's academic collaborations stimulate rapid progress in the area of neural volume rendering.

In Learning to Factorize and Relight a City, a collaboration with UC Berkeley, we proposed a learning-based framework for disentangling outdoor scenes into temporally-varying illumination and permanent scene factors. This gives the ability to change lighting effects and scene geometry for any Street View panorama, or even turn it into a full-day timelapse video.

Our work on generative human shape and articulated pose models introduces a statistical, articulated 3D human shape modeling pipeline, within a fully trainable, modular, deep learning framework. Such models enable 3D human pose and shape reconstruction of people from a single photo to better understand the scene.

Overview of end-to-end statistical 3D articulated human shape model construction in GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models.

The growing area of media compression using neural networks continued to make strong progress in 2020, not only on learned image compression, but also in deep approaches to video compression, volume compression and nice results in deep distortion-agnostic image watermarking.

Samples of encoded and cover images for Distortion Agnostic Deep Watermarking. First row: Cover image with no embedded message. Second row: Encoded image from HiDDeN combined distortion model. Third row: Encoded images from our model. Fourth row: Normalized difference of the encoded image and cover image for the HiDDeN combined model. Fifth row: Normalized difference for our model

Additional important themes in perceptual research included:

Engaging with the broader research community through open sourcing of solutions and datasets is another important aspect of furthering perceptual research. In 2020, we open sourced multiple new perceptual inference capabilities and solutions in MediaPipe, such as on-device face, hand and pose prediction, real-time body pose tracking, real-time iris tracking and depth estimation, and real-time 3D object detection.

We continued to make strides to improve experiences and promote helpfulness on mobile devices through ML-based solutions. Our ability to run sophisticated natural language processing on-device, enabling more natural conversational features, continues to improve. In 2020, we expanded Call Screen and launched Hold for Me to allow users to save time when performing mundane tasks, and we also launched language-based actions and language navigability of our Recorder app to aid productivity.

We have used Google's Duplex technology to make calls to businesses and confirm things like temporary closures. This has enabled us to make 3 million updates to business information globally, that have been seen over 20 billion times on Maps and Search. We also used text to speech technology for easier access to web pages, by enabling Google Assistant to read it aloud, supporting 42 languages.

We also continued to make meaningful improvements to imaging applications. We made it easier to capture precious moments on Pixel with innovative controls and new ways to relight, edit, enhance and relive them again in Google Photos. For the Pixel camera, beginning with Pixel 4 and 4a, we added Live HDR+, which uses machine learning to approximate the vibrance and balanced exposure and appearance of HDR+ burst photography in real time in the viewfinder. We also created dual exposure controls, which allow the brightness of shadows and highlights in a scene to be adjusted independently — live in the viewfinder.

More recently, we introduced Portrait Light, a new post-capture feature for the Pixel Camera and Google Photos apps that adds a simulated directional light source to portraits. This feature is again one that is powered by machine learning, having been trained on 70 different people, photographed one light at a time, in our pretty cool 331-LED Light Stage computational illumination system.

In the past year, Google researchers were excited to contribute to many new (and timely) ways of using Google products. Here are a few examples

In the area of robotics research, we’ve made tremendous progress in our ability to learn more and more complex, safe and robust robot behaviors with less and less data, using many of the RL techniques described earlier in the post.

Transporter Networks are a novel approach to learning how to represent robotic tasks as spatial displacements. Representing relations between objects and the robot end-effectors, as opposed to absolute positions in the environment, makes learning robust transformations of the workspace very efficient.

In Grounding Language in Play, we demonstrated how a robot can be taught to follow natural language instructions (in many languages!). This required a scalable approach to collecting paired data of natural language instructions and robot behaviors. One key insight is that this can be accomplished by asking robot operators to simply play with the robot, and label after-the-fact what instructions would have led to the robot accomplishing the same task.

We also explored doing away with robots altogether (by having humans use a camera-equipped grasping stick) for even more scalable data collection, and how to efficiently transfer visual representations across robotic tasks.

We investigated how to learn very agile strategies for robot locomotion, by taking inspiration from nature, using evolutionary meta-learning strategies, human demonstrations, and various approaches to training data-efficient controllers using deep reinforcement learning.

One increased emphasis this year has been on safety: how do we deploy safe delivery drones in the real world? How do we explore the world in a way that always allows the robot to recover from its mistakes? How do we certify the stability of learned behaviors? This is a critical area of research on which we expect to see increased focus in the future.

Quantum Computing
Our Quantum AI team continued its work to establish practical uses of quantum computing. We ran experimental algorithms on our Sycamore processors to simulate systems relevant to chemistry and physics. These simulations are approaching a scale at which they can not be performed on classical computers anymore, making good on Feynman’s original idea of using quantum computers as an efficient means to simulate systems in which quantum effects are important. We published new quantum algorithms, for instance to perform precise processor calibration, to show an advantage for quantum machine learning or to test quantum enhanced optimization. We also worked on programming models to make it easier to express quantum algorithms. We released qsim, an efficient simulation tool to develop and test quantum algorithms with up to 40 qubits on Google Cloud.

We continued to follow our roadmap towards building a universal error-corrected quantum computer. Our next milestone is the demonstration that quantum error correction can work in practice. To achieve this, we will show that a larger grid of qubits can hold logical information exponentially longer than a smaller grid, even though individual components such as qubits, couplers or I/O devices have imperfections. We are also particularly excited that we now have our own cleanroom which should significantly increase the speed and quality of our processor fabrication.

Supporting the Broader Developer and Researcher Community
This year marked TensorFlow’s 5th birthday, passing 160M downloads. The TensorFlow community continued its impressive growth with new special interest groups, TensorFlow User Groups, TensorFlow Certificates, AI Service partners, and inspiring demos #TFCommunitySpotlight. We significantly improved TF 2.x with seamless TPU support, out of the box performance (and best-in-class performance on MLPerf 0.7), data preprocessing, distribution strategy and a new NumPy API.

We also added many more capabilities to the TensorFlow Ecosystem to help developers and researchers in their workflows: Sounds of India demonstrated going from research to production in under 90 days, using TFX for training and TF.js for deployment in the browser. With Mesh TensorFlow, we pushed the boundaries of model parallelism to provide ultra-high image resolution image analysis. We open-sourced the new TF runtime, TF Profiler for model performance debugging, and tools for Responsible AI, such as the Model Card Toolkit for model transparency and a privacy testing library. With we made it possible to easily host, track, and share your ML experiments for free.

In addition, we redoubled our investment in JAX, an open-source, research-focused ML system that has been actively developed over the past two years. Researchers at Google and beyond are now using JAX in a wide range of fields, including differential privacy, neural rendering, physics-informed networks, fast attention, molecular dynamics, tensor networks, neural tangent kernels, and neural ODEs. JAX accelerates research at DeepMind, powering a growing ecosystem of libraries and work on GANs, meta-gradients, reinforcement learning, and more. We also used JAX and the Flax neural network library to build record-setting MLPerf benchmark submissions, which we demonstrated live at NeurIPS on a large TPU Pod slice with a next-generation Cloud TPU user experience (slides, video, sign-up form). Finally, we’re ensuring that JAX works seamlessly with TF ecosystem tooling, from for data preprocessing and TensorBoard for experiment visualization to the TF Profiler for performance debugging, with more to come in 2021.

Many recent research breakthroughs have been enabled by increased computing power, and we make more than 500 petaflops of Cloud TPU computing power available for free to researchers around the world via the TFRC program to help broaden access to the machine learning research frontier. More than 120 TFRC-supported papers have been published to date, many of which would not have been possible without the computing resources that the program provides. For example, TFRC researchers have recently developed simulations of wildfire spread, helped analyze COVID-19 content and vaccine sentiment changes on social media networks, and advanced our collective understanding of the lottery ticket hypothesis and neural network pruning. Members of the TFRC community have also published experiments with Persian poetry, won a Kaggle contest on fine-grained fashion image segmentation, and shared tutorials and open-source tools as starting points for others. In 2021, we will change the name of the TFRC program to the TPU Research Cloud program to be more inclusive now that Cloud TPUs support JAX and PyTorch in addition to TensorFlow.

Finally, this was a huge year for Colab. Usage doubled, and we launched productivity features to help people do their work more efficiently, including improved Drive integration and access to the Colab VM via the terminal. And we launched Colab Pro to enable users to access faster GPUs, longer runtimes and more memory.

Open Datasets and Dataset Search
Open datasets with clear and measurable goals are often very helpful in driving forward the field of machine learning. To help the research community find interesting datasets, we continue to index a wide variety of open datasets sourced from many different organizations with Google Dataset Search. We also think it's important to create new datasets for the community to explore and to develop new techniques, while ensuring that we share open data responsibly. This year, in addition to open datasets to help address the COVID crisis, we released a number of open datasets across many different areas:

Research Community Interaction
We are proud to enthusiastically support and participate in the broader research community. In 2020, Google researchers presented over 500 papers at leading research conferences, additionally serving on program committees, organizing workshops, tutorials and numerous other activities aimed at collectively progressing the state of the art in the field. To learn more about our contributions to some of the larger research conferences this year, please see our blog posts for ICLR 2020, CVPR 2020, ACL 2020, ICML 2020, ECCV 2020 and NeurIPS 2020.

In 2020 we supported external research with $37M in funding, including $8.5M in COVID research, $8M in research inclusion and equity, and $2M in responsible AI research. In February, we announced the 2019 Google Faculty Research Award Recipients, funding research proposals from 150 faculty members throughout the world. Among this group, 27% self-identified as members of historically underrepresented groups within technology. We also announced a new Research Scholar Program to support early-career professors who are pursuing research in fields relevant to Google via unrestricted gifts. As we have for more than a decade, we selected a group of incredibly talented PhD student researchers to receive Google PhD Fellowships, which provides funding for graduate studies, as well as mentorship as they pursue their research, and opportunities to interact with other Google PhD Fellows.

We are also expanding the ways that we support inclusion and bring new voices into the field of computer science. In 2020, we created a new Award for Inclusion Research program that supports academic research in computing and technology addressing the needs of underrepresented populations. In the inaugural set of awards, we selected 16 proposals for funding with 25 principal investigators, focused on topics around diversity and inclusion, algorithmic bias, education innovation, health tools, accessibility, gender bias, AI for social good, security, and social justice. We additionally partnered with the Computing Alliance of Hispanic-Serving Institutions (CAHSI) and the CMD-IT Diversifying Future Leadership in the Professoriate Alliance (FLIP) to create an award program for doctoral students from traditionally underrepresented backgrounds to support the last year of the completion of the dissertation requirements.

In 2019, Google’s CS Research Mentorship Program (CSRMP) helped provide mentoring to 37 undergraduate students to introduce them to conducting computer science research. Based on the success of the program in 2019/2020, we’re excited to greatly expand this program in 2020/2021 and will have hundreds of Google researchers mentoring hundreds of undergraduate students in order to encourage more people from underrepresented backgrounds to pursue computer science research careers. Finally, in October we provided exploreCSR awards to 50 institutions around the world for the 2020 academic year. These awards fund faculty to host workshops for undergraduates from underrepresented groups in order to encourage them to pursue CS research.

Looking Forward to 2021 and Beyond
I’m excited about what’s to come, from our technical work on next-generation AI models, to the very human work of growing our community of researchers.

We’ll keep ensuring our research is done responsibly and has a positive impact, using our AI Principles as a guiding framework and applying particular scrutiny to topics that can have broad societal impact. This post covers just a few of the many papers on responsible AI that Google published in the past year. While pursuing our research, we’ll focus on:

  • Promoting research integrity: We’ll make sure Google keeps conducting a wide range of research in an appropriate manner, and provides comprehensive, scientific views on a variety of challenging, interesting topics.
  • Responsible AI development: Tackling tough topics will remain core to our work, and Google will continue creating new ML algorithms to make machine learning more efficient and accessible, developing approaches to combat unfair bias in language models, devising new techniques for ensuring privacy in learning systems, and much more. And importantly, beyond looking at AI development with a suitably critical eye, we’re eager to see what techniques we and others in the community can develop to mitigate risks and make sure new technologies have equitable, positive impacts on society.
  • Advancing diversity, equity, and inclusion: We care deeply that the people who are building influential products and computing systems better reflect the people using these products all around the world. Our efforts here are both within Google Research, as well as within the wider research and academic communities — we’ll be calling upon the academic and industry partners we work with to advance these efforts together. On a personal level, I am deeply committed to improving representation in computer science, having spent hundreds of hours working towards these goals over the last few years, as well as supporting universities like Berkeley, CMU, Cornell, Georgia Tech, Howard, UW, and numerous other organizations that work to advance inclusiveness. This is important to me, to Google, and to the broader computer science community.

Finally, looking ahead to the year, I’m particularly enthusiastic about the possibilities of building more general-purpose machine learning models that can handle a variety of modalities and that can automatically learn to accomplish new tasks with very few training examples. Advances in this area will empower people with dramatically more capable products, bringing better translation, speech recognition, language understanding and creative tools to billions of people all around the world. This kind of exploration and impact is what keeps us excited about our work!

Thanks to Martin Abadi, Marc Bellemare, Elie Bursztein, Zhifeng Chen, Ed Chi, Charina Chou, Katherine Chou, Eli Collins, Greg Corrado, Corinna Cortes, Tiffany Deng, Tulsee Doshi, Robin Dua, Kemal El Moujahid, Aleksandra Faust, Orhan Firat, Jen Gennai, Till Hennig, Ben Hutchinson, Alex Ingerman, Tomáš Ižo, Matthew Johnson, Been Kim, Sanjiv Kumar, Yul Kwon, Steve Langdon, James Laudon, Quoc Le, Yossi Matias, Brendan McMahan, Aranyak Mehta, Vahab Mirrokni, Meg Mitchell, Hartmut Neven, Mohammad Norouzi, Timothy Novikoff, Michael Piatek, Florence Poirel, David Salesin, Nithya Sambasivan, Navin Sarma, Tom Small, Jascha Sohl-Dickstein, Zak Stone, Rahul Sukthankar, Mukund Sundararajan, Andreas Terzis, Sergei Vassilvitskii, Vincent Vanhoucke, and Leslie Yeh and others for helpful feedback and for drafting portions of this post, and to the entire Research and Health communities at Google for everyone’s contributions towards this work.

Source: Google AI Blog

End-to-End, Transferable Deep RL for Graph Optimization

An increasing number of applications are driven by large and complex neural networks trained on diverse sets of accelerators. This process is facilitated by ML compilers that map high-level computational graphs to low-level, device-specific executables. In doing so, ML compilers need to solve many optimization problems, including graph rewriting, assignment of operations on devices, operation fusion, layout and tiling of tensors, and scheduling. For example, in a device placement problem, the compiler needs to determine the mapping between operations in the computational graph to the target physical devices so that an objective function, such as training step time, can be minimized. The placement performance is determined by a mixture of intricate factors, including inter-device network bandwidth, peak device memory, co-location constraints, etc., making it challenging for heuristics or search-based algorithms, which typically settle for fast, but sub-optimal, solutions. Furthermore, heuristics are hard to develop and maintain, especially as newer model architectures emerge.

Recent attempts at using learning-based approaches have demonstrated promising results, but they have a number of limitations that make them infeasible to be deployed in practice. Firstly, these approaches do not easily generalize to unseen graphs, especially those arising from newer model architectures, and second, they have poor sample efficiency, leading to high resource consumption during training. Finally, they are only able to solve a single optimization task, and consequently, do not capture the dependencies across the tightly coupled optimization problems in the compilation stack.

In “Transferable Graph Optimizers for ML Compilers”, recently published as an oral paper at NeurIPS 2020, we propose an end-to-end, transferable deep reinforcement learning method for computational graph optimization (GO) that overcomes all of the above limitations. We demonstrate 33%-60% speedup on three graph optimization tasks compared to TensorFlow default optimization. On a diverse set of representative graphs consisting of up to 80,000 nodes, including Inception-v3, Transformer-XL, and WaveNet, GO achieves an average 21% improvement over expert optimization and an 18% improvement over the prior state of the art with 15x faster convergence.

Graph Optimization Problems in ML Compilers
There are three coupled optimization tasks that frequently arise in ML compilers, which we formulate as decision problems that can be solved using a learned policy. The decision problems for each of the tasks can be reframed as making a decision for each node in the computational graph.

The first optimization task is device placement, where the goal is to determine how best to assign the nodes of the graph to the physical devices on which it runs such that the end-to-end run time is minimized.

The second optimization task is operation scheduling. An operation in a computational graph is ready to run when its incoming tensors are present in the device memory. A frequently used scheduling strategy is to maintain a ready queue of operations for each device and schedule operations in first-in-first-out order. However, this scheduling strategy does not take into account the downstream operations placed on other devices that might be blocked by an operation, and often leads to schedules with underutilized devices. To find schedules that can keep track of such cross-device dependencies, our approach uses a priority-based scheduling algorithm that schedules operations in the ready queue based on the priority of each. Similar to device placement, operation scheduling can then be formulated as the problem of learning a policy that assigns a priority for each node in the graph to maximize a reward based on run time.

The third optimization task is operation fusion. For brevity we omit a detailed discussion of this problem here, and instead just note that similar to priority-based scheduling, operation fusion can also use a priority-based algorithm to decide which nodes to fuse. The goal of the policy network in this case is again to assign a priority for each node in the graph.

Finally, it is important to recognize that the decisions taken in each of the three optimization problems can affect the optimal decision for the other problems. For example, placing two nodes on two different devices effectively disables fusion and introduces a communication delay that can influence scheduling.

RL Policy Network Architecture
Our research presents GO, a deep RL framework that can be adapted to solve each of the aforementioned optimization problems — both individually as well as jointly. There are three key aspects of the proposed architecture:

First, we use graph neural networks (specifically GraphSAGE) to capture the topological information encoded in the computational graph. The inductive network of GraphSAGE leverages node attribute information to generalize to previously unseen graphs, which enables decision making for unseen data without incurring significant cost on training.

Second, computational graphs for many models often contain more than 10k nodes. Solving the optimization problems effectively over such large scales requires that the network is able to capture long-range dependencies between nodes. GO’s architecture includes a scalable attention network that uses segment-level recurrence to capture such long-range node dependencies.

Third, ML compilers need to solve optimization problems over a wide variety of graphs from different application domains. A naive strategy of training a shared policy network with heterogeneous graphs is unlikely to capture the idiosyncrasies of a particular class of graphs. To overcome this, GO uses a feature modulation mechanism that allows the network to specialize for specific graph types without increasing the number of parameters.

Overview of GO: An end-to-end graph policy network that combines graph embedding and sequential attention.

To jointly solve multiple dependent optimization tasks, GO has the ability to add additional recurrent attention layers for each task with parameters shared across different tasks. The recurrent attention layers with residual connections of actions enables tracking inter-task dependencies.

Multi-task policy network that extends GO’s policy network with additional recurrent attention layers for each task and residual connections. GE: Graph Embedding, FC: Fully-Connected Layer, Nxf: fusion action dimension, Fxd: placement action dimension, Nxs: scheduling action dimension.

Next, we present evaluation results on a single-task speedup on a device placement task based on real-hardware measurements, generalization to unseen graphs with different GO variants, and multi-task performance jointly optimizing operations fusion, device placement, and scheduling.

To evaluate the performance of this architecture, we apply GO to a device placement problem based on real-hardware evaluation, where we first train the model separately on each of our workloads. This approach, called GO-one, consistently outperforms expert manual placement (HP), TensorFlow METIS placement, and Hierarchical Device Placement (HDP) — the current state-of-the-art reinforcement learning-based device placement. Importantly, with the efficient end-to-end single-shot placement, GO-one has a 15x speedup in convergence time of the placement network over HDP.

Our empirical results show that GO-one consistently outperforms expert placement, TensorFlow METIS placement, and hierarchical device placement (HDP). Because GO is designed in a way to scale up to extremely large graphs consisting of over 80,000 nodes like an 8-layer Google Neural Machine Translation (GNMT) model, it outperforms previous approaches, including HDP, REGAL, and Placeto. GO achieves optimized graph runtimes for large graphs like GNMT that are 21.7% and 36.5% faster than HP and HDP, respectively. Overall, GO-one achieves on average 20.5% and 18.2% run time reduction across a diverse set of 14 graphs, compared to HP and HDP respectively. Importantly, with the efficient end-to-end single-shot placement, GO-one has a 15x speedup in convergence time of the placement network over HDP.

GO generalizes to unseen graphs using offline pre-training followed by fine-tuning on the unseen graphs. During pre-training, we train GO on heterogeneous subsets of graphs from the training set. We train GO for 1000 steps on each such batch of graphs before switching to the next. This pretrained model is then fine-tuned (GO-generalization+finetune) on hold-out graphs for fewer than 50 steps, which typically takes less than one minute. GO-generalization+finetune for hold-out graphs outperforms both expert placement and HDP consistently on all datasets, and on average matches GO-one.

We also run inference directly on just the pre-trained model without any fine-tuning for the target hold-out graphs, and name this GO-generalization-zeroshot. The performance of this untuned model is only marginally worse than GO-generalization+finetune, while being slightly better than expert placement and HDP. This indicates that both graph embedding and the learned policies transfer efficiently, allowing the model to generalize to the unseen data.

Generalization across heterogeneous workload graphs. The figure shows a comparison of two different generalization strategies for GO when trained with graphs from 5 (except the held-out one) of the 6 workloads (Inception-v3, AmoebaNet, recurrent neural network language model (RNNLM), Google Neural Machine Translation (GNMT), Transformer-XL (TRFXL), WaveNet), and evaluated on the held-out workload (x-axis).

Co-optimizing placement, scheduling, and fusion (pl+sch+fu):
Optimizing simultaneously for placement, scheduling and fusion provides 30%-73% speedup compared to the single-gpu unoptimized case and 33%-60% speedup compared to TensorFlow default placement, scheduling, and fusion. Comparing to optimizing each tasks individually, multi-task GO (pl+sch+fu) outperforms single-task GO (p | sch | fu) — optimizing all tasks, one at a time — by an average of 7.8%. Furthermore, for all workloads, co-optimizing all three tasks offers faster run time than optimizing any two of them and using the default policy for the third.

Run time for various workloads on multi-task optimizations. TF-default: TF GPU default placement, fusion, and scheduling. hp-only: human placement only with default scheduling and fusion. pl-only: GO placement only with default scheduling and fusion. pl | sch: GO optimizes placement and scheduling individually with default fusion. pl+sch: multi-task GO co-optimizes placement and scheduling with default fusion. sch+fu: multi-task GO co-optimizes scheduling and fusion with human placement. pl | sch | fu: GO optimizes placement, scheduling, and fusion separately. pl+sch+fu: multi-task GO co-optimizes placement, scheduling, and fusion.

The increasing complexity and diversity of hardware accelerators has made the development of robust and adaptable ML frameworks onerous and time-consuming, often requiring multiple years of effort from hundreds of engineers. In this article, we demonstrated that many of the optimization problems in such frameworks can be solved efficiently and optimally using a carefully designed learned approach.

This is joint work with Daniel Wong, Amirali Abdolrashidi, Peter Ma, Qiumin Xu, Hanxiao Liu, Mangpo Phitchaya Phothilimthana, Shen Wang, Anna Goldie, Azalia Mirhoseini, and James Laudon.

Source: Google AI Blog

Privacy Considerations in Large Language Models

Machine learning-based language models trained to predict the next word in a sentence have become increasingly capable, common, and useful, leading to groundbreaking improvements in applications like question-answering, translation, and more. But as language models continue to advance, new and unexpected risks can be exposed, requiring the research community to proactively work to develop new ways to mitigate potential problems.

One such risk is the potential for models to leak details from the data on which they’re trained. While this may be a concern for all large language models, additional issues may arise if a model trained on private data were to be made publicly available. Because these datasets can be large (hundreds of gigabytes) and pull from a range of sources, they can sometimes contain sensitive data, including personally identifiable information (PII) — names, phone numbers, addresses, etc., even if trained on public data. This raises the possibility that a model trained using such data could reflect some of these private details in its output. It is therefore important to identify and minimize the risks of such leaks, and to develop strategies to address the issue for future models.

If one prompts the GPT-2 language model with the prefix “East Stroudsburg Stroudsburg...”, it will autocomplete a long block of text that contains the full name, phone number, email address, and physical address of a particular person whose information was included in GPT-2’s training data.

In “Extracting Training Data from Large Language Models”, a collaboration with OpenAI, Apple, Stanford, Berkeley, and Northeastern University, we demonstrate that, given only the ability to query a pre-trained language model, it is possible to extract specific pieces of training data that the model has memorized. As such, training data extraction attacks are realistic threats on state-of-the-art large language models. This research represents an early, critical step intended to inform researchers about this class of vulnerabilities, so that they may take steps to mitigate these weaknesses.

Ethics of Language Model Attacks
A training data extraction attack has the greatest potential for harm when applied to a model that is available to the public, but for which the dataset used in training is not. However, since conducting this research on such a dataset could have harmful consequences, we instead mount a proof of concept training data extraction attack on GPT-2, a large, publicly available language model developed by OpenAI, that was trained using only public data. While this work focuses on GPT-2 specifically, the results apply to understanding what privacy threats are possible on large language models generally.

As with other privacy- and security-related research, it is important to consider the ethics of such attacks before actually performing them. To minimize the potential risk of this work, the training data extraction attack in this work was developed using publicly available data. Furthermore, the GPT-2 model itself was made public by OpenAI in 2019, and the training data used to train GPT-2 was collected from the public internet, and is available for download by anyone who follows the data collection process documented in the GPT-2 paper.

Additionally, in accordance with responsible computer security disclosure norms, we followed up with individuals whose PII was extracted, and secured their permission before including references to this data in publication. Further, in all publications of this work, we have redacted any personally identifying information that may identify individuals. We have also worked closely with OpenAI in the analysis of GPT-2.

The Training Data Extraction Attack
By design, language models make it very easy to generate a large amount of output data. By seeding the model with random short phrases, the model can generate millions of continuations, i.e., probable phrases that complete the sentence. Most of the time, these continuations will be benign strings of sensible text. For example, when asked to predict the continuation of the string “Mary had a little…”, a language model will have high confidence that the next token is the word “lamb”. However, if one particular training document happened to repeat the string “Mary had a little wombat” many times, the model might predict that phrase instead.

The goal of a training data extraction attack is then to sift through the millions of output sequences from the language model and predict which text is memorized. To accomplish this, our approach leverages the fact that models tend to be more confident on results captured directly from their training data. These membership inference attacks enable us to predict if a result was used in the training data by checking the confidence of the model on a particular sequence.

The main technical contribution of this work is the development of a method for inferring membership with high accuracy along with techniques for sampling from models in a way that encourages the output of memorized content. We tested a number of different sampling strategies, the most successful of which generates text conditioned on a wide variety of input phrases. We then compare the output of two different language models. When one model has high confidence in a sequence, but the other (equally accurate) model has low confidence in a sequence, it's likely that the first model has memorized the data.

Out of 1800 candidate sequences from the GPT-2 language model, we extracted over 600 that were memorized from the public training data, with the total number limited by the need for manual verification. The memorized examples cover a wide range of content, including news headlines, log messages, JavaScript code, PII, and more. Many of these examples are memorized even though they appear infrequently in the training dataset. For example, for many samples of PII we extract are found in only a single document in the dataset. However, in most of these cases, the originating document contains multiple instances of the PII, and as a result, the model still learns it as high likelihood text.

Finally, we also find that the larger the language model, the more easily it memorizes training data. For example, in one experiment we find that the 1.5 billion parameter GPT-2 XL model memorizes 10 times more information than the 124 million parameter GPT-2 Small model. Given that the research community has already trained models 10 to 100 times larger, this means that as time goes by, more work will be required to monitor and mitigate this problem in increasingly large language models.

While we demonstrate these attacks on GPT-2 specifically, they show potential flaws in all large generative language models. The fact that these attacks are possible has important consequences for the future of machine learning research using these types of models.

Fortunately, there are several ways to mitigate this issue. The most straightforward solution is to ensure that models do not train on any potentially problematic data. But this can be difficult to do in practice.

The use of differential privacy, which allows training on a dataset without revealing any details of individual training examples, is one of the most principled techniques to train machine learning models with privacy. In TensorFlow, this can be achieved with the use of the tensorflow/privacy module (or similar for PyTorch or JAX) that is a drop-in replacement for existing optimizers. Even this can have limitations and won’t prevent memorization of content that is repeated often enough. If this is not possible, we recommend at least measuring how much memorization occurs so appropriate action can be taken.

Language models continue to demonstrate great utility and flexibility—yet, like all innovations, they can also pose risks. Developing them responsibly means proactively identifying those risks and developing ways to mitigate them. We hope that this effort to highlight current weaknesses in large language modeling will raise awareness of this challenge in the broader machine learning community and motivate researchers to continue to develop effective techniques to train models with reduced memorization.

This work was performed jointly with Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel.

Source: Google AI Blog

Portrait Light: Enhancing Portrait Lighting with Machine Learning

Professional portrait photographers are able to create compelling photographs by using specialized equipment, such as off-camera flashes and reflectors, and expert knowledge to capture just the right illumination of their subjects. In order to allow users to better emulate professional-looking portraits, we recently released Portrait Light, a new post-capture feature for the Pixel Camera and Google Photos apps that adds a simulated directional light source to portraits, with the directionality and intensity set to complement the lighting from the original photograph.

Example image with and without Portrait Light applied. Note how Portrait Light contours the face, adding dimensionality, volume, and visual interest.

In the Pixel Camera on Pixel 4, Pixel 4a, Pixel 4a (5G), and Pixel 5, Portrait Light is automatically applied post-capture to images in the default mode and to Night Sight photos that include people — just one person or even a small group. In Portrait Mode photographs, Portrait Light provides more dramatic lighting to accompany the shallow depth-of-field effect already applied, resulting in a studio-quality look. But because lighting can be a personal choice, Pixel users who shoot in Portrait Mode can manually re-position and adjust the brightness of the applied lighting within Google Photos to match their preference. For those running Google Photos on Pixel 2 or newer, this relighting capability is also available for many pre-existing portrait photographs.

Pixel users can adjust a portrait’s lighting as they like in Google Photos, after capture.

Today we present the technology behind Portrait Light. Inspired by the off-camera lights used by portrait photographers, Portrait Light models a repositionable light source that can be added into the scene, with the initial lighting direction and intensity automatically selected to complement the existing lighting in the photo. We accomplish this by leveraging novel machine learning models, each trained using a diverse dataset of photographs captured in the Light Stage computational illumination system. These models enabled two new algorithmic capabilities:

  1. Automatic directional light placement: For a given portrait, the algorithm places a synthetic directional light in the scene consistent with how a photographer would have placed an off-camera light source in the real world.
  2. Synthetic post-capture relighting: For a given lighting direction and portrait, synthetic light is added in a way that looks realistic and natural.

These innovations enable Portrait Light to help create attractive lighting at any moment for every portrait — all on your mobile device.

Automatic Light Placement
Photographers usually rely on perceptual cues when deciding how to augment environmental illumination with off-camera light sources. They assess the intensity and directionality of the light falling on the face, and also adjust their subject’s head pose to complement it. To inform Portrait Light’s automatic light placement, we developed computational equivalents to these two perceptual signals.

First, we trained a novel machine learning model to estimate a high dynamic range, omnidirectional illumination profile for a scene based on an input portrait. This new lighting estimation model infers the direction, relative intensity, and color of all light sources in the scene coming from all directions, considering the face as a light probe. We also estimate the head pose of the portrait’s subject using MediaPipe Face Mesh.

Estimating the high dynamic range, omnidirectional illumination profile from an input portrait. The three spheres at the right of each image, diffuse (top), matte silver (middle), and mirror (bottom), are rendered using the estimated illumination, each reflecting the color, intensity, and directionality of the environmental lighting.

Using these clues, we determine the direction from which the synthetic lighting should originate. In studio portrait photography, the main off-camera light source, or key light, is placed about 30° above the eyeline and between 30° and 60° off the camera axis, when looking overhead at the scene. We follow this guideline for a classic portrait look, enhancing any pre-existing lighting directionality in the scene while targeting a balanced, subtle key-to-fill lighting ratio of about 2:1.

Data-Driven Portrait Relighting
Given a desired lighting direction and portrait, we next trained a new machine learning model to add the illumination from a directional light source to the original photograph. Training the model required millions of pairs of portraits both with and without extra light. Photographing such a dataset in normal settings would have been impossible because it requires near-perfect registration of portraits captured across different lighting conditions.

Instead, we generated training data by photographing seventy different people using the Light Stage computational illumination system. This spherical lighting rig includes 64 cameras with different viewpoints and 331 individually-programmable LED light sources. We photographed each individual illuminated one-light-at-a-time (OLAT) by each light, which generates their reflectance field — or their appearance as illuminated by the discrete sections of the spherical environment. The reflectance field encodes the unique color and light-reflecting properties of the subject’s skin, hair, and clothing — how shiny or dull each material appears. Due to the superposition principle for light, these OLAT images can then be linearly added together to render realistic images of the subject as they would appear in any image-based lighting environment, with complex light transport phenomena like subsurface scattering correctly represented.

Using the Light Stage, we photographed many individuals with different face shapes, genders, skin tones, hairstyles, and clothing/accessories. For each person, we generated synthetic portraits in many different lighting environments, both with and without the added directional light, rendering millions of pairs of images. This dataset encouraged model performance across diverse lighting environments and individuals.

Photographing an individual as illuminated one-light-at-a-time in the Google Light Stage, a 360° computational illumination rig.
Left: Example images from an individual’s photographed reflectance field, their appearance in the Light Stage as illuminated one-light-at-a-time. Right: The images can be added together to form the appearance of the subject in any novel lighting environment.

Learning Detail-Preserving Relighting Using the Quotient Image
Rather than trying to directly predict the output relit image, we trained the relighting model to output a low-resolution quotient image, i.e., a per-pixel multiplier that when upsampled can be applied to the original input image to produce the desired output image with the contribution of the extra light source added. This technique is computationally efficient and encourages only low-frequency lighting changes, without impacting high-frequency image details, which are directly transferred from the input to maintain image quality.

Supervising Relighting with Geometry Estimation
When photographers add an extra light source into a scene, its orientation relative to the subject’s facial geometry determines how much brighter each part of the face appears. To model the optical behavior of light sources reflecting off relatively matte surfaces, we first trained a machine learning model to estimate surface normals given the input photograph, and then applied Lambert’s law to compute a “light visibility map” for the desired lighting direction. We provided this light visibility map as input to the quotient image predictor, ensuring that the model is trained using physics-based insights.

The pipeline of our relighting network. Given an input portrait, we estimate per-pixel surface normals, which we then use to compute a light visibility map. The model is trained to produce a low-resolution quotient image that, when upsampled and applied as a multiplier to the original image, produces the original portrait with an extra light source added synthetically into the scene.

We optimized the full pipeline to run at interactive frame-rates on mobile devices, with total model size under 10 MB. Here are a few examples of Portrait Light in action.

Portrait Light in action.

Getting the Most Out of Portrait Light
You can try Portrait Light in the Pixel Camera and change the light position and brightness to your liking in Google Photos. For those who use Dual Exposure Controls, Portrait Light can be applied post-capture for additional creative flexibility to find just the right balance between light and shadow. On existing images from your Google Photos library, try it where faces are slightly underexposed, where Portrait Light can illuminate and highlight your subject. It will especially benefit images with a single individual posed directly at the camera.

We see Portrait Light as the first step on the journey towards creative post-capture lighting controls for mobile cameras, powered by machine learning.

Portrait Light is the result of a collaboration between Google Research, Google Daydream, Pixel, and Google Photos teams. Key contributors include: Yun-Ta Tsai, Rohit Pandey, Sean Fanello, Chloe LeGendre, Michael Milne, Ryan Geiss, Sam Hasinoff, Dillon Sharlet, Christoph Rhemann, Peter Denny, Kaiwen Guo, Philip Davidson, Jonathan Taylor, Mingsong Dou, Pavel Pidlypenskyi, Peter Lincoln, Jay Busch, Matt Whalen, Jason Dourgarian, Geoff Harvey, Cynthia Herrera, Sergio Orts Escolano, Paul Debevec, Jonathan Barron, Sofien Bouaziz, Clement Ng, Rachit Gupta, Jesse Evans, Ryan Campbell, Sonya Mollinger, Emily To, Yichang Shih, Jana Ehmann, Wan-Chun Alex Ma, Christina Tong, Tim Smith, Tim Ruddick, Bill Strathearn, Jose Lima, Chia-Kai Liang, David Salesin, Shahram Izadi, Navin Sarma, Nisha Masharani, Zachary Senzer.

1  Work conducted while at Google. 

Source: Google AI Blog

MediaPipe Holistic — Simultaneous Face, Hand and Pose Prediction, on Device

Real-time, simultaneous perception of human pose, face landmarks and hand tracking on mobile devices can enable a variety of impactful applications, such as fitness and sport analysis, gesture control and sign language recognition, augmented reality effects and more. MediaPipe, an open-source framework designed specifically for complex perception pipelines leveraging accelerated inference (e.g., GPU or CPU), already offers fast and accurate, yet separate, solutions for these tasks. Combining them all in real-time into a semantically consistent end-to-end solution is a uniquely difficult problem requiring simultaneous inference of multiple, dependent neural networks.

Today, we are excited to announce MediaPipe Holistic, a solution to this challenge that provides a novel state-of-the-art human pose topology that unlocks novel use cases. MediaPipe Holistic consists of a new pipeline with optimized pose, face and hand components that each run in real-time, with minimum memory transfer between their inference backends, and added support for interchangeability of the three components, depending on the quality/speed tradeoffs. When including all three components, MediaPipe Holistic provides a unified topology for a groundbreaking 540+ keypoints (33 pose, 21 per-hand and 468 facial landmarks) and achieves near real-time performance on mobile devices. MediaPipe Holistic is being released as part of MediaPipe and is available on-device for mobile (Android, iOS) and desktop. We are also introducing MediaPipe’s new ready-to-use APIs for research (Python) and web (JavaScript) to ease access to the technology.

Top: MediaPipe Holistic results on sport and dance use-cases. Bottom: “Silence” and “Hello” gestures. Note, that our solution consistently identifies a hand as either right (blue color) or left (orange color).

Pipeline and Quality
The MediaPipe Holistic pipeline integrates separate models for pose, face and hand components, each of which are optimized for their particular domain. However, because of their different specializations, the input to one component is not well-suited for the others. The pose estimation model, for example, takes a lower, fixed resolution video frame (256x256) as input. But if one were to crop the hand and face regions from that image to pass to their respective models, the image resolution would be too low for accurate articulation. Therefore, we designed MediaPipe Holistic as a multi-stage pipeline, which treats the different regions using a region appropriate image resolution.

First, MediaPipe Holistic estimates the human pose with BlazePose’s pose detector and subsequent keypoint model. Then, using the inferred pose key points, it derives three regions of interest (ROI) crops for each hand (2x) and the face, and employs a re-crop model to improve the ROI (details below). The pipeline then crops the full-resolution input frame to these ROIs and applies task-specific face and hand models to estimate their corresponding keypoints. Finally, all key points are merged with those of the pose model to yield the full 540+ keypoints.

MediaPipe Holistic pipeline overview.

To streamline the identification of ROIs, a tracking approach similar to the one used for the standalone face and hand pipelines is utilized. This approach assumes that the object doesn't move significantly between frames, using an estimation from the previous frame as a guide to the object region in the current one. However, during fast movements, the tracker can lose the target, which requires the detector to re-localize it in the image. MediaPipe Holistic uses pose prediction (on every frame) as an additional ROI prior to reduce the response time of the pipeline when reacting to fast movements. This also enables the model to retain semantic consistency across the body and its parts by preventing a mixup between left and right hands or body parts of one person in the frame with another.

In addition, the resolution of the input frame to the pose model is low enough that the resulting ROIs for face and hands are still too inaccurate to guide the re-cropping of those regions, which require a precise input crop to remain lightweight. To close this accuracy gap we use lightweight face and hand re-crop models that play the role of spatial transformers and cost only ~10% of the corresponding model's inference time.

 Tracking pipeline (baseline)   9.8%   3.1% 
 Pipeline without re-crops   11.8%   3.5% 
 Pipeline with re-crops   9.7%   3.1% 
Hand prediction quality.The mean error per hand (MEH) is normalized by the hand size. The face landmarks error (FLE) is normalized by the inter-pupillary distance.

MediaPipe Holistic requires coordination between up to 8 models per frame — 1 pose detector, 1 pose landmark model, 3 re-crop models and 3 keypoint models for hands and face. While building this solution, we optimized not only machine learning models, but also pre- and post-processing algorithms (e.g., affine transformations), which take significant time on most devices due to pipeline complexity. In this case, moving all the pre-processing computations to GPU resulted in ~1.5 times overall pipeline speedup depending on the device. As a result, MediaPipe Holistic runs in near real-time performance even on mid-tier devices and in the browser.

 Phone   FPS 
 Google Pixel 2 XL   18 
 Samsung S9+   20 
 15-inch MacBook Pro 2017   15 
Performance on various mid-tier devices, measured in frames per second (FPS) using TFLite GPU.

The multi-stage nature of the pipeline provides two more performance benefits. As models are mostly independent, they can be replaced with lighter or heavier versions (or turned off completely) depending on the performance and accuracy requirements. Also, once pose is inferred, one knows precisely whether hands and face are within the frame bounds, allowing the pipeline to skip inference on those body parts.

MediaPipe Holistic, with its 540+ key points, aims to enable a holistic, simultaneous perception of body language, gesture and facial expressions. Its blended approach enables remote gesture interfaces, as well as full-body AR, sports analytics, and sign language recognition. To demonstrate the quality and performance of the MediaPipe Holistic, we built a simple remote control interface that runs locally in the browser and enables a compelling user interaction, no mouse or keyboard required. The user can manipulate objects on the screen, type on a virtual keyboard while sitting on the sofa, and point to or touch specific face regions (e.g., mute or turn off the camera). Underneath it relies on accurate hand detection with subsequent gesture recognition mapped to a "trackpad" space anchored to the user’s shoulder, enabling remote control from up to 4 meters.

This technique for gesture control can unlock various novel use-cases when other human-computer interaction modalities are not convenient. Try it out in our web demo and prototype your own ideas with it.

In-browser touchless control demos. Left: Palm picker, touch interface, keyboard. Right: Distant touchless keyboard. Try it out!

MediaPipe for Research and Web
To accelerate ML research as well as its adoption in the web developer community, MediaPipe now offers ready-to-use, yet customizable ML solutions in Python and in JavaScript. We are starting with those in our previous publications: Face Mesh, Hands and Pose, including MediaPipe Holistic, with many more to come. Try them directly in the web browser: for Python using the notebooks in MediaPipe on Google Colab, and for JavaScript with your own webcam input in MediaPipe on CodePen!

We hope the release of MediaPipe Holistic will inspire the research and development community members to build new unique applications. We anticipate that these pipelines will open up avenues for future research into challenging domains, such as sign-language recognition, touchless control interfaces, or other complex use cases. We are looking forward to seeing what you can build with it!

Complex and dynamic hand gestures. Videos by Dr. Bill Vicars, used with permission.

Special thanks to all our team members who worked on the tech with us: Fan Zhang, Gregory Karpiak, Kanstantsin Sokal, Juhyun Lee, Hadon Nash, Chuo-Ling Chang, Jiuqiang Tang, Nikolay Chirkov, Camillo Lugaresi, George Sung, Michael Hays, Tyler Mullen, Chris McClanahan, Ekaterina Ignasheva, Marat Dukhan, Artsiom Ablavatski, Yury Kartynnik, Karthik Raveendran, Andrei Vakunov, Andrei Tkachenka, Suril Shah, Buck Bourdon, Ming Guang Yong, Esha Uboweja, Siarhei Kazakou, Andrei Kulik, Matsvei Zhdanovich, and Matthias Grundmann.

Source: Google AI Blog

Google at NeurIPS 2020

This week marks the beginning of the 34th annual Conference on Neural Information Processing Systems (NeurIPS 2020), the biggest machine learning conference of the year. Held virtually for the first time, this conference includes invited talks, demonstrations and presentations of some of the latest in machine learning research. As a Platinum Sponsor of NeurIPS 2020, Google will have a strong presence with more than 180 accepted papers, additionally contributing to and learning from the broader academic research community via talks, posters, workshops and tutorials.

If you are registered for NeurIPS 2020, we hope you’ll visit our virtual booth and chat with our researchers about the projects and opportunities at Google that go into solving the world's most challenging research problems, and to see demonstrations of some of the exciting research we pursue, such as Transformers for image recognition, Tone Transfer, large-scale distributed RL, recreating historical streetscapes and much more. You can also learn more about our work being presented in the list below (Google affiliations highlighted in blue).

Organizing Committees

General Chair: Hugo Larochelle

Workshop Co-Chair: Sanmi Koyejo

Diversity and Inclusion Chairs include: Katherine Heller

Expo Chair: Pablo Samuel Castro

Senior Area Chairs include: Corinna Cortes, Fei Sha, Mohammad Ghavamzadeh, Sanjiv Kumar, Charles Sutton, Dale Schuurmans, David Duvenaud, Elad Hazan, Marco Cuturi, Peter Bartlett, Samy Bengio, Tong Zhang, Claudio Gentile, Kevin Murphy, Cordelia Schmid, Amir Globerson

Area Chairs include: Boqing Gong, Afshin Rostamizadeh, Alex Kulesza, Branislav Kveton, Craig Boutilier, Heinrich Jiang, Manzil Zaheer, Silvio Lattanzi, Slav Petrov, Srinadh Bhojanapalli, Rodolphe Jenatton, Mathieu Blondel, Aleksandra Faust, Alexey Dosovitskiy, Ashish Vaswani, Augustus Odena, Balaji Lakshminarayanan, Ben Poole, Colin Raffel, Danny Tarlow, David Ha, Denny Zhou, Dumitru Erhan, Dustin Tran, George Tucker, Honglak Lee, Ilya Tolstikhin, Jasper Snoek, Jean-Philippe Vert, Jeffrey Pennington, Kevin Swersky, Matthew Johnson, Minmin Chen, Mohammad Norouzi, Moustapha Cisse, Naman Agarwal, Nicholas Carlini, Olivier Bachem, Tim Salimans, Vincent Dumoulin, Yann Dauphin, Andrew Dai, Izhak Shafran, Karthik Sridharan, Abhinav Gupta, Abhishek Kumar, Adam White, Aditya Menon, Kun Zhang, Ce Liu, Cristian Sminchisescu, Hossein Mobahi, Phillip IsolaTomer Koren, Chelsea Finn, Amin Karbasi

NeurIPS 2020 Foundation Board includes: Michael Mozer, Samy Bengio, Corinna Cortes, Hugo Larochelle, John C. Platt, Fernando Pereira

Accepted Papers

Rankmax: An Adaptive Projection Alternative to the Softmax Function
Weiwei Kong*, Walid Krichene, Nicolas Mayoraz, Steffen Rendle, Li Zhang

Unsupervised Sound Separation Using Mixture Invariant Training
Scott Wisdom, Efthymios Tzinis*, Hakan Erdogan, Ron Weiss, Kevin Wilson, John Hershey

Learning to Select Best Forecast Tasks for Clinical Outcome Prediction
Yuan Xue, Nan Du, Anne Mottram, Martin Seneviratne, Andrew M. Dai

Interpretable Sequence Learning for Covid-19 Forecasting
Sercan O. Arık, Chun-Liang Li, Jinsung Yoon, Rajarishi Sinha, Arkady Epshteyn, Long T. Le, Vikas Menon, Shashank Singh, Leyou Zhang, Nate Yoder, Martin Nikoltchev, Yash Sonthalia, Hootan Nakhost, Elli Kanal, Tomas Pfister

Towards Learning Convolutions from Scratch
Behnam Neyshabur

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design
Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, Sergey Levine

Inverse Rational Control with Partially Observable Continuous Nonlinear Dynamics
Minhae Kwon, Saurabh Daptardar, Paul Schrater, Xaq Pitkow

Off-Policy Evaluation via the Regularized Lagrangian
Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, Dale Schuurmans

CoinDICE: Off-Policy Confidence Interval Estimation
Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, Dale Schuurmans

Unsupervised Data Augmentation for Consistency Training
Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le

VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain
Jinsung Yoon, Yao Zhang, James Jordon, Mihaela van der Schaar

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
Zihang Dai, Guokun Lai, Yiming Yang, Quoc Le

Big Bird: Transformers for Longer Sequences
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed

Provably Efficient Neural Estimation of Structural Equation Models: An Adversarial Approach
Luofeng Liao, You-Lin Chen, Zhuoran Yang, Bo Dai, Zhaoran Wang, Mladen Kolar

Conservative Q-Learning for Offline Reinforcement Learning
Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine

MOReL: Model-Based Offline Reinforcement Learning
Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, Thorsten Joachims

Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness
Long Zhao, Ting Liu, Xi Peng, Dimitris Metaxas

Generative View Synthesis: From Single-view Semantics to Novel-view Images
Tewodros Habtegebrial, Varun Jampani, Orazio Gallo, Didier Stricker

PIE-NET: Parametric Inference of Point Cloud Edges
Xiaogang Wang, Yuelang Xu, Kai Xu, Andrea Tagliasacchi, Bin Zhou, Ali Mahdavi-Amiri, Hao Zhang

Enabling Certification of Verification-Agnostic Networks via Memory-Efficient Semidefinite Programming
Sumanth Dathathri, Krishnamurthy (Dj) Dvijotham, Alex Kurakin, Aditi Raghunathan, Jonathan Uesato, Rudy Bunel, Shreya Shankar, Jacob Steinhardt, Ian Goodfellow*, Percy Liang, Pushmeet Kohli

An Analysis of SVD for Deep Rotation Estimation
Jake Levinson, Carlos Esteves, Kefan Chen, Noah Snavely, Angjoo Kanazawa, Afshin Rostamizadeh, Ameesh Makadia

Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces
Guy Lorberbom, Chris J. Maddison, Nicolas Heess, Tamir Hazan, Daniel Tarlow

Faster Differentially Private Samplers via Rényi Divergence Analysis of Discretized Langevin MCMC
Arun Ganesh*, Kunal Talwar*

DISK: Learning Local Features with Policy Gradient
Michał J. Tyszkiewicz, Pascal Fua, Eduard Trulls

Robust Large-margin Learning in Hyperbolic Space
Melanie Weber*, Manzil Zaheer, Ankit Singh Rawat, Aditya Menon, Sanjiv Kumar

Gamma-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction
Michael Janner, Igor Mordatch, Sergey Levine

Adversarially Robust Streaming Algorithms via Differential Privacy
Avinatan Hassidim, Haim Kaplan, Yishay Mansour, Yossi Matias, Uri Stemmer

Faster DBSCAN via Subsampled Similarity Queries
Heinrich Jiang, Jennifer Jang, Jakub Łacki

Exact Recovery of Mangled Clusters with Same-Cluster Queries
Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice

A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs
Nevena Lazic, Dong Yin, Mehrdad Farajtabar, Nir Levine, Dilan Görür, Chris Harris, Dale Schuurmans

Fairness in Streaming Submodular Maximization: Algorithms and Hardness
Marwa El Halabi, Slobodan Mitrović, Ashkan Norouzi-Fard, Jakab Tardos, Jakub Tarnawski

Efficient Active Learning of Sparse Halfspaces with Arbitrary Bounded Noise
Chicheng Zhang, Jie Shen, Pranjal Awasthi

Private Learning of Halfspaces: Simplifying the Construction and Reducing the Sample Complexity
Haim Kaplan, Yishay Mansour, Uri Stemmer, Eliad Tsfadia

Synthetic Data Generators -- Sequential and Private
Olivier Bousquet, Roi Livni, Shay Moran

Learning Discrete Distributions: User vs Item-level Privacy
Yuhan Liu, Ananda Theertha Suresh, Felix Xinnan X. Yu, Sanjiv Kumar, Michael Riley

Learning Differential Equations that are Easy to Solve
Jacob Kelly, Jesse Bettencourt, Matthew J. Johnson, David K. Duvenaud

An Optimal Elimination Algorithm for Learning a Best Arm
Avinatan Hassidim, Ron Kupfer, Yaron Singer

The Convex Relaxation Barrier, Revisited: Tightened Single-Neuron Relaxations for Neural Network Verification
Christian Tjandraatmadja, Ross Anderson, Joey Huchette, Will Ma, Krunal Kishor Patel*, Juan Pablo Vielma

Escaping the Gravitational Pull of Softmax
Jincheng Mei, Chenjun Xiao, Bo Dai, Lihong Li*, Csaba Szepesvari, Dale Schuurmans

The Complexity of Adversarially Robust Proper Learning of Halfspaces with Agnostic Noise
Ilias Diakonikolas, Daniel M. Kane, Pasin Manurangsi

PAC-Bayes Learning Bounds for Sample-Dependent Priors
Pranjal Awasthi, Satyen Kale, Stefani Karp, Mehryar Mohri

Fictitious Play for Mean Field Games: Continuous Time Analysis and Applications
Sarah Perrin, Julien Perolat, Mathieu Lauriere, Matthieu Geist, Romuald Elie, Olivier Pietquin

What Do Neural Networks Learn When Trained With Random Labels?
Hartmut Maennel, Ibrahim M. Alabdulmohsin, Ilya O. Tolstikhin, Robert Baldock*, Olivier Bousquet, Sylvain Gelly, Daniel Keysers

Online Planning with Lookahead Policies
Yonathan Efroni, Mohammad Ghavamzadeh, Shie Mannor

Smoothly Bounding User Contributions in Differential Privacy
Alessandro Epasto, Mohammad Mahdian, Jieming Mao, Vahab Mirrokni, Lijie Ren

Differentially Private Clustering: Tight Approximation Ratios
Badih Ghazi, Ravi Kumar, Pasin Manurangsi

Hitting the High Notes: Subset Selection for Maximizing Expected Order Statistics
Aranyak Mehta, Uri Nadav, Alexandros Psomas*, Aviad Rubinstein

Myersonian Regression
Allen Liu, Renato Leme, Jon Schneider

Assisted Learning: A Framework for Multi-Organization Learning
Xun Xian, Xinran Wang, Jie Ding, Reza Ghanadan

Adversarial Robustness via Robust Low Rank Representations
Pranjal Awasthi, Himanshu Jain, Ankit Singh Rawat, Aravindan Vijayaraghavan

Multi-Plane Program Induction with 3D Box Priors
Yikai Li, Jiayuan Mao, Xiuming Zhang, Bill Freeman, Josh Tenenbaum, Noah Snavely, Jiajun Wu

Privacy Amplification via Random Check-Ins
Borja Balle, Peter Kairouz, Brendan McMahan, Om Dipakbhai Thakkar, Abhradeep Thakurta

Rethinking Pre-training and Self-training
Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin Dogus Cubuk, Quoc Le

Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing
Arthur Delarue, Ross Anderson, Christian Tjandraatmadja

Online Agnostic Boosting via Regret Minimization
Nataly Brukhim, Xinyi Chen, Elad Hazan, Shay Moran*

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering
Ines Chami, Albert Gu, Vaggos Chatziafratis, Christopher Ré

Faithful Embeddings for Knowledge Base Queries
Haitian Sun, Andrew Arnold*, Tania Bedrax Weiss, Fernando Pereira, William W. Cohen

Contextual Reserve Price Optimization in Auctions via Mixed Integer Programming
Joey Huchette, Haihao Lu, Hossein Esfandiari, Vahab Mirrokni

An Operator View of Policy Gradient Methods
Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux

Reinforcement Learning with Feedback Graphs
Christoph Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, Karthik Sridharan

On Completeness-aware Concept-Based Explanations in Deep Neural Networks
Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, Pradeep Ravikumar

Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement
Benjamin Eysenbach, Xinyang Geng, Sergey Levine, Ruslan Salakhutdinov

The Flajolet-Martin Sketch Itself Preserves Differential Privacy: Private Counting with Minimal Space
Adam Smith, Shuang Song, Abhradeep Thakurta

What is Being Transferred in Transfer Learning?
Behnam Neyshabur, Hanie Sedghi, Chiyuan Zhang

Latent Bandits Revisited
Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed, Craig Boutilier

MetaSDF: Meta-Learning Signed Distance Functions
Vincent Sitzmann, Eric Chan, Richard Tucker, Noah Snavely, Gordon Wetzstein

Measuring Robustness to Natural Distribution Shifts in Image Classification
Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, Ludwig Schmidt

Robust Optimization for Fairness with Noisy Protected Groups
Serena Wang, Wenshuo Guo, Harikrishna Narasimhan, Andrew Cotter, Maya Gupta, Michael I. Jordan

Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration
Hanjun Dai, Rishabh Singh, Bo Dai, Charles Sutton, Dale Schuurmans

Breaking the Communication-Privacy-Accuracy Trilemma
Wei-Ning Chen, Peter Kairouz, Ayfer Ozgur

Differentiable Meta-Learning of Bandit Policies
Craig Boutilier, Chih-wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, Manzil Zaheer

Multi-Stage Influence Function
Hongge Chen*, Si Si, Yang Li, Ciprian Chelba, Sanjiv Kumar, Duane Boning, Cho-Jui Hsieh

Compositional Visual Generation with Energy Based Models
Yilun Du, Shuang Li, Igor Mordatch

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

Curriculum By Smoothing
Samarth Sinha, Animesh Garg, Hugo Larochelle

Online Linear Optimization with Many Hints
Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

Prediction with Corrupted Expert Advice
Idan Amir, Idan Attias, Tomer Koren, Roi Livni, Yishay Mansour

Agnostic Learning with Multiple Objectives
Corinna Cortes, Mehryar Mohri, Javier Gonzalvo, Dmitry Storcheus

CoSE: Compositional Stroke Embeddings
Emre Aksan, Thomas Deselaers*, Andrea Tagliasacchi, Otmar Hilliges

Reparameterizing Mirror Descent as Gradient Descent
Ehsan Amid, Manfred K. Warmuth

Understanding Double Descent Requires A Fine-Grained Bias-Variance Decomposition
Ben Adlam, Jeffrey Pennington

DisARM: An Antithetic Gradient Estimator for Binary Latent Variables
Zhe Dong, Andriy Mnih, George Tucker

Big Self-Supervised Models are Strong Semi-Supervised Learners
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, Geoffrey Hinton

JAX MD: A Framework for Differentiable Physics
Samuel S. Schoenholz, Ekin D. Cubuk

Gradient Surgery for Multi-Task Learning
Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, Chelsea Finn

LoopReg: Self-supervised Learning of Implicit Surface Correspondences, Pose and Shape for 3D Human Mesh Registration
Bharat Lal Bhatnagar, Cristian Sminchisescu, Christian Theobalt, Gerard Pons-Moll

ICE-BeeM: Identifiable Conditional Energy-Based Deep Models Based on Nonlinear ICA
Ilyes Khemakhem, Ricardo P. Monti, Diederik P. Kingma, Aapo Hyvärinen

Demystifying Orthogonal Monte Carlo and Beyond
Han Lin, Haoxian Chen, Tianyi Zhang, Clement Laroche, Krzysztof Choromanski

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Han Zhang, Colin Raffel

Compositional Generalization via Neural-Symbolic Stack Machines
Xinyun Chen, Chen Liang, Adams Wei Yu, Dawn Song, Denny Zhou

Universally Quantized Neural Compression
Eirikur Agustsson, Lucas Theis

Self-Distillation Amplifies Regularization in Hilbert Space
Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett

ShapeFlow: Learnable Deformation Flows Among 3D Shapes
Chiyu “Max" Jiang, Jingwei Huang, Andrea Tagliasacchi, Leonidas Guibas

Entropic Optimal Transport between Unbalanced Gaussian Measures has a Closed Form
Hicham Janati, Boris Muzellec, Gabriel Peyré, Marco Cuturi

High-Fidelity Generative Image Compression
Fabian Mentzer*, George Toderici, Michael Tschannen*, Eirikur Agustsson

COT-GAN: Generating Sequential Data via Causal Optimal Transport
Tianlin Xu, Li K. Wenliang, Michael Munn, Beatrice Acciaio

When Do Neural Networks Outperform Kernel Methods?
Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

Sense and Sensitivity Analysis: Simple Post-Hoc Analysis of Bias Due to Unobserved Confounding
Victor Veitch, Anisha Zaveri

Exemplar VAE: Linking Generative Models, Nearest Neighbor Retrieval, and Data Augmentation
Sajad Norouzi, David J. Fleet, Mohamamd Norouzi

Mitigating Forgetting in Online Continual Learning via Instance-Aware Parameterization
Hung-Jen Chen, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, Min Sun
Consistent Plug-in Classifiers for Complex Objectives and Constraints
Shiv Kumar Tavker, Harish Guruprasad Ramaswamy, Harikrishna Narasimhan

Online MAP Inference of Determinantal Point Processes
Aditya Bhaskara, Amin Karbasi, Silvio Lattanzi, Morteza Zadimoghaddam

Organizing Recurrent Network Dynamics by Task-computation to Enable Continual Learning
Lea Duncker, Laura Driscoll, Krishna V. Shenoy, Maneesh Sahani, David Sussillo

RL Unplugged: A Collection of Benchmarks for Offline Reinforcement Learning
Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Thomas Paine, Sergio Gómez, Konrad Zolna, Rishabh Agarwal, Josh S. Merel, Daniel J. Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matthew Hoffman, Nicolas Heess, Nando de Freitas

Neural Execution Engines: Learning to Execute Subroutines
Yujun Yan*, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, Milad Hashemi

Spin-Weighted Spherical CNNs
Carlos Esteves, Ameesh Makadia, Kostas Daniilidis

An Efficient Nonconvex Reformulation of Stagewise Convex Optimization Problems
Rudy R. Bunel, Oliver Hinder, Srinadh Bhojanapalli, Krishnamurthy Dvijotham

Stochastic Optimization with Laggard Data Pipelines
Naman Agarwal, Rohan Anil, Tomer Koren, Kunal Talwar*, Cyril Zhang*

Regularizing Towards Permutation Invariance In Recurrent Models
Edo Cohen-Karlik, Avichai Ben David, Amir Globerson

Fast and Accurate kk-means++ via Rejection Sampling
Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler*, Ola Svensson

Fairness Without Demographics Through Adversarially Reweighted Learning
Preethi Lahoti*, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, Xuezhi Wang, Ed Chi

Gradient Estimation with Stochastic Softmax Tricks
Max Paulus, Dami Choi, Daniel Tarlow, Andreas Krause, Chris J. Maddison

Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout
Zhao Chen, Jiquan Ngiam, Yanping Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, Dragomir Anguelov

A Spectral Energy Distance for Parallel Speech Synthesis
Alexey A. Gritsenko, Tim Salimans, Rianne van den Berg, Jasper Snoek, Nal Kalchbrenner

Ode to an ODE
Krzysztof Choromanski, Jared Quincy Davis, Valerii Likhosherstov, Xingyou Song, Jean-Jacques Slotine, Jacob Varley, Honglak Lee, Adrian Weller, Vikas Sindhwani

RandAugment: Practical Automated Data Augmentation with a Reduced Search Space
Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, Quoc Le

On Adaptive Attacks to Adversarial Example Defenses
Florian Tramer, Nicholas Carlini, Wieland Brendel, Aleksander Madry

Fair Performance Metric Elicitation
Gaurush Hiranandani, Harikrishna Narasimhan, Oluwasanmi O. Koyejo

Robust Pre-Training by Adversarial Contrastive Learning
Ziyu Jiang, Tianlong Chen, Ting Chen, Zhangyang Wang

Why are Adaptive Methods Good for Attention Models?
Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, Suvrit Sra

PyGlove: Symbolic Programming for Automated Machine Learning
Daiyi Peng, Xuanyi Dong, Esteban Real, Mingxing Tan, Yifeng Lu, Gabriel Bender, Hanxiao Liu, Adam Kraft, Chen Liang, Quoc Le

Fair Hierarchical Clustering
Sara Ahmadian, Alessandro Epasto, Marina Knittel, Ravi Kumar, Mohammad Mahdian, Benjamin Moseley, Philip Pham, Sergei Vassilvitskii, Yuyan Wang

Fairness with Overlapping Groups; a Probabilistic Perspective
Forest Yang*, Moustapha Cisse, Sanmi Koyejo

Differentiable Top-k with Optimal Transport
Yujia Xie*, Hanjun Dai, Minshuo Chen, Bo Dai, Tuo Zhao, Hongyuan Zha, Wei Wei, Tomas Pfister

The Origins and Prevalence of Texture Bias in Convolutional Neural Networks
Katherine Hermann, Ting Chen, Simon Kornblith

Approximate Heavily-Constrained Learning with Lagrange Multiplier Models
Harikrishna Narasimhan, Andrew Cotter, Yichen Zhou, Serena Wang, Wenshuo Guo

Evaluating Attribution for Graph Neural Networks
Benjamin Sanchez-Lengeling, Jennifer Wei, Brian Lee, Emily Reif, Peter Wang, Wesley Wei Qian, Kevin McCloskey, Lucy Colwell, Alexander Wiltschko

Sliding Window Algorithms for k-Clustering Problems
Michele Borassi, Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, Morteza Zadimoghaddam

Meta-Learning Requires Meta-Augmentation
Janarthanan Rajendran*, Alex Irpan, Eric Jang

What Makes for Good Views for Contrastive Learning?
Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

Supervised Contrastive Learning
Prannay Khosla*, Piotr Teterwak*, Chen Wang*, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, Dilip Krishnan

Critic Regularized Regression
Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh Merel, Jost Tobias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, Nando de Freitas

Off-Policy Imitation Learning from Observations
Zhuangdi Zhu, Kaixiang Lin, Bo Dai, Jiayu Zhou

Effective Diversity in Population Based Reinforcement Learning
Jack Parker-Holder, Aldo Pacchiano, Krzysztof Choromanski, Stephen Roberts

Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards
Yijie Guo, Jongwook Choi, Marcin Moczulski, Shengyu Feng, Samy Bengio, Mohammad Norouzi, Honglak Lee

Object-Centric Learning with Slot Attention
Francesco Locatello*, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, Thomas Kipf

On the Power of Louvain in the Stochastic Block Model
Vincent Cohen-Addad, Adrian Kosowski, Frederik Mallmann-Trenn, David Saulpic

Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks
David Bieber, Charles Sutton, Hugo Larochelle, Daniel Tarlow

SMYRF - Efficient Attention using Asymmetric Clustering
Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis

Graph Contrastive Learning with Augmentations
Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, Yang Shen

WOR and p's: Sketches for ℓp-Sampling Without Replacement
Edith Cohen, Rasmus Pagh, David P. Woodruff

Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains
Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, Ren Ng

Model Selection in Contextual Stochastic Bandit Problems
Aldo Pacchiano, My Phan, Yasin Abbasi Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, Csaba Szepesvari

Adapting to Misspecification in Contextual Bandits
Dylan J. Foster, Claudio Gentile, Mehryar Mohri, Julian Zimmert

Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning
Nino Vieillard, Tadashi Kozunoú, Bruno Scherrer, Olivier Pietquin, Rémi Munos, Matthieu Geist

Learning with Differentiable Pertubed Optimizers
Quentin Berthet, Mathieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, Francis Bach

Munchausen Reinforcement Learning
Nino Vieillard, Olivier Pietquin, Matthieu Geist

Log-Likelihood Ratio Minimizing Flows: Towards Robust and Quantifiable Neural Distribution Alignment
Ben Usman, Avneesh Sud, Nick Dufour, Kate Saenko

Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling
Tong Che, Ruixiang Zhang, Jascha Sohl-Dickstein, Hugo Larochelle, Liam Paull, Yuan Cao, Yoshua Bengio

Sample Complexity of Uniform Convergence for Multicalibration
Eliran Shabat, Lee Cohen, Yishay Mansour

Implicit Regularization and Convergence for Weight Normalization
Xiaoxia Wu, Edgar Dobriban, Tongzheng Ren, Shanshan Wu, Zhiyuan Li, Suriya Gunasekar, Rachel Ward, Qiang Liu

Most ReLU Networks Suffer from ℓ² Adversarial Perturbations
Amit Daniely, Hadas Shacham

Geometric Exploration for Online Control
Orestis Plevrakis, Elad Hazan

PLLay: Efficient Topological Layer Based on Persistent Landscapes
Kwangho Kim, Jisu Kim, Manzil Zaheer, Joon Sik Kim, Frederic Chazal, Larry Wasserman

Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness
Jeremiah Zhe Liu*, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax-Weiss, Balaji Lakshminarayanan

Bayesian Deep Ensembles via the Neural Tangent Kernel
Bobby He, Balaji Lakshminarayanan, Yee Whye Teh

Hyperparameter Ensembles for Robustness and Uncertainty Quantification
Florian Wenzel, Jasper Snoek, Dustin Tran, Rodolphe Jenatton

Conic Descent and its Application to Memory-efficient Optimization Over Positive Semidefinite Matrices
John Duchi, Oliver Hinder, Andrew Naber, Yinyu Ye

On the Training Dynamics of Deep Networks with L₂ Regularization
Aitor Lewkowycz, Guy Gur-Ari

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks
Wei Hu*, Lechao Xiao, Ben Adlam, Jeffrey Pennington

Adaptive Probing Policies for Shortest Path Routing
Aditya Bhaskara, Sreenivas Gollapudi, Kostas Kollias, Kamesh Munagala

Optimal Approximation — Smoothness Tradeoffs for Soft-Max Functions
Alessandro Epasto, Mohammad Mahdian, Vahab Mirrokni, Emmanouil Zampetakis

An Unsupervised Information-Theoretic Perceptual Quality Metric
Sangnie Bhardwaj, Ian Fischer, Johannes Ballé, Troy Chinen

Learning Graph Structure With A Finite-State Automaton Layer
Daniel Johnson, Hugo Larochelle, Daniel Tarlow

Estimating Training Data Influence by Tracing Gradient Descent
Garima Pruthi, Frederick Liu, Satyen Kale, Mukund Sundararajan


Practical Uncertainty Estimation and Out-of-Distribution Robustness in Deep Learning
Organizers: Dustin Tran, Balaji Lakshminarayanan, Jasper Snoek

Abstraction & Reasoning in AI systems: Modern Perspectives
Organizers: Francois Chollet, Melanie Mitchell, Christian Szegedy

Policy Optimization in Reinforcement Learning
Organizers: Sham M Kakade, Martha White, Nicolas Le Roux

Federated Learning and Analytics: Industry Meets Academia
Organizers: Brendan McMahan, Virginia Smith, Peter Kairouz

Deep Implicit Layers: Neural ODEs, Equilibrium Models, and Differentiable Optimization
Organizers: David Duvenaud, J. Zico Kolter, Matthew Johnson

Beyond Accuracy: Grounding Evaluation Metrics for Human-Machine Learning Systems
Organizers: Praveen Chandar, Fernando Diaz, Brian St. Thomas


Black in AI Workshop @ NeurIPS 2020 (Diamond Sponsor)
Mentorship Roundtables: Natasha Jacques

LatinX in AI Workshop @ NeurIPS 2020 (Platinum Sponsor)
Organizers include: Pablo Samuel Castro
Invited Speaker: Fernanda Viégas
Mentorship Roundtables: Tomas Izo

Queer in AI Workshop @ NeurIPS 2020 (Platinum Sponsor)
Organizers include: Raphael Gontijo Lopes

Women in Machine Learning (Platinum Sponsor)
Organizers include: Xinyi Chen, Jessica Schrouff
Invited Speaker: Fernanda Viégas
Sponsor Talk: Jessica Schrouff
Mentorship Roundtables: Hanie Sedghi, Marc Bellemare, Katherine Heller, Rianne van den Berg, Natalie Schluter, Colin Raffel, Azalia Mirhoseini, Emily Denton, Jesse Engel, Anusha Ramesh, Matt Johnson, Jeff Dean, Laurent Dinh, Samy Bengio, Yasaman Bahri, Corinna Cortes, Nicolas le Roux, Hugo Larochelle, Sergio Guadarrama, Natasha Jaques, Pablo Samuel Castro, Elaine Le, Cory Silvear

Muslims in ML
Organizers include: Mohammad Norouzi

Resistance AI Workshop
Organizers include: Elliot Creager, Raphael Gontijo Lopes

Privacy Preserving Machine Learning — PriML and PPML Joint Edition
Organizers include: Adria Gascon, Mariana Raykova

OPT2020: Optimization for Machine Learning
Organizers include: Courtney Paquette

Human in the Loop Dialogue Systems
Organizers include: Rahul Goel
Invited Speaker: Ankur Parikh

Self-Supervised Learning for Speech and Audio Processing
Organizers include: Tara Sainath
Invited Speaker: Bhuvana Ramabhadran

3rd Robot Learning Workshop
Organizers include: Alex Bewley, Vincent Vanhoucke
Invited Speaker: Pete Florence

Deep Reinforcement Learning
Organizers include: Chelsea Finn
Invited Speaker: Marc Bellemare

Machine Learning for Engineering Modeling, Simulation and Design
Organizers include: Stephan Hoyer

Machine Learning for Molecules
Organizers include: Jennifer Wei
Invited Speaker: Benjamin Sanchez-Lengeling

The Challenges of Real World Reinforcement Learning
Organizers include: Gabriel Dulac-Arnold
Invited Speaker: Chelsea Finn

Workshop on Computer Assisted Programming (CAP)
Organizers include: Charles Sutton, Augustus Odena

Self-Supervised Learning — Theory and Practice
Organizers include: Barret Zoph
Invited Speaker: Quoc V. Le

Deep Learning Through Information Geometry
Organizers include: Alexander Alemi


Drifting Efficiently Through the Stratosphere Using Deep Reinforcement Learning
Organizers include: Sal Candido

Accelerating Eye Movement Research via Smartphone Gaze
Organizers include: Vidhya Navalpakkam

Mining and Learning with Graphs at Scale
Organizers include: Bryan Perozzi, Vahab Mirrokni, Jonathan Halcrow, Jakub Lacki

*Work performed while at Google

Source: Google AI Blog

Using AutoML for Time Series Forecasting

Time series forecasting is an important research area for machine learning (ML), particularly where accurate forecasting is critical, including several industries such as retail, supply chain, energy, finance, etc. For example, in the consumer goods domain, improving the accuracy of demand forecasting by 10-20% can reduce inventory by 5% and increase revenue by 2-3%. Current ML-based forecasting solutions are usually built by experts and require significant manual effort, including model construction, feature engineering and hyper-parameter tuning. However, such expertise may not be broadly available, which can limit the benefits of applying ML towards time series forecasting challenges.

To address this, automated machine learning (AutoML) is an approach that makes ML more widely accessible by automating the process of creating ML models, and has recently accelerated both ML research and the application of ML to real-world problems. For example, the initial work on neural architecture search enabled breakthroughs in computer vision, such as NasNet, AmoebaNet, and EfficientNet, and in natural language processing, such as Evolved Transformer. More recently, AutoML has also been applied to tabular data.

Today we introduce a scalable end-to-end AutoML solution for time series forecasting, which meets three key criteria:

  • Fully automated: The solution takes in data as input, and produces a servable TensorFlow model as output with no human intervention.
  • Generic: The solution works for most time series forecasting tasks and automatically searches for the best model configuration for each task.
  • High-quality: The produced models have competitive quality compared to those manually crafted for specific tasks.

We demonstrate the success of this approach through participation in the M5 forecasting competition, where this AutoML solution achieved competitive performance against hand-crafted models with moderate compute cost.

Challenges in Time Series Forecasting
Time series forecasting presents several challenges to machine learning models. First, the uncertainty is often high since the goal is to predict the future based on historical data. Unlike other machine learning problems, the test set, for example, future product sales, might have a different distribution from the training and validation set, which are extracted from the historical data. Second, the time series data from the real world often suffers from missing data and high intermittency (i.e., when a high fraction of the time series has the value of zero). Some time series tasks may not have historical data available and suffer from the cold start problem, for example, when predicting the sales of a new product. Third, since we aim to build a fully automated generic solution, the same solution needs to apply to a variety of datasets, which can vary significantly in the domain (product sales, web traffic, etc), the granularity (daily, hourly, etc), the history length, the types of features (categorical, numerical, date time, etc), and so on.

An AutoML Solution
To tackle these challenges, we designed an end-to-end TensorFlow pipeline with a specialized search space for time series forecasting. It is based on an encoder-decoder architecture, in which an encoder transforms the historical information in a time series into a set of vectors, and a decoder generates the future predictions based on these vectors. Inspired by the state-of-the-art sequence models, such as Transformer and WaveNet, and best practices in time series forecasting, our search space included components such as attention, dilated convolution, gating, skip connections, and different feature transformations. The resulting AutoML solution searches for the best combination of these components as well as core hyperparameters.

To combat the uncertainty in predicting the future of a time series, an ensemble of the top models discovered in the search is used to make final predictions. The diversity in the top models made the predictions more robust to uncertainty and less prone to overfitting the historical data. To handle time series with missing data, we fill in the gaps with a trainable vector and let the model learn to adapt to the missing time steps. To address intermittency, we predict, for each future time step, not only the value, but also the probability that the value at this time step is non-zero, and combine the two predictions. Finally, we found that the automated search is able to adjust the architecture and hyperparameter choices for different datasets, which makes the AutoML solution generic and automates the modeling efforts.

Benchmarking in Forecasting Competitions
To benchmark our AutoML solution, we participated in the M5 forecasting competition, the latest in the M-competition series, which is one of the most important competitions in the forecasting community, with a long history spanning nearly 40 years. This most recent competition was hosted on Kaggle and used a dataset from Walmart product sales, the real-world nature of which makes the problem quite challenging.

We participated in the competition with our fully automated solution and achieved a rank of 138 out of 5558 participants (top 2.5%) on the final leaderboard, which is in the silver medal zone. Participants in the competition had almost four months to produce their models. While many of the competitive forecasting models required months of manual effort to create, our AutoML solution found the model in a short time with only a moderate compute cost (500 CPUs for 2 hours) and no human intervention.

We also benchmarked our AutoML forecasting solution on several other Kaggle datasets and found that on average it outperforms 92% of hand-crafted models, despite its limited resource use.

Evaluation of the AutoML Forecasting solution on other Kaggle Datasets (Rossman Store Sales, Web Traffic, Favorita Grocery Sales) besides M5.

This work demonstrates the strength of an end-to-end AutoML solution for time series forecasting, and we are excited about its potential impact on real-world applications.

This project was a joint effort of Google Brain team members Chen Liang, Da Huang, Yifeng Lu and Quoc V. Le. We also thank Junwei Yuan, Xingwei Yang, Dawei Jia, Chenyu Zhao, Tin-yun Ho, Meng Wang, Yaguang Li, Nicolas Loeff, Manish Kurse, Kyle Anderson and Nishant Patil for their collaboration.

Source: Google AI Blog

Transformers for Image Recognition at Scale

While convolutional neural networks (CNNs) have been used in computer vision since the 1980s, they were not at the forefront until 2012 when AlexNet surpassed the performance of contemporary state-of-the-art image recognition methods by a large margin. Two factors helped enable this breakthrough: (i) the availability of training sets like ImageNet, and (ii) the use of commoditized GPU hardware, which provided significantly more compute for training. As such, since 2012, CNNs have become the go-to model for vision tasks.

The benefit of using CNNs was that they avoided the need for hand-designed visual features, instead learning to perform tasks directly from data “end to end”. However, while CNNs avoid hand-crafted feature-extraction, the architecture itself is designed specifically for images and can be computationally demanding. Looking forward to the next generation of scalable vision models, one might ask whether this domain-specific design is necessary, or if one could successfully leverage more domain agnostic and computationally efficient architectures to achieve state-of-the-art results.

As a first step in this direction, we present the Vision Transformer (ViT), a vision model based as closely as possible on the Transformer architecture originally designed for text-based tasks. ViT represents an input image as a sequence of image patches, similar to the sequence of word embeddings used when applying Transformers to text, and directly predicts class labels for the image. ViT demonstrates excellent performance when trained on sufficient data, outperforming a comparable state-of-the-art CNN with four times fewer computational resources. To foster additional research in this area, we have open-sourced both the code and models.

The Vision Transformer treats an input image as a sequence of patches, akin to a series of word embeddings generated by a natural language processing (NLP) Transformer.

The Vision Transformer
The original text Transformer takes as input a sequence of words, which it then uses for classification, translation, or other NLP tasks. For ViT, we make the fewest possible modifications to the Transformer design to make it operate directly on images instead of words, and observe how much about image structure the model can learn on its own.

ViT divides an image into a grid of square patches. Each patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension. Because Transformers are agnostic to the structure of the input elements we add learnable position embeddings to each patch, which allow the model to learn about the structure of the images. A priori, ViT does not know about the relative location of patches in the image, or even that the image has a 2D structure — it must learn such relevant information from the training data and encode structural information in the position embeddings.

Scaling Up

We first train ViT on ImageNet, where it achieves a best score of 77.9% top-1 accuracy. While this is decent for a first attempt, it falls far short of the state of the art — the current best CNN trained on ImageNet with no extra data reaches 85.8%. Despite mitigation strategies (e.g., regularization), ViT overfits the ImageNet task due to its lack of inbuilt knowledge about images.

To investigate the impact of dataset size on model performance, we train ViT on ImageNet-21k (14M images, 21k classes) and JFT (300M images, 18k classes), and compare the results to a state-of-the-art CNN, Big Transfer (BiT), trained on the same datasets. As previously observed, ViT performs significantly worse than the CNN equivalent (BiT) when trained on ImageNet (1M images). However, on ImageNet-21k (14M images) performance is comparable, and on JFT (300M images), ViT now outperforms BiT.

Finally, we investigate the impact of the amount of computation involved in training the models. For this, we train several different ViT models and CNNs on JFT. These models span a range of model sizes and training durations. As a result, they require varying amounts of compute for training. We observe that, for a given amount of compute, ViT yields better performance than the equivalent CNNs.

Left: Performance of ViT when pre-trained on different datasets. Right: ViT yields a good performance/compute trade-off.

High-Performing Large-Scale Image Recognition
Our data suggest that (1) with sufficient training ViT can perform very well, and (2) ViT yields an excellent performance/compute trade-off at both smaller and larger compute scales. Therefore, to see if performance improvements carried over to even larger scales, we trained a 600M-parameter ViT model.

This large ViT model attains state-of-the-art performance on multiple popular benchmarks, including 88.55% top-1 accuracy on ImageNet and 99.50% on CIFAR-10. ViT also performs well on the cleaned-up version of the ImageNet evaluations set “ImageNet-Real”, attaining 90.72% top-1 accuracy. Finally, ViT works well on diverse tasks, even with few training data points. For example, on the VTAB-1k suite (19 tasks with 1,000 data points each), ViT attains 77.63%, significantly ahead of the single-model state of the art (SOTA) (76.3%), and even matching SOTA attained by an ensemble of multiple models (77.6%). Most importantly, these results are obtained using fewer compute resources compared to previous SOTA CNNs, e.g., 4x fewer than the pre-trained BiT models.

Vision Transformer matches or outperforms state-of-the-art CNNs on popular benchmarks. Left: Popular image classification tasks (ImageNet, including new validation labels ReaL, and CIFAR, Pets, and Flowers). Right: Average across 19 tasks in the VTAB classification suite.

To gain some intuition into what the model learns, we visualize some of its internal workings. First, we look at the position embeddings — parameters that the model learns to encode the relative location of patches — and find that ViT is able to reproduce an intuitive image structure. Each position embedding is most similar to others in the same row and column, indicating that the model has recovered the grid structure of the original images. Second, we examine the average spatial distance between one element attending to another for each transformer block. At higher layers (depths of 10-20) only global features are used (i.e., large attention distances), but the lower layers (depths 0-5) capture both global and local features, as indicated by a large range in the mean attention distance. By contrast, only local features are present in the lower layers of a CNN. These experiments indicate that ViT can learn features hard-coded into CNNs (such as awareness of grid structure), but is also free to learn more generic patterns, such as a mix of local and global features at lower layers, that can aid generalization.

Left: ViT learns the grid like structure of the image patches via its position embeddings. Right: The lower layers of ViT contain both global and local features, the higher layers contain only global features.

While CNNs have revolutionized computer vision, our results indicate that models tailor-made for imaging tasks may be unnecessary, or even sub-optimal. With ever-increasing dataset sizes, and the continued development of unsupervised and semi-supervised methods, the development of new vision architectures that train more efficiently on these datasets becomes increasingly important. We believe ViT is a preliminary step towards generic, scalable architectures that can solve many vision tasks, or even tasks from many domains, and are excited for future developments.

A preprint of our work as well as code and models are publically available.

We would like to thank our co-authors in Berlin, Zürich, and Amsterdam: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, and Jakob Uszkoreit. We would like to thank Andreas Steiner for crucial help with infrastructure and open-sourcing, Joan Puigcerver and Maxim Neumann for work on large-scale training infrastructure, and Dmitry Lepikhin, Aravindh Mahendran, Daniel Keysers, Mario Lučić, Noam Shazeer, and Colin Raffel for useful discussions. Finally, we thank Tom Small for creating the Visual Transformer animation in this post.

Source: Google AI Blog

Navigating Recorder Transcripts Easily, with Smart Scrolling

Last year we launched Recorder, a new kind of recording app that made audio recording smarter and more useful by leveraging on-device machine learning (ML) to transcribe the recording, highlight audio events, and suggest appropriate tags for titles. Recorder makes editing, sharing and searching through transcripts easier. Yet because Recorder can transcribe very long recordings (up to 18 hours!), it can still be difficult for users to find specific sections, necessitating a new solution to quickly navigate such long transcripts.

To increase the navigability of content, we introduce Smart Scrolling, a new ML-based feature in Recorder that automatically marks important sections in the transcript, chooses the most representative keywords from each section, and then surfaces those keywords on the vertical scrollbar, like chapter headings. The user can then scroll through the keywords or tap on them to quickly navigate to the sections of interest. The models used are lightweight enough to be executed on-device without the need to upload the transcript, thus preserving user privacy.

Smart Scrolling feature UX

Under the hood
The Smart Scrolling feature is composed of two distinct tasks. The first extracts representative keywords from each section and the second picks which sections in the text are the most informative and unique.

For each task, we utilize two different natural language processing (NLP) approaches: a distilled bidirectional transformer (BERT) model pre-trained on data sourced from a Wikipedia dataset, alongside a modified extractive term frequency–inverse document frequency (TF-IDF) model. By using the bidirectional transformer and the TF-IDF-based models in parallel for both the keyword extraction and important section identification tasks, alongside aggregation heuristics, we were able to harness the advantages of each approach and mitigate their respective drawbacks (more on this in the next section).

The bidirectional transformer is a neural network architecture that employs a self-attention mechanism to achieve context-aware processing of the input text in a non-sequential fashion. This enables parallel processing of the input text to identify contextual clues both before and after a given position in the transcript.

Bidirectional Transformer-based model architecture

The extractive TF-IDF approach rates terms based on their frequency in the text compared to their inverse frequency in the trained dataset, and enables the finding of unique representative terms in the text.

Both models were trained on publicly available conversational datasets that were labeled and evaluated by independent raters. The conversational datasets were from the same domains as the expected product use cases, focusing on meetings, lectures, and interviews, thus ensuring the same word frequency distribution (Zipf’s law).

Extracting Representative Keywords
The TF-IDF-based model detects informative keywords by giving each word a score, which corresponds to how representative this keyword is within the text. The model does so, much like a standard TF-IDF model, by utilizing the ratio of the number of occurrences of a given word in the text compared to the whole of the conversational data set, but it also takes into account the specificity of the term, i.e., how broad or specific it is. Furthermore, the model then aggregates these features into a score using a pre-trained function curve. In parallel, the bidirectional transformer model, which was fine tuned on the task of extracting keywords, provides a deep semantic understanding of the text, enabling it to extract precise context-aware keywords.

The TF-IDF approach is conservative in the sense that it is prone to finding uncommon keywords in the text (high bias), while the drawback for the bidirectional transformer model is the high variance of the possible keywords that can be extracted. But when used together, these two models complement each other, forming a balanced bias-variance tradeoff.

Once the keyword scores are retrieved from both models, we normalize and combine them by utilizing NLP heuristics (e.g., the weighted average), removing duplicates across sections, and eliminating stop words and verbs. The output of this process is an ordered list of suggested keywords for each of the sections.

Rating A Section’s Importance
The next task is to determine which sections should be highlighted as informative and unique. To solve this task, we again combine the two models mentioned above, which yield two distinct importance scores for each of the sections. We compute the first score by taking the TF-IDF scores of all the keywords in the section and weighting them by their respective number of appearances in the section, followed by a summation of these individual keyword scores. We compute the second score by running the section text through the bidirectional transformer model, which was also trained on the sections rating task. The scores from both models are normalized and then combined to yield the section score.

Smart Scrolling pipeline architecture

Some Challenges
A significant challenge in the development of Smart Scrolling was how to identify whether a section or keyword is important - what is of great importance to one person can be of less importance to another. The key was to highlight sections only when it is possible to extract helpful keywords from them.

To do this, we configured the solution to select the top scored sections that also have highly rated keywords, with the number of sections highlighted proportional to the length of the recording. In the context of the Smart Scrolling features, a keyword was more highly rated if it better represented the unique information of the section.

To train the model to understand this criteria, we needed to prepare a labeled training dataset tailored to this task. In collaboration with a team of skilled raters, we applied this labeling objective to a small batch of examples to establish an initial dataset in order to evaluate the quality of the labels and instruct the raters in cases where there were deviations from what was intended. Once the labeling process was complete we reviewed the labeled data manually and made corrections to the labels as necessary to align them with our definition of importance.

Using this limited labeled dataset, we ran automated model evaluations to establish initial metrics on model quality, which were used as a less-accurate proxy to the model quality, enabling us to quickly assess the model performance and apply changes in the architecture and heuristics. Once the solution metrics were satisfactory, we utilized a more accurate manual evaluation process over a closed set of carefully chosen examples that represented expected Recorder use cases. Using these examples, we tweaked the model heuristics parameters to reach the desired level of performance using a reliable model quality evaluation.

Runtime Improvements
After the initial release of Recorder, we conducted a series of user studies to learn how to improve the usability and performance of the Smart Scrolling feature. We found that many users expect the navigational keywords and highlighted sections to be available as soon as the recording is finished. Because the computation pipeline described above can take a considerable amount of time to compute on long recordings, we devised a partial processing solution that amortizes this computation over the whole duration of the recording. During recording, each section is processed as soon as it is captured, and then the intermediate results are stored in memory. When the recording is done, Recorder aggregates the intermediate results.

When running on a Pixel 5, this approach reduced the average processing time of an hour long recording (~9K words) from 1 minute 40 seconds to only 9 seconds, while outputting the same results.

The goal of Recorder is to improve users’ ability to access their recorded content and navigate it with ease. We have already made substantial progress in this direction with the existing ML features that automatically suggest title words for recordings and enable users to search recordings for sounds and text. Smart Scrolling provides additional text navigation abilities that will further improve the utility of Recorder, enabling users to rapidly surface sections of interest, even for long recordings.

Bin Zhang, Sherry Lin, Isaac Blankensmith, Henry Liu‎, Vincent Peng‎, Guilherme Santos‎, Tiago Camolesi, Yitong Lin, James Lemieux, Thomas Hall‎, Kelly Tsai‎, Benny Schlesinger, Dror Ayalon, Amit Pitaru, Kelsie Van Deman, Console Chen, Allen Su, Cecile Basnage, Chorong Johnston‎, Shenaz Zack, Mike Tsao, Brian Chen, Abhinav Rastogi, Tracy Wu, Yvonne Yang‎.

Source: Google AI Blog