Category Archives: Research Blog

The latest news on Google Research

Understanding Transfer Learning for Medical Imaging



As deep neural networks are applied to an increasingly diverse set of domains, transfer learning has emerged as a highly popular technique in developing deep learning models. In transfer learning, the neural network is trained in two stages: 1) pretraining, where the network is generally trained on a large-scale benchmark dataset representing a wide diversity of labels/categories (e.g., ImageNet); and 2) fine-tuning, where the pretrained network is further trained on the specific target task of interest, which may have fewer labeled examples than the pretraining dataset. The pretraining step helps the network learn general features that can be reused on the target task.

This kind of two-stage paradigm has become extremely popular in many settings, and particularly so in medical imaging. In the context of transfer learning, standard architectures designed for ImageNet with corresponding pretrained weights are fine-tuned on medical tasks ranging from interpreting chest x-rays and identifying eye diseases, to early detection of Alzheimer’s disease. Despite its widespread use, however, the precise effects of transfer learning are not yet well understood. While recent work challenges many common assumptions, including the effects on performance improvement, contribution of the underlying architecture and impact of pretraining dataset type and size, these results are all in the natural image setting, and leave many questions open for specialized domains, such as medical images.

In our NeurIPS 2019 paper, “Transfusion: Understanding Transfer Learning for Medical Imaging,” we investigate these central questions for transfer learning in medical imaging tasks. Through both a detailed performance evaluation and analysis of neural network hidden representations, we uncover many surprising conclusions, such as the limited benefits of transfer learning for performance on the tested medical imaging tasks, a detailed characterization of how representations evolve through the training process across different models and hidden layers, and feature independent benefits of transfer learning for convergence speed.

Performance Evaluation
We first performed a thorough study on the effect of transfer learning on model performance. We compared models trained from random initialization and applied directly on tasks to those pretrained on ImageNet that leverage transfer learning for the same tasks. We looked at two large scale medical imaging tasks — diagnosing diabetic retinopathy from fundus photographs and identifying five different diseases from chest x-rays. We evaluated various neural network architectures including both standard architectures popularly used for medical imaging (ResNet50, Inception-v3) as well as a family of simple, lightweight convolutional neural networks that consist of four or five layers of the standard convolution-batchnorm-ReLU progression, or CBRs.

The results from evaluating all of these models on the different tasks with and without transfer learning give us four main takeaways:
  • Surprisingly, transfer learning does not significantly affect performance on medical imaging tasks, with models trained from scratch performing nearly as well as standard ImageNet transferred models.
  • On the medical imaging tasks, the much smaller CBR models perform at a level comparable to the standard ImageNet architectures.
  • As the CBR models are much smaller and shallower than the standard ImageNet models, they perform much worse on ImageNet classification, highlighting that ImageNet performance is not indicative of performance on medical tasks.
  • The two medical tasks are much smaller in size than ImageNet (~200k vs ~1.2m training images), but in the very small data regime, there may only be a few thousand training examples. We evaluated transfer learning in this very small data regime, finding that while there was a larger gap in performance between transfer and training from scratch for large models (ResNet) this was not true for smaller models (CBRs), suggesting that the large models designed for ImageNet might be too overparameterized for the very small data regime.
Representation Analysis
We next study the degree to which transfer learning affects the kinds of features and representations learned by the neural networks. Given the similar performance, does transfer learning result in different representations from random initialization? Is knowledge from the pretraining step reused, and if so, where? To find answers to these questions, this study analyzes and compares the hidden representations (i.e., representations learned in the latent layers of the network) in the different neural networks trained to solve these tasks. This quantitative analysis can be challenging, due to the complexity and lack of alignment in different hidden layers. But a recent method, singular vector canonical correlation analysis (SVCCA; code and tutorials), based on canonical correlation analysis (CCA), helps overcome these challenges, and can be used to calculate a similarity score between a pair of hidden representations.

Similarity scores are computed for some of the hidden representations from the top latent layers of the networks (closer to the output) between networks trained from random initialization and networks trained from pretrained ImageNet weights. As a baseline, we also compute similarity scores of representations learned from different random initializations. For large models, representations learned from random initialization are much more similar to each other than those learned from transfer learning. For smaller models, there is greater overlap between representation similarity scores.
Representation similarity scores between networks trained from random initialization and networks trained from pretrained ImageNet weights (orange), and baseline similarity scores of representations trained from two different random initializations (blue). Higher values indicate greater similarity. For larger models, representations learned from random initialization are much more similar to each other than those learned through transfer. This is not the case for smaller models.
The reason for this difference between large and small models becomes clear with further investigation into the hidden representations. Large models change less through training, even from random initialization. We perform multiple experiments that illustrate this, from simple filter visualizations to tracking changes between different layers through fine-tuning.

When we combine the results of all the experiments from the paper, we can assemble a table summarizing how much representations change through training on the medical task across (i) transfer learning, (ii) model size and (iii) lower/higher layers.
Effects on Convergence: Feature Independent Benefits and Hybrid Approaches
One consistent effect of transfer learning was a significant speedup in the time taken for the model to converge. But having seen the mixed results for feature reuse from our representational study, we looked into whether there were other properties of the pretrained weights that might contribute to this speedup. Surprisingly, we found a feature independent benefit of pretraining — the weight scaling.

We initialized the weights of the neural network as independent and identically distributed (iid), just like random initialization, but using the mean and variance of the pretrained weights. We called this initialization the Mean Var Init, which keeps the pretrained weight scaling but destroys all the features. This Mean Var Init offered significant speedups over random initialization across model architectures and tasks, suggesting that the pretraining process of transfer learning also helps with good weight conditioning.
Filter visualization of weights initialized according to pretrained ImageNet weights, Random Init, and Mean Var Init. Only the ImageNet Init filters have pretrained (Gabor-like) structure, as Rand Init and Mean Var weights are iid.
Recall that our earlier experiments suggested that feature reuse primarily occurs in the lowest layers. To understand this, we performed weight transfusion experiments, where only a subset of the pretrained weights (corresponding to a contiguous set of layers) are transferred, with the remainder of weights being randomly initialized. Comparing convergence speeds of these transfused networks with full transfer learning further supports the conclusion that feature reuse is primarily happening in the lowest layers.
Learning curves comparing the convergence speed with AUC on the test set. Using only the scaling of the pretrained weights (Mean Var Init) helps with convergence speed. The figures compare the standard transfer learning and the Mean Var initialization scheme to training from random initialization.
This suggests hybrid approaches to transfer learning, where instead of reusing the full neural network architecture, we can recycle its lowest layers and redesign the upper layers to better suit the target task. This gives us most of the benefits of transfer learning while further enabling flexible model design. In the Figure below, we show the effect of reusing pretrained weights up to Block2 in Resnet50, halving the remainder of the channels, initializing those layers randomly, and then training end-to-end. This matches the performance and convergence of full transfer learning.
Hybrid approaches to transfer learning on Resnet50 (left) and CBR models (right) — reusing a subset of the weights and slimming the remainder of the network (Slim), and using mathematically synthesized Gabors for conv1 (Synthetic Gabor).
The figure above also shows the results of an extreme version of this partial reuse, transferring only the very first convolutional layer with mathematically synthesized Gabor filters (pictured below). Using just these (synthetic) weights offers significant speedups, and hints at many other creative hybrid approaches.
Synthetic Gabor filters used to initialize the first layer if neural networks in some of the experiments in this paper. The Gabor filters are generated as grayscale images and repeated across the RGB channels. Left: Low frequencies. Right: High frequencies.
Conclusion and Open Questions
Transfer learning is a central technique for many domains. In this paper we provide insights on some of its fundamental properties in the medical imaging context, studying performance, feature reuse, the effect of different architectures, convergence and hybrid approaches. Many interesting open questions remain: How much of the original task has the model forgotten? Why do large models change less? Can we get further gains matching higher order moments of pretrained weight statistics? Are the results similar for other tasks, such as segmentation? We look forward to tackling these questions in future work!

Acknowledgements
Special thanks to Samy Bengio and Jon Kleinberg, who are co-authors on this work. Thanks also to Geoffrey Hinton for helpful feedback.

Source: Google AI Blog


Developing Deep Learning Models for Chest X-rays with Adjudicated Image Labels



With millions of diagnostic examinations performed annually, chest X-rays are an important and accessible clinical imaging tool for the detection of many diseases. However, their usefulness can be limited by challenges in interpretation, which requires rapid and thorough evaluation of a two-dimensional image depicting complex, three-dimensional organs and disease processes. Indeed, early-stage lung cancers or pneumothoraces (collapsed lungs) can be missed on chest X-rays, leading to serious adverse outcomes for patients.

Advances in machine learning (ML) present an exciting opportunity to create new tools to help experts interpret medical images. Recent efforts have shown promise in improving lung cancer detection in radiology, prostate cancer grading in pathology, and differential diagnoses in dermatology. For chest X-ray images in particular, large, de-identified public image sets are available to researchers across disciplines, and have facilitated several valuable efforts to develop deep learning models for X-ray interpretation. However, obtaining accurate clinical labels for the very large image sets needed for deep learning can be difficult. Most efforts have either applied rule-based natural language processing (NLP) to radiology reports or relied on image review by individual readers, both of which may introduce inconsistencies or errors that can be especially problematic during model evaluation. Another challenge involves assembling datasets that represent an adequately diverse spectrum of cases (i.e., ensuring inclusion of both “hard” cases and “easy” cases that represent the full spectrum of disease presentation). Finally, some chest X-ray findings are non-specific and depend on clinical information about the patient to fully understand their significance. As such, establishing labels that are clinically meaningful and have consistent definitions can be a challenging component of developing machine learning models that use only the image as input. Without standardized and clinically meaningful datasets as well as rigorous reference standard methods, successful application of ML to interpretation of chest X-rays will be hindered.

To help address these issues, we recently published “Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation” in the journal Radiology. In this study we developed deep learning models to classify four clinically important findings on chest X-rays — pneumothorax, nodules and masses, fractures, and airspace opacities. These target findings were selected in consultation with radiologists and clinical colleagues, so as to focus on conditions that are both critical for patient care and for which chest X-ray images alone are an important and accessible first-line imaging study. Selection of these findings also allowed model evaluation using only de-identified images without additional clinical data.

Models were evaluated using thousands of held-out images from each dataset for which we collected high-quality labels using a panel-based adjudication process among board-certified radiologists. Four separate radiologists also independently reviewed the held-out images in order to compare radiologist accuracy to that of the deep learning models (using the panel-based image labels as the reference standard). For all four findings and across both datasets, the deep learning models demonstrated radiologist-level performance. We are sharing the adjudicated labels for the publicly available data here to facilitate additional research.

Data Overview
This work leveraged over 600,000 images sourced from two de-identified datasets. The first dataset was developed in collaboration with co-authors at the Apollo Hospitals, and consists of a diverse set of chest X-rays obtained over several years from multiple locations across the Apollo Hospitals network. The second dataset is the publicly available ChestX-ray14 image set released by the National Institutes of Health (NIH). This second dataset has served as an important resource for many machine learning efforts, yet has limitations stemming from issues with the accuracy and clinical interpretation of the currently available labels.
Chest X-ray depicting an upper left lobe pneumothorax identified by the model and the adjudication panel, but missed by the individual radiologist readers. Left: The original image. Right: The same image with the most important regions for the model prediction highlighted in orange.
Training Set Labels Using Deep Learning and Visual Image Review
For very large datasets consisting of hundreds of thousands of images, such as those needed to train highly accurate deep learning models, it is impractical to manually assign image labels. As such, we developed a separate, text-based deep learning model to extract image labels using the de-identified radiology reports associated with each X-ray. This NLP model was then applied to provide labels for over 560,000 images from the Apollo Hospitals dataset used for training the computer vision models.

To reduce noise from any errors introduced by the text-based label extraction and also to provide the relevant labels for a substantial number of the ChestX-ray14 images, approximately 37,000 images across the two datasets were visually reviewed by radiologists. These were separate from the NLP-based labels and helped to ensure high quality labels across such a large, diverse set of training images.

Creating and Sharing Improved Reference Standard Labels
To generate high-quality reference standard labels for model evaluation, we utilized a panel-based adjudication process, whereby three radiologists reviewed all final tune and test set images and resolved disagreements through discussion. This often allowed difficult findings that were initially only detected by a single radiologist to be identified and documented appropriately. To reduce the risk of bias based on any individual radiologist’s personality or seniority, the discussions took place anonymously via an online discussion and adjudication system.

Because the lack of available adjudicated labels was a significant initial barrier to our work, we are sharing with the research community all of the adjudicated labels for the publicly available ChestX-ray14 dataset, including 2,412 training/validation set images and 1,962 test set images (4,374 images in total). We hope that these labels will facilitate future machine learning efforts and enable better apples-to-apples comparisons between machine learning models for chest X-ray interpretation.

Future Outlook
This work presents several contributions: (1) releasing adjudicated labels for images from a publicly available dataset; (2) a method to scale accurate labeling of training data using a text-based deep learning model; (3) evaluation using a diverse set of images with expert-adjudicated reference standard labels; and ultimately (4) radiologist-level performance of deep learning models for clinically important findings on chest X-rays.

However, in regards to model performance, achieving expert-level accuracy on average is just a part of the story. Even though overall accuracy for the deep learning models was consistently similar to that of radiologists for any given finding, performance for both varied across datasets. For example, the sensitivity for detecting pneumothorax among radiologists was approximately 79% for the ChestX-ray14 images, but was only 52% for the same radiologists on the other dataset, suggesting a more difficult collection cases in the latter. This highlights the importance of validating deep learning tools on multiple, diverse datasets and eventually across the patient populations and clinical settings in which any model is intended to be used.

The performance differences between datasets also emphasize the need for standardized evaluation image sets with accurate reference standards in order to allow comparison across studies. For example, if two different models for the same finding were evaluated using different datasets, comparing performance would be of minimal value without knowing additional details such as the case mix, model error modes, or radiologist performance on the same cases.

Finally, the model often identified findings that were consistently missed by radiologists, and vice versa. As such, strategies that combine the unique “skills” of both the deep learning systems and human experts are likely to hold the most promise for realizing the potential of AI applications in medical image interpretation.

Acknowledgements
Key contributors to this project at Google include Sid Mittal, Gavin Duggan, Anna Majkowska, Scott McKinney, Andrew Sellergren, David Steiner, Krish Eswaran, Po-Hsuan Cameron Chen, Yun Liu, Shravya Shetty, and Daniel Tse. Significant contributions and input were also made by radiologist collaborators Joshua Reicher, Alexander Ding, and Sreenivasa Raju Kalidindi. The authors would also like to acknowledge many members of the Google Health radiology team including Jonny Wong, Diego Ardila, Zvika Ben-Haim, Rory Sayres, Shahar Jamshy, Shabir Adeel, Mikhail Fomitchev, Akinori Mitani, Quang Duong, William Chen and Sahar Kazemzadeh. Sincere appreciation also goes to the many radiologists who enabled this work through their expert image interpretation efforts throughout the project.

Source: Google AI Blog


Astrophotography with Night Sight on Pixel Phones



Taking pictures of outdoor scenes at night has so far been the domain of large cameras, such as DSLRs, which are able to achieve excellent image quality, provided photographers are willing to put up with bulky equipment and sometimes tricky postprocessing. A few years ago experiments with phone camera nighttime photography produced pleasing results, but the methods employed were impractical for all but the most dedicated users.

Night Sight, introduced last year as part of the Google Camera App for the Pixel 3, allows phone photographers to take good-looking handheld shots in environments so dark that the normal camera mode would produce grainy, severely underexposed images. In a previous blog post our team described how Night Sight is able to do this, with a technical discussion presented at SIGGRAPH Asia 2019.

This year’s version of Night Sight pushes the boundaries of low-light photography with phone cameras. By allowing exposures up to 4 minutes on Pixel 4, and 1 minute on Pixel 3 and 3a, the latest version makes it possible to take sharp and clear pictures of the stars in the night sky or of nighttime landscapes without any artificial light.
The Milky Way as seen from the summit of Haleakala volcano on a cloudless and moonless September night, captured using the Google Camera App running on a Pixel 4 XL phone. The image has not been retouched or post-processed in any way. It shows significantly more detail than a person can see with the unaided eye on a night this dark. The dust clouds along the Milky Way are clearly visible, the sky is covered with thousands of stars, and unlike human night vision, the picture is colorful.
A Brief Overview of Night Sight
The amount of light detected by the camera’s image sensor inherently has some uncertainty, called “shot noise,” which causes images to look grainy. The visibility of shot noise decreases as the amount of light increases; therefore, it is best for the camera to gather as much light as possible to produce a high-quality image.

How much light reaches the image sensor in a given amount of time is limited by the aperture of the camera lens. Extending the exposure time for a photo increases the total amount of light captured, but if the exposure is long, motion in the scene being photographed and unsteadiness of the handheld camera can cause blur. To overcome this, Night Sight splits the exposure into a sequence of multiple frames with shorter exposure times and correspondingly less motion blur. The frames are first aligned, compensating for both camera shake and in-scene motion, and then averaged, with careful treatment of cases where perfect alignment is not possible. While individual frames may be fairly grainy, the combined, averaged image looks much cleaner.

Experimenting with Exposure Time
Soon after the original Night Sight was released, we started to investigate taking photos in very dark outdoor environments with the goal of capturing the stars. We realized that, just as with our previous experiments, high quality pictures would require exposure times of several minutes. Clearly, this cannot work with a handheld camera; the phone would have to be placed on a tripod, a rock, or whatever else might be available to hold the camera steady.

Just as with handheld Night Sight photos, nighttime landscape shots must take motion in the scene into account — trees sway in the wind, clouds drift across the sky, and the moon and the stars rise in the east and set in the west. Viewers will tolerate motion-blurred clouds and tree branches in a photo that is otherwise sharp, but motion-blurred stars that look like short line segments look wrong. To mitigate this, we split the exposure into frames with exposure times short enough to make the stars look like points of light. Taking pictures of real night skies we found that the per-frame exposure time should not exceed 16 seconds.
Motion-blurred stars in a single-frame two-minute exposure.
While the number of frames we can capture for a single photo, and therefore the total exposure time, is limited by technical considerations, we found that it is more tightly constrained by the photographer’s patience. Few are willing to wait more than four minutes for a picture, so we limited a single Night Sight image to at most 15 frames with up to 16 seconds per frame.

Sixteen-second exposures allow us to capture enough light to produce recognizable images but a useable camera app capable of taking pictures that look great must deal with additional issues that are unique to low-light photography.

Dark Current and Hot Pixels
Dark current causes CMOS image sensors to record a spurious signal, as if the pixels were exposed to a small amount of light, even when no actual light is present. The effect is negligible when exposure times are short, but it becomes significant with multi-second captures. Due to unavoidable imperfections in the sensor’s silicon substrate, some pixels exhibit higher dark current than their neighbors. In a recorded frame these “warm pixels,” as well as defective “hot pixels,” are visible as tiny bright dots.

Warm and hot pixels can be identified by comparing the values of neighboring pixels within the same frame and across the sequence of frames recorded for a photo, and looking for outliers. Once an outlier has been detected, it is concealed by replacing its value with the average of its neighbors. Since the original pixel value is discarded, there is a loss of image information, but in practice this does not noticeably affect image quality.
Left: A small region of a long-exposure image with hot pixels, and warm pixels caused by dark current nonuniformity. Right: The same image after outliers have been removed. Fine details in the landscape, including small points of light, are preserved.
Scene Composition
Mobile phones use their screens as electronic viewfinders — the camera captures a continuous stream of frames that is displayed as a live video in order to aid with shot composition. The frames are simultaneously used by the camera’s autofocus, auto exposure, and auto white balance systems.

To feel responsive to the photographer, the viewfinder is updated at least 15 times per second, which limits the viewfinder frame exposure time to 66 milliseconds. This makes it challenging to display a detailed image in low-light environments. At light levels below the rough equivalent of a full moon or so, the viewfinder becomes mostly gray — maybe showing a few bright stars, but none of the landscape — and composing a shot becomes difficult.

To assist in framing the scene in extremely low light, Night Sight displays a “post-shutter viewfinder”. After the shutter button has been pressed, each long-exposure frame is displayed on the screen as soon as it has been captured. With exposure times up to 16 seconds, these frames have collected almost 250 times more light than the regular viewfinder frames, allowing the photographer to easily see image details as soon as the first frame has been captured. The composition can then be adjusted by moving the phone while the exposure continues. Once the composition is correct, the initial shot can be stopped, and a second shot can be captured where all frames have the desired composition.
Left: The live Night Sight viewfinder in a very dark outdoor environment. Except for a few points of light from distant buildings, the landscape and the sky are largely invisible. Right: The post-shutter viewfinder during a long exposure shot. The image is much clearer; it updates after every long-exposure frame.
Autofocus
Autofocus ensures that the image captured by the camera is sharp. In normal operation, the incoming viewfinder frames are analyzed to determine how far the lens must be from the sensor to produce an in-focus image, but in very low light the viewfinder frames can be so dark and grainy that autofocus fails due to lack of detectable image detail. When this happens, Night Sight on Pixel 4 switches to “post-shutter autofocus.” After the user presses the shutter button, the camera captures two autofocus frames with exposure times up to one second, long enough to detect image details even in low light. These frames are used only to focus the lens and do not contribute directly to the final image.

Even though using long-exposure frames for autofocus leads to consistently sharp images at light levels low enough that the human visual system cannot clearly distinguish objects, sometimes it gets too dark even for post-shutter autofocus. In this case the camera instead focuses at infinity. In addition, Night Sight includes manual focus buttons, allowing the user to focus on nearby objects in very dark conditions.

Sky Processing
When images of very dark environments are viewed on a screen, they are displayed much brighter than the original scenes. This can change the viewer’s perception of the time of day when the photos were captured. At night we expect the sky to be dark. If a picture taken at night shows a bright sky, then we see it as a daytime scene, perhaps with slightly unusual lighting.

This effect is countered in Night Sight by selectively darkening the sky in photos of low-light scenes. To do this, we use machine learning to detect which regions of an image represent sky. An on-device convolutional neural network, trained on over 100,000 images that were manually labeled by tracing the outlines of sky regions, identifies each pixel in a photograph as “sky” or “not sky.”
A landscape picture taken on a bright full-moon night, without sky processing (left half), and with sky darkening (right half). Note that the landscape is not darkened.
Sky detection also makes it possible to perform sky-specific noise reduction, and to selectively increase contrast to make features like clouds, color gradients, or the Milky Way more prominent.

Results
With the phone on a tripod, Night Sight produces sharp pictures of star-filled skies, and as long as there is at least a small amount of moonlight, landscapes will be clear and colorful.

Of course, the phone’s capabilities are not limitless, and there is always room for improvement. Although nighttime scenes are dark overall, they often contain bright light sources such as the moon, distant street lamps, or prominent stars. While we can capture a moonlit landscape, or details on the surface of the moon, the extremely large brightness range, which can exceed 500,000:1, so far prevents us from capturing both in the same image. Also, when the stars are the only source of illumination, we can take clear pictures of the sky, but the landscape is only visible as a silhouette.

For Pixel 4 we have been using the brightest part of the Milky Way, near the constellation Sagittarius, as a benchmark for the quality of images of a moonless sky. By that standard Night Sight is doing very well. Although Milky Way photos exhibit some residual noise, they are pleasing to look at, showing more stars and more detail than a person can see looking at the real night sky.
Examples of photos taken with the Google Camera App on Pixel 4. An album with more pictures can be found here.
Tips and Tricks
In the course of developing and testing Night Sight astrophotography we gained some experience taking outdoor nighttime pictures with Pixel phones, and we’d like to share a list of tips and tricks that have worked for us. You can find it here.

Acknowledgements
Night Sight is an ongoing collaboration between several teams at Google. Key contributors to the project include from the Gcam team, Orly Liba, Nikhil Karnad, Charles He, Manfred Ernst, Michael Milne, Andrew Radin, Navin Sarma, Jon Barron, Yun-Ta Tsai, Tianfan Xue, Jiawen Chen, Dillon Sharlet, Ryan Geiss, Sam Hasinoff, Alex Schiffhauer, Yael Pritch Knaan and Marc Levoy; from the Super Res Zoom team, Bart Wronski, Peyman Milanfar, and Ignacio Garcia Dorado; from the Google camera app team, Emily To, Gabriel Nava, Sushil Nath, Isaac Reynolds, and Michelle Chen; from the Android platform team, Ryan Chan, Ying Chen Lou, and Bob Hung; from the Mobile Vision team, Longqi (Rocky) Cai, Huizhong Chen, Emily Manoogian, Nicole Maffeo, and Tomer Meron; from Machine Perception, Elad Eban and Yair Movshovitz-Attias.

Source: Google AI Blog


RecSim: A Configurable Simulation Platform for Recommender Systems



Significant advances in machine learning, speech recognition, and language technologies are rapidly transforming the way in which recommender systems engage with users. As a result, collaborative interactive recommenders (CIRs) recommender systems that engage in a deliberate sequence of interactions with a user to best meet that user's needs have emerged as a tangible goal for online services.

Despite this, the deployment of CIRs has been limited by challenges in developing algorithms and models that reflect the qualitative characteristics of sequential user interaction. Reinforcement learning (RL) is the de facto standard ML approach for addressing sequential decision problems, and as such is a natural paradigm for modeling and optimizing sequential interaction in recommender systems. However, it remains under-investigated and under-utilized for use in CIRs in both research and practice. One major impediment is the lack of general-purpose simulation platforms for sequential recommender settings, whereas simulation has been one of the primary means for developing and evaluating RL algorithms in real-world applications like robotics.

To address this, we have developed RᴇᴄSɪᴍ (available here), a configurable platform for authoring simulation environments to facilitate the study of RL algorithms in recommender systems (and CIRs in particular). RᴇᴄSɪᴍ allows both researchers and practitioners to test the limits of existing RL methods in synthetic recommender settings. RecSim’s aim is to support simulations that mirror specific aspects of user behavior found in real recommender systems and serve as a controlled environment for developing, evaluating and comparing recommender models and algorithms, especially RL systems designed for sequential user-system interaction.

As an open-source platform, RᴇᴄSɪᴍ: (i) facilitates research at the intersection of RL and recommender systems; (ii) encourages reproducibility and model-sharing; (iii) aids the recommender-systems practitioner, interested in applying RL to rapidly test and refine models and algorithms in simulation, before incurring the potential cost (e.g., time, user impact) of live experiments; and (iv) serves as a resource for academic-industry collaboration through the release of “realistic” stylized models of user behavior without revealing user data or sensitive industry strategies.

Reinforcement Learning and Recommendation Systems
One challenge in applying RL to recommenders is that most recommender research is developed and evaluated using static datasets that do not reflect the sequential, repeated interaction a recommender has with its users. Even those with temporal extent, such as MovieLens 1M, do not (easily) support predictions about the long-term performance of novel recommender policies that differ significantly from those used to collect the data, as many of the factors that impact user choice are not recorded within the data. This makes the evaluation of even basic RL algorithms very difficult, especially when it comes to reasoning about the long-term consequences of some new recommendation policy — research shows changes in policy can have long-term, cumulative impact on user behavior. The ability to model such user behaviors in a simulated environment, and devise and test new recommendation algorithms, including those using RL, can greatly accelerate the research and development cycle for such problems.

Overview of RᴇᴄSɪᴍ
RᴇᴄSɪᴍ simulates a recommender agent’s interaction with an environment consisting of a user model, a document model and a user choice model. The agent interacts with the environment by recommending sets or lists of documents (known as slates) to users, and has access to observable features of simulated individual users and documents to make recommendations. The user model samples users from a distribution over (configurable) user features (e.g., latent features, like interests or satisfaction; observable features, like user demographic; and behavioral features, such as visit frequency or time budget). The document model samples items from a prior distribution over document features, both latent (e.g., quality) and observable (e.g., length, popularity). This prior, as all other components of RᴇᴄSɪᴍ, can be specified by the simulation developer, possibly informed (or learned) from application data.

The level of observability for both user and document features is customizable. When the agent recommends documents to a user, the response is determined by a user-choice model, which can access observable document features and all user features. Other aspects of a user’s response (e.g., time spent engaging with the recommendation) can depend on latent document features, such as document topic or quality. Once a document is consumed, the user state undergoes a transition through a configurable user transition model, since user satisfaction or interests might change.

We note that RᴇᴄSɪᴍ provides the ability to easily author specific aspects of user behavior of interest to the researcher or practitioner, while ignoring others. This can provide the critical ability to focus on modeling and algorithmic techniques designed for novel phenomena of interest (as we illustrate in two applications below). This type of abstraction is often critical to scientific modeling. Consequently, high-fidelity simulation of all elements of user behavior is not an explicit goal of RᴇᴄSɪᴍ. That said, we expect that it may also serve as a platform that supports “sim-to-real” transfer in certain cases (see below).
Data Flow through components of RᴇᴄSɪᴍ. Colors represent different model components — user and user-choice models (green), document model (blue), and the recommender agent (red).
Applications
We have used RᴇᴄSɪᴍ to investigate several key research problems that arise in the use of RL in recommender systems. For example, slate recommendations can result in RL problems, since the parameter space for action grows exponentially with slate size, posing challenges for exploration, generalization and action optimization. We used RᴇᴄSɪᴍ to develop a novel decomposition technique that exploits simple, widely applicable assumptions about user choice behavior to tractably compute Q-values of entire recommendation slates. In particular, RᴇᴄSɪᴍ was used to test a number of experimental hypotheses, such as algorithm performance and robustness to different assumptions about user behavior.

Future Work
While RᴇᴄSɪᴍ provides ample opportunity for researchers and practitioners to probe and question assumptions made by RL/recommender algorithms in stylized environments, we are developing several important extensions. These include: (i) methodologies to fit stylized user models to usage logs to partially address the “sim-to-real” gap; (ii) the development of natural APIs using TensorFlow’s probabilistic APIs to facilitate model specification and learning, as well as scaling up simulation and inference algorithms using accelerators and distributed execution; and (iii) the extension to full-factor, mixed-mode interaction models that will be the hallmark of modern CIRs — e.g., language-based dialogue, preference elicitation, explanations, etc.

Our hope is that RᴇᴄSɪᴍ will serve as a valuable resource that bridges the gap between recommender systems and RL research — the use cases above are examples of how it can be used in this fashion. We also plan to pursue it as a platform to support academic-industry collaborations, through the sharing of stylized models of user behavior that, at suitable levels of abstraction, reflect a degree of realism that can drive useful model and algorithm development.

Further details of the RᴇᴄSɪᴍ framework can be found in the white paper, while code and colabs/tutorials are available here.

Acknowledgements
We thank our collaborators and early adopters of RᴇᴄSɪᴍ, including the other members of the RᴇᴄSɪᴍ team: Eugene Ie, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu and Craig Boutilier.

Source: Google AI Blog


New Solutions for Quantum Gravity with TensorFlow



Recent strides in machine learning (ML) research have led to the development of tools useful for research problems well beyond the realm for which they were designed. The value of these tools when applied to topics ranging from teaching robots how to throw to predicting the olfactory properties of molecules is now beginning to be realized. Inspired by advances such as these, we undertook the challenge of applying TensorFlow, a computing platform normally used for ML, to advance the understanding of fundamental physics.

Perhaps the biggest open problem in fundamental theoretical physics may be that our current understanding of quantum mechanics only includes three of the four fundamental forces — the electromagnetic, strong, and weak forces. There is currently no complete quantum theory that also includes the force of gravitation, while still matching experimental observations, i.e., an accurate model of quantum gravity.

One promising approach to a unified model that includes quantum gravity, which has survived many mathematical consistency checks, is called M-Theory, or "The Theory formerly known as Strings,” introduced in 1995 by Edward Witten. In the everyday world, we all experience four dimensions—three spatial dimensions (x, y, and z), plus time (t). M-Theory predicts that, at very short lengths, the Universe is described, instead, by eleven dimensions. But, as one can imagine, establishing the connection between the four-dimensional world that we observe and the 11-dimensional world predicted by M-theory is exceedingly difficult to solve analytically. In fact, it might require analytic manipulation of equations having more terms than there are electrons in the Universe.

This summer, we published an article in the Journal of High Energy Physics where we introduced novel ways to address such problems through creative use of ML technology. Using simplifications enabled by TensorFlow, we managed to bring the total number of known (stable or unstable) equilibrium solutions for one particular type of M-Theory spacetime geometries to 194, including a new and tachyon-free four-dimensional model universe. The geometries that we studied are special in that they are still (barely) accessible with exact calculations that do not require neglecting potentially important terms. We have also released a short instructive Google colab as well as a more powerful Python library for use in related research.

Applying TensorFlow to M-Theory
This work is predicated on a key observation that a mixed numerical and analytic approach can be more powerful than a purely analytical method. Instead of attempting to find analytic solutions with brute force, we use a numerical approach that leverages TensorFlow for the initial search for solutions to the model. This then yields hypotheses on which specific combinations can be tested and analyzed with stringent mathematical methods, ultimately proving the actual existence of a conjectured solution. This represents a novel methodology for making further progress in theoretical physics.

Conclusion
We hope that these results will be an important step in interpreting M-theory, and demonstrate how the research community can use new ML tools, such as TensorFlow, to approach other similarly complex problems. We are already applying the newly discovered methods in further theoretical physics research.

Acknowledgements
This research was conducted by Iulia M. Comşa, Moritz Firsching, and Thomas Fischbacher. Additional thanks go to Jyrki Alakuijala, Rahul Sukthankar, and Jay Yagnik for encouragement and support.

Source: Google AI Blog


SPICE: Self-Supervised Pitch Estimation



A sound’s pitch is a qualitative measure of its frequency, where a sound with a high pitch is higher in frequency than one of low pitch. Through tracking relative differences in pitch, our auditory system is able to recognize audio features, such as a song’s melody. Pitch estimation has received a great deal of attention over the past decades, due to its central importance in several domains, ranging from music information retrieval to speech analysis.

Traditionally, simple signal processing pipelines were proposed to estimate pitch, working either in the time domain (e.g., pYIN) or in the frequency domain (e.g., SWIPE). But until recently, machine learning methods have not been able to outperform such hand-crafted signal processing pipelines. This was due to the lack of annotated data, which is particularly tedious and difficult to obtain at the temporal and frequency resolution required to train fully supervised models. The CREPE model was able to overcome these limitations to achieve state-of-the-art results by training on a synthetically generated dataset combined with other manually annotated datasets.

In our recent paper, we present a different approach to training pitch estimation models in the absence of annotated data. Inspired by the observation that for humans, including professional musicians, it is typically much easier to estimate relative pitch (the frequency interval between two notes) than absolute pitch (the true fundamental frequency), we designed SPICE (Self-supervised PItCh Estimation) to solve a similar task. This approach relies on self-supervision by defining an auxiliary task (also known as a pretext task) that can be learned in a completely unsupervised way.
Constant-Q transform of an audio clip, superimposed on a representation of pitch as estimated by SPICE (video).
The SPICE model consists of a convolutional encoder, which produces a single scalar embedding that maps linearly to pitch. To accomplish this, we feed two signals to the encoder, a reference signal along with a signal that is pitch shifted from the reference by a random, known amount. Then, we devise a loss function that forces the difference between the scalar embeddings to be proportional to the known difference in pitch. For convenience, we perform pitch shifting in the domain defined by the constant-Q transform (CQT), because this corresponds to a simple translation along the log-spaced frequency axis.

Pitch is well defined only when the underlying signal is harmonic, i.e., when it contains components with integer multiples of the fundamental frequency. So, an important function of the model is to determine when the output is meaningful and reliable. For example, in the figure below, there is no harmonic signal in the interval between 1.2s and 2s resulting in low enough confidence in the pitch estimation that no pitch estimate is generated. SPICE is designed to learn the level of confidence of the pitch estimation in a self-supervised fashion, instead of relying on handcrafted solutions.
SPICE model architecture (simplified). Two pitch-shifted versions of the same CQT frame are fed to two encoders with shared weights. The loss is designed to make the difference between the outputs of the encoders proportional to the relative pitch difference. In addition (not shown), a reconstruction loss is added to regularize the model. The model also learns to produce the confidence of the pitch estimation.
We evaluate our model against publicly available datasets and show that we outperform handcrafted baselines while matching the level of accuracy attained by CREPE, despite having no access to ground truth labels. In addition, by properly augmenting our data during training, SPICE is also able to operate in noisy conditions, e.g., to extract pitch from the singing voice when this is mixed in with background music. The chart below shows a comparison between SWIPE (a hand-crafted signal-processing method), CREPE (a fully supervised model) and SPICE (a self-supervised model) on the MIR-1k dataset.
Evaluation on the MIR-1k dataset, mixing in background music at different signal-to-noise ratios.
The SPICE model has been deployed in FreddieMeter, a web app in which singers can score their performance against Freddie Mercury.

Acknowledgments

The work described here was authored by Beat Gfeller, Christian Frank, Dominik Roblek, Matt Sharifi, Marco Tagliasacchi and Mihajlo Velimirović. We are grateful for all discussions and feedback on this work that we received from our colleagues at Google. The SingingVoices dataset used for training the models in this work has been collected by Alexandra Gherghina as part of FreddieMeter, which is using SPICE and a vocal timbre similarity model to understand how closely a singer matches Freddie Mercury.

Source: Google AI Blog


Introducing the Next Generation of On-Device Vision Models: MobileNetV3 and MobileNetEdgeTPU



On-device machine learning (ML) is an essential component in enabling privacy-preserving, always-available and responsive intelligence. This need to bring on-device machine learning to compute and power-limited devices has spurred the development of algorithmically-efficient neural network models and hardware capable of performing billions of math operations per second, while consuming only a few milliwatts of power. The recently launched Google Pixel 4 exemplifies this trend, and ships with the Pixel Neural Core that contains an instantiation of the Edge TPU architecture, Google’s machine learning accelerator for edge computing devices, and powers Pixel 4 experiences such as face unlock, a faster Google Assistant and unique camera features. Similarly, algorithms, such as MobileNets, have been critical for the success of on-device ML by providing compact and efficient neural network models for mobile vision applications.

Today we are pleased to announce the release of source code and checkpoints for MobileNetV3 and the Pixel 4 Edge TPU-optimized counterpart MobileNetEdgeTPU model. These models are the culmination of the latest advances in hardware-aware AutoML techniques as well as several advances in architecture design. On mobile CPUs, MobileNetV3 is twice as fast as MobileNetV2 with equivalent accuracy, and advances the state-of-the-art for mobile computer vision networks. On the Pixel 4 Edge TPU hardware accelerator, the MobileNetEdgeTPU model pushes the boundary further by improving model accuracy while simultaneously reducing the runtime and power consumption.

Building MobileNetV3
In contrast with the hand-designed previous version of MobileNet, MobileNetV3 relies on AutoML to find the best possible architecture in a search space friendly to mobile computer vision tasks. To most effectively exploit the search space we deploy two techniques in sequence — MnasNet and NetAdapt. First, we search for a coarse architecture using MnasNet, which uses reinforcement learning to select the optimal configuration from a discrete set of choices. Then we fine-tune the architecture using NetAdapt, a complementary technique that trims under-utilized activation channels in small decrements. To provide the best possible performance under different conditions we have produced both large and small models.
Comparison of accuracy vs. latency for mobile models on the ImageNet classification task using the Google Pixel 4 CPU.
MobileNetV3 Search Space
The MobileNetV3 search space builds on multiple recent advances in architecture design that we adapt for the mobile environment. First, we introduce a new activation function called hard-swish (h-swish) which is based on the Swish nonlinearity function. The critical drawback of the Swish function is that it is very inefficient to compute on mobile hardware. So, instead we use an approximation that can be efficiently expressed as a product of two piecewise linear functions.
Next we introduce the mobile-friendly squeeze-and-excitation block, which replaces the classical sigmoid function with a piecewise linear approximation.

Combining h-swish plus mobile-friendly squeeze-and-excitation with a modified version of the inverted bottleneck structure introduced in MobileNetV2 yielded a new building block for MobileNetV3.
MobileNetV3 extends the MobileNetV2 inverted bottleneck structure by adding h-swish and mobile friendly squeeze-and-excitation as searchable options.
These parameters defined the search space used in constructing MobileNetV3:
  • Size of expansion layer
  • Degree of squeeze-excite compression
  • Choice of activation function: h-swish or ReLU
  • Number of layers for each resolution block
We also introduced a new efficient last stage at the end of the network that further reduced latency by 15%.
MobileNetV3 Object Detection and Semantic Segmentation
In addition to classification models, we also introduced MobileNetV3 object detection models, which reduced detection latency by 25% relative to MobileNetV2 at the same accuracy for the COCO dataset.

In order to optimize MobileNetV3 for efficient semantic segmentation, we introduced a low latency segmentation decoder called Lite Reduced Atrous Spatial Pyramid Pooling (LR-SPP). This new decoder contains three branches, one for low resolution semantic features, one for higher resolution details, and one for light-weight attention. The combination of LR-SPP and MobileNetV3 reduces the latency by over 35% on the high resolution Cityscapes Dataset.

MobileNet for Edge TPUs
The Edge TPU in Pixel 4 is similar in architecture to the Edge TPU in the Coral line of products, but customized to meet the requirements of key camera features in Pixel 4. The accelerator-aware AutoML approach substantially reduces the manual process involved in designing and optimizing neural networks for hardware accelerators. Crafting the neural architecture search space is an important part of this approach and centers around the inclusion of neural network operations that are known to improve hardware utilization. While operations such as squeeze-and-excite and swish non-linearity have been shown to be essential in building compact and fast CPU models, these operations tend to perform suboptimally on Edge TPU and hence are excluded from the search space. The minimalistic variants of MobileNetV3 also forgo the use of these operations (i.e., squeeze-and-excite, swish, and 5x5 convolutions) to allow easier portability to a variety of other hardware accelerators such as DSPs and GPUs.

The neural network architecture search, incentivized to jointly optimize the model accuracy and Edge TPU latency, produces the MobileNetEdgeTPU model that achieves lower latency for a fixed accuracy (or higher accuracy for a fixed latency) than existing mobile models such as MobileNetV2 and minimalistic MobileNetV3. Compared with the EfficientNet-EdgeTPU model (optimized for the Edge TPU in Coral), these models are designed to run at a much lower latency on Pixel 4, albeit at the cost of some loss in accuracy.

Although reducing the model’s power consumption was not a part of the search objective, the lower latency of the MobileNetEdgeTPU models also helps reduce the average Edge TPU power use. The MobileNetEdgeTPU model consumes less than 50% the power of the minimalistic MobileNetV3 model at comparable accuracy.
Left: Comparison of the accuracy on the ImageNet classification task between MobileNetEdgeTPU and other image classification networks designed for mobile when running on Pixel4 Edge TPU. MobileNetEdgeTPU achieves higher accuracy and lower latency compared with other models. Right: Average Edge TPU power in Watts for different classification models running at 30 frames per second (fps).
Objection Detection Using MobileNetEdgeTPU
The MobileNetEdgeTPU classification model also serves as an effective feature extractor for object detection tasks. Compared with MobileNetV2 based detection models, MobileNetEdgeTPU models offer a significant improvement in model quality (measured as the mean average precision; mAP) on the COCO14 minival dataset at comparable runtimes on the Edge TPU. The MobileNetEdgeTPU detection model has a latency of 6.6ms and achieves mAP score of 24.3, while MobileNetV2-based detection models achieve an mAP of 22 and takes 6.8ms per inference.

The Need for Hardware-Aware Models
While the results shown above highlight the power, performance, and quality benefits of MobileNetEdgeTPU models, it is important to note that the improvements arise due to the fact that these models have been customized to run on the Edge TPU accelerator.
MobileNetEdgeTPU when running on a mobile CPU delivers inferior performance compared with the models that have been tuned specifically for mobile CPUs (MobileNetV3). MobileNetEdgeTPU models perform a much greater number of operations, and so, it is not surprising that they run slower on mobile CPUs, which exhibit a more linear relationship between a model’s compute requirements and the runtime.
MobileNetV3 is still the best performing network when using mobile CPU as the deployment target.
For Researchers and Developers
The MobileNetV3 and MobileNetEdgeTPU code, as well as both floating point and quantized checkpoints for ImageNet classification, are available at the MobileNet github page. Open source implementation for MobileNetV3 and MobileNetEdgeTPU object detection is available in the Tensorflow Object Detection API. Open source implementation for MobileNetV3 semantic segmentation is available in TensorFlow through DeepLab.

Acknowledgements:
This work is made possible through a collaboration spanning several teams across Google. We’d like to acknowledge contributions from Berkin Akin, Okan Arikan, Gabriel Bender, Bo Chen, Liang-Chieh Chen, Grace Chu, Eddy Hsu, John Joseph, Pieter-jan Kindermans, Quoc Le, Owen Lin, Hanxiao Liu, Yun Long, Ravi Narayanaswami, Ruoming Pang, Mark Sandler, Mingxing Tan, Vijay Vasudevan, Weijun Wang, Dong Hyuk Woo, Dmitry Kalenichenko, Yunyang Xiong, Yukun Zhu and support from Hartwig Adam, Blaise Agüera y Arcas, Chidu Krishnan and Steve Molloy.

Source: Google AI Blog


New Insights into Human Mobility with Privacy Preserving Aggregation



Understanding human mobility is crucial for predicting epidemics, urban and transit infrastructure planning, understanding people’s responses to conflict and natural disasters and other important domains. Formerly, the state-of-the-art in mobility data was based on cell carrier logs or location "check-ins", and was therefore available only in limited areas — where the telecom provider is operating. As a result, cross-border movement and long-distance travel were typically not captured, because users tend not to use their SIM card outside the country covered by their subscription plan and datasets are often bound to specific regions. Additionally, such measures involved considerable time lags and were available only within limited time ranges and geographical areas.

In contrast, de-identified aggregate flows of populations around the world can now be computed from phones' location sensors at a uniform spatial resolution. This metric has the potential to be extremely useful for urban planning since it can be measured in a direct and timely way. The use of de-identified and aggregated population flow data collected at a global level via smartphones could shed additional light on city organization, for example, while requiring significantly fewer resources than existing methods.

In “Hierarchical Organization of Urban Mobility and Its Connection with City Livability”, we show that these mobility patterns — statistics on how populations move about in aggregate — indicate a higher use of public transportation, improved walkability, lower pollutant emissions per capita, and better health indicators, including easier accessibility to hospitals. This work, which appears in Nature Communications, contributes to a better characterization of city organization and supports a stronger quantitative perspective in the efforts to improve urban livability and sustainability.
Visualization of privacy-first computation of the mobility map. Individual data points are automatically aggregated together with differential privacy noise added. Then, flows of these aggregate and obfuscated populations are studied.
Computing a Global Mobility Map While Preserving User Privacy
In line with our AI principles, we have designed a method for analyzing population mobility with privacy-preserving techniques at its core. To ensure that no individual user’s journey can be identified, we create representative models of aggregate data by employing a technique called differential privacy, together with k-anonymity, to aggregate population flows over time. Initially implemented in 2014, this approach to differential privacy intentionally adds random “noise” to the data in a way that maintains both users' privacy and the data's accuracy at an aggregate level. We use this method to aggregate data collected from smartphones of users who have deliberately chosen to opt-in to Location History, in order to better understand global patterns of population movements.

The model only considers de-identified location readings aggregated to geographical areas of predetermined sizes (e.g., S2 cells). It "snaps" each reading into a spacetime bucket by discretizing time into longer intervals (e.g., weeks) and latitude/longitude into a unique identifier of the geographical area. Aggregating into these large spacetime buckets goes beyond protecting individual privacy — it can even protect the privacy of communities.

Finally, for each pair of geographical areas, the system computes the relative flow between the areas over a given time interval, applies differential privacy filters, and outputs the global, anonymized, and aggregated mobility map. The dataset is generated only once and only mobility flows involving a sufficiently large number of accounts are processed by the model. This design is limited to heavily aggregated flows of populations, such as that already used as a vital source of information for estimates of live traffic and parking availability, which protects individual data from being manually identified. The resulting map is indexed for efficient lookup and used to fuel the modeling described below.

Mobility Map Applications
Aggregate mobility of people in cities around the globe defines the city and, in turn, its impact on the people who live there. We define a metric, the flow hierarchy (Φ), derived entirely from the mobility map, that quantifies the hierarchical organization of cities. While hierarchies across cities have been extensively studied since Christaller’s work in the 1930s, for individual cities, the focus has been primarily on the differences between core and peripheral structures, as well as whether cities are mono- or poly-centric. Our results instead show that the reality is much more rich than previously thought. The mobility map enables a quantitative demonstration that cities lie across a spectrum of hierarchical organization that strongly correlates with a series of important quality of life indicators, including health and transportation.

Below we see an example of two cities — Paris and Los Angeles. Though they have almost the same population size, those two populations move in very different ways. Paris is mono-centric, with an "onion" structure that has a distinct high-mobility city center (red), which progressively decreases as we move away from the center (in order: orange, yellow, green, blue). On the other hand, Los Angeles is truly poly-centric, with a large number of high-mobility areas scattered throughout the region.
Mobility maps of Paris (left) and Los Angeles (right). Both cities have similar population sizes, but very different mobility patterns. Paris has an "onion" structure exhibiting a distinct center with a high degree of mobility (red) that progressively decreases as we move away from the center (in order: orange, yellow, green, blue). In contrast, Los Angeles has a large number of high-mobility areas scattered throughout the region.
More hierarchical cities — in terms of flows being primarily between hotspots of similar activity levels — have values of flow hierarchy Φ closer to the upper limit of 1 and tend to have greater levels of uniformity in their spatial distribution of movements, wider use of public transportation, higher levels of walkability, lower pollution emissions, and better indicators of various measures of health. Returning to our example, the flow hierarchy of Paris is Φ=0.93 (in the top quartile across all 174 cities sampled), while that of Los Angeles is 0.86 (bottom quartile).

We find that existing measures of urban structure, such as population density and sprawl composite indices, correlate with flow hierarchy, but in addition the flow hierarchy conveys comparatively more information that includes behavioral and socioeconomic factors.
Connecting flow hierarchy Φ with urban indicators in a sample of US cities. Proportion of trips as a function of Φ, broken down by model share: private car, public transportation, and walking. Sample city names that appear in the plot: ATL (Atlanta), CHA (Charlotte), CHI (Chicago), HOU (Houston), LA (Los Angeles), MIN (Minneapolis), NY (New York City), and SF (San Francisco). We see that cities with higher flow hierarchy exhibit significantly higher rates of public transportation use, less car use, and more walkability.
Measures of urban sprawl require composite indices built up from much more detailed information on land use, population, density of jobs, and street geography among others (sometimes up to 20 different variables). In addition to the extensive data requirements, such metrics are also costly to obtain. For example, censuses and surveys require a massive deployment of resources in terms of interviews, and are only standardized at a country level, hindering the correct quantification of sprawl indices at a global scale. On the other hand, the flow hierarchy, being constructed from mobility information alone, is significantly less expensive to compile (involving only computer processing cycles), and is available in real-time.

Given the ongoing debate on the optimal structure of cities, the flow hierarchy, introduces a different conceptual perspective compared to existing measures, and can shed new light on the organization of cities. From a public-policy point of view, we see that cities with greater degree of mobility hierarchy tend to have more desirable urban indicators. Given that this hierarchy is a measure of proximity and direct connectivity between socioeconomic hubs, a possible direction could be to shape opportunity and demand in a way that facilitates a greater degree of hub-to-hub movement than a hub-to-spoke architecture. The proximity of hubs can be generated through appropriate land use, that can be shaped by data-driven zoning laws in terms of business, residence or service areas. The presence of efficient public transportation and lower use of cars is another important factor. Perhaps a combination of policies, such as congestion-pricing, used to disincentivize private transportation to socioeconomic hubs, along with building public transportation in a targeted fashion to directly connect the hubs, may well prove useful.

Next Steps
This work is part of our larger AI for Social Good efforts, a program that focuses Google's expertise on addressing humanitarian and environmental challenges.These mobility maps are only the first step toward making an impact in epidemiology, infrastructure planning, and disaster response, while ensuring high privacy standards.

The work discussed here goes to great lengths to ensure privacy is maintained. We are also working on newer techniques, such as on-device federated learning, to go a step further and enable computing aggregate flows without personal data leaving the device at all. By using distributed secure aggregation protocols or randomized responses, global flows can be computed without even the aggregator having knowledge of individual data points being aggregated. This technique has also been applied to help secure Chrome from malicious attacks.

Acknowledgements
This work resulted from a collaboration of Aleix Bassolas and José J. Ramasco from the Institute for Cross-Disciplinary Physics and Complex Systems (IFISC, CSIC-UIB), Brian Dickinson, Hugo Barbosa-Filho, Gourab Ghoshal, Surendra A. Hazarie, and Henry Kautz from the Computer Science Department and Ghoshal Lab at the University of Rochester, Riccardo Gallotti from the Bruno Kessler Foundation, and Xerxes Dotiwalla, Paul Eastham, Bryant Gipson, Onur Kucuktunc, Allison Lieber, Adam Sadilek at Google.

The differential privacy library used in this work is open source and available on our GitHub repo.

Source: Google AI Blog


Highlights from the 2019 Google AI Residency Program



This fall marks the successful conclusion to the fourth year of the Google AI Residency Program. Started in 2016 with 27 individuals in Mountain View, CA, the 12-month program has grown to nearly 100 residents from nine locations across the globe. Program participants have gone on to great success in PhD programs, academia, non-profits, and industry. Many have also become full-time Google researchers.

The program’s latest installment was our most successful yet, as residents advanced progress in a broad range of research fields, such as machine perception, algorithms and optimization, language understanding, healthcare and many more. Below are a handful of innovative projects from some of this year’s alumni.
  • A large-scale study on cross-lingual transfer in massive multilingual neural machine translation models (recently highlighted as part of this post), trained on billions of sentence pairs from more than 100 languages in order to significantly improve translation for both low- and high-resource languages.
    Visualization of the clustering of encoder representations of all modeled languages, based on representational similarity. Encoder representations of different languages cluster according to linguistic similarity. Languages are color-coded by their linguistic family.
  • A generative model for Scalable Vector Graphics (SVGs), which can be used to aid designers in generating fonts.
  • Top: Unlike pixel representations of icons (right), in this case a "6", SVGs (left; middle) are scale-invariant representations. Bottom: By modelling SVGs directly, we can aid artists in quickly and intuitively iterating over typography designs.
  • A method to learn GANs using discrepancy divergence, a measure that accounts for both the loss function and hypothesis set to provide theoretical learning guarantees.
  • As more generators are added to the DGAN ensemble more modes in the real distribution are covered. From left to right: 1 generator, 5 generators, and 10 generators.
  • A likelihood ratio method for deep generative models that effectively corrects for confounding background statistics to improve out-of-distribution (OOD) detection, and a new benchmark dataset for OOD detection in genomics.
  • Log-likelihood (left) and log likelihood-ratio (right) of each pixel for Fashion-MNIST. The likelihood is dominated by the “background” pixels, whereas the likelihood ratio focuses on the “semantic” pixels and is thus better for OOD detection.
  • A study showing when label smoothing helps, focusing on its impact on calibration of predictions, representations learned by the penultimate layer and effectiveness of knowledge distillation.
  • 2D-projection of representations of three CIFAR100 classes. Without label smoothing, examples are spread, but with label smoothing each example is encouraged to be equally distant to the clusters of the other classes, attenuating intra-class variation and inter-class similarity structure.
The successes of our AI residents go beyond academic publishing. Their achievements include:
  • Organizing a workshop, bringing together experts in theoretical physics and deep learning, to explore how tools from physics can shed light on the theory of deep learning.
  • Founding Queer in AI, an organization for fostering a community of queer researchers and raising awareness of queer issues in AI/ML.
  • Organizing a hands-on Tensorflow tutorial on using Deep Learning for Natural Language Processing.
  • Automatically learning neural net architectures with AdaNet, an open-source, TensorFlow-based framework.
  • Developing Coconet, the model behind the first AI-powered Doodle (created to celebrate renowned German composer and musician Johann Sebastian Bach).
Also, beginning with the next program cycle, residents will be hosted for a duration of 12 months, with the option of extending up to 18 months! This exciting shift comes as part of our effort to improve the overall program experience and outcomes for residents as the program continues to grow and scale.

If you are interested in joining our fifth cohort, applications for the 2020 Google AI Residency program are now open! Visit g.co/airesidency/apply for more information on how to apply. Please submit your application as soon as possible, as we will be considering candidates on a rolling basis. Please see g.co/airesidency for more resident profiles, past resident publications, blog posts and stories. We can’t wait to see where the next year will take us, and hope you’ll consider joining our research teams across the world!

Source: Google AI Blog


The Visual Task Adaptation Benchmark



Deep learning has revolutionized computer vision, with state-of-the-art deep networks learning useful representations directly from raw pixels, leading to unprecedented performance on many vision tasks. However, learning these representations from scratch typically requires hundreds of thousands of training examples. This burden can be reduced by using pre-trained representations, which have become widely available through services such as TensorFlow Hub (TF Hub) and PyTorch Hub. But their ubiquity can itself be a hindrance. For example, for the task of extracting features from images, there can be over 100 models from which to choose. It is hard to know which methods provide the best representations, since different sub-fields use different evaluation protocols, which do not always reflect the final performance on new tasks.

The overarching goal of representation research is to learn representations a single time on large amounts of generic data without the need to train them from scratch for each task, thus reducing data requirements across all vision tasks. But in order to reach that goal, the research community must have a uniform benchmark against which existing and future methods can be evaluated.

To address this problem, we are releasing "The Visual Task Adaptation Benchmark" (VTAB, available on GitHub), a diverse, realistic, and challenging representation benchmark based on one principle — a better representation is one that yields better performance on unseen tasks, with limited in-domain data. Inspired by benchmarks that have driven progress in other fields of machine learning (ML), such as ImageNet for natural image classification, GLUE for Natural Language Processing, and Atari for reinforcement learning, VTAB follows similar guidelines: (i) minimal constraints on solutions to encourage creativity; (ii) a focus on practical considerations; and (iii) challenging tasks for evaluation.

The Benchmark
VTAB is an evaluation protocol designed to measure progress towards general and useful visual representations, and consists of a suite of evaluation vision tasks that a learning algorithm must solve. These algorithms may use pre-trained visual representations to assist them and must satisfy only two requirements:
    i) They must not be pre-trained on any of the data (labels or input images) used in the downstream evaluation tasks.
    ii) They must not contain hardcoded, task-specific, logic. Alternatively put, the evaluation tasks must be treated like a test set — unseen.
These constraints ensure that solutions that are successful when applied to VTAB will be able to generalize to future tasks.

The VTAB protocol begins with the application of an algorithm (A) to a number of independent tasks, drawn from a broad distribution of vision problems. The algorithm may be pre-trained on upstream data to yield a model that contains visual representations, but it must also define an adaptation strategy that consumes a small training set for each downstream task and return a model that makes task-specific predictions. The algorithm’s final score is its average test score across tasks.
The VTAB protocol. Algorithm A is applied to many tasks T, drawn from a broad distribution of vision problems PT. In the example, pet classification, remote sensing, and maze localization are shown.
VTAB includes 19 evaluation tasks that span a variety of domains, divided into three groups — natural, specialized, and structured. Natural image tasks include images of the natural world captured through standard cameras, representing generic objects, fine-grained classes, or abstract concepts. Specialized tasks utilize images captured using specialist equipment, such as medical images or remote sensing. The structured tasks often derive from artificial environments that target understanding of specific changes between images, such as predicting the distance to an object in a 3D scene (e.g., DeepMind Lab), counting objects (e.g., CLEVR), or detecting orientation (e.g., dSprites for disentangled representations).

While highly diverse, all of the tasks in VTAB share one common feature — people can solve them relatively easily after training on just a few examples. To assess algorithmic generalization to new tasks with limited data, performance is evaluated using only 1000 examples per task. Evaluation using the full dataset can be performed for comparison with previous publications.

Findings Using VTAB
We performed a large scale study testing a number of popular visual representation learning algorithms against VTAB. The study included generative models (GANs and VAEs), self-supervised models, semi-supervised models and supervised models. All of the algorithms were pre-trained on the ImageNet dataset. We also compared each of these approaches using no pre-trained representations, i.e., training “from-scratch”. The figure below summarizes the main pattern of results.
Performance of different classes of representation learning algorithms across different task groups: natural, specialized and structured. Each bar shows the average performance of all methods in that class across all tasks in the group.
Overall we find that generative models do not perform as well as the other methods, even worse than from-scratch training. However, self-supervised models perform much better, significantly outperforming from-scratch training. Better still is supervised learning using the ImageNet labels. Interestingly, while supervised learning is significantly better on the Natural group of tasks, self-supervised learning is close on the other two groups whose domains are more dissimilar to ImageNet.

The best performing representation learning algorithm, of those we tested, is S4L, which combines both supervised and self-supervised pre-training losses. The figure below contrasts S4L with standard supervised ImageNet pre-training. S4L appears to improve performance particularly on the Structured tasks. However, representation learning yields a much smaller benefit over training from-scratch groups other than the Natural tasks, indicating that there is much progress required to attain a universal visual representation.
Top: Performance of S4L versus from-scratch training. Each bar corresponds to a task. Positive-valued bars indicate tasks where S4L outperforms from-scratch. Negative bars indicate that from-scratch performed better. Bottom: S4L versus Supervised training on ImageNet. Positive bars indicate that S4L performs better. The bar colour indicates the task group: Red=Natural, Green=Specialized, Blue=Structured. We can see that additional self-supervision tends to help on structured tasks beyond just using ImageNet labels.
Summary
The code to run VTAB is available on GitHub, including the 19 evaluation datasets and exact data splits. Having a publicly available set of benchmarks ensures the reproducibility of results. Progress is tracked with the public leaderboard, and the models evaluated are uploaded to TF Hub for public use and reproduction. A shell script is provided to perform adaptation and evaluation on all the tasks, with a standardized evaluation protocol making VTAB readily accessible across the industry. Since VTAB can be executed on both TPU and GPU, it is highly efficient. One can obtain comparable results with a single NVIDIA Tesla P100 accelerator in a few hours.

The Visual Task Adaptation Benchmark has helped us better understand which visual representations generalize to the broad spectrum of vision tasks, and provides direction for future research. We hope these resources are useful in driving progress toward general and practical visual representations, and as a result, affords deep learning to the long tail of vision problems with limited labelled data.

Acknowledgements
The core team behind this work includes Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, and Sylvain Gelly.

Source: Google AI Blog