Category Archives: Research Blog

The latest news on Google Research

Improving the Effectiveness of Diabetic Retinopathy Models

Two years ago, we announced our inaugural work in training deep learning models for diabetic retinopathy (DR), a complication of diabetes that is one of the fasting growing causes of vision loss. Based on this research, we set out to apply our technology to improve health outcomes in the world. At the same time, we’ve continued our efforts to improve the model’s performance, explainability, and applicability in clinical settings. Today, we are sharing our research progress toward these goals, as well as announcing a new partner in Thailand.

Improving Model Performance with High-quality Labels
The performance of DR deep learning models is critically important, especially when subtle errors have the potential to generate a misdiagnosis. Earlier this year we published a paper in the journal Ophthalmology that looked at how we could improve our model by 1) moving toward a more granular 5-point grading scale (versus the previous 2-class system) and 2) incorporating adjudication by a panel of retinal specialists. During the adjudication process, a group of retinal specialists debated any case with disagreement until everyone agreed on the final grade. Compared to simply taking a majority vote, this method of resolving disagreements was more accurate and allowed for the identification of subtle findings, such as microaneurysms.

To increase the efficiency of the adjudication process, we carefully selected a small subset (0.22%) of images to use as a tuning set, substantially improving model performance by optimizing model hyperparameters on this more accurate reference standard. When we subsequently measured the rate of agreement against a test set of images with an adjudicated reference standard, the kappa scores (a measurement of agreement that ranges from 0 [random] to 1 [perfect agreement]) for individual retinal specialists, ophthalmologists, and the algorithm ranged from 0.82-0.91, 0.80-0.84, and 0.84, respectively.

Making our Models More Transparent
As we deploy this technology, it is important that we take the proper steps to ensure that it is transparent and trusted. To that end, we have been exploring ways to explain how the model is making its predictions, with the goal of making the DR model a better diagnostic tool and aid for doctors.

In our latest study, to be published today in Ophthalmology, we demonstrate methods by which explanations of deep learning algorithms can be shown to ophthalmologists to increase both the accuracy and confidence of their grading for diabetic eye disease. Using the results of the model trained and validated on high quality labels from our earlier study, we generated different forms of potential assistance for general ophthalmologists. We presented to the physicians the algorithm’s predicted scores for different DR severity levels as well as heatmaps highlighting image regions that most strongly drove its predictions. Using this assistance, we saw a significant increase in physicians’ diagnostic accuracy, as well as improved confidence in their diagnosis.

We saw clear evidence that showing model predictions could help physicians catch pathology they otherwise might have missed. In the retinal image below, our adjudication panel found signs of vision-threatening DR. This was missed by 2 of 3 doctors who graded it without assistance; but caught by all 3 doctors who graded it when they saw the model predictions (which accurately detected the pathology).
On the left is a fundus image graded as having proliferative (vision-threatening) DR by an adjudication panel of ophthalmologists (ground truth). On the top right is an illustration of our deep learning model’s predicted scores (“P” = proliferative, the most severe form of DR). On the bottom right is the set of grades given by physicians without assistance (“Unassisted”) and those who saw the model’s predictions (“Grades Only”).
We also saw evidence that physicians and the model can work together in a way that provides more accuracy than either individually. In the retinal image below, our adjudication panel of retina specialists considered it to have moderate DR. Without assistance, two out of three ophthalmologists grading the image marked it as no DR. In real-world settings, this situation could result in a patient missing a needed referral to a specialist.
On the left is a retinal fundus image graded as having moderate DR (“Mo”) by an adjudication panel of ophthalmologists (ground truth). On the top right is an illustration of the predicted scores (“N” = no DR, “Mi” = Mild DR, “Mo” = Moderate DR) from the model. On the bottom right is the set of scores given by physicians without assistance (“Unassisted”) and those who saw the model’s predictions (“Grades Only”).
In this particular case, our model also indicated evidence for no DR. However, when ophthalmologists saw the model’s predictions, all three gave the correct answer. Seeing that the model saw some evidence for Moderate -- even if it wasn’t the highest score -- may prompt doctors to examine particular cases more carefully for pathology they may otherwise miss. We are excited to develop assistance that works like this, where human and machine learning abilities complement each other.

A New Partner in our Global Efforts
With the help of screening programs and in collaboration with Verily, we have laid a robust foundation for the implementation of these highly accurate systems in real world clinical settings. Working with doctors at Aravind Eye Hospitals and Sankara Nethralaya in India, and now, through our new partnership with the Rajavithi Hospital, affiliated with the Department of Medical Services, Ministry of Public Health in Thailand, we are validating the model performance with patients from broad screening programs. Given the positive results of our model on their real patient population, we are now beginning to pilot the model in their screening programs. We’re looking forward to a very busy 2019!

Source: Google AI Blog

Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

From a remarkably young age, people are capable of recognizing their favorite objects and picking them up, despite never being explicitly taught how to do so. According to cognitive developmental research, the ability to interact with objects in the world plays a crucial role in the emergence of object perception and manipulation capabilities, such as targeted grasping. By interacting with the world around them, people are able to learn with self-supervision: we know what actions we took, and we learn from the outcome. In robotics, this type of self-supervised learning is actively researched because it enables robotic systems to learn without the need for large amounts of training data or manual supervision.

Inspired by the concept of object permanence, we propose Grasp2Vec, a simple yet highly effective algorithm for acquiring object representations. Grasp2Vec is based on the intuition that an attempt to pick up anything provides several pieces of information — if a robot grasps an object and holds it up, the object had to be in the scene before the grasp. Furthermore, the robot knows that the object it grasped is currently in its gripper, and therefore has been removed from the scene. By using this form of self supervision, the robot can learn to recognize the object by the visual change in the scene after the grasp.
Building on our prior collaboration with X Robotics, where a series of robots learn in parallel to grasp household objects using only monocular camera inputs, we use a robotic arm to grasp objects “unintentionally”, and that experience enables the learning of a rich representation of objects. These representations can then be used to acquire “intentional grasping” capabilities, where the robot arm can then pick up user-commanded objects.
Constructing a Perceptual Reward Function
In the framework of reinforcement learning (RL), task success is measured via a “reward function”. By maximizing that reward, robots can teach themselves diverse grasping skills from scratch. Engineering a reward function is easy when success can be measured by simple sensor measurements. A simple example of this is a button that supplies rewards directly to a robot when it is pushed.

However, engineering a reward function is much more difficult when our success criteria depends on perceptual understanding of the task at hand. Consider the task of instance grasping, where a robot is presented a picture of a desired object being held in the gripper. After the robot attempts to grasp that object, it inspects the contents of the gripper. The reward function for this task comes down to answering the question of object recognition: Do these objects match?
On the left, the gripper is holding the brush and there are some objects (yellow cup, blue plastic block) in the background. On the right, the gripper is holding the yellow cup and the brush is in the background. If the left image was the desired outcome, a good reward function should “understand” that the two images above correspond to different objects.
In order to solve this recognition problem, we need a perception system that extracts meaningful object concepts from unstructured image data (without any human annotations), learning the visual perception of objects in an unsupervised fashion. At their core, unsupervised learning algorithms work because they make structural assumptions about data. It is common to assume that images can be compressed into a low-dimensional space, and that frames in a video can be predicted from previous frames. However, without further assumptions on the content of the data, these are usually insufficient for learning disentangled object representations.

What if we used a robot to physically disentangle objects from each other during data collection? The field of robotics presents an exciting opportunity for representation learning because robots can manipulate objects, thus providing the factors of variation needed in data. Our method relies on the insight that grasping an object removes it from the scene. This yields 1) an image of the scene before grasping, 2) an image of the scene after grasping and 3) an isolated view of the grasped object itself.
Left: Objects before the grasp. Center: Objects after the grasp. Right: The Grasped object.
If we then consider an embedding function that extracts “the set of objects” from images, it should preserve the following subtractive relation:
objects_before_grasp - objects_after_grasp = grasped_object
We implement this equality relation using a fully convolutional architecture and a simple metric learning algorithm. At training time, the architecture shown below embeds the pre-grasp images and post-grasp images into a dense spatial feature map. The maps are mean-pooled into vectors and the difference between the “before grasp” and “after grasp” vectors represents a set of objects. This vector and the corresponding vector representation of the grasped object are pushed to equivalence via the N-Pairs objective.
Add caption
Once trained, two useful properties emerge naturally from our model.

1. Object Similarity
The first property is that a cosine distance between vector embeddings allows us to compare objects and determine whether they are identical. This can be used to implement reward functions for reinforcement learning, and allow robots to learn instance grasping without human-provided labels.
2. Localizing Target Objects
The second property is that we can combine scene spatial maps and object embeddings to localize a “query object” in image space. By taking the element-wise product of spatial feature maps and the vector corresponding to the query object, we can find all the pixels in the spatial map that “match” the query object.
Using Grasp2Vec embeddings to localize objects in a scene. The image on the top left shows the objects in the bin. On the bottom left is the query object we wish to grasp. By taking the dot product of the query object vector with the spatial features of the scene image, we get a per-pixel “activation map” (top right image) of how similar that region of the image is to the query. This response map can be used to approach the object for grasping.
Our method also works when there are multiple objects that match the query object, or even if the query consists of multiple objects (the average of two vectors). For example, here is a scenario where it detects multiple orange blocks in a scene.
The resulting “heatmap” can be used to plan the robot approach to the target object(s). We combine Grasp2Vec’s localization and instance recognition capabilities with our “grasp anything” policies to obtain a success rate of 80% on objects seen during data collection and 59% on novel objects the robot hasn’t encountered before.

In our paper, we show how robotic grasping skills can generate the data used for learning object-centric representations. We then can use representation learning to “bootstrap” more complex skills like instance grasping, all while retaining the self-supervised learning properties of our autonomous grasping system.

Besides our own work, a number of recent papers have also studied how self-supervised interaction can be used to acquire representations, by grasping, pushing, and otherwise manipulating objects in the environment. Going forward, we are excited not only for what machine learning can bring to robotics by way of better perception and control, but also what robotics can bring to machine learning in new paradigms of self-supervision.

This research was conducted by Eric Jang, Coline Devin, Vincent Vanhoucke, and Sergey Levine. We’d like to thank Adrian Li, Alex Irpan, Anthony Brohan, Chelsea Finn, Christian Howard, Corey Lynch, Dmitry Kalashnikov, Ian Wilkes, Ivonne Fajardo, Julian Ibarz, Ming Zhao, Peter Pastor, Pierre Sermanet, Stephen James, Tsung-Yi Lin, Yunfei Bai, and many others at Google, X, and the broader robotics community who contributed to improving this work.

Source: Google AI Blog

Providing Gender-Specific Translations in Google Translate

Over the past few years, Google Translate has made significant improvements to translation quality by switching to an end-to-end neural network-based system. At the same time, we realized that translations from our models can reflect societal biases, such as gender bias. Specifically, languages differ a lot in how they represent gender, and when there are ambiguities during translation, the systems tend to pick gender choices that reflect societal asymmetries, resulting in biased translations. For instance, Google Translate historically translated the Turkish equivalent of “He/she is a doctor” into the masculine form, and the Turkish equivalent of “He/she is a nurse” into the feminine form.

Recently, we announced that we’re taking the first step at reducing gender bias in our translations. We now provide both feminine and masculine translations when translating single-word queries from English to four different languages (French, Italian, Portuguese, and Spanish), and when translating phrases and sentences from Turkish to English.
Gender-specific translations on the Google Translate website.
Supporting gender-specific translations for single-word queries involved enriching our underlying dictionary with gender attributes. Supporting gender-specific translations for longer queries (phrases and sentences) was particularly challenging and involved making significant changes to our translation framework. For these longer queries, we focused initially on Turkish-to-English translation. We developed a three-step approach to solve the problem of providing a masculine and feminine translation in English for a gender-neutral query in Turkish.
Detecting Gender-Neutral Queries
Many Turkish sentences that refer to people are gender-neutral, but not all are. Detecting which queries are eligible for gender-specific translations is a hard problem because Turkish is morphologically complex, meaning that reference to a person can either be explicit with a gender-neutral pronoun (e.g. O, Ona) or implicitly encoded. For example, the sentence “Biliyor mu?” has no explicit gender-neutral pronoun but can be translated as either “Does she know?” or “Does he know?”. This complexity means that we cannot use a simple list of gender-neutral pronouns to detect gender-neutral Turkish queries and need a machine-learned system. We estimate that approximately 10% of Turkish Translate queries are ambiguous, and eligible for both feminine and masculine translations.

To detect these queries, we use state-of-the-art text classification algorithms (same as those used in our Cloud Natural Language API) to build a system that is able to detect when a given Turkish query is gender-neutral. Since this introduces an additional step before obtaining the translations, we had to carefully balance model complexity with latency. We trained our system on thousands of human-rated Turkish examples, where raters were asked to judge whether a given example is gender-neutral or not. Our final classification system is a convolutional neural network that can accurately detect queries which require gender-specific translations.

Generating Gender-Specific Translations
Next, we enhanced our underlying Neural Machine Translation (NMT) system to produce feminine and masculine translations when requested. When no gender is requested, we trained the model to produce the default translation. This involved:
  • Identifying and dividing our parallel training data into those with feminine words, those with masculine and those with ungendered words.
  • Adding an additional input token to the beginning of the sentence to specify the required gender to translate to, similar to how we build multilingual NMT systems:
    • <2MALE> O bir doktor → He is a doctor
    • <2FEMALE> O bir doktor → She is a doctor
  • Training our enhanced NMT model on the feminine, masculine and ungendered data sources. We experimented with various mixing ratios for these sources to enable the model to perform equally well on the three tasks.
If a user's query is determined to be gender-neutral, we add a gender prefix to the translation request. For these requests, our final NMT model can reliably produce feminine and masculine translations 99% of the time. Additionally, the system maintains translation quality on queries without the gender prefix.

Checking for Accuracy
Finally, we have a step that decides whether to display the gender-specific translations. Since the training data that produces the masculine translation is different from the training data that produces the feminine translation, there may be differences between the two translations unrelated to gender. If the gender-specific translations are determined to be low quality, we show only the single default translation. To determine the quality of the gender-specific translations, we verify:
  • If the requested feminine translation is feminine.
  • If the requested masculine translation is masculine.
  • If the feminine and masculine translations are exactly equivalent with the exception of gender-related changes. Even minor changes in the wording between the translations will result in being filtered.
Top: The masculine and feminine translations differ only with respect to gender i.e. “he” and “his” vs “she” and “her”. Hence, we will show gender-specific translations. Bottom: The masculine and feminine translations differ correctly with respect to gender i.e. “he” vs “she”. However, the change from “really” to “actually” is not related to gender. Hence, we will filter gender-specific translations and display the default translation.
Putting it all together, input sentences first go through the classifier, which detects whether they’re eligible for gender-specific translations. If the classifier says “yes”, we send three requests to our enhanced NMT model—a feminine request, a masculine request and an ungendered request. Our final step takes into account all three responses and decides whether to display gender-specific translations or a single default translation. This step is still quite conservative in order to maximize the quality of gender-specific translations shown; hence our overall recall is only around 60%. We plan to increase our coverage and add support for more complex sentences in future iterations.

This is just the first step toward addressing gender bias in machine-translation systems and reiterates Google’s commitment to fairness in machine learning. In the future, we plan to extend gender-specific translations to more languages and to address non-binary gender in translations.

This effort has been successful thanks to the hard work of a lot of people including, but not limited to, the following (in alphabetical order of last name): Lindsey Boran, HyunJeong Choe, Héctor Fernández Alcalde, Orhan Firat, Qin Gao, Rick Genter, Macduff Hughes, Tolga Kayadelen, James Kuczmarski, Tatiana Lando, Liu Liu, Michael Mandl, Nihal Meriç Atilla, Mengmeng Niu, Adnan Ozturel, Emily Pitler, Kathy Ray, John Richardson, Larissa Rinaldi, Alex Rudnick, Apu Shah, Jason Smith, Antonio Stella, Romina Stella, Jana Strnadova, Katrin Tomanek, Barak Turovsky, Dan Schwarz, Shilp Vaishnav, Clayton Watts, Kellie Webster, Colin Young, Pendar Yousefi, Candice Zhang and Min Zhao.

Source: Google AI Blog

Adding Diversity to Images with Open Images Extended

Recently, we introduced the Inclusive Images Kaggle competition, part of the NeurIPS 2018 Competition Track, with the goal of stimulating research into the effect of geographic skews in training datasets on ML model performance, and to spur innovation in developing more inclusive models. While the competition has concluded, the broader movement to build more diverse datasets is just beginning.

Today, we’re announcing Open Images Extended, a new branch of Google’s Open Images dataset, which is intended to be a collection of complementary datasets with additional images and/or annotations that better represent global diversity. The first set we are adding is the Crowdsourced extension which is seeded with 478K+ images donated by Crowdsource app users from all around the world.

About the Crowdsourced Extension of Open Images Extended
To bring greater geographic diversity to Open Images, we enabled the global community of Crowdsource app users to photograph the world around them and make their photos available to researchers and developers as part of the Open Images Extended dataset. A large majority of these images are from India, with some representation from the Middle East, Africa and Latin America.

The images, focus on some key categories like household objects, plants & animals, food, and people in various professions (all faces are blurred to protect privacy). Detailed information about the composition of the dataset can be found here.
Pictures from India and Singapore contributed using the Crowdsource app.
Get Involved
This is an early step on a long journey. To build inclusive ML products, training data must represent global diversity along several dimensions. To that end, we invite the global community to help expand the Open Images Extended dataset by contributing imagery from your own hometown and community. Download the Crowdsource Android app to contribute images you’ve taken from your phone, or contact us if there are other image repositories (that you have the rights for) that you’re interested in adding to open-images dataset.

The release of Open Images Extended has been possible thanks to the hard work of a lot of people including, but not limited to the following (in alphabetical order of last name): James Atwood, Pallavi Baljekar, Peggy Chi, Tulsee Doshi, Tom Duerig, Vittorio Ferrari, Akshay Gaur, Victor Gomes, Yoni Halpern, Gursheesh Kaur, Mahima Pushkarna, Jigyasa Saxena, D. Sculley, Richa Singh, Rachelle Summers.

Source: Google AI Blog

TF-Ranking: A Scalable TensorFlow Library for Learning-to-Rank

Ranking, the process of ordering a list of items in a way that maximizes the utility of the entire list, is applicable in a wide range of domains, from search engines and recommender systems to machine translation, dialogue systems and even computational biology. In applications like these (and many others), researchers often utilize a set of supervised machine learning techniques called learning-to-rank. In many cases, these learning-to-rank techniques are applied to datasets that are prohibitively large  scenarios where the scalability of TensorFlow could be an advantage. However, there is currently no out-of-the-box support for applying learning-to-rank techniques in TensorFlow. To the best of our knowledge, there are also no other open source libraries that specialize in applying learning-to-rank techniques at scale.

Today, we are excited to share TF-Ranking, a scalable TensorFlow-based library for learning-to-rank. As described in our recent paper, TF-Ranking provides a unified framework that includes a suite of state-of-the-art learning-to-rank algorithms, and supports pairwise or listwise loss functions, multi-item scoring, ranking metric optimization, and unbiased learning-to-rank.

TF-Ranking is fast and easy to use, and creates high-quality ranking models. The unified framework gives ML researchers, practitioners and enthusiasts the ability to evaluate and choose among an array of different ranking models within a single library. Moreover, we strongly believe that a key to a useful open source library is not only providing sensible defaults, but also empowering our users to develop their own custom models. Therefore, we provide flexible API's, within which the users can define and plug in their own customized loss functions, scoring functions and metrics.

Existing Algorithms and Metrics Support
The objective of learning-to-rank algorithms is minimizing a loss function defined over a list of items to optimize the utility of the list ordering for any given application. TF-Ranking supports a wide range of standard pointwise, pairwise and listwise loss functions as described in prior work. This ensures that researchers using the TF-Ranking library are able to reproduce and extend previously published baselines, and practitioners can make the most informed choices for their applications. Furthermore, TF-Ranking can handle sparse features (like raw text) through embeddings and scales to hundreds of millions of training instances. Thus, anyone who is interested in building real-world data intensive ranking systems such as web search or news recommendation, can use TF-Ranking as a robust, scalable solution.

Empirical evaluation is an important part of any machine learning or information retrieval research. To ensure compatibility with prior work, we support many of the commonly used ranking metrics, including Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). We also make it easy to visualize these metrics at training time on TensorBoard, an open source TensorFlow visualization dashboard.
An example of the NDCG metric (Y-axis) along the training steps (X-axis) displayed in the TensorBoard. It shows the overall progress of the metrics during training. Different methods can be compared directly on the dashboard. Best models can be selected based on the metric.
Multi-Item Scoring
TF-Ranking supports a novel scoring mechanism wherein multiple items (e.g., web pages) can be scored jointly, an extension of the traditional scoring paradigm in which single items are scored independently. One challenge in multi-item scoring is the difficulty for inference where items have to be grouped and scored in subgroups. Then, scores are accumulated per-item and used for sorting. To make these complexities transparent to the user, TF-Ranking provides a List-In-List-Out (LILO) API to wrap all this logic in the exported TF models.
The TF-Ranking library supports multi-item scoring architecture, an extension of traditional single-item scoring.
As we demonstrate in recent work, multi-item scoring is competitive in its performance to the state-of-the-art learning-to-rank models such as RankNet, MART, and LambdaMART on a public LETOR benchmark.

Ranking Metric Optimization
An important research challenge in learning-to-rank is direct optimization of ranking metrics (such as the previously mentioned NDCG and MRR). These metrics, while being able to measure the performance of ranking systems better than the standard classification metrics like Area Under the Curve (AUC), have the unfortunate property of being either discontinuous or flat. Therefore standard stochastic gradient descent optimization of these metrics is problematic.

In recent work, we proposed a novel method, LambdaLoss, which provides a principled probabilistic framework for ranking metric optimization. In this framework, metric-driven loss functions can be designed and optimized by an expectation-maximization procedure. The TF-Ranking library integrates the recent advances in direct metric optimization and provides an implementation of LambdaLoss. We are hopeful that this will encourage and facilitate further research advances in the important area of ranking metric optimization.

Unbiased Learning-to-Rank
Prior research has shown that given a ranked list of items, users are much more likely to interact with the first few results, regardless of their relevance. This observation has inspired research interest in unbiased learning-to-rank, and led to the development of unbiased evaluation and several unbiased learning algorithms, based on training instances re-weighting. In the TF-Ranking library, metrics are implemented to support unbiased evaluation and losses are implemented for unbiased learning by natively supporting re-weighting to overcome the inherent biases in user interactions datasets.

Getting Started with TF-Ranking
TF-Ranking implements the TensorFlow Estimator interface, which greatly simplifies machine learning programming by encapsulating training, evaluation, prediction and export for serving. TF-Ranking is well integrated with the rich TensorFlow ecosystem. As described above, you can use Tensorboard to visualize ranking metrics like NDCG and MRR, as well as to pick the best model checkpoints using these metrics. Once your model is ready, it is easy to deploy it in production using TensorFlow Serving.

If you’re interested in trying TF-Ranking for yourself, please check out our GitHub repo, and walk through the tutorial examples. TF-Ranking is an active research project, and we welcome your feedback and contributions. We are excited to see how TF-Ranking can help the information retrieval and machine learning research communities.

This project was only possible thanks to the members of the core TF-Ranking team: Rama Pasumarthi, Cheng Li, Sebastian Bruch, Nadav Golbandi, Stephan Wolf, Jan Pfeifer, Rohan Anil, Marc Najork, Patrick McGregor and Clemens Mewald‎. We thank the members of the TensorFlow team for their advice and support: Alexandre Passos, Mustafa Ispir, Karmel Allison, Martin Wicke, and others. Finally, we extend our special thanks to our collaborators, interns and early adopters: Suming Chen, Zhen Qin, Chirag Sethi, Maryam Karimzadehgan, Makoto Uchida, Yan Zhu, Qingyao Ai, Brandon Tran, Donald Metzler, Mike Colagrosso, and many others at Google who helped in evaluating and testing the early versions of TF-Ranking.

Source: Google AI Blog

The NeurIPS 2018 Test of Time Award: The Trade-Offs of Large Scale Learning

Progress in machine learning (ML) is happening so rapidly, that it can sometimes feel like any idea or algorithm more than 2 years old is already outdated or superseded by something better. However, old ideas sometimes remain relevant even when a large fraction of the scientific community has turned away from them. This is often a question of context: an idea which may seem to be a dead end in a particular context may become wildly successful in a different one. In the specific case of deep learning (DL), the growth of both the availability of data and computing power renewed interest in the area and significantly influenced research directions.

The NIPS 2008 paper “The Trade-Offs of Large Scale Learning” by Léon Bottou (then at NEC Labs, now at Facebook AI Research) and Olivier Bousquet (Google AI, Zürich) is a good example of this phenomenon. As the recent recipient of the NeurIPS 2018 Test of Time Award, this seminal work investigated the interplay between data and computation in ML, showing that if one is limited by computing power but can make use of a large dataset, it is more efficient to perform a small amount of computation on many individual training examples rather than to perform extensive computation on a subset of the data. This demonstrated the power of an old algorithm, stochastic gradient descent, which is nowadays used in pretty much all applications of DL.

Optimization and the Challenge of Scale
Many ML algorithms can be thought of as the combination of two main ingredients:
  • A model, which is a set of possible functions that will be used to fit the data.
  • An optimization algorithm which specifies how to find the best function in that set.
Back in the 90’s the datasets used in ML were much smaller than the ones in use today, and while artificial neural networks had already led to some successes, they were considered hard to train. In the early 2000’s, with the introduction of Kernel Machines (SVMs in particular), neural networks went out of fashion. Simultaneously, the attention shifted away from the optimization algorithms that had been used to train neural networks (stochastic gradient descent) to focus on those used for kernel machines (quadratic programming). One important difference being that in the former case, training examples are used one at a time to perform gradient steps (this is called “stochastic”), while in the latter case, all training examples are used at each iteration (this is called “batch”).

As the size of the training sets increased, the efficiency of optimization algorithms to handle large amounts of data became a bottleneck. For example, in the case of quadratic programming, running time scales at least quadratically in the number of examples. In other words, if you double your training set size, your training will take at least 4 times longer. Hence, lots of effort went into trying to make these algorithms scale to larger training sets (see for example Large Scale Kernel Machines).

People who had experience with training neural networks knew that stochastic gradient descent was comparably easier to scale to large datasets, but unfortunately its convergence is very slow (it takes lots of iterations to reach an accuracy comparable to that of a batch algorithm), so it wasn’t clear that this would be a solution to the scaling problem.

Stochastic Algorithms Scale Better
In the context of ML, the number of iterations needed to optimize the cost function is actually not the main concern: there is no point in perfectly tuning your model since you will essentially “overfit” to the training data. So why not reduce the computational effort that you put into tuning the model and instead spend the effort processing more data?

The work of Léon and Olivier provided a formal study of this phenomenon: by considering access to a large amount of data and assuming the limiting factor is computation, they showed that it is better to perform a minimal amount of computation on each individual training example (thus processing more of them) rather than performing extensive computation on a smaller amount of data.

In doing so, they also demonstrated that among various possible optimization algorithms, stochastic gradient descent is the best. This was confirmed by many experiments and led to a renewed interest in online optimization algorithms which are now in extensive use in ML.

Mysteries Remain
In the following years, many variants of stochastic gradient descent were developed both in the convex case and in the non-convex one (particularly relevant for DL). The most common variant now is the so-called “mini-batch” SGD where one considers a small number (~10-100) of training examples at each iteration, and performs several passes over the training set, with a couple of clever tricks to scale the gradient appropriately. Most ML libraries provide a default implementation of such an algorithm and it is arguably one of the pillars of DL.

While this analysis provided a solid foundation for understanding the properties of this algorithm, the amazing and sometimes surprising successes of DL continue to raise many more questions for the scientific community. In particular, the role of this algorithm in the generalization properties of deep networks has been repeatedly demonstrated but is still poorly understood. This means that a lot of fascinating questions are yet to be explored which could lead to a better understanding of the algorithms currently in use and the development of even more efficient algorithms in the future.

The perspective proposed by Léon and Olivier in their collaboration 10 years ago provided a significant boost to the development of the algorithm that is nowadays the workhorse of ML systems that benefit our lives daily, and we offer our sincere congratulations to both authors on this well-deserved award.

Source: Google AI Blog

Google at NeurIPS 2018

This week, Montréal hosts the 32nd annual Conference on Neural Information Processing Systems (NeurIPS 2018), the biggest machine learning conference of the year. The conference includes invited talks, demonstrations and presentations of some of the latest in machine learning research. Google will have a strong presence at NeurIPS 2018, with more than 400 Googlers attending in order to contribute to, and learn from, the broader academic research community via talks, posters, workshops, competitions and tutorials. We will be presenting work that pushes the boundaries of what is possible in language understanding, translation, speech recognition and visual & audio perception, with Googlers co-authoring nearly 100 accepted papers (see below).

At the forefront of machine learning, Google is actively exploring virtually all aspects of the field spanning both theory and applications. This research is often inspired by real product needs but increasingly more often driven by scientific curiosity. Given the range of research projects that we pursue, we have found it useful to define a new framework which helps crystalize the goals of projects and allows us to measure progress and success in appropriate ways. Our contributions to NeurIPS and to the broader research community in general are integral to our research mission.

If you are attending NeurIPS 2018, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving the world's most challenging research problems, and to see demonstrations of some of the exciting research we pursue. You can also learn more about our work being presented in the list below (Googlers highlighted in blue).

Google is a Platinum Sponsor of NeurIPS 2018.

NeurIPS Foundation Board
Corinna Cortes, John C. Platt, Fernando Pereira

NeurIPS Organizing Committee
General Chair: Samy Bengio
Program Co-Chair: Hugo Larochelle
Party Chair: Douglas Eck
Diversity and Inclusion Co-Chair: Katherine A. Heller

NeurIPS Program Committee
Senior Area Chairs include:Angela Yu, Claudio Gentile, Cordelia Schmid, Corinna Cortes, Csaba Szepesvari, Dale Schuurmans, Elad Hazan, Mehryar Mohri, Raia Hadsell, Satyen Kale, Yishay Mansour, Afshin Rostamizadeh, Alex Kulesza

Area Chairs include: Amin Karbasi, Amir Globerson, Amit Daniely, Andras Gyorgy, Andriy Mnih, Been Kim, Branislav Kveton, Ce Liu, D Sculley, Danilo Rezende, Danny TarlowDavid Balduzzi, Denny Zhou, Dilan Gorur, Dumitru Erhan, George Dahl, Graham Taylor, Ian Goodfellow, Jasper Snoek, Jean-Philippe Vert, Jia Deng, Jon Shlens, Karen Simonyan, Kevin Swersky, Kun Zhang, Lihong Li, Marc G. Bellemare, Marco Cuturi, Maya Gupta, Michael BowlingMichalis Titsias, Mohammad Norouzi, Mouhamadou Moustapha Cisse, Nicolas Le Roux, Remi Munos, Sanjiv Kumar, Sanmi Koyejo, Sergey Levine, Silvia Chiappa, Slav PetrovSurya Ganguli, Timnit Gebru, Timothy Lillicrap, Viren Jain, Vitaly Feldman, Vitaly Kuznetsov

Workshops Program Committee includes: Mehryar Mohri, Sergey Levine

Accepted Papers
3D-Aware Scene Manipulation via Inverse Graphics
Shunyu Yao, Tzu Ming Harry Hsu, Jun-Yan Zhu, Jiajun Wu, Antonio Torralba, William T. Freeman, Joshua B. Tenenbaum

A Retrieve-and-Edit Framework for Predicting Structured Outputs
Tatsunori Hashimoto, Kelvin Guu, Yonatan Oren, Percy Liang

Adversarial Attacks on Stochastic Bandits
Kwang-Sung Jun, Lihong Li, Yuzhe Ma, Xiaojin Zhu

Adversarial Examples that Fool both Computer Vision and Time-Limited Humans
Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, Jascha Sohl-Dickstein

Adversarially Robust Generalization Requires More Data
Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, Aleksander Madry

Are GANs Created Equal? A Large-Scale Study
Mario Lucic, Karol Kurach, Marcin Michalski, Olivier Bousquet, Sylvain Gelly

Collaborative Learning for Deep Neural Networks
Guocong Song, Wei Chai

Completing State Representations using Spectral Learning
Nan Jiang, Alex Kulesza, Santinder Singh

Content Preserving Text Generation with Attribute Controls
Lajanugen Logeswaran, Honglak Lee, Samy Bengio

Context-aware Synthesis and Placement of Object Instances
Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz

Co-regularized Alignment for Unsupervised Domain Adaptation
Abhishek Kumar, Prasanna Sattigeri, Kahini Wadhawan, Leonid Karlinsky, Rogerlo Feris, William T. Freeman, Gregory Wornell

cpSGD: Communication-efficient and differentially-private distributed SGD
Naman Agarwal, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar, H. Brendan Mcmahan

Data Center Cooling Using Model-Predictive Control
Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, Greg Imwalle

Data-Efficient Hierarchical Reinforcement Learning
Ofir Nachum, Shixiang Gu, Honglak Lee, Sergey Levine

Deep Attentive Tracking via Reciprocative Learning
Shi Pu, Yibing Song, Chao Ma, Honggang Zhang, Ming-Hsuan Yang

Generalizing Point Embeddings Using the Wasserstein Space of Elliptical Distributions
Boris Muzellec, Marco Cuturi

GLoMo: Unsupervised Learning of Transferable Relational Graphs
Zhilin Yang, Jake (Junbo) Zhao, Bhuwan Dhingra, Kaiming He, William W. Cohen, Ruslan Salakhutdinov, Yann LeCun

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking
Patrick Chen, Si Si, Yang Li, Ciprian Chelba, Cho-Jui Hsieh

Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections
Xin Zhang, Armando Solar-Lezama, Rishabh Singh

Learning Hierarchical Semantic Image Manipulation through Structured Representations
Seunghoon Hong, Xinchen Yan, Thomas Huang, Honglak Lee

Learning Temporal Point Processes via Reinforcement Learning
Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, Le Song

Learning Towards Minimum Hyperspherical Energy
Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, Le Song

Mesh-TensorFlow: Deep Learning for Supercomputers
Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman

MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare
Edward Choi, Cao Xiao, Walter F. Stewart, Jimeng Sun

Searching for Efficient Multi-Scale Architectures for Dense Image Prediction
Liang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, Jonathon Shlens

SplineNets: Continuous Neural Decision Graphs
Cem Keskin, Shahram Izadi

Task-Driven Convolutional Recurrent Models of the Visual System
Aran Nayebi, Daniel Bear, Jonas Kubilius, Kohitij Kar, Surya Ganguli, David Sussillo, James J. DiCarlo, Daniel L. K. Yamins

To Trust or Not to Trust a Classifier
Heinrich Jiang, Been Kim, Melody Guan, Maya Gupta

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

Algorithms and Theory for Multiple-Source Adaptation
Judy Hoffman, Mehryar Mohri, Ningshan Zhang

A Lyapunov-based Approach to Safe Reinforcement Learning
Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, Mohammad Ghavamzadeh

Adaptive Methods for Nonconvex Optimization
Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, Sanjiv Kumar

Assessing Generative Models via Precision and Recall
Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, Sylvain Gelly

A Loss Framework for Calibrated Anomaly Detection
Aditya Menon, Robert Williamson

Blockwise Parallel Decoding for Deep Autoregressive Models
Mitchell Stern, Noam Shazeer, Jakob Uszkoreit

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou

Contextual Pricing for Lipschitz Buyers
Jieming Mao, Renato Leme, Jon Schneider

Coupled Variational Bayes via Optimization Embedding
Bo Dai, Hanjun Dai, Niao He, Weiyang Liu, Zhen Liu, Jianshu Chen, Lin Xiao, Le Song

Data Amplification: A Unified and Competitive Approach to Property Estimation
Yi HAO, Alon Orlitsky, Ananda Theertha Suresh, Yihong Wu

Deep Network for the Integrated 3D Sensing of Multiple People in Natural Images
Elisabeta Marinoiu, Mihai Zanfir, Alin-Ionut Popa, Cristian Sminchisescu

Deep Non-Blind Deconvolution via Generalized Low-Rank Approximation
Wenqi Ren, Jiawei Zhang, Lin Ma, Jinshan Pan, Xiaochun Cao, Wei Liu, Ming-Hsuan Yang

Diminishing Returns Shape Constraints for Interpretability and Regularization
Maya Gupta, Dara Bahri, Andrew Cotter, Kevin Canini

DropBlock: A Regularization Method for Convolutional Networks
Golnaz Ghiasi, Tsung-Yi Lin, Quoc V. Le

Generalization Bounds for Uniformly Stable Algorithms
Vitaly Feldman, Jan Vondrak

Geometrically Coupled Monte Carlo Sampling
Mark Rowland, Krzysztof Choromanski, Francois Chalus, Aldo Pacchiano, Tamas Sarlos, Richard E. Turner, Adrian Weller

GILBO: One Metric to Measure Them All
Alexander A. Alemi, Ian Fischer

Insights on Representational Similarity in Neural Networks with Canonical Correlation
Ari S. Morcos, Maithra Raghu, Samy Bengio

Improving Online Algorithms via ML Predictions
Manish Purohit, Zoya Svitkina, Ravi Kumar

Learning to Exploit Stability for 3D Scene Parsing
Yilun Du, Zhijan Liu, Hector Basevi, Ales Leonardis, William T. Freeman, Josh Tenembaum, Jiajun Wu

Maximizing Induced Cardinality Under a Determinantal Point Process
Jennifer Gillenwater, Alex Kulesza, Sergei Vassilvitskii, Zelda Mariet

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing
Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc V. Le, Ni Lao

PCA of High Dimensional Random Walks with Comparison to Neural Network Training
Joseph M. Antognini, Jascha Sohl-Dickstein

Predictive Approximate Bayesian Computation via Saddle Points
Yingxiang Yang, Bo Dai, Negar Kiyavash, Niao He

Recurrent World Models Facilitate Policy Evolution
David Ha, Jürgen Schmidhuber

Sanity Checks for Saliency Maps
Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, Been Kim

Simple, Distributed, and Accelerated Probabilistic Programming
Dustin Tran, Matthew Hoffman, Dave Moore, Christopher Suter, Srinivas Vasudevan, Alexey Radul, Matthew Johnson, Rif A. Saurous

Tangent: Automatic Differentiation Using Source-Code Transformation for Dynamically Typed Array Programming
Bart van Merriënboer, Dan Moldovan, Alex Wiltschko

The Emergence of Multiple Retinal Cell Types Through Efficient Coding of Natural Movies
Samuel A. Ocko, Jack Lindsey, Surya Ganguli, Stephane Deny

The Everlasting Database: Statistical Validity at a Fair Price
Blake Woodworth, Vitaly Feldman, Saharon Rosset, Nathan Srebro

The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network
Jeffrey Pennington, Pratik Worah

A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks
Kimin Lee, Kibok Lee, Honglak Lee, Jinwoo Shin

Autoconj: Recognizing and Exploiting Conjugacy Without a Domain-Specific Language
Matthew D. Hoffman, Matthew Johnson, Dustin Tran

A Bayesian Nonparametric View on Count-Min Sketch
Diana Cai, Michael Mitzenmacher, Ryan Adams (no longer at Google)

Automatic Differentiation in ML: Where We are and Where We Should be Going
Bart van Merriënboer, Olivier Breuleux, Arnaud Bergeron, Pascal Lamblin

Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures
Sergey Bartunov, Adam Santoro, Blake A. Richards, Geoffrey E. Hinton, Timothy P. Lillicrap

Deep Generative Models for Distribution-Preserving Lossy Compression
Michael Tschannen, Eirikur Agustsson, Mario Lucic

Deep Structured Prediction with Nonlinear Output Transformations
Colin Graber, Ofer Meshi, Alexander Schwing

Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning
Supasorn Suwajanakorn, Noah Snavely, Jonathan Tompson, Mohammad Norouzi

Transfer Learning with Neural AutoML
Catherine Wong, Neil Houlsby, Yifeng Lu, Andrea Gesmundo

Efficient Gradient Computation for Structured Output Learning with Rational and Tropical Losses
Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, Dmitry Storcheus, Scott Yang

Cooperative neural networks (CoNN): Exploiting prior independence structure for improved classification
Harsh Shrivastava, Eugene Bart, Bob Price, Hanjun Dai, Bo Dai, Srinivas Aluru

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization
Blake Woodworth, Jialei Wang, Brendan McMahan, Nathan Srebro

Hierarchical Reinforcement Learning for Zero-shot Generalization with Subtask Dependencies
Sungryull Sohn, Junhyuk Oh, Honglak Lee

Human-in-the-Loop Interpretability Prior
Isaac Lage, Andrew Slavin Ross, Been Kim, Samuel J. Gershman, Finale Doshi-Velez

Joint Autoregressive and Hierarchical Priors for Learned Image Compression
David Minnen, Johannes Ballé, George D Toderici

Large-Scale Computation of Means and Clusters for Persistence Diagrams Using Optimal Transport
Théo Lacombe, Steve Oudot, Marco Cuturi

Learning to Reconstruct Shapes from Unseen Classes
Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Joshua B. Tenenbaum, William T. Freeman, Jiajun Wu

Large Margin Deep Networks for Classification
Gamaleldin Fathy Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, Samy Bengio

Mallows Models for Top-k Lists
Flavio Chierichetti, Anirban Dasgupta, Shahrzad Haddadan, Ravi Kumar, Silvio Lattanzi

Meta-Learning MCMC Proposals
Tongzhou Wang, YI WU, Dave Moore, Stuart Russell

Non-delusional Q-Learning and Value-Iteration
Tyler Lu, Dale Schuurmans, Craig Boutilier

Online Learning of Quantum States
Scott Aaronson, Xinyi Chen, Elad Hazan, Satyen Kale, Ashwin Nayak

Online Reciprocal Recommendation with Theoretical Performance Guarantees
Fabio Vitale, Nikos Parotsidis, Claudio Gentile

Optimal Algorithms for Continuous Non-monotone Submodular and DR-Submodular Maximization
Rad Niazadeh, Tim Roughgarden, Joshua R. Wang

Policy Regret in Repeated Games
Raman Arora, Michael Dinitz, Teodor Vanislavov Marinov, Mehryar Mohri

Provable Variational Inference for Constrained Log-Submodular Models
Josip Djolonga, Stefanie Jegelka, Andreas Krause

Realistic Evaluation of Deep Semi-Supervised Learning Algorithms
Avital Oliver, Augustus Odena, Colin Raffel, Ekin D. Cubuk, Ian J. Goodfellow

Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion
Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, Honglak Lee

Visual Object Networks: Image Generation with Disentangled 3D Representations
JunYan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Josh Tenenbaum, William T. Freeman

Watch Your Step: Learning Node Embeddings via Graph Attention
Sami Abu-El-Haija, Bryan Perozzi, Rami AlRfou, Alexander Alemi

2nd Workshop on Machine Learning on the Phone and Other Consumer Devices
Co-Chairs include: Sujith Ravi, Wei Chai, Hrishikesh Aradhye

Bayesian Deep Learning
Workshop Organizers include: Kevin Murphy

Continual Learning
Workshop Organizers include: Marc Pickett

The Second Conversational AI Workshop – Today's Practice and Tomorrow's Potential
Workshop Organizers include: Dilek Hakkani-Tur

Visually Grounded Interaction and Language
Workshop Organizers include: Olivier Pietquin

Workshop on Ethical, Social and Governance Issues in AI
Workshop Organizers include: D. Sculley

AI for Social Good
Workshop Program Committee includes: Samuel Greydanus

Black in AI
Workshop Organizers: Mouhamadou Moustapha Cisse, Timnit Gebru
Program Committee: Irwan Bello, Samy Bengio, Ian Goodfellow, Hugo Larochelle, Margaret Mitchell

Interpretability and Robustness in Audio, Speech, and Language
Workshop Organizers include: Ehsan Variani, Bhuvana Ramabhadran

LatinX in AI
Workshop Organizers includes: Pablo Samuel Castro
Program Committee includes: Sergio Guadarrama

Machine Learning for Systems
Workshop Organizers include: Anna Goldie, Azalia Mirhoseini, Kevin Swersky, Milad Hashemi
Program Committee includes: Simon Kornblith, Nicholas Frosst, Amir Yazdanbakhsh, Azade Nazi, James Bradbury, Sharan Narang, Martin Maas, Carlos Villavieja

Queer in AI
Workshop Organizers include: Raphael Gontijo Lopes

Second Workshop on Machine Learning for Creativity and Design
Workshop Organizers include: Jesse Engel, Adam Roberts

Workshop on Security in Machine Learning
Workshop Organizers include: Nicolas Papernot

Visualization for Machine Learning
Fernanda Viégas, Martin Wattenberg

Source: Google AI Blog

Highlights from the 2018 Google PhD Fellowship Summit

Google created the PhD Fellowship Program to recognize and support outstanding graduate students doing exceptional research in Computer Science and related disciplines. This program provides a unique opportunity for students pursuing a graduate degree in Computer Science (or related field) who seek to influence the future of technology. Now in its tenth year, our Fellowships have helped support close to 400 graduate students globally in Australia, China and East Asia, India, North America, Europe, the Middle East and Africa, the most recent region to award Google Fellowships.
Every year, Google PhD Fellows are invited to our global PhD Fellowship Summit where they are exposed to state-of-the-art research pursued across Google, and are given the opportunity to network with Google’s research community as well as other PhD Fellows from around the world. Below we share some highlights from our most recent summit, and also announce the newest recipients.

Summit Highlights
At this year’s annual Global PhD Fellowship Summit, Fellows from around the world converged on our Mountain View campus for two days of talks, focused discussions, sharing research work, and networking. VP of Education and University Programs Maggie Johnson welcomed the Fellows and presented Google's approach to research and its various outreach efforts that encourage collaboration with academia. The agenda also included talks on a range of topics, starting with an opening keynote from Principal Scientist Maya Gupta on controlling machine learning models with constraints and goals to make them do what you want, followed by researchers Andrew Tomkins, Rahul Sukthankar, Sai Teja Peddinti, Amin Vahdat, Martin Stumpe, Ed Chi and Ciera Jaspan giving talks from a variety of research perspectives. A closing presentation was given by Jeff Dean, Senior Fellow and SVP of Google AI, who spoke about using deep learning to solve a variety of challenging research problems at Google.
Starting clockwise from top left: Researchers Rahul Sukthankar and Ed Chi talking with Fellow attendees; Jeff Dean delivering the closing talk; Poster session in full swing.
Fellows had the chance to connect with each other and Google researchers to discuss their work during a poster event, as well as receive feedback from leaders in their fields in smaller deep dives. A panel discussion comprised of Fellow alumni, 2 from academia and 2 from Google, provided both perspectives on career paths.
Google Fellows attending the 2018 PhD Fellowship Summit.
The Complete List of 2018 Google PhD Fellows
We believe that the Google PhD Fellows represent some of the best and brightest young researchers around the globe in Computer Science, and it is our ongoing goal to support them as they make their mark on the world. As such, we would like to announce the latest recipients from China and East Asia, India, Australia and Africa, who join the North America, Europe and Middle East Fellows we announced last April. Congratulations to all of this year’s awardees! The complete list of recipients is:

Algorithms, Optimizations and Markets
Emmanouil Zampetakis, Massachusetts Institute of Technology
Manuela Fischer, ETH Zurich
Pranjal Dutta, Chennai Mathematical Institute
Thodoris Lykouris, Cornell University
Yuan Deng, Duke University

Computational Neuroscience
Ella Batty, Columbia University
Neha Spenta Wadia, University of California - Berkeley
Reuben Feinman, New York University

Human Computer Interaction
Gierad Laput, Carnegie Mellon University
Mike Schaekermann, University of Waterloo
Minsuk (Brian) Kahng, Georgia Institute of Technology
Niels van Berkel, The University of Melbourne
Siqi Wu, Australian National University
Xiang Zhang, The University of New South Wales

Machine Learning
Abhijeet Awasthi, Indian Institute of Technology - Bombay
Aditi Raghunathan, Stanford University
Futoshi Futami, University of Tokyo
Lin Chen, Yale University
Qian Yu, University of Southern California
Ravid Shwartz-Ziv, Hebrew University
Shuai Li, Chinese University of Hong Kong
Shuang Liu, University of California - San Diego
Stephen Tu, University of California - Berkeley
Steven James, University of the Witwatersrand
Xinchen Yan, University of Michigan
Zelda Mariet, Massachusetts Institute of Technology

Machine Perception, Speech Technology and Computer Vision
Antoine Miech, INRIA
Arsha Nagrani, University of Oxford
Arulkumar S, Indian Institute of Technology - Madras
Joseph Redmon, University of Washington
Raymond Yeh, University of Illinois - Urbana-Champaign
Shanmukha Ramakrishna Vedantam, Georgia Institute of Technology

Mobile Computing
Lili Wei, Hong Kong University of Science & Technology
Rizanne Elbakly, Egypt-Japan University of Science and Technology
Shilin Zhu, University of California - San Diego

Natural Language Processing
Anne Cocos, University of Pennsylvania
Hongwei Wang, Shanghai Jiao Tong University
Jonathan Herzig, Tel Aviv University
Rotem Dror, Technion - Israel Institute of Technology
Shikhar Vashishth, Indian Institute of Science - Bangalore
Yang Liu, University of Edinburgh
Yoon Kim, Harvard University
Zhehuai Chen, Shanghai Jiao Tong University
Imane khaouja, Université Internationale de Rabat

Privacy and Security
Aayush Jain, University of California - Los Angeles

Programming Technology and Software Engineering
Gowtham Kaki, Purdue University
Joseph Benedict Nyansiro, University of Dar es Salaam
Reyhaneh Jabbarvand, University of California - Irvine
Victor Lanvin, Fondation Sciences Mathématiques de Paris

Quantum Computing
Erika Ye, California Institute of Technology

Structured Data and Database Management
Lingjiao Chen, University of Wisconsin - Madison

Systems and Networking
Andrea Lattuada, ETH Zurich
Chen Sun, Tsinghua University
Lana Josipovic, EPFL
Michael Schaarschmidt, University of Cambridge
Rachee Singh, University of Massachusetts - Amherst
Stephen Mallon, The University of Sydney

Source: Google AI Blog

Learning to Predict Depth on the Pixel 3 Phones

Portrait Mode on the Pixel smartphones lets you take professional-looking images that draw attention to a subject by blurring the background behind it. Last year, we described, among other things, how we compute depth with a single camera using its Phase-Detection Autofocus (PDAF) pixels (also known as dual-pixel autofocus) using a traditional non-learned stereo algorithm. This year, on the Pixel 3, we turn to machine learning to improve depth estimation to produce even better Portrait Mode results.
Left: The original HDR+ image. Right: A comparison of Portrait Mode results using depth from traditional stereo and depth from machine learning. The learned depth result has fewer errors. Notably, in the traditional stereo result, many of the horizontal lines behind the man are incorrectly estimated to be at the same depth as the man and are kept sharp.
(Mike Milne)
A Short Recap
As described in last year’s blog post, Portrait Mode uses a neural network to determine what pixels correspond to people versus the background, and augments this two layer person segmentation mask with depth information derived from the PDAF pixels. This is meant to enable a depth-dependent blur, which is closer to what a professional camera does.

PDAF pixels work by capturing two slightly different views of a scene, shown below. Flipping between the two views, we see that the person is stationary, while the background moves horizontally, an effect referred to as parallax. Because parallax is a function of the point’s distance from the camera and the distance between the two viewpoints, we can estimate depth by matching each point in one view with its corresponding point in the other view.
The two PDAF images on the left and center look very similar, but in the crop on the right you can see the parallax between them. It is most noticeable on the circular structure in the middle of the crop.
However, finding these correspondences in PDAF images (a method called depth from stereo) is extremely challenging because scene points barely move between the views. Furthermore, all stereo techniques suffer from the aperture problem. That is, if you look at the scene through a small aperture, it is impossible to find correspondence for lines parallel to the stereo baseline, i.e., the line connecting the two cameras. In other words, when looking at the horizontal lines in the figure above (or vertical lines in portrait orientation shots), any proposed shift of these lines in one view with respect to the other view looks about the same. In last year’s Portrait Mode, all these factors could result in errors in depth estimation and cause unpleasant artifacts.

Improving Depth Estimation
With Portrait Mode on the Pixel 3, we fix these errors by utilizing the fact that the parallax used by depth from stereo algorithms is only one of many depth cues present in images. For example, points that are far away from the in-focus plane appear less sharp than ones that are closer, giving us a defocus depth cue. In addition, even when viewing an image on a flat screen, we can accurately tell how far things are because we know the rough size of everyday objects (e.g. one can use the number of pixels in a photograph of a person’s face to estimate how far away it is). This is called a semantic cue.

Designing a hand-crafted algorithm to combine these different cues is extremely difficult, but by using machine learning, we can do so while also better exploiting the PDAF parallax cue. Specifically, we train a convolutional neural network, written in TensorFlow, that takes as input the PDAF pixels and learns to predict depth. This new and improved ML-based method of depth estimation is what powers Portrait Mode on the Pixel 3.
Our convolutional neural network takes as input the PDAF images and outputs a depth map. The network uses an encoder-decoder style architecture with skip connections and residual blocks.
Training the Neural Network
In order to train the network, we need lots of PDAF images and corresponding high-quality depth maps. And since we want our predicted depth to be useful for Portrait Mode, we also need the training data to be similar to pictures that users take with their smartphones.

To accomplish this, we built our own custom “Frankenphone” rig that contains five Pixel 3 phones, along with a Wi-Fi-based solution that allowed us to simultaneously capture pictures from all of the phones (within a tolerance of ~2 milliseconds). With this rig, we computed high-quality depth from photos by using structure from motion and multi-view stereo.
Left: Custom rig used to collect training data. Middle: An example capture flipping between the five images. Synchronization between the cameras ensures that we can calculate depth for dynamic scenes, such as this one. Right: Ground truth depth. Low confidence points, i.e., points where stereo matches are not reliable due to weak texture, are colored in black and are not used during training. (Sam Ansari and Mike Milne)
The data captured by this rig is ideal for training a network for the following main reasons:
  • Five viewpoints ensure that there is parallax in multiple directions and hence no aperture problem.
  • The arrangement of the cameras ensures that a point in an image is usually visible in at least one other image resulting in fewer points with no correspondences.
  • The baseline, i.e., the distance between the cameras is much larger than our PDAF baseline resulting in more accurate depth estimation.
  • Synchronization between the cameras ensure that we can calculate depth for dynamic scenes like the one above.
  • Portability of the rig ensures that we can capture photos in the wild simulating the photos users take with their smartphones.
However, even though the data captured from this rig is ideal, it is still extremely challenging to predict the absolute depth of objects in a scene — a given PDAF pair can correspond to a range of different depth maps (depending on lens characteristics, focus distance, etc). To account for this, we instead predict the relative depths of objects in the scene, which is sufficient for producing pleasing Portrait Mode results.

Putting it All Together
This ML-based depth estimation needs to run fast on the Pixel 3, so that users don’t have to wait too long for their Portrait Mode shots. However, to get good depth estimates that makes use of subtle defocus and parallax cues, we have to feed full resolution, multi-megapixel PDAF images into the network. To ensure fast results, we use TensorFlow Lite, a cross-platform solution for running machine learning models on mobile and embedded devices and the Pixel 3’s powerful GPU to compute depth quickly despite our abnormally large inputs. We then combine the resulting depth estimates with masks from our person segmentation neural network to produce beautiful Portrait Mode results.

Try it Yourself
In Google Camera App version 6.1 and later, our depth maps are embedded in Portrait Mode images. This means you can use the Google Photos depth editor to change the amount of blur and the focus point after capture. You can also use third-party depth extractors to extract the depth map from a jpeg and take a look at it yourself. Also, here is an album showing the relative depth maps and the corresponding Portrait Mode images for traditional stereo and the learning-based approaches.

This work wouldn’t have been possible without Sam Ansari, Yael Pritch Knaan, David Jacobs, Jiawen Chen, Juhyun Lee and Andrei Kulik. Special thanks to Mike Milne and Andy Radin who captured data with the five-camera rig.

Source: Google AI Blog

A Structured Approach to Unsupervised Depth Learning from Monocular Videos

Perceiving the depth of a scene is an important task for an autonomous robot — the ability to accurately estimate how far from the robot objects are, is crucial for obstacle avoidance, safe planning and navigation. While depth can be obtained (and learned) from sensor data, such as LIDAR, it is also possible to learn it in an unsupervised manner from a monocular camera only, relying on the motion of the robot and the resulting different views of the scene. In doing so, the “ego-motion” (the motion of the robot/camera between two frames) is also learned, which provides localization of the robot itself. While this approach has a long history — coming from the structure-from-motion and multi-view geometry paradigms — new learning based techniques, more specifically for unsupervised learning of depth and ego-motion by using deep neural networks, have advanced the state of the art, including work by Zhou et al., and our own prior research which aligns 3D point clouds of the scene during training.

Despite these efforts, learning to predict scene depth and ego-motion remains an ongoing challenge, specifically when handling highly dynamic scenes and estimating proper depth of moving objects. Because previous research efforts for unsupervised monocular learning do not model moving objects, it can result in consistent misestimation of objects’ depth, often resulting in mapping their depth to infinity.

In “Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos”, to appear in AAAI 2019, we propose a novel approach which is able to model moving objects and produces high quality depth estimation results. Our approach is able to recover the correct depth for moving objects compared to previous methods for unsupervised learning from monocular videos. In our paper, we also propose a seamless online refinement technique that can further improve quality and be applied for transfer across datasets. Furthermore, to encourage even more advanced approaches of onboard robotics learning, we have open sourced the code in TensorFlow.
Previous work (middle row) has not been able to correctly estimate depth of moving objects mapping them to infinity (dark blue regions in the heatmap). Our approach (right) provides much better depth estimates.
A key idea in our approach is to introduce structure into the learning framework. That is, instead of relying on a neural network to learn depth directly, we treat the monocular scene as 3D, composed of moving objects, including the robot itself. The respective motions are modeled as independent transformations — rotations and translations — in the scene, which is then used to model the 3D geometry and estimate all the objects’ motions. Additionally, knowing which objects may potentially move (e.g., cars, people, bicycles, etc.) helps us learn separate motion vectors for them even if they may be static. By decomposing the scene into 3D and individual objects, better depth and ego-motion in the scene is learned, especially on very dynamic scenes.

We tested this method on both KITTI and Cityscapes urban driving datasets, and found that it outperforms state-of-the-art approaches, and is approaching in quality methods which used stereo pair videos as training supervision. Importantly, we are able to recover correctly the depth of a car moving at the same speed as the ego-motion vehicle. This has been challenging previously — in this case, the moving vehicle appears (in a monocular input) as static, exhibiting the same behavior as the static horizon, resulting in an inferred infinite depth. While stereo inputs can solve that ambiguity, our approach is the first one that is able to correctly infer that from a monocular input.
Previous work with monocular inputs were not able to extract moving objects and incorrectly map them to infinity.
Furthermore, since objects are treated individually in our method, the algorithm is able to provide for the motion vectors for each individual object, i.e. which is an estimate of where it is heading:
Example depth results for a dynamic scene together with estimates of the motion vectors of the individual objects (rotation angles are estimated too, but for simplicity are not shown).
In addition to these results, this research provides motivation for further exploring what an unsupervised learning approach can achieve, as monocular inputs are cheaper and easier to deploy than stereo or LIDAR sensors. As can be seen in the figures below, in both the KITTI and Cityscapes datasets, the supervision sensor (be it stereo or LIDAR) is missing values and may occasionally be misaligned with the camera input, which happens due to time delay.
Depth prediction from monocular video input on the KITTI dataset, middle row, compared to ground truth depth from a Lidar sensor; the latter does not cover the full scene and has missing and noisy values. Ground truth depth is not used during training.
Depth prediction on the Cityscapes dataset. Left to right: image, baseline, our method and ground truth provided by stereo. Note the missing values in the stereo ground truth. Also note that our algorithm is able to achieve these results without any ground truth depth supervision.
Our results also provide the best among the state-of-the-art estimates in ego-motion, which is crucial for autonomous robots, as it provides localization of the robots while moving in the environment. The video below shows results from our method that visualizes the speed and turning angle, obtained from the inferred ego-motion. While the outputs of both depth and ego-motion are valid up to a scalar, we can see that it is able to estimate its relative speed when slowing down and stopping.
Depth and ego-motion prediction. Follow the speed and the turning angle indicator to see the estimates when the car is taking a turn or stopping for a red light.
Transfer Across Domains
An important characteristic of a learning algorithm is its adaptability when moved to an unknown environment. In this work we further introduce an online refinement approach which continues to learn online while collecting new data. Below are examples of improvement of the estimated depth quality, after training on Cityscapes and online refinement on KITTI.
Online refinement when training on the Cityscapes Data and testing on KITTI. The images show depth prediction of the trained model, and of the trained model with online refinement. Depth prediction with online refinement better outlines the objects in the scene.
We further tested on a notably different dataset and setting, i.e. on an indoor dataset collected by the Fetch robot, while the training is done on the outdoor urban driving Cityscapes dataset. As to be expected, there is a large discrepancy between these datasets. Despite this, we observe that the online learning technique is able to obtain better depth estimates than the baseline.
Results of online adaptation when transferring the learning model from Cityscapes (an outdoors dataset collected from a moving car) to a dataset collected indoors by the Fetch robot. The bottom row shows improved depth after applying online refinement.
In summary, this work addresses unsupervised learning of depth and ego-motion from a monocular camera, and tackles the problem in highly dynamic scenes. It achieves high quality depth and ego-motion results and with quality comparable to stereo and sets forward the idea of incorporating structure in the learning process. More notably, our proposed combination of unsupervised learning of depth and ego-motion from monocular video only and online adaptation demonstrates a powerful concept, because not only can it learn in unsupervised manner from simple video, but it can also be transferred easily to other datasets.

This research was conducted by Vincent Casser, Soeren Pirk, Reza Mahjourian and Anelia Angelova. We would like to thank Ayzaan Wahid for his help with data collection and Martin Wicke and Vincent Vanhoucke for their support and encouragement.

Source: Google AI Blog