Author Archives: Research Blog

Google’s Workshop on AI/ML Research and Practice in India

Last month, Google Bangalore hosted the Workshop on Artificial Intelligence and Machine Learning, with the goal of fostering collaboration between the academic and industry research communities in India. This forum was designed to exchange current research and industry projects in AI & ML, and included faculty and researchers from Indian Institutes of Technology (IITs) and other leading universities in India, along with industry practitioners from Amazon, Delhivery, Flipkart, LinkedIn, Myntra, Microsoft, Ola and many more. Participants spoke on the ongoing research and work being undertaken in India in deep learning, computer vision, natural language processing, systems and generative models (you can access all the presentations from the workshop here).

Google’s Jeff Dean and Prabhakar Raghavan kicked off the workshop by sharing Google’s uses of deep learning to solve challenging problems and reinventing productivity using AI. Additional keynotes were delivered by Googlers Rajen Sheth and Roberto Bayardo. We also hosted a panel discussion on the challenges and future of AI/ML ecosystem in India, moderated by Google Bangalore’s Pankaj Gupta. Panel participants included Anirban Dasgupta (IIT Gandhinagar), Chiranjib Bhattacharyya of the Indian Institute of Science (IISc), Ashish Tendulkar and Srinivas Raaghav (Google India) and Shourya Roy (American Express Big Data Labs).
Prabhakar Raghavan’s keynote address
The workshop agenda included five broad sessions with presentations by attendees in the following areas:
Pankaj Gupta moderating the panel discussion
Summary and Next Steps
As in many countries around the world, we are seeing increased dialog on various aspects of AI and ML in multiple contexts in India. This workshop hosted 80 attendees representing 9 universities and 36 companies contributing 28 excellent talks, with many opportunities for discussing challenges and opportunities for AI/ML in India. Google will continue to foster this exchange of ideas across a diverse set of folks and applications. As part of this, we also announced the upcoming research awards round (applications due June 4) to support up to seven faculty members in India on their AI/ML research, and new work on an accelerator program for Indian entrepreneurs focused primarily on AI/ML technologies. Please keep an eye out for more information about these programs.

Introducing the CVPR 2018 On-Device Visual Intelligence Challenge

Over the past year, there have been exciting innovations in the design of deep networks for vision applications on mobile devices, such as the MobileNet model family and integer quantization. Many of these innovations have been driven by performance metrics that focus on meaningful user experiences in real-world mobile applications, requiring inference to be both low-latency and accurate. While the accuracy of a deep network model can be conveniently estimated with well established benchmarks in the computer vision community, latency is surprisingly difficult to measure and no uniform metric has been established. This lack of measurement platforms and uniform metrics have hampered the development of performant mobile applications.

Today, we are happy to announce the On-device Visual Intelligence Challenge (OVIC), part of the Low-Power Image Recognition Challenge Workshop at the 2018 Computer Vision and Pattern Recognition conference (CVPR2018). A collaboration with Purdue University, the University of North Carolina and IEEE, OVIC is a public competition for real-time image classification that uses state-of-the-art Google technology to significantly lower the barrier to entry for mobile development. OVIC provides two key features to catalyze innovation: a unified latency metric and an evaluation platform.

A Unified Metric
OVIC focuses on the establishment of a unified metric aligned directly with accurate and performant operation on mobile devices. The metric is defined as the number of correct classifications within a specified per-image average time limit of 33ms. This latency limit allows every frame in a live 30 frames-per-second video to be processed, thus providing a seamless user experience1. Prior to OVIC, it was tricky to enforce such a limit due to the difficulty in accurately and uniformly measuring latency as would be experienced in real-world applications on real-world devices. Without a repeatable mobile development platform, researchers have relied primarily on approximate metrics for latency that are convenient to compute, such as the number of multiply-accumulate operations (MACs). The intuition is that multiply-accumulate constitutes the most time-consuming operation in a deep neural network, so their count should be indicative of the overall latency. However, these metrics are often poor predictors of on-device latency due to many aspects of the models that can impact the average latency of each MAC in typical implementations.
Even though the number of multiply-accumulate operations (# MACs) is the most commonly used metric to approximate on-device latency, it is a poor predictor of latency. Using data from various quantized and floating point MobileNet V1 and V2 based models, this graph plots on-device latency on a common reference device versus the number of MACs. It is clear that models with similar latency can have very different MACs, and vice versa.
The graph above shows that while the number of MACs is correlated with the inference latency, there is significant variation in the mapping. Thus number of MACs is a poor proxy for latency, and since latency directly affects users’ experiences, we believe it is paramount to optimize latency directly rather than focusing on limiting the number of MACs as a proxy.

An Evaluation Platform
As mentioned above, a primary issue with latency is that it has previously been challenging to measure reliably and repeatably, due to variations in implementation, running environment and hardware architectures. Recent successes in mobile development overcome these challenges with the help of a convenient mobile development platform, including optimized kernels for mobile CPUs, light-weight portable model formats, increasingly capable mobile devices, and more. However, these various platforms have traditionally required resources and development capabilities that are only available to larger universities and industry.

With that in mind, we are releasing OVIC’s evaluation platform that includes a number of components designed to make mobile development and evaluations that can be replicated and compared accessible to the broader research community:
  • TOCO compiler for optimizing TensorFlow models for efficient inference
  • TensorFlow Lite inference engine for mobile deployment
  • A benchmarking SDK that can be run locally on any Android phone
  • Sample models to showcase successful mobile architectures that run inference in floating-point and quantized modes
  • Google’s benchmarking tool for reliable latency measurements on specific Pixel phones (available to registered contestants).
Using these tools available in OVIC, a participant can conveniently incorporate measurement of on-device latency into their design loop without having to worry about optimizing kernels, purchasing latency/power measurement devices, or designing the framework to drive them. The only requirement for entry is experiences with training computer vision models in TensorFlow, which can be found in this tutorial.

With OVIC, we encourage the entire research community to improve the classification performance of low-latency high-accuracy models towards new frontiers, as shown in the following graphic.
Sampling of current MobileNet mobile models illustrating the tradeoff between increased accuracy and reduced latency.
We cordially invite you to participate here before the deadline on June 15th, and help us discover new mobile vision architectures that will propel development into the future.

We would like to acknowledge our core contributors Achille Brighton, Alec Go, Andrew Howard, Hartwig Adam, Mark Sandler and Xiao Zhang. We would also like to acknowledge our external collaborators Alex Berg and Yung-Hsiang Lu. We give special thanks to Andre Hentz, Andrew Selle, Benoit Jacob, Brad Krueger, Dmitry Kalenichenko, Megan Cummins, Pete Warden, Rajat Monga, Shiyu Hu and Yicheng Fan.

1 Alternatively the same metric could encourage even lower power operation by only processing a subset of the images in the input stream.

DeepVariant Accuracy Improvements for Genetic Datatypes

Last December we released DeepVariant, a deep learning model that has been trained to analyze genetic sequences and accurately identify the differences, known as variants, that make us all unique. Our initial post focused on how DeepVariant approaches “variant calling” as an image classification problem, and is able to achieve greater accuracy than previous methods.

Today we are pleased to announce the launch of DeepVariant v0.6, which includes some major accuracy improvements. In this post we describe how we train DeepVariant, and how we were able to improve DeepVariant's accuracy for two common sequencing scenarios, whole exome sequencing and polymerase chain reaction sequencing, simply by adding representative data into DeepVariant's training process.

Many Types of Sequencing Data
Approaches to genomic sequencing vary depending on the type of DNA sample (e.g., from blood or saliva), how the DNA was processed (e.g., amplification techniques), which technology was used to sequence the data (e.g., instruments can vary even within the same manufacturer) and what section or how much of the genome was sequenced. These differences result in a very large number of sequencing "datatypes".

Typically, variant calling tools have been tuned for one specific datatype and perform relatively poorly on others. Given the extensive time and expertise involved in tuning variant callers for new datatypes, it seemed infeasible to customize each tool for every one. In contrast, with DeepVariant we are able to improve accuracy for new datatypes simply by including representative data in the training process, without negatively impacting overall performance.

Truth Sets for Variant Calling
Deep learning models depend on having high quality data for training and evaluation. In the field of genomics, the Genome in a Bottle (GIAB) consortium, which is hosted by the National Institute of Standards and Technology (NIST), produces human genomes for use in technology development, evaluation, and optimization. The benefit of working with GIAB benchmarking genomes is that their true sequence is known (at least to the extent currently possible). To achieve this, GIAB takes a single person's DNA and repeatedly sequences it using a wide variety of laboratory methods and sequencing technologies (i.e. many datatypes) and analyzes the resulting data using many different variant calling tools. A tremendous amount of work then follows to evaluate and adjudicate discrepancies to produce a high-confidence "truth set" for each genome.

The majority of DeepVariant’s training data is from the first benchmarking genome released by GIAB, HG001. The sample, from a woman of northern European ancestry, was made available as part of the International HapMap Project, the first large-scale effort to identify common patterns of human genetic variation. Because DNA from HG001 is commercially available and so well characterized, it is often the first sample used to test new sequencing technologies and variant calling tools. By using many replicates and different datatypes of HG001, we can generate millions of training examples which helps DeepVariant learn to accurately classify many datatypes, and even generalize to datatypes it has never seen before.

Improved Exome Model in v0.5
In the v0.5 release we formalized a benchmarking-compatible training strategy to withhold from training a complete sample, HG002, as well as any data from chromosome 20. HG002, the second benchmarking genome released by GIAB, is from a male of Ashkenazi Jewish ancestry. Testing on this sample, which differs in both sex and ethnicity from HG001, helps to ensure that DeepVariant is performing well for diverse populations. Additionally reserving chromosome 20 for testing guarantees that we can evaluate DeepVariant's accuracy for any datatype that has truth data available.

In v0.5 we also focused on exome data, which is the subset of the genome that directly codes for proteins. The exome is only ~1% of the whole human genome, so whole exome sequencing (WES) costs less than whole genome sequencing (WGS). The exome also harbors many variants of clinical significance which makes it useful for both researchers and clinicians. To increase exome accuracy we added a variety of WES datatypes, provided by DNAnexus, to DeepVariant's training data. The v0.5 WES model shows 43% fewer indel (insertion-deletion) errors and a 22% reduction in single nucleotide polymorphism (SNP) errors.
The total number of exome errors for HG002 across DeepVariant versions, broken down by indel errors (left) and SNP errors (right). Errors are either false positive (FP), colored yellow, or false negative (FN), colored blue. The largest accuracy jump is between v0.4 and v0.5, largely attributable to a reduction in indel FPs.
Improved Whole Genome Sequencing Model for PCR+ data in v0.6
Our newest release of DeepVariant, v0.6, focuses on improved accuracy for data that has undergone DNA amplification via polymerase chain reaction (PCR) prior to sequencing. PCR is an easy and inexpensive way to amplify very small quantities of DNA, and once sequenced results in what is known as PCR positive (PCR+) sequencing data. It is well known, however, that PCR can be prone to bias and errors, and non-PCR-based (or PCR-free) DNA preparation methods are increasingly common. DeepVariant's training data prior to the v0.6 release was exclusively PCR-free data, and PCR+ was one of the few datatypes for which DeepVariant had underperformed in external evaluations. By adding PCR+ examples to DeepVariant's training data, also provided by DNAnexus, we have seen significant accuracy improvements for this datatype, including a 60% reduction in indel errors.
DeepVariant v0.6 shows major accuracy improvements for PCR+ data, largely attributable to a reduction in indel errors. Here we re-analyze two PCR+ samples that were used in external evaluations, including DNAnexus on the left (see details in figure 10) and bcbio on the right, showing how indel accuracy improves with each DeepVariant version.
Independent evaluations of DeepVariant v0.6 from both DNAnexus and bcbio are also available. Their analyses support our findings of improved indel accuracy, and also include comparisons to other variant calling tools.

Looking Forward
We released DeepVariant as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. As the pace of innovation in sequencing technologies continues to grow, including more clinical applications, we are optimistic that DeepVariant can be further extended to produce consistent and highly accurate results. We hope that researchers will use DeepVariant v0.6 to accelerate discoveries, and if there is a sequencing datatype that you would like to see us prioritize, please let us know.

An Augmented Reality Microscope for Cancer Detection

Applications of deep learning to medical disciplines including ophthalmology, dermatology, radiology, and pathology have recently shown great promise to increase both the accuracy and availability of high-quality healthcare to patients around the world. At Google, we have also published results showing that a convolutional neural network is able to detect breast cancer metastases in lymph nodes at a level of accuracy comparable to a trained pathologist. However, because direct tissue visualization using a compound light microscope remains the predominant means by which a pathologist diagnoses illness, a critical barrier to the widespread adoption of deep learning in pathology is the dependence on having a digital representation of the microscopic tissue.

Today, in a talk delivered at the Annual Meeting of the American Association for Cancer Research (AACR), with an accompanying paper “An Augmented Reality Microscope for Real-time Automated Detection of Cancer” (under review), we describe a prototype Augmented Reality Microscope (ARM) platform that we believe can possibly help accelerate and democratize the adoption of deep learning tools for pathologists around the world. The platform consists of a modified light microscope that enables real-time image analysis and presentation of the results of machine learning algorithms directly into the field of view. Importantly, the ARM can be retrofitted into existing light microscopes found in hospitals and clinics around the world using low-cost, readily-available components, and without the need for whole slide digital versions of the tissue being analyzed.
Modern computational components and deep learning models, such as those built upon TensorFlow, will allow a wide range of pre-trained models to run on this platform. As in a traditional analog microscope, the user views the sample through the eyepiece. A machine learning algorithm projects its output back into the optical path in real-time. This digital projection is visually superimposed on the original (analog) image of the specimen to assist the viewer in localizing or quantifying features of interest. Importantly, the computation and visual feedback updates quickly — our present implementation runs at approximately 10 frames per second, so the model output updates seamlessly as the user scans the tissue by moving the slide and/or changing magnification.
Left: Schematic overview of the ARM. A digital camera captures the same field of view (FoV) as the user and passes the image to an attached compute unit capable of running real-time inference of a machine learning model. The results are fed back into a custom AR display which is inline with the ocular lens and projects the model output on the same plane as the slide. Right: A picture of our prototype which has been retrofitted into a typical clinical-grade light microscope.
In principle, the ARM can provide a wide variety of visual feedback, including text, arrows, contours, heatmaps, or animations, and is capable of running many types of machine learning algorithms aimed at solving different problems such as object detection, quantification, or classification.

As a demonstration of the potential utility of the ARM, we configured it to run two different cancer detection algorithms: one that detects breast cancer metastases in lymph node specimens, and another that detects prostate cancer in prostatectomy specimens. These models can run at magnifications between 4-40x, and the result of a given model is displayed by outlining detected tumor regions with a green contour. These contours help draw the pathologist’s attention to areas of interest without obscuring the underlying tumor cell appearance.
Example view through the lens of the ARM. These images show examples of the lymph node metastasis model with 4x, 10x, 20x, and 40x microscope objectives.
While both cancer models were originally trained on images from a whole slide scanner with a significantly different optical configuration, the models performed remarkably well on the ARM with no additional re-training. For example, the lymph node metastasis model had an area-under-the-curve (AUC) of 0.98 and our prostate cancer model had an AUC of 0.96 for cancer detection in the field of view (FoV) when run on the ARM, only slightly decreased performance than obtained on WSI. We believe it is likely that the performance of these models can be further improved by additional training on digital images captured directly from the ARM itself.

We believe that the ARM has potential for a large impact on global health, particularly for the diagnosis of infectious diseases, including tuberculosis and malaria, in developing countries. Furthermore, even in hospitals that will adopt a digital pathology workflow in the near future, ARM could be used in combination with the digital workflow where scanners still face major challenges or where rapid turnaround is required (e.g. cytology, fluorescent imaging, or intra-operative frozen sections). Of course, light microscopes have proven useful in many industries other than pathology, and we believe the ARM can be adapted for a broad range of applications across healthcare, life sciences research, and material science. We’re excited to continue to explore how the ARM can help accelerate the adoption of machine learning for positive impact around the world.

Introducing Semantic Experiences with Talk to Books and Semantris

Natural language understanding has evolved substantially in the past few years, in part due to the development of word vectors that enable algorithms to learn about the relationships between words, based on examples of actual language usage. These vector models map semantically similar phrases to nearby points based on equivalence, similarity or relatedness of ideas and language. Last year, we used hierarchical vector models of language to make improvements to Smart Reply for Gmail. More recently, we’ve been exploring other applications of these methods.

Today, we are proud to share Semantic Experiences, a website showing two examples of how these new capabilities can drive applications that weren’t possible before. Talk to Books is an entirely new way to explore books by starting at the sentence level, rather than the author or topic level. Semantris is a word association game powered by machine learning, where you type out words associated with a given prompt. We have also published “Universal Sentence Encoder”, which describes the models used for these examples in more detail. Lastly, we’ve provided a pretrained semantic TensorFlow module for the community to experiment with their own sentence and phrase encoding.

Modeling approach
Our approach extends the idea of representing language in a vector space by creating vectors for larger chunks of language such as full sentences and small paragraphs. Since language is composed of hierarchies of concepts, we create the vectors using a hierarchy of modules, each of which considers features that correspond to sequences at different temporal scales. Relatedness, synonymy, antonymy, meronymy, holonymy, and many other types of relationships may all be represented in vector space language models if we train them in the right way and then pose the right “questions”. We describe this method in our paper, “Efficient Natural Language Response for Smart Reply.”

Talk to Books
With Talk to Books, we provide an entirely new way to explore books. You make a statement or ask a question, and the tool finds sentences in books that respond, with no dependence on keyword matching. In a sense you are talking to the books, getting responses which can help you determine if you’re interested in reading them or not.
Talk to Books
The models driving this experience were trained on a billion conversation-like pairs of sentences, learning to identify what a good response might look like. Once you ask your question (or make a statement), the tools searches all the sentences in over 100,000 books to find the ones that respond to your input based on semantic meaning at the sentence level; there are no predefined rules bounding the relationship between what you put in and the results you get.

This capability is unique and can help you find interesting books that a keyword search might not surface, but there’s still room for improvement. For example, this experiment works at the sentence level (rather than at the paragraph level, as in Smart Reply for Gmail) so a “good” matching sentence can still be taken out of context. You might find books and passages that you didn’t expect, or the reason a particular passage was highlighted might not be obvious. You may also notice that being well-known does not make a book sort to the top; this experiment looks only at how well the individual sentences match up. However, one benefit of this is that the tool may help people discover unexpected authors and titles, and surface books in a way that is fresh and innovative.

We are also providing Semantris, a word association game that is powered by this technology. When you enter a word or phrase, the game ranks all of the words on-screen, scoring them based on how well they respond to what you typed. Again, similarity, opposites and neighboring concepts are all fair-game using this semantic model. Try it out yourself to see what we mean! The time pressure in the Arcade version (shown below) will tempt you to enter in single words as prompts. The Blocks version has no time pressure, which makes it a great place to try out entering in phrases and sentences. You may enjoy exploring how obscure you can be with your hints.
Semantris Arcade
The examples we’re sharing today are just a few of the possible ways to think about experience and application design using these new tools. Other potential applications include classification, semantic similarity, semantic clustering, whitelist applications (selecting the right response from many alternatives), and semantic search (of which Talk to Books is an example). We hope you’ll come up with many more, inspired by these example applications. We look forward to seeing original and innovative uses of our TensorFlow models by the developer community.

Talk to Books was developed by Aaron Phillips, Amin Ahmad, Rachel Bernstein, Aaron Cohen, Noah Constant, Ray Kurzweil, Igor Krivokon, Vladimir Magay, Peter McKenzie, Bryan Richter, Chris Tar, and Dave Uthus. Semantris was developed by Ben Pietrzak, RJ Mical, Steve Pucci, Maria Voitovich, Mo Adeleye, Diana Huang, Catherine McCurry, Tomomi Sohn, and Connor Moore. We'd also like to acknowledge Hallie Benjamin, Eric Breck, Mario Guajardo-Céspedes, Yoni Halpern, Margaret Mitchell, Ben Packer, Andrew Smart and Lucy Vasserman.

Seeing More with In Silico Labeling of Microscopy Images

In the fields of biology and medicine, microscopy allows researchers to observe details of cells and molecules which are unavailable to the naked eye. Transmitted light microscopy, where a biological sample is illuminated on one side and imaged, is relatively simple and well-tolerated by living cultures but produces images which can be difficult to properly assess. Fluorescence microscopy, in which biological objects of interest (such as cell nuclei) are specifically targeted with fluorescent molecules, simplifies analysis but requires complex sample preparation. With the increasing application of machine learning to the field of microscopy, including algorithms used to automatically assess the quality of images and assist pathologists diagnosing cancerous tissue, we wondered if we could develop a deep learning system that could combine the benefits of both microscopy techniques while minimizing the downsides.

With “In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images”, appearing today in Cell, we show that a deep neural network can predict fluorescence images from transmitted light images, generating labeled, useful, images without modifying cells and potentially enabling longitudinal studies in unmodified cells, minimally invasive cell screening for cell therapies, and investigations using large numbers of simultaneous labels. We also open sourced our network, along with the complete training and test data, a trained model checkpoint, and example code.

Transmitted light microscopy techniques are easy to use, but can produce images in which it can be hard to tell what’s going on. An example is the following image from a phase-contrast microscope, in which the intensity of a pixel indicates the degree to which light was phase-shifted as it passed through the sample.
Transmitted light (phase-contrast) image of a human motor neuron culture derived from induced pluripotent stem cells. Outset 1 shows a cluster of cells, possibly neurons. Outset 2 shows a flaw in the image obscuring underlying cells. Outset 3 shows neurites. Outset 4 shows what appear to be dead cells. Scale bar is 40 μm. Source images for this and the following figures come from the Finkbeiner lab at the Gladstone Institutes.
In the above figure, it’s difficult to tell how many cells are in the cluster in Outset 1, or the locations and states of the cells in Outset 4 (hint: there’s a barely-visible flat cell in the upper-middle). It’s also difficult to get fine structures consistently in focus, such as the neurites in Outset 3.

We can get more information out of transmitted light microscopy by acquiring images in z-stacks: sets of images registered in (x, y) where z (the distance from the camera) is systematically varied. This causes different parts of the cells to come in and out of focus, which provides information about a sample’s 3D structure. Unfortunately, it often takes a trained eye to make sense of the z-stack, and analysis of such z-stacks has largely defied automation. An example z-stack is shown below.
A phase-contrast z-stack of the same cells. Note how the appearance changes as the focus is shifted. Now we can see that the fuzzy shape in the lower right of Outset 1 is a single oblong cell, and that the rightmost cell in Outset 4 is taller than the uppermost cell, possibly indicating that it has undergone programmed cell death.
In contrast, fluorescence microscopy images are easier to analyze, because samples are prepared with carefully engineered fluorescent labels which light up just what the researchers want to see. For example, most human cells have exactly one nucleus, so a nuclear label (such as the blue one below) makes it possible for simple tools to find and count cells in an image.
Fluorescence microscopy image of the same cells. The blue fluorescent label localizes to DNA, highlighting cell nuclei. The green fluorescent label localizes to a protein found only in dendrites, a neural substructure. The red fluorescent label localizes to a protein found only in axons, another neural substructure. With these labels it is much easier to understand what’s happening in the sample. For example, the green and red labels in Outset 1 confirm this is a neural cluster. The red label in Outset 3 shows that the neurites are axons, not dendrites. The upper-left blue dot in Outset 4 reveals a previously hard-to-see nucleus, and the lack of a blue dot for the cell at the left shows it to be DNA-free cellular debris.
However, fluorescence microscopy can have significant downsides. First, there is the complexity and variability introduced by the sample preparation and fluorescent labels themselves. Second, when there are many different fluorescent labels in a sample, spectral overlap can make it hard to tell which color belongs to which label, typically limiting researchers to three or four simultaneous labels in a sample. Third, fluorescence labeling may be toxic to cells and sometimes involves protocols that outright kill them, which makes labeling difficult to use in longitudinal studies where the same cells are followed through time.

Seeing more with deep learning
In our paper, we show that a deep neural network can predict fluorescence images from transmitted light z-stacks. To do this, we created a dataset of transmitted light z-stacks matched to fluorescence images and trained a neural network to predict the fluorescence images from the z-stacks. The following diagram explains the process.
Overview of our system. (A) The dataset of training examples: pairs of transmitted light images from z-stacks with pixel-registered sets of fluorescence images of the same scene. Several different fluorescent labels were used to generate fluorescence images and were varied between training examples; the checkerboard images indicate fluorescent labels which were not acquired for a given example. (B) The untrained deep network was (C) trained on the data A. (D) A z-stack of images of a novel scene. (E) The trained network, C, is used to predict fluorescence labels learned from A for each pixel in the novel images, D.
In the course of this work we developed a new neural network composed of three kinds of basic building-blocks, inspired by the modular design of Inception: an in-scale configuration which does not change the spatial scaling of the features, a down-scale configuration which doubles the spatial scaling, and an up-scale configuration which halves it. This lets us break the hard problem of network architecture design into two easier problems: the arrangement of the building-blocks (macro-architecture), and the design of the building-blocks themselves (micro-architecture). We solved the first problem using design principles discussed in the paper, and the second via an automated search powered by Google Hypertune.

To make sure our method was sound, we validated our model using data from an Alphabet lab as well as two external partners: Steve Finkbeiner's lab at the Gladstone Institutes, and the Rubin Lab at Harvard. These data spanned three transmitted light imaging modalities (bright-field, phase-contrast, and differential interference contrast) and three culture types (human motor neurons derived from induced pluripotent stem cells, rat cortical cultures, and human breast cancer cells). We found that our method can accurately predict several labels including those for nuclei, cell type (e.g. neural), and cell state (e.g. cell death). The following figure shows the model’s predictions alongside the transmitted light input and fluorescence ground-truth for our motor neuron example.
Animation showing the same cells in transmitted light and fluorescence imaging, along with predicted fluorescence labels from our model. Outset 2 shows the model predicts the correct labels despite the artifact in the input image. Outset 3 shows the model infers these processes are axons, possibly because of their distance from the nearest cells. Outset 4 shows the model sees the hard-to-see cell at the top, and correctly identifies the object at the left as DNA-free cell debris.
Try it for yourself!
We’ve open sourced our model, along with our full dataset, code for training and inference, and an example. We’re pleased to report that new labels can be learned with minimal additional training data: in the paper and example code, we show a new label may be learned from a single image. This is due to a phenomenon called transfer learning, where a model can learn a new task more quickly and using less training data if it has already mastered similar tasks.

We hope the ability to generate labeled, useful, images without modifying cells will open up completely new kinds of experiments in biology and medicine. If you’re excited to try this technology in your own work, please read the paper or check out the code!

We thank the Google Accelerated Science team for originating and developing this project and its publication, and additionally Kevin P. Murphy for supporting its publication. We thank Mike Ando, Youness Bennani, Amy Chung-Yu Chou, Jason Freidenfelds, Jason Miller, Kevin P. Murphy, Philip Nelson, Patrick Riley, and Samuel Yang for ideas and editing help with this post. This study was supported by NINDS (NS091046, NS083390, NS101995), the NIH’s National Institute on Aging (AG065151, AG058476), the NIH’s National Human Genome Research Institute (HG008105), Google, the ALS Association, and the Michael J. Fox Foundation.

Looking to Listen: Audio-Visual Speech Separation

People are remarkably good at focusing their attention on a particular person in a noisy environment, mentally “muting” all other voices and sounds. Known as the cocktail party effect, this capability comes natural to us humans. However, automatic speech separation — separating an audio signal into its individual speech sources — while a well-studied problem, remains a significant challenge for computers.

In “Looking to Listen at the Cocktail Party”, we present a deep learning audio-visual model for isolating a single speech signal from a mixture of sounds such as other voices and background noise. In this work, we are able to computationally produce videos in which speech of specific people is enhanced while all other sounds are suppressed. Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context. We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking.
A unique aspect of our technique is in combining both the auditory and visual signals of an input video to separate the speech. Intuitively, movements of a person’s mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person. The visual signal not only improves the speech separation quality significantly in cases of mixed speech (compared to speech separation using audio alone, as we demonstrate in our paper), but, importantly, it also associates the separated, clean speech tracks with the visible speakers in the video.
The input to our method is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. The output is a decomposition of the input audio track into clean speech tracks, one for each person detected in the video.
An Audio-Visual Speech Separation Model
To generate training examples, we started by gathering a large collection of 100,000 high-quality videos of lectures and talks from YouTube. From these videos, we extracted segments with a clean speech (e.g. no mixed music, audience sounds or other speakers) and with a single speaker visible in the video frames. This resulted in roughly 2000 hours of video clips, each of a single person visible to the camera and talking with no background interference. We then used this clean data to generate “synthetic cocktail parties” -- mixtures of face videos and their corresponding speech from separate video sources, along with non-speech background noise we obtained from AudioSet.

Using this data, we were able to train a multi-stream convolutional neural network-based model to split the synthetic cocktail mixture into separate audio streams for each speaker in the video. The input to the network are visual features extracted from the face thumbnails of detected speakers in each frame, and a spectrogram representation of the video’s soundtrack. During training, the network learns (separate) encodings for the visual and auditory signals, then it fuses them together to form a joint audio-visual representation. With that joint representation, the network learns to output a time-frequency mask for each speaker. The output masks are multiplied by the noisy input spectrogram and converted back to a time-domain waveform to obtain an isolated, clean speech signal for each speaker. For full details, see our paper.
Our multi-stream, neural network-based model architecture.
Here are some more speech separation and enhancement results by our method. Sound by others than the selected speakers can be entirely suppressed or suppressed to the desired level.
To highlight the utilization of visual information by our model, we took two different parts from the same video of Google’s CEO, Sundar Pichai, and placed them side by side. It is very difficult to perform speech separation in this scenario using only characteristic speech frequencies contained in the audio, however our audio-visual model manages to properly separate the speech even in this challenging case.
Application to Speech Recognition
Our method can also potentially be used as a pre-process for speech recognition and automatic video captioning. Handling overlapping speakers is a known challenge for automatic captioning systems, and separating the audio to the different sources could help in presenting more accurate and easy-to-read captions.
You can similarly see and compare the captions before and after speech separation in all the other videos in this post and on our website, by turning on closed captions in the YouTube player when playing the videos (“cc” button at the lower right corner of the player).

On our project web page you can find more results, as well as comparisons with state-of-the-art audio-only speech separation and with other recent audio-visual speech separation work.
We envision a wide range of applications for this technology. We are currently exploring opportunities for incorporating it into various Google products. Stay tuned!

The research described in this post was done by Ariel Ephrat (as an intern), Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, Bill Freeman and Michael Rubinstein. We would like to thank Yossi Matias and Google Research Israel for their support for the project, and John Hershey for his valuable feedback. We also thank Arkady Ziefman for his help with animations and figures, and Rachel Soh for helping us procure permissions for video content in our results.

Announcing the 2018 Google PhD Fellows for North America, Europe and the Middle East

Google created the PhD Fellowship program in 2009 to recognize and support outstanding graduate students doing exceptional research in Computer Science and related disciplines. Now in its ninth year, our fellowship program has supported hundreds of future faculty, industry researchers, innovators and entrepreneurs.

Reflecting our continuing commitment to supporting and building relationships with the academic community, we are excited to announce the 39 recipients from North America, Europe and the Middle East. We offer our sincere congratulations to the 2018 Google PhD Fellows.

Algorithms, Optimizations and Markets
Emmanouil Zampetakis, Massachusetts Institute of Technology
Manuela Fischer, ETH Zurich CS
Thodoris Lykouris, Cornell University
Yuan Deng, Duke University

Computational Neuroscience
Ella Batty, Columbia University
Neha Spenta Wadia, University of California, Berkeley
Reuben Feinman, New York University

Human-Computer Interaction
Gierad Laput, Carnegie Mellon University
Mike Schaekermann, University of Waterloo
Minsuk (Brian) Kahng, Georgia Tech

Machine Learning
Aditi Raghunathan, Stanford University
Lin Chen, Yale University
Qian Yu, University of Southern California
Ravid Shwartz-Ziv, The Hebrew University of Jerusalem
Shuang Liu, University of California, San Diego
Stephen Tu, University of California, Berkeley
Xinchen Yan, University of Michigan, Ann Arbor
Zelda Mariet, Massachusetts Institute of Technology

Mobile Computing
Shilin Zhu, University of California, San Diego

Machine Perception, Speech Technology and Computer Vision
Antoine Miech, INRIA
Arsha Nagrani, University of Oxford (ES)
Joseph Redmon, University of Washington
Raymond Yeh, University of Illinois, Urbana-Champaign
Shanmukha Ramakrishna Vedantam, Georgia Tech

Natural Language Processing
Anne Cocos, University of Pennsylvania
Jonathan Herzig, Tel-Aviv University
Rotem Dror, Technion - Israel Institute of Technology
Yang Liu, The University of Edinburgh
Yoon Kim, Harvard University

Privacy and Security
Aayush Jain, University of California, Los Angeles

Programming Technology and Software Engineering
Gowtham Kaki, Purdue University, West Lafayette
Reyhaneh Jabbarvand, University of California, Irvine
Victor Lanvin, Fondation Sciences Mathématiques de Paris

Quantum Computing
Erika Ye, California Institute of Technology

Structured Data and Database Management
Lingjiao Chen, University of Wisconsin

Systems and Networking
Andrea Lattuada, ETH Zurich CS
Lana Josipović, EPFL CS
Michael Schaarschmidt, University of Cambridge - Computer Laboratory
Rachee Singh, University of Massachusetts, Amherst

MobileNetV2: The Next Generation of On-Device Computer Vision Networks

Last year we introduced MobileNetV1, a family of general purpose computer vision neural networks designed with mobile devices in mind to support classification, detection and more. The ability to run deep networks on personal mobile devices improves user experience, offering anytime, anywhere access, with additional benefits for security, privacy, and energy consumption. As new applications emerge allowing users to interact with the real world in real time, so does the need for ever more efficient neural networks.

Today, we are pleased to announce the availability of MobileNetV2 to power the next generation of mobile vision applications. MobileNetV2 is a significant improvement over MobileNetV1 and pushes the state of the art for mobile visual recognition including classification, object detection and semantic segmentation. MobileNetV2 is released as part of TensorFlow-Slim Image Classification Library, or you can start exploring MobileNetV2 right away in coLaboratory. Alternately, you can download the notebook and explore it locally using Jupyter. MobileNetV2 is also available as modules on TF-Hub, and pretrained checkpoints can be found on github.

MobileNetV2 builds upon the ideas from MobileNetV1 [1], using depthwise separable convolution as efficient building blocks. However, V2 introduces two new features to the architecture: 1) linear bottlenecks between the layers, and 2) shortcut connections between the bottlenecks1. The basic structure is shown below.
Overview of MobileNetV2 Architecture. Blue blocks represent composite convolutional building blocks as shown above.
The intuition is that the bottlenecks encode the model’s intermediate inputs and outputs while the inner layer encapsulates the model’s ability to transform from lower-level concepts such as pixels to higher level descriptors such as image categories. Finally, as with traditional residual connections, shortcuts enable faster training and better accuracy. You can learn more about the technical details in our paper, “MobileNet V2: Inverted Residuals and Linear Bottlenecks”.

How does it compare to the first generation of MobileNets?
Overall, the MobileNetV2 models are faster for the same accuracy across the entire latency spectrum. In particular, the new models use 2x fewer operations, need 30% fewer parameters and are about 30-40% faster on a Google Pixel phone than MobileNetV1 models, all while achieving higher accuracy.
MobileNetV2 improves speed (reduced latency) and increased ImageNet Top 1 accuracy
MobileNetV2 is a very effective feature extractor for object detection and segmentation. For example, for detection when paired with the newly introduced SSDLite [2] the new model is about 35% faster with the same accuracy than MobileNetV1. We have open sourced the model under the Tensorflow Object Detection API [4].

Mobile CPU
MobileNetV1 + SSDLite

To enable on-device semantic segmentation, we employ MobileNetV2 as a feature extractor in a reduced form of DeepLabv3 [3], that was announced recently. On the semantic segmentation benchmark, PASCAL VOC 2012, our resulting model attains a similar performance as employing MobileNetV1 as feature extractor, but requires 5.3 times fewer parameters and 5.2 times fewer operations in terms of Multiply-Adds.

MobileNetV1 + DeepLabV3

As we have seen MobileNetV2 provides a very efficient mobile-oriented model that can be used as a base for many visual recognition tasks. We hope by sharing it with the broader academic and open-source community we can help to advance research and application development.

We would like to acknowledge our core contributors Menglong Zhu, Andrey Zhmoginov and Liang-Chieh Chen. We also give special thanks to Bo Chen, Dmitry Kalenichenko, Skirmantas Kligys, Mathew Tang, Weijun Wang, Benoit Jacob, George Papandreou and Hartwig Adam.

  1. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H, arXiv:1704.04861, 2017.
  2. MobileNetV2: Inverted Residuals and Linear Bottlenecks, Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC. arXiv preprint. arXiv:1801.04381, 2018.
  3. Rethinking Atrous Convolution for Semantic Image Segmentation, Chen LC, Papandreou G, Schroff F, Adam H. arXiv:1706.05587, 2017.
  4. Speed/accuracy trade-offs for modern convolutional object detectors, Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, Murphy K, CVPR 2017.
  5. Deep Residual Learning for Image Recognition, He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. arXiv:1512.03385,2015

1 The shortcut (also known as skip) connections, popularized by ResNets[5] are commonly used to connect the non-bottleneck layers. MobilenNetV2 inverts this notion and connects the bottlenecks directly.

Investing in France’s AI Ecosystem

Recently, we announced the launch of a new AI research team in our Paris office. And today DeepMind has also announced a new AI research presence in Paris. We are excited about expanding Google’s research presence in Europe, which bolsters the efforts of the existing groups in our Zürich and London offices. As strong supporters of academic research, we are also excited to foster collaborations with France’s vibrant academic ecosystem.

Our research teams in Paris will focus on fundamental AI research, as well as important applications of these ideas to areas such as Health, Science or Arts. They will publish and open-source their results to advance the state-of-the-art in core areas such as Deep Learning and Reinforcement Learning.

Our approach to research is based on building a strong connection with the academic community; contributing to training the next generation of scientists and establishing a bridge between academic and industrial research. We believe that both objectives are key to fostering a healthy research ecosystem that will flourish in the long term. These ideas are very much aligned with some of the recommendations that Fields Medalist and member of French Parliament Cédric Villani is putting forward in his report on AI to the French government.

As we expand our teams in France, we have several initiatives that illustrate our commitment to these goals:
  • We are sponsoring “Artificial Intelligence and Visual Computing” Chair at École Polytechnique (one of the leading higher education institutions in France) which will support their education initiatives in AI
  • We just established a partnership with INRIA for conducting collaborative research projects
  • We are funding academic research with unrestricted grants mostly dedicated to the support of PhD and postdoc positions through our Faculty Research Awards and PhD Fellowship programs, as well as our Focused Research Awards. As one example, we have recently funded a project on large scale optimization of neural networks led by Francis Bach (INRIA and ENS) and Alexandre d’Aspremont (CNRS and ENS)
  • We are working on offering CIFRE PhD positions (joint PhD positions between Google and an academic lab) as well as internships for PhD students
Additionally, we are pleased to announce that one of the world’s leading experts in computer vision, Cordelia Schmid, will begin a dual appointment at INRIA and Google Paris. These kind of appointments, together with our Visiting Faculty program, are a great way to share ideas and research challenges, and utilize Google's world-class computing infrastructure to explore new projects at industrial scale.

France has a long tradition of research and educational excellence, and has a very dynamic and active machine learning community. This makes it a great place to pursue our goal of building AI-enabled technologies that can benefit everyone, through fundamental advances in machine learning and related fields.