Author Archives: Research Blog

MultiModel: Multi-Task Machine Learning Across Domains

Over the last decade, the application and performance of Deep Learning has progressed at an astonishing rate. However, the current state of the field is that the neural network architectures are highly specialized to specific domains of application. An important question remains unanswered: Will a convergence between these domains facilitate a unified model capable of performing well across multiple domains?

Today, we present MultiModel, a neural network architecture that draws from the success of vision, language and audio networks to simultaneously solve a number of problems spanning multiple domains, including image recognition, translation and speech recognition. While strides have been made in this direction before, namely in Google’s Multilingual Neural Machine Translation System used in Google Translate, MultiModel is a first step towards the convergence of vision, audio and language understanding into a single network.

The inspiration for how MultiModel handles multiple domains comes from how the brain transforms sensory input from different modalities (such as sound, vision or taste), into a single shared representation and back out in the form of language or actions. As an analog to these modalities and the transformations they perform, MultiModel has a number of small modality-specific sub-networks for audio, images, or text, and a shared model consisting of an encoder, input/output mixer and decoder, as illustrated below.
MultiModel architecture: small modality-specific sub-networks work with a shared encoder, I/O mixer and decoder. Each petal represents a modality, transforming to and from the internal representation.
We demonstrate that MultiModel is capable of learning eight different tasks simultaneously: it can detect objects in images, provide captions, recognize speech, translate between four pairs of languages, and do grammatical constituency parsing at the same time. The input is given to the model together with a very simple signal that determines which output we are requesting. Below we illustrate a few examples taken from a MultiModel trained jointly on these eight tasks1:
When designing MultiModel it became clear that certain elements from each domain of research (vision, language and audio) were integral to the model’s success in related tasks. We demonstrate that these computational primitives (such as convolutions, attention, or mixture-of-experts layers) clearly improve performance on their originally intended domain of application, while not hindering MultiModel’s performance on other tasks. It is not only possible to achieve good performance while training jointly on multiple tasks, but on tasks with limited quantities of data, the performance actually improves. To our surprise, this happens even if the tasks come from different domains that would appear to have little in common, e.g., an image recognition task can improve performance on a language task.

It is important to note that while MultiModel does not establish new performance records, it does provide insight into the dynamics of multi-domain multi-task learning in neural networks, and the potential for improved learning on data-limited tasks by the introduction of auxiliary tasks. There is a longstanding saying in machine learning: “the best regularizer is more data”; in MultiModel, this data can be sourced across domains, and consequently can be obtained more easily than previously thought. MultiModel provides evidence that training in concert with other tasks can lead to good results and improve performance on data-limited tasks.

Many questions about multi-domain machine learning remain to be studied, and we will continue to work on tuning Multimodel and improving its performance. To allow this research to progress quickly, we open-sourced MultiModel as part of the Tensor2Tensor library. We believe that such synergetic models trained on data from multiple domains will be the next step in deep learning and will ultimately solve tasks beyond the reach of current narrowly trained networks.

This work is a collaboration between Googlers Łukasz Kaiser, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones and Jakob Uszkoreit, and Aidan N. Gomez from the University of Toronto. It was performed while Aidan was working with the Google Brain team.

1 The 8 tasks were: (1) speech recognition (WSJ corpus), (2) image classification (ImageNet), (3) image captioning (MS COCO), (4) parsing (Penn Treebank), (5) English-German translation, (6) German-English translation, (7) English-French translation, (8) French-English translation (all using WMT data-sets).

Accelerating Deep Learning Research with the Tensor2Tensor Library

Deep Learning (DL) has enabled the rapid advancement of many useful technologies, such as machine translation, speech recognition and object detection. In the research community, one can find code open-sourced by the authors to help in replicating their results and further advancing deep learning. However, most of these DL systems use unique setups that require significant engineering effort and may only work for a specific problem or architecture, making it hard to run new experiments and compare the results.

Today, we are happy to release Tensor2Tensor (T2T), an open-source system for training deep learning models in TensorFlow. T2T facilitates the creation of state-of-the art models for a wide variety of ML applications, such as translation, parsing, image captioning and more, enabling the exploration of various ideas much faster than previously possible. This release also includes a library of datasets and models, including the best models from a few recent papers (Attention Is All You Need, Depthwise Separable Convolutions for Neural Machine Translation and One Model to Learn Them All) to help kick-start your own DL research.

Translation Model
Training time
BLEU (difference from baseline)
Transformer (T2T)
3 days on 8 GPU
28.4 (+7.8)
SliceNet (T2T)
6 days on 32 GPUs
26.1 (+5.5)
1 day on 64 GPUs
26.0 (+5.4)
18 days on 1 GPU
25.1 (+4.5)
1 day on 96 GPUs
24.6 (+4.0)
8 days on 32 GPUs
23.8 (+3.2)
MOSES (phrase-based baseline)
20.6 (+0.0)
BLEU scores (higher is better) on the standard WMT English-German translation task.
As an example of the kind of improvements T2T can offer, we applied the library to machine translation. As you can see in the table above, two different T2T models, SliceNet and Transformer, outperform the previous state-of-the-art, GNMT+MoE. Our best T2T model, Transformer, is 3.8 points better than the standard GNMT model, which itself was 4 points above the baseline phrase-based translation system, MOSES. Notably, with T2T you can approach previous state-of-the-art results with a single GPU in one day: a small Transformer model (not shown above) gets 24.9 BLEU after 1 day of training on a single GPU. Now everyone with a GPU can tinker with great translation models on their own: our github repo has instructions on how to do that.

Modular Multi-Task Training
The T2T library is built with familiar TensorFlow tools and defines multiple pieces needed in a deep learning system: data-sets, model architectures, optimizers, learning rate decay schemes, hyperparameters, and so on. Crucially, it enforces a standard interface between all these parts and implements current ML best practices. So you can pick any data-set, model, optimizer and set of hyperparameters, and run the training to check how it performs. We made the architecture modular, so every piece between the input data and the predicted output is a tensor-to-tensor function. If you have a new idea for the model architecture, you don’t need to replace the whole setup. You can keep the embedding part and the loss and everything else, just replace the model body by your own function that takes a tensor as input and returns a tensor.

This means that T2T is flexible, with training no longer pinned to a specific model or dataset. It is so easy that even architectures like the famous LSTM sequence-to-sequence model can be defined in a few dozen lines of code. One can also train a single model on multiple tasks from different domains. Taken to the limit, you can even train a single model on all data-sets concurrently, and we are happy to report that our MultiModel, trained like this and included in T2T, yields good results on many tasks even when training jointly on ImageNet (image classification), MS COCO (image captioning), WSJ (speech recognition), WMT (translation) and the Penn Treebank parsing corpus. It is the first time a single model has been demonstrated to be able to perform all these tasks at once.

Built-in Best Practices
With this initial release, we also provide scripts to generate a number of data-sets widely used in the research community1, a handful of models2, a number of hyperparameter configurations, and a well-performing implementation of other important tricks of the trade. While it is hard to list them all, if you decide to run your model with T2T you’ll get for free the correct padding of sequences and the corresponding cross-entropy loss, well-tuned parameters for the Adam optimizer, adaptive batching, synchronous distributed training, well-tuned data augmentation for images, label smoothing, and a number of hyper-parameter configurations that worked very well for us, including the ones mentioned above that achieve the state-of-the-art results on translation and may help you get good results too.

As an example, consider the task of parsing English sentences into their grammatical constituency trees. This problem has been studied for decades and competitive methods were developed with a lot of effort. It can be presented as a sequence-to-sequence problem and be solved with neural networks, but it used to require a lot of tuning. With T2T, it took us only a few days to add the parsing data-set generator and adjust our attention transformer model to train on this problem. To our pleasant surprise, we got very good results in only a week:

Parsing Model
F1 score (higher is better)
Transformer (T2T)
Dyer et al.
Zhu et al.
Socher et al.
Vinyals & Kaiser et al.
Parsing F1 scores on the standard test set, section 23 of the WSJ. We only compare here models trained discriminatively on the Penn Treebank WSJ training set, see the paper for more results.

Contribute to Tensor2Tensor
In addition to exploring existing models and data-sets, you can easily define your own model and add your own data-sets to Tensor2Tensor. We believe the already included models will perform very well for many NLP tasks, so just adding your data-set might lead to interesting results. By making T2T modular, we also make it very easy to contribute your own model and see how it performs on various tasks. In this way the whole community can benefit from a library of baselines and deep learning research can accelerate. So head to our github repository, try the new models, and contribute your own!

The release of Tensor2Tensor was only possible thanks to the widespread collaboration of many engineers and researchers. We want to acknowledge here the core team who contributed (in alphabetical order): Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, Jakob Uszkoreit, Ashish Vaswani.

1 We include a number of datasets for image classification (MNIST, CIFAR-10, CIFAR-100, ImageNet), image captioning (MS COCO), translation (WMT with multiple languages including English-German and English-French), language modelling (LM1B), parsing (Penn Treebank), natural language inference (SNLI), speech recognition (TIMIT), algorithmic problems (over a dozen tasks from reversing through addition and multiplication to algebra) and we will be adding more and welcome your data-sets too.

2 Including LSTM sequence-to-sequence RNNs, convolutional networks also with separable convolutions (e.g., Xception), recently researched models like ByteNet or the Neural GPU, and our new state-of-the-art models mentioned in this post that we will be actively updating in the repository.

Supercharge your Computer Vision models with the TensorFlow Object Detection API

(Cross-posted on the Google Open Source Blog)

At Google, we develop flexible state-of-the-art machine learning (ML) systems for computer vision that not only can be used to improve our products and services, but also spur progress in the research community. Creating accurate ML models capable of localizing and identifying multiple objects in a single image remains a core challenge in the field, and we invest a significant amount of time training and experimenting with these systems.
Detected objects in a sample image (from the COCO dataset) made by one of our models. Image credit: Michael Miley, original image.
Last October, our in-house object detection system achieved new state-of-the-art results, and placed first in the COCO detection challenge. Since then, this system has generated results for a number of research publications1,2,3,4,5,6,7 and has been put to work in Google products such as NestCam, the similar items and style ideas feature in Image Search and street number and name detection in Street View.

Today we are happy to make this system available to the broader research community via the TensorFlow Object Detection API. This codebase is an open-source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models. Our goals in designing this system was to support state-of-the-art models while allowing for rapid exploration and research. Our first release contains the following:
The SSD models that use MobileNet are lightweight, so that they can be comfortably run in real time on mobile devices. Our winning COCO submission in 2016 used an ensemble of the Faster RCNN models, which are are more computationally intensive but significantly more accurate. For more details on the performance of these models, see our CVPR 2017 paper.

Are you ready to get started?
We’ve certainly found this code to be useful for our computer vision needs, and we hope that you will as well. Contributions to the codebase are welcome and please stay tuned for our own further updates to the framework. To get started, download the code here and try detecting objects in some of your own images using the Jupyter notebook, or training your own pet detector on Cloud ML engine!

The release of the Tensorflow Object Detection API and the pre-trained model zoo has been the result of widespread collaboration among Google researchers with feedback and testing from product groups. In particular we want to highlight the contributions of the following individuals:

Core Contributors: Derek Chow, Chen Sun, Menglong Zhu, Matthew Tang, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Jasper Uijlings, Viacheslav Kovalevskyi, Kevin Murphy

Also special thanks to: Andrew Howard, Rahul Sukthankar, Vittorio Ferrari, Tom Duerig, Chuck Rosenberg, Hartwig Adam, Jing Jing Long, Victor Gomes, George Papandreou, Tyler Zhu

  1. Speed/accuracy trade-offs for modern convolutional object detectors, Huang et al., CVPR 2017 (paper describing this framework)
  2. Towards Accurate Multi-person Pose Estimation in the Wild, Papandreou et al., CVPR 2017
  3. YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video, Real et al., CVPR 2017 (see also our blog post)
  4. Beyond Skip Connections: Top-Down Modulation for Object Detection, Shrivastava et al., arXiv preprint arXiv:1612.06851, 2016
  5. Spatially Adaptive Computation Time for Residual Networks, Figurnov et al., CVPR 2017
  6. AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions, Gu et al., arXiv preprint arXiv:1705.08421, 2017
  7. MobileNets: Efficient convolutional neural networks for mobile vision applications, Howard et al., arXiv preprint arXiv:1704.04861, 2017

MobileNets: Open-Source Models for Efficient On-Device Vision

Deep learning has fueled tremendous progress in the field of computer vision in recent years, with neural networks repeatedly pushing the frontier of visual recognition technology. While many of those technologies such as object, landmark, logo and text recognition are provided for internet-connected devices through the Cloud Vision API, we believe that the ever-increasing computational power of mobile devices can enable the delivery of these technologies into the hands of our users, anytime, anywhere, regardless of internet connection. However, visual recognition for on device and embedded applications poses many challenges — models must run quickly with high accuracy in a resource-constrained environment making use of limited computation, power and space.

Today we are pleased to announce the release of MobileNets, a family of mobile-first computer vision models for TensorFlow, designed to effectively maximize accuracy while being mindful of the restricted resources for an on-device or embedded application. MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases. They can be built upon for classification, detection, embeddings and segmentation similar to how other popular large scale models, such as Inception, are used.
Example use cases include detection, fine-grain classification, attributes and geo-localization.
This release contains the model definition for MobileNets in TensorFlow using TF-Slim, as well as 16 pre-trained ImageNet classification checkpoints for use in mobile projects of all sizes. The models can be run efficiently on mobile devices with TensorFlow Mobile.

Model Checkpoint
Million MACs
Million Parameters
Top-1 Accuracy
Top-5 Accuracy
Choose the right MobileNet model to fit your latency and size budget. The size of the network in memory and on disk is proportional to the number of parameters. The latency and power usage of the network scales with the number of Multiply-Accumulates (MACs) which measures the number of fused Multiplication and Addition operations. Top-1 and Top-5 accuracies are measured on the ILSVRC dataset.
We are excited to share MobileNets with the open-source community. Information for getting started can be found at the TensorFlow-Slim Image Classification Library. To learn how to run models on-device please go to TensorFlow Mobile. You can read more about the technical details of MobileNets in our paper, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.

MobileNets were made possible with the hard work of many engineers and researchers throughout Google. Specifically we would like to thank:

Core Contributors: Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam

Special thanks to: Benoit Jacob, Skirmantas Kligys, George Papandreou, Liang-Chieh Chen, Derek Chow, Sergio Guadarrama, Jonathan Huang, Andre Hentz, Pete Warden

The Machine Intelligence Behind Gboard

Most people spend a significant amount of time each day using mobile-device keyboards: composing emails, texting, engaging in social media, and more. Yet, mobile keyboards are still cumbersome to handle. The average user is roughly 35% slower typing on a mobile device than on a physical keyboard. To change that, we recently provided many exciting improvements to Gboard for Android, working towards our vision of creating an intelligent mechanism that enables faster input while offering suggestions and correcting mistakes, in any language you choose.

With the realization that the way a mobile keyboard translates touch inputs into text is similar to how a speech recognition system translates voice inputs into text, we leveraged our experience in Speech Recognition to pursue our vision. First, we created robust spatial models that map fuzzy sequences of raw touch points to keys on the keyboard, just like acoustic models map sequences of sound bites to phonetic units. Second, we built a powerful core decoding engine based on finite state transducers (FST) to determine the likeliest word sequence given an input touch sequence. With its mathematical formalism and broad success in speech applications, we knew that an FST decoder would offer the flexibility needed to support a variety of complex keyboard input behaviors as well as language features. In this post, we will detail what went into the development of both of these systems.

Neural Spatial Models
Mobile keyboard input is subject to errors that are generally attributed to “fat finger typing” (or tracing spatially similar words in glide typing, as illustrated below) along with cognitive and motor errors (manifesting in misspellings, character insertions, deletions or swaps, etc). An intelligent keyboard needs to be able to account for these errors and predict the intended words rapidly and accurately. As such, we built a spatial model for Gboard that addresses these errors at the character level, mapping the touch points on the screen to actual keys.
Average glide trails for two spatially-similar words: “Vampire” and “Value”.
Up to recently, Gboard used a Gaussian model to quantify the probability of tapping neighboring keys and a rule-based model to represent cognitive and motor errors. These models were simple and intuitive, but they didn’t allow us to directly optimize metrics that correlate with better typing quality. Drawing on our experience with Voice Search acoustic models we replaced both the Gaussian and rule-based models with a single, highly efficient long short-term memory (LSTM) model trained with a connectionist temporal classification (CTC) criterion.

However, training this model turned out to be a lot more complicated than we had anticipated. While acoustic models are trained from human-transcribed audio data, one cannot easily transcribe millions of touch point sequences and glide traces. So the team exploited user-interaction signals, e.g. reverted auto-corrections and suggestion picks as negative and positive semi-supervised learning signals, to form rich training and test sets.
Raw data points corresponding to the word “could” (left), and normalized sampled trajectory with per-sample variances (right).
A plethora of techniques from the speech recognition literature was used to iterate on the NSM models to make them small and fast enough to be run on any device. The TensorFlow infrastructure was used to train hundreds of models, optimizing various signals surfaced by the keyboard: completions, suggestions, gliding, etc. After more than a year of work, the resulting models were about 6 times faster and 10 times smaller than the initial versions, they also showed about 15% reduction in bad autocorrects and 10% reduction in wrongly decoded gestures on offline datasets.

Finite-State Transducers
While the NSM uses spatial information to help determine what was tapped or swiped, there are additional constraints — lexical and grammatical — that can be brought to bear. A lexicon tells us what words occur in a language and a probabilistic grammar tells us what words are likely to follow other words. To encode this information we use finite-state transducers. FSTs have long been a key component of Google’s speech recognition and synthesis systems. They provide a principled way to represent various probabilistic models (lexicons, grammars, normalizers, etc) used in natural language processing together with the mathematical framework needed to manipulate, optimize, combine and search the models*.

In Gboard, a key-to-word transducer compactly represents the keyboard lexicon as shown in the figure below. It encodes the mapping from key sequences to words, allowing for alternative key sequences and optional spaces.
This transducer encodes “I”, “I’ve”, “If” along paths from the start state (the bold circle 1) to final states (the double circle states 0 and 1). Each arc is labeled with an input key (before the “:”) and a corresponding output word (after the “:”) where ε encodes the empty symbol. The apostrophe in “I’ve” can be omitted. The user may skip the space bar sometimes. To account for that, the space key transition between words in the transducer is optional. The ε and space back arcs allow accepting more than one word.
A probabilistic n-gram transducer is used to represent the language model for the keyboard. A state in the model represents an (up to) n-1 word context and an arc leaving that state is labeled with a successor word together with its probability of following that context (estimated from textual data). These, together with the spatial model that gives the likelihoods of sequences of key touches (discrete tap entries or continuous gestures in glide typing), are combined and explored with a beam search.

Generic FST principles, such as streaming, support for dynamic models, etc took us a long way towards building a new keyboard decoder, but several new functionalities also had to be added. When you speak, you don’t need the decoder to complete your words or guess what you will say next to save you a few syllables; but when you type, you appreciate the help of word completions and predictions. Also, we wanted the keyboard to provide seamless multilingual support, as shown below.
Trilingual input typing in Gboard.
It was a complex effort to get our new decoder off the ground, but the principled nature of FSTs has many benefits. For example, supporting transliterations for languages like Hindi is just a simple extension of the generic decoder.

Transliteration Models
In many languages with complex scripts, romanization systems have been developed to map characters into the Latin alphabet, often according to their phonetic pronunciations. For example, the Pinyin “xièxiè” corresponds to the Chinese characters “谢谢” (“thank you”). A Pinyin keyboard allows users to conveniently type words on a QWERTY layout and have them automatically “translated” into the target script. Likewise, a transliterated Hindi keyboard allows users to type “daanth” for “दांत” (teeth). Whereas Pinyin is an agreed-upon romanization system, Hindi transliterations are more fuzzy; for example “daant” would be a valid alternative for “दांत”.
Transliterated glide input for Hindi.
Just as we have a transducer mapping from letter sequences to words (a lexicon) and a weighted language model automaton providing probabilities for word sequences, we built weighted transducer mappings between Latin key sequences and target script symbol sequences for 22 Indic languages. Some languages have multiple writing systems (Bodo for example can be written in the Bengali or Devanagari scripts) so between transliterated and native layouts, we built 57 new input methods in just a few months.

The general nature of the FST decoder let us leverage all the work we had done to support completions, predictions, glide typing and many UI features with no extra effort, allowing us to offer a rich experience to our Indian users right from the start.

A More Intelligent Keyboard
All in all, our recent work cut the decoding latency by 50%, reduced the fraction of words users have to manually correct by more than 10%, allowed us to launch transliteration support for the 22 official languages of India, and enabled many new features you may have noticed.

While we hope that these recent changes improve your typing experience, we recognize that on-device typing is by no means solved. Gboard can still make suggestions that seem nonintuitive or of low utility and gestures can still be decoded to words a human would never pick. However, our shift towards powerful machine intelligence algorithms has opened new spaces that we’re actively exploring to make more useful tools and products for our users worldwide.

This work was done by Cyril Allauzen, Ouais Alsharif, Lars Hellsten, Tom Ouyang, Brian Roark and David Rybach, with help from Speech Data Operation team. Special thanks go to Johan Schalkwyk and Corinna Cortes for their support.

* The toolbox of relevant algorithms is available in the OpenFst open-source library.

Introducing the TensorFlow Research Cloud

Researchers require enormous computational resources to train the machine learning (ML) models that have delivered recent breakthroughs in medical imaging, neural machine translation, game playing, and many other domains. We believe that significantly larger amounts of computation will make it possible for researchers to invent new types of ML models that will be even more accurate and useful.

To accelerate the pace of open machine-learning research, we are introducing the TensorFlow Research Cloud (TFRC), a cluster of 1,000 Cloud TPUs that will be made available free of charge to support a broad range of computationally-intensive research projects that might not be possible otherwise.
The TensorFlow Research Cloud offers researchers the following benefits:
  • Access to Google’s all-new Cloud TPUs that accelerate both training and inference
  • Up to 180 teraflops of floating-point performance per Cloud TPU
  • 64 GB of ultra-high-bandwidth memory per Cloud TPU
  • Familiar TensorFlow programming interfaces
You can sign up here to request to be notified when the TensorFlow Research Cloud application process opens, and you can optionally share more information about your computational needs. We plan to evaluate applications on a rolling basis in search of the most creative and ambitious proposals.

The TensorFlow Research Cloud program is not limited to academia — we recognize that people with a wide range of affiliations, roles, and expertise are making major machine learning research contributions, and we especially encourage those with non-traditional backgrounds to apply. Access will be granted to selected individuals for limited amounts of compute time, and researchers are welcome to apply multiple times with multiple projects.
Since the main goal of the TensorFlow Research Cloud is to benefit the open machine learning research community as a whole, successful applicants will be expected to do the following:
  • Share their TFRC-supported research with the world through peer-reviewed publications, open-source code, blog posts, or other open media
  • Share concrete, constructive feedback with Google to help us improve the TFRC program and the underlying Cloud TPU platform over time
  • Imagine a future in which ML acceleration is abundant and develop new kinds of machine learning models in anticipation of that future
For businesses interested in using Cloud TPUs for proprietary research and development, we will offer a parallel Cloud TPU Alpha program. You can sign up here to learn more about this program. We recommend participating in the Cloud TPU Alpha program if you are interested in any of the following:
  • Accelerating training of proprietary ML models; models that take weeks to train on other hardware can be trained in days or even hours on Cloud TPUs
  • Accelerating batch processing of industrial-scale datasets: images, videos, audio, unstructured text, structured data, etc.
  • Processing live requests in production using larger and more complex ML models than ever before
We hope the TensorFlow Research Cloud will allow as many researchers as possible to explore the frontier of machine learning research and extend it with new discoveries! We encourage you to sign up today to be among the first to know as more information becomes available.

Using Machine Learning to Explore Neural Network Architecture

At Google, we have successfully applied deep learning models to many applications, from image recognition to speech recognition to machine translation. Typically, our machine learning models are painstakingly designed by a team of engineers and scientists. This process of manually designing machine learning models is difficult because the search space of all possible models can be combinatorially large — a typical 10-layer network can have ~1010 candidate networks! For this reason, the process of designing networks often takes a significant amount of time and experimentation by those with significant machine learning expertise.
Our GoogleNet architecture. Design of this network required many years of careful experimentation and refinement from initial versions of convolutional architectures.
To make this process of designing machine learning models much more accessible, we’ve been exploring ways to automate the design of machine learning models. Among many algorithms we’ve studied, evolutionary algorithms [1] and reinforcement learning algorithms [2] have shown great promise. But in this blog post, we’ll focus on our reinforcement learning approach and the early results we’ve gotten so far.

In our approach (which we call "AutoML"), a controller neural net can propose a “child” model architecture, which can then be trained and evaluated for quality on a particular task. That feedback is then used to inform the controller how to improve its proposals for the next round. We repeat this process thousands of times — generating new architectures, testing them, and giving that feedback to the controller to learn from. Eventually the controller learns to assign high probability to areas of architecture space that achieve better accuracy on a held-out validation dataset, and low probability to areas of architecture space that score poorly. Here’s what the process looks like:
We’ve applied this approach to two heavily benchmarked datasets in deep learning: image recognition with CIFAR-10 and language modeling with Penn Treebank. On both datasets, our approach can design models that achieve accuracies on par with state-of-art models designed by machine learning experts (including some on our own team!).

So, what kind of neural nets does it produce? Let’s take one example: a recurrent architecture that’s trained to predict the next word on the Penn Treebank dataset. On the left here is a neural net designed by human experts. On the right is a recurrent architecture created by our method:

The machine-chosen architecture does share some common features with the human design, such as using addition to combine input and previous hidden states. However, there are some notable new elements — for example, the machine-chosen architecture incorporates a multiplicative combination (the left-most blue node on the right diagram labeled “elem_mult”). This type of combination is not common for recurrent networks, perhaps because researchers see no obvious benefit for having it. Interestingly, a simpler form of this approach was recently suggested by human designers, who also argued that this multiplicative combination can actually alleviate gradient vanishing/exploding issues, suggesting that the machine-chosen architecture was able to discover a useful new neural net architecture.

This approach may also teach us something about why certain types of neural nets work so well. The architecture on the right here has many channels so that the gradient can flow backwards, which may help explain why LSTM RNNs work better than standard RNNs.

Going forward, we’ll work on careful analysis and testing of these machine-generated architectures to help refine our understanding of them. If we succeed, we think this can inspire new types of neural nets and make it possible for non-experts to create neural nets tailored to their particular needs, allowing machine learning to have a greater impact to everyone.


[1] Large-Scale Evolution of Image Classifiers, Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, Alex Kurakin. International Conference on Machine Learning, 2017.

[2] Neural Architecture Search with Reinforcement Learning, Barret Zoph, Quoc V. Le. International Conference on Learning Representations, 2017.

Efficient Smart Reply, now for Gmail

Last year we launched Smart Reply, a feature for Inbox by Gmail that uses machine learning to suggest replies to email. Since the initial release, usage of Smart Reply has grown significantly, making up about 12% of replies in Inbox on mobile. Based on our examination of the use of Smart Reply in Inbox and our ideas about how humans learn and use language, we have created a new version of Smart Reply for Gmail. This version increases the percentage of usable suggestions and is more algorithmically efficient.

Novel thinking: hierarchy
Inspired by how humans understand languages and concepts, we turned to hierarchical models of language, an approach that uses hierarchies of modules, each of which can learn, remember, and recognize a sequential pattern.

The content of language is deeply hierarchical, reflected in the structure of language itself, going from letters to words to phrases to sentences to paragraphs to sections to chapters to books to authors to libraries, etc. Consider the message, "That interesting person at the cafe we like gave me a glance." The hierarchical chunks in this sentence are highly variable. The subject of the sentence is "That interesting person at the cafe we like." The modifier "interesting" tells us something about the writer's past experiences with the person. We are told that the location of an incident involving both the writer and the person is "at the cafe." We are also told that "we," meaning the writer and the person being written to, like the cafe. Additionally, each word is itself part of a hierarchy, sometimes more than one. A cafe is a type of restaurant which is a type of store which is a type of establishment, and so on.

In proposing an appropriate response to this message we might consider the meaning of the word "glance," which is potentially ambiguous. Was it a positive gesture? In that case, we might respond, "Cool!" Or was it a negative gesture? If so, does the subject say anything about how the writer felt about the negative exchange? A lot of information about the world, and an ability to make reasoned judgments, are needed to make subtle distinctions.

Given enough examples of language, a machine learning approach can discover many of these subtle distinctions. Moreover, a hierarchical approach to learning is well suited to the hierarchical nature of language. We have found that this approach works well for suggesting possible responses to emails. We use a hierarchy of modules, each of which considers features that correspond to sequences at different temporal scales, similar to how we understand speech and language.
Each module processes inputs and provides transformed representations of those inputs on its outputs (which are, in turn, available for the next level). In the Smart Reply system, and the figure above, the repeated structure has two layers of hierarchy. The first makes each feature useful as a predictor of the final result, and the second combines these features. By definition, the second works at a more abstract representation and considers a wider timescale.

By comparison, the initial release of Smart Reply encoded input emails word-by-word with a long-short-term-memory (LSTM) recurrent neural network, and then decoded potential replies with yet another word-level LSTM. While this type of modeling is very effective in many contexts, even with Google infrastructure, it’s an approach that requires substantial computation resources. Instead of working word-by-word, we found an effective and highly efficient path by processing the problem more all-at-once, by comparing a simple hierarchy of vector representations of multiple features corresponding to longer time spans.

We have also considered whether the mathematical space of these vector representations is implicitly semantic. Do the hierarchical network representations reflect a coarse “understanding” of the actual meaning of the inputs and the responses in order to determine which go together, or do they reflect more consistent syntactical patterns? Given many real examples of which pairs go together and, perhaps more importantly which do not, we found that our networks are surprisingly effective and efficient at deriving representations that meet the training requirements.
So far we see that the system can find responses that are on point, without an overlap of keywords or even synonyms of keywords.More directly, we’re delighted when the system suggests results that show understanding and are helpful.

The key to this work is the confidence and trust people give us when they use the Smart Reply feature. As always, thank you for showing us the ways that work (and the ways that don’t!). With your help, we’ll do our best to keep learning.

Coarse Discourse: A Dataset for Understanding Online Discussions

Every day, participants of online communities form and share their opinions, experiences, advice and social support, most of which is expressed freely and without much constraint. These online discussions are often a key resource of information for many important topics, such as parenting, fitness, travel and more. However, these discussions also are intermixed with a clutter of disagreements, humor, flame wars and trolling, requiring readers to filter the content before getting the information they are looking for. And while the field of Information Retrieval actively explores ways to allow users to more efficiently find, navigate and consume this content, there is a lack of shared datasets on forum discussions to aid in understanding these discussions a bit better.

To aid researchers in this space, we are releasing the Coarse Discourse dataset, the largest dataset of annotated online discussions to date. The Coarse Discourse contains over half a million human annotations of publicly available online discussions on a random sample of over 9,000 threads from 130 communities from

To create this dataset, we developed the Coarse Discourse taxonomy of forum comments by going through a small set of forum threads, reading every comment, and deciding what role the comments played in the discussion. We then repeated and revised this exercise with crowdsourced human editors to validate the reproducibility of the taxonomy's discourse types, which include: announcement, question, answer, agreement, disagreement, appreciation, negative reaction, elaboration, and humor. From this data, over 100,000 comments were independently annotated by the crowdsourced editors for discourse type and relation. Along with the raw annotations from crowdsourced editors, we also provide the Coarse Discourse annotation task guidelines used by the editors to help with collecting data for other forums and refining the task further.
An example thread annotated with discourse types and relations. Early findings suggest that question answering is a prominent use case in most communities, while some communities are more converationally focused, with back-and-forth interactions.
For machine learning and natural language processing researchers trying to characterize the nature of online discussions, we hope that this dataset is a useful resource. Visit our GitHub repository to download the data. For more details, check out our ICWSM paper, “Characterizing Online Discussion Using Coarse Discourse Sequences.”

This work was done by Amy Zhang during her internship at Google. We would also like to thank Bryan Culbertson, Olivia Rhinehart, Eric Altendorf, David Huynh, Nancy Chang, Chris Welty and our crowdsourced editors.

Neural Network-Generated Illustrations in Allo

Taking, sharing, and viewing selfies has become a daily habit for many — the car selfie, the cute-outfit selfie, the travel selfie, the I-woke-up-like-this selfie. Apart from a social capacity, self-portraiture has long served as a means for self and identity exploration. For some, it’s about figuring out who they are. For others it’s about projecting how they want to be perceived. Sometimes it’s both.

Photography in the form of a selfie is a very direct form of expression. It comes with a set of rules bounded by reality. Illustration, on the other hand, empowers people to define themselves - it’s warmer and less fraught than reality.
Today, Google is introducing a feature in Allo that uses a combination of neural networks and the work of artists to turn your selfie into a personalized sticker pack. Simply snap a selfie, and it’ll return an automatically generated illustrated version of you, on the fly, with customization options to help you personalize the stickers even further.
What makes you, you?
The traditional computer vision approach to mapping selfies to art would be to analyze the pixels of an image and algorithmically determine attribute values by looking at pixel values to measure color, shape, or texture. However, people today take selfies in all types of lighting conditions and poses. And while people can easily pick out and recognize qualitative features, like eye color, regardless of the lighting condition, this is a very complex task for computers. When people look at eye color, they don’t just interpret the pixel values of blue or green, but take into account the surrounding visual context.

In order to account for this, we explored how we could enable an algorithm to pick out qualitative features in a manner similar to the way people do, rather than the traditional approach of hand coding how to interpret every permutation of lighting condition, eye color, etc. While we could have trained a large convolutional neural network from scratch to attempt to accomplish this, we wondered if there was a more efficient way to get results, since we expected that learning to interpret a face into an illustration would be a very iterative process.

That led us to run some experiments, similar to DeepDream, on some of Google's existing more general-purpose computer vision neural networks. We discovered that a few neurons among the millions in these networks were good at focusing on things they weren’t explicitly trained to look at that seemed useful for creating personalized stickers. Additionally, by virtue of being large general-purpose neural networks they had already figured out how to abstract away things they didn’t need. All that was left to do was to provide a much smaller number of human labeled examples to teach the classifiers to isolate out the qualities that the neural network already knew about the image.

To create an illustration of you that captures the qualities that would make it recognizable to your friends, we worked alongside an artistic team to create illustrations that represented a wide variety of features. Artists initially designed a set of hairstyles, for example, that they thought would be representative, and with the help of human raters we used these hairstyles to train the network to match the right illustration to the right selfie. We then asked human raters to judge the sticker output against the input image to see how well it did. In some instances, they determined that some styles were not well represented, so the artists created more that the neural network could learn to identify as well.
Raters were asked to classify hairstyles that the icon on the left resembled closest. Then, once consensus was reached, resident artist Lamar Abrams drew a representation of what they had in common.
Avoiding the uncanny valley
In the study of aesthetics, a well-known problem is the uncanny valley - the hypothesis that human replicas which appear almost, but not exactly, like real human beings can feel repulsive. In machine learning, this could be compounded if were confronted by a computer’s perception of you, versus how you may think of yourself, which can be at odds.

Rather than aim to replicate a person’s appearance exactly, pursuing a lower resolution model, like emojis and stickers, allows the team to explore expressive representation by returning an image that is less about reproducing reality and more about breaking the rules of representation.
The team worked with artist Lamar Abrams to design the features that make up more than 563 quadrillion combinations.
Translating pixels to artistic illustrations
Reconciling how the computer perceives you with how you perceive yourself and what you want to project is truly an artistic exercise. This makes a customization feature that includes different hairstyles, skin tones, and nose shapes, essential. After all, illustration by its very nature can be subjective. Aesthetics are defined by race, culture, and class which can lead to creating zones of exclusion without consciously trying. As such, we strove to create a space for a range of race, age, masculinity, femininity, and/or androgyny. Our teams continue to evaluate the research results to help prevent against incorporating biases while training the system.
Creating a broad palette for identity and sentiment
There is no such thing as a ‘universal aesthetic’ or ‘a singular you’. The way people talk to their parents is different than how they talk to their friends which is different than how they talk to their colleagues. It’s not enough to make an avatar that is a literal representation of yourself when there are many versions of you. To address that, the Allo team is working with a range of artistic voices to help others extend their own voice. This first style that launched today speaks to your sarcastic side but the next pack might be more cute for those sincere moments. Then after that, maybe they’ll turn you into a dog. If emojis broadened the world of communication it’s not hard to imagine how this technology and language evolves. What will be most exciting is listening to what people say with it.

This feature is starting to roll out in Allo today for Android, and will come soon to Allo on iOS.

This work was made possible through a collaboration of the Allo Team and Machine Perception researchers at Google. We additionally thank Lamar Abrams, Koji Ashida, Forrester Cole, Jennifer Daniel, Shiraz Fuman, Dilip Krishnan, Inbar Mosseri, Aaron Sarna, and Bhavik Singh.