Category Archives: Research Blog

The latest news on Google Research

Teaching Robots to Understand Semantic Concepts

Machine learning can allow robots to acquire complex skills, such as grasping and opening doors. However, learning these skills requires us to manually program reward functions that the robots then attempt to optimize. In contrast, people can understand the goal of a task just from watching someone else do it, or simply by being told what the goal is. We can do this because we draw on our own prior knowledge about the world: when we see someone cut an apple, we understand that the goal is to produce two slices, regardless of what type of apple it is, or what kind of tool is used to cut it. Similarly, if we are told to pick up the apple, we understand which object we are to grab because we can ground the word “apple” in the environment: we know what it means.

These are semantic concepts: salient events like producing two slices, and object categories denoted by words such as “apple.” Can we teach robots to understand semantic concepts, to get them to follow simple commands specified through categorical labels or user-provided examples? In this post, we discuss some of our recent work on robotic learning that combines experience that is autonomously gathered by the robot, which is plentiful but lacks human-provided labels, with human-labeled data that allows a robot to understand semantics. We will describe how robots can use their experience to understand the salient events in a human-provided demonstration, mimic human movements despite the differences between human robot bodies, and understand semantic categories, like “toy” and “pen”, to pick up objects based on user commands.

Understanding human demonstrations with deep visual features
In the first set of experiments, which appear in our paper Unsupervised Perceptual Rewards for Imitation Learning, our is aim is to enable a robot to understand a task, such as opening a door, from seeing only a small number of unlabeled human demonstrations. By analyzing these demonstrations, the robot must understand what is the semantically salient event that constitutes task success, and then use reinforcement learning to perform it.
Examples of human demonstrations (left) and the corresponding robotic imitation (right).
Unsupervised learning on very small datasets is one of the most challenging scenarios in machine learning. To make this feasible, we use deep visual features from a large network trained for image recognition on ImageNet. Such features are known to be sensitive to semantic concepts, while maintaining invariance to nuisance variables such as appearance and lighting. We use these features to interpret user-provided demonstrations, and show that it is indeed possible to learn reward functions in an unsupervised fashion from a few demonstrations and without retraining.
Example of reward functions learned solely from observation for the door opening tasks. Rewards progressively increase from zero to the maximum reward as a task is completed.
After learning a reward function from observation only, we use it to guide a robot to learn a door opening task, using only the images to evaluate the reward function. With the help of an initial kinesthetic demonstration that succeeds about 10% of the time, the robot learns to improve to 100% accuracy using the learned reward function.
Learning progression.
Emulating human movements with self-supervision and imitation.
In Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation, we propose a novel approach to learn about the world from observation and demonstrate it through self-supervised pose imitation. Our approach relies primarily on co-occurrence in time and space for supervision: by training to distinguish frames from different times of a video, it learns to disentangle and organize reality into useful abstract representations.

In a pose imitation task for example, different dimensions of the representation may encode for different joints of a human or robotic body. Rather than defining by hand a mapping between human and robot joints (which is ambiguous in the first place because of physiological differences), we let the robot learn to imitate in an end-to-end fashion. When our model is simultaneously trained on human and robot observations, it naturally discovers the correspondence between the two, even though no correspondence is provided. We thus obtain a robot that can imitate human poses without having ever been given a correspondence between humans and robots.
Self-supervised human pose imitation by a robot.
A striking evidence of the benefits of learning end-to-end is the many-to-one and highly non-linear joints mapping shown above. In this example, the up-down motion involves many joints for the human while only one joint is needed for the robot. We show that the robot has discovered this highly complex mapping on its own, without any explicit human pose information.

Grasping with semantic object categories
The experiments above illustrate how a person can specify a goal for a robot through an example demonstration, in which case the robots must interpret the semantics of the task -- salient events and relevant features of the pose. What if instead of showing the task, the human simply wants to tell it to what to do? This also requires the robot to understand semantics, in order to identify which objects in the world correspond to the semantic category specified by the user. In End-to-End Learning of Semantic Grasping, we study how a combination of manually labeled and autonomously collected data can be used to perform the task of semantic grasping, where the robot must pick up an object from a cluttered bin that matches a user-specified class label, such as “eraser” or “toy.”
In our semantic grasping setup, the robotic arm is tasked with picking up an object corresponding to a user-provided semantic category (e.g. Legos).
To learn how to perform semantic grasping, our robots first gather a large dataset of grasping data by autonomously attempting to pick up a large variety of objects, as detailed in our previous post and prior work. This data by itself can allow a robot to pick up objects, but doesn’t allow it to understand how to associate them with semantic labels. To enable an understanding of semantics, we again enlist a modest amount of human supervision. Each time a robot successfully grasps an object, it presents it to the camera in a canonical pose, as illustrated below.
The robot presents objects to the camera after grasping. These images can be used to label which object category was picked up.
A subset of these images is then labeled by human labelers. Since the presentation images show the object in a canonical pose, it is easy to then propagate these labels to the remaining presentation images by training a classifier on the labeled examples. The labeled presentation images then tell the robot which object was actually picked up, and it can associate this label, in hindsight, with the images that it observed while picking up that object from the bin.

Using this labeled dataset, we can then train a two-stream model that predicts which object will be grasped, conditioned on the current image and the actions that the robot might take. The two-stream model that we employ is inspired by the dorsal-ventral decomposition observed in the human visual cortex, where the ventral stream reasons about the semantic class of objects, while the dorsal stream reasons about the geometry of the grasp. Crucially, the ventral stream can incorporate auxiliary data consisting of labeled images of objects (not necessarily from the robot), while the dorsal stream can incorporate auxiliary data of grasping that does not have semantic labels, allowing the entire system to be trained more effectively using larger amounts of heterogeneously labeled data. In this way, we can combine a limited amount of human labels with a large amount of autonomously collected robotic data to grasp objects based on desired semantic category, as illustrated in the video below:
Future Work
Our experiments show how limited semantically labeled data can be combined with data that is collected and labeled automatically by the robots, in order to enable robots to understand events, object categories, and user demonstrations. In the future, we might imagine that robotic systems could be trained with a combination of user-annotated data and ever-increasing autonomously collected datasets, improving robotic capability and easing the engineering burden of designing autonomous robots. Furthermore, as robotic systems collect more and more automatically annotated data in the real world, this data can be used to improve not just robotic systems, but also systems for computer vision, speech recognition, and natural language processing that can all benefit from such large auxiliary data sources.

Of course, we are not the first to consider the intersection of robotics and semantics. Extensive prior work in natural language understanding, robotic perception, grasping, and imitation learning has considered how semantics and action can be combined in a robotic system. However, the experiments we discussed above might point the way to future work into combining self-supervised and human-labeled data in the context of autonomous robotic systems.

The research described in this post was performed by Pierre Sermanet, Kelvin Xu, Corey Lynch, Jasmine Hsu, Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, and Sergey Levine. We also thank Mrinal Kalakrishnan, Ali Yahya, and Yevgen Chebotar for developing the policy learning framework used for the door task, and John-Michael Burke for conducting experiments for semantic grasping.

Unsupervised Perceptual Rewards for Imitation Learning was presented at RSS 2017 by Kelvin Xu, and Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation will be presented this week at the CVPR Workshop on Deep Learning for Robotic Vision.

Google at CVPR 2017

From July 21-26, Honolulu, Hawaii hosts the 2017 Conference on Computer Vision and Pattern Recognition (CVPR 2017), the premier annual computer vision event comprising the main conference and several co-located workshops and tutorials. As a leader in computer vision research and a Platinum Sponsor, Google will have a strong presence at CVPR 2017 — over 250 Googlers will be in attendance to present papers and invited talks at the conference, and to organize and participate in multiple workshops.

If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively pursuing the next generation of intelligent systems that utilize the latest machine learning techniques applied to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind Headset Removal for Virtual and Mixed Reality, Image Compression with Neural Networks, Jump, TensorFlow Object Detection API and much more.

You can learn more about our research being presented at CVPR 2017 in the list below (Googlers highlighted in blue).

Organizing Committee
Corporate Relations Chair - Mei Han
Area Chairs include - Alexander Toshev, Ce Liu, Vittorio Ferrari, David Lowe

Training object class detectors with click supervision
Dim Papadopoulos, Jasper Uijlings, Frank Keller, Vittorio Ferrari

Unsupervised Pixel-Level Domain Adaptation With Generative Adversarial Networks
Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, Dilip Krishnan

BranchOut: Regularization for Online Ensemble Tracking With Convolutional Neural Networks Bohyung Han, Jack Sim, Hartwig Adam

Enhancing Video Summarization via Vision-Language Embedding
Bryan A. Plummer, Matthew Brown, Svetlana Lazebnik

Learning by Association — A Versatile Semi-Supervised Training Method for Neural Networks Philip Haeusser, Alexander Mordvintsev, Daniel Cremers

Context-Aware Captions From Context-Agnostic Supervision
Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, Gal Chechik

Spatially Adaptive Computation Time for Residual Networks
Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan HuangDmitry Vetrov, Ruslan Salakhutdinov

Xception: Deep Learning With Depthwise Separable Convolutions
François Chollet

Deep Metric Learning via Facility Location
Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, Kevin Murphy

Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors
Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Kevin Murphy

Synthesizing Normalized Faces From Facial Identity Features
Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, William T. Freeman

Towards Accurate Multi-Person Pose Estimation in the Wild
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, Kevin Murphy

GuessWhat?! Visual Object Discovery Through Multi-Modal Dialogue
Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, Aaron Courville

Learning discriminative and transformation covariant local feature detectors
Xu Zhang, Felix X. Yu, Svebor Karaman, Shih-Fu Chang

Full Resolution Image Compression With Recurrent Neural Networks
George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, Michele Covell

Learning From Noisy Large-Scale Datasets With Minimal Supervision
Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, Serge Belongie

Unsupervised Learning of Depth and Ego-Motion From Video
Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe

Cognitive Mapping and Planning for Visual Navigation
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik

Fast Fourier Color Constancy
Jonathan T. Barron, Yun-Ta Tsai

On the Effectiveness of Visible Watermarks
Tali Dekel, Michael Rubinstein, Ce Liu, William T. Freeman

YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video
Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, Vincent Vanhoucke

Deep Learning for Robotic Vision
Organizers include: Anelia Angelova, Kevin Murphy
Program Committee includes: George Papandreou, Nathan Silberman, Pierre Sermanet

The Fourth Workshop on Fine-Grained Visual Categorization
Organizers include: Yang Song
Advisory Panel includes: Hartwig Adam
Program Committee includes: Anelia Angelova, Yuning Chai, Nathan Frey, Jonathan Krause, Catherine Wah, Weijun Wang

Language and Vision Workshop
Organizers include: R. Sukthankar

The First Workshop on Negative Results in Computer Vision
Organizers include: R. Sukthankar, W. Freeman, J. Malik

Visual Understanding by Learning from Web Data
General Chairs include: Jesse Berent, Abhinav Gupta, Rahul Sukthankar
Program Chairs include: Wei Li

YouTube-8M Large-Scale Video Understanding Challenge
General Chairs: Paul Natsev, Rahul Sukthankar
Program Chairs: Joonseok Lee, George Toderici
Challenge Organizers: Sami Abu-El-Haija, Anja Hauth, Nisarg Kothari, Hanhan Li, Sobhan Naderi Parizi, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan, Jian Wang

An Update to Open Images – Now with Bounding-Boxes

Last year we introduced Open Images, a collaborative release of ~9 million images annotated with labels spanning over 6000 object categories, designed to be a useful dataset for machine learning research. The initial release featured image-level labels automatically produced by a computer vision model similar to Google Cloud Vision API, for all 9M images in the training set, and a validation set of 167K images with 1.2M human-verified image-level labels.

Today, we introduce an update to Open Images, which contains the addition of a total of ~2M bounding-boxes to the existing dataset, along with several million additional image-level labels. Details include:
  • 1.2M bounding-boxes around objects for 600 categories on the training set. These have been produced semi-automatically by an enhanced version of the technique outlined in [1], and are all human-verified.
  • Complete bounding-box annotation for all object instances of the 600 categories on the validation set, all manually drawn (830K boxes). The bounding-box annotations in the training and validations sets will enable research on object detection on this dataset. The 600 categories offer a broader range than those in the ILSVRC and COCO detection challenges, and include new objects such as fedora hat and snowman.
  • 4.3M human-verified image-level labels on the training set (over all categories). This will enable large-scale experiments on object classification, based on a clean training set with reliable labels.
Annotated images from the Open Images dataset. Left: FAMILY MAKING A SNOWMAN by mwvchamber. Right: STANZA STUDENTI.S.S. ANNUNZIATA by ersupalermo. Both images used under CC BY 2.0 license. See more examples here.
We hope that this update to Open Images will stimulate the broader research community to experiment with object classification and detection models, and facilitate the development and evaluation of new techniques.

[1] We don't need no bounding-boxes: Training object class detectors using only human verification, Papadopoulos, Uijlings, Keller, and Ferrari, CVPR 2016

Motion Stills — Now on Android

Last year, we launched Motion Stills, an iOS app that stabilizes your Live Photos and lets you view and share them as looping GIFs and videos. Since then, Motion Stills has been well received, being listed as one of the top apps of 2016 by The Verge and Mashable. However, from its initial release, the community has been asking us to also make Motion Stills available for Android. We listened to your feedback and today, we're excited to announce that we’re bringing this technology, and more, to devices running Android 5.1 and later!
Motion Stills on Android: Instant stabilization on your device.
With Motion Stills on Android we built a new recording experience where everything you capture is instantly transformed into delightful short clips that are easy to watch and share. You can capture a short Motion Still with a single tap like a photo, or condense a longer recording into a new feature we call Fast Forward. In addition to stabilizing your recordings, Motion Stills on Android comes with an improved trimming algorithm that guards against pocket shots and accidental camera shakes. All of this is done during capture on your Android device, no internet connection required!

New streaming pipeline
For this release, we redesigned our existing iOS video processing pipeline to use a streaming approach that processes each frame of a video as it is being recorded. By computing intermediate motion metadata, we are able to immediately stabilize the recording while still performing loop optimization over the full sequence. All this leads to instant results after recording — no waiting required to share your new GIF.
Capture using our streaming pipeline gives you instant results.
In order to display your Motion Stills stream immediately, our algorithm computes and stores the necessary stabilizing transformation as a low resolution texture map. We leverage this texture to apply the stabilization transform using the GPU in real-time during playback, instead of writing a new, stabilized video that would tax your mobile hardware and battery.

Fast Forward
Fast Forward allows you to speed up and condense a longer recording into a short, easy to share clip. The same pipeline described above allows Fast Forward to process up to a full minute of video, right on your phone. You can even change the speed of playback (from 1x to 8x) after recording. To make this possible, we encode videos with a denser I-frame spacing to enable efficient seeking and playback. We also employ additional optimizations in the Fast Forward mode. For instance, we apply adaptive temporal downsampling in the linear solver and long-range stabilization for smooth results over the whole sequence.
Fast Forward condenses your recordings into easy to share clips.
Try out Motion Stills
Motion Stills is an app for us to experiment and iterate quickly with short-form video technology, gathering valuable feedback along the way. The tools our users find most fun and useful may be integrated later on into existing products like Google Photos. Download Motion Stills for Android from the Google Play store—available for mobile phones running Android 5.1 and later—and share your favorite clips on social media with hashtag #motionstills.

Motion Stills would not have been possible without the help of many Googlers. We want to especially acknowledge the work of Matthias Grundmann in advancing our stabilization technology, as well as our UX and interaction designers Jacob Zukerman, Ashley Ma and Mark Bowers.

Facets: An Open Source Visualization Tool for Machine Learning Training Data

(Cross-posted on the Google Open Source Blog)

Getting the best results out of a machine learning (ML) model requires that you truly understand your data. However, ML datasets can contain hundreds of millions of data points, each consisting of hundreds (or even thousands) of features, making it nearly impossible to understand an entire dataset in an intuitive fashion. Visualization can help unlock nuances and insights in large datasets. A picture may be worth a thousand words, but an interactive visualization can be worth even more.

Working with the PAIR initiative, we’ve released Facets, an open source visualization tool to aid in understanding and analyzing ML datasets. Facets consists of two visualizations that allow users to see a holistic picture of their data at different granularities. Get a sense of the shape of each feature of the data using Facets Overview, or explore a set of individual observations using Facets Dive. These visualizations allow you to debug your data which, in machine learning, is as important as debugging your model. They can easily be used inside of Jupyter notebooks or embedded into webpages. In addition to the open source code, we've also created a Facets demo website. This website allows anyone to visualize their own datasets directly in the browser without the need for any software installation or setup, without the data ever leaving your computer.

Facets Overview
Facets Overview automatically gives users a quick understanding of the distribution of values across the features of their datasets. Multiple datasets, such as a training set and a test set, can be compared on the same visualization. Common data issues that can hamper machine learning are pushed to the forefront, such as: unexpected feature values, features with high percentages of missing values, features with unbalanced distributions, and feature distribution skew between datasets.
Facets Overview visualization of the six numeric features of the UCI Census datasets[1]. The features are sorted by non-uniformity, with the feature with the most non-uniform distribution at the top. Numbers in red indicate possible trouble spots, in this case numeric features with a high percentage of values set to 0. The histograms at right allow you to compare the distributions between the training data (blue) and test data (orange).

Facets Overview visualization showing two of the nine categorical features of the UCI Census datasets[1]. The features are sorted by distribution distance, with the feature with the biggest skew between the training (blue) and test (orange) datasets at the top. Notice in the “Target” feature that the label values differ between the training and test datasets, due to a trailing period in the test set (“<=50K” vs “<=50K.”). This can be seen in the chart for the feature and also in the entries in the “top” column of the table. This label mismatch would cause a model trained and tested on this data to not be evaluated correctly.
Facets Dive
Facets Dive provides an easy-to-customize, intuitive interface for exploring the relationship between the data points across the different features of a dataset. With Facets Dive, you control the position, color and visual representation of each data point based on its feature values. If the data points have images associated with them, the images can be used as the visual representations.
Facets Dive visualization showing all 16281 data points in the UCI Census test dataset[1]. The animation shows a user coloring the data points by one feature (“Relationship”), faceting in one dimension by a continuous feature (“Age”) and then faceting in another dimension by a discrete feature (“Marital Status”).
Facets Dive visualization of a large number of face drawings from the “Quick, Draw!” Dataset, showing the relationship between the number of strokes and points in the drawings and the ability for the “Quick, Draw!” classifier to correctly categorize them as faces.
Fun Fact: In large datasets, such as the CIFAR-10 dataset[2], a small human labelling error can easily go unnoticed. We inspected the CIFAR-10 dataset with Dive and were able to catch a frog-cat – an image of a frog that had been incorrectly labelled as a cat!
Exploration of the CIFAR-10 dataset using Facets Dive. Here we facet the ground truth labels by row and the predicted labels by column. This produces a confusion matrix view, allowing us to drill into particular kinds of misclassifications. In this particular case, the ML model incorrectly labels some small percentage of true cats as frogs. The interesting thing we find by putting the real images in the confusion matrix is that one of these "true cats" that the model predicted was a frog is actually a frog from visual inspection. With Facets Dive, we can determine that this one misclassification wasn't a true misclassification of the model, but instead incorrectly labeled data in the dataset.
Can you spot the frog-cat?

We’ve gotten great value out of Facets inside of Google and are excited to share the visualizations with the world. We hope they can help you discover new and interesting things about your data that lead you to create more powerful and accurate machine learning models. And since they are open source, you can customize the visualizations for your specific needs or contribute to the project to help us all better understand our data. If you have feedback about your experience with Facets, please let us know what you think.

This work is a collaboration between Mahima Pushkarna, James Wexler and Jimbo Wilson, with input from the entire Big Picture team. We would also like to thank Justine Tunney for providing us with the build tooling.

[1] Lichman, M. (2013). UCI Machine Learning Repository []. Irvine, CA: University of California, School of Information and Computer Science

[2] Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.

Using Deep Learning to Create Professional-Level Photographs

Machine learning (ML) excels in many areas with well defined goals. Tasks where there exists a right or wrong answer help with the training process and allow the algorithm to achieve its desired goal, whether it be correctly identifying objects in images or providing a suitable translation from one language to another. However, there are areas where objective evaluations are not available. For example, whether a photograph is beautiful is measured by its aesthetic value, which is a highly subjective concept.
A professional(?) photograph of Jasper National Park, Canada.
To explore how ML can learn subjective concepts, we introduce an experimental deep-learning system for artistic content creation. It mimics the workflow of a professional photographer, roaming landscape panoramas from Google Street View and searching for the best composition, then carrying out various postprocessing operations to create an aesthetically pleasing image. Our virtual photographer “travelled” ~40,000 panoramas in areas like Alps, Banff and Jasper National Parks, big sur in California, and Yellowstone National Park, returned with creations that are quite impressive, some even approaching professional quality -- as judged by professional photographers.

Training the Model
While aesthetics can be modelled using datasets like AVA, using it naively to enhance photos may miss some aspect in aesthetics, such as making a photo over-saturated. Using supervised learning to learn multiple aspects in aesthetics properly, however, may require a labelled dataset that is intractable to collect.

Our approach relies only on a collection of professional quality photos, without before/after image pairs, or any additional labels. It breaks down aesthetics into multiple aspects automatically, each of which is learned individually with negative examples generated by a coupled image operation. By keeping these image operations semi-”orthogonal”, we can enhance a photo on its composition, saturation/HDR level and dramatic lighting with fast and separable optimizations:
A panorama (a) is cropped into (b), with saturation and HDR strength enhanced in (c), and with dramatic mask applied in (d). Each step is guided by one learned aspect of aesthetics.
A traditional image filter was used to generate negative training examples for saturation, HDR detail and composition. We also introduce a special operation named dramatic mask, which was created jointly while learning the concept of dramatic lighting. The negative examples were generated by applying a combination of image filters that modify brightness randomly on professional photos, degrading their appearance. For the training we use a generative adversarial network (GAN), where a generative model creates a mask to fix lighting for negative examples, while a discriminative model tries to distinguish enhanced results from the real professional ones. Unlike shape-fixed filters such as vignette, dramatic mask adds content-aware brightness adjustment to a photo. The competitive nature of GAN training leads to good variations of such suggestions. You can read more about the training details in our paper.

Some creations of our system from Google Street View are shown below. As you can see, the application of the trained aesthetic filters creates some dramatic results (including the image we started this post with!):
Jasper National Park, Canada.
Interlaken, Switzerland.
Park Parco delle Orobie Bergamasche, Italy.
Jasper National Park, Canada.
Professional Evaluation
To judge how successful our algorithm was, we designed a “Turing-test”-like experiment: we mix our creations with other photos at different quality, and show them to several professional photographers. They were instructed to assign a quality score for each of them, with meaning defined as following:
  • 1: Point-and-shoot without consideration for composition, lighting etc.
  • 2: Good photos from general population without a background in photography. Nothing artistic stands out.
  • 3: Semi-pro. Great photos showing clear artistic aspects. The photographer is on the right track of becoming a professional.
  • 4: Pro.
In the following chart, each curve shows scores from professional photographers for images within a certain predicted score range. For our creations with a high predicted score, about 40% ratings they received are at “semi-pro” to “pro” levels.
Scores received from professional photographers for photos with different predicted scores.
Future Work
The Street View panoramas served as a testing bed for our project. Someday this technique might even help you to take better photos in the real world. We compiled a showcase of photos created to our satisfaction. If you see a photo you like, you can click on it to bring out a nearby Street View panorama. Would you make the same decision if you were there holding the camera at that moment?

This work was done by Hui Fang and Meng Zhang from Machine Perception at Google Research. We would like to thank Vahid Kazemi for his earlier work in predicting AVA scores using Inception network, and Sagarika Chalasani, Nick Beato, Bryan Klingner and Rupert Breheny for their help in processing Google Street View panoramas. We would like to thank Peyman Milanfar, Tomas Izo, Christian Szegedy, Jon Barron and Sergey Ioffe for their helpful reviews and comments. Huge thanks to our anonymous professional photographers!

Building Your Own Neural Machine Translation System in TensorFlow

Machine translation – the task of automatically translating between languages – is one of the most active research areas in the machine learning community. Among the many approaches to machine translation, sequence-to-sequence ("seq2seq") models [1, 2] have recently enjoyed great success and have become the de facto standard in most commercial translation systems, such as Google Translate, thanks to its ability to use deep neural networks to capture sentence meanings. However, while there is an abundance of material on seq2seq models such as OpenNMT or tf-seq2seq, there is a lack of material that teaches people both the knowledge and the skills to easily build high-quality translation systems.

Today we are happy to announce a new Neural Machine Translation (NMT) tutorial for TensorFlow that gives readers a full understanding of seq2seq models and shows how to build a competitive translation model from scratch. The tutorial is aimed at making the process as simple as possible, starting with some background knowledge on NMT and walking through code details to build a vanilla system. It then dives into the attention mechanism [3, 4], a key ingredient that allows NMT systems to handle long sentences. Finally, the tutorial provides details on how to replicate key features in the Google’s NMT (GNMT) system [5] to train on multiple GPUs.

The tutorial also contains detailed benchmark results, which users can replicate on their own. Our models provide a strong open-source baseline with performance on par with GNMT results [5]. We achieve 24.4 BLEU points on the popular WMT’14 English-German translation task.
Other benchmark results (English-Vietnamese, German-English) can be found in the tutorial.

In addition, this tutorial showcases the fully dynamic seq2seq API (released with TensorFlow 1.2) aimed at making building seq2seq models clean and easy:
  • Easily read and preprocess dynamically sized input sequences using the new input pipeline in
  • Use padded batching and sequence length bucketing to improve training and inference speeds.
  • Train seq2seq models using popular architectures and training schedules, including several types of attention and scheduled sampling.
  • Perform inference in seq2seq models using in-graph beam search.
  • Optimize seq2seq models for multi-GPU settings.
We hope this will help spur the creation of, and experimentation with, many new NMT models by the research community. To get started on your own research, check out the tutorial on GitHub!

Core contributors
Thang Luong, Eugene Brevdo, and Rui Zhao.

We would like to especially thank our collaborator on the NMT project, Rui Zhao. Without his tireless effort, this tutorial would not have been possible. Additional thanks go to Denny Britz, Anna Goldie, Derek Murray, and Cinjon Resnick for their work bringing new features to TensorFlow and the seq2seq library. Lastly, we thank Lukasz Kaiser for the initial help on the seq2seq codebase; Quoc Le for the suggestion to replicate GNMT; Yonghui Wu and Zhifeng Chen for details on the GNMT systems; as well as the Google Brain team for their support and feedback!

[1] Sequence to sequence learning with neural networks, Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. NIPS, 2014.
[2] Learning phrase representations using RNN encoder-decoder for statistical machine translation, Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. EMNLP 2014.
[3] Neural machine translation by jointly learning to align and translate, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ICLR, 2015.
[4] Effective approaches to attention-based neural machine translation, Minh-Thang Luong, Hieu Pham, and Christopher D Manning. EMNLP, 2015.
[5] Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Technical Report, 2016.

Revisiting the Unreasonable Effectiveness of Data

There has been remarkable success in the field of computer vision over the past decade, much of which can be directly attributed to the application of deep learning models to this machine perception task. Furthermore, since 2012 there have been significant advances in representation capabilities of these systems due to (a) deeper models with high complexity, (b) increased computational power and (c) availability of large-scale labeled data. And while every year we get further increases in computational power and the model complexity (from 7-layer AlexNet to 101-layer ResNet), available datasets have not scaled accordingly. A 101-layer ResNet with significantly more capacity than AlexNet is still trained with the same 1M images from ImageNet circa 2011. As researchers, we have always wondered: if we scale up the amount of training data 10x, will the accuracy double? How about 100x or maybe even 300x? Will the accuracy plateau or will we continue to see increasing gains with more and more data?
While GPU computation power and model sizes have continued to increase over the last five years, the size of the largest training dataset has surprisingly remained constant.
In our paper, “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”, we take the first steps towards clearing the clouds of mystery surrounding the relationship between `enormous data' and deep learning. Our goal was to explore: (a) if visual representations can be still improved by feeding more and more images with noisy labels to currently existing algorithms; (b) the nature of the relationship between data and performance on standard vision tasks such as classification, object detection and image segmentation; (c) state-of-the-art models for all the tasks in computer vision using large-scale learning.

Of course, the elephant in the room is where can we obtain a dataset that is 300x larger than ImageNet? At Google, we have been continuously working on building such datasets automatically to improve computer vision algorithms. Specifically, we have built an internal dataset of 300M images that are labeled with 18291 categories, which we call JFT-300M. The images are labeled using an algorithm that uses complex mixture of raw web signals, connections between web-pages and user feedback. This results in over one billion labels for the 300M images (a single image can have multiple labels). Of the billion image labels, approximately 375M are selected via an algorithm that aims to maximize label precision of selected images. However, there is still considerable noise in the labels: approximately 20% of the labels for selected images are noisy. Since there is no exhaustive annotation, we have no way to estimate the recall of the labels.

Our experimental results validate some of the hypotheses but also generate some unexpected surprises:
  • Better Representation Learning Helps. Our first observation is that large-scale data helps in representation learning which in-turn improves the performance on each vision task we study. Our findings suggest that a collective effort to build a large-scale dataset for pretraining is important. It also suggests a bright future for unsupervised and semi-supervised representation learning approaches. It seems the scale of data continues to overpower noise in the label space.
  • Performance increases linearly with orders of magnitude of training data.  Perhaps the most surprising finding is the relationship between performance on vision tasks and the amount of training data (log-scale) used for representation learning. We find that this relationship is still linear! Even at 300M training images, we do not observe any plateauing effect for the tasks studied.
  • Object detection performance when pre-trained on different subsets of JFT-300M from scratch. x-axis is the dataset size in log-scale, y-axis is the detection performance in mAP@[.5,.95] on COCO-minival subset.
  • Capacity is Crucial. We also observe that to fully exploit 300M images, one needs higher capacity (deeper) models. For example, in case of ResNet-50 the gain on COCO object detection benchmark is much smaller (1.87%) compared to (3%) when using ResNet-152.
  • New state of the art results. Our paper presents new state-of-the-art results on several benchmarks using the models learned from JFT-300M. For example, a single model (without any bells and whistles) can now achieve 37.4 AP as compared to 34.3 AP on the COCO detection benchmark.
It is important to highlight that the training regime, learning schedules and parameters we used are based on our understanding of training ConvNets with 1M images from ImageNet. Since we do not search for the optimal set of hyper-parameters in this work (which would have required considerable computational effort), it is highly likely that these results are not the best ones you can obtain when using this scale of data. Therefore, we consider the quantitative performance reported to be an underestimate of the actual impact of data.

This work does not focus on task-specific data, such as exploring if more bounding boxes affects model performance. We believe that, although challenging, obtaining large scale task-specific data should be the focus of future study. Furthermore, building a dataset of 300M images should not be a final goal - as a community, we should explore if models continue to improve in the regime of even larger (1 billion+ image) datasets.

Core Contributors
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta

This work would not have been possible without the significant efforts of the Image Understanding and Expander teams at Google who built the massive JFT dataset. We would specifically like to thank Tom Duerig, Neil Alldrin, Howard Zhou, Lu Chen, David Cai, Gal Chechik, Zheyun Feng, Xiangxin Zhu and Rahul Sukthankar for their help. Also big thanks to the VALE team for APIs and specifically, Jonathan Huang, George Papandreou, Liang-Chieh Chen and Kevin Murphy for helpful discussions.

The Google Brain Residency Program — One Year Later

“Coming from a background in statistics, physics, and chemistry, the Google Brain Residency was my first exposure to both deep learning and serious programming. I enjoyed the autonomy that I was given to research diverse topics of my choosing: deep learning for computer vision and language, reinforcement learning, and theory. I originally intended to pursue a statistics PhD but my experience here spurred me to enroll in the Stanford CS program starting this fall!”
- Melody Guan, 2016 Google Brain Residency Alumna

This month marks the end of an incredibly successful year for our first class of the Google Brain Residency Program. This one-year program was created as an opportunity for individuals from diverse educational backgrounds and experiences to dive into research in machine learning and deep learning. Over the past year, the Residents familiarized themselves with the literature, designed and implemented experiments at Google scale, and engaged in cutting edge research in a wide variety of subjects ranging from theory to robotics to music generation.

To date, the inaugural class of Residents have published over 30 papers at leading machine learning publication venues such as ICLR (15), ICML (11), CVPR (3), EMNLP (2), RSS, GECCO, ISMIR, ISMB and Cosyne. An additional 18 papers are currently under review at NIPS, ICCV, BMVC and Nature Methods. Two of the above papers were published in Distill, exploring how deconvolution causes checkerboard artifacts and presenting ways of visualizing a generative model of handwriting.
A Distill article by residents interactively explores how a neural network generates handwriting.
A system that explores how robots can learn to imitate human motion from observation. For more details, see “Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation” (Co-authored by Resident Corey Lynch, along with P. Sermanet, , J. Hsu, S. Levine, accepted to CVPR Workshop 2017)
A model that uses reinforcement learning to train distributed deep learning networks at large scale by optimizing computations to hardware devices assignment. For more details, see “Device Placement Optimization with Reinforcement Learning” (Co-authored by Residents Azalia Mirhoseini and Hieu Pham, along with Q. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, J. Dean, submitted to ICML 2017).
An approach to automate the process of discovering optimization methods, with a focus on deep learning architectures. Final version of the paper “Neural Optimizer Search with Reinforcement Learning” (Co-authored by Residents Irwan Bello and Barret Zoph, along with V. Vasudevan, Q. Le, submitted to ICML 2017) coming soon.
Residents have also made significant contributions to the open source community with general-purpose sequence-to-sequence models (used for example in translation), music synthesis, mimicking human sketching, subsampling a sequence for model training, an efficient “attention” mechanism for models, and time series analysis (particularly for neuroscience).

The end of the program year marks our Residents embarking on the next stages in their careers. Many are continuing their research careers on the Google Brain team as full time employees. Others have chosen to enter top machine learning Ph.D. programs at schools such as Stanford University, UC Berkeley, Cornell University, Oxford University and NYU, University of Toronto and CMU. We could not be more proud to see where their hard work and experiences will take them next!

As we “graduate” our first class, this week we welcome our next class of 35 incredibly talented Residents who have joined us from a wide range of experience and education backgrounds. We can’t wait to see how they will build on the successes of our first class and continue to push the team in new and exciting directions. We look forward to another exciting year of research and innovation ahead of us!

Applications to the 2018 Residency program will open in September 2017. To learn more about the program, visit

MultiModel: Multi-Task Machine Learning Across Domains

Over the last decade, the application and performance of Deep Learning has progressed at an astonishing rate. However, the current state of the field is that the neural network architectures are highly specialized to specific domains of application. An important question remains unanswered: Will a convergence between these domains facilitate a unified model capable of performing well across multiple domains?

Today, we present MultiModel, a neural network architecture that draws from the success of vision, language and audio networks to simultaneously solve a number of problems spanning multiple domains, including image recognition, translation and speech recognition. While strides have been made in this direction before, namely in Google’s Multilingual Neural Machine Translation System used in Google Translate, MultiModel is a first step towards the convergence of vision, audio and language understanding into a single network.

The inspiration for how MultiModel handles multiple domains comes from how the brain transforms sensory input from different modalities (such as sound, vision or taste), into a single shared representation and back out in the form of language or actions. As an analog to these modalities and the transformations they perform, MultiModel has a number of small modality-specific sub-networks for audio, images, or text, and a shared model consisting of an encoder, input/output mixer and decoder, as illustrated below.
MultiModel architecture: small modality-specific sub-networks work with a shared encoder, I/O mixer and decoder. Each petal represents a modality, transforming to and from the internal representation.
We demonstrate that MultiModel is capable of learning eight different tasks simultaneously: it can detect objects in images, provide captions, recognize speech, translate between four pairs of languages, and do grammatical constituency parsing at the same time. The input is given to the model together with a very simple signal that determines which output we are requesting. Below we illustrate a few examples taken from a MultiModel trained jointly on these eight tasks1:
When designing MultiModel it became clear that certain elements from each domain of research (vision, language and audio) were integral to the model’s success in related tasks. We demonstrate that these computational primitives (such as convolutions, attention, or mixture-of-experts layers) clearly improve performance on their originally intended domain of application, while not hindering MultiModel’s performance on other tasks. It is not only possible to achieve good performance while training jointly on multiple tasks, but on tasks with limited quantities of data, the performance actually improves. To our surprise, this happens even if the tasks come from different domains that would appear to have little in common, e.g., an image recognition task can improve performance on a language task.

It is important to note that while MultiModel does not establish new performance records, it does provide insight into the dynamics of multi-domain multi-task learning in neural networks, and the potential for improved learning on data-limited tasks by the introduction of auxiliary tasks. There is a longstanding saying in machine learning: “the best regularizer is more data”; in MultiModel, this data can be sourced across domains, and consequently can be obtained more easily than previously thought. MultiModel provides evidence that training in concert with other tasks can lead to good results and improve performance on data-limited tasks.

Many questions about multi-domain machine learning remain to be studied, and we will continue to work on tuning Multimodel and improving its performance. To allow this research to progress quickly, we open-sourced MultiModel as part of the Tensor2Tensor library. We believe that such synergetic models trained on data from multiple domains will be the next step in deep learning and will ultimately solve tasks beyond the reach of current narrowly trained networks.

This work is a collaboration between Googlers Łukasz Kaiser, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones and Jakob Uszkoreit, and Aidan N. Gomez from the University of Toronto. It was performed while Aidan was working with the Google Brain team.

1 The 8 tasks were: (1) speech recognition (WSJ corpus), (2) image classification (ImageNet), (3) image captioning (MS COCO), (4) parsing (Penn Treebank), (5) English-German translation, (6) German-English translation, (7) English-French translation, (8) French-English translation (all using WMT data-sets).