Category Archives: Research Blog

The latest news on Google Research

Google Research Awards 2016

We’ve just completed another round of the Google Research Awards, our annual open call for proposals on computer science and related topics including machine learning, machine perception, natural language processing, and security. Our grants cover tuition for a graduate student and provide both faculty and students the opportunity to work directly with Google researchers and engineers.

This round we received 876 proposals covering 44 countries and over 300 universities. After expert reviews and committee discussions, we decided to fund 143 projects. Here are a few observations from this round:

Congratulations to the well-deserving recipients of this round’s awards. If you are interested in applying for the next round (deadline is September 30th), please visit our website for more information.

Preprocessing for Machine Learning with tf.Transform

When applying machine learning to real world datasets, a lot of effort is required to preprocess data into a format suitable for standard machine learning models, such as neural networks. This preprocessing takes a variety of forms, from converting between formats, to tokenizing and stemming text and forming vocabularies, to performing a variety of numerical operations such as normalization.

Today we are announcing tf.Transform, a library for TensorFlow that allows users to define preprocessing pipelines and run these using large scale data processing frameworks, while also exporting the pipeline in a way that can be run as part of a TensorFlow graph. Users define a pipeline by composing modular Python functions, which tf.Transform then executes with Apache Beam, a framework for large-scale, efficient, distributed data processing. Apache Beam pipelines can be run on Google Cloud Dataflow with planned support for running with other frameworks. The TensorFlow graph exported by tf.Transform enables the preprocessing steps to be replicated when the trained model is used to make predictions, such as when serving the model with Tensorflow Serving.

A common problem encountered when running machine learning models in production is "training-serving skew", where the data seen at serving time differs in some way from the data used to train the model, leading to reduced prediction quality. tf.Transform ensures that no skew can arise during preprocessing, by guaranteeing that the serving-time transformations are exactly the same as those performed at training time, in contrast to when training-time and serving-time preprocessing are implemented separately in two different environments (e.g., Apache Beam and TensorFlow, respectively).

In addition to facilitating preprocessing, tf.Transform allows users to compute summary statistics for their datasets. Understanding the data is very important in every machine learning project, as subtle errors can arise from making wrong assumptions about what the underlying data look like. By making the computation of summary statistics easy and efficient, tf.Transform allows users to check their assumptions about both raw and preprocessed data.
tf.Transform allows users to define a preprocessing pipeline. Users can materialize the preprocessed data for use in TensorFlow training, and also export a tf.Transform graph that encodes the transformations as a TensorFlow graph. This transformation graph can then be incorporated into the model graph used for inference.
We’re excited to be releasing this latest addition to the TensorFlow ecosystem, and we hope users will find it useful for preprocessing and understanding their data.

We wish to thank the following members of the tf.Transform team for their contributions to this project: Clemens Mewald, Robert Bradshaw, Rajiv Bharadwaja, Elmer Garduno, Afshin Rostamizadeh, Neoklis Polyzotis, Abhi Rao, Joe Toth, Neda Mirian, Dinesh Kulkarni, Robbie Haertel, Cyril Bortolato and Slaven Bilac. We also wish to thank the TensorFlow, TensorFlow Serving and Cloud Dataflow teams for their support.

Headset “Removal” for Virtual and Mixed Reality

Virtual Reality (VR) enables remarkably immersive experiences, offering new ways to view the world and the ability to explore novel environments, both real and imaginary. However, compared to physical reality, sharing these experiences with others can be difficult, as VR headsets make it challenging to create a complete picture of the people participating in the experience.

Some of this disconnect is alleviated by Mixed Reality (MR), a related medium that shares the virtual context of a VR user in a two dimensional video format allowing other viewers to get a feel for the user’s virtual experience. Even though MR facilitates sharing, the headset continues to block facial expressions and eye gaze, presenting a significant hurdle to a fully engaging experience and complete view of the person in VR.

Google Machine Perception researchers, in collaboration with Daydream Labs and YouTube Spaces, have been working on solutions to address this problem wherein we reveal the user’s face by virtually “removing” the headset and create a realistic see-through effect.
VR user captured in front of a green-screen is blended with the virtual environment to generate the MR output: Traditional MR output has the user face occluded, while our result reveals the face. Note how the headset is modified with a marker to aid tracking.
Our approach uses a combination of 3D vision, machine learning and graphics techniques, and is best explained in the context of enhancing Mixed Reality video (also discussed in the Google-VR blog). It consists of three main components:

Dynamic face model capture
The core idea behind our technique is to use a 3D model of the user’s face as a proxy for the hidden face. This proxy is used to synthesize the face in the MR video, thereby creating an impression of the headset being removed. First, we capture a personalized 3D face model for the user with what we call gaze-dependent dynamic appearance. This initial calibration step requires the user to sit in front of a color+depth camera and a monitor, and then track a marker on the monitor with their eyes. We use this one-time calibration procedure — which typically takes less than a minute — to acquire a 3D face model of the user, and learn a database that maps appearance images (or textures) to different eye-gaze directions and blinks. This gaze database (i.e. the face model with textures indexed by eye-gaze) allows us to dynamically change the appearance of the face during synthesis and generate any desired eye-gaze, thus making the synthesized face look natural and alive
On the left, the user’s face is captured by a camera as she tracks a marker on the monitor with her eyes. On the right, we show the dynamic nature of reconstructed 3D face model: by moving or clicking on the mouse, we are able to simulate both apparent eye gaze and blinking.
Calibration and Alignment
Creating a Mixed Reality video requires a specialized setup consisting of an external camera, calibrated and time-synced with the headset. The camera captures a video stream of the VR user in front of a green screen and then composites a cutout of the user with the virtual world to create the final MR video. An important step here is to accurately estimate the calibration (the fixed 3D transformation) between the camera and headset coordinate systems. These calibration techniques typically involve significant manual intervention and are done in multiple steps. We simplify the process by adding a physical marker to the front of the headset and tracking it visually in 3D, which allows us to optimize for the calibration parameters automatically from the VR session.

For headset “removal”, we need to align the 3D face model with the visible portion of the face in the camera stream, so that they would blend seamlessly with each other. A reasonable proxy to this alignment is to place the face model just behind the headset. The calibration described above, coupled with VR headset tracking, provides sufficient information to determine this placement, allowing us to modify the camera stream by rendering the virtual face into it.

Compositing and Rendering
Having tackled the alignment, the last step involves producing a suitable rendering of the 3D face model, consistent with the content in the camera stream. We are able to reproduce the true eye-gaze of the user by combining our dynamic gaze database with an HTC Vive headset that has been modified by SMI to incorporate eye-tracking technology. Images from these eye trackers lack sufficient detail to directly reproduce the occluded face region, but are well suited to provide fine-grained gaze information. Using the live gaze data from the tracker, we synthesize a face proxy that accurately represents the user’s attention and blinks. At run-time, the gaze database, captured in the preprocessing step, is searched for the most appropriate face image corresponding to the query gaze state, while also respecting aesthetic considerations such as temporal smoothness. Additionally, to account for lighting changes between gaze database acquisition and run-time, we apply color correction and feathering, such that the synthesized face region matches with the rest of the face.

Humans are highly sensitive to artifacts on faces, and even small imperfections in synthesis of the occluded face can feel unnatural and distracting, a phenomenon known as the “uncanny valley.” To mitigate this problem, we do not remove the headset completely, instead we have chosen a user experience that conveys a ‘scuba mask effect’ by compositing the color corrected face proxy with a translucent headset. Reminding the viewer of the presence of the headset helps us avoid the uncanny valley, and also makes our algorithm robust to small errors in alignment and color correction.

This modified camera stream, displaying a see-through headset, with the user’s face revealed and their true eye-gaze recreated, is subsequently merged with the virtual environment to create the final MR video.

Results and Extensions
We have used our headset removal technology to enhance Mixed Reality, allowing the medium to not only convey a VR user’s interaction with the virtual environment but also show their face in a natural and convincing fashion. The example below demonstrates our tech applied to an artist using Google Tilt Brush in a virtual environment:
An artist creates 3D art using Google Tilt Brush, shown in Mixed Reality. On the top is the traditional MR result where the face is hidden behind the headset. On the bottom is our result, which reveals the entire face and eyes for a more natural and engaging experience.
While we have shown the potential of our technology, its applications extend beyond Mixed Reality. Headset removal is poised to enhance communication and social interaction in VR itself with diverse applications like VR video conference meetings, multiplayer VR gaming, and exploration with friends and family. Going from an utterly blank headset to being able to see, with photographic realism, the faces of fellow VR users promises to be a significant transition in the VR world, and we are excited to be a part of it.

The CS Capacity Program – New Tools and SIGCSE 2017

The CS Capacity program was launched in March of 2015 to help address a dramatic increase in undergraduate computer science enrollments that is creating serious resource and pedagogical challenges for many colleges and universities. Over the last two years, a diverse group of universities have been working to develop successful strategies that support the expansion of high-quality CS programs at the undergraduate level. Their work focuses on innovations in teaching and technologies that support scaling while ensuring the engagement of women and underrepresented students. These innovations could provide assistance to many other institutions that are challenged to provide a high-quality educational experience to an increasing number of introductory-level students.

The cohort of CS Capacity institutions include George Mason University, Mount Holyoke College, Rutgers University, and the University California Berkeley which are working individually, and Duke University, North Carolina State University, the University of Florida, and the University of North Carolina which are working together. These institution each brings a unique approach to addressing CS capacity challenges. Two years into the program, we're sharing an update on some of the great projects and ideas to emerge so far.

At George Mason, for example, computer science professor Jeff Offutt and his team have developed an online system to provide self-paced learning for CS1 and CS2 classes that allows learners through the learning materials wore quickly or slowly depending on their needs. The system, called SPARC, includes course content, practice and assessment exercises (including automated testing), mini-lectures, and daily inspirations. This team has also launched a program to recruit and train undergraduate tutorial assistants to increase learning support. For more information on SPARC, contact Jeff Offutt at

The MaGE Peer Mentor program at Mount Holyoke College is addressing its increasing CS student enrollment by preparing undergraduate peer mentors to provide effective feedback on coding assignments and contribute to an inclusive learning environment. One of the major elements of these program is an online course that helps to recruit and train students to be undergraduate peer mentors. Mount Holyoke has made their entire online course curriculum for the peer mentor program available so that other institutions can incorporate all or part of it to assist with preparing their own student tutors. For more information on the MaGE curriculum, contact Heather Pon-Barry at
MaGE Program Students and Faculty from Mount Holyoke College
At University of California, Berkeley, the CS Capacity team is focused on providing access to increased and better tutoring. They’ve instituted a small-group tutoring program that includes weekend mastery learning sessions, increased office hours support, designated discussions section, project checkpoint deadlines, exam/homework/lab/discussion walkthrough videos, and a new office hours app that tracks student satisfaction with office hours. For more information on Berkeley’s interventions, contact Josh Hug at

The CS Capacity team at Rutgers has been exploring the gender gap at multiple levels using a longitudinal study across four required CS classes (paper to be published in the proceedings of the SIGCSE 2017 Technical Symposium). They’re investigating several factors that may impact the retention of women and underrepresented student populations, including intention to major in CS, grades, and prior experience. They’ve also been defining an additional set of feature set to improve their use of Autolab (a course management system with automated grading). This work includes building a hint system to provide more information for students who are struggling with a concept or assignment, crowd-sourcing grading, and studying how students think about CS content and the kinds of errors they are making. The Rutgers team will be publishing their study results in the proceedings of the SIGCSE 2017 Technical Symposium. For more information on these tools, contact Andrew Tjang at

The team consisting of Duke, NCSU, UNC, and UF have produced and plan to share tools to improve the student learning experience. My Digital Hand (MDH) is a free online tool for managing and tracking one-to-one peer teaching sessions (for example, helping to keep track of how many hours peer mentors are spending with mentees). MDH supports best practice in peer teaching and mitigates some of the observed challenges in taking peer teaching to scale. The team has also been working on ASCEND (Adaptive Student Computing Environment with Natural Language Dialogue), an Eclipse plug-in designed to facilitate remote synchronous peer teaching sessions. Students can share their projects with a peer teaching fellow (PTF) and chat as the PTF leads the student through a session. ASCEND helps instructors better understand current practice by logging all programming actions and textual chats in real time to a database. For more information on these tools, contact Jeff Forbes at

Several of the CS Capacity principle investigators will be presenting papers on these new interventions and tools at the SIGCSE conference in March. Faculty from the CS capacity program will also be presenting a panel and roundtable discussion session called “New Tools and Solutions to Address the CS Capacity Crunch.” If you’re attending SIGCSE this year, we hope you’ll join us on Thursday, March 9, from 3:45-5:00 pm.

Given the likelihood that CS undergraduate enrollments will continue to climb, it is critical that the CS education community continue to find, test, and share solutions and tools that enable institutions to effectively teach more students while maintaining the quality of the education experience for students. Faculty from the CS Capacity program will continue to share their solutions and results with the community via CS education conferences and publications.

An updated YouTube-8M, a video understanding challenge, and a CVPR workshop. Oh my!

Last September, we released the YouTube-8M dataset, which spans millions of videos labeled with thousands of classes, in order to spur innovation and advancement in large-scale video understanding. More recently, other teams at Google have released datasets such as Open Images and YouTube-BoundingBoxes that, along with YouTube-8M, can be used to accelerate image and video understanding. To further these goals, today we are releasing an update to the YouTube-8M dataset, and in collaboration with Google Cloud Machine Learning and, we are also organizing a video understanding competition and an affiliated CVPR’17 Workshop.

An Updated YouTube-8M
The new and improved YouTube-8M includes cleaner and more verbose labels (twice as many labels per video, on average), a cleaned-up set of videos, and for the first time, the dataset includes pre-computed audio features, based on a state-of-the-art audio modeling architecture, in addition to the previously released visual features. The audio and visual features are synchronized in time, at 1-second temporal granularity, which makes YouTube-8M a large-scale multi-modal dataset, and opens up opportunities for exciting new research on joint audio-visual (temporal) modeling. Key statistics on the new version are illustrated below (more details here).
A tree-map visualization of the updated YouTube-8M dataset, organized into 24 high-level verticals, including the top-200 most frequent entities, plus the top-5 entities for each vertical.
Sample videos from the top-18 high-level verticals in the YouTube-8M dataset.
The Google Cloud & YouTube-8M Video Understanding Challenge
We are also excited to announce the Google Cloud & YouTube-8M Video Understanding Challenge, in partnership with Google Cloud and The challenge invites participants to build audio-visual content classification models using YouTube-8M as training data, and to then label ~700K unseen test videos. It will be hosted as a Kaggle competition, sponsored by Google Cloud, and will feature a $100,000 prize pool for the top performers (details here). In order to enable wider participation in the competition, Google Cloud is also offering credits so participants can optionally do model training and exploration using Google Cloud Machine Learning. Open-source TensorFlow code, implementing a few baseline classification models for YouTube-8M, along with training and evaluation scripts, is available at Github. For details on getting started with local or cloud-based training, please see our README and the getting started guide on Kaggle.

The CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding
We will announce the results of the challenge and host invited talks by distinguished researchers at the 1st YouTube-8M Workshop, to be held July 26, 2017, at the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) in Honolulu, Hawaii. The workshop will also feature presentations by top-performing challenge participants and a selected set of paper submissions. We invite researchers to submit papers describing novel research, experiments, or applications based on YouTube-8M dataset, including papers summarizing their participation in the above challenge.

We designed this dataset with scale and diversity in mind, and hope lessons learned here will generalize to many video domains (YouTube-8M captures over 20 diverse video domains). We believe the challenge can also accelerate research by enabling researchers without access to big data or compute clusters to explore and innovate at previously unprecedented scale. Please join us in advancing video understanding!

This post reflects the work of many others within Machine Perception at Google Research, including Sami Abu-El-Haija, Anja Hauth, Nisarg Kothari, Joonseok Lee, Hanhan Li, Sobhan Naderi Parizi, Rahul Sukthankar, George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan, Jiang Wang, as well as Philippe Poutonnet and Mike Styer from Google Cloud, and our partners at Kaggle. We are grateful for the support and advice from many others at Google Research, Google Cloud, and YouTube, and especially thank Aren Jansen, Jort Gemmeke, Dan Ellis, and the Google Research Sound Understanding team for providing the audio features in the updated dataset.

Announcing TensorFlow 1.0

In just its first year, TensorFlow has helped researchers, engineers, artists, students, and many others make progress with everything from language translation to early detection of skin cancer and preventing blindness in diabetics. We’re excited to see people using TensorFlow in over 6000 open-source repositories online.

Today, as part of the first annual TensorFlow Developer Summit, hosted in Mountain View and livestreamed around the world, we’re announcing TensorFlow 1.0:

It’s faster: TensorFlow 1.0 is incredibly fast! XLA lays the groundwork for even more performance improvements in the future, and now includes tips & tricks for tuning your models to achieve maximum speed. We’ll soon publish updated implementations of several popular models to show how to take full advantage of TensorFlow 1.0 - including a 7.3x speedup on 8 GPUs for Inception v3 and 58x speedup for distributed Inception v3 training on 64 GPUs!

It’s more flexible: TensorFlow 1.0 introduces a high-level API for TensorFlow, with tf.layers, tf.metrics, and tf.losses modules. We’ve also announced the inclusion of a new tf.keras module that provides full compatibility with Keras, another popular high-level neural networks library.

It’s more production-ready than ever: TensorFlow 1.0 promises Python API stability (details here), making it easier to pick up new features without worrying about breaking your existing code.

Other highlights from TensorFlow 1.0:
  • Python APIs have been changed to resemble NumPy more closely. For this and other backwards-incompatible changes made to support API stability going forward, please use our handy migration guide and conversion script.
  • Experimental APIs for Java and Go
  • Higher-level API modules tf.layers, tf.metrics, and tf.losses - brought over from tf.contrib.learn after incorporating skflow and TF Slim
  • Experimental release of XLA, a domain-specific compiler for TensorFlow graphs, that targets CPUs and GPUs. XLA is rapidly evolving - expect to see more progress in upcoming releases.
  • Introduction of the TensorFlow Debugger (tfdbg), a command-line interface and API for debugging live TensorFlow programs.
  • New Android demos for object detection and localization, and camera-based image stylization.
  • Installation improvements: Python 3 docker images have been added, and TensorFlow’s pip packages are now PyPI compliant. This means TensorFlow can now be installed with a simple invocation of pip install tensorflow.
We’re thrilled to see the pace of development in the TensorFlow community around the world. To hear more about TensorFlow 1.0 and how it’s being used, you can watch the TensorFlow Developer Summit talks on YouTube, covering recent updates from higher-level APIs to TensorFlow on mobile to our new XLA compiler, as well as the exciting ways that TensorFlow is being used:

Click here for a link to the livestream and video playlist (individual talks will be posted online later in the day).

The TensorFlow ecosystem continues to grow with new techniques like Fold for dynamic batching and tools like the Embedding Projector along with updates to our existing tools like TensorFlow Serving. We’re incredibly grateful to the community of contributors, educators, and researchers who have made advances in deep learning available to everyone. We look forward to working with you on forums like GitHub issues, Stack Overflow, @TensorFlow, the group and at future events.

On-Device Machine Intelligence

To build the cutting-edge technologies that enable conversational understanding and image recognition, we often apply combinations of machine learning technologies such as deep neural networks and graph-based machine learning. However, the machine learning systems that power most of these applications run in the cloud and are computationally intensive and have significant memory requirements. What if you want machine intelligence to run on your personal phone or smartwatch, or on IoT devices, regardless of whether they are connected to the cloud?

Yesterday, we announced the launch of Android Wear 2.0, along with brand new wearable devices, that will run Google's first entirely “on-device” ML technology for powering smart messaging. This on-device ML system, developed by the Expander research team, enables technologies like Smart Reply to be used for any application, including third-party messaging apps, without ever having to connect with the cloud…so now you can respond to incoming chat messages directly from your watch, with a tap.
The research behind this began last year while our team was developing the machine learning systems that enable conversational understanding capability in Allo and Inbox. The Android Wear team reached out to us and was interested to know whether it would be possible to deploy this Smart Reply technology directly onto a smart device. Because of the limited computing power and memory on smart devices, we quickly realized that it was not possible to do so. Our product manager, Patrick McGregor, realized that this presented a unique challenge and an opportunity for the Expander team to return to the drawing board to design a completely new, lightweight, machine learning architecture — not only to enable Smart Reply on Android Wear, but also to power a wealth of other on-device mobile applications. Together with Tom Rudick, Nathan Beach, and other colleagues from the Android Wear team, we set out to build the new system.

Learning with Projections
A simple strategy to build lightweight conversational models might be to create a small dictionary of common rules (input → reply mappings) on the device and use a naive look-up strategy at inference time. This can work for simple prediction tasks involving a small set of classes using a handful of features (such as binary sentiment classification from text, e.g. “I love this movie” conveys a positive sentiment whereas the sentence “The acting was horrible” is negative). But, it does not scale to complex natural language tasks involving rich vocabularies and the wide language variability observed in chat messages. On the other hand, machine learning models like recurrent neural networks (such as LSTMs), in conjunction with graph learning, have proven to be extremely powerful tools for complex sequence learning in natural language understanding tasks, including Smart Reply. However, compressing such rich models to fit in device memory and produce robust predictions at low computation cost (rapidly on-demand) is extremely challenging. Early experiments with restricting the model to predict only a small handful of replies or using other techniques like quantization or character-level models did not produce useful results.

Instead, we built a different solution for the on-device ML system. We first use a fast, efficient mechanism to group similar incoming messages and project them to similar (“nearby”) bit vector representations. While there are several ways to perform this projection step, such as using word embeddings or encoder networks, we employ a modified version of locality sensitive hashing (LSH) to reduce dimension from millions of unique words to a short, fixed-length sequence of bits. This allows us to compute a projection for an incoming message very fast, on-the-fly, with a small memory footprint on the device since we do not need to store the incoming messages, word embeddings, or even the full model used for training.
Projection step: Similar messages are grouped together and projected to nearby vectors. For example, the messages "hey, how's it going?" and "How's it going buddy?" share similar content and might be projected to the same vector 11100011. Another related message “Howdy, everything going well?” is mapped to a nearby vector 11100110 that differs only in 2 bits.
Next, our system takes the incoming message along with its projections and jointly trains a “message projection model” that learns to predict likely replies using our semi-supervised graph learning framework. The graph learning framework enables training a robust model by combining semantic relationships from multiple sources — message/reply interactions, word/phrase similarity, semantic cluster information — learning useful projection operations that can be mapped to good reply predictions.
Learning step: (Top) Messages along with projections and corresponding replies, if available, are used in a machine learning framework to jointly learn a “message projection model”. (Bottom) The message projection model learns to associate replies with the projections of the corresponding incoming messages. For example, the model projects two different messages “Howdy, everything going well?” and “How’s it going buddy?” (bottom center) to nearby bit vectors and learns to map these to relevant replies (bottom right).
It’s worth noting that while the message projection model can be trained using complex machine learning architectures and the power of the cloud, as described above, the model itself resides and performs inference completely on device. Apps running on the device can pass a user’s incoming messages and receive reply predictions from the on-device model without data leaving the device. The model can also be adapted to cater to the user’s writing style and individual preferences to provide a personalized experience.
Inference step: The model applies the learned projections to an incoming message (or sequence of messages) and suggests relevant and diverse replies. Inference is performed on the device, allowing the model to adapt to user data and personal writing styles.
To get the on-device system to work out of the box, we had to make a few additional improvements such as optimizing for speeding up computations on device and generating rich, diverse replies from the model. We will have a forthcoming scientific publication that describes the on-device machine learning work in more detail.

Converse from Your Wrist
When we embarked on our journey to build this technology from scratch, we weren’t sure if the predictions would be useful or of sufficient quality. We’re quite surprised and excited about how well it works even on Android wearable devices with very limited computation and memory resources. We look forward to continuing to improve the models to provide users with more delightful conversational experiences, and we will be leveraging this on-device ML platform to enable completely new applications in the months to come.

You can now use this feature to respond to your messages directly from your Google watches or any watch that runs Android Wear 2.0. It is already enabled on Google Hangouts, Google Messenger, and many third-party messaging apps. We also provide an API for developers of third-party Wear apps.

On behalf of the Google Expander team, I would also like to thank the following people who helped make this technology a success: Andrei Broder, Andrew Tomkins, David Singleton, Mirko Ranieri, Robin Dua and Yicheng Fan.

Announcing TensorFlow Fold: Deep Learning With Dynamic Computation Graphs

In much of machine learning, data used for training and inference undergoes a preprocessing step, where multiple inputs (such as images) are scaled to the same dimensions and stacked into batches. This lets high-performance deep learning libraries like TensorFlow run the same computation graph across all the inputs in the batch in parallel. Batching exploits the SIMD capabilities of modern GPUs and multi-core CPUs to speed up execution. However, there are many problem domains where the size and structure of the input data varies, such as parse trees in natural language understanding, abstract syntax trees in source code, DOM trees for web pages and more. In these cases, the different inputs have different computation graphs that don't naturally batch together, resulting in poor processor, memory, and cache utilization.

Today we are releasing TensorFlow Fold to address these challenges. TensorFlow Fold makes it easy to implement deep-learning models that operate over data of varying size and structure. Furthermore, TensorFlow Fold brings the benefits of batching to such models, resulting in a speedup of more than 10x on CPU, and more than 100x on GPU, over alternative implementations. This is made possible by dynamic batching, introduced in our paper Deep Learning with Dynamic Computation Graphs.
This animation shows a recursive neural network run with dynamic batching. Operations with the same color are batched together, which lets TensorFlow run them faster. The Embed operation converts words to vector representations. The fully connected (FC) operation combines word vectors to form vector representations of phrases. The output of the network is a vector representation of an entire sentence. Although only a single parse tree of a sentence is shown, the same network can run, and batch together operations, over multiple parse trees of arbitrary shapes and sizes.
The TensorFlow Fold library will initially build a separate computation graph from each input.
Because the individual inputs may have different sizes and structures, the computation graphs may as well. Dynamic batching then automatically combines these graphs to take advantage of opportunities for batching, both within and across inputs, and inserts additional instructions to move data between the batched operations (see our paper for technical details).

To learn more, head over to our github site. We hope that TensorFlow Fold will be useful for researchers and practitioners implementing neural networks with dynamic computation graphs in TensorFlow.

This work was done under the supervision of Peter Norvig.

Advancing Research on Video Understanding with the YouTube-BoundingBoxes Dataset

One of the most challenging research areas in machine learning today is enabling computers to understand what a scene is about. For example, while humans know that a ball that disappears behind a wall only to reappear a moment later is very likely the same object, this is not at all obvious to an algorithm. Understanding this requires not only a global picture of what objects are contained in each frame of a video, but also where those objects are located within the frame and their locations over time. Just last year we published YouTube-8M, a dataset consisting of automatically labelled YouTube videos. And while this helps further progress in the field, it is only one piece to the puzzle.

Today, in order to facilitate progress in video understanding research, we are introducing YouTube-BoundingBoxes, a dataset consisting of 5 million bounding boxes spanning 23 object categories, densely labeling segments from 210,000 YouTube videos. To date, this is the largest manually annotated video dataset containing bounding boxes, which track objects in temporally contiguous frames. The dataset is designed to be large enough to train large-scale models, and be representative of videos captured in natural settings. Importantly, the human-labelled annotations contain objects as they appear in the real world with partial occlusions, motion blur and natural lighting.
Summary of dataset statistics. Bar Chart: Relative number of detections in existing image (red) and video (blue) data sets. The YouTube BoundingBoxes dataset (YT-BB) is at the bottom, is at the bottom. Table: The three columns are counts for: classification annotations, bounding boxes, and unique videos with bounding boxes. Full details on the dataset can be found in the preprint.
A key feature of this dataset is that bounding box annotations are provided for entire video segments. These bounding box annotations may be used to train models that explicitly leverage this temporal information to identify, localize and track objects over time. In a video, individual annotated objects might become entirely occluded and later return in subsequent frames. These annotations of individual objects are sometimes not recognizable from individual frames, but can be understood and recognized in the context of the video if the objects are localized and tracked accurately.
Three video segments, sampled at 1 frame per second. The final frame of each example shows how it is visually challenging to recognize the bounded object, due to blur or occlusion (train example, blue arrow). However, temporally-related frames, where the object has been more clearly identified, can allow object classes to be inferred. Note how only visible parts are included in the box: the orange arrow in the bear example (middle row) points to the hidden head. The dog example illustrates tight bounding boxes that track the tail (orange arrows) and foot (blue arrows). The airplane example illustrates how partial objects are annotated (first frame) tracked across changes in perspective, occlusions and camera cuts.
We hope that this dataset might ultimately aid the computer vision and machine learning community and lead to new methods for analyzing and understanding real world vision problems. You can learn more about the dataset in this associated preprint.

The work was greatly helped along by Xin Pan and Thomas Silva, as well as support and advice from Manfred Georg, Sami Abu-El-Haija, Susanna Ricco and George Toderici.

Using Machine Learning to predict parking difficulty

"When Solomon said there was a time and a place for everything he had not encountered the problem of parking his automobile." -Bob Edwards, Broadcast Journalist

Much of driving is spent either stuck in traffic or looking for parking. With products like Google Maps and Waze, it is our long-standing goal to help people navigate the roads easily and efficiently. But until now, there wasn’t a tool to address the all-too-common parking woes.

Last week, we launched a new feature for Google Maps for Android across 25 US cities that offers predictions about parking difficulty close to your destination so you can plan accordingly. Providing this feature required addressing some significant challenges:
  • Parking availability is highly variable, based on factors like the time, day of week, weather, special events, holidays, and so on. Compounding the problem, there is almost no real time information about free parking spots.
  • Even in areas with internet-connected parking meters providing information on availability, this data doesn’t account for those who park illegally, park with a permit, or depart early from still-paid meters.
  • Roads form a mostly-planar graph, but parking structures may be more complex, with traffic flows across many levels, possibly with different layouts.
  • Both the supply and the demand for parking are in constant flux, so even the best system is at risk of being outdated as soon as it’s built.
To face these challenges, we used a unique combination of crowdsourcing and machine learning (ML) to build a system that can provide you with parking difficulty information for your destination, and even help you decide what mode of travel to take — in a pre-launch experiment, we saw a significant increase in clicks on the transit travel mode button, indicating that users with additional knowledge of parking difficulty were more likely to consider public transit rather than driving.

Three technical pieces were required to build the algorithms behind the parking difficulty feature: good ground truth data from crowdsourcing, an appropriate ML model and a robust set of features to train the model on.

Ground Truth Data
Gathering high-quality ground truth data is often a key challenge in building any ML solution. We began by asking individuals at a diverse set of locations and times if they found the parking difficult. But we learned that answers to subjective questions like this produces inconsistent results - for a given location and time, one person may answer that it was “easy” to find parking while another found it “difficult.” Switching to objective questions like “How long did it it take to find parking?” led to an increase in answer confidence, enabling us to crowdsource a high-quality set of ground truth data with over 100K responses.

Model Features
With this data available, we began to determine features we could train a model on. Fortunately, we were able to turn to the wisdom of the crowd, and utilize anonymous aggregated information from users who opt to share their location data, which already is a vital source of information for estimates of live traffic or popular times and visit durations.

We quickly discovered that even with this data, some unique challenges remain. For example, our system shouldn’t be fooled into thinking parking is plentiful if someone is parking in a gated or private lot. Users arriving by taxi might look like a sign of abundant parking at the front door, and similarly, public-transit users might seem to park at bus stops. These false positives, and many others, all have the potential to mislead an ML system.

So we needed more robust aggregate features. Perhaps not surprisingly, the inspiration for one of these features came from our own backyard in downtown Mountain View. If Google navigation observes many users circling downtown Mountain View during lunchtime along trajectories like this one, it strongly suggests that parking might be difficult:
Our team thought about how to recognize this “fingerprint” of difficult parking as a feature to train on. In this case, we aggregate the difference between when a user should have arrived at a destination if they simply drove to the front door, versus when they actually arrived, taking into account circling, parking, and walking. If many users show a large gap between these two times, we expect this to be a useful signal that parking is difficult.

From there, we continued to develop more features that took into account, for any particular destination, dispersion of parking locations, time-of-day and date dependence of parking (e.g. what if users park close to a destination in the early morning, but further away at busier hours?), historical parking data and more. In the end, we decided on roughly twenty different features along these lines for our model. Then it was time to tune the model performance.

Model Selection & Training
We decided to use a standard logistic regression ML model for this feature, for a few different reasons. First, the behavior of logistic regression is well understood, and it tends to be resilient to noise in the training data; this is a useful property when the data comes from crowdsourcing a complicated response variable like difficulty of parking. Second, it’s natural to interpret the output of these models as the probability that parking will be difficult, which we can then map into descriptive terms like “Limited parking” or “Easy.” Third, it’s easy to understand the influence of each specific feature, which makes it easier to verify that the model is behaving reasonably. For example, when we started the training process, many of us thought that the “fingerprint” feature described above would be the “silver bullet” that would crack the problem for us. We were surprised to note that this wasn’t the case at all — in fact, it was features based on the dispersion of parking locations that turned out to be one of the most powerful predictors of parking difficulty.

With our model in hand, we were able to generate an estimate for difficulty of parking at any place and time. The figure below gives a few examples of the output of our system, which is then used to provide parking difficulty estimates for a given destination. Parking on Monday mornings, for instance, is difficult throughout the city, especially in the busiest financial and retail areas. On Saturday night, things are busy again, but now predominantly in the areas with restaurants and attractions.
Output of our parking difficulty model in the Financial District and Union Square areas of San Francisco. Red denotes a higher confidence that parking is difficult. Top row: a typical Monday at ~8am (left) and ~9pm (right). Bottom row: the same times but on a typical Saturday.
We’re excited about the opportunities to continue to improve the model quality based on user feedback. If we are able to better understand parking difficulty, we will be able to develop new and smarter forms of parking assistance — we’re very excited about future applications of ML to help make transportation more enjoyable!