Category Archives: Research Blog

The latest news on Google Research

Adding Sound Effect Information to YouTube Captions

The effect of audio on our perception of the world can hardly be overstated. Its importance as a communication medium via speech is obviously the most familiar, but there is also significant information conveyed by ambient sounds. These ambient sounds create context that we instinctively respond to, like getting startled by sudden commotion, the use of music as a narrative element, or how laughter is used as an audience cue in sitcoms.

Since 2009, YouTube has provided automatic caption tracks for videos, focusing heavily on speech transcription in order to make the content hosted more accessible. However, without similar descriptions of the ambient sounds in videos, much of the information and impact of a video is not captured by speech transcription alone. To address this, we announced the addition of sound effect information to the automatic caption track in YouTube videos, enabling greater access to the richness of all the audio content.

In this post, we discuss the backend system developed for this effort, a collaboration among the Accessibility, Sound Understanding and YouTube teams that used machine learning (ML) to enable the first ever automatic sound effect captioning system for YouTube.
Click the CC button to see the sound effect captioning system in action.
The application of ML – in this case, a Deep Neural Network (DNN) model – to the captioning task presented unique challenges. While the process of analyzing the time-domain audio signal of a video to detect various ambient sounds is similar to other well known classification problems (such as object detection in images), in a product setting the solution faces additional difficulties. In particular, given an arbitrary segment of audio, we need our models to be able to 1) detect the desired sounds, 2) temporally localize the sound in the segment and 3) effectively integrate it in the caption track, which may have parallel and independent speech recognition results.

A DNN Model for Ambient Sound
The first challenge we faced in developing the model was the task of obtaining enough labeled data suitable for training our neural network. While labeled ambient sound information is difficult to come by, we were able to generate a large enough dataset for training using weakly labeled data. But of all the ambient sounds in a given video, which ones should we train our DNN to detect?

For the initial launch of this feature, we chose [APPLAUSE], [MUSIC] and [LAUGHTER], prioritized based upon our analysis of human-created caption tracks that indicates that they are among the most frequent sounds that are manually captioned. While the sound space is obviously far richer and provides even more contextually relevant information than these three classes, the semantic information conveyed by these sound effects in the caption track is relatively unambiguous, as opposed to sounds like [RING] which raises the question of “what was it that rang – a bell, an alarm, a phone?”

Much of our initial work on detecting these ambient sounds also included developing the infrastructure and analysis frameworks to enable scaling for future work, including both the detection of sound events and their integration into the automatic caption track. Investing in the development of this infrastructure has the added benefit of allowing us to easily incorporate more sound types in the future, as we expand our algorithms to understand a wider vocabulary of sounds (e.g. [RING], [KNOCK], [BARK]). In doing so, we will be able to incorporate the detected sounds into the narrative to provide more relevant information (e.g. [PIANO MUSIC], [RAUCOUS APPLAUSE]) to viewers.

Dense Detections to Captions
When a video is uploaded to YouTube, the sound effect recognition pipeline runs on the audio stream in the video. The DNN looks at short segments of audio and predicts whether that segment contains any one of the sound events of interest – since multiple sound effects can co-occur, our model makes a prediction at each time step for each of the sound effects. The segment window is then slid to the right (i.e. a slightly later point in time) and the model is used to make a prediction again, and so on till it reaches the end. This results in a dense stream the (likelihood of) presence of each of the sound events in our vocabulary at 100 frames per second.

The dense prediction stream is not directly exposed to the user, of course, since that would result in captions flickering on and off, and because we know that a number of sound effects have some degree of temporal continuity when they occur; e.g. “music” and “applause” will usually be present for a few seconds at least. To incorporate this intuition, we smooth over the dense prediction stream using a modified Viterbi algorithm containing two states: ON and OFF, with the predicted segments for each sound effect corresponding to the ON state. The figure below provides an illustration of the process in going from the dense detections to the final segments determined to contain sound effects of interest.
(Left) The dense sequence of probabilities from our DNN for the occurrence over time of single sound category in a video. (Center) Binarized segments based on the modified Viterbi algorithm. (Right) The duration-based filter removes segments that are shorter in duration than desired for the class.
A classification-based system such as this one will certainly have some errors, and needs to be able to trade off false positives against missed detections as per the product goals. For example, due to the weak labels in the training dataset, the model was often confused between events that tended to co-occur. For example, a segment labeled “laugh” would usually contain both speech and laughter and the model for “laugh” would have a hard time distinguishing them in test data. In our system, we allow further restrictions based on time spent in the ON state (i.e. do not determine sound X to be detected unless it was determined to be present for at least Y seconds) to push performance toward a desired point in the precision-recall curve.

Once we were satisfied with the performance of our system in temporally localizing sound effect captions based on our offline evaluation metrics, we were faced with the following: how do we combine the sound effect and speech captions to create a single automatic caption track, and how (or when) do we present sound effect information to the user to make it most useful to them?

Adding Sound Effect Information into the Automatic Captions Track
Once we had a system capable of accurately detecting and classifying the ambient sounds in video, we investigated how to convey that information to the viewer in an effective way. In collaboration with our User Experience (UX) research teams, we explored various design options and tested them in a qualitative pilot usability study. The participants of the study had different hearing levels and varying needs for captions. We asked participants a number of questions including whether it improved their overall experience, their ability to follow events in the video and extract relevant information from the caption track, to understand the effect of variables such as:
  • Using separate parts of the screen for speech and sound effect captions.
  • Interleaving the speech and sound effect captions as they occur.
  • Only showing sound effect captions at the end of sentences or when there is a pause in speech (even if they occurred in the middle of speech).
  • How hearing users perceive captions when watching with the sound off.
While it wasn’t surprising that almost all users appreciated the added sound effect information when it was accurate, we also paid specific attention to the feedback when the sound detection system made an error (a false positive when determining presence of a sound, or failing to detect an occurrence). This presented a surprising result: when sound effect information was incorrect, it did not detract from the participant’s experience in roughly 50% of the cases. Based upon participant feedback, the reasons for this appear to be:
  • Participants who could hear the audio were able to ignore the inaccuracies.
  • Participants who could not hear the audio interpreted the error as the presence of a sound event, and that they had not missed out on critical speech information.
Overall, users reported that they would be fine with the system making the occasional mistake as long as it was able to provide good information far more often than not.

Looking Forward
Our work toward enabling automatic sound effect captions for YouTube videos and the initial rollout is a step toward making the richness of content in videos more accessible to our users who experience videos in different ways and in different environments that require captions. We’ve developed a framework to enrich the automatic caption track with sound effects, but there is still much to be done here. We hope that this will spur further work and discussion in the community around improving captions using not only automatic techniques, but also around ways to make creator-generated and community-contributed caption tracks richer (including perhaps, starting with the auto-captions) and better to further improve the viewing experience for our users.

Distill: Supporting Clarity in Machine Learning

Science isn't just about discovering new results. It’s also about human understanding. Scientists need to develop notations, analogies, visualizations, and explanations of ideas. This human dimension of science isn't a minor side project. It's deeply tied to the heart of science.

That’s why, in collaboration with OpenAI, DeepMind, YC Research, and others, we’re excited to announce the launch of Distill, a new open science journal and ecosystem supporting human understanding of machine learning. Distill is an independent organization, dedicated to fostering a new segment of the research community.

Modern web technology gives us powerful new tools for expressing this human dimension of science. We can create interactive diagrams and user interfaces the enable intuitive exploration of research ideas. Over the last few years we've seen many incredible demonstrations of this kind of work.
An interactive diagram explaining the Neural Turing Machine from Olah & Carter, 2016.
Unfortunately, while there are a plethora of conferences and journals in machine learning, there aren’t any research venues that are dedicated to publishing this kind of work. This is partly an issue of focus, and partly because traditional publication venues can't, by virtue of their medium, support interactive visualizations. Without a venue to publish in, many significant contributions don’t count as “real academic contributions” and their authors can’t access the academic support structure.

That’s why Distill aims to build an ecosystem to support this kind of work, starting with three pieces: a research journal, prizes recognizing outstanding work, and tools to facilitate the creation of interactive articles.
Distill is an ecosystem to support clarity in Machine Learning.
Led by a diverse steering committee of leaders from the machine learning and user interface communities, we are very excited to see where Distill will go. To learn more about Distill, see the overview page or read the latest articles.

Announcing Guetzli: A New Open Source JPEG Encoder

(Cross-posted on the Google Open Source Blog)

At Google, we care about giving users the best possible online experience, both through our own services and products and by contributing new tools and industry standards for use by the online community. That’s why we’re excited to announce Guetzli, a new open source algorithm that creates high quality JPEG images with file sizes 35% smaller than currently available methods, enabling webmasters to create webpages that can load faster and use even less data.

Guetzli [guɛtsli] — cookie in Swiss German — is a JPEG encoder for digital images and web graphics that can enable faster online experiences by producing smaller JPEG files while still maintaining compatibility with existing browsers, image processing applications and the JPEG standard. From the practical viewpoint this is very similar to our Zopfli algorithm, which produces smaller PNG and gzip files without needing to introduce a new format, and different than the techniques used in RNN-based image compression, RAISR, and WebP, which all need client and ecosystem changes for compression gains at internet scale.

The visual quality of JPEG images is directly correlated to its multi-stage compression process: color space transform, discrete cosine transform, and quantization. Guetzli specifically targets the quantization stage in which the more visual quality loss is introduced, the smaller the resulting file. Guetzli strikes a balance between minimal loss and file size by employing a search algorithm that tries to overcome the difference between the psychovisual modeling of JPEG's format, and Guetzli’s psychovisual model, which approximates color perception and visual masking in a more thorough and detailed way than what is achievable by simpler color transforms and the discrete cosine transform. However, while Guetzli creates smaller image file sizes, the tradeoff is that these search algorithms take significantly longer to create compressed images than currently available methods.
Figure 1. 16x16 pixel synthetic example of a phone line hanging against a blue sky — traditionally a case where JPEG compression algorithms suffer from artifacts. Uncompressed original is on the left. Guetzli (on the right) shows less ringing artefacts than libjpeg (middle) and has a smaller file size.
And while Guetzli produces smaller image file sizes without sacrificing quality, we additionally found that in experiments where compressed image file sizes are kept constant that human raters consistently preferred the images Guetzli produced over libjpeg images, even when the libjpeg files were the same size or even slightly larger. We think this makes the slower compression a worthy tradeoff.
Figure 2. 20x24 pixel zoomed areas from a picture of a cat’s eye. Uncompressed original on the left. Guetzli (on the right)
shows less ringing artefacts than libjpeg (middle) without requiring a larger file size.
It is our hope that webmasters and graphic designers will find Guetzli useful and apply it to their photographic content, making users’ experience smoother on image-heavy websites in addition to reducing load times and bandwidth costs for mobile users. Last, we hope that the new explicitly psychovisual approach in Guetzli will inspire further image and video compression research.

An Upgrade to SyntaxNet, New Models and a Parsing Competition

At Google, we continuously improve the language understanding capabilities used in applications ranging from generation of email responses to translation. Last summer, we open-sourced SyntaxNet, a neural-network framework for analyzing and understanding the grammatical structure of sentences. Included in our release was Parsey McParseface, a state-of-the-art model that we had trained for analyzing English, followed quickly by a collection of pre-trained models for 40 additional languages, which we dubbed Parsey's Cousins. While we were excited to share our research and to provide these resources to the broader community, building machine learning systems that work well for languages other than English remains an ongoing challenge. We are excited to announce a few new research resources, available now, that address this problem.

SyntaxNet Upgrade
We are releasing a major upgrade to SyntaxNet. This upgrade incorporates nearly a year’s worth of our research on multilingual language understanding, and is available to anyone interested in building systems for processing and understanding text. At the core of the upgrade is a new technology that enables learning of richly layered representations of input sentences. More specifically, the upgrade extends TensorFlow to allow joint modeling of multiple levels of linguistic structure, and to allow neural-network architectures to be created dynamically during processing of a sentence or document.

Our upgrade makes it, for example, easy to build character-based models that learn to compose individual characters into words (e.g. ‘c-a-t’ spells ‘cat’). By doing so, the models can learn that words can be related to each other because they share common parts (e.g. ‘cats’ is the plural of ‘cat’ and shares the same stem; ‘wildcat’ is a type of ‘cat’). Parsey and Parsey’s Cousins, on the other hand, operated over sequences of words. As a result, they were forced to memorize words seen during training and relied mostly on the context to determine the grammatical function of previously unseen words.

As an example, consider the following (meaningless but grammatically correct) sentence:
This sentence was originally coined by Andrew Ingraham who explained: “You do not know what this means; nor do I. But if we assume that it is English, we know that the doshes are distimmed by the gostak. We know too that one distimmer of doshes is a gostak." Systematic patterns in morphology and syntax allow us to guess the grammatical function of words even when they are completely novel: we understand that ‘doshes’ is the plural of the noun ‘dosh’ (similar to the ‘cats’ example above) or that ‘distim’ is the third person singular of the verb distim. Based on this analysis we can then derive the overall structure of this sentence even though we have never seen the words before.

To showcase the new capabilities provided by our upgrade to SyntaxNet, we are releasing a set of new pretrained models called ParseySaurus. These models use the character-based input representation mentioned above and are thus much better at predicting the meaning of new words based both on their spelling and how they are used in context. The ParseySaurus models are far more accurate than Parsey’s Cousins (reducing errors by as much as 25%), particularly for morphologically-rich languages like Russian, or agglutinative languages like Turkish and Hungarian. In those languages there can be dozens of forms for each word and many of these forms might never be observed during training - even in a very large corpus.

Consider the following fictitious Russian sentence, where again the stems are meaningless, but the suffixes define an unambiguous interpretation of the sentence structure:
Even though our Russian ParseySaurus model has never seen these words, it can correctly analyze the sentence by inspecting the character sequences which constitute each word. In doing so, the system can determine many properties of the words (notice how many more morphological features there are here than in the English example). To see the sentence as ParseySaurus does, here is a visualization of how the model analyzes this sentence:
Each square represents one node in the neural network graph, and lines show the connections between them. The left-side “tail” of the graph shows the model consuming the input as one long string of characters. These are intermittently passed to the right side, where the rich web of connections shows the model composing words into phrases and producing a syntactic parse. Check out the full-size rendering here.

A Competition
You might be wondering whether character-based modeling are all we need or whether there are other techniques that might be important. SyntaxNet has lots more to offer, like beam search and different training objectives, but there are of course also many other possibilities. To find out what works well in practice, we are helping co-organize a multilingual parsing competition at this year’s Conference on Computational Natural Language Learning (CoNLL) with the goal of building syntactic parsing systems that work well in real-world settings and for 45 different languages.

The competition is made possible by the Universal Dependencies (UD) initiative, whose goal is to develop cross-linguistically consistent treebanks. Because machine learned models can only be as good as the data that they have access to, we have been contributing data to UD since 2013. For the competition, we partnered with UD and DFKI to build a new multilingual evaluation set consisting of 1000 sentences that have been translated into 20+ different languages and annotated by linguists with parse trees. This evaluation set is the first of its kind (in the past, each language had its own independent evaluation set) and will enable more consistent cross-lingual comparisons. Because the sentences have the same meaning and have been annotated according to the same guidelines, we will be able to get closer to answering the question of which languages might be harder to parse.

We hope that the upgraded SyntaxNet framework and our the pre-trained ParseySaurus models will inspire researchers to participate in the competition. We have additionally created a tutorial showing how to load a Docker image and train models on the Google Cloud Platform, to facilitate participation by smaller teams with limited resources. So, if you have an idea for making your own models with the SyntaxNet framework, sign up to compete! We believe that the configurations that we are releasing are a good place to start, but we look forward to seeing how participants will be able to extend and improve these models or perhaps create better ones!

Thanks to everyone involved who made this competition happen, including our collaborators at UD-Pipe, who provide another baseline implementation to make it easy to enter the competition. Happy parsing from the main developers, Chris Alberti, Daniel Andor, Ivan Bogatyy, Mark Omernick, Zora Tung and Ji Ma!

Quick Access in Drive: Using Machine Learning to Save You Time

At Google, we research cutting-edge machine learning (ML) techniques that allow us to provide products and services aimed at helping you focus on what’s important. From providing language translations to understanding images to helping you respond to emails, it is our goal to help you save time, making life — and work — a little more convenient.

Recent studies have shown that finding information is second only to managing email as a drain on workplace productivity. To help address this, last year we launched Quick Access, a feature in Google Drive that uses ML to surface the most relevant documents as soon as you visit the Google Drive home screen. Originally available only for G Suite customers on Android, Quick Access is now available for anyone who uses Google Drive (on the Web, Android, and iOS), saving you from having to enter a search or to browse through your folders. Our metrics show that Quick Access takes you to the documents you need in half the time compared to manually navigating or searching.
Quick Access uses deep neural networks to determine patterns from various signals, such as activity in Drive, meetings on your Calendar, and more, to anticipate your needs and show the appropriate documents on the Drive home screen. Traditional ML approaches require domain experts to derive complex features from data, which are in turn used to train the model. For Quick Access, however, we constructed thousands of simple features from the various signals above (for instance, the timestamps of the last 20 edit events on a document would constitute 20 simple input features), and combined them with the power of deep neural networks to learn from the aggregated activity of our users. By using deep neural networks we were able to develop accurate predictive models with simpler features and less feature engineering effort.
Quick Access suggestions on the top row in Drive on a desktop browser.
The model computes a relevance score for each of the documents in Drive and the top scoring documents are presented on the home screen. For example, if you have a Calendar entry for a meeting with a coworker in the next few minutes, Quick Access might predict that the presentation you’ve been working on with that coworker is more relevant compared to your monthly budget spreadsheet or the photos you uploaded last week. If you’ve been updating a spreadsheet every weekend, then next weekend, Quick Access will likely display that spreadsheet ahead of the other documents you viewed during the week.

We hope Quick Access helps you use Drive more effectively, allowing you to save time and be more productive. To learn more, watch this talk from Google Cloud Next ‘17 that dives into more details on the ML behind Quick Access.

Thanks to Alexandrin Popescul and Marc Najork for contributions that made this application of machine learning technology possible. This work was in close collaboration with several engineers on the Drive team including Sean Abraham, Brian Calaci, Mike Colagrosso, Mike Procopio, Jesse Sterr, and Timothy Vis.

Assisting Pathologists in Detecting Cancer with Deep Learning

A pathologist’s report after reviewing a patient’s biological tissue samples is often the gold standard in the diagnosis of many diseases. For cancer in particular, a pathologist’s diagnosis has a profound impact on a patient’s therapy. The reviewing of pathology slides is a very complex task, requiring years of training to gain the expertise and experience to do well.

Even with this extensive training, there can be substantial variability in the diagnoses given by different pathologists for the same patient, which can lead to misdiagnoses. For example, agreement in diagnosis for some forms of breast cancer can be as low as 48%, and similarly low for prostate cancer. The lack of agreement is not surprising given the massive amount of information that must be reviewed in order to make an accurate diagnosis. Pathologists are responsible for reviewing all the biological tissues visible on a slide. However, there can be many slides per patient, each of which is 10+ gigapixels when digitized at 40X magnification. Imagine having to go through a thousand 10 megapixel (MP) photos, and having to be responsible for every pixel. Needless to say, this is a lot of data to cover, and often time is limited.

To address these issues of limited time and diagnostic variability, we are investigating how deep learning can be applied to digital pathology, by creating an automated detection algorithm that can naturally complement pathologists’ workflow. We used images (graciously provided by the Radboud University Medical Center) which have also been used for the 2016 ISBI Camelyon Challenge1 to train algorithms that were optimized for localization of breast cancer that has spread (metastasized) to lymph nodes adjacent to the breast.

The results? Standard “off-the-shelf” deep learning approaches like Inception (aka GoogLeNet) worked reasonably well for both tasks, although the tumor probability prediction heatmaps produced were a bit noisy. After additional customization, including training networks to examine the image at different magnifications (much like what a pathologist does), we showed that it was possible to train a model that either matched or exceeded the performance of a pathologist who had unlimited time to examine the slides.
Left: Images from two lymph node biopsies. Middle: earlier results of our deep learning tumor detection. Right: our current results. Notice the visibly reduced noise (potential false positives) between the two versions.
In fact, the prediction heatmaps produced by the algorithm had improved so much that the localization score (FROC) for the algorithm reached 89%, which significantly exceeded the score of 73% for a pathologist with no time constraint2. We were not the only ones to see promising results, as other groups were getting scores as high as 81% with the same dataset. Even more exciting for us was that our model generalized very well, even to images that were acquired from a different hospital using different scanners. For full details, see our paper “Detecting Cancer Metastases on Gigapixel Pathology Images”.
A closeup of a lymph node biopsy. The tissue contains a breast cancer metastasis as well as macrophages, which look similar to tumor but are benign normal tissue. Our algorithm successfully identifies the tumor region (bright green) and is not confused by the macrophages.
While these results are promising, there are a few important caveats to consider.
  • Like most metrics, the FROC localization score is not perfect. Here, the FROC score is defined as the sensitivity (percentage of tumors detected) at a few pre-defined average false positives per slide. It is pretty rare for a pathologist to make a false positive call (mistaking normal cells as tumor). For example, the score of 73% mentioned above corresponds to a 73% sensitivity and zero false positives. By contrast, our algorithm’s sensitivity rises when more false positives are allowed. At 8 false positives per slide, our algorithms had a sensitivity of 92%.
  • These algorithms perform well for the tasks for which they are trained, but lack the breadth of knowledge and experience of human pathologists — for example, being able to detect other abnormalities that the model has not been explicitly trained to classify (e.g. inflammatory process, autoimmune disease, or other types of cancer).
  • To ensure the best clinical outcome for patients, these algorithms need to be incorporated in a way that complements the pathologist’s workflow. We envision that algorithm such as ours could improve the efficiency and consistency of pathologists. For example, pathologists could reduce their false negative rates (percentage of undetected tumors) by reviewing the top ranked predicted tumor regions including up to 8 false positive regions per slide. As another example, these algorithms could enable pathologists to easily and accurately measure tumor size, a factor that is associated with prognosis.
Training models is just the first of many steps in translating interesting research to a real product. From clinical validation to regulatory approval, much of the journey from “bench to bedside” still lies ahead — but we are off to a very promising start, and we hope by sharing our work, we will be able to accelerate progress in this space.

1 For those who might be interested, the Camelyon17 challenge, which builds upon the 2016 challenge, is currently underway.

2 The pathologist ended up spending 30 hours on this task on 130 slides.

Google Research Awards 2016

We’ve just completed another round of the Google Research Awards, our annual open call for proposals on computer science and related topics including machine learning, machine perception, natural language processing, and security. Our grants cover tuition for a graduate student and provide both faculty and students the opportunity to work directly with Google researchers and engineers.

This round we received 876 proposals covering 44 countries and over 300 universities. After expert reviews and committee discussions, we decided to fund 143 projects. Here are a few observations from this round:

Congratulations to the well-deserving recipients of this round’s awards. If you are interested in applying for the next round (deadline is September 30th), please visit our website for more information.

Preprocessing for Machine Learning with tf.Transform

When applying machine learning to real world datasets, a lot of effort is required to preprocess data into a format suitable for standard machine learning models, such as neural networks. This preprocessing takes a variety of forms, from converting between formats, to tokenizing and stemming text and forming vocabularies, to performing a variety of numerical operations such as normalization.

Today we are announcing tf.Transform, a library for TensorFlow that allows users to define preprocessing pipelines and run these using large scale data processing frameworks, while also exporting the pipeline in a way that can be run as part of a TensorFlow graph. Users define a pipeline by composing modular Python functions, which tf.Transform then executes with Apache Beam, a framework for large-scale, efficient, distributed data processing. Apache Beam pipelines can be run on Google Cloud Dataflow with planned support for running with other frameworks. The TensorFlow graph exported by tf.Transform enables the preprocessing steps to be replicated when the trained model is used to make predictions, such as when serving the model with Tensorflow Serving.

A common problem encountered when running machine learning models in production is "training-serving skew", where the data seen at serving time differs in some way from the data used to train the model, leading to reduced prediction quality. tf.Transform ensures that no skew can arise during preprocessing, by guaranteeing that the serving-time transformations are exactly the same as those performed at training time, in contrast to when training-time and serving-time preprocessing are implemented separately in two different environments (e.g., Apache Beam and TensorFlow, respectively).

In addition to facilitating preprocessing, tf.Transform allows users to compute summary statistics for their datasets. Understanding the data is very important in every machine learning project, as subtle errors can arise from making wrong assumptions about what the underlying data look like. By making the computation of summary statistics easy and efficient, tf.Transform allows users to check their assumptions about both raw and preprocessed data.
tf.Transform allows users to define a preprocessing pipeline. Users can materialize the preprocessed data for use in TensorFlow training, and also export a tf.Transform graph that encodes the transformations as a TensorFlow graph. This transformation graph can then be incorporated into the model graph used for inference.
We’re excited to be releasing this latest addition to the TensorFlow ecosystem, and we hope users will find it useful for preprocessing and understanding their data.

We wish to thank the following members of the tf.Transform team for their contributions to this project: Clemens Mewald, Robert Bradshaw, Rajiv Bharadwaja, Elmer Garduno, Afshin Rostamizadeh, Neoklis Polyzotis, Abhi Rao, Joe Toth, Neda Mirian, Dinesh Kulkarni, Robbie Haertel, Cyril Bortolato and Slaven Bilac. We also wish to thank the TensorFlow, TensorFlow Serving and Cloud Dataflow teams for their support.

Headset “Removal” for Virtual and Mixed Reality

Virtual Reality (VR) enables remarkably immersive experiences, offering new ways to view the world and the ability to explore novel environments, both real and imaginary. However, compared to physical reality, sharing these experiences with others can be difficult, as VR headsets make it challenging to create a complete picture of the people participating in the experience.

Some of this disconnect is alleviated by Mixed Reality (MR), a related medium that shares the virtual context of a VR user in a two dimensional video format allowing other viewers to get a feel for the user’s virtual experience. Even though MR facilitates sharing, the headset continues to block facial expressions and eye gaze, presenting a significant hurdle to a fully engaging experience and complete view of the person in VR.

Google Machine Perception researchers, in collaboration with Daydream Labs and YouTube Spaces, have been working on solutions to address this problem wherein we reveal the user’s face by virtually “removing” the headset and create a realistic see-through effect.
VR user captured in front of a green-screen is blended with the virtual environment to generate the MR output: Traditional MR output has the user face occluded, while our result reveals the face. Note how the headset is modified with a marker to aid tracking.
Our approach uses a combination of 3D vision, machine learning and graphics techniques, and is best explained in the context of enhancing Mixed Reality video (also discussed in the Google-VR blog). It consists of three main components:

Dynamic face model capture
The core idea behind our technique is to use a 3D model of the user’s face as a proxy for the hidden face. This proxy is used to synthesize the face in the MR video, thereby creating an impression of the headset being removed. First, we capture a personalized 3D face model for the user with what we call gaze-dependent dynamic appearance. This initial calibration step requires the user to sit in front of a color+depth camera and a monitor, and then track a marker on the monitor with their eyes. We use this one-time calibration procedure — which typically takes less than a minute — to acquire a 3D face model of the user, and learn a database that maps appearance images (or textures) to different eye-gaze directions and blinks. This gaze database (i.e. the face model with textures indexed by eye-gaze) allows us to dynamically change the appearance of the face during synthesis and generate any desired eye-gaze, thus making the synthesized face look natural and alive
On the left, the user’s face is captured by a camera as she tracks a marker on the monitor with her eyes. On the right, we show the dynamic nature of reconstructed 3D face model: by moving or clicking on the mouse, we are able to simulate both apparent eye gaze and blinking.
Calibration and Alignment
Creating a Mixed Reality video requires a specialized setup consisting of an external camera, calibrated and time-synced with the headset. The camera captures a video stream of the VR user in front of a green screen and then composites a cutout of the user with the virtual world to create the final MR video. An important step here is to accurately estimate the calibration (the fixed 3D transformation) between the camera and headset coordinate systems. These calibration techniques typically involve significant manual intervention and are done in multiple steps. We simplify the process by adding a physical marker to the front of the headset and tracking it visually in 3D, which allows us to optimize for the calibration parameters automatically from the VR session.

For headset “removal”, we need to align the 3D face model with the visible portion of the face in the camera stream, so that they would blend seamlessly with each other. A reasonable proxy to this alignment is to place the face model just behind the headset. The calibration described above, coupled with VR headset tracking, provides sufficient information to determine this placement, allowing us to modify the camera stream by rendering the virtual face into it.

Compositing and Rendering
Having tackled the alignment, the last step involves producing a suitable rendering of the 3D face model, consistent with the content in the camera stream. We are able to reproduce the true eye-gaze of the user by combining our dynamic gaze database with an HTC Vive headset that has been modified by SMI to incorporate eye-tracking technology. Images from these eye trackers lack sufficient detail to directly reproduce the occluded face region, but are well suited to provide fine-grained gaze information. Using the live gaze data from the tracker, we synthesize a face proxy that accurately represents the user’s attention and blinks. At run-time, the gaze database, captured in the preprocessing step, is searched for the most appropriate face image corresponding to the query gaze state, while also respecting aesthetic considerations such as temporal smoothness. Additionally, to account for lighting changes between gaze database acquisition and run-time, we apply color correction and feathering, such that the synthesized face region matches with the rest of the face.

Humans are highly sensitive to artifacts on faces, and even small imperfections in synthesis of the occluded face can feel unnatural and distracting, a phenomenon known as the “uncanny valley.” To mitigate this problem, we do not remove the headset completely, instead we have chosen a user experience that conveys a ‘scuba mask effect’ by compositing the color corrected face proxy with a translucent headset. Reminding the viewer of the presence of the headset helps us avoid the uncanny valley, and also makes our algorithm robust to small errors in alignment and color correction.

This modified camera stream, displaying a see-through headset, with the user’s face revealed and their true eye-gaze recreated, is subsequently merged with the virtual environment to create the final MR video.

Results and Extensions
We have used our headset removal technology to enhance Mixed Reality, allowing the medium to not only convey a VR user’s interaction with the virtual environment but also show their face in a natural and convincing fashion. The example below demonstrates our tech applied to an artist using Google Tilt Brush in a virtual environment:
An artist creates 3D art using Google Tilt Brush, shown in Mixed Reality. On the top is the traditional MR result where the face is hidden behind the headset. On the bottom is our result, which reveals the entire face and eyes for a more natural and engaging experience.
While we have shown the potential of our technology, its applications extend beyond Mixed Reality. Headset removal is poised to enhance communication and social interaction in VR itself with diverse applications like VR video conference meetings, multiplayer VR gaming, and exploration with friends and family. Going from an utterly blank headset to being able to see, with photographic realism, the faces of fellow VR users promises to be a significant transition in the VR world, and we are excited to be a part of it.

The CS Capacity Program – New Tools and SIGCSE 2017

The CS Capacity program was launched in March of 2015 to help address a dramatic increase in undergraduate computer science enrollments that is creating serious resource and pedagogical challenges for many colleges and universities. Over the last two years, a diverse group of universities have been working to develop successful strategies that support the expansion of high-quality CS programs at the undergraduate level. Their work focuses on innovations in teaching and technologies that support scaling while ensuring the engagement of women and underrepresented students. These innovations could provide assistance to many other institutions that are challenged to provide a high-quality educational experience to an increasing number of introductory-level students.

The cohort of CS Capacity institutions include George Mason University, Mount Holyoke College, Rutgers University, and the University California Berkeley which are working individually, and Duke University, North Carolina State University, the University of Florida, and the University of North Carolina which are working together. These institution each brings a unique approach to addressing CS capacity challenges. Two years into the program, we're sharing an update on some of the great projects and ideas to emerge so far.

At George Mason, for example, computer science professor Jeff Offutt and his team have developed an online system to provide self-paced learning for CS1 and CS2 classes that allows learners through the learning materials wore quickly or slowly depending on their needs. The system, called SPARC, includes course content, practice and assessment exercises (including automated testing), mini-lectures, and daily inspirations. This team has also launched a program to recruit and train undergraduate tutorial assistants to increase learning support. For more information on SPARC, contact Jeff Offutt at

The MaGE Peer Mentor program at Mount Holyoke College is addressing its increasing CS student enrollment by preparing undergraduate peer mentors to provide effective feedback on coding assignments and contribute to an inclusive learning environment. One of the major elements of these program is an online course that helps to recruit and train students to be undergraduate peer mentors. Mount Holyoke has made their entire online course curriculum for the peer mentor program available so that other institutions can incorporate all or part of it to assist with preparing their own student tutors. For more information on the MaGE curriculum, contact Heather Pon-Barry at
MaGE Program Students and Faculty from Mount Holyoke College
At University of California, Berkeley, the CS Capacity team is focused on providing access to increased and better tutoring. They’ve instituted a small-group tutoring program that includes weekend mastery learning sessions, increased office hours support, designated discussions section, project checkpoint deadlines, exam/homework/lab/discussion walkthrough videos, and a new office hours app that tracks student satisfaction with office hours. For more information on Berkeley’s interventions, contact Josh Hug at

The CS Capacity team at Rutgers has been exploring the gender gap at multiple levels using a longitudinal study across four required CS classes (paper to be published in the proceedings of the SIGCSE 2017 Technical Symposium). They’re investigating several factors that may impact the retention of women and underrepresented student populations, including intention to major in CS, grades, and prior experience. They’ve also been defining an additional set of feature set to improve their use of Autolab (a course management system with automated grading). This work includes building a hint system to provide more information for students who are struggling with a concept or assignment, crowd-sourcing grading, and studying how students think about CS content and the kinds of errors they are making. The Rutgers team will be publishing their study results in the proceedings of the SIGCSE 2017 Technical Symposium. For more information on these tools, contact Andrew Tjang at

The team consisting of Duke, NCSU, UNC, and UF have produced and plan to share tools to improve the student learning experience. My Digital Hand (MDH) is a free online tool for managing and tracking one-to-one peer teaching sessions (for example, helping to keep track of how many hours peer mentors are spending with mentees). MDH supports best practice in peer teaching and mitigates some of the observed challenges in taking peer teaching to scale. The team has also been working on ASCEND (Adaptive Student Computing Environment with Natural Language Dialogue), an Eclipse plug-in designed to facilitate remote synchronous peer teaching sessions. Students can share their projects with a peer teaching fellow (PTF) and chat as the PTF leads the student through a session. ASCEND helps instructors better understand current practice by logging all programming actions and textual chats in real time to a database. For more information on these tools, contact Jeff Forbes at

Several of the CS Capacity principle investigators will be presenting papers on these new interventions and tools at the SIGCSE conference in March. Faculty from the CS capacity program will also be presenting a panel and roundtable discussion session called “New Tools and Solutions to Address the CS Capacity Crunch.” If you’re attending SIGCSE this year, we hope you’ll join us on Thursday, March 9, from 3:45-5:00 pm.

Given the likelihood that CS undergraduate enrollments will continue to climb, it is critical that the CS education community continue to find, test, and share solutions and tools that enable institutions to effectively teach more students while maintaining the quality of the education experience for students. Faculty from the CS Capacity program will continue to share their solutions and results with the community via CS education conferences and publications.