Author Archives: Research Blog

Four years of Schema.org – Recent Progress and Looking Forward

Posted by Ramanathan Guha, Google Fellow

In 2011, we announced schema.org, a new initiative from Google, Bing and Yahoo! to create and support a common vocabulary for structured data markup on web pages. Since that time, schema.org has been a resource for webmasters looking to add markup to their pages so that search engines can use that data to index content better and surface it in new experiences like rich snippets, GMail, and the Google App.

Schema.org, which provides a growing vocabulary for describing various kinds of entity in terms of properties and relationships, has become increasingly important as the Web transitions to a multi-device, mobile-oriented world. We are now seeing schema.org being used on many millions of Web sites, defining data types and properties common across applications, platforms and products, in order to enhance the user experience by delivering the most relevant information they need, when they need it.

Schema.org in Google Rich Snippets

Schema.org in Google Knowledge Graph panels

Schema.org in Recipe carousels

In Schema.org: Evolution of Structured Data on the Web, an overview article published this week on ACM, we report some key schema.org adoption metrics from a sample of 10 billion pages from a combination of the Google index and Web Data Commons. In this sample, 31.3% of pages have schema.org markup, up from 22% one year ago. Structured data markup is now a core part of the modern web.

The schema.org group at W3C is now amongst the largest active W3C communities, serving as a hub for diverse groups exploring schemas covering diverse topics such as sports, healthcare, e-commerce, food packaging, bibliography and digital archive management. Other companies, also make use of the same data to build different applications, and as new use cases arise further schemas are integrated via community discussion at W3C. Each of these topics in turn have subtle inter-relationships - for example schemas for food packaging, for flight reservations, for recipes and for restaurant menus, each have different approaches to describing food restrictions and allergies. Rather than try to force a common unified approach across these domains, schema.org's evolution is pragmatic, driven by the combination of available Web data, and the likelihood of mainstream consuming applications.

Schema.org is also finding new kinds of uses. One exciting line of work is the use of schema.org marked up pages as training corpus for machine learning. John Foley, Michael Bendersky and Vanja Josifovski used schema.org data to build a system that can learn to recognize events that may be geographically local to a particular user. Other researchers are looking at using schema.org pages with similar markup, but in different languages, to automatically create parallel corpora for machine translation.

Four years after its launch, Schema.org is entering its next phase, with more of the vocabulary development taking place in a more distributed fashion, as extensions. As schema.org adoption has grown, a number groups with more specialized vocabularies have expressed interest in extending schema.org with their terms. Examples of this include real estate, product, finance, medical and bibliographic information. A number of extensions, for topics ranging from automobiles to product details, are already underway. In such a model, schema.org itself is just the core, providing a unifying vocabulary and congregation forum as necessary.

Source: Google Research Blog

Text-to-Speech for low resource languages (episode 2): Building a parametric voice

Posted by Alexander Gutkin, Google Speech Team

This is the second episode in the series of posts reporting on the work we are doing to build text-to-speech (TTS) systems for low resource languages. In the previous episode, we described the crowdsourced data collection effort for Project Unison. In this episode, we describe our work to construct a parametric voice based on that data.

In our previous episode, we described building TTS systems for low resource languages, and how one of the objectives of data collection for such systems was to quickly build a database representing multiple speakers. There are two main justifications for this approach. First, professional voice talents are often not available for under-resourced languages, so we need to record ordinary people who get tired reading tedious text rather quickly. Hence, the amount of text a person can record is rather limited and we need multiple speakers for a reasonably sized database that can be used by others as well. Second, we wanted to be able to create a voice that sounds human but is not identifiable as a real person. Various concatenative approaches to speech synthesis, such as unit selection, are not very suitable for this problem. This is because the selection algorithm may join acoustic units from different speakers generating a very unnatural sounding result.

Adopting parametric speech synthesis techniques is an attractive approach to building multi-speaker corpora described above. This is because in parametric synthesis the training stage of the statistical component will take care of multiple-speakers by estimating an averaged out representation of various acoustic parameters representing each individual speaker. Depending on number of speakers in the corpus, their acoustic similarity and ratio of speaker genders, the resulting acoustic model can represent an average voice that is indistinguishable from human and yet cannot be traced back to any actual speakers recorded during the data collection.

We decided to use two different approaches to acoustic modeling in our experiments. The first approach uses Hidden Markov Models (HMMs). This well-established technique was pioneered by Prof. Keiichi Tokuda at Nagoya Institute of Technology, Japan and has been widely adopted in academia and industry. It is also supported by a dedicated open-source HMM synthesis toolkit. The resulting models are small enough to fit on mobile devices.

The second approach relies on Recurrent Neural Networks (RNNs). RNNs have feedback loops in their topology, allowing them to model temporal dependencies between various phonemes in human speech. In 2013, Heiga Zen, Andrew Senior and Mike Schuster proposed a neural network-based model that mimics deep structure of human speech production for speech synthesis. The model has further been extended into a Long Short-Term Memory (LSTM) RNN. This allows long term memorization, which is good for speech applications. Earlier this year, Heiga Zen and Hasim Sak described the LSTM RNN architecture that has been specifically designed for fast speech synthesis. The LSTM RNNs are also used in our Automatic Speech Recognition (ASR) systems recently mentioned in our blog.

Using the Hidden Markov Model (HMM) and LSTM RNN synthesizers described above, we experimented with a multi-speaker Bangla corpus totaling 1526 utterances (waveforms and corresponding transcriptions) from five different speakers. We also built a third system that utilizes LSTM RNN acoustic model, but this time we made it small and fast enough to run on a mobile phone.

We synthesized the following Bangla sentence "এটি একটি বাংলা বাক্যের উদাহরণ" translated from “This is an example sentence in Bangla”. Though HMM synthesizer output can sound intelligible, it does exhibit some classic downsides with a voice that sounds buzzy and muffled. With the LSTM RNN configuration for mobile devices, the resulting audio sounds clearer and has improved intonation over the HMM version. We also tried a LSTM RNN configuration with more network nodes (and thus not suitable for low-end mobile devices) to generate this waveform - the quality is slightly better but is not a huge improvement over the more lightweight LSTM RNN version. We hypothesize that this is due to the fact that a neural network with many nodes has more parameters and thus requires more data to train.

These early results are encouraging for several reasons. First, they confirm that natural-sounding speech synthesis based on multiple speakers is practically possible. It is also significant that the total number of recordings used was relatively small, yet were able to build intelligible parametric speech synthesis. This means that it is possible to collect training data for such a speech synthesizer by engaging the help of volunteers who are not professional voice artists, for a short period of time per person. Using multiple volunteers is an advantage: it results in more diverse data, and the resulting synthetic voice does not represent any specific individual. This approach may well be the foundation for bringing speech technology to many more traditionally under-served languages.

NEXT UP: But can it say, “Google”? (Ep.3)

Source: Google Research Blog

Making online learning even easier with a re-envisioned Course Builder

Posted by Adam Feldman, Product Manager and Pavel Simakov, Technical Lead, Course Builder Team

(Cross-posted on the Google for Education blog)

The Course Builder team believes in enabling new and better ways to learn (for both the instructor and learner). Today's release of Course Builder v1.10 furthers these goals in three ways, by being easier to use, embeddable and applicable to more types of content.

Easier to use
We took a step back and re-envisioned the menus and navigation of the administrative interface based on the steps instructors take as they create a course. These are designed to help you through the process of creating, styling, publishing and managing your courses. This re-imagined design gives a solid foundation for future versions of Course Builder.

A completely redesigned navigation simplifies content authoring and configuration.

To support this redesign, we’ve also completely revamped our documentation. There’s now one home for all of Course Builder’s materials: Google Open Online Education. Here, you’ll find everything you need to conceptualize and construct your content, create a course using Course Builder, and even develop new modules to extend Course Builder’s capabilities. The content now reflects the latest features and organization. This re-imagined design gives a solid foundation for future versions of Course Builder.

Embeddable assessment support
What if you want to use some of Course Builder’s features but already have an existing learning site? To help with these situations, Course Builder now supports embeddable assessments (graded questions and answers with an optional due date). Simply create your assessments in Course Builder, copy the JavaScript snippet and paste it on any site. Your users will be able to complete the assessments from the comfort of your existing site and you’ll be able to benefit from Course Builder’s per-question feedback, auto-grading and analytics with just two short lines of code that are automatically generated for you.

We started with embeddable assessments because evaluation is so important to learning, but we don’t plan to stop there. Watch for additional embeddable components in the future.

Applicable to more types of content
Many types of online learning content, like tutorials, exercises and documentation, are a lot like online courses. For instance, they might involve presenting content to users, having them do exercises or assessments and allowing them to stop and return later. Yet, you might not think of them as traditional courses.

To make Course Builder a better fit for a broader set of online content, we’ve added a new “guides” experience. Guides are a new way for students to browse and consume your content. Compared to typical online courses -- which can enforce a strict linear path (from unit 1 to unit 2, etc.) -- guides present your content as a non-numbered list. Users are free to enter and exit in any order. It also allows you to show the content for many courses together.

You could imagine each guide being a documentation page or tutorial section. Guides also work with any existing Course Builder units and can be made available by simply enabling that feature in the dashboard. Here are a couple of our courses, when viewed as guides:

Within each guide, the user is guided through the steps, which could be portions of a docs page or lessons in a unit, as in this example from the “Power Searching with Google” sample course:

By letting users jump in and out of the content as they like, guides are ideally suited to the on-the-go learner and look great on phones and tablets. It’s our first foray into responsive mobile design... but it won’t be our last.

Guides currently support public courses, but we’ll be adding registration, enhanced statefulness and interface customization, as well as elements of dynamic learning (think of a personalized list of guides).

This release has focused on making Course Builder easier to use and more relevant. It sets up the framework to give future features a natural home. It adds embeddable assessments to make Course Builder useful in more places. And it introduces guides, a new, less linear format for consuming content.

For a full list of features, see the release notes, and let us know what you think. Keep on learning!

Source: Google Research Blog

When can Quantum Annealing win?

Posted by Hartmut Neven, Director of Engineering

During the last two years, the Google Quantum AI team has made progress in understanding the physics governing quantum annealers. We recently applied these new insights to construct proof-of-principle optimization problems and programmed these into the D-Wave 2X quantum annealer that Google operates jointly with NASA. The problems were designed to demonstrate that quantum annealing can offer runtime advantages for hard optimization problems characterized by rugged energy landscapes.

We found that for problem instances involving nearly 1000 binary variables, quantum annealing significantly outperforms its classical counterpart, simulated annealing. It is more than 10⁸ times faster than simulated annealing running on a single core. We also compared the quantum hardware to another algorithm called Quantum Monte Carlo. This is a method designed to emulate the behavior of quantum systems, but it runs on conventional processors. While the scaling with size between these two methods is comparable, they are again separated by a large factor sometimes as high as 10⁸.

Time to find the optimal solution with 99% probability for different problem sizes. We compare Simulated Annealing (SA), Quantum Monte Carlo (QMC) and D-Wave 2X. Shown are the 50, 75 and 85 percentiles over a set of 100 instances. We observed a speedup of many orders of magnitude for the D-Wave 2X quantum annealer for this optimization problem characterized by rugged energy landscapes. For such problems quantum tunneling is a useful computational resource to traverse tall and narrow energy barriers.

While these results are intriguing and very encouraging, there is more work ahead to turn quantum enhanced optimization into a practical technology. The design of next generation annealers must facilitate the embedding of problems of practical relevance. For instance, we would like to increase the density and control precision of the connections between the qubits as well as their coherence. Another enhancement we wish to engineer is to support the representation not only of quadratic optimization, but of higher order optimization as well. This necessitates that not only pairs of qubits can interact directly but also larger sets of qubits. Our quantum hardware group is working on these improvements which will make it easier for users to input hard optimization problems. For higher-order optimization problems, rugged energy landscapes will become typical. Problems with such landscapes stand to benefit from quantum optimization because quantum tunneling makes it easier to traverse tall and narrow energy barriers.

We should note that there are algorithms, such as techniques based on cluster finding, that can exploit the sparse qubit connectivity in the current generation of D-Wave processors and still solve our proof-of-principle problems faster than the current quantum hardware. But due to the denser connectivity of next generation annealers, we expect those methods will become ineffective. Also, in our experience we find that lean stochastic local search techniques such as simulated annealing are often the most competitive for hard problems with little structure to exploit. Therefore, we regard simulated annealing as a generic classical competition that quantum annealing needs to beat. We are optimistic that the significant runtime gains we have found will carry over to commercially relevant problems as they occur in tasks relevant to machine intelligence.

For details please refer to http://arxiv.org/abs/1512.02206.

Source: Google Research Blog

How to Classify Images with TensorFlow

Posted by Pete Warden, Software Engineer

Prior to joining Google, I spent a lot of time trying to get computers to recognize objects in images. At Jetpac my colleagues and I built mustache detectors to recognize bars full of hipsters, blue sky detectors to find pubs with beer gardens, and dog detectors to spot canine-friendly cafes. At first, we used the traditional computer vision approaches that I'd used my whole career, writing a big ball of custom logic to laboriously recognize one object at a time. For example, to spot sky I'd first run a color detection filter over the whole image looking for shades of blue, and then look at the upper third. If it was mostly blue, and the lower portion of the image wasn't, then I'd classify that as probably a photo of the outdoors.

I'd been an engineer working on vision problems since the late 90's, and the sad truth was that unless you had a research team and plenty of time behind you, this sort of hand-tailored hack was the only way to get usable results. As you can imagine, the results were far from perfect and each detector I wrote was a custom job, and didn't help me with the next thing I needed to recognize. This probably seems laughable to anybody who didn't work in computer vision in the recent past! It's such a primitive way of solving the problem, it sounds like it should have been superseded long ago.

That's why I was so excited when I started to play around with deep learning. It became clear as I tried them out that the latest approaches using convolutional neural networks were producing far better results than my hand-tuned code on similar problems. Not only that, the process of training a detector for a new class of object was much easier. I didn't have to think about what features to detect, I'd just supply a network with new training examples and it would take it from there.

Those experiences converted me into a deep learning enthusiast, and so when Jetpac was acquired and I had the chance to join Google and work with many of the stars of the field, I couldn't resist. What impressed me more than anything was the team's willingness to share their knowledge with the rest of the world.

I'm especially happy that we've just managed to release TensorFlow, our internal machine learning framework, because it gives me a chance to show practical, usable examples of why I'm so convinced deep learning is an essential tool for anybody working with images, speech, or text in ML.

Given my background, my favorite first example is using a deep network to spot objects in an image. One of the early showcases for the new approach to neural networks was an annual competition to recognize 1,000 different classes of objects, from the Imagenet data set, and TensorFlow includes a pre-trained network for that task. If you look inside the examples folder in the source code, you'll see “label_image”, which is a small C++ application for using that network.

The README has the instructions for building TensorFlow on your machine, downloading the binary files defining the network, and compiling the sample code. Once it's all built, just run it with no arguments, and you should see a list of results showing "Military Uniform" at the top. This is running on the default image of Admiral Grace Hopper, and correctly spots her attire.

Image via Wikipedia

After that, try pointing it at your own images using the “--image” command line flag, and you should see a set of labels for each. If you want to know more about what's going on under the hood, the C++ section of the TensorFlow Inception tutorial goes into a lot more detail.

The only things it will spot are those that are in the original 1,000 Imagenet classes, and it will always try to find something, which can lead to some funny results. There are no people categories, so on portraits you'll often see objects that are associated with people like seat belts or oxygen masks, or in Lincoln’s case, a bow tie!

Image via U.S History Images

If the image is poorly lit, then “nematode” is usually the top pick since most training photos of those are taken in very dim surroundings. It's also not perfect in its identification, with an error rate of 5.6% for getting the right label in the top five results. However, that’s not all that bad considering Stanford’s Andrej Karpathy found that even someone who was trained at the job could only achieve a slightly-better 5.1% error doing the same task manually. We can do even better if we combine the outputs of four trained models into an "ensemble", with an error rate of just 3.5%.

It's unlikely that the set of labels it produces is exactly what you need for your application, so the next step would be to train your own network. That is a much bigger task than running a pre-trained one like this, but one of the things I like about TensorFlow is that it spans the whole lifecycle of a machine learning model, from experimentation, to training, and into production, as this example shows. To get started training, I'd recommend looking at this simple tutorial on recognizing hand-drawn digits from the MNIST data set.

I hope that sharing this framework will help developers build amazing user experiences we’d never even think of. We’ve been having a massive amount of fun with TensorFlow, and I can’t wait to see what interesting image tools you build using it!

Source: Google Research Blog

NIPS 2015 and Machine Learning Research at Google

Posted by Sanjiv Kumar, Research Scientist

This week, Montreal hosts the 29^th Annual Conference on Neural Information Processing Systems (NIPS 2015), a machine learning and computational neuroscience conference that includes invited talks, demonstrations and oral and poster presentations of some of the latest in machine learning research. Google will have a strong presence at NIPS 2015, with over 140 Googlers attending in order to contribute to and learn from the broader academic research community by presenting technical talks and posters, in addition to hosting workshops and tutorials.

Research at Google is at the forefront of innovation in Machine Intelligence, actively exploring virtually all aspects of machine learning including classical algorithms as well as cutting-edge techniques such as deep learning. Focusing on both theory as well as application, much of our work on language understanding, speech, translation, visual processing, ranking, and prediction relies on Machine Intelligence. In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, and develop learning approaches to understand and generalize.

If you are attending NIPS 2015, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about our research being presented at NIPS 2015 in the list below (Googlers highlighted in blue).

Google is a Platinum Sponsor of NIPS 2015.

PROGRAM ORGANIZERS
General Chairs
Corinna Cortes, Neil D. Lawrence
Program Committee includes:
Samy Bengio, Gal Chechik, Ian Goodfellow, Shakir Mohamed, Ilya Sutskever

ORAL SESSIONS
Learning Theory and Algorithms for Forecasting Non-stationary Time Series
Vitaly Kuznetsov, Mehryar Mohri

SPOTLIGHT SESSIONS
Distributed Submodular Cover: Succinctly Summarizing Massive Data
Baharan Mirzasoleiman, Amin Karbasi, Ashwinkumar Badanidiyuru, Andreas Krause

Spatial Transformer Networks
Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

Pointer Networks
Oriol Vinyals, Meire Fortunato, Navdeep Jaitly

Structured Transforms for Small-Footprint Deep Learning
Vikas Sindhwani, Tara Sainath, Sanjiv Kumar

Spherical Random Features for Polynomial Kernels
Jeffrey Pennington, Felix Yu, Sanjiv Kumar

POSTERS
Learning to Transduce with Unbounded Memory
Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, Phil Blunsom

Deep Knowledge Tracing
Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas Guibas, Jascha Sohl-Dickstein

Hidden Technical Debt in Machine Learning Systems
D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison

Grammar as a Foreign Language
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton

Stochastic Variational Information Maximisation
Shakir Mohamed, Danilo Rezende

Embedding Inference for Structured Multilabel Prediction
Farzaneh Mirzazadeh, Siamak Ravanbakhsh, Bing Xu, Nan Ding, Dale Schuurmans

On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators
Changyou Chen, Nan Ding, Lawrence Carin

Spectral Norm Regularization of Orthonormal Representations for Graph Transduction
Rakesh Shivanna, Bibaswan Chatterjee, Raman Sankaran, Chiranjib Bhattacharyya, Francis Bach

Differentially Private Learning of Structured Discrete Distributions
Ilias Diakonikolas, Moritz Hardt, Ludwig Schmidt

Nearly Optimal Private LASSO
Kunal Talwar, Li Zhang, Abhradeep Thakurta

Learning Continuous Control Policies by Stochastic Value Gradients
Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Tom Erez, Yuval Tassa

Gradient Estimation Using Stochastic Computation Graphs
John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer

Teaching Machines to Read and Comprehend
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Phil Blunsom

Bayesian dark knowledge
Anoop Korattikara, Vivek Rathod, Kevin Murphy, Max Welling

Generalization in Adaptive Data Analysis and Holdout Reuse
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth

Semi-supervised Sequence Learning
Andrew Dai, Quoc Le

Natural Neural Networks
Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, Koray Kavukcuoglu

Revenue Optimization against Strategic Buyers
Andres Munoz Medina, Mehryar Mohri

WORKSHOPS
Feature Extraction: Modern Questions and Challenges
Workshop Chairs include: Dmitry Storcheus, Afshin Rostamizadeh, Sanjiv Kumar
Program Committee includes: Jeffery Pennington, Vikas Sindhwani

NIPS Time Series Workshop
Invited Speakers include: Mehryar Mohri
Panelists include: Corinna Cortes

Nonparametric Methods for Large Scale Representation Learning
Invited Speakers include: Amr Ahmed

Machine Learning for Spoken Language Understanding and Interaction
Invited Speakers include: Larry Heck

Adaptive Data Analysis
Organizers include: Moritz Hardt

Deep Reinforcement Learning
Organizers include : David Silver
Invited Speakers include: Sergey Levine

Advances in Approximate Bayesian Inference
Organizers include : Shakir Mohamed
Panelists include: Danilo Rezende

Cognitive Computation: Integrating Neural and Symbolic Approaches
Invited Speakers include: Ramanathan V. Guha, Geoffrey Hinton, Greg Wayne

Transfer and Multi-Task Learning: Trends and New Perspectives
Invited Speakers include: Mehryar Mohri
Poster presentations include: Andres Munoz Medina

Learning and privacy with incomplete data and weak supervision
Organizers include : Felix Yu
Program Committee includes: Alexander Blocker, Krzysztof Choromanski, Sanjiv Kumar
Speakers include: Nando de Freitas

Black Box Learning and Inference
Organizers include : Ali Eslami
Keynotes include: Geoff Hinton

Quantum Machine Learning
Invited Speakers include: Hartmut Neven

Bayesian Nonparametrics: The Next Generation
Invited Speakers include: Amr Ahmed

Bayesian Optimization: Scalability and Flexibility
Organizers include: Nando de Freitas

Reasoning, Attention, Memory (RAM)
Invited speakers include: Alex Graves, Ilya Sutskever

Extreme Classification 2015: Multi-class and Multi-label Learning in Extremely Large Label Spaces
Panelists include: Mehryar Mohri

Machine Learning Systems
Invited speakers include: Jeff Dean

SYMPOSIA
Brains, Mind and Machines
Invited Speakers include: Geoffrey Hinton, Demis Hassabis

Deep Learning Symposium
Program Committee Members include: Samy Bengio, Phil Blunsom, Nando De Freitas, Ilya Sutskever, Andrew Zisserman
Invited Speakers include: Max Jaderberg, Sergey Ioffe, Alexander Graves

Algorithms Among Us: The Societal Impacts of Machine Learning
Panelists include: Shane Legg

TUTORIALS
NIPS 2015 Deep Learning Tutorial
Geoffrey E. Hinton, Yoshua Bengio, Yann LeCun

Large-Scale Distributed Systems for Training Neural Networks
Jeff Dean, Oriol Vinyals

Source: Google Research Blog

TensorFlow – Google’s latest machine learning system, open sourced for everyone

Posted by Jeff Dean, Senior Google Fellow, and Rajat Monga, Technical Lead

Deep Learning has had a huge impact on computer science, making it possible to explore new frontiers of research and to develop amazingly useful products that millions of people use every day. Our internal deep learning infrastructure DistBelief, developed in 2011, has allowed Googlers to build ever larger neural networks and scale training to thousands of cores in our datacenters. We’ve used it to demonstrate that concepts like “cat” can be learned from unlabeled YouTube images, to improve speech recognition in the Google app by 25%, and to build image search in Google Photos. DistBelief also trained the Inception model that won Imagenet’s Large Scale Visual Recognition Challenge in 2014, and drove our experiments in automated image captioning as well as DeepDream.

While DistBelief was very successful, it had some limitations. It was narrowly targeted to neural networks, it was difficult to configure, and it was tightly coupled to Google’s internal infrastructure -- making it nearly impossible to share research code externally.

Today we’re proud to announce the open source release of TensorFlow -- our second-generation machine learning system, specifically designed to correct these shortcomings. TensorFlow is general, flexible, portable, easy-to-use, and completely open source. We added all this while improving upon DistBelief’s speed, scalability, and production readiness -- in fact, on some benchmarks, TensorFlow is twice as fast as DistBelief (see the whitepaper for details of TensorFlow’s programming model and implementation).

TensorFlow has extensive built-in support for deep learning, but is far more general than that -- any computation that you can express as a computational flow graph, you can compute with TensorFlow (see some examples). Any gradient-based machine learning algorithm will benefit from TensorFlow’s auto-differentiation and suite of first-rate optimizers. And it’s easy to express your new ideas in TensorFlow via the flexible Python interface.

Inspecting a model with TensorBoard, the visualization tool

TensorFlow is great for research, but it’s ready for use in real products too. TensorFlow was built from the ground up to be fast, portable, and ready for production service. You can move your idea seamlessly from training on your desktop GPU to running on your mobile phone. And you can get started quickly with powerful machine learning tech by using our state-of-the-art example model architectures. For example, we plan to release our complete, top shelf ImageNet computer vision model on TensorFlow soon.

But the most important thing about TensorFlow is that it’s yours. We’ve open-sourced TensorFlow as a standalone library and associated tools, tutorials, and examples with the Apache 2.0 license so you’re free to use TensorFlow at your institution (no matter where you work).

Our deep learning researchers all use TensorFlow in their experiments. Our engineers use it to infuse Google Search with signals derived from deep neural networks, and to power the magic features of tomorrow. We’ll continue to use TensorFlow to serve machine learning in products, and our research team is committed to sharing TensorFlow implementations of our published ideas. We hope you’ll join us at www.tensorflow.org.

Source: Google Research Blog

Computer, respond to this email.

Posted by Greg Corrado*, Senior Research Scientist

Machine Intelligence for You

What I love about working at Google is the opportunity to harness cutting-edge machine intelligence for users’ benefit. Two recent Research Blog posts talked about how we’ve used machine learning in the form of deep neural networks to improve voice search and YouTube thumbnails. Today we can share something even wilder -- Smart Reply, a deep neural network that writes email.

I get a lot of email, and I often peek at it on the go with my phone. But replying to email on mobile is a real pain, even for short replies. What if there were a system that could automatically determine if an email was answerable with a short reply, and compose a few suitable responses that I could edit or send with just a tap?

Some months ago, Bálint Miklós from the Gmail team asked me if such a thing might be possible. I said it sounded too much like passing the Turing Test to get our hopes up... but having collaborated before on machine learning improvements to spam detection and email categorization, we thought we’d give it a try.

There’s a long history of research on both understanding and generating natural language for applications like machine translation. Last year, Google researchers Oriol Vinyals, Ilya Sutskever, and Quoc Le proposed fusing these two tasks in what they called sequence-to-sequence learning. This end-to-end approach has many possible applications, but one of the most unexpected that we’ve experimented with is conversational synthesis. Early results showed that we could use sequence-to-sequence learning to power a chatbot that was remarkably fun to play with, despite having included no explicit knowledge of language in the program.

Obviously, there’s a huge gap between a cute research chatbot and a system that I want helping me draft email. It was still an open question if we could build something that was actually useful to our users. But one engineer on our team, Anjuli Kannan, was willing to take on the challenge. Working closely with both Machine Intelligence researchers and Gmail engineers, she elaborated and experimented with the sequence-to-sequence research ideas. The result is the industrial strength neural network that runs at the core of the Smart Reply feature we’re launching this week.

How it works

A naive attempt to build a response generation system might depend on hand-crafted rules for common reply scenarios. But in practice, any engineer’s ability to invent “rules” would be quickly outstripped by the tremendous diversity with which real people communicate. A machine-learned system, by contrast, implicitly captures diverse situations, writing styles, and tones. These systems generalize better, and handle completely new inputs more gracefully than brittle, rule-based systems ever could.

Diagram by Chris Olah

Like other sequence-to-sequence models, the Smart Reply System is built on a pair of recurrent neural networks, one used to encode the incoming email and one to predict possible responses. The encoding network consumes the words of the incoming email one at a time, and produces a vector (a list of numbers). This vector, which Geoff Hinton calls a “thought vector,” captures the gist of what is being said without getting hung up on diction -- for example, the vector for "Are you free tomorrow?" should be similar to the vector for "Does tomorrow work for you?" The second network starts from this thought vector and synthesizes a grammatically correct reply one word at a time, like it’s typing it out. Amazingly, the detailed operation of each network is entirely learned, just by training the model to predict likely responses.

One challenge of working with emails is that the inputs and outputs of the model can be hundreds of words long. This is where the particular choice of recurrent neural network type really matters. We used a variant of a "long short-term-memory" network (or LSTM for short), which is particularly good at preserving long-term dependencies, and can home in on the part of the incoming email that is most useful in predicting a response, without being distracted by less relevant sentences before and after.

Of course, there's another very important factor in working with email, which is privacy. In developing Smart Reply we adhered to the same rigorous user privacy standards we’ve always held -- in other words, no humans reading your email. This means researchers have to get machine learning to work on a data set that they themselves cannot read, which is a little like trying to solve a puzzle while blindfolded -- but a challenge makes it more interesting!

Getting it right

Our first prototype of the system had a few unexpected quirks. We wanted to generate a few candidate replies, but when we asked our neural network for the three most likely responses, it’d cough up triplets like “How about tomorrow?” “Wanna get together tomorrow?” “I suggest we meet tomorrow.” That’s not really much of a choice for users. The solution was provided by Sujith Ravi, whose team developed a great machine learning system for mapping natural language responses to semantic intents. This was instrumental in several phases of the project, and was critical to solving the "response diversity problem": by knowing how semantically similar two responses are, we can suggest responses that are different not only in wording, but in their underlying meaning.

Another bizarre feature of our early prototype was its propensity to respond with “I love you” to seemingly anything. As adorable as this sounds, it wasn’t really what we were hoping for. Some analysis revealed that the system was doing exactly what we’d trained it to do, generate likely responses -- and it turns out that responses like “Thanks", "Sounds good", and “I love you” are super common -- so the system would lean on them as a safe bet if it was unsure. Normalizing the likelihood of a candidate reply by some measure of that response's prior probability forced the model to predict responses that were not just highly likely, but also had high affinity to the original message. This made for a less lovey, but far more useful, email assistant.

Give it a try

We’re actually pretty amazed at how well this works. We’ll be rolling this feature out on Inbox for Android and iOS later this week, and we hope you’ll try it for yourself! Tap on a Smart Reply suggestion to start editing it. If it’s perfect as is, just tap send. Two-tap email on the go -- just like Bálint envisioned.

* This blog post may or may not have actually been written by a neural network.^↩

Source: Google Research Blog

How to measure translation quality in your user interfaces

Posted by Javier Bargas-Avila, User Experience Research at Google

Worldwide, there are about 200 languages that are spoken by at least 3 million people. In this global context, software developers are required to translate their user interfaces into many languages. While graphical user interfaces have evolved substantially when compared to text-based user interfaces, they still rely heavily on textual information. The perceived language quality of translated user interfaces (UIs) can have a significant impact on the overall quality and usability of a product. But how can software developers and product managers learn more about the quality of a translation when they don’t speak the language themselves?

Key information in interaction elements and content are mostly conveyed through text. This aspect can be illustrated by removing text elements from a UI, as shown in the the figure below.

Three versions of the YouTube UI: (a) the original, (b) YouTube without text elements, and (c) YouTube without graphic elements. It gets apparent how the textless version is stripped of the most useful information: it is almost impossible to choose a video to watch and navigating the site is impossible.

In "Measuring user rated language quality: Development and validation of the user interface Language Quality Survey (LQS)", recently published in the International Journal of Human-Computer Studies, we describe the development and validation of a survey that enables users to provide feedback about the language quality of the user interface.

UIs are generally developed in one source language and translated afterwards string by string. The process of translation is prone to errors and might introduce problems that are not present in the source. These problems are most often due to difficulties in the translation process. For example, the word “auto” can be translated to French as automatique (automatic) or automobile (car), which obviously has a different meaning. Translators might chose the wrong term if context is missing during the process. Another problem arises from words that behave as a verb when placed in a button or as a noun if part of a label. For example, “access” can stand for “you have access” (as a label) or “you can request access” (as a button).

Further pitfalls are gender, prepositions without context or other characteristics of the source text that might influence translation. These problems sometimes even get aggravated by the fact that translations are made by different linguists at different points in time. Such mistranslations might not only negatively affect trustworthiness and brand perception, but also the acceptance of the product and its perceived usefulness.

This work was motivated by the fact that in 2012, the YouTube internationalization team had anecdotal evidence which suggested that some language versions of YouTube might benefit from improvement efforts. While expert evaluations led to significant improvements of text quality, these evaluations were expensive and time-consuming. Therefore, it was decided to develop a survey that enables users to provide feedback about the language quality of the user interface to allow a scalable way of gathering quantitative data about language quality.

The Language Quality Survey (LQS) contains 10 questions about language quality. The first five questions form the factor “Readability”, which describes how natural and smooth to read the used text is. For instance, one question targets ease of understanding (“How easy or difficult to understand is the text used in the [product name] interface?”). Questions 6 to 9 summarize the frequency of (in)consistencies in the text, called “Linguistic Correctness”. The full survey can be found in the publication.

Case study: applying the LQS in the field

As the LQS was developed to discover problematic translations of the YouTube interface and allow focused quality improvement efforts, it was made available in over 60 languages and data were gathered for all these versions of the YouTube interface. To understand the quality of each UI version, we compared the results for the translated versions to the source language (here: US-English). We inspected first the global item, in combination with Linguistic Correctness and Readability. Second, we inspected each item separately, to understand which notion of Linguistic Correctness or Readability showed worse (or better) values. Here are some results:

The data revealed that about one third of the languages showed subpar language quality levels, when compared to the source language.
To understand the source of these problems and fix them, we analyzed the qualitative feedback users had provided (every time someone selected the lower two end scale points, pointing at a problem in the language, a text box was surfaced, asking them to provide examples or links to illustrate the issues).
The analysis of these comments provided linguists with valuable feedback of various kinds. For instance, users pointed to confusing terminology, untranslated words that were missed during translation, typographical or grammatical problems, words that were translated but are commonly used in English, or screenshots in help pages that were in English but needed to be localized. Some users also pointed to readability aspects such as sections with old fashioned or too formal tone as well as too informal translations, complex technical or legal wordings, unnatural translations or rather lengthy sections of text. In some languages users also pointed to text that was too small or criticized the readability of the font that was used.
In parallel, in-depth expert reviews (so-called “language find-its”) were organized. In these sessions, a group of experts for each language met and screened all of YouTube to discover aspects of the language that could be improved and decided on concrete actions to fix them. By using the LQS data to select target languages, it was possible to reduce the number of language find-its to about one third of the original estimation (if all languages had been screened).

LQS has since been successfully adapted and used for various Google products such as Docs, Analytics, or AdWords. We have found the LQS to be a reliable, valid and useful tool to approach language quality evaluation and improvement. The LQS can be regarded as a small piece in the puzzle of understanding and improving localization quality. Google is making this survey broadly available, so that everyone can start improving their products for everyone around the world.

Source: Google Research Blog

Improving YouTube video thumbnails with deep neural nets

Posted by Weilong Yang and Min-hsuan Tsai, Video Content Analysis team and the YouTube Creator team

Video thumbnails are often the first things viewers see when they look for something interesting to watch. A strong, vibrant, and relevant thumbnail draws attention, giving viewers a quick preview of the content of the video, and helps them to find content more easily. Better thumbnails lead to more clicks and views for video creators.

Inspired by the recent remarkable advances of deep neural networks (DNNs) in computer vision, such as image and video classification, our team has recently launched an improved automatic YouTube "thumbnailer" in order to help creators showcase their video content. Here is how it works.

The Thumbnailer Pipeline

While a video is being uploaded to YouTube, we first sample frames from the video at one frame per second. Each sampled frame is evaluated by a quality model and assigned a single quality score. The frames with the highest scores are selected, enhanced and rendered as thumbnails with different sizes and aspect ratios. Among all the components, the quality model is the most critical and turned out to be the most challenging to develop. In the latest version of the thumbnailer algorithm, we used a DNN for the quality model. So, what is the quality model measuring, and how is the score calculated?

The main processing pipeline of the thumbnailer.

(Training) The Quality Model

Unlike the task of identifying if a video contains your favorite animal, judging the visual quality of a video frame can be very subjective - people often have very different opinions and preferences when selecting frames as video thumbnails. One of the main challenges we faced was how to collect a large set of well-annotated training examples to feed into our neural network. Fortunately, on YouTube, in addition to having algorithmically generated thumbnails, many YouTube videos also come with carefully designed custom thumbnails uploaded by creators. Those thumbnails are typically well framed, in-focus, and center on a specific subject (e.g. the main character in the video). We consider these custom thumbnails from popular videos as positive (high-quality) examples, and randomly selected video frames as negative (low-quality) examples. Some examples of the training images are shown below.

Example training images.

The visual quality model essentially solves a problem we call "binary classification": given a frame, is it of high quality or not? We trained a DNN on this set using a similar architecture to the Inception network in GoogLeNet that achieved the top performance in the ImageNet 2014 competition.

Results

Compared to the previous automatically generated thumbnails, the DNN-powered model is able to select frames with much better quality. In a human evaluation, the thumbnails produced by our new models are preferred to those from the previous thumbnailer in more than 65% of side-by-side ratings. Here are some examples of how the new quality model performs on YouTube videos:

Example frames with low and high quality score from the DNN quality model, from video “Grand Canyon Rock Squirrel”.

Thumbnails generated by old vs. new thumbnailer algorithm.

We recently launched this new thumbnailer across YouTube, which means creators can start to choose from higher quality thumbnails generated by our new thumbnailer. Next time you see an awesome YouTube thumbnail, don’t hesitate to give it a thumbs up. ;)

googblogs.com

All Google blogs and Press in one site

Author Archives: Research Blog

Four years of Schema.org – Recent Progress and Looking Forward

Source: Google Research Blog

Text-to-Speech for low resource languages (episode 2): Building a parametric voice

Source: Google Research Blog

Making online learning even easier with a re-envisioned Course Builder

Source: Google Research Blog

When can Quantum Annealing win?

Source: Google Research Blog

How to Classify Images with TensorFlow

Source: Google Research Blog

NIPS 2015 and Machine Learning Research at Google

Source: Google Research Blog

TensorFlow – Google’s latest machine learning system, open sourced for everyone

Source: Google Research Blog

Computer, respond to this email.

Source: Google Research Blog

How to measure translation quality in your user interfaces

Source: Google Research Blog

Improving YouTube video thumbnails with deep neural nets

Source: Google Research Blog