Tag Archives: Research

Harness the Power of Machine Learning in Your Browser with Deeplearn.js

Machine learning (ML) has become an increasingly powerful tool, one that can be applied to a wide variety of areas spanning object recognition, language translation, health and more. However, the development of ML systems is often restricted to those with computational resources and the technical expertise to work with commonly available ML libraries.

With PAIR — an initiative to study and redesign human interactions with ML — we want to open machine learning up to as many people as possible. In pursuit of that goal, we are excited to announce deeplearn.js 0.1.0, an open source WebGL-accelerated JavaScript library for machine learning that runs entirely in your browser, with no installations and no backend.
There are many reasons to bring machine learning into the browser. A client-side ML library can be a platform for interactive explanations, for rapid prototyping and visualization, and even for offline computation. And if nothing else, the browser is one of the world's most popular programming platforms.

While web machine learning libraries have existed for years (e.g., Andrej Karpathy's convnetjs) they have been limited by the speed of Javascript, or have been restricted to inference rather than training (e.g., TensorFire). By contrast, deeplearn.js offers a significant speedup by exploiting WebGL to perform computations on the GPU, along with the ability to do full backpropagation.

The API mimics the structure of TensorFlow and NumPy, with a delayed execution model for training (like TensorFlow), and an immediate execution model for inference (like NumPy). We have also implemented versions of some of the most commonly-used TensorFlow operations. With the release of deeplearn.js, we will be providing tools to export weights from TensorFlow checkpoints, which will allow authors to import them into web pages for deeplearn.js inference.

You can explore the potential of this library by training a convolutional neural network to recognize photos and handwritten digits — all in your browser without writing a single line of code.
We're releasing a series of demos that show deeplearn.js in action. Play with an image classifier that uses your webcam in real-time and watch the network’s internal representations of what it sees. Or generate abstract art videos at a smooth 60 frames per second. The deeplearn.js homepage contains these and other demos.

Our vision is that this library will significantly increase visibility and engagement with machine learning, giving developers access to powerful tools while simultaneously providing the everyday user with a way to interact with them. We’re looking forward to collaborating with the open source community to drive this vision forward.

Exploring strategies to decarbonize electricity

Climate change is one of the greatest challenges of our time, and the way we generate and use electricity now is a major contributor to that issue. To solve it, we need to find a way to eliminate the carbon emissions associated with our electricity as quickly and as cheaply as possible.

Many analysts have come up with a number of possible solutions: renewable energy plus increased energy storage capacity, nuclear power, carbon capture and sequestration from fossil fuels, or a mixture of these. But we realized that the different answers came from different assumptions that people were making about what combination of those technologies and policies would lead to a positive change.

To help our team understand these dynamics, we created a tool that allows us to quickly see how different assumptions—wind, solar, coal, nuclear, for example—affect the future cost to generate electricity and the amount of carbon dioxide emitted.

We created a simplified model of the electrical grid, where demand is always fulfilled at least cost. By “least cost,” we mean the cost of constructing and maintaining power plants, and generating electricity (with fuel, if required). For a given set of assumptions, the model determines the amount of generation capacity to build and when to turn on which type of generator. Our model is similar to others proposed in other research, but we’ve simplified the model to make it run fast.

We then ran the model hundreds of thousands of times with different assumptions, using our computing infrastructure. We gather all of the runs of the model and present them in a simple web page. Anyone —from students to energy policy wonks—can try different assumptions and see how those assumptions will affect the cost and CO2. The web UI is available for you to try: you can explore the how utilities decide to dispatch their generation capacity, then can test different assumptions. Finally, you can compare different assumptions and share them with others.


We’ve written up the technical details of the model in this paper. In case you want to change the assumptions in the model, we are also releasing the code on Github. The paper shows how the cost of generation technologies change as a function of the fraction of demand that they fulfill. The paper also discusses the limitations and validity of the model.

One interesting conclusion of the paper: if we can find a zero-carbon, 24x7 electricity source that costs about $2200/kW to build, it can displace carbon emission from the electricity grid in less than 27 years. We hope that the tool and the paper help people understand their assumptions about the future of electricity, and stimulate research into climate and energy.

Google at ICML 2017

Machine learning (ML) is a key strategic focus at Google, with highly active groups pursuing research in virtually all aspects of the field, including deep learning and more classical algorithms, exploring theory as well as application. We utilize scalable tools and architectures to build machine learning systems that enable us to solve deep scientific and engineering challenges in areas of language, speech, translation, music, visual processing and more.

As a leader in ML research, Google is proud to be a Platinum Sponsor of the thirty-fourth International Conference on Machine Learning (ICML 2017), a premier annual event supported by the International Machine Learning Society taking place this week in Sydney, Australia. With over 130 Googlers attending the conference to present publications and host workshops, we look forward to our continued colalboration with the larger ML research community.

If you're attending ICML 2017, we hope you'll visit the Google booth and talk with our researchers to learn more about the exciting work, creativity and fun that goes into solving some of the field's most interesting challenges. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind Facets, neural audio synthesis with Nsynth, a Q&A session on the Google Brain Residency program and much more. You can also learn more about our research being presented at ICML 2017 in the list below (Googlers highlighted in blue).

ICML 2017 Committees
Senior Program Committee includes: Alex Kulesza, Amr Ahmed, Andrew Dai, Corinna Cortes, George Dahl, Hugo Larochelle, Matthew Hoffman, Maya Gupta, Moritz Hardt, Quoc Le

Sponsorship Co-Chair: Ryan Adams

Robust Adversarial Reinforcement Learning
Lerrel Pinto, James Davidson, Rahul Sukthankar, Abhinav Gupta

Tight Bounds for Approximate Carathéodory and Beyond
Vahab Mirrokni, Renato Leme, Adrian Vladu, Sam Wong

Sharp Minima Can Generalize For Deep Nets
Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio

Geometry of Neural Network Loss Surfaces via Random Matrix Theory
Jeffrey Pennington, Yasaman Bahri

Conditional Image Synthesis with Auxiliary Classifier GANs
Augustus Odena, Christopher Olah, Jon Shlens

Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo
Maithra Raghu, Ben Poole, Surya Ganguli, Jon Kleinberg, Jascha Sohl-Dickstein

On the Expressive Power of Deep Neural Networks
Maithra Raghu, Ben Poole, Surya Ganguli, Jon Kleinberg, Jascha Sohl-Dickstein

AdaNet: Adaptive Structural Learning of Artificial Neural Networks
Corinna Cortes, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, Scott Yang

Learned Optimizers that Scale and Generalize
Olga Wichrowska, Niru Maheswaranathan, Matthew Hoffman, Sergio Gomez, Misha Denil, Nando de Freitas, Jascha Sohl-Dickstein

Adaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIP
Satyen Kale, Zohar Karnin, Tengyuan Liang, David Pal

Algorithms for ℓp Low-Rank Approximation
Flavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Panigrahy, David Woodruff

Consistent k-Clustering
Silvio Lattanzi, Sergei Vassilvitskii

Input Switched Affine Networks: An RNN Architecture Designed for Interpretability
Jakob Foerster, Justin Gilmer, Jan Chorowski, Jascha Sohl-Dickstein, David Sussillo

Online and Linear-Time Attention by Enforcing Monotonic Alignments
Colin RaffelThang Luong, Peter Liu, Ron Weiss, Douglas Eck

Gradient Boosted Decision Trees for High Dimensional Sparse Output
Si Si, Huan Zhang, Sathiya Keerthi, Dhruv Mahajan, Inderjit Dhillon, Cho-Jui Hsieh

Sequence Tutor: Conservative fine-tuning of sequence generation models with KL-control
Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Hernandez-Lobato, Richard E Turner, Douglas Eck

Uniform Convergence Rates for Kernel Density Estimation
Heinrich Jiang

Density Level Set Estimation on Manifolds with DBSCAN
Heinrich Jiang

Maximum Selection and Ranking under Noisy Comparisons
Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, Ananda Suresh

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
Cinjon Resnick, Adam Roberts, Jesse Engel, Douglas Eck, Sander Dieleman, Karen Simonyan, Mohammad Norouzi

Distributed Mean Estimation with Limited Communication
Ananda Suresh, Felix Yu, Sanjiv Kumar, Brendan McMahan

Learning to Generate Long-term Future via Hierarchical Prediction
Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, Honglak Lee

Variational Boosting: Iteratively Refining Posterior Approximations
Andrew Miller, Nicholas J Foti, Ryan Adams

RobustFill: Neural Program Learning under Noisy I/O
Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, Pushmeet Kohli

A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions
Jayadev Acharya, Hirakendu Das, Alon Orlitsky, Ananda Suresh

Axiomatic Attribution for Deep Networks
Ankur Taly, Qiqi Yan,,Mukund Sundararajan

Differentiable Programs with Neural Libraries
Alex L Gaunt, Marc Brockschmidt, Nate Kushman, Daniel Tarlow

Latent LSTM Allocation: Joint Clustering and Non-Linear Dynamic Modeling of Sequence Data
Manzil Zaheer, Amr Ahmed, Alex Smola

Device Placement Optimization with Reinforcement Learning
Azalia Mirhoseini, Hieu Pham, Quoc Le, Benoit Steiner, Mohammad Norouzi, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Samy Bengio, Jeff Dean

Canopy — Fast Sampling with Cover Trees
Manzil Zaheer, Satwik Kottur, Amr Ahmed, Jose Moura, Alex Smola

Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning
Junhyuk Oh, Satinder Singh, Honglak Lee, Pushmeet Kohli

Probabilistic Submodular Maximization in Sub-Linear Time
Serban Stan, Morteza Zadimoghaddam, Andreas Krause, Amin Karbasi

Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs
Michael Gygli, Mohammad Norouzi, Anelia Angelova

Stochastic Generative Hashing
Bo Dai, Ruiqi Guo, Sanjiv Kumar, Niao He, Le Song

Accelerating Eulerian Fluid Simulation With Convolutional Networks
Jonathan Tompson, Kristofer D Schlachter, Pablo Sprechmann, Ken Perlin

Large-Scale Evolution of Image Classifiers
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc Le, Alexey Kurakin

Neural Message Passing for Quantum Chemistry
Justin Gilmer, Samuel Schoenholz, Patrick Riley, Oriol Vinyals, George Dahl

Neural Optimizer Search with Reinforcement Learning
Irwan BelloBarret Zoph, Vijay Vasudevan, Quoc Le

Implicit Generative Models
Organizers include: Ian Goodfellow

Learning to Generate Natural Language
Accepted Papers include:
Generating High-Quality and Informative Conversation Responses with Sequence-to-Sequence Models
Louis Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, Ray Kurzweil

Lifelong Learning: A Reinforcement Learning Approach
Accepted Papers include:
Bridging the Gap Between Value and Policy Based Reinforcement Learning
Ofir Nachum, Mohammad Norouzi, Kelvin Xu, Dale Schuurmans

Principled Approaches to Deep Learning
Organizers include: Robert Gens
Program Committee includes: Jascha Sohl-Dickstein

Workshop on Human Interpretability in Machine Learning (WHI)
Organizers include: Been Kim

ICML Workshop on TinyML: ML on a Test-time Budget for IoT, Mobiles, and Other Applications
Invited speakers include: Sujith Ravi

Deep Structured Prediction
Organizers include: Gal Chechik, Ofer Meshi
Program Committee includes: Vitaly Kuznetsov, Kevin Murphy
Invited Speakers include: Ryan Adams
Accepted Papers include:
Filtering Variational Objectives
Chris J Maddison, Dieterich Lawson, George Tucker, Mohammad Norouzi, Nicolas Heess, Arnaud Doucet, Andriy Mnih, Yee Whye Teh
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models
George Tucker, Andriy Mnih, Chris J Maddison, Dieterich Lawson, Jascha Sohl-Dickstein

Machine Learning in Speech and Language Processing
Organizers include: Tara Sainath
Invited speakers include: Ron Weiss

Picky Learners: Choosing Alternative Ways to Process Data
Invited speakers include: Tomer Koren
Organizers include: Corinna Cortes, Mehryar Mohri

Private and Secure Machine Learning
Keynote Speakers include: Ilya Mironov

Reproducibility in Machine Learning Research
Invited Speakers include: Hugo Larochelle, Francois Chollet
Organizers include: Samy Bengio

Time Series Workshop
Organizers include: Vitaly Kuznetsov

Interpretable Machine Learning
Presenters include: Been Kim

Google at ACL 2017

This week, Vancouver, Canada hosts the 2017 Annual Meeting of the Association for Computational Linguistics (ACL 2017), the premier conference in the field of natural language understanding, covering a broad spectrum of diverse research areas that are concerned with computational approaches to natural language.

As a leader in natural language processing & understanding and a Platinum sponsor of ACL 2017, Google will be on hand to showcase research interests that include syntax, semantics, discourse, conversation, multilingual modeling, sentiment analysis, question answering, summarization, and generally building better systems using labeled and unlabeled data, state-of-the-art modeling and learning from indirect supervision.

If you’re attending ACL 2017, we hope that you’ll stop by the Google booth to check out some demos, meet our researchers and discuss projects and opportunities at Google that go into solving interesting problems for billions of people. Learn more about the Google research being presented at ACL 2017 below (Googlers highlighted in blue).

Organizing Committee
Area Chairs include: Sujith Ravi (Machine Learning), Thang Luong (Machine Translation)
Publication Chairs include: Margaret Mitchell (Advisory)

Accepted Papers
A Polynomial-Time Dynamic Programming Algorithm for Phrase-Based Decoding with a Fixed Distortion Limit
Yin-Wen Chang, Michael Collins
(Oral Session)

Cross-Sentence N-ary Relation Extraction with Graph LSTMs
Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-Tau Yih
(Oral Session)

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision
Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, Ni Lao

Coarse-to-Fine Question Answering for Long Documents
Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, Jonathan Berant

Automatic Compositor Attribution in the First Folio of Shakespeare
Maria Ryskina, Hannah Alpert-Abrams, Dan Garrette, Taylor Berg-Kirkpatrick

A Nested Attention Neural Hybrid Model for Grammatical Error Correction
Jianshu Ji, Qinlong Wang, Kristina Toutanova, Yongen Gong, Steven Truong, Jianfeng Gao

Get To The Point: Summarization with Pointer-Generator Networks
Abigail See, Peter J. Liu, Christopher D. Manning

Identifying 1950s American Jazz Composers: Fine-Grained IsA Extraction via Modifier Composition
Ellie Pavlick*, Marius Pasca

Learning to Skim Text
Adams Wei Yu, Hongrae Lee, Quoc Le

2017 ACL Student Research Workshop
Program Committee includes: Emily Pitler, Brian Roark, Richard Sproat

WiNLP: Women and Underrepresented Minorities in Natural Language Processing
Organizers include: Margaret Mitchell
Gold Sponsor

BUCC: 10th Workshop on Building and Using Comparable Corpora
Scientific Committee includes: Richard Sproat

CLPsych: Computational Linguistics and Clinical Psychology – From Linguistic Signal to Clinical
Program Committee includes: Brian Roark, Richard Sproat

Repl4NLP: 2nd Workshop on Representation Learning for NLP
Program Committee includes: Ankur Parikh, John Platt

RoboNLP: Language Grounding for Robotics
Program Committee includes: Ankur Parikh, Tom Kwiatkowski

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Management Group includes: Slav Petrov

CoNLL-SIGMORPHON-2017 Shared Task: Universal Morphological Reinflection
Organizing Committee includes: Manaal Faruqui
Invited Speaker: Chris Dyer

SemEval: 11th International Workshop on Semantic Evaluation
Organizers include: Daniel Cer

ALW1: 1st Workshop on Abusive Language Online
Panelists include: Margaret Mitchell

EventStory: Events and Stories in the News
Program Committee includes: Silvia Pareti

NMT: 1st Workshop on Neural Machine Translation
Organizing Committee includes: Thang Luong
Program Committee includes: Hieu Pham, Taro Watanabe
Invited Speaker: Quoc Le

Natural Language Processing for Precision Medicine
Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-tau Yih

Deep Learning for Dialogue Systems
Yun-Nung Chen, Asli Celikyilmaz, Dilek Hakkani-Tur

* Contributed during an internship at Google.

Expressions in Virtual Reality

Recently Google Machine Perception researchers, in collaboration with Daydream Labs and YouTube Spaces, presented a solution for virtual headset ‘removal’ for mixed reality in order to create a more rich and engaging VR experience. While that work could infer eye-gaze directions and blinks, enabled by a headset modified with eye-tracking technology, a richer set of facial expressions — which are key to understanding a person's experience in VR, as well as conveying important social engagement cues — were missing.

Today we present an approach to infer select facial action units and expressions entirely by analyzing a small part of the face while the user is engaged in a virtual reality experience. Specifically, we show that images of the user’s eyes captured from an infrared (IR) gaze-tracking camera within a VR headset are sufficient to infer at least a subset of facial expressions without the use of any external cameras or additional sensors.
Left: A user wearing a VR HMD modified with eye-tracking used for expression classification (Note that no external camera is used in our method; this is just for visualization). Right: inferred expression from eye images using our model. A video demonstrating the work can be seen here.
We use supervised deep learning to classify facial expressions from images of the eyes and surrounding areas, which typically contain the iris, sclera, eyelids and may include parts of the eyebrows and top of cheeks. Obtaining large scale annotated data from such novel sensors is a challenging task, hence we collected training data by capturing 46 subjects while performing a set of facial expressions.

To perform expression classification, we fine-tuned a variant of the widespread Inception architecture with TensorFlow using weights from a model trained to convergence on Imagenet. We attempted to partially remove variance due to differences in participant appearance (i.e., individual differences that do not depend on expression), inspired by the standard practice of mean image subtraction. Since this variance removal occurs within-subject, it is effectively personalization. Further details, along with examples of eye-images, and results are presented in our accompanying paper.

Results and Extensions
We demonstrate that the information required to classify a variety of facial expressions is reliably present in IR eye images captured by a commercial HMD sensor, and that this information can be decoded using a CNN-based method, even though classifying facial expressions from eye-images alone is non-trivial even for humans. Our model inference can be performed in real-time, and we show this can be used to generate expressive avatars in real-time, which can function as an expressive surrogate for users engaged in VR. This interaction mechanism also yields a more intuitive interface for sharing expression in VR as opposed to gestures or keyboard inputs.

The ability to capture a user’s facial expressions using existing eye-tracking cameras enables a fully mobile solution to facial performance capture in VR, without additional external cameras. This technology extends beyond animating cartoon avatars; it could be used to provide a richer headset removal experience, enhancing communication and social interaction in VR by transmitting far more authentic and emotionally textured information.

The research described in this post was performed by Steven Hickson (as an intern), Nick Dufour, Avneesh Sud, Vivek Kwatra and Irfan Essa. We also thank Hayes Raffle and Alex Wong from Daydream, and Chris Bregler, Sergey Ioffe and authors of TF-Slim from Google Research for their guidance and suggestions.

This technology, along with headset removal, will be demonstrated at Siggraph 2017 Emerging Technologies.

So there I was, firing a megawatt plasma collider at work…

Wait, what? Why is Google interested in plasma physics?

Google is always interested in solving complex engineering problems, and few are more complex than fusion. Physicists have been trying since the 1950s to control the fusion of hydrogen atoms into helium, which is the same process that powers the Sun. The key to harnessing this power is to confine hydrogen plasmas for long enough to get more energy out from fusion reactions than was put in. This point is called “breakeven.” If it works, it would represent a technological breakthrough, and could provide an abundant source of zero-carbon energy.

There are currently several large academic and government research efforts in fusion. Just to rattle off a few, in plasma fusion there are tokamak machines like ITER and stellarator machines like Wendelstein 7-X. The stellarator design actually goes back to 1951, so physicists have been working on this for a while. Oh yeah, and if you like giant lasers, there’s the National Ignition Facility which users lasers to generate X-rays to generate fusion reactions. So far, none of these has gotten to the economic breakeven point.

All of these efforts involve complex experiments with many variables, providing an opportunity for Google to help, with our strength in computing and machine learning. Today, we’re publishing “Achievement of Sustained Net Plasma Heating in a Fusion Experiment with the Optometrist Algorithm” in Scientific Reports. This paper describes the first results of Google’s collaboration with the physicists and engineers at Tri Alpha Energy, taking a step towards the breakeven goal.

Did you really just say that you got to fire a plasma collider?

Yeah. Tri Alpha Energy has a unique scheme for plasma confinement called a field-reversed configuration that’s predicted to get more stable as the energy goes up, in contrast to other methods where plasmas get harder to control as you heat them. Tri Alpha built a giant ionized plasma machine, C-2U, that fills an entire warehouse in an otherwise unassuming office park. The plasma that this machine generates and confines exhibits all kinds of highly nonlinear behavior. The machine itself pushes the envelope of how much electrical power can be applied to generate and confine the plasma in such a small space over such a short time. It’s a complex machine with more than 1000 knobs and switches, an investment (not ours!) in exploring clean energy north of $100 million. This is a high-stakes optimization problem, dealing with both plasma performance and equipment constraints. This is where Google comes in.
End-on view of C-2U
Wait, why not just simulate what will happen? Isn’t this simple physics?

The “simple” simulations using magnetohydrodynamics don’t really apply. Even if these machines operated in that limit, which they very much don’t, the simulations make fluid dynamics simulations look easy! The reality is much more complicated, as the ion temperature is three times larger than the electron temperature, so the plasma is far out of thermal equilibrium, also, the fluid approximation is totally invalid, so you have to track at least some of the trillion+ individual particles, so the whole thing is beyond what we know how to do even with Google-scale compute resources.

So why are we doing this? Real experiments! With atoms not bits! At Google we love to run experiments and optimize things. We thought it would be a great challenge to see if we could help Tri Alpha. They run a plasma “shot” on the C-2U machine every 8 minutes. Each shot consists of creating two spinning blobs of plasma in the vacuum sealed innards of C-2U, smashing them together at over 600,000 miles per hour, creating a bigger, hotter, spinning football of plasma. Then they blast it continuously with particle beams (actually neutral hydrogen atoms) to keep it spinning. They hang on to the spinning football with magnetic fields for as long as 10 milliseconds. They’re trying to experimentally verify that these advanced beam driven field-reversed plasma configurations behave as expected by theory. If they do, this scheme could lead to net-energy-out fusion.

Now 8 minutes sounds like a long time (which is the time it takes for C-2U to cool, recharge, and get ready for another 10 ms shot), but when you’re sitting in the control room during an experimental campaign, it goes by really quickly. There are a lot of sensor outputs to look at, to try to figure out how the plasma was behaving. Before you know it, the power supplies are charged again, and they’re ready for another go!

What was that about optimization? What are you optimizing?

That’s the thing, it’s not completely obvious what good plasma performance is. Of course, Tri Alpha has some of the world’s best plasma physicists, but even they disagree on what “good” is. We can boil down the machine controls to “only” 30 parameters or so, but when you have to wait 8 minutes per experiment, it’s a pretty hard problem even with a concrete objective. Also, it’s not entirely known, day-to-day, what the reliable operating envelope of the machine is. And it keeps changing since the quality of the vacuum keeps changing and electrodes wear out and...

So we boil the problem down to “let’s find plasma behaviors that an expert human plasma physicist thinks are interesting, and let’s not break the machine when we’re doing it.” We developed the Optometrist Algorithm, which is sort of a Markov Chain Monte Carlo (MCMC) where the likelihood function being explored is in the plasma physicist’s mind rather than being explicitly written down. Just like getting an eyeglass prescription, the algorithm presents the expert human with machine settings and the associated outcomes. They can just use their judgement on what is interesting, and what is unhealthy for the machine. These could be “That initial collision looked really strong!” or “The edge biasing is actually working well now!” or “Wow, that was awesome, but the electrode current was way too high, let’s not do that again!” The key improvement we provided was a technique to search the high-dimensional space of machine parameters efficiently.

Oh, I like MCMC, it’s like the best thing ever!

I knew you’d like that bit. Using this technique, we actually found something really interesting. As we describe in our paper, we found a regime where the neutral particle beams dumping energy into the plasma were able to completely balance the cooling losses, and the total energy in the plasma actually went up after formation. It was only for about 2 milliseconds, but still, it was a first! Since rising energy due to neutral beam heating was not necessarily expected for C-2U, it would have been difficult to plug into an objective function. We really needed a human expert to notice. This was a classic case of humans and computers doing a better job together than either could have separately. You know how it is — when you think you have an optimization problem, and you optimize the objective, you usually just look at the result and say, “No no no, that’s not what I meant,” and you add some other term and repeat until you get sick of it?

That hasn’t happened to me. This week. Yet.

Yeah, so we just cut out that iteration and let the expert human use their judgment. This learning from human preferences is becoming a thing. Google and Tri Alpha made a pretty good team for it, for a really important problem.

So what now?

So actually, Tri Alpha learned everything they could have from C-2U and then dismantled it. They built a new machine called Norman (after their late co-founder Norman Rostoker) in the same warehouse. It’s much more powerful both in plasma acceleration and in neutral particle beams. It also has a more sophisticated system to confine the plasma in the central region. The pressure vessel, accelerators, and banks of capacitors and power supplies cover the building’s concrete floor.
They just achieved “first plasma” on it. They’re hoping, with our help, to verify this theoretical prediction that the plasma will actually behave better in the “burning plasma” regime. If they can do that over the next 18 months, it will be a lot more likely that the field-reversed configuration is a viable approach for breakeven fusion. In that case, Tri Alpha will try to build their follow-on design, an actual demonstration power generator. That one won’t fit in their warehouse!

On the Google side, we wish to thank John Platt, Michael Dikovsky, Patrick Riley and Ross Koningstein for their significant contributions to this work. We thank the Google Accelerated Science team for their continual support. We are also grateful to the entire team at Tri Alpha for giving us the opportunity to try our hand at optimization for this crucially important problem.

Teaching Robots to Understand Semantic Concepts

Machine learning can allow robots to acquire complex skills, such as grasping and opening doors. However, learning these skills requires us to manually program reward functions that the robots then attempt to optimize. In contrast, people can understand the goal of a task just from watching someone else do it, or simply by being told what the goal is. We can do this because we draw on our own prior knowledge about the world: when we see someone cut an apple, we understand that the goal is to produce two slices, regardless of what type of apple it is, or what kind of tool is used to cut it. Similarly, if we are told to pick up the apple, we understand which object we are to grab because we can ground the word “apple” in the environment: we know what it means.

These are semantic concepts: salient events like producing two slices, and object categories denoted by words such as “apple.” Can we teach robots to understand semantic concepts, to get them to follow simple commands specified through categorical labels or user-provided examples? In this post, we discuss some of our recent work on robotic learning that combines experience that is autonomously gathered by the robot, which is plentiful but lacks human-provided labels, with human-labeled data that allows a robot to understand semantics. We will describe how robots can use their experience to understand the salient events in a human-provided demonstration, mimic human movements despite the differences between human robot bodies, and understand semantic categories, like “toy” and “pen”, to pick up objects based on user commands.

Understanding human demonstrations with deep visual features
In the first set of experiments, which appear in our paper Unsupervised Perceptual Rewards for Imitation Learning, our is aim is to enable a robot to understand a task, such as opening a door, from seeing only a small number of unlabeled human demonstrations. By analyzing these demonstrations, the robot must understand what is the semantically salient event that constitutes task success, and then use reinforcement learning to perform it.
Examples of human demonstrations (left) and the corresponding robotic imitation (right).
Unsupervised learning on very small datasets is one of the most challenging scenarios in machine learning. To make this feasible, we use deep visual features from a large network trained for image recognition on ImageNet. Such features are known to be sensitive to semantic concepts, while maintaining invariance to nuisance variables such as appearance and lighting. We use these features to interpret user-provided demonstrations, and show that it is indeed possible to learn reward functions in an unsupervised fashion from a few demonstrations and without retraining.
Example of reward functions learned solely from observation for the door opening tasks. Rewards progressively increase from zero to the maximum reward as a task is completed.
After learning a reward function from observation only, we use it to guide a robot to learn a door opening task, using only the images to evaluate the reward function. With the help of an initial kinesthetic demonstration that succeeds about 10% of the time, the robot learns to improve to 100% accuracy using the learned reward function.
Learning progression.
Emulating human movements with self-supervision and imitation.
In Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation, we propose a novel approach to learn about the world from observation and demonstrate it through self-supervised pose imitation. Our approach relies primarily on co-occurrence in time and space for supervision: by training to distinguish frames from different times of a video, it learns to disentangle and organize reality into useful abstract representations.

In a pose imitation task for example, different dimensions of the representation may encode for different joints of a human or robotic body. Rather than defining by hand a mapping between human and robot joints (which is ambiguous in the first place because of physiological differences), we let the robot learn to imitate in an end-to-end fashion. When our model is simultaneously trained on human and robot observations, it naturally discovers the correspondence between the two, even though no correspondence is provided. We thus obtain a robot that can imitate human poses without having ever been given a correspondence between humans and robots.
Self-supervised human pose imitation by a robot.
A striking evidence of the benefits of learning end-to-end is the many-to-one and highly non-linear joints mapping shown above. In this example, the up-down motion involves many joints for the human while only one joint is needed for the robot. We show that the robot has discovered this highly complex mapping on its own, without any explicit human pose information.

Grasping with semantic object categories
The experiments above illustrate how a person can specify a goal for a robot through an example demonstration, in which case the robots must interpret the semantics of the task -- salient events and relevant features of the pose. What if instead of showing the task, the human simply wants to tell it to what to do? This also requires the robot to understand semantics, in order to identify which objects in the world correspond to the semantic category specified by the user. In End-to-End Learning of Semantic Grasping, we study how a combination of manually labeled and autonomously collected data can be used to perform the task of semantic grasping, where the robot must pick up an object from a cluttered bin that matches a user-specified class label, such as “eraser” or “toy.”
In our semantic grasping setup, the robotic arm is tasked with picking up an object corresponding to a user-provided semantic category (e.g. Legos).
To learn how to perform semantic grasping, our robots first gather a large dataset of grasping data by autonomously attempting to pick up a large variety of objects, as detailed in our previous post and prior work. This data by itself can allow a robot to pick up objects, but doesn’t allow it to understand how to associate them with semantic labels. To enable an understanding of semantics, we again enlist a modest amount of human supervision. Each time a robot successfully grasps an object, it presents it to the camera in a canonical pose, as illustrated below.
The robot presents objects to the camera after grasping. These images can be used to label which object category was picked up.
A subset of these images is then labeled by human labelers. Since the presentation images show the object in a canonical pose, it is easy to then propagate these labels to the remaining presentation images by training a classifier on the labeled examples. The labeled presentation images then tell the robot which object was actually picked up, and it can associate this label, in hindsight, with the images that it observed while picking up that object from the bin.

Using this labeled dataset, we can then train a two-stream model that predicts which object will be grasped, conditioned on the current image and the actions that the robot might take. The two-stream model that we employ is inspired by the dorsal-ventral decomposition observed in the human visual cortex, where the ventral stream reasons about the semantic class of objects, while the dorsal stream reasons about the geometry of the grasp. Crucially, the ventral stream can incorporate auxiliary data consisting of labeled images of objects (not necessarily from the robot), while the dorsal stream can incorporate auxiliary data of grasping that does not have semantic labels, allowing the entire system to be trained more effectively using larger amounts of heterogeneously labeled data. In this way, we can combine a limited amount of human labels with a large amount of autonomously collected robotic data to grasp objects based on desired semantic category, as illustrated in the video below:
Future Work
Our experiments show how limited semantically labeled data can be combined with data that is collected and labeled automatically by the robots, in order to enable robots to understand events, object categories, and user demonstrations. In the future, we might imagine that robotic systems could be trained with a combination of user-annotated data and ever-increasing autonomously collected datasets, improving robotic capability and easing the engineering burden of designing autonomous robots. Furthermore, as robotic systems collect more and more automatically annotated data in the real world, this data can be used to improve not just robotic systems, but also systems for computer vision, speech recognition, and natural language processing that can all benefit from such large auxiliary data sources.

Of course, we are not the first to consider the intersection of robotics and semantics. Extensive prior work in natural language understanding, robotic perception, grasping, and imitation learning has considered how semantics and action can be combined in a robotic system. However, the experiments we discussed above might point the way to future work into combining self-supervised and human-labeled data in the context of autonomous robotic systems.

The research described in this post was performed by Pierre Sermanet, Kelvin Xu, Corey Lynch, Jasmine Hsu, Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, and Sergey Levine. We also thank Mrinal Kalakrishnan, Ali Yahya, and Yevgen Chebotar for developing the policy learning framework used for the door task, and John-Michael Burke for conducting experiments for semantic grasping.

Unsupervised Perceptual Rewards for Imitation Learning was presented at RSS 2017 by Kelvin Xu, and Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation will be presented this week at the CVPR Workshop on Deep Learning for Robotic Vision.

Google at CVPR 2017

From July 21-26, Honolulu, Hawaii hosts the 2017 Conference on Computer Vision and Pattern Recognition (CVPR 2017), the premier annual computer vision event comprising the main conference and several co-located workshops and tutorials. As a leader in computer vision research and a Platinum Sponsor, Google will have a strong presence at CVPR 2017 — over 250 Googlers will be in attendance to present papers and invited talks at the conference, and to organize and participate in multiple workshops.

If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively pursuing the next generation of intelligent systems that utilize the latest machine learning techniques applied to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind Headset Removal for Virtual and Mixed Reality, Image Compression with Neural Networks, Jump, TensorFlow Object Detection API and much more.

You can learn more about our research being presented at CVPR 2017 in the list below (Googlers highlighted in blue).

Organizing Committee
Corporate Relations Chair - Mei Han
Area Chairs include - Alexander Toshev, Ce Liu, Vittorio Ferrari, David Lowe

Training object class detectors with click supervision
Dim Papadopoulos, Jasper Uijlings, Frank Keller, Vittorio Ferrari

Unsupervised Pixel-Level Domain Adaptation With Generative Adversarial Networks
Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, Dilip Krishnan

BranchOut: Regularization for Online Ensemble Tracking With Convolutional Neural Networks Bohyung Han, Jack Sim, Hartwig Adam

Enhancing Video Summarization via Vision-Language Embedding
Bryan A. Plummer, Matthew Brown, Svetlana Lazebnik

Learning by Association — A Versatile Semi-Supervised Training Method for Neural Networks Philip Haeusser, Alexander Mordvintsev, Daniel Cremers

Context-Aware Captions From Context-Agnostic Supervision
Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, Gal Chechik

Spatially Adaptive Computation Time for Residual Networks
Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan HuangDmitry Vetrov, Ruslan Salakhutdinov

Xception: Deep Learning With Depthwise Separable Convolutions
François Chollet

Deep Metric Learning via Facility Location
Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, Kevin Murphy

Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors
Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Kevin Murphy

Synthesizing Normalized Faces From Facial Identity Features
Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, William T. Freeman

Towards Accurate Multi-Person Pose Estimation in the Wild
George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, Kevin Murphy

GuessWhat?! Visual Object Discovery Through Multi-Modal Dialogue
Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, Aaron Courville

Learning discriminative and transformation covariant local feature detectors
Xu Zhang, Felix X. Yu, Svebor Karaman, Shih-Fu Chang

Full Resolution Image Compression With Recurrent Neural Networks
George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, Michele Covell

Learning From Noisy Large-Scale Datasets With Minimal Supervision
Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, Serge Belongie

Unsupervised Learning of Depth and Ego-Motion From Video
Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe

Cognitive Mapping and Planning for Visual Navigation
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik

Fast Fourier Color Constancy
Jonathan T. Barron, Yun-Ta Tsai

On the Effectiveness of Visible Watermarks
Tali Dekel, Michael Rubinstein, Ce Liu, William T. Freeman

YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video
Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, Vincent Vanhoucke

Deep Learning for Robotic Vision
Organizers include: Anelia Angelova, Kevin Murphy
Program Committee includes: George Papandreou, Nathan Silberman, Pierre Sermanet

The Fourth Workshop on Fine-Grained Visual Categorization
Organizers include: Yang Song
Advisory Panel includes: Hartwig Adam
Program Committee includes: Anelia Angelova, Yuning Chai, Nathan Frey, Jonathan Krause, Catherine Wah, Weijun Wang

Language and Vision Workshop
Organizers include: R. Sukthankar

The First Workshop on Negative Results in Computer Vision
Organizers include: R. Sukthankar, W. Freeman, J. Malik

Visual Understanding by Learning from Web Data
General Chairs include: Jesse Berent, Abhinav Gupta, Rahul Sukthankar
Program Chairs include: Wei Li

YouTube-8M Large-Scale Video Understanding Challenge
General Chairs: Paul Natsev, Rahul Sukthankar
Program Chairs: Joonseok Lee, George Toderici
Challenge Organizers: Sami Abu-El-Haija, Anja Hauth, Nisarg Kothari, Hanhan Li, Sobhan Naderi Parizi, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan, Jian Wang

An Update to Open Images – Now with Bounding-Boxes

Last year we introduced Open Images, a collaborative release of ~9 million images annotated with labels spanning over 6000 object categories, designed to be a useful dataset for machine learning research. The initial release featured image-level labels automatically produced by a computer vision model similar to Google Cloud Vision API, for all 9M images in the training set, and a validation set of 167K images with 1.2M human-verified image-level labels.

Today, we introduce an update to Open Images, which contains the addition of a total of ~2M bounding-boxes to the existing dataset, along with several million additional image-level labels. Details include:
  • 1.2M bounding-boxes around objects for 600 categories on the training set. These have been produced semi-automatically by an enhanced version of the technique outlined in [1], and are all human-verified.
  • Complete bounding-box annotation for all object instances of the 600 categories on the validation set, all manually drawn (830K boxes). The bounding-box annotations in the training and validations sets will enable research on object detection on this dataset. The 600 categories offer a broader range than those in the ILSVRC and COCO detection challenges, and include new objects such as fedora hat and snowman.
  • 4.3M human-verified image-level labels on the training set (over all categories). This will enable large-scale experiments on object classification, based on a clean training set with reliable labels.
Annotated images from the Open Images dataset. Left: FAMILY MAKING A SNOWMAN by mwvchamber. Right: STANZA STUDENTI.S.S. ANNUNZIATA by ersupalermo. Both images used under CC BY 2.0 license. See more examples here.
We hope that this update to Open Images will stimulate the broader research community to experiment with object classification and detection models, and facilitate the development and evaluation of new techniques.

[1] We don't need no bounding-boxes: Training object class detectors using only human verification, Papadopoulos, Uijlings, Keller, and Ferrari, CVPR 2016

Motion Stills — Now on Android

Last year, we launched Motion Stills, an iOS app that stabilizes your Live Photos and lets you view and share them as looping GIFs and videos. Since then, Motion Stills has been well received, being listed as one of the top apps of 2016 by The Verge and Mashable. However, from its initial release, the community has been asking us to also make Motion Stills available for Android. We listened to your feedback and today, we're excited to announce that we’re bringing this technology, and more, to devices running Android 5.1 and later!
Motion Stills on Android: Instant stabilization on your device.
With Motion Stills on Android we built a new recording experience where everything you capture is instantly transformed into delightful short clips that are easy to watch and share. You can capture a short Motion Still with a single tap like a photo, or condense a longer recording into a new feature we call Fast Forward. In addition to stabilizing your recordings, Motion Stills on Android comes with an improved trimming algorithm that guards against pocket shots and accidental camera shakes. All of this is done during capture on your Android device, no internet connection required!

New streaming pipeline
For this release, we redesigned our existing iOS video processing pipeline to use a streaming approach that processes each frame of a video as it is being recorded. By computing intermediate motion metadata, we are able to immediately stabilize the recording while still performing loop optimization over the full sequence. All this leads to instant results after recording — no waiting required to share your new GIF.
Capture using our streaming pipeline gives you instant results.
In order to display your Motion Stills stream immediately, our algorithm computes and stores the necessary stabilizing transformation as a low resolution texture map. We leverage this texture to apply the stabilization transform using the GPU in real-time during playback, instead of writing a new, stabilized video that would tax your mobile hardware and battery.

Fast Forward
Fast Forward allows you to speed up and condense a longer recording into a short, easy to share clip. The same pipeline described above allows Fast Forward to process up to a full minute of video, right on your phone. You can even change the speed of playback (from 1x to 8x) after recording. To make this possible, we encode videos with a denser I-frame spacing to enable efficient seeking and playback. We also employ additional optimizations in the Fast Forward mode. For instance, we apply adaptive temporal downsampling in the linear solver and long-range stabilization for smooth results over the whole sequence.
Fast Forward condenses your recordings into easy to share clips.
Try out Motion Stills
Motion Stills is an app for us to experiment and iterate quickly with short-form video technology, gathering valuable feedback along the way. The tools our users find most fun and useful may be integrated later on into existing products like Google Photos. Download Motion Stills for Android from the Google Play store—available for mobile phones running Android 5.1 and later—and share your favorite clips on social media with hashtag #motionstills.

Motion Stills would not have been possible without the help of many Googlers. We want to especially acknowledge the work of Matthias Grundmann in advancing our stabilization technology, as well as our UX and interaction designers Jacob Zukerman, Ashley Ma and Mark Bowers.