Posted by Tuomas Haarnoja, Student Researcher and Sergey Levine, Faculty Advisor, Robotics at Google
Deep reinforcement learning (RL) provides the promise of fully automated learning of robotic behaviors directly from experience and interaction in the real world, due to its ability to process complex sensory input using general-purpose neural network representations. However, many existing RL algorithms require days or weeks (or more) worth of real-world data in order to converge to the desired behavior. Furthermore, such systems can be tough to deploy on complex robotic systems (such as legged robots) which can easily get damaged during the exploration phase, hyperparameter settings can be challenging to tune, and various safety considerations can introduce further limitations.
In collaboration with UC Berkeley, we recently released Soft Actor-Critic (SAC), a stable and efficient deep RL algorithm suitable for real-world robotic skill learning that is well-aligned with the requirements of robotic experimentation. Importantly, SAC is efficient enough to solve real-world robot tasks in only a handful of hours, and works on a variety of environments with a single set of hyperparameters. Below, we discuss some of the research behind SAC, and also describe some of our recent experiments.
Requirements for Real-World Robotic Learning Real-world robotic experimentation brings significant challenges, such as constant interruptions in the data stream due to hardware failures and manual resets, and smooth exploration to avoid mechanical wear and tear on the robot, which set additional restrictions to both the algorithm and its implementation, including (but not limited to):
Good sample efficiency to lower the learning time
Minimal number of hyperparameters that require tuning
Reusing already collected data on different scenarios (known as off-policy learning)
Ensuring that learning and exploration does not damage the hardware
Soft Actor-Critic Soft actor-critic is based on maximum entropy reinforcement learning, a framework that aims to both maximize the expected reward (which is the standard RL objective) and to maximize the policy's entropy. Policies with higher entropy are more random, which intuitively means that maximum entropy reinforcement learning prefers the most random policy that still achieves a high reward.
Why might this be desirable for robotic learning? The most obvious reason is that policies optimized for maximum entropy will be more robust: if the policy can tolerate highly random behavior during training, it is more likely to respond successfully to unexpected perturbations at test time. However, a more subtle reason is that training for maximum entropy can improve both the algorithm's robustness to hyperparameters and its sample efficiency (to learn more, see this BAIR blog post, and this tutorial).
Soft actor-critic maximizes the entropy augmented reward by learning a stochastic policy that maps states to actions and a Q-function that estimates the objective value of the current policy, optimizing them using approximate dynamic programming. In doing so, SAC views the objective as a grounded way to derive better reinforcement learning algorithms that perform consistently and are sample efficient enough to be applicable to real-world robotic applications. For technical details please see our technical report.
Performance of SAC We evaluated SAC with two tasks: 1) quadrupedal walking with the Minitaur robot from Ghost Robotics, and 2) rotating a valve with a three finger Dynamixel Claw. Learning to walk presents a substantial challenge, as the robot is underactuated, and must therefore delicately balance contact forces on the legs to make forward progress. An untrained policy can lose balance and fall, and too many falls will eventually damage the robot, making sample-efficient learning essential.
Although we trained our policy only on flat terrain, we subsequently tested it on varied terrains and obstacles. In principle, policies learned with soft actor-critic should be robust to test-time perturbations, because they are trained to maximize entropy (i.e., inject maximal noise) at training-time. Indeed, we observe that the policies learned with our method are robust to these perturbations without any additional learning.
Illustration of learned walking, using SAC implemented on the Minitaur robot. A full video of the learning process can be found at our project website.
The manipulation task requires the hand to rotate a valve-like object so that the colored peg faces to the right, as shown below. This task is exceptionally challenging due to both the perception challenges and the need to control a hand with 9 degrees of freedom. In order to perceive the valve, the robot must use raw RGB images shown in the inset at the bottom right. The initial position of the valve is reset uniformly at random for each episode, forcing the policy to learn to use the raw RGB images to perceive the current valve orientation.
Soft actor-critic solves both of these tasks quickly: the Minitaur locomotion takes 2 hours, and the valve-turning task from image observations takes 20 hours. We also learned a policy for the valve-turning task without images by providing the actual valve position as an observation to the policy. Soft actor-critic can learn this easier version of the valve task in 3 hours. For comparison, prior work has used natural policy gradients to learn the same task without images in 7.4 hours.
Conclusion Our work demonstrates that deep reinforcement learning based on maximum entropy framework can be applied to learn robot skills in challenging real-world settings. Since the policies are learned directly in the real world, they exhibit robustness to variations in the environment, which can be difficult to obtain otherwise. We also showed that we can learn directly from high-dimensional image observations, which represents a significant challenge in classical robotics. We hope that the release of SAC helps other research teams in their effort to adopt deep RL for more complex real-world tasks in the future.
Acknowledgements This research was done in collaboration between Google and UC Berkeley. We would like to thank all the people who were involved, including Sehoon Ha, Kristian Hartikainen, Jie Tan, George Tucker, Vincent Vanhoucke and Aurick Zhou.
Posted by Jeff Dean, Senior Fellow and Google AI Lead, on behalf of the entire Google Research Community
2018 was an exciting year for Google's research teams, with our work advancing technology in many ways, including fundamental computer science research results and publications, the application of our research to emerging areas new to Google (such as healthcare and robotics), open source software contributions and strong collaborations with Google product teams, all aimed at providing useful tools and services. Below, we highlight just some of our efforts from 2018, and we look forward to what will come in the new year. For a more comprehensive look, please see our publications in 2018.
Ethical Principles and AI Over the past few years, we have observed major advances in AI and the positive impact it can have on our products and the everyday lives of our billions of users. For those of us working in this field, we care deeply that AI is a force for good in the world, and that it is applied ethically, and to problems that are beneficial to society. This year we published the Google AI Principles, supported with a set of responsible AI practices outlining technical recommendations for implementation. In combination they provide a framework for us to evaluate our own development of AI, and we hope that other organizations can also use these principles to help shape their own thinking. It's important to note that because this field is evolving quite rapidly, best practices in some of the principles noted, such as "Avoid creating or reinforcing unfair bias" or "Be accountable to people", are also changing and improving as we and others conduct new research in areas like ML fairness and model interpretability. This research in turn leads to advances in our products to make them more inclusive and less biased, such as our work on reducing gender biases in Google Translate, and allows the exploration and release of more inclusive image datasets and models that enable computer vision to work for the diversity of global cultures. Furthermore, this work allows us to share best practices with the broader research community with the Fairness Module in the Machine Learning Crash Course.
AI for Social Good The potential of AI to make dramatic impacts on many areas of social and societal importance is clear. One example of how AI can be applied to real-world problems is our work on flood prediction. In collaboration with many teams across Google, this research aims to provide accurate and timely fine-grained information about the likely extent and scope of flooding, enabling those in flood-prone regions to make better decisions about how best to protect themselves and their property. A second example is our work on earthquake aftershock prediction, where we showed that a machine learning (ML) model can predict aftershock locations much more accurately than traditional physics-based models. Perhaps more importantly, because the ML model was designed to be interpretable, scientists have been able to make new discoveries about the behavior of aftershocks, leading to not only more accurate predictions, but also new levels of understanding.
Assistive Technology Much of our research centered on using ML and computer science to help our users accomplish things faster and more effectively. Often, these results in collaborations with various product teams to release the fruits of this research in various product features and settings. One example is Google Duplex, a system that requires research in natural language and dialogue understanding, speech recognition, text-to-speech, user understanding and effective UI design to all come together to enable an experience whereby a user can say "Can you book me a haircut at 4 PM today?", and a virtual agent will interact on your behalf over the telephone to handle the necessary details.
Other examples include Smart Compose, a tool that uses predictive models to give relevant suggestions about how to compose emails, making the process of email composition faster and easier, and Sound Search, a technology built on the Now Playing feature that enables you to discover what song is playing fast and accurately. Additionally, Smart Linkify in Android shows how we can use an on-device ML model to make many different kinds of text that appear on the screen of your phone more useful by understanding the kind of text you're selecting (e.g. knowing that something is an address, so we can offer a shortcut to a maps or direction link).
Quantum computing Quantum computing is an emerging paradigm for computing that promises the ability to solve challenging problems that no classical computer can solve. We have been actively pursuing research in this area for the past several years, and we believe the field is on the cusp of demonstrating this capability for at least one problem (so-called quantum supremacy), which will be a watershed event for the field. Over the last year we produced a number of exciting new results, including the development of Bristlecone, a new 72-qubit quantum computing device, which scales the size of problems that can be tackled in quantum computers in the run-up towards quantum supremacy.
A Bristlecone chip being installed by Research Scientist Marissa Giustina at the Quantum AI Lab in Santa Barbara.
Natural Language Understanding Natural language research at Google had an exciting 2018, with a mix of basic research as well as product-focused collaborations. We developed improvements to our Transformer work from 2017, resulting in a new parallel-in-time version of the model called the Universal Transformer that shows strong gains across a number of natural language tasks including translation and linguistic reasoning. We also developed BERT, the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus, that can then be fine-tuned on a wide variety of natural language tasks using transfer learning. BERT shows significant improvements over previous state-of-the-art results on 11 natural language tasks.
BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks.
In the audio domain, we proposed a method for unsupervised learning of semantic audio representations as well as significant improvements to expressive and human-like speech synthesis. Multimodal perception is an increasingly important research topic. Looking to Listen combines visual and auditory cues in an input video to isolate and enhance the speech of desired speakers in a video. This technology could support a range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where multiple people are speaking.
Enabling perception on resource-constrained platforms has becoming increasingly important. MobileNetV2 is Google's next-generation mobile computer vision model and our MobileNets are used widely across academia and industry. MorphNet proposes an efficient method for learning the structure of deep networks that results in across-the-board performance improvements on image and audio models while respecting computational resource constraints, and more recent work on automatic generation of mobile network architectures demonstrates that even higher performance is possible.
Computational Photography The improvements in quality and versatility of cell phone cameras over the last few years has been nothing short of remarkable. A modest part of this is improvements in the actual physical sensors used in phones, but a much greater part of it is due to advances in the scientific field of computational photography. Our research teams publish their new research techniques, and work closely with the Android and Consumer Hardware teams at Google to deliver this research into your hands in the latest Pixel and Android phones and other devices. In 2014, we introduced HDR+, a technique whereby the camera captures a burst of frames, aligns the frames in software, and merges them together with computational software. Originally in the HDR+ work, this was to enable pictures to have higher dynamic range than was possible with a single exposure. However, capturing a burst of frames and then performing computational analysis of these frames is a general approach that has enabled many advances in cameras in 2018. For example, it allowed the development of Motion Photos in Pixel 2 and the Augmented Reality mode in Motion Stills.
Motion photos on the Pixel 2 in Google Photos. For more examples, check out this Google Photos album.
Augmented chicken family with Motion Stills AR mode.
This year, one of our primary efforts in computational photography research was to create a new capability called Night Sight, which enables Pixel phone cameras to "see in the dark", earning praise by both press and users. Of course, Night Sight is just one of the new software-enabled camera features our teams have developed to help you take the perfect photo, including using ML to provide better portrait mode shots, seeing better and further with Super Res Zoom and capturing special moments with Top Shot and Google Clips.
Performance comparison of ADAM and AMSGRAD on a synthetic example of a simple one dimensional convex problem inspired by our examples of non-convergence. The first two plots (left and center) are for the online setting and the the last one (right) is for the stochastic setting.
Software Systems A large part of our research on software systems continues to relate to building machine-learning models and to TensorFlow in particular. For example, we published on the design and implementation of dynamic control flow for TensorFlow 1.0. Some of our newer research introduces a system that we call Mesh TensorFlow, which makes it easy to specify large-scale distributed computations with model parallelism, sometimes with billions of parameters. As another example, we released a library for scalable deep neural ranking using TensorFlow.
The TF-Ranking library supports multi-item scoring architecture, an extension of traditional single-item scoring.
We also released JAX, an accelerator-backed variant of NumPy that supports automatic differentiation of Python functions to arbitrary order. While JAX is not part of TensorFlow, it leverages some of the same underlying software infrastructure (e.g. XLA), and some of its ideas and algorithms have been helpful to our TensorFlow projects. Finally, we continued our research on the security and privacy of machine learning, and our development of open source frameworks for safety and privacy in AI systems, such as CleverHans and TensorFlow Privacy.
Another important research direction for us is the application of ML to software systems, at many levels of the stack. For instance, we continued work on placement of computations onto devices, with a hierarchical model, and we contributed to learning memory access patterns. We also continued to explore how learned indices could be used to replace traditional index structures in database systems and storage systems. As I wrote last year, we believe that we are just scratching the surface in terms of the use of machine learning in computer systems.
The Hierarchical Planner's placement of a NMT (4-layer) model. White denotes CPU and the four colors each represent one of the GPUs. Note that every step of every layer is allocated across multiple GPUs. This placement is 53.7% faster than that generated by a human expert.
In 2018 we learned about Spectre and Meltdown, new classes of serious security vulnerabilities in modern computer processors, thanks to Google's Project Zero team in collaboration with others. These and related vulnerabilities will keep computer architecture researchers quite busy. In our continuing efforts to model CPU behavior, our Compiler Research team integrated their tool for measuring machine instruction latency and port pressure into LLVM, making possible better compilation decisions.
Running a large-scale web service such as content hosting, requires load balancing with stability in a dynamic environment. We developed a consistent hashing scheme with tight provable guarantees on the maximum load of each server, and deployed it for our cloud customers in Google Cloud Pub/Sub. After making an earlier version of our paper available, engineers at Vimeo found the paper, implemented and open sourced it in haproxy, and used it for their load balancing project at Vimeo. The results were dramatic: applying these algorithmic ideas helped them decrease the cache bandwidth by a factor of almost 8, eliminating a scaling bottleneck.
TPUs Tensor Processing Units (TPUs) are Google's internally-developed ML hardware accelerators, designed from the ground up to power both training and inference at scale. TPUs have enabled Google research breakthroughs such as BERT (discussed previously), and they also allow researchers around the world to build on Google research via open source and to pursue new breakthroughs of their own. For example, anyone can fine-tune BERT on TPUs for free via Colab, and the TensorFlow Research Cloud has given thousands of researchers the opportunity to benefit from even larger amounts of free Cloud TPU computing power. We've also made multiple generations of TPU hardware commercially available as Cloud TPUs, including ML supercomputers called Cloud TPU Pods that make large-scale ML training much more accessible. Internally, in addition to enabling faster advances in ML research, TPUs have driven major improvements across Google's core products, including Search, YouTube, Gmail, Google Assistant, Google Translate, and many others. We look forward to seeing ML teams both here at Google and elsewhere achieve even more with ML via the unprecedented computing scale that TPUs provide.
An individual TPU v3 device (left) and a portion of a TPU v3 Pod (right). TPU v3 is the latest generation of Google's Tensor Processing Unit (TPU) hardware. Available to external customers as Cloud TPU v3, these systems are liquid-cooled for maximum performance (computer chips + liquid = exciting!), and a full TPU v3 Pod can apply more than 100 petaflops of computational power to the world's largest ML problems.
Open Source Software and Datasets Releasing open source software and the creation of new public datasets are two major ways that we contribute to the research and software engineering communities. One of our largest efforts in this space is TensorFlow, a widely popular system for ML computations that we released in November 2015. We celebrated TensorFlow's third birthday in 2018, and during this time, TensorFlow has been downloaded more than 30M times, with over 1700 contributors adding 45,000 commits. In 2018, TensorFlow had eight major releases and added major capabilities such as eager execution and distribution strategies. We launched public design reviews engaging the community in the development process, and we engaged contributors via special interest groups. With the launches of associated products such as TensorFlow Lite, TensorFlow.js and TensorFlow Probability, the TensorFlow ecosystem grew dramatically in 2018.
Real-time evolution of the tSNE embedding for the complete MNIST dataset. The dataset contains images of 60,000 handwritten digits. You can find a live demo here.
Public datasets are often a great source of inspiration that lead to great progress across many fields, since they give the broader community both access to interesting data and problems as well as a healthy competitive drive to achieve better results on a variety of tasks. This year we were happy to release Google Dataset Search, a new tool for finding public datasets from all of the web. Over the years we have also curated and released many new, novel datasets, including everything from millions of general annotated images or videos, to a crowd-source Bengali dataset for speech recognition to robot arm grasping datasets and more. In 2018, we added even more datasets to that list.
Visualization of the fluid annotation interface in action on image from COCO dataset. Image credit: gamene, original image.
From time-to-time, we also help establish new kinds of challenges for the research community, so that we can all work together on solving difficult research problems. Often these are done with the release of a new dataset, but not always. This year, we established new challenges around the Inclusive Images Challenge, to work towards making more robust models that are free from many kinds of biases, the iNaturalist 2018 Challenge which aims to enable computers' fine-grained discrimination of visual categories (such as species of plants in an image), a Kaggle "Quick, Draw!" Doodle Recognition Challenge to create a better classifier for the QuickDraw challenge game, and Conceptual Captions, a larger-scale image captioning dataset and challenge aimed at enabling better image captioning model research.
Applications of AI to Other Fields In 2018, we have applied ML to a wide variety of problems in the physical and biological sciences. Using ML, we can supply scientists with the equivalent of hundreds or thousands of research assistants digging through data, which then frees the scientists to become more creative and productive.
A pre-trained TensorFlow model rates focus quality for a montage of microscope image patches of cells in Fiji (ImageJ). Hue and lightness of the borders denote predicted focus quality and prediction uncertainty, respectively.
Health For the past several years, we have been applying ML to health, an area that affects every one of us, and is also one where we believe ML can make a tremendous difference by augmenting the intuitions and experience of healthcare professionals. Our general approach in this space is to collaborate with healthcare organizations to tackle basic research problems (using feedback from clinical experts to make our results more robust), and then publish the results in well-respected, peer-reviewed scientific and clinical journals. Once the research has been clinically and scientifically validated, we then conduct user and HCI research to understand how we can deploy this in real-world clinical settings. In 2018, we expanded our efforts across the broad space of computer-aided diagnostics to clinical task predictions as well.
On the left is a retinal fundus image graded as having moderate DR ("Mo") by an adjudication panel of ophthalmologists (ground truth). On the top right is an illustration of the predicted scores ("N" = no DR, "Mi" = Mild DR, "Mo" = Moderate DR) from the model. On the bottom right is the set of scores given by physicians without assistance ("Unassisted") and those who saw the model's predictions ("Grades Only").
When applying ML to historically-collected data, it's important to understand the populations that have experienced human and structural biases in the past and how those biases have been codified in the data. Machine-learning offers an opportunity to detect and address bias and to proactively advance health equity, which we are designing our systems to do.
Research Outreach We interact with the external research community in many different ways, including faculty engagement and student support. We are proud to host hundreds of undergraduate, M.S. and Ph.D. students as interns during the academic year, as well as providing multi-year Ph.D. fellowships to students throughout North America, Europe, and the Middle East. In addition to financial support, each of the fellowship recipients is assigned one or more Google researchers as a mentor, and we bring together all the fellows for an annual Google Ph.D. Fellowship Summit, where they are exposed to state-of-the-art research being pursued at Google and given the opportunity to network with Google's researchers as well as other PhD Fellows from around the world. Complementing this fellowship program is the Google AI Residency, a way of allowing people who want to learn to conduct deep learning research to spend a year working alongside and being mentored by researchers at Google. Now in its third year, residents are embedded in various teams across Google's global offices, pursuing research in areas such as machine learning, perception, algorithms and optimization, language understanding, healthcare and much more. With applications having just closed for the fourth year of this program, we are excited to see the research the new cohort of residents will pursue in 2019.
Each year, we also support a number of faculty members and students on research projects through our Google Faculty Research Awards program. In 2018, we also continued to host workshops at Google locations for faculty and graduate students in particular areas, including a workshop on AI/ML Research and Practice hosted in our Bangalore, India office, an Algorithms & Optimization Workshop hosted in our Zürich office, a workshop on healthcare applications of ML hosted in Sunnyvale and a workshop on Fairness and Bias in ML hosted in our Cambridge, MA office.
New Places, New Faces In 2018, we were excited to welcome many new people with a wide range of backgrounds into our research organization. We announced our first AI research office in Africa, located in Accra, Ghana. We expanded our AI research presence in Paris, Tokyo and Amsterdam, and opened a research lab in Princeton. We continue to hire talented people into our offices all over the world, and you can learn more about joining our research efforts here.
Looking Forward to 2019 This blog post summarizes just a small fraction of the research performed in 2018. As we look back on 2018, we're excited (and proud!) of the breadth and depth of what we have accomplished. In 2019, we look forward to having even more impact on Google's direction and products, as well as on the broader research and engineering community!
Posted by Li Zhang and Wei (Alex) Hong, Software Engineers
Life is full of meaningful moments — from a child’s first step to an impromptu jump for joy — that one wishes could be preserved with a picture. However, because these moments are often unpredictable, missing that perfect shot is a frustrating problem that smartphone camera users face daily. Using our experience from developing Google Clips, we wondered if we could develop new techniques for the Pixel 3 camera that would allow everyone to capture the perfect shot every time.
Top Shot is a new feature recently launched with Pixel 3 that helps you to capture precious moments precisely and automatically at the press of the shutter button. Top Shot saves and analyzes the image frames before and after the shutter press on the device in real-time using computer vision techniques, and recommends several alternative high-quality HDR+ photos.
Examples of Top Shot on Pixel 3. On the left, a better smiling shot is recommended. On the right, a better jump shot is recommended. The recommended images are high-quality HDR+ shots.
Capturing Multiple Moments When a user opens the Pixel 3 Camera app, Top Shot is enabled by default, helping to capture the perfect moment by analyzing images taken both before and after the shutter press. Each image is analyzed for some qualitative features (e.g., whether the subject is smiling or not) in real-time and entirely on-device to preserve privacy and minimize latency. Each image is also associated with additional signals, such as optical flow of the image, exposure time, and gyro sensor data to form the input features used to score the frame quality.
When you press the shutter button, Top Shot captures up to 90 images from 1.5 seconds before and after the shutter press, selecting up to two alternative shots to save in high resolution — the original shutter frame and high-res alternatives for you to review (other lower-res frames can also be reviewed as desired). The shutter frame is processed and saved first. The best alternative shots are saved afterwards. Google’s Visual Core on Pixel 3 is used to process these top alternative shots as HDR+ images with a very small amount of extra latency, and are embedded into the file of the Motion Photo.
Top-level diagram of Top Shot capture.
Given Top Shot runs in the camera as a background process, it must have very low power consumption. As such, Top Shot uses a hardware-accelerated MobileNet-based single shot detector (SSD). The execution of such optimized models is also throttled by power and thermal limits.
Recognizing Top Moments When we set out to understand how to enable people to capture the best moments with their camera, we focused on three key attributes: 1) functional qualities like lighting, 2) objective attributes (are the subject's eyes open? Are they smiling?), and 3) subjective qualities like emotional expressions. We designed a computer vision model to recognize these attributes while operating in a low-latency, on-device mode.
During our development process, we started with a vanilla MobileNet model and set out to optimize for Top Shot, arriving at a customized architecture that operated within our accuracy, latency and power tradeoff constraints. Our neural network design detects low-level visual attributes in early layers, like whether the subject is blurry, and then dedicates additional compute and parameters toward more complex objective attributes like whether the subject's eyes are open, and subjective attributes like whether there is an emotional expression of amusement or surprise. We trained our model using knowledge distillation over a large number of diverse face images using quantization during both training and inference.
We then adopted a layered Generalized Additive Model (GAM) to provide quality scores for faces and combine them into a weighted-average “frame faces” score. This model made it easy for us to interpret and identify the exact causes of success or failure, enabling rapid iteration to improve the quality and performance of our attributes model. The number of free parameters was on the order of dozens, so we could optimize these using Google's black box optimizer, Vizier, in tandem with any other parameters that affected selection quality.
Frame Scoring Model While Top Shot prioritizes for face analysis, there are good moments in which faces are not the primary subject. To handle those use cases, we include the following additional scores in the overall frame quality score:
Subject motion saliency score — the low-resolution optical flow between the current frame and the previous frame is estimated in ISP to determine if there is salient object motion in the scene.
Global motion blur score — estimated from the camera motion and the exposure time. The camera motion is calculated from sensor data from the gyroscope and OIS (optical image stabilization).
“3A” scores — the status of auto exposure, auto focus, and auto white balance, are also considered.
All the individual scores are used to train a model predicting an overall quality score, which matches the frame preference of human raters, to maximize end-to-end product quality.
End-to-End Quality and Fairness Most of the above components are each evaluated for accuracy independently However, Top Shot presents requirements that are uniquely challenging since it’s running real-time in the Pixel Camera. Additionally, we needed to ensure that all these signals are combined in a system with favorable results. That means we need to gauge our predictions against what our users perceive as the “top shot.”
To test this, we collected data from hundreds of volunteers, along with their opinions of which frames (out of up to 90!) looked best. This donated dataset covers many typical use cases, e.g. portraits, selfies, actions, landscapes, etc.
Many of the 3-second clips provided by Top Shot had more than one good shot, so it was important for us to engineer our quality metrics to handle this. We used some modified versions of traditional Precision and Recall, some classic ranking metrics (such as Mean Reciprocal Rank), and a few others that were designed specifically for the Top Shot task as our objective. In addition to these metrics, we additionally investigated causes of image quality issues we saw during development, leading to improvements in avoiding blur, handling multiple faces better, and more. In doing so, we were able to steer the model towards a set of selections people were likely to rate highly.
Importantly, we tested the Top Shot system for fairness to make sure that our product can offer a consistent experience to a very wide range of users. We evaluated the accuracy of each signal used in Top Shot on several different subgroups of people (based on gender, age, ethnicity, etc), testing for accuracy of each signal across those subgroups.
Conclusion Top Shot is just one example of how Google leverages optimized hardware and cutting-edge machine learning to provide useful tools and services. We hope you’ll find this feature useful, and we’re committed to further improving the capabilities of mobile phone photography!
Acknowledgements This post reflects the work of a large group of Google engineers, research scientists, and others including: Ari Gilder, Aseem Agarwala, Brendan Jou, David Karam, Eric Penner, Farooq Ahmad, Henri Astre, Hillary Strickland, Marius Renn, Matt Bridges, Maxwell Collins, Navid Shiee, Ryan Gordon, Sarah Clinckemaillie, Shu Zhang, Vivek Kesarwani, Xuhui Jia, Yukun Zhu, Yuzo Watanabe and Chris Breithaupt.
Posted by Elad Hazan and Yoram Singer, Research Scientists, Google AI and Princeton University
Google has long partnered with academia to advance research, collaborating with universities all over the world on joint research projects which result in novel developments in Computer Science, Engineering, and related fields. Today we announce the latest of these academic partnerships in the form of a new lab, across the street from Princeton University’s historic Nassau Hall, opening early next year. By fostering closer collaborations with faculty and students at Princeton, the lab aims to broaden research in multiple facets of machine learning, focusing its initial research efforts on optimization methods for large-scale machine learning, control theory and reinforcement learning. Below we give a brief overview of the research progress thus far.
Large-Scale Optimization Imagine you have gone for a mountain hike and have run out of water. You need to get to a lake. How can you do so most efficiently? This is a matter of optimizing your route, and the mathematical analogue of this is the gradient descent method. You therefore move in the direction of steepest descent until you find the nearest lake at the bottom of your path. In the language of optimization, the location of the lake is referred to as a (local) minimum. The trajectory of gradient descent resembles the path, shown below, a thirsty yet avid hiker would take in order to get down to a lake as fast as she can.
Gradient descent (GD), and its randomized version, stochastic gradient descent (SGD), are the methods of choice for optimizing the weights of neural networks. Stacking all of the parameters together, we form a set of cells organized into vectors Let us take a simplistic view and assume that our neural net merely has 5 different parameters. Taking a gradient descent step amounts to subtracting the gradient vector (red) from the current set of parameters (blue) and putting the result back into the parameter vector.
Going back to our avid hiker, suppose she finds an unmarked path that is long and narrow, with limited visibility as she gazes down. If she follows the descent method her path would zig-zag down the hill, as shown in the illustration below on the left. However, she can now make faster progress by exploiting the skewed geometry of the terrain. That is, she can make a bigger leap forward than to the sides. In the context of gradient descent, pacing up is called acceleration. A popular class of acceleration methods is named adaptive regularization, or adaptive preconditioning, first introduced by the AdaGrad algorithm devised in collaboration with Prof. John Duchi from Stanford while he was at Google.
The idea is to change the geometry of the landscape of the optimization objective to make it easier for gradient descent to work. In order to do so, preconditioning methods stretch and rotate the space. The terrain after preconditioning looks like the serene, perfectly spherical lake above on the right, and the descent trajectory is a straight line! Procedurally, instead of subtracting the gradient vectors from the parameters vector per-se, adaptive preconditioning first multiplies the gradient by a 5×5 multicell structure, called a matrix preconditioner, as shown below.
This preconditioning operation yields a stretched and rotated gradient which is then subtracted as before, allowing much faster progress toward a basin. However, there is a downside to preconditioning, namely, its computational cost. Instead of subtracting a 5-dimensional gradient vector from a 5-dimensional parameter vector, the preconditioning transformation itself requires 5×5=25 operations. Suppose we would like to precondition gradients in order to learn a deep network with 10 million parameters. A single preconditioning step would require 100 trillion operations. In order to save computation, a diagonal version in which preconditioning amounts to stretching sans rotation was also introduced in the original AdaGrad paper. The diagonal version was later adopted and modified, yielding another very successful algorithm called Adam.
This simplified diagonal preconditioning imposes only a marginal additional cost to gradient descent. However, oversimplification has its own downside: we are no longer able to rotate our space. Going back to our hiker, if the deep-and-narrow canyon runs from southeast to northwest, she can no longer take large westward leaps. Had we provided her with a “rigged” compass in which the north pole is in the northwest, she could have followed her descent procedure as before. In high dimensions, the analog of compass rigging is full-matrix preconditioning. We thus asked ourselves whether we could devise a preconditioning method that is computationally efficient while allowing for the equivalent of coordinate rotations.
At Google AI Princeton, we developed a new method for full-matrix adaptive preconditioning at roughly the same computational cost as the commonly-used diagonal restriction. Details can be found in the paper, but the key idea behind the method is depicted below. Instead of using a full matrix, we replace the preconditioning matrix by a product of three matrices: a tall & thin matrix, a (small) square matrix, and a short & fat matrix. The vast amount of computation is performed using the smaller matrix. If we ￼have d parameters, instead of a single large d × d matrix, the matrices maintained by GGT (shorthand for the operation Gradient GradientT), the proposed method, are of sizes d × k, k × k, k × d respectively.
For reasonable choices of k, which can be thought of as the “window size” of the algorithm, the computational bottleneck has been mitigated from a single large matrix, to that of a much smaller kkmatrix. In our implementation we typically choose k to be, say, 50, and maintaining the smaller square matrix is significantly less expensive while yielding good empirical performance. When compared to other adaptive methods on standard deep learning tasks, GGT is competitive with AdaGrad and Adam.
Spectral Filtering for Control and Reinforcement Learning Another broad mission of Google’s research group in Princeton is to develop principled building ￼blocks for decision-making systems. In particular, the group strives to leverage provable guarantees from the field of online learning, which studies the robust (worst-case) guarantees of decision-making algorithms under uncertainty. An online algorithm is said to attain a no-regret guarantee if it learns to make decisions as well as the best "offline" decision in hindsight. Ideas from this field have already enabled many innovations within theoretical computer science, and provide a mathematically elegant framework to study a widely-used technique called boosting. We envision using ideas from online learning to broaden the toolkit of modern reinforcement learning.
With that goal in mind, and in collaboration with researchers and students at Princeton, we developed the algorithmic technique of spectral filtering for estimation and control of linear dynamical systems (see severalrecentpublications). In this setting, noisy observations (e.g., location sensor measurements) are being streamed from an unknown source. The source of the signal is a system whose state evolves over time following a set of linear equations (e.g. Newton's laws). To forecast future signals (prediction), or to perform actions which bring the system to a desired state (control), the usual approach starts with learning the model explicitly (a task termed system identification), which is often slow and inaccurate.
Spectral filtering circumvents the need to model the dynamics explicitly, by reformulating prediction and control as convex programs, enabling provable no-regret guarantees. A major component of the technique is that of a new signal processing transformation. The idea is to summarize the long history of past input signals through convolution with a tailored bank of filters, and then use this representation to predict the dynamical system’s future outputs. Each filter compresses the input signal into a single real number, by taking a weighted combination of the previous inputs.
A set of filters depicted in a plot of filter amplitude versus time. With our technique of spectral filtering, multiple filters are used to predict the state of a linear dynamical system at any given time. Each filter is a set of weights used to summarize past observations, such that combining them in a weighted fashion, over time allows us to accurately predict the system.
The mathematical derivation of these weights (filters) has an interesting connection to the spectral theory of Hankel matrices.
Looking Forward We are excited about the progress we have made thus far in partnership with Princeton’s faculty and students, and we look forward to the official opening of the lab in the coming weeks. It has long been Google’s view that both industry and academia benefit significantly from an open research culture, and we look forward to our continued close collaboration.
Acknowledgments The research and results discussed in this post would not have been possible without contributions from the following researchers: Naman Agarwal, Brian Bullins, Xinyi Chen, Udaya Ghai, Tomer Koren, Karan Singh, Cyril Zhang, Yi Zhang, and visiting professor Sham Kakade. Since joining Google earlier this year, the research team has been working remotely from both the Google NYC office as well as the Princeton University campus, and they look forward to moving into the new Google space across from the Princeton campus in the weeks to come.
Posted by Jarrod McClean, Senior Research Scientist and Hartmut Neven, Director of Engineering, Google AI Quantum Team
Since its inception, the Google AI Quantum team has pushed to understand the role of quantum computing in machine learning. The existence of algorithms with provable advantages for global optimization suggest that quantum computers may be useful for training existing models within machine learning more quickly, and we are building experimental quantum computers to investigate how intricate quantum systems can carry out these computations. While this may prove invaluable, it does not yet touch on the tantalizing idea that quantum computers might be able to provide a way to learn more about complex patterns in physical systems that conventional computers cannot in any reasonable amount of time.
Today we talk about two recent papers from the Google AI Quantum team that make progress towards understanding the power of quantum computers for learning tasks. The first constructs a quantum model of neural networks to investigate how a popular classification task might be carried out on quantum processors. In the second paper, we show how peculiar features of quantum geometry change the strategies for training these networks in comparison to their classical counterparts, and offer guidance towards more robust training of these networks.
In “Classification with Quantum Neural Networks on Near Term Processors”, we construct a model of quantum neural networks (QNNs) that is specifically designed to work on quantum processors that are expected to be available in the near term. While the current work is primarily theoretical, their structure facilitates implementation and testing on quantum computers in the immediate future. These QNNs can be adapted through supervised learning of labeled data, and we show that it is possible to train a QNN to classify images in the famous MNIST dataset. Follow up work in this area with larger quantum devices may pit the ability of quantum networks to learn patterns against popular classical networks.
Quantum Neural Network for classification. Here we depict a sample quantum neural network, where in contrast to hidden layers in classical deep neural networks, the boxes represent entangling actions, or “quantum gates”, on qubits. In a superconducting qubit setup this could be enacted through a microwave control pulse corresponding to each box.
In “Barren Plateaus in Quantum Neural Network Training Landscapes”, we focus on the training of quantum neural networks, and probe questions related to a key difficulty in classical neural networks, which is the problem of vanishing or exploding gradients. In conventional neural networks, a good unbiased initial guess for the neuron weights often involves randomization, although there can be some difficulties as well. Our paper shows that peculiar features of quantum geometry unequivocally prevent this from being a good strategy in the quantum case, instead taking you to barren plateaus. The implications of this work may guide future strategies for initializing and training quantum neural networks.
QNN vanishing gradient: concentration of measure in high dimensional spaces. In very high dimensional spaces, such as those explored by quantum computers, the vast majority of states counterintuitively sit near the equator of the hypersphere (left). This means that any smooth function on this space will tend to take a value very close to its mean with overwhelming probability when selected at random (right).
This research sets the stage for improvements in both the construction and training of quantum neural networks. In particular, experimental realizations of quantum neural networks using hardware at Google will enable rapid exploration of quantum neural networks in the near term. We hope that the insights from the geometry of these states will lead to new algorithms to train these networks that will be essential to unlocking their full potential.
Improving Model Performance with High-quality Labels The performance of DR deep learning models is critically important, especially when subtle errors have the potential to generate a misdiagnosis. Earlier this year we published a paper in the journal Ophthalmology that looked at how we could improve our model by 1) moving toward a more granular 5-point grading scale (versus the previous 2-class system) and 2) incorporating adjudication by a panel of retinal specialists. During the adjudication process, a group of retinal specialists debated any case with disagreement until everyone agreed on the final grade. Compared to simply taking a majority vote, this method of resolving disagreements was more accurate and allowed for the identification of subtle findings, such as microaneurysms.
To increase the efficiency of the adjudication process, we carefully selected a small subset (0.22%) of images to use as a tuning set, substantially improving model performance by optimizing model hyperparameters on this more accurate reference standard. When we subsequently measured the rate of agreement against a test set of images with an adjudicated reference standard, the kappa scores (a measurement of agreement that ranges from 0 [random] to 1 [perfect agreement]) for individual retinal specialists, ophthalmologists, and the algorithm ranged from 0.82-0.91, 0.80-0.84, and 0.84, respectively.
Making our Models More Transparent As we deploy this technology, it is important that we take the proper steps to ensure that it is transparent and trusted. To that end, we have been exploring ways to explain how the model is making its predictions, with the goal of making the DR model a better diagnostic tool and aid for doctors.
In our latest study, to be published today in Ophthalmology, we demonstrate methods by which explanations of deep learning algorithms can be shown to ophthalmologists to increase both the accuracy and confidence of their grading for diabetic eye disease. Using the results of the model trained and validated on high quality labels from our earlier study, we generated different forms of potential assistance for general ophthalmologists. We presented to the physicians the algorithm’s predicted scores for different DR severity levels as well as heatmaps highlighting image regions that most strongly drove its predictions. Using this assistance, we saw a significant increase in physicians’ diagnostic accuracy, as well as improved confidence in their diagnosis.
We saw clear evidence that showing model predictions could help physicians catch pathology they otherwise might have missed. In the retinal image below, our adjudication panel found signs of vision-threatening DR. This was missed by 2 of 3 doctors who graded it without assistance; but caught by all 3 doctors who graded it when they saw the model predictions (which accurately detected the pathology).
On the left is a fundus image graded as having proliferative (vision-threatening) DR by an adjudication panel of ophthalmologists (ground truth). On the top right is an illustration of our deep learning model’s predicted scores (“P” = proliferative, the most severe form of DR). On the bottom right is the set of grades given by physicians without assistance (“Unassisted”) and those who saw the model’s predictions (“Grades Only”).
We also saw evidence that physicians and the model can work together in a way that provides more accuracy than either individually. In the retinal image below, our adjudication panel of retina specialists considered it to have moderate DR. Without assistance, two out of three ophthalmologists grading the image marked it as no DR. In real-world settings, this situation could result in a patient missing a needed referral to a specialist.
On the left is a retinal fundus image graded as having moderate DR (“Mo”) by an adjudication panel of ophthalmologists (ground truth). On the top right is an illustration of the predicted scores (“N” = no DR, “Mi” = Mild DR, “Mo” = Moderate DR) from the model. On the bottom right is the set of scores given by physicians without assistance (“Unassisted”) and those who saw the model’s predictions (“Grades Only”).
In this particular case, our model also indicated evidence for no DR. However, when ophthalmologists saw the model’s predictions, all three gave the correct answer. Seeing that the model saw some evidence for Moderate -- even if it wasn’t the highest score -- may prompt doctors to examine particular cases more carefully for pathology they may otherwise miss. We are excited to develop assistance that works like this, where human and machine learning abilities complement each other.
A New Partner in our Global Efforts With the help of screening programs and in collaboration with Verily, we have laid a robust foundation for the implementation of these highly accurate systems in real world clinical settings. Working with doctors at Aravind Eye Hospitals and Sankara Nethralaya in India, and now, through our new partnership with the Rajavithi Hospital, affiliated with the Department of Medical Services, Ministry of Public Health in Thailand, we are validating the model performance with patients from broad screening programs. Given the positive results of our model on their real patient population, we are now beginning to pilot the model in their screening programs. We’re looking forward to a very busy 2019!
Posted by Eric Jang, Software Engineer, Robotics at Google and Coline Devin, Berkeley PhD Student and former Research Intern
From a remarkably young age, people are capable of recognizing their favorite objects and picking them up, despite never being explicitly taught how to do so. According to cognitive developmental research, the ability to interact with objects in the world plays a crucial role in the emergence of object perception and manipulation capabilities, such as targeted grasping. By interacting with the world around them, people are able to learn with self-supervision: we know what actions we took, and we learn from the outcome. In robotics, this type of self-supervised learning is actively researched because it enables robotic systems to learn without the need for large amounts of training data or manual supervision.
Inspired by the concept of object permanence, we propose Grasp2Vec, a simple yet highly effective algorithm for acquiring object representations. Grasp2Vec is based on the intuition that an attempt to pick up anything provides several pieces of information — if a robot grasps an object and holds it up, the object had to be in the scene before the grasp. Furthermore, the robot knows that the object it grasped is currently in its gripper, and therefore has been removed from the scene. By using this form of self supervision, the robot can learn to recognize the object by the visual change in the scene after the grasp.
Building on our prior collaboration with X Robotics, where a series of robots learn in parallel to grasp household objects using only monocular camera inputs, we use a robotic arm to grasp objects “unintentionally”, and that experience enables the learning of a rich representation of objects. These representations can then be used to acquire “intentional grasping” capabilities, where the robot arm can then pick up user-commanded objects. Constructing a Perceptual Reward Function In the framework of reinforcement learning (RL), task success is measured via a “reward function”. By maximizing that reward, robots can teach themselves diverse grasping skills from scratch. Engineering a reward function is easy when success can be measured by simple sensor measurements. A simple example of this is a button that supplies rewards directly to a robot when it is pushed.
However, engineering a reward function is much more difficult when our success criteria depends on perceptual understanding of the task at hand. Consider the task of instance grasping, where a robot is presented a picture of a desired object being held in the gripper. After the robot attempts to grasp that object, it inspects the contents of the gripper. The reward function for this task comes down to answering the question of object recognition: Do these objects match?
On the left, the gripper is holding the brush and there are some objects (yellow cup, blue plastic block) in the background. On the right, the gripper is holding the yellow cup and the brush is in the background. If the left image was the desired outcome, a good reward function should “understand” that the two images above correspond to different objects.
In order to solve this recognition problem, we need a perception system that extracts meaningful object concepts from unstructured image data (without any human annotations), learning the visual perception of objects in an unsupervised fashion. At their core, unsupervised learning algorithms work because they make structural assumptions about data. It is common to assume that images can be compressed into a low-dimensional space, and that frames in a video can be predicted from previous frames. However, without further assumptions on the content of the data, these are usually insufficient for learning disentangled object representations.
What if we used a robot to physically disentangle objects from each other during data collection? The field of robotics presents an exciting opportunity for representation learning because robots can manipulate objects, thus providing the factors of variation needed in data. Our method relies on the insight that grasping an object removes it from the scene. This yields 1) an image of the scene before grasping, 2) an image of the scene after grasping and 3) an isolated view of the grasped object itself.
Left: Objects before the grasp. Center: Objects after the grasp. Right: The Grasped object.
If we then consider an embedding function that extracts “the set of objects” from images, it should preserve the following subtractive relation:
We implement this equality relation using a fully convolutional architecture and a simple metric learning algorithm. At training time, the architecture shown below embeds the pre-grasp images and post-grasp images into a dense spatial feature map. The maps are mean-pooled into vectors and the difference between the “before grasp” and “after grasp” vectors represents a set of objects. This vector and the corresponding vector representation of the grasped object are pushed to equivalence via the N-Pairs objective.
Once trained, two useful properties emerge naturally from our model.
1. Object Similarity The first property is that a cosine distance between vector embeddings allows us to compare objects and determine whether they are identical. This can be used to implement reward functions for reinforcement learning, and allow robots to learn instance grasping without human-provided labels.
2. Localizing Target Objects The second property is that we can combine scene spatial maps and object embeddings to localize a “query object” in image space. By taking the element-wise product of spatial feature maps and the vector corresponding to the query object, we can find all the pixels in the spatial map that “match” the query object.
Using Grasp2Vec embeddings to localize objects in a scene. The image on the top left shows the objects in the bin. On the bottom left is the query object we wish to grasp. By taking the dot product of the query object vector with the spatial features of the scene image, we get a per-pixel “activation map” (top right image) of how similar that region of the image is to the query. This response map can be used to approach the object for grasping.
Our method also works when there are multiple objects that match the query object, or even if the query consists of multiple objects (the average of two vectors). For example, here is a scenario where it detects multiple orange blocks in a scene.
The resulting “heatmap” can be used to plan the robot approach to the target object(s). We combine Grasp2Vec’s localization and instance recognition capabilities with our “grasp anything” policies to obtain a success rate of 80% on objects seen during data collection and 59% on novel objects the robot hasn’t encountered before.
Conclusion In our paper, we show how robotic grasping skills can generate the data used for learning object-centric representations. We then can use representation learning to “bootstrap” more complex skills like instance grasping, all while retaining the self-supervised learning properties of our autonomous grasping system.
Besides our own work, a number of recent papers have also studied how self-supervised interaction can be used to acquire representations, by grasping, pushing, and otherwise manipulating objects in the environment. Going forward, we are excited not only for what machine learning can bring to robotics by way of better perception and control, but also what robotics can bring to machine learning in new paradigms of self-supervision.
Acknowledgements This research was conducted by Eric Jang, Coline Devin, Vincent Vanhoucke, and Sergey Levine. We’d like to thank Adrian Li, Alex Irpan, Anthony Brohan, Chelsea Finn, Christian Howard, Corey Lynch, Dmitry Kalashnikov, Ian Wilkes, Ivonne Fajardo, Julian Ibarz, Ming Zhao, Peter Pastor, Pierre Sermanet, Stephen James, Tsung-Yi Lin, Yunfei Bai, and many others at Google, X, and the broader robotics community who contributed to improving this work.
Posted by Melvin Johnson, Senior Software Engineer, Google Translate
Over the past few years, Google Translate has made significant improvements to translation quality by switching to an end-to-end neural network-based system. At the same time, we realized that translations from our models can reflect societal biases, such as gender bias. Specifically, languages differ a lot in how they represent gender, and when there are ambiguities during translation, the systems tend to pick gender choices that reflect societal asymmetries, resulting in biased translations. For instance, Google Translate historically translated the Turkish equivalent of “He/she is a doctor” into the masculine form, and the Turkish equivalent of “He/she is a nurse” into the feminine form.
Recently, we announced that we’re taking the first step at reducing gender bias in our translations. We now provide both feminine and masculine translations when translating single-word queries from English to four different languages (French, Italian, Portuguese, and Spanish), and when translating phrases and sentences from Turkish to English.
Gender-specific translations on the Google Translate website.
Supporting gender-specific translations for single-word queries involved enriching our underlying dictionary with gender attributes. Supporting gender-specific translations for longer queries (phrases and sentences) was particularly challenging and involved making significant changes to our translation framework. For these longer queries, we focused initially on Turkish-to-English translation. We developed a three-step approach to solve the problem of providing a masculine and feminine translation in English for a gender-neutral query in Turkish.
Detecting Gender-Neutral Queries Many Turkish sentences that refer to people are gender-neutral, but not all are. Detecting which queries are eligible for gender-specific translations is a hard problem because Turkish is morphologically complex, meaning that reference to a person can either be explicit with a gender-neutral pronoun (e.g. O, Ona) or implicitly encoded. For example, the sentence “Biliyor mu?” has no explicit gender-neutral pronoun but can be translated as either “Does she know?” or “Does he know?”. This complexity means that we cannot use a simple list of gender-neutral pronouns to detect gender-neutral Turkish queries and need a machine-learned system. We estimate that approximately 10% of Turkish Translate queries are ambiguous, and eligible for both feminine and masculine translations.
To detect these queries, we use state-of-the-art text classification algorithms (same as those used in our Cloud Natural Language API) to build a system that is able to detect when a given Turkish query is gender-neutral. Since this introduces an additional step before obtaining the translations, we had to carefully balance model complexity with latency. We trained our system on thousands of human-rated Turkish examples, where raters were asked to judge whether a given example is gender-neutral or not. Our final classification system is a convolutional neural network that can accurately detect queries which require gender-specific translations.
Generating Gender-Specific Translations Next, we enhanced our underlying Neural Machine Translation (NMT) system to produce feminine and masculine translations when requested. When no gender is requested, we trained the model to produce the default translation. This involved:
Identifying and dividing our parallel training data into those with feminine words, those with masculine and those with ungendered words.
Adding an additional input token to the beginning of the sentence to specify the required gender to translate to, similar to how we build multilingual NMT systems:
<2MALE> O bir doktor → He is a doctor
<2FEMALE> O bir doktor → She is a doctor
Training our enhanced NMT model on the feminine, masculine and ungendered data sources. We experimented with various mixing ratios for these sources to enable the model to perform equally well on the three tasks.
If a user's query is determined to be gender-neutral, we add a gender prefix to the translation request. For these requests, our final NMT model can reliably produce feminine and masculine translations 99% of the time. Additionally, the system maintains translation quality on queries without the gender prefix.
Checking for Accuracy Finally, we have a step that decides whether to display the gender-specific translations. Since the training data that produces the masculine translation is different from the training data that produces the feminine translation, there may be differences between the two translations unrelated to gender. If the gender-specific translations are determined to be low quality, we show only the single default translation. To determine the quality of the gender-specific translations, we verify:
If the requested feminine translation is feminine.
If the requested masculine translation is masculine.
If the feminine and masculine translations are exactly equivalent with the exception of gender-related changes. Even minor changes in the wording between the translations will result in being filtered.
Top: The masculine and feminine translations differ only with respect to gender i.e. “he” and “his” vs “she” and “her”. Hence, we will show gender-specific translations. Bottom: The masculine and feminine translations differ correctly with respect to gender i.e. “he” vs “she”. However, the change from “really” to “actually” is not related to gender. Hence, we will filter gender-specific translations and display the default translation.
Putting it all together, input sentences first go through the classifier, which detects whether they’re eligible for gender-specific translations. If the classifier says “yes”, we send three requests to our enhanced NMT model—a feminine request, a masculine request and an ungendered request. Our final step takes into account all three responses and decides whether to display gender-specific translations or a single default translation. This step is still quite conservative in order to maximize the quality of gender-specific translations shown; hence our overall recall is only around 60%. We plan to increase our coverage and add support for more complex sentences in future iterations.
This is just the first step toward addressing gender bias in machine-translation systems and reiterates Google’s commitment to fairness in machine learning. In the future, we plan to extend gender-specific translations to more languages and to address non-binary gender in translations.
Acknowledgements: This effort has been successful thanks to the hard work of a lot of people including, but not limited to, the following (in alphabetical order of last name): Lindsey Boran, HyunJeong Choe, Héctor Fernández Alcalde, Orhan Firat, Qin Gao, Rick Genter, Macduff Hughes, Tolga Kayadelen, James Kuczmarski, Tatiana Lando, Liu Liu, Michael Mandl, Nihal Meriç Atilla, Mengmeng Niu, Adnan Ozturel, Emily Pitler, Kathy Ray, John Richardson, Larissa Rinaldi, Alex Rudnick, Apu Shah, Jason Smith, Antonio Stella, Romina Stella, Jana Strnadova, Katrin Tomanek, Barak Turovsky, Dan Schwarz, Shilp Vaishnav, Clayton Watts, Kellie Webster, Colin Young, Pendar Yousefi, Candice Zhang and Min Zhao.
Posted by Anurag Batra and Parker Barnes, Product Managers, Google AI
Recently, we introduced the Inclusive Images Kaggle competition, part of the NeurIPS 2018 Competition Track, with the goal of stimulating research into the effect of geographic skews in training datasets on ML model performance, and to spur innovation in developing more inclusive models. While the competition has concluded, the broader movement to build more diverse datasets is just beginning.
Today, we’re announcing Open Images Extended, a new branch of Google’s Open Images dataset, which is intended to be a collection of complementary datasets with additional images and/or annotations that better represent global diversity. The first set we are adding is the Crowdsourced extension which is seeded with 478K+ images donated by Crowdsource app users from all around the world.
About the Crowdsourced Extension of Open Images Extended To bring greater geographic diversity to Open Images, we enabled the global community of Crowdsource app users to photograph the world around them and make their photos available to researchers and developers as part of the Open Images Extended dataset. A large majority of these images are from India, with some representation from the Middle East, Africa and Latin America.
The images, focus on some key categories like household objects, plants & animals, food, and people in various professions (all faces are blurred to protect privacy). Detailed information about the composition of the dataset can be found here.
Pictures from India and Singapore contributed using the Crowdsource app.
Get Involved This is an early step on a long journey. To build inclusive ML products, training data must represent global diversity along several dimensions. To that end, we invite the global community to help expand the Open Images Extended dataset by contributing imagery from your own hometown and community. Download the Crowdsource Android app to contribute images you’ve taken from your phone, or contact us if there are other image repositories (that you have the rights for) that you’re interested in adding to open-images dataset.
Acknowledgements The release of Open Images Extended has been possible thanks to the hard work of a lot of people including, but not limited to the following (in alphabetical order of last name): James Atwood, Pallavi Baljekar, Peggy Chi, Tulsee Doshi, Tom Duerig, Vittorio Ferrari, Akshay Gaur, Victor Gomes, Yoni Halpern, Gursheesh Kaur, Mahima Pushkarna, Jigyasa Saxena, D. Sculley, Richa Singh, Rachelle Summers.
Posted by Xuanhui Wang and Michael Bendersky, Software Engineers, Google AI
Ranking, the process of ordering a list of items in a way that maximizes the utility of the entire list, is applicable in a wide range of domains, from search engines and recommender systems to machine translation, dialogue systems and even computational biology. In applications like these (and many others), researchers often utilize a set of supervised machine learning techniques called learning-to-rank. In many cases, these learning-to-rank techniques are applied to datasets that are prohibitively large — scenarios where the scalability of TensorFlow could be an advantage. However, there is currently no out-of-the-box support for applying learning-to-rank techniques in TensorFlow. To the best of our knowledge, there are also no other open source libraries that specialize in applying learning-to-rank techniques at scale.
TF-Ranking is fast and easy to use, and creates high-quality ranking models. The unified framework gives ML researchers, practitioners and enthusiasts the ability to evaluate and choose among an array of different ranking models within a single library. Moreover, we strongly believe that a key to a useful open source library is not only providing sensible defaults, but also empowering our users to develop their own custom models. Therefore, we provide flexible API's, within which the users can define and plug in their own customized loss functions, scoring functions and metrics.
Existing Algorithms and Metrics Support The objective of learning-to-rank algorithms is minimizing a loss function defined over a list of items to optimize the utility of the list ordering for any given application. TF-Ranking supports a wide range of standard pointwise, pairwise and listwise loss functions as described in prior work. This ensures that researchers using the TF-Ranking library are able to reproduce and extend previously published baselines, and practitioners can make the most informed choices for their applications. Furthermore, TF-Ranking can handle sparse features (like raw text) through embeddings and scales to hundreds of millions of training instances. Thus, anyone who is interested in building real-world data intensive ranking systems such as web search or news recommendation, can use TF-Ranking as a robust, scalable solution.
Empirical evaluation is an important part of any machine learning or information retrieval research. To ensure compatibility with prior work, we support many of the commonly used ranking metrics, including Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). We also make it easy to visualize these metrics at training time on TensorBoard, an open source TensorFlow visualization dashboard.
An example of the NDCG metric (Y-axis) along the training steps (X-axis) displayed in the TensorBoard. It shows the overall progress of the metrics during training. Different methods can be compared directly on the dashboard. Best models can be selected based on the metric.
Multi-Item Scoring TF-Ranking supports a novel scoring mechanism wherein multiple items (e.g., web pages) can be scored jointly, an extension of the traditional scoring paradigm in which single items are scored independently. One challenge in multi-item scoring is the difficulty for inference where items have to be grouped and scored in subgroups. Then, scores are accumulated per-item and used for sorting. To make these complexities transparent to the user, TF-Ranking provides a List-In-List-Out (LILO) API to wrap all this logic in the exported TF models.
The TF-Ranking library supports multi-item scoring architecture, an extension of traditional single-item scoring.
As we demonstrate in recent work, multi-item scoring is competitive in its performance to the state-of-the-art learning-to-rank models such as RankNet, MART, and LambdaMART on a public LETOR benchmark.
Ranking Metric Optimization An important research challenge in learning-to-rank is direct optimization of ranking metrics (such as the previously mentioned NDCG and MRR). These metrics, while being able to measure the performance of ranking systems better than the standard classification metrics like Area Under the Curve (AUC), have the unfortunate property of being either discontinuous or flat. Therefore standard stochastic gradient descent optimization of these metrics is problematic.
In recent work, we proposed a novel method, LambdaLoss, which provides a principled probabilistic framework for ranking metric optimization. In this framework, metric-driven loss functions can be designed and optimized by an expectation-maximization procedure. The TF-Ranking library integrates the recent advances in direct metric optimization and provides an implementation of LambdaLoss. We are hopeful that this will encourage and facilitate further research advances in the important area of ranking metric optimization.
Unbiased Learning-to-Rank Prior research has shown that given a ranked list of items, users are much more likely to interact with the first few results, regardless of their relevance. This observation has inspired research interest in unbiased learning-to-rank, and led to the development of unbiased evaluation and several unbiased learning algorithms, based on training instances re-weighting. In the TF-Ranking library, metrics are implemented to support unbiased evaluation and losses are implemented for unbiased learning by natively supporting re-weighting to overcome the inherent biases in user interactions datasets.
Getting Started with TF-Ranking TF-Ranking implements the TensorFlow Estimator interface, which greatly simplifies machine learning programming by encapsulating training, evaluation, prediction and export for serving. TF-Ranking is well integrated with the rich TensorFlow ecosystem. As described above, you can use Tensorboard to visualize ranking metrics like NDCG and MRR, as well as to pick the best model checkpoints using these metrics. Once your model is ready, it is easy to deploy it in production using TensorFlow Serving.
If you’re interested in trying TF-Ranking for yourself, please check out our GitHub repo, and walk through the tutorial examples. TF-Ranking is an active research project, and we welcome your feedback and contributions. We are excited to see how TF-Ranking can help the information retrieval and machine learning research communities.
Acknowledgements This project was only possible thanks to the members of the core TF-Ranking team: Rama Pasumarthi, Cheng Li, Sebastian Bruch, Nadav Golbandi, Stephan Wolf, Jan Pfeifer, Rohan Anil, Marc Najork, Patrick McGregor and Clemens Mewald. We thank the members of the TensorFlow team for their advice and support: Alexandre Passos, Mustafa Ispir, Karmel Allison, Martin Wicke, and others. Finally, we extend our special thanks to our collaborators, interns and early adopters: Suming Chen, Zhen Qin, Chirag Sethi, Maryam Karimzadehgan, Makoto Uchida, Yan Zhu, Qingyao Ai, Brandon Tran, Donald Metzler, Mike Colagrosso, and many others at Google who helped in evaluating and testing the early versions of TF-Ranking.