Posted by Danijar Hafner, Student Researcher, Google AI
Research into how artificial agents can improve their decisions over time is progressing rapidly via reinforcement learning (RL). For this technique, an agent observes a stream of sensory inputs (e.g. camera images) while choosing actions (e.g. motor commands), and sometimes receives a reward for achieving a specified goal. Model-free approaches to RL aim to directly predict good actions from the sensory observations, enabling DeepMind's DQN to play Atari and other agents to controlrobots. However, this blackbox approach often requires several weeks of simulated interaction to learn through trial and error, limiting its usefulness in practice.
Model-based RL, in contrast, attempts to have agents learn how the world behaves in general. Instead of directly mapping observations to actions, this allows an agent to explicitly plan ahead, to more carefully select actions by "imagining" their long-term outcomes. Model-based approaches have achieved substantial successes, including AlphaGo, which imagines taking sequences of moves on a fictitious board with the known rules of the game. However, to leverage planning in unknown environments (such as controlling a robot given only pixels as input), the agent must learn the rules or dynamics from experience. Because such dynamics models in principle allow for higher efficiency and natural multi-task learning, creating models that are accurate enough for successful planning is a long-standing goal of RL.
To spur progress on this research challenge and in collaboration with DeepMind, we present the Deep Planning Network (PlaNet) agent, which learns a world model from image inputs only and successfully leverages it for planning. PlaNet solves a variety of image-based control tasks, competing with advanced model-free agents in terms of final performance while being 5000% more data efficient on average. We are additionally releasing the source code for the research community to build upon.
The PlaNet agent learning to solve a variety of continuous control tasks from images in 2000 attempts. Previous agents that do not learn a model of the environment often require 50 times as many attempts to reach comparable performance.
How PlaNet Works In short, PlaNet learns a dynamics model given image inputs and efficiently plans with it to gather new experience. In contrast to previous methods that plan over images, we rely on a compact sequence of hidden or latent states. This is called a latent dynamics model: instead of directly predicting from one image to the next image, we predict the latent state forward. The image and reward at each step is then generated from the corresponding latent state. By compressing the images in this way, the agent can automatically learn more abstract representations, such as positions and velocities of objects, making it easier to predict forward without having to generate images along the way.
Learned Latent Dynamics Model: In a latent dynamics model, the information of the input images is integrated into the hidden states (green) using the encoder network (grey trapezoids). The hidden state is then projected forward in time to predict future images (blue trapezoids) and rewards (blue rectangle).
To learn an accurate latent dynamics model, we introduce:
A Recurrent State Space Model: A latent dynamics model with both deterministic and stochastic components, allowing to predict a variety of possible futures as needed for robust planning, while remembering information over many time steps. Our experiments indicate both components to be crucial for high planning performance.
A Latent Overshooting Objective: We generalize the standard training objective for latent dynamics models to train multi-step predictions, by enforcing consistency between one-step and multi-step predictions in latent space. This yields a fast and effective objective that improves long-term predictions and is compatible with any latent sequence model.
While predicting future images allows us teach the model, encoding and decoding images (trapezoids in the figure above) requires significant computation, which would slow down planning. However,planning in the compact latent state space is fast since we only need to predict future rewards, and not images, to evaluate an action sequence. For example, the agent can imagine how the position of a ball and its distance to the goal will change for certain actions, without having to visualize the scenario. This allows us to compare 10,000 imagined action sequences with a large batch size every time the agent chooses an action. We then execute the first action of the best sequence found and replan at the next step.
Planning in Latent Space: For planning, we encode past images (gray trapezoid) into the current hidden state (green). From there, we efficiently predict future rewards for multiple action sequences. Note how the expensive image decoder (blue trapezoid) from the previous figure is gone. We then execute the first action of the best sequence found (red box).
Compared to our preceding work on world models, PlaNet works without a policy network -- it chooses actions purely by planning, so it benefits from model improvements on the spot. For the technical details, check out our online research paper or the PDF version.
PlaNet vs. Model-Free Methods We evaluate PlaNet on continuous control tasks. The agent is only given image observations and rewards. We consider tasks that pose a variety of different challenges:
A cartpole swing-up task, with a fixed camera, so the cart can move out of sight. The agent thus must absorb and remember information over multiple frames.
A finger spin task that requires predicting two separate objects, as well as the interactions between them.
A cheetah running task that includes contacts with the ground that are difficult to predict precisely, calling for a model that can predict multiple possible futures.
A cup task, which only provides a sparse reward signal once a ball is caught. This demands accurate predictions far into the future to plan a precise sequence of actions.
A walker task, in which a simulated robot starts off by lying on the ground, and must first learn to stand up and then walk.
PlaNet agents trained on a variety of image-based control tasks. The animation shows the input images as the agent is solving the tasks. The tasks pose different challenges: partial observability, contacts with the ground, sparse rewards for catching a ball, and controlling a challenging bipedal robot.
Our work constitutes one of the first examples where planning with a learned model outperforms model-free methods on image-based tasks. The table below compares PlaNet to the well-known A3C agent and the D4PG agent, that combines recent advances in model-free RL. The numbers for these baselines are taken from the DeepMind Control Suite. PlaNet clearly outperforms A3C on all tasks and reaches final performance close to D4PG while, using 5000% less interaction with the environment on average.
One Agent for All Tasks Additionally, we train a single PlaNet agent to solve all six tasks. The agent is randomly placed into different environments without knowing the task, so it needs to infer the task from its image observations. Without changes to the hyper parameters, the multi-task agent achieves the same mean performance as individual agents. While learning slower on the cartpole tasks, it learns substantially faster and reaches a higher final performance on the challenging walker task that requires exploration.
Video predictions of the PlaNet agent trained on multiple tasks. Holdout episodes collected with the trained agent are shown above and open-loop agent hallucinations below. The agent observes the first 5 frames as context to infer the task and state and accurately predicts ahead for 50 steps given a sequence of actions.
Conclusion Our results showcase the promise of learning dynamics models for building autonomous RL agents. We advocate for further research that focuses on learning accurate dynamics models on tasks of even higher difficulty, such as 3D environments and real-world robotics tasks. A possible ingredient for scaling up is the processing power of TPUs. We are excited about the possibilities that model-based reinforcement learning opens up, including multi-task learning, hierarchical planning and active exploration using uncertainty estimates.
Acknowledgements This project is a collaboration with Timothy Lillicrap, Ian Fischer, Ruben Villegas, Honglak Lee, David Ha and James Davidson. We further thank everybody who commented on our paper draft and provided feedback at any point throughout the project.
Posted by Tilman Reinhardt, Software Engineer, Google Maps
One of the consistent challenges when navigating with Google Maps is figuring out the right direction to go: sure, the app tells you to go north - but many times you're left wondering, "Where exactly am I, andwhich way is north?" Over the years, we've attempted to improve the accuracy of the blue dot with tools like GPS and compass, but found that both have physical limitations that make solving this challenge difficult, especially in urban environments.
We're experimenting with a way to solve this problem using a technique we call global localization, which combines Visual Positioning Service (VPS), Street View, and machine learning to more accurately identify position and orientation. Using the smartphone camera as a sensor, this technology enables a more powerful and intuitive way to help people quickly determine which way to go.
Due to limitations with accuracy and orientation, guidance via GPS alone is limited in urban environments. Using VPS, Street View and machine learning, Global Localization can provide better context on where you are relative to where you're going.
In this post, we'll discuss some of the limitations of navigation in urban environments and how global localization can help overcome them.
Where GPS Falls Short The process of identifying the position and orientation of a device relative to some reference point is referred to as localization. Various techniques approach localization in different ways. GPS relies on measuring the delay of radio signals from multiple dedicated satellites to determine a precise location. However, in dense urban environments like New York or San Francisco, it can be incredibly hard to pinpoint a geographic location due to low visibility to the sky and signals reflecting off of buildings. This can result in highly inaccurate placements on the map, meaning that your location could appear on the wrong side of the street, or even a few blocks away.
GPS signals bouncing off facades in an urban environment.
GPS has another technical shortcoming: it can only determine the location of the device, not the orientation. Sometimes, sensors in your mobile device can remedy the situation by measuring the magnetic and gravity field of the earth and the relative motion of the device in order to give rough estimates of your orientation. But these sensors are easily skewed by magnetic objects such as cars, pipes, buildings, and even electrical wires inside the phone, resulting in errors that can be inaccurate by up to 180 degrees.
A New Approach to Localization To improve the precision position and orientation of the blue dot on the map, a new complementary technology is necessary. When walking down the street, you orient yourself by comparing what you see with what you expect to see. Global localization uses a combination of techniques that enable the camera on your mobile device to orient itself much as you would.
VPS determines the location of a device based on imagery rather than GPS signals. VPS first creates a map by taking a series of images which have a known location and analyzing them for key visual features, such as the outline of buildings or bridges, to create a large scale and fast searchable index of those visual features. To localize the device, VPS compares the features in imagery from the phone to those in the VPS index. However, the accuracy of localization through VPS is greatly affected by the quality of the both the imagery and the location associated with it. And that poses another question—where does one find an extensive source of high-quality global imagery?
Enter Street View Over 10 years ago we launched Street View in Google Maps in order to help people explore the world more deeply. In that time, Street View has continued to expand its coverage of the world, empowering people to not only preview their route, but also step inside famous landmarks and museums, no matter where they are. To deliver global localization with VPS, we connected it with Street View data, making use of information gathered and tested from over 93 countries across the globe. This rich dataset provides trillions of strong reference points to apply triangulation, helping more accurately determine the position of a device and guide people towards their destination.
Features matched from multiple images.
Although this approach works well in theory, making it work well in practice is a challenge. The problem is that the imagery from the phone at the time of localization may differ from what the scene looked like when the Street View imagery was collected, perhaps months earlier. For example, trees have lots of rich detail, but change as the seasons change and even as the wind blows. To get a good match, we need to filter out temporary parts of the scene and focus on permanent structure that doesn't change over time. That's why a core ingredient in this new approach is applying machine learning to automatically decide which features to pay attention to, prioritizing features that are likely to be permanent parts of the scene and ignoring things like trees, dynamic light movement, and construction that are likely transient. This is just one of the many ways in which we use machine learning to improve accuracy.
Combining Global Localization with Augmented Reality Global localization is an additional option that users can enable when they most need accuracy. And, this increased precision has enabled the possibility of a number of new experiences. One of the newest features we're testing is the ability to use ARCore, Google's platform for building augmented reality experiences, to overlay directions right on top of Google Maps when someone is in walking navigation mode. With this feature, a quick glance at your phone shows you exactly which direction you need to go.
Although early results are promising, there's significant work to be done. One outstanding challenge is making this technology work everywhere, in all types of conditions—think late at night, in a snowstorm, or in torrential downpour. To make sure we're building something that's truly useful, we're starting to test this feature with select Local Guides, a small group of Google Maps enthusiasts around the world who we know will offer us the feedback about how this approach can be most helpful.
Like other AI-driven camera experiences such as Google Lens (which uses the camera to let you search what you see), we believe the ability to overlay directions over the real world environment offers an exciting and useful way to use the technology that already exists in your pocket. We look forward to continuing to develop this technology, and the potential for smartphone cameras to add new types of valuable experiences.
An example image from the 2018 test set, comparing the original image to BPG, JPEG and the results from nine competing teams. All the methods are better than JPEG in color reproduction and many of them are comparable to BPG in their ability to create legible text on the sign.
This year, we are again happy co-sponsor the second Workshop and Challenge on Learned Image Compression at CVPR 2019 in Long Beach, California.The half day workshop will feature talks from invited guests Anne Aaron (Netflix), Aaron Van Den Oord (DeepMind) and Jyrki Alakuijala (Google), along with presentations from five top performing teams in the 2019 competition, which is currently open for submissions.
This year's competition features two tracks for participants to compete in. The first track remains the same as last year, in what we're calling the "low-rate compression" track. The goal for low-rate compression is to compress an image dataset to 0.15 bits per pixel and maintaining the highest quality metrics as measured by PSNR, MS-SSIM and a human evaluated rating task.
The second track incorporates feedback from last year's workshop, in which participants expressed interest in the inverse challenge of determining the amount an image could be compressed and still look good. In this "transparent compression" challenge, we set a relatively high quality threshold for the test dataset (in both PSNR and MS-SSIM) with the goal of compressing the dataset to the smallest file sizes.
If you're doing research in the field of learned image compression, we encourage you to participate in CLIC during CVPR 2019. For more details on the competition and dates, please refer to compression.cc.
Acknowledgements This workshop is being jointly hosted by researchers at Google, Twitter and ETH Zürich. We'd like to thank: George Toderici (Google), Michele Covell (Google), Johannes Ballé (Google), Nick Johnston (Google), Eirikur Agustsson (Google), Wenzhe Shi (Twitter), Lucas Theis (Twitter), Radu Timofte (ETH Zürich), Fabian Mentzer (ETH Zürich) for their contributions.
Posted by Sagar Savla, Product Manager, Machine Perception
The World Health Organization (WHO) estimates that there are 466 million people globally that are deaf and hard of hearing. A crucial technology in empowering communication and inclusive access to the world's information to this population is automatic speech recognition (ASR), which enables computers to detect audible languages and transcribe them into text for reading. Google's ASR is behind automated captions in Youtube, presentations in Slides and also phone calls. However, while ASR has seen multiple improvements in the past couple of years, the deaf and hard of hearing still mainly rely on manual-transcription services like CART in the US, Palantypist in the UK, or STTR in other countries. These services can be prohibitively expensive and often require to be scheduled far in advance, diminishing the opportunities for the deaf and hard of hearing to participate in impromptu conversations as well as social occasions. We believe that technology can bridge this gap and empower this community.
Today, we're announcing Live Transcribe, a free Android service that makes real-world conversations more accessible by bringing the power of automatic captioning into everyday, conversational use. Powered by Google Cloud, Live Transcribe captions conversations in real-time, supporting over 70 languages and more than 80% of the world's population. You can launch it with a single tap from within any app, directly from the accessibility icon on the system tray.
Building Live Transcribe Previous ASR-based transcription systems have generally required compute-intensive models, exhaustive user research and expensive access to connectivity, all which hinder the adoption of automated continuous transcription. To address these issues and ensure reasonably accurate real-time transcription, Live Transcribe combines the results of extensive user experience (UX) research with seamless and sustainable connectivity to speech processing servers. Furthermore, we needed to ensure that connectivity to these servers didn't cause our users excessive data usage.
Relying on cloud ASR provides us greater accuracy, but we wanted to reduce the network data consumption that Live Transcribe requires. To do this, we implemented an on-device neural network-based speech detector, built on our previous work with AudioSet. This network is an image-like model, similar to our published VGGish model, which detects speech and automatically manages network connections to the cloud ASR engine, minimizing data usage over long periods of use.
User Experience To make Live Transcribe as intuitive as possible, we partnered with Gallaudet University to kickstart user experience research collaborations that would ensure core user needs were satisfied while maximizing the potential of our technologies. We considered several different modalities, computers, tablets, smartphones, and even small projectors, iterating ways to display auditory information and captions. In the end, we decided to focus on the smartphone form factor because of the sheer ubiquity of these devices and the increasing capabilities they have.
Once this was established, we needed to address another important issue: displaying transcription confidence. Traditionally considered to be helpful to the user, our research explored whether we actually needed to show word-level or phrase-level confidence.
Displaying confidence level of the transcription. Yellow is high confidence, green is medium and blue is low confidence. White is fresh text awaiting context before finalizing. On the left, the coloring is at a per-phrase level while on the right is at a per-word level.1 Research found them to be distracting to the user without providing conversational value.
Reinforcing previous UX research in this space, our research shows that a transcript is easiest to read when it is not layered with these signals. Instead, Live Transcribe focuses on better presentation of the text and supplementing it with other auditory signals besides speech.
Another useful UX signal is the noise level of their current environment. Known as the cocktail party problem, understanding a speaker in a noisy room is a major challenge for computers. To address this, we built an indicator that visualizes the volume of user speech relative to background noise. This also gives users instant feedback on how well the microphone is receiving the incoming speech from the speaker, allowing them to adjust the placement of the phone.
The loudness and noise indicator is made of two concentric circles. The inner brighter circle, indicating the noise floor, tells a deaf user how audibly noisy the current environment is. The outer circle shows how well the speaker’s voice is received.Together, the circles visually show the relative difference intuitively.
Future Work Potential future improvements in mobile-based automatic speech transcription include on-device recognition, speaker-separation, and speech enhancement. Relying solely on transcription can have pitfalls that can lead to miscommunication. Our research with Gallaudet University shows that combining it with other auditory signals like speech detection and a loudness indicator, makes a tangibly meaningful change in communication options for our users.
Live Transcribe is now available in a staged rollout on the Play Store, and is pre-installed on all Pixel 3 devices with the latest update. Live Transcribe can then be enabled via the Accessibility Settings. You can also read more about it on The Keyword.
Acknowledgements Live Transcribe was made by researchers Chet Gnegy, Dimitri Kanevsky, and Justin S. Paul in collaboration with Android Accessibility team members Brian Kemler, Thomas Lin, Alex Huang, Jacqueline Huang, Ben Chung, Richard Chang, I-ting Huang, Jessie Lin, Ausmus Chang, Weiwei Wei, Melissa Barnhart and Bingying Xia. We'd also like to thank our close partners from Gallaudet University, Christian Vogler, Norman Williams and Paula Tucker. 1 Eagle-eyed readers can see the phrase level confidence mode in use by Dr. Obeidat in the video above.↩
To correctly understand an article, sometimes one will need to refer to a word or a sentence that occurs a few thousand words back. This is an example of long-range dependence — a common phenomenon found in sequential data — that must be understood in order to handle many real-world tasks. While people do this naturally, modeling long-term dependency with neural networks remains a challenge. Gating-based RNNs and the gradient clipping technique improve the ability of modeling long-term dependency, but are still not sufficient to fully address this issue.
One way to approach this challenge is to use Transformers, which allows direct connections between data units, offering the promise of better capturing long-term dependency. However, in language modeling, Transformers are currently implemented with a fixed-length context, i.e. a long text sequence is truncated into fixed-length segments of a few hundred characters, and each segment is processed separately.
Vanilla Transformer with a fixed-length context at training time.
This introduces two critical limitations:
The algorithm is not able to model dependencies that are longer than a fixed length.
The segments usually do not respect the sentence boundaries, resulting in context fragmentation which leads to inefficient optimization. This is particularly troublesome even for short sequences, where long range dependency isn't an issue.
To address these limitations, we propose Transformer-XL a novel architecture that enables natural language understanding beyond a fixed-length context. Transformer-XL consists of two techniques: a segment-level recurrence mechanism and a relative positional encoding scheme.
Segment-level Recurrence During training, the representations computed for the previous segment are fixed and cached to be reused as an extended context when the model processes the next new segment. This additional connection increases the largest possible dependency length by N times, where N is the depth of the network, because contextual information is now able to flow across segment boundaries. Moreover, this recurrence mechanism also resolves the context fragmentation issue, providing necessary context for tokens in the front of a new segment.
Transformer-XL with segment-level recurrence at training time.
Relative Positional Encodings Naively applying segment-level recurrence does not work, however, because the positional encodings are not coherent when we reuse the previous segments. For example, consider an old segment with contextual positions [0, 1, 2, 3]. When a new segment is processed, we have positions [0, 1, 2, 3, 0, 1, 2, 3] for the two segments combined, where the semantics of each position id is incoherent through out the sequence. To this end, we propose a novel relative positional encoding scheme to make the recurrence mechanism possible. Moreover, different from other relative positional encoding schemes, our formulation uses fixed embeddings with learnable transformations instead of learnable embeddings, and thus is more generalizable to longer sequences at test time. When both of these approaches are combined, Transformer-XL has a much longer effective context than a vanilla Transformer model at evaluation time.
Vanilla Transformer with a fixed-length context at evaluation time.
Transformer-XL with segment-level recurrence at evaluation time./td>
Furthermore, Transformer-XL is able to process the elements in a new segment all together without recomputation, leading to a significant speed increase (discussed below). Results Transformer-XL obtains new state-of-the-art (SoTA) results on a variety of major language modeling (LM) benchmarks, including character-level and word-level tasks on both long and short sequences. Empirically, Transformer-XL enjoys three benefits:
Transformer-XL learns dependency that is about 80% longer than RNNs and 450% longer than vanilla Transformers, which generally have better performance than RNNs, but are not the best for long-range dependency modeling due to fixed-length contexts (please see our paper for details).
Transformer-XL is up to 1,800+ times faster than a vanilla Transformer during evaluation on language modeling tasks, because no re-computation is needed (see figures above).
Transformer-XL has better performance in perplexity (more accurate at predicting a sample) on long sequences because of long-term dependency modeling, and also on short sequences by resolving the context fragmentation problem.
Transformer-XL improves the SoTA bpc/perplexity from 1.06 to 0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (without fine tuning). We are the first to break through the 1.0 barrier on char-level LM benchmarks.
We envision many exciting potential applications of Transformer-XL, including but not limited to improving language model pretraining methods such as BERT, generating realistic, long articles, and applications in the image and speech domains, which are also important areas in the world of long-term dependency. For more detail, please see our paper.
The code, pretrained models, and hyperparameters used in our paper are also available in both Tensorflow and PyTorch on GitHub.
Posted by Tom Kwiatkowski and Michael Collins, Research Scientists, Google AI Language
Open-domain question answering (QA) is a benchmark task in natural language understanding (NLU) that aims to emulate how people look for information, finding answers to questions by reading and understanding entire documents. Given a question expressed in natural language ("Why is the sky blue?"), a QA system should be able to read the web (such as this Wikipedia page) and return the correct answer, even if the answer is somewhat complicated and long. However, there are currently no large, publicly available sources of naturally occurring questions (i.e. questions asked by a person seeking information) and answers that can be used to train and evaluate QA models. This is because assembling a high-quality dataset for question answering requires a large source of real questions and significant human effort in finding correct answers.
To help spur research advances in QA, we are excited to announce Natural Questions (NQ), a new, large-scale corpus for training and evaluating open-domain question answering systems, and the first to replicate the end-to-end process in which people find answers to questions. NQ is large, consisting of 300,000 naturally occurring questions, along with human annotated answers from Wikipedia pages, to be used in training QA systems. We have additionally included 16,000 examples where answers (to the same questions) are provided by 5 different annotators, useful for evaluating the performance of the learned QA systems. Since answering the questions in NQ requires much deeper understanding than is needed to answer trivia questions — which are already quite easy for computers to solve — we are also announcing a challenge based on this data to help advance natural language understanding in computers.
The Data NQ is the first dataset to use naturally occurring queries and focus on finding answers by reading an entire page, rather than extracting answers from a short paragraph. To create NQ, we started with real, anonymized, aggregated queries that users have posed to Google's search engine. We then ask annotators to find answers by reading through an entire Wikipedia page as they would if the question had been theirs. Annotators look for both long answers that cover all of the information required to infer the answer, and short answers that answer the question succinctly with the names of one or more entities. The quality of the annotations in the NQ corpus has been measured at 90% accuracy.
The Challenge NQ is aimed at enabling QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. Systems will need to first decide whether the question is sufficiently well defined to be answerable — many questions make false assumptions or are just too ambiguous to be answered concisely. Then they will need to decide whether there is any part of the Wikipedia page that contains all of the information needed to infer the answer. We believe that the long answer identification task — finding all of the information required to infer an answer — requires a deeper level of language understanding than finding short answers once the long answers are known.
It is our hope that the release of NQ, and the associated challenge, will help spur the development of more effective and robust QA systems. We encourage the NLU community to participate and to help close the large gap between the performance of current state-of-the-art approaches and a human upper bound. Please visit the challenge website to view the leaderboard and learn more.
Posted by Alvin Rajkomar, MD and Eyal Oren, PhD, Google AI, Healthcare
In 2018 we published a paper that showed how machine learning, when applied to medical records, can predict what might happen to patients who are hospitalized: for example, how long they would need to be in the hospital and, if discharged, how likely they would be to come back unexpectedly. Predictive models of various kinds have already been deployed in hospital settings by others, and our work aims to further improve potential clinical benefit by using new models that can make predictions faster, more accurate, and more adaptable for a broader range of clinical contexts.
Any endeavor to demonstrate the promise of machine learning requires intense collaboration between engineers, doctors, and medical researchers to make sure the work benefits patients, physicians, and health systems, and that it is equitable. Google is already fortunate to partner with some of the best academic medical centers in the world and we are now expanding this work to include Intermountain Healthcare, based in Utah.
The initial collaboration will focus on understanding how Google might adapt machine learning predictions to the various Intermountain care settings, from primary care clinics to the TeleHealth critical care unit, which remotely monitors critically ill patients in surrounding hospitals. We see potential in exploring how scalable computing platforms that include predictions might assist clinical teams in providing the best possible care.
As with our previous research, we will begin with jointly testing the performance of machine learning models on historical records, following strict policies to ensure that all data privacy and security measures are followed.
We are excited to explore how scalable computing platforms that include predictions might assist clinical teams in providing the best possible care in these settings. We additionally hope to further validate that our approach to predictions can work across health systems and improve care for patients.
Posted by Tuomas Haarnoja, Student Researcher and Sergey Levine, Faculty Advisor, Robotics at Google
Deep reinforcement learning (RL) provides the promise of fully automated learning of robotic behaviors directly from experience and interaction in the real world, due to its ability to process complex sensory input using general-purpose neural network representations. However, many existing RL algorithms require days or weeks (or more) worth of real-world data in order to converge to the desired behavior. Furthermore, such systems can be tough to deploy on complex robotic systems (such as legged robots) which can easily get damaged during the exploration phase, hyperparameter settings can be challenging to tune, and various safety considerations can introduce further limitations.
In collaboration with UC Berkeley, we recently released Soft Actor-Critic (SAC), a stable and efficient deep RL algorithm suitable for real-world robotic skill learning that is well-aligned with the requirements of robotic experimentation. Importantly, SAC is efficient enough to solve real-world robot tasks in only a handful of hours, and works on a variety of environments with a single set of hyperparameters. Below, we discuss some of the research behind SAC, and also describe some of our recent experiments.
Requirements for Real-World Robotic Learning Real-world robotic experimentation brings significant challenges, such as constant interruptions in the data stream due to hardware failures and manual resets, and smooth exploration to avoid mechanical wear and tear on the robot, which set additional restrictions to both the algorithm and its implementation, including (but not limited to):
Good sample efficiency to lower the learning time
Minimal number of hyperparameters that require tuning
Reusing already collected data on different scenarios (known as off-policy learning)
Ensuring that learning and exploration does not damage the hardware
Soft Actor-Critic Soft actor-critic is based on maximum entropy reinforcement learning, a framework that aims to both maximize the expected reward (which is the standard RL objective) and to maximize the policy's entropy. Policies with higher entropy are more random, which intuitively means that maximum entropy reinforcement learning prefers the most random policy that still achieves a high reward.
Why might this be desirable for robotic learning? The most obvious reason is that policies optimized for maximum entropy will be more robust: if the policy can tolerate highly random behavior during training, it is more likely to respond successfully to unexpected perturbations at test time. However, a more subtle reason is that training for maximum entropy can improve both the algorithm's robustness to hyperparameters and its sample efficiency (to learn more, see this BAIR blog post, and this tutorial).
Soft actor-critic maximizes the entropy augmented reward by learning a stochastic policy that maps states to actions and a Q-function that estimates the objective value of the current policy, optimizing them using approximate dynamic programming. In doing so, SAC views the objective as a grounded way to derive better reinforcement learning algorithms that perform consistently and are sample efficient enough to be applicable to real-world robotic applications. For technical details please see our technical report.
Performance of SAC We evaluated SAC with two tasks: 1) quadrupedal walking with the Minitaur robot from Ghost Robotics, and 2) rotating a valve with a three finger Dynamixel Claw. Learning to walk presents a substantial challenge, as the robot is underactuated, and must therefore delicately balance contact forces on the legs to make forward progress. An untrained policy can lose balance and fall, and too many falls will eventually damage the robot, making sample-efficient learning essential.
Although we trained our policy only on flat terrain, we subsequently tested it on varied terrains and obstacles. In principle, policies learned with soft actor-critic should be robust to test-time perturbations, because they are trained to maximize entropy (i.e., inject maximal noise) at training-time. Indeed, we observe that the policies learned with our method are robust to these perturbations without any additional learning.
Illustration of learned walking, using SAC implemented on the Minitaur robot. A full video of the learning process can be found at our project website.
The manipulation task requires the hand to rotate a valve-like object so that the colored peg faces to the right, as shown below. This task is exceptionally challenging due to both the perception challenges and the need to control a hand with 9 degrees of freedom. In order to perceive the valve, the robot must use raw RGB images shown in the inset at the bottom right. The initial position of the valve is reset uniformly at random for each episode, forcing the policy to learn to use the raw RGB images to perceive the current valve orientation.
Soft actor-critic solves both of these tasks quickly: the Minitaur locomotion takes 2 hours, and the valve-turning task from image observations takes 20 hours. We also learned a policy for the valve-turning task without images by providing the actual valve position as an observation to the policy. Soft actor-critic can learn this easier version of the valve task in 3 hours. For comparison, prior work has used natural policy gradients to learn the same task without images in 7.4 hours.
Conclusion Our work demonstrates that deep reinforcement learning based on maximum entropy framework can be applied to learn robot skills in challenging real-world settings. Since the policies are learned directly in the real world, they exhibit robustness to variations in the environment, which can be difficult to obtain otherwise. We also showed that we can learn directly from high-dimensional image observations, which represents a significant challenge in classical robotics. We hope that the release of SAC helps other research teams in their effort to adopt deep RL for more complex real-world tasks in the future.
Acknowledgements This research was done in collaboration between Google and UC Berkeley. We would like to thank all the people who were involved, including Sehoon Ha, Kristian Hartikainen, Jie Tan, George Tucker, Vincent Vanhoucke and Aurick Zhou.
Posted by Jeff Dean, Senior Fellow and Google AI Lead, on behalf of the entire Google Research Community
2018 was an exciting year for Google's research teams, with our work advancing technology in many ways, including fundamental computer science research results and publications, the application of our research to emerging areas new to Google (such as healthcare and robotics), open source software contributions and strong collaborations with Google product teams, all aimed at providing useful tools and services. Below, we highlight just some of our efforts from 2018, and we look forward to what will come in the new year. For a more comprehensive look, please see our publications in 2018.
Ethical Principles and AI Over the past few years, we have observed major advances in AI and the positive impact it can have on our products and the everyday lives of our billions of users. For those of us working in this field, we care deeply that AI is a force for good in the world, and that it is applied ethically, and to problems that are beneficial to society. This year we published the Google AI Principles, supported with a set of responsible AI practices outlining technical recommendations for implementation. In combination they provide a framework for us to evaluate our own development of AI, and we hope that other organizations can also use these principles to help shape their own thinking. It's important to note that because this field is evolving quite rapidly, best practices in some of the principles noted, such as "Avoid creating or reinforcing unfair bias" or "Be accountable to people", are also changing and improving as we and others conduct new research in areas like ML fairness and model interpretability. This research in turn leads to advances in our products to make them more inclusive and less biased, such as our work on reducing gender biases in Google Translate, and allows the exploration and release of more inclusive image datasets and models that enable computer vision to work for the diversity of global cultures. Furthermore, this work allows us to share best practices with the broader research community with the Fairness Module in the Machine Learning Crash Course.
AI for Social Good The potential of AI to make dramatic impacts on many areas of social and societal importance is clear. One example of how AI can be applied to real-world problems is our work on flood prediction. In collaboration with many teams across Google, this research aims to provide accurate and timely fine-grained information about the likely extent and scope of flooding, enabling those in flood-prone regions to make better decisions about how best to protect themselves and their property. A second example is our work on earthquake aftershock prediction, where we showed that a machine learning (ML) model can predict aftershock locations much more accurately than traditional physics-based models. Perhaps more importantly, because the ML model was designed to be interpretable, scientists have been able to make new discoveries about the behavior of aftershocks, leading to not only more accurate predictions, but also new levels of understanding.
Assistive Technology Much of our research centered on using ML and computer science to help our users accomplish things faster and more effectively. Often, these results in collaborations with various product teams to release the fruits of this research in various product features and settings. One example is Google Duplex, a system that requires research in natural language and dialogue understanding, speech recognition, text-to-speech, user understanding and effective UI design to all come together to enable an experience whereby a user can say "Can you book me a haircut at 4 PM today?", and a virtual agent will interact on your behalf over the telephone to handle the necessary details.
Other examples include Smart Compose, a tool that uses predictive models to give relevant suggestions about how to compose emails, making the process of email composition faster and easier, and Sound Search, a technology built on the Now Playing feature that enables you to discover what song is playing fast and accurately. Additionally, Smart Linkify in Android shows how we can use an on-device ML model to make many different kinds of text that appear on the screen of your phone more useful by understanding the kind of text you're selecting (e.g. knowing that something is an address, so we can offer a shortcut to a maps or direction link).
Quantum computing Quantum computing is an emerging paradigm for computing that promises the ability to solve challenging problems that no classical computer can solve. We have been actively pursuing research in this area for the past several years, and we believe the field is on the cusp of demonstrating this capability for at least one problem (so-called quantum supremacy), which will be a watershed event for the field. Over the last year we produced a number of exciting new results, including the development of Bristlecone, a new 72-qubit quantum computing device, which scales the size of problems that can be tackled in quantum computers in the run-up towards quantum supremacy.
A Bristlecone chip being installed by Research Scientist Marissa Giustina at the Quantum AI Lab in Santa Barbara.
Natural Language Understanding Natural language research at Google had an exciting 2018, with a mix of basic research as well as product-focused collaborations. We developed improvements to our Transformer work from 2017, resulting in a new parallel-in-time version of the model called the Universal Transformer that shows strong gains across a number of natural language tasks including translation and linguistic reasoning. We also developed BERT, the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus, that can then be fine-tuned on a wide variety of natural language tasks using transfer learning. BERT shows significant improvements over previous state-of-the-art results on 11 natural language tasks.
BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks.
In the audio domain, we proposed a method for unsupervised learning of semantic audio representations as well as significant improvements to expressive and human-like speech synthesis. Multimodal perception is an increasingly important research topic. Looking to Listen combines visual and auditory cues in an input video to isolate and enhance the speech of desired speakers in a video. This technology could support a range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where multiple people are speaking.
Enabling perception on resource-constrained platforms has becoming increasingly important. MobileNetV2 is Google's next-generation mobile computer vision model and our MobileNets are used widely across academia and industry. MorphNet proposes an efficient method for learning the structure of deep networks that results in across-the-board performance improvements on image and audio models while respecting computational resource constraints, and more recent work on automatic generation of mobile network architectures demonstrates that even higher performance is possible.
Computational Photography The improvements in quality and versatility of cell phone cameras over the last few years has been nothing short of remarkable. A modest part of this is improvements in the actual physical sensors used in phones, but a much greater part of it is due to advances in the scientific field of computational photography. Our research teams publish their new research techniques, and work closely with the Android and Consumer Hardware teams at Google to deliver this research into your hands in the latest Pixel and Android phones and other devices. In 2014, we introduced HDR+, a technique whereby the camera captures a burst of frames, aligns the frames in software, and merges them together with computational software. Originally in the HDR+ work, this was to enable pictures to have higher dynamic range than was possible with a single exposure. However, capturing a burst of frames and then performing computational analysis of these frames is a general approach that has enabled many advances in cameras in 2018. For example, it allowed the development of Motion Photos in Pixel 2 and the Augmented Reality mode in Motion Stills.
Motion photos on the Pixel 2 in Google Photos. For more examples, check out this Google Photos album.
Augmented chicken family with Motion Stills AR mode.
This year, one of our primary efforts in computational photography research was to create a new capability called Night Sight, which enables Pixel phone cameras to "see in the dark", earning praise by both press and users. Of course, Night Sight is just one of the new software-enabled camera features our teams have developed to help you take the perfect photo, including using ML to provide better portrait mode shots, seeing better and further with Super Res Zoom and capturing special moments with Top Shot and Google Clips.
Performance comparison of ADAM and AMSGRAD on a synthetic example of a simple one dimensional convex problem inspired by our examples of non-convergence. The first two plots (left and center) are for the online setting and the the last one (right) is for the stochastic setting.
Software Systems A large part of our research on software systems continues to relate to building machine-learning models and to TensorFlow in particular. For example, we published on the design and implementation of dynamic control flow for TensorFlow 1.0. Some of our newer research introduces a system that we call Mesh TensorFlow, which makes it easy to specify large-scale distributed computations with model parallelism, sometimes with billions of parameters. As another example, we released a library for scalable deep neural ranking using TensorFlow.
The TF-Ranking library supports multi-item scoring architecture, an extension of traditional single-item scoring.
We also released JAX, an accelerator-backed variant of NumPy that supports automatic differentiation of Python functions to arbitrary order. While JAX is not part of TensorFlow, it leverages some of the same underlying software infrastructure (e.g. XLA), and some of its ideas and algorithms have been helpful to our TensorFlow projects. Finally, we continued our research on the security and privacy of machine learning, and our development of open source frameworks for safety and privacy in AI systems, such as CleverHans and TensorFlow Privacy.
Another important research direction for us is the application of ML to software systems, at many levels of the stack. For instance, we continued work on placement of computations onto devices, with a hierarchical model, and we contributed to learning memory access patterns. We also continued to explore how learned indices could be used to replace traditional index structures in database systems and storage systems. As I wrote last year, we believe that we are just scratching the surface in terms of the use of machine learning in computer systems.
The Hierarchical Planner's placement of a NMT (4-layer) model. White denotes CPU and the four colors each represent one of the GPUs. Note that every step of every layer is allocated across multiple GPUs. This placement is 53.7% faster than that generated by a human expert.
In 2018 we learned about Spectre and Meltdown, new classes of serious security vulnerabilities in modern computer processors, thanks to Google's Project Zero team in collaboration with others. These and related vulnerabilities will keep computer architecture researchers quite busy. In our continuing efforts to model CPU behavior, our Compiler Research team integrated their tool for measuring machine instruction latency and port pressure into LLVM, making possible better compilation decisions.
Running a large-scale web service such as content hosting, requires load balancing with stability in a dynamic environment. We developed a consistent hashing scheme with tight provable guarantees on the maximum load of each server, and deployed it for our cloud customers in Google Cloud Pub/Sub. After making an earlier version of our paper available, engineers at Vimeo found the paper, implemented and open sourced it in haproxy, and used it for their load balancing project at Vimeo. The results were dramatic: applying these algorithmic ideas helped them decrease the cache bandwidth by a factor of almost 8, eliminating a scaling bottleneck.
TPUs Tensor Processing Units (TPUs) are Google's internally-developed ML hardware accelerators, designed from the ground up to power both training and inference at scale. TPUs have enabled Google research breakthroughs such as BERT (discussed previously), and they also allow researchers around the world to build on Google research via open source and to pursue new breakthroughs of their own. For example, anyone can fine-tune BERT on TPUs for free via Colab, and the TensorFlow Research Cloud has given thousands of researchers the opportunity to benefit from even larger amounts of free Cloud TPU computing power. We've also made multiple generations of TPU hardware commercially available as Cloud TPUs, including ML supercomputers called Cloud TPU Pods that make large-scale ML training much more accessible. Internally, in addition to enabling faster advances in ML research, TPUs have driven major improvements across Google's core products, including Search, YouTube, Gmail, Google Assistant, Google Translate, and many others. We look forward to seeing ML teams both here at Google and elsewhere achieve even more with ML via the unprecedented computing scale that TPUs provide.
An individual TPU v3 device (left) and a portion of a TPU v3 Pod (right). TPU v3 is the latest generation of Google's Tensor Processing Unit (TPU) hardware. Available to external customers as Cloud TPU v3, these systems are liquid-cooled for maximum performance (computer chips + liquid = exciting!), and a full TPU v3 Pod can apply more than 100 petaflops of computational power to the world's largest ML problems.
Open Source Software and Datasets Releasing open source software and the creation of new public datasets are two major ways that we contribute to the research and software engineering communities. One of our largest efforts in this space is TensorFlow, a widely popular system for ML computations that we released in November 2015. We celebrated TensorFlow's third birthday in 2018, and during this time, TensorFlow has been downloaded more than 30M times, with over 1700 contributors adding 45,000 commits. In 2018, TensorFlow had eight major releases and added major capabilities such as eager execution and distribution strategies. We launched public design reviews engaging the community in the development process, and we engaged contributors via special interest groups. With the launches of associated products such as TensorFlow Lite, TensorFlow.js and TensorFlow Probability, the TensorFlow ecosystem grew dramatically in 2018.
Real-time evolution of the tSNE embedding for the complete MNIST dataset. The dataset contains images of 60,000 handwritten digits. You can find a live demo here.
Public datasets are often a great source of inspiration that lead to great progress across many fields, since they give the broader community both access to interesting data and problems as well as a healthy competitive drive to achieve better results on a variety of tasks. This year we were happy to release Google Dataset Search, a new tool for finding public datasets from all of the web. Over the years we have also curated and released many new, novel datasets, including everything from millions of general annotated images or videos, to a crowd-source Bengali dataset for speech recognition to robot arm grasping datasets and more. In 2018, we added even more datasets to that list.
Visualization of the fluid annotation interface in action on image from COCO dataset. Image credit: gamene, original image.
From time-to-time, we also help establish new kinds of challenges for the research community, so that we can all work together on solving difficult research problems. Often these are done with the release of a new dataset, but not always. This year, we established new challenges around the Inclusive Images Challenge, to work towards making more robust models that are free from many kinds of biases, the iNaturalist 2018 Challenge which aims to enable computers' fine-grained discrimination of visual categories (such as species of plants in an image), a Kaggle "Quick, Draw!" Doodle Recognition Challenge to create a better classifier for the QuickDraw challenge game, and Conceptual Captions, a larger-scale image captioning dataset and challenge aimed at enabling better image captioning model research.
Applications of AI to Other Fields In 2018, we have applied ML to a wide variety of problems in the physical and biological sciences. Using ML, we can supply scientists with the equivalent of hundreds or thousands of research assistants digging through data, which then frees the scientists to become more creative and productive.
A pre-trained TensorFlow model rates focus quality for a montage of microscope image patches of cells in Fiji (ImageJ). Hue and lightness of the borders denote predicted focus quality and prediction uncertainty, respectively.
Health For the past several years, we have been applying ML to health, an area that affects every one of us, and is also one where we believe ML can make a tremendous difference by augmenting the intuitions and experience of healthcare professionals. Our general approach in this space is to collaborate with healthcare organizations to tackle basic research problems (using feedback from clinical experts to make our results more robust), and then publish the results in well-respected, peer-reviewed scientific and clinical journals. Once the research has been clinically and scientifically validated, we then conduct user and HCI research to understand how we can deploy this in real-world clinical settings. In 2018, we expanded our efforts across the broad space of computer-aided diagnostics to clinical task predictions as well.
On the left is a retinal fundus image graded as having moderate DR ("Mo") by an adjudication panel of ophthalmologists (ground truth). On the top right is an illustration of the predicted scores ("N" = no DR, "Mi" = Mild DR, "Mo" = Moderate DR) from the model. On the bottom right is the set of scores given by physicians without assistance ("Unassisted") and those who saw the model's predictions ("Grades Only").
When applying ML to historically-collected data, it's important to understand the populations that have experienced human and structural biases in the past and how those biases have been codified in the data. Machine-learning offers an opportunity to detect and address bias and to proactively advance health equity, which we are designing our systems to do.
Research Outreach We interact with the external research community in many different ways, including faculty engagement and student support. We are proud to host hundreds of undergraduate, M.S. and Ph.D. students as interns during the academic year, as well as providing multi-year Ph.D. fellowships to students throughout North America, Europe, and the Middle East. In addition to financial support, each of the fellowship recipients is assigned one or more Google researchers as a mentor, and we bring together all the fellows for an annual Google Ph.D. Fellowship Summit, where they are exposed to state-of-the-art research being pursued at Google and given the opportunity to network with Google's researchers as well as other PhD Fellows from around the world. Complementing this fellowship program is the Google AI Residency, a way of allowing people who want to learn to conduct deep learning research to spend a year working alongside and being mentored by researchers at Google. Now in its third year, residents are embedded in various teams across Google's global offices, pursuing research in areas such as machine learning, perception, algorithms and optimization, language understanding, healthcare and much more. With applications having just closed for the fourth year of this program, we are excited to see the research the new cohort of residents will pursue in 2019.
Each year, we also support a number of faculty members and students on research projects through our Google Faculty Research Awards program. In 2018, we also continued to host workshops at Google locations for faculty and graduate students in particular areas, including a workshop on AI/ML Research and Practice hosted in our Bangalore, India office, an Algorithms & Optimization Workshop hosted in our Zürich office, a workshop on healthcare applications of ML hosted in Sunnyvale and a workshop on Fairness and Bias in ML hosted in our Cambridge, MA office.
New Places, New Faces In 2018, we were excited to welcome many new people with a wide range of backgrounds into our research organization. We announced our first AI research office in Africa, located in Accra, Ghana. We expanded our AI research presence in Paris, Tokyo and Amsterdam, and opened a research lab in Princeton. We continue to hire talented people into our offices all over the world, and you can learn more about joining our research efforts here.
Looking Forward to 2019 This blog post summarizes just a small fraction of the research performed in 2018. As we look back on 2018, we're excited (and proud!) of the breadth and depth of what we have accomplished. In 2019, we look forward to having even more impact on Google's direction and products, as well as on the broader research and engineering community!
Posted by Li Zhang and Wei (Alex) Hong, Software Engineers
Life is full of meaningful moments — from a child’s first step to an impromptu jump for joy — that one wishes could be preserved with a picture. However, because these moments are often unpredictable, missing that perfect shot is a frustrating problem that smartphone camera users face daily. Using our experience from developing Google Clips, we wondered if we could develop new techniques for the Pixel 3 camera that would allow everyone to capture the perfect shot every time.
Top Shot is a new feature recently launched with Pixel 3 that helps you to capture precious moments precisely and automatically at the press of the shutter button. Top Shot saves and analyzes the image frames before and after the shutter press on the device in real-time using computer vision techniques, and recommends several alternative high-quality HDR+ photos.
Examples of Top Shot on Pixel 3. On the left, a better smiling shot is recommended. On the right, a better jump shot is recommended. The recommended images are high-quality HDR+ shots.
Capturing Multiple Moments When a user opens the Pixel 3 Camera app, Top Shot is enabled by default, helping to capture the perfect moment by analyzing images taken both before and after the shutter press. Each image is analyzed for some qualitative features (e.g., whether the subject is smiling or not) in real-time and entirely on-device to preserve privacy and minimize latency. Each image is also associated with additional signals, such as optical flow of the image, exposure time, and gyro sensor data to form the input features used to score the frame quality.
When you press the shutter button, Top Shot captures up to 90 images from 1.5 seconds before and after the shutter press, selecting up to two alternative shots to save in high resolution — the original shutter frame and high-res alternatives for you to review (other lower-res frames can also be reviewed as desired). The shutter frame is processed and saved first. The best alternative shots are saved afterwards. Google’s Visual Core on Pixel 3 is used to process these top alternative shots as HDR+ images with a very small amount of extra latency, and are embedded into the file of the Motion Photo.
Top-level diagram of Top Shot capture.
Given Top Shot runs in the camera as a background process, it must have very low power consumption. As such, Top Shot uses a hardware-accelerated MobileNet-based single shot detector (SSD). The execution of such optimized models is also throttled by power and thermal limits.
Recognizing Top Moments When we set out to understand how to enable people to capture the best moments with their camera, we focused on three key attributes: 1) functional qualities like lighting, 2) objective attributes (are the subject's eyes open? Are they smiling?), and 3) subjective qualities like emotional expressions. We designed a computer vision model to recognize these attributes while operating in a low-latency, on-device mode.
During our development process, we started with a vanilla MobileNet model and set out to optimize for Top Shot, arriving at a customized architecture that operated within our accuracy, latency and power tradeoff constraints. Our neural network design detects low-level visual attributes in early layers, like whether the subject is blurry, and then dedicates additional compute and parameters toward more complex objective attributes like whether the subject's eyes are open, and subjective attributes like whether there is an emotional expression of amusement or surprise. We trained our model using knowledge distillation over a large number of diverse face images using quantization during both training and inference.
We then adopted a layered Generalized Additive Model (GAM) to provide quality scores for faces and combine them into a weighted-average “frame faces” score. This model made it easy for us to interpret and identify the exact causes of success or failure, enabling rapid iteration to improve the quality and performance of our attributes model. The number of free parameters was on the order of dozens, so we could optimize these using Google's black box optimizer, Vizier, in tandem with any other parameters that affected selection quality.
Frame Scoring Model While Top Shot prioritizes for face analysis, there are good moments in which faces are not the primary subject. To handle those use cases, we include the following additional scores in the overall frame quality score:
Subject motion saliency score — the low-resolution optical flow between the current frame and the previous frame is estimated in ISP to determine if there is salient object motion in the scene.
Global motion blur score — estimated from the camera motion and the exposure time. The camera motion is calculated from sensor data from the gyroscope and OIS (optical image stabilization).
“3A” scores — the status of auto exposure, auto focus, and auto white balance, are also considered.
All the individual scores are used to train a model predicting an overall quality score, which matches the frame preference of human raters, to maximize end-to-end product quality.
End-to-End Quality and Fairness Most of the above components are each evaluated for accuracy independently However, Top Shot presents requirements that are uniquely challenging since it’s running real-time in the Pixel Camera. Additionally, we needed to ensure that all these signals are combined in a system with favorable results. That means we need to gauge our predictions against what our users perceive as the “top shot.”
To test this, we collected data from hundreds of volunteers, along with their opinions of which frames (out of up to 90!) looked best. This donated dataset covers many typical use cases, e.g. portraits, selfies, actions, landscapes, etc.
Many of the 3-second clips provided by Top Shot had more than one good shot, so it was important for us to engineer our quality metrics to handle this. We used some modified versions of traditional Precision and Recall, some classic ranking metrics (such as Mean Reciprocal Rank), and a few others that were designed specifically for the Top Shot task as our objective. In addition to these metrics, we additionally investigated causes of image quality issues we saw during development, leading to improvements in avoiding blur, handling multiple faces better, and more. In doing so, we were able to steer the model towards a set of selections people were likely to rate highly.
Importantly, we tested the Top Shot system for fairness to make sure that our product can offer a consistent experience to a very wide range of users. We evaluated the accuracy of each signal used in Top Shot on several different subgroups of people (based on gender, age, ethnicity, etc), testing for accuracy of each signal across those subgroups.
Conclusion Top Shot is just one example of how Google leverages optimized hardware and cutting-edge machine learning to provide useful tools and services. We hope you’ll find this feature useful, and we’re committed to further improving the capabilities of mobile phone photography!
Acknowledgements This post reflects the work of a large group of Google engineers, research scientists, and others including: Ari Gilder, Aseem Agarwala, Brendan Jou, David Karam, Eric Penner, Farooq Ahmad, Henri Astre, Hillary Strickland, Marius Renn, Matt Bridges, Maxwell Collins, Navid Shiee, Ryan Gordon, Sarah Clinckemaillie, Shu Zhang, Vivek Kesarwani, Xuhui Jia, Yukun Zhu, Yuzo Watanabe and Chris Breithaupt.