Author Archives: Google AI Blog

Improved Grading of Prostate Cancer Using Deep Learning

Approximately 1 in 9 men in the United States will develop prostate cancer in their lifetime, making it the most common cancer in males. Despite being common, prostate cancers are frequently non-aggressive, making it challenging to determine if the cancer poses a significant enough risk to the patient to warrant treatment such as surgical removal of the prostate (prostatectomy) or radiation therapy. A key factor that helps in the “risk stratification” of prostate cancer patients is the Gleason grade, which classifies the cancer cells based on how closely they resemble normal prostate glands when viewed on a slide under a microscope.

However, despite its widely recognized clinical importance, Gleason grading of prostate cancer is complex and subjective, as evidenced by studies reporting inter-pathologist disagreements ranging from 30-53% [1][2]. Furthermore, there are not enough speciality trained pathologists to meet the global demand for prostate cancer pathology, especially outside the United States. Recent guidelines also recommend that pathologists report the percentage of tumor of different Gleason patterns in their final report, which adds to the workload and is yet another subjective challenge for the pathologist [3]. Overall, these issues suggest an opportunity to improve the diagnosis and clinical management of prostate cancer using deep learning–based models, similar to how Google and others used such techniques to demonstrate the potential to improve metastatic breast cancer detection.

In “Development and Validation of a Deep Learning Algorithm for Improving Gleason Scoring of Prostate Cancer”, we explore whether deep learning could improve the accuracy and objectivity of Gleason grading of prostate cancer in prostatectomy specimens. We developed a deep learning system (DLS) that mirrors a pathologist’s workflow by first categorizing each region in a slide into a Gleason pattern, with lower patterns corresponding to tumors that more closely resemble normal prostate glands. The DLS then summarizes an overall Gleason grade group based on the two most common Gleason patterns present. The higher the grade group, the greater the risk of further cancer progression and the more likely the patient is to benefit from treatment.
Visual examples of Gleason patterns, which are used in the Gleason system for grading prostate cancer. Individual cancer patches are assigned a Gleason pattern based on how closely the cancer resembles normal prostate tissue, with lower numbers corresponding to more well differentiated tumors. Image Source: National Institutes of Health.
To develop and validate the DLS, we collected de-identified images of prostatectomy samples which contain a greater amount and diversity of prostate cancer than needle core biopsies, even though the latter is the more common clinical procedure. On the training data, a cohort of 32 pathologists provided detailed annotations of Gleason patterns (resulting in over 112 million annotated image patches) and an overall Gleason grade group for each image. To overcome the previously referenced variability in Gleason grading, each slide in the validation set was independently graded by 3 to 5 general pathologists (selected from a cohort of 29 pathologists) and had a final Gleason grade assigned by a genitourinary-specialist pathologist to obtain the ground-truth label for that slide.

In the paper, we show that our DLS achieved an overall accuracy of 70%, compared to an average accuracy of 61% achieved by US board-certified general pathologists in our study. Of 10 high-performing individual general pathologists who graded every slide in the validation set, the DLS was more accurate than 8. The DLS was also more accurate than the average pathologist at Gleason pattern quantitation. These improvements in Gleason grading translated into better clinical risk stratification: the DLS better identified patients at higher risk for disease recurrence after surgery than the average general pathologist, potentially enabling doctors to use this information to better match patients to therapy.
Comparison of scoring performance of the DLS with pathologists. a: Accuracy of the DLS (in red) compared with the mean accuracy among a cohort-of-29 pathologists (in green). Error bars indicate 95% confidence intervals. b: Comparison of risk stratification provided by the DLS, the cohort-of-29 pathologists, and the genitourinary specialist pathologists. Patients are divided into low and high risk groups based on their Gleason grade group, where a larger separation between the Kaplan-Meier curves of these risk groups indicates better stratification.
We also found that the DLS was able to characterize tissue morphology that appeared to lie at the cusp of two Gleason patterns, which is one reason for the disagreements in Gleason grading observed between pathologists, suggesting the possibility of creating finer grained “precision grading” of prostate cancer. While the clinical significance of these intermediate patterns (e.g. Gleason pattern 3.3 or 3.7) is not known, the increased precision of the DLS will enable further research into this interesting question.
Assessing the region-level classification of the DLS. a: Annotations from 3 pathologists compared to DLS predictions. The pathologists show general concordance on the location and the extent of tumor areas, but poor agreement in classifying Gleason patterns. The DLS’s precision Gleason pattern for each region is represented by interpolating between the DLS’s prediction patterns for Gleason patterns 3 (green), 4 (yellow), and 5 (red). b: DLS prediction
patterns compared to the distribution of pathologists’ Gleason pattern classifications on 41 million annotated image patches from the test dataset. On patches where pathologists are discordant, where the tissue is more likely to be on the cusp of two patterns, the DLS reflects this ambiguity in it's prediction scores.
While these initial results are encouraging, there is much more work to be done before systems like our DLS can be used to improve the care of prostate cancer patients. First, the accuracy of the model can be further improved with additional training data and should be validated on independent cohorts containing a larger number and more diverse group of patients. In addition, we are actively working on refining our DLS system to work on diagnostic needle core biopsies, which occur prior to the decision to undergo surgery and where Gleason grading therefore has a significantly greater impact on clinical decision-making. Further work will be needed to assess how to best integrate our DLS into the pathologist’s diagnostic workflow and the impact of such artificial-intelligence based assistance on the overall efficiency, accuracy, and prognostic ability of Gleason grading in clinical practice. Nonetheless, we are excited about the potential of technologies like this to significantly improve cancer diagnostics and patient care.

This work involved the efforts of a multidisciplinary team of software engineers, researchers, clinicians and logistics support staff. Key contributors to this project include Kunal Nagpal, Davis Foote, Yun Liu, Po-Hsuan (Cameron) Chen, Ellery Wulczyn, Fraser Tan, Niels Olson, Jenny L. Smith, Arash Mohtashamian, James H. Wren, Greg S. Corrado, Robert MacDonald, Lily H. Peng, Mahul B. Amin, Andrew J. Evans, Ankur R. Sangoi, Craig H. Mermel, Jason D. Hipp and Martin C. Stumpe. We would also like to thank Tim Hesterberg, Michael Howell, David Miller, Alvin Rajkomar, Benny Ayalew, Robert Nagle, Melissa Moran, Krishna Gadepalli, Aleksey Boyko, and Christopher Gammage. Lastly, this work would not have been possible without the aid of the pathologists who annotated data for this study.

  1. Interobserver Variability in Histologic Evaluation of Radical Prostatectomy Between Central and Local Pathologists: Findings of TAX 3501 Multinational Clinical Trial, Netto, G. J., Eisenberger, M., Epstein, J. I. & TAX 3501 Trial Investigators, Urology 77, 1155–1160 (2011).
  2. Phase 3 Study of Adjuvant Radiotherapy Versus Wait and See in pT3 Prostate Cancer: Impact of Pathology Review on Analysis, Bottke, D., Golz, R., Störkel, S., Hinke, A., Siegmann, A., Hertle, L., Miller, K., Hinkelbein, W., Wiegel, T., Eur. Urol. 64, 193–198 (2013).
  3. Utility of Quantitative Gleason Grading in Prostate Biopsies and Prostatectomy Specimens, Sauter, G. Steurer, S., Clauditz, T. S., Krech, T., Wittmer, C., Lutz, F., Lennartz, M., Janssen, T., Hakimi, N., Simon, R., von Petersdorff-Campen, M., Jacobsen, F., von Loga, K., Wilczak, W., Minner, S., Tsourlakis, M. C., Chirico, V., Haese, A., Heinzer, H., Beyer, B., Graefen, M., Michl, U., Salomon, G., Steuber, T., Budäus, L. H., Hekeler, E., Malsy-Mink, J., Kutzera, S., Fraune, C., Göbel, C., Huland, H., Schlomm, T., Clinical Eur. Urol. 69, 592–598 (2016).

Source: Google AI Blog

Night Sight: Seeing in the Dark on Pixel Phones

Night Sight is a new feature of the Pixel Camera app that lets you take sharp, clean photographs in very low light, even in light so dim you can't see much with your own eyes. It works on the main and selfie cameras of all three generations of Pixel phones, and does not require a tripod or flash. In this article we'll talk about why taking pictures in low light is challenging, and we'll discuss the computational photography and machine learning techniques, much of it built on top of HDR+, that make Night Sight work.
Left: iPhone XS (full resolution image here). Right: Pixel 3 Night Sight (full resolution image here).
Why is Low-light Photography Hard?
Anybody who has photographed a dimly lit scene will be familiar with image noise, which looks like random variations in brightness from pixel to pixel. For smartphone cameras, which have small lenses and sensors, a major source of noise is the natural variation of the number of photons entering the lens, called shot noise. Every camera suffers from it, and it would be present even if the sensor electronics were perfect. However, they are not, so a second source of noise are random errors introduced when converting the electronic charge resulting from light hitting each pixel to a number, called read noise. These and other sources of randomness contribute to the overall signal-to-noise ratio (SNR), a measure of how much the image stands out from these variations in brightness. Fortunately, SNR rises with the square root of exposure time (or faster), so taking a longer exposure produces a cleaner picture. But it’s hard to hold still long enough to take a good picture in dim light, and whatever you're photographing probably won't hold still either.

In 2014 we introduced HDR+, a computational photography technology that improves this situation by capturing a burst of frames, aligning the frames in software, and merging them together. The main purpose of HDR+ is to improve dynamic range, meaning the ability to photograph scenes that exhibit a wide range of brightnesses (like sunsets or backlit portraits). All generations of Pixel phones use HDR+. As it turns out, merging multiple pictures also reduces the impact of shot noise and read noise, so it improves SNR in dim lighting. To keep these photographs sharp even if your hand shakes and the subject moves, we use short exposures. We also reject pieces of frames for which we can't find a good alignment. This allows HDR+ to produce sharp images even while collecting more light.

How Dark is Dark?
But if capturing and merging multiple frames produces cleaner pictures in low light, why not use HDR+ to merge dozens of frames so we can effectively see in the dark? Well, let's begin by defining what we mean by "dark". When photographers talk about the light level of a scene, they often measure it in lux. Technically, lux is the amount of light arriving at a surface per unit area, measured in lumens per meter squared. To give you a feeling for different lux levels, here's a handy table:
Smartphone cameras that take a single picture begin to struggle at 30 lux. Phones that capture and merge several pictures (as HDR+ does) can do well down to 3 lux, but in dimmer scenes don’t perform well (more on that below), relying on using their flash. With Night Sight, our goal was to improve picture-taking in the regime between 3 lux and 0.3 lux, using a smartphone, a single shutter press, and no LED flash. To make this feature work well includes several key elements, the most important of which is to capture more photons.

Capturing the Data
While lengthening the exposure time of each frame increases SNR and leads to cleaner pictures, it unfortunately introduces two problems. First, the default picture-taking mode on Pixel phones uses a zero-shutter-lag (ZSL) protocol, which intrinsically limits exposure time. As soon as you open the camera app, it begins capturing image frames and storing them in a circular buffer that constantly erases old frames to make room for new ones. When you press the shutter button, the camera sends the most recent 9 or 15 frames to our HDR+ or Super Res Zoom software. This means you capture exactly the moment you want — hence the name zero-shutter-lag. However, since we're displaying these same images on the screen to help you aim the camera, HDR+ limits exposures to at most 66ms no matter how dim the scene is, allowing our viewfinder to keep up a display rate of at least 15 frames per second. For dimmer scenes where longer exposures are necessary, Night Sight uses positive-shutter-lag (PSL), which waits until after you press the shutter button before it starts capturing images. Using PSL means you need to hold still for a short time after pressing the shutter, but it allows the use of longer exposures, thereby improving SNR at much lower brightness levels.

The second problem with increasing per-frame exposure time is motion blur, which might be due to handshake or to moving objects in the scene. Optical image stabilization (OIS), which is present on Pixel 2 and 3, reduces handshake for moderate exposure times (up to about 1/8 second), but doesn’t help with longer exposures or with moving objects. To combat motion blur that OIS can’t fix, the Pixel 3’s default picture-taking mode uses “motion metering”, which consists of using optical flow to measure recent scene motion and choosing an exposure time that minimizes this blur. Pixel 1 and 2 don’t use motion metering in their default mode, but all three phones use the technique in Night Sight mode, increasing per-frame exposure time up to 333ms if there isn't much motion. For Pixel 1, which has no OIS, we increase exposure time less (for the selfie cameras, which also don't have OIS, we increase it even less). If the camera is being stabilized (held against a wall, or using a tripod, for example), the exposure of each frame is increased to as much as one second. In addition to varying per-frame exposure, we also vary the number of frames we capture, 6 if the phone is on a tripod and up to 15 if it is handheld. These frame limits prevent user fatigue (and the need for a cancel button). Thus, depending on which Pixel phone you have, camera selection, handshake, scene motion and scene brightness, Night Sight captures 15 frames of 1/15 second (or less) each, or 6 frames of 1 second each, or anything in between.1

Here’s a concrete example of using shorter per-frame exposures when we detect motion:
Left: 15-frame burst captured by one of two side-by-side Pixel 3 phones. Center: Night Sight shot with motion metering disabled, causing this phone to use 73ms exposures. The dog’s head is motion blurred in this crop. Right: Night Sight shot with motion metering enabled, causing this phone to notice the motion and use shorter 48ms exposures. This shot has less motion blur. (Mike Milne)
And here’s an example of using longer exposure times when we detect that the phone is on a tripod:
Left: Crop from a handheld Night Sight shot of the sky (full resolution image here). There was slight handshake, so Night Sight chose 333ms x 15 frames = 5.0 seconds of capture. Right: Tripod shot (full resolution image here). No handshake was detected, so Night Sight used 1.0 second x 6 frames = 6.0 seconds. The sky is cleaner (less noise), and you can see more stars. (Florian Kainz)
Alignment and Merging
The idea of averaging frames to reduce imaging noise is as old as digital imaging. In astrophotography it's called exposure stacking. While the technique itself is straightforward, the hard part is getting the alignment right when the camera is handheld. Our efforts in this area began with an app from 2010 called Synthcam. This app captured pictures continuously, aligned and merged them in real time at low resolution, and displayed the merged result, which steadily became cleaner as you watched.

Night Sight uses a similar principle, although at full sensor resolution and not in real time. On Pixel 1 and 2 we use HDR+'s merging algorithm, modified and re-tuned to strengthen its ability to detect and reject misaligned pieces of frames, even in very noisy scenes. On Pixel 3 we use Super Res Zoom, similarly re-tuned, whether you zoom or not. While the latter was developed for super-resolution, it also works to reduce noise, since it averages multiple images together. Super Res Zoom produces better results for some nighttime scenes than HDR+, but it requires the faster processor of the Pixel 3.

By the way, all of this happens on the phone in a few seconds. If you're quick about tapping on the icon that brings you to the filmstrip (wait until the capture is complete!), you can watch your picture "develop" as HDR+ or Super Res Zoom completes its work.

Other Challenges
Although the basic ideas described above sound simple, there are some gotchas when there isn't much light that proved challenging when developing Night Sight:

1. Auto white balancing (AWB) fails in low light.

Humans are good at color constancy — perceiving the colors of things correctly even under colored illumination (or when wearing sunglasses). But that process breaks down when we take a photograph under one kind of lighting and view it under different lighting; the photograph will look tinted to us. To correct for this perceptual effect, cameras adjust the colors of images to partially or completely compensate for the dominant color of the illumination (sometimes called color temperature), effectively shifting the colors in the image to make it seem as if the scene was illuminated by neutral (white) light. This process is called auto white balancing (AWB).

The problem is that white balancing is what mathematicians call an ill-posed problem. Is that snow really blue, as the camera recorded it? Or is it white snow illuminated by a blue sky? Probably the latter. This ambiguity makes white balancing hard. The AWB algorithm used in non-Night Sight modes is good, but in very dim or strongly colored lighting (think sodium vapor lamps), it’s hard to decide what color the illumination is.

To solve these problems, we developed a learning-based AWB algorithm, trained to discriminate between a well-white-balanced image and a poorly balanced one. When a captured image is poorly balanced, the algorithm can suggest how to shift its colors to make the illumination appear more neutral. Training this algorithm required photographing a diversity of scenes using Pixel phones, then hand-correcting their white balance while looking at the photo on a color-calibrated monitor. You can see how this algorithm works by comparing the same low-light scene captured using two ways using a Pixel 3:
Left: The white balancer in the Pixel’s default camera mode doesn't know how yellow the illumination was on this shack on the Vancouver waterfront (full resolution image here). Right: Our learning-based AWB algorithm does a better job (full resolution image here). (Marc Levoy)
2. Tone mapping of scenes that are too dark to see.

The goal of Night Sight is to make photographs of scenes so dark that you can't see them clearly with your own eyes — almost like a super-power! A related problem is that in very dim lighting humans stop seeing in color, because the cone cells in our retinas stop functioning, leaving only the rod cells, which can't distinguish different wavelengths of light. Scenes are still colorful at night; we just can't see their colors. We want Night Sight pictures to be colorful - that's part of the super-power, but another potential conflict. Finally, our rod cells have low spatial acuity, which is why things seem indistinct at night. We want Night Sight pictures to be sharp, with more detail than you can really see at night.

For example, if you put a DSLR camera on a tripod and take a very long exposure — several minutes, or stack several shorter exposures together — you can make nighttime look like daytime. Shadows will have details, and the scene will be colorful and sharp. Look at the photograph below, which was captured with a DSLR; it must be night, because you can see the stars, but the grass is green, the sky is blue, and the moon casts shadows from the trees that look like shadows cast by the sun. This is a nice effect, but it's not always what you want, and if you share the photograph with a friend, they'll be confused about when you captured it.
Yosemite valley at nighttime, Canon DSLR, 28mm f/4 lens, 3-minute exposure, ISO 100. It's nighttime, since you can see stars, but it looks like daytime (full resolution image here). (Jesse Levinson)
Artists have known for centuries how to make a painting look like night; look at the example below.2
A Philosopher Lecturing on the Orrery, by Joseph Wright of Derby, 1766 (image source: Wikidata). The artist uses pigments from black to white, but the scene depicted is evidently dark. How does he accomplish this? He increases contrast, surrounds the scene with darkness, and drops shadows to black, because we cannot see detail there.
We employ some of the same tricks in Night Sight, partly by throwing an S-curve into our tone mapping. But it's tricky to strike an effective balance between giving you “magical super-powers” while still reminding you when the photo was captured. The photograph below is particularly successful at doing this.
Pixel 3, 6-second Night Sight shot, with tripod (full resolution image here). (Alex Savu)
How Dark can Night Sight Go?
Below 0.3 lux, autofocus begins to fail. If you can't find your keys on the floor, your smartphone can't focus either. To address this limitation we've added two manual focus buttons to Night Sight on Pixel 3 - the "Near" button focuses at about 4 feet, and the "Far" button focuses at about 12 feet. The latter is the hyperfocal distance of our lens, meaning that everything from half of that distance (6 feet) to infinity should be in focus. We’re also working to improve Night Sight’s ability to autofocus in low light. Below 0.3 lux you can still take amazing pictures with a smartphone, and even do astrophotography as this blog post demonstrates, but for that you'll need a tripod, manual focus, and a 3rd party or custom app written using Android's Camera2 API.

How far can we take this? Eventually one reaches a light level where read noise swamps the number of photons gathered by that pixel. There are other sources of noise, including dark current, which increases with exposure time and varies with temperature. To avoid this biologists know to cool their cameras well below zero (Fahrenheit) when imaging weakly fluorescent specimens — something we don’t recommend doing to your Pixel phone! Super-noisy images are also hard to align reliably. Even if you could solve all these problems, the wind blows, the trees sway, and the stars and clouds move. Ultra-long exposure photography is hard.

How to Get the Most out of Night Sight
Night Sight not only takes great pictures in low light; it's also fun to use, because it takes pictures where you can barely see anything. We pop up a “chip” on the screen when the scene is dark enough that you’ll get a better picture using Night Sight, but don't limit yourself to these cases. Just after sunset, or at concerts, or in the city, Night Sight takes clean (low-noise) shots, and makes them brighter than reality. This is a "look", which seems magical if done right. Here are some examples of Night Sight pictures, and some A/B comparisons, mostly taken by our coworkers. And here are some tips on using Night Sight:

- Night Sight can't operate in complete darkness, so pick a scene with some light falling on it.
- Soft, uniform lighting works better than harsh lighting, which creates dark shadows.
- To avoid lens flare artifacts, try to keep very bright light sources out of the field of view.
- To increase exposure, tap on various objects, then move the exposure slider. Tap again to disable.
- To decrease exposure, take the shot and darken later in Google’s Photos editor; it will be less noisy.
- If it’s so dark the camera can’t focus, tap on a high-contrast edge, or the edge of a light source.
- If this won’t work for your scene, use the Near (4 feet) or Far (12 feet) focus buttons (see below).
- To maximize image sharpness, brace your phone against a wall or tree, or prop it on a table or rock.
- Night Sight works for selfies too, as in the A/B album, with optional illumination from the screen itself.
Manual focus buttons (Pixel 3 only).
Night Sight works best on Pixel 3. We’ve also brought it to Pixel 2 and the original Pixel, although on the latter we use shorter exposures because it has no optical image stabilization (OIS). Also, our learning-based white balancer is trained for Pixel 3, so it will be less accurate on older phones. By the way, we brighten the viewfinder in Night Sight to help you frame shots in low light, but the viewfinder is based on 1/15 second exposures, so it will be noisy, and isn't a fair indication of the final photograph. So take a chance — frame a shot, and press the shutter. You'll often be surprised!

Night Sight was a collaboration of several teams at Google. Key contributors to the project include: from the Gcam team Charles He, Nikhil Karnad, Orly Liba, David Jacobs, Tim Brooks, Michael Milne, Andrew Radin, Navin Sarma, Jon Barron, Yun-Ta Tsai, Jiawen Chen, Kiran Murthy, Tianfan Xue, Dillon Sharlet, Ryan Geiss, Sam Hasinoff and Alex Schiffhauer; from the Super Res Zoom team Bart Wronski, Peyman Milanfar and Ignacio Garcia Dorado; from the Google camera app team Gabriel Nava, Sushil Nath, Tim Smith , Justin Harrison, Isaac Reynolds and Michelle Chen.

1 By the way, the exposure time shown in Google Photos (if you press "i") is per-frame, not total time, which depends on the number of frames captured. You can get some idea of the number of frames by watching the animation while the camera is collecting light. Each tick around the circle is one captured frame.

2 For a wonderful analysis of these techniques, look at von Helmholtz, "On the relation of optics to painting" (1876).

Source: Google AI Blog

Accurate Online Speaker Diarization with Supervised Learning

Speaker diarization, the process of partitioning an audio stream with multiple people into homogeneous segments associated with each individual, is an important part of speech recognition systems. By solving the problem of “who spoke when”, speaker diarization has applications in many important scenarios, such as understanding medical conversations, video captioning and more. However, training these systems with supervised learning methods is challenging — unlike standard supervised classification tasks, a robust diarization model requires the ability to associate new individuals with distinct speech segments that weren't involved in training. Importantly, this limits the quality of both online and offline diarization systems. Online systems usually suffer more, since they require diarization results in real time.
Online speaker diarization on streaming audio input. Different colors in the bottom axis indicate different speakers.
In “Fully Supervised Speaker Diarization”, we describe a new model that seeks to make use of supervised speaker labels in a more effective manner. Here “fully” implies that all components in the speaker diarization system, including the estimation of the number of speakers, are trained in supervised ways, so that they can benefit from increasing the amount of labeled data available. On the NIST SRE 2000 CALLHOME benchmark, our diarization error rate (DER) is as low as 7.6%, compared to 8.8% DER from our previous clustering-based method, and 9.9% from deep neural network embedding methods. Moreover, our method achieves this lower error rate based on online decoding, making it specifically suitable for real-time applications. As such we are open sourcing the core algorithms in our paper to accelerate more research along this direction.

Clustering versus Interleaved-state RNN
Modern speaker diarization systems are usually based on clustering algorithms such as k-means or spectral clustering. Since these clustering methods are unsupervised, they could not make good use of the supervised speaker labels available in data. Moreover, online clustering algorithms usually have worse quality in real-time diarization applications with streaming audio inputs. The key difference between our model and common clustering algorithms is that in our method, all speakers’ embeddings are modeled by a parameter-sharing recurrent neural network (RNN), and we distinguish different speakers using different RNN states, interleaved in the time domain.

To understand how this works, consider the example below in which there are four possible speakers: blue, yellow, pink and green (this is arbitrary, and in fact there may be more — our model uses the Chinese restaurant process to accommodate the unknown number of speakers). Each speaker starts with its own RNN instance (with a common initial state shared among all speakers) and keeps updating the RNN state given the new embeddings from this speaker. In the example below, the blue speaker keeps updating its RNN state until a different speaker, yellow, comes in. If blue speaks again later, it resumes updating its RNN state. (This is just one of the possibilities for speech segment y7 in the figure below. If new speaker green enters, it will start with a new RNN instance.)
The generative process of our model. Colors indicate labels for speaker segments.
Representing speakers as RNN states enables us to learn the high-level knowledge shared across different speakers and utterances using RNN parameters, and this promises the usefulness of more labeled data. In contrast, common clustering algorithms almost always work with each single utterance independently, making it difficult to benefit from a large amount of labeled data.

The upshot of all this is that given time-stamped speaker labels (i.e. we know who spoke when), we can train the model with standard stochastic gradient descent algorithms. A trained model can be used for speaker diarization on new utterances from unheard speakers. Furthermore, the use of online decoding makes it more suitable for latency-sensitive applications.

Future Work
Although we've already achieved impressive diarization performance with this system, there are still many exciting directions we are currently exploring. First, we are refining our model so it can easily integrate contextual information to perform offline decoding. This will likely further reduce the DER, which is more useful for latency-insensitive applications. Second, we would like to model acoustic features directly instead of using d-vectors. In this way, the entire speaker diarization system can be trained in an end-to-end way.

To learn more about this work, please see our paper. To download the core algorithm of this system, please visit the Github page.

This work was done as a close collaboration between Google AI and Speech & Assistant teams. Contributors include Aonan Zhang (intern), Quan Wang, Zhengyao Zhu and Chong Wang.

Source: Google AI Blog

Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training). The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.

This week, we open sourced a new technique for NLP pre-training called Bidirectional Encoder Representations from Transformers, or BERT. With this release, anyone in the world can train their own state-of-the-art question answering system (or a variety of other models) in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU. The release includes source code built on top of TensorFlow and a number of pre-trained language representation models. In our associated paper, we demonstrate state-of-the-art results on 11 NLP tasks, including the very competitive Stanford Question Answering Dataset (SQuAD v1.1).

What Makes BERT Different?
BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).

Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the ... account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.

A visualization of BERT’s neural network architecture compared to previous state-of-the-art contextual pre-training methods is shown below. The arrows indicate the information flow from one layer to the next. The green boxes at the top indicate the final contextualized representation of each input word:
BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional.
The Strength of Bidirectionality
If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.

To solve this problem, we use the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:
While this idea has been around for a very long time, BERT is the first time it was successfully used to pre-train a deep neural network.

BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:
Training with Cloud TPUs
Everything that we’ve described so far might seem fairly straightforward, so what’s the missing piece that made it work so well? Cloud TPUs. Cloud TPUs gave us the freedom to quickly experiment, debug, and tweak our models, which was critical in allowing us to move beyond existing pre-training techniques. The Transformer model architecture, developed by researchers at Google in 2017, also gave us the foundation we needed to make BERT successful. The Transformer is implemented in our open source release, as well as the tensor2tensor library.

Results with BERT
To evaluate performance, we compared BERT to other state-of-the-art NLP systems. Importantly, BERT achieved all of its results with almost no task-specific changes to the neural network architecture. On SQuAD v1.1, BERT achieves 93.2% F1 score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%:
BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks. The amount of human-labeled training data in these tasks ranges from 2,500 examples to 400,000 examples, and BERT substantially improves upon the state-of-the-art accuracy on all of them:
Making BERT Work for You
The models that we are releasing can be fine-tuned on a wide variety of NLP tasks in a few hours or less. The open source release also includes code to run pre-training, although we believe the majority of NLP researchers who use BERT will never need to pre-train their own models from scratch. The BERT models that we are releasing today are English-only, but we hope to release models which have been pre-trained on a variety of languages in the near future.

The open source TensorFlow implementation and pointers to pre-trained BERT models can be found at Alternatively, you can get started using BERT through Colab with the notebook “BERT FineTuning with Cloud TPUs.”

You can also read our paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" for more details.

Source: Google AI Blog

Google at EMNLP 2018

This week, the annual conference on Empirical Methods in Natural Language Processing (EMNLP 2018) will be held in Brussels, Belgium. Google will have a strong presence at EMNLP with several of our researchers presenting their research on a diverse set of topics, including language identification, segmentation, semantic parsing and question answering, additionally serving in various levels of organization in the conference. Googlers will also be presenting their papers and participating in the co-located Conference on Computational Natural Language Learning (CoNLL 2018) shared task on multilingual parsing.

In addition to this involvement, we are sharing several new datasets with the academic community that are released with papers published at EMNLP, with the goal of accelerating progress in empirical natural language processing (NLP). These releases are designed to help account for mismatches between the datasets a machine learning model is trained and tested on, and the inputs an NLP system would be asked to handle “in the wild”. All of the datasets we are releasing include realistic, naturally occurring text, and fall into two main categories: 1) challenge sets for well-studied core NLP tasks (part-of-speech tagging, coreference) and 2) datasets to encourage new directions of research on meaning preservation under rephrasings/edits (query well-formedness, split-and-rephrase, atomic edits):
  • Noun-Verb Ambiguity in POS Tagging Dataset: English part-of-speech taggers regularly make egregious errors related to noun-verb ambiguity, despite high accuracies on standard datasets. For example: in “Mark which area you want to distress” several state-of-the-art taggers annotate “Mark” as a noun instead of a verb. We release a new dataset of over 30,000 naturally occurring non-trivial annotated examples of noun-verb ambiguity. Taggers previously indistinguishable from each other have accuracies ranging from 57% to 75% accuracy on this challenge set.
  • Query Wellformedness Dataset: Web search queries are usually “word-salad” style queries with little resemblance to natural language questions (“barack obama height” as opposed to “What is the height of Barack Obama?”). Differentiating a natural language question from a query is of importance to several applications include dialogue. We annotate and release 25,100 queries from the open-source Paralex corpus with ratings on how close they are to well-formed natural language questions.
  • WikiSplit: Split and Rephrase Dataset Extracted from Wikipedia Edits: We extract examples of sentence splits from Wikipedia edits where one sentence gets split into two sentences that together preserve the original meaning of the sentence (E.g. “Street Rod is the first in a series of two games released for the PC and Commodore 64 in 1989.” is split into “Street Rod is the first in a series of two games.” and “It was released for the PC and Commodore 64 in 1989.”) The released corpus contains one million sentence splits with a vocabulary of more than 600,000 words. 
  • WikiAtomicEdits: A Multilingual Corpus of Atomic Wikipedia Edits: Information about how people edit language in Wikipedia can be used to understand the structure of language itself. We pay particular attention to two atomic edits: insertions and deletions that consist of a single contiguous span of text. We extract around 43 million such edits in 8 languages and show that they provide valuable information about entailment and discourse. For example, insertion of “in 1949” adds a prepositional phrase to the sentence “She died there after a long illness” resulting in “She died there in 1949 after a long illness”.
These datasets join the others that Google has recently released, such as Conceptual Captions and GAP Coreference Resolution in addition to our past contributions.

Below is a full list of Google’s involvement and publications being presented at EMNLP and CoNLL (Googlers highlighted in blue). We are particularly happy to announce that the paper “Linguistically-Informed Self-Attention for Semantic Role Labeling” was awarded one of the two Best Long Paper awards. This work was done by our 2017 intern Emma Strubell, Googlers Daniel Andor, David Weiss and Google Faculty Advisor Andrew McCallum. We congratulate these authors, and all other researchers who are presenting their work at the conference.

Area Chairs Include:
Ming-Wei Chang, Marius Pasca, Slav Petrov, Emily Pitler, Meg Mitchell, Taro Watanabe

EMNLP Publications
A Challenge Set and Methods for Noun-Verb Ambiguity
Ali Elkahky, Kellie Webster, Daniel Andor, Emily Pitler

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text
Yuan Zhang, Jason Riesa, Daniel Gillick, Anton Bakalov, Jason Baldridge, David Weiss

AirDialogue: An Environment for Goal-Oriented Dialogue Research
Wei Wei, Quoc Le, Andrew Dai, Jia Li

Content Explorer: Recommending Novel Entities for a Document Writer
Michal Lukasik, Richard Zens

Deep Relevance Ranking using Enhanced Document-Query Interactions
Ryan McDonald, George Brokos, Ion Androutsopoulos

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, Christopher D. Manning

Identifying Well-formed Natural Language Questions
Manaal Faruqui, Dipanjan Das

Learning To Split and Rephrase From Wikipedia Edit History
Jan A. Botha, Manaal Faruqui, John Alex, Jason Baldridge, Dipanjan Das

Linguistically-Informed Self-Attention for Semantic Role Labeling
Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, Andrew McCallum

Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text
Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, William Cohen

Noise Contrastive Estimation for Conditional Models: Consistency and Statistical Efficiency
Zhuang Ma, Michael Collins

Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification
Kelsey Ball, Dan Garrette

Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension
Minjoon Seo, Tom Kwiatkowski, Ankur P. Parikh, Ali Farhadi, Hannaneh Hajishirzi

Policy Shaping and Generalized Update Equations for Semantic Parsing from Denotations
Dipendra Misra, Ming-Wei Chang, Xiaodong He, Wen-tau Yih

Revisiting Character-Based Neural Machine Translation with Capacity and Compression
Colin Cherry, George Foster, Ankur Bapna, Orhan Firat, Wolfgang Macherey

Self-governing neural networks for on-device short text classification
Sujith Ravi, Zornitsa Kozareva

Semi-Supervised Sequence Modeling with Cross-View Training
Kevin Clark, Minh-Thang Luong, Christopher D. Manning, Quoc Le

State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Ji Ma, Kuzman Ganchev, David Weiss

Subgoal Discovery for Hierarchical Dialogue Policy Learning
Da Tang, Xiujun Li, Jianfeng Gao, Chong Wang, Lihong Li, Tony Jebara

SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation
Xinyi Wang, Hieu Pham, Zihang Dai, Graham Neubig

The Importance of Generation Order in Language Modeling
Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, George Dahl

Training Deeper Neural Machine Translation Models with Transparent Attention
Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, Yonghui Wu

Understanding Back-Translation at Scale
Sergey Edunov, Myle Ott, Michael Auli, David Grangier

Unsupervised Natural Language Generation with Denoising Autoencoders
Markus Freitag, Scott Roy

WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse
Manaal Faruqui, Ellie Pavlick, Ian Tenney, Dipanjan Das

WikiConv: A Corpus of the Complete Conversational History of a Large Online Collaborative Community
Yiqing Hua, Cristian Danescu-Niculescu-Mizil, Dario Taraborelli, Nithum Thain, Jeffery Sorensen, Lucas Dixon

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Taku Kudo, John Richardson

Universal Sentence Encoder for English
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, Ray Kurzweil

CoNLL Shared Task
Multilingual Parsing from Raw Text to Universal Dependencies
Slav Petrov, co-organizer

Universal Dependency Parsing with Multi-Treebank Models
Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, Joakim Nivre, Yan Shao, Sara Stymne
(Winner of the Universal POS Tagging and Morphological Tagging subtasks, using the open-sourced Meta-BiLSTM tagger)

CoNLL Publication
Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!
Katharina Kann, Sascha Rothe, Katja Filippova

Source: Google AI Blog

Introducing AdaNet: Fast and Flexible AutoML with Learning Guarantees

Ensemble learning, the art of combining different machine learning (ML) model predictions, is widely used with neural networks to achieve state-of-the-art performance, benefitting from a rich history and theoretical guarantees to enable success at challenges such as the Netflix Prize and various Kaggle competitions. However, they aren’t used much in practice due to long training times, and the ML model candidate selection requires its own domain expertise. But as computational power and specialized deep learning hardware such as TPUs become more readily available, machine learning models will grow larger and ensembles will become more prominent. Now, imagine a tool that automatically searches over neural architectures, and learns to combine the best ones into a high-quality model.

Today, we’re excited to share AdaNet, a lightweight TensorFlow-based framework for automatically learning high-quality models with minimal expert intervention. AdaNet builds on our recent reinforcement learning and evolutionary-based AutoML efforts to be fast and flexible while providing learning guarantees. Importantly, AdaNet provides a general framework for not only learning a neural network architecture, but also for learning to ensemble to obtain even better models.

AdaNet is easy to use, and creates high-quality models, saving ML practitioners the time normally spent selecting optimal neural network architectures, implementing an adaptive algorithm for learning a neural architecture as an ensemble of subnetworks. AdaNet is capable of adding subnetworks of different depths and widths to create a diverse ensemble, and trade off performance improvement with the number of parameters.
AdaNet adaptively growing an ensemble of neural networks. At each iteration, it measures the ensemble loss for each candidate, and selects the best one to move onto the next iteration.
Fast and Easy to Use
AdaNet implements the TensorFlow Estimator interface, which greatly simplifies machine learning programming by encapsulating training, evaluation, prediction and export for serving. It integrates with open-source tools like TensorFlow Hub modules, TensorFlow Model Analysis, and Google Cloud’s Hyperparameter Tuner. Distributed training support significantly reduces training time, and scales linearly with available CPUs and accelerators (e.g. GPUs).
AdaNet’s accuracy (y-axis) per train step (x-axis) on CIFAR-100. The blue line is accuracy on the training set, and red line is performance on the test set. A new subnetwork begins training every million steps, and eventually improves the performance of the ensemble. The grey and green lines are the accuracies of the ensemble before adding the new subnetwork.
Because TensorBoard is one of the best TensorFlow features for visualizing model metrics during training, AdaNet integrates seamlessly with it in order to monitor subnetwork training, ensemble composition, and performance. When AdaNet is done training, it exports a SavedModel that can be deployed with TensorFlow Serving.

Learning Guarantees
Building an ensemble of neural networks has several challenges: What are the best subnetwork architectures to consider? Is it best to reuse the same architectures or encourage diversity? While complex subnetworks with more parameters will tend to perform better on the training set, they may not generalize to unseen data due to their greater complexity. These challenges stem from evaluating model performance. We could evaluate performance on a hold-out set split from the training set, but in doing so would reduce the number of examples one can use for training the neural network.

Instead, AdaNet’s approach (presented in “AdaNet: Adaptive Structural Learning of Artificial Neural Networks” at ICML 2017) is to optimize an objective that balances the trade-offs between the ensemble’s performance on the training set and its ability to generalize to unseen data. The intuition is for the ensemble to include a candidate subnetwork only when it improves the ensemble’s training loss more than it affects its ability to generalize. This guarantees that:
  1. The generalization error of the ensemble is bounded by its training error and complexity.
  2. By optimizing this objective, we are directly minimizing this bound.
A practical benefit of optimizing this objective is that it eliminates the need for a hold-out set for choosing which candidate subnetworks to add to the ensemble. This has the added benefit of enabling the use of more training data for training the subnetworks. To learn more, please walk through our tutorial about the AdaNet objective.

We believe that the key to making a useful AutoML framework for both research and production use is to not only provide sensible defaults, but to also allow users to try their own subnetwork/model definitions. As a result, machine learning researchers, practitioners, and enthusiasts are invited to define their own AdaNet adanet.subnetwork.Builder using high level TensorFlow APIs like tf.layers.

Users who have already integrated a TensorFlow model in their system can easily convert their TensorFlow code into an AdaNet subnetwork, and use the adanet.Estimator to boost model performance while obtaining learning guarantees. AdaNet will explore their defined search space of candidate subnetworks and learn to ensemble the subnetworks. For instance, we took an open-source implementation of a NASNet-A CIFAR architecture, transformed it into a subnetwork, and improved upon CIFAR-10 state-of-the-art results after eight AdaNet iterations. Furthermore, our model achieves this result with fewer parameters:
Performance of a NASNet-A model as presented in Zoph et al., 2018 versus AdaNet learning to combine small NASNet-A subnetworks on CIFAR-10.
Users are also invited to use their own custom loss functions as part of the AdaNet objective via canned or custom tf.contrib.estimator.Heads in order to train regression, classification, and multi-task learning problems.

Users can also fully define the search space of candidate subnetworks to explore by extending the adanet.subnetwork.Generator class. This allows them to grow or reduce their search space based on their available hardware. The search space of subnetworks can be as simple as duplicating the same subnetwork configuration with different random seeds, to training dozens of subnetworks with different hyperparameter combinations, and letting AdaNet choose the one to include in the final ensemble.

If you’re interested in trying AdaNet for yourself, please check out our Github repo, and walk through the tutorial notebooks. We’ve included a few working examples using dense layers and convolutions to get you started. AdaNet is an ongoing research project, and we welcome contributions. We’re excited to see how AdaNet can help the research community.

This project was only possible thanks to the members of the core team including Corinna Cortes, Mehryar Mohri, Xavi Gonzalvo, Charles Weill, Vitaly Kuznetsov, Scott Yak, and Hanna Mazzawi. We also extend a special thanks to our collaborators, residents and interns Gus Kristiansen, Galen Chuang, Ghassen Jerfel, Vladimir Macko, Ben Adlam, Scott Yang and the many others at Google who helped us test it out.

Source: Google AI Blog

Acoustic Detection of Humpback Whales Using a Convolutional Neural Network

Over the last several years, Google AI Perception teams have developed techniques for audio event analysis that have been applied on YouTube for non-speech captions, video categorizations, and indexing. Furthermore, we have published the AudioSet evaluation set and open-sourced some model code in order to further spur research in the community. Recently, we’ve become increasingly aware that many conservation organizations were collecting large quantities of acoustic data, and wondered whether it might be possible to apply these same technologies to that data in order to assist wildlife monitoring and conservation.

As part of our AI for Social Good program, and in partnership with the Pacific Islands Fisheries Science Center of the U.S. National Oceanic and Atmospheric Administration (NOAA), we developed algorithms to identify humpback whale calls in 15 years of underwater recordings from a number of locations in the Pacific. The results of this research provide new and important information about humpback whale presence, seasonality, daily calling behavior, and population structure. This is especially important in remote, uninhabited islands, about which scientists have had no information until now. Additionally, because the dataset spans a large period of time, knowing when and where humpback whales are calling will provide information on whether or not the animals have changed their distribution over the years, especially in relation to increasing human ocean activity. That information will be a key ingredient for effective mitigation of anthropogenic impacts on humpback whales.
HARP deployment locations. Green: sites with currently active recorders. Red: previous recording sites.
Passive Acoustic Monitoring and the NOAA HARP Dataset
Passive acoustic monitoring is the process of listening to marine mammals with underwater microphones called hydrophones, which can be used to record signals so that detection, classification, and localization tasks can be done offline. This has some advantages over ship-based visual surveys, including the ability to detect submerged animals, longer detection ranges and longer monitoring periods. Since 2005, NOAA has collected recordings from ocean-bottom hydrophones at 12 sites in the Pacific Island region, a winter breeding and calving destination for certain populations of humpback whales.

The data was recorded on devices called high-frequency acoustic recording packages, or HARPs (Wiggins and Hildebrand, 2007; full text PDF). In total, NOAA provided about 15 years of audio, or 9.2 terabytes after decimation from 200 kHz to 10kHz. (Since most of the sound energy in humpback vocalizations is in the 100Hz-2000Hz range, little is lost in using the lower sample rate.)

From a research perspective, identifying species of interest in such large volumes of data is an important first stage that provides input for higher-level population abundance, behavioral or oceanographic analyses. However, manually marking humpback whale calls, even with the aid of currently available computer-assisted methods, is extremely time-consuming.

Supervised Learning: Optimizing an Image Model for Humpback Detection
We made the common choice of treating audio event detection as an image classification problem, where the image is a spectrogram — a histogram of sound power plotted on time-frequency axes.
Example spectrograms of audio events found in the dataset, with time on the x-axis and frequency on the y-axis. Left: a humpback whale call (in particular, a tonal unit), Center: narrow-band noise from an unknown source, Right: hard disk noise from the HARP
This is a good representation for an image classifier, whose goal is to discriminate, because the different spectra (frequency decompositions) and time variations thereof (which are characteristic of distinct sound types) are represented in the spectrogram as visually dissimilar patterns. For the image model itself, we used ResNet-50, a convolutional neural network architecture typically used for image classification that has shown success at classifying non-speech audio. This is a supervised learning setup, where only manually labeled data could be used for training (0.2% of the entire dataset — in the next section, we describe an approach that makes use of the unlabeled data.)

The process of going from waveform to spectrogram involves choices of parameters and gain-scaling functions. Common default choices (one of which was logarithmic compression) were a good starting point, but some domain-specific tuning was needed to optimize the detection of whale calls. Humpback vocalizations are varied, but sustained, frequency-modulated, tonal units occur frequently in time. You can listen to an example below:

If the frequency didn't vary at all, a tonal unit would appear in the spectrogram as a horizontal bar. Since the calls are frequency-modulated, we actually see arcs instead of bars, but parts of the arcs are close to horizontal.

A challenge particular to this dataset was narrow-band noise, most often caused by nearby boats and the equipment itself. In a spectrogram it appears as horizontal lines, and early versions of the model would confuse it with humpback calls. This motivated us to try per-channel energy normalization (PCEN), which allows the suppression of stationary, narrow-band noise. This proved to be critical, providing a 24% reduction in error rate of whale call detection.
Spectrograms of the same 5-unit excerpt from humpback whale song beginning at 0:06 in the above recording. Top: PCEN. Bottom: log of squared magnitude. The dark blue horizontal bar along the bottom under log compression has become much lighter relative to the whale call when using PCEN
Aside from PCEN, averaging predictions over a longer period of time led to much better precision. This same effect happens for general audio event detection, but for humpback calls the increase in precision was surprisingly large. A likely explanation is that the vocalizations in our dataset are mainly in the context of whale song, a structured sequence of units than can last over 20 minutes. At the end of one unit in a song, there is a good chance another unit begins within two seconds. The input to the image model covers a short time window, but because the song is so long, model outputs from more distant time windows give extra information useful for making the correct prediction for the current time window.

Overall, evaluating on our test set of 75-second audio clips, the model identifies whether a clip contains humpback calls at over 90% precision and 90% recall. However, one should interpret these results with care; training and test data come from similar equipment and environmental conditions. That said, preliminary checks against some non-NOAA sources look promising.

Unsupervised Learning: Representation for Finding Similar Song Units
A different way to approach the question, "Where are all the humpback sounds in this data?", is to start with several examples of humpback sound and, for each of these, find more in the dataset that are similar to that example. The definition of similar here can be learned by the same ResNet we used when this was framed as a supervised problem. There, we used the labels to learn a classifier on top of the ResNet output. Here, we encourage a pair of ResNet output vectors to be close in Euclidean distance when the corresponding audio examples are close in time. With that distance function, we can retrieve many more examples of audio similar to a given one. In the future, this may be useful input for a classifier that distinguishes different humpback unit types from each other.

To learn the distance function, we used a method described in "Unsupervised Learning of Semantic Audio Representations", based on the idea that closeness in time is related to closeness in meaning. It randomly samples triplets, where each triplet is defined to consist of an anchor, a positive, and a negative. The positive and the anchor are sampled so that they start around the same time. An example of a triplet in our application would be a humpback unit (anchor), a probable repeat of the same unit by the same whale (positive) and background noise from some other month (negative). Passing the 3 samples through the ResNet (with tied weights) represents them as 3 vectors. Minimizing a loss that forces the anchor-negative distance to exceed the anchor-positive distance by a margin learns a distance function faithful to semantic similarity.

Principal component analysis (PCA) on a sample of labeled points lets us visualize the results. Separation between humpback and non-humpback is apparent. Explore for yourself using the TensorFlow Embedding Projector. Try changing Color by to each of class_label and site. Also, try changing PCA to t-SNE in the projector for a visualization that prioritizes preserving relative distances rather than sample variance.
A sample of 5000 data points in the unsupervised representation. (Orange: humpback. Blue: not humpback.)
Given individual "query" units, we retrieved the nearest neighbors in the entire corpus using Euclidean distance between embedding vectors. In some cases we found hundreds more instances of the same unit with good precision.
Manually chosen query units (boxed) and nearest neighbors using the unsupervised representation.
We intend to use these in the future to build a training set for a classifier that discriminates between song units. We could also use them to expand the training set used for learning a humpback detector.

Predictions from the Supervised Classifier on the Entire Dataset
We plotted summaries of the model output grouped by time and location. Not all sites had deployments in all years. Duty cycling (example: 5 minutes on, 15 minutes off) allows longer deployments on limited battery power, but the schedule can vary. To deal with these sources of variability, we consider the proportion of sampled time in which humpback calling was detected to the total time recorded in a month:
Time density of presence on year / month axes for the Kona and Saipan sites.
The apparent seasonal variation is consistent with a known pattern in which humpback populations spend summers feeding near Alaska and then migrate to the vicinity of the Hawaiian Islands to breed and give birth. This is a nice sanity check for the model.

We hope the predictions for the full dataset will equip experts at NOAA to reach deeper insights into the status of these populations and into the degree of any anthropogenic impacts on them. We also hope this is just one of the first few in a series of successes as Google works to accelerate the application of machine learning to the world's biggest humanitarian and environmental challenges.

We would like to thank Ann Allen (NOAA Fisheries) for providing the bulk of the ground truth data, for many useful rounds of feedback, and for some of the words in this post. Karlina Merkens (NOAA affiliate) provided further useful guidance. We also thank the NOAA Pacific Islands Fisheries Science Center as a whole for collecting and sharing the acoustic data.

Within Google, Jiayang Liu, Julie Cattiau, Aren Jansen, Rif A. Saurous, and Lauren Harrell contributed to this work. Special thanks go to Lauren, who designed the plots in the analysis section and implemented them using ggplot.

Source: Google AI Blog

Curiosity and Procrastination in Reinforcement Learning

Reinforcement learning (RL) is one of the most actively pursued research techniques of machine learning, in which an artificial agent receives a positive reward when it does something right, and negative reward otherwise. This carrot-and-stick approach is simple and universal, and allowed DeepMind to teach the DQN algorithm to play vintage Atari games and AlphaGoZero to play the ancient game of Go. This is also how OpenAI taught its OpenAI-Five algorithm to play the modern video game Dota, and how Google taught robotic arms to grasp new objects. However, despite the successes of RL, there are many challenges to making it an effective technique.

Standard RL algorithms struggle with environments where feedback to the agent is sparse — crucially, such environments are common in the real world. As an example, imagine trying to learn how to find your favorite cheese in a large maze-like supermarket. You search and search but the cheese section is nowhere to be found. If at every step you receive no “carrot” and no “stick”, there’s no way to tell if you are headed in the right direction or not. In the absence of rewards, what is to stop you from wandering around in circles? Nothing, except perhaps your curiosity, which motivates you go into a product section that looks unfamiliar to you in pursuit of your sought-after cheese.

In “Episodic Curiosity through Reachability” — the result of a collaboration between the Google Brain team, DeepMind and ETH Zürich — we propose a novel episodic memory-based model of granting RL rewards, akin to curiosity, which leads to exploring the environment. Since we want the agent not only to explore the environment but also to solve the original task, we add a reward bonus provided by our model to the original sparse task reward. The combined reward is not sparse anymore which allows standard RL algorithms to learn from it. Thus, our curiosity method expands the set of tasks which are solvable with RL.
Episodic Curiosity through Reachability: Observations are added to memory, reward is computed based on how far the current observation is from the most similar observation in memory. The agent receives more reward for seeing observations which are not yet represented in memory.
The key idea of our method is to store the agent's observations of the environment in an episodic memory, while also rewarding the agent for reaching observations not yet represented in memory. Being “not in memory” is the definition of novelty in our method — seeking such observations means seeking the unfamiliar. Such a drive to seek the unfamiliar will lead the artificial agent to new locations, thus keeping it from wandering in circles and ultimately help it stumble on the goal. As we will discuss later, our formulation can save the agent from undesired behaviours which some other formulations are prone to. Much to our surprise, those behaviours bear some similarity to what a layperson would call “procrastination”.

Previous Curiosity Formulations
While there have been many attempts to formulate curiosity in the past[1][2][3][4], in this post we  focus on one natural and very popular approach: curiosity through prediction-based surprise, explored in the recent paper “Curiosity-driven Exploration by Self-supervised Prediction” (commonly referred to as the ICM method). To illustrate how surprise leads to curiosity, again consider our analogy of looking for cheese in a supermarket.
Illustration © Indira Pasko, used under CC BY-NC-ND 4.0 license.
As you wander throughout the market, you try to predict the future (“Now I’m in the meat section, so I think the section around the corner is the fish section — those are usually adjacent in this supermarket chain”). If your prediction is wrong, you are surprised (“No, it’s actually the vegetables section. I didn’t expect that!”) and thus rewarded. This makes you more motivated to look around the corner in the future, exploring new locations just to see if your expectations about them meet the reality (and, hopefully, stumble upon the cheese).

Similarly, the ICM method builds a predictive model of the dynamics of the world and gives the agent rewards when the model fails to make good predictions — a marker of surprise or novelty. Note that exploring unvisited locations is not directly a part of the ICM curiosity formulation. For the ICM method, visiting them is only a way to obtain more “surprise” and thus maximize overall rewards. As it turns out, in some environments there could be other ways to inflict self-surprise, leading to unforeseen results.
Agent imbued with surprise-based curiosity gets stuck when it encounters TV. GIF adopted from a video by © Deepak Pathak, used under CC BY 2.0 license.
The Dangers of “Procrastination”
In "Large-Scale Study of Curiosity-Driven Learning", the authors of the ICM method along with researchers from OpenAI show a hidden danger of surprise maximization: agents can learn to indulge procrastination-like behaviour instead of doing something useful for the task at hand. To see why, consider a common thought experiment the authors call the “noisy TV problem”, in which an agent is put into a maze and tasked with finding a highly rewarding item (akin to “cheese” in our previous supermarket example). The environment also contains a TV for which the agent has the remote control. There is a limited number of channels (each with a distinct show) and every press on the remote control switches to a random channel. How would an agent perform in such an environment?

For the surprise-based curiosity formulation, changing channels would result in a large reward, as each change is unpredictable and surprising. Crucially, even after cycling through all the available channels, the random channel selection ensures every new change will still be surprising — the agent is making predictions about what will be on the TV after a channel change, and will very likely be wrong, leading to surprise. Importantly, even if the agent has already seen every show on every channel, the change is still unpredictable. Because of this, the agent imbued with surprise-based curiosity would eventually stay in front of the TV forever instead of searching for a highly rewarding item — akin to procrastination. So, what would be a definition of curiosity which does not lead to such behaviour?

Episodic Curiosity
In “Episodic Curiosity through Reachability”, we explore an episodic memory-based curiosity model that turns out to be less prone to “self-indulging” instant gratification. Why so? Using our example above, after changing channels for a while, all of the shows will end up in memory. Thus, the TV won’t be so attractive anymore: even if the order of shows appearing on the screen is random and unpredictable, all those shows are already in memory! This is the main difference to the surprise-based methods: our method doesn’t even try to make bets about the future which could be hard (or even impossible) to predict. Instead, the agent examines the past to know if it has seen observations similar to the current one. Thus our agent won’t be drawn that much to the instant gratification provided by the noisy TV. It will have to go and explore the world outside of the TV to get more reward.

But how do we decide whether the agent is seeing the same thing as an existing memory? Checking for an exact match could be meaningless: in a realistic environment, the agent rarely sees exactly the same thing twice. For example, even if the agent returned to exactly the same room, it would still see this room under a different angle compared to its memories.

Instead of checking for an exact match in memory, we use a deep neural network that is trained to measure how similar two experiences are. To train this network, we have it guess whether two observations were experienced close together in time, or far apart in time. Temporal proximity is a good proxy for whether two experiences should be judged to be part of the same experience. This training leads to a general concept of novelty via reachability which is illustrated below.
Graph of reachabilities would determine novelty. In practice, this graph is not available — so we train a neural network approximator to estimate a number of steps between observations.
Experimental Results
To compare the performance of different approaches to curiosity, we tested them in two visually rich 3D environments: ViZDoom and DMLab. In those environments, the agent was tasked with various problems like searching for a goal in a maze or collecting good and avoiding bad objects. The DMLab environment happens to provide the agent with a laser-like science fiction gadget. The standard setting in the previous work on DMLab was to equip the agent with this gadget for all tasks, and if the agent does not need a gadget for a particular task, it is free not to use it. Interestingly, similar to the noisy TV experiment described above, the surprise-based ICM method actually uses this gadget a lot even when it is useless for the task at hand! When tasked with searching for a high-rewarding item in the maze, it instead prefers to spend time tagging walls because this yields a lot of “surprise” reward. Theoretically, predicting the result of tagging should be possible, but in practice is too hard as it apparently requires a deeper knowledge of physics than is available to a standard agent.
Surprise-based ICM method is persistently tagging the wall instead of exploring the maze.
Our method instead learns reasonable exploration behaviour under the same conditions. This is because it does not try to predict the result of its actions, but rather seeks observations which are “harder” to achieve from those already in the episodic memory. In other words, the agent implicitly pursues goals which require more effort to reach from memory than just a single tagging action.
Our method shows reasonable exploration.
It is interesting to see that our approach to granting reward penalizes an agent running in circles. This is because after completing the first circle the agent does not encounter new observations other than those in memory, and thus receives no reward:
Our reward visualization: red means negative reward, green means positive reward. Left to right: map with rewards, map with locations currently in memory, first-person view.
At the same time, our method favors good exploration behavior:
Our reward visualization: red means negative reward, green means positive reward. Left to right: map with rewards, map with locations currently in memory, first-person view.
We hope that our work will help lead to a new wave of exploration methods, going beyond surprise and learning more intelligent exploration behaviours. For an in-depth analysis of our method, please take a look at the preprint of our research paper.

This project is a result of a collaboration between the Google Brain team, DeepMind and ETH Zürich. The core team includes Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap and Sylvain Gelly. We would like to thank Olivier Pietquin, Carlos Riquelme, Charles Blundell and Sergey Levine for the discussions about the paper. We are grateful to Indira Pasko for the help with illustrations.

[1] "Count-Based Exploration with Neural Density Models", Georg Ostrovski, Marc G. Bellemare, Aaron van den Oord, Remi Munos
[2] "#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning", Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel
[3] "Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration", Alexandre Péré, Sébastien Forestier, Olivier Sigaud, Pierre-Yves Oudeyer
[4] "VIME: Variational Information Maximizing Exploration", Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel

Source: Google AI Blog

Fluid Annotation: An Exploratory Machine Learning–Powered Interface for Faster Image Annotation

The performance of modern deep learning–based computer vision models, such as those implemented by the TensorFlow Object Detection API, depends on the availability of increasingly large, labeled training datasets, such as Open Images. However, obtaining high-quality training data is quickly becoming a major bottleneck in computer vision. This is especially the case for pixel-wise prediction tasks such as semantic segmentation, used in applications such as autonomous driving, robotics, and image search. Indeed, traditional manual labeling tools require an annotator to carefully click on the boundaries to outline each object in the image, which is tedious: labeling a single image in the COCO+Stuff dataset takes 19 minutes, while labeling the whole dataset would take over 53k hours!
Example of image in the COCO dataset (left) and its pixel-wise semantic labeling (right). Image credit: Florida Memory, original image.
In “Fluid Annotation: A Human-Machine Collaboration Interface for Full Image Annotation”, to be presented at the Brave New Ideas track of the 2018 ACM Multimedia Conference, we explore a machine learning–powered interface for annotating the class label and outline of every object and background region in an image, accelerating the creation of labeled datasets by a factor of 3x.

Fluid Annotation starts from the output of a strong semantic segmentation model, which a human annotator can modify through machine-assisted edit operations using a natural user interface. Our interface empowers annotators to choose what to correct and in which order, allowing them to effectively focus their efforts on what the machine does not already know.
Visualization of the fluid annotation interface in action on image from COCO dataset. Image credit: gamene, original image.
More precisely, to annotate an image we first run it through a pre-trained semantic segmentation model (Mask-RCNN). This generates around 1000 image segments with their class labels and confidence scores. The segments with the highest confidences are used to initialize the labeling which is presented to the annotator. Afterwards, the annotator can: (1) Change the label of an existing segment choosing from a shortlist generated by the machine. (2) Add a segment to cover a missing object. The machine identifies the most likely pre-generated segments, through which the annotator can scroll and select the best one. (3) Remove an existing segment. (4) Change the depth-order of overlapping segments. To get a better feeling for this interface, try out the demo (desktop only).
Comparison of annotations using traditional manual labeling tools (middle column) and fluid annotation (right) on three COCO images. While object boundaries are often more accurate when using manual labeling tools, the biggest source of annotation differences is because human annotators often disagree on the exact object class. Image Credits: sneaka, original image (top), Dan Hurt, original image (middle), Melodie Mesiano, original image (bottom).
Fluid Annotation is a first exploratory step towards making image annotation faster and easier. In future work we aim to improve the annotation of object boundaries, make the interface faster by including more machine intelligence, and finally extend the interface to handle previous unseen classes for which efficient data collection is needed the most.

This work was done in collaboration with Misha Andriluka. Special thanks to Christine Sugrue for creating the fluid annotation demo. We also thank Anna Ukhanova and Damien Henry for their valuable input.

Source: Google AI Blog

See Better and Further with Super Res Zoom on the Pixel 3

Digital zoom using algorithms (rather than lenses) has long been the “ugly duckling” of mobile device cameras. As compared to the optical zoom capabilities of DSLR cameras, the quality of digitally zoomed images has not been competitive, and conventional wisdom is that the complex optics and mechanisms of larger cameras can't be replaced with much more compact mobile device cameras and clever algorithms.

With the new Super Res Zoom feature on the Pixel 3, we are challenging that notion.

The Super Res Zoom technology in Pixel 3 is different and better than any previous digital zoom technique based on upscaling a crop of a single image, because we merge many frames directly onto a higher resolution picture. This results in greatly improved detail that is roughly competitive with the 2x optical zoom lenses on many other smartphones. Super Res Zoom means that if you pinch-zoom before pressing the shutter, you’ll get a lot more details in your picture than if you crop afterwards.
Crops of 2x Zoom: Pixel 2, 2017 vs. Super Res Zoom on the Pixel 3, 2018.
The Challenges of Digital Zoom
Digital zoom is tough because a good algorithm is expected to start with a lower resolution image and "reconstruct" missing details reliably — with typical digital zoom a small crop of a single image is scaled up to produce a much larger image. Traditionally, this is done by linear interpolation methods, which attempt to recreate information that is not available in the original image, but introduce a blurry- or “plasticy” look that lacks texture and details. In contrast, most modern single-image upscalers use machine learning (including our own earlier work, RAISR). These magnify some specific image features such as straight edges and can even synthesize certain textures, but they cannot recover natural high-resolution details. While we still use RAISR to enhance the visual quality of images, most of the improved resolution provided by Super Res Zoom (at least for modest zoom factors like 2-3x) comes from our multi-frame approach.

Color Filter Arrays and Demosaicing
Reconstructing fine details is especially difficult because digital photographs are already incomplete — they’ve been reconstructed from partial color information through a process called demosaicing. In typical consumer cameras, the camera sensor elements are meant to measure only the intensity of the light, not directly its color. To capture real colors present in the scene, cameras use a color filter array placed in front of the sensor so that each pixel measures only a single color (red, green, or blue). These are arranged in a Bayer pattern as shown in the diagram below.
A Bayer mosaic color filter. Every 2x2 group of pixels captures light filtered by a specific color — two green pixels (because our eyes are more sensitive to green), one red, and one blue. This pattern is repeated across the whole image.
A camera processing pipeline then has to reconstruct the real colors and all the details at all pixels, given this partial information.* Demosaicing starts by making a best guess at the missing color information, typically by interpolating from the colors in nearby pixels, meaning that two-thirds of an RGB digital picture is actually a reconstruction!
Demosaicing reconstructs missing color information by using neighboring neighboring pixels.
In its simplest form, this could be achieved by averaging from neighboring values. Most real demosaicing algorithms are more complicated than this, but they still lead to imperfect results and artifacts - as we are limited to only partial information. While this situation exists even for large-format DSLR cameras, their bigger sensors and larger lenses allow for more detail to be captured than is typical in a mobile camera.

The situation gets worse if you pinch-zoom on a mobile device; then algorithms are forced to make up even more information, again by interpolation from the nearby pixels. However, not all is lost. This is where burst photography and the fusion of multiple images can be used to allow for super-resolution, even when limited by mobile device optics.

From Burst Photography to Multi-frame Super-resolution

While a single frame doesn't provide enough information to fill in the missing colors , we can get some of this missing information from multiple images taken successively. The process of capturing and combining multiple sequential photographs is known as burst photography. Google’s HDR+ algorithm, successfully used in Nexus and Pixel phones, already uses information from multiple frames to make photos from mobile phones reach the level of quality expected from a much larger sensor; could a similar approach be used to increase image resolution?

It has been known for more than a decade, including in astronomy where the basic concept is known as “drizzle”, that capturing and combining multiple images taken from slightly different positions can yield resolution equivalent to optical zoom, at least at low magnifications like 2x or 3x and in good lighting conditions. In this process, called muti-frame super-resolution, the general idea is to align and merge low-resolution bursts directly onto a grid of the desired (higher) resolution. Here's an example of how an idealized multi-frame super-resolution algorithm might work:
As compared to the standard demosaicing pipeline that needs to interpolate the missing colors (top), ideally, one could fill some holes from multiple images, each shifted by one pixel horizontally or vertically.
In the example above, we capture 4 frames, three of them shifted by exactly one pixel: in the horizontal, vertical, and both horizontal and vertical directions. All the holes would get filled, and there would be no need for any demosaicing at all! Indeed, some DSLR cameras support this operation, but only if the camera is on a tripod, and the sensor/optics are actively moved to different positions. This is sometimes called "microstepping".

Over the years, the practical usage of this “super-res” approach to higher resolution imaging remained confined largely to the laboratory, or otherwise controlled settings where the sensor and the subject were aligned and the movement between them was either deliberately controlled or tightly constrained. For instance, in astronomical imaging, a stationary telescope sees a predictably moving sky. But in widely used imaging devices like the modern-day smartphone, the practical usage of super-res for zoom in applications like mobile device cameras has remained mostly out of reach.

This is in part due to the fact that in order for this to work properly, certain conditions need to be satisfied. First, and most important, is that the lens needs to resolve detail better than the sensor used (in contrast, you can imagine a case where the lens is so poorly-designed that adding a better sensor provides no benefit). This property is often observed as an unwanted artifact of digital cameras called aliasing.

Image Aliasing
Aliasing occurs when a camera sensor is unable to faithfully represent all patterns and details present in a scene. A good example of aliasing are Moiré patterns, sometimes seen on TV as a result of an unfortunate choice of wardrobe. Furthermore, the aliasing effect on a physical feature (such as an edge of a table) changes when things move in a scene. You can observe this in the following burst sequence, where slight motions of the camera during the burst sequence create time-varying alias effects:
Left: High-resolution, single image of a table edge against a high frequency patterned background, Right: Different frames from a burst. Aliasing and Moiré effects are visible between different frames — pixels seem to jump around and produce different colored patterns.
However, this behavior is a blessing in disguise: if one analyzes the patterns produced, it gives us the variety of color and brightness values, as discussed in the previous section, to achieve super-resolution. That said, many challenges remain, as practical super-resolution needs to work with a handheld mobile phone and on any burst sequence.

Practical Super-resolution Using Hand Motion

As noted earlier, some DSLR cameras offer special tripod super-resolution modes that work in a way similar to what we described so far. These approaches rely on the physical movement of the sensors and optics inside the camera, but require a complete stabilization of the camera otherwise, which is impractical in mobile devices, since they are nearly always handheld. This would seem to create a catch-22 for super-resolution imaging on mobile platforms.

However, we turn this difficulty on its head, by using the hand-motion to our advantage. When we capture a burst of photos with a handheld camera or phone, there is always some movement present between the frames. Optical Image Stabilization (OIS) systems compensate for large camera motions - typically 5-20 pixels between successive frames spaced 1/30 second apart - but are unable to completely eliminate faster, lower magnitude, natural hand tremor, which occurs for everyone (even those with “steady hands”). When taking photos using mobile phones with a high resolution sensor, this hand tremor has a magnitude of just a few pixels.
Effect of hand tremor as seen in a cropped burst, after global alignment.
To take advantage of hand tremor, we first need to align the pictures in a burst together. We choose a single image in the burst as the “base” or reference frame, and align every other frame relative to it. After alignment, the images are combined together roughly as in the diagram shown earlier in this post. Of course, handshake is unlikely to move the image by exactly single pixels, so we need to interpolate between adjacent pixels in each newly captured frame before injecting the colors into the pixel grid of our base frame.

When hand motion is not present because the device is completely stabilized (e.g. placed on a tripod), we can still achieve our goal of simulating natural hand motion by intentionally “jiggling” the camera, by forcing the OIS module to move slightly between the shots. This movement is extremely small and chosen such that it doesn’t interfere with normal photos - but you can observe it yourself on Pixel 3 by holding the phone perfectly still, such as by pressing it against a window, and maximally pinch-zooming the viewfinder. Look for a tiny but continuous elliptical motion in distant objects, like that shown below.
Overcoming the Challenges of Super-resolution
The description of the ideal process we gave above sounds simple, but super-resolution is not that easy — there are many reasons why it hasn’t widely been used in consumer products like mobile phones, and requires the development of significant algorithmic innovations. Challenges can include:
  • A single image from a burst is noisy, even in good lighting. A practical super-resolution algorithm needs to be aware of this noise and work correctly despite it. We don’t want to get just a higher resolution noisy image - our goal is to both increase the resolution but also produce a much less noisy result.
    Left: Single frame frame from a burst taken in good light conditions can still contain a substantial amount of noise due to underexposure. Right: Result of merging multiple frames after burst processing.
  • Motion between images in a burst is not limited to just the movement of the camera. There can be complex motions in the scene such as wind-blown leaves, ripples moving across the surface of water, cars, people moving or changing their facial expressions, or the flicker of a flame — even some movements that cannot be assigned a single, unique motion estimate because they are transparent or multi-layered, such as smoke or glass. Completely reliable and localized alignment is generally not possible, and therefore a good super-resolution algorithm needs to work even if motion estimation is imperfect.
  • Because much of motion is random, even if there is good alignment, the data may be dense in some areas of the image and sparse in others. The crux of super-resolution is a complex interpolation problem, so the irregular spread of data makes it challenging to produce a higher-resolution image in all parts of the grid.
All the above challenges would seem to make real-world super-resolution either infeasible in practice, or at best limited to only static scenes and a camera placed on a tripod. With Super Res Zoom on Pixel 3, we’ve developed a stable and accurate burst resolution enhancement method that uses natural hand motion, and is robust and efficient enough to deploy on a mobile phone.

Here’s how we’ve addressed some of these challenges:
  • To effectively merge frames in a burst, and to produce a red, green, and blue value for every pixel without the need for demosaicing, we developed a method of integrating information across the frames that takes into account the edges of the image, and adapts accordingly. Specifically, we analyze the input frames and adjust how we combine them together, trading off increase in detail and resolution vs. noise suppression and smoothing. We accomplish this by merging pixels along the direction of apparent edges, rather than across them. The net effect is that our multi-frame method provides the best practical balance between noise reduction and enhancement of details.
    Left: Merged image with sub-optimal tradeoff of noise reduction and enhanced resolution. Right: The same merged image with a better tradeoff.
  • To make the algorithm handle scenes with complex local motion (people, cars, water or tree leaves moving) reliably, we developed a robustness model that detects and mitigates alignment errors. We select one frame as a “reference image”, and merge information from other frames into it only if we’re sure that we have found the correct corresponding feature. In this way, we can avoid artifacts like “ghosting” or motion blur, and wrongly merged parts of the image.
    A fast moving bus in a burst of images. Left: Merge without robustness model. Right: Merge with robustness model.
Pushing the State of the Art in Mobile Photography
The Portrait mode last year, and the HDR+ pipeline before it, showed how good mobile photography can be. This year, we set out to do the same for zoom. That’s another step in advancing the state of the art in computational photography, while shrinking the quality gap between mobile photography and DSLRs. Here is an album containing full FOV images, followed by Super Res Zoom images. Note that the Super Res Zoom images in this album are not cropped — they are captured directly on-device using pinch-zoom.
Left: Crop of 7x zoomed image on Pixel 2. Right: Same crop from Super Res Zoom on Pixel 3.
The idea of super-resolution predates the advent of smart-phones by at least a decade. For nearly as long, it has also lived in the public imagination through films and television. It’s been the subject of thousands of papers in academic journals and conferences. Now, it is real — in the palm of your hands, in Pixel 3.
An illustrative animation of Super Res Zoom. When the user takes a zoomed photo, the Pixel 3 takes advantage of the user’s natural hand motion and captures a burst of images at subtly different positions. These are then merged together to add detail to the final image.
Super Res Zoom is the result of a collaboration across several teams at Google. The project would not have been possible without the joint efforts of teams managed by Peyman Milanfar, Marc Levoy, and Bill Freeman. The authors would like to thank Marc Levoy and Isaac Reynolds in particular for their assistance in the writing of this blog.

The authors wish to especially acknowledge the following key contributors to the Super Res Zoom project: Ignacio Garcia-Dorado, Haomiao Jiang, Manfred Ernst, Michael Krainin, Daniel Vlasic, Jiawen Chen, Pascal Getreuer, and Chia-Kai Liang. The project also benefited greatly from contributions and feedback by Ce Liu, Damien Kelly, and Dillon Sharlet.

How to get the most out of Super Res Zoom?
Here are some tips on getting the best of Super Res Zoom on a Pixel 3 phone:
  • Pinch and zoom, or use the + button to increase zoom by discrete steps.
  • Double-tap the preview to quickly toggle between zoomed in and zoomed out.
  • Super Res works well at all zoom factors, though for performance reasons, it activates only above 1.2x. That’s about half way between no zoom and the first “click” in the zoom UI.
  • There are fundamental limits to the optical resolution of a wide-angle camera. So to get the most out of (any) zoom, keep the magnification factor modest.
  • Avoid fast moving objects. Super Res zoom will capture them correctly, but you will not likely get increased resolution.

* It’s worth noting that the situation is similar in some ways to how we see — in human (and other mammalian) eyes, different eye cone cells are sensitive to some specific colors, with the brain filling in the details to reconstruct the full image.

Source: Google AI Blog