Tag Archives: Computational Photography

HDR+ with Bracketing on Pixel Phones

We're continuously working to improve the Pixel — making it more helpful, more capable, and more fun — with regular updates, such as the recent V8.2 update to the Camera app. One such improvement (launched on Pixel 5 and Pixel 4a 5G in October) is a feature that operates “under the hood”, HDR+ with Bracketing. This feature works by merging images taken with different exposure times to improve image quality (especially in shadows), resulting in more natural colors, improved details and texture, and reduced noise.

Why Are HDR Scenes Hard to Capture?
The original HDR+ burst photography system is the engine behind high-quality mobile photography, which captures a rapid series of deliberately underexposed images, then combines and renders them in a way that preserves detail across the range of tones. But this system had one limitation: scenes with high dynamic range (HDR) like the one below were noisy in the shadows because all images captured are underexposed.

The same photo using HDR+ (red outline) and HDR+ with Bracketing (green outline). While the characteristic HDR+ look remains the same, bracketing improves image quality, especially in shadows, with more natural colors, improved details and texture, and reduced noise.

Capturing HDR scenes is difficult because of the physical constraints of image sensors combined with limited signal in the shadows. We can correctly expose either the shadows or the highlights, but not both at the same time.

The same scene shot with different exposure settings and tonemapped to similar overall brightness. Left/Top: Exposure set for the highlights. The bright blue sky is preserved, but the shadows are very noisy. Right/Bottom: Exposure set for the shadows. Noise in the shadows is reduced, but the sky is clipped (white).

Photographers sometimes work around these limitations by taking two different exposures and combining them. This approach, known as exposure bracketing, can deliver the best of both worlds, but it is time-consuming to do by hand. It is also challenging in computational photography because it requires:

  1. Capturing additional long exposure frames while maintaining the fast, predictable capture experience of the Pixel camera.
  2. Taking advantage of long exposure frames while avoiding ghosting artifacts caused by motion between frames.

To avoid these challenges, the original HDR+ system used a different approach to handle high dynamic range scenes.

The Limits of HDR+
The capture strategy used by HDR+ is based on underexposure, which avoids loss of detail in the highlights. While this strategy comes at the expense of noise in the shadows, HDR+ offsets the increased noise through the use of burst photography.

Using bursts to improve image quality. HDR+ starts from a burst of full-resolution raw images (left). Depending on conditions, between 2 and 15 images are aligned and merged into a computational raw image (middle). The merged image has reduced noise and increased dynamic range, leading to a higher quality final result (right).

This approach works well for scenes with moderate dynamic range, but breaks down for HDR scenes. To understand why, we need to take a closer look at how two types of noise get into an image.

Noise in Burst Photography
One important type of noise is called shot noise, which depends only on the total amount of light captured — the sum of N frames, each with E seconds of exposure time has the same amount of shot noise as a single frame exposed for N × E seconds. If this were the only type of noise present in captured images, burst photography would be as efficient as taking longer exposures. Unfortunately, a second type of noise, read noise, is introduced by the sensor every time a frame is captured. Read noise doesn’t depend on the amount of light captured but instead depends on the number of frames taken — that is, with each frame taken, an additional fixed amount of read noise is added.

This is why using burst photography to reduce total noise isn’t as efficient as simply taking longer exposures: taking multiple frames can reduce the effect of shot noise, but will also increase read noise. Even though read noise increases with the number of frames, it is still possible to reduce the overall noisiness with burst photography, but it becomes less efficient. If one were to break a long exposure into N shorter exposures, the ratio of signal to noise in the final image would be lower because of the additional read noise. In this case, to get back to the signal-to-noise ratio in the single long exposure, one would need to merge N2 short-exposure frames. In the example below, if a long exposure were divided into 12 short exposures, we'd have to capture 144 (12 × 12) short frames to match the signal-to-noise ratio in the shadows! Capturing and processing this many frames would be much more time consuming — burst capture and processing could take over a minute and result in a poor user experience. Instead, with bracketing one can capture both short and long exposures — combining highlight protection and noise reduction.

Left: The result of merging 12 short-exposure frames in Night Sight mode. Right: A single frame whose exposure time is 12 times longer than an individual short exposure. The longer exposure has significantly less noise in the shadows but sacrifices the highlights.

Solving with Bracketing
While the challenges of bracketing prevented the original HDR+ system from using it, incremental improvements since then, plus a recent concentrated effort, have made it possible in the Camera app. To start, adding bracketing to HDR+ required redesigning the capture strategy. Capturing is complicated by zero shutter lag (ZSL), which underpins the fast capture experience on Pixel. With ZSL, the frames displayed in the viewfinder before the shutter press are the frames we use for HDR+ burst merging. For bracketing, we capture an additional long exposure frame after the shutter press, which is not shown in the viewfinder. Note that holding the camera still for half a second after the shutter press to accommodate the long exposure can help improve image quality, even with a typical amount of handshake.

Capture strategy. Top: The original HDR+ method captures short exposures before the shutter press, six in this example. Bottom: HDR+ with Bracketing captures five short exposures before the shutter press and one long exposure after the shutter press.

For Night Sight, the capture strategy isn't constrained by the viewfinder — because all frames are captured after the shutter press while the viewfinder is stopped, this mode easily accommodates capturing longer exposure frames. In this case, we capture three long exposures to further reduce noise.

Capture strategy for Night Sight. Top: The original Night Sight captured 15 short exposure frames. Bottom: Night Sight with bracketing captures 12 short and 3 long exposures.

The Merging Algorithm
When merging bracketed shots, we choose one of the short frames as the reference frame to avoid potentially clipped highlights and motion blur. All other frames are aligned to this frame before they are merged. This introduces a challenge — for complex scene motion or occluded regions, it is impossible to find exactly matching regions and a naïve merge algorithm would produce ghosting artifacts in these cases.

Left: Ghosting artifacts are visible around the silhouette of a moving person, when deghosting is disabled.
Right: Robust merging produces a clean image.

To address this, we designed a new spatial merge algorithm, similar to the one used for Super Res Zoom, that decides per pixel whether image content should be merged or not. This deghosting is more complicated for frames with different exposures. Long exposure frames have different noise characteristics, clipped highlights, and different amounts of motion blur, which makes comparisons with the short exposure reference frame more difficult. In addition, ghosting artifacts are more visible in bracketed shots, because noise that would otherwise mask these errors is reduced. Despite those challenges, our algorithm is as robust to these issues as the original HDR+ and Super Res Zoom and doesn’t produce ghosting artifacts. At the same time, it merges images 40% faster than its predecessors. Because it merges RAW images early in the photographic pipeline, we were able to achieve all of those benefits while keeping the rest of processing and the signature HDR+ look unchanged. Furthermore, users who prefer to use computational RAW images can take advantage of those image quality and performance improvements.

Bracketing on Pixel
HDR+ with Bracketing is available to users of Pixel 4a (5G) and 5 in the default camera, as well as in Night Sight and Portrait modes. For users of Pixel 4 and 4a, the Google Camera app supports bracketing in Night Sight mode. No user interaction is needed to activate HDR+ with Bracketing — depending on the dynamic range of the scene, and the presence of motion, HDR+ with bracketing chooses the best exposures to maximize image quality (examples).

Acknowledgements
HDR+ with Bracketing is the result of a collaboration across several teams at Google. The project would not have been possible without the joint efforts of Sam Hasinoff, Dillon Sharlet, Kiran Murthy, Mike Milne, Andy Radin, Nicholas Wilson, Navin Sarma‎, Gabriel Nava, Emily To, Sushil Nath, Alexander Schiffhauer, Isaac Reynolds, Bill Strathearn, Marius Renn, Alex Hong, Jose Ricardo Lima, Bob Hung, Ying Chen Lou, Joy Hsu, Blade Chiu, David Massoud, Jean Hsu, Ellie Yang, and Marc Levoy.

Source: Google AI Blog


HDR+ with Bracketing on Pixel Phones

We're continuously working to improve the Pixel — making it more helpful, more capable, and more fun — with regular updates, such as the recent V8.2 update to the Camera app. One such improvement (launched on Pixel 5 and Pixel 4a 5G in October) is a feature that operates “under the hood”, HDR+ with Bracketing. This feature works by merging images taken with different exposure times to improve image quality (especially in shadows), resulting in more natural colors, improved details and texture, and reduced noise.

Why Are HDR Scenes Hard to Capture?
The original HDR+ burst photography system is the engine behind high-quality mobile photography, which captures a rapid series of deliberately underexposed images, then combines and renders them in a way that preserves detail across the range of tones. But this system had one limitation: scenes with high dynamic range (HDR) like the one below were noisy in the shadows because all images captured are underexposed.

The same photo using HDR+ (red outline) and HDR+ with Bracketing (green outline). While the characteristic HDR+ look remains the same, bracketing improves image quality, especially in shadows, with more natural colors, improved details and texture, and reduced noise.

Capturing HDR scenes is difficult because of the physical constraints of image sensors combined with limited signal in the shadows. We can correctly expose either the shadows or the highlights, but not both at the same time.

The same scene shot with different exposure settings and tonemapped to similar overall brightness. Left/Top: Exposure set for the highlights. The bright blue sky is preserved, but the shadows are very noisy. Right/Bottom: Exposure set for the shadows. Noise in the shadows is reduced, but the sky is clipped (white).

Photographers sometimes work around these limitations by taking two different exposures and combining them. This approach, known as exposure bracketing, can deliver the best of both worlds, but it is time-consuming to do by hand. It is also challenging in computational photography because it requires:

  1. Capturing additional long exposure frames while maintaining the fast, predictable capture experience of the Pixel camera.
  2. Taking advantage of long exposure frames while avoiding ghosting artifacts caused by motion between frames.

To avoid these challenges, the original HDR+ system used a different approach to handle high dynamic range scenes.

The Limits of HDR+
The capture strategy used by HDR+ is based on underexposure, which avoids loss of detail in the highlights. While this strategy comes at the expense of noise in the shadows, HDR+ offsets the increased noise through the use of burst photography.

Using bursts to improve image quality. HDR+ starts from a burst of full-resolution raw images (left). Depending on conditions, between 2 and 15 images are aligned and merged into a computational raw image (middle). The merged image has reduced noise and increased dynamic range, leading to a higher quality final result (right).

This approach works well for scenes with moderate dynamic range, but breaks down for HDR scenes. To understand why, we need to take a closer look at how two types of noise get into an image.

Noise in Burst Photography
One important type of noise is called shot noise, which depends only on the total amount of light captured — the sum of N frames, each with E seconds of exposure time has the same amount of shot noise as a single frame exposed for N × E seconds. If this were the only type of noise present in captured images, burst photography would be as efficient as taking longer exposures. Unfortunately, a second type of noise, read noise, is introduced by the sensor every time a frame is captured. Read noise doesn’t depend on the amount of light captured but instead depends on the number of frames taken — that is, with each frame taken, an additional fixed amount of read noise is added.

This is why using burst photography to reduce total noise isn’t as efficient as simply taking longer exposures: taking multiple frames can reduce the effect of shot noise, but will also increase read noise. Even though read noise increases with the number of frames, it is still possible to reduce the overall noisiness with burst photography, but it becomes less efficient. If one were to break a long exposure into N shorter exposures, the ratio of signal to noise in the final image would be lower because of the additional read noise. In this case, to get back to the signal-to-noise ratio in the single long exposure, one would need to merge N2 short-exposure frames. In the example below, if a long exposure were divided into 12 short exposures, we'd have to capture 144 (12 × 12) short frames to match the signal-to-noise ratio in the shadows! Capturing and processing this many frames would be much more time consuming — burst capture and processing could take over a minute and result in a poor user experience. Instead, with bracketing one can capture both short and long exposures — combining highlight protection and noise reduction.

Left: The result of merging 12 short-exposure frames in Night Sight mode. Right: A single frame whose exposure time is 12 times longer than an individual short exposure. The longer exposure has significantly less noise in the shadows but sacrifices the highlights.

Solving with Bracketing
While the challenges of bracketing prevented the original HDR+ system from using it, incremental improvements since then, plus a recent concentrated effort, have made it possible in the Camera app. To start, adding bracketing to HDR+ required redesigning the capture strategy. Capturing is complicated by zero shutter lag (ZSL), which underpins the fast capture experience on Pixel. With ZSL, the frames displayed in the viewfinder before the shutter press are the frames we use for HDR+ burst merging. For bracketing, we capture an additional long exposure frame after the shutter press, which is not shown in the viewfinder. Note that holding the camera still for half a second after the shutter press to accommodate the long exposure can help improve image quality, even with a typical amount of handshake.

Capture strategy. Top: The original HDR+ method captures short exposures before the shutter press, six in this example. Bottom: HDR+ with Bracketing captures five short exposures before the shutter press and one long exposure after the shutter press.

For Night Sight, the capture strategy isn't constrained by the viewfinder — because all frames are captured after the shutter press while the viewfinder is stopped, this mode easily accommodates capturing longer exposure frames. In this case, we capture three long exposures to further reduce noise.

Capture strategy for Night Sight. Top: The original Night Sight captured 15 short exposure frames. Bottom: Night Sight with bracketing captures 12 short and 3 long exposures.

The Merging Algorithm
When merging bracketed shots, we choose one of the short frames as the reference frame to avoid potentially clipped highlights and motion blur. All other frames are aligned to this frame before they are merged. This introduces a challenge — for complex scene motion or occluded regions, it is impossible to find exactly matching regions and a naïve merge algorithm would produce ghosting artifacts in these cases.

Left: Ghosting artifacts are visible around the silhouette of a moving person, when deghosting is disabled.
Right: Robust merging produces a clean image.

To address this, we designed a new spatial merge algorithm, similar to the one used for Super Res Zoom, that decides per pixel whether image content should be merged or not. This deghosting is more complicated for frames with different exposures. Long exposure frames have different noise characteristics, clipped highlights, and different amounts of motion blur, which makes comparisons with the short exposure reference frame more difficult. In addition, ghosting artifacts are more visible in bracketed shots, because noise that would otherwise mask these errors is reduced. Despite those challenges, our algorithm is as robust to these issues as the original HDR+ and Super Res Zoom and doesn’t produce ghosting artifacts. At the same time, it merges images 40% faster than its predecessors. Because it merges RAW images early in the photographic pipeline, we were able to achieve all of those benefits while keeping the rest of processing and the signature HDR+ look unchanged. Furthermore, users who prefer to use computational RAW images can take advantage of those image quality and performance improvements.

Bracketing on Pixel
HDR+ with Bracketing is available to users of Pixel 4a (5G) and 5 in the default camera, as well as in Night Sight and Portrait modes. For users of Pixel 4 and 4a, the Google Camera app supports bracketing in Night Sight mode. No user interaction is needed to activate HDR+ with Bracketing — depending on the dynamic range of the scene, and the presence of motion, HDR+ with bracketing chooses the best exposures to maximize image quality (examples).

Acknowledgements
HDR+ with Bracketing is the result of a collaboration across several teams at Google. The project would not have been possible without the joint efforts of Sam Hasinoff, Dillon Sharlet, Kiran Murthy, Mike Milne, Andy Radin, Nicholas Wilson, Navin Sarma‎, Gabriel Nava, Emily To, Sushil Nath, Alexander Schiffhauer, Isaac Reynolds, Bill Strathearn, Marius Renn, Alex Hong, Jose Ricardo Lima, Bob Hung, Ying Chen Lou, Joy Hsu, Blade Chiu, David Massoud, Jean Hsu, Ellie Yang, and Marc Levoy.

Source: Google AI Blog


Portrait Light: Enhancing Portrait Lighting with Machine Learning

Professional portrait photographers are able to create compelling photographs by using specialized equipment, such as off-camera flashes and reflectors, and expert knowledge to capture just the right illumination of their subjects. In order to allow users to better emulate professional-looking portraits, we recently released Portrait Light, a new post-capture feature for the Pixel Camera and Google Photos apps that adds a simulated directional light source to portraits, with the directionality and intensity set to complement the lighting from the original photograph.

Example image with and without Portrait Light applied. Note how Portrait Light contours the face, adding dimensionality, volume, and visual interest.

In the Pixel Camera on Pixel 4, Pixel 4a, Pixel 4a (5G), and Pixel 5, Portrait Light is automatically applied post-capture to images in the default mode and to Night Sight photos that include people — just one person or even a small group. In Portrait Mode photographs, Portrait Light provides more dramatic lighting to accompany the shallow depth-of-field effect already applied, resulting in a studio-quality look. But because lighting can be a personal choice, Pixel users who shoot in Portrait Mode can manually re-position and adjust the brightness of the applied lighting within Google Photos to match their preference. For those running Google Photos on Pixel 2 or newer, this relighting capability is also available for many pre-existing portrait photographs.

Pixel users can adjust a portrait’s lighting as they like in Google Photos, after capture.

Today we present the technology behind Portrait Light. Inspired by the off-camera lights used by portrait photographers, Portrait Light models a repositionable light source that can be added into the scene, with the initial lighting direction and intensity automatically selected to complement the existing lighting in the photo. We accomplish this by leveraging novel machine learning models, each trained using a diverse dataset of photographs captured in the Light Stage computational illumination system. These models enabled two new algorithmic capabilities:

  1. Automatic directional light placement: For a given portrait, the algorithm places a synthetic directional light in the scene consistent with how a photographer would have placed an off-camera light source in the real world.
  2. Synthetic post-capture relighting: For a given lighting direction and portrait, synthetic light is added in a way that looks realistic and natural.

These innovations enable Portrait Light to help create attractive lighting at any moment for every portrait — all on your mobile device.

Automatic Light Placement
Photographers usually rely on perceptual cues when deciding how to augment environmental illumination with off-camera light sources. They assess the intensity and directionality of the light falling on the face, and also adjust their subject’s head pose to complement it. To inform Portrait Light’s automatic light placement, we developed computational equivalents to these two perceptual signals.

First, we trained a novel machine learning model to estimate a high dynamic range, omnidirectional illumination profile for a scene based on an input portrait. This new lighting estimation model infers the direction, relative intensity, and color of all light sources in the scene coming from all directions, considering the face as a light probe. We also estimate the head pose of the portrait’s subject using MediaPipe Face Mesh.

Estimating the high dynamic range, omnidirectional illumination profile from an input portrait. The three spheres at the right of each image, diffuse (top), matte silver (middle), and mirror (bottom), are rendered using the estimated illumination, each reflecting the color, intensity, and directionality of the environmental lighting.

Using these clues, we determine the direction from which the synthetic lighting should originate. In studio portrait photography, the main off-camera light source, or key light, is placed about 30° above the eyeline and between 30° and 60° off the camera axis, when looking overhead at the scene. We follow this guideline for a classic portrait look, enhancing any pre-existing lighting directionality in the scene while targeting a balanced, subtle key-to-fill lighting ratio of about 2:1.

Data-Driven Portrait Relighting
Given a desired lighting direction and portrait, we next trained a new machine learning model to add the illumination from a directional light source to the original photograph. Training the model required millions of pairs of portraits both with and without extra light. Photographing such a dataset in normal settings would have been impossible because it requires near-perfect registration of portraits captured across different lighting conditions.

Instead, we generated training data by photographing seventy different people using the Light Stage computational illumination system. This spherical lighting rig includes 64 cameras with different viewpoints and 331 individually-programmable LED light sources. We photographed each individual illuminated one-light-at-a-time (OLAT) by each light, which generates their reflectance field — or their appearance as illuminated by the discrete sections of the spherical environment. The reflectance field encodes the unique color and light-reflecting properties of the subject’s skin, hair, and clothing — how shiny or dull each material appears. Due to the superposition principle for light, these OLAT images can then be linearly added together to render realistic images of the subject as they would appear in any image-based lighting environment, with complex light transport phenomena like subsurface scattering correctly represented.

Using the Light Stage, we photographed many individuals with different face shapes, genders, skin tones, hairstyles, and clothing/accessories. For each person, we generated synthetic portraits in many different lighting environments, both with and without the added directional light, rendering millions of pairs of images. This dataset encouraged model performance across diverse lighting environments and individuals.

Photographing an individual as illuminated one-light-at-a-time in the Google Light Stage, a 360° computational illumination rig.
Left: Example images from an individual’s photographed reflectance field, their appearance in the Light Stage as illuminated one-light-at-a-time. Right: The images can be added together to form the appearance of the subject in any novel lighting environment.

Learning Detail-Preserving Relighting Using the Quotient Image
Rather than trying to directly predict the output relit image, we trained the relighting model to output a low-resolution quotient image, i.e., a per-pixel multiplier that when upsampled can be applied to the original input image to produce the desired output image with the contribution of the extra light source added. This technique is computationally efficient and encourages only low-frequency lighting changes, without impacting high-frequency image details, which are directly transferred from the input to maintain image quality.

Supervising Relighting with Geometry Estimation
When photographers add an extra light source into a scene, its orientation relative to the subject’s facial geometry determines how much brighter each part of the face appears. To model the optical behavior of light sources reflecting off relatively matte surfaces, we first trained a machine learning model to estimate surface normals given the input photograph, and then applied Lambert’s law to compute a “light visibility map” for the desired lighting direction. We provided this light visibility map as input to the quotient image predictor, ensuring that the model is trained using physics-based insights.

The pipeline of our relighting network. Given an input portrait, we estimate per-pixel surface normals, which we then use to compute a light visibility map. The model is trained to produce a low-resolution quotient image that, when upsampled and applied as a multiplier to the original image, produces the original portrait with an extra light source added synthetically into the scene.

We optimized the full pipeline to run at interactive frame-rates on mobile devices, with total model size under 10 MB. Here are a few examples of Portrait Light in action.

Portrait Light in action.

Getting the Most Out of Portrait Light
You can try Portrait Light in the Pixel Camera and change the light position and brightness to your liking in Google Photos. For those who use Dual Exposure Controls, Portrait Light can be applied post-capture for additional creative flexibility to find just the right balance between light and shadow. On existing images from your Google Photos library, try it where faces are slightly underexposed, where Portrait Light can illuminate and highlight your subject. It will especially benefit images with a single individual posed directly at the camera.

We see Portrait Light as the first step on the journey towards creative post-capture lighting controls for mobile cameras, powered by machine learning.

Acknowledgements
Portrait Light is the result of a collaboration between Google Research, Google Daydream, Pixel, and Google Photos teams. Key contributors include: Yun-Ta Tsai, Rohit Pandey, Sean Fanello, Chloe LeGendre, Michael Milne, Ryan Geiss, Sam Hasinoff, Dillon Sharlet, Christoph Rhemann, Peter Denny, Kaiwen Guo, Philip Davidson, Jonathan Taylor, Mingsong Dou, Pavel Pidlypenskyi, Peter Lincoln, Jay Busch, Matt Whalen, Jason Dourgarian, Geoff Harvey, Cynthia Herrera, Sergio Orts Escolano, Paul Debevec, Jonathan Barron, Sofien Bouaziz, Clement Ng, Rachit Gupta, Jesse Evans, Ryan Campbell, Sonya Mollinger, Emily To, Yichang Shih, Jana Ehmann, Wan-Chun Alex Ma, Christina Tong, Tim Smith, Tim Ruddick, Bill Strathearn, Jose Lima, Chia-Kai Liang, David Salesin, Shahram Izadi, Navin Sarma, Nisha Masharani, Zachary Senzer.


1  Work conducted while at Google. 

Source: Google AI Blog


Live HDR+ and Dual Exposure Controls on Pixel 4 and 4a



High dynamic range (HDR) imaging is a method for capturing scenes with a wide range of brightness, from deep shadows to bright highlights. On Pixel phones, the engine behind HDR imaging is HDR+ burst photography, which involves capturing a rapid burst of deliberately underexposed images, combining them, and rendering them in a way that preserves detail across the range of tones. Until recently, one challenge with HDR+ was that it could not be computed in real time (i.e., at 30 frames per second), which prevented the viewfinder from matching the final result. For example, bright white skies in the viewfinder might appear blue in the HDR+ result.

Starting with Pixel 4 and 4a, we have improved the viewfinder using a machine-learning-based approximation to HDR+, which we call Live HDR+. This provides a real-time preview of the final result, making HDR imaging more predictable. We also created dual exposure controls, which generalize the classic “exposure compensation” slider into two controls for separately adjusting the rendition of shadows and highlights. Together, Live HDR+ and dual exposure controls provide HDR imaging with real-time creative control.
Live HDR+ on Pixel 4 and 4a helps the user compose their shot with a WYSIWYG viewfinder that closely resembles the final result. You can see individual images here. Photos courtesy of Florian Kainz.
The HDR+ Look
When the user presses the shutter in the Pixel camera app, it captures 3-15 underexposed images. These images are aligned and merged to reduce noise in the shadows, producing a 14-bit intermediate “linear RGB image” with pixel values proportional to the scene brightness. What gives HDR+ images their signature look is the "tone mapping" of this image, reducing the range to 8 bits and making it suitable for display.

Consider the backlit photo of a motorcyclist, below. While the linear RGB image contains detail in both the dark motorcycle and bright sky, the dynamic range is too high to see it. The simplest method to reveal more detail is to apply a “global curve”, remapping all pixels with a particular brightness to some new value. However, for an HDR scene with details in both shadows and highlights, no single curve is satisfactory.
>Different ways to tone-map a linear RGB image. (a) The original, “un-tone-mapped” image. (b) Global curve optimizing for the sky. (c) Global curve optimizing for the subject. (d) HDR+, which preserves details everywhere. In the 2D histogram, brighter areas indicate where more pixels of a given input brightness are mapped to the same output. The overlapping shapes show that the relationship cannot be modeled using a single curve. Photo courtesy of Nicholas Wilson.
In contrast to applying a single curve, HDR+ uses a local tone mapping algorithm to ensure that the final result contains detail everywhere, while keeping edges and textures looking natural. Effectively, this applies a different curve to different regions, depending on factors such as overall brightness, local texture, and amount of noise. Unfortunately, HDR+ is too slow to run live in the viewfinder, requiring an alternative approach for Live HDR+.

Local Curve Approximation for Live HDR+
Using a single tone curve does not produce a satisfying result for the entire image — but how about for a small region? Consider the small red patch in the figure below. Although the patch includes both shadows and highlights, the relationship between input and output brightness follows a smooth curve. Furthermore, the curve varies gradually. For the blue patch, shifted ten pixels to the right, both the image content and curve are similar. But while the curve approximation works well for small patches, it breaks down for larger patches. For the larger yellow patch, the input/output relationship is more complicated, and not well approximated by a single curve.
(a) Input and HDR+ result. (b) The effect of HDR+ on a small patch (red) is approximately a smooth curve. (c) The relationship is nearly identical for the nearby blue patch. (d) However, if the patch is too big, a single curve will no longer provide a good fit.
To address this challenge, we divide the input image into “tiles” of size roughly equal to the red patch in the figure above, and approximate HDR+ using a curve for each tile. Since these curves vary gradually, blending between curves is a good way to approximate the optimal curve at any pixel. To render a pixel we apply the curves from each of the four nearest tiles, then blend the results according to the distances to the respective tile centers.

Compared to HDR+, this algorithm is particularly well suited for GPUs. Since the tone mapping of each pixel can be computed independently, the algorithm can also be parallelized. Moreover, the representation is memory-efficient: only a small number of tiles is enough to represent HDR+ local tone mapping for the viewfinder.

To compute local curves, we use a machine learning algorithm called HDRnet, a deep neural network that predicts, from a linear image, per-tile curves that approximate the HDR+ look of that image. It's also fast, due to its compact architecture and the way that low-resolution input images can be used to predict the curves for the high-resolution viewfinder. We train HDRnet on thousands of images to ensure it works well on all kinds of scenes.
HDRnet vs. HDR+ on a challenging scene with extreme brights and darks. The results are very similar at viewfinder resolution. Photo courtesy of Nicholas Wilson.
Dual Exposure Controls
HDR+ is designed to produce pleasing HDR images automatically, without the need for manual controls or post-processing. But sometimes the HDR+ rendition may not match the photographer’s artistic vision. While image editing tools are a partial remedy, HDR images can be challenging to edit, because some decisions are effectively baked into the final JPG. To maximize latitude for editing, it’s possible to save RAW images for each shot (an option in the app). However, this process takes the photographer out of the moment and requires expertise with RAW editing tools as well as additional storage.

Another approach to artistic control is to provide it live in the viewfinder. Many photographers are familiar with the exposure compensation slider, which brightens or darkens the image. But overall brightness is not expressive enough for HDR photography. At a minimum two controls are needed in order to control the highlights and shadows separately.

To address this, we introduce dual exposure controls. When the user taps on the Live HDR+ viewfinder, two sliders appear. The "Brightness" slider works like traditional exposure compensation, changing the overall exposure. This slider is used to recover more detail in bright skies, or intentionally blow out the background and make the subject more visible. The "Shadows" slider affects only dark areas — it operates by changing the tone mapping, not the exposure. This slider is most useful for high-contrast scenes, letting the user boost shadows to reveal details, or suppress them to create a silhouette.
Screen capture of dual exposure controls in action on an outdoor HDR scene with HDR+ results below. You can see individual images here. Photos courtesy of Florian Kainz.
Here are some of the dramatic renditions we were able to achieve using dual exposure controls.
Different renditions using Dual Exposure Controls. You can see individual images here. Photo credits: Jiawen Chen, Florian Kainz, Alexander Schiffhauer.
Dual Exposure Controls gives you the flexibility to capture dramatically different versions of the same subject. They are not limited to tough HDR scenes, so don’t be afraid to experiment with different subjects and lighting. You may be surprised at how much these sliders will change how you shoot!

Acknowledgements
Live HDR+ and Dual Exposure Controls is the result of a collaboration between Google Research, Android, Hardware, and UX Design teams. Key contributors include: Francois Bleibel, Sean Callanan, Yulun Chang, Eric Chen, Michelle Chen, Kourosh Derakshan, Ryan Geiss, Zhijun He, Joy Hsu, Liz Koh, Marc Levoy, Chia-Kai Liang, Diane Liang, Timothy Lin, Gaurav Malik, Hossein Mohtasham, Nandini Mukherjee, Sushil Nath, Gabriel Nava, Karl Rasche, YiChang Shih, Daniel Solomon, Gary Sun, Kelly Tsai, Sung-fang Tsai, Ted Tsai, Ruben Velarde, Lida Wang, Tianfan Xue, Junlan Yang.

Source: Google AI Blog


Live HDR+ and Dual Exposure Controls on Pixel 4 and 4a



High dynamic range (HDR) imaging is a method for capturing scenes with a wide range of brightness, from deep shadows to bright highlights. On Pixel phones, the engine behind HDR imaging is HDR+ burst photography, which involves capturing a rapid burst of deliberately underexposed images, combining them, and rendering them in a way that preserves detail across the range of tones. Until recently, one challenge with HDR+ was that it could not be computed in real time (i.e., at 30 frames per second), which prevented the viewfinder from matching the final result. For example, bright white skies in the viewfinder might appear blue in the HDR+ result.

Starting with Pixel 4 and 4a, we have improved the viewfinder using a machine-learning-based approximation to HDR+, which we call Live HDR+. This provides a real-time preview of the final result, making HDR imaging more predictable. We also created dual exposure controls, which generalize the classic “exposure compensation” slider into two controls for separately adjusting the rendition of shadows and highlights. Together, Live HDR+ and dual exposure controls provide HDR imaging with real-time creative control.
Live HDR+ on Pixel 4 and 4a helps the user compose their shot with a WYSIWYG viewfinder that closely resembles the final result. You can see individual images here. Photos courtesy of Florian Kainz.
The HDR+ Look
When the user presses the shutter in the Pixel camera app, it captures 3-15 underexposed images. These images are aligned and merged to reduce noise in the shadows, producing a 14-bit intermediate “linear RGB image” with pixel values proportional to the scene brightness. What gives HDR+ images their signature look is the "tone mapping" of this image, reducing the range to 8 bits and making it suitable for display.

Consider the backlit photo of a motorcyclist, below. While the linear RGB image contains detail in both the dark motorcycle and bright sky, the dynamic range is too high to see it. The simplest method to reveal more detail is to apply a “global curve”, remapping all pixels with a particular brightness to some new value. However, for an HDR scene with details in both shadows and highlights, no single curve is satisfactory.
>Different ways to tone-map a linear RGB image. (a) The original, “un-tone-mapped” image. (b) Global curve optimizing for the sky. (c) Global curve optimizing for the subject. (d) HDR+, which preserves details everywhere. In the 2D histogram, brighter areas indicate where more pixels of a given input brightness are mapped to the same output. The overlapping shapes show that the relationship cannot be modeled using a single curve. Photo courtesy of Nicholas Wilson.
In contrast to applying a single curve, HDR+ uses a local tone mapping algorithm to ensure that the final result contains detail everywhere, while keeping edges and textures looking natural. Effectively, this applies a different curve to different regions, depending on factors such as overall brightness, local texture, and amount of noise. Unfortunately, HDR+ is too slow to run live in the viewfinder, requiring an alternative approach for Live HDR+.

Local Curve Approximation for Live HDR+
Using a single tone curve does not produce a satisfying result for the entire image — but how about for a small region? Consider the small red patch in the figure below. Although the patch includes both shadows and highlights, the relationship between input and output brightness follows a smooth curve. Furthermore, the curve varies gradually. For the blue patch, shifted ten pixels to the right, both the image content and curve are similar. But while the curve approximation works well for small patches, it breaks down for larger patches. For the larger yellow patch, the input/output relationship is more complicated, and not well approximated by a single curve.
(a) Input and HDR+ result. (b) The effect of HDR+ on a small patch (red) is approximately a smooth curve. (c) The relationship is nearly identical for the nearby blue patch. (d) However, if the patch is too big, a single curve will no longer provide a good fit.
To address this challenge, we divide the input image into “tiles” of size roughly equal to the red patch in the figure above, and approximate HDR+ using a curve for each tile. Since these curves vary gradually, blending between curves is a good way to approximate the optimal curve at any pixel. To render a pixel we apply the curves from each of the four nearest tiles, then blend the results according to the distances to the respective tile centers.

Compared to HDR+, this algorithm is particularly well suited for GPUs. Since the tone mapping of each pixel can be computed independently, the algorithm can also be parallelized. Moreover, the representation is memory-efficient: only a small number of tiles is enough to represent HDR+ local tone mapping for the viewfinder.

To compute local curves, we use a machine learning algorithm called HDRnet, a deep neural network that predicts, from a linear image, per-tile curves that approximate the HDR+ look of that image. It's also fast, due to its compact architecture and the way that low-resolution input images can be used to predict the curves for the high-resolution viewfinder. We train HDRnet on thousands of images to ensure it works well on all kinds of scenes.
HDRnet vs. HDR+ on a challenging scene with extreme brights and darks. The results are very similar at viewfinder resolution. Photo courtesy of Nicholas Wilson.
Dual Exposure Controls
HDR+ is designed to produce pleasing HDR images automatically, without the need for manual controls or post-processing. But sometimes the HDR+ rendition may not match the photographer’s artistic vision. While image editing tools are a partial remedy, HDR images can be challenging to edit, because some decisions are effectively baked into the final JPG. To maximize latitude for editing, it’s possible to save RAW images for each shot (an option in the app). However, this process takes the photographer out of the moment and requires expertise with RAW editing tools as well as additional storage.

Another approach to artistic control is to provide it live in the viewfinder. Many photographers are familiar with the exposure compensation slider, which brightens or darkens the image. But overall brightness is not expressive enough for HDR photography. At a minimum two controls are needed in order to control the highlights and shadows separately.

To address this, we introduce dual exposure controls. When the user taps on the Live HDR+ viewfinder, two sliders appear. The "Brightness" slider works like traditional exposure compensation, changing the overall exposure. This slider is used to recover more detail in bright skies, or intentionally blow out the background and make the subject more visible. The "Shadows" slider affects only dark areas — it operates by changing the tone mapping, not the exposure. This slider is most useful for high-contrast scenes, letting the user boost shadows to reveal details, or suppress them to create a silhouette.
Screen capture of dual exposure controls in action on an outdoor HDR scene with HDR+ results below. You can see individual images here. Photos courtesy of Florian Kainz.
Here are some of the dramatic renditions we were able to achieve using dual exposure controls.
Different renditions using Dual Exposure Controls. You can see individual images here. Photo credits: Jiawen Chen, Florian Kainz, Alexander Schiffhauer.
Dual Exposure Controls gives you the flexibility to capture dramatically different versions of the same subject. They are not limited to tough HDR scenes, so don’t be afraid to experiment with different subjects and lighting. You may be surprised at how much these sliders will change how you shoot!

Acknowledgements
Live HDR+ and Dual Exposure Controls is the result of a collaboration between Google Research, Android, Hardware, and UX Design teams. Key contributors include: Francois Bleibel, Sean Callanan, Yulun Chang, Eric Chen, Michelle Chen, Kourosh Derakshan, Ryan Geiss, Zhijun He, Joy Hsu, Liz Koh, Marc Levoy, Chia-Kai Liang, Diane Liang, Timothy Lin, Gaurav Malik, Hossein Mohtasham, Nandini Mukherjee, Sushil Nath, Gabriel Nava, Karl Rasche, YiChang Shih, Daniel Solomon, Gary Sun, Kelly Tsai, Sung-fang Tsai, Ted Tsai, Ruben Velarde, Lida Wang, Tianfan Xue, Junlan Yang.

Source: Google AI Blog


Improvements to Portrait Mode on the Google Pixel 4 and Pixel 4 XL



Portrait Mode on Pixel phones is a camera feature that allows anyone to take professional-looking shallow depth of field images. Launched on the Pixel 2 and then improved on the Pixel 3 by using machine learning to estimate depth from the camera’s dual-pixel auto-focus system, Portrait Mode draws the viewer’s attention to the subject by blurring out the background. A critical component of this process is knowing how far objects are from the camera, i.e., the depth, so that we know what to keep sharp and what to blur.

With the Pixel 4, we have made two more big improvements to this feature, leveraging both the Pixel 4’s dual cameras and dual-pixel auto-focus system to improve depth estimation, allowing users to take great-looking Portrait Mode shots at near and far distances. We have also improved our bokeh, making it more closely match that of a professional SLR camera.
Pixel 4’s Portrait Mode allows for Portrait Shots at both near and far distances and has SLR-like background blur. (Photos Credit: Alain Saal-Dalma and Mike Milne)
A Short Recap
The Pixel 2 and 3 used the camera’s dual-pixel auto-focus system to estimate depth. Dual-pixels work by splitting every pixel in half, such that each half pixel sees a different half of the main lens’ aperture. By reading out each of these half-pixel images separately, you get two slightly different views of the scene. While these views come from a single camera with one lens, it is as if they originate from a virtual pair of cameras placed on either side of the main lens’ aperture. Alternating between these views, the subject stays in the same place while the background appears to move vertically.
The dual-pixel views of the bulb have much more parallax than the views of the man because the bulb is much closer to the camera.
This motion is called parallax and its magnitude depends on depth. One can estimate parallax and thus depth by finding corresponding pixels between the views. Because parallax decreases with object distance, it is easier to estimate depth for near objects like the bulb. Parallax also depends on the length of the stereo baseline, that is the distance between the cameras (or the virtual cameras in the case of dual-pixels). The dual-pixels’ viewpoints have a baseline of less than 1mm, because they are contained inside a single camera’s lens, which is why it’s hard to estimate the depth of far scenes with them and why the two views of the man look almost identical.

Dual Cameras are Complementary to Dual-Pixels
The Pixel 4’s wide and telephoto cameras are 13 mm apart, much greater than the dual-pixel baseline, and so the larger parallax makes it easier to estimate the depth of far objects. In the images below, the parallax between the dual-pixel views is barely visible, while it is obvious between the dual-camera views.
Left: Dual-pixel views. Right: Dual-camera views. The dual-pixel views have only a subtle vertical parallax in the background, while the dual-camera views have much greater horizontal parallax. While this makes it easier to estimate depth in the background, some pixels to the man’s right are visible in only the primary camera’s view making it difficult to estimate depth there.
Even with dual cameras, information gathered by the dual pixels is still useful. The larger the baseline, the more pixels that are visible in one view without a corresponding pixel in the other. For example, the background pixels immediately to the man’s right in the primary camera’s image have no corresponding pixel in the secondary camera’s image. Thus, it is not possible to measure the parallax to estimate the depth for these pixels when using only dual cameras. However, these pixels can still be seen by the dual pixel views, enabling a better estimate of depth in these regions.

Another reason to use both inputs is the aperture problem, described in our previous blog post, which makes it hard to estimate the depth of vertical lines when the stereo baseline is also vertical (or when both are horizontal). On the Pixel 4, the dual-pixel and dual-camera baselines are perpendicular, allowing us to estimate depth for lines of any orientation.

Having this complementary information allows us to estimate the depth of far objects and reduce depth errors for all scenes.

Depth from Dual Cameras and Dual-Pixels
We showed last year how machine learning can be used to estimate depth from dual-pixels. With Portrait Mode on the Pixel 4, we extended this approach to estimate depth from both dual-pixels and dual cameras, using Tensorflow to train a convolutional neural network. The network first separately processes the dual-pixel and dual-camera inputs using two different encoders, a type of neural network that encodes the input into an intermediate representation. Then, a single decoder uses both intermediate representations to compute depth.
Our network to predict depth from dual-pixels and dual-cameras. The network uses two encoders, one for each input and a shared decoder with skip connections and residual blocks.
To force the model to use both inputs, we applied a drop-out technique, where one input is randomly set to zero during training. This teaches the model to work well if one input is unavailable, which could happen if, for example, the subject is too close for the secondary telephoto camera to focus on.
Depth maps from our network where either only one input is provided or both are provided. Top: The two inputs provide depth information for lines in different directions. Bottom: Dual-pixels provide better depth in the regions visible in only one camera, emphasized in the insets. Dual-cameras provide better depth in the background and ground. (Photo Credit: Mike Milne)
The lantern image above shows how having both signals solves the aperture problem. Having one input only allows us to predict depth accurately for lines in one direction (horizontal for dual-pixels and vertical for dual-cameras). With both signals, we can recover the depth on lines in all directions.

With the image of the person, dual-pixels provide better depth information in the occluded regions between the arm and torso, while the large baseline dual cameras provide better depth information in the background and on the ground. This is most noticeable in the upper-left and lower-right corner of depth from dual-pixels. You can find more examples here.

SLR-Like Bokeh
Photographers obsess over the look of the blurred background or bokeh of shallow depth of field images. One of the most noticeable things about high-quality SLR bokeh is that small background highlights turn into bright disks when defocused. Defocusing spreads the light from these highlights into a disk. However, the original highlight is so bright that even when its light is spread into a disk, the disk remains at the bright end of the SLR’s tonal range.
Left: SLRs produce high contrast bokeh disks. Middle: It is hard to make out the disks in our old background blur. Right: Our new bokeh is closer to that of an SLR.
To reproduce this bokeh effect, we replaced each pixel in the original image with a translucent disk whose size is based on depth. In the past, this blurring process was performed after tone mapping, the process by which raw sensor data is converted to an image viewable on a phone screen. Tone mapping compresses the dynamic range of the data, making shadows brighter relative to highlights. Unfortunately, this also results in a loss of information about how bright objects actually were in the scene, making it difficult to produce nice high-contrast bokeh disks. Instead, the bokeh blends in with the background, and does not appear as natural as that from an SLR.

The solution to this problem is to blur the merged raw image produced by HDR+ and then apply tone mapping. In addition to the brighter and more obvious bokeh disks, the background is saturated in the same way as the foreground. Here’s an album showcasing the better blur, which is available on the Pixel 4 and the rear camera of the Pixel 3 and 3a (assuming you have upgraded to version 7.2 of the Google Camera app).
Blurring before tone mapping improves the look of the backgrounds by making it more saturated and by making disks higher contrast.
Try it Yourself
We have made Portrait Mode on the Pixel 4 better by improving depth quality, resulting in fewer errors in the final image and by improving the look of the blurred background. Depth from dual-cameras and dual-pixels only kicks in when the camera is at least 20 cm from the subject, i.e. the minimum focus distance of the secondary telephoto camera. So consider keeping your phone at least that far from the subject to get better quality portrait shots.

Acknowledgments
This work wouldn’t have been possible without Rahul Garg, Sergio Orts Escolano, Sean Fanello, Christian Haene, Shahram Izadi, David Jacobs, Alexander Schiffhauer, Yael Pritch Knaan and Marc Levoy. We would also like to thank the Google Camera team for helping to integrate these algorithms into the Pixel 4. Special thanks to our photographers Mike Milne, Andy Radin, Alain Saal-Dalma, and Alvin Li who took numerous test photographs for us.

Source: Google AI Blog


Astrophotography with Night Sight on Pixel Phones



Taking pictures of outdoor scenes at night has so far been the domain of large cameras, such as DSLRs, which are able to achieve excellent image quality, provided photographers are willing to put up with bulky equipment and sometimes tricky postprocessing. A few years ago experiments with phone camera nighttime photography produced pleasing results, but the methods employed were impractical for all but the most dedicated users.

Night Sight, introduced last year as part of the Google Camera App for the Pixel 3, allows phone photographers to take good-looking handheld shots in environments so dark that the normal camera mode would produce grainy, severely underexposed images. In a previous blog post our team described how Night Sight is able to do this, with a technical discussion presented at SIGGRAPH Asia 2019.

This year’s version of Night Sight pushes the boundaries of low-light photography with phone cameras. By allowing exposures up to 4 minutes on Pixel 4, and 1 minute on Pixel 3 and 3a, the latest version makes it possible to take sharp and clear pictures of the stars in the night sky or of nighttime landscapes without any artificial light.
The Milky Way as seen from the summit of Haleakala volcano on a cloudless and moonless September night, captured using the Google Camera App running on a Pixel 4 XL phone. The image has not been retouched or post-processed in any way. It shows significantly more detail than a person can see with the unaided eye on a night this dark. The dust clouds along the Milky Way are clearly visible, the sky is covered with thousands of stars, and unlike human night vision, the picture is colorful.
A Brief Overview of Night Sight
The amount of light detected by the camera’s image sensor inherently has some uncertainty, called “shot noise,” which causes images to look grainy. The visibility of shot noise decreases as the amount of light increases; therefore, it is best for the camera to gather as much light as possible to produce a high-quality image.

How much light reaches the image sensor in a given amount of time is limited by the aperture of the camera lens. Extending the exposure time for a photo increases the total amount of light captured, but if the exposure is long, motion in the scene being photographed and unsteadiness of the handheld camera can cause blur. To overcome this, Night Sight splits the exposure into a sequence of multiple frames with shorter exposure times and correspondingly less motion blur. The frames are first aligned, compensating for both camera shake and in-scene motion, and then averaged, with careful treatment of cases where perfect alignment is not possible. While individual frames may be fairly grainy, the combined, averaged image looks much cleaner.

Experimenting with Exposure Time
Soon after the original Night Sight was released, we started to investigate taking photos in very dark outdoor environments with the goal of capturing the stars. We realized that, just as with our previous experiments, high quality pictures would require exposure times of several minutes. Clearly, this cannot work with a handheld camera; the phone would have to be placed on a tripod, a rock, or whatever else might be available to hold the camera steady.

Just as with handheld Night Sight photos, nighttime landscape shots must take motion in the scene into account — trees sway in the wind, clouds drift across the sky, and the moon and the stars rise in the east and set in the west. Viewers will tolerate motion-blurred clouds and tree branches in a photo that is otherwise sharp, but motion-blurred stars that look like short line segments look wrong. To mitigate this, we split the exposure into frames with exposure times short enough to make the stars look like points of light. Taking pictures of real night skies we found that the per-frame exposure time should not exceed 16 seconds.
Motion-blurred stars in a single-frame two-minute exposure.
While the number of frames we can capture for a single photo, and therefore the total exposure time, is limited by technical considerations, we found that it is more tightly constrained by the photographer’s patience. Few are willing to wait more than four minutes for a picture, so we limited a single Night Sight image to at most 15 frames with up to 16 seconds per frame.

Sixteen-second exposures allow us to capture enough light to produce recognizable images but a useable camera app capable of taking pictures that look great must deal with additional issues that are unique to low-light photography.

Dark Current and Hot Pixels
Dark current causes CMOS image sensors to record a spurious signal, as if the pixels were exposed to a small amount of light, even when no actual light is present. The effect is negligible when exposure times are short, but it becomes significant with multi-second captures. Due to unavoidable imperfections in the sensor’s silicon substrate, some pixels exhibit higher dark current than their neighbors. In a recorded frame these “warm pixels,” as well as defective “hot pixels,” are visible as tiny bright dots.

Warm and hot pixels can be identified by comparing the values of neighboring pixels within the same frame and across the sequence of frames recorded for a photo, and looking for outliers. Once an outlier has been detected, it is concealed by replacing its value with the average of its neighbors. Since the original pixel value is discarded, there is a loss of image information, but in practice this does not noticeably affect image quality.
Left: A small region of a long-exposure image with hot pixels, and warm pixels caused by dark current nonuniformity. Right: The same image after outliers have been removed. Fine details in the landscape, including small points of light, are preserved.
Scene Composition
Mobile phones use their screens as electronic viewfinders — the camera captures a continuous stream of frames that is displayed as a live video in order to aid with shot composition. The frames are simultaneously used by the camera’s autofocus, auto exposure, and auto white balance systems.

To feel responsive to the photographer, the viewfinder is updated at least 15 times per second, which limits the viewfinder frame exposure time to 66 milliseconds. This makes it challenging to display a detailed image in low-light environments. At light levels below the rough equivalent of a full moon or so, the viewfinder becomes mostly gray — maybe showing a few bright stars, but none of the landscape — and composing a shot becomes difficult.

To assist in framing the scene in extremely low light, Night Sight displays a “post-shutter viewfinder”. After the shutter button has been pressed, each long-exposure frame is displayed on the screen as soon as it has been captured. With exposure times up to 16 seconds, these frames have collected almost 250 times more light than the regular viewfinder frames, allowing the photographer to easily see image details as soon as the first frame has been captured. The composition can then be adjusted by moving the phone while the exposure continues. Once the composition is correct, the initial shot can be stopped, and a second shot can be captured where all frames have the desired composition.
Left: The live Night Sight viewfinder in a very dark outdoor environment. Except for a few points of light from distant buildings, the landscape and the sky are largely invisible. Right: The post-shutter viewfinder during a long exposure shot. The image is much clearer; it updates after every long-exposure frame.
Autofocus
Autofocus ensures that the image captured by the camera is sharp. In normal operation, the incoming viewfinder frames are analyzed to determine how far the lens must be from the sensor to produce an in-focus image, but in very low light the viewfinder frames can be so dark and grainy that autofocus fails due to lack of detectable image detail. When this happens, Night Sight on Pixel 4 switches to “post-shutter autofocus.” After the user presses the shutter button, the camera captures two autofocus frames with exposure times up to one second, long enough to detect image details even in low light. These frames are used only to focus the lens and do not contribute directly to the final image.

Even though using long-exposure frames for autofocus leads to consistently sharp images at light levels low enough that the human visual system cannot clearly distinguish objects, sometimes it gets too dark even for post-shutter autofocus. In this case the camera instead focuses at infinity. In addition, Night Sight includes manual focus buttons, allowing the user to focus on nearby objects in very dark conditions.

Sky Processing
When images of very dark environments are viewed on a screen, they are displayed much brighter than the original scenes. This can change the viewer’s perception of the time of day when the photos were captured. At night we expect the sky to be dark. If a picture taken at night shows a bright sky, then we see it as a daytime scene, perhaps with slightly unusual lighting.

This effect is countered in Night Sight by selectively darkening the sky in photos of low-light scenes. To do this, we use machine learning to detect which regions of an image represent sky. An on-device convolutional neural network, trained on over 100,000 images that were manually labeled by tracing the outlines of sky regions, identifies each pixel in a photograph as “sky” or “not sky.”
A landscape picture taken on a bright full-moon night, without sky processing (left half), and with sky darkening (right half). Note that the landscape is not darkened.
Sky detection also makes it possible to perform sky-specific noise reduction, and to selectively increase contrast to make features like clouds, color gradients, or the Milky Way more prominent.

Results
With the phone on a tripod, Night Sight produces sharp pictures of star-filled skies, and as long as there is at least a small amount of moonlight, landscapes will be clear and colorful.

Of course, the phone’s capabilities are not limitless, and there is always room for improvement. Although nighttime scenes are dark overall, they often contain bright light sources such as the moon, distant street lamps, or prominent stars. While we can capture a moonlit landscape, or details on the surface of the moon, the extremely large brightness range, which can exceed 500,000:1, so far prevents us from capturing both in the same image. Also, when the stars are the only source of illumination, we can take clear pictures of the sky, but the landscape is only visible as a silhouette.

For Pixel 4 we have been using the brightest part of the Milky Way, near the constellation Sagittarius, as a benchmark for the quality of images of a moonless sky. By that standard Night Sight is doing very well. Although Milky Way photos exhibit some residual noise, they are pleasing to look at, showing more stars and more detail than a person can see looking at the real night sky.
Examples of photos taken with the Google Camera App on Pixel 4. An album with more pictures can be found here.
Tips and Tricks
In the course of developing and testing Night Sight astrophotography we gained some experience taking outdoor nighttime pictures with Pixel phones, and we’d like to share a list of tips and tricks that have worked for us. You can find it here.

Acknowledgements
Night Sight is an ongoing collaboration between several teams at Google. Key contributors to the project include from the Gcam team, Orly Liba, Nikhil Karnad, Charles He, Manfred Ernst, Michael Milne, Andrew Radin, Navin Sarma, Jon Barron, Yun-Ta Tsai, Tianfan Xue, Jiawen Chen, Dillon Sharlet, Ryan Geiss, Sam Hasinoff, Alex Schiffhauer, Yael Pritch Knaan and Marc Levoy; from the Super Res Zoom team, Bart Wronski, Peyman Milanfar, and Ignacio Garcia Dorado; from the Google camera app team, Emily To, Gabriel Nava, Sushil Nath, Isaac Reynolds, and Michelle Chen; from the Android platform team, Ryan Chan, Ying Chen Lou, and Bob Hung; from the Mobile Vision team, Longqi (Rocky) Cai, Huizhong Chen, Emily Manoogian, Nicole Maffeo, and Tomer Meron; from Machine Perception, Elad Eban and Yair Movshovitz-Attias.

Source: Google AI Blog


Take Your Best Selfie Automatically, with Photobooth on Pixel 3



Taking a good group selfie can be tricky—you need to hover your finger above the shutter, keep everyone’s faces in the frame, look at the camera, make good expressions, try not to shake the camera and hope no one blinks when you finally press the shutter! After building the technology behind automatic photography with Google Clips, we asked ourselves: can we bring some of the magic of this automatic picture experience to the Pixel phone?

With Photobooth, a new shutter-free mode in the Pixel 3 Camera app, it’s now easier to shoot selfies—solo, couples, or even groups—that capture you at your best. Once you enter Photobooth mode and click the shutter button, it will automatically take a photo when the camera is steady and it sees that the subjects have good expressions with their eyes open. And in the newest release of Pixel Camera, we’ve added kiss detection to Photobooth! Kiss a loved one, and the camera will automatically capture it.

Photobooth automatically captures group shots, when everyone in the photo looks their best.
Photobooth joins Top Shot and Portrait mode in a suite of exciting Pixel camera features that enable you to take the best pictures possible. However, unlike Portrait mode, which takes advantage of specialized hardware in the back-facing camera to provide its most accurate results, Photobooth is optimized for the front-facing camera. To build Photobooth, we had to solve for three challenges: how to identify good content for a wide range of user groups; how to time the shutter to capture the best moment; and how to animate a visual element that helps users understand what Photobooth sees and captures.

Models for Understanding Good Content
In developing Photobooth, a main challenge was to determine when there was good content in either a typical selfie, in which the subjects are all looking at the camera, or in a shot that includes people kissing and not necessarily facing the camera. To accomplish this, Photobooth relies on two distinct models to capture good selfies—a model for facial expressions and a model to detect when people kiss.

We worked with photographers to identify five key expressions that should trigger capture: smiles, tongue-out, kissy/duck face, puffy-cheeks, and surprise. We then trained a neural network to classify these expressions. The kiss detection model used by Photobooth is a variation of the Image Content Model (ICM) trained for Google Clips, fine tuned specifically to focus on kissing. Both of these models use MobileNets in order to run efficiently on-device while continuously processing the images at high frame rate. The outputs of the models are used to evaluate the quality of each frame for the shutter control algorithm.

Shutter Control
Once you click the shutter button in Photobooth mode, a basic quality assessment based on the content score from the models above is performed. This first stage is used as a filter that avoids moments that either contain closed eyes, talking, or motion blur, or fail to detect the facial expressions or kissing actions learned by the models. Photobooth temporally analyzes the expression confidence values to detect their presence in the photo, making it robust to variations in the output of machine learning (ML) models. Once the first stage is successfully passed, each frame is subjected to a more fine-grained analysis, which outputs an overall frame score.

The frame score considers both facial expression quality and the kiss score. As the kiss detection model operates on the entire frame, its output can be used directly as a full-frame score value for kissing. The face expressions model outputs a score for each identified expression. Since a variable number of faces may be present in each frame, Photobooth applies an attention model using the detected expressions to iteratively compute an expression quality representation and weight for each face. The weighting is important, for example, to emphasize the expressions in the foreground, rather than the background. The model then calculates a single, global score for the quality of expressions in the frame.

The final image quality score used for triggering the shutter is computed by a weighted combination of the attention based facial expression score and the kiss score. In order to detect the peak quality, the shutter control algorithm maintains a short buffer of observed frames and only saves a shot if its frame score is higher than the frames that come after it in the buffer. The length of the buffer is short enough to give users a sense of real time feedback.

Intelligence Indicator
Since Photobooth uses the front-facing camera, the user can see and interact with the display while taking a photo. Photobooth mode includes a visual indicator, a bar at the top of the screen that grows in size when photo quality scores increase, to help users understand what the ML algorithms see and capture. The length of the bar is divided into four distinct ranges: (1) no faces clearly seen, (2) faces seen but not paying attention to the camera, (3) faces paying attention but not making key expressions, and (4) faces paying attention with key expressions.

In order to make this indicator more interpretable, we forced the bar into these ranges, which prevented the bar scaling from being too rapid. This resulted in smooth variability of the bar length as the quality score changes and improved the utility. When the indicator bar reaches a length representative of a high quality score, the screen flashes to signify that a photo was captured and saved.
Using ML outputs directly as intelligence feedback results in rapid variation (left), whereas specifying explicit ranges creates a smooth signal (right).
Conclusion
We’re excited by the possibilities of automatic photography on camera phones. As computer vision continues to improve, in the future we may generally trust smart cameras to select a great moment to capture. Photobooth is an example of how we can carve out a useful corner of this space—selfies and group selfies of smiles, funny faces, and kisses—and deliver a fun and useful experience.

Acknowledgments
Photobooth was a collaboration of several teams at Google. Key contributors to the project include: Kojo Acquah, Chris Breithaupt, Chun-Te Chu, Geoff Clark, Laura Culp, Aaron Donsbach, Relja Ivanovic, Pooja Jhunjhunwala, Xuhui Jia, Ting Liu, Arjun Narayanan, Eric Penner, Arushan Raj, Divya Tyam, Raviteja Vemulapalli, Julian Walker, Jun Xie, Li Zhang, Andrey Zhmoginov, Yukun Zhu.

Source: Google AI Blog


Top Shot on Pixel 3



Life is full of meaningful moments — from a child’s first step to an impromptu jump for joy — that one wishes could be preserved with a picture. However, because these moments are often unpredictable, missing that perfect shot is a frustrating problem that smartphone camera users face daily. Using our experience from developing Google Clips, we wondered if we could develop new techniques for the Pixel 3 camera that would allow everyone to capture the perfect shot every time.

Top Shot is a new feature recently launched with Pixel 3 that helps you to capture precious moments precisely and automatically at the press of the shutter button. Top Shot saves and analyzes the image frames before and after the shutter press on the device in real-time using computer vision techniques, and recommends several alternative high-quality HDR+ photos.
Examples of Top Shot on Pixel 3. On the left, a better smiling shot is recommended. On the right, a better jump shot is recommended. The recommended images are high-quality HDR+ shots.
Capturing Multiple Moments
When a user opens the Pixel 3 Camera app, Top Shot is enabled by default, helping to capture the perfect moment by analyzing images taken both before and after the shutter press. Each image is analyzed for some qualitative features (e.g., whether the subject is smiling or not) in real-time and entirely on-device to preserve privacy and minimize latency. Each image is also associated with additional signals, such as optical flow of the image, exposure time, and gyro sensor data to form the input features used to score the frame quality.

When you press the shutter button, Top Shot captures up to 90 images from 1.5 seconds before and after the shutter press, selecting up to two alternative shots to save in high resolution — the original shutter frame and high-res alternatives for you to review (other lower-res frames can also be reviewed as desired). The shutter frame is processed and saved first. The best alternative shots are saved afterwards. Google’s Visual Core on Pixel 3 is used to process these top alternative shots as HDR+ images with a very small amount of extra latency, and are embedded into the file of the Motion Photo.
Top-level diagram of Top Shot capture.
Given Top Shot runs in the camera as a background process, it must have very low power consumption. As such, Top Shot uses a hardware-accelerated MobileNet-based single shot detector (SSD). The execution of such optimized models is also throttled by power and thermal limits.

Recognizing Top Moments
When we set out to understand how to enable people to capture the best moments with their camera, we focused on three key attributes: 1) functional qualities like lighting, 2) objective attributes (are the subject's eyes open? Are they smiling?), and 3) subjective qualities like emotional expressions. We designed a computer vision model to recognize these attributes while operating in a low-latency, on-device mode.

During our development process, we started with a vanilla MobileNet model and set out to optimize for Top Shot, arriving at a customized architecture that operated within our accuracy, latency and power tradeoff constraints. Our neural network design detects low-level visual attributes in early layers, like whether the subject is blurry, and then dedicates additional compute and parameters toward more complex objective attributes like whether the subject's eyes are open, and subjective attributes like whether there is an emotional expression of amusement or surprise. We trained our model using knowledge distillation over a large number of diverse face images using quantization during both training and inference.

We then adopted a layered Generalized Additive Model (GAM) to provide quality scores for faces and combine them into a weighted-average “frame faces” score. This model made it easy for us to interpret and identify the exact causes of success or failure, enabling rapid iteration to improve the quality and performance of our attributes model. The number of free parameters was on the order of dozens, so we could optimize these using Google's black box optimizer, Vizier, in tandem with any other parameters that affected selection quality.

Frame Scoring Model
While Top Shot prioritizes for face analysis, there are good moments in which faces are not the primary subject. To handle those use cases, we include the following additional scores in the overall frame quality score:
  • Subject motion saliency score — the low-resolution optical flow between the current frame and the previous frame is estimated in ISP to determine if there is salient object motion in the scene.
  • Global motion blur score — estimated from the camera motion and the exposure time. The camera motion is calculated from sensor data from the gyroscope and OIS (optical image stabilization).
  • “3A” scores — the status of auto exposure, auto focus, and auto white balance, are also considered.
All the individual scores are used to train a model predicting an overall quality score, which matches the frame preference of human raters, to maximize end-to-end product quality.

End-to-End Quality and Fairness
Most of the above components are each evaluated for accuracy independently However, Top Shot presents requirements that are uniquely challenging since it’s running real-time in the Pixel Camera. Additionally, we needed to ensure that all these signals are combined in a system with favorable results. That means we need to gauge our predictions against what our users perceive as the “top shot.”

To test this, we collected data from hundreds of volunteers, along with their opinions of which frames (out of up to 90!) looked best. This donated dataset covers many typical use cases, e.g. portraits, selfies, actions, landscapes, etc.

Many of the 3-second clips provided by Top Shot had more than one good shot, so it was important for us to engineer our quality metrics to handle this. We used some modified versions of traditional Precision and Recall, some classic ranking metrics (such as Mean Reciprocal Rank), and a few others that were designed specifically for the Top Shot task as our objective. In addition to these metrics, we additionally investigated causes of image quality issues we saw during development, leading to improvements in avoiding blur, handling multiple faces better, and more. In doing so, we were able to steer the model towards a set of selections people were likely to rate highly.

Importantly, we tested the Top Shot system for fairness to make sure that our product can offer a consistent experience to a very wide range of users. We evaluated the accuracy of each signal used in Top Shot on several different subgroups of people (based on gender, age, ethnicity, etc), testing for accuracy of each signal across those subgroups.

Conclusion
Top Shot is just one example of how Google leverages optimized hardware and cutting-edge machine learning to provide useful tools and services. We hope you’ll find this feature useful, and we’re committed to further improving the capabilities of mobile phone photography!

Acknowledgements
This post reflects the work of a large group of Google engineers, research scientists, and others including: Ari Gilder, Aseem Agarwala, Brendan Jou, David Karam, Eric Penner, Farooq Ahmad, Henri Astre, Hillary Strickland, Marius Renn, Matt Bridges, Maxwell Collins, Navid Shiee, Ryan Gordon, Sarah Clinckemaillie, Shu Zhang, Vivek Kesarwani, Xuhui Jia, Yukun Zhu, Yuzo Watanabe and Chris Breithaupt.

Source: Google AI Blog


Learning to Predict Depth on the Pixel 3 Phones



Portrait Mode on the Pixel smartphones lets you take professional-looking images that draw attention to a subject by blurring the background behind it. Last year, we described, among other things, how we compute depth with a single camera using its Phase-Detection Autofocus (PDAF) pixels (also known as dual-pixel autofocus) using a traditional non-learned stereo algorithm. This year, on the Pixel 3, we turn to machine learning to improve depth estimation to produce even better Portrait Mode results.
Left: The original HDR+ image. Right: A comparison of Portrait Mode results using depth from traditional stereo and depth from machine learning. The learned depth result has fewer errors. Notably, in the traditional stereo result, many of the horizontal lines behind the man are incorrectly estimated to be at the same depth as the man and are kept sharp.
(Mike Milne)
A Short Recap
As described in last year’s blog post, Portrait Mode uses a neural network to determine what pixels correspond to people versus the background, and augments this two layer person segmentation mask with depth information derived from the PDAF pixels. This is meant to enable a depth-dependent blur, which is closer to what a professional camera does.

PDAF pixels work by capturing two slightly different views of a scene, shown below. Flipping between the two views, we see that the person is stationary, while the background moves horizontally, an effect referred to as parallax. Because parallax is a function of the point’s distance from the camera and the distance between the two viewpoints, we can estimate depth by matching each point in one view with its corresponding point in the other view.
The two PDAF images on the left and center look very similar, but in the crop on the right you can see the parallax between them. It is most noticeable on the circular structure in the middle of the crop.
However, finding these correspondences in PDAF images (a method called depth from stereo) is extremely challenging because scene points barely move between the views. Furthermore, all stereo techniques suffer from the aperture problem. That is, if you look at the scene through a small aperture, it is impossible to find correspondence for lines parallel to the stereo baseline, i.e., the line connecting the two cameras. In other words, when looking at the horizontal lines in the figure above (or vertical lines in portrait orientation shots), any proposed shift of these lines in one view with respect to the other view looks about the same. In last year’s Portrait Mode, all these factors could result in errors in depth estimation and cause unpleasant artifacts.

Improving Depth Estimation
With Portrait Mode on the Pixel 3, we fix these errors by utilizing the fact that the parallax used by depth from stereo algorithms is only one of many depth cues present in images. For example, points that are far away from the in-focus plane appear less sharp than ones that are closer, giving us a defocus depth cue. In addition, even when viewing an image on a flat screen, we can accurately tell how far things are because we know the rough size of everyday objects (e.g. one can use the number of pixels in a photograph of a person’s face to estimate how far away it is). This is called a semantic cue.

Designing a hand-crafted algorithm to combine these different cues is extremely difficult, but by using machine learning, we can do so while also better exploiting the PDAF parallax cue. Specifically, we train a convolutional neural network, written in TensorFlow, that takes as input the PDAF pixels and learns to predict depth. This new and improved ML-based method of depth estimation is what powers Portrait Mode on the Pixel 3.
Our convolutional neural network takes as input the PDAF images and outputs a depth map. The network uses an encoder-decoder style architecture with skip connections and residual blocks.
Training the Neural Network
In order to train the network, we need lots of PDAF images and corresponding high-quality depth maps. And since we want our predicted depth to be useful for Portrait Mode, we also need the training data to be similar to pictures that users take with their smartphones.

To accomplish this, we built our own custom “Frankenphone” rig that contains five Pixel 3 phones, along with a Wi-Fi-based solution that allowed us to simultaneously capture pictures from all of the phones (within a tolerance of ~2 milliseconds). With this rig, we computed high-quality depth from photos by using structure from motion and multi-view stereo.
Left: Custom rig used to collect training data. Middle: An example capture flipping between the five images. Synchronization between the cameras ensures that we can calculate depth for dynamic scenes, such as this one. Right: Ground truth depth. Low confidence points, i.e., points where stereo matches are not reliable due to weak texture, are colored in black and are not used during training. (Sam Ansari and Mike Milne)
The data captured by this rig is ideal for training a network for the following main reasons:
  • Five viewpoints ensure that there is parallax in multiple directions and hence no aperture problem.
  • The arrangement of the cameras ensures that a point in an image is usually visible in at least one other image resulting in fewer points with no correspondences.
  • The baseline, i.e., the distance between the cameras is much larger than our PDAF baseline resulting in more accurate depth estimation.
  • Synchronization between the cameras ensure that we can calculate depth for dynamic scenes like the one above.
  • Portability of the rig ensures that we can capture photos in the wild simulating the photos users take with their smartphones.
However, even though the data captured from this rig is ideal, it is still extremely challenging to predict the absolute depth of objects in a scene — a given PDAF pair can correspond to a range of different depth maps (depending on lens characteristics, focus distance, etc). To account for this, we instead predict the relative depths of objects in the scene, which is sufficient for producing pleasing Portrait Mode results.

Putting it All Together
This ML-based depth estimation needs to run fast on the Pixel 3, so that users don’t have to wait too long for their Portrait Mode shots. However, to get good depth estimates that makes use of subtle defocus and parallax cues, we have to feed full resolution, multi-megapixel PDAF images into the network. To ensure fast results, we use TensorFlow Lite, a cross-platform solution for running machine learning models on mobile and embedded devices and the Pixel 3’s powerful GPU to compute depth quickly despite our abnormally large inputs. We then combine the resulting depth estimates with masks from our person segmentation neural network to produce beautiful Portrait Mode results.

Try it Yourself
In Google Camera App version 6.1 and later, our depth maps are embedded in Portrait Mode images. This means you can use the Google Photos depth editor to change the amount of blur and the focus point after capture. You can also use third-party depth extractors to extract the depth map from a jpeg and take a look at it yourself. Also, here is an album showing the relative depth maps and the corresponding Portrait Mode images for traditional stereo and the learning-based approaches.

Acknowledgments
This work wouldn’t have been possible without Sam Ansari, Yael Pritch Knaan, David Jacobs, Jiawen Chen, Juhyun Lee and Andrei Kulik. Special thanks to Mike Milne and Andy Radin who captured data with the five-camera rig.

Source: Google AI Blog