Tag Archives: Computational Photography

Behind the Motion Photos Technology in Pixel 2


One of the most compelling things about smartphones today is the ability to capture a moment on the fly. With motion photos, a new camera feature available on the Pixel 2 and Pixel 2 XL phones, you no longer have to choose between a photo and a video so every photo you take captures more of the moment. When you take a photo with motion enabled, your phone also records and trims up to 3 seconds of video. Using advanced stabilization built upon technology we pioneered in Motion Stills for Android, these pictures come to life in Google Photos. Let’s take a look behind the technology that makes this possible!
Motion photos on the Pixel 2 in Google Photos. With the camera frozen in place the focus is put directly on the subjects. For more examples, check out this Google Photos album.
Camera Motion Estimation by Combining Hardware and Software
The image and video pair that is captured every time you hit the shutter button is a full resolution JPEG with an embedded 3 second video clip. On the Pixel 2, the video portion also contains motion metadata that is derived from the gyroscope and optical image stabilization (OIS) sensors to aid the trimming and stabilization of the motion photo. By combining software based visual tracking with the motion metadata from the hardware sensors, we built a new hybrid motion estimation for motion photos on the Pixel 2.

Our approach aligns the background more precisely than the technique used in Motion Stills or the purely hardware sensor based approach. Based on Fused Video Stabilization technology, it reduces the artifacts from the visual analysis due to a complex scene with many depth layers or when a foreground object occupies a large portion of the field of view. It also improves the hardware sensor based approach by refining the motion estimation to be more accurate, especially at close distances.
Motion photo as captured (left) and after freezing the camera by combining hardware and software For more comparisons, check out this Google Photos album.
The purely software-based technique we introduced in Motion Stills uses the visual data from the video frames, detecting and tracking features over consecutive frames yielding motion vectors. It then classifies the motion vectors into foreground and background using motion models such as an affine transformation or a homography. However, this classification is not perfect and can be misled, e.g. by a complex scene or dominant foreground.
Feature classification into background (green) and foreground (orange) by using the motion metadata from the hardware sensors of the Pixel 2. Notice how the new approach not only labels the skateboarder accurately as foreground but also the half-pipe that is at roughly the same depth.
For motion photos on Pixel 2 we improved this classification by using the motion metadata derived from the gyroscope and the OIS. This accurately captures the camera motion with respect to the scene at infinity, which one can think of as the background in the distance. However, for pictures taken at closer range, parallax is introduced for scene elements at different depth layers, which is not accounted for by the gyroscope and OIS. Specifically, we mark motion vectors that deviate too much from the motion metadata as foreground. This results in a significantly more accurate classification of foreground and background, which also enables us to use a more complex motion model known as mixture homographies that can account for rolling shutter and undo the distortions it causes.
Background motion estimation in motion photos. By using the motion metadata from Gyro and OIS we are able to accurately classify features from the visual analysis into foreground and background.
Motion Photo Stabilization and Playback
Once we have accurately estimated the background motion for the video, we determine an optimally stable camera path to align the background using linear programming techniques outlined in our earlier posts. Further, we automatically trim the video to remove any accidental motion caused by putting the phone away. All of this processing happens on your phone and produces a small amount of metadata per frame that is used to render the stabilized video in real-time using a GPU shader when you tap the Motion button in Google Photos. In addition, we play the video starting at the exact timestamp as the HDR+ photo, producing a seamless transition from still image to video.
Motion photos stabilize even complex scenes with large foreground motions.
Motion Photo Sharing
Using Google Photos, you can share motion photos with your friends and as videos and GIFs, watch them on the web, or view them on any phone. This is another example of combining hardware, software and Machine Learning to create new features for Pixel 2.

Acknowledgements
Motion photos is a result of a collaboration across several Google Research teams, Google Pixel and Google Photos. We especially want to acknowledge the work of Karthik Raveendran, Suril Shah, Marius Renn, Alex Hong, Radford Juang, Fares Alhassen, Emily Chang, Isaac Reynolds, and Dave Loxton.

Behind the Motion Photos Technology in Pixel 2


One of the most compelling things about smartphones today is the ability to capture a moment on the fly. With motion photos, a new camera feature available on the Pixel 2 and Pixel 2 XL phones, you no longer have to choose between a photo and a video so every photo you take captures more of the moment. When you take a photo with motion enabled, your phone also records and trims up to 3 seconds of video. Using advanced stabilization built upon technology we pioneered in Motion Stills for Android, these pictures come to life in Google Photos. Let’s take a look behind the technology that makes this possible!
Motion photos on the Pixel 2 in Google Photos. With the camera frozen in place the focus is put directly on the subjects. For more examples, check out this Google Photos album.
Camera Motion Estimation by Combining Hardware and Software
The image and video pair that is captured every time you hit the shutter button is a full resolution JPEG with an embedded 3 second video clip. On the Pixel 2, the video portion also contains motion metadata that is derived from the gyroscope and optical image stabilization (OIS) sensors to aid the trimming and stabilization of the motion photo. By combining software based visual tracking with the motion metadata from the hardware sensors, we built a new hybrid motion estimation for motion photos on the Pixel 2.

Our approach aligns the background more precisely than the technique used in Motion Stills or the purely hardware sensor based approach. Based on Fused Video Stabilization technology, it reduces the artifacts from the visual analysis due to a complex scene with many depth layers or when a foreground object occupies a large portion of the field of view. It also improves the hardware sensor based approach by refining the motion estimation to be more accurate, especially at close distances.
Motion photo as captured (left) and after freezing the camera by combining hardware and software For more comparisons, check out this Google Photos album.
The purely software-based technique we introduced in Motion Stills uses the visual data from the video frames, detecting and tracking features over consecutive frames yielding motion vectors. It then classifies the motion vectors into foreground and background using motion models such as an affine transformation or a homography. However, this classification is not perfect and can be misled, e.g. by a complex scene or dominant foreground.
Feature classification into background (green) and foreground (orange) by using the motion metadata from the hardware sensors of the Pixel 2. Notice how the new approach not only labels the skateboarder accurately as foreground but also the half-pipe that is at roughly the same depth.
For motion photos on Pixel 2 we improved this classification by using the motion metadata derived from the gyroscope and the OIS. This accurately captures the camera motion with respect to the scene at infinity, which one can think of as the background in the distance. However, for pictures taken at closer range, parallax is introduced for scene elements at different depth layers, which is not accounted for by the gyroscope and OIS. Specifically, we mark motion vectors that deviate too much from the motion metadata as foreground. This results in a significantly more accurate classification of foreground and background, which also enables us to use a more complex motion model known as mixture homographies that can account for rolling shutter and undo the distortions it causes.
Background motion estimation in motion photos. By using the motion metadata from Gyro and OIS we are able to accurately classify features from the visual analysis into foreground and background.
Motion Photo Stabilization and Playback
Once we have accurately estimated the background motion for the video, we determine an optimally stable camera path to align the background using linear programming techniques outlined in our earlier posts. Further, we automatically trim the video to remove any accidental motion caused by putting the phone away. All of this processing happens on your phone and produces a small amount of metadata per frame that is used to render the stabilized video in real-time using a GPU shader when you tap the Motion button in Google Photos. In addition, we play the video starting at the exact timestamp as the HDR+ photo, producing a seamless transition from still image to video.
Motion photos stabilize even complex scenes with large foreground motions.
Motion Photo Sharing
Using Google Photos, you can share motion photos with your friends and as videos and GIFs, watch them on the web, or view them on any phone. This is another example of combining hardware, software and machine learning to create new features for Pixel 2.

Acknowledgements
Motion photos is a result of a collaboration across several Google Research teams, Google Pixel and Google Photos. We especially want to acknowledge the work of Karthik Raveendran, Suril Shah, Marius Renn, Alex Hong, Radford Juang, Fares Alhassen, Emily Chang, Isaac Reynolds, and Dave Loxton.

Source: Google AI Blog


Mobile Real-time Video Segmentation



Video segmentation is a widely used technique that enables movie directors and video content creators to separate the foreground of a scene from the background, and treat them as two different visual layers. By modifying or replacing the background, creators can convey a particular mood, transport themselves to a fun location or enhance the impact of the message. However, this operation has traditionally been performed as a time-consuming manual process (e.g. an artist rotoscoping every frame) or requires a studio environment with a green screen for real-time background removal (a technique referred to as chroma keying). In order to enable users to create this effect live in the viewfinder, we designed a new technique that is suitable for mobile phones.

Today, we are excited to bring precise, real-time, on-device mobile video segmentation to the YouTube app by integrating this technology into stories. Currently in limited beta, stories is YouTube’s new lightweight video format, designed specifically for YouTube creators. Our new segmentation technology allows creators to replace and modify the background, effortlessly increasing videos’ production value without specialized equipment.
Neural network video segmentation in YouTube stories.
To achieve this, we leverage machine learning to solve a semantic segmentation task using convolutional neural networks. In particular, we designed a network architecture and training procedure suitable for mobile phones focusing on the following requirements and constraints:
  • A mobile solution should be lightweight and run at least 10-30 times faster than existing state-of-the-art photo segmentation models. For real time inference, such a model needs to provide results at 30 frames per second.
  • A video model should leverage temporal redundancy (neighboring frames look similar) and exhibit temporal consistency (neighboring results should be similar)
  • High quality segmentation results require high quality annotations.
The Dataset
To provide high quality data for our machine learning pipeline, we annotated ten of thousands of images that captured a wide spectrum of foreground poses and background settings. Annotations consisted of pixel accurate locations of foreground elements such as hair, glass, neck, skin, lips, etc. and a general background label achieving a cross-validation result of 98% Intersection-Over-Union (IOU) of human annotator quality.
An example image from our dataset carefully annotated with nine labels - foreground elements are overlaid over the image.
Network Input
Our specific segmentation task is to compute a binary mask separating foreground from background for every input frame (three channels, RGB) of the video. Achieving temporal consistency of the computed masks across frames is key. Current methods that utilize LSTMs or GRUs to realize this are too computationally expensive for real-time applications on mobile phones. Instead we first pass the computed mask from the previous frame as a prior by concatenating it as a fourth channel to the current RGB input frame to achieve temporal consistency, as shown below:
The original frame (left) is separated in its three color channels and concatenated with the previous mask (middle). This is used as input to our neural network to predict the mask for the current frame (right).
Training Procedure
In video segmentation we need to achieve frame-to-frame temporal continuity, while also accounting for temporal discontinuities such as people suddenly appearing in the field of view of the camera. To train our model to robustly handle those use cases, we transform the annotated ground truth of each photo in several ways and use it as a previous frame mask:
  • Empty previous mask - Trains the network to work correctly for the first frame and new objects in scene. This emulates the case of someone appearing in the camera's frame.
  • Affine transformed ground truth mask - Minor transformations train the network to propagate and adjust to the previous frame mask. Major transformations train the network to understand inadequate masks and discard them.
  • Transformed image - We implement thin plate spline smoothing of the original image to emulate fast camera movements and rotations.
Our real-time video segmentation in action.
Network Architecture
With that modified input/output, we build on a standard hourglass segmentation network architecture by adding the following improvements:
  • We use big convolution kernels with large strides of four and above to detect object features on the high-resolution RGB input frame. Convolutions for layers with a small number of channels (as it is the case for the RGB input) are comparably cheap, so using big kernels here has almost no effect on the computational costs.
  • For speed gains, we aggressively downsample using large strides combined with skip connections like U-Net to restore low-level features during upsampling. For our segmentation model this technique results in a significant improvement of 5% IOU compared to using no skip connections.
    Hourglass segmentation network w/ skip connections.
  • For even further speed gains, we optimized default ResNet bottlenecks. In the literature authors tend to squeeze channels in the middle of the network by a factor of four (e.g. reducing 256 channels to 64 by using 64 different convolution kernels). However, we noticed that one can squeeze much more aggressively by a factor of 16 or 32 without significant quality degradation.
    ResNet bottleneck with large squeeze factor.
  • To refine and improve the accuracy of edges, we add several DenseNet layers on top of our network in full resolution similar to neural matting. This technique improves overall model quality by a slight 0.5% IOU, however perceptual quality of segmentation improves significantly.
The end result of these modifications is that our network runs remarkably fast on mobile devices, achieving 100+ FPS on iPhone 7 and 40+ FPS on Pixel 2 with high accuracy (realizing 94.8% IOU on our validation dataset), delivering a variety of smooth running and responsive effects in YouTube stories.
Our immediate goal is to use the limited rollout in YouTube stories to test our technology on this first set of effects. As we improve and expand our segmentation technology to more labels, we plan to integrate it into Google's broader Augmented Reality services.

Acknowledgements
A thank you to our team members who worked on the tech and this launch with us: Andrey Vakunov, Yury Kartynnik, Artsiom Ablavatski, Ivan Grishchenko, Matsvei Zhdanovich, Andrei Kulik, Camillo Lugaresi, John Kim, Ryan Bolyard, Wendy Huang, Michael Chang, Aaron La Lau, Willi Geiger, Tomer Margolin, John Nack and Matthias Grundmann.




Introducing Appsperiments: Exploring the Potentials of Mobile Photography



Each of the world's approximately two billion smartphone owners is carrying a camera capable of capturing photos and video of a tonal richness and quality unimaginable even five years ago. Until recently, those cameras behaved mostly as optical sensors, capturing light and operating on the resulting image's pixels. The next generation of cameras, however, will have the capability to blend hardware and computer vision algorithms that operate as well on an image's semantic content, enabling radically new creative mobile photo and video applications.

Today, we're launching the first installment of a series of photography appsperiments: usable and useful mobile photography experiences built on experimental technology. Our "appsperimental" approach was inspired in part by Motion Stills, an app developed by researchers at Google that converts short videos into cinemagraphs and time lapses using experimental stabilization and rendering technologies. Our appsperiments replicate this approach by building on other technologies in development at Google. They rely on object recognition, person segmentation, stylization algorithms, efficient image encoding and decoding technologies, and perhaps most importantly, fun!

Storyboard
Storyboard (Android) transforms your videos into single-page comic layouts, entirely on device. Simply shoot a video and load it in Storyboard. The app automatically selects interesting video frames, lays them out, and applies one of six visual styles. Save the comic or pull down to refresh and instantly produce a new one. There are approximately 1.6 trillion different possibilities!

Selfissimo!
Selfissimo! (iOS, Android) is an automated selfie photographer that snaps a stylish black and white photo each time you pose. Tap the screen to start a photoshoot. The app encourages you to pose and captures a photo whenever you stop moving. Tap the screen to end the session and review the resulting contact sheet, saving individual images or the entire shoot.

Scrubbies
Scrubbies (iOS) lets you easily manipulate the speed and direction of video playback to produce delightful video loops that highlight actions, capture funny faces, and replay moments. Shoot a video in the app and then remix it by scratching it like a DJ. Scrubbing with one finger plays the video. Scrubbing with two fingers captures the playback so you can save or share it.

Try them out and tell us what you think using the in-app feedback links. The feedback and ideas we get from the new and creative ways people use our appsperiments will help guide some of the technology we develop next.

Acknowledgements
These appsperiments represent a collaboration across many teams at Google. We would like to thank the core contributors Andy Dahley, Ashley Ma, Dexter Allen, Ignacio Garcia Dorado, Madison Le, Mark Bowers, Pascal Getreuer, Robin Debreuil, Suhong Jin, and William Lindmeier. We also wish to give special thanks to Buck Bourdon, Hossein Talebi, Kanstantsin Sokal, Karthik Raveendran, Matthias Grundmann, Peyman Milanfar, Suril Shah, Tomas Izo, Tyler Mullen, and Zheng Sun.

Fused Video Stabilization on the Pixel 2 and Pixel 2 XL



One of the most important aspects of current smartphones is easily capturing and sharing videos. With the Pixel 2 and Pixel 2 XL smartphones, the videos you capture are smoother and clearer than ever before, thanks to our Fused Video Stabilization technique based on both optical image stabilization (OIS) and electronic image stabilization (EIS). Fused Video Stabilization delivers highly stable footage with minimal artifacts, and the Pixel 2 is currently rated as the leader in DxO's video ranking (also earning the highest overall rating for a smartphone camera). But how does it work?

A key principle in videography is keeping the camera motion smooth and steady. A stable video is free of the distraction, so the viewer can focus on the subject of interest. But, videos taken with smartphones are subject to many conditions that make taking a high-quality video a significant challenge:

Camera Shake
Most people hold their mobile phones in their hands to record videos - you pull the phone from your pocket, record the video, and the video is ready to share right after recording. However, that means your videos shake as much as your hands do -- and they shake a lot! Moreover, if you are walking or running while recording, the camera motion can make videos almost unwatchable:
Motion Blur
If the camera or the subject moves during exposure, the resulting photo or video will appear blurry. Even if we stabilize the motion in between consecutive frames, the motion blur in each individual frame cannot be easily restored in practice, especially on a mobile device. One typical video artifact due to motion blur is sharpness inconsistency: the video may rapidly alternate between blurry and sharp, which is very distracting even after the video is stabilized:
Rolling Shutter
The CMOS image sensor collects one row of pixels, or “scanline”, at a time, and it takes tens of milliseconds to goes from the top scanline to the bottom. Therefore, anything moving during this period can appear distorted. This is called the rolling shutter distortion. Even if you have a steady hand, the rolling shutter distortion will appear when you move quickly:
A simulated rendering of a video with global (left) and rolling (right) shutter.
Focus Breathing
When there are objects of varying distance in a video, the angle of view can change significantly due to objects “jumping” in and out of the foreground. As result, everything shrinks or expands like the video below, which professionals call “breathing”:
A good stabilization system should address all of these issues: the video should look sharp, the motion should be smooth, and the rolling shutter and focus breathing should be corrected.

Many professionals mount the camera on a mechanical stabilizer to entirely isolate hand motion. These devices actively sense and compensate for the camera’s movement to remove all unwanted motions. However, they are usually expensive and cumbersome; you wouldn’t want to carry one every day. There are also handheld gimbal mounts available for mobile phones. However, they are usually larger than the phone itself, and you have to put the phone on it before start recording. You’d need to do it fast before the interesting moment vanishes.

Optical Image Stabilization (OIS) is the most well-known method for suppression of handshake artifacts. Typically, in mobile camera modules with OIS, the lens is suspended in the middle of the module by a number of springs and electromagnets are used to move the lens within its enclosure. The lens module actively senses and compensates for handshake motion at very high speeds. Because OIS responds to motion rapidly, it can greatly suppress the handshake blur. However, the range of correctable motion is fairly limited (usually around 1-2 degrees), which is not enough to correct the unwanted motions between consecutive video frames, or to correct excessive motion blur during walking. Moveover, OIS cannot correct some kinds of motions, such as in-plane rotation. Sometimes it can even introduce a “jello” artifact:
The video is taken by Pixel 2 with only OIS enabled. You can see the frame center is stabilized, but the boundaries have some jello-like artifacts.
Electronic Image Stabilization (EIS) analyzes the camera motion, filters out the unwanted parts, and synthesizes a new video by transforming each frame. The final stabilization quality depends on the algorithm design and implementation optimization of these stages. In general, software-based EIS is more flexible than OIS so it can correct larger and more kinds of motions. However, EIS has some common limitations. First, to prevent undefined regions in the synthesized frame, it needs to reduce the field of view or resolution. Second, compared to OIS or an external stabilizer, EIS requires more computation, which is a limited resource on mobile phones.

Making a Better Video: Fused Video Stabilization
With Fused Video Stabilization, both OIS and EIS are enabled simultaneously during video recording to address all the issues mentioned above. Our solution has three processing stages as shown in the system diagram below. The first processing stage, motion analysis, extracts the gyroscope signal, the OIS motion, and other properties to estimate the camera motion precisely. Then, the motion filtering stage combines machine learning and signal processing to predict a person’s intention in moving the camera. Finally, in the frame synthesis stage, we model and remove the rolling shutter and focus breathing distortion. With Fused Video Stabilization, the videos from Pixel 2 have less motion blur and look more natural. The solution is efficient enough to run in all video modes, such as 60fps or 4K recording.
Motion Analysis
In the motion analysis stage, we use the phone’s high-speed gyroscope to estimate the rotational component of the hand motion (roll, pitch, and yaw). By sensing the motion at 200 Hz, we have dense motion vectors for each scanline, enough to model the rolling shutter distortion. We also measure lens motions that are not sensed by the gyroscope, including both the focus adjustment (z) and the OIS movement (x and y) at high speed. Because we need high temporal precision to model the rolling shutter effect, we carefully optimize the system to ensure perfect timestamp alignment between the CMOS image sensor, the gyroscope, and the lens motion readouts. A misalignment of merely a few milliseconds can introduce noticeable jittering artifact:
Left: The stabilized video of a “running” motion with a 3ms timing error. Note the occasional jittering. Right: The stabilized video with correct timestamps. The bottom right corner shows the original shaky video.
Motion Filtering
The motion filtering stage takes the real camera motion from motion analysis and creates the stabilized virtual camera motion. Note that we push the incoming frames into a queue to defer the processing. This enables us to lookahead at future camera motions, using machine learning to accurately predict the user’s intention. Lookahead filtering is not feasible for OIS or any mechanical stabilizers, which can only react to previous or present motions. We will discuss more about this below.

Frame Synthesis
At the final stage, we derive how the frame is transformed based on the real and virtual camera motions. To handle the rolling shutter distortion, we use multiple transformations for each frame. We split the the input frame into a mesh and warp each part separately:
Left: The input video with mesh overlay. Right: The warped frame, and the red rectangle is the final stabilized output. Note how the non-rigid warping corrects the rolling shutter distortion.
Lookahead Motion Filtering
One key feature in the Fused Video Stabilization is our new lookahead filtering algorithm. It analyzes future motions to recognize the user-intended motion patterns, and creates a smooth virtual camera motion. The lookahead filtering has multiple stages to incrementally improve the virtual camera motion for each frame. In the first step, a Gaussian filtering is applied on the real camera motions of both past and future to obtain a smoothed camera motion:
Left: The input unstabilized video. Right: The smoothed result after Gaussian filtering.
You’ll notice that it’s still not very stable. To further improve the quality, we trained a model to extract intentional motions from the noisy real camera motions. We then apply additional filters given the predicted motion. For example, if we predict the camera is panning horizontally, we would reject more vertical motions. The result is shown below.
Left: The Gaussian filtered result. Right: Our lookahead result. We predict that the user is panning to the right, and suppress more vertical motions.
In practice, the process above does not guarantee there is no undefined “bad” regions, which can appear when the virtual camera is too stabilized and the warped frame falls outside the original field of view. We predict the likelihood of this issue in the next couple frames and adjust the virtual camera motion to get the final result.
Left: Our lookahead result. The undefined area at the bottom-left are shown in cyan. Right: The final result with the bad region removed.
As we mentioned earlier, even with OIS enabled, sometimes the motions are too large and cause motion blur in a single frame. When EIS is further applied to further smooth the camera motion, the motion blur leads to distracting sharpness variations:
Left: Pixel 2 with OIS only. Right: Pixel 2 with the basic Fused Video Stabilization. Note that sharpness variation around the “Exit” label.
This is a very common problem in EIS solutions. To address this issue, we exploit the “masking” property in the human visual system. Motion blur usually blurs the frame along a specific direction, and if the overall frame motion follows that direction, the human eye will not notice it. Instead, our brain treats the blur as a natural part of the motion, and masks it away from our perception.

With the high-frequency gyroscope and OIS signals, we can accurately estimate the motion blur for each frame. We compute where the camera pointed to at both the beginning and end of exposure, and the movement in-between is the motion blur. After that, we apply a machine learning algorithm (trained on a set of videos with and without motion blur) to map the motion blurs in past and future frames to the amount of real camera motion we want to keep, and blend the weighted real camera motion with the virtual one. As you can see below, with the motion blur masking, the distracting sharpness variation is greatly reduced and the camera motion is still stabilized.
Left: Pixel 2 with the basic Fused Video Stabilization. Right: The full Fused Video Stabilization solution with motion blur masking.
Results
We have seen many amazing videos from Pixel 2 with Fused Video Stabilization. Here are some for you to check out:
Videos taken by two Pixel 2 phones mounted on a single hand grip. Fused Video Stabilization is disabled in the left one.
Videos taken by two Pixel 2 phones mounting on a single hand grip. Fused Video Stabilization is disabled in the left one. Note that the videographer jumped together with the subject.
Fused Video Stabilization combines the best of OIS and EIS, shows great results in camera motion smoothing and motion blur reduction, and corrects both rolling shutter and focus breathing. With Fused Video Stabilization on the Pixel 2 and Pixel 2 XL, you no longer have to carefully place the phone before recording, hold it firmly over the entire recording session, or carry a gimbal mount everywhere. The recorded video will always be stable, sharp, and ready to share.

Acknowledgements
Fused Video Stabilization is a large-scale effort across multiple teams in Google, including the camera algorithm team, sensor algorithm team, camera hardware team, and sensor hardware team.

Portrait mode on the Pixel 2 and Pixel 2 XL smartphones



Portrait mode, a major feature of the new Pixel 2 and Pixel 2 XL smartphones, allows anyone to take professional-looking shallow depth-of-field images. This feature helped both devices earn DxO's highest mobile camera ranking, and works with both the rear-facing and front-facing cameras, even though neither is dual-camera (normally required to obtain this effect). Today we discuss the machine learning and computational photography techniques behind this feature.
HDR+ picture without (left) and with (right) portrait mode. Note how portrait mode’s synthetic shallow depth of field helps suppress the cluttered background and focus attention on the main subject. Click on these links in the caption to see full resolution versions. Photo by Matt Jones
What is a shallow depth-of-field image?
A single-lens reflex (SLR) camera with a big lens has a shallow depth of field, meaning that objects at one distance from the camera are sharp, while objects in front of or behind that "in-focus plane" are blurry. Shallow depth of field is a good way to draw the viewer's attention to a subject, or to suppress a cluttered background. Shallow depth of field is what gives portraits captured using SLRs their characteristic artistic look.

The amount of blur in a shallow depth-of-field image depends on depth; the farther objects are from the in-focus plane, the blurrier they appear. The amount of blur also depends on the size of the lens opening. A 50mm lens with an f/2.0 aperture has an opening 50mm/2 = 25mm in diameter. With such a lens, objects that are even a few inches away from the in-focus plane will appear soft.

One other parameter worth knowing about depth of field is the shape taken on by blurred points of light. This shape is called bokeh, and it depends on the physical structure of the lens's aperture. Is the bokeh circular? Or is it a hexagon, due to the six metal leaves that form the aperture inside some lenses? Photographers debate tirelessly about what constitutes good or bad bokeh.

Synthetic shallow depth of field images
Unlike SLR cameras, mobile phone cameras have a small, fixed-size aperture, which produces pictures with everything more or less in focus. But if we knew the distance from the camera to points in the scene, we could replace each pixel in the picture with a blur. This blur would be an average of the pixel's color with its neighbors, where the amount of blur depends on distance of that scene point from the in-focus plane. We could also control the shape of this blur, meaning the bokeh.

How can a cell phone estimate the distance to every point in the scene? The most common method is to place two cameras close to one another – so-called dual-camera phones. Then, for each patch in the left camera's image, we look for a matching patch in the right camera's image. The position in the two images where this match is found gives the depth of that scene feature through a process of triangulation. This search for matching features is called a stereo algorithm, and it works pretty much the same way our two eyes do.

A simpler version of this idea, used by some single-camera smartphone apps, involves separating the image into two layers – pixels that are part of the foreground (typically a person) and pixels that are part of the background. This separation, sometimes called semantic segmentation, lets you blur the background, but it has no notion of depth, so it can't tell you how much to blur it. Also, if there is an object in front of the person, i.e. very close to the camera, it won't be blurred out, even though a real camera would do this.

Whether done using stereo or segmentation, artificially blurring pixels that belong to the background is called synthetic shallow depth of field or synthetic background defocusing. Synthetic defocus is not the same as the optical blur you would get from an SLR, but it looks similar to most people.

How portrait mode works on the Pixel 2
The Google Pixel 2 offers portrait mode on both its rear-facing and front-facing cameras. For the front-facing (selfie) camera, it uses only segmentation. For the rear-facing camera it uses both stereo and segmentation. But wait, the Pixel 2 has only one rear facing camera; how can it see in stereo? Let's go through the process step by step.
Step 1: Generate an HDR+ image.
Portrait mode starts with a picture where everything is sharp. For this we use HDR+, Google's computational photography technique for improving the quality of captured
photographs, which runs on all recent Nexus/Pixel phones. It operates by capturing a burst of images that are underexposed to avoid blowing out highlights, aligning and averaging these frames to reduce noise in the shadows, and boosting these shadows in a way that preserves local contrast while judiciously reducing global contrast. The result is a picture with high dynamic range, low noise, and sharp details, even in dim lighting.
The idea of aligning and averaging frames to reduce noise has been known in astrophotography for decades. Google's implementation is a bit different, because we do it on bursts captured by a handheld camera, and we need to be careful not to produce ghosts (double images) if the photographer is not steady or if objects in the scene move. Below is an example of a scene with high dynamic range, captured using HDR+.
Photographs from the Pixel 2 without (left) and with (right) HDR+ enabled.
Notice how HDR+ avoids blowing out the sky and courtyard while retaining detail in the dark arcade ceiling.
Photo by Marc Levoy
Step 2:  Machine learning-based foreground-background segmentation.
Starting from an HDR+ picture, we next decide which pixels belong to the foreground (typically aperson) and which belong to the background. This is a tricky problem, because unlike chroma keying (a.k.a. green-screening) in the movie industry, we can't assume that the background is green (or blue, or any other color) Instead, we apply machine learning.  
In particular, we have trained a neural network, written in TensorFlow, that looks at the picture, and produces an estimate of which pixels are people and which aren't. The specific network we use is a convolutional neural network (CNN) with skip connections. "Convolutional" means that the learned components of the network are in the form of filters (a weighted sum of the neighbors around each pixel), so you can think of the network as just filtering the image, then filtering the filtered image, etc. The "skip connections" allow information to easily flow from the early stages in the network where it reasons about low-level features (color and edges) up to later stages of the network where it reasons about high-level features (faces and body parts). Combining stages like this is important when you need to not just determine if a photo has a person in it, but to identify exactly which pixels belong to that person. Our CNN was trained on almost a million pictures of people (and their hats, sunglasses, and ice cream cones). Inference to produce the mask runs on the phone using TensorFlow Mobile. Here’s an example:
At left is a picture produced by our HDR+ pipeline, and at right is the smoothed output of our neural network. White parts of this mask are thought by the network to be part of the foreground, and black parts are thought to be background.
Photo by Sam Kweskin
How good is this mask? Not too bad; our neural network recognizes the woman's hair and her teacup as being part of the foreground, so it can keep them sharp. If we blur the photograph based on this mask, we would produce this image:
Synthetic shallow depth-of-field image generated using a mask.
There are several things to notice about this result. First, the amount of blur is uniform, even though the background contains objects at varying depths. Second, an SLR would also blur out the pastry on her plate (and the plate itself), since it's close to the camera. Our neural network knows the pastry isn't part of her (note that it's black in the mask image), but being below her it’s not likely to be part of the background. We explicitly detect this situation and keep these pixels relatively sharp. Unfortunately, this solution isn’t always correct, and in this situation we should have blurred these pixels more.
Step 3. From dual pixels to a depth map
To improve on this result, it helps to know the depth at each point in the scene. To compute depth we can use a stereo algorithm. The Pixel 2 doesn't have dual cameras, but it does have a technology called Phase-Detect Auto-Focus (PDAF) pixels, sometimes called dual-pixel autofocus (DPAF). That's a mouthful, but the idea is pretty simple. If one imagines splitting the (tiny) lens of the phone's rear-facing camera into two halves, the view of the world as seen through the left side of the lens and the view through the right side are slightly different. These two viewpoints are less than 1mm apart (roughly the diameter of the lens), but they're different enough to compute stereo and produce a depth map. The way the optics of the camera works, this is equivalent to splitting every pixel on the image sensor chip into two smaller side-by-side pixels and reading them from the chip separately, as shown here:
On the rear-facing camera of the Pixel 2, the right side of every pixel looks at the world through the left side of the lens, and the left side of every pixel looks at the world through the right side of the lens.
Figure by Markus Kohlpaintner, reproduced with permission.
As the diagram shows, PDAF pixels give you views through the left and right sides of the lens in a single snapshot. Or, if you're holding your phone in portrait orientation, then it's the upper and lower halves of the lens. Here's what the upper image and lower image look like for our example scene (below). These images are monochrome because we only use the green pixels of our Bayer color filter sensor in our stereo algorithm, not the red or blue pixels. Having trouble telling the two images apart? Maybe the animated gif at right (below) will help. Look closely; the differences are very small indeed!
Views of our test scene through the upper half and lower half of the lens of a Pixel 2. In the animated gif at right,
notice that she holds nearly still, because the camera is focused on her, while the background moves up and down.
Objects in front of her, if we could see any, would move down when the background moves up (and vice versa).
PDAF technology can be found in many cameras, including SLRs to help them focus faster when recording video. In our application, this technology is being used instead to compute a depth map.  Specifically, we use our left-side and right-side images (or top and bottom) as input to a stereo algorithm similar to that used in Google's Jump system panorama stitcher (called the Jump Assembler). This algorithm first performs subpixel-accurate tile-based alignment to produce a low-resolution depth map, then interpolates it to high resolution using a bilateral solver. This is similar to the technology formerly used in Google's Lens Blur feature.    
One more detail: because the left-side and right-side views captured by the Pixel 2 camera are so close together, the depth information we get is inaccurate, especially in low light, due to the high noise in the images. To reduce this noise and improve depth accuracy we capture a burst of left-side and right-side images, then align and average them before applying our stereo algorithm. Of course we need to be careful during this step to avoid wrong matches, just as in HDR+, or we'll get ghosts in our depth map (but that's the subject of another blog post). On the left below is a depth map generated from the example shown above using our stereo algorithm.
Left: depth map computed using stereo from the foregoing upper-half-of-lens and lower-half-of-lens images. Lighter means closer to the camera. 
Right: visualization of how much blur we apply to each pixel in the original. Black means don't blur at all, red denotes scene features behind the in-focus plane (which is her face), the brighter the red the more we blur, and blue denotes features in front of the in-focus plane (the pastry).
Step 4. Putting it all together to render the final image
The last step is to combine the segmentation mask we computed in step 2 with the depth map we computed in step 3 to decide how much to blur each pixel in the HDR+ picture from step 1.  The way we combine the depth and mask is a bit of secret sauce, but the rough idea is that we want scene features we think belong to a person (white parts of the mask) to stay sharp, and features we think belong to the background (black parts of the mask) to be blurred in proportion to how far they are from the in-focus plane, where these distances are taken from the depth map. The red-colored image above is a visualization of how much to blur each pixel.
Actually applying the blur is conceptually the simplest part; each pixel is replaced with a translucent disk of the same color but varying size. If we composite all these disks in depth order, it's like the averaging we described earlier, and we get a nice approximation to real optical blur.  One of the benefits of defocusing synthetically is that because we're using software, we can get a perfect disk-shaped bokeh without lugging around several pounds of glass camera lenses.  Interestingly, in software there's no particular reason we need to stick to realism; we could make the bokeh shape anything we want! For our example scene, here is the final portrait mode output. If you compare this
result to the the rightmost result in step 2, you'll see that the pastry is now slightly blurred, much as you would expect from an SLR.
Final synthetic shallow depth-of-field image, generated by combining our HDR+
picture, segmentation mask, and depth map. Click for a full-resolution image.
Ways to use portrait mode
Portrait mode on the Pixel 2 runs in 4 seconds, is fully automatic (as opposed to Lens Blur mode on
previous devices, which required a special up-down motion of the phone), and is robust enough to
be used by non-experts. Here is an album of examples, including some hard cases, like people with frizzy hair, people holding flower bouquets, etc. Below is a list of a few ways you can use Portrait Mode on the new Pixel 2.
Taking macro shots
If you're in portrait mode and you point the camera at a small object instead of a person (like a flower or food), then our neural network can't find a face and won't produce a useful segmentation mask. In other words, step 2 of our pipeline doesn't apply. Fortunately, we still have a depth map from PDAF data (step 3), so we can compute a shallow depth-of-field image based on the depth map alone. Because the baseline between the left and right sides of the lens is so small, this works well only for objects that are roughly less than a meter away. But for such scenes it produces nice pictures. You can think of this as a synthetic macro mode. Below are example straight and portrait mode shots of a macro-sized object, and here's an album with more macro shots, including more hard cases, like a water fountain with a thin wire fence behind it. Just be careful not to get too close; the Pixel 2 can’t focus sharply on objects closer than about 10cm from the camera.
Macro picture without (left) and with (right) portrait mode. There’s no person here, so background pixels are identified solely using the depth map. Photo by Marc Levoy
The selfie camera
The Pixel 2 offers portrait mode on the front-facing (selfie) as well as rear-facing camera. This camera is 8Mpix instead of 12Mpix, and it doesn't have PDAF pixels, meaning that its pixels aren't split into left and right halves. In this case, step 3 of our pipeline doesn't apply, but if we can find a face, then we can still use our neural network (step 2) to produce a segmentation mask. This allows us to still generate a shallow depth-of-field image, but because we don't know how far away objects are, we can't vary the amount of blur with depth. Nevertheless, the effect looks pretty good, especially for selfies shot against a cluttered background, where blurring helps suppress the clutter. Here are example straight and portrait mode selfies taken with the Pixel 2's selfie camera:
Selfie without (left) and with (right) portrait mode. The front-facing camera lacks PDAF pixels,so background pixels are identified using only machine learning. Photo by Marc Levoy

How To Get the Most Out of Portrait Mode
The portraits produced by the Pixel 2 depend on the underlying HDR+ image, segmentation mask, and depth map; problems in these inputs can produce artifacts in the result. For example, if a feature is overexposed in the HDR+ image (blown out to white), then it's unlikely the left-half and right-half images will have useful information in them, leading to errors in the depth map. What can go wrong with segmentation? It's a neural network, which has been trained on nearly a million images, but we bet it has never seen a photograph of a person kissing a crocodile, so it will probably omit the crocodile from the mask, causing it to be blurred out. How about the depth map? Our stereo algorithm may fail on textureless features (like blank walls) because there are no features for the stereo algorithm to latch onto, or repeating textures (like plaid shirts) or horizontal or vertical lines, because the stereo algorithm might match the wrong part of the image, thus triangulating to produce the wrong depth.

While any complex technology includes tradeoffs, here are some tips for producing great portrait mode shots:
  • Stand close enough to your subjects that their head (or head and shoulders) fill the frame.
  • For a group shot where you want everyone sharp, place them at the same distance from the camera.
  • For a more pleasing blur, put some distance between your subjects and the background.
  • Remove dark sunglasses, floppy hats, giant scarves, and crocodiles.
  • For macro shots, tap to focus to ensure that the object you care about stays sharp.
By the way, you'll notice that in portrait mode the camera zooms a bit (1.5x for the rear-facing camera, and 1.2x for the selfie camera).  This is deliberate, because narrower fields of view encourage you to stand back further, which in turn reduces perspective distortion,
leading to better portraits.

Is it time to put aside your SLR (forever)?
When we started working at Google 5 years ago, the number of pixels in a cell phone picture hadn't
caught up to SLRs, but it was high enough for most people's needs. Even on a big home computer
screen, you couldn't see the individual pixels in pictures you took using your cell phone.
Nevertheless, mobile phone cameras weren't as powerful as SLRs, in four ways:
  1. Dynamic range in bright scenes (blown-out skies)
  2. Signal-to-noise ratio (SNR) in low light (noisy pictures, loss of detail)
  3. Zoom (for those wildlife shots)
  4. Shallow depth of field
Google's HDR+ and similar technologies by our competitors have made great strides on #1 and #2. In fact, in challenging lighting we'll often put away our SLRs, because we can get a better picture from a phone without painful bracketing and post-processing. For zoom, the modest telephoto lenses being added to some smartphones (typically 2x) help, but for that grizzly bear in the streambed there's no substitute for a 400mm lens (much safer too!). For shallow depth-of-field, synthetic defocusing is not the same as real optical defocusing, but the visual effect is similar enough to achieve the same goal, of directing your attention towards the main subject.

Will SLRs (or their mirrorless interchangeable lens (MIL) cousins) with big sensors and big lenses disappear? Doubtful, but they will occupy a smaller niche in the market. Both of us travel with a big camera and a Pixel 2. At the beginning of our trips we dutifully take out our SLRs, but by the end, it mostly stays in our luggage. Welcome to the new world of software-defined cameras and computational photography!
For more about portrait mode on the Pixel 2, check out this video by Nat & Friends.
Here is another album of pictures (portrait and not) and videos taken by the Pixel 2.

Making Visible Watermarks More Effective



Whether you are a photographer, a marketing manager, or a regular Internet user, chances are you have encountered visible watermarks many times. Visible watermarks are those logos and patterns that are often overlaid on digital images provided by stock photography websites, marking the image owners while allowing viewers to perceive the underlying content so that they could license the images that fit their needs. It is the most common mechanism for protecting the copyrights of hundreds of millions of photographs and stock images that are offered online daily.

It’s standard practice to use watermarks on the assumption that they prevent consumers from accessing the clean images, ensuring there will be no unauthorized or unlicensed use. However, in “On The Effectiveness Of Visible Watermarks” recently presented at the 2017 Computer Vision and Pattern Recognition Conference (CVPR 2017), we show that a computer algorithm can get past this protection and remove watermarks automatically, giving users unobstructed access to the clean images the watermarks are intended to protect.
Left: example watermarked images from popular stock photography websites. Right: watermark-free version of the images on the left, produced automatically by a computer algorithm. More results are available below and on our project page. Image sources: Adobe Stock, 123RF.
As often done with vulnerabilities discovered in operating systems, applications or protocols, we want to disclose this vulnerability and propose solutions in order to help the photography and stock image communities adapt and better protect its copyrighted content and creations. From our experiments much of the world’s stock imagery is currently susceptible to this circumvention. As such, in our paper we also propose ways to make visible watermarks more robust to such manipulations.
The Vulnerability of Visible Watermarks
Visible watermarks are often designed to contain complex structures such as thin lines and shadows in order to make them harder to remove. Indeed, given a single image, for a computer to detect automatically which visual structures belong to the watermark and which structures belong to the underlying image is extremely difficult. Manually, the task of removing a watermark from an image is tedious, and even with state-of-the-art editing tools it may take a Photoshop expert several minutes to remove a watermark from one image.

However, a fact that has been overlooked so far is that watermarks are typically added in a consistent manner to many images. We show that this consistency can be used to invert the watermarking process — that is, estimate the watermark image and its opacity, and recover the original, watermark-free image underneath. This can be all be done automatically, without any user intervention, and by only observing watermarked image collections publicly available online.
The consistency of a watermark over many images allows to automatically remove it in mass scale. Left: input collection marked by the same watermark, middle: computed watermark and its opacity, right: recovered, watermark-free images. Image sources: COCO dataset, Copyright logo.
The first step of this process is identifying which image structures are repeating in the collection. If a similar watermark is embedded in many images, the watermark becomes the signal in the collection and the images become the noise, and simple image operations can be used to pull out a rough estimation of the watermark pattern.
Watermark extraction with increasing number of images. Left: watermarked input images, Middle: median intensities over the input images (up to the input image shown), Right: the corresponding estimated (matted) watermark. All images licensed from 123RF.
This provides a rough (noisy) estimate of the matted watermark (the watermark image times its spatially varying opacity, i.e., alpha matte). To actually recover the image underneath the watermark, we need to know the watermark’s decomposition into its image and alpha matte components. For this, a multi-image optimization problem can be formed, which we call “multi-image matting” (an extension of the traditional, single image matting problem), where the watermark (“foreground”) is separated into its image and opacity components while reconstructing a subset of clean (“background”) images. This optimization is able to produce very accurate estimations of the watermark components already from hundreds of images, and can deal with most watermarks used in practice, including ones containing thin structures, shadows or color gradients (as long as the watermarks are semi-transparent). Once the watermark pattern is recovered, it can be efficiently removed from any image marked by it.

Here are some more results, showing the estimated watermarks and example watermark-free results generated for several popular stock image services. We show many more results in our supplementary material on the project page.
Left column: Watermark estimated automatically from watermarked images online (rendered on a gray background). Middle colum: Input watermarked image. Right column: Automatically removed watermark. Image sources: Adobe Stock, Can Stock Photo, 123RF,  Fotolia.
Making Watermarks More Effective
The vulnerability of current watermarking techniques lies in the consistency in watermarks across image collections. Therefore, to counter it, we need to introduce inconsistencies when embedding the watermark in each image. In our paper we looked at several types of inconsistencies and how they affect the techniques described above. We found for example that simply changing the watermark’s position randomly per image does not prevent removing the watermark, nor do small random changes in the watermark’s opacity. But we found that introducing random geometric perturbations to the watermark — warping it when embedding it in each image — improves its robustness. Interestingly, very subtle warping is already enough to generate watermarks that this technique cannot fully defeat.
Flipping between the original watermark and a slightly, randomly warped watermark that can improve its robustness
This warping produces a watermarked image that is very similar to the original (top right in the following figure), yet now if an attempt is made to remove it, it leaves very visible artifacts (bottom right):
In a nutshell, the reason this works is because that removing the randomly-warped watermark from any single image requires to additionally estimate the warp field that was applied to the watermark for that image — a task that is inherently more difficult. Therefore, even if the watermark pattern can be estimated in the presence of these random perturbations (which by itself is nontrivial), accurately removing it without any visible artifact is far more challenging.

Here are some more results on the images from above when using subtle, randomly warped versions of the watermarks. Notice again how visible artifacts remain when trying to remove the watermark in this case, compared to the accurate reconstructions that are achievable with current, consistent watermarks. More results and a detailed analysis can be found in our paper and project page.
Left column: Watermarked image, using subtle, random warping of the watermark. Right Column: Watermark removal result.
This subtle random warping is only one type of randomization that can introduced to make watermarks more effective. A nice feature of that solution is that it is simple to implement and already improves the robustness of the watermark to image-collection attacks while at the same time being mostly imperceptible. If more visible changes to the watermark across the images are acceptable — for example, introducing larger shifts in the watermark or incorporating other random elements in it — they may lead to an even better protection.

While we cannot guarantee that there will not be a way to break such randomized watermarking schemes in the future, we believe (and our experiments show) that randomization will make watermarked collection attacks fundamentally more difficult. We hope that these findings will be helpful for the photography and stock image communities.

Acknowledgements
The research described in this post was performed by Tali Dekel, Michael Rubinstein, Ce Liu and Bill Freeman. We thank Aaron Maschinot for narrating our video.

Motion Stills — Now on Android



Last year, we launched Motion Stills, an iOS app that stabilizes your Live Photos and lets you view and share them as looping GIFs and videos. Since then, Motion Stills has been well received, being listed as one of the top apps of 2016 by The Verge and Mashable. However, from its initial release, the community has been asking us to also make Motion Stills available for Android. We listened to your feedback and today, we're excited to announce that we’re bringing this technology, and more, to devices running Android 5.1 and later!
Motion Stills on Android: Instant stabilization on your device.
With Motion Stills on Android we built a new recording experience where everything you capture is instantly transformed into delightful short clips that are easy to watch and share. You can capture a short Motion Still with a single tap like a photo, or condense a longer recording into a new feature we call Fast Forward. In addition to stabilizing your recordings, Motion Stills on Android comes with an improved trimming algorithm that guards against pocket shots and accidental camera shakes. All of this is done during capture on your Android device, no internet connection required!

New streaming pipeline
For this release, we redesigned our existing iOS video processing pipeline to use a streaming approach that processes each frame of a video as it is being recorded. By computing intermediate motion metadata, we are able to immediately stabilize the recording while still performing loop optimization over the full sequence. All this leads to instant results after recording — no waiting required to share your new GIF.
Capture using our streaming pipeline gives you instant results.
In order to display your Motion Stills stream immediately, our algorithm computes and stores the necessary stabilizing transformation as a low resolution texture map. We leverage this texture to apply the stabilization transform using the GPU in real-time during playback, instead of writing a new, stabilized video that would tax your mobile hardware and battery.

Fast Forward
Fast Forward allows you to speed up and condense a longer recording into a short, easy to share clip. The same pipeline described above allows Fast Forward to process up to a full minute of video, right on your phone. You can even change the speed of playback (from 1x to 8x) after recording. To make this possible, we encode videos with a denser I-frame spacing to enable efficient seeking and playback. We also employ additional optimizations in the Fast Forward mode. For instance, we apply adaptive temporal downsampling in the linear solver and long-range stabilization for smooth results over the whole sequence.
Fast Forward condenses your recordings into easy to share clips.
Try out Motion Stills
Motion Stills is an app for us to experiment and iterate quickly with short-form video technology, gathering valuable feedback along the way. The tools our users find most fun and useful may be integrated later on into existing products like Google Photos. Download Motion Stills for Android from the Google Play store—available for mobile phones running Android 5.1 and later—and share your favorite clips on social media with hashtag #motionstills.

Acknowledgements
Motion Stills would not have been possible without the help of many Googlers. We want to especially acknowledge the work of Matthias Grundmann in advancing our stabilization technology, as well as our UX and interaction designers Jacob Zukerman, Ashley Ma and Mark Bowers.

Experimental Nighttime Photography with Nexus and Pixel



On a full moon night last year I carried a professional DSLR camera, a heavy lens and a tripod up to a hilltop in the Marin Headlands just north of San Francisco to take a picture of the Golden Gate Bridge and the lights of the city behind it.
A view of the Golden Gate Bridge from the Marin Headlands, taken with a DSLR camera (Canon 1DX, Zeiss Otus 28mm f/1.4 ZE). Click here for the full resolution image.
I thought the photo of the moonlit landscape came out well so I showed it to my (then) teammates in Gcam, a Google Research team that focuses on computational photography - developing algorithms that assist in taking pictures, usually with smartphones and similar small cameras. Seeing my nighttime photo, one of the Gcam team members challenged me to re-take it, but with a phone camera instead of a DSLR. Even though cameras on cellphones have come a long way, I wasn’t sure whether it would be possible to come close to the DSLR shot.

Probably the most successful Gcam project to date is the image processing pipeline that enables the HDR+ mode in the camera app on Nexus and Pixel phones. HDR+ allows you to take photos at low-light levels by rapidly shooting a burst of up to ten short exposures and averaging them them into a single image, reducing blur due to camera shake while collecting enough total light to yield surprisingly good pictures. Of course there are limits to what HDR+ can do. Once it gets dark enough the camera just cannot gather enough light and challenging shots like nighttime landscapes are still beyond reach.

The Challenges
To learn what was possible with a cellphone camera in extremely low-light conditions, I looked to the experimental SeeInTheDark app, written by Marc Levoy and presented at the ICCV 2015 Extreme Imaging Workshop, which can produce pictures with even less light than HDR+. It does this by accumulating more exposures, and merging them under the assumption that the scene is static and any differences between successive exposures must be due to camera motion or sensor noise. The app reduces noise further by dropping image resolution to about 1 MPixel. With SeeInTheDark it is just possible to take pictures, albeit fairly grainy ones, by the light of the full moon.

However, in order to keep motion blur due to camera shake and moving objects in the scene at acceptable levels, both HDR+ and SeeInTheDark must keep the exposure times for individual frames below roughly one tenth of a second. Since the user can’t hold the camera perfectly still for extended periods, it doesn’t make sense to attempt to merge a large number of frames into a single picture. Therefore, HDR+ merges at most ten frames, while SeeInTheDark progressively discounts older frames as new ones are captured. This limits how much light the camera can gather and thus affects the quality of the final pictures at very low light levels.

Of course, if we want to take high-quality pictures of low-light scenes (such as a landscape illuminated only by the moon), increasing the exposure time to more than one second and mounting the phone on a tripod or placing it on some other solid support makes the task a lot easier. Google’s Nexus 6P and Pixel phones support exposure times of 4 and 2 seconds respectively. As long as the scene is static, we should be able to record and merge dozens of frames to produce a single final image, even if shooting those frames takes several minutes.

Even with the use of a tripod, a sharp picture requires the camera’s lens to be focused on the subject, and this can be tricky in scenes with very low light levels. The two autofocus mechanisms employed by cellphone cameras — contrast detection and phase detection — fail when it’s dark enough that the camera's image sensor returns mostly noise. Fortunately, the interesting parts of outdoor scenes tend to be far enough away that simply setting the focus distance to infinity produces sharp images.

Experiments & Results
Taking all this into account, I wrote a simple Android camera app with manual control over exposure time, ISO and focus distance. When the shutter button is pressed the app waits a few seconds and then records up to 64 frames with the selected settings. The app saves the raw frames captured from the sensor as DNG files, which can later be downloaded onto a PC for processing.

To test my app, I visited the Point Reyes lighthouse on the California coast some thirty miles northwest of San Francisco on a full moon night. I pointed a Nexus 6P phone at the building and shot a burst of 32 four-second frames at ISO 1600. After covering the camera lens with opaque adhesive tape I shot an additional 32 black frames. Back at the office I loaded the raw files into Photoshop. The individual frames were very grainy, as one would expect given the tiny sensor in a cellphone camera, but computing the mean of all 32 frames cleaned up most of the grain, and subtracting the mean of the 32 black frames removed faint grid-like patterns caused by local variations in the sensor's black level. The resulting image, shown below, looks surprisingly good.
Point Reyes lighthouse at night, photographed with Google Nexus 6P (full resolution image here).
The lantern in the lighthouse is overexposed, but the rest of the scene is sharp, not too grainy, and has pleasing, natural looking colors. For comparison, a hand-held HDR+ shot of the same scene looks like this:
Point Reyes Lighthouse at night, hand-held HDR+ shot (full resolution image here). The inset rectangle has been brightened in Photoshop to roughly match the previous picture.
Satisfied with these results, I wanted to see if I could capture a nighttime landscape as well as the stars in the clear sky above it, all in one picture. When I took the photo of the lighthouse a thin layer of clouds conspired with the bright moonlight to make the stars nearly invisible, but on a clear night a two or four second exposure can easily capture the brighter stars. The stars are not stationary, though; they appear to rotate around the celestial poles, completing a full turn every 24 hours. The motion is slow enough to be invisible in exposures of only a few seconds, but over the minutes it takes to record a few dozen frames the stars move enough to turn into streaks when the frames are merged. Here is an example:
The North Star above Mount Burdell, single 2-second exposure. (full resolution image here).
Mean of 32 2-second exposures (full resolution image here).
Seeing streaks instead of pinpoint stars in the sky can be avoided by shifting and rotating the original frames such that the stars align. Merging the aligned frames produces an image with a clean-looking sky, and many faint stars that were hidden by noise in the individual frames become visible. Of course, the ground is now motion-blurred as if the camera had followed the rotation of the sky.
Mean of 32 2-second exposures, stars aligned (full resolution image here).
We now have two images; one where the ground is sharp, and one where the sky is sharp, and we can combine them into a single picture that is sharp everywhere. In Photoshop the easiest way to do that is with a hand-painted layer mask. After adjusting brightness and colors to taste, slight cropping, and removing an ugly "No Trespassing" sign we get a presentable picture:
The North Star above Mount Burdell, shot with Google Pixel, final image (full resolution image here).
Using Even Less Light
The pictures I've shown so far were shot on nights with a full moon, when it was bright enough that one could easily walk outside without a lantern or a flashlight. I wanted to find out if it was possible to take cellphone photos in even less light. Using a Pixel phone, I tried a scene illuminated by a three-quarter moon low in the sky, and another one with no moon at all. Anticipating more noise in the individual exposures, I shot 64-frame bursts. The processed final images still look fine:
Wrecked fishing boat in Inverness and the Big Dipper, 64 2-second exposures, shot with Google Pixel (full resolution image here).
Stars above Pierce Point Ranch, 64 2-second exposures, shot with Google Pixel (full resolution image here).
In the second image the distant lights of the cities around the San Francisco Bay caused the sky near the horizon to glow, but without moonlight the night was still dark enough to make the Milky Way visible. The picture looks noticeably grainier than my earlier moonlight shots, but it's not too bad.

Pushing the Limits
How far can we go? Can we take a cellphone photo with only starlight - no moon, no artificial light sources nearby, and no background glow from a distant city?

To test this I drove to a point on the California coast a little north of the mouth of the Russian River, where nights can get really dark, and pointed my Pixel phone at the summer sky above the ocean. Combining 64 two-second exposures taken at ISO 12800, and 64 corresponding black black frames did produce a recognizable image of the Milky Way. The constellations Scorpius and Sagittarius are clearly visible, and squinting hard enough one can just barely make out the horizon and one or two rocks in the ocean, but overall, this is not a picture you'd want to print out and frame. Still, this may be the lowest-light cellphone photo ever taken.
Only starlight, shot with Google Pixel (full resolution image here).
Here we are approaching the limits of what the Pixel camera can do. The camera cannot handle exposure times longer than two seconds. If this restriction was removed we could expose individual frames for eight to ten seconds, and the stars still would not show noticeable motion blur. With longer exposures we could lower the ISO setting, which would significantly reduce noise in the individual frames, and we would get a correspondingly cleaner and more detailed final picture.

Getting back to the original challenge - using a cellphone to reproduce a night-time DSLR shot of the Golden Gate - I did that. Here is what I got:
Golden Gate Bridge at night, shot with Google Nexus 6P (full resolution image here).
The Moon above San Francisco, shot with Google Nexus 6P (full resolution image here).
At 9 to 10 MPixels the resolution of these pictures is not as high as what a DSLR camera might produce, but otherwise image quality is surprisingly good: the photos are sharp all the way into the corners, there is not much visible noise, the captured dynamic range is sufficient to avoid saturating all but the brightest highlights, and the colors are pleasing.

Trying to find out if phone cameras might be suitable for outdoor nighttime photography was a fun experiment, and clearly the result is yes, they are. However, arriving at the final images required a lot of careful post-processing on a desktop computer, and the procedure is too cumbersome for all but the most dedicated cellphone photographers. However, with the right software a phone should be able to process the images internally, and if steps such as painting layer masks by hand can be eliminated, it might be possible to do point-and-shoot photography in very low light conditions. Almost - the cellphone would still have to rest on the ground or be mounted on a tripod.

Here’s a Google Photos album with more examples of photos that were created with the technique described above.

PhotoScan: Taking Glare-Free Pictures of Pictures



Yesterday, we released an update to PhotoScan, an app for iOS and Android that allows you to digitize photo prints with just a smartphone. One of the key features of PhotoScan is the ability to remove glare from prints, which are often glossy and reflective, as are the plastic album pages or glass-covered picture frames that host them. To create this feature, we developed a unique blend of computer vision and image processing techniques that can carefully align and combine several slightly different pictures of a print to separate the glare from the image underneath.
Left: A regular digital picture of a physical print. Right: Glare-free digital output from PhotoScan
When taking a single picture of a photo, determining which regions of the picture are the actual photo and which regions are glare is challenging to do automatically. Moreover, the glare may often saturate regions in the picture, rendering it impossible to see or recover the parts of the photo underneath it. But if we take several pictures of the photo while moving the camera, the position of the glare tends to change, covering different regions of the photo. In most cases we found that every pixel of the photo is likely not to be covered by glare in at least one of the pictures. While no single view may be glare-free, we can combine multiple pictures of the printed photo taken at different angles to remove the glare. The challenge is that the images need to be aligned very accurately in order to combine them properly, and this processing needs to run very quickly on the phone to provide a near instant experience.
Left: The captured, input images (5 in total). Right: If we stabilize the images on the photo, we can see just the glare moving, covering different parts of the photo. Notice no single image is glare-free.
Our technique is inspired by our earlier work published at SIGGRAPH 2015, which we dubbed “obstruction-free photography”. It uses similar principles to remove various types of obstructions from the field of view. However, the algorithm we originally proposed was based on a generative model where the motion and appearance of both the main scene and the obstruction layer are estimated. While that model is quite powerful and can remove a variety of obstructions, it is too computationally expensive to be run on smartphones. We therefore developed a simpler model that treats glare as an outlier, and only attempts to register the underlying, glare-free photo. While this model is simpler, the task is still quite challenging as the registration needs to be highly accurate and robust.

How it Works
We start from a series of pictures of the print taken by the user while moving the camera. The first picture - the “reference frame” - defines the desired output viewpoint. The user is then instructed to take four additional frames. In each additional frame, we detect sparse feature points (we compute ORB features on Harris corners) and use them to establish homographies mapping each frame to the reference frame.
Left: Detected feature matches between the reference frame and each other frame (left), and the warped frames according to the estimated homographies (right).
While the technique may sound straightforward, there is a catch - homographies are only able to align flat images. But printed photos are often not entirely flat (as is the case with the example shown above). Therefore, we use optical flow — a fundamental, computer vision representation for motion, which establishes pixel-wise mapping between two images — to correct the non-planarities. We start from the homography-aligned frames, and compute “flow fields” to warp the images and further refine the registration. In the example below, notice how the corners of the photo on the left slightly “move” after registering the frames using only homographies. The right hand side shows how the photo is better aligned after refining the registration using optical flow.
Comparison between the warped frames using homographies (left) and after the additional warp refinement using optical flow (right).
The difference in the registration is subtle, but has a big impact on the end result. Notice how small misalignments manifest themselves as duplicated image structures in the result, and how these artifacts are alleviated with the additional flow refinement.
Comparison between the glare removal result with (right) and without (left) optical flow refinement. In the result using homographies only (left), notice artifacts around the eye, nose and teeth of the person, and duplicated stems and flower petals on the fabric.
Here too, the challenge was to make optical flow, a naturally slow algorithm, work very quickly on the phone. Instead of computing optical flow at each pixel as done traditionally (the number of flow vectors computed is equal to the number of input pixels), we represent a flow field by a smaller number of control points, and express the motion at each pixel in the image as a function of the motion at the control points. Specifically, we divide each image into tiled, non-overlapping cells to form a coarse grid, and represent the flow of a pixel in a cell as the bilinear combination of the flow at the four corners of the cell that contains it.

The grid setup for grid optical flow. A point p is represented as the bilinear interpolation of the four corner points of the cell that encapsulates it.
Left: Illustration of the computed flow field on one of the frames. Right: The flow color coding: orientation and magnitude represented by hue and saturation, respectively.
This results in a much smaller problem to solve, since the number of flow vectors to compute now equals the number of grid points, which is typically much smaller than the number of pixels. This process is similar in nature to the spline-based image registration described in Szeliski and Coughlan (1997). With this algorithm, we were able to reduce the optical flow computation time by a factor of ~40 on a Pixel phone!
Flipping between the homography-registered frame and the flow-refined warped frame (using the above flow field), superimposed on the (clean) reference frame, shows how the computed flow field “snaps” image parts to their corresponding parts in the reference frame, improving the registration.
Finally, in order to compose the glare-free output, for any given location in the registered frames, we examine the pixel values, and use a soft minimum algorithm to obtain the darkest observed value. More specifically, we compute the expectation of the minimum brightness over the registered frames, assigning less weight to pixels close to the (warped) image boundaries. We use this method rather than computing the minimum directly across the frames due to the fact that corresponding pixels at each frame may have slightly different brightness. Therefore, per-pixel minimum can produce visible seams due to sudden intensity changes at boundaries between overlaid images.
Regular minimum (left) versus soft minimum (right) over the registered frames.
The algorithm can support a variety of scanning conditions — matte and gloss prints, photos inside or outside albums, magazine covers.

Input     Registered     Glare-free
To get the final result, the Photos team has developed a method that automatically detects and crops the photo area, and rectifies it to a frontal view. Because of perspective distortion, the scanned rectangular photo usually appears to be a quadrangle on the image. The method analyzes image signals, like color and edges, to figure out the exact boundary of the original photo on the scanned image, then applies a geometric transformation to rectify the quadrangle area back to its original rectangular shape yielding high-quality, glare-free digital version of the photo.
So overall, quite a lot going on under the hood, and all done almost instantaneously on your phone! To give PhotoScan a try, download the app on Android or iOS.