Tag Archives: Computational Photography

Learning to Predict Depth on the Pixel 3 Phones



Portrait Mode on the Pixel smartphones lets you take professional-looking images that draw attention to a subject by blurring the background behind it. Last year, we described, among other things, how we compute depth with a single camera using its Phase-Detection Autofocus (PDAF) pixels (also known as dual-pixel autofocus) using a traditional non-learned stereo algorithm. This year, on the Pixel 3, we turn to machine learning to improve depth estimation to produce even better Portrait Mode results.
Left: The original HDR+ image. Right: A comparison of Portrait Mode results using depth from traditional stereo and depth from machine learning. The learned depth result has fewer errors. Notably, in the traditional stereo result, many of the horizontal lines behind the man are incorrectly estimated to be at the same depth as the man and are kept sharp.
(Mike Milne)
A Short Recap
As described in last year’s blog post, Portrait Mode uses a neural network to determine what pixels correspond to people versus the background, and augments this two layer person segmentation mask with depth information derived from the PDAF pixels. This is meant to enable a depth-dependent blur, which is closer to what a professional camera does.

PDAF pixels work by capturing two slightly different views of a scene, shown below. Flipping between the two views, we see that the person is stationary, while the background moves horizontally, an effect referred to as parallax. Because parallax is a function of the point’s distance from the camera and the distance between the two viewpoints, we can estimate depth by matching each point in one view with its corresponding point in the other view.
The two PDAF images on the left and center look very similar, but in the crop on the right you can see the parallax between them. It is most noticeable on the circular structure in the middle of the crop.
However, finding these correspondences in PDAF images (a method called depth from stereo) is extremely challenging because scene points barely move between the views. Furthermore, all stereo techniques suffer from the aperture problem. That is, if you look at the scene through a small aperture, it is impossible to find correspondence for lines parallel to the stereo baseline, i.e., the line connecting the two cameras. In other words, when looking at the horizontal lines in the figure above (or vertical lines in portrait orientation shots), any proposed shift of these lines in one view with respect to the other view looks about the same. In last year’s Portrait Mode, all these factors could result in errors in depth estimation and cause unpleasant artifacts.

Improving Depth Estimation
With Portrait Mode on the Pixel 3, we fix these errors by utilizing the fact that the parallax used by depth from stereo algorithms is only one of many depth cues present in images. For example, points that are far away from the in-focus plane appear less sharp than ones that are closer, giving us a defocus depth cue. In addition, even when viewing an image on a flat screen, we can accurately tell how far things are because we know the rough size of everyday objects (e.g. one can use the number of pixels in a photograph of a person’s face to estimate how far away it is). This is called a semantic cue.

Designing a hand-crafted algorithm to combine these different cues is extremely difficult, but by using machine learning, we can do so while also better exploiting the PDAF parallax cue. Specifically, we train a convolutional neural network, written in TensorFlow, that takes as input the PDAF pixels and learns to predict depth. This new and improved ML-based method of depth estimation is what powers Portrait Mode on the Pixel 3.
Our convolutional neural network takes as input the PDAF images and outputs a depth map. The network uses an encoder-decoder style architecture with skip connections and residual blocks.
Training the Neural Network
In order to train the network, we need lots of PDAF images and corresponding high-quality depth maps. And since we want our predicted depth to be useful for Portrait Mode, we also need the training data to be similar to pictures that users take with their smartphones.

To accomplish this, we built our own custom “Frankenphone” rig that contains five Pixel 3 phones, along with a Wi-Fi-based solution that allowed us to simultaneously capture pictures from all of the phones (within a tolerance of ~2 milliseconds). With this rig, we computed high-quality depth from photos by using structure from motion and multi-view stereo.
Left: Custom rig used to collect training data. Middle: An example capture flipping between the five images. Synchronization between the cameras ensures that we can calculate depth for dynamic scenes, such as this one. Right: Ground truth depth. Low confidence points, i.e., points where stereo matches are not reliable due to weak texture, are colored in black and are not used during training. (Sam Ansari and Mike Milne)
The data captured by this rig is ideal for training a network for the following main reasons:
  • Five viewpoints ensure that there is parallax in multiple directions and hence no aperture problem.
  • The arrangement of the cameras ensures that a point in an image is usually visible in at least one other image resulting in fewer points with no correspondences.
  • The baseline, i.e., the distance between the cameras is much larger than our PDAF baseline resulting in more accurate depth estimation.
  • Synchronization between the cameras ensure that we can calculate depth for dynamic scenes like the one above.
  • Portability of the rig ensures that we can capture photos in the wild simulating the photos users take with their smartphones.
However, even though the data captured from this rig is ideal, it is still extremely challenging to predict the absolute depth of objects in a scene — a given PDAF pair can correspond to a range of different depth maps (depending on lens characteristics, focus distance, etc). To account for this, we instead predict the relative depths of objects in the scene, which is sufficient for producing pleasing Portrait Mode results.

Putting it All Together
This ML-based depth estimation needs to run fast on the Pixel 3, so that users don’t have to wait too long for their Portrait Mode shots. However, to get good depth estimates that makes use of subtle defocus and parallax cues, we have to feed full resolution, multi-megapixel PDAF images into the network. To ensure fast results, we use TensorFlow Lite, a cross-platform solution for running machine learning models on mobile and embedded devices and the Pixel 3’s powerful GPU to compute depth quickly despite our abnormally large inputs. We then combine the resulting depth estimates with masks from our person segmentation neural network to produce beautiful Portrait Mode results.

Try it Yourself
In Google Camera App version 6.1 and later, our depth maps are embedded in Portrait Mode images. This means you can use the Google Photos depth editor to change the amount of blur and the focus point after capture. You can also use third-party depth extractors to extract the depth map from a jpeg and take a look at it yourself. Also, here is an album showing the relative depth maps and the corresponding Portrait Mode images for traditional stereo and the learning-based approaches.

Acknowledgments
This work wouldn’t have been possible without Sam Ansari, Yael Pritch Knaan, David Jacobs, Jiawen Chen, Juhyun Lee and Andrei Kulik. Special thanks to Mike Milne and Andy Radin who captured data with the five-camera rig.

Source: Google AI Blog


Night Sight: Seeing in the Dark on Pixel Phones



Night Sight is a new feature of the Pixel Camera app that lets you take sharp, clean photographs in very low light, even in light so dim you can't see much with your own eyes. It works on the main and selfie cameras of all three generations of Pixel phones, and does not require a tripod or flash. In this article we'll talk about why taking pictures in low light is challenging, and we'll discuss the computational photography and machine learning techniques, much of it built on top of HDR+, that make Night Sight work.
Left: iPhone XS (full resolution image here). Right: Pixel 3 Night Sight (full resolution image here).
Why is Low-light Photography Hard?
Anybody who has photographed a dimly lit scene will be familiar with image noise, which looks like random variations in brightness from pixel to pixel. For smartphone cameras, which have small lenses and sensors, a major source of noise is the natural variation of the number of photons entering the lens, called shot noise. Every camera suffers from it, and it would be present even if the sensor electronics were perfect. However, they are not, so a second source of noise are random errors introduced when converting the electronic charge resulting from light hitting each pixel to a number, called read noise. These and other sources of randomness contribute to the overall signal-to-noise ratio (SNR), a measure of how much the image stands out from these variations in brightness. Fortunately, SNR rises with the square root of exposure time (or faster), so taking a longer exposure produces a cleaner picture. But it’s hard to hold still long enough to take a good picture in dim light, and whatever you're photographing probably won't hold still either.

In 2014 we introduced HDR+, a computational photography technology that improves this situation by capturing a burst of frames, aligning the frames in software, and merging them together. The main purpose of HDR+ is to improve dynamic range, meaning the ability to photograph scenes that exhibit a wide range of brightnesses (like sunsets or backlit portraits). All generations of Pixel phones use HDR+. As it turns out, merging multiple pictures also reduces the impact of shot noise and read noise, so it improves SNR in dim lighting. To keep these photographs sharp even if your hand shakes and the subject moves, we use short exposures. We also reject pieces of frames for which we can't find a good alignment. This allows HDR+ to produce sharp images even while collecting more light.

How Dark is Dark?
But if capturing and merging multiple frames produces cleaner pictures in low light, why not use HDR+ to merge dozens of frames so we can effectively see in the dark? Well, let's begin by defining what we mean by "dark". When photographers talk about the light level of a scene, they often measure it in lux. Technically, lux is the amount of light arriving at a surface per unit area, measured in lumens per meter squared. To give you a feeling for different lux levels, here's a handy table:
Smartphone cameras that take a single picture begin to struggle at 30 lux. Phones that capture and merge several pictures (as HDR+ does) can do well down to 3 lux, but in dimmer scenes don’t perform well (more on that below), relying on using their flash. With Night Sight, our goal was to improve picture-taking in the regime between 3 lux and 0.3 lux, using a smartphone, a single shutter press, and no LED flash. To make this feature work well includes several key elements, the most important of which is to capture more photons.

Capturing the Data
While lengthening the exposure time of each frame increases SNR and leads to cleaner pictures, it unfortunately introduces two problems. First, the default picture-taking mode on Pixel phones uses a zero-shutter-lag (ZSL) protocol, which intrinsically limits exposure time. As soon as you open the camera app, it begins capturing image frames and storing them in a circular buffer that constantly erases old frames to make room for new ones. When you press the shutter button, the camera sends the most recent 9 or 15 frames to our HDR+ or Super Res Zoom software. This means you capture exactly the moment you want — hence the name zero-shutter-lag. However, since we're displaying these same images on the screen to help you aim the camera, HDR+ limits exposures to at most 66ms no matter how dim the scene is, allowing our viewfinder to keep up a display rate of at least 15 frames per second. For dimmer scenes where longer exposures are necessary, Night Sight uses positive-shutter-lag (PSL), which waits until after you press the shutter button before it starts capturing images. Using PSL means you need to hold still for a short time after pressing the shutter, but it allows the use of longer exposures, thereby improving SNR at much lower brightness levels.

The second problem with increasing per-frame exposure time is motion blur, which might be due to handshake or to moving objects in the scene. Optical image stabilization (OIS), which is present on Pixel 2 and 3, reduces handshake for moderate exposure times (up to about 1/8 second), but doesn’t help with longer exposures or with moving objects. To combat motion blur that OIS can’t fix, the Pixel 3’s default picture-taking mode uses “motion metering”, which consists of using optical flow to measure recent scene motion and choosing an exposure time that minimizes this blur. Pixel 1 and 2 don’t use motion metering in their default mode, but all three phones use the technique in Night Sight mode, increasing per-frame exposure time up to 333ms if there isn't much motion. For Pixel 1, which has no OIS, we increase exposure time less (for the selfie cameras, which also don't have OIS, we increase it even less). If the camera is being stabilized (held against a wall, or using a tripod, for example), the exposure of each frame is increased to as much as one second. In addition to varying per-frame exposure, we also vary the number of frames we capture, 6 if the phone is on a tripod and up to 15 if it is handheld. These frame limits prevent user fatigue (and the need for a cancel button). Thus, depending on which Pixel phone you have, camera selection, handshake, scene motion and scene brightness, Night Sight captures 15 frames of 1/15 second (or less) each, or 6 frames of 1 second each, or anything in between.1

Here’s a concrete example of using shorter per-frame exposures when we detect motion:
Left: 15-frame burst captured by one of two side-by-side Pixel 3 phones. Center: Night Sight shot with motion metering disabled, causing this phone to use 73ms exposures. The dog’s head is motion blurred in this crop. Right: Night Sight shot with motion metering enabled, causing this phone to notice the motion and use shorter 48ms exposures. This shot has less motion blur. (Mike Milne)
And here’s an example of using longer exposure times when we detect that the phone is on a tripod:
Left: Crop from a handheld Night Sight shot of the sky (full resolution image here). There was slight handshake, so Night Sight chose 333ms x 15 frames = 5.0 seconds of capture. Right: Tripod shot (full resolution image here). No handshake was detected, so Night Sight used 1.0 second x 6 frames = 6.0 seconds. The sky is cleaner (less noise), and you can see more stars. (Florian Kainz)
Alignment and Merging
The idea of averaging frames to reduce imaging noise is as old as digital imaging. In astrophotography it's called exposure stacking. While the technique itself is straightforward, the hard part is getting the alignment right when the camera is handheld. Our efforts in this area began with an app from 2010 called Synthcam. This app captured pictures continuously, aligned and merged them in real time at low resolution, and displayed the merged result, which steadily became cleaner as you watched.

Night Sight uses a similar principle, although at full sensor resolution and not in real time. On Pixel 1 and 2 we use HDR+'s merging algorithm, modified and re-tuned to strengthen its ability to detect and reject misaligned pieces of frames, even in very noisy scenes. On Pixel 3 we use Super Res Zoom, similarly re-tuned, whether you zoom or not. While the latter was developed for super-resolution, it also works to reduce noise, since it averages multiple images together. Super Res Zoom produces better results for some nighttime scenes than HDR+, but it requires the faster processor of the Pixel 3.

By the way, all of this happens on the phone in a few seconds. If you're quick about tapping on the icon that brings you to the filmstrip (wait until the capture is complete!), you can watch your picture "develop" as HDR+ or Super Res Zoom completes its work.

Other Challenges
Although the basic ideas described above sound simple, there are some gotchas when there isn't much light that proved challenging when developing Night Sight:

1. Auto white balancing (AWB) fails in low light.

Humans are good at color constancy — perceiving the colors of things correctly even under colored illumination (or when wearing sunglasses). But that process breaks down when we take a photograph under one kind of lighting and view it under different lighting; the photograph will look tinted to us. To correct for this perceptual effect, cameras adjust the colors of images to partially or completely compensate for the dominant color of the illumination (sometimes called color temperature), effectively shifting the colors in the image to make it seem as if the scene was illuminated by neutral (white) light. This process is called auto white balancing (AWB).

The problem is that white balancing is what mathematicians call an ill-posed problem. Is that snow really blue, as the camera recorded it? Or is it white snow illuminated by a blue sky? Probably the latter. This ambiguity makes white balancing hard. The AWB algorithm used in non-Night Sight modes is good, but in very dim or strongly colored lighting (think sodium vapor lamps), it’s hard to decide what color the illumination is.

To solve these problems, we developed a learning-based AWB algorithm, trained to discriminate between a well-white-balanced image and a poorly balanced one. When a captured image is poorly balanced, the algorithm can suggest how to shift its colors to make the illumination appear more neutral. Training this algorithm required photographing a diversity of scenes using Pixel phones, then hand-correcting their white balance while looking at the photo on a color-calibrated monitor. You can see how this algorithm works by comparing the same low-light scene captured using two ways using a Pixel 3:
Left: The white balancer in the Pixel’s default camera mode doesn't know how yellow the illumination was on this shack on the Vancouver waterfront (full resolution image here). Right: Our learning-based AWB algorithm does a better job (full resolution image here). (Marc Levoy)
2. Tone mapping of scenes that are too dark to see.

The goal of Night Sight is to make photographs of scenes so dark that you can't see them clearly with your own eyes — almost like a super-power! A related problem is that in very dim lighting humans stop seeing in color, because the cone cells in our retinas stop functioning, leaving only the rod cells, which can't distinguish different wavelengths of light. Scenes are still colorful at night; we just can't see their colors. We want Night Sight pictures to be colorful - that's part of the super-power, but another potential conflict. Finally, our rod cells have low spatial acuity, which is why things seem indistinct at night. We want Night Sight pictures to be sharp, with more detail than you can really see at night.

For example, if you put a DSLR camera on a tripod and take a very long exposure — several minutes, or stack several shorter exposures together — you can make nighttime look like daytime. Shadows will have details, and the scene will be colorful and sharp. Look at the photograph below, which was captured with a DSLR; it must be night, because you can see the stars, but the grass is green, the sky is blue, and the moon casts shadows from the trees that look like shadows cast by the sun. This is a nice effect, but it's not always what you want, and if you share the photograph with a friend, they'll be confused about when you captured it.
Yosemite valley at nighttime, Canon DSLR, 28mm f/4 lens, 3-minute exposure, ISO 100. It's nighttime, since you can see stars, but it looks like daytime (full resolution image here). (Jesse Levinson)
Artists have known for centuries how to make a painting look like night; look at the example below.2
A Philosopher Lecturing on the Orrery, by Joseph Wright of Derby, 1766 (image source: Wikidata). The artist uses pigments from black to white, but the scene depicted is evidently dark. How does he accomplish this? He increases contrast, surrounds the scene with darkness, and drops shadows to black, because we cannot see detail there.
We employ some of the same tricks in Night Sight, partly by throwing an S-curve into our tone mapping. But it's tricky to strike an effective balance between giving you “magical super-powers” while still reminding you when the photo was captured. The photograph below is particularly successful at doing this.
Pixel 3, 6-second Night Sight shot, with tripod (full resolution image here). (Alex Savu)
How Dark can Night Sight Go?
Below 0.3 lux, autofocus begins to fail. If you can't find your keys on the floor, your smartphone can't focus either. To address this limitation we've added two manual focus buttons to Night Sight on Pixel 3 - the "Near" button focuses at about 4 feet, and the "Far" button focuses at about 12 feet. The latter is the hyperfocal distance of our lens, meaning that everything from half of that distance (6 feet) to infinity should be in focus. We’re also working to improve Night Sight’s ability to autofocus in low light. Below 0.3 lux you can still take amazing pictures with a smartphone, and even do astrophotography as this blog post demonstrates, but for that you'll need a tripod, manual focus, and a 3rd party or custom app written using Android's Camera2 API.

How far can we take this? Eventually one reaches a light level where read noise swamps the number of photons gathered by that pixel. There are other sources of noise, including dark current, which increases with exposure time and varies with temperature. To avoid this biologists know to cool their cameras well below zero (Fahrenheit) when imaging weakly fluorescent specimens — something we don’t recommend doing to your Pixel phone! Super-noisy images are also hard to align reliably. Even if you could solve all these problems, the wind blows, the trees sway, and the stars and clouds move. Ultra-long exposure photography is hard.

How to Get the Most out of Night Sight
Night Sight not only takes great pictures in low light; it's also fun to use, because it takes pictures where you can barely see anything. We pop up a “chip” on the screen when the scene is dark enough that you’ll get a better picture using Night Sight, but don't limit yourself to these cases. Just after sunset, or at concerts, or in the city, Night Sight takes clean (low-noise) shots, and makes them brighter than reality. This is a "look", which seems magical if done right. Here are some examples of Night Sight pictures, and some A/B comparisons, mostly taken by our coworkers. And here are some tips on using Night Sight:

- Night Sight can't operate in complete darkness, so pick a scene with some light falling on it.
- Soft, uniform lighting works better than harsh lighting, which creates dark shadows.
- To avoid lens flare artifacts, try to keep very bright light sources out of the field of view.
- To increase exposure, tap on various objects, then move the exposure slider. Tap again to disable.
- To decrease exposure, take the shot and darken later in Google’s Photos editor; it will be less noisy.
- If it’s so dark the camera can’t focus, tap on a high-contrast edge, or the edge of a light source.
- If this won’t work for your scene, use the Near (4 feet) or Far (12 feet) focus buttons (see below).
- To maximize image sharpness, brace your phone against a wall or tree, or prop it on a table or rock.
- Night Sight works for selfies too, as in the A/B album, with optional illumination from the screen itself.
Manual focus buttons (Pixel 3 only).
Night Sight works best on Pixel 3. We’ve also brought it to Pixel 2 and the original Pixel, although on the latter we use shorter exposures because it has no optical image stabilization (OIS). Also, our learning-based white balancer is trained for Pixel 3, so it will be less accurate on older phones. By the way, we brighten the viewfinder in Night Sight to help you frame shots in low light, but the viewfinder is based on 1/15 second exposures, so it will be noisy, and isn't a fair indication of the final photograph. So take a chance — frame a shot, and press the shutter. You'll often be surprised!

Acknowledgements
Night Sight was a collaboration of several teams at Google. Key contributors to the project include: from the Gcam team Charles He, Nikhil Karnad, Orly Liba, David Jacobs, Tim Brooks, Michael Milne, Andrew Radin, Navin Sarma, Jon Barron, Yun-Ta Tsai, Jiawen Chen, Kiran Murthy, Tianfan Xue, Dillon Sharlet, Ryan Geiss, Sam Hasinoff and Alex Schiffhauer; from the Super Res Zoom team Bart Wronski, Peyman Milanfar and Ignacio Garcia Dorado; from the Google camera app team Gabriel Nava, Sushil Nath, Tim Smith , Justin Harrison, Isaac Reynolds and Michelle Chen.



1 By the way, the exposure time shown in Google Photos (if you press "i") is per-frame, not total time, which depends on the number of frames captured. You can get some idea of the number of frames by watching the animation while the camera is collecting light. Each tick around the circle is one captured frame.

2 For a wonderful analysis of these techniques, look at von Helmholtz, "On the relation of optics to painting" (1876).

Source: Google AI Blog


See Better and Further with Super Res Zoom on the Pixel 3



Digital zoom using algorithms (rather than lenses) has long been the “ugly duckling” of mobile device cameras. As compared to the optical zoom capabilities of DSLR cameras, the quality of digitally zoomed images has not been competitive, and conventional wisdom is that the complex optics and mechanisms of larger cameras can't be replaced with much more compact mobile device cameras and clever algorithms.

With the new Super Res Zoom feature on the Pixel 3, we are challenging that notion.

The Super Res Zoom technology in Pixel 3 is different and better than any previous digital zoom technique based on upscaling a crop of a single image, because we merge many frames directly onto a higher resolution picture. This results in greatly improved detail that is roughly competitive with the 2x optical zoom lenses on many other smartphones. Super Res Zoom means that if you pinch-zoom before pressing the shutter, you’ll get a lot more details in your picture than if you crop afterwards.
Crops of 2x Zoom: Pixel 2, 2017 vs. Super Res Zoom on the Pixel 3, 2018.
The Challenges of Digital Zoom
Digital zoom is tough because a good algorithm is expected to start with a lower resolution image and "reconstruct" missing details reliably — with typical digital zoom a small crop of a single image is scaled up to produce a much larger image. Traditionally, this is done by linear interpolation methods, which attempt to recreate information that is not available in the original image, but introduce a blurry- or “plasticy” look that lacks texture and details. In contrast, most modern single-image upscalers use machine learning (including our own earlier work, RAISR). These magnify some specific image features such as straight edges and can even synthesize certain textures, but they cannot recover natural high-resolution details. While we still use RAISR to enhance the visual quality of images, most of the improved resolution provided by Super Res Zoom (at least for modest zoom factors like 2-3x) comes from our multi-frame approach.

Color Filter Arrays and Demosaicing
Reconstructing fine details is especially difficult because digital photographs are already incomplete — they’ve been reconstructed from partial color information through a process called demosaicing. In typical consumer cameras, the camera sensor elements are meant to measure only the intensity of the light, not directly its color. To capture real colors present in the scene, cameras use a color filter array placed in front of the sensor so that each pixel measures only a single color (red, green, or blue). These are arranged in a Bayer pattern as shown in the diagram below.
A Bayer mosaic color filter. Every 2x2 group of pixels captures light filtered by a specific color — two green pixels (because our eyes are more sensitive to green), one red, and one blue. This pattern is repeated across the whole image.
A camera processing pipeline then has to reconstruct the real colors and all the details at all pixels, given this partial information.* Demosaicing starts by making a best guess at the missing color information, typically by interpolating from the colors in nearby pixels, meaning that two-thirds of an RGB digital picture is actually a reconstruction!
Demosaicing reconstructs missing color information by using neighboring neighboring pixels.
In its simplest form, this could be achieved by averaging from neighboring values. Most real demosaicing algorithms are more complicated than this, but they still lead to imperfect results and artifacts - as we are limited to only partial information. While this situation exists even for large-format DSLR cameras, their bigger sensors and larger lenses allow for more detail to be captured than is typical in a mobile camera.

The situation gets worse if you pinch-zoom on a mobile device; then algorithms are forced to make up even more information, again by interpolation from the nearby pixels. However, not all is lost. This is where burst photography and the fusion of multiple images can be used to allow for super-resolution, even when limited by mobile device optics.

From Burst Photography to Multi-frame Super-resolution

While a single frame doesn't provide enough information to fill in the missing colors , we can get some of this missing information from multiple images taken successively. The process of capturing and combining multiple sequential photographs is known as burst photography. Google’s HDR+ algorithm, successfully used in Nexus and Pixel phones, already uses information from multiple frames to make photos from mobile phones reach the level of quality expected from a much larger sensor; could a similar approach be used to increase image resolution?

It has been known for more than a decade, including in astronomy where the basic concept is known as “drizzle”, that capturing and combining multiple images taken from slightly different positions can yield resolution equivalent to optical zoom, at least at low magnifications like 2x or 3x and in good lighting conditions. In this process, called muti-frame super-resolution, the general idea is to align and merge low-resolution bursts directly onto a grid of the desired (higher) resolution. Here's an example of how an idealized multi-frame super-resolution algorithm might work:
As compared to the standard demosaicing pipeline that needs to interpolate the missing colors (top), ideally, one could fill some holes from multiple images, each shifted by one pixel horizontally or vertically.
In the example above, we capture 4 frames, three of them shifted by exactly one pixel: in the horizontal, vertical, and both horizontal and vertical directions. All the holes would get filled, and there would be no need for any demosaicing at all! Indeed, some DSLR cameras support this operation, but only if the camera is on a tripod, and the sensor/optics are actively moved to different positions. This is sometimes called "microstepping".

Over the years, the practical usage of this “super-res” approach to higher resolution imaging remained confined largely to the laboratory, or otherwise controlled settings where the sensor and the subject were aligned and the movement between them was either deliberately controlled or tightly constrained. For instance, in astronomical imaging, a stationary telescope sees a predictably moving sky. But in widely used imaging devices like the modern-day smartphone, the practical usage of super-res for zoom in applications like mobile device cameras has remained mostly out of reach.

This is in part due to the fact that in order for this to work properly, certain conditions need to be satisfied. First, and most important, is that the lens needs to resolve detail better than the sensor used (in contrast, you can imagine a case where the lens is so poorly-designed that adding a better sensor provides no benefit). This property is often observed as an unwanted artifact of digital cameras called aliasing.

Image Aliasing
Aliasing occurs when a camera sensor is unable to faithfully represent all patterns and details present in a scene. A good example of aliasing are Moiré patterns, sometimes seen on TV as a result of an unfortunate choice of wardrobe. Furthermore, the aliasing effect on a physical feature (such as an edge of a table) changes when things move in a scene. You can observe this in the following burst sequence, where slight motions of the camera during the burst sequence create time-varying alias effects:
Left: High-resolution, single image of a table edge against a high frequency patterned background, Right: Different frames from a burst. Aliasing and Moiré effects are visible between different frames — pixels seem to jump around and produce different colored patterns.
However, this behavior is a blessing in disguise: if one analyzes the patterns produced, it gives us the variety of color and brightness values, as discussed in the previous section, to achieve super-resolution. That said, many challenges remain, as practical super-resolution needs to work with a handheld mobile phone and on any burst sequence.

Practical Super-resolution Using Hand Motion

As noted earlier, some DSLR cameras offer special tripod super-resolution modes that work in a way similar to what we described so far. These approaches rely on the physical movement of the sensors and optics inside the camera, but require a complete stabilization of the camera otherwise, which is impractical in mobile devices, since they are nearly always handheld. This would seem to create a catch-22 for super-resolution imaging on mobile platforms.

However, we turn this difficulty on its head, by using the hand-motion to our advantage. When we capture a burst of photos with a handheld camera or phone, there is always some movement present between the frames. Optical Image Stabilization (OIS) systems compensate for large camera motions - typically 5-20 pixels between successive frames spaced 1/30 second apart - but are unable to completely eliminate faster, lower magnitude, natural hand tremor, which occurs for everyone (even those with “steady hands”). When taking photos using mobile phones with a high resolution sensor, this hand tremor has a magnitude of just a few pixels.
Effect of hand tremor as seen in a cropped burst, after global alignment.
To take advantage of hand tremor, we first need to align the pictures in a burst together. We choose a single image in the burst as the “base” or reference frame, and align every other frame relative to it. After alignment, the images are combined together roughly as in the diagram shown earlier in this post. Of course, handshake is unlikely to move the image by exactly single pixels, so we need to interpolate between adjacent pixels in each newly captured frame before injecting the colors into the pixel grid of our base frame.

When hand motion is not present because the device is completely stabilized (e.g. placed on a tripod), we can still achieve our goal of simulating natural hand motion by intentionally “jiggling” the camera, by forcing the OIS module to move slightly between the shots. This movement is extremely small and chosen such that it doesn’t interfere with normal photos - but you can observe it yourself on Pixel 3 by holding the phone perfectly still, such as by pressing it against a window, and maximally pinch-zooming the viewfinder. Look for a tiny but continuous elliptical motion in distant objects, like that shown below.
Overcoming the Challenges of Super-resolution
The description of the ideal process we gave above sounds simple, but super-resolution is not that easy — there are many reasons why it hasn’t widely been used in consumer products like mobile phones, and requires the development of significant algorithmic innovations. Challenges can include:
  • A single image from a burst is noisy, even in good lighting. A practical super-resolution algorithm needs to be aware of this noise and work correctly despite it. We don’t want to get just a higher resolution noisy image - our goal is to both increase the resolution but also produce a much less noisy result.
    Left: Single frame frame from a burst taken in good light conditions can still contain a substantial amount of noise due to underexposure. Right: Result of merging multiple frames after burst processing.
  • Motion between images in a burst is not limited to just the movement of the camera. There can be complex motions in the scene such as wind-blown leaves, ripples moving across the surface of water, cars, people moving or changing their facial expressions, or the flicker of a flame — even some movements that cannot be assigned a single, unique motion estimate because they are transparent or multi-layered, such as smoke or glass. Completely reliable and localized alignment is generally not possible, and therefore a good super-resolution algorithm needs to work even if motion estimation is imperfect.
  • Because much of motion is random, even if there is good alignment, the data may be dense in some areas of the image and sparse in others. The crux of super-resolution is a complex interpolation problem, so the irregular spread of data makes it challenging to produce a higher-resolution image in all parts of the grid.
All the above challenges would seem to make real-world super-resolution either infeasible in practice, or at best limited to only static scenes and a camera placed on a tripod. With Super Res Zoom on Pixel 3, we’ve developed a stable and accurate burst resolution enhancement method that uses natural hand motion, and is robust and efficient enough to deploy on a mobile phone.

Here’s how we’ve addressed some of these challenges:
  • To effectively merge frames in a burst, and to produce a red, green, and blue value for every pixel without the need for demosaicing, we developed a method of integrating information across the frames that takes into account the edges of the image, and adapts accordingly. Specifically, we analyze the input frames and adjust how we combine them together, trading off increase in detail and resolution vs. noise suppression and smoothing. We accomplish this by merging pixels along the direction of apparent edges, rather than across them. The net effect is that our multi-frame method provides the best practical balance between noise reduction and enhancement of details.
    Left: Merged image with sub-optimal tradeoff of noise reduction and enhanced resolution. Right: The same merged image with a better tradeoff.
  • To make the algorithm handle scenes with complex local motion (people, cars, water or tree leaves moving) reliably, we developed a robustness model that detects and mitigates alignment errors. We select one frame as a “reference image”, and merge information from other frames into it only if we’re sure that we have found the correct corresponding feature. In this way, we can avoid artifacts like “ghosting” or motion blur, and wrongly merged parts of the image.
    A fast moving bus in a burst of images. Left: Merge without robustness model. Right: Merge with robustness model.
Pushing the State of the Art in Mobile Photography
The Portrait mode last year, and the HDR+ pipeline before it, showed how good mobile photography can be. This year, we set out to do the same for zoom. That’s another step in advancing the state of the art in computational photography, while shrinking the quality gap between mobile photography and DSLRs. Here is an album containing full FOV images, followed by Super Res Zoom images. Note that the Super Res Zoom images in this album are not cropped — they are captured directly on-device using pinch-zoom.
Left: Crop of 7x zoomed image on Pixel 2. Right: Same crop from Super Res Zoom on Pixel 3.
The idea of super-resolution predates the advent of smart-phones by at least a decade. For nearly as long, it has also lived in the public imagination through films and television. It’s been the subject of thousands of papers in academic journals and conferences. Now, it is real — in the palm of your hands, in Pixel 3.
An illustrative animation of Super Res Zoom. When the user takes a zoomed photo, the Pixel 3 takes advantage of the user’s natural hand motion and captures a burst of images at subtly different positions. These are then merged together to add detail to the final image.
Acknowledgements
Super Res Zoom is the result of a collaboration across several teams at Google. The project would not have been possible without the joint efforts of teams managed by Peyman Milanfar, Marc Levoy, and Bill Freeman. The authors would like to thank Marc Levoy and Isaac Reynolds in particular for their assistance in the writing of this blog.

The authors wish to especially acknowledge the following key contributors to the Super Res Zoom project: Ignacio Garcia-Dorado, Haomiao Jiang, Manfred Ernst, Michael Krainin, Daniel Vlasic, Jiawen Chen, Pascal Getreuer, and Chia-Kai Liang. The project also benefited greatly from contributions and feedback by Ce Liu, Damien Kelly, and Dillon Sharlet.



How to get the most out of Super Res Zoom?
Here are some tips on getting the best of Super Res Zoom on a Pixel 3 phone:
  • Pinch and zoom, or use the + button to increase zoom by discrete steps.
  • Double-tap the preview to quickly toggle between zoomed in and zoomed out.
  • Super Res works well at all zoom factors, though for performance reasons, it activates only above 1.2x. That’s about half way between no zoom and the first “click” in the zoom UI.
  • There are fundamental limits to the optical resolution of a wide-angle camera. So to get the most out of (any) zoom, keep the magnification factor modest.
  • Avoid fast moving objects. Super Res zoom will capture them correctly, but you will not likely get increased resolution.


* It’s worth noting that the situation is similar in some ways to how we see — in human (and other mammalian) eyes, different eye cone cells are sensitive to some specific colors, with the brain filling in the details to reconstruct the full image.

Source: Google AI Blog


Behind the Motion Photos Technology in Pixel 2


One of the most compelling things about smartphones today is the ability to capture a moment on the fly. With motion photos, a new camera feature available on the Pixel 2 and Pixel 2 XL phones, you no longer have to choose between a photo and a video so every photo you take captures more of the moment. When you take a photo with motion enabled, your phone also records and trims up to 3 seconds of video. Using advanced stabilization built upon technology we pioneered in Motion Stills for Android, these pictures come to life in Google Photos. Let’s take a look behind the technology that makes this possible!
Motion photos on the Pixel 2 in Google Photos. With the camera frozen in place the focus is put directly on the subjects. For more examples, check out this Google Photos album.
Camera Motion Estimation by Combining Hardware and Software
The image and video pair that is captured every time you hit the shutter button is a full resolution JPEG with an embedded 3 second video clip. On the Pixel 2, the video portion also contains motion metadata that is derived from the gyroscope and optical image stabilization (OIS) sensors to aid the trimming and stabilization of the motion photo. By combining software based visual tracking with the motion metadata from the hardware sensors, we built a new hybrid motion estimation for motion photos on the Pixel 2.

Our approach aligns the background more precisely than the technique used in Motion Stills or the purely hardware sensor based approach. Based on Fused Video Stabilization technology, it reduces the artifacts from the visual analysis due to a complex scene with many depth layers or when a foreground object occupies a large portion of the field of view. It also improves the hardware sensor based approach by refining the motion estimation to be more accurate, especially at close distances.
Motion photo as captured (left) and after freezing the camera by combining hardware and software For more comparisons, check out this Google Photos album.
The purely software-based technique we introduced in Motion Stills uses the visual data from the video frames, detecting and tracking features over consecutive frames yielding motion vectors. It then classifies the motion vectors into foreground and background using motion models such as an affine transformation or a homography. However, this classification is not perfect and can be misled, e.g. by a complex scene or dominant foreground.
Feature classification into background (green) and foreground (orange) by using the motion metadata from the hardware sensors of the Pixel 2. Notice how the new approach not only labels the skateboarder accurately as foreground but also the half-pipe that is at roughly the same depth.
For motion photos on Pixel 2 we improved this classification by using the motion metadata derived from the gyroscope and the OIS. This accurately captures the camera motion with respect to the scene at infinity, which one can think of as the background in the distance. However, for pictures taken at closer range, parallax is introduced for scene elements at different depth layers, which is not accounted for by the gyroscope and OIS. Specifically, we mark motion vectors that deviate too much from the motion metadata as foreground. This results in a significantly more accurate classification of foreground and background, which also enables us to use a more complex motion model known as mixture homographies that can account for rolling shutter and undo the distortions it causes.
Background motion estimation in motion photos. By using the motion metadata from Gyro and OIS we are able to accurately classify features from the visual analysis into foreground and background.
Motion Photo Stabilization and Playback
Once we have accurately estimated the background motion for the video, we determine an optimally stable camera path to align the background using linear programming techniques outlined in our earlier posts. Further, we automatically trim the video to remove any accidental motion caused by putting the phone away. All of this processing happens on your phone and produces a small amount of metadata per frame that is used to render the stabilized video in real-time using a GPU shader when you tap the Motion button in Google Photos. In addition, we play the video starting at the exact timestamp as the HDR+ photo, producing a seamless transition from still image to video.
Motion photos stabilize even complex scenes with large foreground motions.
Motion Photo Sharing
Using Google Photos, you can share motion photos with your friends and as videos and GIFs, watch them on the web, or view them on any phone. This is another example of combining hardware, software and Machine Learning to create new features for Pixel 2.

Acknowledgements
Motion photos is a result of a collaboration across several Google Research teams, Google Pixel and Google Photos. We especially want to acknowledge the work of Karthik Raveendran, Suril Shah, Marius Renn, Alex Hong, Radford Juang, Fares Alhassen, Emily Chang, Isaac Reynolds, and Dave Loxton.

Behind the Motion Photos Technology in Pixel 2


One of the most compelling things about smartphones today is the ability to capture a moment on the fly. With motion photos, a new camera feature available on the Pixel 2 and Pixel 2 XL phones, you no longer have to choose between a photo and a video so every photo you take captures more of the moment. When you take a photo with motion enabled, your phone also records and trims up to 3 seconds of video. Using advanced stabilization built upon technology we pioneered in Motion Stills for Android, these pictures come to life in Google Photos. Let’s take a look behind the technology that makes this possible!
Motion photos on the Pixel 2 in Google Photos. With the camera frozen in place the focus is put directly on the subjects. For more examples, check out this Google Photos album.
Camera Motion Estimation by Combining Hardware and Software
The image and video pair that is captured every time you hit the shutter button is a full resolution JPEG with an embedded 3 second video clip. On the Pixel 2, the video portion also contains motion metadata that is derived from the gyroscope and optical image stabilization (OIS) sensors to aid the trimming and stabilization of the motion photo. By combining software based visual tracking with the motion metadata from the hardware sensors, we built a new hybrid motion estimation for motion photos on the Pixel 2.

Our approach aligns the background more precisely than the technique used in Motion Stills or the purely hardware sensor based approach. Based on Fused Video Stabilization technology, it reduces the artifacts from the visual analysis due to a complex scene with many depth layers or when a foreground object occupies a large portion of the field of view. It also improves the hardware sensor based approach by refining the motion estimation to be more accurate, especially at close distances.
Motion photo as captured (left) and after freezing the camera by combining hardware and software For more comparisons, check out this Google Photos album.
The purely software-based technique we introduced in Motion Stills uses the visual data from the video frames, detecting and tracking features over consecutive frames yielding motion vectors. It then classifies the motion vectors into foreground and background using motion models such as an affine transformation or a homography. However, this classification is not perfect and can be misled, e.g. by a complex scene or dominant foreground.
Feature classification into background (green) and foreground (orange) by using the motion metadata from the hardware sensors of the Pixel 2. Notice how the new approach not only labels the skateboarder accurately as foreground but also the half-pipe that is at roughly the same depth.
For motion photos on Pixel 2 we improved this classification by using the motion metadata derived from the gyroscope and the OIS. This accurately captures the camera motion with respect to the scene at infinity, which one can think of as the background in the distance. However, for pictures taken at closer range, parallax is introduced for scene elements at different depth layers, which is not accounted for by the gyroscope and OIS. Specifically, we mark motion vectors that deviate too much from the motion metadata as foreground. This results in a significantly more accurate classification of foreground and background, which also enables us to use a more complex motion model known as mixture homographies that can account for rolling shutter and undo the distortions it causes.
Background motion estimation in motion photos. By using the motion metadata from Gyro and OIS we are able to accurately classify features from the visual analysis into foreground and background.
Motion Photo Stabilization and Playback
Once we have accurately estimated the background motion for the video, we determine an optimally stable camera path to align the background using linear programming techniques outlined in our earlier posts. Further, we automatically trim the video to remove any accidental motion caused by putting the phone away. All of this processing happens on your phone and produces a small amount of metadata per frame that is used to render the stabilized video in real-time using a GPU shader when you tap the Motion button in Google Photos. In addition, we play the video starting at the exact timestamp as the HDR+ photo, producing a seamless transition from still image to video.
Motion photos stabilize even complex scenes with large foreground motions.
Motion Photo Sharing
Using Google Photos, you can share motion photos with your friends and as videos and GIFs, watch them on the web, or view them on any phone. This is another example of combining hardware, software and machine learning to create new features for Pixel 2.

Acknowledgements
Motion photos is a result of a collaboration across several Google Research teams, Google Pixel and Google Photos. We especially want to acknowledge the work of Karthik Raveendran, Suril Shah, Marius Renn, Alex Hong, Radford Juang, Fares Alhassen, Emily Chang, Isaac Reynolds, and Dave Loxton.

Source: Google AI Blog


Mobile Real-time Video Segmentation



Video segmentation is a widely used technique that enables movie directors and video content creators to separate the foreground of a scene from the background, and treat them as two different visual layers. By modifying or replacing the background, creators can convey a particular mood, transport themselves to a fun location or enhance the impact of the message. However, this operation has traditionally been performed as a time-consuming manual process (e.g. an artist rotoscoping every frame) or requires a studio environment with a green screen for real-time background removal (a technique referred to as chroma keying). In order to enable users to create this effect live in the viewfinder, we designed a new technique that is suitable for mobile phones.

Today, we are excited to bring precise, real-time, on-device mobile video segmentation to the YouTube app by integrating this technology into stories. Currently in limited beta, stories is YouTube’s new lightweight video format, designed specifically for YouTube creators. Our new segmentation technology allows creators to replace and modify the background, effortlessly increasing videos’ production value without specialized equipment.
Neural network video segmentation in YouTube stories.
To achieve this, we leverage machine learning to solve a semantic segmentation task using convolutional neural networks. In particular, we designed a network architecture and training procedure suitable for mobile phones focusing on the following requirements and constraints:
  • A mobile solution should be lightweight and run at least 10-30 times faster than existing state-of-the-art photo segmentation models. For real time inference, such a model needs to provide results at 30 frames per second.
  • A video model should leverage temporal redundancy (neighboring frames look similar) and exhibit temporal consistency (neighboring results should be similar)
  • High quality segmentation results require high quality annotations.
The Dataset
To provide high quality data for our machine learning pipeline, we annotated ten of thousands of images that captured a wide spectrum of foreground poses and background settings. Annotations consisted of pixel accurate locations of foreground elements such as hair, glass, neck, skin, lips, etc. and a general background label achieving a cross-validation result of 98% Intersection-Over-Union (IOU) of human annotator quality.
An example image from our dataset carefully annotated with nine labels - foreground elements are overlaid over the image.
Network Input
Our specific segmentation task is to compute a binary mask separating foreground from background for every input frame (three channels, RGB) of the video. Achieving temporal consistency of the computed masks across frames is key. Current methods that utilize LSTMs or GRUs to realize this are too computationally expensive for real-time applications on mobile phones. Instead we first pass the computed mask from the previous frame as a prior by concatenating it as a fourth channel to the current RGB input frame to achieve temporal consistency, as shown below:
The original frame (left) is separated in its three color channels and concatenated with the previous mask (middle). This is used as input to our neural network to predict the mask for the current frame (right).
Training Procedure
In video segmentation we need to achieve frame-to-frame temporal continuity, while also accounting for temporal discontinuities such as people suddenly appearing in the field of view of the camera. To train our model to robustly handle those use cases, we transform the annotated ground truth of each photo in several ways and use it as a previous frame mask:
  • Empty previous mask - Trains the network to work correctly for the first frame and new objects in scene. This emulates the case of someone appearing in the camera's frame.
  • Affine transformed ground truth mask - Minor transformations train the network to propagate and adjust to the previous frame mask. Major transformations train the network to understand inadequate masks and discard them.
  • Transformed image - We implement thin plate spline smoothing of the original image to emulate fast camera movements and rotations.
Our real-time video segmentation in action.
Network Architecture
With that modified input/output, we build on a standard hourglass segmentation network architecture by adding the following improvements:
  • We use big convolution kernels with large strides of four and above to detect object features on the high-resolution RGB input frame. Convolutions for layers with a small number of channels (as it is the case for the RGB input) are comparably cheap, so using big kernels here has almost no effect on the computational costs.
  • For speed gains, we aggressively downsample using large strides combined with skip connections like U-Net to restore low-level features during upsampling. For our segmentation model this technique results in a significant improvement of 5% IOU compared to using no skip connections.
    Hourglass segmentation network w/ skip connections.
  • For even further speed gains, we optimized default ResNet bottlenecks. In the literature authors tend to squeeze channels in the middle of the network by a factor of four (e.g. reducing 256 channels to 64 by using 64 different convolution kernels). However, we noticed that one can squeeze much more aggressively by a factor of 16 or 32 without significant quality degradation.
    ResNet bottleneck with large squeeze factor.
  • To refine and improve the accuracy of edges, we add several DenseNet layers on top of our network in full resolution similar to neural matting. This technique improves overall model quality by a slight 0.5% IOU, however perceptual quality of segmentation improves significantly.
The end result of these modifications is that our network runs remarkably fast on mobile devices, achieving 100+ FPS on iPhone 7 and 40+ FPS on Pixel 2 with high accuracy (realizing 94.8% IOU on our validation dataset), delivering a variety of smooth running and responsive effects in YouTube stories.
Our immediate goal is to use the limited rollout in YouTube stories to test our technology on this first set of effects. As we improve and expand our segmentation technology to more labels, we plan to integrate it into Google's broader Augmented Reality services.

Acknowledgements
A thank you to our team members who worked on the tech and this launch with us: Andrey Vakunov, Yury Kartynnik, Artsiom Ablavatski, Ivan Grishchenko, Matsvei Zhdanovich, Andrei Kulik, Camillo Lugaresi, John Kim, Ryan Bolyard, Wendy Huang, Michael Chang, Aaron La Lau, Willi Geiger, Tomer Margolin, John Nack and Matthias Grundmann.




Introducing Appsperiments: Exploring the Potentials of Mobile Photography



Each of the world's approximately two billion smartphone owners is carrying a camera capable of capturing photos and video of a tonal richness and quality unimaginable even five years ago. Until recently, those cameras behaved mostly as optical sensors, capturing light and operating on the resulting image's pixels. The next generation of cameras, however, will have the capability to blend hardware and computer vision algorithms that operate as well on an image's semantic content, enabling radically new creative mobile photo and video applications.

Today, we're launching the first installment of a series of photography appsperiments: usable and useful mobile photography experiences built on experimental technology. Our "appsperimental" approach was inspired in part by Motion Stills, an app developed by researchers at Google that converts short videos into cinemagraphs and time lapses using experimental stabilization and rendering technologies. Our appsperiments replicate this approach by building on other technologies in development at Google. They rely on object recognition, person segmentation, stylization algorithms, efficient image encoding and decoding technologies, and perhaps most importantly, fun!

Storyboard
Storyboard (Android) transforms your videos into single-page comic layouts, entirely on device. Simply shoot a video and load it in Storyboard. The app automatically selects interesting video frames, lays them out, and applies one of six visual styles. Save the comic or pull down to refresh and instantly produce a new one. There are approximately 1.6 trillion different possibilities!

Selfissimo!
Selfissimo! (iOS, Android) is an automated selfie photographer that snaps a stylish black and white photo each time you pose. Tap the screen to start a photoshoot. The app encourages you to pose and captures a photo whenever you stop moving. Tap the screen to end the session and review the resulting contact sheet, saving individual images or the entire shoot.

Scrubbies
Scrubbies (iOS) lets you easily manipulate the speed and direction of video playback to produce delightful video loops that highlight actions, capture funny faces, and replay moments. Shoot a video in the app and then remix it by scratching it like a DJ. Scrubbing with one finger plays the video. Scrubbing with two fingers captures the playback so you can save or share it.

Try them out and tell us what you think using the in-app feedback links. The feedback and ideas we get from the new and creative ways people use our appsperiments will help guide some of the technology we develop next.

Acknowledgements
These appsperiments represent a collaboration across many teams at Google. We would like to thank the core contributors Andy Dahley, Ashley Ma, Dexter Allen, Ignacio Garcia Dorado, Madison Le, Mark Bowers, Pascal Getreuer, Robin Debreuil, Suhong Jin, and William Lindmeier. We also wish to give special thanks to Buck Bourdon, Hossein Talebi, Kanstantsin Sokal, Karthik Raveendran, Matthias Grundmann, Peyman Milanfar, Suril Shah, Tomas Izo, Tyler Mullen, and Zheng Sun.

Fused Video Stabilization on the Pixel 2 and Pixel 2 XL



One of the most important aspects of current smartphones is easily capturing and sharing videos. With the Pixel 2 and Pixel 2 XL smartphones, the videos you capture are smoother and clearer than ever before, thanks to our Fused Video Stabilization technique based on both optical image stabilization (OIS) and electronic image stabilization (EIS). Fused Video Stabilization delivers highly stable footage with minimal artifacts, and the Pixel 2 is currently rated as the leader in DxO's video ranking (also earning the highest overall rating for a smartphone camera). But how does it work?

A key principle in videography is keeping the camera motion smooth and steady. A stable video is free of the distraction, so the viewer can focus on the subject of interest. But, videos taken with smartphones are subject to many conditions that make taking a high-quality video a significant challenge:

Camera Shake
Most people hold their mobile phones in their hands to record videos - you pull the phone from your pocket, record the video, and the video is ready to share right after recording. However, that means your videos shake as much as your hands do -- and they shake a lot! Moreover, if you are walking or running while recording, the camera motion can make videos almost unwatchable:
Motion Blur
If the camera or the subject moves during exposure, the resulting photo or video will appear blurry. Even if we stabilize the motion in between consecutive frames, the motion blur in each individual frame cannot be easily restored in practice, especially on a mobile device. One typical video artifact due to motion blur is sharpness inconsistency: the video may rapidly alternate between blurry and sharp, which is very distracting even after the video is stabilized:
Rolling Shutter
The CMOS image sensor collects one row of pixels, or “scanline”, at a time, and it takes tens of milliseconds to goes from the top scanline to the bottom. Therefore, anything moving during this period can appear distorted. This is called the rolling shutter distortion. Even if you have a steady hand, the rolling shutter distortion will appear when you move quickly:
A simulated rendering of a video with global (left) and rolling (right) shutter.
Focus Breathing
When there are objects of varying distance in a video, the angle of view can change significantly due to objects “jumping” in and out of the foreground. As result, everything shrinks or expands like the video below, which professionals call “breathing”:
A good stabilization system should address all of these issues: the video should look sharp, the motion should be smooth, and the rolling shutter and focus breathing should be corrected.

Many professionals mount the camera on a mechanical stabilizer to entirely isolate hand motion. These devices actively sense and compensate for the camera’s movement to remove all unwanted motions. However, they are usually expensive and cumbersome; you wouldn’t want to carry one every day. There are also handheld gimbal mounts available for mobile phones. However, they are usually larger than the phone itself, and you have to put the phone on it before start recording. You’d need to do it fast before the interesting moment vanishes.

Optical Image Stabilization (OIS) is the most well-known method for suppression of handshake artifacts. Typically, in mobile camera modules with OIS, the lens is suspended in the middle of the module by a number of springs and electromagnets are used to move the lens within its enclosure. The lens module actively senses and compensates for handshake motion at very high speeds. Because OIS responds to motion rapidly, it can greatly suppress the handshake blur. However, the range of correctable motion is fairly limited (usually around 1-2 degrees), which is not enough to correct the unwanted motions between consecutive video frames, or to correct excessive motion blur during walking. Moveover, OIS cannot correct some kinds of motions, such as in-plane rotation. Sometimes it can even introduce a “jello” artifact:
The video is taken by Pixel 2 with only OIS enabled. You can see the frame center is stabilized, but the boundaries have some jello-like artifacts.
Electronic Image Stabilization (EIS) analyzes the camera motion, filters out the unwanted parts, and synthesizes a new video by transforming each frame. The final stabilization quality depends on the algorithm design and implementation optimization of these stages. In general, software-based EIS is more flexible than OIS so it can correct larger and more kinds of motions. However, EIS has some common limitations. First, to prevent undefined regions in the synthesized frame, it needs to reduce the field of view or resolution. Second, compared to OIS or an external stabilizer, EIS requires more computation, which is a limited resource on mobile phones.

Making a Better Video: Fused Video Stabilization
With Fused Video Stabilization, both OIS and EIS are enabled simultaneously during video recording to address all the issues mentioned above. Our solution has three processing stages as shown in the system diagram below. The first processing stage, motion analysis, extracts the gyroscope signal, the OIS motion, and other properties to estimate the camera motion precisely. Then, the motion filtering stage combines machine learning and signal processing to predict a person’s intention in moving the camera. Finally, in the frame synthesis stage, we model and remove the rolling shutter and focus breathing distortion. With Fused Video Stabilization, the videos from Pixel 2 have less motion blur and look more natural. The solution is efficient enough to run in all video modes, such as 60fps or 4K recording.
Motion Analysis
In the motion analysis stage, we use the phone’s high-speed gyroscope to estimate the rotational component of the hand motion (roll, pitch, and yaw). By sensing the motion at 200 Hz, we have dense motion vectors for each scanline, enough to model the rolling shutter distortion. We also measure lens motions that are not sensed by the gyroscope, including both the focus adjustment (z) and the OIS movement (x and y) at high speed. Because we need high temporal precision to model the rolling shutter effect, we carefully optimize the system to ensure perfect timestamp alignment between the CMOS image sensor, the gyroscope, and the lens motion readouts. A misalignment of merely a few milliseconds can introduce noticeable jittering artifact:
Left: The stabilized video of a “running” motion with a 3ms timing error. Note the occasional jittering. Right: The stabilized video with correct timestamps. The bottom right corner shows the original shaky video.
Motion Filtering
The motion filtering stage takes the real camera motion from motion analysis and creates the stabilized virtual camera motion. Note that we push the incoming frames into a queue to defer the processing. This enables us to lookahead at future camera motions, using machine learning to accurately predict the user’s intention. Lookahead filtering is not feasible for OIS or any mechanical stabilizers, which can only react to previous or present motions. We will discuss more about this below.

Frame Synthesis
At the final stage, we derive how the frame is transformed based on the real and virtual camera motions. To handle the rolling shutter distortion, we use multiple transformations for each frame. We split the the input frame into a mesh and warp each part separately:
Left: The input video with mesh overlay. Right: The warped frame, and the red rectangle is the final stabilized output. Note how the non-rigid warping corrects the rolling shutter distortion.
Lookahead Motion Filtering
One key feature in the Fused Video Stabilization is our new lookahead filtering algorithm. It analyzes future motions to recognize the user-intended motion patterns, and creates a smooth virtual camera motion. The lookahead filtering has multiple stages to incrementally improve the virtual camera motion for each frame. In the first step, a Gaussian filtering is applied on the real camera motions of both past and future to obtain a smoothed camera motion:
Left: The input unstabilized video. Right: The smoothed result after Gaussian filtering.
You’ll notice that it’s still not very stable. To further improve the quality, we trained a model to extract intentional motions from the noisy real camera motions. We then apply additional filters given the predicted motion. For example, if we predict the camera is panning horizontally, we would reject more vertical motions. The result is shown below.
Left: The Gaussian filtered result. Right: Our lookahead result. We predict that the user is panning to the right, and suppress more vertical motions.
In practice, the process above does not guarantee there is no undefined “bad” regions, which can appear when the virtual camera is too stabilized and the warped frame falls outside the original field of view. We predict the likelihood of this issue in the next couple frames and adjust the virtual camera motion to get the final result.
Left: Our lookahead result. The undefined area at the bottom-left are shown in cyan. Right: The final result with the bad region removed.
As we mentioned earlier, even with OIS enabled, sometimes the motions are too large and cause motion blur in a single frame. When EIS is further applied to further smooth the camera motion, the motion blur leads to distracting sharpness variations:
Left: Pixel 2 with OIS only. Right: Pixel 2 with the basic Fused Video Stabilization. Note that sharpness variation around the “Exit” label.
This is a very common problem in EIS solutions. To address this issue, we exploit the “masking” property in the human visual system. Motion blur usually blurs the frame along a specific direction, and if the overall frame motion follows that direction, the human eye will not notice it. Instead, our brain treats the blur as a natural part of the motion, and masks it away from our perception.

With the high-frequency gyroscope and OIS signals, we can accurately estimate the motion blur for each frame. We compute where the camera pointed to at both the beginning and end of exposure, and the movement in-between is the motion blur. After that, we apply a machine learning algorithm (trained on a set of videos with and without motion blur) to map the motion blurs in past and future frames to the amount of real camera motion we want to keep, and blend the weighted real camera motion with the virtual one. As you can see below, with the motion blur masking, the distracting sharpness variation is greatly reduced and the camera motion is still stabilized.
Left: Pixel 2 with the basic Fused Video Stabilization. Right: The full Fused Video Stabilization solution with motion blur masking.
Results
We have seen many amazing videos from Pixel 2 with Fused Video Stabilization. Here are some for you to check out:
Videos taken by two Pixel 2 phones mounted on a single hand grip. Fused Video Stabilization is disabled in the left one.
Videos taken by two Pixel 2 phones mounting on a single hand grip. Fused Video Stabilization is disabled in the left one. Note that the videographer jumped together with the subject.
Fused Video Stabilization combines the best of OIS and EIS, shows great results in camera motion smoothing and motion blur reduction, and corrects both rolling shutter and focus breathing. With Fused Video Stabilization on the Pixel 2 and Pixel 2 XL, you no longer have to carefully place the phone before recording, hold it firmly over the entire recording session, or carry a gimbal mount everywhere. The recorded video will always be stable, sharp, and ready to share.

Acknowledgements
Fused Video Stabilization is a large-scale effort across multiple teams in Google, including the camera algorithm team, sensor algorithm team, camera hardware team, and sensor hardware team.

Portrait mode on the Pixel 2 and Pixel 2 XL smartphones



Portrait mode, a major feature of the new Pixel 2 and Pixel 2 XL smartphones, allows anyone to take professional-looking shallow depth-of-field images. This feature helped both devices earn DxO's highest mobile camera ranking, and works with both the rear-facing and front-facing cameras, even though neither is dual-camera (normally required to obtain this effect). Today we discuss the machine learning and computational photography techniques behind this feature.
HDR+ picture without (left) and with (right) portrait mode. Note how portrait mode’s synthetic shallow depth of field helps suppress the cluttered background and focus attention on the main subject. Click on these links in the caption to see full resolution versions. Photo by Matt Jones
What is a shallow depth-of-field image?
A single-lens reflex (SLR) camera with a big lens has a shallow depth of field, meaning that objects at one distance from the camera are sharp, while objects in front of or behind that "in-focus plane" are blurry. Shallow depth of field is a good way to draw the viewer's attention to a subject, or to suppress a cluttered background. Shallow depth of field is what gives portraits captured using SLRs their characteristic artistic look.

The amount of blur in a shallow depth-of-field image depends on depth; the farther objects are from the in-focus plane, the blurrier they appear. The amount of blur also depends on the size of the lens opening. A 50mm lens with an f/2.0 aperture has an opening 50mm/2 = 25mm in diameter. With such a lens, objects that are even a few inches away from the in-focus plane will appear soft.

One other parameter worth knowing about depth of field is the shape taken on by blurred points of light. This shape is called bokeh, and it depends on the physical structure of the lens's aperture. Is the bokeh circular? Or is it a hexagon, due to the six metal leaves that form the aperture inside some lenses? Photographers debate tirelessly about what constitutes good or bad bokeh.

Synthetic shallow depth of field images
Unlike SLR cameras, mobile phone cameras have a small, fixed-size aperture, which produces pictures with everything more or less in focus. But if we knew the distance from the camera to points in the scene, we could replace each pixel in the picture with a blur. This blur would be an average of the pixel's color with its neighbors, where the amount of blur depends on distance of that scene point from the in-focus plane. We could also control the shape of this blur, meaning the bokeh.

How can a cell phone estimate the distance to every point in the scene? The most common method is to place two cameras close to one another – so-called dual-camera phones. Then, for each patch in the left camera's image, we look for a matching patch in the right camera's image. The position in the two images where this match is found gives the depth of that scene feature through a process of triangulation. This search for matching features is called a stereo algorithm, and it works pretty much the same way our two eyes do.

A simpler version of this idea, used by some single-camera smartphone apps, involves separating the image into two layers – pixels that are part of the foreground (typically a person) and pixels that are part of the background. This separation, sometimes called semantic segmentation, lets you blur the background, but it has no notion of depth, so it can't tell you how much to blur it. Also, if there is an object in front of the person, i.e. very close to the camera, it won't be blurred out, even though a real camera would do this.

Whether done using stereo or segmentation, artificially blurring pixels that belong to the background is called synthetic shallow depth of field or synthetic background defocusing. Synthetic defocus is not the same as the optical blur you would get from an SLR, but it looks similar to most people.

How portrait mode works on the Pixel 2
The Google Pixel 2 offers portrait mode on both its rear-facing and front-facing cameras. For the front-facing (selfie) camera, it uses only segmentation. For the rear-facing camera it uses both stereo and segmentation. But wait, the Pixel 2 has only one rear facing camera; how can it see in stereo? Let's go through the process step by step.
Step 1: Generate an HDR+ image.
Portrait mode starts with a picture where everything is sharp. For this we use HDR+, Google's computational photography technique for improving the quality of captured
photographs, which runs on all recent Nexus/Pixel phones. It operates by capturing a burst of images that are underexposed to avoid blowing out highlights, aligning and averaging these frames to reduce noise in the shadows, and boosting these shadows in a way that preserves local contrast while judiciously reducing global contrast. The result is a picture with high dynamic range, low noise, and sharp details, even in dim lighting.
The idea of aligning and averaging frames to reduce noise has been known in astrophotography for decades. Google's implementation is a bit different, because we do it on bursts captured by a handheld camera, and we need to be careful not to produce ghosts (double images) if the photographer is not steady or if objects in the scene move. Below is an example of a scene with high dynamic range, captured using HDR+.
Photographs from the Pixel 2 without (left) and with (right) HDR+ enabled.
Notice how HDR+ avoids blowing out the sky and courtyard while retaining detail in the dark arcade ceiling.
Photo by Marc Levoy
Step 2:  Machine learning-based foreground-background segmentation.
Starting from an HDR+ picture, we next decide which pixels belong to the foreground (typically aperson) and which belong to the background. This is a tricky problem, because unlike chroma keying (a.k.a. green-screening) in the movie industry, we can't assume that the background is green (or blue, or any other color) Instead, we apply machine learning.  
In particular, we have trained a neural network, written in TensorFlow, that looks at the picture, and produces an estimate of which pixels are people and which aren't. The specific network we use is a convolutional neural network (CNN) with skip connections. "Convolutional" means that the learned components of the network are in the form of filters (a weighted sum of the neighbors around each pixel), so you can think of the network as just filtering the image, then filtering the filtered image, etc. The "skip connections" allow information to easily flow from the early stages in the network where it reasons about low-level features (color and edges) up to later stages of the network where it reasons about high-level features (faces and body parts). Combining stages like this is important when you need to not just determine if a photo has a person in it, but to identify exactly which pixels belong to that person. Our CNN was trained on almost a million pictures of people (and their hats, sunglasses, and ice cream cones). Inference to produce the mask runs on the phone using TensorFlow Mobile. Here’s an example:
At left is a picture produced by our HDR+ pipeline, and at right is the smoothed output of our neural network. White parts of this mask are thought by the network to be part of the foreground, and black parts are thought to be background.
Photo by Sam Kweskin
How good is this mask? Not too bad; our neural network recognizes the woman's hair and her teacup as being part of the foreground, so it can keep them sharp. If we blur the photograph based on this mask, we would produce this image:
Synthetic shallow depth-of-field image generated using a mask.
There are several things to notice about this result. First, the amount of blur is uniform, even though the background contains objects at varying depths. Second, an SLR would also blur out the pastry on her plate (and the plate itself), since it's close to the camera. Our neural network knows the pastry isn't part of her (note that it's black in the mask image), but being below her it’s not likely to be part of the background. We explicitly detect this situation and keep these pixels relatively sharp. Unfortunately, this solution isn’t always correct, and in this situation we should have blurred these pixels more.
Step 3. From dual pixels to a depth map
To improve on this result, it helps to know the depth at each point in the scene. To compute depth we can use a stereo algorithm. The Pixel 2 doesn't have dual cameras, but it does have a technology called Phase-Detect Auto-Focus (PDAF) pixels, sometimes called dual-pixel autofocus (DPAF). That's a mouthful, but the idea is pretty simple. If one imagines splitting the (tiny) lens of the phone's rear-facing camera into two halves, the view of the world as seen through the left side of the lens and the view through the right side are slightly different. These two viewpoints are less than 1mm apart (roughly the diameter of the lens), but they're different enough to compute stereo and produce a depth map. The way the optics of the camera works, this is equivalent to splitting every pixel on the image sensor chip into two smaller side-by-side pixels and reading them from the chip separately, as shown here:
On the rear-facing camera of the Pixel 2, the right side of every pixel looks at the world through the left side of the lens, and the left side of every pixel looks at the world through the right side of the lens.
Figure by Markus Kohlpaintner, reproduced with permission.
As the diagram shows, PDAF pixels give you views through the left and right sides of the lens in a single snapshot. Or, if you're holding your phone in portrait orientation, then it's the upper and lower halves of the lens. Here's what the upper image and lower image look like for our example scene (below). These images are monochrome because we only use the green pixels of our Bayer color filter sensor in our stereo algorithm, not the red or blue pixels. Having trouble telling the two images apart? Maybe the animated gif at right (below) will help. Look closely; the differences are very small indeed!
Views of our test scene through the upper half and lower half of the lens of a Pixel 2. In the animated gif at right,
notice that she holds nearly still, because the camera is focused on her, while the background moves up and down.
Objects in front of her, if we could see any, would move down when the background moves up (and vice versa).
PDAF technology can be found in many cameras, including SLRs to help them focus faster when recording video. In our application, this technology is being used instead to compute a depth map.  Specifically, we use our left-side and right-side images (or top and bottom) as input to a stereo algorithm similar to that used in Google's Jump system panorama stitcher (called the Jump Assembler). This algorithm first performs subpixel-accurate tile-based alignment to produce a low-resolution depth map, then interpolates it to high resolution using a bilateral solver. This is similar to the technology formerly used in Google's Lens Blur feature.    
One more detail: because the left-side and right-side views captured by the Pixel 2 camera are so close together, the depth information we get is inaccurate, especially in low light, due to the high noise in the images. To reduce this noise and improve depth accuracy we capture a burst of left-side and right-side images, then align and average them before applying our stereo algorithm. Of course we need to be careful during this step to avoid wrong matches, just as in HDR+, or we'll get ghosts in our depth map (but that's the subject of another blog post). On the left below is a depth map generated from the example shown above using our stereo algorithm.
Left: depth map computed using stereo from the foregoing upper-half-of-lens and lower-half-of-lens images. Lighter means closer to the camera. 
Right: visualization of how much blur we apply to each pixel in the original. Black means don't blur at all, red denotes scene features behind the in-focus plane (which is her face), the brighter the red the more we blur, and blue denotes features in front of the in-focus plane (the pastry).
Step 4. Putting it all together to render the final image
The last step is to combine the segmentation mask we computed in step 2 with the depth map we computed in step 3 to decide how much to blur each pixel in the HDR+ picture from step 1.  The way we combine the depth and mask is a bit of secret sauce, but the rough idea is that we want scene features we think belong to a person (white parts of the mask) to stay sharp, and features we think belong to the background (black parts of the mask) to be blurred in proportion to how far they are from the in-focus plane, where these distances are taken from the depth map. The red-colored image above is a visualization of how much to blur each pixel.
Actually applying the blur is conceptually the simplest part; each pixel is replaced with a translucent disk of the same color but varying size. If we composite all these disks in depth order, it's like the averaging we described earlier, and we get a nice approximation to real optical blur.  One of the benefits of defocusing synthetically is that because we're using software, we can get a perfect disk-shaped bokeh without lugging around several pounds of glass camera lenses.  Interestingly, in software there's no particular reason we need to stick to realism; we could make the bokeh shape anything we want! For our example scene, here is the final portrait mode output. If you compare this
result to the the rightmost result in step 2, you'll see that the pastry is now slightly blurred, much as you would expect from an SLR.
Final synthetic shallow depth-of-field image, generated by combining our HDR+
picture, segmentation mask, and depth map. Click for a full-resolution image.
Ways to use portrait mode
Portrait mode on the Pixel 2 runs in 4 seconds, is fully automatic (as opposed to Lens Blur mode on
previous devices, which required a special up-down motion of the phone), and is robust enough to
be used by non-experts. Here is an album of examples, including some hard cases, like people with frizzy hair, people holding flower bouquets, etc. Below is a list of a few ways you can use Portrait Mode on the new Pixel 2.
Taking macro shots
If you're in portrait mode and you point the camera at a small object instead of a person (like a flower or food), then our neural network can't find a face and won't produce a useful segmentation mask. In other words, step 2 of our pipeline doesn't apply. Fortunately, we still have a depth map from PDAF data (step 3), so we can compute a shallow depth-of-field image based on the depth map alone. Because the baseline between the left and right sides of the lens is so small, this works well only for objects that are roughly less than a meter away. But for such scenes it produces nice pictures. You can think of this as a synthetic macro mode. Below are example straight and portrait mode shots of a macro-sized object, and here's an album with more macro shots, including more hard cases, like a water fountain with a thin wire fence behind it. Just be careful not to get too close; the Pixel 2 can’t focus sharply on objects closer than about 10cm from the camera.
Macro picture without (left) and with (right) portrait mode. There’s no person here, so background pixels are identified solely using the depth map. Photo by Marc Levoy
The selfie camera
The Pixel 2 offers portrait mode on the front-facing (selfie) as well as rear-facing camera. This camera is 8Mpix instead of 12Mpix, and it doesn't have PDAF pixels, meaning that its pixels aren't split into left and right halves. In this case, step 3 of our pipeline doesn't apply, but if we can find a face, then we can still use our neural network (step 2) to produce a segmentation mask. This allows us to still generate a shallow depth-of-field image, but because we don't know how far away objects are, we can't vary the amount of blur with depth. Nevertheless, the effect looks pretty good, especially for selfies shot against a cluttered background, where blurring helps suppress the clutter. Here are example straight and portrait mode selfies taken with the Pixel 2's selfie camera:
Selfie without (left) and with (right) portrait mode. The front-facing camera lacks PDAF pixels,so background pixels are identified using only machine learning. Photo by Marc Levoy

How To Get the Most Out of Portrait Mode
The portraits produced by the Pixel 2 depend on the underlying HDR+ image, segmentation mask, and depth map; problems in these inputs can produce artifacts in the result. For example, if a feature is overexposed in the HDR+ image (blown out to white), then it's unlikely the left-half and right-half images will have useful information in them, leading to errors in the depth map. What can go wrong with segmentation? It's a neural network, which has been trained on nearly a million images, but we bet it has never seen a photograph of a person kissing a crocodile, so it will probably omit the crocodile from the mask, causing it to be blurred out. How about the depth map? Our stereo algorithm may fail on textureless features (like blank walls) because there are no features for the stereo algorithm to latch onto, or repeating textures (like plaid shirts) or horizontal or vertical lines, because the stereo algorithm might match the wrong part of the image, thus triangulating to produce the wrong depth.

While any complex technology includes tradeoffs, here are some tips for producing great portrait mode shots:
  • Stand close enough to your subjects that their head (or head and shoulders) fill the frame.
  • For a group shot where you want everyone sharp, place them at the same distance from the camera.
  • For a more pleasing blur, put some distance between your subjects and the background.
  • Remove dark sunglasses, floppy hats, giant scarves, and crocodiles.
  • For macro shots, tap to focus to ensure that the object you care about stays sharp.
By the way, you'll notice that in portrait mode the camera zooms a bit (1.5x for the rear-facing camera, and 1.2x for the selfie camera).  This is deliberate, because narrower fields of view encourage you to stand back further, which in turn reduces perspective distortion,
leading to better portraits.

Is it time to put aside your SLR (forever)?
When we started working at Google 5 years ago, the number of pixels in a cell phone picture hadn't
caught up to SLRs, but it was high enough for most people's needs. Even on a big home computer
screen, you couldn't see the individual pixels in pictures you took using your cell phone.
Nevertheless, mobile phone cameras weren't as powerful as SLRs, in four ways:
  1. Dynamic range in bright scenes (blown-out skies)
  2. Signal-to-noise ratio (SNR) in low light (noisy pictures, loss of detail)
  3. Zoom (for those wildlife shots)
  4. Shallow depth of field
Google's HDR+ and similar technologies by our competitors have made great strides on #1 and #2. In fact, in challenging lighting we'll often put away our SLRs, because we can get a better picture from a phone without painful bracketing and post-processing. For zoom, the modest telephoto lenses being added to some smartphones (typically 2x) help, but for that grizzly bear in the streambed there's no substitute for a 400mm lens (much safer too!). For shallow depth-of-field, synthetic defocusing is not the same as real optical defocusing, but the visual effect is similar enough to achieve the same goal, of directing your attention towards the main subject.

Will SLRs (or their mirrorless interchangeable lens (MIL) cousins) with big sensors and big lenses disappear? Doubtful, but they will occupy a smaller niche in the market. Both of us travel with a big camera and a Pixel 2. At the beginning of our trips we dutifully take out our SLRs, but by the end, it mostly stays in our luggage. Welcome to the new world of software-defined cameras and computational photography!
For more about portrait mode on the Pixel 2, check out this video by Nat & Friends.
Here is another album of pictures (portrait and not) and videos taken by the Pixel 2.

Making Visible Watermarks More Effective



Whether you are a photographer, a marketing manager, or a regular Internet user, chances are you have encountered visible watermarks many times. Visible watermarks are those logos and patterns that are often overlaid on digital images provided by stock photography websites, marking the image owners while allowing viewers to perceive the underlying content so that they could license the images that fit their needs. It is the most common mechanism for protecting the copyrights of hundreds of millions of photographs and stock images that are offered online daily.

It’s standard practice to use watermarks on the assumption that they prevent consumers from accessing the clean images, ensuring there will be no unauthorized or unlicensed use. However, in “On The Effectiveness Of Visible Watermarks” recently presented at the 2017 Computer Vision and Pattern Recognition Conference (CVPR 2017), we show that a computer algorithm can get past this protection and remove watermarks automatically, giving users unobstructed access to the clean images the watermarks are intended to protect.
Left: example watermarked images from popular stock photography websites. Right: watermark-free version of the images on the left, produced automatically by a computer algorithm. More results are available below and on our project page. Image sources: Adobe Stock, 123RF.
As often done with vulnerabilities discovered in operating systems, applications or protocols, we want to disclose this vulnerability and propose solutions in order to help the photography and stock image communities adapt and better protect its copyrighted content and creations. From our experiments much of the world’s stock imagery is currently susceptible to this circumvention. As such, in our paper we also propose ways to make visible watermarks more robust to such manipulations.
The Vulnerability of Visible Watermarks
Visible watermarks are often designed to contain complex structures such as thin lines and shadows in order to make them harder to remove. Indeed, given a single image, for a computer to detect automatically which visual structures belong to the watermark and which structures belong to the underlying image is extremely difficult. Manually, the task of removing a watermark from an image is tedious, and even with state-of-the-art editing tools it may take a Photoshop expert several minutes to remove a watermark from one image.

However, a fact that has been overlooked so far is that watermarks are typically added in a consistent manner to many images. We show that this consistency can be used to invert the watermarking process — that is, estimate the watermark image and its opacity, and recover the original, watermark-free image underneath. This can be all be done automatically, without any user intervention, and by only observing watermarked image collections publicly available online.
The consistency of a watermark over many images allows to automatically remove it in mass scale. Left: input collection marked by the same watermark, middle: computed watermark and its opacity, right: recovered, watermark-free images. Image sources: COCO dataset, Copyright logo.
The first step of this process is identifying which image structures are repeating in the collection. If a similar watermark is embedded in many images, the watermark becomes the signal in the collection and the images become the noise, and simple image operations can be used to pull out a rough estimation of the watermark pattern.
Watermark extraction with increasing number of images. Left: watermarked input images, Middle: median intensities over the input images (up to the input image shown), Right: the corresponding estimated (matted) watermark. All images licensed from 123RF.
This provides a rough (noisy) estimate of the matted watermark (the watermark image times its spatially varying opacity, i.e., alpha matte). To actually recover the image underneath the watermark, we need to know the watermark’s decomposition into its image and alpha matte components. For this, a multi-image optimization problem can be formed, which we call “multi-image matting” (an extension of the traditional, single image matting problem), where the watermark (“foreground”) is separated into its image and opacity components while reconstructing a subset of clean (“background”) images. This optimization is able to produce very accurate estimations of the watermark components already from hundreds of images, and can deal with most watermarks used in practice, including ones containing thin structures, shadows or color gradients (as long as the watermarks are semi-transparent). Once the watermark pattern is recovered, it can be efficiently removed from any image marked by it.

Here are some more results, showing the estimated watermarks and example watermark-free results generated for several popular stock image services. We show many more results in our supplementary material on the project page.
Left column: Watermark estimated automatically from watermarked images online (rendered on a gray background). Middle colum: Input watermarked image. Right column: Automatically removed watermark. Image sources: Adobe Stock, Can Stock Photo, 123RF,  Fotolia.
Making Watermarks More Effective
The vulnerability of current watermarking techniques lies in the consistency in watermarks across image collections. Therefore, to counter it, we need to introduce inconsistencies when embedding the watermark in each image. In our paper we looked at several types of inconsistencies and how they affect the techniques described above. We found for example that simply changing the watermark’s position randomly per image does not prevent removing the watermark, nor do small random changes in the watermark’s opacity. But we found that introducing random geometric perturbations to the watermark — warping it when embedding it in each image — improves its robustness. Interestingly, very subtle warping is already enough to generate watermarks that this technique cannot fully defeat.
Flipping between the original watermark and a slightly, randomly warped watermark that can improve its robustness
This warping produces a watermarked image that is very similar to the original (top right in the following figure), yet now if an attempt is made to remove it, it leaves very visible artifacts (bottom right):
In a nutshell, the reason this works is because that removing the randomly-warped watermark from any single image requires to additionally estimate the warp field that was applied to the watermark for that image — a task that is inherently more difficult. Therefore, even if the watermark pattern can be estimated in the presence of these random perturbations (which by itself is nontrivial), accurately removing it without any visible artifact is far more challenging.

Here are some more results on the images from above when using subtle, randomly warped versions of the watermarks. Notice again how visible artifacts remain when trying to remove the watermark in this case, compared to the accurate reconstructions that are achievable with current, consistent watermarks. More results and a detailed analysis can be found in our paper and project page.
Left column: Watermarked image, using subtle, random warping of the watermark. Right Column: Watermark removal result.
This subtle random warping is only one type of randomization that can introduced to make watermarks more effective. A nice feature of that solution is that it is simple to implement and already improves the robustness of the watermark to image-collection attacks while at the same time being mostly imperceptible. If more visible changes to the watermark across the images are acceptable — for example, introducing larger shifts in the watermark or incorporating other random elements in it — they may lead to an even better protection.

While we cannot guarantee that there will not be a way to break such randomized watermarking schemes in the future, we believe (and our experiments show) that randomization will make watermarked collection attacks fundamentally more difficult. We hope that these findings will be helpful for the photography and stock image communities.

Acknowledgements
The research described in this post was performed by Tali Dekel, Michael Rubinstein, Ce Liu and Bill Freeman. We thank Aaron Maschinot for narrating our video.