Tag Archives: Google Photos

Save photos from Gmail messages directly to Google Photos with a new “Save to Photos” button

What’s changing

Now when you get a photo attachment in a Gmail message, you can save it directly to Google Photos with a new "Save to Photos" button. You’ll see it next to the existing “Add to Drive” button on the attachment and while previewing the image attachment. Currently, this is only available for JPEG images.

Save to Photos Button from a Gmail attachment

Use the Save to Photos option while previewing an image in Gmail

Who’s impacted

End users

Why you’d use it

This new feature frees you from having to download photo attachments from Gmail messages in order to then manually back them up to Google Photos.

Getting started

  • Admins: There is no admin control for this feature.
  • End users: This feature will be ON by default. For an eligible photo, you can choose the "Save to Photos" button which is alongside a similar option to "Add to Drive." Visit the Help Center to learn more about how to upload files and folders to Google Drive.

Rollout pace

Rapid Release and Scheduled Release domains: Gradual rollout (up to 15 days for feature visibility) starting on May 26, 2021


Available to all Google Workspace customers, as well as G Suite Basic and Business customers. Also available to users with personal Google Accounts.


The Technology Behind Cinematic Photos

Looking at photos from the past can help people relive some of their most treasured moments. Last December we launched Cinematic photos, a new feature in Google Photos that aims to recapture the sense of immersion felt the moment a photo was taken, simulating camera motion and parallax by inferring 3D representations in an image. In this post, we take a look at the technology behind this process, and demonstrate how Cinematic photos can turn a single 2D photo from the past into a more immersive 3D animation.

Camera 3D model courtesy of Rick Reitano.
Depth Estimation
Like many recent computational photography features such as Portrait Mode and Augmented Reality (AR), Cinematic photos requires a depth map to provide information about the 3D structure of a scene. Typical techniques for computing depth on a smartphone rely on multi-view stereo, a geometry method to solve for the depth of objects in a scene by simultaneously capturing multiple photos at different viewpoints, where the distances between the cameras is known. In the Pixel phones, the views come from two cameras or dual-pixel sensors.

To enable Cinematic photos on existing pictures that were not taken in multi-view stereo, we trained a convolutional neural network with encoder-decoder architecture to predict a depth map from just a single RGB image. Using only one view, the model learned to estimate depth using monocular cues, such as the relative sizes of objects, linear perspective, defocus blur, etc.

Because monocular depth estimation datasets are typically designed for domains such as AR, robotics, and self-driving, they tend to emphasize street scenes or indoor room scenes instead of features more common in casual photography, like people, pets, and objects, which have different composition and framing. So, we created our own dataset for training the monocular depth model using photos captured on a custom 5-camera rig as well as another dataset of Portrait photos captured on Pixel 4. Both datasets included ground-truth depth from multi-view stereo that is critical for training a model.

Mixing several datasets in this way exposes the model to a larger variety of scenes and camera hardware, improving its predictions on photos in the wild. However, it also introduces new challenges, because the ground-truth depth from different datasets may differ from each other by an unknown scaling factor and shift. Fortunately, the Cinematic photo effect only needs the relative depths of objects in the scene, not the absolute depths. Thus we can combine datasets by using a scale-and-shift-invariant loss during training and then normalize the output of the model at inference.

The Cinematic photo effect is particularly sensitive to the depth map’s accuracy at person boundaries. An error in the depth map can result in jarring artifacts in the final rendered effect. To mitigate this, we apply median filtering to improve the edges, and also infer segmentation masks of any people in the photo using a DeepLab segmentation model trained on the Open Images dataset. The masks are used to pull forward pixels of the depth map that were incorrectly predicted to be in the background.

Camera Trajectory
There can be many degrees of freedom when animating a camera in a 3D scene, and our virtual camera setup is inspired by professional video camera rigs to create cinematic motion. Part of this is identifying the optimal pivot point for the virtual camera’s rotation in order to yield the best results by drawing one’s eye to the subject.

The first step in 3D scene reconstruction is to create a mesh by extruding the RGB image onto the depth map. By doing so, neighboring points in the mesh can have large depth differences. While this is not noticeable from the “face-on” view, the more the virtual camera is moved, the more likely it is to see polygons spanning large changes in depth. In the rendered output video, this will look like the input texture is stretched. The biggest challenge when animating the virtual camera is to find a trajectory that introduces parallax while minimizing these “stretchy” artifacts.

The parts of the mesh with large depth differences become more visible (red visualization) once the camera is away from the “face-on” view. In these areas, the photo appears to be stretched, which we call “stretchy artifacts”.

Because of the wide spectrum in user photos and their corresponding 3D reconstructions, it is not possible to share one trajectory across all animations. Instead, we define a loss function that captures how much of the stretchiness can be seen in the final animation, which allows us to optimize the camera parameters for each unique photo. Rather than counting the total number of pixels identified as artifacts, the loss function triggers more heavily in areas with a greater number of connected artifact pixels, which reflects a viewer’s tendency to more easily notice artifacts in these connected areas.

We utilize padded segmentation masks from a human pose network to divide the image into three different regions: head, body and background. The loss function is normalized inside each region before computing the final loss as a weighted sum of the normalized losses. Ideally the generated output video is free from artifacts but in practice, this is rare. Weighting the regions differently biases the optimization process to pick trajectories that prefer artifacts in the background regions, rather than those artifacts near the image subject.

During the camera trajectory optimization, the goal is to select a path for the camera with the least amount of noticeable artifacts. In these preview images, artifacts in the output are colored red while the green and blue overlay visualizes the different body regions.

Framing the Scene
Generally, the reprojected 3D scene does not neatly fit into a rectangle with portrait orientation, so it was also necessary to frame the output with the correct right aspect ratio while still retaining the key parts of the input image. To accomplish this, we use a deep neural network that predicts per-pixel saliency of the full image. When framing the virtual camera in 3D, the model identifies and captures as many salient regions as possible while ensuring that the rendered mesh fully occupies every output video frame. This sometimes requires the model to shrink the camera's field of view.

Heatmap of the predicted per-pixel saliency. We want the creation to include as much of the salient regions as possible when framing the virtual camera.

Through Cinematic photos, we implemented a system of algorithms – with each ML model evaluated for fairness – that work together to allow users to relive their memories in a new way, and we are excited about future research and feature improvements. Now that you know how they are created, keep an eye open for automatically created Cinematic photos that may appear in your recent memories within the Google Photos app!

Cinematic Photos is the result of a collaboration between Google Research and Google Photos teams. Key contributors also include: Andre Le, Brian Curless, Cassidy Curtis, Ce Liu‎, Chun-po Wang, Daniel Jenstad, David Salesin, Dominik Kaeser, Gina Reynolds, Hao Xu, Huiwen Chang, Huizhong Chen‎, Jamie Aspinall, Janne Kontkanen, Matthew DuVall, Michael Kucera, Michael Milne, Mike Krainin, Mike Liu, Navin Sarma, Orly Liba, Peter Hedman, Rocky Cai‎, Ruirui Jiang‎, Steven Hickson, Tracy Gu, Tyler Zhu, Varun Jampani, Yuan Hao, Zhongli Ding.

Source: Google AI Blog

Portrait Light: Enhancing Portrait Lighting with Machine Learning

Professional portrait photographers are able to create compelling photographs by using specialized equipment, such as off-camera flashes and reflectors, and expert knowledge to capture just the right illumination of their subjects. In order to allow users to better emulate professional-looking portraits, we recently released Portrait Light, a new post-capture feature for the Pixel Camera and Google Photos apps that adds a simulated directional light source to portraits, with the directionality and intensity set to complement the lighting from the original photograph.

Example image with and without Portrait Light applied. Note how Portrait Light contours the face, adding dimensionality, volume, and visual interest.

In the Pixel Camera on Pixel 4, Pixel 4a, Pixel 4a (5G), and Pixel 5, Portrait Light is automatically applied post-capture to images in the default mode and to Night Sight photos that include people — just one person or even a small group. In Portrait Mode photographs, Portrait Light provides more dramatic lighting to accompany the shallow depth-of-field effect already applied, resulting in a studio-quality look. But because lighting can be a personal choice, Pixel users who shoot in Portrait Mode can manually re-position and adjust the brightness of the applied lighting within Google Photos to match their preference. For those running Google Photos on Pixel 2 or newer, this relighting capability is also available for many pre-existing portrait photographs.

Pixel users can adjust a portrait’s lighting as they like in Google Photos, after capture.

Today we present the technology behind Portrait Light. Inspired by the off-camera lights used by portrait photographers, Portrait Light models a repositionable light source that can be added into the scene, with the initial lighting direction and intensity automatically selected to complement the existing lighting in the photo. We accomplish this by leveraging novel machine learning models, each trained using a diverse dataset of photographs captured in the Light Stage computational illumination system. These models enabled two new algorithmic capabilities:

  1. Automatic directional light placement: For a given portrait, the algorithm places a synthetic directional light in the scene consistent with how a photographer would have placed an off-camera light source in the real world.
  2. Synthetic post-capture relighting: For a given lighting direction and portrait, synthetic light is added in a way that looks realistic and natural.

These innovations enable Portrait Light to help create attractive lighting at any moment for every portrait — all on your mobile device.

Automatic Light Placement
Photographers usually rely on perceptual cues when deciding how to augment environmental illumination with off-camera light sources. They assess the intensity and directionality of the light falling on the face, and also adjust their subject’s head pose to complement it. To inform Portrait Light’s automatic light placement, we developed computational equivalents to these two perceptual signals.

First, we trained a novel machine learning model to estimate a high dynamic range, omnidirectional illumination profile for a scene based on an input portrait. This new lighting estimation model infers the direction, relative intensity, and color of all light sources in the scene coming from all directions, considering the face as a light probe. We also estimate the head pose of the portrait’s subject using MediaPipe Face Mesh.

Estimating the high dynamic range, omnidirectional illumination profile from an input portrait. The three spheres at the right of each image, diffuse (top), matte silver (middle), and mirror (bottom), are rendered using the estimated illumination, each reflecting the color, intensity, and directionality of the environmental lighting.

Using these clues, we determine the direction from which the synthetic lighting should originate. In studio portrait photography, the main off-camera light source, or key light, is placed about 30° above the eyeline and between 30° and 60° off the camera axis, when looking overhead at the scene. We follow this guideline for a classic portrait look, enhancing any pre-existing lighting directionality in the scene while targeting a balanced, subtle key-to-fill lighting ratio of about 2:1.

Data-Driven Portrait Relighting
Given a desired lighting direction and portrait, we next trained a new machine learning model to add the illumination from a directional light source to the original photograph. Training the model required millions of pairs of portraits both with and without extra light. Photographing such a dataset in normal settings would have been impossible because it requires near-perfect registration of portraits captured across different lighting conditions.

Instead, we generated training data by photographing seventy different people using the Light Stage computational illumination system. This spherical lighting rig includes 64 cameras with different viewpoints and 331 individually-programmable LED light sources. We photographed each individual illuminated one-light-at-a-time (OLAT) by each light, which generates their reflectance field — or their appearance as illuminated by the discrete sections of the spherical environment. The reflectance field encodes the unique color and light-reflecting properties of the subject’s skin, hair, and clothing — how shiny or dull each material appears. Due to the superposition principle for light, these OLAT images can then be linearly added together to render realistic images of the subject as they would appear in any image-based lighting environment, with complex light transport phenomena like subsurface scattering correctly represented.

Using the Light Stage, we photographed many individuals with different face shapes, genders, skin tones, hairstyles, and clothing/accessories. For each person, we generated synthetic portraits in many different lighting environments, both with and without the added directional light, rendering millions of pairs of images. This dataset encouraged model performance across diverse lighting environments and individuals.

Photographing an individual as illuminated one-light-at-a-time in the Google Light Stage, a 360° computational illumination rig.
Left: Example images from an individual’s photographed reflectance field, their appearance in the Light Stage as illuminated one-light-at-a-time. Right: The images can be added together to form the appearance of the subject in any novel lighting environment.

Learning Detail-Preserving Relighting Using the Quotient Image
Rather than trying to directly predict the output relit image, we trained the relighting model to output a low-resolution quotient image, i.e., a per-pixel multiplier that when upsampled can be applied to the original input image to produce the desired output image with the contribution of the extra light source added. This technique is computationally efficient and encourages only low-frequency lighting changes, without impacting high-frequency image details, which are directly transferred from the input to maintain image quality.

Supervising Relighting with Geometry Estimation
When photographers add an extra light source into a scene, its orientation relative to the subject’s facial geometry determines how much brighter each part of the face appears. To model the optical behavior of light sources reflecting off relatively matte surfaces, we first trained a machine learning model to estimate surface normals given the input photograph, and then applied Lambert’s law to compute a “light visibility map” for the desired lighting direction. We provided this light visibility map as input to the quotient image predictor, ensuring that the model is trained using physics-based insights.

The pipeline of our relighting network. Given an input portrait, we estimate per-pixel surface normals, which we then use to compute a light visibility map. The model is trained to produce a low-resolution quotient image that, when upsampled and applied as a multiplier to the original image, produces the original portrait with an extra light source added synthetically into the scene.

We optimized the full pipeline to run at interactive frame-rates on mobile devices, with total model size under 10 MB. Here are a few examples of Portrait Light in action.

Portrait Light in action.

Getting the Most Out of Portrait Light
You can try Portrait Light in the Pixel Camera and change the light position and brightness to your liking in Google Photos. For those who use Dual Exposure Controls, Portrait Light can be applied post-capture for additional creative flexibility to find just the right balance between light and shadow. On existing images from your Google Photos library, try it where faces are slightly underexposed, where Portrait Light can illuminate and highlight your subject. It will especially benefit images with a single individual posed directly at the camera.

We see Portrait Light as the first step on the journey towards creative post-capture lighting controls for mobile cameras, powered by machine learning.

Portrait Light is the result of a collaboration between Google Research, Google Daydream, Pixel, and Google Photos teams. Key contributors include: Yun-Ta Tsai, Rohit Pandey, Sean Fanello, Chloe LeGendre, Michael Milne, Ryan Geiss, Sam Hasinoff, Dillon Sharlet, Christoph Rhemann, Peter Denny, Kaiwen Guo, Philip Davidson, Jonathan Taylor, Mingsong Dou, Pavel Pidlypenskyi, Peter Lincoln, Jay Busch, Matt Whalen, Jason Dourgarian, Geoff Harvey, Cynthia Herrera, Sergio Orts Escolano, Paul Debevec, Jonathan Barron, Sofien Bouaziz, Clement Ng, Rachit Gupta, Jesse Evans, Ryan Campbell, Sonya Mollinger, Emily To, Yichang Shih, Jana Ehmann, Wan-Chun Alex Ma, Christina Tong, Tim Smith, Tim Ruddick, Bill Strathearn, Jose Lima, Chia-Kai Liang, David Salesin, Shahram Izadi, Navin Sarma, Nisha Masharani, Zachary Senzer.

1  Work conducted while at Google. 

Source: Google AI Blog

Updating Google Photos’ storage policy to build for the future

We launched Google Photos more than five years ago with the mission of being the home for your memories. What started as an app to manage your photos and videos has evolved into a place to reflect on meaningful moments in your life. Today, more than 4 trillion photos are stored in Google Photos, and every week 28 billion new photos and videos are uploaded. 

Since so many of you rely on Google Photos to store your memories, it’s important that it’s not just a great product, but also continues to meet your needs over the long haul. In order to welcome even more of your memories and build Google Photos for the future, we are changing our unlimited High quality storage policy. 

Starting June 1, 2021, any new photos and videos you upload will count toward the free 15 GB of storage that comes with every Google Account or the additional storage you’ve purchased as a Google One member. Your Google Account storage is shared across Drive, Gmail and Photos. This change also allows us to keep pace with the growing demand for storage. And, as always, we uphold our commitment to not use information in Google Photos for advertising purposes. We know this is a big shift and may come as a surprise, so we wanted to let you know well in advance and give you resources to make this easier. 

Existing High quality photos and videos are exempt from this change 

Any photos or videos you’ve uploaded in High quality before June 1, 2021 will not count toward your 15GB of free storage. This means that photos and videos backed up before June 1, 2021 will still be considered free and exempt from the storage limit. You can verify your backup quality at any time in the Photos app by going to back up & sync in Settings.

If you back up your photos and videos in Original quality, these changes do not affect you. As always, your Original quality photos and videos will continue to count toward your 15 GB of free storage across your Google Account. 

If you have a Pixel 1-5, photos uploaded from that device won’t be impacted. Photos and videos uploaded in High quality from that device will continue to be exempt from this change, even after June 1, 2021. 

There’s no action you need to take today

This change does not take effect for another six months, so you don’t need to do anything right now. And once this change does take effect on June 1, 2021, over 80 percent of you should still be able to store roughly three more years worth of memories with your free 15 GB of storage. As your storage nears 15 GB, we will notify you in the app and follow up by email. 

Understand and manage your quota

To understand how this impacts you, you can see a personalized estimate for how long your storage may last. This estimate takes into account how frequently you back up photos, videos and other content to your Google Account.

And in June 2021, you’ll be able to access a new free tool in the Photos app to easily manage your backed up photos and videos. This tool will help you review the memories you want to keep while also surfacing shots you might prefer to delete, like dark or blurry photos or large videos.

If you decide you want more space, you can always expand your storage through Google One. Plans start at $1.99 per month in the U.S. for 100 GB of storage and include additional member benefits like access to Google experts, shared family plans and more.


Thank you for using Google Photos and we hope to continue to be the home for your memories. You can learn more about this change in our Help Center.

Posted by Shimrit Ben-Yair, Vice President, Google Photos

An update to storage policies across your Google Account

Over the past decade, Gmail, Google Drive and Google Photos have helped billions of people securely store and manage their emails, documents, photos, videos and more. Today, people are uploading more content than ever before—in fact, more than 4.3 million GB are added across Gmail, Drive and Photos every day. 

To continue providing everyone with a great storage experience and to keep pace with the growing demand, we're announcing important upcoming storage changes to your Google Account. These changes will apply to Photos and Drive (specifically Google Docs, Sheets, Slides, Drawings, Forms and Jamboard files) and will enable us to continue investing in these products for the future. We're also introducing new policies for consumer Google Accounts that are either inactive or over their storage limit across Gmail, Drive (including Google Docs, Sheets, Slides, Drawings, Forms and Jamboard files) and Photos, to bring our policies more in line with industry standards. 

These storage policy changes won’t take effect until June 1, 2021. However, we wanted to let you know well in advance and give you the resources to navigate these changes. Google Workspace subscribers, and G Suite for Education and G Suite for Nonprofits customers should refer to our Google Workspace Updates post to understand how these changes may affect them.

As always, every Google Account will continue to come with 15 GB of free storage across Gmail, Drive and Photos, which we estimate should last the majority of our users several years.  Because the content you store with these apps is primarily personal, it’s not used for advertising purposes. We’ll also continue to give you visibility and control over your storage, and provide tools to help you easily manage it. 

New content that will count toward your Google Account storage

Beginning June 1, any new photo or video uploaded in High quality in Google Photos will count toward your free 15 GB storage quota or any additional storage you’ve purchased as a Google One member. To make this transition easier, we’ll exempt all High quality photos and videos you back up before June 1. This includes all of the High quality photos and videos you currently store with Google Photos. Most people who back up in High quality should have years before they need to take action—in fact, we estimate that 80 percent of you should have at least three years before you reach 15 GB. You can learn more about this change in our Google Photos post.

Also starting June 1, any new Docs, Sheets, Slides, Drawings, Forms or Jamboard file will begin counting toward your free 15 GB of allotted storage or any additional storage provided through Google One. Existing files within these products will not count toward storage, unless they’re modified on or after June 1. You can learn more in our Help Center.

A new policy for accounts that are inactive or over storage limit

We’re introducing new policies for consumer accounts that are either inactive or over their storage limit across Gmail, Drive (including Google Docs, Sheets, Slides, Drawings, Forms and Jamboard files) and/or Photos to better align with common practices across the industry. After June 1: 

  • If you're inactive in one or more of these services for two years (24 months), Google may delete the content in the product(s) in which you're inactive. 

  • Similarly, if you're over your storage limit for two years, Google may delete your content across Gmail, Drive and Photos.

We will notify you multiple times before we attempt to remove any content so you have ample opportunities to take action. The simplest way to keep your account active is to periodically visit Gmail, Drive or Photos on the web or mobile, while signed in and connected to the internet. 

The Inactive Account Manager can help you manage specific content and notify a trusted contact if you stop using your Google Account for a certain period of time (between 3-18 months). Note that the new two year inactive policy will apply regardless of your Inactive Account Manager settings. 

You can learn more about these changes in our Help Center.

How to manage your storage

To help you manage your Google Account storage, anyone can use the free storage manager in the Google One app and on the web, which gives you an easy way to see how you’re using your storage across Gmail, Drive and Photos. You can keep the files you want, delete the ones you no longer need and make room for more—all in one place.

In addition to helping us meet the growing demand for storage, these changes align our storage policies across products. As always, we remain committed to providing you a great experience and hope to continue to serve you in the future. You can learn more about this change in our Help Center.

Posted by Jose Pastor, Vice President, Google Workspace, and Shimrit Ben-Yair, Vice President, Google Photos

Changes to Google Workspace storage policies starting June 1, 2021

What’s changing

In 2021, we’ll make some changes to the way we store Google Photos, Docs, Sheets, Slides, Drawings, Forms, and Jamboard content that may impact your domain. Please see below for more details.

Google Photos
Starting June 1, 2021, any new photos or videos uploaded to Google Photos or Google Drive in High quality will count toward the storage limits for users in your domain. Currently, only photos and videos uploaded in Original quality count toward storage quotas. Please note that any photos or videos uploaded in High quality prior to June 1, 2021, will not be impacted by this change and will not count toward storage limits.

Google Docs, Sheets, Slides, Drawings, Forms, and Jamboard
Starting June 1, 2021, any newly created Google Docs, Sheets, Slides, Drawings, Forms, and Jamboard files will also count toward the storage limits for users in your domain. Existing files within these products will not count toward storage, unless they’re modified on or after June 1, 2021.

Who’s impacted

Admins and end users. Storage limits differ across Google Workspace and G Suite editions, but we estimate that the majority of users will not be affected by these changes. See “Getting Started” below for more information on determining how much storage each user in your organization is allotted.

Why it’s important

Over the past decade, Gmail, Google Drive, and Google Photos have helped billions of people securely store and manage their emails, documents, photos, videos and more. Today, people are uploading more content than ever before—in fact, more than 4.3 million GB are added across Gmail, Drive, and Photos every day. These changes to our storage policy are necessary to provide our users with a great experience and to keep pace with the growing demand.

Getting started

Rollout pace


  • These changes will apply to all customers with Google Workspace and G Suite licenses. 



Capturing Special Video Moments with Google Photos

Recording video of memorable moments to share with friends and loved ones has become commonplace. But as anyone with a sizable video library can tell you, it's a time consuming task to go through all that raw footage searching for the perfect clips to relive or share with family and friends. Google Photos makes this easier by automatically finding magical moments in your videos—like when your child blows out the candle or when your friend jumps into a pool—and creating animations from them that you can easily share with friends and family.

In "Rethinking the Faster R-CNN Architecture for Temporal Action Localization", we address some of the challenges behind automating this task, which are due to the complexity of identifying and categorizing actions from a highly variable array of input data, by introducing an improved method to identify the exact location within a video where a given action occurs. Our temporal action localization network (TALNet) draws inspiration from advances in region-based object detection methods such as the Faster R-CNN network. TALNet enables identification of moments with large variation in duration, achieving state-of-the-art performance compared to other methods, allowing Google Photos to recommend the best part of a video for you to share with friends and family.
An example of the detected action "blowing out candles"
Identifying Actions for Model Training
The first step in identifying magic moments in videos is to assemble a list of actions that people might wish to highlight. Some examples of actions include "blow out birthday candles", "strike (bowling)", "cat wags tail", etc. We then crowdsourced the annotation of segments within a collection of public videos where these specific actions occurred, in order to create a large training dataset. We asked the raters to find and label all moments, accommodating videos that might have several moments. This final annotated dataset was then used to train our model so that it could identify the desired actions in new, unknown videos.

Comparison to Object Detection
The challenge of recognizing these actions belongs to the field of computer vision known as temporal action localization, which, like the more familiar object detection, falls under the umbrella of visual detection problems. Given a long, untrimmed video as input, temporal action localization aims to identify the start and end times, as well as the action label (like "blowing out candles"), for each action instance in the full video. While object detection aims to produce spatial bounding boxes around an object in a 2D image, temporal action localization aims to produce temporal segments including an action in a 1D sequence of video frames.

Our approach to TALNet is inspired by the faster R-CNN object detection framework for 2D images. So, to understand TALNet, it is useful to first understand faster R-CNN. The figure below demonstrates how the faster R-CNN architecture is used for object detection. The first step is to generate a set of object proposals, regions of the image that can be used for classification. To do this, an input image is first converted into a 2D feature map by a convolutional neural network (CNN). The region proposal network then generates bounding boxes around candidate objects. These boxes are generated at multiple scales in order to capture the large variability in objects' sizes in natural images. With the object proposals now defined, the subjects in the bounding boxes are then classified by a deep neural network (DNN) into specific objects, such as "person", "bike", etc.
Faster R-CNN architecture for object detection
Temporal Action Localization
Temporal action localization is accomplished in a fashion similar to that used by R-CNN. A sequence of input frames from a video are first converted into a sequence of 1D feature maps that encode scene context. This map is passed to a segment proposal network that generates candidate segments, each defined by start and end times. A DNN then applies the representations learned from the training dataset to classify the actions in the proposed video segments (e.g., "slam dunk", "pass", etc.). The actions identified in each segment are given weights according to their learned representations, with the top scoring moment selected to share with the user.
Architecture for temporal action localization
Special Considerations for Temporal Action Localization
While temporal action localization can be viewed as the 1D counterpart of the object detection problem, care must be taken to address a number of issues unique to action localization. In particular, we address three specific issues in order to apply the Faster R-CNN approach to the action localization domain, and redesign the architecture to specifically address them.
  1. Actions have much larger variations in durations
    The temporal extent of actions varies dramatically—from a fraction of a second to minutes. For long actions, it is not important to understand each and every frame of the action. Instead, we can get a better handle on the action by skimming quickly through the video, using dilated temporal convolutions. This approach allows TALNet to search the video for temporal patterns, while skipping over alternate frames based on a given dilation rate. Analysing the video with several different rates that are selected automatically according to the anchor segment's length enables efficient identification of actions as large as the entire video or as short as a second.
  2. The context before and after an action are important
    The moments preceding and following an action instance contain critical information for localization and classification, arguably more so than the spatial context of an object. Therefore, we explicitly encode the temporal context by extending the length of proposal segments on both the left and right by a fixed percentage of the segment's length in both the proposal generation stage and the classification stage.
  3. Actions require multi-modal input
    Actions are defined by appearance, motion and sometimes even audio information. Therefore, it is important to consider multiple modalities of features for the best results. We use a late fusion scheme for both the proposal generation network and the classification network, in which each modality has a separate proposal generation network whose outputs are combined together to obtain the final set of proposals. These proposals are classified using separate classification networks for each modality, which are then averaged to obtain the final predictions.
TALNet in Action
As a consequence of these improvements, TALNet achieves state-of-the-art performance for both action proposal and action localization tasks on the THUMOS'14 detection benchmark and competitive performance on the ActivityNet challenge. Now, whenever people save videos to Google Photos, our model identifies these moments and creates animations to share. Here are a few examples shared by our initial testers.
An example of the detected action "sliding down a slide"
An example of the detected actions "jump into the pool" (left), "twirl in a dress" (center) and "feed baby a spoonful" (right).
Next steps
We are continuing work to improve the precision and recall of action localization using more data, features and models. Improvements in temporal action localization can drive progress on a large number of important topics ranging from video highlights, video summarization, search and more. We hope to continue improving the state-of-the-art in this domain and at the same time provide more ways for people to reminisce on their memories, big and small.

Special thanks Tim Novikoff and Yu-Wei Chao, as well as Bryan Seybold, Lily Kharevych, Siyu Gu, Tracy Gu, Tracy Utley, Yael Marzan, Jingyu Cui, Balakrishnan Varadarajan, Paul Natsev for their critical contributions to this project.

Source: Google AI Blog

Coming to India: Express, a faster way to back up with Google Photos

Since introducing Google Photos, we’ve aspired to be the home for all of your photos, helping you bring together a lifetime of memories in one place. To safely store your memories, we’ve offered two backup options: Original Quality and High Quality. However, in India specifically, we heard from people using the app that their backup experience was at times longer and stalled because they might not always have frequent access to WiFi. In fact, we learned that over a third of people using Google Photos in India have some photos that hadn’t been backed up in over a month.

We want to make sure we’re building experiences in our app that meet the unique needs for people no matter where they are, so last December, we began offering a new backup option in Google Photos called Express backup to a small percentage of people using Google Photos on Android in India. Express provides faster backup at a reduced resolution, making it easier to ensure memories are saved even when you might have poor or infrequent WiFi connectivity.

Over the past week, we’ve started rolling out Express backup to more users in India and by the end of the week, Android users on the latest version of Google Photos should start seeing it as an option for backup. In addition to Express, you will still have the option to choose from the existing backup options: Original Quality and High Quality. And, in addition to rolling out Express as an additional backup option in India, we’re also introducing a new Data Cap option for backup. This gives users more granular daily controls for using cellular data to back up. People can select from a range of daily caps, starting at 5MB.

We’re starting to bring Express backup to dozens of other countries, rolling out slowly so we can listen to feedback and continue to improve the backup experience around the world.

Posted by Raja Ayyagari, Product Manager, Google Photos

Top Shot on Pixel 3

Life is full of meaningful moments — from a child’s first step to an impromptu jump for joy — that one wishes could be preserved with a picture. However, because these moments are often unpredictable, missing that perfect shot is a frustrating problem that smartphone camera users face daily. Using our experience from developing Google Clips, we wondered if we could develop new techniques for the Pixel 3 camera that would allow everyone to capture the perfect shot every time.

Top Shot is a new feature recently launched with Pixel 3 that helps you to capture precious moments precisely and automatically at the press of the shutter button. Top Shot saves and analyzes the image frames before and after the shutter press on the device in real-time using computer vision techniques, and recommends several alternative high-quality HDR+ photos.
Examples of Top Shot on Pixel 3. On the left, a better smiling shot is recommended. On the right, a better jump shot is recommended. The recommended images are high-quality HDR+ shots.
Capturing Multiple Moments
When a user opens the Pixel 3 Camera app, Top Shot is enabled by default, helping to capture the perfect moment by analyzing images taken both before and after the shutter press. Each image is analyzed for some qualitative features (e.g., whether the subject is smiling or not) in real-time and entirely on-device to preserve privacy and minimize latency. Each image is also associated with additional signals, such as optical flow of the image, exposure time, and gyro sensor data to form the input features used to score the frame quality.

When you press the shutter button, Top Shot captures up to 90 images from 1.5 seconds before and after the shutter press, selecting up to two alternative shots to save in high resolution — the original shutter frame and high-res alternatives for you to review (other lower-res frames can also be reviewed as desired). The shutter frame is processed and saved first. The best alternative shots are saved afterwards. Google’s Visual Core on Pixel 3 is used to process these top alternative shots as HDR+ images with a very small amount of extra latency, and are embedded into the file of the Motion Photo.
Top-level diagram of Top Shot capture.
Given Top Shot runs in the camera as a background process, it must have very low power consumption. As such, Top Shot uses a hardware-accelerated MobileNet-based single shot detector (SSD). The execution of such optimized models is also throttled by power and thermal limits.

Recognizing Top Moments
When we set out to understand how to enable people to capture the best moments with their camera, we focused on three key attributes: 1) functional qualities like lighting, 2) objective attributes (are the subject's eyes open? Are they smiling?), and 3) subjective qualities like emotional expressions. We designed a computer vision model to recognize these attributes while operating in a low-latency, on-device mode.

During our development process, we started with a vanilla MobileNet model and set out to optimize for Top Shot, arriving at a customized architecture that operated within our accuracy, latency and power tradeoff constraints. Our neural network design detects low-level visual attributes in early layers, like whether the subject is blurry, and then dedicates additional compute and parameters toward more complex objective attributes like whether the subject's eyes are open, and subjective attributes like whether there is an emotional expression of amusement or surprise. We trained our model using knowledge distillation over a large number of diverse face images using quantization during both training and inference.

We then adopted a layered Generalized Additive Model (GAM) to provide quality scores for faces and combine them into a weighted-average “frame faces” score. This model made it easy for us to interpret and identify the exact causes of success or failure, enabling rapid iteration to improve the quality and performance of our attributes model. The number of free parameters was on the order of dozens, so we could optimize these using Google's black box optimizer, Vizier, in tandem with any other parameters that affected selection quality.

Frame Scoring Model
While Top Shot prioritizes for face analysis, there are good moments in which faces are not the primary subject. To handle those use cases, we include the following additional scores in the overall frame quality score:
  • Subject motion saliency score — the low-resolution optical flow between the current frame and the previous frame is estimated in ISP to determine if there is salient object motion in the scene.
  • Global motion blur score — estimated from the camera motion and the exposure time. The camera motion is calculated from sensor data from the gyroscope and OIS (optical image stabilization).
  • “3A” scores — the status of auto exposure, auto focus, and auto white balance, are also considered.
All the individual scores are used to train a model predicting an overall quality score, which matches the frame preference of human raters, to maximize end-to-end product quality.

End-to-End Quality and Fairness
Most of the above components are each evaluated for accuracy independently However, Top Shot presents requirements that are uniquely challenging since it’s running real-time in the Pixel Camera. Additionally, we needed to ensure that all these signals are combined in a system with favorable results. That means we need to gauge our predictions against what our users perceive as the “top shot.”

To test this, we collected data from hundreds of volunteers, along with their opinions of which frames (out of up to 90!) looked best. This donated dataset covers many typical use cases, e.g. portraits, selfies, actions, landscapes, etc.

Many of the 3-second clips provided by Top Shot had more than one good shot, so it was important for us to engineer our quality metrics to handle this. We used some modified versions of traditional Precision and Recall, some classic ranking metrics (such as Mean Reciprocal Rank), and a few others that were designed specifically for the Top Shot task as our objective. In addition to these metrics, we additionally investigated causes of image quality issues we saw during development, leading to improvements in avoiding blur, handling multiple faces better, and more. In doing so, we were able to steer the model towards a set of selections people were likely to rate highly.

Importantly, we tested the Top Shot system for fairness to make sure that our product can offer a consistent experience to a very wide range of users. We evaluated the accuracy of each signal used in Top Shot on several different subgroups of people (based on gender, age, ethnicity, etc), testing for accuracy of each signal across those subgroups.

Top Shot is just one example of how Google leverages optimized hardware and cutting-edge machine learning to provide useful tools and services. We hope you’ll find this feature useful, and we’re committed to further improving the capabilities of mobile phone photography!

This post reflects the work of a large group of Google engineers, research scientists, and others including: Ari Gilder, Aseem Agarwala, Brendan Jou, David Karam, Eric Penner, Farooq Ahmad, Henri Astre, Hillary Strickland, Marius Renn, Matt Bridges, Maxwell Collins, Navid Shiee, Ryan Gordon, Sarah Clinckemaillie, Shu Zhang, Vivek Kesarwani, Xuhui Jia, Yukun Zhu, Yuzo Watanabe and Chris Breithaupt.

Source: Google AI Blog