Tag Archives: Google Photos

Capturing Special Video Moments with Google Photos



Recording video of memorable moments to share with friends and loved ones has become commonplace. But as anyone with a sizable video library can tell you, it's a time consuming task to go through all that raw footage searching for the perfect clips to relive or share with family and friends. Google Photos makes this easier by automatically finding magical moments in your videos—like when your child blows out the candle or when your friend jumps into a pool—and creating animations from them that you can easily share with friends and family.

In "Rethinking the Faster R-CNN Architecture for Temporal Action Localization", we address some of the challenges behind automating this task, which are due to the complexity of identifying and categorizing actions from a highly variable array of input data, by introducing an improved method to identify the exact location within a video where a given action occurs. Our temporal action localization network (TALNet) draws inspiration from advances in region-based object detection methods such as the Faster R-CNN network. TALNet enables identification of moments with large variation in duration, achieving state-of-the-art performance compared to other methods, allowing Google Photos to recommend the best part of a video for you to share with friends and family.
An example of the detected action "blowing out candles"
Identifying Actions for Model Training
The first step in identifying magic moments in videos is to assemble a list of actions that people might wish to highlight. Some examples of actions include "blow out birthday candles", "strike (bowling)", "cat wags tail", etc. We then crowdsourced the annotation of segments within a collection of public videos where these specific actions occurred, in order to create a large training dataset. We asked the raters to find and label all moments, accommodating videos that might have several moments. This final annotated dataset was then used to train our model so that it could identify the desired actions in new, unknown videos.

Comparison to Object Detection
The challenge of recognizing these actions belongs to the field of computer vision known as temporal action localization, which, like the more familiar object detection, falls under the umbrella of visual detection problems. Given a long, untrimmed video as input, temporal action localization aims to identify the start and end times, as well as the action label (like "blowing out candles"), for each action instance in the full video. While object detection aims to produce spatial bounding boxes around an object in a 2D image, temporal action localization aims to produce temporal segments including an action in a 1D sequence of video frames.

Our approach to TALNet is inspired by the faster R-CNN object detection framework for 2D images. So, to understand TALNet, it is useful to first understand faster R-CNN. The figure below demonstrates how the faster R-CNN architecture is used for object detection. The first step is to generate a set of object proposals, regions of the image that can be used for classification. To do this, an input image is first converted into a 2D feature map by a convolutional neural network (CNN). The region proposal network then generates bounding boxes around candidate objects. These boxes are generated at multiple scales in order to capture the large variability in objects' sizes in natural images. With the object proposals now defined, the subjects in the bounding boxes are then classified by a deep neural network (DNN) into specific objects, such as "person", "bike", etc.
Faster R-CNN architecture for object detection
Temporal Action Localization
Temporal action localization is accomplished in a fashion similar to that used by R-CNN. A sequence of input frames from a video are first converted into a sequence of 1D feature maps that encode scene context. This map is passed to a segment proposal network that generates candidate segments, each defined by start and end times. A DNN then applies the representations learned from the training dataset to classify the actions in the proposed video segments (e.g., "slam dunk", "pass", etc.). The actions identified in each segment are given weights according to their learned representations, with the top scoring moment selected to share with the user.
Architecture for temporal action localization
Special Considerations for Temporal Action Localization
While temporal action localization can be viewed as the 1D counterpart of the object detection problem, care must be taken to address a number of issues unique to action localization. In particular, we address three specific issues in order to apply the Faster R-CNN approach to the action localization domain, and redesign the architecture to specifically address them.
  1. Actions have much larger variations in durations
    The temporal extent of actions varies dramatically—from a fraction of a second to minutes. For long actions, it is not important to understand each and every frame of the action. Instead, we can get a better handle on the action by skimming quickly through the video, using dilated temporal convolutions. This approach allows TALNet to search the video for temporal patterns, while skipping over alternate frames based on a given dilation rate. Analysing the video with several different rates that are selected automatically according to the anchor segment's length enables efficient identification of actions as large as the entire video or as short as a second.
  2. The context before and after an action are important
    The moments preceding and following an action instance contain critical information for localization and classification, arguably more so than the spatial context of an object. Therefore, we explicitly encode the temporal context by extending the length of proposal segments on both the left and right by a fixed percentage of the segment's length in both the proposal generation stage and the classification stage.
  3. Actions require multi-modal input
    Actions are defined by appearance, motion and sometimes even audio information. Therefore, it is important to consider multiple modalities of features for the best results. We use a late fusion scheme for both the proposal generation network and the classification network, in which each modality has a separate proposal generation network whose outputs are combined together to obtain the final set of proposals. These proposals are classified using separate classification networks for each modality, which are then averaged to obtain the final predictions.
TALNet in Action
As a consequence of these improvements, TALNet achieves state-of-the-art performance for both action proposal and action localization tasks on the THUMOS'14 detection benchmark and competitive performance on the ActivityNet challenge. Now, whenever people save videos to Google Photos, our model identifies these moments and creates animations to share. Here are a few examples shared by our initial testers.
An example of the detected action "sliding down a slide"
An example of the detected actions "jump into the pool" (left), "twirl in a dress" (center) and "feed baby a spoonful" (right).
Next steps
We are continuing work to improve the precision and recall of action localization using more data, features and models. Improvements in temporal action localization can drive progress on a large number of important topics ranging from video highlights, video summarization, search and more. We hope to continue improving the state-of-the-art in this domain and at the same time provide more ways for people to reminisce on their memories, big and small.

Acknowledgements
Special thanks Tim Novikoff and Yu-Wei Chao, as well as Bryan Seybold, Lily Kharevych, Siyu Gu, Tracy Gu, Tracy Utley, Yael Marzan, Jingyu Cui, Balakrishnan Varadarajan, Paul Natsev for their critical contributions to this project.

Source: Google AI Blog


Coming to India: Express, a faster way to back up with Google Photos

https://2.bp.blogspot.com/-Lnapm5-5ZRE/XJCl1gDazdI/AAAAAAAAA04/snvLFNK63dsjaIMj12HQLSbSvbVD8dilwCLcBGAs/s1600/Photos_Express_Frame_1%2B%25281%2529.png
Since introducing Google Photos, we’ve aspired to be the home for all of your photos, helping you bring together a lifetime of memories in one place. To safely store your memories, we’ve offered two backup options: Original Quality and High Quality. However, in India specifically, we heard from people using the app that their backup experience was at times longer and stalled because they might not always have frequent access to WiFi. In fact, we learned that over a third of people using Google Photos in India have some photos that hadn’t been backed up in over a month.


We want to make sure we’re building experiences in our app that meet the unique needs for people no matter where they are, so last December, we began offering a new backup option in Google Photos called Express backup to a small percentage of people using Google Photos on Android in India. Express provides faster backup at a reduced resolution, making it easier to ensure memories are saved even when you might have poor or infrequent WiFi connectivity.



Over the past week, we’ve started rolling out Express backup to more users in India and by the end of the week, Android users on the latest version of Google Photos should start seeing it as an option for backup. In addition to Express, you will still have the option to choose from the existing backup options: Original Quality and High Quality. And, in addition to rolling out Express as an additional backup option in India, we’re also introducing a new Data Cap option for backup. This gives users more granular daily controls for using cellular data to back up. People can select from a range of daily caps, starting at 5MB.


We’re starting to bring Express backup to dozens of other countries, rolling out slowly so we can listen to feedback and continue to improve the backup experience around the world.

Posted by Raja Ayyagari, Product Manager, Google Photos

Top Shot on Pixel 3



Life is full of meaningful moments — from a child’s first step to an impromptu jump for joy — that one wishes could be preserved with a picture. However, because these moments are often unpredictable, missing that perfect shot is a frustrating problem that smartphone camera users face daily. Using our experience from developing Google Clips, we wondered if we could develop new techniques for the Pixel 3 camera that would allow everyone to capture the perfect shot every time.

Top Shot is a new feature recently launched with Pixel 3 that helps you to capture precious moments precisely and automatically at the press of the shutter button. Top Shot saves and analyzes the image frames before and after the shutter press on the device in real-time using computer vision techniques, and recommends several alternative high-quality HDR+ photos.
Examples of Top Shot on Pixel 3. On the left, a better smiling shot is recommended. On the right, a better jump shot is recommended. The recommended images are high-quality HDR+ shots.
Capturing Multiple Moments
When a user opens the Pixel 3 Camera app, Top Shot is enabled by default, helping to capture the perfect moment by analyzing images taken both before and after the shutter press. Each image is analyzed for some qualitative features (e.g., whether the subject is smiling or not) in real-time and entirely on-device to preserve privacy and minimize latency. Each image is also associated with additional signals, such as optical flow of the image, exposure time, and gyro sensor data to form the input features used to score the frame quality.

When you press the shutter button, Top Shot captures up to 90 images from 1.5 seconds before and after the shutter press, selecting up to two alternative shots to save in high resolution — the original shutter frame and high-res alternatives for you to review (other lower-res frames can also be reviewed as desired). The shutter frame is processed and saved first. The best alternative shots are saved afterwards. Google’s Visual Core on Pixel 3 is used to process these top alternative shots as HDR+ images with a very small amount of extra latency, and are embedded into the file of the Motion Photo.
Top-level diagram of Top Shot capture.
Given Top Shot runs in the camera as a background process, it must have very low power consumption. As such, Top Shot uses a hardware-accelerated MobileNet-based single shot detector (SSD). The execution of such optimized models is also throttled by power and thermal limits.

Recognizing Top Moments
When we set out to understand how to enable people to capture the best moments with their camera, we focused on three key attributes: 1) functional qualities like lighting, 2) objective attributes (are the subject's eyes open? Are they smiling?), and 3) subjective qualities like emotional expressions. We designed a computer vision model to recognize these attributes while operating in a low-latency, on-device mode.

During our development process, we started with a vanilla MobileNet model and set out to optimize for Top Shot, arriving at a customized architecture that operated within our accuracy, latency and power tradeoff constraints. Our neural network design detects low-level visual attributes in early layers, like whether the subject is blurry, and then dedicates additional compute and parameters toward more complex objective attributes like whether the subject's eyes are open, and subjective attributes like whether there is an emotional expression of amusement or surprise. We trained our model using knowledge distillation over a large number of diverse face images using quantization during both training and inference.

We then adopted a layered Generalized Additive Model (GAM) to provide quality scores for faces and combine them into a weighted-average “frame faces” score. This model made it easy for us to interpret and identify the exact causes of success or failure, enabling rapid iteration to improve the quality and performance of our attributes model. The number of free parameters was on the order of dozens, so we could optimize these using Google's black box optimizer, Vizier, in tandem with any other parameters that affected selection quality.

Frame Scoring Model
While Top Shot prioritizes for face analysis, there are good moments in which faces are not the primary subject. To handle those use cases, we include the following additional scores in the overall frame quality score:
  • Subject motion saliency score — the low-resolution optical flow between the current frame and the previous frame is estimated in ISP to determine if there is salient object motion in the scene.
  • Global motion blur score — estimated from the camera motion and the exposure time. The camera motion is calculated from sensor data from the gyroscope and OIS (optical image stabilization).
  • “3A” scores — the status of auto exposure, auto focus, and auto white balance, are also considered.
All the individual scores are used to train a model predicting an overall quality score, which matches the frame preference of human raters, to maximize end-to-end product quality.

End-to-End Quality and Fairness
Most of the above components are each evaluated for accuracy independently However, Top Shot presents requirements that are uniquely challenging since it’s running real-time in the Pixel Camera. Additionally, we needed to ensure that all these signals are combined in a system with favorable results. That means we need to gauge our predictions against what our users perceive as the “top shot.”

To test this, we collected data from hundreds of volunteers, along with their opinions of which frames (out of up to 90!) looked best. This donated dataset covers many typical use cases, e.g. portraits, selfies, actions, landscapes, etc.

Many of the 3-second clips provided by Top Shot had more than one good shot, so it was important for us to engineer our quality metrics to handle this. We used some modified versions of traditional Precision and Recall, some classic ranking metrics (such as Mean Reciprocal Rank), and a few others that were designed specifically for the Top Shot task as our objective. In addition to these metrics, we additionally investigated causes of image quality issues we saw during development, leading to improvements in avoiding blur, handling multiple faces better, and more. In doing so, we were able to steer the model towards a set of selections people were likely to rate highly.

Importantly, we tested the Top Shot system for fairness to make sure that our product can offer a consistent experience to a very wide range of users. We evaluated the accuracy of each signal used in Top Shot on several different subgroups of people (based on gender, age, ethnicity, etc), testing for accuracy of each signal across those subgroups.

Conclusion
Top Shot is just one example of how Google leverages optimized hardware and cutting-edge machine learning to provide useful tools and services. We hope you’ll find this feature useful, and we’re committed to further improving the capabilities of mobile phone photography!

Acknowledgements
This post reflects the work of a large group of Google engineers, research scientists, and others including: Ari Gilder, Aseem Agarwala, Brendan Jou, David Karam, Eric Penner, Farooq Ahmad, Henri Astre, Hillary Strickland, Marius Renn, Matt Bridges, Maxwell Collins, Navid Shiee, Ryan Gordon, Sarah Clinckemaillie, Shu Zhang, Vivek Kesarwani, Xuhui Jia, Yukun Zhu, Yuzo Watanabe and Chris Breithaupt.

Source: Google AI Blog


Build new experiences with the Google Photos Library API

Posted by Jan-Felix Schmakeit, Google Photos Developer Lead

As we shared in May, people create and consume photos and videos in many different ways, and we think it should be easier to do more with the photos people take, across more of the apps and devices we all use. That's why we created the Google Photos Library API: to give you the ability to build photo and video experiences in your products that are smarter, faster, and more helpful.

After a successful developer preview over the past few months, the Google Photos Library API is now generally available. If you want to build and test your own experience, you can visit our developer documentation to get started. You can also express your interest in joining the Google Photos partner program if you are planning a larger integration.

Here's a quick overview of the Google Photos Library API and what you can do:

Whether you're a mobile, web, or backend developer, you can use this REST API to utilize the best of Google Photos and help people connect, upload, and share from inside your app. We are also launching client libraries in multiple languages that will help you get started quicker.

Users have to authorize requests through the API, so they are always in the driver's seat. Here are a few things you can help your users do:

  • Easily find photos, based on
    • what's in the photo
    • when it was taken
    • attributes like media format
  • Upload directly to their photo library or an album
  • Organize albums and add titles and locations
  • Use shared albums to easily transfer and collaborate

Putting machine learning to work in your app is simple too. You can use smart filters, like content categories, to narrow down or exclude certain types of photos and videos and make it easier for your users to find the ones they're looking for.

Thanks to everyone who provided feedback throughout our developer preview, your contributions helped make the API better. You can read our release notes to follow along with any new releases of our API. And, if you've been using the Picasa Web Albums API, here's a migration guide that will help you move to the Google Photos Library API.

Introducing the Google Photos partner program

Posted by Jan-Felix Schmakeit, Google Photos Developer Lead

People create and consume photos and videos in many different ways, and we think it should be easier to do more with the photos you've taken, across all the apps and devices you use.

That's why we're introducing a new Google Photos partner program that gives you the tools and APIs to build photo and video experiences in your products that are smarter, faster and more helpful.

Building with the Google Photos Library API

With the Google Photos Library API, your users can seamlessly access their photos whenever they need them.

Whether you're a mobile, web, or backend developer, you can use this REST API to utilize the best of Google Photos and help people connect, upload, and share from inside your app.

Your user is always in the driver's seat. Here are a few things you can help them to do:

  • Easily find photos, based on
    • what's in the photo
    • when it was taken
    • attributes like description and media format
  • Upload directly to their photo library
  • Organize albums and add titles and locations
  • Use shared albums to easily transfer and collaborate

With the Library API, you don't have to worry about maintaining your own storage and infrastructure, as photos and videos remain safely backed up in Google Photos.

Putting machine intelligence to work in your app is simple too. You can use smart filters, like content categories, to narrow down or exclude certain types of photos and videos and make it easier for your users to find the ones they're looking for.

We've also aimed to take the hassle out of building a smooth user experience. Features like thumbnailing and cross-platform deep-links mean you can offload common tasks and focus on what makes your product unique.

Getting started

Today, we're launching a developer preview of the Google Photos Library API. You can start building and testing it in your own projects right now.

Get started by visiting our developer documentation where you can also express your interest in joining the Google Photos partner program. Some of our early partners, including HP, Legacy Republic, NixPlay, Xero and TimeHop are already building better experiences using the API.

If you are following Google I/O, you can also join us for our session to learn more.

We're excited for the road ahead and look forward to working with you to develop new apps that work with Google Photos.

PhotoScan: Taking Glare-Free Pictures of Pictures



Yesterday, we released an update to PhotoScan, an app for iOS and Android that allows you to digitize photo prints with just a smartphone. One of the key features of PhotoScan is the ability to remove glare from prints, which are often glossy and reflective, as are the plastic album pages or glass-covered picture frames that host them. To create this feature, we developed a unique blend of computer vision and image processing techniques that can carefully align and combine several slightly different pictures of a print to separate the glare from the image underneath.
Left: A regular digital picture of a physical print. Right: Glare-free digital output from PhotoScan
When taking a single picture of a photo, determining which regions of the picture are the actual photo and which regions are glare is challenging to do automatically. Moreover, the glare may often saturate regions in the picture, rendering it impossible to see or recover the parts of the photo underneath it. But if we take several pictures of the photo while moving the camera, the position of the glare tends to change, covering different regions of the photo. In most cases we found that every pixel of the photo is likely not to be covered by glare in at least one of the pictures. While no single view may be glare-free, we can combine multiple pictures of the printed photo taken at different angles to remove the glare. The challenge is that the images need to be aligned very accurately in order to combine them properly, and this processing needs to run very quickly on the phone to provide a near instant experience.
Left: The captured, input images (5 in total). Right: If we stabilize the images on the photo, we can see just the glare moving, covering different parts of the photo. Notice no single image is glare-free.
Our technique is inspired by our earlier work published at SIGGRAPH 2015, which we dubbed “obstruction-free photography”. It uses similar principles to remove various types of obstructions from the field of view. However, the algorithm we originally proposed was based on a generative model where the motion and appearance of both the main scene and the obstruction layer are estimated. While that model is quite powerful and can remove a variety of obstructions, it is too computationally expensive to be run on smartphones. We therefore developed a simpler model that treats glare as an outlier, and only attempts to register the underlying, glare-free photo. While this model is simpler, the task is still quite challenging as the registration needs to be highly accurate and robust.

How it Works
We start from a series of pictures of the print taken by the user while moving the camera. The first picture - the “reference frame” - defines the desired output viewpoint. The user is then instructed to take four additional frames. In each additional frame, we detect sparse feature points (we compute ORB features on Harris corners) and use them to establish homographies mapping each frame to the reference frame.
Left: Detected feature matches between the reference frame and each other frame (left), and the warped frames according to the estimated homographies (right).
While the technique may sound straightforward, there is a catch - homographies are only able to align flat images. But printed photos are often not entirely flat (as is the case with the example shown above). Therefore, we use optical flow — a fundamental, computer vision representation for motion, which establishes pixel-wise mapping between two images — to correct the non-planarities. We start from the homography-aligned frames, and compute “flow fields” to warp the images and further refine the registration. In the example below, notice how the corners of the photo on the left slightly “move” after registering the frames using only homographies. The right hand side shows how the photo is better aligned after refining the registration using optical flow.
Comparison between the warped frames using homographies (left) and after the additional warp refinement using optical flow (right).
The difference in the registration is subtle, but has a big impact on the end result. Notice how small misalignments manifest themselves as duplicated image structures in the result, and how these artifacts are alleviated with the additional flow refinement.
Comparison between the glare removal result with (right) and without (left) optical flow refinement. In the result using homographies only (left), notice artifacts around the eye, nose and teeth of the person, and duplicated stems and flower petals on the fabric.
Here too, the challenge was to make optical flow, a naturally slow algorithm, work very quickly on the phone. Instead of computing optical flow at each pixel as done traditionally (the number of flow vectors computed is equal to the number of input pixels), we represent a flow field by a smaller number of control points, and express the motion at each pixel in the image as a function of the motion at the control points. Specifically, we divide each image into tiled, non-overlapping cells to form a coarse grid, and represent the flow of a pixel in a cell as the bilinear combination of the flow at the four corners of the cell that contains it.

The grid setup for grid optical flow. A point p is represented as the bilinear interpolation of the four corner points of the cell that encapsulates it.
Left: Illustration of the computed flow field on one of the frames. Right: The flow color coding: orientation and magnitude represented by hue and saturation, respectively.
This results in a much smaller problem to solve, since the number of flow vectors to compute now equals the number of grid points, which is typically much smaller than the number of pixels. This process is similar in nature to the spline-based image registration described in Szeliski and Coughlan (1997). With this algorithm, we were able to reduce the optical flow computation time by a factor of ~40 on a Pixel phone!
Flipping between the homography-registered frame and the flow-refined warped frame (using the above flow field), superimposed on the (clean) reference frame, shows how the computed flow field “snaps” image parts to their corresponding parts in the reference frame, improving the registration.
Finally, in order to compose the glare-free output, for any given location in the registered frames, we examine the pixel values, and use a soft minimum algorithm to obtain the darkest observed value. More specifically, we compute the expectation of the minimum brightness over the registered frames, assigning less weight to pixels close to the (warped) image boundaries. We use this method rather than computing the minimum directly across the frames due to the fact that corresponding pixels at each frame may have slightly different brightness. Therefore, per-pixel minimum can produce visible seams due to sudden intensity changes at boundaries between overlaid images.
Regular minimum (left) versus soft minimum (right) over the registered frames.
The algorithm can support a variety of scanning conditions — matte and gloss prints, photos inside or outside albums, magazine covers.

Input     Registered     Glare-free
To get the final result, the Photos team has developed a method that automatically detects and crops the photo area, and rectifies it to a frontal view. Because of perspective distortion, the scanned rectangular photo usually appears to be a quadrangle on the image. The method analyzes image signals, like color and edges, to figure out the exact boundary of the original photo on the scanned image, then applies a geometric transformation to rectify the quadrangle area back to its original rectangular shape yielding high-quality, glare-free digital version of the photo.
So overall, quite a lot going on under the hood, and all done almost instantaneously on your phone! To give PhotoScan a try, download the app on Android or iOS.

Now your photos look better than ever, even those dusty old prints

Photos from the past, meet scanner from the future

Google Photos is a home for all your photos and videos, but what about those old prints that are some of your most treasured memories? Such as photos of grandma when she was young, your childhood pet, and that hairstyle you wish you could forget.

We all have those old albums and boxes of photos, but we don’t take the time to digitize them because it’s just too hard to get it right. We don’t want to mail away our original copy, buying a scanner is costly and time consuming, and if you try to take a photo of a photo, you end up with crooked edges and glare.

We knew there had to be a better way, so we’re introducing PhotoScan, a brand new, standalone app from Google Photos that easily scans just about any photo, free, from anywhere. Get it today for Android and iOS.


PhotoScan gets you great looking digital copies in seconds - it detects edges, straightens the image, rotates it to the correct orientation, and removes glare. Scanned photos can be saved in one tap to Google Photos to be organized, searchable, shared, and safely backed up at high quality -- for free.  

See how the PhotoScan technology works behind the scenes by watching this video from our friends Nat & Lo.

Pro edits, no pro needed

After all that time in the attic, your photos might need a few polishes. Or you might even want to edit that selfie from this morning. Getting the right look can take a lot of time and with so many editing tools it’s tough to know where to begin.

Today we’re rolling out three easy ways to get great looking photos in Google Photos: a new and improved auto enhance, unique new looks, and advanced editing tools. Open a photo and then tap the pencil icon to start editing. First, for auto enhance, just select Auto, and see instant enhancements a pro editor might make - like balancing exposure and saturation to bring out the details.

Second, our 12 new looks take style to the next level. These unique looks make edits based on the individual photo and its brightness, darkness, warmth, or saturation, before applying the style. All looks use machine intelligence to complement the content of your photo, and choosing one is just a matter of taste.
Third, our advanced editing controls for Light and Color allow you to fine tune your photos, including highlights, shadows, and warmth. Deep Blue is particularly good for images of sea and sky where the color blue is the focal point.
The Google Photos app with the new photo editor will begin rolling out today across Android, iOS and the web. Just in time for your next set of holiday memories.
Posted by Jingyu Cui, Software Engineer, Google Photos