Tag Archives: Video

Video Understanding Using Temporal Cycle-Consistency Learning



In the last few years there has been great progress in the field of video understanding. For example, supervised learning and powerful deep learning models can be used to classify a number of possible actions in videos, summarizing the entire clip with a single label. However, there exist many scenarios in which we need more than just one label for the entire clip. For example, if a robot is pouring water into a cup, simply recognizing the action of “pouring a liquid” is insufficient to predict when the water will overflow. For that, it is necessary to track frame-by-frame the amount of water in the cup as it is being filled. Similarly, a baseball coach who is comparing stances of pitchers may want to retrieve video frames from the precise moment that the ball leaves the pitchers’ hands. Such applications require models to understand each frame of a video.

However, applying supervised learning to understand each individual frame in a video is expensive, since per-frame labels in videos of the action of interest are needed. This requires that annotators apply fine-grained labels to videos by manually adding unambiguous labels to every frame in each video. Only then can the model be trained, and only on a single action. Training on new actions requires the process to be repeated. With the increasing demand for fine-grained labeling, necessary for applications ranging from robotics to sports analytics, this makes the need for scalable learning algorithms that can understand videos without the tedious labeling process increasingly pertinent.

We propose a potential solution using a self-supervised learning method called Temporal Cycle-Consistency Learning (TCC). This novel approach uses correspondences between examples of similar sequential processes to learn representations particularly well-suited for fine-grained temporal understanding of videos. We are also releasing our TCC codebase to enable end-users to apply our self-supervised learning algorithm to new and novel applications.

Representation Learning Using TCC
A plant growing from a seedling to a tree; the daily routine of getting up, going to work and coming back home; or a person pouring themselves a glass of water are all examples of events that happen in a particular order. Videos capturing such processes provide temporal correspondences across multiple instances of the same process. For example, when pouring a drink one could be reaching for a teapot, a bottle of wine, or a glass of water to pour from. Key moments are common to all pouring videos (e.g., the first touch to the container or the container being lifted from the ground) and exist independent of many varying factors, such as visual changes in viewpoint, scale, container style, or the speed of the event. TCC attempts to find such correspondences across videos of the same action by leveraging the principle of cycle-consistency, which has been applied successfully in many problems in computer vision, to learn useful visual representations by aligning videos.

The objective of this training algorithm is to learn a frame encoder, using any network architecture that processes images, such as ResNet. To do so, we pass all frames of the videos to be aligned through the encoder to produce their corresponding embeddings. We then select two videos for TCC learning, say video 1 (the reference video) and video 2. A reference frame is chosen from video 1 and its nearest neighbor frame (NN2) from video 2 is found in the embedding space (not pixel space). We then cycle back by finding the nearest neighbor of NN2 in video 1, which we call NN1. If the representations are cycle-consistent, then the nearest neighbor frame in video 1 (NN1) should refer back to the starting reference frame.
We train the embedder using the distance between the starting reference frame and NN1 as the training signal. As training proceeds, the embeddings improve and reduce the cycle-consistency loss by developing a semantic understanding of each video frame in the context of the action being performed.
Using TCC, we learn embeddings with temporally fine-grained understanding of an action by aligning related videos.
What Does TCC Learn?
In the following figure, we show a model trained using TCC on videos from the Penn Action Dataset of people performing squat exercises. Each point on the left corresponds to frame embeddings, with the highlighted points tracking the embedding of the current video frame. Notice how the embeddings move collectively in spite of many differences in pose, lighting, body and object type. TCC embeddings encode the different phases of squatting without being provided explicit labels.
Right: Input videos of people performing a squat exercise. The video on the top left is the reference. The other videos show nearest neighbor frames (in the TCC embedding space) from other videos of people doing squats. Left: The corresponding frame embeddings move as the action is performed.
Applications of TCC
The learned per-frame embeddings enable an array of interesting applications:
  • Few-shot action phase classification
    When few labeled videos are available for training, the few-shot scenario, TCC performs very well. In fact, TCC can classify the phases of different actions with as few as a single labeled video. In the next figure we compare to other supervised and self-supervised learning approaches in the few-shot setting. We find that supervised learning requires about 50 videos with each frame labeled to achieve the same accuracy that self-supervised methods achieve with just one fully labeled video.
    Comparison of self-supervised and supervised learning for few-shot action phase classification.
  • Unsupervised video alignment
    Aligning or synchronizing videos manually becomes prohibitively difficult as the number of videos increases. Using TCC, many videos can be aligned by selecting the nearest neighbor to each frame in a reference video, without the need for additional labels, as demonstrated in the figure below.
    Results of unsupervised video alignment on videos of people pitching baseball using the distance between frames in the TCC space. The reference video used for alignment is shown in the upper left panel.
  • Label/modality transfer between videos
    Just as TCC finds similar frames by using a nearest neighbor search in the embedding space, it can transfer metadata associated with any frame in one video to its matching frame in another video. This metadata can be in the form of temporal semantic labels or other modalities, such as sound or text. In the video below we show two examples where we can transfer the sound of liquid being poured into a cup from one video to another.
  • Per-frame Retrieval
    With TCC, each frame in a video can be used as a query for retrieval of similar frames by looking up the nearest neighbors in the learned embedding space. The embeddings are powerful enough to differentiate between frames that look quite similar, such as frames just before or after the release of a bowling ball.
    We can perform retrieval from videos on a per-frame basis, i.e., any frame can be used to look up similar frames in a large collection of videos. The retrieved nearest neighbors show that the model captures fine-grained differences in the scene.
Release
We are releasing our codebase, which includes implementations of a number of state-of-the-art self-supervised learning methods, including TCC. This codebase will be useful for researchers working on video understanding, as well as artists looking to use machine learning to align videos to create mosaics of people, animals, and objects moving synchronously.

Acknowledgements
This is joint work with Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. The authors would like to thank Alexandre Passos, Allen Lavoie, Anelia Angelova, Bryan Seybold, Priya Gupta, Relja Arandjelović, Sergio Guadarrama, Sourish Chaudhuri, and Vincent Vanhoucke for their help with this project. The videos used in this project come from the PennAction dataset. We thank the creators of PennAction for curating such an interesting dataset.

Source: Google AI Blog


Introducing the Indexing API and structured data for livestreams

Over the past few years, it's become easier than ever to stream live videos online, from celebrity updates to special events. But it's not always easy for people to determine which videos are live and know when to tune in.
Today, we're introducing new tools to help more people discover your livestreams in Search and Assistant. With livestream structured data and the Indexing API, you can let Google know when your video is live, so it will be eligible to appear with a red "live" badge:

Add livestream structured data to your page

If your website streams live videos, use the livestream developer documentation to flag your video as a live broadcast and mark the start and end times. In addition, VideoObject structured data is required to tell Google that there's a video on your page.

Update Google quickly with the Indexing API

The Indexing API now supports pages with livestream structured data. We encourage you to call the Indexing API to request that your site is crawled in time for the livestream. We recommend calling the Indexing API when your livestream begins and ends, and if the structured data changes.
For more information, visit our developer documentation. If you have any questions, ask us in the Webmaster Help Forum. We look forward to seeing your live videos on Google!

Changes to the URL Performance Report for YouTube video placements

What's changing?
The URL_PERFORMANCE_REPORT in the AdWords API will exclude information for YouTube video placements starting October 30, 2018, in keeping with our data retention policies. As a result, placements where the Url field has a domain of www.youtube.com will no longer appear in the report. New and improved placement reports will be available in one of the upcoming releases of the new Google Ads API.

What you should do
Review your application and workflows and make the necessary changes to ensure that the exclusion of video placements in this report will not cause problems. Watch this blog for updates regarding new placement reports in the Google Ads API.

If you have any questions or need help, please contact us via the forum.

Verifying your Google Assistant media action integrations on Android

Posted by Nevin Mital, Partner Developer Relations

The Media Controller Test (MCT) app is a powerful tool that allows you to test the intricacies of media playback on Android, and it's just gotten even more useful. Media experiences including voice interactions via the Google Assistant on Android phones, cars, TVs, and headphones, are powered by Android MediaSession APIs. This tool will help you verify your integrations. We've now added a new verification testing framework that can be used to help automate your QA testing.

The MCT is meant to be used in conjunction with an app that implements media APIs, such as the Universal Android Music Player. The MCT surfaces information about the media app's MediaController, such as the PlaybackState and Metadata, and can be used to test inter-app media controls.

The Media Action Lifecycle can be complex to follow; even in a simple Play From Search request, there are many intermediate steps (simplified timeline depicted below) where something could go wrong. The MCT can be used to help highlight any inconsistencies in how your music app handles MediaController TransportControl requests.

Timeline of the interaction between the User, the Google Assistant, and the third party Android App for a Play From Search request.

Previously, using the MCT required a lot of manual interaction and monitoring. The new verification testing framework offers one-click tests that you can run to ensure that your media app responds correctly to a playback request.

Running a verification test

To access the new verification tests in the MCT, click the Test button next to your desired media app.

MCT Screenshot of launch screen; contains a list of installed media apps, with an option to go to either the Control or Test view for each.

The next screen shows you detailed information about the MediaController, for example the PlaybackState, Metadata, and Queue. There are two buttons on the toolbar in the top right: the button on the left toggles between parsable and formatted logs, and the button on the right refreshes this view to display the most current information.

MCT Screenshot of the left screen in the Testing view for UAMP; contains information about the Media Controller's Playback State, Metadata, Repeat Mode, Shuffle Mode, and Queue.

By swiping to the left, you arrive at the verification tests view, where you can see a scrollable list of defined tests, a text field to enter a query for tests that require one, and a section to display the results of the test.

MCT Screenshot of the right screen in the Testing view for UAMP; contains a list of tests, a query text field, and a results display section.

As an example, to run the Play From Search Test, you can enter a search query into the text field then hit the Run Test button. Looks like the test succeeded!

MCT Screenshot of the right screen in the Testing view for UAMP; the Play From Search test was run with the query 'Memories' and ended successfully.

Below are examples of the Pause Test (left) and Seek To test (right).

MCT Screenshot of the right screen in the Testing view for UAMP; a Pause test was run successfully. MCT Screenshot of the right screen in the Testing view for UAMP; a Seek To test was run successfully.

Android TV

The MCT now also works on Android TV! For your media app to work with the Android TV version of the MCT, your media app must have a MediaBrowserService implementation. Please see here for more details on how to do this.

On launching the MCT on Android TV, you will see a list of installed media apps. Note that an app will only appear in this list if it implements the MediaBrowserService.

Android TV MCT Screenshot of the launch screen; contains a list of installed media apps that implement the MediaBrowserService.

Selecting an app will take you to the testing screen, which will display a list of verification tests on the right.

Android TV MCT Screenshot of the testing screen; contains a list of tests on the right side.

Running a test will populate the left side of the screen with selected MediaController information. For more details, please check the MCT logs in Logcat.

Android TV MCT Screenshot of the testing screen; the Pause test was run successfully and the left side of the screen now displays selected MediaController information.

Tests that require a query are marked with a keyboard icon. Clicking on one of these tests will open an input field for the query. Upon hitting Enter, the test will run.

Android TV MCT Screenshot of the testing screen; clicking on the Seek To test opened an input field for the query.

To make text input easier, you can also use the ADB command:

adb shell input text [query]

Note that '%s' will add a space between words. For example, the command adb shell input text hello%sworld will add the text "hello world" to the input field.

What's next

The MCT currently includes simple single-media-action tests for the following requests:

  • Play
  • Play From Search
  • Play From Media ID
  • Play From URI
  • Pause
  • Stop
  • Skip To Next
  • Skip To Previous
  • Skip To Queue Item
  • Seek To

For a technical deep dive on how the tests are structured and how to add more tests, visit the MCT GitHub Wiki. We'd love for you to submit pull requests with more tests that you think are useful to have and for any bug fixes. Please make sure to review the contributions process for more information.

Check out the latest updates on GitHub!

Automatic Photography with Google Clips



To me, photography is the simultaneous recognition, in a fraction of a second, of the significance of an event as well as of a precise organization of forms which give that event its proper expression.
Henri Cartier-Bresson

The last few years have witnessed a Cambrian-like explosion in AI, with deep learning methods enabling computer vision algorithms to recognize many of the elements of a good photograph: people, smiles, pets, sunsets, famous landmarks and more. But, despite these recent advancements, automatic photography remains a very challenging problem. Can a camera capture a great moment automatically?

Recently, we released Google Clips, a new, hands-free camera that automatically captures interesting moments in your life. We designed Google Clips around three important principles:
  • We wanted all computations to be performed on-device. In addition to extending battery life and reducing latency, on-device processing means that none of your clips leave the device unless you decide to save or share them, which is a key privacy control.
  • We wanted the device to capture short videos, rather than single photographs. Moments with motion can be more poignant and true-to-memory, and it is often easier to shoot a video around a compelling moment than it is to capture a perfect, single instant in time.
  • We wanted to focus on capturing candid moments of people and pets, rather than the more abstract and subjective problem of capturing artistic images. That is, we did not attempt to teach Clips to think about composition, color balance, light, etc.; instead, Clips focuses on selecting ranges of time containing people and animals doing interesting activities.
Learning to Recognize Great Moments
How could we train an algorithm to recognize interesting moments? As with most machine learning problems, we started with a dataset. We created a dataset of thousands of videos in diverse scenarios where we imagined Clips being used. We also made sure our dataset represented a wide range of ethnicities, genders, and ages. We then hired expert photographers and video editors to pore over this footage to select the best short video segments. These early curations gave us examples for our algorithms to emulate. However, it is challenging to train an algorithm solely from the subjective selection of the curators — one needs a smooth gradient of labels to teach an algorithm to recognize the quality of content, ranging from "perfect" to "terrible."

To address this problem, we took a second data-collection approach, with the goal of creating a continuous quality score across the length of a video. We split each video into short segments (similar to the content Clips captures), randomly selected pairs of segments, and asked human raters to select the one they prefer.
We took this pairwise comparison approach, instead of having raters score videos directly, because it is much easier to choose the better of a pair than it is to specify a number. We found that raters were very consistent in pairwise comparisons, and less so when scoring directly. Given enough pairwise comparisons for any given video, we were able to compute a continuous quality score over the entire length. In this process, we collected over 50,000,000 pairwise comparisons on clips sampled from over 1,000 videos. That’s a lot of human effort!
Training a Clips Quality Model
Given this quality score training data, our next step was to train a neural network model to estimate the quality of any photograph captured by the device. We started with the basic assumption that knowing what’s in the photograph (e.g., people, dogs, trees, etc.) will help determine “interestingness”. If this assumption is correct, we could learn a function that uses the recognized content of the photograph to predict its quality score derived above from human comparisons.

To identify content labels in our training data, we leveraged the same Google machine learning technology that powers Google image search and Google Photos, which can recognize over 27,000 different labels describing objects, concepts, and actions. We certainly didn’t need all these labels, nor could we compute them all on device, so our expert photographers selected the few hundred labels they felt were most relevant to predicting the “interestingness” of a photograph. We also added the labels most highly correlated with the rater-derived quality scores.

Once we had this subset of labels, we then needed to design a compact, efficient model that could predict them for any given image, on-device, within strict power and thermal limits. This presented a challenge, as the deep learning techniques behind computer vision typically require strong desktop GPUs, and algorithms adapted to run on mobile devices lag far behind state-of-the-art techniques on desktop or cloud. To train this on-device model, we first took a large set of photographs and again used Google’s powerful, server-based recognition models to predict label confidence for each of the “interesting” labels described above. We then trained a MobileNet Image Content Model (ICM) to mimic the predictions of the server-based model. This compact model is capable of recognizing the most interesting elements of photographs, while ignoring non-relevant content.

The final step was to predict a single quality score for an input photograph from its content predicted by the ICM, using the 50M pairwise comparisons as training data. This score is computed with a piecewise linear regression model that combines the output of the ICM into a frame quality score. This frame quality score is averaged across the video segment to form a moment score. Given a pairwise comparison, our model should compute a moment score that is higher for the video segment preferred by humans. The model is trained so that its predictions match the human pairwise comparisons as well as possible.
Diagram of the training process for generating frame quality scores. Piecewise linear regression maps from an ICM embedding to a score which, when averaged across a video segment, yields a moment score. The moment score of the preferred segment should be higher.
This process allowed us to train a model that combines the power of Google image recognition technology with the wisdom of human raters–represented by 50 million opinions on what makes interesting content!

While this data-driven score does a great job of identifying interesting (and non-interesting) moments, we also added some bonuses to our overall quality score for phenomena that we know we want Clips to capture, including faces (especially recurring and thus “familiar” ones), smiles, and pets. In our most recent release, we added bonuses for certain activities that customers particularly want to capture, such as hugs, kisses, jumping, and dancing. Recognizing these activities required extensions to the ICM model.

Shot Control
Given this powerful model for predicting the “interestingness” of a scene, the Clips camera can decide which moments to capture in real-time. Its shot control algorithms follow three main principles:
  1. Respect Power & Thermals: We want the Clips battery to last roughly three hours, and we don’t want the device to overheat — the device can’t run at full throttle all the time. Clips spends much of its time in a low-power mode that captures one frame per second. If the quality of that frame exceeds a threshold set by how much Clips has recently shot, it moves into a high-power mode, capturing at 15 fps. Clips then saves a clip at the first quality peak encountered.
  2. Avoid Redundancy: We don’t want Clips to capture all of its moments at once, and ignore the rest of a session. Our algorithms therefore cluster moments into visually similar groups, and limit the number of clips in each cluster.
  3. The Benefit of Hindsight: It’s much easier to determine which clips are the best when you can examine the totality of clips captured. Clips therefore captures more moments than it intends to show to the user. When clips are ready to be transferred to the phone, the Clips device takes a second look at what it has shot, and only transfers the best and least redundant content.
Machine Learning Fairness
In addition to making sure our video dataset represented a diverse population, we also constructed several other tests to assess the fairness of our algorithms. We created controlled datasets by sampling subjects from different genders and skin tones in a balanced manner, while keeping variables like content type, duration, and environmental conditions constant. We then used this dataset to test that our algorithms had similar performance when applied to different groups. To help detect any regressions in fairness that might occur as we improved our moment quality models, we added fairness tests to our automated system. Any change to our software was run across this battery of tests, and was required to pass. It is important to note that this methodology can’t guarantee fairness, as we can’t test for every possible scenario and outcome. However, we believe that these steps are an important part of our long-term work to achieve fairness in ML algorithms.

Conclusion
Most machine learning algorithms are designed to estimate objective qualities – a photo contains a cat, or it doesn’t. In our case, we aim to capture a more elusive and subjective quality – whether a personal photograph is interesting, or not. We therefore combine the objective, semantic content of photographs with subjective human preferences to build the AI behind Google Clips. Also, Clips is designed to work alongside a person, rather than autonomously; to get good results, a person still needs to be conscious of framing, and make sure the camera is pointed at interesting content. We’re happy with how well Google Clips performs, and are excited to continue to improve our algorithms to capture that “perfect” moment!

Acknowledgements
The algorithms described here were conceived and implemented by a large group of Google engineers, research scientists, and others. Figures were made by Lior Shapira. Thanks to Lior and Juston Payne for video content.

Source: Google AI Blog


Introducing the Webmaster Video Series, now in Hindi

Google offers a broad range of resources, in multiple languages, to help you better understand your website and improve its performance. The recently released Search Engine Optimization (SEO) Starter Guide, the Help Center, the Webmaster forums (which are available in 16 languages), and the various Webmaster blogs are just a few of them.
A few months ago, we launched the SEO Snippets video series, where the Google team answered some of the webmaster and SEO questions that we regularly see on the Webmaster Central Help Forum. We are now launching a similar series in Hindi, called the SEO Snippets in Hindi.

IFrom deciding what language to create content in (Hindi vs. Hinglish) to duplicate content, we’re answering the most frequently asked questions on the Hindi Webmaster forum and the India Webmaster community on Google+, in Hindi.
Check out the links shared in the videos to get more helpful webmaster information, drop by our help forum and subscribe to our YouTube channel for more tips and insights!

Update to Engagement Reporting for Bumper Ads

Historically, AdWords API reporting has not included engagements for bumper ads. Bumper ads are video ads that are 6 seconds or shorter, appear at the beginning of a YouTube video, and can't be skipped.

Bumper ads support “drawer open” engagements, where a user can mouse over the ad to expand a widget with more information. These engagements were previously not included in the Engagements and EngagementRate fields in reports. Starting in mid-February 2018, we are going to be changing this behavior for all historical and future bumper ad reporting to include these engagements. This brings bumper ads in line with other types of video ads, which already reported these engagements.

This means that your historical reporting data, starting up to two years ago in January 2016, will be updated to include this statistic to bring it inline with future data.

If you have any questions about this migration, please contact us via the forum.

Introducing the new Webmaster Video Series

Google has a broad range of resources to help you better understand your website and improve its performance. This Webmaster Central Blog, the Help Center, the Webmaster forum, and the recently released Search Engine Optimization (SEO) Starter Guide are just a few.

We also have a YouTube channel, for answers to your questions in video format. To help with short & to the point answers to specific questions, we've just launched a new series, which we call SEO Snippets.

In this series of short videos, the Google team will be answering some of the webmaster and SEO questions that we regularly see on the Webmaster Central Help Forum. From 404 errors, how and when crawling works, a site's URL structure, to duplicate content, we'll have something here for you.

Check out the links shared in the videos to get more helpful webmaster information, drop by our help forum and subscribe to our YouTube channel for more tips and insights!


Gmail Add-ons framework now available to all developers

Originally posted by Wesley Chun, G Suite Developer Advocate on the G Suite Blog

Email remains at the heart of how companies operate. That's why earlier this year, we previewed Gmail Add-ons—a way to help businesses speed up workflows. Since then, we've seen partners build awesome applications, and beginning today, we're extending the Gmail add-on preview to include all developers. Now anyone can start building a Gmail add-on.

Gmail Add-ons let you integrate your app into Gmail and extend Gmail to handle quick actions.

They are built using native UI context cards that can include simple text dialogs, images, links, buttons and forms. The add-on appears when relevant, and the user is just a click away from your app's rich and integrated functionality.

Gmail Add-ons are easy to create. You only have to write code once for your add-on to work on both web and mobile, and you can choose from a rich palette of widgets to craft a custom UI. Create an add-on that contextually surfaces cards based on the content of a message. Check out this video to see how we created an add-on to collate email receipts and expedite expense reporting.

Per the video, you can see that there are three components to the app's core functionality. The first component is getContextualAddOn()—this is the entry point for all Gmail Add-ons where data is compiled to build the card and render it within the Gmail UI. Since the add-on is processing expense reports from email receipts in your inbox, the createExpensesCard()parses the relevant data from the message and presents them in a form so your users can confirm or update values before submitting. Finally, submitForm()takes the data and writes a new row in an "expenses" spreadsheet in Google Sheets, which you can edit and tweak, and submit for approval to your boss.

Check out the documentation to get started with Gmail Add-ons, or if you want to see what it's like to build an add-on, go to the codelab to build ExpenseItstep-by-step. While you can't publish your add-on just yet, you can fill out this form to get notified when publishing is opened. We can't wait to see what Gmail Add-ons you build!

Gmail add-ons framework now available to all developers



Email remains at the heart of how companies operate. That’s why earlier this year, we previewed Gmail Add-ons—a way to help businesses speed up workflows. Since then, we’ve seen partners build awesome applications, and beginning today, we’re extending the Gmail add-on preview to include all developers. Now anyone can start building a Gmail add-on.

Gmail Add-ons let you integrate your app into Gmail and extend Gmail to handle quick actions. They are built using native UI context cards that can include simple text dialogs, images, links, buttons and forms. The add-on appears when relevant, and the user is just a click away from your app's rich and integrated functionality.

Gmail Add-ons are easy to create. You only have to write code once for your add-on to work on both web and mobile, and you can choose from a rich palette of widgets to craft a custom UI. Create an add-on that contextually surfaces cards based on the content of a message. Check out this video to see how we created an add-on to collate email receipts and expedite expense reporting.

Per the video, you can see that there are three components to the app’s core functionality. The first component is getContextualAddOn()—this is the entry point for all Gmail Add-ons where data is compiled to build the card and render it within the Gmail UI. Since the add-on is processing expense reports from email receipts in your inbox, the createExpensesCard()parses the relevant data from the message and presents them in a form so your users can confirm or update values before submitting. Finally, submitForm() takes the data and writes a new row in an “expenses” spreadsheet in Google Sheets, which you can edit and tweak, and submit for approval to your boss.

Check out the documentation to get started with Gmail Add-ons, or if you want to see what it's like to build an add-on, go to the codelab to build ExpenseIt step-by-step. While you can't publish your add-on just yet, you can fill out this form to get notified when publishing is opened. We can’t wait to see what Gmail Add-ons you build!