Tag Archives: Video

Celebrating 10 years of WebM and WebRTC

Originally posted on the Chromium Blog

Ten years ago, Google planted the seeds for two foundational web media technologies, hoping they would provide the roots for a more vibrant internet. Two acquisitions, On2 Technologies and Global IP Solutions, led to a pair of open source projects: the WebM Project, a family of cutting edge video compression technologies (codecs) offered by Google royalty-free, and the WebRTC Project building APIs for real-time voice and video communication on the web.

These initiatives were major technical endeavors, essential infrastructure for enabling the promise of HTML5 with support for video conferencing and streaming. But this was also a philosophical evolution for media as Product Manager Mike Jazayeri noted in his blog post hailing the launch of the WebM Project:
“A key factor in the web’s success is that its core technologies such as HTML, HTTP, TCP/IP, etc. are open and freely implementable.”
As emerging first-class participants in the web experience, media and communication components also had to be free and open.

A decade later, these principles have ensured compression and communication technologies capable of keeping pace with a web ecosystem characterized by exponential growth of media consumption, devices, and demand. Starting from VP8 in 2010, the WebM Project has delivered up to 50% video bitrate savings with VP9 in 2013 and an additional 30% with AV1 in 2018—with adoption by YouTube, Facebook, Netflix, Twitch, and more. Equally importantly, the WebM team co-founded the Alliance for Open Media which has brought the IP of over 40 major tech companies in support of open and free codecs. With Chrome, Edge, Firefox and Safari supporting WebRTC, more than 85% of all installed browsers globally have become a client for real-time communications on the Internet. WebRTC has become a stable standard and it is now the default solution for video calling on the Web. These technologies have succeeded together, as today over 90% of encoded WebRTC video in Chrome uses VP8 or VP9.

The need for these technologies has been highlighted by COVID-19, as people across the globe have found new ways to work, educate, and connect with loved ones via video chat. The compression of open codecs has been essential to keeping services running on limited bandwidth, with over a billion hours of VP9 and AV1 content viewed every day. WebRTC has allowed for an ecosystem of interoperable communications apps to flourish: since the beginning of March 2020, we have seen in Chrome a 13X increase in received video streams via WebRTC.

These successes would not have been possible without all the supporters that make an open source community. Thank you to all the code contributors, testers, bug filers, and corporate partners who helped make this ecosystem a reality. A decade in, Google remains as committed as ever to open media on the web. We look forward to continuing that work with all of you in the next decade and beyond.

By Matt Frost, Product Director Chrome Media and Niklas Blum, Senior Product Manager WebRTC

AutoFlip: An Open Source Framework for Intelligent Video Reframing

Originally posted on the AI Blog

Videos filmed and edited for television and desktop are typically created and viewed in landscape aspect ratios (16:9 or 4:3). However, with an increasing number of users creating and consuming content on mobile devices, historical aspect ratios don’t always fit the display being used for viewing. Traditional approaches for reframing video to different aspect ratios usually involve static cropping, i.e., specifying a camera viewport, then cropping visual contents that are outside. Unfortunately, these static cropping approaches often lead to unsatisfactory results due to the variety of composition and camera motion styles. More bespoke approaches, however, typically require video curators to manually identify salient contents on each frame, track their transitions from frame-to-frame, and adjust crop regions accordingly throughout the video. This process is often tedious, time-consuming, and error-prone.

To address this problem, we are happy to announce AutoFlip, an open source framework for intelligent video reframing. AutoFlip is built on top of the MediaPipe framework that enables the development of pipelines for processing time-series multimodal data. Taking a video (casually shot or professionally edited) and a target dimension (landscape, square, portrait, etc.) as inputs, AutoFlip analyzes the video content, develops optimal tracking and cropping strategies, and produces an output video with the same duration in the desired aspect ratio.
Left: Original video (16:9). Middle: Reframed using a standard central crop (9:16). Right: Reframed with AutoFlip (9:16). By detecting the subjects of interest, AutoFlip is able to avoid cropping off important visual content.

AutoFlip Overview

AutoFlip provides a fully automatic solution to smart video reframing, making use of state-of-the-art ML-enabled object detection and tracking technologies to intelligently understand video content. AutoFlip detects changes in the composition that signify scene changes in order to isolate scenes for processing. Within each shot, video analysis is used to identify salient content before the scene is reframed by selecting a camera mode and path optimized for the contents.

Shot (Scene) Detection

A scene or shot is a continuous sequence of video without cuts (or jumps). To detect the occurrence of a shot change, AutoFlip computes the color histogram of each frame and compares this with prior frames. If the distribution of frame colors changes at a different rate than a sliding historical window, a shot change is signaled. AutoFlip buffers the video until the scene is complete before making reframing decisions, in order to optimize the reframing for the entire scene.

Video Content Analysis

We utilize deep learning-based object detection models to find interesting, salient content in the frame. This content typically includes people and animals, but other elements may be identified, depending on the application, including text overlays and logos for commercials, or motion and ball detection for sports.

The face and object detection models are integrated into AutoFlip through MediaPipe, which uses TensorFlow Lite on CPU. This structure allows AutoFlip to be extensible, so developers may conveniently add new detection algorithms for different use cases and video content. Each object type is associated with a weight value, which defines its relative importance — the higher the weight, the more influence the feature will have when computing the camera path.


Left: People detection on sports footage. Right: Two face boxes (‘core’ and ‘all’ face landmarks). In narrow portrait crop cases, often only the core landmark box can fit.

Reframing

After identifying the subjects of interest on each frame, logical decisions about how to reframe the content for a new view can be made. AutoFlip automatically chooses an optimal reframing strategy — stationary, panning or tracking — depending on the way objects behave during the scene (e.g., moving around or stationary). In stationary mode, the reframed camera viewport is fixed in a position where important content can be viewed throughout the majority of the scene. This mode can effectively mimic professional cinematography in which a camera is mounted on a stationary tripod or where post-processing stabilization is applied. In other cases, it is best to pan the camera, moving the viewport at a constant velocity. The tracking mode provides continuous and steady tracking of interesting objects as they move around within the frame.

Based on which of these three reframing strategies the algorithm selects, AutoFlip then determines an optimal cropping window for each frame, while best preserving the content of interest. While the bounding boxes track the objects of focus in the scene, they typically exhibit considerable jitter from frame-to-frame and, consequently, are not sufficient to define the cropping window. Instead, we adjust the viewport on each frame through the process of Euclidean-norm optimization, in which we minimize the residuals between a smooth (low-degree polynomial) camera path and the bounding boxes.

Top: Camera paths resulting from following the bounding boxes from frame-to-frame. Bottom: Final smoothed camera paths generated using Euclidean-norm path formation. Left: Scene in which objects are moving around, requiring a tracking camera path. Right: Scene where objects stay close to the same position; a stationary camera covers the content for the full duration of the scene.

AutoFlip’s configuration graph provides settings for either best-effort or required reframing. If it becomes infeasible to cover all the required regions (for example, when they are too spread out on the frame), the pipeline will automatically switch to a less aggressive strategy by applying a letterbox effect, padding the image to fill the frame. For cases where the background is detected as being a solid color, this color is used to create seamless padding; otherwise a blurred version of the original frame is used.

AutoFlip Use Cases

We are excited to release this tool directly to developers and filmmakers, reducing the barriers to their design creativity and reach through the automation of video editing. The ability to adapt any video format to various aspect ratios is becoming increasingly important as the diversity of devices for video content consumption continues to rapidly increase. Whether your use case is portrait to landscape, landscape to portrait, or even small adjustments like 4:3 to 16:9, AutoFlip provides a solution for intelligent, automated and adaptive video reframing.


What’s Next?

Like any machine learning algorithm, AutoFlip can benefit from an improved ability to detect objects relevant to the intent of the video, such as speaker detection for interviews or animated face detection on cartoons. Additionally, a common issue arises when input video has important overlays on the edges of the screen (such as text or logos) as they will often be cropped from the view. By combining text/logo detection and image inpainting technology, we hope that future versions of AutoFlip can reposition foreground objects to better fit the new aspect ratios. Lastly, in situations where padding is required, deep uncrop technology could provide improved ability to expand beyond the original viewable area.

While we work to improve AutoFlip internally at Google, we encourage contributions from developers and filmmakers in the open source communities.

Acknowledgments

We would like to thank our colleagues who contributed to Autoflip, Alexander Panagopoulos, Jenny Jin, Brian Mulford, Yuan Zhang, Alex Chen, Xue Yang, Mickey Wang, Justin Parra, Hartwig Adam, Jingbin Wang, and Weilong Yang; MediaPipe team who helped with open sourcing, Jiuqiang Tang, Tyler Mullen, Mogan Shieh, Ming Guang Yong, and Chuo-Ling Chang.

By Nathan Frey, Senior Software Engineer, Google Research, Los Angeles and Zheng Sun, Senior Software Engineer, Google Research, Mountain View

Video Series for New Webmasters: Search for Beginners!

We are excited to introduce our newest video series: “Search For Beginners”! The series was created primarily to help new webmasters. It is also for anyone with an interest in Search or anyone who is still learning about the Web and how to manage their online presence.

We love to see the webmaster community grow! Every day, there are countless new webmasters who are taking the first steps in learning how Search works, and how to make their websites perform well and discoverable on Search. We understand that it sometimes can be challenging or even overwhelming to start with our existing content without some prior knowledge or basic understandings of the Web. We find our basic videos in our YouTube channels to be the ones with the most views. At the same time, advanced webmasters also see the need for content that can be sent to clients or stakeholders to help explain important concepts in managing an online presence.

We want to help all webmasters succeed, regardless of whether you have been managing websites for many years or you’ve just started out yesterday. We want to do more to help the new webmasters and this video series will hopefully help us achieve that.

Introduction to the series:

Episode 1:

The “Search For Beginners” video series covers basic online presence topics ranging from ‘Do you need a website?’, ‘What are the goals for your website?’ to more organic search-related topics such as ‘How does Google Search work?’, ‘How to change description line’, or ‘How to change wrong address information on Google’. Actually, we get asked these questions frequently in forums, social channels and at events around the world! The videos are fully animated. The videos are in English with subtitles available in Spanish, Portuguese, Korean, Chinese, Indonesian, Italian, Japanese, and English. We are working on more, so please stay tuned!

And if you consider yourself a more experienced user, please feel free to use these videos to support your pitches or explaining things to your clients. If you want to share any ideas or learnings, please leave them in the comment section in each video so that others can benefit from your knowledge and experience.

Follow us on Twitter and subscribe on YouTube for the upcoming videos! We will be adding new videos in this series to this playlist about every two weeks!

Video Series for New Webmasters: Search for Beginners!

We are excited to introduce our newest video series: “Search For Beginners”! The series was created primarily to help new webmasters. It is also for anyone with an interest in Search or anyone who is still learning about the Web and how to manage their online presence.

We love to see the webmaster community grow! Every day, there are countless new webmasters who are taking the first steps in learning how Search works, and how to make their websites perform well and discoverable on Search. We understand that it sometimes can be challenging or even overwhelming to start with our existing content without some prior knowledge or basic understandings of the Web. We find our basic videos in our YouTube channels to be the ones with the most views. At the same time, advanced webmasters also see the need for content that can be sent to clients or stakeholders to help explain important concepts in managing an online presence.

We want to help all webmasters succeed, regardless of whether you have been managing websites for many years or you’ve just started out yesterday. We want to do more to help the new webmasters and this video series will hopefully help us achieve that.

Introduction to the series:

Episode 1:

The “Search For Beginners” video series covers basic online presence topics ranging from ‘Do you need a website?’, ‘What are the goals for your website?’ to more organic search-related topics such as ‘How does Google Search work?’, ‘How to change description line’, or ‘How to change wrong address information on Google’. Actually, we get asked these questions frequently in forums, social channels and at events around the world! The videos are fully animated. The videos are in English with subtitles available in Spanish, Portuguese, Korean, Chinese, Indonesian, Italian, Japanese, and English. We are working on more, so please stay tuned!

And if you consider yourself a more experienced user, please feel free to use these videos to support your pitches or explaining things to your clients. If you want to share any ideas or learnings, please leave them in the comment section in each video so that others can benefit from your knowledge and experience.

Follow us on Twitter and subscribe on YouTube for the upcoming videos! We will be adding new videos in this series to this playlist about every two weeks!

Audio and Visual Quality Measurement using Fréchet Distance



"I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.”
    William Thomson (Lord Kelvin), Lecture on "Electrical Units of Measurement" (3 May 1883), published in Popular Lectures Vol. I, p. 73
The rate of scientific progress in machine learning has often been determined by the availability of good datasets, and metrics. In deep learning, benchmark datasets such as ImageNet or Penn Treebank were among the driving forces that established deep artificial neural networks for image recognition and language modeling. Yet, while the available “ground-truth” datasets lend themselves nicely as measures of performance on these prediction tasks, determining the “ground-truth” for comparison to generative models is not so straightforward. Imagine a model that generates videos of StarCraft video game sequences — how does one determine which model is best? Clearly some of the videos shown below look more realistic than others, but can the differences between them be quantified? Access to robust metrics for evaluation of generative models is crucial for measuring (and making) progress in the fields of audio and video understanding, but currently no such metrics exist.
Videos generated from various models trained on sequences from the StarCraft Video (SCV) dataset.
In “Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms” and “Towards Accurate Generative Models of Video: A New Metric & Challenges”, we present two such metrics — the Fréchet Audio Distance (FAD) and Fréchet Video Distance (FVD). We document our large-scale human evaluations using 10k video and 69k audio clip pairwise comparisons that demonstrate high correlations between our metrics and human perception. We are also releasing the source code for both Fréchet Video Distance and Fréchet Audio Distance on github (FVD; FAD).

General Description of Fréchet Distance
The goal of a generative model is to learn to produce samples that look similar to the ones on which it has been trained, such that it knows what properties and features are likely to appear in the data, and which ones are unlikely. In other words, a generative model must learn the probability distribution of the training data. In many cases, the target distributions for generative models are very high-dimensional. For example, a single image of 128x128 pixels with 3 color channels has almost 50k dimensions, while a second-long video clip might consist of dozens (or hundreds) of such frames with audio that may have 16,000 samples. Calculating distances between such high dimensional distributions in order to quantify how well a given model succeeds at a task is very difficult. In the case of pictures, one could look at a few samples to gauge visual quality, but doing so for every model trained is not feasible.

In addition, generative adversarial networks (GANs) tend to focus on a few modes of the overall target distribution, while completely ignoring others. For example, they may learn to generate only one type of object or only a select few viewing angles. As a consequence, looking only at a limited number of samples from the model may not indicate whether the network learned the entire distribution successfully. To remedy this, a metric is needed that aligns well with human judgement of quality, while also taking the properties of the target distribution into account.

One common solution for this problem is the so-called Fréchet Inception Distance (FID) metric, which was specifically designed for images. The FID takes a large number of images from both the target distribution and the generative model, and uses the Inception object-recognition network to embed each image into a lower-dimensional space that captures the important features. Then it computes the so-called Fréchet distance between these samples, which is a common way of calculating distances between distributions that provides a quantitative measure of how similar the two distributions actually are.
A key component for both metrics is a pre-trained model that converts the video or audio clip into an N-dimensional embedding.
Fréchet Audio Distance and Fréchet Video Distance
Building on the principles of FID that have been successfully applied to the image domain, we propose both Fréchet Video Distance (FVD) and Fréchet Audio Distance (FAD). Unlike popular metrics such as peak signal-to-noise ratio or the structural similarity index, FVD looks at videos in their entirety, and thereby avoids the drawbacks of framewise metrics.
Examples of videos of a robot arm, judged by the new FVD metric. FVD values were found to be approximately 2000, 1000, 600, 400, 300 and 150 (left-to-right; top-to-bottom). A lower FVD clearly correlates with higher video quality.
In the audio domain, existing metrics either require a time-aligned ground truth signal, such as source-to-distortion ratio (SDR), or only target a specific domain, like speech quality. FAD on the other hand is reference-free and can be used on any type of audio.

Below is a 2-D visualization of the audio embedding vectors from which we compute the FAD. Each point corresponds to the embedding of a 5-second audio clip, where the blue points are from clean music and other points represent audio that has been distorted in some way. The estimated multivariate Gaussian distributions are presented as concentric ellipses. As the magnitude of the distortions increase, the overlap between their distributions and that of the clean audio decreases. The separation between these distributions is what the Fréchet distance is measuring.
In the animation, we can see that as the magnitude of the distortions increases, the Gaussian distributions of the distorted audio overlaps less with the clean distribution. The magnitude of this separation is what the Fréchet distance is measuring.
Evaluation
It is important for FAD and FVD to closely track human judgement, since that is the gold standard for what looks and sounds “realistic”. So, we performed a large-scale human study to determine how well our new metrics align with qualitative human judgment of generated audio and video. For the study, human raters examined 10,000 video pairs and 69,000 5-second audio clips. For the FAD we asked human raters to compare the effect of two different distortions on the same audio segment, randomizing both the pair of distortions that they compared and the order in which they appeared. The raters were asked “Which audio clip sounds most like a studio-produced recording?” The collected set of pairwise evaluations was then ranked using a Plackett-Luce model, which estimates a worth value for each parameter configuration. Comparison of the worth values to the FAD demonstrates that the FAD correlates quite well with human judgement.
This figure compares the FAD calculated between clean background music and music distorted by a variety of methods (e.g., pitch down, Gaussian noise, etc.) to the associated worth values from human evaluation. Each type of distortion has two data points representing high and low extremes in the distortion applied. The quantization distortion (purple circles), for example, limits the audio to a specific number of bits per sample, where the two data points represent two different bitrates. Both human raters and the FAD assigned higher values (i.e., “less realistic”) to the lower bitrate quantization. Overall log FAD correlates well with human judgement — a perfect correlation between the log FAD and human perception would result in a straight line.
Conclusion
We are currently making great strides in generative models. FAD and FVD will help us keeping this progress measurable, and will hopefully lead us to improve our models for audio and video generation.

Acknowledgements
There are many people who contributed to this large effort, and we’d like to highlight some of the key contributors: Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, Sylvain Gelly, Mauricio Zuluaga, Dominik Roblek, Matthew Sharifi as well as the extended Google Brain team in Zurich.

Source: Google AI Blog


Video Understanding Using Temporal Cycle-Consistency Learning



In the last few years there has been great progress in the field of video understanding. For example, supervised learning and powerful deep learning models can be used to classify a number of possible actions in videos, summarizing the entire clip with a single label. However, there exist many scenarios in which we need more than just one label for the entire clip. For example, if a robot is pouring water into a cup, simply recognizing the action of “pouring a liquid” is insufficient to predict when the water will overflow. For that, it is necessary to track frame-by-frame the amount of water in the cup as it is being filled. Similarly, a baseball coach who is comparing stances of pitchers may want to retrieve video frames from the precise moment that the ball leaves the pitchers’ hands. Such applications require models to understand each frame of a video.

However, applying supervised learning to understand each individual frame in a video is expensive, since per-frame labels in videos of the action of interest are needed. This requires that annotators apply fine-grained labels to videos by manually adding unambiguous labels to every frame in each video. Only then can the model be trained, and only on a single action. Training on new actions requires the process to be repeated. With the increasing demand for fine-grained labeling, necessary for applications ranging from robotics to sports analytics, this makes the need for scalable learning algorithms that can understand videos without the tedious labeling process increasingly pertinent.

We propose a potential solution using a self-supervised learning method called Temporal Cycle-Consistency Learning (TCC). This novel approach uses correspondences between examples of similar sequential processes to learn representations particularly well-suited for fine-grained temporal understanding of videos. We are also releasing our TCC codebase to enable end-users to apply our self-supervised learning algorithm to new and novel applications.

Representation Learning Using TCC
A plant growing from a seedling to a tree; the daily routine of getting up, going to work and coming back home; or a person pouring themselves a glass of water are all examples of events that happen in a particular order. Videos capturing such processes provide temporal correspondences across multiple instances of the same process. For example, when pouring a drink one could be reaching for a teapot, a bottle of wine, or a glass of water to pour from. Key moments are common to all pouring videos (e.g., the first touch to the container or the container being lifted from the ground) and exist independent of many varying factors, such as visual changes in viewpoint, scale, container style, or the speed of the event. TCC attempts to find such correspondences across videos of the same action by leveraging the principle of cycle-consistency, which has been applied successfully in many problems in computer vision, to learn useful visual representations by aligning videos.

The objective of this training algorithm is to learn a frame encoder, using any network architecture that processes images, such as ResNet. To do so, we pass all frames of the videos to be aligned through the encoder to produce their corresponding embeddings. We then select two videos for TCC learning, say video 1 (the reference video) and video 2. A reference frame is chosen from video 1 and its nearest neighbor frame (NN2) from video 2 is found in the embedding space (not pixel space). We then cycle back by finding the nearest neighbor of NN2 in video 1, which we call NN1. If the representations are cycle-consistent, then the nearest neighbor frame in video 1 (NN1) should refer back to the starting reference frame.
We train the embedder using the distance between the starting reference frame and NN1 as the training signal. As training proceeds, the embeddings improve and reduce the cycle-consistency loss by developing a semantic understanding of each video frame in the context of the action being performed.
Using TCC, we learn embeddings with temporally fine-grained understanding of an action by aligning related videos.
What Does TCC Learn?
In the following figure, we show a model trained using TCC on videos from the Penn Action Dataset of people performing squat exercises. Each point on the left corresponds to frame embeddings, with the highlighted points tracking the embedding of the current video frame. Notice how the embeddings move collectively in spite of many differences in pose, lighting, body and object type. TCC embeddings encode the different phases of squatting without being provided explicit labels.
Right: Input videos of people performing a squat exercise. The video on the top left is the reference. The other videos show nearest neighbor frames (in the TCC embedding space) from other videos of people doing squats. Left: The corresponding frame embeddings move as the action is performed.
Applications of TCC
The learned per-frame embeddings enable an array of interesting applications:
  • Few-shot action phase classification
    When few labeled videos are available for training, the few-shot scenario, TCC performs very well. In fact, TCC can classify the phases of different actions with as few as a single labeled video. In the next figure we compare to other supervised and self-supervised learning approaches in the few-shot setting. We find that supervised learning requires about 50 videos with each frame labeled to achieve the same accuracy that self-supervised methods achieve with just one fully labeled video.
    Comparison of self-supervised and supervised learning for few-shot action phase classification.
  • Unsupervised video alignment
    Aligning or synchronizing videos manually becomes prohibitively difficult as the number of videos increases. Using TCC, many videos can be aligned by selecting the nearest neighbor to each frame in a reference video, without the need for additional labels, as demonstrated in the figure below.
    Results of unsupervised video alignment on videos of people pitching baseball using the distance between frames in the TCC space. The reference video used for alignment is shown in the upper left panel.
  • Label/modality transfer between videos
    Just as TCC finds similar frames by using a nearest neighbor search in the embedding space, it can transfer metadata associated with any frame in one video to its matching frame in another video. This metadata can be in the form of temporal semantic labels or other modalities, such as sound or text. In the video below we show two examples where we can transfer the sound of liquid being poured into a cup from one video to another.
  • Per-frame Retrieval
    With TCC, each frame in a video can be used as a query for retrieval of similar frames by looking up the nearest neighbors in the learned embedding space. The embeddings are powerful enough to differentiate between frames that look quite similar, such as frames just before or after the release of a bowling ball.
    We can perform retrieval from videos on a per-frame basis, i.e., any frame can be used to look up similar frames in a large collection of videos. The retrieved nearest neighbors show that the model captures fine-grained differences in the scene.
Release
We are releasing our codebase, which includes implementations of a number of state-of-the-art self-supervised learning methods, including TCC. This codebase will be useful for researchers working on video understanding, as well as artists looking to use machine learning to align videos to create mosaics of people, animals, and objects moving synchronously.

Acknowledgements
This is joint work with Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. The authors would like to thank Alexandre Passos, Allen Lavoie, Anelia Angelova, Bryan Seybold, Priya Gupta, Relja Arandjelović, Sergio Guadarrama, Sourish Chaudhuri, and Vincent Vanhoucke for their help with this project. The videos used in this project come from the PennAction dataset. We thank the creators of PennAction for curating such an interesting dataset.

Source: Google AI Blog


Introducing the Indexing API and structured data for livestreams

Over the past few years, it's become easier than ever to stream live videos online, from celebrity updates to special events. But it's not always easy for people to determine which videos are live and know when to tune in.
Today, we're introducing new tools to help more people discover your livestreams in Search and Assistant. With livestream structured data and the Indexing API, you can let Google know when your video is live, so it will be eligible to appear with a red "live" badge:

Add livestream structured data to your page

If your website streams live videos, use the livestream developer documentation to flag your video as a live broadcast and mark the start and end times. In addition, VideoObject structured data is required to tell Google that there's a video on your page.

Update Google quickly with the Indexing API

The Indexing API now supports pages with livestream structured data. We encourage you to call the Indexing API to request that your site is crawled in time for the livestream. We recommend calling the Indexing API when your livestream begins and ends, and if the structured data changes.
For more information, visit our developer documentation. If you have any questions, ask us in the Webmaster Help Forum. We look forward to seeing your live videos on Google!

Changes to the URL Performance Report for YouTube video placements

What's changing?
The URL_PERFORMANCE_REPORT in the AdWords API will exclude information for YouTube video placements starting October 30, 2018, in keeping with our data retention policies. As a result, placements where the Url field has a domain of www.youtube.com will no longer appear in the report. New and improved placement reports will be available in one of the upcoming releases of the new Google Ads API.

What you should do
Review your application and workflows and make the necessary changes to ensure that the exclusion of video placements in this report will not cause problems. Watch this blog for updates regarding new placement reports in the Google Ads API.

If you have any questions or need help, please contact us via the forum.

Verifying your Google Assistant media action integrations on Android

Posted by Nevin Mital, Partner Developer Relations

The Media Controller Test (MCT) app is a powerful tool that allows you to test the intricacies of media playback on Android, and it's just gotten even more useful. Media experiences including voice interactions via the Google Assistant on Android phones, cars, TVs, and headphones, are powered by Android MediaSession APIs. This tool will help you verify your integrations. We've now added a new verification testing framework that can be used to help automate your QA testing.

The MCT is meant to be used in conjunction with an app that implements media APIs, such as the Universal Android Music Player. The MCT surfaces information about the media app's MediaController, such as the PlaybackState and Metadata, and can be used to test inter-app media controls.

The Media Action Lifecycle can be complex to follow; even in a simple Play From Search request, there are many intermediate steps (simplified timeline depicted below) where something could go wrong. The MCT can be used to help highlight any inconsistencies in how your music app handles MediaController TransportControl requests.

Timeline of the interaction between the User, the Google Assistant, and the third party Android App for a Play From Search request.

Previously, using the MCT required a lot of manual interaction and monitoring. The new verification testing framework offers one-click tests that you can run to ensure that your media app responds correctly to a playback request.

Running a verification test

To access the new verification tests in the MCT, click the Test button next to your desired media app.

MCT Screenshot of launch screen; contains a list of installed media apps, with an option to go to either the Control or Test view for each.

The next screen shows you detailed information about the MediaController, for example the PlaybackState, Metadata, and Queue. There are two buttons on the toolbar in the top right: the button on the left toggles between parsable and formatted logs, and the button on the right refreshes this view to display the most current information.

MCT Screenshot of the left screen in the Testing view for UAMP; contains information about the Media Controller's Playback State, Metadata, Repeat Mode, Shuffle Mode, and Queue.

By swiping to the left, you arrive at the verification tests view, where you can see a scrollable list of defined tests, a text field to enter a query for tests that require one, and a section to display the results of the test.

MCT Screenshot of the right screen in the Testing view for UAMP; contains a list of tests, a query text field, and a results display section.

As an example, to run the Play From Search Test, you can enter a search query into the text field then hit the Run Test button. Looks like the test succeeded!

MCT Screenshot of the right screen in the Testing view for UAMP; the Play From Search test was run with the query 'Memories' and ended successfully.

Below are examples of the Pause Test (left) and Seek To test (right).

MCT Screenshot of the right screen in the Testing view for UAMP; a Pause test was run successfully. MCT Screenshot of the right screen in the Testing view for UAMP; a Seek To test was run successfully.

Android TV

The MCT now also works on Android TV! For your media app to work with the Android TV version of the MCT, your media app must have a MediaBrowserService implementation. Please see here for more details on how to do this.

On launching the MCT on Android TV, you will see a list of installed media apps. Note that an app will only appear in this list if it implements the MediaBrowserService.

Android TV MCT Screenshot of the launch screen; contains a list of installed media apps that implement the MediaBrowserService.

Selecting an app will take you to the testing screen, which will display a list of verification tests on the right.

Android TV MCT Screenshot of the testing screen; contains a list of tests on the right side.

Running a test will populate the left side of the screen with selected MediaController information. For more details, please check the MCT logs in Logcat.

Android TV MCT Screenshot of the testing screen; the Pause test was run successfully and the left side of the screen now displays selected MediaController information.

Tests that require a query are marked with a keyboard icon. Clicking on one of these tests will open an input field for the query. Upon hitting Enter, the test will run.

Android TV MCT Screenshot of the testing screen; clicking on the Seek To test opened an input field for the query.

To make text input easier, you can also use the ADB command:

adb shell input text [query]

Note that '%s' will add a space between words. For example, the command adb shell input text hello%sworld will add the text "hello world" to the input field.

What's next

The MCT currently includes simple single-media-action tests for the following requests:

  • Play
  • Play From Search
  • Play From Media ID
  • Play From URI
  • Pause
  • Stop
  • Skip To Next
  • Skip To Previous
  • Skip To Queue Item
  • Seek To

For a technical deep dive on how the tests are structured and how to add more tests, visit the MCT GitHub Wiki. We'd love for you to submit pull requests with more tests that you think are useful to have and for any bug fixes. Please make sure to review the contributions process for more information.

Check out the latest updates on GitHub!

Automatic Photography with Google Clips



To me, photography is the simultaneous recognition, in a fraction of a second, of the significance of an event as well as of a precise organization of forms which give that event its proper expression.
Henri Cartier-Bresson

The last few years have witnessed a Cambrian-like explosion in AI, with deep learning methods enabling computer vision algorithms to recognize many of the elements of a good photograph: people, smiles, pets, sunsets, famous landmarks and more. But, despite these recent advancements, automatic photography remains a very challenging problem. Can a camera capture a great moment automatically?

Recently, we released Google Clips, a new, hands-free camera that automatically captures interesting moments in your life. We designed Google Clips around three important principles:
  • We wanted all computations to be performed on-device. In addition to extending battery life and reducing latency, on-device processing means that none of your clips leave the device unless you decide to save or share them, which is a key privacy control.
  • We wanted the device to capture short videos, rather than single photographs. Moments with motion can be more poignant and true-to-memory, and it is often easier to shoot a video around a compelling moment than it is to capture a perfect, single instant in time.
  • We wanted to focus on capturing candid moments of people and pets, rather than the more abstract and subjective problem of capturing artistic images. That is, we did not attempt to teach Clips to think about composition, color balance, light, etc.; instead, Clips focuses on selecting ranges of time containing people and animals doing interesting activities.
Learning to Recognize Great Moments
How could we train an algorithm to recognize interesting moments? As with most machine learning problems, we started with a dataset. We created a dataset of thousands of videos in diverse scenarios where we imagined Clips being used. We also made sure our dataset represented a wide range of ethnicities, genders, and ages. We then hired expert photographers and video editors to pore over this footage to select the best short video segments. These early curations gave us examples for our algorithms to emulate. However, it is challenging to train an algorithm solely from the subjective selection of the curators — one needs a smooth gradient of labels to teach an algorithm to recognize the quality of content, ranging from "perfect" to "terrible."

To address this problem, we took a second data-collection approach, with the goal of creating a continuous quality score across the length of a video. We split each video into short segments (similar to the content Clips captures), randomly selected pairs of segments, and asked human raters to select the one they prefer.
We took this pairwise comparison approach, instead of having raters score videos directly, because it is much easier to choose the better of a pair than it is to specify a number. We found that raters were very consistent in pairwise comparisons, and less so when scoring directly. Given enough pairwise comparisons for any given video, we were able to compute a continuous quality score over the entire length. In this process, we collected over 50,000,000 pairwise comparisons on clips sampled from over 1,000 videos. That’s a lot of human effort!
Training a Clips Quality Model
Given this quality score training data, our next step was to train a neural network model to estimate the quality of any photograph captured by the device. We started with the basic assumption that knowing what’s in the photograph (e.g., people, dogs, trees, etc.) will help determine “interestingness”. If this assumption is correct, we could learn a function that uses the recognized content of the photograph to predict its quality score derived above from human comparisons.

To identify content labels in our training data, we leveraged the same Google machine learning technology that powers Google image search and Google Photos, which can recognize over 27,000 different labels describing objects, concepts, and actions. We certainly didn’t need all these labels, nor could we compute them all on device, so our expert photographers selected the few hundred labels they felt were most relevant to predicting the “interestingness” of a photograph. We also added the labels most highly correlated with the rater-derived quality scores.

Once we had this subset of labels, we then needed to design a compact, efficient model that could predict them for any given image, on-device, within strict power and thermal limits. This presented a challenge, as the deep learning techniques behind computer vision typically require strong desktop GPUs, and algorithms adapted to run on mobile devices lag far behind state-of-the-art techniques on desktop or cloud. To train this on-device model, we first took a large set of photographs and again used Google’s powerful, server-based recognition models to predict label confidence for each of the “interesting” labels described above. We then trained a MobileNet Image Content Model (ICM) to mimic the predictions of the server-based model. This compact model is capable of recognizing the most interesting elements of photographs, while ignoring non-relevant content.

The final step was to predict a single quality score for an input photograph from its content predicted by the ICM, using the 50M pairwise comparisons as training data. This score is computed with a piecewise linear regression model that combines the output of the ICM into a frame quality score. This frame quality score is averaged across the video segment to form a moment score. Given a pairwise comparison, our model should compute a moment score that is higher for the video segment preferred by humans. The model is trained so that its predictions match the human pairwise comparisons as well as possible.
Diagram of the training process for generating frame quality scores. Piecewise linear regression maps from an ICM embedding to a score which, when averaged across a video segment, yields a moment score. The moment score of the preferred segment should be higher.
This process allowed us to train a model that combines the power of Google image recognition technology with the wisdom of human raters–represented by 50 million opinions on what makes interesting content!

While this data-driven score does a great job of identifying interesting (and non-interesting) moments, we also added some bonuses to our overall quality score for phenomena that we know we want Clips to capture, including faces (especially recurring and thus “familiar” ones), smiles, and pets. In our most recent release, we added bonuses for certain activities that customers particularly want to capture, such as hugs, kisses, jumping, and dancing. Recognizing these activities required extensions to the ICM model.

Shot Control
Given this powerful model for predicting the “interestingness” of a scene, the Clips camera can decide which moments to capture in real-time. Its shot control algorithms follow three main principles:
  1. Respect Power & Thermals: We want the Clips battery to last roughly three hours, and we don’t want the device to overheat — the device can’t run at full throttle all the time. Clips spends much of its time in a low-power mode that captures one frame per second. If the quality of that frame exceeds a threshold set by how much Clips has recently shot, it moves into a high-power mode, capturing at 15 fps. Clips then saves a clip at the first quality peak encountered.
  2. Avoid Redundancy: We don’t want Clips to capture all of its moments at once, and ignore the rest of a session. Our algorithms therefore cluster moments into visually similar groups, and limit the number of clips in each cluster.
  3. The Benefit of Hindsight: It’s much easier to determine which clips are the best when you can examine the totality of clips captured. Clips therefore captures more moments than it intends to show to the user. When clips are ready to be transferred to the phone, the Clips device takes a second look at what it has shot, and only transfers the best and least redundant content.
Machine Learning Fairness
In addition to making sure our video dataset represented a diverse population, we also constructed several other tests to assess the fairness of our algorithms. We created controlled datasets by sampling subjects from different genders and skin tones in a balanced manner, while keeping variables like content type, duration, and environmental conditions constant. We then used this dataset to test that our algorithms had similar performance when applied to different groups. To help detect any regressions in fairness that might occur as we improved our moment quality models, we added fairness tests to our automated system. Any change to our software was run across this battery of tests, and was required to pass. It is important to note that this methodology can’t guarantee fairness, as we can’t test for every possible scenario and outcome. However, we believe that these steps are an important part of our long-term work to achieve fairness in ML algorithms.

Conclusion
Most machine learning algorithms are designed to estimate objective qualities – a photo contains a cat, or it doesn’t. In our case, we aim to capture a more elusive and subjective quality – whether a personal photograph is interesting, or not. We therefore combine the objective, semantic content of photographs with subjective human preferences to build the AI behind Google Clips. Also, Clips is designed to work alongside a person, rather than autonomously; to get good results, a person still needs to be conscious of framing, and make sure the camera is pointed at interesting content. We’re happy with how well Google Clips performs, and are excited to continue to improve our algorithms to capture that “perfect” moment!

Acknowledgements
The algorithms described here were conceived and implemented by a large group of Google engineers, research scientists, and others. Figures were made by Lior Shapira. Thanks to Lior and Juston Payne for video content.

Source: Google AI Blog