Tag Archives: Video

End-to-end Generative Pre-training for Multimodal Video Captioning

Multimodal video captioning systems utilize both the video frames and speech to generate natural language descriptions (captions) of videos. Such systems are stepping stones towards the longstanding goal of building multimodal conversational systems that effortlessly communicate with users while perceiving environments through multimodal input streams.

Unlike video understanding tasks (e.g., video classification and retrieval) where the key challenge lies in processing and understanding multimodal input videos, the task of multimodal video captioning includes the additional challenge of generating grounded captions. The most widely adopted approach for this task is to train an encoder-decoder network jointly using manually annotated data. However, due to a lack of large-scale, manually annotated data, the task of annotating grounded captions for videos is labor intensive and, in many cases, impractical. Previous research such as VideoBERT and CoMVT pre-train their models on unlabelled videos by leveraging automatic speech recognition (ASR). However, such models often cannot generate natural language sentences because they lack a decoder, and thus only the video encoder is transferred to the downstream tasks.

In “End-to-End Generative Pre-training for Multimodal Video Captioning”, published at CVPR 2022, we introduce a novel pre-training framework for multimodal video captioning. This framework, which we call multimodal video generative pre-training or MV-GPT, jointly trains a multimodal video encoder and a sentence decoder from unlabelled videos by leveraging a future utterance as the target text and formulating a novel bi-directional generation task. We demonstrate that MV-GPT effectively transfers to multimodal video captioning, achieving state-of-the-art results on various benchmarks. Additionally, the multimodal video encoder is competitive for multiple video understanding tasks, such as VideoQA, text-video retrieval, and action recognition.

Future Utterance as an Additional Text Signal
Typically, each training video clip for multimodal video captioning is associated with two different texts: (1) a speech transcript that is aligned with the clip as a part of the multimodal input stream, and (2) a target caption, which is often manually annotated. The encoder learns to fuse information from the transcript with visual contents, and the target caption is used to train the decoder for generation. However, in the case of unlabelled videos, each video clip comes only with a transcript from ASR, without a manually annotated target caption. Moreover, we cannot use the same text (the ASR transcript) for the encoder input and decoder target, since the generation of the target would then be trivial.

MV-GPT circumvents this challenge by leveraging a future utterance as an additional text signal and enabling joint pre-training of the encoder and decoder. However, training a model to generate future utterances that are often not grounded in the input content is not ideal. So we apply a novel bi-directional generation loss to reinforce the connection to the input.

Bi-directional Generation Loss
The issue of non-grounded text generation is mitigated by formulating a bi-directional generation loss that includes forward and backward generation. Forward generation produces future utterances given visual frames and their corresponding transcripts and allows the model to learn to fuse the visual content with its corresponding transcript. Backward generation takes the visual frames and future utterances to train the model to generate a transcript that contains more grounded text of the video clip. Bi-directional generation loss in MV-GPT allows the encoder and the decoder to be trained to handle visually grounded texts.

Bi-directional generation in MV-GPT. A model is trained with two generation losses. In forward generation, the model generates a future utterance (blue boxes) given the frames and the present utterance (red boxes), whereas the present is generated from the future utterance in backward generation. Two special beginning-of-sentence tokens ([BOS-F] and [BOS-B]) initiate forward and backward generation for the decoder.

Results on Multimodal Video Captioning
We compare MV-GPT to existing pre-training losses using the same model architecture, on YouCook2 with standard evaluation metrics (Bleu-4, Cider, Meteor and Rouge-L). While all pre-training techniques improve captioning performances, it is critical to pre-train the decoder jointly to improve model performance. We demonstrate that MV-GPT outperforms the previous state-of-the-art joint pre-training method by over 3.5% with relative gains across all four metrics.

Pre-training Loss Pre-trained Parts Bleu-4 Cider Meteor Rouge-L
No Pre-training N/A 13.25 1.03 17.56 35.48
CoMVT Encoder 14.46 1.24 18.46 37.17
UniVL Encoder + Decoder 19.95 1.98 25.27 46.81
MV-GPT (ours) Encoder + Decoder 21.26 2.14 26.36 48.58
MV-GPT performance across four metrics (Bleu-4, Cider, Meteor and Rouge-L) of different pre-training losses on YouCook2. “Pre-trained parts” indicates which parts of the model are pre-trained — only the encoder or both the encoder and decoder. We reimplement the loss functions of existing methods but use our model and training strategies for a fair comparison.

We transfer a model pre-trained by MV-GPT to four different captioning benchmarks: YouCook2, MSR-VTT, ViTT and ActivityNet-Captions. Our model achieves state-of-the-art performance on all four benchmarks by significant margins. For instance on the Meteor metric, MV-GPT shows over 12% relative improvements in all four benchmarks.

YouCook2 MSR-VTT ViTT ActivityNet-Captions
Best Baseline 22.35 29.90 11.00 10.90
MV-GPT (ours) 27.09 38.66 26.75 12.31
Meteor metric scores of the best baseline methods and MV-GPT on four benchmarks.

Results on Non-generative Video Understanding Tasks
Although MV-GPT is designed to train a generative model for multimodal video captioning, we also find that our pre-training technique learns a powerful multimodal video encoder that can be applied to multiple video understanding tasks, including VideoQA, text-video retrieval and action classification. When compared to the best comparable baseline models, the model transferred from MV-GPT shows superior performance in five video understanding benchmarks on their primary metrics — i.e., top-1 accuracy for VideoQA and action classification benchmarks, and recall at 1 for the retrieval benchmark.

Task Benchmark Best Comparable Baseline MV-GPT
VideoQA MSRVTT-QA 41.5 41.7
ActivityNet-QA 38.9 39.1
Text-Video Retrieval MSR-VTT 33.7 37.3
Action Recognition Kinetics-400 78.9 80.4
Kinetics-600 80.6 82.4
Comparisons of MV-GPT to best comparable baseline models on five video understanding benchmarks. For each dataset we report the widely used primary metric, i.e., MSRVTT-QA and ActivityNet-QA: Top-1 answer accuracy; MSR-VTT: Recall at 1; and Kinetics: Top-1 classification accuracy.

We introduce MV-GPT, a new generative pre-training framework for multimodal video captioning. Our bi-directional generative objective jointly pre-trains a multimodal encoder and a caption decoder by using utterances sampled at different times in unlabelled videos. Our pre-trained model achieves state-of-the-art results on multiple video captioning benchmarks and other video understanding tasks, namely VideoQA, video retrieval and action classification.

This research was conducted by Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab and Cordelia Schmid.

Source: Google AI Blog

Multimodal Bottleneck Transformer (MBT): A New Model for Modality Fusion

People interact with the world through multiple sensory streams (e.g., we see objects, hear sounds, read words, feel textures and taste flavors), combining information and forming associations between senses. As real-world data consists of various signals that co-occur, such as video frames and audio tracks, web images and their captions and instructional videos and speech transcripts, it is natural to apply a similar logic when building and designing multimodal machine learning (ML) models.

Effective multimodal models have wide applications — such as multilingual image retrieval, future action prediction, and vision-language navigation — and are important for several reasons; robustness, which is the ability to perform even when one or more modalities is missing or corrupted, and complementarity between modalities, which is the idea that some information may be present only in one modality (e.g., audio stream) and not in the other (e.g., video frames). While the dominant paradigm for multimodal fusion, called late fusion, consists of using separate models to encode each modality, and then simply combining their output representations at the final step, investigating how to effectively and efficiently combine information from different modalities is still understudied.

In “Attention Bottlenecks for Multimodal Fusion”, published at NeurIPS 2021, we introduce a novel transformer-based model for multimodal fusion in video called Multimodal Bottleneck Transformer (MBT). Our model restricts cross-modal attention flow between latent units in two ways: (1) through tight fusion bottlenecks, that force the model to collect and condense the most relevant inputs in each modality (sharing only necessary information with other modalities), and (2) to later layers of the model, allowing early layers to specialize to information from individual modalities. We demonstrate that this approach achieves state-of-the-art results on video classification tasks, with a 50% reduction in FLOPs compared to a vanilla multimodal transformer model. We have also released our code as a tool for researchers to leverage as they expand on multimodal fusion work.

A Vanilla Multimodal Transformer Model
Transformer models consistently obtain state-of-the-art results in ML tasks, including video (ViViT) and audio classification (AST). Both ViViT and AST are built on the Vision Transformer (ViT); in contrast to standard convolutional approaches that process images pixel-by-pixel, ViT treats an image as a sequence of patch tokens (i.e., tokens from a smaller part, or patch, of an image that is made up of multiple pixels). These models then perform self-attention operations across all pairs of patch tokens. However, using transformers for multimodal fusion is challenging because of their high computational cost, with complexity scaling quadratically with input sequence length.

Because transformers effectively process variable length sequences, the simplest way to extend a unimodal transformer, such as ViT, to the multimodal case is to feed the model a sequence of both visual and auditory tokens, with minimal changes to the transformer architecture. We call this a vanilla multimodal transformer model, which allows free attention flow (called vanilla cross-attention) between different spatial and temporal regions in an image, and across frequency and time in audio inputs, represented by spectrograms. However, while easy to implement by concatenating audio and video input tokens, vanilla cross-attention at all layers of the transformer model is unnecessary because audio and visual inputs contain dense, fine-grained information, which may be redundant for the task — increasing complexity.

Restricting Attention Flow
The issue of growing complexity for long sequences in multimodal models can be mitigated by reducing the attention flow. We restrict attention flow using two methods, specifying the fusion layer and adding attention bottlenecks.

  • Fusion layer (early, mid or late fusion): In multimodal models, the layer where cross-modal interactions are introduced is called the fusion layer. The two extreme versions are early fusion (where all layers in the transformer are cross-modal) and late fusion (where all layers are unimodal and no cross-modal information is exchanged in the transformer encoder). Specifying a fusion layer in between leads to mid fusion. This technique builds on a common paradigm in multimodal learning, which is to restrict cross-modal flow to later layers of the network, allowing early layers to specialize in learning and extracting unimodal patterns.
  • Attention bottlenecks: We also introduce a small set of latent units that form an attention bottleneck (shown below in purple), which force the model, within a given layer, to collate and condense information from each modality before sharing it with the other, while still allowing free attention flow within a modality. We demonstrate that this bottlenecked version (MBT), outperforms or matches its unrestricted counterpart with lower computational cost.
The different attention configurations in our model. Unlike late fusion (top left), where no cross-modal information is exchanged in the transformer encoder, we investigate two pathways for the exchange of cross-modal information. Early and mid fusion (top middle, top right) is done via standard pairwise self attention across all hidden units in a layer. For mid fusion, cross-modal attention is applied only to later layers in the model. Bottleneck fusion (bottom left) restricts attention flow within a layer through tight latent units called attention bottlenecks. Bottleneck mid fusion (bottom right) applies both forms of restriction in conjunction for optimal performance.

Bottlenecks and Computation Cost
We apply MBT to the task of sound classification using the AudioSet dataset and investigate its performance for two approaches: (1) vanilla cross-attention, and (2) bottleneck fusion. For both approaches, mid fusion (shown by the middle values of the x-axis below) outperforms both early (fusion layer = 0) and late fusion (fusion layer = 12). This suggests that the model benefits from restricting cross-modal connections to later layers, allowing earlier layers to specialize in learning unimodal features; however, it still benefits from multiple layers of cross-modal information flow. We find that adding attention bottlenecks (bottleneck fusion) outperforms or maintains performance with vanilla cross-attention for all fusion layers, with more prominent improvements at lower fusion layers.

The impact of using attention bottlenecks for fusion on mAP performance (left) and compute (right) at different fusion layers on AudioSet. Attention bottlenecks (red) improve performance over vanilla cross-attention (blue) at lower computational cost. Mid fusion, which is in fusion layers 4-10, outperforms both early (fusion layer = 0) and late (fusion layer = 12) fusion, with best performance at fusion layer 8.

We compare the amount of computation, measured in GFLOPs, for both vanilla cross-attention and bottleneck fusion. Using a small number of attention bottlenecks (four bottleneck tokens used in our experiments) adds negligible extra computation over a late fusion model, with computation remaining largely constant with varying fusion layers. This is in contrast to vanilla cross-attention, which has a non-negligible computational cost for every layer it is applied to. We note that for early fusion, bottleneck fusion outperforms vanilla cross-attention by over 2 mean average precision points (mAP) on audiovisual sound classification, with less than half the computational cost.

Results on Sound Classification and Action Recognition
MBT outperforms previous research on popular video classification tasks — sound classification (AudioSet and VGGSound) and action recognition (Kinetics and Epic-Kitchens). For multiple datasets, late fusion and MBT with mid fusion (both fusing audio and vision) outperform the best single modality baseline, and MBT with mid fusion outperforms late fusion.

Across multiple datasets, fusing audio and vision outperforms the best single modality baseline, and MBT with mid fusion outperforms late fusion. For each dataset we report the widely used primary metric, i.e., Audioset: mAP, Epic-Kitchens: Top-1 action accuracy, VGGSound, Moments-in-Time and Kinetics: Top-1 classification accuracy.

Visualization of Attention Heatmaps
To understand the behavior of MBT, we visualize the attention computed by our network following the attention rollout technique. We compute heat maps of the attention from the output classification tokens to the image input space for a vanilla cross-attention model and MBT on the AudioSet test set. For each video clip, we show the original middle frame on the left with the ground truth labels overlayed at the bottom. We demonstrate that the attention is particularly focused on regions in the images that contain motion and create sound, e.g., the fingertips on the piano, the sewing machine, and the face of the dog. The fusion bottlenecks in MBT further force the attention to be localized to smaller regions of the images, e.g., the mouth of the dog in the top left and the woman singing in the middle right. This provides some evidence that the tight bottlenecks force MBT to focus only on the image patches that are relevant for an audio classification task and that benefit from mid fusion with audio.

We introduce MBT, a new transformer-based architecture for multimodal fusion, and explore various fusion approaches using cross-attention between bottleneck tokens. We demonstrate that restricting cross-modal attention via a small set of fusion bottlenecks achieves state-of-the-art results on a number of video classification benchmarks while also reducing computational costs compared to vanilla cross-attention models.

This research was conducted by Arsha Nagrani, Anurag Arnab, Shan Yang, Aren Jansen, Cordelia Schmid and Chen Sun. The blog post was written by Arsha Nagrani, Anurag Arnab and Chen Sun. Animations were created by Tom Small.

Source: Google AI Blog

Filtering Non-video Resources from Video Reports in the Google Ads API

Beginning on August 4, 2021, all video-related performance reports will begin filtering out campaigns, ad groups, and ad group ads that are not part of video campaigns (advertising_channel_type = VIDEO). This change will cause performance metrics retrieved from the video resource to change across all supported Google Ads API versions.

If you would like to generate reports that combine metrics from video and non-video resources, you should use the respective resource-specific performance report, for instance: campaign, ad_group, or ad_group_ad.

If you have any questions or need additional help, contact us through the forum or at [email protected]

Experimenting with Automatic Video Creation From a Web Page

At Google, we're actively exploring how people can use creativity tools powered by machine learning and computational methods when producing multimedia content, from creating music and reframing videos, to drawing and more. One creative process in particular, video production, can especially benefit from such tools, as it requires a series of decisions about what content is best suited to a target audience, how to position the available assets within the field of view, and what temporal arrangement will yield the most compelling narrative. But what if one could leverage existing assets, such as a website, to get a jump-start on video creation? Businesses commonly host websites that contain rich visual representations about their services or products, all of which could be repurposed for other multimedia formats, such as videos, potentially enabling those without extensive resources the ability to reach a broader audience.

In “Automatic Video Creation From a Web Page”, published at UIST 2020, we introduce URL2Video, a research prototype pipeline to automatically convert a web page into a short video, given temporal and visual constraints provided by the content owner. URL2Video extracts assets (text, images, or videos) and their design styles (including fonts, colors, graphical layouts, and hierarchy) from HTML sources and organizes the visual assets into a sequence of shots, while maintaining a look-and-feel similar to the source page. Given a user-specified aspect ratio and duration, it then renders the repurposed materials into a video that is ideal for product and service advertising.

URL2Video Overview
Assume a user provides an URL to a web page that illustrates their business. The URL2Video pipeline automatically selects key content from the page and decides the temporal and visual presentation of each asset, based on a set of heuristics derived from an interview study with designers who were familiar with web design and video ad creation. These designer-informed heuristics capture common video editing styles, including content hierarchy, constraining the amount of information in a shot and its time duration, providing consistent color and style for branding, and more. Using this information, the URL2Video pipeline parses a web page, analyzing the content and selecting visually salient text or images while preserving their design styles, which it organizes according to the video specifications provided by the user.

By extracting the structural content and design from the input web page, URL2Video makes automatic editing decisions to present key messages in a video. It considers the temporal (e.g., the duration in seconds) and spatial (e.g., the aspect ratio) constraints of the output video defined by users.

Webpage Analysis
Given a webpage URL, URL2Video extracts document object model (DOM) information and multimedia materials. For the purposes of our research prototype, we limited the domain to static web pages that contain salient assets and headings preserved in an HTML hierarchy that follows recent web design principles, which encourage the use of prominent elements, distinct sections, and an order of visual focus that guides readers in perceiving information. URL2Video identifies such visually-distinguishable elements as a candidate list of asset groups, each of which may contain a heading, a product image, detailed descriptions, and call-to-action buttons, and captures both the raw assets (text and multimedia files) and detailed design specifications (HTML tags, CSS styles, and rendered locations) for each element. It then ranks the asset groups by assigning each a priority score based on their visual appearance and annotations, including their HTML tags, rendered sizes, and ordering shown on the page. In this way, an asset group that occupies a larger area at the top of the page receives a higher score.

Constraints-Based Asset Selection
We consider two goals when composing a video: (1) each video shot should provide concise information, and (2) the visual design should be consistent with the source page. Based on these goals and the video constraints provided by the user, including the intended video duration (in seconds) and aspect ratio (commonly 16:9, 4:3, 1:1, etc.), URL2Video automatically selects and orders the asset groups to optimize the total priority score. To make the content concise, it presents only dominant elements from a page, such as a headline and a few multimedia assets. It constrains the duration of each visual element for viewers to perceive the content. In this way, a short video highlights the most salient information from the top of the page, and a longer video contains more campaigns or products.

Scene Composition & Video Rendering
Given an ordered list of assets based on the DOM hierarchy, URL2Video follows the design heuristics obtained from interview studies to make decisions about both the temporal and spatial arrangement to present the assets in individual shots. It transfers the graphical layout of elements into the video’s aspect ratio, and applies the style choices including fonts and colors. To make a video more dynamic and engaging, it adjusts the presentation timing of assets. Finally, it renders the content into a video in the MPEG-4 container format.

User Control
The interface to the research prototype allows the user to review the design attributes in each video shot extracted from the source page, reorder the materials, change the detailed design, such as colors and fonts, and adjust the constraints to generate a new video.

In URL2Video's authoring interface (left), users specify the input URL to a source page, size of the target page view, and the output video parameters. URL2Video analyzes the web page and extracts major visual components. It composes a series of scenes and visualizes the key frames as a storyboard. These components are rendered into an output video that satisfies the input temporal and spatial constraints. Users can playback the video, examine the design attributes (bottom-right), and make adjustments to generate video variation, such as reordering the scenes (top-right).

URL2Video Use Cases
We demonstrate the performance of the end-to-end URL2Video pipeline on a variety of existing web pages. Below we highlight an example result where URL2Video converts a page that embeds multiple short video clips into a 12-second output video. Note how the pipeline makes automatic editing decisions on font and color choices, timing, and content ordering in a video captured from the source page.

URL2Video identifies key content from our Google Search introduction page (top), including headings and video assets. It converts them into a video by considering the presentation flow, the source design and the output constraints (a 12-second landscape video; bottom).

The video below provides further demonstration:

To evaluate the automatically-generated videos, we conducted a user study with designers at Google. Our results show that URL2Video effectively extracted design elements from a web page and supported designers by bootstrapping the video creation process.

Next steps
While this current research focuses on the visual presentation, we are developing new techniques that support the audio track and a voiceover in video editing. All in all, we envision a future where creators focus on making high-level decisions and an ML model interactively suggests detailed temporal and graphical edits for a final video creation on multiple platforms.

We greatly thank our paper co-authors, Zheng Sun (Research) and Katrina Panovich (YouTube). We would also like to thank our colleagues who contributed to URL2Video, (in alphabetical order of last name) Jordan Canedy, Brian Curless, Nathan Frey, Madison Le, Alireza Mahdian, Justin Parra, Emily Ryan, Mogan Shieh, Sandor Szego, and Weilong Yang. We are grateful to receive the support from our leadership, Tomas Izo, Rahul Sukthankar, and Jay Yagnik.

Source: Google AI Blog

RepNet: Counting Repetitions in Videos

Repeating processes ranging from natural cycles, such as phases of the moon or heartbeats and breathing, to artificial repetitive processes, like those found on manufacturing lines or in traffic patterns, are commonplace in our daily lives. Beyond just their prevalence, repeating processes are of interest to researchers for the variety of insights one can tease out of them. It may be that there is an underlying cause behind something that happens multiple times, or there may be gradual changes in a scene that may be useful for understanding. Sometimes, repeating processes provide us with unambiguous “action units”, semantically meaningful segments that make up an action. For example, if a person is chopping an onion, the action unit is the manipulation action that is repeated to produce additional slices. These units may be indicative of more complex activity and may allow us to analyze more such actions automatically at a finer time-scale without having a person annotate these units. For the above reasons, perceptual systems that aim to observe and understand our world for an extended period of time will benefit from a system that understands general repetitions.

In “Counting Out Time: Class Agnostic Video Repetition Counting in the Wild”, we present RepNet, a single model that can understand a broad range of repeating processes, ranging from people exercising or using tools, to animals running and birds flapping their wings, pendulums swinging, and a wide variety of others. In contrast to our previous work, which used cycle-consistency constraints across different videos of the same action to understand them at a fine-grained level, in this work we present a system that can recognize repetitions within a single video. Along with this model, we are releasing a dataset to benchmark class-agnostic counting in videos and a Colab notebook to run RepNet.

RepNet is a model that takes as input a video that contains periodic action of a variety of classes (including those unseen during training) and returns the period of repetitions found therein. In the past the problem of repetition counting has been addressed by directly comparing pixel intensities in frames, but real world videos have camera motion, occlusion by objects in the field, drastic scale difference and changes in form, which necessitates learning of features invariant to such noise. To accomplish this we train a machine learning model in an end-to-end manner to directly estimate the period of the repetitions. The model consists of three parts: a frame encoder, an intermediate representation, called a temporal self-similarity matrix (which we will describe below), and a period predictor.

First, the frame encoder uses the ResNet architecture as a per-frame model to generate embeddings of each frame of the video The ResNet architecture was chosen since it has been successful for a number of image and video tasks. Passing each frame of a video through a ResNet-based encoder yields a sequence of embeddings.

At this point we calculate a temporal self-similarity matrix (TSM) by comparing each frame’s embedding with every other frame in the video, returning a matrix that is easy for subsequent modules to analyze for counting repetitions. This process surfaces self-similarities in the stream of video frames that enable period estimation, as demonstrated in the video below.
Demonstration of how the TSM processes images of the Earth’s day-night cycle.
For each frame, we then use Transformers to predict the period of repetition and the periodicity (i.e., whether or not a frame is part of the periodic process) directly from the sequence of similarities in the TSM. Once we have the period, we obtain the per-frame count by dividing the number of frames captured in a periodic segment by the period length. We sum this up to predict the number of repetitions in the video.
Overview of the RepNet model.
Temporal Self-Similarity Matrix
The example of the TSM from the day-night cycle, shown above, is derived from an idealized scenario with fixed period repetitions. TSMs from real videos often reveal fascinating structures in the world, as demonstrated in the three examples below. Jumping jacks are close to the ideal periodic action with a fixed period, while in contrast, the period of a bouncing ball declines as the ball loses energy through repeated bounces. The video of someone mixing concrete demonstrates repetitive action that is preceded and followed by a period without motion. These three behaviors are clearly distinguished in the learned TSM, which requires that the model pay attention to fine changes in the scene.
Jumping Jacks (constant period; video from Kinetics), Bouncing ball (decreasing period; Kinetics), Mixing concrete (aperiodic segments present in video; PERTUBE dataset).
One advantage of using the TSM as an intermediate layer in RepNet is that the subsequent processing by the transformers is done in the self-similarity space and not in the feature space. This encourages generalization to unseen classes. For example, the TSMs produced by actions as different as jumping jacks or swimming are similar as long as the action was repeated at a similar pace. This allows us to train on some classes and yet expect generalization to unseen classes.

One way to train the above model would be to collect a large dataset of videos that capture repetitive activities and label them with the repetition count. The challenge in this is two-fold. First, it requires one to examine a large number of videos to identify those with repeated actions. Following that, each video must be annotated with the number of times an action was repeated. While for certain tasks annotators can skip frames (for example, to classify a video as showing jumping jacks), they still need to see the entire video in order to count how many jumping jacks were performed.

We overcome this challenge by introducing a process for synthetic data generation that produces videos with repetitions using videos that may not contain repeating actions at all. This is accomplished by randomly selecting a segment of the video to repeat an arbitrary number of times, bookended by the original video context.
Our synthetic data generation pipeline that produces videos with repetitions from any video.
While this process generates a video that resembles a natural-looking video with repeating processes, it is still too simple for deep learning methods, which can learn to cheat by looking for artifacts, instead of learning to recognize repetitions. To address this, we perform extreme data augmentation, which we call camera motion augmentation. In this method, we modify the video to simulate a camera that smoothly moves around using 2D affine motion as the video progresses.
Left: An example of a synthetic repeating video generated from a random video. Right: An example of a video with camera motion augmentation, which is tougher for the model, but results in better generalization to real repeating videos (both from Kinetics).
Even though we can train a model on synthetic repeating videos, the resulting models must be able to generalize to real video of repeating processes. In order to evaluate the performance of the trained models on real videos, we collect a dataset of ~9000 videos from the Kinetics dataset. These videos span many action classes and capture diverse scenes, arising from the diversity of data seen on Youtube. We annotate these videos with the count of the action being repeated in the video. To encourage further research in this field, we are releasing the count annotations for this dataset, which we call Countix.

A class-agnostic counting model has many useful applications. RepNet serves as a single model that can count repetitions from many different domains:
RepNet can count repeated activities from a range of domains, such as slicing onions (left; video from Kinetics dataset), Earth’s diurnal cycle (middle; Himawari satellite data), or even a cheetah in motion (right; video from imgur.com).
RepNet could be used to estimate heartbeat rates from echocardiogram videos even though it has not seen such videos in training:
Predicted heart rates: 45 bpm (left) and 75 bpm (right). True heart rates 46-50 bpm and 78-79 bpm, respectively. RepNet’s prediction of the heart rate across different devices is encouragingly close to the rate measured by the device. (Source for left and right)
RepNet can also be used to monitor repeating activities for any changes in speed. Below we show how the Such changes in speed can also be used in other settings for quality or process control.
In this video, we see RepNet counting accelerating cellular oscillations observed under a laser microscope even though it has never seen such a video during training, (from Nature article).
Left: Person performing a “mountain climber” exercise. Right: The 1D projection of the RepNet embeddings using principal component analysis, capturing the moment that the person changes their speed during the exercise. (Video from Kinetics)
We are releasing Countix annotations for the community to work on the problem of repetition counting. We are also releasing a Colab notebook for running RepNet. Using this you can run RepNet on your videos or even using your webcam to detect periodic activities in video and count repetitions automatically in videos.

This is joint work with Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Special thanks to Tom Small for designing the visual explanation of TSM. The authors thank Anelia Angelova, Relja Arandjelović, Sourish Chaudhuri, Aishwarya Gomatam, Meghana Thotakuri, and Vincent Vanhoucke for their help with this project.

Source: Google AI Blog

Celebrating 10 years of WebM and WebRTC

Originally posted on the Chromium Blog

Ten years ago, Google planted the seeds for two foundational web media technologies, hoping they would provide the roots for a more vibrant internet. Two acquisitions, On2 Technologies and Global IP Solutions, led to a pair of open source projects: the WebM Project, a family of cutting edge video compression technologies (codecs) offered by Google royalty-free, and the WebRTC Project building APIs for real-time voice and video communication on the web.

These initiatives were major technical endeavors, essential infrastructure for enabling the promise of HTML5 with support for video conferencing and streaming. But this was also a philosophical evolution for media as Product Manager Mike Jazayeri noted in his blog post hailing the launch of the WebM Project:
“A key factor in the web’s success is that its core technologies such as HTML, HTTP, TCP/IP, etc. are open and freely implementable.”
As emerging first-class participants in the web experience, media and communication components also had to be free and open.

A decade later, these principles have ensured compression and communication technologies capable of keeping pace with a web ecosystem characterized by exponential growth of media consumption, devices, and demand. Starting from VP8 in 2010, the WebM Project has delivered up to 50% video bitrate savings with VP9 in 2013 and an additional 30% with AV1 in 2018—with adoption by YouTube, Facebook, Netflix, Twitch, and more. Equally importantly, the WebM team co-founded the Alliance for Open Media which has brought the IP of over 40 major tech companies in support of open and free codecs. With Chrome, Edge, Firefox and Safari supporting WebRTC, more than 85% of all installed browsers globally have become a client for real-time communications on the Internet. WebRTC has become a stable standard and it is now the default solution for video calling on the Web. These technologies have succeeded together, as today over 90% of encoded WebRTC video in Chrome uses VP8 or VP9.

The need for these technologies has been highlighted by COVID-19, as people across the globe have found new ways to work, educate, and connect with loved ones via video chat. The compression of open codecs has been essential to keeping services running on limited bandwidth, with over a billion hours of VP9 and AV1 content viewed every day. WebRTC has allowed for an ecosystem of interoperable communications apps to flourish: since the beginning of March 2020, we have seen in Chrome a 13X increase in received video streams via WebRTC.

These successes would not have been possible without all the supporters that make an open source community. Thank you to all the code contributors, testers, bug filers, and corporate partners who helped make this ecosystem a reality. A decade in, Google remains as committed as ever to open media on the web. We look forward to continuing that work with all of you in the next decade and beyond.

By Matt Frost, Product Director Chrome Media and Niklas Blum, Senior Product Manager WebRTC

AutoFlip: An Open Source Framework for Intelligent Video Reframing

Originally posted on the AI Blog

Videos filmed and edited for television and desktop are typically created and viewed in landscape aspect ratios (16:9 or 4:3). However, with an increasing number of users creating and consuming content on mobile devices, historical aspect ratios don’t always fit the display being used for viewing. Traditional approaches for reframing video to different aspect ratios usually involve static cropping, i.e., specifying a camera viewport, then cropping visual contents that are outside. Unfortunately, these static cropping approaches often lead to unsatisfactory results due to the variety of composition and camera motion styles. More bespoke approaches, however, typically require video curators to manually identify salient contents on each frame, track their transitions from frame-to-frame, and adjust crop regions accordingly throughout the video. This process is often tedious, time-consuming, and error-prone.

To address this problem, we are happy to announce AutoFlip, an open source framework for intelligent video reframing. AutoFlip is built on top of the MediaPipe framework that enables the development of pipelines for processing time-series multimodal data. Taking a video (casually shot or professionally edited) and a target dimension (landscape, square, portrait, etc.) as inputs, AutoFlip analyzes the video content, develops optimal tracking and cropping strategies, and produces an output video with the same duration in the desired aspect ratio.
Left: Original video (16:9). Middle: Reframed using a standard central crop (9:16). Right: Reframed with AutoFlip (9:16). By detecting the subjects of interest, AutoFlip is able to avoid cropping off important visual content.

AutoFlip Overview

AutoFlip provides a fully automatic solution to smart video reframing, making use of state-of-the-art ML-enabled object detection and tracking technologies to intelligently understand video content. AutoFlip detects changes in the composition that signify scene changes in order to isolate scenes for processing. Within each shot, video analysis is used to identify salient content before the scene is reframed by selecting a camera mode and path optimized for the contents.

Shot (Scene) Detection

A scene or shot is a continuous sequence of video without cuts (or jumps). To detect the occurrence of a shot change, AutoFlip computes the color histogram of each frame and compares this with prior frames. If the distribution of frame colors changes at a different rate than a sliding historical window, a shot change is signaled. AutoFlip buffers the video until the scene is complete before making reframing decisions, in order to optimize the reframing for the entire scene.

Video Content Analysis

We utilize deep learning-based object detection models to find interesting, salient content in the frame. This content typically includes people and animals, but other elements may be identified, depending on the application, including text overlays and logos for commercials, or motion and ball detection for sports.

The face and object detection models are integrated into AutoFlip through MediaPipe, which uses TensorFlow Lite on CPU. This structure allows AutoFlip to be extensible, so developers may conveniently add new detection algorithms for different use cases and video content. Each object type is associated with a weight value, which defines its relative importance — the higher the weight, the more influence the feature will have when computing the camera path.

Left: People detection on sports footage. Right: Two face boxes (‘core’ and ‘all’ face landmarks). In narrow portrait crop cases, often only the core landmark box can fit.


After identifying the subjects of interest on each frame, logical decisions about how to reframe the content for a new view can be made. AutoFlip automatically chooses an optimal reframing strategy — stationary, panning or tracking — depending on the way objects behave during the scene (e.g., moving around or stationary). In stationary mode, the reframed camera viewport is fixed in a position where important content can be viewed throughout the majority of the scene. This mode can effectively mimic professional cinematography in which a camera is mounted on a stationary tripod or where post-processing stabilization is applied. In other cases, it is best to pan the camera, moving the viewport at a constant velocity. The tracking mode provides continuous and steady tracking of interesting objects as they move around within the frame.

Based on which of these three reframing strategies the algorithm selects, AutoFlip then determines an optimal cropping window for each frame, while best preserving the content of interest. While the bounding boxes track the objects of focus in the scene, they typically exhibit considerable jitter from frame-to-frame and, consequently, are not sufficient to define the cropping window. Instead, we adjust the viewport on each frame through the process of Euclidean-norm optimization, in which we minimize the residuals between a smooth (low-degree polynomial) camera path and the bounding boxes.

Top: Camera paths resulting from following the bounding boxes from frame-to-frame. Bottom: Final smoothed camera paths generated using Euclidean-norm path formation. Left: Scene in which objects are moving around, requiring a tracking camera path. Right: Scene where objects stay close to the same position; a stationary camera covers the content for the full duration of the scene.

AutoFlip’s configuration graph provides settings for either best-effort or required reframing. If it becomes infeasible to cover all the required regions (for example, when they are too spread out on the frame), the pipeline will automatically switch to a less aggressive strategy by applying a letterbox effect, padding the image to fill the frame. For cases where the background is detected as being a solid color, this color is used to create seamless padding; otherwise a blurred version of the original frame is used.

AutoFlip Use Cases

We are excited to release this tool directly to developers and filmmakers, reducing the barriers to their design creativity and reach through the automation of video editing. The ability to adapt any video format to various aspect ratios is becoming increasingly important as the diversity of devices for video content consumption continues to rapidly increase. Whether your use case is portrait to landscape, landscape to portrait, or even small adjustments like 4:3 to 16:9, AutoFlip provides a solution for intelligent, automated and adaptive video reframing.

What’s Next?

Like any machine learning algorithm, AutoFlip can benefit from an improved ability to detect objects relevant to the intent of the video, such as speaker detection for interviews or animated face detection on cartoons. Additionally, a common issue arises when input video has important overlays on the edges of the screen (such as text or logos) as they will often be cropped from the view. By combining text/logo detection and image inpainting technology, we hope that future versions of AutoFlip can reposition foreground objects to better fit the new aspect ratios. Lastly, in situations where padding is required, deep uncrop technology could provide improved ability to expand beyond the original viewable area.

While we work to improve AutoFlip internally at Google, we encourage contributions from developers and filmmakers in the open source communities.


We would like to thank our colleagues who contributed to Autoflip, Alexander Panagopoulos, Jenny Jin, Brian Mulford, Yuan Zhang, Alex Chen, Xue Yang, Mickey Wang, Justin Parra, Hartwig Adam, Jingbin Wang, and Weilong Yang; MediaPipe team who helped with open sourcing, Jiuqiang Tang, Tyler Mullen, Mogan Shieh, Ming Guang Yong, and Chuo-Ling Chang.

By Nathan Frey, Senior Software Engineer, Google Research, Los Angeles and Zheng Sun, Senior Software Engineer, Google Research, Mountain View

Video Series for New Webmasters: Search for Beginners!

We are excited to introduce our newest video series: “Search For Beginners”! The series was created primarily to help new webmasters. It is also for anyone with an interest in Search or anyone who is still learning about the Web and how to manage their online presence.

We love to see the webmaster community grow! Every day, there are countless new webmasters who are taking the first steps in learning how Search works, and how to make their websites perform well and discoverable on Search. We understand that it sometimes can be challenging or even overwhelming to start with our existing content without some prior knowledge or basic understandings of the Web. We find our basic videos in our YouTube channels to be the ones with the most views. At the same time, advanced webmasters also see the need for content that can be sent to clients or stakeholders to help explain important concepts in managing an online presence.

We want to help all webmasters succeed, regardless of whether you have been managing websites for many years or you’ve just started out yesterday. We want to do more to help the new webmasters and this video series will hopefully help us achieve that.

Introduction to the series:

Episode 1:

The “Search For Beginners” video series covers basic online presence topics ranging from ‘Do you need a website?’, ‘What are the goals for your website?’ to more organic search-related topics such as ‘How does Google Search work?’, ‘How to change description line’, or ‘How to change wrong address information on Google’. Actually, we get asked these questions frequently in forums, social channels and at events around the world! The videos are fully animated. The videos are in English with subtitles available in Spanish, Portuguese, Korean, Chinese, Indonesian, Italian, Japanese, and English. We are working on more, so please stay tuned!

And if you consider yourself a more experienced user, please feel free to use these videos to support your pitches or explaining things to your clients. If you want to share any ideas or learnings, please leave them in the comment section in each video so that others can benefit from your knowledge and experience.

Follow us on Twitter and subscribe on YouTube for the upcoming videos! We will be adding new videos in this series to this playlist about every two weeks!

Video Series for New Webmasters: Search for Beginners!

We are excited to introduce our newest video series: “Search For Beginners”! The series was created primarily to help new webmasters. It is also for anyone with an interest in Search or anyone who is still learning about the Web and how to manage their online presence.

We love to see the webmaster community grow! Every day, there are countless new webmasters who are taking the first steps in learning how Search works, and how to make their websites perform well and discoverable on Search. We understand that it sometimes can be challenging or even overwhelming to start with our existing content without some prior knowledge or basic understandings of the Web. We find our basic videos in our YouTube channels to be the ones with the most views. At the same time, advanced webmasters also see the need for content that can be sent to clients or stakeholders to help explain important concepts in managing an online presence.

We want to help all webmasters succeed, regardless of whether you have been managing websites for many years or you’ve just started out yesterday. We want to do more to help the new webmasters and this video series will hopefully help us achieve that.

Introduction to the series:

Episode 1:

The “Search For Beginners” video series covers basic online presence topics ranging from ‘Do you need a website?’, ‘What are the goals for your website?’ to more organic search-related topics such as ‘How does Google Search work?’, ‘How to change description line’, or ‘How to change wrong address information on Google’. Actually, we get asked these questions frequently in forums, social channels and at events around the world! The videos are fully animated. The videos are in English with subtitles available in Spanish, Portuguese, Korean, Chinese, Indonesian, Italian, Japanese, and English. We are working on more, so please stay tuned!

And if you consider yourself a more experienced user, please feel free to use these videos to support your pitches or explaining things to your clients. If you want to share any ideas or learnings, please leave them in the comment section in each video so that others can benefit from your knowledge and experience.

Follow us on Twitter and subscribe on YouTube for the upcoming videos! We will be adding new videos in this series to this playlist about every two weeks!

Audio and Visual Quality Measurement using Fréchet Distance

"I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.”
    William Thomson (Lord Kelvin), Lecture on "Electrical Units of Measurement" (3 May 1883), published in Popular Lectures Vol. I, p. 73
The rate of scientific progress in machine learning has often been determined by the availability of good datasets, and metrics. In deep learning, benchmark datasets such as ImageNet or Penn Treebank were among the driving forces that established deep artificial neural networks for image recognition and language modeling. Yet, while the available “ground-truth” datasets lend themselves nicely as measures of performance on these prediction tasks, determining the “ground-truth” for comparison to generative models is not so straightforward. Imagine a model that generates videos of StarCraft video game sequences — how does one determine which model is best? Clearly some of the videos shown below look more realistic than others, but can the differences between them be quantified? Access to robust metrics for evaluation of generative models is crucial for measuring (and making) progress in the fields of audio and video understanding, but currently no such metrics exist.
Videos generated from various models trained on sequences from the StarCraft Video (SCV) dataset.
In “Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms” and “Towards Accurate Generative Models of Video: A New Metric & Challenges”, we present two such metrics — the Fréchet Audio Distance (FAD) and Fréchet Video Distance (FVD). We document our large-scale human evaluations using 10k video and 69k audio clip pairwise comparisons that demonstrate high correlations between our metrics and human perception. We are also releasing the source code for both Fréchet Video Distance and Fréchet Audio Distance on github (FVD; FAD).

General Description of Fréchet Distance
The goal of a generative model is to learn to produce samples that look similar to the ones on which it has been trained, such that it knows what properties and features are likely to appear in the data, and which ones are unlikely. In other words, a generative model must learn the probability distribution of the training data. In many cases, the target distributions for generative models are very high-dimensional. For example, a single image of 128x128 pixels with 3 color channels has almost 50k dimensions, while a second-long video clip might consist of dozens (or hundreds) of such frames with audio that may have 16,000 samples. Calculating distances between such high dimensional distributions in order to quantify how well a given model succeeds at a task is very difficult. In the case of pictures, one could look at a few samples to gauge visual quality, but doing so for every model trained is not feasible.

In addition, generative adversarial networks (GANs) tend to focus on a few modes of the overall target distribution, while completely ignoring others. For example, they may learn to generate only one type of object or only a select few viewing angles. As a consequence, looking only at a limited number of samples from the model may not indicate whether the network learned the entire distribution successfully. To remedy this, a metric is needed that aligns well with human judgement of quality, while also taking the properties of the target distribution into account.

One common solution for this problem is the so-called Fréchet Inception Distance (FID) metric, which was specifically designed for images. The FID takes a large number of images from both the target distribution and the generative model, and uses the Inception object-recognition network to embed each image into a lower-dimensional space that captures the important features. Then it computes the so-called Fréchet distance between these samples, which is a common way of calculating distances between distributions that provides a quantitative measure of how similar the two distributions actually are.
A key component for both metrics is a pre-trained model that converts the video or audio clip into an N-dimensional embedding.
Fréchet Audio Distance and Fréchet Video Distance
Building on the principles of FID that have been successfully applied to the image domain, we propose both Fréchet Video Distance (FVD) and Fréchet Audio Distance (FAD). Unlike popular metrics such as peak signal-to-noise ratio or the structural similarity index, FVD looks at videos in their entirety, and thereby avoids the drawbacks of framewise metrics.
Examples of videos of a robot arm, judged by the new FVD metric. FVD values were found to be approximately 2000, 1000, 600, 400, 300 and 150 (left-to-right; top-to-bottom). A lower FVD clearly correlates with higher video quality.
In the audio domain, existing metrics either require a time-aligned ground truth signal, such as source-to-distortion ratio (SDR), or only target a specific domain, like speech quality. FAD on the other hand is reference-free and can be used on any type of audio.

Below is a 2-D visualization of the audio embedding vectors from which we compute the FAD. Each point corresponds to the embedding of a 5-second audio clip, where the blue points are from clean music and other points represent audio that has been distorted in some way. The estimated multivariate Gaussian distributions are presented as concentric ellipses. As the magnitude of the distortions increase, the overlap between their distributions and that of the clean audio decreases. The separation between these distributions is what the Fréchet distance is measuring.
In the animation, we can see that as the magnitude of the distortions increases, the Gaussian distributions of the distorted audio overlaps less with the clean distribution. The magnitude of this separation is what the Fréchet distance is measuring.
It is important for FAD and FVD to closely track human judgement, since that is the gold standard for what looks and sounds “realistic”. So, we performed a large-scale human study to determine how well our new metrics align with qualitative human judgment of generated audio and video. For the study, human raters examined 10,000 video pairs and 69,000 5-second audio clips. For the FAD we asked human raters to compare the effect of two different distortions on the same audio segment, randomizing both the pair of distortions that they compared and the order in which they appeared. The raters were asked “Which audio clip sounds most like a studio-produced recording?” The collected set of pairwise evaluations was then ranked using a Plackett-Luce model, which estimates a worth value for each parameter configuration. Comparison of the worth values to the FAD demonstrates that the FAD correlates quite well with human judgement.
This figure compares the FAD calculated between clean background music and music distorted by a variety of methods (e.g., pitch down, Gaussian noise, etc.) to the associated worth values from human evaluation. Each type of distortion has two data points representing high and low extremes in the distortion applied. The quantization distortion (purple circles), for example, limits the audio to a specific number of bits per sample, where the two data points represent two different bitrates. Both human raters and the FAD assigned higher values (i.e., “less realistic”) to the lower bitrate quantization. Overall log FAD correlates well with human judgement — a perfect correlation between the log FAD and human perception would result in a straight line.
We are currently making great strides in generative models. FAD and FVD will help us keeping this progress measurable, and will hopefully lead us to improve our models for audio and video generation.

There are many people who contributed to this large effort, and we’d like to highlight some of the key contributors: Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, Sylvain Gelly, Mauricio Zuluaga, Dominik Roblek, Matthew Sharifi as well as the extended Google Brain team in Zurich.

Source: Google AI Blog