Tag Archives: Computer Vision

Conceptual Captions: A New Dataset and Challenge for Image Captioning



The web is filled with billions of images, helping to entertain and inform the world on a countless variety of subjects. However, much of that visual information is not accessible to those with visual impairments, or with slow internet speeds that prohibit the loading of images. Image captions, manually added by website authors using Alt-text HTML, is one way to make this content more accessible, so that a natural-language description for images that can be presented using text-to-speech systems. However, existing human-curated Alt-text HTML fields are added for only a very small fraction of web images. And while automatic image captioning can help solve this problem, accurate image captioning is a challenging task that requires advancing the state of the art of both computer vision and natural language processing.
Image captioning can help millions with visual impairments by converting images captions to text. Image by Francis Vallance (Heritage Warrior), used under CC BY 2.0 license.
Today we introduce Conceptual Captions, a new dataset consisting of ~3.3 million image/caption pairs that are created by automatically extracting and filtering image caption annotations from billions of web pages. Introduced in a paper presented at ACL 2018, Conceptual Captions represents an order of magnitude increase of captioned images over the human-curated MS-COCO dataset. As measured by human raters, the machine-curated Conceptual Captions has an accuracy of ~90%. Furthermore, because images in Conceptual Captions are pulled from across the web, it represents a wider variety of image-caption styles than previous datasets, allowing for better training of image captioning models. To track progress on image captioning, we are also announcing the Conceptual Captions Challenge for the machine learning community to train and evaluate their own image captioning models on the Conceptual Captions test bed.
Illustration of images and captions in the Conceptual Captions dataset.
Clockwise from top left, images by Jonny Hunter, SigNote Cloud, Tony Hisgett, ResoluteSupportMedia. All images used under CC BY 2.0 license
Generating the Dataset
To generate the Conceptual Captions dataset, we start by sourcing images from the web that have Alt-text HTML attributes. We automatically screen these for certain properties to ensure image quality while also avoiding undesirable content such as adult themes. We then apply text-based filtering, removing captions with non-descriptive text (such as hashtags, poor grammar or added language that does not relate to the image); we also discard texts with high sentiment polarity or adult content (for more details on the filtering criteria, please see our paper). We use existing image classification models to make sure that, for any given image, there is overlap between its Alt-text (allowing for word variations) and the labels that the image classifier outputs for that image.

From Specific Names to General Concepts
While candidates passing the above filters tend to be good Alt-text image descriptions, a large majority use proper names (for people, venues, locations, organizations etc.). This is problematic because it is very difficult for an image captioning model to learn such fine-grained proper name inference from input image pixels, and also generate natural-language descriptions simultaneously1.

To address the above problems we wrote software that automatically replaces proper names with words representing the same general notion, i.e., with their concept. In some cases, the proper names are removed to simplify the text. For example, we substitute people names (e.g., “Former Miss World Priyanka Chopra on the red carpet” becomes “actor on the red carpet”), remove locations names (“Crowd at a concert in Los Angeles” becomes “Crowd at a concert”), remove named modifiers (e.g., “Italian cuisine” becomes just “cuisine”) and correct newly formed noun phrases if needed (e.g., “artist and artist” becomes “artists”, see the example illustration below).
Illustration of text modification. Image by Rockoleando used under CC BY 2.0 license.
Finally, we cluster all resolved entities (e.g., “artist”, “dog”, “neighborhood”, etc.) and keep only the candidate types which have a count of over 100 mentions, a quantity sufficient to support representation learning for these entities. This retained around 16K entity concepts such as: “person”, “actor”, “artist”, “player” and “illustration”. Less frequent ones that we retained include “baguette”, “bridle”, “deadline”, “ministry” and “funnel”.

In the end, it required roughly one billion (English) webpages containing over 5 billion candidate images to obtain a clean and learnable image caption dataset of over 3M samples (a rejection rate of 99.94%). Our control parameters were biased towards high precision, although these can be tuned to generate an order of magnitude more examples with lower precision.

Dataset Impact
To test the usefulness of our dataset, we independently trained both RNN-based, and Transformer-based image captioning models implemented in Tensor2Tensor (T2T), using the MS-COCO dataset (using 120K images with 5 human annotated-captions per image) and the new Conceptual Captions dataset (using over 3.3M images with 1 caption per image). See our paper for more details on model architectures.

These models were tested using images from Flickr30K dataset (which are out-of-domain for both MS-COCO and Conceptual Captions), and the resulting captions evaluated using 3 human raters per test case. The results are reported in the table below.
From these results we conclude that models trained on Conceptual Captions generalized better than competing approaches irrespective of the architecture (i.e., RNN or Transformer). In addition, we found that Transformer models did better than RNN when trained on either dataset. The conclusion from these findings is that Conceptual Captions provides the ability to train image captioning models that perform better on a wide variety of images.

Get Involved
It is our hope that this dataset will help the machine learning community advance the state of the art in image captioning models. Importantly, since no human annotators were involved in its creation, this dataset is highly scalable, potentially allowing the expansion of the dataset to enable automatic creation of Alt-text-HTML-like descriptions for an even wider variety of images. We encourage all those interested to partake in the Conceptual Captions Challenge, and we look forward to seeing what the community can do! For more details and the latest results please visit the challenge website.

Acknowledgements
Thanks to Nan Ding, Sebastian Goodman and Bo Pang for training models with Conceptual Captions dataset, and to Amol Wankhede for driving the public release efforts for the dataset.


1 In our paper, we posit that if automatic determination of names, locations, brands, etc. from the image is needed, it should be done as a separate task that may leverage image meta-information (e.g. GPS info), or complementary techniques such as OCR.

Source: Google AI Blog


MnasNet: Towards Automating the Design of Mobile Machine Learning Models



Convolutional neural networks (CNNs) have been widely used in image classification, face recognition, object detection and many other domains. Unfortunately, designing CNNs for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant effort has been made to design and improve mobile models, such as MobileNet and MobileNetV2, manually creating efficient models remains challenging when there are so many possibilities to consider. Inspired by recent progress in AutoML neural architecture search, we wondered if the design of mobile CNN models could also benefit from an AutoML approach.

In “MnasNet: Platform-Aware Neural Architecture Search for Mobile”, we explore an automated neural architecture search approach for designing mobile models using reinforcement learning. To deal with mobile speed constraints, we explicitly incorporate the speed information into the main reward function of the search algorithm, so that the search can identify a model that achieves a good trade-off between accuracy and speed. In doing so, MnasNet is able to find models that run 1.5x faster than state-of-the-art hand-crafted MobileNetV2 and 2.4x faster than NASNet, while reaching the same ImageNet top 1 accuracy.

Unlike in previous architecture search approaches, where model speed is considered via another proxy (e.g., FLOPS), our approach directly measures model speed by executing the model on a particular platform, e.g., Pixel phones which were used in this research study. In this way, we can directly measure what is achievable in real-world practice, given that each type of mobile devices has its own software and hardware idiosyncrasies and may require different architectures for the best trade-offs between accuracy and speed.

The overall flow of our approach consists mainly of three components: a RNN-based controller for learning and sampling model architectures, a trainer that builds and trains models to obtain the accuracy, and an inference engine for measuring the model speed on real mobile phones using TensorFlow Lite. We formulate a multi-objective optimization problem that aims to achieve both high accuracy and high speed, and utilize a reinforcement learning algorithm with a customized reward function to find Pareto optimal solutions (e.g., models that have the highest accuracy without worsening speed).
Overall flow of our automated neural architecture search approach for Mobile.
In order to strike the right balance between search flexibility and search space size, we propose a novel factorized hierarchical search space, which factorizes a convolutional neural network into a sequence of blocks, and then uses a hierarchical search space to determine the layer architecture for each block. In this way, our approach allows different layers to use different operations and connections; Meanwhile, we force all layers in each block to share the same structure, thus significantly reducing the search space size by orders of magnitude compared to a flat per-layer search space.
Our MnasNet network, sampled from the novel factorized hierarchical search space,illustrating the layer diversity throughout the network architecture.
We tested the effectiveness of our approach on ImageNet classification and COCO object detection. Our experiments achieve a new state-of-the-art accuracy under typical mobile speed constraints. In particular, the figure below shows the results on ImageNet.
ImageNet Accuracy and Inference Latency comparison. MnasNets are our models.
With the same accuracy, our MnasNet model runs 1.5x faster than the hand-crafted state-of-the-art MobileNetV2, and 2.4x faster than NASNet, which also used architecture search. After applying the squeeze-and-excitation optimization, our MnasNet+SE models achieve ResNet-50 level top-1 accuracy at 76.1%, with 19x fewer parameters and 10x fewer multiply-adds operations. On COCO object detection, our model family achieve both higher accuracy and higher speed over MobileNet, and achieves comparable accuracy to the SSD300 model with 35x less computation cost.

We are pleased to see that our automated approach can achieve state-of-the-art performance on multiple complex mobile vision tasks. In future, we plan to incorporate more operations and optimizations into our search space, and apply it to more mobile vision tasks such as semantic segmentation.

Acknowledgements
Special thanks to the co-authors of the paper Bo Chen, Quoc V. Le, Ruoming Pang and Vijay Vasudevan. We’d also like to thank Andrew Howard, Barret Zoph, Dmitry Kalenichenko, Guiheng Zhou, Jeff Dean, Mark Sandler, Megan Kacholia, Sheng Li, Vishy Tirumalashetty, Wen Wang, Xiaoqiang Zheng and Yifeng Lu for their help, and the TensorFlow Lite and Google Brain teams.

Source: Google AI Blog


Improving Connectomics by an Order of Magnitude



The field of connectomics aims to comprehensively map the structure of the neuronal networks that are found in the nervous system, in order to better understand how the brain works. This process requires imaging brain tissue in 3D at nanometer resolution (typically using electron microscopy), and then analyzing the resulting image data to trace the brain’s neurites and identify individual synaptic connections. Due to the high resolution of the imaging, even a cubic millimeter of brain tissue can generate over 1,000 terabytes of data! When combined with the fact that the structures in these images can be extraordinarily subtle and complex, the primary bottleneck in brain mapping has been automating the interpretation of these data, rather than acquisition of the data itself.

Today, in collaboration with colleagues at the Max Planck Institute of Neurobiology, we published “High-Precision Automated Reconstruction of Neurons with Flood-Filling Networks” in Nature Methods, which shows how a new type of recurrent neural network can improve the accuracy of automated interpretation of connectomics data by an order of magnitude over previous deep learning techniques. An open-access version of this work is also available from biorXiv (2017).

3D Image Segmentation with Flood-Filling Networks
Tracing neurites in large-scale electron microscopy data is an example of an image segmentation problem. Traditional algorithms have divided the process into at least two steps: finding boundaries between neurites using an edge detector or a machine-learning classifier, and then grouping together image pixels that are not separated by a boundary using an algorithm like watershed or graph cut. In 2015, we began experimenting with an alternative approach based on recurrent neural networks that unifies these two steps. The algorithm is seeded at a specific pixel location and then iteratively “fills” a region using a recurrent convolutional neural network that predicts which pixels are part of the same object as the seed. Since 2015, we have been working to apply this new approach to large-scale connectomics datasets and rigorously quantify its accuracy.
A flood-filling network segmenting an object in 2d. The yellow dot is the center of the current area of focus; the algorithm expands the segmented region (blue) as it iteratively examines more of the overall image.
Measuring Accuracy via Expected Run Length
Working with our partners at the Max Planck Institute, we devised a metric we call “expected run length” (ERL) that measures the following: given a random point within a random neuron in a 3d image of a brain, how far can we trace the neuron before making some kind of mistake? This is an example of a mean-time-between-failure metric, except that in this case we measure the amount of space between failures rather than the amount of time. For engineers, the appeal of ERL is that it relates a linear, physical path length to the frequency of individual mistakes that are made by an algorithm, and that it can be computed in a straightforward way. For biologists, the appeal is that a particular numerical value of ERL can be related to biologically relevant quantities, such as the average path length of neurons in different parts of the nervous system.
Progress in expected run length (blue line) leading up to the results shared today in Nature Methods. The red line shows progress in the “merge rate,” which measures the frequency with which two separate neurites were erroneously traced as a single object; achieving a very low merge rate is important for enabling efficient strategies for manual identification and correction of the remaining errors in the reconstruction.
Songbird Connectomics
We used ERL to measure our progress on a ground-truth set of neurons within a 1-million cubic micron zebra finch song-bird brain imaged by our collaborators using serial block-face scanning electron microscopy and found that our approach performed much better than previous deep learning pipelines applied to the same dataset.
Our algorithm in action as it traces a single neurite in 3d in a songbird brain.
We segmented every neuron in a small portion of a zebra finch song-bird brain using the new flood-filling network approach, as depicted here:
Reconstruction of a portion of zebra finch brain. Colors denote distinct objects in the segmentation that was automatically generated using a flood-filling network. Gold spheres represent synaptic locations automatically identified using a previously published approach.
By combining these automated results with a small amount of additional human effort required to fix the remaining errors, our collaborators at the Max Planck Institute are now able to study the songbird connectome to derive new insights into how zebra finch birds sing their song and test theories related to how they learn their song.

Next Steps
We will continue to improve connectomics reconstruction technology, with the aim of fully automating synapse-resolution connectomics and contributing to ongoing connectomics projects at the Max Planck Institute and elsewhere. In order to help support the larger research community in developing connectomics techniques, we have also open-sourced the TensorFlow code for the flood-filling network approach, along with WebGL visualization software for 3d datasets that we developed to help us understand and improve our reconstruction results.

Acknowledgements
We would like to acknowledge core contributions from Tim Blakely, Peter Li, Larry Lindsey, Jeremy Maitin-Shepard, Art Pope and Mike Tyka (Google), as well as Joergen Kornfeld and Winfried Denk (Max Planck Institute).

Source: Google AI Blog


Accelerated Training and Inference with the Tensorflow Object Detection API



Last year we announced the TensorFlow Object Detection API, and since then we’ve released a number of new features, such as models learned via Neural Architecture Search, instance segmentation support and models trained on new datasets such as Open Images. We have been amazed at how it is being used – from finding scofflaws on the streets of NYC to diagnosing diseases on cassava plants in Tanzania.
Today, as part of Google’s commitment to democratizing computer vision, and using feedback from the research community on how to make this codebase even more useful, we’re excited to announce a number of additions to our API. Highlights of this release include:
  • Support for accelerated training of object detection models via Cloud TPUs
  • Improving the mobile deployment process by accelerating inference and making it easy to export a model to mobile with the TensorFlow Lite format
  • Several new model architecture definitions including:
Additionally, we are releasing pre-trained weights for each of the above models based on the COCO dataset.

Accelerated Training via Cloud TPUs
Users spend a great deal of time on optimizing hyperparameters and retraining object detection models, therefore having fast turnaround times on experiments is critical. The models released today belong to the single shot detector (SSD) class of architectures that are optimized for training on Cloud TPUs. For example, we can now train a ResNet-50 based RetinaNet model to achieve 35% mean Average Precision (mAP) on the COCO dataset in < 3.5 hrs.
Accelerated Inference via Quantization and TensorFlow Lite 
To better support low-latency requirements on mobile and embedded devices, the models we are providing are now natively compatible with TensorFlow Lite, which enables on-device machine learning inference with low latency and a small binary size. As part of this, we have implemented: (1) model quantization and (2) detection-specific operations natively in TensorFlow Lite. Our model quantization follows the strategy outlined in Jacob et al. (2018) and the whitepaper by Krishnamoorthi (2018) which applies quantization to both model weights and activations at training and inference time, yielding smaller models that run faster.
Quantized detection models are faster and smaller (e.g., a quantized 75% depth-reduced SSD Mobilenet model runs at >15 fps on a Pixel 2 CPU with a 4.2 Mb footprint) with minimal loss in detection accuracy compared to the full floating point model.
Try it Yourself with a New Tutorial!
To get started training your own model on Cloud TPUs, check out our new tutorial! This walkthrough will take you through the process of training a quantized pet face detector on Cloud TPU then exporting it to an Android phone for inference via TensorFlow Lite conversion.

We hope that these new additions will help make high-quality computer vision models accessible to anyone wishing to solve an object detection problem, and provide a more seamless user experience, from training a model with quantization to exporting to a TensorFlow Lite model ready for on-device deployment. We would like to thank everyone in the community who have contributed features and bug fixes. As always, contributions to the codebase are welcome, and please stay tuned for more updates!

Acknowledgements
This post reflects the work of the following group of core contributors: Derek Chow, Aakanksha Chowdhery, Jonathan Huang, Pengchong Jin, Zhichao Lu, Vivek Rathod, Ronny Votel and Xiangxin Zhu. We would also like to thank the following colleagues: Vasu Agrawal, Sourabh Bajaj, Chiachen Chou, Tom Jablin, Wenzhe Li, Tsung-Yi Lin, Hernan Moraldo, Kevin Murphy, Sara Robinson, Andrew Selle, Shashi Shekhar, Yash Sonthalia, Zak Stone, Pete Warden and Menglong Zhu.

Source: Google AI Blog


Automating Drug Discoveries Using Computer Vision



“Every time you miss a protein crystal, because they are so rare, you risk missing on an important biomedical discovery.”
- Patrick Charbonneau, Duke University Dept. of Chemistry and Lead Researcher, MARCO initiative.

Protein crystallization is a key step to biomedical research concerned with discovering the structure of complex biomolecules. Because that structure determines the molecule’s function, it helps scientists design new drugs that are specifically targeted to that function. However, protein crystals are rare and difficult to find. Hundreds of experiments are typically run for each protein, and while the setup and imaging are mostly automated, finding individual protein crystals remains largely performed through visual inspection and thus prone to human error. Critically, missing these structures can result in lost opportunity for important biomedical discoveries for advancing the state of medicine.

In collaboration with researchers from the MAchine Recognition of Crystallization Outcomes (MARCO) initiative, we have published “Classification of Crystallization Outcomes using Deep Convolutional Neural Networks” in PLOS One (ArXiv preprint), in which we discuss how we used some of the most recent architectures of deep convolutional networks and customized them to achieve an accuracy of more than 94% on the visual recognition task of identifying protein crystals. In order to spur further research in this area, we have made the data freely accessible, and open-sourced our model as part of the TensorFlow research model repository, and available to researchers as a Cloud ML Engine endpoint.
Image of protein crystal, courtesy of the MARCO repository (CC-BY-4.0 license)
The MARCO initiative is a joint project between several pharmaceutical companies and academic research centers to pool and host a large repository of curated crystallography images, and make them available to the community to help develop better image analysis tools. When a member of the initiative reached out to Google with a well-defined problem, and half a million labelled images, we embraced the challenge of trying to apply the recent advances in deep learning to the problem.

Due to the large variability between imaging technologies and data acquisition approaches, coming up with a single approach to the visual recognition problem may appear daunting. Crystals can be very small, which makes them rare structures in a large image containing otherwise undifferentiated visual clutter.
Samples from the MARCO repository, illustrating the degree of variability between data sources.
Fortunately, given sufficient training data, modern deep convolutional networks are well suited to handle extreme variability in visual appearance. We modified the basic Inception V3 model to handle larger images while still being able to be trained quickly. The model achieves a level of precision and recall that makes its use practical in automated assessment pipelines.

This work is a great example of the effectiveness of multi-institutional collaborations aimed at solving problems that require data in amounts and level of diversity that no single collaborator has access to. We invite researchers to take advantage of these resources that are the result of this work and share what they learn. This research was conducted as a personal 20% project by the author. To learn more about this work, please see our paper here and read the recent Duke Research Blog post.

Source: Google AI Blog


Self-Supervised Tracking via Video Colorization



Tracking objects in video is a fundamental problem in computer vision, essential to applications such as activity recognition, object interaction, or video stylization. However, teaching a machine to visually track objects is challenging partly because it requires large, labeled tracking datasets for training, which are impractical to annotate at scale.

In “Tracking Emerges by Colorizing Videos”, we introduce a convolutional network that colorizes grayscale videos, but is constrained to copy colors from a single reference frame. In doing so, the network learns to visually track objects automatically without supervision. Importantly, although the model was never trained explicitly for tracking, it can follow multiple objects, track through occlusions, and remain robust over deformations without requiring any labeled training data.
Example tracking predictions on the publicly-available, academic dataset DAVIS 2017. After learning to colorize videos, a mechanism for tracking automatically emerges without supervision. We specify regions of interest (indicated by different colors) in the first frame, and our model propagates it forward without any additional learning or supervision.

Learning to Recolorize Video
Our hypothesis is that the temporal coherency of color provides excellent large-scale training data for teaching machines to track regions in video. Clearly, there are exceptions when color is not temporally coherent (such as lights turning on suddenly), but in general color is stable over time. Furthermore, most videos contain color, providing a scalable self-supervised learning signal. We decolor videos, and then add the colorization step because there may be multiple objects with the same color, but by colorizing we can teach machines to track specific objects or regions.

In order to train our system, we use videos from the Kinetics dataset, which is a large public collection of videos depicting everyday activities. We convert all video frames except the first frame into gray-scale, and train a convolutional network to predict the original colors in the subsequent frames. We expect the model to learn to follow regions in order to accurately recover the original colors. Our main observation is the need to follow objects for colorization will cause a model for object tracking to be automatically learned.
We illustrate the video recolorization task using video from the DAVIS 2017 dataset. The model receives as input one color frame and a gray-scale video, and predicts the colors for the rest of the video. The model learns to copy colors from the reference frame, which enables a mechanism for tracking to be learned without human supervision.
Learning to copy colors from the single reference frame requires the model to learn to internally point to the right region in order to copy the right colors. This forces the model to learn an explicit mechanism that we can use for tracking. To see how the video colorization model works, we show some predicted colorizations from videos in the Kinetics dataset below.

Examples of predicted colors from colorized reference frame applied to input video using the publicly-available Kinetics dataset.

Although the network is trained without ground-truth identities, our model learns to track any visual region specified in the first frame of a video. We can track outlined objects or a single point in the video. The only change we make is that, instead of propagating colors throughout the video, we now propagate labels representing the regions of interest.

Analyzing the Tracker
Since the model is trained on large amounts of unlabeled video, we want to gain insight into what the model learns. The videos below show a standard trick to visualize the embeddings learned by our model by projecting them down to three dimensions using Principal Component Analysis (PCA) and plotting it as an RGB movie. The results show that nearest neighbors in the learned embedding space tend to correspond to object identity, even over deformations and viewpoint changes.
Top Row: We show videos from the DAVIS 2017 dataset. Bottom Row: We visualize the internal embeddings from the colorization model. Similar embeddings will have a similar color in this visualization. This suggests the learned embedding is grouping pixels by object identity.

Tracking Pose
We found the model can also track human poses given key-points in an initial frame. We show results on the publicly-available, academic dataset JHMDB where we track a human joint skeleton.
Examples of using the model to track movements of the human skeleton. In this case the input was a human pose for the first frame and subsequent movement is automatically tracked. The model can track human poses even though it was never explicitly trained for this task.

While we do not yet outperform heavily supervised models, the colorization model learns to track video segments and human pose well enough to outperform the latest methods based on optical flow. Breaking down performance by motion type suggests that our model is a more robust tracker than optical flow for many natural complexities, such as dynamic backgrounds, fast motion, and occlusions. Please see the paper for details.

Future Work
Our results show that video colorization provides a signal that can be used for learning to track objects in videos without supervision. Moreover, we found that the failures from our system are correlated with failures to colorize the video, which suggests that further improving the video colorization model can advance progress in self-supervised tracking.

Acknowledgements
This project was only possible thanks to several collaborations at Google. The core team includes Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama and Kevin Murphy. We also thank David Ross, Bryan Seybold, Chen Sun and Rahul Sukthankar.

Source: Google AI Blog


Google at CVPR 2018

Posted by Christian Howard, Editor-in-Chief, Google AI Communications

This week, Salt Lake City hosts the 2018 Conference on Computer Vision and Pattern Recognition (CVPR 2018), the premier annual computer vision event comprising the main conference and several co-located workshops and tutorials. As a leader in computer vision research and a Diamond Sponsor, Google will have a strong presence at CVPR 2018 — over 200 Googlers will be in attendance to present papers and invited talks at the conference, and to organize and participate in multiple workshops.

If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively pursuing the next generation of intelligent systems that utilize the latest machine learning techniques applied to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind portrait mode on the Pixel 2 and Pixel 2 XL smartphones, the Open Images V4 dataset and much more.

You can learn more about our research being presented at CVPR 2018 in the list below (Googlers highlighted in blue)

Organization
Finance Chair: Ramin Zabih

Area Chairs include: Sameer Agarwal, Aseem Agrawala, Jon Barron, Abhinav Shrivastava, Carl Vondrick, Ming-Hsuan Yang

Orals/Spotlights
Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee

DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor
Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai Dai, Hao Li, Gerard Pons-Moll, Yebin Liu

Neural Kinematic Networks for Unsupervised Motion Retargetting
Ruben Villegas, Jimei Yang, Duygu Ceylan, Honglak Lee

Burst Denoising with Kernel Prediction Networks
Ben Mildenhall, Jiawen Chen, Jonathan BarronRobert Carroll, Dillon Sharlet, Ren Ng

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob, Skirmantas Kligys, Bo Chen, Matthew Tang, Menglong Zhu, Andrew Howard, Dmitry KalenichenkoHartwig Adam

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Focal Visual-Text Attention for Visual Question Answering
Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, Alexander G. Hauptmann

Inferring Light Fields from Shadows
Manel Baradad, Vickie Ye, Adam Yedida, Fredo Durand, William Freeman, Gregory Wornell, Antonio Torralba

Modifying Non-Local Variations Across Multiple Views
Tal Tlusty, Tomer Michaeli, Tali Dekel, Lihi Zelnik-Manor

Iterative Visual Reasoning Beyond Convolutions
Xinlei Chen, Li-jia Li, Fei-Fei Li, Abhinav Gupta

Unsupervised Training for 3D Morphable Model Regression
Kyle Genova, Forrester Cole, Aaron Maschinot, Daniel Vlasic, Aaron Sarna, William Freeman

Learning Transferable Architectures for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc Le

The iNaturalist Species Classification and Detection Dataset
Grant van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie

Learning Intrinsic Image Decomposition from Watching the World
Zhengqi Li, Noah Snavely

Learning Intelligent Dialogs for Bounding Box Annotation
Ksenia Konyushkova, Jasper Uijlings, Christoph Lampert, Vittorio Ferrari

Posters
Revisiting Knowledge Transfer for Training Object Class Detectors
Jasper Uijlings, Stefan Popov, Vittorio Ferrari

Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David Ross, Jia Deng, Rahul Sukthankar

Hierarchical Novelty Detection for Visual Object Recognition
Kibok Lee, Kimin Lee, Kyle Min, Yuting Zhang, Jinwoo Shin, Honglak Lee

COCO-Stuff: Thing and Stuff Classes in Context
Holger Caesar, Jasper Uijlings, Vittorio Ferrari

Appearance-and-Relation Networks for Video Classification
Limin Wang, Wei Li, Wen Li, Luc Van Gool

MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks
Ariel Gordon, Elad Eban, Bo Chen, Ofir Nachum, Tien-Ju Yang, Edward Choi

Deformable Shape Completion with Graph Convolutional Autoencoders
Or Litany, Alex Bronstein, Michael Bronstein, Ameesh Makadia

MegaDepth: Learning Single-View Depth Prediction from Internet Photos
Zhengqi Li, Noah Snavely

Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee

Burst Denoising with Kernel Prediction Networks
Ben Mildenhall, Jiawen Chen, Jonathan Barron, Robert Carroll, Dillon Sharlet, Ren Ng

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob, Skirmantas Kligys, Bo Chen, Matthew Tang, Menglong Zhu, Andrew Howard, Dmitry Kalenichenko, Hartwig Adam

Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling
Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Tianfan Xue, Joshua Tenenbaum, William Freeman

Sparse, Smart Contours to Represent and Edit Images
Tali Dekel, Dilip Krishnan, Chuang Gan, Ce Liu, William Freeman

MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features
Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, Hartwig Adam

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning
Yin Cui, Yang Song, Chen Sun, Andrew Howard, Serge Belongie

Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks
Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Sung Jin Hwang, George Toderici, Troy Chinen, Joel Shor

MobileNetV2: Inverted Residuals and Linear Bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen

ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans 
Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, Juergen Sturm, Matthias Nießner

Sim2Real View Invariant Visual Servoing by Recurrent Control
Fereshteh Sadeghi, Alexander Toshev, Eric Jang, Sergey Levine

Alternating-Stereo VINS: Observability Analysis and Performance Evaluation
Mrinal Kanti Paul, Stergios Roumeliotis

Soccer on Your Tabletop
Konstantinos Rematas, Ira Kemelmacher, Brian Curless, Steve Seitz

Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints
Reza Mahjourian, Martin Wicke, Anelia Angelova

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Inferring Light Fields from Shadows
Manel Baradad, Vickie Ye, Adam Yedida, Fredo Durand, William Freeman, Gregory Wornell, Antonio Torralba

Modifying Non-Local Variations Across Multiple Views
Tal Tlusty, Tomer Michaeli, Tali Dekel, Lihi Zelnik-Manor

Aperture Supervision for Monocular Depth Estimation
Pratul Srinivasan, Rahul Garg, Neal Wadhwa, Ren Ng, Jonathan Barron

Instance Embedding Transfer to Unsupervised Video Object Segmentation
Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, C.-C. Jay Kuo

Frame-Recurrent Video Super-Resolution
Mehdi S. M. Sajjadi, Raviteja Vemulapalli, Matthew Brown

Weakly Supervised Action Localization by Sparse Temporal Pooling Network
Phuc Nguyen, Ting Liu, Gautam Prasad, Bohyung Han

Iterative Visual Reasoning Beyond Convolutions
Xinlei Chen, Li-jia Li, Fei-Fei Li, Abhinav Gupta

Learning and Using the Arrow of Time
Donglai Wei, Andrew Zisserman, William Freeman, Joseph Lim

HydraNets: Specialized Dynamic Architectures for Efficient Inference
Ravi Teja Mullapudi, Noam Shazeer, William Mark, Kayvon Fatahalian

Thoracic Disease Identification and Localization with Limited Supervision
Zhe Li, Chong Wang, Mei Han, Yuan Xue, Wei Wei, Li-jia Li, Fei-Fei Li

Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis
Seunghoon Hong, Dingdong Yang, Jongwook Choi, Honglak Lee

Deep Semantic Face Deblurring
Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, Ming-Hsuan Yang

Unsupervised Training for 3D Morphable Model Regression
Kyle Genova, Forrester Cole, Aaron Maschinot, Daniel Vlasic, Aaron Sarna, William Freeman

Learning Transferable Architectures for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc Le

Learning Intrinsic Image Decomposition from Watching the World
Zhengqi Li, Noah Snavely

PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection
Nian Liu, Junwei Han, Ming-Hsuan Yang

Tutorials
Computer Vision for Robotics and Driving
Anelia Angelova, Sanja Fidler

Unsupervised Visual Learning
Pierre Sermanet, Anelia Angelova

UltraFast 3D Sensing, Reconstruction and Understanding of People, Objects and Environments
Sean Fanello, Julien Valentin, Jonathan Taylor, Christoph Rhemann, Adarsh Kowdle, Jürgen SturmChristine Kaeser-Chen, Pavel Pidlypenskyi, Rohit Pandey, Andrea Tagliasacchi, Sameh Khamis, David Kim, Mingsong Dou, Kaiwen Guo, Danhang Tang, Shahram Izadi

Generative Adversarial Networks
Jun-Yan Zhu, Taesung Park, Mihaela Rosca, Phillip Isola, Ian Goodfellow

Source: Google AI Blog


Announcing an updated YouTube-8M, and the 2nd YouTube-8M Large-Scale Video Understanding Challenge and Workshop



Last year, we organized the first YouTube-8M Large-Scale Video Understanding Challenge with Kaggle, in which 742 teams consisting of 946 individuals from 60 countries used the YouTube-8M dataset (2017 edition) to develop classification algorithms which accurately assign video-level labels. The purpose of the competition was to accelerate improvements in large-scale video understanding, representation learning, noisy data modeling, transfer learning and domain adaptation approaches that can help improve the machine-learning models that classify video. In addition to the competition, we hosted an affiliated workshop at CVPR’17, inviting competition top-performers and researchers and share their ideas on how to advance the state-of-the-art in video understanding.

As a continuation of these efforts to accelerate video understanding, we are excited to announce another update to the YouTube-8M dataset, a new Kaggle video understanding challenge and an affiliated 2nd Workshop on YouTube-8M Large-Scale Video Understanding, to be held at the 2018 European Conference on Computer Vision (ECCV'18).
An Updated YouTube-8M Dataset (2018 Edition)
Our YouTube-8M (2018 edition) features a major improvement in the quality of annotations, obtained using a machine learning system that combines audio-visual content with title, description and other metadata to provide more accurate ground truth annotations. The updated version contains 6.1 million URLs, labeled with a vocabulary of 3,862 visual entities, with each video annotated with one or more labels and an average of 3 labels per video. We have also updated the starter code, with updated instructions for downloading and training TensorFlow video annotation models on the dataset.

The 2nd YouTube-8M Video Understanding Challenge
The 2nd YouTube-8M Video Understanding Challenge invites participants to build audio-visual content classification models using YouTube-8M as training data, and then to label an unknown subset of test videos. Unlike last year, we strictly impose a hard limit on model size, encouraging participants to advance a single model within tight budget rather than assembling as many models as possible. Each of the top 5 teams will be awarded $5,000 to support their travel to Munich to attend ECCV’18. For details, please visit the Kaggle competition page.

The 2nd Workshop on YouTube-8M Large-Scale Video Understanding
To be held at ECCV’18, the workshop will consist of invited talks by distinguished researchers, as well as presentations by top-performing challenge participants in order to facilitate the exchange of ideas. We encourage those who wish to attend to submit papers describing their research, experiments, or applications based on YouTube-8M dataset, including papers summarizing their participation in the challenge above. Please refer to the workshop page for more details.

It is our hope that this update to the dataset, along with the new challenge and workshop, will continue to advance the research in large-scale video understanding. We hope you will join us again!

Acknowledgements
This post reflects the work of many machine perception researchers including Sami Abu-El-Haija, Ke Chen, Nisarg Kothari, Joonseok Lee, Hanhan Li, Paul Natsev, Sobhan Naderi Parizi, Rahul Sukthankar, George Toderici, Balakrishnan Varadarajan, as well as Sohier Dane, Julia Elliott, Wendy Kan and Walter Reade from Kaggle. We are also grateful for the support and advice from our partners at YouTube.

Source: Google AI Blog


Improving Deep Learning Performance with AutoAugment



The success of deep learning in computer vision can be partially attributed to the availability of large amounts of labeled training data — a model’s performance typically improves as you increase the quality, diversity and the amount of training data. However, collecting enough quality data to train a model to perform well is often prohibitively difficult. One way around this is to hardcode image symmetries into neural network architectures so they perform better or have experts manually design data augmentation methods, like rotation and flipping, that are commonly used to train well-performing vision models. However, until recently, less attention has been paid to finding ways to automatically augment existing data using machine learning. Inspired by the results of our AutoML efforts to design neural network architectures and optimizers to replace components of systems that were previously human designed, we asked ourselves: can we also automate the procedure of data augmentation?

In “AutoAugment: Learning Augmentation Policies from Data”, we explore a reinforcement learning algorithm which increases both the amount and diversity of data in an existing training dataset. Intuitively, data augmentation is used to teach a model about image invariances in the data domain in a way that makes a neural network invariant to these important symmetries, thus improving its performance. Unlike previous state-of-the-art deep learning models that used hand-designed data augmentation policies, we used reinforcement learning to find the optimal image transformation policies from the data itself. The result improved performance of computer vision models without relying on the production of new and ever expanding datasets.

Augmenting Training Data
The idea behind data augmentation is simple: images have many symmetries that don’t change the information present in the image. For example, the mirror reflection of a dog is still a dog. While some of these “invariances” are obvious to humans, many are not. For example, the mixup method augments data by placing images on top of each other during training, resulting in data which improves neural network performance.
Left: An original image from the ImageNet dataset. Right: The same image transformed by a commonly used data augmentation transformation, a horizontal flip about the center.
AutoAugment is an automatic way to design custom data augmentation policies for computer vision datasets, e.g., guiding the selection of basic image transformation operations, such as flipping an image horizontally/vertically, rotating an image, changing the color of an image, etc. AutoAugment not only predicts what image transformations to combine, but also the per-image probability and magnitude of the transformation used, so that the image is not always manipulated in the same way. AutoAugment is able to select an optimal policy from a search space of 2.9 x 1032 image transformation possibilities.

AutoAugment learns different transformations depending on what dataset it is run on. For example, for images involving street view of house numbers (SVHN) which include natural scene images of digits, AutoAugment focuses on geometric transforms like shearing and translation, which represent distortions commonly observed in this dataset. In addition, AutoAugment has learned to completely invert colors which naturally occur in the original SVHN dataset, given the diversity of different building and house numbers materials in the world.
Left: An original image from the SVHN dataset. Right: The same image transformed by AutoAugment. In this case, the optimal transformation was a result of shearing the image and inverting the colors of the pixels.
On CIFAR-10 and ImageNet, AutoAugment does not use shearing because these datasets generally do not include images of sheared objects, nor does it invert colors completely as these transformations would lead to unrealistic images. Instead, AutoAugment focuses on slightly adjusting the color and hue distribution, while preserving the general color properties. This suggests that the actual colors of objects in CIFAR-10 and ImageNet are important, whereas on SVHN only the relative colors are important.


Left: An original image from the ImageNet dataset. Right: The same image transformed by the AutoAugment policy. First, the image contrast is maximized, after which the image is rotated.
Results
Our AutoAugment algorithm found augmentation policies for some of the most well-known computer vision datasets that, when incorporated into the training of the neural network, led to state-of-the-art accuracies. By augmenting ImageNet data we obtain a new state-of-the-art accuracy of 83.54% top1 accuracy and on CIFAR10 we achieve an error rate of 1.48%, which is a 0.83% improvement over the default data augmentation designed by scientists. On SVHN, we improved the state-of-the-art error from 1.30% to 1.02%. Importantly, AutoAugment policies are found to be transferable — the policy found for the ImageNet dataset could also be applied to other vision datasets (Stanford Cars, FGVC-Aircraft, etc.), which in turn improves neural network performance.

We are pleased to see that our AutoAugment algorithm achieved this level of performance on many different competitive computer vision datasets and look forward to seeing future applications of this technology across more computer vision tasks and even in other domains such as audio processing or language models. The policies with the best performance are included in the appendix of the paper, so that researchers can use them to improve their models on relevant vision tasks.

Acknowledgements
Special thanks to the co-authors of the paper Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. We’d also like to thank Alok Aggarwal, Gabriel Bender, Yanping Huang, Pieter-Jan Kindermans, Simon Kornblith, Augustus Odena, Avital Oliver, and Colin Raffel for their help with this project.

Source: Google AI Blog


Automatic Photography with Google Clips



To me, photography is the simultaneous recognition, in a fraction of a second, of the significance of an event as well as of a precise organization of forms which give that event its proper expression.
Henri Cartier-Bresson

The last few years have witnessed a Cambrian-like explosion in AI, with deep learning methods enabling computer vision algorithms to recognize many of the elements of a good photograph: people, smiles, pets, sunsets, famous landmarks and more. But, despite these recent advancements, automatic photography remains a very challenging problem. Can a camera capture a great moment automatically?

Recently, we released Google Clips, a new, hands-free camera that automatically captures interesting moments in your life. We designed Google Clips around three important principles:
  • We wanted all computations to be performed on-device. In addition to extending battery life and reducing latency, on-device processing means that none of your clips leave the device unless you decide to save or share them, which is a key privacy control.
  • We wanted the device to capture short videos, rather than single photographs. Moments with motion can be more poignant and true-to-memory, and it is often easier to shoot a video around a compelling moment than it is to capture a perfect, single instant in time.
  • We wanted to focus on capturing candid moments of people and pets, rather than the more abstract and subjective problem of capturing artistic images. That is, we did not attempt to teach Clips to think about composition, color balance, light, etc.; instead, Clips focuses on selecting ranges of time containing people and animals doing interesting activities.
Learning to Recognize Great Moments
How could we train an algorithm to recognize interesting moments? As with most machine learning problems, we started with a dataset. We created a dataset of thousands of videos in diverse scenarios where we imagined Clips being used. We also made sure our dataset represented a wide range of ethnicities, genders, and ages. We then hired expert photographers and video editors to pore over this footage to select the best short video segments. These early curations gave us examples for our algorithms to emulate. However, it is challenging to train an algorithm solely from the subjective selection of the curators — one needs a smooth gradient of labels to teach an algorithm to recognize the quality of content, ranging from "perfect" to "terrible."

To address this problem, we took a second data-collection approach, with the goal of creating a continuous quality score across the length of a video. We split each video into short segments (similar to the content Clips captures), randomly selected pairs of segments, and asked human raters to select the one they prefer.
We took this pairwise comparison approach, instead of having raters score videos directly, because it is much easier to choose the better of a pair than it is to specify a number. We found that raters were very consistent in pairwise comparisons, and less so when scoring directly. Given enough pairwise comparisons for any given video, we were able to compute a continuous quality score over the entire length. In this process, we collected over 50,000,000 pairwise comparisons on clips sampled from over 1,000 videos. That’s a lot of human effort!
Training a Clips Quality Model
Given this quality score training data, our next step was to train a neural network model to estimate the quality of any photograph captured by the device. We started with the basic assumption that knowing what’s in the photograph (e.g., people, dogs, trees, etc.) will help determine “interestingness”. If this assumption is correct, we could learn a function that uses the recognized content of the photograph to predict its quality score derived above from human comparisons.

To identify content labels in our training data, we leveraged the same Google machine learning technology that powers Google image search and Google Photos, which can recognize over 27,000 different labels describing objects, concepts, and actions. We certainly didn’t need all these labels, nor could we compute them all on device, so our expert photographers selected the few hundred labels they felt were most relevant to predicting the “interestingness” of a photograph. We also added the labels most highly correlated with the rater-derived quality scores.

Once we had this subset of labels, we then needed to design a compact, efficient model that could predict them for any given image, on-device, within strict power and thermal limits. This presented a challenge, as the deep learning techniques behind computer vision typically require strong desktop GPUs, and algorithms adapted to run on mobile devices lag far behind state-of-the-art techniques on desktop or cloud. To train this on-device model, we first took a large set of photographs and again used Google’s powerful, server-based recognition models to predict label confidence for each of the “interesting” labels described above. We then trained a MobileNet Image Content Model (ICM) to mimic the predictions of the server-based model. This compact model is capable of recognizing the most interesting elements of photographs, while ignoring non-relevant content.

The final step was to predict a single quality score for an input photograph from its content predicted by the ICM, using the 50M pairwise comparisons as training data. This score is computed with a piecewise linear regression model that combines the output of the ICM into a frame quality score. This frame quality score is averaged across the video segment to form a moment score. Given a pairwise comparison, our model should compute a moment score that is higher for the video segment preferred by humans. The model is trained so that its predictions match the human pairwise comparisons as well as possible.
Diagram of the training process for generating frame quality scores. Piecewise linear regression maps from an ICM embedding to a score which, when averaged across a video segment, yields a moment score. The moment score of the preferred segment should be higher.
This process allowed us to train a model that combines the power of Google image recognition technology with the wisdom of human raters–represented by 50 million opinions on what makes interesting content!

While this data-driven score does a great job of identifying interesting (and non-interesting) moments, we also added some bonuses to our overall quality score for phenomena that we know we want Clips to capture, including faces (especially recurring and thus “familiar” ones), smiles, and pets. In our most recent release, we added bonuses for certain activities that customers particularly want to capture, such as hugs, kisses, jumping, and dancing. Recognizing these activities required extensions to the ICM model.

Shot Control
Given this powerful model for predicting the “interestingness” of a scene, the Clips camera can decide which moments to capture in real-time. Its shot control algorithms follow three main principles:
  1. Respect Power & Thermals: We want the Clips battery to last roughly three hours, and we don’t want the device to overheat — the device can’t run at full throttle all the time. Clips spends much of its time in a low-power mode that captures one frame per second. If the quality of that frame exceeds a threshold set by how much Clips has recently shot, it moves into a high-power mode, capturing at 15 fps. Clips then saves a clip at the first quality peak encountered.
  2. Avoid Redundancy: We don’t want Clips to capture all of its moments at once, and ignore the rest of a session. Our algorithms therefore cluster moments into visually similar groups, and limit the number of clips in each cluster.
  3. The Benefit of Hindsight: It’s much easier to determine which clips are the best when you can examine the totality of clips captured. Clips therefore captures more moments than it intends to show to the user. When clips are ready to be transferred to the phone, the Clips device takes a second look at what it has shot, and only transfers the best and least redundant content.
Machine Learning Fairness
In addition to making sure our video dataset represented a diverse population, we also constructed several other tests to assess the fairness of our algorithms. We created controlled datasets by sampling subjects from different genders and skin tones in a balanced manner, while keeping variables like content type, duration, and environmental conditions constant. We then used this dataset to test that our algorithms had similar performance when applied to different groups. To help detect any regressions in fairness that might occur as we improved our moment quality models, we added fairness tests to our automated system. Any change to our software was run across this battery of tests, and was required to pass. It is important to note that this methodology can’t guarantee fairness, as we can’t test for every possible scenario and outcome. However, we believe that these steps are an important part of our long-term work to achieve fairness in ML algorithms.

Conclusion
Most machine learning algorithms are designed to estimate objective qualities – a photo contains a cat, or it doesn’t. In our case, we aim to capture a more elusive and subjective quality – whether a personal photograph is interesting, or not. We therefore combine the objective, semantic content of photographs with subjective human preferences to build the AI behind Google Clips. Also, Clips is designed to work alongside a person, rather than autonomously; to get good results, a person still needs to be conscious of framing, and make sure the camera is pointed at interesting content. We’re happy with how well Google Clips performs, and are excited to continue to improve our algorithms to capture that “perfect” moment!

Acknowledgements
The algorithms described here were conceived and implemented by a large group of Google engineers, research scientists, and others. Figures were made by Lior Shapira. Thanks to Lior and Juston Payne for video content.

Source: Google AI Blog