Tag Archives: Machine Perception

Announcing the Objectron Dataset

The state of the art in machine learning (ML) has achieved exceptional accuracy on many computer vision tasks solely by training models on photos. Building upon these successes and advancing 3D object understanding has great potential to power a wider range of applications, such as augmented reality, robotics, autonomy, and image retrieval. For example, earlier this year we released MediaPipe Objectron, a set of real-time 3D object detection models designed for mobile devices, which were trained on a fully annotated, real-world 3D dataset, that can predict objects’ 3D bounding boxes.

Yet, understanding objects in 3D remains a challenging task due to the lack of large real-world datasets compared to 2D tasks (e.g., ImageNet, COCO, and Open Images). To empower the research community for continued advancement in 3D object understanding, there is a strong need for the release of object-centric video datasets, which capture more of the 3D structure of an object, while matching the data format used for many vision tasks (i.e., video or camera streams), to aid in the training and benchmarking of machine learning models.

Today, we are excited to release the Objectron dataset, a collection of short, object-centric video clips capturing a larger set of common objects from different angles. Each video clip is accompanied by AR session metadata that includes camera poses and sparse point-clouds. The data also contain manually annotated 3D bounding boxes for each object, which describe the object’s position, orientation, and dimensions. The dataset consists of 15K annotated video clips supplemented with over 4M annotated images collected from a geo-diverse sample (covering 10 countries across five continents).

Example videos in the Objectron dataset.

A 3D Object Detection Solution
Along with the dataset, we are also sharing a 3D object detection solution for four categories of objects — shoes, chairs, mugs, and cameras. These models are released in MediaPipe, Google's open source framework for cross-platform customizable ML solutions for live and streaming media, which also powers ML solutions like on-device real-time hand, iris and body pose tracking.

Sample results of 3D object detection solution running on mobile.

In contrast to the previously released single-stage Objectron model, these newest versions utilize a two-stage architecture. The first stage employs the TensorFlow Object Detection model to find the 2D crop of the object. The second stage then uses the image crop to estimate the 3D bounding box while simultaneously computing the 2D crop of the object for the next frame, so that the object detector does not need to run every frame. The second stage 3D bounding box predictor runs at 83 FPS on Adreno 650 mobile GPU.

Diagram of a reference 3D object detection solution.

Evaluation Metric for 3D Object Detection
With ground truth annotations, we evaluate the performance of 3D object detection models using 3D intersection over union (IoU) similarity statistics, a commonly used metric for computer vision tasks, which measures how close the bounding boxes are to the ground truth.

We propose an algorithm for computing accurate 3D IoU values for general 3D-oriented boxes. First, we compute the intersection points between faces of the two boxes using Sutherland-Hodgman Polygon clipping algorithm. This is similar to frustum culling, a technique used in computer graphics. The volume of the intersection is computed by the convex hull of all the clipped polygons. Finally, the IoU is computed from the volume of the intersection and volume of the union of two boxes. We are releasing the evaluation metrics source code along with the dataset.

Compute the 3D intersection over union using the polygon clipping algorithm, Left: Compute the intersection points of each face by clipping the polygon against the box. Right: Compute the volume of intersection by computing the convex hull of all intersection points (green).

Dataset Format
The technical details of the Objectron dataset, including usage and tutorials, are available on the dataset website. The dataset includes bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes, and is stored in the objectron bucket on Google Cloud storage with the following assets:

  • The video sequences
  • The annotation labels (3D bounding boxes for objects)
  • AR metadata (such as camera poses, point clouds, and planar surfaces)
  • Processed dataset: shuffled version of the annotated frames, in tf.example format for images and SequenceExample format for videos.
  • Supporting scripts to run evaluation based on the metric described above
  • Supporting scripts to load the data into Tensorflow, PyTorch, and Jax and to visualize the dataset, including “Hello World” examples

With the dataset, we are also open-sourcing a data-pipeline to parse the dataset in popular Tensorflow, PyTorch and Jax frameworks. Example colab notebooks are also provided.

By releasing this Objectron dataset, we hope to enable the research community to push the limits of 3D object geometry understanding. We also hope to foster new research and applications, such as view synthesis, improved 3D representation, and unsupervised learning. Stay tuned for future activities and developments by joining our mailing list and visiting our github page.

Acknowledgements
The research described in this post was done by Adel Ahmadyan, Liangkai Zhang, Jianing Wei, Artsiom Ablavatski, Mogan Shieh, Ryan Hickman, Buck Bourdon, Alexander Kanaukou, Chuo-Ling Chang, Matthias Grundmann, ‎and Tom Funkhouser. We thank Aliaksandr Shyrokau, Sviatlana Mialik, Anna Eliseeva, and the annotation team for their high quality annotations. We also would like to thank Jonathan Huang and Vivek Rathod for their guidance on TensorFlow Object Detection API.

Source: Google AI Blog


Background Features in Google Meet, Powered by Web ML

Video conferencing is becoming ever more critical in people's work and personal lives. Improving that experience with privacy enhancements or fun visual touches can help center our focus on the meeting itself. As part of this goal, we recently announced ways to blur and replace your background in Google Meet, which use machine learning (ML) to better highlight participants regardless of their surroundings. Whereas other solutions require installing additional software, Meet’s features are powered by cutting-edge web ML technologies built with MediaPipe that work directly in your browser — no extra steps necessary. One key goal in developing these features was to provide real-time, in-browser performance on almost all modern devices, which we accomplished by combining efficient on-device ML models, WebGL-based rendering, and web-based ML inference via XNNPACK and TFLite.

Background blur and background replacement, powered by MediaPipe on the web.

Overview of Our Web ML Solution
The new features in Meet are developed with MediaPipe, Google's open source framework for cross-platform customizable ML solutions for live and streaming media, which also powers ML solutions like on-device real-time hand, iris and body pose tracking.

A core need for any on-device solution is to achieve high performance. To accomplish this, MediaPipe’s web pipeline leverages WebAssembly, a low-level binary code format designed specifically for web browsers that improves speed for compute-heavy tasks. At runtime, the browser converts WebAssembly instructions into native machine code that executes much faster than traditional JavaScript code. In addition, Chrome 84 recently introduced support for WebAssembly SIMD, which processes multiple data points with each instruction, resulting in a performance boost of more than 2x.

Our solution first processes each video frame by segmenting a user from their background (more about our segmentation model later in the post) utilizing ML inference to compute a low resolution mask. Optionally, we further refine the mask to align it with the image boundaries. The mask is then used to render the video output via WebGL2, with the background blurred or replaced.

WebML Pipeline: All compute-heavy operations are implemented in C++/OpenGL and run within the browser via WebAssembly.

In the current version, model inference is executed on the client’s CPU for low power consumption and widest device coverage. To achieve real-time performance, we designed efficient ML models with inference accelerated by the XNNPACK library, the first inference engine specifically designed for the novel WebAssembly SIMD specification. Accelerated by XNNPACK and SIMD, the segmentation model can run in real-time on the web.

Enabled by MediaPipe's flexible configuration, the background blur/replace solution adapts its processing based on device capability. On high-end devices it runs the full pipeline to deliver the highest visual quality, whereas on low-end devices it continues to perform at speed by switching to compute-light ML models and bypassing the mask refinement.

Segmentation Model
On-device ML models need to be ultra lightweight for fast inference, low power consumption, and small download size. For models running in the browser, the input resolution greatly affects the number of floating-point operations (FLOPs) necessary to process each frame, and therefore needs to be small as well. We downsample the image to a smaller size before feeding it to the model. Recovering a segmentation mask as fine as possible from a low-resolution image adds to the challenges of model design.

The overall segmentation network has a symmetric structure with respect to encoding and decoding, while the decoder blocks (light green) also share a symmetric layer structure with the encoder blocks (light blue). Specifically, channel-wise attention with global average pooling is applied in both encoder and decoder blocks, which is friendly to efficient CPU inference.

Model architecture with MobileNetV3 encoder (light blue), and a symmetric decoder (light green).

We modified MobileNetV3-small as the encoder, which has been tuned by network architecture search for the best performance with low resource requirements. To reduce the model size by 50%, we exported our model to TFLite using float16 quantization, resulting in a slight loss in weight precision but with no noticeable effect on quality. The resulting model has 193K parameters and is only 400KB in size.

Rendering Effects
Once segmentation is complete, we use OpenGL shaders for video processing and effect rendering, where the challenge is to render efficiently without introducing artifacts. In the refinement stage, we apply a joint bilateral filter to smooth the low resolution mask.

Rendering effects with artifacts reduced. Left: Joint bilateral filter smooths the segmentation mask. Middle: Separable filters remove halo artifacts in background blur. Right: Light wrapping in background replace.

The blur shader simulates a bokeh effect by adjusting the blur strength at each pixel proportionally to the segmentation mask values, similar to the circle-of-confusion (CoC) in optics. Pixels are weighted by their CoC radii, so that foreground pixels will not bleed into the background. We implemented separable filters for the weighted blur, instead of the popular Gaussian pyramid, as it removes halo artifacts surrounding the person. The blur is performed at a low resolution for efficiency, and blended with the input frame at the original resolution.

Background blur examples.

For background replacement, we adopt a compositing technique, known as light wrapping, for blending segmented persons and customized background images. Light wrapping helps soften segmentation edges by allowing background light to spill over onto foreground elements, making the compositing more immersive. It also helps minimize halo artifacts when there is a large contrast between the foreground and the replaced background.

Background replacement examples.

Performance
To optimize the experience for different devices, we provide model variants at multiple input sizes (i.e., 256x144 and 160x96 in the current release), automatically selecting the best according to available hardware resources.

We evaluated the speed of model inference and the end-to-end pipeline on two common devices: MacBook Pro 2018 with 2.2 GHz 6-Core Intel Core i7, and Acer Chromebook 11 with Intel Celeron N3060. For 720p input, the MacBook Pro can run the higher-quality model at 120 FPS and the end-to-end pipeline at 70 FPS, while the Chromebook runs inference at 62 FPS with the lower-quality model and 33 FPS end-to-end.

 Model   FLOPs   Device   Model Inference   Pipeline 
 256x144   64M   MacBook Pro 18   8.3ms (120 FPS)   14.3ms (70 FPS) 
 160x96   27M   Acer Chromebook 11   16.1ms (62 FPS)   30ms (33 FPS) 
Model inference speed and end-to-end pipeline on high-end (MacBook Pro) and low-end (Chromebook) laptops.

For quantitative evaluation of model accuracy, we adopt the popular metrics of intersection-over-union (IOU) and boundary F-measure. Both models achieve high quality, especially for having such a lightweight network:

  Model     IOU     Boundary  
  F-measure  
  256x144     93.58%     0.9024  
  160x96     90.79%     0.8542  
Evaluation of model accuracy, measured by IOU and boundary F-score.

We also release the accompanying Model Card for our segmentation models, which details our fairness evaluations. Our evaluation data contains images from 17 geographical subregions of the globe, with annotations for skin tone and gender. Our analysis shows that the model is consistent in its performance across the various regions, skin-tones, and genders, with only small deviations in IOU metrics.

Conclusion
We introduced a new in-browser ML solution for blurring and replacing your background in Google Meet. With this, ML models and OpenGL shaders can run efficiently on the web. The developed features achieve real-time performance with low power consumption, even on low-power devices.

Acknowledgments
Special thanks to those on the Meet team and others who worked on this project, in particular Sebastian Jansson, Rikard Lundmark, Stephan Reiter, Fabian Bergmark, Ben Wagner, Stefan Holmer, Dan Gunnarson, Stéphane Hulaud and to all our team members who worked on the technology with us: Siargey Pisarchyk, Karthik Raveendran, Chris McClanahan, Marat Dukhan, Frank Barchard, Ming Guang Yong, Chuo-Ling Chang, Michael Hays, Camillo Lugaresi, Gregory Karpiak, Siarhei Kazakou, Matsvei Zhdanovich, and Matthias Grundmann.

Source: Google AI Blog


Experimenting with Automatic Video Creation From a Web Page

At Google, we're actively exploring how people can use creativity tools powered by machine learning and computational methods when producing multimedia content, from creating music and reframing videos, to drawing and more. One creative process in particular, video production, can especially benefit from such tools, as it requires a series of decisions about what content is best suited to a target audience, how to position the available assets within the field of view, and what temporal arrangement will yield the most compelling narrative. But what if one could leverage existing assets, such as a website, to get a jump-start on video creation? Businesses commonly host websites that contain rich visual representations about their services or products, all of which could be repurposed for other multimedia formats, such as videos, potentially enabling those without extensive resources the ability to reach a broader audience.

In “Automatic Video Creation From a Web Page”, published at UIST 2020, we introduce URL2Video, a research prototype pipeline to automatically convert a web page into a short video, given temporal and visual constraints provided by the content owner. URL2Video extracts assets (text, images, or videos) and their design styles (including fonts, colors, graphical layouts, and hierarchy) from HTML sources and organizes the visual assets into a sequence of shots, while maintaining a look-and-feel similar to the source page. Given a user-specified aspect ratio and duration, it then renders the repurposed materials into a video that is ideal for product and service advertising.

URL2Video Overview
Assume a user provides an URL to a web page that illustrates their business. The URL2Video pipeline automatically selects key content from the page and decides the temporal and visual presentation of each asset, based on a set of heuristics derived from an interview study with designers who were familiar with web design and video ad creation. These designer-informed heuristics capture common video editing styles, including content hierarchy, constraining the amount of information in a shot and its time duration, providing consistent color and style for branding, and more. Using this information, the URL2Video pipeline parses a web page, analyzing the content and selecting visually salient text or images while preserving their design styles, which it organizes according to the video specifications provided by the user.

By extracting the structural content and design from the input web page, URL2Video makes automatic editing decisions to present key messages in a video. It considers the temporal (e.g., the duration in seconds) and spatial (e.g., the aspect ratio) constraints of the output video defined by users.

Webpage Analysis
Given a webpage URL, URL2Video extracts document object model (DOM) information and multimedia materials. For the purposes of our research prototype, we limited the domain to static web pages that contain salient assets and headings preserved in an HTML hierarchy that follows recent web design principles, which encourage the use of prominent elements, distinct sections, and an order of visual focus that guides readers in perceiving information. URL2Video identifies such visually-distinguishable elements as a candidate list of asset groups, each of which may contain a heading, a product image, detailed descriptions, and call-to-action buttons, and captures both the raw assets (text and multimedia files) and detailed design specifications (HTML tags, CSS styles, and rendered locations) for each element. It then ranks the asset groups by assigning each a priority score based on their visual appearance and annotations, including their HTML tags, rendered sizes, and ordering shown on the page. In this way, an asset group that occupies a larger area at the top of the page receives a higher score.

Constraints-Based Asset Selection
We consider two goals when composing a video: (1) each video shot should provide concise information, and (2) the visual design should be consistent with the source page. Based on these goals and the video constraints provided by the user, including the intended video duration (in seconds) and aspect ratio (commonly 16:9, 4:3, 1:1, etc.), URL2Video automatically selects and orders the asset groups to optimize the total priority score. To make the content concise, it presents only dominant elements from a page, such as a headline and a few multimedia assets. It constrains the duration of each visual element for viewers to perceive the content. In this way, a short video highlights the most salient information from the top of the page, and a longer video contains more campaigns or products.

Scene Composition & Video Rendering
Given an ordered list of assets based on the DOM hierarchy, URL2Video follows the design heuristics obtained from interview studies to make decisions about both the temporal and spatial arrangement to present the assets in individual shots. It transfers the graphical layout of elements into the video’s aspect ratio, and applies the style choices including fonts and colors. To make a video more dynamic and engaging, it adjusts the presentation timing of assets. Finally, it renders the content into a video in the MPEG-4 container format.

User Control
The interface to the research prototype allows the user to review the design attributes in each video shot extracted from the source page, reorder the materials, change the detailed design, such as colors and fonts, and adjust the constraints to generate a new video.

In URL2Video's authoring interface (left), users specify the input URL to a source page, size of the target page view, and the output video parameters. URL2Video analyzes the web page and extracts major visual components. It composes a series of scenes and visualizes the key frames as a storyboard. These components are rendered into an output video that satisfies the input temporal and spatial constraints. Users can playback the video, examine the design attributes (bottom-right), and make adjustments to generate video variation, such as reordering the scenes (top-right).

URL2Video Use Cases
We demonstrate the performance of the end-to-end URL2Video pipeline on a variety of existing web pages. Below we highlight an example result where URL2Video converts a page that embeds multiple short video clips into a 12-second output video. Note how the pipeline makes automatic editing decisions on font and color choices, timing, and content ordering in a video captured from the source page.

URL2Video identifies key content from our Google Search introduction page (top), including headings and video assets. It converts them into a video by considering the presentation flow, the source design and the output constraints (a 12-second landscape video; bottom).

The video below provides further demonstration:

To evaluate the automatically-generated videos, we conducted a user study with designers at Google. Our results show that URL2Video effectively extracted design elements from a web page and supported designers by bootstrapping the video creation process.

Next steps
While this current research focuses on the visual presentation, we are developing new techniques that support the audio track and a voiceover in video editing. All in all, we envision a future where creators focus on making high-level decisions and an ML model interactively suggests detailed temporal and graphical edits for a final video creation on multiple platforms.

Acknowledgments
We greatly thank our paper co-authors, Zheng Sun (Research) and Katrina Panovich (YouTube). We would also like to thank our colleagues who contributed to URL2Video, (in alphabetical order of last name) Jordan Canedy, Brian Curless, Nathan Frey, Madison Le, Alireza Mahdian, Justin Parra, Emily Ryan, Mogan Shieh, Sandor Szego, and Weilong Yang. We are grateful to receive the support from our leadership, Tomas Izo, Rahul Sukthankar, and Jay Yagnik.

Source: Google AI Blog


Audiovisual Speech Enhancement in YouTube Stories

While tremendous efforts are invested in improving the quality of videos taken with smartphone cameras, the quality of audio in videos is often overlooked. For example, the speech of a subject in a video where there are multiple people speaking or where there is high background noise might be muddled, distorted, or difficult to understand. In an effort to address this, two years ago we introduced Looking to Listen, a machine learning (ML) technology that uses both visual and audio cues to isolate the speech of a video’s subject. By training the model on a large-scale collection of online videos, we are able to capture correlations between speech and visual signals such as mouth movements and facial expressions, which can then be used to separate the speech of one person in a video from another, or to separate speech from background sounds. We showed that this technology not only achieves state-of-the-art results in speech separation and enhancement (a noticeable 1.5dB improvement over audio-only models), but in particular can improve the results over audio-only processing when there are multiple people speaking, as the visual cues in the video help determine who is saying what.

We are now happy to make the Looking to Listen technology available to users through a new audiovisual Speech Enhancement feature in YouTube Stories (on iOS), allowing creators to take better selfie videos by automatically enhancing their voices and reducing background noise. Getting this technology into users’ hands was no easy feat. Over the past year, we worked closely with users to learn how they would like to use such a feature, in what scenarios, and what balance of speech and background sounds they would like to have in their videos. We heavily optimized the Looking to Listen model to make it run efficiently on mobile devices, overall reducing the running time from 10x real-time on a desktop when our paper came out, to 0.5x real-time performance on the phone. We also put the technology through extensive testing to verify that it performs consistently across different recording conditions and for people with different appearances and voices.

From Research to Product
Optimizing Looking to Listen to allow fast and robust operation on mobile devices required us to overcome a number of challenges. First, all processing needed to be done on-device within the client app in order to minimize processing time and to preserve the user’s privacy; no audio or video information would be sent to servers for processing. Further, the model needed to co-exist alongside other ML algorithms used in the YouTube app in addition to the resource-consuming video recording itself. Finally, the algorithm needed to run quickly and efficiently on-device while minimizing battery consumption.

The first step in the Looking to Listen pipeline is to isolate thumbnail images that contain the faces of the speakers from the video stream. By leveraging MediaPipe BlazeFace with GPU accelerated inference, this step is now able to be executed in just a few milliseconds. We then switched the model part that processes each thumbnail separately to a lighter weight MobileNet (v2) architecture, which outputs visual features learned for the purpose of speech enhancement, extracted from the face thumbnails in 10 ms per frame. Because the compute time to embed the visual features is short, it can be done while the video is still being recorded. This avoids the need to keep the frames in memory for further processing, thereby reducing the overall memory footprint. Then, after the video finishes recording, the audio and the computed visual features are streamed to the audio-visual speech separation model which produces the isolated and enhanced speech.

We reduced the total number of parameters in the audio-visual model by replacing “regular” 2D convolutions with separable ones (1D in the frequency dimension, followed by 1D in the time dimension) with fewer filters. We then optimized the model further using TensorFlow Lite — a set of tools that enable running TensorFlow models on mobile devices with low latency and a small binary size. Finally, we reimplemented the model within the Learn2Compress framework in order to take advantage of built-in quantized training and QRNN support.

Our Looking to Listen on-device pipeline for audiovisual speech enhancement

These optimizations and improvements reduced the running time from 10x real-time on a desktop using the original formulation of Looking to Listen, to 0.5x real-time performance using only an iPhone CPU; and brought the model size down from 120MB to 6MB now, which makes it easier to deploy. Since YouTube Stories videos are short — limited to 15 seconds — the result of the video processing is available within a couple of seconds after the recording is finished.

Finally, to avoid processing videos with clean speech (so as to avoid unnecessary computation), we first run our model only on the first two seconds of the video, then compare the speech-enhanced output to the original input audio. If there is sufficient difference (meaning the model cleaned up the speech), then we enhance the speech throughout the rest of the video.

Researching User Needs
Early versions of Looking to Listen were designed to entirely isolate speech from the background noise. In a user study conducted together with YouTube, we found that users prefer to leave in some of the background sounds to give context and to retain some the general ambiance of the scene. Based on this user study, we take a linear combination of the original audio and our produced clean speech channel: output_audio = 0.1 x original_audio + 0.9 x speech. The following video presents clean speech combined with different levels of the background sounds in the scene (10% background is the balance we use in practice).

Below are additional examples of the enhanced speech results from the new Speech Enhancement feature in YouTube Stories. We recommend watching the videos with good speakers or headphones.

Fairness Analysis
Another important requirement is that the model be fair and inclusive. It must be able to handle different types of voices, languages and accents, as well as different visual appearances. To this end, we conducted a series of tests exploring the performance of the model with respect to various visual and speech/auditory attributes: the speaker’s age, skin tone, spoken language, voice pitch, visibility of the speaker’s face (% of video in which the speaker is in frame), head pose throughout the video, facial hair, presence of glasses, and the level of background noise in the (input) video.

For each of the above visual/auditory attributes, we ran our model on segments from our evaluation set (separate from the training set) and measured the speech enhancement accuracy, broken down according to the different attribute values. Results for some of the attributes are summarized in the following plots. Each data point in the plots represents hundreds (in most cases thousands) of videos fitting the criteria.

Speech enhancement quality (signal-to-distortion ratio, SDR, in dB) for different spoken languages, sorted alphabetically. The average SDR was 7.89 dB with a standard deviation of 0.42 dB — deviation that for human listeners is considered hard to notice.
Left: Speech enhancement quality as a function of the speaker’s voice pitch. The fundamental voice frequency (pitch) of an adult male typically ranges from 85 to 180 Hz, and that of an adult female ranges from 165 to 255 Hz. Right: speech enhancement quality as a function of the speaker’s predicted age.
As our method utilizes facial cues and mouth movements to isolate the speech, we tested whether facial hair (e.g., a moustache, beard) may obstruct those visual cues and affect the method’s performance. Our evaluations show that the quality of speech enhancement is maintained well also in the presence of facial hair.

Using the Feature
YouTube creators who are eligible for YouTube Stories creation may record a video on iOS, and select “Enhance speech” from the volume controls editing tool. This will immediately apply speech enhancement to the audio track and will play back the enhanced speech in a loop. It is then possible to toggle the feature on and off multiple times to compare the enhanced speech with the original audio.

In parallel to this new feature in YouTube, we are also exploring additional venues for this technology. More to come later this year — stay tuned!

Acknowledgements
This feature is a collaboration across multiple teams at Google. Key contributors include: from Research-IL: Oran Lang; from VisCAM: Ariel Ephrat, Mike Krainin, JD Velasquez, Inbar Mosseri, Michael Rubinstein; from Learn2Compress: Arun Kandoor; from MediaPipe: Buck Bourdon, Matsvei Zhdanovich, Matthias Grundmann; from YouTube: Andy Poes, Vadim Lavrusik, Aaron La Lau, Willi Geiger, Simona De Rosa, and Tomer Margolin.

Source: Google AI Blog


On-device, Real-time Body Pose Tracking with MediaPipe BlazePose

Pose estimation from video plays a critical role enabling the overlay of digital content and information on top of the physical world in augmented reality, sign language recognition, full-body gesture control, and even quantifying physical exercises, where it can form the basis for yoga, dance, and fitness applications. Pose estimation for fitness applications is particularly challenging due to the wide variety of possible poses (e.g., hundreds of yoga asanas), numerous degrees of freedom, occlusions (e.g. the body or other objects occlude limbs as seen from the camera), and a variety of appearances or outfits.

BlazePose results on fitness and dance use-cases.

Today we are announcing the release of a new approach to human body pose perception, BlazePose, which we presented at the CV4ARVR workshop at CVPR 2020. Our approach provides human pose tracking by employing machine learning (ML) to infer 33, 2D landmarks of a body from a single frame. In contrast to current pose models based on the standard COCO topology, BlazePose accurately localizes more keypoints, making it uniquely suited for fitness applications. In addition, current state-of-the-art approaches rely primarily on powerful desktop environments for inference, whereas our method achieves real-time performance on mobile phones with CPU inference. If one leverages GPU inference, BlazePose achieves super-real-time performance, enabling it to run subsequent ML models, like face or hand tracking.

Upper-body BlazePose model in MediaPipe

Topology
The current standard for human body pose is the COCO topology, which consists of 17 landmarks across the torso, arms, legs, and face. However, the COCO keypoints only localize to the ankle and wrist points, lacking scale and orientation information for hands and feet, which is vital for practical applications like fitness and dance. The inclusion of more keypoints is crucial for the subsequent application of domain-specific pose estimation models, like those for hands, face, or feet.

With BlazePose, we present a new topology of 33 human body keypoints, which is a superset of COCO, BlazeFace and BlazePalm topologies. This allows us to determine body semantics from pose prediction alone that is consistent with face and hand models.

BlazePose 33 keypoint topology as COCO (colored with green) superset

Overview: An ML Pipeline for Pose Tracking
For pose estimation, we utilize our proven two-step detector-tracker ML pipeline. Using a detector, this pipeline first locates the pose region-of-interest (ROI) within the frame. The tracker subsequently predicts all 33 pose keypoints from this ROI. Note that for video use cases, the detector is run only on the first frame. For subsequent frames we derive the ROI from the previous frame’s pose keypoints as discussed below.

Human pose estimation pipeline overview.

Pose Detection by extending BlazeFace
For real-time performance of the full ML pipeline consisting of pose detection and tracking models, each component must be very fast, using only a few milliseconds per frame. To accomplish this, we observe that the strongest signal to the neural network about the position of the torso is the person's face (due to its high-contrast features and comparably small variations in appearance). Therefore, we achieve a fast and lightweight pose detector by making the strong (yet for many mobile and web applications valid) assumption that the head should be visible for our single-person use case.

Consequently, we trained a face detector, inspired by our sub-millisecond BlazeFace model, as a proxy for a pose detector. Note, this model only detects the location of a person within the frame and can not be used to identify individuals. In contrast to the Face Mesh and MediaPipe Hand tracking pipelines, where we derive the ROI from predicted keypoints, for the human pose tracking we explicitly predict two additional virtual keypoints that firmly describe the human body center, rotation and scale as a circle. Inspired by Leonardo’s Vitruvian man, we predict the midpoint of a person's hips, the radius of a circle circumscribing the whole person, and the incline angle of the line connecting the shoulder and hip midpoints. This results in consistent tracking even for very complicated cases, like specific yoga asanas. The figure below illustrates the approach.

Vitruvian man aligned via two virtual keypoints predicted by our BlazePose detector in addition to the face bounding box

Tracking Model
The pose estimation component of the pipeline predicts the location of all 33 person keypoints with three degrees of freedom each (x, y location and visibility) plus the two virtual alignment keypoints described above. Unlike current approaches that employ compute-intensive heatmap prediction, our model uses a regression approach that is supervised by a combined heat map/offset prediction of all keypoints, as shown below.

Tracking network architecture: regression with heatmap supervision

Specifically, during training we first employ a heatmap and offset loss to train the center and left tower of the network. We then remove the heatmap output and train the regression encoder (right tower), thus, effectively using the heatmap to supervise a lightweight embedding.

The table below shows an ablation study of the model quality resulting from different training strategies. As an evaluation metric, we use the Percent of Correct Points with 20% tolerance ([email protected]) (where we assume the point to be detected correctly if the 2D Euclidean error is smaller than 20% of the corresponding person’s torso size). To obtain a human baseline, we asked annotators to annotate several samples redundantly and obtained an average [email protected] of 97.2. The training and validation have been done on a geo-diverse dataset of various poses, sampled uniformly.

To cover a wide range of customer hardware, we present two pose tracking models: lite and full, which are differentiated in the balance of speed versus quality. For performance evaluation on CPU, we use XNNPACK; for mobile GPUs, we use the TFLite GPU backend.

Applications
Based on human pose, we can build a variety of applications, like fitness or yoga trackers. As an example, we present squats and push up counters, which can automatically count user statistics, or verify the quality of performed exercises. Such use cases can be implemented either using an additional classifier network or even with a simple joint pairwise distance lookup algorithm, which matches the closest pose in normalized pose space.

The number of performed exercises counter based on detected body pose. Left: Squats; Right: Push-Ups

Conclusion
BlazePose will be available to the broader mobile developer community via the Pose detection API in the upcoming release of ML Kit, and we are also releasing a version targeting upper body use cases in MediaPipe running in Android, iOS and Python. Apart from the mobile domain, we preview our web-based in-browser version as well. We hope that providing this human pose perception functionality to the broader research and development community will result in an emergence of creative use cases, stimulating new applications, and new research avenues.

We plan to extend this technology with more robust and stable tracking to an even larger variety of human poses and activities. In the accompanying Model Card, we detail the intended uses, limitations and model fairness to ensure that use of these models aligns with Google’s AI Principles. We believe that publishing this technology can provide an impulse to new creative ideas and applications by the members of the research and developer community at large. We are excited to see what you can build with it!

BlazePose results on yoga use-cases

Acknowledgments
Special thanks to all our team members who worked on the tech with us: Fan Zhang, Artsiom Ablavatski, Yury Kartynnik, Tyler Zhu, Karthik Raveendran, Andrei Vakunov, Andrei Tkachenka, Marat Dukhan, Tyler Mullen, Gregory Karpiak, Suril Shah, Buck Bourdon, Jiuqiang Tang, Ming Guang Yong, Chuo-Ling Chang, Esha Uboweja, Siarhei Kazakou, Andrei Kulik, Matsvei Zhdanovich, and Matthias Grundmann.

Source: Google AI Blog


MediaPipe Iris: Real-time Iris Tracking & Depth Estimation

A wide range of real-world applications, including computational photography (e.g., portrait mode and glint reflections) and augmented reality effects (e.g., virtual avatars) rely on estimating eye position by tracking the iris. Once accurate iris tracking is available, we show that it is possible to determine the metric distance from the camera to the user — without the use of a dedicated depth sensor. This, in-turn, can improve a variety of use cases, ranging from computational photography, over virtual try-on of properly sized glasses and hats to usability enhancements that adopt the font size depending on the viewer’s distance.

Iris tracking is a challenging task to solve on mobile devices, due to limited computing resources, variable light conditions and the presence of occlusions, such as hair or people squinting. Often, sophisticated specialized hardware is employed, limiting the range of devices on which the solution could be applied.

FaceMesh can be adopted to drive virtual avatars (middle). By additionally employing iris tracking (right), the avatar’s liveliness is significantly enhanced.
An example of eye re-coloring enabled by MediaPipe Iris.

Today, we announce the release of MediaPipe Iris, a new machine learning model for accurate iris estimation. Building on our work on MediaPipe Face Mesh, this model is able to track landmarks involving the iris, pupil and the eye contours using a single RGB camera, in real-time, without the need for specialized hardware. Through use of iris landmarks, the model is also able to determine the metric distance between the subject and the camera with relative error less than 10% without the use of depth sensor. Note that iris tracking does not infer the location at which people are looking, nor does it provide any form of identity recognition. Thanks to the fact that this system is implemented in MediaPipe — an open source cross-platform framework for researchers and developers to build world-class ML solutions and applications — it can run on most modern mobile phones, desktops, laptops and even on the web.

Usability prototype for far-sighted individuals: observed font size remains constant independent of the device distance from the user.

An ML Pipeline for Iris Tracking
The first step in the pipeline leverages our previous work on 3D Face Meshes, which uses high-fidelity facial landmarks to generate a mesh of the approximate face geometry. From this mesh, we isolate the eye region in the original image for use in the iris tracking model. The problem is then divided into two parts: eye contour estimation and iris location. We designed a multi-task model consisting of a unified encoder with a separate component for each task, which allowed us to use task-specific training data.

Examples of iris (blue) and eyelid (red) tracking.

To train the model from the cropped eye region, we manually annotated ~50k images, representing a variety of illumination conditions and head poses from geographically diverse regions, as shown below.

Eye region annotated with eyelid (red) and iris (blue) contours.
Cropped eye regions form the input to the model, which predicts landmarks via separate components.

Depth-from-Iris: Depth Estimation from a Single Image
Our iris-tracking model is able to determine the metric distance of a subject to the camera with less than 10% error, without requiring any specialized hardware. This is done by relying on the fact that the horizontal iris diameter of the human eye remains roughly constant at 11.7±0.5 mm across a wide population [1, 2, 3, 4], along with some simple geometric arguments. For illustration, consider a pinhole camera model projecting onto a sensor of square pixels. The distance to a subject can be estimated from facial landmarks by using the focal length of the camera, which can be obtained using camera capture APIs or directly from the EXIF metadata of a captured image, along with other camera intrinsic parameters. Given the focal length, the distance from the subject to the camera is directly proportional to the physical size of the subject’s eye, as visualized below.

The distance of the subject (d) can be computed from the focal length (f) and the size of the iris using similar triangles.
Left: MediaPipe Iris predicting metric distance in cm on a Pixel 2 from iris tracking alone, without the use of a depth sensor. Right: Ground-truth depth.

In order to quantify the accuracy of the method, we compared it to the depth sensor on an iPhone 11 by collecting front-facing, synchronized video and depth images on over 200 participants. We experimentally verified the error of the iPhone 11 depth sensor to be < 2% for distances up to 2 meters, using a laser ranging device. Our evaluation shows that our approach for depth estimation using iris size has a mean relative error of 4.3% and standard deviation of 2.4%. We tested our approach on participants with and without eyeglasses (not accounting for contact lenses on participants) and found that eyeglasses increase the mean relative error slightly to 4.8% (standard deviation 3.1%). We did not test this approach on participants with any eye diseases (like arcus senilis or pannus). Considering MediaPipe Iris requires no specialized hardware, these results suggest it may be possible to obtain metric depth from a single image on devices with a wide range of cost-points.

Histogram of estimation errors (left) and comparison of actual to estimated distance by iris (right).

Release of MediaPipe Iris
We are releasing the iris and depth estimation models as a cross-platform MediaPipe pipeline that can run on desktop, mobile and the web. As described in our recent Google Developer Blog post on MediaPipe on the web, we leverage WebAssembly and XNNPACK to run our Iris ML pipeline locally in the browser, without any data being sent to the cloud.

Using MediaPipe’s WASM stack, you can run the models locally in your browser! Left: Iris tracking. Right: Depth from Iris computed just from a photo with EXIF data. Iris tracking can be tried out here and iris depth measurements here.

Future Directions
We plan to extend our MediaPipe Iris model with even more stable tracking for lower error and deploy it for accessibility use cases. We strongly believe in sharing code that enables reproducible research, rapid experimentation, and development of new ideas in different areas. In our documentation and the accompanying Model Card, we detail the intended uses, limitations and model fairness to ensure that use of these models aligns with Google’s AI Principles. Note, that any form of surveillance or identification is explicitly out of scope and not enabled by this technology. We hope that providing this iris perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating responsible new applications and new research avenues.

For more ML solutions from MediaPipe, please see our solutions page and Google Developer blog for the latest updates.

Acknowledgements
We would like to thank Artsiom Ablavatski, Andrei Tkachenka, Buck Bourdon, Ivan Grishchenko and Gregory Karpiak for support in model evaluation and data collection; Yury Kartynnik, Valentin Bazarevsky, Artsiom Ablavatski for developing the mesh technology; Aliaksandr Shyrokau and the annotation team for their diligence to data preparation; Vidhya Navalpakkam, Tomer, Tomer Shekel, Kai Kohlhoff for their domain expertise, Fan Zhang, Esha Uboweja, Tyler Mullen, Michael Hays and Chuo-Ling Chang for help to integrate the model to MediaPipe; Matthias Grundmann, Florian Schroff and Ming Guang Yong for continuous help for building this technology.

Source: Google AI Blog


Sensing Force-Based Gestures on the Pixel 4



Touch input has traditionally focussed on two-dimensional finger pointing. Beyond tapping and swiping gestures, long pressing has been the main alternative path for interaction. However, a long press is sensed with a time-based threshold where a user’s finger must remain stationary for 400–500 ms. By its nature, a time-based threshold has negative effects for usability and discoverability as the lack of immediate feedback disconnects the user’s action from the system’s response. Fortunately, fingers are dynamic input devices that can express more than just location: when a user touches a surface, their finger can also express some level of force, which can be used as an alternative to a time-based threshold.

While a variety of force-based interactions have been pursued, sensing touch force requires dedicated hardware sensors that are expensive to design and integrate. Further, research indicates that touch force is difficult for people to control, and so most practical force-based interactions focus on discrete levels of force (e.g., a soft vs. firm touch) — which do not require the full capabilities of a hardware force sensor.

For a recent update to the Pixel 4, we developed a method for sensing force gestures that allowed us to deliver a more expressive touch interaction experience By studying how the human finger interacts with touch sensors, we designed the experience to complement and support the long-press interactions that apps already have, but with a more natural gesture. In this post we describe the core principles of touch sensing and finger interaction, how we designed a machine learning algorithm to recognise press gestures from touch sensor data, and how we integrated it into the user experience for Pixel devices.

Touch Sensor Technology and Finger Biomechanics
A capacitive touch sensor is constructed from two conductive electrodes (a drive electrode and a sense electrode) that are separated by a non-conductive dielectric (e.g., glass). The two electrodes form a tiny capacitor (a cell) that can hold some charge. When a finger (or another conductive object) approaches this cell, it ‘steals’ some of the charge, which can be measured as a drop in capacitance. Importantly, the finger doesn’t have to come into contact with the electrodes (which are protected under another layer of glass) as the amount of charge stolen is inversely proportional to the distance between the finger and the electrodes.
Left: A finger interacts with a touch sensor cell by ‘stealing’ charge from the projected field around two electrodes. Right: A capacitive touch sensor is constructed from rows and columns of electrodes, separated by a dielectric. The electrodes overlap at cells, where capacitance is measured.
The cells are arranged as a matrix over the display of a device, but with a much lower density than the display pixels. For instance, the Pixel 4 has a 2280 × 1080 pixel display, but a 32 × 15 cell touch sensor. When scanned at a high resolution (at least 120 Hz), readings from these cells form a video of the finger’s interaction.
Slowed touch sensor recordings of a user tapping (left), pressing (middle), and scrolling (right).
Capacitive touch sensors don’t respond to changes in force per se, but are tuned to be highly sensitive to changes in distance within a couple of millimeters above the display. That is, a finger contact on the display glass should saturate the sensor near its centre, but will retain a high dynamic range around the perimeter of the finger’s contact (where the finger curls up).

When a user’s finger presses against a surface, its soft tissue deforms and spreads out. The nature of this spread depends on the size and shape of the user’s finger, and its angle to the screen. At a high level, we can observe a couple of key features in this spread (shown in the figures): it is asymmetric around the initial contact point, and the overall centre of mass shifts along the axis of the finger. This is also a dynamic change that occurs over some period of time, which differentiates it from contacts that have a long duration or a large area.
Touch sensor signals are saturated around the centre of the finger’s contact, but fall off at the edges. This allows us to sense small deformations in the finger’s contact shape caused by changes in the finger’s force.
However, the differences between users (and fingers) makes it difficult to encode these observations with heuristic rules. We therefore designed a machine learning solution that would allow us to learn these features and their variances directly from user interaction samples.

Machine Learning for Touch Interaction
We approached the analysis of these touch signals as a gesture classification problem. That is, rather than trying to predict an abstract parameter, such as force or contact spread, we wanted to sense a press gesture — as if engaging a button or a switch. This allowed us to connect the classification to a well-defined user experience, and allowed users to perform the gesture during training at a comfortable force and posture.

Any classification model we designed had to operate within users’ high expectations for touch experiences. In particular, touch interaction is extremely latency-sensitive and demands real-time feedback. Users expect applications to be responsive to their finger movements as they make them, and application developers expect the system to deliver timely information about the gestures a user is performing. This means that classification of a press gesture needs to occur in real-time, and be able to trigger an interaction at the moment the finger’s force reaches its apex.

We therefore designed a neural network that combined convolutional (CNN) and recurrent (RNN) components. The CNN could attend to the spatial features we observed in the signal, while the RNN could attend to their temporal development. The RNN also helps provide a consistent runtime experience: each frame is processed by the network as it is received from the touch sensor, and the RNN state vectors are preserved between frames (rather than processing them in batches). The network was intentionally kept simple to minimise on-device inference costs when running concurrently with other applications (taking approximately 50 µs of processing per frame and less than 1 MB of memory using TensorFlow Lite).
An overview of the classification model’s architecture.
The model was trained on a dataset of press gestures and other common touch interactions (tapping, scrolling, dragging, and long-pressing without force). As the model would be evaluated after each frame, we designed a loss function that temporally shaped the label probability distribution of each sample, and applied a time-increasing weight to errors. This ensured that the output probabilities were temporally smooth and converged towards the correct gesture classification.

User Experience Integration
Our UX research found that it was hard for users to discover force-based interactions, and that users frequently confused a force press with a long press because of the difficulty in coordinating the amount of force they were applying with the duration of their contact. Rather than creating a new interaction modality based on force, we therefore focussed on improving the user experience of long press interactions by accelerating them with force in a unified press gesture. A press gesture has the same outcome as a long press gesture, whose time threshold remains effective, but provides a stronger connection between the outcome and the user’s action when force is used.
A user long pressing (left) and firmly pressing (right) on a launcher icon.
This also means that users can take advantage of this gesture without developers needing to update their apps. Applications that use Android’s GestureDetector or View APIs will automatically get these press signals through their existing long-press handlers. Developers that implement custom long-press detection logic can receive these press signals through the MotionEvent classification API introduced in Android Q.

Through this integration of machine-learning algorithms and careful interaction design, we were able to deliver a more expressive touch experience for Pixel users. We plan to continue researching and developing these capabilities to refine the touch experience on Pixel, and explore new forms of touch interaction.

Acknowledgements
This project is a collaborative effort between the Android UX, Pixel software, and Android framework teams.

Source: Google AI Blog


Google at CVPR 2020



This week marks the start of the fully virtual 2020 Conference on Computer Vision and Pattern Recognition (CVPR 2020), the premier annual computer vision event consisting of the main conference, workshops and tutorials. As a leader in computer vision research and a Supporter Level Virtual Sponsor, Google will have a strong presence at CVPR 2020, with nearly 70 publications accepted, along with the organization of, and participation in, multiple workshops/tutorials.

If you are participating in CVPR this year, please visit our virtual booth to learn about what Google is actively pursuing for the next generation of intelligent systems that utilize the latest machine learning techniques applied to various areas of machine perception.

You can also learn more about our research being presented at CVPR 2020 in the list below (Google affiliations are bolded).

Organizing Committee

General Chairs: Terry Boult, Gerard Medioni, Ramin Zabih
Program Chairs: Ce Liu, Greg Mori, Kate Saenko, Silvio Savarese
Workshop Chairs: Tal Hassner, Tali Dekel
Website Chairs: Tianfan Xue, Tian Lan
Technical Chair: Daniel Vlasic
Area Chairs include: Alexander Toshev, Alexey Dosovitskiy, Boqing Gong, Caroline Pantofaru, Chen Sun, Deqing Sun, Dilip Krishnan, Feng Yang, Liang-Chieh Chen, Michael Rubinstein, Rodrigo Benenson, Timnit Gebru, Thomas Funkhouser, Varun Jampani, Vittorio Ferrari, William Freeman

Oral Presentations

Evolving Losses for Unsupervised Video Representation Learning
AJ Piergiovanni, Anelia Angelova, Michael Ryoo

CvxNet: Learnable Convex Decomposition
Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, Andrea Tagliasacchi

Neural SDE: Stabilizing Neural ODE Networks with Stochastic Noise
Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, Cho-Jui Hsieh

Scalability in Perception for Autonomous Driving: Waymo Open Dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla‎, Aurélien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev‎, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi‎, Sheng Zhao, Shuyang Chen, Yu Zhang, Jon Shlens, Zhifeng Chen, Dragomir Anguelov

Deep Implicit Volume Compression
Saurabh Singh, Danhang Tang, Cem Keskin, Philip Chou, Christian Haene, Mingsong Dou, Sean Fanello, Jonathan Taylor, Andrea Tagliasacchi, Philip Davidson, Yinda Zhang, Onur Guleryuz, Shahram Izadi, Sofien Bouaziz

Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model
Dongdong Wan, Yandong Li, Liqiang Wang, and Boqing Gong

Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval (see the blog post)
Tobias Weyand, Andre Araujo, Jack Sim, Bingyi Cao

CycleISP: Real Image Restoration via Improved Data Synthesis
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, Ling Shao

Dynamic Graph Message Passing Networks
Li Zhang, Dan Xu, Anurag Arnab, Philip Torr

Local Deep Implicit Functions for 3D Shape
Kyle Genova, Forrester Cole, Avneesh Sud, Aaron Sarna, Thomas Funkhouser

GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models
Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William Freeman, Rahul Sukthankar, Cristian Sminchisescu

Search to Distill: Pearls are Everywhere but not the Eyes
Yu Liu, Xuhui Jia, Mingxing Tan, Raviteja Vemulapalli, Yukun Zhu, Bradley Green, Xiaogang Wang

Semantic Pyramid for Image Generation
Assaf Shocher, Yossi Gandelsman, Inbar Mosseri, Michal Yarom, Michal Irani, William Freeman, Tali Dekel

Flow Contrastive Estimation of Energy-Based Models
Ruiqi Gao, Erik Nijkamp, Diederik Kingma, Zhen Xu, Andrew Dai, Ying Nian Wu

Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition from A Domain Adaptation Perspective
Muhammad Abdullah Jamal, Matthew Brown, Ming-Hsuan Yang, Liqiang Wang, Boqing Gong

Category-Level Articulated Object Pose Estimation
Xiaolong Li, He Wang, Li Yi, Leonidas Guibas, Amos Abbott, Shuran Song

AdaCoSeg: Adaptive Shape Co-Segmentation with Group Consistency Loss
Chenyang Zhu, Kai Xu, Siddhartha Chaudhuri, Li Yi, Leonidas Guibas, Hao Zhang

SpeedNet: Learning the Speediness in Videos
Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William Freeman, Michael Rubinstein, Michal Irani, Tali Dekel

BSP-Net: Generating Compact Meshes via Binary Space Partitioning
Zhiqin Chen, Andrea Tagliasacchi, Hao Zhang

SAPIEN: A SimulAted Part-based Interactive ENvironment
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel Chang, Leonidas Guibas, Hao Su

SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving
Zhenpei Yang, Yuning Chai, Dragomir Anguelov, Yin Zhou, Pei Sun, Dumitru Erhan, Sean Rafferty, Henrik Kretzschmar

Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks
Saurabh Singh, Shankar Krishnan

RL-CycleGAN: Reinforcement Learning Aware Simulation-To-Real
Kanishka Rao, Chris Harris, Alex Irpan, Sergey Levine, Julian Ibarz, Mohi Khansari

Open Compound Domain Adaptation
Ziwei Liu, Zhongqi Miao, Xingang Pan, Xiaohang Zhan, Dahua Lin, Stella X.Yu, and Boqing Gong

Posters
Single-view view synthesis with multiplane images
Richard Tucker, Noah Snavely

Adversarial Examples Improve Image Recognition
Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan Yuille, Quoc V. Le

Adversarial Texture Optimization from RGB-D Scans
Jingwei Huang, Justus Thies, Angela Dai, Abhijit Kundu, Chiyu “Max” Jiang,Leonidas Guibas, Matthias Niessner, Thomas Funkhouser

Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline
Yu-Lun Liu, Wei-Sheng Lai, Yu-Sheng Chen, Yi-Lung Kao, Ming-Hsuan Yang,Yung-Yu Chuang, Jia-Bin Huang

Collaborative Distillation for Ultra-Resolution Universal Style Transfer
Huan Wang, Yijun Li, Yuehai Wang, Haoji Hu, Ming-Hsuan Yang

Learning to Autofocus
Charles Herrmann, Richard Strong Bowen, Neal Wadhwa, Rahul Garg, Qiurui He, Jonathan T. Barron, Ramin Zabih

Multi-Scale Boosted Dehazing Network with Dense Feature Fusion
Hang Dong, Jinshan Pan, Lei Xiang, Zhe Hu, Xinyi Zhang, Fei Wang, Ming-Hsuan Yang

Composing Good Shots by Exploiting Mutual Relations
Debang Li, Junge Zhang, Kaiqi Huang, Ming-Hsuan Yang

PatchVAE: Learning Local Latent Codes for Recognition
Kamal Gupta, Saurabh Singh, Abhinav Shrivastava

Neural Voxel Renderer: Learning an Accurate and Controllable Rendering Tool
Konstantinos Rematas, Vittorio Ferrari

Local Implicit Grid Representations for 3D Scenes
Chiyu “Max” Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Niessner, Thomas Funkhouser

Large Scale Video Representation Learning via Relational Graph Clustering
Hyodong Lee, Joonseok Lee, Joe Yue-Hei Ng, Apostol (Paul) Natsev

Deep Homography Estimation for Dynamic Scenes
Hoang Le, Feng Liu, Shu Zhang, Aseem Agarwala

C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds
Albert Pumarola, Stefan Popov, Francesc Moreno-Noguer, Vittorio Ferrari

Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination
Pratul Srinivasan, Ben Mildenhall, Matthew Tancik, Jonathan T. Barron, Richard Tucker, Noah Snavely

Scale-space flow for end-to-end optimized video compression
Eirikur Agustsson, David Minnen, Nick Johnston, Johannes Ballé, Sung Jin Hwang, George Toderici

StructEdit: Learning Structural Shape Variations
Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy Mitra, Leonidas Guibas

3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation
Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, Matthias Niessner

Sequential mastery of multiple tasks: Networks naturally learn to learn and forget to forget
Guy Davidson, Michael C. Mozer

Distilling Effective Supervision from Severe Label Noise
Zizhao Zhang, Han Zhang, Sercan Ö. Arik, Honglak Lee, Tomas Pfister

ViewAL: Active Learning With Viewpoint Entropy for Semantic Segmentation
Yawar Siddiqui, Julien Valentin, Matthias Niessner

Attribution in Scale and Space
Shawn Xu, Subhashini Venugopalan, Mukund Sundararajan

Weakly-Supervised Semantic Segmentation via Sub-category Exploration
Yu-Ting Chang, Qiaosong Wang, Wei-Chih Hung, Robinson Piramuthu, Yi-Hsuan Tsai, Ming-Hsuan Yang

Speech2Action: Cross-modal Supervision for Action Recognition
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

Counting Out Time: Class Agnostic Video Repetition Counting in the Wild
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction
Junwei Liang, Lu Jiang, Kevin Murphy, Ting Yu, Alexander Hauptmann

Self-training with Noisy Student improves ImageNet classification
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le

EfficientDet: Scalable and Efficient Object Detection (see the blog post)
Mingxing Tan, Ruoming Pang, Quoc Le

ACNe: Attentive Context Normalization for Robust Permutation-Equivariant Learning
Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, Kwang Moo Yi

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation
Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cordelia Schmid, Congcong Li

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization
Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc Le, Xiaodan Song

KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects
Xingyu Liu, Rico Jonschkowski, Anelia Angelova, Kurt Konolige

Structured Multi-Hashing for Model Compression
Elad Eban, Yair Movshovitz-Attias, Hao Wu, Mark Sandler, Andrew Poon, Yerlan Idelbayev, Miguel A. Carreira-Perpinan

DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes
Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu, Vivek Rathod, Tom Funkhouser, Caroline Pantofaru, David Ross, Larry Davis, Alireza Fathi

Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation
Bowen Cheng, Maxwell Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, Liang-Chieh Chen

Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection
Sara Beery, Guanhang Wu, Vivek Rathod, Ronny Votel, Jonathan Huang

Distortion Agnostic Deep Watermarking
Xiyang Luo, Ruohan Zhan, Huiwen Chang, Feng Yang, Peyman Milanfar

Can weight sharing outperform random architecture search? An investigation with TuNAS
Gabriel Bender, Hanxiao Liu, Bo Chen, Grace Chu, Shuyang Cheng, Pieter-Jan Kindermans, Quoc Le

GIFnets: Differentiable GIF Encoding Framework
Innfarn Yoo, Xiyang Luo, Yilin Wang, Feng Yang, Peyman Milanfar

Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models
Giannis Daras, Augustus Odena, Han Zhang, Alex Dimakis

Fast Sparse ConvNets
Erich Elsen, Marat Dukhan, Trevor Gale, Karen Simonyan

RetinaTrack: Online Single Stage Joint Detection and Tracking
Zhichao Lu, Vivek Rathod, Ronny Votel, Jonathan Huang

Learning to See Through Obstructions
Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang,Yung-Yu Chuang, Jia-Bin Huang

Self-Supervised Learning of Video-Induced Visual Invariances
Michael Tschannen, Josip Djolonga, Marvin Ritter, Aravindh Mahendran, Neil Houlsby, Sylvain Gelly, Mario Lucic

Workshops

3rd Workshop and Challenge on Learned Image Compression
Organizers include: George Toderici, Eirikur Agustsson, Lucas Theis, Johannes Ballé, Nick Johnston

CLVISION 1st Workshop on Continual Learning in Computer Vision
Organizers include: Zhiyuan (Brett) Chen, Marc Pickett

Embodied AI
Organizers include: Alexander Toshev, Jie Tan, Aleksandra Faust, Anelia Angelova

The 1st International Workshop and Prize Challenge on Agriculture-Vision: Challenges & Opportunities for Computer Vision in Agriculture
Organizers include: Zhen Li, Jim Yuan

Embodied AI
Organizers include: Alexander Toshev, Jie Tan, Aleksandra Faust, Anelia Angelova

New Trends in Image Restoration and Enhancement workshop and challenges on image and video restoration and enhancement (NTIRE)
Talk: “Sky Optimization: Semantically aware image processing of skies in low-light photography”
Orly Liba, Longqi Cai, Yun-Ta Tsai, Elad Eban, Yair Movshovitz-Attias, Yael Pritch, Huizhong Chen, Jonathan Barron

The End-of-End-to-End A Video Understanding Pentathlon
Organizers include: Rahul Sukthankar

4th Workshop on Media Forensics
Organizers include: Christoph Bregler

4th Workshop on Visual Understanding by Learning from Web Data
Organizers include: Jesse Berent, Rahul Sukthankar

AI for Content Creation
Organizers include: Deqing Sun, Lu Jiang, Weilong Yang

Fourth Workshop on Computer Vision for AR/VR
Organizers include: Sofien Bouaziz

Low-Power Computer Vision Competition (LPCVC)
Organizers include: Bo Chen, Andrew Howard, Jaeyoun Kim

Sight and Sound
Organizers include: William Freeman

Workshop on Efficient Deep Learning for Computer Vision
Organizers include: Pete Warden

Extreme classification in computer vision
Organizers include: Ramin Zabih, Zhen Li

Image Matching: Local Features and Beyond (see the blog post)
Organizers include: Eduard Trulls

The DAVIS Challenge on Video Object Segmentation
Organizers include: Alberto Montes, Jordi Pont-Tuset, Kevis-Kokitsi Maninis

2nd Workshop on Precognition: Seeing through the Future
Organizers include: Utsav Prabhu

Computational Cameras and Displays (CCD)
Talk: Orly Liba

2nd Workshop on Learning from Unlabeled Videos (LUV)
Organizers include:Honglak Lee, Rahul Sukthankar

7th Workshop on Fine Grained Visual Categorization (FGVC7) (see the blog post)
Organizers include: Christine Kaeser-Chen, Serge Belongie

Language & Vision with applications to Video Understanding
Organizers include: Lu Jiang

Neural Architecture Search and Beyond for Representation Learning
Organizers include: Barret Zoph

Tutorials

Disentangled 3D Representations for Relightable Performance Capture of Humans
Organizers include: Sean Fanello, Christoph Rhemann, Jonathan Taylor, Sofien Bouaziz, Adarsh Kowdle, Rohit Pandey, Sergio Orts-Escolano, Paul Debevec, Shahram Izadi

Learning Representations via Graph-Structured Networks
Organizers include:Chen Sun, Ming-Hsuan Yang

Novel View Synthesis: From Depth-Based Warping to Multi-Plane Images and Beyond
Organizers include:Varun Jampani

How to Write a Good Review
Talks by:Vittorio Ferrari, Bill Freeman, Jordi Pont-Tuset

Neural Rendering
Organizers include:Ricardo Martin-Brualla, Rohit K. Pandey, Sean Fanello,Maneesh Agrawala, Dan B. Goldman

Fairness Accountability Transparency and Ethics and Computer Vision
Organizers: Timnit Gebru, Emily Denton

Source: Google AI Blog


Soli Radar-Based Perception and Interaction in Pixel 4



The Pixel 4 and Pixel 4 XL are optimized for ease of use, and a key feature helping to realize this goal is Motion Sense, which enables users to interact with their Pixel in numerous ways without touching the device. For example, with Motion Sense you can use specific gestures to change music tracks or instantly silence an incoming call. Motion Sense additionally detects when you're near your phone and when you reach for it, allowing your Pixel to be more helpful by anticipating your actions, such as by priming the camera to provide a seamless face unlock experience, politely lowering the volume of a ringing alarm as you reach to dismiss it, or turning off the display to save power when you’re no longer near the device.

The technology behind Motion Sense is Soli, the first integrated short-range radar sensor in a consumer smartphone, which facilitates close-proximity interaction with the phone without contact. Below, we discuss Soli’s core radar sensing principles, design of the signal processing and machine learning (ML) algorithms used to recognize human activity from radar data, and how we resolved some of the integration challenges to prepare Soli for use in consumer devices.

Designing the Soli Radar System for Motion Sense
The basic function of radar is to detect and measure properties of remote objects based on their interactions with radio waves. A classic radar system includes a transmitter that emits radio waves, which are then scattered, or redirected, by objects within their paths, with some portion of energy reflected back and intercepted by the radar receiver. Based on the received waveforms, the radar system can detect the presence of objects as well as estimate certain properties of these objects, such as distance and size.

Radar has been under active development as a detection and ranging technology for almost a century. Traditional radar approaches are designed for detecting large, rigid, distant objects, such as planes and cars; therefore, they lack the sensitivity and resolution for sensing complex motions within the requirements of a consumer handheld device. Thus, to enable Motion Sense, the Soli team developed a new, small-scale radar system, novel sensing paradigms, and algorithms from the ground up specifically for fine-grained perception of human interactions.

Classic radar designs rely on fine spatial resolution relative to target size in order to resolve different objects and distinguish their spatial structures. Such spatial resolution typically requires broad transmission bandwidth, narrow antenna beamwidth, and large antenna arrays. Soli, on the other hand, employs a fundamentally different sensing paradigm based on motion, rather than spatial structure. Because of this novel paradigm, we were able to fit Soli’s entire antenna array for Pixel 4 on a 5 mm x 6.5 mm x 0.873 mm chip package, allowing the radar to be integrated in the top of the phone. Remarkably, we developed algorithms that specifically do not require forming a well-defined image of a target’s spatial structure, in contrast to an optical imaging sensor, for example. Therefore, no distinguishable images of a person’s body or face are generated or used for Motion Sense presence or gesture detection.
Soli’s location in Pixel 4.
Soli relies on processing temporal changes in the received signal in order to detect and resolve subtle motions. The Soli radar transmits a 60 GHz frequency-modulated signal and receives a superposition of reflections off of nearby objects or people. A sub-millimeter-scale displacement in a target’s position from one transmission to the next induces a distinguishable timing shift in the received signal. Over a window of multiple transmissions, these shifts manifest as a Doppler frequency that is proportional to the object’s velocity. By resolving different Doppler frequencies, the Soli signal processing pipeline can distinguish objects moving with different motion patterns.

The animations below demonstrate how different actions exhibit distinctive motion features in the processed Soli signal. The vertical axis of each image represents range, or radial distance, from the sensor, increasing from top to bottom. The horizontal axis represents velocity toward or away from the sensor, with zero at the center, negative velocities corresponding to approaching targets on the left, and positive velocities corresponding to receding targets on the right. Energy received by the radar is mapped into these range-velocity dimensions and represented by the intensity of each pixel. Thus, strongly reflective targets tend to be brighter relative to the surrounding noise floor compared to weakly reflective targets. The distribution and trajectory of energy within these range-velocity mappings show clear differences for a person walking, reaching, and swiping over the device.

In the left image, we see reflections from multiple body parts appearing on the negative side of the velocity axis as the person approaches the device, then converging at zero velocity at the top of the image as the person stops close to the device. In the middle image depicting a reach, a hand starts from a stationary position 20 cm from the sensor, then accelerates with negative velocity toward the device, and finally decelerates to a stop as it reaches the device. The reflection corresponding to the hand moves from the middle to the top of the image, corresponding to the hand’s decreasing range from the sensor over the course of the gesture. Finally, the third image shows a hand swiping over the device, moving with negative velocity toward the sensor on the left half of the velocity axis, passing directly over the sensor where its radial velocity is zero, and then away from the sensor on the right half of the velocity axis, before reaching a stop on the opposite side of the device.

Left: Presence - Person walking towards the device. Middle: Reach - Person reaching towards the device. Right: Swipe - Person swiping over the device.
The 3D position of each resolvable reflection can also be estimated by processing the signal received at each of Soli’s three receivers; this positional information can be used in addition to range and velocity for target differentiation.

The signal processing pipeline we designed for Soli includes a combination of custom filters and coherent integration steps that boost signal-to-noise ratio, attenuate unwanted interference, and differentiate reflections off a person from noise and clutter. These signal processing features enable Soli to operate at low-power within the constraints of a consumer smartphone.

Designing Machine Learning Algorithms for Radar
After using Soli’s signal processing pipeline to filter and boost the original radar signal, the resulting signal transformations are fed to Soli’s ML models for gesture classification. These models have been trained to accurately detect and recognize the Motion Sense gestures with low latency.

There are two major research challenges to robustly classifying in-air gestures that are common to any motion sensing technology. The first is that every user is unique and performs even simple motions, such as a swipe, in a myriad of ways. The second is that throughout the day, there may be numerous extraneous motions within the range of the sensor that may appear similar to target gestures. Furthermore, when the phone moves, the whole world looks like it’s moving from the point of view of the motion sensor in the phone.

Solving these challenges required designing custom ML algorithms optimized for low-latency detection of in-air gestures from radar signals. Soli’s ML models consist of neural networks trained using millions of gestures recorded from thousands of Google volunteers. These radar recordings were mixed with hundreds of hours of background radar recordings from other Google volunteers containing generic motions made near the device. Soli’s ML models were trained using TensorFlow and optimized to run directly on Pixel’s low-power digital signal processor (DSP). This allows us to run the models at low power, even when the main application processor is powered down.

Taking Soli from Concept to Product
Soli’s integration into the Pixel smartphone was possible because the end-to-end radar system — including hardware, software, and algorithms — was carefully designed to enable touchless interaction within the size and power constraints of consumer devices. Soli’s miniature hardware allowed the full radar system to fit into the limited space in Pixel’s upper bezel, which was a significant team accomplishment. Indeed, the first Soli prototype in 2014 was the size of a desktop computer. We combined hardware innovations with our novel temporal sensing paradigm described earlier in order to shrink the entire radar system down to a single 5.0 mm x 6.5 mm RFIC, including antennas on package. The Soli team also introduced several innovative hardware power management schemes and optimized Soli’s compute cycles, enabling Motion Sense to fit within the power budget of the smartphone.

Hardware innovations included iteratively shrinking the radar system from a desktop-sized prototype to a single 5.0 mm x 6.5 mm RFIC, including antennas on package.
For integration into Pixel, the radar system team collaborated closely with product design engineers to preserve Soli signal quality. The chip placement within the phone and the z-stack of materials above the chip were optimized to maximize signal transmission through the glass and minimize reflections and occlusions from surrounding components. The team also invented custom signal processing techniques to enable coexistence with surrounding phone components. For example, a novel filter was developed to reduce the impact of audio vibration on the radar signal, enabling gesture detection while music is playing. Such algorithmic innovations enabled Motion Sense features across a variety of common user scenarios.

Vibration due to audio on Pixel 4 appearing as an artifact in Soli’s range-doppler signal representation.
Future Directions
The successful integration of Soli into Pixel 4 and Pixel 4 XL devices demonstrates for the first time the feasibility of radar-based machine perception in an everyday mobile consumer device. Motion Sense in Pixel devices shows Soli’s potential to bring seamless context awareness and gesture recognition for explicit and implicit interaction. We are excited to continue researching and developing Soli to enable new radar-based sensing and perception capabilities.

Acknowledgments
The work described above was a collaborative effort between Google Advanced Technology and Projects (ATAP) and the Pixel and Android product teams. We particularly thank Patrick Amihood for major contributions to this blog post.

Source: Google AI Blog


Real-Time 3D Object Detection on Mobile Devices with MediaPipe



Object detection is an extensively studied computer vision problem, but most of the research has focused on 2D object prediction. While 2D prediction only provides 2D bounding boxes, by extending prediction to 3D, one can capture an object’s size, position and orientation in the world, leading to a variety of applications in robotics, self-driving vehicles, image retrieval, and augmented reality. Although 2D object detection is relatively mature and has been widely used in the industry, 3D object detection from 2D imagery is a challenging problem, due to the lack of data and diversity of appearances and shapes of objects within a category.

Today, we are announcing the release of MediaPipe Objectron, a mobile real-time 3D object detection pipeline for everyday objects. This pipeline detects objects in 2D images, and estimates their poses and sizes through a machine learning (ML) model, trained on a newly created 3D dataset. Implemented in MediaPipe, an open-source cross-platform framework for building pipelines to process perceptual data of different modalities, Objectron computes oriented 3D bounding boxes of objects in real-time on mobile devices.
 
3D Object Detection from a single image. MediaPipe Objectron determines the position, orientation and size of everyday objects in real-time on mobile devices.
Obtaining Real-World 3D Training Data
While there are ample amounts of 3D data for street scenes, due to the popularity of research into self-driving cars that rely on 3D capture sensors like LIDAR, datasets with ground truth 3D annotations for more granular everyday objects are extremely limited. To overcome this problem, we developed a novel data pipeline using mobile augmented reality (AR) session data. With the arrival of ARCore and ARKit, hundreds of millions of smartphones now have AR capabilities and the ability to capture additional information during an AR session, including the camera pose, sparse 3D point clouds, estimated lighting, and planar surfaces.

In order to label ground truth data, we built a novel annotation tool for use with AR session data, which allows annotators to quickly label 3D bounding boxes for objects. This tool uses a split-screen view to display 2D video frames on which are overlaid 3D bounding boxes on the left, alongside a view showing 3D point clouds, camera positions and detected planes on the right. Annotators draw 3D bounding boxes in the 3D view, and verify its location by reviewing the projections in 2D video frames. For static objects, we only need to annotate an object in a single frame and propagate its location to all frames using the ground truth camera pose information from the AR session data, which makes the procedure highly efficient.
Real-world data annotation for 3D object detection. Right: 3D bounding boxes are annotated in the 3D world with detected surfaces and point clouds. Left: Projections of annotated 3D bounding boxes are overlaid on top of video frames making it easy to validate the annotation.
AR Synthetic Data Generation
A popular approach is to complement real-world data with synthetic data in order to increase the accuracy of prediction. However, attempts to do so often yield poor, unrealistic data or, in the case of photorealistic rendering, require significant effort and compute. Our novel approach, called AR Synthetic Data Generation, places virtual objects into scenes that have AR session data, which allows us to leverage camera poses, detected planar surfaces, and estimated lighting to generate placements that are physically probable and with lighting that matches the scene. This approach results in high-quality synthetic data with rendered objects that respect the scene geometry and fit seamlessly into real backgrounds. By combining real-world data and AR synthetic data, we are able to increase the accuracy by about 10%.
An example of AR synthetic data generation. The virtual white-brown cereal box is rendered into the real scene, next to the real blue book.
An ML Pipeline for 3D Object Detection
We built a single-stage model to predict the pose and physical size of an object from a single RGB image. The model backbone has an encoder-decoder architecture, built upon MobileNetv2. We employ a multi-task learning approach, jointly predicting an object's shape with detection and regression. The shape task predicts the object's shape signals depending on what ground truth annotation is available, e.g. segmentation. This is optional if there is no shape annotation in training data. For the detection task, we use the annotated bounding boxes and fit a Gaussian to the box, with center at the box centroid, and standard deviations proportional to the box size. The goal for detection is then to predict this distribution with its peak representing the object’s center location. The regression task estimates the 2D projections of the eight bounding box vertices. To obtain the final 3D coordinates for the bounding box, we leverage a well established pose estimation algorithm (EPnP). It can recover the 3D bounding box of an object, without a priori knowledge of the object dimensions. Given the 3D bounding box, we can easily compute pose and size of the object. The diagram below shows our network architecture and post-processing. The model is light enough to run real-time on mobile devices (at 26 FPS on an Adreno 650 mobile GPU).
Network architecture and post-processing for 3D object detection.
Sample results of our network — [left] original 2D image with estimated bounding boxes, [middle] object detection by Gaussian distribution, [right] predicted segmentation mask.
Detection and Tracking in MediaPipe
When the model is applied to every frame captured by the mobile device, it can suffer from jitter due to the ambiguity of the 3D bounding box estimated in each frame. To mitigate this, we adopt the detection+tracking framework recently released in our 2D object detection and tracking solution. This framework mitigates the need to run the network on every frame, allowing the use of heavier and therefore more accurate models, while keeping the pipeline real-time on mobile devices. It also retains object identity across frames and ensures that the prediction is temporally consistent, reducing the jitter.

For further efficiency in our mobile pipeline, we run our model inference only once every few frames. Next, we take the prediction and track it over time using the approach described in our previous blogs for instant motion tracking and Motion Stills. When a new prediction is made, we consolidate the detection result with the tracking result based on the area of overlap.

To encourage researchers and developers to experiment and prototype based on our pipeline, we are releasing our on-device ML pipeline in MediaPipe, including an end-to-end demo mobile application and our trained models for two categories: shoes and chairs. We hope that sharing our solution with the wide research and development community will stimulate new use cases, new applications, and new research efforts. In the future, we plan to scale our model to many more categories, and further improve our on-device performance.
   
Examples of our 3D object detection in the wild.
Acknowledgements
The research described in this post was done by Adel Ahmadyan, Tingbo Hou, Jianing Wei, Matthias Grundmann, Liangkai Zhang, Jiuqiang Tang, Chris McClanahan, Tyler Mullen, Buck Bourdon, Esha Uboweja, Mogan Shieh, Siarhei Kazakou, Ming Guang Yong, Chuo-Ling Chang, and James Bruce. We thank Aliaksandr Shyrokau and the annotation team for their diligence to high quality annotations.

Source: Google AI Blog