Tag Archives: deep learning

AutoFlip: An Open Source Framework for Intelligent Video Reframing

Originally posted on the AI Blog

Videos filmed and edited for television and desktop are typically created and viewed in landscape aspect ratios (16:9 or 4:3). However, with an increasing number of users creating and consuming content on mobile devices, historical aspect ratios don’t always fit the display being used for viewing. Traditional approaches for reframing video to different aspect ratios usually involve static cropping, i.e., specifying a camera viewport, then cropping visual contents that are outside. Unfortunately, these static cropping approaches often lead to unsatisfactory results due to the variety of composition and camera motion styles. More bespoke approaches, however, typically require video curators to manually identify salient contents on each frame, track their transitions from frame-to-frame, and adjust crop regions accordingly throughout the video. This process is often tedious, time-consuming, and error-prone.

To address this problem, we are happy to announce AutoFlip, an open source framework for intelligent video reframing. AutoFlip is built on top of the MediaPipe framework that enables the development of pipelines for processing time-series multimodal data. Taking a video (casually shot or professionally edited) and a target dimension (landscape, square, portrait, etc.) as inputs, AutoFlip analyzes the video content, develops optimal tracking and cropping strategies, and produces an output video with the same duration in the desired aspect ratio.
Left: Original video (16:9). Middle: Reframed using a standard central crop (9:16). Right: Reframed with AutoFlip (9:16). By detecting the subjects of interest, AutoFlip is able to avoid cropping off important visual content.

AutoFlip Overview

AutoFlip provides a fully automatic solution to smart video reframing, making use of state-of-the-art ML-enabled object detection and tracking technologies to intelligently understand video content. AutoFlip detects changes in the composition that signify scene changes in order to isolate scenes for processing. Within each shot, video analysis is used to identify salient content before the scene is reframed by selecting a camera mode and path optimized for the contents.

Shot (Scene) Detection

A scene or shot is a continuous sequence of video without cuts (or jumps). To detect the occurrence of a shot change, AutoFlip computes the color histogram of each frame and compares this with prior frames. If the distribution of frame colors changes at a different rate than a sliding historical window, a shot change is signaled. AutoFlip buffers the video until the scene is complete before making reframing decisions, in order to optimize the reframing for the entire scene.

Video Content Analysis

We utilize deep learning-based object detection models to find interesting, salient content in the frame. This content typically includes people and animals, but other elements may be identified, depending on the application, including text overlays and logos for commercials, or motion and ball detection for sports.

The face and object detection models are integrated into AutoFlip through MediaPipe, which uses TensorFlow Lite on CPU. This structure allows AutoFlip to be extensible, so developers may conveniently add new detection algorithms for different use cases and video content. Each object type is associated with a weight value, which defines its relative importance — the higher the weight, the more influence the feature will have when computing the camera path.


Left: People detection on sports footage. Right: Two face boxes (‘core’ and ‘all’ face landmarks). In narrow portrait crop cases, often only the core landmark box can fit.

Reframing

After identifying the subjects of interest on each frame, logical decisions about how to reframe the content for a new view can be made. AutoFlip automatically chooses an optimal reframing strategy — stationary, panning or tracking — depending on the way objects behave during the scene (e.g., moving around or stationary). In stationary mode, the reframed camera viewport is fixed in a position where important content can be viewed throughout the majority of the scene. This mode can effectively mimic professional cinematography in which a camera is mounted on a stationary tripod or where post-processing stabilization is applied. In other cases, it is best to pan the camera, moving the viewport at a constant velocity. The tracking mode provides continuous and steady tracking of interesting objects as they move around within the frame.

Based on which of these three reframing strategies the algorithm selects, AutoFlip then determines an optimal cropping window for each frame, while best preserving the content of interest. While the bounding boxes track the objects of focus in the scene, they typically exhibit considerable jitter from frame-to-frame and, consequently, are not sufficient to define the cropping window. Instead, we adjust the viewport on each frame through the process of Euclidean-norm optimization, in which we minimize the residuals between a smooth (low-degree polynomial) camera path and the bounding boxes.

Top: Camera paths resulting from following the bounding boxes from frame-to-frame. Bottom: Final smoothed camera paths generated using Euclidean-norm path formation. Left: Scene in which objects are moving around, requiring a tracking camera path. Right: Scene where objects stay close to the same position; a stationary camera covers the content for the full duration of the scene.

AutoFlip’s configuration graph provides settings for either best-effort or required reframing. If it becomes infeasible to cover all the required regions (for example, when they are too spread out on the frame), the pipeline will automatically switch to a less aggressive strategy by applying a letterbox effect, padding the image to fill the frame. For cases where the background is detected as being a solid color, this color is used to create seamless padding; otherwise a blurred version of the original frame is used.

AutoFlip Use Cases

We are excited to release this tool directly to developers and filmmakers, reducing the barriers to their design creativity and reach through the automation of video editing. The ability to adapt any video format to various aspect ratios is becoming increasingly important as the diversity of devices for video content consumption continues to rapidly increase. Whether your use case is portrait to landscape, landscape to portrait, or even small adjustments like 4:3 to 16:9, AutoFlip provides a solution for intelligent, automated and adaptive video reframing.


What’s Next?

Like any machine learning algorithm, AutoFlip can benefit from an improved ability to detect objects relevant to the intent of the video, such as speaker detection for interviews or animated face detection on cartoons. Additionally, a common issue arises when input video has important overlays on the edges of the screen (such as text or logos) as they will often be cropped from the view. By combining text/logo detection and image inpainting technology, we hope that future versions of AutoFlip can reposition foreground objects to better fit the new aspect ratios. Lastly, in situations where padding is required, deep uncrop technology could provide improved ability to expand beyond the original viewable area.

While we work to improve AutoFlip internally at Google, we encourage contributions from developers and filmmakers in the open source communities.

Acknowledgments

We would like to thank our colleagues who contributed to Autoflip, Alexander Panagopoulos, Jenny Jin, Brian Mulford, Yuan Zhang, Alex Chen, Xue Yang, Mickey Wang, Justin Parra, Hartwig Adam, Jingbin Wang, and Weilong Yang; MediaPipe team who helped with open sourcing, Jiuqiang Tang, Tyler Mullen, Mogan Shieh, Ming Guang Yong, and Chuo-Ling Chang.

By Nathan Frey, Senior Software Engineer, Google Research, Los Angeles and Zheng Sun, Senior Software Engineer, Google Research, Mountain View

AutoFlip: An Open Source Framework for Intelligent Video Reframing



Videos filmed and edited for television and desktop are typically created and viewed in landscape aspect ratios (16:9 or 4:3). However, with an increasing number of users creating and consuming content on mobile devices, historical aspect ratios don’t always fit the display being used for viewing. Traditional approaches for reframing video to different aspect ratios usually involve static cropping, i.e., specifying a camera viewport, then cropping visual contents that are outside. Unfortunately, these static cropping approaches often lead to unsatisfactory results due to the variety of composition and camera motion styles. More bespoke approaches, however, typically require video curators to manually identify salient contents on each frame, track their transitions from frame-to-frame, and adjust crop regions accordingly throughout the video. This process is often tedious, time-consuming, and error-prone.

To address this problem, we are happy to announce AutoFlip, an open source framework for intelligent video reframing. AutoFlip is built on top of the MediaPipe framework that enables the development of pipelines for processing time-series multimodal data. Taking a video (casually shot or professionally edited) and a target dimension (landscape, square, portrait, etc.) as inputs, AutoFlip analyzes the video content, develops optimal tracking and cropping strategies, and produces an output video with the same duration in the desired aspect ratio.
Left: Original video (16:9). Middle: Reframed using a standard central crop (9:16). Right: Reframed with AutoFlip (9:16). By detecting the subjects of interest, AutoFlip is able to avoid cropping off important visual content.
AutoFlip Overview
AutoFlip provides a fully automatic solution to smart video reframing, making use of state-of-the-art ML-enabled object detection and tracking technologies to intelligently understand video content. AutoFlip detects changes in the composition that signify scene changes in order to isolate scenes for processing. Within each shot, video analysis is used to identify salient content before the scene is reframed by selecting a camera mode and path optimized for the contents.
Shot (Scene) Detection
A scene or shot is a continuous sequence of video without cuts (or jumps). To detect the occurrence of a shot change, AutoFlip computes the color histogram of each frame and compares this with prior frames. If the distribution of frame colors changes at a different rate than a sliding historical window, a shot change is signaled. AutoFlip buffers the video until the scene is complete before making reframing decisions, in order to optimize the reframing for the entire scene.

Video Content Analysis
We utilize deep learning-based object detection models to find interesting, salient content in the frame. This content typically includes people and animals, but other elements may be identified, depending on the application, including text overlays and logos for commercials, or motion and ball detection for sports.

The face and object detection models are integrated into AutoFlip through MediaPipe, which uses TensorFlow Lite on CPU. This structure allows AutoFlip to be extensible, so developers may conveniently add new detection algorithms for different use cases and video content. Each object type is associated with a weight value, which defines its relative importance — the higher the weight, the more influence the feature will have when computing the camera path.
Left: People detection on sports footage. Right: Two face boxes (‘core’ and ‘all’ face landmarks). In narrow portrait crop cases, often only the core landmark box can fit.
Reframing
After identifying the subjects of interest on each frame, logical decisions about how to reframe the content for a new view can be made. AutoFlip automatically chooses an optimal reframing strategy — stationary, panning or tracking — depending on the way objects behave during the scene (e.g., moving around or stationary). In stationary mode, the reframed camera viewport is fixed in a position where important content can be viewed throughout the majority of the scene. This mode can effectively mimic professional cinematography in which a camera is mounted on a stationary tripod or where post-processing stabilization is applied. In other cases, it is best to pan the camera, moving the viewport at a constant velocity. The tracking mode provides continuous and steady tracking of interesting objects as they move around within the frame.

Based on which of these three reframing strategies the algorithm selects, AutoFlip then determines an optimal cropping window for each frame, while best preserving the content of interest. While the bounding boxes track the objects of focus in the scene, they typically exhibit considerable jitter from frame-to-frame and, consequently, are not sufficient to define the cropping window. Instead, we adjust the viewport on each frame through the process of Euclidean-norm optimization, in which we minimize the residuals between a smooth (low-degree polynomial) camera path and the bounding boxes.
Top: Camera paths resulting from following the bounding boxes from frame-to-frame. Bottom: Final smoothed camera paths generated using Euclidean-norm path formation. Left: Scene in which objects are moving around, requiring a tracking camera path. Right: Scene where objects stay close to the same position; a stationary camera covers the content for the full duration of the scene.
AutoFlip’s configuration graph provides settings for either best-effort or required reframing. If it becomes infeasible to cover all the required regions (for example, when they are too spread out on the frame), the pipeline will automatically switch to a less aggressive strategy by applying a letterbox effect, padding the image to fill the frame. For cases where the background is detected as being a solid color, this color is used to create seamless padding; otherwise a blurred version of the original frame is used.
AutoFlip Use Cases
We are excited to release this tool directly to developers and filmmakers, reducing the barriers to their design creativity and reach through the automation of video editing. The ability to adapt any video format to various aspect ratios is becoming increasingly important as the diversity of devices for video content consumption continues to rapidly increase. Whether your use case is portrait to landscape, landscape to portrait, or even small adjustments like 4:3 to 16:9, AutoFlip provides a solution for intelligent, automated and adaptive video reframing.
What’s Next?
Like any machine learning algorithm, AutoFlip can benefit from an improved ability to detect objects relevant to the intent of the video, such as speaker detection for interviews or animated face detection on cartoons. Additionally, a common issue arises when input video has important overlays on the edges of the screen (such as text or logos) as they will often be cropped from the view. By combining text/logo detection and image inpainting technology, we hope that future versions of AutoFlip can reposition foreground objects to better fit the new aspect ratios. Lastly, in situations where padding is required, deep uncrop technology could provide improved ability to expand beyond the original viewable area.

While we work to improve AutoFlip internally at Google, we encourage contributions from developers and filmmakers in the open source communities.

Acknowledgments
We would like to thank our colleagues who contributed to Autoflip, Alexander Panagopoulos, Jenny Jin, Brian Mulford, Yuan Zhang, Alex Chen, Xue Yang, Mickey Wang, Justin Parra, Hartwig Adam, Jingbin Wang, and Weilong Yang; MediaPipe team who helped with open sourcing, Jiuqiang Tang, Tyler Mullen, Mogan Shieh, Ming Guang Yong, and Chuo-Ling Chang.

Source: Google AI Blog


Learning to See Transparent Objects



Optical 3D range sensors, like RGB-D cameras and LIDAR, have found widespread use in robotics to generate rich and accurate 3D maps of the environment, from self-driving cars to autonomous manipulators. However, despite the ubiquity of these complex robotic systems, transparent objects (like a glass container) can confound even a suite of expensive sensors that are commonly used. This is because optical 3D sensors are driven by algorithms that assume all surfaces are Lambertian, i.e., they reflect light evenly in all directions, resulting in a uniform surface brightness from all viewing angles. However, transparent objects violate this assumption, since their surfaces both refract and reflect light. Hence, most of the depth data from transparent objects are invalid or contain unpredictable noise.
Transparent objects often fail to be detected by optical 3D sensors. Top, Right: For instance, glass bottles do not show up in the 3D depth imagery captured from an Intel® RealSense™ D415 RGB-D camera. Bottom: A 3D visualization via point clouds constructed from the depth image.
Enabling machines to better sense transparent surfaces would not only improve safety, but could also open up a range of new interactions in unstructured applications — from robots handling kitchenware or sorting plastics for recycling, to navigating indoor environments or generating AR visualizations on glass tabletops.

To address this problem, we teamed up with researchers from Synthesis AI and Columbia University to develop ClearGrasp, a machine learning algorithm that is capable of estimating accurate 3D data of transparent objects from RGB-D images. This is made possible by a large-scale synthetic dataset that we are also releasing publicly today. ClearGrasp can work with inputs from any standard RGB-D camera, using deep learning to accurately reconstruct the depth of transparent objects and generalize to completely new objects unseen during training. This in contrast to previous methods, which required prior knowledge of the transparent objects (e.g., their 3D models), often combined with maps of background lighting and camera positions. In this work, we also demonstrate that ClearGrasp can benefit robotic manipulation by incorporating it into our pick and place robot’s control system, where we observe significant improvements in the grasping success rate of transparent plastic objects.
ClearGrasp uses deep learning to recover accurate 3D depth data of transparent surfaces.
A Visual Dataset of Transparent Objects
Massive quantities of data are required to train any effective deep learning model (e.g., ImageNet for vision or Wikipedia for BERT), and ClearGrasp is no exception. Unfortunately, no datasets are available with 3D data of transparent objects. Existing 3D datasets like Matterport3D or ScanNet overlook transparent surfaces, because they require expensive and time-consuming labeling processes.

To overcome this issue, we created our own large-scale dataset of transparent objects that contains more than 50,000 photorealistic renders with corresponding surface normals (representing the surface curvature), segmentation masks, edges, and depth, useful for training a variety of 2D and 3D detection tasks. Each image contains up to five transparent objects, either on a flat ground plane or inside a tote, with various backgrounds and lighting.

Some example data of transparent objects from the ClearGrasp synthetic dataset.
We also include a test set of 286 real-world images with corresponding ground truth depth. The real-world images were taken by a painstaking process of replacing each transparent object in the scene with a painted one in the same pose. The images are captured under a number of different indoor lighting conditions, using various cloth and veneer backgrounds and containing random opaque objects scattered around the scene. They contain both known objects, present in the synthetic training set, and novel objects.
Left: The real-world image capturing setup, Middle: Custom user interface enables precisely replacing each transparent object with a spray-painted duplicate, Right: Example of captured data.
The Challenge
While the distorted view of the background seen through transparent objects confounds typical depth estimation approaches, there are clues that hint at the objects’ shape. Transparent surfaces exhibit specular reflections, which are mirror-like reflections that show up as bright spots in a well-lit environment. Since these visual cues are prominent in RGB images and are influenced primarily by the shape of the objects, convolutional neural networks can use these reflections to infer accurate surface normals, which then can be used for depth estimation.
Specular reflections on transparent objects create distinct features that vary based on the object shape and provide strong visual cues for estimating surface normals.
Most machine learning algorithms try to directly estimate depth from a monocular RGB image. However, monocular depth estimation is an ill-posed task, even for humans. We observed large errors in estimating the depth of flat background surfaces, which compounds the error in depth estimates for the transparent objects resting atop them. Therefore, rather than directly estimating the depth of all geometry, we conjectured that correcting the initial depth estimates from an RGB-D 3D camera is more practical — it would enable us to use the depth from the non-transparent surfaces to inform the depth of transparent surfaces.

The ClearGrasp Algorithm
ClearGrasp uses 3 neural networks: a network to estimate surface normals, one for occlusion boundaries (depth discontinuities), and one that masks transparent objects. The mask is used to remove all pixels belonging to transparent objects, so that the correct depths can be filled in. We then use a global optimization module that starts extending the depth from known surfaces, using the predicted surface normals to guide the shape of the reconstruction, and the predicted occlusion boundaries to maintain the separation between distinct objects.
Overview of our method. The point cloud was generated using the output depth and is colored with its surface normals.
Each of the neural networks was trained on our synthetic dataset and they performed well on real-world transparent objects. However, the surface normal estimations for other surfaces, like walls or fruits, were poor. This is because of the limitations of our synthetic dataset, which contains only transparent objects on a ground plane. To alleviate this issue, we included some real indoor scenes from the Matterport3D and ScanNet datasets in the surface normals training loop. By training on both the in-domain synthetic dataset and out-of-domain real word dataset, the model performed well on all surfaces in our test set.
Surface Normal estimation on real images when trained on a) Matterport3D and ScanNet only (MP+SN), b) our synthetic dataset only, and c) MP+SN as well as our synthetic dataset. Note how the model trained on MP+SN fails to detect the transparent objects. The model trained on only synthetic data picks up the real plastic bottles remarkably well, but fails for other objects and surfaces. When trained on both, our model gets the best of both worlds.
Results
Overall, our quantitative experiments show that ClearGrasp is able to reconstruct depth for transparent objects with much higher fidelity than alternative methods. Despite being trained on only synthetic transparent objects, we find our models are able to adapt well to the real-world domain — achieving very similar quantitative reconstruction performance on known objects across domains. Our models also generalize well to novel objects with complex shapes never seen before.

To check the qualitative performance of ClearGrasp, we construct 3D point clouds from the input and output depth images, as shown below (additional examples available on the project webpage). The resulting estimated 3D surfaces have clean and coherent reconstructed shapes — important for applications, such as 3D mapping and 3D object detection — without the jagged noise seen in monocular depth estimation methods. Our models are robust and perform well in challenging conditions, such as identifying transparent objects situated in a patterned background or differentiating between transparent objects partially occluding one another.
Qualitative results on real images. Top two rows: results on known objects. Bottom two rows: results on novel objects. The point clouds, colored with their surface normals, are generated from the corresponding depth images.
Most importantly, the output depth from ClearGrasp can be directly used as input to state-of-the-art manipulation algorithms that use RGB-D images. By using ClearGrasp’s output depth estimates instead of the raw sensor data, our grasping algorithm on a UR5 robot arm saw significant improvements in the grasping success rates of transparent objects. When using the parallel-jaw gripper, the success rate improved from a baseline of 12% to 74%, and from 64% to 86% with suction.
Manipulation of novel transparent objects using ClearGrasp. Note the challenging conditions: textureless background, complex object shapes and the directional light causing confusing shadows and caustics (the patterns of light that occur when light rays are reflected or refracted from a surface).
Limitations & Future Work
A limitation of our synthetic dataset is that it does not represent accurate caustics, due to the limitations of rendering with traditional path-tracing algorithms. As a result, our models confuse bright caustics coupled with shadows to be independent transparent objects. Despite these drawbacks, our work with ClearGrasp shows that synthetic data remains a viable approach to achieve competent results for learning-based depth reconstruction methods. A promising direction for future work is improving the domain transfer to real-world images by generating renders with physically-correct caustics and surface imperfections such as fingerprints.

With ClearGrasp, we demonstrate that high-quality renders can be used to successfully train models that perform well in the real world. We hope that our dataset will drive further research on data-driven perception algorithms for transparent objects. Download links and more example images can be found on our project website and our GitHub repository.

Acknowledgements
This research was done by Shreeyak Sajjan (Synthesis.ai), Matthew Moore (Synthesis.ai), Mike Pan (Synthesis.ai), Ganesh Nagaraja (Synthesis.ai), Johnny Lee, Andy Zeng, and Shuran Song (Columbia University). We would like to thank Ryan Hickman for managerial support, Ivan Krasin and Stefan Welker for fruitful technical discussions, Cameron (@camfoxmusic) for sharing 3D models of his potion bottles and Sharat Sajjan for helping with web design.

Source: Google AI Blog


Releasing the Drosophila Hemibrain Connectome — The Largest Synapse-Resolution Map of Brain Connectivity



A fundamental way to describe a complex system is to measure its “network” — the way individual parts connect and communicate with each other. For example, biologists study gene networks, social scientists study social networks and even search engines rely, in part, on analyzing the way web pages form a network by linking to one another.

In neuroscience, a long-standing hypothesis is that the connectivity between brain cells plays a major role in the function of the brain. While technical difficulties have historically been a barrier for neuroscientists trying to study brain networks in detail, this is beginning to change. Last year, we announced the first nanometer-resolution automated reconstruction of an entire fruit fly brain, which focused on the individual shape of the cells. However, this accomplishment didn't reveal information about their connectivity.

Today, in collaboration with the FlyEM team at HHMI’s Janelia Research Campus and several other research partners, we are releasing the “hemibrain” connectome, a highly detailed map of neuronal connectivity in the fly brain, along with a suite of tools for visualization and analysis. The hemibrain is derived from a 3D image of roughly half the fly brain, and contains verified connectivity between ~25,000 neurons that form more than twenty-million connections. To date, this is the largest synapse-resolution map of brain connectivity that has ever been produced, in any species. The goal of this project has been to produce a public resource that any scientist can use to advance their own work, similar to the fly genome, which was released twenty years ago and has become a fundamental tool in biology.
Fly brain regions contained within the hemibrain connectome. Also available: interactive version (example region: mushroom body).
Imaging, Reconstructing, and Proofreading the Hemibrain Connectome
Over a decade of research and development from numerous research partners was required to overcome the challenges in producing the hemibrain connectome. At Janelia, new methods were developed to stain the fly brain and then divide the tissue into separate 20-micron thick slabs. Each slab was then imaged at 8x8x8nm3 voxel-resolution using focused ion beam scanning electron microscopes customized for months-long continuous operation. Computational methods were developed to stitch and align the raw data into a coherent 26-trillion pixel 3D volume.

However, without an accurate 3D reconstruction of the neurons in a fly brain, producing a connectome from this type of imaging data is impossible. Forming a collaboration with Janelia in 2014, Google began working on the fly brain data, focused on automating 3D reconstruction to jointly work towards producing a connectome. After several iterations of technological development we devised a method called flood-filling networks (FFNs) and applied it towards reconstructing the entire hemibrain dataset. In the current project, we worked closely with our collaborators to optimize the reconstruction results to be more useful for generating a connectome (i.e., embedded within a proofreading and synaptic detection pipeline), rather than just showing the shape of the neurons.
A flood-filling network segmenting (tracing) part of a neuron in the fly hemibrain data.
FFNs were the first automated segmentation technology to yield reconstructions that were sufficiently accurate to enable the overall hemibrain project to proceed. This is because errors in automated reconstruction require correction by expert human “proofreaders,” and previous approaches were estimated to require tens of millions of hours of human effort. With FFNs, the hemibrain was proofread using hundreds of thousands of human hours: a two order-of-magnitude improvement. This (still substantial) proofreading effort was performed over two years by a team of highly skilled and dedicated annotators, using tools and workflows pioneered at Janelia for this purpose. For example, annotators used VR headsets and custom 3D object-editing tools to examine neuron shapes and fix errors in the automated reconstruction. These revisions were then used to retrain the FFN models, leading to revised and more accurate machine output.

Finally, after proofreading, the reconstruction was combined with automated synaptic detection in order to produce the hemibrain connectome. Janelia scientists manually labeled individual synapses and then trained neural network classifiers to automate the task. Generalization was improved through multiple rounds of labeling, and the results from two different network architectures were merged to produce robust classifications across the hemibrain.

Further details about producing the hemibrain can be found in HHMI’s press release.

What Is Being Released?
The focus of today’s announcement is a set of inter-related datasets and tools that enable any interested person to visually and programmatically study the fly connectome. Specifically, the following resources are available:
  • Terabytes of raw data, proofread 3D reconstruction, and synaptic annotations can be interactively visualized or downloaded in bulk.
  • A web-based tool neuPrint, which can be used to query the connectivity, partners, connection strengths and morphologies of any specified neurons.
  • A downloadable, compact representation of the connectome that is roughly a million-times smaller in bytes than the raw data from which it was derived.
  • Documentation and video tutorials explaining the use of these resources.
  • A pre-print with further details related to the production and analysis of the hemibrain connectome.
Next Steps
Researchers have begun using the hemibrain connectome to develop a more robust understanding of the drosophila nervous system. For example, a major brain circuit of interest is the “central complex” which integrates sensory information and is involved in navigation, motor control, and sleep:
A detailed view of “ring neurons” in the central complex of the fly brain, one of many neural circuits that can be studied using the hemibrain reconstruction and connectome. Interactive version: ring neurons and ellipsoid body.
Another circuit that is being intensely studied is the “mushroom body,” a primary site of learning and memory in the drosophila brain whose detailed structure is contained within the hemibrain connectome (interactive visualization).

Acknowledgements
We would like to acknowledge core contributions from Tim Blakely, Laramie Leavitt, Peter Li, Larry Lindsey, Jeremy Maitin-Shepard (Google), Stuart Berg, Gary Huang, Bill Katz, Chris Ordish, Stephen Plaza, Pat Rivlin, Shin-ya Takemura (Janelia collaborators who worked closely with Google’s team), and other amazing collaborators at Janelia and elsewhere who were involved in the hemibrain project.

Source: Google AI Blog


Can You Trust Your Model’s Uncertainty?



In an ideal world, machine learning (ML) methods like deep learning are deployed to make predictions on data from the same distribution as that on which they were trained. But the practical reality can be quite different: camera lenses becoming blurry, sensors degrading, and changes to popular online topics can result in differences between the distribution of data on which the model was trained and to which a model is applied, leading to what is known as covariate shift. For example, it was recently observed that deep learning models trained to detect pneumonia in chest x-rays would achieve very different levels of accuracy when evaluated on previously unseen hospitals’ data, due in part to subtle differences in image acquisition and processing.

In “Can you trust your model’s uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift,presented at NeurIPS 2019, we benchmark the uncertainty of state-of-the-art deep learning models as they are exposed to both shifting data distributions and out-of-distribution data. In this work we consider a variety of input modalities, including images, text and online advertising data, exposing these deep learning models to increasingly shifted test data while carefully analyzing the behavior of their predictive probabilities. We also compare a variety of different methods for improving model uncertainty to see which strategies perform best under distribution shift.

What is Out-of-Distribution Data?
Deep learning models provide a probability with each prediction, representing the model confidence or uncertainty. As such, they can express what they don’t know and, correspondingly, abstain from prediction when the data is outside the realm of the original training dataset. In the case of covariate shift, uncertainty would ideally increase proportionally to any decrease in accuracy. A more extreme case is when data are not at all represented in the training set, i.e., when the data are out-of-distribution (OOD). For example, consider what happens when a cat-versus-dog image classifier is shown an image of an airplane. Would the model confidently predict incorrectly or would it assign a low probability to each class? In a related post we recently discussed methods we developed to identify such OOD examples. In this work we instead analyze the predictive uncertainty of models given out-of-distribution and shifted examples to see if the model probabilities reflect their ability to predict on such data.

Quantifying the Quality of Uncertainty
What does it mean for one model to have better representation of its uncertainty than another? While this can be a nuanced question that often is defined by a downstream task, there are ways to quantitatively assess the general quality of probabilistic predictions. For example, the meteorological community has carefully considered this question and developed a set of proper scoring rules that a comparison function for probabilistic weather forecasts should satisfy in order to be well-calibrated, while still rewarding accuracy. We applied several of these proper scoring rules, such as the Brier Score and Negative Log Likelihood (NLL), along with more intuitive heuristics, such as the expected calibration error (ECE), to understand how different ML models dealt with uncertainty under dataset shift.

Experiments
We analyze the effect of dataset shift on uncertainty across a variety of data modalities, including images, text, online advertising data and genomics. As an example, we illustrate the effect of dataset shift on the ImageNet dataset, a popular image understanding benchmark. ImageNet involves classifying over a million images into 1000 different categories. Some now consider this challenge mostly solved, and have developed harder variants, such as Corrupted Imagenet (or Imagenet-C), in which the data are augmented according to 16 different realistic corruptions, each at 5 different intensities.
We explore how model uncertainty behaves under changes to the data distribution, such as increasing intensities of the image perturbations used in Corrupted Imagenet. Shown here are examples of each type of image corruption, at intensity level 3 (of 5).
We used these corrupted images as examples of shifted data and examined the predictive probabilities of deep learning models as they were exposed to shifts of increasing intensity. Below we show box plots of the resulting accuracy and the ECE for each level of corruption (including uncorrupted test data), where each box aggregates across all corruption types in ImageNet-C. Each color represents a different type of model — a “vanilla” deep neural network used as a baseline, four uncertainty methods (dropout, temperature scaling and our last layer approaches), and an ensemble approach.
Accuracy (top) and expected calibration error (bottom; lower is better) for increasing intensities of dataset shift on ImageNet-C. We observe that the decrease in accuracy is not reflected by an increase in uncertainty of the model, indicated by both accuracy and ECE getting worse.
As the shift intensity increases, the deviation in accuracy across corruption methods for each model increases (increasing box size), as expected, and the accuracy on the whole decreases. Ideally this would be reflected in increasing uncertainty of the model, thus leaving the expected calibration error (ECE) unchanged. However, looking at the lower plot of the ECE, one sees that this is not the case and that calibration generally suffers as well. We observed similar worsening trends for Brier score and NLL indicating that the models are not becoming increasingly unsure with shift, but instead are becoming confidently wrong.

One popular method to improve calibration is known as temperature scaling, a variant of Platt scaling, which involves smoothing the predictions after training, using performance on a held-out validation set. We observed that while this improved calibration on the standard test data, it often made things worse on shifted data! Thus, practitioners applying this technique should be wary of distributional shift.

Fortunately, one method degrades in uncertainty much more gracefully than others. Deep ensembles (green), which average the predictions of a selection of models, each of which have different initializations, is a simple strategy that significantly improves robustness to shift and outperforms all other methods tested.

Summary and Recommended Best Practices
In our paper, we explored the behavior of state-of-the-art models under dataset shift across images, text, online advertising data and genomics. Our findings were mostly consistent across these different kinds of data. The quality of uncertainty degrades under dataset shift, but there are promising avenues of research to mitigate this. We hope that deep learning users take home the following messages from our study:
  1. Uncertainty under dataset shift is a real concern that needs to be considered when training models.
  2. Improving calibration and accuracy on an in-distribution test set often does not translate to improved calibration on shifted data.
  3. Out of all the methods we considered, deep ensembles are the most robust to dataset shift, and a relatively small ensemble size (e.g., 5) is sufficient. The effectiveness of ensembles presents interesting avenues for improving other approaches.
Improving the predictive uncertainty of deep learning models remains an active area of research in ML. We have released all of the code and model predictions from this benchmark in the hope that it will be useful to the community to drive and evaluate future work on this important topic.

Source: Google AI Blog


Can You Trust Your Model’s Uncertainty?



In an ideal world, machine learning (ML) methods like deep learning are deployed to make predictions on data from the same distribution as that on which they were trained. But the practical reality can be quite different: camera lenses becoming blurry, sensors degrading, and changes to popular online topics can result in differences between the distribution of data on which the model was trained and to which a model is applied, leading to what is known as covariate shift. For example, it was recently observed that deep learning models trained to detect pneumonia in chest x-rays would achieve very different levels of accuracy when evaluated on previously unseen hospitals’ data, due in part to subtle differences in image acquisition and processing.

In “Can you trust your model’s uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift,presented at NeurIPS 2019, we benchmark the uncertainty of state-of-the-art deep learning models as they are exposed to both shifting data distributions and out-of-distribution data. In this work we consider a variety of input modalities, including images, text and online advertising data, exposing these deep learning models to increasingly shifted test data while carefully analyzing the behavior of their predictive probabilities. We also compare a variety of different methods for improving model uncertainty to see which strategies perform best under distribution shift.

What is Out-of-Distribution Data?
Deep learning models provide a probability with each prediction, representing the model confidence or uncertainty. As such, they can express what they don’t know and, correspondingly, abstain from prediction when the data is outside the realm of the original training dataset. In the case of covariate shift, uncertainty would ideally increase proportionally to any decrease in accuracy. A more extreme case is when data are not at all represented in the training set, i.e., when the data are out-of-distribution (OOD). For example, consider what happens when a cat-versus-dog image classifier is shown an image of an airplane. Would the model confidently predict incorrectly or would it assign a low probability to each class? In a related post we recently discussed methods we developed to identify such OOD examples. In this work we instead analyze the predictive uncertainty of models given out-of-distribution and shifted examples to see if the model probabilities reflect their ability to predict on such data.

Quantifying the Quality of Uncertainty
What does it mean for one model to have better representation of its uncertainty than another? While this can be a nuanced question that often is defined by a downstream task, there are ways to quantitatively assess the general quality of probabilistic predictions. For example, the meteorological community has carefully considered this question and developed a set of proper scoring rules that a comparison function for probabilistic weather forecasts should satisfy in order to be well-calibrated, while still rewarding accuracy. We applied several of these proper scoring rules, such as the Brier Score and Negative Log Likelihood (NLL), along with more intuitive heuristics, such as the expected calibration error (ECE), to understand how different ML models dealt with uncertainty under dataset shift.

Experiments
We analyze the effect of dataset shift on uncertainty across a variety of data modalities, including images, text, online advertising data and genomics. As an example, we illustrate the effect of dataset shift on the ImageNet dataset, a popular image understanding benchmark. ImageNet involves classifying over a million images into 1000 different categories. Some now consider this challenge mostly solved, and have developed harder variants, such as Corrupted Imagenet (or Imagenet-C), in which the data are augmented according to 16 different realistic corruptions, each at 5 different intensities.
We explore how model uncertainty behaves under changes to the data distribution, such as increasing intensities of the image perturbations used in Corrupted Imagenet. Shown here are examples of each type of image corruption, at intensity level 3 (of 5).
We used these corrupted images as examples of shifted data and examined the predictive probabilities of deep learning models as they were exposed to shifts of increasing intensity. Below we show box plots of the resulting accuracy and the ECE for each level of corruption (including uncorrupted test data), where each box aggregates across all corruption types in ImageNet-C. Each color represents a different type of model — a “vanilla” deep neural network used as a baseline, four uncertainty methods (dropout, temperature scaling and our last layer approaches), and an ensemble approach.
Accuracy (top) and expected calibration error (bottom; lower is better) for increasing intensities of dataset shift on ImageNet-C. We observe that the decrease in accuracy is not reflected by an increase in uncertainty of the model, indicated by both accuracy and ECE getting worse.
As the shift intensity increases, the deviation in accuracy across corruption methods for each model increases (increasing box size), as expected, and the accuracy on the whole decreases. Ideally this would be reflected in increasing uncertainty of the model, thus leaving the expected calibration error (ECE) unchanged. However, looking at the lower plot of the ECE, one sees that this is not the case and that calibration generally suffers as well. We observed similar worsening trends for Brier score and NLL indicating that the models are not becoming increasingly unsure with shift, but instead are becoming confidently wrong.

One popular method to improve calibration is known as temperature scaling, a variant of Platt scaling, which involves smoothing the predictions after training, using performance on a held-out validation set. We observed that while this improved calibration on the standard test data, it often made things worse on shifted data! Thus, practitioners applying this technique should be wary of distributional shift.

Fortunately, one method degrades in uncertainty much more gracefully than others. Deep ensembles (green), which average the predictions of a selection of models, each of which have different initializations, is a simple strategy that significantly improves robustness to shift and outperforms all other methods tested.

Summary and Recommended Best Practices
In our paper, we explored the behavior of state-of-the-art models under dataset shift across images, text, online advertising data and genomics. Our findings were mostly consistent across these different kinds of data. The quality of uncertainty degrades under dataset shift, but there are promising avenues of research to mitigate this. We hope that deep learning users take home the following messages from our study:
  1. Uncertainty under dataset shift is a real concern that needs to be considered when training models.
  2. Improving calibration and accuracy on an in-distribution test set often does not translate to improved calibration on shifted data.
  3. Out of all the methods we considered, deep ensembles are the most robust to dataset shift, and a relatively small ensemble size (e.g., 5) is sufficient. The effectiveness of ensembles presents interesting avenues for improving other approaches.
Improving the predictive uncertainty of deep learning models remains an active area of research in ML. We have released all of the code and model predictions from this benchmark in the hope that it will be useful to the community to drive and evaluate future work on this important topic.

Source: Google AI Blog


ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations



Ever since the advent of BERT a year ago, natural language research has embraced a new paradigm, leveraging large amounts of existing text to pretrain a model’s parameters using self-supervision, with no data annotation required. So, rather than needing to train a machine-learning model for natural language processing (NLP) from scratch, one can start from a model primed with knowledge of a language. But, in order to improve upon this new approach to NLP, one must develop an understanding of what, exactly, is contributing to language-understanding performance — the network’s height (i.e., number of layers), its width (size of the hidden layer representations), the learning criteria for self-supervision, or something else entirely?

In “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”, accepted at ICLR 2020, we present an upgrade to BERT that advances the state-of-the-art performance on 12 NLP tasks, including the competitive Stanford Question Answering Dataset (SQuAD v2.0) and the SAT-style reading comprehension RACE benchmark. ALBERT is being released as an open-source implementation on top of TensorFlow, and includes a number of ready-to-use ALBERT pre-trained language representation models.

What Contributes to NLP Performance?
Identifying the dominant driver of NLP performance is complex — some settings are more important than others, and, as our study reveals, a simple, one-at-a-time exploration of these settings would not yield the correct answers.

The key to optimizing performance, captured in the design of ALBERT, is to allocate the model’s capacity more efficiently. Input-level embeddings (words, sub-tokens, etc.) need to learn context-independent representations, a representation for the word “bank”, for example. In contrast, hidden-layer embeddings need to refine that into context-dependent representations, e.g., a representation for “bank” in the context of financial transactions, and a different representation for “bank” in the context of river-flow management.

This is achieved by factorization of the embedding parametrization — the embedding matrix is split between input-level embeddings with a relatively-low dimension (e.g., 128), while the hidden-layer embeddings use higher dimensionalities (768 as in the BERT case, or more). With this step alone, ALBERT achieves an 80% reduction in the parameters of the projection block, at the expense of only a minor drop in performance — 80.3 SQuAD2.0 score, down from 80.4; or 67.9 on RACE, down from 68.2 — with all other conditions the same as for BERT.

Another critical design decision for ALBERT stems from a different observation that examines redundancy. Transformer-based neural network architectures (such as BERT, XLNet, and RoBERTa) rely on independent layers stacked on top of each other. However, we observed that the network often learned to perform similar operations at various layers, using different parameters of the network. This possible redundancy is eliminated in ALBERT by parameter-sharing across the layers, i.e., the same layer is applied on top of each other. This approach slightly diminishes the accuracy, but the more compact size is well worth the tradeoff. Parameter sharing achieves a 90% parameter reduction for the attention-feedforward block (a 70% reduction overall), which, when applied in addition to the factorization of the embedding parameterization, incur a slight performance drop of -0.3 on SQuAD2.0 to 80.0, and a larger drop of -3.9 on RACE score to 64.0.

Implementing these two design changes together yields an ALBERT-base model that has only 12M parameters, an 89% parameter reduction compared to the BERT-base model, yet still achieves respectable performance across the benchmarks considered. But this parameter-size reduction provides the opportunity to scale up the model again. Assuming that memory size allows, one can scale up the size of the hidden-layer embeddings by 10-20x. With a hidden-size of 4096, the ALBERT-xxlarge configuration achieves both an overall 30% parameter reduction compared to the BERT-large model, and, more importantly, significant performance gains: +4.2 on SQuAD2.0 (88.1, up from 83.9), and +8.5 on RACE (82.3, up from 73.8).

These results indicate that accurate language understanding depends on developing robust, high-capacity contextual representations. The context, modeled in the hidden-layer embeddings, captures the meaning of the words, which in turn drives the overall understanding, as directly measured by model performance on standard benchmarks.

Optimized Model Performance with the RACE Dataset
To evaluate the language understanding capability of a model, one can administer a reading comprehension test (e.g., similar to the SAT Reading Test). This can be done with the RACE dataset (2017), the largest publicly available resource for this purpose. Computer performance on this reading comprehension challenge mirrors well the language modeling advances of the last few years: a model pre-trained with only context-independent word representations scores poorly on this test (45.9; left-most bar), while BERT, with context-dependent language knowledge, scores relatively well with a 72.0. Refined BERT models, such as XLNet and RoBERTa, set the bar even higher, in the 82-83 score range. The ALBERT-xxlarge configuration mentioned above yields a RACE score in the same range (82.3), when trained on the base BERT dataset (Wikipedia and Books). However, when trained on the same larger dataset as XLNet and RoBERTa, it significantly outperforms all other approaches to date, and establishes a new state-of-the-art score at 89.4.
Machine performance on the RACE challenge (SAT-like reading comprehension). A random-guess baseline score is 25.0. The maximum possible score is 95.0.
The success of ALBERT demonstrates the importance of identifying the aspects of a model that give rise to powerful contextual representations. By focusing improvement efforts on these aspects of the model architecture, it is possible to greatly improve both the model efficiency and performance on a wide range of NLP tasks. To facilitate further advances in the field of NLP, we are open-sourcing ALBERT to the research community.

Source: Google AI Blog


Understanding Transfer Learning for Medical Imaging



As deep neural networks are applied to an increasingly diverse set of domains, transfer learning has emerged as a highly popular technique in developing deep learning models. In transfer learning, the neural network is trained in two stages: 1) pretraining, where the network is generally trained on a large-scale benchmark dataset representing a wide diversity of labels/categories (e.g., ImageNet); and 2) fine-tuning, where the pretrained network is further trained on the specific target task of interest, which may have fewer labeled examples than the pretraining dataset. The pretraining step helps the network learn general features that can be reused on the target task.

This kind of two-stage paradigm has become extremely popular in many settings, and particularly so in medical imaging. In the context of transfer learning, standard architectures designed for ImageNet with corresponding pretrained weights are fine-tuned on medical tasks ranging from interpreting chest x-rays and identifying eye diseases, to early detection of Alzheimer’s disease. Despite its widespread use, however, the precise effects of transfer learning are not yet well understood. While recent work challenges many common assumptions, including the effects on performance improvement, contribution of the underlying architecture and impact of pretraining dataset type and size, these results are all in the natural image setting, and leave many questions open for specialized domains, such as medical images.

In our NeurIPS 2019 paper, “Transfusion: Understanding Transfer Learning for Medical Imaging,” we investigate these central questions for transfer learning in medical imaging tasks. Through both a detailed performance evaluation and analysis of neural network hidden representations, we uncover many surprising conclusions, such as the limited benefits of transfer learning for performance on the tested medical imaging tasks, a detailed characterization of how representations evolve through the training process across different models and hidden layers, and feature independent benefits of transfer learning for convergence speed.

Performance Evaluation
We first performed a thorough study on the effect of transfer learning on model performance. We compared models trained from random initialization and applied directly on tasks to those pretrained on ImageNet that leverage transfer learning for the same tasks. We looked at two large scale medical imaging tasks — diagnosing diabetic retinopathy from fundus photographs and identifying five different diseases from chest x-rays. We evaluated various neural network architectures including both standard architectures popularly used for medical imaging (ResNet50, Inception-v3) as well as a family of simple, lightweight convolutional neural networks that consist of four or five layers of the standard convolution-batchnorm-ReLU progression, or CBRs.

The results from evaluating all of these models on the different tasks with and without transfer learning give us four main takeaways:
  • Surprisingly, transfer learning does not significantly affect performance on medical imaging tasks, with models trained from scratch performing nearly as well as standard ImageNet transferred models.
  • On the medical imaging tasks, the much smaller CBR models perform at a level comparable to the standard ImageNet architectures.
  • As the CBR models are much smaller and shallower than the standard ImageNet models, they perform much worse on ImageNet classification, highlighting that ImageNet performance is not indicative of performance on medical tasks.
  • The two medical tasks are much smaller in size than ImageNet (~200k vs ~1.2m training images), but in the very small data regime, there may only be a few thousand training examples. We evaluated transfer learning in this very small data regime, finding that while there was a larger gap in performance between transfer and training from scratch for large models (ResNet) this was not true for smaller models (CBRs), suggesting that the large models designed for ImageNet might be too overparameterized for the very small data regime.
Representation Analysis
We next study the degree to which transfer learning affects the kinds of features and representations learned by the neural networks. Given the similar performance, does transfer learning result in different representations from random initialization? Is knowledge from the pretraining step reused, and if so, where? To find answers to these questions, this study analyzes and compares the hidden representations (i.e., representations learned in the latent layers of the network) in the different neural networks trained to solve these tasks. This quantitative analysis can be challenging, due to the complexity and lack of alignment in different hidden layers. But a recent method, singular vector canonical correlation analysis (SVCCA; code and tutorials), based on canonical correlation analysis (CCA), helps overcome these challenges, and can be used to calculate a similarity score between a pair of hidden representations.

Similarity scores are computed for some of the hidden representations from the top latent layers of the networks (closer to the output) between networks trained from random initialization and networks trained from pretrained ImageNet weights. As a baseline, we also compute similarity scores of representations learned from different random initializations. For large models, representations learned from random initialization are much more similar to each other than those learned from transfer learning. For smaller models, there is greater overlap between representation similarity scores.
Representation similarity scores between networks trained from random initialization and networks trained from pretrained ImageNet weights (orange), and baseline similarity scores of representations trained from two different random initializations (blue). Higher values indicate greater similarity. For larger models, representations learned from random initialization are much more similar to each other than those learned through transfer. This is not the case for smaller models.
The reason for this difference between large and small models becomes clear with further investigation into the hidden representations. Large models change less through training, even from random initialization. We perform multiple experiments that illustrate this, from simple filter visualizations to tracking changes between different layers through fine-tuning.

When we combine the results of all the experiments from the paper, we can assemble a table summarizing how much representations change through training on the medical task across (i) transfer learning, (ii) model size and (iii) lower/higher layers.
Effects on Convergence: Feature Independent Benefits and Hybrid Approaches
One consistent effect of transfer learning was a significant speedup in the time taken for the model to converge. But having seen the mixed results for feature reuse from our representational study, we looked into whether there were other properties of the pretrained weights that might contribute to this speedup. Surprisingly, we found a feature independent benefit of pretraining — the weight scaling.

We initialized the weights of the neural network as independent and identically distributed (iid), just like random initialization, but using the mean and variance of the pretrained weights. We called this initialization the Mean Var Init, which keeps the pretrained weight scaling but destroys all the features. This Mean Var Init offered significant speedups over random initialization across model architectures and tasks, suggesting that the pretraining process of transfer learning also helps with good weight conditioning.
Filter visualization of weights initialized according to pretrained ImageNet weights, Random Init, and Mean Var Init. Only the ImageNet Init filters have pretrained (Gabor-like) structure, as Rand Init and Mean Var weights are iid.
Recall that our earlier experiments suggested that feature reuse primarily occurs in the lowest layers. To understand this, we performed weight transfusion experiments, where only a subset of the pretrained weights (corresponding to a contiguous set of layers) are transferred, with the remainder of weights being randomly initialized. Comparing convergence speeds of these transfused networks with full transfer learning further supports the conclusion that feature reuse is primarily happening in the lowest layers.
Learning curves comparing the convergence speed with AUC on the test set. Using only the scaling of the pretrained weights (Mean Var Init) helps with convergence speed. The figures compare the standard transfer learning and the Mean Var initialization scheme to training from random initialization.
This suggests hybrid approaches to transfer learning, where instead of reusing the full neural network architecture, we can recycle its lowest layers and redesign the upper layers to better suit the target task. This gives us most of the benefits of transfer learning while further enabling flexible model design. In the Figure below, we show the effect of reusing pretrained weights up to Block2 in Resnet50, halving the remainder of the channels, initializing those layers randomly, and then training end-to-end. This matches the performance and convergence of full transfer learning.
Hybrid approaches to transfer learning on Resnet50 (left) and CBR models (right) — reusing a subset of the weights and slimming the remainder of the network (Slim), and using mathematically synthesized Gabors for conv1 (Synthetic Gabor).
The figure above also shows the results of an extreme version of this partial reuse, transferring only the very first convolutional layer with mathematically synthesized Gabor filters (pictured below). Using just these (synthetic) weights offers significant speedups, and hints at many other creative hybrid approaches.
Synthetic Gabor filters used to initialize the first layer if neural networks in some of the experiments in this paper. The Gabor filters are generated as grayscale images and repeated across the RGB channels. Left: Low frequencies. Right: High frequencies.
Conclusion and Open Questions
Transfer learning is a central technique for many domains. In this paper we provide insights on some of its fundamental properties in the medical imaging context, studying performance, feature reuse, the effect of different architectures, convergence and hybrid approaches. Many interesting open questions remain: How much of the original task has the model forgotten? Why do large models change less? Can we get further gains matching higher order moments of pretrained weight statistics? Are the results similar for other tasks, such as segmentation? We look forward to tackling these questions in future work!

Acknowledgements
Special thanks to Samy Bengio and Jon Kleinberg, who are co-authors on this work. Thanks also to Geoffrey Hinton for helpful feedback.

Source: Google AI Blog


The Visual Task Adaptation Benchmark



Deep learning has revolutionized computer vision, with state-of-the-art deep networks learning useful representations directly from raw pixels, leading to unprecedented performance on many vision tasks. However, learning these representations from scratch typically requires hundreds of thousands of training examples. This burden can be reduced by using pre-trained representations, which have become widely available through services such as TensorFlow Hub (TF Hub) and PyTorch Hub. But their ubiquity can itself be a hindrance. For example, for the task of extracting features from images, there can be over 100 models from which to choose. It is hard to know which methods provide the best representations, since different sub-fields use different evaluation protocols, which do not always reflect the final performance on new tasks.

The overarching goal of representation research is to learn representations a single time on large amounts of generic data without the need to train them from scratch for each task, thus reducing data requirements across all vision tasks. But in order to reach that goal, the research community must have a uniform benchmark against which existing and future methods can be evaluated.

To address this problem, we are releasing "The Visual Task Adaptation Benchmark" (VTAB, available on GitHub), a diverse, realistic, and challenging representation benchmark based on one principle — a better representation is one that yields better performance on unseen tasks, with limited in-domain data. Inspired by benchmarks that have driven progress in other fields of machine learning (ML), such as ImageNet for natural image classification, GLUE for Natural Language Processing, and Atari for reinforcement learning, VTAB follows similar guidelines: (i) minimal constraints on solutions to encourage creativity; (ii) a focus on practical considerations; and (iii) challenging tasks for evaluation.

The Benchmark
VTAB is an evaluation protocol designed to measure progress towards general and useful visual representations, and consists of a suite of evaluation vision tasks that a learning algorithm must solve. These algorithms may use pre-trained visual representations to assist them and must satisfy only two requirements:
    i) They must not be pre-trained on any of the data (labels or input images) used in the downstream evaluation tasks.
    ii) They must not contain hardcoded, task-specific, logic. Alternatively put, the evaluation tasks must be treated like a test set — unseen.
These constraints ensure that solutions that are successful when applied to VTAB will be able to generalize to future tasks.

The VTAB protocol begins with the application of an algorithm (A) to a number of independent tasks, drawn from a broad distribution of vision problems. The algorithm may be pre-trained on upstream data to yield a model that contains visual representations, but it must also define an adaptation strategy that consumes a small training set for each downstream task and return a model that makes task-specific predictions. The algorithm’s final score is its average test score across tasks.
The VTAB protocol. Algorithm A is applied to many tasks T, drawn from a broad distribution of vision problems PT. In the example, pet classification, remote sensing, and maze localization are shown.
VTAB includes 19 evaluation tasks that span a variety of domains, divided into three groups — natural, specialized, and structured. Natural image tasks include images of the natural world captured through standard cameras, representing generic objects, fine-grained classes, or abstract concepts. Specialized tasks utilize images captured using specialist equipment, such as medical images or remote sensing. The structured tasks often derive from artificial environments that target understanding of specific changes between images, such as predicting the distance to an object in a 3D scene (e.g., DeepMind Lab), counting objects (e.g., CLEVR), or detecting orientation (e.g., dSprites for disentangled representations).

While highly diverse, all of the tasks in VTAB share one common feature — people can solve them relatively easily after training on just a few examples. To assess algorithmic generalization to new tasks with limited data, performance is evaluated using only 1000 examples per task. Evaluation using the full dataset can be performed for comparison with previous publications.

Findings Using VTAB
We performed a large scale study testing a number of popular visual representation learning algorithms against VTAB. The study included generative models (GANs and VAEs), self-supervised models, semi-supervised models and supervised models. All of the algorithms were pre-trained on the ImageNet dataset. We also compared each of these approaches using no pre-trained representations, i.e., training “from-scratch”. The figure below summarizes the main pattern of results.
Performance of different classes of representation learning algorithms across different task groups: natural, specialized and structured. Each bar shows the average performance of all methods in that class across all tasks in the group.
Overall we find that generative models do not perform as well as the other methods, even worse than from-scratch training. However, self-supervised models perform much better, significantly outperforming from-scratch training. Better still is supervised learning using the ImageNet labels. Interestingly, while supervised learning is significantly better on the Natural group of tasks, self-supervised learning is close on the other two groups whose domains are more dissimilar to ImageNet.

The best performing representation learning algorithm, of those we tested, is S4L, which combines both supervised and self-supervised pre-training losses. The figure below contrasts S4L with standard supervised ImageNet pre-training. S4L appears to improve performance particularly on the Structured tasks. However, representation learning yields a much smaller benefit over training from-scratch groups other than the Natural tasks, indicating that there is much progress required to attain a universal visual representation.
Top: Performance of S4L versus from-scratch training. Each bar corresponds to a task. Positive-valued bars indicate tasks where S4L outperforms from-scratch. Negative bars indicate that from-scratch performed better. Bottom: S4L versus Supervised training on ImageNet. Positive bars indicate that S4L performs better. The bar colour indicates the task group: Red=Natural, Green=Specialized, Blue=Structured. We can see that additional self-supervision tends to help on structured tasks beyond just using ImageNet labels.
Summary
The code to run VTAB is available on GitHub, including the 19 evaluation datasets and exact data splits. Having a publicly available set of benchmarks ensures the reproducibility of results. Progress is tracked with the public leaderboard, and the models evaluated are uploaded to TF Hub for public use and reproduction. A shell script is provided to perform adaptation and evaluation on all the tasks, with a standardized evaluation protocol making VTAB readily accessible across the industry. Since VTAB can be executed on both TPU and GPU, it is highly efficient. One can obtain comparable results with a single NVIDIA Tesla P100 accelerator in a few hours.

The Visual Task Adaptation Benchmark has helped us better understand which visual representations generalize to the broad spectrum of vision tasks, and provides direction for future research. We hope these resources are useful in driving progress toward general and practical visual representations, and as a result, affords deep learning to the long tail of vision problems with limited labelled data.

Acknowledgements
The core team behind this work includes Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, and Sylvain Gelly.

Source: Google AI Blog


Learning to Assemble and to Generalize from Self-Supervised Disassembly



Our physical world is full of different shapes, and learning how they are all interconnected is a natural part of interacting with our surroundings — for example, we understand that coat hangers hook onto clothing racks, power plugs insert into wall outlets, and USB cables fit into USB sockets. This general concept of “how things fit together'' based on their shapes is something that people acquire over time and experience, and it helps to increase the efficiency with which we perform tasks, like assembling DIY furniture kits or packing gifts into a box. If robots could learn “how things fit together,” then perhaps they could become more adaptable to new manipulation tasks involving objects they have never seen before, like reconnecting severed pipes, or building makeshift shelters by piecing together debris during disaster response scenarios.

To explore this idea, we worked with researchers from Stanford and Columbia Universities to develop Form2Fit, a robotic manipulation algorithm that uses deep neural networks to learn to visually recognize how objects correspond (or “fit”) to each other. To test this algorithm, we tasked a real robot to perform kit assembly, where it needed to accurately assemble objects into a blister pack or corrugated display to form a single unit. Previous systems built for this task required extensive manual tuning to assemble a single kit unit at a time. However, we demonstrate that by learning the general concept of “how things fit together,” Form2Fit enables our robot to assemble various types of kits with a 94% success rate. Furthermore, Form2Fit is one of the first systems capable of generalizing to new objects and kitting tasks not seen during training.
Form2Fit learns to assemble a wide variety of kits by finding geometric correspondences between object surfaces and their target placement locations. By leveraging geometric information learned from multiple kits during training, the system generalizes to new objects and kits.
While often overlooked, shape analysis plays an important role in manipulation, especially for tasks like kit assembly. In fact, the shape of an object often matches the shape of its corresponding space in the packaging, and understanding this relationship is what allows people to do this task with minimal guesswork. At its core, Form2Fit aims to learn this relationship by training over numerous pairs of objects and their corresponding placing locations across multiple different kitting tasks – with the goal to acquire a broader understanding of how shapes and surfaces fit together. Form2Fit improves itself over time with minimal human supervision, gathering its own training data by repeatedly disassembling completed kits through trial and error, then time-reversing the disassembly sequences to get assembly trajectories. After training overnight for 12 hours, our robot learns effective pick and place policies for a variety of kits, achieving 94% assembly success rates with objects and kits in varying configurations, and over 86% assembly success rates when handling completely new objects and kits.

Data-Driven Shape Descriptors For Generalizable Assembly
The core component of Form2Fit is a two-stream matching network that learns to infer orientation-sensitive geometric pixel-wise descriptors for objects and their target placement locations from visual data. These descriptors can be understood as compressed 3D point representations that encode object geometry, textures, and contextual task-level knowledge. Form2Fit uses these descriptors to establish correspondences between objects and their target locations (i.e., where they should be placed). Since these descriptors are orientation-sensitive, they allow Form2Fit to infer how the picked object should be rotated before it is placed in its target location.

Form2Fit uses two additional networks to generate valid pick and place candidates. A suction network gets fed a 3D image of the objects and generates pixel-wise predictions of suction success. The suction probability map is visualized as a heatmap, where hotter pixels indicate better locations to grasp the object at the 3D location of the corresponding pixel. In parallel, a place network gets fed a 3D image of the target kit and outputs pixel-wise predictions of placement success. These, too, are visualized as a heatmap, where higher confidence values serve as better locations for the robot arm to approach from a top-down angle to place the object. Finally, the planner integrates the output of all three modules to produce the final pick location, place location and rotation angle.
Overview of Form2Fit. The suction and place networks infer candidate picking and placing locations in the scene respectively. The matching network generates pixel-wise orientation-sensitive descriptors to match picking locations to their corresponding placing locations. The planner then integrates it all to control the robot to execute the next best pick and place action.
Learning Assembly from Disassembly
Neural networks require large amounts of training data, which can be difficult to collect for tasks like assembly. Precisely inserting objects into tight spaces with the correct orientation (e.g., in kits) is challenging to learn through trial and error, because the chances of success from random exploration can be slim. In contrast, disassembling completed units is often easier to learn through trial and error, since there are fewer incorrect ways to remove an object than there are to correctly insert it. We leveraged this difference in order to amass training data for Form2Fit.
An example of self-supervision through time-reversal: rewinding a disassembly sequence of a deodorant kit over time generates a valid assembly sequence.
Our key observation is that in many cases of kit assembly, a disassembly sequence – when reversed over time – becomes a valid assembly sequence. This concept, called time-reversed disassembly, enables Form2Fit to train entirely through self-supervision by randomly picking with trial and error to disassemble a fully-assembled kit, then reversing that disassembly sequence to learn how the kit should be put together.

Generalization Results
The results of our experiments show great potential for learning generalizable policies for assembly. For instance, when a policy is trained to assemble a kit in only one specific position and orientation, it can still robustly assemble random rotations and translations of the kit 90% of the time.
Form2Fit policies are robust to a wide range of rotations and translations of the kits.
We also find that Form2Fit is capable of tackling novel configurations it has not been exposed to during training. For example, when training a policy on two single-object kits (floss and tape), we find that it can successfully assemble new combinations and mixtures of those kits, even though it has never seen such configurations before.
Form2Fit policies can generalize to novel kit configurations such as multiple versions of the same kit and mixtures of different kits.
Furthermore, when given completely novel kits on which it has not been trained, Form2Fit can generalize using its learned shape priors to assemble those kits with over 86% assembly accuracy.
Form2Fit policies can generalize to never-before-seen single and multi-object kits.
What Have the Descriptors Learned?
To explore what the descriptors of the matching network from Form2Fit have learned to encode, we visualize the pixel-wise descriptors of various objects in RGB colorspace through use of an embedding technique called t-SNE.
The t-SNE embedding of the learned object descriptors. Similarly oriented objects of the same category display identical colors (e.g. A, B or F, G) while different objects (e.g. C, H) and same objects but different orientation (e.g. A, C, D or H, F) exhibit different colors.
We observe that the descriptors have learned to encode (a) rotation — objects oriented differently have different descriptors (A, C, D, E) and (H, F); (b) spatial correspondence — same points on the same oriented objects share similar descriptors (A, B) and (F, G); and (c) object identity — zoo animals and fruits exhibit unique descriptors (columns 3 and 4).

Limitations & Future Work
While Form2Fit’s results are promising, its limitations suggest directions for future work. In our experiments, we assume a 2D planar workspace to constrain the kit assembly task so that it can be solved by sequencing top-down picking and placing actions. This may not work for all cases of assembly – for example, when a peg needs to be precisely inserted at a 45 degree angle. It would be interesting to expand Form2Fit to more complex action representations for 3D assembly.

You can learn more about this work and download the code from our GitHub repository.


Acknowledgments
This research was done by Kevin Zakka, Andy Zeng, Johnny Lee, and Shuran Song (faculty at Columbia University), with special thanks to Nick Hynes, Alex Nichol, and Ivan Krasin for fruitful technical discussions; Adrian Wong, Brandon Hurd, Julian Salazar, and Sean Snyder for hardware support; Ryan Hickman for valuable managerial support; and Chad Richards for helpful feedback on writing.

Source: Google AI Blog