Author Archives: Research Blog

Self-Supervised Tracking via Video Colorization



Tracking objects in video is a fundamental problem in computer vision, essential to applications such as activity recognition, object interaction, or video stylization. However, teaching a machine to visually track objects is challenging partly because it requires large, labeled tracking datasets for training, which are impractical to annotate at scale.

In “Tracking Emerges by Colorizing Videos”, we introduce a convolutional network that colorizes grayscale videos, but is constrained to copy colors from a single reference frame. In doing so, the network learns to visually track objects automatically without supervision. Importantly, although the model was never trained explicitly for tracking, it can follow multiple objects, track through occlusions, and remain robust over deformations without requiring any labeled training data.
Example tracking predictions on the publicly-available, academic dataset DAVIS 2017. After learning to colorize videos, a mechanism for tracking automatically emerges without supervision. We specify regions of interest (indicated by different colors) in the first frame, and our model propagates it forward without any additional learning or supervision.

Learning to Recolorize Video
Our hypothesis is that the temporal coherency of color provides excellent large-scale training data for teaching machines to track regions in video. Clearly, there are exceptions when color is not temporally coherent (such as lights turning on suddenly), but in general color is stable over time. Furthermore, most videos contain color, providing a scalable self-supervised learning signal. We decolor videos, and then add the colorization step because there may be multiple objects with the same color, but by colorizing we can teach machines to track specific objects or regions.

In order to train our system, we use videos from the Kinetics dataset, which is a large public collection of videos depicting everyday activities. We convert all video frames except the first frame into gray-scale, and train a convolutional network to predict the original colors in the subsequent frames. We expect the model to learn to follow regions in order to accurately recover the original colors. Our main observation is the need to follow objects for colorization will cause a model for object tracking to be automatically learned.
We illustrate the video recolorization task using video from the DAVIS 2017 dataset. The model receives as input one color frame and a gray-scale video, and predicts the colors for the rest of the video. The model learns to copy colors from the reference frame, which enables a mechanism for tracking to be learned without human supervision.
Learning to copy colors from the single reference frame requires the model to learn to internally point to the right region in order to copy the right colors. This forces the model to learn an explicit mechanism that we can use for tracking. To see how the video colorization model works, we show some predicted colorizations from videos in the Kinetics dataset below.

Examples of predicted colors from colorized reference frame applied to input video using the publicly-available Kinetics dataset.

Although the network is trained without ground-truth identities, our model learns to track any visual region specified in the first frame of a video. We can track outlined objects or a single point in the video. The only change we make is that, instead of propagating colors throughout the video, we now propagate labels representing the regions of interest.

Analyzing the Tracker
Since the model is trained on large amounts of unlabeled video, we want to gain insight into what the model learns. The videos below show a standard trick to visualize the embeddings learned by our model by projecting them down to three dimensions using Principal Component Analysis (PCA) and plotting it as an RGB movie. The results show that nearest neighbors in the learned embedding space tend to correspond to object identity, even over deformations and viewpoint changes.
Top Row: We show videos from the DAVIS 2017 dataset. Bottom Row: We visualize the internal embeddings from the colorization model. Similar embeddings will have a similar color in this visualization. This suggests the learned embedding is grouping pixels by object identity.

Tracking Pose
We found the model can also track human poses given key-points in an initial frame. We show results on the publicly-available, academic dataset JHMDB where we track a human joint skeleton.
Examples of using the model to track movements of the human skeleton. In this case the input was a human pose for the first frame and subsequent movement is automatically tracked. The model can track human poses even though it was never explicitly trained for this task.

While we do not yet outperform heavily supervised models, the colorization model learns to track video segments and human pose well enough to outperform the latest methods based on optical flow. Breaking down performance by motion type suggests that our model is a more robust tracker than optical flow for many natural complexities, such as dynamic backgrounds, fast motion, and occlusions. Please see the paper for details.

Future Work
Our results show that video colorization provides a signal that can be used for learning to track objects in videos without supervision. Moreover, we found that the failures from our system are correlated with failures to colorize the video, which suggests that further improving the video colorization model can advance progress in self-supervised tracking.

Acknowledgements
This project was only possible thanks to several collaborations at Google. The core team includes Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama and Kevin Murphy. We also thank David Ross, Bryan Seybold, Chen Sun and Rahul Sukthankar.

Source: Google AI Blog


Teaching Uncalibrated Robots to Visually Self-Adapt



People are remarkably proficient at manipulating objects without needing to adjust their viewpoint to a fixed or specific pose. This capability (referred to as visual motor integration) is learned during childhood from manipulating objects in various situations, and governed by a self-adaptation and mistake correction mechanism that uses rich sensory cues and vision as feedback. However, this capability is quite difficult for vision-based controllers in robotics, which until now have been built on a rigid setup for reading visual input data from a fixed mounted camera which should not be moved or repositioned at train and test time. The ability to quickly acquire visual motor control skills under large viewpoint variation would have substantial implications for autonomous robotic systems — for example, this capability would be particularly desirable for robots that can help rescue efforts in emergency or disaster zones.

In “Sim2Real Viewpoint Invariant Visual Servoing by Recurrent Control” presented at CVPR 2018 this week, we study a novel deep network architecture (consisting of two fully convolutional networks and a long short-term memory unit) that learns from a past history of actions and observations to self-calibrate. Using diverse simulated data consisting of demonstrated trajectories and reinforcement learning objectives, our visually-adaptive network is able to control a robotic arm to reach a diverse set of visually-indicated goals, from various viewpoints and independent of camera calibration.
Viewpoint invariant manipulation for visually indicated goal reaching with a physical robotic arm. We learn a single policy that can reach diverse goals from sensory input captured from drastically different camera viewpoints. First row shows the visually indicated goals.

The Challenge
Discovering how the controllable degrees of freedom (DoF) affect visual motion can be ambiguous and underspecified from a single image captured from an unknown viewpoint. Identifying the effect of actions on image-space motion and successfully performing the desired task requires a robust perception system augmented with the ability to maintain a memory of past actions. To be able to tackle this challenging problem, we had to address the following essential questions:
  • How can we make it feasible to provide the right amount of experience for the robot to learn the self-adaptation behavior based on pure visual observations that simulate a lifelong learning paradigm?
  • How can we design a model that integrates robust perception and self-adaptive control such that it can quickly transfer to unseen environments?
To do so, we devised a new manipulation task where a seven-DoF robot arm is provided with an image of an object and is directed to reach that particular goal amongst a set of distractor objects, while viewpoints change drastically from one trial to another. In doing so, we were able to simulate both the learning of complex behaviors and the transfer to unseen environments.
Visually indicated goal reaching task with a physical robotic arm and diverse camera viewpoints.
Harnessing Simulation to Learn Complex Behaviors
Collecting robot experience data is difficult and time-consuming. In a previous post, we showed how to scale up learning skills by distributing the data collection and trials to multiple robots. Although this approach expedited learning, it is still not feasibly extendable to learning complex behaviors such as visual self-calibration, where we need to expose robots to a huge space of various viewpoints. Instead, we opt to learn such complex behavior in simulation where we can collect unlimited robot trials and easily move the camera to various random viewpoints. In addition to fast data collection in simulation, we can also surpass hardware limitations requiring the installation of multiple cameras around a robot.
We use domain randomization technique to learn generalizable policies in simulation.
To learn visually robust features to transfer to unseen environments, we used a technique known as domain randomization (a.k.a. simulation randomization) introduced by Sadeghi & Levine (2017), that enables robots to learn vision-based policies entirely in simulation such that they can generalize to the real world. This technique was shown to work well for various robotic tasks such as indoor navigation, object localization, pick and placing, etc. In addition, to learn complex behaviors like self-calibration, we harnessed the simulation capabilities to generate synthetic demonstrations and combined reinforcement learning objectives to learn a robust controller for the robotic arm.
Viewpoint invariant manipulation for visually indicated goal reaching with a simulated seven-DoF robotic arm. We learn a single policy that can reach diverse goals from sensory input captured from dramatically different camera viewpoints.

Disentangling Perception from Control
To enable fast transfer to unseen environments, we devised a deep neural network that combines perception and control trained end-to-end simultaneously, while also allowing each to be learned independently if needed. This disentanglement between perception and control eases transfer to unseen environments, and makes the model both flexible and efficient in that each of its parts (i.e. 'perception' or 'control') can be independently adapted to new environments with small amounts of data. Additionally, while the control portion of the network was entirely trained by the simulated data, the perception part of our network was complemented by collecting a small amount of static images with object bounding boxes without needing to collect the whole action sequence trajectory with a physical robot. In practice, we fine-tuned the perception part of our network with only 76 object bounding boxes coming from 22 images.
Real-world robot and moving camera setup. First row shows the scene arrangements and the second row shows the visual sensory input to the robot.
Early Results
We tested the visually-adapted version of our network on a physical robot and on real objects with drastically different appearances than the ones used in simulation. Experiments were performed with both one or two objects on a table — “seen objects” (as labeled in the figure below) were used for visual adaptation using small collection of real static images, while “unseen objects” had not been seen during visual adaptation. During the test, the robot arm was directed to reach a visually indicated object from various viewpoints. For the two object experiments the second object was to "fool" the robotic arm. While the simulation-only network has good generalization capability (due to being trained with domain randomization technique), the very small amount of static visual data to visually adapt the controller boosted the performance, due to the flexible architecture of our network.
After adapting the visual features with the small amount of real images, performance was boosted by more than 10%. All used real objects are drastically different from the objects seen in simulation.
We believe that learning online visual self-adaptation is an important and yet challenging problem with the goal of learning generalizable policies for robots that can act in diverse and unstructured real world setup. Our approach can be extended to any sort of automatic self-calibration. See the video below for more information on this work.
Acknowledgements
This research was conducted by Fereshteh Sadeghi, Alexander Toshev, Eric Jang and Sergey Levine. We would also like to thank Erwin Coumans and Yunfei Bai for providing pybullet, and Vincent Vanhoucke for insightful discussions.




Source: Google AI Blog


How Can Neural Network Similarity Help Us Understand Training and Generalization?


In order to solve tasks, deep neural networks (DNNs) progressively transform input data into a sequence of complex representations (i.e., patterns of activations across individual neurons). Understanding these representations is critically important, not only for interpretability, but also so that we can more intelligently design machine learning systems. However, understanding these representations has proven quite difficult, especially when comparing representations across networks. In a previous post, we outlined the benefits of Canonical Correlation Analysis (CCA) as a tool for understanding and comparing the representations of convolutional neural networks (CNNs), showing that they converge in a bottom-up pattern, with early layers converging to their final representations before later layers over the course of training.

In “Insights on Representational Similarity in Neural Networks with Canonical Correlation” we develop this work further to provide new insights into the representational similarity of CNNs, including differences between networks which memorize (e.g., networks which can only classify images they have seen before) from those which generalize (e.g., networks which can correctly classify previously unseen images). Importantly, we also extend this method to provide insights into the dynamics of recurrent neural networks (RNNs), a class of models that are particularly useful for sequential data, such as language. Comparing RNNs is difficult in many of the same ways as CNNs, but RNNs present the additional challenge that their representations change over the course of a sequence. This makes CCA, with its helpful invariances, an ideal tool for studying RNNs in addition to CNNs. As such, we have additionally open sourced the code used for applying CCA on neural networks with the hope that will help the research community better understand network dynamics.

Representational Similarity of Memorizing and Generalizing CNNs
Ultimately, a machine learning system is only useful if it can generalize to new situations it has never seen before. Understanding the factors which differentiate between networks that generalize and those that don’t is therefore essential, and may lead to new methods to improve generalization performance. To investigate whether representational similarity is predictive of generalization, we studied two types of CNNs:
  • generalizing networks: CNNs trained on data with unmodified, accurate labels and which learn solutions which generalize to novel data.
  • memorizing networks: CNNs trained on datasets with randomized labels such that they must memorize the training data and cannot, by definition, generalize (as in Zhang et al., 2017).
We trained multiple instances of each network, differing only in the initial randomized values of the network weights and the order of the training data, and used a new weighted approach to calculate the CCA distance measure (see our paper for details) to compare the representations within each group of networks and between memorizing and generalizing networks.

We found that groups of different generalizing networks consistently converged to more similar representations (especially in later layers) than groups of memorizing networks (see figure below). At the softmax, which denotes the network’s ultimate prediction, the CCA distance for each group of generalizing and memorizing networks decreases substantially, as the networks in each separate group make similar predictions.
Groups of generalizing networks (blue) converge to more similar solutions than groups of memorizing networks (red). CCA distance was calculated between groups of networks trained on real CIFAR-10 labels (“Generalizing”) or randomized CIFAR-10 labels (“Memorizing”) and between pairs of memorizing and generalizing networks (“Inter”).
Perhaps most surprisingly, in later hidden layers, the representational distance between any given pair of memorizing networks was about the same as the representational distance between a memorizing and generalizing network (“Inter” in the plot above), despite the fact that these networks were trained on data with entirely different labels. Intuitively, this result suggests that while there are many different ways to memorize the training data (resulting in greater CCA distances), there are fewer ways to learn generalizable solutions. In future work, we plan to explore whether this insight can be used to regularize networks to learn more generalizable solutions.

Understanding the Training Dynamics of Recurrent Neural Networks
So far, we have only applied CCA to CNNs trained on image data. However, CCA can also be applied to calculate representational similarity in RNNs, both over the course of training and over the course of a sequence. Applying CCA to RNNs, we first asked whether the RNNs exhibit the same bottom-up convergence pattern we observed in our previous work for CNNs. To test this, we measured the CCA distance between the representation at each layer of the RNN over the course of training with its final representation at the end of training. We found that the CCA distance for layers closer to the input dropped earlier in training than for deeper layers, demonstrating that, like CNNs, RNNs also converge in a bottom-up pattern (see figure below).
Convergence dynamics for RNNs over the course of training exhibit bottom up convergence, as layers closer to the input converge to their final representations earlier in training than later layers. For example, layer 1 converges to its final representation earlier in training than layer 2 than layer 3 and so on. Epoch designates the number of times the model has seen the entire training set while different colors represent the convergence dynamics of different layers.
Additional findings in our paper show that wider networks (e.g., networks with more neurons at each layer) converge to more similar solutions than narrow networks. We also found that trained networks with identical structures but different learning rates converge to distinct clusters with similar performance, but highly dissimilar representations. We also apply CCA to RNN dynamics over the course of a single sequence, rather than simply over the course of training, providing some initial insights into the various factors which influence RNN representations over time.

Conclusions
These findings reinforce the utility of analyzing and comparing DNN representations in order to provide insights into network function, generalization, and convergence. However, there are still many open questions: in future work, we hope to uncover which aspects of the representation are conserved across networks, both in CNNs and RNNs, and whether these insights can be used to improve network performance. We encourage others to try out the code used for the paper to investigate what CCA can tell us about other neural networks!

Acknowledgements
Special thanks to Samy Bengio, who is a co-author on this work. We also thank Martin Wattenberg, Jascha Sohl-Dickstein and Jon Kleinberg for helpful comments.

Source: Google AI Blog


How Can Neural Network Similarity Help Us Understand Training and Generalization?


In order to solve tasks, deep neural networks (DNNs) progressively transform input data into a sequence of complex representations (i.e., patterns of activations across individual neurons). Understanding these representations is critically important, not only for interpretability, but also so that we can more intelligently design machine learning systems. However, understanding these representations has proven quite difficult, especially when comparing representations across networks. In a previous post, we outlined the benefits of Canonical Correlation Analysis (CCA) as a tool for understanding and comparing the representations of convolutional neural networks (CNNs), showing that they converge in a bottom-up pattern, with early layers converging to their final representations before later layers over the course of training.

In “Insights on Representational Similarity in Neural Networks with Canonical Correlation” we develop this work further to provide new insights into the representational similarity of CNNs, including differences between networks which memorize (e.g., networks which can only classify images they have seen before) from those which generalize (e.g., networks which can correctly classify previously unseen images). Importantly, we also extend this method to provide insights into the dynamics of recurrent neural networks (RNNs), a class of models that are particularly useful for sequential data, such as language. Comparing RNNs is difficult in many of the same ways as CNNs, but RNNs present the additional challenge that their representations change over the course of a sequence. This makes CCA, with its helpful invariances, an ideal tool for studying RNNs in addition to CNNs. As such, we have additionally open sourced the code used for applying CCA on neural networks with the hope that will help the research community better understand network dynamics.

Representational Similarity of Memorizing and Generalizing CNNs
Ultimately, a machine learning system is only useful if it can generalize to new situations it has never seen before. Understanding the factors which differentiate between networks that generalize and those that don’t is therefore essential, and may lead to new methods to improve generalization performance. To investigate whether representational similarity is predictive of generalization, we studied two types of CNNs:
  • generalizing networks: CNNs trained on data with unmodified, accurate labels and which learn solutions which generalize to novel data.
  • memorizing networks: CNNs trained on datasets with randomized labels such that they must memorize the training data and cannot, by definition, generalize (as in Zhang et al., 2017).
We trained multiple instances of each network, differing only in the initial randomized values of the network weights and the order of the training data, and used a new weighted approach to calculate the CCA distance measure (see our paper for details) to compare the representations within each group of networks and between memorizing and generalizing networks.

We found that groups of different generalizing networks consistently converged to more similar representations (especially in later layers) than groups of memorizing networks (see figure below). At the softmax, which denotes the network’s ultimate prediction, the CCA distance for each group of generalizing and memorizing networks decreases substantially, as the networks in each separate group make similar predictions.
Groups of generalizing networks (blue) converge to more similar solutions than groups of memorizing networks (red). CCA distance was calculated between groups of networks trained on real CIFAR-10 labels (“Generalizing”) or randomized CIFAR-10 labels (“Memorizing”) and between pairs of memorizing and generalizing networks (“Inter”).
Perhaps most surprisingly, in later hidden layers, the representational distance between any given pair of memorizing networks was about the same as the representational distance between a memorizing and generalizing network (“Inter” in the plot above), despite the fact that these networks were trained on data with entirely different labels. Intuitively, this result suggests that while there are many different ways to memorize the training data (resulting in greater CCA distances), there are fewer ways to learn generalizable solutions. In future work, we plan to explore whether this insight can be used to regularize networks to learn more generalizable solutions.

Understanding the Training Dynamics of Recurrent Neural Networks
So far, we have only applied CCA to CNNs trained on image data. However, CCA can also be applied to calculate representational similarity in RNNs, both over the course of training and over the course of a sequence. Applying CCA to RNNs, we first asked whether the RNNs exhibit the same bottom-up convergence pattern we observed in our previous work for CNNs. To test this, we measured the CCA distance between the representation at each layer of the RNN over the course of training with its final representation at the end of training. We found that the CCA distance for layers closer to the input dropped earlier in training than for deeper layers, demonstrating that, like CNNs, RNNs also converge in a bottom-up pattern (see figure below).
Convergence dynamics for RNNs over the course of training exhibit bottom up convergence, as layers closer to the input converge to their final representations earlier in training than later layers. For example, layer 1 converges to its final representation earlier in training than layer 2 than layer 3 and so on. Epoch designates the number of times the model has seen the entire training set while different colors represent the convergence dynamics of different layers.
Additional findings in our paper show that wider networks (e.g., networks with more neurons at each layer) converge to more similar solutions than narrow networks. We also found that trained networks with identical structures but different learning rates converge to distinct clusters with similar performance, but highly dissimilar representations. We also apply CCA to RNN dynamics over the course of a single sequence, rather than simply over the course of training, providing some initial insights into the various factors which influence RNN representations over time.

Conclusions
These findings reinforce the utility of analyzing and comparing DNN representations in order to provide insights into network function, generalization, and convergence. However, there are still many open questions: in future work, we hope to uncover which aspects of the representation are conserved across networks, both in CNNs and RNNs, and whether these insights can be used to improve network performance. We encourage others to try out the code used for the paper to investigate what CCA can tell us about other neural networks!

Acknowledgements
Special thanks to Samy Bengio, who is a co-author on this work. We also thank Martin Wattenberg, Jascha Sohl-Dickstein and Jon Kleinberg for helpful comments.

Source: Google AI Blog


Google at CVPR 2018

Posted by Christian Howard, Editor-in-Chief, Google AI Communications

This week, Salt Lake City hosts the 2018 Conference on Computer Vision and Pattern Recognition (CVPR 2018), the premier annual computer vision event comprising the main conference and several co-located workshops and tutorials. As a leader in computer vision research and a Diamond Sponsor, Google will have a strong presence at CVPR 2018 — over 200 Googlers will be in attendance to present papers and invited talks at the conference, and to organize and participate in multiple workshops.

If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively pursuing the next generation of intelligent systems that utilize the latest machine learning techniques applied to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind portrait mode on the Pixel 2 and Pixel 2 XL smartphones, the Open Images V4 dataset and much more.

You can learn more about our research being presented at CVPR 2018 in the list below (Googlers highlighted in blue)

Organization
Finance Chair: Ramin Zabih

Area Chairs include: Sameer Agarwal, Aseem Agrawala, Jon Barron, Abhinav Shrivastava, Carl Vondrick, Ming-Hsuan Yang

Orals/Spotlights
Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee

DoubleFusion: Real-time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor
Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai Dai, Hao Li, Gerard Pons-Moll, Yebin Liu

Neural Kinematic Networks for Unsupervised Motion Retargetting
Ruben Villegas, Jimei Yang, Duygu Ceylan, Honglak Lee

Burst Denoising with Kernel Prediction Networks
Ben Mildenhall, Jiawen Chen, Jonathan BarronRobert Carroll, Dillon Sharlet, Ren Ng

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob, Skirmantas Kligys, Bo Chen, Matthew Tang, Menglong Zhu, Andrew Howard, Dmitry KalenichenkoHartwig Adam

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Focal Visual-Text Attention for Visual Question Answering
Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, Alexander G. Hauptmann

Inferring Light Fields from Shadows
Manel Baradad, Vickie Ye, Adam Yedida, Fredo Durand, William Freeman, Gregory Wornell, Antonio Torralba

Modifying Non-Local Variations Across Multiple Views
Tal Tlusty, Tomer Michaeli, Tali Dekel, Lihi Zelnik-Manor

Iterative Visual Reasoning Beyond Convolutions
Xinlei Chen, Li-jia Li, Fei-Fei Li, Abhinav Gupta

Unsupervised Training for 3D Morphable Model Regression
Kyle Genova, Forrester Cole, Aaron Maschinot, Daniel Vlasic, Aaron Sarna, William Freeman

Learning Transferable Architectures for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc Le

The iNaturalist Species Classification and Detection Dataset
Grant van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie

Learning Intrinsic Image Decomposition from Watching the World
Zhengqi Li, Noah Snavely

Learning Intelligent Dialogs for Bounding Box Annotation
Ksenia Konyushkova, Jasper Uijlings, Christoph Lampert, Vittorio Ferrari

Posters
Revisiting Knowledge Transfer for Training Object Class Detectors
Jasper Uijlings, Stefan Popov, Vittorio Ferrari

Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David Ross, Jia Deng, Rahul Sukthankar

Hierarchical Novelty Detection for Visual Object Recognition
Kibok Lee, Kimin Lee, Kyle Min, Yuting Zhang, Jinwoo Shin, Honglak Lee

COCO-Stuff: Thing and Stuff Classes in Context
Holger Caesar, Jasper Uijlings, Vittorio Ferrari

Appearance-and-Relation Networks for Video Classification
Limin Wang, Wei Li, Wen Li, Luc Van Gool

MorphNet: Fast & Simple Resource-Constrained Structure Learning of Deep Networks
Ariel Gordon, Elad Eban, Bo Chen, Ofir Nachum, Tien-Ju Yang, Edward Choi

Deformable Shape Completion with Graph Convolutional Autoencoders
Or Litany, Alex Bronstein, Michael Bronstein, Ameesh Makadia

MegaDepth: Learning Single-View Depth Prediction from Internet Photos
Zhengqi Li, Noah Snavely

Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee

Burst Denoising with Kernel Prediction Networks
Ben Mildenhall, Jiawen Chen, Jonathan Barron, Robert Carroll, Dillon Sharlet, Ren Ng

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Benoit Jacob, Skirmantas Kligys, Bo Chen, Matthew Tang, Menglong Zhu, Andrew Howard, Dmitry Kalenichenko, Hartwig Adam

Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling
Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Tianfan Xue, Joshua Tenenbaum, William Freeman

Sparse, Smart Contours to Represent and Edit Images
Tali Dekel, Dilip Krishnan, Chuang Gan, Ce Liu, William Freeman

MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features
Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, Hartwig Adam

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning
Yin Cui, Yang Song, Chen Sun, Andrew Howard, Serge Belongie

Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks
Nick Johnston, Damien Vincent, David Minnen, Michele Covell, Saurabh Singh, Sung Jin Hwang, George Toderici, Troy Chinen, Joel Shor

MobileNetV2: Inverted Residuals and Linear Bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen

ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans 
Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, Juergen Sturm, Matthias Nießner

Sim2Real View Invariant Visual Servoing by Recurrent Control
Fereshteh Sadeghi, Alexander Toshev, Eric Jang, Sergey Levine

Alternating-Stereo VINS: Observability Analysis and Performance Evaluation
Mrinal Kanti Paul, Stergios Roumeliotis

Soccer on Your Tabletop
Konstantinos Rematas, Ira Kemelmacher, Brian Curless, Steve Seitz

Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints
Reza Mahjourian, Martin Wicke, Anelia Angelova

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Chunhui Gu, Chen Sun, David Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Inferring Light Fields from Shadows
Manel Baradad, Vickie Ye, Adam Yedida, Fredo Durand, William Freeman, Gregory Wornell, Antonio Torralba

Modifying Non-Local Variations Across Multiple Views
Tal Tlusty, Tomer Michaeli, Tali Dekel, Lihi Zelnik-Manor

Aperture Supervision for Monocular Depth Estimation
Pratul Srinivasan, Rahul Garg, Neal Wadhwa, Ren Ng, Jonathan Barron

Instance Embedding Transfer to Unsupervised Video Object Segmentation
Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, C.-C. Jay Kuo

Frame-Recurrent Video Super-Resolution
Mehdi S. M. Sajjadi, Raviteja Vemulapalli, Matthew Brown

Weakly Supervised Action Localization by Sparse Temporal Pooling Network
Phuc Nguyen, Ting Liu, Gautam Prasad, Bohyung Han

Iterative Visual Reasoning Beyond Convolutions
Xinlei Chen, Li-jia Li, Fei-Fei Li, Abhinav Gupta

Learning and Using the Arrow of Time
Donglai Wei, Andrew Zisserman, William Freeman, Joseph Lim

HydraNets: Specialized Dynamic Architectures for Efficient Inference
Ravi Teja Mullapudi, Noam Shazeer, William Mark, Kayvon Fatahalian

Thoracic Disease Identification and Localization with Limited Supervision
Zhe Li, Chong Wang, Mei Han, Yuan Xue, Wei Wei, Li-jia Li, Fei-Fei Li

Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis
Seunghoon Hong, Dingdong Yang, Jongwook Choi, Honglak Lee

Deep Semantic Face Deblurring
Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, Ming-Hsuan Yang

Unsupervised Training for 3D Morphable Model Regression
Kyle Genova, Forrester Cole, Aaron Maschinot, Daniel Vlasic, Aaron Sarna, William Freeman

Learning Transferable Architectures for Scalable Image Recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc Le

Learning Intrinsic Image Decomposition from Watching the World
Zhengqi Li, Noah Snavely

PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection
Nian Liu, Junwei Han, Ming-Hsuan Yang

Tutorials
Computer Vision for Robotics and Driving
Anelia Angelova, Sanja Fidler

Unsupervised Visual Learning
Pierre Sermanet, Anelia Angelova

UltraFast 3D Sensing, Reconstruction and Understanding of People, Objects and Environments
Sean Fanello, Julien Valentin, Jonathan Taylor, Christoph Rhemann, Adarsh Kowdle, Jürgen SturmChristine Kaeser-Chen, Pavel Pidlypenskyi, Rohit Pandey, Andrea Tagliasacchi, Sameh Khamis, David Kim, Mingsong Dou, Kaiwen Guo, Danhang Tang, Shahram Izadi

Generative Adversarial Networks
Jun-Yan Zhu, Taesung Park, Mihaela Rosca, Phillip Isola, Ian Goodfellow

Source: Google AI Blog


Google at NAACL



This week, New Orleans, LA hosted the North American Association of Computational Linguistics (NAACL) conference, a venue for the latest research on computational approaches to understanding natural language. Google once again had a strong presence, presenting our research on a diverse set of topics, including dialog, summarization, machine translation, and linguistic analysis. In addition to contributing publications, Googlers were also involved as committee members, workshop organizers, panelists and presented one of the conference keynotes. We also provided telepresence robots, which enabled researchers who couldn’t attend in person to present their work remotely at the Widening Natural Language Processing Workshop (WiNLP).
Googler Margaret Mitchell and a researcher using our telepresence robots to remotely present their work at the WiNLP workshop.
This year NAACL also introduced a new Test of Time Award recognizing influential papers published between 2002 and 2012. We are happy and honored to recognize that all three papers receiving the award (listed below with a shot summary) were co-authored by researchers who are now at Google (in blue):

BLEU: a Method for Automatic Evaluation of Machine Translation (2002)
Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu
Before the introduction of the BLEU metric, comparing Machine Translation (MT) models required expensive human evaluation. While human evaluation is still the gold standard, the strong correlation of BLEU with human judgment has permitted much faster experiment cycles. BLEU has been a reliable measure of progress, persisting through multiple paradigm shifts in MT.

Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms (2002)
Michael Collins
The structured perceptron is a generalization of the classical perceptron to structured prediction problems, where the number of possible "labels" for each input is a very large set, and each label has rich internal structure. Canonical examples are speech recognition, machine translation, and syntactic parsing. The structured perceptron was one of the first algorithms proposed for structured prediction, and has been shown to be effective in spite of its simplicity.

Thumbs up?: Sentiment Classification using Machine Learning Techniques (2002)
Bo Pang, Lillian Lee, Shivakumar Vaithyanathan
This paper is amongst the first works in sentiment analysis and helped define the subfield of sentiment and opinion analysis and review mining. The paper introduced a new way to look at document classification, developed the first solutions to it using supervised machine learning methods, and discussed insights and challenges. This paper also had significant data impact -- the movie review dataset has supported much of the early work in this area and is still one of the commonly used benchmark evaluation datasets.

If you attended NAACL 2018, we hope that you stopped by the booth to check out some demos, meet our researchers and discuss projects and opportunities at Google that go into solving interesting problems for billions of people. You can learn more about Google research presented at NAACL 2018 below (Googlers highlighted in blue), and visit the Google AI Language Team page.

Keynote
Google Assistant or My Assistant? Towards Personalized Situated Conversational Agents
Dilek Hakkani-Tür

Publications
Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning
Pararth Shah, Dilek Hakkani-Tür, Bing Liu, Gokhan Tür

SHAPED: Shared-Private Encoder-Decoder for Text Style Adaptation
Ye Zhang, Nan Ding, Radu Soricut

Olive Oil is Made of Olives, Baby Oil is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model
Vered Schwartz, Chris Waterson

Are All Languages Equally Hard to Language-Model?
Ryan Cotterell, Sebastian J. Mielke, Jason Eisner, Brian Roark

Self-Attention with Relative Position Representations
Peter Shaw, Jakob Uszkoreit, Ashish Vaswani

Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems
Bing Liu, Gokhan Tür, Dilek Hakkani-Tür, Parath Shah, Larry Heck

Workshops
Subword & Character Level Models in NLP
Organizers: Manaal Faruqui, Hinrich Schütze, Isabel Trancoso, Yulia Tsvetkov, Yadollah Yaghoobzadeh

Storytelling Workshop
Organizers: Margaret Mitchell, Ishan Misra, Ting-Hao 'Kenneth' Huang, Frank Ferraro

Ethics in NLP
Organizers: Michael Strube, Dirk Hovy, Margaret Mitchell, Mark Alfano

NAACL HLT Panels
Careers in Industry
Participants: Philip Resnik (moderator), Jason Baldridge, Laura Chiticariu, Marie Mateer, Dan Roth

Ethics in NLP
Participants: Dirk Hovy (moderator), Margaret Mitchell, Vinodkumar Prabhakaran, Mark Yatskar, Barbara Plank

Source: Google AI Blog


Realtime tSNE Visualizations with TensorFlow.js



In recent years, the t-distributed Stochastic Neighbor Embedding (tSNE) algorithm has become one of the most used and insightful techniques for exploratory data analysis of high-dimensional data. Used to interpret deep neural network outputs in tools such as the TensorFlow Embedding Projector and TensorBoard, a powerful feature of tSNE is that it reveals clusters of high-dimensional data points at different scales while requiring only minimal tuning of its parameters. Despite these advantages, the computational complexity of the tSNE algorithm limits its application to relatively small datasets. While several evolutions of tSNE have been developed to address this issue (mainly focusing on the scalability of the similarity computations between data points), they have so far not been enough to provide a truly interactive experience when visualizing the evolution of the tSNE embedding for large datasets.

In “Linear tSNE Optimization for the Web”, we present a novel approach to tSNE that heavily relies on modern graphics hardware. Given the linear complexity of the new approach, our method generates embeddings faster than comparable techniques and can even be executed on the client side in a web browser by leveraging GPU capabilities through WebGL. The combination of these two factors allows for real-time interactive visualization of large, high-dimensional datasets. Furthermore, we are releasing this work as an open source library in the TensorFlow.js family in the hopes that the broader research community finds it useful.
Real-time evolution of the tSNE embedding for the complete MNIST dataset with our technique. The dataset contains images of 60,000 handwritten digits. You can find a live demo here.
The aim of tSNE is to cluster small “neighborhoods” of similar data points while also reducing the overall dimensionality of the data so it is more easily visualized. In other words, the tSNE objective function measures how well these neighborhoods of similar data are preserved in the 2 or 3-dimensional space, and arranges them into clusters accordingly.

In previous work, the minimization of the tSNE objective was performed as a N-body simulation problem, in which points are randomly placed in the embedding space and two different types of forces are applied on each point. Attractive forces bring the points closer to the points that are most similar in the high-dimensional space, while repulsive forces push them away from all the neighbors in the embedding.

While the attractive forces are acting on a small subset of points (i.e., similar neighbors), repulsive forces are in effect from all pairs of points. Due to this, tSNE requires significant computation and many iterations of the objective function, which limits the possible dataset size to just a few hundred data points. To improve over a brute force solution, the Barnes-Hut algorithm was used to approximate the repulsive forces and the gradient of the objective function. This allows scaling of the computation to tens of thousand data points, but it requires more than 15 minutes to compute the MNIST embedding in a C++ implementation.

In our paper, we propose a solution to this scaling problem by approximating the gradient of the objective function using textures that are generated in WebGL. Our technique draws a “repulsive field” at every minimization iteration using a three channel texture, with the 3 components treated as colors and drawn in the RGB channels. The repulsive field is obtained for every point to represent both the horizontal and vertical repulsive force created by the point, and a third component used for normalization. Intuitively, the normalization term ensures that the magnitude of the shifts matches the similarity measure in the high-dimensional space. In addition, the resolution of the texture is adaptively changed to keep the number of pixels drawn constant.
Rendering of the three functions used to approximate the repulsive effect created by a single point. In the above figure the repulsive forces show a point in a blue area is pushed to the left/bottom, while a point in the red area is pushed to the right/top while a point in the white region will not move.
The contribution of every point is then added on the GPU, resulting in a texture similar to those presented in the GIF below, that approximate the repulsive fields. This innovative repulsive field approach turns out to be much more GPU friendly than more commonly used calculation of point-to-point interactions. This is because repulsion for multiple points can be computed at once and in a very fast way in the GPU. In addition, we implemented the computation of the attraction between points in the GPU.
This animation shows the evolution of the tSNE embedding (upper left) and of the scalar fields used to approximate its gradient with normalization term (upper right), horizontal shift (bottom left) and vertical shift (bottom right).
We additionally revised the update of the embedding from an ad-hoc implementation to a series of standard tensor operations that are computed in TensorFlow.js, a JavaScript library to perform tensor computations in the web browser. Our approach, which is released as an open source library in the TensorFlow.js family, allows us to compute the evolution of the tSNE embedding entirely on the GPU while having better computational complexity.

With this implementation, what used to take 15 minutes to calculate (on the MNIST dataset) can now be visualized in real-time and in the web browser. Furthermore this allows real-time visualizations of much larger datasets, a feature that is particularly useful when deep neural output is analyzed. One main limitation of our work is that this technique currently only works for 2D embeddings. However, 2D visualizations are often preferred over 3D ones as they require more interaction to effectively understand cluster results.

Future Work
We believe that having a fast and interactive tSNE implementation that runs in the browser will empower developers of data analytics systems. We are particularly interested in exploring how our implementation can be used for the interpretation of deep neural networks. Additionally, our implementation shows how lateral thinking in using GPU computations (approximating the gradient using RGB texture) can be used to significantly speed up algorithmic computations. In the future we will be exploring how this kind of gradient approximation can be applied not only to speed-up other dimensionality reduction algorithms, but also to implement other N-body simulations in the web browser using TensorFlow.js.

Acknowledgements
We would like to thank Alexander Mordvintsev, Yannick Assogba, Matt Sharifi, Anna Vilanova, Elmar Eisemann, Nikhil Thorat, Daniel Smilkov, Martin Wattenberg, Fernanda Viegas, Alessio Bazzica, Boudewijn Lelieveldt, Thomas Höllt, Baldur van Lew, Julian Thijssen and Marvin Ritter.

Source: Google AI Blog


Announcing an updated YouTube-8M, and the 2nd YouTube-8M Large-Scale Video Understanding Challenge and Workshop



Last year, we organized the first YouTube-8M Large-Scale Video Understanding Challenge with Kaggle, in which 742 teams consisting of 946 individuals from 60 countries used the YouTube-8M dataset (2017 edition) to develop classification algorithms which accurately assign video-level labels. The purpose of the competition was to accelerate improvements in large-scale video understanding, representation learning, noisy data modeling, transfer learning and domain adaptation approaches that can help improve the machine-learning models that classify video. In addition to the competition, we hosted an affiliated workshop at CVPR’17, inviting competition top-performers and researchers and share their ideas on how to advance the state-of-the-art in video understanding.

As a continuation of these efforts to accelerate video understanding, we are excited to announce another update to the YouTube-8M dataset, a new Kaggle video understanding challenge and an affiliated 2nd Workshop on YouTube-8M Large-Scale Video Understanding, to be held at the 2018 European Conference on Computer Vision (ECCV'18).
An Updated YouTube-8M Dataset (2018 Edition)
Our YouTube-8M (2018 edition) features a major improvement in the quality of annotations, obtained using a machine learning system that combines audio-visual content with title, description and other metadata to provide more accurate ground truth annotations. The updated version contains 6.1 million URLs, labeled with a vocabulary of 3,862 visual entities, with each video annotated with one or more labels and an average of 3 labels per video. We have also updated the starter code, with updated instructions for downloading and training TensorFlow video annotation models on the dataset.

The 2nd YouTube-8M Video Understanding Challenge
The 2nd YouTube-8M Video Understanding Challenge invites participants to build audio-visual content classification models using YouTube-8M as training data, and then to label an unknown subset of test videos. Unlike last year, we strictly impose a hard limit on model size, encouraging participants to advance a single model within tight budget rather than assembling as many models as possible. Each of the top 5 teams will be awarded $5,000 to support their travel to Munich to attend ECCV’18. For details, please visit the Kaggle competition page.

The 2nd Workshop on YouTube-8M Large-Scale Video Understanding
To be held at ECCV’18, the workshop will consist of invited talks by distinguished researchers, as well as presentations by top-performing challenge participants in order to facilitate the exchange of ideas. We encourage those who wish to attend to submit papers describing their research, experiments, or applications based on YouTube-8M dataset, including papers summarizing their participation in the challenge above. Please refer to the workshop page for more details.

It is our hope that this update to the dataset, along with the new challenge and workshop, will continue to advance the research in large-scale video understanding. We hope you will join us again!

Acknowledgements
This post reflects the work of many machine perception researchers including Sami Abu-El-Haija, Ke Chen, Nisarg Kothari, Joonseok Lee, Hanhan Li, Paul Natsev, Sobhan Naderi Parizi, Rahul Sukthankar, George Toderici, Balakrishnan Varadarajan, as well as Sohier Dane, Julia Elliott, Wendy Kan and Walter Reade from Kaggle. We are also grateful for the support and advice from our partners at YouTube.

Source: Google AI Blog


Improving Deep Learning Performance with AutoAugment



The success of deep learning in computer vision can be partially attributed to the availability of large amounts of labeled training data — a model’s performance typically improves as you increase the quality, diversity and the amount of training data. However, collecting enough quality data to train a model to perform well is often prohibitively difficult. One way around this is to hardcode image symmetries into neural network architectures so they perform better or have experts manually design data augmentation methods, like rotation and flipping, that are commonly used to train well-performing vision models. However, until recently, less attention has been paid to finding ways to automatically augment existing data using machine learning. Inspired by the results of our AutoML efforts to design neural network architectures and optimizers to replace components of systems that were previously human designed, we asked ourselves: can we also automate the procedure of data augmentation?

In “AutoAugment: Learning Augmentation Policies from Data”, we explore a reinforcement learning algorithm which increases both the amount and diversity of data in an existing training dataset. Intuitively, data augmentation is used to teach a model about image invariances in the data domain in a way that makes a neural network invariant to these important symmetries, thus improving its performance. Unlike previous state-of-the-art deep learning models that used hand-designed data augmentation policies, we used reinforcement learning to find the optimal image transformation policies from the data itself. The result improved performance of computer vision models without relying on the production of new and ever expanding datasets.

Augmenting Training Data
The idea behind data augmentation is simple: images have many symmetries that don’t change the information present in the image. For example, the mirror reflection of a dog is still a dog. While some of these “invariances” are obvious to humans, many are not. For example, the mixup method augments data by placing images on top of each other during training, resulting in data which improves neural network performance.
Left: An original image from the ImageNet dataset. Right: The same image transformed by a commonly used data augmentation transformation, a horizontal flip about the center.
AutoAugment is an automatic way to design custom data augmentation policies for computer vision datasets, e.g., guiding the selection of basic image transformation operations, such as flipping an image horizontally/vertically, rotating an image, changing the color of an image, etc. AutoAugment not only predicts what image transformations to combine, but also the per-image probability and magnitude of the transformation used, so that the image is not always manipulated in the same way. AutoAugment is able to select an optimal policy from a search space of 2.9 x 1032 image transformation possibilities.

AutoAugment learns different transformations depending on what dataset it is run on. For example, for images involving street view of house numbers (SVHN) which include natural scene images of digits, AutoAugment focuses on geometric transforms like shearing and translation, which represent distortions commonly observed in this dataset. In addition, AutoAugment has learned to completely invert colors which naturally occur in the original SVHN dataset, given the diversity of different building and house numbers materials in the world.
Left: An original image from the SVHN dataset. Right: The same image transformed by AutoAugment. In this case, the optimal transformation was a result of shearing the image and inverting the colors of the pixels.
On CIFAR-10 and ImageNet, AutoAugment does not use shearing because these datasets generally do not include images of sheared objects, nor does it invert colors completely as these transformations would lead to unrealistic images. Instead, AutoAugment focuses on slightly adjusting the color and hue distribution, while preserving the general color properties. This suggests that the actual colors of objects in CIFAR-10 and ImageNet are important, whereas on SVHN only the relative colors are important.


Left: An original image from the ImageNet dataset. Right: The same image transformed by the AutoAugment policy. First, the image contrast is maximized, after which the image is rotated.
Results
Our AutoAugment algorithm found augmentation policies for some of the most well-known computer vision datasets that, when incorporated into the training of the neural network, led to state-of-the-art accuracies. By augmenting ImageNet data we obtain a new state-of-the-art accuracy of 83.54% top1 accuracy and on CIFAR10 we achieve an error rate of 1.48%, which is a 0.83% improvement over the default data augmentation designed by scientists. On SVHN, we improved the state-of-the-art error from 1.30% to 1.02%. Importantly, AutoAugment policies are found to be transferable — the policy found for the ImageNet dataset could also be applied to other vision datasets (Stanford Cars, FGVC-Aircraft, etc.), which in turn improves neural network performance.

We are pleased to see that our AutoAugment algorithm achieved this level of performance on many different competitive computer vision datasets and look forward to seeing future applications of this technology across more computer vision tasks and even in other domains such as audio processing or language models. The policies with the best performance are included in the appendix of the paper, so that researchers can use them to improve their models on relevant vision tasks.

Acknowledgements
Special thanks to the co-authors of the paper Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. We’d also like to thank Alok Aggarwal, Gabriel Bender, Yanping Huang, Pieter-Jan Kindermans, Simon Kornblith, Augustus Odena, Avital Oliver, and Colin Raffel for their help with this project.

Source: Google AI Blog


Advances in Semantic Textual Similarity



The recent rapid progress of neural network-based natural language understanding research, especially on learning semantic text representations, can enable truly novel products such as Smart Compose and Talk to Books. It can also help improve performance on a variety of natural language tasks which have limited amounts of training data, such as building strong text classifiers from as few as 100 labeled examples.

Below, we discuss two papers reporting recent progress on semantic representation research at Google, as well as two new models available for download on TensorFlow Hub that we hope developers will use to build new and exciting applications.

Semantic Textual Similarity
In “Learning Semantic Textual Similarity from Conversations”, we introduce a new way to learn sentence representations for semantic textual similarity. The intuition is that sentences are semantically similar if they have a similar distribution of responses. For example, “How old are you?” and “What is your age?” are both questions about age, which can be answered by similar responses such as “I am 20 years old”. In contrast, while “How are you?” and “How old are you?” contain almost identical words, they have very different meanings and lead to different responses.
Sentences are semantically similar if they can be answered by the same responses. Otherwise, they are semantically different.
In this work, we aim to learn semantic similarity by way of a response classification task: given a conversational input, we wish to classify the correct response from a batch of randomly selected responses. But, the ultimate goal is to learn a model that can return encodings representing a variety of natural language relationships, including similarity and relatedness. By adding another prediction task (In this case, the SNLI entailment dataset) and forcing both through shared encoding layers, we get even better performance on similarity measures such as the STSBenchmark (a sentence similarity benchmark) and CQA task B (a question/question similarity task). This is because logical entailment is quite different from simple equivalence and provides more signal for learning complex semantic representations.
For a given input, classification is considered a ranking problem against potential candidates.
Universal Sentence Encoder
In “Universal Sentence Encoder”, we introduce a model that extends the multitask training described above by adding more tasks, jointly training them with a skip-thought-like model that predicts sentences surrounding a given selection of text. However, instead of the encoder-decoder architecture in the original skip-thought model, we make use of an encode-only architecture by way of a shared encoder to drive the prediction tasks. In this way, training time is greatly reduced while preserving the performance on a variety of transfer tasks including sentiment and semantic similarity classification. The aim is to provide a single encoder that can support as wide a variety of applications as possible, including paraphrase detection, relatedness, clustering and custom text classification.
Pairwise semantic similarity comparison via outputs from TensorFlow Hub Universal Sentence Encoder.
As described in our paper, one version of the Universal Sentence Encoder model uses a deep average network (DAN) encoder, while a second version uses a more complicated self attended network architecture, Transformer.
Multi-task training as described in “Universal Sentence Encoder”. A variety of tasks and task structures are joined by shared encoder layers/parameters (grey boxes).
With the more complicated architecture, the model performs better than the simpler DAN model on a variety of sentiment and similarity classification tasks, and for short sentences is only moderately slower. However, compute time for the model using Transformer increases noticeably as sentence length increases, whereas the compute time for the DAN model stays nearly constant as sentence length is increased.

New Models
In addition to the Universal Sentence Encoder model described above, we are also sharing two new models on TensorFlow Hub: the Universal Sentence Encoder - Large and Universal Sentence Encoder - Lite. These are pretrained Tensorflow models that return a semantic encoding for variable-length text inputs. The encodings can be used for semantic similarity measurement, relatedness, classification, or clustering of natural language text.
  • The Large model is trained with the Transformer encoder described in our second paper. It targets scenarios requiring high precision semantic representations and the best model performance at the cost of speed & size.
  • The Lite model is trained on a Sentence Piece vocabulary instead of words in order to significantly reduce the vocabulary size, which is a major contributor of model size. It targets scenarios where resources like memory and CPU are limited, such as on-device or browser based implementations.
We're excited to share this research, and these models, with the community. We believe that what we're showing here is just the beginning, and that there remain important research problems to be addressed, such as extending the techniques to more languages (the models discussed above currently support English). We also hope to further develop this technology so it can understand text at the paragraph or even document level. In achieving these tasks, it may be possible to make an encoder that is truly “universal”.

Acknowledgements
Daniel Cer, Mario Guajardo-Cespedes, Sheng-Yi Kong, Noah Constant for training the models, Nan Hua, Nicole Limtiaco, Rhomni St. John for transferring tasks, Steve Yuan, Yunhsuan Sung, Brian Strope, Ray Kurzweil for discussion of the model architecture. Special thanks to Sheng-Yi Kong and Noah Constant for training the Lite model.

Source: Google AI Blog