Tag Archives: machine learning

Introducing TensorNetwork, an Open Source Library for Efficient Tensor Calculations



Many of the world's toughest scientific challenges, like developing high-temperature superconductors and understanding the true nature of space and time, involve dealing with the complexity of quantum systems. What makes these challenges difficult is that the number of quantum states in these systems is exponentially large, making brute-force computation infeasible. To deal with this, data structures called tensor networks are used. Tensor networks let one focus on the quantum states that are most relevant for real-world problems—the states of low energy, say—while ignoring other states that aren't relevant. Tensor networks are also increasingly finding applications in machine learning (ML). However, there remain difficulties that prohibit them from widespread use in the ML community: 1) a production-level tensor network library for accelerated hardware has not been available to run tensor network algorithms at scale, and 2) most of the tensor network literature is geared toward physics applications and creates the false impression that expertise in quantum mechanics is required to understand the algorithms.

In order to address these issues, we are releasing TensorNetwork, a brand new open source library to improve the efficiency of tensor calculations, developed in collaboration with the Perimeter Institute for Theoretical Physics and X. TensorNetwork uses TensorFlow as a backend and is optimized for GPU processing, which can enable speedups of up to 100x when compared to work on a CPU. We introduce TensorNetwork in a series of papers, the first of which presents the new library and its API, and provides an overview of tensor networks for a non-physics audience. In our second paper we focus on a particular use case in physics, demonstrating the speedup that one gets using GPUs.

How are Tensor Networks Useful?
Tensors are multidimensional arrays, categorized in a hierarchy according to their order: e.g., an ordinary number is a tensor of order zero (also known as a scalar), a vector is an order-one tensor, a matrix is an order-two tensor, and so on. While low-order tensors can easily be represented by an explicit array of numbers or with a mathematical symbol such as Tijnklm (where the number of indices represents the order of the tensor), that notation becomes very cumbersome once we start talking about high-order tensors. At that point it’s useful to start using diagrammatic notation, where one simply draws a circle (or some other shape) with a number of lines, or legs, coming out of it—the number of legs being the same as the order of the tensor. In this notation, a scalar is just a circle, a vector has a single leg, a matrix has two legs, etc. Each leg of the tensor also has a dimension, which is the size of that leg. For example, a vector representing an object’s velocity through space would be a three-dimensional, order-one tensor.


Diagrammatic notation for tensors.
The benefit of representing tensors in this way is to succinctly encode mathematical operations, e.g., multiplying a matrix by a vector to produce another vector, or multiplying two vectors to make a scalar. These are all examples of a more general concept called tensor contraction.

Diagrammatic notation for tensor contraction. Vector and matrix multiplication, as well as the matrix trace (i.e., the sum of the diagonal elements of a matrix), are all examples.
These are also simple examples of tensor networks, which are graphical ways of encoding the pattern of tensor contractions of several constituent tensors to form a new one. Each constituent tensor has an order determined by its own number of legs. Legs that are connected, forming an edge in the diagram, represent contraction, while the number of remaining dangling legs determines the order of the resultant tensor.
Left: The trace of the product of four matrices, tr(ABCD), which is a scalar. You can see that it has no dangling legs. Right: Three order-three tensors being contracted with three legs dangling, resulting in a new order-three tensor.
While these examples are very simple, the tensor networks of interest often represent hundreds of tensors contracted in a variety of ways. Describing such a thing would be very obscure using traditional notation, which is why the diagrammatic notation was invented by Roger Penrose in 1971.

Tensor Networks in Practice
Consider a collection of black-and-white images, each of which can be thought of as a list of N pixel values. A single pixel of a single image can be one-hot-encoded into a two-dimensional vector, and by combining these pixel encodings together we can make a 2N-dimensional one-hot encoding of the entire image. We can reshape that high-dimensional vector into an order-N tensor, and then add up all of the tensors in our collection of images to get a total tensor Ti1,i2,...,iN encapsulating the collection.

This sounds like a very wasteful thing to do: encoding images with about 50 pixels in this way would already take petabytes of memory. That’s where tensor networks come in. Rather than storing or manipulating the tensor T directly, we instead represent T as the contraction of many smaller constituent tensors in the shape of a tensor network. That turns out to be much more efficient. For instance, the popular matrix product state (MPS) network would write T in terms of N much smaller tensors, so that the total number of parameters is only linear in N, rather than exponential.
The high-order tensor T is represented in terms of many low-order tensors in a matrix product state tensor network.
It’s not obvious that large tensor networks can be efficiently created or manipulated while consistently avoiding the need for a huge amount of memory. But it turns out that this is possible in many cases, which is why tensor networks have been used extensively in quantum physics and, now, in machine learning. Stoudenmire and Schwab used the encoding just described to make an image classification model, demonstrating a new use for tensor networks. The TensorNetwork library is designed to facilitate exactly that kind of work, and our first paper describes how the library functions for general tensor network manipulations.

Performance in Physics Use-Cases
TensorNetwork is a general-purpose library for tensor network algorithms, and so it should prove useful for physicists as well. Approximating quantum states is a typical use-case for tensor networks in physics, and is well-suited to illustrate the capabilities of the TensorNetwork library. In our second paper, we describe a tree tensor network (TTN) algorithm for approximating the ground state of either a periodic quantum spin chain (1D) or a lattice model on a thin torus (2D), and implement the algorithm using TensorNetwork. We compare the use of CPUs with GPUs and observe significant computational speed-ups, up to a factor of 100, when using a GPU and the TensorNetwork library.
Computational time as a function of the bond dimension, χ. The bond dimension determines the size of the constituent tensors of the tensor network. A larger bond dimension means the tensor network is more powerful, but requires more computational resources to manipulate.
Conclusion and Future Work
These are the first in a series of planned papers to illustrate the power of TensorNetwork in real-world applications. In our next paper we will use TensorNetwork to classify images in the MNIST and Fashion-MNIST datasets. Future plans include time series analysis on the ML side, and quantum circuit simulation on the physics side. With the open source community, we are also always adding new features to TensorNetwork itself. We hope that TensorNetwork will become a valuable tool for physicists and machine learning practitioners.

Acknowledgements
The TensorNetwork library was developed by Chase Roberts, Adam Zalcman, and Bruce Fontaine of Google AI; Ashley Milsted, Martin Ganahl, and Guifre Vidal of the Perimeter Institute; and Jack Hidary and Stefan Leichenauer of X. We’d also like to thank Stavros Efthymiou at X for valuable contributions.

Source: Google AI Blog


Coral updates: Project tutorials, a downloadable compiler, and a new distributor

Posted by Vikram Tank (Product Manager), Coral Team

coral hardware

We’re committed to evolving Coral to make it even easier to build systems with on-device AI. Our team is constantly working on new product features, and content that helps ML practitioners, engineers, and prototypers create the next generation of hardware.

To improve our toolchain, we're making the Edge TPU Compiler available to users as a downloadable binary. The binary works on Debian-based Linux systems, allowing for better integration into custom workflows. Instructions on downloading and using the binary are on the Coral site.

We’re also adding a new section to the Coral site that showcases example projects you can build with your Coral board. For instance, Teachable Machine is a project that guides you through building a machine that can quickly learn to recognize new objects by re-training a vision classification model directly on your device. Minigo shows you how to create an implementation of AlphaGo Zero and run it on the Coral Dev Board or USB Accelerator.

Our distributor network is growing as well: Arrow will soon sell Coral products.

Announcing Open Images V5 and the ICCV 2019 Open Images Challenge



In 2016, we introduced Open Images, a collaborative release of ~9 million images annotated with labels spanning thousands of object categories. Since then we have rolled out several updates, culminating with Open Images V4 in 2018. In total, that release included 15.4M bounding-boxes for 600 object categories, making it the largest existing dataset with object location annotations, as well as over 300k visual relationship annotations.

Today we are happy to announce Open Images V5, which adds segmentation masks to the set of annotations, along with the second Open Images Challenge, which will feature a new instance segmentation track based on this data.

Open Images V5
Open Images V5 features segmentation masks for 2.8 million object instances in 350 categories. Unlike bounding-boxes, which only identify regions in which an object is located, segmentation masks mark the outline of objects, characterizing their spatial extent to a much higher level of detail. We have put particular effort into ensuring consistent annotations across different objects (e.g., all cat masks include their tail; bags carried by camels or persons are included in their mask). Importantly, these masks cover a broader range of object categories and a larger total number of instances than any previous dataset.

Example masks on the training set of Open Images V5. These have been produced by our interactive segmentation process. The first example also shows a bounding box, for comparison. From left to right, top to bottom: Tea and cake at the Fitzwilliam Museum by Tim Regan, Pilota II by Euskal kultur erakundea Institut culturel basque, Rheas by Dag Peak, Wuxi science park, 1995 by Gary Stevens, Cat Cafe Shinjuku calico by Ari Helminen, and Untitled by Todd Huffman. All images used under CC BY 2.0 license.
The segmentation masks on the training set (2.68M) have been produced by our state-of-the-art interactive segmentation process, where professional human annotators iteratively correct the output of a segmentation neural network. This is more efficient than manual drawing alone, while at the same time delivering accurate masks (intersection-over-union 84%). Additionally, we release 99k masks on the validation and test sets, which have been annotated manually with a strong focus on quality. These are near-perfect and capture even fine details of complex object boundaries (e.g. spiky flowers and thin structures in man-made objects). Both our training and validation+test annotations offer more accurate object boundaries than the polygon annotations provided by most existing datasets.

Example masks on the validation and test sets of Open Images V5, drawn completely manually. From left to right: thistle flowers by sophie, still life with ax by liz west, Fischkutter KOŁ-180 in Kolobrzeg (PL) by zeesenboot. All images used under CC BY 2.0 license.
In addition to the masks, we also added 6.4M new human-verified image-level labels, reaching a total of 36.5M over nearly 20,000 categories. Finally, we improved annotation density for 600 object categories on the validation and test sets, adding more than 400k bounding boxes to match the density in the training set. This ensures more precise evaluation of object detection models.

Open Images Challenge 2019
In conjunction with this release, we are also introducing the second Open Images Challenge, to be held at the 2019 International Conference on Computer Vision (ICCV 2019). This Challenge will have a new instance segmentation track based on the data above. Moreover, as in the 2018 edition, it will also feature a large-scale object detection track (500 categories with 12.2M training bounding-boxes), and a visual relationship detection track for detecting pairs of objects in particular relations (329 relationship triplets with 375k training samples, e.g., “woman playing guitar” or “beer on table”).

The training set with all annotations is available now. The test set has the same 100k images as the 2018 Challenge and will be launched again on June 3rd, 2019 by Kaggle. The evaluation servers will open on June 3rd for the object detection and visual relationship tracks, and on July 1st for the instance segmentation track. The deadline for submission of results is October 1st, 2019.

We hope that the exceptionally large and diverse training set will inspire research into more advanced instance segmentation models. The extremely accurate ground-truth masks we provide rewards subtle improvements in the output segmentations, and thus will encourage the development of higher-quality models that deliver precise boundaries. Finally, having a single dataset with unified annotations for image classification, object detection, visual relationship detection, and instance segmentation will enable researchers to study these tasks jointly and stimulate progress towards genuine scene understanding.

Source: Google AI Blog


Announcing Google-Landmarks-v2: An Improved Dataset for Landmark Recognition & Retrieval



Last year we released Google-Landmarks, the largest world-wide landmark recognition dataset available at that time. In order to foster advancements in research on instance-level recognition (recognizing specific instances of objects, e.g. distinguishing Niagara Falls from just any waterfall) and image retrieval (matching a specific object in an input image to all other instances of that object in a catalog of reference images), we also hosted two Kaggle challenges, Landmark Recognition 2018 and Landmark Retrieval 2018, in which more than 500 teams of researchers and machine learning (ML) enthusiasts participated. However, both instance recognition and image retrieval methods require ever larger datasets in both the number of images and the variety of landmarks in order to train better and more robust systems.

In support of this goal, this year we are releasing Google-Landmarks-v2, a completely new, even larger landmark recognition dataset that includes over 5 million images (2x that of the first release) of more than 200 thousand different landmarks (an increase of 7x). Due to the difference in scale, this dataset is much more diverse and creates even greater challenges for state-of-the-art instance recognition approaches. Based on this new dataset, we are also announcing two new Kaggle challenges—Landmark Recognition 2019 and Landmark Retrieval 2019—and releasing the source code and model for Detect-to-Retrieve, a novel image representation suitable for retrieval of specific object instances.
Heatmap of the landmark locations in Google-Landmarks-v2, which demonstrates the increase in the scale of the dataset and the improved geographic coverage compared to last year’s dataset.
Creating the Dataset
A particular problem in preparing Google-Landmarks-v2 was the generation of instance labels for the landmarks represented, since it is virtually impossible for annotators to recognize all of the hundreds of thousands of landmarks that could potentially be present in a given photo. Our solution to this problem was to crowdsource the landmark labeling through the efforts of a world-spanning community of hobby photographers, each familiar with the landmarks in their region.
Selection of images from Google-Landmarks-v2. Landmarks include (left to right, top to bottom) Neuschwanstein Castle, Golden Gate Bridge, Kiyomizu-dera, Burj khalifa, Great Sphinx of Giza, and Machu Picchu.
Another issue for research datasets is the requirement that images be shared freely and stored indefinitely, so that the dataset can be used to track the progress of research over a long period of time. As such, we sourced the Google-Landmarks-v2 images through Wikimedia Commons, capturing both world-famous and lesser-known, local landmarks while ensuring broad geographic coverage (thanks in part to Wiki Loves Monuments) and photos sourced from public institutions, including historical photographs that are valuable to test instance recognition over time.

The Kaggle Challenges
The goal of the Landmark Recognition 2019 challenge is to recognize a landmark presented in a query image, while the goal of Landmark Retrieval 2019 is to find all images showing that landmark. The challenges include cash prizes totaling $50,000 and the winning teams will be invited to present their methods at the Second Landmark Recognition Workshop at CVPR 2019.

Open Sourcing our Model
To foster research reproducibility and help push the field of instance recognition forward, we are also releasing open-source code for our new technique, called Detect-to-Retrieve (which will be presented as a paper in CVPR 2019). This new method leverages bounding boxes from an object detection model to give extra weight to image regions containing the class of interest, which significantly improves accuracy. The model we are releasing is trained on a subset of 86k images from the original Google-Landmarks dataset that were annotated with landmark bounding boxes. We are making these annotations available along with the original dataset here.

We invite researchers and ML enthusiasts to participate in the Landmark Recognition 2019 and Landmark Retrieval 2019 Kaggle challenges and to join the Second Landmark Recognition Workshop at CVPR 2019. We hope that this dataset will help advance the state-of-the-art in instance recognition and image retrieval. The data is being made available via the Common Visual Data Foundation.

Acknowledgments
The core contributors to this project are Andre Araujo, Bingyi Cao, Jack Sim and Tobias Weyand. We would like to thank our team members Daniel Kim, Emily Manoogian, Nicole Maffeo, and Hartwig Adam for their kind help. Thanks also to Marvin Teichmann and Menglong Zhu for their contribution to collecting the landmark bounding boxes and developing the Detect-to-Retrieve technique. We would like to thank Will Cukierski and Maggie Demkin for their help organizing the Kaggle challenge, Elan Hourticolon-Retzler, Yuan Gao, Qin Guo, Gang Huang, Yan Wang, Zhicheng Zheng for their help with data collection, Tsung-Yi Lin for his support with CVDF hosting, as well as our CVPR workshop co-organizers Bohyung Han, Shih-Fu Chang, Ondrej Chum, Torsten Sattler, Giorgos Tolias, and Xu Zhang. We have great appreciation for the Wikimedia Commons Community and their volunteer contributions to an invaluable photographic archive of the world’s cultural heritage. And finally, we’d like to thank the Common Visual Data Foundation for hosting the dataset.

Source: Google AI Blog


Evaluating the Unsupervised Learning of Disentangled Representations



The ability to understand high-dimensional data, and to distill that knowledge into useful representations in an unsupervised manner, remains a key challenge in deep learning. One approach to solving these challenges is through disentangled representations, models that capture the independent features of a given scene in such a way that if one feature changes, the others remain unaffected. If done successfully, a machine learning system that is designed to navigate the real world, such as a self driving car or a robot, can disentangle the different factors and properties of objects and their surroundings, enabling the generalization of knowledge to previously unobserved situations. While, unsupervised disentanglement methods have already been used for curiosity driven exploration, abstract reasoning, visual concept learning and domain adaptation for reinforcement learning, recent progress in the field makes it difficult to know how well different approaches work and the extent of their limitations.

In "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations" (to appear at ICML 2019), we perform a large-scale evaluation on recent unsupervised disentanglement methods, challenging some common assumptions in order to suggest several improvements to future work on disentanglement learning. This evaluation is the result of training more than 12,000 models covering most prominent methods and evaluation metrics in a reproducible large-scale experimental study on seven different data sets. Importantly, we have also released both the code used in this study as well as more than 10,000 pretrained disentanglement models. The resulting library, disentanglement_lib, allows researchers to bootstrap their own research in this field and to easily replicate and verify our empirical results.

Understanding Disentanglement
To better understand the ground-truth properties of an image that can be encoded in a disentangled representation, first consider the ground-truth factors of the data set Shapes3D. In this toy model, shown in the figure below, each panel represents one factor that could be encoded into a vector representation of the image. The model shown is defined by the shape of the object in the middle of the image, its size, the rotation of the camera and the color of the floor, the wall and the object.
Visualization of the ground-truth factors of the Shapes3D data set: Floor color (upper left), wall color (upper middle), object color (upper right), object size (bottom left), object shape (bottom middle), and camera angle (bottom right).
The goal of disentangled representations is to build models that can capture these explanatory factors in a vector. The figure below presents a model with a 10-dimensional representation vector. Each of the 10 panels visualizes what information is captured in one of the 10 different coordinates of the representation. From the top right and the top middle panel we see that the model has successfully disentangled floor color, while the two bottom left panels indicate that object color and size are still entangled.
Visualization of the latent dimensions learned by a FactorVAE model (see below). The ground-truth factors wall and floor color as well as rotation of the camera are disentangled (see top right, top center and bottom center panels), while the ground-truth factors object shape, size and color are entangled (see top left and the two bottom left images).
Key Results of this Reproducible Large-scale Study
While the research community has proposed a variety of unsupervised approaches to learn disentangled representations based on variational autoencoders and has devised different metrics to quantify their level of disentanglement, to our knowledge no large-scale empirical study has evaluated these approaches in a unified manner. We propose a fair, reproducible experimental protocol to benchmark the state of unsupervised disentanglement learning by implementing six different state-of-the-art models (BetaVAE, AnnealedVAE, FactorVAE, DIP-VAE I/II and Beta-TCVAE) and six disentanglement metrics (BetaVAE score, FactorVAE score, MIG, SAP, Modularity and DCI Disentanglement). In total, we train and evaluate 12,800 such models on seven data sets. Key findings of our study include:
  • We do not find any empirical evidence that the considered models can be used to reliably learn disentangled representations in an unsupervised way, since random seeds and hyperparameters seem to matter more than the model choice. In other words, even if one trains a large number of models and some of them are disentangled, these disentangled representations seemingly cannot be identified without access to ground-truth labels. Furthermore, good hyperparameter values do not appear to consistently transfer across the data sets in our study. These results are consistent with the theorem we present in the paper, which states that the unsupervised learning of disentangled representations is impossible without inductive biases on both the data set and the models (i.e., one has to make assumptions about the data set and incorporate those assumptions into the model).
  • For the considered models and data sets, we cannot validate the assumption that disentanglement is useful for downstream tasks, e.g., that with disentangled representations it is possible to learn with fewer labeled observations.
The figure below demonstrates some of these findings. The choice of random seed across different runs has a larger impact on disentanglement scores than the model choice and the strength of regularization (while naively one might expect that more regularization should always lead to more disentanglement). A good run with a bad hyperparameter can easily beat a bad run with a good hyperparameter.
The violin plots show the distribution of FactorVAE scores attained by different models on the Cars3D data set. The left plot shows how the distribution changes as different disentanglement models are considered while the right plot displays the different distributions as the regularization strength in a FactorVAE model is varied. The key observation is that the violin plots substantially overlap which indicates that all methods strongly depend on the random seed.
Based on these results, we make four observations relevant to future research:
  1. Given the theoretical result that the unsupervised learning of disentangled representations without inductive biases is impossible, future work should clearly describe the imposed inductive biases and the role of both implicit and explicit supervision.
  2. Finding good inductive biases for unsupervised model selection that work across multiple data sets persists as a key open problem.
  3. The concrete practical benefits of enforcing a specific notion of disentanglement of the learned representations should be demonstrated. Promising directions include robotics, abstract reasoning and fairness.
  4. Experiments should be conducted in a reproducible experimental setup on a diverse selection of data sets.
Open Sourcing disentanglement_lib
In order for others to verify our results, we have released disentanglement_lib, the library we used to create the experimental study. It contains open-source implementations of the considered disentanglement methods and metrics, a standardized training and evaluation protocol, as well as visualization tools to better understand trained models.

The advantages of this library are three-fold. First, with less than four shell commands disentanglement_lib can be used to reproduce any of the models in our study. Second, researchers may easily modify our study to test additional hypotheses. Third, disentanglement_lib is easily extendible and can be used to bootstrap research into the learning of disentangled representations—it is easy to implement new models and compare them to our reference implementation using a fair, reproducible experimental setup.

Reproducing all the models in our study requires a computational effort of approximately 2.5 GPU years, which can be prohibitive. So, we have also released >10,000 pretrained disentanglement_lib models from our study that can be used together with disentanglement_lib.

We hope that this will accelerate research in this field by allowing other researchers to benchmark their new models against our pretrained models and to test new disentanglement metrics and visualization approaches on a diverse set of models.

Acknowledgments
This research was done in collaboration with Francesco Locatello, Mario Lucic, Stefan Bauer, Gunnar Rätsch, Sylvain Gelly and Bernhard Schölkopf at Google AI Zürich, ETH Zürich and the Max-Planck Institute for Intelligent Systems. We also wish to thank Josip Djolonga, Ilya Tolstikhin, Michael Tschannen, Sjoerd van Steenkiste, Joan Puigcerver, Marcin Michalski, Marvin Ritter, Irina Higgins and the rest of the Google Brain team for helpful discussions, comments, technical help and code contributions.

Source: Google AI Blog


Updates from Coral: A new compiler and much more

Posted by Vikram Tank (Product Manager), Coral Team

Coral has been public for about a month now, and we’ve heard some great feedback about our products. As we evolve the Coral platform, we’re making our products easier to use and exposing more powerful tools for building devices with on-device AI.

Today, we're updating the Edge TPU model compiler to remove the restrictions around specific architectures, allowing you to submit any model architecture that you want. This greatly increases the variety of models that you can run on the Coral platform. Just be sure to review the TensorFlow ops supported on Edge TPU and model design requirements to take full advantage of the Edge TPU at runtime.

We're also releasing a new version of Mendel OS (3.0 Chef) for the Dev Board with a new board management tool called Mendel Development Tool (MDT).

To help with the developer workflow, our new C++ API works with the TensorFlow Lite C++ API so you can execute inferences on an Edge TPU. In addition, both the Python and C++ APIs now allow you to run multiple models in parallel, using multiple Edge TPU devices.

In addition to these updates, we’re adding new capabilities to Coral with the release of the Environmental Sensor Board. It’s an accessory board for the Coral Dev Platform (and Raspberry Pi) that brings sensor input to your models. It has integrated light, temperature, humidity, and barometric sensors, and the ability to add more sensors via it's four Grove connectors. The secure element on-board also allows for easy communication with the Google Cloud IOT Core.

The team has also been working with partners to help them evaluate whether Coral is the right fit for their products. We’re excited that Oivi has chosen us to be the base platform of their new handheld AI-camera. This product will help prevent blindness among diabetes patients by providing early, automated detection of diabetic retinopathy. Anders Eikenes, CEO of Oivi, says “Oivi is dedicated towards providing patient-centric eye care for everyone - including emerging markets. We were honoured to be selected by Google to participate in their Coral alpha program, and are looking forward to our continued cooperation. The Coral platform gives us the ability to run our screening ML models inside a handheld device; greatly expanding the access and ease of diabetic retinopathy screening.”

Finally, we’re expanding our distributor network to make it easier to get Coral boards into your hands around the world. This month, Seeed and NXP will begin to sell Coral products, in addition to Mouser.

We're excited to keep evolving the Coral platform, please keep sending us feedback at [email protected].

You can see the full release notes on Coral site.

Reducing the Need for Labeled Data in Generative Adversarial Networks



Generative adversarial networks (GANs) are a powerful class of deep generative models.The main idea behind GANs is to train two neural networks: the generator, which learns how to synthesise data (such as an image), and the discriminator, which learns how to distinguish real data from the ones synthesised by the generator. This approach has been successfully used for high-fidelity natural image synthesis, improving learned image compression, data augmentation, and more.
Evolution of the generated samples as training progresses on ImageNet. The generator network is conditioned on the class (e.g., "great gray owl" or "golden retriever").
For natural image synthesis, state-of-the-art results are achieved by conditional GANs that, unlike unconditional GANs, use labels (e.g. car, dog, etc.) during training. While this makes the task easier and leads to significant improvements, this approach requires a large amount of labeled data that is rarely available in practice.

In "High-Fidelity Image Generation With Fewer Labels", we propose a new approach to reduce the amount of labeled data required to train state-of-the-art conditional GANs. When combined with recent advancements on large-scale GANs, we match the state-of-the-art in high-fidelity natural image synthesis using 10x fewer labels. Based on this research, we are also releasing a major update to the Compare GAN library, which contains all the components necessary to train and evaluate modern GANs.

Improvements via Semi-supervision and Self-supervision
In conditional GANs, both the generator and discriminator are typically conditioned on class labels. In this work, we propose to replace the hand-annotated ground truth labels with inferred ones. To infer high-quality labels for a large dataset of mostly unlabeled data, we take a two-step approach: First, we learn a feature representation using only the unlabeled portion of the dataset. To learn the feature representations we make use of self-supervision in the form of a recently introduced approach, in which the unlabeled images are randomly rotated and a deep convolutional neural network is tasked with predicting the rotation angle. The idea is that the models need to be able to recognize the main objects and their shapes in order to be successful on this task.
An unlabeled image is randomly rotated and the network is tasked with predicting the rotation angle. Successful models need to capture semantically meaningful image features which can then be used for other vision tasks.
We then consider the activation pattern of one of the intermediate layers of the trained network as the new feature representation of the input, and train a classifier to recognize the label of that input using the labeled portion of the original data set. As the network was pre-trained to extract semantically meaningful features from the data (on the rotation prediction task), training this classifier is more sample-efficient than training the entire network from scratch. Finally, we use this classifier to label the unlabeled data.

To further improve the model quality and training stability we encourage the discriminator network to learn meaningful feature representations which are not forgotten during training by means of an auxiliary loss we introduced previously. These two advancements, combined with large-scale training lead to state-of-the-art conditional GANs for the task of ImageNet synthesis as measured by the Fréchet Inception Distance.
Given a latent vector the generator network produces an image. In each row, linear interpolation between the latent codes of the leftmost and the rightmost image results in a semantic interpolation in the image space.
Compare GAN: A Library for Training and Evaluating GANs
Cutting-edge research on GANs is heavily dependent on a well-engineered and well-tested codebase, since even replicating prior results and techniques requires a significant effort. In order to foster open science and allow the research community benefit from recent advancements, we are releasing a major update of the Compare GAN library. The library includes loss functions, regularization and normalization schemes, neural architectures, and quantitative metrics commonly used in modern GANs, and now supports:
Conclusions and Future Work
Given the growing gap between labeled and unlabeled data sources, it is becoming increasingly important to be able to learn from only partially labeled data. We have shown that a simple yet powerful combination of self-supervision and semi-supervision can help to close this gap for GANs. We believe that self-supervision is a powerful idea that should be investigated for other generative modeling tasks.

Acknowledgments
Work conducted in collaboration with colleagues on the Google Brain team in Zürich, ETH Zürich and UCLA. We would like to thank our paper co-authors Michael Tschannen, Xiaohua Zhai, Olivier Bachem and Sylvain Gelly for their input and feedback. We would like to thank Alexander Kolesnikov, Lucas Beyer and Avital Oliver for helpful discussion on self-supervised learning and semi-supervised learning. We would like to thank Karol Kurach and Marcin Michalski for their major contributions to the Compare GAN library. We would also like to thank Andy Brock, Jeff Donahue and Karen Simonyan for their insights into training GANs on TPUs. The work described in this post also builds upon our work on “Self-Supervised Generative Adversarial Networks” with Ting Chen and Neil Houlsby.

Source: Google AI Blog


This is the Future of Finance

Posted by Roy Glasberg, Head of Launchpad

Launchpad's mission is to accelerate innovation and to help startups build world-class technologies by leveraging the best of Google - its people, network, research, and technology.

In September 2018, the Launchpad team welcomed ten of the world's leading FinTech startups to join their accelerator program, helping them fast-track their application of advanced technology. Today, March 15th, we will see this cohort graduate from the program at the Launchpad team's inaugural event - The Future of Finance - a global discussion on the impact of applied ML/AI on the finance industry. These startups are ensuring that everyone has relevant insights at their fingertips and that all people, no matter where they are, have access to equitable money, banking, loans, and marketplaces.

Tune into the event from wherever you are via the livestream link

The Graduating Class of Launchpad FinTech Accelerator San Francisco'19

  • Alchemy (USA), bridging blockchain and the real world
  • Axinan (Singapore), providing smart insurance for the digital economy
  • Aye Finance (India), transforming financing in India
  • Celo (USA), increasing financial inclusion through a mobile-first cryptocurrency
  • Frontier Car Group (Germany), investing in the transformation of used-car marketplaces
  • GO-JEK (Indonesia), improving the welfare and livelihoods of informal sectors
  • GuiaBolso (Brazil), improving the financial lives of Brazilians
  • JUMO (South Africa), creating a transparent, fair money marketplace for mobile users to access loans
  • m.Paani (India), (em)powering local retailers and the next billion users in India
  • Starling Bank (UK), improving financial health with a 100% mobile-only bank

Since joining the accelerator, these startups have made great strides and are going from strength to strength. Some recent announcements from this cohort include:

  • JUMO have announced the launch of Opportunity Co, a 500M fund for credit where all the profits go back to the customers.
  • The team at Aye Finance have just closed $30m in Series D equity round.
  • Starling Bank has provided 150 new jobs in Southampton and have received a £100m grant from a fund aimed at increasing competition and innovation in the British banking sector, and also a £75m fundraise.
  • GuiaBolso ran a campaign to pay the bills of some its users (the beginning of the year in Brazil is a time of high expenses and debts) and is having a significant impact on credit with 80% of cases seeing interest rates on loans being cheaper than traditional banks.

We look forward to following the success of all our participating founders as they continue to make a significant impact on the global economy.

Want to know more about the Launchpad Accelerator? Visit our site, stay updated on developments and future opportunities by subscribing to the Google Developers newsletter and visit The Launchpad Blog.

Harnessing Organizational Knowledge for Machine Learning



One of the biggest bottlenecks in developing machine learning (ML) applications is the need for the large, labeled datasets used to train modern ML models. Creating these datasets involves the investment of significant time and expense, requiring annotators with the right expertise. Moreover, due to the evolution of real-world applications, labeled datasets often need to be thrown out or re-labeled.

In collaboration with Stanford and Brown University, we present "Snorkel Drybell: A Case Study in Deploying Weak Supervision at Industrial Scale," which explores how existing knowledge in an organization can be used as noisier, higher-level supervision—or, as it is often termed, weak supervision—to quickly label large training datasets. In this study, we use an experimental internal system, Snorkel Drybell, which adapts the open-source Snorkel framework to use diverse organizational knowledge resources—like internal models, ontologies, legacy rules, knowledge graphs and more—in order to generate training data for machine learning models at web scale. We find that this approach can match the efficacy of hand-labeling tens of thousands of data points, and reveals some core lessons about how training datasets for modern machine learning models can be created in practice.

Rather than labeling training data by hand, Snorkel DryBell enables writing labeling functions that label training data programmatically. In this work, we explored how these labeling functions can capture engineers' knowledge about how to use existing resources as heuristics for weak supervision. As an example, suppose our goal is to identify content related to celebrities. One can leverage an existing named-entity recognition (NER) model for this task by labeling any content that does not contain a person as not related to celebrities. This illustrates how existing knowledge resources (in this case, a trained model) can be combined with simple programmatic logic to label training data for a new model. Note also, importantly, that this labeling function returns None---i.e. abstains---in many cases, and thus only labels some small part of the data; our overall goal is to use these labels to train a modern machine learning model that can generalize to new data.

In our example of a labeling function, rather than hand-labeling a data point (1), one utilizes an existing knowledge resource—in this case, a NER model (2)—together with some simple logic expressed in code (3) to heuristically label data.
This programmatic interface for labeling training data is much faster and more flexible than hand-labeling individual data points, but the resulting labels are obviously of much lower quality than manually-specified labels. The labels generated by these labeling functions will often overlap and disagree, as the labeling functions may not only have arbitrary unknown accuracies, but may also be correlated in arbitrary ways (for example, from sharing a common data source or heuristic).

To solve the problem of noisy and correlated labels, Snorkel DryBell uses a generative modeling technique to automatically estimate the accuracies and correlations of the labeling functions in a provably consistent way—without any ground truth training labels—then uses this to re-weight and combine their outputs into a single probabilistic label per data point. At a high level, we rely on the observed agreements and disagreements between the labeling functions (the covariance matrix), and learn the labeling function accuracy and correlation parameters that best explain this observed output using a new matrix completion-style approach. The resulting labels can then be used to train an arbitrary model (e.g. in TensorFlow), as shown in the system diagram below.

Using Diverse Knowledge Sources as Weak Supervision
To study the efficacy of Snorkel Drybell, we used three production tasks and corresponding datasets, aimed at classifying topics in web content, identifying mentions of certain products, and detecting certain real-time events. Using Snorkel DryBell, we were able to make use of various existing or quickly specified sources of information such as:
  • Heuristics and rules: e.g. existing human-authored rules about the target domain.
  • Topic models, taggers, and classifiers: e.g. machine learning models about the target domain or a related domain.
  • Aggregate statistics: e.g. tracked metrics about the target domain.
  • Knowledge or entity graphs: e.g. databases of facts about the target domain.
In Snorkel DryBell, the goal is to train a machine learning model (C), for example to do content or event classification over web data. Rather than hand-labeling training data to do this, in Snorkel DryBell users write labeling functions that express various organizational knowledge resources (A), which are then automatically reweighted and combined (B).
We used these organizational knowledge resources to write labeling functions in a MapReduce template-based pipeline. Each labeling function takes in a data point and either abstains, or outputs a label. The result is a large set of programmatically-generated training labels. However, many of these labels were very noisy (e.g. from the heuristics), conflicted with each other, or were far too coarse-grained (e.g. the topic models) for our task, leading to the next stage of Snorkel DryBell, aimed at automatically cleaning and integrating the labels into a final training set.

Modeling the Accuracies to Combine & Repurpose Existing Sources
To handle these noisy labels, the next stage of Snorkel DryBell combines the outputs from the labeling functions into a single, confidence-weighted training label for each data point. The challenging technical aspect is that this must be done without any ground-truth labels. We use a generative modeling technique that learns the accuracy of each labeling function using only unlabeled data. This technique learns by observing the matrix of agreements and disagreements between the labeling functions' outputs, taking into account known (or statistically estimated) correlation structures between them. In Snorkel DryBell, we also implement a new faster, sampling-free version of this modeling approach, implemented in TensorFlow, in order to handle web-scale data.

By combining and modeling the output of the labeling functions using this procedure in Snorkel DryBell, we were able to generate high-quality training labels. In fact, on the two applications where hand-labeled training data was available for comparison, we achieved the same predictive accuracy training a model with Snorkel DryBell's labels as we did when training that same model with 12,000 and 80,000 hand-labeled training data points.

Transferring Non-Servable Knowledge to Servable Models
In many settings, there is also an important distinction between servable features—which can be used in production—and non-servable features, that are too slow or expensive to be used in production. These non-servable features may have very rich signal, but a general question is how to use them to train or otherwise help servable models that can be deployed in production?


In many settings, users write labeling functions that leverage organizational knowledge resources that are not servable in production (a)—e.g. aggregate statistics, internal models, or knowledge graphs that are too slow or expensive to use in production—in order to train models that are only defined over production-servable features (b), e.g. cheap, real-time web signals.
In Snorkel DryBell, we found that users could write the labeling functions—i.e. express their organizational knowledge—over one feature set that was not servable, and then use the resulting training labels output by Snorkel DryBell to train a model defined over a different, servable feature set. This cross-feature transfer boosted our performance by an average 52% on the benchmark datasets we created. More broadly, it represents a simple but powerful way to use resources that are too slow (e.g. expensive models or aggregate statistics), private (e.g. entity or knowledge graphs), or otherwise unsuitable for deployment, to train servable models over cheap, real-time features. This approach can be viewed as a new type of transfer learning, where instead of transferring a model between different datasets, we're transferring domain knowledge between different feature sets- an approach which has potential use cases not just in industry, but in medical settings and beyond.

Next Steps
Moving forward, we're excited to see what other types of organizational knowledge can be used as weak supervision, and how the approach used by Snorkel DryBell can enable new modes of information reuse and sharing across organizations. For more details, check out our paper, and for further technical details, blog posts, and tutorials, check out the open-source Snorkel implementation at snorkel.stanford.edu.

Acknowledgments
This research was done in collaboration between Google, Stanford, and Brown. We would like to thank all the people who were involved, including Stephen Bach (Brown), Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Souvik Sen, Braden Hancock (Stanford), Houman Alborzi, Rahul Kuchhal, Christopher Ré (Stanford), Rob Malkin.

Source: Google AI Blog


An All-Neural On-Device Speech Recognizer



In 2012, speech recognition research showed significant accuracy improvements with deep learning, leading to early adoption in products such as Google's Voice Search. It was the beginning of a revolution in the field: each year, new architectures were developed that further increased quality, from deep neural networks (DNNs) to recurrent neural networks (RNNs), long short-term memory networks (LSTMs), convolutional networks (CNNs), and more. During this time, latency remained a prime focus — an automated assistant feels a lot more helpful when it responds quickly to requests.

Today, we're happy to announce the rollout of an end-to-end, all-neural, on-device speech recognizer to power speech input in Gboard. In our recent paper, "Streaming End-to-End Speech Recognition for Mobile Devices", we present a model trained using RNN transducer (RNN-T) technology that is compact enough to reside on a phone. This means no more network latency or spottiness — the new recognizer is always available, even when you are offline. The model works at the character level, so that as you speak, it outputs words character-by-character, just as if someone was typing out what you say in real-time, and exactly as you'd expect from a keyboard dictation system.
This video compares the production, server-side speech recognizer (left panel) to the new on-device recognizer (right panel) when recognizing the same spoken sentence. Video credit: Akshay Kannan and Elnaz Sarbar
A Bit of History
Traditionally, speech recognition systems consisted of several components - an acoustic model that maps segments of audio (typically 10 millisecond frames) to phonemes, a pronunciation model that connects phonemes together to form words, and a language model that expresses the likelihood of given phrases. In early systems, these components remained independently-optimized.

Around 2014, researchers began to focus on training a single neural network to directly map an input audio waveform to an output sentence. This sequence-to-sequence approach to learning a model by generating a sequence of words or graphemes given a sequence of audio features led to the development of "attention-based" and "listen-attend-spell" models. While these models showed great promise in terms of accuracy, they typically work by reviewing the entire input sequence, and do not allow streaming outputs as the input comes in, a necessary feature for real-time voice transcription.

Meanwhile, an independent technique called connectionist temporal classification (CTC) had helped halve the latency of the production recognizer at that time. This proved to be an important step in creating the RNN-T architecture adopted in this latest release, which can be seen as a generalization of CTC.

Recurrent Neural Network Transducers
RNN-Ts are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet. The RNN-T recognizer outputs characters one-by-one, as you speak, with white spaces where appropriate. It does this with a feedback loop that feeds symbols predicted by the model back into it to predict the next symbols, as described in the figure below.
Representation of an RNN-T, with the input audio samples, x, and the predicted symbols y. The predicted symbols (outputs of the Softmax layer) are fed back into the model through the Prediction network, as yu-1, ensuring that the predictions are conditioned both on the audio samples so far and on past outputs. The Prediction and Encoder Networks are LSTM RNNs, the Joint model is a feedforward network (paper). The Prediction Network comprises 2 layers of 2048 units, with a 640-dimensional projection layer. The Encoder Network comprises 8 such layers. Image credit: Chris Thornton
Training such models efficiently was already difficult, but with our development of a new training technique that further reduced the word error rate by 5%, it became even more computationally intensive. To deal with this, we developed a parallel implementation so the RNN-T loss function could run efficiently in large batches on Google's high-performance Cloud TPU v2 hardware. This yielded an approximate 3x speedup in training.

Offline Recognition
In a traditional speech recognition engine, the acoustic, pronunciation, and language models we described above are "composed" together into a large search graph whose edges are labeled with the speech units and their probabilities. When a speech waveform is presented to the recognizer, a "decoder" searches this graph for the path of highest likelihood, given the input signal, and reads out the word sequence that path takes. Typically, the decoder assumes a Finite State Transducer (FST) representation of the underlying models. Yet, despite sophisticated decoding techniques, the search graph remains quite large, almost 2GB for our production models. Since this is not something that could be hosted easily on a mobile phone, this method requires online connectivity to work properly.

To improve the usefulness of speech recognition, we sought to avoid the latency and inherent unreliability of communication networks by hosting the new models directly on device. As such, our end-to-end approach does not need a search over a large decoder graph. Instead, decoding consists of a beam search through a single neural network. The RNN-T we trained offers the same accuracy as the traditional server-based models but is only 450MB, essentially making a smarter use of parameters and packing information more densely. However, even on today's smartphones, 450MB is a lot, and propagating signals through such a large network can be slow.

We further reduced the model size by using the parameter quantization and hybrid kernel techniques we developed in 2016 and made publicly available through the model optimization toolkit in the TensorFlow Lite library. Model quantization delivered a 4x compression with respect to the trained floating point models and a 4x speedup at run-time, enabling our RNN-T to run faster than real time speech on a single core. After compression, the final model is 80MB.

Our new all-neural, on-device Gboard speech recognizer is initially being launched to all Pixel phones in American English only. Given the trends in the industry, with the convergence of specialized hardware and algorithmic improvements, we are hopeful that the techniques presented here can soon be adopted in more languages and across broader domains of application.

Acknowledgements:
Raziel Alvarez, Michiel Bacchiani, Tom Bagby, Françoise Beaufays, Deepti Bhatia, Shuo-yiin Chang, Yanzhang He, Alex Gruenstein, Anjuli Kannan, Bo Li, Qiao Liang, Ian McGraw, Ruoming Pang, Rohit Prabhavalkar, Golan Pundak, Kanishka Rao, David Rybach, Tara Sainath, Haşim Sak, June Yuan Shangguan, Matt Shannon, Mohammadinamul Sheik, Khe Chai Sim, Gabor Simko, Trevor Strohman, Mirkó Visontai, Yonghui Wu, Ding Zhao, Dan Zivkovic.

Source: Google AI Blog