Tag Archives: machine learning

Announcing Open Images V5 and the ICCV 2019 Open Images Challenge



In 2016, we introduced Open Images, a collaborative release of ~9 million images annotated with labels spanning thousands of object categories. Since then we have rolled out several updates, culminating with Open Images V4 in 2018. In total, that release included 15.4M bounding-boxes for 600 object categories, making it the largest existing dataset with object location annotations, as well as over 300k visual relationship annotations.

Today we are happy to announce Open Images V5, which adds segmentation masks to the set of annotations, along with the second Open Images Challenge, which will feature a new instance segmentation track based on this data.

Open Images V5
Open Images V5 features segmentation masks for 2.8 million object instances in 350 categories. Unlike bounding-boxes, which only identify regions in which an object is located, segmentation masks mark the outline of objects, characterizing their spatial extent to a much higher level of detail. We have put particular effort into ensuring consistent annotations across different objects (e.g., all cat masks include their tail; bags carried by camels or persons are included in their mask). Importantly, these masks cover a broader range of object categories and a larger total number of instances than any previous dataset.

Example masks on the training set of Open Images V5. These have been produced by our interactive segmentation process. The first example also shows a bounding box, for comparison. From left to right, top to bottom: Tea and cake at the Fitzwilliam Museum by Tim Regan, Pilota II by Euskal kultur erakundea Institut culturel basque, Rheas by Dag Peak, Wuxi science park, 1995 by Gary Stevens, Cat Cafe Shinjuku calico by Ari Helminen, and Untitled by Todd Huffman. All images used under CC BY 2.0 license.
The segmentation masks on the training set (2.68M) have been produced by our state-of-the-art interactive segmentation process, where professional human annotators iteratively correct the output of a segmentation neural network. This is more efficient than manual drawing alone, while at the same time delivering accurate masks (intersection-over-union 84%). Additionally, we release 99k masks on the validation and test sets, which have been annotated manually with a strong focus on quality. These are near-perfect and capture even fine details of complex object boundaries (e.g. spiky flowers and thin structures in man-made objects). Both our training and validation+test annotations offer more accurate object boundaries than the polygon annotations provided by most existing datasets.

Example masks on the validation and test sets of Open Images V5, drawn completely manually. From left to right: thistle flowers by sophie, still life with ax by liz west, Fischkutter KOŁ-180 in Kolobrzeg (PL) by zeesenboot. All images used under CC BY 2.0 license.
In addition to the masks, we also added 6.4M new human-verified image-level labels, reaching a total of 36.5M over nearly 20,000 categories. Finally, we improved annotation density for 600 object categories on the validation and test sets, adding more than 400k bounding boxes to match the density in the training set. This ensures more precise evaluation of object detection models.

Open Images Challenge 2019
In conjunction with this release, we are also introducing the second Open Images Challenge, to be held at the 2019 International Conference on Computer Vision (ICCV 2019). This Challenge will have a new instance segmentation track based on the data above. Moreover, as in the 2018 edition, it will also feature a large-scale object detection track (500 categories with 12.2M training bounding-boxes), and a visual relationship detection track for detecting pairs of objects in particular relations (329 relationship triplets with 375k training samples, e.g., “woman playing guitar” or “beer on table”).

The training set with all annotations is available now. The test set has the same 100k images as the 2018 Challenge and will be launched again on June 3rd, 2019 by Kaggle. The evaluation servers will open on June 3rd for the object detection and visual relationship tracks, and on July 1st for the instance segmentation track. The deadline for submission of results is October 1st, 2019.

We hope that the exceptionally large and diverse training set will inspire research into more advanced instance segmentation models. The extremely accurate ground-truth masks we provide rewards subtle improvements in the output segmentations, and thus will encourage the development of higher-quality models that deliver precise boundaries. Finally, having a single dataset with unified annotations for image classification, object detection, visual relationship detection, and instance segmentation will enable researchers to study these tasks jointly and stimulate progress towards genuine scene understanding.

Source: Google AI Blog


Announcing Google-Landmarks-v2: An Improved Dataset for Landmark Recognition & Retrieval



Last year we released Google-Landmarks, the largest world-wide landmark recognition dataset available at that time. In order to foster advancements in research on instance-level recognition (recognizing specific instances of objects, e.g. distinguishing Niagara Falls from just any waterfall) and image retrieval (matching a specific object in an input image to all other instances of that object in a catalog of reference images), we also hosted two Kaggle challenges, Landmark Recognition 2018 and Landmark Retrieval 2018, in which more than 500 teams of researchers and machine learning (ML) enthusiasts participated. However, both instance recognition and image retrieval methods require ever larger datasets in both the number of images and the variety of landmarks in order to train better and more robust systems.

In support of this goal, this year we are releasing Google-Landmarks-v2, a completely new, even larger landmark recognition dataset that includes over 5 million images (2x that of the first release) of more than 200 thousand different landmarks (an increase of 7x). Due to the difference in scale, this dataset is much more diverse and creates even greater challenges for state-of-the-art instance recognition approaches. Based on this new dataset, we are also announcing two new Kaggle challenges—Landmark Recognition 2019 and Landmark Retrieval 2019—and releasing the source code and model for Detect-to-Retrieve, a novel image representation suitable for retrieval of specific object instances.
Heatmap of the landmark locations in Google-Landmarks-v2, which demonstrates the increase in the scale of the dataset and the improved geographic coverage compared to last year’s dataset.
Creating the Dataset
A particular problem in preparing Google-Landmarks-v2 was the generation of instance labels for the landmarks represented, since it is virtually impossible for annotators to recognize all of the hundreds of thousands of landmarks that could potentially be present in a given photo. Our solution to this problem was to crowdsource the landmark labeling through the efforts of a world-spanning community of hobby photographers, each familiar with the landmarks in their region.
Selection of images from Google-Landmarks-v2. Landmarks include (left to right, top to bottom) Neuschwanstein Castle, Golden Gate Bridge, Kiyomizu-dera, Burj khalifa, Great Sphinx of Giza, and Machu Picchu.
Another issue for research datasets is the requirement that images be shared freely and stored indefinitely, so that the dataset can be used to track the progress of research over a long period of time. As such, we sourced the Google-Landmarks-v2 images through Wikimedia Commons, capturing both world-famous and lesser-known, local landmarks while ensuring broad geographic coverage (thanks in part to Wiki Loves Monuments) and photos sourced from public institutions, including historical photographs that are valuable to test instance recognition over time.

The Kaggle Challenges
The goal of the Landmark Recognition 2019 challenge is to recognize a landmark presented in a query image, while the goal of Landmark Retrieval 2019 is to find all images showing that landmark. The challenges include cash prizes totaling $50,000 and the winning teams will be invited to present their methods at the Second Landmark Recognition Workshop at CVPR 2019.

Open Sourcing our Model
To foster research reproducibility and help push the field of instance recognition forward, we are also releasing open-source code for our new technique, called Detect-to-Retrieve (which will be presented as a paper in CVPR 2019). This new method leverages bounding boxes from an object detection model to give extra weight to image regions containing the class of interest, which significantly improves accuracy. The model we are releasing is trained on a subset of 86k images from the original Google-Landmarks dataset that were annotated with landmark bounding boxes. We are making these annotations available along with the original dataset here.

We invite researchers and ML enthusiasts to participate in the Landmark Recognition 2019 and Landmark Retrieval 2019 Kaggle challenges and to join the Second Landmark Recognition Workshop at CVPR 2019. We hope that this dataset will help advance the state-of-the-art in instance recognition and image retrieval. The data is being made available via the Common Visual Data Foundation.

Acknowledgments
The core contributors to this project are Andre Araujo, Bingyi Cao, Jack Sim and Tobias Weyand. We would like to thank our team members Daniel Kim, Emily Manoogian, Nicole Maffeo, and Hartwig Adam for their kind help. Thanks also to Marvin Teichmann and Menglong Zhu for their contribution to collecting the landmark bounding boxes and developing the Detect-to-Retrieve technique. We would like to thank Will Cukierski and Maggie Demkin for their help organizing the Kaggle challenge, Elan Hourticolon-Retzler, Yuan Gao, Qin Guo, Gang Huang, Yan Wang, Zhicheng Zheng for their help with data collection, Tsung-Yi Lin for his support with CVDF hosting, as well as our CVPR workshop co-organizers Bohyung Han, Shih-Fu Chang, Ondrej Chum, Torsten Sattler, Giorgos Tolias, and Xu Zhang. We have great appreciation for the Wikimedia Commons Community and their volunteer contributions to an invaluable photographic archive of the world’s cultural heritage. And finally, we’d like to thank the Common Visual Data Foundation for hosting the dataset.

Source: Google AI Blog


Evaluating the Unsupervised Learning of Disentangled Representations



The ability to understand high-dimensional data, and to distill that knowledge into useful representations in an unsupervised manner, remains a key challenge in deep learning. One approach to solving these challenges is through disentangled representations, models that capture the independent features of a given scene in such a way that if one feature changes, the others remain unaffected. If done successfully, a machine learning system that is designed to navigate the real world, such as a self driving car or a robot, can disentangle the different factors and properties of objects and their surroundings, enabling the generalization of knowledge to previously unobserved situations. While, unsupervised disentanglement methods have already been used for curiosity driven exploration, abstract reasoning, visual concept learning and domain adaptation for reinforcement learning, recent progress in the field makes it difficult to know how well different approaches work and the extent of their limitations.

In "Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations" (to appear at ICML 2019), we perform a large-scale evaluation on recent unsupervised disentanglement methods, challenging some common assumptions in order to suggest several improvements to future work on disentanglement learning. This evaluation is the result of training more than 12,000 models covering most prominent methods and evaluation metrics in a reproducible large-scale experimental study on seven different data sets. Importantly, we have also released both the code used in this study as well as more than 10,000 pretrained disentanglement models. The resulting library, disentanglement_lib, allows researchers to bootstrap their own research in this field and to easily replicate and verify our empirical results.

Understanding Disentanglement
To better understand the ground-truth properties of an image that can be encoded in a disentangled representation, first consider the ground-truth factors of the data set Shapes3D. In this toy model, shown in the figure below, each panel represents one factor that could be encoded into a vector representation of the image. The model shown is defined by the shape of the object in the middle of the image, its size, the rotation of the camera and the color of the floor, the wall and the object.
Visualization of the ground-truth factors of the Shapes3D data set: Floor color (upper left), wall color (upper middle), object color (upper right), object size (bottom left), object shape (bottom middle), and camera angle (bottom right).
The goal of disentangled representations is to build models that can capture these explanatory factors in a vector. The figure below presents a model with a 10-dimensional representation vector. Each of the 10 panels visualizes what information is captured in one of the 10 different coordinates of the representation. From the top right and the top middle panel we see that the model has successfully disentangled floor color, while the two bottom left panels indicate that object color and size are still entangled.
Visualization of the latent dimensions learned by a FactorVAE model (see below). The ground-truth factors wall and floor color as well as rotation of the camera are disentangled (see top right, top center and bottom center panels), while the ground-truth factors object shape, size and color are entangled (see top left and the two bottom left images).
Key Results of this Reproducible Large-scale Study
While the research community has proposed a variety of unsupervised approaches to learn disentangled representations based on variational autoencoders and has devised different metrics to quantify their level of disentanglement, to our knowledge no large-scale empirical study has evaluated these approaches in a unified manner. We propose a fair, reproducible experimental protocol to benchmark the state of unsupervised disentanglement learning by implementing six different state-of-the-art models (BetaVAE, AnnealedVAE, FactorVAE, DIP-VAE I/II and Beta-TCVAE) and six disentanglement metrics (BetaVAE score, FactorVAE score, MIG, SAP, Modularity and DCI Disentanglement). In total, we train and evaluate 12,800 such models on seven data sets. Key findings of our study include:
  • We do not find any empirical evidence that the considered models can be used to reliably learn disentangled representations in an unsupervised way, since random seeds and hyperparameters seem to matter more than the model choice. In other words, even if one trains a large number of models and some of them are disentangled, these disentangled representations seemingly cannot be identified without access to ground-truth labels. Furthermore, good hyperparameter values do not appear to consistently transfer across the data sets in our study. These results are consistent with the theorem we present in the paper, which states that the unsupervised learning of disentangled representations is impossible without inductive biases on both the data set and the models (i.e., one has to make assumptions about the data set and incorporate those assumptions into the model).
  • For the considered models and data sets, we cannot validate the assumption that disentanglement is useful for downstream tasks, e.g., that with disentangled representations it is possible to learn with fewer labeled observations.
The figure below demonstrates some of these findings. The choice of random seed across different runs has a larger impact on disentanglement scores than the model choice and the strength of regularization (while naively one might expect that more regularization should always lead to more disentanglement). A good run with a bad hyperparameter can easily beat a bad run with a good hyperparameter.
The violin plots show the distribution of FactorVAE scores attained by different models on the Cars3D data set. The left plot shows how the distribution changes as different disentanglement models are considered while the right plot displays the different distributions as the regularization strength in a FactorVAE model is varied. The key observation is that the violin plots substantially overlap which indicates that all methods strongly depend on the random seed.
Based on these results, we make four observations relevant to future research:
  1. Given the theoretical result that the unsupervised learning of disentangled representations without inductive biases is impossible, future work should clearly describe the imposed inductive biases and the role of both implicit and explicit supervision.
  2. Finding good inductive biases for unsupervised model selection that work across multiple data sets persists as a key open problem.
  3. The concrete practical benefits of enforcing a specific notion of disentanglement of the learned representations should be demonstrated. Promising directions include robotics, abstract reasoning and fairness.
  4. Experiments should be conducted in a reproducible experimental setup on a diverse selection of data sets.
Open Sourcing disentanglement_lib
In order for others to verify our results, we have released disentanglement_lib, the library we used to create the experimental study. It contains open-source implementations of the considered disentanglement methods and metrics, a standardized training and evaluation protocol, as well as visualization tools to better understand trained models.

The advantages of this library are three-fold. First, with less than four shell commands disentanglement_lib can be used to reproduce any of the models in our study. Second, researchers may easily modify our study to test additional hypotheses. Third, disentanglement_lib is easily extendible and can be used to bootstrap research into the learning of disentangled representations—it is easy to implement new models and compare them to our reference implementation using a fair, reproducible experimental setup.

Reproducing all the models in our study requires a computational effort of approximately 2.5 GPU years, which can be prohibitive. So, we have also released >10,000 pretrained disentanglement_lib models from our study that can be used together with disentanglement_lib.

We hope that this will accelerate research in this field by allowing other researchers to benchmark their new models against our pretrained models and to test new disentanglement metrics and visualization approaches on a diverse set of models.

Acknowledgments
This research was done in collaboration with Francesco Locatello, Mario Lucic, Stefan Bauer, Gunnar Rätsch, Sylvain Gelly and Bernhard Schölkopf at Google AI Zürich, ETH Zürich and the Max-Planck Institute for Intelligent Systems. We also wish to thank Josip Djolonga, Ilya Tolstikhin, Michael Tschannen, Sjoerd van Steenkiste, Joan Puigcerver, Marcin Michalski, Marvin Ritter, Irina Higgins and the rest of the Google Brain team for helpful discussions, comments, technical help and code contributions.

Source: Google AI Blog


Updates from Coral: A new compiler and much more

Posted by Vikram Tank (Product Manager), Coral Team

Coral has been public for about a month now, and we’ve heard some great feedback about our products. As we evolve the Coral platform, we’re making our products easier to use and exposing more powerful tools for building devices with on-device AI.

Today, we're updating the Edge TPU model compiler to remove the restrictions around specific architectures, allowing you to submit any model architecture that you want. This greatly increases the variety of models that you can run on the Coral platform. Just be sure to review the TensorFlow ops supported on Edge TPU and model design requirements to take full advantage of the Edge TPU at runtime.

We're also releasing a new version of Mendel OS (3.0 Chef) for the Dev Board with a new board management tool called Mendel Development Tool (MDT).

To help with the developer workflow, our new C++ API works with the TensorFlow Lite C++ API so you can execute inferences on an Edge TPU. In addition, both the Python and C++ APIs now allow you to run multiple models in parallel, using multiple Edge TPU devices.

In addition to these updates, we’re adding new capabilities to Coral with the release of the Environmental Sensor Board. It’s an accessory board for the Coral Dev Platform (and Raspberry Pi) that brings sensor input to your models. It has integrated light, temperature, humidity, and barometric sensors, and the ability to add more sensors via it's four Grove connectors. The secure element on-board also allows for easy communication with the Google Cloud IOT Core.

The team has also been working with partners to help them evaluate whether Coral is the right fit for their products. We’re excited that Oivi has chosen us to be the base platform of their new handheld AI-camera. This product will help prevent blindness among diabetes patients by providing early, automated detection of diabetic retinopathy. Anders Eikenes, CEO of Oivi, says “Oivi is dedicated towards providing patient-centric eye care for everyone - including emerging markets. We were honoured to be selected by Google to participate in their Coral alpha program, and are looking forward to our continued cooperation. The Coral platform gives us the ability to run our screening ML models inside a handheld device; greatly expanding the access and ease of diabetic retinopathy screening.”

Finally, we’re expanding our distributor network to make it easier to get Coral boards into your hands around the world. This month, Seeed and NXP will begin to sell Coral products, in addition to Mouser.

We're excited to keep evolving the Coral platform, please keep sending us feedback at [email protected].

You can see the full release notes on Coral site.

Reducing the Need for Labeled Data in Generative Adversarial Networks



Generative adversarial networks (GANs) are a powerful class of deep generative models.The main idea behind GANs is to train two neural networks: the generator, which learns how to synthesise data (such as an image), and the discriminator, which learns how to distinguish real data from the ones synthesised by the generator. This approach has been successfully used for high-fidelity natural image synthesis, improving learned image compression, data augmentation, and more.
Evolution of the generated samples as training progresses on ImageNet. The generator network is conditioned on the class (e.g., "great gray owl" or "golden retriever").
For natural image synthesis, state-of-the-art results are achieved by conditional GANs that, unlike unconditional GANs, use labels (e.g. car, dog, etc.) during training. While this makes the task easier and leads to significant improvements, this approach requires a large amount of labeled data that is rarely available in practice.

In "High-Fidelity Image Generation With Fewer Labels", we propose a new approach to reduce the amount of labeled data required to train state-of-the-art conditional GANs. When combined with recent advancements on large-scale GANs, we match the state-of-the-art in high-fidelity natural image synthesis using 10x fewer labels. Based on this research, we are also releasing a major update to the Compare GAN library, which contains all the components necessary to train and evaluate modern GANs.

Improvements via Semi-supervision and Self-supervision
In conditional GANs, both the generator and discriminator are typically conditioned on class labels. In this work, we propose to replace the hand-annotated ground truth labels with inferred ones. To infer high-quality labels for a large dataset of mostly unlabeled data, we take a two-step approach: First, we learn a feature representation using only the unlabeled portion of the dataset. To learn the feature representations we make use of self-supervision in the form of a recently introduced approach, in which the unlabeled images are randomly rotated and a deep convolutional neural network is tasked with predicting the rotation angle. The idea is that the models need to be able to recognize the main objects and their shapes in order to be successful on this task.
An unlabeled image is randomly rotated and the network is tasked with predicting the rotation angle. Successful models need to capture semantically meaningful image features which can then be used for other vision tasks.
We then consider the activation pattern of one of the intermediate layers of the trained network as the new feature representation of the input, and train a classifier to recognize the label of that input using the labeled portion of the original data set. As the network was pre-trained to extract semantically meaningful features from the data (on the rotation prediction task), training this classifier is more sample-efficient than training the entire network from scratch. Finally, we use this classifier to label the unlabeled data.

To further improve the model quality and training stability we encourage the discriminator network to learn meaningful feature representations which are not forgotten during training by means of an auxiliary loss we introduced previously. These two advancements, combined with large-scale training lead to state-of-the-art conditional GANs for the task of ImageNet synthesis as measured by the Fréchet Inception Distance.
Given a latent vector the generator network produces an image. In each row, linear interpolation between the latent codes of the leftmost and the rightmost image results in a semantic interpolation in the image space.
Compare GAN: A Library for Training and Evaluating GANs
Cutting-edge research on GANs is heavily dependent on a well-engineered and well-tested codebase, since even replicating prior results and techniques requires a significant effort. In order to foster open science and allow the research community benefit from recent advancements, we are releasing a major update of the Compare GAN library. The library includes loss functions, regularization and normalization schemes, neural architectures, and quantitative metrics commonly used in modern GANs, and now supports:
Conclusions and Future Work
Given the growing gap between labeled and unlabeled data sources, it is becoming increasingly important to be able to learn from only partially labeled data. We have shown that a simple yet powerful combination of self-supervision and semi-supervision can help to close this gap for GANs. We believe that self-supervision is a powerful idea that should be investigated for other generative modeling tasks.

Acknowledgments
Work conducted in collaboration with colleagues on the Google Brain team in Zürich, ETH Zürich and UCLA. We would like to thank our paper co-authors Michael Tschannen, Xiaohua Zhai, Olivier Bachem and Sylvain Gelly for their input and feedback. We would like to thank Alexander Kolesnikov, Lucas Beyer and Avital Oliver for helpful discussion on self-supervised learning and semi-supervised learning. We would like to thank Karol Kurach and Marcin Michalski for their major contributions to the Compare GAN library. We would also like to thank Andy Brock, Jeff Donahue and Karen Simonyan for their insights into training GANs on TPUs. The work described in this post also builds upon our work on “Self-Supervised Generative Adversarial Networks” with Ting Chen and Neil Houlsby.

Source: Google AI Blog


This is the Future of Finance

Posted by Roy Glasberg, Head of Launchpad

Launchpad's mission is to accelerate innovation and to help startups build world-class technologies by leveraging the best of Google - its people, network, research, and technology.

In September 2018, the Launchpad team welcomed ten of the world's leading FinTech startups to join their accelerator program, helping them fast-track their application of advanced technology. Today, March 15th, we will see this cohort graduate from the program at the Launchpad team's inaugural event - The Future of Finance - a global discussion on the impact of applied ML/AI on the finance industry. These startups are ensuring that everyone has relevant insights at their fingertips and that all people, no matter where they are, have access to equitable money, banking, loans, and marketplaces.

Tune into the event from wherever you are via the livestream link

The Graduating Class of Launchpad FinTech Accelerator San Francisco'19

  • Alchemy (USA), bridging blockchain and the real world
  • Axinan (Singapore), providing smart insurance for the digital economy
  • Aye Finance (India), transforming financing in India
  • Celo (USA), increasing financial inclusion through a mobile-first cryptocurrency
  • Frontier Car Group (Germany), investing in the transformation of used-car marketplaces
  • GO-JEK (Indonesia), improving the welfare and livelihoods of informal sectors
  • GuiaBolso (Brazil), improving the financial lives of Brazilians
  • JUMO (South Africa), creating a transparent, fair money marketplace for mobile users to access loans
  • m.Paani (India), (em)powering local retailers and the next billion users in India
  • Starling Bank (UK), improving financial health with a 100% mobile-only bank

Since joining the accelerator, these startups have made great strides and are going from strength to strength. Some recent announcements from this cohort include:

  • JUMO have announced the launch of Opportunity Co, a 500M fund for credit where all the profits go back to the customers.
  • The team at Aye Finance have just closed $30m in Series D equity round.
  • Starling Bank has provided 150 new jobs in Southampton and have received a £100m grant from a fund aimed at increasing competition and innovation in the British banking sector, and also a £75m fundraise.
  • GuiaBolso ran a campaign to pay the bills of some its users (the beginning of the year in Brazil is a time of high expenses and debts) and is having a significant impact on credit with 80% of cases seeing interest rates on loans being cheaper than traditional banks.

We look forward to following the success of all our participating founders as they continue to make a significant impact on the global economy.

Want to know more about the Launchpad Accelerator? Visit our site, stay updated on developments and future opportunities by subscribing to the Google Developers newsletter and visit The Launchpad Blog.

Harnessing Organizational Knowledge for Machine Learning



One of the biggest bottlenecks in developing machine learning (ML) applications is the need for the large, labeled datasets used to train modern ML models. Creating these datasets involves the investment of significant time and expense, requiring annotators with the right expertise. Moreover, due to the evolution of real-world applications, labeled datasets often need to be thrown out or re-labeled.

In collaboration with Stanford and Brown University, we present "Snorkel Drybell: A Case Study in Deploying Weak Supervision at Industrial Scale," which explores how existing knowledge in an organization can be used as noisier, higher-level supervision—or, as it is often termed, weak supervision—to quickly label large training datasets. In this study, we use an experimental internal system, Snorkel Drybell, which adapts the open-source Snorkel framework to use diverse organizational knowledge resources—like internal models, ontologies, legacy rules, knowledge graphs and more—in order to generate training data for machine learning models at web scale. We find that this approach can match the efficacy of hand-labeling tens of thousands of data points, and reveals some core lessons about how training datasets for modern machine learning models can be created in practice.

Rather than labeling training data by hand, Snorkel DryBell enables writing labeling functions that label training data programmatically. In this work, we explored how these labeling functions can capture engineers' knowledge about how to use existing resources as heuristics for weak supervision. As an example, suppose our goal is to identify content related to celebrities. One can leverage an existing named-entity recognition (NER) model for this task by labeling any content that does not contain a person as not related to celebrities. This illustrates how existing knowledge resources (in this case, a trained model) can be combined with simple programmatic logic to label training data for a new model. Note also, importantly, that this labeling function returns None---i.e. abstains---in many cases, and thus only labels some small part of the data; our overall goal is to use these labels to train a modern machine learning model that can generalize to new data.

In our example of a labeling function, rather than hand-labeling a data point (1), one utilizes an existing knowledge resource—in this case, a NER model (2)—together with some simple logic expressed in code (3) to heuristically label data.
This programmatic interface for labeling training data is much faster and more flexible than hand-labeling individual data points, but the resulting labels are obviously of much lower quality than manually-specified labels. The labels generated by these labeling functions will often overlap and disagree, as the labeling functions may not only have arbitrary unknown accuracies, but may also be correlated in arbitrary ways (for example, from sharing a common data source or heuristic).

To solve the problem of noisy and correlated labels, Snorkel DryBell uses a generative modeling technique to automatically estimate the accuracies and correlations of the labeling functions in a provably consistent way—without any ground truth training labels—then uses this to re-weight and combine their outputs into a single probabilistic label per data point. At a high level, we rely on the observed agreements and disagreements between the labeling functions (the covariance matrix), and learn the labeling function accuracy and correlation parameters that best explain this observed output using a new matrix completion-style approach. The resulting labels can then be used to train an arbitrary model (e.g. in TensorFlow), as shown in the system diagram below.

Using Diverse Knowledge Sources as Weak Supervision
To study the efficacy of Snorkel Drybell, we used three production tasks and corresponding datasets, aimed at classifying topics in web content, identifying mentions of certain products, and detecting certain real-time events. Using Snorkel DryBell, we were able to make use of various existing or quickly specified sources of information such as:
  • Heuristics and rules: e.g. existing human-authored rules about the target domain.
  • Topic models, taggers, and classifiers: e.g. machine learning models about the target domain or a related domain.
  • Aggregate statistics: e.g. tracked metrics about the target domain.
  • Knowledge or entity graphs: e.g. databases of facts about the target domain.
In Snorkel DryBell, the goal is to train a machine learning model (C), for example to do content or event classification over web data. Rather than hand-labeling training data to do this, in Snorkel DryBell users write labeling functions that express various organizational knowledge resources (A), which are then automatically reweighted and combined (B).
We used these organizational knowledge resources to write labeling functions in a MapReduce template-based pipeline. Each labeling function takes in a data point and either abstains, or outputs a label. The result is a large set of programmatically-generated training labels. However, many of these labels were very noisy (e.g. from the heuristics), conflicted with each other, or were far too coarse-grained (e.g. the topic models) for our task, leading to the next stage of Snorkel DryBell, aimed at automatically cleaning and integrating the labels into a final training set.

Modeling the Accuracies to Combine & Repurpose Existing Sources
To handle these noisy labels, the next stage of Snorkel DryBell combines the outputs from the labeling functions into a single, confidence-weighted training label for each data point. The challenging technical aspect is that this must be done without any ground-truth labels. We use a generative modeling technique that learns the accuracy of each labeling function using only unlabeled data. This technique learns by observing the matrix of agreements and disagreements between the labeling functions' outputs, taking into account known (or statistically estimated) correlation structures between them. In Snorkel DryBell, we also implement a new faster, sampling-free version of this modeling approach, implemented in TensorFlow, in order to handle web-scale data.

By combining and modeling the output of the labeling functions using this procedure in Snorkel DryBell, we were able to generate high-quality training labels. In fact, on the two applications where hand-labeled training data was available for comparison, we achieved the same predictive accuracy training a model with Snorkel DryBell's labels as we did when training that same model with 12,000 and 80,000 hand-labeled training data points.

Transferring Non-Servable Knowledge to Servable Models
In many settings, there is also an important distinction between servable features—which can be used in production—and non-servable features, that are too slow or expensive to be used in production. These non-servable features may have very rich signal, but a general question is how to use them to train or otherwise help servable models that can be deployed in production?


In many settings, users write labeling functions that leverage organizational knowledge resources that are not servable in production (a)—e.g. aggregate statistics, internal models, or knowledge graphs that are too slow or expensive to use in production—in order to train models that are only defined over production-servable features (b), e.g. cheap, real-time web signals.
In Snorkel DryBell, we found that users could write the labeling functions—i.e. express their organizational knowledge—over one feature set that was not servable, and then use the resulting training labels output by Snorkel DryBell to train a model defined over a different, servable feature set. This cross-feature transfer boosted our performance by an average 52% on the benchmark datasets we created. More broadly, it represents a simple but powerful way to use resources that are too slow (e.g. expensive models or aggregate statistics), private (e.g. entity or knowledge graphs), or otherwise unsuitable for deployment, to train servable models over cheap, real-time features. This approach can be viewed as a new type of transfer learning, where instead of transferring a model between different datasets, we're transferring domain knowledge between different feature sets- an approach which has potential use cases not just in industry, but in medical settings and beyond.

Next Steps
Moving forward, we're excited to see what other types of organizational knowledge can be used as weak supervision, and how the approach used by Snorkel DryBell can enable new modes of information reuse and sharing across organizations. For more details, check out our paper, and for further technical details, blog posts, and tutorials, check out the open-source Snorkel implementation at snorkel.stanford.edu.

Acknowledgments
This research was done in collaboration between Google, Stanford, and Brown. We would like to thank all the people who were involved, including Stephen Bach (Brown), Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Souvik Sen, Braden Hancock (Stanford), Houman Alborzi, Rahul Kuchhal, Christopher Ré (Stanford), Rob Malkin.

Source: Google AI Blog


An All-Neural On-Device Speech Recognizer



In 2012, speech recognition research showed significant accuracy improvements with deep learning, leading to early adoption in products such as Google's Voice Search. It was the beginning of a revolution in the field: each year, new architectures were developed that further increased quality, from deep neural networks (DNNs) to recurrent neural networks (RNNs), long short-term memory networks (LSTMs), convolutional networks (CNNs), and more. During this time, latency remained a prime focus — an automated assistant feels a lot more helpful when it responds quickly to requests.

Today, we're happy to announce the rollout of an end-to-end, all-neural, on-device speech recognizer to power speech input in Gboard. In our recent paper, "Streaming End-to-End Speech Recognition for Mobile Devices", we present a model trained using RNN transducer (RNN-T) technology that is compact enough to reside on a phone. This means no more network latency or spottiness — the new recognizer is always available, even when you are offline. The model works at the character level, so that as you speak, it outputs words character-by-character, just as if someone was typing out what you say in real-time, and exactly as you'd expect from a keyboard dictation system.
This video compares the production, server-side speech recognizer (left panel) to the new on-device recognizer (right panel) when recognizing the same spoken sentence. Video credit: Akshay Kannan and Elnaz Sarbar
A Bit of History
Traditionally, speech recognition systems consisted of several components - an acoustic model that maps segments of audio (typically 10 millisecond frames) to phonemes, a pronunciation model that connects phonemes together to form words, and a language model that expresses the likelihood of given phrases. In early systems, these components remained independently-optimized.

Around 2014, researchers began to focus on training a single neural network to directly map an input audio waveform to an output sentence. This sequence-to-sequence approach to learning a model by generating a sequence of words or graphemes given a sequence of audio features led to the development of "attention-based" and "listen-attend-spell" models. While these models showed great promise in terms of accuracy, they typically work by reviewing the entire input sequence, and do not allow streaming outputs as the input comes in, a necessary feature for real-time voice transcription.

Meanwhile, an independent technique called connectionist temporal classification (CTC) had helped halve the latency of the production recognizer at that time. This proved to be an important step in creating the RNN-T architecture adopted in this latest release, which can be seen as a generalization of CTC.

Recurrent Neural Network Transducers
RNN-Ts are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet. The RNN-T recognizer outputs characters one-by-one, as you speak, with white spaces where appropriate. It does this with a feedback loop that feeds symbols predicted by the model back into it to predict the next symbols, as described in the figure below.
Representation of an RNN-T, with the input audio samples, x, and the predicted symbols y. The predicted symbols (outputs of the Softmax layer) are fed back into the model through the Prediction network, as yu-1, ensuring that the predictions are conditioned both on the audio samples so far and on past outputs. The Prediction and Encoder Networks are LSTM RNNs, the Joint model is a feedforward network (paper). The Prediction Network comprises 2 layers of 2048 units, with a 640-dimensional projection layer. The Encoder Network comprises 8 such layers. Image credit: Chris Thornton
Training such models efficiently was already difficult, but with our development of a new training technique that further reduced the word error rate by 5%, it became even more computationally intensive. To deal with this, we developed a parallel implementation so the RNN-T loss function could run efficiently in large batches on Google's high-performance Cloud TPU v2 hardware. This yielded an approximate 3x speedup in training.

Offline Recognition
In a traditional speech recognition engine, the acoustic, pronunciation, and language models we described above are "composed" together into a large search graph whose edges are labeled with the speech units and their probabilities. When a speech waveform is presented to the recognizer, a "decoder" searches this graph for the path of highest likelihood, given the input signal, and reads out the word sequence that path takes. Typically, the decoder assumes a Finite State Transducer (FST) representation of the underlying models. Yet, despite sophisticated decoding techniques, the search graph remains quite large, almost 2GB for our production models. Since this is not something that could be hosted easily on a mobile phone, this method requires online connectivity to work properly.

To improve the usefulness of speech recognition, we sought to avoid the latency and inherent unreliability of communication networks by hosting the new models directly on device. As such, our end-to-end approach does not need a search over a large decoder graph. Instead, decoding consists of a beam search through a single neural network. The RNN-T we trained offers the same accuracy as the traditional server-based models but is only 450MB, essentially making a smarter use of parameters and packing information more densely. However, even on today's smartphones, 450MB is a lot, and propagating signals through such a large network can be slow.

We further reduced the model size by using the parameter quantization and hybrid kernel techniques we developed in 2016 and made publicly available through the model optimization toolkit in the TensorFlow Lite library. Model quantization delivered a 4x compression with respect to the trained floating point models and a 4x speedup at run-time, enabling our RNN-T to run faster than real time speech on a single core. After compression, the final model is 80MB.

Our new all-neural, on-device Gboard speech recognizer is initially being launched to all Pixel phones in American English only. Given the trends in the industry, with the convergence of specialized hardware and algorithmic improvements, we are hopeful that the techniques presented here can soon be adopted in more languages and across broader domains of application.

Acknowledgements:
Raziel Alvarez, Michiel Bacchiani, Tom Bagby, Françoise Beaufays, Deepti Bhatia, Shuo-yiin Chang, Yanzhang He, Alex Gruenstein, Anjuli Kannan, Bo Li, Qiao Liang, Ian McGraw, Ruoming Pang, Rohit Prabhavalkar, Golan Pundak, Kanishka Rao, David Rybach, Tara Sainath, Haşim Sak, June Yuan Shangguan, Matt Shannon, Mohammadinamul Sheik, Khe Chai Sim, Gabor Simko, Trevor Strohman, Mirkó Visontai, Yonghui Wu, Ding Zhao, Dan Zivkovic.

Source: Google AI Blog


Introducing Coral: Our platform for development with local AI

Posted by Billy Rutledge (Director) and Vikram Tank (Product Mgr), Coral Team

AI can be beneficial for everyone, especially when we all explore, learn, and build together. To that end, Google's been developing tools like TensorFlow and AutoML to ensure that everyone has access to build with AI. Today, we're expanding the ways that people can build out their ideas and products by introducing Coral into public beta.

Coral is a platform for building intelligent devices with local AI.

Coral offers a complete local AI toolkit that makes it easy to grow your ideas from prototype to production. It includes hardware components, software tools, and content that help you create, train and run neural networks (NNs) locally, on your device. Because we focus on accelerating NN's locally, our products offer speedy neural network performance and increased privacy — all in power-efficient packages. To help you bring your ideas to market, Coral components are designed for fast prototyping and easy scaling to production lines.

Our first hardware components feature the new Edge TPU, a small ASIC designed by Google that provides high-performance ML inferencing for low-power devices. For example, it can execute state-of-the-art mobile vision models such as MobileNet V2 at 100+ fps, in a power efficient manner.

Coral Camera Module, Dev Board and USB Accelerator

For new product development, the Coral Dev Board is a fully integrated system designed as a system on module (SoM) attached to a carrier board. The SoM brings the powerful NXP iMX8M SoC together with our Edge TPU coprocessor (as well as Wi-Fi, Bluetooth, RAM, and eMMC memory). To make prototyping computer vision applications easier, we also offer a Camera that connects to the Dev Board over a MIPI interface.

To add the Edge TPU to an existing design, the Coral USB Accelerator allows for easy integration into any Linux system (including Raspberry Pi boards) over USB 2.0 and 3.0. PCIe versions are coming soon, and will snap into M.2 or mini-PCIe expansion slots.

When you're ready to scale to production we offer the SOM from the Dev Board and PCIe versions of the Accelerator for volume purchase. To further support your integrations, we'll be releasing the baseboard schematics for those who want to build custom carrier boards.

Our software tools are based around TensorFlow and TensorFlow Lite. TF Lite models must be quantized and then compiled with our toolchain to run directly on the Edge TPU. To help get you started, we're sharing over a dozen pre-trained, pre-compiled models that work with Coral boards out of the box, as well as software tools to let you re-train them.

For those building connected devices with Coral, our products can be used with Google Cloud IoT. Google Cloud IoT combines cloud services with an on-device software stack to allow for managed edge computing with machine learning capabilities.

Coral products are available today, along with product documentation, datasheets and sample code at g.co/coral. We hope you try our products during this public beta, and look forward to sharing more with you at our official launch.

Long-Range Robotic Navigation via Automated Reinforcement Learning



In the United States alone, there are 3 million people with a mobility impairment that prevents them from ever leaving their homes. Service robots that can autonomously navigate long distances can improve the independence of people with limited mobility, for example, by bringing them groceries, medicine, and packages. Research has demonstrated that deep reinforcement learning (RL) is good at mapping raw sensory input to actions, e.g. learning to grasp objects and for robot locomotion, but RL agents usually lack the understanding of large physical spaces needed to safely navigate long distances without human help and to easily adapt to new spaces.

In three recent papers, “Learning Navigation Behaviors End-to-End with AutoRL,” “PRM-RL: Long-Range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning”, and “Long-Range Indoor Navigation with PRM-RL”, we investigate easy-to-adapt robotic autonomy by combining deep RL with long-range planning. We train local planner agents to perform basic navigation behaviors, traversing short distances safely without collisions with moving obstacles. The local planners take noisy sensor observations, such as a 1D lidar that provides distances to obstacles, and output linear and angular velocities for robot control. We train the local planner in simulation with AutoRL, a method that automates the search for RL reward and neural network architecture. Despite their limited range of 10 - 15 meters, the local planners transfer well to both real robots and to new, previously unseen environments. This enables us to use them as building blocks for navigation in large spaces. We then build a roadmap, a graph where nodes are locations and edges connect the nodes only if local planners, which mimic real robots well with their noisy sensors and control, can traverse between them reliably.

Automating Reinforcement Learning (AutoRL)
In our first paper, we train the local planners in small, static environments. However, training with standard deep RL algorithms, such as Deep Deterministic Policy Gradient (DDPG), poses several challenges. For example, the true objective of the local planners is to reach the goal, which represents a sparse reward. In practice, this requires researchers to spend significant time iterating and hand-tuning the rewards. Researchers must also make decisions about the neural network architecture, without clear accepted best practices. And finally, algorithms like DDPG are unstable learners and often exhibit catastrophic forgetfulness.

To overcome those challenges, we automate the deep Reinforcement Learning (RL) training. AutoRL is an evolutionary automation layer around deep RL that searches for a reward and neural network architecture using large-scale hyperparameter optimization. It works in two phases, reward search and neural network architecture search. During the reward search, AutoRL trains a population of DDPG agents concurrently over several generations, each with a slightly different reward function optimizing for the local planner’s true objective: reaching the destination. At the end of the reward search phase, we select the reward that leads the agents to its destination most often. In the neural network architecture search phase, we repeat the process, this time using the selected reward and tuning the network layers, optimizing for the cumulative reward.
Automating reinforcement learning with reward and neural network architecture search.
However, this iterative process means AutoRL is not sample efficient. Training one agent takes 5 million samples; AutoRL training over 10 generations of 100 agents requires 5 billion samples - equivalent to 32 years of training! The benefit is that after AutoRL the manual training process is automated, and DDPG does not experience catastrophic forgetfulness. Most importantly, the resulting policies are higher quality — AutoRL policies are robust to sensor, actuator and localization noise, and generalize well to new environments. Our best policy is 26% more successful than other navigation methods across our test environments.
AutoRL (red) success over short distances (up to 10 meters) in several unseen buildings. Compared to hand-tuned DDPG (dark-red), artificial potential fields (light blue), dynamic window approach (blue), and behavior cloning (green).
AutoRL local planner policy transfer to robots in real, unstructured environments
While these policies only perform local navigation, they are robust to moving obstacles and transfer well to real robots, even in unstructured environments. Though they were trained in simulation with only static obstacles, they can also handle moving objects effectively. The next step is to combine the AutoRL policies with sampling-based planning to extend their reach and enable long-range navigation.

Achieving Long Range Navigation with PRM-RL
Sampling-based planners tackle long-range navigation by approximating robot motions. For example, probabilistic roadmaps (PRMs) sample robot poses and connect them with feasible transitions, creating roadmaps that capture valid movements of a robot across large spaces. In our second paper, which won Best Paper in Service Robotics at ICRA 2018, we combine PRMs with hand-tuned RL-based local planners (without AutoRL) to train robots once locally and then adapt them to different environments.

First, for each robot we train a local planner policy in a generic simulated training environment. Next, we build a PRM with respect to that policy, called a PRM-RL, over a floor plan for the deployment environment. The same floor plan can be used for any robot we wish to deploy in the building in a one time per robot+environment setup.

To build a PRM-RL we connect sampled nodes only if the RL-based local planner, which represents robot noise well, can reliably and consistently navigate between them. This is done via Monte Carlo simulation. The resulting roadmap is tuned to both the abilities and geometry of the particular robot. Roadmaps for robots with the same geometry but different sensors and actuators will have different connectivity. Since the agent can navigate around corners, nodes without clear line of sight can be included. Whereas nodes near walls and obstacles are less likely to be connected into the roadmap because of sensor noise. At execution time, the RL agent navigates from roadmap waypoint to waypoint.
Roadmap being built with 3 Monte Carlo simulations per randomly selected node pair.
The largest map was 288 meters by 163 meters and contains almost 700,000 edges, collected over 4 days using 300 workers in a cluster requiring 1.1 billion collision checks.
The third paper makes several improvements over the original PRM-RL. First, we replace the hand-tuned DDPG with AutoRL-trained local planners, which results in improved long-range navigation. Second, it adds Simultaneous Localization and Mapping (SLAM) maps, which robots use at execution time, as a source for building the roadmaps. Because SLAM maps are noisy, this change closes the “sim2real gap”, a phonomena in robotics where simulation-trained agents significantly underperform when transferred to real-robots. Our simulated success rates are the same as in on-robot experiments. Last, we added distributed roadmap building, resulting in very large scale roadmaps containing up to 700,000 nodes.

We evaluated the method using our AutoRL agent, building roadmaps using the floor maps of offices up to 200x larger than the training environments, accepting edges with at least 90% success over 20 trials. We compared PRM-RL to a variety of different methods over distances up to 100m, well beyond the local planner range. PRM-RL had 2 to 3 times the rate of success over baseline because the nodes were connected appropriately for the robot’s capabilities.
Navigation over 100 meters success rates in several buildings. First paper -AutoRL local planner only (blue); original PRMs (red); path-guided artificial potential fields (yellow); second paper (green); third paper - PRMs with AutoRL (orange).
We tested PRM-RL on multiple real robots and real building sites. One set of tests are shown below; the robot is very robust except near cluttered areas and off the edge of the SLAM map.
On-robot experiments
Conclusion
Autonomous robot navigation can significantly improve independence of people with limited mobility. We can achieve this by development of easy-to-adapt robotic autonomy, including methods that can be deployed in new environments using information that it is already available. This is done by automating the learning of basic, short-range navigation behaviors with AutoRL and using these learned policies in conjunction with SLAM maps to build roadmaps. These roadmaps consist of nodes connected by edges that robots can traverse consistently. The result is a policy that once trained can be used across different environments and can produce a roadmap custom-tailored to the particular robot.

Acknowledgements
The research was done by, in alphabetical order, Hao-Tien Lewis Chiang, James Davidson, Aleksandra Faust, Marek Fiser, Anthony Francis, Jasmine Hsu, J. Chase Kew, Tsang-Wei Edward Lee, Ken Oslund, Oscar Ramirez from Robotics at Google and Lydia Tapia from University of New Mexico. We thank Alexander Toshev, Brian Ichter, Chris Harris, and Vincent Vanhoucke for helpful discussions.

Source: Google AI Blog