Tag Archives: #AutoML

Updates from Coral: Mendel Linux 4.0 and much more!

Posted by Carlos Mendonça (Product Manager), Coral TeamIllustration of the Coral Dev Board placed next to Fall foliage

Last month, we announced that Coral graduated out of beta, into a wider, global release. Today, we're announcing the next version of Mendel Linux (4.0 release Day) for the Coral Dev Board and SoM, as well as a number of other exciting updates.

We have made significant updates to improve performance and stability. Mendel Linux 4.0 release Day is based on Debian 10 Buster and includes upgraded GStreamer pipelines and support for Python 3.7, OpenCV, and OpenCL. The Linux kernel has also been updated to version 4.14 and U-Boot to version 2017.03.3.

We’ve also made it possible to use the Dev Board's GPU to convert YUV to RGB pixel data at up to 130 frames per second on 1080p resolution, which is one to two orders of magnitude faster than on Mendel Linux 3.0 release Chef. These changes make it possible to run inferences with YUV-producing sources such as cameras and hardware video decoders.

To upgrade your Dev Board or SoM, follow our guide to flash a new system image.

MediaPipe on Coral

MediaPipe is an open-source, cross-platform framework for building multi-modal machine learning perception pipelines that can process streaming data like video and audio. For example, you can use MediaPipe to run on-device machine learning models and process video from a camera to detect, track and visualize hand landmarks in real-time.

Developers and researchers can prototype their real-time perception use cases starting with the creation of the MediaPipe graph on desktop. Then they can quickly convert and deploy that same graph to the Coral Dev Board, where the quantized TensorFlow Lite model will be accelerated by the Edge TPU.

As part of this first release, MediaPipe is making available new experimental samples for both object and face detection, with support for the Coral Dev Board and SoM. The source code and instructions for compiling and running each sample are available on GitHub and on the MediaPipe documentation site.

New Teachable Sorter project tutorial

New Teachable Sorter project tutorial

A new Teachable Sorter tutorial is now available. The Teachable Sorter is a physical sorting machine that combines the Coral USB Accelerator's ability to perform very low latency inference with an ML model that can be trained to rapidly recognize and sort different objects as they fall through the air. It leverages Google’s new Teachable Machine 2.0, a web application that makes it easy for anyone to quickly train a model in a fun, hands-on way.

The tutorial walks through how to build the free-fall sorter, which separates marshmallows from cereal and can be trained using Teachable Machine.

Coral is now on TensorFlow Hub

Earlier this month, the TensorFlow team announced a new version of TensorFlow Hub, a central repository of pre-trained models. With this update, the interface has been improved with a fresh landing page and search experience. Pre-trained Coral models compiled for the Edge TPU continue to be available on our Coral site, but a select few are also now available from the TensorFlow Hub. On the site, you can find models featuring an Overlay interface, allowing you to test the model's performance against a custom set of images right from the browser. Check out the experience for MobileNet v1 and MobileNet v2.

We are excited to share all that Coral has to offer as we continue to evolve our platform. For a list of worldwide distributors, system integrators and partners, visit the new Coral partnerships page. We hope you’ll use the new features offered on Coral.ai as a resource and encourage you to keep sending us feedback at coral-support@google.com.

Introducing the Next Generation of On-Device Vision Models: MobileNetV3 and MobileNetEdgeTPU



On-device machine learning (ML) is an essential component in enabling privacy-preserving, always-available and responsive intelligence. This need to bring on-device machine learning to compute and power-limited devices has spurred the development of algorithmically-efficient neural network models and hardware capable of performing billions of math operations per second, while consuming only a few milliwatts of power. The recently launched Google Pixel 4 exemplifies this trend, and ships with the Pixel Neural Core that contains an instantiation of the Edge TPU architecture, Google’s machine learning accelerator for edge computing devices, and powers Pixel 4 experiences such as face unlock, a faster Google Assistant and unique camera features. Similarly, algorithms, such as MobileNets, have been critical for the success of on-device ML by providing compact and efficient neural network models for mobile vision applications.

Today we are pleased to announce the release of source code and checkpoints for MobileNetV3 and the Pixel 4 Edge TPU-optimized counterpart MobileNetEdgeTPU model. These models are the culmination of the latest advances in hardware-aware AutoML techniques as well as several advances in architecture design. On mobile CPUs, MobileNetV3 is twice as fast as MobileNetV2 with equivalent accuracy, and advances the state-of-the-art for mobile computer vision networks. On the Pixel 4 Edge TPU hardware accelerator, the MobileNetEdgeTPU model pushes the boundary further by improving model accuracy while simultaneously reducing the runtime and power consumption.

Building MobileNetV3
In contrast with the hand-designed previous version of MobileNet, MobileNetV3 relies on AutoML to find the best possible architecture in a search space friendly to mobile computer vision tasks. To most effectively exploit the search space we deploy two techniques in sequence — MnasNet and NetAdapt. First, we search for a coarse architecture using MnasNet, which uses reinforcement learning to select the optimal configuration from a discrete set of choices. Then we fine-tune the architecture using NetAdapt, a complementary technique that trims under-utilized activation channels in small decrements. To provide the best possible performance under different conditions we have produced both large and small models.
Comparison of accuracy vs. latency for mobile models on the ImageNet classification task using the Google Pixel 4 CPU.
MobileNetV3 Search Space
The MobileNetV3 search space builds on multiple recent advances in architecture design that we adapt for the mobile environment. First, we introduce a new activation function called hard-swish (h-swish) which is based on the Swish nonlinearity function. The critical drawback of the Swish function is that it is very inefficient to compute on mobile hardware. So, instead we use an approximation that can be efficiently expressed as a product of two piecewise linear functions.
Next we introduce the mobile-friendly squeeze-and-excitation block, which replaces the classical sigmoid function with a piecewise linear approximation.

Combining h-swish plus mobile-friendly squeeze-and-excitation with a modified version of the inverted bottleneck structure introduced in MobileNetV2 yielded a new building block for MobileNetV3.
MobileNetV3 extends the MobileNetV2 inverted bottleneck structure by adding h-swish and mobile friendly squeeze-and-excitation as searchable options.
These parameters defined the search space used in constructing MobileNetV3:
  • Size of expansion layer
  • Degree of squeeze-excite compression
  • Choice of activation function: h-swish or ReLU
  • Number of layers for each resolution block
We also introduced a new efficient last stage at the end of the network that further reduced latency by 15%.
MobileNetV3 Object Detection and Semantic Segmentation
In addition to classification models, we also introduced MobileNetV3 object detection models, which reduced detection latency by 25% relative to MobileNetV2 at the same accuracy for the COCO dataset.

In order to optimize MobileNetV3 for efficient semantic segmentation, we introduced a low latency segmentation decoder called Lite Reduced Atrous Spatial Pyramid Pooling (LR-SPP). This new decoder contains three branches, one for low resolution semantic features, one for higher resolution details, and one for light-weight attention. The combination of LR-SPP and MobileNetV3 reduces the latency by over 35% on the high resolution Cityscapes Dataset.

MobileNet for Edge TPUs
The Edge TPU in Pixel 4 is similar in architecture to the Edge TPU in the Coral line of products, but customized to meet the requirements of key camera features in Pixel 4. The accelerator-aware AutoML approach substantially reduces the manual process involved in designing and optimizing neural networks for hardware accelerators. Crafting the neural architecture search space is an important part of this approach and centers around the inclusion of neural network operations that are known to improve hardware utilization. While operations such as squeeze-and-excite and swish non-linearity have been shown to be essential in building compact and fast CPU models, these operations tend to perform suboptimally on Edge TPU and hence are excluded from the search space. The minimalistic variants of MobileNetV3 also forgo the use of these operations (i.e., squeeze-and-excite, swish, and 5x5 convolutions) to allow easier portability to a variety of other hardware accelerators such as DSPs and GPUs.

The neural network architecture search, incentivized to jointly optimize the model accuracy and Edge TPU latency, produces the MobileNetEdgeTPU model that achieves lower latency for a fixed accuracy (or higher accuracy for a fixed latency) than existing mobile models such as MobileNetV2 and minimalistic MobileNetV3. Compared with the EfficientNet-EdgeTPU model (optimized for the Edge TPU in Coral), these models are designed to run at a much lower latency on Pixel 4, albeit at the cost of some loss in accuracy.

Although reducing the model’s power consumption was not a part of the search objective, the lower latency of the MobileNetEdgeTPU models also helps reduce the average Edge TPU power use. The MobileNetEdgeTPU model consumes less than 50% the power of the minimalistic MobileNetV3 model at comparable accuracy.
Left: Comparison of the accuracy on the ImageNet classification task between MobileNetEdgeTPU and other image classification networks designed for mobile when running on Pixel4 Edge TPU. MobileNetEdgeTPU achieves higher accuracy and lower latency compared with other models. Right: Average Edge TPU power in Watts for different classification models running at 30 frames per second (fps).
Objection Detection Using MobileNetEdgeTPU
The MobileNetEdgeTPU classification model also serves as an effective feature extractor for object detection tasks. Compared with MobileNetV2 based detection models, MobileNetEdgeTPU models offer a significant improvement in model quality (measured as the mean average precision; mAP) on the COCO14 minival dataset at comparable runtimes on the Edge TPU. The MobileNetEdgeTPU detection model has a latency of 6.6ms and achieves mAP score of 24.3, while MobileNetV2-based detection models achieve an mAP of 22 and takes 6.8ms per inference.

The Need for Hardware-Aware Models
While the results shown above highlight the power, performance, and quality benefits of MobileNetEdgeTPU models, it is important to note that the improvements arise due to the fact that these models have been customized to run on the Edge TPU accelerator.
MobileNetEdgeTPU when running on a mobile CPU delivers inferior performance compared with the models that have been tuned specifically for mobile CPUs (MobileNetV3). MobileNetEdgeTPU models perform a much greater number of operations, and so, it is not surprising that they run slower on mobile CPUs, which exhibit a more linear relationship between a model’s compute requirements and the runtime.
MobileNetV3 is still the best performing network when using mobile CPU as the deployment target.
For Researchers and Developers
The MobileNetV3 and MobileNetEdgeTPU code, as well as both floating point and quantized checkpoints for ImageNet classification, are available at the MobileNet github page. Open source implementation for MobileNetV3 and MobileNetEdgeTPU object detection is available in the Tensorflow Object Detection API. Open source implementation for MobileNetV3 semantic segmentation is available in TensorFlow through DeepLab.

Acknowledgements:
This work is made possible through a collaboration spanning several teams across Google. We’d like to acknowledge contributions from Berkin Akin, Okan Arikan, Gabriel Bender, Bo Chen, Liang-Chieh Chen, Grace Chu, Eddy Hsu, John Joseph, Pieter-jan Kindermans, Quoc Le, Owen Lin, Hanxiao Liu, Yun Long, Ravi Narayanaswami, Ruoming Pang, Mark Sandler, Mingxing Tan, Vijay Vasudevan, Weijun Wang, Dong Hyuk Woo, Dmitry Kalenichenko, Yunyang Xiong, Yukun Zhu and support from Hartwig Adam, Blaise Agüera y Arcas, Chidu Krishnan and Steve Molloy.

Source: Google AI Blog


Coral moves out of beta

Posted by Vikram Tank (Product Manager), Coral Team

microchips on coral colored background

Last March, we launched Coral beta from Google Research. Coral helps engineers and researchers bring new models out of the data center and onto devices, running TensorFlow models efficiently at the edge. Coral is also at the core of new applications of local AI in industries ranging from agriculture to healthcare to manufacturing. We've received a lot of feedback over the past six months and used it to improve our platform. Today we’re thrilled to graduate Coral out of beta, into a wider, global release.

Coral is already delivering impact across industries, and several of our partners are including Coral in products that require fast ML inferencing at the edge.

In healthcare, Care.ai is using Coral to build a device that enables hospitals and care centers to respond quickly to falls, prevent bed sores, improve patient care, and reduce costs. Virgo SVS is also using Coral as the basis of a polyp detection system that helps doctors improve the accuracy of endoscopies.

In a very different use case, Olea Edge employs Coral to help municipal water utilities accurately measure the amount of water used by their commercial customers. Their Meter Health Analytics solution uses local AI to reduce waste and predict equipment failure in industrial water meters.

Nexcom is using Coral to build gateways with local AI and provide a platform for next-gen, AI-enabled IoT applications. By moving AI processing to the gateway, existing sensor networks can stay in service without the need to add AI processing to each node.

From prototype to production

Microchips on white background

Coral’s Dev Board is designed as an integrated prototyping solution for new product development. Under the heatsink is the detachable Coral SoM, which combines Google’s Edge TPU with the NXP IMX8M SoC, Wi-Fi and Bluetooth connectivity, memory, and storage. We’re happy to announce that you can now purchase the Coral SoM standalone. We’ve also created a baseboard developer guide to help integrate it into your own production design.

Our Coral USB Accelerator allows users with existing system designs to add local AI inferencing via USB 2/3. For production workloads, we now offer three new Accelerators that feature the Edge TPU and connect via PCIe interfaces: Mini PCIe, M.2 A+E key, and M.2 B+M key. You can easily integrate these Accelerators into new products or upgrade existing devices that have an available PCIe slot.

The new Coral products are available globally and for sale at Mouser; for large volume sales, contact our sales team. By the end of 2019, we'll continue to expand our distribution of the Coral Dev Board and SoM into new markets including: Taiwan, Australia, New Zealand, India, Thailand, Singapore, Oman, Ghana and the Philippines.

Better resources

We’ve also revamped the Coral site with better organization for our docs and tools, a set of success stories, and industry focused pages. All of it can be found at a new, easier to remember URL Coral.ai.

To help you get the most out of the hardware, we’re also publishing a new set of examples. The included models and code can provide solutions to the most common on-device ML problems, such as image classification, object detection, pose estimation, and keyword spotting.

For those looking for a more in-depth application—and a way to solve the eternal problem of squirrels plundering your bird feeder—the Smart Bird Feeder project shows you how to perform classification with a custom dataset on the Coral Dev board.

Finally, we’ll soon release a new version of the Mendel OS that updates the system to Debian Buster, and we're hard at work on more improvements to the Edge TPU compiler and runtime that will improve the model development workflow.

The official launch of Coral is, of course, just the beginning, and we’ll continue to evolve the platform. Please keep sending us feedback at coral-support@google.com.

Coral summer updates: Post-training quant support, TF Lite delegate, and new models!

Posted by Vikram Tank (Product Manager), Coral Team

Summer updates cartoon

Coral’s had a busy summer working with customers, expanding distribution, and building new features — and of course taking some time for R&R. We’re excited to share updates, early work, and new models for our platform for local AI with you.

The compiler has been updated to version 2.0, adding support for models built using post-training quantization—only when using full integer quantization (previously, we required quantization-aware training)—and fixing a few bugs. As the Tensorflow team mentions in their Medium post “post-training integer quantization enables users to take an already-trained floating-point model and fully quantize it to only use 8-bit signed integers (i.e. `int8`).” In addition to reducing the model size, models that are quantized with this method can now be accelerated by the Edge TPU found in Coral products.

We've also updated the Edge TPU Python library to version 2.11.1 to include new APIs for transfer learning on Coral products. The new on-device back propagation API allows you to perform transfer learning on the last layer of an image classification model. The last layer of a model is removed before compilation and implemented on-device to run on the CPU. It allows for near-real time transfer learning and doesn’t require you to recompile the model. Our previously released imprinting API, has been updated to allow you to quickly retrain existing classes or add new ones while leaving other classes alone. You can now even keep the classes from the pre-trained base model. Learn more about both options for on-device transfer learning.

Until now, accelerating your model with the Edge TPU required that you write code using either our Edge TPU Python API or in C++. But now you can accelerate your model on the Edge TPU when using the TensorFlow Lite interpreter API, because we've released a TensorFlow Lite delegate for the Edge TPU. The TensorFlow Lite Delegate API is an experimental feature in TensorFlow Lite that allows for the TensorFlow Lite interpreter to delegate part or all of graph execution to another executor—in this case, the other executor is the Edge TPU. Learn more about the TensorFlow Lite delegate for Edge TPU.

Coral has also been working with Edge TPU and AutoML teams to release EfficientNet-EdgeTPU: a family of image classification models customized to run efficiently on the Edge TPU. The models are based upon the EfficientNet architecture to achieve the image classification accuracy of a server-side model in a compact size that's optimized for low latency on the Edge TPU. You can read more about the models’ development and performance on the Google AI Blog, and download trained and compiled versions on the Coral Models page.

And, as summer comes to an end we also want to share that Arrow offers a student teacher discount for those looking to experiment with the boards in class or the lab this year.

We're excited to keep evolving the Coral platform, please keep sending us feedback at coral-support@google.com.

EfficientNet-EdgeTPU: Creating Accelerator-Optimized Neural Networks with AutoML



For several decades, computer processors have doubled their performance every couple of years by reducing the size of the transistors inside each chip, as described by Moore’s Law. As reducing transistor size becomes more and more difficult, there is a renewed focus in the industry on developing domain-specific architectures — such as hardware accelerators — to continue advancing computational power. This is especially true for machine learning, where efforts are aimed at building specialized architectures for neural network (NN) acceleration. Ironically, while there has been a steady proliferation of these architectures in data centers and on edge computing platforms, the NNs that run on them are rarely customized to take advantage of the underlying hardware.

Today, we are happy to announce the release of EfficientNet-EdgeTPU, a family of image classification models derived from EfficientNets, but customized to run optimally on Google’s Edge TPU, a power-efficient hardware accelerator available to developers through the Coral Dev Board and a USB Accelerator. Through such model customizations, the Edge TPU is able to provide real-time image classification performance while simultaneously achieving accuracies typically seen only when running much larger, compute-heavy models in data centers.

Using AutoML to customize EfficientNets for Edge TPU
EfficientNets have been shown to achieve state-of-the-art accuracy in image classification tasks while significantly reducing the model size and computational complexity. To build EfficientNets designed to leverage the Edge TPU’s accelerator architecture, we invoked the AutoML MNAS framework and augmented the original EfficientNet’s neural network architecture search space with building blocks that execute efficiently on the Edge TPU (discussed below). We also built and integrated a “latency predictor” module that provides an estimate of the model latency when executing on the Edge TPU, by running the models on a cycle-accurate architectural simulator. The AutoML MNAS controller implements a reinforcement learning algorithm to search this space while attempting to maximize the reward, which is a joint function of the predicted latency and model accuracy. From past experience, we know that Edge TPU’s power efficiency and performance tend to be maximized when the model fits within its on-chip memory. Hence we also modified the reward function to generate a higher reward for models that satisfy this constraint.
Overall AutoML flow for designing customized EfficientNet-EdgeTPU models.
Search Space Design
When performing the architecture search described above, one must consider that EfficientNets rely primarily on depthwise-separable convolutions, a type of neural network block that factorizes a regular convolution to reduce the number of parameters as well as the amount of computations. However, for certain configurations, a regular convolution utilizes the Edge TPU architecture more efficiently and executes faster, despite the much larger amount of compute. While it is possible, albeit tedious, to manually craft a network that uses an optimal combination of the different building blocks, augmenting the AutoML search space with these accelerator-optimal blocks is a more scalable approach.
A regular 3x3 convolution (right) has more compute (multiply-and-accumulate (mac) operations) than an depthwise-separable convolution (left), but for certain input/output shapes, executes faster on Edge TPU due to ~3x more effective hardware utilization.
In addition, removing certain operations from the search space that require modifications to the Edge TPU compiler to fully support, such swish non-linearity and squeeze-and-excitation block, naturally leads to models that are readily ported to the Edge TPU hardware. These operations tend to improve model quality slightly, so by eliminating them from the search space, we have effectively instructed AutoML to discover alternate network architectures that may compensate for any potential loss in quality.

Model Performance
The neural architecture search (NAS) described above produced a baseline model, EfficientNet-EdgeTPU-S, which is subsequently scaled up using EfficientNet's compound scaling method to produce the -M and -L models. The compound scaling approach selects an optimal combination of input image resolution scaling, network width, and depth scaling to construct larger, more accurate models. The -M, and -L models achieve higher accuracy at the cost of increased latency as shown in the figure below.
EfficientNet-EdgeTPU-S/M/L models achieve better latency and accuracy than existing EfficientNets (B1), ResNet, and Inception by specializing the network architecture for Edge TPU hardware. In particular, our EfficientNet-EdgeTPU-S achieves higher accuracy, yet runs 10x faster than ResNet-50.
Interestingly, the NAS-generated model employs the regular convolution quite extensively in the initial part of the network where the depthwise-separable convolution tends to be less effective than the regular convolution when executed on the accelerator. This clearly highlights the fact that trade-offs usually made while optimizing models for general purpose CPUs (reducing the total number of operations, for example) are not necessarily optimal for hardware accelerators. Also, these models achieve high accuracy even without the use of esoteric operations. Comparing with the other image classification models such as Inception-resnet-v2 and Resnet50, EfficientNet-EdgeTPU models are not only more accurate, but also run faster on Edge TPUs.

This work represents a first experiment in building accelerator-optimized models using AutoML. The AutoML-based model customization can be extended to not only a wide range of hardware accelerators, but also to several different applications that rely on neural networks.

From Cloud TPU training to Edge TPU deployment
We have released the training code and pretrained models for EfficientNet-EdgeTPU on our github repository. We employ tensorflow’s post-training quantization tool to convert a floating-point trained model to an Edge TPU-compatible integer-quantized model. For these models, the post-training quantization works remarkably well and produces only a very slight loss in accuracy (~0.5%). The script for exporting the quantized model from a training checkpoint can be found here. For an update on the Coral platform, see this post on the Google Developer’s Blog, and for full reference materials and detailed instructions, please refer to the Coral website.

Acknowledgements
Special thanks to Quoc Le, Hongkun Yu, Yunlu Li, Ruoming Pang, and Vijay Vasudevan from the Google Brain team; Bo Wu, Vikram Tank, and Ajay Nair from the Google Coral team; Han Vanholder, Ravi Narayanaswami, John Joseph, Dong Hyuk Woo, Raksit Ashok, Jason Jong Kyu Park, Jack Liu, Mohammadali Ghodrat, Cao Gao, Berkin Akin, Liang-Yun Wang, Chirag Gandhi, and Dongdong Li from the Google Edge TPU team.

Source: Google AI Blog


Applying AutoML to Transformer Architectures



Since it was introduced a few years ago, Google’s Transformer architecture has been applied to challenges ranging from generating fantasy fiction to writing musical harmonies. Importantly, the Transformer’s high performance has demonstrated that feed forward neural networks can be as effective as recurrent neural networks when applied to sequence tasks, such as language modeling and translation. While the Transformer and other feed forward models used for sequence problems are rising in popularity, their architectures are almost exclusively manually designed, in contrast to the computer vision domain where AutoML approaches have found state-of-the-art models that outperform those that are designed by hand. Naturally, we wondered if the application of AutoML in the sequence domain could be equally successful.

After conducting an evolution-based neural architecture search (NAS), using translation as a proxy for sequence tasks in general, we found the Evolved Transformer, a new Transformer architecture that demonstrates promising improvements on a variety of natural language processing (NLP) tasks. Not only does the Evolved Transformer achieve state-of-the-art translation results, but it also demonstrates improved performance on language modeling when compared to the original Transformer. We are releasing this new model as part of Tensor2Tensor, where it can be used for any sequence problem.

Developing the Techniques
To begin the evolutionary NAS, it was necessary for us to develop new techniques, due to the fact that the task used to evaluate the “fitness” of each architecture, WMT’14 English-German translation, is computationally expensive. This makes the searches more expensive than similar searches executed in the vision domain, which can leverage smaller datasets, like CIFAR-10. The first of these techniques is warm starting—seeding the initial evolution population with the Transformer architecture instead of random models. This helps ground the search in an area of the search space we know is strong, thereby allowing it to find better models faster.

The second technique is a new method we developed called Progressive Dynamic Hurdles (PDH), an algorithm that augments the evolutionary search to allocate more resources to the strongest candidates, in contrast to previous works, where each candidate model of the NAS is allocated the same amount of resources when it is being evaluated. PDH allows us to terminate the evaluation of a model early if it is flagrantly bad, allowing promising architectures to be awarded more resources.

The Evolved Transformer
Using these methods, we conducted a large-scale NAS on our translation task and discovered the Evolved Transformer (ET). Like most sequence to sequence (seq2seq) neural network architectures, it has an encoder that encodes the input sequence into embeddings and a decoder that uses those embeddings to construct an output sequence; in the case of translation, the input sequence is the sentence to be translated and the output sequence is the translation.

The most interesting feature of the Evolved Transformer is the convolutional layers at the bottom of both its encoder and decoder modules that were added in a similar branching pattern in both places (i.e. the inputs run through two separate convolutional layers before being added together).
A comparison between the Evolved Transformer and the original Transformer encoder architectures. Notice the branched convolution structure at the bottom of the module, which formed in both the encoder and decoder independently. See our paper for a description of the decoder.
This is particularly interesting because the encoder and decoder architectures are not shared during the NAS, so this architecture was independently discovered as being useful in both the encoder and decoder, speaking to the strength of this design. Whereas the original Transformer relied solely on self-attention, the Evolved Transformer is a hybrid, leveraging the strengths of both self-attention and wide convolution.

Evaluation of the Evolved Transformer
To test the effectiveness of this new architecture, we first compared it to the original Transformer on the English-German translation task we used during the search. We found that the Evolved Transformer had better BLEU and perplexity performance at all parameter sizes, with the biggest gain at the size compatible with mobile devices (~7 million parameters), demonstrating an efficient use of parameters. At a larger size, the Evolved Transformer reaches state-of-the-art performance on WMT’ 14 En-De with a BLEU score of 29.8 and a SacreBLEU score of 29.2.
Comparison between the Evolved Transformer and the original Transformer on WMT’14 En-De at varying sizes. The biggest gains in performance occur at smaller sizes, while ET also shows strength at larger sizes, outperforming the largest Transformer with 37.6% less parameters (models to compare are circled in green). See Table 3 in our paper for the exact numbers.
To test generalizability, we also compared ET to the Transformer on additional NLP tasks. First, we looked at translation using different language pairs, and found ET demonstrated improved performance, with margins similar to those seen on English-German; again, due to its efficient use of parameters, the biggest improvements were observed for medium sized models. We also compared the decoders of both models on language modeling using LM1B, and saw a performance improvement of nearly 2 perplexity.
Future Work
These results are the first step in exploring the application of architecture search to feed forward sequence models. The Evolved Transformer is being open sourced as part of Tensor2Tensor, where it can be used for any sequence problem. To promote reproducibility, we are also open sourcing the search space we used for our search and a Colab with an implementation of Progressive Dynamic Hurdles. We look forward to seeing what the research community does with the new model and hope that others are able to build off of these new search techniques!

Source: Google AI Blog


EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling



Convolutional neural networks (CNNs) are commonly developed at a fixed resource cost, and then scaled up in order to achieve better accuracy when more resources are made available. For example, ResNet can be scaled up from ResNet-18 to ResNet-200 by increasing the number of layers, and recently, GPipe achieved 84.3% ImageNet top-1 accuracy by scaling up a baseline CNN by a factor of four. The conventional practice for model scaling is to arbitrarily increase the CNN depth or width, or to use larger input image resolution for training and evaluation. While these methods do improve accuracy, they usually require tedious manual tuning, and still often yield suboptimal performance. What if, instead, we could find a more principled method to scale up a CNN to obtain better accuracy and efficiency?

In our ICML 2019 paper, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, we propose a novel model scaling method that uses a simple yet highly effective compound coefficient to scale up CNNs in a more structured manner. Unlike conventional approaches that arbitrarily scale network dimensions, such as width, depth and resolution, our method uniformly scales each dimension with a fixed set of scaling coefficients. Powered by this novel scaling method and recent progress on AutoML, we have developed a family of models, called EfficientNets, which superpass state-of-the-art accuracy with up to 10x better efficiency (smaller and faster).

Compound Model Scaling: A Better Way to Scale Up CNNs
In order to understand the effect of scaling the network, we systematically studied the impact of scaling different dimensions of the model. While scaling individual dimensions improves model performance, we observed that balancing all dimensions of the network—width, depth, and image resolution—against the available resources would best improve overall performance.

The first step in the compound scaling method is to perform a grid search to find the relationship between different scaling dimensions of the baseline network under a fixed resource constraint (e.g., 2x more FLOPS).This determines the appropriate scaling coefficient for each of the dimensions mentioned above. We then apply those coefficients to scale up the baseline network to the desired target model size or computational budget.

Comparison of different scaling methods. Unlike conventional scaling methods (b)-(d) that arbitrary scale a single dimension of the network, our compound scaling method uniformly scales up all dimensions in a principled way.
This compound scaling method consistently improves model accuracy and efficiency for scaling up existing models such as MobileNet (+1.4% imagenet accuracy), and ResNet (+0.7%), compared to conventional scaling methods.

EfficientNet Architecture
The effectiveness of model scaling also relies heavily on the baseline network. So, to further improve performance, we have also developed a new baseline network by performing a neural architecture search using the AutoML MNAS framework, which optimizes both accuracy and efficiency (FLOPS). The resulting architecture uses mobile inverted bottleneck convolution (MBConv), similar to MobileNetV2 and MnasNet, but is slightly larger due to an increased FLOP budget. We then scale up the baseline network to obtain a family of models, called EfficientNets.
The architecture for our baseline network EfficientNet-B0 is simple and clean, making it easier to scale and generalize.
EfficientNet Performance
We have compared our EfficientNets with other existing CNNs on ImageNet. In general, the EfficientNet models achieve both higher accuracy and better efficiency over existing CNNs, reducing parameter size and FLOPS by an order of magnitude. For example, in the high-accuracy regime, our EfficientNet-B7 reaches state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on CPU inference than the previous Gpipe. Compared with the widely used ResNet-50, our EfficientNet-B4 uses similar FLOPS, while improving the top-1 accuracy from 76.3% of ResNet-50 to 82.6% (+6.3%).
Model Size vs. Accuracy Comparison. EfficientNet-B0 is the baseline network developed by AutoML MNAS, while Efficient-B1 to B7 are obtained by scaling up the baseline network. In particular, our EfficientNet-B7 achieves new state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy, while being 8.4x smaller than the best existing CNN.
Though EfficientNets perform well on ImageNet, to be most useful, they should also transfer to other datasets. To evaluate this, we tested EfficientNets on eight widely used transfer learning datasets. EfficientNets achieved state-of-the-art accuracy in 5 out of the 8 datasets, such as CIFAR-100 (91.7%) and Flowers (98.8%), with an order of magnitude fewer parameters (up to 21x parameter reduction), suggesting that our EfficientNets also transfer well.

By providing significant improvements to model efficiency, we expect EfficientNets could potentially serve as a new foundation for future computer vision tasks. Therefore, we have open-sourced all EfficientNet models, which we hope can benefit the larger machine learning community. You can find the EfficientNet source code and TPU training scripts here.

Acknowledgements:
Special thanks to Hongkun Yu, Ruoming Pang, Vijay Vasudevan, Alok Aggarwal, Barret Zoph, Xianzhi Du, Xiaodan Song, Samy Bengio, Jeff Dean, and the Google Brain team.

Source: Google AI Blog


An End-to-End AutoML Solution for Tabular Data at KaggleDays



Machine learning (ML) for tabular data (e.g. spreadsheet data) is one of the most active research areas in both ML research and business applications. Solutions to tabular data problems, such as fraud detection and inventory prediction, are critical for many business sectors, including retail, supply chain, finance, manufacturing, marketing and others. Current ML-based solutions to these problems can be achieved by those with significant ML expertise, including manual feature engineering and hyper-parameter tuning, to create a good model. However, the lack of broad availability of these skills limits the efficiency of business improvements through ML.

Google’s AutoML efforts aim to make ML more scalable and accelerate both research and industry applications. Our initial efforts of neural architecture search have enabled breakthroughs in computer vision with NasNet, and evolutionary methods such as AmoebaNet and hardware-aware mobile vision architecture MNasNet further show the benefit of these learning-to-learn methods. Recently, we applied a learning-based approach to tabular data, creating a scalable end-to-end AutoML solution that meets three key criteria:
  • Full automation: Data and computation resources are the only inputs, while a servable TensorFlow model is the output. The whole process requires no human intervention.
  • Extensive coverage: The solution is applicable to the majority of arbitrary tasks in the tabular data domain.
  • High quality: Models generated by AutoML has comparable quality to models manually crafted by top ML experts.
To benchmark our solution, we entered our algorithm in the KaggleDays SF Hackathon, an 8.5 hour competition of 74 teams with up to 3 members per team, as part of the KaggleDays event. The first time that AutoML has competed against Kaggle participants, the competition involved predicting manufacturing defects given information about the material properties and testing results for batches of automotive parts. Despite competing against participants thats were at the Kaggle progression system Master level, including many who were at the GrandMaster level, our team (“Google AutoML”) led for most of the day and ended up finishing second place by a narrow margin, as seen in the final leaderboard.

Our team’s AutoML solution was a multistage TensorFlow pipeline. The first stage is responsible for automatic feature engineering, architecture search, and hyperparameter tuning through search. The promising models from the first stage are fed into the second stage, where cross validation and bootstrap aggregating are applied for better model selection. The best models from the second stage are then combined in the final model.
The workflow for the “Google AutoML” team was quite different from that of other Kaggle competitors. While they were busy with analyzing data and experimenting with various feature engineering ideas, our team spent most of time monitoring jobs and and waiting for them to finish. Our solution for second place on the final leaderboard required 1 hour on 2500 CPUs to finish end-to-end.

After the competition, Kaggle published a public kernel to investigate winning solutions and found that augmenting the top hand-designed models with AutoML models, such as ours, could be a useful way for ML experts to create even better performing systems. As can be seen in the plot below, AutoML has the potential to enhance the efforts of human developers and address a broad range of ML problems.
Potential model quality improvement on final leaderboard if AutoML models were merged with other Kagglers’ models. “Erkut & Mark, Google AutoML”, includes the top winner “Erkut & Mark” and the second place “Google AutoML” models. Erkut Aykutlug and Mark Peng used XGBoost with creative feature engineering whereas AutoML uses both neural network and gradient boosting tree (TFBT) with automatic feature engineering and hyperparameter tuning.
Google Cloud AutoML Tables
The solution we presented at the competitions is the main algorithm in Google Cloud AutoML Tables, which was recently launched (beta) at Google Cloud Next ‘19. The AutoML Tables implementation regularly performs well in benchmark tests against Kaggle competitions as shown in the plot below, demonstrating state-of-the-art performance across the industry.
Third party benchmark of AutoML Tables on multiple Kaggle competitions
We are excited about the potential application of AutoML methods across a wide range of real business problems. Customers have already been leveraging their tabular enterprise data to tackle mission-critical tasks like supply chain management and lead conversion optimization using AutoML Tables, and we are excited to be providing our state-of-the-art models to solve tabular data problems.

Acknowledgements
This project was only possible thanks to Google Brain team members Ming Chen, Da Huang, Yifeng Lu, Quoc V. Le and Vishy Tirumalashetty. We also thank Dawei Jia, Chenyu Zhao and Tin-yun Ho from the Cloud AutoML Tables team for great infrastructure and product landing collaboration. Thanks to Walter Reade, Julia Elliott and Kaggle for organizing such an engaging competition.

Source: Google AI Blog


Cloud AutoML: Making AI accessible to every business

https://img.youtube.com/vi/GbLQE2C181U/maxresdefault.jpg
When we both joined Google Cloud just over a year ago, we embarked on a mission to democratize AI. Our goal was to lower the barrier of entry and make AI available to the largest possible community of developers, researchers and businesses.


Our Google Cloud AI team has been making good progress towards this goal. In 2017, we introduced Google Cloud Machine Learning Engine, to help developers with machine learning expertise easily build ML models that work on any type of data, of any size. We showed how modern machine learning services, i.e., APIs—including Vision, Speech, NLP, Translation and Dialogflow—could be built upon pre-trained models to bring unmatched scale and speed to business applications. Kaggle, our community of data scientists and ML researchers, has grown to more than 1 million members. And today, more than 10,000 businesses are using Google Cloud AI services, including companies like Box, Rolls Royce Marine, Kewpie, and Ocado.


But there’s much more we can do. Currently, only a handful of businesses in the world have access to the talent and budgets needed to fully appreciate the advancements of ML and AI. There’s a very limited number of people that can create advanced machine learning models. And if you’re one of the companies that has access to ML/AI engineers, you still have to manage the time-intensive and complicated process of building your own custom ML model. While Google has offered pre-trained machine learning models via APIs that perform specific tasks, there's still a long road ahead if we want to bring AI to everyone.


To close this gap, and to make AI accessible to every business, we’re introducing Cloud AutoML. Cloud AutoML helps businesses with limited ML expertise start building their own high-quality custom models by using advanced techniques like learning2learn and transfer learning from Google. We believe Cloud AutoML will make AI experts even more productive, advance new fields in AI, and help less-skilled engineers build powerful AI systems they previously only dreamed of.


Our first Cloud AutoML release will be Cloud AutoML Vision, a service that makes it faster and easier to create custom ML models for image recognition. Its drag-and-drop interface lets you easily upload images, train and manage models, and then deploy those trained models directly on Google Cloud. Early results using Cloud AutoML Vision to classify popular public datasets like ImageNet and CIFAR have shown more accurate results with fewer misclassifications than generic ML APIs.


Here’s a little more on what Cloud AutoML Vision has to offer:
  • Increased accuracy: Cloud AutoML Vision is built on Google’s leading image recognition approaches, including transfer learning and neural architecture search technologies. This means you’ll get a more accurate model even if your business has limited machine learning expertise.
  • Faster turnaround time to production-ready models: With Cloud AutoML, you can create a simple model in minutes to pilot your AI-enabled application, or build out a full, production-ready model in as little as a day.
  • Easy to use: AutoML Vision provides a simple graphical user interface that lets you specify data, then turns that data into a high quality model customized for your specific needs.



    Urban Outfitters is constantly looking for new ways to enhance our customers’ shopping experience," says Alan Rosenwinkel, Data Scientist at URBN. "Creating and maintaining a comprehensive set of product attributes is critical to providing our customers relevant product recommendations, accurate search results, and helpful product filters; however, manually creating product attributes is arduous and time-consuming. To address this, our team has been evaluating Cloud AutoML to automate the product attribution process by recognizing nuanced product characteristics like patterns and necklines styles. Cloud AutoML has great promise to help our customers with better discovery, recommendation, and search experiences."


    Mike White, CTO and SVP, for Disney Consumer Products and Interactive Media, says: “Cloud AutoML’s technology is helping us build vision models to annotate our products with Disney characters, product categories, and colors. These annotations are being integrated into our search engine to enhance the impact on Guest experience through more relevant search results, expedited discovery, and product recommendations on shopDisney.”

    And Sophie Maxwell, Conservation Technology Lead at the Zoological Society of London, tells us: "ZSL is an international conservation charity devoted to the worldwide conservation of animals and their habitats. A key requirement to deliver on this mission is to track wildlife populations to learn more about their distribution and better understand the impact humans are having on these species. In order to achieve this, ZSL has deployed a series of camera traps in the wild that take pictures of passing animals when triggered by heat or motion. The millions of images captured by these devices are then manually analysed and annotated and with the relevant species such as elephants, lions, and giraffes, etc., which is a labour-intensive and expensive process. ZSL’s dedicated Conservation Technology Unit has been collaborating closely with Google’s CloudML team to help shape the development of this exciting technology, which ZSL aims to use to automate the tagging of these images—cutting costs, enabling wider-scale deployments, and gaining a deeper understanding of how to conserve the world’s wildlife effectively."


    If you’re interested in trying out AutoML Vision, you can request access via this form.

    AutoML Vision is the result of our close collaboration with Google Brain and other Google AI teams, and is the first of several Cloud AutoML products in development. While we’re still at the beginning of our journey to make AI more accessible, we’ve been deeply inspired by what our 10,000+ customers using Cloud AI products have been able to achieve. We hope the release of Cloud AutoML will help even more businesses discover what’s possible through AI.

    By Jia Li, Head of R&D, Cloud AI, and Fei-Fei Li, Chief Scientist, Cloud AI