Tag Archives: TPU

Chip Design with Deep Reinforcement Learning



The revolution of modern computing has been largely enabled by remarkable advances in computer systems and hardware. With the slowing of Moore’s Law and Dennard scaling, the world is moving toward specialized hardware to meet the exponentially growing demand for compute. However, today’s chips take years to design, resulting in the need to speculate about how to optimize the next generation of chips for the machine learning (ML) models of 2-5 years from now. Dramatically shortening the chip design cycle would allow hardware to adapt to the rapidly advancing field of ML. What if ML itself could provide the means to shorten the chip design cycle, creating a more integrated relationship between hardware and ML, with each fueling advances in the other?

In “Chip Placement with Deep Reinforcement Learning”, we pose chip placement as a reinforcement learning (RL) problem, where we train an agent (i.e, an RL policy) to optimize the quality of chip placements. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously unseen chip blocks. Whereas existing baselines require human experts in the loop and take several weeks to generate, our method can generate placements in under six hours that outperform or match their manually designed counterparts. While we show that we can generate optimized placements for Google accelerator chips (TPUs), our methods are applicable to any kind of chip (ASIC).

The Chip Floorplanning Problem
A computer chip is divided into dozens of blocks, each of which is an individual module, such as a memory subsystem, compute unit, or control logic system. These blocks can be described by a netlist, a graph of circuit components, such as macros (memory components) and standard cells (logic gates like NAND, NOR, and XOR), all of which are connected by wires. Determining the layout of a chip block, a process called chip floorplanning, is one of the most complex and time-consuming stages of the chip design process and involves placing the netlist onto a chip canvas (a 2D grid), such that power, performance, and area (PPA) are minimized, while adhering to constraints on density and routing congestion. Despite decades of research on this topic, it is still necessary for human experts to iterate for weeks to produce solutions that meet multi-faceted design criteria. This problem’s complexity arises from the size of the netlist graph (millions to billions of nodes), the granularity of the grid onto which that graph must be placed, and the exorbitant cost of computing the true target metrics, which can take many hours (sometimes over a day) using industry-standard electronic design automation tools.

The Deep Reinforcement Learning Model
The input to our model is the chip netlist (node types and graph adjacency information), the ID of the current node to be placed, and some netlist metadata, such as the total number of wires, macros, and standard cell clusters. The netlist graph and the current node are passed through an edge-based graph neural network that we developed to encode the input state. This generates embeddings of the partially placed graph and the candidate node.
A graph neural network generates embeddings that are concatenated with the metadata embeddings to form the input to the policy and value networks.
The edge, macro and netlist metadata embeddings are then concatenated to form a single state embedding, which is passed to a feedforward neural network. The output of the feedforward network is a learned representation that captures the useful features and serves as input to the policy and value networks. The policy network generates a probability distribution over all possible grid cells onto which the current node could be placed.

In each iteration of training, the macros are sequentially placed by the RL agent, after which the standard cell clusters are placed by a force-directed method, which models the circuit as a system of springs to minimize wirelength. RL training is guided by a fast-but-approximate reward signal calculated for each of the agent’s chip placements using the weighted average of approximate wirelength (i.e., the half-perimeter wirelength, HPWL) and approximate congestion (the fraction of routing resources consumed by the placed netlist).
During each training iteration, the macros are placed by the policy one at a time and the standard cell clusters are placed by a force-directed method. The reward is calculated from the weighted combination of approximate wirelength and congestion.
Results
To our knowledge, this method is the first chip placement approach that has the ability to generalize, meaning that it can leverage what it has learned while placing previous netlists to generate better placements for new unseen netlists. We show that as we increase the number of chip netlists on which we perform pre-training (i.e., as our method becomes more experienced in placement optimization), our policy better generalizes to new netlists.

For example, the pre-trained policy organically identifies an arrangement that places the macros near the edges of the chip with a convex space in the center in which to place the standard cells. This results in lower wirelength between the macros and standard cells without introducing excessive routing congestion. In contrast, the policy trained from scratch starts with random placements and takes much longer to converge to a high-quality solution, rediscovering the need to leave an opening in the center of the chip canvas. This is demonstrated in the animation below.
Macro placements of Ariane, an open-source RISC-V processor, as training progresses. On the left, the policy is being trained from scratch, and on the right, a pre-trained policy is being fine-tuned for this chip. Each rectangle represents an individual macro placement. Notice how the cavity discovered by the from-scratch policy is already present from the outset in the pre-trained policy’s placement.
We observe that pre-training improves sample efficiency and placement quality. We compare the quality of placements generated using pre-trained policies to those generated by training the policy from scratch. To generate placements for previously unseen chip blocks, we use a zero-shot method, meaning that we simply use a pre-trained policy (with no fine-tuning) to place a new block, yielding a placement in less than a second. The results can be further improved by fine-tuning the policy on the new block. The policy trained from scratch takes much longer to converge, and even after 24 hours, its chip placements are worse than what the fine-tuned policy achieves after 12 hours.
Convergence plots for two policies on Ariane blocks. One is training from scratch and the other is finetuning a pre-trained policy.
The performance of our approach improves as we train on a larger dataset. We observed that as we increase the training set from two blocks to five blocks, and then to 20 blocks, the policy generates better placements, both at zero-shot and after being fine-tuned for the same training wall-clock time.
Training data size vs. fine-tuning performance.
The ability of our approach to learn from experience and improve over time unlocks new possibilities for chip designers. As the agent is exposed to a greater volume and variety of chips, it becomes both faster and better at generating optimized placements for new chip blocks. A fast, high-quality, automatic chip placement method could greatly accelerate chip design and enable co-optimization with earlier stages of the chip design process. Although we evaluate primarily on accelerator chips, our proposed method is broadly applicable to any chip placement problem. After all that hardware has done for machine learning, we believe that it is time for machine learning to return the favor.

Acknowledgements
This project was a collaboration between Google Research and Google Hardware and Architecture teams. We would like to thank our coauthors: Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Anand Babu, Quoc Le, James Laudon, Roger Carpenter, Richard Ho, and Jeff Dean for their support and contributions to this work.

Source: Google AI Blog


A Neural Weather Model for Eight-Hour Precipitation Forecasting



Predicting weather from minutes to weeks ahead with high accuracy is a fundamental scientific challenge that can have a wide ranging impact on many aspects of society. Current forecasts employed by many meteorological agencies are based on physical models of the atmosphere that, despite improving substantially over the preceding decades, are inherently constrained by their computational requirements and are sensitive to approximations of the physical laws that govern them. An alternative approach to weather prediction that is able to overcome some of these constraints uses deep neural networks (DNNs): instead of encoding explicit physical laws, DNNs discover patterns in the data and learn complex transformations from inputs to the desired outputs using parallel computation on powerful specialized hardware such as GPUs and TPUs.

Building on our previous research into precipitation nowcasting, we present “MetNet: A Neural Weather Model for Precipitation Forecasting,” a DNN capable of predicting future precipitation at 1 km resolution over 2 minute intervals at timescales up to 8 hours into the future. MetNet outperforms the current state-of-the-art physics-based model in use by NOAA for prediction times up to 7-8 hours ahead and makes a prediction over the entire US in a matter of seconds as opposed to an hour. The inputs to the network are sourced automatically from radar stations and satellite networks without the need for human annotation. The model output is a probability distribution that we use to infer the most likely precipitation rates with associated uncertainties at each geographical region. The figure below provides an example of the network’s predictions over the continental United States.
MetNet model predictions compared to the ground truth as measured by the NOAA multi-radar/multi-sensor system (MRMS). The MetNet model (top) displays the probabilities for 1 mm/hr precipitation predicted from 2 minutes to 480 minutes ahead, whereas the MRMS data (bottom) shows the areas receiving at least 1 mm/hr of precipitation over that same time period.
Neural Weather Model
MetNet does not rely on explicit physical laws describing the dynamics of the atmosphere, but instead learns by backpropagation to forecast the weather directly from observed data. The network uses precipitation estimates derived from ground based radar stations comprising the multi-radar/multi-sensor system (MRMS) and measurements from NOAA’s Geostationary Operational Environmental Satellite system that provides a top down view of clouds in the atmosphere. Both data sources cover the continental US and provide image-like inputs that can be efficiently processed by the network.

The model is executed for every 64 km x 64 km square covering the entire US at 1 km resolution. However, the actual physical coverage of the input data corresponding to each of these output regions is much larger, since it must take into account the possible motion of the clouds and precipitation fields over the time period for which the prediction is made. For example, assuming that clouds move up to 60 km/h, in order to make informed predictions that capture the temporal dynamics of the atmosphere up to 8 hours ahead, the model needs 60 x 8 = 480 km of spatial context in all directions. So, to achieve this level of context, information from a 1024 km x 1024 km area is required for predictions being made on the center 64 km x 64 km patch.
Size of the input patch containing satellite and radar images (large, 1024 x 1024 km square) and of the output predicted radar image (small, 64 x 64 km square).
Because processing a 1024 km x 1024 km area at full resolution requires a significant amount of memory, we use a spatial downsampler that decreases memory consumption by reducing the spatial dimension of the input patch, while also finding and keeping the relevant weather patterns in the input. A temporal encoder (implemented with a convolutional LSTM that is especially well suited for sequences of images) is then applied along the time dimension of the downsampled input data, encoding seven snapshots from the previous 90 minutes of input data, in 15-minute segments. The output of the temporal encoder is then passed to a spatial aggregator, which uses axial self-attention to efficiently capture long range spatial dependencies in the data, with a variable amount of context based on the input target time, to make predictions over the 64 km x 64 km output.

The output of this architecture is a discrete probability distribution estimating the probability of a given rate of precipitation for each square kilometer in the continental United States.
The architecture of the neural weather model, MetNet. The input satellite and radar images first pass through a spatial downsampler to reduce memory consumption. They are then processed by a convolutional LSTM at 15 minute intervals over the 90 minutes of input data. Then axial attention layers are used to make the network see the entirety of the input images.
Results
We evaluate MetNet on a precipitation rate forecasting benchmark and compare the results with two baselines — the NOAA High Resolution Rapid Refresh (HRRR) system, which is the physical weather forecasting model currently operational in the US, and a baseline model that estimates the motion of the precipitation field (i.e., optical flow), a method known to perform well for prediction times less than 2 hours.

A significant advantage of our neural weather model is that it is optimized for dense and parallel computation and well suited for running on specialty hardware (e.g., TPUs). This allows predictions to be made in parallel in a matter of seconds, whether it is for a specific location like New York City or for the entire US, whereas physical models such as HRRR have a runtime of about an hour on a supercomputer.

We quantify the difference in performance between MetNet, HRRR, and the optical flow baseline model in the plot below. Here, we show the performance achieved by the three models, evaluated using the F1-score at a precipitation rate threshold of 1.0 mm/h, which corresponds to light rain. The MetNet neural weather model is able to outperform the NOAA HRRR system at timelines less than 8 hours and is consistently better than the flow-based model.
Performance evaluated in terms of F1-score at 1.0 mm/h precipitation rate (higher is better). The neural weather model (MetNet) outperforms the physics-based model (HRRR) currently operational in the US for timescales up to 8 hours ahead.
Due to the stochastic nature of the atmosphere, the uncertainty about the exact future weather conditions increases with longer prediction times. Because MetNet is a probabilistic model, the uncertainty in the predictions is seen in the visualizations by the growing smoothness of the predictions as the forecast time is extended. In contrast, HRRR does not directly make probabilistic predictions, but instead predicts a single potential future. The figure below compares the output of the MetNet model to that of the HRRR model.
Comparison between the output from MetNet (top) and HRRR (bottom) to ground-truth (middle) as retrieved from the NOAA MRMS system. Notice that while the HRRR model predicts structure that appears to be more similar to that of the ground-truth, the structure predicted may be grossly incorrect.
The predictions from the HRRR physical model look sharper and more structured than that of the MetNet model, but the structure, specifically the exact time and location where the structure is predicted, is less accurate due to uncertainty in the initial conditions and the parameters of the model.
HRRR (left) predicts a single potential future outcome (in red) out of the many possible outcomes, whereas MetNet (right) directly accounts for uncertainty by assigning probabilities over the future outcomes.
A more thorough comparison between the HRRR and MetNet models can be found in the video below:
Future Directions
We are actively researching how to improve global weather forecasting, especially in regions where the impacts of rapid climate change are most profound. While we demonstrate the present MetNet model for the continental US, it could be extended to cover any region for which adequate radar and optical satellite data are available. The work presented here is a small stepping stone in this effort that we hope leads to even greater improvements through future collaboration with the meteorological community.

Acknowledgements
This project was done in collaboration with Lasse Espeholt, Jonathan Heek, Mostafa Dehghani, Avital Oliver, Tim Salimans, Shreya Agrawal and Jason Hickey. We would also like to thank Manoj Kumar, Wendy Shang, Dick Weissenborn, Cenk Gazen, John Burge, Stephen Hoyer, Lak Lakshmanan, Rob Carver, Carla Bromberg and Aaron Bell for useful discussions and Tom Small for help with the visualizations.

Source: Google AI Blog


Massively Scaling Reinforcement Learning with SEED RL



Reinforcement learning (RL) has seen impressive advances over the last few years as demonstrated by the recent success in solving games such as Go and Dota 2. Models, or agents, learn by exploring an environment, such as a game, while optimizing for specified goals. However, current RL techniques require increasingly large amounts of training to successfully learn even simple games, which makes iterating research and product ideas computationally expensive and time consuming.

In “SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference”, we present an RL agent that scales to thousands of machines, which enables training at millions of frames per second, and significantly improves computational efficiency. This is achieved with a novel architecture that takes advantage of accelerators (GPUs or TPUs) at scale by centralizing model inference and introducing a fast communication layer. We demonstrate the performance of SEED RL on popular RL benchmarks, such as Google Research Football, Arcade Learning Environment and DeepMind Lab, and show that by using larger models, data efficiency can be increased. The code has been open sourced on Github together with examples for running on Google Cloud with GPUs.

Current Distributed Architectures
The previous generation of distributed reinforcement learning agents, such as IMPALA, made use of accelerators specialized for numerical calculations, taking advantage of the speed and efficiency from which (un)supervised learning has benefited for years. The architecture of an RL agent is usually separated into actors and learners. The actors typically run on CPUs and iterate between taking steps in the environment and running inference on the model to predict the next action. Frequently the actor will update the parameters of the inference model, and after collecting a sufficient amount of observations, will send a trajectory of observations and actions to the learner, which then optimizes the model. In this architecture, the learner trains the model on GPUs using input from distributed inference on hundreds of machines.

Example architecture for an earlier generation RL agent, IMPALA. Inference is done on the actors, usually using inefficient CPUs. Updated model parameters are frequently sent from the learner to the actors increasing bandwidth requirements.
The architecture of RL agents (such as IMPALA) have a number of drawbacks:
  1. Using CPUs for neural network inference is much less efficient and slower than using accelerators and becomes problematic as models become larger and more computationally expensive.
  2. The bandwidth required for sending parameters and intermediate model states between the actors and learner can be a bottleneck.
  3. Handling two completely different tasks on one machine (i.e., environment rendering and inference) is unlikely to utilize machine resources optimally.
SEED RL Architecture
The SEED RL architecture is designed to solve these drawbacks. With this approach, neural network inference is done centrally by the learner on specialized hardware (GPUs or TPUs), enabling accelerated inference and avoiding the data transfer bottleneck by ensuring that the model parameters and state are kept local. While observations are sent to the learner at every environment step, latency is kept low due to a very efficient network library based on the gRPC framework with asynchronous streaming RPCs. This makes it possible to achieve up to a million queries per second on a single machine. The learner can be scaled to thousands of cores (e.g., up to 2048 on Cloud TPUs) and the number of actors can be scaled to thousands of machines to fully utilize the learner, making it possible to train at millions of frames per second. SEED RL is based on the TensorFlow 2 API and, in our experiments, was accelerated by TPUs.
Overview of the architecture of SEED RL. In contrast to the IMPALA architecture, the actors only take actions in environments. Inference is executed centrally by the learner on accelerators using batches of data from multiple actors.
In order for this architecture to be successful, two state-of-the-art algorithms are integrated into SEED RL. The first is V-trace, a policy gradient-based method, first introduced with IMPALA. In general, policy gradient-based methods predict an action distribution from which an action can be sampled. However, because the actors and the learner execute asynchronously in SEED RL, the policy of actors is slightly behind the policy of the learner, i.e., they become off-policy. The usual policy gradient-based methods are on-policy, meaning that they have the same policy for actors and learner, and suffer from convergence and numerical issues in off-policy settings. V-trace is an off-policy method and thus works well in the asynchronous SEED RL architecture.

The second algorithm is R2D2, a Q-learning method that selects an action based on the predicted future value of that action using recurrent distributed replay. This approach allows the Q-learning algorithm to be run at scale, while still allowing the use of recurrent neural networks that can predict future values based on the information of all past frames in an episode.

Experiments
SEED RL is benchmarked on the commonly used Arcade Learning Environment, DeepMind Lab environments, and on the recently released Google Research Football environment.
Frames per second comparing IMPALA and various configurations of SEED RL on DeepMind Lab. SEED RL achieves 2.4M frames per second using 4,160 CPUs. Assuming the same speed, IMPALA would need 14,000 CPUs.
On DeepMind Lab, we achieve 2.4 million frames per second with 64 Cloud TPU cores, which represents an improvement of 80x over the previous state-of-the-art distributed agent, IMPALA. This results in a significant speed-up in wall-clock time and computational efficiency. IMPALA requires 3-4x as many CPUs as SEED RL for the same speed.
Episode return (i.e., the sum of rewards) over time on the DeepMind Lab game “explore_goal_locations_small” using IMPALA and SEED RL. With SEED RL, the time to train is significantly reduced.
With an architecture optimized for use on modern accelerators, it’s natural to increase the model size in an attempt to increase data efficiency. We show that by increasing the size of the model and the input resolution, we are able to solve a previously unsolved Google Research Football task, “Hard”.
The score of different architectures on the Google Research Football “Hard” task. We show that by using an input resolution and a larger model, the score is improved, and with more training, the model can significantly outperform the builtin AI.
Additional details are provided in the paper, including our results on the Arcade Learning Environment. We believe SEED RL and the results presented, demonstrate that reinforcement learning has once again caught up with the rest of the deep learning field in terms of taking advantage of accelerators.

Acknowledgements
This project was done in collaboration with Raphaël Marinier, Piotr Stanczyk, Ke Wang, Marcin Andrychowicz and Marcin Michalski. We would also like to thank Tom Small for the visualizations.

Source: Google AI Blog


Introducing the Next Generation of On-Device Vision Models: MobileNetV3 and MobileNetEdgeTPU



On-device machine learning (ML) is an essential component in enabling privacy-preserving, always-available and responsive intelligence. This need to bring on-device machine learning to compute and power-limited devices has spurred the development of algorithmically-efficient neural network models and hardware capable of performing billions of math operations per second, while consuming only a few milliwatts of power. The recently launched Google Pixel 4 exemplifies this trend, and ships with the Pixel Neural Core that contains an instantiation of the Edge TPU architecture, Google’s machine learning accelerator for edge computing devices, and powers Pixel 4 experiences such as face unlock, a faster Google Assistant and unique camera features. Similarly, algorithms, such as MobileNets, have been critical for the success of on-device ML by providing compact and efficient neural network models for mobile vision applications.

Today we are pleased to announce the release of source code and checkpoints for MobileNetV3 and the Pixel 4 Edge TPU-optimized counterpart MobileNetEdgeTPU model. These models are the culmination of the latest advances in hardware-aware AutoML techniques as well as several advances in architecture design. On mobile CPUs, MobileNetV3 is twice as fast as MobileNetV2 with equivalent accuracy, and advances the state-of-the-art for mobile computer vision networks. On the Pixel 4 Edge TPU hardware accelerator, the MobileNetEdgeTPU model pushes the boundary further by improving model accuracy while simultaneously reducing the runtime and power consumption.

Building MobileNetV3
In contrast with the hand-designed previous version of MobileNet, MobileNetV3 relies on AutoML to find the best possible architecture in a search space friendly to mobile computer vision tasks. To most effectively exploit the search space we deploy two techniques in sequence — MnasNet and NetAdapt. First, we search for a coarse architecture using MnasNet, which uses reinforcement learning to select the optimal configuration from a discrete set of choices. Then we fine-tune the architecture using NetAdapt, a complementary technique that trims under-utilized activation channels in small decrements. To provide the best possible performance under different conditions we have produced both large and small models.
Comparison of accuracy vs. latency for mobile models on the ImageNet classification task using the Google Pixel 4 CPU.
MobileNetV3 Search Space
The MobileNetV3 search space builds on multiple recent advances in architecture design that we adapt for the mobile environment. First, we introduce a new activation function called hard-swish (h-swish) which is based on the Swish nonlinearity function. The critical drawback of the Swish function is that it is very inefficient to compute on mobile hardware. So, instead we use an approximation that can be efficiently expressed as a product of two piecewise linear functions.
Next we introduce the mobile-friendly squeeze-and-excitation block, which replaces the classical sigmoid function with a piecewise linear approximation.

Combining h-swish plus mobile-friendly squeeze-and-excitation with a modified version of the inverted bottleneck structure introduced in MobileNetV2 yielded a new building block for MobileNetV3.
MobileNetV3 extends the MobileNetV2 inverted bottleneck structure by adding h-swish and mobile friendly squeeze-and-excitation as searchable options.
These parameters defined the search space used in constructing MobileNetV3:
  • Size of expansion layer
  • Degree of squeeze-excite compression
  • Choice of activation function: h-swish or ReLU
  • Number of layers for each resolution block
We also introduced a new efficient last stage at the end of the network that further reduced latency by 15%.
MobileNetV3 Object Detection and Semantic Segmentation
In addition to classification models, we also introduced MobileNetV3 object detection models, which reduced detection latency by 25% relative to MobileNetV2 at the same accuracy for the COCO dataset.

In order to optimize MobileNetV3 for efficient semantic segmentation, we introduced a low latency segmentation decoder called Lite Reduced Atrous Spatial Pyramid Pooling (LR-SPP). This new decoder contains three branches, one for low resolution semantic features, one for higher resolution details, and one for light-weight attention. The combination of LR-SPP and MobileNetV3 reduces the latency by over 35% on the high resolution Cityscapes Dataset.

MobileNet for Edge TPUs
The Edge TPU in Pixel 4 is similar in architecture to the Edge TPU in the Coral line of products, but customized to meet the requirements of key camera features in Pixel 4. The accelerator-aware AutoML approach substantially reduces the manual process involved in designing and optimizing neural networks for hardware accelerators. Crafting the neural architecture search space is an important part of this approach and centers around the inclusion of neural network operations that are known to improve hardware utilization. While operations such as squeeze-and-excite and swish non-linearity have been shown to be essential in building compact and fast CPU models, these operations tend to perform suboptimally on Edge TPU and hence are excluded from the search space. The minimalistic variants of MobileNetV3 also forgo the use of these operations (i.e., squeeze-and-excite, swish, and 5x5 convolutions) to allow easier portability to a variety of other hardware accelerators such as DSPs and GPUs.

The neural network architecture search, incentivized to jointly optimize the model accuracy and Edge TPU latency, produces the MobileNetEdgeTPU model that achieves lower latency for a fixed accuracy (or higher accuracy for a fixed latency) than existing mobile models such as MobileNetV2 and minimalistic MobileNetV3. Compared with the EfficientNet-EdgeTPU model (optimized for the Edge TPU in Coral), these models are designed to run at a much lower latency on Pixel 4, albeit at the cost of some loss in accuracy.

Although reducing the model’s power consumption was not a part of the search objective, the lower latency of the MobileNetEdgeTPU models also helps reduce the average Edge TPU power use. The MobileNetEdgeTPU model consumes less than 50% the power of the minimalistic MobileNetV3 model at comparable accuracy.
Left: Comparison of the accuracy on the ImageNet classification task between MobileNetEdgeTPU and other image classification networks designed for mobile when running on Pixel4 Edge TPU. MobileNetEdgeTPU achieves higher accuracy and lower latency compared with other models. Right: Average Edge TPU power in Watts for different classification models running at 30 frames per second (fps).
Objection Detection Using MobileNetEdgeTPU
The MobileNetEdgeTPU classification model also serves as an effective feature extractor for object detection tasks. Compared with MobileNetV2 based detection models, MobileNetEdgeTPU models offer a significant improvement in model quality (measured as the mean average precision; mAP) on the COCO14 minival dataset at comparable runtimes on the Edge TPU. The MobileNetEdgeTPU detection model has a latency of 6.6ms and achieves mAP score of 24.3, while MobileNetV2-based detection models achieve an mAP of 22 and takes 6.8ms per inference.

The Need for Hardware-Aware Models
While the results shown above highlight the power, performance, and quality benefits of MobileNetEdgeTPU models, it is important to note that the improvements arise due to the fact that these models have been customized to run on the Edge TPU accelerator.
MobileNetEdgeTPU when running on a mobile CPU delivers inferior performance compared with the models that have been tuned specifically for mobile CPUs (MobileNetV3). MobileNetEdgeTPU models perform a much greater number of operations, and so, it is not surprising that they run slower on mobile CPUs, which exhibit a more linear relationship between a model’s compute requirements and the runtime.
MobileNetV3 is still the best performing network when using mobile CPU as the deployment target.
For Researchers and Developers
The MobileNetV3 and MobileNetEdgeTPU code, as well as both floating point and quantized checkpoints for ImageNet classification, are available at the MobileNet github page. Open source implementation for MobileNetV3 and MobileNetEdgeTPU object detection is available in the Tensorflow Object Detection API. Open source implementation for MobileNetV3 semantic segmentation is available in TensorFlow through DeepLab.

Acknowledgements:
This work is made possible through a collaboration spanning several teams across Google. We’d like to acknowledge contributions from Berkin Akin, Okan Arikan, Gabriel Bender, Bo Chen, Liang-Chieh Chen, Grace Chu, Eddy Hsu, John Joseph, Pieter-jan Kindermans, Quoc Le, Owen Lin, Hanxiao Liu, Yun Long, Ravi Narayanaswami, Ruoming Pang, Mark Sandler, Mingxing Tan, Vijay Vasudevan, Weijun Wang, Dong Hyuk Woo, Dmitry Kalenichenko, Yunyang Xiong, Yukun Zhu and support from Hartwig Adam, Blaise Agüera y Arcas, Chidu Krishnan and Steve Molloy.

Source: Google AI Blog


EfficientNet-EdgeTPU: Creating Accelerator-Optimized Neural Networks with AutoML



For several decades, computer processors have doubled their performance every couple of years by reducing the size of the transistors inside each chip, as described by Moore’s Law. As reducing transistor size becomes more and more difficult, there is a renewed focus in the industry on developing domain-specific architectures — such as hardware accelerators — to continue advancing computational power. This is especially true for machine learning, where efforts are aimed at building specialized architectures for neural network (NN) acceleration. Ironically, while there has been a steady proliferation of these architectures in data centers and on edge computing platforms, the NNs that run on them are rarely customized to take advantage of the underlying hardware.

Today, we are happy to announce the release of EfficientNet-EdgeTPU, a family of image classification models derived from EfficientNets, but customized to run optimally on Google’s Edge TPU, a power-efficient hardware accelerator available to developers through the Coral Dev Board and a USB Accelerator. Through such model customizations, the Edge TPU is able to provide real-time image classification performance while simultaneously achieving accuracies typically seen only when running much larger, compute-heavy models in data centers.

Using AutoML to customize EfficientNets for Edge TPU
EfficientNets have been shown to achieve state-of-the-art accuracy in image classification tasks while significantly reducing the model size and computational complexity. To build EfficientNets designed to leverage the Edge TPU’s accelerator architecture, we invoked the AutoML MNAS framework and augmented the original EfficientNet’s neural network architecture search space with building blocks that execute efficiently on the Edge TPU (discussed below). We also built and integrated a “latency predictor” module that provides an estimate of the model latency when executing on the Edge TPU, by running the models on a cycle-accurate architectural simulator. The AutoML MNAS controller implements a reinforcement learning algorithm to search this space while attempting to maximize the reward, which is a joint function of the predicted latency and model accuracy. From past experience, we know that Edge TPU’s power efficiency and performance tend to be maximized when the model fits within its on-chip memory. Hence we also modified the reward function to generate a higher reward for models that satisfy this constraint.
Overall AutoML flow for designing customized EfficientNet-EdgeTPU models.
Search Space Design
When performing the architecture search described above, one must consider that EfficientNets rely primarily on depthwise-separable convolutions, a type of neural network block that factorizes a regular convolution to reduce the number of parameters as well as the amount of computations. However, for certain configurations, a regular convolution utilizes the Edge TPU architecture more efficiently and executes faster, despite the much larger amount of compute. While it is possible, albeit tedious, to manually craft a network that uses an optimal combination of the different building blocks, augmenting the AutoML search space with these accelerator-optimal blocks is a more scalable approach.
A regular 3x3 convolution (right) has more compute (multiply-and-accumulate (mac) operations) than an depthwise-separable convolution (left), but for certain input/output shapes, executes faster on Edge TPU due to ~3x more effective hardware utilization.
In addition, removing certain operations from the search space that require modifications to the Edge TPU compiler to fully support, such swish non-linearity and squeeze-and-excitation block, naturally leads to models that are readily ported to the Edge TPU hardware. These operations tend to improve model quality slightly, so by eliminating them from the search space, we have effectively instructed AutoML to discover alternate network architectures that may compensate for any potential loss in quality.

Model Performance
The neural architecture search (NAS) described above produced a baseline model, EfficientNet-EdgeTPU-S, which is subsequently scaled up using EfficientNet's compound scaling method to produce the -M and -L models. The compound scaling approach selects an optimal combination of input image resolution scaling, network width, and depth scaling to construct larger, more accurate models. The -M, and -L models achieve higher accuracy at the cost of increased latency as shown in the figure below.
EfficientNet-EdgeTPU-S/M/L models achieve better latency and accuracy than existing EfficientNets (B1), ResNet, and Inception by specializing the network architecture for Edge TPU hardware. In particular, our EfficientNet-EdgeTPU-S achieves higher accuracy, yet runs 10x faster than ResNet-50.
Interestingly, the NAS-generated model employs the regular convolution quite extensively in the initial part of the network where the depthwise-separable convolution tends to be less effective than the regular convolution when executed on the accelerator. This clearly highlights the fact that trade-offs usually made while optimizing models for general purpose CPUs (reducing the total number of operations, for example) are not necessarily optimal for hardware accelerators. Also, these models achieve high accuracy even without the use of esoteric operations. Comparing with the other image classification models such as Inception-resnet-v2 and Resnet50, EfficientNet-EdgeTPU models are not only more accurate, but also run faster on Edge TPUs.

This work represents a first experiment in building accelerator-optimized models using AutoML. The AutoML-based model customization can be extended to not only a wide range of hardware accelerators, but also to several different applications that rely on neural networks.

From Cloud TPU training to Edge TPU deployment
We have released the training code and pretrained models for EfficientNet-EdgeTPU on our github repository. We employ tensorflow’s post-training quantization tool to convert a floating-point trained model to an Edge TPU-compatible integer-quantized model. For these models, the post-training quantization works remarkably well and produces only a very slight loss in accuracy (~0.5%). The script for exporting the quantized model from a training checkpoint can be found here. For an update on the Coral platform, see this post on the Google Developer’s Blog, and for full reference materials and detailed instructions, please refer to the Coral website.

Acknowledgements
Special thanks to Quoc Le, Hongkun Yu, Yunlu Li, Ruoming Pang, and Vijay Vasudevan from the Google Brain team; Bo Wu, Vikram Tank, and Ajay Nair from the Google Coral team; Han Vanholder, Ravi Narayanaswami, John Joseph, Dong Hyuk Woo, Raksit Ashok, Jason Jong Kyu Park, Jack Liu, Mohammadali Ghodrat, Cao Gao, Berkin Akin, Liang-Yun Wang, Chirag Gandhi, and Dongdong Li from the Google Edge TPU team.

Source: Google AI Blog


Coral updates: Project tutorials, a downloadable compiler, and a new distributor

Posted by Vikram Tank (Product Manager), Coral Team

coral hardware

We’re committed to evolving Coral to make it even easier to build systems with on-device AI. Our team is constantly working on new product features, and content that helps ML practitioners, engineers, and prototypers create the next generation of hardware.

To improve our toolchain, we're making the Edge TPU Compiler available to users as a downloadable binary. The binary works on Debian-based Linux systems, allowing for better integration into custom workflows. Instructions on downloading and using the binary are on the Coral site.

We’re also adding a new section to the Coral site that showcases example projects you can build with your Coral board. For instance, Teachable Machine is a project that guides you through building a machine that can quickly learn to recognize new objects by re-training a vision classification model directly on your device. Minigo shows you how to create an implementation of AlphaGo Zero and run it on the Coral Dev Board or USB Accelerator.

Our distributor network is growing as well: Arrow will soon sell Coral products.

Reducing the Need for Labeled Data in Generative Adversarial Networks



Generative adversarial networks (GANs) are a powerful class of deep generative models.The main idea behind GANs is to train two neural networks: the generator, which learns how to synthesise data (such as an image), and the discriminator, which learns how to distinguish real data from the ones synthesised by the generator. This approach has been successfully used for high-fidelity natural image synthesis, improving learned image compression, data augmentation, and more.
Evolution of the generated samples as training progresses on ImageNet. The generator network is conditioned on the class (e.g., "great gray owl" or "golden retriever").
For natural image synthesis, state-of-the-art results are achieved by conditional GANs that, unlike unconditional GANs, use labels (e.g. car, dog, etc.) during training. While this makes the task easier and leads to significant improvements, this approach requires a large amount of labeled data that is rarely available in practice.

In "High-Fidelity Image Generation With Fewer Labels", we propose a new approach to reduce the amount of labeled data required to train state-of-the-art conditional GANs. When combined with recent advancements on large-scale GANs, we match the state-of-the-art in high-fidelity natural image synthesis using 10x fewer labels. Based on this research, we are also releasing a major update to the Compare GAN library, which contains all the components necessary to train and evaluate modern GANs.

Improvements via Semi-supervision and Self-supervision
In conditional GANs, both the generator and discriminator are typically conditioned on class labels. In this work, we propose to replace the hand-annotated ground truth labels with inferred ones. To infer high-quality labels for a large dataset of mostly unlabeled data, we take a two-step approach: First, we learn a feature representation using only the unlabeled portion of the dataset. To learn the feature representations we make use of self-supervision in the form of a recently introduced approach, in which the unlabeled images are randomly rotated and a deep convolutional neural network is tasked with predicting the rotation angle. The idea is that the models need to be able to recognize the main objects and their shapes in order to be successful on this task.
An unlabeled image is randomly rotated and the network is tasked with predicting the rotation angle. Successful models need to capture semantically meaningful image features which can then be used for other vision tasks.
We then consider the activation pattern of one of the intermediate layers of the trained network as the new feature representation of the input, and train a classifier to recognize the label of that input using the labeled portion of the original data set. As the network was pre-trained to extract semantically meaningful features from the data (on the rotation prediction task), training this classifier is more sample-efficient than training the entire network from scratch. Finally, we use this classifier to label the unlabeled data.

To further improve the model quality and training stability we encourage the discriminator network to learn meaningful feature representations which are not forgotten during training by means of an auxiliary loss we introduced previously. These two advancements, combined with large-scale training lead to state-of-the-art conditional GANs for the task of ImageNet synthesis as measured by the Fréchet Inception Distance.
Given a latent vector the generator network produces an image. In each row, linear interpolation between the latent codes of the leftmost and the rightmost image results in a semantic interpolation in the image space.
Compare GAN: A Library for Training and Evaluating GANs
Cutting-edge research on GANs is heavily dependent on a well-engineered and well-tested codebase, since even replicating prior results and techniques requires a significant effort. In order to foster open science and allow the research community benefit from recent advancements, we are releasing a major update of the Compare GAN library. The library includes loss functions, regularization and normalization schemes, neural architectures, and quantitative metrics commonly used in modern GANs, and now supports:
Conclusions and Future Work
Given the growing gap between labeled and unlabeled data sources, it is becoming increasingly important to be able to learn from only partially labeled data. We have shown that a simple yet powerful combination of self-supervision and semi-supervision can help to close this gap for GANs. We believe that self-supervision is a powerful idea that should be investigated for other generative modeling tasks.

Acknowledgments
Work conducted in collaboration with colleagues on the Google Brain team in Zürich, ETH Zürich and UCLA. We would like to thank our paper co-authors Michael Tschannen, Xiaohua Zhai, Olivier Bachem and Sylvain Gelly for their input and feedback. We would like to thank Alexander Kolesnikov, Lucas Beyer and Avital Oliver for helpful discussion on self-supervised learning and semi-supervised learning. We would like to thank Karol Kurach and Marcin Michalski for their major contributions to the Compare GAN library. We would also like to thank Andy Brock, Jeff Donahue and Karen Simonyan for their insights into training GANs on TPUs. The work described in this post also builds upon our work on “Self-Supervised Generative Adversarial Networks” with Ting Chen and Neil Houlsby.

Source: Google AI Blog


Measuring the Limits of Data Parallel Training for Neural Networks



Over the past decade, neural networks have achieved state-of-the-art results in a wide variety of prediction tasks, including image classification, machine translation, and speech recognition. These successes have been driven, at least in part, by hardware and software improvements that have significantly accelerated neural network training. Faster training has directly resulted in dramatic improvements to model quality, both by allowing more training data to be processed and by allowing researchers to try new ideas and configurations more rapidly. Today, hardware developments like Cloud TPU Pods are rapidly increasing the amount of computation available for neural network training, which raises the possibility of harnessing additional computation to make neural networks train even faster and facilitate even greater improvements to model quality. But how exactly should we harness this unprecedented amount of computation, and should we always expect more computation to facilitate faster training?

The most common way to utilize massive compute power is to distribute computations between different processors and perform those computations simultaneously. When training neural networks, the primary ways to achieve this are model parallelism, which involves distributing the neural network across different processors, and data parallelism, which involves distributing training examples across different processors and computing updates to the neural network in parallel. While model parallelism makes it possible to train neural networks that are larger than a single processor can support, it usually requires tailoring the model architecture to the available hardware. In contrast, data parallelism is model agnostic and applicable to any neural network architecture – it is the simplest and most widely used technique for parallelizing neural network training. For the most common neural network training algorithms (synchronous stochastic gradient descent and its variants), the scale of data parallelism corresponds to the batch size, the number of training examples used to compute each update to the neural network. But what are the limits of this type of parallelization, and when should we expect to see large speedups?

In "Measuring the Effects of Data Parallelism in Neural Network Training", we investigate the relationship between batch size and training time by running experiments on six different types of neural networks across seven different datasets using three different optimization algorithms ("optimizers"). In total, we trained over 100K individual models across ~450 workloads, and observed a seemingly universal relationship between batch size and training time across all workloads we tested. We also study how this relationship varies with the dataset, neural network architecture, and optimizer, and found extremely large variation between workloads. Additionally, we are excited to share our raw data for further analysis by the research community. The data includes over 71M model evaluations to make up the training curves of all 100K+ individual models we trained, and can be used to reproduce all 24 plots in our paper.

Universal Relationship Between Batch Size and Training Time
In an idealized data parallel system that spends negligible time synchronizing between processors, training time can be measured in the number of training steps (updates to the neural network's parameters). Under this assumption, we observed three distinct scaling regimes in the relationship between batch size and training time: a "perfect scaling" regime where doubling the batch size halves the number of training steps required to reach a target out-of-sample error, followed by a regime of "diminishing returns", and finally a "maximal data parallelism" regime where further increasing the batch size does not reduce training time, even assuming idealized hardware.

For all workloads we tested, we observed a universal relationship between batch size and training speed with three distinct regimes: perfect scaling (following the dashed line), diminishing returns (diverging from the dashed line), and maximal data parallelism (where the trend plateaus). The transition points between the regimes vary dramatically between different workloads.
Although the basic relationship between batch size and training time appears to be universal, we found that the transition points between the different scaling regimes vary dramatically across neural network architectures and datasets. This means that while simple data parallelism can provide large speedups for some workloads at the limits of today's hardware (e.g. Cloud TPU Pods), and perhaps beyond, some workloads require moving beyond simple data parallelism in order to benefit from the largest scale hardware that exists today, let alone hardware that has yet to be built. For example, in the plot above, ResNet-8 on CIFAR-10 cannot benefit from batch sizes larger than 1,024, whereas ResNet-50 on ImageNet continues to benefit from increasing the batch size up to at least 65,536.

Optimizing Workloads
If one could predict which workloads benefit most from data parallel training, then one could tailor their workloads to make maximal use of the available hardware. However, our results suggest that this will often not be straightforward, because the maximum useful batch size depends, at least somewhat, on every aspect of the workload: the neural network architecture, the dataset, and the optimizer. For example, some neural network architectures can benefit from much larger batch sizes than others, even when trained on the same dataset with the same optimizer. Although this effect sometimes depends on the width and depth of the network, it is inconsistent between different types of network and some networks do not even have obvious notions of "width" and "depth". And while we found that some datasets can benefit from much larger batch sizes than others, these differences are not always explained by the size of the dataset—sometimes smaller datasets benefit more from larger batch sizes than larger datasets.

Left: A transformer neural network scales to much larger batch sizes than an LSTM neural network on the LM1B dataset. Right: The Common Crawl dataset does not benefit from larger batch sizes than the LM1B dataset, even though it is 1,000 times the size.
Perhaps our most promising finding is that even small changes to the optimization algorithm, such as allowing momentum in stochastic gradient descent, can dramatically improve how well training scales with increasing batch size. This raises the possibility of designing new optimizers, or testing the scaling properties of optimizers that we did not consider, to find optimizers that can make maximal use of increased data parallelism.

Future Work
Utilizing additional data parallelism by increasing the batch size is a simple way to produce valuable speedups across a range of workloads, but, for all the workloads we tried, the benefits diminished within the limits of state-of-the-art hardware. However, our results suggest that some optimization algorithms may be able to consistently extend the perfect scaling regime across many models and data sets. Future work could perform the same measurements with other optimizers, beyond the few closely-related ones we tried, to see if any existing optimizer extends perfect scaling across many problems.

Acknowledgements
The authors of this study were Chris Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig and George Dahl (Chris and Jaehoon contributed equally). Many researchers have done work in this area that we have built on, so please see our paper for a full discussion of related work.

Source: Google AI Blog


New AIY Edge TPU Boards

Posted by Billy Rutledge, Director of AIY Projects

Over the past year and a half, we've seen more than 200K people build, modify, and create with our Voice Kit and Vision Kit products. Today at Cloud Next we announced two new devices to help professional engineers build new products with on-device machine learning(ML) at their core: the AIY Edge TPU Dev Board and the AIY Edge TPU Accelerator. Both are powered by Google's Edge TPU and represent our first steps towards expanding AIY into a platform for experimentation with on-device ML.

The Edge TPU is Google's purpose-built ASIC chip designed to run TensorFlow Lite ML models on your device. We've learned that performance-per-watt and performance-per-dollar are critical benchmarks when processing neural networks within a small footprint. The Edge TPU delivers both in a package that's smaller than the head of a penny. It can accelerate ML inferencing on device, or can pair with Google Cloud to create a full cloud-to-edge ML stack. In either configuration, by processing data directly on-device, a local ML accelerator increases privacy, removes the need for persistent connections, reduces latency, and allows for high performance using less power.

The AIY Edge TPU Dev Board is an all-in-one development board that allows you to prototype embedded systems that demand fast ML inferencing. The baseboard provides all the peripheral connections you need to effectively prototype your device — including a 40-pin GPIO header to integrate with various electrical components. The board also features a removable System-on-module (SOM) daughter board can be directly integrated into your own hardware once you're ready to scale.

The AIY Edge TPU Accelerator is a neural network coprocessor for your existing system. This small USB-C stick can connect to any Linux-based system to perform accelerated ML inferencing. The casing includes mounting holes for attachment to host boards such as a Raspberry Pi Zero or your custom device.

On-device ML is still in its early days, and we're excited to see how these two products can be applied to solve real world problems — such as increasing manufacturing equipment reliability, detecting quality control issues in products, tracking retail foot-traffic, building adaptive automotive sensing systems, and more applications that haven't been imagined yet.

Both devices will be available online this fall in the US with other countries to follow shortly.

For more product information visit g.co/aiy and sign up to be notified as products become available.

Accelerated Training and Inference with the Tensorflow Object Detection API



Last year we announced the TensorFlow Object Detection API, and since then we’ve released a number of new features, such as models learned via Neural Architecture Search, instance segmentation support and models trained on new datasets such as Open Images. We have been amazed at how it is being used – from finding scofflaws on the streets of NYC to diagnosing diseases on cassava plants in Tanzania.
Today, as part of Google’s commitment to democratizing computer vision, and using feedback from the research community on how to make this codebase even more useful, we’re excited to announce a number of additions to our API. Highlights of this release include:
  • Support for accelerated training of object detection models via Cloud TPUs
  • Improving the mobile deployment process by accelerating inference and making it easy to export a model to mobile with the TensorFlow Lite format
  • Several new model architecture definitions including:
Additionally, we are releasing pre-trained weights for each of the above models based on the COCO dataset.

Accelerated Training via Cloud TPUs
Users spend a great deal of time on optimizing hyperparameters and retraining object detection models, therefore having fast turnaround times on experiments is critical. The models released today belong to the single shot detector (SSD) class of architectures that are optimized for training on Cloud TPUs. For example, we can now train a ResNet-50 based RetinaNet model to achieve 35% mean Average Precision (mAP) on the COCO dataset in < 3.5 hrs.
Accelerated Inference via Quantization and TensorFlow Lite 
To better support low-latency requirements on mobile and embedded devices, the models we are providing are now natively compatible with TensorFlow Lite, which enables on-device machine learning inference with low latency and a small binary size. As part of this, we have implemented: (1) model quantization and (2) detection-specific operations natively in TensorFlow Lite. Our model quantization follows the strategy outlined in Jacob et al. (2018) and the whitepaper by Krishnamoorthi (2018) which applies quantization to both model weights and activations at training and inference time, yielding smaller models that run faster.
Quantized detection models are faster and smaller (e.g., a quantized 75% depth-reduced SSD Mobilenet model runs at >15 fps on a Pixel 2 CPU with a 4.2 Mb footprint) with minimal loss in detection accuracy compared to the full floating point model.
Try it Yourself with a New Tutorial!
To get started training your own model on Cloud TPUs, check out our new tutorial! This walkthrough will take you through the process of training a quantized pet face detector on Cloud TPU then exporting it to an Android phone for inference via TensorFlow Lite conversion.

We hope that these new additions will help make high-quality computer vision models accessible to anyone wishing to solve an object detection problem, and provide a more seamless user experience, from training a model with quantization to exporting to a TensorFlow Lite model ready for on-device deployment. We would like to thank everyone in the community who have contributed features and bug fixes. As always, contributions to the codebase are welcome, and please stay tuned for more updates!

Acknowledgements
This post reflects the work of the following group of core contributors: Derek Chow, Aakanksha Chowdhery, Jonathan Huang, Pengchong Jin, Zhichao Lu, Vivek Rathod, Ronny Votel and Xiangxin Zhu. We would also like to thank the following colleagues: Vasu Agrawal, Sourabh Bajaj, Chiachen Chou, Tom Jablin, Wenzhe Li, Tsung-Yi Lin, Hernan Moraldo, Kevin Murphy, Sara Robinson, Andrew Selle, Shashi Shekhar, Yash Sonthalia, Zak Stone, Pete Warden and Menglong Zhu.

Source: Google AI Blog