Tag Archives: Software

RLiable: Towards Reliable Evaluation & Reporting in Reinforcement Learning

Reinforcement learning (RL) is an area of machine learning that focuses on learning from experiences to solve decision making tasks. While the field of RL has made great progress, resulting in impressive empirical results on complex tasks, such as playing video games, flying stratospheric balloons and designing hardware chips, it is becoming increasingly apparent that the current standards for empirical evaluation might give a false sense of fast scientific progress while slowing it down.

To that end, in “Deep RL at the Edge of the Statistical Precipice”, accepted as an oral presentation at NeurIPS 2021, we discuss how statistical uncertainty of results needs to be considered, especially when using only a few training runs, in order for evaluation in deep RL to be reliable. Specifically, the predominant practice of reporting point estimates ignores this uncertainty and hinders reproducibility of results. Related to this, tables with per-task scores, as are commonly reported, can be overwhelming beyond a few tasks and often omit standard deviations. Furthermore, simple performance metrics like the mean can be dominated by a few outlier tasks, while the median score would remain unaffected even if up to half of the tasks had performance scores of zero. Thus, to increase the field's confidence in reported results with a handful of runs, we propose various statistical tools, including stratified bootstrap confidence intervals, performance profiles, and better metrics, such as interquartile mean and probability of improvement. To help researchers incorporate these tools, we also release an easy-to-use Python library RLiable with a quickstart colab.

Statistical Uncertainty in RL Evaluation
Empirical research in RL relies on evaluating performance on a diverse suite of tasks, such as Atari 2600 video games, to assess progress. Published results on deep RL benchmarks typically compare point estimates of the mean and median scores aggregated across tasks. These scores are typically relative to some defined baseline and optimal performance (e.g., random agent and “average” human performance on Atari games, respectively) so as to make scores comparable across different tasks.

In most RL experiments, there is randomness in the scores obtained from different training runs, so reporting only point estimates does not reveal whether similar results would be obtained with new independent runs. A small number of training runs, coupled with the high variability in performance of deep RL algorithms, often leads to large statistical uncertainty in such point estimates.

The distribution of median human normalized scores on the Atari 100k benchmark, which contains 26 games, for five recently published algorithms, DER, OTR, CURL, two variants of DrQ, and SPR. The reported point estimates of median scores based on a few runs in publications, as shown by dashed lines, do not provide information about the variability in median scores and typically overestimate (e.g., CURL, SPR, DrQ) or underestimate (e.g., DER) the expected median, which can result in erroneous conclusions.

As benchmarks become increasingly more complex, evaluating more than a few runs will be increasingly demanding due to the increased compute and data needed to solve such tasks. For example, five runs on 50 Atari games for 200 million frames takes 1000+ GPU days. Thus, evaluating more runs is not a feasible solution for reducing statistical uncertainty on computationally demanding benchmarks. While prior work has recommended statistical significance tests as a solution, such tests are dichotomous in nature (either “significant” or “not significant”), so they often lack the granularity needed to yield meaningful insights and are widely misinterpreted.

Number of runs in RL papers over the years. Beginning with the Arcade Learning Environment (ALE), the shift toward computationally-demanding benchmarks has led to the practice of evaluating only a handful of runs per task, increasing the statistical uncertainty in point estimates.

Tools for Reliable Evaluation
Any aggregate metric based on a finite number of runs is a random variable, so to take this into account, we advocate for reporting stratified bootstrap confidence intervals (CIs), which predict the likely values of aggregate metrics if the same experiment were repeated with different runs. These CIs allow us to understand the statistical uncertainty and reproducibility of results. Such CIs use the scores on combined runs across tasks. For example, evaluating 3 runs each on Atari 100k, which contains 26 tasks, results in 78 sample scores for uncertainty estimation.

In each task, colored balls denote scores on different runs. To compute statified bootstrap CIs using the percentile method, bootstrap samples are created by randomly sampling scores with replacement proportionately from each task. Then, the distribution of aggregate scores on these samples is the bootstrapping distribution, whose spread around the center gives us the confidence interval.

Most deep RL algorithms often perform better on some tasks and training runs, but aggregate performance metrics can conceal this variability, as shown below.

Data with varied appearance but identical aggregate statistics. Source: Same Stats, Different Graphs.

Instead, we recommend performance profiles, which are typically used for comparing solve times of optimization software. These profiles plot the score distribution across all runs and tasks with uncertainty estimates using stratified bootstrap confidence bands. These plots show the total runs across all tasks that obtain a score above a threshold (𝝉) as a function of the threshold.

Performance profiles correspond to the empirical tail distribution of scores on runs combined across all tasks. Shaded regions show 95% stratified bootstrap confidence bands.

Such profiles allow for qualitative comparisons at a glance. For example, the curve for one algorithm above another means that one algorithm is better than the other. We can also read any score percentile, e.g., the profiles intersect y = 0.5 (dotted line above) at the median score. Furthermore, the area under the profile corresponds to the mean score.

While performance profiles are useful for qualitative comparisons, algorithms rarely outperform other algorithms on all tasks and thus their profiles often intersect, so finer quantitative comparisons require aggregate performance metrics. However, existing metrics have limitations: (1) a single high performing task may dominate the task mean score, while (2) the task median is unaffected by zero scores on nearly half of the tasks and requires a large number of training runs for small statistical uncertainty. To address the above limitations, we recommend two alternatives based on robust statistics: the interquartile mean (IQM) and the optimality gap, both of which can be read as areas under the performance profile, below.

IQM (red) corresponds to the area under the performance profile, shown in blue, between the 25 and 75 percentile scores on the x-axis. Optimality gap (yellow) corresponds to the area between the profile and horizontal line at y = 1 (human performance), for scores less than 1.

As an alternative to median and mean, IQM corresponds to the mean score of the middle 50% of the runs combined across all tasks. It is more robust to outliers than mean, a better indicator of overall performance than median, and results in smaller CIs, and so, requires fewer runs to claim improvements. Another alternative to mean, optimality gap measures how far an algorithm is from optimal performance.

IQM discards the lowest 25% and highest 25% of the combined scores (colored balls) and computes the mean of the remaining 50% scores.

For directly comparing two algorithms, another metric to consider is the average probability of improvement, which describes how likely an improvement over baseline is, regardless of its size. This metric is computed using the Mann-Whitney U-statistic, averaged across tasks.

Re-evaluating Evaluation
Using the above tools for evaluation, we revisit performance evaluations of existing algorithms on widely used RL benchmarks, revealing inconsistencies in prior evaluation. For example, in the Arcade Learning Environment (ALE), a widely recognized RL benchmark, the performance ranking of algorithms changes depending on the choice of aggregate metric. Since performance profiles capture the full picture, they often illustrate why such inconsistencies exist.

Median (left) and IQM (right) human normalized scores on the ALE as a function of the number of environment frames seen during training. IQM results in significantly smaller CIs than median scores.

On DM Control, a popular continuous control benchmark, there are large overlaps in 95% CIs of mean normalized scores for most algorithms.

DM Control Suite results, averaged across six tasks, on the 100k and 500k step benchmark. Since scores are normalized using maximum performance, mean scores correspond to one minus the optimality gap. The ordering of the algorithms is based on their claimed relative performance — all algorithms except Dreamer claimed improvement over at least one algorithm placed below them. Shaded regions show 95% CIs.

Finally, on Procgen, a benchmark for evaluating generalization in RL, the average probability of improvement shows that some claimed improvements are only 50-70% likely, suggesting that some reported improvements could be spurious.

Each row shows the probability that the algorithm X on the left outperforms algorithm Y on the right, given that X was claimed to be better than Y. Shaded region denotes 95% stratified bootstrap CIs.

Our findings on widely-used deep RL benchmarks show that statistical issues can have a large influence on previously reported results. In this work, we take a fresh look at evaluation to improve the interpretation of reported results and standardize experimental reporting. We’d like to emphasize the importance of published papers providing results for all runs to allow for future statistical analyses. To build confidence in your results, please check out our open-source library RLiable and the quickstart colab.

This work was done in collaboration with Max Schwarzer, Aaron Courville and Marc G. Bellemare. We’d like to thank Tom Small for an animated figure used in this post. We are also grateful for feedback by several members of the Google Research, Brain Team and DeepMind.

Source: Google AI Blog

Baselines for Uncertainty and Robustness in Deep Learning

Machine learning (ML) is increasingly being used in real-world applications, so understanding the uncertainty and robustness of a model is necessary to ensure performance in practice. For example, how do models behave when deployed on data that differs from the data on which they were trained? How do models signal when they are likely to make a mistake?

To get a handle on an ML model's behavior, its performance is often measured against a baseline for the task of interest. With each baseline, researchers must try to reproduce results only using descriptions from the corresponding papers , which results in serious challenges for replication. Having access to the code for experiments may be more useful, assuming it is well-documented and maintained. But even this is not enough, because the baselines must be rigorously validated. For example, in retrospective analyses over a collection of works [1, 2, 3], authors often find that a simple well-tuned baseline outperforms more sophisticated methods. In order to truly understand how models perform relative to each other, and enable researchers to measure whether new ideas in fact yield meaningful progress, models of interest must be compared to a common baseline.

In “Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning”, we introduce Uncertainty Baselines, a collection of high-quality implementations of standard and state-of-the-art deep learning methods for a variety of tasks, with the goal of making research on uncertainty and robustness more reproducible. The collection spans 19 methods across nine tasks, each with at least five metrics. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components and with minimal dependencies outside of the framework in which it is written. The included pipelines are implemented in TensorFlow, PyTorch, and Jax. Additionally, the hyperparameters for each baseline have been extensively tuned over numerous iterations so as to provide even stronger results.

Uncertainty Baselines
As of this writing, Uncertainty Baselines provides a total of 83 baselines, comprising 19 methods encompassing standard and more recent strategies over nine datasets. Example methods include BatchEnsemble, Deep Ensembles, Rank-1 Bayesian Neural Nets, Monte Carlo Dropout, and Spectral-normalized Neural Gaussian Processes. It acts as a successor in merging several popular benchmarks in the community: Can You Trust Your Model's Uncertainty?, BDL benchmarks, and Edward2's baselines.

Dataset Inputs Output Train Examples Test Datasets
CIFAR RGB images 10-class distribution 50,000 3
ImageNet RGB images 1000-class distribution 1,281,167 6
CLINC Intent Detection Dialog system query text 150-class distribution (in 10 domains) 15,000 2
Kaggle's Diabetic Retinopathy Detection RGB images Probability of Diabetic Retinopathy 35,126 1
Wikipedia Toxicity Wikipedia comment text Probability of toxicity 159,571 3

A subset of 5 out of 9 available datasets for which baselines are provided. The datasets span tabular, text, and image modalities.

Uncertainty Baselines sets up each baseline under a choice of base model, training dataset, and a suite of evaluation metrics. Each is then tuned over its hyperparameters to maximize performance on such metrics. The available baselines vary among these three axes:

Modularity and Reusability
In order for researchers to use and build on the baselines, we deliberately optimized them to be as modular and minimal as possible. As seen in the workflow figure below, Uncertainty Baselines introduces no new class abstractions, instead reusing classes that pre-exist in the ecosystem (e.g., TensorFlow’s tf.data.Dataset). The train/evaluation pipeline for each of the baselines is contained in a standalone Python file for that experiment, which can run on CPU, GPU, or Google Cloud TPUs. Because of this independence between baselines, we are able to develop baselines in any of TensorFlow, PyTorch or JAX.

Workflow diagram for how the different components of Uncertainty Baselines are structured. All datasets are subclasses of the BaseDataset class, which provides a simple API for use in baselines written with any of the supported frameworks. The outputs from any of the baselines can then be analyzed with the Robustness Metrics library.

One area of debate among research engineers is how to manage hyperparameters and other experiment configuration values, which can easily number in the dozens. Instead of using one of the many frameworks built for this, and risk users having to learn yet another library, we opted to simply use Python flags, i.e., flags defined using Abseil that follow Python conventions. This should be a familiar technique to most researchers, and is easy to extend and plug into other pipelines.

In addition to being able to run each of our baselines using the documented commands and get the same reported results, we also aim to release hyperparameter tuning results and final model checkpoints for further reproducibility. Right now we only have these fully open-sourced for the Diabetic Retinopathy baselines, but we will continue to upload more results as we run them. Additionally, we have examples of baselines that are exactly reproducible up to hardware determinism.

Practical Impact
Each of the baselines included in our repository has gone through extensive hyperparameter tuning, and we hope that researchers can readily reuse this effort without the need for expensive retraining or retuning. Additionally, we hope to avoid minor differences in the pipeline implementations affecting baseline comparisons.

Uncertainty Baselines has already been used in numerous research projects. If you are a researcher with other methods or datasets you would like to contribute, please open a GitHub issue to start a discussion!

We would like to thank a number of folks who are codevelopers, provided guidance, and/or helped review this post: Neil Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, Angelos Filos, Marton Havasi, Rodolphe Jenatton, Ghassen Jerfel, Jeremiah Liu, Zelda Mariet, Jeremy Nixon, Shreyas Padhy, Jie Ren, Tim G. J. Rudner, Yeming Wen, Florian Wenzel, Kevin Murphy, D. Sculley, Balaji Lakshminarayanan, Jasper Snoek, Yarin Gal.

Source: Google AI Blog

FedJAX: Federated Learning Simulation with JAX

Federated learning is a machine learning setting where many clients (i.e., mobile devices or whole organizations, depending on the task at hand) collaboratively train a model under the orchestration of a central server, while keeping the training data decentralized. For example, federated learning makes it possible to train virtual keyboard language models based on user data that never leaves a mobile device.

Federated learning algorithms accomplish this by first initializing the model at the server and completing three key steps for each round of training:

  1. The server sends the model to a set of sampled clients.
  2. These sampled clients train the model on local data.
  3. After training, the clients send the updated models to the server and the server aggregates them together.
An example federated learning algorithm with four clients.

Federated learning has become a particularly active area of research due to an increased focus on privacy and security. Being able to easily translate ideas into code, iterate quickly, and compare and reproduce existing baselines is important for such a fast growing field.

In light of this, we are excited to introduce FedJAX, a JAX-based open source library for federated learning simulations that emphasizes ease-of-use in research. With its simple building blocks for implementing federated algorithms, prepackaged datasets, models and algorithms, and fast simulation speed, FedJAX aims to make developing and evaluating federated algorithms faster and easier for researchers. In this post we discuss the library structure and contents of FedJAX. We demonstrate that on TPUs FedJAX can be used to train models with federated averaging on the EMNIST dataset in a few minutes, and the Stack Overflow dataset in roughly an hour with standard hyperparameters.

Library Structure
Keeping ease of use in mind, FedJAX introduces only a few new concepts. Code written with FedJAX resembles the pseudo-code used to describe novel algorithms in academic papers, making it easy to get started. Additionally, while FedJAX provides building blocks for federated learning, users can replace these with the most basic implementations using just NumPy and JAX while still keeping the overall training reasonably fast.

Included Dataset and Models
In the current landscape of federated learning research, there are a variety of commonly used datasets and models, such as image recognition, language modeling, and more. A growing number of these datasets and models can be used straight out of the box in FedJAX, so the preprocessed datasets and models do not have to be written from scratch. This not only encourages valid comparisons between different federated algorithms but also accelerates the development of new algorithms.

At present, FedJAX comes packaged with the following datasets and sample models:

In addition to these standard setups, FedJAX provides tools to create new datasets and models that can be used with the rest of the library. Finally, FedJAX comes with standard implementations of federated averaging and other federated algorithms for training a shared model on decentralized examples, such as adaptive federated optimizers, agnostic federated averaging, and Mime, to make comparing and evaluating against existing algorithms easier.

Performance Evaluation
We benchmarked a standard FedJAX implementation of adaptive federated averaging on two tasks: the image recognition task for the federated EMNIST-62 dataset and the next word prediction task for the Stack Overflow dataset. Federated EMNIST-62 is a smaller dataset that consists of 3400 users and their writing samples, which are one of 62 characters (alphanumeric), while the Stack Overflow dataset is much larger and consists of millions of questions and answers from the Stack Overflow forum for hundreds of thousands of users.

We measured performance on various hardware specialized for machine learning. For federated EMNIST-62, we trained a model for 1500 rounds with 10 clients per round on GPU (NVIDIA V100) and TPU (1 TensorCore on a Google TPU v2) accelerators.

For Stack Overflow, we trained a model for 1500 rounds with 50 clients per round on GPU (NVIDIA V100) using jax.jit, TPU (1 TensorCore on a Google TPU v2) using only jax.jit, and multi-core TPU (8 TensorCores on a Google TPU v2) using jax.pmap. In the charts below, we’ve recorded the average training round completion time, time taken for full evaluation on test data, and time for the overall execution, which includes both training and full evaluation.

Benchmark results for federated EMNIST-62.
Benchmark results for Stack Overflow.

With standard hyperparameters and TPUs, the full experiments for federated EMNIST-62 can be completed in a few minutes and roughly an hour for Stack Overflow.

Stack Overflow average training round duration as the number of clients per round increases.

We also evaluate the Stack Overflow average training round duration as the number of clients per round increases. By comparing the average training round duration between TPU (8 cores) and TPU (1 core) in the figure, it is evident that using multiple TPU cores results in considerable runtime improvement if the number of clients participating per round is large (useful for applications like differentially private learning).

Conclusions and Future Work
In this post, we introduced FedJAX, a fast and easy-to-use federated learning simulation library for research. We hope that FedJAX will foster even more investigation and interest in federated learning. Moving forward, we plan to continually grow our existing collection of algorithms, aggregation mechanisms, datasets, and models.

Feel free to take a look at some of our tutorial notebooks, or try out FedJAX yourself! For more information about the library and relationship to platforms, such as Tensorflow Federated, see our paper, README, or FAQs.

We would like to thank Ke Wu and Sai Praneeth Kamireddy for contributing to the library and various discussions during development.

We would also like to thank Ehsan Amid, Theresa Breiner, Mingqing Chen, Fabio Costa, Roy Frostig, Zachary Garrett, Alex Ingerman, Satyen Kale, Rajiv Mathews, Lara Mcconnaughey, Brendan McMahan, Mehryar Mohri, Krzysztof Ostrowski, Max Rabinovich, Michael Riley, Vlad Schogol, Jane Shapiro, Gary Sivek, Luciana Toledo-Lopez, and Michael Wunder for helpful comments and contributions.

Source: Google AI Blog

Our best Chromecast yet, now with Google TV

Chromecast changed the way we enjoy our favourite movies, TV shows and YouTube videos by making it easy and inexpensive to bring your online entertainment to your TV—a revolutionary idea in 2013. Today, we have more content choices than ever, sprinkled across an ever-expanding variety of apps, which can make it difficult to find what to watch. This inspired us to rethink what simple and easy content discovery on your TV should look like. So today, we're making our biggest leap yet to help you navigate your entertainment choices, bringing together the best of local and global content into one convenient location, with the all-new Chromecast with Google TV. 
Best Chromecast yet 
Chromecast with Google TV has your favourite Chromecast features and now comes with the all-new Google TV entertainment experience. Google TV experience brings together movies, shows and more from across your apps and subscriptions and organises them just for you. We're also bringing our most requested feature—a remote—to Chromecast. 

A new look, inside and out 
The new Chromecast with Google TV comes in a compact and thin design and is packed with the latest technology to give you the best viewing experience. It neatly plugs into your TV's HDMI port and tucks behind your screen. Power it on and you'll be streaming crystal clear video in up to 4K HDR at up to 60 frames per second in no time. With Dolby Vision, you’ll get extraordinary colour, contrast and brightness on your TV. We also support HDMI pass-through of Dolby audio content. 

More power in your hand 
The new Chromecast voice remote is comfortable to hold, easy to use and full of new features. It has a dedicated Google Assistant button that can help you find something to watch, answer everyday questions like “how's the weather?” or play your favourite artist on YouTube Music all with just your voice. And when it's time to cozy up on the couch for movie night, you can control your smart home lights to set the mood or check your front door with Nest Camera to keep tabs on your pizza delivery. We also have dedicated buttons for popular streaming services, YouTube and Netflix, to give you instant access to the content you love. Best of all, you won't have to juggle multiple remotes thanks to our programmable TV controls for power, volume and input. 

TV just for you 
In need of some good movie or TV recommendations? Google TV's For You tab gives you personalised watch suggestions from across your subscriptions organised based on what you like to watch—even your guilty pleasure reality dramas. Google TV’s Watchlist lets you bookmark movies and shows you want to save for later. You can add to your Watchlist from your phone or laptop, and it will be waiting on your TV when you get home. 
Best of all, you'll also have access to thousands of apps and the ability to browse 400,000+ movies and TV shows sorted and optimised for what you like—ask Google Assistant to see results from across your favourite apps, like YouTube, Netflix, Disney+, Stan, 9Now and ABC iview, among others. 

Starting today Chromecast with Google TV is available for pre-order in Australia for $99 in three fun colours to match your decor or personality: Snow, Sunrise and Sky, and will be available from the Google Store as well as other retailers like JB Hi-Fi, Harvey Norman, OfficeWorks, and The Good Guys starting from October 15. Sunrise and Sky will be exclusively available on Google Store. 

Made for music, the new Nest Audio is here

This year, we’ve all spent a lot of time exploring things to do at home. Some of us gardened, and others baked. We tried at-home workouts, or redecorated the house, took up art projects. But one thing that many—maybe all of us—did? Enjoy a lot of music at home. Personally, I have spent so much more time listening to music during quarantine—bossa nova is my go to soundtrack for doing the dishes and Lil Baby has become one of my favourite artists. 
So, in a time when we’re all listening to more music than ever, we’re especially excited to introduce Nest Audio, our latest smart speaker that is made for music lovers. 

A music machine 
Nest Audio is 75 percent louder and has 50 percent stronger bass than the original Google Home—measurements of both devices were taken in an anechoic chamber at maximum volume, on-axis. With a 19mm tweeter for consistent high frequency coverage and clear vocals and a 75mm mid-woofer that really brings the bass, this smart speaker is a music lover’s dream. 
Nest Audio’s sound is full, clear and natural. We completed more than 500 hours of tuning to ensure balanced lows, mids and highs so that nothing is lacking or overbearing. The bass is significant and the vocals have depth, which makes Nest Audio sound great across genres: classical, R&B, pop and more. The custom-designed tweeter allows each musical detail to come through, and we optimised the grill, fabric and materials so that you can enjoy the audio without distortion. 
Our goal was to ensure that Nest Audio stayed faithful to what the artist intended when they were in the recording studio. We minimised the use of compressors to preserve dynamic range, so that the auditory contrast in the original production is preserved—the quiet parts are delicate and subtle, and the loud parts are more dramatic and powerful. 
Nest Audio also adapts to your home. Our Media EQ feature enables Nest Audio to automatically tune itself to whatever you’re listening to: music, podcasts, audiobooks or hearing a response from Google Assistant. And Ambient IQ lets Nest Audio also adjust the volume of Assistant, news, podcasts, and audiobooks based on the background noise in the home, so you can hear the weather forecast over a noisy dishwasher. 

Whole home audio 
If you have a Google Home, Nest Mini or even a Nest Hub, you can easily make Nest Audio the centre of your whole home sound system. In my living room, I’ve connected two Nest Audio speakers as a stereo pair for left and right channel separation. I also have a Nest Hub Max in my kitchen, a Nest Mini in my bedroom and a Nest Hub in the entryway. These devices are grouped so that I can blast the same song on all of them when I have my daily dance party. 
With our stream transfer feature, I can move music from one device to the other with just my voice. Just last month, we launched multi-room control, which allows you to dynamically group multiple cast-enabled Nest devices in real-time. 

An even faster Assistant 
When we launched Nest Mini last year, we embedded a dedicated machine learning chip with up to one TeraOPS of processing power, which let us move some Google Assistant experiences from our data centres directly onto the device. We’ve leveraged the same ML chip in Nest Audio too.
Google Assistant helps you tackle your day, enjoy your entertainment and control compatible smart home brands like Philips Hue, TP-Link and more. In fact, our users have already set up more than 100 million devices to work with Google Assistant. Plus, if you’re a YouTube Music or Spotify Premium subscriber, you can say, “Hey Google, recommend some music” and Google Assistant will offer a variety of choices from artists and genres that you like, and others like them to choose from.

Differentiated by design 
Typically, a bigger speaker equals bigger sound, but Nest Audio has a really slim profile—so it fits anywhere in the home. In order to maximise audio output, we custom-designed quality drivers and housed them in an enclosure that helps it squeeze out every bit of sound possible. 
Nest Audio is available in two colours in Australia: Chalk and Charcoal. Its soft, rounded edges blend in with your home’s decor, and its minimal footprint doesn't take up too much space on your shelf or countertop. 
We’re continuing our commitment to sustainability with Nest Audio. It’s covered in the same sustainable fabric that we first introduced with Nest Mini last year, and the enclosure (meaning the fabric, housing, foot, and a few smaller parts) is made from 70 percent recycled plastic. 

Starting today Nest Audio is available for pre-order in Australia for $149 at the Google Store and other retailers, including JB Hi-Fi, Harvey Norman, and The Good Guys. It will be on-sale from October 15 through these same retailers, as well as Officeworks and Vodafone. 

Pixel 4a (5G) and Pixel 5 pack 5G speeds and so much more

Today, we hosted Launch Night In, a virtual event introducing new products from across Google that will offer a little joy, entertainment and connection for people. These products bring together the best of Google’s hardware, software and AI to deliver helpful experiences built around you. Not only are these products more helpful; they’re more affordable too. 
Our new smartphones, Pixel 4a with 5G and Pixel 5 offer more helpful Google features backed by the power and speeds of 5G.1 From Google’s latest AI and Assistant features, to the biggest ever batteries we’ve put in a Pixel, to industry-leading camera features, Pixel 4a with 5G and Pixel 5 join our much loved Pixel 4a in providing more help at a more helpful price. 

5G speeds at affordable prices 
5G is the latest in mobile technology, bringing fast download and streaming speeds to users around the world. Whether you’re downloading the latest movie2, listening to your favourite music on YouTube Music, catching up on podcasts with Google Podcast or downloading a game Pixel 4a with 5G and Pixel 5 can provide you with fast speeds at a helpful price.1 Starting at just $799 for Pixel 4a with 5G.

New camera, new lenses—same great photos 
Ask any Pixel owner and they’ll tell you: Pixels take great photos. Pixel 4a with 5G and Pixel 5 are no exception. These phones bring Pixel’s industry-leading photography features to the next level. 
  • Better videos with Cinematic Pan: Pixel 4a with 5G and Pixel 5 come with Cinematic Pan, which gives your videos a professional look with ultrasmooth panning that’s inspired by the equipment Hollywood directors use. 
  • Night Sight in Portrait Mode: Night Sight currently gives you the ability to capture amazing low-light photos—and even the Milky Way with astrophotography. Now, these phones bring the power of Night Sight into Portrait Mode to capture beautifully blurred backgrounds in Portraits even in extremely low light. 
Night Sight in Portrait Mode, captured on Pixel 
  • Portrait Light: Portrait Mode on the Pixel 4a with 5G and Pixel 5 lets you capture beautiful portraits that focus on your subject as the background fades into an artful blur. If the lighting isn’t right, your Pixel can drop in extra light to illuminate your subjects
  • Ultrawide lens for ultra awesome shots: With an ultrawide lens alongside the standard rear camera, you’ll be able to capture the whole scene. And thanks to Google’s software magic, the latest Pixels still get our Super Res Zoom. So whether you’re zooming in or zooming out, you get sharp details and breathtaking images. 
Ultrawide, captured on Pixel 
  • New editor in Google Photos: Even after you’ve captured your portrait, Google Photos can help you add studio-quality light to your portraits of people with Portrait Light, in the new, more helpful Google Photos editor
Stay connected and entertained with Duo 
To make it easier and more enjoyable to stay connected to the most important people in your life, the new HD screen sharing in Duo video calls lets you and a friend watch the same video, cheer on sports with a friend and even plan activities – no matter how far apart you are.3 And with features like Duo Family mode, you will be able to keep kids entertained and engaged with new interactive tools, like colouring over backgrounds, while you video chat. 

A smarter way to record and share audio 
Last year, Recorder made audio recording smarter, with real-time transcriptions and the power of search.4 Now, Recorder makes it even easier to share your favourite audio moments. Since Recorder automatically transcribes every recording, now you can use those transcripts to edit the audio too. Just highlight a sentence to crop or remove its corresponding audio. Once you have something you want others to hear—say a quote from an interview or a new song idea—you can generate a video clip to make sharing your audio easier and more visual than ever. 
Editing in Recorder is easy

To improve searching through your transcripts, smart scrolling will automatically mark important words in longer transcripts so you can quickly jump to the sections you’re looking for as you scroll. But most helpful of all? Recorder still works without an internet connection, so you can transcribe, search and edit from anywhere, anytime. 

The biggest Pixel batteries ever 
Pixel 4a with 5G and Pixel 5 also have all-day batteries that can last up to 48 hours with Extreme Battery Saver.5 This mode automatically limits active apps to just the essentials and lets you choose additional apps you want to keep on. 

And now, the specs 
Like all Pixel devices, security and safety are paramount in Pixel 4a with 5G and Pixel 5. Both devices come with our TitanTM M security chip to help keep your on-device data safe and secure, and both phones will get three years of software and security updates. Your Pixel also has built-in safety features like car crash detection6 and Safety Check.7
Plus, Pixel 5 is designed with the environment in mind; we used 100% recycled aluminium in the back housing enclosure to reduce its carbon footprint. You can charge your Pixel 5 wirelessly8 and even use it to wirelessly charge other Qi-certified devices using Battery Share.9 Pixel 5 also doesn’t mind a little water or dust. The metal unibody can handle being submerged in 1.5 metres of fresh water for 30 minutes.10
When you buy the Google phone, you get more from Google. Pixel 5 and Pixel 4a with 5G come with trial subscriptions to Google’s entertainment, security and storage services for new users.11 If you’re a new user you’ll get a YouTube Premium trial for 3 months, 100 GB of storage with Google One for 3 months and 3 months of Google Play Pass and Gold/Silver Status on Play Points. See g.co/pixel/4a5Goffers or g.co/pixel/5offers, as applicable, for more details.11 
In Australia, Pixel 5 will range in two colours, Just Black and Sorta Sage (selected retailers). It will retail for $999 and can be pre-ordered today from Google Store, Telstra, Optus, Vodafone, JB Hi-Fi, Officeworks and Harvey Norman, and will be available starting October 15. Pixel 4a with 5G will retail for $799 and can be pre-ordered today from JB Hi-Fi, Officeworks and Harvey Norman, and will be available from these retailers in addition to Google Store and Telstra in November ranging in Just Black. 

Looking for the Pixel’s that’s right for you? Head to the Google Store now. 

1 Requires a 5G data plan (sold separately). 5G service and roaming not available on all carrier networks or in all areas. Contact carrier for details about current 5G network performance, compatibility, and availability. Phone connects to 5G networks but, 5G service, speed and performance depend on many factors including, but not limited to, carrier network capabilities, device configuration and capabilities, network traffic, location, signal strength and signal obstruction. Actual results may vary. Some features not available in all areas. Data rates may apply. See g.co/pixel/networkinfo for info. 
2 Download speed claims based on testing videos from three streaming platforms. Average download time was less than sixty seconds. File sizes varied between 449MB and 1.3GB. Download speed depends upon many factors, such as file size, content provider and network connection. Testing conducted in an internal 5G network lab and on pre-production hardware in California in July/August 2020. Actual download speeds may be slower. Australian results may vary. 
3 Screen sharing not available on group calls. Requires Wi-Fi or 5G internet connection. Not available on all apps and content. Data rates may apply. 5G service, speed and performance depend on many factors including, but not limited to, carrier network capabilities, device configuration and capabilities, network traffic, location, signal strength, and signal obstruction. 
4 Transcription and search are available in English only. 
5 For “all day”: Maximum battery life based on testing using a mix of talk, data, standby, and use of other features. Testing conducted on two major US carrier networks using Sub-6 GHz non-standalone 5G (ENDC) connectivity. For “Up to 48 hours”: Maximum battery life based on testing using a mix of talk, data, standby, and use of limited other features that are default in Extreme Battery Saver mode (which disables various features including 5G connectivity). Testing conducted on two major US carrier networks. For both claims: Pixel 4a (5G) and Pixel 5 battery testing conducted by a third party in California in mid 2020 on pre-production hardware and software using default settings, except that, for the “up to 48 hour claim” only, Extreme Battery Saver mode was enabled. Battery life depends upon many factors and usage of certain features will decrease battery life. Actual battery life may be lower.
6 Not available in all languages or countries. Car crash detection may not detect all accidents. High-impact activities may trigger calls to emergency services. This feature is dependent upon network connectivity and other factors and may not be reliable for emergency communications or available in all areas. For country and language availability and more information see g.co/pixel/carcrashdetection. 
7 Personal Safety app features are dependent upon network connectivity and other factors and may not be reliable for emergency communications or available in all areas. For more information, see g.co/pixel/personalsafety. 
8 Qi-compatible. Wireless charger sold separately. 
9 Designed to charge Qi-certified devices. Use of Battery Share significantly reduces Pixel battery life. Cases may interfere with charging and will reduce charging speed. Charge speeds may vary. See g.co/pixel/wirelesscharging for more information. 
10 Pixel 5 has a dust and water protection rating of IP68 under IEC standard 60529. Charger and accessories are not water-resistant or dust-resistant. Water and dust resistance are not permanent conditions and may be compromised due to normal wear and tear, repair, disassembly or damage. 
11 The Google One, Google Play Pass, Google Play Points, and YouTube Premium offers are available to eligible new users with the purchase of Pixel 4a (5G) or Pixel 5. Offer expires April 30, 2021 at 11:59pm PT. See g.co/pixel/4a5Goffers or g.co/pixel/5offers, as applicable, for more details.

Announcing Cirq: An Open Source Framework for NISQ Algorithms

Over the past few years, quantum computing has experienced a growth not only in the construction of quantum hardware, but also in the development of quantum algorithms. With the availability of Noisy Intermediate Scale Quantum (NISQ) computers (devices with ~50 - 100 qubits and high fidelity quantum gates), the development of algorithms to understand the power of these machines is of increasing importance. However, a common problem when designing a quantum algorithm on a NISQ processor is how to take full advantage of these limited quantum devices—using resources to solve the hardest part of the problem rather than on overheads from poor mappings between the algorithm and hardware. Furthermore some quantum processors have complex geometric constraints and other nuances, and ignoring these will either result in faulty quantum computation, or a computation that is modified and sub-optimal.*

Today at the First International Workshop on Quantum Software and Quantum Machine Learning (QSML), the Google AI Quantum team announced the public alpha of Cirq, an open source framework for NISQ computers. Cirq is focused on near-term questions and helping researchers understand whether NISQ quantum computers are capable of solving computational problems of practical importance. Cirq is licensed under Apache 2, and is free to be modified or embedded in any commercial or open source package.
Once installed, Cirq enables researchers to write quantum algorithms for specific quantum processors. Cirq gives users fine tuned control over quantum circuits, specifying gate behavior using native gates, placing these gates appropriately on the device, and scheduling the timing of these gates within the constraints of the quantum hardware. Data structures are optimized for writing and compiling these quantum circuits to allow users to get the most out of NISQ architectures. Cirq supports running these algorithms locally on a simulator, and is designed to easily integrate with future quantum hardware or larger simulators via the cloud.
We are also announcing the release of OpenFermion-Cirq, an example of a Cirq based application enabling near-term algorithms. OpenFermion is a platform for developing quantum algorithms for chemistry problems, and OpenFermion-Cirq is an open source library which compiles quantum simulation algorithms to Cirq. The new library uses the latest advances in building low depth quantum algorithms for quantum chemistry problems to enable users to go from the details of a chemical problem to highly optimized quantum circuits customized to run on particular hardware. For example, this library can be used to easily build quantum variational algorithms for simulating properties of molecules and complex materials.

Quantum computing will require strong cross-industry and academic collaborations if it is going to realize its full potential. In building Cirq, we worked with early testers to gain feedback and insight into algorithm design for NISQ computers. Below are some examples of Cirq work resulting from these early adopters:
To learn more about how Cirq is helping enable NISQ algorithms, please visit the links above where many of the adopters have provided example source code for their implementations.

Today, the Google AI Quantum team is using Cirq to create circuits that run on Google’s Bristlecone processor. In the future, we plan to make this processor available in the cloud, and Cirq will be the interface in which users write programs for this processor. In the meantime, we hope Cirq will improve the productivity of NISQ algorithm developers and researchers everywhere. Please check out the GitHub repositories for Cirq and OpenFermion-Cirq — pull requests welcome!

We would like to thank Craig Gidney for leading the development of Cirq, Ryan Babbush and Kevin Sung for building OpenFermion-Cirq and a whole host of code contributors to both frameworks.

* An analogous situation is how early classical programmers needed to run complex programs in very small memory spaces by paying careful attention to the lowest level details of the hardware.

Source: Google AI Blog

TFGAN: A Lightweight Library for Generative Adversarial Networks

(Crossposted on the Google Open Source Blog)

Training a neural network usually involves defining a loss function, which tells the network how close or far it is from its objective. For example, image classification networks are often given a loss function that penalizes them for giving wrong classifications; a network that mislabels a dog picture as a cat will get a high loss. However, not all problems have easily-defined loss functions, especially if they involve human perception, such as image compression or text-to-speech systems. Generative Adversarial Networks (GANs), a machine learning technique that has led to improvements in a wide range of applications including generating images from text, superresolution, and helping robots learn to grasp, offer a solution. However, GANs introduce new theoretical and software engineering challenges, and it can be difficult to keep up with the rapid pace of GAN research.
A video of a generator improving over time. It begins by producing random noise, and eventually learns to generate MNIST digits.
In order to make GANs easier to experiment with, we’ve open sourced TFGAN, a lightweight library designed to make it easy to train and evaluate GANs. It provides the infrastructure to easily train a GAN, provides well-tested loss and evaluation metrics, and gives easy-to-use examples that highlight the expressiveness and flexibility of TFGAN. We’ve also released a tutorial that includes a high-level API to quickly get a model trained on your data.
This demonstrates the effect of an adversarial loss on image compression. The top row shows image patches from the ImageNet dataset. The middle row shows the results of compressing and uncompressing an image through an image compression neural network trained on a traditional loss. The bottom row shows the results from a network trained with a traditional loss and an adversarial loss. The GAN-loss images are sharper and more detailed, even if they are less like the original.
TFGAN supports experiments in a few important ways. It provides simple function calls that cover the majority of GAN use-cases so you can get a model running on your data in just a few lines of code, but is built in a modular way to cover more exotic GAN designs as well. You can just use the modules you want — loss, evaluation, features, training, etc. are all independent. TFGAN’s lightweight design also means you can use it alongside other frameworks, or with native TensorFlow code. GAN models written using TFGAN will easily benefit from future infrastructure improvements, and you can select from a large number of already-implemented losses and features without having to rewrite your own. Lastly, the code is well-tested, so you don’t have to worry about numerical or statistical mistakes that are easily made with GAN libraries.
Most neural text-to-speech (TTS) systems produce over-smoothed spectrograms. When applied to the Tacotron TTS system, a GAN can recreate some of the realistic-texture, which reduces artifacts in the resulting audio.
When you use TFGAN, you’ll be using the same infrastructure that many Google researchers use, and you’ll have access to the cutting-edge improvements that we develop with the library. Anyone can contribute to the github repositories, which we hope will facilitate code-sharing among ML researchers and users.

Tangent: Source-to-Source Debuggable Derivatives

Tangent is a new, free, and open-source Python library for automatic differentiation. In contrast to existing machine learning libraries, Tangent is a source-to-source system, consuming a Python function f and emitting a new Python function that computes the gradient of f. This allows much better user visibility into gradient computations, as well as easy user-level editing and debugging of gradients. Tangent comes with many more features for debugging and designing machine learning models:
This post gives an overview of the Tangent API. It covers how to use Tangent to generate gradient code in Python that is easy to interpret, debug and modify.

Neural networks (NNs) have led to great advances in machine learning models for images, video, audio, and text. The fundamental abstraction that lets us train NNs to perform well at these tasks is a 30-year-old idea called reverse-mode automatic differentiation (also known as backpropagation), which comprises two passes through the NN. First, we run a “forward pass” to calculate the output value of each node. Then we run a “backward pass” to calculate a series of derivatives to determine how to update the weights to increase the model’s accuracy.

Training NNs, and doing research on novel architectures, requires us to compute these derivatives correctly, efficiently, and easily. We also need to be able to debug these derivatives when our model isn’t training well, or when we’re trying to build something new that we do not yet understand. Automatic differentiation, or just “autodiff,” is a technique to calculate the derivatives of computer programs that denote some mathematical function, and nearly every machine learning library implements it.

Existing libraries implement automatic differentiation by tracing a program’s execution (at runtime, like TF Eager, PyTorch and Autograd) or by building a dynamic data-flow graph and then differentiating the graph (ahead-of-time, like TensorFlow). In contrast, Tangent performs ahead-of-time autodiff on the Python source code itself, and produces Python source code as its output.

As a result, you can finally read your automatic derivative code just like the rest of your program. Tangent is useful to researchers and students who not only want to write their models in Python, but also read and debug automatically-generated derivative code without sacrificing speed and flexibility.

You can easily inspect and debug your models written in Tangent, without special tools or indirection. Tangent works on a large and growing subset of Python, provides extra autodiff features other Python ML libraries don’t have, is high-performance, and is compatible with TensorFlow and NumPy.

Automatic differentiation of Python code
How do we automatically generate derivatives of plain Python code? Math functions like tf.exp or  tf.log have derivatives, which we can compose to build the backward pass. Similarly, pieces of syntax, such as subroutines, conditionals, and loops, also have backward-pass versions. Tangent contains recipes for generating derivative code for each piece of Python syntax, along with many NumPy and TensorFlow function calls.

Tangent has a one-function API:
Here’s an animated graphic of what happens when we call tangent.grad on a Python function:
If you want to print out your derivatives, you can run:
Under the hood, tangent.grad first grabs the source code of the Python function you pass it. Tangent has a large library of recipes for the derivatives of Python syntax, as well as TensorFlow Eager functions. The function  tangent.grad then walks your code in reverse order, looks up the matching backward-pass recipe, and adds it to the end of the derivative function. This reverse-order processing gives the technique its name: reverse-mode automatic differentiation.

The function df above only works for scalar (non-array) inputs. Tangent also supports
Although we started with TensorFlow Eager support, Tangent isn’t tied to one numeric library or another—we would gladly welcome pull requests adding PyTorch or MXNet derivative recipes.

Next Steps
Tangent is open source now at github.com/google/tangent. Go check it out for download and installation instructions. Tangent is still an experiment, so expect some bugs. If you report them to us on GitHub, we will do our best to fix them quickly.

We are working to add support in Tangent for more aspects of the Python language (e.g., closures, inline function definitions, classes, more NumPy and TensorFlow functions). We also hope to add more advanced automatic differentiation and compiler functionality in the future, such as automatic trade-off between memory and compute (Griewank and Walther 2000; Gruslys et al., 2016), more aggressive optimizations, and lambda lifting.

We intend to develop Tangent together as a community. We welcome pull requests with fixes and features. Happy differentiating!

Bart van Merriënboer contributed immensely to all aspects of Tangent during his internship, and Dan Moldovan led TF Eager integration, infrastructure and benchmarking. Also, thanks to the Google Brain team for their support of this post and special thanks to Sanders Kleinfeld, Matt Johnson and Aleks Haecky for their valuable contribution for the technical aspects of the post.

ICSE 2015 and Software Engineering Research at Google

The large scale of our software engineering efforts at Google often pushes us to develop cutting-edge infrastructure. In May 2015, at the International Conference on Software Engineering (ICSE 2015), we shared some of our software engineering tools and practices and collaborated with the research community through a combination of publications, committee memberships, and workshops. Learn more about some of our research below (Googlers highlighted in blue).

Google was a Gold supporter of ICSE 2015.

Technical Research Papers:
A Flexible and Non-intrusive Approach for Computing Complex Structural Coverage Metrics
Michael W. Whalen, Suzette Person, Neha Rungta, Matt Staats, Daniela Grijincu

Automated Decomposition of Build Targets
Mohsen Vakilian, Raluca Sauciuc, David Morgenthaler, Vahab Mirrokni

Tricorder: Building a Program Analysis Ecosystem
Caitlin Sadowski, Jeffrey van Gogh, Ciera Jaspan, Emma Soederberg, Collin Winter

Software Engineering in Practice (SEIP) Papers:
Comparing Software Architecture Recovery Techniques Using Accurate Dependencies
Thibaud Lutellier, Devin Chollak, Joshua Garcia, Lin Tan, Derek Rayside, Nenad Medvidovic, Robert Kroeger

Technical Briefings:
Software Engineering for Privacy in-the-Large
Pauline Anthonysamy, Awais Rashid

Workshop Organizers:
2nd International Workshop on Requirements Engineering and Testing (RET 2015)
Elizabeth Bjarnason, Mirko Morandini, Markus Borg, Michael Unterkalmsteiner, Michael Felderer, Matthew Staats

Committee Members:
Caitlin Sadowski - Program Committee Member and Distinguished Reviewer Award Winner
James Andrews - Review Committee Member
Ray Buse - Software Engineering in Practice (SEIP) Committee Member and Demonstrations Committee Member
John Penix - Software Engineering in Practice (SEIP) Committee Member
Marija Mikic - Poster Co-chair
Daniel Popescu and Ivo Krka - Poster Committee Members