Tag Archives: machine learning

Preference learning with automated feedback for cache eviction

Caching is a ubiquitous idea in computer science that significantly improves the performance of storage and retrieval systems by storing a subset of popular items closer to the client based on request patterns. An important algorithmic piece of cache management is the decision policy used for dynamically updating the set of items being stored, which has been extensively optimized over several decades, resulting in several efficient and robust heuristics. While applying machine learning to cache policies has shown promising results in recent years (e.g., LRB, LHD, storage applications), it remains a challenge to outperform robust heuristics in a way that can generalize reliably beyond benchmarks to production settings, while maintaining competitive compute and memory overheads.

In “HALP: Heuristic Aided Learned Preference Eviction Policy for YouTube Content Delivery Network”, presented at NSDI 2023, we introduce a scalable state-of-the-art cache eviction framework that is based on learned rewards and uses preference learning with automated feedback. The Heuristic Aided Learned Preference (HALP) framework is a meta-algorithm that uses randomization to merge a lightweight heuristic baseline eviction rule with a learned reward model. The reward model is a lightweight neural network that is continuously trained with ongoing automated feedback on preference comparisons designed to mimic the offline oracle. We discuss how HALP has improved infrastructure efficiency and user video playback latency for YouTube’s content delivery network.


Learned preferences for cache eviction decisions

The HALP framework computes cache eviction decisions based on two components: (1) a neural reward model trained with automated feedback via preference learning, and (2) a meta-algorithm that combines a learned reward model with a fast heuristic. As the cache observes incoming requests, HALP continuously trains a small neural network that predicts a scalar reward for each item by formulating this as a preference learning method via pairwise preference feedback. This aspect of HALP is similar to reinforcement learning from human feedback (RLHF) systems, but with two important distinctions:

  • Feedback is automated and leverages well-known results about the structure of offline optimal cache eviction policies.
  • The model is learned continuously using a transient buffer of training examples constructed from the automated feedback process.

The eviction decisions rely on a filtering mechanism with two steps. First, a small subset of candidates is selected using a heuristic that is efficient, but suboptimal in terms of performance. Then, a re-ranking step optimizes from within the baseline candidates via the sparing use of a neural network scoring function to “boost” the quality of the final decision.

As a production ready cache policy implementation, HALP not only makes eviction decisions, but also subsumes the end-to-end process of sampling pairwise preference queries used to efficiently construct relevant feedback and update the model to power eviction decisions.


A neural reward model

HALP uses a light-weight two-layer multilayer perceptron (MLP) as its reward model to selectively score individual items in the cache. The features are constructed and managed as a metadata-only “ghost cache” (similar to classical policies like ARC). After any given lookup request, in addition to regular cache operations, HALP conducts the book-keeping (e.g., tracking and updating feature metadata in a capacity-constrained key-value store) needed to update the dynamic internal representation. This includes: (1) externally tagged features provided by the user as input, along with a cache lookup request, and (2) internally constructed dynamic features (e.g., time since last access, average time between accesses) constructed from lookup times observed on each item.

HALP learns its reward model fully online starting from a random weight initialization. This might seem like a bad idea, especially if the decisions are made exclusively for optimizing the reward model. However, the eviction decisions rely on both the learned reward model and a suboptimal but simple and robust heuristic like LRU. This allows for optimal performance when the reward model has fully generalized, while remaining robust to a temporarily uninformative reward model that is yet to generalize, or in the process of catching up to a changing environment.

Another advantage of online training is specialization. Each cache server runs in a potentially different environment (e.g., geographic location), which influences local network conditions and what content is locally popular, among other things. Online training automatically captures this information while reducing the burden of generalization, as opposed to a single offline training solution.


Scoring samples from a randomized priority queue

It can be impractical to optimize for the quality of eviction decisions with an exclusively learned objective for two reasons.

  1. Compute efficiency constraints: Inference with a learned network can be significantly more expensive than the computations performed in practical cache policies operating at scale. This limits not only the expressivity of the network and features, but also how often these are invoked during each eviction decision.
  2. Robustness for generalizing out-of-distribution: HALP is deployed in a setup that involves continual learning, where a quickly changing workload might generate request patterns that might be temporarily out-of-distribution with respect to previously seen data.

To address these issues, HALP first applies an inexpensive heuristic scoring rule that corresponds to an eviction priority to identify a small candidate sample. This process is based on efficient random sampling that approximates exact priority queues. The priority function for generating candidate samples is intended to be quick to compute using existing manually-tuned algorithms, e.g., LRU. However, this is configurable to approximate other cache replacement heuristics by editing a simple cost function. Unlike prior work, where the randomization was used to tradeoff approximation for efficiency, HALP also relies on the inherent randomization in the sampled candidates across time steps for providing the necessary exploratory diversity in the sampled candidates for both training and inference.

The final evicted item is chosen from among the supplied candidates, equivalent to the best-of-n reranked sample, corresponding to maximizing the predicted preference score according to the neural reward model. The same pool of candidates used for eviction decisions is also used to construct the pairwise preference queries for automated feedback, which helps minimize the training and inference skew between samples.

An overview of the two-stage process invoked for each eviction decision.


Online preference learning with automated feedback

The reward model is learned using online feedback, which is based on automatically assigned preference labels that indicate, wherever feasible, the ranked preference ordering for the time taken to receive future re-accesses, starting from a given snapshot in time among each queried sample of items. This is similar to the oracle optimal policy, which, at any given time, evicts an item with the farthest future access from all the items in the cache.

Generation of the automated feedback for learning the reward model.

To make this feedback process informative, HALP constructs pairwise preference queries that are most likely to be relevant for eviction decisions. In sync with the usual cache operations, HALP issues a small number of pairwise preference queries while making each eviction decision, and appends them to a set of pending comparisons. The labels for these pending comparisons can only be resolved at a random future time. To operate online, HALP also performs some additional book-keeping after each lookup request to process any pending comparisons that can be labeled incrementally after the current request. HALP indexes the pending comparison buffer with each element involved in the comparison, and recycles the memory consumed by stale comparisons (neither of which may ever get a re-access) to ensure that the memory overhead associated with feedback generation remains bounded over time.

Overview of all main components in HALP.


Results: Impact on the YouTube CDN

Through empirical analysis, we show that HALP compares favorably to state-of-the-art cache policies on public benchmark traces in terms of cache miss rates. However, while public benchmarks are a useful tool, they are rarely sufficient to capture all the usage patterns across the world over time, not to mention the diverse hardware configurations that we have already deployed.

Until recently, YouTube servers used an optimized LRU-variant for memory cache eviction. HALP increases YouTube’s memory egress/ingress — the ratio of the total bandwidth egress served by the CDN to that consumed for retrieval (ingress) due to cache misses — by roughly 12% and memory hit rate by 6%. This reduces latency for users, since memory reads are faster than disk reads, and also improves egressing capacity for disk-bounded machines by shielding the disks from traffic.

The figure below shows a visually compelling reduction in the byte miss ratio in the days following HALP’s final rollout on the YouTube CDN, which is now serving significantly more content from within the cache with lower latency to the end user, and without having to resort to more expensive retrieval that increases the operating costs.

Aggregate worldwide YouTube byte miss ratio before and after rollout (vertical dashed line).

An aggregated performance improvement could still hide important regressions. In addition to measuring overall impact, we also conduct an analysis in the paper to understand its impact on different racks using a machine level analysis, and find it to be overwhelmingly positive.


Conclusion

We introduced a scalable state-of-the-art cache eviction framework that is based on learned rewards and uses preference learning with automated feedback. Because of its design choices, HALP can be deployed in a manner similar to any other cache policy without the operational overhead of having to separately manage the labeled examples, training procedure and the model versions as additional offline pipelines common to most machine learning systems. Therefore, it incurs only a small extra overhead compared to other classical algorithms, but has the added benefit of being able to take advantage of additional features to make its eviction decisions and continuously adapt to changing access patterns.

This is the first large-scale deployment of a learned cache policy to a widely used and heavily trafficked CDN, and has significantly improved the CDN infrastructure efficiency while also delivering a better quality of experience to users.


Acknowledgements

Ramki Gummadi is now part of Google DeepMind. We would like to thank John Guilyard for help with the illustrations and Richard Schooler for feedback on this post.

Source: Google AI Blog


Speed is all you need: On-device acceleration of large diffusion models via GPU-aware optimizations

The proliferation of large diffusion models for image generation has led to a significant increase in model size and inference workloads. On-device ML inference in mobile environments requires meticulous performance optimization and consideration of trade-offs due to resource constraints. Running inference of large diffusion models (LDMs) on-device, driven by the need for cost efficiency and user privacy, presents even greater challenges due to the substantial memory requirements and computational demands of these models.

We address this challenge in our work titled “Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations” (to be presented at the CVPR 2023 workshop for Efficient Deep Learning for Computer Vision) focusing on the optimized execution of a foundational LDM model on a mobile GPU. In this blog post, we summarize the core techniques we employed to successfully execute large diffusion models like Stable Diffusion at full resolution (512x512 pixels) and 20 iterations on modern smartphones with high-performing inference speed of the original model without distillation of under 12 seconds. As discussed in our previous blog post, GPU-accelerated ML inference is often limited by memory performance, and execution of LDMs is no exception. Therefore, the central theme of our optimization is efficient memory input/output (I/O) even if it means choosing memory-efficient algorithms over those that prioritize arithmetic logic unit efficiency. Ultimately, our primary objective is to reduce the overall latency of the ML inference.

A sample output of an LDM on Mobile GPU with the prompt text: “a photo realistic and high resolution image of a cute puppy with surrounding flowers”.

Enhanced attention module for memory efficiency

An ML inference engine typically provides a variety of optimized ML operations. Despite this, achieving optimal performance can still be challenging as there is a certain amount of overhead for executing individual neural net operators on a GPU. To mitigate this overhead, ML inference engines incorporate extensive operator fusion rules that consolidate multiple operators into a single operator, thereby reducing the number of iterations across tensor elements while maximizing compute per iteration. For instance, TensorFlow Lite utilizes operator fusion to combine computationally expensive operations, like convolutions, with subsequent activation functions, like rectified linear units, into one.

A clear opportunity for optimization is the heavily used attention block adopted in the denoiser model in the LDM. The attention blocks allow the model to focus on specific parts of the input by assigning higher weights to important regions. There are multiple ways one can optimize the attention modules, and we selectively employ one of the two optimizations explained below depending on which optimization performs better.

The first optimization, which we call partially fused softmax, removes the need for extensive memory writes and reads between the softmax and the matrix multiplication in the attention module. Let the attention block be just a simple matrix multiplication of the form Y = softmax(X) * W where X and W are 2D matrices of shape a×b and b×c, respectively (shown below in the top half).

For numerical stability, T = softmax(X) is typically calculated in three passes:

  1. Determine the maximum value in the list, i.e., for each row in matrix X
  2. Sum up the differences of the exponential of each list item and the maximum value (from pass 1)
  3. Divide the exponential of the items minus the maximum value by the sum from pass 2

Carrying out these passes naïvely would result in a huge memory write for the temporary intermediate tensor T holding the output of the entire softmax function. We bypass this large memory write if we only store the results of passes 1 and 2, labeled m and s, respectively, which are small vectors, with a elements each, compared to T which has a·b elements. With this technique, we are able to reduce tens or even hundreds of megabytes of memory consumption by multiple orders of magnitude (shown below in the bottom half).

Attention modules. Top: A naïve attention block, composed of a SOFTMAX (with all three passes) and a MATMUL, requires a large memory write for the big intermediate tensor T. Bottom: Our memory-efficient attention block with partially fused softmax in MATMUL only needs to store two small intermediate tensors for m and s.

The other optimization involves employing FlashAttention, which is an I/O-aware, exact attention algorithm. This algorithm reduces the number of GPU high-bandwidth memory accesses, making it a good fit for our memory bandwidth–limited use case. However, we found this technique to only work for SRAM with certain sizes and to require a large number of registers. Therefore, we only leverage this technique for attention matrices with a certain size on a select set of GPUs.


Winograd fast convolution for 3×3 convolution layers

The backbone of common LDMs heavily relies on 3×3 convolution layers (convolutions with filter size 3×3), comprising over 90% of the layers in the decoder. Despite increased memory consumption and numerical errors, we found that Winograd fast convolution to be effective at speeding up the convolutions. Distinct from the filter size 3x3 used in convolutions, tile size refers to the size of a sub region of the input tensor that is processed at a time. Increasing the tile size enhances the efficiency of the convolution in terms of arithmetic logic unit (ALU) usage. However, this improvement comes at the expense of increased memory consumption. Our tests indicate that a tile size of 4×4 achieves the optimal trade-off between computational efficiency and memory utilization.


    Memory usage    
    Tile size         FLOPS savings         Intermediate tensors         Weights    
2×2 2.25× 4.00× 1.77×
4×4 4.00× 2.25× 4.00×
6×6 5.06× 1.80× 7.12×
8×8 5.76× 1.56× 11.1×

Impact of Winograd with varying tile sizes for 3×3 convolutions.

Specialized operator fusion for memory efficiency

We discovered that performantly inferring LDMs on a mobile GPU requires significantly larger fusion windows for commonly employed layers and units in LDMs than current off-the-shelf on-device GPU-accelerated ML inference engines provide. Consequently, we developed specialized implementations that could execute a larger range of neural operators than typical fusion rules would permit. Specifically, we focused on two specializations: the Gaussian Error Linear Unit (GELU) and the group normalization layer.

An approximation of GELU with the hyperbolic tangent function requires writing to and reading from seven auxiliary intermediate tensors (shown below as light orange rounded rectangles in the figure below), reading from the input tensor x three times, and writing to the output tensor y once across eight GPU programs implementing the labeled operation each (light blue rectangles). A custom GELU implementation that performs the eight operations in a single shader (shown below in the bottom) can bypass all the memory I/O for the intermediate tensors.

GELU implementations. Top: A naïve implementation with built-in operations would require 8 memory writes and 10 reads. Bottom: Our custom GELU only requires 1 memory read (for x) and 1 write (for y).

Results

After applying all of these optimizations, we conducted tests of Stable Diffusion 1.5 (image resolution 512x512, 20 iterations) on high-end mobile devices. Running Stable Diffusion with our GPU-accelerated ML inference model uses 2,093MB for the weights and 84MB for the intermediate tensors. With latest high-end smartphones, Stable Diffusion can be run in under 12 seconds.

Stable Diffusion runs on modern smartphones in under 12 seconds. Note that running the decoder after each iteration for displaying the intermediate output in this animated GIF results in a ~2× slowdown.

Conclusion

Performing on-device ML inference of large models has proven to be a substantial challenge, encompassing limitations in model file size, extensive runtime memory requirements, and protracted inference latency. By recognizing memory bandwidth usage as the primary bottleneck, we directed our efforts towards optimizing memory bandwidth utilization and striking a delicate balance between ALU efficiency and memory efficiency. As a result, we achieved state-of-the-art inference latency for large diffusion models. You can learn more about this work in the paper.


Acknowledgments

We'd like to thank Yu-Hui Chen, Jiuqiang Tang, Frank Barchard, Yang Zhao, Joe Zou, Khanh LeViet, Chuo-Ling Chang, Andrei Kulik, Lu Wang, and Matthias Grundmann.

Source: Google AI Blog


Retrieval-augmented visual-language pre-training

Large-scale models, such as T5, GPT-3, PaLM, Flamingo and PaLI, have demonstrated the ability to store substantial amounts of knowledge when scaled to tens of billions of parameters and trained on large text and image datasets. These models achieve state-of-the-art results on downstream tasks, such as image captioning, visual question answering and open vocabulary recognition. Despite such achievements, these models require a massive volume of data for training and end up with a tremendous number of parameters (billions in many cases), resulting in significant computational requirements. Moreover, the data used to train these models can become outdated, requiring re-training every time the world's knowledge is updated. For example, a model trained just two years ago might yield outdated information about the current president of the United States.

In the fields of natural language processing (RETRO, REALM) and computer vision (KAT), researchers have attempted to address these challenges using retrieval-augmented models. Typically, these models use a backbone that is able to process a single modality at a time, e.g., only text or only images, to encode and retrieve information from a knowledge corpus. However, these retrieval-augmented models are unable to leverage all available modalities in a query and knowledge corpora, and may not find the information that is most helpful for generating the model’s output.

To address these issues, in “REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory”, to appear at CVPR 2023, we introduce a visual-language model that learns to utilize a multi-source multi-modal “memory” to answer knowledge-intensive queries. REVEAL employs neural representation learning to encode and convert diverse knowledge sources into a memory structure consisting of key-value pairs. The keys serve as indices for the memory items, while the corresponding values store pertinent information about those items. During training, REVEAL learns the key embeddings, value tokens, and the ability to retrieve information from this memory to address knowledge-intensive queries. This approach allows the model parameters to focus on reasoning about the query, rather than being dedicated to memorization.

We augment a visual-language model with the ability to retrieve multiple knowledge entries from a diverse set of knowledge sources, which helps generation.


Memory construction from multimodal knowledge corpora

Our approach is similar to REALM in that we precompute key and value embeddings of knowledge items from different sources and index them in a unified knowledge memory, where each knowledge item is encoded into a key-value pair. Each key is a d-dimensional embedding vector, while each value is a sequence of token embeddings representing the knowledge item in more detail. In contrast to previous work, REVEAL leverages a diverse set of multimodal knowledge corpora, including the WikiData knowledge graph, Wikipedia passages and images, web image-text pairs and visual question answering data. Each knowledge item could be text, an image, a combination of both (e.g., pages in Wikipedia) or a relationship or attribute from a knowledge graph (e.g., Barack Obama is 6’ 2” tall). During training, we continuously re-compute the memory key and value embeddings as the model parameters get updated. We update the memory asynchronously at every thousand training steps.


Scaling memory using compression

A naïve solution for encoding a memory value is to keep the whole sequence of tokens for each knowledge item. Then, the model could fuse the input query and the top-k retrieved memory values by concatenating all their tokens together and feeding them into a transformer encoder-decoder pipeline. This approach has two issues: (1) storing hundreds of millions of knowledge items in memory is impractical if each memory value consists of hundreds of tokens and (2) the transformer encoder has a quadratic complexity with respect to the total number of tokens times k for self-attention. Therefore, we propose to use the Perceiver architecture to encode and compress knowledge items. The Perceiver model uses a transformer decoder to compress the full token sequence into an arbitrary length. This lets us retrieve top-k memory entries for k as large as a hundred.

The following figure illustrates the procedure of constructing the memory key-value pairs. Each knowledge item is processed through a multi-modal visual-language encoder, resulting in a sequence of image and text tokens. The key head then transforms these tokens into a compact embedding vector. The value head (perceiver) condenses these tokens into fewer ones, retaining the pertinent information about the knowledge item within them.

We encode the knowledge entries from different corpora into unified key and value embedding pairs, where the keys are used to index the memory and values contain information about the entries.


Large-scale pre-training on image-text pairs

To train the REVEAL model, we begin with the large-scale corpus, collected from the public Web with three billion image alt-text caption pairs, introduced in LiT. Since the dataset is noisy, we add a filter to remove data points with captions shorter than 50 characters, which yields roughly 1.3 billion image caption pairs. We then take these pairs, combined with the text generation objective used in SimVLM, to train REVEAL. Given an image-text example, we randomly sample a prefix containing the first few tokens of the text. We feed the text prefix and image to the model as input with the objective of generating the rest of the text as output. The training goal is to condition the prefix and autoregressively generate the remaining text sequence.

To train all components of the REVEAL model end-to-end, we need to warm start the model to a good state (setting initial values to model parameters). Otherwise, if we were to start with random weights (cold-start), the retriever would often return irrelevant memory items that would never generate useful training signals. To avoid this cold-start problem, we construct an initial retrieval dataset with pseudo–ground-truth knowledge to give the pre-training a reasonable head start.

We create a modified version of the WIT dataset for this purpose. Each image-caption pair in WIT also comes with a corresponding Wikipedia passage (words surrounding the text). We put together the surrounding passage with the query image and use it as the pseudo ground-truth knowledge that corresponds to the input query. The passage provides rich information about the image and caption, which is useful for initializing the model.

To prevent the model from relying on low-level image features for retrieval, we apply random data augmentation to the input query image. Given this modified dataset that contains pseudo-retrieval ground-truth, we train the query and memory key embeddings to warm start the model.


REVEAL workflow

The overall workflow of REVEAL consists of four primary steps. First, REVEAL encodes a multimodal input into a sequence of token embeddings along with a condensed query embedding. Then, the model translates each multi-source knowledge entry into unified pairs of key and value embeddings, with the key being utilized for memory indexing and the value encompassing the entire information about the entry. Next, REVEAL retrieves the top-k most related knowledge pieces from multiple knowledge sources, returns the pre-processed value embeddings stored in memory, and re-encodes the values. Finally, REVEAL fuses the top-k knowledge pieces through an attentive knowledge fusion layer by injecting the retrieval score (dot product between query and key embeddings) as a prior during attention calculation. This structure is instrumental in enabling the memory, encoder, retriever and the generator to be concurrently trained in an end-to-end fashion.

Overall workflow of REVEAL.


Results

We evaluate REVEAL on knowledge-based visual question answering tasks using OK-VQA and A-OKVQA datasets. We fine-tune our pre-trained model on the VQA tasks using the same generative objective where the model takes in an image-question pair as input and generates the text answer as output. We demonstrate that REVEAL achieves better results on the A-OKVQA dataset than earlier attempts that incorporate a fixed knowledge or the works that utilize large language models (e.g., GPT-3) as an implicit source of knowledge.

Visual question answering results on A-OKVQA. REVEAL achieves higher accuracy in comparison to previous works including ViLBERT, LXMERT, ClipCap, KRISP and GPV-2.

We also evaluate REVEAL on the image captioning benchmarks using MSCOCO and NoCaps dataset. We directly fine-tune REVEAL on the MSCOCO training split via the cross-entropy generative objective. We measure our performance on the MSCOCO test split and NoCaps evaluation set using the CIDEr metric, which is based on the idea that good captions should be similar to reference captions in terms of word choice, grammar, meaning, and content. Our results on MSCOCO caption and NoCaps datasets are shown below.

Image Captioning results on MSCOCO and NoCaps using the CIDEr metric. REVEAL achieves a higher score in comparison to Flamingo, VinVL, SimVLM and CoCa.

Below we show a couple of qualitative examples of how REVEAL retrieves relevant documents to answer visual questions.

REVEAL can use knowledge from different sources to correctly answer the question.


Conclusion

We present an end-to-end retrieval-augmented visual language (REVEAL) model, which contains a knowledge retriever that learns to utilize a diverse set of knowledge sources with different modalities. We train REVEAL on a massive image-text corpus with four diverse knowledge corpora, and achieve state-of-the-art results on knowledge-intensive visual question answering and image caption tasks. In the future we would like to explore the ability of this model for attribution, and apply it to a broader class of multimodal tasks.


Acknowledgements

This research was conducted by Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross and Alireza Fathi.

Source: Google AI Blog


Large sequence models for software development activities

Software isn’t created in one dramatic step. It improves bit by bit, one little step at a time — editing, running unit tests, fixing build errors, addressing code reviews, editing some more, appeasing linters, and fixing more errors — until finally it becomes good enough to merge into a code repository. Software engineering isn’t an isolated process, but a dialogue among human developers, code reviewers, bug reporters, software architects and tools, such as compilers, unit tests, linters and static analyzers.

Today we describe DIDACT (​​Dynamic Integrated Developer ACTivity), which is a methodology for training large machine learning (ML) models for software development. The novelty of DIDACT is that it uses the process of software development as the source of training data for the model, rather than just the polished end state of that process, the finished code. By exposing the model to the contexts that developers see as they work, paired with the actions they take in response, the model learns about the dynamics of software development and is more aligned with how developers spend their time. We leverage instrumentation of Google's software development to scale up the quantity and diversity of developer-activity data beyond previous works. Results are extremely promising along two dimensions: usefulness to professional software developers, and as a potential basis for imbuing ML models with general software development skills.

DIDACT is a multi-task model trained on development activities that include editing, debugging, repair, and code review.

We built and deployed internally three DIDACT tools, Comment Resolution (which we recently announced), Build Repair, and Tip Prediction, each integrated at different stages of the development workflow. All three of these tools received enthusiastic feedback from thousands of internal developers. We see this as the ultimate test of usefulness: do professional developers, who are often experts on the code base and who have carefully honed workflows, leverage the tools to improve their productivity?

Perhaps most excitingly, we demonstrate how DIDACT is a first step towards a general-purpose developer-assistance agent. We show that the trained model can be used in a variety of surprising ways, via prompting with prefixes of developer activities, and by chaining together multiple predictions to roll out longer activity trajectories. We believe DIDACT paves a promising path towards developing agents that can generally assist across the software development process.


A treasure trove of data about the software engineering process

Google’s software engineering toolchains store every operation related to code as a log of interactions among tools and developers, and have done so for decades. In principle, one could use this record to replay in detail the key episodes in the “software engineering video” of how Google’s codebase came to be, step-by-step — one code edit, compilation, comment, variable rename, etc., at a time.

Google code lives in a monorepo, a single repository of code for all tools and systems. A software developer typically experiments with code changes in a local copy-on-write workspace managed by a system called Clients in the Cloud (CitC). When the developer is ready to package a set of code changes together for a specific purpose (e.g., fixing a bug), they create a changelist (CL) in Critique, Google’s code-review system. As with other types of code-review systems, the developer engages in a dialog with a peer reviewer about functionality and style. The developer edits their CL to address reviewer comments as the dialog progresses. Eventually, the reviewer declares “LGTM!” (“looks good to me”), and the CL is merged into the code repository.

Of course, in addition to a dialog with the code reviewer, the developer also maintains a “dialog” of sorts with a plethora of other software engineering tools, such as the compiler, the testing framework, linters, static analyzers, fuzzers, etc.

An illustration of the intricate web of activities involved in developing software: small actions by the developer, interactions with a code reviewer, and invocations of tools such as compilers.

A multi-task model for software engineering

DIDACT utilizes interactions among engineers and tools to power ML models that assist Google developers, by suggesting or enhancing actions developers take — in context — while pursuing their software-engineering tasks. To do that, we have defined a number of tasks about individual developer activities: repairing a broken build, predicting a code-review comment, addressing a code-review comment, renaming a variable, editing a file, etc. We use a common formalism for each activity: it takes some State (a code file), some Intent (annotations specific to the activity, such as code-review comments or compiler errors), and produces an Action (the operation taken to address the task). This Action is like a mini programming language, and can be extended for newly added activities. It covers things like editing, adding comments, renaming variables, marking up code with errors, etc. We call this language DevScript.

The DIDACT model is prompted with a task, code snippets, and annotations related to that task, and produces development actions, e.g., edits or comments.

This state-intent-action formalism enables us to capture many different tasks in a general way. What’s more, DevScript is a concise way to express complex actions, without the need to output the whole state (the original code) as it would be after the action takes place; this makes the model more efficient and more interpretable. For example, a rename might touch a file in dozens of places, but a model can predict a single rename action.


An ML peer programmer

DIDACT does a good job on individual assistive tasks. For example, below we show DIDACT doing code clean-up after functionality is mostly done. It looks at the code along with some final comments by the code reviewer (marked with “human” in the animation), and predicts edits to address those comments (rendered as a diff).

Given an initial snippet of code and the comments that a code reviewer attached to that snippet, the Pre-Submit Cleanup task of DIDACT produces edits (insertions and deletions of text) that address those comments.

The multimodal nature of DIDACT also gives rise to some surprising capabilities, reminiscent of behaviors emerging with scale. One such capability is history augmentation, which can be enabled via prompting. Knowing what the developer did recently enables the model to make a better guess about what the developer should do next.

An illustration of history-augmented code completion in action.

A powerful such task exemplifying this capability is history-augmented code completion. In the figure below, the developer adds a new function parameter (1), and moves the cursor into the documentation (2). Conditioned on the history of developer edits and the cursor position, the model completes the line (3) by correctly predicting the docstring entry for the new parameter.

An illustration of edit prediction, over multiple chained iterations.

In an even more powerful history-augmented task, edit prediction, the model can choose where to edit next in a fashion that is historically consistent. If the developer deletes a function parameter (1), the model can use history to correctly predict an update to the docstring (2) that removes the deleted parameter (without the human developer manually placing the cursor there) and to update a statement in the function (3) in a syntactically (and — arguably — semantically) correct way. With history, the model can unambiguously decide how to continue the “editing video” correctly. Without history, the model wouldn’t know whether the missing function parameter is intentional (because the developer is in the process of a longer edit to remove it) or accidental (in which case the model should re-add it to fix the problem).

The model can go even further. For example, we started with a blank file and asked the model to successively predict what edits would come next until it had written a full code file. The astonishing part is that the model developed code in a step-by-step way that would seem natural to a developer: It started by first creating a fully working skeleton with imports, flags, and a basic main function. It then incrementally added new functionality, like reading from a file and writing results, and added functionality to filter out some lines based on a user-provided regular expression, which required changes across the file, like adding new flags.


Conclusion

DIDACT turns Google's software development process into training demonstrations for ML developer assistants, and uses those demonstrations to train models that construct code in a step-by-step fashion, interactively with tools and code reviewers. These innovations are already powering tools enjoyed by Google developers every day. The DIDACT approach complements the great strides taken by large language models at Google and elsewhere, towards technologies that ease toil, improve productivity, and enhance the quality of work of software engineers.


Acknowledgements

This work is the result of a multi-year collaboration among Google Research, Google Core Systems and Experiences, and DeepMind. We would like to acknowledge our colleagues Jacob Austin, Pascal Lamblin, Pierre-Antoine Manzagol, and Daniel Zheng, who join us as the key drivers of this project. This work could not have happened without the significant and sustained contributions of our partners at Alphabet (Peter Choy, Henryk Michalewski, Subhodeep Moitra, Malgorzata Salawa, Vaibhav Tulsyan, and Manushree Vijayvergiya), as well as the many people who collected data, identified tasks, built products, strategized, evangelized, and helped us execute on the many facets of this agenda (Ankur Agarwal, Paige Bailey, Marc Brockschmidt, Rodrigo Damazio Bovendorp, Satish Chandra, Savinee Dancs, Matt Frazier, Alexander Frömmgen, Nimesh Ghelani, Chris Gorgolewski, Chenjie Gu, Vincent Hellendoorn, Franjo Ivančić, Marko Ivanković, Emily Johnston, Luka Kalinovcic, Lera Kharatyan, Jessica Ko, Markus Kusano, Kathy Nix, Sara Qu, Marc Rasi, Marcus Revaj, Ballie Sandhu, Michael Sloan, Tom Small, Gabriela Surita, Maxim Tabachnyk, David Tattersall, Sara Toth, Kevin Villela, Sara Wiltberger, and Donald Duo Zhao) and our extremely supportive leadership (Martín Abadi, Joelle Barral, Jeff Dean, Madhura Dudhgaonkar, Douglas Eck, Zoubin Ghahramani, Hugo Larochelle, Chandu Thekkath, and Niranjan Tulpule). Thank you!

Source: Google AI Blog


Barkour: Benchmarking animal-level agility with quadruped robots

Creating robots that exhibit robust and dynamic locomotion capabilities, similar to animals or humans, has been a long-standing goal in the robotics community. In addition to completing tasks quickly and efficiently, agility allows legged robots to move through complex environments that are otherwise difficult to traverse. Researchers at Google have been pursuing agility for multiple years and across various form factors. Yet, while researchers have enabled robots to hike or jump over some obstacles, there is still no generally accepted benchmark that comprehensively measures robot agility or mobility. In contrast, benchmarks are driving forces behind the development of machine learning, such as ImageNet for computer vision, and OpenAI Gym for reinforcement learning (RL).

In “Barkour: Benchmarking Animal-level Agility with Quadruped Robots”, we introduce the Barkour agility benchmark for quadruped robots, along with a Transformer-based generalist locomotion policy. Inspired by dog agility competitions, a legged robot must sequentially display a variety of skills, including moving in different directions, traversing uneven terrains, and jumping over obstacles within a limited timeframe to successfully complete the benchmark. By providing a diverse and challenging obstacle course, the Barkour benchmark encourages researchers to develop locomotion controllers that move fast in a controllable and versatile way. Furthermore, by tying the performance metric to real dog performance, we provide an intuitive metric to understand the robot performance with respect to their animal counterparts.



We invited a handful of dooglers to try the obstacle course to ensure that our agility objectives were realistic and challenging. Small dogs complete the obstacle course in approximately 10s, whereas our robot’s typical performance hovers around 20s.


Barkour benchmark

The Barkour scoring system uses a per obstacle and an overall course target time based on the target speed of small dogs in the novice agility competitions (about 1.7m/s). Barkour scores range from 0 to 1, with 1 corresponding to the robot successfully traversing all the obstacles along the course within the allotted time of approximately 10 seconds, the average time needed for a similar-sized dog to traverse the course. The robot receives penalties for skipping, failing obstacles, or moving too slowly.

Our standard course consists of four unique obstacles in a 5m x 5m area. This is a denser and smaller setup than a typical dog competition to allow for easy deployment in a robotics lab. Beginning at the start table, the robot needs to weave through a set of poles, climb an A-frame, clear a 0.5m broad jump and then step onto the end table. We chose this subset of obstacles because they test a diverse set of skills while keeping the setup within a small footprint. As is the case for real dog agility competitions, the Barkour benchmark can be easily adapted to a larger course area and may incorporate a variable number of obstacles and course configurations.

Overview of the Barkour benchmark’s obstacle course setup, which consists of weave poles, an A-frame, a broad jump, and pause tables. The intuitive scoring mechanism, inspired by dog agility competitions, balances speed, agility and performance and can be easily modified to incorporate other types of obstacles or course configurations.


Learning agile locomotion skills

The Barkour benchmark features a diverse set of obstacles and a delayed reward system, which pose a significant challenge when training a single policy that can complete the entire obstacle course. So in order to set a strong performance baseline and demonstrate the effectiveness of the benchmark for robotic agility research, we adopt a student-teacher framework combined with a zero-shot sim-to-real approach. First, we train individual specialist locomotion skills (teacher) for different obstacles using on-policy RL methods. In particular, we leverage recent advances in large-scale parallel simulation to equip the robot with individual skills, including walking, slope climbing, and jumping policies.

Next, we train a single policy (student) that performs all the skills and transitions in between by using a student-teacher framework, based on the specialist skills we previously trained. We use simulation rollouts to create datasets of state-action pairs for each one of the specialist skills. This dataset is then distilled into a single Transformer-based generalist locomotion policy, which can handle various terrains and adjust the robot's gait based on the perceived environment and the robot’s state.

During deployment, we pair the locomotion transformer policy that is capable of performing multiple skills with a navigation controller that provides velocity commands based on the robot's position. Our trained policy controls the robot based on the robot's surroundings represented as an elevation map, velocity commands, and on-board sensory information provided by the robot.



Deployment pipeline for the locomotion transformer architecture. At deployment time, a high-level navigation controller guides the real robot through the obstacle course by sending commands to the locomotion transformer policy.

Robustness and repeatability are difficult to achieve when we aim for peak performance and maximum speed. Sometimes, the robot might fail when overcoming an obstacle in an agile way. To handle failures we train a recovery policy that quickly gets the robot back on its feet, allowing it to continue the episode.


Evaluation

We evaluate the Transformer-based generalist locomotion policy using custom-built quadruped robots and show that by optimizing for the proposed benchmark, we obtain agile, robust, and versatile skills for our robot in the real world. We further provide analysis for various design choices in our system and their impact on the system performance.

Model of the custom-built robots used for evaluation.

We deploy both the specialist and generalist policies to hardware (zero-shot sim-to-real). The robot’s target trajectory is provided by a set of waypoints along the various obstacles. In the case of the specialist policies, we switch between specialist policies by using a hand-tuned policy switching mechanism that selects the most suitable policy given the robot’s position.



Typical performance of our agile locomotion policies on the Barkour benchmark. Our custom-built quadruped robot robustly navigates the terrain’s obstacles by leveraging various skills learned using RL in simulation.

We find that very often our policies can handle unexpected events or even hardware degradation resulting in good average performance, but failures are still possible. As illustrated in the image below, in case of failures, our recovery policy quickly gets the robot back on its feet, allowing it to continue the episode. By combining the recovery policy with a simple walk-back-to-start policy, we are able to run repeated experiments with minimal human intervention to measure the robustness.



Qualitative example of robustness and recovery behaviors. The robot trips and rolls over after heading down the A-frame. This triggers the recovery policy, which enables the robot to get back up and continue the course.

We find that across a large number of evaluations, the single generalist locomotion transformer policy and the specialist policies with the policy switching mechanism achieve similar performance. The locomotion transformer policy has a slightly lower average Barkour score, but exhibits smoother transitions between behaviors and gaits.



Measuring robustness of the different policies across a large number of runs on the Barkour benchmark.

Histogram of the agility scores for the locomotion transformer policy. The highest scores shown in blue (0.75 - 0.9) represent the runs where the robot successfully completes all obstacles.


Conclusion

We believe that developing a benchmark for legged robotics is an important first step in quantifying progress toward animal-level agility. To establish a strong baseline, we investigated a zero-shot sim-to-real approach, taking advantage of large-scale parallel simulation and recent advancements in training Transformer-based architectures. Our findings demonstrate that Barkour is a challenging benchmark that can be easily customized, and that our learning-based method for solving the benchmark provides a quadruped robot with a single low-level policy that can perform a variety of agile low-level skills.


Acknowledgments

The authors of this post are now part of Google DeepMind. We would like to thank our co-authors at Google DeepMind and our collaborators at Google Research: Wenhao Yu, J. Chase Kew, Tingnan Zhang, Daniel Freeman, Kuang-Hei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, Nathan Batchelor, Steven Bohez, Federico Casarini, Jose Enrique Chen, Omar Cortes, Erwin Coumans, Adil Dostmohamed, Gabriel Dulac-Arnold, Alejandro Escontrela, Erik Frey, Roland Hafner, Deepali Jain, Yuheng Kuang, Edward Lee, Linda Luu, Ofir Nachum, Ken Oslund, Jason Powell, Diego Reyes, Francesco Romano, Feresteh Sadeghi, Ron Sloat, Baruch Tabanpour, Daniel Zheng, Michael Neunert, Raia Hadsell, Nicolas Heess, Francesco Nori, Jeff Seto, Carolina Parada, Vikas Sindhwani, Vincent Vanhoucke, and Jie Tan. We would also like to thank Marissa Giustina, Ben Jyenis, Gus Kouretas, Nubby Lee, James Lubin, Sherry Moore, Thinh Nguyen, Krista Reymann, Satoshi Kataoka, Trish Blazina, and the members of the robotics team at Google DeepMind for their contributions to the project.

Source: Google AI Blog


Barkour: Benchmarking animal-level agility with quadruped robots

Creating robots that exhibit robust and dynamic locomotion capabilities, similar to animals or humans, has been a long-standing goal in the robotics community. In addition to completing tasks quickly and efficiently, agility allows legged robots to move through complex environments that are otherwise difficult to traverse. Researchers at Google have been pursuing agility for multiple years and across various form factors. Yet, while researchers have enabled robots to hike or jump over some obstacles, there is still no generally accepted benchmark that comprehensively measures robot agility or mobility. In contrast, benchmarks are driving forces behind the development of machine learning, such as ImageNet for computer vision, and OpenAI Gym for reinforcement learning (RL).

In “Barkour: Benchmarking Animal-level Agility with Quadruped Robots”, we introduce the Barkour agility benchmark for quadruped robots, along with a Transformer-based generalist locomotion policy. Inspired by dog agility competitions, a legged robot must sequentially display a variety of skills, including moving in different directions, traversing uneven terrains, and jumping over obstacles within a limited timeframe to successfully complete the benchmark. By providing a diverse and challenging obstacle course, the Barkour benchmark encourages researchers to develop locomotion controllers that move fast in a controllable and versatile way. Furthermore, by tying the performance metric to real dog performance, we provide an intuitive metric to understand the robot performance with respect to their animal counterparts.



We invited a handful of dooglers to try the obstacle course to ensure that our agility objectives were realistic and challenging. Small dogs complete the obstacle course in approximately 10s, whereas our robot’s typical performance hovers around 20s.


Barkour benchmark

The Barkour scoring system uses a per obstacle and an overall course target time based on the target speed of small dogs in the novice agility competitions (about 1.7m/s). Barkour scores range from 0 to 1, with 1 corresponding to the robot successfully traversing all the obstacles along the course within the allotted time of approximately 10 seconds, the average time needed for a similar-sized dog to traverse the course. The robot receives penalties for skipping, failing obstacles, or moving too slowly.

Our standard course consists of four unique obstacles in a 5m x 5m area. This is a denser and smaller setup than a typical dog competition to allow for easy deployment in a robotics lab. Beginning at the start table, the robot needs to weave through a set of poles, climb an A-frame, clear a 0.5m broad jump and then step onto the end table. We chose this subset of obstacles because they test a diverse set of skills while keeping the setup within a small footprint. As is the case for real dog agility competitions, the Barkour benchmark can be easily adapted to a larger course area and may incorporate a variable number of obstacles and course configurations.

Overview of the Barkour benchmark’s obstacle course setup, which consists of weave poles, an A-frame, a broad jump, and pause tables. The intuitive scoring mechanism, inspired by dog agility competitions, balances speed, agility and performance and can be easily modified to incorporate other types of obstacles or course configurations.


Learning agile locomotion skills

The Barkour benchmark features a diverse set of obstacles and a delayed reward system, which pose a significant challenge when training a single policy that can complete the entire obstacle course. So in order to set a strong performance baseline and demonstrate the effectiveness of the benchmark for robotic agility research, we adopt a student-teacher framework combined with a zero-shot sim-to-real approach. First, we train individual specialist locomotion skills (teacher) for different obstacles using on-policy RL methods. In particular, we leverage recent advances in large-scale parallel simulation to equip the robot with individual skills, including walking, slope climbing, and jumping policies.

Next, we train a single policy (student) that performs all the skills and transitions in between by using a student-teacher framework, based on the specialist skills we previously trained. We use simulation rollouts to create datasets of state-action pairs for each one of the specialist skills. This dataset is then distilled into a single Transformer-based generalist locomotion policy, which can handle various terrains and adjust the robot's gait based on the perceived environment and the robot’s state.

During deployment, we pair the locomotion transformer policy that is capable of performing multiple skills with a navigation controller that provides velocity commands based on the robot's position. Our trained policy controls the robot based on the robot's surroundings represented as an elevation map, velocity commands, and on-board sensory information provided by the robot.



Deployment pipeline for the locomotion transformer architecture. At deployment time, a high-level navigation controller guides the real robot through the obstacle course by sending commands to the locomotion transformer policy.

Robustness and repeatability are difficult to achieve when we aim for peak performance and maximum speed. Sometimes, the robot might fail when overcoming an obstacle in an agile way. To handle failures we train a recovery policy that quickly gets the robot back on its feet, allowing it to continue the episode.


Evaluation

We evaluate the Transformer-based generalist locomotion policy using custom-built quadruped robots and show that by optimizing for the proposed benchmark, we obtain agile, robust, and versatile skills for our robot in the real world. We further provide analysis for various design choices in our system and their impact on the system performance.

Model of the custom-built robots used for evaluation.

We deploy both the specialist and generalist policies to hardware (zero-shot sim-to-real). The robot’s target trajectory is provided by a set of waypoints along the various obstacles. In the case of the specialist policies, we switch between specialist policies by using a hand-tuned policy switching mechanism that selects the most suitable policy given the robot’s position.



Typical performance of our agile locomotion policies on the Barkour benchmark. Our custom-built quadruped robot robustly navigates the terrain’s obstacles by leveraging various skills learned using RL in simulation.

We find that very often our policies can handle unexpected events or even hardware degradation resulting in good average performance, but failures are still possible. As illustrated in the image below, in case of failures, our recovery policy quickly gets the robot back on its feet, allowing it to continue the episode. By combining the recovery policy with a simple walk-back-to-start policy, we are able to run repeated experiments with minimal human intervention to measure the robustness.



Qualitative example of robustness and recovery behaviors. The robot trips and rolls over after heading down the A-frame. This triggers the recovery policy, which enables the robot to get back up and continue the course.

We find that across a large number of evaluations, the single generalist locomotion transformer policy and the specialist policies with the policy switching mechanism achieve similar performance. The locomotion transformer policy has a slightly lower average Barkour score, but exhibits smoother transitions between behaviors and gaits.



Measuring robustness of the different policies across a large number of runs on the Barkour benchmark.

Histogram of the agility scores for the locomotion transformer policy. The highest scores shown in blue (0.75 - 0.9) represent the runs where the robot successfully completes all obstacles.


Conclusion

We believe that developing a benchmark for legged robotics is an important first step in quantifying progress toward animal-level agility. To establish a strong baseline, we investigated a zero-shot sim-to-real approach, taking advantage of large-scale parallel simulation and recent advancements in training Transformer-based architectures. Our findings demonstrate that Barkour is a challenging benchmark that can be easily customized, and that our learning-based method for solving the benchmark provides a quadruped robot with a single low-level policy that can perform a variety of agile low-level skills.


Acknowledgments

The authors of this post are now part of Google DeepMind. We would like to thank our co-authors at Google DeepMind and our collaborators at Google Research: Wenhao Yu, J. Chase Kew, Tingnan Zhang, Daniel Freeman, Kuang-Hei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, Nathan Batchelor, Steven Bohez, Federico Casarini, Jose Enrique Chen, Omar Cortes, Erwin Coumans, Adil Dostmohamed, Gabriel Dulac-Arnold, Alejandro Escontrela, Erik Frey, Roland Hafner, Deepali Jain, Yuheng Kuang, Edward Lee, Linda Luu, Ofir Nachum, Ken Oslund, Jason Powell, Diego Reyes, Francesco Romano, Feresteh Sadeghi, Ron Sloat, Baruch Tabanpour, Daniel Zheng, Michael Neunert, Raia Hadsell, Nicolas Heess, Francesco Nori, Jeff Seto, Carolina Parada, Vikas Sindhwani, Vincent Vanhoucke, and Jie Tan. We would also like to thank Marissa Giustina, Ben Jyenis, Gus Kouretas, Nubby Lee, James Lubin, Sherry Moore, Thinh Nguyen, Krista Reymann, Satoshi Kataoka, Trish Blazina, and the members of the robotics team at Google DeepMind for their contributions to the project.

Source: Google AI Blog


Google Research at I/O 2023

Wednesday, May 10th was an exciting day for the Google Research community as we watched the results of months and years of our foundational and applied work get announced on the Google I/O stage. With the quick pace of announcements on stage, it can be difficult to convey the substantial effort and unique innovations that underlie the technologies we presented. So today, we’re excited to reveal more about the research efforts behind some of the many exciting announcements at this year's I/O.


PaLM 2

Our next-generation large language model (LLM), PaLM 2, is built on advances in compute-optimal scaling, scaled instruction-fine tuning and improved dataset mixture. By fine-tuning and instruction-tuning the model for different purposes, we have been able to integrate state-of-the-art capabilities into over 25 Google products and features, where it is already helping to inform, assist and delight users. For example:

  • Bard is an early experiment that lets you collaborate with generative AI and helps to boost productivity, accelerate ideas and fuel curiosity. It builds on advances in deep learning efficiency and leverages reinforcement learning from human feedback to provide more relevant responses and increase the model’s ability to follow instructions. Bard is now available in 180 countries, where users can interact with it in English, Japanese and Korean, and thanks to the multilingual capabilities afforded by PaLM 2, support for 40 languages is coming soon.
  • With Search Generative Experience we’re taking more of the work out of searching, so you’ll be able to understand a topic faster, uncover new viewpoints and insights, and get things done more easily. As part of this experiment, you’ll see an AI-powered snapshot of key information to consider, with links to dig deeper.
  • MakerSuite is an easy-to-use prototyping environment for the PaLM API, powered by PaLM 2. In fact, internal user engagement with early prototypes of MakerSuite accelerated the development of our PaLM 2 model itself. MakerSuite grew out of research focused on prompting tools, or tools explicitly designed for customizing and controlling LLMs. This line of research includes PromptMaker (precursor to MakerSuite), and AI Chains and PromptChainer (one of the first research efforts demonstrating the utility of LLM chaining).
  • Project Tailwind also made use of early research prototypes of MakerSuite to develop features to help writers and researchers explore ideas and improve their prose; its AI-first notebook prototype used PaLM 2 to allow users to ask questions of the model grounded in documents they define.
  • Med-PaLM 2 is our state-of-the-art medical LLM, built on PaLM 2. Med-PaLM 2 achieved 86.5% performance on U.S. Medical Licensing Exam–style questions, illustrating its exciting potential for health. We’re now exploring multimodal capabilities to synthesize inputs like X-rays.
  • Codey is a version of PaLM 2 fine-tuned on source code to function as a developer assistant. It supports a broad range of Code AI features, including code completions, code explanation, bug fixing, source code migration, error explanations, and more. Codey is available through our trusted tester program via IDEs (Colab, Android Studio, Duet AI for Cloud, Firebase) and via a 3P-facing API.

Perhaps even more exciting for developers, we have opened up the PaLM APIs & MakerSuite to provide the community opportunities to innovate using this groundbreaking technology.

PaLM 2 has advanced coding capabilities that enable it to find code errors and make suggestions in a number of different languages.

Imagen

Our Imagen family of image generation and editing models builds on advances in large Transformer-based language models and diffusion models. This family of models is being incorporated into multiple Google products, including:

  • Image generation in Google Slides and Android’s Generative AI wallpaper are powered by our text-to-image generation features.
  • Google Cloud’s Vertex AI enables image generation, image editing, image upscaling and fine-tuning to help enterprise customers meet their business needs.
  • I/O Flip, a digital take on a classic card game, features Google developer mascots on cards that were entirely AI generated. This game showcased a fine-tuning technique called DreamBooth for adapting pre-trained image generation models. Using just a handful of images as inputs for fine-tuning, it allows users to generate personalized images in minutes. With DreamBooth, users can synthesize a subject in diverse scenes, poses, views, and lighting conditions that don’t appear in the reference images.
    I/O Flip presents custom card decks designed using DreamBooth.

Phenaki

Phenaki, Google’s Transformer-based text-to-video generation model was featured in the I/O pre-show. Phenaki is a model that can synthesize realistic videos from textual prompt sequences by leveraging two main components: an encoder-decoder model that compresses videos to discrete embeddings and a transformer model that translates text embeddings to video tokens.


ARCore and the Scene Semantic API

Among the new features of ARCore announced by the AR team at I/O, the Scene Semantic API can recognize pixel-wise semantics in an outdoor scene. This helps users create custom AR experiences based on the features in the surrounding area. This API is empowered by the outdoor semantic segmentation model, leveraging our recent works around the DeepLab architecture and an egocentric outdoor scene understanding dataset. The latest ARCore release also includes an improved monocular depth model that provides higher accuracy in outdoor scenes.

Scene Semantics API uses DeepLab-based semantic segmentation model to provide accurate pixel-wise labels in a scene outdoors.

Chirp

Chirp is Google's family of state-of-the-art Universal Speech Models trained on 12 million hours of speech to enable automatic speech recognition (ASR) for 100+ languages. The models can perform ASR on under-resourced languages, such as Amharic, Cebuano, and Assamese, in addition to widely spoken languages like English and Mandarin. Chirp is able to cover such a wide variety of languages by leveraging self-supervised learning on unlabeled multilingual dataset with fine-tuning on a smaller set of labeled data. Chirp is now available in the Google Cloud Speech-to-Text API, allowing users to perform inference on the model through a simple interface. You can get started with Chirp here.


MusicLM

At I/O, we launched MusicLM, a text-to-music model that generates 20 seconds of music from a text prompt. You can try it yourself on AI Test Kitchen, or see it featured during the I/O preshow, where electronic musician and composer Dan Deacon used MusicLM in his performance.

MusicLM, which consists of models powered by AudioLM and MuLAN, can make music (from text, humming, images or video) and musical accompaniments to singing. AudioLM generates high quality audio with long-term consistency. It maps audio to a sequence of discrete tokens and casts audio generation as a language modeling task. To synthesize longer outputs efficiently, it used a novel approach we’ve developed called SoundStorm.


Universal Translator dubbing

Our dubbing efforts leverage dozens of ML technologies to translate the full expressive range of video content, making videos accessible to audiences across the world. These technologies have been used to dub videos across a variety of products and content types, including educational content, advertising campaigns, and creator content, with more to come. We use deep learning technology to achieve voice preservation and lip matching and enable high-quality video translation. We’ve built this product to include human review for quality, safety checks to help prevent misuse, and we make it accessible only to authorized partners.


AI for global societal good

We are applying our AI technologies to solve some of the biggest global challenges, like mitigating climate change, adapting to a warming planet and improving human health and wellbeing. For example:

  • Traffic engineers use our Green Light recommendations to reduce stop-and-go traffic at intersections and improve the flow of traffic in cities from Bangalore to Rio de Janeiro and Hamburg. Green Light models each intersection, analyzing traffic patterns to develop recommendations that make traffic lights more efficient — for example, by better synchronizing timing between adjacent lights, or adjusting the “green time” for a given street and direction.
  • We’ve also expanded global coverage on the Flood Hub to 80 countries, as part of our efforts to predict riverine floods and alert people who are about to be impacted before disaster strikes. Our flood forecasting efforts rely on hydrological models informed by satellite observations, weather forecasts and in-situ measurements.

Technologies for inclusive and fair ML applications

With our continued investment in AI technologies, we are emphasizing responsible AI development with the goal of making our models and tools useful and impactful while also ensuring fairness, safety and alignment with our AI Principles. Some of these efforts were highlighted at I/O, including:

  • The release of the Monk Skin Tone Examples (MST-E) Dataset to help practitioners gain a deeper understanding of the MST scale and train human annotators for more consistent, inclusive, and meaningful skin tone annotations. You can read more about this and other developments on our website. This is an advancement on the open source release of the Monk Skin Tone (MST) Scale we launched last year to enable developers to build products that are more inclusive and that better represent their diverse users.
  • A new Kaggle competition (open until August 10th) in which the ML community is tasked with creating a model that can quickly and accurately identify American Sign Language (ASL) fingerspelling — where each letter of a word is spelled out in ASL rapidly using a single hand, rather than using the specific signs for entire words — and translate it into written text. Learn more about the fingerspelling Kaggle competition, which features a song from Sean Forbes, a deaf musician and rapper. We also showcased at I/O the winning algorithm from the prior year’s competition powers PopSign, an ASL learning app for parents of deaf or hard of hearing children created by Georgia Tech and Rochester Institute of Technology (RIT).

Building the future of AI together

It’s inspiring to be part of a community of so many talented individuals who are leading the way in developing state-of-the-art technologies, responsible AI approaches and exciting user experiences. We are in the midst of a period of incredible and transformative change for AI. Stay tuned for more updates about the ways in which the Google Research community is boldly exploring the frontiers of these technologies and using them responsibly to benefit people’s lives around the world. We hope you're as excited as we are about the future of AI technologies and we invite you to engage with our teams through the references, sites and tools that we’ve highlighted here.


Source: Google AI Blog


Larger language models do in-context learning differently

There have recently been tremendous advances in language models, partly because they can perform tasks with strong performance via in-context learning (ICL), a process whereby models are prompted with a few examples of input-label pairs before performing the task on an unseen evaluation example. In general, models’ success at in-context learning is enabled by:

  • Their use of semantic prior knowledge from pre-training to predict labels while following the format of in-context examples (e.g., seeing examples of movie reviews with “positive sentiment” and “negative sentiment” as labels and performing sentiment analysis using prior knowledge).
  • Learning the input-label mappings in context from the presented examples (e.g., finding a pattern that positive reviews should be mapped to one label, and negative reviews should be mapped to a different label).

In “Larger language models do in-context learning differently”, we aim to learn about how these two factors (semantic priors and input-label mappings) interact with each other in ICL settings, especially with respect to the scale of the language model that’s used. We investigate two settings to study these two factors — ICL with flipped labels (flipped-label ICL) and ICL with semantically-unrelated labels (SUL-ICL). In flipped-label ICL, labels of in-context examples are flipped so that semantic priors and input-label mappings disagree with each other. In SUL-ICL, labels of in-context examples are replaced with words that are semantically unrelated to the task presented in-context. We found that overriding prior knowledge is an emergent ability of model scale, as is the ability to learn in-context with semantically-unrelated labels. We also found that instruction tuning strengthens the use of prior knowledge more than it increases the capacity to learn input-label mappings.

An overview of flipped-label ICL and semantically-unrelated label ICL (SUL-ICL), compared with regular ICL, for a sentiment analysis task. Flipped-label ICL uses flipped labels, forcing the model to override semantic priors in order to follow the in-context examples. SUL-ICL uses labels that are not semantically related to the task, which means that models must learn input-label mappings in order to perform the task because they can no longer rely on the semantics of natural language labels.

Experiment design

For a diverse dataset mixture, we experiment on seven natural language processing (NLP) tasks that have been widely used: sentiment analysis, subjective/objective classification, question classification, duplicated-question recognition, entailment recognition, financial sentiment analysis, and hate speech detection. We test five language model families, PaLM, Flan-PaLM, GPT-3, InstructGPT, and Codex.


Flipped labels

In this experiment, labels of in-context examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override their priors. In this setting, models that are able to override prior knowledge and learn input-label mappings in-context should experience a decrease in performance (since ground-truth evaluation labels are not flipped).

The ability to override semantic priors when presented with flipped in-context example labels emerges with model scale. Smaller models cannot flip predictions to follow flipped labels (performance only decreases slightly), while larger models can do so (performance decreases to well below 50%).

We found that when no labels are flipped, larger models have better performance than smaller models (as expected). But when we flip more and more labels, the performance of small models stays relatively flat, but large models experience large performance drops to well-below random guessing (e.g., 90% → 22.5% for code-davinci-002).

These results indicate that large models can override prior knowledge from pre-training when contradicting input-label mappings are presented in-context. Small models can’t do this, making this ability an emergent phenomena of model scale.


Semantically-unrelated labels

In this experiment, we replace labels with semantically-irrelevant ones (e.g., for sentiment analysis, we use “foo/bar” instead of “negative/positive”), which means that the model can only perform ICL by learning from input-label mappings. If a model mostly relies on prior knowledge for ICL, then its performance should decrease after this change since it will no longer be able to use semantic meanings of labels to make predictions. A model that can learn input–label mappings in-context, on the other hand, would be able to learn these semantically-unrelated mappings and should not experience a major drop in performance.

Small models rely more on semantic priors than large models do, as indicated by the greater decrease in performance for small models than for large models when using semantically-unrelated labels (i.e., targets) instead of natural language labels. For each plot, models are shown in order of increasing model size (e.g., for GPT-3 models, a is smaller than b, which is smaller than c).

Indeed, we see that using semantically-unrelated labels results in a greater performance drop for small models. This suggests that smaller models primarily rely on their semantic priors for ICL rather than learning from the presented input-label mappings. Large models, on the other hand, have the ability to learn input-label mappings in-context when the semantic nature of labels is removed.

We also find that including more in-context examples (i.e., exemplars) results in a greater performance improvement for large models than it does for small models, indicating that large models are better at learning from in-context examples than small models are.

In the SUL-ICL setup, larger models benefit more from additional examples than smaller models do.

Instruction tuning

Instruction tuning is a popular technique for improving model performance, which involves tuning models on various NLP tasks that are phrased as instructions (e.g., “Question: What is the sentiment of the following sentence, ‘This movie is great.’ Answer: Positive”). Since the process uses natural language labels, however, an open question is whether it improves the ability to learn input-label mappings or whether it strengthens the ability to recognize and apply semantic prior knowledge. Both of these would lead to an improvement in performance on standard ICL tasks, so it’s unclear which of these occur.

We study this question by running the same two setups as before, only this time we focus on comparing standard language models (specifically, PaLM) with their instruction-tuned variants (Flan-PaLM).

First, we find that Flan-PaLM is better than PaLM when we use semantically-unrelated labels. This effect is very prominent in small models, as Flan-PaLM-8B outperforms PaLM-8B by 9.6% and almost catches up to PaLM-62B. This trend suggests that instruction tuning strengthens the ability to learn input-label mappings, which isn’t particularly surprising.

Instruction-tuned language models are better at learning input–label mappings than pre-training–only language models are.

More interestingly, we saw that Flan-PaLM is actually worse than PaLM at following flipped labels, meaning that the instruction tuned models were unable to override their prior knowledge (Flan-PaLM models don’t reach below random guessing with 100% flipped labels, but PaLM models without instruction tuning can reach 31% accuracy in the same setting). These results indicate that instruction tuning must increase the extent to which models rely on semantic priors when they’re available.

Instruction-tuned models are worse than pre-training–only models at learning to override semantic priors when presented with flipped labels in-context.

Combined with the previous result, we conclude that although instruction tuning improves the ability to learn input-label mappings, it strengthens the usage of semantic prior knowledge more.


Conclusion

We examined the extent to which language models learn in-context by utilizing prior knowledge learned during pre-training versus input-label mappings presented in-context.

We first showed that large language models can learn to override prior knowledge when presented with enough flipped labels, and that this ability emerges with model scale. We then found that successfully doing ICL using semantically-unrelated labels is another emergent ability of model scale. Finally, we analyzed instruction-tuned language models and saw that instruction tuning improves the capacity to learn input-label mappings but also strengthens the use of semantic prior knowledge even more.


Future work

These results underscore how the ICL behavior of language models can change depending on their scale, and that larger language models have an emergent ability to map inputs to many types of labels, a form of reasoning in which input-label mappings can potentially be learned for arbitrary symbols. Future research could help provide insights on why these phenomena occur with respect to model scale.


Acknowledgements

This work was conducted by Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. We would like to thank Sewon Min and our fellow collaborators at Google Research for their advice and helpful discussions.

Source: Google AI Blog


Introducing MediaPipe Solutions for On-Device Machine Learning

Posted by Paul Ruiz, Developer Relations Engineer & Kris Tonthat, Technical Writer

MediaPipe Solutions is available in preview today

This week at Google I/O 2023, we introduced MediaPipe Solutions, a new collection of on-device machine learning tools to simplify the developer process. This is made up of MediaPipe Studio, MediaPipe Tasks, and MediaPipe Model Maker. These tools provide no-code to low-code solutions to common on-device machine learning tasks, such as audio classification, segmentation, and text embedding, for mobile, web, desktop, and IoT developers.

image showing a 4 x 2 grid of solutions via MediaPipe Tools

New solutions

In December 2022, we launched the MediaPipe preview with five tasks: gesture recognition, hand landmarker, image classification, object detection, and text classification. Today we’re happy to announce that we have launched an additional nine tasks for Google I/O, with many more to come. Some of these new tasks include:

  • Face Landmarker, which detects facial landmarks and blendshapes to determine human facial expressions, such as smiling, raised eyebrows, and blinking. Additionally, this task is useful for applying effects to a face in three dimensions that matches the user’s actions.
moving image showing a human with a racoon face filter tracking a range of accurate movements and facial expressions
  • Image Segmenter, which lets you divide images into regions based on predefined categories. You can use this functionality to identify humans or multiple objects, then apply visual effects like background blurring.
moving image of two panels showing a person on the left and how the image of that person is segmented into rergions on the right
  • Interactive Segmenter, which takes the region of interest in an image, estimates the boundaries of an object at that location, and returns the segmentation for the object as image data.
moving image of a dog  moving around as the interactive segmenter identifies boundaries and segments

Coming soon

  • Image Generator, which enables developers to apply a diffusion model within their apps to create visual content.
moving image showing the rendering of an image of a puppy among an array of white and pink wildflowers in MediaPipe from a prompt that reads, 'a photo realistic and high resolution image of a cute puppy with surrounding flowers'
  • Face Stylizer, which lets you take an existing style reference and apply it to a user’s face.
image of a 4 x 3 grid showing varying iterations of a known female and male face acrosss four different art styles

MediaPipe Studio

Our first MediaPipe tool lets you view and test MediaPipe-compatible models on the web, rather than having to create your own custom testing applications. You can even use MediaPipe Studio in preview right now to try out the new tasks mentioned here, and all the extras, by visiting the MediaPipe Studio page.

In addition, we have plans to expand MediaPipe Studio to provide a no-code model training solution so you can create brand new models without a lot of overhead.

moving image showing Gesture Recognition in MediaPipe Studio

MediaPipe Tasks

MediaPipe Tasks simplifies on-device ML deployment for web, mobile, IoT, and desktop developers with low-code libraries. You can easily integrate on-device machine learning solutions, like the examples above, into your applications in a few lines of code without having to learn all the implementation details behind those solutions. These currently include tools for three categories: vision, audio, and text.

To give you a better idea of how to use MediaPipe Tasks, let’s take a look at an Android app that performs gesture recognition.

moving image showing Gesture Recognition across a series of hand gestures in MediaPipe Studio including closed fist, victory, thumb up, thumb down, open palm and i love you.

The following code will create a GestureRecognizer object using a built-in machine learning model, then that object can be used repeatedly to return a list of recognition results based on an input image:

// STEP 1: Create a gesture recognizer val baseOptions = BaseOptions.builder() .setModelAssetPath("gesture_recognizer.task") .build() val gestureRecognizerOptions = GestureRecognizerOptions.builder() .setBaseOptions(baseOptions) .build() val gestureRecognizer = GestureRecognizer.createFromOptions( context, gestureRecognizerOptions) // STEP 2: Prepare the image val mpImage = BitmapImageBuilder(bitmap).build() // STEP 3: Run inference val result = gestureRecognizer.recognize(mpImage)

As you can see, with just a few lines of code you can implement seemingly complex features in your applications. Combined with other Android features, like CameraX, you can provide delightful experiences for your users.

Along with simplicity, one of the other major advantages to using MediaPipe Tasks is that your code will look similar across multiple platforms, regardless of the task you’re using. This will help you develop even faster as you can reuse the same logic for each application.


MediaPipe Model Maker

While being able to recognize and use gestures in your apps is great, what if you have a situation where you need to recognize custom gestures outside of the ones provided by the built-in model? That’s where MediaPipe Model Maker comes in. With Model Maker, you can retrain the built-in model on a dataset with only a few hundred examples of new hand gestures, and quickly create a brand new model specific to your needs. For example, with just a few lines of code you can customize a model to play Rock, Paper, Scissors.

image showing 5 examples of the 'paper' hand gesture in the top row and 5 exaples of the 'rock' hand gesture on the bottom row

from mediapipe_model_maker import gesture_recognizer # STEP 1: Load the dataset. data = gesture_recognizer.Dataset.from_folder(dirname='images') train_data, validation_data = data.split(0.8) # STEP 2: Train the custom model. model = gesture_recognizer.GestureRecognizer.create( train_data=train_data, validation_data=validation_data, hparams=gesture_recognizer.HParams(export_dir=export_dir) ) # STEP 3: Evaluate using unseen data. metric = model.evaluate(test_data) # STEP 4: Export as a model asset bundle. model.export_model(model_name='rock_paper_scissor.task')

After retraining your model, you can use it in your apps with MediaPipe Tasks for an even more versatile experience.

moving image showing Gesture Recognition in MediaPipe Studio recognizing rock, paper, and scissiors hand gestures

Getting started

To learn more, watch our I/O 2023 sessions: Easy on-device ML with MediaPipe, Supercharge your web app with machine learning and MediaPipe, and What's new in machine learning, and check out the official documentation over on developers.google.com/mediapipe.


What’s next?

We will continue to improve and provide new features for MediaPipe Solutions, including new MediaPipe Tasks and no-code training through MediaPipe Studio. You can also keep up to date by joining the MediaPipe Solutions announcement group, where we send out announcements as new features are available.

We look forward to all the exciting things you make, so be sure to share them with @googledevs and your developer communities!

New Resources to Build with Google AI

Posted by Jaimie Hwang, ML Product Marketing and Danu Mbanga, ML Product Management

Today's development environment is only getting more complex and as machine learning becomes increasingly integrated with mobile, web, and cloud platforms, developers are looking for clear pathways to cut through this growing complexity. To help developers of all levels, we've unified our machine learning products, tools, and guidance on Google AI, so you can spend less time searching, and more time building AI solutions.

Whether you are looking for a pre-trained dataset, the latest in Generative AI, or tracking the latest announcements from Google I/O, you’ll be able to find it on ai.google/build/machinelearning. It’s a single destination for building AI solutions, no matter where you are in your machine learning workflow, or where your models are deployed.

cropped screenshot of Google AI landing page

Toolkits: end-to-end guidance

We're taking it one step further with new toolkits that provide you end-to-end guidance to build the latest AI solutions. These toolkits combine many of our products, many of which are open source, alongside a walkthrough so you can learn best practices and implement code. Check out how to build a text classifier using Keras or how you can take a large language model and shrink it to run on Android using Keras and TensorFlow Lite. And we are just getting started. These are our first two toolkits but have many more to come soon – so stay tuned!

moving image of finding a toolkit on Google AI to build an LLM on Android and Keras and TensorFlow Lite
Toolkit to build a LLM on Android with Keras and TensorFlow Lite

Whether you're just starting out with machine learning or you're an experienced developer looking for the latest tools and resources, Google AI has the resources you need to build AI solutions. Visit ai.google/build/machinelearning today to learn more.