Tag Archives: machine learning

Differential Privacy Accounting by Connecting the Dots

Differential privacy (DP) is an approach that enables data analytics and machine learning (ML) with a mathematical guarantee on the privacy of user data. DP quantifies the “privacy cost” of an algorithm, i.e., the level of guarantee that the algorithm’s output distribution for a given dataset will not change significantly if a single user’s data is added to or removed from it. The algorithm is characterized by two parameters, ε and δ, where smaller values of both indicate “more private”. There is a natural tension between the privacy budget (ε, δ) and the utility of the algorithm: a smaller privacy budget requires the output to be more “noisy”, often leading to less utility. Thus, a fundamental goal of DP is to attain as much utility as possible for a desired privacy budget.

A key property of DP that often plays a central role in understanding privacy costs is that of composition, which reflects the net privacy cost of a combination of DP algorithms, viewed together as a single algorithm. A notable example is the differentially-private stochastic gradient descent (DP-SGD) algorithm. This algorithm trains ML models over multiple iterations — each of which is differentially private — and therefore requires an application of the composition property of DP. A basic composition theorem in DP says that the privacy cost of a collection of algorithms is, at most, the sum of the privacy cost of each. However, in many cases, this can be a gross overestimate, and several improved composition theorems provide better estimates of the privacy cost of composition.

In 2019, we released an open-source library (on GitHub) to enable developers to use analytic techniques based on DP. Today, we announce the addition to this library of Connect-the-Dots, a new privacy accounting algorithm based on a novel approach for discretizing privacy loss distributions that is a useful tool for understanding the privacy cost of composition. This algorithm is based on the paper “Connect the Dots: Tighter Discrete Approximations of Privacy Loss Distributions”, presented at PETS 2022. The main novelty of this accounting algorithm is that it uses an indirect approach to construct more accurate discretizations of privacy loss distributions. We find that Connect-the-Dots provides significant gains over other privacy accounting methods in literature in terms of accuracy and running time. This algorithm was also recently applied for the privacy accounting of DP-SGD in training Ads prediction models.


Differential Privacy and Privacy Loss Distributions

A randomized algorithm is said to satisfy DP guarantees if its output “does not depend significantly” on any one entry in its training dataset, quantified mathematically with parameters (ε, δ). For example, consider the motivating example of DP-SGD. When trained with (non-private) SGD, a neural network could, in principle, be encoding the entire training dataset within its weights, thereby allowing one to reconstruct some training examples from a trained model. On the other hand, when trained with DP-SGD, we have a formal guarantee that if one were able to reconstruct a training example with non-trivial probability then one would also be able to reconstruct the same example even if it was not included in the training dataset.

The hockey stick divergence, parameterized by ε, is a measure of distance between two probability distributions, as illustrated in the figure below. The privacy cost of most DP algorithms is dictated by the hockey stick divergence between two associated probability distributions P and Q. The algorithm satisfies DP with parameters (ε, δ), if the value of the hockey stick divergence for ε between P and Q is at most δ. The hockey stick divergence between (P, Q), denoted δP||Q(ε) is in turn completely characterized by it associated privacy loss distribution, denoted by PLDP||Q.

Illustration of hockey stick divergence δP||Q(ε) between distributions P and Q (left), which corresponds to the probability mass of P that is above eεQ, where eεQ is an eε scaling of the probability mass of Q (right).

The main advantage of dealing with PLDs is that compositions of algorithms correspond to the convolution of the corresponding PLDs. Exploiting this fact, prior work has designed efficient algorithms to compute the PLD corresponding to the composition of individual algorithms by simply performing convolution of the individual PLDs using the fast Fourier transform algorithm.

However, one challenge when dealing with many PLDs is that they often are continuous distributions, which make the convolution operations intractable in practice. Thus, researchers often apply various discretization approaches to approximate the PLDs using equally spaced points. For example, the basic version of the Privacy Buckets algorithm assigns the probability mass of the interval between two discretization points entirely to the higher end of the interval.

Illustration of discretization by rounding up probability masses. Here a continuous PLD (in blue) is discretized to a discrete PLD (in red), by rounding up the probability mass between consecutive points.

Connect-the-Dots : A New Algorithm

Our new Connect-the-Dots algorithm provides a better way to discretize PLDs towards the goal of estimating hockey stick divergences. This approach works indirectly by first discretizing the hockey stick divergence function and then mapping it back to a discrete PLD supported on equally spaced points.

Illustration of high-level steps in the Connect-the-Dots algorithm.

This approach relies on the notion of a “dominating PLD”, namely, PLDP’||Q’ dominates over PLDP||Q if the hockey stick divergence of the former is greater or equal to the hockey stick divergence of the latter for all values of ε. The key property of dominating PLDs is that they remain dominating after compositions. Thus for purposes of privacy accounting, it suffices to work with a dominating PLD, which gives us an upper bound on the exact privacy cost.

Our main insight behind the Connect-the-Dots algorithm is a characterization of discrete PLD, namely that a PLD is supported on a given finite set of ε values if and only if the corresponding hockey stick divergence as a function of eε is linear between consecutive eε values. This allows us to discretize the hockey stick divergence by simply connecting the dots to get a piecewise linear function that precisely equals the hockey stick divergence function at the given eε values. See a more detailed explanation of the algorithm.

Comparison of the discretizations of hockey stick divergence by Connect-the-Dots vs Privacy Buckets.

Experimental Evaluation

The DP-SGD algorithm involves a noise multiplier parameter, which controls the magnitude of noise added in each gradient step, and a sampling probability, which controls how many examples are included in each mini-batch. We compare Connect-the-Dots against the algorithms listed below on the task of privacy accounting DP-SGD with a noise multiplier = 0.5, sampling probability = 0.2 x 10-4 and δ = 10-8.

We plot the value of the ε computed by each of the algorithms against the number of composition steps, and additionally, we plot the running time of the implementations. As shown in the plots below, privacy accounting using Renyi DP provides a loose estimate of the privacy loss. However, when comparing the approaches using PLD, we find that in this example, the implementation of Connect-the-Dots achieves a tighter estimate of the privacy loss, with a running time that is 5x faster than the Microsoft PRV Accountant and >200x faster than the previous approach of Privacy Buckets in the Google-DP library.

Left: Upper bounds on the privacy parameter ε for varying number of steps of DP-SGD, as returned by different algorithms (for fixed δ = 10-8). Right: Running time of the different algorithms.

Conclusion & Future Directions

This work proposes Connect-the-Dots, a new algorithm for computing optimal privacy parameters for compositions of differentially private algorithms. When evaluated on the DP-SGD task, we find that this algorithm gives tighter estimates on the privacy loss with a significantly faster running time.

So far, the library only supports the pessimistic estimate version of Connect-the-Dots algorithm, which provides an upper bound on the privacy loss of DP-algorithms. However, the paper also introduces a variant of the algorithm that provides an “optimistic” estimate of the PLD, which can be used to derive lower bounds on the privacy cost of DP-algorithms (provided those admit a “worst case” PLD). Currently, the library does support optimistic estimates as given by the Privacy Buckets algorithm, and we hope to incorporate the Connect-the-Dots version as well.


Acknowledgements

This work was carried out in collaboration with Vadym Doroshenko, Badih Ghazi, Ravi Kumar. We thank Galen Andrew, Stan Bashtavenko, Steve Chien, Christoph Dibak, Miguel Guevara, Peter Kairouz, Sasha Kulankhina, Stefan Mellem, Jodi Spacek, Yurii Sushko and Andreas Terzis for their help.

Source: Google AI Blog


Private Ads Prediction with DP-SGD

Ad technology providers widely use machine learning (ML) models to predict and present users with the most relevant ads, and to measure the effectiveness of those ads. With increasing focus on online privacy, there’s an opportunity to identify ML algorithms that have better privacy-utility trade-offs. Differential privacy (DP) has emerged as a popular framework for developing ML algorithms responsibly with provable privacy guarantees. It has been extensively studied in the privacy literature, deployed in industrial applications and employed by the U.S. Census. Intuitively, the DP framework enables ML models to learn population-wide properties, while protecting user-level information.

When training ML models, algorithms take a dataset as their input and produce a trained model as their output. Stochastic gradient descent (SGD) is a commonly used non-private training algorithm that computes the average gradient from a random subset of examples (called a mini-batch), and uses it to indicate the direction towards which the model should move to fit that mini-batch. The most widely used DP training algorithm in deep learning is an extension of SGD called DP stochastic gradient descent (DP-SGD).

DP-SGD includes two additional steps: 1) before averaging, the gradient of each example is norm-clipped if the L2 norm of the gradient exceeds a predefined threshold; and 2) Gaussian noise is added to the average gradient before updating the model. DP-SGD can be adapted to any existing deep learning pipeline with minimal changes by replacing the optimizer, such as SGD or Adam, with their DP variants. However, applying DP-SGD in practice could lead to a significant loss of model utility (i.e., accuracy) with large computational overheads. As a result, various research attempts to apply DP-SGD training on more practical, large-scale deep learning problems. Recent studies have also shown promising DP training results on computer vision and natural language processing problems.

In “Private Ad Modeling with DP-SGD”, we present a systematic study of DP-SGD training on ads modeling problems, which pose unique challenges compared to vision and language tasks. Ads datasets often have a high imbalance between data classes, and consist of categorical features with large numbers of unique values, leading to models that have large embedding layers and highly sparse gradient updates. With this study, we demonstrate that DP-SGD allows ad prediction models to be trained privately with a much smaller utility gap than previously expected, even in the high privacy regime. Moreover, we demonstrate that with proper implementation, the computation and memory overhead of DP-SGD training can be significantly reduced.


Evaluation

We evaluate private training using three ads prediction tasks: (1) predicting the click-through rate (pCTR) for an ad, (2) predicting the conversion rate (pCVR) for an ad after a click, and 3) predicting the expected number of conversions (pConvs) after an ad click. For pCTR, we use the Criteo dataset, which is a widely used public benchmark for pCTR models. We evaluate pCVR and pConvs using internal Google datasets. pCTR and pCVR are binary classification problems trained with the binary cross entropy loss and we report the test AUC loss (i.e., 1 - AUC). pConvs is a regression problem trained with Poisson log loss (PLL) and we report the test PLL.

For each task, we evaluate the privacy-utility trade-off of DP-SGD by the relative increase in the loss of privately trained models under various privacy budgets (i.e., privacy loss). The privacy budget is characterized by a scalar ε, where a lower ε indicates higher privacy. To measure the utility gap between private and non-private training, we compute the relative increase in loss compared to the non-private model (equivalent to ε = ∞). Our main observation is that on all three common ad prediction tasks, the relative loss increase could be made much smaller than previously expected, even for very high privacy (e.g., ε <= 1) regimes.

DP-SGD results on three ads prediction tasks. The relative increase in loss is computed against the non-private baseline (i.e., ε = ∞) model of each task.


Improved Privacy Accounting

Privacy accounting estimates the privacy budget (ε) for a DP-SGD trained model, given the Gaussian noise multiplier and other training hyperparameters. Rényi Differential Privacy (RDP) accounting has been the most widely used approach in DP-SGD since the original paper. We explore the latest advances in accounting methods to provide tighter estimates. Specifically, we use connect-the-dots for accounting based on the privacy loss distribution (PLD). The following figure compares this improved accounting with the classical RDP accounting and demonstrates that PLD accounting improves the AUC on the pCTR dataset for all privacy budgets (ε).



Large Batch Training

Batch size is a hyperparameter that affects different aspects of DP-SGD training. For instance, increasing the batch size could reduce the amount of noise added during training under the same privacy guarantee, which reduces the training variance. The batch size also affects the privacy guarantee via other parameters, such as the subsampling probability and training steps. There is no simple formula to quantify the impact of batch sizes. However, the relationship between batch size and the noise scale is quantified using privacy accounting, which calculates the required noise scale (measured in terms of the standard deviation) under a given privacy budget (ε) when using a particular batch size. The figure below plots such relations in two different scenarios. The first scenario uses fixed epochs, where we fix the number of passes over the training dataset. In this case, the number of training steps is reduced as the batch size increases, which could result in undertraining the model. The second, more straightforward scenario uses fixed training steps (fixed steps).

The relationship between batch size and noise scales. Privacy accounting requires a noise standard deviation, which decreases as the batch size increases, to meet a given privacy budget. As a result, by using much larger batch sizes than the non-private baseline (indicated by the vertical dotted line), the scale of Gaussian noise added by DP-SGD can be significantly reduced.

In addition to allowing a smaller noise scale, larger batch sizes also allow us to use a larger threshold of norm clipping each per-example gradient as required by DP-SGD. Since the norm clipping step introduces biases in the average gradient estimation, this relaxation mitigates such biases. The table below compares the results on the Criteo dataset for pCTR with a standard batch size (1,024 examples) and a large batch size (16,384 examples), combined with large clipping and increased training epochs. We observe that large batch training significantly improves the model utility. Note that large clipping is only possible with large batch sizes. Large batch training was also found to be essential for DP-SGD training in Language and Computer Vision domains.

The effects of large batch training. For three different privacy budgets (ε), we observe that when training the pCTR models with large batch size (16,384), the AUC is significantly higher than with regular batch size (1,024).


Fast per-example Gradient Norm Computation

The per-example gradient norm calculation used for DP-SGD often causes computational and memory overhead. This calculation removes the efficiency of standard backpropagation on accelerators (like GPUs) that compute the average gradient for a batch without materializing each per-example gradient. However, for certain neural network layer types, an efficient gradient norm computation algorithm allows the per-example gradient norm to be computed without the need to materialize the per-example gradient vector. We also note that this algorithm can efficiently handle neural network models that rely on embedding layers and fully connected layers for solving ads prediction problems. Combining the two observations, we use this algorithm to implement a fast version of the DP-SGD algorithm. We show that Fast-DP-SGD on pCTR can handle a similar number of training examples and the same maximum batch size on a single GPU core as a non-private baseline.

The computation efficiency of our fast implementation (Fast-DP-SGD) on pCTR.

Compared to the non-private baseline, the training throughput is similar, except with very small batch sizes. We also compare it with an implementation utilizing the JAX Just-in-Time (JIT) compilation, which is already much faster than vanilla DP-SGD implementations. Our implementation is not only faster, but it is also more memory efficient. The JIT-based implementation cannot handle batch sizes larger than 64, while our implementation can handle batch sizes up to 500,000. Memory efficiency is important for enabling large-batch training, which was shown above to be important for improving utility.


Conclusion

We have shown that it is possible to train private ads prediction models using DP-SGD that have a small utility gap compared to non-private baselines, with minimum overhead for both computation and memory consumption. We believe there is room for even further reduction of the utility gap through techniques such as pre-training. Please see the paper for full details of the experiments.


Acknowledgements

This work was carried out in collaboration with Carson Denison, Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, and Avinash Varadarajan. We thank Silvano Bonacina and Samuel Ieong for many useful discussions.

Source: Google AI Blog


Talking to Robots in Real Time

A grand vision in robot learning, going back to the SHRDLU experiments in the late 1960s, is that of helpful robots that inhabit human spaces and follow a wide variety of natural language commands. Over the last few years, there have been significant advances in the application of machine learning (ML) for instruction following, both in simulation and in real world systems. Recent Palm-SayCan work has produced robots that leverage language models to plan long-horizon behaviors and reason about abstract goals. Code as Policies has shown that code-generating language models combined with pre-trained perception systems can produce language conditioned policies for zero shot robot manipulation. Despite this progress, an important missing property of current "language in, actions out" robot learning systems is real time interaction with humans.

Ideally, robots of the future would react in real time to any relevant task a user could describe in natural language. Particularly in open human environments, it may be important for end users to customize robot behavior as it is happening, offering quick corrections ("stop, move your arm up a bit") or specifying constraints ("nudge that slowly to the right"). Furthermore, real-time language could make it easier for people and robots to collaborate on complex, long-horizon tasks, with people iteratively and interactively guiding robot manipulation with occasional language feedback.

The challenges of open-vocabulary language following. To be successfully guided through a long horizon task like "put all the blocks in a vertical line", a robot must respond precisely to a wide variety of commands, including small corrective behaviors like "nudge the red circle right a bit".

However, getting robots to follow open vocabulary language poses a significant challenge from a ML perspective. This is a setting with an inherently large number of tasks, including many small corrective behaviors. Existing multitask learning setups make use of curated imitation learning datasets or complex reinforcement learning (RL) reward functions to drive the learning of each task, and this significant per-task effort is difficult to scale beyond a small predefined set. Thus, a critical open question in the open vocabulary setting is: how can we scale the collection of robot data to include not dozens, but hundreds of thousands of behaviors in an environment, and how can we connect all these behaviors to the natural language an end user might actually provide?

In Interactive Language, we present a large scale imitation learning framework for producing real-time, open vocabulary language-conditionable robots. After training with our approach, we find that an individual policy is capable of addressing over 87,000 unique instructions (an order of magnitude larger than prior works), with an estimated average success rate of 93.5%. We are also excited to announce the release of Language-Table, the largest available language-annotated robot dataset, which we hope will drive further research focused on real-time language-controllable robots.




Guiding robots with real time language.

Real Time Language-Controllable Robots

Key to our approach is a scalable recipe for creating large, diverse language-conditioned robot demonstration datasets. Unlike prior setups that define all the skills up front and then collect curated demonstrations for each skill, we continuously collect data across multiple robots without scene resets or any low-level skill segmentation. All data, including failure data (e.g., knocking blocks off a table), goes through a hindsight language relabeling process to be paired with text. Here, annotators watch long robot videos to identify as many behaviors as possible, marking when each began and ended, and use freeform natural language to describe each segment. Importantly, in contrast to prior instruction following setups, all skills used for training emerge bottom up from the data itself rather than being determined upfront by researchers.

Our learning approach and architecture are intentionally straightforward. Our robot policy is a cross-attention transformer, mapping 5hz video and text to 5hz robot actions, using a standard supervised learning behavioral cloning objective with no auxiliary losses. At test time, new spoken commands can be sent to the policy (via speech-to-text) at any time up to 5hz.

Interactive Language: an imitation learning system for producing real time language-controllable robots.

Open Source Release: Language-Table Dataset and Benchmark

This annotation process allowed us to collect the Language-Table dataset, which contains over 440k real and 180k simulated demonstrations of the robot performing a language command, along with the sequence of actions the robot took during the demonstration. This is the largest language-conditioned robot demonstration dataset of its kind, by an order of magnitude. Language-Table comes with a simulated imitation learning benchmark that we use to perform model selection, which can be used to evaluate new instruction following architectures or approaches.


Dataset # Trajectories (k)     # Unique (k)     Physical Actions     Real     Available
Episodic Demonstrations
BC-Z 25 0.1
SayCan 68 0.5
Playhouse 1,097 779
Hindsight Language Labeling
BLOCKS 30 n/a
LangLFP 10 n/a
LOREL 6 1.7
CALVIN 20 0.4
Language-Table (real + sim) 623 (442+181) 206 (127+79)

We compare Language-Table to existing robot datasets, highlighting proportions of simulated (red) or real (blue) robot data, the number of trajectories collected, and the number of unique language describable tasks.

Learned Real Time Language Behaviors

Examples of short horizon instructions the robot is capable of following, sampled randomly from the full set of over 87,000.

Short-Horizon Instruction Success
(87,000 more…)
push the blue triangle to the top left corner    80.0%
separate the red star and red circle 100.0%
nudge the yellow heart a bit right 80.0%
place the red star above the blue cube 90.0%
point your arm at the blue triangle 100.0%
push the group of blocks left a bit 100.0%
Average over 87k, CI 95% 93.5% +- 3.42%

95% Confidence interval (CI) on the average success of an individual Interactive Language policy over 87,000 unique natural language instructions.

We find that interesting new capabilities arise when robots are able to follow real time language. We show that users can walk robots through complex long-horizon sequences using only natural language to solve for goals that require multiple minutes of precise, coordinated control (e.g., "make a smiley face out of the blocks with green eyes" or "place all the blocks in a vertical line"). Because the robot is trained to follow open vocabulary language, we see it can react to a diverse set of verbal corrections (e.g., "nudge the red star slightly right") that might otherwise be difficult to enumerate up front.

Examples of long horizon goals reached under real time human language guidance.

Finally, we see that real time language allows for new modes of robot data collection. For example, a single human operator can control four robots simultaneously using only spoken language. This has the potential to scale up the collection of robot data in the future without requiring undivided human attention for each robot.

One operator controlling multiple robots at once with spoken language.

Conclusion

While currently limited to a tabletop with a fixed set of objects, Interactive Language shows initial evidence that large scale imitation learning can indeed produce real time interactable robots that follow freeform end user commands. We open source Language-Table, the largest language conditioned real-world robot demonstration dataset of its kind and an associated simulated benchmark, to spur progress in real time language control of physical robots. We believe the utility of this dataset may not only be limited to robot control, but may provide an interesting starting point for studying language- and action-conditioned video prediction, robot video-conditioned language modeling, or a host of other interesting active questions in the broader ML context. See our paper and GitHub page to learn more.


Acknowledgements

We would like to thank everyone who supported this research. This includes robot teleoperators: Alex Luong, Armando Reyes, Elio Prado, Eric Tran, Gavin Gonzalez, Jodexty Therlonge, Joel Magpantay, Rochelle Dela Cruz, Samuel Wan, Sarah Nguyen, Scott Lehrer, Norine Rosales, Tran Pham, Kyle Gajadhar, Reece Mungal, and Nikauleene Andrews; robot hardware support and teleoperation coordination: Sean Snyder, Spencer Goodrich, Cameron Burns, Jorge Aldaco, Jonathan Vela; data operations and infrastructure: Muqthar Mohammad, Mitta Kumar, Arnab Bose, Wayne Gramlich; and the many who helped provide language labeling of the datasets. We would also like to thank Pierre Sermanet, Debidatta Dwibedi, Michael Ryoo, Brian Ichter and Vincent Vanhoucke for their invaluable advice and support.

Source: Google AI Blog


Conversation Summaries in Google Chat

Information overload is a significant challenge for many organizations and individuals today. It can be overwhelming to keep up with incoming chat messages and documents that arrive at our inbox everyday. This has been exacerbated by the increase in virtual work and remains a challenge as many teams transition to a hybrid work environment with a mix of those working both virtually and in an office. One solution that can address information overload is summarization — for example, to help users improve their productivity and better manage so much information, we recently introduced auto-generated summaries in Google Docs.

Today, we are excited to introduce conversation summaries in Google Chat for messages in Spaces. When these summaries are available, a card with automatically generated summaries is shown as users enter Spaces with unread messages. The card includes a list of summaries for the different topics discussed in Spaces. This feature is enabled by our state-of-the-art abstractive summarization model, Pegasus, which generates useful and concise summaries for chat conversations, and is currently available to selected premium Google Workspace business customers.

Conversation summaries provide a helpful digest of conversations in Spaces, allowing users to quickly catch-up on unread messages and navigate to the most relevant threads.

Conversation Summarization Modeling

The goal of text summarization is to provide helpful and concise summaries for different types of text, such as documents, articles, or spoken conversations. A good summary covers the key points succinctly, and is fluent and grammatically correct. One approach to summarization is to extract key parts from the text and concatenate them together into a summary (i.e., extractive summarization). Another approach is to use natural language generation (NLG) techniques to summarize using novel words and phrases not necessarily present in the original text. This is referred to as abstractive summarization and is considered closer to how a person would generally summarize text. A main challenge with abstractive summarization, however, is that it sometimes struggles to generate accurate and grammatically correct summaries, especially in real world applications.


ForumSum Dataset

The majority of abstractive summarization datasets and research focuses on single-speaker text documents, like news and scientific articles, mainly due to the abundance of human-written summaries for such documents. On the other hand, datasets of human-written summaries for other types of text, like chat or multi-speaker conversations, are very limited.

To address this we created ForumSum, a diverse and high-quality conversation summarization dataset with human-written summaries. The conversations in the dataset are collected from a wide variety of public internet forums, and are cleaned up and filtered to ensure high quality and safe content (more details in the paper).

An example from the ForumSum dataset.

Each utterance in the conversation starts on a new line, contains an author name and a message text that is separated with a colon. Human annotators are then given detailed instructions to write a 1-3 sentence summary of the conversation. These instructions went through multiple iterations to ensure annotators wrote high quality summaries. We have collected summaries for over six thousand conversations, with an average of more than 6 speakers and 10 utterances per conversation. ForumSum provides quality training data for the conversation summarization problem: it has a variety of topics, number of speakers, and number of utterances commonly encountered in a chat application.


Conversation Summarization Model Design

As we have written previously, the Transformer is a popular model architecture for sequence-to-sequence tasks, like abstractive summarization, where the inputs are the document words and the outputs are the summary words. Pegasus combined transformers with self-supervised pre-training customized for abstractive summarization, making it a great model choice for conversation summarization. First, we fine-tune Pegasus on the ForumSum dataset where the input is the conversation words and the output is the summary words. Second, we use knowledge distillation to distill the Pegasus model into a hybrid architecture of a transformer encoder and a recurrent neural network (RNN) decoder. The resulting model has lower latency and memory footprint while maintaining similar quality as the Pegasus model.


Quality and User Experience

A good summary captures the essence of the conversation while being fluent and grammatically correct. Based on human evaluation and user feedback, we learned that the summarization model generates useful and accurate summaries most of the time. But occasionally the model generates low quality summaries. After looking into issues reported by users, we found that there are two main types of low quality summaries. The first one is misattribution, when the model confuses which person or entity said or performed a certain action. The second one is misrepresentation, when the model’s generated summary misrepresents or contradicts the chat conversation.

To address low quality summaries and improve the user experience, we have made progress in several areas:

  1. Improving ForumSum: While ForumSum provides a good representation of chat conversations, we noticed certain patterns and language styles in Google Chat conversations that differ from ForumSum, e.g., how users mention other users and the use of abbreviations and special symbols. After exploring examples reported by users, we concluded that these out-of-distribution language patterns contributed to low quality summaries. To address this, we first performed data formatting and clean-ups to reduce mismatches between chat and ForumSum conversations whenever possible. Second, we added more training data to ForumSum to better represent these style mismatches. Collectively, these changes resulted in reduction of low quality summaries.
  2. Controlled triggering: To make sure summaries bring the most value to our users, we first need to make sure that the chat conversation is worthy of summarization. For example, we found that there is less value in generating a summary when the user is actively engaged in a conversation and does not have many unread messages, or when the conversation is too short.
  3. Detecting low quality summaries: While the two methods above limited low quality and low value summaries, we still developed methods to detect and abstain from showing such summaries to the user when they are generated. These are a set of heuristics and models to measure the overall quality of summaries and whether they suffer from misattribution or misrepresentation issues.

Finally, while the hybrid model provided significant performance improvements, the latency to generate summaries was still noticeable to users when they opened Spaces with unread messages. To address this issue, we instead generate and update summaries whenever there is a new message sent, edited or deleted. Then summaries are cached ephemerally to ensure they surface smoothly when users open Spaces with unread messages.


Conclusion and Future Work

We are excited to apply state-of-the-art abstractive summarization models to help our Workspace users improve their productivity in Spaces. While this is great progress, we believe there are many opportunities to further improve the experience and the overall quality of summaries. Future directions we are exploring include better modeling and summarizing entangled conversations that include multiple topics, and developing metrics that better measure the factual consistency between chat conversations and summaries.


Acknowledgements

The authors would like to thank the many people across Google that contributed to this work: Ahmed Chowdhury, Alejandro Elizondo, Anmol Tukrel, Benjamin Lee, Chao Wang, Chris Carroll, Don Kim, Jackie Tsay, Jennifer Chou, Jesse Sliter, John Sipple, Kate Montgomery, Maalika Manoharan, Mahdis Mahdieh, Mia Chen, Misha Khalman, Peter Liu, Robert Diersing, Sarah Read, Winnie Yeung, Yao Zhao, and Yonghui Wu.

Source: Google AI Blog


Mixture-of-Experts with Expert Choice Routing

The capacity of a neural network to absorb information is limited by the number of its parameters, and as a consequence, finding more effective ways to increase model parameters has become a trend in deep learning research. Mixture-of-experts (MoE), a type of conditional computation where parts of the network are activated on a per-example basis, has been proposed as a way of dramatically increasing model capacity without a proportional increase in computation. In sparsely-activated variants of MoE models (e.g., Switch Transformer, GLaM, V-MoE), a subset of experts is selected on a per-token or per-example basis, thus creating sparsity in the network. Such models have demonstrated better scaling in multiple domains and better retention capability in a continual learning setting (e.g., Expert Gate). However, a poor expert routing strategy can cause certain experts to be under-trained, leading to an expert being under or over-specialized.

In “Mixture-of-Experts with Expert Choice Routing”, presented at NeurIPS 2022, we introduce a novel MoE routing algorithm called Expert Choice (EC). We discuss how this novel approach can achieve optimal load balancing in an MoE system while allowing heterogeneity in token-to-expert mapping. Compared to token-based routing and other routing methods in traditional MoE networks, EC demonstrates very strong training efficiency and downstream task scores. Our method resonates with one of the vision for Pathways, which is to enable heterogeneous mixture-of-experts via Pathways MPMD (multi program, multi data) support.


Overview of MoE Routing

MoE operates by adopting a number of experts, each as a sub-network, and activating only one or a few experts for each input token. A gating network must be chosen and optimized in order to route each token to the most suited expert(s). Depending on how tokens are mapped to experts, MoE can be sparse or dense. Sparse MoE only selects a subset of experts when routing each token, reducing computational cost as compared to a dense MoE. For example, recent work has implemented sparse routing via k-means clustering, linear assignment to maximize token-expert affinities, or hashing. Google also recently announced GLaM and V-MoE, both of which advance the state of the art in natural language processing and computer vision via sparsely gated MoE with top-k token routing, demonstrating better performance scaling with sparsely activated MoE layers. Many of these prior works used a token choice routing strategy in which the routing algorithm picks the best one or two experts for each token.

Token Choice Routing. The routing algorithm picks the top-1 or top-2 experts with highest affinity scores for each token. The affinity scores can be trained together with model parameters.

The independent token choice approach often leads to an imbalanced load of experts and under-utilization. In order to mitigate this, previous sparsely gated networks introduced additional auxiliary losses as regularization to prevent too many tokens being routed to a single expert, but the effectiveness was limited. As a result, token choice routings need to overprovision expert capacity by a significant margin (2x–8x of the calculated capacity) to avoid dropping tokens when there is a buffer overflow.

In addition to load imbalance, most prior works allocate a fixed number of experts to each token using a top-k function, regardless of the relative importance of different tokens. We argue that different tokens should be received by a variable number of experts, conditioned on token importance or difficulty.


Expert Choice Routing

To address the above issues, we propose a heterogeneous MoE that employs the expert choice routing method illustrated below. Instead of having tokens select the top-k experts, the experts with predetermined buffer capacity are assigned to the top-k tokens. This method guarantees even load balancing, allows a variable number of experts for each token, and achieves substantial gains in training efficiency and downstream performance. EC routing speeds up training convergence by over 2x in an 8B/64E (8 billion activated parameters, 64 experts) model, compared to the top-1 and top-2 gating counterparts in Switch Transformer, GShard, and GLaM.

Expert Choice Routing. Experts with predetermined buffer capacity are assigned top-k tokens, thus guaranteeing even load balancing. Each token can be received by a variable number of experts.

In EC routing, we set expert capacity k as the average tokens per expert in a batch of input sequences multiplied by a capacity factor, which determines the average number of experts that can be received by each token. To learn the token-to-expert affinity, our method produces a token-to-expert score matrix that is used to make routing decisions. The score matrix indicates the likelihood of a given token in a batch of input sequences being routed to a given expert.

Similar to Switch Transformer and GShard, we apply an MoE and gating function in the dense feedforward (FFN) layer, as it is the most computationally expensive part of a Transformer-based network. After producing the token-to-expert score matrix, a top-k function is applied along the token dimension for each expert to pick the most relevant tokens. A permutation function is then applied based on the generated indexes of the token, to create a hidden value with an additional expert dimension. The data is split across multiple experts such that all experts can execute the same computational kernel concurrently on a subset of tokens. Because a fixed expert capacity can be determined, we no longer overprovision expert capacity due to load imbalancing, thus significantly reducing training and inference step time by around 20% compared to GLaM.


Evaluation

To illustrate the effectiveness of Expert Choice routing, we first look at training efficiency and convergence. We use EC with a capacity factor of 2 (EC-CF2) to match the activated parameter size and computational cost on a per-token basis to GShard top-2 gating and run both for a fixed number of steps. EC-CF2 reaches the same perplexity as GShard top-2 in less than half the steps and, in addition, we find that each GShard top-2 step is 20% slower than our method.

We also scale the number of experts while fixing the expert size to 100M parameters for both EC and GShard top-2 methods. We find that both work well in terms of perplexity on the evaluation dataset during pre-training — having more experts consistently improves training perplexity.

Evaluation results on training convergence: EC routing yields 2x faster convergence at 8B/64E scale compared to top-2 gating used in GShard and GLaM (top). EC training perplexity scales better with the scaling of number of experts (bottom).

To validate whether improved perplexity directly translates to better performance in downstream tasks, we perform fine-tuning on 11 selected tasks from GLUE and SuperGLUE. We compare three MoE methods including Switch Transformer top-1 gating (ST Top-1), GShard top-2 gating (GS Top-2) and a version of our method (EC-CF2) that matches the activated parameters and computational cost of GS Top-2. The EC-CF2 method consistently outperforms the related methods and yields an average accuracy increase of more than 2% in a large 8B/64E setting. Comparing our 8B/64E model against its dense counterpart, our method achieves better fine-tuning results, increasing the average score by 3.4 points.

Our empirical results indicate that capping the number of experts for each token hurts the fine-tuning score by 1 point on average. This study confirms that allowing a variable number of experts per token is indeed helpful. On the other hand, we compute statistics on token-to-expert routing, particularly on the ratio of tokens that have been routed to a certain number of experts. We find that a majority of tokens have been routed to one or two experts while 23% have been routed to three or four experts and only about 3% tokens have been routed to more than four experts, thus verifying our hypothesis that expert choice routing learns to allocate a variable number of experts to tokens.


Final Thoughts

We propose a new routing method for sparsely activated mixture-of-experts models. This method addresses load imbalance and under-utilization of experts in conventional MoE methods, and enables the selection of different numbers of experts for each token. Our model demonstrates more than 2x training efficiency improvement when compared to the state-of-the-art GShard and Switch Transformer models, and achieves strong gains when fine-tuning on 11 datasets in the GLUE and SuperGLUE benchmark.

Our approach for expert choice routing enables heterogeneous MoE with straightforward algorithmic innovations. We hope that this may lead to more advances in this space at both the application and system levels.


Acknowledgements

Many collaborators across google research supported this work. We particularly thank Nan Du, Andrew Dai, Yanping Huang, and Zhifeng Chen for the initial ground work on MoE infrastructure and Tarzan datasets. We greatly appreciate Hanxiao Liu and Quoc Le for contributing the initial ideas and discussions. Tao Lei, Vincent Zhao, Da Huang, Chang Lan, Daiyi Peng, and Yifeng Lu contributed significantly on implementations and evaluations. Claire Cui, James Laudon, Martin Abadi, and Jeff Dean provided invaluable feedback and resource support.

Source: Google AI Blog


Machine Learning Communities: Q3 ‘22 highlights and achievements

Posted by Nari Yoon, Hee Jung, DevRel Community Manager / Soonson Kwon, DevRel Program Manager

Let’s explore highlights and accomplishments of vast Google Machine Learning communities over the third quarter of the year! We are enthusiastic and grateful about all the activities by the global network of ML communities. Here are the highlights!


TensorFlow/Keras

Load-testing TensorFlow Serving’s REST Interface

Load-testing TensorFlow Serving’s REST Interface by ML GDE Sayak Paul (India) and Chansung Park (Korea) shares the lessons and findings they learned from conducting load tests for an image classification model across numerous deployment configurations.

TFUG Taipei hosted events (Python + Hugging Face-Translation+ tf.keras.losses, Python + Object detection, Python+Hugging Face-Token Classification+tf.keras.initializers) in September and helped community members learn how to use TF and Hugging face to implement machine learning model to solve problems.

Neural Machine Translation with Bahdanau’s Attention Using TensorFlow and Keras and the related video by ML GDE Aritra Roy Gosthipaty (India) explains the mathematical intuition behind neural machine translation.

Serving a TensorFlow image classification model as RESTful and gRPC based services with TFServing, Docker, and Kubernetes

Automated Deployment of TensorFlow Models with TensorFlow Serving and GitHub Actions by ML GDE Chansung Park (Korea) and Sayak Paul (India) explains how to automate TensorFlow model serving on Kubernetes with TensorFlow Serving and GitHub Action.

Deploying ? ViT on Kubernetes with TF Serving by ML GDE Sayak Paul (India) and Chansung Park (Korea) shows how to scale the deployment of a ViT model from ? Transformers using Docker and Kubernetes.

Screenshot of the TensorFlow Forum in the Chinese Language run by the tf.wiki team

Long-term TensorFlow Guidance on tf.wiki Forum by ML GDE Xihan Li (China) provides TensorFlow guidance by answering the questions from Chinese developers on the forum.

photo of a phone with the Hindi letter 'Ohm' drawn on the top half of the screen. Hinidi Character recognition shows the letter Ohm as the Predicted Result below.

Hindi Character Recognition on Android using TensorFlow Lite by ML GDE Nitin Tiwari (India) shares an end-to-end tutorial on training a custom computer vision model to recognize Hindi characters. In TFUG Pune event, he also gave a presentation titled Building Computer Vision Model using TensorFlow: Part 1.

Using TFlite Model Maker to Complete a Custom Audio Classification App by ML GDE Xiaoxing Wang (China) shows how to use TFLite Model Maker to build a custom audio classification model based on YAMNet and how to import and use the YAMNet-based custom models in Android projects.

SoTA semantic segmentation in TF with ? by ML GDE Sayak Paul (India) and Chansung Park (Korea). The SegFormer model was not available on TensorFlow.

Text Augmentation in Keras NLP by ML GDE Xiaoquan Kong (China) explains what text augmentation is and how the text augmentation feature in Keras NLP is designed.

The largest vision model checkpoint (public) in TF (10 Billion params) through ? transformers by ML GDE Sayak Paul (India) and Aritra Roy Gosthipaty (India). The underlying model is RegNet, known for its ability to scale.

A simple TensorFlow implementation of a DCGAN to generate CryptoPunks

CryptoGANs open-source repository by ML GDE Dimitre Oliveira (Brazil) shows simple model implementations following TensorFlow best practices that can be extended to more complex use-cases. It connects the usage of TensorFlow with other relevant frameworks, like HuggingFace, Gradio, and Streamlit, building an end-to-end solution.


TFX

TFX Machine Learning Pipeline from data injection in TFRecord to pushing out Vertex AI

MLOps for Vision Models from ? with TFX by ML GDE Chansung Park (Korea) and Sayak Paul (India) shows how to build a machine learning pipeline for a vision model (TensorFlow) from ? Transformers using the TF ecosystem.

First release of TFX Addons Package by ML GDE Hannes Hapke (United States). The package has been downloaded a few thousand times (source). Google and other developers maintain it through bi-weekly meetings. Google’s Open Source Peer Award has recognized the work.

TFUG São Paulo hosted TFX T1 | E4 & TFX T1 | E5. And ML GDE Vinicius Caridá (Brazil) shared how to train a model in a TFX pipeline. The fifth episode talks about Pusher: publishing your models with TFX.

Semantic Segmentation model within ML pipeline by ML GDE Chansung Park (Korea) and Sayak Paul (India) shows how to build a machine learning pipeline for semantic segmentation task with TFX and various GCP products such as Vertex Pipeline, Training, and Endpoints.


JAX/Flax

Screen shot of Tutorial 2 (JAX): Introduction to JAX+Flax with GitHub Repo and Codelab via university of Amseterdam

JAX Tutorial by ML GDE Phillip Lippe (Netherlands) is meant to briefly introduce JAX, including writing and training neural networks with Flax.


TFUG Malaysia hosted Introduction to JAX for Machine Learning (video) and Leong Lai Fong gave a talk. The attendees learned what JAX is and its fundamental yet unique features, which make it efficient to use when executing deep learning workloads. After that, they started training their first JAX-powered deep learning model.

TFUG Taipei hosted Python+ JAX + Image classification and helped people learn JAX and how to use it in Colab. They shared knowledge about the difference between JAX and Numpy, the advantages of JAX, and how to use it in Colab.

Introduction to JAX by ML GDE João Araújo (Brazil) shared the basics of JAX in Deep Learning Indaba 2022.

A comparison of the performance and overview of issues resulting from changing from NumPy to JAX

Should I change from NumPy to JAX? by ML GDE Gad Benram (Portugal) compares the performance and overview of the issues that may result from changing from NumPy to JAX.

Introduction to JAX: efficient and reproducible ML framework by ML GDE Seunghyun Lee (Korea) introduced JAX/Flax and their key features using practical examples. He explained the pure function and PRNG, which make JAX explicit and reproducible, and XLA and mapping functions which make JAX fast and easily parallelized.

Data2Vec Style pre-training in JAX by ML GDE Vasudev Gupta (India) shares a tutorial for demonstrating how to pre-train Data2Vec using the Jax/Flax version of HuggingFace Transformers.

Distributed Machine Learning with JAX by ML GDE David Cardozo (Canada) delivered what makes JAX different from TensorFlow.

Image classification with JAX & Flax by ML GDE Derrick Mwiti (Kenya) explains how to build convolutional neural networks with JAX/Flax. And he wrote several articles about JAX/Flax: What is JAX?, How to load datasets in JAX with TensorFlow, Optimizers in JAX and Flax, Flax vs. TensorFlow, etc..


Kaggle

DDPMs - Part 1 by ML GDE Aakash Nain (India) and cait-tf by ML GDE Sayak Paul (India) were announced as Kaggle ML Research Spotlight Winners.

Forward process in DDPMs from Timestep 0 to 100

Fresher on Random Variables, All you need to know about Gaussian distribution, and A deep dive into DDPMs by ML GDE Aakash Nain (India) explain the fundamentals of diffusion models.

In Grandmasters Journey on Kaggle + The Kaggle Book, ML GDE Luca Massaron (Italy) explained how Kaggle helps people in the data science industry and which skills you must focus on apart from the core technical skills.


Cloud AI

How Cohere is accelerating language model training with Google Cloud TPUs by ML GDE Joanna Yoo (Canada) explains what Cohere engineers have done to solve scaling challenges in large language models (LLMs).

ML GDE Hannes Hapke (United States) chats with Fillipo Mandella, Customer Engineering Manager at Google

In Using machine learning to transform finance with Google Cloud and Digits, ML GDE Hannes Hapke (United States) chats with Fillipo Mandella, Customer Engineering Manager at Google, about how Digits leverages Google Cloud’s machine learning tools to empower accountants and business owners with near-zero latency.

A tour of Vertex AI by TFUG Chennai for ML, cloud, and DevOps engineers who are working in MLOps. This session was about the introduction of Vertex AI, handling datasets and models in Vertex AI, deployment & prediction, and MLOps.

TFUG Abidjan hosted two events with GDG Cloud Abidjan for students and professional developers who want to prepare for a Google Cloud certification: Introduction session to certifications and Q&A, Certification Study Group.

Flow chart showing shows how to deploy a ViT B/16 model on Vertex AI

Deploying ? ViT on Vertex AI by ML GDE Sayak Paul (India) and Chansung Park (Korea) shows how to deploy a ViT B/16 model on Vertex AI. They cover some critical aspects of a deployment such as auto-scaling, authentication, endpoint consumption, and load-testing.

Photo collage of AI generated images

TFUG Singapore hosted The World of Diffusion - DALL-E 2, IMAGEN & Stable Diffusion. ML GDE Martin Andrews (Singapore) and Sam Witteveen (Singapore) gave talks named “How Diffusion Works” and “Investigating Prompt Engineering on Diffusion Models” to bring people up-to-date with what has been going on in the world of image generation.

ML GDE Martin Andrews (Singapore) have done three projects: GCP VM with Nvidia set-up and Convenience Scripts, Containers within a GCP host server, with Nvidia pass-through, Installing MineRL using Containers - with linked code.

Jupyter Services on Google Cloud by ML GDE Gad Benram (Portugal) explains the differences between Vertex AI Workbench, Colab, and Deep Learning VMs.

Google Cloud's Two Towers Recommender and TensorFlow

Train and Deploy Google Cloud's Two Towers Recommender by ML GDE Rubens de Almeida Zimbres (Brazil) explains how to implement the model and deploy it in Vertex AI.


Research & Ecosystem

WOMEN DATA SCIENCE, LA PAZ Club de lectura de papers de Machine Learning Read, Learn and Share the knowledge #MLPaperReadingClubs, Nathaly Alarcón, @WIDS_LaPaz #MLPaperReadingClubs

The first session of #MLPaperReadingClubs (video) by ML GDE Nathaly Alarcon Torrico (Bolivia) and Women in Data Science La Paz. Nathaly led the session, and the community members participated in reading the ML paper “Zero-shot learning through cross-modal transfer.”

In #MLPaperReadingClubs (video) by TFUG Lesotho, Arnold Raphael volunteered to lead the first session “Zero-shot learning through cross-modal transfer.”

Screenshot of a screenshare of Zero-shot learning through cross-modal transfer to 7 participants in a virtual call

ML Paper Reading Clubs #1: Zero Shot Learning Paper (video) by TFUG Agadir introduced a model that can recognize objects in images even if no training data is available for the objects. TFUG Agadir prepared this event to make people interested in machine learning research and provide them with a broader vision of differentiating good contributions from great ones.

Opening of the Machine Learning Paper Reading Club (video) by TFUG Dhaka introduced ML Paper Reading Club and the group’s plan.

EDA on SpaceX Falcon 9 launches dataset (Kaggle) (video) by TFUG Mysuru & TFUG Chandigarh organizer Aashi Dutt (presenter) walked through exploratory data analysis on SpaceX Falcon 9 launches dataset from Kaggle.

Screenshot of ML GDE Qinghua Duan (China) showing how to apply the MRC paradigm and BERT to solve the dialogue summarization problem.

Introduction to MRC-style dialogue summaries based on BERT by ML GDE Qinghua Duan (China) shows how to apply the MRC paradigm and BERT to solve the dialogue summarization problem.

Plant disease classification using Deep learning model by ML GDE Yannick Serge Obam Akou (Cameroon) talked on plant disease classification using deep learning model : an end to end Android app (open source project) that diagnoses plant diseases.

TensorFlow/Keras implementation of Nystromformer

Nystromformer Github repository by Rishit Dagli provides TensorFlow/Keras implementation of Nystromformer, a transformer variant that uses the Nyström method to approximate standard self-attention with O(n) complexity which allows for better scalability.

Robots That Write Their Own Code

A common approach used to control robots is to program them with code to detect objects, sequencing commands to move actuators, and feedback loops to specify how the robot should perform a task. While these programs can be expressive, re-programming policies for each new task can be time consuming, and requires domain expertise.

What if when given instructions from people, robots could autonomously write their own code to interact with the world? It turns out that the latest generation of language models, such as PaLM, are capable of complex reasoning and have also been trained on millions of lines of code. Given natural language instructions, current language models are highly proficient at writing not only generic code but, as we’ve discovered, code that can control robot actions as well. When provided with several example instructions (formatted as comments) paired with corresponding code (via in-context learning), language models can take in new instructions and autonomously generate new code that re-composes API calls, synthesizes new functions, and expresses feedback loops to assemble new behaviors at runtime. More broadly, this suggests an alternative approach to using machine learning for robots that (i) pursues generalization through modularity and (ii) leverages the abundance of open-source code and data available on the Internet.

Given code for an example task (left), language models can re-compose API calls to assemble new robot behaviors for new tasks (right) that use the same functions but in different ways.

To explore this possibility, we developed Code as Policies (CaP), a robot-centric formulation of language model-generated programs executed on physical systems. CaP extends our prior work, PaLM-SayCan, by enabling language models to complete even more complex robotic tasks with the full expression of general-purpose Python code. With CaP, we propose using language models to directly write robot code through few-shot prompting. Our experiments demonstrate that outputting code led to improved generalization and task performance over directly learning robot tasks and outputting natural language actions. CaP allows a single system to perform a variety of complex and varied robotic tasks without task-specific training.




We demonstrate, across several robot systems, including a robot from Everyday Robots, that language models can autonomously interpret language instructions to generate and execute CaPs that represent reactive low-level policies (e.g., proportional-derivative or impedance controllers) and waypoint-based policies (e.g., vision-based pick and place, trajectory-based control).

A Different Way to Think about Robot Generalization

To generate code for a new task given natural language instructions, CaP uses a code-writing language model that, when prompted with hints (i.e., import statements that inform which APIs are available) and examples (instruction-to-code pairs that present few-shot "demonstrations" of how instructions should be converted into code), writes new code for new instructions. Central to this approach is hierarchical code generation, which prompts language models to recursively define new functions, accumulate their own libraries over time, and self-architect a dynamic codebase. Hierarchical code generation improves state-of-the-art on both robotics as well as standard code-gen benchmarks in natural language processing (NLP) subfields, with 39.8% pass@1 on HumanEval, a benchmark of hand-written coding problems used to measure the functional correctness of synthesized programs.

Code-writing language models can express a variety of arithmetic operations and feedback loops grounded in language. Pythonic language model programs can use classic logic structures, e.g., sequences, selection (if/else), and loops (for/while), to assemble new behaviors at runtime. They can also use third-party libraries to interpolate points (NumPy), analyze and generate shapes (Shapely) for spatial-geometric reasoning, etc. These models not only generalize to new instructions, but they can also translate precise values (e.g., velocities) to ambiguous descriptions ("faster" and "to the left") depending on the context to elicit behavioral commonsense.

Code as Policies uses code-writing language models to map natural language instructions to robot code to complete tasks. Generated code can call existing perception action APIs, third party libraries, or write new functions at runtime.

CaP generalizes at a specific layer in the robot: interpreting natural language instructions, processing perception outputs (e.g., from off-the-shelf object detectors), and then parameterizing control primitives. This fits into systems with factorized perception and control, and imparts a degree of generalization (acquired from pre-trained language models) without the magnitude of data collection needed for end-to-end robot learning. CaP also inherits language model capabilities that are unrelated to code writing, such as supporting instructions with non-English languages and emojis.

CaP inherits the capabilities of language models, such as multilingual and emoji support.

By characterizing the types of generalization encountered in code generation problems, we can also study how hierarchical code generation improves generalization. For example, "systematicity" evaluates the ability to recombine known parts to form new sequences, "substitutivity" evaluates robustness to synonymous code snippets, while "productivity" evaluates the ability to write policy code longer than those seen in the examples (e.g., for new long horizon tasks that may require defining and nesting new functions). Our paper presents a new open-source benchmark to evaluate language models on a set of robotics-related code generation problems. Using this benchmark, we find that, in general, bigger models perform better across most metrics, and that hierarchical code generation improves "productivity" generalization the most.

Performance on our RoboCodeGen Benchmark across different generalization types. The larger model (Davinci) performs better than the smaller model (Cushman), with hierarchical code generation improving productivity the most.

We're also excited about the potential for code-writing models to express cross-embodied plans for robots with different morphologies that perform the same task differently depending on the available APIs (perception action spaces), which is an important aspect of any robotics foundation model.

Language model code-generation exhibits cross-embodiment capabilities, completing the same task in different ways depending on the available APIs (that define perception action spaces).

Limitations

Code as policies today are restricted by the scope of (i) what the perception APIs can describe (e.g., few visual-language models to date can describe whether a trajectory is "bumpy" or "more C-shaped"), and (ii) which control primitives are available. Only a handful of named primitive parameters can be adjusted without over-saturating the prompts. Our approach also assumes all given instructions are feasible, and we cannot tell if generated code will be useful a priori. CaPs also struggle to interpret instructions that are significantly more complex or operate at a different abstraction level than the few-shot examples provided to the language model prompts. Thus, for example, in the tabletop domain, it would be difficult for our specific instantiation of CaPs to "build a house with the blocks" since there are no examples of building complex 3D structures. These limitations point to avenues for future work, including extending visual language models to describe low-level robot behaviors (e.g., trajectories) or combining CaPs with exploration algorithms that can autonomously add to the set of control primitives.


Open-Source Release

We have released the code needed to reproduce our experiments and an interactive simulated robot demo on the project website, which also contains additional real-world demos with videos and generated code.


Conclusion

Code as policies is a step towards robots that can modify their behaviors and expand their capabilities accordingly. This can be enabling, but the flexibility also raises potential risks since synthesized programs (unless manually checked per runtime) may result in unintended behaviors with physical hardware. We can mitigate these risks with built-in safety checks that bound the control primitives that the system can access, but more work is needed to ensure new combinations of known primitives are equally safe. We welcome broad discussion on how to minimize these risks while maximizing the potential positive impacts towards more general-purpose robots.


Acknowledgements

This research was done by Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng. Special thanks to Vikas Sindhwani, Vincent Vanhoucke for helpful feedback on writing, Chad Boodoo for operations and hardware support. An early preprint is available on arXiv.

Source: Google AI Blog


Interview with Doug Duhaime, contributor to Google’s Dev Library

Posted by the Google Dev Library Team

Introducing the Dev Library Contributor Spotlights - a blog series highlighting developers that are supporting the thriving development ecosystem by contributing their resources and tools to Google Dev Library.

We met with Doug Duhaime, Full Stack Developer in Yale University's Digital Humanities Lab, to discuss his passion for Machine Learning, his processes and what inspired him to release his PixPlot project as an Open Source.

What led you to explore the field of machine learning?

I was an English major in undergrad and in graduate school. I have a PhD in English literature. My dissertation was exploring copyright history and the ways that changes in copyright law affected the book market. How does the institution of fixed duration copyright influence the book market? To answer this question, I had to mine an enormous collection of data - half a million books, published before 1800 - to look at different patterns. That was one of the key projects that got me inspired to further explore the world of Machine Learning.

In fact, one of my projects - the PixPlot library - uses computer vision to analyze image collections, which was also partially used in my research. Part of my research looked at plagiarism detection and how readily people are inclined to copy images once it becomes legal to copy them from other texts. Computer vision helps us to answer these questions and identify key patterns.

I’ve seen machine learning and programming as a way to ask new questions in historical contexts. And there's a whole field of us - we're called digital humanists. Yale University, where I've been for the last five years, has a fantastic digital humanities program where researchers are asking questions like this and using fun machine learning platforms like TensorFlow to answer those questions.

Screenshot from the PixPlot library showing Image Fields in the Meserve-Kunhardt Collection with the following identified hotspots: Boxers, Buildings, Buttons, Chairs, Gowns

Can you tell us more about the evolution of your PixPlot library project?

We started in Yale's digital humanities lab with a project called neural neighbors. And the idea here was to find patterns in the Meserve-Kunhardt Collection of images.

Meserve-Kunhardt is a collection of photographs largely from the 19th century that Yale recently acquired. After being acquired by the university, some curators were preparing to identify all this really rich metadata to describe these images. However, they had a backlog, and they needed help to try to make sense of what's in this collection. And so, Neural Neighbors was our initial attempt to answer this question.

As this project went on, we started running up against limitations and asking bigger questions. For example, instead of just looking at the pictures, what would it be like to look at the entire collection all at once? In order to answer this question, we needed a more performant rendering layer.

So we decided to utilize TensorFlow, which allowed us to extract vector representation of each image. We then compressed the dimensionality of those vectors down to 2D. But for PixPlot, we decided to use a different dimensionality reduction technique called umap. And that brought us to the first release of PixPlot.

The idea here was to take the whole collection, shoot it down into 2D, and then let you move through it and look at the images in the collection wherein we expect images with similar content to be placed close by one another.

And so it's just evolved from that early genesis and Neural Neighbors through to where it is today.

What inspired you to release PixPlot as an open source project?

In the case of PixPlot, I was working for Yale University, and we had a goal to make as much of our contributions to the software world as possible open and publicly accessible without any commercial terms.

It was a huge privilege to spend time with the lab and build software that others found useful. I would say even more generally, in my personal life, I really like building things that people find useful and, when possible, contributing back to the open source world because, I think, so many of us learn from open source.

Google Dev Library Quote: We look at other people's examples and get excited by tools and projects others are building. And many of those are non-commercial. They're just open and free to the world. And it's great to give back when we can. Doug Duhaime Dev Library Contributor

Find out more content contributed and authored by Doug Duhaime and discover more unique tools and resources on the Google Dev Library website!

Open Images V7 — Now Featuring Point Labels

Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. Researchers around the world use Open Images to train and evaluate computer vision models. Since the initial release of Open Images in 2016, which included image-level labels covering 6k categories, we have provided multiple updates to enrich annotations and expand the potential use cases of the dataset. Through several releases, we have added image-level labels for over 20k categories on all images and bounding box annotations, visual relations, instance segmentations, and localized narratives (synchronized voice, mouse trace, and text caption) on a subset of 1.9M images.

Today, we are happy to announce the release of Open Images V7, which expands the Open Images dataset even further with a new annotation type called point-level labels and includes a new all-in-one visualization tool that allows a better exploration of the rich data available.


Point Labels

The main strategy used to collect the new point-level label annotations leveraged suggestions from a machine learning (ML) model and human verification. First, the ML model selected points of interest and asked a yes or no question, e.g., “is this point on a pumpkin?”. Then, human annotators spent an average of 1.1 seconds answering the yes or no questions. We aggregated the answers from different annotators over the same question and assigned a final “yes”, “no”, or “unsure” label to each annotated point.

Illustration of the annotations interface.
(Image by Lenore Edman, under CC BY 2.0 license)

For each annotated image, we provide a collection of points, each with a “yes” or “no” label for a given class. These points provide sparse information that can be used for the semantic segmentation task. We collected a total of 38.6M new point annotations (12.4M with “yes” labels) that cover 5.8 thousand classes and 1.4M images.

By focusing on point labels, we expanded the number of images annotated and categories covered. We also concentrated the efforts of our annotators on efficiently collecting useful information. Compared to our instance segmentation, the new points include 16x more classes and cover more images. The new points also cover 9x more classes than our box annotations. Compared to existing segmentation datasets, like PASCAL VOC, COCO, Cityscapes, LVIS, or ADE20K, our annotations cover more classes and more images than previous work. The new point label annotations are the first type of annotation in Open Images that provides localization information for both things (countable objects, like cars, cats, and catamarans), and stuff categories (uncountable objects like grass, granite, and gravel). Overall, the newly collected data is roughly equivalent to two years of human annotation effort.

Our initial experiments show that this type of sparse data is suitable for both training and evaluating segmentation models. Training a model directly on sparse data allows us to reach comparable quality to training on dense annotations. Similarly, we show that one can directly compute the traditional semantic segmentation intersection-over-union (IoU) metric over sparse data. The ranking across different methods is preserved, and the sparse IoU values are an accurate estimate of its dense version. See our paper for more details.

Below, we show four example images with their point-level labels, illustrating the rich and diverse information these annotations provide. Circles ⭘ are “yes” labels, and squares are “no” labels.

Four example images with point-level labels.
Images by Richie Diesterheft, John AM Nueva, Sarah Ackerman, and C Thomas, all under CC BY 2.0 license.

New Visualizers

In addition to the new data release, we also expanded the available visualizations of the Open Images annotations. The Open Images website now includes dedicated visualizers to explore the localized narratives annotations, the new point-level annotations, and a new all-in-one view. This new all-in-one view is available for the subset of 1.9M densely annotated images and allows one to explore the rich annotations that Open Images has accumulated over seven releases. On average these images have annotations for 6.7 image-labels (classes), 8.3 boxes, 1.7 relations, 1.5 masks, 0.4 localized narratives and 34.8 point-labels per image.

Below, we show two example images with various annotations in the all-in-one visualizer. The figures show the image-level labels, bounding boxes, box relations, instance masks, localized narrative mouse trace and caption, and point-level labels. The + classes have positive annotations (of any kind), while classes have only negative annotations (image-level or point-level).

Two example images with various annotations in the all-in-one visualizer.
Images by Jason Paris, and Rubén Vique, all under CC BY 2.0 license.

Conclusion

We hope that this new data release will enable computer vision research to cover ever more diverse and challenging scenarios. As the quality of automated semantic segmentation models improves over common classes, we want to move towards the long tail of visual concepts, and sparse point annotations are a step in that direction. More and more works are exploring how to use such sparse annotations (e.g., as supervision for instance segmentation or semantic segmentation), and Open Images V7 contributes to this research direction. We are looking forward to seeing what you will build next.


Acknowledgements

Thanks to Vittorio Ferrari, Jordi Pont-Tuset, Alina Kuznetsova, Ashlesha Sadras, and the annotators team for their support creating this new data release.

Source: Google AI Blog


Kubeflow applies to become a CNCF incubating project

Google has pioneered AI and ML and has a history of innovative technology donations to the open source community (e.g. TensorFlow and Jax). Google is also the initial developer and largest contributor to Kubernetes, and brings with it a wealth of experience to the project and its community. Building an ML Platform on our state-of-the-art Google Kubernetes Engine (GKE), we have learned best practices from our users, and in 2017, we used that experience to create and open source the Kubeflow project.

In May 2020, with the v1.0 release, Kubeflow reached maturity across a core set of its stable applications. During that year, we also graduated Kubeflow Serving as an independent project, KServe, which is now incubating in Linux Foundation AI & Data.

Today, Kubeflow has developed into an end-to-end, extendable ML platform, with multiple distinct components to address specific stages of the ML lifecycle: model development (Kubeflow Notebooks), model training (Kubeflow Pipelines and Kubeflow Training Operator), model serving (KServe), and automated machine learning (Katib).

The Kubeflow project now has close to 200 contributors from over 30 organizations, and the Kubeflow community has hosted several summits and contributor meetups across the world. The broader Kubeflow ecosystem includes a number distributions across multiple cloud service providers and on-prem environments. Kubeflow’s powerful development experience helps data scientists build, train, and deploy their ML models, enabling enterprise ML operation teams to deploy and scale advanced workflows in a variety of infrastructures.

Google’s application for Kubeflow to become a CNCF incubating project is the next big milestone for the Kubeflow community, and we’re thrilled to see how developers will continue to build and innovate in ML using this project.

What's next? The pull request we’ve opened today to join the CNCF as an incubating project is only the first step. Google and the Kubeflow community will work with the CNCF and their Technical Oversight Committee (TOC), to meet the incubation stage requirements. While the due diligence and eventual TOC decision can take a few months, the Kubeflow project will continue developing and releasing throughout this process.

If Kubeflow is accepted into CNCF, the project’s assets will be transferred to the CNCF, including the source code, trademark, and website, and other collaboration and social media accounts. At Google, we believe that using open source comes with a responsibility to contribute, maintain, and improve those projects. In that spirit, we will continue supporting the Kubeflow project and work with the community towards the next level of innovation.

Thanks to everyone who has contributed to Kubeflow over the years! We are excited for what lays ahead for the Kubeflow community.

By Thea Lamkin, Senior Program Manager and Mark Chmarny, Senior Technical Program Manager – Google Open Source