Tag Archives: machine learning

SimPer: Simple self-supervised learning of periodic targets

Posted by Daniel McDuff, Staff Research Scientist, and Yuzhe Yang, Student Researcher, Google

Learning from periodic data (signals that repeat, such as a heart beat or the daily temperature changes on Earth’s surface) is crucial for many real-world applications, from monitoring weather systems to detecting vital signs. For example, in the environmental remote sensing domain, periodic learning is often needed to enable nowcasting of environmental changes, such as precipitation patterns or land surface temperature. In the health domain, learning from video measurement has shown to extract (quasi-)periodic vital signs such as atrial fibrillation and sleep apnea episodes.

Approaches like RepNet highlight the importance of these types of tasks, and present a solution that recognizes repetitive activities within a single video. However, these are supervised approaches that require a significant amount of data to capture repetitive activities, all labeled to indicate the number of times an action was repeated. Labeling such data is often challenging and resource-intensive, requiring researchers to manually capture gold-standard temporal measurements that are synchronized with the modality of interest (e.g., video or satellite imagery).

Alternatively, self-supervised learning (SSL) methods (e.g., SimCLR and MoCo v2), which leverage a large amount of unlabeled data to learn representations that capture periodic or quasi-periodic temporal dynamics, have demonstrated success in solving classification tasks. However, they overlook the intrinsic periodicity (i.e., the ability to identify if a frame is part of a periodic process) in data and fail to learn robust representations that capture periodic or frequency attributes. This is because periodic learning exhibits characteristics that are distinct from prevailing learning tasks.

Feature similarity is different in the context of periodic representations as compared to static features (e.g., images). For example, videos that are offset by short time delays or are reversed should be similar to the original sample, whereas videos that have been upsampled or downsampled by a factor x should be different from the original sample by a factor of x.

To address these challenges, in “SimPer: Simple Self-Supervised Learning of Periodic Targets”, published at the eleventh International Conference on Learning Representations (ICLR 2023), we introduced a self-supervised contrastive framework for learning periodic information in data. Specifically, SimPer leverages the temporal properties of periodic targets using temporal self-contrastive learning, where positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. We propose periodic feature similarity that explicitly defines how to measure similarity in the context of periodic learning. Moreover, we design a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). Next, we demonstrate that SimPer effectively learns period feature representations compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts. Finally, we are excited to release the SimPer code repo with the research community.

The SimPer framework

SimPer introduces a temporal self-contrastive learning framework. Positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. For temporal video examples, periodicity-invariant changes are cropping, rotation or flipping, whereas periodicity-variant changes involve increasing or decreasing the speed of a video.

To explicitly define how to measure similarity in the context of periodic learning, SimPer proposes periodic feature similarity. This construction allows us to formulate training as a contrastive learning task. A model can be trained with data without any labels and then fine-tuned if necessary to map the learned features to specific frequency values.

Given an input sequence x, we know there’s an underlying associated periodic signal. We then transform x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo- speed or frequency labels for the unlabeled input x.

Conventional similarity measures such as cosine similarity emphasize strict proximity between two feature vectors, and are sensitive to index shifted features (which represent different time stamps), reversed features, and features with changed frequencies. In contrast, periodic feature similarity should be high for samples with small temporal shifts and or reversed indexes, while capturing a continuous similarity change when the feature frequency varies. This can be achieved via a similarity metric in the frequency domain, such as the distance between two Fourier transforms.

To harness the intrinsic continuity of augmented samples in the frequency domain, SimPer designs a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). This makes it suitable for regression tasks, where the goal is to recover a continuous signal, such as a heart beat.

SimPer constructs negative views of data through transformations in the frequency domain. The input sequence x has an underlying associated periodic signal. SimPer transforms x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo speed or frequency labels for unlabeled input x (periodicity-variant augmentations τ). SimPer takes transformations that do not change the identity of the input and defines these as periodicity-invariant augmentations σ, thus creating different positive views of the sample. Then, it sends these augmented views to the encoder f, which extracts corresponding features.

Results

To evaluate SimPer's performance, we benchmarked it against state-of-the-art SSL schemes (e.g., SimCLR, MoCo v2, BYOL, CVRL) on a set of six diverse periodic learning datasets for common real-world tasks in human behavior analysis, environmental remote sensing, and healthcare. Specifically, below we present results on heart rate measurement and exercise repetition counting from video. The results show that SimPer outperforms the state-of-the-art SSL schemes across all six datasets, highlighting its superior performance in terms of data efficiency, robustness to spurious correlations, and generalization to unseen targets.

Here we show quantitative results on two representative datasets using SimPer pre-trained using various SSL methods and fine-tuned on the labeled data. First, we pre-train SimPer using the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) dataset, a human photoplethysmography and heart rate prediction dataset, and compare its performance to state-of-the-art SSL methods. We observe that SimPer outperforms SimCLR, MoCo v2, BYOL, and CVRL methods. The results on the human action counting dataset, Countix, further confirm the benefits of SimPer over others methods as it notably outperforms the supervised baseline. For the feature evaluation results and performance on other datasets, please refer to the paper.

Results of SimCLR, MoCo v2, BYOL, CVRL and SimPer on the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) and Countix datasets. Heart rate and repetition count performance is reported as mean absolute error (MAE).

Conclusion and applications

We present SimPer, a self-supervised contrastive framework for learning periodic information in data. We demonstrate that by combining a temporal self-contrastive learning framework, periodicity-invariant and periodicity-variant augmentations, and continuous periodic feature similarity, SimPer provides an intuitive and flexible approach for learning strong feature representations for periodic signals. Moreover, SimPer can be applied to various fields, ranging from environmental remote sensing to healthcare.

Acknowledgements

We would like to thank Yuzhe Yang, Xin Liu, Ming-Zher Poh, Jiang Wu, Silviu Borac, and Dina Katabi for their contributions to this work.

Source: Google AI Blog

Symbol tuning improves in-context learning in language models

Posted by Jerry Wei, Student Researcher, and Denny Zhou, Principal Scientist, Google Research

A key feature of human intelligence is that humans can learn to perform new tasks by reasoning using only a few examples. Scaling up language models has unlocked a range of new applications and paradigms in machine learning, including the ability to perform challenging reasoning tasks via in-context learning. Language models, however, are still sensitive to the way that prompts are given, indicating that they are not reasoning in a robust manner. For instance, language models often require heavy prompt engineering or phrasing tasks as instructions, and they exhibit unexpected behaviors such as performance on tasks being unaffected even when shown incorrect labels.

In “Symbol tuning improves in-context learning in language models”, we propose a simple fine-tuning procedure that we call symbol tuning, which can improve in-context learning by emphasizing input–label mappings. We experiment with symbol tuning across Flan-PaLM models and observe benefits across various settings.

Symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels.
Symbol-tuned models are much stronger at algorithmic reasoning tasks.
Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior knowledge.

An overview of symbol tuning, where models are fine-tuned on tasks where natural language labels are replaced with arbitrary symbols. Symbol tuning relies on the intuition that when instruction and relevant labels are not available, models must use in-context examples to learn the task.

Motivation

Instruction tuning is a common fine-tuning method that has been shown to improve performance and allow models to better follow in-context examples. One shortcoming, however, is that models are not forced to learn to use the examples because the task is redundantly defined in the evaluation example via instructions and natural language labels. For example, on the left in the figure above, although the examples can help the model understand the task (sentiment analysis), they are not strictly necessary since the model could ignore the examples and just read the instruction that indicates what the task is.

In symbol tuning, the model is fine-tuned on examples where the instructions are removed and natural language labels are replaced with semantically-unrelated labels (e.g., “Foo,” “Bar,” etc.). In this setup, the task is unclear without looking at the in-context examples. For example, on the right in the figure above, multiple in-context examples would be needed to figure out the task. Because symbol tuning teaches the model to reason over the in-context examples, symbol-tuned models should have better performance on tasks that require reasoning between in-context examples and their labels.

Datasets and task types used for symbol tuning.

Symbol-tuning procedure

We selected 22 publicly-available natural language processing (NLP) datasets that we use for our symbol-tuning procedure. These tasks have been widely used in the past, and we only chose classification-type tasks since our method requires discrete labels. We then remap labels to a random label from a set of ~30K arbitrary labels selected from one of three categories: integers, character combinations, and words.

For our experiments, we symbol tune Flan-PaLM, the instruction-tuned variants of PaLM. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B, and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B (Flan-PaLM-62B at 1.3T tokens instead of 780B tokens), which we abbreviate as 62B-c.

We use a set of ∼300K arbitrary symbols from three categories (integers, character combinations, and words). ∼30K symbols are used during tuning and the rest are held out for evaluation.

Experimental setup

We want to evaluate a model’s ability to perform unseen tasks, so we cannot evaluate on tasks used in symbol tuning (22 datasets) or used during instruction tuning (1.8K tasks). Hence, we choose 11 NLP datasets that were not used during fine-tuning.

In-context learning

In the symbol-tuning procedure, models must learn to reason with in-context examples in order to successfully perform tasks because prompts are modified to ensure that tasks cannot simply be learned from relevant labels or instructions. Symbol-tuned models should perform better in settings where tasks are unclear and require reasoning between in-context examples and their labels. To explore these settings, we define four in-context learning settings that vary the amount of reasoning required between inputs and labels in order to learn the task (based on the availability of instructions/relevant labels)

Depending on the availability of instructions and relevant natural language labels, models may need to do varying amounts of reasoning with in-context examples. When these features are not available, models must reason with the given in-context examples to successfully perform the task.

Symbol tuning improves performance across all settings for models 62B and larger, with small improvements in settings with relevant natural language labels (+0.8% to +4.2%) and substantial improvements in settings without relevant natural language labels (+5.5% to +15.5%). Strikingly, when relevant labels are unavailable, symbol-tuned Flan-PaLM-8B outperforms FlanPaLM-62B, and symbol-tuned Flan-PaLM-62B outperforms Flan-PaLM-540B. This performance difference suggests that symbol tuning can allow much smaller models to perform as well as large models on these tasks (effectively saving ∼10X inference compute).

Large-enough symbol-tuned models are better at in-context learning than baselines, especially in settings where relevant labels are not available. Performance is shown as average model accuracy (%) across eleven tasks.

Algorithmic reasoning

We also experiment on algorithmic reasoning tasks from BIG-Bench. There are two main groups of tasks: 1) List functions — identify a transformation function (e.g., remove the last element in a list) between input and output lists containing non-negative integers; and 2) simple turing concepts — reason with binary strings to learn the concept that maps an input to an output (e.g., swapping 0s and 1s in a string).

On the list function and simple turing concept tasks, symbol tuning results in an average performance improvement of 18.2% and 15.3%, respectively. Additionally, Flan-cont-PaLM-62B with symbol tuning outperforms Flan-PaLM-540B on the list function tasks on average, which is equivalent to a ∼10x reduction in inference compute. These improvements suggest that symbol tuning strengthens the model’s ability to learn in-context for unseen task types, as symbol tuning did not include any algorithmic data.

Symbol-tuned models achieve higher performance on list function tasks and simple turing concept tasks. (A–E): categories of list functions tasks. (F): simple turing concepts task.

Flipped labels

In the flipped-label experiment, labels of in-context and evaluation examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override prior knowledge. Previous work has shown that while pre-trained models (without instruction tuning) can, to some extent, follow flipped labels presented in-context, instruction tuning degraded this ability.

We see that there is a similar trend across all model sizes — symbol-tuned models are much more capable of following flipped labels than instruction-tuned models. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. Additionally, symbol-tuned models achieve similar or better than average performance as pre-training–only models.

Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are.

Conclusion

We presented symbol tuning, a new method of tuning models on tasks where natural language labels are remapped to arbitrary symbols. Symbol tuning is based off of the intuition that when models cannot use instructions or relevant labels to determine a presented task, it must do so by instead learning from in-context examples. We tuned four language models using our symbol-tuning procedure, utilizing a tuning mixture of 22 datasets and approximately 30K arbitrary symbols as labels.

We first showed that symbol tuning improves performance on unseen in-context learning tasks, especially when prompts do not contain instructions or relevant labels. We also found that symbol-tuned models were much better at algorithmic reasoning tasks, despite the lack of numerical or algorithmic data in the symbol-tuning procedure. Finally, in an in-context learning setting where inputs have flipped labels, symbol tuning (for some datasets) restores the ability to follow flipped labels that was lost during instruction tuning.

Future work

Through symbol tuning, we aim to increase the degree to which models can examine and learn from input–label mappings during in-context learning. We hope that our results encourage further work towards improving language models’ ability to reason over symbols presented in-context.

Acknowledgements

The authors of this post are now part of Google DeepMind. This work was conducted by Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc V. Le. We would like to thank our colleagues at Google Research and Google DeepMind for their advice and helpful discussions.

Source: Google AI Blog

Symbol tuning improves in-context learning in language models

Posted by Jerry Wei, Student Researcher, and Denny Zhou, Principal Scientist, Google Research

Symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels.
Symbol-tuned models are much stronger at algorithmic reasoning tasks.
Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior knowledge.

Motivation

Datasets and task types used for symbol tuning.

Symbol-tuning procedure

We use a set of ∼300K arbitrary symbols from three categories (integers, character combinations, and words). ∼30K symbols are used during tuning and the rest are held out for evaluation.

Experimental setup

In-context learning

Algorithmic reasoning

Symbol-tuned models achieve higher performance on list function tasks and simple turing concept tasks. (A–E): categories of list functions tasks. (F): simple turing concepts task.

Flipped labels

Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are.

Conclusion

Future work

Acknowledgements

Source: Google AI Blog

Symbol tuning improves in-context learning in language models

Posted by Jerry Wei, Student Researcher, and Denny Zhou, Principal Scientist, Google Research

Symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels.
Symbol-tuned models are much stronger at algorithmic reasoning tasks.
Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior knowledge.

Motivation

Datasets and task types used for symbol tuning.

Symbol-tuning procedure

We use a set of ∼300K arbitrary symbols from three categories (integers, character combinations, and words). ∼30K symbols are used during tuning and the rest are held out for evaluation.

Experimental setup

In-context learning

Algorithmic reasoning

Symbol-tuned models achieve higher performance on list function tasks and simple turing concept tasks. (A–E): categories of list functions tasks. (F): simple turing concepts task.

Flipped labels

Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are.

Conclusion

Future work

Acknowledgements

Source: Google AI Blog

Google Dev Library Letters: 21st Edition

Posted by Swathi Dharshna Subbaraj, Google Dev Library

In this newsletter, we highlight the best projects developed with Google technologies that have been contributed to the Google Dev Library platform. We hope this will spark some inspiration for your next project!

Highlights of the Month

In the past two months, we asked contributors to look back, revisit, and update their older Dev Library contributions as a best practice. Most contributors took the time to revise their content and incorporate recent releases. This campaign encourages developers to update their repositories with the latest Google technologies, which is advantageous to users and the broader developer community.

Here are some of the standout up-to-date projects:

Sheets Compose Dialogs by Maximilian Keppeler

See how an Android library that offers dialogs and views for various use cases - built with Jetpack Compose for Compose projects. All dialogs and views are easy and quick to implement.
Read more on Dev Library

Round Corner Progress Bar by Somkiat Khitwongwattana

Use this extensive “Rounded Corner progress bar” library for your own Android projects.

Read more on Dev Library

During the campaign, we noticed that some new projects were submitted. Here are some of the new projects from our contributors:

Android TV sample projects by Ademir Queiroga

See some of the Android TV sample projects on the main topics around Android TV development, and the project follows Google's best practices with a few experience-based insights.
Read more on Dev Library

Storage provisioning with Cloud SQL using Workload Identity by Fermin Blanco

Learn how to create a production ready GKE cluster in a matter of seconds.
Read more on Dev Library

Android

Using Android’s new Credential Manager API by Priya Sindkar
Dive into this blog on how Android's new Credential Manager API provides a seamless way for your app’s users to log in with one-click solutions.

KStore by Isuru Rajapakse
Learn how the tiny Kotlin multiplatform library that assists in saving and restoring objects to and from disk using kotlinx.coroutines, kotlinx.serialisation and okio.

DevBricksX by Nan YE
Discover how DevBricksX is a remarkable remake and extended version of DevBricks, this project covers various aspects of daily development, from low-level database tasks to user interface design, as it eliminates the need for repetitive work.

Dose app by Waseef Akhtar
Learn how Dose, a reminder app for people to take their medications on time, was built using Kotlin and Jetpack Compose with MVVM + clean architecture.

Compose_adaptive_scaffold by Thomas Künneth
Explore how to write Jetpack Compose apps that support large screens and foldables.

Cloud

Troubleshooting reachability with a Network Intelligence Center connectivity test by Gaurav Madan
Learn how network troubleshooting processes become crucial when time is of the essence, and how to do so efficiently.

From data chaos to data insights with Google Cloud and GitLab CI: A cutting-edge solution by Gursimar Singh
Take a look at a streamlined, effective approach to acquire important insights from data and learn how to deal with the turmoil of manual data deployment and analysis easily.

Machine Learning

Client-side in-decent content checking
Discover a JavaScript library to help you quickly identify unseemly images; all in the client's browser.

YoloV7 in Tensorflow.js by Hugo Zanini
Learn object detection using Yolov7 in tensorflow.js, and how it’s trained on the MS COCO dataset to recognizes up to 80 different classes

Flutter

Exploring Inherited Widget: The powerful state management solution by Muhammad Salman
Take a deep dive into the backstory of state management in Flutter and explore one of the most important concepts in Flutter state management, the Inherited Widget.

Control your Flutter app on the fly with Firebase Remote Config by Mangirdas Kazlauskas

Learn the overview of Firebase Remote Config and how to use it to enable real-time features in your Flutter application.

The ultimate Flutter Navigator 2.0 series using AutoRoute by Cavin Macwan
Explore the differences between Navigator 1.0 and 2.0 and why you need Navigator 2.0. You’ll also learn how you can implement Navigator 2.0 using the Auto Route package in Flutter.

Angular

Papanasi (UI library) by Quique Fdez Guerra
Learn to use this frontend UI library across frameworks.

How to manage complex forms in Angular by Roland Tubongye Wabubindja
See how to save and modify data from a form containing several FormArray.

Community Updates

🚀 Announcing Google Maps Platform added to Dev Library

Progress Bar AnimationGoogle Maps platform in Dev Library

Google Maps Platform has now been officially added to the Dev Library! With these resources, developers can create applications that enable them to visualize geospatial data and build projects ranging from hyperlocal logistics to location-driven app development, and have access to even more resources to take their projects to the next level.

Dev Library contributors will be better able to write and create innovative and useful applications that utilize Google’s mapping, places, and routing data and features.

Visit the Google Maps Platform product page in Dev Library

Browse Dev Library | Google Developers Online on Discord | Newsletter Archives

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

An open-source gymnasium for machine learning assisted computer architecture design

Posted by Amir Yazdanbakhsh, Research Scientist, and Vijay Janapa Reddi, Visiting Researcher, Google Research

Computer Architecture research has a long history of developing simulators and tools to evaluate and shape the design of computer systems. For example, the SimpleScalar simulator was introduced in the late 1990s and allowed researchers to explore various microarchitectural ideas. Computer architecture simulators and tools, such as gem5, DRAMSys, and many more have played a significant role in advancing computer architecture research. Since then, these shared resources and infrastructure have benefited industry and academia and have enabled researchers to systematically build on each other's work, leading to significant advances in the field.

Nonetheless, computer architecture research is evolving, with industry and academia turning towards machine learning (ML) optimization to meet stringent domain-specific requirements, such as ML for computer architecture, ML for TinyML acceleration, DNN accelerator datapath optimization, memory controllers, power consumption, security, and privacy. Although prior work has demonstrated the benefits of ML in design optimization, the lack of strong, reproducible baselines hinders fair and objective comparison across different methods and poses several challenges to their deployment. To ensure steady progress, it is imperative to understand and tackle these challenges collectively.

To alleviate these challenges, in “ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design”, accepted at ISCA 2023, we introduced ArchGym, which includes a variety of computer architecture simulators and ML algorithms. Enabled by ArchGym, our results indicate that with a sufficiently large number of samples, any of a diverse collection of ML algorithms are capable of finding the optimal set of architecture design parameters for each target problem; no one solution is necessarily better than another. These results further indicate that selecting the optimal hyperparameters for a given ML algorithm is essential for finding the optimal architecture design, but choosing them is non-trivial. We release the code and dataset across multiple computer architecture simulations and ML algorithms.

Challenges in ML-assisted architecture research

ML-assisted architecture research poses several challenges, including:

For a specific ML-assisted computer architecture problem (e.g., finding an optimal solution for a DRAM controller) there is no systematic way to identify optimal ML algorithms or hyperparameters (e.g., learning rate, warm-up steps, etc.). There is a wider range of ML and heuristic methods, from random walk to reinforcement learning (RL), that can be employed for design space exploration (DSE). While these methods have shown noticeable performance improvement over their choice of baselines, it is not evident whether the improvements are because of the choice of optimization algorithms or hyperparameters.

Thus, to ensure reproducibility and facilitate widespread adoption of ML-aided architecture DSE, it is necessary to outline a systematic benchmarking methodology.
While computer architecture simulators have been the backbone of architectural innovations, there is an emerging need to address the trade-offs between accuracy, speed, and cost in architecture exploration. The accuracy and speed of performance estimation widely varies from one simulator to another, depending on the underlying modeling details (e.g., cycle-accurate vs. ML-based proxy models). While analytical or ML-based proxy models are nimble by virtue of discarding low-level details, they generally suffer from high prediction error. Also, due to commercial licensing, there can be strict limits on the number of runs collected from a simulator. Overall, these constraints exhibit distinct performance vs. sample efficiency trade-offs, affecting the choice of optimization algorithm for architecture exploration.

It is challenging to delineate how to systematically compare the effectiveness of various ML algorithms under these constraints.
Finally, the landscape of ML algorithms is rapidly evolving and some ML algorithms need data to be useful. Additionally, rendering the outcome of DSE into meaningful artifacts such as datasets is critical for drawing insights about the design space.

In this rapidly evolving ecosystem, it is consequential to ensure how to amortize the overhead of search algorithms for architecture exploration. It is not apparent, nor systematically studied how to leverage exploration data while being agnostic to the underlying search algorithm.

ArchGym design

ArchGym addresses these challenges by providing a unified framework for evaluating different ML-based search algorithms fairly. It comprises two main components: 1) the ArchGym environment and 2) the ArchGym agent. The environment is an encapsulation of the architecture cost model — which includes latency, throughput, area, energy, etc., to determine the computational cost of running the workload, given a set of architectural parameters — paired with the target workload(s). The agent is an encapsulation of the ML algorithm used for the search and consists of hyperparameters and a guiding policy. The hyperparameters are intrinsic to the algorithm for which the model is to be optimized and can significantly influence performance. The policy, on the other hand, determines how the agent selects a parameter iteratively to optimize the target objective.

Notably, ArchGym also includes a standardized interface that connects these two components, while also saving the exploration data as the ArchGym Dataset. At its core, the interface entails three main signals: hardware state, hardware parameters, and metrics. These signals are the bare minimum to establish a meaningful communication channel between the environment and the agent. Using these signals, the agent observes the state of the hardware and suggests a set of hardware parameters to iteratively optimize a (user-defined) reward. The reward is a function of hardware performance metrics, such as performance, energy consumption, etc.

ArchGym comprises two main components: the ArchGym environment and the ArchGym agent. The ArchGym environment encapsulates the cost model and the agent is an abstraction of a policy and hyperparameters. With a standardized interface that connects these two components, ArchGym provides a unified framework for evaluating different ML-based search algorithms fairly while also saving the exploration data as the ArchGym Dataset.

ML algorithms could be equally favorable to meet user-defined target specifications

Using ArchGym, we empirically demonstrate that across different optimization objectives and DSE problems, at least one set of hyperparameters exists that results in the same hardware performance as other ML algorithms. A poorly selected (random selection) hyperparameter for the ML algorithm or its baseline can lead to a misleading conclusion that a particular family of ML algorithms is better than another. We show that with sufficient hyperparameter tuning, different search algorithms, even random walk (RW), are able to identify the best possible reward. However, note that finding the right set of hyperparameters may require exhaustive search or even luck to make it competitive.

With a sufficient number of samples, there exists at least one set of hyperparameters that results in the same performance across a range of search algorithms. Here the dashed line represents the maximum normalized reward. Cloud-1, cloud-2, stream, and random indicate four different memory traces for DRAMSys (DRAM subsystem design space exploration framework).

Dataset construction and high-fidelity proxy model training

Creating a unified interface using ArchGym also enables the creation of datasets that can be used to design better data-driven ML-based proxy architecture cost models to improve the speed of architecture simulation. To evaluate the benefits of datasets in building an ML model to approximate architecture cost, we leverage ArchGym’s ability to log the data from each run from DRAMSys to create four dataset variants, each with a different number of data points. For each variant, we create two categories: (a) Diverse Dataset, which represents the data collected from different agents (ACO, GA, RW, and BO), and (b) ACO only, which shows the data collected exclusively from the ACO agent, both of which are released along with ArchGym. We train a proxy model on each dataset using random forest regression with the objective to predict the latency of designs for a DRAM simulator. Our results show that:

As we increase the dataset size, the average normalized root mean squared error (RMSE) slightly decreases.
However, as we introduce diversity in the dataset (e.g., collecting data from different agents), we observe 9× to 42× lower RMSE across different dataset sizes.

Diverse dataset collection across different agents using ArchGym interface.

The impact of a diverse dataset and dataset size on the normalized RMSE.

The need for a community-driven ecosystem for ML-assisted architecture research

While, ArchGym is an initial effort towards creating an open-source ecosystem that (1) connects a broad range of search algorithms to computer architecture simulators in an unified and easy-to-extend manner, (2) facilitates research in ML-assisted computer architecture, and (3) forms the scaffold to develop reproducible baselines, there are a lot of open challenges that need community-wide support. Below we outline some of the open challenges in ML-assisted architecture design. Addressing these challenges requires a well coordinated effort and a community driven ecosystem.

Key challenges in ML-assisted architecture design.

We call this ecosystem Architecture 2.0. We outline the key challenges and a vision for building an inclusive ecosystem of interdisciplinary researchers to tackle the long-standing open problems in applying ML for computer architecture research. If you are interested in helping shape this ecosystem, please fill out the interest survey.

Conclusion

ArchGym is an open source gymnasium for ML architecture DSE and enables an standardized interface that can be readily extended to suit different use cases. Additionally, ArchGym enables fair and reproducible comparison between different ML algorithms and helps to establish stronger baselines for computer architecture research problems.

We invite the computer architecture community as well as the ML community to actively participate in the development of ArchGym. We believe that the creation of a gymnasium-type environment for computer architecture research would be a significant step forward in the field and provide a platform for researchers to use ML to accelerate research and lead to new and innovative designs.

Acknowledgements

This blogpost is based on joint work with several co-authors at Google and Harvard University. We would like to acknowledge and highlight Srivatsan Krishnan (Harvard) who contributed several ideas to this project in collaboration with Shvetank Prakash (Harvard), Jason Jabbour (Harvard), Ikechukwu Uchendu (Harvard), Susobhan Ghosh (Harvard), Behzad Boroujerdian (Harvard), Daniel Richins (Harvard), Devashree Tripathy (Harvard), and Thierry Thambe (Harvard). In addition, we would also like to thank James Laudon, Douglas Eck, Cliff Young, and Aleksandra Faust for their support, feedback, and motivation for this work. We would also like to thank John Guilyard for the animated figure used in this post. Amir Yazdanbakhsh is now a Research Scientist at Google DeepMind and Vijay Janapa Reddi is an Associate Professor at Harvard.

Source: Google AI Blog

MediaPipe: Enhancing Virtual Humans to be more realistic

A guest post by the XR Development team at KDDI & Alpha-U

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, KDDI.

AI generated rendering of virtual human ‘Metako’

KDDI is integrating text-to-speech & Cloud Rendering to virtual human ‘Metako’

VTubers, or virtual YouTubers, are online entertainers who use a virtual avatar generated using computer graphics. This digital trend originated in Japan in the mid-2010s, and has become an international online phenomenon. A majority of VTubers are English and Japanese-speaking YouTubers or live streamers who use avatar designs.

KDDI, a telecommunications operator in Japan with over 40 million customers, wanted to experiment with various technologies built on its 5G network but found that getting accurate movements and human-like facial expressions in real-time was challenging.

Creating virtual humans in real-time

Announced at Google I/O 2023 in May, the MediaPipe Face Landmarker solution detects facial landmarks and outputs blendshape scores to render a 3D face model that matches the user. With the MediaPipe Face Landmarker solution, KDDI and the Google Partner Innovation team successfully brought realism to their avatars.

Technical Implementation

Using Mediapipe's powerful and efficient Python package, KDDI developers were able to detect the performer’s facial features and extract 52 blendshapes in real-time.

import mediapipe as mp
from mediapipe.tasks import python as mp_python

MP_TASK_FILE = "face_landmarker_with_blendshapes.task"

class FaceMeshDetector:

    def __init__(self):
        with open(MP_TASK_FILE, mode="rb") as f:
            f_buffer = f.read()
        base_options = mp_python.BaseOptions(model_asset_buffer=f_buffer)
        options = mp_python.vision.FaceLandmarkerOptions(
            base_options=base_options,
            output_face_blendshapes=True,
            output_facial_transformation_matrixes=True,
            running_mode=mp.tasks.vision.RunningMode.LIVE_STREAM,
            num_faces=1,
            result_callback=self.mp_callback)
        self.model = mp_python.vision.FaceLandmarker.create_from_options(
            options)

        self.landmarks = None
        self.blendshapes = None
        self.latest_time_ms = 0

    def mp_callback(self, mp_result, output_image, timestamp_ms: int):
        if len(mp_result.face_landmarks) >= 1 and len(
                mp_result.face_blendshapes) >= 1:

            self.landmarks = mp_result.face_landmarks[0]
            self.blendshapes = [b.score for b in mp_result.face_blendshapes[0]]

    def update(self, frame):
        t_ms = int(time.time() * 1000)
        if t_ms <= self.latest_time_ms:
            return

        frame_mp = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame)
        self.model.detect_async(frame_mp, t_ms)
        self.latest_time_ms = t_ms

    def get_results(self):
        return self.landmarks, self.blendshapes

The Firebase Realtime Database stores a collection of 52 blendshape float values. Each row corresponds to a specific blendshape, listed in order.

_neutral, 
browDownLeft, 
browDownRight, 
browInnerUp,
browOuterUpLeft,
...

These blendshape values are continuously updated in real-time as the camera is open and the FaceMesh model is running. With each frame, the database reflects the latest blendshape values, capturing the dynamic changes in facial expressions as detected by the FaceMesh model.

After extracting the blendshapes data, the next step involves transmitting it to the Firebase Realtime Database. Leveraging this advanced database system ensures a seamless flow of real-time data to the clients, eliminating concerns about server scalability and enabling KDDI to focus on delivering a streamlined user experience.

import concurrent.futures
import time

import cv2
import firebase_admin
import mediapipe as mp
import numpy as np
from firebase_admin import credentials, db

pool = concurrent.futures.ThreadPoolExecutor(max_workers=4)

cred = credentials.Certificate('your-certificate.json')
firebase_admin.initialize_app(
    cred, {
        'databaseURL': 'https://your-project.firebasedatabase.app/'
    })
ref = db.reference('projects/1234/blendshapes')

def main():
    facemesh_detector = FaceMeshDetector()
    cap = cv2.VideoCapture(0)

    while True:
        ret, frame = cap.read()

        facemesh_detector.update(frame)
        landmarks, blendshapes = facemesh_detector.get_results()
        if (landmarks is None) or (blendshapes is None):
            continue

        blendshapes_dict = {k: v for k, v in enumerate(blendshapes)}
        exe = pool.submit(ref.set, blendshapes_dict)

        cv2.imshow('frame', frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()
    exit()

To continue the progress, developers seamlessly transmit the blendshapes data from the Firebase Realtime Database to Google Cloud's Immersive Stream for XR instances in real-time. Google Cloud’s Immersive Stream for XR is a managed service that runs Unreal Engine project in the cloud, renders and streams immersive photorealistic 3D and Augmented Reality (AR) experiences to smartphones and browsers in real time.

This integration enables KDDI to drive character face animation and achieve real-time streaming of facial animation with minimal latency, ensuring an immersive user experience.

Illustrative example of how KDDI transmits data from the Firebase Realtime Database to Google Cloud Immersive Stream for XR in real time to render and stream photorealistic 3D and AR experiences like character face animation with minimal latency

On the Unreal Engine side running by the Immersive Stream for XR, we use the Firebase C++ SDK to seamlessly receive data from the Firebase. By establishing a database listener, we can instantly retrieve blendshape values as soon as updates occur in the Firebase Realtime database table. This integration allows for real-time access to the latest blendshape data, enabling dynamic and responsive facial animation in Unreal Engine projects.

Screenshot of Modify Curve node in use in Unreal Engine

After retrieving blendshape values from the Firebase SDK, we can drive the face animation in Unreal Engine by using the "Modify Curve" node in the animation blueprint. Each blendshape value is assigned to the character individually on every frame, allowing for precise and real-time control over the character's facial expressions.

Flowchart demonstrating how BlendshapesReceiver handles the database connection, authentication, and continuous data reception

An effective approach for implementing a realtime database listener in Unreal Engine is to utilize the GameInstance Subsystem, which serves as an alternative singleton pattern. This allows for the creation of a dedicated BlendshapesReceiver instance responsible for handling the database connection, authentication, and continuous data reception in the background.

By leveraging the GameInstance Subsystem, the BlendshapesReceiver instance can be instantiated and maintained throughout the lifespan of the game session. This ensures a persistent database connection while the animation blueprint reads and drives the face animation using the received blendshape data.

Using just a local PC running MediaPipe, KDDI succeeded in capturing the real performer’s facial expression and movement, and created high-quality 3D re-target animation in real time.

Flow chart showing how a real performer's facial expression and movement being captured and run through MediaPipe on a Local PC, and the high quality 3D re-target animation being rendered in real time by KDDI

KDDI is collaborating with developers of Metaverse anime fashion like Adastria Co., Ltd.

Getting started

To learn more, watch Google I/O 2023 sessions: Easy on-device ML with MediaPipe, Supercharge your web app with machine learning and MediaPipe, What's new in machine learning, and check out the official documentation over on developers.google.com/mediapipe.

What’s next?

This MediaPipe integration is one example of how KDDI is eliminating the boundary between the real and virtual worlds, allowing users to enjoy everyday experiences such as attending live music performances, enjoying art, having conversations with friends, and shopping―anytime, anywhere.

KDDI’s αU provides services for the Web3 era, including the metaverse, live streaming, and virtual shopping, shaping an ecosystem where anyone can become a creator, supporting the new generation of users who effortlessly move between the real and virtual worlds.

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

Developers Share How They Built Their Careers: From Machine Learning to Cloud

Posted by Lyanne Alfaro, DevRel Program Manager, Google Developer Studio

Google Developer Student Club Alums Reflect On Their Journey To Google Developer Experts

Developer Journey is a monthly series highlighting diverse and global developers sharing relatable challenges, opportunities, and wins in their journey. Every month, we will spotlight developers around the world, the Google tools they leverage, and the kind of products they are building.

This month, we spoke with several Google Developer Experts to learn more about their path from being Google Developer Student Clubs leads to connoisseurs of their craft.

Suvaditya Mukherjee

Mumbai, Maharashtra, India

Google Developer Expert, Machine Learning

GDSC Mukesh Patel School of Technology, Management and Engineering - Mumbai Lead Alumni (2021-2022)

Google Summer of Code Org Admin + ML Research Engineer Intern at Ivy

Research Intern at IIIT-Hyderabad

Twitter

LinkedIn

What are some key skills and knowledge you gained as a Google Developer Student Clubs Lead that helped you excel in your role as a Google Developer Expert?

Every day I spent as a lead was a learning experience, but what stood out to me was the holistic learning opportunities that the program brought. For example, as someone specializing in AI, I never found a need to learn Web Development until I had to help audit and create complex web apps for hosting competitions. Additionally, I learned how to absorb newer technical skills as quickly as possible, which proved to be incredibly valuable over time. I also learned the importance of soft skills, which helped me communicate better with my community. As an expert, it’s important to steward your community, and the leadership skills imparted by the program helped me build a deeper understanding of communication, logistics, and team-building.

What has been the impact of being part of the Google Developer Student Clubs community on your personal and professional growth?

As a Google Developer Student Clubs (GDSC) Lead, I benefited from participating in networking opportunities with like-minded folks and potential mentors who helped immensely in my journey. They helped shape my technical skills, and improve my soft skills. I also had the opportunity to speak in front of large crowds, develop content, manage teams, and closely understand what makes a community tick. As a GDE, it becomes important to have a pulse on the community's needs and requirements. The GDSC Program taught me how to measure these metrics at a grassroots level. I have had the privilege of working with the most skilled, dedicated, professional – and most importantly – humble folks as part of the GDSC Community. The program allowed me the privilege of communicating and building friendships with awesome people over time.

What Google tools have you used to build?

I have used quite a few Google tools in different projects and endeavors, including but not limited to Firebase, Flutter, and Android for hackathons. I have also made use of the Google Cloud Platform to develop and host scalable backend infrastructures during projects and internships in different places. But my most used tool is TensorFlow.

Which tool has been your favorite? Why?

As an ML Practitioner, TensorFlow and Keras have been a boon to simplify days of work into potentially hours or even minutes. The power it delivers to end-users in the most open and democratic way possible while constantly innovating for newer advances is something I have always appreciated. One of the biggest reasons I love Keras has to be the awesome community around it that welcomes everyone with open arms.

Tell us about something you've built in the past using Google tools.

I have hacked around a few projects over time. The most notable among them was an application I personally call TranscribeMate. Imagine you’re in an ongoing lecture and the professor is going quicker than usual, hindering your ability to take notes. TranscribeMate (built with Flutter, Firebase, and MLKit) allows you to use OCR technology to transcribe notes from simple photos of the classroom blackboard, allow newer annotations as a note-taking application, and save them for later use. This was an application I developed for a college course- but I ended up tweaking it a bit more and making use of it on my personal device as well for more general tasks too.

What will you create with Google Bard?

I have been using Bard for a while now; it has a permanent home on my browser. Bard helps me with random questions I have, and Python-related problems. Bard has helped me find solutions in seconds, compared to hours of work when done through traditional search methods. I have been using Bard's help on several projects I have been working on within my research, in projects at Ivy, and the Keras Team. Stay tuned for what comes next!

What advice would you give someone starting in their developer journey?

Seek new experiences to learn. No one can learn by working within a narrow niche. Having a working knowledge of different technologies at once allows you to have a diverse and multi-faceted approach to problem-solving. Optimizations in your systems become far more apparent, and you slowly end up learning how to write better code and design scalable systems with ease. Lastly, find a community. Find like-minded folks, talk to them, share notes on what you're building, and if you find yourself too shy to do so, then try anyway. Start by just showing up for one event near you. Then make it two. Then ask a question. The power of collaborative learning is immeasurable.

Veronica Putri Anggraini

Jakarta, Indonesia

Google Developer Expert, Android

GDSC Semarang State Polytechnic Lead Alumni (2017)

Google Developer Group

Women Techmakers Ambassador

Software Engineer Android, @ eWIDEPLUS

LinkedIn

Medium

Instagram

What are some key skills and knowledge you gained as a Google Developer Student Clubs Lead that helped you excel in your role as a Google Developer Expert?

Through GDSC, I learn a lot about Android technology, practice building Android projects, and do workshops for our members every week. This process improves my technical, writing, problem solving and public speaking skills at the same time. I started presenting as a student with a small group workshop of 5-10 people and grew to speaking in front of 1,000 people. This was also one of the necessary criteria to become a GDE.

Can you share some insights on the impact of being part of the Google Developer Student Clubs community on your personal and professional growth?

Exploring different resources while I was a student helped me develop sample app portfolios. I feel like I actually started my professional career as a curriculum developer and trainer for mobile development. I got an offer when I was a speaker at a tech event that discussed Android technology through the GDSC program. In fact, the CEO immediately offered the position after the event ended.

What Google tools have you used to build?

I have a lot of exploration with Jetpack Compose. I currently work closely with the CameraX, AndroidX Library, Google Analytics and Maps API.

Which tool has been your favorite? Why?

CameraX is one of my favorites, because it automatically manages camera resources and avoids unnecessary background work, so I got better performance.

Tell us about something you've built in the past using Google tools.

At my current company, we build a digital bank app product natively. This allows users to use Liveness as a verified onboarding process, QRPay, personalize promo campaigns, and other financial services that we build using Google tools.

What advice would you give someone starting in their developer journey?

Gain experience in dealing with issues in the stack that serve as a focus. Be consistent in learning, and don't give up easily when stuck. In other words, be the person that says: "Challenge Accepted".

You should know that learning together is more fun than learning alone, so join the community and learn everything you need and extend your network.

Anubhav Singh

Prayagraj, India

Google Developer Expert, Firebase

GDSC NSEC Kolkata Lead Alumni (2019-20)

GDG Cloud Kolkata Organizer & TFUG Kolkata Co-Organizer

Co-founder, Dynopii

Twitter

GitHub

Linkborg

What are some key skills and knowledge you gained as a Google Developer Student Clubs Lead that helped you excel in your role as a Google Developer Expert?

A major part of being a Google Developer Student Clubs Lead was to enable growth for those around me by learning together. I would often find myself guiding club members on various fronts – sometimes by taking knowledge-sharing sessions on technical topics, sometimes by diving deep into their projects’ code to help them overcome challenges they were facing and sometimes creating videos or written content for them to be able to follow along later.

Through partaking in these activities, I learned public speaking skills, mentoring, and how to be helpful to others experiencing roadblocks. These skills have proved important in my role as a Google Developer Expert.

What has been the impact of being part of the Google Developer Student Clubs community on your personal and professional growth?

Being a GDSC Lead helped me further steer teams with the same passion I have for building communities. As a GDSC Lead, you get to connect with a lot of amazing people. The community itself is highly diverse and vibrant. When I was organizing a workshop for the club during my time as a GDSC Lead, I was fortunate to meet two individuals who later became the co-founders of my startup. In that same club, three of our members became Google Developer Experts in the fields of their interest. Thus, being a GDSC Lead has had a very positive impact on both my professional and personal growth.

What Google tools have you used to build?

I’ve been working in the software development field for almost 12 years now and have used several Google tools over the years, including some that no longer exist. Some of the currently available tools that I most often work with are:

Google Cloud Platform: Cloud Run, Cloud Functions, Cloud Firestore, Cloud Workflows, GKE, GCE, App Engine, Vertex AI and other AI based products, etc.

Google Postmaster Tools, Search Console Tools, Analytics, Pagespeed Insights

TensorFlow, Keras

Google Maps API

Firebase

reCaptcha

Which tool has been your favorite? Why?

Firebase, hands down. As someone who loves building solutions that are useful to people, Firebase has been my go-to tool for prototyping solutions and MVPs rapidly. I’ve used it to build some simple tools which have been used by thousands of people over the years - all hosted for free and delivered with blazing speed! Even today, during my sessions as a GDE, I always use Firebase to build the UI part of the demo applications I present during the talk.

Tell us about something you've built in the past using Google tools.

I built Fireshort - a URL shortener solution running purely on Firebase. This project is completely open source and has been used by several companies as a base for their in-house URL shortening needs. I’ve been working on the next version of this project at Linkborg.

I’ve also built several real-time updating monitoring products using Firebase and Pub/Sub, mostly for enterprise clients.

As a proof of concept, I also built KolPay, which is a completely event-driven clone of EasyCard - RFID based payment wallet using Firebase, Pub/Sub, Cloud Firestore and Cloud Functions, along with hardware components like Raspberry Pi, RFID Reader/Card.

What will you create with Google Bard?

Building with Google Bard is an exciting prospect. It will be fun to no longer have to write the repetitive parts of code which I need whenever I am setting up a new project or a module within an existing project. Since I spend a lot of my day coding, I will be very happy to automate parts of it and having an AI do that would be amazing!

What advice would you give someone starting in their developer journey?

Starting a developer journey can be a daunting prospect - everyone’s talking about AI and everyone wants to build the next viral thing. If you are new to this field, step back, relax and start building a solution to any problem that has irked you for a long time. While you’re at it - read a lot of tech blogs about solving that problem, become a part of developer communities, either virtual or in person, and meet people who will share their insights about building similar products.

Kartik Derasari

Ahmedabad, Gujarat, India

Google Developer Expert, Google Cloud

GDSC Silver Oak University Lead Alumni (2020-2021)

Google Developers Group Cloud Organizer

Full-Stack Engineer at Persistent

Twitter

Linkedin

Instagram

What are some key skills and knowledge you gained as a Google Developer Student Clubs Lead that helped you excel in your role as a Google Developer Expert?

As a GDSC Lead, I’ve had the opportunity to collaborate with Googlers, Google Developer Experts, and Google Developer Groups Community Leads on various projects which helped me explore different technologies and choose what’s best for me. Knowledge sharing and public speaking is what I learned from the Google Developer Experts. Since then, I started my journey as a Technical Speaker where I share my learnings on Machine Learning & TensorFlow, Web, Firebase, and Google Cloud. I also had the opportunity to share my learnings across conferences like DevFest, Google Cloud Community Days, and GDSC WOW. These are some of the learnings that really helped me shape as a Google Developer Expert and excel in my journey.

Can you share some insights on the impact of being part of the Google Developer Student Clubs community on your personal and professional growth?

Being a GDSC Lead created a positive impact in my personal and professional journey. I came in touch with the tech community and I learned about Google Developer Groups & Google Developer Experts programs. I started volunteering for the GDG Cloud Ahmedabad chapter during my GDSC tenure and later I became one of the Community Organizers. I also started collaborating with Google Developer Experts on Web, Firebase, and Machine Learning projects and made some open-source contributions.

Everyone from the community was so welcoming and helpful. I’d highly recommend everyone join these developer programs by Google and get the best out of it. I also received mentorship from GDG Community Leads and Google Developer Experts for my professional career. They helped me connect with the right set of people and guided me to kick-start my professional career with MediaAgility, which is part of the Google Cloud Partner ecosystem. Since then, I have been working on Web & Google Cloud in my professional capacity and in my personal capacity as well.

I was motivated by the Google Cloud ecosystem in India and I cleared six Google Cloud Certifications, which created a huge impact in my personal and professional growth.

What Google tools have you used to build?

I started using Firebase as a Web Engineer. It has been very helpful when it comes to adding Authentication, storing application data in Firestore, and hosting web-app front-end static files over a CDN using Firebase Hosting. While building a set of web apps, I started exploring Machine Learning and used TensorFlow for building ML models for different use cases. Since then, I started using Google Cloud ML APIs and Cloud Functions for adding more functionalities to my web apps.

While working on these projects, I came across the Google Cloud Partner ecosystem and joined MediaAgility (now part of Persistent Systems) as a Full-Stack Engineer. Since then, I have been working on Google Cloud with Google Cloud PSO and enterprise customers.

Which tool has been your favorite? Why?

Cloud Run is something that I really like as an Application Developer. Since it’s a serverless compute platform, I can spend more time on building my application rather than worrying about my infrastructure. Firebase Authentication, Cloud Firestore, and Cloud Storage are also tools that I really love. They help me create full-stack apps and ship faster to production.

Tell us about something you've built in the past using Google tools. What will you create with Google Bard?

Since we’re in the wave of Generative AI right now, I have been working on building a number of apps using Google Cloud Run, BigQuery, Cloud Storage, Generative AI studio, Model Garden on Vertex AI and PaLM models. Recently, I built a chat application interface which provides insights from structured enterprise data warehouse and unstructured files, along with enterprise-grade data governance and security.

What advice would you give someone starting in their developer journey?

Be a consistent learner and a persistent explorer. It’s great to cultivate a learning habit, which will help you all the way in your personal and professional journey. This will not only help you explore new things, but it will also help you master something that you really love to do. As a beginner, it would be good to start with something that you find interesting, and then you can add a flavor of other things. For example, if you find building web apps interesting, try it. When you think you’re good at it, you can add a flavor of Machine Learning to it. That’s how you explore new things and experiment with what you know.

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

On-device diffusion plugins for conditioned text-to-image generation

Posted by Yang Zhao and Tingbo Hou, Software Engineers, Core ML

In recent years, diffusion models have shown great success in text-to-image generation, achieving high image quality, improved inference performance, and expanding our creative inspiration. Nevertheless, it is still challenging to efficiently control the generation, especially with conditions that are difficult to describe with text.

Today, we announce MediaPipe diffusion plugins, which enable controllable text-to-image generation to be run on-device. Expanding upon our prior work on GPU inference for on-device large generative models, we introduce new low-cost solutions for controllable text-to-image generation that can be plugged into existing diffusion models and their Low-Rank Adaptation (LoRA) variants.

Text-to-image generation with control plugins running on-device.

Background

With diffusion models, image generation is modeled as an iterative denoising process. Starting from a noise image, at each step, the diffusion model gradually denoises the image to reveal an image of the target concept. Research shows that leveraging language understanding via text prompts can greatly improve image generation. For text-to-image generation, the text embedding is connected to the model via cross-attention layers. Yet, some information is difficult to describe by text prompts, e.g., the position and pose of an object. To address this problem, researchers add additional models into the diffusion to inject control information from a condition image.

Common approaches for controlled text-to-image generation include Plug-and-Play, ControlNet, and T2I Adapter. Plug-and-Play applies a widely used denoising diffusion implicit model (DDIM) inversion approach that reverses the generation process starting from an input image to derive an initial noise input, and then employs a copy of the diffusion model (860M parameters for Stable Diffusion 1.5) to encode the condition from an input image. Plug-and-Play extracts spatial features with self-attention from the copied diffusion, and injects them into the text-to-image diffusion. ControlNet creates a trainable copy of the encoder of a diffusion model, which connects via a convolution layer with zero-initialized parameters to encode conditioning information that is conveyed to the decoder layers. However, as a result, the size is large, half that of the diffusion model (430M parameters for Stable Diffusion 1.5). T2I Adapter is a smaller network (77M parameters) and achieves similar effects in controllable generation. T2I Adapter only takes the condition image as input, and its output is shared across all diffusion iterations. Yet, the adapter model is not designed for portable devices.

The MediaPipe diffusion plugins

To make conditioned generation efficient, customizable, and scalable, we design the MediaPipe diffusion plugin as a separate network that is:

Plugable: It can be easily connected to a pre-trained base model.
Trained from scratch: It does not use pre-trained weights from the base model.
Portable: It runs outside the base model on mobile devices, with negligible cost compared to the base model inference.

Method	Parameter Size	Plugable	From Scratch	Portable
Plug-and-Play	860M*	✔️	❌	❌
ControlNet	430M*	✔️	❌	❌
T2I Adapter	77M	✔️	✔️	❌
MediaPipe Plugin	6M	✔️	✔️	✔️

Comparison of Plug-and-Play, ControlNet, T2I Adapter, and the MediaPipe diffusion plugin.
* The number varies depending on the particulars of the diffusion model.

The MediaPipe diffusion plugin is a portable on-device model for text-to-image generation. It extracts multiscale features from a conditioning image, which are added to the encoder of a diffusion model at corresponding levels. When connecting to a text-to-image diffusion model, the plugin model can provide an extra conditioning signal to the image generation. We design the plugin network to be a lightweight model with only 6M parameters. It uses depth-wise convolutions and inverted bottlenecks from MobileNetv2 for fast inference on mobile devices.

Overview of the MediaPipe diffusion model plugin. The plugin is a separate network, whose output can be plugged into a pre-trained text-to-image generation model. Features extracted by the plugin are applied to the associated downsampling layer of the diffusion model (blue).

Unlike ControlNet, we inject the same control features in all diffusion iterations. That is, we only run the plugin once for one image generation, which saves computation. We illustrate some intermediate results of a diffusion process below. The control is effective at every diffusion step and enables controlled generation even at early steps. More iterations improve the alignment of the image with the text prompt and generate more detail.

Illustration of the generation process using the MediaPipe diffusion plugin.

Examples

In this work, we developed plugins for a diffusion-based text-to-image generation model with MediaPipe Face Landmark, MediaPipe Holistic Landmark, depth maps, and Canny edge. For each task, we select about 100K images from a web-scale image-text dataset, and compute control signals using corresponding MediaPipe solutions. We use refined captions from PaLI for training the plugins.

Face Landmark

The MediaPipe Face Landmarker task computes 478 landmarks (with attention) of a human face. We use the drawing utils in MediaPipe to render a face, including face contour, mouth, eyes, eyebrows, and irises, with different colors. The following table shows randomly generated samples by conditioning on face mesh and prompts. As a comparison, both ControlNet and Plugin can control text-to-image generation with given conditions.

Face-landmark plugin for text-to-image generation, compared with ControlNet.

Holistic Landmark

MediaPipe Holistic Landmarker task includes landmarks of body pose, hands, and face mesh. Below, we generate various stylized images by conditioning on the holistic features.

Holistic-landmark plugin for text-to-image generation.

Depth

Depth-plugin for text-to-image generation.

Canny Edge

Canny-edge plugin for text-to-image generation.

Evaluation

We conduct a quantitative study of the face landmark plugin to demonstrate the model's performance. The evaluation dataset contains 5K human images. We compare the generation quality as measured by the widely used metrics, Fréchet Inception Distance (FID) and CLIP scores. The base model is a pre-trained text-to-image diffusion model. We use Stable Diffusion v1.5 here.

As shown in the following table, both ControlNet and the MediaPipe diffusion plugin produce much better sample quality than the base model, in terms of FID and CLIP scores. Unlike ControlNet, which needs to run at every diffusion step, the MediaPipe plugin only runs once for each image generated. We measured the performance of the three models on a server machine (with Nvidia V100 GPU) and a mobile phone (Galaxy S23). On the server, we run all three models with 50 diffusion steps, and on mobile, we run 20 diffusion steps using the MediaPipe image generation app. Compared with ControlNet, the MediaPipe plugin shows a clear advantage in inference efficiency while preserving the sample quality.

Model	FID↓	CLIP↑	Inference Time (s)
Model	FID↓	CLIP↑	Nvidia V100	Galaxy S23
Base	10.32	0.26	5.0	11.5
Base + ControlNet	6.51	0.31	7.4 (+48%)	18.2 (+58.3%)
Base + MediaPipe Plugin	6.50	0.30	5.0 (+0.2%)	11.8 (+2.6%)

Quantitative comparison on FID, CLIP, and inference time.

We test the performance of the plugin on a wide range of mobile devices from mid-tier to high-end. We list the results on some representative devices in the following table, covering both Android and iOS.

Device

Android

iOS

Pixel 4

Pixel 6

Pixel 7

Galaxy S23

iPhone 12 Pro

iPhone 13 Pro

Time (ms)

128

Inference time (ms) of the plugin on different mobile devices.

Conclusion

In this work, we present MediaPipe, a portable plugin for conditioned text-to-image generation. It injects features extracted from a condition image to a diffusion model, and consequently controls the image generation. Portable plugins can be connected to pre-trained diffusion models running on servers or devices. By running text-to-image generation and plugins fully on-device, we enable more flexible applications of generative AI.

Acknowledgments

We’d like to thank all team members who contributed to this work: Raman Sarokin and Juhyun Lee for the GPU inference solution; Khanh LeViet, Chuo-Ling Chang, Andrei Kulik, and Matthias Grundmann for leadership. Special thanks to Jiuqiang Tang, Joe Zou and Lu wang, who made this technology and all the demos running on-device.

Source: Google AI Blog

Champion Innovator Elyes Manai, based in Quebec City, Quebec, Canada

Posted by Max Saltonstall, Developer Relations Engineer

In this ongoing interview series we sit down with Google Cloud Champion Innovators across the world to learn more about their journeys, their technology focus, and what excites them. Today we're talking to Elyes Manai. Elyes is a Machine Learning Consultant, Educator and Mentor. He helps companies tap into the power of data science to reduce costs and increase revenue as well as build relationships with relevant target audiences through educational content and community building.

What is a Champion Innovator?

Champion Innovators are a global network of more than 500 non-Google professionals, who are technical experts in Google Cloud products and services. Each Champion specializes in one of nine different technical categories which are cloud AI/ML, data analytics, hybrid multi-cloud, modern architecture, security and networking, serverless app development, storage, Workspace and databases.

What tech area has you most fascinated right now, and why?

Machine Learning: There are so many new insights we can gain from applying ML and AI to problems right now. Especially in security. I'm currently pursuing my PhD in AI applied to Cybersecurity, and am eager to teach the next generation about computer science, AI and security.

I fell into ML by accident, after trying to pursue something else in university. I had hoped to study architecture, but did not do nearly well enough in high school (in Tunisia, where I'm from). I ended up at my last choice of universities, in an IT program. And then I tried to transfer to an architecture school, but my paperwork got messed up so it didn't work out.

There I was, in a field I had not chosen, and yet I liked it. It felt pretty easy to do, I got good grades, and I realized I could make a career out of it. I liked solving problems with code, and progressed to doing web development and managing a team. From there I started thinking about what I wanted to do next.

I really love teaching, so I began looking into how to become a professor. That led me to the computer science? 50 class at Harvard, where I saw many signs pointing to a big AI trend, and so I decided to pursue a masters in computer science.

How do you like to learn new services, tools, and applications?

I dive right in; learn by doing. I frequently bounce around between subjects. I keep a list of ideas that come to me, and then when I'm ready for something new, I just scan through the list and pick one. This helps me stay fresh and excited.

Whenever I'm learning new skills I remind myself to go with the flow. I start small, learn just enough to start using the technology or tool. I'll ask myself:

What key concepts or pillars do I need to understand this more deeply?

How do I branch out from there?

Who should I talk to?

What can I make?

Since I'm in the middle of a doctoral program right now, I always challenge myself to make that idea somehow connect to my research, so I can bring it back to a common theme that's pervasive through all my work.

What are some exciting projects you have in flight right now?

Explainable AI, especially applied to less frequently used spoken languages in the world. We have a wealth of research on English language AI models, but what about applying BERT (and other technologies) on some lesser used languages, to expand the benefit to a wider population?

I'm also very excited about how we (as researchers) can optimize AI models to be more secure, be more private in terms of protecting our data, and be more useful to a wide variety of use cases.

What engages you outside of the technology world?

I love biking, and whenever it's warm enough in Québec I will go bike outside.

I like to read, especially science fiction. I recently started reading autobiographies to know more about the world from different perspectives. I'm currently reading autobiographies of Scott Kelley and Sohaila Abdulali.

I also keep a big list of ideas outside of tech for me to pursue: people to meet, foods to try, places to go. I'm always working on new experiences and adventures from that list, to broaden my world and learn more about what's all around me.

What brought you into the Innovators program?

I've been a Google Developer Expert (GDE) for two years and then got an invitation to join the Innovators program, after I attended a GDE event. It's helped me gain some respect and credibility, as I have a little bit of Google's reputation behind my voice now when I share my perspective or opinion. Also they have helped me get some great swag!

What's one thing our readers should do next?

Very few things stand the test of time, as our industry is shifting so quickly. I think CS50 on YouTube still has relevance for folks new to computer science.

I also want to encourage people to create social connections, and go meet the people behind the systems you are using. There are humans out there who can help you find the next project or position, and getting to know them is so important.