Search Central Live is returning to Brazil
Source: Google Search Central Blog
MobileDiffusion: Rapid text-to-image generation on-device
Text-to-image diffusion models have shown exceptional capabilities in generating high-quality images from text prompts. However, leading models feature billions of parameters and are consequently expensive to run, requiring powerful desktops or servers (e.g., Stable Diffusion, DALL·E, and Imagen). While recent advancements in inference solutions on Android via MediaPipe and iOS via Core ML have been made in the past year, rapid (sub-second) text-to-image generation on mobile devices has remained out of reach.
To that end, in “MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices”, we introduce a novel approach with the potential for rapid text-to-image generation on-device. MobileDiffusion is an efficient latent diffusion model specifically designed for mobile devices. We also adopt DiffusionGAN to achieve one-step sampling during inference, which fine-tunes a pre-trained diffusion model while leveraging a GAN to model the denoising step. We have tested MobileDiffusion on iOS and Android premium devices, and it can run in half a second to generate a 512x512 high-quality image. Its comparably small model size of just 520M parameters makes it uniquely suited for mobile deployment.
![]() |
![]() |
Rapid text-to-image generation on-device. |
Background
The relative inefficiency of text-to-image diffusion models arises from two primary challenges. First, the inherent design of diffusion models requires iterative denoising to generate images, necessitating multiple evaluations of the model. Second, the complexity of the network architecture in text-to-image diffusion models involves a substantial number of parameters, regularly reaching into the billions and resulting in computationally expensive evaluations. As a result, despite the potential benefits of deploying generative models on mobile devices, such as enhancing user experience and addressing emerging privacy concerns, it remains relatively unexplored within the current literature.
The optimization of inference efficiency in text-to-image diffusion models has been an active research area. Previous studies predominantly concentrate on addressing the first challenge, seeking to reduce the number of function evaluations (NFEs). Leveraging advanced numerical solvers (e.g., DPM) or distillation techniques (e.g., progressive distillation, consistency distillation), the number of necessary sampling steps have significantly reduced from several hundreds to single digits. Some recent techniques, like DiffusionGAN and Adversarial Diffusion Distillation, even reduce to a single necessary step.
However, on mobile devices, even a small number of evaluation steps can be slow due to the complexity of model architecture. Thus far, the architectural efficiency of text-to-image diffusion models has received comparatively less attention. A handful of earlier works briefly touches upon this matter, involving the removal of redundant neural network blocks (e.g., SnapFusion). However, these efforts lack a comprehensive analysis of each component within the model architecture, thereby falling short of providing a holistic guide for designing highly efficient architectures.
MobileDiffusion
Effectively overcoming the challenges imposed by the limited computational power of mobile devices requires an in-depth and holistic exploration of the model's architectural efficiency. In pursuit of this objective, our research undertakes a detailed examination of each constituent and computational operation within Stable Diffusion’s UNet architecture. We present a comprehensive guide for crafting highly efficient text-to-image diffusion models culminating in the MobileDiffusion.
The design of MobileDiffusion follows that of latent diffusion models. It contains three components: a text encoder, a diffusion UNet, and an image decoder. For the text encoder, we use CLIP-ViT/L14, which is a small model (125M parameters) suitable for mobile. We then turn our focus to the diffusion UNet and image decoder.
Diffusion UNet
As illustrated in the figure below, diffusion UNets commonly interleave transformer blocks and convolution blocks. We conduct a comprehensive investigation of these two fundamental building blocks. Throughout the study, we control the training pipeline (e.g., data, optimizer) to study the effects of different architectures.
In classic text-to-image diffusion models, a transformer block consists of a self-attention layer (SA) for modeling long-range dependencies among visual features, a cross-attention layer (CA) to capture interactions between text conditioning and visual features, and a feed-forward layer (FF) to post-process the output of attention layers. These transformer blocks hold a pivotal role in text-to-image diffusion models, serving as the primary components responsible for text comprehension. However, they also pose a significant efficiency challenge, given the computational expense of the attention operation, which is quadratic to the sequence length. We follow the idea of UViT architecture, which places more transformer blocks at the bottleneck of the UNet. This design choice is motivated by the fact that the attention computation is less resource-intensive at the bottleneck due to its lower dimensionality.
![]() |
Our UNet architecture incorporates more transformers in the middle, and skips self-attention (SA) layers at higher resolutions. |
Convolution blocks, in particular ResNet blocks, are deployed at each level of the UNet. While these blocks are instrumental for feature extraction and information flow, the associated computational costs, especially at high-resolution levels, can be substantial. One proven approach in this context is separable convolution. We observed that replacing regular convolution layers with lightweight separable convolution layers in the deeper segments of the UNet yields similar performance.
In the figure below, we compare the UNets of several diffusion models. Our MobileDiffusion exhibits superior efficiency in terms of FLOPs (floating-point operations) and number of parameters.
![]() |
Comparison of some diffusion UNets. |
Image decoder
In addition to the UNet, we also optimized the image decoder. We trained a variational autoencoder (VAE) to encode an RGB image to an 8-channel latent variable, with 8× smaller spatial size of the image. A latent variable can be decoded to an image and gets 8× larger in size. To further enhance efficiency, we design a lightweight decoder architecture by pruning the original’s width and depth. The resulting lightweight decoder leads to a significant performance boost, with nearly 50% latency improvement and better quality. For more details, please refer to our paper.
![]() |
VAE reconstruction. Our VAE decoders have better visual quality than SD (Stable Diffusion). |
Decoder | #Params (M) | PSNR↑ | SSIM↑ | LPIPS↓ |
SD | 49.5 | 26.7 | 0.76 | 0.037 |
Ours | 39.3 | 30.0 | 0.83 | 0.032 |
Ours-Lite | 9.8 | 30.2 | 0.84 | 0.032 |
Quality evaluation of VAE decoders. Our lite decoder is much smaller than SD, with better quality metrics, including peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). |
One-step sampling
In addition to optimizing the model architecture, we adopt a DiffusionGAN hybrid to achieve one-step sampling. Training DiffusionGAN hybrid models for text-to-image generation encounters several intricacies. Notably, the discriminator, a classifier distinguishing real data and generated data, must make judgments based on both texture and semantics. Moreover, the cost of training text-to-image models can be extremely high, particularly in the case of GAN-based models, where the discriminator introduces additional parameters. Purely GAN-based text-to-image models (e.g., StyleGAN-T, GigaGAN) confront similar complexities, resulting in highly intricate and expensive training.
To overcome these challenges, we use a pre-trained diffusion UNet to initialize the generator and discriminator. This design enables seamless initialization with the pre-trained diffusion model. We postulate that the internal features within the diffusion model contain rich information of the intricate interplay between textual and visual data. This initialization strategy significantly streamlines the training.
The figure below illustrates the training procedure. After initialization, a noisy image is sent to the generator for one-step diffusion. The result is evaluated against ground truth with a reconstruction loss, similar to diffusion model training. We then add noise to the output and send it to the discriminator, whose result is evaluated with a GAN loss, effectively adopting the GAN to model a denoising step. By using pre-trained weights to initialize the generator and the discriminator, the training becomes a fine-tuning process, which converges in less than 10K iterations.
![]() |
Illustration of DiffusionGAN fine-tuning. |
Results
Below we show example images generated by our MobileDiffusion with DiffusionGAN one-step sampling. With such a compact model (520M parameters in total), MobileDiffusion can generate high-quality diverse images for various domains.
![]() |
Images generated by our MobileDiffusion |
We measured the performance of our MobileDiffusion on both iOS and Android devices, using different runtime optimizers. The latency numbers are reported below. We see that MobileDiffusion is very efficient and can run within half a second to generate a 512x512 image. This lightning speed potentially enables many interesting use cases on mobile devices.
![]() |
Latency measurements (s) on mobile devices. |
Conclusion
With superior efficiency in terms of latency and size, MobileDiffusion has the potential to be a very friendly option for mobile deployments given its capability to enable a rapid image generation experience while typing text prompts. And we will ensure any application of this technology will be in-line with Google’s responsible AI practices.
Acknowledgments
We like to thank our collaborators and contributors that helped bring MobileDiffusion to on-device: Zhisheng Xiao, Yanwu Xu, Jiuqiang Tang, Haolin Jia, Lutz Justen, Daniel Fenner, Ronald Wotzlaw, Jianing Wei, Raman Sarokin, Juhyun Lee, Andrei Kulik, Chuo-Ling Chang, and Matthias Grundmann.
Source: Google AI Blog
Chrome Beta for Android Update
Hi everyone! We've just released Chrome Beta 122 (122.0.6261.15) for Android. It's now available on Google Play.
You can see a partial list of the changes in the Git log. For details on new features, check out the Chromium blog, and for details on web platform updates, check here.
If you find a new issue, please let us know by filing a bug.
Erhu Akpobaro
Google Chrome
Source: Google Chrome Releases
Chrome Beta for Desktop Update
The Beta channel has been updated to 122.0.6261.18 for Windows, Mac and Linux.
A partial list of changes is available in the Git log. Interested in switching release channels? Find out how. If you find a new issue, please let us know by filing a bug. The community help forum is also a great place to reach out for help or learn about common issues.
Prudhvi Bommana
Google Chrome
Source: Google Chrome Releases
Pin chat messages in Google Meet
What’s changing
Getting started
- Admins: There is no admin control for this feature.
- End users:
- Meeting participants can pin and unpin their own messages. Meeting hosts can unpin anyone’s message.
- You can pin a message by hovering over it in the chat window and selecting the pin icon. Visit the Help Center to learn more about sending chat messages.
Rollout pace
- Rapid and Scheduled Release domains: Gradual rollout (up to 15 days for feature visibility) starting on January 31, 2024
Availability
- Available to all Google Workspace customers and users with personal Google accounts
Resources
Source: Google Workspace Updates
Sunsetting Display & Video 360 Entity Read Files on October 31, 2024
On October 31, 2024, Display & Video 360 Entity Read Files (ERFs) will sunset. Public and private ERFs won’t be generated after this date. Migrate to the Display & Video 360 API to continue programmatically retrieving Display & Video 360 resource configurations.
ERFs were deprecated in June 2021. Since then, we’ve made updates to help you migrate from ERFs:
- You can retrieve Floodlight Activities and YouTube & Partners resources (line items, ad groups, and ads) using the Display & Video 360 API.
- Starting with Structured Data Files v7, you can use IDs retrieved through the Display & Video 360 API in place of ERFs.
See our migration guide for more information on migrating from ERFs to the Display & Video 360 API.
If you have questions regarding these changes, please contact us using our support contact form.
Source: Google Ads Developer Blog
Shareable class templates and classwork in Google Classroom are now generally available
What’s changing
Who’s impacted
Why you’d use it
Additional details
- Student information, such as assignment submissions, comments, and grades, will not be visible when previewing a shared class.
- Imported class materials will be saved in draft mode for the selected class.
Getting started
- Admins: There is no admin control for this feature, however, Admins should make sure the following is set up for end users:
- In order to share classes, educators must have a Google Workspace for Education Plus license assigned to them.
- To preview and import classwork from shared classes, educators must be verified teachers.
- End users:
- To share a class, click the “Share classwork” button on the Classwork page.
- After receiving a class link, open it in your browser. When previewing the shared class, select the classwork items you want to export to a class.
- Visit the Help Center to learn more about sharing class templates and classwork.
Rollout
- Rapid Release and Scheduled Release domains: Gradual rollout (up to 15 days for feature visibility) starting on January 31, 2024
Availability
- Available to Education Plus
Resources
Source: Google Workspace Updates
Scaling security with AI: from detection to solution
The AI world moves fast, so we’ve been hard at work keeping security apace with recent advancements. One of our approaches, in alignment with Google’s Safer AI Framework (SAIF), is using AI itself to automate and streamline routine and manual security tasks, including fixing security bugs. Last year we wrote about our experiences using LLMs to expand vulnerability testing coverage, and we’re excited to share some updates.
Today, we’re releasing our fuzzing framework as a free, open source resource that researchers and developers can use to improve fuzzing’s bug-finding abilities. We’ll also show you how we’re using AI to speed up the bug patching process. By sharing these experiences, we hope to spark new ideas and drive innovation for a stronger ecosystem security.
Update: AI-powered vulnerability discovery
Last August, we announced our framework to automate manual aspects of fuzz testing (“fuzzing”) that often hindered open source maintainers from fuzzing their projects effectively. We used LLMs to write project-specific code to boost fuzzing coverage and find more vulnerabilities. Our initial results on a subset of projects in our free OSS-Fuzz service were very promising, with code coverage increased by 30% in one example. Since then, we’ve expanded our experiments to more than 300 OSS-Fuzz C/C++ projects, resulting in significant coverage gains across many of the project codebases. We’ve also improved our prompt generation and build pipelines, which has increased code line coverage by up to 29% in 160 projects.
How does that translate to tangible security improvements? So far, the expanded fuzzing coverage offered by LLM-generated improvements allowed OSS-Fuzz to discover two new vulnerabilities in cJSON and libplist, two widely used projects that had already been fuzzed for years. As always, we reported the vulnerabilities to the project maintainers for patching. Without the completely LLM-generated code, these two vulnerabilities could have remained undiscovered and unfixed indefinitely.
And more: AI-powered vulnerability fixing
Fuzzing is fantastic for finding bugs, but for security to improve, those bugs also need to be patched. It’s long been an industry-wide struggle to find the engineering hours needed to patch open bugs at the pace that they are uncovered, and triaging and fixing bugs is a significant manual toll on project maintainers. With continued improvements in using LLMs to find more bugs, we need to keep pace in creating similarly automated solutions to help fix those bugs. We recently announced an experiment doing exactly that: building an automated pipeline that intakes vulnerabilities (such as those caught by fuzzing), and prompts LLMs to generate fixes and test them before selecting the best for human review.
This AI-powered patching approach resolved 15% of the targeted bugs, leading to significant time savings for engineers. The potential of this technology should apply to most or all categories throughout the software development process. We’re optimistic that this research marks a promising step towards harnessing AI to help ensure more secure and reliable software.
Try it out
Since we’ve now open sourced our framework to automate manual aspects of fuzzing, any researcher or developer can experiment with their own prompts to test the effectiveness of fuzz targets generated by LLMs (including Google’s VertexAI or their own fine-tuned models) and measure the results against OSS-Fuzz C/C++ projects. We also hope to encourage research collaborations and to continue seeing other work inspired by our approach, such as Rust fuzz target generation.
If you’re interested in using LLMs to patch bugs, be sure to read our paper on building an AI-powered patching pipeline. You’ll find a summary of our own experiences, some unexpected data about LLM’s abilities to patch different types of bugs, and guidance for building pipelines in your own organizations.
Source: Google Online Security Blog
5 ways to use Circle to Search
