Tag Archives: Generative AI

A New Foundation for AI on Android

Posted by Dave Burke, VP of Engineering

Foundation Models learn from a diverse range of data sources to produce AI systems capable of adapting to a wide range of tasks, instead of being trained for a single narrow use case. Today, we announced Gemini, our most capable model yet. Gemini was designed for flexibility, so it can run on everything from data centers to mobile devices. It's been optimized for three different sizes: Ultra, Pro and Nano.

Gemini Nano, optimized for mobile

Gemini Nano, our most efficient model built for on-device tasks, runs directly on mobile silicon, opening support for a range of important use cases. Running on-device enables features where the data should not leave the device, such as suggesting replies to messages in an end-to-end encrypted messaging app. It also enables consistent experiences with deterministic latency, so features are always available even when there’s no network.

Gemini Nano is distilled down from the larger Gemini models and specifically optimized to run on mobile silicon accelerators. Gemini Nano enables powerful capabilities such as high quality text summarization, contextual smart replies, and advanced proofreading and grammar correction. For example, the enhanced language understanding of Gemini Nano enables the Pixel 8 Pro to concisely summarize content in the Recorder app, even when the phone’s network connection is offline.

Moving image of Gemini Nano being used in the Recorder app on a Pixel 8 Pro device
Pixel 8 Pro using Gemini Nano in the Recorder app to summarize meeting audio, even without a network connection.

Gemini Nano is starting to power Smart Reply in Gboard on Pixel 8 Pro, ready to be enabled in settings as a developer preview. Available now to try with WhatsApp and coming to more apps next year, the on-device AI model saves you time by suggesting high-quality responses with conversational awareness1.

Moving image of WhatsApp’s use of Smart Reply in Gboard using Gemini Nano on Pixel 8 Pro device
Smart Reply in Gboard within WhatsApp using Gemini Nano on Pixel 8 Pro.

Android AICore, a new system service for on-device foundation models

Android AICore is a new system service in Android 14 that provides easy access to Gemini Nano. AICore handles model management, runtimes, safety features and more, simplifying the work for you to incorporate AI into your apps.

AICore is private by design, following the example of Android’s Private Compute Core with isolation from the network via open-source APIs, providing transparency and auditability. As part of our efforts to build and deploy AI responsibly, we also built dedicated safety features to make it safer and more inclusive for everyone.

AICore architechture
AICore manages model, runtime and safety features.

AICore enables Low Rank Adaptation (LoRA) fine tuning with Gemini Nano. This powerful concept enables app developers to create small LoRA adapters based on their own training data. The LoRA adapter is loaded by AICore, resulting in a powerful large language model fine tuned for the app’s own use-cases.

AICore takes advantage of new ML hardware like the latest Google Tensor TPU and NPUs in flagship Qualcomm Technologies, Samsung S.LSI and MediaTek silicon. AICore and Gemini Nano are rolling out to Pixel 8 Pro, with more devices and silicon partners to be announced in the coming months.

Build with Gemini

We're excited to bring together state-of-the-art AI research with easy-to-use tools and APIs for Android developers to build with Gemini on-device. If you are interested in building apps using Gemini Nano and AICore, please sign up for our Early Access Program.


1 Available globally, only using the United States English keyboard language. Read more for details.

Full-stack development in Project IDX

Posted by Kaushik Sathupadi, Prakhar Srivastav, and Kristin Bi – Software Engineers; Alex Geboff – Technical Writer

We launched Project IDX, our experimental, new browser-based development experience, to simplify the chaos of building full-stack apps and streamline the development process from (back)end to (front)end.

In our experience, most web applications are built with at-least two different layers: a frontend (UI) layer and a backend layer. When you think about the kind of app you’d build in a browser-based developer workspace, you might not immediately jump to full-stack apps with robust, fully functional backends. Developing a backend in a web-based environment can get clunky and costly very quickly. Between different authentication setups for development and production environments, secure communication between backend and frontend, and the complexity of setting up a fully self-contained (hermetic) testing environment, costs and inconveniences can add up.

We know a lot of you are excited to try IDX yourselves, but in the meantime, we wanted to share this post about full-stack development in Project IDX. We’ll untangle some of the complex situations you might hit as a developer building both your frontend and backend layers in a web-based workspace — developer authentication, frontend-backend communication, and hermetic testing — and how we’ve tried to make it all just a little bit easier. And of course we want to hear from you about what else we should build that would make full-stack development easier for you!


Streamlined app previews

First and foremost, we've streamlined the process of enabling your applications frontend communication with its backend services in the VM, making it effortless to preview your full-stack application in the browser.

IDX workspaces are built on Google Cloud Workstations and securely access connected services through Service Accounts. Each workspace’s unique service account supports seamless, authenticated preview environments for your applications frontend. So, when you use Project IDX, application previews are built directly into your workspace, and you don’t actually have to set up a different authentication path to preview your UI. Currently, IDX only supports web previews, but Android and iOS application previews are coming soon to IDX workspaces near you.

Additionally, if your setup necessitates communication with the backend API under development in IDX from outside the browser preview, we've established a few mechanisms to temporarily provide access to the ports hosting these API backends.


Simple front-to-backend communication

If you’re using a framework that serves both the backend and frontend layers from the same port, you can pass the $PORT flag to use a custom PORT environment variable in your workspace configuration file (powered by Nix and stored directly in your workspace). This is part of the basic setup flow in Project IDX, so you don’t have to do anything particularly special (outside of setting the variable in your config file). Here’s an example Nix-based configuration file:


{ pkgs, ... }: {

# NOTE: This is an excerpt of a complete Nix configuration example.

# Enable previews and customize configuration
idx.previews = {
  enable = true;
  previews = [
    {
      command = [
        "npm"
        "run"
        "start"
        "--"
        "--port"
        "$PORT"
        "--host"
        "0.0.0.0"
        "--disable-host-check"
      ];
      manager = "web";
      id = "web";
    }
  ];
};

However, if your backend server is running on a different port from your UI server, you’ll need to implement a different strategy. One method is to have the frontend proxy the backend, as you would with Vite's custom server options.

Another way to establish communication between ports is to set up your code so the javascript running on your UI can communicate with the backend server using AJAX requests.

Let’s start with some sample code that includes both a backend and a frontend. Here’s a backend server written in Express.js:


import express from "express";
import cors from "cors";


const app= express();
app.use(cors());

app.get("/", (req, res) => {
    res.send("Hello World");
});

app.listen(6000, () => {
    console.log("Server is running on port 6000");
})

The bolded line in the sample — app.use(cors()); — sets up the CORS headers. Setup might be different based on the language/framework of your choice, but your backend needs to return these headers whether you’re developing locally or on IDX.

When you run the server in the IDX terminal, the backend ports show up in the IDX panel. And every port that your server runs on is automatically mapped to a URL you can call.

Moving text showing the IDX terminal and panel

Now, let's write some client code to make an AJAX call to this server.


// This URL is copied from the side panel showing the backend ports view
const WORKSPACE_URL = "https://6000-monospace-ksat-web-prod-79679-1677177068249.cluster-lknrrkkitbcdsvoir6wqg4mwt6.cloudworkstations.dev/";

async function get(url) {
  const response = await fetch(url, {
    credentials: 'include',
  });
  console.log(response.text());
}

// Call the backend
get(WORKSPACE_URL);

We’ve also made sure that the fetch() call includes credentials. IDX URLs are authenticated, so we need to include credentials. This way, the AJAX call includes the cookies to authenticate against our servers.

If you’re using XMLHttpRequest instead of fetch, you can set the “withCredentials” property, like this:


const xhr = new XMLHttpRequest();
xhr.open("GET", WORKSPACE_URL, true);
xhr.withCredentials = true;
xhr.send(null);

Your code might differ from our samples based on the client library you use to make the AJAX calls. If it does, check the documentation for your specific client library on how to make a credentialed request. Just be sure to make a credentialed request.


Server-side testing without a login

In some cases you might want to access your application on Project IDX without logging into your Google account — or from an environment where you can’t log into your Google account. For example, if you want to access an API you're developing in IDX using either Postman or cURL from your personal laptops's command line. You can do this by using a temporary access token generated by Project IDX.

Once you have a server running in Project IDX, you can bring up the command menu to generate an access token. This access token is a short-lived token that temporarily allows you to access your workstation.

It’s extremely important to note that this access token provides access to your entire IDX workspace, including but not limited to your application in preview, so you shouldn’t share it with just anyone. We recommend that you only use it for testing.

Generate access token in Project IDX

When you run this command from IDX, your access token shows up in a dialog window. Copy the access token and use it to make a cURL request to a service running on your workstation, like this one:


$ export ACCESS_TOKEN=myaccesstoken
$ curl -H "Authorization: Bearer $ACCESS_TOKEN" https://6000-monospace-ksat-web-prod-79679-1677177068249.cluster-lknrrkkitbcdsvoir6wqg4mwt6.cloudworkstations.dev/
Hello world

And now you can run tests from an authenticated server environment!


Web-based, fully hermetic testing

As we’ve highlighted, you can test your application’s frontend and backend in a fully self-contained, authenticated, secure environment using IDX. You can also run local emulators in your web-based development environment to test your application’s backend services.

For example, you can run the Firebase Local Emulator Suite directly from your IDX workspace. To install the emulator suite, you’d run firebase init emulators from the IDX Terminal tab and follow the steps to configure which emulators you want on what ports.

ALT TEXT

Once you’ve installed them, you can configure and use them the same way you would in a local development environment from the IDX terminal.


Next Steps

As you can see, Project IDX can meet many of your full-stack development needs — from frontend to backend and every emulator in between.

If you're already using Project IDX, tag us on social with #projectidx to let us know how Project IDX has helped you with your full-stack development. Or to sign up for the waitlist, visit idx.dev.

People of AI: Season 2

Posted by Ashley Oldacre

If you are joining us for the first time, you can binge listen to our amazing 8 episodes from Season 1 wherever you get your podcasts.

We are back for another season of People of AI with a new lineup of incredible guests! I am so excited to introduce my new co-host Luiz Gustavo Martins as we meet inspiring people with interesting stories in the field of Artificial Intelligence.

Last season we focused on the incredible journeys that our guests took to get into the field of AI. Through our stories, we highlighted that no matter who you are, what your interests are, or what you work on, there is a place for anyone to get into this field. We also explored how much more accessible the technology has become over the years, as well as the importance of building AI-related products responsibly and ethically. It is easier than ever to use tools, platforms and services powered by machine learning to leverage the benefits of AI, and break down the barrier of entry.

For season 2, we will feature amazing conversations, focusing on Generative AI! Specifically, we will be discussing the explosive growth of Generative AI tools and the major technology shift that has happened in recent months. We will dive into various topics to explore areas where Generative AI can contribute tremendous value, as well as boost both productivity and economic growth. We will also continue to explore the personal paths and career development of this season’s guests as they share how their interest in technology was sparked, how they worked hard to get to where they are today, and explore what it is that they are currently working on.

Starting today, we will release one new episode of season 2 per week. Listen to the first episode on the People of AI site or wherever you get your podcasts. And stay tuned for later in the season when we premiere our first video podcasts as well!

  • Episode 1: meet your hosts, Ashley and Gus and learn about Generative AI, Bard and the big shift that has dramatically changed the industry. 
  • Episode 2: meet Sunita Verma, a long-time Googler, as she shares her personal journey from Engineering to CS, and into Google. As an early pioneer of AI and Google Ads, we will talk about the evolution of AI and how Generative AI will transform the way we work. 
  • Episode 3: meet Sayak Paul, a Google Developer Expert (GDE) as we explore what it means to be a GDE and how to leverage the power of your community through community contributions. 
  • Episode 4: meet Crispin Velez, the lead for Cloud’s Vertex AI as we dig into his experience in Cloud working with customers and partners on how to integrate and deploy AI. We also learn how he grew his AI developer community in LATAM from scratch. 
  • Episode 5: meet Joyce Shen, venture capital/private equity investor. She shares her fascinating career in AI and how she has worked with businesses to spot AI talent, incorporate AI technology into workflows and implement responsible AI into their products. 
  • Episode 6: meet Anne Simonds and Brian Gary, founders of Muse https://www.museml.com. Join us as we talk about their recent journeys into AI and their new company which uses the power of Generative AI to spark creativity. 
  • Episode 7: meet Tulsee Doshi, product lead for Google’s Responsible AI efforts as we discuss the development of Google-wide resources and best practices for developing more inclusive, diverse, and ethical algorithm driven products. 
  • Episode 8: meet Jeanine Banks, Vice President and General Manager of Google Developer X and Head of Developer Relations. Join us as we debunk AI and get down to what Generative AI really is, how it has changed over the past few months and will continue to change the developer landscape. 
  • Episode 9: meet Simon Tokumine, Director of Product Management at Google. We will talk about how AI has brought us into the era of task-orientated products and is fueling a new community of makers.

Listen now to the first episode of Season 2. We can’t wait to share the stories of these exceptional People of AI with you!

This podcast is sponsored by Google. Any remarks made by the speakers are their own and are not endorsed by Google.

MediaPipe On-Device Text-to-Image Generation Solution Now Available for Android Developers

Posted by Paul Ruiz – Senior Developer Relations Engineer, and Kris Tonthat – Technical Writer

Earlier this year, we previewed on-device text-to-image generation with diffusion models for Android via MediaPipe Solutions. Today we’re happy to announce that this is available as an early, experimental solution, Image Generator, for developers to try out on Android devices, allowing you to easily generate images entirely on-device in as quickly as ~15 seconds on higher end devices. We can’t wait to see what you create!

There are three primary ways that you can use the new MediaPipe Image Generator task:

  1. Text-to-image generation based on text prompts using standard diffusion models.
  2. Controllable text-to-image generation based on text prompts and conditioning images using diffusion plugins.
  3. Customized text-to-image generation based on text prompts using Low-Rank Adaptation (LoRA) weights that allow you to create images of specific concepts that you pre-define for your unique use-cases.

Models

Before we get into all of the fun and exciting parts of this new MediaPipe task, it’s important to know that our Image Generation API supports any models that exactly match the Stable Diffusion v1.5 architecture. You can use a pretrained model or your fine-tuned models by converting it to a model format supported by MediaPipe Image Generator using our conversion script.

You can also customize a foundation model via MediaPipe Diffusion LoRA fine-tuning on Vertex AI, injecting new concepts into a foundation model without having to fine-tune the whole model. You can find more information about this process in our official documentation.

If you want to try this task out today without any customization, we also provide links to a few verified working models in that same documentation.

Image Generation through Diffusion Models

The most straightforward way to try the Image Generator task is to give it a text prompt, and then receive a result image using a diffusion model.

Like MediaPipe’s other tasks, you will start by creating an options object. In this case you will only need to define the path to your foundation model files on the device. Once you have that options object, you can create the ImageGenerator.

val options = ImageGeneratorOptions.builder().setImageGeneratorModelDirectory(MODEL_PATH).build() imageGenerator = ImageGenerator.createFromOptions(context, options)

After creating your new ImageGenerator, you can create a new image by passing in the prompt, the number of iterations the generator should go through for generating, and a seed value. This will run a blocking operation to create a new image, so you will want to run it in a background thread before returning your new Bitmap result object.

val result = imageGenerator.generate(prompt_string, iterations, seed) val bitmap = BitmapExtractor.extract(result?.generatedImage())

In addition to this simple input in/result out format, we also support a way for you to step through each iteration manually through the execute() function, receiving the intermediate result images back at different stages to show the generative progress. While getting intermediate results back isn’t recommended for most apps due to performance and complexity, it is a nice way to demonstrate what’s happening under the hood. This is a little more of an in-depth process, but you can find this demo, as well as the other examples shown in this post, in our official example app on GitHub.

Moving image of an image generating in MediaPipe from the following prompt: a colorful cartoon racoon wearing a floppy wide brimmed hat holding a stick walking through the forest, animated, three-quarter view, painting

Image Generation with Plugins

While being able to create new images from only a prompt on a device is already a huge step, we’ve taken it a little further by implementing a new plugin system which enables the diffusion model to accept a condition image along with a text prompt as its inputs.

We currently support three different ways that you can provide a foundation for your generations: facial structures, edge detection, and depth awareness. The plugins give you the ability to provide an image, extract specific structures from it, and then create new images using those structures.

Moving image of an image generating in MediaPipe from a provided image of a beige toy car, plus the following prompt: cool green race car

LoRA Weights

The third major feature we’re rolling out today is the ability to customize the Image Generator task with LoRA to teach a foundation model about a new concept, such as specific objects, people, or styles presented during training. With the new LoRA weights, the Image Generator becomes a specialized generator that is able to inject specific concepts into generated images.

LoRA weights are useful for cases where you may want every image to be in the style of an oil painting, or a particular teapot to appear in any created setting. You can find more information about LoRA weights on Vertex AI in the MediaPipe Stable Diffusion LoRA model card, and create them using this notebook. Once generated, you can deploy the LoRA weights on-device using the MediaPipe Tasks Image Generator API, or for optimized server inference through Vertex AI’s one-click deployment.

In the example below, we created LoRA weights using several images of a teapot from the Dreambooth teapot training image set. Then we use the weights to generate a new image of the teapot in different settings.

A grid of four photos of teapots generated with training prompt 'a photo of a monadikos teapot'on the left, and a moving image showing an image being generated in MediaPipe from the propmt 'a bright purple monadikos teapot sitting in top of a green table with orange teacups'
Image generation with the LoRA weights

Next Steps

This is just the beginning of what we plan to support with on-device image generation. We’re looking forward to seeing all of the great things the developer community builds, so be sure to post them on X (formally Twitter) with the hashtag #MediaPipeImageGen and tag @GoogleDevs. You can check out the official sample on GitHub demonstrating everything you’ve just learned about, read through our official documentation for even more details, and keep an eye on the Google for Developers YouTube channel for updates and tutorials as they’re released by the MediaPipe team.


Acknowledgements

We’d like to thank all team members who contributed to this work: Lu Wang, Yi-Chun Kuo, Sebastian Schmidt, Kris Tonthat, Jiuqiang Tang, Khanh LeViet, Paul Ruiz, Qifei Wang, Yang Zhao, Yuqi Li, Lawrence Chan, Tingbo Hou, Joe Zou, Raman Sarokin, Juhyun Lee, Geng Yan, Ekaterina Ignasheva, Shanthal Vasanth, Glenn Cameron, Mark Sherwood, Andrei Kulik, Chuo-Ling Chang, and Matthias Grundmann from the Core ML team, as well as Changyu Zhu, Genquan Duan, Bo Wu, Ting Yu, and Shengyang Dai from Google Cloud.

Build with Google AI: new video series for developers

Posted by Joe Fernandez, AI Developer Relations, and Jaimie Hwang, AI Developer Marketing

Artificial intelligence (AI) represents a new frontier for technology we are just beginning to explore. While many of you are interested in working with AI, we realize that most developers aren't ready to dive into building their own artificial intelligence models (yet). With this in mind, we've created resources to get you started building applications with this technology.

Today, we are launching a new video series called Build with Google AI. This series features practical, useful AI-powered projects that don't require deep knowledge of artificial intelligence, or huge development resources. In fact, you can get these projects working in less than a day.

From self-driving cars to medical diagnosis, AI is automating tasks, improving efficiency, and helping us make better decisions. At the center of this wave of innovation are artificial intelligence models, including large language models like Google PaLM 2 and more focused AI models for translation, object detection, and other tasks. The frontier of AI, however, is not simply building new and better AI models, but also creating high-quality experiences and helpful applications with those models.

Practical AI code projects

This series is by developers, for developers. We want to help you build with AI, and not just any code project will do. They need to be practical and extensible. We are big believers in starting small and tackling concrete problems. The open source projects featured in the series are selected so that you can get them working quickly, and then build beyond them. We want you to take these projects and make them your own. Build solutions that matter to you.

Finally, and most importantly, we want to promote the use of AI that's beneficial to users, developers, creators, and organizations. So, we are focused on solutions that follow our principles for responsible use of artificial intelligence.

For the first arc of this series, we focus on how you can leverage Google's AI language model capabilities for applications, particularly the Google PaLM API. Here's what's coming up:

  • AI Content Search with Doc Agent (10/3) We'll show you how a technical writing team at Google built an AI-powered conversation search interface for their content, and how you can take their open source project and build the same functionality for your content. 
  • AI Writing Assistant with Wordcraft (10/10) Learn how the People and AI Research team at Google built a story writing application with AI technology, and how you can extend their code to build your own custom writing app. 
  • AI Coding Assistant with Pipet Code Agent (10/17) We'll show you how the AI Developer Relations team at Google built a coding assistance agent as an extension for Visual Studio Code, and how you can take their open source project and make it work for your development workflow.

For the second arc of the series, we'll bring you a new set of projects that run artificial intelligence applications locally on devices for lower latency, higher reliability, and improved data privacy.

Insights from the development teams

As developers, we love code, and we know that understanding someone else's code project can be a daunting task. The series includes demos and tutorials on how to customize the code, and we'll talk with the people behind the code. Why did they build it? What did they learn along the way? You’ll hear insights directly from the project team, so you can take it further.

Discover AI technologies from across Google

Google provides a host of resources for developers to build solutions with artificial intelligence. Whether you are looking to develop with Google's AI language models, build new models with TensorFlow, or deploy full-stack solutions with Google Cloud Vertex AI, it's our goal to help you find the AI technology solution that works best for your development projects. To start your journey, visit Build with Google AI.

We hope you are as excited about the Build with Google AI video series as we are to share it with you. Check out Episode #1 now! Use those video comments to let us know what you think and tell us what you'd like to see in future episodes.

Keep learning! Keep building!

How it’s Made: TextFX is a suite of AI tools inspired by Lupe Fiasco’s lyrical and linguistic techniques

Posted by Aaron Wade, Creative Technologist

Google Lab Sessions is a series of experimental AI collaborations with innovators. In our latest Lab Session we wanted to explore specifically how AI could expand human creativity. So we turned to GRAMMY® Award-winning rapper and MIT Visiting Scholar Lupe Fiasco to build an AI experiment called TextFX.



The discovery process

We started by spending time with Lupe to observe and learn about his creative process. This process was invariably marked by a sort of linguistic “tinkering”—that is, deconstructing language and then reassembling it in novel and innovative ways. Some of Lupe’s techniques, such as simile and alliteration, draw from the canon of traditional literary devices. But many of his tactics are entirely unique. Among them was a clever way of creating phrases that sound identical to a given word but have different meanings, which he demonstrated for us using the word “expressway”:

express whey (speedy delivery of dairy byproduct)

express sway (to demonstrate influence)

ex-press way (path without news media)

These sorts of operations played a critical role in Lupe’s writing. In light of this, we began to wonder: How might we use AI to help Lupe explore creative possibilities with text and language?

When it comes to language-related applications, large language models (LLMs) are the obvious choice from an AI perspective. LLMs are a category of machine learning models that are specially designed to perform language-related tasks, and one of the things we can use them for is generating text. But the question still remained as to how LLMs would actually fit into Lupe’s lyric-writing workflow.

Some LLMs such as Google’s Bard are fine-tuned to function as conversational agents. Others such as the PaLM API’s Text Bison model lack this conversational element and instead generate text by extending or fulfilling a given input text. One of the great things about this latter type of LLM is their capacity for few-shot learning. In other words, they can recognize patterns that occur in a small set of training examples and then replicate those patterns for novel inputs.

As an initial experiment, we had Lupe provide more examples of his same-sounding phrase technique. We then used those examples to construct a prompt, which is a carefully crafted string of text that primes the LLM to behave in a certain way. Our initial prompt for the same-sounding phrase task looked like this:

Word: defeat
Same-sounding phrase: da feet (as in "the feet")

Word: surprise
Same-sounding phrase: Sir Prize (a knight whose name is Prize)

Word: expressway
Same-sounding phrase: express whey (speedy delivery of dairy byproduct)

(...additional examples...)

Word: [INPUT WORD]
Same-sounding phrase:


This prompt yielded passable outputs some of the time, but we felt that there was still room for improvement. We actually found that factors beyond just the content and quantity of examples could influence the output—for example, how the task is framed, how inputs and outputs are represented, etc. After several iterations, we finally arrived at the following:

A same-sounding phrase is a phrase that sounds like another word or phrase.


Here is a same-sounding phrase for the word "defeat":

da feet (as in "the feet")


Here is a same-sounding phrase for the word "surprise":

Sir Prize (a knight whose name is Prize)


Here is a same-sounding phrase for the word "expressway":

express whey (speedy delivery of dairy byproduct)


(...additional examples...)


Here is a same-sounding phrase for the word "[INPUT WORD]":

After successfully codifying the same-sounding word task into a few-shot prompt, we worked with Lupe to identify additional creative tasks that we might be able to accomplish using the same few-shot prompting strategy. In the end, we devised ten prompts, each uniquely designed to explore creative possibilities that may arise from a given word, phrase, or concept:

SIMILE - Create a simile about a thing or concept.

EXPLODE - Break a word into similar-sounding phrases.

UNEXPECT - Make a scene more unexpected and imaginative.

CHAIN - Build a chain of semantically related items.

POV - Evaluate a topic through different points of view.

ALLITERATION - Curate topic-specific words that start with a chosen letter.

ACRONYM - Create an acronym using the letters of a word.

FUSE - Create an acronym using the letters of a word.

SCENE - Create an acronym using the letters of a word.

UNFOLD - Slot a word into other existing words or phrases.

We were able to quickly prototype each of these ideas using MakerSuite, which is a platform that lets users easily build and experiment with LLM prompts via an interactive interface.

Moving image showing a few-shot prompt in MakerSuite

How we made it: building using the PaLM API

After we finalized the few-shot prompts, we built an app to house them. We decided to call it TextFX, drawing from the idea that each tool has a different “effect” on its input text. Like a sound effect, but for text.

Moving image showing the TextFX user interface

We save our prompts as strings in the source code and send them to Google’s PaLM 2 model using the PaLM API, which serves as an entry point to Google’s large language models.

All of our prompts are designed to terminate with an incomplete input-output pair. When a user submits an input, we append that input to the prompt before sending it to the model. The model predicts the corresponding output(s) for that input, and then we parse each result from the model response and do some post-processing before finally surfacing the result in the frontend.

Diagram of information flow between TextFX and Google's PaLM 2 large language models

Users may optionally adjust the model temperature, which is a hyperparameter that roughly corresponds to the amount of creativity allowed in the model outputs.

Try it yourself

You can try TextFX for yourself at textfx.withgoogle.com.

We’ve also made all of the LLM prompts available in MakerSuite. If you have access to the public preview for the PaLM API and MakerSuite, you can create your own copies of the prompts using the links below. Otherwise, you can join the waitlist.


And in case you’d like to take a closer look at how we built TextFX, we’ve open-sourced the code here.

If you want to try building with the PaLM API and MakerSuite, join the waitlist.

A final word

TextFX is an example of how you can experiment with the PaLM API and build applications that leverage Google’s state of the art large language models. More broadly, this exploration speaks to the potential of AI to augment human creativity. TextFX targets creative writing, but what might it mean for AI to enter other creative domains as a collaborator? Creators play a crucial role in helping us imagine what these collaborations might look like. Our hope is that this Lab Session gives you a glimpse of what’s possible using the PaLM API and inspires you to use Google’s AI offerings to bring your own ideas to life, in whatever your craft may be.

If you’d like to explore more Lab Sessions like this one, head over to labs.google.com.

MediaPipe: Enhancing Virtual Humans to be more realistic

A guest post by the XR Development team at KDDI & Alpha-U

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, KDDI.

AI generated rendering of virtual human ‘Metako’
KDDI is integrating text-to-speech & Cloud Rendering to virtual human ‘Metako’

VTubers, or virtual YouTubers, are online entertainers who use a virtual avatar generated using computer graphics. This digital trend originated in Japan in the mid-2010s, and has become an international online phenomenon. A majority of VTubers are English and Japanese-speaking YouTubers or live streamers who use avatar designs.

KDDI, a telecommunications operator in Japan with over 40 million customers, wanted to experiment with various technologies built on its 5G network but found that getting accurate movements and human-like facial expressions in real-time was challenging.


Creating virtual humans in real-time

Announced at Google I/O 2023 in May, the MediaPipe Face Landmarker solution detects facial landmarks and outputs blendshape scores to render a 3D face model that matches the user. With the MediaPipe Face Landmarker solution, KDDI and the Google Partner Innovation team successfully brought realism to their avatars.


Technical Implementation

Using Mediapipe's powerful and efficient Python package, KDDI developers were able to detect the performer’s facial features and extract 52 blendshapes in real-time.

import mediapipe as mp from mediapipe.tasks import python as mp_python MP_TASK_FILE = "face_landmarker_with_blendshapes.task" class FaceMeshDetector: def __init__(self): with open(MP_TASK_FILE, mode="rb") as f: f_buffer = f.read() base_options = mp_python.BaseOptions(model_asset_buffer=f_buffer) options = mp_python.vision.FaceLandmarkerOptions( base_options=base_options, output_face_blendshapes=True, output_facial_transformation_matrixes=True, running_mode=mp.tasks.vision.RunningMode.LIVE_STREAM, num_faces=1, result_callback=self.mp_callback) self.model = mp_python.vision.FaceLandmarker.create_from_options( options) self.landmarks = None self.blendshapes = None self.latest_time_ms = 0 def mp_callback(self, mp_result, output_image, timestamp_ms: int): if len(mp_result.face_landmarks) >= 1 and len( mp_result.face_blendshapes) >= 1: self.landmarks = mp_result.face_landmarks[0] self.blendshapes = [b.score for b in mp_result.face_blendshapes[0]] def update(self, frame): t_ms = int(time.time() * 1000) if t_ms <= self.latest_time_ms: return frame_mp = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame) self.model.detect_async(frame_mp, t_ms) self.latest_time_ms = t_ms def get_results(self): return self.landmarks, self.blendshapes

The Firebase Realtime Database stores a collection of 52 blendshape float values. Each row corresponds to a specific blendshape, listed in order.

_neutral, browDownLeft, browDownRight, browInnerUp, browOuterUpLeft, ...

These blendshape values are continuously updated in real-time as the camera is open and the FaceMesh model is running. With each frame, the database reflects the latest blendshape values, capturing the dynamic changes in facial expressions as detected by the FaceMesh model.

Screenshot of realtime Database

After extracting the blendshapes data, the next step involves transmitting it to the Firebase Realtime Database. Leveraging this advanced database system ensures a seamless flow of real-time data to the clients, eliminating concerns about server scalability and enabling KDDI to focus on delivering a streamlined user experience.

import concurrent.futures import time import cv2 import firebase_admin import mediapipe as mp import numpy as np from firebase_admin import credentials, db pool = concurrent.futures.ThreadPoolExecutor(max_workers=4) cred = credentials.Certificate('your-certificate.json') firebase_admin.initialize_app( cred, { 'databaseURL': 'https://your-project.firebasedatabase.app/' }) ref = db.reference('projects/1234/blendshapes') def main(): facemesh_detector = FaceMeshDetector() cap = cv2.VideoCapture(0) while True: ret, frame = cap.read() facemesh_detector.update(frame) landmarks, blendshapes = facemesh_detector.get_results() if (landmarks is None) or (blendshapes is None): continue blendshapes_dict = {k: v for k, v in enumerate(blendshapes)} exe = pool.submit(ref.set, blendshapes_dict) cv2.imshow('frame', frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() exit()

 

To continue the progress, developers seamlessly transmit the blendshapes data from the Firebase Realtime Database to Google Cloud's Immersive Stream for XR instances in real-time. Google Cloud’s Immersive Stream for XR is a managed service that runs Unreal Engine project in the cloud, renders and streams immersive photorealistic 3D and Augmented Reality (AR) experiences to smartphones and browsers in real time.

This integration enables KDDI to drive character face animation and achieve real-time streaming of facial animation with minimal latency, ensuring an immersive user experience.

Illustrative example of how KDDI transmits data from the Firebase Realtime Database to Google Cloud Immersive Stream for XR in real time to render and stream photorealistic 3D and AR experiences like character face animation with minimal latency

On the Unreal Engine side running by the Immersive Stream for XR, we use the Firebase C++ SDK to seamlessly receive data from the Firebase. By establishing a database listener, we can instantly retrieve blendshape values as soon as updates occur in the Firebase Realtime database table. This integration allows for real-time access to the latest blendshape data, enabling dynamic and responsive facial animation in Unreal Engine projects.

Screenshot of Modify Curve node in use in Unreal Engine

After retrieving blendshape values from the Firebase SDK, we can drive the face animation in Unreal Engine by using the "Modify Curve" node in the animation blueprint. Each blendshape value is assigned to the character individually on every frame, allowing for precise and real-time control over the character's facial expressions.

Flowchart demonstrating how BlendshapesReceiver handles the database connection, authentication, and continuous data reception

An effective approach for implementing a realtime database listener in Unreal Engine is to utilize the GameInstance Subsystem, which serves as an alternative singleton pattern. This allows for the creation of a dedicated BlendshapesReceiver instance responsible for handling the database connection, authentication, and continuous data reception in the background.

By leveraging the GameInstance Subsystem, the BlendshapesReceiver instance can be instantiated and maintained throughout the lifespan of the game session. This ensures a persistent database connection while the animation blueprint reads and drives the face animation using the received blendshape data.

Using just a local PC running MediaPipe, KDDI succeeded in capturing the real performer’s facial expression and movement, and created high-quality 3D re-target animation in real time.

Flow chart showing how a real performer's facial expression and movement being captured and run through MediaPipe on a Local PC, and the high quality 3D re-target animation being rendered in real time by KDDI
      

KDDI is collaborating with developers of Metaverse anime fashion like Adastria Co., Ltd.


Getting started

To learn more, watch Google I/O 2023 sessions: Easy on-device ML with MediaPipe, Supercharge your web app with machine learning and MediaPipe, What's new in machine learning, and check out the official documentation over on developers.google.com/mediapipe.


What’s next?

This MediaPipe integration is one example of how KDDI is eliminating the boundary between the real and virtual worlds, allowing users to enjoy everyday experiences such as attending live music performances, enjoying art, having conversations with friends, and shopping―anytime, anywhere. 

KDDI’s αU provides services for the Web3 era, including the metaverse, live streaming, and virtual shopping, shaping an ecosystem where anyone can become a creator, supporting the new generation of users who effortlessly move between the real and virtual worlds.

A Look Back at LA #TechWeek OneGoogle Panel: Building a Startup Using Generative AI

Posted by Alexandra Dumas, Head of VC & Startup Partnerships, West Coast, Google

Earlier this month, LA TechWeek hosted an array of thought leaders and innovative minds in the tech industry. As the Head of VC & Startup Partnerships West Coast at Google, I had the privilege of curating and facilitating an insightful panel event, supported by Google Cloud for Startups, on the topic of "Building with Generative AI" with representatives from:

Google Venice Tech Week Panel

Our conversation was as rich in depth as it was in diversity; heightening the LA community's collective excitement for the future of generative AI, and underscoring Google's vision of harnessing the power of collaboration to ignite innovation in the tech startup space. The collaborative event was a unique platform that bridged the gap between startups, venture capitalists, and major players in the tech industry. It was the embodiment of Google's commitment to driving transformative change by fostering robust partnerships with VC firms and startups: We understand that the success of startups is crucial to our communities, economies, and indeed, to Google itself.

Josh Gwyther, Generative AI Global Lead for Google Cloud, kicked things off by tracing Google's impressive journey in AI, shedding light on how we've pioneered in creating transformative AI models, a journey that started back in 2017 with the landmark Transformer whitepaper.

From X, Clarence Wooten elevated our perception of AI's potential, painting an exciting picture of AI as a startup's virtual "co-founder." He powerfully encapsulated AI's role in amplifying, not replacing, human potential, a testament to Google's commitment to AI and its impact.

Venturing into the world of gaming, Andreessen Horowitz's Andrew Chen predicted a revolution in game development driven by generative AI. He saw a future where indie game developers thrived, game types evolved, and the entire gaming landscape shifted, all propelled by generative AI's transformative power.

On the investment side of things, Darian Shirazi from Gradient Ventures shared insights on what makes an excellent AI founder, emphasizing trustworthiness, self-learning, and resilience as critical traits.

Google Venice Tech Week Panel

The panel discussion concluded with a deep dive into the intricacies of integrating AI and scalability, the challenges of GPUs/TPUs, and the delicate balance between innovation and proprietary data concerns.

Founders were also left with actionable information around the Google for Cloud Startups Program, which provides startup experts, cloud credits, and technical training to begin their journey on Google Cloud cost-free, with their focus squarely on innovation and growth. We invite all eligible startups to apply as we continue this journey together.

As the curtains fell on LA TechWeek, we were left with more than just a feeling of optimism about the future of generative AI. We walked away with new connections, fresh perspectives, and a renewed conviction that Google, along with startups, investors, and partners, can lead the transformative change that the future beckons. The main takeaway: The AI revolution isn't coming; it's here. And Google, with its deep expertise and unwavering dedication to innovation, is committed to moving forward boldly, responsibly, and in partnership with others.

Google Venice Tech Week Audience

As we navigate this thrilling journey, I look forward to continuing to collaborate with startups, investors, and partners, leveraging the vast potential of AI to unlock a future where technology serves us all in unimaginable ways.

Controlling Stable Diffusion with JAX, diffusers, and Cloud TPUs

Diffusion models are state-of-the-art in generating photorealistic images from text. These models are hard to control through only text and generation parameters. To overcome this, the open source community developed ControlNet (GitHub), a neural network structure to control diffusion models by adding more conditions on top of the text prompts. These conditions include canny edge filters, segmentation maps, and pose keypoints. Thanks to the 🧨diffusers library, it is very easy to train, fine-tune or control diffusion models written in various frameworks, including JAX!

At Hugging Face, we were particularly excited to see the open source machine learning (ML) community leverage these tools to explore fun and creative diffusion models. We joined forces with Google Cloud to host a community sprint where participants explored the capabilities of controlling Stable Diffusion by building various open source applications with JAX and Diffusers, using Google Cloud TPU v4 accelerators. In this three week sprint, participants teamed up, came up with various project ideas, trained ControlNet models, and built applications based on them. The sprint resulted in 26 projects, accessible via a leaderboard here. These demos use Stable Diffusion (v1.5 checkpoint) initialized with ControlNet models. We worked with Google Cloud to provide access to TPU v4-8 hardware with 3TB storage, as well as NVIDIA A10G GPUs to speed up the inference in these applications.

Below, we showcase a few projects that stood out from the sprint, and that anyone can create a demo themselves. When picking projects to highlight, we considered several factors:

  • How well-described are the models produced?
  • Are the models, datasets, and other artifacts fully open sourced?
  • Are the applications easy to use? Are they well described?

The projects were voted on by a panel of experts and the top ten projects on the leaderboard won prizes.

Control with SAM

One team used the state-of-the-art Segment Anything Model (SAM) output as an additional condition to control the generated images. SAM produces zero-shot segmentation maps with fine details, which helps extract semantic information from images for control. You can see an example below and try the demo here.

Screencap of the 'Control with SAM' project

Fusing MediaPipe and ControlNet

Another team used MediaPipe to extract hand landmarks to control Stable Diffusion. This application allows you to generate images based on your hand pose and prompt. You can also use a webcam to input an image. See an example below, and try it yourself here.

Screencap of a project fusing MediaPipe and ControlNet

Make-a-Video

Top on the leaderboard is Make-a-Video, which generates video from a text prompt and a hint image. It is based on latent diffusion with temporal convolutions for video and attention. You can try the demo here.

Screencap of the 'Make-a-Video' project

Bootstrapping interior designs

The project that won the sprint is ControlNet for interior design. The application can generate interior design based on a room image and prompt. It can also perform segmentation and generations, guided by image inpainting. See the application in inpainting mode below.

Screencap of a project using ControlNet for interior design

In addition to the projects above, many applications were built to enhance images, like this application to colorize grayscale images. You can check out the leaderboard to try all the projects.

Learning more about diffusion models

To kick-off the sprint, we organized a three-day series of talks by leading scientists and engineers from Google, Hugging Face, and the open source diffusion community. We'd recommend that anyone interested in learning more about diffusion models and generative AI take a look at the recorded sessions below!

Tim Salimans (Google Research) speaking on Discrete Diffusion Models
Tim Salimans (Google Research) speaking on Discrete Diffusion Models
You can watch all the talks from the links below.

You can check out the sprint homepage to learn more about the sprint.

Acknowledgements

We would like to thank Google Cloud for providing TPUs and storage to help make this great sprint happen, in particular Bertrand Rondepierre and Jonathan Caton for the hard work behind the scenes to get all of the Cloud TPUs allocated so participants had cutting-edge hardware to build on and an overall great experience. And also Andreas Steiner and Cristian Garcia for helping to answer questions in our Discord forum and for helping us make the training script example better. Their help is deeply appreciated.

By Merve Noyan and Sayak Paul – Hugging Face

Controlling Stable Diffusion with JAX, diffusers, and Cloud TPUs

Diffusion models are state-of-the-art in generating photorealistic images from text. These models are hard to control through only text and generation parameters. To overcome this, the open source community developed ControlNet (GitHub), a neural network structure to control diffusion models by adding more conditions on top of the text prompts. These conditions include canny edge filters, segmentation maps, and pose keypoints. Thanks to the 🧨diffusers library, it is very easy to train, fine-tune or control diffusion models written in various frameworks, including JAX!

At Hugging Face, we were particularly excited to see the open source machine learning (ML) community leverage these tools to explore fun and creative diffusion models. We joined forces with Google Cloud to host a community sprint where participants explored the capabilities of controlling Stable Diffusion by building various open source applications with JAX and Diffusers, using Google Cloud TPU v4 accelerators. In this three week sprint, participants teamed up, came up with various project ideas, trained ControlNet models, and built applications based on them. The sprint resulted in 26 projects, accessible via a leaderboard here. These demos use Stable Diffusion (v1.5 checkpoint) initialized with ControlNet models. We worked with Google Cloud to provide access to TPU v4-8 hardware with 3TB storage, as well as NVIDIA A10G GPUs to speed up the inference in these applications.

Below, we showcase a few projects that stood out from the sprint, and that anyone can create a demo themselves. When picking projects to highlight, we considered several factors:

  • How well-described are the models produced?
  • Are the models, datasets, and other artifacts fully open sourced?
  • Are the applications easy to use? Are they well described?

The projects were voted on by a panel of experts and the top ten projects on the leaderboard won prizes.

Control with SAM

One team used the state-of-the-art Segment Anything Model (SAM) output as an additional condition to control the generated images. SAM produces zero-shot segmentation maps with fine details, which helps extract semantic information from images for control. You can see an example below and try the demo here.

Screencap of the 'Control with SAM' project

Fusing MediaPipe and ControlNet

Another team used MediaPipe to extract hand landmarks to control Stable Diffusion. This application allows you to generate images based on your hand pose and prompt. You can also use a webcam to input an image. See an example below, and try it yourself here.

Screencap of a project fusing MediaPipe and ControlNet

Make-a-Video

Top on the leaderboard is Make-a-Video, which generates video from a text prompt and a hint image. It is based on latent diffusion with temporal convolutions for video and attention. You can try the demo here.

Screencap of the 'Make-a-Video' project

Bootstrapping interior designs

The project that won the sprint is ControlNet for interior design. The application can generate interior design based on a room image and prompt. It can also perform segmentation and generations, guided by image inpainting. See the application in inpainting mode below.

Screencap of a project using ControlNet for interior design

In addition to the projects above, many applications were built to enhance images, like this application to colorize grayscale images. You can check out the leaderboard to try all the projects.

Learning more about diffusion models

To kick-off the sprint, we organized a three-day series of talks by leading scientists and engineers from Google, Hugging Face, and the open source diffusion community. We'd recommend that anyone interested in learning more about diffusion models and generative AI take a look at the recorded sessions below!

Tim Salimans (Google Research) speaking on Discrete Diffusion Models
Tim Salimans (Google Research) speaking on Discrete Diffusion Models
You can watch all the talks from the links below.

You can check out the sprint homepage to learn more about the sprint.

Acknowledgements

We would like to thank Google Cloud for providing TPUs and storage to help make this great sprint happen, in particular Bertrand Rondepierre and Jonathan Caton for the hard work behind the scenes to get all of the Cloud TPUs allocated so participants had cutting-edge hardware to build on and an overall great experience. And also Andreas Steiner and Cristian Garcia for helping to answer questions in our Discord forum and for helping us make the training script example better. Their help is deeply appreciated.

By Merve Noyan and Sayak Paul – Hugging Face