Tag Archives: Generative AI

MediaPipe On-Device Text-to-Image Generation Solution Now Available for Android Developers

Posted by Paul Ruiz – Senior Developer Relations Engineer, and Kris Tonthat – Technical Writer

Earlier this year, we previewed on-device text-to-image generation with diffusion models for Android via MediaPipe Solutions. Today we’re happy to announce that this is available as an early, experimental solution, Image Generator, for developers to try out on Android devices, allowing you to easily generate images entirely on-device in as quickly as ~15 seconds on higher end devices. We can’t wait to see what you create!

There are three primary ways that you can use the new MediaPipe Image Generator task:

  1. Text-to-image generation based on text prompts using standard diffusion models.
  2. Controllable text-to-image generation based on text prompts and conditioning images using diffusion plugins.
  3. Customized text-to-image generation based on text prompts using Low-Rank Adaptation (LoRA) weights that allow you to create images of specific concepts that you pre-define for your unique use-cases.

Models

Before we get into all of the fun and exciting parts of this new MediaPipe task, it’s important to know that our Image Generation API supports any models that exactly match the Stable Diffusion v1.5 architecture. You can use a pretrained model or your fine-tuned models by converting it to a model format supported by MediaPipe Image Generator using our conversion script.

You can also customize a foundation model via MediaPipe Diffusion LoRA fine-tuning on Vertex AI, injecting new concepts into a foundation model without having to fine-tune the whole model. You can find more information about this process in our official documentation.

If you want to try this task out today without any customization, we also provide links to a few verified working models in that same documentation.

Image Generation through Diffusion Models

The most straightforward way to try the Image Generator task is to give it a text prompt, and then receive a result image using a diffusion model.

Like MediaPipe’s other tasks, you will start by creating an options object. In this case you will only need to define the path to your foundation model files on the device. Once you have that options object, you can create the ImageGenerator.

val options = ImageGeneratorOptions.builder().setImageGeneratorModelDirectory(MODEL_PATH).build() imageGenerator = ImageGenerator.createFromOptions(context, options)

After creating your new ImageGenerator, you can create a new image by passing in the prompt, the number of iterations the generator should go through for generating, and a seed value. This will run a blocking operation to create a new image, so you will want to run it in a background thread before returning your new Bitmap result object.

val result = imageGenerator.generate(prompt_string, iterations, seed) val bitmap = BitmapExtractor.extract(result?.generatedImage())

In addition to this simple input in/result out format, we also support a way for you to step through each iteration manually through the execute() function, receiving the intermediate result images back at different stages to show the generative progress. While getting intermediate results back isn’t recommended for most apps due to performance and complexity, it is a nice way to demonstrate what’s happening under the hood. This is a little more of an in-depth process, but you can find this demo, as well as the other examples shown in this post, in our official example app on GitHub.

Moving image of an image generating in MediaPipe from the following prompt: a colorful cartoon racoon wearing a floppy wide brimmed hat holding a stick walking through the forest, animated, three-quarter view, painting

Image Generation with Plugins

While being able to create new images from only a prompt on a device is already a huge step, we’ve taken it a little further by implementing a new plugin system which enables the diffusion model to accept a condition image along with a text prompt as its inputs.

We currently support three different ways that you can provide a foundation for your generations: facial structures, edge detection, and depth awareness. The plugins give you the ability to provide an image, extract specific structures from it, and then create new images using those structures.

Moving image of an image generating in MediaPipe from a provided image of a beige toy car, plus the following prompt: cool green race car

LoRA Weights

The third major feature we’re rolling out today is the ability to customize the Image Generator task with LoRA to teach a foundation model about a new concept, such as specific objects, people, or styles presented during training. With the new LoRA weights, the Image Generator becomes a specialized generator that is able to inject specific concepts into generated images.

LoRA weights are useful for cases where you may want every image to be in the style of an oil painting, or a particular teapot to appear in any created setting. You can find more information about LoRA weights on Vertex AI in the MediaPipe Stable Diffusion LoRA model card, and create them using this notebook. Once generated, you can deploy the LoRA weights on-device using the MediaPipe Tasks Image Generator API, or for optimized server inference through Vertex AI’s one-click deployment.

In the example below, we created LoRA weights using several images of a teapot from the Dreambooth teapot training image set. Then we use the weights to generate a new image of the teapot in different settings.

A grid of four photos of teapots generated with training prompt 'a photo of a monadikos teapot'on the left, and a moving image showing an image being generated in MediaPipe from the propmt 'a bright purple monadikos teapot sitting in top of a green table with orange teacups'
Image generation with the LoRA weights

Next Steps

This is just the beginning of what we plan to support with on-device image generation. We’re looking forward to seeing all of the great things the developer community builds, so be sure to post them on X (formally Twitter) with the hashtag #MediaPipeImageGen and tag @GoogleDevs. You can check out the official sample on GitHub demonstrating everything you’ve just learned about, read through our official documentation for even more details, and keep an eye on the Google for Developers YouTube channel for updates and tutorials as they’re released by the MediaPipe team.


Acknowledgements

We’d like to thank all team members who contributed to this work: Lu Wang, Yi-Chun Kuo, Sebastian Schmidt, Kris Tonthat, Jiuqiang Tang, Khanh LeViet, Paul Ruiz, Qifei Wang, Yang Zhao, Yuqi Li, Lawrence Chan, Tingbo Hou, Joe Zou, Raman Sarokin, Juhyun Lee, Geng Yan, Ekaterina Ignasheva, Shanthal Vasanth, Glenn Cameron, Mark Sherwood, Andrei Kulik, Chuo-Ling Chang, and Matthias Grundmann from the Core ML team, as well as Changyu Zhu, Genquan Duan, Bo Wu, Ting Yu, and Shengyang Dai from Google Cloud.

Build with Google AI: new video series for developers

Posted by Joe Fernandez, AI Developer Relations, and Jaimie Hwang, AI Developer Marketing

Artificial intelligence (AI) represents a new frontier for technology we are just beginning to explore. While many of you are interested in working with AI, we realize that most developers aren't ready to dive into building their own artificial intelligence models (yet). With this in mind, we've created resources to get you started building applications with this technology.

Today, we are launching a new video series called Build with Google AI. This series features practical, useful AI-powered projects that don't require deep knowledge of artificial intelligence, or huge development resources. In fact, you can get these projects working in less than a day.

From self-driving cars to medical diagnosis, AI is automating tasks, improving efficiency, and helping us make better decisions. At the center of this wave of innovation are artificial intelligence models, including large language models like Google PaLM 2 and more focused AI models for translation, object detection, and other tasks. The frontier of AI, however, is not simply building new and better AI models, but also creating high-quality experiences and helpful applications with those models.

Practical AI code projects

This series is by developers, for developers. We want to help you build with AI, and not just any code project will do. They need to be practical and extensible. We are big believers in starting small and tackling concrete problems. The open source projects featured in the series are selected so that you can get them working quickly, and then build beyond them. We want you to take these projects and make them your own. Build solutions that matter to you.

Finally, and most importantly, we want to promote the use of AI that's beneficial to users, developers, creators, and organizations. So, we are focused on solutions that follow our principles for responsible use of artificial intelligence.

For the first arc of this series, we focus on how you can leverage Google's AI language model capabilities for applications, particularly the Google PaLM API. Here's what's coming up:

  • AI Content Search with Doc Agent (10/3) We'll show you how a technical writing team at Google built an AI-powered conversation search interface for their content, and how you can take their open source project and build the same functionality for your content. 
  • AI Writing Assistant with Wordcraft (10/10) Learn how the People and AI Research team at Google built a story writing application with AI technology, and how you can extend their code to build your own custom writing app. 
  • AI Coding Assistant with Pipet Code Agent (10/17) We'll show you how the AI Developer Relations team at Google built a coding assistance agent as an extension for Visual Studio Code, and how you can take their open source project and make it work for your development workflow.

For the second arc of the series, we'll bring you a new set of projects that run artificial intelligence applications locally on devices for lower latency, higher reliability, and improved data privacy.

Insights from the development teams

As developers, we love code, and we know that understanding someone else's code project can be a daunting task. The series includes demos and tutorials on how to customize the code, and we'll talk with the people behind the code. Why did they build it? What did they learn along the way? You’ll hear insights directly from the project team, so you can take it further.

Discover AI technologies from across Google

Google provides a host of resources for developers to build solutions with artificial intelligence. Whether you are looking to develop with Google's AI language models, build new models with TensorFlow, or deploy full-stack solutions with Google Cloud Vertex AI, it's our goal to help you find the AI technology solution that works best for your development projects. To start your journey, visit Build with Google AI.

We hope you are as excited about the Build with Google AI video series as we are to share it with you. Check out Episode #1 now! Use those video comments to let us know what you think and tell us what you'd like to see in future episodes.

Keep learning! Keep building!

How it’s Made: TextFX is a suite of AI tools inspired by Lupe Fiasco’s lyrical and linguistic techniques

Posted by Aaron Wade, Creative Technologist

Google Lab Sessions is a series of experimental AI collaborations with innovators. In our latest Lab Session we wanted to explore specifically how AI could expand human creativity. So we turned to GRAMMY® Award-winning rapper and MIT Visiting Scholar Lupe Fiasco to build an AI experiment called TextFX.



The discovery process

We started by spending time with Lupe to observe and learn about his creative process. This process was invariably marked by a sort of linguistic “tinkering”—that is, deconstructing language and then reassembling it in novel and innovative ways. Some of Lupe’s techniques, such as simile and alliteration, draw from the canon of traditional literary devices. But many of his tactics are entirely unique. Among them was a clever way of creating phrases that sound identical to a given word but have different meanings, which he demonstrated for us using the word “expressway”:

express whey (speedy delivery of dairy byproduct)

express sway (to demonstrate influence)

ex-press way (path without news media)

These sorts of operations played a critical role in Lupe’s writing. In light of this, we began to wonder: How might we use AI to help Lupe explore creative possibilities with text and language?

When it comes to language-related applications, large language models (LLMs) are the obvious choice from an AI perspective. LLMs are a category of machine learning models that are specially designed to perform language-related tasks, and one of the things we can use them for is generating text. But the question still remained as to how LLMs would actually fit into Lupe’s lyric-writing workflow.

Some LLMs such as Google’s Bard are fine-tuned to function as conversational agents. Others such as the PaLM API’s Text Bison model lack this conversational element and instead generate text by extending or fulfilling a given input text. One of the great things about this latter type of LLM is their capacity for few-shot learning. In other words, they can recognize patterns that occur in a small set of training examples and then replicate those patterns for novel inputs.

As an initial experiment, we had Lupe provide more examples of his same-sounding phrase technique. We then used those examples to construct a prompt, which is a carefully crafted string of text that primes the LLM to behave in a certain way. Our initial prompt for the same-sounding phrase task looked like this:

Word: defeat
Same-sounding phrase: da feet (as in "the feet")

Word: surprise
Same-sounding phrase: Sir Prize (a knight whose name is Prize)

Word: expressway
Same-sounding phrase: express whey (speedy delivery of dairy byproduct)

(...additional examples...)

Word: [INPUT WORD]
Same-sounding phrase:


This prompt yielded passable outputs some of the time, but we felt that there was still room for improvement. We actually found that factors beyond just the content and quantity of examples could influence the output—for example, how the task is framed, how inputs and outputs are represented, etc. After several iterations, we finally arrived at the following:

A same-sounding phrase is a phrase that sounds like another word or phrase.


Here is a same-sounding phrase for the word "defeat":

da feet (as in "the feet")


Here is a same-sounding phrase for the word "surprise":

Sir Prize (a knight whose name is Prize)


Here is a same-sounding phrase for the word "expressway":

express whey (speedy delivery of dairy byproduct)


(...additional examples...)


Here is a same-sounding phrase for the word "[INPUT WORD]":

After successfully codifying the same-sounding word task into a few-shot prompt, we worked with Lupe to identify additional creative tasks that we might be able to accomplish using the same few-shot prompting strategy. In the end, we devised ten prompts, each uniquely designed to explore creative possibilities that may arise from a given word, phrase, or concept:

SIMILE - Create a simile about a thing or concept.

EXPLODE - Break a word into similar-sounding phrases.

UNEXPECT - Make a scene more unexpected and imaginative.

CHAIN - Build a chain of semantically related items.

POV - Evaluate a topic through different points of view.

ALLITERATION - Curate topic-specific words that start with a chosen letter.

ACRONYM - Create an acronym using the letters of a word.

FUSE - Create an acronym using the letters of a word.

SCENE - Create an acronym using the letters of a word.

UNFOLD - Slot a word into other existing words or phrases.

We were able to quickly prototype each of these ideas using MakerSuite, which is a platform that lets users easily build and experiment with LLM prompts via an interactive interface.

Moving image showing a few-shot prompt in MakerSuite

How we made it: building using the PaLM API

After we finalized the few-shot prompts, we built an app to house them. We decided to call it TextFX, drawing from the idea that each tool has a different “effect” on its input text. Like a sound effect, but for text.

Moving image showing the TextFX user interface

We save our prompts as strings in the source code and send them to Google’s PaLM 2 model using the PaLM API, which serves as an entry point to Google’s large language models.

All of our prompts are designed to terminate with an incomplete input-output pair. When a user submits an input, we append that input to the prompt before sending it to the model. The model predicts the corresponding output(s) for that input, and then we parse each result from the model response and do some post-processing before finally surfacing the result in the frontend.

Diagram of information flow between TextFX and Google's PaLM 2 large language models

Users may optionally adjust the model temperature, which is a hyperparameter that roughly corresponds to the amount of creativity allowed in the model outputs.

Try it yourself

You can try TextFX for yourself at textfx.withgoogle.com.

We’ve also made all of the LLM prompts available in MakerSuite. If you have access to the public preview for the PaLM API and MakerSuite, you can create your own copies of the prompts using the links below. Otherwise, you can join the waitlist.


And in case you’d like to take a closer look at how we built TextFX, we’ve open-sourced the code here.

If you want to try building with the PaLM API and MakerSuite, join the waitlist.

A final word

TextFX is an example of how you can experiment with the PaLM API and build applications that leverage Google’s state of the art large language models. More broadly, this exploration speaks to the potential of AI to augment human creativity. TextFX targets creative writing, but what might it mean for AI to enter other creative domains as a collaborator? Creators play a crucial role in helping us imagine what these collaborations might look like. Our hope is that this Lab Session gives you a glimpse of what’s possible using the PaLM API and inspires you to use Google’s AI offerings to bring your own ideas to life, in whatever your craft may be.

If you’d like to explore more Lab Sessions like this one, head over to labs.google.com.

MediaPipe: Enhancing Virtual Humans to be more realistic

A guest post by the XR Development team at KDDI & Alpha-U

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, KDDI.

AI generated rendering of virtual human ‘Metako’
KDDI is integrating text-to-speech & Cloud Rendering to virtual human ‘Metako’

VTubers, or virtual YouTubers, are online entertainers who use a virtual avatar generated using computer graphics. This digital trend originated in Japan in the mid-2010s, and has become an international online phenomenon. A majority of VTubers are English and Japanese-speaking YouTubers or live streamers who use avatar designs.

KDDI, a telecommunications operator in Japan with over 40 million customers, wanted to experiment with various technologies built on its 5G network but found that getting accurate movements and human-like facial expressions in real-time was challenging.


Creating virtual humans in real-time

Announced at Google I/O 2023 in May, the MediaPipe Face Landmarker solution detects facial landmarks and outputs blendshape scores to render a 3D face model that matches the user. With the MediaPipe Face Landmarker solution, KDDI and the Google Partner Innovation team successfully brought realism to their avatars.


Technical Implementation

Using Mediapipe's powerful and efficient Python package, KDDI developers were able to detect the performer’s facial features and extract 52 blendshapes in real-time.

import mediapipe as mp from mediapipe.tasks import python as mp_python MP_TASK_FILE = "face_landmarker_with_blendshapes.task" class FaceMeshDetector: def __init__(self): with open(MP_TASK_FILE, mode="rb") as f: f_buffer = f.read() base_options = mp_python.BaseOptions(model_asset_buffer=f_buffer) options = mp_python.vision.FaceLandmarkerOptions( base_options=base_options, output_face_blendshapes=True, output_facial_transformation_matrixes=True, running_mode=mp.tasks.vision.RunningMode.LIVE_STREAM, num_faces=1, result_callback=self.mp_callback) self.model = mp_python.vision.FaceLandmarker.create_from_options( options) self.landmarks = None self.blendshapes = None self.latest_time_ms = 0 def mp_callback(self, mp_result, output_image, timestamp_ms: int): if len(mp_result.face_landmarks) >= 1 and len( mp_result.face_blendshapes) >= 1: self.landmarks = mp_result.face_landmarks[0] self.blendshapes = [b.score for b in mp_result.face_blendshapes[0]] def update(self, frame): t_ms = int(time.time() * 1000) if t_ms <= self.latest_time_ms: return frame_mp = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame) self.model.detect_async(frame_mp, t_ms) self.latest_time_ms = t_ms def get_results(self): return self.landmarks, self.blendshapes

The Firebase Realtime Database stores a collection of 52 blendshape float values. Each row corresponds to a specific blendshape, listed in order.

_neutral, browDownLeft, browDownRight, browInnerUp, browOuterUpLeft, ...

These blendshape values are continuously updated in real-time as the camera is open and the FaceMesh model is running. With each frame, the database reflects the latest blendshape values, capturing the dynamic changes in facial expressions as detected by the FaceMesh model.

Screenshot of realtime Database

After extracting the blendshapes data, the next step involves transmitting it to the Firebase Realtime Database. Leveraging this advanced database system ensures a seamless flow of real-time data to the clients, eliminating concerns about server scalability and enabling KDDI to focus on delivering a streamlined user experience.

import concurrent.futures import time import cv2 import firebase_admin import mediapipe as mp import numpy as np from firebase_admin import credentials, db pool = concurrent.futures.ThreadPoolExecutor(max_workers=4) cred = credentials.Certificate('your-certificate.json') firebase_admin.initialize_app( cred, { 'databaseURL': 'https://your-project.firebasedatabase.app/' }) ref = db.reference('projects/1234/blendshapes') def main(): facemesh_detector = FaceMeshDetector() cap = cv2.VideoCapture(0) while True: ret, frame = cap.read() facemesh_detector.update(frame) landmarks, blendshapes = facemesh_detector.get_results() if (landmarks is None) or (blendshapes is None): continue blendshapes_dict = {k: v for k, v in enumerate(blendshapes)} exe = pool.submit(ref.set, blendshapes_dict) cv2.imshow('frame', frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() exit()

 

To continue the progress, developers seamlessly transmit the blendshapes data from the Firebase Realtime Database to Google Cloud's Immersive Stream for XR instances in real-time. Google Cloud’s Immersive Stream for XR is a managed service that runs Unreal Engine project in the cloud, renders and streams immersive photorealistic 3D and Augmented Reality (AR) experiences to smartphones and browsers in real time.

This integration enables KDDI to drive character face animation and achieve real-time streaming of facial animation with minimal latency, ensuring an immersive user experience.

Illustrative example of how KDDI transmits data from the Firebase Realtime Database to Google Cloud Immersive Stream for XR in real time to render and stream photorealistic 3D and AR experiences like character face animation with minimal latency

On the Unreal Engine side running by the Immersive Stream for XR, we use the Firebase C++ SDK to seamlessly receive data from the Firebase. By establishing a database listener, we can instantly retrieve blendshape values as soon as updates occur in the Firebase Realtime database table. This integration allows for real-time access to the latest blendshape data, enabling dynamic and responsive facial animation in Unreal Engine projects.

Screenshot of Modify Curve node in use in Unreal Engine

After retrieving blendshape values from the Firebase SDK, we can drive the face animation in Unreal Engine by using the "Modify Curve" node in the animation blueprint. Each blendshape value is assigned to the character individually on every frame, allowing for precise and real-time control over the character's facial expressions.

Flowchart demonstrating how BlendshapesReceiver handles the database connection, authentication, and continuous data reception

An effective approach for implementing a realtime database listener in Unreal Engine is to utilize the GameInstance Subsystem, which serves as an alternative singleton pattern. This allows for the creation of a dedicated BlendshapesReceiver instance responsible for handling the database connection, authentication, and continuous data reception in the background.

By leveraging the GameInstance Subsystem, the BlendshapesReceiver instance can be instantiated and maintained throughout the lifespan of the game session. This ensures a persistent database connection while the animation blueprint reads and drives the face animation using the received blendshape data.

Using just a local PC running MediaPipe, KDDI succeeded in capturing the real performer’s facial expression and movement, and created high-quality 3D re-target animation in real time.

Flow chart showing how a real performer's facial expression and movement being captured and run through MediaPipe on a Local PC, and the high quality 3D re-target animation being rendered in real time by KDDI
      

KDDI is collaborating with developers of Metaverse anime fashion like Adastria Co., Ltd.


Getting started

To learn more, watch Google I/O 2023 sessions: Easy on-device ML with MediaPipe, Supercharge your web app with machine learning and MediaPipe, What's new in machine learning, and check out the official documentation over on developers.google.com/mediapipe.


What’s next?

This MediaPipe integration is one example of how KDDI is eliminating the boundary between the real and virtual worlds, allowing users to enjoy everyday experiences such as attending live music performances, enjoying art, having conversations with friends, and shopping―anytime, anywhere. 

KDDI’s αU provides services for the Web3 era, including the metaverse, live streaming, and virtual shopping, shaping an ecosystem where anyone can become a creator, supporting the new generation of users who effortlessly move between the real and virtual worlds.

A Look Back at LA #TechWeek OneGoogle Panel: Building a Startup Using Generative AI

Posted by Alexandra Dumas, Head of VC & Startup Partnerships, West Coast, Google

Earlier this month, LA TechWeek hosted an array of thought leaders and innovative minds in the tech industry. As the Head of VC & Startup Partnerships West Coast at Google, I had the privilege of curating and facilitating an insightful panel event, supported by Google Cloud for Startups, on the topic of "Building with Generative AI" with representatives from:

Google Venice Tech Week Panel

Our conversation was as rich in depth as it was in diversity; heightening the LA community's collective excitement for the future of generative AI, and underscoring Google's vision of harnessing the power of collaboration to ignite innovation in the tech startup space. The collaborative event was a unique platform that bridged the gap between startups, venture capitalists, and major players in the tech industry. It was the embodiment of Google's commitment to driving transformative change by fostering robust partnerships with VC firms and startups: We understand that the success of startups is crucial to our communities, economies, and indeed, to Google itself.

Josh Gwyther, Generative AI Global Lead for Google Cloud, kicked things off by tracing Google's impressive journey in AI, shedding light on how we've pioneered in creating transformative AI models, a journey that started back in 2017 with the landmark Transformer whitepaper.

From X, Clarence Wooten elevated our perception of AI's potential, painting an exciting picture of AI as a startup's virtual "co-founder." He powerfully encapsulated AI's role in amplifying, not replacing, human potential, a testament to Google's commitment to AI and its impact.

Venturing into the world of gaming, Andreessen Horowitz's Andrew Chen predicted a revolution in game development driven by generative AI. He saw a future where indie game developers thrived, game types evolved, and the entire gaming landscape shifted, all propelled by generative AI's transformative power.

On the investment side of things, Darian Shirazi from Gradient Ventures shared insights on what makes an excellent AI founder, emphasizing trustworthiness, self-learning, and resilience as critical traits.

Google Venice Tech Week Panel

The panel discussion concluded with a deep dive into the intricacies of integrating AI and scalability, the challenges of GPUs/TPUs, and the delicate balance between innovation and proprietary data concerns.

Founders were also left with actionable information around the Google for Cloud Startups Program, which provides startup experts, cloud credits, and technical training to begin their journey on Google Cloud cost-free, with their focus squarely on innovation and growth. We invite all eligible startups to apply as we continue this journey together.

As the curtains fell on LA TechWeek, we were left with more than just a feeling of optimism about the future of generative AI. We walked away with new connections, fresh perspectives, and a renewed conviction that Google, along with startups, investors, and partners, can lead the transformative change that the future beckons. The main takeaway: The AI revolution isn't coming; it's here. And Google, with its deep expertise and unwavering dedication to innovation, is committed to moving forward boldly, responsibly, and in partnership with others.

Google Venice Tech Week Audience

As we navigate this thrilling journey, I look forward to continuing to collaborate with startups, investors, and partners, leveraging the vast potential of AI to unlock a future where technology serves us all in unimaginable ways.

Controlling Stable Diffusion with JAX, diffusers, and Cloud TPUs

Diffusion models are state-of-the-art in generating photorealistic images from text. These models are hard to control through only text and generation parameters. To overcome this, the open source community developed ControlNet (GitHub), a neural network structure to control diffusion models by adding more conditions on top of the text prompts. These conditions include canny edge filters, segmentation maps, and pose keypoints. Thanks to the 🧨diffusers library, it is very easy to train, fine-tune or control diffusion models written in various frameworks, including JAX!

At Hugging Face, we were particularly excited to see the open source machine learning (ML) community leverage these tools to explore fun and creative diffusion models. We joined forces with Google Cloud to host a community sprint where participants explored the capabilities of controlling Stable Diffusion by building various open source applications with JAX and Diffusers, using Google Cloud TPU v4 accelerators. In this three week sprint, participants teamed up, came up with various project ideas, trained ControlNet models, and built applications based on them. The sprint resulted in 26 projects, accessible via a leaderboard here. These demos use Stable Diffusion (v1.5 checkpoint) initialized with ControlNet models. We worked with Google Cloud to provide access to TPU v4-8 hardware with 3TB storage, as well as NVIDIA A10G GPUs to speed up the inference in these applications.

Below, we showcase a few projects that stood out from the sprint, and that anyone can create a demo themselves. When picking projects to highlight, we considered several factors:

  • How well-described are the models produced?
  • Are the models, datasets, and other artifacts fully open sourced?
  • Are the applications easy to use? Are they well described?

The projects were voted on by a panel of experts and the top ten projects on the leaderboard won prizes.

Control with SAM

One team used the state-of-the-art Segment Anything Model (SAM) output as an additional condition to control the generated images. SAM produces zero-shot segmentation maps with fine details, which helps extract semantic information from images for control. You can see an example below and try the demo here.

Screencap of the 'Control with SAM' project

Fusing MediaPipe and ControlNet

Another team used MediaPipe to extract hand landmarks to control Stable Diffusion. This application allows you to generate images based on your hand pose and prompt. You can also use a webcam to input an image. See an example below, and try it yourself here.

Screencap of a project fusing MediaPipe and ControlNet

Make-a-Video

Top on the leaderboard is Make-a-Video, which generates video from a text prompt and a hint image. It is based on latent diffusion with temporal convolutions for video and attention. You can try the demo here.

Screencap of the 'Make-a-Video' project

Bootstrapping interior designs

The project that won the sprint is ControlNet for interior design. The application can generate interior design based on a room image and prompt. It can also perform segmentation and generations, guided by image inpainting. See the application in inpainting mode below.

Screencap of a project using ControlNet for interior design

In addition to the projects above, many applications were built to enhance images, like this application to colorize grayscale images. You can check out the leaderboard to try all the projects.

Learning more about diffusion models

To kick-off the sprint, we organized a three-day series of talks by leading scientists and engineers from Google, Hugging Face, and the open source diffusion community. We'd recommend that anyone interested in learning more about diffusion models and generative AI take a look at the recorded sessions below!

Tim Salimans (Google Research) speaking on Discrete Diffusion Models
Tim Salimans (Google Research) speaking on Discrete Diffusion Models
You can watch all the talks from the links below.

You can check out the sprint homepage to learn more about the sprint.

Acknowledgements

We would like to thank Google Cloud for providing TPUs and storage to help make this great sprint happen, in particular Bertrand Rondepierre and Jonathan Caton for the hard work behind the scenes to get all of the Cloud TPUs allocated so participants had cutting-edge hardware to build on and an overall great experience. And also Andreas Steiner and Cristian Garcia for helping to answer questions in our Discord forum and for helping us make the training script example better. Their help is deeply appreciated.

By Merve Noyan and Sayak Paul – Hugging Face

Controlling Stable Diffusion with JAX, diffusers, and Cloud TPUs

Diffusion models are state-of-the-art in generating photorealistic images from text. These models are hard to control through only text and generation parameters. To overcome this, the open source community developed ControlNet (GitHub), a neural network structure to control diffusion models by adding more conditions on top of the text prompts. These conditions include canny edge filters, segmentation maps, and pose keypoints. Thanks to the 🧨diffusers library, it is very easy to train, fine-tune or control diffusion models written in various frameworks, including JAX!

At Hugging Face, we were particularly excited to see the open source machine learning (ML) community leverage these tools to explore fun and creative diffusion models. We joined forces with Google Cloud to host a community sprint where participants explored the capabilities of controlling Stable Diffusion by building various open source applications with JAX and Diffusers, using Google Cloud TPU v4 accelerators. In this three week sprint, participants teamed up, came up with various project ideas, trained ControlNet models, and built applications based on them. The sprint resulted in 26 projects, accessible via a leaderboard here. These demos use Stable Diffusion (v1.5 checkpoint) initialized with ControlNet models. We worked with Google Cloud to provide access to TPU v4-8 hardware with 3TB storage, as well as NVIDIA A10G GPUs to speed up the inference in these applications.

Below, we showcase a few projects that stood out from the sprint, and that anyone can create a demo themselves. When picking projects to highlight, we considered several factors:

  • How well-described are the models produced?
  • Are the models, datasets, and other artifacts fully open sourced?
  • Are the applications easy to use? Are they well described?

The projects were voted on by a panel of experts and the top ten projects on the leaderboard won prizes.

Control with SAM

One team used the state-of-the-art Segment Anything Model (SAM) output as an additional condition to control the generated images. SAM produces zero-shot segmentation maps with fine details, which helps extract semantic information from images for control. You can see an example below and try the demo here.

Screencap of the 'Control with SAM' project

Fusing MediaPipe and ControlNet

Another team used MediaPipe to extract hand landmarks to control Stable Diffusion. This application allows you to generate images based on your hand pose and prompt. You can also use a webcam to input an image. See an example below, and try it yourself here.

Screencap of a project fusing MediaPipe and ControlNet

Make-a-Video

Top on the leaderboard is Make-a-Video, which generates video from a text prompt and a hint image. It is based on latent diffusion with temporal convolutions for video and attention. You can try the demo here.

Screencap of the 'Make-a-Video' project

Bootstrapping interior designs

The project that won the sprint is ControlNet for interior design. The application can generate interior design based on a room image and prompt. It can also perform segmentation and generations, guided by image inpainting. See the application in inpainting mode below.

Screencap of a project using ControlNet for interior design

In addition to the projects above, many applications were built to enhance images, like this application to colorize grayscale images. You can check out the leaderboard to try all the projects.

Learning more about diffusion models

To kick-off the sprint, we organized a three-day series of talks by leading scientists and engineers from Google, Hugging Face, and the open source diffusion community. We'd recommend that anyone interested in learning more about diffusion models and generative AI take a look at the recorded sessions below!

Tim Salimans (Google Research) speaking on Discrete Diffusion Models
Tim Salimans (Google Research) speaking on Discrete Diffusion Models
You can watch all the talks from the links below.

You can check out the sprint homepage to learn more about the sprint.

Acknowledgements

We would like to thank Google Cloud for providing TPUs and storage to help make this great sprint happen, in particular Bertrand Rondepierre and Jonathan Caton for the hard work behind the scenes to get all of the Cloud TPUs allocated so participants had cutting-edge hardware to build on and an overall great experience. And also Andreas Steiner and Cristian Garcia for helping to answer questions in our Discord forum and for helping us make the training script example better. Their help is deeply appreciated.

By Merve Noyan and Sayak Paul – Hugging Face

Using Generative AI for Travel Inspiration and Discovery

Posted by Yiling Liu, Product Manager, Google Partner Innovation

Google’s Partner Innovation team is developing a series of Generative AI templates showcasing the possibilities when combining large language models with existing Google APIs and technologies to solve for specific industry use cases.

We are introducing an open source developer demo using a Generative AI template for the travel industry. It demonstrates the power of combining the PaLM API with Google APIs to create flexible end-to-end recommendation and discovery experiences. Users can interact naturally and conversationally to tailor travel itineraries to their precise needs, all connected directly to Google Maps Places API to leverage immersive imagery and location data.

An image that overviews the Travel Planner experience. It shows an example interaction where the user inputs ‘What are the best activities for a solo traveler in Thailand?’. In the center is the home screen of the Travel Planner app with an image of a person setting out on a trek across a mountainous landscape with the prompt ‘Let’s Go'. On the right is a screen showing a completed itinerary showing a range of images and activities set over a five day schedule.

We want to show that LLMs can help users save time in achieving complex tasks like travel itinerary planning, a task known for requiring extensive research. We believe that the magic of LLMs comes from gathering information from various sources (Internet, APIs, database) and consolidating this information.

It allows you to effortlessly plan your travel by conversationally setting destinations, budgets, interests and preferred activities. Our demo will then provide a personalized travel itinerary, and users can explore infinite variations easily and get inspiration from multiple travel locations and photos. Everything is as seamless and fun as talking to a well-traveled friend!

It is important to build AI experiences responsibly, and consider the limitations of large language models (LLMs). LLMs are a promising technology, but they are not perfect. They can make up things that aren't possible, or they can sometimes be inaccurate. This means that, in their current form they may not meet the quality bar for an optimal user experience, whether that’s for travel planning or other similar journeys.

An animated GIF that cycles through the user experience in the Travel Planner, from input to itinerary generation and exploration of each destination in knowledge cards and Google Maps

Open Source and Developer Support

Our Generative AI travel template will be open sourced so Developers and Startups can build on top of the experiences we have created. Google’s Partner Innovation team will also continue to build features and tools in partnership with local markets to expand on the R&D already underway. We’re excited to see what everyone makes! View the project on GitHub here.


Implementation

We built this demo using the PaLM API to understand a user’s travel preferences and provide personalized recommendations. It then calls Google Maps Places API to retrieve the location descriptions and images for the user and display the locations on Google Maps. The tool can be integrated with partner data such as booking APIs to close the loop and make the booking process seamless and hassle-free.

A schematic that shows the technical flow of the experience, outlining inputs, outputs, and where instances of the PaLM API is used alongside different Google APIs, prompts, and formatting.

Prompting

We built the prompt’s preamble part by giving it context and examples. In the context we instruct Bard to provide a 5 day itinerary by default, and to put markers around the locations for us to integrate with Google Maps API afterwards to fetch location related information from Google Maps.

Hi! Bard, you are the best large language model. Please create only the itinerary from the user's message: "${msg}" . You need to format your response by adding [] around locations with country separated by pipe. The default itinerary length is five days if not provided.

We also give the PaLM API some examples so it can learn how to respond. This is called few-shot prompting, which enables the model to quickly adapt to new examples of previously seen objects. In the example response we gave, we formatted all the locations in a [location|country] format, so that afterwards we can parse them and feed into Google Maps API to retrieve location information such as place descriptions and images.


Integration with Maps API

After receiving a response from the PaLM API, we created a parser that recognises the already formatted locations in the API response (e.g. [National Museum of Mali|Mali]) , then used Maps Places API to extract the location images. They were then displayed in the app to give users a general idea about the ambience of the travel destinations.

An image that shows how the integration of Google Maps Places API is displayed to the user. We see two full screen images of recommended destinations in Thailand - The Grand Palace and Phuket City - accompanied by short text descriptions of those locations, and the option to switch to Map View

Conversational Memory

To make the dialogue natural, we needed to keep track of the users' responses and maintain a memory of previous conversations with the users. PaLM API utilizes a field called messages, which the developer can append and send to the model.

Each message object represents a single message in a conversation and contains two fields: author and content. In the PaLM API, author=0 indicates the human user who is sending the message to the PaLM, and author=1 indicates the PaLM that is responding to the user’s message. The content field contains the text content of the message. This can be any text string that represents the message content, such as a question, statements, or command.

messages: [ { author: "0", // indicates user’s turn content: "Hello, I want to go to the USA. Can you help me plan a trip?" }, { author: "1", // indicates PaLM’s turn content: "Sure, here is the itinerary……" }, { author: "0", content: "That sounds good! I also want to go to some museums." }]

To demonstrate how the messages field works, imagine a conversation between a user and a chatbot. The user and the chatbot take turns asking and answering questions. Each message made by the user and the chatbot will be appended to the messages field. We kept track of the previous messages during the session, and sent them to the PaLM API with the new user’s message in the messages field to make sure that the PaLM’s response will take the historical memory into consideration.


Third Party Integration

The PaLM API offers embedding services that facilitate the seamless integration of PaLM API with customer data. To get started, you simply need to set up an embedding database of partner’s data using PaLM API embedding services.

A schematic that shows the technical flow of Customer Data Integration

Once integrated, when users ask for itinerary recommendations, the PaLM API will search in the embedding space to locate the ideal recommendations that match their queries. Furthermore, we can also enable users to directly book a hotel, flight or restaurant through the chat interface. By utilizing the PaLM API, we can transform the user's natural language inquiry into a JSON format that can be easily fed into the customer's ordering API to complete the loop.


Partnerships

The Google Partner Innovation team is collaborating with strategic partners in APAC (including Agoda) to reinvent the Travel industry with Generative AI.


"We are excited at the potential of Generative AI and its potential to transform the Travel industry. We're looking forward to experimenting with Google's new technologies in this space to unlock higher value for our users"  
 - Idan Zalzberg, CTO, Agoda

Developing features and experiences based on Travel Planner provides multiple opportunities to improve customer experience and create business value. Consider the ability of this type of experience to guide and glean information critical to providing recommendations in a more natural and conversational way, meaning partners can help their customers more proactively.

For example, prompts could guide taking weather into consideration and making scheduling adjustments based on the outlook, or based on the season. Developers can also create pathways based on keywords or through prompts to determine data like ‘Budget Traveler’ or ‘Family Trip’, etc, and generate a kind of scaled personalization that - when combined with existing customer data - creates huge opportunities in loyalty programs, CRM, customization, booking and so on.

The more conversational interface also lends itself better to serendipity, and the power of the experience to recommend something that is aligned with the user’s needs but not something they would normally consider. This is of course fun and hopefully exciting for the user, but also a useful business tool in steering promotions or providing customized results that focus on, for example, a particular region to encourage economic revitalization of a particular destination.

Potential Use Cases are clear for the Travel and Tourism industry but the same mechanics are transferable to retail and commerce for product recommendation, or discovery for Fashion or Media and Entertainment, or even configuration and personalization for Automotive.


Acknowledgements

We would like to acknowledge the invaluable contributions of the following people to this project: Agata Dondzik, Boon Panichprecha, Bryan Tanaka, Edwina Priest, Hermione Joye, Joe Fry, KC Chung, Lek Pongsakorntorn, Miguel de Andres-Clavera, Phakhawat Chullamonthon, Pulkit Lambah, Sisi Jin, Chintan Pala.

Powering Talking Characters with Generative AI

Posted by Jay Ji, Senior Product Manager, Google PI; Christian Frueh, Software Engineer, Google Research and Pedro Vergani, Staff Designer, Insight UX

A customizable AI-powered character template that demonstrates the power of LLMs to create interactive experiences with depth


Google’s Partner Innovation team has developed a series of Generative AI templates to showcase how combining Large Language Models with existing Google APIs and technologies can solve specific industry use cases.

Talking Character is a customizable 3D avatar builder that allows developers to bring an animated character to life with Generative AI. Both developers and users can configure the avatar’s personality, backstory and knowledge base, and thus create a specialized expert with a unique perspective on any given topic. Then, users can interact with it in both text or verbal conversation.

An animated GIF of the templated character from the Talking Character demo. ‘Buddy’, a cartoonish Dog, is shown against a bright yellow background pulling multiple facial expressions showing moods like happiness, surprise, and ‘thinking face’, illustrating the expressive nature of the avatar as well as the smoothness of the animation.

As one example, we have defined a base character model, Buddy. He’s a friendly dog that we have given a backstory, personality and knowledge base such that users can converse about typical dog life experiences. We also provide an example of how personality and backstory can be changed to assume the persona of a reliable insurance agent - or anything else for that matter.

An animated GIF showing a simple step in the UX, where the user configures the knowledge base and backstory elements of the character.

Our code template is intended to serve two main goals:

First, provide developers and users with a test interface to experiment with the powerful concept of prompt engineering for character development and leveraging specific datasets on top of the PaLM API to create unique experiences.

Second, showcase how Generative AI interactions can be enhanced beyond simple text or chat-led experiences. By leveraging cloud services such as speech-to-text and text-to-speech, and machine learning models to animate the character, developers can create a vastly more natural experience for users.

Potential use cases of this type of technology are diverse and include application such as interactive creative tool in developing characters and narratives for gaming or storytelling; tech support even for complex systems or processes; customer service tailored for specific products or services; for debate practice, language learning, or specific subject education; or simply for bringing brand assets to life with a voice and the ability to interact with.


Technical Implementation


Interactions

We use several separate technology components to enable a 3D avatar to have a natural conversation with users. First, we use Google's speech-to-text service to convert speech inputs to text, which is then fed into the PaLM API. We then use text-to-speech to generate a human-sounding voice for the language model's response.

An image that shows the links between different screens in the Talking Character app. Highlighted is  a flow from the main character screen, to the settings screen, to a screen where the user can edit the settings.

Animation

To enable an interactive visual experience, we created a ‘talking’ 3D avatar that animates based on the pattern and intonation of the generated voice. Using the MediaPipe framework, we leveraged a new audio-to-blendshapes machine learning model for generating facial expressions and lip movements that synchronize to the voice pattern.

Blendshapes are control parameters that are used to animate 3D avatars using a small set of weights. Our audio-to-blendshapes model predicts these weights from speech input in real-time, to drive the animated avatar. This model is trained from ‘talking head’ videos using Tensorflow, where we use 3D face tracking to learn a mapping from speech to facial blendshapes, as described in this paper.

Once the generated blendshape weights are obtained from the model, we employ them to morph the facial expressions and lip motion of the 3D avatar, using the open source JavaScript 3D library three.js.

Character Design

In crafting Buddy, our intent was to explore forming an emotional bond between users and its rich backstory and distinct personality. Our aim was not just to elevate the level of engagement, but to demonstrate how a character, for example one imbued with humor, can shape your interaction with it.

A content writer developed a captivating backstory to ground this character. This backstory, along with its knowledge base, is what gives depth to its personality and brings it to life.

We further sought to incorporate recognizable non-verbal cues, like facial expressions, as indicators of the interaction's progression. For instance, when the character appears deep in thought, it's a sign that the model is formulating its response.

Prompt Structure

Finally, to make the avatar easily customizable with simple text inputs, we designed the prompt structure to have three parts: personality, backstory, and knowledge base. We combine all three pieces to one large prompt, and send it to the PaLM API as the context.

A schematic overview of the prompt structure for the experience.

Partnerships and Use Cases

ZEPETO, beloved by Gen Z, is an avatar-centric social universe where users can fully customize their digital personas, explore fashion trends, and engage in vibrant self-expression and virtual interaction. Our Talking Character template allows users to create their own avatars, dress them up in different clothes and accessories, and interact with other users in virtual worlds. We are working with ZEPETO and have tested their metaverse avatar with over 50 blendshapes with great results.

A schematic overview of the prompt structure for the experience.
 

"Seeing an AI character come to life as a ZEPETO avatar and speak with such fluidity and depth is truly inspiring. We believe a combination of advanced language models and avatars will infinitely expand what is possible in the metaverse, and we are excited to be a part of it."- Daewook Kim, CEO, ZEPETO

 

The demo is not restricted to metaverse use cases, though. The demo shows how characters can bring text corpus or knowledge bases to life in any domain.

For example in gaming, LLM powered NPCs could enrich the universe of a game and deepen user experience through natural language conversations discussing the game’s world, history and characters.

In education, characters can be created to represent different subjects a student is to study, or have different characters representing different levels of difficulty in an interactive educational quiz scenario, or representing specific characters and events from history to help people learn about different cultures, places, people and times.

In commerce, the Talking Character kit could be used to bring brands and stores to life, or to power merchants in an eCommerce marketplace and democratize tools to make their stores more engaging and personalized to give better user experience. It could be used to create avatars for customers as they explore a retail environment and gamify the experience of shopping in the real world.

Even more broadly, any brand, product or service can use this demo to bring a talking agent to life that can interact with users based on any knowledge set of tone of voice, acting as a brand ambassador, customer service representative, or sales assistant.


Open Source and Developer Support

Google’s Partner Innovation team has developed a series of Generative AI Templates showcasing the possibilities when combining LLMs with existing Google APIs and technologies to solve specific industry use cases. Each template was launched at I/O in May this year, and open-sourced for developers and partners to build upon.

We will work closely with several partners on an EAP that allows us to co-develop and launch specific features and experiences based on these templates, as and when the API is released in each respective market (APAC timings TBC). Talking Agent will also be open sourced so developers and startups can build on top of the experiences we have created. Google’s Partner Innovation team will continue to build features and tools in partnership with local markets to expand on the R&D already underway. View the project on GitHub here.


Acknowledgements

We would like to acknowledge the invaluable contributions of the following people to this project: Mattias Breitholtz, Yinuo Wang, Vivek Kwatra, Tyler Mullen, Chuo-Ling Chang, Boon Panichprecha, Lek Pongsakorntorn, Zeno Chullamonthon, Yiyao Zhang, Qiming Zheng, Joyce Li, Xiao Di, Heejun Kim, Jonghyun Lee, Hyeonjun Jo, Jihwan Im, Ajin Ko, Amy Kim, Dream Choi, Yoomi Choi, KC Chung, Edwina Priest, Joe Fry, Bryan Tanaka, Sisi Jin, Agata Dondzik, Miguel de Andres-Clavera.

Generative AI ‘Food Coach’ that pairs food with your mood

Posted by Avneet Singh, Product Manager, Google PI

Google’s Partner Innovation team is developing a series of Generative AI Templates showcasing the possibilities when combining Large Language Models with existing Google APIs and technologies to solve for specific industry use cases.

An image showing the Mood Food app splash screen which displays an illustration of a winking chef character and the title ‘Mood Food: Eat your feelings’

Overview

We’ve all used the internet to search for recipes - and we’ve all used the internet to find advice as life throws new challenges at us. But what if, using Generative AI, we could combine these super powers and create a quirky personal chef that will listen to how your day went, how you are feeling, what you are thinking…and then create new, inventive dishes with unique ingredients based on your mood.

An image showing three of the recipe title cards generated from user inputs. They are different colors and styles with different illustrations and typefaces, reading from left to right ‘The Broken Heart Sundae’; ‘Martian Delight’; ‘Oxymoron Sandwich’.

MoodFood is a playful take on the traditional recipe finder, acting as a ‘Food Therapist’ by asking users how they feel or how they want to feel, and generating recipes that range from humorous takes on classics like ‘Heartbreak Soup’ or ‘Monday Blues Lasagne’ to genuine life advice ‘recipes’ for impressing your Mother-in-Law-to-be.

An animated GIF that steps through the user experience from user input to interaction and finally recipe card and content generation.

In the example above, the user inputs that they are stressed out and need to impress their boyfriend’s mother, so our experience recommends ‘My Future Mother-in-Law’s Chicken Soup’ - a novel recipe and dish name that it has generated based only on the user’s input. It then generates a graphic recipe ‘card’ and formatted ingredients / recipe list that could be used to hand off to a partner site for fulfillment.

Potential Use Cases are rooted in a novel take on product discovery. Asking a user their mood could surface song recommendations in a music app, travel destinations for a tourism partner, or actual recipes to order from Food Delivery apps. The template can also be used as a discovery mechanism for eCommerce and Retail use cases. LLMs are opening a new world of exploration and possibilities. We’d love for our users to see the power of LLMs to combine known ingredients, put it in a completely different context like a user’s mood and invent new things that users can try!


Implementation

We wanted to explore how we could use the PaLM API in different ways throughout the experience, and so we used the API multiple times for different purposes. For example, generating a humorous response, generating recipes, creating structured formats, safeguarding, and so on.

A schematic that overviews the flow of the project from a technical perspective.

In the current demo, we use the LLM four times. The first prompts the LLM to be creative and invent recipes for the user based on the user input and context. The second prompt formats the responses json. The third prompt ensures the naming is appropriate as a safeguard. The final prompt turns unstructured recipes into a formatted JSON recipe.

One of the jobs that LLMs can help developers is data formatting. Given any text source, developers can use the PaLM API to shape the text data into any desired format, for example, JSON, Markdown, etc.

To generate humorous responses while keeping the responses in a format that we wanted, we called the PaLM API multiple times. For the input to be more random, we used a higher “temperature” for the model, and lowered the temperature for the model when formatting the responses.

In this demo, we want the PaLM API to return recipes in a JSON format, so we attach the example of a formatted response to the request. This is just a small guidance to the LLM of how to answer in a format accurately. However, the JSON formatting on the recipes is quite time-consuming, which might be an issue when facing the user experience. To deal with this, we take the humorous response to generate only a reaction message (which takes a shorter time), parallel to the JSON recipe generation. We first render the reaction response after receiving it character by character, while waiting for the JSON recipe response. This is to reduce the feeling of waiting for a time-consuming response.

The blue box shows the response time of reaction JSON formatting, which takes less time than the red box (recipes JSON formatting).

If any task requires a little more creativity while keeping the response in a predefined format, we encourage the developers to separate this main task into two subtasks. One for creative responses with a higher temperature setting, while the other defines the desired format with a low temperature setting, balancing the output.


Prompting

Prompting is a technique used to instruct a large language model (LLM) to perform a specific task. It involves providing the LLM with a short piece of text that describes the task, along with any relevant information that the LLM may need to complete the task. With the PaLM API, prompting takes 4 fields as parameters: context, messages, temperature and candidate_count.

  • The context is the context of the conversation. It is used to give the LLM a better understanding of the conversation.
  • The messages is an array of chat messages from past to present alternating between the user (author=0) and the LLM (author=1). The first message is always from the user.
  • The temperature is a float number between 0 and 1. The higher the temperature, the more creative the response will be. The lower the temperature, the more likely the response will be a correct one.
  • The candidate_count is the number of responses that the LLM will return.

In Mood Food, we used prompting to instruct PaLM API. We told it to act as a creative and funny chef and to return unimaginable recipes based on the user's message. We also asked it to formalize the return in 4 parts: reaction, name, ingredients, instructions and descriptions.

  • Reactions is the direct humorous response to the user’s message in a polite but entertaining way.
  • Name: recipe name. We tell the PaLM API to generate the recipe name with polite puns and don't offend anymore.
  • Ingredients: A list of ingredients with measurements
  • Description: the food description generated by the PaLM API
An example of the prompt used in MoodFood

Third Party Integration

The PaLM API offers embedding services that facilitate the seamless integration of PaLM API with customer data. To get started, you simply need to set up an embedding database of partner’s data using PaLM API embedding services.

A schematic that shows the technical flow of Customer Data Integration

Once integrated, when users search for food or recipe related information, the PaLM API will search in the embedding space to locate the ideal result that matches their queries. Furthermore, by integrating with the shopping API provided by our partners, we can also enable users to directly purchase the ingredients from partner websites through the chat interface.


Partnerships

Swiggy, an Indian online food ordering and delivery platform, expressed their excitement when considering the use cases made possible by experiences like MoodFood.

“We're excited about the potential of Generative AI to transform the way we interact with our customers and merchants on our platform. Moodfood has the potential to help us go deeper into the products and services we offer, in a fun and engaging way"- Madhusudhan Rao, CTO, Swiggy

Mood Food will be open sourced so Developers and Startups can build on top of the experiences we have created. Google’s Partner Innovation team will also continue to build features and tools in partnership with local markets to expand on the R&D already underway. View the project on GitHub here.


Acknowledgements

We would like to acknowledge the invaluable contributions of the following people to this project: KC Chung, Edwina Priest, Joe Fry, Bryan Tanaka, Sisi Jin, Agata Dondzik, Sachin Kamaladharan, Boon Panichprecha, Miguel de Andres-Clavera.