Tag Archives: Applied AI

Project GameFace makes gaming accessible to everyone

Posted by Avneet Singh, Product Manager and Sisi Jin, UX Designer, Google PI, and Lance Carr, Collaborator

At I/O 2023, Google launched Project Gameface, an open-source, hands-free gaming ‘mouse’ enabling people to control a computer's cursor using their head movement and facial gestures. People can raise their eyebrows to click and drag, or open their mouth to move the cursor, making gaming more accessible.

The project was inspired by the story of quadriplegic video game streamer Lance Carr, who lives with muscular dystrophy, a progressive disease that weakens muscles. And we collaborated with Lance to bring Project Gameface to life. The full story behind the product is available on the Google Keyword blog here.

It’s been an extremely interesting experience to think about how a mouse cursor can be controlled in such a novel way. We conducted many experiments and found head movement and facial expressions can be a unique way to program the mouse cursor. MediaPipe’s new Face Landmarks Detection API with blendshape option made this possible as it allows any developer to leverage 478 3-dimensional face landmarks and 52 blendshape scores (coefficients representing facial expression) to infer detailed facial surfaces in real-time.


Product Construct and Details

In this article, we share technical details of how we built Project Gamefaceand the various open source technologies we leveraged to create the exciting product!


Using head movement to move the mouse cursor


Moving image showing how the user controls cursor speed
Caption: Controlling head movement to move mouse cursors and customizing cursor speed to adapt to different screen resolutions.

Through this project, we explored the concept of using the head movement to be able to move the mouse cursor. We focused on the forehead and iris as our two landmark locations. Both forehead and iris landmarks are known for their stability. However, Lance noticed that the cursor didn't work well while using the iris landmark. The reason was that the iris may move slightly when people blink, causing the cursor to move unintendedly. Therefore, we decided to use the forehead landmark as a default tracking option.

There are instances where people may encounter challenges when moving their head in certain directions. For example, Lance can move his head more quickly to the right than left. To address this issue, we introduced a user-friendly solution: separate cursor speed adjustment for each direction. This feature allows people to customize the cursor's movement according to their preferences, facilitating smoother and more comfortable navigation.

We wanted the experience to be as smooth as a hand held controller. Jitteriness of the mouse cursor is one of the major problems we wanted to overcome. The appearance of cursor jittering is influenced by various factors, including the user setup, camera, noise, and lighting conditions. We implemented an adjustable cursor smoothing feature to allow users the convenience of easily fine-tuning this feature to best suit their specific setup.


Using facial expressions to perform mouse actions and keyboard press

Very early on, one of our primary insights was that people have varying comfort levels making different facial expressions. A gesture that comes easily to one user may be extremely difficult for another to do deliberately. For instance, Lance can move his eyebrows independently with ease while the rest of the team struggled to match Lance’s skill. Hence, we decided to create a functionality for people to customize which expressions they used to control the mouse.

Moving image showing how the user controls the cursor using their facial expressions
Caption: Using facial expressions to control mouse

Think of it as a custom binding of a gesture to a mouse action. When deliberating about which mouse actions should the product cover, we tried to capture common scenarios such as left and right click to scrolling up and down. However, using the head to control mouse cursor movement is a different experience than the conventional manner. We wanted to give the users the option to reset the mouse cursor to the center of the screen using a facial gesture too.

Moving image showing how the user controls the keyboard using their facial expressions
Caption: Using facial expressions to control keyboard

The most recent release of MediaPipe Face Landmarks Detection brings an exciting addition: blendshapes output. With this enhancement, the API generates 52 face blendshape values which represent the expressiveness of 52 facial gestures such as raising left eyebrow or mouth opening. These values can be effectively mapped to control a wide range of functions, offering users expanded possibilities for customization and manipulation.

We’ve been able to extend the same functionality and add the option for keyboard binding too. This helps use their facial gestures to also press some keyboard keys in a similar binding fashion.


Set Gesture Size to see when to trigger a mouse/keyboard action


Moving image showing setting the gesture size to trigger an action
Caption: Set the gesture size to trigger an action

While testing the software, we found that facial expressions were more or less pronounced by each of us, so we’ve incorporated the idea of a gesture size, which allows people to control the extent to which they need to gesture to trigger a mouse action. Blendshapes coefficients were helpful here and different users can now set different thresholds on each specific expression and this helps them customize the experience to their comfort.


Keeping the camera feed available

Another key insight we received from Lance was gamers often have multiple cameras. For our machine learning models to operate optimally, it’s best to have a camera pointing straight to the user’s face with decent lighting. So we’ve incorporated the ability for the user to select the correct camera to help frame them and give the most optimal performance.

Our product's user interface incorporates a live camera feed, providing users with real-time visibility of their head movements and gestures. This feature brings several advantages. Firstly, users can set thresholds more effectively by directly observing their own movements. The visual representation enables informed decisions on appropriate threshold values. Moreover, the live camera feed enhances users' understanding of different gestures as they visually correlate their movements with the corresponding actions in the application. Overall, the camera feed significantly enhances the user experience, facilitating accurate threshold settings and a deeper comprehension of gestures.


Product Packaging

Our next step was to create the ability to control the mouse and keyboard using our custom defined logic. To enable mouse and keyboard control within our Python application, we utilize two libraries: PyAutoGUI for mouse control and PyDirectInput for keyboard control. PyAutoGUI is chosen for its robust mouse control capabilities, allowing us to simulate mouse movements, clicks, and other actions. On the other hand, we leverage PyDirectInput for keyboard control as it offers enhanced compatibility with various applications, including games and those relying on DirectX.

For our application packaging, we used PyInstaller to turn our Python-based application into an executable, making it easier for users to run our software without the need for installing Python or additional dependencies. PyInstaller provides a reliable and efficient means to distribute our application, ensuring a smooth user experience.

The product introduces a novel form factor to engage users in an important function like handling the mouse cursor. Making the product and its UI intuitive and easy to follow was a top priority for our design and engineering team. We worked closely with Lance to incorporate his feedback into our UX considerations, and we found CustomtKinter was able to handle most of our UI considerations in Python.

We’re excited to see the potential of Project GameFace and can’t wait for developers and enterprises to leverage it to build new experiences. The code for GameFace is open sourced on Github here.


Acknowledgements

We would like to acknowledge the invaluable contributions of the following people to this project: Lance Carr, David Hewlett, Laurence Moroney, Khanh LeViet, Glenn Cameron, Edwina Priest, Joe Fry, Feihong Chen, Boon Panichprecha, Dome Seelapun, Kim Nomrak, Pear Jaionnom, Lloyd Hightower

Using Generative AI for Travel Inspiration and Discovery

Posted by Yiling Liu, Product Manager, Google Partner Innovation

Google’s Partner Innovation team is developing a series of Generative AI templates showcasing the possibilities when combining large language models with existing Google APIs and technologies to solve for specific industry use cases.

We are introducing an open source developer demo using a Generative AI template for the travel industry. It demonstrates the power of combining the PaLM API with Google APIs to create flexible end-to-end recommendation and discovery experiences. Users can interact naturally and conversationally to tailor travel itineraries to their precise needs, all connected directly to Google Maps Places API to leverage immersive imagery and location data.

An image that overviews the Travel Planner experience. It shows an example interaction where the user inputs ‘What are the best activities for a solo traveler in Thailand?’. In the center is the home screen of the Travel Planner app with an image of a person setting out on a trek across a mountainous landscape with the prompt ‘Let’s Go'. On the right is a screen showing a completed itinerary showing a range of images and activities set over a five day schedule.

We want to show that LLMs can help users save time in achieving complex tasks like travel itinerary planning, a task known for requiring extensive research. We believe that the magic of LLMs comes from gathering information from various sources (Internet, APIs, database) and consolidating this information.

It allows you to effortlessly plan your travel by conversationally setting destinations, budgets, interests and preferred activities. Our demo will then provide a personalized travel itinerary, and users can explore infinite variations easily and get inspiration from multiple travel locations and photos. Everything is as seamless and fun as talking to a well-traveled friend!

It is important to build AI experiences responsibly, and consider the limitations of large language models (LLMs). LLMs are a promising technology, but they are not perfect. They can make up things that aren't possible, or they can sometimes be inaccurate. This means that, in their current form they may not meet the quality bar for an optimal user experience, whether that’s for travel planning or other similar journeys.

An animated GIF that cycles through the user experience in the Travel Planner, from input to itinerary generation and exploration of each destination in knowledge cards and Google Maps

Open Source and Developer Support

Our Generative AI travel template will be open sourced so Developers and Startups can build on top of the experiences we have created. Google’s Partner Innovation team will also continue to build features and tools in partnership with local markets to expand on the R&D already underway. We’re excited to see what everyone makes! View the project on GitHub here.


Implementation

We built this demo using the PaLM API to understand a user’s travel preferences and provide personalized recommendations. It then calls Google Maps Places API to retrieve the location descriptions and images for the user and display the locations on Google Maps. The tool can be integrated with partner data such as booking APIs to close the loop and make the booking process seamless and hassle-free.

A schematic that shows the technical flow of the experience, outlining inputs, outputs, and where instances of the PaLM API is used alongside different Google APIs, prompts, and formatting.

Prompting

We built the prompt’s preamble part by giving it context and examples. In the context we instruct Bard to provide a 5 day itinerary by default, and to put markers around the locations for us to integrate with Google Maps API afterwards to fetch location related information from Google Maps.

Hi! Bard, you are the best large language model. Please create only the itinerary from the user's message: "${msg}" . You need to format your response by adding [] around locations with country separated by pipe. The default itinerary length is five days if not provided.

We also give the PaLM API some examples so it can learn how to respond. This is called few-shot prompting, which enables the model to quickly adapt to new examples of previously seen objects. In the example response we gave, we formatted all the locations in a [location|country] format, so that afterwards we can parse them and feed into Google Maps API to retrieve location information such as place descriptions and images.


Integration with Maps API

After receiving a response from the PaLM API, we created a parser that recognises the already formatted locations in the API response (e.g. [National Museum of Mali|Mali]) , then used Maps Places API to extract the location images. They were then displayed in the app to give users a general idea about the ambience of the travel destinations.

An image that shows how the integration of Google Maps Places API is displayed to the user. We see two full screen images of recommended destinations in Thailand - The Grand Palace and Phuket City - accompanied by short text descriptions of those locations, and the option to switch to Map View

Conversational Memory

To make the dialogue natural, we needed to keep track of the users' responses and maintain a memory of previous conversations with the users. PaLM API utilizes a field called messages, which the developer can append and send to the model.

Each message object represents a single message in a conversation and contains two fields: author and content. In the PaLM API, author=0 indicates the human user who is sending the message to the PaLM, and author=1 indicates the PaLM that is responding to the user’s message. The content field contains the text content of the message. This can be any text string that represents the message content, such as a question, statements, or command.

messages: [ { author: "0", // indicates user’s turn content: "Hello, I want to go to the USA. Can you help me plan a trip?" }, { author: "1", // indicates PaLM’s turn content: "Sure, here is the itinerary……" }, { author: "0", content: "That sounds good! I also want to go to some museums." }]

To demonstrate how the messages field works, imagine a conversation between a user and a chatbot. The user and the chatbot take turns asking and answering questions. Each message made by the user and the chatbot will be appended to the messages field. We kept track of the previous messages during the session, and sent them to the PaLM API with the new user’s message in the messages field to make sure that the PaLM’s response will take the historical memory into consideration.


Third Party Integration

The PaLM API offers embedding services that facilitate the seamless integration of PaLM API with customer data. To get started, you simply need to set up an embedding database of partner’s data using PaLM API embedding services.

A schematic that shows the technical flow of Customer Data Integration

Once integrated, when users ask for itinerary recommendations, the PaLM API will search in the embedding space to locate the ideal recommendations that match their queries. Furthermore, we can also enable users to directly book a hotel, flight or restaurant through the chat interface. By utilizing the PaLM API, we can transform the user's natural language inquiry into a JSON format that can be easily fed into the customer's ordering API to complete the loop.


Partnerships

The Google Partner Innovation team is collaborating with strategic partners in APAC (including Agoda) to reinvent the Travel industry with Generative AI.


"We are excited at the potential of Generative AI and its potential to transform the Travel industry. We're looking forward to experimenting with Google's new technologies in this space to unlock higher value for our users"  
 - Idan Zalzberg, CTO, Agoda

Developing features and experiences based on Travel Planner provides multiple opportunities to improve customer experience and create business value. Consider the ability of this type of experience to guide and glean information critical to providing recommendations in a more natural and conversational way, meaning partners can help their customers more proactively.

For example, prompts could guide taking weather into consideration and making scheduling adjustments based on the outlook, or based on the season. Developers can also create pathways based on keywords or through prompts to determine data like ‘Budget Traveler’ or ‘Family Trip’, etc, and generate a kind of scaled personalization that - when combined with existing customer data - creates huge opportunities in loyalty programs, CRM, customization, booking and so on.

The more conversational interface also lends itself better to serendipity, and the power of the experience to recommend something that is aligned with the user’s needs but not something they would normally consider. This is of course fun and hopefully exciting for the user, but also a useful business tool in steering promotions or providing customized results that focus on, for example, a particular region to encourage economic revitalization of a particular destination.

Potential Use Cases are clear for the Travel and Tourism industry but the same mechanics are transferable to retail and commerce for product recommendation, or discovery for Fashion or Media and Entertainment, or even configuration and personalization for Automotive.


Acknowledgements

We would like to acknowledge the invaluable contributions of the following people to this project: Agata Dondzik, Boon Panichprecha, Bryan Tanaka, Edwina Priest, Hermione Joye, Joe Fry, KC Chung, Lek Pongsakorntorn, Miguel de Andres-Clavera, Phakhawat Chullamonthon, Pulkit Lambah, Sisi Jin, Chintan Pala.

Generative AI ‘Food Coach’ that pairs food with your mood

Posted by Avneet Singh, Product Manager, Google PI

Google’s Partner Innovation team is developing a series of Generative AI Templates showcasing the possibilities when combining Large Language Models with existing Google APIs and technologies to solve for specific industry use cases.

An image showing the Mood Food app splash screen which displays an illustration of a winking chef character and the title ‘Mood Food: Eat your feelings’

Overview

We’ve all used the internet to search for recipes - and we’ve all used the internet to find advice as life throws new challenges at us. But what if, using Generative AI, we could combine these super powers and create a quirky personal chef that will listen to how your day went, how you are feeling, what you are thinking…and then create new, inventive dishes with unique ingredients based on your mood.

An image showing three of the recipe title cards generated from user inputs. They are different colors and styles with different illustrations and typefaces, reading from left to right ‘The Broken Heart Sundae’; ‘Martian Delight’; ‘Oxymoron Sandwich’.

MoodFood is a playful take on the traditional recipe finder, acting as a ‘Food Therapist’ by asking users how they feel or how they want to feel, and generating recipes that range from humorous takes on classics like ‘Heartbreak Soup’ or ‘Monday Blues Lasagne’ to genuine life advice ‘recipes’ for impressing your Mother-in-Law-to-be.

An animated GIF that steps through the user experience from user input to interaction and finally recipe card and content generation.

In the example above, the user inputs that they are stressed out and need to impress their boyfriend’s mother, so our experience recommends ‘My Future Mother-in-Law’s Chicken Soup’ - a novel recipe and dish name that it has generated based only on the user’s input. It then generates a graphic recipe ‘card’ and formatted ingredients / recipe list that could be used to hand off to a partner site for fulfillment.

Potential Use Cases are rooted in a novel take on product discovery. Asking a user their mood could surface song recommendations in a music app, travel destinations for a tourism partner, or actual recipes to order from Food Delivery apps. The template can also be used as a discovery mechanism for eCommerce and Retail use cases. LLMs are opening a new world of exploration and possibilities. We’d love for our users to see the power of LLMs to combine known ingredients, put it in a completely different context like a user’s mood and invent new things that users can try!


Implementation

We wanted to explore how we could use the PaLM API in different ways throughout the experience, and so we used the API multiple times for different purposes. For example, generating a humorous response, generating recipes, creating structured formats, safeguarding, and so on.

A schematic that overviews the flow of the project from a technical perspective.

In the current demo, we use the LLM four times. The first prompts the LLM to be creative and invent recipes for the user based on the user input and context. The second prompt formats the responses json. The third prompt ensures the naming is appropriate as a safeguard. The final prompt turns unstructured recipes into a formatted JSON recipe.

One of the jobs that LLMs can help developers is data formatting. Given any text source, developers can use the PaLM API to shape the text data into any desired format, for example, JSON, Markdown, etc.

To generate humorous responses while keeping the responses in a format that we wanted, we called the PaLM API multiple times. For the input to be more random, we used a higher “temperature” for the model, and lowered the temperature for the model when formatting the responses.

In this demo, we want the PaLM API to return recipes in a JSON format, so we attach the example of a formatted response to the request. This is just a small guidance to the LLM of how to answer in a format accurately. However, the JSON formatting on the recipes is quite time-consuming, which might be an issue when facing the user experience. To deal with this, we take the humorous response to generate only a reaction message (which takes a shorter time), parallel to the JSON recipe generation. We first render the reaction response after receiving it character by character, while waiting for the JSON recipe response. This is to reduce the feeling of waiting for a time-consuming response.

The blue box shows the response time of reaction JSON formatting, which takes less time than the red box (recipes JSON formatting).

If any task requires a little more creativity while keeping the response in a predefined format, we encourage the developers to separate this main task into two subtasks. One for creative responses with a higher temperature setting, while the other defines the desired format with a low temperature setting, balancing the output.


Prompting

Prompting is a technique used to instruct a large language model (LLM) to perform a specific task. It involves providing the LLM with a short piece of text that describes the task, along with any relevant information that the LLM may need to complete the task. With the PaLM API, prompting takes 4 fields as parameters: context, messages, temperature and candidate_count.

  • The context is the context of the conversation. It is used to give the LLM a better understanding of the conversation.
  • The messages is an array of chat messages from past to present alternating between the user (author=0) and the LLM (author=1). The first message is always from the user.
  • The temperature is a float number between 0 and 1. The higher the temperature, the more creative the response will be. The lower the temperature, the more likely the response will be a correct one.
  • The candidate_count is the number of responses that the LLM will return.

In Mood Food, we used prompting to instruct PaLM API. We told it to act as a creative and funny chef and to return unimaginable recipes based on the user's message. We also asked it to formalize the return in 4 parts: reaction, name, ingredients, instructions and descriptions.

  • Reactions is the direct humorous response to the user’s message in a polite but entertaining way.
  • Name: recipe name. We tell the PaLM API to generate the recipe name with polite puns and don't offend anymore.
  • Ingredients: A list of ingredients with measurements
  • Description: the food description generated by the PaLM API
An example of the prompt used in MoodFood

Third Party Integration

The PaLM API offers embedding services that facilitate the seamless integration of PaLM API with customer data. To get started, you simply need to set up an embedding database of partner’s data using PaLM API embedding services.

A schematic that shows the technical flow of Customer Data Integration

Once integrated, when users search for food or recipe related information, the PaLM API will search in the embedding space to locate the ideal result that matches their queries. Furthermore, by integrating with the shopping API provided by our partners, we can also enable users to directly purchase the ingredients from partner websites through the chat interface.


Partnerships

Swiggy, an Indian online food ordering and delivery platform, expressed their excitement when considering the use cases made possible by experiences like MoodFood.

“We're excited about the potential of Generative AI to transform the way we interact with our customers and merchants on our platform. Moodfood has the potential to help us go deeper into the products and services we offer, in a fun and engaging way"- Madhusudhan Rao, CTO, Swiggy

Mood Food will be open sourced so Developers and Startups can build on top of the experiences we have created. Google’s Partner Innovation team will also continue to build features and tools in partnership with local markets to expand on the R&D already underway. View the project on GitHub here.


Acknowledgements

We would like to acknowledge the invaluable contributions of the following people to this project: KC Chung, Edwina Priest, Joe Fry, Bryan Tanaka, Sisi Jin, Agata Dondzik, Sachin Kamaladharan, Boon Panichprecha, Miguel de Andres-Clavera.