Category Archives: Google Developers Blog

News and insights on Google platforms, tools and events

Gemma Family Expands with Models Tailored for Developers and Researchers

Posted by Tris Warkentin – Director, Product Management and Jane Fine - Senior Product Manager

In February we announced Gemma, our family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. The community's incredible response – including impressive fine-tuned variants, Kaggle notebooks, integration into tools and services, recipes for RAG using databases like MongoDB, and lots more – has been truly inspiring.

Today, we're excited to announce our first round of additions to the Gemma family, expanding the possibilities for ML developers to innovate responsibly: CodeGemma for code completion and generation tasks as well as instruction following, and RecurrentGemma, an efficiency-optimized architecture for research experimentation. Plus, we're sharing some updates to Gemma and our terms aimed at improvements based on invaluable feedback we've heard from the community and our partners.


Introducing the first two Gemma variants


CodeGemma: Code completion, generation, and chat for developers and businesses

Harnessing the foundation of our Gemma models, CodeGemma brings powerful yet lightweight coding capabilities to the community. CodeGemma models are available as a 7B pretrained variant that specializes in code completion and code generation tasks, a 7B instruction-tuned variant for code chat and instruction-following, and a 2B pretrained variant for fast code completion that fits on your local computer. CodeGemma models have several advantages:

  • Intelligent code completion and generation: Complete lines, functions, and even generate entire blocks of code – whether you're working locally or leveraging cloud resources. 
  • Enhanced accuracy: Trained on 500 billion tokens of primarily English language data from web documents, mathematics, and code, CodeGemma models generate code that's not only more syntactically correct but also semantically meaningful, helping reduce errors and debugging time. 
  • Multi-language proficiency: Your invaluable coding assistant for Python, JavaScript, Java, and other popular languages. 
  • Streamlined workflows: Integrate a CodeGemma model into your development environment to write less boilerplate, and focus on interesting and differentiated code that matters – faster.
image of streamlined workflows within an exisitng AI dev project with CodeGemma integrated
This table compares the performance of CodeGemma with other similar models on both single and multi-line code completion tasks. Learn more in the technical report.

Learn more about CodeGemma in our report or try it in this quickstart guide.


RecurrentGemma: Efficient, faster inference at higher batch sizes for researchers

RecurrentGemma is a technically distinct model that leverages recurrent neural networks and local attention to improve memory efficiency. While achieving similar benchmark score performance to the Gemma 2B model, RecurrentGemma's unique architecture results in several advantages:

  • Reduced memory usage: Lower memory requirements allow for the generation of longer samples on devices with limited memory, such as single GPUs or CPUs. 
  • Higher throughput: Because of its reduced memory usage, RecurrentGemma can perform inference at significantly higher batch sizes, thus generating substantially more tokens per second (especially when generating long sequences). 
  • Research innovation: RecurrentGemma showcases a non-transformer model that achieves high performance, highlighting advancements in deep learning research. 
graph showing maximum thoughput when sampling from a prompt of 2k tokens on TPUv5e
This chart reveals how RecurrentGemma maintains its sampling speed regardless of sequence length, while Transformer-based models like Gemma slow down as sequences get longer.

To understand the underlying technology, check out our paper. For practical exploration, try the notebook, which demonstrates how to finetune the model.


Built upon Gemma foundations, expanding capabilities

Guided by the same principles of the original Gemma models, the new model variants offer:

  • Open availability: Encourages innovation and collaboration with its availability to everyone and flexible terms of use. 
  • High-performance and efficient capabilities: Advances the capabilities of open models with code-specific domain expertise and optimized design for exceptionally fast completion and generation. 
  • Responsible design: Our commitment to responsible AI helps ensure the models deliver safe and reliable results. 
  • Flexibility for diverse software and hardware:  
    • Both CodeGemma and RecurrentGemma: Built with JAX and compatible with JAX, PyTorch, , Hugging Face Transformers, and Gemma.cpp. Enable local experimentation and cost-effective deployment across various hardware, including laptops, desktops, NVIDIA GPUs, and Google Cloud TPUs.  
    • CodeGemma: Additionally compatible with Keras, NVIDIA NeMo, TensorRT-LLM, Optimum-NVIDIA, MediaPipe, and availability on Vertex AI. 
    • RecurrentGemma: Support for all the aforementioned products will be available in the coming weeks.

Gemma 1.1 update

Alongside the new model variants, we're releasing Gemma 1.1, which includes performance improvements. Additionally, we've listened to developer feedback, fixed bugs, and updated our terms to provide more flexibility.


Get started today

These first Gemma model variants are available in various places worldwide, starting today on Kaggle, Hugging Face, and Vertex AI Model Garden. Here's how to get started:

We invite you to try the CodeGemma and RecurrentGemma models and share your feedback on Kaggle. Together, let's shape the future of AI-powered content creation and understanding.

ML Olympiad 2024: Globally Distributed ML Competitions by Google ML Community

Posted by Bitnoori Keum – DevRel Community Manager

The ML Olympiad consists of Kaggle Community Competitions organized by ML GDE, TFUG, and other ML communities, aiming to provide developers with opportunities to learn and practice machine learning. Following successful rounds in 2022 and 2023, the third round has now launched with support from Google for Developers for each competition host. Over the last two rounds, 605 teams participated in 32 competitions, generating 105 discussions and 170 notebooks. We encourage you to join this round to gain hands-on experience with machine learning and tackle real-world challenges.


ML Olympiad Community Competitions

Over 20 ML Olympiad community competitions are currently open. Visit the ML Olympiad page to participate.

Smoking Detection in Patients

Predict smoking status with bio-signal ML models
Host: Rishiraj Acharya (AI/ML GDE) / TFUG Kolkata

TurtleVision Challenge

Develop a classification model to distinguish between jellyfish and plastic pollution in ocean imagery
Host: Anas Lahdhiri / MLAct

Detect hallucinations in LLMs

Detect which answers provided by a Mistral 7B instruct model are most likely hallucinations
Host: Luca Massaron (AI/ML GDE)

ZeroWasteEats

Find ML solutions to reduce food wastage
Host: Anushka Raj / TFUG Hajipur

Predicting Wellness

Predict the percentage of body fat in men using multiple regression methods
Host: Ankit Kumar Verma / TFUG Prayagraj

Offbeats Edition

Build a regression model to predict the age of the crab
Host: Ayush Morbar / Offbeats Byte Labs

Nashik Weather

Predict the condition of weather in Nashik, India
Host: TFUG Nashik

Predicting Earthquake Damage

Predict the level of damage to buildings caused by earthquake based on aspects of building location and construction
Host: Usha Rengaraju

Forecasting Bangladesh's Weather

Predict the rainy day; amount of rainfall, and average temperature for a particular day.
Host: TFUG Bangladesh (Dhaka)

CO2 Emissions Prediction Challenge

Predict CO2 emissions per capita for 2030 using global development indicators
Host: Md Shahriar Azad Evan, Shuvro Pal / TFUG North Bengal

AI & ML Malaysia

Predict loan approval status
Host: Kuan Hoong (AI/ML GDE) / Artificial Intelligence & Machine Learning Malaysia User Group

Sustainable Urban Living

Predict the habitability score of properties
Host: Ashwin Raj / BeyondML

Toxic Language (PTBR) Detection

(in local language)
Classify Brazilian Portuguese tweets in one of the two classes: toxics or non toxics.
Host: Mikaeri Ohana, Pedro Gengo, Vinicius F. Caridá (AI/ML GDE)

Improving disaster response

Predict the humanitarian aid contributions as a response to disasters occurs in the world
Host: Yara Armel Desire / TFUG Abidjan

Urban Traffic Density

Develop predictive models to estimate the traffic density in urban areas
Host: Kartikey Rawat / TFUG Durg

Know Your Customer Opinion

Classify each customer opinion into several Likert scale
Host: TFUG Surabaya

Forecasting India's Weather

Predict the temperature of the particular month
Host: Mohammed Moinuddin / TFUG Hyderabad

Classification Champ

Develop classification models to predict tumor malignancy
Host: TFUG Bhopal

AI-Powered Job Description Generator

Build a system that employs Generative AI and a chatbot interface to automatically generate job descriptions
Host: Akaash Tripathi / TFUG Ghaziabad

Machine Translation French-Wolof

Develop robust algorithms or models capable of accurately translating French sentences into Wolof.
Host: GalsenAI

Water Mapping using Satellite Imagery

Water mapping using satellite imagery and deep learning for dam drought detection
Host: Taha Bouhsine / ML Nomads


Navigating ML Olympiad

To see all the community competitions around the ML Olympiad, search "ML Olympiad" on Kaggle and look for further related posts on social media using #MLOlympiad. Browse through the available competitions and participate in those that interest you!

#WeArePlay | Meet the founders changing women’s lives: Women’s History Month Stories

Posted by Leticia Lago – Developer Marketing

In celebration of Women’s History month, we’re celebrating the founders behind groundbreaking apps and games from around the world - made by women or for women. Let's discover four of my favorites in this latest batch of nine #WeArePlay stories.


Múkami Kinoti Kimotho

Royelles Revolution / Royelles Revolution: Gaming For Girls (USA)

Múkami Kinoti Kimotho – Royelles Revolution / Royelles- Gaming For Girls | USA

Múkami's journey began when she noticed the lack of representation for girls in the gaming industry. Determined to change this narrative, she created Royelles, a game designed to inspire girls and non-binary people to pursue careers in STEAM (science, technology, engineering, art, math) fields. The game is anchored in fierce female avatars like the real life NASA scientist Mara who voices a character. Royelles is revolutionizing the gaming landscape and empowering the next generation of innovators. Múkami's excited to release more gamified stories and learning modules, and a range of extended reality and AI-powered avatars based on the game’s characters.

"If we're going to effectively educate Gen Z and Gen Alpha, we have to meet them in the metaverse and leverage gamified play as a means of driving education, awareness, inspiration and empowerment.” 

- Múkami



Leonika Sari Njoto Boedioetomo

Reblood: Blood Services App (Indonesia)

Leonika Sari Njoto Boedioetomo – Reblood / Blood Services App | Indonesia

When her university friend needed an urgent blood transfusion but discovered there was none available in the blood bank, Leonika became aware of the blood donation shortage in Indonesia. Her mission to address this led her to create Reblood, an app connecting blood donors with those in need. With over 140,000 blood donations facilitated to date, Reblood is not only saving lives but also promoting healthier lifestyles with a recently added feature that allows people to find the most affordable medical checkups.

“Our goal is to save more lives by raising awareness of blood donation in Indonesia and promoting healthier lifestyles for blood donors.” 

- Leonika



Luciane Antunes dos Santos and Renato Hélio Rauber

CARSUL / Car Sul: Urban Mobility App (Brazil)

Luciane Antunes dos Santos and Renato Hélio Rauber – Car Sul: Urban Mobility App | Brazil

Luciane was devastated when she lost her son in a car accident. Her and her husband Renato's loss led them to develop Carsul, an urban mobility app prioritizing safety and security. By providing safe transportation options and partnering with government health programs to chauffeur patients long distances to larger hospitals, Carsul is not only preventing accidents but also saving lives. Luciane and Renato's dedication to protecting others from the pain they've experienced is ongoing and they plan to expand to more cities in Brazil.

“Carsul was born from this story of loss, inspiring me to protect other lives. Redefining myself in this way is very rewarding.” 

- Luciane



Diariata (Diata) N'Diaye

Resonantes / App-Elles: Safety App for Women (France)

Diariata (Diata) N'Diaye – Resonantes /App-Elles: Safety App for Women | France

After hearing the stories of young people who had experienced abuse that was similar to her own, Spoken word artist Diata developed App-Elles – an app that allows women to send alerts when they're in danger. By connecting users with support networks and professional services, App-Elles is empowering women to reclaim their safety and seek help when needed.Diata also runs writing and recording workshops to help victims overcome their experiences with violence and has plans to expand her app with the introduction of a discreet wearable that sends out alerts.

“I realized from my work on the ground that there were victims of violence who needed help and support systems. This was my inspiration to create App-Elles." 

- Diata


Discover more #WeArePlay stories and share your favorites.



How useful did you find this blog post?

Build with Google AI video series, Season 2: more AI patterns

Posted by Joe Fernandez – Google AI Developer Relations

We are off to another exciting year in Artificial Intelligence (AI) and it's time to build more applications with Google AI technology! The Build with Google AI video series is for developers looking to build helpful and practical applications with AI. We focus on useful code projects you can implement and extend in an afternoon to bring the power of artificial intelligence into your workflow or organization. Our first season received over 100,000 views in six weeks! We are glad to see that so many of you liked the series, and we are excited to bring you even more Google AI application projects.

Today, we are launching Season 2 of the Build with Google AI series, featuring projects built with Google's Gemini API technology. The launch of Gemini and the Gemini API has brought developers even more advanced AI capabilities, including advanced reasoning, content generation, information synthesis, and image interpretation. Our goal with this season is to help you put those capabilities to work for you and your organizations.


AI app patterns

The Build with Google AI series features practical application code projects created for you to use and customize. However, we know that you are the best judge of what you or your organization needs to solve day-to-day problems and get work done. That's why each application we feature in this series is also meant to be used as an AI pattern. You can extend the applications immediately to solve problems and provide value for your business, and these applications show you a general coding pattern for getting value out of AI technology.

For this second season of this series, we show how you can leverage Google's Gemini AI model capabilities for applications. Here's what's coming up:

  • AI Slides Reviewer with Google Workspace (3/20) - Image interpretation is one of the Gemini model's biggest new features. We show you how to make practical use of it with a presentation review app for Google Slides that you can customize with your organization's guidelines and recommendations. 
  • AI Flutter Code Agent with Gemini API (3/27) - Code generation was the most popular episode from last season, so we are digging deeper into this topic. Build a code generation extension to write Flutter code and explore user interface designs and looks with just a few words of description.
  • AI Data Agent with Google Cloud (4/3) - Why write code to extract data when you can just ask for it? Build a web application that uses Gemini API's Function Calling feature to translate questions into code calls and data into plain language answers.

Season 1 upgraded to Gemini API: We've upgraded Season 1 tutorials and code projects to use the Gemini API so you can take advantage of the latest in generative AI technology from Google. Check them out!


Learn from the developers

Just like last season, we'll go back to the studio to talk with coders who built these projects so they can share what they learned along the way. How do you make the Gemini model review an entire presentation? What's the most effective way to generate code with AI? How do you get a database to answer questions with the Gemini API? Get insights into coding with AI to jump start your own development project.


New home for AI developer content

Developers interested in Google's AI offerings now have a new home at ai.google.dev. There you'll find a wealth of resources for building with AI from Google, including the Build with Google AI tutorials. Stay tuned for much more content through the rest of the year.

We are excited to bring you the second season of Build with Google AIcheck out Season 2 right now! Use those video comments to let us know what you think and tell us what you'd like to see in future episodes.

Keep learning! Keep building!

Tune Gemini Pro in Google AI Studio or with the Gemini API

Posted by Cher Hu, Product Manager and Saravanan Ganesh, Software Engineer for Gemini API

The following post was originally published in October 2023. Today, we've updated the post to share how you can easily tune Gemini models in Google AI Studio or with the Gemini API.


Last year, we launched Gemini 1.0 Pro, our mid-sized multimodal model optimized for scaling across a wide range of tasks. And with 1.5 Pro this year, we demonstrated the possibilities of what large language models can do with an experimental 1M context window. Now, to quickly and easily customize the generally available Gemini 1.0 Pro model (text) for your specific needs, we’ve added Gemini Tuning to Google AI Studio and the Gemini API.


What is tuning?

Developers often require higher quality output for custom use cases than what can be achieved through few-shot prompting. Tuning improves on this technique by further training the base model on many more task-specific examples—so many that they can’t all fit in the prompt.


Fine-tuning vs. Parameter Efficient Tuning

You may have heard about classic “fine-tuning” of models. This is where a pre-trained model is adapted to a particular task by training it on a smaller set of task-specific labeled data. But with today’s LLMs and their huge number of parameters, fine-tuning is complex: it requires machine learning expertise, lots of data, and lots of compute.

Tuning in Google AI Studio uses a technique called Parameter Efficient Tuning (PET) to produce higher-quality customized models with lower latency compared to few-shot prompting and without the additional costs and complexity of traditional fine-tuning. In addition, PET produces high quality models with as little as a few hundred data points, reducing the burden of data collection for the developer.


Why tuning?

Tuning enables you to customize Gemini models with your own data to perform better for niche tasks while also reducing the context size of prompts and latency of the response. Developers can use tuning for a variety of use cases including but not limited to:

  • Classification: Run natural language tasks like classifying your data into predefined categories, without needing tons of manual work or tools.
  • Information extraction: Extract structured information from unstructured data sources to support downstream tasks within your product.
  • Structured output generation: Generate structured data, such as tables, quickly and easily.
  • Critique Models: Use tuning to create critique models to evaluate output from other models.

Get started quickly with Google AI Studio


1. Create a tuned model

It’s easy to tune models in Google AI Studio. This removes any need for engineering expertise to build custom models. Start by selecting “New tuned model” in the menu bar on the left.

moving image showing how to create a tuned model in Google AI Studio by opening 'New Tuned Model' from the menu

2. Select data for tuning

You can tune your model from an existing structured prompt or import data from Google Sheets or a CSV file. You can get started with as few as 20 examples and to get the best performance, we recommend providing a dataset of at least 100 examples.

moving image showing how to select data for tuning in Google AI Studio by importing data

3. View your tuned model

View your tuning progress in your library. Once the model has finished tuning, you can view the details by clicking on your model. Start running your tuned model through a structured or freeform prompt.

moving image showing how to view your tuned model in Google AI Studio by importing data

4. Run your tuned model anytime

You can also access your newly tuned model by creating a new structured or freeform prompt and selecting your tuned model from the list of available models.

moving image demonstrating what it looks like to run your tuned model in Google AI Studio after importing data

Tuning with the Gemini API

Google AI Studio is the fastest and easiest way to start tuning Gemini models. You can also access the feature via the Gemini API by passing the training data in the API request when creating a tuned model. Learn more about how to get started here.

We’re excited about the possibilities that tuning opens up for developers and can’t wait to see what you build with the feature. If you’ve got some ideas or use cases brewing, share them with us on X (formerly known as Twitter) or Linkedin.

Tune in for Google I/O on May 14

Posted by Jeanine Banks – VP & General Manager, Developer X, and Head of Developer Relations

Google I/O is arriving this year on May 14th and you’re invited to join us online! I/O offers something for everyone, whether you are developing a new application, modernizing an existing one, or transforming it into a business.

The Gemini era unlocks new possibilities for developers to build creative and productive AI-enabled applications. I/O is where you’ll hear how you can get from idea to production AI applications faster. We’re excited to share what’s new for mobile, web, and multiplatform development, and how to scale your applications in the cloud. You will be able to dive deeper into topics that interest you with over 100 sessions, workshops, codelabs, and demos.

Visit the Google I/O site and register to stay informed about I/O and other related events coming soon. The livestreamed keynotes start May 14 at 10am PT, so mark your calendar.

If you haven’t already, go try out our newest Google I/O puzzle and head to @googlefordevs on Instagram if you need a hint.

GDE Women’s History Month Feature: Gema Parreño Piqueras, AI/ML GDE

Posted by Justyna Politanska-Pyszko – Program Manager, Google Developer Experts

For Women's History Month, we're shining a spotlight on Gema Parreño Piqueras, an AI/ML Google Developer Expert (GDE) from Madrid, Spain. GDEs are recognized by Google for their outstanding technical expertise and passion for sharing knowledge.
Gema Parreño Piqueras, AI/ML GDE, Madrid, Spain
Gema Parreño Piqueras, AI/ML GDE, Madrid, Spain

Gema's dedication to the GDE program makes her a true leader within the Google Developers community, and her work in Artificial Intelligence and Machine Learning pushes the boundaries of Google's technological capabilities.

Gema is a force to be reckoned with in the world of data science. As a data scientist at Izertis and a GDE, she's not only making significant contributions to the field of AI/ML but also blazing a trail for women in tech. Her unique background in architecture and her passion for problem-solving led her to an impressive career in AI/ML and development of her extraordinary project – helping NASA track asteroids! Learn more about her projects incorporating AI:

NASA Project: Deep Asteroid

Gema's architectural skills proved invaluable when she turned her attention to AI. In 2016, she created the program Deep Asteroid for NASA's International Space Apps Challenge. This innovative program assists scientists in detecting, tracking, and classifying asteroids, potentially protecting our planet from future threats.

Journey to AI/ML

Intrigued by the potential of AI, Gema embarked on a journey that merged her architectural background with cutting-edge technology. Her experience with 3D modeling translated seamlessly into the world of machine learning, giving her a fresh perspective. Over the past seven years, she's overcome challenges and established herself as a true expert.

As a Google Developer Expert, Gema has found a vibrant community that has fueled her growth. She has attended numerous GDE events throughout Europe and had the opportunity to collaborate with Google teams. This experience was instrumental in the development of Deep Asteroid, demonstrating the power of community and access to advanced technology.

Gema’s advice for women aspiring to enter the field is simple and powerful: "Don't be afraid to experiment, fail, and learn from those failures. Persistence and a willingness to dive into the unknown are what will set you apart." Gema encourages women to find supportive communities, like the GDE program, where they can network, learn, and grow.

You can find Gema on LinkedIn, GitHub and X (formerly known as twitter).


The Google Developer Experts (GDE) program is a global network of highly experienced technology experts, influencers, and thought leaders who actively support developers, companies, and tech communities by speaking at events and publishing content.

Google for Games is coming to GDC 2024

Posted by Aurash Mahbod – General Manager, Games on Google Play

Google for Games is coming to GDC in San Francisco! Join us on March 19 for the Game Developers Conference (GDC) at the Moscone Center, where game developers from across the world will gather to learn, network, problem-solve, and help shape the future of the industry. From March 18 to March 22, experience our comprehensive suite of multi-platform game development tools and explore the new features from Play Pass at the West Hall, Level 2 Lobby.

This year, we’re proud to host eight sessions for developers, designers, business and marketing teams, and everyone else in the gaming community with an interest to grow their game business. Take a look at this year’s sessions below and if you’re interested in learning more about topics from Google Play and Android, check out key product updates from the Google for Games Developer Summit.


Scaling your game development

We’re hosting three sessions designed to help scale your game development using tools from Firebase, Android, and Google Cloud. Learn more about building high quality games with case studies from industry experts.


Beyond "Set and Forget": Advanced Debugging with Firebase Crashlytics

Tuesday, March 19, 9:30 am - 10:00 am 

Speaker: Joe Spiro (Developer Relations Engineer, Google) 

Crashlytics has added a number of features that make detecting, tracking, and understanding bugs even easier, from high-level to native code. Take your fixes to another level with native stack traces, memory debugging, issue annotation, and the ability to log uncaught exceptions as fatal.


Enhancing Game Performance: Vulkan and Android Adaptability Technology

Tuesday, March 19, 10:50 am - 11:50 am 

Speakers: Dohyun Kim (Developer Relations Engineer, Android Games, Google), Hak Matsuda (Developer Relations Engineer, Android Games, Google), Jungwoo Kim (Principal Engineer, Samsung), Syed Farhan Hassan (Software Engineer, ARM) 

Learn how to leverage Vulkan graphics API to improve your graphics quality or performance, including performance tuning with dynamic upscaling. Find out how the Android Dynamic Performance Framework (ADPF) can enhance game performance and power in Unity and native C++, with easy integration through the Unreal Engine plugin. We're also sharing how NCSoft Lineage W improved thermal status and performance using ADPF.


Creating a global-scale game with Google Cloud

Tuesday, March 19, 4:40 pm - 5:10 pm 

Speaker: Mark Mandel (Developer Advocate, Google) 

This session will cover the best of Google Cloud's open source projects (Agones, Open Match, and more) and products (GKE, Spanner, Anthos Service Mesh, Cloud Build, Cloud Deploy, and more) to teach you how to build, deploy, and scale world-scale multiplayer games with Google Cloud.


Increasing user engagement

We’re hosting two sessions designed to help you increase engagement by creating dynamic gameplay experiences using generative AI and expanding opportunities on Google Play to grow your community of players with exclusive rewards.

Reimagine the Future of Gaming with Google AI

Tuesday, March 19, 10:50 am - 11:50 am 

Speakers: Gus Martins (Developer Advocate, Google), Dan Zaratsian (AI/ML Solutions Architect, Google), Lei Zhang (Director, Play Partnerships, Global GenAI & Greater China Play Partnerships, Google), Jack Buser (Director, Game Industry Solutions), Simon Tokumine (Director of Product Management, Google AI), Giovane Moura Jr. (App Modernization Specialist, Google), Moonlit Beshinov (Head of Google for Games Partnerships and Industry Strategy, Google) 

In our keynote session, senior executives from Google Cloud, Google Play, and Labs will share their unique perspectives on generative AI in the gaming landscape. Learn more about cutting-edge AI solutions from Google Cloud, Android, Google Play, and Labs designed to simplify game development, publishing, and business operations, plus actionable strategies to leverage AI for faster development, better player experiences, and sustainable growth.

Grow your community of loyal gamers with Google Play

Tuesday, March 19, 1:20 pm - 1:50 pm 

Speaker: Tom Grinsted (Group Product Manager, Google Play Games, Google) 

In this session, we’ll cover new features and insights from Google Play to create rewarding experiences for gamers using Play Pass, Play Points, and Play Games Services. Get a behind-the-scenes look at how Google Play rewards a growing community of passionate gamers, and how to use this to super-charge your business.


Maximizing reach across screens

These sessions, from Google Play, Android, and Flutter, introduce ways to expand your mobile games to PC. Learn about the latest tools that will help you accelerate growth across large screens.

Bringing more users to your Google Play Games on PC game

Tuesday, March 19, 2:10 pm - 2:40 pm 

Speakers: Aly Hung (Developer Relations Engineer, Android and Google Play, Google), Dara Monasch (Product Manager, Google), Justin Gardner (Partner Program Manager, App Attribution, Google) 

Join us for an overview of Google Play Games on PC, how it has grown in the past year, and a walkthrough of how to optimize and attribute your PC advertisements for your Google Play Games on PC titles. Learn how to use Google Play Games to increase your reach and acquisition of PC users for your mobile game, as well as how to effectively use the Google Play Install Referrer API to attribute and optimize your ads across mobile and PC.

Android input on desktop: How to delight your users

Tuesday, March 19, 3:00 pm - 3:30 pm 

Speakers: Shenshen Cui (Staff Developer Relations Engineer, Google), Patrick Martin (Developer Relations Engineer, Google) 

Give your players a first-class gaming experience with our best practices for handling input between mobile and PC games, including technical details on how to implement these best practices across mobile, tablets, Chromebooks and Windows PCs1. Learn how Android handles keyboard, mouse, and controller input across different form factors, with case studies for designing for both touch and hardware input.

Building Multiplatform Games with Flutter

Tuesday, March 19, 3:50 pm - 4:20 pm 

Speakers: Zoey Fan (Senior Product Manager, Flutter, Google), Brett Morgan (Developer Relations Engineer, Google) 

Learn why game developers are choosing Flutter to build casual games on mobile, desktop, and web browsers. We’ll cover the free, open-source tools and resources available through the Casual Games Toolkit, a collection of free and open-source tools, templates, and resources to make game dev more productive with Flutter.

Learn more about all of our sessions coming to you on March, 19, at GDC in San Francisco.


________________

1Windows is a trademark of the Microsoft group of companies.

Large Language Models On-Device with MediaPipe and TensorFlow Lite

Posted by Mark Sherwood – Senior Product Manager and Juhyun Lee – Staff Software Engineer

TensorFlow Lite has been a powerful tool for on-device machine learning since its release in 2017, and MediaPipe further extended that power in 2019 by supporting complete ML pipelines. While these tools initially focused on smaller on-device models, today marks a dramatic shift with the experimental MediaPipe LLM Inference API.

This new release enables Large Language Models (LLMs) to run fully on-device across platforms. This new capability is particularly transformative considering the memory and compute demands of LLMs, which are over a hundred times larger than traditional on-device models. Optimizations across the on-device stack make this possible, including new ops, quantization, caching, and weight sharing.

The experimental cross-platform MediaPipe LLM Inference API, designed to streamline on-device LLM integration for web developers, supports Web, Android, and iOS with initial support for four openly available LLMs: Gemma, Phi 2, Falcon, and Stable LM. It gives researchers and developers the flexibility to prototype and test popular openly available LLM models on-device.

On Android, the MediaPipe LLM Inference API is intended for experimental and research use only. Production applications with LLMs can use the Gemini API or Gemini Nano on-device through Android AICore. AICore is the new system-level capability introduced in Android 14 to provide Gemini-powered solutions for high-end devices, including integrations with the latest ML accelerators, use-case optimized LoRA adapters, and safety filters. To start using Gemini Nano on-device with your app, apply to the Early Access Preview.


LLM Inference API

Starting today, you can test out the MediaPipe LLM Inference API via our web demo or by building our sample demo apps. You can experiment and integrate it into your projects via our Web, Android, or iOS SDKs.

Using the LLM Inference API allows you to bring LLMs on-device in just a few steps. These steps apply across web, iOS, and Android, though the SDK and native API will be platform specific. The following code samples show the web SDK.

1. Pick model weights compatible with one of our supported model architectures 

 

2. Convert the model weights into a TensorFlow Lite Flatbuffer using the MediaPipe Python Package

from mediapipe.tasks.python.genai import converter 

config = converter.ConversionConfig(...)
converter.convert_checkpoint(config)
 

3. Include the LLM Inference SDK in your application

import { FilesetResolver, LlmInference } from
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai”
 

4. Host the TensorFlow Lite Flatbuffer along with your application.

 

5. Use the LLM Inference API to take a text prompt and get a text response from your model.

const fileset  = await
FilesetResolver.forGenAiTasks("https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm");
const llmInference = await LlmInference.createFromModelPath(fileset, "model.bin");
const responseText = await llmInference.generateResponse("Hello, nice to meet you");
document.getElementById('output').textContent = responseText;


Please see our documentation and code examples for a detailed walk through of each of these steps.

Here are real time gifs of Gemma 2B running via the MediaPipe LLM Inference API.

moving image of Gemma 2B running on-device in browser via the MediaPipe LLM Inference API
Gemma 2B running on-device in browser via the MediaPipe LLM Inference API
moving image of Gemma 2B running on-device on iOS (left) and Android (right) via the MediaPipe LLM Inference API
Gemma 2B running on-device on iOS (left) and Android (right) via the MediaPipe LLM Inference API

Models

Our initial release supports the following four model architectures. Any model weights compatible with these architectures will work with the LLM Inference API. Use the base model weights, use a community fine-tuned version of the weights, or fine tune weights using your own data.

 Model

 Parameter Size

 Falcon 1B

 1.3 Billion

 Gemma 2B

 2.5 Billion

 Phi 2

 2.7 Billion

 Stable LM 3B

 2.8 Billion



Model Performance

Through significant optimizations, some of which are detailed below, the MediaPipe LLM Inference API is able to deliver state-of-the-art latency on-device, focusing on CPU and GPU to support multiple platforms. For sustained performance in a production setting on select premium phones, Android AICore can take advantage of hardware-specific neural accelerators.

When measuring latency for an LLM, there are a few terms and measurements to consider. Time to First Token and Decode Speed will be the two most meaningful as these measure how quickly you get the start of your response and how quickly the response generates once it starts.

 Term

 Significance

 Measurement

 Token

LLMs use tokens rather than words as inputs and outputs. Each model used with the LLM Inference API has a tokenizer built in which converts between words and tokens.

100 English words ≈ 130 tokens. However the conversion is dependent on the specific LLM and the language.

 Max Tokens

The maximum total tokens for the LLM prompt + response.

Configured in the LLM Inference API at runtime.

 Time to First Token

Time between calling the LLM Inference API and receiving the first token of the response.

Max Tokens / Prefill Speed

 Prefill Speed

How quickly a prompt is processed by an LLM.

Model and device specific. Benchmark numbers below.

 Decode Speed

How quickly a response is generated by an LLM.

Model and device specific. Benchmark numbers below.


The Prefill Speed and Decode Speed are dependent on model, hardware, and max tokens. They can also change depending on the current load of the device.

The following speeds were taken on high end devices using a max tokens of 1280 tokens, an input prompt of 1024 tokens, and int8 weight quantization. The exception being Gemma 2B (int4), found here on Kaggle, which uses a mixed 4/8-bit weight quantization.


Benchmarks

Graph showing prefill performance in tokens per second across WebGPU, iOS (GPU), Android (GPU), and Android (CPU)
Graph showing decode performance in tokens per second across WebGPU, iOS (GPU), Android (GPU), and Android (CPU)
On the GPU, Falcon 1B and Phi 2 use fp32 activations, while Gemma and StableLM 3B use fp16 activations as the latter models showed greater robustness to precision loss according to our quality eval studies. The lowest bit activation data type that maintained model quality was chosen for each. Note that Gemma 2B (int4) was the only model we could run on iOS due to its memory constraints, and we are working on enabling other models on iOS as well.

Performance Optimizations

To achieve the performance numbers above, countless optimizations were made across MediaPipe, TensorFlow Lite, XNNPack (our CPU neural network operator library), and our GPU-accelerated runtime. The following are a select few that resulted in meaningful performance improvements.

Weights Sharing: The LLM inference process comprises 2 phases: a prefill phase and a decode phase. Traditionally, this setup would require 2 separate inference contexts, each independently managing resources for its corresponding ML model. Given the memory demands of LLMs, we've added a feature that allows sharing the weights and the KV cache across inference contexts. Although sharing weights might seem straightforward, it has significant performance implications when sharing between compute-bound and memory-bound operations. In typical ML inference scenarios, where weights are not shared with other operators, they are meticulously configured for each fully connected operator separately to ensure optimal performance. Sharing weights with another operator implies a loss of per-operator optimization and this mandates the authoring of new kernel implementations that can run efficiently even on sub-optimal weights.

Optimized Fully Connected Ops: XNNPack’s FULLY_CONNECTED operation has undergone two significant optimizations for LLM inference. First, dynamic range quantization seamlessly merges the computational and memory benefits of full integer quantization with the precision advantages of floating-point inference. The utilization of int8/int4 weights not only enhances memory throughput but also achieves remarkable performance, especially with the efficient, in-register decoding of 4-bit weights requiring only one additional instruction. Second, we actively leverage the I8MM instructions in ARM v9 CPUs which enable the multiplication of a 2x8 int8 matrix by an 8x2 int8 matrix in a single instruction, resulting in twice the speed of the NEON dot product-based implementation.

Balancing Compute and Memory: Upon profiling the LLM inference, we identified distinct limitations for both phases: the prefill phase faces restrictions imposed by the compute capacity, while the decode phase is constrained by memory bandwidth. Consequently, each phase employs different strategies for dequantization of the shared int8/int4 weights. In the prefill phase, each convolution operator first dequantizes the weights into floating-point values before the primary computation, ensuring optimal performance for computationally intensive convolutions. Conversely, the decode phase minimizes memory bandwidth by adding the dequantization computation to the main mathematical convolution operations.

Flowchart showing compute-intensive prefill phase and memory-intensive decode phase, highlighting difference in performance bottlenecks
During the compute-intensive prefill phase, the int4 weights are dequantized a priori for optimal CONV_2D computation. In the memory-intensive decode phase, dequantization is performed on the fly, along with CONV_2D computation, to minimize the memory bandwidth usage.

Custom Operators: For GPU-accelerated LLM inference on-device, we rely extensively on custom operations to mitigate the inefficiency caused by numerous small shaders. These custom ops allow for special operator fusions and various LLM parameters such as token ID, sequence patch size, sampling parameters, to be packed into a specialized custom tensor used mostly within these specialized operations.

Pseudo-Dynamism: In the attention block, we encounter dynamic operations that increase over time as the context grows. Since our GPU runtime lacks support for dynamic ops/tensors, we opt for fixed operations with a predefined maximum cache size. To reduce the computational complexity, we introduce a parameter enabling the skipping of certain value calculations or the processing of reduced data.

Optimized KV Cache Layout: Since the entries in the KV cache ultimately serve as weights for convolutions, employed in lieu of matrix multiplications, we store these in a specialized layout tailored for convolution weights. This strategic adjustment eliminates the necessity for extra conversions or reliance on unoptimized layouts, and therefore contributes to a more efficient and streamlined process.


What’s Next

We are thrilled with the optimizations and the performance in today’s experimental release of the MediaPipe LLM Inference API. This is just the start. Over 2024, we will expand to more platforms and models, offer broader conversion tools, complimentary on-device components, high level tasks, and more.

You can check out the official sample on GitHub demonstrating everything you’ve just learned about and read through our official documentation for even more details. Keep an eye on the Google for Developers YouTube channel for updates and tutorials.


Acknowledgements

We’d like to thank all team members who contributed to this work: T.J. Alumbaugh, Alek Andreev, Frank Ban, Jeanine Banks, Frank Barchard, Pulkit Bhuwalka, Buck Bourdon, Maxime Brénon, Chuo-Ling Chang, Yu-hui Chen, Linkun Chen, Lin Chen, Nikolai Chinaev, Clark Duvall, Rosário Fernandes, Mig Gerard, Matthias Grundmann, Ayush Gupta, Mohammadreza Heydary, Ekaterina Ignasheva, Ram Iyengar, Grant Jensen, Alex Kanaukou, Prianka Liz Kariat, Alan Kelly, Kathleen Kenealy, Ho Ko, Sachin Kotwani, Andrei Kulik, Yi-Chun Kuo, Khanh LeViet, Yang Lu, Lalit Singh Manral, Tyler Mullen, Karthik Raveendran, Raman Sarokin, Sebastian Schmidt, Kris Tonthat, Lu Wang, Tris Warkentin, and the Gemma Team

Google Cloud Next ’24 session library is now available

Posted by Max Saltonstall – Developer Relations Engineer

Google Cloud Next 2024 is coming soon, and our session library is live!

Next ‘24 covers a ton of ground, so choose your adventure. There's something on the menu for everyone, not just AI.

Developer-focused

Developers, this is your time. We have got a huge collection of edutainment for you in store for Next, including:

  • Thousands of Googlers on-site to connect and chat
  • Demos you can play with, try out, poke and see inside of (rather than just watching)
  • Talks from Champion Innovators about how they put cloud to use
  • Gathering spots for classes, interest groups, trainings and hanging out

This year we have more than double the number of advanced technical sessions, and recommendations for startups, small and medium businesses, and sustainability for all. Data scientists and data engineers can shard themselves out into 60+ big data sessions, including going to the cutting edge with BigQuery multi-modal data.


Artificial intelligence

If you want to build your own AI model, LLM or chatbot we've got sessions for that, covering ways to use Vertex AI to spin up your own large-language models on cloud, to search your multimedia library and to maintain equity in your data used for training.


Diversity, equity, and inclusion

Equity and inclusion go way past AI, and we’re really excited to have talks this year addressing allyship for your Muslim colleagues, growing inclusion in your org, and dialogues for change.

A cupped hand with a lock floating in a bed of clouds above it against a nebulous blue background. A faint ray of sunshine is shining through from the top left corner.

Security and data privacy

Don't forget security (really, who does?). Whether you are tackling security at the infrastructure, platform, machine or workload level, we've got sessions for you. Even if you're on multiple clouds, with multiple teams, you still need to get insight into the security and compliance of it all.

Speaking of all these fun chips, what about the salsa? We've got supply chain security with talks on SLSA and GUAC, plus numerous options for serverless workload security and ML data privacy.


Come join us

So, still on the fence?

Come for the magnificent shows in Vegas.

Come for the chance to sit down with expert developers and engineers.

Come for the amazing technical talks and tutorials.

Or just come for the spectacle. We've got it all at Google Cloud Next ‘24.

Check out sessions and secure your spot for three days of learning, community-building, and cloud tech with experts and peers at Mandalay Bay Convention Center in Las Vegas, April 9–11.