Tag Archives: Learn

Our first Spotlight Week: diving into Android 15

Posted by Aaron Labiaga- Android Developer Relations Engineer

By now, you’ve probably heard the news: Android 15 was just released earlier today to AOSP. To celebrate, we’re kicking off a new series called “Spotlight Week” where we’ll shine a light on technical areas across Android development and equip you with the tools you need to take advantage of each area.

The Android 15 "Spotlight Week" will provide resources — blog posts, videos, sample code, and more — all designed to help you prepare your apps and take advantage of the latest features. These changes strive to improve the Android ecosystem, but updating the OS comes with potential app compatibility implications and integrations that require detailed guidance.

Here’s what we’re covering this week in our Spotlight Week on Android 15:

Android 15 goes to AOSP: big news for the Android world as the latest release - Android 15 - moves to AOSP.

Android 15 summary

Edge-to-edge (Tuesday, September 3): To kick-off the week, a change that will most likely be impacting your app, Edge-to-Edge are APIs that lays out your app to optimize for screen real estate. It will be enforced for all apps targeting Android 15, making the status bar and navigation bar transparent, for a more high-quality experience. Understand how these changes will affect your app by reading the documentations linked. Learn how to work around these changes by reading the Insets handling tips for Android 15's edge-to-edge enforcement blog post.

Building for the future of Android, an in-depth video (Wednesday, Sept 4)

Foreground services and a live Android 15 Q&A (Thursday, September 5): Foreground services changes are coming in Android 15, and we’re introducing a new foreground service type, updating the exemption scenarios that allow a foreground service to start from the background, and updating the max duration of certain foreground service types. These changes are intended to improve user experience by preventing apps from misusing foreground service that may drain a user’s battery. Plus we’ll have a live Q&A.

Passkeys and Picture-in-Picture (Friday, September 6): Passkeys enable a more streamlined and secured means of authenticating your users. Learn more about passkeys through our sample code and about the updates made to further simplify the login process in Android 15. Plus, we're highlighting a Picture-in-Picture sample code that is applicable to apps with video functionality.

That’s just a taste of what we’re covering in our Spotlight Week on Android 15. Keep checking back to this blog post for updates, where we’ll be adding links and more throughout the week. Plus, follow Android Developers on X and Android by Google at Linkedin throughout the week to hear even more about Android 15.

Source: Android Developers Blog

Our first Spotlight Week: diving into Android 15

Posted by Aaron Labiaga- Android Developer Relations Engineer

Here’s what we’re covering this week in our Spotlight Week on Android 15:

Android 15 goes to AOSP: big news for the Android world as the latest release - Android 15 - moves to AOSP.

Android 15 summary

Edge-to-edge (Tuesday, September 3): To kick-off the week, a change that will most likely be impacting your app, Edge-to-Edge are APIs that lays out your app to optimize for screen real estate. It will be enforced for all apps targeting Android 15, making the status bar and navigation bar transparent, for a more high-quality experience. Understand how these changes will affect your app by reading the documentations linked. Learn how to work around these changes by reading the Insets handling tips for Android 15's edge-to-edge enforcement blog post.

Building for the future of Android, an in-depth video (Wednesday, Sept 4)

Foreground services and a live Android 15 Q&A (Thursday, September 5): Foreground services changes are coming in Android 15, and we’re introducing a new foreground service type, updating the exemption scenarios that allow a foreground service to start from the background, and updating the max duration of certain foreground service types. These changes are intended to improve user experience by preventing apps from misusing foreground service that may drain a user’s battery. Plus we’ll have a live Q&A.

Passkeys and Picture-in-Picture (Friday, September 6): Passkeys enable a more streamlined and secured means of authenticating your users. Learn more about passkeys through our sample code and about the updates made to further simplify the login process in Android 15. Plus, we're highlighting a Picture-in-Picture sample code that is applicable to apps with video functionality.

Source: Android Developers Blog

ML Olympiad 2024: Globally Distributed ML Competitions by Google ML Community

Posted by Bitnoori Keum – DevRel Community Manager

The ML Olympiad consists of Kaggle Community Competitions organized by ML GDE, TFUG, and other ML communities, aiming to provide developers with opportunities to learn and practice machine learning. Following successful rounds in 2022 and 2023, the third round has now launched with support from Google for Developers for each competition host. Over the last two rounds, 605 teams participated in 32 competitions, generating 105 discussions and 170 notebooks. We encourage you to join this round to gain hands-on experience with machine learning and tackle real-world challenges.

ML Olympiad Community Competitions

Over 20 ML Olympiad community competitions are currently open. Visit the ML Olympiad page to participate.

Smoking Detection in Patients

Predict smoking status with bio-signal ML models

Host: Rishiraj Acharya (AI/ML GDE) / TFUG Kolkata

TurtleVision Challenge

Develop a classification model to distinguish between jellyfish and plastic pollution in ocean imagery

Host: Anas Lahdhiri / MLAct

Detect hallucinations in LLMs

Detect which answers provided by a Mistral 7B instruct model are most likely hallucinations

Host: Luca Massaron (AI/ML GDE)

ZeroWasteEats

Find ML solutions to reduce food wastage

Host: Anushka Raj / TFUG Hajipur

Predicting Wellness

Predict the percentage of body fat in men using multiple regression methods

Host: Ankit Kumar Verma / TFUG Prayagraj

Offbeats Edition

Build a regression model to predict the age of the crab

Host: Ayush Morbar / Offbeats Byte Labs

Nashik Weather

Predict the condition of weather in Nashik, India

Host: TFUG Nashik

Predicting Earthquake Damage

Predict the level of damage to buildings caused by earthquake based on aspects of building location and construction

Host: Usha Rengaraju

Forecasting Bangladesh's Weather

Predict the rainy day; amount of rainfall, and average temperature for a particular day.

Host: TFUG Bangladesh (Dhaka)

CO2 Emissions Prediction Challenge

Predict CO2 emissions per capita for 2030 using global development indicators

Host: Md Shahriar Azad Evan, Shuvro Pal / TFUG North Bengal

AI & ML Malaysia

Predict loan approval status

Host: Kuan Hoong (AI/ML GDE) / Artificial Intelligence & Machine Learning Malaysia User Group

Sustainable Urban Living

Predict the habitability score of properties

Host: Ashwin Raj / BeyondML

Toxic Language (PTBR) Detection

(in local language)

Classify Brazilian Portuguese tweets in one of the two classes: toxics or non toxics.

Host: Mikaeri Ohana, Pedro Gengo, Vinicius F. Caridá (AI/ML GDE)

Improving disaster response

Predict the humanitarian aid contributions as a response to disasters occurs in the world

Host: Yara Armel Desire / TFUG Abidjan

Urban Traffic Density

Develop predictive models to estimate the traffic density in urban areas

Host: Kartikey Rawat / TFUG Durg

Know Your Customer Opinion

Classify each customer opinion into several Likert scale

Host: TFUG Surabaya

Forecasting India's Weather

Predict the temperature of the particular month

Host: Mohammed Moinuddin / TFUG Hyderabad

Classification Champ

Develop classification models to predict tumor malignancy

Host: TFUG Bhopal

AI-Powered Job Description Generator

Build a system that employs Generative AI and a chatbot interface to automatically generate job descriptions

Host: Akaash Tripathi / TFUG Ghaziabad

Machine Translation French-Wolof

Develop robust algorithms or models capable of accurately translating French sentences into Wolof.

Host: GalsenAI

Water Mapping using Satellite Imagery

Water mapping using satellite imagery and deep learning for dam drought detection

Host: Taha Bouhsine / ML Nomads

Navigating ML Olympiad

To see all the community competitions around the ML Olympiad, search "ML Olympiad" on Kaggle and look for further related posts on social media using #MLOlympiad. Browse through the available competitions and participate in those that interest you!

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

Build with Google AI video series, Season 2: more AI patterns

Posted by Joe Fernandez – Google AI Developer Relations

We are off to another exciting year in Artificial Intelligence (AI) and it's time to build more applications with Google AI technology! The Build with Google AI video series is for developers looking to build helpful and practical applications with AI. We focus on useful code projects you can implement and extend in an afternoon to bring the power of artificial intelligence into your workflow or organization. Our first season received over 100,000 views in six weeks! We are glad to see that so many of you liked the series, and we are excited to bring you even more Google AI application projects.

Today, we are launching Season 2 of the Build with Google AI series, featuring projects built with Google's Gemini API technology. The launch of Gemini and the Gemini API has brought developers even more advanced AI capabilities, including advanced reasoning, content generation, information synthesis, and image interpretation. Our goal with this season is to help you put those capabilities to work for you and your organizations.

AI app patterns

The Build with Google AI series features practical application code projects created for you to use and customize. However, we know that you are the best judge of what you or your organization needs to solve day-to-day problems and get work done. That's why each application we feature in this series is also meant to be used as an AI pattern. You can extend the applications immediately to solve problems and provide value for your business, and these applications show you a general coding pattern for getting value out of AI technology.

For this second season of this series, we show how you can leverage Google's Gemini AI model capabilities for applications. Here's what's coming up:

AI Slides Reviewer with Google Workspace (3/20) - Image interpretation is one of the Gemini model's biggest new features. We show you how to make practical use of it with a presentation review app for Google Slides that you can customize with your organization's guidelines and recommendations.

AI Flutter Code Agent with Gemini API (3/27) - Code generation was the most popular episode from last season, so we are digging deeper into this topic. Build a code generation extension to write Flutter code and explore user interface designs and looks with just a few words of description.

AI Data Agent with Google Cloud (4/3) - Why write code to extract data when you can just ask for it? Build a web application that uses Gemini API's Function Calling feature to translate questions into code calls and data into plain language answers.

Season 1 upgraded to Gemini API: We've upgraded Season 1 tutorials and code projects to use the Gemini API so you can take advantage of the latest in generative AI technology from Google. Check them out!

Learn from the developers

Just like last season, we'll go back to the studio to talk with coders who built these projects so they can share what they learned along the way. How do you make the Gemini model review an entire presentation? What's the most effective way to generate code with AI? How do you get a database to answer questions with the Gemini API? Get insights into coding with AI to jump start your own development project.

New home for AI developer content

Developers interested in Google's AI offerings now have a new home at ai.google.dev. There you'll find a wealth of resources for building with AI from Google, including the Build with Google AI tutorials. Stay tuned for much more content through the rest of the year.

We are excited to bring you the second season of Build with Google AI – check out Season 2 right now! Use those video comments to let us know what you think and tell us what you'd like to see in future episodes.

Keep learning! Keep building!

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

Tune Gemini Pro in Google AI Studio or with the Gemini API

Posted by Cher Hu, Product Manager and Saravanan Ganesh, Software Engineer for Gemini API

The following post was originally published in October 2023. Today, we've updated the post to share how you can easily tune Gemini models in Google AI Studio or with the Gemini API.

Last year, we launched Gemini 1.0 Pro, our mid-sized multimodal model optimized for scaling across a wide range of tasks. And with 1.5 Pro this year, we demonstrated the possibilities of what large language models can do with an experimental 1M context window. Now, to quickly and easily customize the generally available Gemini 1.0 Pro model (text) for your specific needs, we’ve added Gemini Tuning to Google AI Studio and the Gemini API.

What is tuning?

Developers often require higher quality output for custom use cases than what can be achieved through few-shot prompting. Tuning improves on this technique by further training the base model on many more task-specific examples—so many that they can’t all fit in the prompt.

Fine-tuning vs. Parameter Efficient Tuning

You may have heard about classic “fine-tuning” of models. This is where a pre-trained model is adapted to a particular task by training it on a smaller set of task-specific labeled data. But with today’s LLMs and their huge number of parameters, fine-tuning is complex: it requires machine learning expertise, lots of data, and lots of compute.

Tuning in Google AI Studio uses a technique called Parameter Efficient Tuning (PET) to produce higher-quality customized models with lower latency compared to few-shot prompting and without the additional costs and complexity of traditional fine-tuning. In addition, PET produces high quality models with as little as a few hundred data points, reducing the burden of data collection for the developer.

Why tuning?

Tuning enables you to customize Gemini models with your own data to perform better for niche tasks while also reducing the context size of prompts and latency of the response. Developers can use tuning for a variety of use cases including but not limited to:

Classification: Run natural language tasks like classifying your data into predefined categories, without needing tons of manual work or tools.

Information extraction: Extract structured information from unstructured data sources to support downstream tasks within your product.

Structured output generation: Generate structured data, such as tables, quickly and easily.

Critique Models: Use tuning to create critique models to evaluate output from other models.

Get started quickly with Google AI Studio

1. Create a tuned model

It’s easy to tune models in Google AI Studio. This removes any need for engineering expertise to build custom models. Start by selecting “New tuned model” in the menu bar on the left.

moving image showing how to create a tuned model in Google AI Studio by opening 'New Tuned Model' from the menu

2. Select data for tuning

You can tune your model from an existing structured prompt or import data from Google Sheets or a CSV file. You can get started with as few as 20 examples and to get the best performance, we recommend providing a dataset of at least 100 examples.

moving image showing how to select data for tuning in Google AI Studio by importing data

3. View your tuned model

View your tuning progress in your library. Once the model has finished tuning, you can view the details by clicking on your model. Start running your tuned model through a structured or freeform prompt.

moving image showing how to view your tuned model in Google AI Studio by importing data

4. Run your tuned model anytime

You can also access your newly tuned model by creating a new structured or freeform prompt and selecting your tuned model from the list of available models.

moving image demonstrating what it looks like to run your tuned model in Google AI Studio after importing data

Tuning with the Gemini API

Google AI Studio is the fastest and easiest way to start tuning Gemini models. You can also access the feature via the Gemini API by passing the training data in the API request when creating a tuned model. Learn more about how to get started here.

We’re excited about the possibilities that tuning opens up for developers and can’t wait to see what you build with the feature. If you’ve got some ideas or use cases brewing, share them with us on X (formerly known as Twitter) or Linkedin.

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

PJRT Plugin to Accelerate Machine Learning

PJRT is an open, stable interface for device runtime and compiler, which simplifies ML hardware and framework integration. With PJRT, ML frameworks become hardware-agnostic and ML hardware becomes pluggable. For the ML developer, it simplifies the adoption of new ML hardware and models become more portable. This addresses ML infrastructure fragmentation across frameworks, compilers and runtimes enhancing the industry’s ability to productionize ML-driven advancements with velocity and at scale.

This article provides an overview of what building a PJRT plugin entails, how frameworks (and models) can use this plugin, and some updates on the PJRT API. PJRT is now used by a growing spectrum of hardware: Apple silicon, Google Cloud TPU, NVIDIA GPU, and Intel Max GPU. We also share a spotlight on Apple’s adoption of PJRT with some details on the workflow and performance.

If you’re developing an ML hardware accelerator or developing your own compiler and runtime, check out the PJRT source code on GitHub and sign up for the PJRT mailing list to quickly bootstrap your work.

What’s in a PJRT Plugin

PJRT was introduced to simplify the growing complexity of ML workload execution across hardware and frameworks. PJRT (used in conjunction with StableHLO) is a stable interface for device runtime and compiler, which abstracts away device specific implementations from frameworks.

An implementation of the PJRT API is called a PJRT plugin, which is usually a Python package for seamless ML model developer experience. To build a PJRT plugin for a hardware target, the following methods need to be implemented:

Compile: compile (program) -> executable
Runtime: execute (executable, arguments) -> results
Memory management: transfer buffer from host to device, device to host, device to device, as well as buffer management such as buffer donation
Topology information such as the platform, how many accelerators and how are they attached.

ML frameworks will discover and load one or multiple PJRT plugins, and call the PJRT API to compile and execute the model. The PJRT plugins may be required to register to the ML frameworks depending on the specific discovery mechanism the framework uses.

API Updates

Versioning and ABI Compatibility

PJRT API has a major version and a minor version. If the framework is newer than the plugin, the framework provides a N-week (N=6 today) forwards compatibility window for minor version updates. The major version updates will be a coordinated update. Frameworks will not support plugins with a lower major version. If the plugin is newer than the framework, plugins will define their own backward compatibility policy.

Multi-Node

A PJRT client is per node, and the plugin may need some way to communicate among nodes in a distributed workload. The framework can pass in key-value store callbacks to the plugin. The plugin can use them to bootstrap multi-node initialization and other coordination needs. An example with the NVIDIA GPU CUDA plugin is as follows:

JAX starts a distribution service and provides key-value store callbacks.
NVIDIA GPU CUDA plugin uses these callbacks to (1) generate global PJRT device topology that includes PJRT device information from all nodes, and (2) generate NCCL ids.

DLPack

A few C APIs were added to PJRT to support DLPack.

PJRT_Client_CreateViewOfDeviceBuffer supports receiving buffers from DLPack.
Exporting buffers to DLPack requires: PJRT_Buffer_IncreaseExternalReferenceCount, PJRT_Buffer_DecreaseExternalReferenceCount to get a PJRT_Buffer_OpaqueDeviceMemoryDataPointer.

Extension

PJRT API provides an extension mechanism that the plugin can provide extensions which are optional or experimental features. These extensions can have their own compatibility guarantee and do not need to support the ABI compatibility of PJRT API.

Industry Adoption

PJRT is the only interface for JAX, the primary interface for TensorFlow and fully supported for PyTorch through PyTorch/XLA. PJRT is not tied to a specific compiler and runtime. The toolchain-independent architecture and open-source availability as part of the OpenXLA Project allows it to be leveraged by any hardware, framework or compiler, with extensibility for unique features. This has allowed PJRT to be adopted by various industry partners through close collaboration. A brief account of Apple’s adoption of PJRT follows.

JAX on Apple Silicon

Apple’s PJRT plugin for the Metal training backend accelerates JAX models on Apple silicon and AMD GPUs. This empowers any ML developers to leverage the full potential of Apple silicon and AMD GPUs on their Apple hardware to accelerate JAX models for faster experimentation. The integration and user experience to accelerate JAX on Apple silicon GPUs is similar to the existing PyTorch and TensorFlow implementations.

The Metal plug-in uses the OpenXLA compiler and PJRT runtime to optimize and accelerate JAX workloads on GPU. When a JAX program is executed, the JAX graph is lowered into StableHLO, which is then passed to PJRT for compilation and execution. The StableHLO is converted to MPSGraph executables and the Metal runtime APIs are invoked to dispatch to the GPU.

Performance

The Metal backend with PJRT plugin provides impressive performance speedup for JAX. On an Apple MacBook Pro with M2 Max, training common networks in JAX see performance speedups of up to 28x, with an average of 10x over a CPU baseline. This empowers any ML developer to leverage the full potential of Apple Silicon on their Apple hardware to accelerate JAX models for faster experimentation.

Figure 1: Performance speedups of up to 28x on Apple MacBook Pro with M2 Max over CPU for JAX training.

Getting Started

Adding Metal support to JAX is as simple as a single pip install:

python -m pip install jax-metal
python -c 'import jax; print(jax.numpy.arange(10))'

For more details on environment setup and installation of JAX on Apple hardware, please refer to the Metal Developer Resources page.

Google Cloud TPU

PJRT is the default runtime for PyTorch 2.0 on Google Cloud TPU. GitHub Readme has more details.

NVIDIA GPU

The NVIDIA GPU CUDA implementation in JAX is extracted and packaged as a PJRT plugin. The ML model developers can install the NVIDIA GPU CUDA plugin from pypi. This plugin uses the newly added features such as multi-node, DLPack, and extensions.

Intel GPU

Intel is leveraging PJRT in Intel® Extension for TensorFlow to provide the Intel GPU backend for TensorFlow, JAX and PyTorch. The example of executing a JAX program on Intel GPU demonstrates how this greatly simplifies the framework and hardware integration.

PJRT Resources

PJRT is available on GitHub: source code for the API, integration guides and issues. If you develop ML frameworks, compilers, runtimes or are interested in improving portability of workloads across hardware, we want your feedback. We encourage you to contribute code, design ideas and feature suggestions. We also invite you to join the PJRT mailing list to stay updated with the latest product and community announcements and to help shape the future of an interoperable ML infrastructure.

Acknowledgements

Chalana Bezawada, Daniel Doctor, Kulin Seth, Shuhan Ding from Apple

Penporn Koanantakool, Peter Hawkins, Skye Wanderman-Milne, Xiao Yu from Google.

By Aman Verma – Product Manager, Machine Learning Infrastructure, Google and Jieying Luo – Software Engineer, Machine Learning Infrastructure, Google

Source: Google Open Source Blog

Large Language Models On-Device with MediaPipe and TensorFlow Lite

Posted by Mark Sherwood – Senior Product Manager and Juhyun Lee – Staff Software Engineer

TensorFlow Lite has been a powerful tool for on-device machine learning since its release in 2017, and MediaPipe further extended that power in 2019 by supporting complete ML pipelines. While these tools initially focused on smaller on-device models, today marks a dramatic shift with the experimental MediaPipe LLM Inference API.

This new release enables Large Language Models (LLMs) to run fully on-device across platforms. This new capability is particularly transformative considering the memory and compute demands of LLMs, which are over a hundred times larger than traditional on-device models. Optimizations across the on-device stack make this possible, including new ops, quantization, caching, and weight sharing.

The experimental cross-platform MediaPipe LLM Inference API, designed to streamline on-device LLM integration for web developers, supports Web, Android, and iOS with initial support for four openly available LLMs: Gemma, Phi 2, Falcon, and Stable LM. It gives researchers and developers the flexibility to prototype and test popular openly available LLM models on-device.

On Android, the MediaPipe LLM Inference API is intended for experimental and research use only. Production applications with LLMs can use the Gemini API or Gemini Nano on-device through Android AICore. AICore is the new system-level capability introduced in Android 14 to provide Gemini-powered solutions for high-end devices, including integrations with the latest ML accelerators, use-case optimized LoRA adapters, and safety filters. To start using Gemini Nano on-device with your app, apply to the Early Access Preview.

LLM Inference API

Starting today, you can test out the MediaPipe LLM Inference API via our web demo or by building our sample demo apps. You can experiment and integrate it into your projects via our Web, Android, or iOS SDKs.

Using the LLM Inference API allows you to bring LLMs on-device in just a few steps. These steps apply across web, iOS, and Android, though the SDK and native API will be platform specific. The following code samples show the web SDK.

1. Pick model weights compatible with one of our supported model architectures

2. Convert the model weights into a TensorFlow Lite Flatbuffer using the MediaPipe Python Package
from mediapipe.tasks.python.genai import converter 

config = converter.ConversionConfig(...)
converter.convert_checkpoint(config)

3. Include the LLM Inference SDK in your application
import { FilesetResolver, LlmInference } from
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai”

4. Host the TensorFlow Lite Flatbuffer along with your application.

5. Use the LLM Inference API to take a text prompt and get a text response from your model.

const fileset  = await
FilesetResolver.forGenAiTasks("https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm");
const llmInference = await LlmInference.createFromModelPath(fileset, "model.bin");
const responseText = await llmInference.generateResponse("Hello, nice to meet you");
document.getElementById('output').textContent = responseText;

Please see our documentation and code examples for a detailed walk through of each of these steps.

Here are real time gifs of Gemma 2B running via the MediaPipe LLM Inference API.

Gemma 2B running on-device in browser via the MediaPipe LLM Inference API

Gemma 2B running on-device on iOS (left) and Android (right) via the MediaPipe LLM Inference API

Models

Our initial release supports the following four model architectures. Any model weights compatible with these architectures will work with the LLM Inference API. Use the base model weights, use a community fine-tuned version of the weights, or fine tune weights using your own data.

Model	Parameter Size
Falcon 1B	1.3 Billion
Gemma 2B	2.5 Billion
Phi 2	2.7 Billion
Stable LM 3B	2.8 Billion

Model Performance

Through significant optimizations, some of which are detailed below, the MediaPipe LLM Inference API is able to deliver state-of-the-art latency on-device, focusing on CPU and GPU to support multiple platforms. For sustained performance in a production setting on select premium phones, Android AICore can take advantage of hardware-specific neural accelerators.

When measuring latency for an LLM, there are a few terms and measurements to consider. Time to First Token and Decode Speed will be the two most meaningful as these measure how quickly you get the start of your response and how quickly the response generates once it starts.

Term	Significance	Measurement
Token	LLMs use tokens rather than words as inputs and outputs. Each model used with the LLM Inference API has a tokenizer built in which converts between words and tokens.	100 English words ≈ 130 tokens. However the conversion is dependent on the specific LLM and the language.
Max Tokens	The maximum total tokens for the LLM prompt + response.	Configured in the LLM Inference API at runtime.
Time to First Token	Time between calling the LLM Inference API and receiving the first token of the response.	Max Tokens / Prefill Speed
Prefill Speed	How quickly a prompt is processed by an LLM.	Model and device specific. Benchmark numbers below.
Decode Speed	How quickly a response is generated by an LLM.	Model and device specific. Benchmark numbers below.

The Prefill Speed and Decode Speed are dependent on model, hardware, and max tokens. They can also change depending on the current load of the device.

The following speeds were taken on high end devices using a max tokens of 1280 tokens, an input prompt of 1024 tokens, and int8 weight quantization. The exception being Gemma 2B (int4), found here on Kaggle, which uses a mixed 4/8-bit weight quantization.

Benchmarks

Graph showing prefill performance in tokens per second across WebGPU, iOS (GPU), Android (GPU), and Android (CPU)

Graph showing decode performance in tokens per second across WebGPU, iOS (GPU), Android (GPU), and Android (CPU)

On the GPU, Falcon 1B and Phi 2 use fp32 activations, while Gemma and StableLM 3B use fp16 activations as the latter models showed greater robustness to precision loss according to our quality eval studies. The lowest bit activation data type that maintained model quality was chosen for each. Note that Gemma 2B (int4) was the only model we could run on iOS due to its memory constraints, and we are working on enabling other models on iOS as well.

Performance Optimizations

To achieve the performance numbers above, countless optimizations were made across MediaPipe, TensorFlow Lite, XNNPack (our CPU neural network operator library), and our GPU-accelerated runtime. The following are a select few that resulted in meaningful performance improvements.

Weights Sharing: The LLM inference process comprises 2 phases: a prefill phase and a decode phase. Traditionally, this setup would require 2 separate inference contexts, each independently managing resources for its corresponding ML model. Given the memory demands of LLMs, we've added a feature that allows sharing the weights and the KV cache across inference contexts. Although sharing weights might seem straightforward, it has significant performance implications when sharing between compute-bound and memory-bound operations. In typical ML inference scenarios, where weights are not shared with other operators, they are meticulously configured for each fully connected operator separately to ensure optimal performance. Sharing weights with another operator implies a loss of per-operator optimization and this mandates the authoring of new kernel implementations that can run efficiently even on sub-optimal weights.

Optimized Fully Connected Ops: XNNPack’s FULLY_CONNECTED operation has undergone two significant optimizations for LLM inference. First, dynamic range quantization seamlessly merges the computational and memory benefits of full integer quantization with the precision advantages of floating-point inference. The utilization of int8/int4 weights not only enhances memory throughput but also achieves remarkable performance, especially with the efficient, in-register decoding of 4-bit weights requiring only one additional instruction. Second, we actively leverage the I8MM instructions in ARM v9 CPUs which enable the multiplication of a 2x8 int8 matrix by an 8x2 int8 matrix in a single instruction, resulting in twice the speed of the NEON dot product-based implementation.

Balancing Compute and Memory: Upon profiling the LLM inference, we identified distinct limitations for both phases: the prefill phase faces restrictions imposed by the compute capacity, while the decode phase is constrained by memory bandwidth. Consequently, each phase employs different strategies for dequantization of the shared int8/int4 weights. In the prefill phase, each convolution operator first dequantizes the weights into floating-point values before the primary computation, ensuring optimal performance for computationally intensive convolutions. Conversely, the decode phase minimizes memory bandwidth by adding the dequantization computation to the main mathematical convolution operations.

Flowchart showing compute-intensive prefill phase and memory-intensive decode phase, highlighting difference in performance bottlenecks

During the compute-intensive prefill phase, the int4 weights are dequantized a priori for optimal CONV_2D computation. In the memory-intensive decode phase, dequantization is performed on the fly, along with CONV_2D computation, to minimize the memory bandwidth usage.

Custom Operators: For GPU-accelerated LLM inference on-device, we rely extensively on custom operations to mitigate the inefficiency caused by numerous small shaders. These custom ops allow for special operator fusions and various LLM parameters such as token ID, sequence patch size, sampling parameters, to be packed into a specialized custom tensor used mostly within these specialized operations.

Pseudo-Dynamism: In the attention block, we encounter dynamic operations that increase over time as the context grows. Since our GPU runtime lacks support for dynamic ops/tensors, we opt for fixed operations with a predefined maximum cache size. To reduce the computational complexity, we introduce a parameter enabling the skipping of certain value calculations or the processing of reduced data.

Optimized KV Cache Layout: Since the entries in the KV cache ultimately serve as weights for convolutions, employed in lieu of matrix multiplications, we store these in a specialized layout tailored for convolution weights. This strategic adjustment eliminates the necessity for extra conversions or reliance on unoptimized layouts, and therefore contributes to a more efficient and streamlined process.

What’s Next

We are thrilled with the optimizations and the performance in today’s experimental release of the MediaPipe LLM Inference API. This is just the start. Over 2024, we will expand to more platforms and models, offer broader conversion tools, complimentary on-device components, high level tasks, and more.

You can check out the official sample on GitHub demonstrating everything you’ve just learned about and read through our official documentation for even more details. Keep an eye on the Google for Developers YouTube channel for updates and tutorials.

Acknowledgements

We’d like to thank all team members who contributed to this work: T.J. Alumbaugh, Alek Andreev, Frank Ban, Jeanine Banks, Frank Barchard, Pulkit Bhuwalka, Buck Bourdon, Maxime Brénon, Chuo-Ling Chang, Yu-hui Chen, Linkun Chen, Lin Chen, Nikolai Chinaev, Clark Duvall, Rosário Fernandes, Mig Gerard, Matthias Grundmann, Ayush Gupta, Mohammadreza Heydary, Ekaterina Ignasheva, Ram Iyengar, Grant Jensen, Alex Kanaukou, Prianka Liz Kariat, Alan Kelly, Kathleen Kenealy, Ho Ko, Sachin Kotwani, Andrei Kulik, Yi-Chun Kuo, Khanh LeViet, Yang Lu, Lalit Singh Manral, Tyler Mullen, Karthik Raveendran, Raman Sarokin, Sebastian Schmidt, Kris Tonthat, Lu Wang, Tris Warkentin, and the Gemma Team

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

Easily add document scanning capability to your app with ML Kit Document Scanner API

Posted by Thomas Ezan – Sr. Developer Relations Engineer; Chengji Yan, Penny Li – ML Kit Engineers; David Miro Llopis – Product Manager

We are excited to announce the launch of the ML Kit Document Scanner API. This new API makes it easy to add advanced document scanning capabilities with a high-quality and consistent user interface to your Android app. The ML Kit Document Scanner API enables your users to quickly and easily digitize paper documents.

Like the other ML Kit APIs, the ML Kit Document Scanner API enables you to seamlessly integrate features powered by Machine Learning (ML) without any ML knowledge.

Why Document Scanner SDK?

Despite the digital revolution, paper documents and printouts are still present in our everyday life. Some of our most important documents are still physical (identity documents, receipts, etc.).

The ML Kit Document Scanner API offers a number of benefits, including:

A high-quality and consistent user interface for digitizing physical documents.
Accurate document detection with precise corner and edge detection for a seamless scanning experience and optimal scanning results.
Flexible functionality allows users to crop scanned documents, apply filters, remove fingers, remove stains and other blemishes and send digitized files in PDF and JPEG formats back to your app.
On-device processing helps preserve privacy.
A complete solution eliminating the need for camera permission.

The ML Kit Document Scanner API is already used by Google Drive Android application and the Google Pixel Camera.

ML Kit Document scanner API in action in Google Drive

Get started

The ML Kit Document Scanner API requires Android API level 21 or above. The models, scanning logic, and UI flow are dynamically downloaded via Google Play services so the ML Kit Document Scanner API has a minimal impact on your app size.

To integrate it in your app, start by configuring the scanner options and getting a scanner client:

val options = GmsDocumentScannerOptions.Builder()
    .setGalleryImportAllowed(false)
    .setPageLimit(2)
    .setResultFormats(RESULT_FORMAT_JPEG, RESULT_FORMAT_PDF)
    .setScannerMode(SCANNER_MODE_FULL)
    .build()
val scanner = GmsDocumentScanning.getClient(options)

Then register an ActivityResultCallback to receive the scanning results:

val scannerLauncher = registerForActivityResult(StartIntentSenderForResult()) {
  result -> {
    if (result.resultCode == RESULT_OK) {
      val result =
        GmsDocumentScanningResult.fromActivityResultIntent(result.data)
      result.getPages()?.let { pages ->
        for (page in pages) {
          val imageUri = page.getImageUri()
        }
      }
      result.getPdf()?.let { pdf ->
        val pdfUri = pdf.getUri()
        val pageCount = pdf.getPageCount()
      }
    }
  }
}

Finally launch the document scanner activity:

scanner.getStartScanIntent(activity)
  .addOnSuccessListener { intentSender ->   
    scannescannerrLauncher.launch(IntentSenderRequest.Builder(intentSender).build())
  }
  .addOnFailureListener { ... }

To get started with the ML Kit Document Scanner API, visit the documentation. We can’t wait to see what you’ll build with it!

Source: Android Developers Blog

Introducing Gemma models in Keras

Posted by Martin Görner – Product Manager, Keras

The Keras team is happy to announce that Gemma, a family of lightweight, state-of-the art open models built from the same research and technology that we used to create the Gemini models, is now available in the KerasNLP collection. Thanks to Keras 3, Gemma runs on JAX, PyTorch and TensorFlow. With this release, Keras is also introducing several new features specifically designed for large language models: a new LoRA API (Low Rank Adaptation) and large scale model-parallel training capabilities.

If you want to dive directly into code samples, head here:

Get started with Gemma models

Fine-tune Gemma models with LoRA

Fine-tune Gemma models on multiple GPUs/TPUs

Get started

Gemma models come in portable 2B and 7B parameter sizes, and deliver significant advances against similar open models, and even some larger ones. For example:

Gemma 7B scores a new best-in class 64.3% of correct answers in the MMLU language understanding benchmark (vs. 62.5% for Mistral-7B and 54.8% for Llama2-13B)

Gemma adds +11 percentage points to the GSM8K benchmark score for grade-school math problems (46.4% for Gemma 7B vs. Mistral-7B 35.4%, Llama2-13B 28.7%)

and +6.1 percentage points of correct answers in HumanEval, a coding challenge (32.3% for Gemma 7B, vs. Mistral 7B 26.2%, Llama2 13B 18.3%).

Gemma models are offered with a familiar KerasNLP API and a super-readable Keras implementation. You can instantiate the model with a single line of code:

gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")

And run it directly on a text prompt – yes, tokenization is built-in, although you can easily split it out if needed - read the Keras NLP guide to see how.

gemma_lm.generate("Keras is a", max_length=32)
> "Keras is a popular deep learning framework for neural networks..."

Try it out here: Get started with Gemma models

Fine-tuning Gemma Models with LoRA

Thanks to Keras 3, you can choose the backend on which you run the model. Here is how to switch:

os.environ["KERAS_BACKEND"] = "jax"  # Or "tensorflow" or "torch".
import keras # import keras after having selected the backend

Keras 3 comes with several new features specifically for large language models. Chief among them is a new LoRA API (Low Rank Adaptation) for parameter-efficient fine-tuning. Here is how to activate it:

gemma_lm.backbone.enable_lora(rank=4)
# Note: rank=4 replaces the weights matrix of relevant layers with the 
# product AxB of two matrices of rank 4, which reduces the number of 
# trainable parameters.

This single line drops the number of trainable parameters from 2.5 billion to 1.3 million!

Try it out here: Fine-tune Gemma models with LoRA.

Fine-tuning Gemma models on multiple GPU/TPUs

Keras 3 also supports large-scale model training and Gemma is the perfect model to try it out. The new Keras distribution API offers data-parallel and model-parallel distributed training options. The new API is meant to be multi-backend but for the time being, it is implemented for the JAX backend only, because of its proven scalability (Gemma models were trained with JAX).

To fine-tune the larger Gemma 7B, a distributed setup is useful, for example a TPUv3 with 8 TPU cores that you can get for free on Kaggle, or an 8-GPU machine from Google Cloud. Here is how to configure the model for distributed training, using model parallelism:

device_mesh = keras.distribution.DeviceMesh(
   (1, 8), # Mesh topology
   ["batch", "model"], # named mesh axes
   devices=keras.distribution.list_devices() # actual accelerators
)


# Model config
layout_map = keras.distribution.LayoutMap(device_mesh)
layout_map["token_embedding/embeddings"] = (None, "model")
layout_map["decoder_block.*attention.*(query|key|value).*kernel"] = (
   None, "model", None)
layout_map["decoder_block.*attention_output.*kernel"] = (
   None, None, "model")
layout_map["decoder_block.*ffw_gating.*kernel"] = ("model", None)
layout_map["decoder_block.*ffw_linear.*kernel"] = (None, "model")


# Set the model config and load the model
model_parallel = keras.distribution.ModelParallel(
   device_mesh, layout_map, batch_dim_name="batch")
keras.distribution.set_distribution(model_parallel)
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_7b_en")
# Ready: you can now train with model.fit() or generate text with generate()

What this code snippet does is set up the 8 accelerators into a 1 x 8 matrix where the two dimensions are called “batch” and “model”. Model weights are sharded on the “model” dimension, here split between the 8 accelerators, while data batches are not partitioned since the “batch” dimension is 1.

Try it out here: Fine-tune Gemma models on multiple GPUs/TPUs.

What’s Next

We will soon be publishing a guide showing you how to correctly partition a Transformer model and write the 6 lines of partitioning setup above. It is not very long but it would not fit in this post.

You will have noticed that layer partitionings are defined through regexes on layer names. You can check layer names with this code snippet. We ran this to construct the LayoutMap above.

# This is for the first Transformer block only,
# but they all have the same structure
tlayer = gemma_lm.backbone.get_layer('decoder_block_0')
for variable in tlayer.weights:
 print(f'{variable.path:<58}  {str(variable.shape):<16}')

Full GSPMD model parallelism works here with just a few partitioning hints because Keras passes these settings to the powerful XLA compiler which figures out all the other details of the distributed computation.

We hope you will enjoy playing with Gemma models. Here is also an instruction-tuning tutorial that you might find useful. And by the way, if you want to share your fine-tuned weights with the community, the Kaggle Model Hub now supports user-tuned weights uploads. Head to the model page for Gemma models on Kaggle and see what others have already created!

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

Magika: AI powered fast and efficient file type identification

Today we are open-sourcing Magika, Google’s AI-powered file-type identification system, to help others accurately detect binary and textual file types. Under the hood, Magika employs a custom, highly optimized deep-learning model, enabling precise file identification within milliseconds, even when running on a CPU.

Magika command line tool used to recognize a identify the type of a diverse set of files

You can try the Magika web demo today, or install it as a Python library and standalone command line tool (output is showcased above) by using the standard command line pip install magika.

Why identifying file type is difficult

Since the early days of computing, accurately detecting file types has been crucial in determining how to process files. Linux comes equipped with libmagic and the file utility, which have served as the de facto standard for file type identification for over 50 years. Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file. For example, modern code editors use file-type detection to choose which syntax coloring scheme to use as the developer starts typing in a new file.

Accurate file-type detection is a notoriously difficult problem because each file format has a different structure, or no structure at all. This is particularly challenging for textual formats and programming languages as they have very similar constructs. So far, libmagic and most other file-type-identification software have been relying on a handcrafted collection of heuristics and custom rules to detect each file format.

This manual approach is both time consuming and error prone as it is hard for humans to create generalized rules by hand. In particular for security applications, creating dependable detection is especially challenging as attackers are constantly attempting to confuse detection with adversarially-crafted payloads.

To address this issue and provide fast and accurate file-type detection we researched and developed Magika, a new AI powered file type detector. Under the hood, Magika uses a custom, highly optimized deep-learning model designed and trained using Keras that only weighs about 1MB. At inference time Magika uses Onnx as an inference engine to ensure files are identified in a matter of milliseconds, almost as fast as a non-AI tool even on CPU.

Magika Performance

Magika detection quality compared to other tools on our 1M files benchmark

Performance wise, Magika, thanks to its AI model and large training dataset, is able to outperform other existing tools by about 20% when evaluated on a 1M files benchmark that encompasses over 100 file types. Breaking down by file type, as reported in the table below, we see even greater performance gains on textual files, including code files and configuration files that other tools can struggle with.

Table showing various file type identification tools performance for a selection of the file types included in our benchmark

Various file type identification tools performance for a selection of the file types included in our benchmark - n/a indicates the tool doesn’t detect the given file type.

Magika at Google

Internally, Magika is used at scale to help improve Google users’ safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners. Looking at a weekly average of hundreds of billions of files reveals that Magika improves file type identification accuracy by 50% compared to our previous system that relied on handcrafted rules. In particular, this increase in accuracy allows us to scan 11% more files with our specialized malicious AI document scanners and reduce the number of unidentified files to 3%.

The upcoming integration of Magika with VirusTotal will complement the platform's existing Code Insight functionality, which employs Google's generative AI to analyze and detect malicious code. Magika will act as a pre-filter before files are analyzed by Code Insight, improving the platform’s efficiency and accuracy. This integration, due to VirusTotal’s collaborative nature, directly contributes to the global cybersecurity ecosystem, fostering a safer digital environment.

Open Sourcing Magika

By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Magika code and model are freely available starting today in Github under the Apache2 License. Magika can also quickly be installed as a standalone utility and python library via the pypi package manager by simply typing pip install magika with no GPU required. We also have an experimental npm package if you would like to use the TFJS version.

To learn more about how to use it, please refer to Magika documentation site.

Acknowledgements

Magika would not have been possible without the help of many people including: Ange Albertini, Loua Farah, Francois Galilee, Giancarlo Metitieri, Luca Invernizzi, Young Maeng, Alex Petit-Bianco , David Tao, Kurt Thomas, Amanda Walker

By Elie Bursztein – Cybersecurity AI Technical and Research Lead and Yanick Fratantonio – Cybersecurity Research Scientist