Tag Archives: AI

Large Language Models On-Device with MediaPipe and TensorFlow Lite

Posted by Mark Sherwood – Senior Product Manager and Juhyun Lee – Staff Software Engineer

TensorFlow Lite has been a powerful tool for on-device machine learning since its release in 2017, and MediaPipe further extended that power in 2019 by supporting complete ML pipelines. While these tools initially focused on smaller on-device models, today marks a dramatic shift with the experimental MediaPipe LLM Inference API.

This new release enables Large Language Models (LLMs) to run fully on-device across platforms. This new capability is particularly transformative considering the memory and compute demands of LLMs, which are over a hundred times larger than traditional on-device models. Optimizations across the on-device stack make this possible, including new ops, quantization, caching, and weight sharing.

The experimental cross-platform MediaPipe LLM Inference API, designed to streamline on-device LLM integration for web developers, supports Web, Android, and iOS with initial support for four openly available LLMs: Gemma, Phi 2, Falcon, and Stable LM. It gives researchers and developers the flexibility to prototype and test popular openly available LLM models on-device.

On Android, the MediaPipe LLM Inference API is intended for experimental and research use only. Production applications with LLMs can use the Gemini API or Gemini Nano on-device through Android AICore. AICore is the new system-level capability introduced in Android 14 to provide Gemini-powered solutions for high-end devices, including integrations with the latest ML accelerators, use-case optimized LoRA adapters, and safety filters. To start using Gemini Nano on-device with your app, apply to the Early Access Preview.


LLM Inference API

Starting today, you can test out the MediaPipe LLM Inference API via our web demo or by building our sample demo apps. You can experiment and integrate it into your projects via our Web, Android, or iOS SDKs.

Using the LLM Inference API allows you to bring LLMs on-device in just a few steps. These steps apply across web, iOS, and Android, though the SDK and native API will be platform specific. The following code samples show the web SDK.

1. Pick model weights compatible with one of our supported model architectures 

 

2. Convert the model weights into a TensorFlow Lite Flatbuffer using the MediaPipe Python Package

from mediapipe.tasks.python.genai import converter 

config = converter.ConversionConfig(...)
converter.convert_checkpoint(config)
 

3. Include the LLM Inference SDK in your application

import { FilesetResolver, LlmInference } from
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai”
 

4. Host the TensorFlow Lite Flatbuffer along with your application.

 

5. Use the LLM Inference API to take a text prompt and get a text response from your model.

const fileset  = await
FilesetResolver.forGenAiTasks("https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm");
const llmInference = await LlmInference.createFromModelPath(fileset, "model.bin");
const responseText = await llmInference.generateResponse("Hello, nice to meet you");
document.getElementById('output').textContent = responseText;


Please see our documentation and code examples for a detailed walk through of each of these steps.

Here are real time gifs of Gemma 2B running via the MediaPipe LLM Inference API.

moving image of Gemma 2B running on-device in browser via the MediaPipe LLM Inference API
Gemma 2B running on-device in browser via the MediaPipe LLM Inference API
moving image of Gemma 2B running on-device on iOS (left) and Android (right) via the MediaPipe LLM Inference API
Gemma 2B running on-device on iOS (left) and Android (right) via the MediaPipe LLM Inference API

Models

Our initial release supports the following four model architectures. Any model weights compatible with these architectures will work with the LLM Inference API. Use the base model weights, use a community fine-tuned version of the weights, or fine tune weights using your own data.

 Model

 Parameter Size

 Falcon 1B

 1.3 Billion

 Gemma 2B

 2.5 Billion

 Phi 2

 2.7 Billion

 Stable LM 3B

 2.8 Billion



Model Performance

Through significant optimizations, some of which are detailed below, the MediaPipe LLM Inference API is able to deliver state-of-the-art latency on-device, focusing on CPU and GPU to support multiple platforms. For sustained performance in a production setting on select premium phones, Android AICore can take advantage of hardware-specific neural accelerators.

When measuring latency for an LLM, there are a few terms and measurements to consider. Time to First Token and Decode Speed will be the two most meaningful as these measure how quickly you get the start of your response and how quickly the response generates once it starts.

 Term

 Significance

 Measurement

 Token

LLMs use tokens rather than words as inputs and outputs. Each model used with the LLM Inference API has a tokenizer built in which converts between words and tokens.

100 English words ≈ 130 tokens. However the conversion is dependent on the specific LLM and the language.

 Max Tokens

The maximum total tokens for the LLM prompt + response.

Configured in the LLM Inference API at runtime.

 Time to First Token

Time between calling the LLM Inference API and receiving the first token of the response.

Max Tokens / Prefill Speed

 Prefill Speed

How quickly a prompt is processed by an LLM.

Model and device specific. Benchmark numbers below.

 Decode Speed

How quickly a response is generated by an LLM.

Model and device specific. Benchmark numbers below.


The Prefill Speed and Decode Speed are dependent on model, hardware, and max tokens. They can also change depending on the current load of the device.

The following speeds were taken on high end devices using a max tokens of 1280 tokens, an input prompt of 1024 tokens, and int8 weight quantization. The exception being Gemma 2B (int4), found here on Kaggle, which uses a mixed 4/8-bit weight quantization.


Benchmarks

Graph showing prefill performance in tokens per second across WebGPU, iOS (GPU), Android (GPU), and Android (CPU)
Graph showing decode performance in tokens per second across WebGPU, iOS (GPU), Android (GPU), and Android (CPU)
On the GPU, Falcon 1B and Phi 2 use fp32 activations, while Gemma and StableLM 3B use fp16 activations as the latter models showed greater robustness to precision loss according to our quality eval studies. The lowest bit activation data type that maintained model quality was chosen for each. Note that Gemma 2B (int4) was the only model we could run on iOS due to its memory constraints, and we are working on enabling other models on iOS as well.

Performance Optimizations

To achieve the performance numbers above, countless optimizations were made across MediaPipe, TensorFlow Lite, XNNPack (our CPU neural network operator library), and our GPU-accelerated runtime. The following are a select few that resulted in meaningful performance improvements.

Weights Sharing: The LLM inference process comprises 2 phases: a prefill phase and a decode phase. Traditionally, this setup would require 2 separate inference contexts, each independently managing resources for its corresponding ML model. Given the memory demands of LLMs, we've added a feature that allows sharing the weights and the KV cache across inference contexts. Although sharing weights might seem straightforward, it has significant performance implications when sharing between compute-bound and memory-bound operations. In typical ML inference scenarios, where weights are not shared with other operators, they are meticulously configured for each fully connected operator separately to ensure optimal performance. Sharing weights with another operator implies a loss of per-operator optimization and this mandates the authoring of new kernel implementations that can run efficiently even on sub-optimal weights.

Optimized Fully Connected Ops: XNNPack’s FULLY_CONNECTED operation has undergone two significant optimizations for LLM inference. First, dynamic range quantization seamlessly merges the computational and memory benefits of full integer quantization with the precision advantages of floating-point inference. The utilization of int8/int4 weights not only enhances memory throughput but also achieves remarkable performance, especially with the efficient, in-register decoding of 4-bit weights requiring only one additional instruction. Second, we actively leverage the I8MM instructions in ARM v9 CPUs which enable the multiplication of a 2x8 int8 matrix by an 8x2 int8 matrix in a single instruction, resulting in twice the speed of the NEON dot product-based implementation.

Balancing Compute and Memory: Upon profiling the LLM inference, we identified distinct limitations for both phases: the prefill phase faces restrictions imposed by the compute capacity, while the decode phase is constrained by memory bandwidth. Consequently, each phase employs different strategies for dequantization of the shared int8/int4 weights. In the prefill phase, each convolution operator first dequantizes the weights into floating-point values before the primary computation, ensuring optimal performance for computationally intensive convolutions. Conversely, the decode phase minimizes memory bandwidth by adding the dequantization computation to the main mathematical convolution operations.

Flowchart showing compute-intensive prefill phase and memory-intensive decode phase, highlighting difference in performance bottlenecks
During the compute-intensive prefill phase, the int4 weights are dequantized a priori for optimal CONV_2D computation. In the memory-intensive decode phase, dequantization is performed on the fly, along with CONV_2D computation, to minimize the memory bandwidth usage.

Custom Operators: For GPU-accelerated LLM inference on-device, we rely extensively on custom operations to mitigate the inefficiency caused by numerous small shaders. These custom ops allow for special operator fusions and various LLM parameters such as token ID, sequence patch size, sampling parameters, to be packed into a specialized custom tensor used mostly within these specialized operations.

Pseudo-Dynamism: In the attention block, we encounter dynamic operations that increase over time as the context grows. Since our GPU runtime lacks support for dynamic ops/tensors, we opt for fixed operations with a predefined maximum cache size. To reduce the computational complexity, we introduce a parameter enabling the skipping of certain value calculations or the processing of reduced data.

Optimized KV Cache Layout: Since the entries in the KV cache ultimately serve as weights for convolutions, employed in lieu of matrix multiplications, we store these in a specialized layout tailored for convolution weights. This strategic adjustment eliminates the necessity for extra conversions or reliance on unoptimized layouts, and therefore contributes to a more efficient and streamlined process.


What’s Next

We are thrilled with the optimizations and the performance in today’s experimental release of the MediaPipe LLM Inference API. This is just the start. Over 2024, we will expand to more platforms and models, offer broader conversion tools, complimentary on-device components, high level tasks, and more.

You can check out the official sample on GitHub demonstrating everything you’ve just learned about and read through our official documentation for even more details. Keep an eye on the Google for Developers YouTube channel for updates and tutorials.


Acknowledgements

We’d like to thank all team members who contributed to this work: T.J. Alumbaugh, Alek Andreev, Frank Ban, Jeanine Banks, Frank Barchard, Pulkit Bhuwalka, Buck Bourdon, Maxime Brénon, Chuo-Ling Chang, Yu-hui Chen, Linkun Chen, Lin Chen, Nikolai Chinaev, Clark Duvall, Rosário Fernandes, Mig Gerard, Matthias Grundmann, Ayush Gupta, Mohammadreza Heydary, Ekaterina Ignasheva, Ram Iyengar, Grant Jensen, Alex Kanaukou, Prianka Liz Kariat, Alan Kelly, Kathleen Kenealy, Ho Ko, Sachin Kotwani, Andrei Kulik, Yi-Chun Kuo, Khanh LeViet, Yang Lu, Lalit Singh Manral, Tyler Mullen, Karthik Raveendran, Raman Sarokin, Sebastian Schmidt, Kris Tonthat, Lu Wang, Tris Warkentin, and the Gemma Team

Easily add document scanning capability to your app with ML Kit Document Scanner API

Posted by Thomas Ezan – Sr. Developer Relations Engineer; Chengji Yan, Penny Li – ML Kit Engineers; David Miro Llopis – Product Manager

We are excited to announce the launch of the ML Kit Document Scanner API. This new API makes it easy to add advanced document scanning capabilities with a high-quality and consistent user interface to your Android app. The ML Kit Document Scanner API enables your users to quickly and easily digitize paper documents.

Like the other ML Kit APIs, the ML Kit Document Scanner API enables you to seamlessly integrate features powered by Machine Learning (ML) without any ML knowledge.

ml kit document scanner illustration

Why Document Scanner SDK?

Despite the digital revolution, paper documents and printouts are still present in our everyday life. Some of our most important documents are still physical (identity documents, receipts, etc.).

The ML Kit Document Scanner API offers a number of benefits, including:

    • A high-quality and consistent user interface for digitizing physical documents.
    • Accurate document detection with precise corner and edge detection for a seamless scanning experience and optimal scanning results.
    • Flexible functionality allows users to crop scanned documents, apply filters, remove fingers, remove stains and other blemishes and send digitized files in PDF and JPEG formats back to your app.
    • On-device processing helps preserve privacy.
    • A complete solution eliminating the need for camera permission.

The ML Kit Document Scanner API is already used by Google Drive Android application and the Google Pixel Camera.

moving image showing ML Kit Document scanner API in action in  
Google Drive
ML Kit Document scanner API in action in Google Drive

Get started

The ML Kit Document Scanner API requires Android API level 21 or above. The models, scanning logic, and UI flow are dynamically downloaded via Google Play services so the ML Kit Document Scanner API has a minimal impact on your app size.

To integrate it in your app, start by configuring the scanner options and getting a scanner client:

val options = GmsDocumentScannerOptions.Builder()
    .setGalleryImportAllowed(false)
    .setPageLimit(2)
    .setResultFormats(RESULT_FORMAT_JPEG, RESULT_FORMAT_PDF)
    .setScannerMode(SCANNER_MODE_FULL)
    .build()
val scanner = GmsDocumentScanning.getClient(options)

Then register an ActivityResultCallback to receive the scanning results:

val scannerLauncher = registerForActivityResult(StartIntentSenderForResult()) {
  result -> {
    if (result.resultCode == RESULT_OK) {
      val result =
        GmsDocumentScanningResult.fromActivityResultIntent(result.data)
      result.getPages()?.let { pages ->
        for (page in pages) {
          val imageUri = page.getImageUri()
        }
      }
      result.getPdf()?.let { pdf ->
        val pdfUri = pdf.getUri()
        val pageCount = pdf.getPageCount()
      }
    }
  }
}

Finally launch the document scanner activity:

scanner.getStartScanIntent(activity)
  .addOnSuccessListener { intentSender ->   
    scannescannerrLauncher.launch(IntentSenderRequest.Builder(intentSender).build())
  }
  .addOnFailureListener { ... }

To get started with the ML Kit Document Scanner API, visit the documentation. We can’t wait to see what you’ll build with it!

Introducing Gemma models in Keras

Posted by Martin Görner – Product Manager, Keras

The Keras team is happy to announce that Gemma, a family of lightweight, state-of-the art open models built from the same research and technology that we used to create the Gemini models, is now available in the KerasNLP collection. Thanks to Keras 3, Gemma runs on JAX, PyTorch and TensorFlow. With this release, Keras is also introducing several new features specifically designed for large language models: a new LoRA API (Low Rank Adaptation) and large scale model-parallel training capabilities.

If you want to dive directly into code samples, head here:


Get started

Gemma models come in portable 2B and 7B parameter sizes, and deliver significant advances against similar open models, and even some larger ones. For example:

  • Gemma 7B scores a new best-in class 64.3% of correct answers in the MMLU language understanding benchmark (vs. 62.5% for Mistral-7B and 54.8% for Llama2-13B)
  • Gemma adds +11 percentage points to the GSM8K benchmark score for grade-school math problems (46.4% for Gemma 7B vs. Mistral-7B 35.4%, Llama2-13B 28.7%)
  • and +6.1 percentage points of correct answers in HumanEval, a coding challenge (32.3% for Gemma 7B, vs. Mistral 7B 26.2%, Llama2 13B 18.3%).

Gemma models are offered with a familiar KerasNLP API and a super-readable Keras implementation. You can instantiate the model with a single line of code:

gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")

And run it directly on a text prompt – yes, tokenization is built-in, although you can easily split it out if needed - read the Keras NLP guide to see how.

gemma_lm.generate("Keras is a", max_length=32)
> "Keras is a popular deep learning framework for neural networks..."

Try it out here: Get started with Gemma models


Fine-tuning Gemma Models with LoRA

Thanks to Keras 3, you can choose the backend on which you run the model. Here is how to switch:

os.environ["KERAS_BACKEND"] = "jax"  # Or "tensorflow" or "torch".
import keras # import keras after having selected the backend

Keras 3 comes with several new features specifically for large language models. Chief among them is a new LoRA API (Low Rank Adaptation) for parameter-efficient fine-tuning. Here is how to activate it:

gemma_lm.backbone.enable_lora(rank=4)
# Note: rank=4 replaces the weights matrix of relevant layers with the 
# product AxB of two matrices of rank 4, which reduces the number of 
# trainable parameters.

This single line drops the number of trainable parameters from 2.5 billion to 1.3 million!

Try it out here: Fine-tune Gemma models with LoRA.


Fine-tuning Gemma models on multiple GPU/TPUs

Keras 3 also supports large-scale model training and Gemma is the perfect model to try it out. The new Keras distribution API offers data-parallel and model-parallel distributed training options. The new API is meant to be multi-backend but for the time being, it is implemented for the JAX backend only, because of its proven scalability (Gemma models were trained with JAX).

To fine-tune the larger Gemma 7B, a distributed setup is useful, for example a TPUv3 with 8 TPU cores that you can get for free on Kaggle, or an 8-GPU machine from Google Cloud. Here is how to configure the model for distributed training, using model parallelism:

device_mesh = keras.distribution.DeviceMesh(
   (1, 8), # Mesh topology
   ["batch", "model"], # named mesh axes
   devices=keras.distribution.list_devices() # actual accelerators
)


# Model config
layout_map = keras.distribution.LayoutMap(device_mesh)
layout_map["token_embedding/embeddings"] = (None, "model")
layout_map["decoder_block.*attention.*(query|key|value).*kernel"] = (
   None, "model", None)
layout_map["decoder_block.*attention_output.*kernel"] = (
   None, None, "model")
layout_map["decoder_block.*ffw_gating.*kernel"] = ("model", None)
layout_map["decoder_block.*ffw_linear.*kernel"] = (None, "model")


# Set the model config and load the model
model_parallel = keras.distribution.ModelParallel(
   device_mesh, layout_map, batch_dim_name="batch")
keras.distribution.set_distribution(model_parallel)
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_7b_en")
# Ready: you can now train with model.fit() or generate text with generate()

What this code snippet does is set up the 8 accelerators into a 1 x 8 matrix where the two dimensions are called “batch” and “model”. Model weights are sharded on the “model” dimension, here split between the 8 accelerators, while data batches are not partitioned since the “batch” dimension is 1.

Try it out here: Fine-tune Gemma models on multiple GPUs/TPUs.


What’s Next

We will soon be publishing a guide showing you how to correctly partition a Transformer model and write the 6 lines of partitioning setup above. It is not very long but it would not fit in this post.

You will have noticed that layer partitionings are defined through regexes on layer names. You can check layer names with this code snippet. We ran this to construct the LayoutMap above.

# This is for the first Transformer block only,
# but they all have the same structure
tlayer = gemma_lm.backbone.get_layer('decoder_block_0')
for variable in tlayer.weights:
 print(f'{variable.path:<58}  {str(variable.shape):<16}')

Full GSPMD model parallelism works here with just a few partitioning hints because Keras passes these settings to the powerful XLA compiler which figures out all the other details of the distributed computation.


We hope you will enjoy playing with Gemma models. Here is also an instruction-tuning tutorial that you might find useful. And by the way, if you want to share your fine-tuned weights with the community, the Kaggle Model Hub now supports user-tuned weights uploads. Head to the model page for Gemma models on Kaggle and see what others have already created!

Building Open Models Responsibly in the Gemini Era

Google has long believed that open technology is not only good for our company, but good for the industry, consumers, and the world. We’ve released open-source projects like Android and Chromium that transformed access to mobile and web technologies, and have done the same in AI with Transformers, TensorFlow, and AlphaFold. The release of our Gemma family of open models is a next step in how we’re deepening our commitment to open technology alongside an industry-leading safe, responsible approach. At the same time, the rapidly evolving nature of AI raises important considerations for how to enable safety-aligned open models: an approach that supports broad innovation while promoting safe uses.

A benefit of open source is that once it is released, its license gives users full creative autonomy. This is a powerful guarantee of technology access for developers and end users. Another benefit is that open-source technology can be modified to fit the unique use case of the end user, without restriction.

In the hands of a malicious actor, however, the lack of restrictions can raise risks. Computing has been through similar cycles before, addressing issues such as protecting users of the open internet, handling cryptography, and addressing open-source software security. We now face this challenge with AI. Below we share the approach we took to openly releasing Gemma models, and the advancements in open model safety we hope to accelerate.


Providing access to Gemma open models

Today, Gemma models are being released as what the industry collectively has begun to refer to as “open models.” Open models feature free access to the model weights, but terms of use, redistribution, and variant ownership vary according to a model’s specific terms of use, which may not be based on an open-source license. The Gemma models’ terms of use make them freely available for individual developers, researchers, and commercial users for access and redistribution. Users are also free to create and publish model variants. In using Gemma models, developers agree to avoid harmful uses, reflecting our commitment to developing AI responsibly while increasing access to this technology.

We’re precise about the language we’re using to describe Gemma models because we’re proud to enable responsible AI access and innovation, and we’re equally proud supporters of open source. The definition of "Open Source" has been invaluable to computing and innovation because of requirements for redistribution and derived works, and against discrimination. These requirements enable cross-industry collaboration, individual innovation and entrepreneurship, and shared research to happen with exponential effects.

However, existing open-source concepts can’t always be directly applied to AI systems, which raises questions on how to use open-source licenses with AI. It’s important that we carry forward open principles that have made the sea-change we’re experiencing with AI possible while clarifying the concept of open-source AI and addressing concepts like derived work and author attribution.


Taking a comprehensive approach to releasing Gemma safely and responsibly

Licensing and terms of use are only one part of the evaluations, technical tools, and considered decision-making that went into aligning this release with our responsible AI Principles. Our approach involved:

  • Systematic internal review in accordance with our AI Principles: Consistent with our AI Principles, we release models only when we have determined the benefits are significant, and the risks of misuse are low or can be mitigated. We take that same approach to open models, incorporating a balance of the benefits of wider access to a particular model as well as the risks of misuse and how we can mitigate them. With Gemma, we considered the increased AI research and innovation by us and many others in the community, the access to AI technology the models could bring, and what access was needed to support these use cases.
  • A high evaluation bar: Gemma models underwent thorough evaluations, and were held to a higher bar for evaluating risk of abuse or harm than our proprietary models, given the more limited mitigations currently available for open models. These evaluations cover a broad range of responsible AI areas, including safety, fairness, privacy, societal risk, as well as capabilities such as chemical, biological, radiological, nuclear (CBRN) risks, cybersecurity, and autonomous replication. As described in our technical report, the Gemma models exhibit state-of-the-art safety performance in human side-by-side evaluations.
  • Responsibility tools for developers: As we release the Gemma models, we are also releasing a Responsible Generative AI Toolkit for developers, providing guidance and tools to help them create safer AI applications.

We continue to evolve our approach. As we build these frameworks further, we will proceed thoughtfully and incorporate what we learn into future model assessments. We will continue to explore the full range of access mechanisms, with benefits and risk mitigation in mind, including API-based access and staged releases.


Advancing open model safety together

Many of today’s AI safety tools are designed for systems where the design approach assumes restricted access and redistribution, as well as auxiliary controls like query filters. Similarly, much of the AI safety research for improving mitigations takes on the design assumptions of those systems. Just as we have created unique threat models and solutions for other open technology, we are developing safety and security tools appropriate for the differences of openly available AI.

As models become more and more capable, we are conducting research and investing in rigorous safety evaluation, testing, and mitigations for open models. We are also actively participating in conversations with policymakers and open-source community leaders on how the industry should approach this technology. This challenge is multifaceted, just like AI systems themselves. Model-sharing platforms like Hugging Face and Kaggle, where developers inspire each other with novel model iterations, play a critical role in efforts to develop open models safely; there is also a role for the cybersecurity community to contribute learnings and best practices.

Building those solutions requires access to open models, sharing innovations and improvements. We believe sharing the Gemma models will not just help increase access to AI technology, but also help the industry develop new approaches to safety and responsibility.

As developers adopt Gemma models and other safety-aligned open models, we look forward to working with the open-source community to develop more solutions for responsible approaches to AI in the open ecosystem. A global diversity of experiences, perspectives, and opportunities will help build safe and responsible AI that works for everyone.

By Anne Bertucio – Sr Program Manager, Open Source Programs Office; Helen King – Sr Director of Responsibility, Google DeepMind

Magika: AI powered fast and efficient file type identification

Today we are open-sourcing Magika, Google’s AI-powered file-type identification system, to help others accurately detect binary and textual file types. Under the hood, Magika employs a custom, highly optimized deep-learning model, enabling precise file identification within milliseconds, even when running on a CPU.

Magika command line tool used to recognize a identify the type of a diverse set of files
Magika command line tool used to recognize a identify the type of a diverse set of files

You can try the Magika web demo today, or install it as a Python library and standalone command line tool (output is showcased above) by using the standard command line pip install magika.

Why identifying file type is difficult

Since the early days of computing, accurately detecting file types has been crucial in determining how to process files. Linux comes equipped with libmagic and the file utility, which have served as the de facto standard for file type identification for over 50 years. Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file. For example, modern code editors use file-type detection to choose which syntax coloring scheme to use as the developer starts typing in a new file.

Accurate file-type detection is a notoriously difficult problem because each file format has a different structure, or no structure at all. This is particularly challenging for textual formats and programming languages as they have very similar constructs. So far, libmagic and most other file-type-identification software have been relying on a handcrafted collection of heuristics and custom rules to detect each file format.

This manual approach is both time consuming and error prone as it is hard for humans to create generalized rules by hand. In particular for security applications, creating dependable detection is especially challenging as attackers are constantly attempting to confuse detection with adversarially-crafted payloads.

To address this issue and provide fast and accurate file-type detection we researched and developed Magika, a new AI powered file type detector. Under the hood, Magika uses a custom, highly optimized deep-learning model designed and trained using Keras that only weighs about 1MB. At inference time Magika uses Onnx as an inference engine to ensure files are identified in a matter of milliseconds, almost as fast as a non-AI tool even on CPU.

Magika Performance

Magika detection quality compared to other tools on our 1M files benchmark
Magika detection quality compared to other tools on our 1M files benchmark

Performance wise, Magika, thanks to its AI model and large training dataset, is able to outperform other existing tools by about 20% when evaluated on a 1M files benchmark that encompasses over 100 file types. Breaking down by file type, as reported in the table below, we see even greater performance gains on textual files, including code files and configuration files that other tools can struggle with.

Table showing various file type identification tools performance for a selection of the file types included in our benchmark
Various file type identification tools performance for a selection of the file types included in our benchmark - n/a indicates the tool doesn’t detect the given file type.

Magika at Google

Internally, Magika is used at scale to help improve Google users’ safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners. Looking at a weekly average of hundreds of billions of files reveals that Magika improves file type identification accuracy by 50% compared to our previous system that relied on handcrafted rules. In particular, this increase in accuracy allows us to scan 11% more files with our specialized malicious AI document scanners and reduce the number of unidentified files to 3%.

The upcoming integration of Magika with VirusTotal will complement the platform's existing Code Insight functionality, which employs Google's generative AI to analyze and detect malicious code. Magika will act as a pre-filter before files are analyzed by Code Insight, improving the platform’s efficiency and accuracy. This integration, due to VirusTotal’s collaborative nature, directly contributes to the global cybersecurity ecosystem, fostering a safer digital environment.

Open Sourcing Magika

By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Magika code and model are freely available starting today in Github under the Apache2 License. Magika can also quickly be installed as a standalone utility and python library via the pypi package manager by simply typing pip install magika with no GPU required. We also have an experimental npm package if you would like to use the TFJS version.

To learn more about how to use it, please refer to Magika documentation site.


Acknowledgements

Magika would not have been possible without the help of many people including: Ange Albertini, Loua Farah, Francois Galilee, Giancarlo Metitieri, Luca Invernizzi, Young Maeng, Alex Petit-Bianco , David Tao, Kurt Thomas, Amanda Walker

By Elie Bursztein – Cybersecurity AI Technical and Research Lead and Yanick Fratantonio – Cybersecurity Research Scientist

Magika: AI powered fast and efficient file type identification

Today we are open-sourcing Magika, Google’s AI-powered file-type identification system, to help others accurately detect binary and textual file types. Under the hood, Magika employs a custom, highly optimized deep-learning model, enabling precise file identification within milliseconds, even when running on a CPU.

Magika command line tool used to recognize a identify the type of a diverse set of files
Magika command line tool used to recognize a identify the type of a diverse set of files

You can try the Magika web demo today, or install it as a Python library and standalone command line tool (output is showcased above) by using the standard command line pip install magika.

Why identifying file type is difficult

Since the early days of computing, accurately detecting file types has been crucial in determining how to process files. Linux comes equipped with libmagic and the file utility, which have served as the de facto standard for file type identification for over 50 years. Today web browsers, code editors, and countless other software rely on file-type detection to decide how to properly render a file. For example, modern code editors use file-type detection to choose which syntax coloring scheme to use as the developer starts typing in a new file.

Accurate file-type detection is a notoriously difficult problem because each file format has a different structure, or no structure at all. This is particularly challenging for textual formats and programming languages as they have very similar constructs. So far, libmagic and most other file-type-identification software have been relying on a handcrafted collection of heuristics and custom rules to detect each file format.

This manual approach is both time consuming and error prone as it is hard for humans to create generalized rules by hand. In particular for security applications, creating dependable detection is especially challenging as attackers are constantly attempting to confuse detection with adversarially-crafted payloads.

To address this issue and provide fast and accurate file-type detection we researched and developed Magika, a new AI powered file type detector. Under the hood, Magika uses a custom, highly optimized deep-learning model designed and trained using Keras that only weighs about 1MB. At inference time Magika uses Onnx as an inference engine to ensure files are identified in a matter of milliseconds, almost as fast as a non-AI tool even on CPU.

Magika Performance

Magika detection quality compared to other tools on our 1M files benchmark
Magika detection quality compared to other tools on our 1M files benchmark

Performance wise, Magika, thanks to its AI model and large training dataset, is able to outperform other existing tools by about 20% when evaluated on a 1M files benchmark that encompasses over 100 file types. Breaking down by file type, as reported in the table below, we see even greater performance gains on textual files, including code files and configuration files that other tools can struggle with.

Table showing various file type identification tools performance for a selection of the file types included in our benchmark
Various file type identification tools performance for a selection of the file types included in our benchmark - n/a indicates the tool doesn’t detect the given file type.

Magika at Google

Internally, Magika is used at scale to help improve Google users’ safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners. Looking at a weekly average of hundreds of billions of files reveals that Magika improves file type identification accuracy by 50% compared to our previous system that relied on handcrafted rules. In particular, this increase in accuracy allows us to scan 11% more files with our specialized malicious AI document scanners and reduce the number of unidentified files to 3%.

The upcoming integration of Magika with VirusTotal will complement the platform's existing Code Insight functionality, which employs Google's generative AI to analyze and detect malicious code. Magika will act as a pre-filter before files are analyzed by Code Insight, improving the platform’s efficiency and accuracy. This integration, due to VirusTotal’s collaborative nature, directly contributes to the global cybersecurity ecosystem, fostering a safer digital environment.

Open Sourcing Magika

By open-sourcing Magika, we aim to help other software improve their file identification accuracy and offer researchers a reliable method for identifying file types at scale.

Magika code and model are freely available starting today in Github under the Apache2 License. Magika can also quickly be installed as a standalone utility and python library via the pypi package manager by simply typing pip install magika with no GPU required. We also have an experimental npm package if you would like to use the TFJS version.

To learn more about how to use it, please refer to Magika documentation site.


Acknowledgements

Magika would not have been possible without the help of many people including: Ange Albertini, Loua Farah, Francois Galilee, Giancarlo Metitieri, Luca Invernizzi, Young Maeng, Alex Petit-Bianco , David Tao, Kurt Thomas, Amanda Walker

By Elie Bursztein – Cybersecurity AI Technical and Research Lead and Yanick Fratantonio – Cybersecurity Research Scientist