
7 ways AI is already making your Pixel more helpful

[Weekly Recap] Warning banners for external email recipients on iOS devices
[Weekly Recap] Expanding color options in Google Slides, Docs, Sheets and Drawings
Expanding the power of Google Sheets with localized formatting and improved CSV imports (Sheets locale update)
Many languages spoken worldwide cover numerous regional varieties (sometimes called dialects), such as Brazilian and European Portuguese or Mainland and Taiwan Mandarin Chinese. Although such varieties are often mutually intelligible to their speakers, there are still important differences. For example, the Brazilian Portuguese word for “bus” is ônibus, while the European Portuguese word is autocarro. Yet, today’s machine translation (MT) systems typically do not allow users to specify which variety of a language to translate into. This may lead to confusion if the system outputs the “wrong” variety or mixes varieties in an unnatural way. Also, region-unaware MT systems tend to favor whichever variety has more data available online, which disproportionately affects speakers of under-resourced language varieties.
In “FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation”, accepted for publication in Transactions of the Association for Computational Linguistics, we present an evaluation dataset used to measure MT systems’ ability to support regional varieties through a case study on Brazilian vs. European Portuguese and Mainland vs. Taiwan Mandarin Chinese. With the release of the FRMT data and accompanying evaluation code, we hope to inspire and enable the research community to discover new ways of creating MT systems that are applicable to the large number of regional language varieties spoken worldwide.
Most modern MT systems are trained on millions or billions of example translations, such as an English input sentence and its corresponding Portuguese translation. However, the vast majority of available training data doesn’t specify what regional variety the translation is in. In light of this data scarcity, we position FRMT as a benchmark for few-shot translation, measuring an MT model’s ability to translate into regional varieties when given no more than 100 labeled examples of each language variety. MT models need to use the linguistic patterns showcased in the small number of labeled examples (called “exemplars”) to identify similar patterns in their unlabeled training examples. In this way, models can generalize, producing correct translations of phenomena not explicitly shown in the exemplars.
![]() |
An illustration of a few-shot MT system translating the English sentence, “The bus arrived,” into two regional varieties of Portuguese: Brazilian (🇧🇷; left) and European (🇵🇹; right). |
Few-shot approaches to MT are attractive because they make it much easier to add support for additional regional varieties to an existing system. While our work is specific to regional varieties of two languages, we anticipate that methods that perform well will be readily applicable to other languages and regional varieties. In principle, those methods should also work for other language distinctions, such as formality and style.
The FRMT dataset consists of partial English Wikipedia articles, sourced from the Wiki40b dataset, that have been translated by paid, professional translators into different regional varieties of Portuguese and Mandarin. In order to highlight key region-aware translation challenges, we designed the dataset using three content buckets: (1) Lexical, (2) Entity, and (3) Random.
To verify that the translations collected for the FRMT dataset capture region-specific phenomena, we conducted a human evaluation of their quality. Expert annotators from each region used the Multi-dimensional Quality Metrics (MQM) framework to identify and categorize errors in the translations. The framework includes a category-wise weighting scheme to convert the identified errors into a single score that roughly represents the number of major errors per sentence; so a lower number indicates a better translation. For each region, we asked MQM raters to score both translations from their region and translations from their language’s other region. For example, Brazilian Portuguese raters scored both the Brazilian and European Portuguese translations. The difference between these two scores indicates the prevalence of linguistic phenomena that are acceptable in one variety but not the other. We found that in both Portuguese and Chinese, raters identified, on average, approximately two more major errors per sentence in the mismatched translations than in the matched ones. This indicates that our dataset truly does capture region-specific phenomena.
While human evaluation is the best way to be sure of model quality, it is often slow and expensive. We therefore wanted to find an existing automatic metric that researchers can use to evaluate their models on our benchmark, and considered chrF, BLEU, and BLEURT. Using the translations from a few baseline models that were also evaluated by our MQM raters, we discovered that BLEURT has the best correlation with human judgments, and that the strength of that correlation (0.65 Pearson correlation coefficient, ρ) is comparable to the inter-annotator consistency (0.70 intraclass correlation).
Metric | Pearson's ρ | ||
chrF | 0.48 | ||
BLEU | 0.58 | ||
BLEURT | 0.65 |
Correlation between different automatic metrics and human judgements of translation quality on a subset of FRMT. Values are between -1 and 1; higher is better. |
Our evaluation covered a handful of recent models capable of few-shot control. Based on human evaluation with MQM, the baseline methods all showed some ability to localize their output for Portuguese, but for Mandarin, they mostly failed to use knowledge of the targeted region to produce superior Mainland or Taiwan translations.
Google’s recent language model, PaLM, was rated best overall among the baselines we evaluated. In order to produce region-targeted translations with PaLM, we feed an instructive prompt into the model and then generate text from it to fill in the blank (see the example shown below).
Translate the following texts from English to European Portuguese.
English: [English example 1].
European Portuguese: [correct translation 1].
...
English: [input].
European Portuguese: _____"
PaLM obtained strong results using a single example, and had marginal quality gains on Portuguese when increasing to ten examples. This performance is impressive when taking into consideration that PaLM was trained in an unsupervised way. Our results also suggest language models like PaLM may be particularly adept at memorizing region-specific word choices required for fluent translation. However, there is still a significant performance gap between PaLM and human performance. See our paper for more details.
![]() | ![]() |
MQM performance across dataset buckets using human and PaLM translations. Thick bars represent the region-matched case, where raters from each region evaluate translations targeted at their own region. Thin, inset bars represent the region-mismatched case, where raters from each region evaluate translations targeted at the other region. Human translations exhibit regional phenomena in all cases. PaLM translations do so for all Portuguese buckets and the Mandarin lexical bucket only. |
In the near future, we hope to see a world where language generation systems, especially machine translation, can support all speaker communities. We want to meet users where they are, generating language fluent and appropriate for their locale or region. To that end, we have released the FRMT dataset and benchmark, enabling researchers to easily compare performance for region-aware MT models. Validated via our thorough human-evaluation studies, the language varieties in FRMT have significant differences that outputs from region-aware MT models should reflect. We are excited to see how researchers utilize this benchmark in development of new MT models that better support under-represented language varieties and all speaker communities, leading to improved equitability in natural-language technologies.
We gratefully acknowledge our paper co-authors for all their contributions to this project: Timothy Dozat, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, and Noah Constant. For helpful discussion and comments on the paper, we thank Jacob Eisenstein, Noah Fiedel, Macduff Hughes and Mingfei Lau. For essential feedback around specific regional language differences, we thank Andre Araujo, Chung-Ching Chang, Andreia Cunha, Filipe Gonçalves, Nuno Guerreiro, Mandy Guo, Luis Miranda, Vitor Rodrigues and Linting Xue. For logistical support in collecting human translations and ratings, we thank the Google Translate team. We thank the professional translators and MQM raters for their role in producing the dataset. We also thank Tom Small for providing the animation in this post.
Posted by the Dev Library Team
In this newsletter, we’re highlighting the best projects developed with Google technologies that have been contributed to the Google Dev Library platform. We hope this will spark some inspiration for your next project!
![]() |
See how to use a lightweight and easy-to-use image picker library that has features like cropping, compression and rotation, video, and Live Photos support.
![]() |
Dive into a comprehensive overview of coroutines including tips and best practices, along with a detailed explanation of the different types of coroutines available in Kotlin and how to use them effectively.
![]() |
Learn the uses of the Diffusion method, a technique used to improve the stability and performance of image-to-image translation models.
Learn how state management in Jetpack Compose is implemented, how it can be used to build a responsive and dynamic UI, and how it compares to other solutions in Android development.
Find out how to switch between different environments (such as development, staging, and production) in an Android app.
Migrate an existing legacy Android application to Jetpack Compose, a modern UI toolkit for building native Android apps
Understand the benefits of using TensorFlow for image processing, including the ability to easily parallelize computations and utilize GPUs for faster processing.
Look into the Flax implementation of the Stable Diffusion model to better understand how it works.
See the tool that allows you to quickly create a TensorFlow application by generating the necessary code and file structure.
Dive into a set of pre-built validation rules and error messages for commonly encountered use cases, making it easy to quickly implement robust form validation for your application.
Learn how to pass configuration data dynamically between modules in an Angular application.
Look at the first part of the series aimed at helping developers to understand how to use the tools effectively to build applications with Dart and Flutter.
Learn the insights, code snippets and results of the experiment for readers to better understand the concept of Server-Driven UI and its potential in Flutter app development.
Learn an easy-to-use API for accessing the device's photo library, that performs operations like retrieving images, videos, and albums, as well as deleting, creating, and updating files in the photo library.
Here’s a simple solution for implementing Firebase Authentication with email and password, using a clean architecture with Jetpack Compose on Android.
Learn how it allows users to perform operations like querying, aggregating, and visualizing data from Firestore, making it a powerful tool for monitoring and analyzing real-time data in a variety of applications. The repository provides the source code for the plugin and documentation on how to install and use it with Grafana.
See how this project aims to replicate clusters across different cloud environments and examine these varying infrastructure models.
Learn how this post covers the basic features and benefits of Cloud Firestore, and how this document database is a scalable and versatile NoSQL cloud database.
Discover how Dataplex can be used to transform data to meet specific business requirements, and how it can integrate with other Google Cloud services like BigQuery for efficient data storage and analysis.
The Beta channel has been updated to 111.0.5563.33 for Windows, Mac and Linux .
A full list of changes in this build is available in the log. Interested in switching release channels? Find out how here. If you find a new issues, please let us know by filing a bug. The community help forum is also a great place to reach out for help or learn about common issues.
The dev channel has been updated to 112.0.5596.2 for Windows, Linux and Mac.
As previously announced, in the coming months, we’ll migrate Reminders from Google Calendar and Google Assistant to Google Tasks to create a single experience for managing to-dos across Google.
Users can create tasks from Calendar and using the hands-free power of Assistant, similar to how they previously created Reminders. Additionally, unlike with Reminders, they can create tasks from other Google Workspace apps like Gmail, Docs and Chat, or directly from the Tasks app.
If you’re a Google Workspace customer with the Tasks service ON in your organization, your end users can voluntarily migrate beginning April 12, 2023. This migration prompt will appear for users with personal accounts starting on March 6, 2023.
This change impacts all Google Workspace customers, legacy G Suite Basic and Business customers, and users with personal Google accounts.
Posted by Harini Chandrasekharan, Staff Software Engineer, Google Play
The Google Play Store, launched 10 years ago in 2012 sits at the heart of Android, connecting billions of users with an equally staggering and ever-growing collection of apps and games worldwide.
Let's take a peek behind the curtains to learn what it takes to design the serving infrastructure of the worlds largest Android marketplace. In the world of consumer facing software, it's not a surprise that out of box engineering solutions fail to meet the requirements that Google scale demands. Therefore every system at Google is carefully crafted and honed with iterative enhancements to meet the unique availability, quality and latency demands of the Google Play Store.
In the domain of consumer facing features, users’ opinions and choices, developer ecosystem and demand often changes faster than infrastructure can. In such an environment, the biggest challenge engineers face is how to be nimble and design infrastructure that’s not only future-proof but also meets the needs of the consumer space within the constraints of scalability and performance. Let’s take a deeper look at some engineering challenges in such a dynamic space.
In a data driven organization such as the Play store, metrics are built for measuring anything and everything of importance. Here are some of the dimensions that come in handy when measuring and tracking success:
Designing backend systems that scale to the requirements of the Play Store that also meet the performance criteria required to make user interactions feel fluid and responsive is paramount. From an engineering perspective, infrastructure needs to continuously evolve to meet the needs of the business. The Play store is no different—the store infrastructure has evolved several times in the last decade to not only support the needs of new features that are available to users today, but also to modernize, eliminate tech debt and most of all reduce latency.
Optimizing for time to market i.e getting the feature to the user and measuring how it impacts app installs and other store business metrics using A/B experiments is of prime importance. Iterating fast based on data helps tune the final feature to the desired end state. Google has several home grown technologies for running A/B experiments at worldwide scale with seamless integration with metric presentation tools that make running these experiments smooth and easy, so developers can spend more time coding and less in analysis.
Deciding what to build, whether it meets Google quality standards, understanding engineering costs and the user needs it solves are all important questions that need to be answered before designing anything. Feature Engineering is therefore often done in close collaboration with Product Managers. Aligning on the perfect MVP that can be built in a reasonable amount of engineering time that meets the user journey is the key to a successful product.
Frequent iterations and a fast MVP development culture often comes with its set of cons, the biggest being tech debt. In optimizing for fast velocity, cutting corners results in obsolete code (due to unlaunchable metrics) or discarded experiment flags. These often make testing, maintaining and impact future development velocity if left unfixed. Additionally, using the latest and greatest frameworks to get to the last milliseconds of latency or making development easier yields great dividends in the long run. Frequently modernizing the infrastructure either via refactoring or full rewrites may traditionally spell signs of poorly designed code, but it's one of the bigger tradeoffs that feature engineers often have to make, because after all what use is all the fancy infrastructure if users don't interact with the feature in the first place!