Tag Archives: User Experience

Top 3 updates for building excellent, adaptive apps at Google I/O ‘25

Posted by Mozart Louis – Developer Relations Engineer

Today, Android is launching a few updates across the platform! This includes the start of Android 16's rollout, with details for both developers and users, a Developer Preview for enhanced Android desktop experiences with connected displays, and updates for Android users across Google apps and more, plus the June Pixel Drop. We're also recapping all the Google I/O updates for Android developers focused on building excellent, adaptive Android apps.

Google I/O 2025 brought exciting advancements to Android, equipping you with essential knowledge and powerful tools you need to build outstanding, user-friendly applications that stand out.

If you missed any of the key #GoogleIO25 updates and just saw the release of Android 16 or you're ready to dive into building excellent adaptive apps, our playlist is for you. Learn how to craft engaging experiences with Live Updates in Android 16, capture video effortlessly with CameraX, process it efficiently using Media3's editing tools, and engage users across diverse platforms like XR, Android for Cars, Android TV, and Desktop.

Check out the Google I/O playlist for all the session details.

Here are three key announcements directly influencing how you can craft deeply engaging experiences and truly connect with your users:

#1: Build adaptively to unlock 500 million devices

In today's diverse device ecosystem, users expect their favorite applications to function seamlessly across various form factors, including phones, tablets, Chromebooks, automobiles, and emerging XR glasses and headsets. Our recommended approach for developing applications that excel on each of these surfaces is to create a single, adaptive application. This strategy avoids the need to rebuild the application for every screen size, shape, or input method, ensuring a consistent and high-quality user experience across all devices.

The talk emphasizes that you don't need to rebuild apps for each form factor. Instead, small, iterative changes can unlock an app's potential.

Here are some resources we encourage you to use in your apps:

New feature support in Jetpack Compose Adaptive Libraries

We’re continuing to make it as easy as possible to build adaptively with Jetpack Compose Adaptive Libraries. with new features in 1.1 like pane expansion and predictive back. By utilizing canonical layout patterns such as List Detail or Supporting Pane layouts and integrating your app code, your application will automatically adjust and reflow when resized.

Navigation 3

The alpha release of the Navigation 3 library now supports displaying multiple panes. This eliminates the need to alter your navigation destination setup for separate list and detail views. Instead, you can adjust the setup to concurrently render multiple destinations when sufficient screen space is available.

Updates to Window Manager Library

AndroidX.window 1.5 introduces two new window size classes for expanded widths, facilitating better layout adaptation for large tablets and desktops. A width of 1600dp or more is now categorized as "extra large," while widths between 1200dp and 1600dp are classified as "large." These subdivisions offer more granularity for developers to optimize their applications for a wider range of window sizes.

Support all orientations and be resizable

In Android 16 important changes are coming, affecting orientation, aspect ratio, and resizability. Apps targeting SDK 36 will need to support all orientations and be resizable.

Extend to Android XR

We are making it easier for you to build for XR with the Android XR SDK in developer preview 2, which features new Material XR components, a fully integrated Emulator withinAndroid Studio and spatial videos support for your Play Store listings.

Upgrade your Wear OS apps to Material 3 Design

Wear OS 6 features Material 3 Expressive, a new UI design with personalized visuals and motion for user creativity, coming to Wear, Android, and Google apps later this year. You can upgrade your app and Tiles to Material 3 Expressive by utilizing new Jetpack libraries: Wear Compose Material 3, which provides components for apps and Wear ProtoLayout Material 3 which provides components and layouts for tiles.

You should build a single, adaptive mobile app that brings the best experiences to all Android surfaces. By building adaptive apps, you meet users where they are today and in the future, enhancing user engagement and app discoverability. This approach represents a strategic business decision that optimizes an app’s long-term success.

#2: Enhance your app’s performance optimization

Get ready to take your app's performance to the next level! Google I/O 2025, brought an inside look at cutting-edge tools and techniques to boost user satisfaction, enhance technical performance metrics, and drive those all-important key performance indicators. Imagine an end-to-end workflow that streamlines performance optimization.

Redesigned UiAutomator API

To make benchmarking reliable and reproducible, there's the brand new UiAutomator API. Write robust test code and run it on your local devices or in Firebase Test Lab, ensuring consistent results every time.

Macrobenchmarks

Once your tests are in place, it's time to measure and understand. Macrobenchmarks give you the hard data, while App Startup Insights provide actionable recommendations for improvement. Plus, you can get a quick snapshot of your app's health with the App Performance Score via DAC. These tools combined give you a comprehensive view of your app's performance and where to focus your efforts.

R8, More than code shrinking and obfuscation

You might know R8 as a code shrinking tool, but it's capable of so much more! The talk dives into R8's capabilities using the "Androidify" sample app. You'll see how to apply R8, troubleshoot any issues (like crashes!), and configure it for optimal performance. It'll also be shown how library developers can include "consumer Keep rules" so that their important code is not touched when used in an application.

#3: Build Richer Image and Video Experiences

In today's digital landscape, users increasingly expect seamless content creation capabilities within their apps. To meet this demand, developers require robust tools for building excellent camera and media experiences.

Media3Effects in CameraX Preview

At Google I/O, developers delve into practical strategies for capturing high-quality video using CameraX, while simultaneously leveraging the Media3Effects on the preview.

Google Low-Light Boost

Google Low Light Boost in Google Play services enables real-time dynamic camera brightness adjustment in low light, even without device support for Low Light Boost AE Mode.

New Camera & Media Samples!

For Google I/O 2025, The Camera & Media team created new samples and demos for building excellent media and camera experiences on Android. It emphasizes future-proofing apps using Jetpack libraries like Media3 Transformer for advanced video editing and Compose for adaptive UIs, including XR. Get more information about incrementally adding premium features with CameraX, utilizing Media3 for AI-powered functionalities such as video summarization and HDR thumbnails, and employing specialized APIs like Oboe for efficient audio playback. We have also updated CameraX samples to fully use Compose instead of the View based system.

Learn more about how CameraX & Media3 can accelerate your development of camera and media related features.

Learn how to build adaptive apps

Want to learn more about building excellent, adaptive apps? Watch this playlist to learn more about all the session details.

Source: Android Developers Blog

Peacock built adaptively on Android to deliver great experiences across screens

Posted by Sa-ryong Kang and Miguel Montemayor - Developer Relations Engineers

Peacock is NBCUniversal’s streaming service app available in the US, offering culture-defining entertainment including live sports, exclusive original content, TV shows, and blockbuster movies. The app continues to evolve, becoming more than just a platform to watch content, but a hub of entertainment.

Today’s users are consuming entertainment on an increasingly wider array of device sizes and types, and in particular are moving towards mobile devices. Peacock has adopted Jetpack Compose to help with its journey in adapting to more screens and meeting users where they are.

Disclaimer: Peacock is available in the US only. This video will only be viewable to US viewers.

Adapting to more flexible form factors

The Peacock development team is focused on bringing the best experience to users, no matter what device they’re using or when they want to consume content. With an emerging trend from app users to watch more on mobile devices and large screens like foldables, the Peacock app needs to be able to adapt to different screen sizes. As more devices are introduced, the team needed to explore new solutions that make the most out of each unique display permutation.

The goal was to have the Peacock app to adapt to these new displays while continually offering high-quality entertainment without interruptions, like the stream reloading or visual errors. While thinking ahead, they also wanted to prepare and build a solution that was ready for Android XR as the entertainment landscape is shifting towards including more immersive experiences.

Building a future-proof experience with Jetpack Compose

In order to build a scalable solution that would help the Peacock app continue to evolve, the app was migrated to Jetpack Compose, Android’s toolkit for building scalable UI. One of the essential tools they used was the WindowSizeClass API, which helps developers create and test UI layouts for different size ranges. This API then allows the app to seamlessly switch between pre-set layouts as it reaches established viewport breakpoints for different window sizes.

The API was used in conjunction with Kotlin Coroutines and Flows to keep the UI state responsive as the window size changed. To test their work and fine tune edge case devices, Peacock used the Android Studio emulator to simulate a wide range of Android-based devices.

Jetpack Compose allowed the team to build adaptively, so now the Peacock app responds to a wide variety of screens while offering a seamless experience to Android users. “The app feels more native, more fluid, and more intuitive across all form factors,” said Diego Valente, Head of Mobile, Peacock and Global Streaming. “That means users can start watching on a smaller screen and continue instantly on a larger one when they unfold the device—no reloads, no friction. It just works.”

Preparing for immersive entertainment experiences

In building adaptive apps on Android, John Jelley, Senior Vice President, Product & UX, Peacock and Global Streaming, says Peacock has also laid the groundwork to quickly adapt to the Android XR platform: “Android XR builds on the same large screen principles, our investment here naturally extends to those emerging experiences with less developmental work.”

The team is excited about the prospect of features unlocked by Android XR, like Multiview for sports and TV, which enables users to watch multiple games or camera angles at once. By tailoring spatial windows to the user’s environment, the app could offer new ways for users to interact with contextual metadata like sports stats or actor information—all without ever interrupting their experience.

Build adaptive apps

Learn how to unlock your app's full potential on phones, tablets, foldables, and beyond.

Explore this announcement and all Google I/O 2025 updates on io.google starting May 22.

Source: Android Developers Blog

Performance Class helps Google Maps deliver premium experiences

Posted by Nevin Mital - Developer Relations Engineer, Android Media

The Android ecosystem features a diverse range of devices, and it can be difficult to build experiences that take advantage of new or premium hardware features while still working well for users on all devices. With Android 12, we introduced the Media Performance Class (MPC) standard to help developers better understand a device’s capabilities and identify high-performing devices. For a refresher on what MPC is, please see our last blog post, Using performance class to optimize your user experience, or check out the Performance Class documentation.

Earlier this year, we published the first stable release of the Jetpack Core Performance library as the recommended solution for more reliably obtaining a device’s MPC level. In particular, this library introduces the PlayServicesDevicePerformance class, an API that queries Google Play Services to get the most up-to-date MPC level for the current device and build. I’ll get into the technical details further down, but let’s start by taking a look at how Google Maps was able to tailor a feature launch to best fit each device with MPC.

Performance Class unblocks premium experience launch for Google Maps

Google Maps recently took advantage of the expanded device coverage enabled by the Play Services module to unblock a feature launch. Google Maps wanted to update their UI by increasing the transparency of some layers. Consequently, this meant they would need to render more of the map, and found they had to stop the rollout due to latency increases on many devices, especially towards the low-end. To resolve this, the Maps team started by slicing an existing key metric, “seconds to UI item visibility”, by MPC level, which revealed that while all devices had a small increase in this latency, devices without an MPC level had the largest increase.

A bar graph displays A/B test results for Seconds to UI item visibility, comparing control results with those using increased transparency across different Media Performance Class Levels. A green horizontal line and text indicate the updated experience shipped to devices qualifying for MPC. A vertical green dotted line separates results for devices without a specific MPC level, which kept the previous UI.

With these results in hand, Google Maps started their rollout again, but this time only launching the feature on devices that report an MPC level. As devices continue to get updated and meet the bar for MPC, the updated Google Maps UI will be available to them as well.

The new Play Services module

MPC level requirements are defined in the Android Compatibility Definition Document (CDD), then devices and Android builds are validated against these requirements by the Android Compatibility Test Suite (CTS). The Play Services module of the Jetpack Core Performance library leverages these test results to continually update a device’s reported MPC level without any additional effort on your end. This also means that you’ll immediately have access to the MPC level for new device launches without needing to acquire and test each device yourself, since it already passed CTS. If the MPC level is not available from Google Play Services, the library will fall back to the MPC level declared by the OEM as a build constant.

A flowchart depicts the process of determining Performance Class levels for Android devices, involving manufacturers, CTS tests, a Grader, the Play Services module, and the CDD.

As of writing, more than 190M in-market devices covering over 500 models across 40+ brands report an MPC level. This coverage will continue to grow over time, as older devices update to newer builds, from Android 11 and up.

Using the Core Performance library

To use Jetpack Core Performance, start by adding a dependency for the relevant modules in your Gradle configuration, and create an instance of DevicePerformance. Initializing a DevicePerformance should only happen once in your app, as early as possible - for example, in the onCreate() lifecycle event of your Application. In this example, we’ll use the Google Play services implementation of DevicePerformance.

// Implementation of Jetpack Core library.
implementation("androidx.core:core-ktx:1.12.0")
// Enable APIs to query for device-reported performance class.
implementation("androidx.core:core-performance:1.0.0")
// Enable APIs to query Google Play Services for performance class.
implementation("androidx.core:core-performance-play-services:1.0.0")

import androidx.core.performance.play.services.PlayServicesDevicePerformance

class MyApplication : Application() {
  lateinit var devicePerformance: DevicePerformance

  override fun onCreate() {
    // Use a class derived from the DevicePerformance interface
    devicePerformance = PlayServicesDevicePerformance(applicationContext)
  }
}

Then, later in your app when you want to retrieve the device’s MPC level, you can call getMediaPerformanceClass():

class MyActivity : Activity() {
  private lateinit var devicePerformance: DevicePerformance
  override fun onCreate(savedInstanceState: Bundle?) {
    super.onCreate(savedInstanceState)
    // Note: Good app architecture is to use a dependency framework. See
    // https://developer.android.com/training/dependency-injection for more
    // information.
    devicePerformance = (application as MyApplication).devicePerformance
  }

  override fun onResume() {
    super.onResume()
    when {
      devicePerformance.mediaPerformanceClass >= Build.VERSION_CODES.UPSIDE_DOWN_CAKE -> {
        // MPC level 34 and later.
        // Provide the most premium experience for the highest performing devices.
      }
      devicePerformance.mediaPerformanceClass == Build.VERSION_CODES.TIRAMISU -> {
        // MPC level 33.
        // Provide a high quality experience.
      }
      else -> {
        // MPC level 31, 30, or undefined.
        // Remove extras to keep experience functional.
      }
    }
  }
}

Strategies for using Performance Class

MPC is intended to identify high-end devices, so you can expect to see MPC levels for the top devices from each year, which are the devices you’re likely to want to be able to support for the longest time. For example, the Pixel 9 Pro released with Android 14 and reports an MPC level of 34, the latest level definition when it launched.

You should use MPC as a complement to any existing Device Clustering solutions you already use, such as querying a device’s static specs or manually blocklisting problematic devices. An area where MPC can be a particularly helpful tool is for new device launches. New devices should be included at launch, so you can use MPC to gauge new devices’ capabilities right from the start, without needing to acquire the hardware yourself or manually test each device.

A great first step to get involved is to include MPC levels in your telemetry. This can help you identify patterns in error reports or generally get a better sense of the devices your user base uses if you segment key metrics by MPC level. From there, you might consider using MPC as a dimension in your experimentation pipeline, for example by setting up A/B testing groups based on MPC level, or by starting a feature rollout with the highest MPC level and working your way down. As discussed previously, this is the approach that Google Maps took.

You could further use MPC to tune a user-facing feature, for example by adjusting the number of concurrent video playbacks your app attempts based on the MPC level’s concurrent codec guarantees. However, make sure to still query a device’s runtime capabilities when using this approach, as they may differ depending on the environment and state the device is in.

Get in touch!

If MPC sounds like it could be useful for your app, please give it a try! You can get started by taking a look at our sample code or documentation. We welcome you to share any questions or feedback you have in this short form.

This blog post is a part of Camera and Media Spotlight Week. We're providing resources – blog posts, videos, sample code, and more – all designed to help you uplevel the media experiences in your app.

To learn more about what Spotlight Week has to offer and how it can benefit you, be sure to read our overview blog post.

Source: Android Developers Blog

Reddit improved app startup speed by over 50% using Baseline Profiles and R8

Posted by Ben Weiss – Developer Relations Engineer, and Lauren Darcey – Senior Engineering Manager, Reddit

Reddit is one of the world’s largest internet forums, bringing together countless communities looking for entertainment, answers to everyday questions, and so much more.

Recently, the team optimized its Android app to reduce startup speed and improve rendering performance using Baseline Profiles. But the team didn’t stop there. Reddit app developers also enabled Android’s R8 compiler in full mode to maximize bytecode optimization and used Jetpack Compose to rewrite legacy UI, improving both the user and developer experience.

Maximizing optimization using Baseline Profiles and R8 full mode

The Reddit Android app has undergone countless performance upgrades over the years. Reddit developers have long since cleared the list of quick and easy tasks for optimization, but the team still wants to improve the app, bringing its performance to the next level and ensuring it runs well on every Android device.

“Reddit is looking for any strategic improvement to its app performance so we can make the app experience better for new and existing users,” said Rob McWhinnie, a staff engineer at Reddit. “Baseline Profiles fit this use case well since they are based on critical user journeys.”

Reddit’s platform engineering team used screen-specific performance metrics and observability to help its feature teams improve key metrics like time to interactive and scroll performance. Baseline Profiles were a natural fit to help improve these metrics and the user experience behind them, so the team integrated them to make tracking and optimizing easier, using insights from geodata and device classes.

The team built Baseline Profiles for five critical user journeys so far, like scrolling the home feed, logging in, launching the full-screen video player, navigating between subreddits and scrolling their feeds, and using the chat feature.

Simplifying Baseline Profile management in their continuous integration processes, enabled Reddit to remove the need for manual maintenance and streamlining optimization. Now, Baseline Profiles are automatically regenerated for each release.

Enabling Android’s R8 optimization compiler in full mode was another area Reddit engineers worked on. The team had already used R8 in compatibility mode, but some of Reddit’s legacy code would’ve made implementing R8’s more aggressive features difficult. The team worked through the app’s existing technical debt first, making it easier to integrate R8's full mode capabilities and maximize Android app optimization.

Quote card with image of Catherine Chi, Senior Engineer at Reddit that reads: 'It’s now trivial to work with a team to instrument Baseline Profiles for their critical user journeys. We turn them around in a couple of hours and see results in production a week later.

Improvements with Baseline Profiles and R8 full mode

Reddit's Baseline Profiles and R8 full mode optimization led to multiple performance improvements across the app, with early benchmarks of the first Baseline Profile for feeds showing a 51% median startup time improvement. While responses from Redditors initially confirmed large startup improvements, Baseline Profile optimizations for less frequent journeys, like logging in, saw fewer user reports.

Baseline Profiles for the home feed had a 36% reduction in frozen frames' 95th percentile. Baseline Profiles for the community feed also delivered strong screen load and scroll performance improvements. At the 90th percentile, screen Time To Interactive improved by 12% and time to first draw decreased by 22%. Reddit’s scrolling performance also saw a 12% reduction in P90 slow frames.

The upgrade to R8 full mode led to an increase in Google Play average ratings. The proportion of global positive ratings (fours and fives) increased by four percent, with a notable decrease in negative reports. R8 full mode also reduced total application-not-responding errors by almost 30%.

Overall, the app saw cold start improvements of 20%, scroll performance improvements of 15%, and widespread enhancements in lower-end devices and emerging markets. Google Play vitals saw improvements in slow cold starts, a 10% reduction in excessive frozen frames, and a 30% reduction in excessive slow frames. Nearly 75% of screens, refactored using Jetpack Compose, experienced performance gains.

Quote card with image of Lauren Darcey, Senior Engineering Manager at Reddit that reads: 'When you find a feature that users love and engage with, taking the time to refine and optimize it can be the difference between a good and a great experience for your users.

Further optimizations using Jetpack Compose

Reddit adopted Jetpack Compose years ago and has since rebuilt much of its UI with the toolkit, benefitting both the app and its design system. According to the Reddit team, Google’s ongoing support for Compose’s stability and performance made it a strong fit as Reddit scaled its app, allowing for more efficient feature development and better performance.

One major example is Reddit’s feed rewrite using Compose, which resulted in more maintainable code and an improved developer experience. Compose enabled teams to focus on future work instead of being bogged down by legacy code, allowing them to fix bugs quickly and improve overall app stability.

“The R8 and Compose upgrades were important to deploy in relative isolation and stabilize,” said Drew Heavner, a staff engineer at Reddit. “We feel like we got great outcomes from this work for all teams adopting our modern tech stack and Compose.”

After upgrading to the September 2024 release of Compose, the latest iteration, Reddit saw significant performance gains across the board. Cold start times improved by 13%, excessive slow frames decreased by 25%, and frozen frames dropped by 10%. Low- and mid-tier devices saw even greater improvements where app start times improved by up to 40%, especially in markets with lower-performing devices.

Screens using Reddit’s modern Compose-powered design stack showed substantial improvements in both slow and frozen frame rates. For example, the home feed saw a 23% reduction in frozen frames, and scrolling performance visibly improved according to internal reviews. These updates were well received among users and reflected a 17% increase in the app’s Google Play average rating.

Quote card with image of the Android Bot peeking in from the right side that reads: Compose continues to deliver great new features for a more responsive user experience. It also provides stability and performance improvements we get to take advantage of.” — Eric Kuck, a Principal Engineer at Reddit

Up-leveling UX through optimization

Adding value to an app isn’t just about introducing new features—it's about refining and optimizing the ones users already love. Investing in performance improvements made Reddit’s key features faster and more reliable, enhancing the overall user experience. These optimizations not only improved app startup and runtime performance but also simplified development workflows, increasing both developer satisfaction and app stability.

The focus on high-traffic features, such as feeds, has demonstrated the power of performance tuning, with substantial gains in user engagement and satisfaction. As the app has become more efficient, both users and developers have benefitted from a cleaner codebase and faster performance.

Looking ahead, Reddit plans to extend the usage of Baseline Profiles to other critical user journeys, including Reddit’s post and comment experiences, ensuring even more users benefit from these ongoing performance improvements.

Reddit’s platform engineers also want to continue collaborating with feature teams to integrate performance improvements across the app. These efforts will ensure that as the app evolves, it remains a smooth, fast, and engaging experience for all Redditors.

“Adding new features isn’t the only way to add value to an experience for users,” said Lauren Darcey, a senior engineering manager at Reddit. “When you find a feature that users love and engage with, taking the time to refine and optimize it can be the difference between a good and a great experience for your users.”

Get started

Improve your app performance using Baseline Profiles, R8 full mode, and Jetpack Compose.

Source: Android Developers Blog

Introducing Restore Credentials: Effortless account restoration for Android apps

Posted by Neelansh Sahai - Developer Relations Engineer

Did you know that, on average, 40% of the people in the US reset or replace their smartphones every year? This frequent device turnover presents a challenge – and an opportunity – for maintaining strong user relationships. When users get a new phone, the friction of re-entering login credentials can lead to frustration, app abandonment, and churn.

To address this issue, we're introducing Restore Credentials, a new feature of Android’s Credential Manager API. With Restore Credentials, apps can seamlessly onboard users to their accounts on a new device after they restore their apps and data from their previous device. This makes the transition to a new device effortless and fosters loyalty and long term relationships.

On top of all this, there's no developer effort required for the transfer of a restore key from one device to the other, as this process is tied together with the android system’s backup and restore mechanism. However, if you want to login your users silently as soon as the restore is completed, you might want to implement BackupAgent and add your logic in the onRestore callback. The experience is delightful - users will continue being signed in as they were on their previous device, and they will be able to get notifications to easily access their content without even needing to open the app on the new device.

An illustration the process of restoring app data and keys to a new device, highlighting automated steps and user interactions. The top row shows a user signing into an app and a restore key being saved locally, while the bottom row shows the restore process on a new device.

click to enlarge

Some of the benefits of the Restore Credentials feature include:

Seamless user experience: Users can easily transition to a new Android device.

Immediate engagement: Engage users with notifications or other prompts as soon as they start using their new device.

Silent login with backup agent integration: If you're using a backup agent, users can be automatically logged back in after data restoration is complete.

Restore key checks without backup agent integration: If a backup agent isn't being used, the app can check for a restore key upon first launch and then log the user in automatically.

Easy implementation: Leverages the same server-side implementation used for passkeys.

How does Restore Credentials work?

The Restore Credentials feature enables seamless user account restoration on a new device. This process occurs automatically in the background during device setup when a user restores apps and data from a previous device. By restoring app credentials, the feature allows the app to sign the user back in without requiring any additional interaction.

The credential type that’s supported for this feature is called restore key, which is a public key compatible with passkey / FIDO2 backends.

A diagram shows the device-to-device and cloud backup restore processes for app data and restore keys between old and new devices. Steps are numbered and explained within the diagram.

Diagram that depicts restoring an app data to a new device using a restore credential, including creating the credential, initiating a restore flow, and automatic user sign-in.

User flow

On the old device:

If the current signed-in user is trusted, you can generate a restore key at any point after they've authenticated in your app. For instance, this could be immediately after login or during a routine check for an existing restore key.
The restore key is stored locally and backed up to the cloud. Apps can opt-out of backing it up to the cloud.

On the new device:

When setting up a new device, the user can select one of the two options to restore data. Either they can restore data from a cloud backup, or can locally transfer the data. If the user transfers locally, the restore key is transferred locally from the old to the new device. Otherwise, if the user restores using the cloud backup, the restore key gets downloaded along with the app data from cloud backup to the new device.
Once this restore key is available on the new device, the app can use it to log in the user on the new device silently in the background.

Note: You should delete the restore key as soon as the user signs out. You don’t want your user to get stuck in a cycle of signing out intentionally and then automatically getting logged back in.

How to implement Restore Credentials

Using the Jetpack Credential Manager let you create, get, and clear the relevant Restore Credentials:

Create a Restore Credential: When the user signs in to your app, create a Restore Credential associated with their account. This credential is stored locally and synced to the cloud if the user has enabled Google Backup and end to end encryption is available. Apps can opt out of syncing to the cloud.

Get the Restore Credential: When the user sets up a new device, your app requests the Restore Credential from Credential Manager. This allows your user to sign in automatically.

Clear the Restore Credential: When the user signs out of your app, delete the associated Restore Credential.

Restore Credentials is available through the Credential Manager Jetpack library. The minimum version of the Jetpack Library is 1.5.0-beta01, and the minimum GMS version is 242200000. For more on this, refer to the Restore Credentials DAC page. To get started, follow these steps:

1. Add the Credential Manager dependency to your project.

// build.gradle.kts
implementation("androidx.credentials:credentials:1.5.0-beta01")

2. Create a CreateRestoreCredentialRequest object.

// Fetch Registration JSON from server
// Same as the registrationJson created at the time of creating a Passkey
// See documentation for more info
val registrationJson = ... 

// Create the CreateRestoreCredentialRequest object
// Pass in the registrationJSON 
val createRequest = CreateRestoreCredentialRequest(
  registrationJson,
  /* isCloudBackupEnabled = */ true
)

NOTE: Set the isCloudBackupEnabled flag to false if you want the restoreKey to be stored locally and not in the cloud. It’s set as true by default

3. Call the createCredential() method on the CredentialManager object.

val credentialManager = CredentialManager.create(context)

// On a successful authentication create a Restore Key
// Pass in the context and CreateRestoreCredentialRequest object
val response = credentialManager.createCredential(
    context,
    createRestoreRequest
)

4. When the user sets up a new device, call the getCredential() method on the CredentialManager object.

// Fetch the Authentication JSON from server
val authenticationJson = ...

// Create the GetRestoreCredentialRequest object
val options = GetRestoreCredentialOption(authenticationJson)
val getRequest = GetCredentialRequest(Immutablelist.of(options))

// The restore key can be fetched in two scenarios to 
// 1. On the first launch of app on the device, fetch the Restore Key
// 2. In the onRestore callback (if the app implements the Backup Agent)
val response = credentialManager.getCredential(context, getRequest)

If you're using a backup agent, perform the getCredential part within the onRestore callback. This ensures that the app's credentials are restored immediately after the app data is restored.

5. When the user signs out of your app, call the clearCredentialState() method on the CredentialManager object.

// Create a ClearCredentialStateRequest object
val clearRequest = ClearCredentialStateRequest(/* requestType = */ 1)

// On user log-out, clear the restore key
val response = credentialManager.clearCredentialState(clearRequest)

Conclusion

The Restore Credentials feature provides significant benefits, ensuring users experience a smooth transition between devices, and allowing them to log in quickly and easily through backup agents or restore key checks. For developers, the feature is straightforward to integrate and leverages existing passkey server-side infrastructure. Overall, Restore Credentials is a valuable tool that delivers a practical and user-friendly authentication solution.

This blog post is a part of our series: Spotlight Week: Passkeys. We're providing you with a wealth of resources through the week. Think informative blog posts, engaging videos, practical sample code, and more—all carefully designed to help you leverage the latest advancements in seamless sign-up and sign-in experiences.

With these cutting-edge solutions, you can enhance security, reduce friction for your users, and stay ahead of the curve in the rapidly evolving landscape of digital identity. To get a complete overview of what Spotlight Week has to offer and how it can benefit you, be sure to read our overview blog post.

Source: Android Developers Blog

Computer-aided diagnosis for lung cancer screening

Posted by Atilla Kiraly, Software Engineer, and Rory Pilgrim, Product Manager, Google Research

Lung cancer is the leading cause of cancer-related deaths globally with 1.8 million deaths reported in 2020. Late diagnosis dramatically reduces the chances of survival. Lung cancer screening via computed tomography (CT), which provides a detailed 3D image of the lungs, has been shown to reduce mortality in high-risk populations by at least 20% by detecting potential signs of cancers earlier. In the US, screening involves annual scans, with some countries or cases recommending more or less frequent scans.

The United States Preventive Services Task Force recently expanded lung cancer screening recommendations by roughly 80%, which is expected to increase screening access for women and racial and ethnic minority groups. However, false positives (i.e., incorrectly reporting a potential cancer in a cancer-free patient) can cause anxiety and lead to unnecessary procedures for patients while increasing costs for the healthcare system. Moreover, efficiency in screening a large number of individuals can be challenging depending on healthcare infrastructure and radiologist availability.

At Google we have previously developed machine learning (ML) models for lung cancer detection, and have evaluated their ability to automatically detect and classify regions that show signs of potential cancer. Performance has been shown to be comparable to that of specialists in detecting possible cancer. While they have achieved high performance, effectively communicating findings in realistic environments is necessary to realize their full potential.

To that end, in “Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the US and Japan”, published in Radiology AI, we investigate how ML models can effectively communicate findings to radiologists. We also introduce a generalizable user-centric interface to help radiologists leverage such models for lung cancer screening. The system takes CT imaging as input and outputs a cancer suspicion rating using four categories (no suspicion, probably benign, suspicious, highly suspicious) along with the corresponding regions of interest. We evaluate the system’s utility in improving clinician performance through randomized reader studies in both the US and Japan, using the local cancer scoring systems (Lung-RADSs V1.1 and Sendai Score) and image viewers that mimic realistic settings. We found that reader specificity increases with model assistance in both reader studies. To accelerate progress in conducting similar studies with ML models, we have open-sourced code to process CT images and generate images compatible with the picture archiving and communication system (PACS) used by radiologists.

Developing an interface to communicate model results

Integrating ML models into radiologist workflows involves understanding the nuances and goals of their tasks to meaningfully support them. In the case of lung cancer screening, hospitals follow various country-specific guidelines that are regularly updated. For example, in the US, Lung-RADs V1.1 assigns an alpha-numeric score to indicate the lung cancer risk and follow-up recommendations. When assessing patients, radiologists load the CT in their workstation to read the case, find lung nodules or lesions, and apply set guidelines to determine follow-up decisions.

Our first step was to improve the previously developed ML models through additional training data and architectural improvements, including self-attention. Then, instead of targeting specific guidelines, we experimented with a complementary way of communicating AI results independent of guidelines or their particular versions. Specifically, the system output offers a suspicion rating and localization (regions of interest) for the user to consider in conjunction with their own specific guidelines. The interface produces output images directly associated with the CT study, requiring no changes to the user’s workstation. The radiologist only needs to review a small set of additional images. There is no other change to their system or interaction with the system.

Example of the assistive lung cancer screening system outputs. Results for the radiologist’s evaluation are visualized on the location of the CT volume where the suspicious lesion is found. The overall suspicion is displayed at the top of the CT images. Circles highlight the suspicious lesions while squares show a rendering of the same lesion from a different perspective, called a sagittal view.

The assistive lung cancer screening system comprises 13 models and has a high-level architecture similar to the end-to-end system used in prior work. The models coordinate with each other to first segment the lungs, obtain an overall assessment, locate three suspicious regions, then use the information to assign a suspicion rating to each region. The system was deployed on Google Cloud using a Google Kubernetes Engine (GKE) that pulled the images, ran the ML models, and provided results. This allows scalability and directly connects to servers where the images are stored in DICOM stores.

Outline of the Google Cloud deployment of the assistive lung cancer screening system and the directional calling flow for the individual components that serve the images and compute results. Images are served to the viewer and to the system using Google Cloud services. The system is run on a Google Kubernetes Engine that pulls the images, processes them, and writes them back into the DICOM store.

Reader studies

To evaluate the system’s utility in improving clinical performance, we conducted two reader studies (i.e., experiments designed to assess clinical performance comparing expert performance with and without the aid of a technology) with 12 radiologists using pre-existing, de-identified CT scans. We presented 627 challenging cases to 6 US-based and 6 Japan-based radiologists. In the experimental setup, readers were divided into two groups that read each case twice, with and without assistance from the model. Readers were asked to apply scoring guidelines they typically use in their clinical practice and report their overall suspicion of cancer for each case. We then compared the results of the reader’s responses to measure the impact of the model on their workflow and decisions. The score and suspicion level were judged against the actual cancer outcomes of the individuals to measure sensitivity, specificity, and area under the ROC curve (AUC) values. These were compared with and without assistance.

A multi-case multi-reader study involves each case being reviewed by each reader twice, once with ML system assistance and once without. In this visualization one reader first reviews Set A without assistance (blue) and then with assistance (orange) after a wash-out period. A second reader group follows the opposite path by reading the same set of cases Set A with assistance first. Readers are randomized to these groups to remove the effect of ordering.

The ability to conduct these studies using the same interface highlights its generalizability to completely different cancer scoring systems, and the generalization of the model and assistive capability to different patient populations. Our study results demonstrated that when radiologists used the system in their clinical evaluation, they had an increased ability to correctly identify lung images without actionable lung cancer findings (i.e., specificity) by an absolute 5–7% compared to when they didn’t use the assistive system. This potentially means that for every 15–20 patients screened, one may be able to avoid unnecessary follow-up procedures, thus reducing their anxiety and the burden on the health care system. This can, in turn, help improve the sustainability of lung cancer screening programs, particularly as more people become eligible for screening.

Reader specificity increases with ML model assistance in both the US-based and Japan-based reader studies. Specificity values were derived from reader scores from actionable findings (something suspicious was found) versus no actionable findings, compared against the true cancer outcome of the individual. Under model assistance, readers flagged fewer cancer-negative individuals for follow-up visits. Sensitivity for cancer positive individuals remained the same.

Translating this into real-world impact through partnership

The system results demonstrate the potential for fewer follow-up visits, reduced anxiety, as well lower overall costs for lung cancer screening. In an effort to translate this research into real-world clinical impact, we are working with: DeepHealth, a leading AI-powered health informatics provider; and Apollo Radiology International a leading provider of Radiology services in India to explore paths for incorporating this system into future products. In addition, we are looking to help other researchers studying how best to integrate ML model results into clinical workflows by open sourcing code used for the reader study and incorporating the insights described in this blog. We hope that this will help accelerate medical imaging researchers looking to conduct reader studies for their AI models, and catalyze translational research in the field.

Acknowledgements

Key contributors to this project include Corbin Cunningham, Zaid Nabulsi, Ryan Najafi, Jie Yang, Charles Lau, Joseph R. Ledsam, Wenxing Ye, Diego Ardila, Scott M. McKinney, Rory Pilgrim, Hiroaki Saito, Yasuteru Shimamura, Mozziyar Etemadi, Yun Liu, David Melnick, Sunny Jansen, Nadia Harhen, David P. Nadich, Mikhail Fomitchev, Ziyad Helali, Shabir Adeel, Greg S. Corrado, Lily Peng, Daniel Tse, Shravya Shetty, Shruthi Prabhakara, Neeral Beladia, and Krish Eswaran. Thanks to Arnav Agharwal and Andrew Sellergren for their open sourcing support and Vivek Natarajan and Michael D. Howell for their feedback. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study, and Jonny Wong and Carli Sampson for coordinating the reader studies.

Source: Google AI Blog

Resolving code review comments with ML

Posted by Alexander Frömmgen, Staff Software Engineer, and Lera Kharatyan, Senior Software Engineer, Core Systems & Experiences

Code-change reviews are a critical part of the software development process at scale, taking a significant amount of the code authors’ and the code reviewers’ time. As part of this process, the reviewer inspects the proposed code and asks the author for code changes through comments written in natural language. At Google, we see millions of reviewer comments per year, and authors require an average of ~60 minutes active shepherding time between sending changes for review and finally submitting the change. In our measurements, the required active work time that the code author must do to address reviewer comments grows almost linearly with the number of comments. However, with machine learning (ML), we have an opportunity to automate and streamline the code review process, e.g., by proposing code changes based on a comment’s text.

Today, we describe applying recent advances of large sequence models in a real-world setting to automatically resolve code review comments in the day-to-day development workflow at Google (publication forthcoming). As of today, code-change authors at Google address a substantial amount of reviewer comments by applying an ML-suggested edit. We expect that to reduce time spent on code reviews by hundreds of thousands of hours annually at Google scale. Unsolicited, very positive feedback highlights that the impact of ML-suggested code edits increases Googlers' productivity and allows them to focus on more creative and complex tasks.

Predicting the code edit

We started by training a model that predicts code edits needed to address reviewer comments. The model is pre-trained on various coding tasks and related developer activities (e.g., renaming a variable, repairing a broken build, editing a file). It’s then fine-tuned for this specific task with reviewed code changes, the reviewer comments, and the edits the author performed to address those comments.

An example of an ML-suggested edit of refactorings that are spread within the code.

Google uses a monorepo, a single repository for all of its software artifacts, which allows our training dataset to include all unrestricted code used to build Google's most recent software, as well as previous versions.

To improve the model quality, we iterated on the training dataset. For example, we compared the model performance for datasets with a single reviewer comment per file to datasets with multiple comments per file, and experimented with classifiers to clean up the training data based on a small, curated dataset to choose the model with the best offline precision and recall metrics.

Serving infrastructure and user experience

We designed and implemented the feature on top of the trained model, focusing on the overall user experience and developer efficiency. As part of this, we explored different user experience (UX) alternatives through a series of user studies. We then refined the feature based on insights from an internal beta (i.e., a test of the feature in development) including user feedback (e.g., a “Was this helpful?” button next to the suggested edit).

The final model was calibrated for a target precision of 50%. That is, we tuned the model and the suggestions filtering, so that 50% of suggested edits on our evaluation dataset are correct. In general, increasing the target precision reduces the number of shown suggested edits, and decreasing the target precision leads to more incorrect suggested edits. Incorrect suggested edits take the developers time and reduce the developers’ trust in the feature. We found that a target precision of 50% provides a good balance.

At a high level, for every new reviewer comment, we generate the model input in the same format that is used for training, query the model, and generate the suggested code edit. If the model is confident in the prediction and a few additional heuristics are satisfied, we send the suggested edit to downstream systems. The downstream systems, i.e., the code review frontend and the integrated development environment (IDE), expose the suggested edits to the user and log user interactions, such as preview and apply events. A dedicated pipeline collects these logs and generates aggregate insights, e.g., the overall acceptance rates as reported in this blog post.

Architecture of the ML-suggested edits infrastructure. We process code and infrastructure from multiple services, get the model predictions and surface the predictions in the code review tool and IDE.

The developer interacts with the ML-suggested edits in the code review tool and the IDE. Based on insights from the user studies, the integration into the code review tool is most suitable for a streamlined review experience. The IDE integration provides additional functionality and supports 3-way merging of the ML-suggested edits (left in the figure below) in case of conflicting local changes on top of the reviewed code state (right) into the merge result (center).

3-way-merge UX in IDE.

Results

Offline evaluations indicate that the model addresses 52% of comments with a target precision of 50%. The online metrics of the beta and the full internal launch confirm these offline metrics, i.e., we see model suggestions above our target model confidence for around 50% of all relevant reviewer comments. 40% to 50% of all previewed suggested edits are applied by code authors.

We used the “not helpful” feedback during the beta to identify recurring failure patterns of the model. We implemented serving-time heuristics to filter these and, thus, reduce the number of shown incorrect predictions. With these changes, we traded quantity for quality and observed an increased real-world acceptance rate.

Code review tool UX. The suggestion is shown as part of the comment and can be previewed, applied and rated as helpful or not helpful.

Our beta launch showed a discoverability challenge: code authors only previewed ~20% of all generated suggested edits. We modified the UX and introduced a prominent “Show ML-edit” button (see the figure above) next to the reviewer comment, leading to an overall preview rate of ~40% at launch. We additionally found that suggested edits in the code review tool are often not applicable due to conflicting changes that the author did during the review process. We addressed this with a button in the code review tool that opens the IDE in a merge view for the suggested edit. We now observe that more than 70% of these are applied in the code review tool and fewer than 30% are applied in the IDE. All these changes allowed us to increase the overall fraction of reviewer comments that are addressed with an ML-suggested edit by a factor of 2 from beta to the full internal launch. At Google scale, these results help automate the resolution of hundreds of thousands of comments each year.

Suggestions filtering funnel.

We see ML-suggested edits addressing a wide range of reviewer comments in production. This includes simple localized refactorings and refactorings that are spread within the code, as shown in the examples throughout the blog post above. The feature addresses longer and less formally-worded comments that require code generation, refactorings and imports.

Example of a suggestion for a longer and less formally worded comment that requires code generation, refactorings and imports.

The model can also respond to complex comments and produce extensive code edits (shown below). The generated test case follows the existing unit test pattern, while changing the details as described in the comment. Additionally, the edit suggests a comprehensive name for the test reflecting the test semantics.

Example of the model's ability to respond to complex comments and produce extensive code edits.

Conclusion and future work

In this post, we introduced an ML-assistance feature to reduce the time spent on code review related changes. At the moment, a substantial amount of all actionable code review comments on supported languages are addressed with applied ML-suggested edits at Google. A 12-week A/B experiment across all Google developers will further measure the impact of the feature on the overall developer productivity.

We are working on improvements throughout the whole stack. This includes increasing the quality and recall of the model and building a more streamlined experience for the developer with improved discoverability throughout the review process. As part of this, we are investigating the option of showing suggested edits to the reviewer while they draft comments and expanding the feature into the IDE to enable code-change authors to get suggested code edits for natural-language commands.

Acknowledgements

This is the work of many people in Google Core Systems & Experiences team, Google Research, and DeepMind. We'd like to specifically thank Peter Choy for bringing the collaboration together, and all of our team members for their key contributions and useful advice, including Marcus Revaj, Gabriela Surita, Maxim Tabachnyk, Jacob Austin, Nimesh Ghelani, Dan Zheng, Peter Josling, Mariana Stariolo, Chris Gorgolewski, Sascha Varkevisser, Katja Grünwedel, Alberto Elizondo, Tobias Welp, Paige Bailey, Pierre-Antoine Manzagol, Pascal Lamblin, Chenjie Gu, Petros Maniatis, Henryk Michalewski, Sara Wiltberger, Ambar Murillo, Satish Chandra, Madhura Dudhgaonkar, Niranjan Tulpule, Zoubin Ghahramani, Juanjo Carin, Danny Tarlow, Kevin Villela, Stoyan Nikolov, David Tattersall, Boris Bokowski, Kathy Nix, Mehdi Ghissassi, Luis C. Cobo, Yujia Li, David Choi, Kristóf Molnár, Vahid Meimand, Amit Patel, Brett Wiltshire, Laurent Le Brun, Mingpan Guo, Hermann Loose, Jonas Mattes, Savinee Dancs.

Source: Google AI Blog

How Project Starline improves remote communication

Posted by Greg Blascovich and Eric Gomez, User Researchers, Google

As companies settle into a new normal of hybrid and distributed work, remote communication technology remains critical for connecting and collaborating with colleagues. While this technology has improved, the core user experience often falls short: conversation can feel stilted, attention can be difficult to maintain, and usage can be fatiguing.

Project Starline renders people at natural scale on a 3D display and enables natural eye contact.

At Google I/O 2021 we announced Project Starline, a technology project that combines advances in hardware and software to create a remote communication experience that feels like you’re together, even when you’re thousands of miles apart. This perception of co-presence is created by representing users in 3D at natural scale, enabling eye contact, and providing spatially accurate audio. But to what extent do these technological innovations translate to meaningful, observable improvement in user value compared to traditional video conferencing?

In this blog we share results from a number of studies across a variety of methodologies, finding converging evidence that Project Starline outperforms traditional video conferencing in terms of conversation dynamics, video meeting fatigue, and attentiveness. Some of these results were previously published while others we are sharing for the first time as preliminary findings.

Improved conversation dynamics

In our qualitative studies, users often describe conversations in Project Starline as “more natural.” However, when asked to elaborate, many have difficulty articulating this concept in a way that fully captures their experience. Because human communication relies partly on unconscious processes like nonverbal behavior, people might have a hard time reflecting on these processes that are potentially impacted by experiencing a novel technology. To address this challenge, we conducted a series of behavioral lab experiments to shed light on what “more natural” might mean for Project Starline. These experiments employed within-subjects designs in which participants experienced multiple conditions (e.g., meeting in Project Starline vs. traditional videoconferencing) in randomized order. This allowed us to control for between-subject differences by comparing how the same individual responded to a variety of conditions, thus increasing statistical power and reducing the sample size necessary to detect statistical differences (sample sizes in our behavioral experiments range from ~ 20 to 30).

In one study, preliminary data suggest Project Starline improves conversation dynamics by increasing rates of turn-taking. We recruited pairs of participants who had never met each other to have unstructured conversations in both Project Starline and traditional video conferencing. We analyzed the audio from each conversation and found that Project Starline facilitated significantly more dynamic "back and forth" conversations compared to traditional video conferencing. Specifically, participants averaged about 2-3 more speaker hand-offs in Project Starline conversations compared to those in traditional video conferencing across a two minute subsample of their conversation (a uniform selection at the end of each conversation to help standardize for interpersonal rapport). Participants also rated their Starline conversations as significantly more natural (“smooth,” “easy,” “not awkward”), higher in quality, and easier to recognize when it was their turn to speak compared to conversations using traditional video conferencing.

In another study, participants had conversations with a confederate in both Project Starline and traditional video conferencing. We recorded these conversations to analyze select nonverbal behaviors. In Project Starline, participants were more animated, using significantly more hand gestures (+43%), head nods (+26%), and eyebrow movements (+49%). Participants also reported a significantly better ability to perceive and convey nonverbal cues in Project Starline than in traditional video conferencing. Together with the turn-taking results, these data help explain why conversations in Project Starline may feel more natural.

We recorded participants to quantify their nonverbal behaviors and found that they were more animated in Project Starline (left) compared to traditional video conferencing (right).

Reduced video meeting fatigue

A well-documented challenge of video conferencing, especially within the workplace, is video meeting fatigue. The causes of video meeting fatigue are complex, but one possibility is that video communication is cognitively taxing because it becomes more difficult to convey and interpret nonverbal behavior. Considering previous findings that suggested Project Starline might improve nonverbal communication, we examined whether video meeting fatigue might also be improved (i.e., reduced) compared to traditional video conferencing.

Our study found preliminary evidence that Project Starline indeed reduces video meeting fatigue. Participants held 30-minute mock meetings in Project Starline and traditional video conferencing. Meeting content was standardized across participants using an exercise adapted from academic literature that emulates key elements of a work meeting, such as brainstorming and persuasion. We then measured video meeting fatigue via the Zoom Exhaustion and Fatigue (ZEF) Scale. Additionally, we measured participants' reaction times on a complex cognitive task originally used in cognitive psychology. We repurposed this task as a proxy for video meeting fatigue based on the assumption that more fatigue would lead to slower reaction times. Participants reported significantly less video meeting fatigue on the ZEF Scale (-31%) and had faster reaction times (-12%) on the cognitive task after using Project Starline compared to traditional video conferencing.

Increased attentiveness

Another challenge with video conferencing is focusing attention on the meeting at hand, rather than on other browser windows or secondary devices.

In our earlier study on nonverbal behavior, we included an exploratory information-retention task. We asked participants to write as much as they could remember about each conversation (one in Project Starline, and one in traditional video conferencing). We found that participants wrote 28% more in this task (by character count) after their conversation in Project Starline. This could be because they paid closer attention when in Project Starline, or possibly that they found conversations in Project Starline to be more engaging.

To explore the concept of attentiveness further, we conducted a study in which participants wore eye-tracking glasses. This allowed us to calculate the percentage of time participants spent focusing on their conversation partner’s face, an important source of social information in human interaction. Participants had a conversation with a confederate in Project Starline, traditional video conferencing, and in person. We found that participants spent a significantly higher proportion of time looking at their conversation partner's face in Project Starline (+14%) than they did in traditional video conferencing. In fact, visual attentiveness in Project Starline mirrored that of the in-person condition: participants spent roughly the same proportion of time focusing on their meeting partner’s face in the Project Starline and in-person conditions.

The use of eye-tracking glasses and facial detection software allowed us to quantify participants' gaze patterns. The video above illustrates how a hypothetical participant's eye tracking data (red dot) correspond to their meeting partner's face (white box).

User value in real meetings

The lab-based, experimental approach used in the studies above allows for causal inference while minimizing confounding variables. However, one limitation of these studies is that they are low in external validity — that is, they took place in a lab environment, and the extent to which their results extend to the real world is unclear. Thus, we studied actual users within Google who used Project Starline for their day-to-day work meetings and collected their feedback.

An internal pilot revealed that users derive meaningful value from using Project Starline. We used post-meeting surveys to capture immediate feedback on individual meetings, longer monthly surveys to capture holistic feedback on the experience, and conducted in-depth qualitative interviews with a subset of users. We evaluated Project Starline on concepts such as presence, nonverbal behavior, attentiveness, and personal connection. We found strong evidence that Project Starline delivered across these four metrics, with over 87% of participants expressing that their meetings in Project Starline were better than their previous experiences with traditional video conferencing.

Conclusion

Together, these findings offer a compelling case for Project Starline's value to users: improved conversation dynamics, reduced video meeting fatigue, and increased attentiveness. Participants expressed that Project Starline was a significant improvement over traditional video conferencing in highly controlled lab experiments, as well as when they used Project Starline for their actual work meetings. We’re excited to see these findings converge across multiple methodologies (surveys, qualitative interviews, experiments) and measurements (self-report, behavioral, qualitative), and we’re eager to continue exploring the implications of Project Starline on human interaction.

Acknowledgments

We’d like to thank Melba Tellez, Eric Baczuk, Jinghua Zhang, Matthew DuVall, and Travis Miller for contributing to visual assets and illustrations.

Source: Google AI Blog

ML-Enhanced Code Completion Improves Developer Productivity

Posted by Maxim Tabachnyk, Staff Software Engineer and Stoyan Nikolov, Senior Engineering Manager, Google Research

The increasing complexity of code poses a key challenge to productivity in software engineering. Code completion has been an essential tool that has helped mitigate this complexity in integrated development environments (IDEs). Conventionally, code completion suggestions are implemented with rule-based semantic engines (SEs), which typically have access to the full repository and understand its semantic structure. Recent research has demonstrated that large language models (e.g., Codex and PaLM) enable longer and more complex code suggestions, and as a result, useful products have emerged (e.g., Copilot). However, the question of how code completion powered by machine learning (ML) impacts developer productivity, beyond perceived productivity and accepted suggestions, remains open.

Today we describe how we combined ML and SE to develop a novel Transformer-based hybrid semantic ML code completion, now available to internal Google developers. We discuss how ML and SEs can be combined by (1) re-ranking SE single token suggestions using ML, (2) applying single and multi-line completions using ML and checking for correctness with the SE, or (3) using single and multi-line continuation by ML of single token semantic suggestions. We compare the hybrid semantic ML code completion of 10k+ Googlers (over three months across eight programming languages) to a control group and see a 6% reduction in coding iteration time (time between builds and tests) and a 7% reduction in context switches (i.e., leaving the IDE) when exposed to single-line ML completion. These results demonstrate that the combination of ML and SEs can improve developer productivity. Currently, 3% of new code (measured in characters) is now generated from accepting ML completion suggestions.

Transformers for Completion
A common approach to code completion is to train transformer models, which use a self-attention mechanism for language understanding, to enable code understanding and completion predictions. We treat code similar to language, represented with sub-word tokens and a SentencePiece vocabulary, and use encoder-decoder transformer models running on TPUs to make completion predictions. The input is the code that is surrounding the cursor (~1000-2000 tokens) and the output is a set of suggestions to complete the current or multiple lines. Sequences are generated with a beam search (or tree exploration) on the decoder.

During training on Google’s monorepo, we mask out the remainder of a line and some follow-up lines, to mimic code that is being actively developed. We train a single model on eight languages (C++, Java, Python, Go, Typescript, Proto, Kotlin, and Dart) and observe improved or equal performance across all languages, removing the need for dedicated models. Moreover, we find that a model size of ~0.5B parameters gives a good tradeoff for high prediction accuracy with low latency and resource cost. The model strongly benefits from the quality of the monorepo, which is enforced by guidelines and reviews. For multi-line suggestions, we iteratively apply the single-line model with learned thresholds for deciding whether to start predicting completions for the following line.

Encoder-decoder transformer models are used to predict the remainder of the line or lines of code.

Re-rank Single Token Suggestions with ML
While a user is typing in the IDE, code completions are interactively requested from the ML model and the SE simultaneously in the backend. The SE typically only predicts a single token. The ML models we use predict multiple tokens until the end of the line, but we only consider the first token to match predictions from the SE. We identify the top three ML suggestions that are also contained in the SE suggestions and boost their rank to the top. The re-ranked results are then shown as suggestions for the user in the IDE.

In practice, our SEs are running in the cloud, providing language services (e.g., semantic completion, diagnostics, etc.) with which developers are familiar, and so we collocated the SEs to run on the same locations as the TPUs performing ML inference. The SEs are based on an internal library that offers compiler-like features with low latencies. Due to the design setup, where requests are done in parallel and ML is typically faster to serve (~40 ms median), we do not add any latency to completions. We observe a significant quality improvement in real usage. For 28% of accepted completions, the rank of the completion is higher due to boosting, and in 0.4% of cases it is worse. Additionally, we find that users type >10% fewer characters before accepting a completion suggestion.

Check Single / Multi-line ML Completions for Semantic Correctness
At inference time, ML models are typically unaware of code outside of their input window, and code seen during training might miss recent additions needed for completions in actively changing repositories. This leads to a common drawback of ML-powered code completion whereby the model may suggest code that looks correct, but doesn’t compile. Based on internal user experience research, this issue can lead to the erosion of user trust over time while reducing productivity gains.

We use SEs to perform fast semantic correctness checks within a given latency budget (<100ms for end-to-end completion) and use cached abstract syntax trees to enable a “full” structural understanding. Typical semantic checks include reference resolution (i.e., does this object exist), method invocation checks (e.g., confirming the method was called with a correct number of parameters), and assignability checks (to confirm the type is as expected).

For example, for the coding language Go, ~8% of suggestions contain compilation errors before semantic checks. However, the application of semantic checks filtered out 80% of uncompilable suggestions. The acceptance rate for single-line completions improved by 1.9x over the first six weeks of incorporating the feature, presumably due to increased user trust. As a comparison, for languages where we did not add semantic checking, we only saw a 1.3x increase in acceptance.

Language servers with access to source code and the ML backend are collocated on the cloud. They both perform semantic checking of ML completion suggestions.

Results
With 10k+ Google-internal developers using the completion setup in their IDE, we measured a user acceptance rate of 25-34%. We determined that the transformer-based hybrid semantic ML code completion completes >3% of code, while reducing the coding iteration time for Googlers by 6% (at a 90% confidence level). The size of the shift corresponds to typical effects observed for transformational features (e.g., key framework) that typically affect only a subpopulation, whereas ML has the potential to generalize for most major languages and engineers.

Fraction of all code added by ML	2.6%
Reduction in coding iteration duration	6%
Reduction in number of context switches	7%
Acceptance rate (for suggestions visible for >750ms)	25%
Average characters per accept	21


Key metrics for single-line code completion measured in production for 10k+ Google-internal developers using it in their daily development across eight languages.

Fraction of all code added by ML (with >1 line in suggestion)	0.6%
Average characters per accept	73
Acceptance rate (for suggestions visible for >750ms)	34%


Key metrics for multi-line code completion measured in production for 5k+ Google-internal developers using it in their daily development across eight languages.

Providing Long Completions while Exploring APIs
We also tightly integrated the semantic completion with full line completion. When the dropdown with semantic single token completions appears, we display inline the single-line completions returned from the ML model. The latter represent a continuation of the item that is the focus of the dropdown. For example, if a user looks at possible methods of an API, the inline full line completions show the full method invocation also containing all parameters of the invocation.

Integrated full line completions by ML continuing the semantic dropdown completion that is in focus.

Suggestions of multiple line completions by ML.

Conclusion and Future Work
We demonstrate how the combination of rule-based semantic engines and large language models can be used to significantly improve developer productivity with better code completion. As a next step, we want to utilize SEs further, by providing extra information to ML models at inference time. One example can be for long predictions to go back and forth between the ML and the SE, where the SE iteratively checks correctness and offers all possible continuations to the ML model. When adding new features powered by ML, we want to be mindful to go beyond just “smart” results, but ensure a positive impact on productivity.

Acknowledgements
This research is the outcome of a two-year collaboration between Google Core and Google Research, Brain Team. Special thanks to Marc Rasi, Yurun Shen, Vlad Pchelin, Charles Sutton, Varun Godbole, Jacob Austin, Danny Tarlow, Benjamin Lee, Satish Chandra, Ksenia Korovina, Stanislav Pyatykh, Cristopher Claeys, Petros Maniatis, Evgeny Gryaznov, Pavel Sychev, Chris Gorgolewski, Kristof Molnar, Alberto Elizondo, Ambar Murillo, Dominik Schulz, David Tattersall, Rishabh Singh, Manzil Zaheer, Ted Ying, Juanjo Carin, Alexander Froemmgen and Marcus Revaj for their contributions.

Source: Google AI Blog

Contextual Rephrasing in Google Assistant

Posted by Aurelien Boffy, Senior Staff Software Engineer, and Roberto Pieraccini, Engineering Director, Google Assistant

When people converse with one another, context and references play a critical role in driving their conversation more efficiently. For instance, if one asks the question “Who wrote Romeo and Juliet?” and, after receiving an answer, asks “Where was he born?”, it is clear that ‘he’ is referring to William Shakespeare without the need to explicitly mention him. Or if someone mentions “python” in a sentence, one can use the context from the conversation to determine whether they are referring to a type of snake or a computer language. If a virtual assistant cannot robustly handle context and references, users would be required to adapt to the limitation of the technology by repeating previously shared contextual information in their follow-up queries to ensure that the assistant understands their requests and can provide relevant answers.

In this post, we present a technology currently deployed on Google Assistant that allows users to speak in a natural manner when referencing context that was defined in previous queries and answers. The technology, based on the latest machine learning (ML) advances, rephrases a user’s follow-up query to explicitly mention the missing contextual information, thus enabling it to be answered as a stand-alone query. While Assistant considers many types of context for interpreting the user input, in this post we are focusing on short-term conversation history.

Context Handling by Rephrasing
One of the approaches taken by Assistant to understand contextual queries is to detect if an input utterance is referring to previous context and then rephrase it internally to explicitly include the missing information. Following on from the previous example in which the user asked who wrote Romeo and Juliet, one may ask follow-up questions like “When?”. Assistant recognizes that this question is referring to both the subject (Romeo and Juliet) and answer from the previous query (William Shakespeare) and can rephrase “When?” to “When did William Shakespeare write Romeo and Juliet?”

While there are other ways to handle context, for instance, by applying rules directly to symbolic representations of the meaning of queries, like intents and arguments, the advantage of the rephrasing approach is that it operates horizontally at the string level across any query answering, parsing, or action fulfillment module.

Conversation on a smart display device, where Assistant understands multiple contextual follow-up queries, allowing the user to have a more natural conversation. The phrases appearing at the bottom of the display are suggestions for follow-up questions that the user can select. However, the user can still ask different questions.

A Wide Variety of Contextual Queries
The natural language processing field, traditionally, has not put much emphasis on a general approach to context, focusing on the understanding of stand-alone queries that are fully specified. Accurately incorporating context is a challenging problem, especially when considering the large variety of contextual query types. The table below contains example conversations that illustrate query variability and some of the many contextual challenges that Assistant’s rephrasing method can resolve (e.g., differentiating between referential and non-referential cases or identifying what context a query is referencing). We demonstrate how Assistant is now able to rephrase follow-up queries, adding contextual information before providing an answer.

System Architecture
At a high level, the rephrasing system generates rephrasing candidates by using different types of candidate generators. Each rephrasing candidate is then scored based on a number of signals, and the one with the highest score is selected.

High level architecture of Google Assistant contextual rephraser.

Candidate Generation
To generate rephrasing candidates we use a hybrid approach that applies different techniques, which we classify into three categories:

Generators based on the analysis of the linguistic structure of the queries use grammatical and morphological rules to perform specific operations — for instance, the replacement of pronouns or other types of referential phrases with antecedents from the context.
Generators based on query statistics combine key terms from the current query and its context to create candidates that match popular queries from historical data or common query patterns.
Generators based on Transformer technologies, such as MUM, learn to generate sequences of words according to a number of training samples. LaserTagger and FELIX are technologies suitable for tasks with high overlap between the input and output texts, are very fast at inference time, and are not vulnerable to hallucination (i.e., generating text that is not related to the input texts). Once presented with a query and its context, they can generate a sequence of text edits to transform the input queries into a rephrasing candidate by indicating which portions of the context should be preserved and which words should be modified.

Candidate Scoring
We extract a number of signals for each rephrasing candidate and use an ML model to select the most promising candidate. Some of the signals depend only on the current query and its context. For example, is the topic of the current query similar to the topic of the previous query? Or, is the current query a good stand-alone query or does it look incomplete? Other signals depend on the candidate itself: How much of the information of the context does the candidate preserve? Is the candidate well-formed from a linguistic point of view? Etc.

Recently, new signals generated by BERT and MUM models have significantly improved the performance of the ranker, fixing about one-third of the recall headroom while minimizing false positives on query sequences that are not contextual (and therefore do not require a rephrasing).

Example conversation on a phone where Assistant understands a sequence of contextual queries.

Conclusion
The solution described here attempts to resolve contextual queries by rephrasing them in order to make them fully answerable in a stand-alone manner, i.e., without having to relate to other information during the fulfillment phase. The benefit of this approach is that it is agnostic to the mechanisms that would fulfill the query, thus making it usable as a horizontal layer to be deployed before any further processing.

Given the variety of contexts naturally used in human languages, we adopted a hybrid approach that combines linguistic rules, large amounts of historic data through logs, and ML models based on state-of-the-art Transformer approaches. By generating a number of rephrasing candidates for each query and its context, and then scoring and ranking them using a variety of signals, Assistant can rephrase and thus correctly interpret most contextual queries. As Assistant can handle most types of linguistic references, we are empowering users to have more natural conversations. To make such multi-turn conversations even less cumbersome, Assistant users can turn on Continued Conversation mode to enable asking follow-up queries without the need to repeat "Hey Google" between each query. We are also using this technology in other virtual assistant settings, for instance, interpreting context from something shown on a screen or playing on a speaker.

Acknowledgements
This post reflects the combined work of Aliaksei Severyn, André Farias, Cheng-Chun Lee, Florian Thöle, Gabriel Carvajal, Gyorgy Gyepesi, Julien Cretin, Liana Marinescu, Martin Bölle, Patrick Siegler, Sebastian Krause, Victor Ähdel, Victoria Fossum, Vincent Zhao. We also thank Amar Subramanya, Dave Orr, Yury Pinsky for helpful discussions and support.