Tag Archives: UI

ScreenAI: A visual language model for UI and visually-situated language understanding

Screen user interfaces (UIs) and infographics, such as charts, diagrams and tables, play important roles in human communication and human-machine interaction as they facilitate rich and interactive user experiences. UIs and infographics share similar design principles and visual language (e.g., icons and layouts), that offer an opportunity to build a single model that can understand, reason, and interact with these interfaces. However, because of their complexity and varied presentation formats, infographics and UIs present a unique modeling challenge.

To that end, we introduce “ScreenAI: A Vision-Language Model for UI and Infographics Understanding”. ScreenAI improves upon the PaLI architecture with the flexible patching strategy from pix2struct. We train ScreenAI on a unique mixture of datasets and tasks, including a novel Screen Annotation task that requires the model to identify UI element information (i.e., type, location and description) on a screen. These text annotations provide large language models (LLMs) with screen descriptions, enabling them to automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. At only 5B parameters, ScreenAI achieves state-of-the-art results on UI- and infographic-based tasks (WebSRC and MoTIF), and best-in-class performance on Chart QA, DocVQA, and InfographicVQA compared to models of similar size. We are also releasing three new datasets: Screen Annotation to evaluate the layout understanding capability of the model, as well as ScreenQA Short and Complex ScreenQA for a more comprehensive evaluation of its QA capability.


ScreenAI

ScreenAI’s architecture is based on PaLI, composed of a multimodal encoder block and an autoregressive decoder. The PaLI encoder uses a vision transformer (ViT) that creates image embeddings and a multimodal encoder that takes the concatenation of the image and text embeddings as input. This flexible architecture allows ScreenAI to solve vision tasks that can be recast as text+image-to-text problems.

On top of the PaLI architecture, we employ a flexible patching strategy introduced in pix2struct. Instead of using a fixed-grid pattern, the grid dimensions are selected such that they preserve the native aspect ratio of the input image. This enables ScreenAI to work well across images of various aspect ratios.

The ScreenAI model is trained in two stages: a pre-training stage followed by a fine-tuning stage. First, self-supervised learning is applied to automatically generate data labels, which are then used to train ViT and the language model. ViT is frozen during the fine-tuning stage, where most data used is manually labeled by human raters.

ScreenAI model architecture.


Data generation

To create a pre-training dataset for ScreenAI, we first compile an extensive collection of screenshots from various devices, including desktops, mobile, and tablets. This is achieved by using publicly accessible web pages and following the programmatic exploration approach used for the RICO dataset for mobile apps. We then apply a layout annotator, based on the DETR model, that identifies and labels a wide range of UI elements (e.g., image, pictogram, button, text) and their spatial relationships. Pictograms undergo further analysis using an icon classifier capable of distinguishing 77 different icon types. This detailed classification is essential for interpreting the subtle information conveyed through icons. For icons that are not covered by the classifier, and for infographics and images, we use the PaLI image captioning model to generate descriptive captions that provide contextual information. We also apply an optical character recognition (OCR) engine to extract and annotate textual content on screen. We combine the OCR text with the previous annotations to create a detailed description of each screen.

A mobile app screenshot with generated annotations that include UI elements and their descriptions, e.g., TEXT elements also contain the text content from OCR, IMAGE elements contain image captions, LIST_ITEMs contain all their child elements.


LLM-based data generation

We enhance the pre-training data's diversity using PaLM 2 to generate input-output pairs in a two-step process. First, screen annotations are generated using the technique outlined above, then we craft a prompt around this schema for the LLM to create synthetic data. This process requires prompt engineering and iterative refinement to find an effective prompt. We assess the generated data's quality through human validation against a quality threshold.


You only speak JSON. Do not write text that isn’t JSON.
You are given the following mobile screenshot, described in words. Can you generate 5 questions regarding the content of the screenshot as well as the corresponding short answers to them? 

The answer should be as short as possible, containing only the necessary information. Your answer should be structured as follows:
questions: [
{{question: the question,
    answer: the answer
}},
 ...
]

{THE SCREEN SCHEMA}

A sample prompt for QA data generation.

By combining the natural language capabilities of LLMs with a structured schema, we simulate a wide range of user interactions and scenarios to generate synthetic, realistic tasks. In particular, we generate three categories of tasks:

  • Question answering: The model is asked to answer questions regarding the content of the screenshots, e.g., “When does the restaurant open?”
  • Screen navigation: The model is asked to convert a natural language utterance into an executable action on a screen, e.g., “Click the search button.”
  • Screen summarization: The model is asked to summarize the screen content in one or two sentences.
Block diagram of our workflow for generating data for QA, summarization and navigation tasks using existing ScreenAI models and LLMs. Each task uses a custom prompt to emphasize desired aspects, like questions related to counting, involving reasoning, etc.

LLM-generated data. Examples for screen QA, navigation and summarization. For navigation, the action bounding box is displayed in red on the screenshot.


Experiments and results

As previously mentioned, ScreenAI is trained in two stages: pre-training and fine-tuning. Pre-training data labels are obtained using self-supervised learning and fine-tuning data labels comes from human raters.

We fine-tune ScreenAI using public QA, summarization, and navigation datasets and a variety of tasks related to UIs. For QA, we use well established benchmarks in the multimodal and document understanding field, such as ChartQA, DocVQA, Multi page DocVQA, InfographicVQA, OCR VQA, Web SRC and ScreenQA. For navigation, datasets used include Referring Expressions, MoTIF, Mug, and Android in the Wild. Finally, we use Screen2Words for screen summarization and Widget Captioning for describing specific UI elements. Along with the fine-tuning datasets, we evaluate the fine-tuned ScreenAI model using three novel benchmarks:

  1. Screen Annotation: Enables the evaluation model layout annotations and spatial understanding capabilities.
  2. ScreenQA Short: A variation of ScreenQA, where its ground truth answers have been shortened to contain only the relevant information that better aligns with other QA tasks.
  3. Complex ScreenQA: Complements ScreenQA Short with more difficult questions (counting, arithmetic, comparison, and non-answerable questions) and contains screens with various aspect ratios.

The fine-tuned ScreenAI model achieves state-of-the-art results on various UI and infographic-based tasks (WebSRC and MoTIF) and best-in-class performance on Chart QA, DocVQA, and InfographicVQA compared to models of similar size. ScreenAI achieves competitive performance on Screen2Words and OCR-VQA. Additionally, we report results on the new benchmark datasets introduced to serve as a baseline for further research.

Comparing model performance of ScreenAI with state-of-the-art (SOTA) models of similar size.

Next, we examine ScreenAI’s scaling capabilities and observe that across all tasks, increasing the model size improves performances and the improvements have not saturated at the largest size.

Model performance increases with size, and the performance has not saturated even at the largest size of 5B params.


Conclusion

We introduce the ScreenAI model along with a unified representation that enables us to develop self-supervised learning tasks leveraging data from all these domains. We also illustrate the impact of data generation using LLMs and investigate improving model performance on specific aspects with modifying the training mixture. We apply all of these techniques to build multi-task trained models that perform competitively with state-of-the-art approaches on a number of public benchmarks. However, we also note that our approach still lags behind large models and further research is needed to bridge this gap.


Acknowledgements

This project is the result of joint work with Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen and Abhanshu Sharma. We thank Fangyu Liu, Xi Chen, Efi Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Oriana Riva, Gang Li,Yang Li, Radu Soricut, and Tania Bedrax-Weiss for their insightful feedback and discussions, along with Rahul Aralikatte, Hao Cheng and Daniel Kim for their support in data preparation. We also thank Jay Yagnik, Blaise Aguera y Arcas, Ewa Dominowska, David Petrou, and Matt Sharifi for their leadership, vision and support. We are very grateful toTom Small for helping us create the animation in this post.

Source: Google AI Blog


Enabling conversational interaction on mobile with LLMs

Intelligent assistants on mobile devices have significantly advanced language-based interactions for performing simple daily tasks, such as setting a timer or turning on a flashlight. Despite the progress, these assistants still face limitations in supporting conversational interactions in mobile user interfaces (UIs), where many user tasks are performed. For example, they cannot answer a user's question about specific information displayed on a screen. An agent would need to have a computational understanding of graphical user interfaces (GUIs) to achieve such capabilities.

Prior research has investigated several important technical building blocks to enable conversational interaction with mobile UIs, including summarizing a mobile screen for users to quickly understand its purpose, mapping language instructions to UI actions and modeling GUIs so that they are more amenable for language-based interaction. However, each of these only addresses a limited aspect of conversational interaction and requires considerable effort in curating large-scale datasets and training dedicated models. Furthermore, there is a broad spectrum of conversational interactions that can occur on mobile UIs. Therefore, it is imperative to develop a lightweight and generalizable approach to realize conversational interaction.

In “Enabling Conversational Interaction with Mobile UI using Large Language Models”, presented at CHI 2023, we investigate the viability of utilizing large language models (LLMs) to enable diverse language-based interactions with mobile UIs. Recent pre-trained LLMs, such as PaLM, have demonstrated abilities to adapt themselves to various downstream language tasks when being prompted with a handful of examples of the target task. We present a set of prompting techniques that enable interaction designers and developers to quickly prototype and test novel language interactions with users, which saves time and resources before investing in dedicated datasets and models. Since LLMs only take text tokens as input, we contribute a novel algorithm that generates the text representation of mobile UIs. Our results show that this approach achieves competitive performance using only two data examples per task. More broadly, we demonstrate LLMs’ potential to fundamentally transform the future workflow of conversational interaction design.

Animation showing our work on enabling various conversational interactions with mobile UI using LLMs.


Prompting LLMs with UIs

LLMs support in-context few-shot learning via prompting — instead of fine-tuning or re-training models for each new task, one can prompt an LLM with a few input and output data exemplars from the target task. For many natural language processing tasks, such as question-answering or translation, few-shot prompting performs competitively with benchmark approaches that train a model specific to each task. However, language models can only take text input, while mobile UIs are multimodal, containing text, image, and structural information in their view hierarchy data (i.e., the structural data containing detailed properties of UI elements) and screenshots. Moreover, directly inputting the view hierarchy data of a mobile screen into LLMs is not feasible as it contains excessive information, such as detailed properties of each UI element, which can exceed the input length limits of LLMs.

To address these challenges, we developed a set of techniques to prompt LLMs with mobile UIs. We contribute an algorithm that generates the text representation of mobile UIs using depth-first search traversal to convert the Android UI's view hierarchy into HTML syntax. We also utilize chain of thought prompting, which involves generating intermediate results and chaining them together to arrive at the final output, to elicit the reasoning ability of the LLM.

Animation showing the process of few-shot prompting LLMs with mobile UIs.

Our prompt design starts with a preamble that explains the prompt’s purpose. The preamble is followed by multiple exemplars consisting of the input, a chain of thought (if applicable), and the output for each task. Each exemplar’s input is a mobile screen in the HTML syntax. Following the input, chains of thought can be provided to elicit logical reasoning from LLMs. This step is not shown in the animation above as it is optional. The task output is the desired outcome for the target tasks, e.g., a screen summary or an answer to a user question. Few-shot prompting can be achieved with more than one exemplar included in the prompt. During prediction, we feed the model the prompt with a new input screen appended at the end.


Experiments

We conducted comprehensive experiments with four pivotal modeling tasks: (1) screen question-generation, (2) screen summarization, (3) screen question-answering, and (4) mapping instruction to UI action. Experimental results show that our approach achieves competitive performance using only two data examples per task.



Task 1: Screen question generation

Given a mobile UI screen, the goal of screen question-generation is to synthesize coherent, grammatically correct natural language questions relevant to the UI elements requiring user input.

We found that LLMs can leverage the UI context to generate questions for relevant information. LLMs significantly outperformed the heuristic approach (template-based generation) regarding question quality.

Example screen questions generated by the LLM. The LLM can utilize screen contexts to generate grammatically correct questions relevant to each input field on the mobile UI, while the template approach falls short.

We also revealed LLMs' ability to combine relevant input fields into a single question for efficient communication. For example, the filters asking for the minimum and maximum price were combined into a single question: “What’s the price range?

We observed that the LLM could use its prior knowledge to combine multiple related input fields to ask a single question.

In an evaluation, we solicited human ratings on whether the questions were grammatically correct (Grammar) and relevant to the input fields for which they were generated (Relevance). In addition to the human-labeled language quality, we automatically examined how well LLMs can cover all the elements that need to generate questions (Coverage F1). We found that the questions generated by LLM had almost perfect grammar (4.98/5) and were highly relevant to the input fields displayed on the screen (92.8%). Additionally, LLM performed well in terms of covering the input fields comprehensively (95.8%).


      Template       2-shot LLM      
Grammar       3.6 (out of 5)       4.98 (out of 5)      
Relevance       84.1%       92.8%      
Coverage F1       100%       95.8%      


Task 2: Screen summarization

Screen summarization is the automatic generation of descriptive language overviews that cover essential functionalities of mobile screens. The task helps users quickly understand the purpose of a mobile UI, which is particularly useful when the UI is not visually accessible.

Our results showed that LLMs can effectively summarize the essential functionalities of a mobile UI. They can generate more accurate summaries than the Screen2Words benchmark model that we previously introduced using UI-specific text, as highlighted in the colored text and boxes below.

Example summary generated by 2-shot LLM. We found the LLM is able to use specific text on the screen to compose more accurate summaries.

Interestingly, we observed LLMs using their prior knowledge to deduce information not presented in the UI when creating summaries. In the example below, the LLM inferred the subway stations belong to the London Tube system, while the input UI does not contain this information.

LLM uses its prior knowledge to help summarize the screens.

Human evaluation rated LLM summaries as more accurate than the benchmark, yet they scored lower on metrics like BLEU. The mismatch between perceived quality and metric scores echoes recent work showing LLMs write better summaries despite automatic metrics not reflecting it.

  

Left: Screen summarization performance on automatic metrics. Right: Screen summarization accuracy voted by human evaluators.


Task 3: Screen question-answering

Given a mobile UI and an open-ended question asking for information regarding the UI, the model should provide the correct answer. We focus on factual questions, which require answers based on information presented on the screen.

Example results from the screen QA experiment. The LLM significantly outperforms the off-the-shelf QA baseline model.

We report performance using four metrics: Exact Matches (identical predicted answer to ground truth), Contains GT (answer fully containing ground truth), Sub-String of GT (answer is a sub-string of ground truth), and the Micro-F1 score based on shared words between the predicted answer and ground truth across the entire dataset.

Our results showed that LLMs can correctly answer UI-related questions, such as "what's the headline?". The LLM performed significantly better than baseline QA model DistillBERT, achieving a 66.7% fully correct answer rate. Notably, the 0-shot LLM achieved an exact match score of 30.7%, indicating the model's intrinsic question answering capability.


Models       Exact Matches       Contains GT       Sub-String of GT       Micro-F1      
0-shot LLM       30.7%       6.5%       5.6%       31.2%      
1-shot LLM       65.8%       10.0%       7.8%       62.9%      
2-shot LLM       66.7%       12.6%       5.2%       64.8%      
DistillBERT       36.0%       8.5%       9.9%       37.2%      


Task 4: Mapping instruction to UI action

Given a mobile UI screen and natural language instruction to control the UI, the model needs to predict the ID of the object to perform the instructed action. For example, when instructed with "Open Gmail," the model should correctly identify the Gmail icon on the home screen. This task is useful for controlling mobile apps using language input such as voice access. We introduced this benchmark task previously.

Example using data from the PixelHelp dataset. The dataset contains interaction traces for common UI tasks such as turning on wifi. Each trace contains multiple steps and corresponding instructions.

We assessed the performance of our approach using the Partial and Complete metrics from the Seq2Act paper. Partial refers to the percentage of correctly predicted individual steps, while Complete measures the portion of accurately predicted entire interaction traces. Although our LLM-based method did not surpass the benchmark trained on massive datasets, it still achieved remarkable performance with just two prompted data examples.


Models       Partial       Complete      
0-shot LLM       1.29       0.00      
1-shot LLM (cross-app)       74.69       31.67      
2-shot LLM (cross-app)       75.28       34.44      
1-shot LLM (in-app)       78.35       40.00      
2-shot LLM (in-app)       80.36       45.00      
Seq2Act       89.21       70.59      


Takeaways and conclusion

Our study shows that prototyping novel language interactions on mobile UIs can be as easy as designing a data exemplar. As a result, an interaction designer can rapidly create functioning mock-ups to test new ideas with end users. Moreover, developers and researchers can explore different possibilities of a target task before investing significant efforts into developing new datasets and models.

We investigated the feasibility of prompting LLMs to enable various conversational interactions on mobile UIs. We proposed a suite of prompting techniques for adapting LLMs to mobile UIs. We conducted extensive experiments with the four important modeling tasks to evaluate the effectiveness of our approach. The results showed that compared to traditional machine learning pipelines that consist of expensive data collection and model training, one could rapidly realize novel language-based interactions using LLMs while achieving competitive performance.


Acknowledgements

We thank our paper co-author Gang Li, and appreciate the discussions and feedback from our colleagues Chin-Yi Cheng, Tao Li, Yu Hsiao, Michael Terry and Minsuk Chang. Special thanks to Muqthar Mohammad and Ashwin Kakarla for their invaluable assistance in coordinating data collection. We thank John Guilyard for helping create animations and graphics in the blog.

Source: Google AI Blog


Clue’s development speed improves 3X after rebuilding the app with Jetpack Compose

Posted by the Android team

Clue is a freemium menstrual health application founded in 2012 and was among the earliest developers in femtech. The app helps women and people who menstruate track their cycles and serves 11 million monthly active users in over 190 countries. Additional features, including tools for tracking prenatal and postpartum health, are available through the app’s subscription tier, Clue Plus.

Having access to streamlined and easily digestible menstrual health data can be an invaluable resource for people who menstruate, and Clue has supported these insights for Android users for over a decade. As with any codebase, however, the Clue app inherited technical debt. This limited the team’s ability to push changes and features quickly, scale developer efforts, and provide a modern UI to its users.

Clue previously relied heavily on custom views that made extending the existing codebase difficult and required time-consuming testing methods that slowed the development process. Clue’s codebase had additionally amassed UI inconsistencies from hard-coded theme values such as colors and sizes, and in 2022 Clue’s engineers recognized that they needed a more efficient and flexible solution. They ultimately landed on migrating to Jetpack Compose, Android’s modern toolkit for building native UI.

“We decided that a complete rewrite of the application, with a specific emphasis on the UI layer, would be the best course of action,” said Moctezuma Rojas, an Android developer at Clue. “This decision was based on the fact that it would enable us to have a more efficient and faster development cycle, quickly implement features that would have taken much longer to develop using views, and make our code more testable.”

Building a faster and more efficient codebase with Compose

The Clue team saw immediate benefits by rewriting its app with Compose. For one, a faster, more efficient testing and development cycle significantly reduced the time and effort necessary to improve the codebase while reducing bugs and errors. Compose also enabled Clue’s engineers to implement features faster than they could with Views.

Migrating the app to Compose resulted in improved testability for screens, faster development from ideation to release, and better standardization processes that aligned with the best practices recommended by Android developer resources. Compose also helped the Clue team double—and in some cases, triple—their development speed when compared to the old codebase.

“With the traditional view system, adding new features, visual representations, or user interactions was difficult due to the need for custom view creation and maintenance. However, by utilizing Jetpack Compose, we've been able to effortlessly develop and expand the Cycle View feature without any limitations in adding elements,” said Moctezuma.

Photo of Moctezuma Rojas, Android developer at Clue,smiling while snuggling a cat, with quote text which reads, '...By utilizing Jetpack Compose, we've been able to effortlessly develop and expand the Cycle View feature without any limitations in adding elements.'

Compose also helped Clue’s developers quickly overhaul several other important features within the application, including Calendar View, Analysis View, and the account management and settings screens.

More creativity made possible with Compose

Compose enabled developers to make Clue screens more intuitive, improve scrolling, and deploy a custom color system and component library that aligned with the brand—a huge win for the team. Previously, adding new features, visual representations, or user interactions was complicated because they required creating a custom view and ongoing maintenance.

Compose APIs made it much easier to test UI so Clue developers felt more confident about what they were shipping to users. As an added benefit, Clue developers now have more space for exploring UX innovation.

“The custom dynamic theming allows designers to freely explore their creativity without being limited by technological constraints,” said Moctezuma. “It provides a flexible and scalable approach to styling that can be easily adapted as our app evolves and grows, resulting in a visually appealing and cohesive user experience.”

All of these changes vastly improved the user experience for Clue subscribers, resulting in fewer error messages and bug reports. The Clue team also says that using Compose has enabled them to identify areas of improvement in the app’s code that could have potentially impacted its users.

“Compose increases developer velocity by eliminating boilerplate code, works seamlessly with the existing code base thanks to its Interoperability APIs, and improves UI testing—which has always been painful in Android development,” said Tilbe Saltan, a senior Android developer at Clue.

Continued success with Jetpack Compose

Compose has improved each subsequent app release and made preview and live editing features more reliable for Clue engineers, allowing for a more flexible development experience from start to finish. Since implementing Compose, the Clue team has also seen excitement from prospective candidates interested in working on the app so they can work with more modern development technologies.

“The future of Compose holds many potential development areas that could benefit developers and companies. As Compose continues to evolve, we can expect to see more improvements in performance, stability, tooling, and cross-platform support, which will make it an even more compelling choice for building high-quality UIs,” said Tilbe.


Get started

Optimize your UI development with Jetpack Compose.

Material Design 3 for Compose hits stable

Posted by Gurupreet Singh, Developer Advocate; Android

Today marks the first stable release of Compose Material 3. The library allows you to build Jetpack Compose UIs with Material Design 3, the next evolution of Material Design. You can start using Material Design 3 in your apps today!

Note: The terms "Material Design 3", "Material 3", and "M3" are used interchangeably. 

Material 3 includes updated theming and components, exclusive features like dynamic color, and is designed to be aligned with the latest Android visual style and system UI.
ALT TEXT
Multiple apps using Material Design 3 theming

You can start using Material Design 3 in your apps by adding the Compose Material 3 dependency to your build.gradle files:

// Add dependency in module build.gradle

implementation "androidx.compose.material3:material3:$material3_version" 


Note: See the latest M3 versions on the Compose Material 3 releases page.


Color schemes

Material 3 brings extensive, finer grained color customisation, and comes with both light and dark color scheme support out of the box. The Material Theme Builder allows you to generate a custom color scheme using core colors, and optionally export Compose theming code. You can read more about color schemes and color roles.
ALT TEXT
Material Theme Builder to export Material 3 color schemes


Dynamic color

Dynamic color derives from the user’s wallpaper. The colors can be applied to apps and the system UI.

Dynamic color is available on Android 12 (API level 31) and above. If dynamic color is available, you can set up a dynamic ColorScheme. If not, you should fall back to using a custom light or dark ColorScheme.
Reply Dynamic theming from wallpaper(Left) and Default app theming (Right)

 

 


















The ColorScheme class provides builder functions to create both dynamic and custom light and dark color schemes:

Theme.kt

// Dynamic color is available on Android 12+
val dynamicColor = Build.VERSION.SDK_INT >= Build.VERSION_CODES.S
val colorScheme = when {
  dynamicColor && darkTheme -> dynamicDarkColorScheme(LocalContext.current)
  dynamicColor && !darkTheme -> dynamicLightColorScheme(LocalContext.current)
  darkTheme -> darkColorScheme(...)
  else -> lightColorScheme(...)
}

MaterialTheme(
  colorScheme = colorScheme,
  typography = typography,
  shapes = shapes
) {
  // M3 App content
}


Material components

The Compose Material 3 APIs contain a wide range of both new and evolved Material components, with more planned for future versions. Many of the Material components, like CardRadioButton and CheckBox, are no longer considered experimental; their APIs are stable and they can be used without the ExperimentalMaterial3Api annotation.

The M3 Switch component has a brand new UI refresh with accessibility-compliant minimum touch target size support, color mappings, and optional icon support in the switch thumb. The touch target is bigger, and the thumb size increases on user interaction, providing feedback to the user that the thumb is being interacted with.
ALT TEXT
Material 3 Switch thumb interaction

Switch(
      checked = isChecked,
      onCheckedChange = { /*...*/ },
      thumbContent = {
          Icon(
              imageVector = Icons.Default.Check,
              contentDescription = stringResource(id = R.string.switch_check)
          )
      },
  )


Navigation drawer components now provide wrapper sheets for content to change colors, shapes, and elevation independently.

Navigation drawer component

Content
ModalNavigationDrawerModalDrawerSheet
PermanentNavigationDrawer

PermanentDrawerSheet
DismissableNavigationDrawerDismissableDrawerSheet


ALT TEXT
ModalNavigationDrawer with content wrapped in ModalDrawerSheet

ModalNavigationDrawer {
    ModalDrawerSheet(
        drawerShape = MaterialTheme.shapes.small,
        drawerContainerColor = MaterialTheme.colorScheme.primaryContainer,
        drawerContentColor = MaterialTheme.colorScheme.onPrimaryContainer,
        drawerTonalElevation = 4.dp,
    ) {
        DESTINATIONS.forEach { destination ->
            NavigationDrawerItem(
                selected = selectedDestination == destination.route,
                onClick = { ... },
                icon = { ... },
                label = { ... }
            )
        }
    }
}


We have a brand new CenterAlignedTopAppBar  in addition to already existing app bars. This can be used for the main root page in an app: you can display the app name or page headline with home and action icons.


ALT TEXT
Material CenterAlignedTopAppBar with home and action items.

CenterAlignedTopAppBar(
          title = {
              Text(stringResources(R.string.top_stories))
          },
          scrollBehavior = scrollBehavior,
          navigationIcon =  { /* Navigation Icon */},
          actions = { /* App bar actions */}
      )


See the latest M3 components and layouts on the Compose Material 3 API reference overview. Keep an eye on the releases page for new and updated APIs.


Typography

Material 3 simplified the naming and grouping of typography to:
  • Display
  • Headline
  • Title
  • Body
  • Label
There are large, medium, and small sizes for each, providing a total of 15 text style variations.

The 
Typography constructor offers defaults for each style, so you can omit any parameters that you don’t want to customize:

val typography = Typography(
  titleLarge = TextStyle(
      fontWeight = FontWeight.SemiBold,
      fontSize = 22.sp,
      lineHeight = 28.sp,
      letterSpacing = 0.sp
  ),
  titleMedium = TextStyle(
      fontWeight = FontWeight.SemiBold,
      fontSize = 16.sp,
      lineHeight = 24.sp,
      letterSpacing = 0.15.sp
  ),
  ...
}


You can customize your typography by changing default values of TextStyle and font-related properties like fontFamily and letterSpacing.

bodyLarge = TextStyle(
  fontWeight = FontWeight.Normal,
  fontFamily = FontFamily.SansSerif,
  fontStyle = FontStyle.Italic,
  fontSize = 16.sp,
  lineHeight = 24.sp,
  letterSpacing = 0.15.sp,
  baselineShift = BaselineShift.Subscript
)


Shapes

The Material 3 shape scale defines the style of container corners, offering a range of roundedness from square to fully circular.

There are different sizes of shapes:
  • Extra small
  • Small
  • Medium
  • Large
  • Extra large

ALT TEXT
Material Design 3 shapes used in various components as default value.

Each shape has a default value but you can override it:

val shapes = Shapes(
  extraSmall = RoundedCornerShape(4.dp),
  small = RoundedCornerShape(8.dp),
  medium = RoundedCornerShape(12.dp),
  large = RoundedCornerShape(16.dp),
  extraLarge = RoundedCornerShape(28.dp)
)


You can read more about applying shape.


Window size classes

Jetpack Compose and Material 3 provide window size artifacts that can help make your apps adaptive. You can start by adding the Compose Material 3 window size class dependency to your build.gradle files:

// Add dependency in module build.gradle

implementation "androidx.compose.material3:material3-window-size-class:$material3_version"


Window size classes group sizes into standard size buckets, which are breakpoints that are designed to optimize your app for most unique cases.


ALT TEXT
WindowWidthSize Class for grouping devices in different size buckets

See the Reply Compose sample to learn more about adaptive apps and the window size classes implementation.


Window insets support

M3 components, like top app bars, navigation drawers, bar, and rail, include built-in support for window insets. These components, when used independently or with Scaffold, will automatically handle insets determined by the status bar, navigation bar, and other parts of the system UI.

Scaffold now supports the contentWindowInsets parameter which can help to specify insets for the scaffold content.

Scaffold insets are only taken into consideration when a topBar or bottomBar is not present in Scaffold, as these components handle insets at the component level.

Scaffold(
    contentWindowInsets = WindowInsets(16.dp)
) {
    // Scaffold content
}



Resources

With Compose Material 3 reaching stable, it’s a great time to start learning all about it and get ready to adopt it in your apps. Check out the resources below to get started.

Monster Mash: A Sketch-Based Tool for Casual 3D Modeling and Animation

3D computer animation is a time-consuming and highly technical medium — to complete even a single animated scene requires numerous steps, like modeling, rigging and animating, each of which is itself a sub-discipline that can take years to master. Because of its complexity, 3D animation is generally practiced by teams of skilled specialists and is inaccessible to almost everyone else, despite decades of advances in technology and tools. With the recent development of tools that facilitate game character creation and game balance, a natural question arises: is it possible to democratize the 3D animation process so it’s accessible to everyone?

To explore this concept, we start with the observation that most forms of artistic expression have a casual mode: a classical guitarist might jam without any written music, a trained actor could ad-lib a line or two while rehearsing, and an oil painter can jot down a quick gesture drawing. What these casual modes have in common is that they allow an artist to express a complete thought quickly and intuitively without fear of making a mistake. This turns out to be essential to the creative process — when each sketch is nearly effortless, it is possible to iteratively explore the space of possibilities far more effectively.

In this post, we describe Monster Mash, an open source tool presented at SIGGRAPH Asia 2020 that allows experts and amateurs alike to create rich, expressive, deformable 3D models from scratch — and to animate them — all in a casual mode, without ever having to leave the 2D plane. With Monster Mash, the user sketches out a character, and the software automatically converts it to a soft, deformable 3D model that the user can immediately animate by grabbing parts of it and moving them around in real time. There is also an online demo, where you can try it out for yourself.



Creating a walk cycle using Monster Mash. Step 1: Draw a character. Step 2: Animate it.

Creating a 2D Sketch
The insight that makes this casual sketching approach possible is that many 3D models, particularly those of organic forms, can be described by an ordered set of overlapping 2D regions. This abstraction makes the complex task of 3D modeling much easier: the user creates 2D regions by drawing their outlines, then the algorithm creates a 3D model by stitching the regions together and inflating them. The result is a simple and intuitive user interface for sketching 3D figures.

For example, suppose the user wants to create a 3D model of an elephant. The first step is to draw the body as a closed stroke (a). Then the user adds strokes to depict other body parts such as legs (b). Drawing those additional strokes as open curves provides a hint to the system that they are meant to be smoothly connected with the regions they overlap. The user can also specify that some new parts should go behind the existing ones by drawing them with the right mouse button (c), and mark other parts as symmetrical by double-clicking on them (d). The result is an ordered list of 2D regions.

Steps in creating a 2D sketch of an elephant.

Stitching and Inflation
To understand how a 3D model is created from these 2D regions, let’s look more closely at one part of the elephant. First, the system identifies where the leg must be connected to the body (a) by finding the segment (red) that completes the open curve. The system cuts the body’s front surface along that segment, and then stitches the front of the leg together with the body (b). It then inflates the model into 3D by solving a modified form of Poisson’s equation to produce a surface with a rounded cross-section (c). The resulting model (d) is smooth and well-shaped, but because all of the 3D parts are rooted in the drawing plane, they may intersect each other, resulting in a somewhat odd-looking “elephant”. These intersections will be resolved by the deformation system.

Illustration of the details of the stitching and inflation process. The schematic illustrations (b, c) are cross-sections viewed from the elephant’s front.

Layered Deformation
At this point we just have a static model — we need to give the user an easy way to pose the model, and also separate the intersecting parts somehow. Monster Mash’s layered deformation system, based on the well-known smooth deformation method as-rigid-as-possible (ARAP), solves both of these problems at once. What’s novel about our layered “ARAP-L” approach is that it combines deformation and other constraints into a single optimization framework, allowing these processes to run in parallel at interactive speed, so that the user can manipulate the model in real time.

The framework incorporates a set of layering and equality constraints, which move body parts along the z axis to prevent them from visibly intersecting each other. These constraints are applied only at the silhouettes of overlapping parts, and are dynamically updated each frame.

In steps (d) through (h) above, ARAP-L transforms a model from one with intersecting 3D parts to one with the depth ordering specified by the user. The layering constraints force the leg’s silhouette to stay in front of the body (green), and the body’s silhouette to stay behind the leg (yellow). Equality constraints (red) seal together the loose boundaries between the leg and the body.

Meanwhile, in a separate thread of the framework, we satisfy point constraints to make the model follow user-defined control points (described in the section below) in the xy-plane. This ARAP-L method allows us to combine modeling, rigging, deformation, and animation all into a single process that is much more approachable to the non-specialist user.

The model deforms to match the point constraints (red dots) while the layering constraints prevent the parts from visibly intersecting.

Animation
To pose the model, the user can create control points anywhere on the model’s surface and move them. The deformation system converges over multiple frames, which gives the model’s movement a soft and floppy quality, allowing the user to intuitively grasp its dynamic properties — an essential prerequisite for kinesthetic learning.

Because the effect of deformations converges over multiple frames, our system lends 3D models a soft and dynamic quality.

To create animation, the system records the user’s movements in real time. The user can animate one control point, then play back that movement while recording additional control points. In this way, the user can build up a complex action like a walk by layering animation, one body part at a time. At every stage of the animation process, the only task required of the user is to move points around in 2D, a low-risk workflow meant to encourage experimentation and play.

Conclusion
We believe this new way of creating animation is intuitive and can thus help democratize the field of computer animation, encouraging novices who would normally be unable to try it on their own as well as experts who often require fast iteration under tight deadlines. Here you can see a few of the animated characters that have been created using Monster Mash. Most of these were created in a matter of minutes.

A selection of animated characters created using Monster Mash. The original hand-drawn outline used to create each 3D model is visible as an inset above each character.

All of the code for Monster Mash is available as open source, and you can watch our presentation and read our paper from SIGGRAPH Asia 2020 to learn more. We hope this software will make creating 3D animations more broadly accessible. Try out the online demo and see for yourself!

Acknowledgements
Monster Mash is the result of a collaboration between Google Research, Czech Technical University in Prague, ETH Zürich, and the University of Washington. Key contributors include Marek Dvorožňák, Daniel Sýkora, Cassidy Curtis, Brian Curless, Olga Sorkine-Hornung, and David Salesin. We are also grateful to Hélène Leroux, Neth Nom, David Murphy, Samuel Leather, Pavla Sýkorová, and Jakub Javora for participating in the early interactive sessions.

Source: Google AI Blog


Here’s how to watch #TheAndroidShow in just under 24 hours

Posted by The Jetpack Compose Team

In less than 24 hours, we're giving you a backstage pass to Jetpack Compose, Android's modern toolkit for building native UIs, on #TheAndroidShow. Hosted by Kari Byron, you'll hear the latest on Jetpack Compose from the people who built it, plus a fireside interview with Android's Dave Burke.

The show kicks off live at 9AM PT!

Broadcasting live on February 24th at 9AM PT, you’ll be able to watch the show at goo.gle/TheAndroidShow, where you’ll also be able to find more information and links to all of the things we covered in the show. Or if you prefer, you can watch directly on YouTube or Twitter.

There’s still time to ask your Jetpack Compose questions, use #TheAndroidShow

Got a burning Jetpack Compose question? Want to learn about annotating a function type with @ Composable? Or how to add a static parameter to Composable functions at the compiler level? Tweet us your Jetpack Compose questions now, using #TheAndroidShow. We’ve assembled a team of experts, ready to answer your questions live on #TheAndroidShow; tune in on February 24 to see if we cover your question!

First preview of Android 12

Android 12 logo

Posted by Dave Burke, VP of Engineering

Every day, Android apps help billions of people work, play, communicate, and create on a wide range of devices from phones and laptops to tablets, TVs, and cars. As more people come to rely on the experiences you build, their expectations can rise just as fast. It’s one of the reasons we share Android releases with you early: your feedback helps us build a better platform for your apps and all of the people who use them. Today, we’re releasing the first Developer Preview of Android 12, the next version of Android, for your testing and feedback.

With each version, we’re working to make the OS smarter, easier to use, and better performing, with privacy and security at the core. In Android 12 we’re also working to give you new tools for building great experiences for users. Starting with things like compatible media transcoding, which helps your app to work with the latest video formats if you don’t already support them, and easier copy/paste of rich content into your apps, like images and videos. We’re also adding privacy protections and optimizing performance to keep your apps responsive.

Today’s first preview is just the start for Android 12, and we’ll have lots more to share as we move through the release. Read on for a taste of what’s new in Android 12, and visit the Android 12 developer site for details on downloads for Pixel and release timeline. As always, it’s crucial to get your feedback early, to help us incorporate it into the final product, so let us know what you think!

Alongside the work we’re doing in Android 12, later this month we’ll have more to share on another important tool that helps you create great user experiences more easily: Jetpack Compose, our modern toolkit for building native UI. Join us on #TheAndroidShow for a behind-the-scenes look at Jetpack Compose, livestreamed on February 24 at 9AM PT, and tweet your Jetpack Compose questions using #TheAndroidShow to have them answered live on the show.

Trust and safety

Privacy is at the heart of everything we do, and in Android 12 we’re continuing to focus on giving users more transparency and control while keeping their devices and data secure. In today’s release we’ve added new controls over identifiers that can be used for tracking, safer defaults for app components, and more. These changes may affect your apps, so we recommend testing as soon as possible. Watch for more privacy and security features coming in later preview releases.

Modern SameSite cookie behaviors in WebView - In line with changes to Chrome and other browsers, WebView includes new SameSite cookie behaviors to provide additional security and privacy and give users more transparency and control over how cookies can be used across sites. More here.

Restricted Netlink MAC - We’re continuing to help developers migrate to privacy-protecting resettable identifiers. In a multi-release effort to ease migration of device-scoped Netlink MAC, in Android 11 we restricted access to it based on API level 30, and in Android 12 we’re applying the restriction for all apps - regardless of targetSDK level. More here.

Safer exporting of components - To prevent apps from inadvertently exporting activities, services, and receivers, we’re changing the default handling of the android:exported attribute to be more explicit. With this change, components that declare one or more intent filters must now explicitly declare an android:exported attribute. You should inspect your components in the manifest in order to avoid installation errors related to this change. More here.

Safer handling of Intents - To make handling PendingIntents more secure, Android 12 requires apps to explicitly declare a mutability flag, either FLAG_MUTABLE or the new FLAG_IMMUTABLE, for each PendingIntent. More here.

You can read more about these and other privacy and security changes here.

Better user experience tools

In Android 12 we’re investing in key areas to help deliver a polished experience and better performance for users. Here are some of the updates so far.

Compatible media transcoding - With the prevalence of HEVC hardware encoders on mobile devices, camera apps are increasingly capturing in HEVC format, which offers significant improvements in quality and compression over older codecs. Most apps should support HEVC, but for apps that can’t, we’re introducing compatible media transcoding.

With this feature, an app that doesn’t support HEVC can have the platform automatically transcode the file into AVC, a format that is widely compatible. The transcoding process takes time, depending on the video and hardware properties of the device. As an example, a one minute 1080p video at 30fps takes around 9 seconds to transcode on a Pixel 4. You can opt-in to use the transcoding service by just declaring the media formats that your apps don't support. For developers, we strongly recommend that your apps support HEVC, and if that’s not possible, enable compatible media transcoding. The feature will be active on all devices using HEVC format for video capture. We'd love to hear your feedback on this feature. More here.

AVIF image support - To give you higher image quality with more efficient compression, Android 12 introduces platform support for AV1 Image File Format (AVIF). AVIF is a container format for images and sequences of images encoded using AV1. Like other modern image formats, AVIF takes advantage of the intra-frame encoded content from video compression. This dramatically improves image quality for the same file size when compared to older image formats, such as JPEG.

AVIF (18.2kB)

JPEG (20.7kB)

race car photo in AVIF (18.2kB)
race car photo in JPEG (20.7kB)

Credit: Image comparison from AVIF has landed by Jake Archibald

Foreground service optimizations - Foreground services are an important way for apps to manage certain types of user-facing tasks, but when overused they can affect performance and even lead to app kills. To ensure a better experience for users, we will be blocking foreground service starts from the background for apps that are targeting the new platform. To make it easier to transition away from this pattern, we’re introducing a new expedited job in JobScheduler that gets elevated process priority, network access, and runs immediately regardless of power constraints like Battery Saver or Doze. For back-compatibility, we’ve also built expedited jobs into the latest release of Jetpack WorkManager library. Also, to reduce distraction for users, we’re now delaying the display of some foreground service notifications by up to 10 seconds. This gives short-lived tasks a chance to complete before their notifications are shown. More here.

Rich content insertion - Users love images, videos and other expressive content, but inserting and moving this content in apps is not always easy. To make it simple for your apps to receive rich content, we’re introducing a new unified API that lets you accept content from any source: clipboard, keyboard, or drag and drop. You can attach a new interface, OnReceiveContentListener, to UI components and get a callback when content is inserted through any mechanism. This callback becomes the single place for your code to handle insertion of all content, from plain and styled text to markup, images, videos, audio files, and more. For back-compatibility, we’ve added the unified API to AndroidX. More here.

Haptic-coupled audio effect - In Android 12 apps can provide audio-coupled haptic feedback through the phone's vibrator. The vibration strength and frequency are derived from an audio session, allowing you to create more immersive game and audio experiences. For example, a video calling app could use custom ringtones to identify the caller through haptic feedback, or you could simulate rough terrain in a racing game. More here.

Multi-channel audio - Android 12 includes several enhancements for audio with spatial information. It adds support for MPEG-H playback in passthrough and offload mode, and the audio mixers, resamplers and effects have been optimized for up to 24 channels (the previous maximum was 8).

Immersive mode API improvements for gesture nav - We’ve simplified immersive mode so that gesture navigation is easier and more consistent, for example when watching a video, reading a book, or playing a game. We’re still protecting apps from accidental gestures when in full-screen experiences related to gaming, but in all other full-screen or immersive experiences (e.g. video viewers, reading, photo gallery), for apps targeting the new platform, we’re changing the default to allow users to navigate their phone with one swipe. More here.

Notification UI updates - We’re refreshing notification designs to make them more modern, easier to use, and more functional. In this first preview you’ll notice changes from the drawer and controls to the templates themselves. We’re also optimizing transitions and animations across the system to make them more smooth. As part of the updates, for apps targeting Android 12 we’re decorating notifications with custom content with icon and expand affordances to match all other notifications. More here.

Faster, more responsive notifications - When users tap a notification, they expect to jump immediately into the app - the faster the better. To meet that expectation, developers should make sure that notification taps trigger Activity starts directly, rather than using “trampolines” - an intermediary broadcast receiver or service - to start the Activity. Notification trampolines can cause significant delays and affect the user experience. To keep notifications responsive, Android 12 will block notification trampolines by preventing them from launching their target Activities, and we’re asking developers to migrate away from this pattern. The change applies only to apps targeting the new platform, but for all apps we’ll display a toast to make trampolines visible to you and to users. More here.

Improved Binder IPC calls - As part of our work on performance, we’ve put a focus on reducing system variability. We’ve taken a look at latency and workload distribution, and made optimizations that reduce the median experience from the tail end, or 99% percentile use case. In doing so, we’ve targeted improvements to system binder calls adding lightweight caching strategies and focusing on removing lock contention to improve latency distribution. This has yielded roughly a 2x performance increase on Binder calls overall, with significant improvements in specific calls, for example a 47x improvement in refContentProvider(), 15x in releaseWakeLock(), and 7.9x in JobScheduler.schedule().

App compatibility

We’re working to make updates faster and smoother by prioritizing app compatibility as we roll out new platform versions. In Android 12 we’ve made most app-facing changes opt-in to give you more time, and we’ve updated our tools and processes to help you get ready sooner. We’ve also added new functionality to Google Play system updates to give your apps a better environment on Android 12 devices.

More of Android updated through Google Play - We’re continuing to expand our investment in Google Play system updates (Project Mainline) to give apps a more consistent, secure environment across devices. In Android 12 we’ve added the Android Runtime (ART) module that lets us push updates to the core runtime and libraries on devices running Android 12. We can improve runtime performance and correctness, manage memory more efficiently, and make Kotlin operations faster - all without requiring a full system update. We’ve also expanded the functionality of existing modules - for example, we’re delivering our compatible media transcoding feature inside an updatable module.

Optimizing for tablets, foldables, and TVs - With more people than ever using apps on large-screen devices like foldables, tablets, and TVs, now is a great time to make sure your app or game is ready. Get started by optimizing for tablets and building apps for foldables. And, for the biggest screen in the home, the first Android 12 preview for Android TV is also available. In addition to bringing the latest Android features to the TV with this preview, you will also be able to test your apps on the all-new Google TV experience. Learn more on the Android TV Developers site and get started with your ADT-3 developer kit.

Updated lists of non-SDK interfaces - We’ve restricted additional non-SDK interfaces, and as always your feedback and requests for public API equivalents are welcome.

Easier testing and debugging of changes - To make it easier for you to test the opt-in changes that can affect your app, we’ve made many of them toggleable. WIth the toggles you can force-enable or disable the changes individually from Developer options or adb. Check out the details here.

mobile display of App Compatibility Changes with toggles

App compatibility toggles in Developer Options.

Platform stability milestone - Like last year, we’re letting you know our Platform Stability milestone well in advance, to give you more time to plan for app compatibility work. At this milestone we’ll deliver not only final SDK/NDK APIs, but also final internal APIs and app-facing system behaviors. We’re expecting to reach Platform Stability by August 2021, and you’ll have several weeks before the official release to do your final testing. The release timeline details are here.

Get started with Android 12

The Developer Preview has everything you need to try the Android 12 features, test your apps, and give us feedback. You can get started today by flashing a device system image to a Pixel 3 / 3 XL, Pixel 3a / 3a XL, Pixel 4 / 4 XL, Pixel 4a / 4a 5G, or Pixel 5 device. If you don’t have a Pixel device, you can use the 64-bit system images with the Android Emulator in Android Studio.

When you’re set up, here are some of the things you should do:

  • Try the new features and APIs - your feedback is critical during the early part of the developer preview. Report issues in our tracker or give us direct feedback by survey for selected features from the feedback and requests page.
  • Test your current app for compatibility - the goal here is to learn whether your app is affected by default behavior changes in Android 12. Just install your current published app onto a device or emulator running Android 12 and test.
  • Test your app with opt-in changes - Android 12 has opt-in behavior changes that only affect your app when it’s targeting the new platform. It’s extremely important to understand and assess these changes early. To make it easier to test, you can toggle the changes on and off individually.

We’ll update the preview system images and SDK regularly throughout the Android 12 release cycle. This initial preview release is for developers only and not intended for daily or consumer use, so we're making it available by manual download only. You can flash a factory image to your Pixel device, or you can sideload an OTA image to a Pixel device running Android 11, in which case you won’t need to unlock your bootloader or wipe data. Either way, once you’ve manually installed a preview build, you’ll automatically get future updates over-the-air for all later previews and Betas. More here.

As we get closer to a final product, we'll be inviting consumers to try it out as well, and we'll open up enrollments through Android Beta at that time. Stay tuned for details, but for now please note that Android Beta is not currently available for Android 12.

For complete information, visit the Android 12 developer site.

Grounding Natural Language Instructions to Mobile UI Actions



Mobile devices offer a myriad of functionalities that can assist in everyday activities. However, many of these functionalities are not easily discoverable or accessible to users, forcing users to look up how to perform a specific task -- how to turn on the traffic mode in Maps or change notification settings in YouTube, for example. While searching the web for detailed instructions for these questions is an option, it is still up to the user to follow these instructions step-by-step and navigate UI details through a small touchscreen, which can be tedious and time consuming, and results in reduced accessibility. What if one could design a computational agent to turn these language instructions into actions and automatically execute them on the user’s behalf?

In “Mapping Natural Language Instructions to Mobile UI Action Sequences”, published at ACL 2020, we present the first step towards addressing the problem of automatic action sequence mapping, creating three new datasets used to train deep learning models that ground natural language instructions to executable mobile UI actions. This work lays the technical foundation for task automation on mobile devices that would alleviate the need to maneuver through UI details, which may be especially valuable for users who are visually or situationally impaired. We have also open-sourced our model code and data pipelines through our GitHub repository, in order to spur further developments among the research community.

Constructing Language Grounding Models
People often provide one another with instructions in order to coordinate joint efforts and accomplish tasks involving complex sequences of actions, for example, following a recipe to bake a cake, or having a friend walk you through setting up a home network. Building computational agents able to help with similar interactions is an important goal that requires true language grounding in the environments in which the actions take place.

The learning task addressed here is to predict a sequence of actions for a mobile platform given a set of instructions, a sequence of screens produced as the system transitions from one screen to another, as well as the set of interactive elements on those screens. Training such a model end-to-end would require paired language-action data, which is difficult to acquire at a large scale.

Instead, we deconstruct the problem into two sequential steps: an action phrase-extraction step and a grounding step.
The workflow of grounding language instructions to executable actions.
The action phrase-extraction step identifies the operation, object and argument descriptions from multi-step instructions using a Transformer model with area attention for representing each description phrase. Area attention allows the model to attend to a group of adjacent words in the instruction (a span) as a whole for decoding a description.
The action phrase extraction model takes a word sequence of a natural language instruction and outputs a sequence of spans (denoted in red boxes) that indicate the phrases describing the operation, the object and the argument of each action in the task.
Next, the grounding step matches the extracted operation and object descriptions with a UI object on the screen. Again, we use a Transformer model, but in this case, it contextually represents UI objects and grounds object descriptions to them.
The grounding model takes the extracted spans as input and grounds them to executable actions, including the object an action is applied to, given the UI screen at each step during execution.
Results
To investigate the feasibility of this task and the effectiveness of our approach, we construct three new datasets to train and evaluate our model. The first dataset includes 187 multi-step English instructions for operating Pixel phones along their corresponding action-screen sequences and enables assessment of full task performance on naturally occurring instructions, which is used for testing end-to-end grounding quality. For action phrase extraction training and evaluation, we obtain English “how-to” instructions that can be found abundantly from the web and annotate phrases that describe each action. To train the grounding model, we synthetically generate 295K single-step commands to UI actions, covering 178K different UI objects across 25K mobile UI screens from a public android UI corpus.

A Transformer with area attention obtains 85.56% accuracy for predicting span sequences that completely match the ground truth. The phrase extractor and grounding model together obtain 89.21% partial and 70.59% complete accuracy for matching ground-truth action sequences on the more challenging task of mapping language instructions to executable actions end-to-end. We also evaluated alternative methods and representations of UI objects, such as using a graph convolutional network (GCN) or a feedforward network, and found those that can represent an object contextually in the screen lead to better grounding accuracy. The new datasets, models and results provide an important first step on the challenging problem of grounding natural language instructions to mobile UI actions.

Conclusion
This research, and language grounding in general, is an important step for translating multi-stage instructions into actions on a graphical user interface. Successful application of task automation to the UI domain has the potential to significantly improve accessibility, where language interfaces might help individuals who are visually impaired perform tasks with interfaces that are predicated on sight. This also matters for situational impairment when one cannot access a device easily while encumbered by tasks at hand.

By deconstructing the problem into action phrase extraction and language grounding, progress on either can improve full task performance and it alleviates the need to have language-action paired datasets, which are difficult to collect at scale. For example, action span extraction is related to both semantic role labeling and extraction of multiple facts from text and could benefit from innovations in span identification and multitask learning. Reinforcement learning that has been applied in previous grounding work may help improve out-of-sample prediction for grounding in UIs and improve direct grounding from hidden state representations. Although our datasets were based on Android UIs, our approach can be applied generally to instruction grounding on other user interface platforms. Lastly, our work provides a technical foundation for investigating user experiences in language-based human computer interaction.

Acknowledgements
Many thanks to my collaborators on this work at Google Research. Xin Zhou and Jiacong He contributed substantially to the data pipelines and the creation of the datasets. Yuan Zhang and Jason Baldridge provided much valuable advice for the project and contributed to the presentation of the work. Gang Li provided generous help for creating open-source datasets. Many thanks to Ashwin Kakarla, Muqthar Mohammad and Mohd Majeed for their help with the annotations.

Source: Google AI Blog


Expand your app beyond mobile to reach Android users at large

Posted by Sameer Samat, Vice President, Platforms & Ecosystems

dark theme graphic illustration with geometric shapes and Android 2019 logo

From day one, we designed Android to be a flexible, adaptive platform.

Most people picture a smartphone when they think of Android, but Android also powers an amazing number of large-screen devices. In fact, there are more than 175 million Android tablets with the Google Play store,1 making Android tablets a vital form factor for Google and our OEM partners today. Android apps also run on Chrome OS laptops, and the number of monthly active users who enabled Android apps grew 250% in just the last year.2

Here at Google, we’re excited to see how you can take advantage of large-screen formats - including Samsung’s new Galaxy Tab S6, the upcoming Lenovo™ Smart Tab M8 with Google Assistant, the upcoming Samsung Fold, and other devices launching this week at IFA. Our OEM partners are building experiences that help users every day:

image of two quotes

From the start, Android was designed as a platform that could handle multiple screen sizes. Over the years, we’ve continued to add functionality for developers to accommodate new devices and form factors.

  • We started with a phone. Developers could write Android apps that would work on phones of all sizes, all over the world. Part of what made this work was Android’s resource and layout system, which enabled applications to smoothly adapt to different screen sizes.
  • In Android 3.0 Honeycomb, we added support for tablets. In particular, capabilities like Fragments allow you to create applications that work across vastly different form factors.
  • Android 7 Nougat brought multi-window and multi-display capabilities, including the ability to drag-and-drop across apps. Meanwhile, Chrome OS added the capability to run Android applications on laptops. With some adjustments to handle different inputs and windowing dynamics, you could now reach app users in a desktop-style environment.
Android’s layout system helps applications smoothly resize and adjust their layout interactively.

Android’s layout system helps applications smoothly resize and adjust their layout interactively.

  • Now, in Android 10, we’ve made even more enhancements for development on large screens. We’ve improved multi-window capabilities, making it easier to use multiple apps in parallel. We also continued improving multi-display support, enabling more multi-monitor use cases. And we made it easy for you to experiment and test new form factors by adding dedicated emulator for foldables as well as publishing a foldables guide.

By optimizing your app to take advantage of different form factors, developers have an opportunity to deliver richer, more engaging experiences to millions of users on larger screens. And if you don’t have access to physical devices, the Android Emulator supports all of the form factors mentioned above, from Chrome OS to phones and tablets.


Developers of apps like Mint, Evernote, and Asphalt are just a few who have seen success from taking their existing APK to larger screens.

image of a single quote from Damien Marchi, VP of Marketing at Gameloft

To learn more about optimizing your Android apps for richer experiences on tablets, Chrome OS laptops, foldables, and more, join us at the Android Developer Summit on October 23-24 — either in person or via the livestream — or check out our recap videos on YouTube.

Sources:

[1] The number of tablets only accounts for devices that have the Google Play Store installed (for example, this excludes tablets in China); the actual number of tablets capable of running Android applications is much larger.

[2] Google Internal Data, March 2018 to March 2019.

Flutter and Chrome OS: Better Together

Posted by the Flutter and Chrome OS teams

Chrome OS is the fast, simple, and secure operating system that powers Chromebooks, including the Google Pixelbook and millions of devices used by consumers and students every day. The latest Flutter release adds support for building beautiful, tailored Chrome OS applications, including rich support for keyboard and mouse, and tooling to ensure that your app runs well on a Chromebook. Furthermore, Chrome OS is a great developer workstation for building general-purpose Flutter apps, thanks to its support for developing and running Flutter apps locally on the same device.

Flutter is a great way to build Chrome OS apps

Since its inception, Flutter has shared many of the same principles as Chrome OS: productive, fast, and beautiful experiences. Flutter allows developers to build beautiful, fast UIs, while also providing a high degree of developer productivity, and a completely open-source engine, framework and tools. In short, it’s the ideal modern toolkit for building multi-platform apps, including apps for Chrome OS.

Flutter initially focused on providing a UI toolkit for building apps for mobile devices, which typically feature touch input and small screens. However, we’ve been building keyboard and mouse support into Flutter since before our 1.0 release last December. And today, we’re pleased to announce that Flutter for Chrome OS is now stronger with scroll wheel support, hover management, and better keyboard event support. In addition, Flutter has always been great at allowing you to build apps that run at any size (large screen or small), with seamless resizing, as shown here in the Chrome OS Best Practices Sample:

The Chrome OS best practices sample in action

The Chrome OS best practices sample in action

The Chrome OS Hello World sample is an app built with Flutter that is optimized for Chrome OS. This includes a responsive UI to showcase how to reposition items and have layouts that respond well to changes in size from mobile to desktop.

Because Chrome OS runs Android apps, targeting Android is the way to build Chrome OS apps. However, while building Chrome OS apps on Android has always been possible, as described in these guidelines, it’s often difficult to know whether your Android app is going to run well on Chrome OS. To help with that problem, today we are adding a new set of lint rules to the Flutter tooling to catch violations of the most important of the Chrome OS best practice guidelines:

The Flutter Chrome OS lint rules in action

The Flutter Chrome OS lint rules in action

When you’re able to put these Chrome OS lint rules in place, you’ll quickly be able to see any problems in your Android app that would hamper it when running on Chrome OS. To learn how to take advantage of these rules, see the linting docs for Flutter Chrome OS.

But all of that is just the beginning -- the Flutter tools allow you to develop and test your apps directly on Chrome OS as well.

Chrome OS is a great developer platform to build Flutter apps

No matter what platform you're targeting, Flutter has support for rich IDEs and programming tools like Android Studio and Visual Studio Code. Over the last year, Chrome OS has been building support for running the Linux version of these tools with the beta of Linux on Chrome OS (aka Crostini). And, because Chrome OS also supports Android natively, you can configure the Flutter tooling to run your Android apps directly without an emulator involved.

The Flutter development tools running on Chrome OS

The Flutter development tools running on Chrome OS

All of the great productivity of Flutter is available, including Stateful Hot Reload, seamless resizing, keyboard and mouse support, and so on. Recent improvements in Crostini, such as high DPI support, Crostini file system integration, easier adb, and so on, have made this experience even better! Of course, you don’t have to test against the Android container running on Chrome OS; you can also test against Android devices attached to your Chrome OS box. In short, Chrome OS is the ideal environment in which to develop and test your Flutter apps, especially when you’re targeting Chrome OS itself.

Customers love Flutter on Chrome OS

With its unique combination of simplicity, security, and capability, Chrome OS is an increasingly popular platform for enterprise applications. These apps often work with large quantities of data, whether it’s a chart, or a graph for visualization, or lists and forms for data entry. The support in Flutter for high quality graphics, large screen layout, and input features (like text selection, tab order and mousewheel), make it an ideal way to port mobile applications for the enterprise. One purveyor of such apps is AppTree, who use Flutter and Chrome OS to solve problems for their enterprise customers.

“Creating a Chrome OS version of our app took very little effort. In 10 minutes we tweaked a few values and now our users have access to our app on a whole new class of devices. This is a huge deal for our enterprise customers who have been wanting access to our app on Desktop devices.”
--Matthew Smith, CTO, AppTree Software

By using Flutter to target Chrome OS, AppTree was able to start with their existing Flutter mobile app and easily adapt it to take advantage of the capabilities of Chrome OS.

Try Flutter on Chrome OS today!

If you’d like to target Chrome OS with Flutter, you can do so today simply by installing the latest version of Flutter. If you’d like to run the Flutter development tools on Chrome OS, you can follow these instructions to get started fast. To see a real-world app built with Flutter that has been optimized for Chrome OS, check out the the Developer Quest sample that the Flutter DevRel team launched at the 2019 Google I/O conference. And finally, don’t forget to try out the Flutter Chrome OS linting rules to make sure that your Chrome OS apps are following the most important practices.

Flutter and Chrome OS go great together. What are you going to build?