Tag Archives: accessibility

Responsible AI at Google Research: AI for Social Good

Google’s AI for Social Good team consists of researchers, engineers, volunteers, and others with a shared focus on positive social impact. Our mission is to demonstrate AI’s societal benefit by enabling real-world value, with projects spanning work in public health, accessibility, crisis response, climate and energy, and nature and society. We believe that the best way to drive positive change in underserved communities is by partnering with change-makers and the organizations they serve.

In this blog post we discuss work done by Project Euphonia, a team within AI for Social Good, that aims to improve automatic speech recognition (ASR) for people with disordered speech. For people with typical speech, an ASR model’s word error rate (WER) can be less than 10%. But for people with disordered speech patterns, such as stuttering, dysarthria and apraxia, the WER could reach 50% or even 90% depending on the etiology and severity. To help address this problem, we worked with more than 1,000 participants to collect over 1,000 hours of disordered speech samples and used the data to show that ASR personalization is a viable avenue for bridging the performance gap for users with disordered speech. We've shown that personalization can be successful with as little as 3-4 minutes of training speech using layer freezing techniques.

This work led to the development of Project Relate for anyone with atypical speech who could benefit from a personalized speech model. Built in partnership with Google’s Speech team, Project Relate enables people who find it hard to be understood by other people and technology to train their own models. People can use these personalized models to communicate more effectively and gain more independence. To make ASR more accessible and usable, we describe how we fine-tuned Google’s Universal Speech Model (USM) to better understand disordered speech out of the box, without personalization, for use with digital assistant technologies, dictation apps, and in conversations.


Addressing the challenges

Working closely with Project Relate users, it became clear that personalized models can be very useful, but for many users, recording dozens or hundreds of examples can be challenging. In addition, the personalized models did not always perform well in freeform conversation.

To address these challenges, Euphonia’s research efforts have been focusing on speaker independent ASR (SI-ASR) to make models work better out of the box for people with disordered speech so that no additional training is necessary.


Prompted Speech dataset for SI-ASR

The first step in building a robust SI-ASR model was to create representative dataset splits. We created the Prompted Speech dataset by splitting the Euphonia corpus into train, validation and test portions, while ensuring that each split spanned a range of speech impairment severity and underlying etiology and that no speakers or phrases appeared in multiple splits. The training portion consists of over 950k speech utterances from over 1,000 speakers with disordered speech. The test set contains around 5,700 utterances from over 350 speakers. Speech-language pathologists manually reviewed all of the utterances in the test set for transcription accuracy and audio quality.


Real Conversation test set

Unprompted or conversational speech differs from prompted speech in several ways. In conversation, people speak faster and enunciate less. They repeat words, repair misspoken words, and use a more expansive vocabulary that is specific and personal to themselves and their community. To improve a model for this use case, we created the Real Conversation test set to benchmark performance.

The Real Conversation test set was created with the help of trusted testers who recorded themselves speaking during conversations. The audio was reviewed, any personally identifiable information (PII) was removed, and then that data was transcribed by speech-language pathologists. The Real Conversation test set contains over 1,500 utterances from 29 speakers.


Adapting USM to disordered speech

We then tuned USM on the training split of the Euphonia Prompted Speech set to improve its performance on disordered speech. Instead of fine-tuning the full model, our tuning was based on residual adapters, a parameter-efficient tuning approach that adds tunable bottleneck layers as residuals between the transformer layers. Only these layers are tuned, while the rest of the model weights are untouched. We have previously shown that this approach works very well to adapt ASR models to disordered speech. Residual adapters were only added to the encoder layers, and the bottleneck dimension was set to 64.


Results

To evaluate the adapted USM, we compared it to older ASR models using the two test sets described above. For each test, we compare adapted USM to the pre-USM model best suited to that task: (1) For short prompted speech, we compare to Google’s production ASR model optimized for short form ASR; (2) for longer Real Conversation speech, we compare to a model trained for long form ASR. USM improvements over pre-USM models can be explained by USM’s relative size increase, 120M to 2B parameters, and other improvements discussed in the USM blog post.

Model word error rates (WER) for each test set (lower is better).

We see that the USM adapted with disordered speech significantly outperforms the other models. The adapted USM’s WER on Real Conversation is 37% better than the pre-USM model, and on the Prompted Speech test set, the adapted USM performs 53% better.

These findings suggest that the adapted USM is significantly more usable for an end user with disordered speech. We can demonstrate this improvement by looking at transcripts of Real Conversation test set recordings from a trusted tester of Euphonia and Project Relate (see below).


Audio1    Ground Truth    Pre-USM ASR    Adapted USM
                    
   I now have an Xbox adaptive controller on my lap.    i now have a lot and that consultant on my mouth    i now had an xbox adapter controller on my lamp.
                    
   I've been talking for quite a while now. Let's see.    quite a while now    i've been talking for quite a while now.
Example audio and transcriptions of a trusted tester’s speech from the Real Conversation test set.

A comparison of the Pre-USM and adapted USM transcripts revealed some key advantages:

  • The first example shows that Adapted USM is better at recognizing disordered speech patterns. The baseline misses key words like “XBox” and “controller” that are important for a listener to understand what they are trying to say.
  • The second example is a good example of how deletions are a primary issue with ASR models that are not trained with disordered speech. Though the baseline model did transcribe a portion correctly, a large part of the utterance was not transcribed, losing the speaker’s intended message.

Conclusion

We believe that this work is an important step towards making speech recognition more accessible to people with disordered speech. We are continuing to work on improving the performance of our models. With the rapid advancements in ASR, we aim to ensure people with disordered speech benefit as well.


Acknowledgements

Key contributors to this project include Fadi Biadsy, Michael Brenner, Julie Cattiau, Richard Cave, Amy Chung-Yu Chou, Dotan Emanuel, Jordan Green, Rus Heywood, Pan-Pan Jiang, Anton Kast, Marilyn Ladewig, Bob MacDonald, Philip Nelson, Katie Seaver, Joel Shor, Jimmy Tobin, Katrin Tomanek, and Subhashini Venugopalan. We gratefully acknowledge the support Project Euphonia received from members of the USM research team including Yu Zhang, Wei Han, Nanxin Chen, and many others. Most importantly, we wanted to say a huge thank you to the 2,200+ participants who recorded speech samples and the many advocacy groups who helped us connect with these participants.


1Audio volume has been adjusted for ease of listening, but the original files would be more consistent with those used in training and would have pauses, silences, variable volume, etc. 

Source: Google AI Blog


Project GameFace makes gaming accessible to everyone

Posted by Avneet Singh, Product Manager and Sisi Jin, UX Designer, Google PI, and Lance Carr, Collaborator

At I/O 2023, Google launched Project Gameface, an open-source, hands-free gaming ‘mouse’ enabling people to control a computer's cursor using their head movement and facial gestures. People can raise their eyebrows to click and drag, or open their mouth to move the cursor, making gaming more accessible.

The project was inspired by the story of quadriplegic video game streamer Lance Carr, who lives with muscular dystrophy, a progressive disease that weakens muscles. And we collaborated with Lance to bring Project Gameface to life. The full story behind the product is available on the Google Keyword blog here.

It’s been an extremely interesting experience to think about how a mouse cursor can be controlled in such a novel way. We conducted many experiments and found head movement and facial expressions can be a unique way to program the mouse cursor. MediaPipe’s new Face Landmarks Detection API with blendshape option made this possible as it allows any developer to leverage 478 3-dimensional face landmarks and 52 blendshape scores (coefficients representing facial expression) to infer detailed facial surfaces in real-time.


Product Construct and Details

In this article, we share technical details of how we built Project Gamefaceand the various open source technologies we leveraged to create the exciting product!


Using head movement to move the mouse cursor


Moving image showing how the user controls cursor speed
Caption: Controlling head movement to move mouse cursors and customizing cursor speed to adapt to different screen resolutions.

Through this project, we explored the concept of using the head movement to be able to move the mouse cursor. We focused on the forehead and iris as our two landmark locations. Both forehead and iris landmarks are known for their stability. However, Lance noticed that the cursor didn't work well while using the iris landmark. The reason was that the iris may move slightly when people blink, causing the cursor to move unintendedly. Therefore, we decided to use the forehead landmark as a default tracking option.

There are instances where people may encounter challenges when moving their head in certain directions. For example, Lance can move his head more quickly to the right than left. To address this issue, we introduced a user-friendly solution: separate cursor speed adjustment for each direction. This feature allows people to customize the cursor's movement according to their preferences, facilitating smoother and more comfortable navigation.

We wanted the experience to be as smooth as a hand held controller. Jitteriness of the mouse cursor is one of the major problems we wanted to overcome. The appearance of cursor jittering is influenced by various factors, including the user setup, camera, noise, and lighting conditions. We implemented an adjustable cursor smoothing feature to allow users the convenience of easily fine-tuning this feature to best suit their specific setup.


Using facial expressions to perform mouse actions and keyboard press

Very early on, one of our primary insights was that people have varying comfort levels making different facial expressions. A gesture that comes easily to one user may be extremely difficult for another to do deliberately. For instance, Lance can move his eyebrows independently with ease while the rest of the team struggled to match Lance’s skill. Hence, we decided to create a functionality for people to customize which expressions they used to control the mouse.

Moving image showing how the user controls the cursor using their facial expressions
Caption: Using facial expressions to control mouse

Think of it as a custom binding of a gesture to a mouse action. When deliberating about which mouse actions should the product cover, we tried to capture common scenarios such as left and right click to scrolling up and down. However, using the head to control mouse cursor movement is a different experience than the conventional manner. We wanted to give the users the option to reset the mouse cursor to the center of the screen using a facial gesture too.

Moving image showing how the user controls the keyboard using their facial expressions
Caption: Using facial expressions to control keyboard

The most recent release of MediaPipe Face Landmarks Detection brings an exciting addition: blendshapes output. With this enhancement, the API generates 52 face blendshape values which represent the expressiveness of 52 facial gestures such as raising left eyebrow or mouth opening. These values can be effectively mapped to control a wide range of functions, offering users expanded possibilities for customization and manipulation.

We’ve been able to extend the same functionality and add the option for keyboard binding too. This helps use their facial gestures to also press some keyboard keys in a similar binding fashion.


Set Gesture Size to see when to trigger a mouse/keyboard action


Moving image showing setting the gesture size to trigger an action
Caption: Set the gesture size to trigger an action

While testing the software, we found that facial expressions were more or less pronounced by each of us, so we’ve incorporated the idea of a gesture size, which allows people to control the extent to which they need to gesture to trigger a mouse action. Blendshapes coefficients were helpful here and different users can now set different thresholds on each specific expression and this helps them customize the experience to their comfort.


Keeping the camera feed available

Another key insight we received from Lance was gamers often have multiple cameras. For our machine learning models to operate optimally, it’s best to have a camera pointing straight to the user’s face with decent lighting. So we’ve incorporated the ability for the user to select the correct camera to help frame them and give the most optimal performance.

Our product's user interface incorporates a live camera feed, providing users with real-time visibility of their head movements and gestures. This feature brings several advantages. Firstly, users can set thresholds more effectively by directly observing their own movements. The visual representation enables informed decisions on appropriate threshold values. Moreover, the live camera feed enhances users' understanding of different gestures as they visually correlate their movements with the corresponding actions in the application. Overall, the camera feed significantly enhances the user experience, facilitating accurate threshold settings and a deeper comprehension of gestures.


Product Packaging

Our next step was to create the ability to control the mouse and keyboard using our custom defined logic. To enable mouse and keyboard control within our Python application, we utilize two libraries: PyAutoGUI for mouse control and PyDirectInput for keyboard control. PyAutoGUI is chosen for its robust mouse control capabilities, allowing us to simulate mouse movements, clicks, and other actions. On the other hand, we leverage PyDirectInput for keyboard control as it offers enhanced compatibility with various applications, including games and those relying on DirectX.

For our application packaging, we used PyInstaller to turn our Python-based application into an executable, making it easier for users to run our software without the need for installing Python or additional dependencies. PyInstaller provides a reliable and efficient means to distribute our application, ensuring a smooth user experience.

The product introduces a novel form factor to engage users in an important function like handling the mouse cursor. Making the product and its UI intuitive and easy to follow was a top priority for our design and engineering team. We worked closely with Lance to incorporate his feedback into our UX considerations, and we found CustomtKinter was able to handle most of our UI considerations in Python.

We’re excited to see the potential of Project GameFace and can’t wait for developers and enterprises to leverage it to build new experiences. The code for GameFace is open sourced on Github here.


Acknowledgements

We would like to acknowledge the invaluable contributions of the following people to this project: Lance Carr, David Hewlett, Laurence Moroney, Khanh LeViet, Glenn Cameron, Edwina Priest, Joe Fry, Feihong Chen, Boon Panichprecha, Dome Seelapun, Kim Nomrak, Pear Jaionnom, Lloyd Hightower

Designing for Wear OS: Getting started with designing inclusive smartwatch apps

Posted by Matthew Pateman & Mallory Carroll (UX Research), and Josef Burnham (UX Design)

Smartwatches are becoming increasingly popular, with many people using them to stay connected, track their health, and control their devices. Watches enable people to get information at a glance and then take action. These quick and frequent interactions can help people get back to being present in their daily lives.

To help with the challenges of designing and building great watch experiences that work for all, we have created a series of videos. These videos cover a variety of topics starting with how to understand what people want from a smartwatch app. We cover how best to design for your target audience, and how to make the most of the watch’s form factor with a series of design principles. Lastly, we give you an introduction on how to approach product inclusion throughout the whole development lifecycle, and how this approach can help make your products better for all. If you’re interested in learning more, be sure to check out the videos below.


1. Introduction to UX Research & Product Inclusion on Wear OS

If you’re considering building a smartwatch app but don’t know how to begin, this video will help you get started. It shows how to uncover what people want from a smartwatch app, what a great Wear OS experience should look like, and how to ensure it addresses real needs of the people you are building for. Lastly, you’ll find out how to take an equity-focused approach when developing products, apps, and experiences.


2. Introduction to UX Design on Wear OS

Did you know that the average smartwatch interaction is approximately 5 seconds long? In this video you will learn how to design effective and engaging experiences for Wear OS. We’ll guide you on how to make the most out of these short watch interactions by covering key differences between mobile and smartwatch design, the importance of a glanceable user experience, and practical tips for designing for different Wear OS surfaces.


3. Introduction to Product Inclusion & Equity

We will introduce you to Product Inclusion and Equity, and how to approach it when designing for Wear OS. You will learn how to build for belonging and make products more accessible and usable by all.


4. Case Studies: Inclusion and Exclusion in Technology Design

Here you will see a series of case studies showing how product and design choices can be impactful on a personal, community, and systemic level. Designs can both be affirming and inclusive, or harmful and exclusionary to various people and communities. We’ll use some examples to highlight how important inclusion and equity considerations are when making product decisions.


5. Considerations for Community Co-Design

The last video in this series will give you an introduction into community co-design, a powerful approach that focuses on building solutions with, not for, historically marginalized communities. In community co-design, we engage with people based on identity, culture, community, and context. You’ll find out how to engage people and communities in a safe, respectful, and equity-centered way in product development.


Keep your eyes peeled for more updates from us as we continue to share and evolve our latest design thinking and practices, principles, and guidelines.

We also have many more resources to help get you started designing for Wear OS:

  • Find inspiring designs for different types of apps in our gallery
  • Interested in designing for multiple devices from TV’s to mobiles to tablets, check out our design hub
  • Access developer documentation for Wear OS