Tag Archives: accessibility

Bringing Live Transcribe’s Speech Engine to Everyone

Earlier this year, Google launched Live Transcribe, an Android application that provides real-time automated captions for people who are deaf or hard of hearing. Through many months of user testing, we've learned that robustly delivering good captions for long-form conversations isn't so easy, and we want to make it easier for developers to build upon what we've learned. Live Transcribe's speech recognition is provided by Google's state-of-the-art Cloud Speech API, which under most conditions delivers pretty impressive transcript accuracy. However, relying on the cloud introduces several complications—most notably robustness to ever-changing network connections, data costs, and latency. Today, we are sharing our transcription engine with the world so that developers everywhere can build applications with robust transcription.

Those who have worked with our Cloud Speech API know that sending infinitely long streams of audio is currently unsupported. To help solve this challenge, we take measures to close and restart streaming requests prior to hitting the timeout, including restarting the session during long periods of silence and closing whenever there is a detected pause in the speech. Otherwise, this would result in a truncated sentence or word. In between sessions, we buffer audio locally and send it upon reconnection. This reduces the amount of text lost mid-conversation—either due to restarting speech requests or switching between wireless networks.



Endlessly streaming audio comes with its own challenges. In many countries, network data is quite expensive and in spots with poor internet, bandwidth may be limited. After much experimentation with audio codecs (in particular, we evaluated the FLAC, AMR-WB, and Opus codecs), we were able to achieve a 10x reduction in data usage without compromising accuracy. FLAC, a lossless codec, preserves accuracy completely, but doesn't save much data. It also has noticeable codec latency. AMR-WB, on the other hand, saves a lot of data, but delivers much worse accuracy in noisy environments. Opus was a clear winner, allowing data rates many times lower than most music streaming services while still preserving the important details of the audio signal—even in noisy environments. Beyond relying on codecs to keep data usage to a minimum, we also support using speech detection to close the network connection during extended periods of silence. That means if you accidentally leave your phone on and running Live Transcribe when nobody is around, it stops using your data.

Finally, we know that if you are relying on captions, you want them immediately, so we've worked hard to keep latency to a minimum. Though most of the credit for speed goes to the Cloud Speech API, Live Transcribe's final trick lies in our custom Opus encoder. At the cost of only a minor increase in bitrate, we see latency that is visually indistinguishable to sending uncompressed audio.

Today, we are excited to make all of this available to developers everywhere. We hope you'll join us in trying to build a world that is more accessible for everyone.

By Chet Gnegy, Alex Huang, and Ausmus Chang from the Live Transcribe Team

Project Euphonia’s Personalized Speech Recognition for Non-Standard Speech



The utility of technology is dependent on its accessibility. One key component of accessibility is automatic speech recognition (ASR), which can greatly improve the ability of those with speech impairments to interact with every-day smart devices. However, ASR systems are most often trained from 'typical' speech, which means that underrepresented groups, such as those with speech impairments or heavy accents, don't experience the same degree of utility. For example, amyotrophic lateral sclerosis (ALS) is a disease that can adversely affect a person’s speech—about 25% of people with ALS experiencing slurred speech as their first symptom. In addition, most people with ALS eventually lose the ability to walk, so being able to interact with automated devices from a distance can be very important. Yet current state-of-the-art ASR models can yield high word error rates (WER) for speakers with only a moderate speech impairment from ALS, effectively barring access to ASR reliant technologies.

In “Personalizing ASR for Dysarthric and Accented Speech with Limited Data,” to be presented at Interspeech 2019, we describe some of the research behind Project Euphonia, an ASR platform that performs speech-to-text transcription. This work presents an approach to improve ASR for people with ALS that may also be applicable to many other types of non-standard speech. Using a two-step training approach that starts with a baseline “standard” corpus and then fine-tunes the training with a personalized speech dataset, we have demonstrated significant improvements for speakers with atypical speech over current state-of-the-art models.

A Two-Phased Approach to Training
In order to create ASR models that work on non-standard speech, one needs to overcome two challenges. The first is that within a particular class of atypical speech, be it a regional accent or a speech impairment, for example, individuals can exhibit very different ways of speaking. Our approach deals with this sub-group heterogeneity by training the ASR model in two phases. We start with a high-quality ASR model trained on thousands of hours of standard speech and then we fine-tune parts of the model to an individual with non-standard speech. This approach is similar to that of Parrotron: both systems use end-to-end neural networks to help improve communication and accessibility, but Parrotron focuses exclusively on speech-to-speech, where a person’s speech is converted directly into synthesized speech, rather than text.

The second challenge arises from the difficulty in collecting enough data to train a state-of-the-art recognizer for individuals. Typical speech recognizers are trained on thousands of hours of speech from many different speakers. Acquiring this much data from a single speaker is nearly impossible, especially if the speaker may experience exhaustion from speaking due to a medical condition. Our approach overcomes this issue by first training a base model on a large corpus of typical speech, and then training a personalized model using a much smaller dataset with the targeted non-standard speech characteristics.

The Neural Network Architecture
When developing the models used for training data on atypical speech, we explored two different neural architectures. The first is the RNN-Transducer (RNN-T), a neural network architecture consisting of encoder and decoder networks that has shown good results on numerous ASR tasks. The encoder is bidirectional (i.e., it looks at the entire sentence at once in order to provide context), and thus it requires the entire audio sample to perform speech recognition.

The other architecture we explored was Listen, Attend, and Spell (LAS), which is an attention-based, sequence-to-sequence model that maps sequences of acoustic properties to sequences of languages. This model uses an encoder to convert the sequence of acoustic frames to a sequence of internal representations, and a decoder to convert the sequence of internal representations to linguistic output. The network produces “word pieces”, which are a linguistic representation between graphemes and words.
Comparison of the RNN-Transducer (left) and Listen, Attend, Spell (right) architectures. From Prabhavalkar et al. 2017.
We experimented with fine-tuning the state-of-the-art RNN-T and LAS base models on two types of non-standard speech. In partnership with the ALS Therapy Development Institute, we first collected about 36 hours of audio from 67 speakers who have ALS. The participants recorded themselves on their home computers using custom software while they read sentences from a very restricted language domain. Many phrases were single sentences with simple grammatical structure (e.g., “What time is the basketball game on tonight?”). This is in contrast with unrestricted language domains, which include domain-specific vocabulary (e.g., science talks) and complex language structure (e.g., a debate). The recordings did not include many of the filler words common in normal speech, such as “um” and “uh”.

We also tested accented speech, using the open source L2 Arctic dataset of non-native speech, which consists of 20 speakers with approximately 1 hour of speech per speaker. Each speaker recorded a set of 1150 utterances from the CMU Arctic prompts.

AudioEuphonia ModelStandard Speech Model
Did I have anything to say about it?Dictatorship angels to think about it
Come right back pleaseCameras object
Let’s try that againIt extracts
Turn it down a little bit pleaseTurning down a little bit please
The audio (left) are recordings of a speaker with ALS. The text transcriptions are output from the Euphonia model (center) and the Standard Speech model (right). Incorrectly transcribed text is underlined.
Results
The absolute word error rates on the language-restricted test set is shown below. There is an improvement over the baseline model for very non-standard speech (heavy accents and ALS speech below 3 on the ALS Functional Rating Scale) and moderate improvements in ALS speech that is similar to typical speech. The relative difference between the base model and the fine-tuned model demonstrates that the majority of the improvement comes from the fine-tuning process, except in the case of the RNN-T on the Arctic dataset, where the RNN-T baseline is already strong.
1 Non-native English speech from the L2-Arctic dataset.
2 Low FRS (ALS Functional Rating Scale) speech; intelligible with repeating (FRS 2); Speech combined with non-vocal communication (FRS 1).
3 FRS 3; detectable speech disturbance.
The RNN-T model achieved 91% of the improvement by fine-tuning just two layers, most of which are close to the input. On the accented dataset, fine-tuning the same two layers achieved 86% of the relative improvement compared to fine-tuning the entire network. This is consistent with previous speech work.

Most of the performance gains were achieved early in training. The models we trained were tested on a relatively limited domain of vocabulary and linguistic complexity, so the performance numbers are not necessarily related to how well the models perform on more general tasks. We hope that just fine-tuning part of the network allows it to retain the acoustic and linguistic information from the general speech model, while needing minimal modifications to adapt to a single new speaker. Future work will test this hypothesis.
Low FRS corresponds to the ALS speakers with low intelligibility (FRS 2, 1), while high FRS corresponds to ALS speakers with less severely impacted speech (FRS 3).
Understanding Model Behavior
To better understand how our models improved after fine-tuning, we looked at the pattern of phoneme mistakes. We started by comparing the distribution of phoneme mistakes made by the base ASR model on standard speech to the mistakes made on ALS speech. The SAMPA phonemes with the five largest differences between the ALS data and standard speech are p, U, f, k, and Z, which account for 20% of the deletion mistakes. Similarly, the n and m phonemes together account for 17% of the insertion / substitution mistakes. The same analysis on our fine-tuned models verifies that the unrecognized phoneme distribution is more similar to that of standard speech.

Our analysis shows that there are two aspects to every mistake: which phoneme the system doesn’t understand, and which phoneme the system thinks was said. Imagine having two systems with identical accuracy: one system always thinks that the f phoneme is actually the g phoneme, while another doesn't know what the f phoneme is and randomly guesses. These two systems will have identical performance and identical distributions of phoneme mistakes, but very different distributions of the predicted phoneme when a mistake is made. Surprisingly, ASR mistakes on ALS speech are far more similar to regular speech mistakes after Euphonia fine-tuning.
Deletion / substitution mistakes per SAMPA phoneme on ALS speech before fine-tuning, ALS speech after fine-tuning, and on typical speech (Librispeech dataset).
Future Work
In the future, we intend to explore additional techniques that can be helpful in the low data regime. We also hope to use phoneme mistakes to weight certain examples during training, or to pick training sentences for people with ALS to record that contain the most common phoneme mistakes. We would like to explore pooling data from multiple speakers with similar conditions.

We hope that continued research in this area will help voice interfaces become accessible to more people, especially those who need it most. One key component to this is collecting data. Anyone 18 or older can help us build better personalized models by donating audio data. If you’re interested, you can fill out this form to allow Google to contact you.

Acknowledgements
This work would not have been possible without the extraordinary effort and support of the ALS Therapy Development Institute and the ALS community, especially Fernando Vieira, Maeve McNally, Taylor Charbonneau, Melissa Nollstadt, and the individuals with ALS who kindly and patiently volunteered their audio. This work builds on the pioneering advances in speech recognition made by Google's speech team, in particular the recent development and deployment of end-to-end speech recognition models. We are grateful to the Google speech team for advice and collaboration, particularly to Anshuman Tripathi and Hasim Sak who guided us in training the initial models. We’d also like to thank Oran Lang, Omry Tuval, Michael Brenner, Julie Cattiau, Tara Sainath, Ding Zhao, Qiao Liang, Chung-Cheng Chiu, Dan Liebling, Ron Weiss, Anjuli Kannan, Dimitri Kanevsky, Ryan He, Gabor Simko, Benjamin Lee, Françoise Beaufays, Khe Chai Sim, Jimmy Tobin, Chet Gnegy, Jacqueline Huang, Ye Jia, Yu Zhang, Yonghui Wu, Michelle Ramanovich, Rus Heywood, Katrin Tomanek, Bob MacDonald, Pan-Pan Jiang, Ronnie Maor, Rif A. Saurous, Trevor Strohman, Dick Lyon, Avinatan Hassidim, Philip Nelson, and Yossi Matias for their technical contributions and project guidance.

Source: Google AI Blog


For individuals with paralysis, Google Nest gives help at home

Editor’s note: Today's post comes from Garrison Redd, who shares how his Google Home Mini helped him regain independence, and how it can improve the lives of people living with paralysis.

It’s been nearly 20 years since my life changed—that’s two decades of learning to navigate life in a wheelchair. There are many obstacles for people living with paralysis, so I have to find creative ways to get things done. While I’m more independent than most, there have been times when I couldn’t join my friends for a drink because the bar had steep steps. Or I’ve been on a date where there wasn’t space between tables so everyone had to get up and cause a commotion. 

But some of the greatest challenges and hurdles I face are at home. When you’re paralyzed, your home goes from being a place of comfort and security to a reminder of what you’ve lost. Light switches and thermostats are usually too high up on the wall and, if my phone falls on the floor, I may not be able to call a friend or family member if I need help. These may seem like simple annoyances but, to members of the paralysis community, they reinforce the lack of control and limitations we often face.

This changed when the Christopher & Dana Reeve Foundation and Google Nest started a project to understand how technology can benefit people living with paralysis. Google Nest is providing up to 100,000 Google Home Minis to help them. I’ve been using mine for a few months, and it’s helped me control my environment, gain more independence, and have a little fun—all with my voice. 

If you’re not familiar with Mini, it’s a small and mighty smart speaker that gives you help when you need it. The first thing I did was connect Mini to my Nest Thermostat (the one that’s a tad too high). "Hey Google, turn down the thermostat" is especially useful these days in the summer heat. I’m training for the 2020 Paralympic Games as a powerlifter for Team USA, so I use my Mini to set alarms, manage my training schedule, and even make grocery lists. Music is a huge motivator for me, and with Mini, I listen to Spotify playlists and get pumped up before a workout. 

I can have fun with my Mini, too. I’ve tried my hand at trivia by saying, “Hey Google, let’s play lucky trivia.” I’ve dropped a beat with “Hey Google, beat box,” and I enjoy listening to my Google Play audiobooks. And, on a serious note, I know that if I need help but cannot reach my phone, I can use my Mini to call my mom or cousin using only my voice. 

29 years ago today, the Americans with Disabilities Act passed landmark legislation making public spaces more accessible for everyone. Unfortunately, the world isn’t flat and there are still many obstacles for people living with paralysis. I'm hopeful that Google Nest can help more people make their homes that much easier to navigate, just as it has for me. 

Individuals living with paralysis and their caregivers can sign up to get a little help around the home with a Google Home Mini—here’s how you can find out if you’re eligible. If you’d like to help through a donation, you can ask your Assistant, “Hey Google, donate to the Christopher and Dana Reeve Foundation.” Through your voice, you can offer a little bit of help that will go a long way. 


For individuals with paralysis, Google Nest gives help at home

Editor’s note: Today's post comes from Garrison Redd, who shares how his Google Home Mini helped him regain independence, and how it can improve the lives of people living with paralysis.

It’s been nearly 20 years since my life changed—that’s two decades of learning to navigate life in a wheelchair. There are many obstacles for people living with paralysis, so I have to find creative ways to get things done. While I’m more independent than most, there have been times when I couldn’t join my friends for a drink because the bar had steep steps. Or I’ve been on a date where there wasn’t space between tables so everyone had to get up and cause a commotion. 

But some of the greatest challenges and hurdles I face are at home. When you’re paralyzed, your home goes from being a place of comfort and security to a reminder of what you’ve lost. Light switches and thermostats are usually too high up on the wall and, if my phone falls on the floor, I may not be able to call a friend or family member if I need help. These may seem like simple annoyances but, to members of the paralysis community, they reinforce the lack of control and limitations we often face.

This changed when the Christopher & Dana Reeve Foundation and Google Nest started a project to understand how technology can benefit people living with paralysis. Google Nest is providing up to 100,000 Google Home Minis to help them. I’ve been using mine for a few months, and it’s helped me control my environment, gain more independence, and have a little fun—all with my voice. 

If you’re not familiar with Mini, it’s a small and mighty smart speaker that gives you help when you need it. The first thing I did was connect Mini to my Nest Thermostat (the one that’s a tad too high). "Hey Google, turn down the thermostat" is especially useful these days in the summer heat. I’m training for the 2020 Paralympic Games as a powerlifter for Team USA, so I use my Mini to set alarms, manage my training schedule, and even make grocery lists. Music is a huge motivator for me, and with Mini, I listen to Spotify playlists and get pumped up before a workout. 

I can have fun with my Mini, too. I’ve tried my hand at trivia by saying, “Hey Google, let’s play lucky trivia.” I’ve dropped a beat with “Hey Google, beat box,” and I enjoy listening to my Google Play audiobooks. And, on a serious note, I know that if I need help but cannot reach my phone, I can use my Mini to call my mom or cousin using only my voice. 

29 years ago today, the Americans with Disabilities Act passed landmark legislation making public spaces more accessible for everyone. Unfortunately, the world isn’t flat and there are still many obstacles for people living with paralysis. I'm hopeful that Google Nest can help more people make their homes that much easier to navigate, just as it has for me. 

Individuals living with paralysis and their caregivers can sign up to get a little help around the home with a Google Home Mini—here’s how you can find out if you’re eligible. If you’d like to help through a donation, you can ask your Assistant, “Hey Google, donate to the Christopher and Dana Reeve Foundation.” Through your voice, you can offer a little bit of help that will go a long way. 


With Sound Amplifier, more people can hear clearly

For the 466 million people in the world who have hearing loss, the inability to hear a conversation or the sounds around you can be isolating. Without clear sound, it’s challenging to connect to the people around you and fully experience the world. And simply asking others to speak louder (or turn up the TV volume) isn’t a helpful solution because people hear more clearly at different audio frequencies.

Sound Amplifier is an Android Accessibility app that helps people hear more clearly, and now it’s available on Android devices running Android 6.0 Marshmallow and above. Using machine learning, we sorted through thousands of publicly available hearing studies and data to understand how people hear in different environments and created a few simple controls.

Here’s how it works: When you plug in your headphones and use Sound Amplifier, you can customize frequencies to augment important sound, like the voices of the people you are with, and filter out background noise. It can help you hear conversations in noisy restaurants more clearly, amplify the sound coming from TV at personalized frequency levels without bothering others, or boost the voices of presenters at a lecture.

For some people, it may be hard to know when Sound Amplifier is detecting or enhancing sound. So we added an audio visualization feature that shows when sound is detected, helping you visualize the changes you’re making to it. Like a volume number on your TV, you know how much the sound is boosted even if you can’t hear it yet. There are a couple of new visual updates, too. You can launch the app directly from your phone’s home screen instead of tapping into Accessibility settings, and with the reorganized the control settings, you can easily tap between boosting your sound or filtering out the background noise.


Sound Amplifier v2

Caption: Sound Amplifier has a new look and feel with an audio visualization feature.

Sound Amplifier is the latest step in our commitment to make audio clear and accessible for everyone. And we’ll continue to improve the app through new features that enhance sound for all types of hearing.


Download the Sound Amplifier app on Google Play today on your Android device to enhance the sound around you.

Source: Android


Parrotron: New Research into Improving Verbal Communication for People with Speech Impairments



Most people take for granted that when they speak, they will be heard and understood. But for the millions who live with speech impairments caused by physical or neurological conditions, trying to communicate with others can be difficult and lead to frustration. While there have been a great number of recent advances in automatic speech recognition (ASR; a.k.a. speech-to-text) technologies, these interfaces can be inaccessible for those with speech impairments. Further, applications that rely on speech recognition as input for text-to-speech synthesis (TTS) can exhibit word substitution, deletion, and insertion errors. Critically, in today’s technological environment, limited access to speech interfaces, such as digital assistants that depend on directly understanding one's speech, means being excluded from state-of-the-art tools and experiences, widening the gap between what those with and without speech impairments can access.

Project Euphonia has demonstrated that speech recognition models can be significantly improved to better transcribe a variety of atypical and dysarthric speech. Today, we are presenting Parrotron, an ongoing research project that continues and extends our effort to build speech technologies to help those with impaired or atypical speech to be understood by both people and devices. Parrotron consists of a single end-to-end deep neural network trained to convert speech from a speaker with atypical speech patterns directly into fluent synthesized speech, without an intermediate step of generating text—skipping speech recognition altogether. Parrotron’s approach is speech-centric, looking at the problem only from the point of view of speech signals—e.g., without visual cues such as lip movements. Through this work, we show that Parrotron can help people with a variety of atypical speech patterns—including those with ALS, deafness, and muscular dystrophy—to be better understood in both human-to-human interactions and by ASR engines.
The Parrotron Speech Conversion Model
Parrotron is an attention-based sequence-to-sequence model trained in two phases using parallel corpora of input/output speech pairs. First, we build a general speech-to-speech conversion model for standard fluent speech, followed by a personalization phase that adjusts the model parameters to the atypical speech patterns from the target speaker. The primary challenge in such a configuration lies in the collection of the parallel training data needed for supervised training, which consists of utterances spoken by many speakers and mapped to the same output speech content spoken by a single speaker. Since it is impractical to have a single speaker record the many hours of training data needed to build a high quality model, Parrotron uses parallel data automatically derived with a TTS system. This allows us to make use of a pre-existing anonymized, transcribed speech recognition corpus to obtain training targets.

The first training phase uses a corpus of ~30,000 hours that consists of millions of anonymized utterance pairs. Each pair includes a natural utterance paired with an automatically synthesized speech utterance that results from running our state-of-the-art Parallel WaveNet TTS system on the transcript of the first. This dataset includes utterances from thousands of speakers spanning hundreds of dialects/accents and acoustic conditions, allowing us to model a large variety of voices, linguistic and non-linguistic contents, accents, and noise conditions with “typical” speech all in the same language. The resulting conversion model projects away all non-linguistic information, including speaker characteristics, and retains only what is being said, not who, where, or how it is said. This base model is used to seed the second personalization phase of training.

The second training phase utilizes a corpus of utterance pairs generated in the same manner as the first dataset. In this case, however, the corpus is used to adapt the network to the acoustic/phonetic, phonotactic and language patterns specific to the input speaker, which might include, for example, learning how the target speaker alters, substitutes, and reduces or removes certain vowels or consonants. To model ALS speech characteristics in general, we use utterances taken from an ALS speech corpus derived from Project Euphonia. If instead we want to personalize the model for a particular speaker, then the utterances are contributed by that person. The larger this corpus is, the better the model is likely to be at correctly converting to fluent speech. Using this second smaller and personalized parallel corpus, we run the neural-training algorithm, updating the parameters of the pre-trained base model to generate the final personalized model.

We found that training the model with a multitask objective to predict the target phonemes while simultaneously generating spectrograms of the target speech led to significant quality improvements. Such a multitask trained encoder can be thought of as learning a latent representation of the input that maintains information about the underlying linguistic content.
Overview of the Parrotron model architecture. An input speech spectrogram is passed through encoder and decoder neural networks to generate an output spectrogram in a new voice.
Case Studies
To demonstrate a proof of concept, we worked with our fellow Google research scientist and mathematician Dimitri Kanevsky, who was born in Russia to Russian speaking, normal-hearing parents but has been profoundly deaf from a very young age. He learned to speak English as a teenager, by using Russian phonetic representations of English words, learning to pronounce English using transliteration into Russian (e.g., The quick brown fox jumps over the lazy dog => ЗИ КВИК БРАУН ДОГ ЖАМПС ОУВЕР ЛАЙЗИ ДОГ). As a result, Dimitri’s speech is substantially distinct from native English speakers, and can be challenging to comprehend for systems or listeners who are not accustomed to it.

Dimitri recorded a corpus of 15 hours of speech, which was used to adapt the base model to the nuances specific to his speech. The resulting Parrotron system helped him be better understood by both people and Google’s ASR system alike. Running Google’s ASR engine on the output of Parrotron significantly reduced the word error rate from 89% to 32%, on a held out test set from Dimitri. Below is an example of Parrotron’s successful conversion of input speech from Dimitri:

Input from Dimitri Audio
Output from Parrotron Audio

We also worked with Aubrie Lee, a Googler and advocate for disability inclusion, who has muscular dystrophy, a condition that causes progressive muscle weakness, and sometimes impacts speech production. Aubrie contributed 1.5 hours of speech, which has been instrumental in showing promising outcomes of the applicability of this speech-to-speech technology. Below is an example of Parrotron’s successful conversion of input speech from Aubrie:

Input from Aubrie Audio
Output from Parrotron Audio
Input from Aubrie Audio
Output from Parrotron Audio

We also tested Parrotron’s performance on speech from speakers with ALS by adapting the pretrained model on multiple speakers who share similar speech characteristics grouped together, rather than on a single speaker. We conducted a preliminary listening study and observed an increase in intelligibility when comparing natural ALS speech to the corresponding speech obtained from running the Parroton model, for the majority of our test speakers.

Cascaded Approach
Project Euphonia has built a personalized speech-to-text model that has reduced the word error rate for a deaf speaker from 89% to 25%, and ongoing research is also likely to improve upon these results. One could use such a speech-to-text model to achieve a similar goal as Parrotron by simply passing its output into a TTS system to synthesize speech from the result. In such a cascaded approach, however, the recognizer may choose an incorrect word (roughly 1 out 4 times, in this case)—i.e., it may yield words/sentences with unintended meaning and, as a result, the synthesized audio of these words would be far from the speaker’s intention. Given the end-to-end speech-to-speech training objective function of Parrotron, even when errors are made, the generated output speech is likely to sound acoustically similar to the input speech, and thus the speaker’s original intention is less likely to be significantly altered and it is often still possible to understand what is intended:

Input from Dimitri Audio
Output from Parrotron Audio
Input from Dimitri Audio
Output from Parrotron/Input to Assistant Audio
Output from Assistant Audio
Input from Aubrie Audio
Output from Parrotron Audio

Furthermore, since Parrotron is not strongly biased to producing words from a predefined vocabulary set, input to the model may contain completely new invented words, foreign words/names, and even nonsense words. We observe that feeding Arabic and Spanish utterances into the US-English Parrotron model often results in output which echoes the original speech content with an American accent, in the target voice. Such behavior is qualitatively different from what one would obtain by simply running an ASR followed by a TTS. Finally, by going from a combination of independently tuned neural networks to a single one, we also believe there are improvements and simplifications that could be substantial.

Conclusion
Parrotron makes it easier for users with atypical speech to talk to and be understood by other people and by speech interfaces, with its end-to-end speech conversion approach more likely to reproduce the user’s intended speech. More exciting applications of Parrotron are discussed in our paper and additional audio samples can be found on our github repository. If you would like to participate in this ongoing research, please fill out this short form and volunteer to record a set of phrases. We look forward to working with you!
Acknowledgements
This project was joint work between the Speech and Google Brain teams. Contributors include Fadi Biadsy, Ron Weiss, Pedro Moreno, Dimitri Kanevsky, Ye Jia, Suzan Schwartz, Landis Baker, Zelin Wu, Johan Schalkwyk, Yonghui Wu, Zhifeng Chen, Patrick Nguyen, Aubrie Lee, Andrew Rosenberg, Bhuvana Ramabhadran, Jason Pelecanos, Julie Cattiau, Michael Brenner, Dotan Emanuel and Joel Shor. Our data collection efforts have been vastly accelerated by our collaborations with ALS-TDI.

Source: Google AI Blog


Building for all learners with new apps, tools, and resources

Everyone deserves access to a quality education—no matter your background, where you live, or your abilities. We’re recognizing this on Global Accessibility Awareness Day, an effort to promote digital accessibility and inclusion for people with disabilities, by sharing new features, training, and partners, along with the many new products announced at Google I/O.

Since everyone learns in different ways, we design technology that can adapt to a broad range of needs and learning styles. For example, you can now add captions in Slides and turn on live captions in Hangouts Meet, and we’ve improved discoverability in the G Suite toolbar. By making these features available—with even more in the works—teachers can help students learn in ways that work best for them.

Working with our partners to expand access

We’re not the only ones trying to make learning more accessible, so we’ve partnered with companies who are building apps to make it easier for teachers to communicate with all students.

One of our partners, Crick Software, just launched Clicker Communicator, a child-friendly communication tool for the classroom: bridging the gap between needs/wants and curriculum access, empowering non-verbal students with the tools to initiate and lead conversations, and enabling proactive participation in the classroom. It’s one of the first augmentative and alternative communication (AAC) apps specifically created for Chromebook users.

Learn more about the Clicker Communicator for Chromebooks, one of the first augmentative and alternative communication (AAC) apps specifically created for Chromebook users.

Learn more about Clicker Communicator, an AAC app for Chromebooks.

Assessing with accessibility in mind

Teachers use locked mode when giving Quizzes in Google Forms, only on managed Chromebooks, to eliminate distractions. Locked mode is now used millions of times per month, and many students use additional apps for accommodations when taking quizzes. We’ve been working with many developers to make sure their tools work with locked mode. One of those developers is our partner Texthelp®. Coming soon, when you enable locked mode in Quizzes in Google Forms, your students will be able to access Read&Write for Google Chrome and EquatIO® for Google that they rely on daily.

Another partner, Don Johnston, supports students with their apps including Co:Writer for word prediction, translation, and speech recognition and Snap&Read for read aloud, highlighting, and note-taking. Students signed into these extensions can use them on the quiz—even in locked mode. This integration will be rolling out over the next couple of weeks.

Learn more about the accessibility features available in locked mode, including ChromeVox, select-to-speak, and visual aids including high contrast mode and magnifiers.

Tools, training, and more resources

Assistive technology has the power to transform learning for more students, but educators need training, support, and tutorials to help their students get the most from the technology.

The new Accessibility section on our Google for Education website has information on Chromebooks and G Suite for Education, a module on our Teacher Center and printable flashcards, and EDU in 90 YouTube videos on G Suite and Chromebook accessibility features. Check out our accessibility tools and find training on how to use them to create more engaging, accessible learning experiences.

EDU in 90 video of Chromebook accessibility features

Watch the EDU in 90 on Chrome accessibility features.

We love hearing stories of how technology is making learning more accessible for more learners, so please share how you're using accessibility tools to support all types of learners, and requests for how we can continue to improve to meet the needs of more learners.

Make your smart home more accessible with new tutorials

I’m legally blind, so from the moment I pop out of bed each morning, I use technology to help me go about my day. When I wake up, I ask my Google Assistant for my custom-made morning Routine which turns on my lights, reads my calendar and plays the news. I use other products as well, like screen readers and a refreshable braille display, to help me be as productive as possible.

I bring my understanding of what it's like to have a disability to work with me, where I lead accessibility for Google Search, Google News and the Google Assistant. I work with cross-functional teams to help fulfill Google’s mission of building products for everyone—including those of us in the disabled community.

The Assistant can be particularly useful for helping people with disabilities get things done. So today, Global Accessibility Awareness Day, we’re releasing a series of how-to videos with visual and audible directions, designed to help the accessibility community set up and get the most out of their Assistant-enabled smart devices.

You can find step-by-step tutorials to learn how to interact with your Assistant, from setting up your Assistant-enabled device to using your voice to control your home appliances, at our YouTube playlist which we’ll continue to update throughout the year.

Intro to Assistant Accessibility Videos

This playlist came out of conversations within the team about how we can use our products to make life a little easier. Many of us have some form of disability, or have a friend, co-worker or family member who does. For example, Stephanie Wilson, an engineer on the Google Home team, helped set up her parents’ smart home after her dad was diagnosed with Parkinson’s disease.

In addition to our own teammates, we're always listening to suggestions from the broader community on how we can make our products more accessible. Last week at I/O, we showed how we’re making the Google Assistant more accessible, using AI to improve products for people with a speech impairment, and added Live Caption in Android Q to give the Deaf community automatic captions for media that’s playing audio on your phone. All these changes were made after receiving feedback from people like you.

Head over to our Accessibility website to learn more, and if you have questions or feedback on accessibility within Google products, please share your feedback with us via our dedicated Disability Support team.

New features to make audio more accessible on your phone

Smartphones are key to helping all of us get through our days, from getting directions to translating a word. But for people with disabilities, phones have the potential to do even more to connect people to information and help them perform everyday tasks. We want Android to work for all users, no matter their abilities. And on Global Accessibility Awareness Day, we’re taking another step toward this aim with updates to Live Transcribe, coming next month.


Available on 1.8 billion Android devices, Live Transcribe helps bridge the connection between the deaf and the hearing via real-time, real-world transcriptions for everyday conversations. With this update, we’re building on our machine learning and speech recognition technology to add new capabilities.


First, Live Transcribe will now show you sound events in addition to transcribing speech. You can see, for example, when a dog is barking or when someone is knocking on your door.  Seeing sound events allows you to be more immersed in the non-conversation realm of audio and helps you understand what is happening in the world. This is important to those who may not be able to hear non-speech audio cues such as clapping, laughter, music, applause, or the sound of a speeding vehicle whizzing by.


Second, you’ll now be able to copy and save transcripts, stored locally on your device for three days. This is useful not only for those with deafness or hearing loss—it also helps those who might be using real-time transcriptions in other ways, such as those learning a language for the first time or even, secondarily, journalists capturing interviews or students taking lecture notes. We’ve also made the audio visualization indicator bigger, so that users can more easily see the background audio around them.

New features of Live Transcribe

Caption: See sound events, like whistling or a dog barking, in the bottom left corner of the updated Live Transcribe.

With billions of active devices powered by Android, we’re humbled by the opportunity to build helpful tools that make the world’s information more accessible in the palm of everyone’s hand. As long as there are barriers for some people, we still have work to do. We’ll continue to release more features to enrich the lives of our accessibility community and the people around them.

How DIVA makes Google Assistant more accessible

My 21 year old brother Giovanni loves to listen to music and movies. But because he was born with congenital cataracts, Down syndrome and West syndrome, he is non-verbal. This means he relies on our parents and friends to start or stop music or a movie.  

Over the years, Giovanni has used everything from DVDs to tablets to YouTube to Chromecast to fill his entertainment needs. But as new voice-driven technologies started to emerge, they also came with a different set of challenges that required him to be able to use his voice or a touchscreen. That’s when I decided to find a way to let my brother control access to his music and movies on voice-driven devices without any help. It was a way for me to give him some independence and autonomy.

Working alongside my colleagues in the Milan Google office, I set up Project DIVA, which stands for DIVersely Assisted. The goal was to create a way to let people like Giovanni trigger commands to the Google Assistant without using their voice. We looked at many different scenarios and methodologies that people could use to trigger commands, like pressing a big button with their chin or their foot, or with a bite.  For several months we brainstormed different approaches and presented them at different accessibility and tech events to get feedback.

We had a bunch of ideas on paper that looked promising. But in order to turn those ideas into something real, we took part in an Alphabet-wide accessibility innovation challenge and built a prototype which went on to win the competition. We identified that many assistive buttons available on the market come with a 3.5mm jack, which is the kind many people have on their wired headphones. For our prototype, we created a box to connect those buttons and convert the signal coming from the button to a command sent to the Google Assistant.

Project DIVA diagram

To move from a prototype to reality, we started working with the team behind Google Assistant Connect, and today we are announcing DIVA at Google I/O 2019.


The real test, however, was giving this to Giovanni to try out. By touching the button with his hand, the signal is converted into a command sent to the Assistant. Now he can listen to music on the same devices and services our family and all his friends use,  and his smile tells the best story.


Getting this to work for Giovanni was just the start for Project DIVA. We started with single-purpose buttons, but this could be extended to more flexible and configurable scenarios. Now, we are investigating attaching RFID tags to objects and associating a command to each tag. That way, a person might have a cartoon puppet trigger a cartoon on the TV, or a physical CD trigger the music on their speaker.


Learn more about the idea behind the DIVA project at our publication site, and learn how to build your own device at our technical site.