Tag Archives: Interspeech

Google at Interspeech 2022

This week, the 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022) is being held in Incheon, South Korea, representing one of the world’s most extensive conferences on research and technology of spoken language understanding and processing. Over 2,000 experts in speech-related research fields gather to take part in oral presentations and poster sessions and to collaborate with streamed events across the globe.

We are excited to be a Diamond Sponsor of INTERSPEECH 2022, where we will be showcasing nearly 50 research publications and supporting a number of workshops, special sessions and tutorials. We welcome in-person attendees to drop by the Google booth to meet our researchers and participate in Q&As and demonstrations of some of our latest speech technologies, which help to improve accessibility and provide convenience in communication for billions of users. In addition, online attendees are encouraged to visit our virtual booth in GatherTown where you can get up-to-date information on research and opportunities at Google. You can also learn more about the Google research being presented at INTERSPEECH 2022 below (Google affiliations in bold).

Organizing Committee

Industry Liaisons include: Bhuvana Ramabahdran

Area Chairs include: John Hershey, Heiga Zen, Shrikanth Narayanan, Bastiaan Kleijn

ISCA Fellows

Include: Tara Sainath, Heiga Zen


Production Federated Keyword Spotting via Distillation, Filtering, and Joint Federated-Centralized Training
Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio Lopez Moreno, Rajiv Mathews, Françoise Beaufays

Leveraging Unsupervised and Weakly-Supervised Data to Improve Direct Speech-to-Speech Translation
Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

UserLibri: A Dataset for ASR Personalization Using Only Text
Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey

SNRi Target Training for Joint Speech Enhancement and Recognition
Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani

Turn-Taking Prediction for Natural Conversational Speech
Shuo-Yiin Chang, Bo Li, Tara Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He

Streaming Intended Query Detection Using E2E Modeling for Continued Conversation
Shuo-Yiin Chang, Guru Prakash, Zelin Wu, Tara Sainath, Bo Li, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman

Improving Distortion Robustness of Self-Supervised Speech Processing Tasks with Domain Adaptation
Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee

XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

Extracting Targeted Training Data from ASR Models, and How to Mitigate It
Ehsan Amid, Om Thakkar, Arun Narayanan, Rajiv Mathews, Françoise Beaufays

Detecting Unintended Memorization in Language-Model-Fused ASR
W. Ronny Huang, Steve Chien, Om Thakkar, Rajiv Mathews

AVATAR: Unconstrained Audiovisual Speech Recognition
Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

End-to-End Multi-talker Audio-Visual ASR Using an Active Speaker Attention Module
Richard Rose, Olivier Siohan

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-person Video
Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

Unsupervised Data Selection via Discrete Speech Representation for ASR
Zhiyun Lu, Yongqiang Wang, Yu Zhang, Wei Han, Zhehuai Chen, Parisa Haghani

Non-parallel Voice Conversion for ASR Augmentation
Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Jesse Emond, Yinghui Huang, Pedro J. Moreno

Ultra-Low-Bitrate Speech Coding with Pre-trained Transformers
Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, Jan Skoglund

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-Yiin Chang, Parisa Haghani

Improving Deliberation by Text-Only and Semi-supervised Training
Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

CycleGAN-Based Unpaired Speech Dereverberation
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

TRILLsson: Distilled Universal Paralinguistic Speech Representations (see blog post)
Joel Shor, Subhashini Venugopalan

Learning Neural Audio Features Without Supervision
Sarthak Yadav, Neil Zeghidour

SpeechPainter: Text-Conditioned Speech Inpainting
Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi

SpecGrad: Diffusion Probabilistic Model-Based Neural Vocoder with Adaptive Noise Spectral Shaping
Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani

Distance-Based Sound Separation
Katharine Patterson, Kevin Wilson, Scott Wisdom, John R. Hershey

Analysis of Self-Attention Head Diversity for Conformer-Based Automatic Speech Recognition
Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno

Improving Rare Word Recognition with LM-Aware MWER Training
Wang Weiran, Tongzhou Chen, Tara Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

MAESTRO: Matched Speech Text Representations Through Modality Matching
Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno, Ankur Bapna, Heiga Zen

Pseudo Label is Better Than Human Label
Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman

On the Optimal Interpolation Weights for Hybrid Autoregressive Transducer Model
Ehsan Variani, Michael Riley, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran

Streaming Align-Refine for Non-autoregressive Deliberation
Wang Weiran, Ke Hu, Tara Sainath

Federated Pruning: Improving Neural Network Efficiency with Federated Learning
Rongmei Lin*, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Françoise Beaufays

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
Shaojin Ding, Weiran Wang, Ding Zhao, Tara N Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

4-Bit Conformer with Native Quantization Aware Training for Speech Recognition
Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov

Visually-Aware Acoustic Event Detection Using Heterogeneous Graphs
Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha

A Conformer-Based Waveform-Domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy
Sankaran Panchapagesan, Arun Narayanan, Turaj Zakizadeh Shabestary, Shuai Shao, Nathan Howard, Alex Park, James Walker, Alexander Gruenstein

Reducing Domain Mismatch in Self-Supervised Speech Pre-training
Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Nicolás Serrano

On-the-Fly ASR Corrections with Audio Exemplars
Golan Pundak, Tsendsuren Munkhdalai, Khe Chai Sim

A Language Agnostic Multilingual Streaming On-Device ASR System
Bo Li, Tara Sainath, Ruoming Pang*, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani

XTREME-S: Evaluating Cross-Lingual Speech Representations
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

Towards Disentangled Speech Representations
Cal Peyser, Ronny Huang, Andrew Rosenberg, Tara Sainath, Michael Picheny, Kyunghyun Cho

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition
Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O'Malley, Ian McGraw

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation
Tom O’Malley, Arun Narayanan, Quan Wang

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alex Petelin, Jonathan Shen*, Vincent Wan, Yu Zhang, Yonghui Wu, Robert Clark

A Scalable Model Specialization Framework for Training and Inference Using Submodels and Its Application to Speech Model Personalization
Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro Moreno

Text-Driven Separation of Arbitrary Sounds
Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi

Workshops, Tutorials & Special Sessions

The VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
Organizers include: Arsha Nagrani

Self-Supervised Representation Learning for Speech Processing
Organizers include: Tara Sainath

Learning from Weak Labels
Organizers include: Ankit Shah

RNN Transducers for Named Entity Recognition with Constraints on Alignment for Understanding Medical Conversations
Authors: Hagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey

Listening with Googlears: Low-Latency Neural Multiframe Beamforming and Equalization for Hearing Aids
Authors: Samuel Yang, Scott Wisdom, Chet Gnegy, Richard F. Lyon, Sagar Savla

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset
Authors: Michael Chinen, Jan Skoglund, Chandan K. A. Reddy, Alessandro Ragano, Andrew Hines

Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device
Authors: Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Françoise Beaufays

Trustworthy Speech Processing
Organizers include: Shrikanth Narayanan

*Work done while at Google.  

Source: Google AI Blog

Personalized ASR Models from a Large and Diverse Disordered Speech Dataset

Speech impairments affect millions of people, with underlying causes ranging from neurological or genetic conditions to physical impairment, brain damage or hearing loss. Similarly, the resulting speech patterns are diverse, including stuttering, dysarthria, apraxia, etc., and can have a detrimental impact on self-expression, participation in society and access to voice-enabled technologies. Automatic speech recognition (ASR) technologies have the potential to help individuals with such speech impairments by improving access to dictation and home automation and by enhancing communication. However, while the increased computational power of deep learning systems and the availability of large training datasets has improved the accuracy of ASR systems, their performance is still insufficient for many people with speech disorders, rendering the technology unusable for many of the speakers who could benefit the most.

In 2019, we introduced Project Euphonia and discussed how we could use personalized ASR models of disordered speech to achieve accuracies on par with non-personalized ASR on typical speech. Today we share the results of two studies, presented at Interspeech 2021, that aim to expand the availability of personalized ASR models to more users. In “Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia”, we present a greatly expanded collection of disordered speech data, composed of over 1 million utterances. Then, in “Automatic Speech Recognition of Disordered Speech: Personalized models outperforming human listeners on short phrases”, we discuss our efforts to generate personalized ASR models based on this corpus. This approach leads to highly accurate models that can achieve up to 85% improvement to the word error rate in select domains compared to out-of-the-box speech models trained on typical speech.

Impaired Speech Data Collection
Since 2019, speakers with speech impairments of varying degrees of severity across a variety of conditions have provided voice samples to support Project Euphonia’s research mission. This effort has grown Euphonia’s corpus to over 1 million utterances, comprising over 1400 hours from 1330 speakers (as of August 2021).

Distribution of severity of speech disorder and condition across all speakers with more than 300 utterances recorded. For conditions, only those with > 5 speakers are shown (all others aggregated into “OTHER” for k-anonymity).
ALS = amyotrophic lateral sclerosis; DS = Down syndrome; PD = Parkinson’s disease; CP = cerebral palsy; HI = hearing impaired; MD = muscular dystrophy; MS = multiple sclerosis

To simplify the data collection, participants used an at-home recording system on their personal hardware (laptop or phone, with and without headphones), instead of an idealized lab-based setting that would collect studio quality recordings.

To reduce transcription cost, while still maintaining high transcript conformity, we prioritized scripted speech. Participants read prompts shown on a browser-based recording tool. Phrase prompts covered use-cases like home automation (“Turn on the TV.”), caregiver conversations (“I am hungry.”) and informal conversations (“How are you doing? Did you have a nice day?”). Most participants received a list of 1500 phrases, which included 1100 unique phrases along with 100 phrases that were each repeated four more times.

Speech professionals conducted a comprehensive auditory-perceptual speech assessment while listening to a subset of utterances for every speaker providing the following speaker-level metadata: speech disorder type (e.g., stuttering, dysarthria, apraxia), rating of 24 features of abnormal speech (e.g., hypernasality, articulatory imprecision, dysprosody), as well as recording quality assessments of both technical (e.g., signal dropouts, segmentation problems) and acoustic (e.g., environmental noise, secondary speaker crosstalk) features.

Personalized ASR Models
This expanded impaired speech dataset is the foundation of our new approach to personalized ASR models for disordered speech. Each personalized model uses a standard end-to-end, RNN-Transducer (RNN-T) ASR model that is fine-tuned using data from the target speaker only.

Architecture of RNN-Transducer. In our case, the encoder network consists of 8 layers and the predictor network consists of 2 layers of uni-directional LSTM cells.

To accomplish this, we focus on adapting the encoder network, i.e. the part of the model dealing with the specific acoustics of a given speaker, as speech sound disorders were most common in our corpus. We found that only updating the bottom five (out of eight) encoder layers while freezing the top three encoder layers (as well as the joint layer and decoder layers) led to the best results and effectively avoided overfitting. To make these models more robust against background noise and other acoustic effects, we employ a configuration of SpecAugment specifically tuned to the prevailing characteristics of disordered speech. Further, we found that the choice of the pre-trained base model was critical. A base model trained on a large and diverse corpus of typical speech (multiple domains and acoustic conditions) proved to work best for our scenario.

We trained personalized ASR models for ~430 speakers who recorded at least 300 utterances. 10% of utterances were held out as a test set (with no phrase overlap) on which we calculated the word error rate (WER) for the personalized model and the unadapted base model.

Overall, our personalization approach yields significant improvements across all severity levels and conditions. Even for severely impaired speech, the median WER for short phrases from the home automation domain dropped from around 89% to 13%. Substantial accuracy improvements were also seen across other domains such as conversational and caregiver.

WER of unadapted and personalized ASR models on home automation phrases.

To understand when personalization does not work well, we analyzed several subgroups:

  • HighWER and LowWER: Speakers with high and low personalized model WERs based on the 1st and 5th quintiles of the WER distribution.
  • SurpHighWER: Speakers with a surprisingly high WER (participants with typical speech or mild speech impairment of the HighWER group).

Different pathologies and speech disorder presentations are expected to impact ASR non-uniformly. The distribution of speech disorder types within the HighWER group indicates that dysarthria due to cerebral palsy was particularly difficult to model. Not surprisingly, median severity was also higher in this group.

To identify the speaker-specific and technical factors that impact ASR accuracy, we examined the differences (Cohen's D) in the metadata between the participants that had poor (HighWER) and excellent (LowWER) ASR performance. As expected, overall speech severity was significantly lower in the LowWER group than in the HighWER group (p < 0.01). Intelligibility and severity were the most prominent atypical speech features in the HighWER group; however, other speech features also emerged, including abnormal prosody, articulation, and phonation. These speech features are known to degrade overall speech intelligibility.

The SurpHighWER group had fewer training utterances and lower SNR compared with the LowWER group (p < 0.01) resulting in large (negative) effect sizes, with all other factors having small effect sizes, except fastness. In contrast, the HighWER group exhibited medium to large differences across all factors.

Speech disorder and technical metadata effect sizes for the HighWER-vs-LowWER and SurpHighWER-vs-LowWER pairs. Positive effects indicated that the group values of the HighWER group were greater than LowWER groups.

We then compared personalized ASR models to human listeners. Three speech professionals independently transcribed 30 utterances per speaker. We found that WERs were, on average, lower for personalized ASR models compared to the WERs of human listeners, with gains increasing by severity.

Delta between the WERs of the personalized ASR models and the human listeners. Negative values indicate that personalized ASR performs better than human (expert) listeners.

With over 1 million utterances, Euphonia’s corpus is one of the largest and most diversely disordered speech corpora (in terms of disorder types and severities) and has enabled significant advances in ASR accuracy for these types of atypical speech. Our results demonstrate the efficacy of personalized ASR models for recognizing a wide range of speech impairments and severities, with potential for making ASR available to a wider population of users.

Key contributors to this project include Michael Brenner, Julie Cattiau, Richard Cave, Jordan Green, Rus Heywood, Pan-Pan Jiang, Anton Kast, Marilyn Ladewig, Bob MacDonald, Phil Nelson, Katie Seaver, Jimmy Tobin, and Katrin Tomanek. We gratefully acknowledge the support Project Euphonia received from members of many speech research teams across Google, including Françoise Beaufays, Fadi Biadsy, Dotan Emanuel, Khe Chai Sim, Pedro Moreno Mengibar, Arun Narayanan, Hasim Sak, Suzan Schwartz, Joel Shor, and many others. And most importantly, we wanted to say a huge thank you to the over 1300 participants who recorded speech samples and the many advocacy groups who helped us connect with these participants.

Source: Google AI Blog

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Google's mission is not just to organize the world's information but to make it universally accessible, which means ensuring that our products work in as many of the world's languages as possible. When it comes to understanding human speech, which is a core capability of the Google Assistant, extending to more languages poses a challenge: high-quality automatic speech recognition (ASR) systems require large amounts of audio and text data — even more so as data-hungry neural models continue to revolutionize the field. Yet many languages have little data available.

We wondered how we could keep the quality of speech recognition high for speakers of data-scarce languages. A key insight from the research community was that much of the "knowledge" a neural network learns from audio data of a data-rich language is re-usable by data-scarce languages; we don't need to learn everything from scratch. This led us to study multilingual speech recognition, in which a single model learns to transcribe multiple languages.

In “Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model”, published at Interspeech 2019, we present an end-to-end (E2E) system trained as a single model, which allows for real-time multilingual speech recognition. Using nine Indian languages, we demonstrated a dramatic improvement in the ASR quality on several data-scarce languages, while still improving performance for the data-rich languages.

India: A Land of Languages
For this study, we focused on India, an inherently multilingual society where there are more than thirty languages with at least a million native speakers. Many of these languages overlap in acoustic and lexical content due to the geographic proximity of the native speakers and shared cultural history. Additionally, many Indians are bilingual or trilingual, making the use of multiple languages within a conversation a common phenomenon, and a natural case for training a single multilingual model. In this work, we combined nine primary Indian languages, namely Hindi, Marathi, Urdu, Bengali, Tamil, Telugu, Kannada, Malayalam and Gujarati.

A Low-latency All-neural Multilingual Model
Traditional ASR systems contain separate components for acoustic, pronunciation, and language models. While there have been attempts to make some or all of the traditional ASR components multilingual [1,2,3,4], this approach can be complex and difficult to scale. E2E ASR models combine all three components into a single neural network and promise scalability and ease of parameter sharing. Recent works have extended E2E models to be multilingual [1,2], but they did not address the need for real-time speech recognition, a key requirement for applications such as the Assistant, Voice Search and GBoard dictation. For this, we turned to recent research at Google that used a Recurrent Neural Network Transducer (RNN-T) model to achieve streaming E2E ASR. The RNN-T system outputs words one character at a time, just as if someone was typing in real time, however this was not multilingual. We built upon this architecture to develop a low-latency model for multilingual speech recognition.
[Left] A traditional monolingual speech recognizer comprising of Acoustic, Pronunciation and Language Models for each language. [Middle] A traditional multilingual speech recognizer where the Acoustic and Pronunciation model is multilingual, while the Language model is language-specific. [Right] An E2E multilingual speech recognizer where the Acoustic, Pronunciation and Language Model is combined into a single multilingual model.
Large-Scale Data Challenges
Using large-scale, real-world data for training a multilingual model is complicated by data imbalance. Given the steep skew in the distribution of speakers across the languages and speech product maturity, it is not surprising to have varying amounts of transcribed data available per language. As a result, a multilingual model can tend to be more influenced by languages that are over-represented in the training set. This bias is more prominent in an E2E model, which unlike a traditional ASR system, does not have access to additional in-language text data and learns lexical characteristics of the languages solely from the audio training data.
Histogram of training data for the nine languages showing the steep skew in the data available.
We addressed this issue with a few architectural modifications. First, we provided an extra language identifier input, which is an external signal derived from the language locale of the training data; i.e. the language preference set in an individual’s phone. This signal is combined with the audio input as a one-hot feature vector. We hypothesize that the model is able to use the language vector not only to disambiguate the language but also to learn separate features for separate languages, as needed, which helped with data imbalance.

Building on the idea of language-specific representations within the global model, we further augmented the network architecture by allocating extra parameters per language in the form of residual adapter modules. Adapters helped fine-tune a global model on each language while maintaining parameter efficiency of a single global model, and in turn, improved performance.
[Left] Multilingual RNN-T architecture with a language identifier. [Middle] Residual adapters inside the encoder. For a Tamil utterance, only the Tamil adapters are applied to each activation. [Right] Architecture details of the Residual Adapter modules. For more details please see our paper.
Putting all of these elements together, our multilingual model outperforms all the single-language recognizers, with especially large improvements in data-scarce languages like Kannada and Urdu. Moreover, since it is a streaming E2E model, it simplifies training and serving, and is also usable in low-latency applications like the Assistant. Building on this result, we hope to continue our research on multilingual ASRs for other language groups, to better assist our growing body of diverse users.

We would like to thank the following for their contribution to this research: Tara N. Sainath, Eugene Weinstein, Bo Li, Shubham Toshniwal, Ron Weiss, Bhuvana Ramabhadran, Yonghui Wu, Ankur Bapna, Zhifeng Chen, Seungji Lee, Meysam Bastani, Mikaela Grace, Pedro Moreno, Yanzhang (Ryan) He, Khe Chai Sim.

Source: Google AI Blog

Google at Interspeech 2019

This week, Graz, Austria hosts the 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), one of the world‘s most extensive conferences on the research and engineering for spoken language processing. Over 2,000 experts in speech-related research fields gather to take part in oral presentations and poster sessions and to collaborate with streamed events across the globe.

As a Gold Sponsor of Interspeech 2019, we are excited to present 30 research publications, and demonstrate some of the impact speech technology has made in our products, from accessible, automatic video captioning to a more robust, reliable Google Assistant. If you’re attending Interspeech 2019, we hope that you’ll stop by the Google booth to meet our researchers and discuss projects and opportunities at Google that go into solving interesting problems for billions of people. Our researchers will also be on hand to discuss Google Cloud Text-to-Speech and Speech-to-text, demo Parrotron, and more. You can also learn more about the Google research being presented at Interspeech 2019 below (Google affiliations in blue).

Organizing Committee includes:
Michiel Bacchiani

Technical Program Committee includes:
Tara Sainath

Neural Machine Translation
Organizers include: Wolfgang Macherey, Yuan Cao

Accepted Publications
Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data (link to appear soon)
Manasa Prasad, Daan van Esch, Sandy Ritchie, Jonas Fromseier Mortensen

Multi-Microphone Adaptive Noise Cancellation for Robust Hotword Detection (link to appear soon)
Yiteng Huang, Turaj Shabestary, Alexander Gruenstein, Li Wan

Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model
Ye Jia, Ron Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Yonghui Wu

Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale (link to appear soon)
Hanna Mazzawi, Javier Gonzalvo, Aleks Kracun, Prashant Sridhar, Niranjan Subrahmanya, Ignacio Lopez Moreno, Hyun Jin Park, Patrick Violette

Shallow-Fusion End-to-End Contextual Biasing (link to appear soon)
Ding Zhao, Tara Sainath, David Rybach, Pat Rondon, Deepti Bhatia, Bo Li, Ruoming Pang

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif Saurous, Ron Weiss, Ye Jia, Ignacio Lopez Moreno

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
Daniel Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin Dogus Cubuk, Quoc Le

Two-Pass End-to-End Speech Recognition
Ruoming Pang, Tara Sainath, David Rybach, Yanzhang He, Rohit Prabhavalkar, Mirko Visontai, Qiao Liang, Trevor Strohman, Yonghui Wu, Ian McGraw, Chung-Cheng Chiu

On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach, Patrick Nguyen

Contextual Recovery of Out-of-Lattice Named Entities in Automatic Speech Recognition (link to appear soon)
Jack Serrino, Leonid Velikovich, Petar Aleksic, Cyril Allauzen

Joint Speech Recognition and Speaker Diarization via Sequence Transduction
Laurent El Shafey, Hagen Soltau, Izhak Shafran

Personalizing ASR for Dysarthric and Accented Speech with Limited Data
Joel Shor, Dotan Emanuel, Oran Lang, Omry Tuval, Michael Brenner, Julie Cattiau, Fernando Vieira, Maeve McNally, Taylor Charbonneau, Melissa Nollstadt, Avinatan Hassidim, Yossi Matias

An Investigation Into On-Device Personalization of End-to-End Automatic Speech Recognition Models (link to appear soon)
Khe Chai Sim, Petr Zadrazil, Francoise Beaufays

Salient Speech Representations Based on Cloned Networks
Bastiaan Kleijn, Felicia Lim, Michael Chinen, Jan Skoglund

Cross-Lingual Consistency of Phonological Features: An Empirical Study (link to appear soon)
Cibu Johny, Alexander Gutkin, Martin Jansche

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
Heiga Zen, Viet Dang, Robert Clark, Yu Zhang, Ron Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu

Improving Performance of End-to-End ASR on Numeric Sequences
Cal Peyser, Hao Zhang, Tara Sainath, Zelin Wu

Developing Pronunciation Models in New Languages Faster by Exploiting Common Grapheme-to-Phoneme Correspondences Across Languages (link to appear soon)
Harry Bleyan, Sandy Ritchie, Jonas Fromseier Mortensen, Daan van Esch

Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models
Ke Hu, Antoine Bruguier, Tara Sainath, Rohit Prabhavalkar, Golan Pundak

Fréchet Audio Distance: A Reference-free Metric for Evaluating Music Enhancement Algorithms
Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, Matthew Sharifi

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
Yu Zhang, Ron Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

Sampling from Stochastic Finite Automata with Applications to CTC Decoding
Martin Jansche, Alexander Gutkin

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model (link to appear soon)
Anjuli Kannan, Arindrima Datta, Tara Sainath, Eugene Weinstein, Bhuvana Ramabhadran, Yonghui Wu, Ankur Bapna, Zhifeng Chen, SeungJi Lee

A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet
Jean-Marc Valin, Jan Skoglund

Low-Dimensional Bottleneck Features for On-Device Continuous Speech Recognition
David Ramsay, Kevin Kilgour, Dominik Roblek, Matthew Sharif

Unified Verbalization for Speech Recognition & Synthesis Across Languages (link to appear soon)
Sandy Ritchie, Richard Sproat, Kyle Gorman, Daan van Esch, Christian Schallhart, Nikos Bampounis, Benoit Brard, Jonas Mortensen, Amelia Holt, Eoin Mahon

Better Morphology Prediction for Better Speech Systems (link to appear soon)
Dravyansh Sharma, Melissa Wilson, Antoine Bruguier

Dual Encoder Classifier Models as Constraints in Neural Text Normalization
Ajda Gokcen, Hao Zhang, Richard Sproat

Large-Scale Visual Speech Recognition
Brendan Shillingford, Yannis Assael, Matthew Hoffman, Thomas Paine, Cían Hughes, Utsav Prabhu, Hank Liao, Hasim Sak, Kanishka Rao, Lorrayne Bennett, Marie Mulville, Ben Coppin, Ben Laurie, Andrew Senior, Nando de Freitas

Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation
Fadi Biadsy, Ron Weiss, Pedro Moreno, Dimitri Kanevsky, Ye Jia

Source: Google AI Blog

Project Euphonia’s Personalized Speech Recognition for Non-Standard Speech

The utility of technology is dependent on its accessibility. One key component of accessibility is automatic speech recognition (ASR), which can greatly improve the ability of those with speech impairments to interact with every-day smart devices. However, ASR systems are most often trained from 'typical' speech, which means that underrepresented groups, such as those with speech impairments or heavy accents, don't experience the same degree of utility. For example, amyotrophic lateral sclerosis (ALS) is a disease that can adversely affect a person’s speech—about 25% of people with ALS experiencing slurred speech as their first symptom. In addition, most people with ALS eventually lose the ability to walk, so being able to interact with automated devices from a distance can be very important. Yet current state-of-the-art ASR models can yield high word error rates (WER) for speakers with only a moderate speech impairment from ALS, effectively barring access to ASR reliant technologies.

In “Personalizing ASR for Dysarthric and Accented Speech with Limited Data,” to be presented at Interspeech 2019, we describe some of the research behind Project Euphonia, an ASR platform that performs speech-to-text transcription. This work presents an approach to improve ASR for people with ALS that may also be applicable to many other types of non-standard speech. Using a two-step training approach that starts with a baseline “standard” corpus and then fine-tunes the training with a personalized speech dataset, we have demonstrated significant improvements for speakers with atypical speech over current state-of-the-art models.

A Two-Phased Approach to Training
In order to create ASR models that work on non-standard speech, one needs to overcome two challenges. The first is that within a particular class of atypical speech, be it a regional accent or a speech impairment, for example, individuals can exhibit very different ways of speaking. Our approach deals with this sub-group heterogeneity by training the ASR model in two phases. We start with a high-quality ASR model trained on thousands of hours of standard speech and then we fine-tune parts of the model to an individual with non-standard speech. This approach is similar to that of Parrotron: both systems use end-to-end neural networks to help improve communication and accessibility, but Parrotron focuses exclusively on speech-to-speech, where a person’s speech is converted directly into synthesized speech, rather than text.

The second challenge arises from the difficulty in collecting enough data to train a state-of-the-art recognizer for individuals. Typical speech recognizers are trained on thousands of hours of speech from many different speakers. Acquiring this much data from a single speaker is nearly impossible, especially if the speaker may experience exhaustion from speaking due to a medical condition. Our approach overcomes this issue by first training a base model on a large corpus of typical speech, and then training a personalized model using a much smaller dataset with the targeted non-standard speech characteristics.

The Neural Network Architecture
When developing the models used for training data on atypical speech, we explored two different neural architectures. The first is the RNN-Transducer (RNN-T), a neural network architecture consisting of encoder and decoder networks that has shown good results on numerous ASR tasks. The encoder is bidirectional (i.e., it looks at the entire sentence at once in order to provide context), and thus it requires the entire audio sample to perform speech recognition.

The other architecture we explored was Listen, Attend, and Spell (LAS), which is an attention-based, sequence-to-sequence model that maps sequences of acoustic properties to sequences of languages. This model uses an encoder to convert the sequence of acoustic frames to a sequence of internal representations, and a decoder to convert the sequence of internal representations to linguistic output. The network produces “word pieces”, which are a linguistic representation between graphemes and words.
Comparison of the RNN-Transducer (left) and Listen, Attend, Spell (right) architectures. From Prabhavalkar et al. 2017.
We experimented with fine-tuning the state-of-the-art RNN-T and LAS base models on two types of non-standard speech. In partnership with the ALS Therapy Development Institute, we first collected about 36 hours of audio from 67 speakers who have ALS. The participants recorded themselves on their home computers using custom software while they read sentences from a very restricted language domain. Many phrases were single sentences with simple grammatical structure (e.g., “What time is the basketball game on tonight?”). This is in contrast with unrestricted language domains, which include domain-specific vocabulary (e.g., science talks) and complex language structure (e.g., a debate). The recordings did not include many of the filler words common in normal speech, such as “um” and “uh”.

We also tested accented speech, using the open source L2 Arctic dataset of non-native speech, which consists of 20 speakers with approximately 1 hour of speech per speaker. Each speaker recorded a set of 1150 utterances from the CMU Arctic prompts.

AudioEuphonia ModelStandard Speech Model
Did I have anything to say about it?Dictatorship angels to think about it
Come right back pleaseCameras object
Let’s try that againIt extracts
Turn it down a little bit pleaseTurning down a little bit please
The audio (left) are recordings of a speaker with ALS. The text transcriptions are output from the Euphonia model (center) and the Standard Speech model (right). Incorrectly transcribed text is underlined.
The absolute word error rates on the language-restricted test set is shown below. There is an improvement over the baseline model for very non-standard speech (heavy accents and ALS speech below 3 on the ALS Functional Rating Scale) and moderate improvements in ALS speech that is similar to typical speech. The relative difference between the base model and the fine-tuned model demonstrates that the majority of the improvement comes from the fine-tuning process, except in the case of the RNN-T on the Arctic dataset, where the RNN-T baseline is already strong.
1 Non-native English speech from the L2-Arctic dataset.
2 Low FRS (ALS Functional Rating Scale) speech; intelligible with repeating (FRS 2); Speech combined with non-vocal communication (FRS 1).
3 FRS 3; detectable speech disturbance.
The RNN-T model achieved 91% of the improvement by fine-tuning just two layers, most of which are close to the input. On the accented dataset, fine-tuning the same two layers achieved 86% of the relative improvement compared to fine-tuning the entire network. This is consistent with previous speech work.

Most of the performance gains were achieved early in training. The models we trained were tested on a relatively limited domain of vocabulary and linguistic complexity, so the performance numbers are not necessarily related to how well the models perform on more general tasks. We hope that just fine-tuning part of the network allows it to retain the acoustic and linguistic information from the general speech model, while needing minimal modifications to adapt to a single new speaker. Future work will test this hypothesis.
Low FRS corresponds to the ALS speakers with low intelligibility (FRS 2, 1), while high FRS corresponds to ALS speakers with less severely impacted speech (FRS 3).
Understanding Model Behavior
To better understand how our models improved after fine-tuning, we looked at the pattern of phoneme mistakes. We started by comparing the distribution of phoneme mistakes made by the base ASR model on standard speech to the mistakes made on ALS speech. The SAMPA phonemes with the five largest differences between the ALS data and standard speech are p, U, f, k, and Z, which account for 20% of the deletion mistakes. Similarly, the n and m phonemes together account for 17% of the insertion / substitution mistakes. The same analysis on our fine-tuned models verifies that the unrecognized phoneme distribution is more similar to that of standard speech.

Our analysis shows that there are two aspects to every mistake: which phoneme the system doesn’t understand, and which phoneme the system thinks was said. Imagine having two systems with identical accuracy: one system always thinks that the f phoneme is actually the g phoneme, while another doesn't know what the f phoneme is and randomly guesses. These two systems will have identical performance and identical distributions of phoneme mistakes, but very different distributions of the predicted phoneme when a mistake is made. Surprisingly, ASR mistakes on ALS speech are far more similar to regular speech mistakes after Euphonia fine-tuning.
Deletion / substitution mistakes per SAMPA phoneme on ALS speech before fine-tuning, ALS speech after fine-tuning, and on typical speech (Librispeech dataset).
Future Work
In the future, we intend to explore additional techniques that can be helpful in the low data regime. We also hope to use phoneme mistakes to weight certain examples during training, or to pick training sentences for people with ALS to record that contain the most common phoneme mistakes. We would like to explore pooling data from multiple speakers with similar conditions.

We hope that continued research in this area will help voice interfaces become accessible to more people, especially those who need it most. One key component to this is collecting data. Anyone 18 or older can help us build better personalized models by donating audio data. If you’re interested, you can fill out this form to allow Google to contact you.

This work would not have been possible without the extraordinary effort and support of the ALS Therapy Development Institute and the ALS community, especially Fernando Vieira, Maeve McNally, Taylor Charbonneau, Melissa Nollstadt, and the individuals with ALS who kindly and patiently volunteered their audio. This work builds on the pioneering advances in speech recognition made by Google's speech team, in particular the recent development and deployment of end-to-end speech recognition models. We are grateful to the Google speech team for advice and collaboration, particularly to Anshuman Tripathi and Hasim Sak who guided us in training the initial models. We’d also like to thank Oran Lang, Omry Tuval, Michael Brenner, Julie Cattiau, Tara Sainath, Ding Zhao, Qiao Liang, Chung-Cheng Chiu, Dan Liebling, Ron Weiss, Anjuli Kannan, Dimitri Kanevsky, Ryan He, Gabor Simko, Benjamin Lee, Françoise Beaufays, Khe Chai Sim, Jimmy Tobin, Chet Gnegy, Jacqueline Huang, Ye Jia, Yu Zhang, Yonghui Wu, Michelle Ramanovich, Rus Heywood, Katrin Tomanek, Bob MacDonald, Pan-Pan Jiang, Ronnie Maor, Rif A. Saurous, Trevor Strohman, Dick Lyon, Avinatan Hassidim, Philip Nelson, and Yossi Matias for their technical contributions and project guidance.

Source: Google AI Blog