Tag Archives: speech

Audiovisual Speech Enhancement in YouTube Stories

While tremendous efforts are invested in improving the quality of videos taken with smartphone cameras, the quality of audio in videos is often overlooked. For example, the speech of a subject in a video where there are multiple people speaking or where there is high background noise might be muddled, distorted, or difficult to understand. In an effort to address this, two years ago we introduced Looking to Listen, a machine learning (ML) technology that uses both visual and audio cues to isolate the speech of a video’s subject. By training the model on a large-scale collection of online videos, we are able to capture correlations between speech and visual signals such as mouth movements and facial expressions, which can then be used to separate the speech of one person in a video from another, or to separate speech from background sounds. We showed that this technology not only achieves state-of-the-art results in speech separation and enhancement (a noticeable 1.5dB improvement over audio-only models), but in particular can improve the results over audio-only processing when there are multiple people speaking, as the visual cues in the video help determine who is saying what.

We are now happy to make the Looking to Listen technology available to users through a new audiovisual Speech Enhancement feature in YouTube Stories (on iOS), allowing creators to take better selfie videos by automatically enhancing their voices and reducing background noise. Getting this technology into users’ hands was no easy feat. Over the past year, we worked closely with users to learn how they would like to use such a feature, in what scenarios, and what balance of speech and background sounds they would like to have in their videos. We heavily optimized the Looking to Listen model to make it run efficiently on mobile devices, overall reducing the running time from 10x real-time on a desktop when our paper came out, to 0.5x real-time performance on the phone. We also put the technology through extensive testing to verify that it performs consistently across different recording conditions and for people with different appearances and voices.

From Research to Product
Optimizing Looking to Listen to allow fast and robust operation on mobile devices required us to overcome a number of challenges. First, all processing needed to be done on-device within the client app in order to minimize processing time and to preserve the user’s privacy; no audio or video information would be sent to servers for processing. Further, the model needed to co-exist alongside other ML algorithms used in the YouTube app in addition to the resource-consuming video recording itself. Finally, the algorithm needed to run quickly and efficiently on-device while minimizing battery consumption.

The first step in the Looking to Listen pipeline is to isolate thumbnail images that contain the faces of the speakers from the video stream. By leveraging MediaPipe BlazeFace with GPU accelerated inference, this step is now able to be executed in just a few milliseconds. We then switched the model part that processes each thumbnail separately to a lighter weight MobileNet (v2) architecture, which outputs visual features learned for the purpose of speech enhancement, extracted from the face thumbnails in 10 ms per frame. Because the compute time to embed the visual features is short, it can be done while the video is still being recorded. This avoids the need to keep the frames in memory for further processing, thereby reducing the overall memory footprint. Then, after the video finishes recording, the audio and the computed visual features are streamed to the audio-visual speech separation model which produces the isolated and enhanced speech.

We reduced the total number of parameters in the audio-visual model by replacing “regular” 2D convolutions with separable ones (1D in the frequency dimension, followed by 1D in the time dimension) with fewer filters. We then optimized the model further using TensorFlow Lite — a set of tools that enable running TensorFlow models on mobile devices with low latency and a small binary size. Finally, we reimplemented the model within the Learn2Compress framework in order to take advantage of built-in quantized training and QRNN support.

Our Looking to Listen on-device pipeline for audiovisual speech enhancement

These optimizations and improvements reduced the running time from 10x real-time on a desktop using the original formulation of Looking to Listen, to 0.5x real-time performance using only an iPhone CPU; and brought the model size down from 120MB to 6MB now, which makes it easier to deploy. Since YouTube Stories videos are short — limited to 15 seconds — the result of the video processing is available within a couple of seconds after the recording is finished.

Finally, to avoid processing videos with clean speech (so as to avoid unnecessary computation), we first run our model only on the first two seconds of the video, then compare the speech-enhanced output to the original input audio. If there is sufficient difference (meaning the model cleaned up the speech), then we enhance the speech throughout the rest of the video.

Researching User Needs
Early versions of Looking to Listen were designed to entirely isolate speech from the background noise. In a user study conducted together with YouTube, we found that users prefer to leave in some of the background sounds to give context and to retain some the general ambiance of the scene. Based on this user study, we take a linear combination of the original audio and our produced clean speech channel: output_audio = 0.1 x original_audio + 0.9 x speech. The following video presents clean speech combined with different levels of the background sounds in the scene (10% background is the balance we use in practice).

Below are additional examples of the enhanced speech results from the new Speech Enhancement feature in YouTube Stories. We recommend watching the videos with good speakers or headphones.

Fairness Analysis
Another important requirement is that the model be fair and inclusive. It must be able to handle different types of voices, languages and accents, as well as different visual appearances. To this end, we conducted a series of tests exploring the performance of the model with respect to various visual and speech/auditory attributes: the speaker’s age, skin tone, spoken language, voice pitch, visibility of the speaker’s face (% of video in which the speaker is in frame), head pose throughout the video, facial hair, presence of glasses, and the level of background noise in the (input) video.

For each of the above visual/auditory attributes, we ran our model on segments from our evaluation set (separate from the training set) and measured the speech enhancement accuracy, broken down according to the different attribute values. Results for some of the attributes are summarized in the following plots. Each data point in the plots represents hundreds (in most cases thousands) of videos fitting the criteria.

Speech enhancement quality (signal-to-distortion ratio, SDR, in dB) for different spoken languages, sorted alphabetically. The average SDR was 7.89 dB with a standard deviation of 0.42 dB — deviation that for human listeners is considered hard to notice.
Left: Speech enhancement quality as a function of the speaker’s voice pitch. The fundamental voice frequency (pitch) of an adult male typically ranges from 85 to 180 Hz, and that of an adult female ranges from 165 to 255 Hz. Right: speech enhancement quality as a function of the speaker’s predicted age.
As our method utilizes facial cues and mouth movements to isolate the speech, we tested whether facial hair (e.g., a moustache, beard) may obstruct those visual cues and affect the method’s performance. Our evaluations show that the quality of speech enhancement is maintained well also in the presence of facial hair.

Using the Feature
YouTube creators who are eligible for YouTube Stories creation may record a video on iOS, and select “Enhance speech” from the volume controls editing tool. This will immediately apply speech enhancement to the audio track and will play back the enhanced speech in a loop. It is then possible to toggle the feature on and off multiple times to compare the enhanced speech with the original audio.

In parallel to this new feature in YouTube, we are also exploring additional venues for this technology. More to come later this year — stay tuned!

Acknowledgements
This feature is a collaboration across multiple teams at Google. Key contributors include: from Research-IL: Oran Lang; from VisCAM: Ariel Ephrat, Mike Krainin, JD Velasquez, Inbar Mosseri, Michael Rubinstein; from Learn2Compress: Arun Kandoor; from MediaPipe: Buck Bourdon, Matsvei Zhdanovich, Matthias Grundmann; from YouTube: Andy Poes, Vadim Lavrusik, Aaron La Lau, Willi Geiger, Simona De Rosa, and Tomer Margolin.

Source: Google AI Blog


Improving Speech Representations and Personalized Models Using Self-Supervision



There are many tasks within speech processing that are easier to solve by having large amounts of data. For example automatic speech recognition (ASR) translates spoken audio into text. In contrast, "non-semantic" tasks focus on the aspects of human speech other than its meaning, encompassing "paralinguistic" tasks, like speech emotion recognition, as well as other kinds of tasks, such as speaker identification, language identification, and certain kinds of voice-based medical diagnoses. In training systems to accomplish these tasks, one common approach is to utilize the largest datasets possible to help ensure good results. However, machine learning techniques that directly rely on massive datasets are often less successful when trained on small datasets.

One way to bridge the performance gap between large and small datasets is to train a representation model on a large dataset, then transfer it to a setting with less data. Representations can improve performance in two ways: they can make it possible to train small models by transforming high-dimensional data (like images and audio) to a lower dimension, and the representation model can also be used as pre-training. In addition, if the representation model is small enough to be run or trained on-device, it can improve performance in a privacy-preserving way by giving users the benefits of a personalized model where the raw data never leaves their device. While representation learning is commonly used in the text domain (e.g. BERT and ALBERT) and in the images domain (e.g. Inception layers and SimCLR), such approaches are underutilized in the speech domain.
Bottom:A large speech dataset is used to train a model, which is then rolled out to other environments. Top Left: On-device personalization — personalized, on-device models combine security and privacy. Top Middle: Small model on embeddings — general-use representations transform high-dimensional, few-example datasets to a lower dimension without sacrificing accuracy; smaller models train faster and are regularized. Top Right: Full model fine-tuning — large datasets can use the embedding model as pre-training to improve performance
Unambiguously improving generally-useful representations, for non-semantic speech tasks in particular, is difficult without a standard benchmark to compare "speech representation usefulness." While the T5 framework systematically evaluates text embeddings and the Visual Task Adaptation Benchmark (VTAB) standardizes image embedding evaluation, both leading to progress in representation learning in those respective fields, there has been no such benchmark for non-semantic speech embeddings.

In "Towards Learning a Universal Non-Semantic Representation of Speech", we make three contributions to representation learning for speech-related applications. First, we present a NOn-Semantic Speech (NOSS) benchmark for comparing speech representations, which includes diverse datasets and benchmark tasks, such as speech emotion recognition, language identification, and speaker identification. These datasets are available in the "audio" section of TensorFlow Datasets. Second, we create and open-source TRIpLet Loss network (TRILL), a new model that is small enough to be executed and fine-tuned on-device, while still outperforming other representations. Third, we perform a large-scale study comparing different representations, and open-source the code used to compute the performance on new representations.

A New Benchmark for Speech Embeddings
For a benchmark to usefully guide model development, it must contain tasks that ought to have similar solutions and exclude those that are significantly different. Previous work either dealt with the variety of possible speech-based tasks independently, or lumped semantic and non-semantic tasks together. Our work improves performance on non-semantic speech tasks, in part, by focusing on neural network architectures that perform well specifically on this subset of speech tasks.

The tasks were selected for the NOSS benchmark on the basis of their 1) diversity — they need to cover a range of use-cases; 2) complexity — they should be challenging; and 3) availability, with particular emphasis on those tasks that are open-source. We combined six datasets of different sizes and tasks.
Datasets for downstream benchmark tasks. *VoxCeleb results in our study were computed using a subset of the dataset that was filtered according to internal policy.
We also introduce three additional intra-speaker tasks to test performance in the personalization scenario. In some datasets with k speakers, we can create k different tasks consisting of training and testing on just a single speaker. Overall performance is averaged across speakers. These three additional intra-speaker tasks measure the ability of an embedding to adapt to a particular speaker, as would be necessary for personalized, on-device models, which are becoming more important as computation moves to smart phones and the internet of things.

To help enable researchers to compare speech embeddings, we’ve added the six datasets in our benchmark to TensorFlow Datasets (in the "audio" section) and open sourced the evaluation framework.

TRILL: A New State of the Art in Non-semantic Speech Classification
Learning an embedding from one dataset and applying it to other tasks is not as common in speech as in other modalities. However, transfer learning, the more general technique of using data from one task to help another (not necessarily with embeddings), has some compelling applications, such as personalizing speech recognizers and voice imitation text-to-speech from few samples. There have been many previously proposed representations of speech, but most of these have been trained on a smaller and less diverse data, have been tested primarily on speech recognition, or both.

To create a data-derived representation of speech that was useful across environments and tasks, we started with AudioSet, a large and diverse dataset that includes about 2500 hours of speech. We then trained an embedding model on a simple, self-supervised criteria derived from previous work on metric learning — embeddings from the same audio should be closer in embedding space than embeddings from different audio. Like BERT and the other text embeddings, the self-supervised loss function doesn't require labels and only relies on the structure of the data itself. This form of self-supervision is the most appropriate for non-semantic speech, since non-semantic phenomena are more stable in time than ASR and other sub-second speech characteristics. This simple, self-supervised criteria captures a large number of acoustic properties that are leveraged in downstream tasks.
TRILL loss: Embeddings from the same audio are closer in embedding space than embeddings from different audio.
TRILL architecture is based on MobileNet, making it fast enough to run on mobile devices. To achieve high accuracy on this small architecture, we distilled the embedding from a larger ResNet50 model without performance degradation.

Benchmark Results
We compared the performance of TRILL against other deep learning representations that are not focused on speech recognition and were trained on similarly diverse datasets. In addition, we compared TRILL to the popular OpenSMILE feature extractor, which uses pre-deep learning techniques (e.g., a fourier transform coefficients, "pitch tracking" using a time-series of pitch measurements, etc.), and randomly initialized networks, which have been shown to be strong baselines. To aggregate the performance across tasks that have different performance characteristics, we first train a small number of simple models, for a given task and embedding. The best result is chosen. Then, to understand the effect that a particular embedding has across all tasks, we calculate a linear regression on the observed accuracies, with both the model and task as the explanatory variables. The effect a model has on the accuracy is the coefficient associated with the model in the regression. For a given task, when changing from one model to another, the resulting change in accuracy is expected to be the difference in y-values in the figure below.
Effect of model on accuracy.
TRILL outperforms the other representations in our study. Factors that contribute to TRILL's success are the diversity of the training dataset, the large context window of the network, and the generality of the TRILL training loss that broadly preserves acoustic characteristics instead of prematurely focusing on certain aspects. Note that representations from intermediate network layers are often more generally useful. The intermediate representations are larger, have finer temporal granularity, and in the case of the classification networks they retain more general information that isn't as specific to the classes on which they were trained.

Another benefit of a generally-useful model is that it can be used to initialize a model on a new task. When the sample size of a new task is small, fine-tuning an existing model may lead to better results than training the model from scratch. We achieved a new state-of-the-art result on three out of six benchmark tasks using this technique, despite doing no dataset-specific hyperparameter tuning.

To compare our new representation, we also tested it on the mask sub-challenge of the Interspeech 2020 Computational Paralinguistics Challenge (ComParE). In this challenge, models must predict whether a speaker is wearing a mask, which would affect their speech. The mask effects are sometimes subtle, and audio clips are only one second long. A linear model on TRILL outperformed the best baseline model, which was a fusion of many models on different kinds of features including traditional spectral and deep-learned features.

Summary
The code to evaluate NOSS is available on GitHub, the datasets are on TensorFlow Datasets, and the TRILL models are available on AI Hub.

The NOn-Semantic Speech benchmark helps researchers create speech embeddings that are useful in a wide range of contexts, including for personalization and small-dataset problems. We provide the TRILL model to the research community as a baseline embedding to surpass.

Acknowledgements
The core team behind this work includes Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Felix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, and Yinnon Haviv. We'd also like to thank Avinatan Hassidim and Yossi Matias for technical guidance.

Source: Google AI Blog


Improving Speech Representations and Personalized Models Using Self-Supervision



There are many tasks within speech processing that are easier to solve by having large amounts of data. For example automatic speech recognition (ASR) translates spoken audio into text. In contrast, "non-semantic" tasks focus on the aspects of human speech other than its meaning, encompassing "paralinguistic" tasks, like speech emotion recognition, as well as other kinds of tasks, such as speaker identification, language identification, and certain kinds of voice-based medical diagnoses. In training systems to accomplish these tasks, one common approach is to utilize the largest datasets possible to help ensure good results. However, machine learning techniques that directly rely on massive datasets are often less successful when trained on small datasets.

One way to bridge the performance gap between large and small datasets is to train a representation model on a large dataset, then transfer it to a setting with less data. Representations can improve performance in two ways: they can make it possible to train small models by transforming high-dimensional data (like images and audio) to a lower dimension, and the representation model can also be used as pre-training. In addition, if the representation model is small enough to be run or trained on-device, it can improve performance in a privacy-preserving way by giving users the benefits of a personalized model where the raw data never leaves their device. While representation learning is commonly used in the text domain (e.g. BERT and ALBERT) and in the images domain (e.g. Inception layers and SimCLR), such approaches are underutilized in the speech domain.
Bottom:A large speech dataset is used to train a model, which is then rolled out to other environments. Top Left: On-device personalization — personalized, on-device models combine security and privacy. Top Middle: Small model on embeddings — general-use representations transform high-dimensional, few-example datasets to a lower dimension without sacrificing accuracy; smaller models train faster and are regularized. Top Right: Full model fine-tuning — large datasets can use the embedding model as pre-training to improve performance
Unambiguously improving generally-useful representations, for non-semantic speech tasks in particular, is difficult without a standard benchmark to compare "speech representation usefulness." While the T5 framework systematically evaluates text embeddings and the Visual Task Adaptation Benchmark (VTAB) standardizes image embedding evaluation, both leading to progress in representation learning in those respective fields, there has been no such benchmark for non-semantic speech embeddings.

In "Towards Learning a Universal Non-Semantic Representation of Speech", we make three contributions to representation learning for speech-related applications. First, we present a NOn-Semantic Speech (NOSS) benchmark for comparing speech representations, which includes diverse datasets and benchmark tasks, such as speech emotion recognition, language identification, and speaker identification. These datasets are available in the "audio" section of TensorFlow Datasets. Second, we create and open-source TRIpLet Loss network (TRILL), a new model that is small enough to be executed and fine-tuned on-device, while still outperforming other representations. Third, we perform a large-scale study comparing different representations, and open-source the code used to compute the performance on new representations.

A New Benchmark for Speech Embeddings
For a benchmark to usefully guide model development, it must contain tasks that ought to have similar solutions and exclude those that are significantly different. Previous work either dealt with the variety of possible speech-based tasks independently, or lumped semantic and non-semantic tasks together. Our work improves performance on non-semantic speech tasks, in part, by focusing on neural network architectures that perform well specifically on this subset of speech tasks.

The tasks were selected for the NOSS benchmark on the basis of their 1) diversity — they need to cover a range of use-cases; 2) complexity — they should be challenging; and 3) availability, with particular emphasis on those tasks that are open-source. We combined six datasets of different sizes and tasks.
Datasets for downstream benchmark tasks. *VoxCeleb results in our study were computed using a subset of the dataset that was filtered according to internal policy.
We also introduce three additional intra-speaker tasks to test performance in the personalization scenario. In some datasets with k speakers, we can create k different tasks consisting of training and testing on just a single speaker. Overall performance is averaged across speakers. These three additional intra-speaker tasks measure the ability of an embedding to adapt to a particular speaker, as would be necessary for personalized, on-device models, which are becoming more important as computation moves to smart phones and the internet of things.

To help enable researchers to compare speech embeddings, we’ve added the six datasets in our benchmark to TensorFlow Datasets (in the "audio" section) and open sourced the evaluation framework.

TRILL: A New State of the Art in Non-semantic Speech Classification
Learning an embedding from one dataset and applying it to other tasks is not as common in speech as in other modalities. However, transfer learning, the more general technique of using data from one task to help another (not necessarily with embeddings), has some compelling applications, such as personalizing speech recognizers and voice imitation text-to-speech from few samples. There have been many previously proposed representations of speech, but most of these have been trained on a smaller and less diverse data, have been tested primarily on speech recognition, or both.

To create a data-derived representation of speech that was useful across environments and tasks, we started with AudioSet, a large and diverse dataset that includes about 2500 hours of speech. We then trained an embedding model on a simple, self-supervised criteria derived from previous work on metric learning — embeddings from the same audio should be closer in embedding space than embeddings from different audio. Like BERT and the other text embeddings, the self-supervised loss function doesn't require labels and only relies on the structure of the data itself. This form of self-supervision is the most appropriate for non-semantic speech, since non-semantic phenomena are more stable in time than ASR and other sub-second speech characteristics. This simple, self-supervised criteria captures a large number of acoustic properties that are leveraged in downstream tasks.
TRILL loss: Embeddings from the same audio are closer in embedding space than embeddings from different audio.
TRILL architecture is based on MobileNet, making it fast enough to run on mobile devices. To achieve high accuracy on this small architecture, we distilled the embedding from a larger ResNet50 model without performance degradation.

Benchmark Results
We compared the performance of TRILL against other deep learning representations that are not focused on speech recognition and were trained on similarly diverse datasets. In addition, we compared TRILL to the popular OpenSMILE feature extractor, which uses pre-deep learning techniques (e.g., a fourier transform coefficients, "pitch tracking" using a time-series of pitch measurements, etc.), and randomly initialized networks, which have been shown to be strong baselines. To aggregate the performance across tasks that have different performance characteristics, we first train a small number of simple models, for a given task and embedding. The best result is chosen. Then, to understand the effect that a particular embedding has across all tasks, we calculate a linear regression on the observed accuracies, with both the model and task as the explanatory variables. The effect a model has on the accuracy is the coefficient associated with the model in the regression. For a given task, when changing from one model to another, the resulting change in accuracy is expected to be the difference in y-values in the figure below.
Effect of model on accuracy.
TRILL outperforms the other representations in our study. Factors that contribute to TRILL's success are the diversity of the training dataset, the large context window of the network, and the generality of the TRILL training loss that broadly preserves acoustic characteristics instead of prematurely focusing on certain aspects. Note that representations from intermediate network layers are often more generally useful. The intermediate representations are larger, have finer temporal granularity, and in the case of the classification networks they retain more general information that isn't as specific to the classes on which they were trained.

Another benefit of a generally-useful model is that it can be used to initialize a model on a new task. When the sample size of a new task is small, fine-tuning an existing model may lead to better results than training the model from scratch. We achieved a new state-of-the-art result on three out of six benchmark tasks using this technique, despite doing no dataset-specific hyperparameter tuning.

To compare our new representation, we also tested it on the mask sub-challenge of the Interspeech 2020 Computational Paralinguistics Challenge (ComParE). In this challenge, models must predict whether a speaker is wearing a mask, which would affect their speech. The mask effects are sometimes subtle, and audio clips are only one second long. A linear model on TRILL outperformed the best baseline model, which was a fusion of many models on different kinds of features including traditional spectral and deep-learned features.

Summary
The code to evaluate NOSS is available on GitHub, the datasets are on TensorFlow Datasets, and the TRILL models are available on AI Hub.

The NOn-Semantic Speech benchmark helps researchers create speech embeddings that are useful in a wide range of contexts, including for personalization and small-dataset problems. We provide the TRILL model to the research community as a baseline embedding to surpass.

Acknowledgements
The core team behind this work includes Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Felix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, and Yinnon Haviv. We'd also like to thank Avinatan Hassidim and Yossi Matias for technical guidance.

Source: Google AI Blog


Improving Audio Quality in Duo with WaveNetEQ



Online calls have become an everyday part of life for millions of people by helping to streamline their work and connect them to loved ones. To transmit a call across the internet, the data from calls are split into short chunks, called packets. These packets make their way over the network from the sender to the receiver where they are reassembled to make continuous streams of video and audio. However, packets often arrive at the other end in the wrong order or at the wrong time, an issue generally referred to as jitter, and sometimes individual packets can be lost entirely. Issues such as these lead to lower call quality, since the receiver has to try and fill in the gaps, and are a pervasive problem for both audio and video transmission. For example, 99% of Google Duo calls need to deal with packet losses, excessive jitter or network delays. Of those calls, 20% lose more than 3% of the total audio duration due to network issues, and 10% of calls lose more than 8%.
Simplified diagram of network problems leading to packet loss, which needs to be counteracted by the receiver to allow reliable real-time communication.
In order to ensure reliable real-time communication, it is necessary to deal with packets that are missing when the receiver needs them. Specifically, if new audio is not provided continuously, glitches and gaps will be audible, but repeating the same audio over and over is not an ideal solution, as it produces artifacts and reduces the overall quality of the call. The process of dealing with the missing packets is called packet loss concealment (PLC). The receiver’s PLC module is responsible for creating audio (or video) to fill in the gaps created by packet losses, excessive jitter or temporary network glitches, all three of which result in an absence of data.

To address these audio issues, we present WaveNetEQ, a new PLC system now being used in Duo. WaveNetEQ is a generative model, based on DeepMind’s WaveRNN technology, that is trained using a large corpus of speech data to realistically continue short speech segments enabling it to fully synthesize the raw waveform of missing speech. Because Duo calls are end-to-end encrypted, all processing needs to be done on-device. The WaveNetEQ model is fast enough to run on a phone, while still providing state-of-the-art audio quality and more natural sounding PLC than other systems currently in use.

A New PLC System for Duo
Like many other web-based communication systems, Duo is based on the WebRTC open source project. To conceal the effects of packet loss, WebRTC’s NetEQ component uses signal processing methods, which analyze the speech and produce a smooth continuation that works very well for small losses (20ms or less), but does not sound good when the number of missing packets leads to gaps of 60ms or more. In those latter cases the speech becomes robotic and repetitive, a characteristic sound that is unfortunately familiar to many internet voice callers.

To better manage packet loss, we replace the NetEQ PLC component with a modified version of WaveRNN, a recurrent neural network model for speech synthesis consisting of two parts, an autoregressive network and a conditioning network. The autoregressive network is responsible for the continuity of the signal and provides the short-term and mid-term structure for the speech by having each generated sample depend on the network’s previous outputs. The conditioning network influences the autoregressive network to produce audio that is consistent with the more slowly-moving input features.

However, WaveRNN, like its predecessor WaveNet, was created with the text-to-speech (TTS) application in mind. As a TTS model, WaveRNN is supplied with the information of what it is supposed to say and how to say it. The conditioning network directly receives this information as input in form of the phonemes that make up the words and additional prosody features (i.e., all non-text information like intonation or pitch). In a way, the conditioning network can “see into the future” and then steer the autoregressive network towards the right waveforms to match it. In the case of a PLC system and real-time communication, this context is not provided.

For a functional PLC system, one must both extract contextual information from the current speech (i.e., the past), and generate a plausible sound to continue it. Our solution, WaveNetEQ, does both at the same time, using the autoregressive network to provide the audio continuation during a packet loss event, and the conditioning network to model long term features, like voice characteristics. The spectrogram of the past audio signal is used as input for the conditioning network, which extracts limited information about the prosody and textual content. This condensed information is fed to the autoregressive network, which combines it with the audio of the recent past to predict the next sample in the waveform domain.

This differs slightly from the procedure that was followed during training of the WaveNetEQ model, where the autoregressive network receives the actual sample present in the training data as input for the next step, rather than using the last sample it produced. This process, called teacher forcing, assures that the model learns valuable information, even at an early stage of training when its predictions are still of low quality. Once the model is fully trained and put to use in an audio or video call, teacher forcing is only used to "warm up" the model for the first sample, and after that its own output is passed back as input for the next step.
WaveNetEQ architecture. During inference, we "warm up" the autoregressive network by teacher forcing with the most recent audio. Afterwards, the model is supplied with its own output as input for the next step. A MEL spectrogram from a longer audio part is used as input for the conditioning network.
The model is applied to the audio data in Duo's jitter buffer. Once the real audio continues after a packet loss event, we seamlessly merge the synthetic and real audio stream. In order to find the best alignment between the two signals, the model generates slightly more output than is required and then cross-fades from one to the other. This makes the transition smooth and avoids noticeable noise.
Simulation of PLC events on audio over a moving span of 60 ms. The blue line represents the real audio signal, including past and future parts of the PLC event. At each timestep the orange line represents the synthetic audio WaveNetEQ would predict if the audio were to cut out at the vertical grey line.
60 ms Packet Loss
NetEQ
WaveNetEQ
NetEQ
WaveNetEQ

120 ms Packet Loss
NetEQ
WaveNetEQ
NetEQ
WaveNetEQ
Audio clips: Comparison of WebRTC’s default PLC system, NetEQ, with our model, WaveNetEQ. Audio clips were taken from LibriTTS and 10% of the audio was dropped in 60 or 120 ms chunks and then filled in by the PLC systems.
Ensuring Robustness
One important factor during PLC is the ability of the network to adapt to variable input signals, including different speakers or changes in background noise. In order to ensure the robustness of the model across a wide range of users, we trained WaveNetEQ on a speech dataset that contains over 100 speakers in 48 different languages, which allows the model to learn the characteristics of human speech in general, instead of the properties of a specific language. To ensure WaveNetEQ is able to deal with noisy environments, such as answering your phone in the train station or in the cafeteria, we augment the data by mixing it with a wide variety of background noises.

While our model learns how to plausibly continue speech, this is only true on a short scale — it can finish a syllable but does not predict words, per se. Instead, for longer packet losses we gradually fade out until the model only produces silence after 120 milliseconds. To further ensure that the model is not generating false syllables, we evaluated samples from WaveNetEQ and NetEQ using the Google Cloud Speech-to-Text API and found no significant difference in the word error rate, i.e., how many mistakes were made transcribing the spoken text.

We have been experimenting with WaveNetEQ in Duo, where the feature has demonstrated a positive impact on call quality and user experience. WaveNetEQ is already available in all Duo calls on Pixel 4 phones and is now being rolled out to additional models.

Acknowledgements
The core team includes Alessio Bazzica, Niklas Blum, Lennart Kolmodin, Henrik Lundin, Alex Narest, Olga Sharonova from Google and Tom Walters from DeepMind. We would also like to thank Martin Bruse (Google), Norman Casagrande, Ray Smith, Chenjie Gu and Erich Elsen (DeepMind) for their contributions.

Source: Google AI Blog


Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model



Google's mission is not just to organize the world's information but to make it universally accessible, which means ensuring that our products work in as many of the world's languages as possible. When it comes to understanding human speech, which is a core capability of the Google Assistant, extending to more languages poses a challenge: high-quality automatic speech recognition (ASR) systems require large amounts of audio and text data — even more so as data-hungry neural models continue to revolutionize the field. Yet many languages have little data available.

We wondered how we could keep the quality of speech recognition high for speakers of data-scarce languages. A key insight from the research community was that much of the "knowledge" a neural network learns from audio data of a data-rich language is re-usable by data-scarce languages; we don't need to learn everything from scratch. This led us to study multilingual speech recognition, in which a single model learns to transcribe multiple languages.

In “Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model”, published at Interspeech 2019, we present an end-to-end (E2E) system trained as a single model, which allows for real-time multilingual speech recognition. Using nine Indian languages, we demonstrated a dramatic improvement in the ASR quality on several data-scarce languages, while still improving performance for the data-rich languages.

India: A Land of Languages
For this study, we focused on India, an inherently multilingual society where there are more than thirty languages with at least a million native speakers. Many of these languages overlap in acoustic and lexical content due to the geographic proximity of the native speakers and shared cultural history. Additionally, many Indians are bilingual or trilingual, making the use of multiple languages within a conversation a common phenomenon, and a natural case for training a single multilingual model. In this work, we combined nine primary Indian languages, namely Hindi, Marathi, Urdu, Bengali, Tamil, Telugu, Kannada, Malayalam and Gujarati.

A Low-latency All-neural Multilingual Model
Traditional ASR systems contain separate components for acoustic, pronunciation, and language models. While there have been attempts to make some or all of the traditional ASR components multilingual [1,2,3,4], this approach can be complex and difficult to scale. E2E ASR models combine all three components into a single neural network and promise scalability and ease of parameter sharing. Recent works have extended E2E models to be multilingual [1,2], but they did not address the need for real-time speech recognition, a key requirement for applications such as the Assistant, Voice Search and GBoard dictation. For this, we turned to recent research at Google that used a Recurrent Neural Network Transducer (RNN-T) model to achieve streaming E2E ASR. The RNN-T system outputs words one character at a time, just as if someone was typing in real time, however this was not multilingual. We built upon this architecture to develop a low-latency model for multilingual speech recognition.
[Left] A traditional monolingual speech recognizer comprising of Acoustic, Pronunciation and Language Models for each language. [Middle] A traditional multilingual speech recognizer where the Acoustic and Pronunciation model is multilingual, while the Language model is language-specific. [Right] An E2E multilingual speech recognizer where the Acoustic, Pronunciation and Language Model is combined into a single multilingual model.
Large-Scale Data Challenges
Using large-scale, real-world data for training a multilingual model is complicated by data imbalance. Given the steep skew in the distribution of speakers across the languages and speech product maturity, it is not surprising to have varying amounts of transcribed data available per language. As a result, a multilingual model can tend to be more influenced by languages that are over-represented in the training set. This bias is more prominent in an E2E model, which unlike a traditional ASR system, does not have access to additional in-language text data and learns lexical characteristics of the languages solely from the audio training data.
Histogram of training data for the nine languages showing the steep skew in the data available.
We addressed this issue with a few architectural modifications. First, we provided an extra language identifier input, which is an external signal derived from the language locale of the training data; i.e. the language preference set in an individual’s phone. This signal is combined with the audio input as a one-hot feature vector. We hypothesize that the model is able to use the language vector not only to disambiguate the language but also to learn separate features for separate languages, as needed, which helped with data imbalance.

Building on the idea of language-specific representations within the global model, we further augmented the network architecture by allocating extra parameters per language in the form of residual adapter modules. Adapters helped fine-tune a global model on each language while maintaining parameter efficiency of a single global model, and in turn, improved performance.
[Left] Multilingual RNN-T architecture with a language identifier. [Middle] Residual adapters inside the encoder. For a Tamil utterance, only the Tamil adapters are applied to each activation. [Right] Architecture details of the Residual Adapter modules. For more details please see our paper.
Putting all of these elements together, our multilingual model outperforms all the single-language recognizers, with especially large improvements in data-scarce languages like Kannada and Urdu. Moreover, since it is a streaming E2E model, it simplifies training and serving, and is also usable in low-latency applications like the Assistant. Building on this result, we hope to continue our research on multilingual ASRs for other language groups, to better assist our growing body of diverse users.

Acknowledgements
We would like to thank the following for their contribution to this research: Tara N. Sainath, Eugene Weinstein, Bo Li, Shubham Toshniwal, Ron Weiss, Bhuvana Ramabhadran, Yonghui Wu, Ankur Bapna, Zhifeng Chen, Seungji Lee, Meysam Bastani, Mikaela Grace, Pedro Moreno, Yanzhang (Ryan) He, Khe Chai Sim.

Source: Google AI Blog


Assessing the Quality of Long-Form Synthesized Speech



Automatically generated speech is everywhere, from directions being read out aloud while you are driving, to virtual assistants on your phone or smart speaker devices at home. While much research is being done to try to make synthesized speech sound as natural as possible—such as generating speech for low-resource languages and creating human-like speech with Tacotron 2—how does one evaluate the generated speech? The best way to find out is to ask people, who are very good at telling if something sounds natural or not.

In the field of speech synthesis, subjects are routinely asked to listen to samples of synthesized speech and rate their quality. Yet, until now, evaluation of synthesized speech has been done on a sentence-by-sentence basis. But often one wants to know the quality of a series of sentences that belong together, such as a paragraph in a news article or a turn in a conversation. This is where it gets interesting, as there is more than one way of evaluating sentences that naturally occur in a sequence, and, surprisingly, a rigorous comparison of these different methods has not been carried out. This in turn can hinder research progress in developing products that rely on generated speech.

To address this challenge, we present “Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs”, a publication to appear at SSW10 in which we compare several ways of evaluating synthesized speech for multi-line texts. We find that when a sentence is evaluated as part of a longer text involving several sentences, the outcome is influenced by the way in which the audio sample is presented to the people evaluating it. For example, when the sentence is presented by itself, without any context, the rating people give on average is substantially different from the rating they give when they listen to the same sentence with some context (while the context doesn't have to be rated).

Evaluating Automatically Generated Speech
To determine the quality of speech signals, it is common practice to ask several human raters to give their opinion for a particular sample, on a 1-to-5 scale. This sample can be automatically generated, but it can also be natural speech (i.e., an actual person saying a sentence out loud), which serves as a control. The scores of all reviewers rating a particular speech sample are averaged to get a Mean Opinion Score (MOS).

Until now, MOS ratings were typically collected per sentence, i.e., raters listened to sentences in isolation to form their opinion. Instead of this typical approach, we consider three different ways of presenting speech samples to raters—both with and without context—and we show that each approach yields different results. The first, presenting the sentence in isolation, is the default method commonly used in the field. An alternative method is to provide the full context for the sentence. In this case, the entire paragraph to which the sentence belongs is included and the ensemble is rated. The final approach is to provide a context-stimulus pair. Here, rather than providing full context, only some context is provided, such as the preceding sentence(s) from the original paragraph.

Interestingly, these three different approaches for presenting speech give different results even when applied to natural speech. This is demonstrated in the figure below, where the MOS scores are presented for natural speech samples rated using the three different methods of presentation. Even though the sentences being rated are identical across the three different settings, the scores are different on average, depending on the context in which they were presented.
MOS results for natural speech from a dataset consisting of news articles. Though the differences appear small, they are significant between all conditions (two-tailed t-test with α=0.05).
Examination of the figure above reveals that raters rarely give top scores (a five) even to recorded human speech, which may be surprising. However, this is a typical result seen in sentence evaluation studies and probably has to do with a more generic pattern of behavior, that people tend to avoid using the extreme ends of a scale, regardless of the task or setting.

When evaluated synthesized speech, the differences are more pronounced.
MOS results for synthesized speech on the same news article dataset used above. All lines are synthesized speech, unless indicated otherwise.
To see if the way context is presented makes a difference, we tried several different ways of providing it: one or two sentences leading up to the sentence to be evaluated, provided as generated speech or real speech. When context is added, the scores get higher (the four blue bars on the left) except when the context presented is real speech, in which case the score drops (the rightmost blue bar). Our hypothesis is that this has to do with an anchoring effect—if the context is very good (real speech) the synthesized speech, in comparison, is perceived as less natural.

Predicting Paragraph Score
When an entire paragraph of synthesized speech is played (the yellow bar), this is perceived as even less natural than in the other settings. Our original hypothesis was a weakest-link argument—the rating is probably as bad as the worst sentence in the paragraph. If that were the case, it should be easy to predict the rating of a paragraph by considering the ratings of the individual sentences in it, perhaps simply taking the minimum value to get the paragraph rating. It turns out, however, that does not work.

The failure of the weakest-link hypothesis may be due to more subtle factors that are difficult to tease out with such a simple approach. To test this, we also trained a machine learning algorithm to predict the paragraph score from the individual sentences. However, this approach, too, was unable to successfully predict paragraph scores reliably.

Conclusion
Evaluating synthesized speech is not straightforward when multiple sentences are involved. The traditional paradigm of rating sentences in isolation does not give the full picture, and one should be aware of anchoring effects when context is provided. Rating full paragraphs might be the most conservative approach. We hope our findings help advance future work in speech synthesis where long-form content is concerned, such as audio book readers and conversational agents.

Acknowledgments
Many thanks to all authors of the paper: Rob Clark, Hanna Silen, Ralph Leith.

Source: Google AI Blog


Assessing the Quality of Long-Form Synthesized Speech



Automatically generated speech is everywhere, from directions being read out aloud while you are driving, to virtual assistants on your phone or smart speaker devices at home. While much research is being done to try to make synthesized speech sound as natural as possible—such as generating speech for low-resource languages and creating human-like speech with Tacotron 2—how does one evaluate the generated speech? The best way to find out is to ask people, who are very good at telling if something sounds natural or not.

In the field of speech synthesis, subjects are routinely asked to listen to samples of synthesized speech and rate their quality. Yet, until now, evaluation of synthesized speech has been done on a sentence-by-sentence basis. But often one wants to know the quality of a series of sentences that belong together, such as a paragraph in a news article or a turn in a conversation. This is where it gets interesting, as there is more than one way of evaluating sentences that naturally occur in a sequence, and, surprisingly, a rigorous comparison of these different methods has not been carried out. This in turn can hinder research progress in developing products that rely on generated speech.

To address this challenge, we present “Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs”, a publication to appear at SSW10 in which we compare several ways of evaluating synthesized speech for multi-line texts. We find that when a sentence is evaluated as part of a longer text involving several sentences, the outcome is influenced by the way in which the audio sample is presented to the people evaluating it. For example, when the sentence is presented by itself, without any context, the rating people give on average is substantially different from the rating they give when they listen to the same sentence with some context (while the context doesn't have to be rated).

Evaluating Automatically Generated Speech
To determine the quality of speech signals, it is common practice to ask several human raters to give their opinion for a particular sample, on a 1-to-5 scale. This sample can be automatically generated, but it can also be natural speech (i.e., an actual person saying a sentence out loud), which serves as a control. The scores of all reviewers rating a particular speech sample are averaged to get a Mean Opinion Score (MOS).

Until now, MOS ratings were typically collected per sentence, i.e., raters listened to sentences in isolation to form their opinion. Instead of this typical approach, we consider three different ways of presenting speech samples to raters—both with and without context—and we show that each approach yields different results. The first, presenting the sentence in isolation, is the default method commonly used in the field. An alternative method is to provide the full context for the sentence. In this case, the entire paragraph to which the sentence belongs is included and the ensemble is rated. The final approach is to provide a context-stimulus pair. Here, rather than providing full context, only some context is provided, such as the preceding sentence(s) from the original paragraph.

Interestingly, these three different approaches for presenting speech give different results even when applied to natural speech. This is demonstrated in the figure below, where the MOS scores are presented for natural speech samples rated using the three different methods of presentation. Even though the sentences being rated are identical across the three different settings, the scores are different on average, depending on the context in which they were presented.
MOS results for natural speech from a dataset consisting of news articles. Though the differences appear small, they are significant between all conditions (two-tailed t-test with α=0.05).
Examination of the figure above reveals that raters rarely give top scores (a five) even to recorded human speech, which may be surprising. However, this is a typical result seen in sentence evaluation studies and probably has to do with a more generic pattern of behavior, that people tend to avoid using the extreme ends of a scale, regardless of the task or setting.

When evaluated synthesized speech, the differences are more pronounced.
MOS results for synthesized speech on the same news article dataset used above. All lines are synthesized speech, unless indicated otherwise.
To see if the way context is presented makes a difference, we tried several different ways of providing it: one or two sentences leading up to the sentence to be evaluated, provided as generated speech or real speech. When context is added, the scores get higher (the four blue bars on the left) except when the context presented is real speech, in which case the score drops (the rightmost blue bar). Our hypothesis is that this has to do with an anchoring effect—if the context is very good (real speech) the synthesized speech, in comparison, is perceived as less natural.

Predicting Paragraph Score
When an entire paragraph of synthesized speech is played (the yellow bar), this is perceived as even less natural than in the other settings. Our original hypothesis was a weakest-link argument—the rating is probably as bad as the worst sentence in the paragraph. If that were the case, it should be easy to predict the rating of a paragraph by considering the ratings of the individual sentences in it, perhaps simply taking the minimum value to get the paragraph rating. It turns out, however, that does not work.

The failure of the weakest-link hypothesis may be due to more subtle factors that are difficult to tease out with such a simple approach. To test this, we also trained a machine learning algorithm to predict the paragraph score from the individual sentences. However, this approach, too, was unable to successfully predict paragraph scores reliably.

Conclusion
Evaluating synthesized speech is not straightforward when multiple sentences are involved. The traditional paradigm of rating sentences in isolation does not give the full picture, and one should be aware of anchoring effects when context is provided. Rating full paragraphs might be the most conservative approach. We hope our findings help advance future work in speech synthesis where long-form content is concerned, such as audio book readers and conversational agents.

Acknowledgments
Many thanks to all authors of the paper: Rob Clark, Hanna Silen, Ralph Leith.

Source: Google AI Blog


Introducing Translatotron: An End-to-End Speech-to-Speech Translation Model



Speech-to-speech translation systems have been developed over the past several decades with the goal of helping people who speak different languages to communicate with each other. Such systems have usually been broken into three separate components: automatic speech recognition to transcribe the source speech as text, machine translation to translate the transcribed text into the target language, and text-to-speech synthesis (TTS) to generate speech in the target language from the translated text. Dividing the task into such a cascade of systems has been very successful, powering many commercial speech-to-speech translation products, including Google Translate.

In “Direct speech-to-speech translation with a sequence-to-sequence model”, we propose an experimental new system that is based on a single attentive sequence-to-sequence model for direct speech-to-speech translation without relying on intermediate text representation. Dubbed Translatotron, this system avoids dividing the task into separate stages, providing a few advantages over cascaded systems, including faster inference speed, naturally avoiding compounding errors between recognition and translation, making it straightforward to retain the voice of the original speaker after translation, and better handling of words that do not need to be translated (e.g., names and proper nouns).

Translatotron
The emergence of end-to-end models on speech translation started in 2016, when researchers demonstrated the feasibility of using a single sequence-to-sequence model for speech-to-text translation. In 2017, we demonstrated that such end-to-end models can outperform cascade models. Many approaches to further improve end-to-end speech-to-text translation models have been proposed recently, including our effort on leveraging weakly supervised data. Translatotron goes a step further by demonstrating that a single sequence-to-sequence model can directly translate speech from one language into speech in another language, without relying on an intermediate text representation in either language, as is required in cascaded systems.

Translatotron is based on a sequence-to-sequence network which takes source spectrograms as input and generates spectrograms of the translated content in the target language. It also makes use of two other separately trained components: a neural vocoder that converts output spectrograms to time-domain waveforms, and, optionally, a speaker encoder that can be used to maintain the character of the source speaker’s voice in the synthesized translated speech. During training, the sequence-to-sequence model uses a multitask objective to predict source and target transcripts at the same time as generating target spectrograms. However, no transcripts or other intermediate text representations are used during inference.

Model architecture of Translatotron.
Performance
We validated Translatotron’s translation quality by measuring the BLEU score, computed with text transcribed by a speech recognition system. Though our results lag behind a conventional cascade system, we have demonstrated the feasibility of the end-to-end direct speech-to-speech translation.

Compared in the audio clips below are the direct speech-to-speech translation output from Translatotron to that of the baseline cascade method. In this case, both systems provide a suitable translation and speak naturally using the same canonical voice.


Input (Spanish)
Reference translation (English)
Baseline cascade translation
Translatotron translation

You can listen to more audio samples here.

Preserving Vocal Characteristics
By incorporating a speaker encoder network, Translatotron is also able to retain the original speaker’s vocal characteristics in the translated speech, which makes the translated speech sound more natural and less jarring. This feature leverages previous Google research on speaker verification and speaker adaptation for TTS. The speaker encoder is pretrained on the speaker verification task, learning to encode speaker characteristics from a short example utterance. Conditioning the spectrogram decoder on this encoding makes it possible to synthesize speech with similar speaker characteristics, even though the content is in a different language.

The audio clips below demonstrate the performance of Translatotron when transferring the original speaker’s voice to the translated speech. In this example, Translatotron gives more accurate translation than the baseline cascade model, while being able to retain the original speaker’s vocal characteristics. The Translatotron output that retains the original speaker’s voice is trained with less data than the one using the canonical voice, so that they yield slightly different translations.

Input (Spanish)
Reference translation (English)
Baseline cascade translation
Translatotron translation (canonical voice)
Translatotron translation (original speaker’s voice)

More audio samples are available here.

Conclusion
To the best of our knowledge, Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language. It is also able to retain the source speaker’s voice in the translated speech. We hope that this work can serve as a starting point for future research on end-to-end speech-to-speech translation systems.

Acknowledgments
This research was a joint work between the Google Brain, Google Translate, and Google Speech teams. Contributors include Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Mengmeng Niu, Quan Wang, Jason Pelecanos, Ignacio Lopez Moreno, Tom Walters, Heiga Zen, Patrick Nguyen, Yu Zhang, Jonathan Shen, Orhan Firat, and Yonghui Wu. We also thank Jorge Pereira and Stella Laurenzo for verifying the quality of the translation from Translatotron.

Source: Google AI Blog


Accurate Online Speaker Diarization with Supervised Learning



Speaker diarization, the process of partitioning an audio stream with multiple people into homogeneous segments associated with each individual, is an important part of speech recognition systems. By solving the problem of “who spoke when”, speaker diarization has applications in many important scenarios, such as understanding medical conversations, video captioning and more. However, training these systems with supervised learning methods is challenging — unlike standard supervised classification tasks, a robust diarization model requires the ability to associate new individuals with distinct speech segments that weren't involved in training. Importantly, this limits the quality of both online and offline diarization systems. Online systems usually suffer more, since they require diarization results in real time.
Online speaker diarization on streaming audio input. Different colors in the bottom axis indicate different speakers.
In “Fully Supervised Speaker Diarization”, we describe a new model that seeks to make use of supervised speaker labels in a more effective manner. Here “fully” implies that all components in the speaker diarization system, including the estimation of the number of speakers, are trained in supervised ways, so that they can benefit from increasing the amount of labeled data available. On the NIST SRE 2000 CALLHOME benchmark, our diarization error rate (DER) is as low as 7.6%, compared to 8.8% DER from our previous clustering-based method, and 9.9% from deep neural network embedding methods. Moreover, our method achieves this lower error rate based on online decoding, making it specifically suitable for real-time applications. As such we are open sourcing the core algorithms in our paper to accelerate more research along this direction.

Clustering versus Interleaved-state RNN
Modern speaker diarization systems are usually based on clustering algorithms such as k-means or spectral clustering. Since these clustering methods are unsupervised, they could not make good use of the supervised speaker labels available in data. Moreover, online clustering algorithms usually have worse quality in real-time diarization applications with streaming audio inputs. The key difference between our model and common clustering algorithms is that in our method, all speakers’ embeddings are modeled by a parameter-sharing recurrent neural network (RNN), and we distinguish different speakers using different RNN states, interleaved in the time domain.

To understand how this works, consider the example below in which there are four possible speakers: blue, yellow, pink and green (this is arbitrary, and in fact there may be more — our model uses the Chinese restaurant process to accommodate the unknown number of speakers). Each speaker starts with its own RNN instance (with a common initial state shared among all speakers) and keeps updating the RNN state given the new embeddings from this speaker. In the example below, the blue speaker keeps updating its RNN state until a different speaker, yellow, comes in. If blue speaks again later, it resumes updating its RNN state. (This is just one of the possibilities for speech segment y7 in the figure below. If new speaker green enters, it will start with a new RNN instance.)
The generative process of our model. Colors indicate labels for speaker segments.
Representing speakers as RNN states enables us to learn the high-level knowledge shared across different speakers and utterances using RNN parameters, and this promises the usefulness of more labeled data. In contrast, common clustering algorithms almost always work with each single utterance independently, making it difficult to benefit from a large amount of labeled data.

The upshot of all this is that given time-stamped speaker labels (i.e. we know who spoke when), we can train the model with standard stochastic gradient descent algorithms. A trained model can be used for speaker diarization on new utterances from unheard speakers. Furthermore, the use of online decoding makes it more suitable for latency-sensitive applications.

Future Work
Although we've already achieved impressive diarization performance with this system, there are still many exciting directions we are currently exploring. First, we are refining our model so it can easily integrate contextual information to perform offline decoding. This will likely further reduce the DER, which is more useful for latency-insensitive applications. Second, we would like to model acoustic features directly instead of using d-vectors. In this way, the entire speaker diarization system can be trained in an end-to-end way.

To learn more about this work, please see our paper. To download the core algorithm of this system, please visit the Github page.

Acknowledgments
This work was done as a close collaboration between Google AI and Speech & Assistant teams. Contributors include Aonan Zhang (intern), Quan Wang, Zhengyao Zhu and Chong Wang.

Source: Google AI Blog


Text-to-Speech for Low-Resource Languages (Episode 4): One Down, 299 to Go



This is the fourth episode in the series of posts reporting on the work we are doing to build text-to-speech (TTS) systems for low resource languages. In the first episode, we described the crowdsourced acoustic data collection effort for Project Unison. In the second episode, we described how we built parametric voices based on that data. In the third episode, we described the compilation of a pronunciation lexicon for a TTS system. In this episode, we describe how to make a single TTS system speak many languages.

Developing TTS systems for any given language is a significant challenge, and requires large amounts of high quality acoustic recordings and linguistic annotations. Because of this, these systems are only available for a tiny fraction of the world's languages. A natural question that arises in this situation is, instead of attempting to build a high quality voice for a single language using monolingual data from multiple speakers, as we described in the previous three episodes, can we somehow combine the limited monolingual data from multiple speakers of multiple languages to build a single multilingual voice that can speak any language?

Building upon an initial investigation into creating a multilingual TTS system that can synthesize speech in multiple languages from a single model, we developed a new model that uses uniform phonological representation for all languages — the International Phonetic Alphabet (IPA). The model trained using this representation can synthesize both the languages seen in the training data as well as languages not observed in training. This has two main benefits: First, pooling training data from related languages increases phonemic coverage which results in improved synthesis quality of the languages observed in training. Finally, because the model contains many languages pooled together, there is a better chance that an “unseen” language will have a “related” language present in the model that will guide and aid the synthesis.

Exploring the Closely Related Languages of Indonesia
We applied this multilingual approach first to languages of Indonesia, where Standard Indonesian is the official national language, and is spoken natively or as a second language by more than 200 million people. Javanese, with roughly 90 million native speakers, and Sundanese, with approximately 40 million native speakers, constitute the two largest regional languages of Indonesia. Unlike Indonesian, which received a lot of attention by the computational linguists and speech scientists over the years, both Javanese and Sundanese are currently low-resourced due to the lack of openly available high-quality corpora. We collaborated with universities in Indonesia to collect crowd-sourced Javanese and Sundanese recordings.

Since our corpus of Standard Indonesian was much larger and recorded in a professional studio, our hypothesis was that combining three languages may result in significant improvements over the systems constructed using a “classical” monolingual approach. To test this, we first proceeded to analyze the similarities and crucial differences between the phonologies of these three languages (shown below) and used this information to design the phonological representation that allows maximum degree of sharing between the languages while preserving their crucial differences.
Joint phoneme inventory of Indonesian, Javanese, and Sundanese in International Phonetic Alphabet notation.
The resulting Javanese and Sundanese voices trained jointly with Standard Indonesian strongly outperformed our corresponding monolingual multispeaker voices that we used as a baseline. This allowed us to launch Javanese and Sundanese TTS in Google products, such as Google Translate and Android.

Expanding to the More Diverse Language Families of South Asia
Next, we focused on the languages of South Asia spanning two very different language families: Indo-Aryan and Dravidian. Unlike the languages of Indonesia described above, these languages are much more diverse. In particular, they have significantly smaller overlap in their phonologies. The table below shows a superset of the languages in our experiment, including the variety of orthographies used, as well as modern words related to the Sanskrit word for “culture”. These languages show considerable variation within each group, but also such similarities across groups.
Descendants of Sanskrit word for “culture” across languages.
In this work, we leveraged the unified phonological representation mentioned above to make the most of the data we have and eliminate scarcity of data for certain phonemes. This was accomplished by conflating similar phonemes into a single representative phoneme in the multilingual phoneme inventory. Where possible, we use the same inventory for phonologically close languages. For example we have an identical phoneme inventory for Telugu and Kannada, and another one for West Bengali and Odia. For other language pairs like Gujarati and Marathi, we copied over the inventory of one language to another, but made a few changes to reflect the differences in their phonemic inventories. For all languages in these experiments we retained a common underlying representation, mapping similar phonemes across different inventories, so that we could still use the data from one language in training the others.

In addition, we made sure our representation is driven by the phonology in use, rather than the orthography. For example, although there are distinct letters for long and short vowels in Marathi, they are not contrastive in a linguistic sense, so we used a single representation for them, increasing the robustness of our training data. Similarly, if two languages use one character that was historically related to the same Sanskrit letter to represent different sounds or different letters for a similar sound, our mapping reflected the phonological closeness rather than the historical or orthographic representation. Describing all the features of the unified phoneme inventory is outside the scope of this post, the details can be found in our recent paper.
Diagram illustrating our multilingual text-to-speech approach. The input text queries are processed by language-specific linguistic front-ends to generate pronunciations in a shared phonemic representation serving as input to the language-agnostic acoustic model. The model then generates audio for the respective queries.
Our experiments focused on Indian Bengali, Gujarati, Kannada, Malayalam, Marathi, Tamil, Telugu and Urdu. For most of these languages, apart from Bengali and Marathi, the recording data and the transcriptions were crowd-sourced. For each of these languages we constructed a multilingual acoustic model that used all the data available. In addition, the acoustic model included the previously crowd-sourced Nepali and Sinhala data, as well as Hindi and Bangladeshi Bengali.

The results were encouraging: for most of the languages, the multilingual voices outperformed the voices that were constructed using traditional monolingual approach. We performed a further experiment with the Odia language, for which we had no training data, by attempting to synthesize it using the South Asian multilingual model. Subjective listening tests revealed that the native speakers of Odia judged the resulting audio to be acceptable and intelligible. The resulting voices for Marathi, Tamil, Telugu and Malayalam built using our multilingual approach in collaboration with the Speech team were announced at the recent “Google for India” event and are now powering Google Translate as well as other Google products.

Using crowd-sourcing in data collections was interesting from a research point of view and rewarding in terms of establishing fruitful collaborations with the native speaker communities. Our experiments with the Malayo-Polynesian, Indo-Aryan and Dravidian language families have shown that in most instances carefully sharing the data across multiple languages in a single multilingual acoustic model using deep learning techniques alleviates some of the severe data scarcity issues plaguing the low-resource languages and results in good quality voices used in Google products.

This TTS research is a first step towards applying speech and language technology to more of the world’s many languages, and it is our hope is that others will join us in this effort. To contribute to the research community we have open sourced corpora for Nepali, Sinhala, Bengali, Khmer, Javanese and Sundanese as we return from SLTU and Interspeech conferences, where we have been discussing this work with other researchers. We are planning on continuing to release additional datasets for other languages in our projects in the future.

Source: Google AI Blog