In 2012, speech recognition research showed significant accuracy improvements with deep learning, leading to early adoption in products such as Google's Voice Search. It was the beginning of a revolution in the field: each year, new architectures were developed that further increased quality, from deep neural networks (DNNs) to recurrent neural networks (RNNs), long short-term memory networks (LSTMs), convolutional networks (CNNs), and more. During this time, latency remained a prime focus — an automated assistant feels a lot more helpful when it responds quickly to requests.
Today, we're happy to announce the rollout of an end-to-end, all-neural, on-device speech recognizer to power speech input in Gboard. In our recent paper, "Streaming End-to-End Speech Recognition for Mobile Devices", we present a model trained using RNN transducer (RNN-T) technology that is compact enough to reside on a phone. This means no more network latency or spottiness — the new recognizer is always available, even when you are offline. The model works at the character level, so that as you speak, it outputs words character-by-character, just as if someone was typing out what you say in real-time, and exactly as you'd expect from a keyboard dictation system.
|This video compares the production, server-side speech recognizer (left panel) to the new on-device recognizer (right panel) when recognizing the same spoken sentence. Video credit: Akshay Kannan and Elnaz Sarbar|
Traditionally, speech recognition systems consisted of several components - an acoustic model that maps segments of audio (typically 10 millisecond frames) to phonemes, a pronunciation model that connects phonemes together to form words, and a language model that expresses the likelihood of given phrases. In early systems, these components remained independently-optimized.
Around 2014, researchers began to focus on training a single neural network to directly map an input audio waveform to an output sentence. This sequence-to-sequence approach to learning a model by generating a sequence of words or graphemes given a sequence of audio features led to the development of "attention-based" and "listen-attend-spell" models. While these models showed great promise in terms of accuracy, they typically work by reviewing the entire input sequence, and do not allow streaming outputs as the input comes in, a necessary feature for real-time voice transcription.
Meanwhile, an independent technique called connectionist temporal classification (CTC) had helped halve the latency of the production recognizer at that time. This proved to be an important step in creating the RNN-T architecture adopted in this latest release, which can be seen as a generalization of CTC.
Recurrent Neural Network Transducers
RNN-Ts are a form of sequence-to-sequence models that do not employ attention mechanisms. Unlike most sequence-to-sequence models, which typically need to process the entire input sequence (the waveform in our case) to produce an output (the sentence), the RNN-T continuously processes input samples and streams output symbols, a property that is welcome for speech dictation. In our implementation, the output symbols are the characters of the alphabet. The RNN-T recognizer outputs characters one-by-one, as you speak, with white spaces where appropriate. It does this with a feedback loop that feeds symbols predicted by the model back into it to predict the next symbols, as described in the figure below.
|Representation of an RNN-T, with the input audio samples, x, and the predicted symbols y. The predicted symbols (outputs of the Softmax layer) are fed back into the model through the Prediction network, as yu-1, ensuring that the predictions are conditioned both on the audio samples so far and on past outputs. The Prediction and Encoder Networks are LSTM RNNs, the Joint model is a feedforward network (paper). The Prediction Network comprises 2 layers of 2048 units, with a 640-dimensional projection layer. The Encoder Network comprises 8 such layers. Image credit: Chris Thornton|
In a traditional speech recognition engine, the acoustic, pronunciation, and language models we described above are "composed" together into a large search graph whose edges are labeled with the speech units and their probabilities. When a speech waveform is presented to the recognizer, a "decoder" searches this graph for the path of highest likelihood, given the input signal, and reads out the word sequence that path takes. Typically, the decoder assumes a Finite State Transducer (FST) representation of the underlying models. Yet, despite sophisticated decoding techniques, the search graph remains quite large, almost 2GB for our production models. Since this is not something that could be hosted easily on a mobile phone, this method requires online connectivity to work properly.
To improve the usefulness of speech recognition, we sought to avoid the latency and inherent unreliability of communication networks by hosting the new models directly on device. As such, our end-to-end approach does not need a search over a large decoder graph. Instead, decoding consists of a beam search through a single neural network. The RNN-T we trained offers the same accuracy as the traditional server-based models but is only 450MB, essentially making a smarter use of parameters and packing information more densely. However, even on today's smartphones, 450MB is a lot, and propagating signals through such a large network can be slow.
We further reduced the model size by using the parameter quantization and hybrid kernel techniques we developed in 2016 and made publicly available through the model optimization toolkit in the TensorFlow Lite library. Model quantization delivered a 4x compression with respect to the trained floating point models and a 4x speedup at run-time, enabling our RNN-T to run faster than real time speech on a single core. After compression, the final model is 80MB.
Our new all-neural, on-device Gboard speech recognizer is initially being launched to all Pixel phones in American English only. Given the trends in the industry, with the convergence of specialized hardware and algorithmic improvements, we are hopeful that the techniques presented here can soon be adopted in more languages and across broader domains of application.
Raziel Alvarez, Michiel Bacchiani, Tom Bagby, Françoise Beaufays, Deepti Bhatia, Shuo-yiin Chang, Yanzhang He, Alex Gruenstein, Anjuli Kannan, Bo Li, Qiao Liang, Ian McGraw, Ruoming Pang, Rohit Prabhavalkar, Golan Pundak, Kanishka Rao, David Rybach, Tara Sainath, Haşim Sak, June Yuan Shangguan, Matt Shannon, Mohammadinamul Sheik, Khe Chai Sim, Gabor Simko, Trevor Strohman, Mirkó Visontai, Yonghui Wu, Ding Zhao, Dan Zivkovic.