Those who have worked with our Cloud Speech API know that sending infinitely long streams of audio is currently unsupported. To help solve this challenge, we take measures to close and restart streaming requests prior to hitting the timeout, including restarting the session during long periods of silence and closing whenever there is a detected pause in the speech. Otherwise, this would result in a truncated sentence or word. In between sessions, we buffer audio locally and send it upon reconnection. This reduces the amount of text lost mid-conversation—either due to restarting speech requests or switching between wireless networks.
(From the Live Transcribe launch video)
Endlessly streaming audio comes with its own challenges. In many countries, network data is quite expensive and in spots with poor internet, bandwidth may be limited. After much experimentation with audio codecs (in particular, we evaluated the FLAC, AMR-WB, and Opus codecs), we were able to achieve a 10x reduction in data usage without compromising accuracy. FLAC, a lossless codec, preserves accuracy completely, but doesn't save much data. It also has noticeable codec latency. AMR-WB, on the other hand, saves a lot of data, but delivers much worse accuracy in noisy environments. Opus was a clear winner, allowing data rates many times lower than most music streaming services while still preserving the important details of the audio signal—even in noisy environments. Beyond relying on codecs to keep data usage to a minimum, we also support using speech detection to close the network connection during extended periods of silence. That means if you accidentally leave your phone on and running Live Transcribe when nobody is around, it stops using your data.
Finally, we know that if you are relying on captions, you want them immediately, so we've worked hard to keep latency to a minimum. Though most of the credit for speed goes to the Cloud Speech API, Live Transcribe's final trick lies in our custom Opus encoder. At the cost of only a minor increase in bitrate, we see latency that is visually indistinguishable to sending uncompressed audio.
Today, we are excited to make all of this available to developers everywhere. We hope you'll join us in trying to build a world that is more accessible for everyone.
By Chet Gnegy, Alex Huang, and Ausmus Chang from the Live Transcribe Team