Demystifying ML: How machine learning is used for speech recognition


This is the second blog post in our series that looks at how machine learning can be used to solve common problems. In this article, we discuss how ML is used in speech recognition.

In our previous post, we talked about how email classification can be used to improve customer service. But emails are are just one way customers interact with businesses. When it comes to difficult issues, 48% of customers opt to speak with a customer service representative over the phone as opposed to text-based chat or email (source: American Express 2014 Global Customer Service Barometer). Additionally, in today’s business landscape, an increasing number of interactions are happening in real time.

Take the example of a commercial bank. For urgent matters — say, a customer reporting a stolen credit card — it doesn’t make sense to send an email. In these instances, a customer is more likely to call a customer service representative, and getting that customer to the right representative as fast as possible can be the difference between a minor inconvenience and a bigger problem. This means that speech recognition systems, with the ability to swiftly identify exact words and their context, are more important than ever.

Since speech recognition requires bridging the gap from the physical to digital world, there are many layers of engineering that go into the process. In layman's terms, you start with the input: the audio waveform. This waveform is digitized and converted using a Fourier transform — which converts a signal from a function of time into a function of frequency — similar to the spectrum display grid on certain audio equipment. We then use machine learning to find the most likely phonemes (distinct units of sound) and probable sequences of words based on the sequence of converted frequency graphs. Finally, depending on the application, an output in the form of a textual answer or result is returned. In the case of a customer service call center, this textual output (or its binary equivalent) allows your call to be routed typically in a matter of milliseconds.

Building your own speech recognition system is a complex process, and each layer involves its own interesting implementations and challenges. In this post, we’ll focus on phoneme modeling, i.e., isolated word recognition.

This is a simplified representation. The actual process contains all possible phonemes, and looks at matching not just discrete phonemes, but the beginning, middle and end of those waveforms. (A “k” sound, or the aspirated start of a certain vowel, for example.)

Returning to our example of the customer service call center for a moment, a HMM can construct a graph linking phonemes, or sometimes even consecutive words into a sequence, resulting in a histogram of possible outputs corresponding to various support teams in your company. With a large dataset of recorded customer statements and their call center destinations, you can build a robust AI-based routing system that gets customers to the right help as fast as possible.

As we noted earlier, building your own speech recognition system is a major undertaking. It requires a large dataset to train your model, and this sample data must be labeled through a fairly manual and laborious process. At scale, the data is costly to store on-site, and converting all the stored data into a functioning model requires multiple iterative attempts, as well as substantial computation resources, and sometimes multiple days or weeks for a training process. If you’re interested in quickly deploying a speech-based application, but want to avoid the ordeal of training your own model, you can always use a tool like Cloud Speech API.

There are also many more ways speech recognition can be helpful — from closed captioning to real-time transcription. If you’re interested in learning more, you can check out our best practices and sample applications.