Speaker diarization, the process of partitioning an audio stream with multiple people into homogeneous segments associated with each individual, is an important part of speech recognition systems. By solving the problem of “who spoke when”, speaker diarization has applications in many important scenarios, such as understanding medical conversations, video captioning and more. However, training these systems with supervised learning methods is challenging — unlike standard supervised classification tasks, a robust diarization model requires the ability to associate new individuals with distinct speech segments that weren't involved in training. Importantly, this limits the quality of both online and offline diarization systems. Online systems usually suffer more, since they require diarization results in real time.
|Online speaker diarization on streaming audio input. Different colors in the bottom axis indicate different speakers.|
Clustering versus Interleaved-state RNN
Modern speaker diarization systems are usually based on clustering algorithms such as k-means or spectral clustering. Since these clustering methods are unsupervised, they could not make good use of the supervised speaker labels available in data. Moreover, online clustering algorithms usually have worse quality in real-time diarization applications with streaming audio inputs. The key difference between our model and common clustering algorithms is that in our method, all speakers’ embeddings are modeled by a parameter-sharing recurrent neural network (RNN), and we distinguish different speakers using different RNN states, interleaved in the time domain.
To understand how this works, consider the example below in which there are four possible speakers: blue, yellow, pink and green (this is arbitrary, and in fact there may be more — our model uses the Chinese restaurant process to accommodate the unknown number of speakers). Each speaker starts with its own RNN instance (with a common initial state shared among all speakers) and keeps updating the RNN state given the new embeddings from this speaker. In the example below, the blue speaker keeps updating its RNN state until a different speaker, yellow, comes in. If blue speaks again later, it resumes updating its RNN state. (This is just one of the possibilities for speech segment y7 in the figure below. If new speaker green enters, it will start with a new RNN instance.)
|The generative process of our model. Colors indicate labels for speaker segments.|
The upshot of all this is that given time-stamped speaker labels (i.e. we know who spoke when), we can train the model with standard stochastic gradient descent algorithms. A trained model can be used for speaker diarization on new utterances from unheard speakers. Furthermore, the use of online decoding makes it more suitable for latency-sensitive applications.
Although we've already achieved impressive diarization performance with this system, there are still many exciting directions we are currently exploring. First, we are refining our model so it can easily integrate contextual information to perform offline decoding. This will likely further reduce the DER, which is more useful for latency-insensitive applications. Second, we would like to model acoustic features directly instead of using d-vectors. In this way, the entire speaker diarization system can be trained in an end-to-end way.
To learn more about this work, please see our paper. To download the core algorithm of this system, please visit the Github page.
This work was done as a close collaboration between Google AI and Speech & Assistant teams. Contributors include Aonan Zhang (intern), Quan Wang, Zhengyao Zhu and Chong Wang.