Audio recordings often contain a mixture of different sound sources; Universal sound separation is the ability to separate such a mixture into its component sounds, regardless of the types of sound present. Previously, sound separation work has focused on separating mixtures of a small number of sound types, such as "speech" versus "nonspeech", or different instances of the same type of sound, such as speaker #1 versus speaker #2. Often in such work, the number of sounds in a mixture is also assumed to be known a priori. The FUSS dataset shifts focus to the more general problem of separating a variable number of arbitrary sounds from one another.
One major hurdle to training models in this domain is that even if you have high-quality recordings of sound mixtures, you can't easily annotate these recordings with ground truth. High-quality simulation is one approach to overcome this limitation. To achieve good results, you need a diverse set of sounds, a realistic room simulator, and code to mix these elements together for realistic, multi-source, multi-class audio with ground truth. With FUSS, we are releasing all three of these.
FUSS relies on Creative Commons licensed audio clips from freesound.org. We filtered these by license type, then using a pre-release of FSD50k , further filtered out sounds that aren't separable by humans when mixed together. We were left with about 23 hours of audio, consisting of 12,377 sounds useful for mixing (7,237 train, 2,883 validation, 2,257 eval). Using these clips, we created 20,000 training mixtures, 1,000 validation mixtures, and 1,000 eval mixtures.
We developed our own room simulator implemented in tensorflow, which generates the impulse response of a box shaped room with frequency-dependent reflective properties given a sound source location and a mic location. As part of the dataset release, we provide pre-calculated room impulse responses used for each audio sample along with mixing code, so the research community can simulate novel audio without running the computationally expensive room simulator. Future work may include releasing the code for our room simulator and extending the simulator capabilities to address more extensive acoustic properties of rooms, materials with different reflective properties, novel room shapes, etc.
Finally, we have released a masking-based separation model, based on an improved time-domain convolutional network (TDCN++), described in our recent publications [2, 3]. On the eval set, this model achieves 12.5 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source mixtures with 37.6 dB absolute SI-SNR.
Source audio, reverb impulse responses, reverberated mixtures and sources created by the mixing code, and a baseline model checkpoint are available for download. Code for reverberating and mixing the audio data and for training the released model is available on our github page.
The dataset will also be used in the DCASE challenge, as a component of the Sound Event Detection and Separation task. The released model will serve as a baseline for this competition, and a benchmark to demonstrate progress against in future experiments.
Our hope is this dataset will lower the barrier to new research, and particularly will allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.
By John Hershey, Scott Wisdom, and Hakan Erdogan, Google Research
 Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font Corbera, Dmitry Bogdanov, Andrés Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. "Freesound Datasets: A Platform for the Creation of Open Audio Datasets." International Society for Music Information Retrieval Conference (ISMIR), pp. 486–493. Suzhou, China, 2017.
 Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R. Hershey. "Universal Sound Separation." IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 175-179. New Paltz, NY, USA, 2019.
 Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, and Daniel P. W. Ellis. "Improving Universal Sound Separation Using Sound Classification." IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2020.