Tag Archives: Audio

Audio and Visual Quality Measurement using Fréchet Distance



"I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.”
    William Thomson (Lord Kelvin), Lecture on "Electrical Units of Measurement" (3 May 1883), published in Popular Lectures Vol. I, p. 73
The rate of scientific progress in machine learning has often been determined by the availability of good datasets, and metrics. In deep learning, benchmark datasets such as ImageNet or Penn Treebank were among the driving forces that established deep artificial neural networks for image recognition and language modeling. Yet, while the available “ground-truth” datasets lend themselves nicely as measures of performance on these prediction tasks, determining the “ground-truth” for comparison to generative models is not so straightforward. Imagine a model that generates videos of StarCraft video game sequences — how does one determine which model is best? Clearly some of the videos shown below look more realistic than others, but can the differences between them be quantified? Access to robust metrics for evaluation of generative models is crucial for measuring (and making) progress in the fields of audio and video understanding, but currently no such metrics exist.
Videos generated from various models trained on sequences from the StarCraft Video (SCV) dataset.
In “Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms” and “Towards Accurate Generative Models of Video: A New Metric & Challenges”, we present two such metrics — the Fréchet Audio Distance (FAD) and Fréchet Video Distance (FVD). We document our large-scale human evaluations using 10k video and 69k audio clip pairwise comparisons that demonstrate high correlations between our metrics and human perception. We are also releasing the source code for both Fréchet Video Distance and Fréchet Audio Distance on github (FVD; FAD).

General Description of Fréchet Distance
The goal of a generative model is to learn to produce samples that look similar to the ones on which it has been trained, such that it knows what properties and features are likely to appear in the data, and which ones are unlikely. In other words, a generative model must learn the probability distribution of the training data. In many cases, the target distributions for generative models are very high-dimensional. For example, a single image of 128x128 pixels with 3 color channels has almost 50k dimensions, while a second-long video clip might consist of dozens (or hundreds) of such frames with audio that may have 16,000 samples. Calculating distances between such high dimensional distributions in order to quantify how well a given model succeeds at a task is very difficult. In the case of pictures, one could look at a few samples to gauge visual quality, but doing so for every model trained is not feasible.

In addition, generative adversarial networks (GANs) tend to focus on a few modes of the overall target distribution, while completely ignoring others. For example, they may learn to generate only one type of object or only a select few viewing angles. As a consequence, looking only at a limited number of samples from the model may not indicate whether the network learned the entire distribution successfully. To remedy this, a metric is needed that aligns well with human judgement of quality, while also taking the properties of the target distribution into account.

One common solution for this problem is the so-called Fréchet Inception Distance (FID) metric, which was specifically designed for images. The FID takes a large number of images from both the target distribution and the generative model, and uses the Inception object-recognition network to embed each image into a lower-dimensional space that captures the important features. Then it computes the so-called Fréchet distance between these samples, which is a common way of calculating distances between distributions that provides a quantitative measure of how similar the two distributions actually are.
A key component for both metrics is a pre-trained model that converts the video or audio clip into an N-dimensional embedding.
Fréchet Audio Distance and Fréchet Video Distance
Building on the principles of FID that have been successfully applied to the image domain, we propose both Fréchet Video Distance (FVD) and Fréchet Audio Distance (FAD). Unlike popular metrics such as peak signal-to-noise ratio or the structural similarity index, FVD looks at videos in their entirety, and thereby avoids the drawbacks of framewise metrics.
Examples of videos of a robot arm, judged by the new FVD metric. FVD values were found to be approximately 2000, 1000, 600, 400, 300 and 150 (left-to-right; top-to-bottom). A lower FVD clearly correlates with higher video quality.
In the audio domain, existing metrics either require a time-aligned ground truth signal, such as source-to-distortion ratio (SDR), or only target a specific domain, like speech quality. FAD on the other hand is reference-free and can be used on any type of audio.

Below is a 2-D visualization of the audio embedding vectors from which we compute the FAD. Each point corresponds to the embedding of a 5-second audio clip, where the blue points are from clean music and other points represent audio that has been distorted in some way. The estimated multivariate Gaussian distributions are presented as concentric ellipses. As the magnitude of the distortions increase, the overlap between their distributions and that of the clean audio decreases. The separation between these distributions is what the Fréchet distance is measuring.
In the animation, we can see that as the magnitude of the distortions increases, the Gaussian distributions of the distorted audio overlaps less with the clean distribution. The magnitude of this separation is what the Fréchet distance is measuring.
Evaluation
It is important for FAD and FVD to closely track human judgement, since that is the gold standard for what looks and sounds “realistic”. So, we performed a large-scale human study to determine how well our new metrics align with qualitative human judgment of generated audio and video. For the study, human raters examined 10,000 video pairs and 69,000 5-second audio clips. For the FAD we asked human raters to compare the effect of two different distortions on the same audio segment, randomizing both the pair of distortions that they compared and the order in which they appeared. The raters were asked “Which audio clip sounds most like a studio-produced recording?” The collected set of pairwise evaluations was then ranked using a Plackett-Luce model, which estimates a worth value for each parameter configuration. Comparison of the worth values to the FAD demonstrates that the FAD correlates quite well with human judgement.
This figure compares the FAD calculated between clean background music and music distorted by a variety of methods (e.g., pitch down, Gaussian noise, etc.) to the associated worth values from human evaluation. Each type of distortion has two data points representing high and low extremes in the distortion applied. The quantization distortion (purple circles), for example, limits the audio to a specific number of bits per sample, where the two data points represent two different bitrates. Both human raters and the FAD assigned higher values (i.e., “less realistic”) to the lower bitrate quantization. Overall log FAD correlates well with human judgement — a perfect correlation between the log FAD and human perception would result in a straight line.
Conclusion
We are currently making great strides in generative models. FAD and FVD will help us keeping this progress measurable, and will hopefully lead us to improve our models for audio and video generation.

Acknowledgements
There are many people who contributed to this large effort, and we’d like to highlight some of the key contributors: Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, Sylvain Gelly, Mauricio Zuluaga, Dominik Roblek, Matthew Sharifi as well as the extended Google Brain team in Zurich.

Source: Google AI Blog


What’s new with Fast Pair

Posted by Catherina Xu (Product Manager)

Last November, we released Fast Pair with the Jaybird Tarah Bluetooth headphones. Since then, we’ve engaged with dozens of OEMs, ODMs, and silicon partners to bring Fast Pair to even more devices. Last month, we held a talk at I/O announcing 10+ certified devices, support for Qualcomm’s Smart Headset Development Kit, and upcoming experiences for Fast Pair devices.

The Fast Pair team presenting at I/O 2019.

The Fast Pair team presenting at I/O 2019.

Upcoming experiences

Fast Pair makes pairing seamless across Android phones - this year, we are introducing additional features to improve Bluetooth device management.

  • True Wireless Features. As True Wireless Stereo (TWS) headphones continue to gain momentum in the market and with users, it is important to build system-wide support for TWS. Later this year, TWS headsets with Fast Pair will be able to broadcast individual battery information for the case and buds. This enables features such as case open and close battery notifications and per-component battery reporting throughout the UI.

     Detailed battery level notifications surfaced during “case open” for TWS headphones.

    Detailed battery level notifications surfaced during “case open” for TWS headphones.

    • Find My Device. Fast Pair devices will soon be surfaced in the Find My Device app and website, allowing users to easily track down lost devices. Headset owners can view the location and time of last use, as well as unpair or ring the buds to locate when they are in range.

    Connected Device Details. In Android Q, Fast Pair devices will have an enhanced Bluetooth device details page to centralize management and key settings. This includes links to Find My Device, Assistant settings (if available), and additional OEM-specified settings that will link to the OEM’s companion app.

    The updated Device details screen in Q allows easy access to key settings and the headphone’s companion app.

    The updated Device details screen in Q allows easy access to key settings and the headphone’s companion app.

    Compatible Devices

    Below is a list of devices that were showcased during our I/O talk:

    • Anker Spirit Pro GVA
    • Anker SoundCore Flare+ (Speaker)
    • JBL Live 220BT
    • JBL Live 400BT
    • JBL Live 500BT
    • JBL Live 650BT
    • Jaybird Tarah
    • 1More Dual Driver BT ANC
    • LG HBS-SL5
    • LG HBS-PL6S
    • LG HBS-SL6S
    • LG HBS-PL5
    • Cleer Ally Plus

    Interested in Fast Pair?

    If you are interested in creating Fast Pair compatible Bluetooth devices, please take a look at:

    Once you have selected devices to integrate, head to our Nearby Devices console to register your product. Reach out to us at [email protected] if you have any questions.

  • Fast Pair Update

    Posted by Seang Chau (VP Engineering)

    Last year we announced Fast Pair, a set of specs that make it easier to connect Bluetooth headsets and speakers to Android devices.

    Today, we're making it easier for people to connect Fast Pair compatible accessories to devices associated with the same Google Account. Fast Pair will connect accessories to a user's current and future Android phones (6.0+), and we're adding support for Chromebooks in 2019.

    Fast Pair provides stress-free Bluetooth pairing for your Android phone.

    We have been working closely with dozens of manufacturers, many of which are bringing new Fast Pair compatible devices to market over the coming months. This includes Jaybird, who is already selling the Tarah Wireless Sport Headphones, as well as upcoming products from prominent brands such as Anker SoundCore, Bose, and many more.

    The Jaybird Tarah, Fast Pair compatible sport headphones already available in the market.

    We also want to make it easy for manufacturers to ship compatible products with minimal additional engineering effort. We collaborated with industry leading Bluetooth audio companies such as Airoha Technology Corp., BES and Qualcomm Technologies International, Ltd. (QTIL) to add native Fast Pair support to their software development kits.

    If you are a manufacturer interested in creating Fast Pair compatible Bluetooth devices, just head to our Nearby Devices console to register your product and validate that it has correctly implemented the Fast Pair specs.

    Introducing Oboe: A C++ library for low latency audio

    Posted by Don Turner, Developer Advocate, Android Audio Framework

    This week we released the first production-ready version of Oboe - a C++ library for building real-time audio apps. Oboe provides the lowest possible audio latency across the widest range of Android devices, as well as several other benefits.

    Single API

    Oboe takes advantage of the improved performance and features of AAudio on Oreo MR1 (API 27+) whilst maintaining backward compatibility (using OpenSL ES) on API 16+. It's kind of like AndroidX for native audio.

    Diagram showing the underlying audio API which Oboe will use

    Less code to write and maintain

    Using Oboe you can create an audio stream in just 3 lines of code (vs 50+ lines in OpenSL ES):

    AudioStreamBuilder builder;
    AudioStream *stream = nullptr;
    Result result = builder.openStream(&stream);
    

    Other benefits

    • Convenient C++ API (uses the C++11 standard)
    • Fast release process: supplied as a source library, bug fixes can be rolled out in days, quite a bit faster than the Android platform release cycle
    • Less guesswork: Provides workarounds for known audio bugs and has sensible default behaviour for stream properties, such as sample rate and audio data formats
    • Open source and maintained by Google engineers (although we welcome outside contributions)

    Getting started

    Take a look at the short video introduction:

    Check out the documentation, code samples and API reference. There's even a codelab which shows you how to build a rhythm-based game.

    If you have any issues, please file them here, we'd love to hear how you get on.

    Open sourcing Resonance Audio

    Spatial audio adds to your sense of presence when you’re in VR or AR, making it feel and sound, like you’re surrounded by a virtual or augmented world. And regardless of the display hardware you’re using, spatial audio makes it possible to hear sounds coming from all around you.

    Resonance Audio, our spatial audio SDK launched last year, enables developers to create more realistic VR and AR experiences on mobile and desktop. We’ve seen a number of exciting experiences emerge across a variety of platforms using our SDK. Recent examples include apps like Pixar’s Coco VR for Gear VR, Disney’s Star Wars™: Jedi Challenges AR app for Android and iOS, and Runaway’s Flutter VR for Daydream, which all used Resonance Audio technology.

    To accelerate adoption of immersive audio technology and strengthen the developer community around it, we’re opening Resonance Audio to a community-driven development model. By creating an open source spatial audio project optimized for mobile and desktop computing, any platform or software development tool provider can easily integrate with Resonance Audio. More cross-platform and tooling support means more distribution opportunities for content creators, without the worry of investing in costly porting projects.

    What’s Included in the Open Source Project

    As part of our open source project, we’re providing a reference implementation of YouTube’s Ambisonic-based spatial audio decoder, compatible with the same Ambisonics format (Ambix ACN/SN3D) used by others in the industry. Using our reference implementation, developers can easily render Ambisonic content in their VR media and other applications, while benefiting from Ambisonics open source, royalty-free model. The project also includes encoding, sound field manipulation and decoding techniques, as well as head related transfer functions (HRTFs) that we’ve used to achieve rich spatial audio that scales across a wide spectrum of device types and platforms. Lastly, we’re making our entire library of highly optimized DSP classes and functions, open to all. This includes resamplers, convolvers, filters, delay lines and other DSP capabilities. Additionally, developers can now use Resonance Audio’s brand new Spectral Reverb, an efficient, high quality, constant complexity reverb effect, in their own projects.

    We’ve open sourced Resonance Audio as a standalone library and associated engine plugins, VST plugin, tutorials, and examples with the Apache 2.0 license. This means Resonance Audio is yours, so you’re free to use Resonance Audio in your projects, no matter where you work. And if you see something you’d like to improve, submit a GitHub pull request to be reviewed by the Resonance Audio project committers. While the engine plugins for Unity, Unreal, FMOD, and Wwise will remain open source, going forward they will be maintained by project committers from our partners, Unity, Epic, Firelight Technologies, and Audiokinetic, respectively.

    If you’re interested in learning more about Resonance Audio, check out the documentation on our developer site. If you want to get more involved, visit our GitHub to access the source code, build the project, download the latest release, or even start contributing. We’re looking forward to building the future of immersive audio with all of you.

    By Eric Mauskopf, Google AR/VR Team

    Tacotron 2: Generating Human-like Speech from Text



    Generating very natural sounding speech from text (text-to-speech, TTS) has been a research goal for decades. There has been great progress in TTS research over the last few years and many individual pieces of a complete TTS system have greatly improved. Incorporating ideas from past work such as Tacotron and WaveNet, we added more improvements to end up with our new system, Tacotron 2. Our approach does not use complex linguistic and acoustic features as input. Instead, we generate human-like speech from text using neural networks trained using only speech examples and corresponding text transcripts.

    A full description of our new system can be found in our paper “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” In a nutshell it works like this: We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a WaveNet-like architecture.
    A detailed look at Tacotron 2's model architecture. The lower half of the image describes the sequence-to-sequence model that maps a sequence of letters to a spectrogram. For technical details, please refer to the paper.
    You can listen to some of the Tacotron 2 audio samples that demonstrate the results of our state-of-the-art TTS system. In an evaluation where we asked human listeners to rate the naturalness of the generated speech, we obtained a score that was comparable to that of professional recordings.

    While our samples sound great, there are still some difficult problems to be tackled. For example, our system has difficulties pronouncing complex words (such as “decorum” and “merlot”), and in extreme cases it can even randomly generate strange noises. Also, our system cannot yet generate audio in realtime. Furthermore, we cannot yet control the generated speech, such as directing it to sound happy or sad. Each of these is an interesting research problem on its own.

    Acknowledgements
    Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu, Sound Understanding team, TTS Research team, and TensorFlow team.

    Resonance Audio: Multi-platform spatial audio at scale

    Posted by Eric Mauskopf, Product Manager

    As humans, we rely on sound to guide us through our environment, help us communicate with others and connect us with what's happening around us. Whether walking along a busy city street or attending a packed music concert, we're able to hear hundreds of sounds coming from different directions. So when it comes to AR, VR, games and even 360 video, you need rich sound to create an engaging immersive experience that makes you feel like you're really there. Today, we're releasing a new spatial audio software development kit (SDK) called Resonance Audio. It's based on technology from Google's VR Audio SDK, and it works at scale across mobile and desktop platforms.

    Experience spatial audio in our Audio Factory VR app for Daydreamand SteamVR

    Performance that scales on mobile and desktop

    Bringing rich, dynamic audio environments into your VR, AR, gaming, or video experiences without affecting performance can be challenging. There are often few CPU resources allocated for audio, especially on mobile, which can limit the number of simultaneous high-fidelity 3D sound sources for complex environments. The SDK uses highly optimized digital signal processing algorithms based on higher order Ambisonics to spatialize hundreds of simultaneous 3D sound sources, without compromising audio quality, even on mobile. We're also introducing a new feature in Unity for precomputing highly realistic reverb effects that accurately match the acoustic properties of the environment, reducing CPU usage significantly during playback.

    Using geometry-based reverb by assigning acoustic materials to a cathedral in Unity

    Multi-platform support for developers and sound designers

    We know how important it is that audio solutions integrate seamlessly with your preferred audio middleware and sound design tools. With Resonance Audio, we've released cross-platform SDKs for the most popular game engines, audio engines, and digital audio workstations (DAW) to streamline workflows, so you can focus on creating more immersive audio. The SDKs run on Android, iOS, Windows, MacOS and Linux platforms and provide integrations for Unity, Unreal Engine, FMOD, Wwise and DAWs. We also provide native APIs for C/C++, Java, Objective-C and the web. This multi-platform support enables developers to implement sound designs once, and easily deploy their project with consistent sounding results across the top mobile and desktop platforms. Sound designers can save time by using our new DAW plugin for accurately monitoring spatial audio that's destined for YouTube videos or apps developed with Resonance Audio SDKs. Web developers get the open source Resonance Audio Web SDK that works in the top web browsers by using the Web Audio API.

    DAW plugin for sound designers to monitor audio destined for YouTube 360 videos or apps developed with the SDK

    Model complex Sound Environments Cutting edge features

    By providing powerful tools for accurately modeling complex sound environments, Resonance Audio goes beyond basic 3D spatialization. The SDK enables developers to control the direction acoustic waves propagate from sound sources. For example, when standing behind a guitar player, it can sound quieter than when standing in front. And when facing the direction of the guitar, it can sound louder than when your back is turned.

    Controlling sound wave directivity for an acoustic guitar using the SDK

    Another SDK feature is automatically rendering near-field effects when sound sources get close to a listener's head, providing an accurate perception of distance, even when sources are close to the ear. The SDK also enables sound source spread, by specifying the width of the source, allowing sound to be simulated from a tiny point in space up to a wall of sound. We've also released an Ambisonic recording tool to spatially capture your sound design directly within Unity, save it to a file, and use it anywhere Ambisonic soundfield playback is supported, from game engines to YouTube videos.

    If you're interested in creating rich, immersive soundscapes using cutting-edge spatial audio technology, check out the Resonance Audio documentation on our developer site, let us know what you think through GitHub, and show us what you build with #ResonanceAudio on social media; we'll be resharing our favorites.

    Bringing Real-time Spatial Audio to the Web with Songbird

    Posted by Jamieson Brettle and Drew Allen, Chrome Media Team

    For a virtual scene to be truly immersive, stunning visuals need to be accompanied by true spatial audio to create a realistic and believable experience. Spatial audio tools allow developers to include sounds that can come from any direction, and that are associated in 3D space with audio sources, thus completely enveloping the user in 360-degree sound.

    Spatial audio helps draw the user into a scene and creates the illusion of entering an entirely new world. To make this possible, the Chrome Media team has created Songbird, an open source, spatial audio encoding engine that works in any web browser by using the Web Audio API.

    The Songbird library takes in any number of mono audio streams and allows developers to programmatically place them in 3D space around the user. Songbird allows you to create immersive soundscapes, realistically reproducing reflection and reverb for the space you describe. Sounds bounce off walls and reflect off materials just as they would in real-life, capturing truly 360-degree sound. Songbird creates an ambisonic soundfield that can then be rendered in real-time for use in your application. We've partnered with the Omnitoneproject, which we blogged about last year, to add higher-order ambisonic support to Omnitone's binaural rendererto produce far more accurate sounding audio than ever before.

    Songbird encapsulates Omnitone and with it, developers can now add interactive, full-sphere audio to any web based application. Songbird can scale to any order ambisonics, thereby creating a more realistic sound and higher performance than what is achievable through standard Web Audio API.

    Songbird Audio Processing Diagram

    The implementation of Songbird is based on the Google spatial mediaspecification. It expects mono input and outputs ambisonic (multichannel) ACN channel layout with SN3D normalization. Detailed documentation may be found here.

    As the web emerges as an important VR platformfor delivering content, spatial audio will play a vital role in users' embrace of this new medium. Songbird and Omnitone are key tools in enabling spatial audio on the web platform and establishing it as a preeminent platform for compelling VR experiences. Combining these audio experiences with 3D JavaScript libraries like three.js gives a glimpseinto the future on the web.

    Demo combining spatial sound in 3D environment

    This project was made possible through close collaboration with Google's Daydream and Web Audio teams. This collaboration allowed us to deliver similar audio capabilities to the web as are available to developers creating Daydream applications.

    We look forward to seeing what people do with Songbird now that it's open source. Check out the code on GitHub and let us know what you think. Also available are a number of demoson creating full spherical audio with Songbird.

    Bringing Real-time Spatial Audio to the Web with Songbird

    For a virtual scene to be truly immersive, stunning visuals need to be accompanied by true spatial audio to create a realistic and believable experience. Spatial audio tools allow developers to include sounds that can come from any direction, and that are associated in 3D space with audio sources, thus completely enveloping the user in 360-degree sound.

    Spatial audio helps draw the user into a scene and creates the illusion of entering an entirely new world. To make this possible, the Chrome Media team has created Songbird, an open source, spatial audio encoding engine that works in any web browser by using the Web Audio API.

    The Songbird library takes in any number of mono audio streams and allows developers to programmatically place them in 3D space around the user. Songbird allows you to create immersive soundscapes, realistically reproducing reflection and reverb for the space you describe. Sounds bounce off walls and reflect off materials just as they would in real-life, capturing truly 360-degree sound. Songbird creates an ambisonic soundfield that can then be rendered in real-time for use in your application. We’ve partnered with the Omnitone project, which we blogged about last year, to add higher-order ambisonic support to Omnitone’s binaural renderer to produce far more accurate sounding audio than ever before.

    Songbird encapsulates Omnitone and with it, developers can now add interactive, full-sphere audio to any web based application. Songbird can scale to any order ambisonics, thereby creating a more realistic sound and higher performance than what is achievable through standard Web Audio API.
    Songbird Audio Processing Diagram
    The implementation of Songbird is based on the Google spatial media specification. It expects mono input and outputs ambisonic (multichannel) ACN channel layout with SN3D normalization. Detailed documentation may be found here.

    As the web emerges as an important VR platform for delivering content, spatial audio will play a vital role in users’ embrace of this new medium. Songbird and Omnitone are key tools in enabling spatial audio on the web platform and establishing it as a preeminent platform for compelling VR experiences. Combining these audio experiences with 3D JavaScript libraries like three.js gives a glimpse into the future on the web.
    Demo combining spatial sound in 3D environment
    This project was made possible through close collaboration with Google’s Daydream and Web Audio teams. This collaboration allowed us to deliver similar audio capabilities to the web as are available to developers creating Daydream applications.

    We look forward to seeing what people do with Songbird now that it's open source. Check out the code on GitHub and let us know what you think. Also available are a number of demos on creating full spherical audio with Songbird.

    By Jamieson Brettle and Drew Allen, Chrome Media Team