Tag Archives: Audio

Open sourcing Resonance Audio

Spatial audio adds to your sense of presence when you’re in VR or AR, making it feel and sound, like you’re surrounded by a virtual or augmented world. And regardless of the display hardware you’re using, spatial audio makes it possible to hear sounds coming from all around you.

Resonance Audio, our spatial audio SDK launched last year, enables developers to create more realistic VR and AR experiences on mobile and desktop. We’ve seen a number of exciting experiences emerge across a variety of platforms using our SDK. Recent examples include apps like Pixar’s Coco VR for Gear VR, Disney’s Star Wars™: Jedi Challenges AR app for Android and iOS, and Runaway’s Flutter VR for Daydream, which all used Resonance Audio technology.

To accelerate adoption of immersive audio technology and strengthen the developer community around it, we’re opening Resonance Audio to a community-driven development model. By creating an open source spatial audio project optimized for mobile and desktop computing, any platform or software development tool provider can easily integrate with Resonance Audio. More cross-platform and tooling support means more distribution opportunities for content creators, without the worry of investing in costly porting projects.

What’s Included in the Open Source Project

As part of our open source project, we’re providing a reference implementation of YouTube’s Ambisonic-based spatial audio decoder, compatible with the same Ambisonics format (Ambix ACN/SN3D) used by others in the industry. Using our reference implementation, developers can easily render Ambisonic content in their VR media and other applications, while benefiting from Ambisonics open source, royalty-free model. The project also includes encoding, sound field manipulation and decoding techniques, as well as head related transfer functions (HRTFs) that we’ve used to achieve rich spatial audio that scales across a wide spectrum of device types and platforms. Lastly, we’re making our entire library of highly optimized DSP classes and functions, open to all. This includes resamplers, convolvers, filters, delay lines and other DSP capabilities. Additionally, developers can now use Resonance Audio’s brand new Spectral Reverb, an efficient, high quality, constant complexity reverb effect, in their own projects.

We’ve open sourced Resonance Audio as a standalone library and associated engine plugins, VST plugin, tutorials, and examples with the Apache 2.0 license. This means Resonance Audio is yours, so you’re free to use Resonance Audio in your projects, no matter where you work. And if you see something you’d like to improve, submit a GitHub pull request to be reviewed by the Resonance Audio project committers. While the engine plugins for Unity, Unreal, FMOD, and Wwise will remain open source, going forward they will be maintained by project committers from our partners, Unity, Epic, Firelight Technologies, and Audiokinetic, respectively.

If you’re interested in learning more about Resonance Audio, check out the documentation on our developer site. If you want to get more involved, visit our GitHub to access the source code, build the project, download the latest release, or even start contributing. We’re looking forward to building the future of immersive audio with all of you.

By Eric Mauskopf, Google AR/VR Team

Tacotron 2: Generating Human-like Speech from Text



Generating very natural sounding speech from text (text-to-speech, TTS) has been a research goal for decades. There has been great progress in TTS research over the last few years and many individual pieces of a complete TTS system have greatly improved. Incorporating ideas from past work such as Tacotron and WaveNet, we added more improvements to end up with our new system, Tacotron 2. Our approach does not use complex linguistic and acoustic features as input. Instead, we generate human-like speech from text using neural networks trained using only speech examples and corresponding text transcripts.

A full description of our new system can be found in our paper “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” In a nutshell it works like this: We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a WaveNet-like architecture.
A detailed look at Tacotron 2's model architecture. The lower half of the image describes the sequence-to-sequence model that maps a sequence of letters to a spectrogram. For technical details, please refer to the paper.
You can listen to some of the Tacotron 2 audio samples that demonstrate the results of our state-of-the-art TTS system. In an evaluation where we asked human listeners to rate the naturalness of the generated speech, we obtained a score that was comparable to that of professional recordings.

While our samples sound great, there are still some difficult problems to be tackled. For example, our system has difficulties pronouncing complex words (such as “decorum” and “merlot”), and in extreme cases it can even randomly generate strange noises. Also, our system cannot yet generate audio in realtime. Furthermore, we cannot yet control the generated speech, such as directing it to sound happy or sad. Each of these is an interesting research problem on its own.

Acknowledgements
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu, Sound Understanding team, TTS Research team, and TensorFlow team.

Resonance Audio: Multi-platform spatial audio at scale

Posted by Eric Mauskopf, Product Manager

As humans, we rely on sound to guide us through our environment, help us communicate with others and connect us with what's happening around us. Whether walking along a busy city street or attending a packed music concert, we're able to hear hundreds of sounds coming from different directions. So when it comes to AR, VR, games and even 360 video, you need rich sound to create an engaging immersive experience that makes you feel like you're really there. Today, we're releasing a new spatial audio software development kit (SDK) called Resonance Audio. It's based on technology from Google's VR Audio SDK, and it works at scale across mobile and desktop platforms.

Experience spatial audio in our Audio Factory VR app for Daydreamand SteamVR

Performance that scales on mobile and desktop

Bringing rich, dynamic audio environments into your VR, AR, gaming, or video experiences without affecting performance can be challenging. There are often few CPU resources allocated for audio, especially on mobile, which can limit the number of simultaneous high-fidelity 3D sound sources for complex environments. The SDK uses highly optimized digital signal processing algorithms based on higher order Ambisonics to spatialize hundreds of simultaneous 3D sound sources, without compromising audio quality, even on mobile. We're also introducing a new feature in Unity for precomputing highly realistic reverb effects that accurately match the acoustic properties of the environment, reducing CPU usage significantly during playback.

Using geometry-based reverb by assigning acoustic materials to a cathedral in Unity

Multi-platform support for developers and sound designers

We know how important it is that audio solutions integrate seamlessly with your preferred audio middleware and sound design tools. With Resonance Audio, we've released cross-platform SDKs for the most popular game engines, audio engines, and digital audio workstations (DAW) to streamline workflows, so you can focus on creating more immersive audio. The SDKs run on Android, iOS, Windows, MacOS and Linux platforms and provide integrations for Unity, Unreal Engine, FMOD, Wwise and DAWs. We also provide native APIs for C/C++, Java, Objective-C and the web. This multi-platform support enables developers to implement sound designs once, and easily deploy their project with consistent sounding results across the top mobile and desktop platforms. Sound designers can save time by using our new DAW plugin for accurately monitoring spatial audio that's destined for YouTube videos or apps developed with Resonance Audio SDKs. Web developers get the open source Resonance Audio Web SDK that works in the top web browsers by using the Web Audio API.

DAW plugin for sound designers to monitor audio destined for YouTube 360 videos or apps developed with the SDK

Model complex Sound Environments Cutting edge features

By providing powerful tools for accurately modeling complex sound environments, Resonance Audio goes beyond basic 3D spatialization. The SDK enables developers to control the direction acoustic waves propagate from sound sources. For example, when standing behind a guitar player, it can sound quieter than when standing in front. And when facing the direction of the guitar, it can sound louder than when your back is turned.

Controlling sound wave directivity for an acoustic guitar using the SDK

Another SDK feature is automatically rendering near-field effects when sound sources get close to a listener's head, providing an accurate perception of distance, even when sources are close to the ear. The SDK also enables sound source spread, by specifying the width of the source, allowing sound to be simulated from a tiny point in space up to a wall of sound. We've also released an Ambisonic recording tool to spatially capture your sound design directly within Unity, save it to a file, and use it anywhere Ambisonic soundfield playback is supported, from game engines to YouTube videos.

If you're interested in creating rich, immersive soundscapes using cutting-edge spatial audio technology, check out the Resonance Audio documentation on our developer site, let us know what you think through GitHub, and show us what you build with #ResonanceAudio on social media; we'll be resharing our favorites.

Bringing Real-time Spatial Audio to the Web with Songbird

Posted by Jamieson Brettle and Drew Allen, Chrome Media Team

For a virtual scene to be truly immersive, stunning visuals need to be accompanied by true spatial audio to create a realistic and believable experience. Spatial audio tools allow developers to include sounds that can come from any direction, and that are associated in 3D space with audio sources, thus completely enveloping the user in 360-degree sound.

Spatial audio helps draw the user into a scene and creates the illusion of entering an entirely new world. To make this possible, the Chrome Media team has created Songbird, an open source, spatial audio encoding engine that works in any web browser by using the Web Audio API.

The Songbird library takes in any number of mono audio streams and allows developers to programmatically place them in 3D space around the user. Songbird allows you to create immersive soundscapes, realistically reproducing reflection and reverb for the space you describe. Sounds bounce off walls and reflect off materials just as they would in real-life, capturing truly 360-degree sound. Songbird creates an ambisonic soundfield that can then be rendered in real-time for use in your application. We've partnered with the Omnitoneproject, which we blogged about last year, to add higher-order ambisonic support to Omnitone's binaural rendererto produce far more accurate sounding audio than ever before.

Songbird encapsulates Omnitone and with it, developers can now add interactive, full-sphere audio to any web based application. Songbird can scale to any order ambisonics, thereby creating a more realistic sound and higher performance than what is achievable through standard Web Audio API.

Songbird Audio Processing Diagram

The implementation of Songbird is based on the Google spatial mediaspecification. It expects mono input and outputs ambisonic (multichannel) ACN channel layout with SN3D normalization. Detailed documentation may be found here.

As the web emerges as an important VR platformfor delivering content, spatial audio will play a vital role in users' embrace of this new medium. Songbird and Omnitone are key tools in enabling spatial audio on the web platform and establishing it as a preeminent platform for compelling VR experiences. Combining these audio experiences with 3D JavaScript libraries like three.js gives a glimpseinto the future on the web.

Demo combining spatial sound in 3D environment

This project was made possible through close collaboration with Google's Daydream and Web Audio teams. This collaboration allowed us to deliver similar audio capabilities to the web as are available to developers creating Daydream applications.

We look forward to seeing what people do with Songbird now that it's open source. Check out the code on GitHub and let us know what you think. Also available are a number of demoson creating full spherical audio with Songbird.

Bringing Real-time Spatial Audio to the Web with Songbird

For a virtual scene to be truly immersive, stunning visuals need to be accompanied by true spatial audio to create a realistic and believable experience. Spatial audio tools allow developers to include sounds that can come from any direction, and that are associated in 3D space with audio sources, thus completely enveloping the user in 360-degree sound.

Spatial audio helps draw the user into a scene and creates the illusion of entering an entirely new world. To make this possible, the Chrome Media team has created Songbird, an open source, spatial audio encoding engine that works in any web browser by using the Web Audio API.

The Songbird library takes in any number of mono audio streams and allows developers to programmatically place them in 3D space around the user. Songbird allows you to create immersive soundscapes, realistically reproducing reflection and reverb for the space you describe. Sounds bounce off walls and reflect off materials just as they would in real-life, capturing truly 360-degree sound. Songbird creates an ambisonic soundfield that can then be rendered in real-time for use in your application. We’ve partnered with the Omnitone project, which we blogged about last year, to add higher-order ambisonic support to Omnitone’s binaural renderer to produce far more accurate sounding audio than ever before.

Songbird encapsulates Omnitone and with it, developers can now add interactive, full-sphere audio to any web based application. Songbird can scale to any order ambisonics, thereby creating a more realistic sound and higher performance than what is achievable through standard Web Audio API.
Songbird Audio Processing Diagram
The implementation of Songbird is based on the Google spatial media specification. It expects mono input and outputs ambisonic (multichannel) ACN channel layout with SN3D normalization. Detailed documentation may be found here.

As the web emerges as an important VR platform for delivering content, spatial audio will play a vital role in users’ embrace of this new medium. Songbird and Omnitone are key tools in enabling spatial audio on the web platform and establishing it as a preeminent platform for compelling VR experiences. Combining these audio experiences with 3D JavaScript libraries like three.js gives a glimpse into the future on the web.
Demo combining spatial sound in 3D environment
This project was made possible through close collaboration with Google’s Daydream and Web Audio teams. This collaboration allowed us to deliver similar audio capabilities to the web as are available to developers creating Daydream applications.

We look forward to seeing what people do with Songbird now that it's open source. Check out the code on GitHub and let us know what you think. Also available are a number of demos on creating full spherical audio with Songbird.

By Jamieson Brettle and Drew Allen, Chrome Media Team

Omnitone: Spatial audio on the web


Spatial audio is a key element for an immersive virtual reality (VR) experience. By bringing spatial audio to the web, the browser can be transformed into a complete VR media player with incredible reach and engagement. That’s why the Chrome WebAudio team has created and is releasing the Omnitone project, an open source spatial audio renderer with the cross-browser support.

Our challenge was to introduce the audio spatialization technique called ambisonics so the user can hear the full-sphere surround sound on the browser. In order to achieve this, we implemented the ambisonic decoding with binaural rendering using web technology. There are several paths for introducing a new feature into the web platform, but we chose to use only the Web Audio API. In doing so, we can reach a larger audience with this cross-browser technology, and we can also avoid the lengthy standardization process for introducing a new Web Audio component. This is possible because the Web Audio API provides all the necessary building blocks for this audio spatialization technique.



Omnitone Audio Processing Diagram

The AmbiX format recording, which is the target of the Omnitone decoder, contains 4 channels of audio that are encoded using ambisonics, which can then be decoded into an arbitrary speaker setup. Instead of the actual speaker array, Omnitone uses 8 virtual speakers based on an the head-related transfer function (HRTF) convolution to render the final audio stream binaurally. This binaurally-rendered audio can convey a sense of space when it is heard through headphones.

The beauty of this mechanism lies in the sound-field rotation applied to the incoming spatial audio stream. The orientation sensor of a VR headset or a smartphone can be linked to Omnitone’s decoder to seamlessly rotate the entire sound field. The rest of the spatialization process will be handled automatically by Omnitone. A live demo can be found at the project landing page.

Throughout the project, we worked closely with the Google VR team for their VR audio expertise. Not only was their knowledge on the spatial audio a tremendous help for the project, but the collaboration also ensured identical audio spatialization across all of Google’s VR applications - both on the web and Android (e.g. Google VR SDK, YouTube Android app). The Spatial Media Specification and HRTF sets are great examples of the Google VR team’s efforts, and Omnitone is built on top of this specification and HRTF sets.

With emerging web-based VR projects like WebVR, Omnitone’s audio spatialization can play a critical role in a more immersive VR experience on the web. Web-based VR applications will also benefit from high-quality streaming spatial audio, as the Chrome Media team has recently added FOA compression to the open source audio codec Opus. More exciting things like VR view integration, higher-order ambisonics and mobile web support will also be coming soon to Omnitone.

We look forward to seeing what people do with Omnitone now that it's open source. Feel free to reach out to us or leave a comment with your thoughts and feedback on the issue tracker on GitHub.

By Hongchan Choi and Raymond Toy, Chrome Team

Due to the incomplete implementation of multichannel audio decoding on various browsers, Omnitone does not support mobile web at the time of writing.

A new method to measure touch and audio latency

Posted by Mark Koudritsky, software engineer

There is a new addition in the arsenal of instruments used by Android and ChromeOS teams in the battle to measure and minimize touch and audio latency: the WALT Latency Timer.

When you use a mobile device, you expect it to respond instantly to your touch or voice: the more immediate the response, the more you feel directly connected to the device. Over the past few years, we have been trying to measure, understand, and reduce latency in our Chromebook and Android products.

Before we can reduce latency, we must first understand where it comes from. In the case of tapping a touchscreen, the time for a response includes the touch-sensing hardware and driver, the application, and the display and graphics output. For a voice command, there is time spent in sampling input audio, the application, and in audio output. Sometimes we have a mixture of these (for example, a piano app would include touch input and audio output).

Most previous work to study latency has focused on measuring a single round-trip latency number. For example, to measure audio latency, an app would measure time from app to speaker/mic and back to the app using the Dr. Rick O'Rang loopback audio dongle together with an appropriate app such as the Dr Rick O’Rang Loopback app or Superpowered Mobile Audio Latency Test App. Similarly, the TouchBot uses a fast camera to measure the round-trip delay from physical touch until a change on the screen is visible. While valuable, the problem with such a setup is that it’s very difficult to break down the latency into input vs output components.

An important innovation in WALT (a descendant of QuickStep) is that it synchronizes an external hardware clock with the Android device or Chromebook to within a millisecond. This allows it to measure input and output latencies separately as opposed to measuring a round-trip latency.

WALT is simple. The parts cost less than $50 and with some basic hobby electronics skills, you can build it yourself.

We’ve been using WALT within Google for Nexus and Chromebook development. We’re now opening this tool to app developers and anyone who wants to precisely measure real-world latencies. We hope that having readily accessible tools will help the industry as a whole improve and make all our devices more responsive to touch and voice.

Exercise or Games? Why Not Both!

Posted by Alice Ching, Google Engineer

We are pleased to announce the release of Games in Motion, an open source game sample to demonstrate how developers can make fun games using Google Fit and Android Wear. Do you ever go on a jog and feel like there is a lack of incentive to help you run better? What if you were a secret agent and had to use your speed and your nifty gadget to complete missions?

With Games in Motion, you can enhance your exercise with missions and actions on your Android Wear device, while logging your jogs to the cloud.

Games in Motion is written in Java programming language using Android Studio. It demonstrates multiple Android technologies.

  • Android Wear bridges notifications from a phone or tablet to a paired Android Wear device. The notifications are stacked so we can show multiple stats at the same time.
  • Google Fit API collects and processes fitness data and sessions. This allows us to use the fitness data to show user progress. All exercise sessions done in Games in Motion will be recorded to Google Fit as well.
  • Google Play Games Services is used to create and unlock achievements.
  • Several different Android audio APIs are integrated.
  • JUnit tests are present for the data-driven parser, which demonstrates how unit testing can be done within Android Studio.

You can download the latest open source release from GitHub. We hope to inspire similar Android games, where multiple different form factors are combined for a fun experience.

Respecting Audio Focus

Posted by Kristan Uccello, Google Developer Relations

It’s rude to talk during a presentation, it disrespects the speaker and annoys the audience. If your application doesn’t respect the rules of audio focus then it’s disrespecting other applications and annoying the user. If you have never heard of audio focus you should take a look at the Android developer training material.

With multiple apps potentially playing audio it's important to think about how they should interact. To avoid every music app playing at the same time, Android uses audio focus to moderate audio playback—your app should only play audio when it holds audio focus. This post provides some tips on how to handle changes in audio focus properly, to ensure the best possible experience for the user.

Requesting audio focus

Audio focus should not be requested when your application starts (don’t get greedy), instead delay requesting it until your application is about to do something with an audio stream. By requesting audio focus through the AudioManager system service, an application can use one of the AUDIOFOCUS_GAIN* constants (see Table 1) to indicate the desired level of focus.

Listing 1. Requesting audio focus.

1. AudioManager am = (AudioManager) mContext.getSystemService(Context.AUDIO_SERVICE);
2.     
3.  int result = am.requestAudioFocus(mOnAudioFocusChangeListener,
4.    // Hint: the music stream.
5.    AudioManager.STREAM_MUSIC,
6.    // Request permanent focus.
7.    AudioManager.AUDIOFOCUS_GAIN);
8.  if (result == AudioManager.AUDIOFOCUS_REQUEST_GRANTED) {
9.    mState.audioFocusGranted = true;
10. } else if (result == AudioManager.AUDIOFOCUS_REQUEST_FAILED) {
11.   mState.audioFocusGranted = false;
12. }

In line 7 above, you can see that we have requested permanent audio focus. An application could instead request transient focus using AUDIOFOCUS_GAIN_TRANSIENT which is appropriate when using the audio system for less than 45 seconds.

Alternatively, the app could use AUDIOFOCUS_GAIN_TRANSIENT_MAY_DUCK, which is appropriate when the use of the audio system may be shared with another application that is currently playing audio (e.g. for playing a "keep it up" prompt in a fitness application and expecting background music to duck during the prompt). The app requesting AUDIOFOCUS_GAIN_TRANSIENT_MAY_DUCK should not use the audio system for more than 15 seconds before releasing focus.

Handling audio focus changes

In order to handle audio focus change events, an application should create an instance of OnAudioFocusChangeListener. In the listener, the application will need to handle theAUDIOFOCUS_GAIN* event and AUDIOFOCUS_LOSS* events (see Table 1). It should be noted that AUDIOFOCUS_GAIN has some nuances which are highlighted in Listing 2, below.

Listing 2. Handling audio focus changes.

1. mOnAudioFocusChangeListener = new AudioManager.OnAudioFocusChangeListener() {  
2.   
3. @Override
4. public void onAudioFocusChange(int focusChange) {
5.   switch (focusChange) {
6.   case AudioManager.AUDIOFOCUS_GAIN:
7.     mState.audioFocusGranted = true;
8.        
9.     if(mState.released) {
10.   initializeMediaPlayer();
11.    }
12.
13. switch(mState.lastKnownAudioFocusState) { 14. case UNKNOWN: 15. if(mState.state == PlayState.PLAY && !mPlayer.isPlaying()) { 16. mPlayer.start(); 17. } 18. break; 19. case AudioManager.AUDIOFOCUS_LOSS_TRANSIENT: 20. if(mState.wasPlayingWhenTransientLoss) { 21. mPlayer.start(); 22. } 23. break; 24. case AudioManager.AUDIOFOCUS_LOSS_TRANSIENT_CAN_DUCK: 25. restoreVolume(); 26. break; 27. } 28.
29. break; 30. case AudioManager.AUDIOFOCUS_LOSS: 31. mState.userInitiatedState = false; 32. mState.audioFocusGranted = false; 33. teardown(); 34. break; 35. case AudioManager.AUDIOFOCUS_LOSS_TRANSIENT: 36. mState.userInitiatedState = false; 37. mState.audioFocusGranted = false; 38. mState.wasPlayingWhenTransientLoss = mPlayer.isPlaying(); 39. mPlayer.pause(); 40. break; 41. case AudioManager.AUDIOFOCUS_LOSS_TRANSIENT_CAN_DUCK: 42. mState.userInitiatedState = false; 43. mState.audioFocusGranted = false; 44. lowerVolume(); 45. break; 46. } 47. mState.lastKnownAudioFocusState = focusChange; 48. } 49.};

AUDIOFOCUS_GAIN is used in two distinct scopes of an applications code. First, it can be used when registering for audio focus as shown in Listing 1. This does NOT translate to an event for the registered OnAudioFocusChangeListener, meaning that on a successful audio focus request the listener will NOT receive an AUDIOFOCUS_GAIN event for the registration.

AUDIOFOCUS_GAIN is also used in the implementation of an OnAudioFocusChangeListener as an event condition. As stated above, the AUDIOFOCUS_GAIN event will not be triggered on audio focus requests. Instead the AUDIOFOCUS_GAIN event will occur only after an AUDIOFOCUS_LOSS* event has occurred. This is the only constant in the set shown Table 1 that is used in both scopes.

There are four cases that need to be handled by the focus change listener. When the application receives an AUDIOFOCUS_LOSS this usually means it will not be getting its focus back. In this case the app should release assets associated with the audio system and stop playback. As an example, imagine a user is playing music using an app and then launches a game which takes audio focus away from the music app. There is no predictable time for when the user will exit the game. More likely, the user will navigate to the home launcher (leaving the game in the background) and launch yet another application or return to the music app causing a resume which would then request audio focus again.

However another case exists that warrants some discussion. There is a difference between losing audio focus permanently (as described above) and temporarily. When an application receives an AUDIOFOCUS_LOSS_TRANSIENT, the behavior of the app should be that it suspends its use of the audio system until it receives an AUDIOFOCUS_GAIN event. When the AUDIOFOCUS_LOSS_TRANSIENT occurs, the application should make a note that the loss is temporary, that way on audio focus gain it can reason about what the correct behavior should be (see lines 13-27 of Listing 2).

Sometimes an app loses audio focus (receives an AUDIOFOCUS_LOSS) and the interrupting application terminates or otherwise abandons audio focus. In this case the last application that had audio focus may receive an AUDIOFOCUS_GAIN event. On the subsequent AUDIOFOCUS_GAIN event the app should check and see if it is receiving the gain after a temporary loss and can thus resume use of the audio system or if recovering from an permanent loss, setup for playback.

If an application will only be using the audio capabilities for a short time (less than 45 seconds), it should use an AUDIOFOCUS_GAIN_TRANSIENT focus request and abandon focus after it has completed its playback or capture. Audio focus is handled as a stack on the system — as such the last process to request audio focus wins.

When audio focus has been gained this is the appropriate time to create a MediaPlayer or MediaRecorder instance and allocate resources. Likewise when an app receives AUDIOFOCUS_LOSS it is good practice to clean up any resources allocated. Gaining audio focus has three possibilities that also correspond to the three audio focus loss cases in Table 1. It is a good practice to always explicitly handle all the loss cases in the OnAudioFocusChangeListener.

Table 1. Audio focus gain and loss implication.

GAIN LOSS
AUDIOFOCUS_GAIN AUDIOFOCUS_LOSS
AUDIOFOCUS_GAIN_TRANSIENT AUDIOFOCUS_LOSS_TRANSIENT
AUDIOFOCUS_GAIN_TRANSIENT_MAY_DUCK AUDIOFOCUS_LOSS_TRANSIENT_CAN_DUCK

Note: AUDIOFOCUS_GAIN is used in two places. When requesting audio focus it is passed in as a hint to the AudioManager and it is used as an event case in the OnAudioFocusChangeListener. The gain events highlighted in green are only used when requesting audio focus. The loss events are only used in the OnAudioFocusChangeListener.

Table 2. Audio stream types.

Stream Type Description
STREAM_ALARM The audio stream for alarms
STREAM_DTMF The audio stream for DTMF Tones
STREAM_MUSIC The audio stream for "media" (music, podcast, videos) playback
STREAM_NOTIFICATION The audio stream for notifications
STREAM_RING The audio stream for the phone ring
STREAM_SYSTEM The audio stream for system sounds

An app will request audio focus (see an example in the sample source code linked below) from the AudioManager (Listing 1, line 1). The three arguments it provides are an audio focus change listener object (optional), a hint as to what audio channel to use (Table 2, most apps should use STREAM_MUSIC) and the type of audio focus from Table 1, column 1. If audio focus is granted by the system (AUDIOFOCUS_REQUEST_GRANTED), only then handle any initialization (see Listing 1, line 9).

Note: The system will not grant audio focus (AUDIOFOCUS_REQUEST_FAILED) if there is a phone call currently in process and the application will not receive AUDIOFOCUS_GAIN after the call ends.

Within an implementation of OnAudioFocusChange(), understanding what to do when an application receives an onAudioFocusChange() event is summarized in Table 3.

In the cases of losing audio focus be sure to check that the loss is in fact final. If the app receives an AUDIOFOCUS_LOSS_TRANSIENT or AUDIOFOCUS_LOSS_TRANSIENT_CAN_DUCK it can hold onto the media resources it has created (don’t call release()) as there will likely be another audio focus change event very soon thereafter. The app should take note that it has received a transient loss using some sort of state flag or simple state machine.

If an application were to request permanent audio focus with AUDIOFOCUS_GAIN and then receive an AUDIOFOCUS_LOSS_TRANSIENT_CAN_DUCK an appropriate action for the application would be to lower its stream volume (make sure to store the original volume state somewhere) and then raise the volume upon receiving an AUDIOFOCUS_GAIN event (see Figure 1, below).

Table 3. Appropriate actions by focus change type.

Focus Change Type Appropriate Action
AUDIOFOCUS_GAIN Gain event after loss event: Resume playback of media unless other state flags set by the application indicate otherwise. For example, the user paused the media prior to loss event.
AUDIOFOCUS_LOSS Stop playback. Release assets.
AUDIOFOCUS_LOSS_TRANSIENT Pause playback and keep a state flag that the loss is transient so that when the AUDIOFOCUS_GAIN event occurs you can resume playback if appropriate. Do not release assets.
AUDIOFOCUS_LOSS_TRANSIENT_CAN_DUCK Lower volume or pause playback keeping track of state as with AUDIOFOCUS_LOSS_TRANSIENT. Do not release assets.

Conclusion and further reading

Understanding how to be a good audio citizen application on an Android device means respecting the system's audio focus rules and handling each case appropriately. Try to make your application behave in a consistent manner and not negatively surprise the user. There is a lot more that can be talked about within the audio system on Android and in the material below you will find some additional discussions.

Example source code is available here:

https://android.googlesource.com/platform/development/+/master/samples/RandomMusicPlayer