Add stereo music or narration to VR videos

We introduced a new feature which allows YouTube creators to mix together spatial audio with stereo audio content, like music and/or narration, when they work on VR experiences. Viewers can already enjoy this feature on YouTube mobile apps as well as desktop web browsers.

Why we need spatial audio in VR

Sound has been part of making movies great almost from the start. Live music performance went along with silent movies even before 1927, when "The Jazz Singer" brought talkies to the big screen. In the early days, movie sound reproduction was very primitive and typically played over a single loudspeaker. Little consideration was given to the relationship between recorded sounds and the objects or actors on the screen. As technology progressed, people realized that making sound stereo — putting some sounds to the left, some to the right, and some moving from side-to-side across the screen  added another dimension to the experience. It's easier to get immersed in a movie when the sound and the picture fit together!

We like when the sound of an engine follows the car we are seeing. We get excited if our attention is suddenly drawn to another part of the screen by a door squeak, or a gunshot or an explosion. Although stereo sound creates an immersive experience, a typical loudspeaker set-up places the speakers on either side of the screen, which largely confines the movement of sound to within the screen itself. One of the limitations of this is that it doesn't match what we're used to in real life. We're used to hearing sounds from all around us, even when we don't see where they're coming from.

The need for more accurate real-life sound reproduction was recognized even before stereo was perfected for film in the production of "Fantasia" and its groundbreaking multi-speaker: Fantasound system. Through the use of multiple speakers, "Fantasia" pushed sound reproduction off the screen and into three-dimensions, putting the audience at the center of a 3-D sound experience. Since this early work, sound technology has advanced to more complex multi-speaker surround sound systems, but also 3-D headphone sound.

More recently, we've seen the emergence of VR, which aims to improve immersive experiences further by giving the audience not just a 3-D audio experience, but an entire 3-D video experience too. That's why the VR teams at Google and YouTube have been working to provide YouTube users with VR experiences with immersive spatial audio.

Spatial audio in VR

One of the challenges in VR production is sound design and spatial audio production. A major task for sound designers is to accurately associate sounds in 3-D space with visible objects within the 3-D video scene. Like the engine sound we mentioned before, a sound designer needs to correctly position the audio to accurately follow the visible position of the car in the scene. The car in this example is what is known as a diegetic sound source, because its position is visible or implied within the video scene. In a typical cinematic production, though, there will also be sounds that don't directly correspond to positions within the video scene, like voice-overs or narration, for example. Voiced narration (i.e. 'Red' in "The Shawshank Redemption") is typically not associated with any object within the video scene. This type of sound is known as a non-diegetic sound source. Another example would be background music, which has been present in cinematic experiences since the very beginning.

How does it work?

When you watch a VR video on a Head Mounted Display (HMD) like the Daydream View, the spatial audio rendering needs to accurately reproduce the intentions of the sound designer. How does it achieve this? Firstly, a spatial audio rendering engine needs to treat non-diegetic and diegetic sounds differently.

The audio processing for diegetic sounds is conceptually quite simple: The device knows how your head moves, and hence, how all the sounds need to be filtered, so that what you hear over headphones precisely reflects what is happening around you at that very moment. It is like creating a unique headphone mix especially for you every time you watch a movie. This way you can experience all the sounds with their true depth and spatial location, coming from all around you!

When it comes to non-diegetic sounds, the situation is quite different. These should be rendered as a standard stereophonic track, alongside immersive spatial audio content and preserve the original fidelity of music or narrator's voice. The viewer should experience them the same way that we are used to: in left/right stereo. (This is why you may hear the phrase "head-locked stereo.") Create spatial audio with head-locked stereo and upload to YouTube

YouTube now allows creators to join these two concepts together and augment immersive spatial audio with more traditional stereo content. When creators add two extra channels to their uploaded spatial audio soundtrack, they will now be interpreted as a head-locked stereo and won't go through the same processing algorithms that YouTube uses for spatial audio. In other words, it will sound exactly the same as more traditional audio uploaded to YouTube. See this YouTube Spatial Audio help page for a more detailed guide on how to prepare and upload spatial audio and head-locked stereo to YouTube. Also, make sure to check out the example video here:

YouTube viewers can already enjoy this new feature on YouTube Android/iOS apps, as well as Chrome, Opera and now also Mozilla Firefox web browsers. For the best experience, we recommend using YouTube VR with the Daydream View.

Marcin Gorzel and Damien Kelly, Software Engineers, recently watched "Ecuadorian Cloud Forest in 360 VR!! (2018)."