Virtual Reality (VR) enables remarkably immersive experiences, offering new ways to view the world and the ability to explore novel environments, both real and imaginary. However, compared to physical reality, sharing these experiences with others can be difficult, as VR headsets make it challenging to create a complete picture of the people participating in the experience.
Some of this disconnect is alleviated by Mixed Reality (MR), a related medium that shares the virtual context of a VR user in a two dimensional video format allowing other viewers to get a feel for the user’s virtual experience. Even though MR facilitates sharing, the headset continues to block facial expressions and eye gaze, presenting a significant hurdle to a fully engaging experience and complete view of the person in VR.
Google Machine Perception researchers, in collaboration with Daydream Labs and YouTube Spaces, have been working on solutions to address this problem wherein we reveal the user’s face by virtually “removing” the headset and create a realistic see-through effect.
enhancing Mixed Reality video (also discussed in the Google-VR blog). It consists of three main components:
Dynamic face model capture
The core idea behind our technique is to use a 3D model of the user’s face as a proxy for the hidden face. This proxy is used to synthesize the face in the MR video, thereby creating an impression of the headset being removed. First, we capture a personalized 3D face model for the user with what we call gaze-dependent dynamic appearance. This initial calibration step requires the user to sit in front of a color+depth camera and a monitor, and then track a marker on the monitor with their eyes. We use this one-time calibration procedure — which typically takes less than a minute — to acquire a 3D face model of the user, and learn a database that maps appearance images (or textures) to different eye-gaze directions and blinks. This gaze database (i.e. the face model with textures indexed by eye-gaze) allows us to dynamically change the appearance of the face during synthesis and generate any desired eye-gaze, thus making the synthesized face look natural and alive
Creating a Mixed Reality video requires a specialized setup consisting of an external camera, calibrated and time-synced with the headset. The camera captures a video stream of the VR user in front of a green screen and then composites a cutout of the user with the virtual world to create the final MR video. An important step here is to accurately estimate the calibration (the fixed 3D transformation) between the camera and headset coordinate systems. These calibration techniques typically involve significant manual intervention and are done in multiple steps. We simplify the process by adding a physical marker to the front of the headset and tracking it visually in 3D, which allows us to optimize for the calibration parameters automatically from the VR session.
For headset “removal”, we need to align the 3D face model with the visible portion of the face in the camera stream, so that they would blend seamlessly with each other. A reasonable proxy to this alignment is to place the face model just behind the headset. The calibration described above, coupled with VR headset tracking, provides sufficient information to determine this placement, allowing us to modify the camera stream by rendering the virtual face into it.
Compositing and Rendering
Having tackled the alignment, the last step involves producing a suitable rendering of the 3D face model, consistent with the content in the camera stream. We are able to reproduce the true eye-gaze of the user by combining our dynamic gaze database with an HTC Vive headset that has been modified by SMI to incorporate eye-tracking technology. Images from these eye trackers lack sufficient detail to directly reproduce the occluded face region, but are well suited to provide fine-grained gaze information. Using the live gaze data from the tracker, we synthesize a face proxy that accurately represents the user’s attention and blinks. At run-time, the gaze database, captured in the preprocessing step, is searched for the most appropriate face image corresponding to the query gaze state, while also respecting aesthetic considerations such as temporal smoothness. Additionally, to account for lighting changes between gaze database acquisition and run-time, we apply color correction and feathering, such that the synthesized face region matches with the rest of the face.
Humans are highly sensitive to artifacts on faces, and even small imperfections in synthesis of the occluded face can feel unnatural and distracting, a phenomenon known as the “uncanny valley.” To mitigate this problem, we do not remove the headset completely, instead we have chosen a user experience that conveys a ‘scuba mask effect’ by compositing the color corrected face proxy with a translucent headset. Reminding the viewer of the presence of the headset helps us avoid the uncanny valley, and also makes our algorithm robust to small errors in alignment and color correction.
This modified camera stream, displaying a see-through headset, with the user’s face revealed and their true eye-gaze recreated, is subsequently merged with the virtual environment to create the final MR video.
Results and Extensions
We have used our headset removal technology to enhance Mixed Reality, allowing the medium to not only convey a VR user’s interaction with the virtual environment but also show their face in a natural and convincing fashion. The example below demonstrates our tech applied to an artist using Google Tilt Brush in a virtual environment: