Posted by Erica Hanson, Global Program Manager, Google Developer Student Clubs
(Irene (left) and her DSC team from the Polytechnic University of Cartagena (photo prior to COVID-19)
Irene Ruiz Pozo is a former Google Developer Student Club (DSC) Lead at the Polytechnic University of Cartagena in Murcia, Spain. As one of the founding members, Irene has seen the club grow from just a few student developers at her university to hosting multiple learning events across Spain. Recently, we spoke with Irene to understand more about the unique ways in which her team helped local university students learn more about Google technologies.
Real world ML and AR learning opportunities
Irene mentioned two fascinating projects that she had the chance to work on through her DSC at the Polytechnic University of Cartagena. The first was a learning lab that helped students understand how to use 360º cameras and 3D scanners for machine learning.
(A DSC member giving a demo of a 360º camera to students at the National Museum of Underwater Archeology in Cartagena)
The second was a partnership with the National Museum of Underwater Archeology, where Irene and her team created an augmented reality game that let students explore a digital rendition of the museum’s exhibitions.
(An image from the augmented reality game created for the National Museum of Underwater Archeology)
In the above AR experience created by Irene’s team, users can create their own character and move throughout the museum and explore different virtual renditions of exhibits in a video game-like setting.
Hash Code competition and experiencing the Google work culture
(Students working on the Hash Code competition (photo taken prior to COVID-19)
To Irene, the experience felt like a live look at being a software engineer at Google. The event taught her and her DSC team that while programming skills are important, communication and collaboration skills are what really help solve problems. For Irene, the experience truly bridged the gap between theory and practice.
Expanding knowledge with a podcast for student developers
(Irene’s team working with other student developers (photo taken before COVID-19)
After the event, Irene felt that if a true mentorship network was established among other DSCs in Europe, students would feel more comfortable partnering with one another to talk about common problems they faced. Inspired, she began to build out her mentorship program which included a podcast where student developers could collaborate on projects together.
The podcast, which just released its second episode, also highlights upcoming opportunities for students. In the most recent episode, Irene and friends dive into how to apply for Google Summer of Code Scholarships and talk about other upcoming open source project opportunities. Organizing these types of learning experiences for the community was one of the most fulfilling parts of working as a DSC Lead, according to Irene. She explained that the podcast has been an exciting space that allows her and other students to get more experience presenting ideas to an audience. Through this podcast, Irene has already seen many new DSC members eager to join the conversation and collaborate on new ideas.
As Irene now looks out on her future, she is excited for all the learning and career development that awaits her from the entire Google Developer community. Having graduated from university, Irene is now a Google Developer Groups (GDG) Lead - a program similar to DSC, but created for the professional developer community. In this role, she is excited to learn new skills and make professional connections that will help her start her career.
Are you also a student with a passion for code? Then join a local Google Developer Student Club near you, here.
Posted by Eric Lai, Product Manager, Augmented Reality
Augmented reality (AR) can help you explore the world around you in new, seemingly magical ways. Whether you want to venture through the Earth’s unique habitats, explore historic cultures or even just find the shortest path to your destination, there’s no shortage of ways that AR can help you interact with the world.
That’s why we’re constantly improving ARCore — so developers can build amazing AR experiences that help us reimagine what’s possible.
In 2018, we introduced the Cloud Anchors API in ARCore, which lets people across devices view and share the same AR content in real-world spaces. Since then, we’ve been working on new ways for developers to use Cloud Anchors to make AR content persist and more easily discoverable.
Create long-lasting AR experiences
Last year, we previewed persistent Cloud Anchors, which lets people return to shared AR experiences again and again. With ARCore 1.20, this feature is now widely available to Android, iOS, and Unity mobile developers.
Developers all over the world are already using this technology to help people learn, share and engage with the world around them in new ways.
MARK, which we highlighted last year, is a social platform that lets people leave AR messages in real-world locations for friends, family and their community to discover. MARK is now available globally and will be launching the MARK Hope Campaign in the US to help people raise funds for their favorite charities and have their donations matched for a limited time.
REWILD Our Planet is an AR nature series produced by Melbourne based studio PHORIA. The experience is based on the Netflix original documentary series Our Planet. REWILD uses Ultra High Definition Video alongside AR content to let you venture into earth’s unique habitats and interact with endangered wildlife. It originally launched in museums, but can now be enjoyed on your smartphone in your living room. As episodes of the show are released, persistent Cloud Anchors allow you to return to the same spot in your own home to see how nature is changing.
Changdeok ARirang is an AR tour guide app that combines the power of SK Telecom’s 5G with persistent Cloud Anchors. Visitors at Changdeokgung Palace in South Korea are guided by the legendary Haechi to relevant locations where they can experience historical and cultural high fidelity AR content. Changdeok ARirang at Home was also launched so that this same experience can be accessed from the comfort of your couch.
In Sweden, SJ Labs, the innovation arm of Swedish Railways, together with Bontouch, their tech innovation partner, uses persistent Cloud Anchors to help passengers find their way at Central Station in Stockholm, making it easier and faster for them to make their train departures.
Coming soon, Lowe’s Persistent View will let you design your home in AR with the help of an expert. You’ll be able to add furniture and appliances to different areas of your home to see how they’d look, and return to the experience as many times as needed before making a purchase.
Lowe’s Persistent View powered by Streem
If you’re interested in building AR experiences that last over time, you can learn more about persistent Cloud Anchors in our docs.
Call for collaborators: test a new way to find AR content
As developers use Cloud Anchors to attach more AR experiences to the world, we also want to make it easier for people to discover them. That’s why we’re working on earth Cloud Anchors, a new feature that uses AR and global localization—the underlying technology that powers Live View features on Google Maps—to easily guide users to AR content. If you’re interested in early access to test this feature, you can apply here.
Earlier this year, the MediaPipe Team released the Face Mesh solution, which estimates the approximate 3D face shape via 468 landmarks in real-time on mobile devices. In this blog, we introduce a new face transform estimation module that establishes a researcher- and developer-friendly semantic API useful for determining the 3D face pose and attaching virtual objects (like glasses, hats or masks) to a face.
The new module establishes a metric 3D space and uses the landmark screen positions to estimate common 3D face primitives, including a face pose transformation matrix and a triangular face mesh. Under the hood, a lightweight statistical analysis method called Procrustes Analysis is employed to drive a robust, performant and portable logic. The analysis runs on CPU and has a minimal speed/memory footprint on top of the original Face Mesh solution.
Figure 1: An example of virtual mask and glasses effects, based on the MediaPipe Face Mesh solution.
The MediaPipe Face Landmark Model performs a single-camera face landmark detection in the screen coordinate space: the X- and Y- coordinates are normalized screen coordinates, while the Z coordinate is relative and is scaled as the X coordinate under the weak perspective projection camera model. While this format is well-suited for some applications, it does not directly enable crucial features like aligning a virtual 3D object with a detected face.
The newly introduced module moves away from the screen coordinate space towards a metric 3D space and provides the necessary primitives to handle a detected face as a regular 3D object. By design, you'll be able to use a perspective camera to project the final 3D scene back into the screen coordinate space with a guarantee that the face landmark positions are not changed.
Metric 3D Space
The Metric 3D space established within the new module is a right-handed orthonormal metric 3D coordinate space. Within the space, there is a virtual perspective camera located at the space origin and pointed in the negative direction of the Z-axis. It is assumed that the input camera frames are observed by exactly this virtual camera and therefore its parameters are later used to convert the screen landmark coordinates back into the Metric 3D space. The virtual camera parameters can be set freely, however for better results it is advised to set them as close to the real physical camera parameters as possible.
Figure 2: A visualization of multiple key elements in the metric 3D space. Created in Cinema 4D
Canonical Face Model
The Canonical Face Model is a static 3D model of a human face, which follows the 3D face landmark topology of the MediaPipe Face Landmark Model. The model bears two important functions:
Defines metric units: the scale of the canonical face model defines the metric units of the Metric 3D space. A metric unit used by the default canonical face model is a centimeter;
Bridges static and runtime spaces: the face pose transformation matrix is - in fact - a linear map from the canonical face model into the runtime face landmark set estimated on each frame. This way, virtual 3D assets modeled around the canonical face model can be aligned with a tracked face by applying the face pose transformation matrix to them.
Face Transform Estimation
The face transform estimation pipeline is a key component, responsible for estimating face transform data within the Metric 3D space. On each frame, the following steps are executed in the given order:
Face landmark screen coordinates are converted into the Metric 3D space coordinates;
Face pose transformation matrix is estimated as a rigid linear mapping from the canonical face metric landmark set into the runtime face metric landmark set in a way that minimizes a difference between the two;
A face mesh is created using the runtime face metric landmarks as the vertex positions (XYZ), while both the vertex texture coordinates (UV) and the triangular topology are inherited from the canonical face model.
The Effect Renderer is a component, which serves as a working example of a face effect renderer. It targets the OpenGL ES 2.0 API to enable a real-time performance on mobile devices and supports the following rendering modes:
3D object rendering mode: a virtual object is aligned with a detected face to emulate an object attached to the face (example: glasses);
Face mesh rendering mode: a texture is stretched on top of the face mesh surface to emulate a face painting technique.
In both rendering modes, the face mesh is first rendered as an occluder straight into the depth buffer. This step helps to create a more believable effect via hiding invisible elements behind the face surface.
Figure 3: An example of face effects rendered by the Face Effect Renderer.
Using Face Transform Module
The face transform estimation module is available as a part of the MediaPipe Face Mesh solution. It comes with face effect application examples, available as graphs and mobile apps on Android or iOS. If you wish to go beyond examples, the module contains generic calculators and subgraphs - those can be flexibly applied to solve specific use cases in any MediaPipe graph. For more information, please visit our documentation.
We look forward to publishing more blog posts related to new MediaPipe pipeline examples and features. Please follow the MediaPipe label on Google Developers Blog and Google Developers twitter account (@googledevs).
We would like to thank Chuo-Ling Chang, Ming Guang Yong, Jiuqiang Tang, Gregory Karpiak, Siarhei Kazakou, Matsvei Zhdanovich and Matthias Grundman for contributing to this blog post.
Augmented Reality (AR) technology creates fun, engaging, and immersive user experiences. The ability to perform AR tracking across devices and platforms, without initialization, remains important for powering AR applications at scale.
Today, we are excited to release the Instant Motion Tracking solution in MediaPipe. It is built upon the MediaPipe Box Tracking solution we released previously. With Instant Motion Tracking, you can easily place fun virtual 2D and 3D content on static or moving surfaces, allowing them to seamlessly interact with the real world. This technology also powered MotionStills AR. Along with the library, we are releasing an open source Android application to showcase its capabilities. In this application, a user simply taps the camera viewfinder in order to place virtual 3D objects and GIF animations, augmenting the real-world environment.
Instant Motion Tracking in MediaPipe
Instant Motion Tracking
The Instant Motion Tracking solution provides the capability to seamlessly place virtual content on static or motion surfaces in the real world. To achieve that, we provide the six degrees of freedom tracking with relative scale in the form of rotation and translation matrices. This tracking information is then used in the rendering system to overlay virtual content on camera streams to create immersive AR experiences.
The core concept behind Instant Motion Tracking is to decouple the camera’s translation and rotation estimation, treating them instead as independent optimization problems. This approach enables AR tracking across devices and platforms without initialization or calibration. We do this by first finding the 3D camera translation using only the visual signals from the camera. This involves estimating the target region's apparent 2D translation and relative scale across frames. The process can be illustrated with a simple pinhole camera model, relating translation and scale of an object in the image plane to the final 3D translation.
By finding the change in relative size of our tracked region from view position V1 to V2, we can estimate the relative change in distance from the camera.
Next, we obtain the device’s 3D rotation from its built-in IMU (Inertial Measurement Unit) sensor. By combining this translation and rotation data, we can track a target region with six degrees of freedom at relative scale. This information allows for the placement of virtual content on any system with a camera and IMU functionality, and is calibration free. For more details on Instant Motion Tracking, please refer to our paper.
A MediaPipe Pipeline for Instant Motion Tracking
A diagram of Instant Motion Tracking pipeline is shown below, consisting of four major components: a Sticker Manager module, a Region Tracking module, a Matrices Manager module, and lastly a Rendering System. Each of the components consists of MediaPipe calculators or subgraphs.
Diagram of Instant Motion Tracking Pipeline
The Sticker Manager accepts sticker data from the application and produces initial anchors (tracked region information) based on user taps, and user gesture controls for every sticker object. Initial anchors are then sent to our Region Tracking module to generate tracked anchors. The Matrices Manager combines this data with our device’s rotation matrix to produce six degrees-of-freedom poses as model matrices. After integrating any user-specified transforms like asset scaling, our final poses are forwarded to the Rendering System to render all virtual objects overlaid on the camera frame to produce the output AR frame.
Using the Instant Motion Tracking Solution
The Instant Motion Tracking solution is easy to use by leveraging the MediaPipe cross-platform framework. With camera frames, device rotation matrix, and anchor positions (screen coordinates) as input, the MediaPipe graph produces AR renderings for each frame, providing engaging experiences. If you wish to integrate this Instant Motion Tracking library with your system or application, please visit our documentation to build your own AR experiences on any device with IMU functionality and a camera sensor.
Augmenting The World with 3D Stickers and GIFs
Instant Motion Tracking solution allows bringing both 3D stickers and GIF animations into Augmented Reality experiences. GIFs are rendered on flat 3D billboards placed in the world, introducing fun and immersive experiences with animated content blended into the real environment.Try it for yourself!
Demonstration of GIF placement in 3D
MediaPipe Instant Motion Tracking is already helping PixelShift.AI, a startup applying cutting-edge vision technologies to facilitate video content creation, to track virtual characters seamlessly in the view-finder for a realistic experience. Building upon Instant Motion Tracking’s high-quality pose estimation, PixelShift.AI enables VTubers to create mixed reality experiences with web technologies. The product is going to be released to the broader VTuber community later this year.
We look forward to publishing more blog posts related to new MediaPipe pipeline examples and features. Please follow the MediaPipe label on Google Developers Blog and Google Developers twitter account (@googledevs).
We would like to thank Vikram Sharma, Jianing Wei, Tyler Mullen, Chuo-Ling Chang, Ming Guang Yong, Jiuqiang Tang, Siarhei Kazakou, Genzhi Ye, Camillo Lugaresi, Buck Bourdon, and Matthias Grundman for their contributions to this release.
Posted by Rajat Paharia, Product Lead, AR Platform
Since the launch of ARCore, our developer platform for building augmented reality (AR) experiences, we've been focused on providing APIs that help developers seamlessly blend the digital and physical worlds.
At the end of last year, we announced a preview of the ARCore Depth API, which uses our depth-from-motion algorithms to generate a depth map with a single RGB camera. Since then, we’ve been working with select collaborators to explore how depth can be used across a range of use cases to enhance AR realism.
Today, we're taking a major step forward and announcing the Depth API is available in ARCore 1.18 for Android and Unity, including AR Foundation, across hundreds of millions of compatible Android devices.
Generate a depth map without specialized hardware to unlock capabilities like occlusion
As we highlighted last year, a key capability of the Depth API is occlusion: the ability for digital objects to accurately appear behind real world objects. This makes objects feel as if they’re actually in your space, creating a more realistic AR experience.
Illumix, the game studio behind Five Nights at Freddy’s AR: Special Delivery, uses occlusion to deepen the realism of the experience by allowing certain characters to hide behind objects for more startling jump scares.
While occlusion is an important capability, the ARCore Depth API unlocks more ways to increase realism and enables new interaction types. The ARCore Depth Lab spurred more ideas on how depth can be used, including realistics physics, surface interactions, environmental traversal, and more. Developers can now build on these ideas through the open sourced GitHub project.
Snapchat Lens Creators can now download an ARCore Depth API template to create depth-based experiences for compatible Android devices. Sam Hare, Research Engineering Manager at Snap Inc, expressed his excitement, “We’re beginning to understand what kinds of depth capabilities are exciting for developers to build with. This single integration point streamlines and simplifies the development process and enables Lens Studio developers to easily take advantage of advanced depth capabilities.”
Another app that combines occlusion with other depth capabilities is Lines of Play, an Android experiment from the Google Creative Lab. Lines of Play lets users create domino art in AR, and uses depth information to showcase both occlusion and collisions. Design elaborate domino creations, topple them over and watch them collide with the furniture and walls in your room.
Watch as domino pieces topple into each other and onto your walls with Lines of Play
In addition to gaming and self-expression, depth can also be used to unlock new utility use cases. For example, the TeamViewer Pilot app, a remote assistance solution that enables AR annotations on video calls, uses depth to better understand the environment so experts around the world can more precisely apply real time 3D AR annotations for remote support and maintenance.
Later this year, you will be able to try more depth-enabled AR experiences such as SKATRIX by Reality Crisis and SPLASHAAR by ForwARdgames, that use surface interactions and environmental traversal as they make rich use of the environment around you.
Check out surface interactions and environmental traversal in SKATRIX, and SPLASHAAR
While depth sensors, such as time-of-flight (ToF) sensors, are not required for the Depth API to work, having them will further improve the quality of experiences. Dr. Soo Wan Kim, Camera Technical Product Manager at Samsung commented on the future that the Depth API and ToF unlocks saying, “Depth will enrich user's AR experience in many perspectives. It will reduce scanning time, and can detect planes fast, even low textured planes. These will bring seamless experiences to users who will be able to use AR apps more easily and frequently.” In the coming months, Samsung will update their Quick Measure app to use the ARCore Depth API on the Galaxy Note10+ and Galaxy S20 Ultra.
Posted by Eduard Trulls, Research Scientist, Google Maps
Reconstructing 3D objects and buildings from a series of images is a well-known problem in computer vision, known as Structure-from-Motion (SfM). It has diverse applications in photography and cultural heritage preservation (e.g., allowing people to explore the sculptures of Rapa Nui in a browser) and powers many services across Google Maps, such as the 3D models created from StreetView and aerial imagery. In these examples, images are usually captured by operators under controlled conditions. While this ensures homogeneous data with a uniform, high-quality appearance in the images and the final reconstruction, it also limits the diversity of sites captured and the viewpoints from which they are seen. What if, instead of using images from tightly controlled conditions, one could apply SfM techniques to better capture the richness of the world using the vast amounts of unstructured image collections freely available on the internet?
Recovering 3D Structure In the Wild Google Maps already uses images donated by users to inform visitors about popular locations or to update business hours. However, using this type of data to build 3D models is much more difficult, since donated photos have a wide variety of viewpoints, lighting and weather conditions, occlusions from people and vehicles, and the occasional user-applied filters. The examples below highlight the diversity of images for the Trevi Fountain in Rome.
Some example images sampled from the Image Matching Challenge dataset, showing different perspectives of the Trevi Fountain.
In general, the use of SfM to reconstruct 3D scenes starts by identifying which parts of the images capture the same physical points of a scene, the corners of a window, for instance. This is achieved using local features, i.e., salient locations in an image that can be reliably identified across different views. They contain short description vectors (model representations) that capture the appearance around the point of interest. By comparing these descriptors, one can establish likely correspondences between the pixel coordinates of image locations across two or more images, and recover the 3D location of the point by triangulation. Both the pose from where the images were captured as well as the 3D location of the physical points observed (for example, identifying where the corner of the window is relative to the camera location) can then be jointly estimated. Doing this over many images and points allows one to obtain very detailed reconstructions.
A 3D reconstruction generated from over 3000 images, including those from the previous figure.
The challenge for this approach is the risk of having incorrect correspondences due, for example, to repeated structure such as the windows of the building, that may be very similar to each other, or transient elements that do not persist across images, such as the crowds admiring the Trevi Fountain. One way to filter these out is by reasoning about relations between correspondences using multiple images. An additional, even more powerful approach is to design better methods for identifying and isolating local features, for instance, by ignoring points on transient elements such as people. But to better understand the shortcomings of existing local feature algorithms for SfM and to provide insight into promising directions for future research, it is necessary to have a reliable benchmark to measure performance.
A Benchmark for Evaluating Local Features for 3D Reconstruction Local features power many Google services, such as Image Search and product recognition in Google Lens, and are also used in mixed reality applications, like Google Maps' Live View, which relies on traditional, handcrafted local features. Designing better algorithms to identify and describe local features will lead to better performance overall.
Comparing the performance of local feature algorithms, however, has been difficult, because it is not obvious how to collect "ground-truth" data for this purpose. Some computer vision tasks rely on crowdsourcing: Google's OpenImages dataset labels "objects" with bounding boxes or pixel masks, by combining machine learning techniques with human annotators. This is not possible in this case, as it is not known what constitutes a "good" local feature a priori, making labelling infeasible. Additionally, existing benchmarks such as HPatches, are often small or limited to a narrow range of transformations, which can bias the evaluation.
What matters is the quality of the reconstruction, and that benchmarks reflect real-world scale and challenges in order to highlight opportunities for developing new approaches. To this end, we have created the Image Matching Benchmark, the first benchmark to include a large dataset of images for training and evaluation. The dataset includes more than 25k images (sourced from the public YFCC100m dataset), each of which has been augmented with accurate pose information (location and orientation). We obtain this "pseudo" ground-truth from large-scale SfM (100s-1000s of images, for each scene), which provides accurate and stable poses, and then run our evaluation on smaller subsets (10s of images), a much more difficult problem. This approach does not require expensive sensors or human labelling, and it provides better proxy metrics than previous benchmarks, which were restricted to small and homogenous datasets.
Visualizations from our benchmark. We show point-to-point matches generated by different local feature algorithms. Left to right: SIFT, HardNet, LogPolarDesc, R2D2. For details, please refer to our website.
We hope this benchmark, dataset and challenge helps advance the state of the art in 3D reconstruction with heterogeneous images. If you’re interested in participating in the challenge, please see the 2020 Image Matching Challenge website for more details.
Acknowledgements The benchmark is joint work by Yuhe Jin and Kwang Moo Yi (University of Victoria), Anastasiia Mishchuk and Pascal Fua (EPFL), Dmytro Mishkin and Jiří Matas (Czech Technical University), and Eduard Trulls (Google). The CVPR workshop is co-organized by Vassileios Balntas (Scape Technologies/Facebook), Vincent Lepetit (Ecole des Ponts ParisTech), Dmytro Mishkin and Jiří Matas (Czech Technical University), Johannes Schönberger (Microsoft), Eduard Trulls (Google), and Kwang Moo Yi (University of Victoria).
1 Please note that as of April 2, 2020, CVPR is currently on track, despite the COVID-19 pandemic. Challenge information will be updated as the situation develops. Please see the 2020 Image Matching Challenge website for details.↩
Posted by Shahram Izadi, Director of Research and Engineering
ARCore, our developer platform for building augmented reality (AR) experiences, allows your devices to display content immersively in the context of the world around us-- making them instantly accessible and useful. Earlier this year, we introduced Environmental HDR, which brings real world lighting to AR objects and scenes, enhancing immersion with more realistic reflections, shadows, and lighting. Today, we're opening a call for collaborators to try another tool that helps improve immersion with the new Depth API in ARCore, enabling experiences that are vastly more natural, interactive, and helpful. The ARCore Depth API allows developers to use our depth-from-motion algorithms to create a depth map using a single RGB camera. The depth map is created by taking multiple images from different angles and comparing them as you move your phone to estimate the distance to every pixel.
Example depth map, with red indicating areas that are close by, and blue representing areas that are farther away.
One important application for depth is occlusion: the ability for digital objects to accurately appear in front of or behind real world objects. Occlusion helps digital objects feel as if they are actually in your space by blending them with the scene. We will begin making occlusion available in Scene Viewer, the developer tool that powers AR in Search, to an initial set of over 200 million ARCore-enabled Android devices today.
A virtual cat with occlusion off and with occlusion on.
We’ve also been working with Houzz, a company that focuses on home renovation and design, to bring the Depth API to the “View in My Room” experience in their app. “Using the ARCore Depth API, people can see a more realistic preview of the products they’re about to buy, visualizing our 3D models right next to the existing furniture in a room,” says Sally Huang, Visual Technologies Lead at Houzz. “Doing this gives our users much more confidence in their purchasing decisions.”
The Houzz app with occlusion is available today.
In addition to enabling occlusion, having a 3D understanding of the world on your device unlocks a myriad of other possibilities. Our team has been exploring some of these, playing with realistic physics, path planning, surface interaction, and more.
Physics, path planning, and surface interaction examples.
When applications of the Depth API are combined together, you can also create experiences in which objects accurately bounce and splash across surfaces and textures, as well as new interactive game mechanics that enable players to duck and hide behind real-world objects.
A demo experience we created where you have to dodge and throw food at a robot chef.
The Depth API is not dependent on specialized cameras and sensors, and it will only get better as hardware improves. For example, the addition of depth sensors, like time-of-flight (ToF) sensors, to new devices will help create more detailed depth maps to improve existing capabilities like occlusion, and unlock new capabilities such as dynamic occlusion—the ability to occlude behind moving objects. We’ve only begun to scratch the surface of what’s possible with the Depth API and we want to see how you will innovate with this feature. If you are interested in trying the new Depth API, please fill out our call for collaborators form.
Posted by Christina Tong, Product Manager, Augmented Reality
Two years ago, we launched ARCore, our developer platform for building augmented reality (AR) experiences. Since then, we’ve seen developers create thousands of AR apps across Android and iOS that transform the way people play, shop, learn and create together. To enable even more shared cross-platform AR experiences, we’re announcing new updates to ARCore’s Augmented Faces and Cloud Anchors APIs.
Augmented Faces on iOS
Earlier this year, we announced our Augmented Faces API, which offers a high-quality, 468-point 3D mesh that lets users attach fun effects to their faces — all without a depth sensor on their smartphone. With the addition of iOS support rolling out today, developers can now create effects for more than a billion users. We’ve also made the creation process easier for both iOS and Android developers with a new face effects template.
Improvements to Cloud Anchors
Last year, we introduced the Cloud Anchors API, which lets developers create shared AR experiences across Android and iOS. Cloud Anchors let devices create a 3D feature map from visual data onto which anchors can be placed. The anchors are hosted in the cloud so multiple people can use them to enable shared real world experiences. Cloud Anchors power a wide variety of cross-platform apps, like Just a Line, PHAROS AR and Spacecraft AR.
In our latest ARCore update, we’ve made some improvements to the Cloud Anchors API that make hosting and resolving anchors more efficient and robust. This is due to improved anchor creation and visual processing in the cloud. Now, when creating an anchor, more angles across larger areas in the scene can be captured for a more robust 3D feature map. Once the map is created, the visual data used to create the map is deleted and only anchor IDs are shared with other devices to be resolved. Moreover, multiple anchors in the scene can now be resolved simultaneously, reducing the time needed to start a shared AR experience.
These updates to Cloud Anchors are available for developers today.
Persistent Cloud Anchors and Call for Collaborators
As we look to the future, we’re taking steps to expand the scale and timeline of shared AR experiences with persistent Cloud Anchors. We see this as enabling a “save button” for AR, so that digital information overlaid on top of the real world can be experienced at anytime.
Imagine working together on a redesign of your home throughout the year, leaving AR notes for your friends around an amusement park, or hiding AR objects at specific places around the world to be discovered by others.
Persistent Cloud Anchors are powering Mark AR, a social app being developed by Sybo and iDreamSky that lets people create, discover, and share their AR art with friends and followers in real-world locations. With persistent Cloud Anchors, users can continuously return back to their pieces as they create and collaborate over time.
Mark AR is an app that lets people create and discover AR art in real-world locations.
Reliably anchoring AR content for every use case—regardless of surface, distance, and time—pushes the limits of computation and computer vision because the real world is diverse and always changing. By enabling a “save button” for AR, we’re taking an important step toward bridging the digital and physical worlds to expand the ways AR can be useful in our day-to-day lives.
We’re currently looking for more developers to help us explore and test persistent Cloud Anchors in real world apps at scale, before making the feature broadly available. If you’re interested in early access, you can apply here.
Posted by Rajan Patel, Director, Augmented Reality
Around the world, millions of people are coming online for the first time, and many of them are among the 800 million adults worldwide who are unable to read or write, or those who are migrating to towns and cities where they are not able to speak the predominant language. As a smartphone camera-based tool, Google Lens has great potential for helping people who struggle with reading and other language-based challenges. Lens uses computer vision, machine learning and Google’s Knowledge Graph to let people turn the things they see in the real world into a visual search box, enabling them to identify objects like plants and animals, or to copy and paste text from the real world into their phone.
However, in order for Lens to be able to help the greatest number of people, we needed to create a special version that can work on even the most basic smartphones. So at I/O 2019, we announced a new version of Lens designed specifically for use in Google Go—our Search app for entry level devices—and we included a new set of features designed to help people who face reading and other language-based challenges. When users point their camera at text they don’t understand, Lens in Google Go can translate and read it out loud. It even highlights each word as it’s being read so users can follow along. If you want to try out these features for yourself, they are available today via Lens in Google Go. While Google Go was initially available only on Android Go devices and on the Google Play Store in select markets, recently, we made it available globally in the Google Play Store.
To make these reading features work, the Google Go version of Lens needs to be able to capture high quality images on a wide variety of devices, then identify the text, understand its structure, translate and overlay it in context, and finally, read it out loud.
Image Capture Image capture on entry-level devices, like those that run Android Go, is tricky since it must work on a wide variety of devices, many of which are more resource constrained than flagship phones. To build a universal tool that can reliably capture high-quality images with minimal lag, we made Lens in Google Go an early adopter of a new Android support library called CameraX. Available in Jetpack—a suite of libraries, tools, and guidance for Android developers—CameraX is an abstraction layer over the Android Camera2 API that resolves device compatibility issues so developers don't have to write their own device-specific code.
Using CameraX, we implemented two capture strategies to balance capture latency against performance impact. On higher-end phones, which are powerful enough to provide a constant stream of high-resolution frames from which to select an image, we’ve made capture instantaneous. On less advanced devices, streaming these frames could cause camera lag since the CPU is less powerful, so we process the frame when the user taps capture to produce a single, on-demand high-resolution image.
Text Recognition After Lens in Google Go captures an image, it needs to make sense of the shapes and letters that constitute the words, sentences and paragraphs. To do this, the image is scaled down and transferred to the Lens server, where the processing will be performed. Next, optical character recognition (OCR) is applied, which utilizes a region proposal network to detect character level bounding boxes that can be merged into lines for text recognition.
Merging these character boxes into words is a two-step, sequential process. The first step is to apply the Hough Transform, which assumes the text is distributed across parallel lines. The second step uses Text Flow, which instead traces text that may follow a curve by finding the shortest path through a graph of detected text boxes. This ensures that text with a variety of distributions, be they straight, curved or mixed, can be identified and processed.
Because the images captured by Lens in Google Go may include sources such as signage, handwriting or documents, a slew of additional challenges can arise. For example, the text can be obscured, scripts can be uniquely stylized, and images can be blurry. All of these issues can cause the OCR engine to misunderstand various characters within each word. To correct mistakes and improve word accuracy, Lens in Google Go uses the context of surrounding words to make corrections. It also utilizes the Knowledge Graph to provide contextual clues, such as whether a word is likely a proper noun and should not be spell-corrected.
Left: Image with bounding box around recognized text. The raw OCR output from this image reads, “Cise is beauti640”. Right: By applying Knowledge Graph in addition to context from nearby words, Lens in Google Go recognizes the words, “life is beautiful”.
Understanding Structure Once the individual words have been recognized, Lens must determine how to fit them together. The text that people come across in the real world is laid out in many different ways. A newspaper, for example, is laid out into columns, with headlines, article text, and advertisements. Meanwhile, a bus schedule, has one column for destinations and another with times. While understanding text structure comes very naturally to people, computers need to be taught how to comprehend it. Lens uses CNNs to detect coherent text blocks like columns, or text in a consistent style or color. And then, within each block, it uses signals like text-alignment, language, and the geometric relationship of the paragraphs to determine their final reading order.
One of the other challenges in detecting document structure is that people take pictures of text from different angles, often with a warped perspective. This means we cannot revert to off-the-shelf detectors that rely on axis aligned boxes, but must generalize our systems to be able to deal with homographic distortions.
Paragraph segmentation on the front page of a newspaper. Notice how “News Analysis”, which is embedded in the middle of a column, has been identified separately due to its distinct style features.
Translations in Context To provide users with the most helpful information, translations must be both accurate and contextual. Lens uses Google Translate’s neural machine translation (NMT) algorithms, to translate entire sentences at a time, rather than going word-by-word, in order to preserve proper grammar and diction.
For the translation to be most useful, it needs to be placed in the context of the original text. For example, when translating instructions on an ATM, it is important to know which buttons correspond to which instructions. Part of the challenge is accounting for the fact that the translated text can be much shorter or longer than the original. For example, German sentences tend to be longer than English ones. To accomplish this seamless overlay, Lens redistributes the translation into lines of similar length, and chooses an appropriate font size to match. It also matches the color of the translation and its background with the original text through the use of a heuristic that assumes the background and the text differ in luminosity, and that the background takes up the majority of the space. This allows Lens to classify whether a pixel represents background or text, and then sample the average color from these two regions to ensure the translated text matches the original text.
Reading the Text Out Loud The final challenge in delivering information in the most helpful way with Lens in Google Go is reading the text aloud. High-fidelity audio is generated using Google Text-to-Speech (TTS), a service that applies machine learning to disambiguate and detected entities such as dates, phone numbers and addresses, and uses that to generate realistic speech based on DeepMind’s WaveNet.
These reading features become more contextual and useful when they are paired with display. Lens utilizes timing annotations from the TTS service that mark the beginning of each word in order to highlight each word on screen as it’s being read, similar to a karaoke machine. Say for example, a user takes a picture of an ATM screen with different labels next to different buttons. This karaoke effect allows users to know which label applies to which button. It may also help users learn how to pronounce the words being translated. Looking Ahead Taken together, it is our hope that these features will have a positive impact on the day-to-day lives of millions of people. Moving forward, we will continue to work on further updates to these reading features to make the OCR more precise, including improvements to text structure understanding (e.g. multi-column text) and recognition of Indic scripts. As we address these text challenges, we continue to look for new ways that the combination of machine learning and the smartphone camera can help people as they go about their lives.
One of the key challenges in making these AR features possible is proper anchoring of the virtual content to the real world; a process that requires a unique set of perceptive technologies able to track the highly dynamic surface geometry across every smile, frown or smirk.
Our 3D mesh and some of the effects it enables
To make all this possible, we employ machine learning (ML) to infer approximate 3D surface geometry to enable visual effects, requiring only a single camera input without the need for a dedicated depth sensor. This approach provides the use of AR effects at realtime speeds, using TensorFlow Lite for mobile CPU inference or its new mobile GPU functionality where available. This technology is the same as what powers YouTube Stories' new creator effects, and is also available to the broader developer community via the latest ARCore SDK release and the ML Kit Face Contour Detection API.
An ML Pipeline for Selfie AR Our ML pipeline consists of two real-time deep neural network models that work together: A detector that operates on the full image and computes face locations, and a generic 3D mesh model that operates on those locations and predicts the approximate surface geometry via regression. Having the face accurately cropped drastically reduces the need for common data augmentations like affine transformations consisting of rotations, translation and scale changes. Instead it allows the network to dedicate most of its capacity towards coordinate prediction accuracy, which is critical to achieve proper anchoring of the virtual content.
Once the location of interest is cropped, the mesh network is only applied to a single frame at a time, using a windowed smoothing in order to reduce noise when the face is static while avoiding lagging during significant movement.
Our 3D mesh in action
For our 3D mesh we employed transfer learning and trained a network with several objectives: the network simultaneously predicts 3D mesh coordinates on synthetic, rendered data and 2D semantic contours on annotated, real world data similar to those MLKit provides. The resulting network provided us with reasonable 3D mesh predictions not just on synthetic but also on real world data. All models are trained on data sourced from a geographically diverse dataset and subsequently tested on a balanced, diverse testset for qualitative and quantitative performance.
The 3D mesh network receives as input a cropped video frame. It doesn't rely on additional depth input, so it can also be applied to pre-recorded videos. The model outputs the positions of the 3D points, as well as the probability of a face being present and reasonably aligned in the input. A common alternative approach is to predict a 2D heatmap for each landmark, but it is not amenable to depth prediction and has high computational costs for so many points.
We further improve the accuracy and robustness of our model by iteratively bootstrapping and refining predictions. That way we can grow our dataset to increasingly challenging cases, such as grimaces, oblique angle and occlusions. Dataset augmentation techniques also expanded the available ground truth data, developing model resilience to artifacts like camera imperfections or extreme lighting conditions.
Dataset expansion and improvement pipeline
Hardware-tailored Inference We use TensorFlow Lite for on-device neural network inference. The newly introduced GPU back-end acceleration boosts performance where available, and significantly lowers the power consumption. Furthermore, to cover a wide range of consumer hardware, we designed a variety of model architectures with different performance and efficiency characteristics. The most important differences of the lighter networks are the residual block layout and the accepted input resolution (128x128 pixels in the lightest model vs. 256x256 in the most complex). We also vary the number of layers and the subsampling rate (how fast the input resolution decreases with network depth).
Inference time per frame: CPU vs. GPU
The result of these optimizations is a substantial speedup from using lighter models, with minimal degradation in AR effect quality.
Comparison of the most complex (left) and the lightest models (right). Temporal consistency as well as lip and eye tracking is slightly degraded on light models.
The end result of these efforts empowers a user experience with convincing, realistic selfie AR effects in YouTube, ARCore, and other clients by:
Simulating light reflections via environmental mapping for realistic rendering of glasses
Natural lighting by casting virtual object shadows onto the face mesh
Modelling face occlusions to hide virtual object parts behind a face, e.g. virtual glasses, as shown below.
YouTube Stories includes Creator Effects like realistic virtual glasses, based on our 3D mesh
In addition, we achieve highly realistic makeup effects by:
Modelling Specular reflections applied on lips and
Face painting by using luminance-aware material
Case study comparing real make-up against our AR make-up on 5 subjects under different lighting conditions.
We are excited to share this new technology with creators, users and developers alike, who can use this new technology immediately by downloading the latest ARCore SDK. In the future we plan to broaden this technology to more Google products.
Acknowledgements We would like to thank Yury Kartynnik, Valentin Bazarevsky, Andrey Vakunov, Siargey Pisarchyk, Andrei Tkachenka, and Matthias Grundmann for collaboration on developing the current mesh technology; Nick Dufour, Avneesh Sud and Chris Bregler for an earlier version of the technology based on parametric models; Kanstantsin Sokal, Matsvei Zhdanovich, Gregory Karpiak, Alexander Kanaukou, Suril Shah, Buck Bourdon, Camillo Lugaresi, Siarhei Kazakou and Igor Kibalchich for building the ML pipeline to drive impressive effects; Aleksandra Volf and the annotation team for their diligence and dedication to perfection; Andrei Kulik, Juhyun Lee, Raman Sarokin, Ekaterina Ignasheva, Nikolay Chirkov, and Yury Pisarchyk for careful benchmarking and insights on mobile GPU-centric network architecture optimizations.