Tag Archives: augmented reality

MediaPipe 3D Face Transform

Posted by Kanstantsin Sokal, Software Engineer, MediaPipe team

Earlier this year, the MediaPipe Team released the Face Mesh solution, which estimates the approximate 3D face shape via 468 landmarks in real-time on mobile devices. In this blog, we introduce a new face transform estimation module that establishes a researcher- and developer-friendly semantic API useful for determining the 3D face pose and attaching virtual objects (like glasses, hats or masks) to a face.

The new module establishes a metric 3D space and uses the landmark screen positions to estimate common 3D face primitives, including a face pose transformation matrix and a triangular face mesh. Under the hood, a lightweight statistical analysis method called Procrustes Analysis is employed to drive a robust, performant and portable logic. The analysis runs on CPU and has a minimal speed/memory footprint on top of the original Face Mesh solution.

MediaPipe image

Figure 1: An example of virtual mask and glasses effects, based on the MediaPipe Face Mesh solution.

Introduction

The MediaPipe Face Landmark Model performs a single-camera face landmark detection in the screen coordinate space: the X- and Y- coordinates are normalized screen coordinates, while the Z coordinate is relative and is scaled as the X coordinate under the weak perspective projection camera model. While this format is well-suited for some applications, it does not directly enable crucial features like aligning a virtual 3D object with a detected face.

The newly introduced module moves away from the screen coordinate space towards a metric 3D space and provides the necessary primitives to handle a detected face as a regular 3D object. By design, you'll be able to use a perspective camera to project the final 3D scene back into the screen coordinate space with a guarantee that the face landmark positions are not changed.

Metric 3D Space

The Metric 3D space established within the new module is a right-handed orthonormal metric 3D coordinate space. Within the space, there is a virtual perspective camera located at the space origin and pointed in the negative direction of the Z-axis. It is assumed that the input camera frames are observed by exactly this virtual camera and therefore its parameters are later used to convert the screen landmark coordinates back into the Metric 3D space. The virtual camera parameters can be set freely, however for better results it is advised to set them as close to the real physical camera parameters as possible.

MediaPipe image

Figure 2: A visualization of multiple key elements in the metric 3D space. Created in Cinema 4D

Canonical Face Model

The Canonical Face Model is a static 3D model of a human face, which follows the 3D face landmark topology of the MediaPipe Face Landmark Model. The model bears two important functions:

  • Defines metric units: the scale of the canonical face model defines the metric units of the Metric 3D space. A metric unit used by the default canonical face model is a centimeter;
  • Bridges static and runtime spaces: the face pose transformation matrix is - in fact - a linear map from the canonical face model into the runtime face landmark set estimated on each frame. This way, virtual 3D assets modeled around the canonical face model can be aligned with a tracked face by applying the face pose transformation matrix to them.

Face Transform Estimation

The face transform estimation pipeline is a key component, responsible for estimating face transform data within the Metric 3D space. On each frame, the following steps are executed in the given order:

  • Face landmark screen coordinates are converted into the Metric 3D space coordinates;
  • Face pose transformation matrix is estimated as a rigid linear mapping from the canonical face metric landmark set into the runtime face metric landmark set in a way that minimizes a difference between the two;
  • A face mesh is created using the runtime face metric landmarks as the vertex positions (XYZ), while both the vertex texture coordinates (UV) and the triangular topology are inherited from the canonical face model.

Effect Renderer

The Effect Renderer is a component, which serves as a working example of a face effect renderer. It targets the OpenGL ES 2.0 API to enable a real-time performance on mobile devices and supports the following rendering modes:

  • 3D object rendering mode: a virtual object is aligned with a detected face to emulate an object attached to the face (example: glasses);
  • Face mesh rendering mode: a texture is stretched on top of the face mesh surface to emulate a face painting technique.

In both rendering modes, the face mesh is first rendered as an occluder straight into the depth buffer. This step helps to create a more believable effect via hiding invisible elements behind the face surface.

MediaPipe image

Figure 3: An example of face effects rendered by the Face Effect Renderer.

Using Face Transform Module

The face transform estimation module is available as a part of the MediaPipe Face Mesh solution. It comes with face effect application examples, available as graphs and mobile apps on Android or iOS. If you wish to go beyond examples, the module contains generic calculators and subgraphs - those can be flexibly applied to solve specific use cases in any MediaPipe graph. For more information, please visit our documentation.

Follow MediaPipe

We look forward to publishing more blog posts related to new MediaPipe pipeline examples and features. Please follow the MediaPipe label on Google Developers Blog and Google Developers twitter account (@googledevs).

Acknowledgements

We would like to thank Chuo-Ling Chang, Ming Guang Yong, Jiuqiang Tang, Gregory Karpiak, Siarhei Kazakou, Matsvei Zhdanovich and Matthias Grundman for contributing to this blog post.

Instant Motion Tracking with MediaPipe

Posted by Vikram Sharma, Software Engineering Intern; Jianing Wei, Staff Software Engineer; Tyler Mullen, Senior Software Engineer

Augmented Reality (AR) technology creates fun, engaging, and immersive user experiences. The ability to perform AR tracking across devices and platforms, without initialization, remains important for powering AR applications at scale.

Today, we are excited to release the Instant Motion Tracking solution in MediaPipe. It is built upon the MediaPipe Box Tracking solution we released previously. With Instant Motion Tracking, you can easily place fun virtual 2D and 3D content on static or moving surfaces, allowing them to seamlessly interact with the real world. This technology also powered MotionStills AR. Along with the library, we are releasing an open source Android application to showcase its capabilities. In this application, a user simply taps the camera viewfinder in order to place virtual 3D objects and GIF animations, augmenting the real-world environment.

gif of instant motion tracking in MediaPipe gif of instant motion tracking in MediaPipe

Instant Motion Tracking in MediaPipe

Instant Motion Tracking

The Instant Motion Tracking solution provides the capability to seamlessly place virtual content on static or motion surfaces in the real world. To achieve that, we provide the six degrees of freedom tracking with relative scale in the form of rotation and translation matrices. This tracking information is then used in the rendering system to overlay virtual content on camera streams to create immersive AR experiences.

The core concept behind Instant Motion Tracking is to decouple the camera’s translation and rotation estimation, treating them instead as independent optimization problems. This approach enables AR tracking across devices and platforms without initialization or calibration. We do this by first finding the 3D camera translation using only the visual signals from the camera. This involves estimating the target region's apparent 2D translation and relative scale across frames. The process can be illustrated with a simple pinhole camera model, relating translation and scale of an object in the image plane to the final 3D translation.

image

By finding the change in relative size of our tracked region from view position V1 to V2, we can estimate the relative change in distance from the camera.

Next, we obtain the device’s 3D rotation from its built-in IMU (Inertial Measurement Unit) sensor. By combining this translation and rotation data, we can track a target region with six degrees of freedom at relative scale. This information allows for the placement of virtual content on any system with a camera and IMU functionality, and is calibration free. For more details on Instant Motion Tracking, please refer to our paper.

A MediaPipe Pipeline for Instant Motion Tracking

A diagram of Instant Motion Tracking pipeline is shown below, consisting of four major components: a Sticker Manager module, a Region Tracking module, a Matrices Manager module, and lastly a Rendering System. Each of the components consists of MediaPipe calculators or subgraphs.

Diagram

Diagram of Instant Motion Tracking Pipeline

The Sticker Manager accepts sticker data from the application and produces initial anchors (tracked region information) based on user taps, and user gesture controls for every sticker object. Initial anchors are then sent to our Region Tracking module to generate tracked anchors. The Matrices Manager combines this data with our device’s rotation matrix to produce six degrees-of-freedom poses as model matrices. After integrating any user-specified transforms like asset scaling, our final poses are forwarded to the Rendering System to render all virtual objects overlaid on the camera frame to produce the output AR frame.

Using the Instant Motion Tracking Solution

The Instant Motion Tracking solution is easy to use by leveraging the MediaPipe cross-platform framework. With camera frames, device rotation matrix, and anchor positions (screen coordinates) as input, the MediaPipe graph produces AR renderings for each frame, providing engaging experiences. If you wish to integrate this Instant Motion Tracking library with your system or application, please visit our documentation to build your own AR experiences on any device with IMU functionality and a camera sensor.

Augmenting The World with 3D Stickers and GIFs

Instant Motion Tracking solution allows bringing both 3D stickers and GIF animations into Augmented Reality experiences. GIFs are rendered on flat 3D billboards placed in the world, introducing fun and immersive experiences with animated content blended into the real environment.Try it for yourself!

Demonstration of GIF placement in 3D Demonstration of GIF placement in 3D

Demonstration of GIF placement in 3D

MediaPipe Instant Motion Tracking is already helping PixelShift.AI, a startup applying cutting-edge vision technologies to facilitate video content creation, to track virtual characters seamlessly in the view-finder for a realistic experience. Building upon Instant Motion Tracking’s high-quality pose estimation, PixelShift.AI enables VTubers to create mixed reality experiences with web technologies. The product is going to be released to the broader VTuber community later this year.

Instant

Instant Motion Tracking helps PixelShift.AI create mixed reality experiences

Follow MediaPipe

We look forward to publishing more blog posts related to new MediaPipe pipeline examples and features. Please follow the MediaPipe label on Google Developers Blog and Google Developers twitter account (@googledevs).

Acknowledgement

We would like to thank Vikram Sharma, Jianing Wei, Tyler Mullen, Chuo-Ling Chang, Ming Guang Yong, Jiuqiang Tang, Siarhei Kazakou, Genzhi Ye, Camillo Lugaresi, Buck Bourdon, and Matthias Grundman for their contributions to this release.

A new wave of AR Realism with the ARCore Depth API

Posted by Rajat Paharia, Product Lead, AR Platform

Since the launch of ARCore, our developer platform for building augmented reality (AR) experiences, we've been focused on providing APIs that help developers seamlessly blend the digital and physical worlds.

At the end of last year, we announced a preview of the ARCore Depth API, which uses our depth-from-motion algorithms to generate a depth map with a single RGB camera. Since then, we’ve been working with select collaborators to explore how depth can be used across a range of use cases to enhance AR realism.

Today, we're taking a major step forward and announcing the Depth API is available in ARCore 1.18 for Android and Unity, including AR Foundation, across hundreds of millions of compatible Android devices.

Generate a depth map without specialized hardware to unlock capabilities like occlusion

As we highlighted last year, a key capability of the Depth API is occlusion: the ability for digital objects to accurately appear behind real world objects. This makes objects feel as if they’re actually in your space, creating a more realistic AR experience.

Illumix, the game studio behind Five Nights at Freddy’s AR: Special Delivery, uses occlusion to deepen the realism of the experience by allowing certain characters to hide behind objects for more startling jump scares.

Play Five Nights at Freddy’s AR: Special Delivery

While occlusion is an important capability, the ARCore Depth API unlocks more ways to increase realism and enables new interaction types. The ARCore Depth Lab spurred more ideas on how depth can be used, including realistics physics, surface interactions, environmental traversal, and more. Developers can now build on these ideas through the open sourced GitHub project.

Experiment with ARCore Depth Lab on the Google Play Store

The designers and engineers at Snap Inc. integrated several of these ideas into a set of Snapchat Lenses including the Dancing Hotdog and a new Android exclusive Undersea World Lens.

See how depth can add a layer of realism to your Snapchat experience

Snapchat Lens Creators can now download an ARCore Depth API template to create depth-based experiences for compatible Android devices. Sam Hare, Research Engineering Manager at Snap Inc, expressed his excitement, “We’re beginning to understand what kinds of depth capabilities are exciting for developers to build with. This single integration point streamlines and simplifies the development process and enables Lens Studio developers to easily take advantage of advanced depth capabilities.”

Another app that combines occlusion with other depth capabilities is Lines of Play, an Android experiment from the Google Creative Lab. Lines of Play lets users create domino art in AR, and uses depth information to showcase both occlusion and collisions. Design elaborate domino creations, topple them over and watch them collide with the furniture and walls in your room.

Watch as domino pieces topple into each other and onto your walls with Lines of Play

In addition to gaming and self-expression, depth can also be used to unlock new utility use cases. For example, the TeamViewer Pilot app, a remote assistance solution that enables AR annotations on video calls, uses depth to better understand the environment so experts around the world can more precisely apply real time 3D AR annotations for remote support and maintenance.

3D annotations help experts accurately highlight details in the TeamViewer Pilot app

Later this year, you will be able to try more depth-enabled AR experiences such as SKATRIX by Reality Crisis and SPLASHAAR by ForwARdgames, that use surface interactions and environmental traversal as they make rich use of the environment around you.

Check out surface interactions and environmental traversal in SKATRIX, and SPLASHAAR

While depth sensors, such as time-of-flight (ToF) sensors, are not required for the Depth API to work, having them will further improve the quality of experiences. Dr. Soo Wan Kim, Camera Technical Product Manager at Samsung commented on the future that the Depth API and ToF unlocks saying, “Depth will enrich user's AR experience in many perspectives. It will reduce scanning time, and can detect planes fast, even low textured planes. These will bring seamless experiences to users who will be able to use AR apps more easily and frequently.” In the coming months, Samsung will update their Quick Measure app to use the ARCore Depth API on the Galaxy Note10+ and Galaxy S20 Ultra.

Accurately measure with Quick Measure

To learn more and get started with the ARCore Depth API, get the SDK and visit the ARCore developer website.

Announcing the 2020 Image Matching Benchmark and Challenge



Reconstructing 3D objects and buildings from a series of images is a well-known problem in computer vision, known as Structure-from-Motion (SfM). It has diverse applications in photography and cultural heritage preservation (e.g., allowing people to explore the sculptures of Rapa Nui in a browser) and powers many services across Google Maps, such as the 3D models created from StreetView and aerial imagery. In these examples, images are usually captured by operators under controlled conditions. While this ensures homogeneous data with a uniform, high-quality appearance in the images and the final reconstruction, it also limits the diversity of sites captured and the viewpoints from which they are seen. What if, instead of using images from tightly controlled conditions, one could apply SfM techniques to better capture the richness of the world using the vast amounts of unstructured image collections freely available on the internet?

In order to accelerate research into this topic, and how to better leverage the volume of data already publicly available, we present, “Image Matching across Wide Baselines: From Paper to Practice”, a collaboration with UVIC, CTU and EPFL, that presents a new public benchmark to evaluate methods for 3D reconstruction. Following on the results of the first Image Matching: Local Features and Beyond workshop held at CVPR 2019, this project now includes more than 25k images, each of which includes accurate pose information (location and orientation). This data is publicly available, along with the open-sourced benchmark, and is the foundation of the 2020 Image Matching Challenge to be held at CVPR 20201.

Recovering 3D Structure In the Wild
Google Maps already uses images donated by users to inform visitors about popular locations or to update business hours. However, using this type of data to build 3D models is much more difficult, since donated photos have a wide variety of viewpoints, lighting and weather conditions, occlusions from people and vehicles, and the occasional user-applied filters. The examples below highlight the diversity of images for the Trevi Fountain in Rome.
Some example images sampled from the Image Matching Challenge dataset, showing different perspectives of the Trevi Fountain.
In general, the use of SfM to reconstruct 3D scenes starts by identifying which parts of the images capture the same physical points of a scene, the corners of a window, for instance. This is achieved using local features, i.e., salient locations in an image that can be reliably identified across different views. They contain short description vectors (model representations) that capture the appearance around the point of interest. By comparing these descriptors, one can establish likely correspondences between the pixel coordinates of image locations across two or more images, and recover the 3D location of the point by triangulation. Both the pose from where the images were captured as well as the 3D location of the physical points observed (for example, identifying where the corner of the window is relative to the camera location) can then be jointly estimated. Doing this over many images and points allows one to obtain very detailed reconstructions.
A 3D reconstruction generated from over 3000 images, including those from the previous figure.
The challenge for this approach is the risk of having incorrect correspondences due, for example, to repeated structure such as the windows of the building, that may be very similar to each other, or transient elements that do not persist across images, such as the crowds admiring the Trevi Fountain. One way to filter these out is by reasoning about relations between correspondences using multiple images. An additional, even more powerful approach is to design better methods for identifying and isolating local features, for instance, by ignoring points on transient elements such as people. But to better understand the shortcomings of existing local feature algorithms for SfM and to provide insight into promising directions for future research, it is necessary to have a reliable benchmark to measure performance.

A Benchmark for Evaluating Local Features for 3D Reconstruction
Local features power many Google services, such as Image Search and product recognition in Google Lens, and are also used in mixed reality applications, like Google Maps' Live View, which relies on traditional, handcrafted local features. Designing better algorithms to identify and describe local features will lead to better performance overall.

Comparing the performance of local feature algorithms, however, has been difficult, because it is not obvious how to collect "ground-truth" data for this purpose. Some computer vision tasks rely on crowdsourcing: Google's OpenImages dataset labels "objects" with bounding boxes or pixel masks, by combining machine learning techniques with human annotators. This is not possible in this case, as it is not known what constitutes a "good" local feature a priori, making labelling infeasible. Additionally, existing benchmarks such as HPatches, are often small or limited to a narrow range of transformations, which can bias the evaluation.

What matters is the quality of the reconstruction, and that benchmarks reflect real-world scale and challenges in order to highlight opportunities for developing new approaches. To this end, we have created the Image Matching Benchmark, the first benchmark to include a large dataset of images for training and evaluation. The dataset includes more than 25k images (sourced from the public YFCC100m dataset), each of which has been augmented with accurate pose information (location and orientation). We obtain this "pseudo" ground-truth from large-scale SfM (100s-1000s of images, for each scene), which provides accurate and stable poses, and then run our evaluation on smaller subsets (10s of images), a much more difficult problem. This approach does not require expensive sensors or human labelling, and it provides better proxy metrics than previous benchmarks, which were restricted to small and homogenous datasets.
Visualizations from our benchmark. We show point-to-point matches generated by different local feature algorithms. Left to right: SIFT, HardNet, LogPolarDesc, R2D2. For details, please refer to our website.
We hope this benchmark, dataset and challenge helps advance the state of the art in 3D reconstruction with heterogeneous images. If you’re interested in participating in the challenge, please see the 2020 Image Matching Challenge website for more details.

Acknowledgements
The benchmark is joint work by Yuhe Jin and Kwang Moo Yi (University of Victoria), Anastasiia Mishchuk and Pascal Fua (EPFL), Dmytro Mishkin and Jiří Matas (Czech Technical University), and Eduard Trulls (Google). The CVPR workshop is co-organized by Vassileios Balntas (Scape Technologies/Facebook), Vincent Lepetit (Ecole des Ponts ParisTech), Dmytro Mishkin and Jiří Matas (Czech Technical University), Johannes Schönberger (Microsoft), Eduard Trulls (Google), and Kwang Moo Yi (University of Victoria).

1 Please note that as of April 2, 2020, CVPR is currently on track, despite the COVID-19 pandemic. Challenge information will be updated as the situation develops. Please see the 2020 Image Matching Challenge website for details.

Source: Google AI Blog


Blending Realities with the ARCore Depth API

Posted by Shahram Izadi, Director of Research and Engineering

ARCore, our developer platform for building augmented reality (AR) experiences, allows your devices to display content immersively in the context of the world around us-- making them instantly accessible and useful.
Earlier this year, we introduced Environmental HDR, which brings real world lighting to AR objects and scenes, enhancing immersion with more realistic reflections, shadows, and lighting. Today, we're opening a call for collaborators to try another tool that helps improve immersion with the new Depth API in ARCore, enabling experiences that are vastly more natural, interactive, and helpful.
The ARCore Depth API allows developers to use our depth-from-motion algorithms to create a depth map using a single RGB camera. The depth map is created by taking multiple images from different angles and comparing them as you move your phone to estimate the distance to every pixel.
Example depth map

Example depth map, with red indicating areas that are close by, and blue representing areas that are farther away.

One important application for depth is occlusion: the ability for digital objects to accurately appear in front of or behind real world objects. Occlusion helps digital objects feel as if they are actually in your space by blending them with the scene. We will begin making occlusion available in Scene Viewer, the developer tool that powers AR in Search, to an initial set of over 200 million ARCore-enabled Android devices today.

A virtual cat with occlusion off and with occlusion on.

We’ve also been working with Houzz, a company that focuses on home renovation and design, to bring the Depth API to the “View in My Room” experience in their app. “Using the ARCore Depth API, people can see a more realistic preview of the products they’re about to buy, visualizing our 3D models right next to the existing furniture in a room,” says Sally Huang, Visual Technologies Lead at Houzz. “Doing this gives our users much more confidence in their purchasing decisions.”
The Houzz app with occlusion is available today.
The Houzz app with occlusion is available today.
In addition to enabling occlusion, having a 3D understanding of the world on your device unlocks a myriad of other possibilities. Our team has been exploring some of these, playing with realistic physics, path planning, surface interaction, and more.

Physics, path planning, and surface interaction examples.

When applications of the Depth API are combined together, you can also create experiences in which objects accurately bounce and splash across surfaces and textures, as well as new interactive game mechanics that enable players to duck and hide behind real-world objects.
A demo experience we created where you have to dodge and throw food at a robot chef
A demo experience we created where you have to dodge and throw food at a robot chef.
The Depth API is not dependent on specialized cameras and sensors, and it will only get better as hardware improves. For example, the addition of depth sensors, like time-of-flight (ToF) sensors, to new devices will help create more detailed depth maps to improve existing capabilities like occlusion, and unlock new capabilities such as dynamic occlusion—the ability to occlude behind moving objects.
We’ve only begun to scratch the surface of what’s possible with the Depth API and we want to see how you will innovate with this feature. If you are interested in trying the new Depth API, please fill out our call for collaborators form.

ARCore updates to Augmented Faces and Cloud Anchors enable new shared cross-platform experiences

Posted by Christina Tong, Product Manager, Augmented Reality

Two years ago, we launched ARCore, our developer platform for building augmented reality (AR) experiences. Since then, we’ve seen developers create thousands of AR apps across Android and iOS that transform the way people play, shop, learn and create together. To enable even more shared cross-platform AR experiences, we’re announcing new updates to ARCore’s Augmented Faces and Cloud Anchors APIs.

Augmented Faces on iOS

Earlier this year, we announced our Augmented Faces API, which offers a high-quality, 468-point 3D mesh that lets users attach fun effects to their faces — all without a depth sensor on their smartphone. With the addition of iOS support rolling out today, developers can now create effects for more than a billion users. We’ve also made the creation process easier for both iOS and Android developers with a new face effects template.

Improvements to Cloud Anchors

Last year, we introduced the Cloud Anchors API, which lets developers create shared AR experiences across Android and iOS. Cloud Anchors let devices create a 3D feature map from visual data onto which anchors can be placed. The anchors are hosted in the cloud so multiple people can use them to enable shared real world experiences. Cloud Anchors power a wide variety of cross-platform apps, like Just a Line, PHAROS AR and Spacecraft AR.

In our latest ARCore update, we’ve made some improvements to the Cloud Anchors API that make hosting and resolving anchors more efficient and robust. This is due to improved anchor creation and visual processing in the cloud. Now, when creating an anchor, more angles across larger areas in the scene can be captured for a more robust 3D feature map. Once the map is created, the visual data used to create the map is deleted and only anchor IDs are shared with other devices to be resolved. Moreover, multiple anchors in the scene can now be resolved simultaneously, reducing the time needed to start a shared AR experience.

These updates to Cloud Anchors are available for developers today.

Persistent Cloud Anchors and Call for Collaborators

As we look to the future, we’re taking steps to expand the scale and timeline of shared AR experiences with persistent Cloud Anchors. We see this as enabling a “save button” for AR, so that digital information overlaid on top of the real world can be experienced at anytime.

Imagine working together on a redesign of your home throughout the year, leaving AR notes for your friends around an amusement park, or hiding AR objects at specific places around the world to be discovered by others.

Persistent Cloud Anchors are powering Mark AR, a social app being developed by Sybo and iDreamSky that lets people create, discover, and share their AR art with friends and followers in real-world locations. With persistent Cloud Anchors, users can continuously return back to their pieces as they create and collaborate over time.

Mark AR phone demonstration

Mark AR is an app that lets people create and discover AR art in real-world locations.

Reliably anchoring AR content for every use case—regardless of surface, distance, and time—pushes the limits of computation and computer vision because the real world is diverse and always changing. By enabling a “save button” for AR, we’re taking an important step toward bridging the digital and physical worlds to expand the ways AR can be useful in our day-to-day lives.

We’re currently looking for more developers to help us explore and test persistent Cloud Anchors in real world apps at scale, before making the feature broadly available. If you’re interested in early access, you can apply here.

Giving Lens New Reading Capabilities in Google Go



Around the world, millions of people are coming online for the first time, and many of them are among the 800 million adults worldwide who are unable to read or write, or those who are migrating to towns and cities where they are not able to speak the predominant language. As a smartphone camera-based tool, Google Lens has great potential for helping people who struggle with reading and other language-based challenges. Lens uses computer vision, machine learning and Google’s Knowledge Graph to let people turn the things they see in the real world into a visual search box, enabling them to identify objects like plants and animals, or to copy and paste text from the real world into their phone.

However, in order for Lens to be able to help the greatest number of people, we needed to create a special version that can work on even the most basic smartphones. So at I/O 2019, we announced a new version of Lens designed specifically for use in Google Go—our Search app for entry level devices—and we included a new set of features designed to help people who face reading and other language-based challenges. When users point their camera at text they don’t understand, Lens in Google Go can translate and read it out loud. It even highlights each word as it’s being read so users can follow along. If you want to try out these features for yourself, they are available today via Lens in Google Go. While Google Go was initially available only on Android Go devices and on the Google Play Store in select markets, recently, we made it available globally in the Google Play Store.
To make these reading features work, the Google Go version of Lens needs to be able to capture high quality images on a wide variety of devices, then identify the text, understand its structure, translate and overlay it in context, and finally, read it out loud.

Image Capture
Image capture on entry-level devices, like those that run Android Go, is tricky since it must work on a wide variety of devices, many of which are more resource constrained than flagship phones. To build a universal tool that can reliably capture high-quality images with minimal lag, we made Lens in Google Go an early adopter of a new Android support library called CameraX. Available in Jetpack—a suite of libraries, tools, and guidance for Android developers—CameraX is an abstraction layer over the Android Camera2 API that resolves device compatibility issues so developers don't have to write their own device-specific code.

Using CameraX, we implemented two capture strategies to balance capture latency against performance impact. On higher-end phones, which are powerful enough to provide a constant stream of high-resolution frames from which to select an image, we’ve made capture instantaneous. On less advanced devices, streaming these frames could cause camera lag since the CPU is less powerful, so we process the frame when the user taps capture to produce a single, on-demand high-resolution image.

Text Recognition
After Lens in Google Go captures an image, it needs to make sense of the shapes and letters that constitute the words, sentences and paragraphs. To do this, the image is scaled down and transferred to the Lens server, where the processing will be performed. Next, optical character recognition (OCR) is applied, which utilizes a region proposal network to detect character level bounding boxes that can be merged into lines for text recognition.
Merging these character boxes into words is a two-step, sequential process. The first step is to apply the Hough Transform, which assumes the text is distributed across parallel lines. The second step uses Text Flow, which instead traces text that may follow a curve by finding the shortest path through a graph of detected text boxes. This ensures that text with a variety of distributions, be they straight, curved or mixed, can be identified and processed.

Because the images captured by Lens in Google Go may include sources such as signage, handwriting or documents, a slew of additional challenges can arise. For example, the text can be obscured, scripts can be uniquely stylized, and images can be blurry. All of these issues can cause the OCR engine to misunderstand various characters within each word. To correct mistakes and improve word accuracy, Lens in Google Go uses the context of surrounding words to make corrections. It also utilizes the Knowledge Graph to provide contextual clues, such as whether a word is likely a proper noun and should not be spell-corrected.

All of these steps, from script detection and direction identification to text recognition, are performed by separable convolutional neural networks (CNNs) with an additional quantized long short-term memory (LSTM) network. And the models are trained on data from a variety of sources, ranging from ReCaptcha to scanned images from Google Books.
Left: Image with bounding box around recognized text. The raw OCR output from this image reads, “Cise is beauti640”. Right: By applying Knowledge Graph in addition to context from nearby words, Lens in Google Go recognizes the words, “life is beautiful”.
Understanding Structure
Once the individual words have been recognized, Lens must determine how to fit them together. The text that people come across in the real world is laid out in many different ways. A newspaper, for example, is laid out into columns, with headlines, article text, and advertisements. Meanwhile, a bus schedule, has one column for destinations and another with times. While understanding text structure comes very naturally to people, computers need to be taught how to comprehend it. Lens uses CNNs to detect coherent text blocks like columns, or text in a consistent style or color. And then, within each block, it uses signals like text-alignment, language, and the geometric relationship of the paragraphs to determine their final reading order.

One of the other challenges in detecting document structure is that people take pictures of text from different angles, often with a warped perspective. This means we cannot revert to off-the-shelf detectors that rely on axis aligned boxes, but must generalize our systems to be able to deal with homographic distortions.
Paragraph segmentation on the front page of a newspaper. Notice how “News Analysis”, which is embedded in the middle of a column, has been identified separately due to its distinct style features.
Translations in Context
To provide users with the most helpful information, translations must be both accurate and contextual. Lens uses Google Translate’s neural machine translation (NMT) algorithms, to translate entire sentences at a time, rather than going word-by-word, in order to preserve proper grammar and diction.

For the translation to be most useful, it needs to be placed in the context of the original text. For example, when translating instructions on an ATM, it is important to know which buttons correspond to which instructions. Part of the challenge is accounting for the fact that the translated text can be much shorter or longer than the original. For example, German sentences tend to be longer than English ones. To accomplish this seamless overlay, Lens redistributes the translation into lines of similar length, and chooses an appropriate font size to match. It also matches the color of the translation and its background with the original text through the use of a heuristic that assumes the background and the text differ in luminosity, and that the background takes up the majority of the space. This allows Lens to classify whether a pixel represents background or text, and then sample the average color from these two regions to ensure the translated text matches the original text.

Reading the Text Out Loud
The final challenge in delivering information in the most helpful way with Lens in Google Go is reading the text aloud. High-fidelity audio is generated using Google Text-to-Speech (TTS), a service that applies machine learning to disambiguate and detected entities such as dates, phone numbers and addresses, and uses that to generate realistic speech based on DeepMind’s WaveNet.

These reading features become more contextual and useful when they are paired with display. Lens utilizes timing annotations from the TTS service that mark the beginning of each word in order to highlight each word on screen as it’s being read, similar to a karaoke machine. Say for example, a user takes a picture of an ATM screen with different labels next to different buttons. This karaoke effect allows users to know which label applies to which button. It may also help users learn how to pronounce the words being translated.
Looking Ahead
Taken together, it is our hope that these features will have a positive impact on the day-to-day lives of millions of people. Moving forward, we will continue to work on further updates to these reading features to make the OCR more precise, including improvements to text structure understanding (e.g. multi-column text) and recognition of Indic scripts. As we address these text challenges, we continue to look for new ways that the combination of machine learning and the smartphone camera can help people as they go about their lives.

Source: Google AI Blog


Real-Time AR Self-Expression with Machine Learning



Augmented reality (AR) helps you do more with what you see by overlaying digital content and information on top of the physical world. For example, AR features coming to Google Maps will let you find your way with directions overlaid on top of your real world. With Playground - a creative mode in the Pixel camera -- you can use AR to see the world differently. And with the latest release of YouTube Stories and ARCore's new Augmented Faces API you can add objects like animated masks, glasses, 3D hats and more to your own selfies!

One of the key challenges in making these AR features possible is proper anchoring of the virtual content to the real world; a process that requires a unique set of perceptive technologies able to track the highly dynamic surface geometry across every smile, frown or smirk.
Our 3D mesh and some of the effects it enables
To make all this possible, we employ machine learning (ML) to infer approximate 3D surface geometry to enable visual effects, requiring only a single camera input without the need for a dedicated depth sensor. This approach provides the use of AR effects at realtime speeds, using TensorFlow Lite for mobile CPU inference or its new mobile GPU functionality where available. This technology is the same as what powers YouTube Stories' new creator effects, and is also available to the broader developer community via the latest ARCore SDK release and the ML Kit Face Contour Detection API.

An ML Pipeline for Selfie AR
Our ML pipeline consists of two real-time deep neural network models that work together: A detector that operates on the full image and computes face locations, and a generic 3D mesh model that operates on those locations and predicts the approximate surface geometry via regression. Having the face accurately cropped drastically reduces the need for common data augmentations like affine transformations consisting of rotations, translation and scale changes. Instead it allows the network to dedicate most of its capacity towards coordinate prediction accuracy, which is critical to achieve proper anchoring of the virtual content.

Once the location of interest is cropped, the mesh network is only applied to a single frame at a time, using a windowed smoothing in order to reduce noise when the face is static while avoiding lagging during significant movement.
Our 3D mesh in action
For our 3D mesh we employed transfer learning and trained a network with several objectives: the network simultaneously predicts 3D mesh coordinates on synthetic, rendered data and 2D semantic contours on annotated, real world data similar to those MLKit provides. The resulting network provided us with reasonable 3D mesh predictions not just on synthetic but also on real world data. All models are trained on data sourced from a geographically diverse dataset and subsequently tested on a balanced, diverse testset for qualitative and quantitative performance.

The 3D mesh network receives as input a cropped video frame. It doesn't rely on additional depth input, so it can also be applied to pre-recorded videos. The model outputs the positions of the 3D points, as well as the probability of a face being present and reasonably aligned in the input. A common alternative approach is to predict a 2D heatmap for each landmark, but it is not amenable to depth prediction and has high computational costs for so many points.

We further improve the accuracy and robustness of our model by iteratively bootstrapping and refining predictions. That way we can grow our dataset to increasingly challenging cases, such as grimaces, oblique angle and occlusions. Dataset augmentation techniques also expanded the available ground truth data, developing model resilience to artifacts like camera imperfections or extreme lighting conditions.
Dataset expansion and improvement pipeline
Hardware-tailored Inference
We use TensorFlow Lite for on-device neural network inference. The newly introduced GPU back-end acceleration boosts performance where available, and significantly lowers the power consumption. Furthermore, to cover a wide range of consumer hardware, we designed a variety of model architectures with different performance and efficiency characteristics. The most important differences of the lighter networks are the residual block layout and the accepted input resolution (128x128 pixels in the lightest model vs. 256x256 in the most complex). We also vary the number of layers and the subsampling rate (how fast the input resolution decreases with network depth).
Inference time per frame: CPU vs. GPU
The result of these optimizations is a substantial speedup from using lighter models, with minimal degradation in AR effect quality.
Comparison of the most complex (left) and the lightest models (right). Temporal consistency as well as lip and eye tracking is slightly degraded on light models.
The end result of these efforts empowers a user experience with convincing, realistic selfie AR effects in YouTube, ARCore, and other clients by:
  • Simulating light reflections via environmental mapping for realistic rendering of glasses
  • Natural lighting by casting virtual object shadows onto the face mesh
  • Modelling face occlusions to hide virtual object parts behind a face, e.g. virtual glasses, as shown below.
YouTube Stories includes Creator Effects like realistic virtual glasses, based on our 3D mesh
In addition, we achieve highly realistic makeup effects by:
  • Modelling Specular reflections applied on lips and
  • Face painting by using luminance-aware material 
Case study comparing real make-up against our AR make-up on 5 subjects under different lighting conditions.
We are excited to share this new technology with creators, users and developers alike, who can use this new technology immediately by downloading the latest ARCore SDK. In the future we plan to broaden this technology to more Google products.

Acknowledgements
We would like to thank Yury Kartynnik, Valentin Bazarevsky, Andrey Vakunov, Siargey Pisarchyk, Andrei Tkachenka, and Matthias Grundmann for collaboration on developing the current mesh technology; Nick Dufour, Avneesh Sud and Chris Bregler for an earlier version of the technology based on parametric models; Kanstantsin Sokal, Matsvei Zhdanovich, Gregory Karpiak, Alexander Kanaukou, Suril Shah, Buck Bourdon, Camillo Lugaresi, Siarhei Kazakou and Igor Kibalchich for building the ML pipeline to drive impressive effects; Aleksandra Volf and the annotation team for their diligence and dedication to perfection; Andrei Kulik, Juhyun Lee, Raman Sarokin, Ekaterina Ignasheva, Nikolay Chirkov, and Yury Pisarchyk for careful benchmarking and insights on mobile GPU-centric network architecture optimizations.

Source: Google AI Blog


New UI tools and a richer creative canvas come to ARCore

Posted by Evan Hardesty Parker, Software Engineer

ARCore and Sceneform give developers simple yet powerful tools for creating augmented reality (AR) experiences. In our last update (version 1.6) we focused on making virtual objects appear more realistic within a scene. In version 1.7, we're focusing on creative elements like AR selfies and animation as well as helping you improve the core user experience in your apps.

Creating AR Selfies

Example of 3D face mesh application

ARCore's new Augmented Faces API (available on the front-facing camera) offers a high quality, 468-point 3D mesh that lets users attach fun effects to their faces. From animated masks, glasses, and virtual hats to skin retouching, the mesh provides coordinates and region specific anchors that make it possible to add these delightful effects.

You can get started in Unity or Sceneform by creating an ARCore session with the "front-facing camera" and Augmented Faces "mesh" mode enabled. Note that other AR features such as plane detection aren't currently available when using the front-facing camera. AugmentedFace extends Trackable, so faces are detected and updated just like planes, Augmented Images, and other trackables.

// Create ARCore session that support Augmented Faces for use in Sceneform.
public Session createAugmentedFacesSession(Activity activity) throws UnavailableException {
// Use the front-facing (selfie) camera.
Session session = new Session(activity, EnumSet.of(Session.Feature.FRONT_CAMERA));
// Enable Augmented Faces.
Config config = session.getConfig();
config.setAugmentedFaceMode(Config.AugmentedFaceMode.MESH3D);
session.configure(config);
return session;
}

Animating characters in your Sceneform AR apps

Another way version 1.7 expands the AR creative canvas is by letting your objects dance, jump, spin and move around with support for animations in Sceneform. To start an animation, initialize a ModelAnimator (an extension of the existing Android animation support) with animation data from your ModelRenderable.

void startDancing(ModelRenderable andyRenderable) {
AnimationData data = andyRenderable.getAnimationData("andy_dancing");
animator = new ModelAnimator(data, andyRenderable);
animator.start();
}

Solving common AR UX challenges in Unity with new UI components

In ARCore version 1.7 we also focused on helping you improve your user experience with a simplified workflow. We've integrated "ARCore Elements" -- a set of common AR UI components that have been validated with user testing -- into the ARCore SDK for Unity. You can use ARCore Elements to insert AR interactive patterns in your apps without having to reinvent the wheel. ARCore Elements also makes it easier to follow Google's recommended AR UX guidelines.

ARCore Elements includes two AR UI components that are especially useful:

  • Plane Finding - streamlining the key steps involved in detecting a surface
  • Object Manipulation - using intuitive gestures to rotate, elevate, move, and resize virtual objects

We plan to add more to ARCore Elements over time. You can download the ARCore Elements app available in the Google Play Store to learn more.

Improving the User Experience with Shared Camera Access

ARCore version 1.7 also includes UX enhancements for the smartphone camera -- specifically, the experience of switching in and out of AR mode. Shared Camera access in the ARCore SDK for Java lets users pause an AR experience, access the camera, and jump back in. This can be particularly helpful if users want to take a picture of the action in your app.

More details are available in the Shared Camera developer documentation and Java sample.

Learn more and get started

For AR experiences to capture users' imaginations they need to be both immersive and easily accessible. With tools for adding AR selfies, animation, and UI enhancements, ARCore version 1.7 can help with both these objectives.

You can learn more about these new updates on our ARCore developer website.

Using Global Localization to Improve Navigation



One of the consistent challenges when navigating with Google Maps is figuring out the right direction to go: sure, the app tells you to go north - but many times you're left wondering, "Where exactly am I, and which way is north?" Over the years, we've attempted to improve the accuracy of the blue dot with tools like GPS and compass, but found that both have physical limitations that make solving this challenge difficult, especially in urban environments.

We're experimenting with a way to solve this problem using a technique we call global localization, which combines Visual Positioning Service (VPS), Street View, and machine learning to more accurately identify position and orientation. Using the smartphone camera as a sensor, this technology enables a more powerful and intuitive way to help people quickly determine which way to go.
Due to limitations with accuracy and orientation, guidance via GPS alone is limited in urban environments. Using VPS, Street View and machine learning, Global Localization can provide better context on where you are relative to where you're going.
In this post, we'll discuss some of the limitations of navigation in urban environments and how global localization can help overcome them.

Where GPS Falls Short
The process of identifying the position and orientation of a device relative to some reference point is referred to as localization. Various techniques approach localization in different ways. GPS relies on measuring the delay of radio signals from multiple dedicated satellites to determine a precise location. However, in dense urban environments like New York or San Francisco, it can be incredibly hard to pinpoint a geographic location due to low visibility to the sky and signals reflecting off of buildings. This can result in highly inaccurate placements on the map, meaning that your location could appear on the wrong side of the street, or even a few blocks away.
GPS signals bouncing off facades in an urban environment.
GPS has another technical shortcoming: it can only determine the location of the device, not the orientation. Sometimes, sensors in your mobile device can remedy the situation by measuring the magnetic and gravity field of the earth and the relative motion of the device in order to give rough estimates of your orientation. But these sensors are easily skewed by magnetic objects such as cars, pipes, buildings, and even electrical wires inside the phone, resulting in errors that can be inaccurate by up to 180 degrees.

A New Approach to Localization
To improve the precision position and orientation of the blue dot on the map, a new complementary technology is necessary. When walking down the street, you orient yourself by comparing what you see with what you expect to see. Global localization uses a combination of techniques that enable the camera on your mobile device to orient itself much as you would.

VPS determines the location of a device based on imagery rather than GPS signals. VPS first creates a map by taking a series of images which have a known location and analyzing them for key visual features, such as the outline of buildings or bridges, to create a large scale and fast searchable index of those visual features. To localize the device, VPS compares the features in imagery from the phone to those in the VPS index. However, the accuracy of localization through VPS is greatly affected by the quality of the both the imagery and the location associated with it. And that poses another question—where does one find an extensive source of high-quality global imagery?

Enter Street View
Over 10 years ago we launched Street View in Google Maps in order to help people explore the world more deeply. In that time, Street View has continued to expand its coverage of the world, empowering people to not only preview their route, but also step inside famous landmarks and museums, no matter where they are. To deliver global localization with VPS, we connected it with Street View data, making use of information gathered and tested from over 93 countries across the globe. This rich dataset provides trillions of strong reference points to apply triangulation, helping more accurately determine the position of a device and guide people towards their destination.
Features matched from multiple images.
Although this approach works well in theory, making it work well in practice is a challenge. The problem is that the imagery from the phone at the time of localization may differ from what the scene looked like when the Street View imagery was collected, perhaps months earlier. For example, trees have lots of rich detail, but change as the seasons change and even as the wind blows. To get a good match, we need to filter out temporary parts of the scene and focus on permanent structure that doesn't change over time. That's why a core ingredient in this new approach is applying machine learning to automatically decide which features to pay attention to, prioritizing features that are likely to be permanent parts of the scene and ignoring things like trees, dynamic light movement, and construction that are likely transient. This is just one of the many ways in which we use machine learning to improve accuracy.

Combining Global Localization with Augmented Reality
Global localization is an additional option that users can enable when they most need accuracy. And, this increased precision has enabled the possibility of a number of new experiences. One of the newest features we're testing is the ability to use ARCore, Google's platform for building augmented reality experiences, to overlay directions right on top of Google Maps when someone is in walking navigation mode. With this feature, a quick glance at your phone shows you exactly which direction you need to go.
Although early results are promising, there's significant work to be done. One outstanding challenge is making this technology work everywhere, in all types of conditions—think late at night, in a snowstorm, or in torrential downpour. To make sure we're building something that's truly useful, we're starting to test this feature with select Local Guides, a small group of Google Maps enthusiasts around the world who we know will offer us the feedback about how this approach can be most helpful.

Like other AI-driven camera experiences such as Google Lens (which uses the camera to let you search what you see), we believe the ability to overlay directions over the real world environment offers an exciting and useful way to use the technology that already exists in your pocket. We look forward to continuing to develop this technology, and the potential for smartphone cameras to add new types of valuable experiences.

Source: Google AI Blog