Tag Archives: machine learning

Introducing Coral: Our platform for development with local AI

Posted by Billy Rutledge (Director) and Vikram Tank (Product Mgr), Coral Team

AI can be beneficial for everyone, especially when we all explore, learn, and build together. To that end, Google's been developing tools like TensorFlow and AutoML to ensure that everyone has access to build with AI. Today, we're expanding the ways that people can build out their ideas and products by introducing Coral into public beta.

Coral is a platform for building intelligent devices with local AI.

Coral offers a complete local AI toolkit that makes it easy to grow your ideas from prototype to production. It includes hardware components, software tools, and content that help you create, train and run neural networks (NNs) locally, on your device. Because we focus on accelerating NN's locally, our products offer speedy neural network performance and increased privacy — all in power-efficient packages. To help you bring your ideas to market, Coral components are designed for fast prototyping and easy scaling to production lines.

Our first hardware components feature the new Edge TPU, a small ASIC designed by Google that provides high-performance ML inferencing for low-power devices. For example, it can execute state-of-the-art mobile vision models such as MobileNet V2 at 100+ fps, in a power efficient manner.

Coral Camera Module, Dev Board and USB Accelerator

For new product development, the Coral Dev Board is a fully integrated system designed as a system on module (SoM) attached to a carrier board. The SoM brings the powerful NXP iMX8M SoC together with our Edge TPU coprocessor (as well as Wi-Fi, Bluetooth, RAM, and eMMC memory). To make prototyping computer vision applications easier, we also offer a Camera that connects to the Dev Board over a MIPI interface.

To add the Edge TPU to an existing design, the Coral USB Accelerator allows for easy integration into any Linux system (including Raspberry Pi boards) over USB 2.0 and 3.0. PCIe versions are coming soon, and will snap into M.2 or mini-PCIe expansion slots.

When you're ready to scale to production we offer the SOM from the Dev Board and PCIe versions of the Accelerator for volume purchase. To further support your integrations, we'll be releasing the baseboard schematics for those who want to build custom carrier boards.

Our software tools are based around TensorFlow and TensorFlow Lite. TF Lite models must be quantized and then compiled with our toolchain to run directly on the Edge TPU. To help get you started, we're sharing over a dozen pre-trained, pre-compiled models that work with Coral boards out of the box, as well as software tools to let you re-train them.

For those building connected devices with Coral, our products can be used with Google Cloud IoT. Google Cloud IoT combines cloud services with an on-device software stack to allow for managed edge computing with machine learning capabilities.

Coral products are available today, along with product documentation, datasheets and sample code at g.co/coral. We hope you try our products during this public beta, and look forward to sharing more with you at our official launch.

Source: Google Developers Blog

Long-Range Robotic Navigation via Automated Reinforcement Learning

Aleksandra Faust, Senior Research Scientist and Anthony Francis, Senior Software Engineer, Robotics at Google

In the United States alone, there are 3 million people with a mobility impairment that prevents them from ever leaving their homes. Service robots that can autonomously navigate long distances can improve the independence of people with limited mobility, for example, by bringing them groceries, medicine, and packages. Research has demonstrated that deep reinforcement learning (RL) is good at mapping raw sensory input to actions, e.g. learning to grasp objects and for robot locomotion, but RL agents usually lack the understanding of large physical spaces needed to safely navigate long distances without human help and to easily adapt to new spaces.

In three recent papers, “Learning Navigation Behaviors End-to-End with AutoRL,” “PRM-RL: Long-Range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning”, and “Long-Range Indoor Navigation with PRM-RL”, we investigate easy-to-adapt robotic autonomy by combining deep RL with long-range planning. We train local planner agents to perform basic navigation behaviors, traversing short distances safely without collisions with moving obstacles. The local planners take noisy sensor observations, such as a 1D lidar that provides distances to obstacles, and output linear and angular velocities for robot control. We train the local planner in simulation with AutoRL, a method that automates the search for RL reward and neural network architecture. Despite their limited range of 10 - 15 meters, the local planners transfer well to both real robots and to new, previously unseen environments. This enables us to use them as building blocks for navigation in large spaces. We then build a roadmap, a graph where nodes are locations and edges connect the nodes only if local planners, which mimic real robots well with their noisy sensors and control, can traverse between them reliably.

Automating Reinforcement Learning (AutoRL)
In our first paper, we train the local planners in small, static environments. However, training with standard deep RL algorithms, such as Deep Deterministic Policy Gradient (DDPG), poses several challenges. For example, the true objective of the local planners is to reach the goal, which represents a sparse reward. In practice, this requires researchers to spend significant time iterating and hand-tuning the rewards. Researchers must also make decisions about the neural network architecture, without clear accepted best practices. And finally, algorithms like DDPG are unstable learners and often exhibit catastrophic forgetfulness.

To overcome those challenges, we automate the deep Reinforcement Learning (RL) training. AutoRL is an evolutionary automation layer around deep RL that searches for a reward and neural network architecture using large-scale hyperparameter optimization. It works in two phases, reward search and neural network architecture search. During the reward search, AutoRL trains a population of DDPG agents concurrently over several generations, each with a slightly different reward function optimizing for the local planner’s true objective: reaching the destination. At the end of the reward search phase, we select the reward that leads the agents to its destination most often. In the neural network architecture search phase, we repeat the process, this time using the selected reward and tuning the network layers, optimizing for the cumulative reward.

Automating reinforcement learning with reward and neural network architecture search.

However, this iterative process means AutoRL is not sample efficient. Training one agent takes 5 million samples; AutoRL training over 10 generations of 100 agents requires 5 billion samples - equivalent to 32 years of training! The benefit is that after AutoRL the manual training process is automated, and DDPG does not experience catastrophic forgetfulness. Most importantly, the resulting policies are higher quality — AutoRL policies are robust to sensor, actuator and localization noise, and generalize well to new environments. Our best policy is 26% more successful than other navigation methods across our test environments.

AutoRL (red) success over short distances (up to 10 meters) in several unseen buildings. Compared to hand-tuned DDPG (dark-red), artificial potential fields (light blue), dynamic window approach (blue), and behavior cloning (green).

AutoRL local planner policy transfer to robots in real, unstructured environments

While these policies only perform local navigation, they are robust to moving obstacles and transfer well to real robots, even in unstructured environments. Though they were trained in simulation with only static obstacles, they can also handle moving objects effectively. The next step is to combine the AutoRL policies with sampling-based planning to extend their reach and enable long-range navigation.

Achieving Long Range Navigation with PRM-RL
Sampling-based planners tackle long-range navigation by approximating robot motions. For example, probabilistic roadmaps (PRMs) sample robot poses and connect them with feasible transitions, creating roadmaps that capture valid movements of a robot across large spaces. In our second paper, which won Best Paper in Service Robotics at ICRA 2018, we combine PRMs with hand-tuned RL-based local planners (without AutoRL) to train robots once locally and then adapt them to different environments.

First, for each robot we train a local planner policy in a generic simulated training environment. Next, we build a PRM with respect to that policy, called a PRM-RL, over a floor plan for the deployment environment. The same floor plan can be used for any robot we wish to deploy in the building in a one time per robot+environment setup.

To build a PRM-RL we connect sampled nodes only if the RL-based local planner, which represents robot noise well, can reliably and consistently navigate between them. This is done via Monte Carlo simulation. The resulting roadmap is tuned to both the abilities and geometry of the particular robot. Roadmaps for robots with the same geometry but different sensors and actuators will have different connectivity. Since the agent can navigate around corners, nodes without clear line of sight can be included. Whereas nodes near walls and obstacles are less likely to be connected into the roadmap because of sensor noise. At execution time, the RL agent navigates from roadmap waypoint to waypoint.

Roadmap being built with 3 Monte Carlo simulations per randomly selected node pair.

The largest map was 288 meters by 163 meters and contains almost 700,000 edges, collected over 4 days using 300 workers in a cluster requiring 1.1 billion collision checks.

The third paper makes several improvements over the original PRM-RL. First, we replace the hand-tuned DDPG with AutoRL-trained local planners, which results in improved long-range navigation. Second, it adds Simultaneous Localization and Mapping (SLAM) maps, which robots use at execution time, as a source for building the roadmaps. Because SLAM maps are noisy, this change closes the “sim2real gap”, a phonomena in robotics where simulation-trained agents significantly underperform when transferred to real-robots. Our simulated success rates are the same as in on-robot experiments. Last, we added distributed roadmap building, resulting in very large scale roadmaps containing up to 700,000 nodes.

We evaluated the method using our AutoRL agent, building roadmaps using the floor maps of offices up to 200x larger than the training environments, accepting edges with at least 90% success over 20 trials. We compared PRM-RL to a variety of different methods over distances up to 100m, well beyond the local planner range. PRM-RL had 2 to 3 times the rate of success over baseline because the nodes were connected appropriately for the robot’s capabilities.

Navigation over 100 meters success rates in several buildings. First paper -AutoRL local planner only (blue); original PRMs (red); path-guided artificial potential fields (yellow); second paper (green); third paper - PRMs with AutoRL (orange).

We tested PRM-RL on multiple real robots and real building sites. One set of tests are shown below; the robot is very robust except near cluttered areas and off the edge of the SLAM map.

On-robot experiments

Conclusion
Autonomous robot navigation can significantly improve independence of people with limited mobility. We can achieve this by development of easy-to-adapt robotic autonomy, including methods that can be deployed in new environments using information that it is already available. This is done by automating the learning of basic, short-range navigation behaviors with AutoRL and using these learned policies in conjunction with SLAM maps to build roadmaps. These roadmaps consist of nodes connected by edges that robots can traverse consistently. The result is a policy that once trained can be used across different environments and can produce a roadmap custom-tailored to the particular robot.

Acknowledgements
The research was done by, in alphabetical order, Hao-Tien Lewis Chiang, James Davidson, Aleksandra Faust, Marek Fiser, Anthony Francis, Jasmine Hsu, J. Chase Kew, Tsang-Wei Edward Lee, Ken Oslund, Oscar Ramirez from Robotics at Google and Lydia Tapia from University of New Mexico. We thank Alexander Toshev, Brian Ichter, Chris Harris, and Vincent Vanhoucke for helpful discussions.

Source: Google AI Blog

Long-Range Robotic Navigation via Automated Reinforcement Learning

Automating reinforcement learning with reward and neural network architecture search.

AutoRL local planner policy transfer to robots in real, unstructured environments

Roadmap being built with 3 Monte Carlo simulations per randomly selected node pair.

The largest map was 288 meters by 163 meters and contains almost 700,000 edges, collected over 4 days using 300 workers in a cluster requiring 1.1 billion collision checks.

We tested PRM-RL on multiple real robots and real building sites. One set of tests are shown below; the robot is very robust except near cluttered areas and off the edge of the SLAM map.

On-robot experiments

Source: Google AI Blog

Dopamine 2.0: providing more flexibility in reinforcement learning research

Reinforcement learning (RL) has become one of the most popular fields of machine learning, and has seen a number of great advances over the last few years. As a result, there is a growing need from both researchers and educators to have access to a clear and reliable framework for RL research and education.

Last August, we announced Dopamine, our framework for flexible reinforcement learning. For the initial version we decided to focus on a specific type of RL research: value-based agents evaluated on the Atari 2600 framework supported by the Arcade Learning Environment. We were thrilled to see how well it was received by the community, including a live coding session, its inclusion in a recently-announced benchmark for RL, considered as the top “Cool new open source project of 2018” by the Octoverse, and over 7K GitHub stars on our repository.

One of the most common requests we have received is support for more environments. This confirms what we have seen internally, where simpler environments, such as those supported by OpenAI’s Gym, are incredibly useful when testing out new algorithms. We are happy to announce Dopamine 2.0, which includes support for discrete-domain gym environments (e.g. discrete states and actions). The core of the framework remains unchanged, we have simply generalized the interface with the environment. For backwards compatibility, users will still be able to download version 1.0.

We include default configurations for two classic control environments: CartPole and Acrobot; on these environments one can train a Dopamine agent in minutes. When compared with the training time for a standard Atari 2600 game (around 5 days on a standard GPU), these environments allow researchers to iterate much faster on research ideas before testing them out on larger Atari games. We also include a Colaboratory that illustrates how to train an agent on Cartpole and Acrobot. Finally, our GymPreprocessing class serves as an example for how to use Dopamine with other custom environments.

We are excited by the new opportunities enabled by Dopamine 2.0, and look forward to seeing what the research community creates with it!

By Pablo Samuel Castro and Marc G. Bellemare, Dopamine Team

Source: Google Open Source Blog

Expanding the Application of Deep Learning to Electronic Health Records

Posted by Alvin Rajkomar, MD and Eyal Oren, PhD, Google AI, Healthcare

In 2018 we published a paper that showed how machine learning, when applied to medical records, can predict what might happen to patients who are hospitalized: for example, how long they would need to be in the hospital and, if discharged, how likely they would be to come back unexpectedly. Predictive models of various kinds have already been deployed in hospital settings by others, and our work aims to further improve potential clinical benefit by using new models that can make predictions faster, more accurate, and more adaptable for a broader range of clinical contexts.

Any endeavor to demonstrate the promise of machine learning requires intense collaboration between engineers, doctors, and medical researchers to make sure the work benefits patients, physicians, and health systems, and that it is equitable. Google is already fortunate to partner with some of the best academic medical centers in the world and we are now expanding this work to include Intermountain Healthcare, based in Utah.

The initial collaboration will focus on understanding how Google might adapt machine learning predictions to the various Intermountain care settings, from primary care clinics to the TeleHealth critical care unit, which remotely monitors critically ill patients in surrounding hospitals. We see potential in exploring how scalable computing platforms that include predictions might assist clinical teams in providing the best possible care.

As with our previous research, we will begin with jointly testing the performance of machine learning models on historical records, following strict policies to ensure that all data privacy and security measures are followed.

We are excited to explore how scalable computing platforms that include predictions might assist clinical teams in providing the best possible care in these settings. We additionally hope to further validate that our approach to predictions can work across health systems and improve care for patients.

Source: Google AI Blog

Soft Actor-Critic: Deep Reinforcement Learning for Robotics

Posted by Tuomas Haarnoja, Student Researcher and Sergey Levine, Faculty Advisor, Robotics at Google

Deep reinforcement learning (RL) provides the promise of fully automated learning of robotic behaviors directly from experience and interaction in the real world, due to its ability to process complex sensory input using general-purpose neural network representations. However, many existing RL algorithms require days or weeks (or more) worth of real-world data in order to converge to the desired behavior. Furthermore, such systems can be tough to deploy on complex robotic systems (such as legged robots) which can easily get damaged during the exploration phase, hyperparameter settings can be challenging to tune, and various safety considerations can introduce further limitations.

In collaboration with UC Berkeley, we recently released Soft Actor-Critic (SAC), a stable and efficient deep RL algorithm suitable for real-world robotic skill learning that is well-aligned with the requirements of robotic experimentation. Importantly, SAC is efficient enough to solve real-world robot tasks in only a handful of hours, and works on a variety of environments with a single set of hyperparameters. Below, we discuss some of the research behind SAC, and also describe some of our recent experiments.

Requirements for Real-World Robotic Learning
Real-world robotic experimentation brings significant challenges, such as constant interruptions in the data stream due to hardware failures and manual resets, and smooth exploration to avoid mechanical wear and tear on the robot, which set additional restrictions to both the algorithm and its implementation, including (but not limited to):

Good sample efficiency to lower the learning time
Minimal number of hyperparameters that require tuning
Reusing already collected data on different scenarios (known as off-policy learning)
Ensuring that learning and exploration does not damage the hardware

Soft Actor-Critic
Soft actor-critic is based on maximum entropy reinforcement learning, a framework that aims to both maximize the expected reward (which is the standard RL objective) and to maximize the policy's entropy. Policies with higher entropy are more random, which intuitively means that maximum entropy reinforcement learning prefers the most random policy that still achieves a high reward.

Why might this be desirable for robotic learning? The most obvious reason is that policies optimized for maximum entropy will be more robust: if the policy can tolerate highly random behavior during training, it is more likely to respond successfully to unexpected perturbations at test time. However, a more subtle reason is that training for maximum entropy can improve both the algorithm's robustness to hyperparameters and its sample efficiency (to learn more, see this BAIR blog post, and this tutorial).

Soft actor-critic maximizes the entropy augmented reward by learning a stochastic policy that maps states to actions and a Q-function that estimates the objective value of the current policy, optimizing them using approximate dynamic programming. In doing so, SAC views the objective as a grounded way to derive better reinforcement learning algorithms that perform consistently and are sample efficient enough to be applicable to real-world robotic applications. For technical details please see our technical report.

Performance of SAC
We evaluated SAC with two tasks: 1) quadrupedal walking with the Minitaur robot from Ghost Robotics, and 2) rotating a valve with a three finger Dynamixel Claw. Learning to walk presents a substantial challenge, as the robot is underactuated, and must therefore delicately balance contact forces on the legs to make forward progress. An untrained policy can lose balance and fall, and too many falls will eventually damage the robot, making sample-efficient learning essential.

Although we trained our policy only on flat terrain, we subsequently tested it on varied terrains and obstacles. In principle, policies learned with soft actor-critic should be robust to test-time perturbations, because they are trained to maximize entropy (i.e., inject maximal noise) at training-time. Indeed, we observe that the policies learned with our method are robust to these perturbations without any additional learning.

Illustration of learned walking, using SAC implemented on the Minitaur robot. A full video of the learning process can be found at our project website.

The manipulation task requires the hand to rotate a valve-like object so that the colored peg faces to the right, as shown below. This task is exceptionally challenging due to both the perception challenges and the need to control a hand with 9 degrees of freedom. In order to perceive the valve, the robot must use raw RGB images shown in the inset at the bottom right. The initial position of the valve is reset uniformly at random for each episode, forcing the policy to learn to use the raw RGB images to perceive the current valve orientation.

Soft actor-critic solves both of these tasks quickly: the Minitaur locomotion takes 2 hours, and the valve-turning task from image observations takes 20 hours. We also learned a policy for the valve-turning task without images by providing the actual valve position as an observation to the policy. Soft actor-critic can learn this easier version of the valve task in 3 hours. For comparison, prior work has used natural policy gradients to learn the same task without images in 7.4 hours.

Conclusion
Our work demonstrates that deep reinforcement learning based on maximum entropy framework can be applied to learn robot skills in challenging real-world settings. Since the policies are learned directly in the real world, they exhibit robustness to variations in the environment, which can be difficult to obtain otherwise. We also showed that we can learn directly from high-dimensional image observations, which represents a significant challenge in classical robotics. We hope that the release of SAC helps other research teams in their effort to adopt deep RL for more complex real-world tasks in the future.

For more technical details, please visit the BAIR blog post, or read an early preprint of the locomotion experiment and a more complete description of the algorithm. You can find the implementation on GitHub.

Acknowledgements
This research was done in collaboration between Google and UC Berkeley. We would like to thank all the people who were involved, including Sehoon Ha, Kristian Hartikainen, Jie Tan, George Tucker, Vincent Vanhoucke and Aurick Zhou.

Source: Google AI Blog

Exploring Quantum Neural Networks

Posted by Jarrod McClean, Senior Research Scientist and Hartmut Neven, Director of Engineering, Google AI Quantum Team

Since its inception, the Google AI Quantum team has pushed to understand the role of quantum computing in machine learning. The existence of algorithms with provable advantages for global optimization suggest that quantum computers may be useful for training existing models within machine learning more quickly, and we are building experimental quantum computers to investigate how intricate quantum systems can carry out these computations. While this may prove invaluable, it does not yet touch on the tantalizing idea that quantum computers might be able to provide a way to learn more about complex patterns in physical systems that conventional computers cannot in any reasonable amount of time.

Today we talk about two recent papers from the Google AI Quantum team that make progress towards understanding the power of quantum computers for learning tasks. The first constructs a quantum model of neural networks to investigate how a popular classification task might be carried out on quantum processors. In the second paper, we show how peculiar features of quantum geometry change the strategies for training these networks in comparison to their classical counterparts, and offer guidance towards more robust training of these networks.

In “Classification with Quantum Neural Networks on Near Term Processors”, we construct a model of quantum neural networks (QNNs) that is specifically designed to work on quantum processors that are expected to be available in the near term. While the current work is primarily theoretical, their structure facilitates implementation and testing on quantum computers in the immediate future. These QNNs can be adapted through supervised learning of labeled data, and we show that it is possible to train a QNN to classify images in the famous MNIST dataset. Follow up work in this area with larger quantum devices may pit the ability of quantum networks to learn patterns against popular classical networks.

Quantum Neural Network for classification. Here we depict a sample quantum neural network, where in contrast to hidden layers in classical deep neural networks, the boxes represent entangling actions, or “quantum gates”, on qubits. In a superconducting qubit setup this could be enacted through a microwave control pulse corresponding to each box.

In “Barren Plateaus in Quantum Neural Network Training Landscapes”, we focus on the training of quantum neural networks, and probe questions related to a key difficulty in classical neural networks, which is the problem of vanishing or exploding gradients. In conventional neural networks, a good unbiased initial guess for the neuron weights often involves randomization, although there can be some difficulties as well. Our paper shows that peculiar features of quantum geometry unequivocally prevent this from being a good strategy in the quantum case, instead taking you to barren plateaus. The implications of this work may guide future strategies for initializing and training quantum neural networks.

QNN vanishing gradient: concentration of measure in high dimensional spaces. In very high dimensional spaces, such as those explored by quantum computers, the vast majority of states counterintuitively sit near the equator of the hypersphere (left). This means that any smooth function on this space will tend to take a value very close to its mean with overwhelming probability when selected at random (right).

This research sets the stage for improvements in both the construction and training of quantum neural networks. In particular, experimental realizations of quantum neural networks using hardware at Google will enable rapid exploration of quantum neural networks in the near term. We hope that the insights from the geometry of these states will lead to new algorithms to train these networks that will be essential to unlocking their full potential.

Source: Google AI Blog

Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

Posted by Eric Jang, Software Engineer, Robotics at Google and Coline Devin, Berkeley PhD Student and former Research Intern

From a remarkably young age, people are capable of recognizing their favorite objects and picking them up, despite never being explicitly taught how to do so. According to cognitive developmental research, the ability to interact with objects in the world plays a crucial role in the emergence of object perception and manipulation capabilities, such as targeted grasping. By interacting with the world around them, people are able to learn with self-supervision: we know what actions we took, and we learn from the outcome. In robotics, this type of self-supervised learning is actively researched because it enables robotic systems to learn without the need for large amounts of training data or manual supervision.

Inspired by the concept of object permanence, we propose Grasp2Vec, a simple yet highly effective algorithm for acquiring object representations. Grasp2Vec is based on the intuition that an attempt to pick up anything provides several pieces of information — if a robot grasps an object and holds it up, the object had to be in the scene before the grasp. Furthermore, the robot knows that the object it grasped is currently in its gripper, and therefore has been removed from the scene. By using this form of self supervision, the robot can learn to recognize the object by the visual change in the scene after the grasp.

Building on our prior collaboration with X Robotics, where a series of robots learn in parallel to grasp household objects using only monocular camera inputs, we use a robotic arm to grasp objects “unintentionally”, and that experience enables the learning of a rich representation of objects. These representations can then be used to acquire “intentional grasping” capabilities, where the robot arm can then pick up user-commanded objects.

Constructing a Perceptual Reward Function
In the framework of reinforcement learning (RL), task success is measured via a “reward function”. By maximizing that reward, robots can teach themselves diverse grasping skills from scratch. Engineering a reward function is easy when success can be measured by simple sensor measurements. A simple example of this is a button that supplies rewards directly to a robot when it is pushed.

However, engineering a reward function is much more difficult when our success criteria depends on perceptual understanding of the task at hand. Consider the task of instance grasping, where a robot is presented a picture of a desired object being held in the gripper. After the robot attempts to grasp that object, it inspects the contents of the gripper. The reward function for this task comes down to answering the question of object recognition: Do these objects match?

On the left, the gripper is holding the brush and there are some objects (yellow cup, blue plastic block) in the background. On the right, the gripper is holding the yellow cup and the brush is in the background. If the left image was the desired outcome, a good reward function should “understand” that the two images above correspond to different objects.

In order to solve this recognition problem, we need a perception system that extracts meaningful object concepts from unstructured image data (without any human annotations), learning the visual perception of objects in an unsupervised fashion. At their core, unsupervised learning algorithms work because they make structural assumptions about data. It is common to assume that images can be compressed into a low-dimensional space, and that frames in a video can be predicted from previous frames. However, without further assumptions on the content of the data, these are usually insufficient for learning disentangled object representations.

What if we used a robot to physically disentangle objects from each other during data collection? The field of robotics presents an exciting opportunity for representation learning because robots can manipulate objects, thus providing the factors of variation needed in data. Our method relies on the insight that grasping an object removes it from the scene. This yields 1) an image of the scene before grasping, 2) an image of the scene after grasping and 3) an isolated view of the grasped object itself.

Left: Objects before the grasp. Center: Objects after the grasp. Right: The Grasped object.

If we then consider an embedding function that extracts “the set of objects” from images, it should preserve the following subtractive relation:

objects_before_grasp - objects_after_grasp = grasped_object

We implement this equality relation using a fully convolutional architecture and a simple metric learning algorithm. At training time, the architecture shown below embeds the pre-grasp images and post-grasp images into a dense spatial feature map. The maps are mean-pooled into vectors and the difference between the “before grasp” and “after grasp” vectors represents a set of objects. This vector and the corresponding vector representation of the grasped object are pushed to equivalence via the N-Pairs objective.

Add caption

Once trained, two useful properties emerge naturally from our model.

1. Object Similarity
The first property is that a cosine distance between vector embeddings allows us to compare objects and determine whether they are identical. This can be used to implement reward functions for reinforcement learning, and allow robots to learn instance grasping without human-provided labels.

2. Localizing Target Objects
The second property is that we can combine scene spatial maps and object embeddings to localize a “query object” in image space. By taking the element-wise product of spatial feature maps and the vector corresponding to the query object, we can find all the pixels in the spatial map that “match” the query object.

Using Grasp2Vec embeddings to localize objects in a scene. The image on the top left shows the objects in the bin. On the bottom left is the query object we wish to grasp. By taking the dot product of the query object vector with the spatial features of the scene image, we get a per-pixel “activation map” (top right image) of how similar that region of the image is to the query. This response map can be used to approach the object for grasping.

Our method also works when there are multiple objects that match the query object, or even if the query consists of multiple objects (the average of two vectors). For example, here is a scenario where it detects multiple orange blocks in a scene.

The resulting “heatmap” can be used to plan the robot approach to the target object(s). We combine Grasp2Vec’s localization and instance recognition capabilities with our “grasp anything” policies to obtain a success rate of 80% on objects seen during data collection and 59% on novel objects the robot hasn’t encountered before.

Conclusion
In our paper, we show how robotic grasping skills can generate the data used for learning object-centric representations. We then can use representation learning to “bootstrap” more complex skills like instance grasping, all while retaining the self-supervised learning properties of our autonomous grasping system.

Besides our own work, a number of recent papers have also studied how self-supervised interaction can be used to acquire representations, by grasping, pushing, and otherwise manipulating objects in the environment. Going forward, we are excited not only for what machine learning can bring to robotics by way of better perception and control, but also what robotics can bring to machine learning in new paradigms of self-supervision.

Acknowledgements
This research was conducted by Eric Jang, Coline Devin, Vincent Vanhoucke, and Sergey Levine. We’d like to thank Adrian Li, Alex Irpan, Anthony Brohan, Chelsea Finn, Christian Howard, Corey Lynch, Dmitry Kalashnikov, Ian Wilkes, Ivonne Fajardo, Julian Ibarz, Ming Zhao, Peter Pastor, Pierre Sermanet, Stephen James, Tsung-Yi Lin, Yunfei Bai, and many others at Google, X, and the broader robotics community who contributed to improving this work.

Source: Google AI Blog

Google at NeurIPS 2018

Posted by Slav Petrov, Principal Scientist, Google

This week, Montréal hosts the 32^nd annual Conference on Neural Information Processing Systems (NeurIPS 2018), the biggest machine learning conference of the year. The conference includes invited talks, demonstrations and presentations of some of the latest in machine learning research. Google will have a strong presence at NeurIPS 2018, with more than 400 Googlers attending in order to contribute to, and learn from, the broader academic research community via talks, posters, workshops, competitions and tutorials. We will be presenting work that pushes the boundaries of what is possible in language understanding, translation, speech recognition and visual & audio perception, with Googlers co-authoring nearly 100 accepted papers (see below).

At the forefront of machine learning, Google is actively exploring virtually all aspects of the field spanning both theory and applications. This research is often inspired by real product needs but increasingly more often driven by scientific curiosity. Given the range of research projects that we pursue, we have found it useful to define a new framework which helps crystalize the goals of projects and allows us to measure progress and success in appropriate ways. Our contributions to NeurIPS and to the broader research community in general are integral to our research mission.

If you are attending NeurIPS 2018, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving the world's most challenging research problems, and to see demonstrations of some of the exciting research we pursue. You can also learn more about our work being presented in the list below (Googlers highlighted in blue).

Google is a Platinum Sponsor of NeurIPS 2018.

NeurIPS Foundation Board
Corinna Cortes, John C. Platt, Fernando Pereira

NeurIPS Organizing Committee
General Chair: Samy Bengio
Program Co-Chair: Hugo Larochelle
Party Chair: Douglas Eck
Diversity and Inclusion Co-Chair: Katherine A. Heller

NeurIPS Program Committee
Senior Area Chairs include:Angela Yu, Claudio Gentile, Cordelia Schmid, Corinna Cortes, Csaba Szepesvari, Dale Schuurmans, Elad Hazan, Mehryar Mohri, Raia Hadsell, Satyen Kale, Yishay Mansour, Afshin Rostamizadeh, Alex Kulesza

Area Chairs include: Amin Karbasi, Amir Globerson, Amit Daniely, Andras Gyorgy, Andriy Mnih, Been Kim, Branislav Kveton, Ce Liu, D Sculley, Danilo Rezende, Danny Tarlow, David Balduzzi, Denny Zhou, Dilan Gorur, Dumitru Erhan, George Dahl, Graham Taylor, Ian Goodfellow, Jasper Snoek, Jean-Philippe Vert, Jia Deng, Jon Shlens, Karen Simonyan, Kevin Swersky, Kun Zhang, Lihong Li, Marc G. Bellemare, Marco Cuturi, Maya Gupta, Michael Bowling, Michalis Titsias, Mohammad Norouzi, Mouhamadou Moustapha Cisse, Nicolas Le Roux, Remi Munos, Sanjiv Kumar, Sanmi Koyejo, Sergey Levine, Silvia Chiappa, Slav Petrov, Surya Ganguli, Timnit Gebru, Timothy Lillicrap, Viren Jain, Vitaly Feldman, Vitaly Kuznetsov

Workshops Program Committee includes: Mehryar Mohri, Sergey Levine

Accepted Papers
3D-Aware Scene Manipulation via Inverse Graphics
Shunyu Yao, Tzu Ming Harry Hsu, Jun-Yan Zhu, Jiajun Wu, Antonio Torralba, William T. Freeman, Joshua B. Tenenbaum

A Retrieve-and-Edit Framework for Predicting Structured Outputs
Tatsunori Hashimoto, Kelvin Guu, Yonatan Oren, Percy Liang

Adversarial Attacks on Stochastic Bandits
Kwang-Sung Jun, Lihong Li, Yuzhe Ma, Xiaojin Zhu

Adversarial Examples that Fool both Computer Vision and Time-Limited Humans
Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian Goodfellow, Jascha Sohl-Dickstein

Adversarially Robust Generalization Requires More Data
Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, Aleksander Madry

Are GANs Created Equal? A Large-Scale Study
Mario Lucic, Karol Kurach, Marcin Michalski, Olivier Bousquet, Sylvain Gelly

Collaborative Learning for Deep Neural Networks
Guocong Song, Wei Chai

Completing State Representations using Spectral Learning
Nan Jiang, Alex Kulesza, Santinder Singh

Content Preserving Text Generation with Attribute Controls
Lajanugen Logeswaran, Honglak Lee, Samy Bengio

Context-aware Synthesis and Placement of Object Instances
Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz

Co-regularized Alignment for Unsupervised Domain Adaptation
Abhishek Kumar, Prasanna Sattigeri, Kahini Wadhawan, Leonid Karlinsky, Rogerlo Feris, William T. Freeman, Gregory Wornell

cpSGD: Communication-efficient and differentially-private distributed SGD
Naman Agarwal, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar, H. Brendan Mcmahan

Data Center Cooling Using Model-Predictive Control
Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, Greg Imwalle

Data-Efficient Hierarchical Reinforcement Learning
Ofir Nachum, Shixiang Gu, Honglak Lee, Sergey Levine

Deep Attentive Tracking via Reciprocative Learning
Shi Pu, Yibing Song, Chao Ma, Honggang Zhang, Ming-Hsuan Yang

Generalizing Point Embeddings Using the Wasserstein Space of Elliptical Distributions
Boris Muzellec, Marco Cuturi

GLoMo: Unsupervised Learning of Transferable Relational Graphs
Zhilin Yang, Jake (Junbo) Zhao, Bhuwan Dhingra, Kaiming He, William W. Cohen, Ruslan Salakhutdinov, Yann LeCun

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking
Patrick Chen, Si Si, Yang Li, Ciprian Chelba, Cho-Jui Hsieh

Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections
Xin Zhang, Armando Solar-Lezama, Rishabh Singh

Learning Hierarchical Semantic Image Manipulation through Structured Representations
Seunghoon Hong, Xinchen Yan, Thomas Huang, Honglak Lee

Learning Temporal Point Processes via Reinforcement Learning
Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, Le Song

Learning Towards Minimum Hyperspherical Energy
Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, Le Song

Mesh-TensorFlow: Deep Learning for Supercomputers
Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman

MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare
Edward Choi, Cao Xiao, Walter F. Stewart, Jimeng Sun

Searching for Efficient Multi-Scale Architectures for Dense Image Prediction
Liang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, Jonathon Shlens

SplineNets: Continuous Neural Decision Graphs
Cem Keskin, Shahram Izadi

Task-Driven Convolutional Recurrent Models of the Visual System
Aran Nayebi, Daniel Bear, Jonas Kubilius, Kohitij Kar, Surya Ganguli, David Sussillo, James J. DiCarlo, Daniel L. K. Yamins

To Trust or Not to Trust a Classifier
Heinrich Jiang, Been Kim, Melody Guan, Maya Gupta

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

Algorithms and Theory for Multiple-Source Adaptation
Judy Hoffman, Mehryar Mohri, Ningshan Zhang

A Lyapunov-based Approach to Safe Reinforcement Learning
Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, Mohammad Ghavamzadeh

Adaptive Methods for Nonconvex Optimization
Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, Sanjiv Kumar

Assessing Generative Models via Precision and Recall
Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, Sylvain Gelly

A Loss Framework for Calibrated Anomaly Detection
Aditya Menon, Robert Williamson

Blockwise Parallel Decoding for Deep Autoregressive Models
Mitchell Stern, Noam Shazeer, Jakob Uszkoreit

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
Qiang Liu, Lihong Li, Ziyang Tang, Dengyong Zhou

Contextual Pricing for Lipschitz Buyers
Jieming Mao, Renato Leme, Jon Schneider

Coupled Variational Bayes via Optimization Embedding
Bo Dai, Hanjun Dai, Niao He, Weiyang Liu, Zhen Liu, Jianshu Chen, Lin Xiao, Le Song

Data Amplification: A Unified and Competitive Approach to Property Estimation
Yi HAO, Alon Orlitsky, Ananda Theertha Suresh, Yihong Wu

Deep Network for the Integrated 3D Sensing of Multiple People in Natural Images
Elisabeta Marinoiu, Mihai Zanfir, Alin-Ionut Popa, Cristian Sminchisescu

Deep Non-Blind Deconvolution via Generalized Low-Rank Approximation
Wenqi Ren, Jiawei Zhang, Lin Ma, Jinshan Pan, Xiaochun Cao, Wei Liu, Ming-Hsuan Yang

Diminishing Returns Shape Constraints for Interpretability and Regularization
Maya Gupta, Dara Bahri, Andrew Cotter, Kevin Canini

DropBlock: A Regularization Method for Convolutional Networks
Golnaz Ghiasi, Tsung-Yi Lin, Quoc V. Le

Generalization Bounds for Uniformly Stable Algorithms
Vitaly Feldman, Jan Vondrak

Geometrically Coupled Monte Carlo Sampling
Mark Rowland, Krzysztof Choromanski, Francois Chalus, Aldo Pacchiano, Tamas Sarlos, Richard E. Turner, Adrian Weller

GILBO: One Metric to Measure Them All
Alexander A. Alemi, Ian Fischer

Insights on Representational Similarity in Neural Networks with Canonical Correlation
Ari S. Morcos, Maithra Raghu, Samy Bengio

Improving Online Algorithms via ML Predictions
Manish Purohit, Zoya Svitkina, Ravi Kumar

Learning to Exploit Stability for 3D Scene Parsing
Yilun Du, Zhijan Liu, Hector Basevi, Ales Leonardis, William T. Freeman, Josh Tenembaum, Jiajun Wu

Maximizing Induced Cardinality Under a Determinantal Point Process
Jennifer Gillenwater, Alex Kulesza, Sergei Vassilvitskii, Zelda Mariet

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing
Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc V. Le, Ni Lao

PCA of High Dimensional Random Walks with Comparison to Neural Network Training
Joseph M. Antognini, Jascha Sohl-Dickstein

Predictive Approximate Bayesian Computation via Saddle Points
Yingxiang Yang, Bo Dai, Negar Kiyavash, Niao He

Recurrent World Models Facilitate Policy Evolution
David Ha, Jürgen Schmidhuber

Sanity Checks for Saliency Maps
Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, Been Kim

Simple, Distributed, and Accelerated Probabilistic Programming
Dustin Tran, Matthew Hoffman, Dave Moore, Christopher Suter, Srinivas Vasudevan, Alexey Radul, Matthew Johnson, Rif A. Saurous

Tangent: Automatic Differentiation Using Source-Code Transformation for Dynamically Typed Array Programming
Bart van Merriënboer, Dan Moldovan, Alex Wiltschko

The Emergence of Multiple Retinal Cell Types Through Efficient Coding of Natural Movies
Samuel A. Ocko, Jack Lindsey, Surya Ganguli, Stephane Deny

The Everlasting Database: Statistical Validity at a Fair Price
Blake Woodworth, Vitaly Feldman, Saharon Rosset, Nathan Srebro

The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network
Jeffrey Pennington, Pratik Worah

A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks
Kimin Lee, Kibok Lee, Honglak Lee, Jinwoo Shin

Autoconj: Recognizing and Exploiting Conjugacy Without a Domain-Specific Language
Matthew D. Hoffman, Matthew Johnson, Dustin Tran

A Bayesian Nonparametric View on Count-Min Sketch
Diana Cai, Michael Mitzenmacher, Ryan Adams (no longer at Google)

Automatic Differentiation in ML: Where We are and Where We Should be Going
Bart van Merriënboer, Olivier Breuleux, Arnaud Bergeron, Pascal Lamblin

Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures
Sergey Bartunov, Adam Santoro, Blake A. Richards, Geoffrey E. Hinton, Timothy P. Lillicrap

Deep Generative Models for Distribution-Preserving Lossy Compression
Michael Tschannen, Eirikur Agustsson, Mario Lucic

Deep Structured Prediction with Nonlinear Output Transformations
Colin Graber, Ofer Meshi, Alexander Schwing

Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning
Supasorn Suwajanakorn, Noah Snavely, Jonathan Tompson, Mohammad Norouzi

Transfer Learning with Neural AutoML
Catherine Wong, Neil Houlsby, Yifeng Lu, Andrea Gesmundo

Efficient Gradient Computation for Structured Output Learning with Rational and Tropical Losses
Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, Dmitry Storcheus, Scott Yang

Cooperative neural networks (CoNN): Exploiting prior independence structure for improved classification
Harsh Shrivastava, Eugene Bart, Bob Price, Hanjun Dai, Bo Dai, Srinivas Aluru

Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization
Blake Woodworth, Jialei Wang, Brendan McMahan, Nathan Srebro

Hierarchical Reinforcement Learning for Zero-shot Generalization with Subtask Dependencies
Sungryull Sohn, Junhyuk Oh, Honglak Lee

Human-in-the-Loop Interpretability Prior
Isaac Lage, Andrew Slavin Ross, Been Kim, Samuel J. Gershman, Finale Doshi-Velez

Joint Autoregressive and Hierarchical Priors for Learned Image Compression
David Minnen, Johannes Ballé, George D Toderici

Large-Scale Computation of Means and Clusters for Persistence Diagrams Using Optimal Transport
Théo Lacombe, Steve Oudot, Marco Cuturi

Learning to Reconstruct Shapes from Unseen Classes
Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Joshua B. Tenenbaum, William T. Freeman, Jiajun Wu

Large Margin Deep Networks for Classification
Gamaleldin Fathy Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, Samy Bengio

Mallows Models for Top-k Lists
Flavio Chierichetti, Anirban Dasgupta, Shahrzad Haddadan, Ravi Kumar, Silvio Lattanzi

Meta-Learning MCMC Proposals
Tongzhou Wang, YI WU, Dave Moore, Stuart Russell

Non-delusional Q-Learning and Value-Iteration
Tyler Lu, Dale Schuurmans, Craig Boutilier

Online Learning of Quantum States
Scott Aaronson, Xinyi Chen, Elad Hazan, Satyen Kale, Ashwin Nayak

Online Reciprocal Recommendation with Theoretical Performance Guarantees
Fabio Vitale, Nikos Parotsidis, Claudio Gentile

Optimal Algorithms for Continuous Non-monotone Submodular and DR-Submodular Maximization
Rad Niazadeh, Tim Roughgarden, Joshua R. Wang

Policy Regret in Repeated Games
Raman Arora, Michael Dinitz, Teodor Vanislavov Marinov, Mehryar Mohri

Provable Variational Inference for Constrained Log-Submodular Models
Josip Djolonga, Stefanie Jegelka, Andreas Krause

Realistic Evaluation of Deep Semi-Supervised Learning Algorithms
Avital Oliver, Augustus Odena, Colin Raffel, Ekin D. Cubuk, Ian J. Goodfellow

Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion
Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, Honglak Lee

Visual Object Networks: Image Generation with Disentangled 3D Representations
JunYan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Josh Tenenbaum, William T. Freeman

Watch Your Step: Learning Node Embeddings via Graph Attention
Sami Abu-El-Haija, Bryan Perozzi, Rami AlRfou, Alexander Alemi

Workshops
2nd Workshop on Machine Learning on the Phone and Other Consumer Devices
Co-Chairs include: Sujith Ravi, Wei Chai, Hrishikesh Aradhye

Bayesian Deep Learning
Workshop Organizers include: Kevin Murphy

Continual Learning
Workshop Organizers include: Marc Pickett

The Second Conversational AI Workshop – Today's Practice and Tomorrow's Potential
Workshop Organizers include: Dilek Hakkani-Tur

Visually Grounded Interaction and Language
Workshop Organizers include: Olivier Pietquin

Workshop on Ethical, Social and Governance Issues in AI
Workshop Organizers include: D. Sculley

AI for Social Good
Workshop Program Committee includes: Samuel Greydanus

Black in AI
Workshop Organizers: Mouhamadou Moustapha Cisse, Timnit Gebru
Program Committee: Irwan Bello, Samy Bengio, Ian Goodfellow, Hugo Larochelle, Margaret Mitchell

Interpretability and Robustness in Audio, Speech, and Language
Workshop Organizers include: Ehsan Variani, Bhuvana Ramabhadran

LatinX in AI
Workshop Organizers includes: Pablo Samuel Castro
Program Committee includes: Sergio Guadarrama

Machine Learning for Systems
Workshop Organizers include: Anna Goldie, Azalia Mirhoseini, Kevin Swersky, Milad Hashemi
Program Committee includes: Simon Kornblith, Nicholas Frosst, Amir Yazdanbakhsh, Azade Nazi, James Bradbury, Sharan Narang, Martin Maas, Carlos Villavieja

Queer in AI
Workshop Organizers include: Raphael Gontijo Lopes

Second Workshop on Machine Learning for Creativity and Design
Workshop Organizers include: Jesse Engel, Adam Roberts

Workshop on Security in Machine Learning
Workshop Organizers include: Nicolas Papernot

Tutorial
Visualization for Machine Learning
Fernanda Viégas, Martin Wattenberg

Source: Google AI Blog

Learning to Predict Depth on the Pixel 3 Phones

Posted by Rahul Garg, Research Scientist and Neal Wadhwa, Software Engineer

Portrait Mode on the Pixel smartphones lets you take professional-looking images that draw attention to a subject by blurring the background behind it. Last year, we described, among other things, how we compute depth with a single camera using its Phase-Detection Autofocus (PDAF) pixels (also known as dual-pixel autofocus) using a traditional non-learned stereo algorithm. This year, on the Pixel 3, we turn to machine learning to improve depth estimation to produce even better Portrait Mode results.

Left: The original HDR+ image. Right: A comparison of Portrait Mode results using depth from traditional stereo and depth from machine learning. The learned depth result has fewer errors. Notably, in the traditional stereo result, many of the horizontal lines behind the man are incorrectly estimated to be at the same depth as the man and are kept sharp.
(Mike Milne)

A Short Recap
As described in last year’s blog post, Portrait Mode uses a neural network to determine what pixels correspond to people versus the background, and augments this two layer person segmentation mask with depth information derived from the PDAF pixels. This is meant to enable a depth-dependent blur, which is closer to what a professional camera does.

PDAF pixels work by capturing two slightly different views of a scene, shown below. Flipping between the two views, we see that the person is stationary, while the background moves horizontally, an effect referred to as parallax. Because parallax is a function of the point’s distance from the camera and the distance between the two viewpoints, we can estimate depth by matching each point in one view with its corresponding point in the other view.

The two PDAF images on the left and center look very similar, but in the crop on the right you can see the parallax between them. It is most noticeable on the circular structure in the middle of the crop.

However, finding these correspondences in PDAF images (a method called depth from stereo) is extremely challenging because scene points barely move between the views. Furthermore, all stereo techniques suffer from the aperture problem. That is, if you look at the scene through a small aperture, it is impossible to find correspondence for lines parallel to the stereo baseline, i.e., the line connecting the two cameras. In other words, when looking at the horizontal lines in the figure above (or vertical lines in portrait orientation shots), any proposed shift of these lines in one view with respect to the other view looks about the same. In last year’s Portrait Mode, all these factors could result in errors in depth estimation and cause unpleasant artifacts.

Improving Depth Estimation
With Portrait Mode on the Pixel 3, we fix these errors by utilizing the fact that the parallax used by depth from stereo algorithms is only one of many depth cues present in images. For example, points that are far away from the in-focus plane appear less sharp than ones that are closer, giving us a defocus depth cue. In addition, even when viewing an image on a flat screen, we can accurately tell how far things are because we know the rough size of everyday objects (e.g. one can use the number of pixels in a photograph of a person’s face to estimate how far away it is). This is called a semantic cue.

Designing a hand-crafted algorithm to combine these different cues is extremely difficult, but by using machine learning, we can do so while also better exploiting the PDAF parallax cue. Specifically, we train a convolutional neural network, written in TensorFlow, that takes as input the PDAF pixels and learns to predict depth. This new and improved ML-based method of depth estimation is what powers Portrait Mode on the Pixel 3.

Our convolutional neural network takes as input the PDAF images and outputs a depth map. The network uses an encoder-decoder style architecture with skip connections and residual blocks.

Training the Neural Network
In order to train the network, we need lots of PDAF images and corresponding high-quality depth maps. And since we want our predicted depth to be useful for Portrait Mode, we also need the training data to be similar to pictures that users take with their smartphones.

To accomplish this, we built our own custom “Frankenphone” rig that contains five Pixel 3 phones, along with a Wi-Fi-based solution that allowed us to simultaneously capture pictures from all of the phones (within a tolerance of ~2 milliseconds). With this rig, we computed high-quality depth from photos by using structure from motion and multi-view stereo.

Left: Custom rig used to collect training data. Middle: An example capture flipping between the five images. Synchronization between the cameras ensures that we can calculate depth for dynamic scenes, such as this one. Right: Ground truth depth. Low confidence points, i.e., points where stereo matches are not reliable due to weak texture, are colored in black and are not used during training. (Sam Ansari and Mike Milne)

The data captured by this rig is ideal for training a network for the following main reasons:

Five viewpoints ensure that there is parallax in multiple directions and hence no aperture problem.
The arrangement of the cameras ensures that a point in an image is usually visible in at least one other image resulting in fewer points with no correspondences.
The baseline, i.e., the distance between the cameras is much larger than our PDAF baseline resulting in more accurate depth estimation.
Synchronization between the cameras ensure that we can calculate depth for dynamic scenes like the one above.
Portability of the rig ensures that we can capture photos in the wild simulating the photos users take with their smartphones.

However, even though the data captured from this rig is ideal, it is still extremely challenging to predict the absolute depth of objects in a scene — a given PDAF pair can correspond to a range of different depth maps (depending on lens characteristics, focus distance, etc). To account for this, we instead predict the relative depths of objects in the scene, which is sufficient for producing pleasing Portrait Mode results.

Putting it All Together
This ML-based depth estimation needs to run fast on the Pixel 3, so that users don’t have to wait too long for their Portrait Mode shots. However, to get good depth estimates that makes use of subtle defocus and parallax cues, we have to feed full resolution, multi-megapixel PDAF images into the network. To ensure fast results, we use TensorFlow Lite, a cross-platform solution for running machine learning models on mobile and embedded devices and the Pixel 3’s powerful GPU to compute depth quickly despite our abnormally large inputs. We then combine the resulting depth estimates with masks from our person segmentation neural network to produce beautiful Portrait Mode results.

Try it Yourself
In Google Camera App version 6.1 and later, our depth maps are embedded in Portrait Mode images. This means you can use the Google Photos depth editor to change the amount of blur and the focus point after capture. You can also use third-party depth extractors to extract the depth map from a jpeg and take a look at it yourself. Also, here is an album showing the relative depth maps and the corresponding Portrait Mode images for traditional stereo and the learning-based approaches.

Acknowledgments
This work wouldn’t have been possible without Sam Ansari, Yael Pritch Knaan, David Jacobs, Jiawen Chen, Juhyun Lee and Andrei Kulik. Special thanks to Mike Milne and Andy Radin who captured data with the five-camera rig.

googblogs.com

All Google blogs and Press in one site

Tag Archives: machine learning

Introducing Coral: Our platform for development with local AI

Source: Google Developers Blog

Long-Range Robotic Navigation via Automated Reinforcement Learning

Source: Google AI Blog

Long-Range Robotic Navigation via Automated Reinforcement Learning

Source: Google AI Blog

Dopamine 2.0: providing more flexibility in reinforcement learning research

Source: Google Open Source Blog

Expanding the Application of Deep Learning to Electronic Health Records

Source: Google AI Blog

Soft Actor-Critic: Deep Reinforcement Learning for Robotics

Source: Google AI Blog

Exploring Quantum Neural Networks

Source: Google AI Blog

Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

Source: Google AI Blog

Google at NeurIPS 2018

Source: Google AI Blog

Learning to Predict Depth on the Pixel 3 Phones

Source: Google AI Blog