People are remarkably proficient at manipulating objects without needing to adjust their viewpoint to a fixed or specific pose. This capability (referred to as visual motor integration) is learned during childhood from manipulating objects in various situations, and governed by a self-adaptation and mistake correction mechanism that uses rich sensory cues and vision as feedback. However, this capability is quite difficult for vision-based controllers in robotics, which until now have been built on a rigid setup for reading visual input data from a fixed mounted camera which should not be moved or repositioned at train and test time. The ability to quickly acquire visual motor control skills under large viewpoint variation would have substantial implications for autonomous robotic systems — for example, this capability would be particularly desirable for robots that can help rescue efforts in emergency or disaster zones.
In “Sim2Real Viewpoint Invariant Visual Servoing by Recurrent Control” presented at CVPR 2018 this week, we study a novel deep network architecture (consisting of two fully convolutional networks and a long short-term memory unit) that learns from a past history of actions and observations to self-calibrate. Using diverse simulated data consisting of demonstrated trajectories and reinforcement learning objectives, our visually-adaptive network is able to control a robotic arm to reach a diverse set of visually-indicated goals, from various viewpoints and independent of camera calibration.
Viewpoint invariant manipulation for visually indicated goal reaching with a physical robotic arm. We learn a single policy that can reach diverse goals from sensory input captured from drastically different camera viewpoints. First row shows the visually indicated goals.
Discovering how the controllable degrees of freedom (DoF) affect visual motion can be ambiguous and underspecified from a single image captured from an unknown viewpoint. Identifying the effect of actions on image-space motion and successfully performing the desired task requires a robust perception system augmented with the ability to maintain a memory of past actions. To be able to tackle this challenging problem, we had to address the following essential questions:
- How can we make it feasible to provide the right amount of experience for the robot to learn the self-adaptation behavior based on pure visual observations that simulate a lifelong learning paradigm?
- How can we design a model that integrates robust perception and self-adaptive control such that it can quickly transfer to unseen environments?
|Visually indicated goal reaching task with a physical robotic arm and diverse camera viewpoints.|
Collecting robot experience data is difficult and time-consuming. In a previous post, we showed how to scale up learning skills by distributing the data collection and trials to multiple robots. Although this approach expedited learning, it is still not feasibly extendable to learning complex behaviors such as visual self-calibration, where we need to expose robots to a huge space of various viewpoints. Instead, we opt to learn such complex behavior in simulation where we can collect unlimited robot trials and easily move the camera to various random viewpoints. In addition to fast data collection in simulation, we can also surpass hardware limitations requiring the installation of multiple cameras around a robot.
|We use domain randomization technique to learn generalizable policies in simulation.|
Viewpoint invariant manipulation for visually indicated goal reaching with a simulated seven-DoF robotic arm. We learn a single policy that can reach diverse goals from sensory input captured from dramatically different camera viewpoints.
Disentangling Perception from Control
To enable fast transfer to unseen environments, we devised a deep neural network that combines perception and control trained end-to-end simultaneously, while also allowing each to be learned independently if needed. This disentanglement between perception and control eases transfer to unseen environments, and makes the model both flexible and efficient in that each of its parts (i.e. 'perception' or 'control') can be independently adapted to new environments with small amounts of data. Additionally, while the control portion of the network was entirely trained by the simulated data, the perception part of our network was complemented by collecting a small amount of static images with object bounding boxes without needing to collect the whole action sequence trajectory with a physical robot. In practice, we fine-tuned the perception part of our network with only 76 object bounding boxes coming from 22 images.
|Real-world robot and moving camera setup. First row shows the scene arrangements and the second row shows the visual sensory input to the robot.|
We tested the visually-adapted version of our network on a physical robot and on real objects with drastically different appearances than the ones used in simulation. Experiments were performed with both one or two objects on a table — “seen objects” (as labeled in the figure below) were used for visual adaptation using small collection of real static images, while “unseen objects” had not been seen during visual adaptation. During the test, the robot arm was directed to reach a visually indicated object from various viewpoints. For the two object experiments the second object was to "fool" the robotic arm. While the simulation-only network has good generalization capability (due to being trained with domain randomization technique), the very small amount of static visual data to visually adapt the controller boosted the performance, due to the flexible architecture of our network.
|After adapting the visual features with the small amount of real images, performance was boosted by more than 10%. All used real objects are drastically different from the objects seen in simulation.|
This research was conducted by Fereshteh Sadeghi, Alexander Toshev, Eric Jang and Sergey Levine. We would also like to thank Erwin Coumans and Yunfei Bai for providing pybullet, and Vincent Vanhoucke for insightful discussions.