Reinforcement Learning (RL) is a paradigm for using trial-and-error to train agents to sequentially make decisions in complex environments, which has had great success in a number of domains, including games, robotics manipulation and chip design. Agents typically aim at maximizing the sum of the reward they collect in an environment, which can be based on a variety of parameters, including speed, curiosity, aesthetics and more. However, designing a specific RL reward function is a challenge since it can be hard to specify or too sparse. In such cases, imitation learning (IL) methods offer an alternative as they learn how to solve a task from expert demonstrations, rather than a carefully designed reward function. However, state-of-the-art IL methods rely on adversarial training, which uses min/max optimization procedures, making them algorithmically unstable and difficult to deploy.
In “Primal Wasserstein Imitation Learning” (PWIL), we introduce a new IL method, based on the primal form of the Wasserstein distance, also known as the earth mover’s distance, which does not rely on adversarial training. Using the MuJoCo suite of tasks, we demonstrate the efficacy of the PWIL method by imitating a simulated expert with a limited number of demonstrations (even a single example) and limited interactions with the environment.
|Left: Demonstration of the algorithmic Humanoid “expert”, trained on the true reward of the task (which relates to speed). Right: Agent trained using PWIL on the expert demonstration.|
Adversarial Imitation Learning
State-of-the-art adversarial IL methods operate similarly to generative adversarial networks (GANs) in which a generator (the policy) is trained to maximize the confusion of a discriminator (the reward) that itself is trained to differentiate between the agent’s state-action pairs and the expert’s. Adversarial IL methods boil down to a distribution matching problem, i.e., the problem of minimizing a distance between probability distributions in a metric space. However, just as GANs, adversarial IL methods rely on a min/max optimization problem and hence come with a number of training stability challenges.
Imitation Learning as Distribution Matching
The PWIL method is based on the formulation of IL as a distribution matching problem, in this case, the Wasserstein distance. The first step consists of inferring from the demonstrations a state-action distribution of the expert, the collection of relationships between the actions taken by the expert and the corresponding state of the environment. The goal is then to minimize the distance between the agent’s and the expert’s state-action distributions, through interactions with the environment. In contrast, PWIL is a non-adversarial method, enabling it to bypass the min/max optimization problem and directly minimize the Wasserstein distance between the agent’s and the expert’s state-action pair distributions.
Primal Wasserstein Imitation Learning
Computing the exact Wasserstein distance can be restrictive since one must wait until the end of a trajectory of the agent to calculate it, meaning that the rewards can be computed only when the agent is done interacting with the environment. To avoid this restriction, we use an upper bound on the distance instead, from which we can define a reward that we optimize using RL. We show that by doing so, we indeed recover expert behaviour and minimize the Wasserstein distance between the agent and the expert on a number of locomotion tasks of the MuJoCo simulator. While adversarial IL methods use a reward function from a neural network that must be optimized and re-estimated continuously as the agent interacts with the environment, PWIL defines a reward function offline from demonstrations, which does not change and is based on substantially fewer hyperparameters than adversarial IL approaches.
|Training curves for PWIL on Humanoid. In green, the Wasserstein distance to the state-action distribution of the expert. In blue, the return (the sum of rewards collected) by the agent.|
A Measure of Similarity for the True Imitation Learning Setting
As in numerous challenges in ML, a number of IL methods are evaluated on synthetic tasks, where one usually has access to the underlying reward function of the task and can measure similarity between the expert’s and the agent’s behaviour in terms of performance, which is the expected sum of rewards. A byproduct of PWIL is the creation of a metric that can compare expert behavior to an agent’s behavior for any IL method, without access to the true reward of the task. In this sense, we can use the Wasserstein distance in the true IL setting, not only on synthetic tasks.
In environments where interacting is costly (e.g., a real robot or a complex simulator), PWIL is a prime candidate not only because it can recover expert behaviour, but also because the reward function it defines is easy to tune and is defined without interactions with the environment. This opens multiple opportunities for future exploration, including deployment to real systems, extending PWIL to the setup where we have only access to demonstration states (rather than states and actions), and finally applying PWIL to visual based observations.
We thank our co-authors, Matthieu Geist and Olivier Pietquin; as well as Zafarali Ahmed, Adrien Ali Taïga, Gabriel Dulac-Arnold, Johan Ferret, Alexis Jacq and Saurabh Kumar for their feedback on the manuscript.