The success of deep learning in computer vision can be partially attributed to the availability of large amounts of labeled training data — a model’s performance typically improves as you increase the quality, diversity and the amount of training data. However, collecting enough quality data to train a model to perform well is often prohibitively difficult. One way around this is to hardcode image symmetries into neural network architectures so they perform better or have experts manually design data augmentation methods, like rotation and flipping, that are commonly used to train well-performing vision models. However, until recently, less attention has been paid to finding ways to automatically augment existing data using machine learning. Inspired by the results of our AutoML efforts to design neural network architectures and optimizers to replace components of systems that were previously human designed, we asked ourselves: can we also automate the procedure of data augmentation?
In “AutoAugment: Learning Augmentation Policies from Data”, we explore a reinforcement learning algorithm which increases both the amount and diversity of data in an existing training dataset. Intuitively, data augmentation is used to teach a model about image invariances in the data domain in a way that makes a neural network invariant to these important symmetries, thus improving its performance. Unlike previous state-of-the-art deep learning models that used hand-designed data augmentation policies, we used reinforcement learning to find the optimal image transformation policies from the data itself. The result improved performance of computer vision models without relying on the production of new and ever expanding datasets.
Augmenting Training Data
The idea behind data augmentation is simple: images have many symmetries that don’t change the information present in the image. For example, the mirror reflection of a dog is still a dog. While some of these “invariances” are obvious to humans, many are not. For example, the mixup method augments data by placing images on top of each other during training, resulting in data which improves neural network performance.
|Left: An original image from the ImageNet dataset. Right: The same image transformed by a commonly used data augmentation transformation, a horizontal flip about the center.|
AutoAugment learns different transformations depending on what dataset it is run on. For example, for images involving street view of house numbers (SVHN) which include natural scene images of digits, AutoAugment focuses on geometric transforms like shearing and translation, which represent distortions commonly observed in this dataset. In addition, AutoAugment has learned to completely invert colors which naturally occur in the original SVHN dataset, given the diversity of different building and house numbers materials in the world.
|Left: An original image from the SVHN dataset. Right: The same image transformed by AutoAugment. In this case, the optimal transformation was a result of shearing the image and inverting the colors of the pixels.|
|Left: An original image from the ImageNet dataset. Right: The same image transformed by the AutoAugment policy. First, the image contrast is maximized, after which the image is rotated.|
Our AutoAugment algorithm found augmentation policies for some of the most well-known computer vision datasets that, when incorporated into the training of the neural network, led to state-of-the-art accuracies. By augmenting ImageNet data we obtain a new state-of-the-art accuracy of 83.54% top1 accuracy and on CIFAR10 we achieve an error rate of 1.48%, which is a 0.83% improvement over the default data augmentation designed by scientists. On SVHN, we improved the state-of-the-art error from 1.30% to 1.02%. Importantly, AutoAugment policies are found to be transferable — the policy found for the ImageNet dataset could also be applied to other vision datasets (Stanford Cars, FGVC-Aircraft, etc.), which in turn improves neural network performance.
We are pleased to see that our AutoAugment algorithm achieved this level of performance on many different competitive computer vision datasets and look forward to seeing future applications of this technology across more computer vision tasks and even in other domains such as audio processing or language models. The policies with the best performance are included in the appendix of the paper, so that researchers can use them to improve their models on relevant vision tasks.
Special thanks to the co-authors of the paper Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. We’d also like to thank Alok Aggarwal, Gabriel Bender, Yanping Huang, Pieter-Jan Kindermans, Simon Kornblith, Augustus Odena, Avital Oliver, and Colin Raffel for their help with this project.