The performance of machine learning (ML) models depends both on the learning algorithms, as well as the data used for training and evaluation. The role of the algorithms is well studied and the focus of a multitude of challenges, such as SQuAD, GLUE, ImageNet, and many others. In addition, there have been efforts to also improve the data, including a series of workshops addressing issues for ML evaluation. In contrast, research and challenges that focus on the data used for evaluation of ML models are not commonplace. Furthermore, many evaluation datasets contain items that are easy to evaluate, e.g., photos with a subject that is easy to identify, and thus they miss the natural ambiguity of real world context. The absence of ambiguous real-world examples in evaluation undermines the ability to reliably test machine learning performance, which makes ML models prone to develop “weak spots”, i.e., classes of examples that are difficult or impossible for a model to accurately evaluate, because that class of examples is missing from the evaluation set.
To address the problem of identifying these weaknesses in ML models, we recently launched the Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML) Data Challenge at HCOMP 2020 (open until 30 April, 2021 to researchers and developers worldwide). The goal of the challenge is to raise the bar in ML evaluation sets and to find as many examples as possible that are confusing or otherwise problematic for algorithms to process. CATS4ML relies on people’s abilities and intuition to spot new data examples about which machine learning is confident, but actually misclassifies.
What are ML “Weak Spots”?
There are two categories of weak spots: known unknowns and unknown unknowns. Known unknowns are examples for which a model is unsure about the correct classification. The research community continues to study this in a field known as active learning, and has found the solution to be, in very general terms, to interactively solicit new labels from people on uncertain examples. For example, if a model is not certain whether or not the subject of a photo is a cat, a person is asked to verify; but if the system is certain, a person is not asked. While there is room for improvement in this area, what is comforting is that the confidence of the model is correlated with its performance, i.e., one can see what the model doesn’t know.
Unknown unknowns, on the other hand, are examples where a model is confident about its answer, but is actually wrong. Efforts to proactively discover unknown unknowns (e.g., Attenberg 2015 and Crawford 2019) have helped uncover a multitude of unintended machine behaviours. In contrast to such approaches for the discovery of unknown unknowns, generative adversarial networks (GANs) generate unknown unknowns for image recognition models in the form of optical illusions for computers that cause deep learning models to make mistakes beyond human perception. While GANs uncover model exploits in the event of an intentional manipulation, real-world examples can better highlight a model’s failures in its day-to-day performance. These real-world examples are the unknown unknowns of interest to CATS4ML — the challenge aims to gather unmanipulated examples that humans can reliably interpret but on which many ML models would confidently disagree.
|Example illustrating how optical illusions for computers caused by adversarial noise help discover machine manipulated unknown unknowns for ML models (based on Brown 2018).|
First Edition of CATS4ML Data Challenge: Open Images Dataset
The CATS4ML Data Challenge focuses on visual recognition, using images and labels from the Open Images Dataset. The target images for the challenge are selected from the Open Images Dataset along with a set of 24 target labels from the same dataset. The challenge participants are invited to invent new and creative ways to explore this existing publicly available dataset and, focussed on a list of pre-selected target labels, discover examples of unknown unknowns for ML models.
|Examples from the Open Images Dataset as possible unknown unknowns for ML models.|
CATS4ML is a complementary effort to FAIR’s recently introduced DynaBench research platform for dynamic data collection. Where DynaBench tackles issues with static benchmarks using ML models with humans in the loop, CATS4ML focuses on improving evaluation datasets for ML by encouraging the exploration of existing ML benchmarks for adverse examples that can be unknown unknowns. The results will help detect and avoid future errors, and also will give insights to model explainability.
In this way, CATS4ML aims to raise greater awareness of the problem by providing dataset resources that developers can use to uncover the weak spots of their algorithms. This will also inform researchers on how to create benchmark datasets for machine learning that are more balanced, diverse and socially aware.
We invite the global community of ML researchers and practitioners to join us in the effort of discovering interesting, difficult examples from the Open Images Dataset. Register on the challenge website, download the target images and labeled data, contribute the images you discover and join the competition for the winning participant!
To score points in this competition, participants should submit a set of image-label pairs that will be confirmed by human-in-the-loop raters, whose votes should be in disagreement with the average machine score for the label over a number of machine learning models.
The release of the CATS4ML Data Challenge has been possible thanks to the hard work of a lot of people including, but not limited to, the following (in alphabetical order of last name): Osman Aka, Ken Burke, Tulsee Doshi, Mig Gerard, Victor Gomes, Shahab Kamali, Igor Karpov, Devi Krishna, Daphne Luong, Carey Radebaugh, Jamie Taylor, Nithum Thain, Kenny Wibowo, Ka Wong, and Tong Zhou.