Reconstructing 3D objects and buildings from a series of images is a well-known problem in computer vision, known as Structure-from-Motion (SfM). It has diverse applications in photography and cultural heritage preservation (e.g., allowing people to explore the sculptures of Rapa Nui in a browser) and powers many services across Google Maps, such as the 3D models created from StreetView and aerial imagery. In these examples, images are usually captured by operators under controlled conditions. While this ensures homogeneous data with a uniform, high-quality appearance in the images and the final reconstruction, it also limits the diversity of sites captured and the viewpoints from which they are seen. What if, instead of using images from tightly controlled conditions, one could apply SfM techniques to better capture the richness of the world using the vast amounts of unstructured image collections freely available on the internet?
In order to accelerate research into this topic, and how to better leverage the volume of data already publicly available, we present, “Image Matching across Wide Baselines: From Paper to Practice”, a collaboration with UVIC, CTU and EPFL, that presents a new public benchmark to evaluate methods for 3D reconstruction. Following on the results of the first Image Matching: Local Features and Beyond workshop held at CVPR 2019, this project now includes more than 25k images, each of which includes accurate pose information (location and orientation). This data is publicly available, along with the open-sourced benchmark, and is the foundation of the 2020 Image Matching Challenge to be held at CVPR 20201.
Recovering 3D Structure In the Wild
Google Maps already uses images donated by users to inform visitors about popular locations or to update business hours. However, using this type of data to build 3D models is much more difficult, since donated photos have a wide variety of viewpoints, lighting and weather conditions, occlusions from people and vehicles, and the occasional user-applied filters. The examples below highlight the diversity of images for the Trevi Fountain in Rome.
|Some example images sampled from the Image Matching Challenge dataset, showing different perspectives of the Trevi Fountain.|
|A 3D reconstruction generated from over 3000 images, including those from the previous figure.|
A Benchmark for Evaluating Local Features for 3D Reconstruction
Local features power many Google services, such as Image Search and product recognition in Google Lens, and are also used in mixed reality applications, like Google Maps' Live View, which relies on traditional, handcrafted local features. Designing better algorithms to identify and describe local features will lead to better performance overall.
Comparing the performance of local feature algorithms, however, has been difficult, because it is not obvious how to collect "ground-truth" data for this purpose. Some computer vision tasks rely on crowdsourcing: Google's OpenImages dataset labels "objects" with bounding boxes or pixel masks, by combining machine learning techniques with human annotators. This is not possible in this case, as it is not known what constitutes a "good" local feature a priori, making labelling infeasible. Additionally, existing benchmarks such as HPatches, are often small or limited to a narrow range of transformations, which can bias the evaluation.
What matters is the quality of the reconstruction, and that benchmarks reflect real-world scale and challenges in order to highlight opportunities for developing new approaches. To this end, we have created the Image Matching Benchmark, the first benchmark to include a large dataset of images for training and evaluation. The dataset includes more than 25k images (sourced from the public YFCC100m dataset), each of which has been augmented with accurate pose information (location and orientation). We obtain this "pseudo" ground-truth from large-scale SfM (100s-1000s of images, for each scene), which provides accurate and stable poses, and then run our evaluation on smaller subsets (10s of images), a much more difficult problem. This approach does not require expensive sensors or human labelling, and it provides better proxy metrics than previous benchmarks, which were restricted to small and homogenous datasets.
|Visualizations from our benchmark. We show point-to-point matches generated by different local feature algorithms. Left to right: SIFT, HardNet, LogPolarDesc, R2D2. For details, please refer to our website.|
The benchmark is joint work by Yuhe Jin and Kwang Moo Yi (University of Victoria), Anastasiia Mishchuk and Pascal Fua (EPFL), Dmytro Mishkin and Jiří Matas (Czech Technical University), and Eduard Trulls (Google). The CVPR workshop is co-organized by Vassileios Balntas (Scape Technologies/Facebook), Vincent Lepetit (Ecole des Ponts ParisTech), Dmytro Mishkin and Jiří Matas (Czech Technical University), Johannes Schönberger (Microsoft), Eduard Trulls (Google), and Kwang Moo Yi (University of Victoria).
1 Please note that as of April 2, 2020, CVPR is currently on track, despite the COVID-19 pandemic. Challenge information will be updated as the situation develops. Please see the 2020 Image Matching Challenge website for details.↩