Tag Archives: Computer Vision

Scaling vision transformers to 22 billion parameters

Large Language Models (LLMs) like PaLM or GPT-3 showed that scaling transformers to hundreds of billions of parameters improves performance and unlocks emergent abilities. The biggest dense models for image understanding, however, have reached only 4 billion parameters, despite research indicating that promising multimodal models like PaLI continue to benefit from scaling vision models alongside their language counterparts. Motivated by this, and the results from scaling LLMs, we decided to undertake the next step in the journey of scaling the Vision Transformer.

In “Scaling Vision Transformers to 22 Billion Parameters”, we introduce the biggest dense vision model, ViT-22B. It is 5.5x larger than the previous largest vision backbone, ViT-e, which has 4 billion parameters. To enable this scaling, ViT-22B incorporates ideas from scaling text models like PaLM, with improvements to both training stability (using QK normalization) and training efficiency (with a novel approach called asynchronous parallel linear operations). As a result of its modified architecture, efficient sharding recipe, and bespoke implementation, it was able to be trained on Cloud TPUs with a high hardware utilization1. ViT-22B advances the state of the art on many vision tasks using frozen representations, or with full fine-tuning. Further, the model has also been successfully used in PaLM-e, which showed that a large model combining ViT-22B with a language model can significantly advance the state of the art in robotics tasks.


Our work builds on many advances from LLMs, such as PaLM and GPT-3. Compared to the standard Vision Transformer architecture, we use parallel layers, an approach in which attention and MLP blocks are executed in parallel, instead of sequentially as in the standard Transformer. This approach was used in PaLM and reduced training time by 15%.

Secondly, ViT-22B omits biases in the QKV projections, part of the self-attention mechanism, and in the LayerNorms, which increases utilization by 3%. The diagram below shows the modified transformer architecture used in ViT-22B:

ViT-22B transformer encoder architecture uses parallel feed-forward layers, omits biases in QKV and LayerNorm layers and normalizes Query and Key projections.

Models at this scale necessitate “sharding” — distributing the model parameters in different compute devices. Alongside this, we also shard the activations (the intermediate representations of an input). Even something as simple as a matrix multiplication necessitates extra care, as both the input and the matrix itself are distributed across devices. We develop an approach called asynchronous parallel linear operations, whereby communications of activations and weights between devices occur at the same time as computations in the matrix multiply unit (the part of the TPU holding the vast majority of the computational capacity). This asynchronous approach minimizes the time waiting on incoming communication, thus increasing device efficiency. The animation below shows an example computation and communication pattern for a matrix multiplication.

Asynchronized parallel linear operation. The goal is to compute the matrix multiplication y = Ax, but both the matrix A and activation x are distributed across different devices. Here we illustrate how it can be done with overlapping communication and computation across devices. The matrix A is column-sharded across the devices, each holding a contiguous slice, each block represented as Aij. More details are in the paper.

At first, the new model scale resulted in severe training instabilities. The normalization approach of Gilmer et al. (2023, upcoming) resolved these issues, enabling smooth and stable model training; this is illustrated below with example training progressions.

The effect of normalizing the queries and keys (QK normalization) in the self-attention layer on the training dynamics. Without QK normalization (red) gradients become unstable and the training loss diverges.


Here we highlight some results of ViT-22B. Note that in the paper we also explore several other problem domains, like video classification, depth estimation, and semantic segmentation.

To illustrate the richness of the learned representation, we train a text model to produce representations that align text and image representations (using LiT-tuning). Below we show several results for out-of-distribution images generated by Parti and Imagen:

Examples of image+text understanding for ViT-22B paired with a text model. The graph shows normalized probability distribution for each description of an image.

Human object recognition alignment

To find out how aligned ViT-22B classification decisions are with human classification decisions, we evaluated ViT-22B fine-tuned with different resolutions on out-of-distribution (OOD) datasets for which human comparison data is available via the model-vs-human toolbox. This toolbox measures three key metrics: How well do models cope with distortions (accuracy)? How different are human and model accuracies (accuracy difference)? Finally, how similar are human and model error patterns (error consistency)? While not all fine-tuning resolutions perform equally well, ViT-22B variants are state of the art for all three metrics. Furthermore, the ViT-22B models also have the highest ever recorded shape bias in vision models. This means that they mostly use object shape, rather than object texture, to inform classification decisions — a strategy known from human perception (which has a shape bias of 96%). Standard models (e.g., ResNet-50, which has aa ~20–30% shape bias) often classify images like the cat with elephant texture below according to the texture (elephant); models with a high shape bias tend to focus on the shape instead (cat). While there are still many important differences between human and model perception, ViT-22B shows increased similarities to human visual object recognition.

Cat or elephant? Car or clock? Bird or bicycle? Example images with the shape of one object and the texture of a different object, used to measure shape/texture bias.
Shape bias evaluation (higher = more shape-biased). Many vision models have a low shape / high texture bias, whereas ViT-22B fine-tuned on ImageNet (red, green, blue trained on 4B images as indicated by brackets after model names, unless trained on ImageNet only) have the highest shape bias recorded in a ML model to date, bringing them closer to a human-like shape bias.

Out-of-distribution performance

Measuring performance on OOD datasets helps assess generalization. In this experiment we construct label-maps (mappings of labels between datasets) from JFT to ImageNet and also from ImageNet to different out-of-distribution datasets like ObjectNet (results after pre-training on this data shown in the left curve below). Then the models are fully fine-tuned on ImageNet.

We observe that scaling Vision Transformers increases OOD performance: even though ImageNet accuracy saturates, we see a significant increase on ObjectNet from ViT-e to ViT-22B (shown by the three orange dots in the upper right below).

Even though ImageNet accuracy saturates, we see a significant increase in performance on ObjectNet from ViT-e/14 to ViT-22B.

Linear probe

Linear probe is a technique where a single linear layer is trained on top of a frozen model. Compared to full fine-tuning, this is much cheaper to train and easier to set up. We observed that the linear probe of ViT-22B performance approaches that of state-of-the-art full fine-tuning of smaller models using high-resolution images (training with higher resolution is generally much more expensive, but for many tasks it yields better results). Here are results of a linear probe trained on the ImageNet dataset and evaluated on the ImageNet validation dataset and other OOD ImageNet datasets.

Linear probe results trained on ImageNet, evaluated on Imagenet-ReaL, ImageNet-v2, ObjectNet, ImageNet-R and ImageNet-A datasets. High-resolution fine-tuned ViT-e/14 provided as a reference.


The knowledge of the bigger model can be transferred to a smaller model using the distillation method. This is helpful as big models are slower and more expensive to use. We found that ViT-22B knowledge can be transferred to smaller models like ViT-B/16 and ViT-L/16, achieving a new state of the art on ImageNet for those model sizes.

Model Approach (dataset) ImageNet1k Accuracy
ViT-B/16       Transformers for Image Recognition at Scale (JFT)       84.2
Scaling Vision Transformers (JFT) 86.6
DeiT III: Revenge of the ViT (INet21k) 86.7
Distilled from ViT-22B (JFT) 88.6
ViT-L/16 Transformers for Image Recognition at Scale (JFT) 87.1
Scaling Vision Transformers (JFT) 88.5
DeiT III: Revenge of the ViT (INet21k) 87.7
Distilled from ViT-22B (JFT) 89.6

Fairness and bias

ML models can be susceptible to unintended unfair biases, such as picking up spurious correlations (measured using demographic parity) or having performance gaps across subgroups. We show that scaling up the size helps in mitigating such issues.

First, scale offers a more favorable tradeoff frontier — performance improves with scale even when the model is post-processed after training to control its level of demographic parity below a prescribed, tolerable level. Importantly, this holds not only when performance is measured in terms of accuracy, but also other metrics, such as calibration, which is a statistical measure of the truthfulness of the model's estimated probabilities. Second, classification of all subgroups tends to improve with scale as demonstrated below. Third, ViT-22B reduces the performance gap across subgroups.

Top: Accuracy for each subgroup in CelebA before debiasing. Bottom: The y-axis shows the absolute difference in performance across the two specific subgroups highlighted in this example: females and males. ViT-22B has a small gap in performance compared to smaller ViT architectures.


We have presented ViT-22B, currently the largest vision transformer model at 22 billion parameters. With small but critical changes to the original architecture, we achieved excellent hardware utilization and training stability, yielding a model that advances the state of the art on several benchmarks. Great performance can be achieved using the frozen model to produce embeddings and then training thin layers on top. Our evaluations further show that ViT-22B shows increased similarities to human visual perception when it comes to shape and texture bias, and offers benefits in fairness and robustness, when compared to existing models.


This is a joint work of Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin Fathy, Elsayed Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers Jeremiah Harmsen, and Neil Houlsby

We would like to thank Jasper Uijlings, Jeremy Cohen, Arushi Goel, Radu Soricut, Xingyi Zhou, Lluis Castrejon, Adam Paszke, Joelle Barral, Federico Lebron, Blake Hechtman, and Peter Hawkins. Their expertise and unwavering support played a crucial role in the completion of this paper. We also acknowledge the collaboration and dedication of the talented researchers and engineers at Google Research.

1Note: ViT-22B has 54.9% model FLOPs utilization (MFU) while PaLM reported 46.2% MFU and we measured 44.0% MFU for ViT-e on the same hardware. 

Source: Google AI Blog

Vid2Seq: a pretrained visual language model for describing multi-event videos

Videos have become an increasingly important part of our daily lives, spanning fields such as entertainment, education, and communication. Understanding the content of videos, however, is a challenging task as videos often contain multiple events occurring at different time scales. For example, a video of a musher hitching up dogs to a dog sled before they all race away involves a long event (the dogs pulling the sled) and a short event (the dogs being hitched to the sled). One way to spur research in video understanding is via the task of dense video captioning, which consists of temporally localizing and describing all events in a minutes-long video. This differs from single image captioning and standard video captioning, which consists of describing short videos with a single sentence.

Dense video captioning systems have wide applications, such as making videos accessible to people with visual or auditory impairments, automatically generating chapters for videos, or improving the search of video moments in large databases. Current dense video captioning approaches, however, have several limitations — for example, they often contain highly specialized task-specific components, which make it challenging to integrate them into powerful foundation models. Furthermore, they are often trained exclusively on manually annotated datasets, which are very difficult to obtain and hence are not a scalable solution.

In this post, we introduce “Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning”, to appear at CVPR 2023. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. In order to pre-train this unified model, we leverage unlabeled narrated videos by reformulating sentence boundaries of transcribed speech as pseudo-event boundaries, and using the transcribed speech sentences as pseudo-event captions. The resulting Vid2Seq model pre-trained on millions of narrated videos improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the few-shot dense video captioning setting, the video paragraph captioning task, and the standard video captioning task. Finally, we have also released the code for Vid2Seq here.

Vid2Seq is a visual language model that predicts dense event captions together with their temporal grounding in a video by generating a single sequence of tokens.

A visual language model for dense video captioning

Multimodal transformer architectures have improved the state of the art on a wide range of video tasks, such as action recognition. However it is not straightforward to adapt such an architecture to the complex task of jointly localizing and captioning events in minutes-long videos.

For a general overview of how we achieve this, we augment a visual language model with special time tokens (like text tokens) that represent discretized timestamps in the video, similar to Pix2Seq in the spatial domain. Given visual inputs, the resulting Vid2Seq model can both take as input and generate sequences of text and time tokens. First, this enables the Vid2Seq model to understand the temporal information of the transcribed speech input, which is cast as a single sequence of tokens. Second, this allows Vid2Seq to jointly predict dense event captions and temporally ground them in the video while generating a single sequence of tokens.

The Vid2Seq architecture includes a visual encoder and a text encoder, which encode the video frames and the transcribed speech input, respectively. The resulting encodings are then forwarded to a text decoder, which autoregressively predicts the output sequence of dense event captions together with their temporal localization in the video. The architecture is initialized with a powerful visual backbone and a strong language model.

Vid2Seq model overview: We formulate dense event captioning as a sequence-to-sequence problem, using special time tokens to allow the model to seamlessly understand and generate sequences of tokens containing both textual semantic information and temporal localization information grounding each text sentence in the video.

Large-scale pre-training on untrimmed narrated videos

Due to the dense nature of the task, the manual collection of annotations for dense video captioning is particularly expensive. Hence we pre-train the Vid2Seq model using unlabeled narrated videos, which are easily available at scale. In particular, we use the YT-Temporal-1B dataset, which includes 18 million narrated videos covering a wide range of domains.

We use transcribed speech sentences and their corresponding timestamps as supervision, which are cast as a single sequence of tokens. We pre-train Vid2Seq with a generative objective that teaches the decoder to predict the transcribed speech sequence given visual inputs only, and a denoising objective that encourages multimodal learning by requiring the model to predict masked tokens given a noisy transcribed speech sequence and visual inputs. In particular, noise is added to the speech sequence by randomly masking out spans of tokens.

Vid2Seq is pre-trained on unlabeled narrated videos with a generative objective (top) and a denoising objective (bottom).

Results on downstream dense video captioning benchmarks

The resulting pre-trained Vid2Seq model can be fine-tuned on downstream tasks with a simple maximum likelihood objective using teacher forcing (i.e., predicting the next token given previous ground-truth tokens). After fine-tuning, Vid2Seq notably improves the state of the art on three standard downstream dense video captioning benchmarks (ActivityNet Captions, YouCook2 and ViTT) and two video clip captioning benchmarks (MSR-VTT, MSVD). In our paper we provide additional ablation studies, qualitative results, as well as results in the few-shot settings and in the video paragraph captioning task.

Comparison to state-of-the-art methods for dense video captioning (left) and for video clip captioning (right), on the CIDEr metric (higher is better).


We introduce Vid2Seq, a novel visual language model for dense video captioning that simply predicts all event boundaries and captions as a single sequence of tokens. Vid2Seq can be effectively pretrained on unlabeled narrated videos at scale, and achieves state-of-the-art results on various downstream dense video captioning benchmarks. Learn more from the paper and grab the code here.


This research was conducted by Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid.

Source: Google AI Blog

PaLM-E: An embodied multimodal language model

Recent years have seen tremendous advances across machine learning domains, from models that can explain jokes or answer visual questions in a variety of languages to those that can produce images based on text descriptions. Such innovations have been possible due to the increase in availability of large scale datasets along with novel advances that enable the training of models on these data. While scaling of robotics models has seen some success, it is outpaced by other domains due to a lack of datasets available on a scale comparable to large text corpora or image datasets.

Today we introduce PaLM-E, a new generalist robotics model that overcomes these issues by transferring knowledge from varied visual and language domains to a robotics system. We began with PaLM, a powerful large language model, and “embodied” it (the “E” in PaLM-E), by complementing it with sensor data from the robotic agent. This is the key difference from prior efforts to bring large language models to robotics — rather than relying on only textual input, with PaLM-E we train the language model to directly ingest raw streams of robot sensor data. The resulting model not only enables highly effective robot learning, but is also a state-of-the-art general-purpose visual-language model, while maintaining excellent language-only task capabilities.

An embodied  language model, and also a visual-language generalist

On the one hand, PaLM-E was primarily developed to be a model for robotics, and it solves a variety of tasks on multiple types of robots and for multiple modalities (images, robot states, and neural scene representations). At the same time, PaLM-E is a generally-capable vision-and-language model. It can perform visual tasks, such as describing images, detecting objects, or classifying scenes, and is also proficient at language tasks, like quoting poetry, solving math equations or generating code.

PaLM-E combines our most recent large language model, PaLM, together with one of our most advanced vision models, ViT-22B. The largest instantiation of this approach, built on PaLM-540B, is called PaLM-E-562B and sets a new state of the art on the visual-language OK-VQA benchmark, without task-specific fine-tuning, and while retaining essentially the same general language performance as PaLM-540B.

How does PaLM-E work?

Technically, PaLM-E works by injecting observations into a pre-trained language model. This is realized by transforming sensor data, e.g., images, into a representation through a procedure that is comparable to how words of natural language are processed by a language model.

Language models rely on a mechanism to represent text mathematically in a way that neural networks can process. This is achieved by first splitting the text into so-called tokens that encode (sub)words, each of which is associated with a high-dimensional vector of numbers, the token embedding. The language model is then able to apply mathematical operations (e.g., matrix multiplication) on the resulting sequence of vectors to predict the next, most likely word token. By feeding the newly predicted word back to the input, the language model can iteratively generate a longer and longer text.

The inputs to PaLM-E are text and other modalities — images, robot states, scene embeddings, etc. — in an arbitrary order, which we call "multimodal sentences". For example, an input might look like, "What happened between <img_1> and <img_2>?", where <img_1> and <img_2> are two images. The output is text generated auto-regressively by PaLM-E, which could be an answer to a question, or a sequence of decisions in text form.

PaLM-E model architecture, showing how PaLM-E ingests different modalities (states and/or images) and addresses tasks through multimodal language modeling.

The idea of PaLM-E is to train encoders that convert a variety of inputs into the same space as the natural word token embeddings. These continuous inputs are mapped into something that resembles "words" (although they do not necessarily form discrete sets). Since both the word and image embeddings now have the same dimensionality, they can be fed into the language model.

We initialize PaLM-E for training with pre-trained models for both the language (PaLM) and vision components (Vision Transformer, a.k.a. ViT). All parameters of the model can be updated during training.

Transferring knowledge from large-scale training to robots

PaLM-E offers a new paradigm for training a generalist model, which is achieved by framing robot tasks and vision-language tasks together through a common representation: taking images and text as input, and outputting text. A key result is that PaLM-E attains significant positive knowledge transfer from both the vision and language domains, improving the effectiveness of robot learning.

Positive transfer of knowledge from general vision-language tasks results in more effective robot learning, shown for three different robot embodiments and domains.

Results show that PaLM-E can address a large set of robotics, vision and language tasks simultaneously without performance degradation compared to training individual models on individual tasks. Further, the visual-language data actually significantly improves the performance of the robot tasks. This transfer enables PaLM-E to learn robotics tasks efficiently in terms of the number of examples it requires to solve a task.


We evaluate PaLM-E on three robotic environments, two of which involve real robots, as well as general vision-language tasks such as visual question answering (VQA), image captioning, and general language tasks. When PaLM-E is tasked with making decisions on a robot, we pair it with a low-level language-to-action policy to translate text into low-level robot actions.

In the first example below, a person asks a mobile robot to bring a bag of chips to them. To successfully complete the task, PaLM-E produces a plan to find the drawer and open it and then responds to changes in the world by updating its plan as it executes the task. In the second example, the robot is asked to grab a green block. Even though the block has not been seen by that robot, PaLM-E still generates a step-by-step plan that generalizes beyond the training data of that robot.

PaLM-E controls a mobile robot operating in a kitchen environment. Left: The task is to get a chip bag. PaLM-E shows robustness against adversarial disturbances, such as putting the chip bag back into the drawer. Right: The final steps of executing a plan to retrieve a previously unseen block (green star). This capability is facilitated by transfer learning from the vision and language models.

In the second environment below, the same PaLM-E model solves very long-horizon, precise tasks, such as “sort the blocks by colors into corners,” on a different type of robot. It directly looks at the images and produces a sequence of shorter textually-represented actions — e.g., “Push the blue cube to the bottom right corner,” “Push the blue triangle there too.” — long-horizon tasks that were out of scope for autonomous completion, even in our own most recent models. We also demonstrate the ability to generalize to new tasks not seen during training time (zero-shot generalization), such as pushing red blocks to the coffee cup.

PaLM-E controlling a tabletop robot to successfully complete long-horizon tasks.

The third robot environment is inspired by the field of task and motion planning (TAMP), which studies combinatorially challenging planning tasks (rearranging objects) that confront the robot with a very high number of possible action sequences. We show that with a modest amount of training data from an expert TAMP planner, PaLM-E is not only able to also solve these tasks, but it also leverages visual and language knowledge transfer in order to more effectively do so.

PaLM-E produces plans for a task and motion planning environment.

As a visual-language generalist, PaLM-E is a competitive model, even compared with the best vision-language-only models, including Flamingo and PaLI. In particular, PaLM-E-562B achieves the highest number ever reported on the challenging OK-VQA dataset, which requires not only visual understanding but also external knowledge of the world. Further, this result is reached with a generalist model, without fine-tuning specifically on only that task.

PaLM-E exhibits capabilities like visual chain-of-thought reasoning in which the model breaks down its answering process in smaller steps, an ability that has so far only been demonstrated in the language-only domain. The model also demonstrates the ability to perform inference on multiple images although being trained on only single-image prompts. The image of the New York Knicks and Boston Celtics is under the terms CC-by-2.0 and was posted to Flickr by kowarski. The image of Kobe Bryant is in the Public Domain. The other images were taken by us.


PaLM-E pushes the boundaries of how generally-capable models can be trained to simultaneously address vision, language and robotics while also being capable of transferring knowledge from vision and language to the robotics domain. There are additional topics investigated in further detail in the paper, such as how to leverage neural scene representations with PaLM-E and also the extent to which PaLM-E, with greater model scale, experiences less catastrophic forgetting of its language capabilities.

PaLM-E not only provides a path towards building more capable robots that benefit from other data sources, but might also be a key enabler to other broader applications using multimodal learning, including the ability to unify tasks that have so far seemed separate.


This work was done in collaboration across several teams at Google, including the Robotics at Google team and the Brain team, and with TU Berlin. Co-authors: Igor Mordatch, Andy Zeng, Aakanksha Chowdhery, Klaus Greff, Mehdi S. M. Sajjadi, Daniel Duckworth, Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Fei Xia, Brian Ichter, Karol Hausman, Tianhe Yu, Quan Vuong, Yevgen Chebotar, Wenlong Huang, Pierre Sermanet, Sergey Levine, Vincent Vanhoucke, and Marc Toussiant. Danny is a PhD student advised by Marc Toussaint at TU Berlin. We also would like to thank several other colleagues for their advice and help, including Xi Chen, Etienne Pot, Sebastian Goodman, Maria Attarian, Ted Xiao, Keerthana Gopalakrishnan, Kehang Han, Henryk Michalewski, Neil Houlsby, Basil Mustafa, Justin Gilmer, Yonghui Wu, Erica Moreira, Victor Gomes, Tom Duerig, Mario Lucic, Henning Meyer, and Kendra Byrne.

Source: Google AI Blog

Announcing the ICDAR 2023 Competition on Hierarchical Text Detection and Recognition

The last few decades have witnessed the rapid development of Optical Character Recognition (OCR) technology, which has evolved from an academic benchmark task used in early breakthroughs of deep learning research to tangible products available in consumer devices and to third party developers for daily use. These OCR products digitize and democratize the valuable information that is stored in paper or image-based sources (e.g., books, magazines, newspapers, forms, street signs, restaurant menus) so that they can be indexed, searched, translated, and further processed by state-of-the-art natural language processing techniques.

Research in scene text detection and recognition (or scene text spotting) has been the major driver of this rapid development through adapting OCR to natural images that have more complex backgrounds than document images. These research efforts, however, focus on the detection and recognition of each individual word in images, without understanding how these words compose sentences and articles.

Layout analysis is another relevant line of research that takes a document image and extracts its structure, i.e., title, paragraphs, headings, figures, tables and captions. These layout analysis efforts are parallel to OCR and have been largely developed as independent techniques that are typically evaluated only on document images. As such, the synergy between OCR and layout analysis remains largely under-explored. We believe that OCR and layout analysis are mutually complementary tasks that enable machine learning to interpret text in images and, when combined, could improve the accuracy and efficiency of both tasks.

With this in mind, we announce the Competition on Hierarchical Text Detection and Recognition (the HierText Challenge), hosted as part of the 17th annual International Conference on Document Analysis and Recognition (ICDAR 2023). The competition is hosted on the Robust Reading Competition website, and represents the first major effort to unify OCR and layout analysis. In this competition, we invite researchers from around the world to build systems that can produce hierarchical annotations of text in images using words clustered into lines and paragraphs. We hope this competition will have a significant and long-term impact on image-based text understanding with the goal to consolidate the research efforts across OCR and layout analysis, and create new signals for downstream information processing tasks.

The concept of hierarchical text representation.

Constructing a hierarchical text dataset

In this competition, we use the HierText dataset that we published at CVPR 2022 with our paper "Towards End-to-End Unified Scene Text Detection and Layout Analysis". It’s the first real-image dataset that provides hierarchical annotations of text, containing word, line, and paragraph level annotations. Here, "words" are defined as sequences of textual characters not interrupted by spaces. "Lines" are then interpreted as "space"-separated clusters of "words" that are logically connected in one direction, and aligned in spatial proximity. Finally, "paragraphs" are composed of "lines" that share the same semantic topic and are geometrically coherent.

To build this dataset, we first annotated images from the Open Images dataset using the Google Cloud Platform (GCP) Text Detection API. We filtered through these annotated images, keeping only images rich in text content and layout structure. Then, we worked with our third-party partners to manually correct all transcriptions and to label words, lines and paragraph composition. As a result, we obtained 11,639 transcribed images, split into three subsets: (1) a train set with 8,281 images, (2) a validation set with 1,724 images, and (3) a test set with 1,634 images. As detailed in the paper, we also checked the overlap between our dataset, TextOCR, and Intel OCR (both of which also extracted annotated images from Open Images), making sure that the test images in the HierText dataset were not also included in the TextOCR or Intel OCR training and validation splits and vice versa. Below, we visualize examples using the HierText dataset and demonstrate the concept of hierarchical text by shading each text entity with different colors. We can see that HierText has a diversity of image domain, text layout, and high text density.

Samples from the HierText dataset. Left: Illustration of each word entity. Middle: Illustration of line clustering. Right: Illustration paragraph clustering.

Dataset with highest density of text

In addition to the novel hierarchical representation, HierText represents a new domain of text images. We note that HierText is currently the most dense publicly available OCR dataset. Below we summarize the characteristics of HierText in comparison with other OCR datasets. HierText identifies 103.8 words per image on average, which is more than 3x the density of TextOCR and 25x more dense than ICDAR-2015. This high density poses unique challenges for detection and recognition, and as a consequence HierText is used as one of the primary datasets for OCR research at Google.

Dataset       Training split       Validation split       Testing split       Words per image      
ICDAR-2015       1,000       0       500       4.4      
TextOCR       21,778       3,124       3,232       32.1      
Intel OCR       19,1059       16,731       0       10.0      
HierText       8,281       1,724       1,634       103.8

Comparing several OCR datasets to the HierText dataset.

Spatial distribution

We also find that text in the HierText dataset has a much more even spatial distribution than other OCR datasets, including TextOCR, Intel OCR, IC19 MLT, COCO-Text and IC19 LSVT. These previous datasets tend to have well-composed images, where text is placed in the middle of the images, and are thus easier to identify. On the contrary, text entities in HierText are broadly distributed across the images. It's proof that our images are from more diverse domains. This characteristic makes HierText uniquely challenging among public OCR datasets.

Spatial distribution of text instances in different datasets.

The HierText challenge

The HierText Challenge represents a novel task and with unique challenges for OCR models. We invite researchers to participate in this challenge and join us in ICDAR 2023 this year in San Jose, CA. We hope this competition will spark research community interest in OCR models with rich information representations that are useful for novel down-stream tasks.


The core contributors to this project are Shangbang Long, Siyang Qin, Dmitry Panteleev, Alessandro Bissacco, Yasuhisa Fujii and Michalis Raptis. Ashok Popat and Jake Walker provided valuable advice. We also thank Dimosthenis Karatzas and Sergi Robles from Autonomous University of Barcelona for helping us set up the competition website.

Source: Google AI Blog

A vision-language approach for foundational UI understanding

The computational understanding of user interfaces (UI) is a key step towards achieving intelligent UI behaviors. Previously, we investigated various UI modeling tasks, including widget captioning, screen summarization, and command grounding, that address diverse interaction scenarios such as automation and accessibility. We also demonstrated how machine learning can help user experience practitioners improve UI quality by diagnosing tappability confusion and providing insights for improving UI design. These works along with those developed by others in the field have showcased how deep neural networks can potentially transform end user experiences and the interaction design practice.

With these successes in addressing individual UI tasks, a natural question is whether we can obtain foundational understandings of UIs that can benefit specific UI tasks. As our first attempt to answer this question, we developed a multi-task model to address a range of UI tasks simultaneously. Although the work made some progress, a few challenges remain. Previous UI models heavily rely on UI view hierarchies — i.e., the structure or metadata of a mobile UI screen like the Document Object Model for a webpage — that allow a model to directly acquire detailed information of UI objects on the screen (e.g., their types, text content and positions). This metadata has given previous models advantages over their vision-only counterparts. However, view hierarchies are not always accessible, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the short-term gains from using view hierarchies, it may ultimately hamper the model performance and applicability. In addition, previous models had to deal with heterogeneous information across datasets and UI tasks, which often resulted in complex model architectures that were difficult to scale or generalize across tasks.

In “Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus”, accepted for publication at ICLR 2023, we present a vision-only approach that aims to achieve general UI understanding completely from raw pixels. We introduce a unified approach to represent diverse UI tasks, the information for which can be universally represented by two core modalities: vision and language. The vision modality captures what a person would see from a UI screen, and the language modality can be natural language or any token sequences related to the task. We demonstrate that Spotlight substantially improves accuracy on a range of UI tasks, including widget captioning, screen summarization, command grounding and tappability prediction.

Spotlight Model

The Spotlight model input includes a tuple of three items: the screenshot, the region of interest on the screen, and the text description of the task. The output is a text description or response about the region of interest. This simple input and output representation of the model is expressive to capture various UI tasks and allows scalable model architectures. This model design allows a spectrum of learning strategies and setups, from task-specific fine-tuning, to multi-task learning and to few-shot learning. The Spotlight model, as illustrated in the above figure, leverages existing architecture building blocks such as ViT and T5 that are pre-trained in the high-resourced, general vision-language domain, which allows us to build on top of the success of these general domain models.

Because UI tasks are often concerned with a specific object or area on the screen, which requires a model to be able to focus on the object or area of interest, we introduce a Focus Region Extractor to a vision-language model that enables the model to concentrate on the region in light of the screen context.

In particular, we design a Region Summarizer that acquires a latent representation of a screen region based on ViT encodings by using attention queries generated from the bounding box of the region (see paper for more details). Specifically, each coordinate (a scalar value, i.e., the left, top, right or bottom) of the bounding box, denoted as a yellow box on the screenshot, is first embedded via a multilayer perceptron (MLP) as a collection of dense vectors, and then fed to a Transformer model along their coordinate-type embedding. The dense vectors and their corresponding coordinate-type embeddings are color coded to indicate their affiliation with each coordinate value. Coordinate queries then attend to screen encodings output by ViT via cross attention, and the final attention output of the Transformer is used as the region representation for the downstream decoding by T5.

A target region on the screen is summarized by using its bounding box to query into screen encodings from ViT via attentional mechanisms.


We pre-train the Spotlight model using two unlabeled datasets (an internal dataset based on C4 corpus and an internal mobile dataset) with 2.5 million mobile UI screens and 80 million web pages. We then separately fine-tune the pre-trained model for each of the four downstream tasks (captioning, summarization, grounding, and tappability). For widget captioning and screen summarization tasks, we report CIDEr scores, which measure how similar a model text description is to a set of references created by human raters. For command grounding, we report accuracy that measures the percentage of times the model successfully locates a target object in response to a user command. For tappability prediction, we report F1 scores that measure the model’s ability to tell tappable objects from untappable ones.

In this experiment, we compare Spotlight with several benchmark models. Widget Caption uses view hierarchy and the image of each UI object to generate a text description for the object. Similarly, Screen2Words uses view hierarchy and the screenshot as well as auxiliary features (e.g., app description) to generate a summary for the screen. In the same vein, VUT combines screenshots and view hierarchies for performing multiple tasks. Finally, the original Tappability model leverages object metadata from view hierarchy and the screenshot to predict object tappability. Taperception, a follow-up model of Tappability, uses a vision-only tappability prediction approach. We examine two Spotlight model variants with respect to the size of its ViT building block, including B/16 and L/16. Spotlight drastically exceeded the state-of-the-art across four UI modeling tasks.

Widget Caption      97      -      -      -     
Screen2Words      -      61.3      -      -     
VUT      99.3      65.6      82.1      -     
Taperception      -      -      -      85.5     
Tappability      -      -      -      87.9     


L/16      141.8      106.7      95.8      88.4     

We then pursue a more challenging setup where we ask the model to learn multiple tasks simultaneously because a multi-task model can substantially reduce model footprint. As shown in the table below, the experiments showed that our model still performs competitively.

Model      Captioning      Summarization      Grounding      Tappability
VUT multi-task      99.3      65.1      80.8      -     
Spotlight B/16      140      102.7      90.8      89.4     
Spotlight L/16      141.3      99.2      94.2      89.5     

To understand how the Region Summarizer enables Spotlight to focus on a target region and relevant areas on the screen, we analyze the attention weights (which indicate where the model attention is on the screenshot) for both widget captioning and screen summarization tasks. In the figure below, for the widget captioning task, the model predicts “select Chelsea team” for the checkbox on the left side, highlighted with a red bounding box. We can see from its attention heatmap (which illustrates the distribution of attention weights) on the right that the model learns to attend to not only the target region of the check box, but also the text “Chelsea" on the far left to generate the caption. For the screen summarization example, the model predicts “page displaying the tutorial of a learning app” given the screenshot on the left. In this example, the target region is the entire screen, and the model learns to attend to important parts on the screen for summarization.

For the widget captioning task, the attention heatmap shows the model attending to the checkbox, i.e., the target object, and the text label on its left when generating a caption for the object. The red bounding box in the figure is for illustration purposes.
For the screen summarization task that the target region encloses the entire screen, the attention heatmap shows the model attending to various locations on the screen that contribute to generating the summary.


We demonstrate that Spotlight outperforms previous methods that use both screenshots and view hierarchies as the input, and establishes state-of-the-art results on multiple representative UI tasks. These tasks range from accessibility, automation to interaction design and evaluation. Our vision-only approach for mobile UI understanding alleviates the need to use view hierarchy, allows the architecture to easily scale and benefits from the success of large vision-language models pre-trained for the general domain. Compared to recent large vision-language model efforts such as Flamingo and PaLI, Spotlight is relatively small and our experiments show the trend that larger models yield better performance. Spotlight can be easily applied to more UI tasks and potentially advance the fronts of many interaction and user experience tasks.


We thank Mandar Joshi and Tao Li for their help in processing the web pre-training dataset, and Chin-Yi Cheng and Forrest Huang for their feedback for proofreading the paper. Thanks to Tom Small for his help in creating animated figures in this post.

Source: Google AI Blog

Real-time tracking of wildfire boundaries using satellite imagery

As global temperatures rise, wildfires around the world are becoming more frequent and more dangerous. Their effects are felt by many communities as people evacuate their homes or suffer harm even from proximity to the fire and smoke.

As part of Google’s mission to help people access trusted information in critical moments, we use satellite imagery and machine learning (ML) to track wildfires and inform affected communities. Our wildfire tracker was recently expanded. It provides updated fire boundary information every 10–15 minutes, is more accurate than similar satellite products, and improves on our previous work. These boundaries are shown for large fires in the continental US, Mexico, and most of Canada and Australia. They are displayed, with additional information from local authorities, on Google Search and Google Maps, allowing people to keep safe and stay informed about potential dangers near them, their homes or loved ones.

Real-time boundary tracking of the 2021-2022 Wrattonbully bushfire, shown as a red polygon in Google Maps.


Wildfire boundary tracking requires balancing spatial resolution and update frequency. The most scalable method to obtain frequent boundary updates is to use geostationary satellites, i.e., satellites that orbit the earth once every 24 hours. These satellites remain at a fixed point above Earth, providing continual coverage of the area surrounding that point. Specifically, our wildfire tracker models use the GOES-16 and GOES-18 satellites to cover North America, and the Himawari-9 and GK2A satellites to cover Australia. These provide continent-scale images every 10 minutes. The spatial resolution is 2km at nadir (the point directly below the satellite), and lower as one moves away from nadir. The goal here is to provide people with warnings as soon as possible, and refer them to authoritative sources for spatially precise, on-the-ground data, as necessary.

Smoke plumes obscuring the 2018 Camp Fire in California. [Image from NASA Worldview]

Determining the precise extent of a wildfire is nontrivial, since fires emit massive smoke plumes, which can spread far from the burn area and obscure the flames. Clouds and other meteorological phenomena further obscure the underlying fire. To overcome these challenges, it is common to rely on infrared (IR) frequencies, particularly in the 3–4 μm wavelength range. This is because wildfires (and similar hot surfaces) radiate considerably at this frequency band, and these emissions diffract with relatively minor distortions through smoke and other particulates in the atmosphere. This is illustrated in the figure below, which shows a multispectral image of a wildfire in Australia. The visible channels (blue, green, and red) mostly show the triangular smoke plume, while the 3.85 μm IR channel shows the ring-shaped burn pattern of the fire itself. Even with the added information from the IR bands, however, determining the exact extent of the fire remains challenging, as the fire has variable emission strength, and multiple other phenomena emit or reflect IR radiation.

Himawari-8 hyperspectral image of a wildfire. Note the smoke plume in the visible channels (blue, green, and red), and the ring indicating the current burn area in the 3.85μm band.


Prior work on fire detection from satellite imagery is typically based on physics-based algorithms for identifying hotspots from multispectral imagery. For example, the National Oceanic and Atmospheric Administration (NOAA) fire product identifies potential wildfire pixels in each of the GOES satellites, primarily by relying on the 3.9 μm and 11.2 μm frequencies (with auxiliary information from two other frequency bands).

In our wildfire tracker, the model is trained on all satellite inputs, allowing it to learn the relative importance of different frequency bands. The model receives a sequence of the three most recent images from each band so as to compensate for temporary obstructions such as cloud cover. Additionally, the model receives inputs from two geostationary satellites, achieving a super-resolution effect whereby the detection accuracy improves upon the pixel size of either satellite. In North America, we also supply the aforementioned NOAA fire product as input. Finally, we compute the relative angles of the sun and the satellites, and provide these as additional input to the model.

All inputs are resampled to a uniform 1 km–square grid and fed into a convolutional neural network (CNN). We experimented with several architectures and settled on a CNN followed by a 1x1 convolutional layer to yield separate classification heads for fire and cloud pixels (shown below). The number of layers and their sizes are hyperparameters, which are optimized separately for Australia and North America. When a pixel is identified as a cloud, we override any fire detection since heavy clouds obscure underlying fires. Even so, separating the cloud classification task improves the performance of fire detection as we incentivize the system to better identify these edge cases.

CNN architecture for the Australia model; a similar architecture was used for North America. Adding a cloud classification head improves fire classification performance.

To train the network, we used thermal anomalies data from the MODIS and VIIRS polar-orbiting satellites as labels. MODIS and VIIRS have higher spatial accuracy (750–1000 meters) than the geostationary satellites we use as inputs. However, they cover a given location only once every few hours, which occasionally causes them to miss rapidly-advancing fires. Therefore, we use MODIS and VIIRS to construct a training set, but at inference time we rely on the high-frequency imagery from geostationary satellites.

Even when limiting attention to active fires, most pixels in an image are not currently burning. To reduce the model's bias towards non-burning pixels, we upsampled fire pixels in the training set and applied focal loss to encourage improvements in the rare misclassified fire pixels.

The progressing boundary of the 2022 McKinney fire, and a smaller nearby fire.


High-resolution fire signals from polar-orbiting satellites are a plentiful source for training data. However, such satellites use sensors that are similar to geostationary satellites, which increases the risk of systemic labeling errors (e.g., cloud-related misdetections) being incorporated into the model. To evaluate our wildfire tracker model without such bias, we compared it against fire scars (i.e., the shape of the total burnt area) measured by local authorities. Fire scars are obtained after a fire has been contained and are more reliable than real-time fire detection techniques. We compare each fire scar to the union of all fire pixels detected in real time during the wildfire to obtain an image such as the one shown below. In this image, green represents correctly identified burn areas (true positive), yellow represents unburned areas detected as burn areas (false positive), and red represents burn areas that were not detected (false negative).

Example evaluation for a single fire. Pixel size is 1km x 1km.

We compare our models to official fire scars using the precision and recall metrics. To quantify the spatial severity of classification errors, we take the maximum distance between a false positive or false negative pixel and the nearest true positive fire pixel. We then average each metric across all fires. The results of the evaluation are summarized below. Most severe misdetections were found to be a result of errors in the official data, such as a missing scar for a nearby fire.

Test set metrics comparing our models to official fire scars.

We performed two additional experiments on wildfires in the United States (see table below). First, we evaluated an earlier model that relies only on NOAA's GOES-16 and GOES-17 fire products. Our model outperforms this approach in all metrics considered, demonstrating that the raw satellite measurements can be used to enhance the existing NOAA fire product.

Next, we collected a new test set consisting of all large fires in the United States in 2022. This test set was not available during training because the model launched before the fire season began. Evaluating the performance on this test set shows performance in line with expectations from the original test set.

Comparison between models on fires in the United States.


Boundary tracking is part of Google’s wider commitment to bring accurate and up-to-date information to people in critical moments. This demonstrates how we use satellite imagery and ML to track wildfires, and provide real time support to affected people in times of crisis. In the future, we plan to keep improving the quality of our wildfire boundary tracking, to expand this service to more countries and continue our work helping fire authorities access critical information in real time.


This work is a collaboration between teams from Google Research, Google Maps and Crisis Response, with support from our partnerships and policy teams. We would also like to thank the fire authorities whom we partner with around the world.

Source: Google AI Blog

Infinite Nature: Generating 3D Flythroughs from Still Photos

We live in a world of great natural beauty — of majestic mountains, dramatic seascapes, and serene forests. Imagine seeing this beauty as a bird does, flying past richly detailed, three-dimensional landscapes. Can computers learn to synthesize this kind of visual experience? Such a capability would allow for new kinds of content for games and virtual reality experiences: for instance, relaxing within an immersive flythrough of an infinite nature scene. But existing methods that synthesize new views from images tend to allow for only limited camera motion.

In a research effort we call Infinite Nature, we show that computers can learn to generate such rich 3D experiences simply by viewing nature videos and photographs. Our latest work on this theme, InfiniteNature-Zero (presented at ECCV 2022) can produce high-resolution, high-quality flythroughs starting from a single seed image, using a system trained only on still photographs, a breakthrough capability not seen before. We call the underlying research problem perpetual view generation: given a single input view of a scene, how can we synthesize a photorealistic set of output views corresponding to an arbitrarily long, user-controlled 3D path through that scene? Perpetual view generation is very challenging because the system must generate new content on the other side of large landmarks (e.g., mountains), and render that new content with high realism and in high resolution.

Example flythrough generated with InfiniteNature-Zero. It takes a single input image of a natural scene and synthesizes a long camera path flying into that scene, generating new scene content as it goes.

Background: Learning 3D Flythroughs from Videos

To establish the basics of how such a system could work, we’ll describe our first version, “Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image” (presented at ICCV 2021). In that work we explored a “learn from video” approach, where we collected a set of online videos captured from drones flying along coastlines, with the idea that we could learn to synthesize new flythroughs that resemble these real videos. This set of online videos is called the Aerial Coastline Imagery Dataset (ACID). In order to learn how to synthesize scenes that respond dynamically to any desired 3D camera path, however, we couldn’t simply treat these videos as raw collections of pixels; we also had to compute their underlying 3D geometry, including the camera position at each frame.

The basic idea is that we learn to generate flythroughs step-by-step. Given a starting view, like the first image in the figure below, we first compute a depth map using single-image depth prediction methods. We then use that depth map to render the image forward to a new camera viewpoint, shown in the middle, resulting in a new image and depth map from that new viewpoint.

However, this intermediate image has some problems — it has holes where we can see behind objects into regions that weren’t visible in the starting image. It is also blurry, because we are now closer to objects, but are stretching the pixels from the previous frame to render these now-larger objects.

To handle these problems, we learn a neural image refinement network that takes this low-quality intermediate image and outputs a complete, high-quality image and corresponding depth map. These steps can then be repeated, with this synthesized image as the new starting point. Because we refine both the image and the depth map, this process can be iterated as many times as desired — the system automatically learns to generate new scenery, like mountains, islands, and oceans, as the camera moves further into the scene.

Our Infinite Nature methods take an input view and its corresponding depth map (left). Using this depth map, the system renders the input image to a new desired viewpoint (center). This intermediate image has problems, such as missing pixels revealed behind foreground content (shown in magenta). We learn a deep network that refines this image to produce a new high-quality image (right). This process can be repeated to produce a long trajectory of views. We thus call this approach “render-refine-repeat”.

We train this render-refine-repeat synthesis approach using the ACID dataset. In particular, we sample a video from the dataset and then a frame from that video. We then use this method to render several new views moving into the scene along the same camera trajectory as the ground truth video, as shown in the figure below, and compare these rendered frames to the corresponding ground truth video frames to derive a training signal. We also include an adversarial setup that tries to distinguish synthesized frames from real images, encouraging the generated imagery to appear more realistic.

Infinite Nature can synthesize views corresponding to any camera trajectory. During training, we run our system for T steps to generate T views along a camera trajectory calculated from a training video sequence, then compare the resulting synthesized views to the ground truth ones. In the figure, each camera viewpoint is generated from the previous one by performing a warp operation R, followed by the neural refinement operation gθ.

The resulting system can generate compelling flythroughs, as featured on the project webpage, along with a “flight simulator” Colab demo. Unlike prior methods on video synthesis, this method allows the user to interactively control the camera and can generate much longer camera paths.

InfiniteNature-Zero: Learning Flythroughs from Still Photos

One problem with this first approach is that video is difficult to work with as training data. High-quality video with the right kind of camera motion is challenging to find, and the aesthetic quality of an individual video frame generally cannot compare to that of an intentionally captured nature photograph. Therefore, in “InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images”, we build on the render-refine-repeat strategy above, but devise a way to learn perpetual view synthesis from collections of still photos — no videos needed. We call this method InfiniteNature-Zero because it learns from “zero” videos. At first, this might seem like an impossible task — how can we train a model to generate video flythroughs of scenes when all it’s ever seen are isolated photos?

To solve this problem, we had the key insight that if we take an image and render a camera path that forms a cycle — that is, where the path loops back such that the last image is from the same viewpoint as the first — then we know that the last synthesized image along this path should be the same as the input image. Such cycle consistency provides a training constraint that helps the model learn to fill in missing regions and increase image resolution during each step of view generation.

However, training with these camera cycles is insufficient for generating long and stable view sequences, so as in our original work, we include an adversarial strategy that considers long, non-cyclic camera paths, like the one shown in the figure above. In particular, if we render T frames from a starting frame, we optimize our render-refine-repeat model such that a discriminator network can’t tell which was the starting frame and which was the final synthesized frame. Finally, we add a component trained to generate high-quality sky regions to increase the perceived realism of the results.

With these insights, we trained InfiniteNature-Zero on collections of landscape photos, which are available in large quantities online. Several resulting videos are shown below — these demonstrate beautiful, diverse natural scenery that can be explored along arbitrarily long camera paths. Compared to our prior work — and to prior video synthesis methods — these results exhibit significant improvements in quality and diversity of content (details available in the paper).

Several nature flythroughs generated by InfiniteNature-Zero from single starting photos.


There are a number of exciting future directions for this work. For instance, our methods currently synthesize scene content based only on the previous frame and its depth map; there is no persistent underlying 3D representation. Our work points towards future algorithms that can generate complete, photorealistic, and consistent 3D worlds.


Infinite Nature and InfiniteNature-Zero are the result of a collaboration between researchers at Google Research, UC Berkeley, and Cornell University. The key contributors to the work represented in this post include Angjoo Kanazawa, Andrew Liu, Richard Tucker, Zhengqi Li, Noah Snavely, Qianqian Wang, Varun Jampani, and Ameesh Makadia.

Source: Google AI Blog

Open Images V7 — Now Featuring Point Labels

Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. Researchers around the world use Open Images to train and evaluate computer vision models. Since the initial release of Open Images in 2016, which included image-level labels covering 6k categories, we have provided multiple updates to enrich annotations and expand the potential use cases of the dataset. Through several releases, we have added image-level labels for over 20k categories on all images and bounding box annotations, visual relations, instance segmentations, and localized narratives (synchronized voice, mouse trace, and text caption) on a subset of 1.9M images.

Today, we are happy to announce the release of Open Images V7, which expands the Open Images dataset even further with a new annotation type called point-level labels and includes a new all-in-one visualization tool that allows a better exploration of the rich data available.

Point Labels

The main strategy used to collect the new point-level label annotations leveraged suggestions from a machine learning (ML) model and human verification. First, the ML model selected points of interest and asked a yes or no question, e.g., “is this point on a pumpkin?”. Then, human annotators spent an average of 1.1 seconds answering the yes or no questions. We aggregated the answers from different annotators over the same question and assigned a final “yes”, “no”, or “unsure” label to each annotated point.

Illustration of the annotations interface.
(Image by Lenore Edman, under CC BY 2.0 license)

For each annotated image, we provide a collection of points, each with a “yes” or “no” label for a given class. These points provide sparse information that can be used for the semantic segmentation task. We collected a total of 38.6M new point annotations (12.4M with “yes” labels) that cover 5.8 thousand classes and 1.4M images.

By focusing on point labels, we expanded the number of images annotated and categories covered. We also concentrated the efforts of our annotators on efficiently collecting useful information. Compared to our instance segmentation, the new points include 16x more classes and cover more images. The new points also cover 9x more classes than our box annotations. Compared to existing segmentation datasets, like PASCAL VOC, COCO, Cityscapes, LVIS, or ADE20K, our annotations cover more classes and more images than previous work. The new point label annotations are the first type of annotation in Open Images that provides localization information for both things (countable objects, like cars, cats, and catamarans), and stuff categories (uncountable objects like grass, granite, and gravel). Overall, the newly collected data is roughly equivalent to two years of human annotation effort.

Our initial experiments show that this type of sparse data is suitable for both training and evaluating segmentation models. Training a model directly on sparse data allows us to reach comparable quality to training on dense annotations. Similarly, we show that one can directly compute the traditional semantic segmentation intersection-over-union (IoU) metric over sparse data. The ranking across different methods is preserved, and the sparse IoU values are an accurate estimate of its dense version. See our paper for more details.

Below, we show four example images with their point-level labels, illustrating the rich and diverse information these annotations provide. Circles ⭘ are “yes” labels, and squares are “no” labels.

Four example images with point-level labels.
Images by Richie Diesterheft, John AM Nueva, Sarah Ackerman, and C Thomas, all under CC BY 2.0 license.

New Visualizers

In addition to the new data release, we also expanded the available visualizations of the Open Images annotations. The Open Images website now includes dedicated visualizers to explore the localized narratives annotations, the new point-level annotations, and a new all-in-one view. This new all-in-one view is available for the subset of 1.9M densely annotated images and allows one to explore the rich annotations that Open Images has accumulated over seven releases. On average these images have annotations for 6.7 image-labels (classes), 8.3 boxes, 1.7 relations, 1.5 masks, 0.4 localized narratives and 34.8 point-labels per image.

Below, we show two example images with various annotations in the all-in-one visualizer. The figures show the image-level labels, bounding boxes, box relations, instance masks, localized narrative mouse trace and caption, and point-level labels. The + classes have positive annotations (of any kind), while classes have only negative annotations (image-level or point-level).

Two example images with various annotations in the all-in-one visualizer.
Images by Jason Paris, and Rubén Vique, all under CC BY 2.0 license.


We hope that this new data release will enable computer vision research to cover ever more diverse and challenging scenarios. As the quality of automated semantic segmentation models improves over common classes, we want to move towards the long tail of visual concepts, and sparse point annotations are a step in that direction. More and more works are exploring how to use such sparse annotations (e.g., as supervision for instance segmentation or semantic segmentation), and Open Images V7 contributes to this research direction. We are looking forward to seeing what you will build next.


Thanks to Vittorio Ferrari, Jordi Pont-Tuset, Alina Kuznetsova, Ashlesha Sadras, and the annotators team for their support creating this new data release.

Source: Google AI Blog

Google at ECCV 2022

Google is proud to be a Platinum Sponsor of the European Conference on Computer Vision (ECCV 2022), a premier forum for the dissemination of research in computer vision and machine learning (ML). This year, ECCV 2022 will be held as a hybrid event, in person in Tel Aviv, Israel with virtual attendance as an option. Google has a strong presence at this year’s conference with over 60 accepted publications and active involvement in a number of workshops and tutorials. We look forward to sharing some of our extensive research and expanding our partnership with the broader ML research community.

Registered for ECCV 2022? We hope you’ll visit our on-site or virtual booths to learn more about the research we’re presenting at ECCV 2022, including several demos and opportunities to connect with our researchers. Learn more about Google's research being presented at ECCV 2022 below (Google affiliations in bold).

Organizing Committee

Program Chairs include: Moustapha Cissé

Awards Paper Committee: Todd Zickler

Area Chairs include: Ayan Chakrabarti, Tali Dekel, Alireza Fathi, Vittorio Ferrari, David Fleet, Dilip Krishnan, Michael Rubinstein, Cordelia Schmid, Deqing Sun, Federico Tombari, Jasper Uijlings, Ming-Hsuan Yang, Todd Zickler

Accepted Publications

NeuMesh: Learning Disentangled Neural Mesh-Based Implicit Field for Geometry and Texture Editing
Bangbang Yang, Chong Bao, Junyi Zeng, Hujun Bao, Yinda Zhang, Zhaopeng Cui, Guofeng Zhang

Anti-Neuron Watermarking: Protecting Personal Data Against Unauthorized Neural Networks
Zihang Zou, Boqing Gong, Liqiang Wang

Exploiting Unlabeled Data with Vision and Language Models for Object Detection
Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, Vijay Kumar B G, Anastasis Stathopoulos, Manmohan Chandraker, Dimitris N. Metaxas

Waymo Open Dataset: Panoramic Video Panoptic Segmentation
Jieru Mei, Alex Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar

PRIF: Primary Ray-Based Implicit Function
Brandon Yushan Feng, Yinda Zhang, Danhang Tang, Ruofei Du, Amitabh Varshney

LoRD: Local 4D Implicit Representation for High-Fidelity Dynamic Human Modeling
Boyan Jiang, Xinlin Ren, Mingsong Dou, Xiangyang Xue, Yanwei Fu, Yinda Zhang

k-Means Mask Transformer (see blog post)
Qihang Yu*, Siyuan Qiao, Maxwell D Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen

MaxViT: Multi-Axis Vision Transformer (see blog post)
Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li

E-Graph: Minimal Solution for Rigid Rotation with Extensibility Graphs
Yanyan Li, Federico Tombari

RBP-Pose: Residual Bounding Box Projection for Category-Level Pose Estimation
Ruida Zhang, Yan Di, Zhiqiang Lou, Fabian Manhardt, Federico Tombari, Xiangyang Ji

GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning
Huseyin Coskun, Alireza Zareian, Joshua L Moore, Federico Tombari, Chen Wang

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, Tsung-Yi Lin*

Adaptive Transformers for Robust Few-Shot Cross-Domain Face Anti-spoofing
Hsin-Ping Huang, Deqing Sun, Yaojie Liu, Wen-Sheng Chu, Taihong Xiao, Jinwei Yuan, Hartwig Adam, Ming-Hsuan Yang

DualPrompt: Complementary Prompting for Rehearsal-Free Continual Learning
Zifeng Wang*, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, Tomas Pfister

BLT: Bidirectional Layout Transformer for Controllable Layout Generation
Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, Irfan Essa

V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer
Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, Ming-Hsuan Yang, Jiaqi Ma

Learning Visibility for Robust Dense Human Body Estimation
Chun-Han Yao, Jimei Yang, Duygu Ceylan, Yi Zhou, Yang Zhou, Ming-Hsuan Yang

Are Vision Transformers Robust to Patch Perturbations?
Jindong Gu, Volker Tresp, Yao Qin

PseudoAugment: Learning to Use Unlabeled Data for Data Augmentation in Point Clouds
Zhaoqi Leng, Shuyang Cheng, Ben Caine, Weiyue Wang, Xiao Zhang, Jonathon Shlens, Mingxing Tan, Dragomir Anguelov

Structure and Motion from Casual Videos
Zhoutong Zhang, Forrester Cole, Zhengqi Li, Noah Snavely, Michael Rubinstein, William T. Freeman

PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map
Chenfeng Xu, Tian Li, Chen Tang, Lingfeng Sun, Kurt Keutzer, Masayoshi Tomizuka, Alireza Fathi, Wei Zhan

Novel Class Discovery Without Forgetting
Joseph K J, Sujoy Paul, Gaurav Aggarwal, Soma Biswas, Piyush Rai, Kai Han, Vineeth N Balasubramanian

Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning
Yuxiao Chen, Long Zhao, Jianbo Yuan, Yu Tian, Zhaoyang Xia, Shijie Geng, Ligong Han, Dimitris N. Metaxas

PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks
Nan Ding, Xi Chen, Tomer Levinboim, Soravit Changpinyo, Radu Soricut

InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images
Zhengqi Li, Qianqian Wang*, Noah Snavely, Angjoo Kanazawa*

Generalizable Patch-Based Neural Rendering (see blog post)
Mohammed Suhail*, Carlos Esteves, Leonid Sigal, Ameesh Makadia

LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds
Minghua Liu, Yin Zhou, Charles R. Qi, Boqing Gong, Hao Su, Dragomir Anguelov

The Missing Link: Finding Label Relations Across Datasets
Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Learning Instance-Specific Adaptation for Cross-Domain Segmentation
Yuliang Zou, Zizhao Zhang, Chun-Liang Li, Han Zhang, Tomas Pfister, Jia-Bin Huang

Learning Audio-Video Modalities from Image Captions
Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
Medhini Narasimhan*, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid

On Label Granularity and Object Localization
Elijah Cole, Kimberly Wilber, Grant Van Horn, Xuan Yang, Marco Fornoni, Pietro Perona, Serge Belongie, Andrew Howard, Oisin Mac Aodha

Disentangling Architecture and Training for Optical Flow
Deqing Sun, Charles Herrmann, Fitsum Reda, Michael Rubinstein, David J. Fleet, William T. Freeman

NewsStories: Illustrating Articles with Visual Summaries
Reuben Tan, Bryan Plummer, Kate Saenko, J.P. Lewis, Avneesh Sud, Thomas Leung

Improving GANs for Long-Tailed Data Through Group Spectral Regularization
Harsh Rangwani, Naman Jaswani, Tejan Karmali, Varun Jampani, Venkatesh Babu Radhakrishnan

Planes vs. Chairs: Category-Guided 3D Shape Learning Without Any 3D Cues
Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James Rehg

A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, James Hays

Learned Monocular Depth Priors in Visual-Inertial Initialization
Yunwen Zhou, Abhishek Kar, Eric L. Turner, Adarsh Kowdle, Chao Guo, Ryan DuToit, Konstantine Tsotsos

How Stable are Transferability Metrics Evaluations?
Andrea Agostinelli, Michal Pandy, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Data-Free Neural Architecture Search via Recursive Label Calibration
Zechun Liu*, Zhiqiang Shen, Yun Long, Eric Xing, Kwang-Ting Cheng, Chas H. Leichner

Fast and High Quality Image Denoising via Malleable Convolution
Yifan Jiang*, Bartlomiej Wronski, Ben Mildenhall, Jonathan T. Barron, Zhangyang Wang, Tianfan Xue

Concurrent Subsidiary Supervision for Unsupervised Source-Free Domain Adaptation
Jogendra Nath Kundu, Suvaansh Bhambri, Akshay R Kulkarni, Hiran Sarkar,
Varun Jampani, Venkatesh Babu Radhakrishnan

Learning Online Multi-Sensor Depth Fusion
Erik Sandström, Martin R. Oswald, Suryansh Kumar, Silvan Weder, Fisher Yu, Cristian Sminchisescu, Luc Van Gool

Hierarchical Semantic Regularization of Latent Spaces in StyleGANs
Tejan Karmali, Rishubh Parihar, Susmit Agrawal, Harsh Rangwani, Varun Jampani, Maneesh K Singh, Venkatesh Babu Radhakrishnan

RayTran: 3D Pose Estimation and Shape Reconstruction of Multiple Objects from Videos with Ray-Traced Transformers
Michał J Tyszkiewicz, Kevis-Kokitsi Maninis, Stefan Popov, Vittorio Ferrari

Neural Video Compression Using GANs for Detail Synthesis and Propagation
Fabian Mentzer, Eirikur Agustsson, Johannes Ballé, David Minnen, Nick Johnston, George Toderici

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset
Grant Van Horn, Rui Qian, Kimberly Wilber, Hartwig Adam, Oisin Mac Aodha, Serge Belongie

Implicit Neural Representations for Image Compression
Yannick Strümpler, Janis Postels, Ren Yang, Luc Van Gool, Federico Tombari

3D Compositional Zero-Shot Learning with DeCompositional Consensus
Muhammad Ferjad Naeem, Evin Pınar Örnek, Yongqin Xian, Luc Van Gool, Federico Tombari

FindIt: Generalized Localization with Natural Language Queries (see blog post)
Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation
Wuyang Chen*, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou

Improved Masked Image Generation with Token-Critic
Jose Lezama, Huiwen Chang, Lu Jiang, Irfan Essa

Learning Discriminative Shrinkage Deep Networks for Image Deconvolution
Pin-Hung Kuo, Jinshan Pan, Shao-Yi Chien, Ming-Hsuan Yang

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Efthymios Tzinis*, Scott Wisdom, Tal Remez, John Hershey

Simple Open-Vocabulary Object Detection with Vision Transformers
Matthias Minderer, Alexey Gritsenko, Austin C Stone, Maxim Neumann, Dirk Weißenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality
Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, Hans Peter Graf

Video Question Answering with Iterative Video-Text Co-tokenization (see blog post)
AJ Piergiovanni, Kairo Morton*, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Class-Agnostic Object Detection with Multi-modal Transformer
Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang

FILM: Frame Interpolation for Large Motion (see blog post)
Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, Brian Curless

Compositional Human-Scene Interaction Synthesis with Semantic Control
Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, Siyu Tang


LatinX in AI
Mentors include: José Lezama
Keynote Speakers include: Andre Araujo

AI for Creative Video Editing and Understanding
Keynote Speakers include: Tali Dekel, Negar Rostamzadeh

Learning With Limited and Imperfect Data (L2ID)
Invited Speakers include: Xiuye Gu
Organizing Committee includes: Sadeep Jayasumana

International Challenge on Compositional and Multimodal Perception (CAMP)
Program Committee includes: Edward Vendrow

Self-Supervised Learning: What is Next?
Invited Speakers include: Mathilde Caron, Arsha Nagrani
Organizers include: Andrew Zisserman

3rd Workshop on Adversarial Robustness In the Real World
Invited Speakers include: Ekin Dogus Cubuk
Organizers include: Xinyun Chen, Alexander Robey, Nataniel Ruiz, Yutong Bai

AV4D: Visual Learning of Sounds in Spaces
Invited Speakers include: John Hershey

Challenge on Mobile Intelligent Photography and Imaging (MIPI)
Invited Speakers include: Peyman Milanfar

Robust Vision Challenge 2022
Organizing Committee includes: Alina Kuznetsova

Computer Vision in the Wild
Challenge Organizers include: Yi-Ting Chen, Ye Xia
Invited Speakers include: Yin Cui, Yongqin Xian, Neil Houlsby

Self-Supervised Learning for Next-Generation Industry-Level Autonomous Driving (SSLAD)
Organizers include: Fisher Yu

Responsible Computer Vision
Organizing Committee includes: Been Kim
Invited Speakers include: Emily Denton

Cross-Modal Human-Robot Interaction
Invited Speakers include: Peter Anderson

ISIC Skin Image Analysis
Organizing Committee includes: Yuan Liu
Steering Committee includes: Yuan Liu, Dale Webster
Invited Speakers include: Yuan Liu

Observing and Understanding Hands in Action
Sponsored by Google

Autonomous Vehicle Vision (AVVision)
Speakers include: Fisher Yu

Visual Perception for Navigation in Human Environments: The JackRabbot Human Body Pose Dataset and Benchmark
Organizers include: Edward Vendrow

Language for 3D Scenes
Invited Speakers include: Jason Baldridge
Organizers include: Leonidas Guibas

Designing and Evaluating Computer Perception Systems (CoPe)
Organizers include: Andrew Zisserman

Learning To Generate 3D Shapes and Scenes
Panelists include: Pete Florence

Advances in Image Manipulation
Program Committee includes: George Toderici, Ming-Hsuan Yang

TiE: Text in Everything
Challenge Organizers include: Shangbang Long, Siyang Qin
Invited Speakers include: Tali Dekel, Aishwarya Agrawal

Instance-Level Recognition
Organizing Committee: Andre Araujo, Bingyi Cao, Tobias Weyand
Invited Speakers include: Mathilde Caron

What Is Motion For?
Organizing Committee: Deqing Sun, Fitsum Reda, Charles Herrmann
Invited Speakers include: Tali Dekel

Neural Geometry and Rendering: Advances and the Common Objects in 3D Challenge
Invited Speakers include: Ben Mildenhall

Visual Object-Oriented Learning Meets Interaction: Discovery, Representations, and Applications
Invited Speakers include: Klaus Greff, Thomas Kipf
Organizing Committee includes: Leonidas Guibas

Vision with Biased or Scarce Data (VBSD)
Program Committee includes: Yizhou Wang

Multiple Object Tracking and Segmentation in Complex Environments
Invited Speakers include: Xingyi Zhou, Fisher Yu

3rd Visual Inductive Priors for Data-Efficient Deep Learning Workshop
Organizing Committee includes: Ekin Dogus Cubuk

DeeperAction: Detailed Video Action Understanding and Anomaly Recognition
Advisors include: Rahul Sukthankar

Sign Language Understanding Workshop and Sign Language Recognition, Translation & Production Challenge
Organizing Committee includes: Andrew Zisserman
Speakers include: Andrew Zisserman

Ego4D: First-Person Multi-Modal Video Understanding
Invited Speakers include: Michal Irani

AI-Enabled Medical Image Analysis: Digital Pathology & Radiology/COVID19
Program Chairs include: Po-Hsuan Cameron Chen
Workshop Partner: Google Health

Visual Object Tracking Challenge (VOT 2022)
Technical Committee includes: Christoph Mayer

Assistive Computer Vision and Robotics
Technical Committee includes: Maja Mataric

Human Body, Hands, and Activities from Egocentric and Multi-View Cameras
Organizers include: Francis Engelmann

Frontiers of Monocular 3D Perception: Implicit x Explicit
Panelists include: Pete Florence


Self-Supervised Representation Learning in Computer Vision
Invited Speakers include: Ting Chen

Neural Volumetric Rendering for Computer Vision
Organizers include: Ben Mildenhall, Pratul Srinivasan, Jon Barron
Presenters include: Ben Mildenhall, Pratul Srinivasan

New Frontiers in Efficient Neural Architecture Search!
Speakers include: Ruochen Wang

*Work done while at Google.  

Source: Google AI Blog