Tag Archives: Computer Vision

LIMoE: Learning Multiple Modalities with One Sparse Mixture of Experts Model

Sparse models stand out among the most promising approaches for the future of deep learning. Instead of every part of a model processing every input (“dense” modeling), sparse models employing conditional computation learn to route individual inputs to different “experts” in a potentially huge network. This has many benefits. First, model size can increase while keeping computational cost constant — an effective and environmentally friendlier way to scale models, which is often key to high performance. Sparsity also naturally compartmentalizes neural networks. Dense models that learn many different tasks simultaneously (multitask) or sequentially (continual learning) often suffer negative interference, where too much task variety means it is better to just train one model per task, or catastrophic forgetting, where the model becomes worse at earlier tasks as new ones are added. Sparse models help avoid both these phenomena — by not applying the whole model to all inputs, “experts” in the model can specialize on different tasks or data types while still taking advantage of shared parts of the model.

Research on sparsity has long been pursued at Google Research. Pathways summarizes the research vision of building one single large model that diligently handles thousands of tasks and numerous data modalities. So far there has been considerable progress in sparse unimodal models for language (Switch, Task-MoE, GLaM) and computer vision (Vision MoE). Today, we take another important step towards the Pathways vision by studying large sparse models that simultaneously handle images and text with modality-agnostic routing. A relevant approach is multimodal contrastive learning, which requires a solid understanding of both images and text in order to align pictures with their correct text description. The strongest models that tackle this task to date rely on independent networks for each modality (a “two-tower” approach).

In “Multimodal Contrastive Learning with LIMoE: the Language Image Mixture of Experts”, we present the first large-scale multimodal architecture using a sparse mixture of experts. It simultaneously processes both images and text, but uses sparsely activated experts that naturally specialize. On zero-shot image classification, LIMoE outperforms both comparable dense multimodal models and two-tower approaches. The largest LIMoE achieves 84.1% zero-shot ImageNet accuracy, comparable to more expensive state-of-the-art models. Sparsity enables LIMoE to scale up gracefully and learn to handle very different inputs, addressing the tension between being a jack-of-all-trades generalist and a master-of-one specialist.

The LIMoE architecture contains many “experts” and routers decide which tokens (parts of an image or sentence) go to which experts. After being processed by expert layers (gray) and shared dense layers (brown), a final output layer computes a single vector representation for either an image or a text.

Sparse Mixture of Expert Models
Transformers represent data as a sequence of vectors (or tokens). Though originally developed for text, they can be applied to most things that are representable as a sequence of tokens, e.g., images, videos, and audio. Recent large-scale MoE models add expert layers to the Transformer architecture (e.g., gShard and ST-MoE in natural language processing, and Vision MoE for vision tasks).

A standard Transformer consists of many “blocks”, each containing various different layers. One of these layers is a feed-forward network (FFN). For LIMoE and the works cited above, this single FFN is replaced by an expert layer that contains many parallel FFNs, each of which is an expert. Given a sequence of tokens to process, a simple router learns to predict which experts should handle which tokens. Only a small number of experts are activated per token, meaning although the model capacity is significantly increased by virtue of having so many experts, the actual computational cost is controlled by using them sparsely. If only one expert is activated, the model's cost is roughly equivalent to the standard Transformer model.

LIMoE does precisely that, activating one expert per example, thereby matching the computational cost of the dense baselines. What’s different is that the LIMoE router might see tokens of either image or text data.

A unique failure mode of MoE models occurs when they try to send all tokens to the same expert. Typically this is addressed with auxiliary losses, extra training objectives that encourage balanced expert usage. We found that dealing with multiple modalities interacted with sparsity to cause new failure modes that existing auxiliary losses could not address. To overcome this, we developed new auxiliary losses (more details in the paper) and used routing prioritization (BPR) during training, two innovations that resulted in stable and high performance multimodal models.

The new auxiliary losses (LIMoE aux) and routing prioritization (BPR) stabilized and improved overall performance (left) and increased the success rate of routing behavior (middle and right). A low success rate means the router does not use all the experts available and drops many tokens due to individual expert capacity being reached, which usually indicates the sparse model is not learning well. The combination introduced for LIMoE ensures high routing success rates for both images and text and consequently leads to significantly better performance.

Contrastive Learning with LIMoE
In multimodal contrastive learning, models are trained on paired image-text data (e.g., a photo and its caption). Typically, an image model extracts a representation of images, and a different text model extracts a representation of text. The contrastive learning objective encourages the image and text representations to be close for the same image-text pair and far away for content from different pairs. Such models with aligned representations can be adapted to new tasks without extra training data (“zero-shot”), e.g., an image will be classified as a dog if its representation is closer to the representation of the word “dog” than the word “cat”. This idea scales to thousands of classes and is referred to as zero-shot image classification.

CLIP and ALIGN (both two-tower models) scaled this process to achieve 76.2% and 76.4% zero-shot classification accuracy on the popular ImageNet dataset. We study one-tower models which compute both image and text representations. We find this reduces performance for dense models, likely due to negative interference or insufficient capacity. However, a compute-matched LIMoE not only improves over the one-tower dense model, but also outperforms two-tower dense models. We trained a series of models in a comparable training regimen to CLIP. Our dense L/16 model achieves 73.5% zero-shot accuracy, whereas LIMoE-L/16 gets to 78.6%, even outperforming CLIP’s more expensive, two-tower L/14 model (76.2%). As shown below, LIMoE’s use of sparsity provides a remarkable performance boost over dense models with equivalent cost.

For a given compute cost (x-axis), LIMoE models (circles, solid line) are significantly better than their dense baselines (triangles, dashed line). The architecture indicates the size of the underlying transformer, increasing from left (S/32) to right (L/16). Following standard convention, S (small), B (base), and L (large) refer to model scale. The number refers to the patch size, where smaller patches imply a larger architecture.

LiT and BASIC pushed zero-shot accuracy for dense two-tower models to 84.5% and 85.6% respectively. In addition to scaling, these approaches made use of specialized pre-training methods, repurposing image models that were already of exceptionally high quality. LIMoE-H/14 does not benefit from any pre-training or modality-specific components, but still achieved a comparable 84.1% zero-shot accuracy training from scratch. The scale of these models is also interesting to compare: LiT and BASIC are 2.1B and 3B parameter models. LIMoE-H/14 has 5.6B parameters in total, but via sparsity it only applies 675M parameters per token making it significantly more lightweight.

Data seen during training
Model   Pre-training     Image-text     Total      Parameters per token     ImageNet accuracy  
CLIP - 12.8B 12.8B ~200M 76.2%
ALIGN - 19.8B 19.8B ~410M 76.4%
LiT 25.8B 18.2B 44.0B 1.1B 84.5%
BASIC 19.7B 32.8B 52.5B 1.5B 85.6%
LIMoE H/14    - 23.3B 23.3B 675M 84.1%

Understanding LIMoE’s Behavior
LIMoE was motivated by the intuition that sparse conditional computation enables a generalist multimodal model to still develop the specialization needed to excel at understanding each modality. We analyzed LIMoE’s expert layers and uncovered a few interesting phenomena.

First, we see the emergence of modality-specialized experts. In our training setup there are many more image tokens than text tokens, so all experts tend to process at least some images, but some experts process either mostly images, mostly text, or both.

Distributions for an eight expert LIMoE; percentages indicate the amount of image tokens processed by the expert. There are one or two experts clearly specialized on text (shown by the mostly blue experts), usually two to four image specialists (mostly red), and the remainder are somewhere in the middle.

There are also some clear qualitative patterns among the image experts — e.g., in most LIMoE models, there is an expert that processes all image patches that contain text. In the example below, one expert processes fauna and greenery, and another processes human hands.

LIMoE chooses an expert for each token. Here we show which image tokens go to which experts on one of the layers of LIMoE-H/14. Despite not being trained to do so, we observe the emergence of semantic experts that specialize in specific topics such as plants or wheels.

Moving Forward
Multimodal models that handle many tasks are a promising route forward, and there are two key ingredients for success: scale, and the ability to avoid interference between distinct tasks and modalities while taking advantage of synergies. Sparse conditional computation is an excellent way of doing both. It enables performant and efficient generalist models that also have the capacity and flexibility for the specialization necessary to excel at individual tasks, as demonstrated by LIMoE’s solid performance with less compute.

Acknowledgements
We thank our co-authors on this work: Joan Puigcerver, Rodolphe Jenatton and Neil Houlsby. We also thank Andreas Steiner, Xiao Wang and Xiaohua Zhai, who led early explorations into dense single-tower models for contrastive multimodal learning, and also were instrumental in providing data access. We enjoyed useful discussions with André Susano Pinto, Maxim Neumann, Barret Zoph, Liam Fedus, Wei Han and Josip Djolonga. Finally, we would also like to thank and acknowledge Tom Small for the awesome animated figure used in this post.

Source: Google AI Blog


Vector-Quantized Image Modeling with Improved VQGAN

In recent years, natural language processing models have dramatically improved their ability to learn general-purpose representations, which has resulted in significant performance gains for a wide range of natural language generation and natural language understanding tasks. In large part, this has been accomplished through pre-training language models on extensive unlabeled text corpora.

This pre-training formulation does not make assumptions about input signal modality, which can be language, vision, or audio, among others. Several recent papers have exploited this formulation to dramatically improve image generation results through pre-quantizing images into discrete integer codes (represented as natural numbers), and modeling them autoregressively (i.e., predicting sequences one token at a time). In these approaches, a convolutional neural network (CNN) is trained to encode an image into discrete tokens, each corresponding to a small patch of the image. A second stage CNN or Transformer is then trained to model the distribution of encoded latent variables. The second stage can also be applied to autoregressively generate an image after the training. But while such models have achieved strong performance for image generation, few studies have evaluated the learned representation for downstream discriminative tasks (such as image classification).

In “Vector-Quantized Image Modeling with Improved VQGAN”, we propose a two-stage model that reconceives traditional image quantization techniques to yield improved performance on image generation and image understanding tasks. In the first stage, an image quantization model, called VQGAN, encodes an image into lower-dimensional discrete latent codes. Then a Transformer model is trained to model the quantized latent codes of an image. This approach, which we call Vector-quantized Image Modeling (VIM), can be used for both image generation and unsupervised image representation learning. We describe multiple improvements to the image quantizer and show that training a stronger image quantizer is a key component for improving both image generation and image understanding.

Vector-Quantized Image Modeling with ViT-VQGAN
One recent, commonly used model that quantizes images into integer tokens is the Vector-quantized Variational AutoEncoder (VQVAE), a CNN-based auto-encoder whose latent space is a matrix of discrete learnable variables, trained end-to-end. VQGAN is an improved version of this that introduces an adversarial loss to promote high quality reconstruction. VQGAN uses transformer-like elements in the form of non-local attention blocks, which allows it to capture distant interactions using fewer layers.

In our work, we propose taking this approach one step further by replacing both the CNN encoder and decoder with ViT. In addition, we introduce a linear projection from the output of the encoder to a low-dimensional latent variable space for lookup of the integer tokens. Specifically, we reduced the encoder output from a 768-dimension vector to a 32- or 8-dimension vector per code, which we found encourages the decoder to better utilize the token outputs, improving model capacity and efficiency.

Overview of the proposed ViT-VQGAN (left) and VIM (right), which, when working together, is capable of both image generation and image understanding. In the first stage, ViT-VQGAN converts images into discrete integers, which the autoregressive Transformer (Stage 2) then learns to model. Finally, the Stage 1 decoder is applied to these tokens to enable generation of high quality images from scratch.

With our trained ViT-VQGAN, images are encoded into discrete tokens represented by integers, each of which encompasses an 8x8 patch of the input image. Using these tokens, we train a decoder-only Transformer to predict a sequence of image tokens autoregressively. This two-stage model, VIM, is able to perform unconditioned image generation by simply sampling token-by-token from the output softmax distribution of the Transformer model.

VIM is also capable of performing class-conditioned generation, such as synthesizing a specific image of a given class (e.g., a dog or a cat). We extend the unconditional generation to class-conditioned generation by prepending a class-ID token before the image tokens during both training and sampling.

Uncurated set of dog samples from class-conditioned image generation trained on ImageNet. Conditioned classes: Irish terrier, Norfolk terrier, Norwich terrier, Yorkshire terrier, wire-haired fox terrier, Lakeland terrier.

To test the image understanding capabilities of VIM, we also fine-tune a linear projection layer to perform ImageNet classification, a standard benchmark for measuring image understanding abilities. Similar to ImageGPT, we take a layer output at a specific block, average over the sequence of token features (frozen) and insert a softmax layer (learnable) projecting averaged features to class logits. This allows us to capture intermediate features that provide more information useful for representation learning.

Experimental Results
We train all ViT-VQGAN models with a training batch size of 256 distributed across 128 CloudTPUv4 cores. All models are trained with an input image resolution of 256x256. On top of the pre-learned ViT-VQGAN image quantizer, we train Transformer models for unconditional and class-conditioned image synthesis and compare with previous work.

We measure the performance of our proposed methods for class-conditioned image synthesis and unsupervised representation learning on the widely used ImageNet benchmark. In the table below we demonstrate the class-conditioned image synthesis performance measured by the Fréchet Inception Distance (FID). Compared to prior work, VIM improves the FID to 3.07 (lower is better), a relative improvement of 58.6% over the VQGAN model (FID 7.35). VIM also improves the capacity for image understanding, as indicated by the Inception Score (IS), which goes from 188.6 to 227.4, a 20.6% improvement relative to VQGAN.

Model Acceptance
Rate
FID IS

Validation data 1.0 1.62 235.0

DCTransformer 1.0 36.5 N/A
BigGAN 1.0 7.53 168.6
BigGAN-deep 1.0 6.84 203.6
IDDPM 1.0 12.3 N/A
ADM-G, 1.0 guid. 1.0 4.59 186.7
VQVAE-2 1.0 ~31 ~45

VQGAN 1.0 17.04 70.6
VQGAN 0.5 10.26 125.5
VQGAN 0.25 7.35 188.6
ViT-VQGAN (Ours) 1.0 4.17 175.1
ViT-VQGAN (Ours) 0.5 3.04 227.4
Fréchet Inception Distance (FID) comparison between different models for class-conditional image synthesis and Inception Score (IS) for image understanding, both on ImageNet with resolution 256x256. The acceptance rate shows results filtered by a ResNet-101 classification model, similar to the process in VQGAN.

After training a generative model, we test the learned image representations by fine-tuning a linear layer to perform ImageNet classification, a standard benchmark for measuring image understanding abilities. Our model outperforms previous generative models on the image understanding task, improving classification accuracy through linear probing (i.e., training a single linear classification layer, while keeping the rest of the model frozen) from 60.3% (iGPT-L) to 73.2%. These results showcase VIM’s strong generation results as well as image representation learning abilities.

Conclusion
We propose Vector-quantized Image Modeling (VIM), which pretrains a Transformer to predict image tokens autoregressively, where discrete image tokens are produced from improved ViT-VQGAN image quantizers. With our proposed improvements on image quantization, we demonstrate superior results on both image generation and understanding. We hope our results can inspire future work towards more unified approaches for image generation and understanding.

Acknowledgements
We would like to thank Xin Li, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu for the preparation of the VIM paper. We thank Wei Han, Yuan Cao, Jiquan Ngiam‎, Vijay Vasudevan, Zhifeng Chen and Claire Cui for helpful discussions and feedback, and others on the Google Research and Brain Team for support throughout this project.

Source: Google AI Blog


Pix2Seq: A New Language Interface for Object Detection

Object detection is a long-standing computer vision task that attempts to recognize and localize all objects of interest in an image. The complexity arises when trying to identify or localize all object instances while also avoiding duplication. Existing approaches, like Faster R-CNN and DETR, are carefully designed and highly customized in the choice of architecture and loss function. This specialization of existing systems has created two major barriers: (1) it adds complexity in tuning and training the different parts of the system (e.g., region proposal network, graph matching with GIOU loss, etc.), and (2), it can reduce the ability of a model to generalize, necessitating a redesign of the model for application to other tasks.

In “Pix2Seq: A Language Modeling Framework for Object Detection”, published at ICLR 2022, we present a simple and generic method that tackles object detection from a completely different perspective. Unlike existing approaches that are task-specific, we cast object detection as a language modeling task conditioned on the observed pixel inputs. We demonstrate that Pix2Seq achieves competitive results on the large-scale object detection COCO dataset compared to existing highly-specialized and well-optimized detection algorithms, and its performance can be further improved by pre-training the model on a larger object detection dataset. To encourage further research in this direction, we are also excited to release to the broader research community Pix2Seq’s code and pre-trained models along with an interactive demo.

Pix2Seq Overview
Our approach is based on the intuition that if a neural network knows where and what the objects in an image are, one could simply teach it how to read them out. By learning to “describe” objects, the model can learn to ground the descriptions on pixel observations, leading to useful object representations. Given an image, the Pix2Seq model outputs a sequence of object descriptions, where each object is described using five discrete tokens: the coordinates of the bounding box’s corners [ymin, xmin, ymax, xmax] and a class label.

Pix2Seq framework for object detection. The neural network perceives an image, and generates a sequence of tokens for each object, which correspond to bounding boxes and class labels.

With Pix2Seq, we propose a quantization and serialization scheme that converts bounding boxes and class labels into sequences of discrete tokens (similar to captions), and leverage an encoder-decoder architecture to perceive pixel inputs and generate the sequence of object descriptions. The training objective function is simply the maximum likelihood of tokens conditioned on pixel inputs and the preceding tokens.

Sequence Construction from Object Descriptions
In commonly used object detection datasets, images have variable numbers of objects, represented as sets of bounding boxes and class labels. In Pix2Seq, a single object, defined by a bounding box and class label, is represented as [ymin, xmin, ymax, xmax, class]. However, typical language models are designed to process discrete tokens (or integers) and are unable to comprehend continuous numbers. So, instead of representing image coordinates as continuous numbers, we normalize the coordinates between 0 and 1 and quantize them into one of a few hundred or thousand discrete bins. The coordinates are then converted into discrete tokens as are the object descriptions, similar to image captions, which in turn can then be interpreted by the language model. The quantization process is achieved by multiplying the normalized coordinate (e.g., ymin) by the number of bins minus one, and rounding it to the nearest integer (the detailed process can be found in our paper).

Quantization of the coordinates of the bounding boxes with different numbers of bins on a 480 × 640 image. With a small number of bins/tokens, such as 500 bins (∼1 pixel/bin), it achieves high precision even for small objects.

After quantization, the object annotations provided with each training image are ordered into a sequence of discrete tokens (shown below). Since the order of the objects does not matter for the detection task per se, we randomize the order of objects each time an image is shown during training. We also append an End of Sequence (EOS) token at the end as​​ different images often have different numbers of objects, and hence sequence lengths.

The bounding boxes and class labels for objects detected in the image on the left are represented in the sequences shown on the right. A random object ordering strategy is used in our work but other approaches to ordering could also be used.

The Model Architecture, Objective Function, and Inference
We treat the sequences that we constructed from object descriptions as a “dialect” and address the problem via a powerful and general language model with an image encoder and autoregressive language encoder. Similar to language modeling, Pix2Seq is trained to predict tokens, given an image and preceding tokens, with a maximum likelihood loss. At inference time, we sample tokens from model likelihood. The sampled sequence ends when the EOS token is generated. Once the sequence is generated, we split it into chunks of 5 tokens for extracting and de-quantizing the object descriptions (i.e., obtaining the predicted bounding boxes and class labels). It is worth noting that both the architecture and loss function are task-agnostic in that they don’t assume prior knowledge about object detection (e.g., bounding boxes). We describe how we can incorporate task-specific prior knowledge with a sequence augmentation technique in our paper.

Results
Despite its simplicity, Pix2Seq achieves impressive empirical performance on benchmark datasets. Specifically, we compare our method with well established baselines, Faster R-CNN and DETR, on the widely used COCO dataset and demonstrate that it achieves competitive average precision (AP) results.

Pix2Seq achieves competitive AP results compared to existing systems that require specialization during model design, while being significantly simpler. The best performing Pix2Seq model achieved an AP score of 45.

Since our approach incorporates minimal inductive bias or prior knowledge of the object detection task into the model design, we further explore how pre-training the model using the large-scale object detection COCO dataset can impact its performance. Our results indicate that this training strategy (along with using bigger models) can further boost performance.

The average precision of the Pix2Seq model with pre-training followed by fine-tuning. The best performing Pix2Seq model without pre-training achieved an AP score of 45. When the model is pre-trained, we see an 11% improvement with an AP score of 50.

Pix2Seq can detect objects in densely populated and complex scenes, such as those shown below.

Example complex and densely populated scenes labeled by a trained Pix2Seq model. Try it out here.

Conclusion and Future Work
With Pix2Seq, we cast object detection as a language modeling task conditioned on pixel inputs for which the model architecture and loss function are generic, and have not been engineered specifically for the detection task. One can, therefore, readily extend this framework to different domains or applications, where the output of the system can be represented by a relatively concise sequence of discrete tokens (e.g., keypoint detection, image captioning, visual question answering), or incorporate it into a perceptual system supporting general intelligence, for which it provides a language interface to a wide range of vision and language tasks. We also hope that the release of our Pix2Seq’s code, pre-trained models and interactive demo will inspire further research in this direction.

Acknowledgements
This post reflects the combined work with our co-authors: Saurabh Saxena, Lala Li, Geoffrey Hinton. We would also like to thank Tom Small for the visualization of the Pix2Seq illustration figure.

Source: Google AI Blog


FormNet: Beyond Sequential Modeling for Form-Based Document Understanding

Form-based document understanding is a growing research topic because of its practical potential for automatically converting unstructured text data into structured information to gain insight about a document’s contents. Recent sequence modeling, which is a self-attention mechanism that directly models relationships between all words in a selection of text, has demonstrated state-of-the-art performance on natural language tasks. A natural approach to handle form document understanding tasks is to first serialize the form documents (usually in a left-to-right, top-to-bottom fashion) and then apply state-of-the-art sequence models to them.

However, form documents often have more complex layouts that contain structured objects, such as tables, columns, and text blocks. Their variety of layout patterns makes serialization difficult, substantially limiting the performance of strict serialization approaches. These unique challenges in form document structural modeling have been largely underexplored in literature.

An illustration of the form document information extraction task using an example from the FUNSD dataset.

In “FormNet: Structural Encoding Beyond Sequential Modeling in Form Document Information Extraction”, presented at ACL 2022, we propose a structure-aware sequence model, called FormNet, to mitigate the sub-optimal serialization of forms for document information extraction. First, we design a Rich Attention (RichAtt) mechanism that leverages the 2D spatial relationship between word tokens for more accurate attention weight calculation. Then, we construct Super-Tokens (tokens that aggregate semantically meaningful information from neighboring tokens) for each word by embedding representations from their neighboring tokens through a graph convolutional network (GCN). Finally, we demonstrate that FormNet outperforms existing methods, while using less pre-training data, and achieves state-of-the-art performance on the CORD, FUNSD, and Payment benchmarks.

FormNet for Information Extraction
Given a form document, we first use the BERT-multilingual vocabulary and optical character recognition (OCR) engine to identify and tokenize words. We then feed the tokens and their corresponding 2D coordinates into a GCN for graph construction and message passing. Next, we use Extended Transformer Construction (ETC) layers with the proposed RichAtt mechanism to continue to process the GCN-encoded structure-aware tokens for schema learning (i.e., semantic entity extraction). Finally, we use the Viterbi algorithm, which finds a sequence that maximizes the posterior probability, to decode and obtain the final entities for output.

Extended Transformer Construction (ETC)
We adopt ETC as the FormNet model backbone. ETC scales to relatively long inputs by replacing standard attention, which has quadratic complexity, with a sparse global-local attention mechanism that distinguishes between global and long input tokens. The global tokens attend to and are attended by all tokens, but the long tokens attend only locally to other long tokens within a specified local radius, reducing the complexity so that it is more manageable for long sequences.

Rich Attention
Our novel architecture, RichAtt, avoids the deficiencies of absolute and relative embeddings by avoiding embeddings entirely. Instead, it computes the order of and log distance between pairs of tokens with respect to the x and y axes on the layout grid, and adjusts the pre-softmax attention scores of each pair as a direct function of these values.

In a traditional attention layer, each token representation is linearly transformed into a Query vector, a Key vector, and a Value vector. A token “looks” for other tokens from which it might want to absorb information (i.e., attend to) by finding the ones with Key vectors that create relatively high scores when matrix-multiplied (called Matmul) by its Query vector and then softmax-normalized. The token then sums together the Value vectors of all other tokens in the sentence, weighted by their score, and passes this up the network, where it will normally be added to the token’s original input vector.

However, other features beyond the Query and Key vectors are often relevant to the decision of how strongly a token should attend to another given token, such as the order they’re in, how many other tokens separate them, or how many pixels apart they are. In order to incorporate these features into the system, we use a trainable parametric function paired with an error network, which takes the observed feature and the output of the parametric function and returns a penalty that reduces the dot product attention score.

The network uses the Query and Key vectors to consider what value some low-level feature (e.g., distance) should take if the tokens are related, and penalizes the attention score based on the error.

At a high level, for each attention head at each layer, FormNet examines each pair of token representations, determines the ideal features the tokens should have if there is a meaningful relationship between them, and penalizes the attention score according to how different the actual features are from the ideal ones. This allows the model to learn constraints on attention using logical implication.

A visualization of how RichAtt might act on a sentence. There are three adjectives that the word “crow” might attend to. “Lazy” is to the right, so it probably does not modify “crow” and its attention edge is penalized. “Sly” is many tokens away, so its attention edge is also penalized. “Cunning” receives no significant penalties, so by process of elimination, it is the best candidate for attention.

Furthermore, if one assumes that the softmax-normalized attention scores represent a probability distribution, and the distributions for the observed features are known, then this algorithm — including the exact choice of parametric functions and error functions — falls out algebraically, meaning FormNet has a mathematical correctness to it that is lacking from many alternatives (including relative embeddings).

Super-Tokens by Graph Learning
The key to sparsifying attention mechanisms in ETC for long sequence modeling is to have every token only attend to tokens that are nearby in the serialized sequence. Although the RichAtt mechanism empowers the transformers by taking the spatial layout structures into account, poor serialization can still block significant attention weight calculation between related word tokens.

To further mitigate the issue, we construct a graph to connect nearby tokens in a form document. We design the edges of the graph based on strong inductive biases so that they have higher probabilities of belonging to the same entity type. For each token, we obtain its Super-Token embedding by applying graph convolutions along these edges to aggregate semantically relevant information from neighboring tokens. We then use these Super-Tokens as an input to the RichAtt ETC architecture. This means that even though an entity may get broken up into multiple segments due to poor serialization, the Super-Tokens learned by the GCN will have retained much of the context of the entity phrase.

An illustration of the word-level graph, with blue edges between tokens, of a FUNSD document.

Key Results
The Figure below shows model size vs. F1 score (the harmonic mean of the precision and recall) for recent approaches on the CORD benchmark. FormNet-A2 outperforms the most recent DocFormer while using a model that is 2.5x smaller. FormNet-A3 achieves state-of-the-art performance with a 97.28% F1 score. For more experimental results, please refer to the paper.

Model Size vs. Entity Extraction F1 Score on CORD benchmark. FormNet significantly outperforms other recent approaches in absolute F1 performance and parameter efficiency.

We study the importance of RichAtt and Super-Token by GCN on the large-scale masked language modeling (MLM) pre-training task across three FormNets. Both RichAtt and GCN components improve upon the ETC baseline on reconstructing the masked tokens by a large margin, showing the effectiveness of their structural encoding capability on form documents. The best performance is obtained when incorporating both RichAtt and GCN.

Performance of the Masked-Language Modeling (MLM) pre-training. Both the proposed RichAtt and Super-Token by GCN components improve upon ETC baseline by a large margin, showing the effectiveness of their structural encoding capability on large-scale form documents.

Using BertViz, we visualize the local-to-local attention scores for specific examples from the CORD dataset for the standard ETC and FormNet models. Qualitatively, we confirm that the tokens attend primarily to other tokens within the same visual block for FormNet. Moreover for that model, specific attention heads are attending to tokens aligned horizontally, which is a strong signal of meaning for form documents. No clear attention pattern emerges for the ETC model, suggesting the RichAtt and Super-Token by GCN enable the model to learn the structural cues and leverage layout information effectively.

The attention scores for ETC and FormNet (ETC+RichAtt+GCN) models. Unlike the ETC model, the FormNet model makes tokens attend to other tokens within the same visual blocks, along with tokens aligned horizontally, thus strongly leveraging structural cues.

Conclusion
We present FormNet, a novel model architecture for form-based document understanding. We determine that the novel RichAtt mechanism and Super-Token components help the ETC transformer excel at form understanding in spite of sub-optimal, noisy serialization. We demonstrate that FormNet recovers local syntactic information that may have been lost during text serialization and achieves state-of-the-art performance on three benchmarks.

Acknowledgements
This research was conducted by Chen-Yu Lee, Chun-Liang Li, Timothy Dozat, Vincent Perot, Guolong Su, Nan Hua, Joshua Ainslie, Renshen Wang, Yasuhisa Fujii, and Tomas Pfister. Thanks to Evan Huang, Shengyang Dai, and Salem Elie Haykal for their valuable feedback, and Tom Small for creating the animation in this post.

Source: Google AI Blog


Learning to Prompt for Continual Learning

Supervised learning is a common approach to machine learning (ML) in which the model is trained using data that is labeled appropriately for the task at hand. Ordinary supervised learning trains on independent and identically distributed (IID) data, where all training examples are sampled from a fixed set of classes, and the model has access to these examples throughout the entire training phase. In contrast, continual learning tackles the problem of training a single model on changing data distributions where different classification tasks are presented sequentially. This is particularly important, for example, to enable autonomous agents to process and interpret continuous streams of information in real-world scenarios.

To illustrate the difference between supervised and continual learning, consider two tasks: (1) classify cats vs. dogs and (2) classify pandas vs. koalas. In supervised learning, which uses IID, the model is given training data from both tasks and treats it as a single 4-class classification problem. However, in continual learning, these two tasks arrive sequentially, and the model only has access to the training data of the current task. As a result, such models tend to suffer from performance degradation on the previous tasks, a phenomenon called catastrophic forgetting.

Mainstream solutions try to address catastrophic forgetting by buffering past data in a “rehearsal buffer” and mixing it with current data to train the model. However, the performance of these solutions depends heavily on the size of the buffer and, in some cases, may not be possible at all due to data privacy concerns. Another branch of work designs task-specific components to avoid interference between tasks. But these methods often assume that the task at test time is known, which is not always true, and they require a large number of parameters. The limitations of these approaches raise critical questions for continual learning: (1) Is it possible to have a more effective and compact memory system that goes beyond buffering past data? (2) Can one automatically select relevant knowledge components for an arbitrary sample without knowing its task identity?

In “Learning to Prompt for Continual Learning”, presented at CVPR2022, we attempt to answer these questions. Drawing inspiration from prompting techniques in natural language processing, we propose a novel continual learning framework called Learning to Prompt (L2P). Instead of continually re-learning all the model weights for each sequential task, we instead provide learnable task-relevant “instructions'' (i.e., prompts) to guide pre-trained backbone models through sequential training via a pool of learnable prompt parameters. L2P is applicable to various challenging continual learning settings and outperforms previous state-of-the-art methods consistently on all benchmarks. It achieves competitive results against rehearsal-based methods while also being more memory efficient. Most importantly, L2P is the first to introduce the idea of prompting in the field of continual learning.

Compared with typical methods that adapt entire or partial model weights to tasks sequentially using a rehearsal buffer, L2P uses a single frozen backbone model and learns a prompt pool to conditionally instruct the model. “Model 0” indicates that the backbone model is fixed at the beginning.

Prompt Pool and Instance-Wise Query
Given a pre-trained Transformer model, “prompt-based learning” modifies the original input using a fixed template. Imagine a sentiment analysis task is given the input “I like this cat”. A prompt-based method will transform the input to “I like this cat. It looks X”, where the “X” is an empty slot to be predicted (e.g., “nice”, “cute”, etc.) and “It looks X” is the so-called prompt. By adding prompts to the input, one can condition the pre-trained models to solve many downstream tasks. While designing fixed prompts requires prior knowledge along with trial and error, prompt tuning prepends a set of learnable prompts to the input embedding to instruct the pre-trained backbone to learn a single downstream task, under the transfer learning setting.

In the continual learning scenario, L2P maintains a learnable prompt pool, where prompts can be flexibly grouped as subsets to work jointly. Specifically, each prompt is associated with a key that is learned by reducing the cosine similarity loss between matched input query features. These keys are then utilized by a query function to dynamically look up a subset of task-relevant prompts based on the input features. At test time, inputs are mapped by the query function to the top-N closest keys in the prompt pool, and the associated prompt embeddings are then fed to the rest of the model to generate the output prediction. At training, we optimize the prompt pool and the classification head via the cross-entropy loss.

Illustration of L2P at test time. First, L2P selects a subset of prompts from a key-value paired prompt pool based on our proposed instance-wise query mechanism. Then, L2P prepends the selected prompts to the input tokens. Finally, L2P feeds the extended tokens to the model for prediction.

Intuitively, similar input examples tend to choose similar sets of prompts and vice versa. Thus, prompts that are frequently shared encode more generic knowledge while other prompts encode more task-specific knowledge. Moreover, prompts store high-level instructions and keep lower-level pre-trained representations frozen, thus catastrophic forgetting is mitigated even without the necessity of a rehearsal buffer. The instance-wise query mechanism removes the necessity of knowing the task identity or boundaries, enabling this approach to address the under-investigated challenge of task-agnostic continual learning.

Effectiveness of L2P
We evaluate the effectiveness of L2P in different baseline methods using an ImageNet pre-trained Vision Transformer (ViT) on representative benchmarks. The naïve baseline, called Sequential in the graphs below, refers to training a single model sequentially on all tasks. The EWC model adds a regularization term to mitigate forgetting and the Rehearsal model saves past examples to a buffer for mixed training with current data. To measure the overall continual learning performance, we measure both the accuracy and the average difference between the best accuracy achieved during training and the final accuracy for all tasks (except the last task), which we call forgetting. We find that L2P outperforms the Sequential and EWC methods significantly in both metrics. Notably, L2P even surpasses the Rehearsal approach, which uses an additional buffer to save past data. Because the L2P approach is orthogonal to Rehearsal, its performance could be further improved if it, too, used a rehearsal buffer.

L2P outperforms baseline methods in both accuracy (top) and forgetting (bottom). Accuracy refers to the average accuracy for all tasks and forgetting is defined as the average difference between the best accuracy achieved during training and the final accuracy for all tasks (except the last task).

We also visualize the prompt selection result from our instance-wise query strategy on two different benchmarks, where one has similar tasks and the other has varied tasks. The results indicate that L2P promotes more knowledge sharing between similar tasks by having more shared prompts, and less knowledge sharing between varied tasks by having more task-specific prompts.

Prompt selection histograms for benchmarks of similar tasks (left) and varied tasks (right). The left benchmark has higher intra-task similarity, thus sharing prompts between tasks results in good performance, while the right benchmark favors more task-specific prompts.

Conclusion
In this work, we present L2P to address key challenges in continual learning from a new perspective. L2P does not require a rehearsal buffer or known task identity at test time to achieve high performance. Further, it can handle various complex continual learning scenarios, including the challenging task-agnostic setting. Because large-scale pre-trained models are widely used in the machine learning community for their robust performance on real-world problems, we believe that L2P opens a new learning paradigm towards practical continual learning applications.

Acknowledgements
We gratefully acknowledge the contributions of other co-authors, including Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, Tomas Pfister. We would also like to thank Chun-Liang Li, Jeremy Martin Kubica, Sayna Ebrahimi, Stratis Ioannidis, Nan Hua, and Emmanouil Koukoumidis, for their valuable discussions and feedback, and Tom Small for figure creation.

Source: Google AI Blog


Locked-image Tuning: Adding Language Understanding to Image Models

The ability to classify images into categories has been transformed by deep learning. It has also been significantly accelerated by transfer learning, whereby models are first pre-trained on large datasets, like ImageNet, to learn visual representations that are then transferred via fine-tuning to a new task with less data (e.g., classifying animals). Previous works such as BiT and ViT employed these methods to achieve state-of-the-art performance on a wide range of classification tasks, such as the VTAB benchmark.

However, fine-tuning has some downsides: though pre-training is done only once, fine-tuning is necessary on every new dataset for which task-specific data is needed. Multimodal contrastive learning is an alternative, recently popularized paradigm (e.g., CLIP, ALIGN) that overcomes these issues by instead learning how to match free-form text with images. These models can then solve new tasks by reformulating them as image-text matching problems, without extra data (referred to as “zero-shot” learning). Contrastive learning is flexible and easy to adapt to new tasks, but has its own limitations, namely the need for a lot of paired image-text data and weaker performance than transfer learning approaches.

With those limitations in mind, we propose “LiT: Zero-Shot Transfer with Locked-image Text Tuning”, to appear at CVPR 2022. LiT models learn to match text to an already pre-trained image encoder. This simple yet effective setup provides the best of both worlds: strong image representations from pre-training, plus flexible zero-shot transfer to new tasks via contrastive learning. LiT achieves state-of-the-art zero-shot classification accuracy, significantly closing the gap between the two styles of learning. We think the best way to understand is to try it yourself, so we’ve included a demo of LiT models at the end of this post.

Fine-tuning (left) requires task-specific data and training to adapt a pre-trained model to a new task. An LiT model (right) can be used with any task, without further data or adaptation.

Contrastive Learning on Image-Text Data
Contrastive learning models learn representations from “positive” and “negative” examples, such that representations for "positive" examples are similar to each other but different from "negative" examples.

Multimodal contrastive learning applies this to pairs of images and associated texts. An image encoder computes representations from images, and a text encoder does the same for texts. Each image representation is encouraged to be close to the representation of its associated text (“positive”), but distinct from the representation of other texts ("negatives") in the data, and vice versa. This has typically been done with randomly initialized models (“from scratch”), meaning the encoders have to simultaneously learn representations and how to match them.

Multimodal contrastive learning trains models to produce similar representations for closely matched images and texts.

This training can be done on noisy, loosely aligned pairs of image and text, which naturally occur on the web. This circumvents the need for manual labeling, and makes data scaling easy. Furthermore, the model learns much richer visual concepts — it’s not constrained to what’s defined in the classification label space. Instead of classifying an image as “coffee”, it can understand whether it’s "a small espresso in a white mug” or “a large latte in a red flask”.

Once trained, a model that aligns image and text can be used in many ways. For zero-shot classification, we compare image representations to text representations of the class names. For example, a “wombat vs jaguar” classifier can be built by computing the representations of the texts “jaguar” and “wombat”, and classifying an image as a jaguar if its representation better matches the former. This approach scales to thousands of classes and makes it very easy to solve classification tasks without the extra data necessary for fine-tuning. Another application of contrastive models is image search (a.k.a. image-text retrieval), by finding the image whose representation best matches that of a given text, or vice versa.

The Best of Both Worlds with Locked-image Tuning
As mentioned earlier, transfer learning achieves state-of-the-art accuracy, but requires per-task labels, datasets, and training. On the other hand, contrastive models are flexible, scalable, and easily adaptable to new tasks, but fall short in performance. To compare, at the time of writing, the state of the art on ImageNet classification using transfer learning is 90.94%, but the best contrastive zero-shot models achieve 76.4%.

LiT tuning bridges this gap: we contrastively train a text model to compute representations well aligned with the powerful ones available from a pre-trained image encoder. Importantly, for this to work well, the image encoder should be “locked“, that is: it should not be updated during training. This may be unintuitive since one usually expects the additional information from further training to increase performance, but we find that locking the image encoder consistently leads to better results.

LiT-tuning contrastively trains a text encoder to match a pre-trained image encoder. The text encoder learns to compute representations that align to those from the image encoder.

This can be considered an alternative to the classic fine-tuning stage, where the image encoder is separately adapted to every new classification task; instead we have one stage of LiT-tuning, after which the model can classify any data. LiT-tuned models achieve 84.5% zero-shot accuracy on ImageNet classification, showing significant improvements over previous methods that train models from scratch, and halving the performance gap between fine-tuning and contrastive learning.

Left: LiT-tuning significantly closes the gap between the best contrastive models and the best models fine-tuned with labels. Right: Using a pre-trained image encoder is always helpful, but locking it is surprisingly a key part of the recipe to success; unlocked image models (dashed) yield significantly worse performance.

An impressive benefit of contrastive models is increased robustness — they retain high accuracy on datasets that typically fool fine-tuned models, such as ObjectNet and ImageNet-C. Similarly, LiT-tuned models have high performance across various challenging versions of ImageNet, for example achieving a state-of-the-art 81.1% accuracy on ObjectNet.

LiT-tuning has other advantages. While prior contrastive works require large amounts of data and train for a very long time, the LiT approach is much less data hungry. LiT models trained on 24M publicly available image-text pairs rival the zero-shot classification performance of prior models trained on 400M image-text pairs of private data. The locked image encoder also leads to faster training with a smaller memory footprint. On larger datasets, image representations can be pre-computed; not running the image model during training further improves efficiency and also unlocks much larger batch sizes, which increases the number of “negatives” the model sees and is key to high-performance contrastive learning. The method works well with varied forms of image pre-training (e.g., including self-supervised learning), and with many publicly available image models. We hope that these benefits make LiT a great testbed for researchers.

Conclusion
We present Locked-image Tuning (LiT), which contrastively trains a text encoder to match image representations from a powerful pre-trained image encoder. This simple method is data and compute efficient, and substantially improves zero-shot classification performance compared to existing contrastive learning approaches.

Want to try it yourself?

A preview of the demo: use it to match free-form text descriptions to images and build your own zero-shot classifier!

We have prepared a small interactive demo to try some LiT-tuned models. We also provide a Colab with more advanced use cases and larger models, which are a great way to get started.

Acknowledgments
We would like to thank Xiaohua Zhai, Xiao Wang, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer who have co-authored the LiT paper and been involved in all aspects of its development, as well as the Brain team in Zürich. We also would like to thank Tom Small for creating the animations used in this blogpost.

Source: Google AI Blog


Lidar-Camera Deep Fusion for Multi-Modal 3D Detection

LiDAR and visual cameras are two types of complementary sensors used for 3D object detection in autonomous vehicles and robots. LiDAR, which is a remote sensing technique that uses light in the form of a pulsed laser to measure ranges, provides low-resolution shape and depth information, while cameras provide high-resolution shape and texture information. While the features captured by LiDAR and cameras should be merged together to provide optimal 3D object detection, it turns out that most state-of-the-art 3D object detectors use LiDAR as the only input. The main reason is that to develop robust 3D object detection models, most methods need to augment and transform the data from both modalities, making the accurate alignment of the features challenging.

Existing algorithms for fusing LiDAR and camera outputs, such as PointPainting, PointAugmenting, EPNet, 4D-Net and ContinuousFusion, generally follow two approaches — input-level fusion where the features are fused at an early stage, decorating points in the LiDAR point cloud with the corresponding camera features, or mid-level fusion where features are extracted from both sensors and then combined. Despite realizing the importance of effective alignment, these methods struggle to efficiently process the common scenario where features are enhanced and aggregated before fusion. This indicates that effectively fusing the signals from both sensors might not be straightforward and remains challenging.

In our CVPR 2022 paper, “DeepFusion: LiDAR-Camera Deep Fusion for Multi-Modal 3D Object Detection”, we introduce a fully end-to-end multi-modal 3D detection framework called DeepFusion that applies a simple yet effective deep-level feature fusion strategy to unify the signals from the two sensing modalities. Unlike conventional approaches that decorate raw LiDAR point clouds with manually selected camera features, our method fuses the deep camera and deep LiDAR features in an end-to-end framework. We begin by describing two novel techniques, InverseAug and LearnableAlign, that improve the quality of feature alignment and are applied to the development of DeepFusion. We then demonstrate state-of-the-art performance by DeepFusion on the Waymo Open Dataset, one of the largest datasets for automotive 3D object detection.

InverseAug: Accurate Alignment under Geometric Augmentation
To achieve good performance on existing 3D object detection benchmarks for autonomous cars, most methods require strong data augmentation during training to avoid overfitting. However, the necessity of data augmentation poses a non-trivial challenge in the DeepFusion pipeline. Specifically, the data from the two modalities use different augmentation strategies, e.g., rotating along the z-axis for 3D point clouds combined with random flipping for 2D camera images, often resulting in alignment that is inaccurate. Then the augmented LiDAR data has to go through a voxelization step that converts the point clouds into volume data stored in a three dimensional array of voxels. The voxelized features are quite different compared to the raw data, making the alignment even more difficult. To address the alignment issue caused by geometry-related data augmentation, we introduce Inverse Augmentation (InverseAug), a technique used to reverse the augmentation before fusion during the model’s training phase.

In the example below, we demonstrate the difficulties in aligning the augmented LiDAR data with the camera data. In this case, the LiDAR point cloud is augmented by rotation with the result that a given 3D key point, which could be any 3D coordinate, such as a LiDAR data point, cannot be easily aligned in 2D space simply through use of the original LiDAR and camera parameters. To make the localization feasible, InverseAug first stores the augmentation parameters before applying the geometry-related data augmentation. At the fusion stage, it reverses all data augmentation to get the original coordinate for the 3D key point, and then finds its corresponding 2D coordinates in the camera space.

During training, InverseAug resolves the inaccurate alignment from geometric augmentation.
Left: Alignment without InverseAug. Right: Alignment quality improvement with InverseAug.

LearnableAlign: A Cross-Modality-Attention Module to Learn Alignment
We also introduce Learnable Alignment (LearnableAlign), a cross-modality-attention–based feature-level alignment technique, to improve the alignment quality. For input-level fusion methods, such as PointPainting and PointAugmenting, given a 3D LiDAR point, only the corresponding camera pixel can be exactly located as there is a one-to-one mapping. In contrast, when fusing deep features in the DeepFusion pipeline, each LiDAR feature represents a voxel containing a subset of points, and hence, its corresponding camera pixels are in a polygon. So the alignment becomes the problem of learning the mapping between a voxel cell and a set of pixels.

A naïve approach is to average over all pixels corresponding to the given voxel. However, intuitively, and as supported by our visualized results, these pixels are not equally important because the information from the LiDAR deep feature unequally aligns with every camera pixel. For example, some pixels may contain critical information for detection (e.g., the target object), while others may be less informative (e.g., consisting of backgrounds such as roads, plants, occluders, etc.).

LearnableAlign leverages a cross-modality attention mechanism to dynamically capture the correlations between two modalities. Here, the input contains the LiDAR features in a voxel cell, and all its corresponding camera features. The output of the attention is essentially a weighted sum of the camera features, where the weights are collectively determined by a function of the LiDAR and camera features. More specifically, LearnableAlign uses three fully-connected layers to respectively transform the LiDAR features to a vector (ql), and camera features to vectors (kc) and (vc). For each vector (ql), we compute the dot products between (ql) and (kc) to obtain the attention affinity matrix that contains correlations between the LiDAR features and the corresponding camera features. Normalized by a softmax operator, the attention affinity matrix is then used to calculate weights and aggregate the vectors (vc) that contain camera information. The aggregated camera information is then processed by a fully-connected layer, and concatenated (Concat) with the original LiDAR feature. The output is then fed into any standard 3D detection framework, such as PointPillars or CenterPoint for model training.

LearnableAlign leverages the cross-attention mechanism to align LiDAR and camera features.

DeepFusion: A Better Way to Fuse Information from Different Modalities
Powered by our two novel feature alignment techniques, we develop DeepFusion, a fully end-to-end multi-modal 3D detection framework. In the DeepFusion pipeline, the LiDAR points are first fed into an existing feature extractor (e.g., pillar feature net from PointPillars) to obtain LiDAR features (e.g., pseudo-images). In the meantime, the camera images are fed into a 2D image feature extractor (e.g., ResNet) to obtain camera features. Then, InverseAug and LearnableAlign are applied in order to fuse the camera and LiDAR features together. Finally, the fused features are processed by the remaining components of the selected 3D detection model (e.g., the backbone and detection head from PointPillars) to obtain the detection results.

The pipeline of DeepFusion.

Benchmark Results
We evaluate DeepFusion on the Waymo Open Dataset, one of the largest 3D detection challenges for autonomous cars, using the Average Precision with Heading (APH) metric under difficulty level 2, the default metric to rank a model’s performance on the leaderboard. Among the 70 participating teams all over the world, the DeepFusion single and ensemble models achieve state-of-the-art performance in their corresponding categories.

The single DeepFusion model achieves new state-of-the-art performance on Waymo Open Dataset.
The Ensemble DeepFusion model outperforms all other methods on Waymo Open Dataset, ranking No. 1 on the leaderboard.

The Impact of InverseAug and LearnableAlign
We also conduct ablation studies on the effectiveness of the proposed InverseAug and LearnableAlign techniques. We demonstrate that both InverseAug and LearnableAlign individually contribute to a performance gain over the LiDAR-only model, and combining both can further yield an even more significant boost.

Ablation studies on InverseAug (IA) and LearnableAlign (LA) measured in average precision (AP) and APH. Combining both techniques contributes to the best performance gain.

Conclusion
We demonstrate that late-stage deep feature fusion can be more effective when features are aligned well, but aligning features from two different modalities can be challenging. To address this challenge, we propose two techniques, InverseAug and LearnableAlign, to improve the quality of alignment among multimodal features. By integrating these techniques into the fusion stage of our proposed DeepFusion method, we achieve state-of-the-art performance on the Waymo Open Dataset.

Acknowledgements:
Special thanks to co-authors Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Bo Wu, Yifeng Lu, Denny Zhou, Quoc Le, Alan Yuille, Mingxing Tan.

Source: Google AI Blog


Learning from Weakly-Labeled Videos via Sub-Concepts

Video recognition is a core task in computer vision with applications from video content analysis to action recognition. However, training models for video recognition often requires untrimmed videos to be manually annotated, which can be prohibitively time consuming. In order to reduce the effort of collecting videos with annotations, learning visual knowledge from videos with weak labels, i.e., where the annotation is auto-generated without manual intervention, has attracted growing research interest, thanks to the large volume of easily accessible video data. Untrimmed videos, for example, are often acquired by querying with keywords for classes that the video recognition model aims to classify. A keyword, which we refer to as a weak label, is then assigned to each untrimmed video obtained.

Although large-scale videos with weak labels are easier to collect, training with unverified weak labels poses another challenge in developing robust models. Recent studies have demonstrated that, in addition to the label noise (e.g., incorrect action labels on untrimmed videos), there is temporal noise due to the lack of accurate temporal action localization — i.e., an untrimmed video may include other non-targeted content or may only show the target action in a small proportion of the video.

Reducing noise effects for large-scale weakly-supervised pre-training is critical but particularly challenging in practice. Recent work indicates that querying short videos (e.g., ~1 minute in length) to obtain more accurate temporal localization of target actions or applying a teacher model to do filtering can yield improved results. However, such data pre-processing methods prevent models from fully utilizing available video data, especially longer videos with richer content.

In "Learning from Weakly-Labeled Web Videos via Exploring Sub-Concepts", we propose a solution to these issues that uses a simple learning framework to conduct effective pre-training on untrimmed videos. Instead of simply filtering the potential temporal noise, this approach converts such “noisy” data to useful supervision by creating a new set of meaningful “middle ground” pseudo-labels that expand the original weak label space, a novel concept we call Sub-Pseudo Label (SPL). The model is pre-trained on this more "fine-grained" space and then fine-tuned on a target dataset. Our experiments demonstrate that the learned representations are much better than previous approaches. Moreover, SPL has been shown to be effective in improving the action recognition model quality for Google Cloud Video AI, which enables content producers to easily search through massive libraries of their video assets to quickly source content of interest.

Sampled training clips may represent a different visual action (whisking eggs) from the query label of the whole untrimmed video (baking cookies). SPL converts the potential label noise to useful supervision signals by creating a new set of “middle ground” pseudo-classes (i.e., sub-concepts) via extrapolating two related action classes. Enriched supervision is provided for effective model pre-training.

Sub-Pseudo Label (SPL)
SPL is a simple technique that advances the teacher-student training framework, which is known to be effective for self-training and to improve semi-supervised learning. In the teacher-student framework, a teacher model is trained on high-quality labeled data and then assigns pseudo-labels to unlabeled data. The student model trains on both high-quality labeled data and the unlabeled data that has the teacher-predicted labels. While previous methods have proposed a number of ways to improve the pseudo-label quality, SPL takes a novel approach that combines knowledge from both weak labels (i.e., query text used to acquire data) and teacher-predicted labels, which results in better pseudo-labels overall. This method focuses on video recognition where temporal noise is challenging, but it can be extended easily to other domains, like image classification.

The overall pre-training framework for learning from weakly labeled videos via SPLs. Each trimmed video clip is re-labeled using SPL given the teacher-predicted labels and the weak labels used to query the corresponding untrimmed video.

The SPL method is motivated by the observation that within an untrimmed video “noisy” video clips have semantic relations with the target action (i.e., the weak label class), but may also include essential visual components of other actions, such as the teacher model–predicted class. Our approach uses the extrapolated SPLs from weak labels together with the distilled labels to capture the enriched supervision signals, encouraging learning better representations during pre-training that can be used for downstream fine-tuning tasks.

It is straightforward to determine the SPL class for each video clip. We first perform inference on each video clip using the teacher model trained from a target dataset to get a teacher prediction class. Each clip is also labeled by the class (i.e., query text) of the untrimmed source video. A 2-dimensional confusion matrix is used to summarize the alignments between the teacher model inferences and the original weak annotations. Based on this confusion matrix, we conduct label extrapolation between teacher model predictions and weak labels to obtain the raw SPL label space.

Left: The confusion matrix, which is the basis of the raw SPL label space. Middle: The resulting SPL label spaces (16 classes in this example). Right: SPL-B, another SPL version, that reduces the label space by collating agreed and disagreed entries of each row as independent SPL classes, which in this example results in only 8 classes.

Effectiveness of SPL
We evaluate the effectiveness of SPL in comparison to different pre-training methods applied to a 3D ResNet50 model that is fine-tuned on Kinetics-200 (K200). One pre-training approach simply initializes the model using ImageNet. The other pre-training methods use 670k video clips sampled from an internal dataset of 147k videos, collected following standard processes similar to those described for Kinetics-200, that cover a broad range of actions. Weak label training and teacher prediction training use either the weak labels or teacher-predicted labels on the videos, respectively. Agreement filtering uses only the training data for which the weak labels and teacher-predicted labels match. We find that SPL outperforms each of these methods. Though the dataset used to illustrate the SPL approach was constructed for this work, in principle the method we describe applies to any dataset that has weak labels.

Pre-training Method      Top-1      Top-5
ImageNet Initialized      80.6      94.7
Weak Label Train      82.8      95.6
Teacher Prediction Train      81.9      95.0
Agreement Filtering Train      82.9      95.4
SPL      84.3      95.7

We also demonstrate that sampling more video clips from a given number of untrimmed videos can help improve the model performance. With a sufficient number of video clips available, SPL consistently outperforms weak label pre-training by providing enriched supervision.

As more clips are sampled from 147K videos, the label noise is increased gradually. SPL becomes more and more effective at utilizing the weakly-labeled clips to achieve better pre-training.

We visualize the visual concepts learned from SPL with attention visualization by applying Grad-CAM on the trained model. It is interesting to observe some meaningful “middle ground” concepts that can be learned by SPL.

Examples of attention visualization for SPL classes. Some meaningful “middle ground” concepts can be learned by SPL, such as mixing up the eggs and flour (left) and using the abseiling equipment (right).

Conclusion
We demonstrate that SPLs can provide enriched supervision for pre-training. SPL does not increase training complexity and can be treated as an off-the-shelf technique to integrate with teacher-student–based training frameworks. We believe this is a promising direction for discovering meaningful visual concepts by bridging weak labels and the knowledge distilled from teacher models. SPL has also demonstrated promising generalization to the image recognition domain and we expect future extensions that apply to tasks that have noise in labels. We have successfully applied SPL for Google Cloud Video AI where it has improved the accuracy of the action recognition models, helping users to better understand, search, and monetize their video content library.

Acknowledgements
We gratefully acknowledge the contributions of other co-authors, including Kunpeng Li, Xuehan Xiong, Chen-Yu Lee, Zhichao Lu, Yun Fu, Tomas Pfister. We also thank Debidatta Dwibedi, David A Ross, Chen Sun, Jonathan C. Stroud, and Wei Hua for their valuable comments and help on this work, and Tom Small for figure creation.

Source: Google AI Blog


Co-training Transformer with Videos and Images Improves Action Recognition

Action recognition has become a major focus area for the research community because many applications can benefit from improved modeling, such as video retrieval, video captioning, video question-answering, etc. Transformer-based approaches have recently demonstrated state-of-the-art performance on several benchmarks. While Transformer models require data to learn better visual priors compared to ConvNets, action recognition datasets are relatively small in scale. Large Transformer models are typically first trained on image datasets and later fine-tuned on a target action recognition dataset.

While the current pre-training and fine-tuning action recognition paradigm is straightforward and manifests strong empirical results, it may be overly restrictive for building general-purpose action-recognition models. Compared to a dataset like ImageNet that covers a large range of object recognition classes, action recognition datasets like Kinetics and Something-Something-v2 (SSv2) pertain to limited topics. For example, Kinetics include object-centric actions like “cliff diving” and “ice climbing’ while SSv2 contains object-agnostic activities like ’pretending to put something onto something else.’ As a result, we observed poor performance adapting an action recognition model that has been fine-tuned on one dataset to another disparate dataset.

Differences in objects and video backgrounds among datasets further exacerbate learning a general-purpose action recognition classification model. Despite the fact that video datasets may be increasing in size, prior work suggests significant data augmentation and regularization is necessary to achieve strong performance. This latter finding may indicate the model quickly overfits on the target dataset, and as a result, hinders its capacity to generalize to other action recognition tasks.

In “Co-training Transformer with Videos and Images Improves Action Recognition”, we propose a training strategy, named CoVeR, that leverages both image and video data to jointly learn a single general-purpose action recognition model. Our approach is buttressed by two main findings. First, disparate video datasets cover a diverse set of activities, and training them together in a single model could lead to a model that excels at a wide range of activities. Second, video is a perfect source for learning motion information, while images are great for exploiting structural appearance. Leveraging a diverse distribution of image examples may be beneficial in building robust spatial representations in video models. Concretely, CoVeR first pre-trains the model on an image dataset, and during fine-tuning, it simultaneously trains a single model on multiple video and image datasets to build robust spatial and temporal representations for a general-purpose video understanding model.

Architecture and Training Strategy
We applied the CoVeR approach to the recently proposed spatial-temporal video transformer, called TimeSFormer, that contains 24 layers of transformer blocks. Each block contains one temporal attention, one spatial attention, and one multilayer perceptron (MLP) layer. To learn from multiple video and image datasets, we adopt a multi-task learning paradigm and equip the action recognition model with multiple classification heads. We pre-train all non-temporal parameters on the large-scale JFT dataset. During fine-tuning, a batch of videos and images are sampled from multiple video and image datasets. The sampling rate is proportional to the size of the datasets. Each sample within the batch is processed by the TimeSFormer and then distributed to the corresponding classifier to get the predictions.

Compared with the standard training strategy, CoVeR has two advantages. First, as the model is directly trained on multiple datasets, the learned video representations are more general and can be directly evaluated on those datasets without additional fine-tuning. Second, Transformer-based models may easily overfit to a smaller video distribution, thus degrading the generalization of the learned representations. Training on multiple datasets mitigates this challenge by reducing the risk of overfitting.

CoVeR adopts a multi-task learning strategy trained on multiple datasets, each with their own classifier.

Benchmark Results
We evaluate the CoVeR approach to train on Kinetics-400 (K400), Kinetics-600 (K600), Kinetics-700 (K700), SomethingSomething-V2 (SSv2), and Moments-in-Time (MiT) datasets. Compared with other approaches — such as TimeSFormer, Video SwinTransformer, TokenLearner, ViViT, MoViNet, VATT, VidTr, and OmniSource — CoVeR established the new state-of-the-art on multiple datasets (shown below). Unlike previous approaches that train a dedicated model for one single dataset, a model trained by CoVeR can be directly applied to multiple datasets without further fine-tuning.

Model Pretrain Finetune K400 Accuracy
VATT AudioSet+Videos K400 82.1
Omnisource IG-Kinetics-65M K400 83.6
ViViT JFT-300M K400 85.4
Video SwinTrans   ImageNet21K+external   K400 86.8
CoVeR JFT-3B K400+SSv2+MiT+ImNet 87.2
Accuracy comparison on Kinetics-400 (K400) dataset.
Model Pretrain Finetune SSv2 Accuracy
TimeSFormer ImageNet21k SSv2 62.4
VidTr ImageNet21k SSv2 63.0
ViViT ImageNet21k SSv2 65.9
Video SwinTrans   ImageNet21K+external   SSv2 69.6
CoVeR JFT-3B K400+SSv2+MiT+ImNet 70.9
Accuracy comparison on SomethingSomething-V2 (SSv2) dataset.
Model Pretrain Finetune MiT Accuracy
ViViT ImageNet21k MiT 38.5
VidTr ImageNet21k SSv2 41.1
CoVeR JFT-3B K400+SSv2+MiT+ImNet 46.1
Accuracy comparison on Moments-in-Time (MiT) dataset.

Transfer Learning
We use transfer learning to further verify the video action recognition performance and compare with co-training on multiple datasets, results are summarized below. Specifically, we train on the source datasets, then fine-tune and evaluate on the target dataset.

We first consider K400 as the target dataset. CoVeR co-trained on SSv2 and MiT improves the top-1 accuracy on K400→K400 (where the model is trained on K400 and then fine-tuned on K400) by 1.3%, SSv2→K400 by 1.7%, and MiT→K400 by 0.4%. Similarly, we observe that by transferring to SSv2, CoVeR achieves 2%, 1.8%, and 1.1% improvement over SSv2→SSv2, K400→SSv2, and MiT→SSv2, respectively. The 1.2% and 2% performance improvement on K400 and SSv2 indicates that CoVeR co-trained on multiple datasets could learn better visual representations than the standard training paradigm, which is useful for downstream tasks.

Comparison of transfer learning the representation learned by CoVeR and standard training paradigm. A→B means the model is trained on dataset A and then fine-tuned on dataset B.

Conclusion
In this work, we present CoVeR, a training paradigm that jointly learns action recognition and object recognition tasks in a single model for the purpose of constructing a general-purpose action recognition framework. Our analysis indicates that it may be beneficial to integrate many video datasets into one multi-task learning paradigm. We highlight the importance of continuing to learn on image data during fine-tuning to maintain robust spatial representations. Our empirical findings suggest CoVeR can learn a single general-purpose video understanding model which achieves impressive performance across many action recognition datasets without an additional stage of fine-tuning on each downstream application.

Acknowledgements
We would like to thank Christopher Fifty, Wei Han, Andrew M. Dai, Ruoming Pang, and Fei Sha for preparation of the CoVeR paper, Yue Zhao, Hexiang Hu, Zirui Wang, Zitian Chen, Qingqing Huang, Claire Cui and Yonghui Wu for helpful discussions and feedbacks, and others on the Brain Team for support throughout this project.

Source: Google AI Blog


Nested Hierarchical Transformer: Towards Accurate, Data-Efficient, and Interpretable Visual Understanding

In visual understanding, the Visual Transformer (ViT) and its variants have received significant attention recently due to their superior performance on many core visual applications, such as image classification, object detection, and video understanding. The core idea of ViT is to utilize the power of self-attention layers to learn global relationships between small patches of images. However, the number of connections between patches increases quadratically with image size. Such a design has been observed to be data inefficient — although the original ViT can perform better than convolutional networks with hundreds of millions of images for pre-training, such a data requirement is not always practical, and it still underperforms compared to convolutional networks when given less data. Many are exploring to find more suitable architectural re-designs that can learn visual representations effectively, such as by adding convolutional layers and building hierarchical structures with local self-attention.

The principle of hierarchical structure is one of the core ideas in vision models, where bottom layers learn more local object structures on the high-dimensional pixel space and top layers learn more abstracted and high-level knowledge at low-dimensional feature space. Existing ViT-based methods focus on designing a variety of modifications inside self-attention layers to achieve such a hierarchy, but while these offer promising performance improvements, they often require substantial architectural re-designs. Moreover, these approaches lack an interpretable design, so it is difficult to explain the inner-workings of trained models.

To address these challenges, in “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding”, we present a rethinking of existing hierarchical structure–driven designs, and provide a novel and orthogonal approach to significantly simplify them. The central idea of this work is to decouple feature learning and feature abstraction (pooling) components: nested transformer layers encode visual knowledge of image patches separately, and then the processed information is aggregated. This process is repeated in a hierarchical manner, resulting in a pyramid network structure. The resulting architecture achieves competitive results on ImageNet and outperforms results on data-efficient benchmarks. We have shown such a design can meaningfully improve data efficiency with faster convergence and provide valuable interpretability benefits. Moreover, we introduce GradCAT, a new technique for interpreting the decision process of a trained model at inference time.

Architecture Design
The overall architecture is simple to implement by adding just a few lines of Python code to the source code of the original ViT. The original ViT architecture divides an input image into small patches, projects pixels of each patch to a vector with predefined dimension, and then feeds the sequences of all vectors to the overall ViT architecture containing multiple stacked identical transformer layers. While every layer in ViT processes information of the whole image, with this new method, stacked transformer layers are used to process only a region (i.e., block) of the image containing a few spatially adjacent image patches. This step is independent for each block and is also where feature learning occurs. Finally, a new computational layer called block aggregation then combines all of the spatially adjacent blocks. After each block aggregation, the features corresponding to four spatially adjacent blocks are fed to another module with a stack of transformer layers, which then process those four blocks jointly. This design naturally builds a pyramid hierarchical structure of the network, where bottom layers can focus on local features (such as textures) and upper layers focus on global features (such as object shape) at reduced dimensionality thanks to the block aggregation.

A visualization of the network processing an image: Given an input image, the network first partitions images into blocks, where each block contains 4 image patches. Image patches in every block are linearly projected as vectors and processed by a stack of identical transformer layers. Then the proposed block aggregation layer aggregates information from each block and reduces its spatial size by 4 times. The number of blocks is reduced to 1 at the top hierarchy and classification is conducted after the output of it.

Interpretability
This architecture has a non-overlapping information processing mechanism, independent at every node. This design resembles a decision tree-like structure, which manifests unique interpretability capabilities because every tree node contains independent information of an image block that is being received by its parent nodes. We can trace the information flow through the nodes to understand the importance of each feature. In addition, our hierarchical structure retains the spatial structure of images throughout the network, leading to learned spatial feature maps that are effective for interpretation. Below we showcase two kinds of visual interpretability.

First, we present a method to interpret the trained model on test images, called gradient-based class-aware tree-traversal (GradCAT). GradCAT traces the feature importance of each block (a tree node) from top to bottom of the hierarchy structure. The main idea is to find the most valuable traversal from the root node at the top layer to a child node at the bottom layer that contributes the most to the classification outcomes. Since each node processes information from a certain region of the image, such traversal can be easily mapped to the image space for interpretation (as shown by the overlaid dots and lines in the image below).

The following is an example of the model's top-4 predictions and corresponding interpretability results on the left input image (containing 4 animals). As shown below, GradCAT highlights the decision path along the hierarchical structure as well as the corresponding visual cues in local image regions on the images.

Given the left input image (containing four objects), the figure visualizes the interpretability results of the top-4 prediction classes. The traversal locates the model decision path along the tree and simultaneously locates the corresponding image patch (shown by the dotted line on images) that has the highest impact to the predicted target class.

Moreover, the following figures visualize results on the ImageNet validation set and show how this approach enables some intuitive observations. For instance, the example of the lighter below (upper left panel) is particularly interesting because the ground truth class — lighter/matchstick — actually defines the bottom-right matchstick object, while the most salient visual features (with the highest node values) are actually from the upper-left red light, which conceptually shares visual cues with a lighter. This can also be seen from the overlaid red lines, which indicate the image patches with the highest impact on the prediction. Thus, although the visual cue is a mistake, the output prediction is correct. In addition, the four child nodes of the wooden spoon below have similar feature importance values (see numbers visualized in the nodes; higher indicates more importance), which is because the wooden texture of the table is similar to that of the spoon.

Visualization of the results obtained by the proposed GradCAT. Images are from the ImageNet validation dataset.

Second, different from the original ViT, our hierarchical architecture retains spatial relationships in learned representations. The top layers output low-resolution features maps of input images, enabling the model to easily perform attention-based interpretation by applying Class Attention Map (CAM) on the learned representations at the top hierarchical level. This enables high-quality weakly-supervised object localization with just image-level labels. See the following figure for examples.

Visualization of CAM-based attention results. Warmer colors indicate higher attention. Images are from the ImageNet validation dataset.

Convergence Advantages
With this design, feature learning only happens at local regions independently, and feature abstraction happens inside the aggregation function. This design and simple implementation is general enough for other types of visual understanding tasks beyond classification. It also improves the model convergence speed greatly, significantly reducing the training time to reach the desired maximum accuracy.

We validate this advantage in two ways. First, we compare the ViT structure on the ImageNet accuracy with a different number of total training epochs. The results are shown on the left side of the figure below, demonstrating much faster convergence than the original ViT, e.g., around 20% improvement in accuracy over ViT with 30 total training epochs.

Second, we modify the architecture to conduct unconditional image generation tasks, since training ViT-based models for image generation tasks is challenging due to convergence and speed issues. Creating such a generator is straightforward by transposing the proposed architecture: the input is an embedding vector, the output is a full image in RGB channels, and the block aggregation is replaced by a block de-aggregation component supported by Pixel Shuffling. Surprisingly, we find our generator is easy to train and demonstrates faster convergence speed, as well as better FID score (which measures how similar generated images are to real ones), than the capacity-comparable SAGAN.

Left: ImageNet accuracy given different number of total training epochs compared with standard ViT architecture. Right: ImageNet 64x64 image generation FID scores (lower is better) with single 1000-epoch training. On both tasks, our method shows better convergence speed.

Conclusion
In this work we demonstrate the simple idea that decoupled feature learning and feature information extraction in this nested hierarchy design leads to better feature interpretability through a new gradient-based class-aware tree traversal method. Moreover, the architecture improves convergence on not only classification tasks but also image generation tasks. The proposed idea is focusing on aggregation function and thereby is orthogonal to advanced architecture design for self-attention. We hope this new research encourages future architecture designers to explore more interpretable and data-efficient ViT-based models for visual understanding, like the adoption of this work for high-resolution image generation. We have also released the source code for the image classification portion of this work.

Acknowledgements
We gratefully acknowledge the contributions of other co-authors, including Han Zhang, Long Zhao, Ting Chen, Sercan Arik, Tomas Pfister. We also thank Xiaohua Zhai, Jeremy Kubica, Kihyuk Sohn, and Madeleine Udell for the valuable feedback of the work.

Source: Google AI Blog