Tag Archives: Natural Language Understanding

Crisscrossed Captions: Semantic Similarity for Images and Text

The past decade has seen remarkable progress on automatic image captioning, a task in which a computer algorithm creates written descriptions for images. Much of the progress has come through the use of modern deep learning methods developed for both computer vision and natural language processing, combined with large scale datasets that pair images with descriptions created by people. In addition to supporting important practical applications, such as providing descriptions of images for visually impaired people, these datasets also enable investigations into important and exciting research questions about grounding language in visual inputs. For example, learning deep representations for a word like “car”, means using both linguistic and visual contexts.

Image captioning datasets that contain pairs of textual descriptions and their corresponding images, such as MS-COCO and Flickr30k, have been widely used to learn aligned image and text representations and to build captioning models. Unfortunately, these datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image (also called co-captions), there are image-caption pairs that match but are not labeled as a match, and there are no labels that indicate when an image-caption pair does not match. This undermines research into how inter-modality learning (connecting captions to images, for example) impacts intra-modality tasks (connecting captions to captions or images to images). This is important to address, especially because a fair amount of work on learning from images paired with text is motivated by arguments about how visual elements should inform and improve representations of language.

To address this evaluation gap, we present "Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO", which was recently presented at EACL 2021. The Crisscrossed Captions (CxC) dataset extends the development and test splits of MS-COCO with semantic similarity ratings for image-text, text-text and image-image pairs. The rating criteria are based on Semantic Textual Similarity, an existing and widely-adopted measure of semantic relatedness between pairs of short texts, which we extend to include judgments about images as well. In all, CxC contains human-derived semantic similarity ratings for 267,095 pairs (derived from 1,335,475 independent judgments), a massive extension in scale and detail to the 50k original binary pairings in MS-COCO’s development and test splits. We have released CxC’s ratings, along with code to merge CxC with existing MS-COCO data. Anyone familiar with MS-COCO can thus easily enhance their experiments with CxC.

Crisscrossed Captions extends the MS-COCO evaluation sets by adding human-derived semantic similarity ratings for existing image-caption pairs and co-captions (solid lines), and it increases rating density by adding human ratings for new image-caption, caption-caption and image-image pairs (dashed lines).*

Creating the CxC Dataset
If a picture is worth a thousand words, it is likely because there are so many details and relationships between objects that are generally depicted in pictures. We can describe the texture of the fur on a dog, name the logo on the frisbee it is chasing, mention the expression on the face of the person who has just thrown the frisbee, or note the vibrant red on a large leaf in a tree above the person’s head, and so on.

The CxC dataset extends the MS-COCO evaluation splits with graded similarity associations within and across modalities. MS-COCO has five captions for each image, split into 410k training, 25k development, and 25k test captions (for 82k, 5k, 5k images, respectively). An ideal extension would rate every pair in the dataset (caption-caption, image-image, and image-caption), but this is infeasible as it would require obtaining human ratings for billions of pairs.

Given that randomly selected pairs of images and captions are likely to be dissimilar, we came up with a way to select items for human rating that would include at least some new pairs with high expected similarity. To reduce the dependence of the chosen pairs on the models used to find them, we introduce an indirect sampling scheme (depicted below) where we encode images and captions using different encoding methods and compute the similarity between pairs of same modality items, resulting in similarity matrices. Images are encoded using Graph-RISE embeddings, while captions are encoded using two methods — Universal Sentence Encoder (USE) and average bag-of-words (BoW) based on GloVe embeddings. Since each MS-COCO example has five co-captions, we average the co-caption encodings to create a single representation per example, ensuring all caption pairs can be mapped to image pairs (more below on how we select intermodality pairs).

Top: Text similarity matrix (each cell corresponds to a similarity score) constructed using averaged co-caption encodings, so each text entry corresponds to a single image, resulting in a 5k x 5k matrix. Two different text encoding methods were used, but only one text similarity matrix has been shown for simplicity. Bottom: Image similarity matrix for each image in the dataset, resulting in a 5k x 5k matrix.

The next step of the indirect sampling scheme is to use the computed similarities of images for a biased sampling of caption pairs for human rating (and vice versa). For example, we select two captions with high computed similarities from the text similarity matrix, then take each of their images, resulting in a new pair of images that are different in appearance but similar in what they depict based on their descriptions. For example, the captions “A dog looking bashfully to the side” and “A black dog lifts its head to the side to enjoy a breeze” would have a reasonably high model similarity, so the corresponding images of the two dogs in the figure below could be selected for image similarity rating. This step can also start with two images with high computed similarities to yield a new pair of captions. We now have indirectly sampled new intramodal pairs — at least some of which are highly similar — for which we obtain human ratings.

Top: Pairs of images are picked based on their computed caption similarity. Bottom: Pairs of captions are picked based on the computed similarity of the images they describe.

Last, we then use these new intramodal pairs and their human ratings to select new intermodal pairs for human rating. We do this by using existing image-caption pairs to link between modalities. For example, if a caption pair example ij was rated by humans as highly similar, we pick the image from example i and caption from example j to obtain a new intermodal pair for human rating. And again, we use the intramodal pairs with the highest rated similarity for sampling because this includes at least some new pairs with high similarity. Finally, we also add human ratings for all existing intermodal pairs and a large sample of co-captions.

The following table shows examples of semantic image similarity (SIS) and semantic image-text similarity (SITS) pairs corresponding to each rating, with 5 being the most similar and 0 being completely dissimilar.

Examples for each human-derived similarity score (left: 5 to 0, 5 being very similar and 0 being completely dissimilar) of image pairs based on SIS (middle) and SITS (right) tasks. Note that these examples are for illustrative purposes and are not themselves in the CxC dataset.

Evaluation
MS-COCO supports three retrieval tasks:

  1. Given an image, find its matching captions out of all other captions in the evaluation set.
  2. Given a caption, find its corresponding image out of all other images in the evaluation set.
  3. Given a caption, find its other co-captions out of all other captions in the evaluation set.

MS-COCO’s pairs are incomplete because captions created for one image at times apply equally well to another, yet these associations are not captured in the dataset. CxC enhances these existing retrieval tasks with new positive pairs, and it also supports a new image-image retrieval task. With its graded similarity judgements, CxC also makes it possible to measure correlations between model and human rankings. Retrieval metrics in general focus only on positive pairs, while CxC’s correlation scores additionally account for the relative ordering of similarity and include low-scoring items (non-matches). Supporting these evaluations on a common set of images and captions makes them more valuable for understanding inter-modal learning compared to disjoint sets of caption-image, caption-caption, and image-image associations.

We ran a series of experiments to show the utility of CxC’s ratings. For this, we constructed three dual encoder (DE) models using BERT-base as the text encoder and EfficientNet-B4 as the image encoder:

  1. A text-text (DE_T2T) model that uses a shared text encoder for both sides.
  2. An image-text model (DE_I2T) that uses the aforementioned text and image encoders, and includes a layer above the text encoder to match the image encoder output.
  3. A multitask model (DE_I2T+T2T) trained on a weighted combination of text-text and image-text tasks.
CxC retrieval results — a comparison of our text-text (T2T), image-text (I2T) and multitask (I2T+T2T) dual encoder models on all the four retrieval tasks.

From the results on the retrieval tasks, we can see that DE_I2T+T2T (yellow bar) performs better than DE_I2T (red bar) on the image-text and text-image retrieval tasks. Thus, adding the intramodal (text-text) training task helped improve the intermodal (image-text, text-image) performance. As for the other two intramodal tasks (text-text and image-image), DE_I2T+T2T shows strong, balanced performance on both of them.

CxC correlation results for the same models shown above.

For the correlation tasks, DE_I2T performs the best on SIS and DE_I2T+T2T is the best overall. The correlation scores also show that DE_I2T performs well only on images: it has the highest SIS but has much worse STS. Adding the text-text loss to DE_I2T training (DE_I2T+T2T) produces more balanced overall performance.

The CxC dataset provides a much more complete set of relationships between and among images and captions than the raw MS-COCO image-caption pairs. The new ratings have been released and further details are in our paper. We hope to encourage the research community to push the state of the art on the tasks introduced by CxC with better models for jointly learning inter- and intra-modal representations.

Acknowledgments
The core team includes Daniel Cer, Yinfei Yang and Austin Waters. We thank Julia Hockenmaier for her inputs on CxC’s formulation, the Google Data Compute Team, especially Ashwin Kakarla and Mohd Majeed for their tooling and annotation support, Yuan Zhang, Eugene Ie for their comments on the initial versions of the paper and Daphne Luong for executive support for the data collection.


  *All the images in the article have been taken from the Open Images dataset under the CC-by 4.0 license. 

Source: Google AI Blog


ToTTo: A Controlled Table-to-Text Generation Dataset

In the last few years, research in natural language generation, used for tasks like text summarization, has made tremendous progress. Yet, despite achieving high levels of fluency, neural systems can still be prone to hallucination (i.e.generating text that is understandable, but not faithful to the source), which can prohibit these systems from being used in many applications that require high degrees of accuracy. Consider an example from the Wikibio dataset, where the neural baseline model tasked with summarizing a Wikipedia infobox entry for Belgian football player Constant Vanden Stock summarizes incorrectly that he is an American figure skater.

While the process of assessing the faithfulness of generated text to the source content can be challenging, it is often easier when the source content is structured (e.g., in tabular format). Moreover, structured data can also test a model’s ability for reasoning and numerical inference. However, existing large scale structured datasets are often noisy (i.e., the reference sentence cannot be fully inferred from the tabular data), making them unreliable for the measurement of hallucination in model development.

In “ToTTo: A Controlled Table-To-Text Generation Dataset”, we present an open domain table-to-text generation dataset created using a novel annotation process (via sentence revision) along with a controlled text generation task that can be used to assess model hallucination. ToTTo (shorthand for “Table-To-Text”) consists of 121,000 training examples, along with 7,500 examples each for development and test. Due to the accuracy of annotations, this dataset is suitable as a challenging benchmark for research in high precision text generation. The dataset and code are open-sourced on our GitHub repo.

Table-to-Text Generation
ToTTo introduces a controlled generation task in which a given Wikipedia table with a set of selected cells is used as the source material for the task of producing a single sentence description that summarizes the cell contents in the context of the table. The example below demonstrates some of the many challenges posed by the task, such as numerical reasoning, a large open-domain vocabulary, and varied table structure.

Example in the ToTTo dataset, where given the source table and set of highlighted cells (left), the goal is to generate a one sentence description, such as the “target sentence” (right). Note that generating the target sentence would require numerical inference (eleven NFL seasons) and understanding of the NFL domain.

Annotation Process
Designing an annotation process to obtain natural but also clean target sentences from tabular data is a significant challenge. Many datasets like Wikibio and RotoWire pair naturally occurring text heuristically with tables, a noisy process that makes it difficult to disentangle whether hallucination is primarily caused by data noise or model shortcomings. On the other hand, one can elicit annotators to write sentence targets from scratch, which are faithful to the table, but the resulting targets often lack variety in terms of structure and style.

In contrast, ToTTo is constructed using a novel data annotation strategy in which annotators revise existing Wikipedia sentences in stages. This results in target sentences that are clean, as well as natural, containing interesting and varied linguistic properties. The data collection and annotation process begins by collecting tables from Wikipedia, where a given table is paired with a summary sentence collected from the supporting page context according to heuristics, such as word overlap between the page text and the table and hyperlinks referencing tabular data. This summary sentence may contain information not supported by the table and may contain pronouns with antecedents found in the table only, not the sentence itself.

The annotator then highlights the cells in the table that support the sentence and deletes phrases in the sentence that are not supported by the table. They also decontextualize the sentence so that it is standalone (e.g., with correct pronoun resolution) and correct grammar, where necessary.

We show that annotators obtain high agreement on the above task: 0.856 Fleiss Kappa for cell highlighting, and 67.0 BLEU for the final target sentence.

Dataset Analysis
We conducted a topic analysis on the ToTTo dataset over 44 categories and found that the Sports and Countries topics, each of which consists of a range of fine-grained topics, e.g., football/olympics for sports and population/buildings for countries, together comprise 56.4% of the dataset. The other 44% is composed of a much more broad set of topics, including Performing Arts, Transportation, and Entertainment.

Furthermore, we conducted a manual analysis of the different types of linguistic phenomena in the dataset over 100 randomly chosen examples. The table below summarizes the fraction of examples that require reference to the page and section titles, as well as some of the linguistic phenomena in the dataset that potentially pose new challenges to current systems.

Linguistic Phenomena Percentage
Require reference to page title 82%
Require reference to section title 19%
Require reference to table description 3%
Reasoning (logical, numerical, temporal etc.) 21%
Comparison across rows/columns/cells 13%
Require background information 12%

Baseline Results
We present some baseline results of three state-of-the-art models from the literature (BERT-to-BERT, Pointer Generator, and the Puduppully 2019 model) on two evaluation metrics, BLEU and PARENT. In addition to reporting the score on the overall test set, we also evaluate each model on a more challenging subset consisting of out-of-domain examples. As the table below shows, the BERT-to-BERT model performs best in terms of both BLEU and PARENT. Moreover, all models achieve considerably lower performance on the challenge set indicating the challenge of out-of-domain generalization.

  BLEU PARENT BLEU PARENT
Model (overall) (overall) (challenge) (challenge)
BERT-to-BERT 43.9 52.6 34.8 46.7
Pointer Generator 41.6 51.6 32.2 45.2
Puduppully et al. 2019 19.2 29.2 13.9 25.8

While automatic metrics can give some indication of performance, they are not currently sufficient for evaluating hallucination in text generation systems. To better understand hallucination, we manually evaluate the top performing baseline, to determine how faithful it is to the content in the source table, under the assumption that discrepancies indicate hallucination. To compute the “Expert” performance, for each example in our multi-reference test set, we held out one reference and asked annotators to compare it with the other references for faithfulness. As the results show, the top performing baseline appears to hallucinate information ~20% of the time.

  Faithfulness Faithfulness
Model (overall) (challenge)
Expert 93.6 91.4
BERT-to-BERT  76.2 74.2

Model Errors and Challenges
In the table below, we present a selection of the observed model errors to highlight some of the more challenging aspects of the ToTTo dataset. We find that state-of-the-art models struggle with hallucination, numerical reasoning, and rare topics, even when using cleaned references (errors in red). The last example shows that even when the model output is correct it is sometimes not as informative as the original reference which contains more reasoning about the table (shown in blue).

Reference Model Prediction
in the 1939 currie cup, western province lost to transvaal by 17–6 in cape town. the first currie cup was played in 1939 in transvaal1 at new- lands, with western province winning 17–6.
a second generation of micro- drive was announced by ibm in 2000 with increased capacities at 512 mb and 1 gb. there were 512 microdrive models in 2000: 1 gigabyte.
the 1956 grand prix motorcy- cle racing season consisted of six grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc. the 1956 grand prix motorcycle racing season consisted of eight grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc.
in travis kelce’s last collegiate season, he set personal career highs in receptions (45), re- ceiving yards (722), yards per receptions (16.0) and receiving touchdowns (8). travis kelce finished the 2012 season with 45 receptions for 722 yards (16.0 avg.) and eight touchdowns.

Conclusion
In this work, we presented ToTTo, a large, English table-to-text dataset that presents both a controlled generation task and a data annotation process based on iterative sentence revision. We also provided several state-of-the-art baselines, and demonstrated ToTTo could be a useful dataset for modeling research as well as for developing evaluation metrics that can better detect model improvements.

In addition to the proposed task, we hope our dataset can also be helpful for other tasks such as table understanding and sentence revision. ToTTo is available at our GitHub repo.

Acknowledgements
The authors wish to thank Ming-Wei Chang, Jonathan H. Clark, Kenton Lee, and Jennimaria Palomaki for their insightful discussions and support. Many thanks also to Ashwin Kakarla and his team for help with the annotations.

Source: Google AI Blog


Measuring Gendered Correlations in Pre-trained NLP Models

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google's AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Source: Google AI Blog


Measuring Gendered Correlations in Pre-trained NLP Models

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google's AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Source: Google AI Blog


Measuring Gendered Correlations in Pre-trained NLP Models

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google's AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Source: Google AI Blog


Language-Agnostic BERT Sentence Embedding

A multilingual embedding model is a powerful tool that encodes text from different languages into a shared embedding space, enabling it to be applied to a range of downstream tasks, like text classification, clustering, and others, while also leveraging semantic information for language understanding. Existing approaches for generating such embeddings, like LASER or m~USE, rely on parallel data, mapping a sentence from one language directly to another language in order to encourage consistency between the sentence embeddings. While these existing multilingual approaches yield good overall performance across a number of languages, they often underperform on high-resource languages compared to dedicated bilingual models, which can leverage approaches like translation ranking tasks with translation pairs as training data to obtain more closely aligned representations. Further, due to limited model capacity and the often poor quality of training data for low-resource languages, it can be difficult to extend multilingual models to support a larger number of languages while maintaining good performance.

Illustration of a multilingual embedding space.

Recent efforts to improve language models include the development of masked language model (MLM) pre-training, such as that used by BERT, ALBERT and RoBERTa. This approach has led to exceptional gains across a wide range of languages and a variety of natural language processing tasks since it only requires monolingual text. In addition, MLM pre-training has been extended to the multilingual setting by modifying MLM training to include concatenated translation pairs, known as translation language modeling (TLM), or by simply introducing pre-training data from multiple languages. However, while the internal model representations learned during MLM and TLM training are helpful when fine-tuning on downstream tasks, without a sentence level objective, they do not directly produce sentence embeddings, which are critical for translation tasks.

In “Language-agnostic BERT Sentence Embedding”, we present a multilingual BERT embedding model, called LaBSE, that produces language-agnostic cross-lingual sentence embeddings for 109 languages. The model is trained on 17 billion monolingual sentences and 6 billion bilingual sentence pairs using MLM and TLM pre-training, resulting in a model that is effective even on low-resource languages for which there is no data available during training. Further, the model establishes a new state of the art on multiple parallel text (a.k.a. bitext) retrieval tasks. We have released the pre-trained model to the community through tfhub, which includes modules that can be used as-is or can be fine-tuned using domain-specific data.

The collection of the training data for 109 supported languages

The Model
In previous work, we proposed the use of a translation ranking task to learn a multilingual sentence embedding space. This approach tasks the model with ranking the true translation over a collection of sentences in the target language, given a sentence in the source language. The translation ranking task is trained using a dual encoder architecture with a shared transformer encoder. The resulting bilingual models achieved state-of-the-art performance on multiple parallel text retrieval tasks (including United Nations and BUCC). However, the model suffered when the bi-lingual models were extended to support multiple languages (16 languages, in our test case) due to limitations in model capacity, vocabulary coverage, training data quality and more.

Translation ranking task. Given a sentence in a given source language, the task is to find the true translation over a collection of sentences in the target language.

For LaBSE, we leverage recent advances on language model pre-training, including MLM and TLM, on a BERT-like architecture and follow this with fine-tuning on a translation ranking task. A 12-layer transformer with a 500k token vocabulary pre-trained using MLM and TLM on 109 languages is used to increase the model and vocabulary coverage. The resulting LaBSE model offers extended support to 109 languages in a single model.

The dual encoder architecture, in which the source and target text are encoded using a shared transformer embedding network separately. The translation ranking task is applied, forcing the text that paraphrases each other to have similar representations. The transformer embedding network is initialized from a BERT checkpoint trained on MLM and TLM tasks.

Performance on Cross-lingual Text Retrieval
We evaluate the proposed model using the Tatoeba corpus, a dataset consisting of up to 1,000 English-aligned sentence pairs for 112 languages. For more than 30 of the languages in the dataset, the model has no training data. The model is tasked with finding the nearest neighbor translation for a given sentence, which it calculates using the cosine distance.

To understand the performance of the model for languages at the head or tail of the training data distribution, we divide the set of languages into several groups and compute the average accuracy for each set. The first 14-language group is selected from the languages supported by m~USE, which cover the languages from the head of the distribution (head languages). We also evaluate a second language group composed of 36 languages from the XTREME benchmark. The third 82-language group, selected from the languages covered by the LASER training data, includes many languages from the tail of the distribution (tail languages). Finally, we compute the average accuracy for all languages.

The table below presents the average accuracy achieved by LaBSE, compared to the m~USE and LASER models, for each language group. As expected, all models perform strongly on the 14-language group that covers most head languages. With more languages included, the averaged accuracy for both LASER and LaBSE declines. However, the reduction in accuracy from the LaBSE model with increasing numbers of languages is much less significant, outperforming LASER significantly, particularly when the full distribution of 112 languages is included (83.7% accuracy vs. 65.5%).

Model 14 Langs 36 Langs 82 Langs All Langs
m~USE* 93.9
LASER 95.3 84.4 75.9 65.5
LaBSE 95.3 95.0 87.3 83.7
Average Accuracy (%) on Tatoeba Datasets. The “14 Langs” group consists of languages supported by m~USE; the “36 Langs” group includes languages selected by XTREME; and the “82 Langs” group represents languages covered by the LASER model. The “All Langs” group includes all languages supported by Taoteba.
* The m~USE model comes in two varieties, one built on a convolutional neural network architecture and the other a Transformer-like architecture. Here, we compare only to the Transformer version.

Support to Unsupported Languages
The average performance of all languages included in Tatoeba is very promising. Interestingly, LaBSE even performs relatively well for many of the 30+ Tatoeba languages for which it has no training data (see below). For one third of these languages the LaBSE accuracy is higher than 75% and only 8 have accuracy lower than 25%, indicating very strong transfer performance to languages without training data. Such positive language transfer is only possible due to the massively multilingual nature of LaBSE.

LaBSE accuracy for the subset of Tatoeba languages (represented with ISO 639-1/639-2 codes) for which there was no training data.

Mining Parallel Text from WebLaBSE can be used for mining parallel text (bi-text) from web-scale data. For example, we applied LaBSE to CommonCrawl, a large-scale monolingual corpus, to process 560 million Chinese and 330 million German sentences for the extraction of parallel text. Each Chinese and German sentence pair is encoded using the LaBSE model and then the encoded embedding is used to find a potential translation from a pool of 7.7 billion English sentences pre-processed and encoded by the model. An approximate nearest neighbor search is employed to quickly search through the high-dimensional sentence embeddings. After a simple filtering, the model returns 261M and 104M potential parallel pairs for English-Chinese and English-German, respectively. The trained NMT model using the mined data reaches BLEU scores of 35.7 and 27.2 on the WMT translation tasks (wmt17 for English-to-Chinese and wmt14 for English-to-German). The performance is only a few points away from current state-of-art-models trained on high quality parallel data.

ConclusionWe're excited to share this research, and the model, with the community. The pre-trained model is released at tfhub to support further research on this direction and possible downstream applications. We also believe that what we're showing here is just the beginning, and there are more important research problems to be addressed, such as building better models to support all languages.

AcknowledgementsThe core team includes Wei Wang, Naveen Arivazhagan, Daniel Cer. We would like to thank the Google Research Language team, along with our partners in other Google groups for their feedback and suggestions. Special thanks goes to Sidharth Mudgal, and Jax Law for help with data processing; as well as Jialu Liu, Tianqi Liu, Chen Chen, and Anosh Raj for help on BERT pre-training.

Source: Google AI Blog


Grounding Natural Language Instructions to Mobile UI Actions



Mobile devices offer a myriad of functionalities that can assist in everyday activities. However, many of these functionalities are not easily discoverable or accessible to users, forcing users to look up how to perform a specific task -- how to turn on the traffic mode in Maps or change notification settings in YouTube, for example. While searching the web for detailed instructions for these questions is an option, it is still up to the user to follow these instructions step-by-step and navigate UI details through a small touchscreen, which can be tedious and time consuming, and results in reduced accessibility. What if one could design a computational agent to turn these language instructions into actions and automatically execute them on the user’s behalf?

In “Mapping Natural Language Instructions to Mobile UI Action Sequences”, published at ACL 2020, we present the first step towards addressing the problem of automatic action sequence mapping, creating three new datasets used to train deep learning models that ground natural language instructions to executable mobile UI actions. This work lays the technical foundation for task automation on mobile devices that would alleviate the need to maneuver through UI details, which may be especially valuable for users who are visually or situationally impaired. We have also open-sourced our model code and data pipelines through our GitHub repository, in order to spur further developments among the research community.

Constructing Language Grounding Models
People often provide one another with instructions in order to coordinate joint efforts and accomplish tasks involving complex sequences of actions, for example, following a recipe to bake a cake, or having a friend walk you through setting up a home network. Building computational agents able to help with similar interactions is an important goal that requires true language grounding in the environments in which the actions take place.

The learning task addressed here is to predict a sequence of actions for a mobile platform given a set of instructions, a sequence of screens produced as the system transitions from one screen to another, as well as the set of interactive elements on those screens. Training such a model end-to-end would require paired language-action data, which is difficult to acquire at a large scale.

Instead, we deconstruct the problem into two sequential steps: an action phrase-extraction step and a grounding step.
The workflow of grounding language instructions to executable actions.
The action phrase-extraction step identifies the operation, object and argument descriptions from multi-step instructions using a Transformer model with area attention for representing each description phrase. Area attention allows the model to attend to a group of adjacent words in the instruction (a span) as a whole for decoding a description.
The action phrase extraction model takes a word sequence of a natural language instruction and outputs a sequence of spans (denoted in red boxes) that indicate the phrases describing the operation, the object and the argument of each action in the task.
Next, the grounding step matches the extracted operation and object descriptions with a UI object on the screen. Again, we use a Transformer model, but in this case, it contextually represents UI objects and grounds object descriptions to them.
The grounding model takes the extracted spans as input and grounds them to executable actions, including the object an action is applied to, given the UI screen at each step during execution.
Results
To investigate the feasibility of this task and the effectiveness of our approach, we construct three new datasets to train and evaluate our model. The first dataset includes 187 multi-step English instructions for operating Pixel phones along their corresponding action-screen sequences and enables assessment of full task performance on naturally occurring instructions, which is used for testing end-to-end grounding quality. For action phrase extraction training and evaluation, we obtain English “how-to” instructions that can be found abundantly from the web and annotate phrases that describe each action. To train the grounding model, we synthetically generate 295K single-step commands to UI actions, covering 178K different UI objects across 25K mobile UI screens from a public android UI corpus.

A Transformer with area attention obtains 85.56% accuracy for predicting span sequences that completely match the ground truth. The phrase extractor and grounding model together obtain 89.21% partial and 70.59% complete accuracy for matching ground-truth action sequences on the more challenging task of mapping language instructions to executable actions end-to-end. We also evaluated alternative methods and representations of UI objects, such as using a graph convolutional network (GCN) or a feedforward network, and found those that can represent an object contextually in the screen lead to better grounding accuracy. The new datasets, models and results provide an important first step on the challenging problem of grounding natural language instructions to mobile UI actions.

Conclusion
This research, and language grounding in general, is an important step for translating multi-stage instructions into actions on a graphical user interface. Successful application of task automation to the UI domain has the potential to significantly improve accessibility, where language interfaces might help individuals who are visually impaired perform tasks with interfaces that are predicated on sight. This also matters for situational impairment when one cannot access a device easily while encumbered by tasks at hand.

By deconstructing the problem into action phrase extraction and language grounding, progress on either can improve full task performance and it alleviates the need to have language-action paired datasets, which are difficult to collect at scale. For example, action span extraction is related to both semantic role labeling and extraction of multiple facts from text and could benefit from innovations in span identification and multitask learning. Reinforcement learning that has been applied in previous grounding work may help improve out-of-sample prediction for grounding in UIs and improve direct grounding from hidden state representations. Although our datasets were based on Android UIs, our approach can be applied generally to instruction grounding on other user interface platforms. Lastly, our work provides a technical foundation for investigating user experiences in language-based human computer interaction.

Acknowledgements
Many thanks to my collaborators on this work at Google Research. Xin Zhou and Jiacong He contributed substantially to the data pipelines and the creation of the datasets. Yuan Zhang and Jason Baldridge provided much valuable advice for the project and contributed to the presentation of the work. Gang Li provided generous help for creating open-source datasets. Many thanks to Ashwin Kakarla, Muqthar Mohammad and Mohd Majeed for their help with the annotations.

Source: Google AI Blog


Presenting a Challenge and Workshop in Efficient Open-Domain Question Answering



One of the primary goals of natural language processing is to build systems that can answer a user's questions. To do this, computers need to be able to understand questions, represent world knowledge, and reason their way to answers. Traditionally, answers have been retrieved from a collection of documents or a knowledge graph. For example, to answer the question, “When was the declaration of independence officially signed?” a system might first find the most relevant article from Wikipedia, and then locate a sentence containing the answer, “August 2, 1776”. However, more recent approaches, like T5, have also shown that neural models, trained on large amounts of web-text, can also answer questions directly, without retrieving documents or facts from a knowledge graph. This has led to significant debate about how knowledge should be stored for use by our question answering systems — in human readable text and structured formats, or in the learned parameters of a neural network.

Today, we are proud to announce the EfficientQA competition and workshop at NeurIPS 2020, organized in cooperation with Princeton University and the University of Washington. The goal is to develop an end-to-end question answering system that contains all of the knowledge required to answer open-domain questions. There are no constraints on how the knowledge is stored — it could be in documents, databases, the parameters of a neural network, or any other form — but entries will be evaluated based on the number of bytes used to access this knowledge, including code, corpora, and model parameters. There will also be an unconstrained track, in which the goal is to achieve the best possible question answering performance regardless of system size. To build small, yet robust systems, participants will have to explore new methods of knowledge representation and reasoning.
An illustration of how the memory budget changes as a neural network and retrieval corpus grow and shrink. It is possible that successful systems will also use other resources such as a knowledge graph.
Competition Overview
The competition will be evaluated using the open-domain variant of the Natural Questions dataset. We will also provide further human evaluation of all the top performing entries to account for the fact that there are many correct ways to answer a question, not all of which will be covered by any set of reference answers. For example, for the question “What type of car is a Jeep considered?” both “off-road vehicles” and “crossover SUVs” are valid answers.

The competition is divided between four separate tracks: best performing system under 500 Mb; best performing system under 6 Gb; smallest system to get at least 25% accuracy; and the best performing system with no constraints. The winners of each of these tracks will be invited to present their work during the competition track at NeurIPS 2020, which will be hosted virtually. We will also put each of the winning systems up against human trivia experts (the 2017 NeurIPS Human-Computer competition featured Jeopardy! and Who Wants to Be a Millionaire champions) in a real-time contest at the virtual conference.

Participation
To participate, go to the competition site where you will find the data and evaluation code available for download, as well as dates and instructions on how to participate, and a sign-up form for updates. Along with our academic collaborators, we have provided some example systems to help you get started.

We believe that the field of natural language processing will benefit from a greater exploration and comparison of small system question answering options. We hope that by encouraging the development of very small systems, this competition will pave the way for on-device question answering.

Acknowledgements
Creating this challenge and workshop has been a large team effort including Adam Roberts, Colin Raffel, Chris Alberti, Jordan Boyd-Graber, Jennimaria Palomaki, Kenton Lee, Kelvin Guu, and Michael Collins from Google; as well as Sewon Min and Hannaneh Hajishirzi from the University of Washington; and Danqi Chen from Princeton University.

Source: Google AI Blog


PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization



Students are often tasked with reading a document and producing a summary (for example, a book report) to demonstrate both reading comprehension and writing ability. This abstractive text summarization is one of the most challenging tasks in natural language processing, involving understanding of long passages, information compression, and language generation. The dominant paradigm for training machine learning models to do this is sequence-to-sequence (seq2seq) learning, where a neural network learns to map input sequences to output sequences. While these seq2seq models were initially developed using recurrent neural networks, Transformer encoder-decoder models have recently become favored as they are more effective at modeling the dependencies present in the long sequences encountered in summarization.

Transformer models combined with self-supervised pre-training (e.g., BERT, GPT-2, RoBERTa, XLNet, ALBERT, T5, ELECTRA) have shown to be a powerful framework for producing general language learning, achieving state-of-the-art performance when fine-tuned on a wide array of language tasks. In prior work, the self-supervised objectives used in pre-training have been somewhat agnostic to the down-stream application in favor of generality; we wondered whether better performance could be achieved if the self-supervised objective more closely mirrored the final task.

In “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization” (to appear at the 2020 International Conference on Machine Learning), we designed a pre-training self-supervised objective (called gap-sentence generation) for Transformer encoder-decoder models to improve fine-tuning performance on abstractive summarization, achieving state-of-the-art results on 12 diverse summarization datasets. Supplementary to the paper, we are also releasing the training code and model checkpoints on GitHub.

A Self-Supervised Objective for Summarization
Our hypothesis is that the closer the pre-training self-supervised objective is to the final down-stream task, the better the fine-tuning performance. In PEGASUS pre-training, several whole sentences are removed from documents and the model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. This is an incredibly difficult task that may seem impossible, even for people, and we don’t expect the model to solve it perfectly. However, such a challenging task encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task. The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.
A self-supervised example for PEGASUS during pre-training. The model is trained to output all the masked sentences.
We found that choosing “important” sentences to mask worked best, making the output of self-supervised examples even more similar to a summary. We automatically identified these sentences by finding those that were most similar to the rest of the document according to a metric called ROUGE. ROUGE computes the similarity of two texts by computing n-gram overlaps using a score from 0 to 100 (ROUGE-1, ROUGE-2, and ROUGE-L are three common variants).

Similar to other recent methods, such as T5, we pre-trained our model on a very large corpus of web-crawled documents, then we fine-tuned the model on 12 public down-stream abstractive summarization datasets, resulting in new state-of-the-art results as measured by automatic metrics, while using only 5% of the number of parameters of T5. The datasets were chosen to be diverse, including news articles, scientific papers, patents, short stories, e-mails, legal documents, and how-to directions, showing that the model framework is adaptive to a wide-variety of topics.

Fine-Tuning with Small Numbers of Examples
While PEGASUS showed remarkable performance with large datasets, we were surprised to learn that the model didn’t require a large number of examples for fine-tuning to get near state-of-the-art performance:
ROUGE scores (three variants, higher is better) vs. the number of supervised examples across four selected summarization datasets. The dotted-line shows the Transformer encoder-decoder performance with full-supervision, but without pre-training.
With only 1000 fine-tuning examples, we were able to perform better in most tasks than a strong baseline (Transformer encoder-decoder) that used the full supervised data, which in some cases had many orders of magnitude more examples. This “sample efficiency” greatly increases the usefulness of text summarization models as it significantly lowers the scale and cost of supervised data collection, which in the case of summarization is very expensive.

Human-Quality summaries
While we find automatic metrics such as ROUGE are useful proxies for measuring progress during model development, they only provide limited information and don’t tell us the whole story, such as fluency or a comparison to human performance. To this end, we conducted a human evaluation, where raters were asked to compare summaries from our model with human ones (without knowing which is which). This has some similarities to the Turing test.
Human raters were asked to rate model and human-written summaries without knowing which was which. The document is truncated here for illustration, but raters see the full text.
We performed the experiment with 3 different datasets and found that human raters do not consistently prefer the human summaries to those from our model. Furthermore, our models trained with only 1000 examples performed nearly as well. In particular, with the much studied XSum and CNN/Dailymail datasets, the model achieves human-like performance using only 1000 examples. This suggests large datasets of supervised examples are no longer necessary for summarization, opening up many low-cost use-cases.

A Test of Comprehension: Counting Ships
Following this post is an example article from the XSum dataset and the model-generated abstractive summary. As we can see, the model correctly abstracts and paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall) as “four Royal Navy frigates”, something an extractive approach could not do since “four” is not mentioned anywhere. Was this a fluke or did the model actually count? One way to find out is to add and remove ships to see if the count changes.

As can be seen below, the model successfully “counts” ships from 2 to 5. However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as “seven”. So it appears the model has learned to count small numbers of items in a list, but does not yet generalize as elegantly as we would hope. Still, we think this rudimentary counting ability is impressive as it was not explicitly programmed into the model, and it demonstrates a limited amount of “symbolic reasoning” by the model.

PEGASUS code and model release
To support on-going research in this field and ensure reproducibility, we are releasing the PEGASUS code and model checkpoints on GitHub. This includes fine-tuning code which can be used to adapt PEGASUS to other summarization datasets.

Acknowledgements
This work has been a collaborative effort involving Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. We thank the T5 and Google News teams for providing datasets for pre-training PEGASUS.

Source: Google AI Blog


PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization



Students are often tasked with reading a document and producing a summary (for example, a book report) to demonstrate both reading comprehension and writing ability. This abstractive text summarization is one of the most challenging tasks in natural language processing, involving understanding of long passages, information compression, and language generation. The dominant paradigm for training machine learning models to do this is sequence-to-sequence (seq2seq) learning, where a neural network learns to map input sequences to output sequences. While these seq2seq models were initially developed using recurrent neural networks, Transformer encoder-decoder models have recently become favored as they are more effective at modeling the dependencies present in the long sequences encountered in summarization.

Transformer models combined with self-supervised pre-training (e.g., BERT, GPT-2, RoBERTa, XLNet, ALBERT, T5, ELECTRA) have shown to be a powerful framework for producing general language learning, achieving state-of-the-art performance when fine-tuned on a wide array of language tasks. In prior work, the self-supervised objectives used in pre-training have been somewhat agnostic to the down-stream application in favor of generality; we wondered whether better performance could be achieved if the self-supervised objective more closely mirrored the final task.

In “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization” (to appear at the 2020 International Conference on Machine Learning), we designed a pre-training self-supervised objective (called gap-sentence generation) for Transformer encoder-decoder models to improve fine-tuning performance on abstractive summarization, achieving state-of-the-art results on 12 diverse summarization datasets. Supplementary to the paper, we are also releasing the training code and model checkpoints on GitHub.

A Self-Supervised Objective for Summarization
Our hypothesis is that the closer the pre-training self-supervised objective is to the final down-stream task, the better the fine-tuning performance. In PEGASUS pre-training, several whole sentences are removed from documents and the model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. This is an incredibly difficult task that may seem impossible, even for people, and we don’t expect the model to solve it perfectly. However, such a challenging task encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task. The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.
A self-supervised example for PEGASUS during pre-training. The model is trained to output all the masked sentences.
We found that choosing “important” sentences to mask worked best, making the output of self-supervised examples even more similar to a summary. We automatically identified these sentences by finding those that were most similar to the rest of the document according to a metric called ROUGE. ROUGE computes the similarity of two texts by computing n-gram overlaps using a score from 0 to 100 (ROUGE-1, ROUGE-2, and ROUGE-L are three common variants).

Similar to other recent methods, such as T5, we pre-trained our model on a very large corpus of web-crawled documents, then we fine-tuned the model on 12 public down-stream abstractive summarization datasets, resulting in new state-of-the-art results as measured by automatic metrics, while using only 5% of the number of parameters of T5. The datasets were chosen to be diverse, including news articles, scientific papers, patents, short stories, e-mails, legal documents, and how-to directions, showing that the model framework is adaptive to a wide-variety of topics.

Fine-Tuning with Small Numbers of Examples
While PEGASUS showed remarkable performance with large datasets, we were surprised to learn that the model didn’t require a large number of examples for fine-tuning to get near state-of-the-art performance:
ROUGE scores (three variants, higher is better) vs. the number of supervised examples across four selected summarization datasets. The dotted-line shows the Transformer encoder-decoder performance with full-supervision, but without pre-training.
With only 1000 fine-tuning examples, we were able to perform better in most tasks than a strong baseline (Transformer encoder-decoder) that used the full supervised data, which in some cases had many orders of magnitude more examples. This “sample efficiency” greatly increases the usefulness of text summarization models as it significantly lowers the scale and cost of supervised data collection, which in the case of summarization is very expensive.

Human-Quality summaries
While we find automatic metrics such as ROUGE are useful proxies for measuring progress during model development, they only provide limited information and don’t tell us the whole story, such as fluency or a comparison to human performance. To this end, we conducted a human evaluation, where raters were asked to compare summaries from our model with human ones (without knowing which is which). This has some similarities to the Turing test.
Human raters were asked to rate model and human-written summaries without knowing which was which. The document is truncated here for illustration, but raters see the full text.
We performed the experiment with 3 different datasets and found that human raters do not consistently prefer the human summaries to those from our model. Furthermore, our models trained with only 1000 examples performed nearly as well. In particular, with the much studied XSum and CNN/Dailymail datasets, the model achieves human-like performance using only 1000 examples. This suggests large datasets of supervised examples are no longer necessary for summarization, opening up many low-cost use-cases.

A Test of Comprehension: Counting Ships
Following this post is an example article from the XSum dataset and the model-generated abstractive summary. As we can see, the model correctly abstracts and paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall) as “four Royal Navy frigates”, something an extractive approach could not do since “four” is not mentioned anywhere. Was this a fluke or did the model actually count? One way to find out is to add and remove ships to see if the count changes.

As can be seen below, the model successfully “counts” ships from 2 to 5. However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as “seven”. So it appears the model has learned to count small numbers of items in a list, but does not yet generalize as elegantly as we would hope. Still, we think this rudimentary counting ability is impressive as it was not explicitly programmed into the model, and it demonstrates a limited amount of “symbolic reasoning” by the model.

PEGASUS code and model release
To support on-going research in this field and ensure reproducibility, we are releasing the PEGASUS code and model checkpoints on GitHub. This includes fine-tuning code which can be used to adapt PEGASUS to other summarization datasets.

Acknowledgements
This work has been a collaborative effort involving Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. We thank the T5 and Google News teams for providing datasets for pre-training PEGASUS.

Source: Google AI Blog