Tag Archives: Natural Language Processing

Measuring Gendered Correlations in Pre-trained NLP Models

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google's AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Source: Google AI Blog


Advancing NLP with Efficient Projection-Based Model Architectures

Deep neural networks have radically transformed natural language processing (NLP) in the last decade, primarily through their application in data centers using specialized hardware. However, issues such as preserving user privacy, eliminating network latency, enabling offline functionality, and reducing operation costs have rapidly spurred the development of NLP models that can be run on-device rather than in data centers. Yet mobile devices have limited memory and processing power, which requires models running on them to be small and efficient — without compromising quality.

Last year, we published a neural architecture called PRADO, which at the time achieved state-of-the-art performance on many text classification problems, using a model with less than 200K parameters. While most models use a fixed number of parameters per token, the PRADO model used a network structure that required extremely few parameters to learn the most relevant or useful tokens for the task.

Today we describe a new extension to the model, called pQRNN, which advances the state of the art for NLP performance with a minimal model size. The novelty of pQRNN is in how it combines a simple projection operation with a quasi-RNN encoder for fast, parallel processing. We show that the pQRNN model is able to achieve BERT-level performance on a text classification task with orders of magnitude fewer number of parameters.

What Makes PRADO Work?
When developed a year ago, PRADO exploited NLP domain-specific knowledge on text segmentation to reduce the model size and improve the performance. Normally, the text input to NLP models is first processed into a form that is suitable for the neural network, by segmenting text into pieces (tokens) that correspond to values in a predefined universal dictionary (a list of all possible tokens). The neural network then uniquely identifies each segment using a trainable parameter vector, which comprises the embedding table. However, the way in which text is segmented has a significant impact on the model performance, size, and latency. The figure below shows the spectrum of approaches used by the NLP community and their pros and cons.

Since the number of text segments is such an important parameter for model performance and compression, it raises the question of whether or not an NLP model needs to be able to distinctly identify every possible text segment. To answer this question we look at the inherent complexity of NLP tasks.

Only a few NLP tasks (e.g., language models and machine translation) need to know subtle differences between text segments and thus need to be capable of uniquely identifying all possible text segments. In contrast, the majority of other tasks can be solved by knowing a small subset of these segments. Furthermore, this subset of task-relevant segments will likely not be the most frequent, as a significant fraction of segments will undoubtedly be dedicated to articles, such as a, an, the, etc., which for many tasks are not necessarily critical. Hence, allowing the network to determine the most relevant segments for a given task results in better performance. In addition, the network does not need to be able to uniquely identify these segments, but only needs to recognize clusters of text segments. For example, a sentiment classifier just needs to know segment clusters that are strongly correlated to the sentiment in the text.

Leveraging these insights, PRADO was designed to learn clusters of text segments from words rather than word pieces or characters, which enabled it to achieve good performance on low-complexity NLP tasks. Since word units are more meaningful, and yet the most relevant words for most tasks are reasonably small, many fewer model parameters are needed to learn such a reduced subset of relevant word clusters.

Improving PRADO
Building on the success of PRADO, we developed an improved NLP model, called pQRNN. This model is composed of three building blocks, a projection operator that converts tokens in text to a sequence of ternary vectors, a dense bottleneck layer and a stack of QRNN encoders.

The implementation of the projection layer in pQRNN is identical to that used in PRADO and helps the model learn the most relevant tokens without a fixed set of parameters to define them. It first fingerprints the tokens in the text and converts it to a ternary feature vector using a simple mapping function. This results in a ternary vector sequence with a balanced symmetric distribution that uniquely represents the text. This representation is not directly useful since it does not have any information needed to solve the task of interest and the network has no control over this representation. We combine it with a dense bottleneck layer to allow the network to learn a per word representation that is relevant for the task at hand. The representation resulting from the bottleneck layer still does not take the context of the word into account. We learn a contextual representation by using a stack of bidirectional QRNN encoders. The result is a network that is capable of learning a contextual representation from just text input without employing any kind of preprocessing.

Performance
We evaluated pQRNN on the civil_comments dataset and compared it with the BERT model on the same task. Simply because the model size is proportional to the number of parameters, pQRNN is much smaller than BERT. But in addition, pQRNN is quantized, further reducing the model size by a factor of 4x. The public pretrained version of BERT performed poorly on the task hence the comparison is done to a BERT version that is pretrained on several different relevant multilingual data sources to achieve the best possible performance.

We capture the area under the curve (AUC) for the two models. Without any kind of pre-training and just trained on the supervised data, the AUC for pQRNN is 0.963 using 1.3 million quantized (8-bit) parameters. With pre-training on several different data sources and fine-tuning on the supervised data, the BERT model gets 0.976 AUC using 110 million floating point parameters.

Conclusion
Using our previous generation model PRADO, we have demonstrated how it can be used as the foundation for the next generation of state-of-the-art light-weight text classification models. We present one such model, pQRNN, and show that this new architecture can nearly achieve BERT-level performance, despite using 300x fewer parameters and being trained on only the supervised data. To stimulate further research in this area, we have open-sourced the PRADO model and encourage the community to use it as a jumping off point for new model architectures.

Acknowledgements
We thank Yicheng Fan, Márius Šajgalík, Peter Young and Arun Kandoor for contributing to the open sourcing effort and helping improve the models. We would also like to thank Amarnag Subramanya, Ashwini Venkatesh, Benoit Jacob, Catherine Wah, Dana Movshovitz-Attias, Dang Hien, Dmitry Kalenichenko, Edgar Gonzàlez i Pellicer, Edward Li, Erik Vee, Evgeny Livshits, Gaurav Nemade, Jeffrey Soren, Jeongwoo Ko, Julia Proskurnia, Rushin Shah, Shirin Badiezadegan, Sidharth KV, Victor Cărbune and the Learn2Compress team for their support. We would like to thank Andrew Tomkins and Patrick Mcgregor for sponsoring this research project.

Source: Google AI Blog


Advancing NLP with Efficient Projection-Based Model Architectures

Deep neural networks have radically transformed natural language processing (NLP) in the last decade, primarily through their application in data centers using specialized hardware. However, issues such as preserving user privacy, eliminating network latency, enabling offline functionality, and reducing operation costs have rapidly spurred the development of NLP models that can be run on-device rather than in data centers. Yet mobile devices have limited memory and processing power, which requires models running on them to be small and efficient — without compromising quality.

Last year, we published a neural architecture called PRADO, which at the time achieved state-of-the-art performance on many text classification problems, using a model with less than 200K parameters. While most models use a fixed number of parameters per token, the PRADO model used a network structure that required extremely few parameters to learn the most relevant or useful tokens for the task.

Today we describe a new extension to the model, called pQRNN, which advances the state of the art for NLP performance with a minimal model size. The novelty of pQRNN is in how it combines a simple projection operation with a quasi-RNN encoder for fast, parallel processing. We show that the pQRNN model is able to achieve BERT-level performance on a text classification task with orders of magnitude fewer number of parameters.

What Makes PRADO Work?
When developed a year ago, PRADO exploited NLP domain-specific knowledge on text segmentation to reduce the model size and improve the performance. Normally, the text input to NLP models is first processed into a form that is suitable for the neural network, by segmenting text into pieces (tokens) that correspond to values in a predefined universal dictionary (a list of all possible tokens). The neural network then uniquely identifies each segment using a trainable parameter vector, which comprises the embedding table. However, the way in which text is segmented has a significant impact on the model performance, size, and latency. The figure below shows the spectrum of approaches used by the NLP community and their pros and cons.

Since the number of text segments is such an important parameter for model performance and compression, it raises the question of whether or not an NLP model needs to be able to distinctly identify every possible text segment. To answer this question we look at the inherent complexity of NLP tasks.

Only a few NLP tasks (e.g., language models and machine translation) need to know subtle differences between text segments and thus need to be capable of uniquely identifying all possible text segments. In contrast, the majority of other tasks can be solved by knowing a small subset of these segments. Furthermore, this subset of task-relevant segments will likely not be the most frequent, as a significant fraction of segments will undoubtedly be dedicated to articles, such as a, an, the, etc., which for many tasks are not necessarily critical. Hence, allowing the network to determine the most relevant segments for a given task results in better performance. In addition, the network does not need to be able to uniquely identify these segments, but only needs to recognize clusters of text segments. For example, a sentiment classifier just needs to know segment clusters that are strongly correlated to the sentiment in the text.

Leveraging these insights, PRADO was designed to learn clusters of text segments from words rather than word pieces or characters, which enabled it to achieve good performance on low-complexity NLP tasks. Since word units are more meaningful, and yet the most relevant words for most tasks are reasonably small, many fewer model parameters are needed to learn such a reduced subset of relevant word clusters.

Improving PRADO
Building on the success of PRADO, we developed an improved NLP model, called pQRNN. This model is composed of three building blocks, a projection operator that converts tokens in text to a sequence of ternary vectors, a dense bottleneck layer and a stack of QRNN encoders.

The implementation of the projection layer in pQRNN is identical to that used in PRADO and helps the model learn the most relevant tokens without a fixed set of parameters to define them. It first fingerprints the tokens in the text and converts it to a ternary feature vector using a simple mapping function. This results in a ternary vector sequence with a balanced symmetric distribution that uniquely represents the text. This representation is not directly useful since it does not have any information needed to solve the task of interest and the network has no control over this representation. We combine it with a dense bottleneck layer to allow the network to learn a per word representation that is relevant for the task at hand. The representation resulting from the bottleneck layer still does not take the context of the word into account. We learn a contextual representation by using a stack of bidirectional QRNN encoders. The result is a network that is capable of learning a contextual representation from just text input without employing any kind of preprocessing.

Performance
We evaluated pQRNN on the civil_comments dataset and compared it with the BERT model on the same task. Simply because the model size is proportional to the number of parameters, pQRNN is much smaller than BERT. But in addition, pQRNN is quantized, further reducing the model size by a factor of 4x. The public pretrained version of BERT performed poorly on the task hence the comparison is done to a BERT version that is pretrained on several different relevant multilingual data sources to achieve the best possible performance.

We capture the area under the curve (AUC) for the two models. Without any kind of pre-training and just trained on the supervised data, the AUC for pQRNN is 0.963 using 1.3 million quantized (8-bit) parameters. With pre-training on several different data sources and fine-tuning on the supervised data, the BERT model gets 0.976 AUC using 110 million floating point parameters.

Conclusion
Using our previous generation model PRADO, we have demonstrated how it can be used as the foundation for the next generation of state-of-the-art light-weight text classification models. We present one such model, pQRNN, and show that this new architecture can nearly achieve BERT-level performance, despite using 300x fewer parameters and being trained on only the supervised data. To stimulate further research in this area, we have open-sourced the PRADO model and encourage the community to use it as a jumping off point for new model architectures.

Acknowledgements
We thank Yicheng Fan, Márius Šajgalík, Peter Young and Arun Kandoor for contributing to the open sourcing effort and helping improve the models. We would also like to thank Amarnag Subramanya, Ashwini Venkatesh, Benoit Jacob, Catherine Wah, Dana Movshovitz-Attias, Dang Hien, Dmitry Kalenichenko, Edgar Gonzàlez i Pellicer, Edward Li, Erik Vee, Evgeny Livshits, Gaurav Nemade, Jeffrey Soren, Jeongwoo Ko, Julia Proskurnia, Rushin Shah, Shirin Badiezadegan, Sidharth KV, Victor Cărbune and the Learn2Compress team for their support. We would like to thank Andrew Tomkins and Patrick Mcgregor for sponsoring this research project.

Source: Google AI Blog


Advancing NLP with Efficient Projection-Based Model Architectures

Deep neural networks have radically transformed natural language processing (NLP) in the last decade, primarily through their application in data centers using specialized hardware. However, issues such as preserving user privacy, eliminating network latency, enabling offline functionality, and reducing operation costs have rapidly spurred the development of NLP models that can be run on-device rather than in data centers. Yet mobile devices have limited memory and processing power, which requires models running on them to be small and efficient — without compromising quality.

Last year, we published a neural architecture called PRADO, which at the time achieved state-of-the-art performance on many text classification problems, using a model with less than 200K parameters. While most models use a fixed number of parameters per token, the PRADO model used a network structure that required extremely few parameters to learn the most relevant or useful tokens for the task.

Today we describe a new extension to the model, called pQRNN, which advances the state of the art for NLP performance with a minimal model size. The novelty of pQRNN is in how it combines a simple projection operation with a quasi-RNN encoder for fast, parallel processing. We show that the pQRNN model is able to achieve BERT-level performance on a text classification task with orders of magnitude fewer number of parameters.

What Makes PRADO Work?
When developed a year ago, PRADO exploited NLP domain-specific knowledge on text segmentation to reduce the model size and improve the performance. Normally, the text input to NLP models is first processed into a form that is suitable for the neural network, by segmenting text into pieces (tokens) that correspond to values in a predefined universal dictionary (a list of all possible tokens). The neural network then uniquely identifies each segment using a trainable parameter vector, which comprises the embedding table. However, the way in which text is segmented has a significant impact on the model performance, size, and latency. The figure below shows the spectrum of approaches used by the NLP community and their pros and cons.

Since the number of text segments is such an important parameter for model performance and compression, it raises the question of whether or not an NLP model needs to be able to distinctly identify every possible text segment. To answer this question we look at the inherent complexity of NLP tasks.

Only a few NLP tasks (e.g., language models and machine translation) need to know subtle differences between text segments and thus need to be capable of uniquely identifying all possible text segments. In contrast, the majority of other tasks can be solved by knowing a small subset of these segments. Furthermore, this subset of task-relevant segments will likely not be the most frequent, as a significant fraction of segments will undoubtedly be dedicated to articles, such as a, an, the, etc., which for many tasks are not necessarily critical. Hence, allowing the network to determine the most relevant segments for a given task results in better performance. In addition, the network does not need to be able to uniquely identify these segments, but only needs to recognize clusters of text segments. For example, a sentiment classifier just needs to know segment clusters that are strongly correlated to the sentiment in the text.

Leveraging these insights, PRADO was designed to learn clusters of text segments from words rather than word pieces or characters, which enabled it to achieve good performance on low-complexity NLP tasks. Since word units are more meaningful, and yet the most relevant words for most tasks are reasonably small, many fewer model parameters are needed to learn such a reduced subset of relevant word clusters.

Improving PRADO
Building on the success of PRADO, we developed an improved NLP model, called pQRNN. This model is composed of three building blocks, a projection operator that converts tokens in text to a sequence of ternary vectors, a dense bottleneck layer and a stack of QRNN encoders.

The implementation of the projection layer in pQRNN is identical to that used in PRADO and helps the model learn the most relevant tokens without a fixed set of parameters to define them. It first fingerprints the tokens in the text and converts it to a ternary feature vector using a simple mapping function. This results in a ternary vector sequence with a balanced symmetric distribution that uniquely represents the text. This representation is not directly useful since it does not have any information needed to solve the task of interest and the network has no control over this representation. We combine it with a dense bottleneck layer to allow the network to learn a per word representation that is relevant for the task at hand. The representation resulting from the bottleneck layer still does not take the context of the word into account. We learn a contextual representation by using a stack of bidirectional QRNN encoders. The result is a network that is capable of learning a contextual representation from just text input without employing any kind of preprocessing.

Performance
We evaluated pQRNN on the civil_comments dataset and compared it with the BERT model on the same task. Simply because the model size is proportional to the number of parameters, pQRNN is much smaller than BERT. But in addition, pQRNN is quantized, further reducing the model size by a factor of 4x. The public pretrained version of BERT performed poorly on the task hence the comparison is done to a BERT version that is pretrained on several different relevant multilingual data sources to achieve the best possible performance.

We capture the area under the curve (AUC) for the two models. Without any kind of pre-training and just trained on the supervised data, the AUC for pQRNN is 0.963 using 1.3 million quantized (8-bit) parameters. With pre-training on several different data sources and fine-tuning on the supervised data, the BERT model gets 0.976 AUC using 110 million floating point parameters.

Conclusion
Using our previous generation model PRADO, we have demonstrated how it can be used as the foundation for the next generation of state-of-the-art light-weight text classification models. We present one such model, pQRNN, and show that this new architecture can nearly achieve BERT-level performance, despite using 300x fewer parameters and being trained on only the supervised data. To stimulate further research in this area, we have open-sourced the PRADO model and encourage the community to use it as a jumping off point for new model architectures.

Acknowledgements
We thank Yicheng Fan, Márius Šajgalík, Peter Young and Arun Kandoor for contributing to the open sourcing effort and helping improve the models. We would also like to thank Amarnag Subramanya, Ashwini Venkatesh, Benoit Jacob, Catherine Wah, Dana Movshovitz-Attias, Dang Hien, Dmitry Kalenichenko, Edgar Gonzàlez i Pellicer, Edward Li, Erik Vee, Evgeny Livshits, Gaurav Nemade, Jeffrey Soren, Jeongwoo Ko, Julia Proskurnia, Rushin Shah, Shirin Badiezadegan, Sidharth KV, Victor Cărbune and the Learn2Compress team for their support. We would like to thank Andrew Tomkins and Patrick Mcgregor for sponsoring this research project.

Source: Google AI Blog


Language-Agnostic BERT Sentence Embedding

A multilingual embedding model is a powerful tool that encodes text from different languages into a shared embedding space, enabling it to be applied to a range of downstream tasks, like text classification, clustering, and others, while also leveraging semantic information for language understanding. Existing approaches for generating such embeddings, like LASER or m~USE, rely on parallel data, mapping a sentence from one language directly to another language in order to encourage consistency between the sentence embeddings. While these existing multilingual approaches yield good overall performance across a number of languages, they often underperform on high-resource languages compared to dedicated bilingual models, which can leverage approaches like translation ranking tasks with translation pairs as training data to obtain more closely aligned representations. Further, due to limited model capacity and the often poor quality of training data for low-resource languages, it can be difficult to extend multilingual models to support a larger number of languages while maintaining good performance.

Illustration of a multilingual embedding space.

Recent efforts to improve language models include the development of masked language model (MLM) pre-training, such as that used by BERT, ALBERT and RoBERTa. This approach has led to exceptional gains across a wide range of languages and a variety of natural language processing tasks since it only requires monolingual text. In addition, MLM pre-training has been extended to the multilingual setting by modifying MLM training to include concatenated translation pairs, known as translation language modeling (TLM), or by simply introducing pre-training data from multiple languages. However, while the internal model representations learned during MLM and TLM training are helpful when fine-tuning on downstream tasks, without a sentence level objective, they do not directly produce sentence embeddings, which are critical for translation tasks.

In “Language-agnostic BERT Sentence Embedding”, we present a multilingual BERT embedding model, called LaBSE, that produces language-agnostic cross-lingual sentence embeddings for 109 languages. The model is trained on 17 billion monolingual sentences and 6 billion bilingual sentence pairs using MLM and TLM pre-training, resulting in a model that is effective even on low-resource languages for which there is no data available during training. Further, the model establishes a new state of the art on multiple parallel text (a.k.a. bitext) retrieval tasks. We have released the pre-trained model to the community through tfhub, which includes modules that can be used as-is or can be fine-tuned using domain-specific data.

The collection of the training data for 109 supported languages

The Model
In previous work, we proposed the use of a translation ranking task to learn a multilingual sentence embedding space. This approach tasks the model with ranking the true translation over a collection of sentences in the target language, given a sentence in the source language. The translation ranking task is trained using a dual encoder architecture with a shared transformer encoder. The resulting bilingual models achieved state-of-the-art performance on multiple parallel text retrieval tasks (including United Nations and BUCC). However, the model suffered when the bi-lingual models were extended to support multiple languages (16 languages, in our test case) due to limitations in model capacity, vocabulary coverage, training data quality and more.

Translation ranking task. Given a sentence in a given source language, the task is to find the true translation over a collection of sentences in the target language.

For LaBSE, we leverage recent advances on language model pre-training, including MLM and TLM, on a BERT-like architecture and follow this with fine-tuning on a translation ranking task. A 12-layer transformer with a 500k token vocabulary pre-trained using MLM and TLM on 109 languages is used to increase the model and vocabulary coverage. The resulting LaBSE model offers extended support to 109 languages in a single model.

The dual encoder architecture, in which the source and target text are encoded using a shared transformer embedding network separately. The translation ranking task is applied, forcing the text that paraphrases each other to have similar representations. The transformer embedding network is initialized from a BERT checkpoint trained on MLM and TLM tasks.

Performance on Cross-lingual Text Retrieval
We evaluate the proposed model using the Tatoeba corpus, a dataset consisting of up to 1,000 English-aligned sentence pairs for 112 languages. For more than 30 of the languages in the dataset, the model has no training data. The model is tasked with finding the nearest neighbor translation for a given sentence, which it calculates using the cosine distance.

To understand the performance of the model for languages at the head or tail of the training data distribution, we divide the set of languages into several groups and compute the average accuracy for each set. The first 14-language group is selected from the languages supported by m~USE, which cover the languages from the head of the distribution (head languages). We also evaluate a second language group composed of 36 languages from the XTREME benchmark. The third 82-language group, selected from the languages covered by the LASER training data, includes many languages from the tail of the distribution (tail languages). Finally, we compute the average accuracy for all languages.

The table below presents the average accuracy achieved by LaBSE, compared to the m~USE and LASER models, for each language group. As expected, all models perform strongly on the 14-language group that covers most head languages. With more languages included, the averaged accuracy for both LASER and LaBSE declines. However, the reduction in accuracy from the LaBSE model with increasing numbers of languages is much less significant, outperforming LASER significantly, particularly when the full distribution of 112 languages is included (83.7% accuracy vs. 65.5%).

Model 14 Langs 36 Langs 82 Langs All Langs
m~USE* 93.9
LASER 95.3 84.4 75.9 65.5
LaBSE 95.3 95.0 87.3 83.7
Average Accuracy (%) on Tatoeba Datasets. The “14 Langs” group consists of languages supported by m~USE; the “36 Langs” group includes languages selected by XTREME; and the “82 Langs” group represents languages covered by the LASER model. The “All Langs” group includes all languages supported by Taoteba.
* The m~USE model comes in two varieties, one built on a convolutional neural network architecture and the other a Transformer-like architecture. Here, we compare only to the Transformer version.

Support to Unsupported Languages
The average performance of all languages included in Tatoeba is very promising. Interestingly, LaBSE even performs relatively well for many of the 30+ Tatoeba languages for which it has no training data (see below). For one third of these languages the LaBSE accuracy is higher than 75% and only 8 have accuracy lower than 25%, indicating very strong transfer performance to languages without training data. Such positive language transfer is only possible due to the massively multilingual nature of LaBSE.

LaBSE accuracy for the subset of Tatoeba languages (represented with ISO 639-1/639-2 codes) for which there was no training data.

Mining Parallel Text from WebLaBSE can be used for mining parallel text (bi-text) from web-scale data. For example, we applied LaBSE to CommonCrawl, a large-scale monolingual corpus, to process 560 million Chinese and 330 million German sentences for the extraction of parallel text. Each Chinese and German sentence pair is encoded using the LaBSE model and then the encoded embedding is used to find a potential translation from a pool of 7.7 billion English sentences pre-processed and encoded by the model. An approximate nearest neighbor search is employed to quickly search through the high-dimensional sentence embeddings. After a simple filtering, the model returns 261M and 104M potential parallel pairs for English-Chinese and English-German, respectively. The trained NMT model using the mined data reaches BLEU scores of 35.7 and 27.2 on the WMT translation tasks (wmt17 for English-to-Chinese and wmt14 for English-to-German). The performance is only a few points away from current state-of-art-models trained on high quality parallel data.

ConclusionWe're excited to share this research, and the model, with the community. The pre-trained model is released at tfhub to support further research on this direction and possible downstream applications. We also believe that what we're showing here is just the beginning, and there are more important research problems to be addressed, such as building better models to support all languages.

AcknowledgementsThe core team includes Wei Wang, Naveen Arivazhagan, Daniel Cer. We would like to thank the Google Research Language team, along with our partners in other Google groups for their feedback and suggestions. Special thanks goes to Sidharth Mudgal, and Jax Law for help with data processing; as well as Jialu Liu, Tianqi Liu, Chen Chen, and Anosh Raj for help on BERT pre-training.

Source: Google AI Blog


REALM: Integrating Retrieval into Language Representation Models

Recent advances in natural language processing have largely built upon the power of unsupervised pre-training, which trains general purpose language representation models using a large amount of text, without human annotations or labels. These pre-trained models, such as BERT and RoBERTa, have been shown to memorize a surprising amount of world knowledge, such as “the birthplace of Francesco Bartolomeo Conti”, “the developer of JDK” and “the owner of Border TV”. While the ability to encode knowledge is especially important for certain natural language processing tasks such as question answering, information retrieval and text generation, these models memorize knowledge implicitly — i.e., world knowledge is captured in an abstract way in the model weights — making it difficult to determine what knowledge has been stored and where it is kept in the model. Furthermore, the storage space, and hence the accuracy of the model, is limited by the size of the network. To capture more world knowledge, the standard practice is to train ever-larger networks, which can be prohibitively slow or expensive.

Instead, what if there was a method for pre-training that could access knowledge explicitly, e.g., by referencing an additional large external text corpus, in order to achieve accurate results without increasing the model size or complexity?  For example, a sentence found in an external document collection, "Francesco Bartolomeo Conti was born in Florence," could be referenced by the model to determine the birthplace of the musician, rather than relying on the model's opaque ability to access the knowledge stored in its own parameters. The ability to retrieve text containing explicit knowledge such as this would improve the efficiency of pre-training while enabling the model to perform well on knowledge-intensive tasks without using billions of parameters.

In “REALM: Retrieval-Augmented Language Model Pre-Training”, accepted at the 2020 International Conference on Machine Learning, we share a novel paradigm for language model pre-training, which augments a language representation model with a knowledge retriever, allowing REALM models to retrieve textual world knowledge explicitly from raw text documents, instead of memorizing all the knowledge in the model parameters. We have also open sourced the REALM codebase to demonstrate how one can train the retriever and the language representation jointly.

Background: Pre-training Language Representation Models
To understand how standard language representation models memorize world knowledge, one should first review how these models are pre-trained. Since the invention of BERT, the fill-in-the-blank task, called masked language modeling, has been widely used for pre-training language representation models. Given any text with certain words masked out, the task is to fill back the missing words. An example of this task looks like:

I am so thirsty. I need to __ water.

During pre-training, a model will go over a large number of examples and adjust the parameters in order to predict the missing words (answer: drink, in the above example). Interestingly, the fill-in-the-blank task makes the model memorize certain facts about the world. For example, the knowledge of Einstein's birthplace is required to fill the missing word in the following example:

Einstein was a __-born scientist. (answer: German)

However, because the world knowledge captured by the model is stored in the model weights, it is abstract, making it difficult to understand what information is stored.

Our Proposal: Retrieval-Augmented Language Representation Model Pre-training
In contrast to standard language representation models, REALM augments the language representation model with a knowledge retriever that first retrieves another piece of text from an external document collection as the supporting knowledge — in our experiments, we use the Wikipedia text corpus — and then feeds this supporting text as well as the original text into a language representation model.

The key intuition of REALM is that a retrieval system should improve the model's ability to fill in missing words. Therefore, a retrieval that provides more context for filling the missing words should be rewarded. If the retrieved information does not help the model make its predictions, it should be discouraged, making room for better retrievals.

How does one train a knowledge retriever, given that only unlabeled text is available during pre-training? It turns out that one can use the task of filling words to train the knowledge retriever indirectly, without any human annotations. Assume the input of the query is:

We paid twenty __ at the Buckingham Palace gift shop.

Filling the missing word (answer:pounds) in this sentence without retrieval can be tricky, as the model would need to have implicitly stored knowledge of the country in which the Buckingham Palace is located and the associated currency, as well as make the connection between the two. It would be easier for the model to fill in the missing word if it was presented with a passage that explicitly connects some of the necessary knowledge, retrieved from an external corpus.

In this example, the retriever would be rewarded for retrieving the following sentence.

Buckingham Palace is the London residence of the British monarchy.

Since the retrieval step needs to add more context, there may be multiple retrieval targets that could be helpful in filling the missing word, for example, “The official currency of the United Kingdom is the Pound.” The whole process is demonstrated in the next figure:

Computational Challenges for REALM
Scaling REALM pre-training such that models can retrieve knowledge from millions of documents is challenging. In REALM, the selection of the best document is formulated as maximum inner product search (MIPS). To perform retrieval, MIPS models need to first encode all of the documents in the collection, such that each document has a corresponding document vector. When an input arrives, it is encoded as a query vector. In MIPS, given a query, the document in the collection that has the maximum inner product value between its document vector and the query vector is retrieved, as shown in the following figure:

In REALM, we use the ScaNN package to conduct MIPS efficiently, which makes finding the maximum inner product value relatively cheap, given that the document vectors are pre-computed. However, if the model parameters were updated during training, it is typically necessary to re-encode the document vectors for the entire collection of documents. To address the computational challenges, we structure the retriever so that the computation performed for each document can be cached and asynchronously updated. We also found that updating document vectors every 500 training steps, instead of every step, is able to achieve good performance and make training tractable.

Applying REALM to Open-domain Question Answering
We evaluate the effectiveness of REALM by applying it to open-domain question answering (Open-QA), one of the most knowledge-intensive tasks in natural language processing. The goal of the task is to answer questions, such as “What is the angle of the equilateral triangle?”

In standard question answering tasks (e.g., SQuAD or Natural Questions), the supporting document is provided as part of input, so a model only needs to look up the answer in the given document. In Open-QA, there are no given documents, so that Open-QA models need to look up the knowledge by themselves — this makes Open-QA an excellent task to examine the effectiveness of REALM.

The following figure shows the results on the OpenQA version of Natural Question. We mainly compared our results with T5, another approach that trains models without annotated supporting documents. From the figure, one can clearly see that REALM pre-training generates very powerful Open-QA models, and even outperforms the much larger T5 (11B) model by almost 4 points, using only a fraction of the parameters (300M).

Conclusion
The release of REALM has helped drive interest in developing end-to-end retrieval-augmented models, including a recent retrieval-augmented generative model. We look forward to the possibility of extending this line of work in several ways, including 1) applying REALM-like methods to new applications that require knowledge-intensive reasoning and interpretable provenance (beyond Open-QA), and 2) exploring the benefits of retrieving other forms of knowledge, such as images, knowledge graph structures, or even text in other languages. We are also excited to see what the research community does with the open source REALM codebase!

Acknowledgements
This work has been a collaborative effort involving Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.

Source: Google AI Blog


SmartReply for YouTube Creators



It has been more than 4 years since SmartReply was launched, and since then, it has expanded to more users with the Gmail launch and Android Messages and to more devices with Android Wear. Developers now use SmartReply to respond to reviews within the Play Developer Console and can set up their own versions using APIs offered within MLKit and TFLite. With each launch there has been a unique challenge in modeling and serving that required customizing SmartReply for the task requirements.

We are now excited to share an updated SmartReply built for YouTube and implemented in YouTube Studio that helps creators engage more easily with their viewers. This model learns comment and reply representation through a computationally efficient dilated self-attention network, and represents the first cross-lingual and character byte-based SmartReply model. SmartReply for YouTube is currently available for English and Spanish creators, and this approach simplifies the process of extending the SmartReply feature to many more languages in the future.
YouTube creators receive a large volume of responses to their videos. Moreover, the community of creators and viewers on YouTube is diverse, as reflected by the creativity of their comments, discussions and videos. In comparison to emails, which tend to be long and dominated by formal language, YouTube comments reveal complex patterns of language switching, abbreviated words, slang, inconsistent usage of punctuation, and heavy utilization of emoji. Following is a sample of comments that illustrate this challenge:
Deep Retrieval
The initial release of SmartReply for Inbox encoded input emails word-by-word with a recurrent neural network, and then decoded potential replies with yet another word-level recurrent neural network. Despite the expressivity of this approach, it was computationally expensive. Instead, we found that one can achieve the same ends by designing a system that searches through a predefined list of suggestions for the most appropriate response.

This retrieval system encoded the message and its suggestion independently. First, the text was preprocessed to extract words and short phrases. This preprocessing included, but was not limited to, language identification, tokenization, and normalization. Two neural networks then simultaneously and independently encoded the message and the suggestion. This factorization allowed one to pre-compute the suggestion encodings and then search through the set of suggestions using an efficient maximum inner product search data structure. This deep retrieval approach enabled us to expand SmartReply to Gmail and since then, it has been the foundation for several SmartReply systems including the current YouTube system.

Beyond Words
The previous SmartReply systems described above relied on word level preprocessing that is well tuned for a limited number of languages and narrow genres of writing. Such systems face significant challenges in the YouTube case, where a typical comment might include heterogeneous content, like emoji, ASCII art, language switching, etc. In light of this, and taking inspiration from our recent work on byte and character language modeling, we decided to encode the text without any preprocessing. This approach is supported by research demonstrating that a deep Transformer network is able to model words and phrases from the ground up just by feeding it text as a sequence of characters or bytes, with comparable quality to word-based models.

Although initial results were promising, especially for processing comments with emoji or typos, the inference speed was too slow for production due to the fact that character sequences are longer than word equivalents and the computational complexity of self-attention layers grows quadratically as a function of sequence length. We found that shrinking the sequence length by applying temporal reduction layers at each layer of the network, similar to the dilation technique applied in WaveNet, provides a good trade-off between computation and quality.

The figure below presents a dual encoder network that encodes both the comment and the reply to maximize the mutual information between their latent representations by training the network with a contrastive objective. The encoding starts with feeding the transformer a sequence of bytes after they have been embedded. The input for each subsequent layer will be reduced by dropping a percentage of characters at equal offsets. After applying several transformer layers the sequence length is greatly truncated, significantly reducing the computational complexity. This sequence compression scheme could be substituted by other operators such as average pooling, though we did not notice any gains from more sophisticated methods, and therefore, opted to use dilation for simplicity.
A dual encoder network that maximizes the mutual information between the comments and their replies through a contrastive objective. Each encoder is fed a sequence of bytes and is implemented as a computationally efficient dilated transformer network.
A Model to Learn Them All
Instead of training a separate model for each language, we opted to train a single cross-lingual model for all supported languages. This allows the support of mixed-language usage in the comments, and enables the model to utilize the learning of common elements in one language for understanding another, such as emoji and numbers. Moreover, having a single model simplifies the logistics of maintenance and updates. While the model has been rolled out to English and Spanish, the flexibility inherent in this approach will enable it to be expanded to other languages in the future.

Inspecting the encodings of a multilingual set of suggestions produced by the model reveals that the model clusters appropriate replies, regardless of the language to which they belong. This cross-lingual capability emerged without exposing the model during training to any parallel corpus. We demonstrate in the figure below for three languages how the replies are clustered by their meaning when the model is probed with an input comment. For example, the English comment “This is a great video,” is surrounded by appropriate replies, such as “Thanks!” Moreover, inspection of the nearest replies in other languages reveal them also to be appropriate and similar in meaning to the English reply. The 2D projection also shows several other cross-lingual clusters that consist of replies of similar meaning. This clustering demonstrates how the model can support a rich cross-lingual user experience in the supported languages.
A 2D projection of the model encodings when presented with a hypothetical comment and a small list of potential replies. The neighborhood surrounding English comments (black color) consists of appropriate replies in English and their counterparts in Spanish and Arabic. Note that the network learned to align English replies with their translations without access to any parallel corpus.
When to Suggest?
Our goal is to help creators, so we have to make sure that SmartReply only makes suggestions when it is very likely to be useful. Ideally, suggestions would only be displayed when it is likely that the creator would reply to the comment and when the model has a high chance of providing a sensible and specific response. To accomplish this, we trained auxiliary models to identify which comments should trigger the SmartReply feature.

Conclusion
We’ve launched YouTube SmartReply, starting with English and Spanish comments, the first cross-lingual and character byte-based SmartReply. YouTube is a global product with a diverse user base that generates heterogeneous content. Consequently, it is important that we continuously improve comments for this global audience, and SmartReply represents a strong step in this direction.

Acknowledgements
SmartReply for YouTube creators was developed by Golnaz Farhadi, Ezequiel Baril, Cheng Lee, Claire Yuan, Coty Morrison‎, Joe Simunic‎, Rachel Bransom‎, Rajvi Mehta, Jorge Gonzalez‎, Mark Williams, Uma Roy and many more. We are grateful for the leadership support from Nikhil Dandekar, Eileen Long, Siobhan Quinn, Yun-hsuan Sung, Rachel Bernstein, and Ray Kurzweil.

Source: Google AI Blog


PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization



Students are often tasked with reading a document and producing a summary (for example, a book report) to demonstrate both reading comprehension and writing ability. This abstractive text summarization is one of the most challenging tasks in natural language processing, involving understanding of long passages, information compression, and language generation. The dominant paradigm for training machine learning models to do this is sequence-to-sequence (seq2seq) learning, where a neural network learns to map input sequences to output sequences. While these seq2seq models were initially developed using recurrent neural networks, Transformer encoder-decoder models have recently become favored as they are more effective at modeling the dependencies present in the long sequences encountered in summarization.

Transformer models combined with self-supervised pre-training (e.g., BERT, GPT-2, RoBERTa, XLNet, ALBERT, T5, ELECTRA) have shown to be a powerful framework for producing general language learning, achieving state-of-the-art performance when fine-tuned on a wide array of language tasks. In prior work, the self-supervised objectives used in pre-training have been somewhat agnostic to the down-stream application in favor of generality; we wondered whether better performance could be achieved if the self-supervised objective more closely mirrored the final task.

In “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization” (to appear at the 2020 International Conference on Machine Learning), we designed a pre-training self-supervised objective (called gap-sentence generation) for Transformer encoder-decoder models to improve fine-tuning performance on abstractive summarization, achieving state-of-the-art results on 12 diverse summarization datasets. Supplementary to the paper, we are also releasing the training code and model checkpoints on GitHub.

A Self-Supervised Objective for Summarization
Our hypothesis is that the closer the pre-training self-supervised objective is to the final down-stream task, the better the fine-tuning performance. In PEGASUS pre-training, several whole sentences are removed from documents and the model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. This is an incredibly difficult task that may seem impossible, even for people, and we don’t expect the model to solve it perfectly. However, such a challenging task encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task. The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.
A self-supervised example for PEGASUS during pre-training. The model is trained to output all the masked sentences.
We found that choosing “important” sentences to mask worked best, making the output of self-supervised examples even more similar to a summary. We automatically identified these sentences by finding those that were most similar to the rest of the document according to a metric called ROUGE. ROUGE computes the similarity of two texts by computing n-gram overlaps using a score from 0 to 100 (ROUGE-1, ROUGE-2, and ROUGE-L are three common variants).

Similar to other recent methods, such as T5, we pre-trained our model on a very large corpus of web-crawled documents, then we fine-tuned the model on 12 public down-stream abstractive summarization datasets, resulting in new state-of-the-art results as measured by automatic metrics, while using only 5% of the number of parameters of T5. The datasets were chosen to be diverse, including news articles, scientific papers, patents, short stories, e-mails, legal documents, and how-to directions, showing that the model framework is adaptive to a wide-variety of topics.

Fine-Tuning with Small Numbers of Examples
While PEGASUS showed remarkable performance with large datasets, we were surprised to learn that the model didn’t require a large number of examples for fine-tuning to get near state-of-the-art performance:
ROUGE scores (three variants, higher is better) vs. the number of supervised examples across four selected summarization datasets. The dotted-line shows the Transformer encoder-decoder performance with full-supervision, but without pre-training.
With only 1000 fine-tuning examples, we were able to perform better in most tasks than a strong baseline (Transformer encoder-decoder) that used the full supervised data, which in some cases had many orders of magnitude more examples. This “sample efficiency” greatly increases the usefulness of text summarization models as it significantly lowers the scale and cost of supervised data collection, which in the case of summarization is very expensive.

Human-Quality summaries
While we find automatic metrics such as ROUGE are useful proxies for measuring progress during model development, they only provide limited information and don’t tell us the whole story, such as fluency or a comparison to human performance. To this end, we conducted a human evaluation, where raters were asked to compare summaries from our model with human ones (without knowing which is which). This has some similarities to the Turing test.
Human raters were asked to rate model and human-written summaries without knowing which was which. The document is truncated here for illustration, but raters see the full text.
We performed the experiment with 3 different datasets and found that human raters do not consistently prefer the human summaries to those from our model. Furthermore, our models trained with only 1000 examples performed nearly as well. In particular, with the much studied XSum and CNN/Dailymail datasets, the model achieves human-like performance using only 1000 examples. This suggests large datasets of supervised examples are no longer necessary for summarization, opening up many low-cost use-cases.

A Test of Comprehension: Counting Ships
Following this post is an example article from the XSum dataset and the model-generated abstractive summary. As we can see, the model correctly abstracts and paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall) as “four Royal Navy frigates”, something an extractive approach could not do since “four” is not mentioned anywhere. Was this a fluke or did the model actually count? One way to find out is to add and remove ships to see if the count changes.

As can be seen below, the model successfully “counts” ships from 2 to 5. However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as “seven”. So it appears the model has learned to count small numbers of items in a list, but does not yet generalize as elegantly as we would hope. Still, we think this rudimentary counting ability is impressive as it was not explicitly programmed into the model, and it demonstrates a limited amount of “symbolic reasoning” by the model.

PEGASUS code and model release
To support on-going research in this field and ensure reproducibility, we are releasing the PEGASUS code and model checkpoints on GitHub. This includes fine-tuning code which can be used to adapt PEGASUS to other summarization datasets.

Acknowledgements
This work has been a collaborative effort involving Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. We thank the T5 and Google News teams for providing datasets for pre-training PEGASUS.

Source: Google AI Blog


PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization



Students are often tasked with reading a document and producing a summary (for example, a book report) to demonstrate both reading comprehension and writing ability. This abstractive text summarization is one of the most challenging tasks in natural language processing, involving understanding of long passages, information compression, and language generation. The dominant paradigm for training machine learning models to do this is sequence-to-sequence (seq2seq) learning, where a neural network learns to map input sequences to output sequences. While these seq2seq models were initially developed using recurrent neural networks, Transformer encoder-decoder models have recently become favored as they are more effective at modeling the dependencies present in the long sequences encountered in summarization.

Transformer models combined with self-supervised pre-training (e.g., BERT, GPT-2, RoBERTa, XLNet, ALBERT, T5, ELECTRA) have shown to be a powerful framework for producing general language learning, achieving state-of-the-art performance when fine-tuned on a wide array of language tasks. In prior work, the self-supervised objectives used in pre-training have been somewhat agnostic to the down-stream application in favor of generality; we wondered whether better performance could be achieved if the self-supervised objective more closely mirrored the final task.

In “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization” (to appear at the 2020 International Conference on Machine Learning), we designed a pre-training self-supervised objective (called gap-sentence generation) for Transformer encoder-decoder models to improve fine-tuning performance on abstractive summarization, achieving state-of-the-art results on 12 diverse summarization datasets. Supplementary to the paper, we are also releasing the training code and model checkpoints on GitHub.

A Self-Supervised Objective for Summarization
Our hypothesis is that the closer the pre-training self-supervised objective is to the final down-stream task, the better the fine-tuning performance. In PEGASUS pre-training, several whole sentences are removed from documents and the model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. This is an incredibly difficult task that may seem impossible, even for people, and we don’t expect the model to solve it perfectly. However, such a challenging task encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task. The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.
A self-supervised example for PEGASUS during pre-training. The model is trained to output all the masked sentences.
We found that choosing “important” sentences to mask worked best, making the output of self-supervised examples even more similar to a summary. We automatically identified these sentences by finding those that were most similar to the rest of the document according to a metric called ROUGE. ROUGE computes the similarity of two texts by computing n-gram overlaps using a score from 0 to 100 (ROUGE-1, ROUGE-2, and ROUGE-L are three common variants).

Similar to other recent methods, such as T5, we pre-trained our model on a very large corpus of web-crawled documents, then we fine-tuned the model on 12 public down-stream abstractive summarization datasets, resulting in new state-of-the-art results as measured by automatic metrics, while using only 5% of the number of parameters of T5. The datasets were chosen to be diverse, including news articles, scientific papers, patents, short stories, e-mails, legal documents, and how-to directions, showing that the model framework is adaptive to a wide-variety of topics.

Fine-Tuning with Small Numbers of Examples
While PEGASUS showed remarkable performance with large datasets, we were surprised to learn that the model didn’t require a large number of examples for fine-tuning to get near state-of-the-art performance:
ROUGE scores (three variants, higher is better) vs. the number of supervised examples across four selected summarization datasets. The dotted-line shows the Transformer encoder-decoder performance with full-supervision, but without pre-training.
With only 1000 fine-tuning examples, we were able to perform better in most tasks than a strong baseline (Transformer encoder-decoder) that used the full supervised data, which in some cases had many orders of magnitude more examples. This “sample efficiency” greatly increases the usefulness of text summarization models as it significantly lowers the scale and cost of supervised data collection, which in the case of summarization is very expensive.

Human-Quality summaries
While we find automatic metrics such as ROUGE are useful proxies for measuring progress during model development, they only provide limited information and don’t tell us the whole story, such as fluency or a comparison to human performance. To this end, we conducted a human evaluation, where raters were asked to compare summaries from our model with human ones (without knowing which is which). This has some similarities to the Turing test.
Human raters were asked to rate model and human-written summaries without knowing which was which. The document is truncated here for illustration, but raters see the full text.
We performed the experiment with 3 different datasets and found that human raters do not consistently prefer the human summaries to those from our model. Furthermore, our models trained with only 1000 examples performed nearly as well. In particular, with the much studied XSum and CNN/Dailymail datasets, the model achieves human-like performance using only 1000 examples. This suggests large datasets of supervised examples are no longer necessary for summarization, opening up many low-cost use-cases.

A Test of Comprehension: Counting Ships
Following this post is an example article from the XSum dataset and the model-generated abstractive summary. As we can see, the model correctly abstracts and paraphrases four named frigates (HMS Cumberland, HMS Campbeltown, HMS Chatham and HMS Cornwall) as “four Royal Navy frigates”, something an extractive approach could not do since “four” is not mentioned anywhere. Was this a fluke or did the model actually count? One way to find out is to add and remove ships to see if the count changes.

As can be seen below, the model successfully “counts” ships from 2 to 5. However, when we add a sixth ship, the “HMS Alphabet”, it miscounts it as “seven”. So it appears the model has learned to count small numbers of items in a list, but does not yet generalize as elegantly as we would hope. Still, we think this rudimentary counting ability is impressive as it was not explicitly programmed into the model, and it demonstrates a limited amount of “symbolic reasoning” by the model.

PEGASUS code and model release
To support on-going research in this field and ensure reproducibility, we are releasing the PEGASUS code and model checkpoints on GitHub. This includes fine-tuning code which can be used to adapt PEGASUS to other summarization datasets.

Acknowledgements
This work has been a collaborative effort involving Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. We thank the T5 and Google News teams for providing datasets for pre-training PEGASUS.

Source: Google AI Blog


Evaluating Natural Language Generation with BLEURT



In the last few years, research in natural language generation (NLG) has made tremendous progress, with models now able to translate text, summarize articles, engage in conversation, and comment on pictures with unprecedented accuracy, using approaches with increasingly high levels of sophistication. Currently, there are two methods to evaluate these NLG systems: human evaluation and automatic metrics. With human evaluation, one runs a large-scale quality survey for each new version of a model using human annotators, but that approach can be prohibitively labor intensive. In contrast, one can use popular automatic metrics (e.g., BLEU), but these are oftentimes unreliable substitutes for human interpretation and judgement. The rapid progress of NLG and the drawbacks of existing evaluation methods calls for the development of novel ways to assess the quality and success of NLG systems.

In “BLEURT: Learning Robust Metrics for Text Generation” (presented during ACL 2020), we introduce a novel automatic metric that delivers ratings that are robust and reach an unprecedented level of quality, much closer to human annotation. BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) builds upon recent advances in transfer learning to capture widespread linguistic phenomena, such as paraphrasing. The metric is available on Github.

Evaluating NLG Systems
In human evaluation, a piece of generated text is presented to annotators, who are tasked with assessing its quality with respect to its fluency and meaning. The text is typically shown side-by-side with a reference, authored by a human or mined from the Web.
An example questionnaire used for human evaluation in machine translation.
The advantage of this method is that it is accurate: people are still unrivaled when it comes to evaluating the quality of a piece of text. However, this method of evaluation can easily take days and involve dozens of people for just a few thousand examples, which disrupts the model development workflow.

In contrast, the idea behind automatic metrics is to provide a cheap, low-latency proxy for human-quality measurements. Automatic metrics often take two sentences as input, a candidate and a reference, and they return a score that indicates to what extent the former resembles the latter, typically using lexical overlap. A popular metric is BLEU, which counts the sequences of words in the candidate that also appear in the reference (the BLEU score is very similar to precision).

The advantages and weaknesses of automatic metrics are the opposite of those that come with human evaluation. Automatic metrics are convenient — they can be computed in real-time throughout the training process (e.g., for plotting with Tensorboard). However, they are often inaccurate due to their focus on surface-level similarities and they fail to capture the diversity of human language. Frequently, there are many perfectly valid sentences that can convey the same meaning. Overlap-based metrics that rely exclusively on lexical matches unfairly reward those that resemble the reference in their surface form, even if they do not accurately capture meaning, and penalize other paraphrases.
BLEU scores for three candidate sentences. Candidate 2 is semantically close to the reference, and yet its score is lower than Candidate 3.
Ideally, an evaluation method for NLG should combine the advantages of both human evaluation and automatic metrics — it should be relatively cheap to compute, but flexible enough to cope with linguistic diversity.

Introducing BLEURT
BLEURT is a novel, machine learning-based automatic metric that can capture non-trivial semantic similarities between sentences. It is trained on a public collection of ratings (the WMT Metrics Shared Task dataset) as well as additional ratings provided by the user.
Three candidate sentences rated by BLEURT. BLEURT captures that candidate 2 is similar to the reference, even though it contains more non-reference words than candidate 3.
Creating a metric based on machine learning poses a fundamental challenge: the metric should do well consistently on a wide range of tasks and domains, and over time. However, there is only a limited amount of training data. Indeed, public data is sparse — the WMT Metrics Task dataset, the largest collection of human ratings at the time of writing, contains ~260K human ratings covering the news domain only. This is too limited to train a metric suited for the evaluation of NLG systems of the future.

To address this problem, we employ transfer learning. First, we use the contextual word representations of BERT, a state-of-the-art unsupervised representation learning method for language understanding that has already been successfully incorporated into NLG metrics (e.g., YiSi or BERTscore).

Second, we introduce a novel pre-training scheme to increase BLEURT's robustness. Our experiments reveal that training a regression model directly over publicly available human ratings is a brittle approach, since we cannot control in what domain and across what time span the metric will be used. The accuracy is likely to drop in the presence of domain drift, i.e., when the text used comes from a different domain than the training sentence pairs. It may also drop when there is a quality drift, when the ratings to be predicted are higher than those used during training — a feature which would normally be good news because it indicates that ML research is making progress.

The success of BLEURT relies on “warming-up” the model using millions of synthetic sentence pairs before fine-tuning on human ratings. We generated training data by applying random perturbations to sentences from Wikipedia. Instead of collecting human ratings, we use a collection of metrics and models from the literature (including BLEU), which allows the number of training examples to be scaled up at very low cost.
BLEURT's data generation process combines random perturbations and scoring with pre-existing metrics and models.
Experiments reveal that pre-training significantly increases BLEURT's accuracy, especially when the test data is out-of-distribution.

We pre-train BLEURT twice, first with a language modelling objective (as explained in the original BERT paper), then with a collection of NLG evaluation objectives. We then fine-tune the model on the WMT Metrics dataset, on a set of ratings provided by the user, or a combination of both.The following figure illustrates BLEURT's training procedure end-to-end.

Results
We benchmark BLEURT against competing approaches and show that it offers superior performance, correlating well with human ratings on the WMT Metrics Shared Task (machine translation) and the WebNLG Challenge (data-to-text). For example, BLEURT is ~48% more accurate than BLEU on the WMT Metrics Shared Task of 2019. We also demonstrate that pre-training helps BLEURT cope with quality drift.
Correlation between different metrics and human ratings on the WMT'19 Metrics Shared Task.
Conclusion
As NLG models have gotten better over time, evaluation metrics have become an important bottleneck for the research in this field. There are good reasons why overlap-based metrics are so popular: they are simple, consistent, and they do not require any training data. In the use cases where multiple reference sentences are available for each candidate, they can be very accurate. While they play a critical part in our infrastructure, they are also very conservative, and only give an incomplete picture of NLG systems' performance. Our view is that ML engineers should enrich their evaluation toolkits with more flexible, semantic-level metrics.

BLEURT is our attempt to capture NLG quality beyond surface overlap. Thanks to BERT's representations and a novel pre-training scheme, our metric yields SOTA performance on two academic benchmarks, and we are currently investigating how it can improve Google products. Future research includes investigating multilinguality and multimodality.

Acknowledgements
This project was co-advised by Dipanjan Das. We thank Slav Petrov, Eunsol Choi, Nicholas FitzGerald, Jacob Devlin, Madhavan Kidambi, Ming-Wei Chang, and all the members of the Google Research Language team.

Source: Google AI Blog