Tag Archives: NLP

Measuring Gendered Correlations in Pre-trained NLP Models

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google's AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Source: Google AI Blog


Measuring Gendered Correlations in Pre-trained NLP Models

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google's AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Source: Google AI Blog


Measuring Gendered Correlations in Pre-trained NLP Models

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google's AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Source: Google AI Blog


Language-Agnostic BERT Sentence Embedding

A multilingual embedding model is a powerful tool that encodes text from different languages into a shared embedding space, enabling it to be applied to a range of downstream tasks, like text classification, clustering, and others, while also leveraging semantic information for language understanding. Existing approaches for generating such embeddings, like LASER or m~USE, rely on parallel data, mapping a sentence from one language directly to another language in order to encourage consistency between the sentence embeddings. While these existing multilingual approaches yield good overall performance across a number of languages, they often underperform on high-resource languages compared to dedicated bilingual models, which can leverage approaches like translation ranking tasks with translation pairs as training data to obtain more closely aligned representations. Further, due to limited model capacity and the often poor quality of training data for low-resource languages, it can be difficult to extend multilingual models to support a larger number of languages while maintaining good performance.

Illustration of a multilingual embedding space.

Recent efforts to improve language models include the development of masked language model (MLM) pre-training, such as that used by BERT, ALBERT and RoBERTa. This approach has led to exceptional gains across a wide range of languages and a variety of natural language processing tasks since it only requires monolingual text. In addition, MLM pre-training has been extended to the multilingual setting by modifying MLM training to include concatenated translation pairs, known as translation language modeling (TLM), or by simply introducing pre-training data from multiple languages. However, while the internal model representations learned during MLM and TLM training are helpful when fine-tuning on downstream tasks, without a sentence level objective, they do not directly produce sentence embeddings, which are critical for translation tasks.

In “Language-agnostic BERT Sentence Embedding”, we present a multilingual BERT embedding model, called LaBSE, that produces language-agnostic cross-lingual sentence embeddings for 109 languages. The model is trained on 17 billion monolingual sentences and 6 billion bilingual sentence pairs using MLM and TLM pre-training, resulting in a model that is effective even on low-resource languages for which there is no data available during training. Further, the model establishes a new state of the art on multiple parallel text (a.k.a. bitext) retrieval tasks. We have released the pre-trained model to the community through tfhub, which includes modules that can be used as-is or can be fine-tuned using domain-specific data.

The collection of the training data for 109 supported languages

The Model
In previous work, we proposed the use of a translation ranking task to learn a multilingual sentence embedding space. This approach tasks the model with ranking the true translation over a collection of sentences in the target language, given a sentence in the source language. The translation ranking task is trained using a dual encoder architecture with a shared transformer encoder. The resulting bilingual models achieved state-of-the-art performance on multiple parallel text retrieval tasks (including United Nations and BUCC). However, the model suffered when the bi-lingual models were extended to support multiple languages (16 languages, in our test case) due to limitations in model capacity, vocabulary coverage, training data quality and more.

Translation ranking task. Given a sentence in a given source language, the task is to find the true translation over a collection of sentences in the target language.

For LaBSE, we leverage recent advances on language model pre-training, including MLM and TLM, on a BERT-like architecture and follow this with fine-tuning on a translation ranking task. A 12-layer transformer with a 500k token vocabulary pre-trained using MLM and TLM on 109 languages is used to increase the model and vocabulary coverage. The resulting LaBSE model offers extended support to 109 languages in a single model.

The dual encoder architecture, in which the source and target text are encoded using a shared transformer embedding network separately. The translation ranking task is applied, forcing the text that paraphrases each other to have similar representations. The transformer embedding network is initialized from a BERT checkpoint trained on MLM and TLM tasks.

Performance on Cross-lingual Text Retrieval
We evaluate the proposed model using the Tatoeba corpus, a dataset consisting of up to 1,000 English-aligned sentence pairs for 112 languages. For more than 30 of the languages in the dataset, the model has no training data. The model is tasked with finding the nearest neighbor translation for a given sentence, which it calculates using the cosine distance.

To understand the performance of the model for languages at the head or tail of the training data distribution, we divide the set of languages into several groups and compute the average accuracy for each set. The first 14-language group is selected from the languages supported by m~USE, which cover the languages from the head of the distribution (head languages). We also evaluate a second language group composed of 36 languages from the XTREME benchmark. The third 82-language group, selected from the languages covered by the LASER training data, includes many languages from the tail of the distribution (tail languages). Finally, we compute the average accuracy for all languages.

The table below presents the average accuracy achieved by LaBSE, compared to the m~USE and LASER models, for each language group. As expected, all models perform strongly on the 14-language group that covers most head languages. With more languages included, the averaged accuracy for both LASER and LaBSE declines. However, the reduction in accuracy from the LaBSE model with increasing numbers of languages is much less significant, outperforming LASER significantly, particularly when the full distribution of 112 languages is included (83.7% accuracy vs. 65.5%).

Model 14 Langs 36 Langs 82 Langs All Langs
m~USE* 93.9
LASER 95.3 84.4 75.9 65.5
LaBSE 95.3 95.0 87.3 83.7
Average Accuracy (%) on Tatoeba Datasets. The “14 Langs” group consists of languages supported by m~USE; the “36 Langs” group includes languages selected by XTREME; and the “82 Langs” group represents languages covered by the LASER model. The “All Langs” group includes all languages supported by Taoteba.
* The m~USE model comes in two varieties, one built on a convolutional neural network architecture and the other a Transformer-like architecture. Here, we compare only to the Transformer version.

Support to Unsupported Languages
The average performance of all languages included in Tatoeba is very promising. Interestingly, LaBSE even performs relatively well for many of the 30+ Tatoeba languages for which it has no training data (see below). For one third of these languages the LaBSE accuracy is higher than 75% and only 8 have accuracy lower than 25%, indicating very strong transfer performance to languages without training data. Such positive language transfer is only possible due to the massively multilingual nature of LaBSE.

LaBSE accuracy for the subset of Tatoeba languages (represented with ISO 639-1/639-2 codes) for which there was no training data.

Mining Parallel Text from WebLaBSE can be used for mining parallel text (bi-text) from web-scale data. For example, we applied LaBSE to CommonCrawl, a large-scale monolingual corpus, to process 560 million Chinese and 330 million German sentences for the extraction of parallel text. Each Chinese and German sentence pair is encoded using the LaBSE model and then the encoded embedding is used to find a potential translation from a pool of 7.7 billion English sentences pre-processed and encoded by the model. An approximate nearest neighbor search is employed to quickly search through the high-dimensional sentence embeddings. After a simple filtering, the model returns 261M and 104M potential parallel pairs for English-Chinese and English-German, respectively. The trained NMT model using the mined data reaches BLEU scores of 35.7 and 27.2 on the WMT translation tasks (wmt17 for English-to-Chinese and wmt14 for English-to-German). The performance is only a few points away from current state-of-art-models trained on high quality parallel data.

ConclusionWe're excited to share this research, and the model, with the community. The pre-trained model is released at tfhub to support further research on this direction and possible downstream applications. We also believe that what we're showing here is just the beginning, and there are more important research problems to be addressed, such as building better models to support all languages.

AcknowledgementsThe core team includes Wei Wang, Naveen Arivazhagan, Daniel Cer. We would like to thank the Google Research Language team, along with our partners in other Google groups for their feedback and suggestions. Special thanks goes to Sidharth Mudgal, and Jax Law for help with data processing; as well as Jialu Liu, Tianqi Liu, Chen Chen, and Anosh Raj for help on BERT pre-training.

Source: Google AI Blog


SmartReply for YouTube Creators



It has been more than 4 years since SmartReply was launched, and since then, it has expanded to more users with the Gmail launch and Android Messages and to more devices with Android Wear. Developers now use SmartReply to respond to reviews within the Play Developer Console and can set up their own versions using APIs offered within MLKit and TFLite. With each launch there has been a unique challenge in modeling and serving that required customizing SmartReply for the task requirements.

We are now excited to share an updated SmartReply built for YouTube and implemented in YouTube Studio that helps creators engage more easily with their viewers. This model learns comment and reply representation through a computationally efficient dilated self-attention network, and represents the first cross-lingual and character byte-based SmartReply model. SmartReply for YouTube is currently available for English and Spanish creators, and this approach simplifies the process of extending the SmartReply feature to many more languages in the future.
YouTube creators receive a large volume of responses to their videos. Moreover, the community of creators and viewers on YouTube is diverse, as reflected by the creativity of their comments, discussions and videos. In comparison to emails, which tend to be long and dominated by formal language, YouTube comments reveal complex patterns of language switching, abbreviated words, slang, inconsistent usage of punctuation, and heavy utilization of emoji. Following is a sample of comments that illustrate this challenge:
Deep Retrieval
The initial release of SmartReply for Inbox encoded input emails word-by-word with a recurrent neural network, and then decoded potential replies with yet another word-level recurrent neural network. Despite the expressivity of this approach, it was computationally expensive. Instead, we found that one can achieve the same ends by designing a system that searches through a predefined list of suggestions for the most appropriate response.

This retrieval system encoded the message and its suggestion independently. First, the text was preprocessed to extract words and short phrases. This preprocessing included, but was not limited to, language identification, tokenization, and normalization. Two neural networks then simultaneously and independently encoded the message and the suggestion. This factorization allowed one to pre-compute the suggestion encodings and then search through the set of suggestions using an efficient maximum inner product search data structure. This deep retrieval approach enabled us to expand SmartReply to Gmail and since then, it has been the foundation for several SmartReply systems including the current YouTube system.

Beyond Words
The previous SmartReply systems described above relied on word level preprocessing that is well tuned for a limited number of languages and narrow genres of writing. Such systems face significant challenges in the YouTube case, where a typical comment might include heterogeneous content, like emoji, ASCII art, language switching, etc. In light of this, and taking inspiration from our recent work on byte and character language modeling, we decided to encode the text without any preprocessing. This approach is supported by research demonstrating that a deep Transformer network is able to model words and phrases from the ground up just by feeding it text as a sequence of characters or bytes, with comparable quality to word-based models.

Although initial results were promising, especially for processing comments with emoji or typos, the inference speed was too slow for production due to the fact that character sequences are longer than word equivalents and the computational complexity of self-attention layers grows quadratically as a function of sequence length. We found that shrinking the sequence length by applying temporal reduction layers at each layer of the network, similar to the dilation technique applied in WaveNet, provides a good trade-off between computation and quality.

The figure below presents a dual encoder network that encodes both the comment and the reply to maximize the mutual information between their latent representations by training the network with a contrastive objective. The encoding starts with feeding the transformer a sequence of bytes after they have been embedded. The input for each subsequent layer will be reduced by dropping a percentage of characters at equal offsets. After applying several transformer layers the sequence length is greatly truncated, significantly reducing the computational complexity. This sequence compression scheme could be substituted by other operators such as average pooling, though we did not notice any gains from more sophisticated methods, and therefore, opted to use dilation for simplicity.
A dual encoder network that maximizes the mutual information between the comments and their replies through a contrastive objective. Each encoder is fed a sequence of bytes and is implemented as a computationally efficient dilated transformer network.
A Model to Learn Them All
Instead of training a separate model for each language, we opted to train a single cross-lingual model for all supported languages. This allows the support of mixed-language usage in the comments, and enables the model to utilize the learning of common elements in one language for understanding another, such as emoji and numbers. Moreover, having a single model simplifies the logistics of maintenance and updates. While the model has been rolled out to English and Spanish, the flexibility inherent in this approach will enable it to be expanded to other languages in the future.

Inspecting the encodings of a multilingual set of suggestions produced by the model reveals that the model clusters appropriate replies, regardless of the language to which they belong. This cross-lingual capability emerged without exposing the model during training to any parallel corpus. We demonstrate in the figure below for three languages how the replies are clustered by their meaning when the model is probed with an input comment. For example, the English comment “This is a great video,” is surrounded by appropriate replies, such as “Thanks!” Moreover, inspection of the nearest replies in other languages reveal them also to be appropriate and similar in meaning to the English reply. The 2D projection also shows several other cross-lingual clusters that consist of replies of similar meaning. This clustering demonstrates how the model can support a rich cross-lingual user experience in the supported languages.
A 2D projection of the model encodings when presented with a hypothetical comment and a small list of potential replies. The neighborhood surrounding English comments (black color) consists of appropriate replies in English and their counterparts in Spanish and Arabic. Note that the network learned to align English replies with their translations without access to any parallel corpus.
When to Suggest?
Our goal is to help creators, so we have to make sure that SmartReply only makes suggestions when it is very likely to be useful. Ideally, suggestions would only be displayed when it is likely that the creator would reply to the comment and when the model has a high chance of providing a sensible and specific response. To accomplish this, we trained auxiliary models to identify which comments should trigger the SmartReply feature.

Conclusion
We’ve launched YouTube SmartReply, starting with English and Spanish comments, the first cross-lingual and character byte-based SmartReply. YouTube is a global product with a diverse user base that generates heterogeneous content. Consequently, it is important that we continuously improve comments for this global audience, and SmartReply represents a strong step in this direction.

Acknowledgements
SmartReply for YouTube creators was developed by Golnaz Farhadi, Ezequiel Baril, Cheng Lee, Claire Yuan, Coty Morrison‎, Joe Simunic‎, Rachel Bransom‎, Rajvi Mehta, Jorge Gonzalez‎, Mark Williams, Uma Roy and many more. We are grateful for the leadership support from Nikhil Dandekar, Eileen Long, Siobhan Quinn, Yun-hsuan Sung, Rachel Bernstein, and Ray Kurzweil.

Source: Google AI Blog


Evaluating Natural Language Generation with BLEURT



In the last few years, research in natural language generation (NLG) has made tremendous progress, with models now able to translate text, summarize articles, engage in conversation, and comment on pictures with unprecedented accuracy, using approaches with increasingly high levels of sophistication. Currently, there are two methods to evaluate these NLG systems: human evaluation and automatic metrics. With human evaluation, one runs a large-scale quality survey for each new version of a model using human annotators, but that approach can be prohibitively labor intensive. In contrast, one can use popular automatic metrics (e.g., BLEU), but these are oftentimes unreliable substitutes for human interpretation and judgement. The rapid progress of NLG and the drawbacks of existing evaluation methods calls for the development of novel ways to assess the quality and success of NLG systems.

In “BLEURT: Learning Robust Metrics for Text Generation” (presented during ACL 2020), we introduce a novel automatic metric that delivers ratings that are robust and reach an unprecedented level of quality, much closer to human annotation. BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) builds upon recent advances in transfer learning to capture widespread linguistic phenomena, such as paraphrasing. The metric is available on Github.

Evaluating NLG Systems
In human evaluation, a piece of generated text is presented to annotators, who are tasked with assessing its quality with respect to its fluency and meaning. The text is typically shown side-by-side with a reference, authored by a human or mined from the Web.
An example questionnaire used for human evaluation in machine translation.
The advantage of this method is that it is accurate: people are still unrivaled when it comes to evaluating the quality of a piece of text. However, this method of evaluation can easily take days and involve dozens of people for just a few thousand examples, which disrupts the model development workflow.

In contrast, the idea behind automatic metrics is to provide a cheap, low-latency proxy for human-quality measurements. Automatic metrics often take two sentences as input, a candidate and a reference, and they return a score that indicates to what extent the former resembles the latter, typically using lexical overlap. A popular metric is BLEU, which counts the sequences of words in the candidate that also appear in the reference (the BLEU score is very similar to precision).

The advantages and weaknesses of automatic metrics are the opposite of those that come with human evaluation. Automatic metrics are convenient — they can be computed in real-time throughout the training process (e.g., for plotting with Tensorboard). However, they are often inaccurate due to their focus on surface-level similarities and they fail to capture the diversity of human language. Frequently, there are many perfectly valid sentences that can convey the same meaning. Overlap-based metrics that rely exclusively on lexical matches unfairly reward those that resemble the reference in their surface form, even if they do not accurately capture meaning, and penalize other paraphrases.
BLEU scores for three candidate sentences. Candidate 2 is semantically close to the reference, and yet its score is lower than Candidate 3.
Ideally, an evaluation method for NLG should combine the advantages of both human evaluation and automatic metrics — it should be relatively cheap to compute, but flexible enough to cope with linguistic diversity.

Introducing BLEURT
BLEURT is a novel, machine learning-based automatic metric that can capture non-trivial semantic similarities between sentences. It is trained on a public collection of ratings (the WMT Metrics Shared Task dataset) as well as additional ratings provided by the user.
Three candidate sentences rated by BLEURT. BLEURT captures that candidate 2 is similar to the reference, even though it contains more non-reference words than candidate 3.
Creating a metric based on machine learning poses a fundamental challenge: the metric should do well consistently on a wide range of tasks and domains, and over time. However, there is only a limited amount of training data. Indeed, public data is sparse — the WMT Metrics Task dataset, the largest collection of human ratings at the time of writing, contains ~260K human ratings covering the news domain only. This is too limited to train a metric suited for the evaluation of NLG systems of the future.

To address this problem, we employ transfer learning. First, we use the contextual word representations of BERT, a state-of-the-art unsupervised representation learning method for language understanding that has already been successfully incorporated into NLG metrics (e.g., YiSi or BERTscore).

Second, we introduce a novel pre-training scheme to increase BLEURT's robustness. Our experiments reveal that training a regression model directly over publicly available human ratings is a brittle approach, since we cannot control in what domain and across what time span the metric will be used. The accuracy is likely to drop in the presence of domain drift, i.e., when the text used comes from a different domain than the training sentence pairs. It may also drop when there is a quality drift, when the ratings to be predicted are higher than those used during training — a feature which would normally be good news because it indicates that ML research is making progress.

The success of BLEURT relies on “warming-up” the model using millions of synthetic sentence pairs before fine-tuning on human ratings. We generated training data by applying random perturbations to sentences from Wikipedia. Instead of collecting human ratings, we use a collection of metrics and models from the literature (including BLEU), which allows the number of training examples to be scaled up at very low cost.
BLEURT's data generation process combines random perturbations and scoring with pre-existing metrics and models.
Experiments reveal that pre-training significantly increases BLEURT's accuracy, especially when the test data is out-of-distribution.

We pre-train BLEURT twice, first with a language modelling objective (as explained in the original BERT paper), then with a collection of NLG evaluation objectives. We then fine-tune the model on the WMT Metrics dataset, on a set of ratings provided by the user, or a combination of both.The following figure illustrates BLEURT's training procedure end-to-end.

Results
We benchmark BLEURT against competing approaches and show that it offers superior performance, correlating well with human ratings on the WMT Metrics Shared Task (machine translation) and the WebNLG Challenge (data-to-text). For example, BLEURT is ~48% more accurate than BLEU on the WMT Metrics Shared Task of 2019. We also demonstrate that pre-training helps BLEURT cope with quality drift.
Correlation between different metrics and human ratings on the WMT'19 Metrics Shared Task.
Conclusion
As NLG models have gotten better over time, evaluation metrics have become an important bottleneck for the research in this field. There are good reasons why overlap-based metrics are so popular: they are simple, consistent, and they do not require any training data. In the use cases where multiple reference sentences are available for each candidate, they can be very accurate. While they play a critical part in our infrastructure, they are also very conservative, and only give an incomplete picture of NLG systems' performance. Our view is that ML engineers should enrich their evaluation toolkits with more flexible, semantic-level metrics.

BLEURT is our attempt to capture NLG quality beyond surface overlap. Thanks to BERT's representations and a novel pre-training scheme, our metric yields SOTA performance on two academic benchmarks, and we are currently investigating how it can improve Google products. Future research includes investigating multilinguality and multimodality.

Acknowledgements
This project was co-advised by Dipanjan Das. We thank Slav Petrov, Eunsol Choi, Nicholas FitzGerald, Jacob Devlin, Madhavan Kidambi, Ming-Wei Chang, and all the members of the Google Research Language team.

Source: Google AI Blog


Using Neural Networks to Find Answers in Tables



Much of the world’s information is stored in the form of tables, which can be found on the web or in databases and documents. These might include anything from technical specifications of consumer products to financial and country development statistics, sports results and much more. Currently, one needs to manually look at these tables to find the answer to a question or rely on a service that gives answers to specific questions (e.g., about sports results). This information would be much more accessible and useful if it could be queried through natural language.

For example, the following figure shows a table with a number of questions that people might want to ask. The answer to these questions might be found in one, or multiple, cells in a table (“Which wrestler had the most number of reigns?”), or might require aggregating multiple table cells (“How many world champions are there with only one reign?”).
A table and questions with the expected answers. Answers can be selected (#1, #4) or computed (#2, #3).
Many recent approaches apply traditional semantic parsing to this problem, where a natural language question is translated to an SQL-like database query that executes against a database to provide the answers. For example, the question “How many world champions are there with only one reign?” would be mapped to a query such as “select count(*) where column("No. of reigns") == 1;” and then executed to produce the answer. This approach often requires substantial engineering in order to generate syntactically and semantically valid queries and is difficult to scale to arbitrary questions rather than questions about very specific tables (such as sports results).

In, “TAPAS: Weakly Supervised Table Parsing via Pre-training”, accepted at ACL 2020, we take a different approach that extends the BERT architecture to encode the question jointly along with tabular data structure, resulting in a model that can then point directly to the answer. Instead of creating a model that works only for a single style of table, this approach results in a model that can be applied to tables from a wide range of domains. After pre-training on millions of Wikipedia tables, we show that our approach exhibits competitive accuracy on three academic table question-answering (QA) datasets. Additionally, in order to facilitate more exciting research in this area, we have open-sourced the code for training and testing the models as well as the models pre-trained on Wikipedia tables, available at our GitHub repo.

How to Process a Question
To process a question such as “Average time as champion for top 2 wrestlers?”, our model jointly encodes the question as well as the table content row by row using a BERT model that is extended with special embeddings to encode the table structure.

The key addition to the transformer-based BERT model are the extra embeddings that are used to encode the structured input. We rely on learned embeddings for the column index, the row index and one special rank index, which indicates the order of elements in numerical columns. The following image shows how all of these are added together at the input and fed into the transformer layers. The figure below illustrates how the question is encoded, together with the small table shown on the left. Each cell token has a special embedding that indicates its row, column and numeric rank within the column.
BERT layer input: Every input token is represented as the sum of the embeddings of its word, absolute position, segment (whether it belongs to the question or table), column and row and numeric rank (the position the cell would have if the column was sorted by its numeric values).
The model has two outputs: 1) for each table cell, a score indicates the probability that this cell will be part of the answer and 2) an aggregation operation that indicates which operation (if any) is applied to produce the final answer. The following figure shows how, for the question “Average time as champion for top 2 wrestlers?”, the model should select the first two cells of the “Combined days” column and the “AVERAGE” operation with high probability.
Model schematic: The BERT layer encodes both the question and table. The model outputs a probability for every aggregation operation and a selection probability for every table cell. For the question “Average time as champion for top 2 wrestlers?” the AVERAGE operation and the cells with the numbers 3,749 and 3,103 should have a high probability.
Pre-training
Using a method similar to how BERT is trained on text, we pre-trained our model on 6.2 million table-text pairs extracted from the English Wikipedia. During pre-training, the model learns to restore words in both table and text that have been replaced with a mask. We find that the model can do this with relatively high accuracy (71.4% of the masked tokens are restored correctly for tables unseen during training).

Learning from Answers Only
During fine-tuning the model learns how to answer questions from a table. This is done through training with either strong or weak supervision. In the case of strong supervision, for a given table and questions, one must provide the cells and aggregation operation to select (e.g., sum or count), a time-consuming and laborious process. More commonly, one trains using weak supervision, where only the correct answer (e.g., 3426 for the question in the example above) is provided. In this case, the model attempts to find an aggregation operation and cells that produce an answer close to the correct answer. This is done by computing the expectation over all the possible aggregation decisions and comparing it with the true result. The weak supervision scenario is beneficial because it allows for non-experts to provide the data needed to train the model and takes less time than strong supervision.

Results
We applied our model to three datasets — SQA, WikiTableQuestions (WTQ) and WikiSQL — and compared it to the performance of the top three state-of-the-art (SOTA) models for parsing tabular data. The comparison models included Min et al (2019) for WikiSQL, Wang et al. (2019) for WTQ and our own previous work for SQA (Mueller et al., 2019). For all datasets, we report the answer accuracy on the test sets for the weakly supervised training setup. For SQA and WIkiSQL we used our base model pre-trained on Wikipedia, while for WTQ, we found it beneficial to additionally pre-train on the SQA data. Our best models outperform the previous SOTA for SQA by more than 12 points, the previous SOTA for WTQ by more than 4 points and performs similarly to the best model published on WikiSQL.
Test answer accuracy for the weakly-supervised setup on three academic TableQA datasets.
Acknowledgements
This work was carried out by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos in the Google AI language group in Zurich. We would like to thank Yasemin Altun, Srini Narayanan, Slav Petrov, William Cohen, Massimo Nicosia, Syrine Krichene, and Jordan Boyd-Graber for useful comments and suggestions.

Source: Google AI Blog


XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization



One of the key challenges in natural language processing (NLP) is building systems that not only work in English but in all of the world’s ~6,900 languages. Luckily, while most of the world’s languages are data sparse and do not have enough data available to train robust models on their own, many languages do share a considerable amount of underlying structure. On the vocabulary level, languages often have words that stem from the same origin — for instance, “desk” in English and “Tisch” in German both come from the Latin “discus”. Similarly, many languages also mark semantic roles in similar ways, such as the use of postpositions to mark temporal and spatial relations in both Chinese and Turkish.

In NLP, there are a number of methods that leverage the shared structure of multiple languages in training in order to overcome the data sparsity problem. Historically, most of these methods focused on performing a specific task in multiple languages. Over the last few years, driven by advances in deep learning, there has been an increase in the number of approaches that attempt to learn general-purpose multilingual representations (e.g., mBERT, XLM, XLM-R), which aim to capture knowledge that is shared across languages and that is useful for many tasks. In practice, however, the evaluation of such methods has mostly focused on a small set of tasks and for linguistically similar languages.

To encourage more research on multilingual learning, we introduce “XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization”, which covers 40 typologically diverse languages (spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of syntax or semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the Niger-Congo languages Swahili and Yoruba, spoken in Africa. The code and data, including examples for running various baselines, is available here.

XTREME Tasks and Languages
The tasks included in XTREME cover a range of paradigms, including sentence classification, structured prediction, sentence retrieval and question answering. Consequently, in order for models to be successful on the XTREME benchmarks, they must learn representations that generalize to many standard cross-lingual transfer settings.

Tasks supported in the XTREME benchmark.
Each of the tasks covers a subset of the 40 languages. To obtain additional data in the low-resource languages used for analyses in XTREME, the test sets of two representative tasks, natural language inference (XNLI) and question answering (XQuAD), were automatically translated from English to the remaining languages. We show that models using the translated test sets for these tasks exhibited performance comparable to that achieved using human-labelled test sets.

Zero-shot Evaluation
To evaluate performance using XTREME, models must first be pre-trained on multilingual text using objectives that encourage cross-lingual learning. Then, they are fine-tuned on task-specific English data, since English is the most likely language where labelled data is available. XTREME then evaluates these models on zero-shot cross-lingual transfer performance, i.e., on other languages for which no task-specific data was seen. The three-step process, from pre-training to fine-tuning to zero-shot transfer, is shown in the figure below.
The cross-lingual transfer learning process for a given model: pre-training on multilingual text, followed by fine-tuning in English on downstream tasks, and finally zero-shot evaluation with XTREME.
In practice, one of the benefits of this zero-shot setting is computational efficiency — a pre-trained model only needs to be fine-tuned on English data for each task and can then be evaluated directly on other languages. Nevertheless, for tasks where labelled data is available in other languages, we also compare against fine-tuning on in-language data. Finally, we provide a combined score by obtaining the zero-shot scores on all nine XTREME tasks.

A Testbed for Transfer Learning
We conduct experiments with several state-of-the-art pre-trained multilingual models, including: multilingual BERT, a multilingual extension of the popular BERT model; XLM and XLM-R, two larger versions of multilingual BERT that have been trained on even more data; and a massively multilingual machine translation model, M4. A common feature of these models is that they have been pre-trained on large amounts of data from multiple languages. For our experiments, we choose variants of these models that are pre-trained on around 100 languages, including the 40 languages of our benchmark.

We find that while models achieve close to human performance on most existing tasks in English, performance is significantly lower for many of the other languages. Across all models, the gap between English performance and performance for the remaining languages is largest for the structured prediction and question answering tasks, while the spread of results across languages is largest for the structured prediction and sentence retrieval tasks.

For illustration, in the figure below we show the performance of the best-performing model in the zero-shot setting, XLM-R, by task and language, across all language families. The scores across tasks are not comparable, so the main focus should be the relative ranking of languages across tasks. As we can see, many high-resource languages, particularly from the Indo-European language family, are consistently ranked higher. In contrast, the model achieves lower performance on many languages from other language families such as Sino-Tibetan, Japonic, Koreanic, and Niger-Congo languages.
Performance of the best-performing model (XLM-R) across all tasks and languages in XTREME in the zero-shot setting. The reported scores are percentages based on task-specific metrics and are not directly comparable across tasks. Human performance (if available) is represented by a red star. Specific examples from each language family are represented with their ISO 639-1 codes.
In general we made a number of interesting observations.
  • In the zero-shot setting, M4 and mBERT are competitive with XLM-R for most tasks, while the latter outperforms them in the particularly challenging question answering tasks. For example, on XQuAD, XLM-R scored 76.6 compared to 64.5 for mBERT and 64.6 for M4, with similar spreads on MLQA and TyDi QA.
  • We find that baselines utilizing machine translation, which translate either the training data or test data, are very competitive. On the XNLI task, mBERT scored 65.4 in the zero shot transfer setting, and 74.0 when using translated training data.
  • We observe that the few-shot setting (i.e., using limited amounts of in-language labelled data, when available) is particularly competitive for simpler tasks, such as NER, but less useful for the more complex question answering tasks. This can be seen in the performance of mBERT, which improves by 42% on the NER task from 62.2 to 88.3 in the few-shot setting, but for the question answering task (TyDi QA), only improves by 25% (59.7 to 74.5).
  • Overall, a large gap between performance in English and other languages remains across all models and settings, which indicates that there is much potential for research on cross-lingual transfer.
Cross-lingual Transfer Analysis
Similar to previous observations regarding the generalisation ability of deep models, we observe that results improve if more pre-training data is available for a language, e.g., mBERT compared to XLM-R, which has more pre-training data. However, we find that this correlation does not hold for the structured prediction tasks, part-of-speech tagging (POS) and named entity recognition (NER), which indicates that current deep pre-trained models are not able to fully exploit the pre-training data to transfer to such syntactic tasks. We also find that models have difficulties transferring to non-Latin scripts. This is evident on the POS task, where mBERT achieves a zero-shot accuracy of 86.9 on Spanish compared to just 49.2 on Japanese.

For the natural language inference task, XNLI, we find that a model makes the same prediction on a test example in English and on the same example in another language about 70% of the time. Semi-supervised methods might be helpful in encouraging improved consistency between the predictions on examples and their translations in different languages. We also find that models struggle to predict POS tag sequences that were not seen in the English training data on which they were fine-tuned, highlighting that these models struggle to learn the syntax of other languages from the large amounts of unlabelled data used for pre-training. For named entity recognition, models have the most difficulty predicting entities that were not seen in the English training data for distant languages — accuracies on Indonesian and Swahili are 58.0 and 66.6, respectively, compared to 82.3 and 80.1 for Portguese and French.

Making Progress on Multilingual Transfer Learning
English has been the focal point of most recent advances in NLP despite being spoken by only around 15% of the world’s population. We believe that building on deep contextual representations, we now have the tools to make substantial progress on systems that serve the remainder of the world’s languages. We hope that XTREME will catalyze research in multilingual transfer learning, similar to how benchmarks such as GLUE and SuperGLUE have spurred the development of deep monolingual models, including BERT, RoBERTa, XLNet, AlBERT, and others. Stay tuned to our Twitter account for information on our upcoming website launch with a submission portal and leaderboard.

Acknowledgements:
This effort has been successful thanks to the hard work of a lot of people including, but not limited to the following (in alphabetical order of last name): Jon Clark, Orhan Firat, Dan Garrette, Junjie Hu, Graham Neubig, and Aditya Siddhant.

Source: Google AI Blog


More Efficient NLP Model Pre-training with ELECTRA



Recent advances in language pre-training have led to substantial gains in the field of natural language processing, with state-of-the-art models such as BERT, RoBERTa, XLNet, ALBERT, and T5, among many others. These methods, though they differ in design, share the same idea of leveraging a large amount of unlabeled text to build a general model of language understanding before being fine-tuned on specific NLP tasks such as sentiment analysis and question answering.

Existing pre-training methods generally fall under two categories: language models (LMs), such as GPT, which process the input text left-to-right, predicting the next word given the previous context, and masked language models (MLMs), such as BERT, RoBERTa, and ALBERT, which instead predict the identities of a small number of words that have been masked out of the input. MLMs have the advantage of being bidirectional instead of unidirectional in that they “see” the text to both the left and right of the token being predicted, instead of only to one side. However, the MLM objective (and related objectives such as XLNet’s) also have a disadvantage. Instead of predicting every single input token, those models only predict a small subset — the 15% that was masked out, reducing the amount learned from each sentence.
Existing pre-training methods and their disadvantages. Arrows indicate which tokens are used to produce a given output representation (rectangle). Left: Traditional language models (e.g., GPT) only use context to the left of the current word. Right: Masked language models (e.g., BERT) use context from both the left and right, but predict only a small subset of words for each input.
In “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators”, we take a different approach to language pre-training that provides the benefits of BERT but learns far more efficiently. ELECTRA — Efficiently Learning an Encoder that Classifies Token Replacements Accurately — is a novel pre-training method that outperforms existing techniques given the same compute budget. For example, ELECTRA matches the performance of RoBERTa and XLNet on the GLUE natural language understanding benchmark when using less than ¼ of their compute and achieves state-of-the-art results on the SQuAD question answering benchmark. ELECTRA’s excellent efficiency means it works well even at small scale — it can be trained in a few days on a single GPU to better accuracy than GPT, a model that uses over 30x more compute. ELECTRA is being released as an open-source model on top of TensorFlow and includes a number of ready-to-use pre-trained language representation models.

Making Pre-training Faster
ELECTRA uses a new pre-training task, called replaced token detection (RTD), that trains a bidirectional model (like a MLM) while learning from all input positions (like a LM). Inspired by generative adversarial networks (GANs), ELECTRA trains the model to distinguish between “real” and “fake” input data. Instead of corrupting the input by replacing tokens with “[MASK]” as in BERT, our approach corrupts the input by replacing some input tokens with incorrect, but somewhat plausible, fakes. For example, in the below figure, the word “cooked” could be replaced with “ate”. While this makes a bit of sense, it doesn’t fit as well with the entire context. The pre-training task requires the model (i.e., the discriminator) to then determine which tokens from the original input have been replaced or kept the same. Crucially, this binary classification task is applied to every input token, instead of only a small number of masked tokens (15% in the case of BERT-style models), making RTD more efficient than MLM — ELECTRA needs to see fewer examples to achieve the same performance because it receives mode training signal per example. At the same time, RTD results in powerful representation learning, because the model must learn an accurate representation of the data distribution in order to solve the task.
Replaced token detection trains a bidirectional model while learning from all input positions.
The replacement tokens come from another neural network called the generator. While the generator can be any model that produces an output distribution over tokens, we use a small masked language model (i.e., a BERT model with small hidden size) that is trained jointly with the discriminator. Although the structure of the generator feeding into the discriminator is similar to a GAN, we train the generator with maximum likelihood to predict masked words, rather than adversarially, due to the difficulty of applying GANs to text. The generator and discriminator share the same input word embeddings. After pre-training, the generator is dropped and the discriminator (the ELECTRA model) is fine-tuned on downstream tasks. Our models all use the transformer neural architecture.

Further details on the replaced token detection (RTD) task. The fake tokens are sampled from a small masked language model that is trained jointly with ELECTRA.
ELECTRA Results
We compare ELECTRA against other state-of-the-art NLP models and found that it substantially improves over previous methods, given the same compute budget, performing comparably to RoBERTa and XLNet while using less than 25% of the compute.

The x-axis shows the amount of compute used to train the model (measured in FLOPs) and the y-axis shows the dev GLUE score. ELECTRA learns much more efficiently than existing pre-trained NLP models. Note that current best models on GLUE such as T5 (11B) do not fit on this plot because they use much more compute than others (around 10x more than RoBERTa).
Further pushing the limits of efficiency, we experimented with a small ELECTRA model that can be trained to good accuracy on a single GPU in 4 days. Although not achieving the same accuracy as larger models that require many TPUs to train, ELECTRA-small still performs quite well, even outperforming GPT while requiring only 1/30th as much compute.

Lastly, to see if the strong results held at scale, we trained a large ELECTRA model using more compute (roughly the same amount as RoBERTa, about 10% the compute as T5). This model achieves a new state-of-the-art for a single model on the SQuAD 2.0 question answering dataset (see the below table) and outperforms RoBERTa, XLNet, and ALBERT on the GLUE leaderboard. While the large-scale T5-11b model scores higher still on GLUE, ELECTRA is 1/30th the size and uses 10% of the compute to train.

Model Squad 2.0 test set
ELECTRA-Large 88.7
ALBERT-xxlarge 88.1
XLNet-Large 87.9
RoBERTa-Large 86.8
BERT-Large 80.0
SQuAD 2.0 scores for ELECTRA-Large and other state-of-the-art models (only non-ensemble models shown).
Releasing ELECTRA
We are releasing the code for both pre-training ELECTRA and fine-tuning it on downstream tasks, with currently supported tasks including text classification, question answering and sequence tagging. The code supports quickly training a small ELECTRA model on one GPU. We are also releasing pre-trained weights for ELECTRA-Large, ELECTRA-Base, and ELECTRA-Small. The ELECTRA models are currently English-only, but we hope to release models which have been pre-trained on many languages in the future.

Source: Google AI Blog


Measuring Compositional Generalization



People are capable of learning the meaning of a new word and then applying it to other language contexts. As Lake and Baroni put it, “Once a person learns the meaning of a new verb ‘dax’, he or she can immediately understand the meaning of ‘dax twice’ and ‘sing and dax’.” Similarly, one can learn a new object shape and then recognize it with different compositions of previously learned colors or materials (e.g., in the CLEVR dataset). This is because people exhibit the capacity to understand and produce a potentially infinite number of novel combinations of known components, or as Chomsky said, to make “infinite use of finite means.” In the context of a machine learning model learning from a set of training examples, this skill is called compositional generalization.

A common approach for measuring compositional generalization in machine learning (ML) systems is to split the training and testing data based on properties that intuitively correlate with compositional structure. For instance, one approach is to split the data based on sequence length — the training set consists of short examples, while the test set consists of longer examples. Another approach uses sequence patterns, meaning the split is based on randomly assigning clusters of examples sharing the same pattern to either train or test sets. For instance, the questions "Who directed Movie1" and "Who directed Movie2" both fall into the pattern "Who directed <MOVIE>" so they would be grouped together. Yet another method uses held out primitives — some linguistic primitives are shown very rarely during training (e.g., the verb “jump”), but are very prominent in testing. While each of these experiments are useful, it is not immediately clear which experiment is a "better" measure for compositionality. Is it possible to systematically design an “optimal” compositional generalization experiment?

In “Measuring Compositional Generalization: A Comprehensive Method on Realistic Data”, we attempt to address this question by introducing the largest and most comprehensive benchmark for compositional generalization using realistic natural language understanding tasks, specifically, semantic parsing and question answering. In this work, we propose a metric — compound divergence — that allows one to quantitatively assess how much a train-test split measures the compositional generalization ability of an ML system. We analyze the compositional generalization ability of three sequence to sequence ML architectures, and find that they fail to generalize compositionally. We also are releasing the Compositional Freebase Questions dataset used in the work as a resource for researchers wishing to improve upon these results.

Measuring Compositionality
In order to measure the compositional generalization ability of a system, we start with the assumption that we understand the underlying principles of how examples are generated. For instance, we begin with the grammar rules to which we must adhere when generating questions and answers. We then draw a distinction between atoms and compounds. Atoms are the building blocks that are used to generate examples and compounds are concrete (potentially partial) compositions of these atoms. For example, in the figure below, every box is an atom (e.g., Shane Steel, brother, <entity>'s <entity>, produce, etc.), which fits together to form compounds, such as produce and <verb>, Shane Steel’s brother, Did Shane Steel’s brother produce and direct Revenge of the Spy?, etc.
Building compositional sentences (compounds) from building blocks (atoms).
An ideal compositionality experiment then should have a similar atom distribution, i.e., the distribution of words and sub-phrases in the training set is as similar as possible to their distribution in the test set, but with a different compound distribution. To measure compositional generalization on a question answering task about a movie domain, one might, for instance, have the following questions in train and test:
While atoms such as “directed”, “Inception”, and “who <predicate> <entity>” appear in both the train and test sets, the compounds are different.

The Compositional Freebase Questions dataset
In order to conduct an accurate compositionality experiment, we created the Compositional Freebase Questions (CFQ) dataset, a simple, yet realistic, large dataset of natural language questions and answers generated from the public Freebase knowledge base. The CFQ can be used for text-in / text-out tasks, as well as semantic parsing. In our experiments, we focus on semantic parsing, where the input is a natural language question and the output is a query, which when executed against Freebase, produces the correct outcome. CFQ contains around 240k examples and almost 35k query patterns, making it significantly larger and more complex than comparable datasets — about 4 times that of WikiSQL with about 17x more query patterns than Complex Web Questions. Special care has been taken to ensure that the questions and answers are natural. We also quantify the complexity of the syntax in each example using the “complexity level” metric (L), which corresponds roughly to the depth of the parse tree, examples of which are shown below.
Compositional Generalization Experiments on CFQ
For a given train-test split, if the compound distributions of the train and test sets are very similar, then their compound divergence would be close to 0, indicating that they are not difficult tests for compositional generalization. A compound divergence close to 1 means that the train-test sets have many different compounds, which makes it a good test for compositional generalization. Compound divergence thus captures the notion of "different compound distribution", as desired.

We algorithmically generate train-test splits using the CFQ dataset that have a compound divergence ranging from 0 to 0.7 (the maximum that we were able to achieve). We fix the atom divergence to be very small. Then, for each split we measure the performance of three standard ML architectures — LSTM+attention, Transformer, and Universal Transformer. The results are shown in the graph below.
Compound divergence vs accuracy for three ML architectures. There is a surprisingly strong negative correlation between compound divergence and accuracy.
We measure the performance of a model by comparing the correct answers with the output string given by the model. All models achieve an accuracy greater than 95% when the compound divergence is very low. The mean accuracy on the split with highest compound divergence is below 20% for all architectures, which means that even a large training set with a similar atom distribution between train and test is not sufficient for the architectures to generalize well. For all architectures, there is a strong negative correlation between the compound divergence and the accuracy. This seems to indicate that compound divergence successfully captures the core difficulty for these ML architectures to generalize compositionally.

Potentially promising directions for future work might be to apply unsupervised pre-training on input language or output queries, or to use more diverse or more targeted learning architectures, such as syntactic attention. It would also be interesting to apply this approach to other domains such as visual reasoning, e.g. based on CLEVR, or to extend our approach to broader subsets of language understanding, including the use of ambiguous constructs, negations, quantification, comparatives, additional languages, and other vertical domains. We hope that this work will inspire others to use this benchmark to advance the compositional generalization capabilities of learning systems.

Source: Google AI Blog