Tag Archives: NLP

Navigating Recorder Transcripts Easily, with Smart Scrolling

Last year we launched Recorder, a new kind of recording app that made audio recording smarter and more useful by leveraging on-device machine learning (ML) to transcribe the recording, highlight audio events, and suggest appropriate tags for titles. Recorder makes editing, sharing and searching through transcripts easier. Yet because Recorder can transcribe very long recordings (up to 18 hours!), it can still be difficult for users to find specific sections, necessitating a new solution to quickly navigate such long transcripts.

To increase the navigability of content, we introduce Smart Scrolling, a new ML-based feature in Recorder that automatically marks important sections in the transcript, chooses the most representative keywords from each section, and then surfaces those keywords on the vertical scrollbar, like chapter headings. The user can then scroll through the keywords or tap on them to quickly navigate to the sections of interest. The models used are lightweight enough to be executed on-device without the need to upload the transcript, thus preserving user privacy.

Smart Scrolling feature UX

Under the hood
The Smart Scrolling feature is composed of two distinct tasks. The first extracts representative keywords from each section and the second picks which sections in the text are the most informative and unique.

For each task, we utilize two different natural language processing (NLP) approaches: a distilled bidirectional transformer (BERT) model pre-trained on data sourced from a Wikipedia dataset, alongside a modified extractive term frequency–inverse document frequency (TF-IDF) model. By using the bidirectional transformer and the TF-IDF-based models in parallel for both the keyword extraction and important section identification tasks, alongside aggregation heuristics, we were able to harness the advantages of each approach and mitigate their respective drawbacks (more on this in the next section).

The bidirectional transformer is a neural network architecture that employs a self-attention mechanism to achieve context-aware processing of the input text in a non-sequential fashion. This enables parallel processing of the input text to identify contextual clues both before and after a given position in the transcript.

Bidirectional Transformer-based model architecture

The extractive TF-IDF approach rates terms based on their frequency in the text compared to their inverse frequency in the trained dataset, and enables the finding of unique representative terms in the text.

Both models were trained on publicly available conversational datasets that were labeled and evaluated by independent raters. The conversational datasets were from the same domains as the expected product use cases, focusing on meetings, lectures, and interviews, thus ensuring the same word frequency distribution (Zipf’s law).

Extracting Representative Keywords
The TF-IDF-based model detects informative keywords by giving each word a score, which corresponds to how representative this keyword is within the text. The model does so, much like a standard TF-IDF model, by utilizing the ratio of the number of occurrences of a given word in the text compared to the whole of the conversational data set, but it also takes into account the specificity of the term, i.e., how broad or specific it is. Furthermore, the model then aggregates these features into a score using a pre-trained function curve. In parallel, the bidirectional transformer model, which was fine tuned on the task of extracting keywords, provides a deep semantic understanding of the text, enabling it to extract precise context-aware keywords.

The TF-IDF approach is conservative in the sense that it is prone to finding uncommon keywords in the text (high bias), while the drawback for the bidirectional transformer model is the high variance of the possible keywords that can be extracted. But when used together, these two models complement each other, forming a balanced bias-variance tradeoff.

Once the keyword scores are retrieved from both models, we normalize and combine them by utilizing NLP heuristics (e.g., the weighted average), removing duplicates across sections, and eliminating stop words and verbs. The output of this process is an ordered list of suggested keywords for each of the sections.

Rating A Section’s Importance
The next task is to determine which sections should be highlighted as informative and unique. To solve this task, we again combine the two models mentioned above, which yield two distinct importance scores for each of the sections. We compute the first score by taking the TF-IDF scores of all the keywords in the section and weighting them by their respective number of appearances in the section, followed by a summation of these individual keyword scores. We compute the second score by running the section text through the bidirectional transformer model, which was also trained on the sections rating task. The scores from both models are normalized and then combined to yield the section score.

Smart Scrolling pipeline architecture

Some Challenges
A significant challenge in the development of Smart Scrolling was how to identify whether a section or keyword is important - what is of great importance to one person can be of less importance to another. The key was to highlight sections only when it is possible to extract helpful keywords from them.

To do this, we configured the solution to select the top scored sections that also have highly rated keywords, with the number of sections highlighted proportional to the length of the recording. In the context of the Smart Scrolling features, a keyword was more highly rated if it better represented the unique information of the section.

To train the model to understand this criteria, we needed to prepare a labeled training dataset tailored to this task. In collaboration with a team of skilled raters, we applied this labeling objective to a small batch of examples to establish an initial dataset in order to evaluate the quality of the labels and instruct the raters in cases where there were deviations from what was intended. Once the labeling process was complete we reviewed the labeled data manually and made corrections to the labels as necessary to align them with our definition of importance.

Using this limited labeled dataset, we ran automated model evaluations to establish initial metrics on model quality, which were used as a less-accurate proxy to the model quality, enabling us to quickly assess the model performance and apply changes in the architecture and heuristics. Once the solution metrics were satisfactory, we utilized a more accurate manual evaluation process over a closed set of carefully chosen examples that represented expected Recorder use cases. Using these examples, we tweaked the model heuristics parameters to reach the desired level of performance using a reliable model quality evaluation.

Runtime Improvements
After the initial release of Recorder, we conducted a series of user studies to learn how to improve the usability and performance of the Smart Scrolling feature. We found that many users expect the navigational keywords and highlighted sections to be available as soon as the recording is finished. Because the computation pipeline described above can take a considerable amount of time to compute on long recordings, we devised a partial processing solution that amortizes this computation over the whole duration of the recording. During recording, each section is processed as soon as it is captured, and then the intermediate results are stored in memory. When the recording is done, Recorder aggregates the intermediate results.

When running on a Pixel 5, this approach reduced the average processing time of an hour long recording (~9K words) from 1 minute 40 seconds to only 9 seconds, while outputting the same results.

Summary
The goal of Recorder is to improve users’ ability to access their recorded content and navigate it with ease. We have already made substantial progress in this direction with the existing ML features that automatically suggest title words for recordings and enable users to search recordings for sounds and text. Smart Scrolling provides additional text navigation abilities that will further improve the utility of Recorder, enabling users to rapidly surface sections of interest, even for long recordings.

Acknowledgments
Bin Zhang, Sherry Lin, Isaac Blankensmith, Henry Liu‎, Vincent Peng‎, Guilherme Santos‎, Tiago Camolesi, Yitong Lin, James Lemieux, Thomas Hall‎, Kelly Tsai‎, Benny Schlesinger, Dror Ayalon, Amit Pitaru, Kelsie Van Deman, Console Chen, Allen Su, Cecile Basnage, Chorong Johnston‎, Shenaz Zack, Mike Tsao, Brian Chen, Abhinav Rastogi, Tracy Wu, Yvonne Yang‎.

Source: Google AI Blog


Measuring Gendered Correlations in Pre-trained NLP Models

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google's AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Source: Google AI Blog


Measuring Gendered Correlations in Pre-trained NLP Models

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google's AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Source: Google AI Blog


Measuring Gendered Correlations in Pre-trained NLP Models

Natural language processing (NLP) has seen significant progress over the past several years, with pre-trained models like BERT, ALBERT, ELECTRA, and XLNet achieving remarkable accuracy across a variety of tasks. In pre-training, representations are learned from a large text corpus, e.g., Wikipedia, by repeatedly masking out words and trying to predict them (this is called masked language modeling). The resulting representations encode rich information about language and correlations between concepts, such as surgeons and scalpels. There is then a second training stage, fine-tuning, in which the model uses task-specific training data to learn how to use the general pre-trained representations to do a concrete task, like classification. Given the broad adoption of these representations in many NLP tasks, it is crucial to understand the information encoded in them and how any learned correlations affect performance downstream, to ensure the application of these models aligns with our AI Principles.

In “Measuring and Reducing Gendered Correlations in Pre-trained Models” we perform a case study on BERT and its low-memory counterpart ALBERT, looking at correlations related to gender, and formulate a series of best practices for using pre-trained language models. We present experimental results over public model checkpoints and an academic task dataset to illustrate how the best practices apply, providing a foundation for exploring settings beyond the scope of this case study. We will soon release a series of checkpoints, Zari1, which reduce gendered correlations while maintaining state-of-the-art accuracy on standard NLP task metrics.

Measuring Correlations
To understand how correlations in pre-trained representations can affect downstream task performance, we apply a diverse set of evaluation metrics for studying the representation of gender. Here, we’ll discuss results from one of these tests, based on coreference resolution, which is the capability that allows models to understand the correct antecedent to a given pronoun in a sentence. For example, in the sentence that follows, the model should recognize his refers to the nurse, and not to the patient.

The standard academic formulation of the task is the OntoNotes test (Hovy et al., 2006), and we measure how accurate a model is at coreference resolution in a general setting using an F1 score over this data (as in Tenney et al. 2019). Since OntoNotes represents only one data distribution, we also consider the WinoGender benchmark that provides additional, balanced data designed to identify when model associations between gender and profession incorrectly influence coreference resolution. High values of the WinoGender metric (close to one) indicate a model is basing decisions on normative associations between gender and profession (e.g., associating nurse with the female gender and not male). When model decisions have no consistent association between gender and profession, the score is zero, which suggests that decisions are based on some other information, such as sentence structure or semantics.

BERT and ALBERT metrics on OntoNotes (accuracy) and WinoGender (gendered correlations). Low values on the WinoGender metric indicate that a model does not preferentially use gendered correlations in reasoning.

In this study, we see that neither the (Large) BERT or ALBERT public model achieves zero score on the WinoGender examples, despite achieving impressive accuracy on OntoNotes (close to 100%). At least some of this is due to models preferentially using gendered correlations in reasoning. This isn’t completely surprising: there are a range of cues available to understand text and it is possible for a general model to pick up on any or all of these. However, there is reason for caution, as it is undesirable for a model to make predictions primarily based on gendered correlations learned as priors rather than the evidence available in the input.

Best Practices
Given that it is possible for unintended correlations in pre-trained model representations to affect downstream task reasoning, we now ask: what can one do to mitigate any risk this poses when developing new NLP models?

  • It is important to measure for unintended correlations: Model quality may be assessed using accuracy metrics, but these only measure one dimension of performance, especially if the test data is drawn from the same distribution as the training data. For example, the BERT and ALBERT checkpoints have accuracy within 1% of each other, but differ by 26% (relative) in the degree to which they use gendered correlations for coreference resolution. This difference might be important for some tasks; selecting a model with low WinoGender score could be desirable in an application featuring texts about people in professions that may not conform to historical social norms, e.g., male nurses.
  • Be careful even when making seemingly innocuous configuration changes: Neural network model training is controlled by many hyperparameters that are usually selected to maximize some training objective. While configuration choices often seem innocuous, we find they can cause significant changes for gendered correlations, both for better and for worse. For example, dropout regularization is used to reduce overfitting by large models. When we increase the dropout rate used for pre-training BERT and ALBERT, we see a significant reduction in gendered correlations even after fine-tuning. This is promising since a simple configuration change allows us to train models with reduced risk of harm, but it also shows that we should be mindful and evaluate carefully when making any change in model configuration.
    Impact of increasing dropout regularization in BERT and ALBERT.
  • There are opportunities for general mitigations: A further corollary from the perhaps unexpected impact of dropout on gendered correlations is that it opens the possibility to use general-purpose methods for reducing unintended correlations: by increasing dropout in our study, we improve how the models reason about WinoGender examples without manually specifying anything about the task or changing the fine-tuning stage at all. Unfortunately, OntoNotes accuracy does start to decline as the dropout rate increases (which we can see in the BERT results), but we are excited about the potential to mitigate this in pre-training, where changes can lead to model improvements without the need for task-specific updates. We explore counterfactual data augmentation as another mitigation strategy with different tradeoffs in our paper.

What’s Next
We believe these best practices provide a starting point for developing robust NLP systems that perform well across the broadest possible range of linguistic settings and applications. Of course these techniques on their own are not sufficient to capture and remove all potential issues. Any model deployed in a real-world setting should undergo rigorous testing that considers the many ways it will be used, and implement safeguards to ensure alignment with ethical norms, such as Google's AI Principles. We look forward to developments in evaluation frameworks and data that are more expansive and inclusive to cover the many uses of language models and the breadth of people they aim to serve.

Acknowledgements
This is joint work with Xuezhi Wang, Ian Tenney, Ellie Pavlick, Alex Beutel, Jilin Chen, Emily Pitler, and Slav Petrov. We benefited greatly throughout the project from discussions with Fernando Pereira, Ed Chi, Dipanjan Das, Vera Axelrod, Jacob Eisenstein, Tulsee Doshi, and James Wexler.



1 Zari is an Afghan Muppet designed to show that ‘a little girl could do as much as everybody else’.

Source: Google AI Blog


Language-Agnostic BERT Sentence Embedding

A multilingual embedding model is a powerful tool that encodes text from different languages into a shared embedding space, enabling it to be applied to a range of downstream tasks, like text classification, clustering, and others, while also leveraging semantic information for language understanding. Existing approaches for generating such embeddings, like LASER or m~USE, rely on parallel data, mapping a sentence from one language directly to another language in order to encourage consistency between the sentence embeddings. While these existing multilingual approaches yield good overall performance across a number of languages, they often underperform on high-resource languages compared to dedicated bilingual models, which can leverage approaches like translation ranking tasks with translation pairs as training data to obtain more closely aligned representations. Further, due to limited model capacity and the often poor quality of training data for low-resource languages, it can be difficult to extend multilingual models to support a larger number of languages while maintaining good performance.

Illustration of a multilingual embedding space.

Recent efforts to improve language models include the development of masked language model (MLM) pre-training, such as that used by BERT, ALBERT and RoBERTa. This approach has led to exceptional gains across a wide range of languages and a variety of natural language processing tasks since it only requires monolingual text. In addition, MLM pre-training has been extended to the multilingual setting by modifying MLM training to include concatenated translation pairs, known as translation language modeling (TLM), or by simply introducing pre-training data from multiple languages. However, while the internal model representations learned during MLM and TLM training are helpful when fine-tuning on downstream tasks, without a sentence level objective, they do not directly produce sentence embeddings, which are critical for translation tasks.

In “Language-agnostic BERT Sentence Embedding”, we present a multilingual BERT embedding model, called LaBSE, that produces language-agnostic cross-lingual sentence embeddings for 109 languages. The model is trained on 17 billion monolingual sentences and 6 billion bilingual sentence pairs using MLM and TLM pre-training, resulting in a model that is effective even on low-resource languages for which there is no data available during training. Further, the model establishes a new state of the art on multiple parallel text (a.k.a. bitext) retrieval tasks. We have released the pre-trained model to the community through tfhub, which includes modules that can be used as-is or can be fine-tuned using domain-specific data.

The collection of the training data for 109 supported languages

The Model
In previous work, we proposed the use of a translation ranking task to learn a multilingual sentence embedding space. This approach tasks the model with ranking the true translation over a collection of sentences in the target language, given a sentence in the source language. The translation ranking task is trained using a dual encoder architecture with a shared transformer encoder. The resulting bilingual models achieved state-of-the-art performance on multiple parallel text retrieval tasks (including United Nations and BUCC). However, the model suffered when the bi-lingual models were extended to support multiple languages (16 languages, in our test case) due to limitations in model capacity, vocabulary coverage, training data quality and more.

Translation ranking task. Given a sentence in a given source language, the task is to find the true translation over a collection of sentences in the target language.

For LaBSE, we leverage recent advances on language model pre-training, including MLM and TLM, on a BERT-like architecture and follow this with fine-tuning on a translation ranking task. A 12-layer transformer with a 500k token vocabulary pre-trained using MLM and TLM on 109 languages is used to increase the model and vocabulary coverage. The resulting LaBSE model offers extended support to 109 languages in a single model.

The dual encoder architecture, in which the source and target text are encoded using a shared transformer embedding network separately. The translation ranking task is applied, forcing the text that paraphrases each other to have similar representations. The transformer embedding network is initialized from a BERT checkpoint trained on MLM and TLM tasks.

Performance on Cross-lingual Text Retrieval
We evaluate the proposed model using the Tatoeba corpus, a dataset consisting of up to 1,000 English-aligned sentence pairs for 112 languages. For more than 30 of the languages in the dataset, the model has no training data. The model is tasked with finding the nearest neighbor translation for a given sentence, which it calculates using the cosine distance.

To understand the performance of the model for languages at the head or tail of the training data distribution, we divide the set of languages into several groups and compute the average accuracy for each set. The first 14-language group is selected from the languages supported by m~USE, which cover the languages from the head of the distribution (head languages). We also evaluate a second language group composed of 36 languages from the XTREME benchmark. The third 82-language group, selected from the languages covered by the LASER training data, includes many languages from the tail of the distribution (tail languages). Finally, we compute the average accuracy for all languages.

The table below presents the average accuracy achieved by LaBSE, compared to the m~USE and LASER models, for each language group. As expected, all models perform strongly on the 14-language group that covers most head languages. With more languages included, the averaged accuracy for both LASER and LaBSE declines. However, the reduction in accuracy from the LaBSE model with increasing numbers of languages is much less significant, outperforming LASER significantly, particularly when the full distribution of 112 languages is included (83.7% accuracy vs. 65.5%).

Model 14 Langs 36 Langs 82 Langs All Langs
m~USE* 93.9
LASER 95.3 84.4 75.9 65.5
LaBSE 95.3 95.0 87.3 83.7
Average Accuracy (%) on Tatoeba Datasets. The “14 Langs” group consists of languages supported by m~USE; the “36 Langs” group includes languages selected by XTREME; and the “82 Langs” group represents languages covered by the LASER model. The “All Langs” group includes all languages supported by Taoteba.
* The m~USE model comes in two varieties, one built on a convolutional neural network architecture and the other a Transformer-like architecture. Here, we compare only to the Transformer version.

Support to Unsupported Languages
The average performance of all languages included in Tatoeba is very promising. Interestingly, LaBSE even performs relatively well for many of the 30+ Tatoeba languages for which it has no training data (see below). For one third of these languages the LaBSE accuracy is higher than 75% and only 8 have accuracy lower than 25%, indicating very strong transfer performance to languages without training data. Such positive language transfer is only possible due to the massively multilingual nature of LaBSE.

LaBSE accuracy for the subset of Tatoeba languages (represented with ISO 639-1/639-2 codes) for which there was no training data.

Mining Parallel Text from WebLaBSE can be used for mining parallel text (bi-text) from web-scale data. For example, we applied LaBSE to CommonCrawl, a large-scale monolingual corpus, to process 560 million Chinese and 330 million German sentences for the extraction of parallel text. Each Chinese and German sentence pair is encoded using the LaBSE model and then the encoded embedding is used to find a potential translation from a pool of 7.7 billion English sentences pre-processed and encoded by the model. An approximate nearest neighbor search is employed to quickly search through the high-dimensional sentence embeddings. After a simple filtering, the model returns 261M and 104M potential parallel pairs for English-Chinese and English-German, respectively. The trained NMT model using the mined data reaches BLEU scores of 35.7 and 27.2 on the WMT translation tasks (wmt17 for English-to-Chinese and wmt14 for English-to-German). The performance is only a few points away from current state-of-art-models trained on high quality parallel data.

ConclusionWe're excited to share this research, and the model, with the community. The pre-trained model is released at tfhub to support further research on this direction and possible downstream applications. We also believe that what we're showing here is just the beginning, and there are more important research problems to be addressed, such as building better models to support all languages.

AcknowledgementsThe core team includes Wei Wang, Naveen Arivazhagan, Daniel Cer. We would like to thank the Google Research Language team, along with our partners in other Google groups for their feedback and suggestions. Special thanks goes to Sidharth Mudgal, and Jax Law for help with data processing; as well as Jialu Liu, Tianqi Liu, Chen Chen, and Anosh Raj for help on BERT pre-training.

Source: Google AI Blog


SmartReply for YouTube Creators



It has been more than 4 years since SmartReply was launched, and since then, it has expanded to more users with the Gmail launch and Android Messages and to more devices with Android Wear. Developers now use SmartReply to respond to reviews within the Play Developer Console and can set up their own versions using APIs offered within MLKit and TFLite. With each launch there has been a unique challenge in modeling and serving that required customizing SmartReply for the task requirements.

We are now excited to share an updated SmartReply built for YouTube and implemented in YouTube Studio that helps creators engage more easily with their viewers. This model learns comment and reply representation through a computationally efficient dilated self-attention network, and represents the first cross-lingual and character byte-based SmartReply model. SmartReply for YouTube is currently available for English and Spanish creators, and this approach simplifies the process of extending the SmartReply feature to many more languages in the future.
YouTube creators receive a large volume of responses to their videos. Moreover, the community of creators and viewers on YouTube is diverse, as reflected by the creativity of their comments, discussions and videos. In comparison to emails, which tend to be long and dominated by formal language, YouTube comments reveal complex patterns of language switching, abbreviated words, slang, inconsistent usage of punctuation, and heavy utilization of emoji. Following is a sample of comments that illustrate this challenge:
Deep Retrieval
The initial release of SmartReply for Inbox encoded input emails word-by-word with a recurrent neural network, and then decoded potential replies with yet another word-level recurrent neural network. Despite the expressivity of this approach, it was computationally expensive. Instead, we found that one can achieve the same ends by designing a system that searches through a predefined list of suggestions for the most appropriate response.

This retrieval system encoded the message and its suggestion independently. First, the text was preprocessed to extract words and short phrases. This preprocessing included, but was not limited to, language identification, tokenization, and normalization. Two neural networks then simultaneously and independently encoded the message and the suggestion. This factorization allowed one to pre-compute the suggestion encodings and then search through the set of suggestions using an efficient maximum inner product search data structure. This deep retrieval approach enabled us to expand SmartReply to Gmail and since then, it has been the foundation for several SmartReply systems including the current YouTube system.

Beyond Words
The previous SmartReply systems described above relied on word level preprocessing that is well tuned for a limited number of languages and narrow genres of writing. Such systems face significant challenges in the YouTube case, where a typical comment might include heterogeneous content, like emoji, ASCII art, language switching, etc. In light of this, and taking inspiration from our recent work on byte and character language modeling, we decided to encode the text without any preprocessing. This approach is supported by research demonstrating that a deep Transformer network is able to model words and phrases from the ground up just by feeding it text as a sequence of characters or bytes, with comparable quality to word-based models.

Although initial results were promising, especially for processing comments with emoji or typos, the inference speed was too slow for production due to the fact that character sequences are longer than word equivalents and the computational complexity of self-attention layers grows quadratically as a function of sequence length. We found that shrinking the sequence length by applying temporal reduction layers at each layer of the network, similar to the dilation technique applied in WaveNet, provides a good trade-off between computation and quality.

The figure below presents a dual encoder network that encodes both the comment and the reply to maximize the mutual information between their latent representations by training the network with a contrastive objective. The encoding starts with feeding the transformer a sequence of bytes after they have been embedded. The input for each subsequent layer will be reduced by dropping a percentage of characters at equal offsets. After applying several transformer layers the sequence length is greatly truncated, significantly reducing the computational complexity. This sequence compression scheme could be substituted by other operators such as average pooling, though we did not notice any gains from more sophisticated methods, and therefore, opted to use dilation for simplicity.
A dual encoder network that maximizes the mutual information between the comments and their replies through a contrastive objective. Each encoder is fed a sequence of bytes and is implemented as a computationally efficient dilated transformer network.
A Model to Learn Them All
Instead of training a separate model for each language, we opted to train a single cross-lingual model for all supported languages. This allows the support of mixed-language usage in the comments, and enables the model to utilize the learning of common elements in one language for understanding another, such as emoji and numbers. Moreover, having a single model simplifies the logistics of maintenance and updates. While the model has been rolled out to English and Spanish, the flexibility inherent in this approach will enable it to be expanded to other languages in the future.

Inspecting the encodings of a multilingual set of suggestions produced by the model reveals that the model clusters appropriate replies, regardless of the language to which they belong. This cross-lingual capability emerged without exposing the model during training to any parallel corpus. We demonstrate in the figure below for three languages how the replies are clustered by their meaning when the model is probed with an input comment. For example, the English comment “This is a great video,” is surrounded by appropriate replies, such as “Thanks!” Moreover, inspection of the nearest replies in other languages reveal them also to be appropriate and similar in meaning to the English reply. The 2D projection also shows several other cross-lingual clusters that consist of replies of similar meaning. This clustering demonstrates how the model can support a rich cross-lingual user experience in the supported languages.
A 2D projection of the model encodings when presented with a hypothetical comment and a small list of potential replies. The neighborhood surrounding English comments (black color) consists of appropriate replies in English and their counterparts in Spanish and Arabic. Note that the network learned to align English replies with their translations without access to any parallel corpus.
When to Suggest?
Our goal is to help creators, so we have to make sure that SmartReply only makes suggestions when it is very likely to be useful. Ideally, suggestions would only be displayed when it is likely that the creator would reply to the comment and when the model has a high chance of providing a sensible and specific response. To accomplish this, we trained auxiliary models to identify which comments should trigger the SmartReply feature.

Conclusion
We’ve launched YouTube SmartReply, starting with English and Spanish comments, the first cross-lingual and character byte-based SmartReply. YouTube is a global product with a diverse user base that generates heterogeneous content. Consequently, it is important that we continuously improve comments for this global audience, and SmartReply represents a strong step in this direction.

Acknowledgements
SmartReply for YouTube creators was developed by Golnaz Farhadi, Ezequiel Baril, Cheng Lee, Claire Yuan, Coty Morrison‎, Joe Simunic‎, Rachel Bransom‎, Rajvi Mehta, Jorge Gonzalez‎, Mark Williams, Uma Roy and many more. We are grateful for the leadership support from Nikhil Dandekar, Eileen Long, Siobhan Quinn, Yun-hsuan Sung, Rachel Bernstein, and Ray Kurzweil.

Source: Google AI Blog


Evaluating Natural Language Generation with BLEURT



In the last few years, research in natural language generation (NLG) has made tremendous progress, with models now able to translate text, summarize articles, engage in conversation, and comment on pictures with unprecedented accuracy, using approaches with increasingly high levels of sophistication. Currently, there are two methods to evaluate these NLG systems: human evaluation and automatic metrics. With human evaluation, one runs a large-scale quality survey for each new version of a model using human annotators, but that approach can be prohibitively labor intensive. In contrast, one can use popular automatic metrics (e.g., BLEU), but these are oftentimes unreliable substitutes for human interpretation and judgement. The rapid progress of NLG and the drawbacks of existing evaluation methods calls for the development of novel ways to assess the quality and success of NLG systems.

In “BLEURT: Learning Robust Metrics for Text Generation” (presented during ACL 2020), we introduce a novel automatic metric that delivers ratings that are robust and reach an unprecedented level of quality, much closer to human annotation. BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) builds upon recent advances in transfer learning to capture widespread linguistic phenomena, such as paraphrasing. The metric is available on Github.

Evaluating NLG Systems
In human evaluation, a piece of generated text is presented to annotators, who are tasked with assessing its quality with respect to its fluency and meaning. The text is typically shown side-by-side with a reference, authored by a human or mined from the Web.
An example questionnaire used for human evaluation in machine translation.
The advantage of this method is that it is accurate: people are still unrivaled when it comes to evaluating the quality of a piece of text. However, this method of evaluation can easily take days and involve dozens of people for just a few thousand examples, which disrupts the model development workflow.

In contrast, the idea behind automatic metrics is to provide a cheap, low-latency proxy for human-quality measurements. Automatic metrics often take two sentences as input, a candidate and a reference, and they return a score that indicates to what extent the former resembles the latter, typically using lexical overlap. A popular metric is BLEU, which counts the sequences of words in the candidate that also appear in the reference (the BLEU score is very similar to precision).

The advantages and weaknesses of automatic metrics are the opposite of those that come with human evaluation. Automatic metrics are convenient — they can be computed in real-time throughout the training process (e.g., for plotting with Tensorboard). However, they are often inaccurate due to their focus on surface-level similarities and they fail to capture the diversity of human language. Frequently, there are many perfectly valid sentences that can convey the same meaning. Overlap-based metrics that rely exclusively on lexical matches unfairly reward those that resemble the reference in their surface form, even if they do not accurately capture meaning, and penalize other paraphrases.
BLEU scores for three candidate sentences. Candidate 2 is semantically close to the reference, and yet its score is lower than Candidate 3.
Ideally, an evaluation method for NLG should combine the advantages of both human evaluation and automatic metrics — it should be relatively cheap to compute, but flexible enough to cope with linguistic diversity.

Introducing BLEURT
BLEURT is a novel, machine learning-based automatic metric that can capture non-trivial semantic similarities between sentences. It is trained on a public collection of ratings (the WMT Metrics Shared Task dataset) as well as additional ratings provided by the user.
Three candidate sentences rated by BLEURT. BLEURT captures that candidate 2 is similar to the reference, even though it contains more non-reference words than candidate 3.
Creating a metric based on machine learning poses a fundamental challenge: the metric should do well consistently on a wide range of tasks and domains, and over time. However, there is only a limited amount of training data. Indeed, public data is sparse — the WMT Metrics Task dataset, the largest collection of human ratings at the time of writing, contains ~260K human ratings covering the news domain only. This is too limited to train a metric suited for the evaluation of NLG systems of the future.

To address this problem, we employ transfer learning. First, we use the contextual word representations of BERT, a state-of-the-art unsupervised representation learning method for language understanding that has already been successfully incorporated into NLG metrics (e.g., YiSi or BERTscore).

Second, we introduce a novel pre-training scheme to increase BLEURT's robustness. Our experiments reveal that training a regression model directly over publicly available human ratings is a brittle approach, since we cannot control in what domain and across what time span the metric will be used. The accuracy is likely to drop in the presence of domain drift, i.e., when the text used comes from a different domain than the training sentence pairs. It may also drop when there is a quality drift, when the ratings to be predicted are higher than those used during training — a feature which would normally be good news because it indicates that ML research is making progress.

The success of BLEURT relies on “warming-up” the model using millions of synthetic sentence pairs before fine-tuning on human ratings. We generated training data by applying random perturbations to sentences from Wikipedia. Instead of collecting human ratings, we use a collection of metrics and models from the literature (including BLEU), which allows the number of training examples to be scaled up at very low cost.
BLEURT's data generation process combines random perturbations and scoring with pre-existing metrics and models.
Experiments reveal that pre-training significantly increases BLEURT's accuracy, especially when the test data is out-of-distribution.

We pre-train BLEURT twice, first with a language modelling objective (as explained in the original BERT paper), then with a collection of NLG evaluation objectives. We then fine-tune the model on the WMT Metrics dataset, on a set of ratings provided by the user, or a combination of both.The following figure illustrates BLEURT's training procedure end-to-end.

Results
We benchmark BLEURT against competing approaches and show that it offers superior performance, correlating well with human ratings on the WMT Metrics Shared Task (machine translation) and the WebNLG Challenge (data-to-text). For example, BLEURT is ~48% more accurate than BLEU on the WMT Metrics Shared Task of 2019. We also demonstrate that pre-training helps BLEURT cope with quality drift.
Correlation between different metrics and human ratings on the WMT'19 Metrics Shared Task.
Conclusion
As NLG models have gotten better over time, evaluation metrics have become an important bottleneck for the research in this field. There are good reasons why overlap-based metrics are so popular: they are simple, consistent, and they do not require any training data. In the use cases where multiple reference sentences are available for each candidate, they can be very accurate. While they play a critical part in our infrastructure, they are also very conservative, and only give an incomplete picture of NLG systems' performance. Our view is that ML engineers should enrich their evaluation toolkits with more flexible, semantic-level metrics.

BLEURT is our attempt to capture NLG quality beyond surface overlap. Thanks to BERT's representations and a novel pre-training scheme, our metric yields SOTA performance on two academic benchmarks, and we are currently investigating how it can improve Google products. Future research includes investigating multilinguality and multimodality.

Acknowledgements
This project was co-advised by Dipanjan Das. We thank Slav Petrov, Eunsol Choi, Nicholas FitzGerald, Jacob Devlin, Madhavan Kidambi, Ming-Wei Chang, and all the members of the Google Research Language team.

Source: Google AI Blog


Using Neural Networks to Find Answers in Tables



Much of the world’s information is stored in the form of tables, which can be found on the web or in databases and documents. These might include anything from technical specifications of consumer products to financial and country development statistics, sports results and much more. Currently, one needs to manually look at these tables to find the answer to a question or rely on a service that gives answers to specific questions (e.g., about sports results). This information would be much more accessible and useful if it could be queried through natural language.

For example, the following figure shows a table with a number of questions that people might want to ask. The answer to these questions might be found in one, or multiple, cells in a table (“Which wrestler had the most number of reigns?”), or might require aggregating multiple table cells (“How many world champions are there with only one reign?”).
A table and questions with the expected answers. Answers can be selected (#1, #4) or computed (#2, #3).
Many recent approaches apply traditional semantic parsing to this problem, where a natural language question is translated to an SQL-like database query that executes against a database to provide the answers. For example, the question “How many world champions are there with only one reign?” would be mapped to a query such as “select count(*) where column("No. of reigns") == 1;” and then executed to produce the answer. This approach often requires substantial engineering in order to generate syntactically and semantically valid queries and is difficult to scale to arbitrary questions rather than questions about very specific tables (such as sports results).

In, “TAPAS: Weakly Supervised Table Parsing via Pre-training”, accepted at ACL 2020, we take a different approach that extends the BERT architecture to encode the question jointly along with tabular data structure, resulting in a model that can then point directly to the answer. Instead of creating a model that works only for a single style of table, this approach results in a model that can be applied to tables from a wide range of domains. After pre-training on millions of Wikipedia tables, we show that our approach exhibits competitive accuracy on three academic table question-answering (QA) datasets. Additionally, in order to facilitate more exciting research in this area, we have open-sourced the code for training and testing the models as well as the models pre-trained on Wikipedia tables, available at our GitHub repo.

How to Process a Question
To process a question such as “Average time as champion for top 2 wrestlers?”, our model jointly encodes the question as well as the table content row by row using a BERT model that is extended with special embeddings to encode the table structure.

The key addition to the transformer-based BERT model are the extra embeddings that are used to encode the structured input. We rely on learned embeddings for the column index, the row index and one special rank index, which indicates the order of elements in numerical columns. The following image shows how all of these are added together at the input and fed into the transformer layers. The figure below illustrates how the question is encoded, together with the small table shown on the left. Each cell token has a special embedding that indicates its row, column and numeric rank within the column.
BERT layer input: Every input token is represented as the sum of the embeddings of its word, absolute position, segment (whether it belongs to the question or table), column and row and numeric rank (the position the cell would have if the column was sorted by its numeric values).
The model has two outputs: 1) for each table cell, a score indicates the probability that this cell will be part of the answer and 2) an aggregation operation that indicates which operation (if any) is applied to produce the final answer. The following figure shows how, for the question “Average time as champion for top 2 wrestlers?”, the model should select the first two cells of the “Combined days” column and the “AVERAGE” operation with high probability.
Model schematic: The BERT layer encodes both the question and table. The model outputs a probability for every aggregation operation and a selection probability for every table cell. For the question “Average time as champion for top 2 wrestlers?” the AVERAGE operation and the cells with the numbers 3,749 and 3,103 should have a high probability.
Pre-training
Using a method similar to how BERT is trained on text, we pre-trained our model on 6.2 million table-text pairs extracted from the English Wikipedia. During pre-training, the model learns to restore words in both table and text that have been replaced with a mask. We find that the model can do this with relatively high accuracy (71.4% of the masked tokens are restored correctly for tables unseen during training).

Learning from Answers Only
During fine-tuning the model learns how to answer questions from a table. This is done through training with either strong or weak supervision. In the case of strong supervision, for a given table and questions, one must provide the cells and aggregation operation to select (e.g., sum or count), a time-consuming and laborious process. More commonly, one trains using weak supervision, where only the correct answer (e.g., 3426 for the question in the example above) is provided. In this case, the model attempts to find an aggregation operation and cells that produce an answer close to the correct answer. This is done by computing the expectation over all the possible aggregation decisions and comparing it with the true result. The weak supervision scenario is beneficial because it allows for non-experts to provide the data needed to train the model and takes less time than strong supervision.

Results
We applied our model to three datasets — SQA, WikiTableQuestions (WTQ) and WikiSQL — and compared it to the performance of the top three state-of-the-art (SOTA) models for parsing tabular data. The comparison models included Min et al (2019) for WikiSQL, Wang et al. (2019) for WTQ and our own previous work for SQA (Mueller et al., 2019). For all datasets, we report the answer accuracy on the test sets for the weakly supervised training setup. For SQA and WIkiSQL we used our base model pre-trained on Wikipedia, while for WTQ, we found it beneficial to additionally pre-train on the SQA data. Our best models outperform the previous SOTA for SQA by more than 12 points, the previous SOTA for WTQ by more than 4 points and performs similarly to the best model published on WikiSQL.
Test answer accuracy for the weakly-supervised setup on three academic TableQA datasets.
Acknowledgements
This work was carried out by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos in the Google AI language group in Zurich. We would like to thank Yasemin Altun, Srini Narayanan, Slav Petrov, William Cohen, Massimo Nicosia, Syrine Krichene, and Jordan Boyd-Graber for useful comments and suggestions.

Source: Google AI Blog


XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization



One of the key challenges in natural language processing (NLP) is building systems that not only work in English but in all of the world’s ~6,900 languages. Luckily, while most of the world’s languages are data sparse and do not have enough data available to train robust models on their own, many languages do share a considerable amount of underlying structure. On the vocabulary level, languages often have words that stem from the same origin — for instance, “desk” in English and “Tisch” in German both come from the Latin “discus”. Similarly, many languages also mark semantic roles in similar ways, such as the use of postpositions to mark temporal and spatial relations in both Chinese and Turkish.

In NLP, there are a number of methods that leverage the shared structure of multiple languages in training in order to overcome the data sparsity problem. Historically, most of these methods focused on performing a specific task in multiple languages. Over the last few years, driven by advances in deep learning, there has been an increase in the number of approaches that attempt to learn general-purpose multilingual representations (e.g., mBERT, XLM, XLM-R), which aim to capture knowledge that is shared across languages and that is useful for many tasks. In practice, however, the evaluation of such methods has mostly focused on a small set of tasks and for linguistically similar languages.

To encourage more research on multilingual learning, we introduce “XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization”, which covers 40 typologically diverse languages (spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of syntax or semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the Niger-Congo languages Swahili and Yoruba, spoken in Africa. The code and data, including examples for running various baselines, is available here.

XTREME Tasks and Languages
The tasks included in XTREME cover a range of paradigms, including sentence classification, structured prediction, sentence retrieval and question answering. Consequently, in order for models to be successful on the XTREME benchmarks, they must learn representations that generalize to many standard cross-lingual transfer settings.

Tasks supported in the XTREME benchmark.
Each of the tasks covers a subset of the 40 languages. To obtain additional data in the low-resource languages used for analyses in XTREME, the test sets of two representative tasks, natural language inference (XNLI) and question answering (XQuAD), were automatically translated from English to the remaining languages. We show that models using the translated test sets for these tasks exhibited performance comparable to that achieved using human-labelled test sets.

Zero-shot Evaluation
To evaluate performance using XTREME, models must first be pre-trained on multilingual text using objectives that encourage cross-lingual learning. Then, they are fine-tuned on task-specific English data, since English is the most likely language where labelled data is available. XTREME then evaluates these models on zero-shot cross-lingual transfer performance, i.e., on other languages for which no task-specific data was seen. The three-step process, from pre-training to fine-tuning to zero-shot transfer, is shown in the figure below.
The cross-lingual transfer learning process for a given model: pre-training on multilingual text, followed by fine-tuning in English on downstream tasks, and finally zero-shot evaluation with XTREME.
In practice, one of the benefits of this zero-shot setting is computational efficiency — a pre-trained model only needs to be fine-tuned on English data for each task and can then be evaluated directly on other languages. Nevertheless, for tasks where labelled data is available in other languages, we also compare against fine-tuning on in-language data. Finally, we provide a combined score by obtaining the zero-shot scores on all nine XTREME tasks.

A Testbed for Transfer Learning
We conduct experiments with several state-of-the-art pre-trained multilingual models, including: multilingual BERT, a multilingual extension of the popular BERT model; XLM and XLM-R, two larger versions of multilingual BERT that have been trained on even more data; and a massively multilingual machine translation model, M4. A common feature of these models is that they have been pre-trained on large amounts of data from multiple languages. For our experiments, we choose variants of these models that are pre-trained on around 100 languages, including the 40 languages of our benchmark.

We find that while models achieve close to human performance on most existing tasks in English, performance is significantly lower for many of the other languages. Across all models, the gap between English performance and performance for the remaining languages is largest for the structured prediction and question answering tasks, while the spread of results across languages is largest for the structured prediction and sentence retrieval tasks.

For illustration, in the figure below we show the performance of the best-performing model in the zero-shot setting, XLM-R, by task and language, across all language families. The scores across tasks are not comparable, so the main focus should be the relative ranking of languages across tasks. As we can see, many high-resource languages, particularly from the Indo-European language family, are consistently ranked higher. In contrast, the model achieves lower performance on many languages from other language families such as Sino-Tibetan, Japonic, Koreanic, and Niger-Congo languages.
Performance of the best-performing model (XLM-R) across all tasks and languages in XTREME in the zero-shot setting. The reported scores are percentages based on task-specific metrics and are not directly comparable across tasks. Human performance (if available) is represented by a red star. Specific examples from each language family are represented with their ISO 639-1 codes.
In general we made a number of interesting observations.
  • In the zero-shot setting, M4 and mBERT are competitive with XLM-R for most tasks, while the latter outperforms them in the particularly challenging question answering tasks. For example, on XQuAD, XLM-R scored 76.6 compared to 64.5 for mBERT and 64.6 for M4, with similar spreads on MLQA and TyDi QA.
  • We find that baselines utilizing machine translation, which translate either the training data or test data, are very competitive. On the XNLI task, mBERT scored 65.4 in the zero shot transfer setting, and 74.0 when using translated training data.
  • We observe that the few-shot setting (i.e., using limited amounts of in-language labelled data, when available) is particularly competitive for simpler tasks, such as NER, but less useful for the more complex question answering tasks. This can be seen in the performance of mBERT, which improves by 42% on the NER task from 62.2 to 88.3 in the few-shot setting, but for the question answering task (TyDi QA), only improves by 25% (59.7 to 74.5).
  • Overall, a large gap between performance in English and other languages remains across all models and settings, which indicates that there is much potential for research on cross-lingual transfer.
Cross-lingual Transfer Analysis
Similar to previous observations regarding the generalisation ability of deep models, we observe that results improve if more pre-training data is available for a language, e.g., mBERT compared to XLM-R, which has more pre-training data. However, we find that this correlation does not hold for the structured prediction tasks, part-of-speech tagging (POS) and named entity recognition (NER), which indicates that current deep pre-trained models are not able to fully exploit the pre-training data to transfer to such syntactic tasks. We also find that models have difficulties transferring to non-Latin scripts. This is evident on the POS task, where mBERT achieves a zero-shot accuracy of 86.9 on Spanish compared to just 49.2 on Japanese.

For the natural language inference task, XNLI, we find that a model makes the same prediction on a test example in English and on the same example in another language about 70% of the time. Semi-supervised methods might be helpful in encouraging improved consistency between the predictions on examples and their translations in different languages. We also find that models struggle to predict POS tag sequences that were not seen in the English training data on which they were fine-tuned, highlighting that these models struggle to learn the syntax of other languages from the large amounts of unlabelled data used for pre-training. For named entity recognition, models have the most difficulty predicting entities that were not seen in the English training data for distant languages — accuracies on Indonesian and Swahili are 58.0 and 66.6, respectively, compared to 82.3 and 80.1 for Portguese and French.

Making Progress on Multilingual Transfer Learning
English has been the focal point of most recent advances in NLP despite being spoken by only around 15% of the world’s population. We believe that building on deep contextual representations, we now have the tools to make substantial progress on systems that serve the remainder of the world’s languages. We hope that XTREME will catalyze research in multilingual transfer learning, similar to how benchmarks such as GLUE and SuperGLUE have spurred the development of deep monolingual models, including BERT, RoBERTa, XLNet, AlBERT, and others. Stay tuned to our Twitter account for information on our upcoming website launch with a submission portal and leaderboard.

Acknowledgements:
This effort has been successful thanks to the hard work of a lot of people including, but not limited to the following (in alphabetical order of last name): Jon Clark, Orhan Firat, Dan Garrette, Junjie Hu, Graham Neubig, and Aditya Siddhant.

Source: Google AI Blog


More Efficient NLP Model Pre-training with ELECTRA



Recent advances in language pre-training have led to substantial gains in the field of natural language processing, with state-of-the-art models such as BERT, RoBERTa, XLNet, ALBERT, and T5, among many others. These methods, though they differ in design, share the same idea of leveraging a large amount of unlabeled text to build a general model of language understanding before being fine-tuned on specific NLP tasks such as sentiment analysis and question answering.

Existing pre-training methods generally fall under two categories: language models (LMs), such as GPT, which process the input text left-to-right, predicting the next word given the previous context, and masked language models (MLMs), such as BERT, RoBERTa, and ALBERT, which instead predict the identities of a small number of words that have been masked out of the input. MLMs have the advantage of being bidirectional instead of unidirectional in that they “see” the text to both the left and right of the token being predicted, instead of only to one side. However, the MLM objective (and related objectives such as XLNet’s) also have a disadvantage. Instead of predicting every single input token, those models only predict a small subset — the 15% that was masked out, reducing the amount learned from each sentence.
Existing pre-training methods and their disadvantages. Arrows indicate which tokens are used to produce a given output representation (rectangle). Left: Traditional language models (e.g., GPT) only use context to the left of the current word. Right: Masked language models (e.g., BERT) use context from both the left and right, but predict only a small subset of words for each input.
In “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators”, we take a different approach to language pre-training that provides the benefits of BERT but learns far more efficiently. ELECTRA — Efficiently Learning an Encoder that Classifies Token Replacements Accurately — is a novel pre-training method that outperforms existing techniques given the same compute budget. For example, ELECTRA matches the performance of RoBERTa and XLNet on the GLUE natural language understanding benchmark when using less than ¼ of their compute and achieves state-of-the-art results on the SQuAD question answering benchmark. ELECTRA’s excellent efficiency means it works well even at small scale — it can be trained in a few days on a single GPU to better accuracy than GPT, a model that uses over 30x more compute. ELECTRA is being released as an open-source model on top of TensorFlow and includes a number of ready-to-use pre-trained language representation models.

Making Pre-training Faster
ELECTRA uses a new pre-training task, called replaced token detection (RTD), that trains a bidirectional model (like a MLM) while learning from all input positions (like a LM). Inspired by generative adversarial networks (GANs), ELECTRA trains the model to distinguish between “real” and “fake” input data. Instead of corrupting the input by replacing tokens with “[MASK]” as in BERT, our approach corrupts the input by replacing some input tokens with incorrect, but somewhat plausible, fakes. For example, in the below figure, the word “cooked” could be replaced with “ate”. While this makes a bit of sense, it doesn’t fit as well with the entire context. The pre-training task requires the model (i.e., the discriminator) to then determine which tokens from the original input have been replaced or kept the same. Crucially, this binary classification task is applied to every input token, instead of only a small number of masked tokens (15% in the case of BERT-style models), making RTD more efficient than MLM — ELECTRA needs to see fewer examples to achieve the same performance because it receives mode training signal per example. At the same time, RTD results in powerful representation learning, because the model must learn an accurate representation of the data distribution in order to solve the task.
Replaced token detection trains a bidirectional model while learning from all input positions.
The replacement tokens come from another neural network called the generator. While the generator can be any model that produces an output distribution over tokens, we use a small masked language model (i.e., a BERT model with small hidden size) that is trained jointly with the discriminator. Although the structure of the generator feeding into the discriminator is similar to a GAN, we train the generator with maximum likelihood to predict masked words, rather than adversarially, due to the difficulty of applying GANs to text. The generator and discriminator share the same input word embeddings. After pre-training, the generator is dropped and the discriminator (the ELECTRA model) is fine-tuned on downstream tasks. Our models all use the transformer neural architecture.

Further details on the replaced token detection (RTD) task. The fake tokens are sampled from a small masked language model that is trained jointly with ELECTRA.
ELECTRA Results
We compare ELECTRA against other state-of-the-art NLP models and found that it substantially improves over previous methods, given the same compute budget, performing comparably to RoBERTa and XLNet while using less than 25% of the compute.

The x-axis shows the amount of compute used to train the model (measured in FLOPs) and the y-axis shows the dev GLUE score. ELECTRA learns much more efficiently than existing pre-trained NLP models. Note that current best models on GLUE such as T5 (11B) do not fit on this plot because they use much more compute than others (around 10x more than RoBERTa).
Further pushing the limits of efficiency, we experimented with a small ELECTRA model that can be trained to good accuracy on a single GPU in 4 days. Although not achieving the same accuracy as larger models that require many TPUs to train, ELECTRA-small still performs quite well, even outperforming GPT while requiring only 1/30th as much compute.

Lastly, to see if the strong results held at scale, we trained a large ELECTRA model using more compute (roughly the same amount as RoBERTa, about 10% the compute as T5). This model achieves a new state-of-the-art for a single model on the SQuAD 2.0 question answering dataset (see the below table) and outperforms RoBERTa, XLNet, and ALBERT on the GLUE leaderboard. While the large-scale T5-11b model scores higher still on GLUE, ELECTRA is 1/30th the size and uses 10% of the compute to train.

Model Squad 2.0 test set
ELECTRA-Large 88.7
ALBERT-xxlarge 88.1
XLNet-Large 87.9
RoBERTa-Large 86.8
BERT-Large 80.0
SQuAD 2.0 scores for ELECTRA-Large and other state-of-the-art models (only non-ensemble models shown).
Releasing ELECTRA
We are releasing the code for both pre-training ELECTRA and fine-tuning it on downstream tasks, with currently supported tasks including text classification, question answering and sequence tagging. The code supports quickly training a small ELECTRA model on one GPU. We are also releasing pre-trained weights for ELECTRA-Large, ELECTRA-Base, and ELECTRA-Small. The ELECTRA models are currently English-only, but we hope to release models which have been pre-trained on many languages in the future.

Source: Google AI Blog