Tag Archives: Natural Language Processing

An NLU-Powered Tool to Explore COVID-19 Scientific Literature

Posted by Keith Hall, Research Scientist, Natural Language Understanding, Google Research

Due to the COVID-19 pandemic, scientists and researchers around the world are publishing an immense amount of new research in order to understand and combat the disease. While the volume of research is very encouraging, it can be difficult for scientists and researchers to keep up with the rapid pace of new publications. Traditional search engines can be excellent resources for finding real-time information on general COVID-19 questions like "How many COVID-19 cases are there in the United States?", but can struggle with understanding the meaning behind research-driven queries. Furthermore, searching through the existing corpus of COVID-19 scientific literature with traditional keyword-based approaches can make it difficult to pinpoint relevant evidence for complex queries.

To help address this problem, we are launching the COVID-19 Research Explorer, a semantic search interface on top of the COVID-19 Open Research Dataset (CORD-19), which includes more than 50,000 journal articles and preprints. We have designed the tool with the goal of helping scientists and researchers efficiently pore through articles for answers or evidence to COVID-19-related questions.

When the user asks an initial question, the tool not only returns a set of papers (like in a traditional search) but also highlights snippets from the paper that are possible answers to the question. The user can review the snippets and quickly make a decision on whether or not that paper is worth further reading. If the user is satisfied with the initial set of papers and snippets, we have added functionality to pose follow-up questions, which act as new queries for the original set of retrieved articles. Take a look at the animation below to see an example of a query and a corresponding follow-up question. We hope these features will foster knowledge exploration and efficient gathering of evidence for scientific hypotheses.

Semantic Search
A key technology powering the tool is semantic search. Semantic search aims to not just capture term overlap between a query and a document, but to really understand whether the meaning of a phrase is relevant to the user’s true intent behind their query.

Consider the query, “What regulates ACE2 expression?” Even though this seems like a simple question, certain phrases can still confuse a search engine that relies solely on text matching. For example, “regulates” can refer to a number of biological processes. While traditional information retrieval (IR) systems use techniques like query expansion to mitigate this confusion, semantic search models aim to learn these relationships implicitly.

Word order also matters. ACE2 (angiotensin converting enzyme-2) itself regulates certain biological processes, but the question is actually asking what regulates ACE2. Matching on terms alone will not distinguish between “what regulates ACE2 ” and “what ACE2 regulates.” Traditional IR systems use tricks like n-gram term matching, but semantic search methods strive to model word order and semantics at their core.

The semantic search technology we use is powered by BERT, which has recently been deployed to improve retrieval quality of Google Search. For the COVID-19 Research Explorer we faced the challenge that biomedical literature uses a language that is very different from the kinds of queries submitted to Google.com. In order to train BERT models, we required supervision — examples of queries and their relevant documents and snippets. While we relied on excellent resources produced by BioASQ for fine-tuning, such human-curated datasets tend to be small. Neural semantic search models require large amounts of training data. To augment small human-constructed datasets, we used advances in query generation to build a large synthetic corpus of questions and relevant documents in the biomedical domain.

Specifically, we used large amounts of general domain question-answer pairs to train an encoder-decoder model (part a in the figure below). This kind of neural architecture is used in tasks like machine translation that encodes one piece of text (e.g., an English sentence) and produces another piece of text (e.g., a French sentence). Here we trained the model to translate from answer passages to questions (or queries) about that passage. Next we took passages from every document in the collection, in this case CORD-19, and generated corresponding queries (part b). We then used these synthetic query-passage pairs as supervision to train our neural retrieval model (part c).

Synthetic query construction.

However, we found that there were examples where the neural model performed worse than a keyword-based model. This is because of the memorization-generalization continuum, which is well known in most fields of artificial intelligence and psycholinguistics. Keyword-based models, like tf-idf, are essentially memorizers. They memorize terms from the query and look for documents that have them. Neural retrieval models, on the other hand, learn generalizations about concepts and meaning and try to match based on those. Sometimes they can over-generalize when precision is important. For example, if I query, “What regulates ACE2 expression?”, one may want the model to generalize the concept of “regulation,” but not ACE2 beyond acronym expansion.

Hybrid Term-Neural Retrieval Model
To improve our system we built a hybrid term-neural retrieval model. A crucial observation is that both term-based and neural models can be cast as a vector space model. In other words, we can encode both the query and documents and then treat retrieval as looking for the document vectors that are most similar to the query vector, also known as k-nearest neighbor retrieval. There is a lot of research and engineering that is needed to make this work at scale, but it allows us a simple mechanism to combine methods. The simplest approach is to combine the vectors with a trade-off parameter.

Hybrid Term and Neural Retrieval.

In the figure above, the blue boxes are the term-based vectors, and the red, the neural vectors. We represent documents by concatenating these vectors. We concatenate the two vectors for queries as well, but we control the relative importance of exact term matches versus neural semantic matching. This is done via a weight parameter k. While more complex hybrid schemes are possible, we found that this simple hybrid model significantly increased quality on our biomedical literature retrieval benchmarks.

Availability and Community Feedback
The COVID-19 Research Explorer is freely available to the research community as an open alpha. Over the coming months we will be making a number of usability enhancements, so please check back often. Try out the COVID-19 Research Explorer, and please share any comments you have with us via the feedback channels on the site.

Acknowledgements
This effort has been successful thanks to the hard work of many people, including, but not limited to the following (in alphabetical order of last name): John Alex, Waleed Ammar, Greg Billock, Yale Cong, Ali Elkahky, Daniel Francisco, Stephen Greco, Stefan Hosein, Johanna Katz, Gyorgy Kiss, Margarita Kopniczky, Ivan Korotkov, Dominic Leung, Daphne Luong, Ji Ma, Ryan Mcdonald, Matt Pearson-Beck, Biao She, Jonathan Sheffi, Kester Tong, Ben Wedin

Source: Google AI Blog

An NLU-Powered Tool to Explore COVID-19 Scientific Literature

Synthetic query construction.

Hybrid Term and Neural Retrieval.

Source: Google AI Blog

Using Neural Networks to Find Answers in Tables

Posted by Thomas Müller, Software Engineer, Google Research

Much of the world’s information is stored in the form of tables, which can be found on the web or in databases and documents. These might include anything from technical specifications of consumer products to financial and country development statistics, sports results and much more. Currently, one needs to manually look at these tables to find the answer to a question or rely on a service that gives answers to specific questions (e.g., about sports results). This information would be much more accessible and useful if it could be queried through natural language.

For example, the following figure shows a table with a number of questions that people might want to ask. The answer to these questions might be found in one, or multiple, cells in a table (“Which wrestler had the most number of reigns?”), or might require aggregating multiple table cells (“How many world champions are there with only one reign?”).

A table and questions with the expected answers. Answers can be selected (#1, #4) or computed (#2, #3).

Many recent approaches apply traditional semantic parsing to this problem, where a natural language question is translated to an SQL-like database query that executes against a database to provide the answers. For example, the question “How many world champions are there with only one reign?” would be mapped to a query such as “select count(*) where column("No. of reigns") == 1;” and then executed to produce the answer. This approach often requires substantial engineering in order to generate syntactically and semantically valid queries and is difficult to scale to arbitrary questions rather than questions about very specific tables (such as sports results).

In, “TAPAS: Weakly Supervised Table Parsing via Pre-training”, accepted at ACL 2020, we take a different approach that extends the BERT architecture to encode the question jointly along with tabular data structure, resulting in a model that can then point directly to the answer. Instead of creating a model that works only for a single style of table, this approach results in a model that can be applied to tables from a wide range of domains. After pre-training on millions of Wikipedia tables, we show that our approach exhibits competitive accuracy on three academic table question-answering (QA) datasets. Additionally, in order to facilitate more exciting research in this area, we have open-sourced the code for training and testing the models as well as the models pre-trained on Wikipedia tables, available at our GitHub repo.

How to Process a Question
To process a question such as “Average time as champion for top 2 wrestlers?”, our model jointly encodes the question as well as the table content row by row using a BERT model that is extended with special embeddings to encode the table structure.

The key addition to the transformer-based BERT model are the extra embeddings that are used to encode the structured input. We rely on learned embeddings for the column index, the row index and one special rank index, which indicates the order of elements in numerical columns. The following image shows how all of these are added together at the input and fed into the transformer layers. The figure below illustrates how the question is encoded, together with the small table shown on the left. Each cell token has a special embedding that indicates its row, column and numeric rank within the column.

BERT layer input: Every input token is represented as the sum of the embeddings of its word, absolute position, segment (whether it belongs to the question or table), column and row and numeric rank (the position the cell would have if the column was sorted by its numeric values).

The model has two outputs: 1) for each table cell, a score indicates the probability that this cell will be part of the answer and 2) an aggregation operation that indicates which operation (if any) is applied to produce the final answer. The following figure shows how, for the question “Average time as champion for top 2 wrestlers?”, the model should select the first two cells of the “Combined days” column and the “AVERAGE” operation with high probability.

Model schematic: The BERT layer encodes both the question and table. The model outputs a probability for every aggregation operation and a selection probability for every table cell. For the question “Average time as champion for top 2 wrestlers?” the AVERAGE operation and the cells with the numbers 3,749 and 3,103 should have a high probability.

Pre-training
Using a method similar to how BERT is trained on text, we pre-trained our model on 6.2 million table-text pairs extracted from the English Wikipedia. During pre-training, the model learns to restore words in both table and text that have been replaced with a mask. We find that the model can do this with relatively high accuracy (71.4% of the masked tokens are restored correctly for tables unseen during training).

Learning from Answers Only
During fine-tuning the model learns how to answer questions from a table. This is done through training with either strong or weak supervision. In the case of strong supervision, for a given table and questions, one must provide the cells and aggregation operation to select (e.g., sum or count), a time-consuming and laborious process. More commonly, one trains using weak supervision, where only the correct answer (e.g., 3426 for the question in the example above) is provided. In this case, the model attempts to find an aggregation operation and cells that produce an answer close to the correct answer. This is done by computing the expectation over all the possible aggregation decisions and comparing it with the true result. The weak supervision scenario is beneficial because it allows for non-experts to provide the data needed to train the model and takes less time than strong supervision.

Results
We applied our model to three datasets — SQA, WikiTableQuestions (WTQ) and WikiSQL — and compared it to the performance of the top three state-of-the-art (SOTA) models for parsing tabular data. The comparison models included Min et al (2019) for WikiSQL, Wang et al. (2019) for WTQ and our own previous work for SQA (Mueller et al., 2019). For all datasets, we report the answer accuracy on the test sets for the weakly supervised training setup. For SQA and WIkiSQL we used our base model pre-trained on Wikipedia, while for WTQ, we found it beneficial to additionally pre-train on the SQA data. Our best models outperform the previous SOTA for SQA by more than 12 points, the previous SOTA for WTQ by more than 4 points and performs similarly to the best model published on WikiSQL.

Test answer accuracy for the weakly-supervised setup on three academic TableQA datasets.

Acknowledgements
This work was carried out by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos in the Google AI language group in Zurich. We would like to thank Yasemin Altun, Srini Narayanan, Slav Petrov, William Cohen, Massimo Nicosia, Syrine Krichene, and Jordan Boyd-Graber for useful comments and suggestions.

Source: Google AI Blog

A Scalable Approach to Reducing Gender Bias in Google Translate

Posted by Melvin Johnson, Senior Software Engineer, Google Research

Machine learning (ML) models for language translation can be skewed by societal biases reflected in their training data. One such example, gender bias, often becomes more apparent when translating between a gender-specific language and one that is less-so. For instance, Google Translate historically translated the Turkish equivalent of “He/she is a doctor” into the masculine form, and the Turkish equivalent of “He/she is a nurse” into the feminine form.

In line with Google’s AI Principles, which emphasizes the importance to avoid creating or reinforcing unfair biases, in December 2018 we announced gender-specific translations. This feature in Google Translate provides options for both feminine and masculine translations when translating queries that are gender-neutral in the source language. For this work, we developed a three-step approach, which involved detecting gender-neutral queries, generating gender-specific translations and checking for accuracy. We used this approach to enable gender-specific translations for phrases and sentences in Turkish-to-English and have now expanded this approach for English-to-Spanish translations, the most popular language-pair in Google Translate.

Left: Early example of the translation of a gender neutral English phrase to a gender-specific Spanish counterpart. In this case, only a biased example is given. Right: The new Translate provides both a feminine and a masculine translation option.

But as this approach was applied to more languages, it became apparent that there were issues in scaling. Specifically, generating masculine and feminine translations independently using a neural machine translation (NMT) system resulted in low recall, failing to show gender-specific translations for up to 40% of eligible queries, because the two translations often weren’t exactly equivalent, except for gender-related phenomena. Additionally, building a classifier to detect gender-neutrality for each source language was data intensive.

Today, along with the release of the new English-to-Spanish gender-specific translations, we announce an improved approach that uses a dramatically different paradigm to address gender bias by rewriting or post-editing the initial translation. This approach is more scalable, especially when translating from gender-neutral languages to English, since it does not require a gender-neutrality detector. Using this approach we have expanded gender-specific translations to include Finnish, Hungarian, and Persian-to-English. We have also replaced the previous Turkish-to-English system using the new rewriting-based method.

Rewriting-Based Gender-Specific Translation
The first step in the rewriting-based method is to generate the initial translation. The translation is then reviewed to identify instances where a gender-neutral source phrase yielded a gender-specific translation. If that is the case, we apply a sentence-level rewriter to generate an alternative gendered translation. Finally, both the initial and the rewritten translations are reviewed to ensure that the only difference is the gender.

Top: The original approach. Bottom: The new rewriting-based approach.

Rewriter
Building a rewriter involved generating millions of training examples composed of pairs of phrases, each of which included both masculine and feminine translations. Because such data was not readily available, we generated a new dataset for this purpose. Starting with a large monolingual dataset, we programmatically generated candidate rewrites by swapping gendered pronouns from masculine to feminine, or vice versa. Since there can be multiple valid candidates, depending on the context — for example the feminine pronoun “her” can map to either “him” or “his” and the masculine pronoun “his” can map to “her” or “hers” — a mechanism was needed for choosing the correct one. To resolve this tie, one can either use a syntactic parser or a language model. Because a syntactic parsing model would require training with labeled datasets in each language, it is less scalable than a language model, which can learn in an unsupervised fashion. So, we select the best candidate using an in-house language model trained on millions of English sentences.

This table demonstrates the data generation process. We start with the input, generate candidates and finally break the tie using a language model.

The above data generation process results in training data that goes from a masculine input to a feminine output and vice versa. We merge data from both these directions and train a one-layer transformer-based sequence-to-sequence model on it. We introduce punctuation and casing variants in the training data to increase the model robustness. Our final model can reliably produce the requested masculine or feminine rewrites 99% of the time.

Evaluation
We also devised a new method of evaluation, named bias reduction, which measures the relative reduction of bias between the new translation system and the existing system. Here “bias” is defined as making a gender choice in the translation that is unspecified in the source. For example, if the current system is biased 90% of the time and the new system is biased 45% of the time, this results in a 50% relative bias reduction. Using this metric, the new approach results in a bias reduction of ≥90% for translations from Hungarian, Finnish and Persian-to-English. The bias reduction of the existing Turkish-to-English system improved from 60% to 95% with the new approach. Our system triggers gender-specific translations with an average precision of 97% (i.e., when we decide to show gender-specific translations we’re right 97% of the time).

We’ve made significant progress since our initial launch by increasing the quality of gender-specific translations and also expanding it to 4 more language-pairs. We are committed to further addressing gender bias in Google Translate and plan to extend this work to document-level translation, as well.

Acknowledgements:
This effort has been successful thanks to the hard work of many people, including, but not limited to the following (in alphabetical order of last name): Anja Austermann, Jennifer Choi‎, Hossein Emami, Rick Genter, Megan Hancock, Mikio Hirabayashi‎, Macduff Hughes, Tolga Kayadelen, Mira Keskinen, Michelle Linch, Klaus Macherey‎, Gergely Morvay, Tetsuji Nakagawa, Thom Nelson, Mengmeng Niu, Jennimaria Palomaki‎, Alex Rudnick, Apu Shah, Jason Smith, Romina Stella, Vilis Urban, Colin Young, Angie Whitnah, Pendar Yousefi, Tao Yu

Source: Google AI Blog

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Posted by Melvin Johnson, Senior Software Engineer, Google Research and Sebastian Ruder, Research Scientist, DeepMind

One of the key challenges in natural language processing (NLP) is building systems that not only work in English but in all of the world’s ~6,900 languages. Luckily, while most of the world’s languages are data sparse and do not have enough data available to train robust models on their own, many languages do share a considerable amount of underlying structure. On the vocabulary level, languages often have words that stem from the same origin — for instance, “desk” in English and “Tisch” in German both come from the Latin “discus”. Similarly, many languages also mark semantic roles in similar ways, such as the use of postpositions to mark temporal and spatial relations in both Chinese and Turkish.

In NLP, there are a number of methods that leverage the shared structure of multiple languages in training in order to overcome the data sparsity problem. Historically, most of these methods focused on performing a specific task in multiple languages. Over the last few years, driven by advances in deep learning, there has been an increase in the number of approaches that attempt to learn general-purpose multilingual representations (e.g., mBERT, XLM, XLM-R), which aim to capture knowledge that is shared across languages and that is useful for many tasks. In practice, however, the evaluation of such methods has mostly focused on a small set of tasks and for linguistically similar languages.

To encourage more research on multilingual learning, we introduce “XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization”, which covers 40 typologically diverse languages (spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of syntax or semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the Niger-Congo languages Swahili and Yoruba, spoken in Africa. The code and data, including examples for running various baselines, is available here.

XTREME Tasks and Languages
The tasks included in XTREME cover a range of paradigms, including sentence classification, structured prediction, sentence retrieval and question answering. Consequently, in order for models to be successful on the XTREME benchmarks, they must learn representations that generalize to many standard cross-lingual transfer settings.

Tasks supported in the XTREME benchmark.

Each of the tasks covers a subset of the 40 languages. To obtain additional data in the low-resource languages used for analyses in XTREME, the test sets of two representative tasks, natural language inference (XNLI) and question answering (XQuAD), were automatically translated from English to the remaining languages. We show that models using the translated test sets for these tasks exhibited performance comparable to that achieved using human-labelled test sets.

Zero-shot Evaluation
To evaluate performance using XTREME, models must first be pre-trained on multilingual text using objectives that encourage cross-lingual learning. Then, they are fine-tuned on task-specific English data, since English is the most likely language where labelled data is available. XTREME then evaluates these models on zero-shot cross-lingual transfer performance, i.e., on other languages for which no task-specific data was seen. The three-step process, from pre-training to fine-tuning to zero-shot transfer, is shown in the figure below.

The cross-lingual transfer learning process for a given model: pre-training on multilingual text, followed by fine-tuning in English on downstream tasks, and finally zero-shot evaluation with XTREME.

In practice, one of the benefits of this zero-shot setting is computational efficiency — a pre-trained model only needs to be fine-tuned on English data for each task and can then be evaluated directly on other languages. Nevertheless, for tasks where labelled data is available in other languages, we also compare against fine-tuning on in-language data. Finally, we provide a combined score by obtaining the zero-shot scores on all nine XTREME tasks.

A Testbed for Transfer Learning
We conduct experiments with several state-of-the-art pre-trained multilingual models, including: multilingual BERT, a multilingual extension of the popular BERT model; XLM and XLM-R, two larger versions of multilingual BERT that have been trained on even more data; and a massively multilingual machine translation model, M4. A common feature of these models is that they have been pre-trained on large amounts of data from multiple languages. For our experiments, we choose variants of these models that are pre-trained on around 100 languages, including the 40 languages of our benchmark.

We find that while models achieve close to human performance on most existing tasks in English, performance is significantly lower for many of the other languages. Across all models, the gap between English performance and performance for the remaining languages is largest for the structured prediction and question answering tasks, while the spread of results across languages is largest for the structured prediction and sentence retrieval tasks.

For illustration, in the figure below we show the performance of the best-performing model in the zero-shot setting, XLM-R, by task and language, across all language families. The scores across tasks are not comparable, so the main focus should be the relative ranking of languages across tasks. As we can see, many high-resource languages, particularly from the Indo-European language family, are consistently ranked higher. In contrast, the model achieves lower performance on many languages from other language families such as Sino-Tibetan, Japonic, Koreanic, and Niger-Congo languages.

Performance of the best-performing model (XLM-R) across all tasks and languages in XTREME in the zero-shot setting. The reported scores are percentages based on task-specific metrics and are not directly comparable across tasks. Human performance (if available) is represented by a red star. Specific examples from each language family are represented with their ISO 639-1 codes.

In general we made a number of interesting observations.

In the zero-shot setting, M4 and mBERT are competitive with XLM-R for most tasks, while the latter outperforms them in the particularly challenging question answering tasks. For example, on XQuAD, XLM-R scored 76.6 compared to 64.5 for mBERT and 64.6 for M4, with similar spreads on MLQA and TyDi QA.
We find that baselines utilizing machine translation, which translate either the training data or test data, are very competitive. On the XNLI task, mBERT scored 65.4 in the zero shot transfer setting, and 74.0 when using translated training data.
We observe that the few-shot setting (i.e., using limited amounts of in-language labelled data, when available) is particularly competitive for simpler tasks, such as NER, but less useful for the more complex question answering tasks. This can be seen in the performance of mBERT, which improves by 42% on the NER task from 62.2 to 88.3 in the few-shot setting, but for the question answering task (TyDi QA), only improves by 25% (59.7 to 74.5).
Overall, a large gap between performance in English and other languages remains across all models and settings, which indicates that there is much potential for research on cross-lingual transfer.

Cross-lingual Transfer Analysis
Similar to previous observations regarding the generalisation ability of deep models, we observe that results improve if more pre-training data is available for a language, e.g., mBERT compared to XLM-R, which has more pre-training data. However, we find that this correlation does not hold for the structured prediction tasks, part-of-speech tagging (POS) and named entity recognition (NER), which indicates that current deep pre-trained models are not able to fully exploit the pre-training data to transfer to such syntactic tasks. We also find that models have difficulties transferring to non-Latin scripts. This is evident on the POS task, where mBERT achieves a zero-shot accuracy of 86.9 on Spanish compared to just 49.2 on Japanese.

For the natural language inference task, XNLI, we find that a model makes the same prediction on a test example in English and on the same example in another language about 70% of the time. Semi-supervised methods might be helpful in encouraging improved consistency between the predictions on examples and their translations in different languages. We also find that models struggle to predict POS tag sequences that were not seen in the English training data on which they were fine-tuned, highlighting that these models struggle to learn the syntax of other languages from the large amounts of unlabelled data used for pre-training. For named entity recognition, models have the most difficulty predicting entities that were not seen in the English training data for distant languages — accuracies on Indonesian and Swahili are 58.0 and 66.6, respectively, compared to 82.3 and 80.1 for Portguese and French.

Making Progress on Multilingual Transfer Learning
English has been the focal point of most recent advances in NLP despite being spoken by only around 15% of the world’s population. We believe that building on deep contextual representations, we now have the tools to make substantial progress on systems that serve the remainder of the world’s languages. We hope that XTREME will catalyze research in multilingual transfer learning, similar to how benchmarks such as GLUE and SuperGLUE have spurred the development of deep monolingual models, including BERT, RoBERTa, XLNet, AlBERT, and others. Stay tuned to our Twitter account for information on our upcoming website launch with a submission portal and leaderboard.

Acknowledgements:
This effort has been successful thanks to the hard work of a lot of people including, but not limited to the following (in alphabetical order of last name): Jon Clark, Orhan Firat, Dan Garrette, Junjie Hu, Graham Neubig, and Aditya Siddhant.

Source: Google AI Blog

More Efficient NLP Model Pre-training with ELECTRA

Posted by Kevin Clark, Student Researcher and Thang Luong, Senior Research Scientist, Google Research, Brain Team

Recent advances in language pre-training have led to substantial gains in the field of natural language processing, with state-of-the-art models such as BERT, RoBERTa, XLNet, ALBERT, and T5, among many others. These methods, though they differ in design, share the same idea of leveraging a large amount of unlabeled text to build a general model of language understanding before being fine-tuned on specific NLP tasks such as sentiment analysis and question answering.

Existing pre-training methods generally fall under two categories: language models (LMs), such as GPT, which process the input text left-to-right, predicting the next word given the previous context, and masked language models (MLMs), such as BERT, RoBERTa, and ALBERT, which instead predict the identities of a small number of words that have been masked out of the input. MLMs have the advantage of being bidirectional instead of unidirectional in that they “see” the text to both the left and right of the token being predicted, instead of only to one side. However, the MLM objective (and related objectives such as XLNet’s) also have a disadvantage. Instead of predicting every single input token, those models only predict a small subset — the 15% that was masked out, reducing the amount learned from each sentence.

Existing pre-training methods and their disadvantages. Arrows indicate which tokens are used to produce a given output representation (rectangle). Left: Traditional language models (e.g., GPT) only use context to the left of the current word. Right: Masked language models (e.g., BERT) use context from both the left and right, but predict only a small subset of words for each input.

In “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators”, we take a different approach to language pre-training that provides the benefits of BERT but learns far more efficiently. ELECTRA — Efficiently Learning an Encoder that Classifies Token Replacements Accurately — is a novel pre-training method that outperforms existing techniques given the same compute budget. For example, ELECTRA matches the performance of RoBERTa and XLNet on the GLUE natural language understanding benchmark when using less than ¼ of their compute and achieves state-of-the-art results on the SQuAD question answering benchmark. ELECTRA’s excellent efficiency means it works well even at small scale — it can be trained in a few days on a single GPU to better accuracy than GPT, a model that uses over 30x more compute. ELECTRA is being released as an open-source model on top of TensorFlow and includes a number of ready-to-use pre-trained language representation models.

Making Pre-training Faster
ELECTRA uses a new pre-training task, called replaced token detection (RTD), that trains a bidirectional model (like a MLM) while learning from all input positions (like a LM). Inspired by generative adversarial networks (GANs), ELECTRA trains the model to distinguish between “real” and “fake” input data. Instead of corrupting the input by replacing tokens with “[MASK]” as in BERT, our approach corrupts the input by replacing some input tokens with incorrect, but somewhat plausible, fakes. For example, in the below figure, the word “cooked” could be replaced with “ate”. While this makes a bit of sense, it doesn’t fit as well with the entire context. The pre-training task requires the model (i.e., the discriminator) to then determine which tokens from the original input have been replaced or kept the same. Crucially, this binary classification task is applied to every input token, instead of only a small number of masked tokens (15% in the case of BERT-style models), making RTD more efficient than MLM — ELECTRA needs to see fewer examples to achieve the same performance because it receives mode training signal per example. At the same time, RTD results in powerful representation learning, because the model must learn an accurate representation of the data distribution in order to solve the task.

Replaced token detection trains a bidirectional model while learning from all input positions.

The replacement tokens come from another neural network called the generator. While the generator can be any model that produces an output distribution over tokens, we use a small masked language model (i.e., a BERT model with small hidden size) that is trained jointly with the discriminator. Although the structure of the generator feeding into the discriminator is similar to a GAN, we train the generator with maximum likelihood to predict masked words, rather than adversarially, due to the difficulty of applying GANs to text. The generator and discriminator share the same input word embeddings. After pre-training, the generator is dropped and the discriminator (the ELECTRA model) is fine-tuned on downstream tasks. Our models all use the transformer neural architecture.

Further details on the replaced token detection (RTD) task. The fake tokens are sampled from a small masked language model that is trained jointly with ELECTRA.

ELECTRA Results
We compare ELECTRA against other state-of-the-art NLP models and found that it substantially improves over previous methods, given the same compute budget, performing comparably to RoBERTa and XLNet while using less than 25% of the compute.

The x-axis shows the amount of compute used to train the model (measured in FLOPs) and the y-axis shows the dev GLUE score. ELECTRA learns much more efficiently than existing pre-trained NLP models. Note that current best models on GLUE such as T5 (11B) do not fit on this plot because they use much more compute than others (around 10x more than RoBERTa).

Further pushing the limits of efficiency, we experimented with a small ELECTRA model that can be trained to good accuracy on a single GPU in 4 days. Although not achieving the same accuracy as larger models that require many TPUs to train, ELECTRA-small still performs quite well, even outperforming GPT while requiring only 1/30th as much compute.

Lastly, to see if the strong results held at scale, we trained a large ELECTRA model using more compute (roughly the same amount as RoBERTa, about 10% the compute as T5). This model achieves a new state-of-the-art for a single model on the SQuAD 2.0 question answering dataset (see the below table) and outperforms RoBERTa, XLNet, and ALBERT on the GLUE leaderboard. While the large-scale T5-11b model scores higher still on GLUE, ELECTRA is 1/30th the size and uses 10% of the compute to train.

*Model*	*Squad 2.0 test set*
ELECTRA-Large	88.7
ALBERT-xxlarge	88.1
XLNet-Large	87.9
RoBERTa-Large	86.8
BERT-Large	80.0

SQuAD 2.0 scores for ELECTRA-Large and other state-of-the-art models (only non-ensemble models shown).

Releasing ELECTRA
We are releasing the code for both pre-training ELECTRA and fine-tuning it on downstream tasks, with currently supported tasks including text classification, question answering and sequence tagging. The code supports quickly training a small ELECTRA model on one GPU. We are also releasing pre-trained weights for ELECTRA-Large, ELECTRA-Base, and ELECTRA-Small. The ELECTRA models are currently English-only, but we hope to release models which have been pre-trained on many languages in the future.

Source: Google AI Blog

Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer

Posted by Adam Roberts, Staff Software Engineer and Colin Raffel, Senior Research Scientist, Google Research

Over the past few years, transfer learning has led to a new wave of state-of-the-art results in natural language processing (NLP). Transfer learning's effectiveness comes from pre-training a model on abundantly-available unlabeled text data with a self-supervised task, such as language modeling or filling in missing words. After that, the model can be fine-tuned on smaller labeled datasets, often resulting in (far) better performance than training on the labeled data alone. The recent success of transfer learning was ignited in 2018 by GPT, ULMFiT, ELMo, and BERT, and 2019 saw the development of a huge diversity of new methods like XLNet, RoBERTa, ALBERT, Reformer, and MT-DNN. The rate of progress in the field has made it difficult to evaluate which improvements are most meaningful and how effective they are when combined.

In “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”, we present a large-scale empirical survey to determine which transfer learning techniques work best and apply these insights at scale to create a new model that we call the Text-To-Text Transfer Transformer (T5). We also introduce a new open-source pre-training dataset, called the Colossal Clean Crawled Corpus (C4). The T5 model, pre-trained on C4, achieves state-of-the-art results on many NLP benchmarks while being flexible enough to be fine-tuned to a variety of important downstream tasks. In order for our results to be extended and reproduced, we provide the code and pre-trained models, along with an easy-to-use Colab Notebook to help get started.

A Shared Text-To-Text Framework
With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). We can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself.

Diagram of our text-to-text framework. Every task we consider uses text as input to the model, which is trained to generate some target text. This allows us to use the same model, loss function, and hyperparameters across our diverse set of tasks including translation (green), linguistic acceptability (red), sentence similarity (yellow), and document summarization (blue). It also provides a standard testbed for the methods included in our empirical survey.

A Large Pre-training Dataset (C4)
An important ingredient for transfer learning is the unlabeled dataset used for pre-training. To accurately measure the effect of scaling up the amount of pre-training, one needs a dataset that is not only high quality and diverse, but also massive. Existing pre-training datasets don’t meet all three of these criteria — for example, text from Wikipedia is high quality, but uniform in style and relatively small for our purposes, while the Common Crawl web scrapes are enormous and highly diverse, but fairly low quality.

To satisfy these requirements, we developed the Colossal Clean Crawled Corpus (C4), a cleaned version of Common Crawl that is two orders of magnitude larger than Wikipedia. Our cleaning process involved deduplication, discarding incomplete sentences, and removing offensive or noisy content. This filtering led to better results on downstream tasks, while the additional size allowed the model size to increase without overfitting during pre-training. C4 is available through TensorFlow Datasets.

A Systematic Study of Transfer Learning Methodology
With the T5 text-to-text framework and the new pre-training dataset (C4), we surveyed the vast landscape of ideas and methods introduced for NLP transfer learning over the past few years. The full details of the investigation can be found in our paper, including experiments on:

model architectures, where we found that encoder-decoder models generally outperformed "decoder-only" language models;
pre-training objectives, where we confirmed that fill-in-the-blank-style denoising objectives (where the model is trained to recover missing words in the input) worked best and that the most important factor was the computational cost;
unlabeled datasets, where we showed that training on in-domain data can be beneficial but that pre-training on smaller datasets can lead to detrimental overfitting;
training strategies, where we found that multitask learning could be close to competitive with a pre-train-then-fine-tune approach but requires carefully choosing how often the model is trained on each task;
and scale, where we compare scaling up the model size, the training time, and the number of ensembled models to determine how to make the best use of fixed compute power.

Insights + Scale = State-of-the-Art
To explore the current limits of transfer learning for NLP, we ran a final set of experiments where we combined all of the best methods from our systematic study and scaled up our approach with Google Cloud TPU accelerators. Our largest model had 11 billion parameters and achieved state-of-the-art on the GLUE, SuperGLUE, SQuAD, and CNN/Daily Mail benchmarks. One particularly exciting result was that we achieved a near-human score on the SuperGLUE natural language understanding benchmark, which was specifically designed to be difficult for machine learning models but easy for humans.

Extensions
T5 is flexible enough to be easily modified for application to many tasks beyond those considered in our paper, often with great success. Below, we apply T5 to two novel tasks: closed-book question answering and fill-in-the-blank text generation with variable-sized blanks.

Closed-Book Question Answering
One way to use the text-to-text framework is on reading comprehension problems, where the model is fed some context along with a question and is trained to find the question's answer from the context. For example, one might feed the model the text from the Wikipedia article about Hurricane Connie along with the question "On what date did Hurricane Connie occur?" The model would then be trained to find the date "August 3rd, 1955" in the article. In fact, we achieved state-of-the-art results on the Stanford Question Answering Dataset (SQuAD) with this approach.

In our Colab demo and follow-up paper, we trained T5 to answer trivia questions in a more difficult "closed-book" setting, without access to any external knowledge. In other words, in order to answer a question T5 can only use knowledge stored in its parameters that it picked up during unsupervised pre-training. This can be considered a constrained form of open-domain question answering.

During pre-training, T5 learns to fill in dropped-out spans of text (denoted by <M>) from documents in C4. To apply T5 to closed-book question answer, we fine-tuned it to answer questions without inputting any additional information or context. This forces T5 to answer questions based on “knowledge” that it internalized during pre-training.

T5 is surprisingly good at this task. The full 11-billion parameter model produces the exact text of the answer 50.1%, 37.4%, and 34.5% of the time on TriviaQA, WebQuestions, and Natural Questions, respectively. To put these results in perspective, the T5 team went head-to-head with the model in a pub trivia challenge and lost! Try it yourself by clicking the animation below.

Fill-in-the-Blank Text Generation
Large language models like GPT-2 excel at generating very realistic looking-text since they are trained to predict what words come next after an input prompt. This has led to numerous creative applications like Talk To Transformer and the text-based game AI Dungeon. The pre-training objective used by T5 aligns more closely with a fill-in-the-blank task where the model predicts missing words within a corrupted piece of text. This objective is a generalization of the continuation task, since the “blanks” can appear at the end of the text as well.

To make use of this objective, we created a new downstream task called sized fill-in-the-blank, where the model is asked to replace a blank with a specified number of words. For example, if we give the model the input “I like to eat peanut butter and _4_ sandwiches,” we would train it to fill in the blank with approximately 4 words.

We fine-tuned T5 on this task using C4 and found the resulting outputs to be quite realistic. It’s especially fun to see how the model adjusts its predictions based on the requested size for the missing text. For example, with the input, “I love peanut butter and _N_ sandwiches,” the outputs looked like:

Conclusion
We are excited to see how people use our findings, code, and pre-trained models to help jump-start their projects. Check out the Colab Notebook to get started, and share how you use it with us on Twitter!

Acknowledgements
This work has been a collaborative effort involving Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, Karishma Malkan, Noah Fiedel, and Monica Dinculescu.

Source: Google AI Blog

TyDi QA: A Multilingual Question Answering Benchmark

Posted by Jonathan Clark, Research Scientist, Google Research

Question answering technologies help people on a daily basis — when faced with a question, such as “Is squid ink safe to eat?”, users can ask a voice assistant or type a search and expect to receive an answer. Last year, we released the English-language Natural Questions dataset to the research community to provide a challenge that reflects the needs of real users. However, there are thousands of different languages, and many of those use very different approaches to construct meaning. For example, while English changes words to indicate one object (“book”) versus many (“books”), Arabic also has a third form to indicate if there are two of something ("كتابان", kitaban) beyond just singular ("كتاب", kitab) or plural ("كتب", kutub). In addition, some languages, such as Japanese, do not use spaces between words. Creating machine learning systems that can understand the many ways languages express meaning is challenging, and training such systems requires examples from the diverse the languages to which they will be applied.

To encourage research on multilingual question-answering, today we are releasing TyDi QA, a question answering corpus covering 11 Typologically Diverse languages. Described in our paper, “TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages”, our corpus is inspired by typological diversity, a notion that different languages express meaning in structurally different ways. Because we selected a set of languages that are typologically distant from each other for this corpus, we expect models performing well on this dataset to generalize across a large number of the languages in the world.

A Typologically Diverse Collection of Languages
TyDi QA includes over 200,000 question-answer pairs from 11 languages representing a diverse range of linguistic phenomena and data challenges. Many of these languages use non-Latin alphabets, such as Arabic, Bengali, Korean, Russian, Telugu, and Thai. Others form words in complex ways, including Arabic, Finnish, Indonesian, Kiswahili, Russian. Japanese uses four alphabets (shown by the four colors in “24時間でのサーキット周回数”) while the Korean alphabet itself is highly compositional. These languages also range from having much available data on the web (English and Arabic) to very little (Bengali and Kiswahili). We expect that systems that can address these challenges will be successful for a very large number of languages.

Creating Realistic Data
Many early QA datasets used by the research community were created by first showing paragraphs to people and then asking them to write questions based on what could be answered from reading the paragraph. However, since people could see the answer while writing each question, this approach yielded questions that often contained the same words as the answer. As a result, machine learning algorithms trained on such data would favor word matching, oblivious to the more nuanced answers required to satisfy users’ needs.

To construct a more natural dataset, we collected questions from people who wanted an answer, but did not know the answer yet. To inspire questions, we showed people an interesting passage from Wikipedia written in their native language. We then had them ask a question, any question, as long as it was not answered by the passage and they actually wanted to know the answer. This is similar to how your own curiosity might spawn questions about interesting things you see while walking down the street. We encouraged our question writers to let their imaginations run. Does a passage about ice make you think about popsicles in summer? Great! Ask who invented popsicles. Importantly, questions were written directly in each language, not translated, so many questions are unlike those seen in an English-first corpus. One question in Bengali asks, “সফেদা ফল খেতে কেমন?” (What does sapodilla taste like?) Never heard of it? That’s probably because it’s grown much more commonly in India than the U.S.

For each of these questions, we performed a Google Search for the best-matching Wikipedia article in the appropriate language and asked a person to find and highlight the answer within that article. While we expected some interesting divergences between the question and answers when the question writers did not have the answers in front of them, combined with the astonishing breadth of linguistic phenomena in the world’s languages, we found that the situation was even more complex.

For example, in Finnish, there are fascinating examples in which the words day and week are represented very differently in the question and answer. To be successful in selecting this answer sentence out of an entire Wikipedia article, a system needs to be able to recognize the relationship among the Finnish words viikonpäivät, seitsenpäiväinen, and viikko

Making Progress Together as a Research Community
It is our hope that this dataset will push the research community to innovate in ways that will create more helpful question-answering systems for users around the world. To track the community’s progress, we have established a leaderboard where participants can evaluate the quality of their machine learning systems and are also open-sourcing a question answering system that uses the data. Please visit the challenge website to view the leaderboard and learn more.

Acknowledgements
This dataset is the result of a team of many Googlers including (alphabetically) Dan Garrette, Eunsol Choi, Jennimaria Palomaki, Michael Collins, Tom Kwiatkowski, and Vitaly Nikolaev. The Finnish gloss above is by Jennimaria Palomaki.

Source: Google AI Blog

Encode, Tag and Realize: A Controllable and Efficient Approach for Text Generation

Posted by Eric Malmi and Sebastian Krause, Software Engineers, Google Research

Sequence-to-sequence (seq2seq) models have revolutionized the field of machine translation and have become the tool of choice for various text-generation tasks, such as summarization, sentence fusion and grammatical error correction. Improvements in model architecture (e.g., Transformer) and the ability to leverage large corpora of unannotated text via unsupervised pre-training have enabled the quality gains in neural network approaches we have seen in recent years.

Yet, the use of seq2seq models for text generation can come with a number of substantial drawbacks depending on the use case, such as producing outputs that are not supported by the input text (known as hallucination) and requiring large amounts of training data to reach good performance. Furthermore, seq2seq models are inherently slow at inference time, since they typically generate the output word-by-word.

In “Encode, Tag, Realize: High-Precision Text Editing,” we present a novel, open sourced method for text generation, which is designed to specifically address these three shortcomings. This method is called LaserTagger, owing to the speed and precision of the method. Instead of generating the output text from scratch, LaserTagger produces output by tagging words with predicted edit operations that are then applied to the input words in a separate realization step. This is a less error-prone way of tackling text generation, which can be handled by an easier to train and faster to execute model architecture.

Design and Functionality of LaserTagger
A distinct characteristic of many text-generation tasks is that there is often a high overlap between the input and the output. For instance, when detecting and fixing grammatical mistakes or when fusing sentences, most of the input text can remain unchanged, and only a small fraction of the words needs to be modified. For this reason, LaserTagger produces a sequence of edit operations instead of actual words. The four types of edit operations we use are: Keep (copies a word to the output), Delete (removes a word) and Keep-AddX / Delete-AddX (adds phrase X before the tagged word and optionally deletes the tagged word). This process is illustrated in the figure below, which shows an application of LaserTagger to sentence fusion.

LaserTagger applied to sentence fusion. The predicted edit operations correspond to deleting “. Turing" and adding "and he" before it. Notice the high overlap between the input and output text.

All added phrases come from a restricted vocabulary. This vocabulary is the result of an optimization process that has two goals: (1) minimizing the vocabulary size and (2) maximizing the number of training examples, where the only words necessary to add to the target text come from the vocabulary alone. Having a restricted phrase vocabulary makes the space of output decisions smaller and prevents the model from adding arbitrary words, hence mitigating the problem of hallucination. A corollary of the high-overlap property of input and output texts is that required modifications tend to be local and independent from one another. This means that the edit operations can be predicted in parallel with high accuracy, enabling a significant end-to-end speed up compared to autoregressive seq2seq models, which perform the predictions sequentially, conditioning on the previous predictions.

Results
We evaluated LaserTagger on four tasks: sentence fusion, split and rephrase, abstractive summarization, and grammar correction. Across the tasks, LaserTagger performs comparably to a strong BERT-based seq2seq baseline that uses a large number of training examples, and clearly outperforms this baseline when the number of training examples is limited. Below we show the results on the WikiSplit dataset, where the task is to rephrase a long sentence into two coherent short sentences.

When training the models on the full dataset of 1 million examples, both LaserTagger and a BERT-based seq2seq baseline model perform comparably, but when training on a subsample of 10,000 examples or less, LaserTagger clearly outperforms the baseline model (the higher the SARI score the better).

Key Advantages of LaserTagger
Compared to traditional seq2seq methods, LaserTagger has the following advantages:

Control: By controlling the output phrase vocabulary, which we can also manually edit or curate, LaserTagger is less susceptible to hallucination than the seq2seq baseline.
Inference speed: LaserTagger computes predictions up to 100 times faster than the seq2seq baseline, making it suitable for real-time applications.
Data efficiency: LaserTagger produces reasonable outputs, even when trained using only a few hundred or a few thousand training examples. In our experiments, a competitive seq2seq baseline required tens of thousands of examples to obtain comparable performance.

Why This Matters
The advantages of LaserTagger become even more pronounced when applied at large scale, such as improving the formulation of voice answers in some services by reducing the length of the responses and making them less repetitive. The high inference speed allows the model to be plugged into an existing technology stack, without adding any noticeable latency on the user side, while the improved data efficiency enables the collection of training data for many languages, thus benefiting users from different language backgrounds.

In our current work, we strive to bring similar improvements to other Google technologies that produce natural language. Furthermore, we are exploring how the editing of texts (instead of their generation from scratch) can help us to better understand user queries as they grow longer, become more complex, and come as part of a dialogue discourse. The code for LaserTagger is open-sourced to the community through our GitHub repo.

Acknowledgements
This research was conducted by Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil Mirylenka, and Aliaksei Severyn. We are grateful for useful discussions with Enrique Alfonseca, Idan Szpektor, and Orgad Keller.

Source: Google AI Blog

Towards a Conversational Agent that Can Chat About…Anything

Posted by Daniel Adiwardana, Senior Research Engineer, and Thang Luong, Senior Research Scientist, Google Research, Brain Team

Modern conversational agents (chatbots) tend to be highly specialized — they perform well as long as users don’t stray too far from their expected usage. To better handle a wide variety of conversational topics, open-domain dialog research explores a complementary approach attempting to develop a chatbot that is not specialized but can still chat about virtually anything a user wants. Besides being a fascinating research problem, such a conversational agent could lead to many interesting applications, such as further humanizing computer interactions, improving foreign language practice, and making relatable interactive movie and videogame characters.

However, current open-domain chatbots have a critical flaw — they often don’t make sense. They sometimes say things that are inconsistent with what has been said so far, or lack common sense and basic knowledge about the world. Moreover, chatbots often give responses that are not specific to the current context. For example, “I don’t know,” is a sensible response to any question, but it’s not specific. Current chatbots do this much more often than people because it covers many possible user inputs.

In “Towards a Human-like Open-Domain Chatbot”, we present Meena, a 2.6 billion parameter end-to-end trained neural conversational model. We show that Meena can conduct conversations that are more sensible and specific than existing state-of-the-art chatbots. Such improvements are reflected through a new human evaluation metric that we propose for open-domain chatbots, called Sensibleness and Specificity Average (SSA), which captures basic, but important attributes for human conversation. Remarkably, we demonstrate that perplexity, an automatic metric that is readily available to any neural conversational models, highly correlates with SSA.

A chat between Meena (left) and a person (right).

Meena
Meena is an end-to-end, neural conversational model that learns to respond sensibly to a given conversational context. The training objective is to minimize perplexity, the uncertainty of predicting the next token (in this case, the next word in a conversation). At its heart lies the Evolved Transformer seq2seq architecture, a Transformer architecture discovered by evolutionary neural architecture search to improve perplexity.

Concretely, Meena has a single Evolved Transformer encoder block and 13 Evolved Transformer decoder blocks as illustrated below. The encoder is responsible for processing the conversation context to help Meena understand what has already been said in the conversation. The decoder then uses that information to formulate an actual response. Through tuning the hyper-parameters, we discovered that a more powerful decoder was the key to higher conversational quality.

Example of Meena encoding a 7-turn conversation context and generating a response, “The Next Generation”.

Conversations used for training are organized as tree threads, where each reply in the thread is viewed as one conversation turn. We extract each conversation training example, with seven turns of context, as one path through a tree thread. We choose seven as a good balance between having long enough context to train a conversational model and fitting models within memory constraints (longer contexts take more memory).

The Meena model has 2.6 billion parameters and is trained on 341 GB of text, filtered from public domain social media conversations. Compared to an existing state-of-the-art generative model, OpenAI GPT-2, Meena has 1.7x greater model capacity and was trained on 8.5x more data.

Human Evaluation Metric: Sensibleness and Specificity Average (SSA)
Existing human evaluation metrics for chatbot quality tend to be complex and do not yield consistent agreement between reviewers. This motivated us to design a new human evaluation metric, the Sensibleness and Specificity Average (SSA), which captures basic, but important attributes for natural conversations.

To compute SSA, we crowd-sourced free-form conversation with the chatbots being tested — Meena and other well-known open-domain chatbots, notably, Mitsuku, Cleverbot, XiaoIce, and DialoGPT. In order to ensure consistency between evaluations, each conversation starts with the same greeting, “Hi!”. For each utterance, the crowd workers answer two questions, “does it make sense?” and “is it specific?”. The evaluator is asked to use common sense to judge if a response is completely reasonable in context. If anything seems off — confusing, illogical, out of context, or factually wrong — then it should be rated as, “does not make sense”. If the response makes sense, the utterance is then assessed to determine if it is specific to the given context. For example, if A says, “I love tennis,” and B responds, “That’s nice,” then the utterance should be marked, “not specific”. That reply could be used in dozens of different contexts. But if B responds, “Me too, I can’t get enough of Roger Federer!” then it is marked as “specific”, since it relates closely to what is being discussed.

For each chatbot, we collect between 1600 and 2400 individual conversation turns through about 100 conversations. Each model response is labeled by crowdworkers to indicate if it is sensible and specific. The sensibleness of a chatbot is the fraction of responses labeled “sensible”, and specificity is the fraction of responses that are marked “specific”. The average of these two is the SSA score. The results below demonstrate that Meena does much better than existing state-of-the-art chatbots by large margins in terms of SSA scores, and is closing the gap with human performance.

Meena Sensibleness and Specificity Average (SSA) compared with that of humans, Mitsuku, Cleverbot, XiaoIce, and DialoGPT.

Automatic Metric: Perplexity
Researchers have long sought for an automatic evaluation metric that correlates with more accurate, human evaluation. Doing so would enable faster development of dialogue models, but to date, finding such an automatic metric has been challenging. Surprisingly, in our work, we discover that perplexity, an automatic metric that is readily available to any neural seq2seq model, exhibits a strong correlation with human evaluation, such as the SSA value. Perplexity measures the uncertainty of a language model. The lower the perplexity, the more confident the model is in generating the next token (character, subword, or word). Conceptually, perplexity represents the number of choices the model is trying to choose from when producing the next token.

During development, we benchmarked eight different model versions with varying hyperparameters and architectures, such as the number of layers, attention heads, total training steps, whether we use Evolved Transformer or regular Transformer, and whether we train with hard labels or with distillation. As illustrated in the figure below, the lower the perplexity, the better the SSA score for the model, with a strong correlation coefficient (R² = 0.93).

Interactive SSA vs. Perplexity. Each blue dot is a different version of the Meena model. A regression line is plotted demonstrating the strong correlation between SSA and perplexity. Dotted lines correspond to SSA performance of humans, other bots, Meena (base), our end-to-end trained model, and finally full Meena with filtering mechanism and tuned decoding.

Our best end-to-end trained Meena model, referred to as Meena (base), achieves a perplexity of 10.2 (smaller is better) and that translates to an SSA score of 72%. Compared to the SSA scores achieved by other chabots, our SSA score of 72% is not far from the 86% SSA achieved by the average person. The full version of Meena, which has a filtering mechanism and tuned decoding, further advances the SSA score to 79%.

Future Research & Challenges
As advocated previously, we will continue our goal of lowering the perplexity of neural conversational models through improvements in algorithms, architectures, data, and compute.

While we have focused solely on sensibleness and specificity in this work, other attributes such as personality and factuality are also worth considering in subsequent works. Also, tackling safety and bias in the models is a key focus area for us, and given the challenges related to this, we are not currently releasing an external research demo. We are evaluating the risks and benefits associated with externalizing the model checkpoint, however, and may choose to make it available in the coming months to help advance research in this area.

Acknowledgements
Several members contributed immensely to this project: David So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu. Also, thanks to Quoc Le, Samy Bengio, and Christine Robson for their leadership support. Thanks to the people who gave feedback on drafts of the paper: Anna Goldie, Abigail See, YizheZhang, Lauren Kunze, Steve Worswick, Jianfeng Gao, Scott Roy, Ilya Sutskever, Tatsu Hashimoto, Dan Jurafsky, Dilek Hakkani-tur, Noam Shazeer, Gabriel Bender, Prajit Ramachandran, Rami Al-Rfou, Michael Fink, Mingxing Tan, Maarten Bosma, and Adams Yu. Also thanks to the many volunteers who helped collect conversations with each other and with various chatbots. Finally thanks to Noam Shazeer, Rami Al-Rfou, Khoa Vo, Trieu H. Trinh, Ni Yan, Kyu Jin Hwang and the Google Brain team for their help with the project.

googblogs.com

All Google blogs and Press in one site

Tag Archives: Natural Language Processing

An NLU-Powered Tool to Explore COVID-19 Scientific Literature

Source: Google AI Blog

An NLU-Powered Tool to Explore COVID-19 Scientific Literature

Source: Google AI Blog

Using Neural Networks to Find Answers in Tables

Source: Google AI Blog

A Scalable Approach to Reducing Gender Bias in Google Translate

Source: Google AI Blog

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Source: Google AI Blog

More Efficient NLP Model Pre-training with ELECTRA

Source: Google AI Blog

Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer

Source: Google AI Blog

TyDi QA: A Multilingual Question Answering Benchmark

Source: Google AI Blog

Encode, Tag and Realize: A Controllable and Efficient Approach for Text Generation

Source: Google AI Blog

Towards a Conversational Agent that Can Chat About…Anything

Source: Google AI Blog