Tag Archives: EMNLP

ToTTo: A Controlled Table-to-Text Generation Dataset

In the last few years, research in natural language generation, used for tasks like text summarization, has made tremendous progress. Yet, despite achieving high levels of fluency, neural systems can still be prone to hallucination (i.e.generating text that is understandable, but not faithful to the source), which can prohibit these systems from being used in many applications that require high degrees of accuracy. Consider an example from the Wikibio dataset, where the neural baseline model tasked with summarizing a Wikipedia infobox entry for Belgian football player Constant Vanden Stock summarizes incorrectly that he is an American figure skater.

While the process of assessing the faithfulness of generated text to the source content can be challenging, it is often easier when the source content is structured (e.g., in tabular format). Moreover, structured data can also test a model’s ability for reasoning and numerical inference. However, existing large scale structured datasets are often noisy (i.e., the reference sentence cannot be fully inferred from the tabular data), making them unreliable for the measurement of hallucination in model development.

In “ToTTo: A Controlled Table-To-Text Generation Dataset”, we present an open domain table-to-text generation dataset created using a novel annotation process (via sentence revision) along with a controlled text generation task that can be used to assess model hallucination. ToTTo (shorthand for “Table-To-Text”) consists of 121,000 training examples, along with 7,500 examples each for development and test. Due to the accuracy of annotations, this dataset is suitable as a challenging benchmark for research in high precision text generation. The dataset and code are open-sourced on our GitHub repo.

Table-to-Text Generation
ToTTo introduces a controlled generation task in which a given Wikipedia table with a set of selected cells is used as the source material for the task of producing a single sentence description that summarizes the cell contents in the context of the table. The example below demonstrates some of the many challenges posed by the task, such as numerical reasoning, a large open-domain vocabulary, and varied table structure.

Example in the ToTTo dataset, where given the source table and set of highlighted cells (left), the goal is to generate a one sentence description, such as the “target sentence” (right). Note that generating the target sentence would require numerical inference (eleven NFL seasons) and understanding of the NFL domain.

Annotation Process
Designing an annotation process to obtain natural but also clean target sentences from tabular data is a significant challenge. Many datasets like Wikibio and RotoWire pair naturally occurring text heuristically with tables, a noisy process that makes it difficult to disentangle whether hallucination is primarily caused by data noise or model shortcomings. On the other hand, one can elicit annotators to write sentence targets from scratch, which are faithful to the table, but the resulting targets often lack variety in terms of structure and style.

In contrast, ToTTo is constructed using a novel data annotation strategy in which annotators revise existing Wikipedia sentences in stages. This results in target sentences that are clean, as well as natural, containing interesting and varied linguistic properties. The data collection and annotation process begins by collecting tables from Wikipedia, where a given table is paired with a summary sentence collected from the supporting page context according to heuristics, such as word overlap between the page text and the table and hyperlinks referencing tabular data. This summary sentence may contain information not supported by the table and may contain pronouns with antecedents found in the table only, not the sentence itself.

The annotator then highlights the cells in the table that support the sentence and deletes phrases in the sentence that are not supported by the table. They also decontextualize the sentence so that it is standalone (e.g., with correct pronoun resolution) and correct grammar, where necessary.

We show that annotators obtain high agreement on the above task: 0.856 Fleiss Kappa for cell highlighting, and 67.0 BLEU for the final target sentence.

Dataset Analysis
We conducted a topic analysis on the ToTTo dataset over 44 categories and found that the Sports and Countries topics, each of which consists of a range of fine-grained topics, e.g., football/olympics for sports and population/buildings for countries, together comprise 56.4% of the dataset. The other 44% is composed of a much more broad set of topics, including Performing Arts, Transportation, and Entertainment.

Furthermore, we conducted a manual analysis of the different types of linguistic phenomena in the dataset over 100 randomly chosen examples. The table below summarizes the fraction of examples that require reference to the page and section titles, as well as some of the linguistic phenomena in the dataset that potentially pose new challenges to current systems.

Linguistic Phenomena Percentage
Require reference to page title 82%
Require reference to section title 19%
Require reference to table description 3%
Reasoning (logical, numerical, temporal etc.) 21%
Comparison across rows/columns/cells 13%
Require background information 12%

Baseline Results
We present some baseline results of three state-of-the-art models from the literature (BERT-to-BERT, Pointer Generator, and the Puduppully 2019 model) on two evaluation metrics, BLEU and PARENT. In addition to reporting the score on the overall test set, we also evaluate each model on a more challenging subset consisting of out-of-domain examples. As the table below shows, the BERT-to-BERT model performs best in terms of both BLEU and PARENT. Moreover, all models achieve considerably lower performance on the challenge set indicating the challenge of out-of-domain generalization.

  BLEU PARENT BLEU PARENT
Model (overall) (overall) (challenge) (challenge)
BERT-to-BERT 43.9 52.6 34.8 46.7
Pointer Generator 41.6 51.6 32.2 45.2
Puduppully et al. 2019 19.2 29.2 13.9 25.8

While automatic metrics can give some indication of performance, they are not currently sufficient for evaluating hallucination in text generation systems. To better understand hallucination, we manually evaluate the top performing baseline, to determine how faithful it is to the content in the source table, under the assumption that discrepancies indicate hallucination. To compute the “Expert” performance, for each example in our multi-reference test set, we held out one reference and asked annotators to compare it with the other references for faithfulness. As the results show, the top performing baseline appears to hallucinate information ~20% of the time.

  Faithfulness Faithfulness
Model (overall) (challenge)
Expert 93.6 91.4
BERT-to-BERT  76.2 74.2

Model Errors and Challenges
In the table below, we present a selection of the observed model errors to highlight some of the more challenging aspects of the ToTTo dataset. We find that state-of-the-art models struggle with hallucination, numerical reasoning, and rare topics, even when using cleaned references (errors in red). The last example shows that even when the model output is correct it is sometimes not as informative as the original reference which contains more reasoning about the table (shown in blue).

Reference Model Prediction
in the 1939 currie cup, western province lost to transvaal by 17–6 in cape town. the first currie cup was played in 1939 in transvaal1 at new- lands, with western province winning 17–6.
a second generation of micro- drive was announced by ibm in 2000 with increased capacities at 512 mb and 1 gb. there were 512 microdrive models in 2000: 1 gigabyte.
the 1956 grand prix motorcy- cle racing season consisted of six grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc. the 1956 grand prix motorcycle racing season consisted of eight grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc.
in travis kelce’s last collegiate season, he set personal career highs in receptions (45), re- ceiving yards (722), yards per receptions (16.0) and receiving touchdowns (8). travis kelce finished the 2012 season with 45 receptions for 722 yards (16.0 avg.) and eight touchdowns.

Conclusion
In this work, we presented ToTTo, a large, English table-to-text dataset that presents both a controlled generation task and a data annotation process based on iterative sentence revision. We also provided several state-of-the-art baselines, and demonstrated ToTTo could be a useful dataset for modeling research as well as for developing evaluation metrics that can better detect model improvements.

In addition to the proposed task, we hope our dataset can also be helpful for other tasks such as table understanding and sentence revision. ToTTo is available at our GitHub repo.

Acknowledgements
The authors wish to thank Ming-Wei Chang, Jonathan H. Clark, Kenton Lee, and Jennimaria Palomaki for their insightful discussions and support. Many thanks also to Ashwin Kakarla and his team for help with the annotations.

Source: Google AI Blog


Encode, Tag and Realize: A Controllable and Efficient Approach for Text Generation



Sequence-to-sequence (seq2seq) models have revolutionized the field of machine translation and have become the tool of choice for various text-generation tasks, such as summarization, sentence fusion and grammatical error correction. Improvements in model architecture (e.g., Transformer) and the ability to leverage large corpora of unannotated text via unsupervised pre-training have enabled the quality gains in neural network approaches we have seen in recent years.

Yet, the use of seq2seq models for text generation can come with a number of substantial drawbacks depending on the use case, such as producing outputs that are not supported by the input text (known as hallucination) and requiring large amounts of training data to reach good performance. Furthermore, seq2seq models are inherently slow at inference time, since they typically generate the output word-by-word.

In “Encode, Tag, Realize: High-Precision Text Editing,” we present a novel, open sourced method for text generation, which is designed to specifically address these three shortcomings. This method is called LaserTagger, owing to the speed and precision of the method. Instead of generating the output text from scratch, LaserTagger produces output by tagging words with predicted edit operations that are then applied to the input words in a separate realization step. This is a less error-prone way of tackling text generation, which can be handled by an easier to train and faster to execute model architecture.

Design and Functionality of LaserTagger
A distinct characteristic of many text-generation tasks is that there is often a high overlap between the input and the output. For instance, when detecting and fixing grammatical mistakes or when fusing sentences, most of the input text can remain unchanged, and only a small fraction of the words needs to be modified. For this reason, LaserTagger produces a sequence of edit operations instead of actual words. The four types of edit operations we use are: Keep (copies a word to the output), Delete (removes a word) and Keep-AddX / Delete-AddX (adds phrase X before the tagged word and optionally deletes the tagged word). This process is illustrated in the figure below, which shows an application of LaserTagger to sentence fusion.
LaserTagger applied to sentence fusion. The predicted edit operations correspond to deleting “. Turing" and adding "and he" before it. Notice the high overlap between the input and output text.
All added phrases come from a restricted vocabulary. This vocabulary is the result of an optimization process that has two goals: (1) minimizing the vocabulary size and (2) maximizing the number of training examples, where the only words necessary to add to the target text come from the vocabulary alone. Having a restricted phrase vocabulary makes the space of output decisions smaller and prevents the model from adding arbitrary words, hence mitigating the problem of hallucination. A corollary of the high-overlap property of input and output texts is that required modifications tend to be local and independent from one another. This means that the edit operations can be predicted in parallel with high accuracy, enabling a significant end-to-end speed up compared to autoregressive seq2seq models, which perform the predictions sequentially, conditioning on the previous predictions.

Results
We evaluated LaserTagger on four tasks: sentence fusion, split and rephrase, abstractive summarization, and grammar correction. Across the tasks, LaserTagger performs comparably to a strong BERT-based seq2seq baseline that uses a large number of training examples, and clearly outperforms this baseline when the number of training examples is limited. Below we show the results on the WikiSplit dataset, where the task is to rephrase a long sentence into two coherent short sentences.
When training the models on the full dataset of 1 million examples, both LaserTagger and a BERT-based seq2seq baseline model perform comparably, but when training on a subsample of 10,000 examples or less, LaserTagger clearly outperforms the baseline model (the higher the SARI score the better).
Key Advantages of LaserTagger
Compared to traditional seq2seq methods, LaserTagger has the following advantages:
  1. Control: By controlling the output phrase vocabulary, which we can also manually edit or curate, LaserTagger is less susceptible to hallucination than the seq2seq baseline.
  2. Inference speed: LaserTagger computes predictions up to 100 times faster than the seq2seq baseline, making it suitable for real-time applications.
  3. Data efficiency: LaserTagger produces reasonable outputs, even when trained using only a few hundred or a few thousand training examples. In our experiments, a competitive seq2seq baseline required tens of thousands of examples to obtain comparable performance.
Why This Matters
The advantages of LaserTagger become even more pronounced when applied at large scale, such as improving the formulation of voice answers in some services by reducing the length of the responses and making them less repetitive. The high inference speed allows the model to be plugged into an existing technology stack, without adding any noticeable latency on the user side, while the improved data efficiency enables the collection of training data for many languages, thus benefiting users from different language backgrounds.

In our current work, we strive to bring similar improvements to other Google technologies that produce natural language. Furthermore, we are exploring how the editing of texts (instead of their generation from scratch) can help us to better understand user queries as they grow longer, become more complex, and come as part of a dialogue discourse. The code for LaserTagger is open-sourced to the community through our GitHub repo.

Acknowledgements
This research was conducted by Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil Mirylenka, and Aliaksei Severyn. We are grateful for useful discussions with Enrique Alfonseca, Idan Szpektor, and Orgad Keller.

Source: Google AI Blog


Exploring Massively Multilingual, Massive Neural Machine Translation



“... perhaps the way [of translation] is to descend, from each language, down to the common base of human communication — the real but as yet undiscovered universal language — and then re-emerge by whatever particular route is convenient.”Warren Weaver, 1949

Over the last few years there has been enormous progress in the quality of machine translation (MT) systems, breaking language barriers around the world thanks to the developments in neural machine translation (NMT). The success of NMT however, owes largely to the great amounts of supervised training data. But what about languages where data is scarce, or even absent? Multilingual NMT, with the inductive bias that “the learning signal from one language should benefit the quality of translation to other languages”, is a potential remedy.

Multilingual machine translation processes multiple languages using a single translation model. The success of multilingual training for data-scarce languages has been demonstrated for automatic speech recognition and text-to-speech systems, and by prior research on multilingual translation [1,2,3]. We previously studied the effect of scaling up the number of languages that can be learned in a single neural network, while controlling the amount of training data per language. But what happens once all constraints are removed? Can we train a single model using all of the available data, despite the huge differences across languages in data size, scripts, complexity and domains?

In “Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges” and follow-up papers [4,5,6,7], we push the limits of research on multilingual NMT by training a single NMT model on 25+ billion sentence pairs, from 100+ languages to and from English, with 50+ billion parameters. The result is an approach for massively multilingual, massive neural machine translation (M4) that demonstrates large quality improvements on both low- and high-resource languages and can be easily adapted to individual domains/languages, while showing great efficacy on cross-lingual downstream transfer tasks.

Massively Multilingual Machine Translation
Though data skew across language-pairs is a great challenge in NMT, it also creates an ideal scenario in which to study transfer, where insights gained through training on one language can be applied to the translation of other languages. On one end of the distribution, there are high-resource languages like French, German and Spanish where there are billions of parallel examples, while on the other end, supervised data for low-resource languages such as Yoruba, Sindhi and Hawaiian, is limited to a few tens of thousands.
The data distribution over all language pairs (in log scale) and the relative translation quality (BLEU score) of the bilingual baselines trained on each one of these specific language pairs.
Once trained using all of the available data (25+ billion examples from 103 languages), we observe strong positive transfer towards low-resource languages, dramatically improving the translation quality of 30+ languages at the tail of the distribution by an average of 5 BLEU points. This effect is already known, but surprisingly encouraging, considering the comparison is between bilingual baselines (i.e., models trained only on specific language pairs) and a single multilingual model with representational capacity similar to a single bilingual model. This finding hints that massively multilingual models are effective at generalization, and capable of capturing the representational similarity across a large body of languages.
Translation quality comparison of a single massively multilingual model against bilingual baselines that are trained for each one of the 103 language pairs.
In our EMNLP’19 paper [5], we compare the representations of multilingual models across different languages. We find that multilingual models learn shared representations for linguistically similar languages without the need for external constraints, validating long-standing intuitions and empirical results that exploit these similarities. In [6], we further demonstrate the effectiveness of these learned representations on cross-lingual transfer on downstream tasks.
Visualization of the clustering of the encoded representations of all 103 languages, based on representational similarity. Languages are color-coded by their linguistic family.
Building Massive Neural Networks
As we increase the number of low-resource languages in the model, the quality of high-resource language translations starts to decline. This regression is recognized in multi-task setups, arising from inter-task competition and the unidirectional nature of transfer (i.e., from high- to low-resource). While working on better learning and capacity control algorithms to mitigate this negative transfer, we also extend the representational capacity of our neural networks by making them bigger by increasing the number of model parameters to improve the quality of translation for high-resource languages.

Numerous design choices can be made to scale neural network capacity, including adding more layers or making the hidden representations wider. Continuing our study on training deeper networks for translation, we utilized GPipe [4] to train 128-layer Transformers with over 6 billion parameters. Increasing the model capacity resulted in significantly improved performance across all languages by an average of 5 BLEU points. We also studied other properties of very deep networks, including the depth-width trade-off, trainability challenges and design choices for scaling Transformers to over 1500 layers with 84 billion parameters.

While scaling depth is one approach to increasing model capacity, exploring architectures that can exploit the multi-task nature of the problem is a very plausible complementary way forward. By modifying the Transformer architecture through the substitution of the vanilla feed-forward layers with sparsely-gated mixture of experts, we drastically scale up the model capacity, allowing us to successfully train and pass 50 billion parameters, which further improved translation quality across the board.
Translation quality improvement of a single massively multilingual model as we increase the capacity (number of parameters) compared to 103 individual bilingual baselines.
Making M4 Practical
It is inefficient to train large models with extremely high computational costs for every individual language, domain or transfer task. Instead, we present methods [7] to make these models more practical by using capacity tunable layers to adapt a new model to specific languages or domains, without altering the original.

Next Steps
At least half of the 7,000 languages currently spoken will no longer exist by the end of this century*. Can multilingual machine translation come to the rescue? We see the M4 approach as a stepping stone towards serving the next 1,000 languages; starting from such multilingual models will allow us to easily extend to new languages, domains and down-stream tasks, even when parallel data is unavailable. Indeed the path is rocky, and on the road to universal MT many promising solutions appear to be interdisciplinary. This makes multilingual NMT a plausible test bed for machine learning practitioners and theoreticians interested in exploring the annals of multi-task learning, meta-learning, training dynamics of deep nets and much more. We still have a long way to go.

Acknowledgements
This effort is built on contributions from Naveen Arivazhagan, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Chen, Yuan Cao, Yanping Huang, Sneha Kudugunta, Isaac Caswell, Aditya Siddhant, Wei Wang, Roee Aharoni, Sébastien Jean, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen and Yonghui Wu. We would also like to acknowledge support from the Google Translate, Brain, and Lingvo development teams, Jakob Uszkoreit, Noam Shazeer, Hyouk Joong Lee, Dehao Chen, Youlong Cheng, David Grangier, Colin Raffel, Katherine Lee, Thang Luong, Geoffrey Hinton, Manisha Jain, Pendar Yousefi and Macduff Hughes.


* The Cambridge Handbook of Endangered Languages (Austin and Sallabank, 2011).

Source: Google AI Blog