Tag Archives: Natural Language Processing

Google at NAACL



This week, New Orleans, LA hosted the North American Association of Computational Linguistics (NAACL) conference, a venue for the latest research on computational approaches to understanding natural language. Google once again had a strong presence, presenting our research on a diverse set of topics, including dialog, summarization, machine translation, and linguistic analysis. In addition to contributing publications, Googlers were also involved as committee members, workshop organizers, panelists and presented one of the conference keynotes. We also provided telepresence robots, which enabled researchers who couldn’t attend in person to present their work remotely at the Widening Natural Language Processing Workshop (WiNLP).
Googler Margaret Mitchell and a researcher using our telepresence robots to remotely present their work at the WiNLP workshop.
This year NAACL also introduced a new Test of Time Award recognizing influential papers published between 2002 and 2012. We are happy and honored to recognize that all three papers receiving the award (listed below with a shot summary) were co-authored by researchers who are now at Google (in blue):

BLEU: a Method for Automatic Evaluation of Machine Translation (2002)
Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu
Before the introduction of the BLEU metric, comparing Machine Translation (MT) models required expensive human evaluation. While human evaluation is still the gold standard, the strong correlation of BLEU with human judgment has permitted much faster experiment cycles. BLEU has been a reliable measure of progress, persisting through multiple paradigm shifts in MT.

Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms (2002)
Michael Collins
The structured perceptron is a generalization of the classical perceptron to structured prediction problems, where the number of possible "labels" for each input is a very large set, and each label has rich internal structure. Canonical examples are speech recognition, machine translation, and syntactic parsing. The structured perceptron was one of the first algorithms proposed for structured prediction, and has been shown to be effective in spite of its simplicity.

Thumbs up?: Sentiment Classification using Machine Learning Techniques (2002)
Bo Pang, Lillian Lee, Shivakumar Vaithyanathan
This paper is amongst the first works in sentiment analysis and helped define the subfield of sentiment and opinion analysis and review mining. The paper introduced a new way to look at document classification, developed the first solutions to it using supervised machine learning methods, and discussed insights and challenges. This paper also had significant data impact -- the movie review dataset has supported much of the early work in this area and is still one of the commonly used benchmark evaluation datasets.

If you attended NAACL 2018, we hope that you stopped by the booth to check out some demos, meet our researchers and discuss projects and opportunities at Google that go into solving interesting problems for billions of people. You can learn more about Google research presented at NAACL 2018 below (Googlers highlighted in blue), and visit the Google AI Language Team page.

Keynote
Google Assistant or My Assistant? Towards Personalized Situated Conversational Agents
Dilek Hakkani-Tür

Publications
Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning
Pararth Shah, Dilek Hakkani-Tür, Bing Liu, Gokhan Tür

SHAPED: Shared-Private Encoder-Decoder for Text Style Adaptation
Ye Zhang, Nan Ding, Radu Soricut

Olive Oil is Made of Olives, Baby Oil is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model
Vered Schwartz, Chris Waterson

Are All Languages Equally Hard to Language-Model?
Ryan Cotterell, Sebastian J. Mielke, Jason Eisner, Brian Roark

Self-Attention with Relative Position Representations
Peter Shaw, Jakob Uszkoreit, Ashish Vaswani

Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems
Bing Liu, Gokhan Tür, Dilek Hakkani-Tür, Parath Shah, Larry Heck

Workshops
Subword & Character Level Models in NLP
Organizers: Manaal Faruqui, Hinrich Schütze, Isabel Trancoso, Yulia Tsvetkov, Yadollah Yaghoobzadeh

Storytelling Workshop
Organizers: Margaret Mitchell, Ishan Misra, Ting-Hao 'Kenneth' Huang, Frank Ferraro

Ethics in NLP
Organizers: Michael Strube, Dirk Hovy, Margaret Mitchell, Mark Alfano

NAACL HLT Panels
Careers in Industry
Participants: Philip Resnik (moderator), Jason Baldridge, Laura Chiticariu, Marie Mateer, Dan Roth

Ethics in NLP
Participants: Dirk Hovy (moderator), Margaret Mitchell, Vinodkumar Prabhakaran, Mark Yatskar, Barbara Plank

Source: Google AI Blog


Advances in Semantic Textual Similarity



The recent rapid progress of neural network-based natural language understanding research, especially on learning semantic text representations, can enable truly novel products such as Smart Compose and Talk to Books. It can also help improve performance on a variety of natural language tasks which have limited amounts of training data, such as building strong text classifiers from as few as 100 labeled examples.

Below, we discuss two papers reporting recent progress on semantic representation research at Google, as well as two new models available for download on TensorFlow Hub that we hope developers will use to build new and exciting applications.

Semantic Textual Similarity
In “Learning Semantic Textual Similarity from Conversations”, we introduce a new way to learn sentence representations for semantic textual similarity. The intuition is that sentences are semantically similar if they have a similar distribution of responses. For example, “How old are you?” and “What is your age?” are both questions about age, which can be answered by similar responses such as “I am 20 years old”. In contrast, while “How are you?” and “How old are you?” contain almost identical words, they have very different meanings and lead to different responses.
Sentences are semantically similar if they can be answered by the same responses. Otherwise, they are semantically different.
In this work, we aim to learn semantic similarity by way of a response classification task: given a conversational input, we wish to classify the correct response from a batch of randomly selected responses. But, the ultimate goal is to learn a model that can return encodings representing a variety of natural language relationships, including similarity and relatedness. By adding another prediction task (In this case, the SNLI entailment dataset) and forcing both through shared encoding layers, we get even better performance on similarity measures such as the STSBenchmark (a sentence similarity benchmark) and CQA task B (a question/question similarity task). This is because logical entailment is quite different from simple equivalence and provides more signal for learning complex semantic representations.
For a given input, classification is considered a ranking problem against potential candidates.
Universal Sentence Encoder
In “Universal Sentence Encoder”, we introduce a model that extends the multitask training described above by adding more tasks, jointly training them with a skip-thought-like model that predicts sentences surrounding a given selection of text. However, instead of the encoder-decoder architecture in the original skip-thought model, we make use of an encode-only architecture by way of a shared encoder to drive the prediction tasks. In this way, training time is greatly reduced while preserving the performance on a variety of transfer tasks including sentiment and semantic similarity classification. The aim is to provide a single encoder that can support as wide a variety of applications as possible, including paraphrase detection, relatedness, clustering and custom text classification.
Pairwise semantic similarity comparison via outputs from TensorFlow Hub Universal Sentence Encoder.
As described in our paper, one version of the Universal Sentence Encoder model uses a deep average network (DAN) encoder, while a second version uses a more complicated self attended network architecture, Transformer.
Multi-task training as described in “Universal Sentence Encoder”. A variety of tasks and task structures are joined by shared encoder layers/parameters (grey boxes).
With the more complicated architecture, the model performs better than the simpler DAN model on a variety of sentiment and similarity classification tasks, and for short sentences is only moderately slower. However, compute time for the model using Transformer increases noticeably as sentence length increases, whereas the compute time for the DAN model stays nearly constant as sentence length is increased.

New Models
In addition to the Universal Sentence Encoder model described above, we are also sharing two new models on TensorFlow Hub: the Universal Sentence Encoder - Large and Universal Sentence Encoder - Lite. These are pretrained Tensorflow models that return a semantic encoding for variable-length text inputs. The encodings can be used for semantic similarity measurement, relatedness, classification, or clustering of natural language text.
  • The Large model is trained with the Transformer encoder described in our second paper. It targets scenarios requiring high precision semantic representations and the best model performance at the cost of speed & size.
  • The Lite model is trained on a Sentence Piece vocabulary instead of words in order to significantly reduce the vocabulary size, which is a major contributor of model size. It targets scenarios where resources like memory and CPU are limited, such as on-device or browser based implementations.
We're excited to share this research, and these models, with the community. We believe that what we're showing here is just the beginning, and that there remain important research problems to be addressed, such as extending the techniques to more languages (the models discussed above currently support English). We also hope to further develop this technology so it can understand text at the paragraph or even document level. In achieving these tasks, it may be possible to make an encoder that is truly “universal”.

Acknowledgements
Daniel Cer, Mario Guajardo-Cespedes, Sheng-Yi Kong, Noah Constant for training the models, Nan Hua, Nicole Limtiaco, Rhomni St. John for transferring tasks, Steve Yuan, Yunhsuan Sung, Brian Strope, Ray Kurzweil for discussion of the model architecture. Special thanks to Sheng-Yi Kong and Noah Constant for training the Lite model.

Source: Google AI Blog


Smart Compose: Using Neural Networks to Help Write Emails



Last week at Google I/O, we introduced Smart Compose, a new feature in Gmail that uses machine learning to interactively offer sentence completion suggestions as you type, allowing you to draft emails faster. Building upon technology developed for Smart Reply, Smart Compose offers a new way to help you compose messages — whether you are responding to an incoming email or drafting a new one from scratch.
In developing Smart Compose, there were a number of key challenges to face, including:
  • Latency: Since Smart Compose provides predictions on a per-keystroke basis, it must respond ideally within 100ms for the user not to notice any delays. Balancing model complexity and inference speed was a critical issue.
  • Scale: Gmail is used by more than 1.4 billion diverse users. In order to provide auto completions that are useful for all Gmail users, the model has to have enough modeling capacity so that it is able to make tailored suggestions in subtly different contexts.
  • Fairness and Privacy: In developing Smart Compose, we needed to address sources of potential bias in the training process, and had to adhere to the same rigorous user privacy standards as Smart Reply, making sure that our models never expose user’s private information. Furthermore, researchers had no access to emails, which meant they had to develop and train a machine learning system to work on a dataset that they themselves cannot read.
Finding the Right Model
Typical language generation models, such as ngramneural bag-of-words (BoW) and RNN language (RNN-LM) models, learn to predict the next word conditioned on the prefix word sequence. In an email, however, the words a user has typed in the current email composing session is only one “signal” a model can use to predict the next word. In order to incorporate more context about what the user wants to say, our model is also conditioned on the email subject and the previous email body (if the user is replying to an incoming email).

One approach to include this additional context is to cast the problem as a sequence-to-sequence (seq2seq) machine translation task, where the source sequence is the concatenation of the subject and the previous email body (if there is one), and the target sequence is the current email the user is composing. While this approach worked well in terms of prediction quality, it failed to meet our strict latency constraints by orders of magnitude.

To improve on this, we combined a BoW model with an RNN-LM, which is faster than the seq2seq models with only a slight sacrifice to model prediction quality. In this hybrid approach, we encode the subject and previous email by averaging the word embeddings in each field. We then join those averaged embeddings, and feed them to the target sequence RNN-LM at every decoding step, as the model diagram below shows.
Smart Compose RNN-LM model architecture. Subject and previous email message are encoded by averaging the word embeddings in each field. The averaged embeddings are then fed to the RNN-LM at each decoding step.
Accelerated Model Training & Serving
Of course, once we decided on this modeling approach we still had to tune various model hyperparameters and train the models over billions of examples, all of which can be very time-intensive. To speed things up, we used a full TPUv2 Pod to perform experiments. In doing so, we’re able to train a model to convergence in less than a day.

Even after training our faster hybrid model, our initial version of Smart Compose running on a standard CPU had an average serving latency of hundreds of milliseconds, which is still unacceptable for a feature that is trying to save users' time. Fortunately, TPUs can also be used at inference time to greatly speed up the user experience. By offloading the bulk of the computation onto TPUs, we improved the average latency to tens of milliseconds while also greatly increasing the number of requests that can be served by a single machine.

Fairness and Privacy
Fairness in machine learning is very important, as language understanding models can reflect human cognitive biases resulting in unwanted word associations and sentence completions. As Caliskan et al. point out in their recent paper “Semantics derived automatically from language corpora contain human-like biases”, these associations are deeply entangled in natural language data, which presents a considerable challenge to building any language model. We are actively researching ways to continue to reduce potential biases in our training procedures. Also, since Smart Compose is trained on billions of phrases and sentences, similar to the way spam machine learning models are trained, we have done extensive testing to make sure that only common phrases used by multiple users are memorized by our model, using findings from this paper.

Future work
We are constantly working on improving the suggestion quality of the language generation model by following state-of-the-art architectures (e.g., Transformer, RNMT+, etc.) and experimenting with most recent and advanced training techniques. We will deploy those more advanced models to production once our strict latency constraints can be met. We are also working on incorporating personal language models, designed to more accurately emulate an individual’s style of writing into our system.

Acknowledgements
Smart Compose language generation model was developed by Benjamin Lee, Mia Chen, Gagan Bansal, Justin Lu, Jackie Tsay, Kaushik Roy, Tobias Bosch, Yinan Wang, Matthew Dierker, Katherine Evans, Thomas Jablin, Dehao Chen, Vinu Rajashekhar, Akshay Agrawal, Yuan Cao, Shuyuan Zhang, Xiaobing Liu, Noam Shazeer, Andrew Dai, Zhifeng Chen, Rami Al-Rfou, DK Choe, Yunhsuan Sung, Brian Strope, Timothy Sohn, Yonghui Wu, and many others.

Source: Google AI Blog


Introducing Semantic Experiences with Talk to Books and Semantris



Natural language understanding has evolved substantially in the past few years, in part due to the development of word vectors that enable algorithms to learn about the relationships between words, based on examples of actual language usage. These vector models map semantically similar phrases to nearby points based on equivalence, similarity or relatedness of ideas and language. Last year, we used hierarchical vector models of language to make improvements to Smart Reply for Gmail. More recently, we’ve been exploring other applications of these methods.

Today, we are proud to share Semantic Experiences, a website showing two examples of how these new capabilities can drive applications that weren’t possible before. Talk to Books is an entirely new way to explore books by starting at the sentence level, rather than the author or topic level. Semantris is a word association game powered by machine learning, where you type out words associated with a given prompt. We have also published “Universal Sentence Encoder”, which describes the models used for these examples in more detail. Lastly, we’ve provided a pretrained semantic TensorFlow module for the community to experiment with their own sentence and phrase encoding.

Modeling approach
Our approach extends the idea of representing language in a vector space by creating vectors for larger chunks of language such as full sentences and small paragraphs. Since language is composed of hierarchies of concepts, we create the vectors using a hierarchy of modules, each of which considers features that correspond to sequences at different temporal scales. Relatedness, synonymy, antonymy, meronymy, holonymy, and many other types of relationships may all be represented in vector space language models if we train them in the right way and then pose the right “questions”. We describe this method in our paper, “Efficient Natural Language Response for Smart Reply.”

Talk to Books
With Talk to Books, we provide an entirely new way to explore books. You make a statement or ask a question, and the tool finds sentences in books that respond, with no dependence on keyword matching. In a sense you are talking to the books, getting responses which can help you determine if you’re interested in reading them or not.
Talk to Books
The models driving this experience were trained on a billion conversation-like pairs of sentences, learning to identify what a good response might look like. Once you ask your question (or make a statement), the tools searches all the sentences in over 100,000 books to find the ones that respond to your input based on semantic meaning at the sentence level; there are no predefined rules bounding the relationship between what you put in and the results you get.

This capability is unique and can help you find interesting books that a keyword search might not surface, but there’s still room for improvement. For example, this experiment works at the sentence level (rather than at the paragraph level, as in Smart Reply for Gmail) so a “good” matching sentence can still be taken out of context. You might find books and passages that you didn’t expect, or the reason a particular passage was highlighted might not be obvious. You may also notice that being well-known does not make a book sort to the top; this experiment looks only at how well the individual sentences match up. However, one benefit of this is that the tool may help people discover unexpected authors and titles, and surface books in a way that is fresh and innovative.

Semantris
We are also providing Semantris, a word association game that is powered by this technology. When you enter a word or phrase, the game ranks all of the words on-screen, scoring them based on how well they respond to what you typed. Again, similarity, opposites and neighboring concepts are all fair-game using this semantic model. Try it out yourself to see what we mean! The time pressure in the Arcade version (shown below) will tempt you to enter in single words as prompts. The Blocks version has no time pressure, which makes it a great place to try out entering in phrases and sentences. You may enjoy exploring how obscure you can be with your hints.
Semantris Arcade
The examples we’re sharing today are just a few of the possible ways to think about experience and application design using these new tools. Other potential applications include classification, semantic similarity, semantic clustering, whitelist applications (selecting the right response from many alternatives), and semantic search (of which Talk to Books is an example). We hope you’ll come up with many more, inspired by these example applications. We look forward to seeing original and innovative uses of our TensorFlow models by the developer community.

Acknowledgements
Talk to Books was developed by Aaron Phillips, Amin Ahmad, Rachel Bernstein, Aaron Cohen, Noah Constant, Ray Kurzweil, Igor Krivokon, Vladimir Magay, Peter McKenzie, Bryan Richter, Chris Tar, and Dave Uthus. Semantris was developed by Ben Pietrzak, RJ Mical, Steve Pucci, Maria Voitovich, Mo Adeleye, Diana Huang, Catherine McCurry, Tomomi Sohn, and Connor Moore. We'd also like to acknowledge Hallie Benjamin, Eric Breck, Mario Guajardo-Céspedes, Yoni Halpern, Margaret Mitchell, Ben Packer, Andrew Smart and Lucy Vasserman.

Introducing Semantic Experiences with Talk to Books and Semantris



Natural language understanding has evolved substantially in the past few years, in part due to the development of word vectors that enable algorithms to learn about the relationships between words, based on examples of actual language usage. These vector models map semantically similar phrases to nearby points based on equivalence, similarity or relatedness of ideas and language. Last year, we used hierarchical vector models of language to make improvements to Smart Reply for Gmail. More recently, we’ve been exploring other applications of these methods.

Today, we are proud to share Semantic Experiences, a website showing two examples of how these new capabilities can drive applications that weren’t possible before. Talk to Books is an entirely new way to explore books by starting at the sentence level, rather than the author or topic level. Semantris is a word association game powered by machine learning, where you type out words associated with a given prompt. We have also published “Universal Sentence Encoder”, which describes the models used for these examples in more detail. Lastly, we’ve provided a pretrained semantic TensorFlow module for the community to experiment with their own sentence and phrase encoding.

Modeling approach
Our approach extends the idea of representing language in a vector space by creating vectors for larger chunks of language such as full sentences and small paragraphs. Since language is composed of hierarchies of concepts, we create the vectors using a hierarchy of modules, each of which considers features that correspond to sequences at different temporal scales. Relatedness, synonymy, antonymy, meronymy, holonymy, and many other types of relationships may all be represented in vector space language models if we train them in the right way and then pose the right “questions”. We describe this method in our paper, “Efficient Natural Language Response for Smart Reply.”

Talk to Books
With Talk to Books, we provide an entirely new way to explore books. You make a statement or ask a question, and the tool finds sentences in books that respond, with no dependence on keyword matching. In a sense you are talking to the books, getting responses which can help you determine if you’re interested in reading them or not.
Talk to Books
The models driving this experience were trained on a billion conversation-like pairs of sentences, learning to identify what a good response might look like. Once you ask your question (or make a statement), the tools searches all the sentences in over 100,000 books to find the ones that respond to your input based on semantic meaning at the sentence level; there are no predefined rules bounding the relationship between what you put in and the results you get.

This capability is unique and can help you find interesting books that a keyword search might not surface, but there’s still room for improvement. For example, this experiment works at the sentence level (rather than at the paragraph level, as in Smart Reply for Gmail) so a “good” matching sentence can still be taken out of context. You might find books and passages that you didn’t expect, or the reason a particular passage was highlighted might not be obvious. You may also notice that being well-known does not make a book sort to the top; this experiment looks only at how well the individual sentences match up. However, one benefit of this is that the tool may help people discover unexpected authors and titles, and surface books in a way that is fresh and innovative.

Semantris
We are also providing Semantris, a word association game that is powered by this technology. When you enter a word or phrase, the game ranks all of the words on-screen, scoring them based on how well they respond to what you typed. Again, similarity, opposites and neighboring concepts are all fair-game using this semantic model. Try it out yourself to see what we mean! The time pressure in the Arcade version (shown below) will tempt you to enter in single words as prompts. The Blocks version has no time pressure, which makes it a great place to try out entering in phrases and sentences. You may enjoy exploring how obscure you can be with your hints.
Semantris Arcade
The examples we’re sharing today are just a few of the possible ways to think about experience and application design using these new tools. Other potential applications include classification, semantic similarity, semantic clustering, whitelist applications (selecting the right response from many alternatives), and semantic search (of which Talk to Books is an example). We hope you’ll come up with many more, inspired by these example applications. We look forward to seeing original and innovative uses of our TensorFlow models by the developer community.

Acknowledgements
Talk to Books was developed by Aaron Phillips, Amin Ahmad, Rachel Bernstein, Aaron Cohen, Noah Constant, Ray Kurzweil, Igor Krivokon, Vladimir Magay, Peter McKenzie, Bryan Richter, Chris Tar, Dave Uthus, and Ni Yan. Semantris was developed by Ben Pietrzak, RJ Mical, Steve Pucci, Maria Voitovich, Mo Adeleye, Diana Huang, Catherine McCurry, Tomomi Sohn, and Connor Moore. Core semantic search technology development was led by Brian Strope and Yunhsuan Sung. Fast scalable matching work was led by Sanjiv Kumar, Dave Dopson, and David Simcha. We'd also like to acknowledge Hallie Benjamin, Eric Breck, Mario Guajardo-Céspedes, Yoni Halpern, Margaret Mitchell, Ben Packer, Andrew Smart and Lucy Vasserman.

Source: Google AI Blog


SLING: A Natural Language Frame Semantic Parser



Until recently, most practical natural language understanding (NLU) systems used a pipeline of analysis stages, from part-of-speech tagging and dependency parsing to steps that computed a semantic representation of the input text. While this facilitated easy modularization of different analysis stages, errors in earlier stages would have cascading effects in later stages and the final representation, and the intermediate stage outputs might not be relevant on their own. For example, a typical pipeline might perform the task of dependency parsing in an early stage and the task of coreference resolution towards the end. If one was only interested in the output of coreference resolution, it would be affected by cascading effects of any errors during dependency parsing.

Today we are announcing SLING, an experimental system for parsing natural language text directly into a representation of its meaning as a semantic frame graph. The output frame graph directly captures the semantic annotations of interest to the user, while avoiding the pitfalls of pipelined systems by not running any intermediate stages, additionally preventing unnecessary computation. SLING uses a special-purpose recurrent neural network model to compute the output representation of input text through incremental editing operations on the frame graph. The frame graph, in turn, is flexible enough to capture many semantic tasks of interest (more on this below). SLING's parser is trained using only the input words, bypassing the need for producing any intermediate annotations (e.g. dependency parses).

SLING provides fast parsing at inference time by providing (a) an efficient and scalable frame store implementation and (b) a JIT compiler that generates efficient code to execute the recurrent neural network. Although SLING is experimental, it achieves a parsing speed of >2,500 tokens/second on a desktop CPU, thanks to its efficient frame store and neural network compiler. SLING is implemented in C++ and it is available for download on GitHub. The entire system is described in detail in a technical report as well.

Frame Semantic Parsing
Frame Semantics [1] represents the meaning of text — such as a sentence — as a set of formal statements. Each formal statement is called a frame, which can be seen as a unit of knowledge or meaning, that also contains interactions with concepts or other frames typically associated with it. SLING organizes each frame as a list of slots, where each slot has a name (role) and a value which could be a literal or a link to another frame. As an example, consider the sentence:

“Many people now claim to have predicted Black Monday.”

The figure below illustrates SLING recognizing mentions of entities (e.g. people, places, or events), measurements (e.g. dates or distances), and other concepts (e.g. verbs), and placing them in the correct semantic roles for the verbs in the input. The word predicted evokes the most dominant sense of the verb "predict", denoted as a PREDICT-01 frame. Additionally, this frame also has interactions (slots) with who made the prediction (denoted via the ARG0 slot, which points to the PERSON frame for people) and what was being predicted (denoted via ARG1, which links to the EVENT frame for Black Monday). Frame semantic parsing is the task of producing a directed graph of such frames linked through slots.
Although the example above is fairly simple, frame graphs are powerful enough to model a variety of complex semantic annotation tasks. For starters, frames provide a convenient way to bring together language-internal and external information types (e.g. knowledge bases). This can then be used to address complex language understanding problems such as reference, metaphor, metonymy, and perspective. The frame graphs for these tasks only differ in the inventory of frame types, roles, and any linking constraints.

SLING
SLING trains a recurrent neural network by optimizing for the semantic frames of interest.
The internal learned representations in the network’s hidden layers replace the hand-crafted feature combinations and intermediate representations in pipelined systems. Internally, SLING uses an encoder-decoder architecture where each input word is encoded into a vector using simple lexical features like the raw word, its suffix(es), punctuation etc. The decoder uses that representation, along with recurrent features from its own history, to compute a sequence of transitions that update the frame graph to obtain the intended frame semantic representation of the input sentence. SLING trains its model using TensorFlow and DRAGNN.

The animation below shows how frames and roles are incrementally added to the under-construction frame graph using individual transitions. As discussed earlier with our simple example sentence, SLING connects the VERB and EVENT frames using the role ARG1, signifying that the EVENT frame is the concept being predicted. The EVOKE transition evokes a frame of a specified type from the next few tokens in the text (e.g. EVENT from Black Monday). Similarly, the CONNECT transition links two existing frames with a specified role. When the input is exhausted and the last transition (denoted as STOP) is executed, the frame graph is deemed as complete and returned to the user, who can inspect the graph to get the semantic meaning behind the sentence.
One key aspect of our transition system is the presence of a small fixed-size attention buffer of frames that represents the most recent frames to be evoked or modified, shown with the orange boxes in the figure above. This buffer captures the intuition that we tend to remember knowledge that was recently evoked, referred to, or enhanced. If a frame is no longer in use, it eventually gets flushed out of this buffer as new frames come into picture. We found this simple mechanism to be surprisingly effective at capturing a large fraction of inter-frame links.

Next Steps
The illustrative experiment above is just a launchpad for research in semantic parsing for tasks such as knowledge extraction, resolving complex references, and dialog understanding. The SLING release on Github comes with a pre-trained model for the task we illustrated, as well as examples and recipes to train your own parser on either the supplied synthetic data or your own data. We hope the community finds SLING useful and we look forward to engaging conversations about applying and extending SLING to other semantic parsing tasks.

Acknowledgements
The research described in this post was done by Michael Ringgaard, Rahul Gupta, and Fernando Pereira. We thank the Tensorflow and DRAGNN teams for open-sourcing their packages, and various colleagues at DRAGNN who helped us with multiple aspects of SLING's training setup.



1 Charles J. Fillmore. 1982. Frame semantics. Linguistics in the Morning Calm, pages 111–138.

Transformer: A Novel Neural Network Architecture for Language Understanding



Neural networks, in particular recurrent neural networks (RNNs), are now at the core of the leading approaches to language understanding tasks such as language modeling, machine translation and question answering. In Attention Is All You Need we introduce the Transformer, a novel neural network architecture based on a self-attention mechanism that we believe to be particularly well-suited for language understanding.

In our paper, we show that the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks. On top of higher translation quality, the Transformer requires less computation to train and is a much better fit for modern machine learning hardware, speeding up training by up to an order of magnitude.
BLEU scores (higher is better) of single models on the standard WMT newstest2014 English to German translation benchmark.
BLEU scores (higher is better) of single models on the standard WMT newstest2014 English to French translation benchmark.
Accuracy and Efficiency in Language Understanding
Neural networks usually process language by generating fixed- or variable-length vector-space representations. After starting with representations of individual words or even pieces of words, they aggregate information from surrounding words to determine the meaning of a given bit of language in context. For example, deciding on the most likely meaning and appropriate representation of the word “bank” in the sentence “I arrived at the bank after crossing the…” requires knowing if the sentence ends in “... road.” or “... river.”

RNNs have in recent years become the typical network architecture for translation, processing language sequentially in a left-to-right or right-to-left fashion. Reading one word at a time, this forces RNNs to perform multiple steps to make decisions that depend on words far away from each other. Processing the example above, an RNN could only determine that “bank” is likely to refer to the bank of a river after reading each word between “bank” and “river” step by step. Prior research has shown that, roughly speaking, the more such steps decisions require, the harder it is for a recurrent network to learn how to make those decisions.

The sequential nature of RNNs also makes it more difficult to fully take advantage of modern fast computing devices such as TPUs and GPUs, which excel at parallel and not sequential processing. Convolutional neural networks (CNNs) are much less sequential than RNNs, but in CNN architectures like ByteNet or ConvS2S the number of steps required to combine information from distant parts of the input still grows with increasing distance.

The Transformer
In contrast, the Transformer only performs a small, constant number of steps (chosen empirically). In each step, it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position. In the earlier example “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately attend to the word “river” and make this decision in a single step. In fact, in our English-French translation model we observe exactly this behavior.

More specifically, to compute the next representation for a given word - “bank” for example - the Transformer compares it to every other word in the sentence. The result of these comparisons is an attention score for every other word in the sentence. These attention scores determine how much each of the other words should contribute to the next representation of “bank”. In the example, the disambiguating “river” could receive a high attention score when computing a new representation for “bank”. The attention scores are then used as weights for a weighted average of all words’ representations which is fed into a fully-connected network to generate a new representation for “bank”, reflecting that the sentence is talking about a river bank.

The animation below illustrates how we apply the Transformer to machine translation. Neural networks for machine translation typically contain an encoder reading the input sentence and generating a representation of it. A decoder then generates the output sentence word by word while consulting the representation generated by the encoder. The Transformer starts by generating initial representations, or embeddings, for each word. These are represented by the unfilled circles. Then, using self-attention, it aggregates information from all of the other words, generating a new representation per word informed by the entire context, represented by the filled balls. This step is then repeated multiple times in parallel for all words, successively generating new representations.
The decoder operates similarly, but generates one word at a time, from left to right. It attends not only to the other previously generated words, but also to the final representations generated by the encoder.

Flow of Information
Beyond computational performance and higher accuracy, another intriguing aspect of the Transformer is that we can visualize what other parts of a sentence the network attends to when processing or translating a given word, thus gaining insights into how information travels through the network.

To illustrate this, we chose an example involving a phenomenon that is notoriously challenging for machine translation systems: coreference resolution. Consider the following sentences and their French translations:
It is obvious to most that in the first sentence pair “it” refers to the animal, and in the second to the street. When translating these sentences to French or German, the translation for “it” depends on the gender of the noun it refers to - and in French “animal” and “street” have different genders. In contrast to the current Google Translate model, the Transformer translates both of these sentences to French correctly. Visualizing what words the encoder attended to when computing the final representation for the word “it” sheds some light on how the network made the decision. In one of its steps, the Transformer clearly identified the two nouns “it” could refer to and the respective amount of attention reflects its choice in the different contexts.
The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads).
Given this insight, it might not be that surprising that the Transformer also performs very well on the classic language analysis task of syntactic constituency parsing, a task the natural language processing community has attacked with highly specialized systems for decades.
In fact, with little adaptation, the same network we used for English to German translation outperformed all but one of the previously proposed approaches to constituency parsing.

Next Steps
We are very excited about the future potential of the Transformer and have already started applying it to other problems involving not only natural language but also very different inputs and outputs, such as images and video. Our ongoing experiments are accelerated immensely by the Tensor2Tensor library, which we recently open sourced. In fact, after downloading the library you can train your own Transformer networks for translation and parsing by invoking just a few commands. We hope you’ll give it a try, and look forward to seeing what the community can do with the Transformer.

Acknowledgements
This research was conducted by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez and Łukasz Kaiser. Additional thanks go to David Chenell for creating the animation above.

Google at ACL 2017



This week, Vancouver, Canada hosts the 2017 Annual Meeting of the Association for Computational Linguistics (ACL 2017), the premier conference in the field of natural language understanding, covering a broad spectrum of diverse research areas that are concerned with computational approaches to natural language.

As a leader in natural language processing & understanding and a Platinum sponsor of ACL 2017, Google will be on hand to showcase research interests that include syntax, semantics, discourse, conversation, multilingual modeling, sentiment analysis, question answering, summarization, and generally building better systems using labeled and unlabeled data, state-of-the-art modeling and learning from indirect supervision.

If you’re attending ACL 2017, we hope that you’ll stop by the Google booth to check out some demos, meet our researchers and discuss projects and opportunities at Google that go into solving interesting problems for billions of people. Learn more about the Google research being presented at ACL 2017 below (Googlers highlighted in blue).

Organizing Committee
Area Chairs include: Sujith Ravi (Machine Learning), Thang Luong (Machine Translation)
Publication Chairs include: Margaret Mitchell (Advisory)

Accepted Papers
A Polynomial-Time Dynamic Programming Algorithm for Phrase-Based Decoding with a Fixed Distortion Limit
Yin-Wen Chang, Michael Collins
(Oral Session)

Cross-Sentence N-ary Relation Extraction with Graph LSTMs
Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-Tau Yih
(Oral Session)

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision
Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, Ni Lao

Coarse-to-Fine Question Answering for Long Documents
Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, Jonathan Berant

Automatic Compositor Attribution in the First Folio of Shakespeare
Maria Ryskina, Hannah Alpert-Abrams, Dan Garrette, Taylor Berg-Kirkpatrick

A Nested Attention Neural Hybrid Model for Grammatical Error Correction
Jianshu Ji, Qinlong Wang, Kristina Toutanova, Yongen Gong, Steven Truong, Jianfeng Gao

Get To The Point: Summarization with Pointer-Generator Networks
Abigail See, Peter J. Liu, Christopher D. Manning

Identifying 1950s American Jazz Composers: Fine-Grained IsA Extraction via Modifier Composition
Ellie Pavlick*, Marius Pasca

Learning to Skim Text
Adams Wei Yu, Hongrae Lee, Quoc Le

Workshops
2017 ACL Student Research Workshop
Program Committee includes: Emily Pitler, Brian Roark, Richard Sproat

WiNLP: Women and Underrepresented Minorities in Natural Language Processing
Organizers include: Margaret Mitchell
Gold Sponsor

BUCC: 10th Workshop on Building and Using Comparable Corpora
Scientific Committee includes: Richard Sproat

CLPsych: Computational Linguistics and Clinical Psychology – From Linguistic Signal to Clinical
Reality
Program Committee includes: Brian Roark, Richard Sproat

Repl4NLP: 2nd Workshop on Representation Learning for NLP
Program Committee includes: Ankur Parikh, John Platt

RoboNLP: Language Grounding for Robotics
Program Committee includes: Ankur Parikh, Tom Kwiatkowski

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Management Group includes: Slav Petrov

CoNLL-SIGMORPHON-2017 Shared Task: Universal Morphological Reinflection
Organizing Committee includes: Manaal Faruqui
Invited Speaker: Chris Dyer

SemEval: 11th International Workshop on Semantic Evaluation
Organizers include: Daniel Cer

ALW1: 1st Workshop on Abusive Language Online
Panelists include: Margaret Mitchell

EventStory: Events and Stories in the News
Program Committee includes: Silvia Pareti

NMT: 1st Workshop on Neural Machine Translation
Organizing Committee includes: Thang Luong
Program Committee includes: Hieu Pham, Taro Watanabe
Invited Speaker: Quoc Le

Tutorials
Natural Language Processing for Precision Medicine
Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-tau Yih

Deep Learning for Dialogue Systems
Yun-Nung Chen, Asli Celikyilmaz, Dilek Hakkani-Tur



* Contributed during an internship at Google.

MultiModel: Multi-Task Machine Learning Across Domains



Over the last decade, the application and performance of Deep Learning has progressed at an astonishing rate. However, the current state of the field is that the neural network architectures are highly specialized to specific domains of application. An important question remains unanswered: Will a convergence between these domains facilitate a unified model capable of performing well across multiple domains?

Today, we present MultiModel, a neural network architecture that draws from the success of vision, language and audio networks to simultaneously solve a number of problems spanning multiple domains, including image recognition, translation and speech recognition. While strides have been made in this direction before, namely in Google’s Multilingual Neural Machine Translation System used in Google Translate, MultiModel is a first step towards the convergence of vision, audio and language understanding into a single network.

The inspiration for how MultiModel handles multiple domains comes from how the brain transforms sensory input from different modalities (such as sound, vision or taste), into a single shared representation and back out in the form of language or actions. As an analog to these modalities and the transformations they perform, MultiModel has a number of small modality-specific sub-networks for audio, images, or text, and a shared model consisting of an encoder, input/output mixer and decoder, as illustrated below.
MultiModel architecture: small modality-specific sub-networks work with a shared encoder, I/O mixer and decoder. Each petal represents a modality, transforming to and from the internal representation.
We demonstrate that MultiModel is capable of learning eight different tasks simultaneously: it can detect objects in images, provide captions, recognize speech, translate between four pairs of languages, and do grammatical constituency parsing at the same time. The input is given to the model together with a very simple signal that determines which output we are requesting. Below we illustrate a few examples taken from a MultiModel trained jointly on these eight tasks1:
When designing MultiModel it became clear that certain elements from each domain of research (vision, language and audio) were integral to the model’s success in related tasks. We demonstrate that these computational primitives (such as convolutions, attention, or mixture-of-experts layers) clearly improve performance on their originally intended domain of application, while not hindering MultiModel’s performance on other tasks. It is not only possible to achieve good performance while training jointly on multiple tasks, but on tasks with limited quantities of data, the performance actually improves. To our surprise, this happens even if the tasks come from different domains that would appear to have little in common, e.g., an image recognition task can improve performance on a language task.

It is important to note that while MultiModel does not establish new performance records, it does provide insight into the dynamics of multi-domain multi-task learning in neural networks, and the potential for improved learning on data-limited tasks by the introduction of auxiliary tasks. There is a longstanding saying in machine learning: “the best regularizer is more data”; in MultiModel, this data can be sourced across domains, and consequently can be obtained more easily than previously thought. MultiModel provides evidence that training in concert with other tasks can lead to good results and improve performance on data-limited tasks.

Many questions about multi-domain machine learning remain to be studied, and we will continue to work on tuning Multimodel and improving its performance. To allow this research to progress quickly, we open-sourced MultiModel as part of the Tensor2Tensor library. We believe that such synergetic models trained on data from multiple domains will be the next step in deep learning and will ultimately solve tasks beyond the reach of current narrowly trained networks.

Acknowledgements
This work is a collaboration between Googlers Łukasz Kaiser, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones and Jakob Uszkoreit, and Aidan N. Gomez from the University of Toronto. It was performed while Aidan was working with the Google Brain team.



1 The 8 tasks were: (1) speech recognition (WSJ corpus), (2) image classification (ImageNet), (3) image captioning (MS COCO), (4) parsing (Penn Treebank), (5) English-German translation, (6) German-English translation, (7) English-French translation, (8) French-English translation (all using WMT data-sets).

Accelerating Deep Learning Research with the Tensor2Tensor Library



Deep Learning (DL) has enabled the rapid advancement of many useful technologies, such as machine translation, speech recognition and object detection. In the research community, one can find code open-sourced by the authors to help in replicating their results and further advancing deep learning. However, most of these DL systems use unique setups that require significant engineering effort and may only work for a specific problem or architecture, making it hard to run new experiments and compare the results.

Today, we are happy to release Tensor2Tensor (T2T), an open-source system for training deep learning models in TensorFlow. T2T facilitates the creation of state-of-the art models for a wide variety of ML applications, such as translation, parsing, image captioning and more, enabling the exploration of various ideas much faster than previously possible. This release also includes a library of datasets and models, including the best models from a few recent papers (Attention Is All You Need, Depthwise Separable Convolutions for Neural Machine Translation and One Model to Learn Them All) to help kick-start your own DL research.

Translation Model
Training time
BLEU (difference from baseline)
Transformer (T2T)
3 days on 8 GPU
28.4 (+7.8)
SliceNet (T2T)
6 days on 32 GPUs
26.1 (+5.5)
1 day on 64 GPUs
26.0 (+5.4)
ConvS2S
18 days on 1 GPU
25.1 (+4.5)
GNMT
1 day on 96 GPUs
24.6 (+4.0)
8 days on 32 GPUs
23.8 (+3.2)
MOSES (phrase-based baseline)
N/A
20.6 (+0.0)
BLEU scores (higher is better) on the standard WMT English-German translation task.
As an example of the kind of improvements T2T can offer, we applied the library to machine translation. As you can see in the table above, two different T2T models, SliceNet and Transformer, outperform the previous state-of-the-art, GNMT+MoE. Our best T2T model, Transformer, is 3.8 points better than the standard GNMT model, which itself was 4 points above the baseline phrase-based translation system, MOSES. Notably, with T2T you can approach previous state-of-the-art results with a single GPU in one day: a small Transformer model (not shown above) gets 24.9 BLEU after 1 day of training on a single GPU. Now everyone with a GPU can tinker with great translation models on their own: our github repo has instructions on how to do that.

Modular Multi-Task Training
The T2T library is built with familiar TensorFlow tools and defines multiple pieces needed in a deep learning system: data-sets, model architectures, optimizers, learning rate decay schemes, hyperparameters, and so on. Crucially, it enforces a standard interface between all these parts and implements current ML best practices. So you can pick any data-set, model, optimizer and set of hyperparameters, and run the training to check how it performs. We made the architecture modular, so every piece between the input data and the predicted output is a tensor-to-tensor function. If you have a new idea for the model architecture, you don’t need to replace the whole setup. You can keep the embedding part and the loss and everything else, just replace the model body by your own function that takes a tensor as input and returns a tensor.

This means that T2T is flexible, with training no longer pinned to a specific model or dataset. It is so easy that even architectures like the famous LSTM sequence-to-sequence model can be defined in a few dozen lines of code. One can also train a single model on multiple tasks from different domains. Taken to the limit, you can even train a single model on all data-sets concurrently, and we are happy to report that our MultiModel, trained like this and included in T2T, yields good results on many tasks even when training jointly on ImageNet (image classification), MS COCO (image captioning), WSJ (speech recognition), WMT (translation) and the Penn Treebank parsing corpus. It is the first time a single model has been demonstrated to be able to perform all these tasks at once.

Built-in Best Practices
With this initial release, we also provide scripts to generate a number of data-sets widely used in the research community1, a handful of models2, a number of hyperparameter configurations, and a well-performing implementation of other important tricks of the trade. While it is hard to list them all, if you decide to run your model with T2T you’ll get for free the correct padding of sequences and the corresponding cross-entropy loss, well-tuned parameters for the Adam optimizer, adaptive batching, synchronous distributed training, well-tuned data augmentation for images, label smoothing, and a number of hyper-parameter configurations that worked very well for us, including the ones mentioned above that achieve the state-of-the-art results on translation and may help you get good results too.

As an example, consider the task of parsing English sentences into their grammatical constituency trees. This problem has been studied for decades and competitive methods were developed with a lot of effort. It can be presented as a sequence-to-sequence problem and be solved with neural networks, but it used to require a lot of tuning. With T2T, it took us only a few days to add the parsing data-set generator and adjust our attention transformer model to train on this problem. To our pleasant surprise, we got very good results in only a week:

Parsing Model
F1 score (higher is better)
Transformer (T2T)
91.3
Dyer et al.
91.7
Zhu et al.
90.4
Socher et al.
90.4
Vinyals & Kaiser et al.
88.3
Parsing F1 scores on the standard test set, section 23 of the WSJ. We only compare here models trained discriminatively on the Penn Treebank WSJ training set, see the paper for more results.

Contribute to Tensor2Tensor
In addition to exploring existing models and data-sets, you can easily define your own model and add your own data-sets to Tensor2Tensor. We believe the already included models will perform very well for many NLP tasks, so just adding your data-set might lead to interesting results. By making T2T modular, we also make it very easy to contribute your own model and see how it performs on various tasks. In this way the whole community can benefit from a library of baselines and deep learning research can accelerate. So head to our github repository, try the new models, and contribute your own!

Acknowledgements
The release of Tensor2Tensor was only possible thanks to the widespread collaboration of many engineers and researchers. We want to acknowledge here the core team who contributed (in alphabetical order): Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, Jakob Uszkoreit, Ashish Vaswani.



1 We include a number of datasets for image classification (MNIST, CIFAR-10, CIFAR-100, ImageNet), image captioning (MS COCO), translation (WMT with multiple languages including English-German and English-French), language modelling (LM1B), parsing (Penn Treebank), natural language inference (SNLI), speech recognition (TIMIT), algorithmic problems (over a dozen tasks from reversing through addition and multiplication to algebra) and we will be adding more and welcome your data-sets too.

2 Including LSTM sequence-to-sequence RNNs, convolutional networks also with separable convolutions (e.g., Xception), recently researched models like ByteNet or the Neural GPU, and our new state-of-the-art models mentioned in this post that we will be actively updating in the repository.