Tag Archives: Natural Language Understanding

Introducing Semantic Experiences with Talk to Books and Semantris



Natural language understanding has evolved substantially in the past few years, in part due to the development of word vectors that enable algorithms to learn about the relationships between words, based on examples of actual language usage. These vector models map semantically similar phrases to nearby points based on equivalence, similarity or relatedness of ideas and language. Last year, we used hierarchical vector models of language to make improvements to Smart Reply for Gmail. More recently, we’ve been exploring other applications of these methods.

Today, we are proud to share Semantic Experiences, a website showing two examples of how these new capabilities can drive applications that weren’t possible before. Talk to Books is an entirely new way to explore books by starting at the sentence level, rather than the author or topic level. Semantris is a word association game powered by machine learning, where you type out words associated with a given prompt. We have also published “Universal Sentence Encoder”, which describes the models used for these examples in more detail. Lastly, we’ve provided a pretrained semantic TensorFlow module for the community to experiment with their own sentence and phrase encoding.

Modeling approach
Our approach extends the idea of representing language in a vector space by creating vectors for larger chunks of language such as full sentences and small paragraphs. Since language is composed of hierarchies of concepts, we create the vectors using a hierarchy of modules, each of which considers features that correspond to sequences at different temporal scales. Relatedness, synonymy, antonymy, meronymy, holonymy, and many other types of relationships may all be represented in vector space language models if we train them in the right way and then pose the right “questions”. We describe this method in our paper, “Efficient Natural Language Response for Smart Reply.”

Talk to Books
With Talk to Books, we provide an entirely new way to explore books. You make a statement or ask a question, and the tool finds sentences in books that respond, with no dependence on keyword matching. In a sense you are talking to the books, getting responses which can help you determine if you’re interested in reading them or not.
Talk to Books
The models driving this experience were trained on a billion conversation-like pairs of sentences, learning to identify what a good response might look like. Once you ask your question (or make a statement), the tools searches all the sentences in over 100,000 books to find the ones that respond to your input based on semantic meaning at the sentence level; there are no predefined rules bounding the relationship between what you put in and the results you get.

This capability is unique and can help you find interesting books that a keyword search might not surface, but there’s still room for improvement. For example, this experiment works at the sentence level (rather than at the paragraph level, as in Smart Reply for Gmail) so a “good” matching sentence can still be taken out of context. You might find books and passages that you didn’t expect, or the reason a particular passage was highlighted might not be obvious. You may also notice that being well-known does not make a book sort to the top; this experiment looks only at how well the individual sentences match up. However, one benefit of this is that the tool may help people discover unexpected authors and titles, and surface books in a way that is fresh and innovative.

Semantris
We are also providing Semantris, a word association game that is powered by this technology. When you enter a word or phrase, the game ranks all of the words on-screen, scoring them based on how well they respond to what you typed. Again, similarity, opposites and neighboring concepts are all fair-game using this semantic model. Try it out yourself to see what we mean! The time pressure in the Arcade version (shown below) will tempt you to enter in single words as prompts. The Blocks version has no time pressure, which makes it a great place to try out entering in phrases and sentences. You may enjoy exploring how obscure you can be with your hints.
Semantris Arcade
The examples we’re sharing today are just a few of the possible ways to think about experience and application design using these new tools. Other potential applications include classification, semantic similarity, semantic clustering, whitelist applications (selecting the right response from many alternatives), and semantic search (of which Talk to Books is an example). We hope you’ll come up with many more, inspired by these example applications. We look forward to seeing original and innovative uses of our TensorFlow models by the developer community.

Acknowledgements
Talk to Books was developed by Aaron Phillips, Amin Ahmad, Rachel Bernstein, Aaron Cohen, Noah Constant, Ray Kurzweil, Igor Krivokon, Vladimir Magay, Peter McKenzie, Bryan Richter, Chris Tar, and Dave Uthus. Semantris was developed by Ben Pietrzak, RJ Mical, Steve Pucci, Maria Voitovich, Mo Adeleye, Diana Huang, Catherine McCurry, Tomomi Sohn, and Connor Moore. We'd also like to acknowledge Hallie Benjamin, Eric Breck, Mario Guajardo-Céspedes, Yoni Halpern, Margaret Mitchell, Ben Packer, Andrew Smart and Lucy Vasserman.

SLING: A Natural Language Frame Semantic Parser



Until recently, most practical natural language understanding (NLU) systems used a pipeline of analysis stages, from part-of-speech tagging and dependency parsing to steps that computed a semantic representation of the input text. While this facilitated easy modularization of different analysis stages, errors in earlier stages would have cascading effects in later stages and the final representation, and the intermediate stage outputs might not be relevant on their own. For example, a typical pipeline might perform the task of dependency parsing in an early stage and the task of coreference resolution towards the end. If one was only interested in the output of coreference resolution, it would be affected by cascading effects of any errors during dependency parsing.

Today we are announcing SLING, an experimental system for parsing natural language text directly into a representation of its meaning as a semantic frame graph. The output frame graph directly captures the semantic annotations of interest to the user, while avoiding the pitfalls of pipelined systems by not running any intermediate stages, additionally preventing unnecessary computation. SLING uses a special-purpose recurrent neural network model to compute the output representation of input text through incremental editing operations on the frame graph. The frame graph, in turn, is flexible enough to capture many semantic tasks of interest (more on this below). SLING's parser is trained using only the input words, bypassing the need for producing any intermediate annotations (e.g. dependency parses).

SLING provides fast parsing at inference time by providing (a) an efficient and scalable frame store implementation and (b) a JIT compiler that generates efficient code to execute the recurrent neural network. Although SLING is experimental, it achieves a parsing speed of >2,500 tokens/second on a desktop CPU, thanks to its efficient frame store and neural network compiler. SLING is implemented in C++ and it is available for download on GitHub. The entire system is described in detail in a technical report as well.

Frame Semantic Parsing
Frame Semantics [1] represents the meaning of text — such as a sentence — as a set of formal statements. Each formal statement is called a frame, which can be seen as a unit of knowledge or meaning, that also contains interactions with concepts or other frames typically associated with it. SLING organizes each frame as a list of slots, where each slot has a name (role) and a value which could be a literal or a link to another frame. As an example, consider the sentence:

“Many people now claim to have predicted Black Monday.”

The figure below illustrates SLING recognizing mentions of entities (e.g. people, places, or events), measurements (e.g. dates or distances), and other concepts (e.g. verbs), and placing them in the correct semantic roles for the verbs in the input. The word predicted evokes the most dominant sense of the verb "predict", denoted as a PREDICT-01 frame. Additionally, this frame also has interactions (slots) with who made the prediction (denoted via the ARG0 slot, which points to the PERSON frame for people) and what was being predicted (denoted via ARG1, which links to the EVENT frame for Black Monday). Frame semantic parsing is the task of producing a directed graph of such frames linked through slots.
Although the example above is fairly simple, frame graphs are powerful enough to model a variety of complex semantic annotation tasks. For starters, frames provide a convenient way to bring together language-internal and external information types (e.g. knowledge bases). This can then be used to address complex language understanding problems such as reference, metaphor, metonymy, and perspective. The frame graphs for these tasks only differ in the inventory of frame types, roles, and any linking constraints.

SLING
SLING trains a recurrent neural network by optimizing for the semantic frames of interest.
The internal learned representations in the network’s hidden layers replace the hand-crafted feature combinations and intermediate representations in pipelined systems. Internally, SLING uses an encoder-decoder architecture where each input word is encoded into a vector using simple lexical features like the raw word, its suffix(es), punctuation etc. The decoder uses that representation, along with recurrent features from its own history, to compute a sequence of transitions that update the frame graph to obtain the intended frame semantic representation of the input sentence. SLING trains its model using TensorFlow and DRAGNN.

The animation below shows how frames and roles are incrementally added to the under-construction frame graph using individual transitions. As discussed earlier with our simple example sentence, SLING connects the VERB and EVENT frames using the role ARG1, signifying that the EVENT frame is the concept being predicted. The EVOKE transition evokes a frame of a specified type from the next few tokens in the text (e.g. EVENT from Black Monday). Similarly, the CONNECT transition links two existing frames with a specified role. When the input is exhausted and the last transition (denoted as STOP) is executed, the frame graph is deemed as complete and returned to the user, who can inspect the graph to get the semantic meaning behind the sentence.
One key aspect of our transition system is the presence of a small fixed-size attention buffer of frames that represents the most recent frames to be evoked or modified, shown with the orange boxes in the figure above. This buffer captures the intuition that we tend to remember knowledge that was recently evoked, referred to, or enhanced. If a frame is no longer in use, it eventually gets flushed out of this buffer as new frames come into picture. We found this simple mechanism to be surprisingly effective at capturing a large fraction of inter-frame links.

Next Steps
The illustrative experiment above is just a launchpad for research in semantic parsing for tasks such as knowledge extraction, resolving complex references, and dialog understanding. The SLING release on Github comes with a pre-trained model for the task we illustrated, as well as examples and recipes to train your own parser on either the supplied synthetic data or your own data. We hope the community finds SLING useful and we look forward to engaging conversations about applying and extending SLING to other semantic parsing tasks.

Acknowledgements
The research described in this post was done by Michael Ringgaard, Rahul Gupta, and Fernando Pereira. We thank the Tensorflow and DRAGNN teams for open-sourcing their packages, and various colleagues at DRAGNN who helped us with multiple aspects of SLING's training setup.



1 Charles J. Fillmore. 1982. Frame semantics. Linguistics in the Morning Calm, pages 111–138.

Transformer: A Novel Neural Network Architecture for Language Understanding



Neural networks, in particular recurrent neural networks (RNNs), are now at the core of the leading approaches to language understanding tasks such as language modeling, machine translation and question answering. In Attention Is All You Need we introduce the Transformer, a novel neural network architecture based on a self-attention mechanism that we believe to be particularly well-suited for language understanding.

In our paper, we show that the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks. On top of higher translation quality, the Transformer requires less computation to train and is a much better fit for modern machine learning hardware, speeding up training by up to an order of magnitude.
BLEU scores (higher is better) of single models on the standard WMT newstest2014 English to German translation benchmark.
BLEU scores (higher is better) of single models on the standard WMT newstest2014 English to French translation benchmark.
Accuracy and Efficiency in Language Understanding
Neural networks usually process language by generating fixed- or variable-length vector-space representations. After starting with representations of individual words or even pieces of words, they aggregate information from surrounding words to determine the meaning of a given bit of language in context. For example, deciding on the most likely meaning and appropriate representation of the word “bank” in the sentence “I arrived at the bank after crossing the…” requires knowing if the sentence ends in “... road.” or “... river.”

RNNs have in recent years become the typical network architecture for translation, processing language sequentially in a left-to-right or right-to-left fashion. Reading one word at a time, this forces RNNs to perform multiple steps to make decisions that depend on words far away from each other. Processing the example above, an RNN could only determine that “bank” is likely to refer to the bank of a river after reading each word between “bank” and “river” step by step. Prior research has shown that, roughly speaking, the more such steps decisions require, the harder it is for a recurrent network to learn how to make those decisions.

The sequential nature of RNNs also makes it more difficult to fully take advantage of modern fast computing devices such as TPUs and GPUs, which excel at parallel and not sequential processing. Convolutional neural networks (CNNs) are much less sequential than RNNs, but in CNN architectures like ByteNet or ConvS2S the number of steps required to combine information from distant parts of the input still grows with increasing distance.

The Transformer
In contrast, the Transformer only performs a small, constant number of steps (chosen empirically). In each step, it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position. In the earlier example “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately attend to the word “river” and make this decision in a single step. In fact, in our English-French translation model we observe exactly this behavior.

More specifically, to compute the next representation for a given word - “bank” for example - the Transformer compares it to every other word in the sentence. The result of these comparisons is an attention score for every other word in the sentence. These attention scores determine how much each of the other words should contribute to the next representation of “bank”. In the example, the disambiguating “river” could receive a high attention score when computing a new representation for “bank”. The attention scores are then used as weights for a weighted average of all words’ representations which is fed into a fully-connected network to generate a new representation for “bank”, reflecting that the sentence is talking about a river bank.

The animation below illustrates how we apply the Transformer to machine translation. Neural networks for machine translation typically contain an encoder reading the input sentence and generating a representation of it. A decoder then generates the output sentence word by word while consulting the representation generated by the encoder. The Transformer starts by generating initial representations, or embeddings, for each word. These are represented by the unfilled circles. Then, using self-attention, it aggregates information from all of the other words, generating a new representation per word informed by the entire context, represented by the filled balls. This step is then repeated multiple times in parallel for all words, successively generating new representations.
The decoder operates similarly, but generates one word at a time, from left to right. It attends not only to the other previously generated words, but also to the final representations generated by the encoder.

Flow of Information
Beyond computational performance and higher accuracy, another intriguing aspect of the Transformer is that we can visualize what other parts of a sentence the network attends to when processing or translating a given word, thus gaining insights into how information travels through the network.

To illustrate this, we chose an example involving a phenomenon that is notoriously challenging for machine translation systems: coreference resolution. Consider the following sentences and their French translations:
It is obvious to most that in the first sentence pair “it” refers to the animal, and in the second to the street. When translating these sentences to French or German, the translation for “it” depends on the gender of the noun it refers to - and in French “animal” and “street” have different genders. In contrast to the current Google Translate model, the Transformer translates both of these sentences to French correctly. Visualizing what words the encoder attended to when computing the final representation for the word “it” sheds some light on how the network made the decision. In one of its steps, the Transformer clearly identified the two nouns “it” could refer to and the respective amount of attention reflects its choice in the different contexts.
The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads).
Given this insight, it might not be that surprising that the Transformer also performs very well on the classic language analysis task of syntactic constituency parsing, a task the natural language processing community has attacked with highly specialized systems for decades.
In fact, with little adaptation, the same network we used for English to German translation outperformed all but one of the previously proposed approaches to constituency parsing.

Next Steps
We are very excited about the future potential of the Transformer and have already started applying it to other problems involving not only natural language but also very different inputs and outputs, such as images and video. Our ongoing experiments are accelerated immensely by the Tensor2Tensor library, which we recently open sourced. In fact, after downloading the library you can train your own Transformer networks for translation and parsing by invoking just a few commands. We hope you’ll give it a try, and look forward to seeing what the community can do with the Transformer.

Acknowledgements
This research was conducted by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez and Łukasz Kaiser. Additional thanks go to David Chenell for creating the animation above.

Google at ACL 2017



This week, Vancouver, Canada hosts the 2017 Annual Meeting of the Association for Computational Linguistics (ACL 2017), the premier conference in the field of natural language understanding, covering a broad spectrum of diverse research areas that are concerned with computational approaches to natural language.

As a leader in natural language processing & understanding and a Platinum sponsor of ACL 2017, Google will be on hand to showcase research interests that include syntax, semantics, discourse, conversation, multilingual modeling, sentiment analysis, question answering, summarization, and generally building better systems using labeled and unlabeled data, state-of-the-art modeling and learning from indirect supervision.

If you’re attending ACL 2017, we hope that you’ll stop by the Google booth to check out some demos, meet our researchers and discuss projects and opportunities at Google that go into solving interesting problems for billions of people. Learn more about the Google research being presented at ACL 2017 below (Googlers highlighted in blue).

Organizing Committee
Area Chairs include: Sujith Ravi (Machine Learning), Thang Luong (Machine Translation)
Publication Chairs include: Margaret Mitchell (Advisory)

Accepted Papers
A Polynomial-Time Dynamic Programming Algorithm for Phrase-Based Decoding with a Fixed Distortion Limit
Yin-Wen Chang, Michael Collins
(Oral Session)

Cross-Sentence N-ary Relation Extraction with Graph LSTMs
Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-Tau Yih
(Oral Session)

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision
Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, Ni Lao

Coarse-to-Fine Question Answering for Long Documents
Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, Jonathan Berant

Automatic Compositor Attribution in the First Folio of Shakespeare
Maria Ryskina, Hannah Alpert-Abrams, Dan Garrette, Taylor Berg-Kirkpatrick

A Nested Attention Neural Hybrid Model for Grammatical Error Correction
Jianshu Ji, Qinlong Wang, Kristina Toutanova, Yongen Gong, Steven Truong, Jianfeng Gao

Get To The Point: Summarization with Pointer-Generator Networks
Abigail See, Peter J. Liu, Christopher D. Manning

Identifying 1950s American Jazz Composers: Fine-Grained IsA Extraction via Modifier Composition
Ellie Pavlick*, Marius Pasca

Learning to Skim Text
Adams Wei Yu, Hongrae Lee, Quoc Le

Workshops
2017 ACL Student Research Workshop
Program Committee includes: Emily Pitler, Brian Roark, Richard Sproat

WiNLP: Women and Underrepresented Minorities in Natural Language Processing
Organizers include: Margaret Mitchell
Gold Sponsor

BUCC: 10th Workshop on Building and Using Comparable Corpora
Scientific Committee includes: Richard Sproat

CLPsych: Computational Linguistics and Clinical Psychology – From Linguistic Signal to Clinical
Reality
Program Committee includes: Brian Roark, Richard Sproat

Repl4NLP: 2nd Workshop on Representation Learning for NLP
Program Committee includes: Ankur Parikh, John Platt

RoboNLP: Language Grounding for Robotics
Program Committee includes: Ankur Parikh, Tom Kwiatkowski

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Management Group includes: Slav Petrov

CoNLL-SIGMORPHON-2017 Shared Task: Universal Morphological Reinflection
Organizing Committee includes: Manaal Faruqui
Invited Speaker: Chris Dyer

SemEval: 11th International Workshop on Semantic Evaluation
Organizers include: Daniel Cer

ALW1: 1st Workshop on Abusive Language Online
Panelists include: Margaret Mitchell

EventStory: Events and Stories in the News
Program Committee includes: Silvia Pareti

NMT: 1st Workshop on Neural Machine Translation
Organizing Committee includes: Thang Luong
Program Committee includes: Hieu Pham, Taro Watanabe
Invited Speaker: Quoc Le

Tutorials
Natural Language Processing for Precision Medicine
Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-tau Yih

Deep Learning for Dialogue Systems
Yun-Nung Chen, Asli Celikyilmaz, Dilek Hakkani-Tur



* Contributed during an internship at Google.

Efficient Smart Reply, now for Gmail



Last year we launched Smart Reply, a feature for Inbox by Gmail that uses machine learning to suggest replies to email. Since the initial release, usage of Smart Reply has grown significantly, making up about 12% of replies in Inbox on mobile. Based on our examination of the use of Smart Reply in Inbox and our ideas about how humans learn and use language, we have created a new version of Smart Reply for Gmail. This version increases the percentage of usable suggestions and is more algorithmically efficient.

Novel thinking: hierarchy
Inspired by how humans understand languages and concepts, we turned to hierarchical models of language, an approach that uses hierarchies of modules, each of which can learn, remember, and recognize a sequential pattern.

The content of language is deeply hierarchical, reflected in the structure of language itself, going from letters to words to phrases to sentences to paragraphs to sections to chapters to books to authors to libraries, etc. Consider the message, "That interesting person at the cafe we like gave me a glance." The hierarchical chunks in this sentence are highly variable. The subject of the sentence is "That interesting person at the cafe we like." The modifier "interesting" tells us something about the writer's past experiences with the person. We are told that the location of an incident involving both the writer and the person is "at the cafe." We are also told that "we," meaning the writer and the person being written to, like the cafe. Additionally, each word is itself part of a hierarchy, sometimes more than one. A cafe is a type of restaurant which is a type of store which is a type of establishment, and so on.

In proposing an appropriate response to this message we might consider the meaning of the word "glance," which is potentially ambiguous. Was it a positive gesture? In that case, we might respond, "Cool!" Or was it a negative gesture? If so, does the subject say anything about how the writer felt about the negative exchange? A lot of information about the world, and an ability to make reasoned judgments, are needed to make subtle distinctions.

Given enough examples of language, a machine learning approach can discover many of these subtle distinctions. Moreover, a hierarchical approach to learning is well suited to the hierarchical nature of language. We have found that this approach works well for suggesting possible responses to emails. We use a hierarchy of modules, each of which considers features that correspond to sequences at different temporal scales, similar to how we understand speech and language.
Each module processes inputs and provides transformed representations of those inputs on its outputs (which are, in turn, available for the next level). In the Smart Reply system, and the figure above, the repeated structure has two layers of hierarchy. The first makes each feature useful as a predictor of the final result, and the second combines these features. By definition, the second works at a more abstract representation and considers a wider timescale.

By comparison, the initial release of Smart Reply encoded input emails word-by-word with a long-short-term-memory (LSTM) recurrent neural network, and then decoded potential replies with yet another word-level LSTM. While this type of modeling is very effective in many contexts, even with Google infrastructure, it’s an approach that requires substantial computation resources. Instead of working word-by-word, we found an effective and highly efficient path by processing the problem more all-at-once, by comparing a simple hierarchy of vector representations of multiple features corresponding to longer time spans.

Semantics
We have also considered whether the mathematical space of these vector representations is implicitly semantic. Do the hierarchical network representations reflect a coarse “understanding” of the actual meaning of the inputs and the responses in order to determine which go together, or do they reflect more consistent syntactical patterns? Given many real examples of which pairs go together and, perhaps more importantly which do not, we found that our networks are surprisingly effective and efficient at deriving representations that meet the training requirements.
So far we see that the system can find responses that are on point, without an overlap of keywords or even synonyms of keywords.More directly, we’re delighted when the system suggests results that show understanding and are helpful.

The key to this work is the confidence and trust people give us when they use the Smart Reply feature. As always, thank you for showing us the ways that work (and the ways that don’t!). With your help, we’ll do our best to keep learning.

An Upgrade to SyntaxNet, New Models and a Parsing Competition



At Google, we continuously improve the language understanding capabilities used in applications ranging from generation of email responses to translation. Last summer, we open-sourced SyntaxNet, a neural-network framework for analyzing and understanding the grammatical structure of sentences. Included in our release was Parsey McParseface, a state-of-the-art model that we had trained for analyzing English, followed quickly by a collection of pre-trained models for 40 additional languages, which we dubbed Parsey's Cousins. While we were excited to share our research and to provide these resources to the broader community, building machine learning systems that work well for languages other than English remains an ongoing challenge. We are excited to announce a few new research resources, available now, that address this problem.

SyntaxNet Upgrade
We are releasing a major upgrade to SyntaxNet. This upgrade incorporates nearly a year’s worth of our research on multilingual language understanding, and is available to anyone interested in building systems for processing and understanding text. At the core of the upgrade is a new technology that enables learning of richly layered representations of input sentences. More specifically, the upgrade extends TensorFlow to allow joint modeling of multiple levels of linguistic structure, and to allow neural-network architectures to be created dynamically during processing of a sentence or document.

Our upgrade makes it, for example, easy to build character-based models that learn to compose individual characters into words (e.g. ‘c-a-t’ spells ‘cat’). By doing so, the models can learn that words can be related to each other because they share common parts (e.g. ‘cats’ is the plural of ‘cat’ and shares the same stem; ‘wildcat’ is a type of ‘cat’). Parsey and Parsey’s Cousins, on the other hand, operated over sequences of words. As a result, they were forced to memorize words seen during training and relied mostly on the context to determine the grammatical function of previously unseen words.

As an example, consider the following (meaningless but grammatically correct) sentence:
This sentence was originally coined by Andrew Ingraham who explained: “You do not know what this means; nor do I. But if we assume that it is English, we know that the doshes are distimmed by the gostak. We know too that one distimmer of doshes is a gostak." Systematic patterns in morphology and syntax allow us to guess the grammatical function of words even when they are completely novel: we understand that ‘doshes’ is the plural of the noun ‘dosh’ (similar to the ‘cats’ example above) or that ‘distim’ is the third person singular of the verb distim. Based on this analysis we can then derive the overall structure of this sentence even though we have never seen the words before.

ParseySaurus
To showcase the new capabilities provided by our upgrade to SyntaxNet, we are releasing a set of new pretrained models called ParseySaurus. These models use the character-based input representation mentioned above and are thus much better at predicting the meaning of new words based both on their spelling and how they are used in context. The ParseySaurus models are far more accurate than Parsey’s Cousins (reducing errors by as much as 25%), particularly for morphologically-rich languages like Russian, or agglutinative languages like Turkish and Hungarian. In those languages there can be dozens of forms for each word and many of these forms might never be observed during training - even in a very large corpus.

Consider the following fictitious Russian sentence, where again the stems are meaningless, but the suffixes define an unambiguous interpretation of the sentence structure:
Even though our Russian ParseySaurus model has never seen these words, it can correctly analyze the sentence by inspecting the character sequences which constitute each word. In doing so, the system can determine many properties of the words (notice how many more morphological features there are here than in the English example). To see the sentence as ParseySaurus does, here is a visualization of how the model analyzes this sentence:
Each square represents one node in the neural network graph, and lines show the connections between them. The left-side “tail” of the graph shows the model consuming the input as one long string of characters. These are intermittently passed to the right side, where the rich web of connections shows the model composing words into phrases and producing a syntactic parse. Check out the full-size rendering here.

A Competition
You might be wondering whether character-based modeling are all we need or whether there are other techniques that might be important. SyntaxNet has lots more to offer, like beam search and different training objectives, but there are of course also many other possibilities. To find out what works well in practice, we are helping co-organize a multilingual parsing competition at this year’s Conference on Computational Natural Language Learning (CoNLL) with the goal of building syntactic parsing systems that work well in real-world settings and for 45 different languages.

The competition is made possible by the Universal Dependencies (UD) initiative, whose goal is to develop cross-linguistically consistent treebanks. Because machine learned models can only be as good as the data that they have access to, we have been contributing data to UD since 2013. For the competition, we partnered with UD and DFKI to build a new multilingual evaluation set consisting of 1000 sentences that have been translated into 20+ different languages and annotated by linguists with parse trees. This evaluation set is the first of its kind (in the past, each language had its own independent evaluation set) and will enable more consistent cross-lingual comparisons. Because the sentences have the same meaning and have been annotated according to the same guidelines, we will be able to get closer to answering the question of which languages might be harder to parse.

We hope that the upgraded SyntaxNet framework and our the pre-trained ParseySaurus models will inspire researchers to participate in the competition. We have additionally created a tutorial showing how to load a Docker image and train models on the Google Cloud Platform, to facilitate participation by smaller teams with limited resources. So, if you have an idea for making your own models with the SyntaxNet framework, sign up to compete! We believe that the configurations that we are releasing are a good place to start, but we look forward to seeing how participants will be able to extend and improve these models or perhaps create better ones!

Thanks to everyone involved who made this competition happen, including our collaborators at UD-Pipe, who provide another baseline implementation to make it easy to enter the competition. Happy parsing from the main developers, Chris Alberti, Daniel Andor, Ivan Bogatyy, Mark Omernick, Zora Tung and Ji Ma!

On-Device Machine Intelligence



To build the cutting-edge technologies that enable conversational understanding and image recognition, we often apply combinations of machine learning technologies such as deep neural networks and graph-based machine learning. However, the machine learning systems that power most of these applications run in the cloud and are computationally intensive and have significant memory requirements. What if you want machine intelligence to run on your personal phone or smartwatch, or on IoT devices, regardless of whether they are connected to the cloud?

Yesterday, we announced the launch of Android Wear 2.0, along with brand new wearable devices, that will run Google's first entirely “on-device” ML technology for powering smart messaging. This on-device ML system, developed by the Expander research team, enables technologies like Smart Reply to be used for any application, including third-party messaging apps, without ever having to connect with the cloud…so now you can respond to incoming chat messages directly from your watch, with a tap.
The research behind this began last year while our team was developing the machine learning systems that enable conversational understanding capability in Allo and Inbox. The Android Wear team reached out to us and was interested to know whether it would be possible to deploy this Smart Reply technology directly onto a smart device. Because of the limited computing power and memory on smart devices, we quickly realized that it was not possible to do so. Our product manager, Patrick McGregor, realized that this presented a unique challenge and an opportunity for the Expander team to return to the drawing board to design a completely new, lightweight, machine learning architecture — not only to enable Smart Reply on Android Wear, but also to power a wealth of other on-device mobile applications. Together with Tom Rudick, Nathan Beach, and other colleagues from the Android Wear team, we set out to build the new system.

Learning with Projections
A simple strategy to build lightweight conversational models might be to create a small dictionary of common rules (input → reply mappings) on the device and use a naive look-up strategy at inference time. This can work for simple prediction tasks involving a small set of classes using a handful of features (such as binary sentiment classification from text, e.g. “I love this movie” conveys a positive sentiment whereas the sentence “The acting was horrible” is negative). But, it does not scale to complex natural language tasks involving rich vocabularies and the wide language variability observed in chat messages. On the other hand, machine learning models like recurrent neural networks (such as LSTMs), in conjunction with graph learning, have proven to be extremely powerful tools for complex sequence learning in natural language understanding tasks, including Smart Reply. However, compressing such rich models to fit in device memory and produce robust predictions at low computation cost (rapidly on-demand) is extremely challenging. Early experiments with restricting the model to predict only a small handful of replies or using other techniques like quantization or character-level models did not produce useful results.

Instead, we built a different solution for the on-device ML system. We first use a fast, efficient mechanism to group similar incoming messages and project them to similar (“nearby”) bit vector representations. While there are several ways to perform this projection step, such as using word embeddings or encoder networks, we employ a modified version of locality sensitive hashing (LSH) to reduce dimension from millions of unique words to a short, fixed-length sequence of bits. This allows us to compute a projection for an incoming message very fast, on-the-fly, with a small memory footprint on the device since we do not need to store the incoming messages, word embeddings, or even the full model used for training.
Projection step: Similar messages are grouped together and projected to nearby vectors. For example, the messages "hey, how's it going?" and "How's it going buddy?" share similar content and might be projected to the same vector 11100011. Another related message “Howdy, everything going well?” is mapped to a nearby vector 11100110 that differs only in 2 bits.
Next, our system takes the incoming message along with its projections and jointly trains a “message projection model” that learns to predict likely replies using our semi-supervised graph learning framework. The graph learning framework enables training a robust model by combining semantic relationships from multiple sources — message/reply interactions, word/phrase similarity, semantic cluster information — learning useful projection operations that can be mapped to good reply predictions.
Learning step: (Top) Messages along with projections and corresponding replies, if available, are used in a machine learning framework to jointly learn a “message projection model”. (Bottom) The message projection model learns to associate replies with the projections of the corresponding incoming messages. For example, the model projects two different messages “Howdy, everything going well?” and “How’s it going buddy?” (bottom center) to nearby bit vectors and learns to map these to relevant replies (bottom right).
It’s worth noting that while the message projection model can be trained using complex machine learning architectures and the power of the cloud, as described above, the model itself resides and performs inference completely on device. Apps running on the device can pass a user’s incoming messages and receive reply predictions from the on-device model without data leaving the device. The model can also be adapted to cater to the user’s writing style and individual preferences to provide a personalized experience.
Inference step: The model applies the learned projections to an incoming message (or sequence of messages) and suggests relevant and diverse replies. Inference is performed on the device, allowing the model to adapt to user data and personal writing styles.
To get the on-device system to work out of the box, we had to make a few additional improvements such as optimizing for speeding up computations on device and generating rich, diverse replies from the model. We will have a forthcoming scientific publication that describes the on-device machine learning work in more detail.

Converse from Your Wrist
When we embarked on our journey to build this technology from scratch, we weren’t sure if the predictions would be useful or of sufficient quality. We’re quite surprised and excited about how well it works even on Android wearable devices with very limited computation and memory resources. We look forward to continuing to improve the models to provide users with more delightful conversational experiences, and we will be leveraging this on-device ML platform to enable completely new applications in the months to come.

You can now use this feature to respond to your messages directly from your Google watches or any watch that runs Android Wear 2.0. It is already enabled on Google Hangouts, Google Messenger, and many third-party messaging apps. We also provide an API for developers of third-party Wear apps.

Acknowledgements
On behalf of the Google Expander team, I would also like to thank the following people who helped make this technology a success: Andrei Broder, Andrew Tomkins, David Singleton, Mirko Ranieri, Robin Dua and Yicheng Fan.

A Large Corpus for Supervised Word-Sense Disambiguation



Understanding the various meanings of a particular word in text is key to understanding language. For example, in the sentence “he will receive stock in the reorganized company”, we know that “stock” refers to “the capital raised by a business or corporation through the issue and subscription of shares” as defined in the New Oxford American Dictionary (NOAD), based on the context. However, there are more than 10 other definitions for “stock” in NOAD, ranging from “goods in a store”to “a medieval device for punishment”. For a computer algorithm, distinguishing between these meanings is so difficult that it has been described as “AI-complete” in the past (Navigli, 2009; Ide and Veronis 1998; Mallery 1988).

In order to help further progress on this challenge, we’re happy to announce the release of word-sense annotations on the popular MASC and SemCor datasets, manually annotated with senses from the NOAD. We’re also releasing mappings from the NOAD senses to English Wordnet, which is more commonly used by the research community. This is one of the largest releases of fully sense-annotated English corpora.

Supervised Word-Sense Disambiguation
Humans distinguish between meanings of words in text easily because we have access to an enormous amount of common-sense knowledge about how the world works, and how this connects to language. For an example of the difficulty, “[stock] in a business” implies the financial sense, but “[stock] in a bodega” is more likely to refer to goods on the shelves of a store, even though a bodega is a kind of business. Acquiring sufficient knowledge in a form that a machine can use, and then applying it to understanding the words in text, is a challenge.

Supervised word-sense disambiguation (WSD) is the problem of building a machine-learned system using human-labeled data that can assign a dictionary sense to all words used in text (in contrast to entity disambiguation, which focuses on nouns, mostly proper). Building a supervised model that performs better than just assigning the most frequent sense of a word without considering the surrounding text is difficult, but supervised models can perform well when supplied with significant amounts of training data. (Navigli, 2009)

By releasing this dataset, it is our hope that the research community will be able to further the advance of algorithms that allow machines to understand language better, allowing applications such as:
  • Facilitating the automatic construction of databases from text in order to answer questions and connect knowledge in documents. For example, understanding that a “hemi engine” is a kind of automotive machinery, and a “locomotive engine” is a kind of train, or that “Kanye West is a star” implies that he is a celebrity, but “Sirius is a star” implies that it is an astronomical object.
  • Disambiguating words in queries, so that results for “date palm” and “date night” or “web spam” and “spam recipe” can have distinct interpretations for different senses, and documents returned from a query have the same meaning that is implied by the query.
Manual Annotation
In the manually labeled data sets that we are releasing, each sense annotation is labeled by five raters. To ensure high quality of the sense annotation, raters are first trained with gold annotations, which were labeled by experienced linguists in a separate pilot study before the annotation task. The figure below shows an example of a rater’s work page in our annotation tool.

The left side of the page lists all candidate dictionary senses (in this case, the word “general”). Example sentences from the dictionary are also provided. The to-be-annotated words, highlighted within a sentence, are shown on the right side of the work page. Besides linking a dictionary sense to a word, raters could also label one of the three exceptions: (1) The word is a typo (2) None of the above and (3) I can’t decide. Raters could also check whether the word usage is a metaphor and leave comments.

The sense annotation task used for this data release achieves an inter-rater reliability score of 0.869 using Krippendorff's alpha (α >= 0.67 is considered an acceptable level of reproducibility, and α >= 0.80 is considered a highly reproducible result) (Krippendorff, 2004). Annotation counts are listed below.

Total
noun
verb
adjective
adverb
SemCor
115k
38k
57k
11.6k
8.6k
MASC
133k
50k
12.7k
13.6k
4.2k

Wordnet Mappings
We’ve also included two sets of mappings from NOAD to Wordnet. A smaller set of 2200 words was manually mapped in a process similar to the sense annotations described above, and a larger set was created algorithmically. Together, these mappings allow for resources in Wordnet to be applied to this NOAD corpus, and for systems built using Wordnet to be evaluated using this corpus.

You can learn more about our full research results on this corpus using LSTM-based language models and semi-supervised learning in “Semi-supervised Word Sense Disambiguation with Neural Models”.

Acknowledgements
The datasets were built with help from Eric Altendorf, Heng Chen, Jutta Degener, Ryan Doherty, David Huynh, Ji Li, Julian Richardson and Binbin Ruan.

Graph-powered Machine Learning at Google



Recently, there have been significant advances in Machine Learning that enable computer systems to solve complex real-world problems. One of those advances is Google’s large scale, graph-based machine learning platform, built by the Expander team in Google Research. A technology that is behind many of the Google products and features you may use everyday, graph-based machine learning is a powerful tool that can be used to power useful features such as reminders in Inbox and smart messaging in Allo, or used in conjunction with deep neural networks to power the latest image recognition system in Google Photos.
Learning with Minimal Supervision

Much of the recent success in deep learning and machine learning, in general, can be attributed to models that demonstrate high predictive capacity when trained on large amounts of labeled data -- often millions of training examples. This is commonly referred to as “supervised learning” since it requires supervision, in the form of labeled data, to train the machine learning systems. (Conversely, some machine learning methods operate directly on raw data without any supervision, a paradigm referred to as unsupervised learning.)

However, the more difficult the task, the harder it is to get sufficient high-quality labeled data. It is often prohibitively labor intensive and time-consuming to collect labeled data for every new problem. This motivated the Expander research team to build new technology for powering machine learning applications at scale and with minimal supervision.

Expander’s technology draws inspiration from how humans learn to generalize and bridge the gap between what they already know (labeled information) and novel, unfamiliar observations (unlabeled information). Known as “semi-supervised” learning, this powerful technique enables us to build systems that can work in situations where training data may be sparse. One of the key advantages to a graph-based semi-supervised machine learning approach is the fact that (a) one models labeled and unlabeled data jointly during learning, leveraging the underlying structure in the data, (b) one can easily combine multiple types of signals (for example, relational information from Knowledge Graph along with raw features) into a single graph representation and learn over them. This is in contrast to other machine learning approaches, such as neural network methods, in which it is typical to first train a system using labeled data with features and then apply the trained system to unlabeled data.

Graph Learning: How It Works

At its core, Expander’s platform combines semi-supervised machine learning with large-scale graph-based learning by building a multi-graph representation of the data with nodes corresponding to objects or concepts and edges connecting concepts that share similarities. The graph typically contains both labeled data (nodes associated with a known output category or label) and unlabeled data (nodes for which no labels were provided). Expander’s framework then performs semi-supervised learning to label all nodes jointly by propagating label information across the graph.

However, this is easier said than done! We have to (1) learn efficiently at scale with minimal supervision (i.e., tiny amount of labeled data), (2) operate over multi-modal data (i.e., heterogeneous representations and various sources of data), and (3) solve challenging prediction tasks (i.e., large, complex output spaces) involving high dimensional data that might be noisy.

One of the primary ingredients in the entire learning process is the graph and choice of connections. Graphs come in all sizes, shapes and can be combined from multiple sources. We have observed that it is often beneficial to learn over multi-graphs that combine information from multiple types of data representations (e.g., image pixels, object categories and chat response messages for PhotoReply in Allo). The Expander team’s graph learning platform automatically generates graphs directly from data based on the inferred or known relationships between data elements. The data can be structured (for example, relational data) or unstructured (for example, sparse or dense feature representations extracted from raw data).

To understand how Expander’s system learns, let us consider an example graph shown below.
There are two types of nodes in the graph: “grey” represents unlabeled data whereas the colored nodes represent labeled data. Relationships between node data is represented via edges and thickness of each edge indicates strength of the connection. We can formulate the semi-supervised learning problem on this toy graph as follows: predict a color (“red” or “blue”) for every node in the graph. Note that the specific choice of graph structure and colors depend on the task. For example, as shown in this research paper we recently published, a graph that we built for the Smart Reply feature in Inbox represents email messages as nodes and colors indicate semantic categories of user responses (e.g., “yes”, “awesome”, “funny”).

The Expander graph learning framework solves this labeling task by treating it as an optimization problem. At the simplest level, it learns a color label assignment for every node in the graph such that neighboring nodes are assigned similar colors depending on the strength of their connection. A naive way to solve this would be to try to learn a label assignment for all nodes at once -- this method does not scale to large graphs. Instead, we can optimize the problem formulation by propagating colors from labeled nodes to their neighbors, and then repeating the process. In each step, an unlabeled node is assigned a label by inspecting color assignments of its neighbors. We can update every node’s label in this manner and iterate until the whole graph is colored. This process is a far more efficient way to optimize the same problem and the sequence of iterations converges to a unique solution in this case. The solution at the end of the graph propagation looks something like this:
Semi-supervised learning on a graph
In practice, we use complex optimization functions defined over the graph structure, which incorporate additional information and constraints for semi-supervised graph learning that can lead to hard, non-convex problems. The real challenge, however, is to scale this efficiently to graphs containing billions of nodes, trillions of edges and for complex tasks involving billions of different label types.

To tackle this challenge, we created an approach outlined in Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation, published last year. It introduces a streaming algorithm to process information propagated from neighboring nodes in a distributed manner that makes it work on very large graphs. In addition, it addresses other practical concerns, notably it guarantees that the space complexity or memory requirements of the system stays constant regardless of the difficulty of the task, i.e., the overall system uses the same amount of memory regardless of whether the number of prediction labels is two (as in the above toy example) or a million or even a billion. This enables wide-ranging applications for natural language understanding, machine perception, user modeling and even joint multimodal learning for tasks involving multiple modalities such as text, image and video inputs.

Language Graphs for Learning Humor

As an example use of graph-based machine learning, consider emotion labeling, a language understanding task in Smart Reply for Inbox, where the goal is to label words occurring in natural language text with their fine-grained emotion categories. A neural network model is first applied to a text corpus to learn word embeddings, i.e., a mathematical vector representation of the meaning of each word. The dense embedding vectors are then used to build a sparse graph where nodes correspond to words and edges represent semantic relationship between them. Edge strength is computed using similarity between embedding vectors — low similarity edges are ignored. We seed the graph with emotion labels known a priori for a few nodes (e.g., laugh is labeled as “funny”) and then apply semi-supervised learning over the graph to discover emotion categories for remaining words (e.g., ROTFL gets labeled as “funny” owing to its multi-hop semantic connection to the word “laugh”).
Learning emotion associations using graph constructed from word embedding vectors
For applications involving large datasets or dense representations that are observed (e.g., pixels from images) or learned using neural networks (e.g., embedding vectors), it is infeasible to compute pairwise similarity between all objects to construct edges in the graph. The Expander team solves this problem by leveraging approximate, linear-time graph construction algorithms.

Graph-based Machine Intelligence in Action

The Expander team’s machine learning system is now being used on massive graphs (containing billions of nodes and trillions of edges) to recognize and understand concepts in natural language, images, videos, and queries, powering Google products for applications like reminders, question answering, language translation, visual object recognition, dialogue understanding, and more.

We are excited that with the recent release of Allo, millions of chat users are now experiencing smart messaging technology powered by the Expander team’s system for understanding and assisting with chat conversations in multiple languages. Also, this technology isn’t used only for large-scale models in the cloud - as announced this past week, Android Wear has opened up an on-device Smart Reply capability for developers that will provide smart replies for any messaging application. We’re excited to tackle even more challenging Internet-scale problems with Expander in the years to come.

Acknowledgements

We wish to acknowledge the hard work of all the researchers, engineers, product managers, and leaders across Google who helped make this technology a success. In particular, we would like to highlight the efforts of Allan Heydon, Andrei Broder, Andrew Tomkins, Ariel Fuxman, Bo Pang, Dana Movshovitz-Attias, Fritz Obermeyer, Krishnamurthy Viswanathan, Patrick McGregor, Peter Young, Robin Dua, Sujith Ravi and Vivek Ramavajjala.

Meet Parsey’s Cousins: Syntax for 40 languages, plus new SyntaxNet capabilities



Just in time for ACL 2016, we are pleased to announce that Parsey McParseface, released in May as part of SyntaxNet and the basis for the Cloud Natural Language API, now has 40 cousins! Parsey’s Cousins is a collection of pretrained syntactic models for 40 languages, capable of analyzing the native language of more than half of the world’s population at often unprecedented accuracy. To better address the linguistic phenomena occurring in these languages we have endowed SyntaxNet with new abilities for Text Segmentation and Morphological Analysis.

When we released Parsey, we were already planning to expand to more languages, and it soon became clear that this was both urgent and important, because researchers were having trouble creating top notch SyntaxNet models for other languages.

The reason for that is a little bit subtle. SyntaxNet, like other TensorFlow models, has a lot of knobs to turn, which affect accuracy and speed. These knobs are called hyperparameters, and control things like the learning rate and its decay, momentum, and random initialization. Because neural networks are more sensitive to the choice of these hyperparameters than many other machine learning algorithms, picking the right hyperparameter setting is very important. Unfortunately there is no tested and proven way of doing this and picking good hyperparameters is mostly an empirical science -- we try a bunch of settings and see what works best.

An additional challenge is that training these models can take a long time, several days on very fast hardware. Our solution is to train many models in parallel via MapReduce, and when one looks promising, train a bunch more models with similar settings to fine-tune the results. This can really add up -- on average, we train more than 70 models per language. The plot below shows how the accuracy varies depending on the hyperparameters as training progresses. The best models are up to 4% absolute more accurate than ones trained without hyperparameter tuning.
Held-out set accuracy for various English parsing models with different hyperparameters (each line corresponds to one training run with specific hyperparameters). In some cases training is a lot slower and in many cases a suboptimal choice of hyperparameters leads to significantly lower accuracy. We are releasing the best model that we were able to train for each language.
In order to do a good job at analyzing the grammar of other languages, it was not sufficient to just fine-tune our English setup. We also had to expand the capabilities of SyntaxNet. The first extension is a model for text segmentation, which is the task of identifying word boundaries. In languages like English, this isn’t very hard -- you can mostly look for spaces and punctuation. In Chinese, however, this can be very challenging, because words are not separated by spaces. To correctly analyze dependencies between Chinese words, SyntaxNet needs to understand text segmentation -- and now it does.
Analysis of a Chinese string into a parse tree showing dependency labels, word tokens, and parts of speech (read top to bottom for each word token).
The second extension is a model for morphological analysis. Morphology is a language feature that is poorly represented in English. It describes inflection: i.e., how the grammatical function and meaning of the word changes as its spelling changes. In English, we add an -s to a word to indicate plurality. In Russian, a heavily inflected language, morphology can indicate number, gender, whether the word is the subject or object of a sentence, possessives, prepositional phrases, and more. To understand the syntax of a sentence in Russian, SyntaxNet needs to understand morphology -- and now it does.
Parse trees showing dependency labels, parts of speech, and morphology.
As you might have noticed, the parse trees for all of the sentences above look very similar. This is because we follow the content-head principle, under which dependencies are drawn between content words, with function words becoming leaves in the parse tree. This idea was developed by the Universal Dependencies project in order to increase parallelism between languages. Parsey’s Cousins are trained on treebanks provided by this project and are designed to be cross-linguistically consistent and thus easier to use in multi-lingual language understanding applications.

Using the same set of labels across languages can help us understand how sentences in different languages, or variations in the same language, convey the same meaning. In all of the above examples, the root indicates the main verb of the sentence and there is a passive nominal subject (indicated by the arc labeled with ‘nsubjpass’) and a passive auxiliary (‘auxpass’). If you look closely, you will also notice some differences because the grammar of each language differs. For example, English uses the preposition ‘by,’ where Russian uses morphology to mark that the phrase ‘the publisher (издателем)’ is in instrumental case -- the meaning is the same, it is just expressed differently.

Google has been involved in the Universal Dependencies project since its inception and we are very excited to be able to bring together our efforts on datasets and modeling. We hope that this release will facilitate research progress in building computer systems that can understand all of the world’s languages.

Parsey's Cousins can be found on GitHub, along with Parsey McParseface and SyntaxNet.