Tag Archives: Natural Language Processing

Robust Neural Machine Translation



In recent years, neural machine translation (NMT) using Transformer models has experienced tremendous success. Based on deep neural networks, NMT models are usually trained end-to-end on very large parallel corpora (input/output text pairs) in an entirely data-driven fashion and without the need to impose explicit rules of language.

Despite this huge success, NMT models can be sensitive to minor perturbations of the input, which can manifest as a variety of different errors, such as under-translation, over-translation or mistranslation. For example, given a German sentence, the state-of-the-art NMT model, Transformer, will yield a correct translation.

“Der Sprecher des Untersuchungsausschusses hat angekündigt, vor Gericht zu ziehen, falls sich die geladenen Zeugen weiterhin weigern sollten, eine Aussage zu machen.”

(Machine translation to English: “The spokesman of the Committee of Inquiry has announced that if the witnesses summoned continue to refuse to testify, he will be brought to court.”),

But, when we apply a subtle change to the input sentence, say from geladenen to the synonym vorgeladenen, the translation becomes very different (and in this case, incorrect):

“Der Sprecher des Untersuchungsausschusses hat angekündigt, vor Gericht zu ziehen, falls sich die vorgeladenen Zeugen weiterhin weigern sollten, eine Aussage zu machen.”

(Machine translation to English: “The investigative committee has announced that he will be brought to justice if the witnesses who have been invited continue to refuse to testify.”).

This lack of robustness in NMT models prevents many commercial systems from being applicable to tasks that cannot tolerate this level of instability. Therefore, learning robust translation models is not just desirable, but is often required in many scenarios. Yet, while the robustness of neural networks has been extensively studied in the computer vision community, only a few prior studies on learning robust NMT models can be found in literature.

In “Robust Neural Machine Translation with Doubly Adversarial Inputs” (to appear at ACL 2019), we propose an approach that uses generated adversarial examples to improve the stability of machine translation models against small perturbations in the input. We learn a robust NMT model to directly overcome adversarial examples generated with knowledge of the model and with the intent of distorting the model predictions. We show that this approach improves the performance of the NMT model on standard benchmarks.

Training a Model with AdvGen
An ideal NMT model would generate similar translations for separate inputs that exhibit small differences. The idea behind our approach is to perturb a translation model with adversarial inputs in the hope of improving the model’s robustness. It does this using an algorithm called Adversarial Generation (AdvGen), which generates plausible adversarial examples for perturbing the model and then feeds them back into the model for defensive training. While this method is inspired by the idea of generative adversarial networks (GANs), it does not rely on a discriminator network, but simply applies the adversarial example in training, effectively diversifying and extending the training set.

The first step is to perturb the model using AdvGen. We start by using Transformer to calculate the translation loss based on a source input sentence, a target input sentence and a target output sentence. Then AdvGen randomly selects some words in the source sentence, assuming a uniform distribution. Each word has an associated list of similar words, i.e., candidates that can be used for substitution, from which AdvGen selects the word that is most likely to introduce errors in Transformer output. Then, this generated adversarial sentence is fed back into Transformer, initiating the defense stage.
First, the Transformer model is applied to an input sentence (lower left) and, in conjunction with the target output sentence (above right) and target input sentence (middle right; beginning with the placeholder “<sos>”), the translation loss is calculated. The AdvGen function then takes the source sentence, word selection distribution, word candidates, and the translation loss as inputs to construct an adversarial source example.
During the defend stage, the adversarial sentence is fed back into the Transformer model. Again the translation loss is calculated, but this time using the adversarial source input. Using the same method as above, AdvGen uses the target input sentence, word replacement candidates, the word selection distribution calculated by the attention matrix, and the translation loss to construct an adversarial target example.
In the defense stage, the adversarial source example serves as input to the Transformer model, and the translation loss is calculated. AdvGen then uses the same method as above to generate an adversarial target example from the target input.
Finally, the adversarial sentence is fed back into Transformer and the robustness loss using the adversarial source example, the adversarial target input example and the target sentence is calculated. If the perturbation led to a significant loss, the loss is minimized so that when the model is confronted with similar perturbations, it will not repeat the same mistake. On the other hand, if the perturbation leads to a low loss, nothing happens, indicating that the model can already handle this perturbation.

Model Performance
We demonstrate the effectiveness of our approach by applying it to the standard Chinese-English and English-German translation benchmarks. We observed a notable improvement of 2.8 and 1.6 BLEU points, respectively, compared to the competitive Transformer model, achieving a new state-of-the-art performance.
Comparison of Transformer model (Vaswani et al., 2017) on standard benchmarks.
We then evaluate our model on a noisy dataset, generated using a procedure similar to that described for AdvGen. We take an input clean dataset, such as that used on standard translation benchmarks, and randomly select words for similar word substitution. We find that our model exhibits improved robustness compared to other recent models.
Comparison of Transformer, Miyao et al. and Cheng et al. on artificial noisy inputs.
These results show that our method is able to overcome small perturbations in the input sentence and improve the generalization performance. It outperforms competitive translation models and achieves state-of-the-art translation performance on standard benchmarks. We hope our translation model will serve as a robust building block for improving many downstream tasks, especially when those are sensitive or intolerant to imperfect translation input.

Acknowledgements
This research was conducted by Yong Cheng, Lu Jiang and Wolfgang Macherey. Additional thanks go to our leadership Andrew Moore and Julia (Wenli) Zhu‎.

Source: Google AI Blog


Multilingual Universal Sentence Encoder for Semantic Retrieval



Since it was introduced last year, “Universal Sentence Encoder (USE) for English’’ has become one of the most downloaded pre-trained text modules in Tensorflow Hub, providing versatile sentence embedding models that convert sentences into vector representations. These vectors capture rich semantic information that can be used to train classifiers for a broad range of downstream tasks. For example, a strong sentiment classifier can be trained from as few as one hundred labeled examples, and still be used to measure semantic similarity and for meaning-based clustering.

Today, we are pleased to announce the release of three new USE multilingual modules with additional features and potential applications. The first two modules provide multilingual models for retrieving semantically similar text, one optimized for retrieval performance and the other for speed and less memory usage. The third model is specialized for question-answer retrieval in sixteen languages (USE-QA), and represents an entirely new application of USE. All three multilingual modules are trained using a multi-task dual-encoder framework, similar to the original USE model for English, while using techniques we developed for improving the dual-encoder with additive margin softmax approach. They are designed not only to maintain good transfer learning performance, but to perform well on semantic retrieval tasks.
Multi-task training structure of the Universal Sentence Encoder. A variety of tasks and task structures are joined by shared encoder layers/parameters (pink boxes).
Semantic Retrieval Applications
The three new modules are all built on semantic retrieval architectures, which typically split the encoding of questions and answers into separate neural networks, which makes it possible to search among billions of potential answers within milliseconds. The key to using dual encoders for efficient semantic retrieval is to pre-encode all candidate answers to expected input queries and store them in a vector database that is optimized for solving the nearest neighbor problem, which allows a large number of candidates to be searched quickly with good precision and recall. For all three modules, the input query is then encoded into a vector on which we can perform an approximate nearest neighbor search. Together, this enables good results to be found quickly without needing to do a direct query/candidate comparison for every candidate. The prototypical pipeline is illustrated below:
A prototypical semantic retrieval pipeline, used for textual similarity.
Semantic Similarity Modules
For semantic similarity tasks, the query and candidates are encoded using the same neural network. Two common semantic retrieval tasks made possible by the new modules include Multilingual Semantic Textual Similarity Retrieval and Multilingual Translation Pair Retrieval.
  • Multilingual Semantic Textual Similarity Retrieval
    Most existing approaches for finding semantically similar text require being given a pair of texts to compare. However, using the Universal Sentence Encoder, semantically similar text can be extracted directly from a very large database. For example, in an application like FAQ search, a system can first index all possible questions with associated answers. Then, given a user’s question, the system can search for known questions that are semantically similar enough to provide an answer. A similar approach was used to find comparable sentences from 50 million sentences in wikipedia. With the new multilingual USE models, this can be done in any of supported non-English languages.
  • Multilingual Translation Pair Retrieval
    The newly released modules can also be used to mine translation pairs to train neural machine translation systems. Given a source sentence in one language (“How do I get to the restroom?”), they can find the potential translation target in any other supported language (“¿Cómo llego al baño?”).
Both new semantic similarity modules are cross-lingual. Given an input in Chinese, for example, the modules can find the best candidates, regardless of which language it is expressed in. This versatility can be particularly useful for languages that are underrepresented on the internet. For example, an early version of these modules has been used by Chidambaram et al. (2018) to provide classifications in circumstances where the training data is only available in a single language, e.g. English, but the end system must function in a range of other languages.

USE for Question-Answer Retrieval
The USE-QA module extends the USE architecture to question-answer retrieval applications, which generally take an input query and find relevant answers from a large set of documents that may be indexed at the document, paragraph, or even sentence level. The input query is encoded with the question encoding network, while the candidates are encoded with the answer encoding network.
Visualizing the action of a neural answer retrieval system. The blue point at the north pole represents the question vector. The other points represent the embeddings of various answers. The correct answer, highlighted here in red, is “closest” to the question, in that it minimizes the angular distance. The points in this diagram are produced by an actual USE-QA model, however, they have been projected downwards from ℝ500 to ℝ3 to assist the reader’s visualization.
Question-answer retrieval systems also rely on the ability to understand semantics. For example, consider a possible query to one such system, Google Talk to Books, which was launched in early 2018 and backed by a sentence-level index of over 100,000 books. A query, “What fragrance brings back memories?”, yields the result, “And for me, the smell of jasmine along with the pan bagnat, it brings back my entire carefree childhood.” Without specifying any explicit rules or substitutions, the vector encoding captures the semantic similarity between the terms fragrance and smell. The advantage provided by the USE-QA module is that it can extend question-answer retrieval tasks such as this to multilingual applications.

For Researchers and Developers
We're pleased to share the latest additions to the Universal Sentence Encoder family with the research community, and are excited to see what other applications will be found. These modules can be used as-is, or fine tuned using domain-specific data. Lastly, we will also host the semantic similarity for natural language page on Cloud AI Workshop to further encourage research in this area.

Acknowledgements
Mandy Guo, Daniel Cer, Noah Constant, Jax Law, Muthuraman Chidambaram for core modeling, Gustavo Hernandez Abrego, Chen Chen, Mario Guajardo-Cespedes for infrastructure and colabs, Steve Yuan, Chris Tar, Yunhsuan Sung, Brian Strope, Ray Kurzweil for discussion of the model architecture.

Source: Google AI Blog


Advancing Semi-supervised Learning with Unsupervised Data Augmentation



Success in deep learning has largely been enabled by key factors such as algorithmic advancements, parallel processing hardware (GPU / TPU), and the availability of large-scale labeled datasets, like ImageNet. However, when labeled data is scarce, it can be difficult to train neural networks to perform well. In this case, one can apply data augmentation methods, e.g., paraphrasing a sentence or rotating an image, to effectively increase the amount of labeled training data. Recently, there has been significant progress in the design of data augmentation approaches for a variety of areas such as natural language processing (NLP), vision, and speech. Unfortunately, data augmentation is often limited to supervised learning only, in which labels are required to transfer from original examples to augmented ones.
Example augmentation operations for text-based (top) or image-based (bottom) training data.
In our recent work, “Unsupervised Data Augmentation (UDA) for Consistency Training”, we demonstrate that one can also perform data augmentation on unlabeled data to significantly improve semi-supervised learning (SSL). Our results support the recent revival of semi-supervised learning, showing that: (1) SSL can match and even outperform purely supervised learning that uses orders of magnitude more labeled data, (2) SSL works well across domains in both text and vision and (3) SSL combines well with transfer learning, e.g., when fine-tuning from BERT. We have also open-sourced our code (github) for the community to replicate and build upon.

Unsupervised Data Augmentation Explained
Unsupervised Data Augmentation (UDA) makes use of both labeled data and unlabeled data. To use labeled data, it computes the loss function using standard methods for supervised learning to train the model, as shown in the left part of the graph below. For unlabeled data, consistency training is applied to enforce the predictions to be similar for an unlabeled example and the augmented unlabeled example, as shown in the right part of the graph. Here, the same model is applied to both the unlabeled example and its augmented counterpart to produce two model predictions, from which a consistency loss is computed (i.e., the distance between the two prediction distributions). UDA then computes the final loss by jointly optimizing both the supervised loss from the labeled data and the unsupervised consistency loss from the unlabeled data.

An overview of Unsupervised Data Augmentation (UDA). Left: Standard supervised loss is computed when labeled data is available. Right: With unlabeled data, a consistency loss is computed between an example and its augmented version.
By minimizing the consistency loss, UDA allows for label information to propagate smoothly from labeled examples to unlabeled ones. Intuitively, one can think of UDA as an implicit iterative process. First, the model relies on a small amount of labeled examples to make correct predictions for some unlabeled examples, from which the label information is propagated to augmented counterparts through the consistency loss. Over time, more and more unlabeled examples will be predicted correctly which reflects the improved generalization of the model. Various other types of noise have been tested for consistency training (e.g., Gaussian noise, adversarial noise, and others), yet we found that data augmentation outperforms all of them, leading to state-of-the-art performance on a wide variety of tasks from language to vision. UDA applies different existing augmentation methods depending on the task at hand, including back translation, AutoAugment, and TF-IDF word replacement.

Benchmarks in NLP and Computer Vision
UDA is surprisingly effective in the low-data regime. With only 20 labeled examples, UDA achieves an error rate of 4.20 on the IMDb sentiment analysis task by leveraging 50,000 unlabeled examples. This result outperforms the previous state-of-the-art model trained on 25,000 labeled examples with an error rate of 4.32. In the large-data regime, with the full training set, UDA also provides robust gains.
Benchmark on IMDb, a sentiment analysis task. UDA surpasses state-of-the-art results in supervised learning across different training sizes.
On the CIFAR-10 semi-supervised learning benchmark, UDA outperforms all existing SSL methods, such as VAT, ICT, and MixMatch by significant margins. With 4k examples, UDA achieves an error rate of 5.27, matching the performance of the fully supervised model that uses 50k examples. Furthermore, with a more advanced architecture, PyramidNet+ShakeDrop, UDA achieves a new state-of-the-art error rate of 2.7, a more than 45% reduction in error rate compared to the previous best semi-supervised result. On SVHN, UDA achieves an error rate of 2.85 with only 250 labeled examples, matching the performance of the fully supervised model trained with ~70k labeled examples.
SSL benchmark on CIFAR-10, an image classification task. UDA surpases all existing semi-supervised learning methods, all of which use the Wide-ResNet-28-2 architecture. At 4000 examples, UDA matches the performance of the fully supervised setting with 50,000 examples.
On ImageNet with 10% labeled examples, UDA improves the top-1 accuracy from 55.1% to 68.7%. In the high-data regime with the fully labeled set and 1.3M extra unlabeled examples, UDA continues to provide gains from 78.3% to 79.0% for top-1 accuracy.

Release
We have released the codebase of UDA, together with all data augmentation methods, e.g., back-translation with pre-trained translation models, to replicate our results. We hope that this release will further advance the progress in semi-supervised learning.

Acknowledgements
Special thanks to the co-authors of the paper Zihang Dai, Eduard Hovy, and Quoc V. Le. We’d also like to thank Hieu Pham, Adams Wei Yu, Zhilin Yang, Colin Raffel, Olga Wichrowska, Ekin Dogus Cubuk, Guokun Lai, Jiateng Xie, Yulun Du, Trieu Trinh, Ran Zhao, Ola Spyra, Brandon Yang, Daiyi Peng, Andrew Dai, Samy Bengio and Jeff Dean for their help with this project. A preprint is available online.

Source: Google AI Blog


Transformer-XL: Unleashing the Potential of Attention Models



To correctly understand an article, sometimes one will need to refer to a word or a sentence that occurs a few thousand words back. This is an example of long-range dependence — a common phenomenon found in sequential data — that must be understood in order to handle many real-world tasks. While people do this naturally, modeling long-term dependency with neural networks remains a challenge. Gating-based RNNs and the gradient clipping technique improve the ability of modeling long-term dependency, but are still not sufficient to fully address this issue.

One way to approach this challenge is to use Transformers, which allows direct connections between data units, offering the promise of better capturing long-term dependency. However, in language modeling, Transformers are currently implemented with a fixed-length context, i.e. a long text sequence is truncated into fixed-length segments of a few hundred characters, and each segment is processed separately.
Vanilla Transformer with a fixed-length context at training time.
This introduces two critical limitations:
  1. The algorithm is not able to model dependencies that are longer than a fixed length.
  2. The segments usually do not respect the sentence boundaries, resulting in context fragmentation which leads to inefficient optimization. This is particularly troublesome even for short sequences, where long range dependency isn't an issue.
To address these limitations, we propose Transformer-XL a novel architecture that enables natural language understanding beyond a fixed-length context. Transformer-XL consists of two techniques: a segment-level recurrence mechanism and a relative positional encoding scheme.

Segment-level Recurrence
During training, the representations computed for the previous segment are fixed and cached to be reused as an extended context when the model processes the next new segment. This additional connection increases the largest possible dependency length by N times, where N is the depth of the network, because contextual information is now able to flow across segment boundaries. Moreover, this recurrence mechanism also resolves the context fragmentation issue, providing necessary context for tokens in the front of a new segment.
Transformer-XL with segment-level recurrence at training time.
Relative Positional Encodings
Naively applying segment-level recurrence does not work, however, because the positional encodings are not coherent when we reuse the previous segments. For example, consider an old segment with contextual positions [0, 1, 2, 3]. When a new segment is processed, we have positions [0, 1, 2, 3, 0, 1, 2, 3] for the two segments combined, where the semantics of each position id is incoherent through out the sequence. To this end, we propose a novel relative positional encoding scheme to make the recurrence mechanism possible. Moreover, different from other relative positional encoding schemes, our formulation uses fixed embeddings with learnable transformations instead of learnable embeddings, and thus is more generalizable to longer sequences at test time. When both of these approaches are combined, Transformer-XL has a much longer effective context than a vanilla Transformer model at evaluation time.
Vanilla Transformer with a fixed-length context at evaluation time.

Transformer-XL with segment-level recurrence at evaluation time./td>
Furthermore, Transformer-XL is able to process the elements in a new segment all together without recomputation, leading to a significant speed increase (discussed below).

Results
Transformer-XL obtains new state-of-the-art (SoTA) results on a variety of major language modeling (LM) benchmarks, including character-level and word-level tasks on both long and short sequences. Empirically, Transformer-XL enjoys three benefits:
  1. Transformer-XL learns dependency that is about 80% longer than RNNs and 450% longer than vanilla Transformers, which generally have better performance than RNNs, but are not the best for long-range dependency modeling due to fixed-length contexts (please see our paper for details).
  2. Transformer-XL is up to 1,800+ times faster than a vanilla Transformer during evaluation on language modeling tasks, because no re-computation is needed (see figures above).
  3. Transformer-XL has better performance in perplexity (more accurate at predicting a sample) on long sequences because of long-term dependency modeling, and also on short sequences by resolving the context fragmentation problem.
Transformer-XL improves the SoTA bpc/perplexity from 1.06 to 0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (without fine tuning). We are the first to break through the 1.0 barrier on char-level LM benchmarks.

We envision many exciting potential applications of Transformer-XL, including but not limited to improving language model pretraining methods such as BERT, generating realistic, long articles, and applications in the image and speech domains, which are also important areas in the world of long-term dependency. For more detail, please see our paper.

The code, pretrained models, and hyperparameters used in our paper are also available in both Tensorflow and PyTorch on GitHub.

Source: Google AI Blog


Natural Questions: a New Corpus and Challenge for Question Answering Research



Open-domain question answering (QA) is a benchmark task in natural language understanding (NLU) that aims to emulate how people look for information, finding answers to questions by reading and understanding entire documents. Given a question expressed in natural language ("Why is the sky blue?"), a QA system should be able to read the web (such as this Wikipedia page) and return the correct answer, even if the answer is somewhat complicated and long. However, there are currently no large, publicly available sources of naturally occurring questions (i.e. questions asked by a person seeking information) and answers that can be used to train and evaluate QA models. This is because assembling a high-quality dataset for question answering requires a large source of real questions and significant human effort in finding correct answers.

To help spur research advances in QA, we are excited to announce Natural Questions (NQ), a new, large-scale corpus for training and evaluating open-domain question answering systems, and the first to replicate the end-to-end process in which people find answers to questions. NQ is large, consisting of 300,000 naturally occurring questions, along with human annotated answers from Wikipedia pages, to be used in training QA systems. We have additionally included 16,000 examples where answers (to the same questions) are provided by 5 different annotators, useful for evaluating the performance of the learned QA systems. Since answering the questions in NQ requires much deeper understanding than is needed to answer trivia questions — which are already quite easy for computers to solve — we are also announcing a challenge based on this data to help advance natural language understanding in computers.

The Data
NQ is the first dataset to use naturally occurring queries and focus on finding answers by reading an entire page, rather than extracting answers from a short paragraph. To create NQ, we started with real, anonymized, aggregated queries that users have posed to Google's search engine. We then ask annotators to find answers by reading through an entire Wikipedia page as they would if the question had been theirs. Annotators look for both long answers that cover all of the information required to infer the answer, and short answers that answer the question succinctly with the names of one or more entities. The quality of the annotations in the NQ corpus has been measured at 90% accuracy.

Our paper "Natural Questions: a Benchmark for Question Answering Research", which has been accepted for publication in Transactions of the Association for Computational Linguistics, has a full description of the data collection process. To see some more examples from the dataset, please check out the NQ website.

The Challenge
NQ is aimed at enabling QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. Systems will need to first decide whether the question is sufficiently well defined to be answerable — many questions make false assumptions or are just too ambiguous to be answered concisely. Then they will need to decide whether there is any part of the Wikipedia page that contains all of the information needed to infer the answer. We believe that the long answer identification task — finding all of the information required to infer an answer — requires a deeper level of language understanding than finding short answers once the long answers are known.

It is our hope that the release of NQ, and the associated challenge, will help spur the development of more effective and robust QA systems. We encourage the NLU community to participate and to help close the large gap between the performance of current state-of-the-art approaches and a human upper bound. Please visit the challenge website to view the leaderboard and learn more.

Source: Google AI Blog


Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing



One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training). The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.

This week, we open sourced a new technique for NLP pre-training called Bidirectional Encoder Representations from Transformers, or BERT. With this release, anyone in the world can train their own state-of-the-art question answering system (or a variety of other models) in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU. The release includes source code built on top of TensorFlow and a number of pre-trained language representation models. In our associated paper, we demonstrate state-of-the-art results on 11 NLP tasks, including the very competitive Stanford Question Answering Dataset (SQuAD v1.1).

What Makes BERT Different?
BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).

Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the ... account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.

A visualization of BERT’s neural network architecture compared to previous state-of-the-art contextual pre-training methods is shown below. The arrows indicate the information flow from one layer to the next. The green boxes at the top indicate the final contextualized representation of each input word:
BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional.
The Strength of Bidirectionality
If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.

To solve this problem, we use the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:
While this idea has been around for a very long time, BERT is the first time it was successfully used to pre-train a deep neural network.

BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:
Training with Cloud TPUs
Everything that we’ve described so far might seem fairly straightforward, so what’s the missing piece that made it work so well? Cloud TPUs. Cloud TPUs gave us the freedom to quickly experiment, debug, and tweak our models, which was critical in allowing us to move beyond existing pre-training techniques. The Transformer model architecture, developed by researchers at Google in 2017, also gave us the foundation we needed to make BERT successful. The Transformer is implemented in our open source release, as well as the tensor2tensor library.

Results with BERT
To evaluate performance, we compared BERT to other state-of-the-art NLP systems. Importantly, BERT achieved all of its results with almost no task-specific changes to the neural network architecture. On SQuAD v1.1, BERT achieves 93.2% F1 score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%:
BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks. The amount of human-labeled training data in these tasks ranges from 2,500 examples to 400,000 examples, and BERT substantially improves upon the state-of-the-art accuracy on all of them:
Making BERT Work for You
The models that we are releasing can be fine-tuned on a wide variety of NLP tasks in a few hours or less. The open source release also includes code to run pre-training, although we believe the majority of NLP researchers who use BERT will never need to pre-train their own models from scratch. The BERT models that we are releasing today are English-only, but we hope to release models which have been pre-trained on a variety of languages in the near future.

The open source TensorFlow implementation and pointers to pre-trained BERT models can be found at http://goo.gl/language/bert. Alternatively, you can get started using BERT through Colab with the notebook “BERT FineTuning with Cloud TPUs.”

You can also read our paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" for more details.

Source: Google AI Blog


Google at EMNLP 2018



This week, the annual conference on Empirical Methods in Natural Language Processing (EMNLP 2018) will be held in Brussels, Belgium. Google will have a strong presence at EMNLP with several of our researchers presenting their research on a diverse set of topics, including language identification, segmentation, semantic parsing and question answering, additionally serving in various levels of organization in the conference. Googlers will also be presenting their papers and participating in the co-located Conference on Computational Natural Language Learning (CoNLL 2018) shared task on multilingual parsing.

In addition to this involvement, we are sharing several new datasets with the academic community that are released with papers published at EMNLP, with the goal of accelerating progress in empirical natural language processing (NLP). These releases are designed to help account for mismatches between the datasets a machine learning model is trained and tested on, and the inputs an NLP system would be asked to handle “in the wild”. All of the datasets we are releasing include realistic, naturally occurring text, and fall into two main categories: 1) challenge sets for well-studied core NLP tasks (part-of-speech tagging, coreference) and 2) datasets to encourage new directions of research on meaning preservation under rephrasings/edits (query well-formedness, split-and-rephrase, atomic edits):
  • Noun-Verb Ambiguity in POS Tagging Dataset: English part-of-speech taggers regularly make egregious errors related to noun-verb ambiguity, despite high accuracies on standard datasets. For example: in “Mark which area you want to distress” several state-of-the-art taggers annotate “Mark” as a noun instead of a verb. We release a new dataset of over 30,000 naturally occurring non-trivial annotated examples of noun-verb ambiguity. Taggers previously indistinguishable from each other have accuracies ranging from 57% to 75% accuracy on this challenge set.
  • Query Wellformedness Dataset: Web search queries are usually “word-salad” style queries with little resemblance to natural language questions (“barack obama height” as opposed to “What is the height of Barack Obama?”). Differentiating a natural language question from a query is of importance to several applications include dialogue. We annotate and release 25,100 queries from the open-source Paralex corpus with ratings on how close they are to well-formed natural language questions.
  • WikiSplit: Split and Rephrase Dataset Extracted from Wikipedia Edits: We extract examples of sentence splits from Wikipedia edits where one sentence gets split into two sentences that together preserve the original meaning of the sentence (E.g. “Street Rod is the first in a series of two games released for the PC and Commodore 64 in 1989.” is split into “Street Rod is the first in a series of two games.” and “It was released for the PC and Commodore 64 in 1989.”) The released corpus contains one million sentence splits with a vocabulary of more than 600,000 words. 
  • WikiAtomicEdits: A Multilingual Corpus of Atomic Wikipedia Edits: Information about how people edit language in Wikipedia can be used to understand the structure of language itself. We pay particular attention to two atomic edits: insertions and deletions that consist of a single contiguous span of text. We extract around 43 million such edits in 8 languages and show that they provide valuable information about entailment and discourse. For example, insertion of “in 1949” adds a prepositional phrase to the sentence “She died there after a long illness” resulting in “She died there in 1949 after a long illness”.
These datasets join the others that Google has recently released, such as Conceptual Captions and GAP Coreference Resolution in addition to our past contributions.

Below is a full list of Google’s involvement and publications being presented at EMNLP and CoNLL (Googlers highlighted in blue). We are particularly happy to announce that the paper “Linguistically-Informed Self-Attention for Semantic Role Labeling” was awarded one of the two Best Long Paper awards. This work was done by our 2017 intern Emma Strubell, Googlers Daniel Andor, David Weiss and Google Faculty Advisor Andrew McCallum. We congratulate these authors, and all other researchers who are presenting their work at the conference.

Area Chairs Include:
Ming-Wei Chang, Marius Pasca, Slav Petrov, Emily Pitler, Meg Mitchell, Taro Watanabe

EMNLP Publications
A Challenge Set and Methods for Noun-Verb Ambiguity
Ali Elkahky, Kellie Webster, Daniel Andor, Emily Pitler

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text
Yuan Zhang, Jason Riesa, Daniel Gillick, Anton Bakalov, Jason Baldridge, David Weiss

AirDialogue: An Environment for Goal-Oriented Dialogue Research
Wei Wei, Quoc Le, Andrew Dai, Jia Li

Content Explorer: Recommending Novel Entities for a Document Writer
Michal Lukasik, Richard Zens

Deep Relevance Ranking using Enhanced Document-Query Interactions
Ryan McDonald, George Brokos, Ion Androutsopoulos

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, Christopher D. Manning

Identifying Well-formed Natural Language Questions
Manaal Faruqui, Dipanjan Das

Learning To Split and Rephrase From Wikipedia Edit History
Jan A. Botha, Manaal Faruqui, John Alex, Jason Baldridge, Dipanjan Das

Linguistically-Informed Self-Attention for Semantic Role Labeling
Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, Andrew McCallum

Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text
Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, William Cohen

Noise Contrastive Estimation for Conditional Models: Consistency and Statistical Efficiency
Zhuang Ma, Michael Collins

Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification
Kelsey Ball, Dan Garrette

Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension
Minjoon Seo, Tom Kwiatkowski, Ankur P. Parikh, Ali Farhadi, Hannaneh Hajishirzi

Policy Shaping and Generalized Update Equations for Semantic Parsing from Denotations
Dipendra Misra, Ming-Wei Chang, Xiaodong He, Wen-tau Yih

Revisiting Character-Based Neural Machine Translation with Capacity and Compression
Colin Cherry, George Foster, Ankur Bapna, Orhan Firat, Wolfgang Macherey

Self-governing neural networks for on-device short text classification
Sujith Ravi, Zornitsa Kozareva

Semi-Supervised Sequence Modeling with Cross-View Training
Kevin Clark, Minh-Thang Luong, Christopher D. Manning, Quoc Le

State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Ji Ma, Kuzman Ganchev, David Weiss

Subgoal Discovery for Hierarchical Dialogue Policy Learning
Da Tang, Xiujun Li, Jianfeng Gao, Chong Wang, Lihong Li, Tony Jebara

SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation
Xinyi Wang, Hieu Pham, Zihang Dai, Graham Neubig

The Importance of Generation Order in Language Modeling
Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, George Dahl

Training Deeper Neural Machine Translation Models with Transparent Attention
Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, Yonghui Wu

Understanding Back-Translation at Scale
Sergey Edunov, Myle Ott, Michael Auli, David Grangier

Unsupervised Natural Language Generation with Denoising Autoencoders
Markus Freitag, Scott Roy

WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse
Manaal Faruqui, Ellie Pavlick, Ian Tenney, Dipanjan Das

WikiConv: A Corpus of the Complete Conversational History of a Large Online Collaborative Community
Yiqing Hua, Cristian Danescu-Niculescu-Mizil, Dario Taraborelli, Nithum Thain, Jeffery Sorensen, Lucas Dixon

EMNLP Demos
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Taku Kudo, John Richardson

Universal Sentence Encoder for English
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, Ray Kurzweil

CoNLL Shared Task
Multilingual Parsing from Raw Text to Universal Dependencies
Slav Petrov, co-organizer

Universal Dependency Parsing with Multi-Treebank Models
Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, Joakim Nivre, Yan Shao, Sara Stymne
(Winner of the Universal POS Tagging and Morphological Tagging subtasks, using the open-sourced Meta-BiLSTM tagger)

CoNLL Publication
Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!
Katharina Kann, Sascha Rothe, Katja Filippova

Source: Google AI Blog


Text-to-Speech for Low-Resource Languages (Episode 4): One Down, 299 to Go



This is the fourth episode in the series of posts reporting on the work we are doing to build text-to-speech (TTS) systems for low resource languages. In the first episode, we described the crowdsourced acoustic data collection effort for Project Unison. In the second episode, we described how we built parametric voices based on that data. In the third episode, we described the compilation of a pronunciation lexicon for a TTS system. In this episode, we describe how to make a single TTS system speak many languages.

Developing TTS systems for any given language is a significant challenge, and requires large amounts of high quality acoustic recordings and linguistic annotations. Because of this, these systems are only available for a tiny fraction of the world's languages. A natural question that arises in this situation is, instead of attempting to build a high quality voice for a single language using monolingual data from multiple speakers, as we described in the previous three episodes, can we somehow combine the limited monolingual data from multiple speakers of multiple languages to build a single multilingual voice that can speak any language?

Building upon an initial investigation into creating a multilingual TTS system that can synthesize speech in multiple languages from a single model, we developed a new model that uses uniform phonological representation for all languages — the International Phonetic Alphabet (IPA). The model trained using this representation can synthesize both the languages seen in the training data as well as languages not observed in training. This has two main benefits: First, pooling training data from related languages increases phonemic coverage which results in improved synthesis quality of the languages observed in training. Finally, because the model contains many languages pooled together, there is a better chance that an “unseen” language will have a “related” language present in the model that will guide and aid the synthesis.

Exploring the Closely Related Languages of Indonesia
We applied this multilingual approach first to languages of Indonesia, where Standard Indonesian is the official national language, and is spoken natively or as a second language by more than 200 million people. Javanese, with roughly 90 million native speakers, and Sundanese, with approximately 40 million native speakers, constitute the two largest regional languages of Indonesia. Unlike Indonesian, which received a lot of attention by the computational linguists and speech scientists over the years, both Javanese and Sundanese are currently low-resourced due to the lack of openly available high-quality corpora. We collaborated with universities in Indonesia to collect crowd-sourced Javanese and Sundanese recordings.

Since our corpus of Standard Indonesian was much larger and recorded in a professional studio, our hypothesis was that combining three languages may result in significant improvements over the systems constructed using a “classical” monolingual approach. To test this, we first proceeded to analyze the similarities and crucial differences between the phonologies of these three languages (shown below) and used this information to design the phonological representation that allows maximum degree of sharing between the languages while preserving their crucial differences.
Joint phoneme inventory of Indonesian, Javanese, and Sundanese in International Phonetic Alphabet notation.
The resulting Javanese and Sundanese voices trained jointly with Standard Indonesian strongly outperformed our corresponding monolingual multispeaker voices that we used as a baseline. This allowed us to launch Javanese and Sundanese TTS in Google products, such as Google Translate and Android.

Expanding to the More Diverse Language Families of South Asia
Next, we focused on the languages of South Asia spanning two very different language families: Indo-Aryan and Dravidian. Unlike the languages of Indonesia described above, these languages are much more diverse. In particular, they have significantly smaller overlap in their phonologies. The table below shows a superset of the languages in our experiment, including the variety of orthographies used, as well as modern words related to the Sanskrit word for “culture”. These languages show considerable variation within each group, but also such similarities across groups.
Descendants of Sanskrit word for “culture” across languages.
In this work, we leveraged the unified phonological representation mentioned above to make the most of the data we have and eliminate scarcity of data for certain phonemes. This was accomplished by conflating similar phonemes into a single representative phoneme in the multilingual phoneme inventory. Where possible, we use the same inventory for phonologically close languages. For example we have an identical phoneme inventory for Telugu and Kannada, and another one for West Bengali and Odia. For other language pairs like Gujarati and Marathi, we copied over the inventory of one language to another, but made a few changes to reflect the differences in their phonemic inventories. For all languages in these experiments we retained a common underlying representation, mapping similar phonemes across different inventories, so that we could still use the data from one language in training the others.

In addition, we made sure our representation is driven by the phonology in use, rather than the orthography. For example, although there are distinct letters for long and short vowels in Marathi, they are not contrastive in a linguistic sense, so we used a single representation for them, increasing the robustness of our training data. Similarly, if two languages use one character that was historically related to the same Sanskrit letter to represent different sounds or different letters for a similar sound, our mapping reflected the phonological closeness rather than the historical or orthographic representation. Describing all the features of the unified phoneme inventory is outside the scope of this post, the details can be found in our recent paper.
Diagram illustrating our multilingual text-to-speech approach. The input text queries are processed by language-specific linguistic front-ends to generate pronunciations in a shared phonemic representation serving as input to the language-agnostic acoustic model. The model then generates audio for the respective queries.
Our experiments focused on Indian Bengali, Gujarati, Kannada, Malayalam, Marathi, Tamil, Telugu and Urdu. For most of these languages, apart from Bengali and Marathi, the recording data and the transcriptions were crowd-sourced. For each of these languages we constructed a multilingual acoustic model that used all the data available. In addition, the acoustic model included the previously crowd-sourced Nepali and Sinhala data, as well as Hindi and Bangladeshi Bengali.

The results were encouraging: for most of the languages, the multilingual voices outperformed the voices that were constructed using traditional monolingual approach. We performed a further experiment with the Odia language, for which we had no training data, by attempting to synthesize it using the South Asian multilingual model. Subjective listening tests revealed that the native speakers of Odia judged the resulting audio to be acceptable and intelligible. The resulting voices for Marathi, Tamil, Telugu and Malayalam built using our multilingual approach in collaboration with the Speech team were announced at the recent “Google for India” event and are now powering Google Translate as well as other Google products.

Using crowd-sourcing in data collections was interesting from a research point of view and rewarding in terms of establishing fruitful collaborations with the native speaker communities. Our experiments with the Malayo-Polynesian, Indo-Aryan and Dravidian language families have shown that in most instances carefully sharing the data across multiple languages in a single multilingual acoustic model using deep learning techniques alleviates some of the severe data scarcity issues plaguing the low-resource languages and results in good quality voices used in Google products.

This TTS research is a first step towards applying speech and language technology to more of the world’s many languages, and it is our hope is that others will join us in this effort. To contribute to the research community we have open sourced corpora for Nepali, Sinhala, Bengali, Khmer, Javanese and Sundanese as we return from SLTU and Interspeech conferences, where we have been discussing this work with other researchers. We are planning on continuing to release additional datasets for other languages in our projects in the future.

Source: Google AI Blog


Conceptual Captions: A New Dataset and Challenge for Image Captioning



The web is filled with billions of images, helping to entertain and inform the world on a countless variety of subjects. However, much of that visual information is not accessible to those with visual impairments, or with slow internet speeds that prohibit the loading of images. Image captions, manually added by website authors using Alt-text HTML, is one way to make this content more accessible, so that a natural-language description for images that can be presented using text-to-speech systems. However, existing human-curated Alt-text HTML fields are added for only a very small fraction of web images. And while automatic image captioning can help solve this problem, accurate image captioning is a challenging task that requires advancing the state of the art of both computer vision and natural language processing.
Image captioning can help millions with visual impairments by converting images captions to text. Image by Francis Vallance (Heritage Warrior), used under CC BY 2.0 license.
Today we introduce Conceptual Captions, a new dataset consisting of ~3.3 million image/caption pairs that are created by automatically extracting and filtering image caption annotations from billions of web pages. Introduced in a paper presented at ACL 2018, Conceptual Captions represents an order of magnitude increase of captioned images over the human-curated MS-COCO dataset. As measured by human raters, the machine-curated Conceptual Captions has an accuracy of ~90%. Furthermore, because images in Conceptual Captions are pulled from across the web, it represents a wider variety of image-caption styles than previous datasets, allowing for better training of image captioning models. To track progress on image captioning, we are also announcing the Conceptual Captions Challenge for the machine learning community to train and evaluate their own image captioning models on the Conceptual Captions test bed.
Illustration of images and captions in the Conceptual Captions dataset.
Clockwise from top left, images by Jonny Hunter, SigNote Cloud, Tony Hisgett, ResoluteSupportMedia. All images used under CC BY 2.0 license
Generating the Dataset
To generate the Conceptual Captions dataset, we start by sourcing images from the web that have Alt-text HTML attributes. We automatically screen these for certain properties to ensure image quality while also avoiding undesirable content such as adult themes. We then apply text-based filtering, removing captions with non-descriptive text (such as hashtags, poor grammar or added language that does not relate to the image); we also discard texts with high sentiment polarity or adult content (for more details on the filtering criteria, please see our paper). We use existing image classification models to make sure that, for any given image, there is overlap between its Alt-text (allowing for word variations) and the labels that the image classifier outputs for that image.

From Specific Names to General Concepts
While candidates passing the above filters tend to be good Alt-text image descriptions, a large majority use proper names (for people, venues, locations, organizations etc.). This is problematic because it is very difficult for an image captioning model to learn such fine-grained proper name inference from input image pixels, and also generate natural-language descriptions simultaneously1.

To address the above problems we wrote software that automatically replaces proper names with words representing the same general notion, i.e., with their concept. In some cases, the proper names are removed to simplify the text. For example, we substitute people names (e.g., “Former Miss World Priyanka Chopra on the red carpet” becomes “actor on the red carpet”), remove locations names (“Crowd at a concert in Los Angeles” becomes “Crowd at a concert”), remove named modifiers (e.g., “Italian cuisine” becomes just “cuisine”) and correct newly formed noun phrases if needed (e.g., “artist and artist” becomes “artists”, see the example illustration below).
Illustration of text modification. Image by Rockoleando used under CC BY 2.0 license.
Finally, we cluster all resolved entities (e.g., “artist”, “dog”, “neighborhood”, etc.) and keep only the candidate types which have a count of over 100 mentions, a quantity sufficient to support representation learning for these entities. This retained around 16K entity concepts such as: “person”, “actor”, “artist”, “player” and “illustration”. Less frequent ones that we retained include “baguette”, “bridle”, “deadline”, “ministry” and “funnel”.

In the end, it required roughly one billion (English) webpages containing over 5 billion candidate images to obtain a clean and learnable image caption dataset of over 3M samples (a rejection rate of 99.94%). Our control parameters were biased towards high precision, although these can be tuned to generate an order of magnitude more examples with lower precision.

Dataset Impact
To test the usefulness of our dataset, we independently trained both RNN-based, and Transformer-based image captioning models implemented in Tensor2Tensor (T2T), using the MS-COCO dataset (using 120K images with 5 human annotated-captions per image) and the new Conceptual Captions dataset (using over 3.3M images with 1 caption per image). See our paper for more details on model architectures.

These models were tested using images from Flickr30K dataset (which are out-of-domain for both MS-COCO and Conceptual Captions), and the resulting captions evaluated using 3 human raters per test case. The results are reported in the table below.
From these results we conclude that models trained on Conceptual Captions generalized better than competing approaches irrespective of the architecture (i.e., RNN or Transformer). In addition, we found that Transformer models did better than RNN when trained on either dataset. The conclusion from these findings is that Conceptual Captions provides the ability to train image captioning models that perform better on a wide variety of images.

Get Involved
It is our hope that this dataset will help the machine learning community advance the state of the art in image captioning models. Importantly, since no human annotators were involved in its creation, this dataset is highly scalable, potentially allowing the expansion of the dataset to enable automatic creation of Alt-text-HTML-like descriptions for an even wider variety of images. We encourage all those interested to partake in the Conceptual Captions Challenge, and we look forward to seeing what the community can do! For more details and the latest results please visit the challenge website.

Acknowledgements
Thanks to Nan Ding, Sebastian Goodman and Bo Pang for training models with Conceptual Captions dataset, and to Amol Wankhede for driving the public release efforts for the dataset.


1 In our paper, we posit that if automatic determination of names, locations, brands, etc. from the image is needed, it should be done as a separate task that may leverage image meta-information (e.g. GPS info), or complementary techniques such as OCR.

Source: Google AI Blog


The Machine Learning Behind Android Smart Linkify



Earlier this week we launched Android 9 Pie, the latest release of Android that uses machine learning to make your phone simpler to use. One of the features in Android 9 is Smart Linkify, a new API that adds clickable links when certain types of entities are detected in text. This is useful when, for example, you receive an address from a friend in a messaging app and want to look it up on a map. With a Smart Linkify-annotated text, it’s a lot easier!
Smart Linkify is a new version of the existing Android Linkify API. It is powered by a small feed-forward neural network (500kB per language) with low latency (less than 20ms on Google Pixel phones) and small inference code (250kB), and uses essentially the same machine learning technology that powers Smart Text Selection (released as part of Android Oreo) to now also create links.

Smart Linkify is available as an open-source TextClassifier API in Android (as the generateLinks method). The models were trained using TensorFlow and exported to a custom inference library backed by TensorFlow Lite and FlatBuffers. The C++ inference library for the models is available as part of Android Open-Source framework here, and runs on each text selection and Smart Linkify API calls.

Finding Entities
Looking for phone numbers and postal addresses in text is a difficult problem. Not only are there many variations in how people write them, but it’s also often ambiguous what type of entity is being represented (e.g. “Confirmation number: 857-555-3556” is not a phone number even though it it takes a similar form to one). As a solution, we designed an inference algorithm with two small feedforward neural networks at its heart. This algorithm is general enough to perform all kinds of entity chunking beyond just addresses and phone numbers.

Overall, the system architecture is as follows: A given input text is first split into words (based on space separation), then all possible word subsequences of certain maximum length (15 words in our case) are generated, and for each candidate the scoring neural net assigns a value (between 0 and 1) based on whether it represents a valid entity:
For the given text string, the first network assigns low scores to non-entities and a high score for the candidate that correctly selects the whole phone number.
Next, the generated entities that overlap are removed, favoring the ones with the higher score over the conflicting ones with a lower score. Now, we have a set of entities, but still don’t know their types. So now the second neural network is used to classify the type of the entity, as either a phone number, address or in some cases, a non-entity.

In our example, the only non-conflicting entities are “And call 857 555 3556tomorrow.” (with “857 555 3556” classified as a phone number), and “And call 857 555 3556 tomorrow.” (with “And” classified as a non-entity).

Now that we have the only non-conflicting entities, “And call 857 555 3556 tomorrow.” (with “857 555 3556” classified as a phone number) and “And call 857 555 3556 tomorrow.” (with “And” classified as a non-entity), we are easily able to underline them in the displayed text on the screen, and run the right app when clicked.

Textual Features
So far, we’ve given a general description of the way Smart Linkify locates and classifies entities in a string of text. Here, we go into more detail on how the text is processed and fed to the network.

The task of the networks, given an entity candidate in the input text, is to determine whether the entity is valid, and then to classify it. To do this, the networks need to know the context surrounding the entity (in addition to the text string of the entity itself). In machine learning this is done by representing these parts as separate features. Effectively, the input text is split into several parts that are fed to the network separately:
Given a candidate entity span, we extract: Left context: five words before the entity, Entity start: first three words of the entity, Entity end: last three words of the entity (they can be duplicated with the previous feature if they overlap, or padded if there are not that many), Right context: five words after the entity, Entity content: bag of words inside the entity and Entity length: size of the entity in number of words. They are then concatenated together and fed as an input to the neural network.
The feature extraction operates with words, and we use character n-grams and a capitalization feature to represent the individual words as real vectors suitable as an input of the neural network:
  • Character N-grams. Instead of using the standard word embedding technique for representing words, which keeps a separate vector for each word in the model and thus would be infeasible for mobile devices because of their large storage size, we use the hashed charactergram embedding. This technique represents the word as a set of all character subsequences of certain length. We use lengths 1 to 5. These strings are additionally hashed and mapped to a fixed number of buckets (see here for more details on the technique). As a result, the final model only stores vectors for each of the hash buckets, not each word/character subsequence, and can be kept small. The embedding matrix for the hashed charactergrams that we use has 20,000 buckets and 12 dimensions.
  • A binary feature that indicates whether the word starts with a capital letter. This is important for the network to know because the capitalization in postal addresses is quite distinct, and helps the networks to discriminate.
A Training Dataset
There is no obvious dataset for this task on which we could readily train the networks, so we came up with a training algorithm that generates synthetic examples out of realistic pieces. Concretely, we gathered lists of addresses, phone numbers and named entities (like product, place and business names) and other random words from the Web (using Schema.org annotations), and use them to synthesize the training data for the neural networks. We take the entities as they are and generate random textual contexts around them (from the list of random words on Web). Additionally, we add phrases like “Confirmation number:” or “ID:” to the negative training data for phone numbers, to teach the network to suppress phone number matches in these contexts.

Making it Work
There are a number of additional techniques that we had to use for training the network and making a practical mobile deployment:
  • Quantizing the embedding matrix to 8 bits. We found that we could reduce the size of the model almost 4x without compromising the performance, by quantizing the embedding matrix values to 8-bit integers.
  • Sharing embedding matrices between the selection and classification networks. This brings almost no loss and makes the model 2x smaller.
  • Varying the size of the context before/after the entities. On mobile screens text is often short, with not enough context, so the network needs to be exposed to this during training as well.
  • Creating artificial negative examples out of the positive ones for the classification network. For example for the positive example: “call me 857 555-3556 today” with a label “phone” we generate “call me 857 555-3556 today” as a negative example with a label “other”. This teaches the classification network to be more precise about the entity span. Without doing this, the network would be merely a detector whether there is a phone number somewhere in the input, regardless of the span.
Internationalization is Important
The automatic data extraction we use makes it easier to train language-specific models. However, making them work for all languages is a challenge, requiring careful checking of language nuance by experts, as well as having an acceptable amount of training data. We found that having one model for all Latin-script languages works well (e.g. Czech, Polish, German, English), with individual models for each of Chinese, Japanese, Korean, Thai, Arabic and Russian. While Smark Linkify currently supports 16 languages, we are experimenting with models that support even more languages, which is especially challenging given the mobile model size constraints and trickiness with languages that do not split words on spaces.

Next Steps
While the technique described in this post enables the fast and accurate annotation of phone numbers and postal addresses in text, the recognition of flight numbers, date and time, or IBAN, is currently implemented with a more traditional technique using standard regular expressions. However, we are looking into creating ML models for date and time as well, particularly for recognizing informal relative date/time specifications prevalent in messaging context, like “next Thursday” or “in 3 weeks”.

The small model and binary size as well as low latency are very important for mobile deployment. The models and the code we developed are available open-source as part of Android framework. We believe that the architecture could extend to other on-device text annotation problems and we look forward to seeing new use cases from our developer community!

Source: Google AI Blog