Tag Archives: Machine Translation

Robust Neural Machine Translation



In recent years, neural machine translation (NMT) using Transformer models has experienced tremendous success. Based on deep neural networks, NMT models are usually trained end-to-end on very large parallel corpora (input/output text pairs) in an entirely data-driven fashion and without the need to impose explicit rules of language.

Despite this huge success, NMT models can be sensitive to minor perturbations of the input, which can manifest as a variety of different errors, such as under-translation, over-translation or mistranslation. For example, given a German sentence, the state-of-the-art NMT model, Transformer, will yield a correct translation.

“Der Sprecher des Untersuchungsausschusses hat angekündigt, vor Gericht zu ziehen, falls sich die geladenen Zeugen weiterhin weigern sollten, eine Aussage zu machen.”

(Machine translation to English: “The spokesman of the Committee of Inquiry has announced that if the witnesses summoned continue to refuse to testify, he will be brought to court.”),

But, when we apply a subtle change to the input sentence, say from geladenen to the synonym vorgeladenen, the translation becomes very different (and in this case, incorrect):

“Der Sprecher des Untersuchungsausschusses hat angekündigt, vor Gericht zu ziehen, falls sich die vorgeladenen Zeugen weiterhin weigern sollten, eine Aussage zu machen.”

(Machine translation to English: “The investigative committee has announced that he will be brought to justice if the witnesses who have been invited continue to refuse to testify.”).

This lack of robustness in NMT models prevents many commercial systems from being applicable to tasks that cannot tolerate this level of instability. Therefore, learning robust translation models is not just desirable, but is often required in many scenarios. Yet, while the robustness of neural networks has been extensively studied in the computer vision community, only a few prior studies on learning robust NMT models can be found in literature.

In “Robust Neural Machine Translation with Doubly Adversarial Inputs” (to appear at ACL 2019), we propose an approach that uses generated adversarial examples to improve the stability of machine translation models against small perturbations in the input. We learn a robust NMT model to directly overcome adversarial examples generated with knowledge of the model and with the intent of distorting the model predictions. We show that this approach improves the performance of the NMT model on standard benchmarks.

Training a Model with AdvGen
An ideal NMT model would generate similar translations for separate inputs that exhibit small differences. The idea behind our approach is to perturb a translation model with adversarial inputs in the hope of improving the model’s robustness. It does this using an algorithm called Adversarial Generation (AdvGen), which generates plausible adversarial examples for perturbing the model and then feeds them back into the model for defensive training. While this method is inspired by the idea of generative adversarial networks (GANs), it does not rely on a discriminator network, but simply applies the adversarial example in training, effectively diversifying and extending the training set.

The first step is to perturb the model using AdvGen. We start by using Transformer to calculate the translation loss based on a source input sentence, a target input sentence and a target output sentence. Then AdvGen randomly selects some words in the source sentence, assuming a uniform distribution. Each word has an associated list of similar words, i.e., candidates that can be used for substitution, from which AdvGen selects the word that is most likely to introduce errors in Transformer output. Then, this generated adversarial sentence is fed back into Transformer, initiating the defense stage.
First, the Transformer model is applied to an input sentence (lower left) and, in conjunction with the target output sentence (above right) and target input sentence (middle right; beginning with the placeholder “<sos>”), the translation loss is calculated. The AdvGen function then takes the source sentence, word selection distribution, word candidates, and the translation loss as inputs to construct an adversarial source example.
During the defend stage, the adversarial sentence is fed back into the Transformer model. Again the translation loss is calculated, but this time using the adversarial source input. Using the same method as above, AdvGen uses the target input sentence, word replacement candidates, the word selection distribution calculated by the attention matrix, and the translation loss to construct an adversarial target example.
In the defense stage, the adversarial source example serves as input to the Transformer model, and the translation loss is calculated. AdvGen then uses the same method as above to generate an adversarial target example from the target input.
Finally, the adversarial sentence is fed back into Transformer and the robustness loss using the adversarial source example, the adversarial target input example and the target sentence is calculated. If the perturbation led to a significant loss, the loss is minimized so that when the model is confronted with similar perturbations, it will not repeat the same mistake. On the other hand, if the perturbation leads to a low loss, nothing happens, indicating that the model can already handle this perturbation.

Model Performance
We demonstrate the effectiveness of our approach by applying it to the standard Chinese-English and English-German translation benchmarks. We observed a notable improvement of 2.8 and 1.6 BLEU points, respectively, compared to the competitive Transformer model, achieving a new state-of-the-art performance.
Comparison of Transformer model (Vaswani et al., 2017) on standard benchmarks.
We then evaluate our model on a noisy dataset, generated using a procedure similar to that described for AdvGen. We take an input clean dataset, such as that used on standard translation benchmarks, and randomly select words for similar word substitution. We find that our model exhibits improved robustness compared to other recent models.
Comparison of Transformer, Miyao et al. and Cheng et al. on artificial noisy inputs.
These results show that our method is able to overcome small perturbations in the input sentence and improve the generalization performance. It outperforms competitive translation models and achieves state-of-the-art translation performance on standard benchmarks. We hope our translation model will serve as a robust building block for improving many downstream tasks, especially when those are sensitive or intolerant to imperfect translation input.

Acknowledgements
This research was conducted by Yong Cheng, Lu Jiang and Wolfgang Macherey. Additional thanks go to our leadership Andrew Moore and Julia (Wenli) Zhu‎.

Source: Google AI Blog


Applying AutoML to Transformer Architectures



Since it was introduced a few years ago, Google’s Transformer architecture has been applied to challenges ranging from generating fantasy fiction to writing musical harmonies. Importantly, the Transformer’s high performance has demonstrated that feed forward neural networks can be as effective as recurrent neural networks when applied to sequence tasks, such as language modeling and translation. While the Transformer and other feed forward models used for sequence problems are rising in popularity, their architectures are almost exclusively manually designed, in contrast to the computer vision domain where AutoML approaches have found state-of-the-art models that outperform those that are designed by hand. Naturally, we wondered if the application of AutoML in the sequence domain could be equally successful.

After conducting an evolution-based neural architecture search (NAS), using translation as a proxy for sequence tasks in general, we found the Evolved Transformer, a new Transformer architecture that demonstrates promising improvements on a variety of natural language processing (NLP) tasks. Not only does the Evolved Transformer achieve state-of-the-art translation results, but it also demonstrates improved performance on language modeling when compared to the original Transformer. We are releasing this new model as part of Tensor2Tensor, where it can be used for any sequence problem.

Developing the Techniques
To begin the evolutionary NAS, it was necessary for us to develop new techniques, due to the fact that the task used to evaluate the “fitness” of each architecture, WMT’14 English-German translation, is computationally expensive. This makes the searches more expensive than similar searches executed in the vision domain, which can leverage smaller datasets, like CIFAR-10. The first of these techniques is warm starting—seeding the initial evolution population with the Transformer architecture instead of random models. This helps ground the search in an area of the search space we know is strong, thereby allowing it to find better models faster.

The second technique is a new method we developed called Progressive Dynamic Hurdles (PDH), an algorithm that augments the evolutionary search to allocate more resources to the strongest candidates, in contrast to previous works, where each candidate model of the NAS is allocated the same amount of resources when it is being evaluated. PDH allows us to terminate the evaluation of a model early if it is flagrantly bad, allowing promising architectures to be awarded more resources.

The Evolved Transformer
Using these methods, we conducted a large-scale NAS on our translation task and discovered the Evolved Transformer (ET). Like most sequence to sequence (seq2seq) neural network architectures, it has an encoder that encodes the input sequence into embeddings and a decoder that uses those embeddings to construct an output sequence; in the case of translation, the input sequence is the sentence to be translated and the output sequence is the translation.

The most interesting feature of the Evolved Transformer is the convolutional layers at the bottom of both its encoder and decoder modules that were added in a similar branching pattern in both places (i.e. the inputs run through two separate convolutional layers before being added together).
A comparison between the Evolved Transformer and the original Transformer encoder architectures. Notice the branched convolution structure at the bottom of the module, which formed in both the encoder and decoder independently. See our paper for a description of the decoder.
This is particularly interesting because the encoder and decoder architectures are not shared during the NAS, so this architecture was independently discovered as being useful in both the encoder and decoder, speaking to the strength of this design. Whereas the original Transformer relied solely on self-attention, the Evolved Transformer is a hybrid, leveraging the strengths of both self-attention and wide convolution.

Evaluation of the Evolved Transformer
To test the effectiveness of this new architecture, we first compared it to the original Transformer on the English-German translation task we used during the search. We found that the Evolved Transformer had better BLEU and perplexity performance at all parameter sizes, with the biggest gain at the size compatible with mobile devices (~7 million parameters), demonstrating an efficient use of parameters. At a larger size, the Evolved Transformer reaches state-of-the-art performance on WMT’ 14 En-De with a BLEU score of 29.8 and a SacreBLEU score of 29.2.
Comparison between the Evolved Transformer and the original Transformer on WMT’14 En-De at varying sizes. The biggest gains in performance occur at smaller sizes, while ET also shows strength at larger sizes, outperforming the largest Transformer with 37.6% less parameters (models to compare are circled in green). See Table 3 in our paper for the exact numbers.
To test generalizability, we also compared ET to the Transformer on additional NLP tasks. First, we looked at translation using different language pairs, and found ET demonstrated improved performance, with margins similar to those seen on English-German; again, due to its efficient use of parameters, the biggest improvements were observed for medium sized models. We also compared the decoders of both models on language modeling using LM1B, and saw a performance improvement of nearly 2 perplexity.
Future Work
These results are the first step in exploring the application of architecture search to feed forward sequence models. The Evolved Transformer is being open sourced as part of Tensor2Tensor, where it can be used for any sequence problem. To promote reproducibility, we are also open sourcing the search space we used for our search and a Colab with an implementation of Progressive Dynamic Hurdles. We look forward to seeing what the research community does with the new model and hope that others are able to build off of these new search techniques!

Source: Google AI Blog


Introducing Translatotron: An End-to-End Speech-to-Speech Translation Model



Speech-to-speech translation systems have been developed over the past several decades with the goal of helping people who speak different languages to communicate with each other. Such systems have usually been broken into three separate components: automatic speech recognition to transcribe the source speech as text, machine translation to translate the transcribed text into the target language, and text-to-speech synthesis (TTS) to generate speech in the target language from the translated text. Dividing the task into such a cascade of systems has been very successful, powering many commercial speech-to-speech translation products, including Google Translate.

In “Direct speech-to-speech translation with a sequence-to-sequence model”, we propose an experimental new system that is based on a single attentive sequence-to-sequence model for direct speech-to-speech translation without relying on intermediate text representation. Dubbed Translatotron, this system avoids dividing the task into separate stages, providing a few advantages over cascaded systems, including faster inference speed, naturally avoiding compounding errors between recognition and translation, making it straightforward to retain the voice of the original speaker after translation, and better handling of words that do not need to be translated (e.g., names and proper nouns).

Translatotron
The emergence of end-to-end models on speech translation started in 2016, when researchers demonstrated the feasibility of using a single sequence-to-sequence model for speech-to-text translation. In 2017, we demonstrated that such end-to-end models can outperform cascade models. Many approaches to further improve end-to-end speech-to-text translation models have been proposed recently, including our effort on leveraging weakly supervised data. Translatotron goes a step further by demonstrating that a single sequence-to-sequence model can directly translate speech from one language into speech in another language, without relying on an intermediate text representation in either language, as is required in cascaded systems.

Translatotron is based on a sequence-to-sequence network which takes source spectrograms as input and generates spectrograms of the translated content in the target language. It also makes use of two other separately trained components: a neural vocoder that converts output spectrograms to time-domain waveforms, and, optionally, a speaker encoder that can be used to maintain the character of the source speaker’s voice in the synthesized translated speech. During training, the sequence-to-sequence model uses a multitask objective to predict source and target transcripts at the same time as generating target spectrograms. However, no transcripts or other intermediate text representations are used during inference.

Model architecture of Translatotron.
Performance
We validated Translatotron’s translation quality by measuring the BLEU score, computed with text transcribed by a speech recognition system. Though our results lag behind a conventional cascade system, we have demonstrated the feasibility of the end-to-end direct speech-to-speech translation.

Compared in the audio clips below are the direct speech-to-speech translation output from Translatotron to that of the baseline cascade method. In this case, both systems provide a suitable translation and speak naturally using the same canonical voice.


Input (Spanish)
Reference translation (English)
Baseline cascade translation
Translatotron translation

You can listen to more audio samples here.

Preserving Vocal Characteristics
By incorporating a speaker encoder network, Translatotron is also able to retain the original speaker’s vocal characteristics in the translated speech, which makes the translated speech sound more natural and less jarring. This feature leverages previous Google research on speaker verification and speaker adaptation for TTS. The speaker encoder is pretrained on the speaker verification task, learning to encode speaker characteristics from a short example utterance. Conditioning the spectrogram decoder on this encoding makes it possible to synthesize speech with similar speaker characteristics, even though the content is in a different language.

The audio clips below demonstrate the performance of Translatotron when transferring the original speaker’s voice to the translated speech. In this example, Translatotron gives more accurate translation than the baseline cascade model, while being able to retain the original speaker’s vocal characteristics. The Translatotron output that retains the original speaker’s voice is trained with less data than the one using the canonical voice, so that they yield slightly different translations.

Input (Spanish)
Reference translation (English)
Baseline cascade translation
Translatotron translation (canonical voice)
Translatotron translation (original speaker’s voice)

More audio samples are available here.

Conclusion
To the best of our knowledge, Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language. It is also able to retain the source speaker’s voice in the translated speech. We hope that this work can serve as a starting point for future research on end-to-end speech-to-speech translation systems.

Acknowledgments
This research was a joint work between the Google Brain, Google Translate, and Google Speech teams. Contributors include Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Mengmeng Niu, Quan Wang, Jason Pelecanos, Ignacio Lopez Moreno, Tom Walters, Heiga Zen, Patrick Nguyen, Yu Zhang, Jonathan Shen, Orhan Firat, and Yonghui Wu. We also thank Jorge Pereira and Stella Laurenzo for verifying the quality of the translation from Translatotron.

Source: Google AI Blog


Providing Gender-Specific Translations in Google Translate



Over the past few years, Google Translate has made significant improvements to translation quality by switching to an end-to-end neural network-based system. At the same time, we realized that translations from our models can reflect societal biases, such as gender bias. Specifically, languages differ a lot in how they represent gender, and when there are ambiguities during translation, the systems tend to pick gender choices that reflect societal asymmetries, resulting in biased translations. For instance, Google Translate historically translated the Turkish equivalent of “He/she is a doctor” into the masculine form, and the Turkish equivalent of “He/she is a nurse” into the feminine form.

Recently, we announced that we’re taking the first step at reducing gender bias in our translations. We now provide both feminine and masculine translations when translating single-word queries from English to four different languages (French, Italian, Portuguese, and Spanish), and when translating phrases and sentences from Turkish to English.
Gender-specific translations on the Google Translate website.
Supporting gender-specific translations for single-word queries involved enriching our underlying dictionary with gender attributes. Supporting gender-specific translations for longer queries (phrases and sentences) was particularly challenging and involved making significant changes to our translation framework. For these longer queries, we focused initially on Turkish-to-English translation. We developed a three-step approach to solve the problem of providing a masculine and feminine translation in English for a gender-neutral query in Turkish.
Detecting Gender-Neutral Queries
Many Turkish sentences that refer to people are gender-neutral, but not all are. Detecting which queries are eligible for gender-specific translations is a hard problem because Turkish is morphologically complex, meaning that reference to a person can either be explicit with a gender-neutral pronoun (e.g. O, Ona) or implicitly encoded. For example, the sentence “Biliyor mu?” has no explicit gender-neutral pronoun but can be translated as either “Does she know?” or “Does he know?”. This complexity means that we cannot use a simple list of gender-neutral pronouns to detect gender-neutral Turkish queries and need a machine-learned system. We estimate that approximately 10% of Turkish Translate queries are ambiguous, and eligible for both feminine and masculine translations.

To detect these queries, we use state-of-the-art text classification algorithms (same as those used in our Cloud Natural Language API) to build a system that is able to detect when a given Turkish query is gender-neutral. Since this introduces an additional step before obtaining the translations, we had to carefully balance model complexity with latency. We trained our system on thousands of human-rated Turkish examples, where raters were asked to judge whether a given example is gender-neutral or not. Our final classification system is a convolutional neural network that can accurately detect queries which require gender-specific translations.

Generating Gender-Specific Translations
Next, we enhanced our underlying Neural Machine Translation (NMT) system to produce feminine and masculine translations when requested. When no gender is requested, we trained the model to produce the default translation. This involved:
  • Identifying and dividing our parallel training data into those with feminine words, those with masculine and those with ungendered words.
  • Adding an additional input token to the beginning of the sentence to specify the required gender to translate to, similar to how we build multilingual NMT systems:
    • <2MALE> O bir doktor → He is a doctor
    • <2FEMALE> O bir doktor → She is a doctor
  • Training our enhanced NMT model on the feminine, masculine and ungendered data sources. We experimented with various mixing ratios for these sources to enable the model to perform equally well on the three tasks.
If a user's query is determined to be gender-neutral, we add a gender prefix to the translation request. For these requests, our final NMT model can reliably produce feminine and masculine translations 99% of the time. Additionally, the system maintains translation quality on queries without the gender prefix.

Checking for Accuracy
Finally, we have a step that decides whether to display the gender-specific translations. Since the training data that produces the masculine translation is different from the training data that produces the feminine translation, there may be differences between the two translations unrelated to gender. If the gender-specific translations are determined to be low quality, we show only the single default translation. To determine the quality of the gender-specific translations, we verify:
  • If the requested feminine translation is feminine.
  • If the requested masculine translation is masculine.
  • If the feminine and masculine translations are exactly equivalent with the exception of gender-related changes. Even minor changes in the wording between the translations will result in being filtered.
Top: The masculine and feminine translations differ only with respect to gender i.e. “he” and “his” vs “she” and “her”. Hence, we will show gender-specific translations. Bottom: The masculine and feminine translations differ correctly with respect to gender i.e. “he” vs “she”. However, the change from “really” to “actually” is not related to gender. Hence, we will filter gender-specific translations and display the default translation.
Putting it all together, input sentences first go through the classifier, which detects whether they’re eligible for gender-specific translations. If the classifier says “yes”, we send three requests to our enhanced NMT model—a feminine request, a masculine request and an ungendered request. Our final step takes into account all three responses and decides whether to display gender-specific translations or a single default translation. This step is still quite conservative in order to maximize the quality of gender-specific translations shown; hence our overall recall is only around 60%. We plan to increase our coverage and add support for more complex sentences in future iterations.

This is just the first step toward addressing gender bias in machine-translation systems and reiterates Google’s commitment to fairness in machine learning. In the future, we plan to extend gender-specific translations to more languages and to address non-binary gender in translations.

Acknowledgements:
This effort has been successful thanks to the hard work of a lot of people including, but not limited to, the following (in alphabetical order of last name): Lindsey Boran, HyunJeong Choe, Héctor Fernández Alcalde, Orhan Firat, Qin Gao, Rick Genter, Macduff Hughes, Tolga Kayadelen, James Kuczmarski, Tatiana Lando, Liu Liu, Michael Mandl, Nihal Meriç Atilla, Mengmeng Niu, Adnan Ozturel, Emily Pitler, Kathy Ray, John Richardson, Larissa Rinaldi, Alex Rudnick, Apu Shah, Jason Smith, Antonio Stella, Romina Stella, Jana Strnadova, Katrin Tomanek, Barak Turovsky, Dan Schwarz, Shilp Vaishnav, Clayton Watts, Kellie Webster, Colin Young, Pendar Yousefi, Candice Zhang and Min Zhao.

Source: Google AI Blog


Moving Beyond Translation with the Universal Transformer



Last year we released the Transformer, a new machine learning model that showed remarkable success over existing algorithms for machine translation and other language understanding tasks. Before the Transformer, most neural network based approaches to machine translation relied on recurrent neural networks (RNNs) which operate sequentially (e.g. translating words in a sentence one-after-the-other) using recurrence (i.e. the output of each step feeds into the next). While RNNs are very powerful at modeling sequences, their sequential nature means that they are quite slow to train, as longer sentences need more processing steps, and their recurrent structure also makes them notoriously difficult to train properly.

In contrast to RNN-based approaches, the Transformer used no recurrence, instead processing all words or symbols in the sequence in parallel while making use of a self-attention mechanism to incorporate context from words farther away. By processing all words in parallel and letting each word attend to other words in the sentence over multiple processing steps, the Transformer was much faster to train than recurrent models. Remarkably, it also yielded much better translation results than RNNs. However, on smaller and more structured language understanding tasks, or even simple algorithmic tasks such as copying a string (e.g. to transform an input of “abc” to “abcabc”), the Transformer does not perform very well. In contrast, models that perform well on these tasks, like the Neural GPU and Neural Turing Machine, fail on large-scale language understanding tasks like translation.

In “Universal Transformers” we extend the standard Transformer to be computationally universal (Turing complete) using a novel, efficient flavor of parallel-in-time recurrence which yields stronger results across a wider range of tasks. We built on the parallel structure of the Transformer to retain its fast training speed, but we replaced the Transformer’s fixed stack of different transformation functions with several applications of a single, parallel-in-time recurrent transformation function (i.e. the same learned transformation function is applied to all symbols in parallel over multiple processing steps, where the output of each step feeds into the next). Crucially, where an RNN processes a sequence symbol-by-symbol (left to right), the Universal Transformer processes all symbols at the same time (like the Transformer), but then refines its interpretation of every symbol in parallel over a variable number of recurrent processing steps using self-attention. This parallel-in-time recurrence mechanism is both faster than the serial recurrence used in RNNs, and also makes the Universal Transformer more powerful than the standard feedforward Transformer.
The Universal Transformer repeatedly refines a series of vector representations (shown as h1 to hm) for each position of the sequence in parallel, by combining information from different positions using self-attention and applying a recurrent transition function. Arrows denote dependencies between operations.
At each step, information is communicated from each symbol (e.g. word in the sentence) to all other symbols using self-attention, just like in the original Transformer. However, now the number of times this transformation is applied to each symbol (i.e. the number of recurrent steps) can either be manually set ahead of time (e.g. to some fixed number or to the input length), or it can be decided dynamically by the Universal Transformer itself. To achieve the latter, we added an adaptive computation mechanism to each position which can allocate more processing steps to symbols that are more ambiguous or require more computations.

As an intuitive example of how this could be useful, consider the sentence “I arrived at the bank after crossing the river”. In this case, more context is required to infer the most likely meaning of the word “bank” compared to the less ambiguous meaning of “I” or “river”. When we encode this sentence using the standard Transformer, the same amount of computation is applied unconditionally to each word. However, the Universal Transformer’s adaptive mechanism allows the model to spend increased computation only on the more ambiguous words, e.g. to use more steps to integrate the additional contextual information needed to disambiguate the word “bank”, while spending potentially fewer steps on less ambiguous words.

At first it might seem restrictive to allow the Universal Transformer to only apply a single learned function repeatedly to process its input, especially when compared to the standard Transformer which learns to apply a fixed sequence of distinct functions. But learning how to apply a single function repeatedly means the number of applications (processing steps) can now be variable, and this is the crucial difference. Beyond allowing the Universal Transformer to apply more computation to more ambiguous symbols, as explained above, it further allows the model to scale the number of function applications with the overall size of the input (more steps for longer sequences), or to decide dynamically how often to apply the function to any given part of the input based on other characteristics learned during training. This makes the Universal Transformer more powerful in a theoretical sense, as it can effectively learn to apply different transformations to different parts of the input. This is something that the standard Transformer cannot do, as it consists of fixed stacks of learned Transformation blocks applied only once.

But while increased theoretical power is desirable, we also care about empirical performance. Our experiments confirm that Universal Transformers are indeed able to learn from examples how to copy and reverse strings and how to perform integer addition much better than a Transformer or an RNN (although not quite as well as Neural GPUs). Furthermore, on a diverse set of challenging language understanding tasks the Universal Transformer generalizes significantly better and achieves a new state of the art on the bAbI linguistic reasoning task and the challenging LAMBADA language modeling task. But perhaps of most interest is that the Universal Transformer also improves translation quality by 0.9 BLEU1 over a base Transformer with the same number of parameters, trained in the same way on the same training data. Putting things in perspective, this almost adds another 50% relative improvement on top of the previous 2.0 BLEU improvement that the original Transformer showed over earlier models when it was released last year.

The Universal Transformer thus closes the gap between practical sequence models competitive on large-scale language understanding tasks such as machine translation, and computationally universal models such as the Neural Turing Machine or the Neural GPU, which can be trained using gradient descent to perform arbitrary algorithmic tasks. We are enthusiastic about recent developments on parallel-in-time sequence models, and in addition to adding computational capacity and recurrence in processing depth, we hope that further improvements to the basic Universal Transformer presented here will help us build learning algorithms that are both more powerful, more data efficient, and that generalize beyond the current state-of-the-art.

If you’d like to try this for yourself, the code used to train and evaluate Universal Transformers can be found here in the open-source Tensor2Tensor repository.

Acknowledgements
This research was conducted by Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Additional thanks go to Ashish Vaswani, Douglas Eck, and David Dohan for their fruitful comments and inspiration.



1 A translation quality benchmark widely used in the machine translation community, computed on the standard WMT newstest2014 English to German translation test data set.

Source: Google AI Blog


Transformer: A Novel Neural Network Architecture for Language Understanding



Neural networks, in particular recurrent neural networks (RNNs), are now at the core of the leading approaches to language understanding tasks such as language modeling, machine translation and question answering. In Attention Is All You Need we introduce the Transformer, a novel neural network architecture based on a self-attention mechanism that we believe to be particularly well-suited for language understanding.

In our paper, we show that the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks. On top of higher translation quality, the Transformer requires less computation to train and is a much better fit for modern machine learning hardware, speeding up training by up to an order of magnitude.
BLEU scores (higher is better) of single models on the standard WMT newstest2014 English to German translation benchmark.
BLEU scores (higher is better) of single models on the standard WMT newstest2014 English to French translation benchmark.
Accuracy and Efficiency in Language Understanding
Neural networks usually process language by generating fixed- or variable-length vector-space representations. After starting with representations of individual words or even pieces of words, they aggregate information from surrounding words to determine the meaning of a given bit of language in context. For example, deciding on the most likely meaning and appropriate representation of the word “bank” in the sentence “I arrived at the bank after crossing the…” requires knowing if the sentence ends in “... road.” or “... river.”

RNNs have in recent years become the typical network architecture for translation, processing language sequentially in a left-to-right or right-to-left fashion. Reading one word at a time, this forces RNNs to perform multiple steps to make decisions that depend on words far away from each other. Processing the example above, an RNN could only determine that “bank” is likely to refer to the bank of a river after reading each word between “bank” and “river” step by step. Prior research has shown that, roughly speaking, the more such steps decisions require, the harder it is for a recurrent network to learn how to make those decisions.

The sequential nature of RNNs also makes it more difficult to fully take advantage of modern fast computing devices such as TPUs and GPUs, which excel at parallel and not sequential processing. Convolutional neural networks (CNNs) are much less sequential than RNNs, but in CNN architectures like ByteNet or ConvS2S the number of steps required to combine information from distant parts of the input still grows with increasing distance.

The Transformer
In contrast, the Transformer only performs a small, constant number of steps (chosen empirically). In each step, it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position. In the earlier example “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately attend to the word “river” and make this decision in a single step. In fact, in our English-French translation model we observe exactly this behavior.

More specifically, to compute the next representation for a given word - “bank” for example - the Transformer compares it to every other word in the sentence. The result of these comparisons is an attention score for every other word in the sentence. These attention scores determine how much each of the other words should contribute to the next representation of “bank”. In the example, the disambiguating “river” could receive a high attention score when computing a new representation for “bank”. The attention scores are then used as weights for a weighted average of all words’ representations which is fed into a fully-connected network to generate a new representation for “bank”, reflecting that the sentence is talking about a river bank.

The animation below illustrates how we apply the Transformer to machine translation. Neural networks for machine translation typically contain an encoder reading the input sentence and generating a representation of it. A decoder then generates the output sentence word by word while consulting the representation generated by the encoder. The Transformer starts by generating initial representations, or embeddings, for each word. These are represented by the unfilled circles. Then, using self-attention, it aggregates information from all of the other words, generating a new representation per word informed by the entire context, represented by the filled balls. This step is then repeated multiple times in parallel for all words, successively generating new representations.
The decoder operates similarly, but generates one word at a time, from left to right. It attends not only to the other previously generated words, but also to the final representations generated by the encoder.

Flow of Information
Beyond computational performance and higher accuracy, another intriguing aspect of the Transformer is that we can visualize what other parts of a sentence the network attends to when processing or translating a given word, thus gaining insights into how information travels through the network.

To illustrate this, we chose an example involving a phenomenon that is notoriously challenging for machine translation systems: coreference resolution. Consider the following sentences and their French translations:
It is obvious to most that in the first sentence pair “it” refers to the animal, and in the second to the street. When translating these sentences to French or German, the translation for “it” depends on the gender of the noun it refers to - and in French “animal” and “street” have different genders. In contrast to the current Google Translate model, the Transformer translates both of these sentences to French correctly. Visualizing what words the encoder attended to when computing the final representation for the word “it” sheds some light on how the network made the decision. In one of its steps, the Transformer clearly identified the two nouns “it” could refer to and the respective amount of attention reflects its choice in the different contexts.
The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads).
Given this insight, it might not be that surprising that the Transformer also performs very well on the classic language analysis task of syntactic constituency parsing, a task the natural language processing community has attacked with highly specialized systems for decades.
In fact, with little adaptation, the same network we used for English to German translation outperformed all but one of the previously proposed approaches to constituency parsing.

Next Steps
We are very excited about the future potential of the Transformer and have already started applying it to other problems involving not only natural language but also very different inputs and outputs, such as images and video. Our ongoing experiments are accelerated immensely by the Tensor2Tensor library, which we recently open sourced. In fact, after downloading the library you can train your own Transformer networks for translation and parsing by invoking just a few commands. We hope you’ll give it a try, and look forward to seeing what the community can do with the Transformer.

Acknowledgements
This research was conducted by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez and Łukasz Kaiser. Additional thanks go to David Chenell for creating the animation above.

Building Your Own Neural Machine Translation System in TensorFlow



Machine translation – the task of automatically translating between languages – is one of the most active research areas in the machine learning community. Among the many approaches to machine translation, sequence-to-sequence ("seq2seq") models [1, 2] have recently enjoyed great success and have become the de facto standard in most commercial translation systems, such as Google Translate, thanks to its ability to use deep neural networks to capture sentence meanings. However, while there is an abundance of material on seq2seq models such as OpenNMT or tf-seq2seq, there is a lack of material that teaches people both the knowledge and the skills to easily build high-quality translation systems.

Today we are happy to announce a new Neural Machine Translation (NMT) tutorial for TensorFlow that gives readers a full understanding of seq2seq models and shows how to build a competitive translation model from scratch. The tutorial is aimed at making the process as simple as possible, starting with some background knowledge on NMT and walking through code details to build a vanilla system. It then dives into the attention mechanism [3, 4], a key ingredient that allows NMT systems to handle long sentences. Finally, the tutorial provides details on how to replicate key features in the Google’s NMT (GNMT) system [5] to train on multiple GPUs.

The tutorial also contains detailed benchmark results, which users can replicate on their own. Our models provide a strong open-source baseline with performance on par with GNMT results [5]. We achieve 24.4 BLEU points on the popular WMT’14 English-German translation task.
Other benchmark results (English-Vietnamese, German-English) can be found in the tutorial.

In addition, this tutorial showcases the fully dynamic seq2seq API (released with TensorFlow 1.2) aimed at making building seq2seq models clean and easy:
  • Easily read and preprocess dynamically sized input sequences using the new input pipeline in tf.contrib.data.
  • Use padded batching and sequence length bucketing to improve training and inference speeds.
  • Train seq2seq models using popular architectures and training schedules, including several types of attention and scheduled sampling.
  • Perform inference in seq2seq models using in-graph beam search.
  • Optimize seq2seq models for multi-GPU settings.
We hope this will help spur the creation of, and experimentation with, many new NMT models by the research community. To get started on your own research, check out the tutorial on GitHub!

Core contributors
Thang Luong, Eugene Brevdo, and Rui Zhao.

Acknowledgements
We would like to especially thank our collaborator on the NMT project, Rui Zhao. Without his tireless effort, this tutorial would not have been possible. Additional thanks go to Denny Britz, Anna Goldie, Derek Murray, and Cinjon Resnick for their work bringing new features to TensorFlow and the seq2seq library. Lastly, we thank Lukasz Kaiser for the initial help on the seq2seq codebase; Quoc Le for the suggestion to replicate GNMT; Yonghui Wu and Zhifeng Chen for details on the GNMT systems; as well as the Google Brain team for their support and feedback!

References
[1] Sequence to sequence learning with neural networks, Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. NIPS, 2014.
[2] Learning phrase representations using RNN encoder-decoder for statistical machine translation, Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. EMNLP 2014.
[3] Neural machine translation by jointly learning to align and translate, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ICLR, 2015.
[4] Effective approaches to attention-based neural machine translation, Minh-Thang Luong, Hieu Pham, and Christopher D Manning. EMNLP, 2015.
[5] Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Technical Report, 2016.

Introducing tf-seq2seq: An Open Source Sequence-to-Sequence Framework in TensorFlow



(Crossposted on the Google Open Source Blog)

Last year, we announced Google Neural Machine Translation (GNMT), a sequence-to-sequence (“seq2seq”) model which is now used in Google Translate production systems. While GNMT achieved huge improvements in translation quality, its impact was limited by the fact that the framework for training these models was unavailable to external researchers.

Today, we are excited to introduce tf-seq2seq, an open source seq2seq framework in TensorFlow that makes it easy to experiment with seq2seq models and achieve state-of-the-art results. To that end, we made the tf-seq2seq codebase clean and modular, maintaining full test coverage and documenting all of its functionality.

Our framework supports various configurations of the standard seq2seq model, such as depth of the encoder/decoder, attention mechanism, RNN cell type, or beam size. This versatility allowed us to discover optimal hyperparameters and outperform other frameworks, as described in our paper, “Massive Exploration of Neural Machine Translation Architectures.”
A seq2seq model translating from Mandarin to English. At each time step, the encoder takes in one Chinese character and its own previous state (black arrow), and produces an output vector (blue arrow). The decoder then generates an English translation word-by-word, at each time step taking in the last word, the previous state, and a weighted combination of all the outputs of the encoder (aka attention [3], depicted in blue) and then producing the next English word. Please note that in our implementation we use wordpieces [4] to handle rare words.
In addition to machine translation, tf-seq2seq can also be applied to any other sequence-to-sequence task (i.e. learning to produce an output sequence given an input sequence), including machine summarization, image captioning, speech recognition, and conversational modeling. We carefully designed our framework to maintain this level of generality and provide tutorials, preprocessed data, and other utilities for machine translation.

We hope that you will use tf-seq2seq to accelerate (or kick off) your own deep learning research. We also welcome your contributions to our GitHub repository, where we have a variety of open issues that we would love to have your help with!

Acknowledgments:
We’d like to thank Eugene Brevdo, Melody Guan, Lukasz Kaiser, Quoc V. Le, Thang Luong, and Chris Olah for all their help. For a deeper dive into how seq2seq models work, please see the resources below.

References:
[1] Massive Exploration of Neural Machine Translation Architectures, Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le
[2] Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals, Quoc V. Le. NIPS, 2014
[3] Neural Machine Translation by Jointly Learning to Align and Translate, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. ICLR, 2015
[4] Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Technical Report, 2016
[5] Attention and Augmented Recurrent Neural Networks, Chris Olah, Shan Carter. Distill, 2016
[6] Neural Machine Translation and Sequence-to-sequence Models: A Tutorial, Graham Neubig
[7] Sequence-to-Sequence Models, TensorFlow.org

Introducing tf-seq2seq: An Open Source Sequence-to-Sequence Framework in TensorFlow

Crossposted on the Google Research Blog

Last year, we announced Google Neural Machine Translation (GNMT), a sequence-to-sequence (“seq2seq”) model which is now used in Google Translate production systems. While GNMT achieved huge improvements in translation quality, its impact was limited by the fact that the framework for training these models was unavailable to external researchers.

Today, we are excited to introduce tf-seq2seq, an open source seq2seq framework in TensorFlow that makes it easy to experiment with seq2seq models and achieve state-of-the-art results. To that end, we made the tf-seq2seq codebase clean and modular, maintaining full test coverage and documenting all of its functionality.

Our framework supports various configurations of the standard seq2seq model, such as depth of the encoder/decoder, attention mechanism, RNN cell type, or beam size. This versatility allowed us to discover optimal hyperparameters and outperform other frameworks, as described in our paper, “Massive Exploration of Neural Machine Translation Architectures.”

A seq2seq model translating from Mandarin to English. At each time step, the encoder takes in one Chinese character and its own previous state (black arrow), and produces an output vector (blue arrow). The decoder then generates an English translation word-by-word, at each time step taking in the last word, the previous state, and a weighted combination of all the outputs of the encoder (aka attention [3], depicted in blue) and then producing the next English word. Please note that in our implementation we use wordpieces [4] to handle rare words.
In addition to machine translation, tf-seq2seq can also be applied to any other sequence-to-sequence task (i.e. learning to produce an output sequence given an input sequence), including machine summarization, image captioning, speech recognition, and conversational modeling. We carefully designed our framework to maintain this level of generality and provide tutorials, preprocessed data, and other utilities for machine translation.

We hope that you will use tf-seq2seq to accelerate (or kick off) your own deep learning research. We also welcome your contributions to our GitHub repository, where we have a variety of open issues that we would love to have your help with!

Acknowledgments:
We’d like to thank Eugene Brevdo, Melody Guan, Lukasz Kaiser, Quoc V. Le, Thang Luong, and Chris Olah for all their help. For a deeper dive into how seq2seq models work, please see the resources below.

References:
[1] Massive Exploration of Neural Machine Translation Architectures, Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le
[2] Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals, Quoc V. Le. NIPS, 2014
[3] Neural Machine Translation by Jointly Learning to Align and Translate, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. ICLR, 2015
[4] Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean. Technical Report, 2016
[5] Attention and Augmented Recurrent Neural Networks, Chris Olah, Shan Carter. Distill, 2016
[6] Neural Machine Translation and Sequence-to-sequence Models: A Tutorial, Graham Neubig
[7] Sequence-to-Sequence Models, TensorFlow.org

By Anna Goldie and Denny Britz, Research Software Engineer and Google Brain Resident, Google Brain Team

Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System



In the last 10 years, Google Translate has grown from supporting just a few languages to 103, translating over 140 billion words every day. To make this possible, we needed to build and maintain many different systems in order to translate between any two languages, incurring significant computational cost. With neural networks reforming many fields, we were convinced we could raise the translation quality further, but doing so would mean rethinking the technology behind Google Translate.

In September, we announced that Google Translate is switching to a new system called Google Neural Machine Translation (GNMT), an end-to-end learning framework that learns from millions of examples, and provided significant improvements in translation quality. However, while switching to GNMT improved the quality for the languages we tested it on, scaling up to all the 103 supported languages presented a significant challenge.

In “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation”, we address this challenge by extending our previous GNMT system, allowing for a single system to translate between multiple languages. Our proposed architecture requires no change in the base GNMT system, but instead uses an additional “token” at the beginning of the input sentence to specify the required target language to translate to. In addition to improving translation quality, our method also enables “Zero-Shot Translation” — translation between language pairs never seen explicitly by the system.
Here’s how it works. Let’s say we train a multilingual system with Japanese⇄English and Korean⇄English examples, shown by the solid blue lines in the animation. Our multilingual system, with the same size as a single GNMT system, shares its parameters to translate between these four different language pairs. This sharing enables the system to transfer the “translation knowledge” from one language pair to the others. This transfer learning and the need to translate between multiple languages forces the system to better use its modeling power.

This inspired us to ask the following question: Can we translate between a language pair which the system has never seen before? An example of this would be translations between Korean and Japanese where Korean⇄Japanese examples were not shown to the system. Impressively, the answer is yes — it can generate reasonable Korean⇄Japanese translations, even though it has never been taught to do so. We call this “zero-shot” translation, shown by the yellow dotted lines in the animation. To the best of our knowledge, this is the first time this type of transfer learning has worked in Machine Translation.

The success of the zero-shot translation raises another important question: Is the system learning a common representation in which sentences with the same meaning are represented in similar ways regardless of language — i.e. an “interlingua”? Using a 3-dimensional representation of internal network data, we were able to take a peek into the system as it translates a set of sentences between all possible pairs of the Japanese, Korean, and English languages.

Part (a) from the figure above shows an overall geometry of these translations. The points in this view are colored by the meaning; a sentence translated from English to Korean with the same meaning as a sentence translated from Japanese to English share the same color. From this view we can see distinct groupings of points, each with their own color. Part (b) zooms in to one of the groups, and part (c) colors by the source language. Within a single group, we see a sentence with the same meaning but from three different languages. This means the network must be encoding something about the semantics of the sentence rather than simply memorizing phrase-to-phrase translations. We interpret this as a sign of existence of an interlingua in the network.

We show many more results and analyses in our paper, and hope that its findings are not only interesting for machine learning or machine translation researchers but also to linguists and others who are interested in how multiple languages can be processed by machines using a single system.

Finally, the described Multilingual Google Neural Machine Translation system is running in production today for all Google Translate users. Multilingual systems are currently used to serve 10 of the recently launched 16 language pairs, resulting in improved quality and a simplified production architecture.