Tag Archives: NLP

Introducing the Schema-Guided Dialogue Dataset for Conversational Assistants



Today's virtual assistants help users to accomplish a wide variety of tasks, including finding flights, searching for nearby events and movies, making reservations, sourcing information from the web and more. They provide this functionality by offering a unified natural language interface to a wide variety of services across the web. Large-scale virtual assistants, like Google Assistant, need to integrate with a large and constantly increasing number of services, each with potentially overlapping functionality, over a wide variety of domains. Supporting new services with ease, without collection of additional data or retraining the model, and reducing maintenance workload are necessary to accommodate future growth. Despite tremendous progress, however, these challenges have often been overlooked in state-of-the-art models. This is due, in part, to the absence of suitable datasets that match the scale and complexity confronted by such virtual assistants.

In our recent paper, “Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset”, we introduce a new dataset to address these problems. The Schema-Guided Dialogue dataset (SGD) is the largest publicly available corpus of task-oriented dialogues, with over 18,000 dialogues spanning 17 domains. Equipped with various annotations, this dataset is designed to serve as an effective testbed for intent prediction, slot filling, state tracking (i.e., estimating the user's goal) and language generation, among other tasks for large-scale virtual assistants. We also propose a schema-guided approach for building virtual assistants as a solution to the aforementioned challenges. Our approach utilizes a single model across all services and domains, with no domain-specific parameters. Based on the schema-guided approach and building on the power of pre-trained language models like BERT, we open source a model for dialogue state tracking, which is applicable in a zero-shot setting (i.e., with no training data for new services and APIs) while remaining competitive in the regular setting.

The Dataset
The primary goal of releasing the SGD dataset is to confront many real-world challenges that are not sufficiently captured by existing datasets. The SGD dataset consists of over 18k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 17 domains, ranging from banks and events to media, calendar, travel, and weather. For most of these domains, the SGD dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. SGD is the first dataset to cover such a wide variety of domains and provide multiple APIs per domain. Furthermore, to quantify the robustness of models to changes in API interfaces or to the addition of new APIs, the evaluation set contains many new services that are not present in the training set.

For the creation of the SGD dataset, we have prioritized the variety and accuracy of annotations in the included dialogues. To begin with, dialogues were collected by interaction between two people using a Wizard-of-Oz style process, followed by crowdsourced annotation. Initial efforts revealed the difficulty in obtaining consistent annotations using this method, so we developed a new data collection process that minimized the need for complex manual annotation, and considerably reduced the time and cost of data collection.

For this alternate approach, we developed a multi-domain dialogue simulator that generates dialogue skeletons over an arbitrary combination of APIs, along with the corresponding annotations, such as dialogue state and system actions. The simulator consists of two agents playing the role of the user and the assistant. Both the agents interact with each other using a finite set of actions denoting dialogue semantics with transitions specified through a probabilistic automaton, designed to capture a wide variety of dialogue trajectories. The actions generated by the simulator are converted into natural language utterances using a set of templates. Crowdsourcing is used only for paraphrasing these templatized utterances in order to make the dialogue more natural and coherent. This setup eliminates the need for complicated domain-specific instructions while keeping the crowdsourcing task simple and yields natural dialogues with consistent, high quality annotations.
Steps for obtaining dialogues, with assistant turns marked in red and user turns in blue. Left: The simulator generates a dialogue skeleton using a finite set of actions. Center: Actions are converted into utterances using templates (~50 per service) and slot values are replaced with natural variations. Right: Paraphrasing via crowdsourcing to make the flow cohesive.
The Schema-Guided Approach
With the availability of the SGD dataset, it is now possible to train virtual assistants to support the diversity of services available on the web. A common approach to do this requires a large master schema that lists all supported functions and their parameters. However, it is difficult to develop a master schema catering to all possible use cases. Even if that problem is solved, a master schema would complicate integration of new or small-scale services and would increase the maintenance workload of the assistant. Furthermore, while there are many similar concepts across services that can be jointly modeled, for example, the similarities in logic for querying or specifying the number of movie tickets, flight tickets or concert tickets, the master schema approach does not facilitate joint modeling of such concepts, unless an explicit mapping between them is manually defined.

The new schema-guided approach we propose addresses these limitations. This approach does not require the definition of a master schema for the assistant. Instead, each service or API provides a natural language description of the functions listed in its schema along with their associated attributes. These descriptions are then used to learn a distributed semantic representation of the schema, which is given as an additional input to the dialogue system. The dialogue system is then implemented as a single unified model, containing no domain or service specific parameters. This unified model facilitates representation of common knowledge between similar concepts in different services, while the use of distributed representations of the schema makes it possible to operate over new services that are not present in the training data. We have implemented this approach in our open-sourced dialogue state tracking model.

Eighth Dialogue System Technology Challenge
The Dialog System Technology Challenges (DSTCs) are a series of research competitions to accelerate the development of new dialogue technologies. This year, Google organized one of the tracks, "Schema-Guided Dialogue State Tracking", as part of the recently concluded 8th DSTC. We received submissions from a total of 25 teams from both industry and academia, which will be presented at the DSTC8 workshop at AAAI-20.

We believe that this dataset will act as a good benchmark for building large-scale dialogue models. We are excited and looking forward to all the innovative ways in which the research community will use it for the advancement of dialogue technologies.

Acknowledgements
This post reflects the work of our co-authors Xiaoxue Zang, Srinivas Sunkara and Raghav Gupta. We also thank Amir Fayazi and Maria Wang for help with data collection and Guan-Lin Chao for insights on model design and implementation.

Source: Google AI Blog


Assessing the Quality of Long-Form Synthesized Speech



Automatically generated speech is everywhere, from directions being read out aloud while you are driving, to virtual assistants on your phone or smart speaker devices at home. While much research is being done to try to make synthesized speech sound as natural as possible—such as generating speech for low-resource languages and creating human-like speech with Tacotron 2—how does one evaluate the generated speech? The best way to find out is to ask people, who are very good at telling if something sounds natural or not.

In the field of speech synthesis, subjects are routinely asked to listen to samples of synthesized speech and rate their quality. Yet, until now, evaluation of synthesized speech has been done on a sentence-by-sentence basis. But often one wants to know the quality of a series of sentences that belong together, such as a paragraph in a news article or a turn in a conversation. This is where it gets interesting, as there is more than one way of evaluating sentences that naturally occur in a sequence, and, surprisingly, a rigorous comparison of these different methods has not been carried out. This in turn can hinder research progress in developing products that rely on generated speech.

To address this challenge, we present “Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs”, a publication to appear at SSW10 in which we compare several ways of evaluating synthesized speech for multi-line texts. We find that when a sentence is evaluated as part of a longer text involving several sentences, the outcome is influenced by the way in which the audio sample is presented to the people evaluating it. For example, when the sentence is presented by itself, without any context, the rating people give on average is substantially different from the rating they give when they listen to the same sentence with some context (while the context doesn't have to be rated).

Evaluating Automatically Generated Speech
To determine the quality of speech signals, it is common practice to ask several human raters to give their opinion for a particular sample, on a 1-to-5 scale. This sample can be automatically generated, but it can also be natural speech (i.e., an actual person saying a sentence out loud), which serves as a control. The scores of all reviewers rating a particular speech sample are averaged to get a Mean Opinion Score (MOS).

Until now, MOS ratings were typically collected per sentence, i.e., raters listened to sentences in isolation to form their opinion. Instead of this typical approach, we consider three different ways of presenting speech samples to raters—both with and without context—and we show that each approach yields different results. The first, presenting the sentence in isolation, is the default method commonly used in the field. An alternative method is to provide the full context for the sentence. In this case, the entire paragraph to which the sentence belongs is included and the ensemble is rated. The final approach is to provide a context-stimulus pair. Here, rather than providing full context, only some context is provided, such as the preceding sentence(s) from the original paragraph.

Interestingly, these three different approaches for presenting speech give different results even when applied to natural speech. This is demonstrated in the figure below, where the MOS scores are presented for natural speech samples rated using the three different methods of presentation. Even though the sentences being rated are identical across the three different settings, the scores are different on average, depending on the context in which they were presented.
MOS results for natural speech from a dataset consisting of news articles. Though the differences appear small, they are significant between all conditions (two-tailed t-test with α=0.05).
Examination of the figure above reveals that raters rarely give top scores (a five) even to recorded human speech, which may be surprising. However, this is a typical result seen in sentence evaluation studies and probably has to do with a more generic pattern of behavior, that people tend to avoid using the extreme ends of a scale, regardless of the task or setting.

When evaluated synthesized speech, the differences are more pronounced.
MOS results for synthesized speech on the same news article dataset used above. All lines are synthesized speech, unless indicated otherwise.
To see if the way context is presented makes a difference, we tried several different ways of providing it: one or two sentences leading up to the sentence to be evaluated, provided as generated speech or real speech. When context is added, the scores get higher (the four blue bars on the left) except when the context presented is real speech, in which case the score drops (the rightmost blue bar). Our hypothesis is that this has to do with an anchoring effect—if the context is very good (real speech) the synthesized speech, in comparison, is perceived as less natural.

Predicting Paragraph Score
When an entire paragraph of synthesized speech is played (the yellow bar), this is perceived as even less natural than in the other settings. Our original hypothesis was a weakest-link argument—the rating is probably as bad as the worst sentence in the paragraph. If that were the case, it should be easy to predict the rating of a paragraph by considering the ratings of the individual sentences in it, perhaps simply taking the minimum value to get the paragraph rating. It turns out, however, that does not work.

The failure of the weakest-link hypothesis may be due to more subtle factors that are difficult to tease out with such a simple approach. To test this, we also trained a machine learning algorithm to predict the paragraph score from the individual sentences. However, this approach, too, was unable to successfully predict paragraph scores reliably.

Conclusion
Evaluating synthesized speech is not straightforward when multiple sentences are involved. The traditional paradigm of rating sentences in isolation does not give the full picture, and one should be aware of anchoring effects when context is provided. Rating full paragraphs might be the most conservative approach. We hope our findings help advance future work in speech synthesis where long-form content is concerned, such as audio book readers and conversational agents.

Acknowledgments
Many thanks to all authors of the paper: Rob Clark, Hanna Silen, Ralph Leith.

Source: Google AI Blog


Assessing the Quality of Long-Form Synthesized Speech



Automatically generated speech is everywhere, from directions being read out aloud while you are driving, to virtual assistants on your phone or smart speaker devices at home. While much research is being done to try to make synthesized speech sound as natural as possible—such as generating speech for low-resource languages and creating human-like speech with Tacotron 2—how does one evaluate the generated speech? The best way to find out is to ask people, who are very good at telling if something sounds natural or not.

In the field of speech synthesis, subjects are routinely asked to listen to samples of synthesized speech and rate their quality. Yet, until now, evaluation of synthesized speech has been done on a sentence-by-sentence basis. But often one wants to know the quality of a series of sentences that belong together, such as a paragraph in a news article or a turn in a conversation. This is where it gets interesting, as there is more than one way of evaluating sentences that naturally occur in a sequence, and, surprisingly, a rigorous comparison of these different methods has not been carried out. This in turn can hinder research progress in developing products that rely on generated speech.

To address this challenge, we present “Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs”, a publication to appear at SSW10 in which we compare several ways of evaluating synthesized speech for multi-line texts. We find that when a sentence is evaluated as part of a longer text involving several sentences, the outcome is influenced by the way in which the audio sample is presented to the people evaluating it. For example, when the sentence is presented by itself, without any context, the rating people give on average is substantially different from the rating they give when they listen to the same sentence with some context (while the context doesn't have to be rated).

Evaluating Automatically Generated Speech
To determine the quality of speech signals, it is common practice to ask several human raters to give their opinion for a particular sample, on a 1-to-5 scale. This sample can be automatically generated, but it can also be natural speech (i.e., an actual person saying a sentence out loud), which serves as a control. The scores of all reviewers rating a particular speech sample are averaged to get a Mean Opinion Score (MOS).

Until now, MOS ratings were typically collected per sentence, i.e., raters listened to sentences in isolation to form their opinion. Instead of this typical approach, we consider three different ways of presenting speech samples to raters—both with and without context—and we show that each approach yields different results. The first, presenting the sentence in isolation, is the default method commonly used in the field. An alternative method is to provide the full context for the sentence. In this case, the entire paragraph to which the sentence belongs is included and the ensemble is rated. The final approach is to provide a context-stimulus pair. Here, rather than providing full context, only some context is provided, such as the preceding sentence(s) from the original paragraph.

Interestingly, these three different approaches for presenting speech give different results even when applied to natural speech. This is demonstrated in the figure below, where the MOS scores are presented for natural speech samples rated using the three different methods of presentation. Even though the sentences being rated are identical across the three different settings, the scores are different on average, depending on the context in which they were presented.
MOS results for natural speech from a dataset consisting of news articles. Though the differences appear small, they are significant between all conditions (two-tailed t-test with α=0.05).
Examination of the figure above reveals that raters rarely give top scores (a five) even to recorded human speech, which may be surprising. However, this is a typical result seen in sentence evaluation studies and probably has to do with a more generic pattern of behavior, that people tend to avoid using the extreme ends of a scale, regardless of the task or setting.

When evaluated synthesized speech, the differences are more pronounced.
MOS results for synthesized speech on the same news article dataset used above. All lines are synthesized speech, unless indicated otherwise.
To see if the way context is presented makes a difference, we tried several different ways of providing it: one or two sentences leading up to the sentence to be evaluated, provided as generated speech or real speech. When context is added, the scores get higher (the four blue bars on the left) except when the context presented is real speech, in which case the score drops (the rightmost blue bar). Our hypothesis is that this has to do with an anchoring effect—if the context is very good (real speech) the synthesized speech, in comparison, is perceived as less natural.

Predicting Paragraph Score
When an entire paragraph of synthesized speech is played (the yellow bar), this is perceived as even less natural than in the other settings. Our original hypothesis was a weakest-link argument—the rating is probably as bad as the worst sentence in the paragraph. If that were the case, it should be easy to predict the rating of a paragraph by considering the ratings of the individual sentences in it, perhaps simply taking the minimum value to get the paragraph rating. It turns out, however, that does not work.

The failure of the weakest-link hypothesis may be due to more subtle factors that are difficult to tease out with such a simple approach. To test this, we also trained a machine learning algorithm to predict the paragraph score from the individual sentences. However, this approach, too, was unable to successfully predict paragraph scores reliably.

Conclusion
Evaluating synthesized speech is not straightforward when multiple sentences are involved. The traditional paradigm of rating sentences in isolation does not give the full picture, and one should be aware of anchoring effects when context is provided. Rating full paragraphs might be the most conservative approach. We hope our findings help advance future work in speech synthesis where long-form content is concerned, such as audio book readers and conversational agents.

Acknowledgments
Many thanks to all authors of the paper: Rob Clark, Hanna Silen, Ralph Leith.

Source: Google AI Blog


ML Kit expands into NLP with Language Identification and Smart Reply

Posted by Christiaan Prins and Max Gubin

Today we are announcing the release of two new features to ML Kit: Language Identification and Smart Reply.

You might notice that both of these features are different from our existing APIs that were all focused on image/video processing. Our goal with ML Kit is to offer powerful but simple-to-use APIs to leverage the power of ML, independent of the domain. As such, we are excited to expand ML Kit with solutions for Natural Language Processing (NLP)!

NLP is a category of ML that deals with analyzing and generating text, speech, and other kinds of natural language data. We're excited to start out with two APIs: one that helps you identify the language of text, and one that generates reply suggestions in chat applications. Both of these features work fully on-device and are available on the latest version of the ML Kit SDK, on iOS (9.0 and higher) and Android (4.1 and higher).

Generate reply suggestions based on previous messages

A new feature popping up in messaging apps is to provide the user with a selection of suggested responses, either as actions on a notification or inside the app itself. This can really help a user to quickly respond when they are busy or a handy way to initiate a longer message.

With the new Smart Reply API you can now quickly achieve the same in your own apps. The API provides suggestions based on the last 10 messages in a conversation, although it still works if only one previous message is available. It is a stateless API that fully runs on-device, so we don't keep message history in memory nor send it to a server.

textPlus app providing response suggestions using Smart Reply

We have worked closely with partners like textPlus to ensure Smart Reply is ready for prime time and they have now implemented in-app response suggestions with the latest version of their app (screenshot above).

Adding Smart Reply to your own app is done with a simple function call (using Swift in this example):

let smartReply = NaturalLanguage.naturalLanguage().smartReply()
smartReply.suggestReplies(for: conversation) { result, error in
    guard error == nil, let result = result else {
        return
    }
    if (result.status == .success) {
        for suggestion in result.suggestions {
            print("Suggested reply: \(suggestion.text)")
        }
    }
}

After you initialize a Smart Reply instance, call suggestReplies with a list of recent messages. The callback provides the result which contains a list of suggestions.

For details on how to use the Smart Reply API, check out the documentation.

Tell me more ...

Although as a developer, you can just pick up this new API and easily get it integrated in your app, it may be interesting to reveal a bit on how it works under the hood. At the core of Smart Reply is a machine-learned model that is executed using TensorFlow Lite and has a state-of-the-art modern architecture based on SentencePiece text encoding[1] and Transformer[2].

However, as we realized when we started development of the API, the core suggestion model is not all that's needed to provide a solution that developers can use in their apps. For example, we added a model to detect sensitive topics, so that we avoid making suggestions in response to profanity or in cases of personal tragedy/hardship. Also, we included language identification, to ensure we do not provide suggestions for languages the core model is not trained on. The Smart Reply feature is launching with English support first.

Identify the language of a piece of text

The language of a given text string is a subtle but helpful piece of information. A lot of apps have functionality with a dependency on the language: you can think of features like spell checking, text translation or Smart Reply. Rather than asking a user to specify the language they use, you can use our new Language Identification API.

ML Kit recognizes text in 103 different languages and typically only requires a few words to make an accurate determination. It is fast as well, typically providing a response within 1 to 2 ms across iOS and Android phones.

Similar to the Smart Reply API, you can identify the language with a function call (using Swift in this example):

let languageId = NaturalLanguage.naturalLanguage().languageIdentification()
languageId.identifyLanguage(for: "¿Cómo estás?") { languageCode, error in
  guard error == nil, let languageCode = languageCode else {
    print("Failed to identify language with error: \(error!)")
    return
  }

  print("Identified Language: \(languageCode)")
}

The identifyLanguage functions takes a piece of a text and its callback provides a BCP-47 language code. If no language can be confidently recognized, ML Kit returns a code of und for undetermined. The Language Identification API can also provide a list of possible languages and their confidence values.

For details on how to use the Language Identification API, check out the documentation.

Get started today

We're really excited to expand ML Kit to include Natural Language APIs. Give the two new NLP APIs a spin today and let us know what you think! You can always reach us in our Firebase Talk Google Group.

As ML Kit grows we look forward to adding more APIs and categories that enables you to provide smarter experiences for your users. With that, please keep an eye out for some exciting ML Kit announcements at Google I/O.

Text-to-Speech for Low-Resource Languages (Episode 4): One Down, 299 to Go



This is the fourth episode in the series of posts reporting on the work we are doing to build text-to-speech (TTS) systems for low resource languages. In the first episode, we described the crowdsourced acoustic data collection effort for Project Unison. In the second episode, we described how we built parametric voices based on that data. In the third episode, we described the compilation of a pronunciation lexicon for a TTS system. In this episode, we describe how to make a single TTS system speak many languages.

Developing TTS systems for any given language is a significant challenge, and requires large amounts of high quality acoustic recordings and linguistic annotations. Because of this, these systems are only available for a tiny fraction of the world's languages. A natural question that arises in this situation is, instead of attempting to build a high quality voice for a single language using monolingual data from multiple speakers, as we described in the previous three episodes, can we somehow combine the limited monolingual data from multiple speakers of multiple languages to build a single multilingual voice that can speak any language?

Building upon an initial investigation into creating a multilingual TTS system that can synthesize speech in multiple languages from a single model, we developed a new model that uses uniform phonological representation for all languages — the International Phonetic Alphabet (IPA). The model trained using this representation can synthesize both the languages seen in the training data as well as languages not observed in training. This has two main benefits: First, pooling training data from related languages increases phonemic coverage which results in improved synthesis quality of the languages observed in training. Finally, because the model contains many languages pooled together, there is a better chance that an “unseen” language will have a “related” language present in the model that will guide and aid the synthesis.

Exploring the Closely Related Languages of Indonesia
We applied this multilingual approach first to languages of Indonesia, where Standard Indonesian is the official national language, and is spoken natively or as a second language by more than 200 million people. Javanese, with roughly 90 million native speakers, and Sundanese, with approximately 40 million native speakers, constitute the two largest regional languages of Indonesia. Unlike Indonesian, which received a lot of attention by the computational linguists and speech scientists over the years, both Javanese and Sundanese are currently low-resourced due to the lack of openly available high-quality corpora. We collaborated with universities in Indonesia to collect crowd-sourced Javanese and Sundanese recordings.

Since our corpus of Standard Indonesian was much larger and recorded in a professional studio, our hypothesis was that combining three languages may result in significant improvements over the systems constructed using a “classical” monolingual approach. To test this, we first proceeded to analyze the similarities and crucial differences between the phonologies of these three languages (shown below) and used this information to design the phonological representation that allows maximum degree of sharing between the languages while preserving their crucial differences.
Joint phoneme inventory of Indonesian, Javanese, and Sundanese in International Phonetic Alphabet notation.
The resulting Javanese and Sundanese voices trained jointly with Standard Indonesian strongly outperformed our corresponding monolingual multispeaker voices that we used as a baseline. This allowed us to launch Javanese and Sundanese TTS in Google products, such as Google Translate and Android.

Expanding to the More Diverse Language Families of South Asia
Next, we focused on the languages of South Asia spanning two very different language families: Indo-Aryan and Dravidian. Unlike the languages of Indonesia described above, these languages are much more diverse. In particular, they have significantly smaller overlap in their phonologies. The table below shows a superset of the languages in our experiment, including the variety of orthographies used, as well as modern words related to the Sanskrit word for “culture”. These languages show considerable variation within each group, but also such similarities across groups.
Descendants of Sanskrit word for “culture” across languages.
In this work, we leveraged the unified phonological representation mentioned above to make the most of the data we have and eliminate scarcity of data for certain phonemes. This was accomplished by conflating similar phonemes into a single representative phoneme in the multilingual phoneme inventory. Where possible, we use the same inventory for phonologically close languages. For example we have an identical phoneme inventory for Telugu and Kannada, and another one for West Bengali and Odia. For other language pairs like Gujarati and Marathi, we copied over the inventory of one language to another, but made a few changes to reflect the differences in their phonemic inventories. For all languages in these experiments we retained a common underlying representation, mapping similar phonemes across different inventories, so that we could still use the data from one language in training the others.

In addition, we made sure our representation is driven by the phonology in use, rather than the orthography. For example, although there are distinct letters for long and short vowels in Marathi, they are not contrastive in a linguistic sense, so we used a single representation for them, increasing the robustness of our training data. Similarly, if two languages use one character that was historically related to the same Sanskrit letter to represent different sounds or different letters for a similar sound, our mapping reflected the phonological closeness rather than the historical or orthographic representation. Describing all the features of the unified phoneme inventory is outside the scope of this post, the details can be found in our recent paper.
Diagram illustrating our multilingual text-to-speech approach. The input text queries are processed by language-specific linguistic front-ends to generate pronunciations in a shared phonemic representation serving as input to the language-agnostic acoustic model. The model then generates audio for the respective queries.
Our experiments focused on Indian Bengali, Gujarati, Kannada, Malayalam, Marathi, Tamil, Telugu and Urdu. For most of these languages, apart from Bengali and Marathi, the recording data and the transcriptions were crowd-sourced. For each of these languages we constructed a multilingual acoustic model that used all the data available. In addition, the acoustic model included the previously crowd-sourced Nepali and Sinhala data, as well as Hindi and Bangladeshi Bengali.

The results were encouraging: for most of the languages, the multilingual voices outperformed the voices that were constructed using traditional monolingual approach. We performed a further experiment with the Odia language, for which we had no training data, by attempting to synthesize it using the South Asian multilingual model. Subjective listening tests revealed that the native speakers of Odia judged the resulting audio to be acceptable and intelligible. The resulting voices for Marathi, Tamil, Telugu and Malayalam built using our multilingual approach in collaboration with the Speech team were announced at the recent “Google for India” event and are now powering Google Translate as well as other Google products.

Using crowd-sourcing in data collections was interesting from a research point of view and rewarding in terms of establishing fruitful collaborations with the native speaker communities. Our experiments with the Malayo-Polynesian, Indo-Aryan and Dravidian language families have shown that in most instances carefully sharing the data across multiple languages in a single multilingual acoustic model using deep learning techniques alleviates some of the severe data scarcity issues plaguing the low-resource languages and results in good quality voices used in Google products.

This TTS research is a first step towards applying speech and language technology to more of the world’s many languages, and it is our hope is that others will join us in this effort. To contribute to the research community we have open sourced corpora for Nepali, Sinhala, Bengali, Khmer, Javanese and Sundanese as we return from SLTU and Interspeech conferences, where we have been discussing this work with other researchers. We are planning on continuing to release additional datasets for other languages in our projects in the future.

Source: Google AI Blog


Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source



At Google, we spend a lot of time thinking about how computer systems can read and understand human language in order to process it in intelligent ways. Today, we are excited to share the fruits of our research with the broader community by releasing SyntaxNet, an open-source neural network framework implemented in TensorFlow that provides a foundation for Natural Language Understanding (NLU) systems. Our release includes all the code needed to train new SyntaxNet models on your own data, as well as Parsey McParseface, an English parser that we have trained for you and that you can use to analyze English text.

Parsey McParseface is built on powerful machine learning algorithms that learn to analyze the linguistic structure of language, and that can explain the functional role of each word in a given sentence. Because Parsey McParseface is the most accurate such model in the world, we hope that it will be useful to developers and researchers interested in automatic extraction of information, translation, and other core applications of NLU.

How does SyntaxNet work?

SyntaxNet is a framework for what’s known in academic circles as a syntactic parser, which is a key first component in many NLU systems. Given a sentence as input, it tags each word with a part-of-speech (POS) tag that describes the word's syntactic function, and it determines the syntactic relationships between words in the sentence, represented in the dependency parse tree. These syntactic relationships are directly related to the underlying meaning of the sentence in question. To take a very simple example, consider the following dependency tree for Alice saw Bob:


This structure encodes that Alice and Bob are nouns and saw is a verb. The main verb saw is the root of the sentence and Alice is the subject (nsubj) of saw, while Bob is its direct object (dobj). As expected, Parsey McParseface analyzes this sentence correctly, but also understands the following more complex example:


This structure again encodes the fact that Alice and Bob are the subject and object respectively of saw, in addition that Alice is modified by a relative clause with the verb reading, that saw is modified by the temporal modifier yesterday, and so on. The grammatical relationships encoded in dependency structures allow us to easily recover the answers to various questions, for example whom did Alice see?, who saw Bob?, what had Alice been reading about? or when did Alice see Bob?.

Why is Parsing So Hard For Computers to Get Right?

One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. As a very simple example, the sentence Alice drove down the street in her car has at least two possible dependency parses:


The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition in can either modify drove or street; this example is an instance of what is called prepositional phrase attachment ambiguity.

Humans do a remarkable job of dealing with ambiguity, almost to the point where the problem is unnoticeable; the challenge is for computers to do the same. Multiple ambiguities such as these in longer sentences conspire to give a combinatorial explosion in the number of possible structures for a sentence. Usually the vast majority of these structures are wildly implausible, but are nevertheless possible and must be somehow discarded by a parser.

SyntaxNet applies neural networks to the ambiguity problem. An input sentence is processed from left to right, with dependencies between words being incrementally added as each word in the sentence is considered. At each point in processing many decisions may be possible—due to ambiguity—and a neural network gives scores for competing decisions based on their plausibility. For this reason, it is very important to use beam search in the model. Instead of simply taking the first-best decision at each point, multiple partial hypotheses are kept at each step, with hypotheses only being discarded when there are several other higher-ranked hypotheses under consideration. An example of a left-to-right sequence of decisions that produces a simple parse is shown below for the sentence I booked a ticket to Google.
Furthermore, as described in our paper, it is critical to tightly integrate learning and search in order to achieve the highest prediction accuracy. Parsey McParseface and other SyntaxNet models are some of the most complex networks that we have trained with the TensorFlow framework at Google. Given some data from the Google supported Universal Treebanks project, you can train a parsing model on your own machine.

So How Accurate is Parsey McParseface?

On a standard benchmark consisting of randomly drawn English newswire sentences (the 20 year old Penn Treebank), Parsey McParseface recovers individual dependencies between words with over 94% accuracy, beating our own previous state-of-the-art results, which were already better than any previous approach. While there are no explicit studies in the literature about human performance, we know from our in-house annotation projects that linguists trained for this task agree in 96-97% of the cases. This suggests that we are approaching human performance—but only on well-formed text. Sentences drawn from the web are a lot harder to analyze, as we learned from the Google WebTreebank (released in 2011). Parsey McParseface achieves just over 90% of parse accuracy on this dataset.

While the accuracy is not perfect, it’s certainly high enough to be useful in many applications. The major source of errors at this point are examples such as the prepositional phrase attachment ambiguity described above, which require real world knowledge (e.g. that a street is not likely to be located in a car) and deep contextual reasoning. Machine learning (and in particular, neural networks) have made significant progress in resolving these ambiguities. But our work is still cut out for us: we would like to develop methods that can learn world knowledge and enable equal understanding of natural language across all languages and contexts.

To get started, see the SyntaxNet code and download the Parsey McParseface parser model. Happy parsing from the main developers, Chris Alberti, David Weiss, Daniel Andor, Michael Collins & Slav Petrov.