Today, we are happy to announce the release of Open Images V6, which greatly expands the annotation of the Open Images dataset with a large set of new visual relationships (e.g., “dog catching a flying disk”), human action annotations (e.g., “woman jumping”), and image-level labels (e.g., “paisley”). Notably, this release also adds localized narratives, a completely new form of multimodal annotations that consist of synchronized voice, text, and mouse traces over the objects being described. In Open Images V6, these localized narratives are available for 500k of its images. Additionally, in order to facilitate comparison to previous works, we also release localized narratives annotations for the full 123k images of the COCO dataset.
Localized Narratives One of the motivations behind localized narratives is to study and leverage the connection between vision and language, typically done via image captioning — images paired with human-authored textual descriptions of their content. One of the limitations of image captioning, however, is the lack of visual grounding, that is, localization on the image of the words in the textual description. To mitigate that, some previous works have a-posteriori drawn the bounding boxes for the nouns present in the description. In contrast, in localized narratives, every word in the textual description is grounded.
Localized narratives are generated by annotators who provide spoken descriptions of an image while they simultaneously move their mouse to hover over the regions they are describing. Voice annotation is at the core of our approach since it directly connects the description with the regions of the image it is referencing. To make the descriptions more accessible, the annotators manually transcribed their description, which was then aligned with the automatic speech transcription result. This recovers the timestamps for the description, ensuring that the three modalities (speech, text, and mouse trace) are correct and synchronized.
Alignment of manual and automatic transcriptions. Icons based on an original design from Freepik.
Speaking and pointing simultaneously are very intuitive, which allowed us to give the annotators very vague instructions about the task. This creates potential avenues of research for studying how people describe images. For example, we observed different styles when indicating the spatial extent of an object — circling, scratching, underlining, etc. — the study of which could bring valuable insights for the design of new user interfaces.
To get a sense of the amount of additional data these localized narratives represent, the total length of mouse traces is ~6400 km long, and if read aloud without stopping, all the narratives would take ~1.5 years to listen to!
New Visual Relationships, Human Actions, and Image-Level Annotations In addition to the localized narratives, in Open Images V6 we increased the types of visual relationship annotations by an order of magnitude (up to 1.4k), adding for example “man riding a skateboard”, “man and woman holding hands”, and “dog catching a flying disk”.
People in images have been at the core of computer vision interests since its inception and understanding what those people are doing is of utmost importance for many applications. That is why Open Images V6 also includes 2.5M annotations of humans performing standalone actions, such as “jumping”, “smiling”, or “laying down”.
Finally, we also added 23.5M new human-verified image-level labels, reaching a total of 59.9M over nearly 20,000 categories.
Conclusion Open Images V6 is a significant qualitative and quantitative step towards improving the unified annotations for image classification, object detection, visual relationship detection, and instance segmentation, and takes a novel approach in connecting vision and language with localized narratives. We hope that Open Images V6 will further stimulate progress towards genuine scene understanding.
Posted by Daniel Adiwardana, Senior Research Engineer, and Thang Luong, Senior Research Scientist, Google Research, Brain Team
Modern conversational agents (chatbots) tend to be highly specialized — they perform well as long as users don’t stray too far from their expected usage. To better handle a wide variety of conversational topics, open-domain dialog research explores a complementary approach attempting to develop a chatbot that is not specialized but can still chat about virtually anything a user wants. Besides being a fascinating research problem, such a conversational agent could lead to many interesting applications, such as further humanizing computer interactions, improving foreign language practice, and making relatable interactive movie and videogame characters.
However, current open-domain chatbots have a critical flaw — they often don’t make sense. They sometimes say things that are inconsistent with what has been said so far, or lack common sense and basic knowledge about the world. Moreover, chatbots often give responses that are not specific to the current context. For example, “I don’t know,” is a sensible response to any question, but it’s not specific. Current chatbots do this much more often than people because it covers many possible user inputs.
In “Towards a Human-like Open-Domain Chatbot”, we present Meena, a 2.6 billion parameter end-to-end trained neural conversational model. We show that Meena can conduct conversations that are more sensible and specific than existing state-of-the-art chatbots. Such improvements are reflected through a new human evaluation metric that we propose for open-domain chatbots, called Sensibleness and Specificity Average (SSA), which captures basic, but important attributes for human conversation. Remarkably, we demonstrate that perplexity, an automatic metric that is readily available to any neural conversational models, highly correlates with SSA.
A chat between Meena (left) and a person (right).
Meena Meena is an end-to-end, neural conversational model that learns to respond sensibly to a given conversational context. The training objective is to minimize perplexity, the uncertainty of predicting the next token (in this case, the next word in a conversation). At its heart lies the Evolved Transformerseq2seq architecture, a Transformer architecture discovered by evolutionary neural architecture search to improve perplexity.
Concretely, Meena has a single Evolved Transformer encoder block and 13 Evolved Transformer decoder blocks as illustrated below. The encoder is responsible for processing the conversation context to help Meena understand what has already been said in the conversation. The decoder then uses that information to formulate an actual response. Through tuning the hyper-parameters, we discovered that a more powerful decoder was the key to higher conversational quality.
Example of Meena encoding a 7-turn conversation context and generating a response, “The Next Generation”.
Conversations used for training are organized as tree threads, where each reply in the thread is viewed as one conversation turn. We extract each conversation training example, with seven turns of context, as one path through a tree thread. We choose seven as a good balance between having long enough context to train a conversational model and fitting models within memory constraints (longer contexts take more memory).
The Meena model has 2.6 billion parameters and is trained on 341 GB of text, filtered from public domain social media conversations. Compared to an existing state-of-the-art generative model, OpenAI GPT-2, Meena has 1.7x greater model capacity and was trained on 8.5x more data.
Human Evaluation Metric: Sensibleness and Specificity Average (SSA) Existing human evaluation metrics for chatbot quality tend to be complex and do not yield consistent agreement between reviewers. This motivated us to design a new human evaluation metric, the Sensibleness and Specificity Average (SSA), which captures basic, but important attributes for natural conversations.
To compute SSA, we crowd-sourced free-form conversation with the chatbots being tested — Meena and other well-known open-domain chatbots, notably, Mitsuku, Cleverbot, XiaoIce, and DialoGPT. In order to ensure consistency between evaluations, each conversation starts with the same greeting, “Hi!”. For each utterance, the crowd workers answer two questions, “does it make sense?” and “is it specific?”. The evaluator is asked to use common sense to judge if a response is completely reasonable in context. If anything seems off — confusing, illogical, out of context, or factually wrong — then it should be rated as, “does not make sense”. If the response makes sense, the utterance is then assessed to determine if it is specific to the given context. For example, if A says, “I love tennis,” and B responds, “That’s nice,” then the utterance should be marked, “not specific”. That reply could be used in dozens of different contexts. But if B responds, “Me too, I can’t get enough of Roger Federer!” then it is marked as “specific”, since it relates closely to what is being discussed.
For each chatbot, we collect between 1600 and 2400 individual conversation turns through about 100 conversations. Each model response is labeled by crowdworkers to indicate if it is sensible and specific. The sensibleness of a chatbot is the fraction of responses labeled “sensible”, and specificity is the fraction of responses that are marked “specific”. The average of these two is the SSA score. The results below demonstrate that Meena does much better than existing state-of-the-art chatbots by large margins in terms of SSA scores, and is closing the gap with human performance.
Automatic Metric: Perplexity Researchers have long sought for an automatic evaluation metric that correlates with more accurate, human evaluation. Doing so would enable faster development of dialogue models, but to date, finding such an automatic metric has been challenging. Surprisingly, in our work, we discover that perplexity, an automatic metric that is readily available to any neural seq2seq model, exhibits a strong correlation with human evaluation, such as the SSA value. Perplexity measures the uncertainty of a language model. The lower the perplexity, the more confident the model is in generating the next token (character, subword, or word). Conceptually, perplexity represents the number of choices the model is trying to choose from when producing the next token.
During development, we benchmarked eight different model versions with varying hyperparameters and architectures, such as the number of layers, attention heads, total training steps, whether we use Evolved Transformer or regular Transformer, and whether we train with hard labels or with distillation. As illustrated in the figure below, the lower the perplexity, the better the SSA score for the model, with a strong correlation coefficient (R2 = 0.93).
Interactive SSA vs. Perplexity. Each blue dot is a different version of the Meena model. A regression line is plotted demonstrating the strong correlation between SSA and perplexity. Dotted lines correspond to SSA performance of humans, other bots, Meena (base), our end-to-end trained model, and finally full Meena with filtering mechanism and tuned decoding.
Our best end-to-end trained Meena model, referred to as Meena (base), achieves a perplexity of 10.2 (smaller is better) and that translates to an SSA score of 72%. Compared to the SSA scores achieved by other chabots, our SSA score of 72% is not far from the 86% SSA achieved by the average person. The full version of Meena, which has a filtering mechanism and tuned decoding, further advances the SSA score to 79%.
Future Research & Challenges As advocated previously, we will continue our goal of lowering the perplexity of neural conversational models through improvements in algorithms, architectures, data, and compute.
While we have focused solely on sensibleness and specificity in this work, other attributes such as personality and factuality are also worth considering in subsequent works. Also, tackling safety and bias in the models is a key focus area for us, and given the challenges related to this, we are not currently releasing an external research demo. We are evaluating the risks and benefits associated with externalizing the model checkpoint, however, and may choose to make it available in the coming months to help advance research in this area.
Acknowledgements Several members contributed immensely to this project: David So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu. Also, thanks to Quoc Le, Samy Bengio, and Christine Robson for their leadership support. Thanks to the people who gave feedback on drafts of the paper: Anna Goldie, Abigail See, YizheZhang, Lauren Kunze, Steve Worswick, Jianfeng Gao, Scott Roy, Ilya Sutskever, Tatsu Hashimoto, Dan Jurafsky, Dilek Hakkani-tur, Noam Shazeer, Gabriel Bender, Prajit Ramachandran, Rami Al-Rfou, Michael Fink, Mingxing Tan, Maarten Bosma, and Adams Yu. Also thanks to the many volunteers who helped collect conversations with each other and with various chatbots. Finally thanks to Noam Shazeer, Rami Al-Rfou, Khoa Vo, Trieu H. Trinh, Ni Yan, Kyu Jin Hwang and the Google Brain team for their help with the project.
Posted by Abhinav Rastogi, Software Engineer and Pranav Khaitan, Engineering Lead, Google Research
Today's virtual assistants help users to accomplish a wide variety of tasks, including finding flights, searching for nearby events and movies, making reservations, sourcing information from the web and more. They provide this functionality by offering a unified natural language interface to a wide variety of services across the web. Large-scale virtual assistants, like Google Assistant, need to integrate with a large and constantly increasing number of services, each with potentially overlapping functionality, over a wide variety of domains. Supporting new services with ease, without collection of additional data or retraining the model, and reducing maintenance workload are necessary to accommodate future growth. Despite tremendous progress, however, these challenges have often been overlooked in state-of-the-art models. This is due, in part, to the absence of suitable datasets that match the scale and complexity confronted by such virtual assistants.
In our recent paper, “Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset”, we introduce a new dataset to address these problems. The Schema-Guided Dialogue dataset (SGD) is the largest publicly available corpus of task-oriented dialogues, with over 18,000 dialogues spanning 17 domains. Equipped with various annotations, this dataset is designed to serve as an effective testbed for intent prediction, slot filling, state tracking (i.e., estimating the user's goal) and language generation, among other tasks for large-scale virtual assistants. We also propose a schema-guided approach for building virtual assistants as a solution to the aforementioned challenges. Our approach utilizes a single model across all services and domains, with no domain-specific parameters. Based on the schema-guided approach and building on the power of pre-trained language models like BERT, we open source a model for dialogue state tracking, which is applicable in a zero-shot setting (i.e., with no training data for new services and APIs) while remaining competitive in the regular setting.
The Dataset The primary goal of releasing the SGD dataset is to confront many real-world challenges that are not sufficiently captured by existing datasets. The SGD dataset consists of over 18k annotated multi-domain, task-oriented conversations between a human and a virtual assistant. These conversations involve interactions with services and APIs spanning 17 domains, ranging from banks and events to media, calendar, travel, and weather. For most of these domains, the SGD dataset contains multiple different APIs, many of which have overlapping functionalities but different interfaces, which reflects common real-world scenarios. SGD is the first dataset to cover such a wide variety of domains and provide multiple APIs per domain. Furthermore, to quantify the robustness of models to changes in API interfaces or to the addition of new APIs, the evaluation set contains many new services that are not present in the training set.
For the creation of the SGD dataset, we have prioritized the variety and accuracy of annotations in the included dialogues. To begin with, dialogues were collected by interaction between two people using a Wizard-of-Oz style process, followed by crowdsourced annotation. Initial efforts revealed the difficulty in obtaining consistent annotations using this method, so we developed a new data collection process that minimized the need for complex manual annotation, and considerably reduced the time and cost of data collection.
For this alternate approach, we developed a multi-domain dialogue simulator that generates dialogue skeletons over an arbitrary combination of APIs, along with the corresponding annotations, such as dialogue state and system actions. The simulator consists of two agents playing the role of the user and the assistant. Both the agents interact with each other using a finite set of actions denoting dialogue semantics with transitions specified through a probabilistic automaton, designed to capture a wide variety of dialogue trajectories. The actions generated by the simulator are converted into natural language utterances using a set of templates. Crowdsourcing is used only for paraphrasing these templatized utterances in order to make the dialogue more natural and coherent. This setup eliminates the need for complicated domain-specific instructions while keeping the crowdsourcing task simple and yields natural dialogues with consistent, high quality annotations.
Steps for obtaining dialogues, with assistant turns marked in red and user turns in blue. Left: The simulator generates a dialogue skeleton using a finite set of actions. Center: Actions are converted into utterances using templates (~50 per service) and slot values are replaced with natural variations. Right: Paraphrasing via crowdsourcing to make the flow cohesive.
The Schema-Guided Approach With the availability of the SGD dataset, it is now possible to train virtual assistants to support the diversity of services available on the web. A common approach to do this requires a large master schema that lists all supported functions and their parameters. However, it is difficult to develop a master schema catering to all possible use cases. Even if that problem is solved, a master schema would complicate integration of new or small-scale services and would increase the maintenance workload of the assistant. Furthermore, while there are many similar concepts across services that can be jointly modeled, for example, the similarities in logic for querying or specifying the number of movie tickets, flight tickets or concert tickets, the master schema approach does not facilitate joint modeling of such concepts, unless an explicit mapping between them is manually defined.
The new schema-guided approach we propose addresses these limitations. This approach does not require the definition of a master schema for the assistant. Instead, each service or API provides a natural language description of the functions listed in its schema along with their associated attributes. These descriptions are then used to learn a distributed semantic representation of the schema, which is given as an additional input to the dialogue system. The dialogue system is then implemented as a single unified model, containing no domain or service specific parameters. This unified model facilitates representation of common knowledge between similar concepts in different services, while the use of distributed representations of the schema makes it possible to operate over new services that are not present in the training data. We have implemented this approach in our open-sourced dialogue state tracking model.
We believe that this dataset will act as a good benchmark for building large-scale dialogue models. We are excited and looking forward to all the innovative ways in which the research community will use it for the advancement of dialogue technologies.
Acknowledgements This post reflects the work of our co-authors Xiaoxue Zang, Srinivas Sunkara and Raghav Gupta. We also thank Amir Fayazi and Maria Wang for help with data collection and Guan-Lin Chao for insights on model design and implementation.
Posted by Yuan Zhang, Research Scientist and Yinfei Yang, Software Engineer, Google Research
Word order and syntactic structure have a large impact on sentence meaning — even small perturbations in word order can completely change interpretation. For example, consider the following related sentences:
Flights from New York to Florida.
Flights to Florida from New York.
Flights from Florida to New York.
All three have the same set of words. However, 1 and 2 have the same meaning — known as paraphrase pairs — while 1 and 3 have very different meanings — known as non-paraphrase pairs. The task of identifying whether pairs are paraphrase or not is called paraphrase identification, and this task is important to many real-world natural language understanding (NLU) applications such as question answering. Perhaps surprisingly, even state-of-the-art models, like BERT, would fail to correctly identify the difference between many non-paraphrase pairs like 1 and 3 above if trained only on existing NLU datasets. This is because existing datasets lack training pairs like this, so it is hard for machine learning models to learn this pattern even if they have the capability to understand complex contextual phrasings.
To address this, we are releasing two new datasets for use in the research community: Paraphrase Adversaries from Word Scrambling (PAWS) in English, and PAWS-X, an extension of the PAWS dataset to six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. Both datasets contain well-formed sentence pairs with high lexical overlap, in which about half of the pairs are paraphrase and others are not. Including new pairs in training data for state-of-the-art models improves their accuracy on this problem from <50% to 85-90%. In contrast, models that do not capture non-local contextual information fail even with new training examples. The new datasets therefore provide an effective instrument for measuring the sensitivity of models to word order and structure.
The PAWS dataset contains 108,463 human-labeled pairs in English, sourced from Quora Question Pairs (QQP) and Wikipedia pages. PAWS-X contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs. The table below gives detailed statistics of the datasets.
† The training set of PAWS-X is machine translated from a subset of the PAWS Wiki dataset in English.
Creating the PAWS Dataset in English In “PAWS: Paraphrase Adversaries from Word Scrambling,” we introduce a workflow for generating pairs of sentences that have high word overlap, but which are balanced with respect to whether they are paraphrases or not. To generate examples, source sentences are first passed to a specialized language model that creates word-swapped variants that are still semantically meaningful, but ambiguous as to whether they are paraphrase pairs or not. These were then judged by human raters for grammaticality and then multiple raters judged whether they were paraphrases of each other.
PAWS corpus creation workflow.
One problem with this swapping strategy is that it tends to produce pairs that aren't paraphrases (e.g., "why do bad things happen to good people" != "why do good things happen to bad people"). In order to ensure balance between paraphrases and non-paraphrases, we added other examples based on back-translation. Back-translation has the opposite bias as it tends to preserve meaning while changing word order and word choice. These two strategies lead to PAWS being balanced overall, especially for the Wikipedia portion.
Creating the Multilingual PAWS-X Dataset After creating PAWS, we extended it to six more languages: Chinese, French, German, Korean, Japanese, and Spanish. We hired human translators to translate the development and test sets, and used a neural machine translation (NMT) service to translate the training set. We obtained human translations (native speakers) on a random sample of 4,000 sentence pairs from the PAWS development set for each of the six languages (48,000 translations). Each sentence in a pair is presented independently so that translation is not affected by context. A randomly sampled subset was validated by a second worker. The final dataset has less than 5% word level error rate. Note, we allowed professionals to not translate a sentence if it was incomplete or ambiguous. On average, less than 2% of the pairs were not translated, and we simply excluded them. The final translated pairs are split then into new development and test sets, ~2,000 pairs for each.
Examples of human translated pairs for German(de) and Chinese(zh).
Language Understanding with PAWS and PAWS-X We train multiple models on the created dataset and measure the classification accuracy on the eval set. When trained with PAWS, strong models, such as BERT and DIIN, show remarkable improvement over when they are trained on the existing Quora Question Pairs (QQP) dataset. For example, on the PAWS data sourced from QQP (PAWS-QQP), BERT gets only 33.5 accuracy if trained on existing QQP, but it recovers to 83.1 accuracy when given PAWS training examples. Unlike BERT, a simple Bag-of-Words (BOW) model fails to learn from PAWS training examples, demonstrating its weakness at capturing non-local contextual information. These results demonstrate that PAWS effectively measures sensitivity of models to word order and structure.
Accuracy on PAWS-QQP Eval Set (English).
The figure below shows the performance of the popular multilingual BERT model on PAWS-X using several common strategies:
Zero Shot: The model is trained on the PAWS English training data, and then directly evaluated on all others. Machine translation is not involved in this strategy.
Translate Test: Train a model using the English training data, and machine-translate all test examples to English for evaluation.
Translate Train: The English training data is machine-translated into each target language to provide data to train each model.
Merged: Train a multilingual model on all languages, including the original English pairs and machine-translated data in all other languages.
The results show that cross-lingual techniques help, while it also leaves considerable headroom to drive multilingual research on the problem of paraphrase identification
Accuracy of PAWS-X Test Set using BERT Models.
It is our hope that these datasets will be useful to the research community to drive further progress on multilingual models that better exploit structure, context, and pairwise comparisons.
Acknowledgements The core team includes Luheng He, Jason Baldridge, Chris Tar. We would like to thank the Language team in Google Research, especially Emily Pitler, for the insightful comments that contributed to our papers. Many thanks also to Ashwin Kakarla, Henry Jicha, and Mengmeng Niu, for the help with the annotations.
Posted by Rajan Patel, Director, Augmented Reality
Around the world, millions of people are coming online for the first time, and many of them are among the 800 million adults worldwide who are unable to read or write, or those who are migrating to towns and cities where they are not able to speak the predominant language. As a smartphone camera-based tool, Google Lens has great potential for helping people who struggle with reading and other language-based challenges. Lens uses computer vision, machine learning and Google’s Knowledge Graph to let people turn the things they see in the real world into a visual search box, enabling them to identify objects like plants and animals, or to copy and paste text from the real world into their phone.
However, in order for Lens to be able to help the greatest number of people, we needed to create a special version that can work on even the most basic smartphones. So at I/O 2019, we announced a new version of Lens designed specifically for use in Google Go—our Search app for entry level devices—and we included a new set of features designed to help people who face reading and other language-based challenges. When users point their camera at text they don’t understand, Lens in Google Go can translate and read it out loud. It even highlights each word as it’s being read so users can follow along. If you want to try out these features for yourself, they are available today via Lens in Google Go. While Google Go was initially available only on Android Go devices and on the Google Play Store in select markets, recently, we made it available globally in the Google Play Store.
To make these reading features work, the Google Go version of Lens needs to be able to capture high quality images on a wide variety of devices, then identify the text, understand its structure, translate and overlay it in context, and finally, read it out loud.
Image Capture Image capture on entry-level devices, like those that run Android Go, is tricky since it must work on a wide variety of devices, many of which are more resource constrained than flagship phones. To build a universal tool that can reliably capture high-quality images with minimal lag, we made Lens in Google Go an early adopter of a new Android support library called CameraX. Available in Jetpack—a suite of libraries, tools, and guidance for Android developers—CameraX is an abstraction layer over the Android Camera2 API that resolves device compatibility issues so developers don't have to write their own device-specific code.
Using CameraX, we implemented two capture strategies to balance capture latency against performance impact. On higher-end phones, which are powerful enough to provide a constant stream of high-resolution frames from which to select an image, we’ve made capture instantaneous. On less advanced devices, streaming these frames could cause camera lag since the CPU is less powerful, so we process the frame when the user taps capture to produce a single, on-demand high-resolution image.
Text Recognition After Lens in Google Go captures an image, it needs to make sense of the shapes and letters that constitute the words, sentences and paragraphs. To do this, the image is scaled down and transferred to the Lens server, where the processing will be performed. Next, optical character recognition (OCR) is applied, which utilizes a region proposal network to detect character level bounding boxes that can be merged into lines for text recognition.
Merging these character boxes into words is a two-step, sequential process. The first step is to apply the Hough Transform, which assumes the text is distributed across parallel lines. The second step uses Text Flow, which instead traces text that may follow a curve by finding the shortest path through a graph of detected text boxes. This ensures that text with a variety of distributions, be they straight, curved or mixed, can be identified and processed.
Because the images captured by Lens in Google Go may include sources such as signage, handwriting or documents, a slew of additional challenges can arise. For example, the text can be obscured, scripts can be uniquely stylized, and images can be blurry. All of these issues can cause the OCR engine to misunderstand various characters within each word. To correct mistakes and improve word accuracy, Lens in Google Go uses the context of surrounding words to make corrections. It also utilizes the Knowledge Graph to provide contextual clues, such as whether a word is likely a proper noun and should not be spell-corrected.
Left: Image with bounding box around recognized text. The raw OCR output from this image reads, “Cise is beauti640”. Right: By applying Knowledge Graph in addition to context from nearby words, Lens in Google Go recognizes the words, “life is beautiful”.
Understanding Structure Once the individual words have been recognized, Lens must determine how to fit them together. The text that people come across in the real world is laid out in many different ways. A newspaper, for example, is laid out into columns, with headlines, article text, and advertisements. Meanwhile, a bus schedule, has one column for destinations and another with times. While understanding text structure comes very naturally to people, computers need to be taught how to comprehend it. Lens uses CNNs to detect coherent text blocks like columns, or text in a consistent style or color. And then, within each block, it uses signals like text-alignment, language, and the geometric relationship of the paragraphs to determine their final reading order.
One of the other challenges in detecting document structure is that people take pictures of text from different angles, often with a warped perspective. This means we cannot revert to off-the-shelf detectors that rely on axis aligned boxes, but must generalize our systems to be able to deal with homographic distortions.
Paragraph segmentation on the front page of a newspaper. Notice how “News Analysis”, which is embedded in the middle of a column, has been identified separately due to its distinct style features.
Translations in Context To provide users with the most helpful information, translations must be both accurate and contextual. Lens uses Google Translate’s neural machine translation (NMT) algorithms, to translate entire sentences at a time, rather than going word-by-word, in order to preserve proper grammar and diction.
For the translation to be most useful, it needs to be placed in the context of the original text. For example, when translating instructions on an ATM, it is important to know which buttons correspond to which instructions. Part of the challenge is accounting for the fact that the translated text can be much shorter or longer than the original. For example, German sentences tend to be longer than English ones. To accomplish this seamless overlay, Lens redistributes the translation into lines of similar length, and chooses an appropriate font size to match. It also matches the color of the translation and its background with the original text through the use of a heuristic that assumes the background and the text differ in luminosity, and that the background takes up the majority of the space. This allows Lens to classify whether a pixel represents background or text, and then sample the average color from these two regions to ensure the translated text matches the original text.
Reading the Text Out Loud The final challenge in delivering information in the most helpful way with Lens in Google Go is reading the text aloud. High-fidelity audio is generated using Google Text-to-Speech (TTS), a service that applies machine learning to disambiguate and detected entities such as dates, phone numbers and addresses, and uses that to generate realistic speech based on DeepMind’s WaveNet.
These reading features become more contextual and useful when they are paired with display. Lens utilizes timing annotations from the TTS service that mark the beginning of each word in order to highlight each word on screen as it’s being read, similar to a karaoke machine. Say for example, a user takes a picture of an ATM screen with different labels next to different buttons. This karaoke effect allows users to know which label applies to which button. It may also help users learn how to pronounce the words being translated. Looking Ahead Taken together, it is our hope that these features will have a positive impact on the day-to-day lives of millions of people. Moving forward, we will continue to work on further updates to these reading features to make the OCR more precise, including improvements to text structure understanding (e.g. multi-column text) and recognition of Indic scripts. As we address these text challenges, we continue to look for new ways that the combination of machine learning and the smartphone camera can help people as they go about their lives.
Posted by Rishabh Agarwal, Google AI Resident and Mohammad Norouzi, Research Scientist
Reinforcement learning (RL) presents a unified and flexible framework for optimizing goal-oriented behavior, and has enabled remarkable success in addressing challenging tasks such as playing video games, continuous control, and robotic learning. The success of RL algorithms in these application domains often hinges on the availability of high-quality and dense reward feedback. However, broadening the applicability of RL algorithms to environments with sparse and underspecified rewards is an ongoing challenge, requiring a learning agent to generalize (i.e., learn the right behavior) from limited feedback. A natural way to investigate the performance of RL algorithms in such problem settings is via language understanding tasks, where an agent is provided with a natural language input and needs to generate a complex response to achieve a goal specified in the input, while only receiving binary success-failure feedback.
For instance, consider a "blind" agent tasked with reaching a goal position in a maze by following a sequence of natural language commands (e.g., "Right, Up, Up, Right"). Given the input text, the agent (green circle) needs to interpret the commands and take actions based on such interpretation to generate an action sequence (a). The agent receives a reward of 1 if it reaches the goal (red star) and 0 otherwise. Because the agent doesn't have access to any visual information, the only way for the agent to solve this task and generalize to novel instructions is by correctly interpreting the instructions.
In this instruction-following task, the action trajectories a1, a2 and a3 reach the goal, but the sequences a2 and a3 do not follow the instructions. This illustrates the issue of underspecified rewards.
In these tasks, the RL agent needs to learn to generalize from sparse (only a few trajectories lead to a non-zero reward) and underspecified (no distinction between purposeful and accidental success) rewards. Importantly, because of underspecified rewards, the agent may receive positive feedback for exploiting spurious patterns in the environment. This can lead to reward hacking, causing unintended and harmful behavior when deployed in real-world systems.
In "Learning to Generalize from Sparse and Underspecified Rewards", we address the issue of underspecified rewards by developing Meta Reward Learning (MeRL), which provides more refined feedback to the agent by optimizing an auxiliary reward function. MeRL is combined with a memory buffer of successful trajectories collected using a novel exploration strategy to learn from sparse rewards. The effectiveness of our approach is demonstrated on semantic parsing, where the goal is to learn a mapping from natural language to logical forms (e.g., mapping questions to SQL programs). In the paper, we investigate the weakly-supervised problem setting, where the goal is to automatically discover logical programs from question-answer pairs, without any form of program supervision. For instance, given the question "Which nation won the most silver medals?" and a relevant Wikipedia table, an agent needs to generate an SQL-like program that results in the correct answer (i.e., "Nigeria").
The proposed approach achieves state-of-the-art results on the WikiTableQuestions and WikiSQL benchmarks, improving upon prior work by 1.2% and 2.4% respectively. MeRL automatically learns the auxiliary reward function without using any expert demonstrations, (e.g., ground-truth programs) making it more widely applicable and distinct from previousrewardlearning approaches. The diagram below depicts a high level overview of our approach:
Overview of the proposed approach. We employ (1) mode covering exploration to collect a diverse set of successful trajectories in a memory buffer; (2) Meta-learning or Bayesian optimization to learn an auxiliary reward that provides more refined feedback for policy optimization.
Meta Reward Learning (MeRL) The key insight of MeRL in dealing with underspecified rewards is that spurious trajectories and programs that achieve accidental success are detrimental to the agent's generalization performance. For example, an agent might be able to solve a specific instance of the maze problem above. However, if it learns to perform spurious actions during training, it is likely to fail when provided with unseen instructions. To mitigate this issue, MeRLoptimizes a more refined auxiliary reward function, which can differentiate between accidental and purposeful success based on features of action trajectories. The auxiliary reward is optimized by maximizing the trained agent's performance on a hold-out validation set via meta learning.
Schematic illustration of MeRL: The RL agent is trained via the reward signal obtained from the auxiliary reward model while the auxiliary rewards are trained using the generalization error of the agent.
Learning from Sparse Rewards To learn from sparse rewards, effective exploration is critical to find a set of successful trajectories. Our paper addresses this challenge by utilizing the two directions of Kullback–Leibler (KL) divergence, a measure on how different two probability distributions are. In the example below, we use KL divergence to minimize the difference between a fixed bimodal (shaded purple) and a learned gaussian (shaded green) distribution, which can represent the distribution of the agent's optimal policy and our learned policy respectively. One direction of the KL objective learns a distribution which tries to cover both the modes while the distribution learned by other objective seeks a particular mode (i.e. it prefers one mode over another). Our method exploits the mode covering KL's tendency to focus on multiple peaks to collect a diverse set of successful trajectories and mode seeking KL's implicit preference between trajectories to learn a robust policy.
Conclusion Designing reward functions that distinguish between optimal and suboptimal behavior is critical for applying RL to real-world applications. This research takes a small step in the direction of modelling reward functions without any human supervision. In future work, we'd like to tackle the credit-assignment problem in RL from the perspective of automatically learning a dense reward function.
Acknowledgements This research was done in collaboration with Chen Liang and Dale Schuurmans. We thank Chelsea Finn and Kelvin Guu for their review of the paper.
Posted by Tom Kwiatkowski and Michael Collins, Research Scientists, Google AI Language
Open-domain question answering (QA) is a benchmark task in natural language understanding (NLU) that aims to emulate how people look for information, finding answers to questions by reading and understanding entire documents. Given a question expressed in natural language ("Why is the sky blue?"), a QA system should be able to read the web (such as this Wikipedia page) and return the correct answer, even if the answer is somewhat complicated and long. However, there are currently no large, publicly available sources of naturally occurring questions (i.e. questions asked by a person seeking information) and answers that can be used to train and evaluate QA models. This is because assembling a high-quality dataset for question answering requires a large source of real questions and significant human effort in finding correct answers.
To help spur research advances in QA, we are excited to announce Natural Questions (NQ), a new, large-scale corpus for training and evaluating open-domain question answering systems, and the first to replicate the end-to-end process in which people find answers to questions. NQ is large, consisting of 300,000 naturally occurring questions, along with human annotated answers from Wikipedia pages, to be used in training QA systems. We have additionally included 16,000 examples where answers (to the same questions) are provided by 5 different annotators, useful for evaluating the performance of the learned QA systems. Since answering the questions in NQ requires much deeper understanding than is needed to answer trivia questions — which are already quite easy for computers to solve — we are also announcing a challenge based on this data to help advance natural language understanding in computers.
The Data NQ is the first dataset to use naturally occurring queries and focus on finding answers by reading an entire page, rather than extracting answers from a short paragraph. To create NQ, we started with real, anonymized, aggregated queries that users have posed to Google's search engine. We then ask annotators to find answers by reading through an entire Wikipedia page as they would if the question had been theirs. Annotators look for both long answers that cover all of the information required to infer the answer, and short answers that answer the question succinctly with the names of one or more entities. The quality of the annotations in the NQ corpus has been measured at 90% accuracy.
The Challenge NQ is aimed at enabling QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. Systems will need to first decide whether the question is sufficiently well defined to be answerable — many questions make false assumptions or are just too ambiguous to be answered concisely. Then they will need to decide whether there is any part of the Wikipedia page that contains all of the information needed to infer the answer. We believe that the long answer identification task — finding all of the information required to infer an answer — requires a deeper level of language understanding than finding short answers once the long answers are known.
It is our hope that the release of NQ, and the associated challenge, will help spur the development of more effective and robust QA systems. We encourage the NLU community to participate and to help close the large gap between the performance of current state-of-the-art approaches and a human upper bound. Please visit the challenge website to view the leaderboard and learn more.
Posted by Jacob Devlin and Ming-Wei Chang, Research Scientists, Google AI Language
One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples. To help close this gap in data, researchers have developed a variety of techniques for training general purpose language representation models using the enormous amount of unannotated text on the web (known as pre-training). The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.
This week, we open sourced a new technique for NLP pre-training called Bidirectional Encoder Representations from Transformers, or BERT. With this release, anyone in the world can train their own state-of-the-art question answering system (or a variety of other models) in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU. The release includes source code built on top of TensorFlow and a number of pre-trained language representation models. In our associated paper, we demonstrate state-of-the-art results on 11 NLP tasks, including the very competitive Stanford Question Answering Dataset (SQuAD v1.1).
What Makes BERT Different? BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).
Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the ... account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.
A visualization of BERT’s neural network architecture compared to previous state-of-the-art contextual pre-training methods is shown below. The arrows indicate the information flow from one layer to the next. The green boxes at the top indicate the final contextualized representation of each input word:
BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional.
The Strength of Bidirectionality If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.
To solve this problem, we use the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:
While this idea has been around for a very long time, BERT is the first time it was successfully used to pre-train a deep neural network.
BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:
Training with Cloud TPUs Everything that we’ve described so far might seem fairly straightforward, so what’s the missing piece that made it work so well? Cloud TPUs. Cloud TPUs gave us the freedom to quickly experiment, debug, and tweak our models, which was critical in allowing us to move beyond existing pre-training techniques. The Transformer model architecture, developed by researchers at Google in 2017, also gave us the foundation we needed to make BERT successful. The Transformer is implemented in our open source release, as well as the tensor2tensor library.
Results with BERT To evaluate performance, we compared BERT to other state-of-the-art NLP systems. Importantly, BERT achieved all of its results with almost no task-specific changes to the neural network architecture. On SQuAD v1.1, BERT achieves 93.2% F1 score (a measure of accuracy), surpassing the previous state-of-the-art score of 91.6% and human-level score of 91.2%:
BERT also improves the state-of-the-art by 7.6% absolute on the very challenging GLUE benchmark, a set of 9 diverse Natural Language Understanding (NLU) tasks. The amount of human-labeled training data in these tasks ranges from 2,500 examples to 400,000 examples, and BERT substantially improves upon the state-of-the-art accuracy on all of them:
Making BERT Work for You The models that we are releasing can be fine-tuned on a wide variety of NLP tasks in a few hours or less. The open source release also includes code to run pre-training, although we believe the majority of NLP researchers who use BERT will never need to pre-train their own models from scratch. The BERT models that we are releasing today are English-only, but we hope to release models which have been pre-trained on a variety of languages in the near future.
Posted by Manaal Faruqui, Senior Research Scientist and Emily Pitler, Staff Research Scientist, Google AI Language
This week, the annual conference on Empirical Methods in Natural Language Processing (EMNLP 2018) will be held in Brussels, Belgium. Google will have a strong presence at EMNLP with several of our researchers presenting their research on a diverse set of topics, including language identification, segmentation, semantic parsing and question answering, additionally serving in various levels of organization in the conference. Googlers will also be presenting their papers and participating in the co-located Conference on Computational Natural Language Learning (CoNLL 2018) shared task on multilingual parsing.
In addition to this involvement, we are sharing several new datasets with the academic community that are released with papers published at EMNLP, with the goal of accelerating progress in empirical natural language processing (NLP). These releases are designed to help account for mismatches between the datasets a machine learning model is trained and tested on, and the inputs an NLP system would be asked to handle “in the wild”. All of the datasets we are releasing include realistic, naturally occurring text, and fall into two main categories: 1) challenge sets for well-studied core NLP tasks (part-of-speech tagging, coreference) and 2) datasets to encourage new directions of research on meaning preservation under rephrasings/edits (query well-formedness, split-and-rephrase, atomic edits):
Noun-Verb Ambiguity in POS Tagging Dataset: English part-of-speech taggers regularly make egregious errors related to noun-verb ambiguity, despite high accuracies on standard datasets. For example: in “Mark which area you want to distress” several state-of-the-art taggers annotate “Mark” as a noun instead of a verb. We release a new dataset of over 30,000 naturally occurring non-trivial annotated examples of noun-verb ambiguity. Taggers previously indistinguishable from each other have accuracies ranging from 57% to 75% accuracy on this challenge set.
Query Wellformedness Dataset: Web search queries are usually “word-salad” style queries with little resemblance to natural language questions (“barack obama height” as opposed to “What is the height of Barack Obama?”). Differentiating a natural language question from a query is of importance to several applications include dialogue. We annotate and release 25,100 queries from the open-source Paralex corpus with ratings on how close they are to well-formed natural language questions.
WikiSplit: Split and Rephrase Dataset Extracted from Wikipedia Edits: We extract examples of sentence splits from Wikipedia edits where one sentence gets split into two sentences that together preserve the original meaning of the sentence (E.g. “Street Rod is the first in a series of two games released for the PC and Commodore 64 in 1989.” is split into “Street Rod is the first in a series of two games.” and “It was released for the PC and Commodore 64 in 1989.”) The released corpus contains one million sentence splits with a vocabulary of more than 600,000 words.
WikiAtomicEdits: A Multilingual Corpus of Atomic Wikipedia Edits: Information about how people edit language in Wikipedia can be used to understand the structure of language itself. We pay particular attention to two atomic edits: insertions and deletions that consist of a single contiguous span of text. We extract around 43 million such edits in 8 languages and show that they provide valuable information about entailment and discourse. For example, insertion of “in 1949” adds a prepositional phrase to the sentence “She died there after a long illness” resulting in “She died there in 1949 after a long illness”.
Below is a full list of Google’s involvement and publications being presented at EMNLP and CoNLL (Googlers highlighted in blue). We are particularly happy to announce that the paper “Linguistically-Informed Self-Attention for Semantic Role Labeling” was awarded one of the two Best Long Paper awards. This work was done by our 2017 intern Emma Strubell, Googlers Daniel Andor, David Weiss and Google Faculty Advisor Andrew McCallum. We congratulate these authors, and all other researchers who are presenting their work at the conference.
Area Chairs Include: Ming-Wei Chang, Marius Pasca, Slav Petrov, Emily Pitler, Meg Mitchell, Taro Watanabe
Universal Sentence Encoder for English Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, Ray Kurzweil
Active Question Answering In traditional QA, supervised learning techniques are used in combination with labeled data to train a system that answers arbitrary input questions. While this is effective, it suffers from a lack of ability to deal with uncertainty like humans would, by reformulating questions, issuing multiple searches, evaluating and aggregating responses. Inspired by humans’ ability to "ask the right questions", ActiveQA introduces an agent that repeatedly consults the QA system. In doing so, the agent may reformulate the original question multiple times in order to find the best possible answer. We call this approach active because the agent engages in a dynamic interaction with the QA system, with the goal of improving the quality of the answers returned.
For example, consider the question “When was Tesla born?”. The agent reformulates the question in two different ways: “When is Tesla’s birthday” and “Which year was Tesla born”, retrieving answers to both questions from the QA system. Using all this information it decides to return “July 10 1856”.
What characterizes an ActiveQA system is that it learns to ask questions that lead to good answers. However, because training data in the form of question pairs, with an original question and a more successful variant, is not readily available, ActiveQA uses reinforcement learning, an approach to machine learning concerned with training agents so that they take actions that maximize a reward, while interacting with an environment.
The learning takes place as the ActiveQA agent interacts with the QA system; each question reformulation is evaluated in terms of how good the corresponding answer is, which constitutes the reward. If the answer is good, then the learning algorithm will adjust the model’s parameters so that the question reformulation that lead to the answer is more likely to be generated again, or otherwise less likely, if the answer was bad.
In our paper, we show that it is possible to train such agents to outperform the underlying QA system, the one used to provide answers to reformulations, by asking better questions. This is an important result, as the QA system is already trained with supervised learning to solve the same task. Another compelling finding of our research is that the ActiveQA agent can learn a fairly sophisticated, and still somewhat interpretable, reformulation strategy (the policy in reinforcement learning). The learned policy uses well-known information retrieval techniques such as tf-idf query term re-weighting, the process by which more informative terms are weighted more than generic ones, and word stemming.
Build Your Own ActiveQA System The TensorFlow ActiveQA package we are releasing consists of three main components, and contains all the code necessary to train and run the ActiveQA agent.
A pretrained sequence to sequence model that takes as input a question and returns its reformulations. This task is similar to machine translation, translating from English to English, and indeed the initial model can be used for general paraphrasing. For its implementation we use and customize the TensorFlow Neural Machine Translation Tutorial code. We adapted the code to support training with reinforcement learning, using policy gradient methods.*
An answer selection model. The answer selector uses a convolutional neural network and assigns a score to each triplet of original question, reformulation and answer. The selector uses pre-trained, publicly available word embeddings (GloVe).
A question answering system (the environment). For this purpose we use BiDAF, a popular question answering system, described in Seo et al. (2017).
We also provide pointers to checkpoints for all the trained models.
Google’s mission is to organize the world's information and make it universally accessible and useful, and we believe that ActiveQA is an important step in realizing that mission. We envision that this research will help us design systems that provide better and more interpretable answers, and hope it will help others develop systems that can interact with the world using natural language.
Acknowledgments Contributors to this research and release include Alham Fikri Aji, Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Wojciech Gajewski, Andrea Gesmundo, Alexey Gronskiy, Neil Houlsby, Yannic Kilcher, and Wei Wang.