Author Archives: Google AI Blog

Introducing the Inclusive Images Competition



The release of large, publicly available image datasets, such as ImageNet, Open Images and Conceptual Captions, has been one of the factors driving the tremendous progress in the field of computer vision. While these datasets are a necessary and critical part of developing useful machine learning (ML) models, some open source data sets have been found to be geographically skewed based on how they were collected. Because the shape of a dataset informs what an ML model learns, such skew may cause the research community to inadvertently develop models that may perform less well on images drawn from geographical regions under-represented in those data sets. For example, the images below show one standard open-source image classifier trained on the Open Images dataset that does not properly apply “wedding” related labels to images of wedding traditions from different parts of the world.
Wedding photographs (donated by Googlers), labeled by a classifier trained on the Open Images dataset. The classifier’s label predictions are recorded below each image.
While Google is focusing on building even more representative datasets, we also want to encourage additional research in the field around ways that machine learning methods can be more robust and inclusive when learning from imperfect data sources. This is an important research challenge, and one that pushes the boundaries of ways that machine learning models are currently created. Good solutions will help ensure that even when some data sources aren’t fully inclusive, the models developed with them can be.

In support of this effort and to spur further progress in developing inclusive ML models, we are happy to announce the Inclusive Images Competition on Kaggle. Developed in partnership with the Conference on Neural Information Processing Systems Competition Track, this competition challenges you to use Open Images, a large, multilabel, publicly-available image classification dataset that is majority-sampled from North America and Europe, to train a model that will be evaluated on images collected from a different set of geographic regions across the globe.
The three geographical distributions of data in this competition. Competitors will train their models on Open Images, a widely used publicly available benchmark dataset for image classification which happens to be drawn mostly from North America and Western Europe. Models are then evaluated first on Challenge Stage 1 and finally on Challenge Stage 2, each with different un-revealed geographical distributions. In this way, models are stress-tested for their ability to operate inclusively beyond their training data.
For model evaluation, we have created two Challenge datasets via our Crowdsource project, where we asked our volunteers from across the globe to participate in contributing photos of their surroundings. We hope that these datasets, built by donations from Google’s global community, will provide a challenging geographically-based stress test for this competition. We also plan to release a larger set of images at the end of the competition to further encourage inclusive development, with more inclusive data.
Examples of labeled images from the challenge dataset. Clockwise from top left, image donation by Peter Tester, Mukesh Kumhar, HeeYoung Moon, Sudipta Pramanik, jaturan amnatbuddee, Tomi Familoni and Anu Subhi
The Inclusive Images Competition officially started September 5th with the available training data & first stage Challenge data set. The deadline for submitting your results will be Monday, November 5th, and the test set will be released on Tuesday, November 6th. For more details and timelines, please visit the Inclusive Images Competition website.

The results of the competition will be presented at the 2018 Conference on Neural Information Processing Systems, and we will provide top-ranking competitors with travel grants to attend the conference (see this page for full details). We look forward to being part of the community's development of more inclusive, global image classification algorithms!

Acknowledgements
We would like to thank the following individuals for making the Inclusive Image Competition and dataset possible: James Atwood, Pallavi Baljekar, Parker Barnes, Anurag Batra, Eric Breck, Peggy Chi, Tulsee Doshi, Julia Elliott, Gursheesh Kaur, Akshay Gaur, Yoni Halpern, Henry Jicha, Matthew Long, Jigyasa Saxena, and D. Sculley.

Source: Google AI Blog


Conceptual Captions: A New Dataset and Challenge for Image Captioning



The web is filled with billions of images, helping to entertain and inform the world on a countless variety of subjects. However, much of that visual information is not accessible to those with visual impairments, or with slow internet speeds that prohibit the loading of images. Image captions, manually added by website authors using Alt-text HTML, is one way to make this content more accessible, so that a natural-language description for images that can be presented using text-to-speech systems. However, existing human-curated Alt-text HTML fields are added for only a very small fraction of web images. And while automatic image captioning can help solve this problem, accurate image captioning is a challenging task that requires advancing the state of the art of both computer vision and natural language processing.
Image captioning can help millions with visual impairments by converting images captions to text. Image by Francis Vallance (Heritage Warrior), used under CC BY 2.0 license.
Today we introduce Conceptual Captions, a new dataset consisting of ~3.3 million image/caption pairs that are created by automatically extracting and filtering image caption annotations from billions of web pages. Introduced in a paper presented at ACL 2018, Conceptual Captions represents an order of magnitude increase of captioned images over the human-curated MS-COCO dataset. As measured by human raters, the machine-curated Conceptual Captions has an accuracy of ~90%. Furthermore, because images in Conceptual Captions are pulled from across the web, it represents a wider variety of image-caption styles than previous datasets, allowing for better training of image captioning models. To track progress on image captioning, we are also announcing the Conceptual Captions Challenge for the machine learning community to train and evaluate their own image captioning models on the Conceptual Captions test bed.
Illustration of images and captions in the Conceptual Captions dataset.
Clockwise from top left, images by Jonny Hunter, SigNote Cloud, Tony Hisgett, ResoluteSupportMedia. All images used under CC BY 2.0 license
Generating the Dataset
To generate the Conceptual Captions dataset, we start by sourcing images from the web that have Alt-text HTML attributes. We automatically screen these for certain properties to ensure image quality while also avoiding undesirable content such as adult themes. We then apply text-based filtering, removing captions with non-descriptive text (such as hashtags, poor grammar or added language that does not relate to the image); we also discard texts with high sentiment polarity or adult content (for more details on the filtering criteria, please see our paper). We use existing image classification models to make sure that, for any given image, there is overlap between its Alt-text (allowing for word variations) and the labels that the image classifier outputs for that image.

From Specific Names to General Concepts
While candidates passing the above filters tend to be good Alt-text image descriptions, a large majority use proper names (for people, venues, locations, organizations etc.). This is problematic because it is very difficult for an image captioning model to learn such fine-grained proper name inference from input image pixels, and also generate natural-language descriptions simultaneously1.

To address the above problems we wrote software that automatically replaces proper names with words representing the same general notion, i.e., with their concept. In some cases, the proper names are removed to simplify the text. For example, we substitute people names (e.g., “Former Miss World Priyanka Chopra on the red carpet” becomes “actor on the red carpet”), remove locations names (“Crowd at a concert in Los Angeles” becomes “Crowd at a concert”), remove named modifiers (e.g., “Italian cuisine” becomes just “cuisine”) and correct newly formed noun phrases if needed (e.g., “artist and artist” becomes “artists”, see the example illustration below).
Illustration of text modification. Image by Rockoleando used under CC BY 2.0 license.
Finally, we cluster all resolved entities (e.g., “artist”, “dog”, “neighborhood”, etc.) and keep only the candidate types which have a count of over 100 mentions, a quantity sufficient to support representation learning for these entities. This retained around 16K entity concepts such as: “person”, “actor”, “artist”, “player” and “illustration”. Less frequent ones that we retained include “baguette”, “bridle”, “deadline”, “ministry” and “funnel”.

In the end, it required roughly one billion (English) webpages containing over 5 billion candidate images to obtain a clean and learnable image caption dataset of over 3M samples (a rejection rate of 99.94%). Our control parameters were biased towards high precision, although these can be tuned to generate an order of magnitude more examples with lower precision.

Dataset Impact
To test the usefulness of our dataset, we independently trained both RNN-based, and Transformer-based image captioning models implemented in Tensor2Tensor (T2T), using the MS-COCO dataset (using 120K images with 5 human annotated-captions per image) and the new Conceptual Captions dataset (using over 3.3M images with 1 caption per image). See our paper for more details on model architectures.

These models were tested using images from Flickr30K dataset (which are out-of-domain for both MS-COCO and Conceptual Captions), and the resulting captions evaluated using 3 human raters per test case. The results are reported in the table below.
From these results we conclude that models trained on Conceptual Captions generalized better than competing approaches irrespective of the architecture (i.e., RNN or Transformer). In addition, we found that Transformer models did better than RNN when trained on either dataset. The conclusion from these findings is that Conceptual Captions provides the ability to train image captioning models that perform better on a wide variety of images.

Get Involved
It is our hope that this dataset will help the machine learning community advance the state of the art in image captioning models. Importantly, since no human annotators were involved in its creation, this dataset is highly scalable, potentially allowing the expansion of the dataset to enable automatic creation of Alt-text-HTML-like descriptions for an even wider variety of images. We encourage all those interested to partake in the Conceptual Captions Challenge, and we look forward to seeing what the community can do! For more details and the latest results please visit the challenge website.

Acknowledgements
Thanks to Nan Ding, Sebastian Goodman and Bo Pang for training models with Conceptual Captions dataset, and to Amol Wankhede for driving the public release efforts for the dataset.


1 In our paper, we posit that if automatic determination of names, locations, brands, etc. from the image is needed, it should be done as a separate task that may leverage image meta-information (e.g. GPS info), or complementary techniques such as OCR.

Source: Google AI Blog


Understanding Performance Fluctuations in Quantum Processors



One area of research the Google AI Quantum team pursues is building quantum processors from superconducting electrical circuits, which are attractive candidates for implementing quantum bits (qubits). While superconducting circuits have demonstrated state-of-the-art performance and extensibility to modest processor sizes comprising tens of qubits, an outstanding challenge is stabilizing their performance, which can fluctuate unpredictably. Although performance fluctuations have been observed in numerous superconducting qubit architectures, their origin isn’t well understood, impeding progress in stabilizing processor performance.

In “Fluctuations of Energy-Relaxation Times in Superconducting Qubits” published in this week’s Physical Review Letters, we use qubits as probes of their environment to show that performance fluctuations are dominated by material defects. This was done by investigating qubits’ energy relaxation times (T1) — a popular performance metric that gives the length of time that it takes for a qubit to undergo energy-relaxation from its excited to ground state — as a function of operating frequency and time.

In measuring T1, we found that some qubit operating frequencies are significantly worse than others, forming energy-relaxation hot-spots (see figure below). Our research suggests that these hot spots are due to material defects, which are themselves quantum systems that can extract energy from qubits when their frequencies overlap (i.e. are “resonant”). Surprisingly, we found that the energy-relaxation hot spots are not static, but “move” on timescales ranging from minutes to hours. From these observations, we concluded that the dynamics of defects’ frequencies into and out of resonance with qubits drives the most significant performance fluctuations.
Left: A quantum processor similar to the one that was used to investigate qubit performance fluctuations. One qubit is highlighted in blue. Right: One qubit’s energy-relaxation time “T1” plotted as a function of it’s operating frequency and time. We see energy-relaxation hotspots, which our data suggest are due to material defects (black arrowheads). The motion of these hotspots into and out-of resonance with the qubit are responsible for the most significant energy-relaxation fluctuations. Note that these data were taken over a frequency band with an above-average density of defects.
These defects — which are typically referred to as two-level-systems (TLS) — are commonly believed to exist at the material interfaces of superconducting circuits. However, even after decades of research, their microscopic origin still puzzles researchers. In addition to clarifying the origin of qubit performance fluctuations, our data shed light on the physics governing defect dynamics, which is an important piece of this puzzle. Interestingly, from thermodynamics arguments we would not expect the defects that we see to exhibit any dynamics at all. Their energies are about one order of magnitude higher than the thermal energy available in our quantum processor, and so they should be “frozen out.” The fact that they are not frozen out suggests their dynamics may be driven by interactions with other defects that have much lower energies and can thus be thermally activated.

The fact that qubits can be used to investigate individual material defects - which are believed to have atomic dimensions, millions of times smaller than our qubits - demonstrates that they are powerful metrological tools. While it’s clear that defect research could help address outstanding problems in materials physics, it’s perhaps surprising that it has direct implications on improving the performance of today’s quantum processors. In fact, defect metrology already informs our processor design and fabrication, and even the mathematical algorithms that we use to avoid defects during quantum processor runtime. We hope this research motivates further work into understanding material defects in superconducting circuits.

Source: Google AI Blog


Teaching the Google Assistant to be Multilingual



Multilingual households are becoming increasingly common, with several sources [1][2][3] indicating that multilingual speakers already outnumber monolingual counterparts, and that this number will continue to grow. With this large and increasing population of multilingual users, it is more important than ever that Google develop products that can support multiple languages simultaneously to better serve our users.

Today, we’re launching multilingual support for the Google Assistant, which enables users to jump between two different languages across queries, without having to go back to their language settings. Once users select two of the supported languages, English, Spanish, French, German, Italian and Japanese, from there on out they can speak to the Assistant in either language and the Assistant will respond in kind. Previously, users had to choose a single language setting for the Assistant, changing their settings each time they wanted to use another language, but now, it’s a simple, hands-free experience for multilingual households.
The Google Assistant is now able to identify the language, interpret the query and provide a response using the right language without the user having to touch the Assistant settings.
Getting this to work, however, was not a simple feat. In fact, this was a multi-year effort that involved solving a lot of challenging problems. In the end, we broke the problem down into three discrete parts: Identifying Multiple Languages, Understanding Multiple Languages and Optimizing Multilingual Recognition for Google Assistant users.

Identifying Multiple Languages
People have the ability to recognize when someone is speaking another language, even if they do not speak the language themselves, just by paying attention to the acoustics of the speech (intonation, phonetic registry, etc). However, defining a computational framework for automatic spoken language recognition is challenging, even with the help of full automatic speech recognition systems1. In 2013, Google started working on spoken language identification (LangID) technology using deep neural networks [4][5]. Today, our state-of-the-art LangID models can distinguish between pairs of languages in over 2000 alternative language pairs using recurrent neural networks, a family of neural networks which are particularly successful for sequence modeling problems, such as those in speech recognition, voice detection, speaker recognition and others. One of the challenges we ran into was working with larger sets of audio — getting models that can automatically understanding multiple languages at scale, and hitting a quality standard that allowed those models to work properly.

Understanding Multiple Languages
To understand more than one language at once, multiple processes need to be run in parallel, each producing incremental results, allowing the Assistant not only to identify the language in which the query is spoken but also to parse the query to create an actionable command. For example, even for a monolingual environment, if a user asks to “set an alarm for 6pm”, the Google Assistant must understand that "set an alarm" implies opening the clock app, fulfilling the explicit parameter of “6pm” and additionally make the inference that the alarm should be set for today. To make this work for any given pair of supported languages is a challenge, as the Assistant executes the same work it does for the monolingual case, but now must additionally enable LangID, and not just one but two monolingual speech recognition systems simultaneously (we’ll explain more about the current two language limitation later in this post).

Importantly, the Google Assistant and other services that are referenced in the user’s query asynchronously generate real-time incremental results that need to be evaluated in a matter of milliseconds. This is accomplished with the help of an additional algorithm that ranks the transcription hypotheses provided by each of the two speech recognition systems using the probabilities of the candidate languages produced by LangID, our confidence on the transcription and the user’s preferences (such as favorite artists, for example).
Schematic of our multilingual speech recognition system used by the Google Assistant versus the standard monolingual speech recognition system. A ranking algorithm is used to select the best recognition hypotheses from two monolingual speech recognizer using relevant information about the user and the incremental langID results.
When the user stops speaking, the model has not only determined what language was being spoken, but also what was said. Of course, this process requires a sophisticated architecture that comes with an increased processing cost and the possibility of introducing unnecessary latency.

Optimizing Multilingual Recognition
To minimize these undesirable effects, the faster the system can make a decision about which language is being spoken, the better. If the system becomes certain of the language being spoken before the user finishes a query, then it will stop running the user’s speech through the losing recognizer and discard the losing hypothesis, thus lowering the processing cost and reducing any potential latency. With this in mind, we saw several ways of optimizing the system.

One use case we considered was that people normally use the same language throughout their query (which is also the language users generally want to hear back from the Assistant), with the exception of asking about entities with names in different languages. This means that, in most cases, focusing on the first part of the query allows the Assistant to make a preliminary guess of the language being spoken, even in sentences containing entities in a different language. With this early identification, the task is simplified by switching to a single monolingual speech recognizer, as we do for monolingual queries. Making a quick decision about how and when to commit to a single language, however, requires a final technological twist: specifically, we use a random forest technique that combines multiple contextual signals, such as the type of device being used, the number of speech hypotheses found, how often we receive similar hypotheses, the uncertainty of the individual speech recognizers, and how frequently each language is used.

An additional way we simplified and improved the quality of the system was to limit the list of candidate languages users can select. Users can choose two languages out of the six that our Home devices currently support, which will allow us to support the majority of our multilingual speakers. As we continue to improve our technology, however, we hope to tackle trilingual support next, knowing that this will further enhance the experience of our growing user base.

Bilingual to Trilingual
From the beginning, our goal has been to make the Assistant naturally conversational for all users. Multilingual support has been a highly-requested feature, and it’s something our team set its sights on years ago. But there aren’t just a lot of bilingual speakers around the globe today, we also want to make life a little easier for trilingual users, or families that live in homes where more than two languages are spoken.

With today’s update, we’re on the right track, and it was made possible by our advanced machine learning, our speech and language recognition technologies, and our team’s commitment to refine our LangID model. We’re now working to teach the Google Assistant how to process more than two languages simultaneously, and are working to add more supported languages in the future — stay tuned!


1 It is typically acknowledged that spoken language recognition is remarkably more challenging than text-based language identification where, relatively simple techniques based on dictionaries can do a good job. The time/frequency patterns of spoken words are difficult to compare, spoken words can be more difficult to delimit as they can be spoken without pause and at different paces and microphones may record background noise in addition to speech.

Source: Google AI Blog


Introducing a New Framework for Flexible and Reproducible Reinforcement Learning Research



Reinforcement learning (RL) research has seen a number of significant advances over the past few years. These advances have allowed agents to play games at a super-human level — notable examples include DeepMind’s DQN on Atari games along with AlphaGo and AlphaGo Zero, as well as Open AI Five. Specifically, the introduction of replay memories in DQN enabled leveraging previous agent experience, large-scale distributed training enabled distributing the learning process across multiple workers, and distributional methods allowed agents to model full distributions, rather than simply their expected values, to learn a more complete picture of their world. This type of progress is important, as the algorithms yielding these advances are additionally applicable for other domains, such as in robotics (see our recent work on robotic manipulation and teaching robots to visually self-adapt).

Quite often, developing these kind of advances requires quickly iterating over a design — often with no clear direction — and disrupting the structure of established methods. However, most existing RL frameworks do not provide the combination of flexibility and stability that enables researchers to iterate on RL methods effectively, and thus explore new research directions that may not have immediately obvious benefits. Further, reproducing the results from existing frameworks is often too time consuming, which can lead to scientific reproducibility issues down the line.

Today we’re introducing a new Tensorflow-based framework that aims to provide flexibility, stability, and reproducibility for new and experienced RL researchers alike. Inspired by one of the main components in reward-motivated behaviour in the brain and reflecting the strong historical connection between neuroscience and reinforcement learning research, this platform aims to enable the kind of speculative research that can drive radical discoveries. This release also includes a set of colabs that clarify how to use our framework.

Ease of Use
Clarity and simplicity are two key considerations in the design of this framework. The code we provide is compact (about 15 Python files) and is well-documented. This is achieved by focusing on the Arcade Learning Environment (a mature, well-understood benchmark), and four value-based agents: DQN, C51, a carefully curated simplified variant of the Rainbow agent, and the Implicit Quantile Network agent, which was presented only last month at the International Conference on Machine Learning (ICML). We hope this simplicity makes it easy for researchers to understand the inner workings of the agent and to quickly try out new ideas.

Reproducibility
We are particularly sensitive to the importance of reproducibility in reinforcement learning research. To this end, we provide our code with full test coverage; these tests also serve as an additional form of documentation. Furthermore, our experimental framework follows the recommendations given by Machado et al. (2018) on standardizing empirical evaluation with the Arcade Learning Environment.

Benchmarking
It is important for new researchers to be able to quickly benchmark their ideas against established methods. As such, we are providing the full training data of the four provided agents, across the 60 games supported by the Arcade Learning Environment, available as Python pickle files (for agents trained with our framework) and as JSON data files (for comparison with agents trained in other frameworks); we additionally provide a website where you can quickly visualize the training runs for all provided agents on all 60 games. Below we show the training runs for our 4 agents on Seaquest, one of the Atari 2600 games supported by the Arcade Learning Environment.
The training runs for our 4 agents on Seaquest. The x-axis represents iterations, where each iteration is 1 million game frames (4.5 hours of real-time play); the y-axis is the average score obtained per play. The shaded areas show confidence intervals from 5 independent runs.
We are also providing the trained deep networks from these agents, the raw statistics logs, as well as the Tensorflow event files for plotting with Tensorboard. These can all be found in the downloads section of our site.

Our hope is that our framework’s flexibility and ease-of-use will empower researchers to try out new ideas, both incremental and radical. We are already actively using it for our research and finding it is giving us the flexibility to iterate quickly over many ideas. We’re excited to see what the larger community can make of it. Check it out at our github repo, play with it, and let us know what you think!

Acknowledgements
This project was only possible thanks to several collaborations at Google. The core team includes Marc G. Bellemare, Pablo Samuel Castro, Carles Gelada, Subhodeep Moitra and Saurabh Kumar. We also extend a special thanks to Sergio Guadamarra, Ofir Nachum, Yifan Wu, Clare Lyle, Liam Fedus, Kelvin Xu, Emilio Parisoto, Hado van Hasselt, Georg Ostrovski and Will Dabney, and the many people at Google who helped us test it out.

Source: Google AI Blog