Posted by Anurag Batra and Parker Barnes, Product Managers, Google AI
Recently, we introduced the Inclusive Images Kaggle competition, part of the NeurIPS 2018 Competition Track, with the goal of stimulating research into the effect of geographic skews in training datasets on ML model performance, and to spur innovation in developing more inclusive models. While the competition has concluded, the broader movement to build more diverse datasets is just beginning.
Today, we’re announcing Open Images Extended, a new branch of Google’s Open Images dataset, which is intended to be a collection of complementary datasets with additional images and/or annotations that better represent global diversity. The first set we are adding is the Crowdsourced extension which is seeded with 478K+ images donated by Crowdsource app users from all around the world.
About the Crowdsourced Extension of Open Images Extended To bring greater geographic diversity to Open Images, we enabled the global community of Crowdsource app users to photograph the world around them and make their photos available to researchers and developers as part of the Open Images Extended dataset. A large majority of these images are from India, with some representation from the Middle East, Africa and Latin America.
The images, focus on some key categories like household objects, plants & animals, food, and people in various professions (all faces are blurred to protect privacy). Detailed information about the composition of the dataset can be found here.
Pictures from India and Singapore contributed using the Crowdsource app.
Get Involved This is an early step on a long journey. To build inclusive ML products, training data must represent global diversity along several dimensions. To that end, we invite the global community to help expand the Open Images Extended dataset by contributing imagery from your own hometown and community. Download the Crowdsource Android app to contribute images you’ve taken from your phone, or contact us if there are other image repositories (that you have the rights for) that you’re interested in adding to open-images dataset.
Acknowledgements The release of Open Images Extended has been possible thanks to the hard work of a lot of people including, but not limited to the following (in alphabetical order of last name): James Atwood, Pallavi Baljekar, Peggy Chi, Tulsee Doshi, Tom Duerig, Vittorio Ferrari, Akshay Gaur, Victor Gomes, Yoni Halpern, Gursheesh Kaur, Mahima Pushkarna, Jigyasa Saxena, D. Sculley, Richa Singh, Rachelle Summers.
Posted by James Cook, Yechen Li, Software Engineers and Ravi Kumar, Research Scientist
"When Solomon said there was a time and a place for everything he had not encountered the problem of parking his automobile." -Bob Edwards, Broadcast Journalist
Much of driving is spent either stuck in traffic or looking for parking. With products like Google Maps and Waze, it is our long-standing goal to help people navigate the roads easily and efficiently. But until now, there wasn’t a tool to address the all-too-common parking woes.
Last week, we launched a new feature for Google Maps for Android across 25 US cities that offers predictions about parking difficulty close to your destination so you can plan accordingly. Providing this feature required addressing some significant challenges:
Parking availability is highly variable, based on factors like the time, day of week, weather, special events, holidays, and so on. Compounding the problem, there is almost no real time information about free parking spots.
Even in areas with internet-connected parking meters providing information on availability, this data doesn’t account for those who park illegally, park with a permit, or depart early from still-paid meters.
Roads form a mostly-planar graph, but parking structures may be more complex, with traffic flows across many levels, possibly with different layouts.
Both the supply and the demand for parking are in constant flux, so even the best system is at risk of being outdated as soon as it’s built.
To face these challenges, we used a unique combination of crowdsourcing and machine learning (ML) to build a system that can provide you with parking difficulty information for your destination, and even help you decide what mode of travel to take — in a pre-launch experiment, we saw a significant increase in clicks on the transit travel mode button, indicating that users with additional knowledge of parking difficulty were more likely to consider public transit rather than driving.
Three technical pieces were required to build the algorithms behind the parking difficulty feature: good ground truth data from crowdsourcing, an appropriate ML model and a robust set of features to train the model on.
Ground Truth Data Gathering high-quality ground truth data is often a key challenge in building any ML solution. We began by asking individuals at a diverse set of locations and times if they found the parking difficult. But we learned that answers to subjective questions like this produces inconsistent results - for a given location and time, one person may answer that it was “easy” to find parking while another found it “difficult.” Switching to objective questions like “How long did it it take to find parking?” led to an increase in answer confidence, enabling us to crowdsource a high-quality set of ground truth data with over 100K responses.
Model Features With this data available, we began to determine features we could train a model on. Fortunately, we were able to turn to the wisdom of the crowd, and utilize anonymous aggregated information from users who opt to share their location data, which already is a vital source of information for estimates of live traffic or popular times and visit durations.
We quickly discovered that even with this data, some unique challenges remain. For example, our system shouldn’t be fooled into thinking parking is plentiful if someone is parking in a gated or private lot. Users arriving by taxi might look like a sign of abundant parking at the front door, and similarly, public-transit users might seem to park at bus stops. These false positives, and many others, all have the potential to mislead an ML system.
So we needed more robust aggregate features. Perhaps not surprisingly, the inspiration for one of these features came from our own backyard in downtown Mountain View. If Google navigation observes many users circling downtown Mountain View during lunchtime along trajectories like this one, it strongly suggests that parking might be difficult:
Our team thought about how to recognize this “fingerprint” of difficult parking as a feature to train on. In this case, we aggregate the difference between when a user should have arrived at a destination if they simply drove to the front door, versus when they actually arrived, taking into account circling, parking, and walking. If many users show a large gap between these two times, we expect this to be a useful signal that parking is difficult.
From there, we continued to develop more features that took into account, for any particular destination, dispersion of parking locations, time-of-day and date dependence of parking (e.g. what if users park close to a destination in the early morning, but further away at busier hours?), historical parking data and more. In the end, we decided on roughly twenty different features along these lines for our model. Then it was time to tune the model performance.
Model Selection & Training We decided to use a standard logistic regression ML model for this feature, for a few different reasons. First, the behavior of logistic regression is well understood, and it tends to be resilient to noise in the training data; this is a useful property when the data comes from crowdsourcing a complicated response variable like difficulty of parking. Second, it’s natural to interpret the output of these models as the probability that parking will be difficult, which we can then map into descriptive terms like “Limited parking” or “Easy.” Third, it’s easy to understand the influence of each specific feature, which makes it easier to verify that the model is behaving reasonably. For example, when we started the training process, many of us thought that the “fingerprint” feature described above would be the “silver bullet” that would crack the problem for us. We were surprised to note that this wasn’t the case at all — in fact, it was features based on the dispersion of parking locations that turned out to be one of the most powerful predictors of parking difficulty.
Results With our model in hand, we were able to generate an estimate for difficulty of parking at any place and time. The figure below gives a few examples of the output of our system, which is then used to provide parking difficulty estimates for a given destination. Parking on Monday mornings, for instance, is difficult throughout the city, especially in the busiest financial and retail areas. On Saturday night, things are busy again, but now predominantly in the areas with restaurants and attractions.
Output of our parking difficulty model in the Financial District and Union Square areas of San Francisco. Red denotes a higher confidence that parking is difficult. Top row: a typical Monday at ~8am (left) and ~9pm (right). Bottom row: the same times but on a typical Saturday.
We’re excited about the opportunities to continue to improve the model quality based on user feedback. If we are able to better understand parking difficulty, we will be able to develop new and smarter forms of parking assistance — we’re very excited about future applications of ML to help make transportation more enjoyable!
Posted by Linne Ha, Senior Program Manager, Google Research for Low Resource Languages
Building a decent text-to-speech (TTS) voice for any language can be challenging, but creating one – a good, intelligible one – for a low resource language can be downright impossible. By definition, working with low resource languages can feel like a losing proposition – from the get go, there is not enough audio data, and the data that exists may be questionable in quality. High quality audio data, and lots of it, is key to developing a high quality machine learning model. To make matters worse, most of the world’s oldest, richest spoken languages fall into this category. There are currently over 300 languages, each spoken by at least one million people, and most will be overlooked by technologists for various reasons. One important reason is that there is not enough data to conduct meaningful research and development.
Project Unison is an on-going Google research effort, in collaboration with the Speech team, to explore innovative approaches to building a TTS voice for low resource languages – quickly, inexpensively and efficiently. This blog post will be one of several to track progress of this experiment and to share our experience with the research community at large – our successes and failures in a trial and error, iterative approach – as our adventure plays out.
One of the most critical aspects of building a TTS system is acquiring audio data. The traditional way to do this is in a professional recording studio with a voice talent, sound engineer and a voice director. The process can take considerable time and can be quite expensive. People often mistake voice talent work to be similar to a news reader, but it is highly specialized and the work can be very difficult.
Such investments in time and money may yield great audio, but the catch is that even if you’ve created the best TTS voice from these recordings, at best it will still sound exactly like the voice talent - the person who provided the raw audio data. (We’ve read the articles about people who have fallen for their GPS voice to find that they are real people with real names.) So the interesting problem here from a research perspective is how to create a voice that sounds human but is not identifiable as a singular person.
Crowd-sourcing projects for automatic speech recognition (ASR) for Google Voice Search had been successful in the past, with public volunteers eager to participate by providing voice samples. For ASR, the objective is to collect from a diversity of speakers and environments, capturing varying regional accents. The polar opposite is true of TTS, where one unique speaker, with the standard accent and in a soundproof studio is the basic criteria.
Many years ago, Yannis Agiomyrgiannakis, Digital Signal Processing researcher on the TTS team in Google London, wrote a “manifesto” for acoustic data collection for 2000 languages. In his document, he gave technical specifications on how to convert an average room into a recording studio. Knot Pipatsrisawat, software engineer in Google Research for Low Resource Languages, built a tool that we call “ChitChat”, a portable recording studio, using Yannis’ specifications. This web app allows users to read the prompt, playback the recording and even assess the noise level of the room.
From other past research in ASR, we knew that the right tool could solve the crowd sourcing problem. ChitChat allowed us to experiment in different environments to get an idea of what kind of office space would work and what kind of problems we might encounter. After experimenting with several different laptops and tablets, we were able to find a computer that recognized the necessary peripherals (the microphone, USB converter, and preamp) for under $2,000 – much cheaper than a recording studio!
Now we needed multiple speakers of a single language. For us, it was a no-brainer to pilot Project Unison with Bangladeshi Googlers, all of whom are passionate about getting Google products to their home country (the success of Android products in Bangladesh is an example of this). Googlers by and large are passionate about their work and many offer their 20% time as a way to help, to improve or to experiment on something that may or may not work because they care. The Bangladeshi Googlers are no exception. They embodied our objectives for a crowdsourcing innovation: out of many, we could achieve (literally) one voice.
With multiple speakers, we would target speakers of similar vocal profiles and adapt them to create a blended voice. Statistical parametric synthesis is not new, but the advances in recent technology have improved quality and proved to be a lightweight solution for a project like ours.
In May of this year, we auditioned 15 Bangaldeshi Googlers in Mountain View. From these recordings, the broader Bangladeshi Google community voted blindly for their preferred voice. Zakaria Haque, software engineer in Machine Intelligence, was chosen as our reference for the Bangla voice. We then narrowed down the group to five speakers based on these criteria: Dhaka accent, male (to match Zakaria’s), similarity in pitch and tone, and availability for recordings. The original plan of a spectral analysis using PRAAT proved to be unnecessary with our limited pool of candidates.
All 5 software engineers – Ahmed Chowdury, Mohammad Hossain, Syeed Faiz, Md. Arifuzzaman Arif, Sabbir Yousuf Sanny – plus Zakaria Haque recorded over 3 days in the anechoic chamber, a makeshift sound-proofed room at the Mountain View campus just before Ramadan. HyunJeong Choe, who had helped with the Korean TTS recordings, directed our volunteers.
Left: TPM Mohammad Khan measures the distance from the speaker to the mic to keep the sound quality consistent across all speakers. Right: Analytical Linguist HyunJeong Choe coaches SWE Ahmed Chowdury on how to speak in a friendly, knowledgeable, "Googly" voice
ChitChat allowed us to troubleshoot on the fly as recordings could be monitored from another room using the admin panel. In total, we recorded 2000 Bangla and English phrases mined from Wikipedia. In 30-60 minute intervals, the participants recorded over 250 sentences each.
In this session, we discovered an issue: a sudden drop in amplitude at high frequencies in a few recordings. We were worried that all the recordings might have to be scrapped.
As illustrated in the third image, speaker3 has a drop in energy above 13kHz which is visible in the graph and may be present at speech, distorting the speaker’s voice to sound as if he were speaking through a tube.
Another challenge was that we didn’t have a pronunciation lexicon for Bangla as spoken in Bangladesh. We worked initially with the publicly available TTS data from the Indian Institute of Information Technology, but this represented the variant of Bangla spoken in West Bengal (India), which differs from the speech we recorded. Our internally designed pronunciation rules for Bengali were also aimed at West Bengal and would need to be revised later.
Deciding to proceed anyway, Alexander Gutkin, Speech software engineer and lead for TTS for Low Resource Languages in Google London, built an initial prototype voice. Using the preliminary text normalization rules created by Richard Sproat, Speech and Language Processing researcher, the first voice we attempted proved to be surprisingly good. The problem in the high frequencies we had seen in the recordings is undetectable in the parametric voice. When we return to the sound studio to record an additional 200 longer sentences, we plan to try an upgrade of the USB converter. Meanwhile, Martin Jansche, Natural Language Understanding software engineer, has worked with a team of native speakers on a pronunciation and lexicon and model that better matches the phonology of colloquial Bangladeshi Bangla. Alexander will use the additional recordings and the new pronunciation dictionary to build the second version.
NEXT UP: Building a parametric voice with multiple speaker data (Ep.2)