Tag Archives: crowd-sourcing

SCIN: A new resource for representative dermatology images

Posted by Pooja Rao, Research Scientist, Google Research

Health datasets play a crucial role in research and medical education, but it can be challenging to create a dataset that represents the real world. For example, dermatology conditions are diverse in their appearance and severity and manifest differently across skin tones. Yet, existing dermatology image datasets often lack representation of everyday conditions (like rashes, allergies and infections) and skew towards lighter skin tones. Furthermore, race and ethnicity information is frequently missing, hindering our ability to assess disparities or create solutions.

To address these limitations, we are releasing the Skin Condition Image Network (SCIN) dataset in collaboration with physicians at Stanford Medicine. We designed SCIN to reflect the broad range of concerns that people search for online, supplementing the types of conditions typically found in clinical datasets. It contains images across various skin tones and body parts, helping to ensure that future AI tools work effectively for all. We've made the SCIN dataset freely available as an open-access resource for researchers, educators, and developers, and have taken careful steps to protect contributor privacy.

Example set of images and metadata from the SCIN dataset.

Dataset composition

The SCIN dataset currently contains over 10,000 images of skin, nail, or hair conditions, directly contributed by individuals experiencing them. All contributions were made voluntarily with informed consent by individuals in the US, under an institutional-review board approved study. To provide context for retrospective dermatologist labeling, contributors were asked to take images both close-up and from slightly further away. They were given the option to self-report demographic information and tanning propensity (self-reported Fitzpatrick Skin Type, i.e., sFST), and to describe the texture, duration and symptoms related to their concern.

One to three dermatologists labeled each contribution with up to five dermatology conditions, along with a confidence score for each label. The SCIN dataset contains these individual labels, as well as an aggregated and weighted differential diagnosis derived from them that could be useful for model testing or training. These labels were assigned retrospectively and are not equivalent to a clinical diagnosis, but they allow us to compare the distribution of dermatology conditions in the SCIN dataset with existing datasets.

The SCIN dataset contains largely allergic, inflammatory and infectious conditions while datasets from clinical sources focus on benign and malignant neoplasms.

While many existing dermatology datasets focus on malignant and benign tumors and are intended to assist with skin cancer diagnosis, the SCIN dataset consists largely of common allergic, inflammatory, and infectious conditions. The majority of images in the SCIN dataset show early-stage concerns — more than half arose less than a week before the photo, and 30% arose less than a day before the image was taken. Conditions within this time window are seldom seen within the health system and therefore are underrepresented in existing dermatology datasets.

We also obtained dermatologist estimates of Fitzpatrick Skin Type (estimated FST or eFST) and layperson labeler estimates of Monk Skin Tone (eMST) for the images. This allowed comparison of the skin condition and skin type distributions to those in existing dermatology datasets. Although we did not selectively target any skin types or skin tones, the SCIN dataset has a balanced Fitzpatrick skin type distribution (with more of Types 3, 4, 5, and 6) compared to similar datasets from clinical sources.

Self-reported and dermatologist-estimated Fitzpatrick Skin Type distribution in the SCIN dataset compared with existing un-enriched dermatology datasets (Fitzpatrick17k, PH², SKINL2, and PAD-UFES-20).

The Fitzpatrick Skin Type scale was originally developed as a photo-typing scale to measure the response of skin types to UV radiation, and it is widely used in dermatology research. The Monk Skin Tone scale is a newer 10-shade scale that measures skin tone rather than skin phototype, capturing more nuanced differences between the darker skin tones. While neither scale was intended for retrospective estimation using images, the inclusion of these labels is intended to enable future research into skin type and tone representation in dermatology. For example, the SCIN dataset provides an initial benchmark for the distribution of these skin types and tones in the US population.

The SCIN dataset has a high representation of women and younger individuals, likely reflecting a combination of factors. These could include differences in skin condition incidence, propensity to seek health information online, and variations in willingness to contribute to research across demographics.

Crowdsourcing method

To create the SCIN dataset, we used a novel crowdsourcing method, which we describe in the accompanying research paper co-authored with investigators at Stanford Medicine. This approach empowers individuals to play an active role in healthcare research. It allows us to reach people at earlier stages of their health concerns, potentially before they seek formal care. Crucially, this method uses advertisements on web search result pages — the starting point for many people’s health journey — to connect with participants.

Our results demonstrate that crowdsourcing can yield a high-quality dataset with a low spam rate. Over 97.5% of contributions were genuine images of skin conditions. After performing further filtering steps to exclude images that were out of scope for the SCIN dataset and to remove duplicates, we were able to release nearly 90% of the contributions received over the 8-month study period. Most images were sharp and well-exposed. Approximately half of the contributions include self-reported demographics, and 80% contain self-reported information relating to the skin condition, such as texture, duration, or other symptoms. We found that dermatologists’ ability to retrospectively assign a differential diagnosis depended more on the availability of self-reported information than on image quality.

Dermatologist confidence in their labels (scale from 1-5) depended on the availability of self-reported demographic and symptom information.

While perfect image de-identification can never be guaranteed, protecting the privacy of individuals who contributed their images was a top priority when creating the SCIN dataset. Through informed consent, contributors were made aware of potential re-identification risks and advised to avoid uploading images with identifying features. Post-submission privacy protection measures included manual redaction or cropping to exclude potentially identifying areas, reverse image searches to exclude publicly available copies and metadata removal or aggregation. The SCIN Data Use License prohibits attempts to re-identify contributors.

We hope the SCIN dataset will be a helpful resource for those working to advance inclusive dermatology research, education, and AI tool development. By demonstrating an alternative to traditional dataset creation methods, SCIN paves the way for more representative datasets in areas where self-reported data or retrospective labeling is feasible.

Acknowledgements

We are grateful to all our co-authors Abbi Ward, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, Pradeep Kumar S, Tiya Tiyasirisokchai, Sunny Virmani, Renee Wong, Yossi Matias, Greg S. Corrado, Dale R. Webster, Dawn Siegel (Stanford Medicine), Steven Lin (Stanford Medicine), Justin Ko (Stanford Medicine), Alan Karthikesalingam and Christopher Semturs. We also thank Yetunde Ibitoye, Sami Lachgar, Lisa Lehmann, Javier Perez, Margaret Ann Smith (Stanford Medicine), Rachelle Sico, Amit Talreja, Annisah Um’rani and Wayne Westerlind for their essential contributions to this work. Finally, we are grateful to Heather Cole-Lewis, Naama Hammel, Ivor Horn, Michael Howell, Yun Liu, and Eric Teasley for their insightful comments on the study design and manuscript.

Source: Google AI Blog

Uncovering Unknown Unknowns in Machine Learning

Posted by Lora Aroyo and Praveen Paritosh, Research Scientists, Google Research

The performance of machine learning (ML) models depends both on the learning algorithms, as well as the data used for training and evaluation. The role of the algorithms is well studied and the focus of a multitude of challenges, such as SQuAD, GLUE, ImageNet, and many others. In addition, there have been efforts to also improve the data, including a series of workshops addressing issues for ML evaluation. In contrast, research and challenges that focus on the data used for evaluation of ML models are not commonplace. Furthermore, many evaluation datasets contain items that are easy to evaluate, e.g., photos with a subject that is easy to identify, and thus they miss the natural ambiguity of real world context. The absence of ambiguous real-world examples in evaluation undermines the ability to reliably test machine learning performance, which makes ML models prone to develop “weak spots”, i.e., classes of examples that are difficult or impossible for a model to accurately evaluate, because that class of examples is missing from the evaluation set.

To address the problem of identifying these weaknesses in ML models, we recently launched the Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML) Data Challenge at HCOMP 2020 (open until 30 April, 2021 to researchers and developers worldwide). The goal of the challenge is to raise the bar in ML evaluation sets and to find as many examples as possible that are confusing or otherwise problematic for algorithms to process. CATS4ML relies on people’s abilities and intuition to spot new data examples about which machine learning is confident, but actually misclassifies.

What are ML “Weak Spots”?
There are two categories of weak spots: known unknowns and unknown unknowns. Known unknowns are examples for which a model is unsure about the correct classification. The research community continues to study this in a field known as active learning, and has found the solution to be, in very general terms, to interactively solicit new labels from people on uncertain examples. For example, if a model is not certain whether or not the subject of a photo is a cat, a person is asked to verify; but if the system is certain, a person is not asked. While there is room for improvement in this area, what is comforting is that the confidence of the model is correlated with its performance, i.e., one can see what the model doesn’t know.

Unknown unknowns, on the other hand, are examples where a model is confident about its answer, but is actually wrong. Efforts to proactively discover unknown unknowns (e.g., Attenberg 2015 and Crawford 2019) have helped uncover a multitude of unintended machine behaviours. In contrast to such approaches for the discovery of unknown unknowns, generative adversarial networks (GANs) generate unknown unknowns for image recognition models in the form of optical illusions for computers that cause deep learning models to make mistakes beyond human perception. While GANs uncover model exploits in the event of an intentional manipulation, real-world examples can better highlight a model’s failures in its day-to-day performance. These real-world examples are the unknown unknowns of interest to CATS4ML — the challenge aims to gather unmanipulated examples that humans can reliably interpret but on which many ML models would confidently disagree.

Example illustrating how optical illusions for computers caused by adversarial noise help discover machine manipulated unknown unknowns for ML models (based on Brown 2018).

First Edition of CATS4ML Data Challenge: Open Images Dataset
The CATS4ML Data Challenge focuses on visual recognition, using images and labels from the Open Images Dataset. The target images for the challenge are selected from the Open Images Dataset along with a set of 24 target labels from the same dataset. The challenge participants are invited to invent new and creative ways to explore this existing publicly available dataset and, focussed on a list of pre-selected target labels, discover examples of unknown unknowns for ML models.

Examples from the Open Images Dataset as possible unknown unknowns for ML models.

CATS4ML is a complementary effort to FAIR’s recently introduced DynaBench research platform for dynamic data collection. Where DynaBench tackles issues with static benchmarks using ML models with humans in the loop, CATS4ML focuses on improving evaluation datasets for ML by encouraging the exploration of existing ML benchmarks for adverse examples that can be unknown unknowns. The results will help detect and avoid future errors, and also will give insights to model explainability.

In this way, CATS4ML aims to raise greater awareness of the problem by providing dataset resources that developers can use to uncover the weak spots of their algorithms. This will also inform researchers on how to create benchmark datasets for machine learning that are more balanced, diverse and socially aware.

Get Involved
We invite the global community of ML researchers and practitioners to join us in the effort of discovering interesting, difficult examples from the Open Images Dataset. Register on the challenge website, download the target images and labeled data, contribute the images you discover and join the competition for the winning participant!

To score points in this competition, participants should submit a set of image-label pairs that will be confirmed by human-in-the-loop raters, whose votes should be in disagreement with the average machine score for the label over a number of machine learning models.

An example of how a submitted image can score points. The same image can score as a false positive (Left) and as a false negative (Right) with two different labels. In both cases the human verification is in disagreement with the machine score. Participants score on submitted image-label pairs, which means that one and the same image can be an example of an ML unknown unknown for different labels.

The challenge is open until 30 April, 2021 to researchers and developers worldwide. To learn more about CATS4ML and how to join, please review these slides and visit the challenge website.

Acknowledgements
The release of the CATS4ML Data Challenge has been possible thanks to the hard work of a lot of people including, but not limited to, the following (in alphabetical order of last name): Osman Aka, Ken Burke, Tulsee Doshi, Mig Gerard, Victor Gomes, Shahab Kamali, Igor Karpov, Devi Krishna, Daphne Luong, Carey Radebaugh, Jamie Taylor, Nithum Thain, Kenny Wibowo, Ka Wong, and Tong Zhou.

Source: Google AI Blog

Recreating Historical Streetscapes Using Deep Learning and Crowdsourcing

Posted by Raimondas Kiveris, Software Engineer, Google Research

For many, gazing at an old photo of a city can evoke feelings of both nostalgia and wonder — what was it like to walk through Manhattan in the 1940s? How much has the street one grew up on changed? While Google Street View allows people to see what an area looks like in the present day, what if you want to explore how places looked in the past?

To create a rewarding “time travel” experience for both research and entertainment purposes, we are launching rǝ (pronounced as re”turn"), an open source, scalable system running on Google Cloud and Kubernetes that can reconstruct cities from historical maps and photos, representing an implementation of our suite of open source tools launched earlier this year. Referencing the common prefix meaning again or anew, rǝ is meant to represent the themes of reconstruction, research, recreation and remembering behind this crowdsourced research effort, and consists of three components:

A crowdsourcing platform, which allows users to upload historical maps of cities, georectify (i.e., match them to real world coordinates), and vectorize them
A temporal map server, which shows how maps of cities change over time
A 3D experience platform, which runs on top of the rǝ map server, creating the 3D experience by using deep learning to reconstruct buildings in 3D from limited historical images and maps data.

Our goal is for rǝ to become a compendium that allows history enthusiasts to virtually experience historical cities around the world, aids researchers, policy makers and educators, and provides a dose of nostalgia to everyday users.

Bird’s eye view of Chelsea, Manhattan with a time slider from 1890 to 1970, crafted from historical photos and maps and using rǝ’s 3D reconstruction pipeline and colored with a preset Manhattan-inspired palette.

Crowdsourcing Data from Historical Maps
Reconstructing how cities used to look at scale is a challenge — historical image data is more difficult to work with than modern data, as there are far fewer images available and much less metadata captured from the images. To help with this difficulty, the rǝ maps module is a suite of open source tools that work together to create a map server with a time dimension, allowing users to jump back and forth between time periods using a slider. These tools allow users to upload scans of historical print maps, georectify them to match real world coordinates, and then convert them to vector format by tracing their geographic features. These vectorized maps are then served on a tile server and rendered as slippy maps, which lets the user zoom in and pan around.

Sub-modules of the rǝ suite of tools

The entry point of the rǝ maps module is Warper, a web app that allows users to upload historical images of maps and georectify them by finding control points on the historical map and corresponding points on a base map. The next app, Editor, allows users to load the georectified historical maps as the background and then trace their geographic features (e.g., building footprints, roads, etc.). This traced data is stored in an OpenStreetMap (OSM) vector format. They are then converted to vector tiles and served from the Server app, a vector tile server. Finally, our map renderer, Kartta, visualizes the spatiotemporal vector tiles allowing the users to navigate space and time on historical maps. These tools were built on top of numerous open source resources including OpenStreetMap, and we intend for our tools and data to be completely open source as well.

Warper and Editor work together to let users upload a map, anchor it to a base map using control points, and trace geographic features like building footprints and roads.

3D Experience
The 3D Models module aims to reconstruct the detailed full 3D structures of historical buildings using the associated images and maps data, organize these 3D models properly in one repository, and render them on the historical maps with a time dimension.

In many cases, there is only one historical image available for a building, which makes the 3D reconstruction an extremely challenging problem. To tackle this challenge, we developed a coarse-to-fine reconstruction-by-recognition algorithm.

High-level overview of rǝ’s 3D reconstruction pipeline, which takes annotated images and maps and prepares them for 3D rendering.

Starting with footprints on maps and façade regions in historical images (both are annotated by crowdsourcing or detected by automatic algorithms), the footprint of one input building is extruded upwards to generate its coarse 3D structure. The height of this extrusion is set to the number of floors from the corresponding metadata in the maps database.

In parallel, instead of directly inferring the detailed 3D structures of each façade as one entity, the 3D reconstruction pipeline recognizes all individual constituent components (e.g., windows, entries, stairs, etc.) and reconstructs their 3D structures separately based on their categories. Then these detailed 3D structures are merged with the coarse one for the final 3D mesh. The results are stored in a 3D repository and ready for 3D rendering.

The key technology powering this feature is a number of state-of-art deep learning models:

Faster region-based convolutional neural networks (RCNN) were trained using the façade component annotations for each target semantic class (e.g., windows, entries, stairs, etc), which are used to localize bounding-box level instances in historical images.
DeepLab, a semantic segmentation model, was trained to provide pixel-level labels for each semantic class.
A specifically designed neural network was trained to enforce high-level regularities within the same semantic class. This ensured that windows generated on a façade were equally spaced and consistent in shape with each other. This also facilitated consistency across different semantic classes such as stairs to ensure they are placed at reasonable positions and have consistent dimensions relative to the associated entry ways.

Key Results

Street level view of 3D-reconstructed Chelsea, Manhattan

Conclusion
With rǝ, we have developed tools that facilitate crowdsourcing to tackle the main challenge of insufficient historical data when recreating virtual cities. The 3D experience is still a work-in-progress and we aim to improve it with future updates. We hope rǝ acts as a nexus for an active community of enthusiasts and casual users that not only utilizes our historical datasets and open source code, but actively contributes to both.

Acknowledgements
This effort has been successful thanks to the hard work of many people, including, but not limited to the following (in alphabetical order of last name): Yale Cong, Feng Han, Amol Kapoor, Raimondas Kiveris, Brandon Mayer, Mark Phillips, Sasan Tavakkol, and Tim Waters (Waters Geospatial Ltd).

Source: Google AI Blog

Recreating Historical Streetscapes Using Deep Learning and Crowdsourcing

Posted by Raimondas Kiveris, Software Engineer, Google Research

A crowdsourcing platform, which allows users to upload historical maps of cities, georectify (i.e., match them to real world coordinates), and vectorize them
A temporal map server, which shows how maps of cities change over time
A 3D experience platform, which runs on top of the rǝ map server, creating the 3D experience by using deep learning to reconstruct buildings in 3D from limited historical images and maps data.

Sub-modules of the rǝ suite of tools

Warper and Editor work together to let users upload a map, anchor it to a base map using control points, and trace geographic features like building footprints and roads.

High-level overview of rǝ’s 3D reconstruction pipeline, which takes annotated images and maps and prepares them for 3D rendering.

The key technology powering this feature is a number of state-of-art deep learning models:

Faster region-based convolutional neural networks (RCNN) were trained using the façade component annotations for each target semantic class (e.g., windows, entries, stairs, etc), which are used to localize bounding-box level instances in historical images.
DeepLab, a semantic segmentation model, was trained to provide pixel-level labels for each semantic class.
A specifically designed neural network was trained to enforce high-level regularities within the same semantic class. This ensured that windows generated on a façade were equally spaced and consistent in shape with each other. This also facilitated consistency across different semantic classes such as stairs to ensure they are placed at reasonable positions and have consistent dimensions relative to the associated entry ways.

Key Results

Street level view of 3D-reconstructed Chelsea, Manhattan

Source: Google AI Blog

Recreating Historical Streetscapes Using Deep Learning and Crowdsourcing

Posted by Raimondas Kiveris, Software Engineer, Google Research

A crowdsourcing platform, which allows users to upload historical maps of cities, georectify (i.e., match them to real world coordinates), and vectorize them
A temporal map server, which shows how maps of cities change over time
A 3D experience platform, which runs on top of the rǝ map server, creating the 3D experience by using deep learning to reconstruct buildings in 3D from limited historical images and maps data.

Sub-modules of the rǝ suite of tools

Warper and Editor work together to let users upload a map, anchor it to a base map using control points, and trace geographic features like building footprints and roads.

High-level overview of rǝ’s 3D reconstruction pipeline, which takes annotated images and maps and prepares them for 3D rendering.

The key technology powering this feature is a number of state-of-art deep learning models:

Faster region-based convolutional neural networks (RCNN) were trained using the façade component annotations for each target semantic class (e.g., windows, entries, stairs, etc), which are used to localize bounding-box level instances in historical images.
DeepLab, a semantic segmentation model, was trained to provide pixel-level labels for each semantic class.
A specifically designed neural network was trained to enforce high-level regularities within the same semantic class. This ensured that windows generated on a façade were equally spaced and consistent in shape with each other. This also facilitated consistency across different semantic classes such as stairs to ensure they are placed at reasonable positions and have consistent dimensions relative to the associated entry ways.

Key Results

Street level view of 3D-reconstructed Chelsea, Manhattan

Source: Google AI Blog

Adding Diversity to Images with Open Images Extended

Posted by Anurag Batra and Parker Barnes, Product Managers, Google AI

Recently, we introduced the Inclusive Images Kaggle competition, part of the NeurIPS 2018 Competition Track, with the goal of stimulating research into the effect of geographic skews in training datasets on ML model performance, and to spur innovation in developing more inclusive models. While the competition has concluded, the broader movement to build more diverse datasets is just beginning.

Today, we’re announcing Open Images Extended, a new branch of Google’s Open Images dataset, which is intended to be a collection of complementary datasets with additional images and/or annotations that better represent global diversity. The first set we are adding is the Crowdsourced extension which is seeded with 478K+ images donated by Crowdsource app users from all around the world.

About the Crowdsourced Extension of Open Images Extended
To bring greater geographic diversity to Open Images, we enabled the global community of Crowdsource app users to photograph the world around them and make their photos available to researchers and developers as part of the Open Images Extended dataset. A large majority of these images are from India, with some representation from the Middle East, Africa and Latin America.

The images, focus on some key categories like household objects, plants & animals, food, and people in various professions (all faces are blurred to protect privacy). Detailed information about the composition of the dataset can be found here.

Pictures from India and Singapore contributed using the Crowdsource app.

Get Involved
This is an early step on a long journey. To build inclusive ML products, training data must represent global diversity along several dimensions. To that end, we invite the global community to help expand the Open Images Extended dataset by contributing imagery from your own hometown and community. Download the Crowdsource Android app to contribute images you’ve taken from your phone, or contact us if there are other image repositories (that you have the rights for) that you’re interested in adding to open-images dataset.

Acknowledgements
The release of Open Images Extended has been possible thanks to the hard work of a lot of people including, but not limited to the following (in alphabetical order of last name): James Atwood, Pallavi Baljekar, Peggy Chi, Tulsee Doshi, Tom Duerig, Vittorio Ferrari, Akshay Gaur, Victor Gomes, Yoni Halpern, Gursheesh Kaur, Mahima Pushkarna, Jigyasa Saxena, D. Sculley, Richa Singh, Rachelle Summers.

Source: Google AI Blog

Using Machine Learning to predict parking difficulty

Posted by James Cook, Yechen Li, Software Engineers and Ravi Kumar, Research Scientist

"When Solomon said there was a time and a place for everything he had not encountered the problem of parking his automobile." -Bob Edwards, Broadcast Journalist

Much of driving is spent either stuck in traffic or looking for parking. With products like Google Maps and Waze, it is our long-standing goal to help people navigate the roads easily and efficiently. But until now, there wasn’t a tool to address the all-too-common parking woes.

Last week, we launched a new feature for Google Maps for Android across 25 US cities that offers predictions about parking difficulty close to your destination so you can plan accordingly. Providing this feature required addressing some significant challenges:

Parking availability is highly variable, based on factors like the time, day of week, weather, special events, holidays, and so on. Compounding the problem, there is almost no real time information about free parking spots.
Even in areas with internet-connected parking meters providing information on availability, this data doesn’t account for those who park illegally, park with a permit, or depart early from still-paid meters.
Roads form a mostly-planar graph, but parking structures may be more complex, with traffic flows across many levels, possibly with different layouts.
Both the supply and the demand for parking are in constant flux, so even the best system is at risk of being outdated as soon as it’s built.

To face these challenges, we used a unique combination of crowdsourcing and machine learning (ML) to build a system that can provide you with parking difficulty information for your destination, and even help you decide what mode of travel to take — in a pre-launch experiment, we saw a significant increase in clicks on the transit travel mode button, indicating that users with additional knowledge of parking difficulty were more likely to consider public transit rather than driving.

Three technical pieces were required to build the algorithms behind the parking difficulty feature: good ground truth data from crowdsourcing, an appropriate ML model and a robust set of features to train the model on.

Ground Truth Data
Gathering high-quality ground truth data is often a key challenge in building any ML solution. We began by asking individuals at a diverse set of locations and times if they found the parking difficult. But we learned that answers to subjective questions like this produces inconsistent results - for a given location and time, one person may answer that it was “easy” to find parking while another found it “difficult.” Switching to objective questions like “How long did it it take to find parking?” led to an increase in answer confidence, enabling us to crowdsource a high-quality set of ground truth data with over 100K responses.

Model Features
With this data available, we began to determine features we could train a model on. Fortunately, we were able to turn to the wisdom of the crowd, and utilize anonymous aggregated information from users who opt to share their location data, which already is a vital source of information for estimates of live traffic or popular times and visit durations.

We quickly discovered that even with this data, some unique challenges remain. For example, our system shouldn’t be fooled into thinking parking is plentiful if someone is parking in a gated or private lot. Users arriving by taxi might look like a sign of abundant parking at the front door, and similarly, public-transit users might seem to park at bus stops. These false positives, and many others, all have the potential to mislead an ML system.

So we needed more robust aggregate features. Perhaps not surprisingly, the inspiration for one of these features came from our own backyard in downtown Mountain View. If Google navigation observes many users circling downtown Mountain View during lunchtime along trajectories like this one, it strongly suggests that parking might be difficult:

Our team thought about how to recognize this “fingerprint” of difficult parking as a feature to train on. In this case, we aggregate the difference between when a user should have arrived at a destination if they simply drove to the front door, versus when they actually arrived, taking into account circling, parking, and walking. If many users show a large gap between these two times, we expect this to be a useful signal that parking is difficult.

From there, we continued to develop more features that took into account, for any particular destination, dispersion of parking locations, time-of-day and date dependence of parking (e.g. what if users park close to a destination in the early morning, but further away at busier hours?), historical parking data and more. In the end, we decided on roughly twenty different features along these lines for our model. Then it was time to tune the model performance.

Model Selection & Training
We decided to use a standard logistic regression ML model for this feature, for a few different reasons. First, the behavior of logistic regression is well understood, and it tends to be resilient to noise in the training data; this is a useful property when the data comes from crowdsourcing a complicated response variable like difficulty of parking. Second, it’s natural to interpret the output of these models as the probability that parking will be difficult, which we can then map into descriptive terms like “Limited parking” or “Easy.” Third, it’s easy to understand the influence of each specific feature, which makes it easier to verify that the model is behaving reasonably. For example, when we started the training process, many of us thought that the “fingerprint” feature described above would be the “silver bullet” that would crack the problem for us. We were surprised to note that this wasn’t the case at all — in fact, it was features based on the dispersion of parking locations that turned out to be one of the most powerful predictors of parking difficulty.

Results
With our model in hand, we were able to generate an estimate for difficulty of parking at any place and time. The figure below gives a few examples of the output of our system, which is then used to provide parking difficulty estimates for a given destination. Parking on Monday mornings, for instance, is difficult throughout the city, especially in the busiest financial and retail areas. On Saturday night, things are busy again, but now predominantly in the areas with restaurants and attractions.

Output of our parking difficulty model in the Financial District and Union Square areas of San Francisco. Red denotes a higher confidence that parking is difficult. Top row: a typical Monday at ~8am (left) and ~9pm (right). Bottom row: the same times but on a typical Saturday.

We’re excited about the opportunities to continue to improve the model quality based on user feedback. If we are able to better understand parking difficulty, we will be able to develop new and smarter forms of parking assistance — we’re very excited about future applications of ML to help make transportation more enjoyable!

Source: Google Research Blog

Crowdsourcing a Text-to-Speech voice for low resource languages (episode 1)

Posted by Linne Ha, Senior Program Manager, Google Research for Low Resource Languages

Building a decent text-to-speech (TTS) voice for any language can be challenging, but creating one – a good, intelligible one – for a low resource language can be downright impossible. By definition, working with low resource languages can feel like a losing proposition – from the get go, there is not enough audio data, and the data that exists may be questionable in quality. High quality audio data, and lots of it, is key to developing a high quality machine learning model. To make matters worse, most of the world’s oldest, richest spoken languages fall into this category. There are currently over 300 languages, each spoken by at least one million people, and most will be overlooked by technologists for various reasons. One important reason is that there is not enough data to conduct meaningful research and development.

Project Unison is an on-going Google research effort, in collaboration with the Speech team, to explore innovative approaches to building a TTS voice for low resource languages – quickly, inexpensively and efficiently. This blog post will be one of several to track progress of this experiment and to share our experience with the research community at large – our successes and failures in a trial and error, iterative approach – as our adventure plays out.

One of the most critical aspects of building a TTS system is acquiring audio data. The traditional way to do this is in a professional recording studio with a voice talent, sound engineer and a voice director. The process can take considerable time and can be quite expensive. People often mistake voice talent work to be similar to a news reader, but it is highly specialized and the work can be very difficult.

Such investments in time and money may yield great audio, but the catch is that even if you’ve created the best TTS voice from these recordings, at best it will still sound exactly like the voice talent - the person who provided the raw audio data. (We’ve read the articles about people who have fallen for their GPS voice to find that they are real people with real names.) So the interesting problem here from a research perspective is how to create a voice that sounds human but is not identifiable as a singular person.

Crowd-sourcing projects for automatic speech recognition (ASR) for Google Voice Search had been successful in the past, with public volunteers eager to participate by providing voice samples. For ASR, the objective is to collect from a diversity of speakers and environments, capturing varying regional accents. The polar opposite is true of TTS, where one unique speaker, with the standard accent and in a soundproof studio is the basic criteria.

Many years ago, Yannis Agiomyrgiannakis, Digital Signal Processing researcher on the TTS team in Google London, wrote a “manifesto” for acoustic data collection for 2000 languages. In his document, he gave technical specifications on how to convert an average room into a recording studio. Knot Pipatsrisawat, software engineer in Google Research for Low Resource Languages, built a tool that we call “ChitChat”, a portable recording studio, using Yannis’ specifications. This web app allows users to read the prompt, playback the recording and even assess the noise level of the room.

From other past research in ASR, we knew that the right tool could solve the crowd sourcing problem. ChitChat allowed us to experiment in different environments to get an idea of what kind of office space would work and what kind of problems we might encounter. After experimenting with several different laptops and tablets, we were able to find a computer that recognized the necessary peripherals (the microphone, USB converter, and preamp) for under $2,000 – much cheaper than a recording studio!

Now we needed multiple speakers of a single language. For us, it was a no-brainer to pilot Project Unison with Bangladeshi Googlers, all of whom are passionate about getting Google products to their home country (the success of Android products in Bangladesh is an example of this). Googlers by and large are passionate about their work and many offer their 20% time as a way to help, to improve or to experiment on something that may or may not work because they care. The Bangladeshi Googlers are no exception. They embodied our objectives for a crowdsourcing innovation: out of many, we could achieve (literally) one voice.

With multiple speakers, we would target speakers of similar vocal profiles and adapt them to create a blended voice. Statistical parametric synthesis is not new, but the advances in recent technology have improved quality and proved to be a lightweight solution for a project like ours.

In May of this year, we auditioned 15 Bangaldeshi Googlers in Mountain View. From these recordings, the broader Bangladeshi Google community voted blindly for their preferred voice. Zakaria Haque, software engineer in Machine Intelligence, was chosen as our reference for the Bangla voice. We then narrowed down the group to five speakers based on these criteria: Dhaka accent, male (to match Zakaria’s), similarity in pitch and tone, and availability for recordings. The original plan of a spectral analysis using PRAAT proved to be unnecessary with our limited pool of candidates.

All 5 software engineers – Ahmed Chowdury, Mohammad Hossain, Syeed Faiz, Md. Arifuzzaman Arif, Sabbir Yousuf Sanny – plus Zakaria Haque recorded over 3 days in the anechoic chamber, a makeshift sound-proofed room at the Mountain View campus just before Ramadan. HyunJeong Choe, who had helped with the Korean TTS recordings, directed our volunteers.

Left: TPM Mohammad Khan measures the distance from the speaker to the mic to keep the sound quality consistent across all speakers. Right: Analytical Linguist HyunJeong Choe coaches SWE Ahmed Chowdury on how to speak in a friendly, knowledgeable, "Googly" voice

ChitChat allowed us to troubleshoot on the fly as recordings could be monitored from another room using the admin panel. In total, we recorded 2000 Bangla and English phrases mined from Wikipedia. In 30-60 minute intervals, the participants recorded over 250 sentences each.

In this session, we discovered an issue: a sudden drop in amplitude at high frequencies in a few recordings. We were worried that all the recordings might have to be scrapped.

As illustrated in the third image, speaker3 has a drop in energy above 13kHz which is visible in the graph and may be present at speech, distorting the speaker’s voice to sound as if he were speaking through a tube.

Another challenge was that we didn’t have a pronunciation lexicon for Bangla as spoken in Bangladesh. We worked initially with the publicly available TTS data from the Indian Institute of Information Technology, but this represented the variant of Bangla spoken in West Bengal (India), which differs from the speech we recorded. Our internally designed pronunciation rules for Bengali were also aimed at West Bengal and would need to be revised later.

Deciding to proceed anyway, Alexander Gutkin, Speech software engineer and lead for TTS for Low Resource Languages in Google London, built an initial prototype voice. Using the preliminary text normalization rules created by Richard Sproat, Speech and Language Processing researcher, the first voice we attempted proved to be surprisingly good. The problem in the high frequencies we had seen in the recordings is undetectable in the parametric voice.

When we return to the sound studio to record an additional 200 longer sentences, we plan to try an upgrade of the USB converter. Meanwhile, Martin Jansche, Natural Language Understanding software engineer, has worked with a team of native speakers on a pronunciation and lexicon and model that better matches the phonology of colloquial Bangladeshi Bangla. Alexander will use the additional recordings and the new pronunciation dictionary to build the second version.

NEXT UP: Building a parametric voice with multiple speaker data (Ep.2)

googblogs.com

All Google blogs and Press in one site

Tag Archives: crowd-sourcing

SCIN: A new resource for representative dermatology images

Dataset composition

Crowdsourcing method

Acknowledgements

Source: Google AI Blog

Uncovering Unknown Unknowns in Machine Learning

Source: Google AI Blog

Recreating Historical Streetscapes Using Deep Learning and Crowdsourcing

Source: Google AI Blog

Recreating Historical Streetscapes Using Deep Learning and Crowdsourcing

Source: Google AI Blog

Recreating Historical Streetscapes Using Deep Learning and Crowdsourcing

Source: Google AI Blog

Adding Diversity to Images with Open Images Extended

Source: Google AI Blog

Using Machine Learning to predict parking difficulty

Source: Google Research Blog

Crowdsourcing a Text-to-Speech voice for low resource languages (episode 1)

Source: Google Research Blog