Tag Archives: Supervised Learning

Learning to Smell: Using Deep Learning to Predict the Olfactory Properties of Molecules



Smell is a sense shared by an incredible range of living organisms, and plays a critical role in how they analyze and react to the world. For humans, our sense of smell is tied to our ability to enjoy food and can also trigger vivid memories. Smell allows us to appreciate all of the fragrances that abound in our everyday lives, be they the proverbial roses, a batch of freshly baked cookies, or a favorite perfume. Yet despite its importance, smell has not received the same level of attention from machine learning researchers as have vision and hearing.

Odor perception in humans is the result of the activation of 400 different types of olfactory receptors (ORs), expressed in 1 million olfactory sensory neurons (OSNs), in a small patch of tissue called the olfactory epithelium. These OSNs send signals to the olfactory bulb, and then to further structures in the brain. Based on analogous advances in deep learning for sight and sound, it should be possible to directly predict the end sensory result of an input molecule, even without knowing the intricate details of all the systems involved. Solving the odor prediction problem would aid in discovering new synthetic odorants, thereby reducing the ecological impact of harvesting natural products. Inspection of the resulting olfactory models may even lead to new insights into the biology of smell.

Small odorant molecules are the most basic building blocks of flavors and fragrances, and therefore represent the simplest version of the odor prediction problem. Yet each molecule can have multiple odor descriptors. Vanillin, for example, has descriptors such as sweet, vanilla, creamy, and chocolate, with some notes being more apparent than others. So odor prediction is also a multi-label classification problem.

In “Machine Learning for Scent: Learning Generalizable Perceptual Representations of Small Molecules”, we leverage graph neural networks (GNNs), a kind of deep neural network designed to operate on graphs as input, to directly predict the odor descriptors for individual molecules, without using any handcrafted rules. We demonstrate that this approach yields significantly improved performance in odor prediction compared to current state-of-the-art and is a promising direction for future research.

Graph Neural Networks for Odor Prediction
Since molecules are analogous to graphs, with atoms forming the vertices and bonds forming the edges, GNNs are the natural model of choice for their understanding. But how does one translate the structure of a molecule into a graph representation? Initially, every node in the graph is represented as a vector, using any preferred featurization — atom identity, atom charge, etc. Then, in a series of message passing steps, every node broadcasts its current vector value to each of its neighbors. An update function then takes the collection of vectors sent to it, and generates an updated vector value. This process can be repeated many times, until finally all of the nodes in the graph are summarized into a single vector via summing or averaging. That single vector, representing the entire molecule, can then be passed into a fully connected network as a learned molecular featurization. This network outputs a prediction for odor descriptors, as provided by perfume experts.
Each node is represented as a vector, and each entry in the vector initially encodes some atomic-level information.
For each node we look at adjacent nodes and collect their information, which is then transformed with a neural network into new information for the centered node. This procedure is performed iteratively. Other variants of GNNs utilize edge and graph-level information.
Illustration of a GNN for odor prediction. We translate the structure of molecules into graphs that are fed into GNN layers to learn a better representation of the nodes. These nodes are reduced into a single vector and passed into a neural network that is used to predict multiple odor descriptors.
This representation doesn’t know anything about spatial positions of atoms, and so it can’t distinguish stereoisomers, molecules made of the same atoms but in slightly different configurations that can smell different, such as (R)- and (S)-carvone. Nevertheless, we have found that even without distinguishing stereoisomers, in practice it is still possible to predict odor quite well.

For odor prediction, GNNs consistently demonstrate improved performance compared to previous state-of-the-art methods, such as random forests, which do not directly encode graph structure. The magnitude of the improvement depends on which odor one tries to predict.
Example of the performance of a GNN on odor descriptors against a strong baseline, as measured by the AUROC score. Example odor descriptors are picked randomly. Closer to 1.0 means better. In the majority of cases GNNs outperform the field-standard baseline substantially, with similar performance seen against other metrics (e.g., AUPRC, recall, precision).
Learning from the Model, and Extending It to Other Tasks
In addition to predicting odor descriptors, GNNs can be applied to other olfaction tasks. For example, take the case of classifying new or refined odor descriptors using only limited data. For each molecule, we extract a learned representation from an intermediate layer of the model that is optimized for our odor descriptors, which we call an “odor embedding”. One can think of this as an olfaction version of a color space, like RGB or CMYK. To see if this odor embedding is useful for predicting related but different tasks, we designed experiments that test our learned embedding on related tasks for which it was not originally designed. We then compared the performance of our odor embedding representation to a common chemoinformatic representation that encodes structural information of a molecule, but is agnostic to odor and found that the odor embedding generalized to several challenging new tasks, even matching state-of-the-art on some.
2D snapshot of our embedding space with some example odors highlighted. Left: Each odor is clustered in its own space. Right: The hierarchical nature of the odor descriptor. Shaded and contoured areas are computed with a kernel-density estimate of the embeddings.
Future Work
Within the realm of machine learning, smell remains the most elusive of the senses, and we’re excited to continue doing a small part to shed light on it through further fundamental research. The possibilities for future research are numerous, and touch on everything from designing new olfactory molecules that are cheaper and more sustainably produced, to digitizing scent, or even one day giving those without a sense of smell access to roses (and, unfortunately, also rotten eggs). We hope to also bring this problem to the attention of more of the machine learning world through the eventual creation and sharing of high-quality, open datasets.

Acknowledgements
This early research is the result of the work and advisement of a team of talented researchers and engineers in Google Brain — Benjamin Sanchez-Lengeling, Jennifer Wei, Brian Lee, Emily Reif, Carey Radebaugh, Max Bileschi, Yoni Halpern, and D. Sculley. We are delighted to have collaborated on this work with Richard Gerkin at ASU and Alán Aspuru-Guzik at the University of Toronto. We are of course building on an enormous amount of prior work, and have benefitted particularly from work by Justin Gilmer, George Dahl and others on fundamental methodology in GNNs, among many other works in neuroscience, statistics and chemistry. We are also grateful to helpful comments from Steven Kearnes, David Belanger, Joel Mainland, and Emily Mayhew.

Source: Google AI Blog


Accurate Online Speaker Diarization with Supervised Learning



Speaker diarization, the process of partitioning an audio stream with multiple people into homogeneous segments associated with each individual, is an important part of speech recognition systems. By solving the problem of “who spoke when”, speaker diarization has applications in many important scenarios, such as understanding medical conversations, video captioning and more. However, training these systems with supervised learning methods is challenging — unlike standard supervised classification tasks, a robust diarization model requires the ability to associate new individuals with distinct speech segments that weren't involved in training. Importantly, this limits the quality of both online and offline diarization systems. Online systems usually suffer more, since they require diarization results in real time.
Online speaker diarization on streaming audio input. Different colors in the bottom axis indicate different speakers.
In “Fully Supervised Speaker Diarization”, we describe a new model that seeks to make use of supervised speaker labels in a more effective manner. Here “fully” implies that all components in the speaker diarization system, including the estimation of the number of speakers, are trained in supervised ways, so that they can benefit from increasing the amount of labeled data available. On the NIST SRE 2000 CALLHOME benchmark, our diarization error rate (DER) is as low as 7.6%, compared to 8.8% DER from our previous clustering-based method, and 9.9% from deep neural network embedding methods. Moreover, our method achieves this lower error rate based on online decoding, making it specifically suitable for real-time applications. As such we are open sourcing the core algorithms in our paper to accelerate more research along this direction.

Clustering versus Interleaved-state RNN
Modern speaker diarization systems are usually based on clustering algorithms such as k-means or spectral clustering. Since these clustering methods are unsupervised, they could not make good use of the supervised speaker labels available in data. Moreover, online clustering algorithms usually have worse quality in real-time diarization applications with streaming audio inputs. The key difference between our model and common clustering algorithms is that in our method, all speakers’ embeddings are modeled by a parameter-sharing recurrent neural network (RNN), and we distinguish different speakers using different RNN states, interleaved in the time domain.

To understand how this works, consider the example below in which there are four possible speakers: blue, yellow, pink and green (this is arbitrary, and in fact there may be more — our model uses the Chinese restaurant process to accommodate the unknown number of speakers). Each speaker starts with its own RNN instance (with a common initial state shared among all speakers) and keeps updating the RNN state given the new embeddings from this speaker. In the example below, the blue speaker keeps updating its RNN state until a different speaker, yellow, comes in. If blue speaks again later, it resumes updating its RNN state. (This is just one of the possibilities for speech segment y7 in the figure below. If new speaker green enters, it will start with a new RNN instance.)
The generative process of our model. Colors indicate labels for speaker segments.
Representing speakers as RNN states enables us to learn the high-level knowledge shared across different speakers and utterances using RNN parameters, and this promises the usefulness of more labeled data. In contrast, common clustering algorithms almost always work with each single utterance independently, making it difficult to benefit from a large amount of labeled data.

The upshot of all this is that given time-stamped speaker labels (i.e. we know who spoke when), we can train the model with standard stochastic gradient descent algorithms. A trained model can be used for speaker diarization on new utterances from unheard speakers. Furthermore, the use of online decoding makes it more suitable for latency-sensitive applications.

Future Work
Although we've already achieved impressive diarization performance with this system, there are still many exciting directions we are currently exploring. First, we are refining our model so it can easily integrate contextual information to perform offline decoding. This will likely further reduce the DER, which is more useful for latency-insensitive applications. Second, we would like to model acoustic features directly instead of using d-vectors. In this way, the entire speaker diarization system can be trained in an end-to-end way.

To learn more about this work, please see our paper. To download the core algorithm of this system, please visit the Github page.

Acknowledgments
This work was done as a close collaboration between Google AI and Speech & Assistant teams. Contributors include Aonan Zhang (intern), Quan Wang, Zhengyao Zhu and Chong Wang.

Source: Google AI Blog


Neural Network-Generated Illustrations in Allo



Taking, sharing, and viewing selfies has become a daily habit for many — the car selfie, the cute-outfit selfie, the travel selfie, the I-woke-up-like-this selfie. Apart from a social capacity, self-portraiture has long served as a means for self and identity exploration. For some, it’s about figuring out who they are. For others it’s about projecting how they want to be perceived. Sometimes it’s both.

Photography in the form of a selfie is a very direct form of expression. It comes with a set of rules bounded by reality. Illustration, on the other hand, empowers people to define themselves - it’s warmer and less fraught than reality.
Today, Google is introducing a feature in Allo that uses a combination of neural networks and the work of artists to turn your selfie into a personalized sticker pack. Simply snap a selfie, and it’ll return an automatically generated illustrated version of you, on the fly, with customization options to help you personalize the stickers even further.
What makes you, you?
The traditional computer vision approach to mapping selfies to art would be to analyze the pixels of an image and algorithmically determine attribute values by looking at pixel values to measure color, shape, or texture. However, people today take selfies in all types of lighting conditions and poses. And while people can easily pick out and recognize qualitative features, like eye color, regardless of the lighting condition, this is a very complex task for computers. When people look at eye color, they don’t just interpret the pixel values of blue or green, but take into account the surrounding visual context.

In order to account for this, we explored how we could enable an algorithm to pick out qualitative features in a manner similar to the way people do, rather than the traditional approach of hand coding how to interpret every permutation of lighting condition, eye color, etc. While we could have trained a large convolutional neural network from scratch to attempt to accomplish this, we wondered if there was a more efficient way to get results, since we expected that learning to interpret a face into an illustration would be a very iterative process.

That led us to run some experiments, similar to DeepDream, on some of Google's existing more general-purpose computer vision neural networks. We discovered that a few neurons among the millions in these networks were good at focusing on things they weren’t explicitly trained to look at that seemed useful for creating personalized stickers. Additionally, by virtue of being large general-purpose neural networks they had already figured out how to abstract away things they didn’t need. All that was left to do was to provide a much smaller number of human labeled examples to teach the classifiers to isolate out the qualities that the neural network already knew about the image.

To create an illustration of you that captures the qualities that would make it recognizable to your friends, we worked alongside an artistic team to create illustrations that represented a wide variety of features. Artists initially designed a set of hairstyles, for example, that they thought would be representative, and with the help of human raters we used these hairstyles to train the network to match the right illustration to the right selfie. We then asked human raters to judge the sticker output against the input image to see how well it did. In some instances, they determined that some styles were not well represented, so the artists created more that the neural network could learn to identify as well.
Raters were asked to classify hairstyles that the icon on the left resembled closest. Then, once consensus was reached, resident artist Lamar Abrams drew a representation of what they had in common.
Avoiding the uncanny valley
In the study of aesthetics, a well-known problem is the uncanny valley - the hypothesis that human replicas which appear almost, but not exactly, like real human beings can feel repulsive. In machine learning, this could be compounded if were confronted by a computer’s perception of you, versus how you may think of yourself, which can be at odds.

Rather than aim to replicate a person’s appearance exactly, pursuing a lower resolution model, like emojis and stickers, allows the team to explore expressive representation by returning an image that is less about reproducing reality and more about breaking the rules of representation.
The team worked with artist Lamar Abrams to design the features that make up more than 563 quadrillion combinations.
Translating pixels to artistic illustrations
Reconciling how the computer perceives you with how you perceive yourself and what you want to project is truly an artistic exercise. This makes a customization feature that includes different hairstyles, skin tones, and nose shapes, essential. After all, illustration by its very nature can be subjective. Aesthetics are defined by race, culture, and class which can lead to creating zones of exclusion without consciously trying. As such, we strove to create a space for a range of race, age, masculinity, femininity, and/or androgyny. Our teams continue to evaluate the research results to help prevent against incorporating biases while training the system.
Creating a broad palette for identity and sentiment
There is no such thing as a ‘universal aesthetic’ or ‘a singular you’. The way people talk to their parents is different than how they talk to their friends which is different than how they talk to their colleagues. It’s not enough to make an avatar that is a literal representation of yourself when there are many versions of you. To address that, the Allo team is working with a range of artistic voices to help others extend their own voice. This first style that launched today speaks to your sarcastic side but the next pack might be more cute for those sincere moments. Then after that, maybe they’ll turn you into a dog. If emojis broadened the world of communication it’s not hard to imagine how this technology and language evolves. What will be most exciting is listening to what people say with it.

This feature is starting to roll out in Allo today for Android, and will come soon to Allo on iOS.

Acknowledgements
This work was made possible through a collaboration of the Allo Team and Machine Perception researchers at Google. We additionally thank Lamar Abrams, Koji Ashida, Forrester Cole, Jennifer Daniel, Shiraz Fuman, Dilip Krishnan, Inbar Mosseri, Aaron Sarna, and Bhavik Singh.

Equality of Opportunity in Machine Learning



As machine learning technology progresses rapidly, there is much interest in understanding its societal impact. A particularly successful branch of machine learning is supervised learning. With enough past data and computational resources, learning algorithms often produce surprisingly effective predictors of future events. To take one hypothetical example: an algorithm could, for example, be used to predict with high accuracy who will pay back their loan. Lenders might then use such a predictor as an aid in deciding who should receive a loan in the first place. Decisions based on machine learning can be both incredibly useful and have a profound impact on our lives.

Even the best predictors make mistakes. Although machine learning aims to minimize the chance of a mistake, how do we prevent certain groups from experiencing a disproportionate share of these mistakes? Consider the case of a group that we have relatively little data on and whose characteristics differ from those of the general population in ways that are relevant to the prediction task. As prediction accuracy is generally correlated with the amount of data available for training, it is likely that incorrect predictions will be more common in this group. A predictor might, for example, end up flagging too many individuals in this group as ‘high risk of default’ even though they pay back their loan. When group membership coincides with a sensitive attribute, such as race, gender, disability, or religion, this situation can lead to unjust or prejudicial outcomes.

Despite the need, a vetted methodology in machine learning for preventing this kind of discrimination based on sensitive attributes has been lacking. A naive approach might require a set of sensitive attributes to be removed from the data before doing anything else with it. This idea of “fairness through unawareness,” however, fails due to the existence of “redundant encodings.” Even if a particular attribute is not present in the data, combinations of other attributes can act as a proxy.

Another common approach, called demographic parity, asks that the prediction must be uncorrelated with the sensitive attribute. This might sound intuitively desirable, but the outcome itself is often correlated with the sensitive attribute. For example, the incidence of heart failure is substantially more common in men than in women. When predicting such a medical condition, it is therefore neither realistic nor desirable to prevent all correlation between the predicted outcome and group membership.

Equal Opportunity

Taking these conceptual difficulties into account, we’ve proposed a methodology for measuring and preventing discrimination based on a set of sensitive attributes. Our framework not only helps to scrutinize predictors to discover possible concerns. We also show how to adjust a given predictor so as to strike a better tradeoff between classification accuracy and non-discrimination if need be.

At the heart of our approach is the idea that individuals who qualify for a desirable outcome should have an equal chance of being correctly classified for this outcome. In our fictional loan example, it means the rate of ‘low risk’ predictions among people who actually pay back their loan should not depend on a sensitive attribute like race or gender. We call this principle equality of opportunity in supervised learning.

When implemented, our framework also improves incentives by shifting the cost of poor predictions from the individual to the decision maker, who can respond by investing in improved prediction accuracy. Perfect predictors always satisfy our notion, showing that the central goal of building more accurate predictors is well aligned with the goal of avoiding discrimination.

Learn more

To explore the ideas in this blog post on your own, our Big Picture team created a beautiful interactive visualization of the different concepts and tradeoffs. So, head on over to their page to learn more.

Once you’ve walked through the demo, please check out the full version of our paper, a joint work with Eric Price (UT Austin) and Nati Srebro (TTI Chicago). We’ll present the paper at this year’s Conference on Neural Information Processing Systems (NIPS) in Barcelona. So, if you’re around, be sure to stop by and chat with one of us.

Our paper is by no means the final word on this important and complex topic. It joins an ongoing conversation with a multidisciplinary focus of research. We hope to inspire future research that will sharpen the discussion of the different achievable tradeoffs surrounding discrimination and machine learning, as well as the development of tools that will help practitioners address these challenges.