Tag Archives: machine learning

Applying Machine Learning to…..Yeast?

Posted by Ted Baltz, Senior Staff Software Engineer, Google Research, Accelerated Science Team

Humans have a long history with yeast, tied to the beginnings of plant domestication — baker’s (or brewer’s) yeast, Saccharomyces cerevisiae, has been used to make grains more digestible in the form of bread (or beer) for millennia. Today, yeast still has a large impact, with biologists adopting it as a model organism for biological research, genetics in particular, because it is easy to grow in the lab and is a eukaryote (i.e., unlike bacteria, it has a cell nucleus, like our cells do). It has even earned its own catchphrase in the biological community — “the awesome power of yeast genetics”. Studying the fundamentals of genetics is much easier in yeast, but is still applicable to humans since ~1000 yeast genes have a sequence homolog to human ones. Understanding how genes work together as a system is core to understanding all living things, which drives interest in this microorganism.

In collaboration with Calico Life Sciences, we present “Learning causal networks using inducible transcription factors and transcriptome-wide time series”, published in Molecular Systems Biology. Based on exhaustive experiments, we built a genome-wide model for the regulation of gene expression in S. cerevisiae and verified some of the results experimentally, enabling future investigations into less well understood biological systems. The Induction Dynamics gene Expression Atlas is available from Calico in a format easy to manipulate in python, with open-sourced code to do this on the Google Research GitHub. The data is hosted in a standard format at the Gene Expression Omnibus.

Using Yeast to Provide Insight into Aging
Yeast reproduce through a process called budding, in which a small bud grows from the surface of the parent to produce an offspring that is almost genetically identical. Interestingly, even though yeast are single-celled organisms, they grow old and die, typically after 30 budding events. In fact the “scars” from budding are clearly visible under a powerful microscope, allowing one to tell the age of the cell simply by looking! The problem is that researchers still do not know what causes aging to happen.

Bud scars on old yeast cells (5 μm bar for scale) — Photo Credit: Ian Foe, (Calico)

Scientists at Calico Life Sciences have pioneered a technique to make targeted perturbations to gene expression in yeast (i.e., allowing them to selectively “turn on” a gene’s activity) with the goal of understanding how aging works at the molecular level. The hope is that understanding aging in yeast will apply to aging in more complex organisms, like humans. This work is an early step in building a predictive framework for understanding the behavior of cells over time.

The Gene Expression Experiment
Genes encoded in DNA only function after being transcribed to RNA. It’s the RNA that is “translated” or “read” by ribosomes to produce protein. The level of protein production is governed by how much RNA is transcribed from DNA. Most of the work in a cell is being done by proteins, so they are key to understanding cell behavior. Yet, while we’d really like to measure the protein production levels, techniques to identify proteins at this scale are prohibitively expensive. Instead, in this experiment we use RNA as a proxy, since measuring RNA levels is easier.

The gene expression experiment is designed to perturb individual genes and measure, over time, how every other gene in the genome responds. The ability to rapidly perturb and track dynamics allows us to learn causal relationships and non-linear behaviors missing in most experiments. These dynamic data can also be used to train predictive models. This is made possible by strains of yeast with a single gene that is responsive to an external switch, in this case the hormone β-estradiol. To perturb a gene, the hormone is introduced, causing the switched gene to be overexpressed by a factor of 50 within 10 minutes. The yeast culture is then sampled at several points in time to measure the gene expression levels on microarrays. These experiments were done in parallel, with one yeast strain per culture, running concurrently.

Most of the perturbation experiments were done on a particular class of genes coding for transcription factors (TFs). These genes are the primary regulators of gene expression, coding for proteins that actually bind to the DNA strands, permitting or blocking transcription of particular genes.

When gene “a” is turned on it may upregulate gene “b” and downregulate gene “c”, and later lead to upregulation of gene “d”. Since yeast has more than 6000 genes, tracing the downstream impact of perturbing a single gene can get complicated very quickly. By combining experiments on different genes, one hopes to disambiguate the exact regulation mechanisms.

Schematic of the genome perturbation experiment: yeast strain with switchable gene “a”. Turning on a single gene (A) can result in differing levels of gene expression over time (B). Tracking these changes in comparison to those induced by turning on other genes (C and D) can provide insight into the regulation mechanisms (E).

The Gene Expression Model
For this experiment, we partnered with Calico because of the scale of the data, and the opportunity to leverage Google’s machine learning expertise and compute resources. There were more than 200 perturbation experiments on different yeast strains, each activating a single gene. In each experiment, the expression levels of all 6000 genes were measured eight times over 90 minutes, yielding a total of almost 20 million individual measurements (panel F, above). Clearly some automation was required to analyze the data.

Our approach was to model the whole process as a system of differential equations: the rate of change of the expression of a gene was proportional to a weighted sum of the expression levels of all genes. We first estimated the time derivatives from the data by simply subtracting the expression levels among adjacent time points. We then predicted the time derivatives using only the raw expression levels themselves. By fitting a linear regression, we are, in effect, fitting the coefficients of a system of differential equations describing gene regulation. Our hope is that the differential equation model would be a low dimensional representation of the data that could be interpreted more easily. To handle overfitting, we regularized the model using the L1-norm, which prefers to set uninformative parameters to exactly zero.

Because each of the 200 experiments was unique, we held out each one in turn, refitting the model and allowing the selection of the best hyperparameters to optimize the out-of-sample loss. In the end, the work required a significant amount of compute, amounting to more than 50 million full regularization paths.

Model Results
Our model made predictions about which genes would code for intermediate regulators of gene expression. This is an attempt at modeling the full gene regulation network of the organism. To verify these predictions, our collaborators at Calico collected more data from ten new strains of yeast. Three out of ten of the predictions held up in these experiments. One of the genes that the model predicted to be active encoded an unverified transcription factor, while another previously identified as a regulator but never followed up, was found by our model to be a very active regulator. Our model was able to identify these without prior biological knowledge, demonstrating that these ML techniques might scale to other domains or organisms that are much less well studied.

More discussion of the impact of this work within the broad context of the field of genomics is available in an independent peer commentary.

Acknowledgements
We wish to thank Marc Coram, Minjie Fan, and Marc Berndl for their foundational contributions to this work, the Google Accelerated Science team for their continual support, and the entire team at Calico for the opportunity to collaborate on this experiment.

Source: Google AI Blog

MediaPipe KNIFT: Template-based Feature Matching

Posted by Zhicheng Wang and Genzhi Ye, MediaPipe team

Image Feature Correspondence with KNIFT

In many computer vision applications, a crucial building block is to establish reliable correspondences between different views of an object or scene, forming the foundation for approaches like template matching, image retrieval and structure from motion. Correspondences are usually computed by extracting distinctive view-invariant features such as SIFT or ORB from images. The ability to reliably establish such correspondences enables applications like image stitching to create panoramas or template matching for object recognition in videos (see Figure 1).

Today, we are announcing KNIFT (Keypoint Neural Invariant Feature Transform), a general purpose local feature descriptor similar to SIFT or ORB. Likewise, KNIFT is also a compact vector representation of local image patches that is invariant to uniform scaling, orientation, and illumination changes. However unlike SIFT or ORB, which were engineered with heuristics, KNIFT is an embedding learned directly from a large number of corresponding local patches extracted from nearby video frames. This data driven approach implicitly encodes complex, real-world spatial transformations and lighting changes in the embedding. As a result, the KNIFT feature descriptor appears to be more robust, not only to affine distortions, but to some degree of perspective distortions as well. We are releasing an implementation of KNIFT in MediaPipe and a KNIFT-based template matching demo in the next section to get you started.

Figure 1: Matching a real Stop Sign with a Stop Sign template using KNIFT.

Training Method

In Machine Learning, loosely speaking, training an embedding means finding a mapping that can translate a high dimensional vector, such as an image patch, to a relatively lower dimensional vector, such as a feature descriptor. Ideally, this mapping should have the following property: image patches around a real-world point should have the same or very similar descriptors across different views or illumination changes. We have found real world videos a good source of such corresponding image patches as training data (See Figure 3 and 4) and we use the well-established Triplet Loss (see Figure 2) to train such an embedding. Each triplet consists of an anchor (denoted by a), a positive (p), and a negative (n) feature vector extracted from the corresponding image patches, and d() denotes the Euclidean distance in the feature space.

Figure 2: Triplet Loss Function.

Training Data

The training triplets are extracted from all ~1500 video clips in the publicly available YouTube UGC Dataset. We first use an existing heuristically-engineered local feature detector to detect keypoints and compute the affine transform between two frames with a high accuracy (see Figure 4). Then we use this correspondence to find keypoint pairs and extract the patches around these keypoints. Note that the newly identified keypoints may include those that were detected but rejected by geometric verification in the first step. For each pair of matched patches, we randomly apply some form of data augmentation (e.g. random rotation or brightness adjustment) to construct the anchor-positive pair. Finally, we randomly pick an arbitrary patch from another video as the negative to finish the construction of this triplet (see Figure 5).

Figure 3: An example video clip from which we extract training triplets.

Figure 4: Finding frame correspondence using existing local features.

Figure 5: (Top to bottom) Anchor, positive and negative patches.

Hard-negative Triplet Mining

To improve model quality, we use the same hard-negative triplet mining method used by FaceNet training. We first train a base model with randomly selected triplets. Then we implement a pipeline that uses the base model to find semi-hard-negative samples (d(a,p) < d(a,n) < d(a,p)+margin) for each anchor-positive pair (Figure 6). After mixing the randomly selected triplets and hard-negative triplets, we re-train the model with this improved data.

Figure 6: (Top to bottom) Anchor, positive and semi-hard negative patches.

Model Architecture

From model architecture exploration, we have found that a relatively small architecture is sufficient to achieve decent quality, so we use a lightweight version of the Inception architecture as the KNIFT model backbone. The resulting KNIFT descriptor is a 40-dimensional float vector. For more model details, please refer to the KNIFT model card.

Benchmark

We benchmark the KNIFT model inference speed on various devices (computing 200 features) and list them in Table 1.

Table 1: KNIFT performance benchmark.

Quality-wise, we compare the average number of keypoints matched by KNIFT and by ORB (OpenCV implementation) respectively on an in-house benchmark (Table 2). There are many publicly available image matching benchmarks, e.g. 2020 Image Matching Benchmark, but most of them focus on matching landmarks across large perspective changes in relatively high resolution images, and the tasks often require computing thousands of keypoints. In contrast, since we designed KNIFT for matching objects in large scale (i.e. billions of images) online image retrieval tasks, we devised our benchmark to focus on low cost and high precision driven use cases, i.e. 100-200 keypoints computed per image and only ~10 matching keypoints needed for reliably determining a match. In addition, to illustrate the fine-grained performance characteristics of a feature descriptor, we divide and categorize the benchmark set by object types (e.g. 2D planar surface) and image pair relations (e.g. large size difference). In table 2, we compare the average number of keypoints matched by KNIFT and by ORB respectively in each category, based on the same 200 keypoint locations detected in each image by the oFast detector that comes with the ORB implementation in OpenCV.

Table 2: KNIFT vs ORB average number of matched keypoints.

From Table 2, we can see that KNIFT consistently matches more keypoints than ORB by a large margin in every category. Here we acknowledge the fact that KNIFT (40-d float) is considerably larger than ORB (32-d char) and this can have an effort on matching quality. Nevertheless, most local feature benchmarks do not take descriptor size into account so we will follow the convention here.

To make it easy for developers to try KNIFT in MediaPIpe, we have built a local-feature-based template matching solution (see implementation details using MediaPipe in the next section). As a side effect, we can demonstrate the matching quality between KNIFT and ORB visually in side-by-side comparisons like Figure 7 and 9.

Figure 7: Example of “matching 2D planar surface”. (Left) KNIFT 183/240, (Right) ORB 133/240.

In Figure 7, we choose a typical U.S. Stop Sign image from Google Image Search as the template and attempt to match it with the Stop Sign in this video. This example falls into the “matching 2D planar surface” category in Table 2. Using the same 200 keypoint locations detected by oFast and the same RANSAC setting, we show that KNIFT is successful at matching the Stop Sign in 183 frames out of a total of 240 frames. In comparison, ORB matches 133 frames.

Figure 8: Example of “matching 3D untextured object”. Two template images from different views.

Figure 9: Example of “matching 3D untextured object”. (Left) KNIFT 89/150, (Right) ORB 37/150.

Figure 9 shows another matching performance comparison on an example from the “matching 3D untextured object” category in Table 2. Since this example involves large perspective changes of untextured surfaces, which is known to be challenging for local feature descriptors, we use template images from two different views (shown in Figure 8) to improve the matching performance. Again, using the same keypoint locations and RANSAC setting, we show that KNIFT is successful at matching 89 frames out of a total of 150 frames while ORB matches 37 frames.

KNIFT-based Template Matching in MediaPipe

We are releasing the aforementioned template matching solution based on KNIFT in MediaPipe, which is capable of identifying pre-defined image templates and precisely localizing recognized templates on the camera image. There are 3 major components in the template-matching MediaPipe graph shown below:

FeatureDetectorCalculator: a calculator that consumes image frames and performs OpenCV oFast detector on the input image and outputs keypoint locations. Moreover, this calculator is also responsible for cropping patches around each keypoint with rotation and scale info and stacking them into a vector for the downstream calculator to process.
TfLiteInferenceCalculator with KNIFT model: a calculator that loads the KNIFT tflite model and performs model inference. The input tensor shape is (200, 32, 32, 1), indicating 200 32x32 local patches. The output tensor shape is (200, 40), indicating 200 40-dimensional feature descriptors. By default, the calculator runs the TFLite XNNPACK delegate, but users have the option to select the regular CPU delegate to run at a reduced speed.
BoxDetectorCalculator: a calculator that takes pre-computed keypoint locations and KNIFT descriptors and performs feature matching between the current frame and multiple template images. The output of this calculator is a list of TimedBoxProto, which contains the unique id and location of each box as a quadrilateral on the image. Aside from the classic homography RANSAC algorithm, we also apply a perspective transform verification step to ensure that the output quadrilateral does not result in too much skew or a weird shape.

Figure 10: MediaPipe graph of the demo

Demo

In this demo, we chose three different denominations ($1, $5, $20) of U.S. dollar bills as templates and attempted to match them to various real world dollar bills in videos. We resized each input frame to 640x480 pixels, ran the oFast detector to detect 200 keypoints, and used KNIFT to extract feature descriptors from each 32x32 local image patch surrounding these keypoints. We then performed template matching between these video frames and the KNIFT features extracted from the dollar bill templates. This demo runs at 20 FPS on a Pixel 2 Phone CPU with XNNPACK.

Figure 11: Matching different U.S. dollar bills using KNIFT.

Build Your Own Templates

We have provided a set of built-in planar templates in our demo. To make it easy for users to try their own templates, we also provide a tool to build such an index with user generated templates. index_building.pbtxt is a MediaPipe graph that accepts as its input a directory path containing a set of template images. Users can use this graph to compute KNIFT descriptors for all template images (which will be stored in a single file) by 1) replacing the index_proto_filename field in the main graph and the BUILD file and 2) rebuilding the APK file. For step-by-step instructions on how we created the dollar bill demo shown above, please refer to this documentation.

Acknowledgements

We would like to thank Jiuqiang Tang, Chuo-Ling Chang, Dan Gnanapragasam‎, Howard Zhou, Jianing Wei and Ming Guang Yong for contributing to this blog post.

Source: Google Developers Blog

Free Universal Sound Separation

We are happy to announce the release of FUSS: the Free Universal Sound Separation dataset.

Audio recordings often contain a mixture of different sound sources; Universal sound separation is the ability to separate such a mixture into its component sounds, regardless of the types of sound present. Previously, sound separation work has focused on separating mixtures of a small number of sound types, such as "speech" versus "nonspeech", or different instances of the same type of sound, such as speaker #1 versus speaker #2. Often in such work, the number of sounds in a mixture is also assumed to be known a priori. The FUSS dataset shifts focus to the more general problem of separating a variable number of arbitrary sounds from one another.

One major hurdle to training models in this domain is that even if you have high-quality recordings of sound mixtures, you can't easily annotate these recordings with ground truth. High-quality simulation is one approach to overcome this limitation. To achieve good results, you need a diverse set of sounds, a realistic room simulator, and code to mix these elements together for realistic, multi-source, multi-class audio with ground truth. With FUSS, we are releasing all three of these.

FUSS relies on Creative Commons licensed audio clips from freesound.org. We filtered these by license type, then using a pre-release of FSD50k [1], further filtered out sounds that aren't separable by humans when mixed together. We were left with about 23 hours of audio, consisting of 12,377 sounds useful for mixing (7,237 train, 2,883 validation, 2,257 eval). Using these clips, we created 20,000 training mixtures, 1,000 validation mixtures, and 1,000 eval mixtures.

We developed our own room simulator implemented in tensorflow, which generates the impulse response of a box shaped room with frequency-dependent reflective properties given a sound source location and a mic location. As part of the dataset release, we provide pre-calculated room impulse responses used for each audio sample along with mixing code, so the research community can simulate novel audio without running the computationally expensive room simulator. Future work may include releasing the code for our room simulator and extending the simulator capabilities to address more extensive acoustic properties of rooms, materials with different reflective properties, novel room shapes, etc.

Finally, we have released a masking-based separation model, based on an improved time-domain convolutional network (TDCN++), described in our recent publications [2, 3]. On the eval set, this model achieves 12.5 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source mixtures with 37.6 dB absolute SI-SNR.

Source audio, reverb impulse responses, reverberated mixtures and sources created by the mixing code, and a baseline model checkpoint are available for download. Code for reverberating and mixing the audio data and for training the released model is available on our github page.

The dataset will also be used in the DCASE challenge, as a component of the Sound Event Detection and Separation task. The released model will serve as a baseline for this competition, and a benchmark to demonstrate progress against in future experiments.

Our hope is this dataset will lower the barrier to new research, and particularly will allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.

By John Hershey, Scott Wisdom, and Hakan Erdogan, Google Research

References:
[1] Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font Corbera, Dmitry Bogdanov, Andrés Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra. "Freesound Datasets: A Platform for the Creation of Open Audio Datasets." International Society for Music Information Retrieval Conference (ISMIR), pp. 486–493. Suzhou, China, 2017.
[2] Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, and John R. Hershey. "Universal Sound Separation." IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 175-179. New Paltz, NY, USA, 2019.
[3] Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, and Daniel P. W. Ellis. "Improving Universal Sound Separation Using Sound Classification." IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2020.

Source: Google Open Source Blog

A Step Towards Protecting Patients from Medication Errors

Posted by Kathryn Rough, Research Scientist and Alvin Rajkomar, MD, Google Health

While no doctor, nurse, or pharmacist wants to make a mistake that harms a patient, research shows that 2% of hospitalized patients experience serious preventable medication-related incidents that can be life-threatening, cause permanent harm, or result in death. There are many factors contributing to medical mistakes, often rooted in deficient systems, tools, processes, or working conditions, rather than the flaws of individual clinicians (IOM report). To mitigate these challenges, one can imagine a system more sophisticated than the current rules-based error alerts provided in standard electronic health record software. The system would identify prescriptions that looked abnormal for the patient and their current situation, similar to a system that produces warnings for atypical credit card purchases on stolen cards. However, determining which medications are appropriate for any given patient at any given time is complex — doctors and pharmacists train for years before acquiring the skill. With the widespread use of electronic health records, it may now be feasible to use this data to identify normal and abnormal patterns of prescriptions.

In an initial effort to explore solutions to this problem, we partnered with UCSF's Bakar Computational Health Sciences Institute to publish “Predicting Inpatient Medication Orders in Electronic Health Record Data” in Clinical Pharmacology and Therapeutics, which evaluates the extent to which machine learning could anticipate normal prescribing patterns by doctors, based on electronic health records. Similar to our prior work, we used comprehensive clinical data from de-identified patient records, including the sequence of vital signs, laboratory results, past medications, procedures, diagnoses and more. Based on the patient’s current clinical state and medical history, our best model was able to anticipate physician’s actual prescribing decisions three quarters of the time.

Model Training
The dataset used for model training included approximately three million medication orders from over 100,000 hospitalizations. It used retrospective electronic health record data, which was de-identified by randomly shifting dates and removing identifying portions of the record in accordance with HIPAA, including names, addresses, contact details, record numbers, physician names, free-text notes, images, and more. The data was not joined or combined with any other data. All research was done using the open-sourced Fast Healthcare Interoperability Resources (FHIR) format, which we’ve previously used to make healthcare data more effective for machine learning. The dataset was not restricted to a particular disease or therapeutic area, which made the machine learning task more challenging, but also helped to ensure that the model could identify a larger variety of conditions; e.g. patients suffering from dehydration require different medications than those with traumatic injuries.

We evaluated two machine learning models: a long short-term memory (LSTM) recurrent neural network and a regularized, time-bucketed logistic model, which are commonly used in clinical research. Both were compared to a simple baseline that ranked the most frequently ordered medications based on a patient’s hospital service (e.g., General Medical, General Surgical, Obstetrics, Cardiology, etc.) and amount of time since admission. Each time a medication was ordered in the retrospective data, the models ranked a list of 990 possible medications, and we assessed whether the models assigned high probabilities to the medications actually ordered by doctors in each case.

As an example of how the model was evaluated, imagine a patient who arrived at the hospital with signs of an infection. The model reviewed the information recorded in the patient’s electronic health record — a high temperature, elevated white blood cell count, quick breathing rate — and estimated how likely it would be for different medications to be prescribed in that situation. The model’s performance was evaluated by comparing its ranked choices against the medications that the physician actually prescribed (in this example, the antibiotic vancomycin and sodium chloride solution for rehydration).

Based on a patient’s medical history and current clinical characteristics, the model ranks the medications a physician is most likely to prescribe.

Findings
Our best-performing model was the LSTM model, a class of models particularly effective for handling sequential data, including text and language. These models are capable of capturing the ordering and time recency of events in the data, making them a good choice for this problem.

Nearly all (93%) top-10 lists contained at least one medication that would be ordered by clinicians for the given patient within the next day. Fifty-five percent of the time, the model correctly placed medications prescribed by the doctor as one of the top-10 most likely medications, and 75% of ordered medications were ranked in the top-25. Even for ‘false negatives’ — cases where the medication ordered by doctors did not appear among the top-25 results — the model highly ranked a medication in the same class 42% of the time. This performance was not explained by the model simply predicting previously prescribed medications. Even when we blinded the model to previous medication orders, it maintained high performance.

What Does This Mean for Patients and Clinicians?
It’s important to remember that models trained this way reproduce physician behavior as it appears in historical data, and have not learned optimal prescribing patterns, how these medications might work, or what side effects might occur. However, learning ‘normal’ is a starting point to eventually spot abnormal, potentially dangerous orders. In our next phase of research, we will examine under which circumstances these models are useful for finding medication errors that could harm patients.

The results from this exploratory work are early first steps towards testing the hypothesis that machine learning can be applied to build systems that prevent mistakes and help to keep patients safe. We look forward to collaborating with doctors, pharmacists, other clinicians, and patients as we continue research to quantify whether models like this one are capable of catching errors, keeping patients safe in the hospital.

Acknowledgements
We would like to thank Atul Butte (UCSF), Claire Cui, Andrew Dai, Michael Howell, Laura Vardoulakis, Yuan (Emily) Xue, and Kun Zhang for their contributions towards the research work described in this post. We’d additionally like to thank members of our broader research team who have assisted in the development of analytical tools, data collection, maintenance of research infrastructure, assurance of data quality, and project management: Gabby Espinosa, Gerardo Flores, Michaela Hardt, Sharat Israni (UCSF), Jeff Love (UCSF), Dana Ludwig (UCSF), Hong Ji, Svetlana Kelman, I-Ching Lee, Mimi Sun, Patrik Sundberg, Chunfeng Wen, and Doris Wong.

Source: Google AI Blog

Semantic Reactor: A tool for experimenting with NLU models

Companies are using natural language understanding (NLU) to create digital personal assistants, customer service bots, and semantic search engines for reviews, forums and the news.

However, the perception that using NLU and machine learning is costly and time consuming prevents a lot of potential users from exploring its benefits.

To dispel some of the intimidation of using NLU, and to demonstrate how it can be easily used with pre-trained, generic models, we have released a tool, the Semantic Reactor, and open-sourced example code, The Mystery of the Three Bots.

The Semantic Reactor

The Semantic Reactor is a Google Sheets Add-On that allows the user to sort lines of text in a sheet using a variety of machine-learning models. It is released as a whitelisted experiment, so if you would like to check it out, fill out this application at the Google Cloud AI Workshop. Once approved, you’ll be emailed instructions on how to install it.

The tool offers ranking methods that determine how the list will be sorted. With the semantic similarity method, the lines more similar in meaning to the input will be ranked higher.

With the input-response method, the lines that are the most appropriate conversational responses are ranked higher.

Why use the Semantic Reactor?

There are a lot of interesting things you can do with the Semantic Reactor, but let’s look at the following two:

Writing dialogue for a bot that exists within a well-defined environment and has a clear purpose (like a customer service bot) using semantic similarity.
Searching within large collections of text, like from a message board. For that, we will use input-response.

Writing Dialogue for a Bot Using Semantic Similarity

For the sake of an example, let’s say you are writing dialogue for a bot that answers questions about a product, in this case, cookies.

If you’ve been running a cookie hotline for a while, you probably can list the most common cookie questions. With that data, you can create your cookie bot. Start by opening a Google Sheet and writing the common questions and answers (questions in the A column, answers in the B).

Here is the start of what that Sheet might look like. Make a copy of the Sheet, which will allow you to use the Semantic Reactor Add-on. Use the tool to experiment with new QA pairs and how each model reacts to them.

Here are a few queries to try, using the semantic similarity rank method:

Query: What are cookie ingredients?
Returns: What are cookies made of?

Query: Are cookies biscuits?
Returns: Are cookies also called biscuits?

Query: What should I serve with cookies?
Returns: What drinks go well with cookies?

Of course, that small list of responses won’t cover many of the questions people will ask your cookie bot. What the Reactor allows you to do is quickly add new QA pairs as you learn about what your users want to ask.

For example, maybe people are asking a lot about cookie calories.

You’d write the new question in column A, and the new answer in column B, and then test a few different phrasings with the Reactor. You might need to tweak the target response a few times to make sure it matches a wide variety of phrasings. You should also experiment with the three different models to see which one performs the best.

For instance, let’s say the new target question you want the model to match to is: “How many calories does a typical cookie have?”

That question might be phrased by users as:

Are cookies caloric?
A lot of calories in a cookie?
Will cookies wreck my diet?
Are cookies fattening?

The more you test with live users, the more you’ll find that they phrase their questions in ways you don’t expect. As with all things based on machine learning, constantly refreshing data, testing and improvement is all part of the process.

Searching Through Text Using Input-Response

Sometimes you can’t anticipate what users are going to ask, and sometimes you might be dealing with a lot of potential responses, maybe thousands. In cases like that, you should use the input-response ranking method. That means the model will examine the list of potential responses and then rank each one according to what it thinks is the most likely response.

Here is a Sheet containing a list of simple conversational responses. Using the input-response ranking method, try a few generic conversational openers like “Hello” or “How’s it going?”

Note that in input-response mode, the model is predicting the most likely conversational response to an input and not the most semantically similar response.

Note that “Hello,” in input-response mode, returns “Nice to meet you.” In semantic similarity mode, “Hello” returns what the model thinks is semantically closest to “Hello,” which is “What’s up?”

Now try your own! Add potential responses. Switch between the models and ranking methods to see how it changes the results (be sure to hit the “reload” button every time you add new responses).

Example Code

One of the models available on TensorFlow Hub is the Universal Sentence Encoder Lite. It’s only 1.6MB and is suitable for use within websites and on-device applications.

An open sourced sample game that uses the USE Lite is Mystery of the Three Bots on Github. It’s a simple demonstration that shows how you can use a small semantic ML model to drive conversations with game characters. The corpora the game uses were created and tested using the Semantic Reactor.

You can play a running version of the game here. You can experiment with the corpora of two of the characters, the Maid and the Butler, contained within this Sheet. Be sure to make a copy of the Sheet so you can edit and add new QA pairs.

Where To Get The Models Used Within The Semantic Reactor

All of the models used in the Semantic Reactor are published and available online.

Local – Minified TensorFlow.js version of the Universal Sentence Encoder.
Basic Online – Basic version of the Universal Sentence Encoder.
Multilingual Online – Universal Sentence Encoder trained on question/ answer pairs in 16 languages.

Final Thoughts

These language models are far from perfect. They use their training to give a best estimate on what to return based on the list of responses you gave it. Machine learning is about calculation, prediction, and training. Models can be improved over time with more data and tuning, and in turn, be made more accurate.

Also, because conversational models are trained on dialogue between people, and because people are biased, the models will display biases that exist in the data that they were trained on, sometimes in ways you can’t predict. For more on model bias, and more detail about how these models were trained, see the Semantic Experiences for Developers page.

By Ben Pietrzak, Steve Pucci, Aaron Cohen — Google AI

Source: Google Open Source Blog

Alfred Camera: Smart camera features using MediaPipe

Guest post by the Engineering team at Alfred Camera

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, Alfred Camera.

In this article, we’d like to give you a short overview of Alfred Camera and our experience of using MediaPipe to transform our moving object feature, and how MediaPipe has helped to get things easier to achieve our goals.

What is Alfred Camera?

Fig.1 Alfred Camera Logo

Alfred Camera is a smart home app for both Android and iOS devices, with over 15 million downloads worldwide. By downloading the app, users are able to turn their spare phones into security cameras and monitors directly, which allows them to watch their homes, shops, pets anytime. The mission of Alfred Camera is to provide affordable home security so that everyone can find peace of mind in this busy world.

The Alfred Camera team is composed of professionals in various fields, including an engineering team with several machine learning and computer vision experts. Our aim is to integrate AI technology into devices that are accessible to everyone.

Machine Learning in Alfred Camera

Alfred Camera currently has a feature called Moving Object Detection, which continuously uses the device’s camera to monitor a target scene. Once it identifies a moving object in the area, the app will begin recording the video and send notifications to the device owner. The machine learning models for detection are hand-crafted and trained by our team using TensorFlow, and run on TensorFlow Lite with good performance even on mid-tier devices. This is important because the app is leveraging old phones and we'd like the feature to reach as many users as possible.

The Challenges

We had started building our AI features at Alfred Camera since 2017. In order to have a solid foundation to support our AI feature requirements for the coming years, we decided to rebuild our real-time video analysis pipeline. At the beginning of the project, the goals were to create a new pipeline which should be 1) modular enough so we could swap core algorithms easily with minimal changes in other parts of the pipeline, 2) having GPU acceleration designed in place, 3) cross-platform as much as possible so there’s no need to create/maintain separate implementations for different platforms. Based on the goals, we had surveyed several open source projects that had the potential but we ended up using none of them as they either fell short on the features or were not providing the readiness/stabilities that we were looking for.

We started a small team to prototype on those goals first for the Android platform. What came later were some tough challenges way above what we originally anticipated. We ran into several major design changes as some key design basics were overlooked. We needed to implement some utilities to do things that sounded trivial but required significant effort to make it right and fast. Dealing with asynchronous processing also led us into a bunch of timing issues, which took the team quite some effort to address. Not to mention debugging on real devices was extremely inefficient and painful.

Things didn't just stop here. Our product is also on iOS and we had to tackle these challenges once again. Moreover, discrepancies in the behavior between the platform-specific implementations introduced additional issues that we needed to resolve.

Even though we finally managed to get the implementations to the confidence level we wanted, that was not a very pleasant experience and we have never stopped thinking if there is a better option.

MediaPipe - A Game Changer

Google open sourced MediaPipe project in June 2019 and it immediately caught our attention. We were surprised by how it is perfectly aligned with the previous goals we set, and has functionalities that could not have been developed with the amount of engineering resources we had as a small company.

We immediately decided to start an evaluation project by building a new product feature directly using MediaPipe to see if it could live up to all the promises.

Migrating to MediaPipe

To start the evaluation, we decided to migrate our existing moving object feature to see what exactly MediaPipe can do.

Our current Moving Object Detection pipeline consists of the following main components:

(Moving) Object Detection Model
As explained earlier, a TensorFlow Lite model trained by our team, tailored to run on mid-tier devices.
Low-light Detection and Low-light Filter
Calculate the average luminance of the scene, and based on the result conditionally process the incoming frames to intensify the brightness of the pixels to let our users see things in the dark. We are also controlling whether we should run the detection or not as the moving object detection model does not work properly when the frame has been processed by the filter.
Motion Detection
Sending frames through Moving Object Detection still consumes a significant amount of power even with a small model like the one we created. Running inferences continuously does not seem to be a good idea as most of the time there may not be any moving object in front of the camera. We decided to implement a gating mechanism where the frames are only being sent to the Moving Object Detection model based on the movements detected from the scene. The detection is done mainly by calculating the differences between two frames with some additional tricks that take the movements detected in a few frames before into consideration.
Area of Interest
This is a mechanism to let users manually mask out the area where they do not want the camera to see. It can also be done automatically based on regional luminance that can be generated by the aforementioned low-light detection component.

Our current implementation has taken GPU into consideration as much as we can. A series of shaders are created to perform the tasks above and the pipeline is designed to avoid moving pixels between CPU/GPU frequently to eliminate the potential performance hits.

The pipeline involves multiple ML models that are conditionally executed, mixed CPU/GPU processing, etc. All the challenges here make it a perfect showcase for how MediaPipe could help develop a complicated pipeline.

Playing with MediaPipe

MediaPipe provides a lot of code samples for any developer to bootstrap with. We took the Object Detection on Android sample that comes with the project to start with because of the similarity with the back-end part of our pipeline. It did take us sometimes to fully understand the design concepts of MediaPipe and all the tools associated. But with the complete documentation and the great responsiveness from the MediaPipe team, we got up to speed soon to do most of the things we wanted.

That being said, there were a few challenges we needed to overcome on the road to full migration. Our original pipeline of Moving Object Detection takes the input frame asynchronously, but MediaPipe has timestamp bound limitations such that we cannot just show the result in an allochronic way. Meanwhile, we need to gather data through JNI in a specific data format. We came up with a workaround that conquered all the issues under the circumstances, which will be mentioned later.

After wrapping our models and the processing logics into calculators and wired them up, we have successfully transformed our existing implementation and created our first MediaPipe Moving Object Detection pipeline like the figure below, running on Android devices:

Fig.2 Moving Object Detection Graph

We do not block the video frame in the main calculation loop, and set the detection result as an input stream to show the annotation on the screen. The whole graph is designed as a multi-functioned process, the left chunk is the debug annotation and video frame output module, and the rest of the calculation occurs in the rest of the graph, e.g., low light detection, motion triggered detection, cropping of the area of interest and the detection process. In this way, the graph process will naturally separate into real-time display and asynchronous calculation.

As a result, we are able to complete a full processing for detection in under 40ms on a device with Snapdragon 660 chipset. MediaPipe’s tight integration with TensorFlow Lite provides us the flexibility to get even more performance gain by leveraging whatever acceleration techniques available (GPU or DSP) on the device.

The following figure shows the current implementation working in action:

Fig.3 Moving Object Detection running in Alfred Camera

After getting things to run on Android, Desktop GPU (OpenGL-ES) emulation was our next target to evaluate. We are already using OpenGL-ES shaders for some computer vision operations in our pipeline. Having the capability to develop the algorithm on desktop, seeing it work in action before deployment onto mobile platforms is a huge benefit to us. The feature was not ready at the time when the project was first released, but MediaPipe team had soon added Desktop GPU emulation support for Linux in follow-up releases to make this possible. We have used the capability to detect and fix some issues in the graphs we created even before we put things on the mobile devices. Although it currently only works on Linux, it is still a big leap forward for us.

Testing the algorithms and making sure they behave as expected is also a challenge for a camera application. MediaPipe helps us simplify this by using pre-recorded MP4 files as input so we could verify the behavior simply by replaying the files. There is also built-in profiling support that makes it easy for us to locate potential performance bottlenecks.

MediaPipe - Exactly What We Were Looking For

The result of the evaluation and the feedback from our engineering team were very positive and promising:

We are able to design/verify the algorithm and complete core implementations directly on the desktop emulation environment, and then migrate to the target platforms with minimum efforts. As a result, complexities of debugging on real devices are greatly reduced.
MediaPipe’s modular design of graphs/calculators enables us to better split up the development into different engineers/teams, try out new pipeline design easily by rewiring the graph, and test the building blocks independently to ensure quality before we put things together.
MediaPipe’s cross-platform design maximizes the reusability and minimizes fragmentation of the implementations we created. Not only are the efforts required to support a new platform greatly reduced, but we are also less worried about the behavior discrepancies on different platforms due to different interpretations of the spec from platform engineers.
Built-in graphics utilities and profiling support saved us a lot of time creating those common facilities and making them right, and we could be more focused on the key designs.
Tight integration with TensorFlow Lite really saves lots of effort for a company like us that heavily depends on TensorFlow, and it still gives us the flexibility to easily interface with other solutions.

With just a few weeks working with MediaPipe, it has shown strong capabilities to fundamentally transform how we develop our products. Without MediaPipe we could have spent months creating the same features without the same level of performance.

Summary

Alfred Camera is designed to bring home security with AI to everyone, and MediaPipe has significantly made achieving that goal easier for our team. From Moving Object Detection to future AI-powered features, we are focusing on transforming a basic security camera use case into a smart housekeeper that can help provide even more context that our users care about. With the support of MediaPipe, we have been able to accelerate our development process and bring the features to the market at an unprecedented speed. Our team is really excited about how MediaPipe could help us progress and discover new possibilities, and is looking forward to the enhancements that are yet to come to the project.

Source: Google Developers Blog

Visual Transfer Learning for Robotic Manipulation

Posted by Yen-Chen Lin, Research Intern and Andy Zeng, Research Scientist, Robotics at Google

The idea that robots can learn to directly perceive the affordances of actions on objects (i.e., what the robot can or cannot do with an object) is called affordance-based manipulation, explored in research on learning complex vision-based manipulation skills including grasping, pushing, and throwing. In these systems, affordances are represented as dense pixel-wise action-value maps that estimate how good it is for the robot to execute one of several predefined motions at each location. For example, given an RGB-D image, an affordance-based grasping model might infer grasping affordances per pixel with a convolutional neural network. The grasping affordance value at each pixel would represent the success rate of performing a corresponding motion primitive (e.g. grasping action), which would then be executed by the robot at the position with the highest value.

Overview of affordance-based manipulation.

For methods such as this, the ability to do more with less data is incredibly important, since data collection through physical trial and error can be both time consuming and expensive. However, recent discoveries in transfer learning have shown that visual feature representations learned from large-scale computer vision datasets can be reused for deep learning agents, enabling them to learn faster and generalize better in video games and simulated environments. If end-to-end affordance-based robot learning models that map from pixels to actions could similarly benefit from these visual representations, one could begin to leverage the vast amounts of labeled visual data that are now available in order to more efficiently learn useful skills for real-world interaction with less training.

In “Learning to See before Learning to Act: Visual Pre-training for Manipulation”, a collaboration with researchers from MIT to be presented at ICRA 2020, we investigate whether existing pre-trained deep learning visual feature representations can improve the efficiency of learning robotic manipulation tasks, like grasping objects. By studying how we can intelligently transfer neural network weights between vision models and affordance-based manipulation models, we can evaluate how different visual feature representations benefit the exploration process and enable robots to quickly acquire manipulation skills using different grippers. We present practical techniques to pre-train deep learning models, which enable robots to learn to pick and grasp arbitrary objects in unstructured settings in less than 10 minutes of trial and error.

Does first learning to see, improve the speed at which a robot can learn to act? In this project, we study ways in which we can transfer knowledge learned from computer vision tasks (left) to robot manipulation tasks (right).

Transfer Learning for Affordance-Based Manipulation
Affordance-based manipulation is essentially a way to reframe a manipulation task as a computer vision task, but rather than referencing pixels to object labels, we instead associate pixels to the value of actions. Since the structure of computer vision models and affordance models are so similar, one can leverage techniques from transfer learning in computer vision to enable affordance models to learn faster with less data. This approach re-purposes pre-trained neural network weights (i.e., feature representations) learned from large-scale vision datasets to initialize network weights of affordance models for robotic grasping.

In computer vision, many deep model architectures are composed of two parts: a “backbone” and a “head”. The backbone consists of weights that are responsible for early-stage image processing, e.g., filtering edges, detecting corners, and distinguishing between colors, while the head consists of network weights that are used in latter-stage processing, such as identifying high-level features, recognizing contextual cues, and executing spatial reasoning. The head is often much smaller than the backbone and is also more task specific. Hence, it is common practice in transfer learning to pre-train (e.g., on ResNet) and share backbone weights between tasks, while randomly initializing the weights of the model head for each new task.

Following this recipe, we initialized our affordance-based manipulation models with backbones based on the ResNet-50 architecture and pre-trained on different vision tasks, including a classification model from ImageNet and a segmentation model from COCO. With different initializations, the robot was then tasked with learning to grasp a diverse set of objects through trial and error.

Initially, we did not see any significant gains in performance compared with training from scratch – grasping success rates on training objects were only able to rise to 77% after 1,000 trial and error grasp attempts, outperforming training from scratch by 2%. However, upon transferring network weights from both the backbone and the head of the pre-trained COCO vision model, we saw a substantial improvement in training speed – grasping success rates reached 73% in just 500 trial and error grasp attempts, and jumped to 86% by 1,000 attempts. In addition, we tested our model on new objects unseen during training and found that models with the pre-trained backbone from COCO generalize better. The grasping success rates reach 83% with pre-trained backbone alone and further improve to 90% with both pre-trained backbone and head, outperforming the 46% reached by a model trained from scratch.

Affordance-based grasping models trained from scratch can struggle to pick up new objects after 60 minutes of training (left). With pre-training from visual tasks, our affordance-based grasping models can easily generalize to picking up new objects with less than 10 minutes of training, even when evaluated with different hardware (middle: suction, right: gripper).

Transfer Learning Can Improve Exploration
In our experiments with the grasping robot, we observed that the distribution of successful grasps versus failures in the generated datasets was far more balanced when network weights from both the backbone and head of pre-trained vision models were transferred to the affordance models, as opposed to only transferring the backbone.

Number of successful grasps out of the first 50 attempts using: a random initialization of weights, backbone and head pre-trained on ImageNet, COCO pre-trained backbone only, and backbone and head trained on COCO.

These results suggest that reusing network weights from vision tasks that require object localization (e.g., instance segmentation, like COCO) has the potential to significantly improve the exploration process when learning manipulation tasks. Pre-trained weights from these tasks encourage the robot to sample actions on things that look more like objects, thereby quickly generating a more balanced dataset from which the system can learn the differences between good and bad grasps. In contrast, pre-trained weights from vision tasks that potentially discard objects’ spatial information (e.g., image classification, like ImageNet) can only improve the performance slightly compared to random initialization.

To better understand this, we visualize the neural activations that are triggered by different pre-trained models and a converged affordance model trained from scratch using a suction gripper. Interestingly, we find that the intermediate network representations learned from the head of vision models used for segmentation from the COCO dataset activate on objects in ways that are similar to the converged affordance model. This aligns with the idea that transferring as much of the vision model as possible (both backbone and head) can lead to more object-centric exploration by leveraging model weights that are better at picking up visual features and localizing objects.

Affordances predicted by different models from images of cluttered objects (a). (b) Random refers to a randomly initialized model. (c) ImageNet is a model with backbone pre-trained on ImageNet and a randomly initialized head. (d) Normal refers to a model pre-trained to detect pixels with surface normals close to the anti-gravity axis. (e) COCO is the modified segmentation model (MaskRCNN) trained on the COCO dataset. (f) Suction is a converged model learned from robot-environment interactions using the suction gripper.

Limitations and Future Work
Many of the methods that we use today for end-to-end robot learning are effectively the same as those being used for computer vision tasks. Our work here on visual pre-training illuminates this connection and demonstrates that it is possible to leverage techniques from visual pre-training to improve the learning efficiency of affordance-base manipulation applied to robotic grasping tasks. While our experiments point to a better understanding of deep learning for robots, there are still many interesting questions that have yet to be explored. For example, how do we leverage large-scale pre-training for additional modes of sensing (e.g. force-torque or tactile)? How do we extend these pre-training techniques towards more complex manipulation tasks that may not be as object-centric as grasping? These areas are promising directions for future research.

You can learn more about this work in the summary video below.

Acknowledgements
This research was done by Yen-Chen Lin (Ph.D. student at MIT), Andy Zeng, Shuran Song, Phillip Isola (faculty at MIT), and Tsung-Yi Lin, with special thanks to Johnny Lee and Ivan Krasin for valuable managerial support, Chad Richards for helpful feedback on writing, and Jonathan Thompson for fruitful technical discussions.

Source: Google AI Blog

Fast and Easy Infinitely Wide Networks with Neural Tangents

Posted by Samuel S. Schoenholz, Senior Research Scientist and Roman Novak, Research Engineer, Google Research

The widespread success of deep learning across a range of domains such as natural language processing, conversational agents, and connectomics, has transformed the landscape of research in machine learning and left researchers with a number of interesting and important open questions such as: Why do deep neural networks (DNNs) generalize so well despite being overparameterized? What is the relationship between architecture, training, and performance for deep networks? How can one extract salient features from deep learning models?

One of the key theoretical insights that has allowed us to make progress in recent years has been that increasing the width of DNNs results in more regular behavior, and makes them easier to understand. A number of recent results have shown that DNNs that are allowed to become infinitely wide converge to another, simpler, class of models called Gaussian processes. In this limit, complicated phenomena (like Bayesian inference or gradient descent dynamics of a convolutional neural network) boil down to simple linear algebra equations. Insights from these infinitely wide networks frequently carry over to their finite counterparts. As such, infinite-width networks can be used as a lens to study deep learning, but also as useful models in their own right.

Left: A schematic showing how deep neural networks induce simple input / output maps as they become infinitely wide. Right: As the width of a neural network increases , we see that the distribution of outputs over different random instantiations of the network becomes Gaussian.

Unfortunately, deriving the infinite-width limit of a finite network requires significant mathematical expertise and has to be worked out separately for each architecture studied. Once the infinite-width model is derived, coming up with an efficient and scalable implementation further requires significant engineering proficiency. Together, the process of taking a finite-width model to its corresponding infinite-width network could take months and might be the topic of a research paper in its own right.

To address this issue and to accelerate theoretical progress in deep learning, we present Neural Tangents, a new open-source software library written in JAX that allows researchers to build and train infinitely wide neural networks as easily as finite neural networks. At its core, Neural Tangents provides an easy-to-use neural network library that builds finite- and infinite-width versions of neural networks simultaneously.

As an example of the utility of Neural Tangents, imagine training a fully-connected neural network on some data. Normally, a neural network is randomly initialized and then trained using gradient descent. Initializing and training many of these neural networks results in an ensemble. Often researchers and practitioners average the predictions from different members of the ensemble together for better performance. Additionally, the variance in the predictions of members of the ensemble can be used to estimate uncertainty. The downside is that training an ensemble of networks requires a significant computational budget, so it is often avoided. However, when the neural networks become infinitely wide, the ensemble is described by a Gaussian process with a mean and variance that can be computed throughout training.

With Neural Tangents, one can construct and train ensembles of these infinite-width networks at once using only five lines of code! The resulting training process is displayed below, and an interactive colaboratory notebook going through this experiment can be found here.

In both plots we compare training of an ensemble of finite neural networks with the infinite-width ensemble of the same architecture. The empirical mean and variance of the finite ensemble is displayed as a dashed black line between two dotted black lines. The closed-form mean and variance of the infinite-width ensemble is displayed as a solid colored line inside a filled color region. In both plots finite- and infinite-width ensembles match very closely and can be hard to distinguish. Left: Outputs (vertical f-axis) on the input data (horizontal x-axis) as the training progresses. Right: Train and test loss with uncertainty over the course of training.

Despite the fact that the infinite-width ensemble is governed by a simple closed-form expression, it exhibits remarkable agreement with the finite-width ensemble. And since the infinite-width ensemble is a Gaussian process, it naturally provides closed-form uncertainty estimates (filled colored regions in the figure above). These uncertainty estimates closely match the variation of predictions that are observed when training many different copies of the finite network (dashed lines).

The above example shows the power of infinite-width neural networks to capture training dynamics. However, networks built using Neural Tangents can be applied to any problem on which you could apply a regular neural network. For example, below we compare three different infinite-width neural network architectures on image recognition using the CIFAR-10 dataset. Remarkably, we can evaluate ensembles of highly-elaborate models like infinitely wide residual networks in closed-form under both gradient descent and fully-Bayesian inference (an intractable task in the finite-width regime).

We see that, mimicking finite neural networks, infinite-width networks follow a similar hierarchy of performance with fully-connected networks performing worse than convolutional networks, which in turn perform worse than wide residual networks. However, unlike regular training, the learning dynamics of these models is completely tractable in closed-form, which allows unprecedented insight into their behavior.

We invite everyone to explore the infinite-width versions of their models with Neural Tangents, and help us open the black box of deep learning. To get started, please check out the paper, the tutorial Colab notebook, and the Github repo — contributions, feature requests, and bug reports are very welcome. This work has been accepted as a spotlight at ICLR 2020.

Acknowledgements
Neural Tangents is being actively developed by Lechao Xiao, Roman Novak, Jiri Hron, Jaehoon Lee, Alex Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. We also thank Yasaman Bahri and Greg Yang for the ongoing contributions to improve the library, as well as Sergey Ioffe, Ben Adlam, Ravid Ziv, and Jeffrey Pennington for frequent discussion and useful feedback. Finally, we thank Tom Small for creating the animation in the first figure.

Source: Google AI Blog

Measuring Compositional Generalization

People are capable of learning the meaning of a new word and then applying it to other language contexts. As Lake and Baroni put it, “Once a person learns the meaning of a new verb ‘dax’, he or she can immediately understand the meaning of ‘dax twice’ and ‘sing and dax’.” Similarly, one can learn a new object shape and then recognize it with different compositions of previously learned colors or materials (e.g., in the CLEVR dataset). This is because people exhibit the capacity to understand and produce a potentially infinite number of novel combinations of known components, or as Chomsky said, to make “infinite use of finite means.” In the context of a machine learning model learning from a set of training examples, this skill is called compositional generalization.

A common approach for measuring compositional generalization in machine learning (ML) systems is to split the training and testing data based on properties that intuitively correlate with compositional structure. For instance, one approach is to split the data based on sequence length—the training set consists of short examples, while the test set consists of longer examples. Another approach uses sequence patterns, meaning the split is based on randomly assigning clusters of examples sharing the same pattern to either train or test sets. For instance, the questions "Who directed Movie1" and "Who directed Movie2" both fall into the pattern "Who directed <MOVIE>" so they would be grouped together. Yet another method uses held out primitives—some linguistic primitives are shown very rarely during training (e.g., the verb “jump”), but are very prominent in testing. While each of these experiments are useful, it is not immediately clear which experiment is a "better" measure for compositionality. Is it possible to systematically design an “optimal” compositional generalization experiment?

In “Measuring Compositional Generalization: A Comprehensive Method on Realistic Data”, we attempt to address this question by introducing the largest and most comprehensive benchmark for compositional generalization using realistic natural language understanding tasks, specifically, semantic parsing and question answering. In this work, we propose a metric—compound divergence—that allows one to quantitatively assess how much a train-test split measures the compositional generalization ability of an ML system. We analyze the compositional generalization ability of three sequence to sequence ML architectures, and find that they fail to generalize compositionally. We also are releasing the Compositional Freebase Questions dataset used in the work as a resource for researchers wishing to improve upon these results.

Measuring Compositionality

In order to measure the compositional generalization ability of a system, we start with the assumption that we understand the underlying principles of how examples are generated. For instance, we begin with the grammar rules to which we must adhere when generating questions and answers. We then draw a distinction between atoms and compounds. Atoms are the building blocks that are used to generate examples and compounds are concrete (potentially partial) compositions of these atoms. For example, in the figure below, every box is an atom (e.g., Shane Steel, brother, <entity>'s <entity>, produce, etc.), which fits together to form compounds, such as produce and <verb>, Shane Steel’s brother, Did Shane Steel’s brother produce and direct Revenge of the Spy?, etc.

Building compositional sentences (compounds) from building blocks (atoms)

An ideal compositionality experiment then should have a similar atom distribution, i.e., the distribution of words and sub-phrases in the training set is as similar as possible to their distribution in the test set, but with a different compound distribution. To measure compositional generalization on a question answering task about a movie domain, one might, for instance, have the following questions in train and test:

Train set	Test set
Who directed Inception? Did Greta Gerwig direct Goldfinger? ...	Did Greta Gerwig produce Goldfinger? Who produced Inception? ...

While atoms such as “directed”, “Inception”, and “who <predicate> <entity>” appear in both the train and test sets, the compounds are different.

The Compositional Freebase Questions dataset

In order to conduct an accurate compositionality experiment, we created the Compositional Freebase Questions (CFQ) dataset, a simple, yet realistic, large dataset of natural language questions and answers generated from the public Freebase knowledge base. The CFQ can be used for text-in / text-out tasks, as well as semantic parsing. In our experiments, we focus on semantic parsing, where the input is a natural language question and the output is a query, which when executed against Freebase, produces the correct outcome. CFQ contains around 240k examples and almost 35k query patterns, making it significantly larger and more complex than comparable datasets — about 4 times that of WikiSQL with about 17x more query patterns than Complex Web Questions. Special care has been taken to ensure that the questions and answers are natural. We also quantify the complexity of the syntax in each example using the “complexity level” metric (L), which corresponds roughly to the depth of the parse tree, examples of which are shown below.

L	Question → Answer
10	What did Commerzbank acquire? → Eurohypo; Dresdner Bank
15	Did Dianna Rhodes’s spouse produce Soldier Blue? → No
20	Which costume designer of E.T. married Mannequin’s cinematographer? → Deborah Lynn Scott
40	Was Weekend Cowgirls produced, directed, and written by a film editor that The Evergreen State College and Fairway Pictures employed → No
50	Were It’s Not About the Shawerma, The Fifth Wall, Rick’s Canoe, White Stork Is Coming, and Blues for the Avatar executive produced, edited, directed, and written by a screenwriter’s parent? → Yes

Compositional Generalization Experiments on CFQ

For a given train-test split, if the compound distributions of the train and test sets are very similar, then their compound divergence would be close to 0, indicating that they are not difficult tests for compositional generalization. A compound divergence close to 1 means that the train-test sets have many different compounds, which makes it a good test for compositional generalization. Compound divergence thus captures the notion of "different compound distribution", as desired.

We algorithmically generate train-test splits using the CFQ dataset that have a compound divergence ranging from 0 to 0.7 (the maximum that we were able to achieve). We fix the atom divergence to be very small. Then, for each split we measure the performance of three standard ML architectures — LSTM+attention, Transformer, and Universal Transformer. The results are shown in the graph below.

Compound divergence vs accuracy for three ML architectures. There is a surprisingly strong negative correlation between compound divergence and accuracy.

We measure the performance of a model by comparing the correct answers with the output string given by the model. All models achieve an accuracy greater than 95% when the compound divergence is very low. The mean accuracy on the split with highest compound divergence is below 20% for all architectures, which means that even a large training set with a similar atom distribution between train and test is not sufficient for the architectures to generalize well. For all architectures, there is a strong negative correlation between the compound divergence and the accuracy. This seems to indicate that compound divergence successfully captures the core difficulty for these ML architectures to generalize compositionally.

Potentially promising directions for future work might be to apply unsupervised pre-training on input language or output queries, or to use more diverse or more targeted learning architectures, such as syntactic attention. It would also be interesting to apply this approach to other domains such as visual reasoning, e.g. based on CLEVR, or to extend our approach to broader subsets of language understanding, including the use of ambiguous constructs, negations, quantification, comparatives, additional languages, and other vertical domains. We hope that this work will inspire others to use this benchmark to advance the compositional generalization capabilities of learning systems.

By Marc van Zee, Software Engineer, Google Research – Brain Team

Source: Google Open Source Blog

Measuring Compositional Generalization

Posted by Marc van Zee, Software Engineer, Google Research

People are capable of learning the meaning of a new word and then applying it to other language contexts. As Lake and Baroni put it, “Once a person learns the meaning of a new verb ‘dax’, he or she can immediately understand the meaning of ‘dax twice’ and ‘sing and dax’.” Similarly, one can learn a new object shape and then recognize it with different compositions of previously learned colors or materials (e.g., in the CLEVR dataset). This is because people exhibit the capacity to understand and produce a potentially infinite number of novel combinations of known components, or as Chomsky said, to make “infinite use of finite means.” In the context of a machine learning model learning from a set of training examples, this skill is called compositional generalization.

A common approach for measuring compositional generalization in machine learning (ML) systems is to split the training and testing data based on properties that intuitively correlate with compositional structure. For instance, one approach is to split the data based on sequence length — the training set consists of short examples, while the test set consists of longer examples. Another approach uses sequence patterns, meaning the split is based on randomly assigning clusters of examples sharing the same pattern to either train or test sets. For instance, the questions "Who directed Movie1" and "Who directed Movie2" both fall into the pattern "Who directed <MOVIE>" so they would be grouped together. Yet another method uses held out primitives — some linguistic primitives are shown very rarely during training (e.g., the verb “jump”), but are very prominent in testing. While each of these experiments are useful, it is not immediately clear which experiment is a "better" measure for compositionality. Is it possible to systematically design an “optimal” compositional generalization experiment?

In “Measuring Compositional Generalization: A Comprehensive Method on Realistic Data”, we attempt to address this question by introducing the largest and most comprehensive benchmark for compositional generalization using realistic natural language understanding tasks, specifically, semantic parsing and question answering. In this work, we propose a metric — compound divergence — that allows one to quantitatively assess how much a train-test split measures the compositional generalization ability of an ML system. We analyze the compositional generalization ability of three sequence to sequence ML architectures, and find that they fail to generalize compositionally. We also are releasing the Compositional Freebase Questions dataset used in the work as a resource for researchers wishing to improve upon these results.

Measuring Compositionality
In order to measure the compositional generalization ability of a system, we start with the assumption that we understand the underlying principles of how examples are generated. For instance, we begin with the grammar rules to which we must adhere when generating questions and answers. We then draw a distinction between atoms and compounds. Atoms are the building blocks that are used to generate examples and compounds are concrete (potentially partial) compositions of these atoms. For example, in the figure below, every box is an atom (e.g., Shane Steel, brother, <entity>'s <entity>, produce, etc.), which fits together to form compounds, such as produce and <verb>, Shane Steel’s brother, Did Shane Steel’s brother produce and direct Revenge of the Spy?, etc.

Building compositional sentences (compounds) from building blocks (atoms).

An ideal compositionality experiment then should have a similar atom distribution, i.e., the distribution of words and sub-phrases in the training set is as similar as possible to their distribution in the test set, but with a different compound distribution. To measure compositional generalization on a question answering task about a movie domain, one might, for instance, have the following questions in train and test:

While atoms such as “directed”, “Inception”, and “who <predicate> <entity>” appear in both the train and test sets, the compounds are different.

The Compositional Freebase Questions dataset
In order to conduct an accurate compositionality experiment, we created the Compositional Freebase Questions (CFQ) dataset, a simple, yet realistic, large dataset of natural language questions and answers generated from the public Freebase knowledge base. The CFQ can be used for text-in / text-out tasks, as well as semantic parsing. In our experiments, we focus on semantic parsing, where the input is a natural language question and the output is a query, which when executed against Freebase, produces the correct outcome. CFQ contains around 240k examples and almost 35k query patterns, making it significantly larger and more complex than comparable datasets — about 4 times that of WikiSQL with about 17x more query patterns than Complex Web Questions. Special care has been taken to ensure that the questions and answers are natural. We also quantify the complexity of the syntax in each example using the “complexity level” metric (L), which corresponds roughly to the depth of the parse tree, examples of which are shown below.

Compositional Generalization Experiments on CFQ
For a given train-test split, if the compound distributions of the train and test sets are very similar, then their compound divergence would be close to 0, indicating that they are not difficult tests for compositional generalization. A compound divergence close to 1 means that the train-test sets have many different compounds, which makes it a good test for compositional generalization. Compound divergence thus captures the notion of "different compound distribution", as desired.

We algorithmically generate train-test splits using the CFQ dataset that have a compound divergence ranging from 0 to 0.7 (the maximum that we were able to achieve). We fix the atom divergence to be very small. Then, for each split we measure the performance of three standard ML architectures — LSTM+attention, Transformer, and Universal Transformer. The results are shown in the graph below.

Compound divergence vs accuracy for three ML architectures. There is a surprisingly strong negative correlation between compound divergence and accuracy.