Tag Archives: Information Retrieval

Building SMILY, a Human-Centric, Similar-Image Search Tool for Pathology

Advances in machine learning (ML) have shown great promise for assisting in the work of healthcare professionals, such as aiding the detection of diabetic eye disease and metastatic breast cancer. Though high-performing algorithms are necessary to gain the trust and adoption of clinicians, they are not always sufficient—what information is presented to doctors and how doctors interact with that information can be crucial determinants in the utility that ML technology ultimately has for users.

The medical specialty of anatomic pathology, which is the gold standard for the diagnosis of cancer and many other diseases through microscopic analysis of tissue samples, can greatly benefit from applications of ML. Though diagnosis through pathology is traditionally done on physical microscopes, there has been a growing adoption of “digital pathology,” where high-resolution images of pathology samples can be examined on a computer. With this movement comes the potential to much more easily look up information, as is needed when pathologists tackle the diagnosis of difficult cases or rare diseases, when “general” pathologists approach specialist cases, and when trainee pathologists are learning. In these situations, a common question arises, “What is this feature that I’m seeing?” The traditional solution is for doctors to ask colleagues, or to laboriously browse reference textbooks or online resources, hoping to find an image with similar visual characteristics. The general computer vision solution to problems like this is termed content-based image retrieval (CBIR), one example of which is the “reverse image search” feature in Google Images, in which users can search for similar images by using another image as input.

Today, we are excited to share two research papers describing further progress in human-computer interaction research for similar image search in medicine. In “Similar Image Search for Histopathology: SMILY” published in Nature Partner Journal (npj) Digital Medicine, we report on our ML-based tool for reverse image search for pathology. In our second paper, Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making(preprint available here), which received an honorable mention at the 2019 ACM CHI Conference on Human Factors in Computing Systems, we explored different modes of refinement for image-based search, and evaluated their effects on doctor interaction with SMILY.

SMILY Design
The first step in developing SMILY was to apply a deep learning model, trained using 5 billion natural, non-pathology images (e.g., dogs, trees, man-made objects, etc.), to compress images into a “summary” numerical vector, called an embedding. The network learned during the training process to distinguish similar images from dissimilar ones by computing and comparing their embeddings. This model is then used to create a database of image patches and their associated embeddings using a corpus of de-identified slides from The Cancer Genome Atlas. When a query image patch is selected in the SMILY tool, the query patch’s embedding is similarly computed and compared with the database to retrieve the image patches with the most similar embeddings.
Schematic of the steps in building the SMILY database and the process by which input image patches are used to perform the similar image search.
The tool allows a user to select a region of interest, and obtain visually-similar matches. We tested SMILY’s ability to retrieve images along a pre-specified axis of similarity (e.g. histologic feature or tumor grade), using images of tissue from the breast, colon, and prostate (3 of the most common cancer sites). We found that SMILY demonstrated promising results despite not being trained specifically on pathology images or using any labeled examples of histologic features or tumor grades.
Example of selecting a small region in a slide and using SMILY to retrieve similar images. SMILY efficiently searches a database of billions of cropped images in a few seconds. Because pathology images can be viewed at different magnifications (zoom levels), SMILY automatically searches images at the same magnification as the input image.
Second example of using SMILY, this time searching for a lobular carcinoma, a specific subtype of breast cancer.
Refinement tools for SMILY
However, a problem emerged when we observed how pathologists interacted with SMILY. Specifically, users were trying to answer the nebulous question of “What looks similar to this image?” so that they could learn from past cases containing similar images. Yet, there was no way for the tool to understand the intent of the search: Was the user trying to find images that have a similar histologic feature, glandular morphology, overall architecture, or something else? In other words, users needed the ability to guide and refine the search results on a case-by-case basis in order to actually find what they were looking for. Furthermore, we observed that this need for iterative search refinement was rooted in how doctors often perform “iterative diagnosis”—by generating hypotheses, collecting data to test these hypotheses, exploring alternative hypotheses, and revisiting or retesting previous hypotheses in an iterative fashion. It became clear that, for SMILY to meet real user needs, it would need to support a different approach to user interaction.

Through careful human-centered research described in our second paper, we designed and augmented SMILY with a suite of interactive refinement tools that enable end-users to express what similarity means on-the-fly: 1) refine-by-region allows pathologists to crop a region of interest within the image, limiting the search to just that region; 2) refine-by-example gives users the ability to pick a subset of the search results and retrieve more results like those; and 3) refine-by-concept sliders can be used to specify that more or less of a clinical concept be present in the search results (e.g., fused glands). Rather than requiring that these concepts be built into the machine learning model, we instead developed a method that enables end-users to create new concepts post-hoc, customizing the search algorithm towards concepts they find important for each specific use case. This enables new explorations via post-hoc tools after a machine learning model has already been trained, without needing to re-train the original model for each concept or application of interest.
Through our user study with pathologists, we found that the tool-based SMILY not only increased the clinical usefulness of search results, but also significantly increased users’ trust and likelihood of adoption, compared to a conventional version of SMILY without these tools. Interestingly, these refinement tools appeared to have supported pathologists’ decision-making process in ways beyond simply performing better on similarity searches. For example, pathologists used the observed changes to their results from iterative searches as a means of progressively tracking the likelihood of a hypothesis. When search results were surprising, many re-purposed the tools to test and understand the underlying algorithm, for example, by cropping out regions they thought were interfering with the search or by adjusting the concept sliders to increase the presence of concepts they suspected were being ignored. Beyond being passive recipients of ML results, doctors were empowered with the agency to actively test hypotheses and apply their expert domain knowledge, while simultaneously leveraging the benefits of automation.
With these interactive tools enabling users to tailor each search experience to their desired intent, we are excited for SMILY’s potential to assist with searching large databases of digitized pathology images. One potential application of this technology is to index textbooks of pathology images with descriptive captions, and enable medical students or pathologists in training to search these textbooks using visual search, speeding up the educational process. Another application is for cancer researchers interested in studying the correlation of tumor morphologies with patient outcomes, to accelerate the search for similar cases. Finally, pathologists may be able to leverage tools like SMILY to locate all occurrences of a feature (e.g. signs of active cell division, or mitosis) in the same patient’s tissue sample to better understand the severity of the disease to inform cancer therapy decisions. Importantly, our findings add to the body of evidence that sophisticated machine learning algorithms need to be paired with human-centered design and interactive tooling in order to be most useful.

This work would not have been possible without Jason D. Hipp, Yun Liu, Emily Reif, Daniel Smilkov, Michael Terry, Craig H. Mermel, Martin C. Stumpe and members of Google Health and PAIR. Preprints of the two papers are available here and here.

Source: Google AI Blog

Multilingual Universal Sentence Encoder for Semantic Retrieval

Since it was introduced last year, “Universal Sentence Encoder (USE) for English’’ has become one of the most downloaded pre-trained text modules in Tensorflow Hub, providing versatile sentence embedding models that convert sentences into vector representations. These vectors capture rich semantic information that can be used to train classifiers for a broad range of downstream tasks. For example, a strong sentiment classifier can be trained from as few as one hundred labeled examples, and still be used to measure semantic similarity and for meaning-based clustering.

Today, we are pleased to announce the release of three new USE multilingual modules with additional features and potential applications. The first two modules provide multilingual models for retrieving semantically similar text, one optimized for retrieval performance and the other for speed and less memory usage. The third model is specialized for question-answer retrieval in sixteen languages (USE-QA), and represents an entirely new application of USE. All three multilingual modules are trained using a multi-task dual-encoder framework, similar to the original USE model for English, while using techniques we developed for improving the dual-encoder with additive margin softmax approach. They are designed not only to maintain good transfer learning performance, but to perform well on semantic retrieval tasks.
Multi-task training structure of the Universal Sentence Encoder. A variety of tasks and task structures are joined by shared encoder layers/parameters (pink boxes).
Semantic Retrieval Applications
The three new modules are all built on semantic retrieval architectures, which typically split the encoding of questions and answers into separate neural networks, which makes it possible to search among billions of potential answers within milliseconds. The key to using dual encoders for efficient semantic retrieval is to pre-encode all candidate answers to expected input queries and store them in a vector database that is optimized for solving the nearest neighbor problem, which allows a large number of candidates to be searched quickly with good precision and recall. For all three modules, the input query is then encoded into a vector on which we can perform an approximate nearest neighbor search. Together, this enables good results to be found quickly without needing to do a direct query/candidate comparison for every candidate. The prototypical pipeline is illustrated below:
A prototypical semantic retrieval pipeline, used for textual similarity.
Semantic Similarity Modules
For semantic similarity tasks, the query and candidates are encoded using the same neural network. Two common semantic retrieval tasks made possible by the new modules include Multilingual Semantic Textual Similarity Retrieval and Multilingual Translation Pair Retrieval.
  • Multilingual Semantic Textual Similarity Retrieval
    Most existing approaches for finding semantically similar text require being given a pair of texts to compare. However, using the Universal Sentence Encoder, semantically similar text can be extracted directly from a very large database. For example, in an application like FAQ search, a system can first index all possible questions with associated answers. Then, given a user’s question, the system can search for known questions that are semantically similar enough to provide an answer. A similar approach was used to find comparable sentences from 50 million sentences in wikipedia. With the new multilingual USE models, this can be done in any of supported non-English languages.
  • Multilingual Translation Pair Retrieval
    The newly released modules can also be used to mine translation pairs to train neural machine translation systems. Given a source sentence in one language (“How do I get to the restroom?”), they can find the potential translation target in any other supported language (“¿Cómo llego al baño?”).
Both new semantic similarity modules are cross-lingual. Given an input in Chinese, for example, the modules can find the best candidates, regardless of which language it is expressed in. This versatility can be particularly useful for languages that are underrepresented on the internet. For example, an early version of these modules has been used by Chidambaram et al. (2018) to provide classifications in circumstances where the training data is only available in a single language, e.g. English, but the end system must function in a range of other languages.

USE for Question-Answer Retrieval
The USE-QA module extends the USE architecture to question-answer retrieval applications, which generally take an input query and find relevant answers from a large set of documents that may be indexed at the document, paragraph, or even sentence level. The input query is encoded with the question encoding network, while the candidates are encoded with the answer encoding network.
Visualizing the action of a neural answer retrieval system. The blue point at the north pole represents the question vector. The other points represent the embeddings of various answers. The correct answer, highlighted here in red, is “closest” to the question, in that it minimizes the angular distance. The points in this diagram are produced by an actual USE-QA model, however, they have been projected downwards from ℝ500 to ℝ3 to assist the reader’s visualization.
Question-answer retrieval systems also rely on the ability to understand semantics. For example, consider a possible query to one such system, Google Talk to Books, which was launched in early 2018 and backed by a sentence-level index of over 100,000 books. A query, “What fragrance brings back memories?”, yields the result, “And for me, the smell of jasmine along with the pan bagnat, it brings back my entire carefree childhood.” Without specifying any explicit rules or substitutions, the vector encoding captures the semantic similarity between the terms fragrance and smell. The advantage provided by the USE-QA module is that it can extend question-answer retrieval tasks such as this to multilingual applications.

For Researchers and Developers
We're pleased to share the latest additions to the Universal Sentence Encoder family with the research community, and are excited to see what other applications will be found. These modules can be used as-is, or fine tuned using domain-specific data. Lastly, we will also host the semantic similarity for natural language page on Cloud AI Workshop to further encourage research in this area.

Mandy Guo, Daniel Cer, Noah Constant, Jax Law, Muthuraman Chidambaram for core modeling, Gustavo Hernandez Abrego, Chen Chen, Mario Guajardo-Cespedes for infrastructure and colabs, Steve Yuan, Chris Tar, Yunhsuan Sung, Brian Strope, Ray Kurzweil for discussion of the model architecture.

Source: Google AI Blog

TF-Ranking: A Scalable TensorFlow Library for Learning-to-Rank

Ranking, the process of ordering a list of items in a way that maximizes the utility of the entire list, is applicable in a wide range of domains, from search engines and recommender systems to machine translation, dialogue systems and even computational biology. In applications like these (and many others), researchers often utilize a set of supervised machine learning techniques called learning-to-rank. In many cases, these learning-to-rank techniques are applied to datasets that are prohibitively large  scenarios where the scalability of TensorFlow could be an advantage. However, there is currently no out-of-the-box support for applying learning-to-rank techniques in TensorFlow. To the best of our knowledge, there are also no other open source libraries that specialize in applying learning-to-rank techniques at scale.

Today, we are excited to share TF-Ranking, a scalable TensorFlow-based library for learning-to-rank. As described in our recent paper, TF-Ranking provides a unified framework that includes a suite of state-of-the-art learning-to-rank algorithms, and supports pairwise or listwise loss functions, multi-item scoring, ranking metric optimization, and unbiased learning-to-rank.

TF-Ranking is fast and easy to use, and creates high-quality ranking models. The unified framework gives ML researchers, practitioners and enthusiasts the ability to evaluate and choose among an array of different ranking models within a single library. Moreover, we strongly believe that a key to a useful open source library is not only providing sensible defaults, but also empowering our users to develop their own custom models. Therefore, we provide flexible API's, within which the users can define and plug in their own customized loss functions, scoring functions and metrics.

Existing Algorithms and Metrics Support
The objective of learning-to-rank algorithms is minimizing a loss function defined over a list of items to optimize the utility of the list ordering for any given application. TF-Ranking supports a wide range of standard pointwise, pairwise and listwise loss functions as described in prior work. This ensures that researchers using the TF-Ranking library are able to reproduce and extend previously published baselines, and practitioners can make the most informed choices for their applications. Furthermore, TF-Ranking can handle sparse features (like raw text) through embeddings and scales to hundreds of millions of training instances. Thus, anyone who is interested in building real-world data intensive ranking systems such as web search or news recommendation, can use TF-Ranking as a robust, scalable solution.

Empirical evaluation is an important part of any machine learning or information retrieval research. To ensure compatibility with prior work, we support many of the commonly used ranking metrics, including Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). We also make it easy to visualize these metrics at training time on TensorBoard, an open source TensorFlow visualization dashboard.
An example of the NDCG metric (Y-axis) along the training steps (X-axis) displayed in the TensorBoard. It shows the overall progress of the metrics during training. Different methods can be compared directly on the dashboard. Best models can be selected based on the metric.
Multi-Item Scoring
TF-Ranking supports a novel scoring mechanism wherein multiple items (e.g., web pages) can be scored jointly, an extension of the traditional scoring paradigm in which single items are scored independently. One challenge in multi-item scoring is the difficulty for inference where items have to be grouped and scored in subgroups. Then, scores are accumulated per-item and used for sorting. To make these complexities transparent to the user, TF-Ranking provides a List-In-List-Out (LILO) API to wrap all this logic in the exported TF models.
The TF-Ranking library supports multi-item scoring architecture, an extension of traditional single-item scoring.
As we demonstrate in recent work, multi-item scoring is competitive in its performance to the state-of-the-art learning-to-rank models such as RankNet, MART, and LambdaMART on a public LETOR benchmark.

Ranking Metric Optimization
An important research challenge in learning-to-rank is direct optimization of ranking metrics (such as the previously mentioned NDCG and MRR). These metrics, while being able to measure the performance of ranking systems better than the standard classification metrics like Area Under the Curve (AUC), have the unfortunate property of being either discontinuous or flat. Therefore standard stochastic gradient descent optimization of these metrics is problematic.

In recent work, we proposed a novel method, LambdaLoss, which provides a principled probabilistic framework for ranking metric optimization. In this framework, metric-driven loss functions can be designed and optimized by an expectation-maximization procedure. The TF-Ranking library integrates the recent advances in direct metric optimization and provides an implementation of LambdaLoss. We are hopeful that this will encourage and facilitate further research advances in the important area of ranking metric optimization.

Unbiased Learning-to-Rank
Prior research has shown that given a ranked list of items, users are much more likely to interact with the first few results, regardless of their relevance. This observation has inspired research interest in unbiased learning-to-rank, and led to the development of unbiased evaluation and several unbiased learning algorithms, based on training instances re-weighting. In the TF-Ranking library, metrics are implemented to support unbiased evaluation and losses are implemented for unbiased learning by natively supporting re-weighting to overcome the inherent biases in user interactions datasets.

Getting Started with TF-Ranking
TF-Ranking implements the TensorFlow Estimator interface, which greatly simplifies machine learning programming by encapsulating training, evaluation, prediction and export for serving. TF-Ranking is well integrated with the rich TensorFlow ecosystem. As described above, you can use Tensorboard to visualize ranking metrics like NDCG and MRR, as well as to pick the best model checkpoints using these metrics. Once your model is ready, it is easy to deploy it in production using TensorFlow Serving.

If you’re interested in trying TF-Ranking for yourself, please check out our GitHub repo, and walk through the tutorial examples. TF-Ranking is an active research project, and we welcome your feedback and contributions. We are excited to see how TF-Ranking can help the information retrieval and machine learning research communities.

This project was only possible thanks to the members of the core TF-Ranking team: Rama Pasumarthi, Cheng Li, Sebastian Bruch, Nadav Golbandi, Stephan Wolf, Jan Pfeifer, Rohan Anil, Marc Najork, Patrick McGregor and Clemens Mewald‎. We thank the members of the TensorFlow team for their advice and support: Alexandre Passos, Mustafa Ispir, Karmel Allison, Martin Wicke, and others. Finally, we extend our special thanks to our collaborators, interns and early adopters: Suming Chen, Zhen Qin, Chirag Sethi, Maryam Karimzadehgan, Makoto Uchida, Yan Zhu, Qingyao Ai, Brandon Tran, Donald Metzler, Mike Colagrosso, and many others at Google who helped in evaluating and testing the early versions of TF-Ranking.

Source: Google AI Blog

Evaluation of Speech for the Google Assistant

Voice interactions with technology are becoming a key part of our lives — from asking your phone for traffic conditions to work to using a smart device at home to turn on the lights or play music. The Google Assistant is designed to provide help and information across a variety of platforms, and is built to bring together a number of products — including Google Maps, Search, Google Photos, third party services, and more. For some of these products, we have released specific evaluation guidelines, like Search Quality Rating Guidelines. However, the Google Assistant needs its own guidelines in place, as many of its interactions utilize what is called “eyes-free technology,” when there is no screen as part of the experience.

In the past we have received requests to see our evaluation guidelines from academics who are researching improvements in voice interactions, question answering and voice-guided exploration. To facilitate their evaluations, we are publishing some of the first Google Assistant guidelines. It is our hope that making these guidelines public will help the research community build and evaluate their own systems.

Creating the Guidelines
For many queries, responses are presented on the display (like a phone) with a graph, a table, or an interactive element, like you’d see for [weather this weekend].
But spoken responses are very different from display results, as what’s on screen needs to be translated into useful speech. Furthermore, the contents of the voice response are sometimes sourced from the web, and in those cases it’s important to provide the user with a link to the original source. While users looking at their mobile device can click through to read the original web page, an eyes free solution presents unique challenges. In order to generate the optimal audio response, we use a combination of explicit linguistic knowledge and deep learning solutions that allow us to keep answers grammatical, fluent and concise.

How do we ensure that we consistently meet user expectations on quality, across all answer types and languages? One of the tools we use to measure that are human evaluations. In these, we ask raters to make sure that answers are satisfactory across several dimensions:
  • Information Satisfaction: the content of the answer should meet the information needs of the user.
  • Length: when a displayed answer is too long, users can quickly scan it visually and locate the relevant information. For voice answers, that is not possible. It is much more important to ensure that we provide a helpful amount of information, hopefully not too much or too little. Some of our previous work is currently in use for identifying the most relevant fragments of answers.
  • Formulation: it is much easier to understand a badly formulated written answer than an ungrammatical spoken answer, so more care has to be placed in ensuring grammatical correctness.
  • Elocution: spoken answers must have proper pronunciation and prosody. Improvements in text-to-speech generation, such as WaveNet and Tacotron 2, are quickly reducing the gap with human performance.
The current version of the guidelines can be found here. Of course, guidelines are often updated, and these are just a snapshot of something that is a living, changing, always-work-in-progress evaluation!