Tag Archives: Health

Using Deep Learning to Inform Differential Diagnoses of Skin Diseases

An estimated 1.9 billion people worldwide suffer from a skin condition at any given time, and due to a shortage of dermatologists, many cases are seen by general practitioners instead. In the United States alone, up to 37% of patients seen in the clinic have at least one skin complaint and more than half of those patients are seen by non-dermatologists. However, studies demonstrate a significant gap in the accuracy of skin condition diagnoses between general practitioners and dermatologists, with the accuracy of general practitioners between 24% and 70%, compared to 77-96% for dermatologists. This can lead to suboptimal referrals, delays in care, and errors in diagnosis and treatment.

Existing strategies for non-dermatologists to improve diagnostic accuracy include the use of reference textbooks, online resources, and consultation with a colleague. Machine learning tools have also been developed with the aim of helping to improve diagnostic accuracy. Previous research has largely focused on early screening of skin cancer, in particular, whether a lesion is malignant or benign, or whether a lesion is melanoma. However, upwards of 90% of skin problems are not malignant, and addressing these more common conditions is also important to reduce the global burden of skin disease.

In “A Deep Learning System for Differential Diagnosis of Skin Diseases,” we developed a deep learning system (DLS) to address the most common skin conditions seen in primary care. Our results showed that a DLS can achieve an accuracy across 26 skin conditions that is on par with U.S. board-certified dermatologists, when presented with identical information about a patient case (images and metadata). This study highlights the potential of the DLS to augment the ability of general practitioners who did not have additional specialty training to accurately diagnose skin conditions.

DLS Design
Clinicians often face ambiguous cases for which there is no clear cut answer. For example, is this patient’s rash stasis dermatitis or cellulitis, or perhaps both superimposed? Rather than giving just one diagnosis, clinicians generate a differential diagnosis, which is a ranked list of possible diagnoses. A differential diagnosis frames the problem so that additional workup (laboratory tests, imaging, procedures, consultations) and treatments can be systematically applied until a diagnosis is confirmed. As such, a deep learning system (DLS) that produces a ranked list of possible skin conditions for a skin complaint closely mimics how clinicians think and is key to prompt triage, diagnosis and treatment for patients.

To render this prediction, the DLS processes inputs, including one or more clinical images of the skin abnormality and up to 45 types of metadata (self-reported components of the medical history such as age, sex, symptoms, etc.). For each case, multiple images were processed using the Inception-v4 neural network architecture and combined with feature-transformed metadata, for use in the classification layer. In our study, we developed and evaluated the DLS with 17,777 de-identified cases that were primarily referred from primary care clinics to a teledermatology service. Data from 2010-2017 were used for training and data from 2017-2018 for evaluation. During model training, the DLS leveraged over 50,000 differential diagnoses provided by over 40 dermatologists.

To evaluate the DLS’s accuracy, we compared it to a rigorous reference standard based on the diagnoses from three U.S. board-certified dermatologists. In total, dermatologists provided differential diagnoses for 3,756 cases (“Validation set A”), and these diagnoses were aggregated via a voting process to derive the ground truth labels. The DLS’s ranked list of skin conditions was compared with this dermatologist-derived differential diagnosis, achieving 71% and 93% top-1 and top-3 accuracies, respectively.
Schematic of the DLS and how the reference standard (ground truth) was derived via the voting of three board-certified dermatologists for each case in the validation set.
Comparison to Professional Evaluations
In this study, we also compared the accuracy of the DLS to that of three categories of clinicians on a subset of the validation A dataset (“Validation set B”): dermatologists, primary care physicians (PCPs), and nurse practitioners (NPs) — all chosen randomly and representing a range of experience, training, and diagnostic accuracy. Because typical differential diagnoses provided by clinicians only contain up to three diagnoses, we compared only the top three predictions by the DLS with the clinicians. The DLS achieved a top-3 diagnostic accuracy of 90% on the validation B dataset, which was comparable to dermatologists and substantially higher than primary care physicians (PCPs) and nurse practitioners (NPs)—75%, 60%, and 55%, respectively, for the 6 clinicians in each group. This high top-3 accuracy suggests that the DLS may help prompt clinicians (including dermatologists) to consider possibilities that were not originally in their differential diagnoses, thus improving diagnostic accuracy and condition management.
The DLS’s leading (top-1) differential diagnosis is substantially higher than PCPs and NPs, and on par with dermatologists. This accuracy increases substantially when we look at the DLS’s top-3 accuracy, suggesting that in the majority of cases the DLS’s ranked list of diagnoses contains the correct ground truth answer for the case.
Assessing Demographic Performance
Skin type, in particular, is highly relevant to dermatology, where visual assessment of the skin itself is crucial to diagnosis. To evaluate potential bias towards skin type, we examined DLS performance based on the Fitzpatrick skin type, which is a scale that ranges from Type I (“pale white, always burns, never tans”) to Type VI (“darkest brown, never burns”). To ensure sufficient numbers of cases on which to draw convincing conclusions, we focused on skin types that represented at least 5% of the data — Fitzpatrick skin types II through IV. On these categories, the DLS’s accuracy was similar, with a top-1 accuracy ranging from 69-72%, and the top-3 accuracy from 91-94%. Encouragingly, the DLS also remained accurate in patient subgroups for which significant numbers (at least 5%) were present in the dataset based on other self-reported demographic information: age, sex, and race/ethnicities. As further qualitative analysis, we assessed via saliency (explanation) techniques that the DLS was reassuringly “focusing” on the abnormalities instead of on skin tone.
Left: An example of a case with hair loss that was challenging for non-specialists to arrive at the specific diagnosis, which is necessary for determining appropriate treatment. Right: An image with regions highlighted in green showing the areas that the DLS identified as important and used to make its prediction. Center: The combined image, which indicates that the DLS mostly focused on the area with hair loss to make this prediction, instead of on forehead skin color, for example, which may indicate potential bias.
Incorporating Multiple Data Types
We also studied the effect of different types of input data on the DLS performance. Much like how having images from several angles can help a teledermatologist more accurately diagnose a skin condition, the accuracy of the DLS improves with increasing number of images. If metadata (e.g., the medical history) is missing, the model does not perform as well. This accuracy gap, which may occur in scenarios where no medical history is available, can be partially mitigated by training the DLS with only images. Nevertheless, this data suggests that providing the answers to a few questions about the skin condition can substantially improve the DLS accuracy.
The DLS performance improves when more images (blue line) or metadata (blue compared with red line) are present. In the absence of metadata as input, training a separate DLS using images alone leads to a marginal improvement compared to the current DLS (green line).
Future Work and Applications
Though these results are very promising, much work remains ahead. First, as reflective of real-world practice, the relative rarity of skin cancer such as melanoma in our dataset hindered our ability to train an accurate system to detect cancer. Related to this, the skin cancer labels in our dataset were not biopsy-proven, limiting the quality of the ground truth in this regard. Second, while our dataset did contain a variety of Fitzpatrick skin types, some skin types were too rare in this dataset to allow meaningful training or analysis. Finally, the validation dataset was from one teledermatology service. Though 17 primary care locations across two states were included, additional validation on cases from a wider geographical region will be critical. We believe these limitations can be addressed by including more cases of biopsy-proven skin cancers in the training and validation sets, and including cases representative of additional Fitzpatrick skin types and from other clinical centers.

The success of deep learning to inform the differential diagnosis of skin disease is highly encouraging of such a tool’s potential to assist clinicians. For example, such a DLS could help triage cases to guide prioritization for clinical care or could help non-dermatologists initiate dermatologic care more accurately and potentially improve access. Though significant work remains, we are excited for future efforts in examining the usefulness of such a system for clinicians. For research collaboration inquiries, please contact dermatology-research@google.com.

This work involved the efforts of a multidisciplinary team of software engineers, researchers, clinicians and cross functional contributors. Key contributors to this project include Yuan Liu, Ayush Jain, Clara Eng, David H. Way, Kang Lee, Peggy Bui, Kimberly Kanada, Guilherme de Oliveira Marinho, Jessica Gallegos, Sara Gabriele, Vishakha Gupta, Nalini Singh, Vivek Natarajan, Rainer Hofmann-Wellenhof, Greg S. Corrado, Lily H. Peng, Dale R. Webster, Dennis Ai, Susan Huang, Yun Liu, R. Carter Dunn and David Coz. The authors would like to acknowledge William Chen, Jessica Yoshimi, Xiang Ji and Quang Duong for software infrastructure support for data collection. Thanks also go to Genevieve Foti, Ken Su, T Saensuksopa, Devon Wang, Yi Gao and Linh Tran. Last but not least, this work would not have been possible without the participation of the dermatologists, primary care physicians, nurse practitioners who reviewed cases for this study, Sabina Bis who helped to establish the skin condition mapping and Amy Paller who provided feedback on the manuscript.

Source: Google AI Blog

Building SMILY, a Human-Centric, Similar-Image Search Tool for Pathology

Advances in machine learning (ML) have shown great promise for assisting in the work of healthcare professionals, such as aiding the detection of diabetic eye disease and metastatic breast cancer. Though high-performing algorithms are necessary to gain the trust and adoption of clinicians, they are not always sufficient—what information is presented to doctors and how doctors interact with that information can be crucial determinants in the utility that ML technology ultimately has for users.

The medical specialty of anatomic pathology, which is the gold standard for the diagnosis of cancer and many other diseases through microscopic analysis of tissue samples, can greatly benefit from applications of ML. Though diagnosis through pathology is traditionally done on physical microscopes, there has been a growing adoption of “digital pathology,” where high-resolution images of pathology samples can be examined on a computer. With this movement comes the potential to much more easily look up information, as is needed when pathologists tackle the diagnosis of difficult cases or rare diseases, when “general” pathologists approach specialist cases, and when trainee pathologists are learning. In these situations, a common question arises, “What is this feature that I’m seeing?” The traditional solution is for doctors to ask colleagues, or to laboriously browse reference textbooks or online resources, hoping to find an image with similar visual characteristics. The general computer vision solution to problems like this is termed content-based image retrieval (CBIR), one example of which is the “reverse image search” feature in Google Images, in which users can search for similar images by using another image as input.

Today, we are excited to share two research papers describing further progress in human-computer interaction research for similar image search in medicine. In “Similar Image Search for Histopathology: SMILY” published in Nature Partner Journal (npj) Digital Medicine, we report on our ML-based tool for reverse image search for pathology. In our second paper, Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making(preprint available here), which received an honorable mention at the 2019 ACM CHI Conference on Human Factors in Computing Systems, we explored different modes of refinement for image-based search, and evaluated their effects on doctor interaction with SMILY.

SMILY Design
The first step in developing SMILY was to apply a deep learning model, trained using 5 billion natural, non-pathology images (e.g., dogs, trees, man-made objects, etc.), to compress images into a “summary” numerical vector, called an embedding. The network learned during the training process to distinguish similar images from dissimilar ones by computing and comparing their embeddings. This model is then used to create a database of image patches and their associated embeddings using a corpus of de-identified slides from The Cancer Genome Atlas. When a query image patch is selected in the SMILY tool, the query patch’s embedding is similarly computed and compared with the database to retrieve the image patches with the most similar embeddings.
Schematic of the steps in building the SMILY database and the process by which input image patches are used to perform the similar image search.
The tool allows a user to select a region of interest, and obtain visually-similar matches. We tested SMILY’s ability to retrieve images along a pre-specified axis of similarity (e.g. histologic feature or tumor grade), using images of tissue from the breast, colon, and prostate (3 of the most common cancer sites). We found that SMILY demonstrated promising results despite not being trained specifically on pathology images or using any labeled examples of histologic features or tumor grades.
Example of selecting a small region in a slide and using SMILY to retrieve similar images. SMILY efficiently searches a database of billions of cropped images in a few seconds. Because pathology images can be viewed at different magnifications (zoom levels), SMILY automatically searches images at the same magnification as the input image.
Second example of using SMILY, this time searching for a lobular carcinoma, a specific subtype of breast cancer.
Refinement tools for SMILY
However, a problem emerged when we observed how pathologists interacted with SMILY. Specifically, users were trying to answer the nebulous question of “What looks similar to this image?” so that they could learn from past cases containing similar images. Yet, there was no way for the tool to understand the intent of the search: Was the user trying to find images that have a similar histologic feature, glandular morphology, overall architecture, or something else? In other words, users needed the ability to guide and refine the search results on a case-by-case basis in order to actually find what they were looking for. Furthermore, we observed that this need for iterative search refinement was rooted in how doctors often perform “iterative diagnosis”—by generating hypotheses, collecting data to test these hypotheses, exploring alternative hypotheses, and revisiting or retesting previous hypotheses in an iterative fashion. It became clear that, for SMILY to meet real user needs, it would need to support a different approach to user interaction.

Through careful human-centered research described in our second paper, we designed and augmented SMILY with a suite of interactive refinement tools that enable end-users to express what similarity means on-the-fly: 1) refine-by-region allows pathologists to crop a region of interest within the image, limiting the search to just that region; 2) refine-by-example gives users the ability to pick a subset of the search results and retrieve more results like those; and 3) refine-by-concept sliders can be used to specify that more or less of a clinical concept be present in the search results (e.g., fused glands). Rather than requiring that these concepts be built into the machine learning model, we instead developed a method that enables end-users to create new concepts post-hoc, customizing the search algorithm towards concepts they find important for each specific use case. This enables new explorations via post-hoc tools after a machine learning model has already been trained, without needing to re-train the original model for each concept or application of interest.
Through our user study with pathologists, we found that the tool-based SMILY not only increased the clinical usefulness of search results, but also significantly increased users’ trust and likelihood of adoption, compared to a conventional version of SMILY without these tools. Interestingly, these refinement tools appeared to have supported pathologists’ decision-making process in ways beyond simply performing better on similarity searches. For example, pathologists used the observed changes to their results from iterative searches as a means of progressively tracking the likelihood of a hypothesis. When search results were surprising, many re-purposed the tools to test and understand the underlying algorithm, for example, by cropping out regions they thought were interfering with the search or by adjusting the concept sliders to increase the presence of concepts they suspected were being ignored. Beyond being passive recipients of ML results, doctors were empowered with the agency to actively test hypotheses and apply their expert domain knowledge, while simultaneously leveraging the benefits of automation.
With these interactive tools enabling users to tailor each search experience to their desired intent, we are excited for SMILY’s potential to assist with searching large databases of digitized pathology images. One potential application of this technology is to index textbooks of pathology images with descriptive captions, and enable medical students or pathologists in training to search these textbooks using visual search, speeding up the educational process. Another application is for cancer researchers interested in studying the correlation of tumor morphologies with patient outcomes, to accelerate the search for similar cases. Finally, pathologists may be able to leverage tools like SMILY to locate all occurrences of a feature (e.g. signs of active cell division, or mitosis) in the same patient’s tissue sample to better understand the severity of the disease to inform cancer therapy decisions. Importantly, our findings add to the body of evidence that sophisticated machine learning algorithms need to be paired with human-centered design and interactive tooling in order to be most useful.

This work would not have been possible without Jason D. Hipp, Yun Liu, Emily Reif, Daniel Smilkov, Michael Terry, Craig H. Mermel, Martin C. Stumpe and members of Google Health and PAIR. Preprints of the two papers are available here and here.

Source: Google AI Blog

Meet David Feinberg, head of Google Health

Dr. David Feinberg has spent his entire career caring for people’s health and wellbeing. And after years in the healthcare system, he now leads Google Health, which brings together groups from across Google and Alphabet that are using AI, product expertise and hardware to take on big healthcare challenges. We sat down with David to hear more about his pre-Google life, what he’s learned as a “Noogler” (new Googler), and what’s next for Google Health.

You joined Google after a career path that led you from child psychiatrist to hospital executive. Tell us how this journey brought you to Google Health.

I’m driven by the urgency to help people live longer, healthier lives. I started as a child psychiatrist at UCLA helping young patients with serious mental health needs. Over the course of my 25 years at UCLA, I moved from treating dozens of patients, to overseeing the UCLA health system and the more than a million patients in our care. Then, at Geisinger, I had the opportunity to support a community of more than 3 million patients.

I recall my mom being very confused by my logic of stepping away from clinical duties and moving toward administrative duties as a way of helping more people. However, in these roles, the impact lies in initiatives that have boosted patient experience, improved people’s access to healthcare, and (I hope!) helped people get more time back to live their lives.

When I began speaking with Google, I immediately saw the potential to help billions of people, in part because I believe Google is already a health company. It’s been in the company’s DNA from the start.

You say Google is already a health company. How so?

We’re already making strides in organizing and making health data more useful thanks to work being done by Cloud and AI teams. And looking across the rest of Google’s portfolio of helpful products, we’re already addressing aspects of people’s health. Search helps people answer everyday health questions, Maps helps get people to the nearest hospital, and other tools and products are addressing issues tangential to health—for instance, literacysafer driving, and air pollution.

We already have the foundation, and I’m excited by the potential to tap into Google’s strengths, its brilliant people, and its amazing products to do more for people’s health (and lives).

I believe Google is already a health company. It’s been in the company’s DNA from the start.

This isn’t the first time Google has invested directly in health efforts. What has changed over the years about Google’s solving health-related problems? 

Some of Google’s early efforts didn’t gain traction due to various challenges the entire industry was facing at the time. During this period, I was a hospital administrator and no one talked about interoperability—a term familiar to those of us in the industry today. We were only just starting to think about the behemoth task of adopting electronic health records and bringing health data online, which is why some of the early projects didn’t really get off the ground. Today we take some of this for granted as we navigate today’s more digitized healthcare systems.

The last few years have changed the healthcare landscape—offering up new opportunities and challenges. And in response, Google and Alphabet have invested in efforts that complement their strengths and put users, patients, and care providers first. Look no further than the promising AI research and mobile applications coming from Google and DeepMind Health, or Verily’s Project Baseline that is pushing the boundaries of what we think we know about human health. And there’s so much more we can and will do.

Speaking of AI, it features prominently in many of Google’s current health efforts. What’s next for this research?

There’s little doubt that AI will power the next wave of tools that can improve many facets of healthcare: delivery, access, and so much more.

When I consider the future of research, I see us continuing to be deliberate and thoughtful about sharing our findings with the research and medical communities, incorporating feedback, and generally making sure our work actually adds value to patients, doctors and care providers.

Of course, we have to work toward getting solutions out in the wild, and into the hands of the pathologist scanning slides for breast cancer, or the nurse scanning a patient’s record for the latest lab results on the go. But this needs to be executed safely, working with and listening to our users to ensure that we get this right.

Now that you’ve been here for six months, what’s been most surprising to you about Google or the team?

I can’t believe how fantastic it is to not wear a suit after decades of formal business attire. When I got the job I ended up donating most of my suits. I kept a few, you know, for weddings.

On a more serious note, I’m blown away every day by the teams I’m surrounded by, and the drive and commitment they have for the work they do. I’m thrilled to be a part of this team.

What's your life motto?

I know this sounds cheesy, but there are three words I really do say every morning when I arrive in the parking lot for work: passion, humility, integrity. These are words that ground me, and also ground the work we are doing at Google Health.

Passion means we have to get this right, and feel that health is a cause worth fighting for, every day. We need humility, because at the end of the day, if we move too quickly or mess up, people’s lives are on the line. And integrity means that we should come to work with the aim of leaving the place—and the world—better than when we found it.

A promising step forward for predicting lung cancer

Over the past three years, teams at Google have been applying AI to problems in healthcare—from diagnosing eye disease to predicting patient outcomes in medical records. Today we’re sharing new research showing how AI can predict lung cancer in ways that could boost the chances of survival for many people at risk around the world.

Lung cancer results in over 1.7 million deaths per year, making it the deadliest of all cancers worldwide—more than breast, prostate, and colorectal cancers combined—and it’s the sixth most common cause of death globally, according to the World Health Organization. While lung cancer has one of the worst survival rates among all cancers, interventions are much more successful when the cancer is caught early. Unfortunately, the statistics are sobering because the overwhelming majority of cancers are not caught until later stages.

Over the last three decades, doctors have explored ways to screen people at high-risk for lung cancer. Though lower dose CT screening has been proven to reduce mortality, there are still challenges that lead to unclear diagnosis, subsequent unnecessary procedures, financial costs, and more.

Our latest research

In late 2017, we began exploring how we could address some of these challenges using AI. Using advances in 3D volumetric modeling alongside datasets from our partners (including Northwestern University), we’ve made progress in modeling lung cancer prediction as well as laying the groundwork for future clinical testing. Today we’re publishing our promising findings in “Nature Medicine.”

Radiologists typically look through hundreds of 2D images within a single CT scan and cancer can be miniscule and hard to spot. We created a model that can not only generate the overall lung cancer malignancy prediction (viewed in 3D volume) but also identify subtle malignant tissue in the lungs (lung nodules). The model can also factor in information from previous scans, useful in predicting lung cancer risk because the growth rate of suspicious lung nodules can be indicative of malignancy.

lung cancer model.gif

This is a high level modeling framework. For each patient, the AI uses the current CT scan and, if available, a previous CT scan as input. The model outputs an overall malignancy prediction.

In our research, we leveraged 45,856 de-identified chest CT screening cases (some in which cancer was found) from NIH’s research dataset from the National Lung Screening Trial study and Northwestern University. We validated the results with a second dataset and also compared our results against 6 U.S. board-certified radiologists.

When using a single CT scan for diagnosis, our model performed on par or better than the six radiologists. We detected five percent more cancer cases while reducing false-positive exams by more than 11 percent compared to unassisted radiologists in our study. Our approach achieved an AUC of 94.4 percent (AUC is a common common metric used in machine learning and provides an aggregate measure for classification performance).

lung cancer scan.gif

For an asymptomatic patient with no history of cancer, the AI system reviewed and detected potential lung cancer that had been previously called normal.

Next steps

Despite the value of lung cancer screenings, only 2-4 percent of eligible patients in the U.S. are screened today. This work demonstrates the potential for AI to increase both accuracy and consistency, which could help accelerate adoption of lung cancer screening worldwide.

These initial results are encouraging, but further studies will assess the impact and utility in clinical practice. We’re collaborating with Google Cloud Healthcare and Life Sciences team to serve this model through the Cloud Healthcare API and are in early conversations with partners around the world to continue additional clinical validation research and deployment. If you’re a research institution or hospital system that is interested in collaborating in future research, please fill out this form.

New milestones in helping prevent eye disease with Verily

Diabetes is at an all-time high around the world, and the number of people living with the disease is only increasing. Many complications can arise from diabetes, including diabetic retinopathy (DR) and diabetic macular edema (DME)—two of the leading causes of preventable blindness in adults. In India, a shortage of more than 100,000 eye doctors—and the fact that only 6 million out of 72 million people with diabetes are screened for diabetic eye disease—mean that many individuals go undiagnosed and untreated.

Over the last three years, Google and Verily—Alphabet’s life sciences and healthcare arm—have developed a machine learning algorithm to make it easier to screen for disease, as well as expand access to screening for DR and DME. As part of this effort, we’ve conducted a global clinical research program with a focus on India. Today, we’re sharing that the first real world clinical use of the algorithm is underway at the Aravind Eye Hospital in Madurai, India.
ML graphic

How the machine learning screening works at the hospital in Madurai

Thousands of patients come through the doors of Aravind Eye Hospital and vision centers every day. Dr. R. Kim, chief medical officer and chief of retina services, says that by integrating our machine learning algorithm into their screening process, “physicians like me have more time to work closely with patients on treatment and management of their disease, while increasing the volume of screenings we can perform."

Aravind Eye Hospital.png

Screening using the algorithm at the Aravind Eye Hospital with a trained technician.

Building off our initial efforts, we believe our machine learning algorithm could be helpful in many other areas of the world where there aren’t enough eye doctors to screen a growing population with diabetes. As part of our broader screening collaboration, our partners at Verily have received CE mark for the algorithm, which means that the software has met the European Union Directive’s standards for medical devices, further validating our approach. In addition, late in 2018 we announced our research efforts in Thailand and this year we’ll expand our research and clinical efforts globally, with the goal of screening more people and preventing disease.

To read more about our research to date, visit JAMA, Ophthalmology and Nature Biomedical Engineering.

Expanding the Application of Deep Learning to Electronic Health Records

In 2018 we published a paper that showed how machine learning, when applied to medical records, can predict what might happen to patients who are hospitalized: for example, how long they would need to be in the hospital and, if discharged, how likely they would be to come back unexpectedly. Predictive models of various kinds have already been deployed in hospital settings by others, and our work aims to further improve potential clinical benefit by using new models that can make predictions faster, more accurate, and more adaptable for a broader range of clinical contexts.

Any endeavor to demonstrate the promise of machine learning requires intense collaboration between engineers, doctors, and medical researchers to make sure the work benefits patients, physicians, and health systems, and that it is equitable. Google is already fortunate to partner with some of the best academic medical centers in the world and we are now expanding this work to include Intermountain Healthcare, based in Utah.
The initial collaboration will focus on understanding how Google might adapt machine learning predictions to the various Intermountain care settings, from primary care clinics to the TeleHealth critical care unit, which remotely monitors critically ill patients in surrounding hospitals. We see potential in exploring how scalable computing platforms that include predictions might assist clinical teams in providing the best possible care.

As with our previous research, we will begin with jointly testing the performance of machine learning models on historical records, following strict policies to ensure that all data privacy and security measures are followed.

We are excited to explore how scalable computing platforms that include predictions might assist clinical teams in providing the best possible care in these settings. We additionally hope to further validate that our approach to predictions can work across health systems and improve care for patients.

Source: Google AI Blog

Improving the Effectiveness of Diabetic Retinopathy Models

Two years ago, we announced our inaugural work in training deep learning models for diabetic retinopathy (DR), a complication of diabetes that is one of the fasting growing causes of vision loss. Based on this research, we set out to apply our technology to improve health outcomes in the world. At the same time, we’ve continued our efforts to improve the model’s performance, explainability, and applicability in clinical settings. Today, we are sharing our research progress toward these goals, as well as announcing a new partner in Thailand.

Improving Model Performance with High-quality Labels
The performance of DR deep learning models is critically important, especially when subtle errors have the potential to generate a misdiagnosis. Earlier this year we published a paper in the journal Ophthalmology that looked at how we could improve our model by 1) moving toward a more granular 5-point grading scale (versus the previous 2-class system) and 2) incorporating adjudication by a panel of retinal specialists. During the adjudication process, a group of retinal specialists debated any case with disagreement until everyone agreed on the final grade. Compared to simply taking a majority vote, this method of resolving disagreements was more accurate and allowed for the identification of subtle findings, such as microaneurysms.

To increase the efficiency of the adjudication process, we carefully selected a small subset (0.22%) of images to use as a tuning set, substantially improving model performance by optimizing model hyperparameters on this more accurate reference standard. When we subsequently measured the rate of agreement against a test set of images with an adjudicated reference standard, the kappa scores (a measurement of agreement that ranges from 0 [random] to 1 [perfect agreement]) for individual retinal specialists, ophthalmologists, and the algorithm ranged from 0.82-0.91, 0.80-0.84, and 0.84, respectively.

Making our Models More Transparent
As we deploy this technology, it is important that we take the proper steps to ensure that it is transparent and trusted. To that end, we have been exploring ways to explain how the model is making its predictions, with the goal of making the DR model a better diagnostic tool and aid for doctors.

In our latest study, to be published today in Ophthalmology, we demonstrate methods by which explanations of deep learning algorithms can be shown to ophthalmologists to increase both the accuracy and confidence of their grading for diabetic eye disease. Using the results of the model trained and validated on high quality labels from our earlier study, we generated different forms of potential assistance for general ophthalmologists. We presented to the physicians the algorithm’s predicted scores for different DR severity levels as well as heatmaps highlighting image regions that most strongly drove its predictions. Using this assistance, we saw a significant increase in physicians’ diagnostic accuracy, as well as improved confidence in their diagnosis.

We saw clear evidence that showing model predictions could help physicians catch pathology they otherwise might have missed. In the retinal image below, our adjudication panel found signs of vision-threatening DR. This was missed by 2 of 3 doctors who graded it without assistance; but caught by all 3 doctors who graded it when they saw the model predictions (which accurately detected the pathology).
On the left is a fundus image graded as having proliferative (vision-threatening) DR by an adjudication panel of ophthalmologists (ground truth). On the top right is an illustration of our deep learning model’s predicted scores (“P” = proliferative, the most severe form of DR). On the bottom right is the set of grades given by physicians without assistance (“Unassisted”) and those who saw the model’s predictions (“Grades Only”).
We also saw evidence that physicians and the model can work together in a way that provides more accuracy than either individually. In the retinal image below, our adjudication panel of retina specialists considered it to have moderate DR. Without assistance, two out of three ophthalmologists grading the image marked it as no DR. In real-world settings, this situation could result in a patient missing a needed referral to a specialist.
On the left is a retinal fundus image graded as having moderate DR (“Mo”) by an adjudication panel of ophthalmologists (ground truth). On the top right is an illustration of the predicted scores (“N” = no DR, “Mi” = Mild DR, “Mo” = Moderate DR) from the model. On the bottom right is the set of scores given by physicians without assistance (“Unassisted”) and those who saw the model’s predictions (“Grades Only”).
In this particular case, our model also indicated evidence for no DR. However, when ophthalmologists saw the model’s predictions, all three gave the correct answer. Seeing that the model saw some evidence for Moderate -- even if it wasn’t the highest score -- may prompt doctors to examine particular cases more carefully for pathology they may otherwise miss. We are excited to develop assistance that works like this, where human and machine learning abilities complement each other.

A New Partner in our Global Efforts
With the help of screening programs and in collaboration with Verily, we have laid a robust foundation for the implementation of these highly accurate systems in real world clinical settings. Working with doctors at Aravind Eye Hospitals and Sankara Nethralaya in India, and now, through our new partnership with the Rajavithi Hospital, affiliated with the Department of Medical Services, Ministry of Public Health in Thailand, we are validating the model performance with patients from broad screening programs. Given the positive results of our model on their real patient population, we are now beginning to pilot the model in their screening programs. We’re looking forward to a very busy 2019!

Source: Google AI Blog

Improved Grading of Prostate Cancer Using Deep Learning

Approximately 1 in 9 men in the United States will develop prostate cancer in their lifetime, making it the most common cancer in males. Despite being common, prostate cancers are frequently non-aggressive, making it challenging to determine if the cancer poses a significant enough risk to the patient to warrant treatment such as surgical removal of the prostate (prostatectomy) or radiation therapy. A key factor that helps in the “risk stratification” of prostate cancer patients is the Gleason grade, which classifies the cancer cells based on how closely they resemble normal prostate glands when viewed on a slide under a microscope.

However, despite its widely recognized clinical importance, Gleason grading of prostate cancer is complex and subjective, as evidenced by studies reporting inter-pathologist disagreements ranging from 30-53% [1][2]. Furthermore, there are not enough speciality trained pathologists to meet the global demand for prostate cancer pathology, especially outside the United States. Recent guidelines also recommend that pathologists report the percentage of tumor of different Gleason patterns in their final report, which adds to the workload and is yet another subjective challenge for the pathologist [3]. Overall, these issues suggest an opportunity to improve the diagnosis and clinical management of prostate cancer using deep learning–based models, similar to how Google and others used such techniques to demonstrate the potential to improve metastatic breast cancer detection.

In “Development and Validation of a Deep Learning Algorithm for Improving Gleason Scoring of Prostate Cancer”, we explore whether deep learning could improve the accuracy and objectivity of Gleason grading of prostate cancer in prostatectomy specimens. We developed a deep learning system (DLS) that mirrors a pathologist’s workflow by first categorizing each region in a slide into a Gleason pattern, with lower patterns corresponding to tumors that more closely resemble normal prostate glands. The DLS then summarizes an overall Gleason grade group based on the two most common Gleason patterns present. The higher the grade group, the greater the risk of further cancer progression and the more likely the patient is to benefit from treatment.
Visual examples of Gleason patterns, which are used in the Gleason system for grading prostate cancer. Individual cancer patches are assigned a Gleason pattern based on how closely the cancer resembles normal prostate tissue, with lower numbers corresponding to more well differentiated tumors. Image Source: National Institutes of Health.
To develop and validate the DLS, we collected de-identified images of prostatectomy samples which contain a greater amount and diversity of prostate cancer than needle core biopsies, even though the latter is the more common clinical procedure. On the training data, a cohort of 32 pathologists provided detailed annotations of Gleason patterns (resulting in over 112 million annotated image patches) and an overall Gleason grade group for each image. To overcome the previously referenced variability in Gleason grading, each slide in the validation set was independently graded by 3 to 5 general pathologists (selected from a cohort of 29 pathologists) and had a final Gleason grade assigned by a genitourinary-specialist pathologist to obtain the ground-truth label for that slide.

In the paper, we show that our DLS achieved an overall accuracy of 70%, compared to an average accuracy of 61% achieved by US board-certified general pathologists in our study. Of 10 high-performing individual general pathologists who graded every slide in the validation set, the DLS was more accurate than 8. The DLS was also more accurate than the average pathologist at Gleason pattern quantitation. These improvements in Gleason grading translated into better clinical risk stratification: the DLS better identified patients at higher risk for disease recurrence after surgery than the average general pathologist, potentially enabling doctors to use this information to better match patients to therapy.
Comparison of scoring performance of the DLS with pathologists. a: Accuracy of the DLS (in red) compared with the mean accuracy among a cohort-of-29 pathologists (in green). Error bars indicate 95% confidence intervals. b: Comparison of risk stratification provided by the DLS, the cohort-of-29 pathologists, and the genitourinary specialist pathologists. Patients are divided into low and high risk groups based on their Gleason grade group, where a larger separation between the Kaplan-Meier curves of these risk groups indicates better stratification.
We also found that the DLS was able to characterize tissue morphology that appeared to lie at the cusp of two Gleason patterns, which is one reason for the disagreements in Gleason grading observed between pathologists, suggesting the possibility of creating finer grained “precision grading” of prostate cancer. While the clinical significance of these intermediate patterns (e.g. Gleason pattern 3.3 or 3.7) is not known, the increased precision of the DLS will enable further research into this interesting question.
Assessing the region-level classification of the DLS. a: Annotations from 3 pathologists compared to DLS predictions. The pathologists show general concordance on the location and the extent of tumor areas, but poor agreement in classifying Gleason patterns. The DLS’s precision Gleason pattern for each region is represented by interpolating between the DLS’s prediction patterns for Gleason patterns 3 (green), 4 (yellow), and 5 (red). b: DLS prediction
patterns compared to the distribution of pathologists’ Gleason pattern classifications on 41 million annotated image patches from the test dataset. On patches where pathologists are discordant, where the tissue is more likely to be on the cusp of two patterns, the DLS reflects this ambiguity in it's prediction scores.
While these initial results are encouraging, there is much more work to be done before systems like our DLS can be used to improve the care of prostate cancer patients. First, the accuracy of the model can be further improved with additional training data and should be validated on independent cohorts containing a larger number and more diverse group of patients. In addition, we are actively working on refining our DLS system to work on diagnostic needle core biopsies, which occur prior to the decision to undergo surgery and where Gleason grading therefore has a significantly greater impact on clinical decision-making. Further work will be needed to assess how to best integrate our DLS into the pathologist’s diagnostic workflow and the impact of such artificial-intelligence based assistance on the overall efficiency, accuracy, and prognostic ability of Gleason grading in clinical practice. Nonetheless, we are excited about the potential of technologies like this to significantly improve cancer diagnostics and patient care.

This work involved the efforts of a multidisciplinary team of software engineers, researchers, clinicians and logistics support staff. Key contributors to this project include Kunal Nagpal, Davis Foote, Yun Liu, Po-Hsuan (Cameron) Chen, Ellery Wulczyn, Fraser Tan, Niels Olson, Jenny L. Smith, Arash Mohtashamian, James H. Wren, Greg S. Corrado, Robert MacDonald, Lily H. Peng, Mahul B. Amin, Andrew J. Evans, Ankur R. Sangoi, Craig H. Mermel, Jason D. Hipp and Martin C. Stumpe. We would also like to thank Tim Hesterberg, Michael Howell, David Miller, Alvin Rajkomar, Benny Ayalew, Robert Nagle, Melissa Moran, Krishna Gadepalli, Aleksey Boyko, and Christopher Gammage. Lastly, this work would not have been possible without the aid of the pathologists who annotated data for this study.

  1. Interobserver Variability in Histologic Evaluation of Radical Prostatectomy Between Central and Local Pathologists: Findings of TAX 3501 Multinational Clinical Trial, Netto, G. J., Eisenberger, M., Epstein, J. I. & TAX 3501 Trial Investigators, Urology 77, 1155–1160 (2011).
  2. Phase 3 Study of Adjuvant Radiotherapy Versus Wait and See in pT3 Prostate Cancer: Impact of Pathology Review on Analysis, Bottke, D., Golz, R., Störkel, S., Hinke, A., Siegmann, A., Hertle, L., Miller, K., Hinkelbein, W., Wiegel, T., Eur. Urol. 64, 193–198 (2013).
  3. Utility of Quantitative Gleason Grading in Prostate Biopsies and Prostatectomy Specimens, Sauter, G. Steurer, S., Clauditz, T. S., Krech, T., Wittmer, C., Lutz, F., Lennartz, M., Janssen, T., Hakimi, N., Simon, R., von Petersdorff-Campen, M., Jacobsen, F., von Loga, K., Wilczak, W., Minner, S., Tsourlakis, M. C., Chirico, V., Haese, A., Heinzer, H., Beyer, B., Graefen, M., Michl, U., Salomon, G., Steuber, T., Budäus, L. H., Hekeler, E., Malsy-Mink, J., Kutzera, S., Fraune, C., Göbel, C., Huland, H., Schlomm, T., Clinical Eur. Urol. 69, 592–598 (2016).

Source: Google AI Blog

Applying Deep Learning to Metastatic Breast Cancer Detection

A pathologist’s microscopic examination of a tumor in patients is considered the gold standard for cancer diagnosis, and has a profound impact on prognosis and treatment decisions. One important but laborious aspect of the pathologic review involves detecting cancer that has spread (metastasized) from the primary site to nearby lymph nodes. Detection of nodal metastasis is relevant for most cancers, and forms one of the foundations of the widely-used TNM cancer staging.

In breast cancer in particular, nodal metastasis influences treatment decisions regarding radiation therapy, chemotherapy, and the potential surgical removal of additional lymph nodes. As such, the accuracy and timeliness of identifying nodal metastases has a significant impact on clinical care. However, studies have shown that about 1 in 4 metastatic lymph node staging classifications would be changed upon second pathologic review, and detection sensitivity of small metastases on individual slides can be as low as 38% when reviewed under time constraints.

Last year, we described our deep learning–based approach to improve diagnostic accuracy (LYmph Node Assistant, or LYNA) to the 2016 ISBI Camelyon Challenge, which provided gigapixel-sized pathology slides of lymph nodes from breast cancer patients for researchers to develop computer algorithms to detect metastatic cancer. While LYNA achieved significantly higher cancer detection rates (Liu et al. 2017) than had been previously reported, an accurate algorithm alone is insufficient to improve pathologists’ workflow or improve outcomes for breast cancer patients. For patient safety, these algorithms must be tested in a variety of settings to understand their strengths and weaknesses. Furthermore, the actual benefits to pathologists using these algorithms had not been previously explored and must be assessed to determine whether or not an algorithm actually improves efficiency or diagnostic accuracy.

In “Artificial Intelligence Based Breast Cancer Nodal Metastasis Detection: Insights into the Black Box for Pathologists” (Liu et al. 2018), published in the Archives of Pathology and Laboratory Medicine and “Impact of Deep Learning Assistance on the Histopathologic Review of Lymph Nodes for Metastatic Breast Cancer” (Steiner, MacDonald, Liu et al. 2018) published in The American Journal of Surgical Pathology, we present a proof-of-concept pathologist assistance tool based on LYNA, and investigate these factors.

In the first paper, we applied our algorithm to de-identified pathology slides from both the Camelyon Challenge and an independent dataset provided by our co-authors at the Naval Medical Center San Diego. Because this additional dataset consisted of pathology samples from a different lab using different processes, it improved the representation of the diversity of slides and artifacts seen in routine clinical practice. LYNA proved robust to image variability and numerous histological artifacts, and achieved similar performance on both datasets without additional development.
Left: sample view of a slide containing lymph nodes, with multiple artifacts: the dark zone on the left is an air bubble, the white streaks are cutting artifacts, the red hue across some regions are hemorrhagic (containing blood), the tissue is necrotic (decaying), and the processing quality was poor. Right: LYNA identifies the tumor region in the center (red), and correctly classifies the surrounding artifact-laden regions as non-tumor (blue).
In both datasets, LYNA was able to correctly distinguish a slide with metastatic cancer from a slide without cancer 99% of the time. Further, LYNA was able to accurately pinpoint the location of both cancers and other suspicious regions within each slide, some of which were too small to be consistently detected by pathologists. As such, we reasoned that one potential benefit of LYNA could be to highlight these areas of concern for pathologists to review and determine the final diagnosis.

In our second paper, 6 board-certified pathologists completed a simulated diagnostic task in which they reviewed lymph nodes for metastatic breast cancer both with and without the assistance of LYNA. For the often laborious task of detecting small metastases (termed micrometastases), the use of LYNA made the task subjectively “easier” (according to pathologists’ self-reported diagnostic difficulty) and halved average slide review time, requiring about one minute instead of two minutes per slide.
Left: sample views of a slide containing lymph nodes with a small metastatic breast tumor at progressively higher magnifications. Right: the same views when shown with algorithmic “assistance” (LYmph Node Assistant, LYNA) outlining the tumor in cyan.
This suggests the intriguing potential for assistive technologies such as LYNA to reduce the burden of repetitive identification tasks and to allow more time and energy for pathologists to focus on other, more challenging clinical and diagnostic tasks. In terms of diagnostic accuracy, pathologists in this study were able to more reliably detect micrometastases with LYNA, reducing the rate of missed micrometastases by a factor of two. Encouragingly, pathologists with LYNA assistance were more accurate than either unassisted pathologists or the LYNA algorithm itself, suggesting that people and algorithms can work together effectively to perform better than either alone.

With these studies, we have made progress in demonstrating the robustness of our LYNA algorithm to support one component of breast cancer TNM staging, and assessing its impact in a proof-of-concept diagnostic setting. While encouraging, the bench-to-bedside journey to help doctors and patients with these types of technologies is a long one. These studies have important limitations, such as limited dataset sizes and a simulated diagnostic workflow which examined only a single lymph node slide for every patient instead of the multiple slides that are common for a complete clinical case. Further work will be needed to assess the impact of LYNA on real clinical workflows and patient outcomes. However, we remain optimistic that carefully validated deep learning technologies and well-designed clinical tools can help improve both the accuracy and availability of pathologic diagnosis around the world.

Source: Google AI Blog

Deep Learning for Electronic Health Records

When patients get admitted to a hospital, they have many questions about what will happen next. When will I be able to go home? Will I get better? Will I have to come back to the hospital? Having precise answers to those questions helps doctors and nurses make care better, safer, and faster — if a patient’s health is deteriorating, doctors could be sent proactively to act before things get worse.

Predicting what will happen next is a natural application of machine learning. We wondered if the same types of machine learning that predict traffic during your commute or the next word in a translation from English to Spanish could be used for clinical predictions. For predictions to be useful in practice they should be, at least:
  1. Scalable: Predictions should be straightforward to create for any important outcome and for different hospital systems. Since healthcare data is very complicated and requires much data wrangling, this requirement is not straightforward to satisfy.
  2. Accurate: Predictions should alert clinicians to problems but not distract them with false alarms. With the widespread adoption of electronic health records, we set out to use that data to create more accurate prediction models.
Together with colleagues at UC San Francisco, Stanford Medicine, and The University of Chicago Medicine, we published “Scalable and Accurate Deep Learning with Electronic Health Records” in Nature Partner Journals: Digital Medicine, which contributes to these two aims.
We used deep learning models to make a broad set of predictions relevant to hospitalized patients using de-identified electronic health records. Importantly, we were able to use the data as-is, without the laborious manual effort typically required to extract, clean, harmonize, and transform relevant variables in those records. Our partners had removed sensitive individual information before we received it, and on our side, we protected the data using state-of-the-art security including logical separation, strict access controls, and encryption of data at rest and in transit.

Electronic health records (EHRs) are tremendously complicated. Even a temperature measurement has a different meaning depending on if it’s taken under the tongue, through your eardrum, or on your forehead. And that's just a simple vital sign. Moreover, each health system customizes their EHR system, making the data collected at one hospital look different than data on a similar patient receiving similar care at another hospital. Before we could even apply machine learning, we needed a consistent way to represent patient records, which we built on top of the open Fast Healthcare Interoperability Resources (FHIR) standard as described in an earlier blog post.

Once in a consistent format, we did not have to manually select or harmonize the variables to use. Instead, for each prediction, a deep learning model reads all the data-points from earliest to most recent and then learns which data helps predict the outcome. Since there are thousands of data points involved, we had to develop some new types of deep learning modeling approaches based on recurrent neural networks (RNNs) and feedforward networks.
Data in a patient's record is represented as a timeline. For illustrative purposes, we display various types of clinical data (e.g. encounters, lab tests) by row. Each piece of data, indicated as a little grey dot, is stored in FHIR, an open data standard that can be used by any healthcare institution. A deep learning model analyzed a patient's chart by reading the timeline from left to right, from the beginning of a chart to the current hospitalization, and used this data to make different types of predictions.
Thus we engineered a computer system to render predictions without hand-crafting a new dataset for each task, in a scalable manner. But setting up the data is only one part of the work; the predictions also need to be accurate.

Prediction Accuracy
The most common way to assess accuracy is by a measure called the area-under-the-receiver-operator curve, which measures how well a model distinguishes between a patient who will have a particular future outcome compared to one who will not. In this metric, 1.00 is perfect, and 0.50 is no better than random chance, so higher numbers mean the model is more accurate. By this measure, the models we reported in the paper scored 0.86 in predicting if patients will stay long in the hospital (traditional logistic regression scored 0.76); they scored 0.95 in predicting inpatient mortality (traditional methods were 0.86), and they scored 0.77 in predicting unexpected readmissions after patients are discharged (traditional methods were 0.70). These gains were statistically significant.

We also used these models to identify the conditions for which the patients were being treated. For example, if a doctor prescribed ceftriaxone and doxycycline for a patient with an elevated temperature, fever and cough, the model could identify these as signals that the patient was being treated for pneumonia. We emphasize that the model is not diagnosing patients — it picks up signals about the patient, their treatments and notes written by their clinicians, so the model is more like a good listener than a master diagnostician.

An important focus of our work includes the interpretability of the deep learning models used. An “attention map” of each prediction shows the important data points considered by the models as they make that prediction. We show an example as a proof-of-concept and see this as an important part of what makes predictions useful for clinicians.
A deep learning model was used to render a prediction 24 hours after a patient was admitted to the hospital. The timeline (top of figure) contains months of historical data and the most recent data is shown enlarged in the middle. The model "attended" to information highlighted in red that was in the patient's chart to "explain" its prediction. In this case-study, the model highlighted pieces of information that make sense clinically. Figure from our paper.
What does this mean for patients and clinicians?
The results of this work are early and on retrospective data only. Indeed, this paper represents just the beginning of the work that is needed to test the hypothesis that machine learning can be used to make healthcare better. Doctors are already inundated with alerts and demands on their attention — could models help physicians with tedious, administrative tasks so they can better focus on the patient in front of them or ones that need extra attention? Can we help patients get high-quality care no matter where they seek it? We look forward to collaborating with doctors and patients to figure out the answers to these questions and more.

Source: Google AI Blog