Tag Archives: Health

A Step Towards Protecting Patients from Medication Errors

Posted by Kathryn Rough, Research Scientist and Alvin Rajkomar, MD, Google Health

While no doctor, nurse, or pharmacist wants to make a mistake that harms a patient, research shows that 2% of hospitalized patients experience serious preventable medication-related incidents that can be life-threatening, cause permanent harm, or result in death. There are many factors contributing to medical mistakes, often rooted in deficient systems, tools, processes, or working conditions, rather than the flaws of individual clinicians (IOM report). To mitigate these challenges, one can imagine a system more sophisticated than the current rules-based error alerts provided in standard electronic health record software. The system would identify prescriptions that looked abnormal for the patient and their current situation, similar to a system that produces warnings for atypical credit card purchases on stolen cards. However, determining which medications are appropriate for any given patient at any given time is complex — doctors and pharmacists train for years before acquiring the skill. With the widespread use of electronic health records, it may now be feasible to use this data to identify normal and abnormal patterns of prescriptions.

In an initial effort to explore solutions to this problem, we partnered with UCSF's Bakar Computational Health Sciences Institute to publish “Predicting Inpatient Medication Orders in Electronic Health Record Data” in Clinical Pharmacology and Therapeutics, which evaluates the extent to which machine learning could anticipate normal prescribing patterns by doctors, based on electronic health records. Similar to our prior work, we used comprehensive clinical data from de-identified patient records, including the sequence of vital signs, laboratory results, past medications, procedures, diagnoses and more. Based on the patient’s current clinical state and medical history, our best model was able to anticipate physician’s actual prescribing decisions three quarters of the time.

Model Training
The dataset used for model training included approximately three million medication orders from over 100,000 hospitalizations. It used retrospective electronic health record data, which was de-identified by randomly shifting dates and removing identifying portions of the record in accordance with HIPAA, including names, addresses, contact details, record numbers, physician names, free-text notes, images, and more. The data was not joined or combined with any other data. All research was done using the open-sourced Fast Healthcare Interoperability Resources (FHIR) format, which we’ve previously used to make healthcare data more effective for machine learning. The dataset was not restricted to a particular disease or therapeutic area, which made the machine learning task more challenging, but also helped to ensure that the model could identify a larger variety of conditions; e.g. patients suffering from dehydration require different medications than those with traumatic injuries.

We evaluated two machine learning models: a long short-term memory (LSTM) recurrent neural network and a regularized, time-bucketed logistic model, which are commonly used in clinical research. Both were compared to a simple baseline that ranked the most frequently ordered medications based on a patient’s hospital service (e.g., General Medical, General Surgical, Obstetrics, Cardiology, etc.) and amount of time since admission. Each time a medication was ordered in the retrospective data, the models ranked a list of 990 possible medications, and we assessed whether the models assigned high probabilities to the medications actually ordered by doctors in each case.

As an example of how the model was evaluated, imagine a patient who arrived at the hospital with signs of an infection. The model reviewed the information recorded in the patient’s electronic health record — a high temperature, elevated white blood cell count, quick breathing rate — and estimated how likely it would be for different medications to be prescribed in that situation. The model’s performance was evaluated by comparing its ranked choices against the medications that the physician actually prescribed (in this example, the antibiotic vancomycin and sodium chloride solution for rehydration).

Based on a patient’s medical history and current clinical characteristics, the model ranks the medications a physician is most likely to prescribe.

Findings
Our best-performing model was the LSTM model, a class of models particularly effective for handling sequential data, including text and language. These models are capable of capturing the ordering and time recency of events in the data, making them a good choice for this problem.

Nearly all (93%) top-10 lists contained at least one medication that would be ordered by clinicians for the given patient within the next day. Fifty-five percent of the time, the model correctly placed medications prescribed by the doctor as one of the top-10 most likely medications, and 75% of ordered medications were ranked in the top-25. Even for ‘false negatives’ — cases where the medication ordered by doctors did not appear among the top-25 results — the model highly ranked a medication in the same class 42% of the time. This performance was not explained by the model simply predicting previously prescribed medications. Even when we blinded the model to previous medication orders, it maintained high performance.

What Does This Mean for Patients and Clinicians?
It’s important to remember that models trained this way reproduce physician behavior as it appears in historical data, and have not learned optimal prescribing patterns, how these medications might work, or what side effects might occur. However, learning ‘normal’ is a starting point to eventually spot abnormal, potentially dangerous orders. In our next phase of research, we will examine under which circumstances these models are useful for finding medication errors that could harm patients.

The results from this exploratory work are early first steps towards testing the hypothesis that machine learning can be applied to build systems that prevent mistakes and help to keep patients safe. We look forward to collaborating with doctors, pharmacists, other clinicians, and patients as we continue research to quantify whether models like this one are capable of catching errors, keeping patients safe in the hospital.

Acknowledgements
We would like to thank Atul Butte (UCSF), Claire Cui, Andrew Dai, Michael Howell, Laura Vardoulakis, Yuan (Emily) Xue, and Kun Zhang for their contributions towards the research work described in this post. We’d additionally like to thank members of our broader research team who have assisted in the development of analytical tools, data collection, maintenance of research infrastructure, assurance of data quality, and project management: Gabby Espinosa, Gerardo Flores, Michaela Hardt, Sharat Israni (UCSF), Jeff Love (UCSF), Dana Ludwig (UCSF), Hong Ji, Svetlana Kelman, I-Ching Lee, Mimi Sun, Patrik Sundberg, Chunfeng Wen, and Doris Wong.

Source: Google AI Blog

Connecting people with COVID-19 information and resources

Since the beginning of the year, search interest in COVID-19 has continued to climb around the world. Right now the disease is the largest topic people are looking for globally, surpassing even some of the most common and consistent queries we see in Search.

As this public health crisis has evolved into a pandemic, information needs are continuing to change, differing from region to region. When COVID-19 was declared a public health emergency by the World Health Organization (WHO) in late January, we launched an SOS Alert with resources and safety information from the WHO, along with the latest news. The alert has launched in 25 languages across dozens of countries, and people in more than 50 countries can access localized public health guidance from health authorities.

Expanding our COVID-19 Search experience
Now, as we continue to see people’s information needs expanding, we’re introducing a more comprehensive experience for COVID-19 in Search, providing easy access to authoritative information from health authorities alongside new data and visualizations. This new format organizes the search results page to help people easily navigate information and resources, and it will also make it possible to add more information over time as it becomes available.

In addition to links to helpful resources from national and local health authorities, people will also find a carousel of Twitter accounts from local civic organizations and health authorities to help connect them with the latest local guidance as it’s shared. We’ve also introduced a feature to surface some of the most common questions about the pandemic, with relevant snippets sourced from the WHO and the Centers for Disease Control and Prevention (CDC).

To help people track the latest information about the spread of the disease, we’re adding modules with statistics and a map showing COVID-19 prevalence in countries around the world. This new COVID-19 experience on Search will roll out in the coming days in English in the U.S., and we plan to add more information and expand to other languages and countries soon.

A website dedicated to help and resources
In addition to launching new features on Google Search that provide easy access to more authoritative information, we’ve worked with relevant agencies and authorities to roll out a website—available at google.com/covid19—focused on education, prevention and local resources. People can find state-based information, safety and prevention tips, search trends related to COVID-19, and further resources for individuals, educators and businesses. Launching today in the U.S., the site will be available in more languages and countries in the coming days and we’ll update the website as more resources become available. Along with our other products and initiatives, we hope these resources will help people find answers to the questions they’re asking and get the help they need.

Guidance around local health services
We’re also looking for more ways we can help people follow authoritative public health guidance and locate appropriate health services through our products. Right now in the U.S., people seeking out urgent care, hospitals and other medical services in Search or Maps will see an alert reminding them of the CDC’s recommendation that symptomatic individuals call ahead in order to avoid overwhelming health systems and increasing the risk of exposure.

As coronavirus becomes a challenge in more communities and as authorities around the world develop new guidance and tools to address the pandemic, we’ll continue to find more opportunities to connect people with key information to keep themselves, their families, and their communities safe.

Source: The Official Google Blog

Connecting people with COVID-19 information and resources

Source: Google LatLong

Get helpful health info from the NHS, right in Search

People come to Search for all types of information to navigate their lives and look after themselves and their families. When it comes to important topics like health, high-quality information is critical, and we aim to connect people with the most reliable sources on the web as quickly as possible.

Now, we’re making it even easier for people in the U.K. to find trusted information from the National Health Service (NHS). Beginning this week, when you search for health conditions like chickenpox, back pain, or the common cold, you can find Knowledge Panels with information from the NHS website that help you understand more about common causes, treatments and more.

These Knowledge Panels aim to give people authoritative, locally trusted health information, based on open source content. The NHS has formatted their content so that it’s easy to find on the web and available publicly to anyone via the NHS website—Google is one of more than 2,000 organizations using NHS website content to provide trusted information to people looking for it.

To start, these Knowledge Panels will be available for more than 250 health conditions. Of course, they’re not intended to provide medical advice, and we encourage anyone searching for health information to seek guidance from a doctor if they have a medical concern or, in an emergency, call local emergency services immediately. But we hope this feature will help people find reliable information and have more informed conversations with medical professionals to improve their care.

Source: Search

Generating Diverse Synthetic Medical Image Data for Training Machine Learning Models

Posted by Timo Kohlberger and Yuan Liu, Software Engineers, Google Health

The progress in machine learning (ML) for medical imaging that helps doctors provide better diagnoses has partially been driven by the use of large, meticulously labeled datasets. However, dataset size can be limited in real life due to privacy concerns, low patient volume at partner institutions, or by virtue of studying rare diseases. Moreover, to ensure that ML models generalize well, they need training data that span a range of subgroups, such as skin type, demographics, and imaging devices. Requiring that the size of each combinatorial subgroup (e.g., skin type A with skin condition B, taken by camera C) is also sufficiently large can quickly become impractical.

Today we are happy to share two projects aimed at both improving the diversity of ML training data, and increasing the effective amount of available training data for medical applications. The first project is a configurable method for generation of synthetic skin lesion images in order to improve coverage of rarer skin types and conditions. The second project uses synthetic images as training data to develop an ML model, that can better interpret different biological tissue types across a range of imaging devices.

Generating Diverse Images of Skin Conditions
In “DermGAN: Synthetic Generation of Clinical Skin Images with Pathology”, published in the Machine Learning for Health (ML4H) workshop at NeurIPS 2019, we address problems associated with data diversity in de-identified dermatology images taken by consumer grade cameras. This work addresses (1) the scarcity of imaging data representative of rare skin conditions, and (2) the lower frequency of data covering certain Fitzpatrick skin types. Fitzpatrick skin types range from Type I (“pale white, always burns, never tans”) to Type VI (“darkest brown, never burns”), with datasets generally containing relative few cases at the “boundaries”. In both cases, data scarcity problems are exacerbated by the low signal-to-noise ratio common in the target images, due to the lack of standardized lighting, contrast and field-of-view; variability of the background, such as furniture and clothing; and the fine details of the skin, like hair and wrinkles.

To improve diversity in the skin images, we developed a model, called DermGAN, which generates skin images that exhibit the characteristics of a given pre-specified skin condition, location, and underlying skin color. DermGAN uses an image-to-image translation approach, based on the pix2pix generative adversarial network (GAN) architecture, to learn the underlying mapping from one type of image to another.

DermGAN takes as input a real image and its corresponding, pre-generated semantic map representing the underlying characteristics of the real image (e.g., the skin condition, location of the lesion, and skin type), from which it will generate a new synthetic example with the requested characteristics. The generator is based on the U-Net architecture, but in order to mitigate checkerboard artifacts, the deconvolution layers are replaced with a resizing layer, followed by a convolution. A few customized losses are introduced to improve the quality of the synthetic images, especially within the pathological region. The discriminator component of DermGAN is solely used for training, whereas the generator is evaluated both visually and for use in augmenting the training dataset for a skin condition classifier.

Overview of the generator component of DermGAN. The model takes an RGB semantic map (red box) annotated with the skin condition's size and location (smaller orange rectangle), and outputs a realistic skin image. Colored boxes represent various neural network layers, such as convolutions and ReLU; the skip connections resemble the U-Net and enable information to be propagated at the appropriate scales.

The top row shows generated synthetic examples and the bottom row illustrates real images of basal cell carcinoma (left) and melanocytic nevus (right). More examples can be found in the paper.

In addition to generating visually realistic images, our method enables generation of images of skin conditions or skin types that are more rare and that suffer from a paucity of dermatologic images.

DermGAN can be used to generate skin images (all with melanocytic nevus in this case) with different background skin types (top, by changing the input skin color) and different-sized lesions (bottom, by changing the input lesion size). As the input skin color changes, the lesion changes appearance to match what the lesion would look like on different skin types.

Early results indicated that using the generated images as additional data to train a skin condition classifier may improve performance at detecting rare malignant conditions, such as melanoma. However, more work is needed to explore how best to utilize such generated images to improve accuracy more generally across rarer skin types and conditions.

Generating Pathology Images with Different Labels Across Diverse Scanners
The focus quality of medical images is important for accurate diagnoses. Poor focus quality can trigger both false positives and false negatives, even in otherwise accurate ML-based metastatic breast cancer detection algorithms. Determining whether or not pathology images are in-focus is difficult due to factors such as the complexity of the image acquisition process. Digitized whole-slide images could have poor focus across the entire image, but since they are essentially stitched together from thousands of smaller fields of view, they could also have subregions with different focus properties than the rest of the image. This makes manual screening for focus quality impractical and motivates the desire for an automated approach to detect poorly-focused slides and locate out-of-focus regions. Identifying regions with poor focus might enable re-scanning, or yield opportunities to improve the focusing algorithms used during the scanning process.

In our second project, presented in “Whole-slide image focus quality: Automatic assessment and impact on AI cancer detection”, published in the Journal of Pathology Informatics, we develop a method of evaluating de-identified, large gigapixel pathology images for focus quality issues. This involved training a convolutional neural network on semi-synthetic training data that represent different tissue types and slide scanner optical properties. However, a key barrier towards developing such an ML-based system was the lack of labeled data — focus quality is difficult to grade reliably and labeled datasets were not available. To exacerbate the problem, because focus quality affects minute details of the image, any data collected for a specific scanner may not be representative of other scanners, which may have differences in the physical optical systems, the stitching procedure used to recreate a large pathology image from captured image tiles, white-balance and post-processing algorithms, and more. This led us to develop a novel multi-step system for generating synthetic images that exhibit realistic out-of-focus characteristics.

We deconstructed the process of collecting training data into multiple steps. The first step was to collect images from various scanners and to label in-focus regions. This task is substantially easier than trying to determine the degree to which an image is out of focus, and can be completed by non-experts. Next, we generated synthetic out-of-focus images, inspired by the sequence of events that happen prior to a real out-of-focus image is captured: the optical blurring effect happens first, followed by those photons being collected by a sensor (a process that adds sensor noise), and finally software compression adds noise.

A sequence of images showing step-wise out-of-focus image generation. Images are shown in grayscale to accentuate the difference between steps. First, an in-focus image is collected (a) and a bokeh effect is added to produce a blurry image (b). Next, sensor noise is added to simulate a real image sensor (c), and finally JPEG compression is added to simulate the sharp edges introduced by post-acquisition software processing (d). A real out-of-focus image is shown for comparison (e).

Our study shows that modeling each step is essential for optimal results across multiple scanner types, and remarkably, enabled the detection of spectacular out-of-focus patterns in real data:

An example of a particularly interesting out-of-focus pattern across a biological tissue slice. Areas in blue were recognized by the model to be in-focus, whereas areas highlighted in yellow, orange, or red were more out of focus. The gradation in focus here (represented by concentric circles: a red/orange out-of-focus center surrounded by green/cyan mildly out-of-focus, and then a blue in-focus ring) was caused by a hard “stone” in the center that lifted the surrounding biological tissue.

Implications and Future Outlook
Though the volume of data used to develop ML systems is seen as a fundamental bottleneck, we have presented techniques for generating synthetic data that can be used to improve the diversity of training data for ML models and thereby improve the ability of ML to work well on more diverse datasets. We should caution though that these methods are not appropriate for validation data, so as to avoid bias such as an ML model performing well only on synthetic data. To ensure unbiased, statistically-rigorous evaluation, real data of sufficient volume and diversity will still be needed, though techniques such as inverse probability weighting (for example, as leveraged in our work on ML for chest X-rays) may be useful there. We continue to explore other approaches to more efficiently leverage de-identified data to improve data diversity and reduce the need for large datasets in the development of ML models for healthcare.

Acknowledgements
These projects involved the efforts of multidisciplinary teams of software engineers, researchers, clinicians and cross functional contributors. Key contributors to these projects include Timo Kohlberger, Yun Liu, Melissa Moran, Po-Hsuan Cameron Chen, Trissia Brown, Jason Hipp, Craig Mermel, Martin Stumpe, Amirata Ghorbani, Vivek Natarajan, David Coz, and Yuan Liu. The authors would also like to acknowledge Daniel Fenner, Samuel Yang, Susan Huang, Kimberly Kanada, Greg Corrado and Erica Brand for their advice, members of the Google Health dermatology and pathology teams for their support, and Ashwin Kakarla and Shivamohan Reddy Garlapati for their team for image labeling.

Source: Google AI Blog

Detecting hidden signs of anemia from the eye

Beyond helping us navigate the world, the human eye can reveal signs of underlying disease, which care providers can now uncover during a simple, non-invasive screening (a photograph taken of the back of the eye). We’ve previously shown that deep learning applied to these photos can help identify diabetic eye disease as well as cardiovascular risk factors. Today, we’re sharing how we’re continuing to use deep learning to detect anemia.

Anemia is a major public health problem that affects 1.6 billion people globally, and can cause tiredness, weakness, dizziness and drowsiness. The diagnosis of anemia typically involves a blood test to measure the amount of hemoglobin (a critical protein in your red blood cells that carries oxygen). If your hemoglobin is lower than normal, that indicates anemia. Women during pregnancy are at particularly high risk of anemia with more than 2 in 5 affected, and anemia can also be an early sign of colon cancer in otherwise healthy individuals.

Our findings

In our latest work, "Detection of anemia from retinal fundus images via deep learning" published in “Nature Biomedical Engineering” we find that a deep learning model can quantify hemoglobin using de-identified photographs of the back of the eye and common metadata (e.g. age, self-reported sex) from the UK Biobank, a population-based study. Compared to just using metadata, deep learning improved the detection of anemia (as measured using the AUC), from 74 percent to 88 percent.

To ensure these promising findings were not the result of chance or false correlations, other scientists helped to validate the model—which was initially developed on a dataset of primarily Caucasian ancestry—on a separate dataset from Asia. The performance of the model was similar on both datasets, suggesting the model could be useful in a variety of settings.

Optic disc — Multiple “explanation” techniques suggest that the optic disc is important for detecting anemia from images of the back of the eye.

Because this research uncovered new findings about the effects of anemia on the eye, we wanted to identify which parts of the eye contained signs of anemia. Our analysis revealed that much of the information comes from the optic disc and surrounding blood vessels. The optic disc is where nerves and blood vessels enter and exit the eye, and normally appears much brighter than the surrounding areas on a photograph of the back of the eye.

Key takeaways

This method to non-invasively screen for anemia could add value to existing diabetic eye disease screening programs, or support an anemia screening that would be quicker and easier than a blood test. Additionally, this work is another example of using deep learning with explainable insights to discover new biomedical knowledge, extending our previous work oncardiovascular risk factors, refractive error, and progression of macular degeneration. We hope this will inspire additional research to reveal new scientific insights from existing medical tests, and to help improve early interventions and health outcomes.

To read more about our latest research for improving the diagnosis of eye diseases, visit Nature Communications and Ophthalmology. You can find more research from Google Health team here.

Source: The Official Google Blog

Using AI to improve breast cancer screening

Breast cancer is a condition that affects far too many women across the globe. More than 55,000 people in the U.K. are diagnosed with breast cancer each year, and about 1 in 8 women in the U.S. will develop the disease in their lifetime.

Digital mammography, or X-ray imaging of the breast, is the most common method to screen for breast cancer, with over 42 million exams performed each year in the U.S. and U.K. combined. But despite the wide usage of digital mammography, spotting and diagnosing breast cancer early remains a challenge.

Reading these X-ray images is a difficult task, even for experts, and can often result in both false positives and false negatives. In turn, these inaccuracies can lead to delays in detection and treatment, unnecessary stress for patients and a higher workload for radiologists who are already in short supply.

Over the last two years, we’ve been working with leading clinical research partners in the U.K. and U.S. to see if artificial intelligence could improve the detection of breast cancer. Today, we’re sharing our initial findings, which have been published in Nature. These findings show that our AI model spotted breast cancer in de-identified screening mammograms (where identifiable information has been removed) with greater accuracy, fewer false positives, and fewer false negatives than experts. This sets the stage for future applications where the model could potentially support radiologists performing breast cancer screenings.

Our research

In collaboration with colleagues at DeepMind, Cancer Research UK Imperial Centre, Northwestern University and Royal Surrey County Hospital, we set out to see if artificial intelligence could support radiologists to spot the signs of breast cancer more accurately.

The model was trained and tuned on a representative data set comprised of de-identified mammograms from more than 76,000 women in the U.K. and more than 15,000 women in the U.S., to see if it could learn to spot signs of breast cancer in the scans. The model was then evaluated on a separate de-identified data set of more than 25,000 women in the U.K. and over 3,000 women in the U.S. In this evaluation, our system produced a 5.7 percent reduction of false positives in the U.S, and a 1.2 percent reduction in the U.K. It produced a 9.4 percent reduction in false negatives in the U.S., and a 2.7 percent reduction in the U.K.

We also wanted to see if the model could generalize to other healthcare systems. To do this, we trained the model only on the data from the women in the U.K. and then evaluated it on the data set from women in the U.S. In this separate experiment, there was a 3.5 percent reduction in false positives and an 8.1 percent reduction in false negatives, showing the model’s potential to generalize to new clinical settings while still performing at a higher level than experts.

Animation showing tumour growth and metastatic spread in breast cancer_resize.gif — This is a visualization of tumor growth and metastatic spread in breast cancer. Screening aims to detect breast cancer early, before symptoms develop.

Notably, when making its decisions, the model received less information than human experts did. The human experts (in line with routine practice) had access to patient histories and prior mammograms, while the model only processed the most recent anonymized mammogram with no extra information. Despite working from these X-ray images alone, the model surpassed individual experts in accurately identifying breast cancer.

Next steps

Looking forward to future applications, there are some promising signs that the model could potentially increase the accuracy and efficiency of screening programs, as well as reduce wait times and stress for patients. Google’s Chief Financial Officer Ruth Porat shared her optimism around potential technological breakthroughs in this area in a post in October reflecting on her personal experience with breast cancer.

But getting there will require continued research, prospective clinical studies and regulatory approval to understand and prove how software systems inspired by this research could improve patient care.

This work is the latest strand of our research looking into detection and diagnosis of breast cancer, not just within the scope of radiology, but also pathology. In 2017, we published early findings showing how our models can accurately detect metastatic breast cancer from lymph node specimens. Last year, we also developed a deep learning algorithm that could help doctors spot breast cancer more quickly and accurately in pathology slides.

We’re looking forward to working with our partners in the coming years to translate our machine learning research into tools that benefit clinicians and patients.

Source: The Official Google Blog

Lessons Learned from Developing ML for Healthcare

Posted by Yun Liu, Research Scientist and Po-Hsuan Cameron Chen, Research Engineer, Google Health

Machine learning (ML) methods are not new in medicine -- traditional techniques, such as decision trees and logistic regression, were commonly used to derive established clinical decision rules (for example, the TIMI Risk Score for estimating patient risk after a coronary event). In recent years, however, there has been a tremendous surge in leveraging ML for a variety of medical applications, such as predicting adverse events from complex medical records, and improving the accuracy of genomic sequencing. In addition to detecting known diseases, ML models can tease out previously unknown signals, such as cardiovascular risk factors and refractive error from retinal fundus photographs.

Beyond developing these models, it’s important to understand how they can be incorporated into medical workflows. Previous research indicates that doctors assisted by ML models can be more accurate than either doctors or models alone in grading diabetic eye disease and diagnosing metastatic breast cancer. Similarly, doctors are able to leverage ML-based tools in an interactive fashion to search for similar medical images, providing further evidence that doctors can work effectively with ML-based assistive tools.

In an effort to improve guidance for research at the intersection of ML and healthcare, we have written a pair of articles, published in Nature Materials and the Journal of the American Medical Association (JAMA). The first is for ML practitioners to better understand how to develop ML solutions for healthcare, and the other is for doctors who desire a better understanding of whether ML could help improve their clinical work.

How to Develop Machine Learning Models for Healthcare
In “How to develop machine learning models for healthcare” (pdf), published in Nature Materials, we discuss the importance of ensuring that the needs specific to the healthcare environment inform the development of ML models for that setting. This should be done throughout the process of developing technologies for healthcare applications, from problem selection, data collection and ML model development to validation and assessment, deployment and monitoring.

The first consideration is how to identify a healthcare problem for which there is both an urgent clinical need and for which predictions based on ML models will provide actionable insight. For example, ML for detecting diabetic eye disease can help alleviate the screening workload in parts of the world where diabetes is prevalent and the number of medical specialists is insufficient. Once the problem has been identified, one must be careful with data curation to ensure that the ground truth labels, or “reference standard”, applied to the data are reliable and accurate. This can be accomplished by validating labels via comparison to expert interpretation of the same data, such as retinal fundus photographs, or through an orthogonal procedure, such as a biopsy to confirm radiologic findings. This is particularly important since a high-quality reference standard is essential both for training useful models and for accurately measuring model performance. Therefore, it is critical that ML practitioners work closely with clinical experts to ensure the rigor of the reference standard used for training and evaluation.

Validation of model performance is also substantially different in healthcare, because the problem of distributional shift can be pronounced. In contrast to typical ML studies where a single random test split is common, the medical field values validation using multiple independent evaluation datasets, each with different patient populations that may exhibit differences in demographics or disease subtypes. Because the specifics depend on the problem, ML practitioners should work closely with clinical experts to design the study, with particular care in ensuring that the model validation and performance metrics are appropriate for the clinical setting.

Integration of the resulting assistive tools also requires thoughtful design to ensure seamless workflow integration, with consideration for measurement of the impact of these tools on diagnostic accuracy and workflow efficiency. Importantly, there is substantial value in prospective study of these tools in real patient care to better understand their real-world impact.

Finally, even after validation and workflow integration, the journey towards deployment is just beginning: regulatory approval and continued monitoring for unexpected error modes or adverse events in real use remains ahead.

Two examples of the translational process of developing, validating, and implementing ML models for healthcare based on our work in detecting diabetic eye disease and metastatic breast cancer.

Empowering Doctors to Better Understand Machine Learning for Healthcare
In “Users’ Guide to the Medical Literature: How to Read Articles that use Machine Learning,” published in JAMA, we summarize key ML concepts to help doctors evaluate ML studies for suitability of inclusion in their workflow. The goal of this article is to demystify ML, to assist doctors who need to use ML systems to understand their basic functionality, when to trust them, and their potential limitations.

The central questions doctors ask when evaluating any study, whether ML or not, remain: Was the reference standard reliable? Was the evaluation unbiased, such as assessing for both false positives and false negatives, and performing a fair comparison with clinicians? Does the evaluation apply to the patient population that I see? How does the ML model help me in taking care of my patients?

In addition to these questions, ML models should also be scrutinized to determine whether the hyperparameters used in their development were tuned on a dataset independent of that used for final model evaluation. This is particularly important, since inappropriate tuning can lead to substantial overestimation of performance, e.g., a sufficiently sophisticated model can be trained to completely memorize the training dataset and generalize poorly to new data. Ensuring that tuning was done appropriately requires being mindful of ambiguities in dataset naming, and in particular, using the terminology with which the audience is most familiar:

The intersection of two fields: ML and healthcare creates ambiguity in the term “validation dataset”. An ML validation set is typically used to refer to the dataset used for hyperparameter tuning, whereas a “clinical” validation set is typically used for final evaluation. To reduce confusion, we have opted to refer to the (ML) validation set as the “tuning” set.

Future outlook
It is an exciting time to work on AI for healthcare. The “bench-to-bedside” path is a long one that requires researchers and experts from multiple disciplines to work together in this translational process. We hope that these two articles will promote mutual understanding of what is important for ML practitioners developing models for healthcare and what is emphasized by doctors evaluating these models, thus driving further collaborations between the fields and towards eventual positive impact on patient care.

Acknowledgements
Key contributors to these projects include Yun Liu, Po-Hsuan Cameron Chen, Jonathan Krause, and Lily Peng. The authors would like to acknowledge Greg Corrado and Avinash Varadarajan for their advice, and the Google Health team for their support.

Source: Google AI Blog

Developing Deep Learning Models for Chest X-rays with Adjudicated Image Labels

Posted by Dave Steiner, MD, Research Scientist and Shravya Shetty, Technical Lead, Google Health

With millions of diagnostic examinations performed annually, chest X-rays are an important and accessible clinical imaging tool for the detection of many diseases. However, their usefulness can be limited by challenges in interpretation, which requires rapid and thorough evaluation of a two-dimensional image depicting complex, three-dimensional organs and disease processes. Indeed, early-stage lung cancers or pneumothoraces (collapsed lungs) can be missed on chest X-rays, leading to serious adverse outcomes for patients.

Advances in machine learning (ML) present an exciting opportunity to create new tools to help experts interpret medical images. Recent efforts have shown promise in improving lung cancer detection in radiology, prostate cancer grading in pathology, and differential diagnoses in dermatology. For chest X-ray images in particular, large, de-identified public image sets are available to researchers across disciplines, and have facilitated several valuable efforts to develop deep learning models for X-ray interpretation. However, obtaining accurate clinical labels for the very large image sets needed for deep learning can be difficult. Most efforts have either applied rule-based natural language processing (NLP) to radiology reports or relied on image review by individual readers, both of which may introduce inconsistencies or errors that can be especially problematic during model evaluation. Another challenge involves assembling datasets that represent an adequately diverse spectrum of cases (i.e., ensuring inclusion of both “hard” cases and “easy” cases that represent the full spectrum of disease presentation). Finally, some chest X-ray findings are non-specific and depend on clinical information about the patient to fully understand their significance. As such, establishing labels that are clinically meaningful and have consistent definitions can be a challenging component of developing machine learning models that use only the image as input. Without standardized and clinically meaningful datasets as well as rigorous reference standard methods, successful application of ML to interpretation of chest X-rays will be hindered.

To help address these issues, we recently published “Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation” in the journal Radiology. In this study we developed deep learning models to classify four clinically important findings on chest X-rays — pneumothorax, nodules and masses, fractures, and airspace opacities. These target findings were selected in consultation with radiologists and clinical colleagues, so as to focus on conditions that are both critical for patient care and for which chest X-ray images alone are an important and accessible first-line imaging study. Selection of these findings also allowed model evaluation using only de-identified images without additional clinical data.

Models were evaluated using thousands of held-out images from each dataset for which we collected high-quality labels using a panel-based adjudication process among board-certified radiologists. Four separate radiologists also independently reviewed the held-out images in order to compare radiologist accuracy to that of the deep learning models (using the panel-based image labels as the reference standard). For all four findings and across both datasets, the deep learning models demonstrated radiologist-level performance. We are sharing the adjudicated labels for the publicly available data here to facilitate additional research.

Data Overview
This work leveraged over 600,000 images sourced from two de-identified datasets. The first dataset was developed in collaboration with co-authors at the Apollo Hospitals, and consists of a diverse set of chest X-rays obtained over several years from multiple locations across the Apollo Hospitals network. The second dataset is the publicly available ChestX-ray14 image set released by the National Institutes of Health (NIH). This second dataset has served as an important resource for many machine learning efforts, yet has limitations stemming from issues with the accuracy and clinical interpretation of the currently available labels.

Chest X-ray depicting an upper left lobe pneumothorax identified by the model and the adjudication panel, but missed by the individual radiologist readers. Left: The original image. Right: The same image with the most important regions for the model prediction highlighted in orange.

Training Set Labels Using Deep Learning and Visual Image Review
For very large datasets consisting of hundreds of thousands of images, such as those needed to train highly accurate deep learning models, it is impractical to manually assign image labels. As such, we developed a separate, text-based deep learning model to extract image labels using the de-identified radiology reports associated with each X-ray. This NLP model was then applied to provide labels for over 560,000 images from the Apollo Hospitals dataset used for training the computer vision models.

To reduce noise from any errors introduced by the text-based label extraction and also to provide the relevant labels for a substantial number of the ChestX-ray14 images, approximately 37,000 images across the two datasets were visually reviewed by radiologists. These were separate from the NLP-based labels and helped to ensure high quality labels across such a large, diverse set of training images.

Creating and Sharing Improved Reference Standard Labels
To generate high-quality reference standard labels for model evaluation, we utilized a panel-based adjudication process, whereby three radiologists reviewed all final tune and test set images and resolved disagreements through discussion. This often allowed difficult findings that were initially only detected by a single radiologist to be identified and documented appropriately. To reduce the risk of bias based on any individual radiologist’s personality or seniority, the discussions took place anonymously via an online discussion and adjudication system.

Because the lack of available adjudicated labels was a significant initial barrier to our work, we are sharing with the research community all of the adjudicated labels for the publicly available ChestX-ray14 dataset, including 2,412 training/validation set images and 1,962 test set images (4,374 images in total). We hope that these labels will facilitate future machine learning efforts and enable better apples-to-apples comparisons between machine learning models for chest X-ray interpretation.

Future Outlook
This work presents several contributions: (1) releasing adjudicated labels for images from a publicly available dataset; (2) a method to scale accurate labeling of training data using a text-based deep learning model; (3) evaluation using a diverse set of images with expert-adjudicated reference standard labels; and ultimately (4) radiologist-level performance of deep learning models for clinically important findings on chest X-rays.

However, in regards to model performance, achieving expert-level accuracy on average is just a part of the story. Even though overall accuracy for the deep learning models was consistently similar to that of radiologists for any given finding, performance for both varied across datasets. For example, the sensitivity for detecting pneumothorax among radiologists was approximately 79% for the ChestX-ray14 images, but was only 52% for the same radiologists on the other dataset, suggesting a more difficult collection cases in the latter. This highlights the importance of validating deep learning tools on multiple, diverse datasets and eventually across the patient populations and clinical settings in which any model is intended to be used.

The performance differences between datasets also emphasize the need for standardized evaluation image sets with accurate reference standards in order to allow comparison across studies. For example, if two different models for the same finding were evaluated using different datasets, comparing performance would be of minimal value without knowing additional details such as the case mix, model error modes, or radiologist performance on the same cases.

Finally, the model often identified findings that were consistently missed by radiologists, and vice versa. As such, strategies that combine the unique “skills” of both the deep learning systems and human experts are likely to hold the most promise for realizing the potential of AI applications in medical image interpretation.

Acknowledgements
Key contributors to this project at Google include Sid Mittal, Gavin Duggan, Anna Majkowska, Scott McKinney, Andrew Sellergren, David Steiner, Krish Eswaran, Po-Hsuan Cameron Chen, Yun Liu, Shravya Shetty, and Daniel Tse. Significant contributions and input were also made by radiologist collaborators Joshua Reicher, Alexander Ding, and Sreenivasa Raju Kalidindi. The authors would also like to acknowledge many members of the Google Health radiology team including Jonny Wong, Diego Ardila, Zvika Ben-Haim, Rory Sayres, Shahar Jamshy, Shabir Adeel, Mikhail Fomitchev, Akinori Mitani, Quang Duong, William Chen and Sahar Kazemzadeh. Sincere appreciation also goes to the many radiologists who enabled this work through their expert image interpretation efforts throughout the project.

Source: Google AI Blog

Tools to help healthcare providers deliver better care

There has been a lot of interest around our collaboration with Ascension. As a physician, I understand. Health is incredibly personal, and your health information should be private to you and the people providing your care.

That’s why I want to clarify what our teams are doing, why we’re doing it, and how it will help your healthcare providers—and you.

Doctors and nurses love caring for patients, but aren’t always equipped with the tools they need to thrive in their mission. We have all seen headlines like "Why doctors hate their computers," with complaints about having to use "a disconnected patchwork" that makes finding critical health information like finding a needle in the haystack. The average U.S. health system has 18 electronic medical record systems, and our doctors and nurses feel like they are "data clerks" rather than healers.

Google has spent two decades on similar problems for consumers, building products such as Search, Translate and Gmail, and we believe we can adapt our technology to help. That’s why we’re building an intelligent suite of tools to help doctors, nurses, and other providers take better care of patients, leveraging our expertise in organizing information.

One of those tools aims to make health records more useful, more accessible and more searchable by pulling them into a single, easy-to-use interface for doctors. I mentioned this during my presentation last month at theHLTH Conference. Ascension is the first partner where we are working with the frontline staff to pilot this tool.

Google Health: Tools to help healthcare providers deliver better care

This effort is challenging. Health information is incredibly complex—there are misspellings, different ways of saying the same thing, handwritten scribbles, and faxes. Healthcare IT systems also don’t talk well to each other and this keeps doctors and nurses from taking the best possible care of you.

Policymakers and regulators across the world (e.g., CMS, HHS, the NHS, and EC )have called this out as an important issue. We’ve committed to help, and it’s why we built this system on interoperable standards.

To deliver such a tool to providers, the system must operate on patients' records. This is what people have been asking about in the context of our Ascension partnership, and why we want to clarify how we handle that data.

As we noted in an earlier post, our work adheres to strict regulations on handling patient data, and our Business Associate Agreement with Ascension ensures their patient data cannot be used for any other purpose than for providing our services—this means it’s never used for advertising. We’ve also published a white paper around how customer data is encrypted and isolated in the cloud.

To ensure that our tools are safe for Ascension doctors and nurses treating real patients, members of our team might come into contact with identifiable patient data. Because of this, we have strict controls for the limited Google employees who handle such data:

We develop and test our system on synthetic (fake) data and openly available datasets.
To configure, test, tune and maintain the service in a clinical setting, a limited number of screened and qualified Google staff may be exposed to real data. These staff undergo HIPAA and medical ethics training, and are individually and explicitly approved by Ascension for a limited time.
We have technical controls to further enhance data privacy. Data is accessible in a strictly controlled environment with audit trails—these controls are designed to prevent the data from leaving this environment and access to patient data is monitored and auditable.
We will further prioritize the development of technology that reduces the number of engineers that need access to patient data (similar to our external redactiontechnology).
We also participate in external certifications, like ISO 27001, where independent third-party auditors come and check our processes, including information security controls for these tools.

I graduated from medical school in 1989. I've seen tremendous progress in healthcare over the ensuing decades, but this progress has also brought with it challenges of information overload that have taken doctors’ and nurses’ attentions away from the patients they are called to serve. I believe technology has a major role to play in reversing this trend, while also improving how care is delivered in ways that can save lives.

googblogs.com

All Google blogs and Press in one site

Tag Archives: Health

A Step Towards Protecting Patients from Medication Errors

Source: Google AI Blog

Connecting people with COVID-19 information and resources

Source: The Official Google Blog

Connecting people with COVID-19 information and resources

Source: Google LatLong

Get helpful health info from the NHS, right in Search

Source: Search

Generating Diverse Synthetic Medical Image Data for Training Machine Learning Models

Source: Google AI Blog

Detecting hidden signs of anemia from the eye

Our findings

Key takeaways

Source: The Official Google Blog

Using AI to improve breast cancer screening

Our research

Next steps

Source: The Official Google Blog

Lessons Learned from Developing ML for Healthcare

Source: Google AI Blog

Developing Deep Learning Models for Chest X-rays with Adjudicated Image Labels

Source: Google AI Blog

Tools to help healthcare providers deliver better care

Google Health: Tools to help healthcare providers deliver better care

Source: The Official Google Blog