Tag Archives: Health

Deciphering Clinical Abbreviations with Privacy Protecting ML

Today many people have digital access to their medical records, including their doctor’s clinical notes. However, clinical notes are hard to understand because of the specialized language that clinicians use, which contains unfamiliar shorthand and abbreviations. In fact, there are thousands of such abbreviations, many of which are specific to certain medical specialities and locales or can mean multiple things in different contexts. For example, a doctor might write in their clinical notes, “pt referred to pt for lbp“, which is meant to convey the statement: “Patient referred to physical therapy for low back pain.” Coming up with this translation is tough for laypeople and computers because some abbreviations are uncommon in everyday language (e.g., “lbp” means “low back pain”), and even familiar abbreviations, such as “pt” for “patient”, can have alternate meanings, such as “physical therapy.” To disambiguate between multiple meanings, the surrounding context must be considered. It’s no easy task to decipher all the meanings, and prior research suggests that expanding the shorthand and abbreviations can help patients better understand their health, diagnoses, and treatments.

In “Deciphering clinical abbreviations with a privacy protecting machine learning system”, published in Nature Communications, we report our findings on a general method that deciphers clinical abbreviations in a way that is both state-of-the-art and is on-par with board certified physicians in this task. We built the model using only public data on the web that wasn't associated with any patient (i.e., no potentially sensitive data) and evaluated performance on real, de-identified notes from inpatient and outpatient clinicians from different health systems. To enable the model to generalize from web-data to notes, we created a way to algorithmically re-write large amounts of internet text to look as if it were written by a doctor (called web-scale reverse substitution), and we developed a novel inference method, (called elicitive inference).

The model input is a string that may or may not contain medical abbreviations. We trained a model to output a corresponding string in which all abbreviations are simultaneously detected and expanded. If the input string does not contain an abbreviation, the model will output the original string. By Rajkomar et al used under CC BY 4.0/ Cropped from original.

Rewriting Text to Include Medical Abbreviations

Building a system to translate doctors’ notes would usually start with a large, representative dataset of clinical text where all abbreviations are labeled with their meanings. But no such dataset for general use by researchers exists. We therefore sought to develop an automated way to create such a dataset but without the use of any actual patient notes, which might include sensitive data. We also wanted to ensure that models trained on this data would still work well on real clinical notes from multiple hospital sites and types of care, such as both outpatient and inpatient.

To do this, we referenced a dictionary of thousands of clinical abbreviations and their expansions, and found sentences on the web that contained uses of the expansions from this dictionary. We then “rewrote” those sentences by abbreviating each expansion, resulting in web data that looked like it was written by a doctor. For instance, if a website contained the phrase “patients with atrial fibrillation can have chest pain,” we would rewrite this sentence to “pts with af can have cp.” We then used the abbreviated text as input to the model, with the original text serving as the label. This approach provided us with large amounts of data to train our model to perform abbreviation expansion.

The idea of “reverse substituting” the long-forms for their abbreviations was introduced in prior research, but our distributed algorithm allows us to extend the technique to large, web-sized datasets. Our algorithm, called web-scale reverse substitution (WSRS), is designed to ensure that rare terms occur more frequently and common terms are down-sampled across the public web to derive a more balanced dataset. With this data in-hand, we trained a series of large transformer-based language models to expand the web text.

We generate text to train our model on the decoding task by extracting phrases from public web pages that have corresponding medical abbreviations (shaded boxes on the left) and then substituting in the appropriate abbreviations (shaded dots, right). Since some words are found much more frequently than others ("patient" more than "posterior tibialis", both of which can be abbreviated “pt”), we downsampled common expansions to derive a more balanced dataset across the thousands of abbreviations. By Rajkomar et al used under CC BY 4.0.

Adapting Protein Alignment Algorithms to Unstructured Clinical Text

Evaluation of these models on the particular task of abbreviation expansion is difficult. Because they produce unstructured text as output, we had to figure out which abbreviations in the input correspond to which expansion in the output. To achieve this, we created a modified version of the Needleman Wunsch algorithm, which was originally designed for divergent sequence alignment in molecular biology, to align the model input and output and extract the corresponding abbreviation-expansion pairs. Using this alignment technique, we were able to evaluate the model’s capacity to detect and expand abbreviations accurately. We evaluated Text-to-Text Transfer Transformer (T5) models of various sizes (ranging from 60 million to over 60 billion parameters) and found that larger models performed translation better than smaller models, with the biggest model achieving the best performance.


Creating New Model Inference Techniques to Coax the Model

However, we did find something unexpected. When we evaluated the performance on multiple external test sets from real clinical notes, we found the models would leave some abbreviations unexpanded, and for larger models, the problem of incomplete expansion was even worse. This is mainly due to the fact that while we substitute expansions on the web for their abbreviations, we have no way of handling the abbreviations that are already present. This means that the abbreviations appear in both the original and rewritten text used as respective labels and input, and the model learns not to expand them.

To address this, we developed a new inference-chaining technique in which the model output is fed again as input to coax the model to make further expansions as long as the model is confident in the expansion. In technical terms, our best-performing technique, which we call elicitive inference, involves examining the outputs from a beam search above a certain log-likelihood threshold. Using elicitive inference, we were able to achieve state-of-the-art capability of expanding abbreviations in multiple external test sets.

Real example of the model’s input (left) and output (right).

Comparative Performance

We also sought to understand how patients and doctors currently perform at deciphering clinical notes, and how our model compared. We found that lay people (people without specific medical training) demonstrated less than 30% comprehension of the abbreviations present in the sample medical texts. When we allowed them to use Google Search, their comprehension increased to nearly 75%, still leaving 1 out of 5 abbreviations indecipherable. Unsurprisingly, medical students and trained physicians performed much better at the task with an accuracy of 90%. We found that our largest model was capable of matching or exceeding experts, with an accuracy of 98%.

How does the model perform so well compared to physicians in this task? There are two important factors in the model’s high comparative performance. Part of the discrepancy is that there were some abbreviations that clinicians did not even attempt to expand (such as "cm" for centimeter), which partly lowered the measured performance. This might seem unimportant, but for non-english speakers, these abbreviations may not be familiar, and so it may be helpful to have them written out. In contrast, our model is designed to comprehensively expand abbreviations. In addition, clinicians are familiar with abbreviations they commonly see in their speciality, but other specialists use shorthand that are not understood by those outside their fields. Our model is trained on thousands of abbreviations across multiple specialities and therefore can decipher a breadth of terms.


Towards Improved Health Literacy

We think there are numerous avenues in which large language models (LLMs) can help advance the health literacy of patients by augmenting the information they see and read. Most LLMs are trained on data that does not look like clinical note data, and the unique distribution of this data makes it challenging to deploy these models in an out-of-the-box fashion. We have demonstrated how to overcome this limitation. Our model also serves to "normalize" clinical note data, facilitating additional capabilities of ML to make the text easier for patients of all educational and health-literacy levels to understand.


Acknowledgements

This work was carried out in collaboration with Yuchen Liu, Jonas Kemp, Benny Li, Ming-Jun Chen, Yi Zhang, Afroz Mohiddin, and Juraj Gottweis. We thank Lisa Williams, Yun Liu, Arelene Chung, and Andrew Dai for many useful conversations and discussions about this work.

Source: Google AI Blog


EHR-Safe: Generating High-Fidelity and Privacy-Preserving Synthetic Electronic Health Records

Analysis of Electronic Health Records (EHR) has a tremendous potential for enhancing patient care, quantitatively measuring performance of clinical practices, and facilitating clinical research. Statistical estimation and machine learning (ML) models trained on EHR data can be used to predict the probability of various diseases (such as diabetes), track patient wellness, and predict how patients respond to specific drugs. For such models, researchers and practitioners need access to EHR data. However, it can be challenging to leverage EHR data while ensuring data privacy and conforming to patient confidentiality regulations (such as HIPAA).

Conventional methods to anonymize data (e.g., de-identification) are often tedious and costly. Moreover, they can distort important features from the original dataset, decreasing the utility of the data significantly; they can also be susceptible to privacy attacks. Alternatively, an approach based on generating synthetic data can maintain both important dataset features and privacy.

To that end, we propose a novel generative modeling framework in “EHR-Safe: Generating High-Fidelity and Privacy-Preserving Synthetic Electronic Health Records". With the innovative methodology in EHR-Safe, we show that synthetic data can satisfy two key properties: (i) high fidelity (i.e., they are useful for the task of interest, such as having similar downstream performance when a diagnostic model is trained on them), (ii) meet certain privacy measures (i.e., they do not reveal any real patient's identity). Our state-of-the-art results stem from novel approaches for encoding/decoding features, normalizing complex distributions, conditioning adversarial training, and representing missing data.

Generating synthetic data from the original data with EHR-Safe.

Challenges of Generating Realistic Synthetic EHR Data

There are multiple fundamental challenges to generating synthetic EHR data. EHR data contain heterogeneous features with different characteristics and distributions. There can be numerical features (e.g., blood pressure) and categorical features with many or two categories (e.g., medical codes, mortality outcome). Some of these may be static (i.e., not varying during the modeling window), while others are time-varying, such as regular or sporadic lab measurements. Distributions might come from different families — categorical distributions can be highly non-uniform (e.g., for under-represented groups) and numerical distributions can be highly skewed (e.g., a small proportion of values being very large while the vast majority are small). Depending on a patient's condition, the number of visits can also vary drastically — some patients visit a clinic only once whereas some visit hundreds of times, leading to a variance in sequence lengths that is typically much higher compared to other time-series data. There can be a high ratio of missing features across different patients and time steps, as not all lab measurements or other input data are collected.

Examples of real EHR data: temporal numerical features (upper) and temporal categorical features (lower).

EHR-Safe: Synthetic EHR Data Generation Framework

EHR-Safe consists of sequential encoder-decoder architecture and generative adversarial networks (GANs), depicted in the figure below. Because EHR data are heterogeneous (as described above), direct modeling of raw EHR data is challenging for GANs. To circumvent this, we propose utilizing a sequential encoder-decoder architecture, to learn the mapping from the raw EHR data to the latent representations, and vice versa.

Block diagram of EHR-Safe framework.

While learning the mapping, esoteric distributions of numerical and categorical features pose a great challenge. For example, some values or numerical ranges might dominate the distribution, but the capability of modeling rare cases is essential. The proposed feature mapping and stochastic normalization (transforming original feature distributions into uniform distributions without information loss) are key to handling such data by converting to distributions for which the training of encoder-decoder and GAN are more stable (details can be found in the paper). The mapped latent representations, generated by the encoder, are then used for GAN training. After training both the encoder-decoder framework and GANs, EHR-Safe can generate synthetic heterogeneous EHR data from any input, for which we feed randomly sampled vectors. Note that only the trained generator and decoders are used for generating synthetic data.


Datasets

We focus on two real-world EHR datasets to showcase the EHR-Safe framework, MIMIC-III and eICU. Both are inpatient datasets that consist of varying lengths of sequences and include multiple numerical and categorical features with missing components.


Fidelity Results

The fidelity metrics focus on the quality of synthetically generated data by measuring the realisticness of the synthetic data. Higher fidelity implies that it is more difficult to differentiate between synthetic and real data. We evaluate the fidelity of synthetic data in terms of multiple quantitative and qualitative analyses.


Visualization

Having similar coverage and avoiding under-representation of certain data regimes are both important for synthetic data generation. As the below t-SNE analyses show, the coverage of the synthetic data (blue) is very similar with the original data (red). With membership inference metrics (will be introduced in the privacy section), we also verify that EHR-Safe does not just memorize the original train data.

t-SNE analyses on temporal and static data on MIMIC-III (upper) and eICU (lower) datasets.

Statistical Similarity

We provide quantitative comparisons of statistical similarity between original and synthetic data for each feature. Most statistics are well-aligned between original and synthetic data — for example a measure of the KS statistics, i.e,. the maximum difference in the cumulative distribution function (CDF) between the original and the synthetic data, are mostly lower than 0.03. More detailed tables can be found in the paper. The figure below exemplifies the CDF graphs for original vs. synthetic data for three features — overall they seem very close in most cases.

CDF graphs of two features between original and synthetic EHR data. Left: Mean Airway Pressure. Right: Minute Volume Alarm.

Utility

Because one of the most important use cases of synthetic data is enabling ML innovations, we focus on the fidelity metric that measures the ability of models trained on synthetic data to make accurate predictions on real data. We compare such model performance to an equivalent model trained with real data. Similar model performance would indicate that the synthetic data captures the relevant informative content for the task. As one of the important potential use cases of EHR, we focus on the mortality prediction task. We consider four different predictive models: Gradient Boosting Tree Ensemble (GBDT), Random Forest (RF), Logistic Regression (LR), Gated Recurrent Units (GRU).

Mortality prediction performance with the model trained on real vs. synthetic data. Left: MIMIC-III. Right: eICU.

In the figure above we see that in most scenarios, training on synthetic vs. real data are highly similar in terms of Area Under Receiver Operating Characteristics Curve (AUC). On MIMIC-III, the best model (GBDT) on synthetic data is only 2.6% worse than the best model on real data; whereas on eICU, the best model (RF) on synthetic data is only 0.9% worse.


Privacy Results

We consider three different privacy attacks to quantify the robustness of the synthetic data with respect to privacy.

  • Membership inference attack: An adversary predicts whether a known subject was a present in the training data used for training the synthetic data model.
  • Re-identification attack: The adversary explores the probability of some features being re-identified using synthetic data and matching to the training data.
  • Attribute inference attack: The adversary predicts the value of sensitive features using synthetic data.
Privacy risk evaluation across three privacy metrics: membership-inference (top-left), re-identification (top-right), and attribute inference (bottom). The ideal value of privacy risk for membership inference is random guessing (0.5). For re-identification, the ideal case is to replace the synthetic data with disjoint holdout original data.

The figure above summarizes the results along with the ideal achievable value for each metric. We observe that the privacy metrics are very close to the ideal in all cases. The risk of understanding whether a sample of the original data is a member used for training the model is very close to random guessing; it also verifies that EHR-Safe does not just memorize the original train data. For the attribute inference attack, we focus on the prediction task of inferring specific attributes (e.g., gender, religion, and marital status) from other attributes. We compare prediction accuracy when training a classifier with real data against the same classifier trained with synthetic data. Because the EHR-Safe bars are all lower, the results demonstrate that access to synthetic data does not lead to higher prediction performance on specific features as compared to access to the original data.


Comparison to Alternative Methods

We compare EHR-Safe to alternatives (TimeGAN, RC-GAN, C-RNN-GAN) proposed for time-series synthetic data generation. As shown below, EHR-Safe significantly outperforms each.

Downstream task performance (AUC) in comparison to alternatives.

Conclusions

We propose a novel generative modeling framework, EHR-Safe, that can generate highly realistic synthetic EHR data that are robust to privacy attacks. EHR-Safe is based on generative adversarial networks applied to the encoded raw data. We introduce multiple innovations in the architecture and training mechanisms that are motivated by the key challenges of EHR data. These innovations are key to our results that show almost-identical properties with real data (when desired downstream capabilities are considered) with almost-ideal privacy preservation. An important future direction is generative modeling capability for multimodal data, including text and image, as modern EHR data might contain both.


Acknowledgements

We gratefully acknowledge the contributions of Michel Mizrahi, Nahid Farhady Ghalaty, Thomas Jarvinen, Ashwin S. Ravi, Peter Brune, Fanyu Kong, Dave Anderson, George Lee, Arie Meir, Farhana Bandukwala, Elli Kanal, and Tomas Pfister.

Source: Google AI Blog