Tag Archives: bioinformatics

DeepNull: an open-source method to improve the discovery power of genetic association studies

In our paper “DeepNull models non-linear covariate effects to improve phenotypic prediction and association power,” we proposed a new method, DeepNull, to model the complex relationship between covariate effects on phenotypes to improve Genome-wide association studies (GWAS) results. We have released DeepNull as open source software, with a Colab notebook tutorial for its use.

Human Genetics 101

Each individual’s genetic data carries health information such as why certain individuals have a lower risk of developing skin cancer compared to others or why certain drugs differ in effectiveness between individuals. Genetic data is encoded in the human genome—a DNA sequence—composed of a 3 billion long chain built from four possible nucleotides (A, C, G, and T). Only a small subset of the genome (~4-5 million positions) varies between two individuals. One of the goals of genetic studies is to detect variants that are associated with different phenotypes (e.g., risk of diseases such as Glaucoma or observed phenotypic values such as high-density lipoprotein (HDL), low-density lipoproteins (LDL), height, etc).

Genome-wide association studies

GWAS are used to associate genetic variants with complex traits and diseases. To more accurately determine an association strength between genotype and phenotype, the interactions between phenotypes (such as age and sex) and principal components (PCs) of genotypes, must be adjusted for as covariates. Covariate adjustment in GWAS can increase precision and correct for confounding. In the linear model setting, adjustment for a covariate will improve precision (i.e., statistical power) if the distribution of the phenotype differs across levels of the covariate. For example, when performing GWAS on height, males and females have different means. All state of the art methods (e.g., BOLT-LMM, regenie) perform GWAS assuming that the effect of genotypes and covariates to phenotype is linear and additive. However, we know that the assumption of linear and additive contributions of covariates often does not reflect underlying biology, so we sought a method to more comprehensively model and adjust for the interactions between phenotypes for GWAS.

DeepNull method overview

We proposed a new method, DeepNull, to relax the linear assumption of covariate effects on phenotypes. DeepNull trains a deep neural network (DNN) to predict phenotype using all covariates in a 5-fold cross-validation. After training the DeepNull model, we make phenotype predictions for all individuals and add this prediction as one additional covariate in the association test. Major advantages of DeepNull are its simplicity to use and that it requires only a minimal change to existing GWAS pipeline implementations. In other words, to use DeepNull, we just need to add one additional covariate, which is computed by DeepNull, to the existing pipeline to perform GWAS.

DeepNull improves statistical power

We simulated data under different genetic architectures (genetic conditions) to first check that DeepNull controls type I error and then compare DeepNull statistical power with current state of the art methods (hereafter referred to as “Baseline”). First, we simulated data under genetic architectures where covariates have a linear effect on phenotype and observed that both Baseline and DeepNull have tight control of type I error. It is interesting that DeepNull power does not decrease compared to Baseline under a setting in which covariates have only a linear effect on phenotype. Next, we simulated data under genetic architectures where covariates have non-linear effects on phenotype. Both Baseline and DeepNull have tight control of type I error while DeepNull increases the statistical power depending on the genetic architecture. We observed that for certain genetic architectures, DeepNull increases the statistical power up to 20%. Below, we compare the -log p-value of test statistics computed from DeepNull versus Baseline for Apolipoprotein B (ApoB) levels obtained from UK Biobank:
Figure 1. Significance level comparison of DeepNull vs Baseline. X-axis is the -log p-value of Baseline and Y-axis is the -log p-value of DeepNull. The orange dots indicate variants that are significant for Baseline but not significant for DeepNull and green dots indicate variants that are significant for DeepNull but not significant for Baseline.

DeepNull improves phenotype prediction

We applied DeepNull to predict phenotypes by utilizing polygenic risk score (PRS) and existing covariates such as age and sex. We considered 10 phenotypes obtained from UK Biobank. We observed that DeepNull on average increased the phenotype prediction (R2 where R is Pearson correlation) by 23%. More strikingly, in the case of Glaucoma, referral probability that is computed from the fundus images (Phene et al. Ophthalmology 2019, Alipanahi et al AJHG 2021), DeepNull improves the phenotype prediction by 83.4% and in the case of LDL, DeepNull improves the phenotype prediction by 40.3%. The summary of DeepNull results versus Baseline are shown in figure 2 below:

 
 

Figure 2. DeepNull improves phenotype prediction compared to Baseline. The Y-axis is the R2 where R is the Pearson’s correlation between true and predicted value of phenotypes. Phenotypic abbreviations: alkaline phosphatase (ALP), alanine aminotransferase (ALT), aspartate aminotransferase (AST), apolipoprotein B (ApoB), glaucoma referral probability (GRP), LDLcholesterol (LDL), sex hormone-binding globulin (SHBG), and triglycerides (TG).

Conclusion

We proposed a new framework, DeepNull, that can model the nonlinear effect of covariates on phenotypes when such nonlinearity exists. We show that DeepNull can substantially improve phenotype prediction. In addition, we show that DeepNull achieves results similar to a standard GWAS when the effect of covariate on the phenotype is linear and can significantly outperform a standard GWAS when the covariate effects are nonlinear. DeepNull is open source and is available for download from GitHub or installation via PyPI.

By Farhad Hormozdiari and Andrew Carroll – Genomics team in HealthAI

Acknowledgments

This blog summarizes the work of the following Google contributors, who we would like to thank: Zachary R. McCaw, Thomas Colthurst, Ted Yun, Nick Furlotte, Babak Alipanahi, and Cory Y. McLean. In addition, we would like to thank Alkes Price, Babak Behsaz, and Justin Cosentino for their invaluable comments and suggestions.

Analyzing genomic data in families with deep learning

The Genomics team at Google Health is excited to share our latest expansion to DeepVariant - DeepTrio.

First released in 2017, DeepVariant is an open source tool that enables researchers and clinicians to analyze an individual’s genome sequencing data and identify genetic variants, such as those that may cause disease. Our continued work on DeepVariant has been recognized for its top-of-class accuracy. With DeepTrio, we have expanded DeepVariant to be able to consider the genetic variants in the sequence data of a mother-father-child trio.

Humans are diploid organisms, carrying two copies of the human genome. Every individual inherits one copy of the genome from their mother, and the other from their father. Parental inheritance informs analysis of traits and diseases that follow Mendelian inheritance. DeepTrio learns to use the properties of Mendelian inheritance directly from sequencing data in order to more accurately identify genetic variants in cases when both parent and a child sample can be co-analyzed.

Modifying DeepVariant to analyze trio samples

DeepVariant learns to classify positions in a genome as reference or variant using representations of data similar to the “genome browser” which experts use in analysis. “Improving the Accuracy of Genomic Analysis with DeepVariant 1.0” provides a good overview.

DeepVariant receives data as a window of the genome centered on a candidate variant which it is asked to classify as either reference (no variant), heterozygous (one copy of a variant) or homozygous (both copies are variant). DeepVariant sees the sequence evidence as channels representing features of the data (see: “Looking through DeepVariant’s eyes” for a deeper explanation).

We modified DeepTrio to represent the sequence data from a trio in a single image, with a fixed height for each sample and the child in the middle. Using gold standard samples from NIST Genome in a Bottle for truth labels, we train one model to call variants in the child and another to call variants in the top parent. To call both parents, we flip the position of the parent samples.

An image of 4 of the channels that DeepTrio uses in classification (these, and 4 other channels are shown in a stack.

conceptual schematic of how trio files are used to create examples, which are then called by DeepTrio.

Figure 1. (top) An image of 4 of the channels that DeepTrio uses in classification (these, and 4 other channels are shown in a stack. (bottom) conceptual schematic of how trio files are used to create examples, which are then called by DeepTrio.

Measuring DeepTrio’s improved accuracy

We show that DeepTrio is more accurate than DeepVariant for both parent and child variant detection, with an especially pronounced advantage at lower coverages. This enables researchers to either analyze samples at higher accuracy, or to maintain comparable accuracy at a substantially reduced expense.

To assess the accuracy of DeepTrio, we compare its accuracy to DeepVariant using extensively characterized gold standards made available by NIST Genome in a Bottle. In order to have an evaluation dataset which is never seen in training, we exclude chromosome 20 from training and perform evaluations on chromosome 20.

We train DeepVariant and DeepTrio for sequencing data from two different instruments, Illumina and Pacific Biosciences (PacBio), for more information on the differences between these technologies, please see our previous blog. These sequencers both randomly sample the genome in an error-prone manner. To accurately analyze a genome, the same region needs to be sampled repeatedly. The depth of sampling at a position is called coverage. Sequencing to greater coverage is more expensive in an approximately linear manner. This often forces trade-offs between cost, accuracy, and samples sequenced. As a result, in trios parents are often sequenced at lower depth.

In the charts below, we plot the accuracy of DeepTrio and DeepVariant across a range of coverages.

DeepTrio child accuracy

DeepTrio parent accuracy

Figure 2. F1-score for DeepTrio (solid line) and DeepVariant (dashed line) on a child sample (top) and a parent sample (bottom), sequenced with an Illumina (blue) and PacBio (black) instrument. F1 is measured for all types of small variants on chromosome 20, across samples with a range of sequencing coverage (x-axis).

DeepTrio’s performance on de novo variants

Each individual has roughly 5 million variants relative to the human reference genome. The overwhelming majority of these are inherited from their parents. A small number, around 100, are new (referred to as de novo), due to copying errors during DNA replication. We demonstrate that DeepTrio substantially reduces false positives for de novo variants. For Illumina data, this comes with a smaller decrease in recovery of true positives, while for PacBio data, this trade-off does not occur.

To assess accuracy we analyzed sites where both parents are called as non-variant, but the child is called as heterozygous variant. We observe that DeepTrio is more reluctant to call a variant as de novo, which is similar to how a human would require a higher level of evidence for sites violating Mendelian inheritance. This results in a much lower false positive rate for these de novo variants, but a slightly lower recall rate in DeepTrio Illumina. Usually when this occurs, the child is still called as a variant, but the parents are given “no-call” (the classifier is not confident enough to make a call).

Accuracy on de novo calls (child heterozygous variant, parents reference call) for recall of true de novo events


Accuracy on de novo calls (child heterozygous variant, parents reference call) for recall of false positive de novo events

Figure 3. Accuracy on de novo calls (child heterozygous variant, parents reference call) for recall of true de novo events (top) and false positive de novo events (bottom) for DeepTrio (solid line) and DeepVariant (dashed line) on Illumina (blue) and PacBio (black). Accuracy is measured on chromosome 20, across samples with a range of sequencing coverage (x-axis).

Contributing to rare disease research

By releasing DeepTrio as open source software, we hope to improve analysis of genomic data, by allowing scientists to more accurately analyze samples. We hope this will enable research and clinical pipelines, leading to better resolution of rare disease cases, and improve development of therapeutics.

In addition to the release of DeepTrio’s code as open source, we have also released the sequencing data that we generated in order to train these models. That data is described in our pre-print “An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development”. By releasing both this production model, and the data required to train models of similar complexity, we hope to contribute to methods development by the genomics community.

By Andrew Carroll, Product Lead Genomics and Howard Yang, Program Manager Genomics — Google Health