Tag Archives: Google Genomics

Building better pangenomes to improve the equity of genomics

Posted by Andrew Carroll, Product Lead, and Kishwar Shafin, Research Scientist, Genomics

For decades, researchers worked together to assemble a complete copy of the molecular instructions for a human — a map of the human genome. The first draft was finished in 2000, but with several missing pieces. Even when a complete reference genome was achieved in 2022, their work was not finished. A single reference genome can’t incorporate known genetic variations, such as the variants for the gene determining whether a person has a blood type A, B, AB or O. Furthermore, the reference genome didn’t represent the vast diversity of human ancestries, making it less useful for detecting disease or finding cures for people from some backgrounds than others. For the past three years, we have been part of an international collaboration with 119 scientists across 60 institutions, called the Human Pangenome Research Consortium, to address these challenges by creating a new and more representative map of the human genome, a pangenome.

We are excited to share that today, in “A draft human pangenome reference”, published in Nature, this group is announcing the completion of the first human pangenome reference. The pangenome combines 47 individual genome reference sequences and better represents the genomic diversity of global populations. Building on Google’s deep learning technologies and past advances in genomics, we used tools based on convolutional neural networks (CNNs) and transformers to tackle the challenges of building accurate pangenome sequences and using them for genome analysis. These contributions helped the consortium build an information-rich resource for geneticists, researchers and clinicians around the world.

Using graphs to build pangenomes

In the typical analysis workflow for high-throughput DNA sequencing, a sequencing instrument reads millions of short pieces of an individual’s genome, and a program called a mapper or aligner then estimates where those pieces best fit relative to the single, linear human reference sequence. Next, variant caller software identifies the unique parts of the individual’s sequence relative to the reference.

But because humans carry a diverse set of sequences, sections that are present in an individual’s DNA but are not in the reference genome can’t be analyzed. One study of 910 African individuals found that a total of 300 million DNA base pairs — 10% of the roughly three billion base pair reference genome — are not present in the previous linear reference but occur in at least one of the 910 individuals.

To address this issue, the consortium used graph data structures, which are powerful for genomics because they can represent the sequences of many people simultaneously, which is needed to create a pangenome. Nodes in a graph genome contain the known set of sequences in a population, and paths through those nodes compactly describe the unique sequences of an individual’s DNA.

Schematic of a graph genome. Each color represents the sequence path of a different individual. Multiple paths passing through the same node indicate multiple individuals share that sequence, but some paths also show a single nucleotide variant (SNV), insertions, or deletions. Illustration credit Darryl Leja, National Human Genome Research Institute (NHGRI).

Actual graph genome for the major histocompatibility complex (MHC) region of the genome. Genes in MHC regions are essential to immune function and are associated with a person’s resistance and susceptibility to infectious disease and autoimmune disorders (e.g., ankylosing spondylitis and lupus). The graph shows the linear human genome reference (green) and different individual person’s sequence (gray).

Using graphs creates numerous challenges. They require reference sequences to be highly accurate and the development of new methods that can use their data structure as an input. However, new sequencing technologies (such as consensus sequencing and phased assembly methods) have driven exciting progress towards solving these problems.

Long-read sequencing technology, which reads larger pieces of the genome (10,000 to millions of DNA characters long) at a time, are essential to the creation of high quality reference sequences because larger pieces can be stitched together into assembled genomes more easily than the short pieces read out by earlier technologies. Short read sequencing reads pieces of the genome that are only 100 to 300 DNA characters long, but has been the highly scalable basis for high-throughput sequencing methods developed in the 2000s. Though long-read sequencing is newer and has advantages for reference genome creation, many informatics methods for short reads hadn’t been developed for long read technologies.

Evolving DeepVariant for error correction

Google initially developed DeepVariant, an open-source CNN variant caller framework that analyzes the short-read sequencing evidence of local regions of the genome. However, we were able to re-train DeepVariant to yield accurate analysis of Pacific Bioscience’s long-read data.

Training and evaluation schematic for DeepVariant.

We next teamed up with researchers at the University of California, Santa Cruz (UCSC) Genomics Institute to participate in a United States Food and Drug Administration competition for another long-read sequencing technology from Oxford Nanopore. Together, we won the award for highest accuracy in the nanopore category, with a single nucleotide variants (SNVs) accuracy that matched short-read sequencing. This work has been used to detect and treat genetic diseases in critically ill newborns. The use of DeepVariant on long-read technologies provided the foundation for the consortium’s use of DeepVariant for error correction of pangenomes.

DeepVariant’s ability to use multiple long-read sequencing modalities proved useful for error correction in the Telomere-to-Telomere (T2T) Consortium’s effort that generated the first complete assembly of a human genome. Completing this first genome set the stage to build the multiple reference genomes required for pangenomes, and T2T was already working closely with the Human Pangenome Project (with many shared members) to scale those practices.

With a set of high-quality human reference genomes on the horizon, developing methods that could use those assemblies grew in importance. We worked to adapt DeepVariant to use the pangenome developed by the consortium. In partnership with UCSC, we built an end-to-end analysis workflow for graph-based variant detection, and demonstrated improved accuracy across several thousand samples. The use of the pangenome allows many previously missed variants to be correctly identified.

Visualization of variant calls in the KCNE1 gene (a gene with variants associated with cardiac arrhythmias and sudden death) using a pangenome reference versus the prior linear reference. Each dot represents a variant call that is either correct (blue dot), incorrect (green dot) — when a variant is identified but is not really there —or a missed variant call (red dot). The top box shows variant calls made by DeepVariant using the pangenome reference while the bottom shows variant calls made by using the linear reference. Figure adapted from A Draft Human Pangenome Reference.

Improving pangenome sequences using transformers

Just as new sequencing technologies enabled new pangenome approaches, new informatics technologies enabled improvements for sequencing methods. Google adapted transformer architectures from analysis of human language to genome sequences to develop DeepConsensus. A key enabler for this was the development of a differentiable loss function that could handle the insertions and deletions common in sequencing data. This enabled us to have high accuracy without needing a decoder, allowing the speed required to keep up with terabytes of sequencer output.

Transformer architecture for DeepConsensus. DeepConsensus takes as input the repeated sequence of the DNA molecule, measured from fluorescent light detected by the addition of each base. DeepConsensus also uses as input the more detailed information about the sequencing process, including the duration of the light pulse (referred to here as pulse width or PW), the time between pulses (IP) the signal-to-noise ratio (SN) and which side of the double helix is being measured (strand).

Effect of alignment loss function in training evaluation of model output. Better accounting of insertions and deletions by a differentiable alignment function enables the model training process to better estimate errors.

DeepConsensus improves the yield and accuracy of instrument data. Because PacBio sequencing provides the primary sequence information for the 47 genome assemblies, we could apply DeepConsensus to improve those assemblies. With application of DeepConsensus, consortium members built a genome assembler that was able to reach 99.9997% assembly base-level accuracies.

Conclusion

We developed multiple new approaches to improve genetic sequencing methods, which we then used to construct pangenome references that enable more robust genome analysis.

But this is just the beginning of the story. In the next stage, a larger, worldwide group of scientists and clinicians will use this pangenome reference to study genetic diseases and make new drugs. And future pangenomes will represent even more individuals, realizing a vision summarized this way in a recent Nature story: “Every base, everywhere, all at once.” Read our post on the Keyword Blog to learn more about the human pangenome reference announcement.

Acknowledgements

Many people were involved in creating the pangenome reference, including 119 authors across 60 organizations, with the Human Pangenome Reference Consortium. This blog post highlights Google’s contributions to the broader work. We thank the research groups at UCSC Genomics Institute (GI) under Professors Benedict Paten and Karen Miga, genome polishing efforts of Arang Rhie at National Institute of Health (NIH), Genome Assembly and Polishing of Adam Phillipy’s group, and the standards group at National Institute of Standards and Technology (NIST) of Justin Zook. We thank Google contributors: Pi-Chuan Chang, Maria Nattestad, Daniel Cook, Alexey Kolesnikov, Anastaysia Belyaeva, and Gunjan Baid. We thank Lizzie Dorfman, Elise Kleeman, Erika Hayden, Cory McLean, Shravya Shetty, Greg Corrado, Katherine Chou, and Yossi Matias for their support, coordination, and leadership. Last but not least, thanks to the research participants that provided their DNA to help build the pangenome resource.

Source: Google AI Blog

An ML-based approach to better characterize lung diseases

Posted by Babak Behsaz, Software Engineer, and Andrew Carroll, Product Lead, Genomics

The combination of the environment an individual experiences and their genetic predispositions determines the majority of their risk for various diseases. Large national efforts, such as the UK Biobank, have created large, public resources to better understand the links between environment, genetics, and disease. This has the potential to help individuals better understand how to stay healthy, clinicians to treat illnesses, and scientists to develop new medicines.

One challenge in this process is how we make sense of the vast amount of clinical measurements — the UK Biobank has many petabytes of imaging, metabolic tests, and medical records spanning 500,000 individuals. To best use this data, we need to be able to represent the information present as succinct, informative labels about meaningful diseases and traits, a process called phenotyping. That is where we can use the ability of ML models to pick up on subtle intricate patterns in large amounts of data.

We’ve previously demonstrated the ability to use ML models to quickly phenotype at scale for retinal diseases. Nonetheless, these models were trained using labels from clinician judgment, and access to clinical-grade labels is a limiting factor due to the time and expense needed to create them.

In “Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models”, published in Nature Genetics, we’re excited to highlight a method for training accurate ML models for genetic discovery of diseases, even when using noisy and unreliable labels. We demonstrate the ability to train ML models that can phenotype directly from raw clinical measurement and unreliable medical record information. This reduced reliance on medical domain experts for labeling greatly expands the range of applications for our technique to a panoply of diseases and has the potential to improve their prevention, diagnosis, and treatment. We showcase this method with ML models that can better characterize lung function and chronic obstructive pulmonary disease (COPD). Additionally, we show the usefulness of these models by demonstrating a better ability to identify genetic variants associated with COPD, improved understanding of the biology behind the disease, and successful prediction of outcomes associated with COPD.

ML for deeper understanding of exhalation

For this demonstration, we focused on COPD, the third leading cause of worldwide death in 2019, in which airway inflammation and impeded airflow can progressively reduce lung function. Lung function for COPD and other diseases is measured by recording an individual’s exhalation volume over time (the record is called a spirogram; see an example below). Although there are guidelines (called GOLD) for determining COPD status from exhalation, these use only a few, specific data points in the curve and apply fixed thresholds to those values. Much of the rich data from these spirograms is discarded in this analysis of lung function.

We reasoned that ML models trained to classify spirograms would be able to use the rich data present more completely and result in more accurate and comprehensive measures of lung function and disease, similar to what we have seen in other classification tasks like mammography or histology. We trained ML models to predict whether an individual has COPD using the full spirograms as inputs.

Spirometry and COPD status overview. Spirograms from lung function test showing a forced expiratory volume-time spirogram (left), a forced expiratory flow-time spirogram (middle), and an interpolated forced expiratory flow-volume spirogram (right). The profile of individuals w/o COPD is different.

The common method of training models for this problem, supervised learning, requires samples to be associated with labels. Determining those labels can require the effort of very time-constrained experts. For this work, to show that we do not necessarily need medically graded labels, we decided to use a variety of widely available sources of medical record information to create those labels without medical expert review. These labels are less reliable and noisy for two reasons. First, there are gaps in the medical records of individuals because they use multiple health services. Second, COPD is often undiagnosed, meaning many with the disease will not be labeled as having it even if we compile the complete medical records. Nonetheless, we trained a model to predict these noisy labels from the spirogram curves and treat the model predictions as a quantitative COPD liability or risk score.

Noisy COPD status labels were derived using various medical record sources (clinical data). A COPD liability model is then trained to predict COPD status from raw flow-volume spirograms.

Predicting COPD outcomes

We then investigated whether the risk scores produced by our model could better predict a variety of binary COPD outcomes (for example, an individual’s COPD status, whether they were hospitalized for COPD or died from it). For comparison, we benchmarked the model relative to expert-defined measurements required to diagnose COPD, specifically FEV1/FVC, which compares specific points on the spirogram curve with a simple mathematical ratio. We observed an improvement in the ability to predict these outcomes as seen in the precision-recall curves below.

Precision-recall curves for COPD status and outcomes for our ML model (green) compared to traditional measures. Confidence intervals are shown by lighter shading.

We also observed that separating populations by their COPD model score was predictive of all-cause mortality. This plot suggests that individuals with higher COPD risk are more likely to die earlier from any causes and the risk probably has implications beyond just COPD.

Survival analysis of a cohort of UK Biobank individuals stratified by their COPD model’s predicted risk quartile. The decrease of the curve indicates individuals in the cohort dying over time. For example, p100 represents the 25% of the cohort with greatest predicted risk, while p50 represents the 2nd quartile.

Identifying the genetic links with COPD

Since the goal of large scale biobanks is to bring together large amounts of both phenotype and genetic data, we also performed a test called a genome-wide association study (GWAS) to identify the genetic links with COPD and genetic predisposition. A GWAS measures the strength of the statistical association between a given genetic variant — a change in a specific position of DNA — and the observations (e.g., COPD) across a cohort of cases and controls. Genetic associations discovered in this manner can inform drug development that modifies the activity or products of a gene, as well as expand our understanding of the biology for a disease.

We showed with our ML-phenotyping method that not only do we rediscover almost all known COPD variants found by manual phenotyping, but we also find many novel genetic variants significantly associated with COPD. In addition, we see good agreement on the effect sizes for the variants discovered by both our ML approach and the manual one (R²=0.93), which provides strong evidence for validity of the newly found variants.

Left: A plot comparing the statistical power of genetic discovery using the labels for our ML model (y-axis) with the statistical power of the manual labels from a traditional study (x-axis). A value above the y = x line indicates greater statistical power in our method. Green points indicate significant findings in our method that are not found using the traditional approach. Orange points are significant in the traditional approach but not ours. Blue points are significant in both. Right: Estimates of the association effect between our method (y-axis) and traditional method (x-axis). Note that the relative values between studies are comparable but the absolute numbers are not.

Finally, our collaborators at Harvard Medical School and Brigham and Women’s Hospital further examined the plausibility of these findings by providing insights into the possible biological role of the novel variants in development and progression of COPD (you can see more discussion on these insights in the paper).

Conclusion

We demonstrated that our earlier methods for phenotyping with ML can be expanded to a wide range of diseases and can provide novel and valuable insights. We made two key observations by using this to predict COPD from spirograms and discovering new genetic insights. First, domain knowledge was not necessary to make predictions from raw medical data. Interestingly, we showed the raw medical data is probably underutilized and the ML model can find patterns in it that are not captured by expert-defined measurements. Second, we do not need medically graded labels; instead, noisy labels defined from widely available medical records can be used to generate clinically predictive and genetically informative risk scores. We hope that this work will broadly expand the ability of the field to use noisy labels and will improve our collective understanding of lung function and disease.

Acknowledgments

This work is the combined output of multiple contributors and institutions. We thank all contributors: Justin Cosentino, Babak Alipanahi, Zachary R. McCaw, Cory Y. McLean, Farhad Hormozdiari (Google), Davin Hill (Northeastern University), Tae-Hwi Schwantes-An and Dongbing Lai (Indiana University), Brian D. Hobbs and Michael H. Cho (Brigham and Women’s Hospital, and Harvard Medical School). We also thank Ted Yun and Nick Furlotte for reviewing the manuscript, Greg Corrado and Shravya Shetty for support, and Howard Yang, Kavita Kulkarni, and Tammi Huynh for helping with publication logistics.

Source: Google AI Blog

Improving the Accuracy of Genomic Analysis with DeepVariant 1.0

Posted by Andrew Carroll, Product Lead, and Pi-Chuan Chang, Technical Lead, Google Health

Sequencing genomes involves sampling short pieces of the DNA from the ~6 billion pairs of nucleobases — i.e., adenine (A), thymine (T), guanine (G), and cytosine (C) — we inherit from our parents. Genome sequencing is enabled by two key technologies: DNA sequencers (hardware) that "read" relatively small fragments of DNA, and variant callers (software) that combine the reads to identify where and how an individual's genome differs from a reference genome, like the one assembled in the Human Genome Project. Such variants may be indicators of genetic disorders, such as an elevated risk for breast cancer, pulmonary arterial hypertension, or neurodevelopmental disorders.

In 2017, we released DeepVariant, an open-source tool which identifies genome variants in sequencing data using a convolutional neural network (CNN). The sequencing process begins with a physical sample being sequenced by any of a handful of instruments, depending on the end goal of the sequencing. The raw data, which consists of numerous reads of overlapping fragments of the genome, are then mapped to a reference genome. DeepVariant analyzes these mappings to identify variant locations and distinguish them from sequencing errors.

Soon after it was first published in 2018, DeepVariant underwent a number of updates and improvements, including significant changes to improve accuracy for whole exome sequencing and polymerase chain reaction (PCR) sequencing.

We are now releasing DeepVariant v1.0, which incorporates a large number of improvements for all sequencing types. DeepVariant v1.0 is an improved version of our submission to the PrecisionFDA v2 Truth Challenge, which achieved Best Overall accuracy for 3 of 4 instrument categories. Compared to previous state-of-the-art models, DeepVariant v1.0 significantly reduces the errors for widely-used sequencing data types, including Illumina and Pacific Biosciences. In addition, through a collaboration with the UCSC Genomics Institute, we have also released a model that combines DeepVariant with the UCSC’s PEPPER method, called PEPPER-DeepVariant, which extends coverage to Oxford Nanopore data for the first time.

Sequencing Technologies and DeepVariant
For the last decade, the majority of sequence data were generated using Illumina instruments, which produce short (75-250 bases) and accurate sequences. In recent years, new technologies have become available that can sequence much longer pieces, including Pacific Biosciences, which can produce long and accurate sequences up to ~15,000 bases in length, and Oxford Nanopore, which can produce reads up to 1 million bases long, but with higher error rates. The particular type of sequencing data a researcher might use depends on the ultimate use-case.

Because DeepVariant is a deep learning method, we can quickly re-train it for these new instrument types, ensuring highly accurate sequence identification. Accuracy is important because a missed variant call could mean missing the causal variant for a disorder, while a false positive variant call could lead to identifying an incorrect one. Earlier state-of-the-art methods could reach ~99.1% accuracy (~73,000 errors) on a 35-fold coverage Illumina whole genome, whereas an early version of DeepVariant (v0.10) had ~99.4% accuracy (46,000 errors), corresponding to a 38% error reduction. DeepVariant v1.0 reduces Illumina errors by another ~22% and PacBio errors by another ~52% relative to the last DeepVariant release (v0.10).

DeepVariant Overview
DeepVariant is a convolutional neural network (CNN) that treats the task of identifying genetic variants as an image classification problem. DeepVariant constructs tensors, essentially multi-channel images, where each channel represents an aspect of the sequence, such as the bases in the sequence (called read base), the quality of alignment between different reads (mapping quality), whether a given read supports an alternate allele (read supports variant), etc. It then analyzes these data and outputs three genotype likelihoods, corresponding to how many copies (0, 1, or 2) of a given alternate allele are present.

Example of DeepVariant data. Each row of pixels in each panel corresponds to a single read, i.e., a short genetic sequence. The top, middle, and bottom rows of panels present examples with a different number of variant alleles. Only two of the six data channels are shown: Read base — the pixel value is mapped to each of the four bases, A, C, G, or T; Read supports variant — white means that the read is consistent with a given allele and grey means it is not. Top: Classified by DeepVariant as a "2", which means that both chromosomes match the variant allele. Middle: Classified as a “1”, meaning that one chromosome matches the variant allele. Bottom: Classified as a “0”, implying that the variant allele is missing from both chromosomes.

Technical Improvements in DeepVariant v1.0
Because DeepVariant uses the same codebase for each data type, improvements apply to each of Illumina, PacBio, and Oxford Nanopore. Below, we show the numbers for Illumina and PacBio for two types of small variants: SNPs (single nucleotide polymorphisms, which change a single base without changing sequence length) and INDELs (insertions and deletions).

Training on an extended truth set
The Genome in a Bottle consortium from the National Institute of Standards and Technology (NIST) creates gold-standard samples with known variants covering the regions of the genome. These are used as labels to train DeepVariant. Using long-read technologies the Genome in a Bottle expanded the set of confident variants, increasing the regions described by the standard set from 85% of the genome to 92% of it. These more difficult regions were already used in training the PacBio models, and including them in the Illumina models reduced errors by 11%. By relaxing the filter for reads of lower mapping quality, we further reduced errors by 4% for Illumina and 13% for PacBio.
Haplotype sorting of long reads
We inherit one copy of DNA from our mother and another from our father. PacBio and Oxford Nanopore sequences are long enough to separate sequences by parental origin, which is called a haplotype. By providing this information to the neural network, DeepVariant improves its identification of random sequence errors and can better determine whether a variant has a copy from one or both parents.

Re-aligning reads to the alternate (ALT) allele

DeepVariant uses input sequence fragments that have been aligned to a reference genome. The optimal alignment for variants that include insertions or deletions could be different if the aligner knew they were present. To capture this information, we implemented an additional alignment step relative to the candidate variant. The figure below shows an additional second row where the reads are aligned to the candidate variant, which is a large insertion. You can see sequences that abruptly stop in the first row can now be fully aligned, providing additional information.

Example of DeepVariant data with realignment to ALT allele. DeepVariant is presented the information in both rows of data for the same example. Only two of the six data channels are shown: Read base (channel #1) and Read supports variant (channel #5). Top: Shows the reads aligned to the reference (in DeepVariant v0.10 and earlier this is all DeepVariant sees). Bottom: Shows the reads aligned to the candidate variant, in this case a long insertion of sequence). The red arrow indicates where the inserted sequence begins.

Use a small network to post-process outputs
Variants can have multiple alleles, with a different base inherited from each parent. DeepVariant’s classifier only generates a probability for one potential variant at a time. In previous versions, simple hand-written rules converted the probabilities into a composite call, but these rules failed in some edge cases. In addition, it also separated the way a final call was made from the backpropagation to train the network. By adding a small, fully-connected neural network to the post-processing step, we are able to better handle these tricky multi-allelic cases.
Adding data to train the release model
The timeframe for the competition was compressed, so we trained only with data similar to the challenge data (PCR-Free NovaSeq) to speed model training. In our production releases, we seek high accuracy for multiple instruments as well as PCR+ preparations. Training with data from these diverse classes helps the model generalize, so our DeepVariant v1.0 release model outperforms the one submitted.

The charts below show the error reduction achieved by each improvement.

Training a Hybrid model
DeepVariant v1.0 also includes a hybrid model for PacBio and Illumina reads. In this case, the model leverages the strengths of both input types, without needing new logic.

Example of DeepVariant merging data from both PacBio and Illumina. Only two of the six data channels are shown: Read base (channel #1) and Read supports variant (channel #5). The longer PacBio reads (at the upper part of the image) span the region being called entirely, while the shorter Illumin reads span only a portion of the region.

We observed no change in SNP errors, suggesting that PacBio reads are strictly superior for SNP calling. We observed a further 49% reduction in Indel errors relative to the PacBio model, suggesting that the Indel error modes of Illumina and PacBio HiFi can be used in a complementary manner.

PEPPER-Deepvariant: A Pipeline for Oxford Nanopore Data Using DeepVariant
Until the PrecisionFDA competition, a DeepVariant model was not available for Oxford Nanopore data, because the higher base error rate created too many candidates for DeepVariant to classify. We partnered with the UC Santa Cruz Genomics Institute, which has extensive expertise with Nanopore data. They had previously trained a deep learning method called PEPPER, which could narrow down the candidates to a more tractable number. The larger neural network of DeepVariant can then accurately characterize the remaining candidates with a reasonable runtime.

The combined PEPPER-DeepVariant pipeline with the Oxford Nanopore model is open-source and available on GitHub. This pipeline was able to achieve a superior SNP calling accuracy to DeepVariant Illumina on the PrecisionFDA challenge, which is the first time anyone has shown Nanopore outperforming Illumina in this way.

Conclusion
DeepVariant v1.0 isn’t the end of development. We look forward to working with the genomics community to further maximize the value of genomic data to patients and researchers.

Source: Google AI Blog

DeepVariant Accuracy Improvements for Genetic Datatypes

Posted by Pi-Chuan Chang, Software Engineer and Lizzie Dorfman, Technical Program Manager, Google Brain Team

Last December we released DeepVariant, a deep learning model that has been trained to analyze genetic sequences and accurately identify the differences, known as variants, that make us all unique. Our initial post focused on how DeepVariant approaches “variant calling” as an image classification problem, and is able to achieve greater accuracy than previous methods.

Today we are pleased to announce the launch of DeepVariant v0.6, which includes some major accuracy improvements. In this post we describe how we train DeepVariant, and how we were able to improve DeepVariant's accuracy for two common sequencing scenarios, whole exome sequencing and polymerase chain reaction sequencing, simply by adding representative data into DeepVariant's training process.

Many Types of Sequencing Data
Approaches to genomic sequencing vary depending on the type of DNA sample (e.g., from blood or saliva), how the DNA was processed (e.g., amplification techniques), which technology was used to sequence the data (e.g., instruments can vary even within the same manufacturer) and what section or how much of the genome was sequenced. These differences result in a very large number of sequencing "datatypes".

Typically, variant calling tools have been tuned for one specific datatype and perform relatively poorly on others. Given the extensive time and expertise involved in tuning variant callers for new datatypes, it seemed infeasible to customize each tool for every one. In contrast, with DeepVariant we are able to improve accuracy for new datatypes simply by including representative data in the training process, without negatively impacting overall performance.

Truth Sets for Variant Calling
Deep learning models depend on having high quality data for training and evaluation. In the field of genomics, the Genome in a Bottle (GIAB) consortium, which is hosted by the National Institute of Standards and Technology (NIST), produces human genomes for use in technology development, evaluation, and optimization. The benefit of working with GIAB benchmarking genomes is that their true sequence is known (at least to the extent currently possible). To achieve this, GIAB takes a single person's DNA and repeatedly sequences it using a wide variety of laboratory methods and sequencing technologies (i.e. many datatypes) and analyzes the resulting data using many different variant calling tools. A tremendous amount of work then follows to evaluate and adjudicate discrepancies to produce a high-confidence "truth set" for each genome.

The majority of DeepVariant’s training data is from the first benchmarking genome released by GIAB, HG001. The sample, from a woman of northern European ancestry, was made available as part of the International HapMap Project, the first large-scale effort to identify common patterns of human genetic variation. Because DNA from HG001 is commercially available and so well characterized, it is often the first sample used to test new sequencing technologies and variant calling tools. By using many replicates and different datatypes of HG001, we can generate millions of training examples which helps DeepVariant learn to accurately classify many datatypes, and even generalize to datatypes it has never seen before.

Improved Exome Model in v0.5
In the v0.5 release we formalized a benchmarking-compatible training strategy to withhold from training a complete sample, HG002, as well as any data from chromosome 20. HG002, the second benchmarking genome released by GIAB, is from a male of Ashkenazi Jewish ancestry. Testing on this sample, which differs in both sex and ethnicity from HG001, helps to ensure that DeepVariant is performing well for diverse populations. Additionally reserving chromosome 20 for testing guarantees that we can evaluate DeepVariant's accuracy for any datatype that has truth data available.

In v0.5 we also focused on exome data, which is the subset of the genome that directly codes for proteins. The exome is only ~1% of the whole human genome, so whole exome sequencing (WES) costs less than whole genome sequencing (WGS). The exome also harbors many variants of clinical significance which makes it useful for both researchers and clinicians. To increase exome accuracy we added a variety of WES datatypes, provided by DNAnexus, to DeepVariant's training data. The v0.5 WES model shows 43% fewer indel (insertion-deletion) errors and a 22% reduction in single nucleotide polymorphism (SNP) errors.

The total number of exome errors for HG002 across DeepVariant versions, broken down by indel errors (left) and SNP errors (right). Errors are either false positive (FP), colored yellow, or false negative (FN), colored blue. The largest accuracy jump is between v0.4 and v0.5, largely attributable to a reduction in indel FPs.

Improved Whole Genome Sequencing Model for PCR+ data in v0.6
Our newest release of DeepVariant, v0.6, focuses on improved accuracy for data that has undergone DNA amplification via polymerase chain reaction (PCR) prior to sequencing. PCR is an easy and inexpensive way to amplify very small quantities of DNA, and once sequenced results in what is known as PCR positive (PCR+) sequencing data. It is well known, however, that PCR can be prone to bias and errors, and non-PCR-based (or PCR-free) DNA preparation methods are increasingly common. DeepVariant's training data prior to the v0.6 release was exclusively PCR-free data, and PCR+ was one of the few datatypes for which DeepVariant had underperformed in external evaluations. By adding PCR+ examples to DeepVariant's training data, also provided by DNAnexus, we have seen significant accuracy improvements for this datatype, including a 60% reduction in indel errors.

DeepVariant v0.6 shows major accuracy improvements for PCR+ data, largely attributable to a reduction in indel errors. Here we re-analyze two PCR+ samples that were used in external evaluations, including DNAnexus on the left (see details in figure 10) and bcbio on the right, showing how indel accuracy improves with each DeepVariant version.

Independent evaluations of DeepVariant v0.6 from both DNAnexus and bcbio are also available. Their analyses support our findings of improved indel accuracy, and also include comparisons to other variant calling tools.

Looking Forward
We released DeepVariant as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. As the pace of innovation in sequencing technologies continues to grow, including more clinical applications, we are optimistic that DeepVariant can be further extended to produce consistent and highly accurate results. We hope that researchers will use DeepVariant v0.6 to accelerate discoveries, and if there is a sequencing datatype that you would like to see us prioritize, please let us know.

Source: Google AI Blog

DeepVariant Accuracy Improvements for Genetic Datatypes

Source: Google Research Blog

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Crossposted on the Google Research Blog

Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted from new technologies. From Sanger sequencing, which made it possible to sequence the human genome, to the microarray technologies that enabled the first large-scale genome-wide experiments, new instruments and tools have allowed us to look ever more deeply into the genome and apply the results broadly to health, agriculture and ecology.

One of the most transformative new technologies in genomics was high-throughput sequencing (HTS), which first became commercially available in the early 2000s. HTS allowed scientists and clinicians to produce sequencing data quickly, cheaply, and at scale. However, the output of HTS instruments is not the genome sequence for the individual being analyzed — for humans this is 3 billion paired bases (guanine, cytosine, adenine and thymine) organized into 23 pairs of chromosomes. Instead, these instruments generate ~1 billion short sequences, known as reads. Each read represents just 100 of the 3 billion bases, and per-base error rates range from 0.1-10%. Processing the HTS output into a single, accurate and complete genome sequence is a major outstanding challenge. The importance of this problem, for biomedical applications in particular, has motivated efforts such as the Genome in a Bottle Consortium (GIAB), which produces high confidence human reference genomes that can be used for validation and benchmarking, as well as the precisionFDA community challenges, which are designed to foster innovation that will improve the quality and accuracy of HTS-based genomic tests.

CAPTION: For any given location in the genome, there are multiple reads among the ~1 billion that include a base at that position. Each read is aligned to a reference, and then each of the bases in the read is compared to the base of the reference at that location. When a read includes a base that differs from the reference, it may indicate a variant (a difference in the true sequence), or it may be an error.

Today, we announce the open source release of DeepVariant, a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods. This work is the product of more than two years of research by the Google Brain team, in collaboration with Verily Life Sciences. DeepVariant transforms the task of variant calling, as this reconstruction problem is known in genomics, into an image classification problem well-suited to Google's existing technology and expertise.

CAPTION: Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. There is more than one type of variant, with SNPs and insertions/deletions being the most common. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors. It's easy to see that these look quite distinct when visualized in this manner.

We started with GIAB reference genomes, for which there is high-quality ground truth (or the closest approximation currently possible). Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produced by the instruments. Although the resulting deep learning model, DeepVariant, had no specialized knowledge about genomics or HTS, within a year it had won the the highest SNP accuracy award at the precisionFDA Truth Challenge, outperforming state-of-the-art methods. Since then, we've further reduced the error rate by more than 50%.

DeepVariant is being released as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. To further this goal, we partnered with Google Cloud Platform (GCP) to deploy DeepVariant workflows on GCP, available today, in configurations optimized for low-cost and fast turnarounds using scalable GCP technologies like the Pipelines API. This paired set of releases provides a smooth ramp for users to explore and evaluate the capabilities of DeepVariant in their current compute environment while providing a scalable, cloud-based solution to satisfy the needs of even the largest genomics datasets.

DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community. This is all part of a broader goal to apply Google technologies to healthcare and other scientific applications, and to make the results of these efforts broadly accessible.

By Mark DePristo and Ryan Poplin, Google Brain Team

Source: Google Open Source Blog

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Source: Google Open Source Blog

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Source: Google Open Source Blog

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Posted by Mark DePristo and Ryan Poplin, Google Brain Team

(Crossposted on the Google Open Source Blog)

Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted from new technologies. From Sanger sequencing, which made it possible to sequence the human genome, to the microarray technologies that enabled the first large-scale genome-wide experiments, new instruments and tools have allowed us to look ever more deeply into the genome and apply the results broadly to health, agriculture and ecology.

One of the most transformative new technologies in genomics was high-throughput sequencing (HTS), which first became commercially available in the early 2000s. HTS allowed scientists and clinicians to produce sequencing data quickly, cheaply, and at scale. However, the output of HTS instruments is not the genome sequence for the individual being analyzed — for humans this is 3 billion paired bases (guanine, cytosine, adenine and thymine) organized into 23 pairs of chromosomes. Instead, these instruments generate ~1 billion short sequences, known as reads. Each read represents just 100 of the 3 billion bases, and per-base error rates range from 0.1-10%. Processing the HTS output into a single, accurate and complete genome sequence is a major outstanding challenge. The importance of this problem, for biomedical applications in particular, has motivated efforts such as the Genome in a Bottle Consortium (GIAB), which produces high confidence human reference genomes that can be used for validation and benchmarking, as well as the precisionFDA community challenges, which are designed to foster innovation that will improve the quality and accuracy of HTS-based genomic tests.

For any given location in the genome, there are multiple reads among the ~1 billion that include a base at that position. Each read is aligned to a reference, and then each of the bases in the read is compared to the base of the reference at that location. When a read includes a base that differs from the reference, it may indicate a variant (a difference in the true sequence), or it may be an error.

Today, we announce the open source release of DeepVariant, a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods. This work is the product of more than two years of research by the Google Brain team, in collaboration with Verily Life Sciences. DeepVariant transforms the task of variant calling, as this reconstruction problem is known in genomics, into an image classification problem well-suited to Google's existing technology and expertise.

Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. There is more than one type of variant, with SNPs and insertions/deletions being the most common. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors. It's easy to see that these look quite distinct when visualized in this manner.

We started with GIAB reference genomes, for which there is high-quality ground truth (or the closest approximation currently possible). Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produced by the instruments. Although the resulting deep learning model, DeepVariant, had no specialized knowledge about genomics or HTS, within a year it had won the the highest SNP accuracy award at the precisionFDA Truth Challenge, outperforming state-of-the-art methods. Since then, we've further reduced the error rate by more than 50%.

Source: Google Research Blog

Reproducible Science: Cancer Researchers Embrace Containers in the Cloud

Posted by Dr. Kyle Ellrott, Oregon Health and Sciences University, Dr. Josh Stuart, University of California Santa Cruz, and Dr. Paul Boutros, Ontario Institute for Cancer Research

Today we hear from the principal investigators of the ICGC-TCGA DREAM Somatic Mutation Calling Challenges about how they are encouraging cancer researchers to make use of Docker and Google Cloud Platform to gain a deeper understanding of the complex genetic mutations that occur in cancer, while doing so in a reproducible way.
– Nicole Deflaux and Jonathan Bingham, Google Genomics

Today’s genomic analysis software tools often give different answers when run in different computing environments - that’s like getting a different diagnosis from your doctor depending on which examination room you’re sitting in. Reproducible science matters, especially in cancer research where so many lives are at stake. The Cancer Moonshot has called for the research world to 'Break down silos and bring all the cancer fighters together'. Portable software “containers” and cloud computing hold the potential to help achieve these goals by making scientific data analysis more reproducible, reusable and scalable.

Our team of researchers from the Ontario Institute for Cancer Research, University of California Santa Cruz, Sage Bionetworks and Oregon Health and Sciences University is pushing the frontiers by encouraging scientists to package up their software in reusable Docker containers and make use of cloud-resident data from the Cancer Cloud Pilots funded by the National Cancer Institute.

In 2014 we initiated the ICGC-TCGA DREAM Somatic Mutation Calling (SMC) Challenges where Google provided credits on Google Cloud Platform. The first result of this collaboration was the DREAM-SMC DNA challenge, a public challenge that engaged cancer researchers from around the world to find the best methods for discovering DNA somatic mutations. By the end of the challenge, over 400 registered participants competed by submitting 3,500 open-source entries for 14 test genomes, providing key insights on the strengths and limitations of the current mutation detection methods.

The SMC-DNA challenge enabled comparison of results, but it did little to facilitate the exchange of cross-platform software tools. Accessing extremely large genome sequence input files and shepherding complex software pipelines created a “double whammy” to discourage data sharing and software reuse.

How can we overcome these barriers?

Exciting developments have taken place in the past couple of years that may annihilate these last barriers. The availability of cloud technologies and containerization can serve as the vanguards of reproducibility and interoperability.

Thus, a new way of creating open DREAM challenges has emerged: rather than encouraging the status quo where participants run their own methods themselves on their own systems, and the results cannot be verified, the new challenge design requires participants to submit open-source code packaged in Docker containers so that anyone can run their methods and verify the results. Real-time leaderboards show which entries are winning and top performers have a chance to claim a prize.

Working with Google Genomics and Google Cloud Platform, the DREAM-SMC organizers are now using cloud and containerization technologies to enable portability and reproducibility as a core part of the DREAM challenges. The latest SMC installments, the SMC-Het Challenge and the SMC-RNA Challenge have implemented this new plan:

SMC-Het Challenge: Tumour biopsies are composed of many different cell types in addition to tumour cells, including normal tissue and infiltrating immune cells. Furthermore, the tumours themselves are made of a mixture of different subpopulations, all related to one another through cell division and mutation. Critically, each sub-population can have distinct clinical outcomes, with some more resistant to treatment or more likely to metastasize than others. The goal of the SMC-Het Challenge is to identify the best methods for predicting tumor subpopulations and their “family tree” of relatedness from genome sequencing data.
SMC-RNA Challenge: The alteration of RNA production is a fundamental mechanism by which cancer cells rewire cellular circuitry. Genomic rearrangements in cancer cells can produce fused protein products that can bestow Frankenstein-like properties. Both RNA abundances and novel fusions can serve as the basis for clinically-important prognostic biomarkers. The SMC-RNA Challenge will identify the best methods to detect such rogue expressed RNAs in cancer cells.

Ultimately, the success will be gauged by the amount of serious participation in these latest competitions. So far, the signs are encouraging. SMC-Het, which focuses on a very new research area, launched in November 2015 and has already enlisted 18 teams contributing over 70 submissions. SMC-RNA just recently launched and will run until early 2017, with several of the world leaders in the field starting to prepare entries. What’s great about the submissions being packaged in containers is that even after the challenges end, the tested methods can be applied and further adapted by anyone around the world.

Thus, the moon shot need not be a lucky solo attempt made by one hero in one moment of inspiration. Instead, the new informatics of clouds and containers will enable us to combine intelligence so we can build a series of bridges from here to there.

To participate in the DREAM challenges, visit the SMC-Het and SMC-RNA Challenge sites.

googblogs.com

All Google blogs and Press in one site

Tag Archives: Google Genomics

Building better pangenomes to improve the equity of genomics

Using graphs to build pangenomes

Evolving DeepVariant for error correction

Improving pangenome sequences using transformers

Conclusion

Acknowledgements

Source: Google AI Blog

An ML-based approach to better characterize lung diseases

ML for deeper understanding of exhalation

Predicting COPD outcomes

Identifying the genetic links with COPD

Conclusion

Acknowledgments

Source: Google AI Blog

Improving the Accuracy of Genomic Analysis with DeepVariant 1.0

Source: Google AI Blog

DeepVariant Accuracy Improvements for Genetic Datatypes

Source: Google AI Blog

DeepVariant Accuracy Improvements for Genetic Datatypes

Source: Google Research Blog

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Source: Google Open Source Blog

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Source: Google Open Source Blog

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Source: Google Open Source Blog

DeepVariant: Highly Accurate Genomes With Deep Neural Networks

Source: Google Research Blog

Reproducible Science: Cancer Researchers Embrace Containers in the Cloud

Source: Google Research Blog