这是用户在 2024-12-4 11:44 为 https://app.immersivetranslate.com/pdf-pro/7e1a91ee-0304-4e39-b3d4-85d188fa4a54 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
一个使用深度神经网络的通用 SNP 和小插入/缺失变异调用器

A universal SNP and small-indel variant caller using deep neural networks
一个使用深度神经网络的通用 SNP 和小插入/缺失变异调用器

Ryan Poplin 1 , 2 1 , 2 ^(1,2){ }^{1,2}, Pi-Chuan Chang 2 2 ^(2){ }^{2}, David Alexander 2 2 ^(2){ }^{2}, Scott Schwartz 2 2 ^(2){ }^{2}, Thomas Colthurst 2 2 ^(2){ }^{2}, Alexander Ku 2 Ku 2 Ku^(2)\mathrm{Ku}^{2}, Dan Newburger 1 1 ^(1){ }^{1}, Jojo Dijamco 1 1 ^(1){ }^{1}, Nam Nguyen 1 1 ^(1){ }^{1}, Pegah T Afshar 1 1 ^(1){ }^{1}, Sam S Gross 1 1 ^(1){ }^{1}, Lizzie Dorfman 1 , 2 1 , 2 ^(1,2){ }^{1,2}, Cory Y McLean 1 , 2 & 1 , 2 & ^(1,2)&{ }^{1,2} \& Mark A DePristo 1 , 2 1 , 2 ^(1,2){ }^{1,2}
Ryan Poplin 1 , 2 1 , 2 ^(1,2){ }^{1,2} ,张碧川 2 2 ^(2){ }^{2} ,大卫·亚历山大 2 2 ^(2){ }^{2} ,斯科特·施瓦茨 2 2 ^(2){ }^{2} ,托马斯·科尔瑟斯特 2 2 ^(2){ }^{2} ,亚历山大 Ku 2 Ku 2 Ku^(2)\mathrm{Ku}^{2} ,丹·纽伯格 1 1 ^(1){ }^{1} ,乔乔·迪亚姆科 1 1 ^(1){ }^{1} ,阮南 1 1 ^(1){ }^{1} ,佩加·塔·阿夫沙尔 1 1 ^(1){ }^{1} ,山姆·S·格罗斯 1 1 ^(1){ }^{1} ,莉齐·多夫曼 1 , 2 1 , 2 ^(1,2){ }^{1,2} ,科尔·Y·麦克莱恩 1 , 2 & 1 , 2 & ^(1,2)&{ }^{1,2} \& ,马克·A·德普里斯特 1 , 2 1 , 2 ^(1,2){ }^{1,2}

Abstract 摘要

Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.
尽管测序技术在快速发展,但准确识别个体基因组中存在的遗传变异,从数十亿个短、错误序列读数中仍然具有挑战性。在这里,我们展示了深度卷积神经网络可以通过学习围绕假设变异和真实基因型调用图像之间的统计关系,在比对下一代测序读数数据中调用遗传变异。这种方法称为 DeepVariant,优于现有的最先进工具。学习到的模型可以跨基因组构建和哺乳动物物种进行泛化,使非人类测序项目能够从人类地面数据中受益。我们进一步表明,DeepVariant 可以学习在多种测序技术和实验设计中调用变异,包括 10X Genomics 的深度全基因组测序和 Ion Ampliseq 外显子组,突出了使用更自动化和可泛化技术进行变异调用的好处。

Calling genetic variants from next-generation sequencing (NGS) data has proven challenging because NGS reads are not only errorful (with error rates from 0.1 10 % 0.1 10 % ∼0.1-10%\sim 0.1-10 \% ) but arise from a complex error process that depends on properties of the instrument, preceding data processing tools, and the genome sequence itself 1 5 1 5 ^(1-5){ }^{1-5}. State-of-the-art variant callers use a variety of statistical techniques to model these error processes to accurately identify differences between the reads and the reference genome caused either by real genetic variants or by errors in the reads 3 6 3 6 ^(3-6){ }^{3-6}. For example, the widely used GATK uses logistic regression to model base errors, hidden Markov models to compute read likelihoods, and naive Bayes classification to identify variants, which are then filtered to remove likely false positives using a Gaussian mixture model with hand-crafted features capturing common error modes 5 5 ^(5){ }^{5}. These techniques allow the GATK to achieve high but still imperfect accuracy on the Illumina sequencing platform 3 , 4 3 , 4 ^(3,4){ }^{3,4}. Generalizing these models to other sequencing technologies (for example, Ion Torrent 7 , 8 7 , 8 ^(7,8){ }^{7,8} ) has proven difficult due to the need to manually retune or extend these statistical models, which is problematic in an area with such rapid technological progress 1 1 ^(1){ }^{1}.
从下一代测序(NGS)数据中调用遗传变异已被证明具有挑战性,因为 NGS 读数不仅存在错误(错误率从 0.1 10 % 0.1 10 % ∼0.1-10%\sim 0.1-10 \% ),而且源于一个复杂的错误过程,该过程依赖于仪器的特性、先前的数据处理工具以及基因组序列本身 1 5 1 5 ^(1-5){ }^{1-5} 。最先进的变异调用者使用各种统计技术来模拟这些错误过程,以准确识别由真实遗传变异或读数错误引起的读数与参考基因组之间的差异 3 6 3 6 ^(3-6){ }^{3-6} 。例如,广泛使用的 GATK 使用逻辑回归来模拟碱基错误,使用隐马尔可夫模型来计算读数可能性,并使用朴素贝叶斯分类来识别变异,然后使用具有手工制作的特征捕获常见错误模式的高斯混合模型进行过滤,以去除可能的假阳性 5 5 ^(5){ }^{5} 。这些技术使 GATK 在 Illumina 测序平台上实现了高但仍然不完美的准确性 3 , 4 3 , 4 ^(3,4){ }^{3,4} 。 将这些模型推广到其他测序技术(例如,Ion Torrent 7 , 8 7 , 8 ^(7,8){ }^{7,8} )因需要手动调整或扩展这些统计模型而变得困难,这在技术进步如此迅速的领域是个问题 1 1 ^(1){ }^{1}
Here we describe a variant caller, called DeepVariant, that replaces the assortment of statistical modeling components with a single deep learning model. Deep learning is a machine learning technique applicable to a variety of domains, including image classification 9 9 ^(9){ }^{9}, translation 10 10 ^(10){ }^{10}, gaming 11 , 12 11 , 12 ^(11,12){ }^{11,12} and the life sciences 13 16 13 16 ^(13-16){ }^{13-16}. This toolchain (Fig. 1) begins by finding candidate single nucleotide polymorphisms (SNPs) and indels in reads aligned to the reference genome with high sensitivity but low specificity using standard, algorithmic preprocessing techniques. The deep learning model, using the Inception architecture 17 17 ^(17){ }^{17}, emits probabilities for each of the three diploid genotypes at a locus using a pileup image of the reference and read data around each candidate variant (Fig. 1). The model is trained using labeled true genotypes, after which it is frozen and can then be applied to novel sites or samples. In the following experiments, DeepVariant was trained on an independent set of samples or variants from those being evaluated.
这里我们描述了一种名为 DeepVariant 的变异调用器,它用一个深度学习模型替换了各种统计建模组件。深度学习是一种适用于多种领域的机器学习技术,包括图像分类、翻译、游戏和生命科学。这个工具链(图 1)首先使用标准的算法预处理技术,以高灵敏度但低特异性地找到与参考基因组对齐的候选单核苷酸多态性(SNPs)和插入/缺失(indels)。使用 Inception 架构的深度学习模型,通过每个候选变异周围的参考和读取数据的 pileup 图像,为每个位点的三个二倍体基因型发出概率(图 1)。该模型使用标记的真实基因型进行训练,之后将其冻结,然后可以应用于新的位点或样本。在接下来的实验中,DeepVariant 在独立的一组样本或变异上进行训练,这些样本或变异正在被评估。
The deep learning model was trained without specialized knowledge about genomics or next-generation sequencing, and yet it can learn to call genetic variants more accurately than state-of-the-art methods. When applied to the Platinum Genomes Project NA12878 data 18 18 ^(18){ }^{18}, DeepVariant produced a callset with better performance than the GATK when evaluated on the held-out chromosomes of the Genome in a Bottle ground-truth set (Supplementary Figs. 1a and 2). For further validation, we sequenced 35 replicates of NA12878 using a standard whole-genome sequencing (WGS) protocol and called variants on 27 replicates using a GATK best-practices pipeline and DeepVariant using a model trained on the other eight replicates (Online Methods). DeepVariant produced more accurate results with greater consistency across a variety of quality metrics (Supplementary Fig. 1b and Supplementary Notes 1, 10 and 11).
深度学习模型在缺乏基因组学或下一代测序专业知识的情况下进行了训练,但它仍然能够比最先进的方法更准确地学习调用遗传变异。当应用于 Platinum Genomes 项目 NA12878 数据时,DeepVariant 在基因组瓶真实集的保留染色体上评估时,其性能优于 GATK(补充图 1a 和 2)。为了进一步验证,我们使用标准的全基因组测序(WGS)方案对 NA12878 进行了 35 个重复测序,并在 27 个重复中使用 GATK 最佳实践流程和 DeepVariant(使用在另外八个重复上训练的模型)调用变异(在线方法)。DeepVariant 在各种质量指标上产生了更准确且一致性更高的结果(补充图 1b 和补充笔记 1、10 和 11)。
Like many variant calling algorithms, the GATK relies on a model that assumes read errors to be independent 5 5 ^(5){ }^{5}. Though this has long been recognized as an invalid assumption 2 2 ^(2){ }^{2}, the true likelihood function that models multiple reads simultaneously is unknown 5 , 19 , 20 5 , 19 , 20 ^(5,19,20){ }^{5,19,20}. Because DeepVariant presents an image of all of the reads relevant for a putative variant together, the convolutional neural network (CNN) is able to account for the complex dependence among the reads by virtue of being a universal approximator 21 21 ^(21){ }^{21}. This manifests itself as a
与许多变异调用算法一样,GATK 依赖于一个假设读错误是独立的模型 5 5 ^(5){ }^{5} 。尽管这早已被认定为无效的假设 2 2 ^(2){ }^{2} ,但模拟多个读同时发生的真实似然函数是未知的 5 , 19 , 20 5 , 19 , 20 ^(5,19,20){ }^{5,19,20} 。因为 DeepVariant 展示了所有与假设变异相关的读的图像,卷积神经网络(CNN)能够通过作为通用逼近器来考虑读之间的复杂依赖关系 21 21 ^(21){ }^{21} 。这表现为
Received 15 December 2017; accepted 2 August 2018; published online 24 September 2018; doi:10.1038/nbt. 4235
收到 2017 年 12 月 15 日;接受 2018 年 8 月 2 日;在线发表 2018 年 9 月 24 日;doi:10.1038/nbt.4235

Figure 1 DeepVariant workflow overview. Before DeepVariant, NGS reads are first aligned to a reference genome and cleaned up with duplicate marking and, optionally, local assembly. Left box: first, the aligned reads are scanned for sites that may be different from the reference genome. The read and reference data are encoded as an image for each candidate variant site. A trained CNN calculates the genotype likelihoods for each site. A variant call is emitted if the most likely genotype is heterozygous or homozygous non-reference. Middle box: training the CNN reuses the DeepVariant machinery to generate pileup images for a sample with known genotypes. These labeled image + genotype pairs, along with an initial CNN, which can be a random model, a CNN trained for other image classification tests, or a prior DeepVariant model, are used to optimize the CNN parameters to maximize genotype prediction accuracy using a stochastic gradient descent algorithm. After a maximum number of cycles or time has elapsed or the model’s performance has converged, the final trained model is frozen and can then be used for variant calling. Right box: the reference and read bases, quality scores, and other read features are encoded into a red-green-blue (RGB) pileup image at a candidate variant. This encoded image is provided to the CNN to calculate the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example a heterozygous variant call is emitted, as the most probable genotype here is “het”. In all panels, blue boxes represent data and red boxes are processes. Details of all processes are given in the Online Methods.
图 1 DeepVariant 工作流程概述。在 DeepVariant 之前,NGS 读数首先与参考基因组对齐并清理,包括重复标记和可选的局部组装。左侧框:首先,对齐的读数被扫描以查找可能与参考基因组不同的位点。将读取和参考数据编码为图像,用于每个候选变异位点。训练好的 CNN 计算每个位点的基因型可能性。如果最可能的基因型为杂合子或纯合子非参考型,则发出变异调用。中间框:训练 CNN 时,DeepVariant 机制被重用来为具有已知基因型的样本生成堆叠图像。这些标记的图像+基因型对,以及一个初始 CNN(可以是随机模型、为其他图像分类测试训练的 CNN 或先前的 DeepVariant 模型),用于使用随机梯度下降算法优化 CNN 参数,以最大化基因型预测的准确性。经过最大循环次数或时间后或模型性能收敛,最终训练好的模型被冻结,然后可用于变异调用。 正确框:参考和读取基,质量分数和其他读取特征被编码到一个候选变异的红绿蓝(RGB)堆叠图像中。此编码图像提供给 CNN 以计算三种二倍体基因型状态(纯合参考(hom-ref)、杂合(het)或纯合等位基因(hom-alt))的基因型可能性。在此示例中,发出一个杂合变异调用,因为这里最可能的基因型是“het”。在所有面板中,蓝色框表示数据,红色框表示过程。所有过程的详细信息见在线方法。

tight concordance between the estimated probability of error from the likelihood function and the observed error rate (Supplementary Fig. 1c) where DeepVariant’s CNN is well calibrated, more so than the GATK. That the CNN has approximated this true but unknown interdependent likelihood function is the essential technical advance enabling us to replace the hand-crafted statistical models used in other approaches with a single deep learning model, and still achieve such high performance in variant calling.
紧密的误差概率估计与观察到的错误率之间的吻合度(补充图 1c),其中 DeepVariant 的 CNN 校准良好,优于 GATK。CNN 能够近似这个真实但未知的相互依赖似然函数,这是使我们能够用单个深度学习模型替换其他方法中使用的手工统计模型,并在变异调用中仍然实现如此高性能的关键技术进步。
To further benchmark the performance of DeepVariant, we submitted variant calls for a blinded sample, NA24385, to the US Food and Drug Administration (FDA)-sponsored variant calling Truth Challenge in May 2016 and won the “highest performance” award for SNPs as assessed by an independent team using a different evaluation methodology. For this contest DeepVariant was trained only on data available from the CEPH (Centre d’Etude du Polymorphisme Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. In achieving high accuracy as measured via F1, or the harmonic mean of sensitivity and positive predictive value (PPV), on this new sample (SNP F1 = 99.95 % = 99.95 % =99.95%=99.95 \%, indel F1 = 98.98 % = 98.98 % =98.98%=98.98 \% ), we show that DeepVariant can generalize beyond its training data. We then applied the same dataset and evaluation
为进一步评估 DeepVariant 的性能,我们于 2016 年 5 月向美国食品药品监督管理局(FDA)赞助的变异调用真实挑战赛提交了盲样 NA24385 的变异调用,并获得了由独立团队使用不同评估方法评估的“最高性能”奖项。在此比赛中,DeepVariant 仅使用 CEPH(人类多态性研究中心)女性样本 NA12878 的数据进行训练,并在未见过的阿什肯纳兹男性样本 NA24385 上进行评估。通过在新的样本(SNP F1 = 99.95 % = 99.95 % =99.95%=99.95 \% ,indel F1 = 98.98 % = 98.98 % =98.98%=98.98 \% )上实现高准确度(通过 F1,即敏感性和阳性预测值(PPV)的调和平均值衡量),我们表明 DeepVariant 可以推广到其训练数据之外。然后,我们应用了相同的训练集和评估

methodology to a variety of both recent and commonly used bioinformatics methods, including the GATK, FreeBayes 22 22 ^(22){ }^{22}, SAMtools 23 23 ^(23){ }^{23}, 16 GT 24 16 GT 24 16GT^(24)16 \mathrm{GT}^{24} and Strelka 25 25 ^(25){ }^{25} (Table 1). DeepVariant demonstrated more than 50 % 50 % 50%50 \% fewer errors per genome ( 4,652 errors) compared to the next-best algorithm ( 9,531 errors). We also evaluated the same set of methods using the synthetic diploid sample CHM1-CHM1326 (Table 2). In our tests DeepVariant outperformed all other methods for calling both SNP and indel mutations, without needing to adjust filtering thresholds or other parameters.
方法涵盖了多种近期和常用的生物信息学方法,包括 GATK、FreeBayes 22 22 ^(22){ }^{22} 、SAMtools 23 23 ^(23){ }^{23} 16 GT 24 16 GT 24 16GT^(24)16 \mathrm{GT}^{24} 和 Strelka 25 25 ^(25){ }^{25} (表 1)。与下一个最佳算法(9,531 个错误)相比,DeepVariant 在基因组中每基因组错误少于 50 % 50 % 50%50 \% (4,652 个错误)。我们还使用合成二倍体样本 CHM1-CHM1326(表 2)评估了相同的方法集。在我们的测试中,DeepVariant 在调用 SNP 和 indel 突变方面优于所有其他方法,无需调整过滤阈值或其他参数。
We further explored how well DeepVariant’s CNN generalizes beyond its training data. First, a model trained with read data aligned to human genome build GRCh37 and applied to reads aligned to GRCh38 had similar performance (overall F1 = 99.45 % = 99.45 % =99.45%=99.45 \% ) to one trained on GRCh38 and then applied to GRCh38 (overall F1 =99.53%), thereby demonstrating that a model learned from one version of the human genome reference can be applied to other versions with effectively no loss in accuracy (Supplementary Table 1 and Supplementary Note 2). Second, models trained using human reads and groundtruth data achieved high accuracy when applied to a mouse data set 27 set 27 set^(27)\operatorname{set}^{27} ( F 1 = 98.29 % ) F 1 = 98.29 % {:F1=98.29%)\left.\mathrm{F} 1=98.29 \%\right), outperforming training on the mouse data itself (F1 = 97.84%; Supplementary Table 2 and Supplementary Note 3).
我们进一步探讨了 DeepVariant 的 CNN 在训练数据之外的泛化能力。首先,使用与人类基因组构建 GRCh37 对齐的读取数据训练的模型,应用于与 GRCh38 对齐的读取数据,其性能(总体 F1 = 99.45 % = 99.45 % =99.45%=99.45 \% )与在 GRCh38 上训练然后应用于 GRCh38 的模型相似(总体 F1 =99.53%),从而证明了从人类基因组参考的一个版本学习到的模型可以应用于其他版本,而不会在准确性上造成损失(补充表 1 和补充说明 2)。其次,使用人类读取数据和地面实况数据训练的模型,在应用于小鼠数据时取得了高精度( set 27 set 27 set^(27)\operatorname{set}^{27} F 1 = 98.29 % ) F 1 = 98.29 % {:F1=98.29%)\left.\mathrm{F} 1=98.29 \%\right) ,优于在小鼠数据上的训练(F1 = 97.84%;补充表 2 和补充说明 3)。
Table 1 Evaluation of several bioinformatics methods on the high-coverage, whole-genome sample NA24385
表 1 对高覆盖全基因组样本 NA24385 几种生物信息学方法的评估
Method 方法 Type 类型 F1 Recall 回忆 Precision 精度 TP FN FP FP.gt FP.al Version 版本
DeepVariant (live GitHub)
DeepVariant(GitHub 实时版)
Indel 0.99507 0.99347 0.99666 357,641 2350 1,198 217 840 Latest GitHub v0.4.1-b4e8d37d
最新 GitHub v0.4.1-b4e8d37d
GATK (raw) GATK(原始) Indel 0.99366 0.99219 0.99512 357,181 2810 1,752 377 995 3.8-0-ge9d806836
Strelka 斯特雷尔卡 Indel 0.99227 0.98829 0.99628 355,777 4214 1,329 221 855 2.8.4-3-gbe58942
DeepVariant (pFDA) DeepVariant(美国食品药品监督管理局批准) Indel 0.99112 0.98776 0.99450 355,586 4405 1,968 846 1,027 pFDA submission May 2016
2016 年 5 月 pFDA 提交
GATK (VQSR) GATK(VQSR) Indel 0.99010 0.98454 0.99573 354,425 5566 1,522 343 909 3.8-0-ge9d806836
GATK (fIt) GATK(fIt) Indel 0.98229 0.96881 0.99615 348,764 11227 1,349 370 916 3.8-0-ge9d806836
FreeBayes Indel 0.94091 0.91917 0.96372 330,891 29,100 12,569 9,149 3,347 v1.1.0-54-g49413aa
16GT Indel 0.92732 0.91102 0.94422 327,960 32,031 19,364 10,700 7,745 v1.0-34e8f934
SAMtools Indel 0.87951 0.83369 0.93066 300,120 59,871 22,682 2,302 20,282 1.6
DeepVariant (live GitHub)
DeepVariant(GitHub 实时版)
SNP 0.99982 0.99975 0.99989 3,054,552 754 350 157 38 Latest GitHub v0.4.1-b4e8d37d
最新 GitHub v0.4.1-b4e8d37d
DeepVariant (pFDA) DeepVariant(美国食品药品监督管理局批准) SNP 0.99958 0.99944 0.99973 3,053,579 1,727 837 409 78 pFDA submission May 2016
2016 年 5 月 pFDA 提交
Strelka 斯特雷尔卡 SNP 0.99935 0.99893 0.99976 3,052,050 3,256 732 87 136 2.8.4-3-gbe58942
GATK (raw) GATK(原始) SNP 0.99914 0.99973 0.99854 3,054,494 812 4,469 176 257 3.8-0-ge9d806836
16GT SNP 0.99583 0.99850 0.99318 3,050,725 4,581 20,947 3,476 3,899 v1.0-34e8f934
GATK (VQSR) GATK(VQSR) SNP 0.99436 0.98940 0.99937 3,022,917 32,389 1,920 80 170 3.8-0-ge9d806836
FreeBayes SNP 0.99124 0.98342 0.99919 3,004,641 50,665 2,434 351 1,232 v1.1.0-54-g49413aa
SAMtools SNP 0.99021 0.98114 0.99945 2,997,677 57,629 1,651 1,040 200 1.6
GATK (fIt) GATK(fIt) SNP 0.98958 0.97953 0.99983 2,992,764 62,542 509 168 26 3.8-0-ge9d806836
Method Type F1 Recall Precision TP FN FP FP.gt FP.al Version DeepVariant (live GitHub) Indel 0.99507 0.99347 0.99666 357,641 2350 1,198 217 840 Latest GitHub v0.4.1-b4e8d37d GATK (raw) Indel 0.99366 0.99219 0.99512 357,181 2810 1,752 377 995 3.8-0-ge9d806836 Strelka Indel 0.99227 0.98829 0.99628 355,777 4214 1,329 221 855 2.8.4-3-gbe58942 DeepVariant (pFDA) Indel 0.99112 0.98776 0.99450 355,586 4405 1,968 846 1,027 pFDA submission May 2016 GATK (VQSR) Indel 0.99010 0.98454 0.99573 354,425 5566 1,522 343 909 3.8-0-ge9d806836 GATK (fIt) Indel 0.98229 0.96881 0.99615 348,764 11227 1,349 370 916 3.8-0-ge9d806836 FreeBayes Indel 0.94091 0.91917 0.96372 330,891 29,100 12,569 9,149 3,347 v1.1.0-54-g49413aa 16GT Indel 0.92732 0.91102 0.94422 327,960 32,031 19,364 10,700 7,745 v1.0-34e8f934 SAMtools Indel 0.87951 0.83369 0.93066 300,120 59,871 22,682 2,302 20,282 1.6 DeepVariant (live GitHub) SNP 0.99982 0.99975 0.99989 3,054,552 754 350 157 38 Latest GitHub v0.4.1-b4e8d37d DeepVariant (pFDA) SNP 0.99958 0.99944 0.99973 3,053,579 1,727 837 409 78 pFDA submission May 2016 Strelka SNP 0.99935 0.99893 0.99976 3,052,050 3,256 732 87 136 2.8.4-3-gbe58942 GATK (raw) SNP 0.99914 0.99973 0.99854 3,054,494 812 4,469 176 257 3.8-0-ge9d806836 16GT SNP 0.99583 0.99850 0.99318 3,050,725 4,581 20,947 3,476 3,899 v1.0-34e8f934 GATK (VQSR) SNP 0.99436 0.98940 0.99937 3,022,917 32,389 1,920 80 170 3.8-0-ge9d806836 FreeBayes SNP 0.99124 0.98342 0.99919 3,004,641 50,665 2,434 351 1,232 v1.1.0-54-g49413aa SAMtools SNP 0.99021 0.98114 0.99945 2,997,677 57,629 1,651 1,040 200 1.6 GATK (fIt) SNP 0.98958 0.97953 0.99983 2,992,764 62,542 509 168 26 3.8-0-ge9d806836| Method | Type | F1 | Recall | Precision | TP | FN | FP | FP.gt | FP.al | Version | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | DeepVariant (live GitHub) | Indel | 0.99507 | 0.99347 | 0.99666 | 357,641 | 2350 | 1,198 | 217 | 840 | Latest GitHub v0.4.1-b4e8d37d | | GATK (raw) | Indel | 0.99366 | 0.99219 | 0.99512 | 357,181 | 2810 | 1,752 | 377 | 995 | 3.8-0-ge9d806836 | | Strelka | Indel | 0.99227 | 0.98829 | 0.99628 | 355,777 | 4214 | 1,329 | 221 | 855 | 2.8.4-3-gbe58942 | | DeepVariant (pFDA) | Indel | 0.99112 | 0.98776 | 0.99450 | 355,586 | 4405 | 1,968 | 846 | 1,027 | pFDA submission May 2016 | | GATK (VQSR) | Indel | 0.99010 | 0.98454 | 0.99573 | 354,425 | 5566 | 1,522 | 343 | 909 | 3.8-0-ge9d806836 | | GATK (fIt) | Indel | 0.98229 | 0.96881 | 0.99615 | 348,764 | 11227 | 1,349 | 370 | 916 | 3.8-0-ge9d806836 | | FreeBayes | Indel | 0.94091 | 0.91917 | 0.96372 | 330,891 | 29,100 | 12,569 | 9,149 | 3,347 | v1.1.0-54-g49413aa | | 16GT | Indel | 0.92732 | 0.91102 | 0.94422 | 327,960 | 32,031 | 19,364 | 10,700 | 7,745 | v1.0-34e8f934 | | SAMtools | Indel | 0.87951 | 0.83369 | 0.93066 | 300,120 | 59,871 | 22,682 | 2,302 | 20,282 | 1.6 | | DeepVariant (live GitHub) | SNP | 0.99982 | 0.99975 | 0.99989 | 3,054,552 | 754 | 350 | 157 | 38 | Latest GitHub v0.4.1-b4e8d37d | | DeepVariant (pFDA) | SNP | 0.99958 | 0.99944 | 0.99973 | 3,053,579 | 1,727 | 837 | 409 | 78 | pFDA submission May 2016 | | Strelka | SNP | 0.99935 | 0.99893 | 0.99976 | 3,052,050 | 3,256 | 732 | 87 | 136 | 2.8.4-3-gbe58942 | | GATK (raw) | SNP | 0.99914 | 0.99973 | 0.99854 | 3,054,494 | 812 | 4,469 | 176 | 257 | 3.8-0-ge9d806836 | | 16GT | SNP | 0.99583 | 0.99850 | 0.99318 | 3,050,725 | 4,581 | 20,947 | 3,476 | 3,899 | v1.0-34e8f934 | | GATK (VQSR) | SNP | 0.99436 | 0.98940 | 0.99937 | 3,022,917 | 32,389 | 1,920 | 80 | 170 | 3.8-0-ge9d806836 | | FreeBayes | SNP | 0.99124 | 0.98342 | 0.99919 | 3,004,641 | 50,665 | 2,434 | 351 | 1,232 | v1.1.0-54-g49413aa | | SAMtools | SNP | 0.99021 | 0.98114 | 0.99945 | 2,997,677 | 57,629 | 1,651 | 1,040 | 200 | 1.6 | | GATK (fIt) | SNP | 0.98958 | 0.97953 | 0.99983 | 2,992,764 | 62,542 | 509 | 168 | 26 | 3.8-0-ge9d806836 |
The dataset used in this evaluation is the same as in the precisionFDA Truth Challenge (pFDA). Several methods are compared, including the DeepVariant callset as submitted to the contest and the most recent DeepVariant version from GitHub. Each method was run according to the individual authors’ best-practice recommendations and represents a good-faith effort to achieve best results. Comparisons to the Genome in a Bottle truth set for this sample were performed using the hap.py software, available on GitHub at http://github.com/Illumina/hap.py, using the same version of the GIAB truth set (v3.2.2) used by pFDA. The overall accuracy (F1, sort order within each variant type), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are shown over the whole genome. False positives are further divided by those caused by genotype mismatches (FP.gt) and those cause by allele mismatches (FP.al). Finally, the version of the software used for each method is provided. We present three GATK callsets: GATK (raw), the unfiltered calls emitted by the HaplotypeCaller; GATK (VQSR), the callset filtered with variant quality score recalibration (VQSR); and GATK (fIt), the raw GATK callset filtered with run-flt in CHM-eval. See Supplementary Note 7 for more details.
该评估所使用的数据集与 precisionFDA Truth Challenge(pFDA)中使用的相同。比较了多种方法,包括提交给比赛的 DeepVariant callset 和 GitHub 上最新的 DeepVariant 版本。每种方法都按照个别作者的最好实践建议运行,代表了为获得最佳结果而做出的真诚努力。使用 GitHub 上可用的 hap.py 软件(http://github.com/Illumina/hap.py)对样本的 Genome in a Bottle truth 集进行了比较,使用了与 pFDA 相同的 GIAB truth 集版本(v3.2.2)。在整个基因组上显示了整体准确性(F1,每种变异类型内的排序顺序)、召回率、精确率和真阳性(TP)、假阴性(FN)和假阳性(FP)的数量。假阳性进一步分为由基因型不匹配(FP.gt)引起的和由等位基因不匹配(FP.al)引起的。最后,提供了每种方法的软件版本。 我们介绍了三个 GATK callsets:GATK(原始),由 HaplotypeCaller 发出的未过滤调用;GATK(VQSR),经过变体质量得分重新校准(VQSR)过滤的 callset;以及 GATK(fIt),使用 CHM-eval 中的 run-flt 过滤的原始 GATK callset。有关更多详细信息,请参阅补充说明 7。
This last experiment is especially demanding as not only do the species differ but nearly all of the sequencing parameters do as well: 50 × 2 × 50 × 2 × 50 xx2xx50 \times 2 \times 148 bp from an Illumina TruSeq prep sequenced on a HiSeq 2500 for the human sample and 27 × 2 × 100 bp 27 × 2 × 100 bp 27 xx2xx100bp27 \times 2 \times 100 \mathrm{bp} reads from a custom sequencing preparation run on an Illumina Genome Analyzer II for mouse 27 27 ^(27){ }^{27}. Thus, DeepVariant is robust to changes in sequencing depth, preparation protocol, instrument type, genome build and even mammalian species, thereby enabling resequencing projects in nonhuman species, which often have no ground-truth data to guide their efforts 27 , 28 27 , 28 ^(27,28){ }^{27,28}, to leverage the large and growing ground-truth data in humans.
这个最后实验特别具有挑战性,因为不仅物种不同,几乎所有测序参数也是如此: 50 × 2 × 50 × 2 × 50 xx2xx50 \times 2 \times 从 Illumina TruSeq 制备的 HiSeq 2500 测序的人类样本中测序的 148 个碱基,以及 27 × 2 × 100 bp 27 × 2 × 100 bp 27 xx2xx100bp27 \times 2 \times 100 \mathrm{bp} 在 Illumina Genome Analyzer II 上运行的定制测序制备的读取,用于小鼠 27 27 ^(27){ }^{27} 。因此,DeepVariant 对测序深度、制备方案、仪器类型、基因组构建甚至哺乳动物物种的变化都具有鲁棒性,从而使得非人类物种的重新测序项目得以实现,这些项目通常没有真实数据来指导他们的努力 27 , 28 27 , 28 ^(27,28){ }^{27,28} ,以利用人类中大量且不断增长的真实数据。
To further assess its capabilities, we trained DeepVariant to call variants in eight datasets from Genome in a Bottle 29 29 ^(29){ }^{29} that spanned a variety of sequencing instruments and protocols, including wholegenome and exome sequencing technologies, with read lengths from 50 to many thousands of base pairs (Supplementary Tables 3 and 4 and Supplementary Notes 4 and 5). We used the already processed BAM files to introduce additional variability, as these BAMs differed in their alignment and cleaning steps. The results of this experiment all exhibit a characteristic pattern: the candidate variants have the highest sensitivity but a low PPV (mean of 57.6%), which varies substantially by dataset. After retraining, all of the callsets achieve high PPVs (mean of 99.3%) while largely preserving the candidate callset sensitivity (mean loss of 2.3%). The high PPVs and low loss of sensitivity indicate that DeepVariant can learn a model that captures the technology-specific error processes in sufficient detail to separate real variation from false positives with high fidelity for many different sequencing technologies.
为了进一步评估其能力,我们将 DeepVariant 训练用于从 Genome in a Bottle 29 29 ^(29){ }^{29} 中的八个数据集中调用变异,这些数据集涵盖了各种测序仪器和协议,包括全基因组测序和外显子测序技术,读长从 50 到数万个碱基对不等(补充表 3 和 4 以及补充笔记 4 和 5)。我们使用已经处理过的 BAM 文件来引入额外的变异性,因为这些 BAM 文件在对其对齐和清洗步骤上存在差异。该实验的所有结果都表现出一种特征模式:候选变异具有最高的敏感性,但 PPV(均值为 57.6%)较低,并且根据数据集的不同而有很大变化。在重新训练后,所有调用集都实现了高 PPV(均值为 99.3%),同时在很大程度上保留了候选调用集的敏感性(平均损失为 2.3%)。高 PPV 和低敏感性损失表明,DeepVariant 可以学习一个模型,该模型能够以足够详细的程度捕捉特定技术的错误过程,以高保真度区分真实变异和假阳性,适用于许多不同的测序技术。
Next we analyzed the behavior of DeepVariant on two nonIllumina WGS datasets, one from ThermoFisher (SOLiD) and one from Pacific Biosciences (PacBio), and on two exome datasets from Illumina (TruSeq) and Ion Torrent (Ion Ampliseq). The SOLiD and PacBio WGS datasets have high error rates in the candidate callsets. SOLiD (13.9% PPV for SNPs, 96.2% for indels and 14.3% overall) has many SNP artifacts from the mapping of short, color-space reads. The
接下来,我们分析了 DeepVariant 在两个非 Illumina WGS 数据集上的行为,一个来自 ThermoFisher(SOLiD),一个来自 Pacific Biosciences(PacBio),以及来自 Illumina(TruSeq)和 Ion Torrent(Ion Ampliseq)的两个外显子数据集。SOLiD 和 PacBio WGS 数据集的候选调用集中错误率较高。SOLiD(SNPs 的 PPV 为 13.9%,indels 为 96.2%,总体为 14.3%)有许多来自短读、颜色空间读映射的 SNP 伪影。
PacBio dataset is the opposite, with many false indels ( 79.8 % 79.8 % 79.8%79.8 \% PPV for SNPs, 1.4 % 1.4 % 1.4%1.4 \% for indels and 22.1 % 22.1 % 22.1%22.1 \% overall) owing to this technology’s high indel error rate. Training DeepVariant to call variants in an exome is likely to be particularly challenging. Exomes have far fewer variants ( 20 k 30 k ) 30 ( 20 k 30 k ) 30 (∼20k-30k)^(30)(\sim 20 \mathrm{k}-30 \mathrm{k})^{30} than found in a whole genome ( 4 5 M ) 31 ( 4 5 M ) 31 (∼4-5M)^(31)(\sim 4-5 \mathrm{M})^{31}. The non-uniform coverage and sequencing errors from the exome capture or amplification technology also introduce many false positive variants 32 32 ^(32){ }^{32}. For example, at 8.1 % 8.1 % 8.1%8.1 \%, the PPV of our candidate variants for Ion Ampliseq is the lowest of all our datasets.
PacBio 数据集则相反,由于该技术的高插入/缺失错误率,存在许多假插入/缺失( 79.8 % 79.8 % 79.8%79.8 \% PPV 对于 SNPs, 1.4 % 1.4 % 1.4%1.4 \% 对于插入/缺失, 22.1 % 22.1 % 22.1%22.1 \% 总体)。训练 DeepVariant 以在外显子组中调用变异可能特别具有挑战性。外显子组中的变异比全基因组中的变异少得多( ( 20 k 30 k ) 30 ( 20 k 30 k ) 30 (∼20k-30k)^(30)(\sim 20 \mathrm{k}-30 \mathrm{k})^{30} )。外显子捕获或扩增技术的不均匀覆盖和测序错误也引入了许多假阳性变异( 32 32 ^(32){ }^{32} )。例如,在 8.1 % 8.1 % 8.1%8.1 \% ,我们候选变异的 PPV 是所有数据集中最低的。
Despite the low initial PPVs, the retrained models in DeepVariant separated errors from real variants with high accuracy in the WGS datasets (PPVs of 99.0% and 97.3% for SOLiD and PacBio, respectively), though with a larger loss in sensitivity (candidates 82.5% and final 76.6 % 76.6 % 76.6%76.6 \% for SOLiD and 93.4 % 93.4 % 93.4%93.4 \% and 88.5 % 88.5 % 88.5%88.5 \%, respectively, for PacBio) than other technologies. Furthermore, despite the challenges of retraining deep learning models with limited data, the exome datasets also performed well, with a small reduction in sensitivity (from 91.9% to 89.3 % 89.3 % 89.3%89.3 \% and 94.0 % 94.0 % 94.0%94.0 \% to 92.6 % 92.6 % 92.6%92.6 \% for Ion Ampliseq and TruSeq candidates and final calls, respectively) for a substantial boost in PPV (from 8.1% to 99.7 % 99.7 % 99.7%99.7 \% and 65.3 % 65.3 % 65.3%65.3 \% to 99.3 % 99.3 % 99.3%99.3 \% for Ion and TruSeq, respectively). The performance of DeepVariant compares favorably to those of callsets submitted to the Genome in a Bottle project site using tools developed specifically for each NGS technology and to callsets produced by the GATK or SAMtools (Supplementary Table 5).
尽管初始 PPV 值较低,DeepVariant 中的重新训练模型在 WGS 数据集中以高精度将错误与真实变异分开(SOLiD 和 PacBio 的 PPV 分别为 99.0%和 97.3%),尽管与其他技术相比,灵敏度损失更大(SOLiD 候选者 82.5%,最终 76.6 % 76.6 % 76.6%76.6 \% ,PacBio 分别为 93.4 % 93.4 % 93.4%93.4 \% 88.5 % 88.5 % 88.5%88.5 \% )。此外,尽管在有限数据下重新训练深度学习模型存在挑战,外显子组数据集也表现良好,灵敏度略有下降(Ion Ampliseq 和 TruSeq 候选者和最终调用分别从 91.9%降至 89.3 % 89.3 % 89.3%89.3 \% 94.0 % 94.0 % 94.0%94.0 \% 92.6 % 92.6 % 92.6%92.6 \% ),PPV 显著提高(Ion 和 TruSeq 分别从 8.1%升至 99.7 % 99.7 % 99.7%99.7 \% 65.3 % 65.3 % 65.3%65.3 \% 99.3 % 99.3 % 99.3%99.3 \% )。DeepVariant 的性能与专门为每种 NGS 技术开发的工具提交到基因组在瓶项目网站上的 callsets 以及由 GATK 或 SAMtools 产生的 callsets 相比,表现良好(补充表 5)。
The accuracy numbers presented here should not be viewed as the maximum achievable by either the sequencing technology or DeepVariant. For consistency, we used the same model architecture, image representation, training parameters and candidate variant criteria for each technology. Because DeepVariant achieves high PPVs for all technologies, the overall accuracy is effectively driven by the sensitivity of the candidate callset. Improvements to the data processing steps before DeepVariant and the algorithm used to identify candidate variants is likely to translate into further improvements in overall accuracy, particularly for multi-allelic indels. Conversely, despite its
此处展示的准确率数字不应被视为测序技术或 DeepVariant 所能达到的最大值。为了保持一致性,我们对每种技术都使用了相同的模型架构、图像表示、训练参数和候选变异标准。由于 DeepVariant 对所有技术都实现了高阳性预测值(PPV),整体准确率实际上是由候选调用集的敏感性驱动的。在 DeepVariant 之前的数据处理步骤的改进以及用于识别候选变异的算法的改进可能会转化为整体准确率的进一步提高,尤其是在多等位基因插入/缺失(indels)方面。相反,尽管它
Table 2 Evaluation of several bioinformatics methods on the high-coverage, whole-genome synthetic diploid sample CHM1-CHM13
表 2 对高覆盖全基因组合成二倍体样本 CHM1-CHM13 的几种生物信息学方法的评估
Method 方法 Type 类型 F1 Recall 回忆 Precision 精度 TP FN FP Version 版本
DeepVariant Indel 0.95806 0.92868 0.98936 529,137 40,634 5,690 v0.4.1-b4e8d37d
Strelka 斯特雷尔卡 Indel 0.95074 0.91623 0.98796 522,039 47,732 6,363 2.8.4-3-gbe58942
16GT Indel 0.94010 0.90803 0.97452 517,369 52,402 13,527 v1.0-34e8f934
GATK (raw) GATK(原始) Indel 0.93268 0.89504 0.97363 509,969 59,802 13,811 3.8-0-ge9d806836
GATK (VQSR) GATK(VQSR) Indel 0.91212 0.84497 0.99087 481,441 88,330 4,437 3.8-0-ge9d806836
FreeBayes Indel 0.90438 0.83025 0.99305 473,053 96,718 3,313 v1.1.0-54-g49413aa
SAMtools Indel 0.86976 0.79089 0.96611 450,626 119,145 15,807 1.6
DeepVariant SNP 0.99103 0.98888 0.99319 3,518,118 39,553 24,132 v0.4.1-b4e8d37d
Strelka 斯特雷尔卡 SNP 0.98865 0.98107 0.99636 3,490,314 67,357 12,749 2.8.4-3-gbe58942
16GT SNP 0.97862 0.98966 0.96782 3,520,894 36,777 117,078 v1.0-34e8f934
FreeBayes SNP 0.96910 0.94837 0.99075 3,373,984 183,687 31,492 v1.1.0-54-g49413aa
GATK (VQSR) GATK(VQSR) SNP 0.96895 0.94542 0.99368 3,363,476 194,195 21,379 3.8-0-ge9d806836
SAMtools SNP 0.96818 0.94386 0.99378 3,357,947 199,724 21,012 1.6
GATK (raw) GATK(原始) SNP 0.96646 0.95685 0.97627 3,404,167 153,504 82,748 3.8-0-ge9d806836
Method Type F1 Recall Precision TP FN FP Version DeepVariant Indel 0.95806 0.92868 0.98936 529,137 40,634 5,690 v0.4.1-b4e8d37d Strelka Indel 0.95074 0.91623 0.98796 522,039 47,732 6,363 2.8.4-3-gbe58942 16GT Indel 0.94010 0.90803 0.97452 517,369 52,402 13,527 v1.0-34e8f934 GATK (raw) Indel 0.93268 0.89504 0.97363 509,969 59,802 13,811 3.8-0-ge9d806836 GATK (VQSR) Indel 0.91212 0.84497 0.99087 481,441 88,330 4,437 3.8-0-ge9d806836 FreeBayes Indel 0.90438 0.83025 0.99305 473,053 96,718 3,313 v1.1.0-54-g49413aa SAMtools Indel 0.86976 0.79089 0.96611 450,626 119,145 15,807 1.6 DeepVariant SNP 0.99103 0.98888 0.99319 3,518,118 39,553 24,132 v0.4.1-b4e8d37d Strelka SNP 0.98865 0.98107 0.99636 3,490,314 67,357 12,749 2.8.4-3-gbe58942 16GT SNP 0.97862 0.98966 0.96782 3,520,894 36,777 117,078 v1.0-34e8f934 FreeBayes SNP 0.96910 0.94837 0.99075 3,373,984 183,687 31,492 v1.1.0-54-g49413aa GATK (VQSR) SNP 0.96895 0.94542 0.99368 3,363,476 194,195 21,379 3.8-0-ge9d806836 SAMtools SNP 0.96818 0.94386 0.99378 3,357,947 199,724 21,012 1.6 GATK (raw) SNP 0.96646 0.95685 0.97627 3,404,167 153,504 82,748 3.8-0-ge9d806836| Method | Type | F1 | Recall | Precision | TP | FN | FP | Version | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | DeepVariant | Indel | 0.95806 | 0.92868 | 0.98936 | 529,137 | 40,634 | 5,690 | v0.4.1-b4e8d37d | | Strelka | Indel | 0.95074 | 0.91623 | 0.98796 | 522,039 | 47,732 | 6,363 | 2.8.4-3-gbe58942 | | 16GT | Indel | 0.94010 | 0.90803 | 0.97452 | 517,369 | 52,402 | 13,527 | v1.0-34e8f934 | | GATK (raw) | Indel | 0.93268 | 0.89504 | 0.97363 | 509,969 | 59,802 | 13,811 | 3.8-0-ge9d806836 | | GATK (VQSR) | Indel | 0.91212 | 0.84497 | 0.99087 | 481,441 | 88,330 | 4,437 | 3.8-0-ge9d806836 | | FreeBayes | Indel | 0.90438 | 0.83025 | 0.99305 | 473,053 | 96,718 | 3,313 | v1.1.0-54-g49413aa | | SAMtools | Indel | 0.86976 | 0.79089 | 0.96611 | 450,626 | 119,145 | 15,807 | 1.6 | | DeepVariant | SNP | 0.99103 | 0.98888 | 0.99319 | 3,518,118 | 39,553 | 24,132 | v0.4.1-b4e8d37d | | Strelka | SNP | 0.98865 | 0.98107 | 0.99636 | 3,490,314 | 67,357 | 12,749 | 2.8.4-3-gbe58942 | | 16GT | SNP | 0.97862 | 0.98966 | 0.96782 | 3,520,894 | 36,777 | 117,078 | v1.0-34e8f934 | | FreeBayes | SNP | 0.96910 | 0.94837 | 0.99075 | 3,373,984 | 183,687 | 31,492 | v1.1.0-54-g49413aa | | GATK (VQSR) | SNP | 0.96895 | 0.94542 | 0.99368 | 3,363,476 | 194,195 | 21,379 | 3.8-0-ge9d806836 | | SAMtools | SNP | 0.96818 | 0.94386 | 0.99378 | 3,357,947 | 199,724 | 21,012 | 1.6 | | GATK (raw) | SNP | 0.96646 | 0.95685 | 0.97627 | 3,404,167 | 153,504 | 82,748 | 3.8-0-ge9d806836 |
Several methods are compared, including the most recent DeepVariant version from GitHub. Each method was run according to the individual authors’ best-practice recommendations and represents a good faith effort to achieve best results. Comparisons to the CHM1-CHM13 truth set were performed using the CHM-eval.kit software, available on GitHub at https://github.com/lh3/CHM-eval, release version 0.5 . The overall accuracy (F1, sort order within each variant type), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are shown over the whole genome. Finally, the version of the software used for each method is provided. Note that we present two GATK callsets: GATK (raw), the unfiltered calls emitted by the HaplotypeCaller; and GATK (VQSR), the callset filtered with the VQSR. See Supplementary Note 7 for more details.
几种方法被比较,包括来自 GitHub 的最新 DeepVariant 版本。每种方法都按照个别作者的最好实践建议运行,代表了一种为了实现最佳结果的良好意愿的努力。使用 GitHub 上可用的 CHM-eval.kit 软件(https://github.com/lh3/CHM-eval,版本 0.5)对 CHM1-CHM13 真实集进行了比较。在整个基因组上显示了整体准确性(F1,每种变异类型内的排序顺序)、召回率、精确率和真阳性(TP)、假阴性(FN)和假阳性(FP)的数量。最后,提供了每种方法使用的软件版本。请注意,我们提供了两个 GATK callsets:GATK(原始),由 HaplotypeCaller 发出的未过滤调用;和 GATK(VQSR),使用 VQSR 过滤的调用集。有关更多详细信息,请参阅补充说明 7。

effectiveness, representing variant calls as images and applying general image-classification models is certainly suboptimal, as we were unable to effectively encode all of the available information in the reads and reference into the three-channel image.
有效性,将变异调用表示为图像并应用通用图像分类模型无疑是次优的,因为我们无法有效地将所有可用的信息编码到三个通道的图像中。
Taken together, our results demonstrate that the deep learning approach employed by DeepVariant can learn a statistical model describing the relationship between the experimentally observed NGS reads and genetic variants in that data for several sequencing technologies. Technologies like DeepVariant change the problem of calling variants from a process of expert-driven, technology-specific statistical modeling to a more automated process of optimizing a general model against data. With DeepVariant, creating an NGS caller for a new sequencing technology becomes a simpler matter of developing the appropriate preprocessing steps, training a deep learning model on sequencing data from samples with ground-truth data, and applying this model to new, even nonhuman, samples (see Supplementary Note 6).
总体而言,我们的结果表明,DeepVariant 采用的深度学习方法可以学习一个描述实验观察到的 NGS 读数与数据中遗传变异之间关系的统计模型,适用于多种测序技术。像 DeepVariant 这样的技术将调用变异的问题从专家驱动、技术特定的统计建模过程转变为针对数据优化通用模型的一种更自动化的过程。使用 DeepVariant,为新的测序技术创建 NGS 调用器变得简单,只需开发适当的预处理步骤,在具有真实数据的样本测序数据上训练深度学习模型,并将此模型应用于新的、甚至非人类样本(见补充说明 6)。

At its core, DeepVariant generates candidate entities with high sensitivity but low specificity, represents the experimental data about each entity in a machine-learning-compatible format and then applies deep learning to assign meaningful biological labels to these entities. This general framework for inferring biological entities from raw, errorful, indirect experimental data is likely to be applicable to other high-throughput instruments.
DeepVariant 的核心是生成高灵敏度但低特异性的候选实体,以机器学习兼容的格式表示每个实体的实验数据,然后应用深度学习为这些实体分配有意义的生物学标签。这种从原始、错误、间接实验数据推断生物学实体的通用框架可能适用于其他高通量仪器。
The results presented in Figure 1, Supplementary Figures 1 and 2 2 2\mathbf{2}, and Supplementary Tables 1-8 were generated with the original, internal version of DeepVariant. Since then we have rewritten DeepVariant to make it available as open source software. As a result, several improvements to the DeepVariant method have been made that are not captured in the analyses presented here, including switching to TensorFlow 33 33 ^(33){ }^{33} to train the model, using the inception_v3 neural network architecture and using a multichannel tensor representation for the genomics data instead of an RGB image. The results in Tables 1 and 2 2 2\mathbf{2} used the open source version of DeepVariant; the evaluation scripts are available as Supplementary Software. The latest version of DeepVariant is available on GitHub (https://github.com/google/ deepvariant/).
图 1、补充图 1 和 2 2 2\mathbf{2} 以及补充表 1-8 中的结果是用 DeepVariant 的原始内部版本生成的。从那时起,我们重写了 DeepVariant,使其作为开源软件可用。因此,对 DeepVariant 方法进行了几项改进,这些改进在此处展示的分析中未涉及,包括切换到 TensorFlow 33 33 ^(33){ }^{33} 来训练模型,使用 inception_v3 神经网络架构,以及使用多通道张量表示基因组数据而不是 RGB 图像。表 1 和 2 2 2\mathbf{2} 中的结果使用了 DeepVariant 的开源版本;评估脚本作为补充软件提供。DeepVariant 的最新版本可在 GitHub 上找到(https://github.com/google/deepvariant/)。
Also note that several other deep-learning-based variant callers have since been described 34 , 35 34 , 35 ^(34,35){ }^{34,35}.
此外,还描述了几种基于深度学习的变异检测器。

METHODS 方法

Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.
方法,包括数据可用性声明以及任何相关访问代码和参考文献,可在论文的在线版本中找到。
Note: Any Supplementary Information and Source Data files are available in the online version of the paper.
注意:任何补充信息和源数据文件均可在论文的在线版本中找到。

ACKNOWLEDGMENTS 致谢

We thank J. Zook and his collaborators at NIST for their work developing the Genome in a Bottle resources, the Verily sequencing facility for running the NA12878 replicates, and our colleagues at Verily and Google for their feedback on this manuscript and the project in general. This work was supported by internal funding.
我们感谢 J. Zook 及其在 NIST 的同事们在开发“瓶中基因组”资源方面的工作,Verily 测序设施在运行 NA12878 重复实验中的工作,以及 Verily 和 Google 的同事们对我们这篇稿件和整个项目的一般反馈。这项工作得到了内部资金的支持。

AUTHOR CONTRIBUTIONS 作者贡献

R.P. and M.A.D. designed the study, analyzed and interpreted results and wrote the paper. R.P., P.-C.C., D.A., S.S., T.C., A.K., D.N., J.D., N.N., P.T.A., S.S.G., L.D., C.Y.M. and M.A.D. performed experiments and contributed to the software.
R.P. 和 M.A.D. 设计了这项研究,分析了结果并撰写了论文。R.P.、P.-C.C.、D.A.、S.S.、T.C.、A.K.、D.N.、J.D.、N.N.、P.T.A.、S.S.G.、L.D.、C.Y.M. 和 M.A.D. 进行了实验并参与了软件开发。

COMPETING INTERESTS 竞争利益

D.N., J.D., N.N., P.T.A. and S.S.G. are employees of Verily Life Sciences. P.-C.C., D.A., S.S, T.C. and A.K. are employees of Google Inc. R.P., L.D., C.Y.M. and M.A.D. are employees of Verily Life Sciences and Google Inc. This work was internally funded by Verily Life Sciences and Google Inc.
D.N.、J.D.、N.N.、P.T.A.和 S.S.G.是 Verily Life Sciences 的员工。P.-C.C.、D.A.、S.S、T.C.和 A.K.是谷歌公司的员工。R.P.、L.D.、C.Y.M.和 M.A.D.是 Verily Life Sciences 和谷歌公司的员工。这项工作由 Verily Life Sciences 和谷歌公司内部资助。
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html. Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
复印和许可信息可在 http://www.nature.com/ reprints/index.html 在线获取。出版者注:Springer Nature 对已发表地图中的管辖权主张和机构隶属关系保持中立。
  1. Goodwin, S., McPherson, J.D. & McCombie, W.R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333-351 (2016).
    古丁,S.,麦克弗森,J.D. & 麦科姆比,W.R. 成长:下一代测序技术十年回顾。自然遗传学评论 17,333-351(2016)。
  2. Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443-451 (2011).
    Nielsen, R.,Paul, J.S.,Albrechtsen, A. & Song, Y.S. 基因型和 SNP 调用从下一代测序数据。自然遗传学评论 12,443-451(2011)。
  3. Li, H. Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843-2851 (2014).
    李,H. 从高覆盖样本中变异调用中艺术品的更好理解。生物信息学 30,2843-2851(2014)。
  4. Goldfeder, R.L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).
    Goldfeder, R.L. 等人. 基因测序技术精度在医学上的意义。基因组医学。8,24(2016)。
  5. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491-498 (2011).
    DePristo, M.A. 等人。基于下一代 DNA 测序数据的变异发现和基因分型框架。自然遗传学 43,491-498(2011)。
  6. Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumournormal paired sequencing data. Bioinformatics 28, 167-175 (2012).
    丁,J. 等。基于特征的肿瘤正常配对测序数据体细胞突变检测分类器。生物信息学 28,167-175(2012)。
  7. Bragg, L.M., Stone, G., Butler, M.K., Hugenholtz, P. & Tyson, G.W. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput. Biol. 9, e1003031 (2013).
    Bragg, L.M.,Stone, G.,Butler, M.K.,Hugenholtz, P. & Tyson, G.W. 揭示暗测序的真相:表征 Ion Torrent PGM 数据的错误。PLoS 计算生物学 9,e1003031(2013)。
  8. Yeo, Z.X., Wong, J.C.L., Rozen, S.G. & Lee, A.S.G. Evaluation and optimisation of indel detection workflows for ion torrent sequencing of the BRCA1 and BRCA2 genes. BMC Genomics 15, 516 (2014).
    叶,Z.X.,黄,J.C.L.,罗森,S.G. & 李,A.S.G. 对 BRCA1 和 BRCA2 基因的 Ion Torrent 测序中插入/缺失检测工作流程的评估和优化。BMC 基因组学 15,516(2014)。
  9. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process Syst. 25, 1097-1105 (2012).
    克里泽夫斯基,A.,苏特斯克维尔,I. & 希顿,G. E. 基于深度卷积神经网络的 ImageNet 分类。神经信息处理系统进展 25 卷,1097-1105 页(2012 年)。
  10. Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).
    吴,Y. 等. 谷歌的神经机器翻译系统:弥合人类与机器翻译之间的差距。预印本,https://arxiv.org/abs/1609.08144(2016)。
  11. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484-489 (2016).
    银,D. 等。利用深度神经网络和树搜索掌握围棋游戏。自然 529,484-489(2016)。
  12. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529-533 (2015).
    Mnih, V. 等人. 通过深度强化学习实现人类水平控制。自然 518,529-533(2015)。
  13. Min, S., Lee, B. & Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 18, 851-869 (2017).
    Min, S., Lee, B. & Yoon, S. 生物信息学中的深度学习。生物信息学简报。18,851-869(2017)。
  14. Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831-838 (2015).
    阿利帕纳希,B.,德尔龙,A.,维拉乌奇,M.T. & 弗雷,B.J. 通过深度学习预测 DNA 和 RNA 结合蛋白的序列特异性。自然生物技术 33,831-838(2015)。
  15. Zhou, J. & Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931-934 (2015).
    周,J. & Troyanskaya,O.G. 基于深度学习序列模型的非编码变异效应预测。自然方法 12,931-934(2015)。
  16. Xiong, H.Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).
    熊,H.Y. 等。人类剪接密码揭示疾病遗传决定因素的新见解。科学 347,1254806(2015)。
  17. Szegedy, C., Vanhoucke, V., loffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. Preprint at https://arxiv.org/abs/1512.00567 (2015).
    Szegedy, C.,Vanhoucke, V.,loffe, S.,Shlens, J. & Wojna, Z. 重新思考计算机视觉的 Inception 架构。预印本在 https://arxiv.org/abs/1512.00567(2015 年)。
  18. Eberle, M.A. et al. A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157-164 (2017).
    埃伯勒,M.A. 等。通过测序三代 17 人系谱验证的 540 万相变人类变异的参考数据集。基因组研究。27,157-164(2017)。
  19. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851-1858 (2008).
    李,H.,阮,J. & 达宾,R. 基于映射质量分数映射短 DNA 测序读数和调用变异。基因组研究 18,1851-1858(2008)。
  20. Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124-1132 (2009).
    李,R. 等。大规模并行全基因组重测序中的 SNP 检测。基因组研究。19,1124-1132(2009)。
  21. Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359-366 (1989).
    霍尼克,K.,斯廷科姆,M. & 怀特,H. 多层前馈网络是通用逼近器。神经网络,2,359-366(1989)。
  22. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
    加里森,E. & 马斯,G. 基于单倍型变异检测的短读测序。预印本在 https://arxiv.org/abs/1207.3907(2012 年)。
  23. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009).
    李,H. 等. 序列比对/映射格式和 SAMtools。生物信息学 25,2078-2079(2009)。
  24. Luo, R., Schatz, M.C. & Salzberg, S.L. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. Gigascience 6, 1-4 (2017).
    10GT:一种使用 16 基因型概率模型的快速且敏感的变异调用器。Gigascience 6,1-4(2017)。
  25. Kim, S. et al. Strelka2: fast and accurate variant calling for clinical sequencing applications. Preprint at bioRxiv https://doi.org/10.1101/192872 (2017).
    Kim, S. 等人. Strelka2:适用于临床测序应用的快速且准确的变异调用。bioRxiv 预印本 https://doi.org/10.1101/192872(2017 年)。
  26. Li, H. et al. New synthetic-diploid benchmark for accurate variant calling evaluation. Preprint at bioRxiv https://doi.org/10.1101/223297 (2017).
    李,H. 等。用于准确变异调用评估的新合成二倍体基准。bioRxiv 预印本 https://doi.org/10.1101/223297(2017)。
  27. Keane, T.M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289-294 (2011).
    凯恩,T.M. 等。小鼠基因组变异及其对表型和基因调控的影响。自然 477,289-294(2011)。
  28. Van der Auwera, G. What are the standard resources for non-human genomes? http://gatkforums.broadinstitute.org/gatk/discussion/1243/what-are-the-standard-resources-for-non-human-genomes (2018).
    范德奥维拉,G. 非人类基因组的标准资源有哪些?http://gatkforums.broadinstitute.org/gatk/discussion/1243/what-are-the-standard-resources-for-non-human-genomes (2018)。
  29. Zook, J.M. et al. Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials (Cold Spring Harbor, 2015).
    Zook, J.M. 等人. 七个人类基因组的大规模测序以表征基准参考材料(冷泉港,2015 年)。
  30. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285-291 (2016).
    Lek, M. 等人. 分析 60,706 名人类蛋白质编码基因变异。自然 536,285-291(2016)。
  31. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68-74 (2015).
    Auton, A. 等人。人类遗传变异的全球参考。自然 526,68-74(2015)。
  32. Robasky, K., Lewis, N.E. & Church, G.M. The role of replicates for error mitigation in next-generation sequencing. Nat. Rev. Genet. 15, 56-62 (2014).
    罗巴克西,K.,刘易斯,N.E. & 教堂,G.M. 在下一代测序中重复样本在错误缓解中的作用。自然综述·遗传学 15,56-62(2014)。
  33. Abadi, M., Agarwal, A., Barham, P., Brevdo, E. & Chen, Z. TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Preprint at https://arxiv.org/ abs/1603.04467 (2015).
    阿巴迪,M.,阿加瓦尔,A.,巴哈姆,P.,布雷多,E. & 陈,Z. TensorFlow:异构系统上的大规模机器学习,2015。预印本在 https://arxiv.org/abs/1603.04467(2015)。
  34. Luo, R., Sedlazeck, F.J., Lam, T.-W. & Schatz, M. Clairvoyante: a multi-task convolutional deep neural network for variant calling in single molecule sequencing. Preprint at bioRxiv https://doi.org/10.1101/310458 (2018).
    罗,R.,塞德莱茨克,F.J.,蓝,T.-W. & 莱希特,M. Clairvoyante:一种用于单分子测序变异检测的多任务卷积深度神经网络。bioRxiv 预印本 https://doi.org/10.1101/310458(2018)。
  35. Torracinta, R. & Campagne, F. Training genotype callers with neural networks. Preprint at bioRxiv https://doi.org/10.1101/097469 (2016).
    托拉奇纳塔,R. & 坎帕涅,F. 使用神经网络训练基因型调用器。bioRxiv 预印本 https://doi.org/10.1101/097469(2016)。

ONLINE METHODS 在线方法

Haplotype-aware realignment of reads. Mapped reads are preprocessed using an error-tolerant, local De-Bruijn-graph-based read assembly procedure that realigns them according to their most likely derived haplotype. Candidate windows across the genome are selected for reassembly by looking for any evidence of possible genetic variation, such as mismatching or soft clipped bases. The selection criteria for a candidate window are very permissive so that true variation is unlikely to be missed. All candidate windows across the genome are considered independently. De Bruijn graphs are constructed using multiple fixed k k kk-mer sizes (from 20 to 75 , inclusive, with increments of 5) out of the reference genome bases for the candidate window, as well as all overlapping reads. Edges are given a weight determined by how many times they are observed in the reads. We trim any edges with weight less than three, except that edges found in the reference are never trimmed. Candidate haplotypes are generated by traversing the assembly graphs and the top two most likely haplotypes are selected that best explain the read evidence. The likelihood function used to score haplotypes is a traditional pair HMM with fixed parameters that do not depend on base quality scores. This likelihood function assumes that each read is independent. Finally, each read is then realigned to its most likely haplotype using a Smith-Waterman-like algorithm with an additional affine gap penalty score for homopolymer indels. This procedure updates both the position and the CIGAR string for each read.
基于单倍型意识的读段重排。将映射读段使用容错、基于局部 De-Bruijn 图的读段组装程序进行预处理,根据其最可能的衍生单倍型重新排列。通过寻找可能的遗传变异的证据(如错配或软剪辑碱基)来选择整个基因组中的候选窗口进行重新组装。候选窗口的选择标准非常宽松,以确保不会错过真正的变异。整个基因组中的所有候选窗口都是独立考虑的。使用多个固定的 k k kk -mer 大小(从 20 到 75,包括 75,以 5 为增量)从候选窗口的参考基因组碱基以及所有重叠读段构建 De Bruijn 图。边赋予的权重由它们在读段中观察到的次数决定。我们修剪任何权重小于三的边,但参考中发现的边永远不会被修剪。通过遍历组装图生成候选单倍型,并选择两个最有可能的单倍型,它们最好地解释了读段证据。 该用于评估单倍型的似然函数是一个具有固定参数的传统成对隐马尔可夫模型,这些参数不依赖于碱基质量分数。此似然函数假定每个读段都是独立的。最后,每个读段使用类似 Smith-Waterman 的算法重新定位到其最可能的单倍型,并额外考虑同聚物插入/缺失的成对间隙惩罚分数。此过程更新了每个读段的定位和 CIGAR 字符串。
Finding candidate variants. Candidate variants for evaluation with the deep learning model are identified with the following algorithm. We consider each position in the reference genome independently. For each site in the genome, we collect all the reads that overlap that site. The CIGAR string of each read is decoded and the corresponding allele aligned to that site is determined; these are classified into either a reference-matching base, a reference-mismatching base, an insertion with a specific sequence, or a deletion with a specific length. We count the number of occurrences of each distinct allele across all reads. See Supplementary Note 8 and the current implementation at https://github. com/google/deepvariant/blob/r0.4/deepvariant/make_examples.py#L770.
正在寻找候选变体。使用以下算法识别用于深度学习模型评估的候选变体。我们独立考虑参考基因组中的每个位置。对于基因组中的每个位点,我们收集所有重叠该位点的读取。解码每个读取的 CIGAR 字符串,并确定与该位点对齐的相应等位基因;这些被分类为参考匹配碱基、参考不匹配碱基、具有特定序列的插入或具有特定长度的删除。我们统计所有读取中每个不同等位基因的出现次数。参见补充说明 8 和当前实现 https://github.com/google/deepvariant/blob/r0.4/deepvariant/make_examples.py#L770。

If any candidates pass our calling thresholds at a site in the genome, we emit a VCF-like record with chromosome, start, reference bases and alternate bases, where reference bases and alternate bases are the VCF-compatible representation of all of the passing alleles.
如果任何候选者在基因组中的某个位点通过我们的调用阈值,我们将发出一个类似 VCF 的记录,其中包含染色体、起始位置、参考碱基和替代碱基,其中参考碱基和替代碱基是所有通过等位基因的 VCF 兼容表示。
We filter away any unusable reads (see is_usable_read() below) if a read is marked as a duplicate, if it is marked as failing vendor quality checks, if it is not aligned or is not the primary alignment, if its mapping quality is less than 10 , or if it is paired and not marked as properly placed. We further only include read bases as potential alleles if all of the bases in the alleles have a base quality 10 10 >= 10\geq 10. We emit variant calls only at standard (ACGT) bases in the reference genome. It is possible to force candidate variants to be emitted (randomly with probability of p p pp ) at sites with no alternate alleles, which are used as homozygous reference training sites. There is no constraint on the size of indels emitted, so long as the exact position and bases are present in the CIGAR string and they are consistent across multiple reads.
我们过滤掉任何不可用的读取(见下面的 is_usable_read()函数),如果读取被标记为重复,如果它被标记为未通过供应商质量检查,如果它未对齐或不是主要对齐,如果其映射质量小于 10,或者如果它是配对但未标记为正确放置。我们进一步仅将所有碱基质量为 10 10 >= 10\geq 10 的碱基作为潜在等位基因包括在内。我们仅在参考基因组的标准(ACGT)碱基处发出变异调用。有可能强制在无替代等位基因的位点发出候选变异(以 p p pp 的概率随机发出),这些位点用作纯合参考训练位点。发出的插入或缺失大小没有限制,只要在 CIGAR 字符串中存在确切的位置和碱基,并且它们在多个读取中保持一致。
Creating images around candidate variants. The second phase of DeepVariant encodes the reference and read support for each candidate variant into an RGB image. The pseudocode for this component is shown below; it contains all of the key operations to build the image, leaving out for clarity error handling, code to deal with edge cases such as those in which variants occur close to the start or end of the chromosome, and the implementation of nonessential and/or obvious functions. See Supplementary Note 9 and the current implementation at https://github.com/google/deepvariant/blob/r0.4/deepvariant/ pileup_image.py.
创建围绕候选变体的图像。DeepVariant 的第二阶段将每个候选变体的参考和读取支持编码到 RGB 图像中。以下是该组件的伪代码;它包含构建图像的所有关键操作,为了清晰起见省略了错误处理、处理边缘情况(如变体出现在染色体起始或结束附近)的代码,以及非必要和/或明显的函数实现。参见补充说明 9 和当前实现:https://github.com/google/deepvariant/blob/r0.4/deepvariant/pileup_image.py。
The actual implementation of this code uses a reservoir sampler to randomly remove reads at locations where there is excessive coverage. This downsampling occurs conceptually within the reads.get_overlapping() function but occurs in our implementation anywhere where there are more than 10,000 reads in a tiling of 300-bp intervals on the chromosome.
该代码的实际实现使用水库采样器随机删除覆盖过度的位置的读取。这种下采样在概念上发生在 reads.get_overlapping()函数中,但在我方的实现中,任何在染色体 300-bp 间隔的拼接中读取超过 10,000 个读取的地方都会发生。
Deep learning. DistBelief 36 36 ^(36){ }^{36} was used to represent models, train models on labeled images, export trained models, and evaluate trained models on unlabeled images. We adapted the Inception v2 architecture to our input images
深度学习。使用 DistBelief 36 36 ^(36){ }^{36} 来表示模型,在标记图像上训练模型,导出训练好的模型,并在未标记图像上评估训练好的模型。我们将 Inception v2 架构适配到我们的输入图像

and our three-state (hom-ref, het, hom-alt) genotype classification problem. Specifically, we created an input image layer that rescales our input images to 299 × 299 299 × 299 299 xx299299 \times 299 pixels without shifting or scaling our pixel values. This input layer is attached to the ConvNetJuly2015 22 17 CNN 22 17 CNN 22^(17)CNN22^{17} \mathrm{CNN} with nine partitions and weight decay of 0.00004 . The final output layer of the CNN is a three-class Softmax layer with fully connected inputs to the preceding layer initialized with Gaussian random weights and s.d. of 0.001 and a weight decay of 0.00004 .
并且我们的三态(同源参考、异源、同源替代)基因型分类问题。具体来说,我们创建了一个输入图像层,将我们的输入图像重新缩放为 299 × 299 299 × 299 299 xx299299 \times 299 像素,而不改变我们的像素值。这个输入层连接到 ConvNetJuly2015 22 17 CNN 22 17 CNN 22^(17)CNN22^{17} \mathrm{CNN} ,有九个分区和权重衰减为 0.00004。CNN 的最终输出层是一个三类的 Softmax 层,其前一层有完全连接的输入,初始化为高斯随机权重和标准差为 0.001,权重衰减为 0.00004。
The CNN was trained using stochastic gradient descent in batches of 32 images with eight replicated models and RMS decay of 0.9. For the Platinum Genomes, precisionFDA, NA12878 replicates, mouse and genome build experiments, multiple models were trained (using the product of learning rates of [ 0.00095 , 0.001 , 0.0015 ] [ 0.00095 , 0.001 , 0.0015 ] [0.00095,0.001,0.0015][0.00095,0.001,0.0015] and momenta [ 0.8 , 0.85 , 0.9 ] ) [ 0.8 , 0.85 , 0.9 ] ) [0.8,0.85,0.9])[0.8,0.85,0.9]) for 80 h or until training accuracy converged, and the model with the highest accuracy on the training set was selected as the final model. For the multiple sequencing technologies experiment, a single model was trained with learning rate 0.0015 and momentum 0.8 for 250,000 update steps. In all experiments unless otherwise noted, the CNN was initialized with weights from the ImageNet model ConvNetJuly2015v2 17 17 ^(17){ }^{17}.
CNN 使用 32 个图像的批次和 8 个复制的模型以及 0.9 的 RMS 衰减进行随机梯度下降训练。对于 Platinum Genomes、precisionFDA、NA12878 复制品、小鼠和基因组构建实验,训练了多个模型(使用 [ 0.00095 , 0.001 , 0.0015 ] [ 0.00095 , 0.001 , 0.0015 ] [0.00095,0.001,0.0015][0.00095,0.001,0.0015]