这是用户在 2024-12-4 14:09 为 https://app.immersivetranslate.com/pdf-pro/7e1a91ee-0304-4e39-b3d4-85d188fa4a54 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
利用深度神经网络的通用 SNP 和小吲哚变体调用器

A universal SNP and small-indel variant caller using deep neural networks
利用深度神经网络的通用 SNP 和小吲哚变体调用器

Ryan Poplin 1 , 2 1 , 2 ^(1,2){ }^{1,2}, Pi-Chuan Chang 2 2 ^(2){ }^{2}, David Alexander 2 2 ^(2){ }^{2}, Scott Schwartz 2 2 ^(2){ }^{2}, Thomas Colthurst 2 2 ^(2){ }^{2}, Alexander Ku 2 Ku 2 Ku^(2)\mathrm{Ku}^{2}, Dan Newburger 1 1 ^(1){ }^{1}, Jojo Dijamco 1 1 ^(1){ }^{1}, Nam Nguyen 1 1 ^(1){ }^{1}, Pegah T Afshar 1 1 ^(1){ }^{1}, Sam S Gross 1 1 ^(1){ }^{1}, Lizzie Dorfman 1 , 2 1 , 2 ^(1,2){ }^{1,2}, Cory Y McLean 1 , 2 & 1 , 2 & ^(1,2)&{ }^{1,2} \& Mark A DePristo 1 , 2 1 , 2 ^(1,2){ }^{1,2}
Ryan Poplin 1 , 2 1 , 2 ^(1,2){ }^{1,2} , Pi-Chuan Chang 2 2 ^(2){ }^{2} , David Alexander 2 2 ^(2){ }^{2} , Scott Schwartz 2 2 ^(2){ }^{2} , Thomas Colthurst 2 2 ^(2){ }^{2} , Alexander Ku 2 Ku 2 Ku^(2)\mathrm{Ku}^{2} , Dan Newburger 1 1 ^(1){ }^{1} 、Jojo Dijamco 1 1 ^(1){ }^{1} , Nam Nguyen 1 1 ^(1){ }^{1} , Pegah T Afshar 1 1 ^(1){ }^{1} , Sam S Gross 1 1 ^(1){ }^{1} , Lizzie Dorfman 1 , 2 1 , 2 ^(1,2){ }^{1,2} , Cory Y McLean 1 , 2 & 1 , 2 & ^(1,2)&{ }^{1,2} \& Mark A DePristo 1 , 2 1 , 2 ^(1,2){ }^{1,2} .

Abstract 摘要

Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.
尽管测序技术突飞猛进,但要从数十亿短小、错误的序列读数中准确地调用个体基因组中存在的遗传变异仍是一项挑战。在这里,我们展示了深度卷积神经网络可以通过学习推定变体周围读数堆积图像与真实基因型调用之间的统计关系来调用对齐的下一代测序读数数据中的遗传变异。这种名为 DeepVariant 的方法优于现有的最先进工具。学习到的模型可以跨越基因组构建和哺乳动物物种,让非人类测序项目从丰富的人类基本真实数据中获益。我们进一步表明,DeepVariant 可以学会在各种测序技术和实验设计(包括来自 10X Genomics 和 Ion Ampliseq 外显子组的深度全基因组)中调用变异,这凸显了使用更自动化、更通用的技术进行变异调用的好处。

Calling genetic variants from next-generation sequencing (NGS) data has proven challenging because NGS reads are not only errorful (with error rates from 0.1 10 % 0.1 10 % ∼0.1-10%\sim 0.1-10 \% ) but arise from a complex error process that depends on properties of the instrument, preceding data processing tools, and the genome sequence itself 1 5 1 5 ^(1-5){ }^{1-5}. State-of-the-art variant callers use a variety of statistical techniques to model these error processes to accurately identify differences between the reads and the reference genome caused either by real genetic variants or by errors in the reads 3 6 3 6 ^(3-6){ }^{3-6}. For example, the widely used GATK uses logistic regression to model base errors, hidden Markov models to compute read likelihoods, and naive Bayes classification to identify variants, which are then filtered to remove likely false positives using a Gaussian mixture model with hand-crafted features capturing common error modes 5 5 ^(5){ }^{5}. These techniques allow the GATK to achieve high but still imperfect accuracy on the Illumina sequencing platform 3 , 4 3 , 4 ^(3,4){ }^{3,4}. Generalizing these models to other sequencing technologies (for example, Ion Torrent 7 , 8 7 , 8 ^(7,8){ }^{7,8} ) has proven difficult due to the need to manually retune or extend these statistical models, which is problematic in an area with such rapid technological progress 1 1 ^(1){ }^{1}.
从下一代测序 (NGS) 数据中调用遗传变异已被证明具有挑战性,因为 NGS 读数不仅有误差(误差率在 0.1 10 % 0.1 10 % ∼0.1-10%\sim 0.1-10 \% 之间),而且产生于复杂的误差过程,这种误差过程取决于仪器的特性、前面的数据处理工具和基因组序列本身 1 5 1 5 ^(1-5){ }^{1-5} 。最先进的变异调用器使用各种统计技术来模拟这些错误过程,以准确识别读数与参考基因组之间的差异,这种差异要么是由真正的遗传变异引起的,要么是由读数中的错误引起的 3 6 3 6 ^(3-6){ }^{3-6} 。例如,广泛使用的 GATK 使用逻辑回归来模拟碱基错误,使用隐马尔可夫模型来计算读取似然,使用天真贝叶斯分类来识别变异,然后使用高斯混合模型和手工创建的捕捉常见错误模式 5 5 ^(5){ }^{5} 的特征对变异进行过滤,以去除可能的假阳性。这些技术使 GATK 能够在 Illumina 测序平台上实现较高但仍不完美的准确性 3 , 4 3 , 4 ^(3,4){ }^{3,4} 。将这些模型推广到其他测序技术(例如 Ion Torrent 7 , 8 7 , 8 ^(7,8){ }^{7,8} )已被证明是困难的,因为需要手动调整或扩展这些统计模型,这在技术进步如此迅速的领域是很成问题的 1 1 ^(1){ }^{1}
Here we describe a variant caller, called DeepVariant, that replaces the assortment of statistical modeling components with a single deep learning model. Deep learning is a machine learning technique applicable to a variety of domains, including image classification 9 9 ^(9){ }^{9}, translation 10 10 ^(10){ }^{10}, gaming 11 , 12 11 , 12 ^(11,12){ }^{11,12} and the life sciences 13 16 13 16 ^(13-16){ }^{13-16}. This toolchain (Fig. 1) begins by finding candidate single nucleotide polymorphisms (SNPs) and indels in reads aligned to the reference genome with high sensitivity but low specificity using standard, algorithmic preprocessing techniques. The deep learning model, using the Inception architecture 17 17 ^(17){ }^{17}, emits probabilities for each of the three diploid genotypes at a locus using a pileup image of the reference and read data around each candidate variant (Fig. 1). The model is trained using labeled true genotypes, after which it is frozen and can then be applied to novel sites or samples. In the following experiments, DeepVariant was trained on an independent set of samples or variants from those being evaluated.
在这里,我们将介绍一种名为 DeepVariant 的变体调用器,它用单一的深度学习模型取代了各种统计建模组件。深度学习是一种机器学习技术,适用于多个领域,包括图像分类 9 9 ^(9){ }^{9} 、翻译 10 10 ^(10){ }^{10} 、游戏 11 , 12 11 , 12 ^(11,12){ }^{11,12} 和生命科学 13 16 13 16 ^(13-16){ }^{13-16} 。该工具链(图 1)首先使用标准的算法预处理技术,在与参考基因组对齐的读数中寻找候选单核苷酸多态性 (SNP) 和嵌合体,灵敏度高,但特异性低。深度学习模型采用 Inception 架构 17 17 ^(17){ }^{17} ,利用每个候选变体周围的参考数据和读数数据的堆积图像,为一个位点的三种二倍体基因型中的每一种发出概率(图 1)。该模型使用已标记的真实基因型进行训练,然后将其冻结,并可应用于新的位点或样本。在下面的实验中,DeepVariant 是在一组独立的样本或变异体上进行训练的,与正在评估的样本或变异体不同。
The deep learning model was trained without specialized knowledge about genomics or next-generation sequencing, and yet it can learn to call genetic variants more accurately than state-of-the-art methods. When applied to the Platinum Genomes Project NA12878 data 18 18 ^(18){ }^{18}, DeepVariant produced a callset with better performance than the GATK when evaluated on the held-out chromosomes of the Genome in a Bottle ground-truth set (Supplementary Figs. 1a and 2). For further validation, we sequenced 35 replicates of NA12878 using a standard whole-genome sequencing (WGS) protocol and called variants on 27 replicates using a GATK best-practices pipeline and DeepVariant using a model trained on the other eight replicates (Online Methods). DeepVariant produced more accurate results with greater consistency across a variety of quality metrics (Supplementary Fig. 1b and Supplementary Notes 1, 10 and 11).
深度学习模型是在没有基因组学或下一代测序专业知识的情况下训练出来的,但它能比最先进的方法更准确地学会调用基因变异。当应用于白金基因组计划 NA12878 数据 18 18 ^(18){ }^{18} 时,DeepVariant 生成的调用集在瓶中基因组的保留染色体地面实况集上进行评估时,性能优于 GATK(补充图 1a 和 2)。为了进一步验证,我们使用标准全基因组测序(WGS)方案对 NA12878 的 35 个重复序列进行了测序,并使用 GATK 最佳实践管道和 DeepVariant 对 27 个重复序列进行了变异调用,DeepVariant 使用的是在其他 8 个重复序列上训练的模型(在线方法)。DeepVariant 的结果更准确,在各种质量指标上的一致性更高(补充图 1b 和补充注释 1、10 和 11)。
Like many variant calling algorithms, the GATK relies on a model that assumes read errors to be independent 5 5 ^(5){ }^{5}. Though this has long been recognized as an invalid assumption 2 2 ^(2){ }^{2}, the true likelihood function that models multiple reads simultaneously is unknown 5 , 19 , 20 5 , 19 , 20 ^(5,19,20){ }^{5,19,20}. Because DeepVariant presents an image of all of the reads relevant for a putative variant together, the convolutional neural network (CNN) is able to account for the complex dependence among the reads by virtue of being a universal approximator 21 21 ^(21){ }^{21}. This manifests itself as a
与许多变异调用算法一样,GATK 依赖于假定读数错误是独立的 5 5 ^(5){ }^{5} 模型。虽然人们早已认识到这是一个无效的假设 2 2 ^(2){ }^{2} ,但同时对多个读数进行建模的真正似然函数是未知的 5 , 19 , 20 5 , 19 , 20 ^(5,19,20){ }^{5,19,20} 。由于 DeepVariant 将与推定变异相关的所有读数图像呈现在一起,因此卷积神经网络 (CNN) 作为一个通用近似值 21 21 ^(21){ }^{21} ,能够解释读数之间复杂的依赖关系。这表现为
Received 15 December 2017; accepted 2 August 2018; published online 24 September 2018; doi:10.1038/nbt. 4235
2017年12月15日接收;2018年8月2日接受;2018年9月24日在线发表;doi:10.1038/nbt.4235

Figure 1 DeepVariant workflow overview. Before DeepVariant, NGS reads are first aligned to a reference genome and cleaned up with duplicate marking and, optionally, local assembly. Left box: first, the aligned reads are scanned for sites that may be different from the reference genome. The read and reference data are encoded as an image for each candidate variant site. A trained CNN calculates the genotype likelihoods for each site. A variant call is emitted if the most likely genotype is heterozygous or homozygous non-reference. Middle box: training the CNN reuses the DeepVariant machinery to generate pileup images for a sample with known genotypes. These labeled image + genotype pairs, along with an initial CNN, which can be a random model, a CNN trained for other image classification tests, or a prior DeepVariant model, are used to optimize the CNN parameters to maximize genotype prediction accuracy using a stochastic gradient descent algorithm. After a maximum number of cycles or time has elapsed or the model’s performance has converged, the final trained model is frozen and can then be used for variant calling. Right box: the reference and read bases, quality scores, and other read features are encoded into a red-green-blue (RGB) pileup image at a candidate variant. This encoded image is provided to the CNN to calculate the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example a heterozygous variant call is emitted, as the most probable genotype here is “het”. In all panels, blue boxes represent data and red boxes are processes. Details of all processes are given in the Online Methods.
图 1 DeepVariant 工作流程概览。在 DeepVariant 之前,首先将 NGS 读数与参考基因组进行比对,并通过重复标记和可选的本地组装进行清理。左框:首先,扫描已比对的读数,查找可能与参考基因组不同的位点。读数和参考数据被编码为每个候选变异位点的图像。经过训练的 CNN 会计算每个位点的基因型可能性。如果最可能的基因型是杂合或同源非参考,则会发出变异调用。中框:训练 CNN 时会重新使用 DeepVariant 机器,为已知基因型的样本生成堆积图像。这些标注的图像 + 基因型对以及初始 CNN(可以是随机模型、为其他图像分类测试训练的 CNN 或之前的 DeepVariant 模型)被用于优化 CNN 参数,以使用随机梯度下降算法最大化基因型预测准确性。当循环次数或时间达到最大值或模型性能收敛后,最终训练好的模型将被冻结,然后可用于变异调用。右框:参考碱基和读数碱基、质量得分和其他读数特征被编码成候选变体处的红-绿-蓝(RGB)堆积图像。该编码图像将提供给 CNN,用于计算同卵参考(hom-ref)、杂合子(het)或同卵交替(hom-alt)三种二倍体基因型状态的基因型似然率。在本例中,由于最有可能的基因型是 "het",因此发出了杂合子变异调用。在所有面板中,蓝色方框代表数据,红色方框代表处理过程。所有处理过程的详情见在线方法。

tight concordance between the estimated probability of error from the likelihood function and the observed error rate (Supplementary Fig. 1c) where DeepVariant’s CNN is well calibrated, more so than the GATK. That the CNN has approximated this true but unknown interdependent likelihood function is the essential technical advance enabling us to replace the hand-crafted statistical models used in other approaches with a single deep learning model, and still achieve such high performance in variant calling.
根据似然函数估算的错误概率与观察到的错误率(补充图 1c)紧密吻合,DeepVariant 的 CNN 得到了很好的校准,比 GATK 更好。CNN 逼近了这一真实但未知的相互依赖似然函数,这是一项重要的技术进步,它使我们能够用单一的深度学习模型取代其他方法中使用的手工制作的统计模型,并在变体调用中实现如此高的性能。
To further benchmark the performance of DeepVariant, we submitted variant calls for a blinded sample, NA24385, to the US Food and Drug Administration (FDA)-sponsored variant calling Truth Challenge in May 2016 and won the “highest performance” award for SNPs as assessed by an independent team using a different evaluation methodology. For this contest DeepVariant was trained only on data available from the CEPH (Centre d’Etude du Polymorphisme Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. In achieving high accuracy as measured via F1, or the harmonic mean of sensitivity and positive predictive value (PPV), on this new sample (SNP F1 = 99.95 % = 99.95 % =99.95%=99.95 \%, indel F1 = 98.98 % = 98.98 % =98.98%=98.98 \% ), we show that DeepVariant can generalize beyond its training data. We then applied the same dataset and evaluation
为了进一步确定 DeepVariant 的性能基准,我们在 2016 年 5 月向美国食品和药物管理局 (FDA) 赞助的变体调用真相挑战赛提交了一个盲样 NA24385 的变体调用,并赢得了由独立团队采用不同评估方法评定的 SNPs "最高性能 "奖。在这次比赛中,DeepVariant 仅在 CEPH(人类多态性研究中心)女性样本 NA12878 的可用数据上进行了训练,并在未见过的 Ashkenazi 男性样本 NA24385 上进行了评估。通过 F1 或灵敏度和阳性预测值(PPV)的调和平均值(SNP F1 = 99.95 % = 99.95 % =99.95%=99.95 \% , indel F1 = 98.98 % = 98.98 % =98.98%=98.98 \% )来衡量,我们在这个新样本上获得了很高的准确率,这表明 DeepVariant 可以超越其训练数据。然后,我们将相同的数据集和评估结果应用于

methodology to a variety of both recent and commonly used bioinformatics methods, including the GATK, FreeBayes 22 22 ^(22){ }^{22}, SAMtools 23 23 ^(23){ }^{23}, 16 GT 24 16 GT 24 16GT^(24)16 \mathrm{GT}^{24} and Strelka 25 25 ^(25){ }^{25} (Table 1). DeepVariant demonstrated more than 50 % 50 % 50%50 \% fewer errors per genome ( 4,652 errors) compared to the next-best algorithm ( 9,531 errors). We also evaluated the same set of methods using the synthetic diploid sample CHM1-CHM1326 (Table 2). In our tests DeepVariant outperformed all other methods for calling both SNP and indel mutations, without needing to adjust filtering thresholds or other parameters.
与各种最新和常用的生物信息学方法相比,包括 GATK、FreeBayes 22 22 ^(22){ }^{22} 、SAMtools 23 23 ^(23){ }^{23} 16 GT 24 16 GT 24 16GT^(24)16 \mathrm{GT}^{24} 和 Strelka 25 25 ^(25){ }^{25} (表 1)。与次好的算法(9,531 个错误)相比,DeepVariant 的每个基因组错误数(4,652 个错误)少了 50 % 50 % 50%50 \% 多。我们还使用合成二倍体样本 CHM1-CHM1326 评估了同一组方法(表 2)。在我们的测试中,DeepVariant 在调用 SNP 和 indel 变异方面的表现优于所有其他方法,无需调整过滤阈值或其他参数。
We further explored how well DeepVariant’s CNN generalizes beyond its training data. First, a model trained with read data aligned to human genome build GRCh37 and applied to reads aligned to GRCh38 had similar performance (overall F1 = 99.45 % = 99.45 % =99.45%=99.45 \% ) to one trained on GRCh38 and then applied to GRCh38 (overall F1 =99.53%), thereby demonstrating that a model learned from one version of the human genome reference can be applied to other versions with effectively no loss in accuracy (Supplementary Table 1 and Supplementary Note 2). Second, models trained using human reads and groundtruth data achieved high accuracy when applied to a mouse data set 27 set 27 set^(27)\operatorname{set}^{27} ( F 1 = 98.29 % ) F 1 = 98.29 % {:F1=98.29%)\left.\mathrm{F} 1=98.29 \%\right), outperforming training on the mouse data itself (F1 = 97.84%; Supplementary Table 2 and Supplementary Note 3).
我们进一步探索了 DeepVariant 的 CNN 在训练数据之外的泛化能力。首先,使用与人类基因组构建 GRCh37 对齐的读数数据训练的模型应用于与 GRCh38 对齐的读数时,其性能(总体 F1 = 99.45 % = 99.45 % =99.45%=99.45 \% )与使用 GRCh38 训练的模型应用于 GRCh38 时的性能(总体 F1 =99.53%)相似,从而证明了从一个版本的人类基因组参考文献中学到的模型可以应用于其他版本的参考文献,而且准确率实际上没有损失(补充表 1 和补充注释 2)。其次,使用人类读数和地面实况数据训练的模型在应用于小鼠数据 set 27 set 27 set^(27)\operatorname{set}^{27} ( F 1 = 98.29 % ) F 1 = 98.29 % {:F1=98.29%)\left.\mathrm{F} 1=98.29 \%\right) 时达到了很高的准确率,超过了小鼠数据本身的训练结果(F1 = 97.84%;补充表 2 和补充注释 3)。
Table 1 Evaluation of several bioinformatics methods on the high-coverage, whole-genome sample NA24385
表 1 在高覆盖率全基因组样本 NA24385 上对几种生物信息学方法的评估
Method 方法 Type 类型 F1 Recall 回顾 Precision 精确度 TP FN FP FP.gt FP.al Version 版本
DeepVariant (live GitHub)
DeepVariant(实时 GitHub)
Indel 吲哚 0.99507 0.99347 0.99666 357,641 2350 1,198 217 840 Latest GitHub v0.4.1-b4e8d37d
最新版 GitHub v0.4.1-b4e8d37d
GATK (raw) GATK(原始) Indel 吲哚 0.99366 0.99219 0.99512 357,181 2810 1,752 377 995 3.8-0-ge9d806836
Strelka Indel 吲哚 0.99227 0.98829 0.99628 355,777 4214 1,329 221 855 2.8.4-3-gbe58942
DeepVariant (pFDA) 深度变量(pFDA) Indel 吲哚 0.99112 0.98776 0.99450 355,586 4405 1,968 846 1,027 pFDA submission May 2016 2016 年 5 月提交 pFDA
GATK (VQSR) Indel 吲哚 0.99010 0.98454 0.99573 354,425 5566 1,522 343 909 3.8-0-ge9d806836
GATK (fIt) Indel 吲哚 0.98229 0.96881 0.99615 348,764 11227 1,349 370 916 3.8-0-ge9d806836
FreeBayes 自由贝叶斯 Indel 吲哚 0.94091 0.91917 0.96372 330,891 29,100 12,569 9,149 3,347 v1.1.0-54-g49413aa
16GT Indel 吲哚 0.92732 0.91102 0.94422 327,960 32,031 19,364 10,700 7,745 v1.0-34e8f934
SAMtools Indel 吲哚 0.87951 0.83369 0.93066 300,120 59,871 22,682 2,302 20,282 1.6
DeepVariant (live GitHub)
DeepVariant(实时 GitHub)
SNP 0.99982 0.99975 0.99989 3,054,552 754 350 157 38 Latest GitHub v0.4.1-b4e8d37d
最新版 GitHub v0.4.1-b4e8d37d
DeepVariant (pFDA) 深度变量(pFDA) SNP 0.99958 0.99944 0.99973 3,053,579 1,727 837 409 78 pFDA submission May 2016 2016 年 5 月提交 pFDA
Strelka SNP 0.99935 0.99893 0.99976 3,052,050 3,256 732 87 136 2.8.4-3-gbe58942
GATK (raw) GATK(原始) SNP 0.99914 0.99973 0.99854 3,054,494 812 4,469 176 257 3.8-0-ge9d806836
16GT SNP 0.99583 0.99850 0.99318 3,050,725 4,581 20,947 3,476 3,899 v1.0-34e8f934
GATK (VQSR) SNP 0.99436 0.98940 0.99937 3,022,917 32,389 1,920 80 170 3.8-0-ge9d806836
FreeBayes 自由贝叶斯 SNP 0.99124 0.98342 0.99919 3,004,641 50,665 2,434 351 1,232 v1.1.0-54-g49413aa
SAMtools SNP 0.99021 0.98114 0.99945 2,997,677 57,629 1,651 1,040 200 1.6
GATK (fIt) SNP 0.98958 0.97953 0.99983 2,992,764 62,542 509 168 26 3.8-0-ge9d806836
Method Type F1 Recall Precision TP FN FP FP.gt FP.al Version DeepVariant (live GitHub) Indel 0.99507 0.99347 0.99666 357,641 2350 1,198 217 840 Latest GitHub v0.4.1-b4e8d37d GATK (raw) Indel 0.99366 0.99219 0.99512 357,181 2810 1,752 377 995 3.8-0-ge9d806836 Strelka Indel 0.99227 0.98829 0.99628 355,777 4214 1,329 221 855 2.8.4-3-gbe58942 DeepVariant (pFDA) Indel 0.99112 0.98776 0.99450 355,586 4405 1,968 846 1,027 pFDA submission May 2016 GATK (VQSR) Indel 0.99010 0.98454 0.99573 354,425 5566 1,522 343 909 3.8-0-ge9d806836 GATK (fIt) Indel 0.98229 0.96881 0.99615 348,764 11227 1,349 370 916 3.8-0-ge9d806836 FreeBayes Indel 0.94091 0.91917 0.96372 330,891 29,100 12,569 9,149 3,347 v1.1.0-54-g49413aa 16GT Indel 0.92732 0.91102 0.94422 327,960 32,031 19,364 10,700 7,745 v1.0-34e8f934 SAMtools Indel 0.87951 0.83369 0.93066 300,120 59,871 22,682 2,302 20,282 1.6 DeepVariant (live GitHub) SNP 0.99982 0.99975 0.99989 3,054,552 754 350 157 38 Latest GitHub v0.4.1-b4e8d37d DeepVariant (pFDA) SNP 0.99958 0.99944 0.99973 3,053,579 1,727 837 409 78 pFDA submission May 2016 Strelka SNP 0.99935 0.99893 0.99976 3,052,050 3,256 732 87 136 2.8.4-3-gbe58942 GATK (raw) SNP 0.99914 0.99973 0.99854 3,054,494 812 4,469 176 257 3.8-0-ge9d806836 16GT SNP 0.99583 0.99850 0.99318 3,050,725 4,581 20,947 3,476 3,899 v1.0-34e8f934 GATK (VQSR) SNP 0.99436 0.98940 0.99937 3,022,917 32,389 1,920 80 170 3.8-0-ge9d806836 FreeBayes SNP 0.99124 0.98342 0.99919 3,004,641 50,665 2,434 351 1,232 v1.1.0-54-g49413aa SAMtools SNP 0.99021 0.98114 0.99945 2,997,677 57,629 1,651 1,040 200 1.6 GATK (fIt) SNP 0.98958 0.97953 0.99983 2,992,764 62,542 509 168 26 3.8-0-ge9d806836| Method | Type | F1 | Recall | Precision | TP | FN | FP | FP.gt | FP.al | Version | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | DeepVariant (live GitHub) | Indel | 0.99507 | 0.99347 | 0.99666 | 357,641 | 2350 | 1,198 | 217 | 840 | Latest GitHub v0.4.1-b4e8d37d | | GATK (raw) | Indel | 0.99366 | 0.99219 | 0.99512 | 357,181 | 2810 | 1,752 | 377 | 995 | 3.8-0-ge9d806836 | | Strelka | Indel | 0.99227 | 0.98829 | 0.99628 | 355,777 | 4214 | 1,329 | 221 | 855 | 2.8.4-3-gbe58942 | | DeepVariant (pFDA) | Indel | 0.99112 | 0.98776 | 0.99450 | 355,586 | 4405 | 1,968 | 846 | 1,027 | pFDA submission May 2016 | | GATK (VQSR) | Indel | 0.99010 | 0.98454 | 0.99573 | 354,425 | 5566 | 1,522 | 343 | 909 | 3.8-0-ge9d806836 | | GATK (fIt) | Indel | 0.98229 | 0.96881 | 0.99615 | 348,764 | 11227 | 1,349 | 370 | 916 | 3.8-0-ge9d806836 | | FreeBayes | Indel | 0.94091 | 0.91917 | 0.96372 | 330,891 | 29,100 | 12,569 | 9,149 | 3,347 | v1.1.0-54-g49413aa | | 16GT | Indel | 0.92732 | 0.91102 | 0.94422 | 327,960 | 32,031 | 19,364 | 10,700 | 7,745 | v1.0-34e8f934 | | SAMtools | Indel | 0.87951 | 0.83369 | 0.93066 | 300,120 | 59,871 | 22,682 | 2,302 | 20,282 | 1.6 | | DeepVariant (live GitHub) | SNP | 0.99982 | 0.99975 | 0.99989 | 3,054,552 | 754 | 350 | 157 | 38 | Latest GitHub v0.4.1-b4e8d37d | | DeepVariant (pFDA) | SNP | 0.99958 | 0.99944 | 0.99973 | 3,053,579 | 1,727 | 837 | 409 | 78 | pFDA submission May 2016 | | Strelka | SNP | 0.99935 | 0.99893 | 0.99976 | 3,052,050 | 3,256 | 732 | 87 | 136 | 2.8.4-3-gbe58942 | | GATK (raw) | SNP | 0.99914 | 0.99973 | 0.99854 | 3,054,494 | 812 | 4,469 | 176 | 257 | 3.8-0-ge9d806836 | | 16GT | SNP | 0.99583 | 0.99850 | 0.99318 | 3,050,725 | 4,581 | 20,947 | 3,476 | 3,899 | v1.0-34e8f934 | | GATK (VQSR) | SNP | 0.99436 | 0.98940 | 0.99937 | 3,022,917 | 32,389 | 1,920 | 80 | 170 | 3.8-0-ge9d806836 | | FreeBayes | SNP | 0.99124 | 0.98342 | 0.99919 | 3,004,641 | 50,665 | 2,434 | 351 | 1,232 | v1.1.0-54-g49413aa | | SAMtools | SNP | 0.99021 | 0.98114 | 0.99945 | 2,997,677 | 57,629 | 1,651 | 1,040 | 200 | 1.6 | | GATK (fIt) | SNP | 0.98958 | 0.97953 | 0.99983 | 2,992,764 | 62,542 | 509 | 168 | 26 | 3.8-0-ge9d806836 |
The dataset used in this evaluation is the same as in the precisionFDA Truth Challenge (pFDA). Several methods are compared, including the DeepVariant callset as submitted to the contest and the most recent DeepVariant version from GitHub. Each method was run according to the individual authors’ best-practice recommendations and represents a good-faith effort to achieve best results. Comparisons to the Genome in a Bottle truth set for this sample were performed using the hap.py software, available on GitHub at http://github.com/Illumina/hap.py, using the same version of the GIAB truth set (v3.2.2) used by pFDA. The overall accuracy (F1, sort order within each variant type), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are shown over the whole genome. False positives are further divided by those caused by genotype mismatches (FP.gt) and those cause by allele mismatches (FP.al). Finally, the version of the software used for each method is provided. We present three GATK callsets: GATK (raw), the unfiltered calls emitted by the HaplotypeCaller; GATK (VQSR), the callset filtered with variant quality score recalibration (VQSR); and GATK (fIt), the raw GATK callset filtered with run-flt in CHM-eval. See Supplementary Note 7 for more details.
本次评估使用的数据集与 precisionFDA Truth Challenge(pFDA)中的数据集相同。对几种方法进行了比较,包括提交给竞赛的 DeepVariant 调用集和来自 GitHub 的最新 DeepVariant 版本。每种方法都是根据作者个人的最佳实践建议运行的,代表了为实现最佳结果所做的真诚努力。该样本与 "瓶中基因组 "真相集的比较是使用 hap.py 软件进行的,该软件可从 GitHub 上的 http://github.com/Illumina/hap.py 获取,使用的是 pFDA 使用的相同版本的 GIAB 真相集 (v3.2.2)。全基因组的总体准确率(F1,每种变异类型内的排序顺序)、召回率、精确度以及真阳性(TP)、假阴性(FN)和假阳性(FP)的数量均有显示。假阳性又分为基因型错配引起的假阳性(FP.gt)和等位基因错配引起的假阳性(FP.al)。最后,我们提供了每种方法所使用的软件版本。我们提供了三个 GATK 调用集:GATK (raw),即 HaplotypeCaller 发出的未经过滤的调用集;GATK (VQSR),即经过变异质量分数重新校准(VQSR)过滤的调用集;以及 GATK (fIt),即经过 CHM-eval 中 run-flt 过滤的原始 GATK 调用集。详见补充说明 7。
This last experiment is especially demanding as not only do the species differ but nearly all of the sequencing parameters do as well: 50 × 2 × 50 × 2 × 50 xx2xx50 \times 2 \times 148 bp from an Illumina TruSeq prep sequenced on a HiSeq 2500 for the human sample and 27 × 2 × 100 bp 27 × 2 × 100 bp 27 xx2xx100bp27 \times 2 \times 100 \mathrm{bp} reads from a custom sequencing preparation run on an Illumina Genome Analyzer II for mouse 27 27 ^(27){ }^{27}. Thus, DeepVariant is robust to changes in sequencing depth, preparation protocol, instrument type, genome build and even mammalian species, thereby enabling resequencing projects in nonhuman species, which often have no ground-truth data to guide their efforts 27 , 28 27 , 28 ^(27,28){ }^{27,28}, to leverage the large and growing ground-truth data in humans.
最后一项实验的要求特别高,因为不仅物种不同,几乎所有的测序参数也不同: 50 × 2 × 50 × 2 × 50 xx2xx50 \times 2 \times 148bp来自在HiSeq 2500上测序的Illumina TruSeq预处理,用于人类样本; 27 × 2 × 100 bp 27 × 2 × 100 bp 27 xx2xx100bp27 \times 2 \times 100 \mathrm{bp} 读数来自在Illumina Genome Analyzer II上运行的定制测序预处理,用于小鼠 27 27 ^(27){ }^{27} 。因此,DeepVariant 对测序深度、制备方案、仪器类型、基因组构建甚至哺乳动物物种的变化都有很好的适应性,从而使通常没有基本真实数据来指导工作 27 , 28 27 , 28 ^(27,28){ }^{27,28} 的非人类物种的再测序项目能够利用大量且不断增加的人类基本真实数据。
To further assess its capabilities, we trained DeepVariant to call variants in eight datasets from Genome in a Bottle 29 29 ^(29){ }^{29} that spanned a variety of sequencing instruments and protocols, including wholegenome and exome sequencing technologies, with read lengths from 50 to many thousands of base pairs (Supplementary Tables 3 and 4 and Supplementary Notes 4 and 5). We used the already processed BAM files to introduce additional variability, as these BAMs differed in their alignment and cleaning steps. The results of this experiment all exhibit a characteristic pattern: the candidate variants have the highest sensitivity but a low PPV (mean of 57.6%), which varies substantially by dataset. After retraining, all of the callsets achieve high PPVs (mean of 99.3%) while largely preserving the candidate callset sensitivity (mean loss of 2.3%). The high PPVs and low loss of sensitivity indicate that DeepVariant can learn a model that captures the technology-specific error processes in sufficient detail to separate real variation from false positives with high fidelity for many different sequencing technologies.
为了进一步评估 DeepVariant 的能力,我们训练 DeepVariant 调用来自 Genome in a Bottle 29 29 ^(29){ }^{29} 的八个数据集中的变异,这些数据集跨越了各种测序仪器和协议,包括全基因组和外显子组测序技术,读长从 50 到数千个碱基对不等(补充表 3 和 4 以及补充注释 4 和 5)。我们使用已经处理过的 BAM 文件来引入额外的可变性,因为这些 BAM 在比对和清理步骤上有所不同。该实验的结果都表现出一种特征模式:候选变体的灵敏度最高,但 PPV 却很低(平均值为 57.6%),而且不同数据集的 PPV 差异很大。经过再训练后,所有调用集都达到了较高的 PPV 值(平均值为 99.3%),同时在很大程度上保留了候选调用集的灵敏度(平均损失为 2.3%)。高 PPVs 和低灵敏度损失表明,DeepVariant 可以学习一个模型,该模型可以足够详细地捕捉特定技术的错误过程,从而在许多不同的测序技术中高保真地将真实变异与假阳性分离开来。
Next we analyzed the behavior of DeepVariant on two nonIllumina WGS datasets, one from ThermoFisher (SOLiD) and one from Pacific Biosciences (PacBio), and on two exome datasets from Illumina (TruSeq) and Ion Torrent (Ion Ampliseq). The SOLiD and PacBio WGS datasets have high error rates in the candidate callsets. SOLiD (13.9% PPV for SNPs, 96.2% for indels and 14.3% overall) has many SNP artifacts from the mapping of short, color-space reads. The
接下来,我们分析了 DeepVariant 在两个非 Illumina WGS 数据集(一个来自 ThermoFisher(SOLiD),一个来自 Pacific Biosciences(PacBio))以及两个来自 Illumina(TruSeq)和 Ion Torrent(Ion Ampliseq)的外显子组数据集上的表现。SOLiD 和 PacBio WGS 数据集的候选调用集错误率较高。SOLiD 数据集(SNP 的 PPV 为 13.9%,indels 为 96.2%,总体为 14.3%)中有许多 SNP 伪影,这些伪影来自短色空读数的映射。该系统的
PacBio dataset is the opposite, with many false indels ( 79.8 % 79.8 % 79.8%79.8 \% PPV for SNPs, 1.4 % 1.4 % 1.4%1.4 \% for indels and 22.1 % 22.1 % 22.1%22.1 \% overall) owing to this technology’s high indel error rate. Training DeepVariant to call variants in an exome is likely to be particularly challenging. Exomes have far fewer variants ( 20 k 30 k ) 30 ( 20 k 30 k ) 30 (∼20k-30k)^(30)(\sim 20 \mathrm{k}-30 \mathrm{k})^{30} than found in a whole genome ( 4 5 M ) 31 ( 4 5 M ) 31 (∼4-5M)^(31)(\sim 4-5 \mathrm{M})^{31}. The non-uniform coverage and sequencing errors from the exome capture or amplification technology also introduce many false positive variants 32 32 ^(32){ }^{32}. For example, at 8.1 % 8.1 % 8.1%8.1 \%, the PPV of our candidate variants for Ion Ampliseq is the lowest of all our datasets.
PacBio 数据集的情况恰恰相反,由于该技术的吲哚错误率较高,因此存在很多假吲哚(SNP 的 PPV 为 79.8 % 79.8 % 79.8%79.8 \% ,吲哚的 PPV 为 1.4 % 1.4 % 1.4%1.4 \% ,总体的 PPV 为 22.1 % 22.1 % 22.1%22.1 \% )。训练 DeepVariant 调用外显子组中的变异可能尤其具有挑战性。外显子组中的变异 ( 20 k 30 k ) 30 ( 20 k 30 k ) 30 (∼20k-30k)^(30)(\sim 20 \mathrm{k}-30 \mathrm{k})^{30} 远远少于全基因组中的变异 ( 4 5 M ) 31 ( 4 5 M ) 31 (∼4-5M)^(31)(\sim 4-5 \mathrm{M})^{31} 。外显子组捕获或扩增技术的非均匀覆盖和测序误差也会带来许多假阳性变异 32 32 ^(32){ }^{32} 。例如,在 8.1 % 8.1 % 8.1%8.1 \% 中,Ion Ampliseq 候选变异的 PPV 是我们所有数据集中最低的。
Despite the low initial PPVs, the retrained models in DeepVariant separated errors from real variants with high accuracy in the WGS datasets (PPVs of 99.0% and 97.3% for SOLiD and PacBio, respectively), though with a larger loss in sensitivity (candidates 82.5% and final 76.6 % 76.6 % 76.6%76.6 \% for SOLiD and 93.4 % 93.4 % 93.4%93.4 \% and 88.5 % 88.5 % 88.5%88.5 \%, respectively, for PacBio) than other technologies. Furthermore, despite the challenges of retraining deep learning models with limited data, the exome datasets also performed well, with a small reduction in sensitivity (from 91.9% to 89.3 % 89.3 % 89.3%89.3 \% and 94.0 % 94.0 % 94.0%94.0 \% to 92.6 % 92.6 % 92.6%92.6 \% for Ion Ampliseq and TruSeq candidates and final calls, respectively) for a substantial boost in PPV (from 8.1% to 99.7 % 99.7 % 99.7%99.7 \% and 65.3 % 65.3 % 65.3%65.3 \% to 99.3 % 99.3 % 99.3%99.3 \% for Ion and TruSeq, respectively). The performance of DeepVariant compares favorably to those of callsets submitted to the Genome in a Bottle project site using tools developed specifically for each NGS technology and to callsets produced by the GATK or SAMtools (Supplementary Table 5).
尽管初始 PPV 较低,DeepVariant 中重新训练的模型在 WGS 数据集中仍能高精度地将错误与真实变异分离开来(SOLiD 和 PacBio 的 PPV 分别为 99.0% 和 97.3%),不过与其他技术相比,灵敏度损失较大(SOLiD 的候选样本为 82.5%,最终为 76.6 % 76.6 % 76.6%76.6 \% ;PacBio 的候选样本为 93.4 % 93.4 % 93.4%93.4 \% 88.5 % 88.5 % 88.5%88.5 \% )。此外,尽管在数据有限的情况下重新训练深度学习模型存在挑战,但外显子组数据集的表现也不错,灵敏度略有降低(Ion Ampliseq 和 TruSeq 候选和最终调用的灵敏度分别从 91.9% 降至 89.3 % 89.3 % 89.3%89.3 \% 94.0 % 94.0 % 94.0%94.0 \% 降至 92.6 % 92.6 % 92.6%92.6 \% ),而 PPV 则大幅提高(Ion 和 TruSeq 的 PPV 分别从 8.1% 降至 99.7 % 99.7 % 99.7%99.7 \% 65.3 % 65.3 % 65.3%65.3 \% 降至 99.3 % 99.3 % 99.3%99.3 \% )。DeepVariant 的性能优于使用专门为每种 NGS 技术开发的工具提交给 "Genome in a Bottle "项目网站的调用集,也优于 GATK 或 SAMtools 生成的调用集(补充表 5)。
The accuracy numbers presented here should not be viewed as the maximum achievable by either the sequencing technology or DeepVariant. For consistency, we used the same model architecture, image representation, training parameters and candidate variant criteria for each technology. Because DeepVariant achieves high PPVs for all technologies, the overall accuracy is effectively driven by the sensitivity of the candidate callset. Improvements to the data processing steps before DeepVariant and the algorithm used to identify candidate variants is likely to translate into further improvements in overall accuracy, particularly for multi-allelic indels. Conversely, despite its
这里给出的准确率数字不应被视为测序技术或 DeepVariant 可达到的最高准确率。为了保持一致性,我们对每种技术都使用了相同的模型架构、图像表示、训练参数和候选变体标准。由于 DeepVariant 在所有技术中都能达到较高的 PPV,因此总体准确率实际上是由候选调用集的灵敏度决定的。改进 DeepVariant 之前的数据处理步骤和用于识别候选变异体的算法,很可能会进一步提高总体准确率,尤其是多等位基因嵌合的准确率。相反,尽管其
Table 2 Evaluation of several bioinformatics methods on the high-coverage, whole-genome synthetic diploid sample CHM1-CHM13
表 2 几种生物信息学方法对高覆盖率全基因组合成二倍体样本 CHM1-CHM13 的评估
Method 方法 Type 类型 F1 Recall 回顾 Precision 精确度 TP FN FP Version 版本
DeepVariant 深度变量 Indel 吲哚 0.95806 0.92868 0.98936 529,137 40,634 5,690 v0.4.1-b4e8d37d
Strelka Indel 吲哚 0.95074 0.91623 0.98796 522,039 47,732 6,363 2.8.4-3-gbe58942
16GT Indel 吲哚 0.94010 0.90803 0.97452 517,369 52,402 13,527 v1.0-34e8f934
GATK (raw) GATK(原始) Indel 吲哚 0.93268 0.89504 0.97363 509,969 59,802 13,811 3.8-0-ge9d806836
GATK (VQSR) Indel 吲哚 0.91212 0.84497 0.99087 481,441 88,330 4,437 3.8-0-ge9d806836
FreeBayes 自由贝叶斯 Indel 吲哚 0.90438 0.83025 0.99305 473,053 96,718 3,313 v1.1.0-54-g49413aa
SAMtools Indel 吲哚 0.86976 0.79089 0.96611 450,626 119,145 15,807 1.6
DeepVariant 深度变量 SNP 0.99103 0.98888 0.99319 3,518,118 39,553 24,132 v0.4.1-b4e8d37d
Strelka SNP 0.98865 0.98107 0.99636 3,490,314 67,357 12,749 2.8.4-3-gbe58942
16GT SNP 0.97862 0.98966 0.96782 3,520,894 36,777 117,078 v1.0-34e8f934
FreeBayes 自由贝叶斯 SNP 0.96910 0.94837 0.99075 3,373,984 183,687 31,492 v1.1.0-54-g49413aa
GATK (VQSR) SNP 0.96895 0.94542 0.99368 3,363,476 194,195 21,379 3.8-0-ge9d806836
SAMtools SNP 0.96818 0.94386 0.99378 3,357,947 199,724 21,012 1.6
GATK (raw) GATK(原始) SNP 0.96646 0.95685 0.97627 3,404,167 153,504 82,748 3.8-0-ge9d806836
Method Type F1 Recall Precision TP FN FP Version DeepVariant Indel 0.95806 0.92868 0.98936 529,137 40,634 5,690 v0.4.1-b4e8d37d Strelka Indel 0.95074 0.91623 0.98796 522,039 47,732 6,363 2.8.4-3-gbe58942 16GT Indel 0.94010 0.90803 0.97452 517,369 52,402 13,527 v1.0-34e8f934 GATK (raw) Indel 0.93268 0.89504 0.97363 509,969 59,802 13,811 3.8-0-ge9d806836 GATK (VQSR) Indel 0.91212 0.84497 0.99087 481,441 88,330 4,437 3.8-0-ge9d806836 FreeBayes Indel 0.90438 0.83025 0.99305 473,053 96,718 3,313 v1.1.0-54-g49413aa SAMtools Indel 0.86976 0.79089 0.96611 450,626 119,145 15,807 1.6 DeepVariant SNP 0.99103 0.98888 0.99319 3,518,118 39,553 24,132 v0.4.1-b4e8d37d Strelka SNP 0.98865 0.98107 0.99636 3,490,314 67,357 12,749 2.8.4-3-gbe58942 16GT SNP 0.97862 0.98966 0.96782 3,520,894 36,777 117,078 v1.0-34e8f934 FreeBayes SNP 0.96910 0.94837 0.99075 3,373,984 183,687 31,492 v1.1.0-54-g49413aa GATK (VQSR) SNP 0.96895 0.94542 0.99368 3,363,476 194,195 21,379 3.8-0-ge9d806836 SAMtools SNP 0.96818 0.94386 0.99378 3,357,947 199,724 21,012 1.6 GATK (raw) SNP 0.96646 0.95685 0.97627 3,404,167 153,504 82,748 3.8-0-ge9d806836| Method | Type | F1 | Recall | Precision | TP | FN | FP | Version | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | DeepVariant | Indel | 0.95806 | 0.92868 | 0.98936 | 529,137 | 40,634 | 5,690 | v0.4.1-b4e8d37d | | Strelka | Indel | 0.95074 | 0.91623 | 0.98796 | 522,039 | 47,732 | 6,363 | 2.8.4-3-gbe58942 | | 16GT | Indel | 0.94010 | 0.90803 | 0.97452 | 517,369 | 52,402 | 13,527 | v1.0-34e8f934 | | GATK (raw) | Indel | 0.93268 | 0.89504 | 0.97363 | 509,969 | 59,802 | 13,811 | 3.8-0-ge9d806836 | | GATK (VQSR) | Indel | 0.91212 | 0.84497 | 0.99087 | 481,441 | 88,330 | 4,437 | 3.8-0-ge9d806836 | | FreeBayes | Indel | 0.90438 | 0.83025 | 0.99305 | 473,053 | 96,718 | 3,313 | v1.1.0-54-g49413aa | | SAMtools | Indel | 0.86976 | 0.79089 | 0.96611 | 450,626 | 119,145 | 15,807 | 1.6 | | DeepVariant | SNP | 0.99103 | 0.98888 | 0.99319 | 3,518,118 | 39,553 | 24,132 | v0.4.1-b4e8d37d | | Strelka | SNP | 0.98865 | 0.98107 | 0.99636 | 3,490,314 | 67,357 | 12,749 | 2.8.4-3-gbe58942 | | 16GT | SNP | 0.97862 | 0.98966 | 0.96782 | 3,520,894 | 36,777 | 117,078 | v1.0-34e8f934 | | FreeBayes | SNP | 0.96910 | 0.94837 | 0.99075 | 3,373,984 | 183,687 | 31,492 | v1.1.0-54-g49413aa | | GATK (VQSR) | SNP | 0.96895 | 0.94542 | 0.99368 | 3,363,476 | 194,195 | 21,379 | 3.8-0-ge9d806836 | | SAMtools | SNP | 0.96818 | 0.94386 | 0.99378 | 3,357,947 | 199,724 | 21,012 | 1.6 | | GATK (raw) | SNP | 0.96646 | 0.95685 | 0.97627 | 3,404,167 | 153,504 | 82,748 | 3.8-0-ge9d806836 |
Several methods are compared, including the most recent DeepVariant version from GitHub. Each method was run according to the individual authors’ best-practice recommendations and represents a good faith effort to achieve best results. Comparisons to the CHM1-CHM13 truth set were performed using the CHM-eval.kit software, available on GitHub at https://github.com/lh3/CHM-eval, release version 0.5 . The overall accuracy (F1, sort order within each variant type), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are shown over the whole genome. Finally, the version of the software used for each method is provided. Note that we present two GATK callsets: GATK (raw), the unfiltered calls emitted by the HaplotypeCaller; and GATK (VQSR), the callset filtered with the VQSR. See Supplementary Note 7 for more details.
比较了几种方法,包括来自 GitHub 的最新 DeepVariant 版本。每种方法都是根据作者的最佳实践建议运行的,代表了为获得最佳结果所做的真诚努力。与 CHM1-CHM13 真相集的比较是使用 CHM-eval.kit 软件进行的,该软件可在 GitHub https://github.com/lh3/CHM-eval 上获取,版本为 0.5。全基因组的总体准确率(F1,每种变异类型内的排序顺序)、召回率、精确度以及真阳性(TP)、假阴性(FN)和假阳性(FP)的数量如图所示。最后,我们还提供了每种方法所使用的软件版本。请注意,我们提供了两个 GATK 调用集:GATK (raw),即 HaplotypeCaller 发出的未经过滤的调用;以及 GATK (VQSR),即经过 VQSR 过滤的调用集。详见补充说明 7。

effectiveness, representing variant calls as images and applying general image-classification models is certainly suboptimal, as we were unable to effectively encode all of the available information in the reads and reference into the three-channel image.
由于我们无法将读数和参考文献中的所有可用信息有效地编码到三通道图像中,因此将变异调用表示为图像并应用一般图像分类模型肯定不是最佳选择。
Taken together, our results demonstrate that the deep learning approach employed by DeepVariant can learn a statistical model describing the relationship between the experimentally observed NGS reads and genetic variants in that data for several sequencing technologies. Technologies like DeepVariant change the problem of calling variants from a process of expert-driven, technology-specific statistical modeling to a more automated process of optimizing a general model against data. With DeepVariant, creating an NGS caller for a new sequencing technology becomes a simpler matter of developing the appropriate preprocessing steps, training a deep learning model on sequencing data from samples with ground-truth data, and applying this model to new, even nonhuman, samples (see Supplementary Note 6).
总之,我们的研究结果表明,DeepVariant 采用的深度学习方法可以学习描述实验观察到的 NGS 读数与该数据中基因变异之间关系的统计模型,适用于多种测序技术。DeepVariant 等技术将变异调用问题从专家驱动的特定技术统计建模过程转变为针对数据优化通用模型的更自动化过程。有了 DeepVariant,为新的测序技术创建 NGS 调用器就变得更简单了,只需开发适当的预处理步骤,在具有基本真实数据的样本测序数据上训练深度学习模型,然后将该模型应用于新样本,甚至非人类样本(见补充说明 6)。

At its core, DeepVariant generates candidate entities with high sensitivity but low specificity, represents the experimental data about each entity in a machine-learning-compatible format and then applies deep learning to assign meaningful biological labels to these entities. This general framework for inferring biological entities from raw, errorful, indirect experimental data is likely to be applicable to other high-throughput instruments.
DeepVariant 的核心是生成灵敏度高但特异性低的候选实体,以机器学习兼容格式表示每个实体的实验数据,然后应用深度学习为这些实体分配有意义的生物标签。这种从原始、有误差的间接实验数据中推断生物实体的通用框架很可能适用于其他高通量仪器。
The results presented in Figure 1, Supplementary Figures 1 and 2 2 2\mathbf{2}, and Supplementary Tables 1-8 were generated with the original, internal version of DeepVariant. Since then we have rewritten DeepVariant to make it available as open source software. As a result, several improvements to the DeepVariant method have been made that are not captured in the analyses presented here, including switching to TensorFlow 33 33 ^(33){ }^{33} to train the model, using the inception_v3 neural network architecture and using a multichannel tensor representation for the genomics data instead of an RGB image. The results in Tables 1 and 2 2 2\mathbf{2} used the open source version of DeepVariant; the evaluation scripts are available as Supplementary Software. The latest version of DeepVariant is available on GitHub (https://github.com/google/ deepvariant/).
图1、补充图1和 2 2 2\mathbf{2} 以及补充表1-8中的结果是用最初的内部版本DeepVariant生成的。此后,我们重写了 DeepVariant,使其成为开源软件。因此,我们对 DeepVariant 方法进行了一些改进,但这些改进并没有体现在本文的分析中,包括改用 TensorFlow 33 33 ^(33){ }^{33} 训练模型,使用 inception_v3 神经网络架构,以及对基因组学数据使用多通道张量表示法而不是 RGB 图像。表 1 和 2 2 2\mathbf{2} 中的结果使用的是 DeepVariant 的开源版本;评估脚本作为补充软件提供。DeepVariant 的最新版本可在 GitHub 上获取(https://github.com/google/ deepvariant/)。
Also note that several other deep-learning-based variant callers have since been described 34 , 35 34 , 35 ^(34,35){ }^{34,35}.
还请注意,后来又有几种基于深度学习的变体调用器被描述 34 , 35 34 , 35 ^(34,35){ }^{34,35}

METHODS 方法

Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.
研究方法,包括数据可用性声明以及任何相关的加入代码和参考文献,可在论文的在线版本中查阅。
Note: Any Supplementary Information and Source Data files are available in the online version of the paper.
注:任何补充信息和原始数据文件均可在论文的在线版本中获取。

ACKNOWLEDGMENTS 致谢

We thank J. Zook and his collaborators at NIST for their work developing the Genome in a Bottle resources, the Verily sequencing facility for running the NA12878 replicates, and our colleagues at Verily and Google for their feedback on this manuscript and the project in general. This work was supported by internal funding.
我们感谢 J. Zook 和他在 NIST 的合作者开发了 "瓶中基因组 "资源,感谢 Verily 测序设备运行了 NA12878 复制数据,感谢 Verily 和谷歌的同事对本稿和整个项目的反馈意见。这项工作得到了内部资金的支持。

AUTHOR CONTRIBUTIONS 作者贡献

R.P. and M.A.D. designed the study, analyzed and interpreted results and wrote the paper. R.P., P.-C.C., D.A., S.S., T.C., A.K., D.N., J.D., N.N., P.T.A., S.S.G., L.D., C.Y.M. and M.A.D. performed experiments and contributed to the software.
R.P.和M.A.D.设计研究、分析和解释结果并撰写论文。R.P., P.-C.C., D.A., S.S., T.C., A.K., D.N., J.D., N.N., P.T.A., S.S.G., L.D., C.Y.M. 和 M.A.D. 进行了实验并参与了软件的编写。

COMPETING INTERESTS 利益冲突

D.N., J.D., N.N., P.T.A. and S.S.G. are employees of Verily Life Sciences. P.-C.C., D.A., S.S, T.C. and A.K. are employees of Google Inc. R.P., L.D., C.Y.M. and M.A.D. are employees of Verily Life Sciences and Google Inc. This work was internally funded by Verily Life Sciences and Google Inc.
D.N.、J.D.、N.N.、P.T.A.和S.S.G.是Verily Life Sciences的员工。P.-C.C.、D.A.、S.S.、T.C.和 A.K. 是谷歌公司的员工。R.P.、L.D.、C.Y.M.和M.A.D.是Verily生命科学公司和谷歌公司的员工。这项工作由Verily生命科学公司和谷歌公司内部资助。
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html. Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
可在http://www.nature.com/rerints/index.html上在线获取重印和许可信息。出版商说明:施普林格-自然对出版地图和机构隶属关系中的管辖权主张保持中立。
  1. Goodwin, S., McPherson, J.D. & McCombie, W.R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333-351 (2016).
    Goodwin, S., McPherson, J.D. & McCombie, W.R. Coming of age: Ten years of next-generation sequencing technologies.Nat.Rev. Genet.17, 333-351 (2016).
  2. Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443-451 (2011).
    Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data.Nat.Rev. Genet.12, 443-451 (2011).
  3. Li, H. Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843-2851 (2014).
    Li, H. 更好地理解高覆盖率样本变异调用中的伪差。Bioinformatics 30, 2843-2851 (2014).
  4. Goldfeder, R.L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).
    Goldfeder, R.L. et al. 基因组测序技术准确性的医学意义。Genome Med.8, 24 (2016).
  5. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491-498 (2011).
    DePristo, M.A. et al. 利用下一代 DNA 测序数据进行变异发现和基因分型的框架。Nat.Genet.43, 491-498 (2011).
  6. Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumournormal paired sequencing data. Bioinformatics 28, 167-175 (2012).
    Ding, J. et al. 基于特征的分类器用于肿瘤配对测序数据中的体细胞突变检测。Bioinformatics 28, 167-175 (2012)。
  7. Bragg, L.M., Stone, G., Butler, M.K., Hugenholtz, P. & Tyson, G.W. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput. Biol. 9, e1003031 (2013).
    Bragg, L.M., Stone, G., Butler, M.K., Hugenholtz, P. & Tyson, G.W. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data.PLoS Comput.9, e1003031 (2013).
  8. Yeo, Z.X., Wong, J.C.L., Rozen, S.G. & Lee, A.S.G. Evaluation and optimisation of indel detection workflows for ion torrent sequencing of the BRCA1 and BRCA2 genes. BMC Genomics 15, 516 (2014).
    Yeo, Z.X., Wong, J.C.L., Rozen, S.G. & Lee, A.S.G. Evaluation and optimisation of indel detection workflows for ion torrent sequencing of the BRCA1 and BRCA2 genes.BMC Genomics 15, 516 (2014).
  9. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process Syst. 25, 1097-1105 (2012).
    Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks.Adv.Process Syst.25, 1097-1105 (2012).
  10. Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).
    Wu, Y. et al. 谷歌的神经机器翻译系统:弥合人类与机器翻译之间的鸿沟。Preprint at https://arxiv.org/abs/1609.08144 (2016).
  11. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484-489 (2016).
    Silver, D. et al. 利用深度神经网络和树搜索掌握围棋。Nature 529, 484-489 (2016).
  12. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529-533 (2015).
    Mnih, V. 等人. 通过深度强化学习实现人类水平的控制。Nature 518, 529-533 (2015).
  13. Min, S., Lee, B. & Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 18, 851-869 (2017).
    Min, S., Lee, B. & Yoon, S. 生物信息学中的深度学习。Brief.Bioinform.18, 851-869 (2017).
  14. Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831-838 (2015).
    Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.Nat.Biotechnol.33, 831-838 (2015).
  15. Zhou, J. & Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931-934 (2015).
    Zhou, J. & Troyanskaya, O.G. 利用基于深度学习的序列模型预测非编码变体的影响。Nat.Methods 12, 931-934 (2015).
  16. Xiong, H.Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).
    Xiong, H.Y. et al. 人类剪接代码揭示了疾病遗传决定因素的新见解。Science 347, 1254806 (2015).
  17. Szegedy, C., Vanhoucke, V., loffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. Preprint at https://arxiv.org/abs/1512.00567 (2015).
    Szegedy, C., Vanhoucke, V., loffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision.预印本:https://arxiv.org/abs/1512.00567 (2015)。
  18. Eberle, M.A. et al. A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157-164 (2017).
    Eberle, M.A. et al. 通过测序三代17人血统的遗传验证的540万相人类变异参考数据集。Genome Res. 27, 157-164 (2017).
  19. Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851-1858 (2008).
    Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores.Genome Res. 18, 1851-1858 (2008).
  20. Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124-1132 (2009).
    用于大规模并行全基因组重测序的 SNP 检测。Genome Res. 19, 1124-1132 (2009)。
  21. Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359-366 (1989).
    Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators.2, 359-366 (1989).
  22. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
    Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing.Preprint at https://arxiv.org/abs/1207.3907 (2012)。
  23. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009).
    Li, H. et al. Sequence Alignment/Map format and SAMtools.Bioinformatics 25, 2078-2079 (2009).
  24. Luo, R., Schatz, M.C. & Salzberg, S.L. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. Gigascience 6, 1-4 (2017).
    Luo, R., Schatz, M.C. & Salzberg, S.L. 16GT:使用 16 基因型概率模型的快速灵敏变体调用器。Gigascience 6, 1-4 (2017).
  25. Kim, S. et al. Strelka2: fast and accurate variant calling for clinical sequencing applications. Preprint at bioRxiv https://doi.org/10.1101/192872 (2017).
    Kim, S. et al. Strelka2: 快速准确的临床测序应用变体调用。Preprint at bioRxiv https://doi.org/10.1101/192872 (2017)。
  26. Li, H. et al. New synthetic-diploid benchmark for accurate variant calling evaluation. Preprint at bioRxiv https://doi.org/10.1101/223297 (2017).
    Li, H. 等. 用于准确变异调用评估的新合成二倍体基准。Preprint at bioRxiv https://doi.org/10.1101/223297 (2017).
  27. Keane, T.M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289-294 (2011).
    Keane, T.M. et al. 小鼠基因组变异及其对表型和基因调控的影响。自然》477 卷,289-294 页(2011 年)。
  28. Van der Auwera, G. What are the standard resources for non-human genomes? http://gatkforums.broadinstitute.org/gatk/discussion/1243/what-are-the-standard-resources-for-non-human-genomes (2018).
    Van der Auwera, G. What are the standard resources for non-human genomes?http://gatkforums.broadinstitute.org/gatk/discussion/1243/what-are-the-standard-resources-for-non-human-genomes (2018).
  29. Zook, J.M. et al. Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials (Cold Spring Harbor, 2015).
    Zook, J.M. et al. Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials(冷泉港,2015 年)。
  30. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285-291 (2016).
    Lek, M. 等人. 60 706 名人类蛋白质编码基因变异分析。Nature 536, 285-291 (2016).
  31. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68-74 (2015).
    Auton, A. et al. 人类遗传变异的全球参照。Nature 526, 68-74 (2015).
  32. Robasky, K., Lewis, N.E. & Church, G.M. The role of replicates for error mitigation in next-generation sequencing. Nat. Rev. Genet. 15, 56-62 (2014).
    Robasky, K., Lewis, N.E. & Church, G.M. The role of replicates for error mitigation in next-generation sequencing.Nat.Rev. Genet.15, 56-62 (2014).
  33. Abadi, M., Agarwal, A., Barham, P., Brevdo, E. & Chen, Z. TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Preprint at https://arxiv.org/ abs/1603.04467 (2015).
    Abadi, M., Agarwal, A., Barham, P., Brevdo, E. & Chen, Z. TensorFlow: large-scale machine learning on heterogeneous systems, 2015.预印本:https://arxiv.org/ abs/1603.04467 (2015)。
  34. Luo, R., Sedlazeck, F.J., Lam, T.-W. & Schatz, M. Clairvoyante: a multi-task convolutional deep neural network for variant calling in single molecule sequencing. Preprint at bioRxiv https://doi.org/10.1101/310458 (2018).
    Luo, R., Sedlazeck, F.J., Lam, T.-W. & Schatz, M. Clairvoyante: a multi-task convolutional deep neural network for variant calling in single molecule sequencing.Preprint at bioRxiv https://doi.org/10.1101/310458 (2018).
  35. Torracinta, R. & Campagne, F. Training genotype callers with neural networks. Preprint at bioRxiv https://doi.org/10.1101/097469 (2016).
    Torracinta, R. & Campagne, F. Training genotype callers with neural networks.Preprint at bioRxiv https://doi.org/10.1101/097469 (2016)。

ONLINE METHODS 在线方法

Haplotype-aware realignment of reads. Mapped reads are preprocessed using an error-tolerant, local De-Bruijn-graph-based read assembly procedure that realigns them according to their most likely derived haplotype. Candidate windows across the genome are selected for reassembly by looking for any evidence of possible genetic variation, such as mismatching or soft clipped bases. The selection criteria for a candidate window are very permissive so that true variation is unlikely to be missed. All candidate windows across the genome are considered independently. De Bruijn graphs are constructed using multiple fixed k k kk-mer sizes (from 20 to 75 , inclusive, with increments of 5) out of the reference genome bases for the candidate window, as well as all overlapping reads. Edges are given a weight determined by how many times they are observed in the reads. We trim any edges with weight less than three, except that edges found in the reference are never trimmed. Candidate haplotypes are generated by traversing the assembly graphs and the top two most likely haplotypes are selected that best explain the read evidence. The likelihood function used to score haplotypes is a traditional pair HMM with fixed parameters that do not depend on base quality scores. This likelihood function assumes that each read is independent. Finally, each read is then realigned to its most likely haplotype using a Smith-Waterman-like algorithm with an additional affine gap penalty score for homopolymer indels. This procedure updates both the position and the CIGAR string for each read.
单倍型感知读数重新配准。使用基于 De-Bruijn 图的本地容错读数组装程序对映射读数进行预处理,该程序根据最可能的衍生单倍型对读数进行重新配置。通过寻找任何可能的遗传变异证据,如不匹配或软剪切碱基,在整个基因组中选择候选窗口进行重新组装。候选窗口的选择标准非常宽松,因此不太可能遗漏真正的变异。整个基因组的所有候选窗口都是独立考虑的。在候选窗口的参考基因组碱基以及所有重叠读数中,使用多个固定的 k k kk 聚合体大小(从 20 到 75,包含在内,增量为 5)构建 De Bruijn 图。边缘的权重取决于它们在读数中出现的次数。我们会对权重小于 3 的边缘进行修剪,但参考文献中发现的边缘绝不会被修剪。通过遍历装配图生成候选单倍型,并选出最有可能解释读数证据的前两个单倍型。用于对单倍型进行评分的似然函数是传统的成对 HMM,其参数是固定的,不依赖于碱基质量得分。这种似然函数假定每个读数都是独立的。最后,使用一种类似于 Smith-Waterman 的算法将每个读数与最可能的单倍型重新对齐,并对同源嵌合体进行额外的仿射间隙惩罚得分。这一过程同时更新了每个读数的位置和 CIGAR 字符串。
Finding candidate variants. Candidate variants for evaluation with the deep learning model are identified with the following algorithm. We consider each position in the reference genome independently. For each site in the genome, we collect all the reads that overlap that site. The CIGAR string of each read is decoded and the corresponding allele aligned to that site is determined; these are classified into either a reference-matching base, a reference-mismatching base, an insertion with a specific sequence, or a deletion with a specific length. We count the number of occurrences of each distinct allele across all reads. See Supplementary Note 8 and the current implementation at https://github. com/google/deepvariant/blob/r0.4/deepvariant/make_examples.py#L770.
寻找候选变体。利用深度学习模型进行评估的候选变体是通过以下算法确定的。我们独立考虑参考基因组中的每个位置。对于基因组中的每个位点,我们收集与该位点重叠的所有读数。对每个读数的 CIGAR 字符串进行解码,并确定与该位点对齐的相应等位基因;这些等位基因被分为参考匹配碱基、参考不匹配碱基、具有特定序列的插入或具有特定长度的缺失。我们计算每个等位基因在所有读数中出现的次数。请参阅补充说明 8 和当前的实现方法,网址是https://github. com/google/deepvariant/blob/r0.4/deepvariant/make_examples.py#L770。

If any candidates pass our calling thresholds at a site in the genome, we emit a VCF-like record with chromosome, start, reference bases and alternate bases, where reference bases and alternate bases are the VCF-compatible representation of all of the passing alleles.
如果基因组中的某个位点有候选者通过了我们的调用阈值,我们就会发出一条类似于 VCF 的记录,其中包含染色体、起始位点、参考碱基和候补碱基,其中参考碱基和候补碱基是所有通过等位基因的 VCF 兼容表示。
We filter away any unusable reads (see is_usable_read() below) if a read is marked as a duplicate, if it is marked as failing vendor quality checks, if it is not aligned or is not the primary alignment, if its mapping quality is less than 10 , or if it is paired and not marked as properly placed. We further only include read bases as potential alleles if all of the bases in the alleles have a base quality 10 10 >= 10\geq 10. We emit variant calls only at standard (ACGT) bases in the reference genome. It is possible to force candidate variants to be emitted (randomly with probability of p p pp ) at sites with no alternate alleles, which are used as homozygous reference training sites. There is no constraint on the size of indels emitted, so long as the exact position and bases are present in the CIGAR string and they are consistent across multiple reads.
如果某个读数被标记为重复读数、被标记为供应商质量检查不合格读数、未对齐读数或非主要对齐读数、映射质量小于 10 或已配对但未标记为正确放置读数,我们会过滤掉所有不可用的读数(见下文 is_usable_read())。此外,只有当等位基因中所有碱基的碱基质量都为 10 10 >= 10\geq 10 时,我们才会将读取的碱基作为潜在等位基因。我们只对参考基因组中的标准(ACGT)碱基发出变异调用。我们可以强制在没有等位基因的位点上发出候选变异(以 p p pp 的概率随机发出),这些位点被用作同源参考训练位点。只要 CIGAR 字符串中存在准确的位置和碱基,并且在多个读数中保持一致,就不会限制释放的吲哚大小。
Creating images around candidate variants. The second phase of DeepVariant encodes the reference and read support for each candidate variant into an RGB image. The pseudocode for this component is shown below; it contains all of the key operations to build the image, leaving out for clarity error handling, code to deal with edge cases such as those in which variants occur close to the start or end of the chromosome, and the implementation of nonessential and/or obvious functions. See Supplementary Note 9 and the current implementation at https://github.com/google/deepvariant/blob/r0.4/deepvariant/ pileup_image.py.
围绕候选变体创建图像。DeepVariant 的第二阶段是将每个候选变体的参考和读取支持编码为 RGB 图像。该组件的伪代码如下所示;它包含构建图像的所有关键操作,为了清晰起见,不包括错误处理、处理边缘情况(如变体出现在染色体起点或终点附近)的代码,以及非必要和/或明显功能的实现。请参阅补充说明 9 以及 https://github.com/google/deepvariant/blob/r0.4/deepvariant/ pileup_image.py 中的当前实现。
The actual implementation of this code uses a reservoir sampler to randomly remove reads at locations where there is excessive coverage. This downsampling occurs conceptually within the reads.get_overlapping() function but occurs in our implementation anywhere where there are more than 10,000 reads in a tiling of 300-bp intervals on the chromosome.
该代码的实际实现是使用储层采样器随机删除覆盖率过高位置的读数。这种下采样在概念上发生在 reads.get_overlapping() 函数中,但在我们的实现中,只要染色体上 300-bp 间隔的平铺中读数超过 10,000 个,就会发生下采样。
Deep learning. DistBelief 36 36 ^(36){ }^{36} was used to represent models, train models on labeled images, export trained models, and evaluate trained models on unlabeled images. We adapted the Inception v2 architecture to our input images
深度学习。DistBelief 36 36 ^(36){ }^{36} 用于表示模型、在有标签的图像上训练模型、导出训练好的模型,以及在无标签的图像上评估训练好的模型。我们根据输入图像调整了 Inception v2 架构

and our three-state (hom-ref, het, hom-alt) genotype classification problem. Specifically, we created an input image layer that rescales our input images to 299 × 299 299 × 299 299 xx299299 \times 299 pixels without shifting or scaling our pixel values. This input layer is attached to the ConvNetJuly2015 22 17 CNN 22 17 CNN 22^(17)CNN22^{17} \mathrm{CNN} with nine partitions and weight decay of 0.00004 . The final output layer of the CNN is a three-class Softmax layer with fully connected inputs to the preceding layer initialized with Gaussian random weights and s.d. of 0.001 and a weight decay of 0.00004 .
和我们的三态(hom-ref、het、hom-alt)基因型分类问题。具体来说,我们创建了一个输入图像层,该层可将输入图像重新缩放为 299 × 299 299 × 299 299 xx299299 \times 299