A universal SNP and small-indel variant caller using deep neural networks
利用深度神经网络的通用 SNP 和小吲哚变体调用器

Ryan Poplin $^{1, 2}$ , Pi-Chuan Chang $^{2}$ , David Alexander $^{2}$ , Scott Schwartz $^{2}$ , Thomas Colthurst $^{2}$ , Alexander ${Ku}^{2}$ , Dan Newburger $^{1}$ , Jojo Dijamco $^{1}$ , Nam Nguyen $^{1}$ , Pegah T Afshar $^{1}$ , Sam S Gross $^{1}$ , Lizzie Dorfman $^{1, 2}$ , Cory Y McLean $^{1, 2} &$ Mark A DePristo $^{1, 2}$
Ryan Poplin $^{1, 2}$ , Pi-Chuan Chang $^{2}$ , David Alexander $^{2}$ , Scott Schwartz $^{2}$ , Thomas Colthurst $^{2}$ , Alexander ${Ku}^{2}$ , Dan Newburger $^{1}$ 、Jojo Dijamco $^{1}$ , Nam Nguyen $^{1}$ , Pegah T Afshar $^{1}$ , Sam S Gross $^{1}$ , Lizzie Dorfman $^{1, 2}$ , Cory Y McLean $^{1, 2} &$ Mark A DePristo $^{1, 2}$ .

Abstract 摘要

Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.
尽管测序技术突飞猛进，但要从数十亿短小、错误的序列读数中准确地调用个体基因组中存在的遗传变异仍是一项挑战。在这里，我们展示了深度卷积神经网络可以通过学习推定变体周围读数堆积图像与真实基因型调用之间的统计关系来调用对齐的下一代测序读数数据中的遗传变异。这种名为 DeepVariant 的方法优于现有的最先进工具。学习到的模型可以跨越基因组构建和哺乳动物物种，让非人类测序项目从丰富的人类基本真实数据中获益。我们进一步表明，DeepVariant 可以学会在各种测序技术和实验设计（包括来自 10X Genomics 和 Ion Ampliseq 外显子组的深度全基因组）中调用变异，这凸显了使用更自动化、更通用的技术进行变异调用的好处。

Calling genetic variants from next-generation sequencing (NGS) data has proven challenging because NGS reads are not only errorful (with error rates from

\sim 0.1 - 10 %

) but arise from a complex error process that depends on properties of the instrument, preceding data processing tools, and the genome sequence itself

^{1 - 5}

. State-of-the-art variant callers use a variety of statistical techniques to model these error processes to accurately identify differences between the reads and the reference genome caused either by real genetic variants or by errors in the reads

^{3 - 6}

. For example, the widely used GATK uses logistic regression to model base errors, hidden Markov models to compute read likelihoods, and naive Bayes classification to identify variants, which are then filtered to remove likely false positives using a Gaussian mixture model with hand-crafted features capturing common error modes

^{5}

. These techniques allow the GATK to achieve high but still imperfect accuracy on the Illumina sequencing platform

^{3, 4}

. Generalizing these models to other sequencing technologies (for example, Ion Torrent

^{7, 8}

) has proven difficult due to the need to manually retune or extend these statistical models, which is problematic in an area with such rapid technological progress

^{1}

.
从下一代测序 (NGS) 数据中调用遗传变异已被证明具有挑战性，因为 NGS 读数不仅有误差（误差率在

\sim 0.1 - 10 %

之间），而且产生于复杂的误差过程，这种误差过程取决于仪器的特性、前面的数据处理工具和基因组序列本身

^{1 - 5}

。最先进的变异调用器使用各种统计技术来模拟这些错误过程，以准确识别读数与参考基因组之间的差异，这种差异要么是由真正的遗传变异引起的，要么是由读数中的错误引起的

^{3 - 6}

。例如，广泛使用的 GATK 使用逻辑回归来模拟碱基错误，使用隐马尔可夫模型来计算读取似然，使用天真贝叶斯分类来识别变异，然后使用高斯混合模型和手工创建的捕捉常见错误模式

^{5}

的特征对变异进行过滤，以去除可能的假阳性。这些技术使 GATK 能够在 Illumina 测序平台上实现较高但仍不完美的准确性

^{3, 4}

。将这些模型推广到其他测序技术（例如 Ion Torrent

^{7, 8}

）已被证明是困难的，因为需要手动调整或扩展这些统计模型，这在技术进步如此迅速的领域是很成问题的

^{1}

。

Here we describe a variant caller, called DeepVariant, that replaces the assortment of statistical modeling components with a single deep learning model. Deep learning is a machine learning technique applicable to a variety of domains, including image classification

^{9}

, translation

^{10}

, gaming

^{11, 12}

and the life sciences

^{13 - 16}

. This toolchain (Fig. 1) begins by finding candidate single nucleotide polymorphisms (SNPs) and indels in reads aligned to the reference genome with high sensitivity but low specificity using standard, algorithmic preprocessing techniques. The deep learning model, using the Inception architecture

^{17}

, emits probabilities for each of the three diploid genotypes at a locus using a pileup image of the reference and read data around each candidate variant (Fig. 1). The model is trained using labeled true genotypes, after which it is frozen and can then be applied to novel sites or samples. In the following experiments, DeepVariant was trained on an independent set of samples or variants from those being evaluated.
在这里，我们将介绍一种名为 DeepVariant 的变体调用器，它用单一的深度学习模型取代了各种统计建模组件。深度学习是一种机器学习技术，适用于多个领域，包括图像分类

^{9}

、翻译

^{10}

、游戏

^{11, 12}

和生命科学

^{13 - 16}

。该工具链（图 1）首先使用标准的算法预处理技术，在与参考基因组对齐的读数中寻找候选单核苷酸多态性 (SNP) 和嵌合体，灵敏度高，但特异性低。深度学习模型采用 Inception 架构

^{17}

，利用每个候选变体周围的参考数据和读数数据的堆积图像，为一个位点的三种二倍体基因型中的每一种发出概率（图 1）。该模型使用已标记的真实基因型进行训练，然后将其冻结，并可应用于新的位点或样本。在下面的实验中，DeepVariant 是在一组独立的样本或变异体上进行训练的，与正在评估的样本或变异体不同。

The deep learning model was trained without specialized knowledge about genomics or next-generation sequencing, and yet it can learn to call genetic variants more accurately than state-of-the-art methods. When applied to the Platinum Genomes Project NA12878 data

^{18}

, DeepVariant produced a callset with better performance than the GATK when evaluated on the held-out chromosomes of the Genome in a Bottle ground-truth set (Supplementary Figs. 1a and 2). For further validation, we sequenced 35 replicates of NA12878 using a standard whole-genome sequencing (WGS) protocol and called variants on 27 replicates using a GATK best-practices pipeline and DeepVariant using a model trained on the other eight replicates (Online Methods). DeepVariant produced more accurate results with greater consistency across a variety of quality metrics (Supplementary Fig. 1b and Supplementary Notes 1, 10 and 11).
深度学习模型是在没有基因组学或下一代测序专业知识的情况下训练出来的，但它能比最先进的方法更准确地学会调用基因变异。当应用于白金基因组计划 NA12878 数据

^{18}

时，DeepVariant 生成的调用集在瓶中基因组的保留染色体地面实况集上进行评估时，性能优于 GATK（补充图 1a 和 2）。为了进一步验证，我们使用标准全基因组测序（WGS）方案对 NA12878 的 35 个重复序列进行了测序，并使用 GATK 最佳实践管道和 DeepVariant 对 27 个重复序列进行了变异调用，DeepVariant 使用的是在其他 8 个重复序列上训练的模型（在线方法）。DeepVariant 的结果更准确，在各种质量指标上的一致性更高（补充图 1b 和补充注释 1、10 和 11）。

Like many variant calling algorithms, the GATK relies on a model that assumes read errors to be independent

^{5}

. Though this has long been recognized as an invalid assumption

^{2}

, the true likelihood function that models multiple reads simultaneously is unknown

^{5, 19, 20}

. Because DeepVariant presents an image of all of the reads relevant for a putative variant together, the convolutional neural network (CNN) is able to account for the complex dependence among the reads by virtue of being a universal approximator

^{21}

. This manifests itself as a
与许多变异调用算法一样，GATK 依赖于假定读数错误是独立的

^{5}

模型。虽然人们早已认识到这是一个无效的假设

^{2}

，但同时对多个读数进行建模的真正似然函数是未知的

^{5, 19, 20}

。由于 DeepVariant 将与推定变异相关的所有读数图像呈现在一起，因此卷积神经网络 (CNN) 作为一个通用近似值

^{21}

，能够解释读数之间复杂的依赖关系。这表现为

Received 15 December 2017; accepted 2 August 2018; published online 24 September 2018; doi:10.1038/nbt. 4235
2017年12月15日接收；2018年8月2日接受；2018年9月24日在线发表；doi:10.1038/nbt.4235

Figure 1 DeepVariant workflow overview. Before DeepVariant, NGS reads are first aligned to a reference genome and cleaned up with duplicate marking and, optionally, local assembly. Left box: first, the aligned reads are scanned for sites that may be different from the reference genome. The read and reference data are encoded as an image for each candidate variant site. A trained CNN calculates the genotype likelihoods for each site. A variant call is emitted if the most likely genotype is heterozygous or homozygous non-reference. Middle box: training the CNN reuses the DeepVariant machinery to generate pileup images for a sample with known genotypes. These labeled image + genotype pairs, along with an initial CNN, which can be a random model, a CNN trained for other image classification tests, or a prior DeepVariant model, are used to optimize the CNN parameters to maximize genotype prediction accuracy using a stochastic gradient descent algorithm. After a maximum number of cycles or time has elapsed or the model’s performance has converged, the final trained model is frozen and can then be used for variant calling. Right box: the reference and read bases, quality scores, and other read features are encoded into a red-green-blue (RGB) pileup image at a candidate variant. This encoded image is provided to the CNN to calculate the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example a heterozygous variant call is emitted, as the most probable genotype here is “het”. In all panels, blue boxes represent data and red boxes are processes. Details of all processes are given in the Online Methods.
图 1 DeepVariant 工作流程概览。在 DeepVariant 之前，首先将 NGS 读数与参考基因组进行比对，并通过重复标记和可选的本地组装进行清理。左框：首先，扫描已比对的读数，查找可能与参考基因组不同的位点。读数和参考数据被编码为每个候选变异位点的图像。经过训练的 CNN 会计算每个位点的基因型可能性。如果最可能的基因型是杂合或同源非参考，则会发出变异调用。中框：训练 CNN 时会重新使用 DeepVariant 机器，为已知基因型的样本生成堆积图像。这些标注的图像 + 基因型对以及初始 CNN（可以是随机模型、为其他图像分类测试训练的 CNN 或之前的 DeepVariant 模型）被用于优化 CNN 参数，以使用随机梯度下降算法最大化基因型预测准确性。当循环次数或时间达到最大值或模型性能收敛后，最终训练好的模型将被冻结，然后可用于变异调用。右框：参考碱基和读数碱基、质量得分和其他读数特征被编码成候选变体处的红-绿-蓝（RGB）堆积图像。该编码图像将提供给 CNN，用于计算同卵参考（hom-ref）、杂合子（het）或同卵交替（hom-alt）三种二倍体基因型状态的基因型似然率。在本例中，由于最有可能的基因型是 "het"，因此发出了杂合子变异调用。在所有面板中，蓝色方框代表数据，红色方框代表处理过程。所有处理过程的详情见在线方法。
tight concordance between the estimated probability of error from the likelihood function and the observed error rate (Supplementary Fig. 1c) where DeepVariant’s CNN is well calibrated, more so than the GATK. That the CNN has approximated this true but unknown interdependent likelihood function is the essential technical advance enabling us to replace the hand-crafted statistical models used in other approaches with a single deep learning model, and still achieve such high performance in variant calling.
根据似然函数估算的错误概率与观察到的错误率（补充图 1c）紧密吻合，DeepVariant 的 CNN 得到了很好的校准，比 GATK 更好。CNN 逼近了这一真实但未知的相互依赖似然函数，这是一项重要的技术进步，它使我们能够用单一的深度学习模型取代其他方法中使用的手工制作的统计模型，并在变体调用中实现如此高的性能。

To further benchmark the performance of DeepVariant, we submitted variant calls for a blinded sample, NA24385, to the US Food and Drug Administration (FDA)-sponsored variant calling Truth Challenge in May 2016 and won the “highest performance” award for SNPs as assessed by an independent team using a different evaluation methodology. For this contest DeepVariant was trained only on data available from the CEPH (Centre d’Etude du Polymorphisme Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. In achieving high accuracy as measured via F1, or the harmonic mean of sensitivity and positive predictive value (PPV), on this new sample (SNP F1

= 99.95 %

, indel F1

= 98.98 %

), we show that DeepVariant can generalize beyond its training data. We then applied the same dataset and evaluation
为了进一步确定 DeepVariant 的性能基准，我们在 2016 年 5 月向美国食品和药物管理局 (FDA) 赞助的变体调用真相挑战赛提交了一个盲样 NA24385 的变体调用，并赢得了由独立团队采用不同评估方法评定的 SNPs "最高性能 "奖。在这次比赛中，DeepVariant 仅在 CEPH（人类多态性研究中心）女性样本 NA12878 的可用数据上进行了训练，并在未见过的 Ashkenazi 男性样本 NA24385 上进行了评估。通过 F1 或灵敏度和阳性预测值（PPV）的调和平均值（SNP F1

= 99.95 %

, indel F1

= 98.98 %

）来衡量，我们在这个新样本上获得了很高的准确率，这表明 DeepVariant 可以超越其训练数据。然后，我们将相同的数据集和评估结果应用于
methodology to a variety of both recent and commonly used bioinformatics methods, including the GATK, FreeBayes

^{22}

, SAMtools

^{23}

16 {GT}^{24}

and Strelka

^{25}

(Table 1). DeepVariant demonstrated more than

50 %

fewer errors per genome ( 4,652 errors) compared to the next-best algorithm ( 9,531 errors). We also evaluated the same set of methods using the synthetic diploid sample CHM1-CHM1326 (Table 2). In our tests DeepVariant outperformed all other methods for calling both SNP and indel mutations, without needing to adjust filtering thresholds or other parameters.
与各种最新和常用的生物信息学方法相比，包括 GATK、FreeBayes

^{22}

、SAMtools

^{23}

、

16 {GT}^{24}

和 Strelka

^{25}

（表 1）。与次好的算法（9,531 个错误）相比，DeepVariant 的每个基因组错误数（4,652 个错误）少了

50 %

多。我们还使用合成二倍体样本 CHM1-CHM1326 评估了同一组方法（表 2）。在我们的测试中，DeepVariant 在调用 SNP 和 indel 变异方面的表现优于所有其他方法，无需调整过滤阈值或其他参数。

We further explored how well DeepVariant’s CNN generalizes beyond its training data. First, a model trained with read data aligned to human genome build GRCh37 and applied to reads aligned to GRCh38 had similar performance (overall F1

= 99.45 %

) to one trained on GRCh38 and then applied to GRCh38 (overall F1 =99.53%), thereby demonstrating that a model learned from one version of the human genome reference can be applied to other versions with effectively no loss in accuracy (Supplementary Table 1 and Supplementary Note 2). Second, models trained using human reads and groundtruth data achieved high accuracy when applied to a mouse data

{set}^{27}

(

F 1 = 98.29 %)

, outperforming training on the mouse data itself (F1 = 97.84%; Supplementary Table 2 and Supplementary Note 3).
我们进一步探索了 DeepVariant 的 CNN 在训练数据之外的泛化能力。首先，使用与人类基因组构建 GRCh37 对齐的读数数据训练的模型应用于与 GRCh38 对齐的读数时，其性能（总体 F1

= 99.45 %

）与使用 GRCh38 训练的模型应用于 GRCh38 时的性能（总体 F1 =99.53%）相似，从而证明了从一个版本的人类基因组参考文献中学到的模型可以应用于其他版本的参考文献，而且准确率实际上没有损失（补充表 1 和补充注释 2）。其次，使用人类读数和地面实况数据训练的模型在应用于小鼠数据

{set}^{27}

(

F 1 = 98.29 %)

时达到了很高的准确率，超过了小鼠数据本身的训练结果（F1 = 97.84%；补充表 2 和补充注释 3）。

Table 1 Evaluation of several bioinformatics methods on the high-coverage, whole-genome sample NA24385
表 1 在高覆盖率全基因组样本 NA24385 上对几种生物信息学方法的评估

Method 方法	Type 类型	F1	Recall 回顾	Precision 精确度	TP	FN	FP	FP.gt	FP.al	Version 版本
DeepVariant (live GitHub) DeepVariant（实时 GitHub）	Indel 吲哚	0.99507	0.99347	0.99666	357,641	2350	1,198	217	840	Latest GitHub v0.4.1-b4e8d37d 最新版 GitHub v0.4.1-b4e8d37d
GATK (raw) GATK（原始）	Indel 吲哚	0.99366	0.99219	0.99512	357,181	2810	1,752	377	995	3.8-0-ge9d806836
Strelka	Indel 吲哚	0.99227	0.98829	0.99628	355,777	4214	1,329	221	855	2.8.4-3-gbe58942
DeepVariant (pFDA) 深度变量（pFDA）	Indel 吲哚	0.99112	0.98776	0.99450	355,586	4405	1,968	846	1,027	pFDA submission May 2016 2016 年 5 月提交 pFDA
GATK (VQSR)	Indel 吲哚	0.99010	0.98454	0.99573	354,425	5566	1,522	343	909	3.8-0-ge9d806836
GATK (fIt)	Indel 吲哚	0.98229	0.96881	0.99615	348,764	11227	1,349	370	916	3.8-0-ge9d806836
FreeBayes 自由贝叶斯	Indel 吲哚	0.94091	0.91917	0.96372	330,891	29,100	12,569	9,149	3,347	v1.1.0-54-g49413aa
16GT	Indel 吲哚	0.92732	0.91102	0.94422	327,960	32,031	19,364	10,700	7,745	v1.0-34e8f934
SAMtools	Indel 吲哚	0.87951	0.83369	0.93066	300,120	59,871	22,682	2,302	20,282	1.6
DeepVariant (live GitHub) DeepVariant（实时 GitHub）	SNP	0.99982	0.99975	0.99989	3,054,552	754	350	157	38	Latest GitHub v0.4.1-b4e8d37d 最新版 GitHub v0.4.1-b4e8d37d
DeepVariant (pFDA) 深度变量（pFDA）	SNP	0.99958	0.99944	0.99973	3,053,579	1,727	837	409	78	pFDA submission May 2016 2016 年 5 月提交 pFDA
Strelka	SNP	0.99935	0.99893	0.99976	3,052,050	3,256	732	87	136	2.8.4-3-gbe58942
GATK (raw) GATK（原始）	SNP	0.99914	0.99973	0.99854	3,054,494	812	4,469	176	257	3.8-0-ge9d806836
16GT	SNP	0.99583	0.99850	0.99318	3,050,725	4,581	20,947	3,476	3,899	v1.0-34e8f934
GATK (VQSR)	SNP	0.99436	0.98940	0.99937	3,022,917	32,389	1,920	80	170	3.8-0-ge9d806836
FreeBayes 自由贝叶斯	SNP	0.99124	0.98342	0.99919	3,004,641	50,665	2,434	351	1,232	v1.1.0-54-g49413aa
SAMtools	SNP	0.99021	0.98114	0.99945	2,997,677	57,629	1,651	1,040	200	1.6
GATK (fIt)	SNP	0.98958	0.97953	0.99983	2,992,764	62,542	509	168	26	3.8-0-ge9d806836

| Method | Type | F1 | Recall | Precision | TP | FN | FP | FP.gt | FP.al | Version | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | DeepVariant (live GitHub) | Indel | 0.99507 | 0.99347 | 0.99666 | 357,641 | 2350 | 1,198 | 217 | 840 | Latest GitHub v0.4.1-b4e8d37d | | GATK (raw) | Indel | 0.99366 | 0.99219 | 0.99512 | 357,181 | 2810 | 1,752 | 377 | 995 | 3.8-0-ge9d806836 | | Strelka | Indel | 0.99227 | 0.98829 | 0.99628 | 355,777 | 4214 | 1,329 | 221 | 855 | 2.8.4-3-gbe58942 | | DeepVariant (pFDA) | Indel | 0.99112 | 0.98776 | 0.99450 | 355,586 | 4405 | 1,968 | 846 | 1,027 | pFDA submission May 2016 | | GATK (VQSR) | Indel | 0.99010 | 0.98454 | 0.99573 | 354,425 | 5566 | 1,522 | 343 | 909 | 3.8-0-ge9d806836 | | GATK (fIt) | Indel | 0.98229 | 0.96881 | 0.99615 | 348,764 | 11227 | 1,349 | 370 | 916 | 3.8-0-ge9d806836 | | FreeBayes | Indel | 0.94091 | 0.91917 | 0.96372 | 330,891 | 29,100 | 12,569 | 9,149 | 3,347 | v1.1.0-54-g49413aa | | 16GT | Indel | 0.92732 | 0.91102 | 0.94422 | 327,960 | 32,031 | 19,364 | 10,700 | 7,745 | v1.0-34e8f934 | | SAMtools | Indel | 0.87951 | 0.83369 | 0.93066 | 300,120 | 59,871 | 22,682 | 2,302 | 20,282 | 1.6 | | DeepVariant (live GitHub) | SNP | 0.99982 | 0.99975 | 0.99989 | 3,054,552 | 754 | 350 | 157 | 38 | Latest GitHub v0.4.1-b4e8d37d | | DeepVariant (pFDA) | SNP | 0.99958 | 0.99944 | 0.99973 | 3,053,579 | 1,727 | 837 | 409 | 78 | pFDA submission May 2016 | | Strelka | SNP | 0.99935 | 0.99893 | 0.99976 | 3,052,050 | 3,256 | 732 | 87 | 136 | 2.8.4-3-gbe58942 | | GATK (raw) | SNP | 0.99914 | 0.99973 | 0.99854 | 3,054,494 | 812 | 4,469 | 176 | 257 | 3.8-0-ge9d806836 | | 16GT | SNP | 0.99583 | 0.99850 | 0.99318 | 3,050,725 | 4,581 | 20,947 | 3,476 | 3,899 | v1.0-34e8f934 | | GATK (VQSR) | SNP | 0.99436 | 0.98940 | 0.99937 | 3,022,917 | 32,389 | 1,920 | 80 | 170 | 3.8-0-ge9d806836 | | FreeBayes | SNP | 0.99124 | 0.98342 | 0.99919 | 3,004,641 | 50,665 | 2,434 | 351 | 1,232 | v1.1.0-54-g49413aa | | SAMtools | SNP | 0.99021 | 0.98114 | 0.99945 | 2,997,677 | 57,629 | 1,651 | 1,040 | 200 | 1.6 | | GATK (fIt) | SNP | 0.98958 | 0.97953 | 0.99983 | 2,992,764 | 62,542 | 509 | 168 | 26 | 3.8-0-ge9d806836 |

The dataset used in this evaluation is the same as in the precisionFDA Truth Challenge (pFDA). Several methods are compared, including the DeepVariant callset as submitted to the contest and the most recent DeepVariant version from GitHub. Each method was run according to the individual authors’ best-practice recommendations and represents a good-faith effort to achieve best results. Comparisons to the Genome in a Bottle truth set for this sample were performed using the hap.py software, available on GitHub at http://github.com/Illumina/hap.py, using the same version of the GIAB truth set (v3.2.2) used by pFDA. The overall accuracy (F1, sort order within each variant type), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are shown over the whole genome. False positives are further divided by those caused by genotype mismatches (FP.gt) and those cause by allele mismatches (FP.al). Finally, the version of the software used for each method is provided. We present three GATK callsets: GATK (raw), the unfiltered calls emitted by the HaplotypeCaller; GATK (VQSR), the callset filtered with variant quality score recalibration (VQSR); and GATK (fIt), the raw GATK callset filtered with run-flt in CHM-eval. See Supplementary Note 7 for more details.
本次评估使用的数据集与 precisionFDA Truth Challenge（pFDA）中的数据集相同。对几种方法进行了比较，包括提交给竞赛的 DeepVariant 调用集和来自 GitHub 的最新 DeepVariant 版本。每种方法都是根据作者个人的最佳实践建议运行的，代表了为实现最佳结果所做的真诚努力。该样本与 "瓶中基因组 "真相集的比较是使用 hap.py 软件进行的，该软件可从 GitHub 上的 http://github.com/Illumina/hap.py 获取，使用的是 pFDA 使用的相同版本的 GIAB 真相集 (v3.2.2)。全基因组的总体准确率（F1，每种变异类型内的排序顺序）、召回率、精确度以及真阳性（TP）、假阴性（FN）和假阳性（FP）的数量均有显示。假阳性又分为基因型错配引起的假阳性（FP.gt）和等位基因错配引起的假阳性（FP.al）。最后，我们提供了每种方法所使用的软件版本。我们提供了三个 GATK 调用集：GATK (raw)，即 HaplotypeCaller 发出的未经过滤的调用集；GATK (VQSR)，即经过变异质量分数重新校准（VQSR）过滤的调用集；以及 GATK (fIt)，即经过 CHM-eval 中 run-flt 过滤的原始 GATK 调用集。详见补充说明 7。

This last experiment is especially demanding as not only do the species differ but nearly all of the sequencing parameters do as well:

50 \times 2 \times

148 bp from an Illumina TruSeq prep sequenced on a HiSeq 2500 for the human sample and

27 \times 2 \times 100 bp

reads from a custom sequencing preparation run on an Illumina Genome Analyzer II for mouse

^{27}

. Thus, DeepVariant is robust to changes in sequencing depth, preparation protocol, instrument type, genome build and even mammalian species, thereby enabling resequencing projects in nonhuman species, which often have no ground-truth data to guide their efforts

^{27, 28}

, to leverage the large and growing ground-truth data in humans.
最后一项实验的要求特别高，因为不仅物种不同，几乎所有的测序参数也不同：

50 \times 2 \times

148bp来自在HiSeq 2500上测序的Illumina TruSeq预处理，用于人类样本；

27 \times 2 \times 100 bp

读数来自在Illumina Genome Analyzer II上运行的定制测序预处理，用于小鼠

^{27}

。因此，DeepVariant 对测序深度、制备方案、仪器类型、基因组构建甚至哺乳动物物种的变化都有很好的适应性，从而使通常没有基本真实数据来指导工作

^{27, 28}

的非人类物种的再测序项目能够利用大量且不断增加的人类基本真实数据。

To further assess its capabilities, we trained DeepVariant to call variants in eight datasets from Genome in a Bottle

^{29}

that spanned a variety of sequencing instruments and protocols, including wholegenome and exome sequencing technologies, with read lengths from 50 to many thousands of base pairs (Supplementary Tables 3 and 4 and Supplementary Notes 4 and 5). We used the already processed BAM files to introduce additional variability, as these BAMs differed in their alignment and cleaning steps. The results of this experiment all exhibit a characteristic pattern: the candidate variants have the highest sensitivity but a low PPV (mean of 57.6%), which varies substantially by dataset. After retraining, all of the callsets achieve high PPVs (mean of 99.3%) while largely preserving the candidate callset sensitivity (mean loss of 2.3%). The high PPVs and low loss of sensitivity indicate that DeepVariant can learn a model that captures the technology-specific error processes in sufficient detail to separate real variation from false positives with high fidelity for many different sequencing technologies.
为了进一步评估 DeepVariant 的能力，我们训练 DeepVariant 调用来自 Genome in a Bottle

^{29}

的八个数据集中的变异，这些数据集跨越了各种测序仪器和协议，包括全基因组和外显子组测序技术，读长从 50 到数千个碱基对不等（补充表 3 和 4 以及补充注释 4 和 5）。我们使用已经处理过的 BAM 文件来引入额外的可变性，因为这些 BAM 在比对和清理步骤上有所不同。该实验的结果都表现出一种特征模式：候选变体的灵敏度最高，但 PPV 却很低（平均值为 57.6%），而且不同数据集的 PPV 差异很大。经过再训练后，所有调用集都达到了较高的 PPV 值（平均值为 99.3%），同时在很大程度上保留了候选调用集的灵敏度（平均损失为 2.3%）。高 PPVs 和低灵敏度损失表明，DeepVariant 可以学习一个模型，该模型可以足够详细地捕捉特定技术的错误过程，从而在许多不同的测序技术中高保真地将真实变异与假阳性分离开来。

Next we analyzed the behavior of DeepVariant on two nonIllumina WGS datasets, one from ThermoFisher (SOLiD) and one from Pacific Biosciences (PacBio), and on two exome datasets from Illumina (TruSeq) and Ion Torrent (Ion Ampliseq). The SOLiD and PacBio WGS datasets have high error rates in the candidate callsets. SOLiD (13.9% PPV for SNPs, 96.2% for indels and 14.3% overall) has many SNP artifacts from the mapping of short, color-space reads. The
接下来，我们分析了 DeepVariant 在两个非 Illumina WGS 数据集（一个来自 ThermoFisher（SOLiD），一个来自 Pacific Biosciences（PacBio））以及两个来自 Illumina（TruSeq）和 Ion Torrent（Ion Ampliseq）的外显子组数据集上的表现。SOLiD 和 PacBio WGS 数据集的候选调用集错误率较高。SOLiD 数据集（SNP 的 PPV 为 13.9%，indels 为 96.2%，总体为 14.3%）中有许多 SNP 伪影，这些伪影来自短色空读数的映射。该系统的

PacBio dataset is the opposite, with many false indels (

79.8 %

PPV for SNPs,

1.4 %

for indels and

22.1 %

overall) owing to this technology’s high indel error rate. Training DeepVariant to call variants in an exome is likely to be particularly challenging. Exomes have far fewer variants

(\sim 20 k - 30 k)^{30}

than found in a whole genome

(\sim 4 - 5 M)^{31}

. The non-uniform coverage and sequencing errors from the exome capture or amplification technology also introduce many false positive variants

^{32}

. For example, at

8.1 %

, the PPV of our candidate variants for Ion Ampliseq is the lowest of all our datasets.
PacBio 数据集的情况恰恰相反，由于该技术的吲哚错误率较高，因此存在很多假吲哚（SNP 的 PPV 为

79.8 %

，吲哚的 PPV 为

1.4 %

，总体的 PPV 为

22.1 %

）。训练 DeepVariant 调用外显子组中的变异可能尤其具有挑战性。外显子组中的变异

(\sim 20 k - 30 k)^{30}

远远少于全基因组中的变异

(\sim 4 - 5 M)^{31}

。外显子组捕获或扩增技术的非均匀覆盖和测序误差也会带来许多假阳性变异

^{32}

。例如，在

8.1 %

中，Ion Ampliseq 候选变异的 PPV 是我们所有数据集中最低的。

Despite the low initial PPVs, the retrained models in DeepVariant separated errors from real variants with high accuracy in the WGS datasets (PPVs of 99.0% and 97.3% for SOLiD and PacBio, respectively), though with a larger loss in sensitivity (candidates 82.5% and final

76.6 %

for SOLiD and

93.4 %

and

88.5 %

, respectively, for PacBio) than other technologies. Furthermore, despite the challenges of retraining deep learning models with limited data, the exome datasets also performed well, with a small reduction in sensitivity (from 91.9% to

89.3 %

and

94.0 %

92.6 %

for Ion Ampliseq and TruSeq candidates and final calls, respectively) for a substantial boost in PPV (from 8.1% to

99.7 %

and

65.3 %

99.3 %

for Ion and TruSeq, respectively). The performance of DeepVariant compares favorably to those of callsets submitted to the Genome in a Bottle project site using tools developed specifically for each NGS technology and to callsets produced by the GATK or SAMtools (Supplementary Table 5).
尽管初始 PPV 较低，DeepVariant 中重新训练的模型在 WGS 数据集中仍能高精度地将错误与真实变异分离开来（SOLiD 和 PacBio 的 PPV 分别为 99.0% 和 97.3%），不过与其他技术相比，灵敏度损失较大（SOLiD 的候选样本为 82.5%，最终为

76.6 %

；PacBio 的候选样本为

93.4 %

和

88.5 %

）。此外，尽管在数据有限的情况下重新训练深度学习模型存在挑战，但外显子组数据集的表现也不错，灵敏度略有降低（Ion Ampliseq 和 TruSeq 候选和最终调用的灵敏度分别从 91.9% 降至

89.3 %

和

94.0 %

降至

92.6 %

），而 PPV 则大幅提高（Ion 和 TruSeq 的 PPV 分别从 8.1% 降至

99.7 %

和

65.3 %

降至

99.3 %

）。DeepVariant 的性能优于使用专门为每种 NGS 技术开发的工具提交给 "Genome in a Bottle "项目网站的调用集，也优于 GATK 或 SAMtools 生成的调用集（补充表 5）。

The accuracy numbers presented here should not be viewed as the maximum achievable by either the sequencing technology or DeepVariant. For consistency, we used the same model architecture, image representation, training parameters and candidate variant criteria for each technology. Because DeepVariant achieves high PPVs for all technologies, the overall accuracy is effectively driven by the sensitivity of the candidate callset. Improvements to the data processing steps before DeepVariant and the algorithm used to identify candidate variants is likely to translate into further improvements in overall accuracy, particularly for multi-allelic indels. Conversely, despite its
这里给出的准确率数字不应被视为测序技术或 DeepVariant 可达到的最高准确率。为了保持一致性，我们对每种技术都使用了相同的模型架构、图像表示、训练参数和候选变体标准。由于 DeepVariant 在所有技术中都能达到较高的 PPV，因此总体准确率实际上是由候选调用集的灵敏度决定的。改进 DeepVariant 之前的数据处理步骤和用于识别候选变异体的算法，很可能会进一步提高总体准确率，尤其是多等位基因嵌合的准确率。相反，尽管其

Table 2 Evaluation of several bioinformatics methods on the high-coverage, whole-genome synthetic diploid sample CHM1-CHM13
表 2 几种生物信息学方法对高覆盖率全基因组合成二倍体样本 CHM1-CHM13 的评估

Method 方法	Type 类型	F1	Recall 回顾	Precision 精确度	TP	FN	FP	Version 版本
DeepVariant 深度变量	Indel 吲哚	0.95806	0.92868	0.98936	529,137	40,634	5,690	v0.4.1-b4e8d37d
Strelka	Indel 吲哚	0.95074	0.91623	0.98796	522,039	47,732	6,363	2.8.4-3-gbe58942
16GT	Indel 吲哚	0.94010	0.90803	0.97452	517,369	52,402	13,527	v1.0-34e8f934
GATK (raw) GATK（原始）	Indel 吲哚	0.93268	0.89504	0.97363	509,969	59,802	13,811	3.8-0-ge9d806836
GATK (VQSR)	Indel 吲哚	0.91212	0.84497	0.99087	481,441	88,330	4,437	3.8-0-ge9d806836
FreeBayes 自由贝叶斯	Indel 吲哚	0.90438	0.83025	0.99305	473,053	96,718	3,313	v1.1.0-54-g49413aa
SAMtools	Indel 吲哚	0.86976	0.79089	0.96611	450,626	119,145	15,807	1.6
DeepVariant 深度变量	SNP	0.99103	0.98888	0.99319	3,518,118	39,553	24,132	v0.4.1-b4e8d37d
Strelka	SNP	0.98865	0.98107	0.99636	3,490,314	67,357	12,749	2.8.4-3-gbe58942
16GT	SNP	0.97862	0.98966	0.96782	3,520,894	36,777	117,078	v1.0-34e8f934
FreeBayes 自由贝叶斯	SNP	0.96910	0.94837	0.99075	3,373,984	183,687	31,492	v1.1.0-54-g49413aa
GATK (VQSR)	SNP	0.96895	0.94542	0.99368	3,363,476	194,195	21,379	3.8-0-ge9d806836
SAMtools	SNP	0.96818	0.94386	0.99378	3,357,947	199,724	21,012	1.6
GATK (raw) GATK（原始）	SNP	0.96646	0.95685	0.97627	3,404,167	153,504	82,748	3.8-0-ge9d806836

Several methods are compared, including the most recent DeepVariant version from GitHub. Each method was run according to the individual authors’ best-practice recommendations and represents a good faith effort to achieve best results. Comparisons to the CHM1-CHM13 truth set were performed using the CHM-eval.kit software, available on GitHub at https://github.com/lh3/CHM-eval, release version 0.5 . The overall accuracy (F1, sort order within each variant type), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are shown over the whole genome. Finally, the version of the software used for each method is provided. Note that we present two GATK callsets: GATK (raw), the unfiltered calls emitted by the HaplotypeCaller; and GATK (VQSR), the callset filtered with the VQSR. See Supplementary Note 7 for more details.
比较了几种方法，包括来自 GitHub 的最新 DeepVariant 版本。每种方法都是根据作者的最佳实践建议运行的，代表了为获得最佳结果所做的真诚努力。与 CHM1-CHM13 真相集的比较是使用 CHM-eval.kit 软件进行的，该软件可在 GitHub https://github.com/lh3/CHM-eval 上获取，版本为 0.5。全基因组的总体准确率（F1，每种变异类型内的排序顺序）、召回率、精确度以及真阳性（TP）、假阴性（FN）和假阳性（FP）的数量如图所示。最后，我们还提供了每种方法所使用的软件版本。请注意，我们提供了两个 GATK 调用集：GATK (raw)，即 HaplotypeCaller 发出的未经过滤的调用；以及 GATK (VQSR)，即经过 VQSR 过滤的调用集。详见补充说明 7。
effectiveness, representing variant calls as images and applying general image-classification models is certainly suboptimal, as we were unable to effectively encode all of the available information in the reads and reference into the three-channel image.
由于我们无法将读数和参考文献中的所有可用信息有效地编码到三通道图像中，因此将变异调用表示为图像并应用一般图像分类模型肯定不是最佳选择。

Taken together, our results demonstrate that the deep learning approach employed by DeepVariant can learn a statistical model describing the relationship between the experimentally observed NGS reads and genetic variants in that data for several sequencing technologies. Technologies like DeepVariant change the problem of calling variants from a process of expert-driven, technology-specific statistical modeling to a more automated process of optimizing a general model against data. With DeepVariant, creating an NGS caller for a new sequencing technology becomes a simpler matter of developing the appropriate preprocessing steps, training a deep learning model on sequencing data from samples with ground-truth data, and applying this model to new, even nonhuman, samples (see Supplementary Note 6).
总之，我们的研究结果表明，DeepVariant 采用的深度学习方法可以学习描述实验观察到的 NGS 读数与该数据中基因变异之间关系的统计模型，适用于多种测序技术。DeepVariant 等技术将变异调用问题从专家驱动的特定技术统计建模过程转变为针对数据优化通用模型的更自动化过程。有了 DeepVariant，为新的测序技术创建 NGS 调用器就变得更简单了，只需开发适当的预处理步骤，在具有基本真实数据的样本测序数据上训练深度学习模型，然后将该模型应用于新样本，甚至非人类样本（见补充说明 6）。
At its core, DeepVariant generates candidate entities with high sensitivity but low specificity, represents the experimental data about each entity in a machine-learning-compatible format and then applies deep learning to assign meaningful biological labels to these entities. This general framework for inferring biological entities from raw, errorful, indirect experimental data is likely to be applicable to other high-throughput instruments.
DeepVariant 的核心是生成灵敏度高但特异性低的候选实体，以机器学习兼容格式表示每个实体的实验数据，然后应用深度学习为这些实体分配有意义的生物标签。这种从原始、有误差的间接实验数据中推断生物实体的通用框架很可能适用于其他高通量仪器。

The results presented in Figure 1, Supplementary Figures 1 and

2

, and Supplementary Tables 1-8 were generated with the original, internal version of DeepVariant. Since then we have rewritten DeepVariant to make it available as open source software. As a result, several improvements to the DeepVariant method have been made that are not captured in the analyses presented here, including switching to TensorFlow

^{33}

to train the model, using the inception_v3 neural network architecture and using a multichannel tensor representation for the genomics data instead of an RGB image. The results in Tables 1 and

2

used the open source version of DeepVariant; the evaluation scripts are available as Supplementary Software. The latest version of DeepVariant is available on GitHub (https://github.com/google/ deepvariant/).
图1、补充图1和

2

以及补充表1-8中的结果是用最初的内部版本DeepVariant生成的。此后，我们重写了 DeepVariant，使其成为开源软件。因此，我们对 DeepVariant 方法进行了一些改进，但这些改进并没有体现在本文的分析中，包括改用 TensorFlow

^{33}

训练模型，使用 inception_v3 神经网络架构，以及对基因组学数据使用多通道张量表示法而不是 RGB 图像。表 1 和

2

中的结果使用的是 DeepVariant 的开源版本；评估脚本作为补充软件提供。DeepVariant 的最新版本可在 GitHub 上获取（https://github.com/google/ deepvariant/）。

Also note that several other deep-learning-based variant callers have since been described

^{34, 35}

.
还请注意，后来又有几种基于深度学习的变体调用器被描述

^{34, 35}

。

METHODS 方法

Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.
研究方法，包括数据可用性声明以及任何相关的加入代码和参考文献，可在论文的在线版本中查阅。

Note: Any Supplementary Information and Source Data files are available in the online version of the paper.
注：任何补充信息和原始数据文件均可在论文的在线版本中获取。

ACKNOWLEDGMENTS 致谢

We thank J. Zook and his collaborators at NIST for their work developing the Genome in a Bottle resources, the Verily sequencing facility for running the NA12878 replicates, and our colleagues at Verily and Google for their feedback on this manuscript and the project in general. This work was supported by internal funding.
我们感谢 J. Zook 和他在 NIST 的合作者开发了 "瓶中基因组 "资源，感谢 Verily 测序设备运行了 NA12878 复制数据，感谢 Verily 和谷歌的同事对本稿和整个项目的反馈意见。这项工作得到了内部资金的支持。

AUTHOR CONTRIBUTIONS 作者贡献

R.P. and M.A.D. designed the study, analyzed and interpreted results and wrote the paper. R.P., P.-C.C., D.A., S.S., T.C., A.K., D.N., J.D., N.N., P.T.A., S.S.G., L.D., C.Y.M. and M.A.D. performed experiments and contributed to the software.
R.P.和M.A.D.设计研究、分析和解释结果并撰写论文。R.P., P.-C.C., D.A., S.S., T.C., A.K., D.N., J.D., N.N., P.T.A., S.S.G., L.D., C.Y.M. 和 M.A.D. 进行了实验并参与了软件的编写。

COMPETING INTERESTS 利益冲突

D.N., J.D., N.N., P.T.A. and S.S.G. are employees of Verily Life Sciences. P.-C.C., D.A., S.S, T.C. and A.K. are employees of Google Inc. R.P., L.D., C.Y.M. and M.A.D. are employees of Verily Life Sciences and Google Inc. This work was internally funded by Verily Life Sciences and Google Inc.
D.N.、J.D.、N.N.、P.T.A.和S.S.G.是Verily Life Sciences的员工。P.-C.C.、D.A.、S.S.、T.C.和 A.K. 是谷歌公司的员工。R.P.、L.D.、C.Y.M.和M.A.D.是Verily生命科学公司和谷歌公司的员工。这项工作由Verily生命科学公司和谷歌公司内部资助。

Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html. Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
可在http://www.nature.com/rerints/index.html上在线获取重印和许可信息。出版商说明：施普林格-自然对出版地图和机构隶属关系中的管辖权主张保持中立。

Goodwin, S., McPherson, J.D. & McCombie, W.R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333-351 (2016).
Goodwin, S., McPherson, J.D. & McCombie, W.R. Coming of age: Ten years of next-generation sequencing technologies.Nat.Rev. Genet.17, 333-351 (2016).
Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443-451 (2011).
Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data.Nat.Rev. Genet.12, 443-451 (2011).
Li, H. Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843-2851 (2014).
Li, H. 更好地理解高覆盖率样本变异调用中的伪差。Bioinformatics 30, 2843-2851 (2014).
Goldfeder, R.L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).
Goldfeder, R.L. et al. 基因组测序技术准确性的医学意义。Genome Med.8, 24 (2016).
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491-498 (2011).
DePristo, M.A. et al. 利用下一代 DNA 测序数据进行变异发现和基因分型的框架。Nat.Genet.43, 491-498 (2011).
Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumournormal paired sequencing data. Bioinformatics 28, 167-175 (2012).
Ding, J. et al. 基于特征的分类器用于肿瘤配对测序数据中的体细胞突变检测。Bioinformatics 28, 167-175 (2012)。
Bragg, L.M., Stone, G., Butler, M.K., Hugenholtz, P. & Tyson, G.W. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput. Biol. 9, e1003031 (2013).
Bragg, L.M., Stone, G., Butler, M.K., Hugenholtz, P. & Tyson, G.W. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data.PLoS Comput.9, e1003031 (2013).
Yeo, Z.X., Wong, J.C.L., Rozen, S.G. & Lee, A.S.G. Evaluation and optimisation of indel detection workflows for ion torrent sequencing of the BRCA1 and BRCA2 genes. BMC Genomics 15, 516 (2014).
Yeo, Z.X., Wong, J.C.L., Rozen, S.G. & Lee, A.S.G. Evaluation and optimisation of indel detection workflows for ion torrent sequencing of the BRCA1 and BRCA2 genes.BMC Genomics 15, 516 (2014).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process Syst. 25, 1097-1105 (2012).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks.Adv.Process Syst.25, 1097-1105 (2012).
Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).
Wu, Y. et al. 谷歌的神经机器翻译系统：弥合人类与机器翻译之间的鸿沟。Preprint at https://arxiv.org/abs/1609.08144 (2016).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484-489 (2016).
Silver, D. et al. 利用深度神经网络和树搜索掌握围棋。Nature 529, 484-489 (2016).
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529-533 (2015).
Mnih, V. 等人. 通过深度强化学习实现人类水平的控制。Nature 518, 529-533 (2015).
Min, S., Lee, B. & Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 18, 851-869 (2017).
Min, S., Lee, B. & Yoon, S. 生物信息学中的深度学习。Brief.Bioinform.18, 851-869 (2017).
Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831-838 (2015).
Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.Nat.Biotechnol.33, 831-838 (2015).
Zhou, J. & Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931-934 (2015).
Zhou, J. & Troyanskaya, O.G. 利用基于深度学习的序列模型预测非编码变体的影响。Nat.Methods 12, 931-934 (2015).
Xiong, H.Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).
Xiong, H.Y. et al. 人类剪接代码揭示了疾病遗传决定因素的新见解。Science 347, 1254806 (2015).
Szegedy, C., Vanhoucke, V., loffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. Preprint at https://arxiv.org/abs/1512.00567 (2015).
Szegedy, C., Vanhoucke, V., loffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision.预印本：https://arxiv.org/abs/1512.00567 (2015)。
Eberle, M.A. et al. A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157-164 (2017).
Eberle, M.A. et al. 通过测序三代17人血统的遗传验证的540万相人类变异参考数据集。Genome Res. 27, 157-164 (2017).
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851-1858 (2008).
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores.Genome Res. 18, 1851-1858 (2008).
Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124-1132 (2009).
用于大规模并行全基因组重测序的 SNP 检测。Genome Res. 19, 1124-1132 (2009)。
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359-366 (1989).
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators.2, 359-366 (1989).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing.Preprint at https://arxiv.org/abs/1207.3907 (2012)。
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009).
Li, H. et al. Sequence Alignment/Map format and SAMtools.Bioinformatics 25, 2078-2079 (2009).
Luo, R., Schatz, M.C. & Salzberg, S.L. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. Gigascience 6, 1-4 (2017).
Luo, R., Schatz, M.C. & Salzberg, S.L. 16GT：使用 16 基因型概率模型的快速灵敏变体调用器。Gigascience 6, 1-4 (2017).
Kim, S. et al. Strelka2: fast and accurate variant calling for clinical sequencing applications. Preprint at bioRxiv https://doi.org/10.1101/192872 (2017).
Kim, S. et al. Strelka2: 快速准确的临床测序应用变体调用。Preprint at bioRxiv https://doi.org/10.1101/192872 (2017)。
Li, H. et al. New synthetic-diploid benchmark for accurate variant calling evaluation. Preprint at bioRxiv https://doi.org/10.1101/223297 (2017).
Li, H. 等. 用于准确变异调用评估的新合成二倍体基准。Preprint at bioRxiv https://doi.org/10.1101/223297 (2017).
Keane, T.M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289-294 (2011).
Keane, T.M. et al. 小鼠基因组变异及其对表型和基因调控的影响。自然》477 卷，289-294 页（2011 年）。
Van der Auwera, G. What are the standard resources for non-human genomes? http://gatkforums.broadinstitute.org/gatk/discussion/1243/what-are-the-standard-resources-for-non-human-genomes (2018).
Van der Auwera, G. What are the standard resources for non-human genomes?http://gatkforums.broadinstitute.org/gatk/discussion/1243/what-are-the-standard-resources-for-non-human-genomes (2018).
Zook, J.M. et al. Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials (Cold Spring Harbor, 2015).
Zook, J.M. et al. Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials（冷泉港，2015 年）。
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285-291 (2016).
Lek, M. 等人. 60 706 名人类蛋白质编码基因变异分析。Nature 536, 285-291 (2016).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68-74 (2015).
Auton, A. et al. 人类遗传变异的全球参照。Nature 526, 68-74 (2015).
Robasky, K., Lewis, N.E. & Church, G.M. The role of replicates for error mitigation in next-generation sequencing. Nat. Rev. Genet. 15, 56-62 (2014).
Robasky, K., Lewis, N.E. & Church, G.M. The role of replicates for error mitigation in next-generation sequencing.Nat.Rev. Genet.15, 56-62 (2014).
Abadi, M., Agarwal, A., Barham, P., Brevdo, E. & Chen, Z. TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Preprint at https://arxiv.org/ abs/1603.04467 (2015).
Abadi, M., Agarwal, A., Barham, P., Brevdo, E. & Chen, Z. TensorFlow: large-scale machine learning on heterogeneous systems, 2015.预印本：https://arxiv.org/ abs/1603.04467 (2015)。
Luo, R., Sedlazeck, F.J., Lam, T.-W. & Schatz, M. Clairvoyante: a multi-task convolutional deep neural network for variant calling in single molecule sequencing. Preprint at bioRxiv https://doi.org/10.1101/310458 (2018).
Luo, R., Sedlazeck, F.J., Lam, T.-W. & Schatz, M. Clairvoyante: a multi-task convolutional deep neural network for variant calling in single molecule sequencing.Preprint at bioRxiv https://doi.org/10.1101/310458 (2018).
Torracinta, R. & Campagne, F. Training genotype callers with neural networks. Preprint at bioRxiv https://doi.org/10.1101/097469 (2016).
Torracinta, R. & Campagne, F. Training genotype callers with neural networks.Preprint at bioRxiv https://doi.org/10.1101/097469 (2016)。

ONLINE METHODS 在线方法

Haplotype-aware realignment of reads. Mapped reads are preprocessed using an error-tolerant, local De-Bruijn-graph-based read assembly procedure that realigns them according to their most likely derived haplotype. Candidate windows across the genome are selected for reassembly by looking for any evidence of possible genetic variation, such as mismatching or soft clipped bases. The selection criteria for a candidate window are very permissive so that true variation is unlikely to be missed. All candidate windows across the genome are considered independently. De Bruijn graphs are constructed using multiple fixed

k

-mer sizes (from 20 to 75 , inclusive, with increments of 5) out of the reference genome bases for the candidate window, as well as all overlapping reads. Edges are given a weight determined by how many times they are observed in the reads. We trim any edges with weight less than three, except that edges found in the reference are never trimmed. Candidate haplotypes are generated by traversing the assembly graphs and the top two most likely haplotypes are selected that best explain the read evidence. The likelihood function used to score haplotypes is a traditional pair HMM with fixed parameters that do not depend on base quality scores. This likelihood function assumes that each read is independent. Finally, each read is then realigned to its most likely haplotype using a Smith-Waterman-like algorithm with an additional affine gap penalty score for homopolymer indels. This procedure updates both the position and the CIGAR string for each read.
单倍型感知读数重新配准。使用基于 De-Bruijn 图的本地容错读数组装程序对映射读数进行预处理，该程序根据最可能的衍生单倍型对读数进行重新配置。通过寻找任何可能的遗传变异证据，如不匹配或软剪切碱基，在整个基因组中选择候选窗口进行重新组装。候选窗口的选择标准非常宽松，因此不太可能遗漏真正的变异。整个基因组的所有候选窗口都是独立考虑的。在候选窗口的参考基因组碱基以及所有重叠读数中，使用多个固定的

k

聚合体大小（从 20 到 75，包含在内，增量为 5）构建 De Bruijn 图。边缘的权重取决于它们在读数中出现的次数。我们会对权重小于 3 的边缘进行修剪，但参考文献中发现的边缘绝不会被修剪。通过遍历装配图生成候选单倍型，并选出最有可能解释读数证据的前两个单倍型。用于对单倍型进行评分的似然函数是传统的成对 HMM，其参数是固定的，不依赖于碱基质量得分。这种似然函数假定每个读数都是独立的。最后，使用一种类似于 Smith-Waterman 的算法将每个读数与最可能的单倍型重新对齐，并对同源嵌合体进行额外的仿射间隙惩罚得分。这一过程同时更新了每个读数的位置和 CIGAR 字符串。

Finding candidate variants. Candidate variants for evaluation with the deep learning model are identified with the following algorithm. We consider each position in the reference genome independently. For each site in the genome, we collect all the reads that overlap that site. The CIGAR string of each read is decoded and the corresponding allele aligned to that site is determined; these are classified into either a reference-matching base, a reference-mismatching base, an insertion with a specific sequence, or a deletion with a specific length. We count the number of occurrences of each distinct allele across all reads. See Supplementary Note 8 and the current implementation at https://github. com/google/deepvariant/blob/r0.4/deepvariant/make_examples.py#L770.
寻找候选变体。利用深度学习模型进行评估的候选变体是通过以下算法确定的。我们独立考虑参考基因组中的每个位置。对于基因组中的每个位点，我们收集与该位点重叠的所有读数。对每个读数的 CIGAR 字符串进行解码，并确定与该位点对齐的相应等位基因；这些等位基因被分为参考匹配碱基、参考不匹配碱基、具有特定序列的插入或具有特定长度的缺失。我们计算每个等位基因在所有读数中出现的次数。请参阅补充说明 8 和当前的实现方法，网址是https://github. com/google/deepvariant/blob/r0.4/deepvariant/make_examples.py#L770。
If any candidates pass our calling thresholds at a site in the genome, we emit a VCF-like record with chromosome, start, reference bases and alternate bases, where reference bases and alternate bases are the VCF-compatible representation of all of the passing alleles.
如果基因组中的某个位点有候选者通过了我们的调用阈值，我们就会发出一条类似于 VCF 的记录，其中包含染色体、起始位点、参考碱基和候补碱基，其中参考碱基和候补碱基是所有通过等位基因的 VCF 兼容表示。

We filter away any unusable reads (see is_usable_read() below) if a read is marked as a duplicate, if it is marked as failing vendor quality checks, if it is not aligned or is not the primary alignment, if its mapping quality is less than 10 , or if it is paired and not marked as properly placed. We further only include read bases as potential alleles if all of the bases in the alleles have a base quality

\geq 10

. We emit variant calls only at standard (ACGT) bases in the reference genome. It is possible to force candidate variants to be emitted (randomly with probability of

p

) at sites with no alternate alleles, which are used as homozygous reference training sites. There is no constraint on the size of indels emitted, so long as the exact position and bases are present in the CIGAR string and they are consistent across multiple reads.
如果某个读数被标记为重复读数、被标记为供应商质量检查不合格读数、未对齐读数或非主要对齐读数、映射质量小于 10 或已配对但未标记为正确放置读数，我们会过滤掉所有不可用的读数（见下文 is_usable_read()）。此外，只有当等位基因中所有碱基的碱基质量都为

\geq 10

时，我们才会将读取的碱基作为潜在等位基因。我们只对参考基因组中的标准（ACGT）碱基发出变异调用。我们可以强制在没有等位基因的位点上发出候选变异（以

p

的概率随机发出），这些位点被用作同源参考训练位点。只要 CIGAR 字符串中存在准确的位置和碱基，并且在多个读数中保持一致，就不会限制释放的吲哚大小。

Creating images around candidate variants. The second phase of DeepVariant encodes the reference and read support for each candidate variant into an RGB image. The pseudocode for this component is shown below; it contains all of the key operations to build the image, leaving out for clarity error handling, code to deal with edge cases such as those in which variants occur close to the start or end of the chromosome, and the implementation of nonessential and/or obvious functions. See Supplementary Note 9 and the current implementation at https://github.com/google/deepvariant/blob/r0.4/deepvariant/ pileup_image.py.
围绕候选变体创建图像。DeepVariant 的第二阶段是将每个候选变体的参考和读取支持编码为 RGB 图像。该组件的伪代码如下所示；它包含构建图像的所有关键操作，为了清晰起见，不包括错误处理、处理边缘情况（如变体出现在染色体起点或终点附近）的代码，以及非必要和/或明显功能的实现。请参阅补充说明 9 以及 https://github.com/google/deepvariant/blob/r0.4/deepvariant/ pileup_image.py 中的当前实现。

The actual implementation of this code uses a reservoir sampler to randomly remove reads at locations where there is excessive coverage. This downsampling occurs conceptually within the reads.get_overlapping() function but occurs in our implementation anywhere where there are more than 10,000 reads in a tiling of 300-bp intervals on the chromosome.
该代码的实际实现是使用储层采样器随机删除覆盖率过高位置的读数。这种下采样在概念上发生在 reads.get_overlapping() 函数中，但在我们的实现中，只要染色体上 300-bp 间隔的平铺中读数超过 10,000 个，就会发生下采样。

Deep learning. DistBelief

^{36}

was used to represent models, train models on labeled images, export trained models, and evaluate trained models on unlabeled images. We adapted the Inception v2 architecture to our input images
深度学习。DistBelief

^{36}

用于表示模型、在有标签的图像上训练模型、导出训练好的模型，以及在无标签的图像上评估训练好的模型。我们根据输入图像调整了 Inception v2 架构
and our three-state (hom-ref, het, hom-alt) genotype classification problem. Specifically, we created an input image layer that rescales our input images to

299 \times 299

pixels without shifting or scaling our pixel values. This input layer is attached to the ConvNetJuly2015

22^{17} CNN

with nine partitions and weight decay of 0.00004 . The final output layer of the CNN is a three-class Softmax layer with fully connected inputs to the preceding layer initialized with Gaussian random weights and s.d. of 0.001 and a weight decay of 0.00004 .
和我们的三态（hom-ref、het、hom-alt）基因型分类问题。具体来说，我们创建了一个输入图像层，该层可将输入图像重新缩放为

299 \times 299

像素，而不会移动或缩放像素值。该输入层连接到 ConvNetJuly2015

22^{17} CNN

，有九个分区，权重衰减为 0.00004。CNN 的最终输出层是一个三类 Softmax 层，与前一层的输入完全连接，初始化为高斯随机权重，s.d. 为 0.001，权重衰减为 0.00004。

The CNN was trained using stochastic gradient descent in batches of 32 images with eight replicated models and RMS decay of 0.9. For the Platinum Genomes, precisionFDA, NA12878 replicates, mouse and genome build experiments, multiple models were trained (using the product of learning rates of

[0.00095, 0.001, 0.0015]

and momenta

[0.8, 0.85, 0.9])

for 80 h or until training accuracy converged, and the model with the highest accuracy on the training set was selected as the final model. For the multiple sequencing technologies experiment, a single model was trained with learning rate 0.0015 and momentum 0.8 for 250,000 update steps. In all experiments unless otherwise noted, the CNN was initialized with weights from the ImageNet model ConvNetJuly2015v2

^{17}

.
采用随机梯度下降法对 CNN 进行了训练，每批 32 幅图像使用 8 个重复模型，RMS 衰减为 0.9。在白金基因组、precisionFDA、NA12878 复制、小鼠和基因组构建实验中，使用学习率

[0.00095, 0.001, 0.0015]

和时刻

[0.8, 0.85, 0.9])

的乘积对多个模型进行了长达 80 小时的训练，直到训练准确率收敛为止，并选择训练集上准确率最高的模型作为最终模型。在多测序技术实验中，使用学习率 0.0015 和动量 0.8 对单一模型进行了 250,000 步更新训练。除非另有说明，否则在所有实验中，CNN 均使用来自 ImageNet 模型 ConvNetJuly2015v2

^{17}

的权重进行初始化。

DeepVariant inference client and allele merging. At inference time each biallelic candidate variant site represented as a pileup image is presented as input to the trained CNN. After a forward pass through the network, a three-state probability distribution is returned. These probabilities correspond to the biallelic genotype likelihood states of

{P

(homozygous reference),

P

(heterozygous),

P

(homozygous variant)

}

and are encoded directly in the output VCF record as the phred scaled GL field. Variant calls are emitted for all sites where the most likely genotype is either het or hom-alt with at least a Q4 genotype confidence. Finally, all biallelic records at the same starting position are merged into multiallelic records to facilitate comparisons with other datasets.
DeepVariant 推理客户端和等位基因合并。在推理时，每个以堆积图像表示的双等位基因候选变异位点都会作为输入呈现给训练有素的 CNN。经过网络前向传递后，会返回一个三态概率分布。这些概率对应于

{P

（同源参考）、

P

（杂合子）、

P

（同源变异）

}

的双链基因型似然状态，并直接在输出 VCF 记录中编码为 phred 标度 GL 字段。所有基因型最有可能是 het 或 hom-alt（基因型置信度至少为 Q4）的位点都会发出变异调用。最后，同一起始位置上的所有双拷贝记录都会合并为多拷贝记录，以方便与其他数据集进行比较。

Genome in a Bottle human reference datasets. We used version 3.2.1 of the Genome in a Bottle reference data

^{37}

. We downloaded calls in VCF format and confident called intervals in BED format from the following:
Genome in a Bottle 人类参考数据集。我们使用了 3.2.1 版的 Genome in a Bottle 参考数据

^{37}

。我们从以下网站下载了 VCF 格式的调用数据和 BED 格式的可信调用区间：

NA12878: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_ HG001/NISTv3.2.1/
NA12878：https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_ HG001/NISTv3.2.1/

NA24385: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/Ashkenazim Trio/HG002_NA24385_son/NISTv3.2.1/
NA24385：https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/Ashkenazim Trio/HG002_NA24385_son/NISTv3.2.1/

The VCF files were converted to Global Alliance for Global Health (GA4GH) protocol buffer format but otherwise were used without further modification.
VCF 文件已转换为全球健康联盟（GA4GH）协议缓冲区格式，但使用时未作进一步修改。

Evaluating variant calls. An internal evaluation tool was used for some analyses. As of the availability of the open source version of DeepVariant, all analyses were completed using hap.py or CHM-eval (for example, see Supplementary Tables 3 and 5).
评估变异调用。一些分析使用了内部评估工具。在 DeepVariant 开放源码版本推出后，所有分析都使用 hap.py 或 CHM-eval 完成（例如，见补充表 3 和 5）。

Truth variants and confident reference intervals were parsed from the Genome in a Bottle or other ground standard datasets from the VCF and BED files for their respective samples. Truth variants outside the confident intervals were removed. The evaluation variants were loaded and variants marked as filtered or assigned homozygous reference genotypes were removed. Metrics such as the number of SNPs, number of indels, insertion/deletion ratio, heterozygous/homozygous non-reference ratio and transition/transversion ratio (

Ti / Tv

) were calculated from all remaining evaluation variants.
从 "瓶中基因组 "或其他地面标准数据集的 VCF 和 BED 文件中解析出各自样本的真实变异和可信参考区间。置信区间之外的真实变异株被移除。加载评估变异，并移除标记为过滤的变异或指定为同源参考基因型的变异。从所有剩余的评估变异中计算出 SNPs 数量、嵌合体数量、插入/缺失比率、杂合/同源非参考比率和转换/反转比率（

Ti / Tv

）等指标。

Evaluation variants were matched to truth variants if they start at the same position on the same chromosome. To compute genotype concordance, we added to the list of matched pairs of evaluation-truth variants all of the unmatched evaluation variants that overlap the confidence intervals with a ‘virtual’ homozygous reference genotype sample. The number of matching genotypes is defined as the number of pairs in which the genotype alleles of the evaluation variant and truth variant are equal, independent of order. From this we compute the genotyping concordance as
如果评价变异与真相变异起始于同一染色体上的同一位置，则将其匹配。为了计算基因型一致性，我们将所有与 "虚拟 "同源参考基因型样本的置信区间重叠的未匹配评估变体加入到匹配的评估-真相变体对列表中。匹配基因型的数量定义为评估变体和真相变体的基因型等位基因相等的基因型对数量，与顺序无关。由此我们计算出基因分型一致性为

Genotype concordance

= \frac{No.of matching genotypes}{No.of paired evaluation and truth variants}

基因型一致性

= \frac{No.of matching genotypes}{No.of paired evaluation and truth variants}

The number of matched pairs is counted as the number of true positives. Any truth variants without a matched evaluation variant are counted as false negatives. Any unmatched evaluation variants that occur within the confident intervals are
匹配对的数量计为真阳性的数量。任何没有匹配评价变体的真值变体都算作假阴性。在置信区间内出现的任何未匹配的评估变体都将被视为
counted as false positives. From the number of true positives (TP), false negatives (FN) and false positives (TP), we compute the sensitivity, PPV and F1 as
算作假阳性。根据真阳性（TP）、假阴性（FN）和假阳性（TP）的数量，我们计算出灵敏度、PPV 和 F1 为

\begin{aligned} Sensitivity = \frac{TP}{TP + FN} \\ PPV = \frac{TP}{TP + FP} \\ Fl = \frac{2 TP}{2 TP + FN + FP} \end{aligned}

Our evaluation metrics fall between the tolerant hapdip metric

^{3}

and the strict vcfeval

^{38}

metrics. In particular, our sensitivity and PPV metrics emphasize discriminating between variant and reference sites, allowing errors in the determination of the exact variant alleles and genotypes. These errors are tallied separately as an allelic error rate and a genotyping error rate. Although we believe this separation is informative and valuable for understanding the types of errors that occur in a variant callset, we appreciate the approaches pursued by other evaluation methods.
我们的评估指标介于宽容的 hapdip 指标

^{3}

和严格的 vcfeval 指标

^{38}

之间。特别是，我们的灵敏度和 PPV 指标强调区分变异位点和参考位点，允许在确定确切的变异等位基因和基因型时出现误差。这些误差被分别统计为等位基因误差率和基因分型误差率。尽管我们认为这种区分对于了解变异调用集中出现的错误类型具有参考价值，但我们仍对其他评估方法所采用的方法表示赞赏。

Life Sciences Reporting Summary. Further information about experimental design is available in the Nature Research Reporting Summary linked to this article.
生命科学报告摘要。有关实验设计的更多信息，请参阅本文链接的《自然研究报告摘要》。

Code availability. The latest version of DeepVariant is available at https://github.com/google/deepvariant. The key results and analyses presented here can be reproduced using the open-source version. Custom code was specific to our computing infrastructure and mainly used for simple data analysis tasks. The benchmarking script used to generate and evaluate the results in Table 2 and Supplementary Table 3 is available as Supplementary Software. An evaluation metrics file is available as Supplementary Data.
代码可用性。DeepVariant 的最新版本可从https://github.com/google/deepvariant获取。本文介绍的主要结果和分析可以通过开源版本重现。定制代码是针对我们的计算基础设施的，主要用于简单的数据分析任务。用于生成和评估表 2 和补充表 3 中结果的基准测试脚本作为补充软件提供。评估指标文件作为补充数据提供。

Data availability. All data used in this manuscript is publicly available from Genome in a Bottle or the Mouse Genome project, with the exception of 35 NA12878 WGS replicates from the Verily sequencing laboratory, which were licensed from Verily for the current study and are not publicly available. These data may be available from Verily upon reasonable request.
数据可用性。本手稿中使用的所有数据均可从 "瓶中基因组 "或 "小鼠基因组 "项目中公开获取，但来自 Verily 测序实验室的 35 个 NA12878 WGS 复制数据除外，这些数据是 Verily 为当前研究授权提供的，不可公开获取。如提出合理要求，可向 Verily 索取这些数据。
36. Dean, J. et al. Large scale distributed deep networks. Adv. Neural Inf. Process. Syst. 25, 1223-1231 (2012).
36.Dean, J. et al.Adv.Process.Syst.25, 1223-1231 (2012).
37. Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246-251 (2014).
37.Zook, J.M. 等人.整合人类序列数据集提供基准 SNP 和 indel 基因型调用资源.Nat.Biotechnol.32, 246-251 (2014).
38. Cleary, J.G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at bioRxiv https://doi. org/10.1101/023754 (2015).
38.Cleary, J.G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines.Preprint at bioRxiv https://doi. org/10.1101/023754 (2015)。

Life Sciences Reporting Summary
生命科学报告摘要

Nature Research wishes to improve the reproducibility of the work that we publish. This form is intended for publication with all accepted life science papers and provides structure for consistency and transparency in reporting. Every life science submission will use this form; some list items might not apply to an individual manuscript, but all fields must be completed for clarity.
《自然科研》希望提高我们所发表论文的可重复性。本表旨在与所有录用的生命科学论文一起发表，并为报告的一致性和透明度提供结构。每篇生命科学投稿都将使用此表；某些列表项目可能不适用于个别稿件，但为了清晰起见，必须填写所有字段。

For further information on the points included in this form, see Reporting Life Sciences Research. For further information on Nature Research policies, including our data availability policy, see Authors & Referees and the Editorial Policy Checklist.
有关本表格要点的更多信息，请参阅《报告生命科学研究》。有关自然科研政策（包括我们的数据可用性政策）的更多信息，请参阅《作者与推荐人》和《编辑政策核对表》。

Please do not complete any field with “not applicable” or

n / a

. Refer to the help text for what text to use if an item is not relevant to your study. For final submission: please carefully check your responses for accuracy; you will not be able to make changes later.
请勿在任何字段中填写 "不适用 "或

n / a

。如果某个项目与您的研究无关，请参阅帮助文本，了解应使用哪些文本。最后提交时：请仔细检查您的回答是否准确；以后将无法修改。

- Experimental design - 实验设计

Sample size 样本量

Describe how sample size was determined.
说明如何确定样本量。
2. Data exclusions 2.数据排除

Describe any data exclusions.
说明任何数据排除情况。
3. Replication 3.复制

Describe the measures taken to verify the reproducibility of the experimental findings.
说明为验证实验结果的可重复性而采取的措施。

4. Randomization 4.随机化

Describe how samples/organisms/participants were allocated into experimental groups.
说明如何将样本/组织/参与者分配到实验组。
5. Blinding 5.盲法

Describe whether the investigators were blinded to group allocation during data collection and/or analysis.
说明研究人员在数据收集和/或分析过程中是否对组别分配进行了盲法处理。

No statistical claims based on sample sizes are made in the paper.
文中没有根据样本大小进行统计。

No data was excluded. 没有数据被排除在外。

Multiple replicates of NA12878 were evaluated. Participated in a public blinded variant calling evaluation administered by a third-party (PrecisionFDA). Released the code to github, enabling additional third-party evaluations.
对 NA12878 进行了多次重复评估。参与由第三方（PrecisionFDA）管理的公开盲法变异调用评估。将代码发布到 github 上，以便进行更多第三方评估。

No randomization was needed.
无需进行随机分配。

No blinding was needed. 无需致盲。

Note: all in vivo studies must report how sample size was determined and whether blinding and randomization were used.
注：所有体内研究必须报告样本量是如何确定的，以及是否使用了盲法和随机方法。

6. Statistical parameters
6.统计参数

For all figures and tables that use statistical methods, confirm that the following items are present in relevant figure legends (or in the Methods section if additional space is needed).
对于使用统计方法的所有图和表，请确认相关图例（或在需要额外空间的情况下，在 "方法 "部分）中包含以下项目。

◻

The exact sample size (

n

) for each experimental group/condition, given as a discrete number and unit of measurement (animals, litters, cultures, etc.)

◻

每个实验组/条件的确切样本量（

n

），以离散数和计量单位（动物、窝、培养物等）表示。

◻

A description of how samples were collected, noting whether measurements were taken from distinct samples or whether the same sample was measured repeatedly

◻

关于如何采集样本的说明，注意是否对不同样本进行了测量，或是否对同一样本进行了重复测量

A statement indicating how many times each experiment was replicated
说明每个实验重复了多少次的声明
The statistical test(s) used and whether they are one- or two-sided
使用的统计检验，以及是单侧检验还是双侧检验
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
只有常见的测试才应仅用名称进行描述；更复杂的技术应在 "方法 "部分进行描述。
A description of any assumptions or corrections, such as an adjustment for multiple comparisons
任何假设或修正的说明，如多重比较的调整
Test values indicating whether an effect is present
表明是否存在效应的测试值
Provide confidence intervals or give results of significance tests (e.g. P values) as exact values whenever appropriate and with effect sizes noted.
酌情提供置信区间或以精确值给出显著性检验结果（如 P 值），并注明效应大小。
A clear description of statistics including central tendency (e.g. median, mean) and variation (e.g. standard deviation, interquartile range)
清晰描述统计数据，包括中心倾向（如中位数、平均值）和变异（如标准差、四分位数间距）

◻

Clearly defined error bars in all relevant figure captions (with explicit mention of central tendency and variation)

◻

在所有相关图表标题中明确定义误差条（明确提及中心倾向和变化）

Software 软件

Policy information about availability of computer code
关于计算机代码可用性的政策信息

7. Software 7.软件

Describe the software used to analyze the data in this study.
说明本研究中用于分析数据的软件。

A fully functional version of the software has been released to GitHub at https://github.com/ google/deepvariant.
该软件的完整功能版本已发布到 GitHub，网址为 https://github.com/ google/deepvariant。

For manuscripts utilizing custom algorithms or software that are central to the paper but not yet described in the published literature, software must be made available to editors and reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). Nature Methods guidance for providing algorithms and software for publication provides further information on this topic.
对于使用定制算法或软件的稿件，如果这些算法或软件是论文的核心，但尚未在公开发表的文献中描述，则必须应要求向编辑和审稿人提供软件。我们强烈鼓励将代码存放在社区资源库（如GitHub）中。自然方法》关于为发表论文提供算法和软件的指南提供了有关此主题的更多信息。

- Materials and reagents - 材料和试剂

Policy information about availability of materials
有关材料供应的政策信息

8. Materials availability
8.材料供应

Indicate whether there are restrictions on availability of unique materials or if these materials are only available for distribution by a third party.
说明在提供独特材料方面是否有限制，或者这些材料是否只能由第三方分发。
9. Antibodies 9.抗体

Describe the antibodies used and how they were validated for use in the system under study (i.e. assay and species).
说明所使用的抗体，以及如何对其在所研究的系统中的使用进行验证（即化验和物种）。

10. Eukaryotic cell lines
10.真核细胞系

a. State the source of each eukaryotic cell line used.
a.说明所用每个真核细胞系的来源。
b. Describe the method of cell line authentication used.
b.说明所用的细胞系鉴定方法。
c. Report whether the cell lines were tested for mycoplasma contamination.
c.报告是否对细胞系进行了支原体污染检测。
d. If any of the cell lines used are listed in the database of commonly misidentified cell lines maintained by ICLAC, provide a scientific rationale for their use.
d.如果所使用的细胞系中有任何细胞系被列入由国际化学和生物化学中心（ICLAC）维护的常见错误识别细胞系数据库，请提供使用这些细胞系的科学依据。

All samples analyzed here are available from Coriell as part of the Genome in a Bottle collections. Contact Verily re the availability of the 35 NA12878 replicates analyzed here.
本文分析的所有样本均可从 Coriell 公司获得，属于 "瓶中基因组"（Genome in a Bottle）系列的一部分。有关本文分析的 35 个 NA12878 复制样本的供应情况，请联系 Verily。

No antibodies were used. 没有使用抗体。

Coriell 科里埃尔

Genotyping concordance with public databases of genetic variants for these samples.
这些样本的基因分型与遗传变异公共数据库一致。
Unknown. 未知。

None of the cell lines used are listed in the ICLAC database
所使用的细胞系均未列入 ICLAC 数据库

- Animals and human research participants
- 动物和人类研究参与者

Policy information about studies involving animals; when reporting animal research, follow the ARRIVE guidelines
有关动物研究的政策信息；在报告动物研究时，请遵循 ARRIVE 指导方针

11. Description of research animals
11.研究动物说明

Provide all relevant details on animals and/or
提供有关动物和/或
No animals were used in the study. animal-derived materials used in the study.

研究中未使用动物。

Policy information about studies involving human research participants
关于涉及人类研究参与者的研究的政策信息

12. Description of human research participants
12.人类研究参与者的描述

Describe the covariate-relevant population The study did not involve human research participants. characteristics of the human research participants.
描述协变量相关人群该研究不涉及人类研究参与者。

$^{1}$ Verily Life Sciences, Mountain View, California, USA. $^{2}$ Google Inc., Mountain View, California, USA. Correspondence should be addressed to M.A.D. (mdepristo@google.com).
$^{1}$ Verily 生命科学公司，美国加利福尼亚州山景城。 $^{2}$ 谷歌公司，美国加利福尼亚州山景城。通讯作者：M.A.D. (mdepristo@google.com).

A universal SNP and small-indel variant caller using deep neural networks 利用深度神经网络的通用 SNP 和小吲哚变体调用器