A universal SNP and small-indel variant caller using deep neural networks
一个使用深度神经网络的通用 SNP 和小插入/缺失变异调用器

Ryan Poplin $^{1, 2}$ , Pi-Chuan Chang $^{2}$ , David Alexander $^{2}$ , Scott Schwartz $^{2}$ , Thomas Colthurst $^{2}$ , Alexander ${Ku}^{2}$ , Dan Newburger $^{1}$ , Jojo Dijamco $^{1}$ , Nam Nguyen $^{1}$ , Pegah T Afshar $^{1}$ , Sam S Gross $^{1}$ , Lizzie Dorfman $^{1, 2}$ , Cory Y McLean $^{1, 2} &$ Mark A DePristo $^{1, 2}$
Ryan Poplin $^{1, 2}$ ，张碧川 $^{2}$ ，大卫·亚历山大 $^{2}$ ，斯科特·施瓦茨 $^{2}$ ，托马斯·科尔瑟斯特 $^{2}$ ，亚历山大 ${Ku}^{2}$ ，丹·纽伯格 $^{1}$ ，乔乔·迪亚姆科 $^{1}$ ，阮南 $^{1}$ ，佩加·塔·阿夫沙尔 $^{1}$ ，山姆·S·格罗斯 $^{1}$ ，莉齐·多夫曼 $^{1, 2}$ ，科尔·Y·麦克莱恩 $^{1, 2} &$ ，马克·A·德普里斯特 $^{1, 2}$

Abstract 摘要

Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.
尽管测序技术在快速发展，但准确识别个体基因组中存在的遗传变异，从数十亿个短、错误序列读数中仍然具有挑战性。在这里，我们展示了深度卷积神经网络可以通过学习围绕假设变异和真实基因型调用图像之间的统计关系，在比对下一代测序读数数据中调用遗传变异。这种方法称为 DeepVariant，优于现有的最先进工具。学习到的模型可以跨基因组构建和哺乳动物物种进行泛化，使非人类测序项目能够从人类地面数据中受益。我们进一步表明，DeepVariant 可以学习在多种测序技术和实验设计中调用变异，包括 10X Genomics 的深度全基因组测序和 Ion Ampliseq 外显子组，突出了使用更自动化和可泛化技术进行变异调用的好处。

Calling genetic variants from next-generation sequencing (NGS) data has proven challenging because NGS reads are not only errorful (with error rates from

\sim 0.1 - 10 %

) but arise from a complex error process that depends on properties of the instrument, preceding data processing tools, and the genome sequence itself

^{1 - 5}

. State-of-the-art variant callers use a variety of statistical techniques to model these error processes to accurately identify differences between the reads and the reference genome caused either by real genetic variants or by errors in the reads

^{3 - 6}

. For example, the widely used GATK uses logistic regression to model base errors, hidden Markov models to compute read likelihoods, and naive Bayes classification to identify variants, which are then filtered to remove likely false positives using a Gaussian mixture model with hand-crafted features capturing common error modes

^{5}

. These techniques allow the GATK to achieve high but still imperfect accuracy on the Illumina sequencing platform

^{3, 4}

. Generalizing these models to other sequencing technologies (for example, Ion Torrent

^{7, 8}

) has proven difficult due to the need to manually retune or extend these statistical models, which is problematic in an area with such rapid technological progress

^{1}

.
从下一代测序（NGS）数据中调用遗传变异已被证明具有挑战性，因为 NGS 读数不仅存在错误（错误率从

\sim 0.1 - 10 %

），而且源于一个复杂的错误过程，该过程依赖于仪器的特性、先前的数据处理工具以及基因组序列本身

^{1 - 5}

。最先进的变异调用者使用各种统计技术来模拟这些错误过程，以准确识别由真实遗传变异或读数错误引起的读数与参考基因组之间的差异

^{3 - 6}

。例如，广泛使用的 GATK 使用逻辑回归来模拟碱基错误，使用隐马尔可夫模型来计算读数可能性，并使用朴素贝叶斯分类来识别变异，然后使用具有手工制作的特征捕获常见错误模式的高斯混合模型进行过滤，以去除可能的假阳性

^{5}

。这些技术使 GATK 在 Illumina 测序平台上实现了高但仍然不完美的准确性

^{3, 4}

。将这些模型推广到其他测序技术（例如，Ion Torrent

^{7, 8}

）因需要手动调整或扩展这些统计模型而变得困难，这在技术进步如此迅速的领域是个问题

^{1}

。

Here we describe a variant caller, called DeepVariant, that replaces the assortment of statistical modeling components with a single deep learning model. Deep learning is a machine learning technique applicable to a variety of domains, including image classification

^{9}

, translation

^{10}

, gaming

^{11, 12}

and the life sciences

^{13 - 16}

. This toolchain (Fig. 1) begins by finding candidate single nucleotide polymorphisms (SNPs) and indels in reads aligned to the reference genome with high sensitivity but low specificity using standard, algorithmic preprocessing techniques. The deep learning model, using the Inception architecture

^{17}

, emits probabilities for each of the three diploid genotypes at a locus using a pileup image of the reference and read data around each candidate variant (Fig. 1). The model is trained using labeled true genotypes, after which it is frozen and can then be applied to novel sites or samples. In the following experiments, DeepVariant was trained on an independent set of samples or variants from those being evaluated.
这里我们描述了一种名为 DeepVariant 的变异调用器，它用一个深度学习模型替换了各种统计建模组件。深度学习是一种适用于多种领域的机器学习技术，包括图像分类、翻译、游戏和生命科学。这个工具链（图 1）首先使用标准的算法预处理技术，以高灵敏度但低特异性地找到与参考基因组对齐的候选单核苷酸多态性（SNPs）和插入/缺失（indels）。使用 Inception 架构的深度学习模型，通过每个候选变异周围的参考和读取数据的 pileup 图像，为每个位点的三个二倍体基因型发出概率（图 1）。该模型使用标记的真实基因型进行训练，之后将其冻结，然后可以应用于新的位点或样本。在接下来的实验中，DeepVariant 在独立的一组样本或变异上进行训练，这些样本或变异正在被评估。

The deep learning model was trained without specialized knowledge about genomics or next-generation sequencing, and yet it can learn to call genetic variants more accurately than state-of-the-art methods. When applied to the Platinum Genomes Project NA12878 data

^{18}

, DeepVariant produced a callset with better performance than the GATK when evaluated on the held-out chromosomes of the Genome in a Bottle ground-truth set (Supplementary Figs. 1a and 2). For further validation, we sequenced 35 replicates of NA12878 using a standard whole-genome sequencing (WGS) protocol and called variants on 27 replicates using a GATK best-practices pipeline and DeepVariant using a model trained on the other eight replicates (Online Methods). DeepVariant produced more accurate results with greater consistency across a variety of quality metrics (Supplementary Fig. 1b and Supplementary Notes 1, 10 and 11).
深度学习模型在缺乏基因组学或下一代测序专业知识的情况下进行了训练，但它仍然能够比最先进的方法更准确地学习调用遗传变异。当应用于 Platinum Genomes 项目 NA12878 数据时，DeepVariant 在基因组瓶真实集的保留染色体上评估时，其性能优于 GATK（补充图 1a 和 2）。为了进一步验证，我们使用标准的全基因组测序（WGS）方案对 NA12878 进行了 35 个重复测序，并在 27 个重复中使用 GATK 最佳实践流程和 DeepVariant（使用在另外八个重复上训练的模型）调用变异（在线方法）。DeepVariant 在各种质量指标上产生了更准确且一致性更高的结果（补充图 1b 和补充笔记 1、10 和 11）。

Like many variant calling algorithms, the GATK relies on a model that assumes read errors to be independent

^{5}

. Though this has long been recognized as an invalid assumption

^{2}

, the true likelihood function that models multiple reads simultaneously is unknown

^{5, 19, 20}

. Because DeepVariant presents an image of all of the reads relevant for a putative variant together, the convolutional neural network (CNN) is able to account for the complex dependence among the reads by virtue of being a universal approximator

^{21}

. This manifests itself as a
与许多变异调用算法一样，GATK 依赖于一个假设读错误是独立的模型

^{5}

。尽管这早已被认定为无效的假设

^{2}

，但模拟多个读同时发生的真实似然函数是未知的

^{5, 19, 20}

。因为 DeepVariant 展示了所有与假设变异相关的读的图像，卷积神经网络（CNN）能够通过作为通用逼近器来考虑读之间的复杂依赖关系

^{21}

。这表现为

Received 15 December 2017; accepted 2 August 2018; published online 24 September 2018; doi:10.1038/nbt. 4235
收到 2017 年 12 月 15 日；接受 2018 年 8 月 2 日；在线发表 2018 年 9 月 24 日；doi:10.1038/nbt.4235

Figure 1 DeepVariant workflow overview. Before DeepVariant, NGS reads are first aligned to a reference genome and cleaned up with duplicate marking and, optionally, local assembly. Left box: first, the aligned reads are scanned for sites that may be different from the reference genome. The read and reference data are encoded as an image for each candidate variant site. A trained CNN calculates the genotype likelihoods for each site. A variant call is emitted if the most likely genotype is heterozygous or homozygous non-reference. Middle box: training the CNN reuses the DeepVariant machinery to generate pileup images for a sample with known genotypes. These labeled image + genotype pairs, along with an initial CNN, which can be a random model, a CNN trained for other image classification tests, or a prior DeepVariant model, are used to optimize the CNN parameters to maximize genotype prediction accuracy using a stochastic gradient descent algorithm. After a maximum number of cycles or time has elapsed or the model’s performance has converged, the final trained model is frozen and can then be used for variant calling. Right box: the reference and read bases, quality scores, and other read features are encoded into a red-green-blue (RGB) pileup image at a candidate variant. This encoded image is provided to the CNN to calculate the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example a heterozygous variant call is emitted, as the most probable genotype here is “het”. In all panels, blue boxes represent data and red boxes are processes. Details of all processes are given in the Online Methods.
图 1 DeepVariant 工作流程概述。在 DeepVariant 之前，NGS 读数首先与参考基因组对齐并清理，包括重复标记和可选的局部组装。左侧框：首先，对齐的读数被扫描以查找可能与参考基因组不同的位点。将读取和参考数据编码为图像，用于每个候选变异位点。训练好的 CNN 计算每个位点的基因型可能性。如果最可能的基因型为杂合子或纯合子非参考型，则发出变异调用。中间框：训练 CNN 时，DeepVariant 机制被重用来为具有已知基因型的样本生成堆叠图像。这些标记的图像+基因型对，以及一个初始 CNN（可以是随机模型、为其他图像分类测试训练的 CNN 或先前的 DeepVariant 模型），用于使用随机梯度下降算法优化 CNN 参数，以最大化基因型预测的准确性。经过最大循环次数或时间后或模型性能收敛，最终训练好的模型被冻结，然后可用于变异调用。正确框：参考和读取基，质量分数和其他读取特征被编码到一个候选变异的红绿蓝（RGB）堆叠图像中。此编码图像提供给 CNN 以计算三种二倍体基因型状态（纯合参考（hom-ref）、杂合（het）或纯合等位基因（hom-alt））的基因型可能性。在此示例中，发出一个杂合变异调用，因为这里最可能的基因型是“het”。在所有面板中，蓝色框表示数据，红色框表示过程。所有过程的详细信息见在线方法。
tight concordance between the estimated probability of error from the likelihood function and the observed error rate (Supplementary Fig. 1c) where DeepVariant’s CNN is well calibrated, more so than the GATK. That the CNN has approximated this true but unknown interdependent likelihood function is the essential technical advance enabling us to replace the hand-crafted statistical models used in other approaches with a single deep learning model, and still achieve such high performance in variant calling.
紧密的误差概率估计与观察到的错误率之间的吻合度（补充图 1c），其中 DeepVariant 的 CNN 校准良好，优于 GATK。CNN 能够近似这个真实但未知的相互依赖似然函数，这是使我们能够用单个深度学习模型替换其他方法中使用的手工统计模型，并在变异调用中仍然实现如此高性能的关键技术进步。

To further benchmark the performance of DeepVariant, we submitted variant calls for a blinded sample, NA24385, to the US Food and Drug Administration (FDA)-sponsored variant calling Truth Challenge in May 2016 and won the “highest performance” award for SNPs as assessed by an independent team using a different evaluation methodology. For this contest DeepVariant was trained only on data available from the CEPH (Centre d’Etude du Polymorphisme Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. In achieving high accuracy as measured via F1, or the harmonic mean of sensitivity and positive predictive value (PPV), on this new sample (SNP F1

= 99.95 %

, indel F1

= 98.98 %

), we show that DeepVariant can generalize beyond its training data. We then applied the same dataset and evaluation
为进一步评估 DeepVariant 的性能，我们于 2016 年 5 月向美国食品药品监督管理局（FDA）赞助的变异调用真实挑战赛提交了盲样 NA24385 的变异调用，并获得了由独立团队使用不同评估方法评估的“最高性能”奖项。在此比赛中，DeepVariant 仅使用 CEPH（人类多态性研究中心）女性样本 NA12878 的数据进行训练，并在未见过的阿什肯纳兹男性样本 NA24385 上进行评估。通过在新的样本（SNP F1

= 99.95 %

，indel F1

= 98.98 %

）上实现高准确度（通过 F1，即敏感性和阳性预测值（PPV）的调和平均值衡量），我们表明 DeepVariant 可以推广到其训练数据之外。然后，我们应用了相同的训练集和评估
methodology to a variety of both recent and commonly used bioinformatics methods, including the GATK, FreeBayes

^{22}

, SAMtools

^{23}

16 {GT}^{24}

and Strelka

^{25}

(Table 1). DeepVariant demonstrated more than

50 %

fewer errors per genome ( 4,652 errors) compared to the next-best algorithm ( 9,531 errors). We also evaluated the same set of methods using the synthetic diploid sample CHM1-CHM1326 (Table 2). In our tests DeepVariant outperformed all other methods for calling both SNP and indel mutations, without needing to adjust filtering thresholds or other parameters.
方法涵盖了多种近期和常用的生物信息学方法，包括 GATK、FreeBayes

^{22}

、SAMtools

^{23}

、

16 {GT}^{24}

和 Strelka

^{25}

（表 1）。与下一个最佳算法（9,531 个错误）相比，DeepVariant 在基因组中每基因组错误少于

50 %

（4,652 个错误）。我们还使用合成二倍体样本 CHM1-CHM1326（表 2）评估了相同的方法集。在我们的测试中，DeepVariant 在调用 SNP 和 indel 突变方面优于所有其他方法，无需调整过滤阈值或其他参数。

We further explored how well DeepVariant’s CNN generalizes beyond its training data. First, a model trained with read data aligned to human genome build GRCh37 and applied to reads aligned to GRCh38 had similar performance (overall F1

= 99.45 %

) to one trained on GRCh38 and then applied to GRCh38 (overall F1 =99.53%), thereby demonstrating that a model learned from one version of the human genome reference can be applied to other versions with effectively no loss in accuracy (Supplementary Table 1 and Supplementary Note 2). Second, models trained using human reads and groundtruth data achieved high accuracy when applied to a mouse data

{set}^{27}

(

F 1 = 98.29 %)

, outperforming training on the mouse data itself (F1 = 97.84%; Supplementary Table 2 and Supplementary Note 3).
我们进一步探讨了 DeepVariant 的 CNN 在训练数据之外的泛化能力。首先，使用与人类基因组构建 GRCh37 对齐的读取数据训练的模型，应用于与 GRCh38 对齐的读取数据，其性能（总体 F1

= 99.45 %

）与在 GRCh38 上训练然后应用于 GRCh38 的模型相似（总体 F1 =99.53%），从而证明了从人类基因组参考的一个版本学习到的模型可以应用于其他版本，而不会在准确性上造成损失（补充表 1 和补充说明 2）。其次，使用人类读取数据和地面实况数据训练的模型，在应用于小鼠数据时取得了高精度（

{set}^{27}

，

F 1 = 98.29 %)

，优于在小鼠数据上的训练（F1 = 97.84%；补充表 2 和补充说明 3）。

Table 1 Evaluation of several bioinformatics methods on the high-coverage, whole-genome sample NA24385
表 1 对高覆盖全基因组样本 NA24385 几种生物信息学方法的评估

Method 方法	Type 类型	F1	Recall 回忆	Precision 精度	TP	FN	FP	FP.gt	FP.al	Version 版本
DeepVariant (live GitHub) DeepVariant（GitHub 实时版）	Indel	0.99507	0.99347	0.99666	357,641	2350	1,198	217	840	Latest GitHub v0.4.1-b4e8d37d 最新 GitHub v0.4.1-b4e8d37d
GATK (raw) GATK（原始）	Indel	0.99366	0.99219	0.99512	357,181	2810	1,752	377	995	3.8-0-ge9d806836
Strelka 斯特雷尔卡	Indel	0.99227	0.98829	0.99628	355,777	4214	1,329	221	855	2.8.4-3-gbe58942
DeepVariant (pFDA) DeepVariant（美国食品药品监督管理局批准）	Indel	0.99112	0.98776	0.99450	355,586	4405	1,968	846	1,027	pFDA submission May 2016 2016 年 5 月 pFDA 提交
GATK (VQSR) GATK（VQSR）	Indel	0.99010	0.98454	0.99573	354,425	5566	1,522	343	909	3.8-0-ge9d806836
GATK (fIt) GATK（fIt）	Indel	0.98229	0.96881	0.99615	348,764	11227	1,349	370	916	3.8-0-ge9d806836
FreeBayes	Indel	0.94091	0.91917	0.96372	330,891	29,100	12,569	9,149	3,347	v1.1.0-54-g49413aa
16GT	Indel	0.92732	0.91102	0.94422	327,960	32,031	19,364	10,700	7,745	v1.0-34e8f934
SAMtools	Indel	0.87951	0.83369	0.93066	300,120	59,871	22,682	2,302	20,282	1.6
DeepVariant (live GitHub) DeepVariant（GitHub 实时版）	SNP	0.99982	0.99975	0.99989	3,054,552	754	350	157	38	Latest GitHub v0.4.1-b4e8d37d 最新 GitHub v0.4.1-b4e8d37d
DeepVariant (pFDA) DeepVariant（美国食品药品监督管理局批准）	SNP	0.99958	0.99944	0.99973	3,053,579	1,727	837	409	78	pFDA submission May 2016 2016 年 5 月 pFDA 提交
Strelka 斯特雷尔卡	SNP	0.99935	0.99893	0.99976	3,052,050	3,256	732	87	136	2.8.4-3-gbe58942
GATK (raw) GATK（原始）	SNP	0.99914	0.99973	0.99854	3,054,494	812	4,469	176	257	3.8-0-ge9d806836
16GT	SNP	0.99583	0.99850	0.99318	3,050,725	4,581	20,947	3,476	3,899	v1.0-34e8f934
GATK (VQSR) GATK（VQSR）	SNP	0.99436	0.98940	0.99937	3,022,917	32,389	1,920	80	170	3.8-0-ge9d806836
FreeBayes	SNP	0.99124	0.98342	0.99919	3,004,641	50,665	2,434	351	1,232	v1.1.0-54-g49413aa
SAMtools	SNP	0.99021	0.98114	0.99945	2,997,677	57,629	1,651	1,040	200	1.6
GATK (fIt) GATK（fIt）	SNP	0.98958	0.97953	0.99983	2,992,764	62,542	509	168	26	3.8-0-ge9d806836

| Method | Type | F1 | Recall | Precision | TP | FN | FP | FP.gt | FP.al | Version | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | DeepVariant (live GitHub) | Indel | 0.99507 | 0.99347 | 0.99666 | 357,641 | 2350 | 1,198 | 217 | 840 | Latest GitHub v0.4.1-b4e8d37d | | GATK (raw) | Indel | 0.99366 | 0.99219 | 0.99512 | 357,181 | 2810 | 1,752 | 377 | 995 | 3.8-0-ge9d806836 | | Strelka | Indel | 0.99227 | 0.98829 | 0.99628 | 355,777 | 4214 | 1,329 | 221 | 855 | 2.8.4-3-gbe58942 | | DeepVariant (pFDA) | Indel | 0.99112 | 0.98776 | 0.99450 | 355,586 | 4405 | 1,968 | 846 | 1,027 | pFDA submission May 2016 | | GATK (VQSR) | Indel | 0.99010 | 0.98454 | 0.99573 | 354,425 | 5566 | 1,522 | 343 | 909 | 3.8-0-ge9d806836 | | GATK (fIt) | Indel | 0.98229 | 0.96881 | 0.99615 | 348,764 | 11227 | 1,349 | 370 | 916 | 3.8-0-ge9d806836 | | FreeBayes | Indel | 0.94091 | 0.91917 | 0.96372 | 330,891 | 29,100 | 12,569 | 9,149 | 3,347 | v1.1.0-54-g49413aa | | 16GT | Indel | 0.92732 | 0.91102 | 0.94422 | 327,960 | 32,031 | 19,364 | 10,700 | 7,745 | v1.0-34e8f934 | | SAMtools | Indel | 0.87951 | 0.83369 | 0.93066 | 300,120 | 59,871 | 22,682 | 2,302 | 20,282 | 1.6 | | DeepVariant (live GitHub) | SNP | 0.99982 | 0.99975 | 0.99989 | 3,054,552 | 754 | 350 | 157 | 38 | Latest GitHub v0.4.1-b4e8d37d | | DeepVariant (pFDA) | SNP | 0.99958 | 0.99944 | 0.99973 | 3,053,579 | 1,727 | 837 | 409 | 78 | pFDA submission May 2016 | | Strelka | SNP | 0.99935 | 0.99893 | 0.99976 | 3,052,050 | 3,256 | 732 | 87 | 136 | 2.8.4-3-gbe58942 | | GATK (raw) | SNP | 0.99914 | 0.99973 | 0.99854 | 3,054,494 | 812 | 4,469 | 176 | 257 | 3.8-0-ge9d806836 | | 16GT | SNP | 0.99583 | 0.99850 | 0.99318 | 3,050,725 | 4,581 | 20,947 | 3,476 | 3,899 | v1.0-34e8f934 | | GATK (VQSR) | SNP | 0.99436 | 0.98940 | 0.99937 | 3,022,917 | 32,389 | 1,920 | 80 | 170 | 3.8-0-ge9d806836 | | FreeBayes | SNP | 0.99124 | 0.98342 | 0.99919 | 3,004,641 | 50,665 | 2,434 | 351 | 1,232 | v1.1.0-54-g49413aa | | SAMtools | SNP | 0.99021 | 0.98114 | 0.99945 | 2,997,677 | 57,629 | 1,651 | 1,040 | 200 | 1.6 | | GATK (fIt) | SNP | 0.98958 | 0.97953 | 0.99983 | 2,992,764 | 62,542 | 509 | 168 | 26 | 3.8-0-ge9d806836 |

The dataset used in this evaluation is the same as in the precisionFDA Truth Challenge (pFDA). Several methods are compared, including the DeepVariant callset as submitted to the contest and the most recent DeepVariant version from GitHub. Each method was run according to the individual authors’ best-practice recommendations and represents a good-faith effort to achieve best results. Comparisons to the Genome in a Bottle truth set for this sample were performed using the hap.py software, available on GitHub at http://github.com/Illumina/hap.py, using the same version of the GIAB truth set (v3.2.2) used by pFDA. The overall accuracy (F1, sort order within each variant type), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are shown over the whole genome. False positives are further divided by those caused by genotype mismatches (FP.gt) and those cause by allele mismatches (FP.al). Finally, the version of the software used for each method is provided. We present three GATK callsets: GATK (raw), the unfiltered calls emitted by the HaplotypeCaller; GATK (VQSR), the callset filtered with variant quality score recalibration (VQSR); and GATK (fIt), the raw GATK callset filtered with run-flt in CHM-eval. See Supplementary Note 7 for more details.
该评估所使用的数据集与 precisionFDA Truth Challenge（pFDA）中使用的相同。比较了多种方法，包括提交给比赛的 DeepVariant callset 和 GitHub 上最新的 DeepVariant 版本。每种方法都按照个别作者的最好实践建议运行，代表了为获得最佳结果而做出的真诚努力。使用 GitHub 上可用的 hap.py 软件（http://github.com/Illumina/hap.py）对样本的 Genome in a Bottle truth 集进行了比较，使用了与 pFDA 相同的 GIAB truth 集版本（v3.2.2）。在整个基因组上显示了整体准确性（F1，每种变异类型内的排序顺序）、召回率、精确率和真阳性（TP）、假阴性（FN）和假阳性（FP）的数量。假阳性进一步分为由基因型不匹配（FP.gt）引起的和由等位基因不匹配（FP.al）引起的。最后，提供了每种方法的软件版本。我们介绍了三个 GATK callsets：GATK（原始），由 HaplotypeCaller 发出的未过滤调用；GATK（VQSR），经过变体质量得分重新校准（VQSR）过滤的 callset；以及 GATK（fIt），使用 CHM-eval 中的 run-flt 过滤的原始 GATK callset。有关更多详细信息，请参阅补充说明 7。

This last experiment is especially demanding as not only do the species differ but nearly all of the sequencing parameters do as well:

50 \times 2 \times

148 bp from an Illumina TruSeq prep sequenced on a HiSeq 2500 for the human sample and

27 \times 2 \times 100 bp

reads from a custom sequencing preparation run on an Illumina Genome Analyzer II for mouse

^{27}

. Thus, DeepVariant is robust to changes in sequencing depth, preparation protocol, instrument type, genome build and even mammalian species, thereby enabling resequencing projects in nonhuman species, which often have no ground-truth data to guide their efforts

^{27, 28}

, to leverage the large and growing ground-truth data in humans.
这个最后实验特别具有挑战性，因为不仅物种不同，几乎所有测序参数也是如此：

50 \times 2 \times

从 Illumina TruSeq 制备的 HiSeq 2500 测序的人类样本中测序的 148 个碱基，以及

27 \times 2 \times 100 bp

在 Illumina Genome Analyzer II 上运行的定制测序制备的读取，用于小鼠

^{27}

。因此，DeepVariant 对测序深度、制备方案、仪器类型、基因组构建甚至哺乳动物物种的变化都具有鲁棒性，从而使得非人类物种的重新测序项目得以实现，这些项目通常没有真实数据来指导他们的努力

^{27, 28}

，以利用人类中大量且不断增长的真实数据。

To further assess its capabilities, we trained DeepVariant to call variants in eight datasets from Genome in a Bottle

^{29}

that spanned a variety of sequencing instruments and protocols, including wholegenome and exome sequencing technologies, with read lengths from 50 to many thousands of base pairs (Supplementary Tables 3 and 4 and Supplementary Notes 4 and 5). We used the already processed BAM files to introduce additional variability, as these BAMs differed in their alignment and cleaning steps. The results of this experiment all exhibit a characteristic pattern: the candidate variants have the highest sensitivity but a low PPV (mean of 57.6%), which varies substantially by dataset. After retraining, all of the callsets achieve high PPVs (mean of 99.3%) while largely preserving the candidate callset sensitivity (mean loss of 2.3%). The high PPVs and low loss of sensitivity indicate that DeepVariant can learn a model that captures the technology-specific error processes in sufficient detail to separate real variation from false positives with high fidelity for many different sequencing technologies.
为了进一步评估其能力，我们将 DeepVariant 训练用于从 Genome in a Bottle

^{29}

中的八个数据集中调用变异，这些数据集涵盖了各种测序仪器和协议，包括全基因组测序和外显子测序技术，读长从 50 到数万个碱基对不等（补充表 3 和 4 以及补充笔记 4 和 5）。我们使用已经处理过的 BAM 文件来引入额外的变异性，因为这些 BAM 文件在对其对齐和清洗步骤上存在差异。该实验的所有结果都表现出一种特征模式：候选变异具有最高的敏感性，但 PPV（均值为 57.6%）较低，并且根据数据集的不同而有很大变化。在重新训练后，所有调用集都实现了高 PPV（均值为 99.3%），同时在很大程度上保留了候选调用集的敏感性（平均损失为 2.3%）。高 PPV 和低敏感性损失表明，DeepVariant 可以学习一个模型，该模型能够以足够详细的程度捕捉特定技术的错误过程，以高保真度区分真实变异和假阳性，适用于许多不同的测序技术。

Next we analyzed the behavior of DeepVariant on two nonIllumina WGS datasets, one from ThermoFisher (SOLiD) and one from Pacific Biosciences (PacBio), and on two exome datasets from Illumina (TruSeq) and Ion Torrent (Ion Ampliseq). The SOLiD and PacBio WGS datasets have high error rates in the candidate callsets. SOLiD (13.9% PPV for SNPs, 96.2% for indels and 14.3% overall) has many SNP artifacts from the mapping of short, color-space reads. The
接下来，我们分析了 DeepVariant 在两个非 Illumina WGS 数据集上的行为，一个来自 ThermoFisher（SOLiD），一个来自 Pacific Biosciences（PacBio），以及来自 Illumina（TruSeq）和 Ion Torrent（Ion Ampliseq）的两个外显子数据集。SOLiD 和 PacBio WGS 数据集的候选调用集中错误率较高。SOLiD（SNPs 的 PPV 为 13.9%，indels 为 96.2%，总体为 14.3%）有许多来自短读、颜色空间读映射的 SNP 伪影。

PacBio dataset is the opposite, with many false indels (

79.8 %

PPV for SNPs,

1.4 %

for indels and

22.1 %

overall) owing to this technology’s high indel error rate. Training DeepVariant to call variants in an exome is likely to be particularly challenging. Exomes have far fewer variants

(\sim 20 k - 30 k)^{30}

than found in a whole genome

(\sim 4 - 5 M)^{31}

. The non-uniform coverage and sequencing errors from the exome capture or amplification technology also introduce many false positive variants

^{32}

. For example, at

8.1 %

, the PPV of our candidate variants for Ion Ampliseq is the lowest of all our datasets.
PacBio 数据集则相反，由于该技术的高插入/缺失错误率，存在许多假插入/缺失（

79.8 %

PPV 对于 SNPs，

1.4 %

对于插入/缺失，

22.1 %

总体）。训练 DeepVariant 以在外显子组中调用变异可能特别具有挑战性。外显子组中的变异比全基因组中的变异少得多（

(\sim 20 k - 30 k)^{30}

）。外显子捕获或扩增技术的不均匀覆盖和测序错误也引入了许多假阳性变异（

^{32}

）。例如，在

8.1 %

，我们候选变异的 PPV 是所有数据集中最低的。

Despite the low initial PPVs, the retrained models in DeepVariant separated errors from real variants with high accuracy in the WGS datasets (PPVs of 99.0% and 97.3% for SOLiD and PacBio, respectively), though with a larger loss in sensitivity (candidates 82.5% and final

76.6 %

for SOLiD and

93.4 %

and

88.5 %

, respectively, for PacBio) than other technologies. Furthermore, despite the challenges of retraining deep learning models with limited data, the exome datasets also performed well, with a small reduction in sensitivity (from 91.9% to

89.3 %

and

94.0 %

92.6 %

for Ion Ampliseq and TruSeq candidates and final calls, respectively) for a substantial boost in PPV (from 8.1% to

99.7 %

and

65.3 %

99.3 %

for Ion and TruSeq, respectively). The performance of DeepVariant compares favorably to those of callsets submitted to the Genome in a Bottle project site using tools developed specifically for each NGS technology and to callsets produced by the GATK or SAMtools (Supplementary Table 5).
尽管初始 PPV 值较低，DeepVariant 中的重新训练模型在 WGS 数据集中以高精度将错误与真实变异分开（SOLiD 和 PacBio 的 PPV 分别为 99.0%和 97.3%），尽管与其他技术相比，灵敏度损失更大（SOLiD 候选者 82.5%，最终

76.6 %

，PacBio 分别为

93.4 %

和

88.5 %

）。此外，尽管在有限数据下重新训练深度学习模型存在挑战，外显子组数据集也表现良好，灵敏度略有下降（Ion Ampliseq 和 TruSeq 候选者和最终调用分别从 91.9%降至

89.3 %

和

94.0 %

至

92.6 %

），PPV 显著提高（Ion 和 TruSeq 分别从 8.1%升至

99.7 %

和

65.3 %

至

99.3 %

）。DeepVariant 的性能与专门为每种 NGS 技术开发的工具提交到基因组在瓶项目网站上的 callsets 以及由 GATK 或 SAMtools 产生的 callsets 相比，表现良好（补充表 5）。

The accuracy numbers presented here should not be viewed as the maximum achievable by either the sequencing technology or DeepVariant. For consistency, we used the same model architecture, image representation, training parameters and candidate variant criteria for each technology. Because DeepVariant achieves high PPVs for all technologies, the overall accuracy is effectively driven by the sensitivity of the candidate callset. Improvements to the data processing steps before DeepVariant and the algorithm used to identify candidate variants is likely to translate into further improvements in overall accuracy, particularly for multi-allelic indels. Conversely, despite its
此处展示的准确率数字不应被视为测序技术或 DeepVariant 所能达到的最大值。为了保持一致性，我们对每种技术都使用了相同的模型架构、图像表示、训练参数和候选变异标准。由于 DeepVariant 对所有技术都实现了高阳性预测值（PPV），整体准确率实际上是由候选调用集的敏感性驱动的。在 DeepVariant 之前的数据处理步骤的改进以及用于识别候选变异的算法的改进可能会转化为整体准确率的进一步提高，尤其是在多等位基因插入/缺失（indels）方面。相反，尽管它

Table 2 Evaluation of several bioinformatics methods on the high-coverage, whole-genome synthetic diploid sample CHM1-CHM13
表 2 对高覆盖全基因组合成二倍体样本 CHM1-CHM13 的几种生物信息学方法的评估

Method 方法	Type 类型	F1	Recall 回忆	Precision 精度	TP	FN	FP	Version 版本
DeepVariant	Indel	0.95806	0.92868	0.98936	529,137	40,634	5,690	v0.4.1-b4e8d37d
Strelka 斯特雷尔卡	Indel	0.95074	0.91623	0.98796	522,039	47,732	6,363	2.8.4-3-gbe58942
16GT	Indel	0.94010	0.90803	0.97452	517,369	52,402	13,527	v1.0-34e8f934
GATK (raw) GATK（原始）	Indel	0.93268	0.89504	0.97363	509,969	59,802	13,811	3.8-0-ge9d806836
GATK (VQSR) GATK（VQSR）	Indel	0.91212	0.84497	0.99087	481,441	88,330	4,437	3.8-0-ge9d806836
FreeBayes	Indel	0.90438	0.83025	0.99305	473,053	96,718	3,313	v1.1.0-54-g49413aa
SAMtools	Indel	0.86976	0.79089	0.96611	450,626	119,145	15,807	1.6
DeepVariant	SNP	0.99103	0.98888	0.99319	3,518,118	39,553	24,132	v0.4.1-b4e8d37d
Strelka 斯特雷尔卡	SNP	0.98865	0.98107	0.99636	3,490,314	67,357	12,749	2.8.4-3-gbe58942
16GT	SNP	0.97862	0.98966	0.96782	3,520,894	36,777	117,078	v1.0-34e8f934
FreeBayes	SNP	0.96910	0.94837	0.99075	3,373,984	183,687	31,492	v1.1.0-54-g49413aa
GATK (VQSR) GATK（VQSR）	SNP	0.96895	0.94542	0.99368	3,363,476	194,195	21,379	3.8-0-ge9d806836
SAMtools	SNP	0.96818	0.94386	0.99378	3,357,947	199,724	21,012	1.6
GATK (raw) GATK（原始）	SNP	0.96646	0.95685	0.97627	3,404,167	153,504	82,748	3.8-0-ge9d806836

Several methods are compared, including the most recent DeepVariant version from GitHub. Each method was run according to the individual authors’ best-practice recommendations and represents a good faith effort to achieve best results. Comparisons to the CHM1-CHM13 truth set were performed using the CHM-eval.kit software, available on GitHub at https://github.com/lh3/CHM-eval, release version 0.5 . The overall accuracy (F1, sort order within each variant type), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are shown over the whole genome. Finally, the version of the software used for each method is provided. Note that we present two GATK callsets: GATK (raw), the unfiltered calls emitted by the HaplotypeCaller; and GATK (VQSR), the callset filtered with the VQSR. See Supplementary Note 7 for more details.
几种方法被比较，包括来自 GitHub 的最新 DeepVariant 版本。每种方法都按照个别作者的最好实践建议运行，代表了一种为了实现最佳结果的良好意愿的努力。使用 GitHub 上可用的 CHM-eval.kit 软件（https://github.com/lh3/CHM-eval，版本 0.5）对 CHM1-CHM13 真实集进行了比较。在整个基因组上显示了整体准确性（F1，每种变异类型内的排序顺序）、召回率、精确率和真阳性（TP）、假阴性（FN）和假阳性（FP）的数量。最后，提供了每种方法使用的软件版本。请注意，我们提供了两个 GATK callsets：GATK（原始），由 HaplotypeCaller 发出的未过滤调用；和 GATK（VQSR），使用 VQSR 过滤的调用集。有关更多详细信息，请参阅补充说明 7。
effectiveness, representing variant calls as images and applying general image-classification models is certainly suboptimal, as we were unable to effectively encode all of the available information in the reads and reference into the three-channel image.
有效性，将变异调用表示为图像并应用通用图像分类模型无疑是次优的，因为我们无法有效地将所有可用的信息编码到三个通道的图像中。

Taken together, our results demonstrate that the deep learning approach employed by DeepVariant can learn a statistical model describing the relationship between the experimentally observed NGS reads and genetic variants in that data for several sequencing technologies. Technologies like DeepVariant change the problem of calling variants from a process of expert-driven, technology-specific statistical modeling to a more automated process of optimizing a general model against data. With DeepVariant, creating an NGS caller for a new sequencing technology becomes a simpler matter of developing the appropriate preprocessing steps, training a deep learning model on sequencing data from samples with ground-truth data, and applying this model to new, even nonhuman, samples (see Supplementary Note 6).
总体而言，我们的结果表明，DeepVariant 采用的深度学习方法可以学习一个描述实验观察到的 NGS 读数与数据中遗传变异之间关系的统计模型，适用于多种测序技术。像 DeepVariant 这样的技术将调用变异的问题从专家驱动、技术特定的统计建模过程转变为针对数据优化通用模型的一种更自动化的过程。使用 DeepVariant，为新的测序技术创建 NGS 调用器变得简单，只需开发适当的预处理步骤，在具有真实数据的样本测序数据上训练深度学习模型，并将此模型应用于新的、甚至非人类样本（见补充说明 6）。
At its core, DeepVariant generates candidate entities with high sensitivity but low specificity, represents the experimental data about each entity in a machine-learning-compatible format and then applies deep learning to assign meaningful biological labels to these entities. This general framework for inferring biological entities from raw, errorful, indirect experimental data is likely to be applicable to other high-throughput instruments.
DeepVariant 的核心是生成高灵敏度但低特异性的候选实体，以机器学习兼容的格式表示每个实体的实验数据，然后应用深度学习为这些实体分配有意义的生物学标签。这种从原始、错误、间接实验数据推断生物学实体的通用框架可能适用于其他高通量仪器。

The results presented in Figure 1, Supplementary Figures 1 and

2

, and Supplementary Tables 1-8 were generated with the original, internal version of DeepVariant. Since then we have rewritten DeepVariant to make it available as open source software. As a result, several improvements to the DeepVariant method have been made that are not captured in the analyses presented here, including switching to TensorFlow

^{33}

to train the model, using the inception_v3 neural network architecture and using a multichannel tensor representation for the genomics data instead of an RGB image. The results in Tables 1 and

2

used the open source version of DeepVariant; the evaluation scripts are available as Supplementary Software. The latest version of DeepVariant is available on GitHub (https://github.com/google/ deepvariant/).
图 1、补充图 1 和

2

以及补充表 1-8 中的结果是用 DeepVariant 的原始内部版本生成的。从那时起，我们重写了 DeepVariant，使其作为开源软件可用。因此，对 DeepVariant 方法进行了几项改进，这些改进在此处展示的分析中未涉及，包括切换到 TensorFlow

^{33}

来训练模型，使用 inception_v3 神经网络架构，以及使用多通道张量表示基因组数据而不是 RGB 图像。表 1 和

2

中的结果使用了 DeepVariant 的开源版本；评估脚本作为补充软件提供。DeepVariant 的最新版本可在 GitHub 上找到（https://github.com/google/deepvariant/）。

Also note that several other deep-learning-based variant callers have since been described

^{34, 35}

.
此外，还描述了几种基于深度学习的变异检测器。

METHODS 方法

Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.
方法，包括数据可用性声明以及任何相关访问代码和参考文献，可在论文的在线版本中找到。

Note: Any Supplementary Information and Source Data files are available in the online version of the paper.
注意：任何补充信息和源数据文件均可在论文的在线版本中找到。

ACKNOWLEDGMENTS 致谢

We thank J. Zook and his collaborators at NIST for their work developing the Genome in a Bottle resources, the Verily sequencing facility for running the NA12878 replicates, and our colleagues at Verily and Google for their feedback on this manuscript and the project in general. This work was supported by internal funding.
我们感谢 J. Zook 及其在 NIST 的同事们在开发“瓶中基因组”资源方面的工作，Verily 测序设施在运行 NA12878 重复实验中的工作，以及 Verily 和 Google 的同事们对我们这篇稿件和整个项目的一般反馈。这项工作得到了内部资金的支持。

AUTHOR CONTRIBUTIONS 作者贡献

R.P. and M.A.D. designed the study, analyzed and interpreted results and wrote the paper. R.P., P.-C.C., D.A., S.S., T.C., A.K., D.N., J.D., N.N., P.T.A., S.S.G., L.D., C.Y.M. and M.A.D. performed experiments and contributed to the software.
R.P. 和 M.A.D. 设计了这项研究，分析了结果并撰写了论文。R.P.、P.-C.C.、D.A.、S.S.、T.C.、A.K.、D.N.、J.D.、N.N.、P.T.A.、S.S.G.、L.D.、C.Y.M. 和 M.A.D. 进行了实验并参与了软件开发。

COMPETING INTERESTS 竞争利益

D.N., J.D., N.N., P.T.A. and S.S.G. are employees of Verily Life Sciences. P.-C.C., D.A., S.S, T.C. and A.K. are employees of Google Inc. R.P., L.D., C.Y.M. and M.A.D. are employees of Verily Life Sciences and Google Inc. This work was internally funded by Verily Life Sciences and Google Inc.
D.N.、J.D.、N.N.、P.T.A.和 S.S.G.是 Verily Life Sciences 的员工。P.-C.C.、D.A.、S.S、T.C.和 A.K.是谷歌公司的员工。R.P.、L.D.、C.Y.M.和 M.A.D.是 Verily Life Sciences 和谷歌公司的员工。这项工作由 Verily Life Sciences 和谷歌公司内部资助。

Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html. Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
复印和许可信息可在 http://www.nature.com/ reprints/index.html 在线获取。出版者注：Springer Nature 对已发表地图中的管辖权主张和机构隶属关系保持中立。

Goodwin, S., McPherson, J.D. & McCombie, W.R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333-351 (2016).
古丁，S.，麦克弗森，J.D. & 麦科姆比，W.R. 成长：下一代测序技术十年回顾。自然遗传学评论 17，333-351（2016）。
Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443-451 (2011).
Nielsen, R.，Paul, J.S.，Albrechtsen, A. & Song, Y.S. 基因型和 SNP 调用从下一代测序数据。自然遗传学评论 12，443-451（2011）。
Li, H. Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843-2851 (2014).
李，H. 从高覆盖样本中变异调用中艺术品的更好理解。生物信息学 30，2843-2851（2014）。
Goldfeder, R.L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).
Goldfeder, R.L. 等人. 基因测序技术精度在医学上的意义。基因组医学。8，24（2016）。
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491-498 (2011).
DePristo, M.A. 等人。基于下一代 DNA 测序数据的变异发现和基因分型框架。自然遗传学 43，491-498（2011）。
Ding, J. et al. Feature-based classifiers for somatic mutation detection in tumournormal paired sequencing data. Bioinformatics 28, 167-175 (2012).
丁，J. 等。基于特征的肿瘤正常配对测序数据体细胞突变检测分类器。生物信息学 28，167-175（2012）。
Bragg, L.M., Stone, G., Butler, M.K., Hugenholtz, P. & Tyson, G.W. Shining a light on dark sequencing: characterising errors in Ion Torrent PGM data. PLoS Comput. Biol. 9, e1003031 (2013).
Bragg, L.M.，Stone, G.，Butler, M.K.，Hugenholtz, P. & Tyson, G.W. 揭示暗测序的真相：表征 Ion Torrent PGM 数据的错误。PLoS 计算生物学 9，e1003031（2013）。
Yeo, Z.X., Wong, J.C.L., Rozen, S.G. & Lee, A.S.G. Evaluation and optimisation of indel detection workflows for ion torrent sequencing of the BRCA1 and BRCA2 genes. BMC Genomics 15, 516 (2014).
叶，Z.X.，黄，J.C.L.，罗森，S.G. & 李，A.S.G. 对 BRCA1 和 BRCA2 基因的 Ion Torrent 测序中插入/缺失检测工作流程的评估和优化。BMC 基因组学 15，516（2014）。
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process Syst. 25, 1097-1105 (2012).
克里泽夫斯基，A.，苏特斯克维尔，I. & 希顿，G. E. 基于深度卷积神经网络的 ImageNet 分类。神经信息处理系统进展 25 卷，1097-1105 页（2012 年）。
Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).
吴，Y. 等. 谷歌的神经机器翻译系统：弥合人类与机器翻译之间的差距。预印本，https://arxiv.org/abs/1609.08144（2016）。
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484-489 (2016).
银，D. 等。利用深度神经网络和树搜索掌握围棋游戏。自然 529，484-489（2016）。
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529-533 (2015).
Mnih, V. 等人. 通过深度强化学习实现人类水平控制。自然 518，529-533（2015）。
Min, S., Lee, B. & Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 18, 851-869 (2017).
Min, S., Lee, B. & Yoon, S. 生物信息学中的深度学习。生物信息学简报。18，851-869（2017）。
Alipanahi, B., Delong, A., Weirauch, M.T. & Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831-838 (2015).
阿利帕纳希，B.，德尔龙，A.，维拉乌奇，M.T. & 弗雷，B.J. 通过深度学习预测 DNA 和 RNA 结合蛋白的序列特异性。自然生物技术 33，831-838（2015）。
Zhou, J. & Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931-934 (2015).
周，J. & Troyanskaya，O.G. 基于深度学习序列模型的非编码变异效应预测。自然方法 12，931-934（2015）。
Xiong, H.Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).
熊，H.Y. 等。人类剪接密码揭示疾病遗传决定因素的新见解。科学 347，1254806（2015）。
Szegedy, C., Vanhoucke, V., loffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. Preprint at https://arxiv.org/abs/1512.00567 (2015).
Szegedy, C.，Vanhoucke, V.，loffe, S.，Shlens, J. & Wojna, Z. 重新思考计算机视觉的 Inception 架构。预印本在 https://arxiv.org/abs/1512.00567（2015 年）。
Eberle, M.A. et al. A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157-164 (2017).
埃伯勒，M.A. 等。通过测序三代 17 人系谱验证的 540 万相变人类变异的参考数据集。基因组研究。27，157-164（2017）。
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851-1858 (2008).
李，H.，阮，J. & 达宾，R. 基于映射质量分数映射短 DNA 测序读数和调用变异。基因组研究 18，1851-1858（2008）。
Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124-1132 (2009).
李，R. 等。大规模并行全基因组重测序中的 SNP 检测。基因组研究。19，1124-1132（2009）。
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359-366 (1989).
霍尼克，K.，斯廷科姆，M. & 怀特，H. 多层前馈网络是通用逼近器。神经网络，2，359-366（1989）。
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
加里森，E. & 马斯，G. 基于单倍型变异检测的短读测序。预印本在 https://arxiv.org/abs/1207.3907（2012 年）。
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009).
李，H. 等. 序列比对/映射格式和 SAMtools。生物信息学 25，2078-2079（2009）。
Luo, R., Schatz, M.C. & Salzberg, S.L. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. Gigascience 6, 1-4 (2017).
10GT：一种使用 16 基因型概率模型的快速且敏感的变异调用器。Gigascience 6，1-4（2017）。
Kim, S. et al. Strelka2: fast and accurate variant calling for clinical sequencing applications. Preprint at bioRxiv https://doi.org/10.1101/192872 (2017).
Kim, S. 等人. Strelka2：适用于临床测序应用的快速且准确的变异调用。bioRxiv 预印本 https://doi.org/10.1101/192872（2017 年）。
Li, H. et al. New synthetic-diploid benchmark for accurate variant calling evaluation. Preprint at bioRxiv https://doi.org/10.1101/223297 (2017).
李，H. 等。用于准确变异调用评估的新合成二倍体基准。bioRxiv 预印本 https://doi.org/10.1101/223297（2017）。
Keane, T.M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289-294 (2011).
凯恩，T.M. 等。小鼠基因组变异及其对表型和基因调控的影响。自然 477，289-294（2011）。
Van der Auwera, G. What are the standard resources for non-human genomes? http://gatkforums.broadinstitute.org/gatk/discussion/1243/what-are-the-standard-resources-for-non-human-genomes (2018).
范德奥维拉，G. 非人类基因组的标准资源有哪些？http://gatkforums.broadinstitute.org/gatk/discussion/1243/what-are-the-standard-resources-for-non-human-genomes (2018)。
Zook, J.M. et al. Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials (Cold Spring Harbor, 2015).
Zook, J.M. 等人. 七个人类基因组的大规模测序以表征基准参考材料（冷泉港，2015 年）。
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285-291 (2016).
Lek, M. 等人. 分析 60,706 名人类蛋白质编码基因变异。自然 536，285-291（2016）。
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68-74 (2015).
Auton, A. 等人。人类遗传变异的全球参考。自然 526，68-74（2015）。
Robasky, K., Lewis, N.E. & Church, G.M. The role of replicates for error mitigation in next-generation sequencing. Nat. Rev. Genet. 15, 56-62 (2014).
罗巴克西，K.，刘易斯，N.E. & 教堂，G.M. 在下一代测序中重复样本在错误缓解中的作用。自然综述·遗传学 15，56-62（2014）。
Abadi, M., Agarwal, A., Barham, P., Brevdo, E. & Chen, Z. TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Preprint at https://arxiv.org/ abs/1603.04467 (2015).
阿巴迪，M.，阿加瓦尔，A.，巴哈姆，P.，布雷多，E. & 陈，Z. TensorFlow：异构系统上的大规模机器学习，2015。预印本在 https://arxiv.org/abs/1603.04467（2015）。
Luo, R., Sedlazeck, F.J., Lam, T.-W. & Schatz, M. Clairvoyante: a multi-task convolutional deep neural network for variant calling in single molecule sequencing. Preprint at bioRxiv https://doi.org/10.1101/310458 (2018).
罗，R.，塞德莱茨克，F.J.，蓝，T.-W. & 莱希特，M. Clairvoyante：一种用于单分子测序变异检测的多任务卷积深度神经网络。bioRxiv 预印本 https://doi.org/10.1101/310458（2018）。
Torracinta, R. & Campagne, F. Training genotype callers with neural networks. Preprint at bioRxiv https://doi.org/10.1101/097469 (2016).
托拉奇纳塔，R. & 坎帕涅，F. 使用神经网络训练基因型调用器。bioRxiv 预印本 https://doi.org/10.1101/097469（2016）。

ONLINE METHODS 在线方法

Haplotype-aware realignment of reads. Mapped reads are preprocessed using an error-tolerant, local De-Bruijn-graph-based read assembly procedure that realigns them according to their most likely derived haplotype. Candidate windows across the genome are selected for reassembly by looking for any evidence of possible genetic variation, such as mismatching or soft clipped bases. The selection criteria for a candidate window are very permissive so that true variation is unlikely to be missed. All candidate windows across the genome are considered independently. De Bruijn graphs are constructed using multiple fixed

k

-mer sizes (from 20 to 75 , inclusive, with increments of 5) out of the reference genome bases for the candidate window, as well as all overlapping reads. Edges are given a weight determined by how many times they are observed in the reads. We trim any edges with weight less than three, except that edges found in the reference are never trimmed. Candidate haplotypes are generated by traversing the assembly graphs and the top two most likely haplotypes are selected that best explain the read evidence. The likelihood function used to score haplotypes is a traditional pair HMM with fixed parameters that do not depend on base quality scores. This likelihood function assumes that each read is independent. Finally, each read is then realigned to its most likely haplotype using a Smith-Waterman-like algorithm with an additional affine gap penalty score for homopolymer indels. This procedure updates both the position and the CIGAR string for each read.
基于单倍型意识的读段重排。将映射读段使用容错、基于局部 De-Bruijn 图的读段组装程序进行预处理，根据其最可能的衍生单倍型重新排列。通过寻找可能的遗传变异的证据（如错配或软剪辑碱基）来选择整个基因组中的候选窗口进行重新组装。候选窗口的选择标准非常宽松，以确保不会错过真正的变异。整个基因组中的所有候选窗口都是独立考虑的。使用多个固定的

k

-mer 大小（从 20 到 75，包括 75，以 5 为增量）从候选窗口的参考基因组碱基以及所有重叠读段构建 De Bruijn 图。边赋予的权重由它们在读段中观察到的次数决定。我们修剪任何权重小于三的边，但参考中发现的边永远不会被修剪。通过遍历组装图生成候选单倍型，并选择两个最有可能的单倍型，它们最好地解释了读段证据。该用于评估单倍型的似然函数是一个具有固定参数的传统成对隐马尔可夫模型，这些参数不依赖于碱基质量分数。此似然函数假定每个读段都是独立的。最后，每个读段使用类似 Smith-Waterman 的算法重新定位到其最可能的单倍型，并额外考虑同聚物插入/缺失的成对间隙惩罚分数。此过程更新了每个读段的定位和 CIGAR 字符串。

Finding candidate variants. Candidate variants for evaluation with the deep learning model are identified with the following algorithm. We consider each position in the reference genome independently. For each site in the genome, we collect all the reads that overlap that site. The CIGAR string of each read is decoded and the corresponding allele aligned to that site is determined; these are classified into either a reference-matching base, a reference-mismatching base, an insertion with a specific sequence, or a deletion with a specific length. We count the number of occurrences of each distinct allele across all reads. See Supplementary Note 8 and the current implementation at https://github. com/google/deepvariant/blob/r0.4/deepvariant/make_examples.py#L770.
正在寻找候选变体。使用以下算法识别用于深度学习模型评估的候选变体。我们独立考虑参考基因组中的每个位置。对于基因组中的每个位点，我们收集所有重叠该位点的读取。解码每个读取的 CIGAR 字符串，并确定与该位点对齐的相应等位基因；这些被分类为参考匹配碱基、参考不匹配碱基、具有特定序列的插入或具有特定长度的删除。我们统计所有读取中每个不同等位基因的出现次数。参见补充说明 8 和当前实现 https://github.com/google/deepvariant/blob/r0.4/deepvariant/make_examples.py#L770。
If any candidates pass our calling thresholds at a site in the genome, we emit a VCF-like record with chromosome, start, reference bases and alternate bases, where reference bases and alternate bases are the VCF-compatible representation of all of the passing alleles.
如果任何候选者在基因组中的某个位点通过我们的调用阈值，我们将发出一个类似 VCF 的记录，其中包含染色体、起始位置、参考碱基和替代碱基，其中参考碱基和替代碱基是所有通过等位基因的 VCF 兼容表示。

We filter away any unusable reads (see is_usable_read() below) if a read is marked as a duplicate, if it is marked as failing vendor quality checks, if it is not aligned or is not the primary alignment, if its mapping quality is less than 10 , or if it is paired and not marked as properly placed. We further only include read bases as potential alleles if all of the bases in the alleles have a base quality

\geq 10

. We emit variant calls only at standard (ACGT) bases in the reference genome. It is possible to force candidate variants to be emitted (randomly with probability of

p

) at sites with no alternate alleles, which are used as homozygous reference training sites. There is no constraint on the size of indels emitted, so long as the exact position and bases are present in the CIGAR string and they are consistent across multiple reads.
我们过滤掉任何不可用的读取（见下面的 is_usable_read()函数），如果读取被标记为重复，如果它被标记为未通过供应商质量检查，如果它未对齐或不是主要对齐，如果其映射质量小于 10，或者如果它是配对但未标记为正确放置。我们进一步仅将所有碱基质量为

\geq 10

的碱基作为潜在等位基因包括在内。我们仅在参考基因组的标准（ACGT）碱基处发出变异调用。有可能强制在无替代等位基因的位点发出候选变异（以

p

的概率随机发出），这些位点用作纯合参考训练位点。发出的插入或缺失大小没有限制，只要在 CIGAR 字符串中存在确切的位置和碱基，并且它们在多个读取中保持一致。

Creating images around candidate variants. The second phase of DeepVariant encodes the reference and read support for each candidate variant into an RGB image. The pseudocode for this component is shown below; it contains all of the key operations to build the image, leaving out for clarity error handling, code to deal with edge cases such as those in which variants occur close to the start or end of the chromosome, and the implementation of nonessential and/or obvious functions. See Supplementary Note 9 and the current implementation at https://github.com/google/deepvariant/blob/r0.4/deepvariant/ pileup_image.py.
创建围绕候选变体的图像。DeepVariant 的第二阶段将每个候选变体的参考和读取支持编码到 RGB 图像中。以下是该组件的伪代码；它包含构建图像的所有关键操作，为了清晰起见省略了错误处理、处理边缘情况（如变体出现在染色体起始或结束附近）的代码，以及非必要和/或明显的函数实现。参见补充说明 9 和当前实现：https://github.com/google/deepvariant/blob/r0.4/deepvariant/pileup_image.py。

The actual implementation of this code uses a reservoir sampler to randomly remove reads at locations where there is excessive coverage. This downsampling occurs conceptually within the reads.get_overlapping() function but occurs in our implementation anywhere where there are more than 10,000 reads in a tiling of 300-bp intervals on the chromosome.
该代码的实际实现使用水库采样器随机删除覆盖过度的位置的读取。这种下采样在概念上发生在 reads.get_overlapping()函数中，但在我方的实现中，任何在染色体 300-bp 间隔的拼接中读取超过 10,000 个读取的地方都会发生。

Deep learning. DistBelief

^{36}

was used to represent models, train models on labeled images, export trained models, and evaluate trained models on unlabeled images. We adapted the Inception v2 architecture to our input images
深度学习。使用 DistBelief

^{36}

来表示模型，在标记图像上训练模型，导出训练好的模型，并在未标记图像上评估训练好的模型。我们将 Inception v2 架构适配到我们的输入图像
and our three-state (hom-ref, het, hom-alt) genotype classification problem. Specifically, we created an input image layer that rescales our input images to

299 \times 299

pixels without shifting or scaling our pixel values. This input layer is attached to the ConvNetJuly2015

22^{17} CNN

with nine partitions and weight decay of 0.00004 . The final output layer of the CNN is a three-class Softmax layer with fully connected inputs to the preceding layer initialized with Gaussian random weights and s.d. of 0.001 and a weight decay of 0.00004 .
并且我们的三态（同源参考、异源、同源替代）基因型分类问题。具体来说，我们创建了一个输入图像层，将我们的输入图像重新缩放为

299 \times 299

像素，而不改变我们的像素值。这个输入层连接到 ConvNetJuly2015

22^{17} CNN

，有九个分区和权重衰减为 0.00004。CNN 的最终输出层是一个三类的 Softmax 层，其前一层有完全连接的输入，初始化为高斯随机权重和标准差为 0.001，权重衰减为 0.00004。

The CNN was trained using stochastic gradient descent in batches of 32 images with eight replicated models and RMS decay of 0.9. For the Platinum Genomes, precisionFDA, NA12878 replicates, mouse and genome build experiments, multiple models were trained (using the product of learning rates of

[0.00095, 0.001, 0.0015]

and momenta

[0.8, 0.85, 0.9])

for 80 h or until training accuracy converged, and the model with the highest accuracy on the training set was selected as the final model. For the multiple sequencing technologies experiment, a single model was trained with learning rate 0.0015 and momentum 0.8 for 250,000 update steps. In all experiments unless otherwise noted, the CNN was initialized with weights from the ImageNet model ConvNetJuly2015v2

^{17}

.
CNN 使用 32 个图像的批次和 8 个复制的模型以及 0.9 的 RMS 衰减进行随机梯度下降训练。对于 Platinum Genomes、precisionFDA、NA12878 复制品、小鼠和基因组构建实验，训练了多个模型（使用

[0.00095, 0.001, 0.0015]

的学习率乘积和

[0.8, 0.85, 0.9])

的动量，持续 80 小时或直到训练精度收敛，并在训练集上准确率最高的模型被选为最终模型。对于多种测序技术实验，使用学习率 0.0015 和动量 0.8 训练了一个模型，进行 250,000 次更新步骤。在所有实验中，除非另有说明，CNN 使用 ImageNet 模型 ConvNetJuly2015v2

^{17}

的权重进行初始化。

DeepVariant inference client and allele merging. At inference time each biallelic candidate variant site represented as a pileup image is presented as input to the trained CNN. After a forward pass through the network, a three-state probability distribution is returned. These probabilities correspond to the biallelic genotype likelihood states of

{P

(homozygous reference),

P

(heterozygous),

P

(homozygous variant)

}

and are encoded directly in the output VCF record as the phred scaled GL field. Variant calls are emitted for all sites where the most likely genotype is either het or hom-alt with at least a Q4 genotype confidence. Finally, all biallelic records at the same starting position are merged into multiallelic records to facilitate comparisons with other datasets.
DeepVariant 推理客户端和等位基因合并。在推理时间，每个双等位基因候选变异位点以 pileup 图像的形式作为输入提交给训练好的 CNN。经过网络的前向传递后，返回一个三状态概率分布。这些概率对应于双等位基因基因型似然状态

{P

（纯合参考）、

P

（杂合）、

P

（纯合变异）

}

，并直接编码在输出 VCF 记录的 phred 缩放 GL 字段中。对于最可能的基因型为 het 或 hom-alt 且至少有 Q4 基因型置信度的所有位点，发出变异调用。最后，将同一起始位置的所有双等位基因记录合并为多等位基因记录，以方便与其他数据集进行比较。

Genome in a Bottle human reference datasets. We used version 3.2.1 of the Genome in a Bottle reference data

^{37}

. We downloaded calls in VCF format and confident called intervals in BED format from the following:
基因组瓶人类参考数据集。我们使用了基因组瓶参考数据的 3.2.1 版本

^{37}

。我们从以下位置下载了 VCF 格式的调用和 BED 格式的置信调用区间：

NA12878: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_ HG001/NISTv3.2.1/

NA24385: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/Ashkenazim Trio/HG002_NA24385_son/NISTv3.2.1/

The VCF files were converted to Global Alliance for Global Health (GA4GH) protocol buffer format but otherwise were used without further modification.
VCF 文件被转换为全球健康联盟（GA4GH）协议缓冲格式，但除此之外未做进一步修改。

Evaluating variant calls. An internal evaluation tool was used for some analyses. As of the availability of the open source version of DeepVariant, all analyses were completed using hap.py or CHM-eval (for example, see Supplementary Tables 3 and 5).
评估变异调用。某些分析使用了内部评估工具。截至 DeepVariant 开源版本的可用性，所有分析均使用 hap.py 或 CHM-eval（例如，参见补充表 3 和 5）完成。

Truth variants and confident reference intervals were parsed from the Genome in a Bottle or other ground standard datasets from the VCF and BED files for their respective samples. Truth variants outside the confident intervals were removed. The evaluation variants were loaded and variants marked as filtered or assigned homozygous reference genotypes were removed. Metrics such as the number of SNPs, number of indels, insertion/deletion ratio, heterozygous/homozygous non-reference ratio and transition/transversion ratio (

Ti / Tv

) were calculated from all remaining evaluation variants.
从基因组瓶或其他地面标准数据集的 VCF 和 BED 文件中解析了真实变体和置信参考区间，用于各自的样本。移除了置信区间之外的真实变体。加载了评估变体，并移除了标记为过滤或指定为纯合参考基因型的变体。从所有剩余的评估变体中计算了指标，如 SNP 数量、indel 数量、插入/删除比率、杂合/纯合非参考比率以及转换/颠换比率（

Ti / Tv

）。

Evaluation variants were matched to truth variants if they start at the same position on the same chromosome. To compute genotype concordance, we added to the list of matched pairs of evaluation-truth variants all of the unmatched evaluation variants that overlap the confidence intervals with a ‘virtual’ homozygous reference genotype sample. The number of matching genotypes is defined as the number of pairs in which the genotype alleles of the evaluation variant and truth variant are equal, independent of order. From this we compute the genotyping concordance as
评估变体与真实变体匹配，如果它们在相同染色体的相同位置开始。为了计算基因型一致性，我们将所有未匹配的评估变体添加到评估-真实变体匹配对列表中，这些变体与具有“虚拟”纯合参考基因型样本的置信区间重叠。匹配基因型的数量定义为评估变体和真实变体基因型等位基因相等的对数，不考虑顺序。据此，我们计算基因型一致性为

Genotype concordance

= \frac{No.of matching genotypes}{No.of paired evaluation and truth variants}

基因型一致性

= \frac{No.of matching genotypes}{No.of paired evaluation and truth variants}

The number of matched pairs is counted as the number of true positives. Any truth variants without a matched evaluation variant are counted as false negatives. Any unmatched evaluation variants that occur within the confident intervals are
匹配对的数目被计为真正例的数量。任何没有匹配评估变体的真值变体都被计为假阴性。任何在置信区间内出现的未匹配评估变体
counted as false positives. From the number of true positives (TP), false negatives (FN) and false positives (TP), we compute the sensitivity, PPV and F1 as
计为假阳性。从真正阳性（TP）、假阴性（FN）和假阳性（TP）的数量中，我们计算灵敏度、阳性预测值（PPV）和 F1 值。

\begin{aligned} Sensitivity = \frac{TP}{TP + FN} \\ PPV = \frac{TP}{TP + FP} \\ Fl = \frac{2 TP}{2 TP + FN + FP} \end{aligned}

Our evaluation metrics fall between the tolerant hapdip metric

^{3}

and the strict vcfeval

^{38}

metrics. In particular, our sensitivity and PPV metrics emphasize discriminating between variant and reference sites, allowing errors in the determination of the exact variant alleles and genotypes. These errors are tallied separately as an allelic error rate and a genotyping error rate. Although we believe this separation is informative and valuable for understanding the types of errors that occur in a variant callset, we appreciate the approaches pursued by other evaluation methods.
我们的评估指标介于宽容的 hapdip 指标

^{3}

和严格的 vcfeval

^{38}

指标之间。特别是，我们的灵敏度和 PPV 指标强调区分变异位点和参考位点，允许在确定确切变异等位基因和基因型时出现错误。这些错误分别统计为等位基因错误率和基因分型错误率。尽管我们相信这种分离有助于了解变异调用集中发生的错误类型，但我们赞赏其他评估方法所追求的方法。

Life Sciences Reporting Summary. Further information about experimental design is available in the Nature Research Reporting Summary linked to this article.
生命科学报告摘要。关于实验设计的更多信息可在链接到本文的《自然研究》报告摘要中找到。

Code availability. The latest version of DeepVariant is available at https://github.com/google/deepvariant. The key results and analyses presented here can be reproduced using the open-source version. Custom code was specific to our computing infrastructure and mainly used for simple data analysis tasks. The benchmarking script used to generate and evaluate the results in Table 2 and Supplementary Table 3 is available as Supplementary Software. An evaluation metrics file is available as Supplementary Data.
代码可用性。DeepVariant 的最新版本可在 https://github.com/google/deepvariant 获取。此处展示的关键结果和分析可以使用开源版本重现。定制代码针对我们的计算基础设施，主要用于简单的数据分析任务。用于生成和评估表 2 和补充表 3 中结果的基准脚本作为补充软件提供。评估指标文件作为补充数据提供。

Data availability. All data used in this manuscript is publicly available from Genome in a Bottle or the Mouse Genome project, with the exception of 35 NA12878 WGS replicates from the Verily sequencing laboratory, which were licensed from Verily for the current study and are not publicly available. These data may be available from Verily upon reasonable request.
数据可用性。本文中使用的所有数据均来自“基因组在瓶中”或“小鼠基因组计划”，但 35 个 NA12878 WGS 重复样本来自 Verily 测序实验室，这些样本已从 Verily 获得当前研究的许可，且不公开可用。这些数据在合理请求下可能从 Verily 处获得。
36. Dean, J. et al. Large scale distributed deep networks. Adv. Neural Inf. Process. Syst. 25, 1223-1231 (2012).
36. 迪恩，J. 等人. 大规模分布式深度网络。神经信息处理系统进展 25，1223-1231（2012）。
37. Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246-251 (2014).
37. 骆，J.M. 等. 整合人类序列数据集提供基准 SNP 和 indel 基因型调用资源。自然生物技术 32，246-251（2014）。
38. Cleary, J.G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at bioRxiv https://doi. org/10.1101/023754 (2015).
38. 清晰，J.G. 等人。比较变异调用文件以进行下一代测序变异调用管道的性能基准测试。bioRxiv 预印本 https://doi.org/10.1101/023754（2015）。

Life Sciences Reporting Summary
生命科学报告摘要

Nature Research wishes to improve the reproducibility of the work that we publish. This form is intended for publication with all accepted life science papers and provides structure for consistency and transparency in reporting. Every life science submission will use this form; some list items might not apply to an individual manuscript, but all fields must be completed for clarity.
自然科研希望提高我们发表工作的可重复性。此表格旨在与所有接受的生物科学论文一起发表，并为报告的一致性和透明度提供结构。每个生物科学投稿都将使用此表格；某些清单项可能不适用于个别稿件，但所有字段都必须填写以保持清晰。

For further information on the points included in this form, see Reporting Life Sciences Research. For further information on Nature Research policies, including our data availability policy, see Authors & Referees and the Editorial Policy Checklist.
关于本表格中包含的要点，请参阅《生命科学研究报告》。有关《自然研究》政策的信息，包括我们的数据可用性政策，请参阅《作者与审稿人指南》和《编辑政策清单》。

Please do not complete any field with “not applicable” or

n / a

. Refer to the help text for what text to use if an item is not relevant to your study. For final submission: please carefully check your responses for accuracy; you will not be able to make changes later.
请勿在任一字段中填写“不适用”或

n / a

。如项目与您的研究无关，请参考帮助文本了解应使用何种文本。对于最终提交：请仔细检查您的回答以确保准确无误；之后将无法进行修改。

- Experimental design - 实验设计

Sample size 样本量

Describe how sample size was determined.
描述样本大小是如何确定的。
2. Data exclusions 2. 数据排除

Describe any data exclusions.
描述任何数据排除情况。
3. Replication 3. 复制

Describe the measures taken to verify the reproducibility of the experimental findings.
描述采取的措施以验证实验结果的再现性。

4. Randomization 4. 随机化

Describe how samples/organisms/participants were allocated into experimental groups.
描述如何将样本/生物/参与者分配到实验组。
5. Blinding 5. 盲法

Describe whether the investigators were blinded to group allocation during data collection and/or analysis.
描述调查员在数据收集和/或分析过程中是否对分组分配不知情。

No statistical claims based on sample sizes are made in the paper.
该论文中没有基于样本大小的统计声明。

No data was excluded.
没有数据被排除。

Multiple replicates of NA12878 were evaluated. Participated in a public blinded variant calling evaluation administered by a third-party (PrecisionFDA). Released the code to github, enabling additional third-party evaluations.
多个 NA12878 样本被评估。参与了由第三方（PrecisionFDA）管理的公开盲法变异调用评估。代码已发布到 github，允许额外的第三方评估。

No randomization was needed.
无需随机化。

No blinding was needed.
无需盲法。

Note: all in vivo studies must report how sample size was determined and whether blinding and randomization were used.
注意：所有体内研究必须报告样本量是如何确定的，以及是否使用了盲法和随机化。

6. Statistical parameters
6. 统计参数

For all figures and tables that use statistical methods, confirm that the following items are present in relevant figure legends (or in the Methods section if additional space is needed).
对于所有使用统计方法的图表，请确认以下项目在相关图表标题（或如需额外空间则在方法部分）中存在。

◻

The exact sample size (

n

) for each experimental group/condition, given as a discrete number and unit of measurement (animals, litters, cultures, etc.)
每个实验组/条件的确切样本量（

n

），以离散数字和计量单位（动物、窝、培养物等）给出。

◻

A description of how samples were collected, noting whether measurements were taken from distinct samples or whether the same sample was measured repeatedly

◻

样本收集的描述，指出是否从不同的样本中进行了测量，或者是否对同一样本进行了重复测量

A statement indicating how many times each experiment was replicated
一个表示每个实验重复次数的声明
The statistical test(s) used and whether they are one- or two-sided
统计检验方法及其为一侧或双侧
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
仅应通过名称描述常见测试；在方法部分描述更复杂的技巧。
A description of any assumptions or corrections, such as an adjustment for multiple comparisons
任何假设或更正的描述，例如多重比较的调整
Test values indicating whether an effect is present
测试值表示是否存在效果
Provide confidence intervals or give results of significance tests (e.g. P values) as exact values whenever appropriate and with effect sizes noted.
提供置信区间或给出显著性检验结果（例如 P 值）的精确值，并在适当的情况下注明效应量。
A clear description of statistics including central tendency (e.g. median, mean) and variation (e.g. standard deviation, interquartile range)
一个关于统计学的清晰描述，包括集中趋势（例如中位数、均值）和变异（例如标准差、四分位距）

◻

Clearly defined error bars in all relevant figure captions (with explicit mention of central tendency and variation)
所有相关图例的误差条明确定义（明确提及中心趋势和变化）

Software 软件

Policy information about availability of computer code
关于计算机代码可用性的政策信息

7. Software 7. 软件

Describe the software used to analyze the data in this study.
描述本研究中用于分析数据的软件。

A fully functional version of the software has been released to GitHub at https://github.com/ google/deepvariant.
一个完整的软件版本已发布到 GitHub，网址为 https://github.com/google/deepvariant。

For manuscripts utilizing custom algorithms or software that are central to the paper but not yet described in the published literature, software must be made available to editors and reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). Nature Methods guidance for providing algorithms and software for publication provides further information on this topic.
对于使用自定义算法或软件的稿件，这些算法或软件对于论文至关重要但尚未在已发表的文献中描述，必须根据要求向编辑和审稿人提供软件。我们强烈建议将代码存放在社区仓库（例如 GitHub）中。关于为出版物提供算法和软件的 Nature Methods 指南提供了有关此主题的更多信息。

- Materials and reagents
材料和试剂

Policy information about availability of materials
政策信息关于材料可用性

8. Materials availability
8. 材料可用性

Indicate whether there are restrictions on availability of unique materials or if these materials are only available for distribution by a third party.
指示是否对独特材料的可用性有限制，或者这些材料是否仅由第三方进行分销。
9. Antibodies 9. 抗体

Describe the antibodies used and how they were validated for use in the system under study (i.e. assay and species).
描述所使用的抗体及其在研究系统中的应用验证方法（即检测方法和物种）。

10. Eukaryotic cell lines
10. 真核细胞系

a. State the source of each eukaryotic cell line used.
a. 说明每种真核细胞系的来源。
b. Describe the method of cell line authentication used.
b. 描述所使用的细胞系鉴定方法。
c. Report whether the cell lines were tested for mycoplasma contamination.
c. 报告细胞系是否进行了支原体污染检测。
d. If any of the cell lines used are listed in the database of commonly misidentified cell lines maintained by ICLAC, provide a scientific rationale for their use.
d. 如果使用的细胞系中任何一种列在 ICLAC 维护的常见误识别细胞系数据库中，请提供其使用的科学依据。

All samples analyzed here are available from Coriell as part of the Genome in a Bottle collections. Contact Verily re the availability of the 35 NA12878 replicates analyzed here.
所有在此分析的样本均可从 Coriell 获取，作为基因组瓶收藏的一部分。有关此处分析的 35 个 NA12878 重复样本的可用性，请联系 Verily。

No antibodies were used.
未使用抗体。

Coriell 柯里尔

Genotyping concordance with public databases of genetic variants for these samples.
对样本进行基因型与公共数据库中遗传变异的一致性分析。
Unknown. 未知。

None of the cell lines used are listed in the ICLAC database
所有使用的细胞系均未列在 ICLAC 数据库中

- Animals and human research participants
动物和人类研究参与者

Policy information about studies involving animals; when reporting animal research, follow the ARRIVE guidelines
关于涉及动物的研究政策信息；在报告动物研究时，遵循 ARRIVE 指南

11. Description of research animals
11. 研究动物描述

Provide all relevant details on animals and/or
提供所有有关动物和/或的详细信息
No animals were used in the study. animal-derived materials used in the study.

研究中没有使用动物。研究中使用了动物来源的材料。
Policy information about studies involving human research participants
关于涉及人类研究参与者的研究政策信息

12. Description of human research participants
12. 人类研究参与者的描述

Describe the covariate-relevant population The study did not involve human research participants. characteristics of the human research participants.
描述与协变量相关的受试者群体研究未涉及人类研究参与者。人类研究参与者的特征。

$^{1}$ Verily Life Sciences, Mountain View, California, USA. $^{2}$ Google Inc., Mountain View, California, USA. Correspondence should be addressed to M.A.D. (mdepristo@google.com).
$^{1}$ 真实生物科学公司，加利福尼亚州山景城，美国。 $^{2}$ 谷歌公司，加利福尼亚州山景城，美国。通讯地址：M.A.D.（mdepristo@google.com）。

A universal SNP and small-indel variant caller using deep neural networks 一个使用深度神经网络的通用 SNP 和小插入/缺失变异调用器

Abstract 摘要

METHODS 方法

ACKNOWLEDGMENTS 致谢

AUTHOR CONTRIBUTIONS 作者贡献

COMPETING INTERESTS 竞争利益

ONLINE METHODS 在线方法

Life Sciences Reporting Summary生命科学报告摘要

- Experimental design - 实验设计

4. Randomization 4. 随机化

6. Statistical parameters6. 统计参数

Software 软件

7. Software 7. 软件

- Materials and reagents材料和试剂

8. Materials availability8. 材料可用性

10. Eukaryotic cell lines10. 真核细胞系

No antibodies were used.未使用抗体。

- Animals and human research participants动物和人类研究参与者

11. Description of research animals11. 研究动物描述

12. Description of human research participants12. 人类研究参与者的描述