A framework for variation discovery and genotyping using nextgeneration DNA sequencing data
一个用于变异发现和基因分型的框架，基于下一代 DNA 测序数据

M.A. DePristo $^{1, *}$ , E. Banks $^{1}$ , R.E. Poplin $^{1}$ , K.V. Garimella $^{1}$ , J.R. Maguire $^{1}$ , C. Hartl $^{1}$ , A.A. Philippakis $^{1, 2, 3}$ , G. del Angel $^{1}$ , M.A Rivas $^{1, 4}$ , M. Hanna $^{1}$ , A. McKenna $^{1}$ , T.J. Fennell $^{1}$ , A.M. Kernytsky $^{1}$ , A.Y. Sivachenko $^{1}$ , K. Cibulskis $^{1}$ , S.B. Gabriel $^{1}$ , D. Altshuler $^{1, 3, 4}$ , and M.J. Daly $^{1, 3, 4}$
M.A. DePristo $^{1, *}$ , E. Banks $^{1}$ , R.E. Poplin $^{1}$ , K.V. Garimella $^{1}$ , J.R. Maguire $^{1}$ , C. Hartl $^{1}$ , A.A. Philippakis $^{1, 2, 3}$ , G. del Angel $^{1}$ , M.A Rivas $^{1, 4}$ , M. Hanna $^{1}$ , A. McKenna $^{1}$ , T.J. Fennell $^{1}$ , A.M. Kernytsky $^{1}$ , A.Y. Sivachenko $^{1}$ , K. Cibulskis $^{1}$ , S.B. Gabriel $^{1}$ , D. Altshuler $^{1, 3, 4}$ , 和 M.J. Daly $^{1, 3, 4}$ $^{1}$ Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Five Cambridge Center, Cambridge, Massachusetts 02142
$^{1}$ 医学与人群遗传学项目，哈佛大学和麻省理工学院布罗德研究所，五剑桥中心，马萨诸塞州剑桥市 02142 $^{2}$ Brigham and Women's Hospital, 75 Francis Street, Boston, MA 02115 USA
$^{2}$ 布莱根妇女医院，弗朗西斯街 75 号，波士顿，马萨诸塞州 02115 美国 $^{3}$ Harvard Medical School, Boston, MA 02116
哈佛医学院，波士顿，马萨诸塞州 02116 $^{4}$ Center for Human Genetic Research, Massachusetts General Hospital, Richard B. Simches Research Center, Boston, Massachusetts 02114, USA
$^{4}$ 人类遗传研究中心，马萨诸塞州总医院，理查德·B·辛奇斯研究中心，美国马萨诸塞州波士顿 02114

Abstract 摘要

Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass ( 4×) 1000 Genomes Project datasets.
最近的测序技术进展使得全面编目人群样本中的遗传变异成为可能，为理解人类疾病、祖先和进化奠定了基础。产生的原始数据量巨大，需要许多计算步骤将这些输出转换为高质量的变异调用。我们提出了一个统一的分析框架，以同时发现和基因分型多个样本中的变异，能够在五种测序技术和三种不同的经典实验设计中实现灵敏和特异的结果。我们的过程包括（1）初始读取映射；（2）围绕插入缺失的局部重校准；（3）碱基质量分数重新校准；（4）SNP 发现和基因分型以找到所有潜在变异；以及（5）机器学习以将真实的分离变异与下一代测序技术中常见的机器伪影分开。我们讨论了这些工具在基因组分析工具包（GATK）中的应用，适用于深度全基因组、全外显子捕获和多样本低通（4×）1000 基因组计划数据集。

Introduction 介绍

Recent advances in NGS technology now provide the first cost-effective approach to largescale resequencing of human samples for medical and population genetics. Projects such as the 1000 Genomes

^{1}

, The Cancer Genome Atlas and numerous large medically-focused
最近在 NGS 技术方面的进展现在提供了对人类样本进行大规模重新测序的首个具有成本效益的方法，适用于医学和群体遗传学。诸如 1000 基因组

^{1}

、癌症基因组图谱以及众多大型医学相关项目。

exome sequencing projects

^{2}

are underway in an attempt to elucidate the full spectrum of human genetic diversity

^{1}

and the complete genetic architecture of human disease. The ability to examine the entire genome in an unbiased way will make possible comprehensive searches for standing variation in common disease; mutations underlying linkages in Mendelian disease

^{3}

; as well as spontaneously arising variation for which no gene-mapping shortcuts are available (e.g., somatic mutations in cancer

^{4 - 6}

and de novo mutations

^{7, 8}

in autism and schizophrenia).
外显子测序项目

^{2}

正在进行中，旨在阐明人类遗传多样性的完整谱系

^{1}

以及人类疾病的完整遗传结构。以无偏见的方式检查整个基因组的能力将使得对常见疾病中的现存变异进行全面搜索成为可能；揭示孟德尔疾病

^{3}

中的连锁突变；以及自发产生的变异，针对这些变异没有基因定位捷径可用（例如，癌症中的体细胞突变

^{4 - 6}

和自闭症及精神分裂症中的新生突变

^{7, 8}

）。

Many capabilities are required to obtain a complete and accurate record of the variation from NGS from sequencing data. Mapping reads to the reference genome

^{9 - 12}

is a first critical computational challenge whose cost necessitates each read be aligned independently, guaranteeing many reads spanning indels will be misaligned. The per-base quality scores, which convey the probability that the called base in the read is the true sequenced base

^{13}

, are quite inaccurate and co-vary with features like sequencing technology, machine cycle and sequence context

^{14 - 16}

. These misaligned reads and inaccurate quality scores propagate into single nucleotide polymorphism (SNP) discovery and genotyping, a general problem that becomes acute in projects with multiple sequencing technologies, generated by many centers using rapidly evolving experimental processing pipelines, such as the 1000 Genomes Project.
获取 NGS 变异的完整和准确记录需要许多能力。将读取映射到参考基因组

^{9 - 12}

是第一个关键的计算挑战，其成本要求每个读取独立对齐，保证许多跨越插入缺失的读取将会错位。每个碱基的质量分数传达了读取中调用的碱基是真实测序碱基

^{13}

的概率，这些分数相当不准确，并且与测序技术、机器周期和序列上下文等特征共同变化

^{14 - 16}

。这些错位的读取和不准确的质量分数会传播到单核苷酸多态性（SNP）发现和基因分型中，这是一个普遍问题，在使用快速发展的实验处理管道的多个测序技术生成的项目中变得尤为严重，例如 1000 基因组计划。

Given well mapped, aligned, and calibrated reads, resolving even simple SNPs, let alone more complex variation such as multi-nucleotide substitutions, insertions and deletions, inversions, rearrangements, and copy number variation requires sensitive and specific statistical models

^{9 - 12, 16 - 24}

. Separating true variation from machine artifacts due to the high rate and context-specific nature of sequencing errors is the outstanding challenge in NGS analysis. Previous approaches have relied on filtering SNP calls that exhibit characteristics outside of their normal ranges, such as occurring at sites with too much coverage

^{18, 20}

, or by requiring non-reference bases to occur on at least three reads in both synthesis orientations

^{21}

. Though effective, such hard filters are frustratingly difficult to develop, require parameterization for each new data set, and are necessarily either restrictive (high specificity, as in 1000 Genomes) or tolerant (high sensitivity, used in Mendelian disease studies, with concomitantly more false positives). Moreover, all of these challenges must be addressed within the context of a proliferation of sequencing technology platforms and study designs (e.g. whole genome shotgun, exome capture sequencing, multiple samples sequenced at shallow coverage), a point not tackled in previous work.
给定良好映射、对齐和校准的读取，解决即使是简单的 SNP，更不用说更复杂的变异，如多核苷酸替换、插入和缺失、倒位、重排和拷贝数变异，需要敏感和特定的统计模型

^{9 - 12, 16 - 24}

。由于测序错误的高发生率和特定上下文的性质，将真实变异与机器伪影分离是 NGS 分析中的突出挑战。以前的方法依赖于过滤表现出超出其正常范围特征的 SNP 调用，例如发生在覆盖度过高的位点

^{18, 20}

，或要求非参考碱基在两个合成方向上至少出现在三条读取中

^{21}

。尽管有效，这种硬过滤器的开发令人沮丧地困难，需要为每个新数据集进行参数化，并且必然是限制性的（高特异性，如在 1000 基因组中）或宽容的（高灵敏度，用于孟德尔疾病研究，同时伴随更多的假阳性）。此外，所有这些挑战必须在测序技术平台和研究设计（例如）的激增背景下得到解决。全基因组随机测序，外显子捕获测序，多样本在浅覆盖下测序，这是之前工作中未涉及的一个点。

Here we present a single framework and associated tools capable of discovering high-quality variation and genotyping individual samples using diverse sequencing machines and experimental designs (Figure 1). We present several novel methods addressing the challenges listed above in local realignment, base quality recalibration, multi-sample SNP calling and adaptive error modeling, which we apply to three prototypical NGS data sets (Table 1). In each data set we include CEPH individual NA12878 to demonstrate the consistency of results for this individual across all three data sets.
在这里，我们提出了一个单一框架和相关工具，能够使用多种测序机器和实验设计发现高质量的变异并对单个样本进行基因分型（图 1）。我们提出了几种新方法，解决了上述在局部重排、碱基质量重新校准、多样本 SNP 调用和自适应错误建模方面的挑战，我们将其应用于三个典型的 NGS 数据集（表 1）。在每个数据集中，我们包括了 CEPH 个体 NA12878，以展示该个体在所有三个数据集中的结果一致性。

Results 结果

Here we describe a three-part conceptual framework (Figure 1):
在这里，我们描述了一个三部分的概念框架（图 1）：

Phase 1: raw read data with platform-dependent biases is transformed into a single, generic representation with well-calibrated base error estimates, mapped to their correct genomic origin, and aligned consistently with respect to one another. Mapping algorithms place reads with an initial alignment on the reference genome, either generated in, or converted to, the technology-independent SAM/BAM reference file format $^{25}$ . Next, molecular duplicates are eliminated (Suppl. Mats), initial alignments are refined by local realignment, and then an empirically accurate per-base error model is determined.
阶段 1：具有平台依赖偏差的原始读取数据被转换为单一的通用表示，具有良好校准的基础错误估计，映射到其正确的基因组来源，并在相互之间一致对齐。映射算法将具有初始对齐的读取放置在参考基因组上，参考基因组可以是在技术无关的 SAM/BAM 参考文件格式 $^{25}$ 中生成或转换的。接下来，消除分子重复（补充材料），通过局部重新对齐来细化初始对齐，然后确定一个经验上准确的每碱基错误模型。
Phase 2: the analysis-ready SAM/BAM files are analyzed to discover all sites with statistical evidence for an alternate allele present, among the samples including SNPs, short indels, and CNVs. CNV discovery and genotyping methods, though part of this conceptual framework, are described elsewhere $^{26}$ .
第 2 阶段：分析准备好的 SAM/BAM 文件，以发现所有具有统计证据的替代等位基因存在的位点，包括 SNP、短插入缺失和 CNV。CNV 发现和基因分型方法虽然是这个概念框架的一部分，但在其他地方有描述 $^{26}$ 。
Phase 3: technical covariates, known sites of variation, genotypes for individuals, linkage disequilibrium, and family and population structure are integrated with the raw variant calls from phase 2 to separate true polymorphic sites from machine artifacts, and at these sites high-quality genotypes are determined for all samples.
阶段 3：技术协变量、已知变异位点、个体基因型、连锁不平衡以及家族和群体结构与阶段 2 的原始变异调用相结合，以区分真实的多态位点和机器伪影，并在这些位点为所有样本确定高质量的基因型。

All components after initial mapping and duplicate marking are instantiated in the Genome Analysis ToolKit (GATK)

^{27}

.
所有在初始映射和重复标记之后的组件都在基因组分析工具包（GATK）

^{27}

中实例化。

Applying the analysis pipeline to HiSeq data at w6x of NA12878
将分析流程应用于 NA12878 的 w6x HiSeq 数据

2.72B bases (

\sim 96 %

) of the 2.83B non-N bases in the autosomal regions and chromosome X of the human reference genome have sufficient coverage to call variants in the 101bp pairedended HiSeq data (Table 1). Even though the HiSeq reads were aligned with the gap-enabled BWA, more than 15% of the reads that span known homozygous indels in NA12878 are misaligned (Supplemental Table 1). Realignment corrects 6.6M of 2.4B total reads in 950K regions covering 21 Mb in the HiSeq data, eliminating 1.8 M loci with significant accumulation of mismatching bases (Supplemental Table 2). The initial data processing steps (Phase 1) eliminate

\sim 300 K

SNP calls, more than one fifth of the raw novel calls, with quality metrics consistent with more than

90 %

of these SNPs being false positives (Table 2).
在人体参考基因组的常染色体区域和 X 染色体中，2.83B 非 N 碱基中有 2.72B 碱基（

\sim 96 %

）具有足够的覆盖度以在 101bp 的配对末端 HiSeq 数据中调用变异（表 1）。尽管 HiSeq 读取是使用支持缺口的 BWA 进行比对的，但在 NA12878 中，跨越已知纯合缺失的读取中有超过 15%是错误比对的（补充表 1）。重新比对纠正了在 HiSeq 数据中覆盖 21 Mb 的 950K 区域中的 2.4B 总读取中的 6.6M，消除了 1.8M 具有显著不匹配碱基积累的位点（补充表 2）。初始数据处理步骤（阶段 1）消除了

\sim 300 K

SNP 调用，超过五分之一的原始新调用，其质量指标与超过

90 %

的这些 SNP 为假阳性一致（表 2）。

The initial 4.2M confidently called non-reference sites include 99.7% and 99.5% of the HapMap3 and 1KG Trio sites genotyped as non-reference in NA12878; at these variant sites the sequencing and genotyping calls are concordant 99.9% of the time (Table 2). Variant quality score recalibration of these initial calls identifies a tranche of SNPs with estimated FDR of

< 1 %

containing 3.2M known variants and 362K novel variants, a

90 %

dbSNP rate, and

Ti / Tv

ratios of 2.15 and 2.05, respectively, consistent with our genome-wide expectations (Box 1). While the variant recalibrator removed

\sim 595 K

total variants with a

Ti / Tv

ratio of

\sim 1.2

, it retained

99 %

and

97.3 %

of the HapMap3 and 1KG Trio non-reference sites. The discordant sites have 100× higher genotype discrepancy rates, suggesting that the sites themselves may be problematic. Almost all of the variants in the

1 %

tranche are already present in the even higher stringency

0.1 %

FDR tranche, while analysis of the

10 %

最初的 4.2M 自信地称为非参考位点，包括 99.7%和 99.5%的 HapMap3 和 1KG Trio 位点在 NA12878 中被基因分型为非参考；在这些变异位点，测序和基因分型的调用一致性为 99.9%（表 2）。对这些初始调用的变异质量分数重新校准识别出一批估计 FDR 为

< 1 %

的 SNP，包含 3.2M 已知变异和 362K 新变异，

90 %

dbSNP 率，以及

Ti / Tv

比率分别为 2.15 和 2.05，与我们的全基因组预期一致（框 1）。虽然变异重新校准器移除了

\sim 595 K

总变异，具有

Ti / Tv

比率为

\sim 1.2

，但保留了

99 %

和

97.3 %

的 HapMap3 和 1KG Trio 非参考位点。这些不一致位点的基因型差异率高出 100 倍，表明这些位点本身可能存在问题。几乎所有在

1 %

批次中的变异已经存在于更高严格性

0.1 %

FDR 批次中，而对

10 %

的分析。

FDR tranche suggest that some more variants could be obtained, at the cost of many more false positives (Figure 4).
FDR 分层表明可以获得更多变体，但代价是会产生更多的假阳性（图 4）。

Applying the analysis pipeline to 28Mb exome capture at 〜150x of NA12878
将分析流程应用于 NA12878 的 28Mb 外显子捕获，约 150 倍

The raw data processing tools here eliminated

\sim 450

novel call sites from the pre-MSA/prerecal call set, representing more than

20 %

of all the novel calls, with a Ti/Tv of 0.30 - fully consistent with all being false positives - while adding several sites present in HapMap3 and the 1 KG Trio. The raw whole exome data call set, at

\sim 150 \times

coverage (Table 1), includes >99% of both the HapMap3 and 1KG Trio non-reference sites within the 28Mb exome target region, with

> 99.8 %

genotype concordance at these sites. As with HiSeq, even with recalibration and local realignment, however, the

Ti / Tv

ratio of the novel sites in the initial SNP calls indicates that more than 50% of these calls are false positives. Variant quality score recalibration, using only 5400 SNPs for training, identifies a high-quality subset of calls that capture >98% of the HapMap3 and 1KG Trio sites in the target regions. The value of the tranches is more pronounced in the whole exome (Figure 4d), where 900 of the 1039 novel calls come from tranches with FDRs under 1%, despite needing to reach into the 10% FDR tranche to include most true positive SNPs.
这里的原始数据处理工具从预 MSA/预校正调用集中消除了

\sim 450

个新颖调用位点，代表了所有新颖调用的超过

20 %

，Ti/Tv 比为 0.30 - 完全一致地表明这些都是假阳性 - 同时添加了 HapMap3 和 1 KG Trio 中存在的几个位点。原始全外显子数据调用集在

\sim 150 \times

覆盖率下（表 1），包括超过 99%的 HapMap3 和 1KG Trio 非参考位点，位于 28Mb 外显子目标区域内，这些位点的基因型一致性为

> 99.8 %

。与 HiSeq 一样，即使经过重新校准和局部重对齐，初始 SNP 调用中新颖位点的

Ti / Tv

比例表明超过 50%的这些调用是假阳性。变异质量分数重新校准，仅使用 5400 个 SNP 进行训练，识别出一个高质量的调用子集，捕获目标区域内超过 98%的 HapMap3 和 1KG Trio 位点。分层的价值在全外显子中更为明显（图 4d），其中 1039 个新颖调用中有 900 个来自 FDR 低于 1%的分层，尽管需要进入 10% FDR 分层以包括大多数真正的阳性 SNP。

The Hiseq WGS and exome capture datasets differ drastically in their sequencing protocols (WGS vs. hybrid capture), the sequencing machines (HiSeq vs. GA), and the initial alignment tools (BWA vs. MAQ). Neverthless, the exome call set is remarkably consistent the subset of calls from HiSeq that overlap the target regions of the hybrid capture protocol. 94% of the HiSeq calls are also called in the final exome set sliced at 10% FDR (data not shown), and at these sites the non-reference discrepancy rate is extremely low (

< 0.4 %

). Mapping differences between the aligners used for HiSeq (BWA) and exome (MAQ) data sets account for vast the majority of these discordant calls, with the remainder of the differences due to limited coverage in the exome, and only a small minority of sites due to differential SNP calling or variant quality score recalibration. Overall, despite the technical differences in the capture and sequencing protocols of the HiSeq and Exome data sets, the data processing pipeline presented here uncovers a remarkably consistent set of SNPs in exomes with excellent genotyping accuracy.
Hiseq WGS 和外显子捕获数据集在测序协议（WGS 与混合捕获）、测序机器（HiSeq 与 GA）以及初始比对工具（BWA 与 MAQ）上有着显著的差异。然而，外显子调用集与 HiSeq 中重叠混合捕获协议目标区域的调用子集之间却表现出显著的一致性。在最终的外显子集（以 10% FDR 切片）中，94% 的 HiSeq 调用也被调用（数据未显示），而在这些位点上，非参考差异率极低（

< 0.4 %

）。用于 HiSeq（BWA）和外显子（MAQ）数据集的比对工具之间的映射差异占据了这些不一致调用的绝大多数，其余差异则由于外显子的覆盖有限，只有少数位点由于 SNP 调用差异或变异质量评分重新校准。总体而言，尽管 HiSeq 和外显子数据集在捕获和测序协议上存在技术差异，但这里提出的数据处理管道揭示了外显子中一组显著一致的 SNP，具有出色的基因分型准确性。

Applying the analysis pipeline to low-pass (4×) sequencing of NA12878 with 60 unrelated CEPH individuals
将分析流程应用于 60 个无关 CEPH 个体的 NA12878 低通（4×）测序

Multi-sample low-pass resequencing poses a major challenge for variant discovery and genotyping because there is so little evidence at any particular locus in the genome for any given sample (Table 1). Consequently, it is in precisely this situation where there is little signal from true SNPs that our data processing tools are most valuable, as can be seen from the progression of call sets in Table 2. Local realignment and base quality recalibration eliminate

\sim 650 K

false positive SNPs among 13M sites,

4 \times

more sites than in the HiSeq data set, with an aggregate Ti/Tv of 0.7. The initial low-pass CEU set includes over 13M called sites among all individuals, of which nearly 7M are novel. NA12878 herself has 2.9M variants, of which 430 K are novel. The

4 \times

average coverage limits the sensitivity and concordance of this call set, with only

84 %

and

80 %

of HapMap3 and 1KG Trio sites assigned a non-reference genotype in the NA12878 sample, both with a 20% NRD rate.
多样本低通量重测序对变异发现和基因分型构成了重大挑战，因为在基因组中的任何特定位点，对于任何给定样本几乎没有证据（表 1）。因此，正是在这种对真实 SNP 信号很少的情况下，我们的数据处理工具最为有价值，正如表 2 中调用集的进展所示。局部重校准和碱基质量重新校准在 1300 万个位点中消除了

\sim 650 K

个假阳性 SNP，比 HiSeq 数据集多

4 \times

个位点，整体 Ti/Tv 为 0.7。初始的低通量 CEU 数据集包括所有个体中超过 1300 万个已调用位点，其中近 700 万个是新颖的。NA12878 本身有 290 万个变异，其中 430K 是新颖的。

4 \times

平均覆盖度限制了该调用集的灵敏度和一致性，在 NA12878 样本中，只有

84 %

和

80 %

的 HapMap3 和 1KG Trio 位点被分配了非参考基因型，且两者的 NRD 率均为 20%。

The variant quality recalibrator identifies from the 13M potential variants

\sim 6 M

known and 1.5M novel sites in tranches from 0.1% to 10% FDR. Figure 5a highlights several key features of the data: the allele frequency distribution of these calls closely matches the population genetics expectation and the vast majority of HapMap3 and 1000 Genomes official CEU call sites are recovered, with the proportion nearing 100% for more common variant sites (Figure 5a). Although we selected a 0.1% FDR tranche for analysis here, which contains the bulk of HapMap3, 1KG Trio, and HiSeq sites, there are another

\sim 700 K

true sites can be found in the 1 and 10% FDR tranche, albeit among many more false positives. This highest quality tranche includes nearly all variants observed more than 5 times in the samples and 1.4 M novels, with the SNPs in the tranches at

1 %

and

10 %

generally occupying the lower alternate allele frequency range (Figure 5b). The overall picture is clear: calling multiple samples simultaneously, even with only a handful of reads spanning a SNP for any given sample, enables one to detect the vast majority of common variant sites present in the cohort with a high degree of sensitivity.
变异质量重校准器从 1300 万个潜在变异

\sim 6 M

已知和 150 万个新位点中识别出，FDR 从 0.1%到 10%分为几个区间。图 5a 突出了数据的几个关键特征：这些调用的等位基因频率分布与种群遗传学预期密切匹配，绝大多数 HapMap3 和 1000 Genomes 官方 CEU 调用位点被恢复，对于更常见的变异位点，其比例接近 100%（图 5a）。尽管我们在此选择了 0.1% FDR 区间进行分析，其中包含了大部分 HapMap3、1KG Trio 和 HiSeq 位点，但在 1%和 10% FDR 区间中仍可以找到另外

\sim 700 K

个真实位点，尽管伴随有更多的假阳性。这个最高质量的区间几乎包括了样本中观察到的所有变异超过 5 次和 140 万个新位点，区间中的 SNP 在

1 %

和

10 %

通常占据较低的替代等位基因频率范围（图 5b）。整体情况很清楚：同时调用多个样本，即使每个样本只有少量读取跨越一个 SNP，也能以高灵敏度检测到队列中绝大多数常见变异位点。

While the bulk properties of the 61-sample call set are good, we expect the low-pass

4 \times

design to limit variation discovery and genotyping in each sample relative to deep resequencing. In the 61 sample call set we discover

\sim 80 %

of the non-reference sites in NA12878 according to HapMap3, 1KG Trio, and HiSeq call sets (Table 2). The

\sim 20 %

of the missed variant sites from these three data sets had little to no coverage in the NA12878 sample in the low-pass data and, therefore, could not be assigned a genotype using only the NGS data, a general limitation of the low-pass sequencing strategy (Table 2, Figure 5c/d). The multi-sample discovery design, however, affords us the opportunity to apply imputation to refine and recover genotypes at sites with little or no sequencing data. Applying genotype-likelihood based imputation with Beagle

^{28}

to the 61 sample call set recovers an additional 15-20% of the non-reference sites in NA12878 that had insufficient coverage in the sequencing data (Table 2) as well as vastly improving genotyping accuracy (Figure 5c/d).
虽然 61 个样本调用集的整体属性良好，但我们预计低通量

4 \times

设计将限制相对于深度重测序在每个样本中的变异发现和基因分型。在 61 个样本调用集中，我们根据 HapMap3、1KG Trio 和 HiSeq 调用集发现了 NA12878 中

\sim 80 %

个非参考位点（表 2）。这三个数据集中错过的变异位点

\sim 20 %

在低通量数据中在 NA12878 样本中的覆盖率几乎为零，因此仅使用 NGS 数据无法分配基因型，这是低通量测序策略的一般限制（表 2，图 5c/d）。然而，多样本发现设计使我们有机会应用插补来细化和恢复在几乎没有测序数据的位点上的基因型。使用 Beagle

^{28}

基于基因型似然的插补应用于 61 个样本调用集，恢复了 NA12878 中在测序数据中覆盖不足的额外 15-20%的非参考位点（表 2），并大大提高了基因分型的准确性（图 5c/d）。

We further characterize the quality of our low-pass call set as a function of the number of samples included during the discovery process in addition to NA12878 herself. Increasing the number of samples in the cohort rapidly improves both sensitivity and specificity of the call set. As evidence mounts with more samples that a particular site is polymorphic, our confidence in the call increases and the site is more likely to be called (Figure 6a). Distinguishing true positive variants from sequencing and data processing artifacts is more difficult with few samples and, consequently, low aggregated coverage; adding more reads empowers the error covariates to identify sites as errors by the variant recalibrator (Figure

6 b

and

6 c

).
我们进一步将低通滤波调用集的质量特征化，作为发现过程中包含的样本数量的函数，除了 NA12878 本身之外。增加队列中的样本数量迅速提高了调用集的灵敏度和特异性。随着更多样本的证据表明特定位点是多态的，我们对调用的信心增加，该位点更有可能被调用（图 6a）。在样本较少的情况下，区分真实阳性变异与测序和数据处理伪影更加困难，因此，聚合覆盖率较低；添加更多读取使得错误协变量能够通过变异重校准器将位点识别为错误（图

6 b

和

6 c

）。

The combination of multi-sample SNP calling, variant quality recalibration using error covariates, and imputation allows one to achieve a high-quality call set, both in aggregate and per-sample, with astoundingly little data. The aggregated 61-sample set at

4 \times

coverage includes only four times as much sequencing data as the HiSeq data, yet we discover 3.2M polymorphic sites in NA12878, which includes

97 %, 91 %

, and

87 %

of the variants in HapMap3, 1000 Genomes Trio, and HiSeq call sets, respectively, while also finding 5M additional variants among the 60 other samples.
多样本 SNP 调用、使用误差协变量的变异质量重校正和插补的结合，使得在极少的数据下，能够实现高质量的调用集，无论是整体还是每个样本。

4 \times

覆盖下的 61 个样本集仅包含四倍于 HiSeq 数据的测序数据，但我们在 NA12878 中发现了 320 万个多态位点，其中包括

97 %, 91 %

和

87 %

，分别来自 HapMap3、1000 Genomes Trio 和 HiSeq 调用集，同时在其他 60 个样本中还发现了 500 万个额外的变异。

Comparison of hard filtering to variant quality score recalibration
硬过滤与变异质量分数重新校准的比较

Supplemental Table 3 lists the quality of call sets derived using our previous filtering approaches on all three data sets relative to the adaptive recalibrator described here. In all cases the adaptive approach outperforms the manually optimized hard filtering previously developed for this calling system for the 1000 Genomes pilot data. This highlights two important points - first, that a principled integration of all covariates (which may have a complex correlation structure) should and does outperform single manually defined thresholds on covariates independently, with the added benefit of not requiring human intervention; second, that an accurate ranking of discovered putative variants by the probability that each represents a true site permits the definition of tranches for specificity or sensitivity (Figure

4 c - e

) as appropriate to the needs of the specific project. Although the most permissive tranche includes almost all sites that have any chance of being true polymorphisms - critical for projects looking for single large effect mutations - the vast majority of true polymorphisms are present in the highest quality tranche of data (not shown).
补充表 3 列出了使用我们之前的过滤方法在所有三个数据集上获得的呼叫集的质量，相对于这里描述的自适应重校准器。在所有情况下，自适应方法的表现优于之前为 1000 基因组试点数据开发的手动优化硬过滤。这突出了两个重要点——首先，所有协变量的原则性整合（可能具有复杂的相关结构）应该并且确实优于对协变量独立定义的单一手动阈值，并且不需要人工干预的额外好处；其次，通过每个变体代表真实位点的概率对发现的假定变体进行准确排名，可以根据特定项目的需求定义特定性或灵敏度的分层（图

4 c - e

）。尽管最宽松的分层几乎包括所有有可能是真正多态性的位点——这对寻找单一大效应突变的项目至关重要——但绝大多数真实多态性位点存在于数据的最高质量分层中（未显示）。

Comparison of this calling pipeline to Crossbow
与 Crossbow 的调用管道比较

To calibrate the additional value of the tools described here we contrast our results with SNPs called on our raw NA12878 exome data using Crossbow

^{29}

, a package combining bowtie, a gapless read mapping tool based on the Burrows-Wheeler transformation

^{30}

and SoapSNP for SNP detection

^{16}

. We chose to perform this analysis on the exome data because its wide range of read depths and complex error modes make SNP calling a challenge, especially given the small number of novel variants ( 1000 per sample) expected in this 28 Mb target. In Supplemental Table 4 the high-level results of the GATK and Crossbow calling pipelines are compared and contrasted. Key metrics such as the number of novel SNP calls, their Ti/Tv ratio, the number of calls not seen in either the 1000G trio or the HiSeq data, and the high nonsense/read-through rates indicate that the Crossbow call set has lower specificity than the GATK pipeline. This is the case despite applying a aggressive P -value threshold (

P < 0.01

) for the base quality rank sum test

^{16}

to filter false positive variants, which reduces the sensitivity to HM3, 1000G, and the HiSeq call sets by

> 3 %

. As usual, the intersection set between GATK and Crossbow is more specific but less sensitive than the calls unique to each pipeline (Table 1), a clear sign that despite the advances presented here significant work remains in perfecting calling in data sets like single sample exome capture. Although the value of the data processing and error modeling presented here is also clear, applying local realignment and base quality score recalibration – publicly available, easy-to-use modules in the GATK – are likely to improve the results of the Crossbow pipeline.
为了校准此处描述的工具的附加值，我们将我们的结果与使用 Crossbow

^{29}

在原始 NA12878 外显子数据上调用的 SNP 进行对比，该软件包结合了 bowtie，这是一个基于 Burrows-Wheeler 变换的无间隙读取映射工具

^{30}

，以及用于 SNP 检测的 SoapSNP

^{16}

。我们选择在外显子数据上进行此分析，因为其广泛的读取深度和复杂的错误模式使得 SNP 调用成为一项挑战，特别是考虑到在这个 28 Mb 目标中预期的新变体数量较少（每个样本 1000 个）。在补充表 4 中，比较和对比了 GATK 和 Crossbow 调用管道的高层次结果。关键指标如新 SNP 调用的数量、它们的 Ti/Tv 比率、在 1000G 三重体或 HiSeq 数据中未见的调用数量，以及高无义/读取通过率表明 Crossbow 调用集的特异性低于 GATK 管道。尽管对基础质量排名和总和检验

^{16}

应用了激进的 P 值阈值 (

P < 0.01

) 来过滤假阳性变体，这仍然导致对 HM3、1000G 和 HiSeq 调用集的敏感性降低了

> 3 %

。如往常一样，GATK 和 Crossbow 之间的交集集比每个管道独有的调用更具体但敏感性较低（表 1），这清楚地表明，尽管这里展示了进展，但在像单样本外显子捕获这样的数据集上，完善调用仍然需要大量工作。尽管这里展示的数据处理和错误建模的价值也很明显，但应用局部重对齐和基础质量分数重新校准——在 GATK 中公开可用且易于使用的模块——可能会改善 Crossbow 管道的结果。

Discussion 讨论

The inaccuracy and covariation patterns differ strikingly between sequencing technologies (Figure 3), which if uncorrected can propagate into downstream analyses. Accurately recalibrated base quality scores eliminates these sequencer-specific biases (Figure 3) and enables integration of data generated from multiple systems. Although developed for early NGS data sets like those from the 1000 Genomes Project pilot, the impact of recalibration is
测序技术之间的不准确性和协变模式差异显著（图 3），如果不加以修正，可能会传播到下游分析中。准确重新校准的碱基质量分数消除了这些特定于测序仪的偏差（图 3），并使来自多个系统生成的数据能够集成。尽管是为早期 NGS 数据集（如 1000 基因组计划试点项目）开发的，但重新校准的影响是
still significant even for data emerging today on newer sequencers like the HiSeq 2000. Together with local realignment, these two data processing methods eliminate millions of mostly false positive variants while preserving nearly all truly variable sites, such as those in HapMap3 and 1KG Trio sites (Table 2). In single sample data sets, such as HiSeq and exome, without realignment and recalibration these false variants account for more than a fifth of all of the novel calls.
即使对于今天在像 HiSeq 2000 这样的新测序仪上出现的数据，这仍然是显著的。结合局部重校准，这两种数据处理方法消除了数百万个主要是错误的变异，同时保留了几乎所有真正的可变位点，例如 HapMap3 和 1KG Trio 位点（表 2）。在单样本数据集中，例如 HiSeq 和外显子组，如果没有重校准和重新校正，这些错误变异占所有新调用的五分之一以上。

Even with very deep coverage, the na‘ve Bayesian model for SNP calling results in an initial call set with a surprisingly large number of false-positive calls. While we expect 3.3M known and 330K novel non-reference sites in a single European sample sequenced genomewide, the initial HiSeq call set contains 3.5M known and 800K novel calls. The excessive number of variable sites, and the low

Ti / Tv

ratio in particular among the novel calls, implies that

\sim 600 K

of these variants are likely errors resulting from stochastic and systemic sequencing and alignment errors. The same calculations suggest that a similar fraction of the initial exome calls are likely false positives, while more than

80 %

of the initial novel low pass SNP calls are likely errors. The adaptive error modeling developed here enables us to identify these false positive variants based on their dissimilarity to known variants, despite error rates of

50 - 80 %

among the novel variants.
即使在非常深的覆盖下，朴素贝叶斯模型用于 SNP 调用的初始调用集仍然产生了意外数量的假阳性调用。虽然我们预计在一个单一的欧洲样本中全基因组测序会有 330 万个已知位点和 33 万个新非参考位点，但初始的 HiSeq 调用集包含 350 万个已知调用和 80 万个新调用。可变位点的过多数量，尤其是在新调用中低

Ti / Tv

比率，意味着

\sim 600 K

这些变异很可能是由于随机和系统性的测序及比对错误导致的。这些计算表明，初始外显子调用中类似的比例很可能是假阳性，而初始的新低通量 SNP 调用中超过

80 %

很可能是错误。这里开发的自适应错误建模使我们能够根据这些假阳性变异与已知变异的差异性来识别它们，尽管新变异中的错误率为

50 - 80 %

。

In each step of the pipeline, the improvements derive from the correction of systematic errors made in base calling or read mapping/alignment. By characterizing the specific NGS machine error processes and capturing our certainty, or lack thereof, that a putative variant is truly present in the sample or population, we deliver not a single concrete call set but a continuum from confident to less reliable variant calls for use as appropriate to the specific needs of downstream analysis. Mendelian disease projects can select a more sensitive set of calls with a higher error rate to avoid missing that single, high-impact variant, while community-resource projects like the 1000 Genomes Project can place a high premium on specificity.
在管道的每个步骤中，改进源于对基础调用或读取映射/对齐中系统性错误的修正。通过表征特定的 NGS 机器错误过程并捕捉我们对假定变体在样本或人群中真实存在的确定性（或缺乏确定性），我们提供的不是单一的具体调用集，而是从可信到不太可靠的变体调用的连续体，以便根据下游分析的具体需求进行适当使用。孟德尔疾病项目可以选择一组更敏感的调用，尽管错误率较高，以避免错过那个单一的高影响变体，而像 1000 基因组计划这样的社区资源项目则可以高度重视特异性。

The division between SNP discovery and preliminary genotyping and genotype refinement (columns 2 and 3, Figure 1) avoids embedding in the discovery phase assumptions about population structure, sample relationships, and the linkage disequilibrium relationships between variants. Consequently, our calling approach applies equally well to population samples in Hardy-Weinberg equilibrium like mother-father-child trios or interbreeding families suffering from Mendelian disorders. Critically, our framework produces highly sensitive and specific variation calls without the use of linkage disequilibrium and so can be applied in situations where LD information is unavailable or weak (many organisms) or would confound analytic goals such as studying LD patterns themselves or comparing Neanderthals and modern humans

^{31}

. Where appropriate, however, imputation can be applied to great value, as we demonstrate in the 61 sample CEU low-pass call set.
SNP 发现与初步基因分型和基因型精炼之间的划分（图 1 的第 2 和第 3 列）避免在发现阶段嵌入关于种群结构、样本关系和变异之间的连锁不平衡关系的假设。因此，我们的调用方法同样适用于哈迪-温伯格平衡下的人口样本，如母-父-子三人组或遭受孟德尔疾病的近亲繁殖家庭。关键是，我们的框架在不使用连锁不平衡的情况下产生高度敏感和特异的变异调用，因此可以应用于 LD 信息不可用或较弱（许多生物体）或会混淆分析目标的情况，例如研究 LD 模式本身或比较尼安德特人和现代人类

^{31}

。然而，在适当的情况下，插补可以发挥很大价值，正如我们在 61 个样本的 CEU 低通量调用集中所展示的那样。

The analysis results presented here clearly indicate that even with our best current approaches we are still far from obtaining a complete and accurate picture of genetic variation of all types in even a single sample. Even with the HiSeq 101bp paired-end reads nearly

4 %

(

\sim 100 Mb

) of the potentially callable genome is considered poorly mapped (Suppl. Mats) and analysis of variants within these regions requires care. Nearly two-thirds
这里呈现的分析结果清楚地表明，即使使用我们目前最好的方法，我们仍然远未获得单个样本中所有类型的遗传变异的完整和准确的图像。即使使用 HiSeq 101bp 配对末端读取，潜在可调用基因组的近

4 %

（

\sim 100 Mb

）被认为映射不良（补充材料），并且在这些区域内的变异分析需要谨慎。近三分之二
of the differences between the HiSeq and exome call sets can be attributed to different read mappings between BWA and MAQ.
HiSeq 和外显子调用集之间差异的部分可以归因于 BWA 和 MAQ 之间的不同读取映射。

The challenge of obtaining accurate variant calls from NGS data is substantial. We have developed an analysis framework for NGS data that achieves consistent and accurate results from a wide array of experimental design options including diverse sequencing machinery and distinct sequencing approaches. We have introduced here an integrated approach to data processing and variation discovery from NGS data that is designed to meet these specifications. Using data generated both at the Broad Institute and throughout the 1000 Genomes project, we have demonstrated that the introduction of improved calibration of base quality scores, local realignment to accommodate indels, the simultaneous evaluation of multiple samples from a population, and finally an assessment of the likelihood that an identified variable site is a true biological DNA variant significantly improves the sensitivity and specificity of variant discovery from NGS data. The impending arrival of yet more NGS technologies makes even more important modular, extensible frameworks like ours that produce high-quality variant and genotype calls despite distinct error modes of multiple technologies for many experimental designs.
从 NGS 数据中获得准确的变异调用的挑战是巨大的。我们开发了一个 NGS 数据分析框架，能够从各种实验设计选项中获得一致且准确的结果，包括多样的测序设备和不同的测序方法。我们在这里介绍了一种集成的数据处理和变异发现方法，旨在满足这些规范。使用在布罗德研究所和 1000 基因组计划中生成的数据，我们证明了改进碱基质量分数的校准、局部重校准以适应插入缺失、同时评估来自一个人群的多个样本，以及最终评估识别的变异位点是真正的生物 DNA 变异的可能性，显著提高了从 NGS 数据中发现变异的灵敏度和特异性。即将到来的更多 NGS 技术的出现，使得像我们这样的模块化、可扩展框架变得更加重要，尽管多种技术存在不同的错误模式，但仍能产生高质量的变异和基因型调用，适用于许多实验设计。

Supplementary Material 补充材料

Refer to Web version on PubMed Central for supplementary material.
请参阅 PubMed Central 上的网页版本以获取补充材料。

Acknowledgments 致谢

Many thanks to our colleagues in Medical and Population Genetics and Cancer Informatics and the 1000 Genomes Project who encouraged and supported us during the development of the Genome Analysis ToolKit and associated tools. This work was supported by grants from the National Human Genome Research Institute, including the Large Scale Sequencing and Analysis of Genomes grant ( 54 HG003067) and the Joint SNP and CNV calling in 1000 Genomes sequence data grant (U01 HG005208). We would also like to thank our excellent anonymous reviewers for their thoughtful comments.
非常感谢我们在医学与人群遗传学、癌症信息学以及 1000 基因组计划的同事们，在基因组分析工具包及相关工具的开发过程中给予我们的鼓励和支持。这项工作得到了国家人类基因组研究所的资助，包括大规模基因组测序与分析资助（54 HG003067）和 1000 基因组序列数据中的联合 SNP 和 CNV 调用资助（U01 HG005208）。我们还要感谢我们优秀的匿名评审员们提出的深思熟虑的意见。

References 参考文献

The 1000 Genomes Project Consortium. A map of human genome variation from population scale sequencing. Nature. 2010
1000 基因组计划联盟。来自群体规模测序的人类基因组变异图谱。自然。2010
Yi X, et al. Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude. Science. 2010; 329:75-78. [PubMed: 20595611]
Yi X, et al. 50 个人类外显子的测序揭示了对高海拔的适应。科学。2010; 329:75-78. [PubMed: 20595611]
Ng SB, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2009
Ng SB, et al. 外显子测序确定了孟德尔疾病的原因。Nat Genet. 2009
Lee W, et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature. 2010; 465:473-477. [PubMed: 20505728]
李 W 等。来自一名肺癌患者的配对基因组序列揭示的突变谱。自然。2010；465：473-477。[PubMed：20505728]
Pleasance ED, et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2009
Pleasance ED 等。人类癌症基因组的体细胞突变综合目录。自然。2009
Beroukhim R, et al. The landscape of somatic copy-number alteration across human cancers. Nature. 2010; 463:899-905. [PubMed: 20164920]
Beroukhim R, 等. 人类癌症中体细胞拷贝数变化的全景. 自然. 2010; 463:899-905. [PubMed: 20164920]
Roach JC, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010; 328:636-639. [PubMed: 20220176]
Roach JC, 等. 通过全基因组测序分析一个家庭四重奏中的遗传继承. 科学. 2010; 328:636-639. [PubMed: 20220176]
Conrad DF, et al. Variation in genome-wide mutation rates within and between human families. Submitted.
康拉德 DF 等。人类家庭内外基因组突变率的变异。已提交。
Li R, et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009; 25:1966-1967. [PubMed: 19497933]
李 R 等。SOAP2：一种改进的超快速短读序列比对工具。生物信息学。2009；25：1966-1967。[PubMed：19497933]
Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research. 2008; 18:1851-1858. [PubMed: 18714091]
李 H, 阮 J, 达宾 R. 使用比对质量分数映射短 DNA 测序读段和调用变异. 基因组研究. 2008; 18:1851-1858. [PubMed: 18714091]
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009; 25:1754-1760. [PubMed: 19451168]
李 H, Durbin R. 使用 Burrows-Wheeler 变换进行快速准确的短序列比对. 生物信息学. 2009; 25:1754-1760. [PubMed: 19451168]
Ning Z, Cox AJ, Mullikin JC. SSAHA: a fast search method for large DNA databases. Genome Research. 2001; 11:1725-1729. [PubMed: 11591649]
宁 Z, Cox AJ, Mullikin JC. SSAHA：一种用于大型 DNA 数据库的快速搜索方法。基因组研究。2001；11：1725-1729。[PubMed：11591649]
Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research. 1998; 8:186-194. [PubMed: 9521922]
Ewing B, Green P. 使用 phred 对自动测序仪轨迹进行基础调用。II. 错误概率。基因组研究。1998; 8:186-194. [PubMed: 9521922]
Brockman W, et al. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Research. 2008; 18:763-770. [PubMed: 18212088]
Brockman W, 等. 合成测序系统中的质量评分和 SNP 检测. 基因组研究. 2008; 18:763-770. [PubMed: 18212088]
Li M, Nordborg M, Li LM. Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Res. 2004; 32:5183-5191. [PubMed: 15459287]
李明, Nordborg M, 李丽梅. 调整比对的质量分数并提高测序准确性. 核酸研究. 2004; 32:5183-5191. [PubMed: 15459287]
Li R, et al. SNP detection for massively parallel whole-genome resequencing. Genome Research. 2009; 19:1124-1132. [PubMed: 19420381]
李 R 等。大规模平行全基因组重测序的 SNP 检测。基因组研究。2009；19：1124-1132。[PubMed：19420381]
Drmanac R, et al. Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays. Science. 2010; 327:78-81. [PubMed: 19892942]
Drmanac R 等。使用自组装 DNA 纳米阵列的无链碱基读取进行人类基因组测序。《科学》。2010；327：78-81。[PubMed：19892942]
Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008; 456:53-59. [PubMed: 18987734]
本特利 DR 等. 使用可逆终止子化学进行准确的整个人类基因组测序. 自然. 2008; 456:53-59. [PubMed: 18987734]
Koboldt D, Chen K, Wylie T, Larson D. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. 2009
Koboldt D, Chen K, Wylie T, Larson D. VarScan：在个体和混合样本的大规模平行测序中进行变异检测。2009
Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008; 452:872-876. [PubMed: 18421352]
Wheeler DA, 等. 通过大规模平行 DNA 测序获得个体的完整基因组. 自然. 2008; 452:872-876. [PubMed: 18421352]
Mokry M, et al. Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries. Nucleic Acids Res. 2010:1-9.
Mokry M, 等. 通过定制微阵列基础的基因组富集对短片段测序文库进行准确的 SNP 和突变检测. 核酸研究. 2010:1-9.
Shen Y, et al. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Research. 2010; 20:273-280. [PubMed: 20019143]
沈 Y 等. 一种 SNP 发现方法，用于评估来自下一代重测序数据的变异等位基因概率。基因组研究。2010; 20:273-280. [PubMed: 20019143]
Hoberman R, et al. A probabilistic approach for SNP discovery in high-throughput human resequencing data. Genome Research. 2009; 19:1542-1552. [PubMed: 19605794]
霍伯曼 R 等. 一种用于高通量人类重测序数据中 SNP 发现的概率方法. 基因组研究. 2009; 19:1542-1552. [PubMed: 19605794]
Malhis N, Jones S. High quality SNP calling using Illumina data at shallow coverage. Bioinformatics. 2010; 26:1029. [PubMed: 20190250]
Malhis N, Jones S. 使用浅覆盖的 Illumina 数据进行高质量 SNP 调用。生物信息学。2010; 26:1029. [PubMed: 20190250]
Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25:20782079. [PubMed: 19505943]
李 H 等. 序列比对/映射格式和 SAMtools. 生物信息学. 2009; 25:2078-2079. [PubMed: 19505943]
Handsaker RE, Korn JM, Nemesh J, McCarroll SA. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nature Genetics. 2011 In press.
Handsaker RE, Korn JM, Nemesh J, McCarroll SA. 通过大规模测序发现和基因分型基因组结构多态性。自然遗传学。2011 年待发表。
McKenna AH, et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome research. 2010
McKenna AH, 等. 基因组分析工具包：用于分析下一代 DNA 测序数据的 MapReduce 框架。基因组研究。2010
Browning BL, Yu Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet. 2009; 85:847-861. [PubMed: 19931040]
布朗宁 BL, 余 Z. 同时进行基因型调用和单倍型相位确定提高了基因型准确性并减少了全基因组关联研究中的假阳性关联. 人类遗传学杂志. 2009; 85:847-861. [PubMed: 19931040]
Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biol. 2009; 10:R134. [PubMed: 19930550]
Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. 使用云计算搜索 SNP。基因组生物学。2009; 10:R134. [PubMed: 19930550]
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10:R25. [PubMed: 19261174]
Langmead B, Trapnell C, Pop M, Salzberg SL. 超快速且内存高效的短 DNA 序列与人类基因组的比对。基因组生物学。2009; 10:R25. [PubMed: 19261174]
Green RE, et al. A draft sequence of the Neandertal genome. Science. 2010; 328:710-722. [PubMed: 20448178]
Green RE, et al. 尼安德特人基因组的草稿序列。科学。2010；328：710-722。[PubMed：20448178]
Gnirke A, et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol. 2009; 27:182-189. [PubMed: 19182786]
Gnirke A, 等. 使用超长寡核苷酸进行大规模平行靶向测序的溶液杂交选择. 自然生物技术. 2009; 27:182-189. [PubMed: 19182786]
Ng S, Turner E, Robertson P, Flygare S. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009
Ng S, Turner E, Robertson P, Flygare S. 针对 12 个人类外显子的靶向捕获和大规模平行测序。自然。2009
Mckernan KJ, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Research. 2009; 19:1527-1541. [PubMed: 19546169]
Mckernan KJ, 等. 通过使用双碱基编码的短读长、大规模平行连接测序揭示的人类基因组中的序列和结构变异. 基因组研究. 2009; 19:1527-1541. [PubMed: 19546169]
Ebersberger I, Metzler D, Schwarz C, Pââbo S. Genomewide comparison of DNA sequences between humans and chimpanzees. Am J Hum Genet. 2002; 70:1490-1497. [PubMed: 11992255]
Ebersberger I, Metzler D, Schwarz C, Pââbo S. 人类与黑猩猩之间的全基因组 DNA 序列比较。人类遗传学杂志。2002; 70:1490-1497. [PubMed: 11992255]
Freudenberg-Hua Y, et al. Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome Research. 2003; 13:2271-2276. [PubMed: 14525928]
Freudenberg-Hua Y 等. 在欧洲人群代表性样本中对 65 个候选基因进行中枢神经系统疾病的单核苷酸变异分析. 基因组研究. 2003; 13:2271-2276. [PubMed: 14525928]
Durbin, R.; Eddy, S.; Krogh, A.; Mitchison, G. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press; 1998.
Durbin, R.; Eddy, S.; Krogh, A.; Mitchison, G. 生物序列分析：蛋白质和核酸的概率模型。剑桥：剑桥大学出版社；1998。
Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008; 36:e105. [PubMed: 18660515]
Dohm JC, Lottaz C, Borodina T, Himmelbauer H. 高通量 DNA 测序的超短读数数据集存在显著偏差。核酸研究。2008; 36:e105. [PubMed: 18660515]
HUGO Consortium. Mapping human genetic diversity in Asia. Science. 2009; 326:1541-1545. [PubMed: 20007900]
HUGO 联盟。映射亚洲人类遗传多样性。科学。2009；326：1541-1545。[PubMed：20007900]
Bishop, C. Pattern recognition and machine learning. Springer: 2006.
比肖普，C. 模式识别与机器学习。施普林格：2006。

Figure 1. 图 1。
Framework for variation discovery and genotyping from next-generation DNA sequencing. See text for a detailed description.
从下一代 DNA 测序中发现变异和基因分型的框架。详细描述请参见文本。

Effect of MSA on alignments
MSA 对比对的影响

NA12878, chr1:1,510,530-1,510,589

Figure 2. 图 2。
IGV visualization of alignments in region chr1:1,510,446-1,510,622 from the (a) Trio NA12878 Illumina reads from 1000 Genomes and (b) NA12878 HiSeq reads before (left) and after (right) multiple sequence realignment. Reads are depicted as arrows oriented by increasing machine cycle; highlighted bases indicate mismatches to the reference: A is green, G is orange, T is red, and deleted bases are dashes; a coverage histogram per base is shown above the reads. Both the 4bp indel (rs34877486) and the C/T polymorphism (rs28788874) are present in dbSNP, as are the artifactual A/G polymorphisms (rs28782535 and rs28783181) resulting from the mis-modeled indel, indicating that these sites are common misalignment errors.
IGV 可视化在区域 chr1:1,510,446-1,510,622 中的比对，来自(a) Trio NA12878 的 Illumina 测序数据（1000 Genomes）和(b) NA12878 的 HiSeq 测序数据，分别在多序列重新比对之前（左）和之后（右）。读取以箭头形式表示，箭头方向由机器循环的增加而定；高亮的碱基表示与参考序列的不匹配：A 为绿色，G 为橙色，T 为红色，缺失的碱基用破折号表示；每个碱基的覆盖直方图显示在读取上方。4bp 插入缺失（rs34877486）和 C/T 多态性（rs28788874）在 dbSNP 中均有记录，伪造的 A/G 多态性（rs28782535 和 rs28783181）也存在，这些是由于错误建模的插入缺失导致的，表明这些位点是常见的比对错误。

Figure 3. 图 3。
Raw (violet) and recalibrated (blue) base quality scores for NGS paired end read sets of NA12878 of (a) Illumina/GA (b) Life/SOLiD and © Roche/454 lanes from 1000 Genomes, and (d) Illumina/HiSeq. For each technology: top panel: shows reported base quality scores compared to the empirical estimates (Methods); middle panel: the difference between the average reported and empirical quality score for each machine cycle, with positive and negative cycle values given for the first and second read in the pair, respectively; bottom panel: the difference between reported and empirical quality scores for each of the 16 genomic dinucleotide contexts. For example, the AG context occurs at all sites in a read where G is the current nucleotide and A is the preceding one in the read. Root-mean-square errors (RMSE) are given for the pre- and post-recalibration curves.
原始（紫色）和重新校准（蓝色）基础质量分数，针对 1000 基因组中 NA12878 的 NGS 成对末端读取集，包括（a）Illumina/GA（b）Life/SOLiD 和（c）Roche/454 通道，以及（d）Illumina/HiSeq。对于每种技术：顶部面板：显示报告的基础质量分数与经验估计值的比较（方法）；中间面板：每个机器周期的平均报告质量分数与经验质量分数之间的差异，成对读取的第一个和第二个读取分别给出正值和负值；底部面板：16 个基因组二核苷酸上下文中报告和经验质量分数之间的差异。例如，AG 上下文出现在读取中的所有位置，其中 G 是当前核苷酸，A 是读取中的前一个核苷酸。给出了重新校准前后曲线的均方根误差（RMSE）。

Figure 4. 图 4。
(a) Relationship in the HiSeq call set between strand bias and quality by depth, for genomic locations in HapMap3 (red) and dbSNP (yellow) used for training the variant quality score recalibrator (left) and the same annotations applied to differentiate likely true positive (green) from false positive (purple) novel SNPs. (b,c,d) Quality tranches in the recalibrated HiSeq (b), exome ©, and low-pass CEU (d) calls beginning with (top) the highest-quality but smallest call set with an estimated false positive rate among novel SNP calls of <1/1000 to a more comprehensive call set (bottom) that includes effectively all true positives in the raw call set along with more false positive calls for a cumulative false positive rate of

10 %

. Each successive call set contains within it the previous tranche’s true and false positive calls (shaded bars) as well as tranche-specific calls of both classes (solid bars). The tranche selected for further analyses here is indicated.
(a) HiSeq 调用集中的链偏倚与深度质量之间的关系，对于用于训练变异质量评分重校准器的 HapMap3（红色）和 dbSNP（黄色）中的基因组位置（左），以及应用于区分可能的真正阳性（绿色）和假阳性（紫色）新 SNP 的相同注释。（b,c,d）重校准 HiSeq（b）、外显子（c）和低通量 CEU（d）调用中的质量分层，从（顶部）最高质量但最小的调用集开始，估计新 SNP 调用中的假阳性率为 <1/1000，到一个更全面的调用集（底部），该调用集有效地包括原始调用集中所有真正阳性以及更多假阳性调用，累积假阳性率为

10 %

。每个后续调用集都包含前一层的真正和假阳性调用（阴影条），以及两个类别的层特定调用（实心条）。这里选择用于进一步分析的层已被指示。

Figure 5. 图 5。
Variation discovered among 60 individuals from the CEPH population from 1000 Genomes pilot phase plus low-pass NA12878. (a) Discovered SNPs by non-reference allele count in the 61 CEPH cohort, colored by known (light blue, striped) and novel (dark blue, filled) variation, along with non-reference sensitivity to CEU HapMap3 and 1000 Genomes lowpass variants. (b) Quality and certainty of discovered SNPs by non-reference allele count. The histogram depicts the certainty of called variation broken out into

0.1, 1

, and

10 %

novel FDR tranches. The Ti/Tv ratio is shown for known and novel variation for each allele count, aggregating the novel calls with allele count > 74 due to their limited numbers. (c,d) Genotyping accuracy for NA12878 from reads alone (blue circles) and following genotypelikelihood based imputation (pink squares) called in the 61 sample call set as assessed by the NRD rate to HiSeq genotypes, as a function of allele count © and sequencing depth (d).
在来自 1000 Genomes 试点阶段的 CEPH 人群中的 60 个个体中发现的变异，加上低通量 NA12878。(a) 在 61 个 CEPH 队列中通过非参考等位基因计数发现的 SNP，按已知（浅蓝色，条纹）和新颖（深蓝色，填充）变异着色，以及对 CEU HapMap3 和 1000 Genomes 低通量变异的非参考灵敏度。(b) 通过非参考等位基因计数发现的 SNP 的质量和确定性。直方图描绘了被调用变异的确定性，分为

0.1, 1

和

10 %

新颖 FDR 区间。Ti/Tv 比率显示了每个等位基因计数的已知和新颖变异，聚合了等位基因计数> 74 的新调用，因为它们的数量有限。(c,d) 从读取数据（蓝色圆圈）和基于基因型似然的插补（粉色方块）中对 NA12878 的基因分型准确性，在 61 个样本调用集中评估的 NRD 率与 HiSeq 基因型的关系，作为等位基因计数©和测序深度(d)的函数。

sample vs. 101bp HiSeq) as well as the vagaries of sampling at

4 \times

coverage. Because most of these missed sites are common and are consequently called in the other samples, imputation recovers

\sim 50 %

of these sites. (b,c) Increasing power to identify strand-biased, likely false positive SNP calls with additional samples. Histograms of the Strand Bias annotation at raw variant calls discovered in the low-pass CEU data using NA12878 at 4× combined with one other CEU individual (b) and with 60 other individuals © stratified into sites present (green) and not (purple) in the 1000 Genomes CEU trio.
样本与 101bp HiSeq 的比较，以及在

4 \times

覆盖率下采样的变化。由于这些遗漏位点大多数是常见的，因此在其他样本中被调用，插补恢复了

\sim 50 %

的这些位点。(b,c) 通过额外样本增加识别链偏倚、可能的假阳性 SNP 调用的能力。在低通量 CEU 数据中使用 NA12878 发现的原始变异调用的链偏倚注释的直方图，结合另一个 CEU 个体 (b) 和与 60 个其他个体 (c) 分层为在 1000 基因组 CEU 三重体中存在 (绿色) 和不存在 (紫色) 的位点。

Table 1 表 1
Next-generation DNA sequencing data sets analyzed
下一代 DNA 测序数据集分析

	HiSeq	Exome 外显子	Low-pass 低通
Samples 示例	NA12878	NA12878	NA12878 +60 unrelated CEPH individuals NA12878 +60 个无关的 CEPH 个体
Sequencing technologies 测序技术	Whole genome shotgun; Illumina HiSeq 200018 全基因组随机测序；Illumina HiSeq 200018	Agilent exome hybrid capture 32,33; Illumina GenomeAnalyzer 18 安捷伦外显子杂交捕获 32,33；Illumina 基因组分析仪 18	Whole genome shotgun; Illumina GenomeAnalyzer 18; Life/SOLiD 34; Roche/454 20 全基因组随机测序；Illumina 基因组分析仪 18；Life/SOLiD 34；Roche/454 20
Coverage per sample 每个样本的覆盖率	60	150×; 93% of bases at > 20× coverage 150×; 93%的碱基在> 20×覆盖率下	4×
Read architecture 阅读架构	101bp paired end 101bp 配对末端	76/101bp paired end 76/101bp 配对末端	25, 36, 51, 76, 250 (454) bp single and paired ends 25, 36, 51, 76, 250 (454) bp 单端和双端
Targeted area 目标区域	2.85 Gb of autosomes and chrX 2.85 Gb 的常染色体和 chrX	28 Mb 28 兆字节	2.85 Gb of autosomes and chrX 2.85 Gb 的常染色体和 chrX
Data set source 数据集来源	Novel, generated for this article 小说，为本文生成	Novel, generated for this article 小说，为本文生成	1000 Genomes Project 1000 基因组计划
Aligner(s) 对齐器	BWA 11	MAQ 10	MAQ 10; Corona Lite; SSAHA 12

Raw to recalibrated, imputed SNP calls HiSeq, Exome, and 61 sample low-pass data sets. Part one of each section summarizes the impact of local
原始到重新校准的，插补的 SNP 调用 HiSeq、外显子和 61 个样本低通数据集。每个部分的第一部分总结了局部的影响。

recalibrated reads. 重新校准的读取。

Call set 调用设置	Site discovery 站点发现						Comparison to NA12878 variants 与 NA12878 变体的比较
	No. of SNPs SNP 数量				Ti/Tv		HM3 concordance HM3 一致性		HM3 concordance HM3 一致性
	All 所有	Known 已知	Novel 小说	dbSNP%	Known 已知	Novel 小说	NR sensitivity NR 灵敏度	NRD rate NRD 率	NR sensitivity NR 灵敏度	NRD rate NRD 率

HiSeq
Raw reads, all calls 原始读取，所有调用	4.43 M	3.49 M	941 K	78.77	2.05	1.29	99.74	0.10	99.57	0.20
Unique to raw read calls 独特于原始读取调用	263 K	37 K	226 K	13.95	1.37	0.70	0.02	37.97	0.09	12.64
Unique to +recal/+MSA calls 独特于 +recal/+MSA 调用	9.8 K	1.8 K	8.0 K	18.08	1.38	1.39	0.00	18.18	0.00	9.93
+recal/+MSA, all calls +recal/+MSA，所有调用	4.18 M	3.45 M	722 K	82.71	2.06	1.57	99.72	0.09	99.48	0.19
Filtered by variant recalibration 按变体重新校准过滤	595 K	235 K	360 K	39.44	1.19	1.21	0.67	3.00	2.2	4.31
Final call set 最终呼叫设置	$3.58 M$	$3.22 M$	$362 K$	$89.89$	$2.15$	$2.05$	$99.05$	$0.07$	$97.28$	$0.10$
Low-pass 低通

\begin{array}{lllllllllllll} Raw reads, all calls & 13.4 M & 6.5 M & 6.9 M & 48.77 & 2.05 & 1.13 & 83.97 & 20.34 & 80.45 & 22.53 \end{array}

Raw reads, all calls 原始读取，所有调用	13.4 M	6.5 M	6.9 M	48.77	2.05	1.13	83.97	20.34	80.45	22.53
Unique to raw read calls 独特于原始读取调用	670 K	32 K	638 K	4.74	1.19	0.67	0.01	49.21	0.02	52.57

68:8E ఒs`て૪ L6 Zะ

$\begin{aligned} Oे \\ స̇ \end{aligned}$	m	$\underset{\sim}{\tilde{O}}$	m.
$\begin{aligned} ®. \\ \overset{\oplus}{\circ} \end{aligned}$	$\overset{i}{స}$	̇ㅗㅇ ㅗ怒ㅇ	§.
$\begin{aligned} Ṅ. \\ Nे \end{aligned}$	$\begin{aligned} m \\ m \end{aligned}$	응	ํ.

морәq рәz!шәэI

∠ 6.88

000
S6:0
ZL96
Z0ㅌ8

Table

2 ⟶

表

2 ⟶

Nat Genet. Author manuscript; available in PMC 2011 November 01.
Nat Genet. 作者手稿；可在 PMC 中获取，2011 年 11 月 01 日。

Call set 调用设置	Site discovery 站点发现						Comparison to NA12878 variants 与 NA12878 变体的比较
	No. of SNPs SNP 数量				Ti/Tv		HM3 concordance HM3 一致性		HM3 concordance HM3 一致性
	All 所有	Known 已知	Novel 小说	dbSNP%	Known 已知	Novel 小说	NR sensitivity NR 灵敏度	NRD rate NRD 率	NR sensitivity NR 灵敏度	NRD rate NRD 率
+recal/+MSA, all calls +recal/+MSA，所有调用	18.5K	16.8K	1.7K	90.77	3.20	1.61	99.07	0.08	99.13	0.11
Filtered by variant recalibration 按变体重新校准过滤	1274	609	665	47.8	1.85	0.84	0.59	N/A 不适用	0.76	N/A 不适用
Final call set 最终呼叫设置	17.2K	16.2K	1039	93.96	3.27	2.57	98.49	0.08	98.38	0.11

Users may view, print, copy, download and text and data- mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms
用户可以查看、打印、复制、下载以及对这些文档中的内容进行文本和数据挖掘，目的为学术研究，但始终须遵守完整的使用条款： http://www.nature.com/authors/editorial_policies/license.html#terms
*Corresponding author: depristo@broadinstitute.org.
*通讯作者：depristo@broadinstitute.org。
Author contributions 作者贡献
M.A.D., E.B., R.E.P., K.V.G., J.R.M, C.H., A.A.P., G.d.A., M.A.R., T.J.F., A.Y.S., K.C. conceived of, implemented, and performed analytic approaches. M.A.D., E.B., R.E.P., K.V.G., G.d.A., A.M.K., M.J.D. wrote the manuscript. M.A.D., M.H., A.M. developed Picard and GATK infrastructure underlying the tools implemented here. M.A.D, S.B.G, D.A., M. J. D. lead the team.
M.A.D.、E.B.、R.E.P.、K.V.G.、J.R.M、C.H.、A.A.P.、G.d.A.、M.A.R.、T.J.F.、A.Y.S.、K.C. 设计、实施并执行了分析方法。M.A.D.、E.B.、R.E.P.、K.V.G.、G.d.A.、A.M.K.、M.J.D. 撰写了手稿。M.A.D.、M.H.、A.M. 开发了支撑此处实现工具的 Picard 和 GATK 基础设施。M.A.D.、S.B.G、D.A.、M.J.D. 领导了团队。

A framework for variation discovery and genotyping using nextgeneration DNA sequencing data 一个用于变异发现和基因分型的框架，基于下一代 DNA 测序数据