Scaling accurate genetic variant discovery to tens of thousands of samples 对数千个样本进行准确的遗传变异发现扩展
Ryan Poplin ^(1){ }^{1}, Valentin Ruano-Rubio ^(1){ }^{1}, Mark A. DePristo ^(1){ }^{1}, Tim J. Fennell ^(1){ }^{1}, Mauricio O. Carneiro ^(1){ }^{1}, Geraldine A. Van der Auwera ^(1){ }^{1}, David E. Kling ^(1){ }^{1}, Laura D. Gauthier ^(1,2){ }^{1,2}, Ami Levy-Moonshine ^(1){ }^{1}, David Roazen ^(1){ }^{1}, Khalid Shakir ^(1){ }^{1}, Joel Thibault ^(1){ }^{1}, Sheila Chandran ^(1){ }^{1}, Chris Whelan ^(1){ }^{1}, Monkol Lek ^(1,2){ }^{1,2}, Stacey Gabriel ^(1){ }^{1}, Mark J Daly ^(1,2){ }^{1,2}, Ben Neale ^(1,2){ }^{1,2}, Daniel G. MacArthur ^(1,2){ }^{1,2}, and Eric Banks ^(1){ }^{1} Ryan Poplin ^(1){ }^{1} 、Valentin Ruano-Rubio ^(1){ }^{1} 、Mark A. DePristo ^(1){ }^{1} 、Tim J. Fennell ^(1){ }^{1} 、Mauricio O. Carneiro ^(1){ }^{1} 、Geraldine A. Van der Auwera ^(1){ }^{1} 、David E. Kling ^(1){ }^{1} 、Laura D. Gauthier ^(1,2){ }^{1,2} 、Ami Levy-Moonshine ^(1){ }^{1} 、David Roazen ^(1){ }^{1} 、Khalid Shakir ^(1){ }^{1} 、Joel Thibault ^(1){ }^{1} 、Sheila Chandran ^(1){ }^{1} 、Chris Whelan ^(1){ }^{1} 、Monkol Lek ^(1,2){ }^{1,2} 、Stacey Gabriel ^(1){ }^{1} 、Mark J Daly ^(1,2){ }^{1,2} 、Ben Neale ^(1,2){ }^{1,2} 、Daniel G. MacArthur ^(1,2){ }^{1,2} 和 Eric Banks ^(1){ }^{1}^(1){ }^{1} Broad Institute, 75 Ames Street, Cambridge, MA 02142 宽泛研究所,75 Ames 街,剑桥,马萨诸塞州 02142^(2){ }^{2} Massachusetts General Hospital, Boston, MA 02114 马萨诸塞总医院,波士顿,MA 02114
Abstract 摘要
Comprehensive disease gene discovery in both common and rare diseases will require the efficient and accurate detection of all classes of genetic variation across tens to hundreds of thousands of human samples. We describe here a novel assembly-based approach to variant calling, the GATK HaplotypeCaller (HC) and Reference Confidence Model (RCM), that determines genotype likelihoods independently per-sample but performs joint calling across all samples within a project simultaneously. We show by calling over 90,000 samples from the Exome Aggregation Consortium (ExAC) that, in contrast to other algorithms, the HC-RCM scales efficiently to very large sample sizes without loss in accuracy; and that the accuracy of indel variant calling is superior in comparison to other algorithms. More importantly, the HC-RCM produces a fully squared-off matrix of genotypes across all samples at every genomic position being investigated. The HCRCM is a novel, scalable, assembly-based algorithm with abundant applications for population genetics and clinical studies. 全面的常见和罕见疾病基因发现将需要在数以万计到数十万的人类样本中有效和准确地检测所有类型的遗传变异。我们在此描述了一种新颖的基于组装的变异检测方法,即 GATK HaplotypeCaller(HC)和 Reference Confidence Model(RCM),它能独立地确定每个样本的基因型可能性,但同时对项目中的所有样本进行联合检测。我们通过检测来自 Exome Aggregation Consortium (ExAC)的 90,000 多个样本发现,与其他算法相比,HC-RCM 能够高效地扩展到非常大的样本量而不会损失准确性;而且缩进变异检测的准确性也优于其他算法。更重要的是,HC-RCM 在所有被调查的基因组位置上产生了一个完全对称的基因型矩阵。HC-RCM 是一种新颖、可扩展的基于组装的算法,在种群遗传学和临床研究中有广泛的应用。
1 Introduction 1 引言
Next-generation sequencing technologies combined with large population cohorts are facilitating the identification of rare allelic variants for population 下一代测序技术与大规模人群队列相结合,有助于识别人群中的罕见等位基因变异
and medical genetic studies [18,4,25,14][18,4,25,14]. Although high-throughput sequencing platforms currently generate up to four billion paired-end reads per run, low signal-to-noise ratios necessitate the use of data processing algorithms that differentiate true variants from machine-generated artifacts [19, 11]. 和医学遗传学研究 [18,4,25,14][18,4,25,14] 。尽管高通量测序平台当前每次运行可产生高达 40 亿配对末端读数,但信噪比较低仍需要使用数据处理算法来区分真实变异和机器产生的伪迹[19, 11]。
The first generation of variant calling algorithms scanned short read data mapped to a reference sequence to identify mismatches [1,6,15,17,22][1,6,15,17,22]. Reference sequence mismatches were probabilistically sorted using algorithms that took into account the reported base quality and context prior to calling variants [22]. These position-based or so-called “pileup” callers, which include SAMtools Li et al 2009 and the GATK’s UnifiedGenotyper (UG), proved highly effective at calling small nucleotide polymorphisms (SNPs) but are unable to attain high accuracy for indel variants due to their reliance on independent short read alignments to a reference sequence [1,6,15,17,22][1,6,15,17,22]. In contrast, assembly-based variant callers including Platypus [22] and the GATK’s HaplotypeCaller (HC), construct theoretical haplotypes via de Bruijn-like graphs from a consensus of the reads covering the genomic region [13,22,7,10,20][13,22,7,10,20]. Although assembly-based algorithms call indels with greater accuracy, they do not scale well due in part to exponential increases in graph complexity with the number of samples. Thus, calling variants jointly on large numbers of samples becomes increasingly computationally intensive until the requirements exceed hardware performance limitations [6]. 第一代变异检测算法扫描了映射到参考序列的短读取数据,以识别错配 [1,6,15,17,22][1,6,15,17,22] 。使用考虑报告的碱基质量和上下文的概率算法对参考序列错配进行排序,从而调用变体[22]。这些基于位置或所谓的"堆叠"呼叫器,包括 SAMtools Li et al 2009 和 GATK 的 UnifiedGenotyper(UG),在呼叫小核苷酸多态性(SNP)方面非常有效,但由于依赖于独立的短读取与参考序列的对齐,无法达到高精度的插入缺失(indel)变体 [1,6,15,17,22][1,6,15,17,22] 。相比之下,包括 Platypus[22]和 GATK 的 HaplotypeCaller(HC)在内的基于组装的变体呼叫器,通过来自覆盖基因组区域的读取的共识构建理论单倍体 [13,22,7,10,20][13,22,7,10,20] 。尽管基于组装的算法能更准确地呼叫插入缺失,但由于图形复杂性随样本数量呈指数增加,因此可扩展性很差。因此,在大量样本上联合呼叫变体变得计算密集,直到要求超过硬件性能限制[6]。
One naive solution would be simply to discover and genotype variants in each sample separately and then merge the independently discovered variants across all samples. In this way, data from a single sample would only be retained at positions in which that sample is variant. A major shortcoming of this approach is the only genotype calls present would be heterozygous or homozygous variant. It would be impossible to determine whether missing genotype calls were homozygous reference or void of any read data at all. As an illustration of this shortcoming, take the case where a mutation is discovered in a single sample within a cohort; a naively merged list would merely contain “no-call” genotypes for all other samples making estimating the population allele frequency difficult. What is required for the most accurate estimates of population allele frequencies is instead to jointly call variants over all samples together, creating a “squared-off matrix” of samples by genomic position where each cell in the matrix contains genotype likelihoods for all possible alleles (including the reference) at the corresponding genomic position. In addition, is has been previously shown [6] that joint calling has 一个 naïve 的解决方案是单独在每个样本中发现和测序变体,然后将独立发现的变体合并到所有样本中。这样做的话,只会保留在该样本中存在变体的位点的数据。这种方法的主要缺点是只能得到杂合或纯合的变体基因型,无法确定缺失的基因型是纯合参考还是根本没有读取数据。举个例子,如果在一个样本中发现了一个突变,则简单合并的列表中其他样本的基因型都会是"无呼叫",这会使得难以估计群体等位基因频率。相反,要准确估计群体等位基因频率,需要对所有样本进行联合呼叫,创建一个"方阵"的样本-基因组位点矩阵,每个单元格包含该位点所有可能等位基因(包括参考等位基因)的基因型似然值。此外,先前的研究[6]表明,联合呼叫具有优于单独呼叫的性能。
Figure 1: GATK Variant Calling Best Practices. The HaplotypeCaller takes in analysis-ready reads and performs variant calling per sample to produce unfiltered genotype likelihoods. 图 1:GATK 变量调用最佳实践。HaplotypeCaller 接收分析就绪的 reads 并对每个样本执行变量调用以产生未经过滤的基因型似然。
several other advantages: it improves sensitivity at low coverage positions and powers the training of a more accurate statistical filtering model, which is a crucial step in the variant discovery pipeline (Figure 1). 几个其他优势:它提高了低覆盖位置的敏感性,并为更精确的统计过滤模型的训练提供动力,这是变体发现管道中的关键步骤(图 1)。
To solve the problem of joint calling large cohorts using graph-based assembly without introducing intractable computational complexity, the Reference Confidence Model was developed as a module for the HC (Figure 2). In brief, this combined HaplotypeCaller-Reference Confidence Model (HCRCM) algorithm constructs candidate haplotypes for each individual sample in the cohort, avoiding the exponential complexity of a graph describing all samples simultaneously and ultimately accelerating variant calling compared with the previous approach to joint calling. Constructed haplotypes are used to calculate genotype likelihoods using a pair-hidden Markov model (pair-HMM), and these likelihoods are stored in an intermediate file for subsequent joint variant calling across all samples (which includes allele frequency estimation and genotype assignment). In addition to likelihoods for all alleles explicitly observed a sample’s reads, the model generates the likelihood over the set of unobserved, non-reference alleles. Here it is shown that the accuracy of the HC-RCM algorithm is comparable to or better than other 为了解决使用基于图形的组装对大型群体进行联合呼叫而不引入复杂计算的问题,开发了参考置信度模型作为 HC(图 2)的一个模块。简而言之,这种结合 HaplotypeCaller - 参考置信度模型(HCRCM)算法为队列中的每个个体样本构建候选单倍型,避免了同时描述所有样本的图形的指数复杂性,从而加快了与之前联合呼叫方法相比的变异呼叫。构造的单倍型用于使用对 HMM(对 HMM)计算基因型似然度,并将这些似然度存储在中间文件中,以便在所有样本上进行后续的联合变异呼叫(包括等位基因频率估计和基因型分配)。除了观察到读数的所有等位基因的似然度之外,该模型还生成未观察到的非参考等位基因集的似然度。这里显示,HC-RCM 算法的准确性与其他算法相当或更好。
Figure 2: HaplotypeCaller-Reference Confidence Model overview. The four basic steps of variant calling with the HC-RCM including identification of the ActiveRegions, assembly of candidate haplotypes using de Bruijn-like graphs, determination of the per-read likelihoods of candidate haplotypes using a pair-HMM, and genotype assignment. 图 2:HaplotypeCaller-参考基因型信赖度模型概述。使用 HC-RCM 进行变异检测的 4 个基本步骤包括:识别 ActiveRegions、使用 de Bruijn 式图谱组装候选单倍型、利用配对隐马尔可夫模型确定每条读数的候选单倍型的似然度、以及基因型赋值。
widely used tools and that the computational performance on large sample sizes is superior. 广泛使用的工具,且在大样本量上的计算性能优于其他。
2 Results 2 结果
2.1 Brief Methodology 2.1 简要方法论
The role of the HaplotypeCaller within the GATK variant calling pipeline is shown in Figure 1 (see http://www.broadinstitute.org/gatk for more details regarding the GATK Best Practices). High-throughput sequencing analyses start with raw read data (FASTQ files) that are converted to recalibrated, analysis-ready reads in the Binary sequence Alignment/Map (BAM) format using SAMtools and GATK modules [16,6][16,6]. Reads are initially aligned to a reference sequence e.g. human reference genome (GRCh37) using an mapping/aligner program e.g. Burrows-Wheeler Aligner (BWA) 在 GATK 变体调用管道中, HaplotypeCaller 的作用如图 1 所示(有关 GATK 最佳实践的更多详细信息,请参见 http://www.broadinstitute.org/gatk)。高通量测序分析从原始读数数据(FASTQ 文件)开始,使用 SAMtools 和 GATK 模块将其转换为经过校准的、分析就绪的二进制序列比对/映射(BAM)格式的读数。读数最初使用映射/对准程序(如 Burrows-Wheeler Aligner, BWA)与参考序列(如人类参考基因组 GRCh37)对齐。 [16,5][16,5].
For computational efficiency, variant calling is focused on “ActiveRegions”, or regions of the genome that vary significantly from the reference. ActiveRegions are defined by genomic intervals where the BWA-aligned reads contain evidence that they are in disagreement with the reference, using criteria such as base mismatches, insertion/deletion markers in the reads and high base-quality soft clips [16]. The reads from these ActiveRegions are split into overlapping subsequences, or k-mers, and subsequently reassembled using de-Bruijn-like graphs into candidate haplotypes. This is followed by the construction of a pair-HMM using state transition probabilities derived from the read base qualities. This pair-HMM is then used to calculate the likelihood that each read was derived from each haplotype (Supplemental Figure 5). These likelihoods are used to calculate the raw genotype likelihoods for each candidate variant. Ultimately, genotype likelihoods across all samples are used to call raw variants for the cohort. 为了提高计算效率,变异检测集中在"活动区域"上,即基因组中与参考序列存在明显差异的区域。活动区域是由 BWA 比对的读数证明它们与参考序列不一致的基因组区间定义的,包括碱基错配、插入/删除标记和高质量的软剪切[16]。这些活动区域中的读数被拆分成重叠的子序列(即 k-mer),然后使用类 de Bruijn 图重新组装成候选单倍型。接下来构建一对隐马尔可夫模型,使用读数碱基质量得到的状态转移概率。利用这对隐马尔可夫模型计算每个读数来自每个单倍型的似然性(补充图 5)。这些似然性用于计算每个候选变异的原始基因型似然性。最终,整个队列中所有样本的基因型似然性用于检测原始变异。
The raw, unfiltered output of the HaplotypeCaller is not appropriate to be used for downstream analyses. The HC-RCM aims to attain maximum sensitivity with the consequence of retaining some false positive variants. The subsequent steps of the GATK Best Practices pipeline (Figure 1) identify and filter out false positives to achieve maximum accuracy, as described in previous in previous work [6]. For the accuracy results presented here, the HC-RCM output is subjected to joint genotyping and variant recalibration in accordance with the GATK Best Practices. 来自 HaplotypeCaller 的原始、未经过滤的输出不适合用于后续分析。HC-RCM 旨在实现最大的敏感性,但这也意味着保留了某些假阳性变异。GATK 最佳实践流水线(图 1)的后续步骤可以识别并过滤掉这些假阳性变异,从而实现最高的准确性,如之前的研究[6]所述。在这里介绍的准确性结果中,HC-RCM 输出经过联合基因分型和变异重新校准,以符合 GATK 最佳实践。
For further details on the HC-RCM algorithm, see the Supplemental Methods section. 有关 HC-RCM 算法的更多详细信息,请参见补充方法部分。
2.2 Comparative Analyses 2.2 对比分析
To validate the sensitivity and specificity of this new variant calling algorithm, a set of analyses was performed comparing the HaplotypeCaller with three other variant calling algorithms (Figure 3). These other algorithms consisted of two pileup variant callers, SAMtools[17] and the GATK Unified Genotyper (UG) [6], as well as Platypus, another assembly-based algorithm [22]. SNPs and indels were called with each of the four variant calling algorithms from whole genome and exome-capture sequencing data from the well characterized CEU HapMap trio [2]. Variant caller accuracy was evaluated using the Genome in a Bottle standard (GiaB) [26] as described in the supplemental methods. 为了验证这种新的变体呼叫算法的灵敏度和特异性,进行了一系列分析,将 HaplotypeCaller 与其他三种变体呼叫算法进行了比较(图 3)。这些其他算法包括两种堆叠变体 caller,SAMtools[17]和 GATK 统一基因分型器(UG)[6],以及 Platypus,另一种基于组装的算法[22]。使用四种变体呼叫算法从全基因组和外显子捕获测序数据中呼叫了单核苷酸多态性(SNP)和插入缺失(indel)。使用瓶中基因组(GiaB)[26]标准评估了变体 caller 的准确性,如补充方法中所述。
Figure 3: Variant caller comparison. Accuracy comparisons of both SNP and indel variant calls over all three samples in the CEU trio by the GATK HaplotypeCaller, Platypus, SAMtools, and Unified Genotyper. WGS data was PCR-free, 250bp paired-end reads sequenced by an Illumina HiSeq. WES data was 76 bp paired-end reads sequenced by an Illumina HiSeq. Sensitivity is plotted as false negative rate (FNR), for which lower values equate to better sensitivity. Specificity is plotted as false discovery rate FDR, for which lower values are also better. For genotype concordance higher values indicate better genotype call accuracy. 图 3: 变异检测器比较。对 CEU 三联体三个样本中 SNP 和缩插变异的准确性比较,使用 GATK HaplotypeCaller、Platypus、SAMtools 和 Unified Genotyper。全基因组测序数据采用无 PCR 的 250 个碱基对端序列,由 Illumina HiSeq 测序。全外显子测序数据采用 76 个碱基对端序列,由 Illumina HiSeq 测序。灵敏度以假阴性率 (FNR) 表示,值较低表示灵敏度较高。特异性以假发现率 FDR 表示,值较低表示特异性较高。基因型一致性,值较高表示基因型检测准确性较高。
Figure 3 shows the results of these analyses, which indicate that the HaplotypeCaller effectively calls both SNP and indel variants from both whole genome and exome-captured sequence data. Although Figure 3 shows that SAMtools called a greater number of both SNP and indel variants compared to the other callers, the specificity of these calls was substantially lower than the other algorithms, especially for indel variants, as indicated by high false discovery rates (FDR). In contrast, the HaplotypeCaller algorithm called large numbers of SNP and indel variants with high sensitivity and low FDRs on both whole genome and exome captured DNA, suggesting that the HaplotypeCaller has both high sensitivity and specificity. 图 3 显示了这些分析的结果,表明 HaplotypeCaller 有效地从全基因组和外显子捕获序列数据中调用 SNP 和插入缺失变体。虽然图 3 显示 SAMtools 调用的 SNP 和插入缺失变体数量比其他调用器多,但这些调用的特异性明显低于其他算法,特别是对于插入缺失变体,表现为高假发现率(FDR)。相比之下,HaplotypeCaller 算法在全基因组和外显子捕获 DNA 上调用大量 SNP 和插入缺失变体,具有高灵敏度和低 FDR,表明 HaplotypeCaller 具有高灵敏度和高特异性。
To determine algorithm genotype assignment accuracy, the genotype concordance was determined. Genotype concordance measures the accuracy of genotype assignments of true positive variants to a gold standard (GiaB). In contrast to the other variant calling algorithms, the HaplotypeCaller had exceptionally high genotype concordance values for indel variant calls from both whole genome and exome captured data. These results suggest that the HaplotypeCaller calls genotypes with superior accuracy compared with the other algorithms. 为了确定算法基因型分配的准确性,确定了基因型一致性。基因型一致性测量了将真阳性变异体分配给金标准(GiaB)的基因型分配的准确性。与其他变异检测算法相比,HaplotypeCaller 对从全基因组和外显子捕获数据获得的缺失/插入变异的基因型一致性值都非常高。这些结果表明,与其他算法相比,HaplotypeCaller 呼叫基因型的准确性更优。
2.3 Scaling 2.3 缩放
The Reference Confidence Model was integrated into the HaplotypeCaller to enable scaling of joint calling up to hundreds of thousands of exomes. To determine how this algorithm performs on large sample sets, runtime was examined as a function of the number of samples. Figure 4 shows the runtimes for variant calling using GATK HaplotypeCaller, GATK UnifiedGenotyper, Platypus and SAMtools. For the 20X exomes used here, HaplotypeCaller requires slightly more CPU-hours than the other algorithms for up to 250 samples. Beyond 250 samples, HaplotypeCaller runtime continues to increase linearly with the number of samples, while the other algorithms increase superlinearly. Furthermore, the efficient scaling of the HC-RCM algorithm was leveraged to produce a joint call set with over 90,000 exomes for the Exome Aggregation Consortium [14]. 参考置信度模型集成到 HaplotypeCaller 中,以实现对成千上万外显子组进行联合调用的可扩展性。为了确定该算法在大样本集上的表现,我们检查了样本数量对运行时间的影响。图 4 显示了使用 GATK HaplotypeCaller、GATK UnifiedGenotyper、Platypus 和 SAMtools 进行变异检测的运行时间。对于这里使用的 20X 外显子组,HaplotypeCaller 的 CPU 小时数略高于其他算法,最多可达 250 个样本。超过 250 个样本后,HaplotypeCaller 的运行时间继续线性增加,而其他算法的增长呈超线性趋势。此外,HC-RCM 算法的高效可扩展性被利用,为外显子组聚合联盟生成了包含超过 90,000 个外显子组的联合调用集[14]。
Figure 4: Scaling of whole exome calling with the HaplotypeCaller-Reference Confidence Model. Comparison of computational runtimes as a function of sample size. For each algorithm, variants were called over 1M bases of genomic territory and extrapolated to full exome runtime. Error bars represent 95%95 \% confidence intervals from 10 independent runs. 图 4:使用 HaplotypeCaller-Reference 置信度模型对全外显子测序进行缩放分析。样本量对计算运行时间的影响对比。对于每种算法,在 100 万碱基的基因组区域内进行了变异检测,并将其外推到全外显子测序的运行时间。误差条代表 10 次独立运行的 95%置信区间。
3 Discussion 3 讨论
The above validation studies demonstrate that the HaplotypeCaller calls both SNP and indel variants with high sensitivity and specificity. Now part of the widely used GATK pipeline, the HC removes a critical bottleneck in variant calling by enabling this algorithm to scale without losses in accuracy or sensitivity. 上述验证研究表明,HaplotypeCaller 能够以高灵敏度和高特异性检测 SNP 和插入缺失变异。现已成为广泛使用的 GATK 流程的一部分,HC 通过使该算法能够在准确性或灵敏度不降低的情况下进行扩展,解决了变异检测中的一个关键瓶颈。
Our analyses show that SAMtools is the most sensitive SNP variant caller using both whole genome and exome-captured data. These results are consistent with other studies for whole genomic data which have shown that locusbased algorithms accurately call SNP variants [17, 22]. However, SAMtools also has a high false discovery rate for SNP variant calls, suggesting that though highly sensitive, it lacks the specificity of the other variant calling algorithms evaluated in this study. 我们的分析表明,SAMtools 是使用全基因组和外显子捕获数据的最敏感的 SNP 变异检测工具。这些结果与其他研究全基因组数据的研究一致,这些研究表明基于位点的算法能准确地检测 SNP 变异[17, 22]。然而,SAMtools 对于 SNP 变异检测也存在较高的假阳性率,这表明虽然其敏感性很高,但其特异性低于本研究评估的其他变异检测算法。
In our results, the HaplotypeCaller was the most sensitive indel variant caller using both whole genome and exome-capture data, which is a conclusion consistent with other reports comparing variant callers [22, 21, 9]. However, in contrast with the report of Rimmer, et al., the results here show that the HaplotypeCaller had the lowest FDR on whole genomic data, suggesting that it had the highest specificity for indel variant calls [22]. Reasons for the discrepancies between the two studies, include different versions of the algorithm used; the current study used the HaplotypeCaller (v3.4) while Rimmer, et al used a much older version (v2.5). It is likely that more recent versions of the HaplotypeCaller software improved variant calling accuracy as has been reported by Narzisi, et al. [20] and our internal tests (data not shown). In addition to the different software versions used, there is a substantial difference in the validation methods between the two studies. Although Rimmer et al., used a fosmid dataset to validate their variant calls, the majority of variants called as false-positives by Platypus were actually present in dbSNP137, suggesting that their “truth” data contained a high frequency of errors [22]. In contrast, this analysis used the GiaB standard which contains variants validated using multiple data sets, platforms, variant callers, and filtering strategies suggesting that the GiaB is a highly validated “truth” dataset [26]. Using this high quality “truth” set it is demonstrated that the HaplotypeCaller calls indel variants with both high sensitivity and specificity. 在我们的结果中,HaplotypeCaller 是使用全基因组和外显子捕获数据最敏感的插入缺失变异检测器,这与其他比较变异检测器的报告一致[22,21,9]。然而,与 Rimmer 等人的报告相反,这里的结果显示 HaplotypeCaller 在全基因组数据上具有最低的假阳性率,表明它具有最高的插入缺失变异检测的特异性[22]。两项研究之间存在差异的原因包括所使用算法版本不同;本研究使用的是 HaplotypeCaller(v3.4),而 Rimmer 等人使用的是一个更老的版本(v2.5)。很可能更新版本的 HaplotypeCaller 软件提高了变异检测的准确性,这也在 Narzisi 等人的报告和我们的内部测试中得到证实(未发表数据)。除了使用的软件版本不同,两项研究的验证方法也存在实质性差异。尽管 Rimmer 等人使用了长片段克隆数据集来验证他们的变异检测结果,但 Platypus 检测为假阳性的大多数变异实际存在于 dbSNP137,这表明他们的"真实"数据集包含大量错误[22]。相比之下,本分析使用了 GiaB 标准,其中包含通过多个数据集、平台、变异检测器和过滤策略验证的变异[26]。使用这个高质量的"真实"数据集,证明 HaplotypeCaller 检测插入缺失变异具有较高的敏感性和特异性。
These comparison data show that all of the algorithms assign genotypes with similar accuracy from SNP variants, consistent with the report 这些比较数据显示,所有算法从 SNP 变体中分配基因型的准确性都非常相似,这与报告一致
of [22]. However, substantial differences in genotype concordance were observed among algorithms for indel variants. HaplotypeCaller’s local assembly and pairHMM likelihood calculation are more robust to errors in repetitive contexts, leading to the higher genotyping accuracy. 根据[22]的结果,对于插入缺失变异(Indel)而言,不同算法之间在基因型一致性上存在显著差异。HaplotypeCaller 利用局部组装和配对 HMM 似然计算更好地抵御重复区域中的错误,从而实现更高的基因型精确度。
Platypus FDRs were higher than expected for both SNPs and indels in the WGS results. The orthogonal Illumina Platinum Genomes evaluation [9] showed a FDR of less than 1 在 WGS 结果中,短序列多态性(SNPs)和插入缺失(indels)的假阳性检测率(FDR)均高于预期。相互独立的 Illumina Platinum Genomes 评估[9]显示,FDR 低于 1。
While variant calling accuracy on a single sample is of primary importance in some studies, in nearly all contexts the interpretation of variants requires data derived from many samples. Analyses like genome-wide association studies leverage larger numbers of samples to increase their power to detect genotype-phenotype associations. In a clinical research setting, analysts often interpret the effect of a given putatively causal mutation based on population frequency estimated from a large joint call set of samples from healthy individuals used as a reference panel. To produce such a reference panel, it is crucial that all samples be jointly called and evaluated across as many other samples as possible. The estimation of the frequencies of rare alleles in large populations is possible only with accurate and confident genotype calls for all samples, especially those that are homozygous for the reference allele. 尽管在某些研究中单个样本的变异检测精度非常重要,但在几乎所有情况下,对变异的解释都需要从多个样本收集的数据。像基因组关联研究这样的分析依赖更多样本数量来提高检测基因型-表型关联的能力。在临床研究中,分析人员通常根据大型联合样本集中健康个体的群体频率来解释某一推测性致病突变的效果。要建立这样一个参考样本库,关键是要对所有样本进行联合检测和评估。只有准确和可靠地确定所有样本的基因型,特别是对于参考等位基因同源的样本,才能够估计大型群体中稀有等位基因的频率。
Joint calling increases variant calling sensitivity over low coverage regions and improves filtering accuracy [6]. The HaplotypeCaller algorithm further improves accuracy by using a local assembly method for variant discovery. However joint calling across large sets of samples requires an algorithm that can effectively scale. HC-RCM runtimes increase linearly with sample number, enabling the algorithm to produce data across tens of thousands of samples, as demonstrated by the production of a call set featuring over 90,000 exome samples used in the ExAC study [14]. 联合召唤提高了低覆盖区域的变体调用灵敏度,并提高了过滤准确性[6]。HaplotypeCaller 算法进一步提高了准确性,它采用了局部组装方法进行变异发现。然而,对大量样本进行联合调用需要一种能够有效扩展的算法。HC-RCM 运行时间随样本数量线性增加,使该算法能够处理数万个样本,这在 ExAC 研究中使用的超过 90,000 个外显子样本集上得到了证明[14]。
4 Supplemental Methods 4 补充方法
4.1 Data 4.1 数据
Input data for the variant caller evaluations were obtained from the 1000 genomes repository [5]. Both data sets, consisting of the 2x250bp PCR-free whole genome samples and 2x76bp exome samples, were retrieved as BAM files of high coverage CEU trio samples. These samples, NA12891, NA12892, variant 调用器评估的输入数据是从 1000 基因组资源库[5]获得的。包含 2x250bp 免 PCR 全基因组样本和 2x76bp 外显子样本的两个数据集,以高覆盖度 CEU 三联体样本的 BAM 文件形式检索。这些样本是 NA12891、NA12892。
NA12878 representing the father, mother, and daughter respectively have been described previously [2]. Prior to analysis, all sequencing data were aligned with the Burrows-Wheeler transform algorithm (BWA) [16]. The Genome in a Bottle (GiaB) standard version 2.18 for NA12878 was used for the variant caller accuracy comparison [26]. NA12878 代表父亲、母亲和女儿[2]已经在先前研究中描述过。在分析之前,所有测序数据都使用 Burrows-Wheeler 变换算法(BWA)[16]进行了比对。用于变异检测准确性比较的是 Genome in a Bottle(GiaB)标准版本 2.18[26]。
4.2 Software 4.2 软件
Software versions were Platypus version 0.7.8, SAMtools version 1.1 and the UnifiedGenotyper and HaplotypeCaller-Reference Confidence Model (HCRCM) were obtained from the Genome Analysis Toolkit version 3.4. The authors made a good faith effort to run each tool according to its respective best practice recommendations by following the directions found on each tool author’s webpage. 软件版本为 Platypus 版本 0.7.8,SAMtools 版本 1.1,UnifiedGenotyper 和 HaplotypeCaller-参考置信度模型(HCRCM)来自 Genome Analysis Toolkit 版本 3.4。作者尽力按照各工具作者网页上的说明运行每个工具,以遵循各自的最佳实践建议。
4.3 Evaluation Metrics 4.3 评估指标
Figure 3 reports variant caller accuracy using false negative rate (FNR), false discovery rate (FDR), and genotype concordance with respect to the Genome in a Bottle (GiaB) reference standard as the truth set [26]. A variant at a position listed in the truth set that also has an alternate allele matching the truth set is considered a true positive (TP). A variant that is called within the GiaB confidence region that does not occur in the list of truth variants or have an alternate allele matching the truth variant is considered a false positive (FP). 图 3 报告了变异检测器的精确度,使用了假阴性率(FNR)、假发现率(FDR)和与 Genome in a Bottle (GiaB)参考标准的基因型一致性作为真实集[26]。如果在真实集中列出的位置有一个变异且备用等位基因与真实集相匹配,则被视为真阳性(TP)。如果在 GiaB 置信区域内检测到一个变异但不在真实变异列表中或备用等位基因与真实变异不匹配,则被视为假阳性(FP)。
Algorithm sensitivity is calculated by dividing the number of true positive variants by the total number of variants in the GiaB standard [26, 22]. Here we report the FNR (1-sensitivity) to better visualize the differences between algorithms. 算法敏感性通过将真阳性变体的数量除以 GiaB 标准[26, 22]中的总变体数来计算。在此,我们报告 FNR(1-敏感性)以更好地可视化算法之间的差异。
The FDR describes the frequency of incorrect variant calls in a call set and is calculated by dividing the number of false-positive ( FP ) variant calls by total number of called variants (the sum of both true and false positive variant calls) as described [22, 6]. Genotype concordance (GT concordance), measures the accuracy of genotype calls and is calculated by dividing the number of correctly assigned genotypes (both heterozygous and homozygous variant) by the total number of true positive variants [26]. 错误变异调用频率指标(FDR)描述了调用集中错误变异调用的频率,它是通过将假阳性(FP)变异调用的数量除以已调用变异的总数(包括真阳性和假阳性变异调用)计算得出的[22, 6]。基因型一致性(GT 一致性)是衡量基因型调用准确性的指标,它是通过将正确分配的基因型(包括杂合子和纯合子变异)数量除以真阳性变异总数计算得出的[26]。
HaplotypeCaller is the method responsible for variant calling in the current best practices version of the GATK pipeline for the processing of highthroughput sequencing data. This component was developed to address the shortcomings of previous locus-based callers, primarily an inability to call structural variants including indels and repetitive sequences [22]. The HaplotypeCaller’s approach to variant calling is reference-based local reassembly of genomic regions containing non-reference evidence. Using this approach, the method presented here avoids many of the pitfalls of global alignment to the reference sequence used in mapping/locus-based approaches e.g. SAMtools and the Unified Genotyper [17, 6]. Given that this local assembly approach becomes exponentially more computationally intensive with increasing numbers of samples, the reference confidence model (RCM) was developed to reduce the computational burden associated with the tandem local reassembly of multiple samples. The combined algorithm is described in detail below. There are several preprocessing steps (Figure 1) that are explained in detail on the GATK website www.broadinstitute.org/gatk and in previous publications [6, 3]. Initially, raw sequence data is aligned and de-duplicated using BWA and Picard Tools, respectively [17, 15]. Additional preprocessing steps used to produce analysis-ready reads, including the Base Quality Score Recalibrator (BQSR), are part of the GATK pipeline and have been described previously [6,3][6,3]. Note also that reads with mapping quality less than 20 are filtered by the tool engine and excluded from HaplotypeCaller. 哈普洛泰型检测器是当前 GATK 管道中负责变体检测的方法,用于处理高通量测序数据。该组件旨在解决以前基于位点的检测器的局限性,主要是无法检测包括插入/缺失和重复序列在内的结构变异[22]。哈普洛泰型检测器对变体检测的方法是基于参考序列对包含非参考证据的基因组区域进行局部重组装。采用这种方法,所提出的方法可以避免许多基于全局比对参考序列的方法(例如 SAMtools 和 Unified Genotyper)[17,6]中遇到的问题。鉴于这种局部装配方法随样本数量增加而呈指数增长的计算负担,开发了参考信心模型(RCM)来减轻多个样本串行局部重组装的计算负担。下面详细介绍了这种组合算法。在 GATK 网站 www.broadinstitute.org/gatk 和以前的出版物[6,3]中,有几个预处理步骤(图 1)有详细说明。最初,使用 BWA 和 Picard Tools 分别对原始测序数据进行比对和重复去除[17,15]。GATK 管道中用于生成分析就绪读数的其他预处理步骤,包括测序质量校正(BQSR),已经在之前描述过。另请注意,映射质量低于 20 的读数被工具引擎过滤掉,从而不会进入哈普洛泰型检测器。
4.4.1 Defining ActiveRegions 4.4.1 定义活动区域
To define an ActiveRegion of a sample’s genome using data from that sample’s reads, the HaplotypeCaller operates in three phases. It computes an “active probability” for each locus, smoothes the probability signal via convolution, and thresholds the resulting signal to define the ActiveRegion boundaries. Using the original alignment assigned by BWA, the HaplotypeCaller performs genotyping at each pileup position comparing the reference allele with any non-reference possibility, incorporating mismatch evidence, but also insertions, deletions, and soft-clips. Sites are assigned an “active probability” based on these genotype likelihoods and a heterozygosity prior ( 0.001 for SNPs and 0.0001 for indels by default). That probability is then convolved across loci with a Gaussian kernel (sigma of 17 bp by default) to expand 使用该样本的读数数据来定义样本基因组的 ActiveRegion,HaplotypeCaller 分三个阶段操作。它计算每个位点的"激活概率",通过卷积平滑概率信号,并对结果信号进行阈值处理以定义 ActiveRegion 边界。使用 BWA 分配的原始比对,HaplotypeCaller 在每个堆积位置执行基因型分型,比较参考等位基因与任何非参考可能性,结合错配证据,以及插入、删除和软剪接。根据这些基因型似然和异质性先验(默认 SNP 为 0.001,缺失为 0.0001),给每个位点分配一个"激活概率"。然后使用高斯核(默认 sigma 为 17 bp)在位点间进行卷积以扩展该概率。
the activity signal. An ActiveRegion is thus defined as an interval of contiguous loci where the active probability exceeds a certain threshold (0.002 by default.) By default, HaplotypeCaller incorporates data from reads that cover an interval of up to 100bp adjacent to but outside the thresholded ActiveRegion during assembly, but these reads do not contribute to genotype likelihoods. Reads outside the ActiveRegion and its extension are not considered at all. Minimum ActiveRegion size is by default 50bp while the maximum size is 300 bp . If an ActiveRegion exceeds the maximum size after the thresholding step then is it split into two, such that the regions for which each will emit variants abut, but reads overlapping the junction will be considered in both. 活动信号。因此,ActiveRegion 被定义为连续位点区间,其中活动概率超过某个阈值(默认为 0.002)。默认情况下,HaplotypeCaller 在组装期间会使用涵盖 ActiveRegion 外 100bp 范围内的读数,但这些读数不会对基因型似然性产生贡献。不考虑 ActiveRegion 及其扩展区域外的任何读数。默认 ActiveRegion 的最小大小为 50bp,最大大小为 300 bp。如果一个 ActiveRegion 在阈值步骤后超过最大大小,则它将被分成两个部分,使得每个部分将发出的变体相互接壤,但重叠连接的读数将在两个部分中都被考虑。
4.4.2 Graph Construction 4.4.2 图构建
The second step involves de Bruijn-like graph construction for each ActiveRegion [23]. The reads within the current ActiveRegion and the reference sequence corresponding to that ActiveRegion are parsed into k-mers of 10 and 25 nucleotides in length. If the k-mer-ized reference sequence contains nonunique k -mers, then the value k in incremented by 10 until a maximum of 65. If the reference sequence for the ActiveRegion contains non-unique 65-mers, then assembly is aborted. The graph is initialized by connecting overlapping reference k-mers into a single path. Edges in the graph are assigned weights according to how many reads contain each pair of k -mers the edge connects. The graph is simplified by merging paths with entire k -mers in common. Edges with limited k-mer support are pruned out of the graph and potential haplotypes are removed if supported by fewer than two reads by default. Paths that do not have a terminal kmer that connects back to the reference haplotype, referred to as “dangling tails”, are attempted to be merged to the reference using Smith-Waterman alignment [24]. If no overlap with the reference is found, paths containing the dangling tail will be discarded. Each candidate haplotype is aligned to the reference sequence using Smith-Waterman . The output of this alignment generates a CIGAR string for each candidate haplotype (H_(j))[17]\left(H_{j}\right)[17]. This CIGAR is used to help translate the haplotypes into the variants that will be output in the VCF. 第二步涉及为每个 ActiveRegion [23] 构建类似 de Bruijn 的图。对于当前 ActiveRegion 中的读数和对应于该 ActiveRegion 的参考序列,将其解析为长度为 10 和 25 个核苷酸的 k 个核苷酸。如果 k 个核苷酸化参考序列包含非唯一的 k 个核苷酸,则 k 的值会增加 10,直到最大值为 65。如果 ActiveRegion 的参考序列包含非唯一的 65 个核苷酸,则中止装配。通过将重叠的参考 k 个核苷酸连接成单个路径来初始化图。根据每对 k 个核苷酸连接的读数数量为图边分配权重。通过合并具有完整 k 个核苷酸共同的路径来简化图。修剪掉支持有限的 k 个核苷酸的边,并且如果支持少于两个读数,则删除潜在的单倍体。对于没有连接回参考单倍体的终端 kmer 的路径,称为"悬挂尾巴",尝试使用 Smith-Waterman 对齐将其与参考文献合并 [24]。如果找不到与参考文献的重叠,则会丢弃包含悬挂尾巴的路径。使用 Smith-Waterman 将每个候选单倍体与参考序列对齐。该对齐的输出生成每个候选单倍体的 CIGAR 字符串 (H_(j))[17]\left(H_{j}\right)[17] 。该 CIGAR 用于帮助将单倍体转译为将在 VCF 中输出的变体。
Figure 5: Pair-HMM overview. This diagram shown depicts the states, transitions, and transition rates for the read-haplotype alignment. The alignment of a read to a candidate haplotype is allowed to take on one of three states; match, insertion, or deletion (M, I, or D). epsilon\epsilon represents the gap extension penalty while delta\delta is the gap open penalty. Other transition probabilities are defined based on the constraint that the sum of the transition probabilities must equal one. 图 5:配对隐马尔可夫模型概述。此示意图描述了读取-单倍型对齐的状态、转移和转移速率。读取与候选单倍型的对齐可以采取三种状态之一:匹配、插入或删除 (M、I 或 D)。 epsilon\epsilon 代表间隙延长惩罚,而 delta\delta 是间隙开启惩罚。其他转移概率是基于转移概率之和必须等于一的约束定义的。
4.4.3 Pair-HMM 4.4.3 成对隐马尔科夫模型
A paired Hidden Markov Model (pair-HMM) is used to determine the likelihood for each combination of candidate haplotype (H_(j))\left(H_{j}\right) and read data (R_(i))\left(R_{i}\right), namely P(R_(i)∣H_(j))[12,8]P\left(R_{i} \mid H_{j}\right)[12,8]. 一对隐马尔可夫模型(pair-HMM)被用来确定每一个候选单倍体 (H_(j))\left(H_{j}\right) 和读数据 (R_(i))\left(R_{i}\right) 组合的可能性,即 P(R_(i)∣H_(j))[12,8]P\left(R_{i} \mid H_{j}\right)[12,8] 。
Figure 5 shows a graphical representation of a global alignment model, which is a simplified version of the pair-HMM used herein. For additional details and a description of the complete model, the reader is referred to Durbin et al., [8]. The pair-HMM calculates an alignment score for each read (R_(i))\left(R_{i}\right) and candidate haplotype (H_(j))\left(H_{j}\right). There are three primary “states” of the algorithm, match (M_(ij))\left(M_{i j}\right), insertion (I_(ij))\left(I_{i j}\right), and deletion (D_(ij))\left(D_{i j}\right). An alignment of a read to a haplotype can be described by a sequence of these states. The transition probability from the match to the insertion or deletion state is given by the gap open penalty (delta)(\delta). The probability that an alignment stays in either the insertion or deletion state is given by the gap extension penalty (epsilon)(\epsilon). The gap extension penalty is held constant (default 10) while the gap open penalty may be derived from GATK’s BQSR if base insertion and deletion qualities have been output in the recalibrated BAM. If BQSR 图 5 显示了一个全局比对模型的图形表示,这是本文使用的成对隐马尔科夫模型的简化版本。有关更多详细信息和完整模型的描述,请参阅 Durbin 等人的文献[8]。该成对隐马尔科夫模型会为每个读数 (R_(i))\left(R_{i}\right) 和候选单倍型 (H_(j))\left(H_{j}\right) 计算一个比对得分。该算法有三个主要的"状态":匹配 (M_(ij))\left(M_{i j}\right) 、插入 (I_(ij))\left(I_{i j}\right) 和删除 (D_(ij))\left(D_{i j}\right) 。读数到单倍型的比对可以用这些状态的序列来描述。从匹配状态转换到插入或删除状态的概率由间隙开启罚分 (delta)(\delta) 给出。保持在插入或删除状态的概率由间隙延伸罚分 (epsilon)(\epsilon) 给出。间隙延伸罚分保持恒定(默认 10),而间隙开启罚分可以从 GATK 的 BQSR 中获得,前提是在校正的 BAM 文件中输出了碱基插入和删除质量。
base insertion and deletion qualities are not available, a constant value of 45 is assigned. For alignments in the match state, the probability of emitting a base identical to the reference (P_(ij))\left(P_{i j}\right) is given by the complement of the base error probability given by the base quality. 基础插入和删除质量不可用,将分配一个恒定值 45。对于匹配状态下的对齐,发射与参考相同碱基的概率由碱基质量给出的碱基误差概率的补码给出。
Recurrence relations for the states with respect to position i in the read and position j in the haplotype are given as follows: 对于读数的位置 i 和单倍型的位置 j 的状态 recurrence 关系如下:
The pair-HMM can be used to correct for PCR errors that can lead to spurious indel calls. In this error correction mode, when the alignment sequence is being calculated for a tandem repeat sequence in the reference, the base insertion and deletion qualities are decreased in order to convey the reduced confidence in indels occurring within repetitive contexts. The default is to use a conservative correction. For PCR-free genomes, the authors recommend setting the PCR indel model to “NONE”. 成对隐马尔科夫模型可用于纠正可能导致虚假插入缺失调用的 PCR 错误。在此错误修正模式下,在为参考序列中的串联重复序列计算比对序列时,会降低碱基插入和删除质量,以表示在重复上下文中出现插入缺失的信心降低。默认情况下使用保守的修正。对于无 PCR 的基因组,作者建议将 PCR 插入缺失模型设置为"无"。
4.4.4 Assigning Genotypes 4.4.4 分配基因型
The CIGAR derived from the Smith-Waterman alignment of the discovered haplotypes is used to translate the haplotypes into variant events with respect to the reference. Genotyping is then performed at each of the events discovered in any haplotype. Per-read haplotype likelihood output from the pair-HMM is used to calculate raw genotype likelihoods using a Bayesian model. Given that multiple haplotypes may support a variant allele, the per-read allele likelihood at a given variant position is taken as the maximum of the likelihoods for haplotypes containing the allele. Equation 1 gives the genotype likelihood for a diploid genotype G_(i)G_{i} composed of one copy of allele A_(1)A_{1} and one copy of allele A_(2)A_{2}. Briefly, the probability of a candidate genotype P(R_(i)∣G_(l))P\left(R_{i} \mid G_{l}\right), is the product over all the reads of the mean read likelihoods for the alleles in the specified genotype 从发现的单倍型的史密斯-沃特曼(Smith-Waterman)比对中得到的 CIGAR 被用于将单倍型转换为相对于参考序列的变异事件。然后在每一个在任何单倍型中发现的事件上进行基因型分型。从对偶 HMM 输出的逐读单倍型似然被用于使用贝叶斯模型计算原始基因型似然。鉴于可能有多个单倍型支持一个变异等位基因,给定变异位置的逐读等位基因似然被取为包含该等位基因的单倍型的似然的最大值。公式 1 给出了由一个 A_(1)A_{1} 等位基因和一个 A_(2)A_{2} 等位基因组成的二倍体基因型 G_(i)G_{i} 的基因型似然。简单地说,候选基因型 P(R_(i)∣G_(l))P\left(R_{i} \mid G_{l}\right) 的概率是所有读段的包含指定基因型中等位基因的平均读段似然的乘积。
To determine the posterior probability of each candidate genotype P(G_(l)∣R_(i))P\left(G_{l} \mid R_{i}\right), the raw genotype likelihoods P(R_(i)∣G_(l))P\left(R_{i} \mid G_{l}\right) are marginalized using a Bayesian model:. 要确定每个候选基因型 P(G_(l)∣R_(i))P\left(G_{l} \mid R_{i}\right) 的后验概率,原始基因型似然度 P(R_(i)∣G_(l))P\left(R_{i} \mid G_{l}\right) 使用贝叶斯模型进行边缘化:.
The numerator consists of the product of the prior probability of a genotype P(G)P(G) and a raw genotype likelihood divided by the sum of the likelihoods of all the possible genotypes for the set of alleles called in the variant. At this stage in the best practices pipeline, the genotype prior is flat such that the prior probability of all genotypes are equal. The GATK tool CalculateGenotypePosteriors can be used in post-processing to apply more informed priors. 分母由先验概率 P(G)P(G) 基因型和原始基因型似然度的乘积除以所有可能基因型的似然度之和组成。在最佳实践管道的这一阶段,基因型先验是平坦的,所有基因型的先验概率都相等。 GATK 工具 CalculateGenotypePosteriors 可用于后处理以应用更有信息量的先验。
4.4.5 Reference Confidence Model 4.4.5 参考置信度模型
The two main requirements for joint calling large cohorts using local assembly are maintaining manageable compute complexity and outputting data for every variant in every sample. The reference confidence model addresses the former requirement by processing the samples individually, reducing the number of probable paths through the assembly graph, resulting in fewer candidate haplotypes, and decreasing the number of computationally expensive likelihood calculations that must be performed for each sample. The solution to the latter involves the addition of the symbolic non-reference allele (<NON_REF>) to aggregate evidence for a variant that was not explicitly called as an alternate allele. This additional candidate allele category enables the estimation of likelihoods to genotypes featuring an allele not seen in the sample in question. For variant sites, for each read the non-reference likelihood is set to the median of the set of allele likelihoods that are worst than the best allele. For sites that appear to be reference, non-reference likelihoods are estimated using a SNP model and an indel model. 使用本地组装进行大型队列联合呼叫的两个主要要求是维护可管理的计算复杂性和为每个样本中的每个变体输出数据。参考置信度模型通过单独处理样本来解决前一个要求,减少通过组装图的可能路径数量,从而产生更少的候选染色体组和减少必须为每个样本执行的计算昂贵的可能性计算。后一个问题的解决方案涉及添加符号非参考等位基因()来聚合未被明确调用为备用等位基因的变体的证据。这种额外的候选等位基因类别使得可以估计包含在样本中未观察到的等位基因的基因型的似然性。对于变异位点,对于每个读数,非参考似然性被设置为比最佳等位基因更差的一组等位基因似然性的中位数。对于似乎是参考的位点,使用 SNP 模型和 indel 模型估计非参考似然性。
Under the SNP model, base qualities are used to assign oer-read allele likelihoods for the reference and non-reference alleles. The per-read allele likelihoods for reference bases are assigned using the base error probability (epsilon)(\epsilon) derived from the base quality. Bases that do not match the reference or are adjacent to soft clips, insertions, or deletions are considered non-reference evidence. The likelihood for read R_(i)R_{i} with base b_(i)b_{i} and base quality qq compared to the reference allele with base b_(r)b_{r} is given by: 在 SNP 模型下,使用碱基质量来为参考和非参考等位基因分配每个读段的概率。参考碱基的每个读段等位基因概率是使用从碱基质量中推导出的碱基错误概率 (epsilon)(\epsilon) 来分配的。与参考不匹配或邻近软剪切、插入或缺失的碱基被视为非参考证据。相比参考等位基因具有碱基 b_(r)b_{r} 的读段 R_(i)R_{i} 中具有碱基 b_(i)b_{i} 和碱基质量 qq 的概率为:
Where epsilon=10^(-q//10)\epsilon=10^{-q / 10} for base quality qq. Deletions are assigned a constant quality of 30 . Bases with base quality of 6 or below are discarded. 其中 epsilon=10^(-q//10)\epsilon=10^{-q / 10} 表示质量为 qq 的碱基。缺失的碱基被指定为恒定的质量 30。碱基质量为 6 或以下的会被丢弃。
Under the indel model, only reads that are considered informative for all possible indels of up to size 10bp (by default) are considered. For each event size up to the maximum to be considered, bases are removed from the reference to simulate a deletion or from the read to simulate an insertion. A read is considered informative if, for the pair of sequences with the indel modification, the sum of the base qualities for mismatching bases in the read is greater than the sum of the base qualities for bases that are mismatched according to the original alignment. The number of reads informative for indels is capped at 40 . A constant reference quality of 45 is used for each indel-informative read. 在缺失/插入模型下,仅考虑长度不超过 10bp(默认)的所有可能缺失/插入对于所有读段都是信息性的。对于每一种最大可考虑的事件大小,从参考序列中删除碱基模拟缺失,或从读段中删除碱基模拟插入。如果对于包含缺失/插入修改的序列对,读段中错配碱基的质量和大于原始比对中错配碱基的质量和,则该读段被视为信息性。信息性缺失/插入读段的数量最多为 40 条。每个信息性缺失/插入读段使用 45 的恒定参考质量。
Genotype likelihoods for genotypes incorporating the non-reference allele are calculated in the same way as described above for likelihoods derived from both SNP and indel models. The likelihoods with the lower genotype quality are then assigned to the site. 包含非参考等位基因的基因型似然度的计算方式与上述从 SNP 和插入缺失模型导出的似然度计算方式相同。然后将具有较低基因型质量的似然度分配给该位点。
Output from the HC-RCM is captured in a new intermediate file format called genomic variant call format (gVCF) that contains likelihood data for every position in the genome (or specified intervals). This is an intermediate output that is not appropriate for analysis until the data has been processed with the full GATK Best Practices pipeline. 来自 HC-RCM 的输出被捕获到一个名为基因组变体呼叫格式(gVCF)的新中间文件格式中,该格式包含了整个基因组(或指定区间)每个位置的似然数据。这是一种中间输出,在用完整的 GATK 最佳实践管道处理数据之前是不适合分析的。
4.5 Running HaplotypeCaller-Reference Confidence Model 运行 HaplotypeCaller-参考置信度模型
Use of the Reference Confidence Model is not the HaplotypeCaller default and must be specified on the commandline. (See https://software. broadinstitute.org/gatk/documentation/article?id=3893 for exact command lines.) Output from this mode will be intermediate gVCF files that are not appropriate for analysis. The reader is encouraged to visit the GATK Forums website for more information about the gVCF intermediate file format used by the HC-RCM (https://software.broadinstitute. org/gatk/documentation/article.php?id=4017). gVCFs must be genotyped with the GATK GenotypeGVCFs tool to remove low quality, likely 使用参考可信度模型不是 HaplotypeCaller 的默认选项,必须在命令行中指定。(有关确切命令行,请参见 https://software.broadinstitute.org/gatk/documentation/article?id=3893。)此模式的输出将是不适合分析的中间 gVCF 文件。我们鼓励读者访问 GATK 论坛网站,了解有关 HC-RCM 使用的 gVCF 中间文件格式的更多信息(https://software.broadinstitute.org/gatk/documentation/article.php?id=4017)。必须使用 GATK GenotypeGVCFs 工具对 gVCF 进行基因型检测,以去除低质量、可能是错误的变体。
artifactual alleles and assign a quality score to each variant. For many of the scaling analyses presented here, per-sample gVCFs were first combined into multi-sample gVCFs using the GATK3 tool CombineGVCFs before genotyping that output with GenotypeGVCFs. For multi-sample analysis, GenotypeGVCFs also produces the finalized annotation values that will be used as features in VQSR filtering. 假象性等位基因并为每个变体分配质量分数。对于此处提供的许多缩放分析,首先使用 GATK3 工具 CombineGVCFs 将每个样本的 gVCF 合并为多个样本的 gVCF,然后使用 GenotypeGVCFs 对其进行基因型检测。对于多样本分析,GenotypeGVCFs 还生成最终的注释值,这些值将用作 VQSR 过滤的特征。
References 参考文献
[1] C. A. Albers, G. Lunter, D. G. MacArthur, G. McVean, W. H. Ouwehand, and R. Durbin. Dindel: Accurate indel calls from short-read data. Genome Research, 21(6):961-973, 2011. [1] C. A. Albers、G. Lunter、D. G. MacArthur、G. McVean、W. H. Ouwehand 和 R. Durbin。Dindel:从短读数据中准确地进行插入和缺失检测。Genome Research,21(6):961-973,2011。
[2] DM Altshuler, RA Gibbs, L Peltonen, E Dermitzakis, SF Schaffner, F Yu, PE Bonnen, PI de Bakker, P Deloukas, SB Gabriel, R Gwilliam, S Hunt, M Inouye, X Jia, A Palotie, M Parkin, P Whittaker, K Chang, A Hawes, LR Lewis, Y Ren, D Wheeler, DM Muzny, C Barnes, K Darvishi, M Hurles, JM Korn, K Kristiansson, C Lee, and SA McCarrol. Integrating common and rare genetic variation in diverse human populations. Nature, 467:52-58, 2010. [2] DM Altshuler, RA Gibbs, L Peltonen, E Dermitzakis, SF Schaffner, F Yu, PE Bonnen, PI de Bakker, P Deloukas, SB Gabriel, R Gwilliam, S Hunt, M Inouye, X Jia, A Palotie, M Parkin, P Whittaker, K Chang, A Hawes, LR Lewis, Y Ren, D Wheeler, DM Muzny, C Barnes, K Darvishi, M Hurles, JM Korn, K Kristiansson, C Lee, and SA McCarrol. 整合不同人群中的普通和稀有遗传变异。Nature, 467:52-58, 2010。
[3] Geraldine A Auwera, Mauricio O Carneiro, Christopher Hartl, Ryan Poplin, Guillermo del Angel, Ami Levy-Moonshine, Tadeusz Jordan, Khalid Shakir, David Roazen, Joel Thibault, et al. From fastq data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Current Protocols in Bioinformatics, pages 11-10, 2013. [3] Geraldine A Auwera, Mauricio O Carneiro, Christopher Hartl, Ryan Poplin, Guillermo del Angel, Ami Levy-Moonshine, Tadeusz Jordan, Khalid Shakir, David Roazen, Joel Thibault 等。从 fastq 数据到高置信度变异调用:基因组分析工具包最佳实践管道。《生物信息学当代方案》, 第 11-10 页, 2013 年。
[4] Ewa Bielczyk-Maczyńska, Jovana Serbanovic-Canic, Lauren Ferreira, Nicole Soranzo, Derek L. Stemple, Willem H. Ouwehand, and Ana Cvejic. A loss of function screen of identified genome-wide association study loci reveals new genes controlling hematopoiesis. PLoS Genet, 10(7):e1004450, 072014. [4] 埃娃·比尔琴克-马欣斯卡、约瓦娜·塞尔巴诺维奇-察尼奇、劳伦·费雷拉、妮可·索兰佐、德里克·L·施特姆普尔、威廉·H·奥文汉德和安娜·齐维奇。一项功能丧失筛查发现的基因组范围关联研究座位揭示了新的控制造血的基因。PLoS Genet, 10(7):e1004450, 072014.
[5] 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56-65, 2012. [5] 1000 Genomes Project Consortium 等. 基于 1,092 个人类基因组的综合遗传变异图. Nature, 491(7422):56-65, 2012.
[6] Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, An- [6] 马克 A. 德普里斯托, 埃里克 · 班克斯, 瑞安 · 波普林, 基兰 V. 加里梅拉, 贾里德 R. 马奎尔, 克里斯托弗 · 哈特尔, 安东尼 A. 菲利帕基斯, 吉尔莫 · 德尔安格尔, 曼纽尔 A. 里瓦斯, 马特 · 汉娜, 亚伦 · 麦肯纳, 蒂姆 J. 费内尔