Scaling accurate genetic variant discovery to tens of thousands of samples 将精确的遗传变异发现扩展到数万个样本
Ryan Poplin ^(1){ }^{1}, Valentin Ruano-Rubio ^(1){ }^{1}, Mark A. DePristo ^(1){ }^{1}, Tim J. Fennell ^(1){ }^{1}, Mauricio O. Carneiro ^(1){ }^{1}, Geraldine A. Van der Auwera ^(1){ }^{1}, David E. Kling ^(1){ }^{1}, Laura D. Gauthier ^(1,2){ }^{1,2}, Ami Levy-Moonshine ^(1){ }^{1}, David Roazen ^(1){ }^{1}, Khalid Shakir ^(1){ }^{1}, Joel Thibault ^(1){ }^{1}, Sheila Chandran ^(1){ }^{1}, Chris Whelan ^(1){ }^{1}, Monkol Lek ^(1,2){ }^{1,2}, Stacey Gabriel ^(1){ }^{1}, Mark J Daly ^(1,2){ }^{1,2}, Ben Neale ^(1,2){ }^{1,2}, Daniel G. MacArthur ^(1,2){ }^{1,2}, and Eric Banks ^(1){ }^{1} Ryan Poplin ^(1){ }^{1} ,Valentin Ruano-Rubio ^(1){ }^{1} ,Mark A. DePristo ^(1){ }^{1} ,Tim J. Fennell ^(1){ }^{1} ,Mauricio O. Carneiro ^(1){ }^{1} ,Geraldine A. Van der Auwera ^(1){ }^{1} ,David E. Kling ^(1){ }^{1} ,Laura D. Gauthier ^(1,2){ }^{1,2} ,Ami Levy-Moonshine ^(1){ }^{1} ,David Roazen ^(1){ }^{1} ,Khalid Shakir ^(1){ }^{1} ,Joel Thibault ^(1){ }^{1} ,Sheila Chandran ^(1){ }^{1} ,Chris Whelan ^(1){ }^{1} ,Monkol Lek ^(1,2){ }^{1,2} ,Stacey Gabriel ^(1){ }^{1} ,Mark J Daly ^(1,2){ }^{1,2} ,Ben Neale ^(1,2){ }^{1,2} ,Daniel G. MacArthur ^(1,2){ }^{1,2} ,和 Eric Banks ^(1){ }^{1}^(1){ }^{1} Broad Institute, 75 Ames Street, Cambridge, MA 02142 ^(1){ }^{1} 布朗大学研究所,艾姆斯街 75 号,剑桥,马萨诸塞州 02142^(2){ }^{2} Massachusetts General Hospital, Boston, MA 02114 麻省总医院,波士顿,马萨诸塞州 02114
Abstract 摘要
Comprehensive disease gene discovery in both common and rare diseases will require the efficient and accurate detection of all classes of genetic variation across tens to hundreds of thousands of human samples. We describe here a novel assembly-based approach to variant calling, the GATK HaplotypeCaller (HC) and Reference Confidence Model (RCM), that determines genotype likelihoods independently per-sample but performs joint calling across all samples within a project simultaneously. We show by calling over 90,000 samples from the Exome Aggregation Consortium (ExAC) that, in contrast to other algorithms, the HC-RCM scales efficiently to very large sample sizes without loss in accuracy; and that the accuracy of indel variant calling is superior in comparison to other algorithms. More importantly, the HC-RCM produces a fully squared-off matrix of genotypes across all samples at every genomic position being investigated. The HCRCM is a novel, scalable, assembly-based algorithm with abundant applications for population genetics and clinical studies. 综合常见病和罕见病的疾病基因发现需要高效准确地检测数十万至数百万个人类样本中所有类别的遗传变异。我们在此描述了一种基于组装的变异调用方法,即 GATK HaplotypeCaller(HC)和参考置信度模型(RCM),该方法独立于每个样本确定基因型可能性,但同时在项目内对所有样本进行联合调用。我们通过调用来自外显子聚合联盟(ExAC)的超过 90,000 个样本表明,与其它算法相比,HC-RCM 在非常大的样本量下能够高效扩展而不损失准确性;并且与其它算法相比,插入/缺失变异调用准确性更高。更重要的是,HC-RCM 在调查的每个基因组位置上为所有样本生成一个完全平方的基因型矩阵。HCRCM 是一种新颖、可扩展、基于组装的算法,在群体遗传学和临床研究中具有广泛的应用。
1 Introduction 1 引言
Next-generation sequencing technologies combined with large population cohorts are facilitating the identification of rare allelic variants for population 下一代测序技术与大型人群队列相结合,正在促进罕见等位基因变异的识别
and medical genetic studies [18,4,25,14][18,4,25,14]. Although high-throughput sequencing platforms currently generate up to four billion paired-end reads per run, low signal-to-noise ratios necessitate the use of data processing algorithms that differentiate true variants from machine-generated artifacts [19, 11]. 尽管高通量测序平台目前每运行一次可生成高达 40 亿对端读段,但低信噪比需要使用区分真实变异和机器生成伪影的数据处理算法 [19, 11]。
The first generation of variant calling algorithms scanned short read data mapped to a reference sequence to identify mismatches [1,6,15,17,22][1,6,15,17,22]. Reference sequence mismatches were probabilistically sorted using algorithms that took into account the reported base quality and context prior to calling variants [22]. These position-based or so-called “pileup” callers, which include SAMtools Li et al 2009 and the GATK’s UnifiedGenotyper (UG), proved highly effective at calling small nucleotide polymorphisms (SNPs) but are unable to attain high accuracy for indel variants due to their reliance on independent short read alignments to a reference sequence [1,6,15,17,22][1,6,15,17,22]. In contrast, assembly-based variant callers including Platypus [22] and the GATK’s HaplotypeCaller (HC), construct theoretical haplotypes via de Bruijn-like graphs from a consensus of the reads covering the genomic region [13,22,7,10,20][13,22,7,10,20]. Although assembly-based algorithms call indels with greater accuracy, they do not scale well due in part to exponential increases in graph complexity with the number of samples. Thus, calling variants jointly on large numbers of samples becomes increasingly computationally intensive until the requirements exceed hardware performance limitations [6]. 第一代变异调用算法扫描映射到参考序列的短读数据以识别错配 [1,6,15,17,22][1,6,15,17,22] 。使用考虑了报告的碱基质量和上下文的算法对参考序列错配进行概率排序,在调用变异之前[22]。这些基于位置的或所谓的“堆积”调用者,包括 SAMtools Li 等人 2009 年和 GATK 的统一基因型器(UG),在调用小核苷酸多态性(SNPs)方面非常有效,但由于它们依赖于独立短读与参考序列的对齐,因此无法达到插入/缺失变异的高精度 [1,6,15,17,22][1,6,15,17,22] 。相比之下,基于组装的变异调用者,包括 Platypus[22]和 GATK 的 HaplotypeCaller(HC),通过从覆盖基因组区域的读数共识构建理论单倍型,使用类似 de Bruijn 图的方法。尽管基于组装的算法在调用插入/缺失时具有更高的准确性,但它们由于样本数量导致图复杂性的指数增长,因此扩展性不好。 因此,在大量样本上联合调用变体变得越来越计算密集,直到需求超过硬件性能限制[6]。
One naive solution would be simply to discover and genotype variants in each sample separately and then merge the independently discovered variants across all samples. In this way, data from a single sample would only be retained at positions in which that sample is variant. A major shortcoming of this approach is the only genotype calls present would be heterozygous or homozygous variant. It would be impossible to determine whether missing genotype calls were homozygous reference or void of any read data at all. As an illustration of this shortcoming, take the case where a mutation is discovered in a single sample within a cohort; a naively merged list would merely contain “no-call” genotypes for all other samples making estimating the population allele frequency difficult. What is required for the most accurate estimates of population allele frequencies is instead to jointly call variants over all samples together, creating a “squared-off matrix” of samples by genomic position where each cell in the matrix contains genotype likelihoods for all possible alleles (including the reference) at the corresponding genomic position. In addition, is has been previously shown [6] that joint calling has 一个简单的方法是分别在每个样本中寻找和基因分型变异,然后将所有样本中独立发现的变异合并。这样,单个样本的数据只会在该样本存在变异的位置上保留。这种方法的一个主要缺点是,只有杂合子或纯合子变异的基因型调用存在。无法确定缺失的基因型调用是纯合子参考还是完全没有读取数据。为了说明这一缺点,以一个突变在队列中的一个样本中被发现为例;天真合并的列表将只包含所有其他样本的“未调用”基因型,这使得估计人群等位基因频率变得困难。对于最准确的人群等位基因频率估计,所需的是对所有样本一起调用变异,创建一个“平方矩阵”,其中每个矩阵单元格包含对应基因组位置所有可能等位基因(包括参考)的基因型可能性。此外,已有研究表明[6]联合通话具有
Figure 1: GATK Variant Calling Best Practices. The HaplotypeCaller takes in analysis-ready reads and performs variant calling per sample to produce unfiltered genotype likelihoods. 图 1:GATK 变异调用最佳实践。HaplotypeCaller 接收分析就绪的读取数据,并对每个样本进行变异调用,以生成未过滤的基因型似然。
several other advantages: it improves sensitivity at low coverage positions and powers the training of a more accurate statistical filtering model, which is a crucial step in the variant discovery pipeline (Figure 1). 几个其他优点:它提高了低覆盖位置的反应灵敏度,并推动了更精确的统计过滤模型的训练,这是变异发现流程中的关键步骤(图 1)。
To solve the problem of joint calling large cohorts using graph-based assembly without introducing intractable computational complexity, the Reference Confidence Model was developed as a module for the HC (Figure 2). In brief, this combined HaplotypeCaller-Reference Confidence Model (HCRCM) algorithm constructs candidate haplotypes for each individual sample in the cohort, avoiding the exponential complexity of a graph describing all samples simultaneously and ultimately accelerating variant calling compared with the previous approach to joint calling. Constructed haplotypes are used to calculate genotype likelihoods using a pair-hidden Markov model (pair-HMM), and these likelihoods are stored in an intermediate file for subsequent joint variant calling across all samples (which includes allele frequency estimation and genotype assignment). In addition to likelihoods for all alleles explicitly observed a sample’s reads, the model generates the likelihood over the set of unobserved, non-reference alleles. Here it is shown that the accuracy of the HC-RCM algorithm is comparable to or better than other 为了在不引入难以处理的计算复杂性的情况下,使用基于图的组装来解决联合调用大型队列的问题,开发了参考置信度模型作为 HC(图 2)的一个模块。简而言之,这个结合了 HaplotypeCaller-参考置信度模型(HCRCM)的算法为队列中的每个个体样本构建候选单倍型,避免了描述所有样本的同时图指数复杂度,并最终比之前的联合调用方法加速了变异调用。构建的单倍型使用成对隐马尔可夫模型(pair-HMM)来计算基因型似然,并将这些似然存储在中间文件中,用于后续对所有样本的联合变异调用(包括等位基因频率估计和基因型分配)。除了存储样本读取中明确观察到的所有等位基因的似然外,该模型还生成了未观察到的、非参考等位基因集的似然。这里表明,HC-RCM 算法的准确性与其他
Figure 2: HaplotypeCaller-Reference Confidence Model overview. The four basic steps of variant calling with the HC-RCM including identification of the ActiveRegions, assembly of candidate haplotypes using de Bruijn-like graphs, determination of the per-read likelihoods of candidate haplotypes using a pair-HMM, and genotype assignment. 图 2:HaplotypeCaller-Reference Confidence Model 概述。使用 HC-RCM 进行变异检测的四个基本步骤包括识别 ActiveRegions、使用 de Bruijn 图组装候选单倍型、使用对-HMM 确定候选单倍型的每读读数似然以及基因型分配。
widely used tools and that the computational performance on large sample sizes is superior. 广泛使用的工具,并且在大型样本量上的计算性能更优越。
2 Results 2 结果
2.1 Brief Methodology 2.1 简要方法
The role of the HaplotypeCaller within the GATK variant calling pipeline is shown in Figure 1 (see http://www.broadinstitute.org/gatk for more details regarding the GATK Best Practices). High-throughput sequencing analyses start with raw read data (FASTQ files) that are converted to recalibrated, analysis-ready reads in the Binary sequence Alignment/Map (BAM) format using SAMtools and GATK modules [16,6][16,6]. Reads are initially aligned to a reference sequence e.g. human reference genome (GRCh37) using an mapping/aligner program e.g. Burrows-Wheeler Aligner (BWA) GATK 变异调用流程中 HaplotypeCaller 的作用如图 1 所示(有关 GATK 最佳实践的更多详细信息,请参阅 http://www.broadinstitute.org/gatk)。高通量测序分析从原始读数数据(FASTQ 文件)开始,这些数据使用 SAMtools 和 GATK 模块转换为重新校准的、分析就绪的读数,以二进制序列比对/映射(BAM)格式。读数最初使用映射/比对程序(例如 Burrows-Wheeler Aligner(BWA))与参考序列(例如人类参考基因组(GRCh37))进行比对。 [16,5][16,5].
For computational efficiency, variant calling is focused on “ActiveRegions”, or regions of the genome that vary significantly from the reference. ActiveRegions are defined by genomic intervals where the BWA-aligned reads contain evidence that they are in disagreement with the reference, using criteria such as base mismatches, insertion/deletion markers in the reads and high base-quality soft clips [16]. The reads from these ActiveRegions are split into overlapping subsequences, or k-mers, and subsequently reassembled using de-Bruijn-like graphs into candidate haplotypes. This is followed by the construction of a pair-HMM using state transition probabilities derived from the read base qualities. This pair-HMM is then used to calculate the likelihood that each read was derived from each haplotype (Supplemental Figure 5). These likelihoods are used to calculate the raw genotype likelihoods for each candidate variant. Ultimately, genotype likelihoods across all samples are used to call raw variants for the cohort. 为了计算效率,变异检测集中在“活跃区域”,即与参考基因组显著不同的区域。活跃区域由基因组区间定义,其中 BWA 对齐的读段包含与参考不一致的证据,使用如碱基错配、读段中的插入/删除标记和高碱基质量软剪辑等标准[16]。来自这些活跃区域的读段被分割成重叠的子序列,或 k-mer,然后使用类似 de-Bruijn 的图重新组装成候选单倍型。随后,使用从读段碱基质量中导出的状态转移概率构建成对 HMM。然后,使用这个成对 HMM 计算每个读段来自每个单倍型的可能性(补充图 5)。这些可能性用于计算每个候选变异的原始基因型可能性。最终,使用所有样本的基因型可能性调用队列的原始变异。
The raw, unfiltered output of the HaplotypeCaller is not appropriate to be used for downstream analyses. The HC-RCM aims to attain maximum sensitivity with the consequence of retaining some false positive variants. The subsequent steps of the GATK Best Practices pipeline (Figure 1) identify and filter out false positives to achieve maximum accuracy, as described in previous in previous work [6]. For the accuracy results presented here, the HC-RCM output is subjected to joint genotyping and variant recalibration in accordance with the GATK Best Practices. 原始未过滤的 HaplotypeCaller 输出不适用于下游分析。HC-RCM 旨在达到最大敏感性,但代价是保留一些假阳性变异。GATK 最佳实践管道的后续步骤(图 1)识别并过滤掉假阳性,以实现最大准确性,如前所述[6]。对于此处展示的准确性结果,HC-RCM 输出根据 GATK 最佳实践进行联合基因分型和变异重新校准。
For further details on the HC-RCM algorithm, see the Supplemental Methods section. 有关 HC-RCM 算法的更多详细信息,请参阅补充方法部分。
2.2 Comparative Analyses 2.2 比较分析
To validate the sensitivity and specificity of this new variant calling algorithm, a set of analyses was performed comparing the HaplotypeCaller with three other variant calling algorithms (Figure 3). These other algorithms consisted of two pileup variant callers, SAMtools[17] and the GATK Unified Genotyper (UG) [6], as well as Platypus, another assembly-based algorithm [22]. SNPs and indels were called with each of the four variant calling algorithms from whole genome and exome-capture sequencing data from the well characterized CEU HapMap trio [2]. Variant caller accuracy was evaluated using the Genome in a Bottle standard (GiaB) [26] as described in the supplemental methods. 为了验证这种新的变异调用算法的敏感性和特异性,进行了一系列分析,比较了 HaplotypeCaller 与三种其他变异调用算法(图 3)。这些其他算法包括两种 pileup 变异调用器,SAMtools[17]和 GATK 统一基因型器(UG)[6],以及 Platypus,另一种基于组装的算法[22]。使用四种变异调用算法从整个基因组和外显子捕获测序数据中调用 SNPs 和 indels,这些数据来自已表征的 CEU HapMap 三联体[2]。使用基因组在瓶中标准(GiaB)[26]评估变异调用器的准确性,方法如补充方法所述。
Figure 3: Variant caller comparison. Accuracy comparisons of both SNP and indel variant calls over all three samples in the CEU trio by the GATK HaplotypeCaller, Platypus, SAMtools, and Unified Genotyper. WGS data was PCR-free, 250bp paired-end reads sequenced by an Illumina HiSeq. WES data was 76 bp paired-end reads sequenced by an Illumina HiSeq. Sensitivity is plotted as false negative rate (FNR), for which lower values equate to better sensitivity. Specificity is plotted as false discovery rate FDR, for which lower values are also better. For genotype concordance higher values indicate better genotype call accuracy. 图 3:变异调用比较。GATK HaplotypeCaller、Platypus、SAMtools 和 Unified Genotyper 在 CEU 三联体所有三个样本中对 SNP 和 indel 变异调用的准确性比较。WGS 数据为 PCR-free,由 Illumina HiSeq 测序的 250bp 配对末端读取。WES 数据为 76bp 配对末端读取,由 Illumina HiSeq 测序。灵敏度以假阴性率(FNR)表示,值越低表示灵敏度越好。特异性以假发现率(FDR)表示,值越低表示特异性越好。对于基因型一致性,更高的值表示基因型调用准确性更好。
Figure 3 shows the results of these analyses, which indicate that the HaplotypeCaller effectively calls both SNP and indel variants from both whole genome and exome-captured sequence data. Although Figure 3 shows that SAMtools called a greater number of both SNP and indel variants compared to the other callers, the specificity of these calls was substantially lower than the other algorithms, especially for indel variants, as indicated by high false discovery rates (FDR). In contrast, the HaplotypeCaller algorithm called large numbers of SNP and indel variants with high sensitivity and low FDRs on both whole genome and exome captured DNA, suggesting that the HaplotypeCaller has both high sensitivity and specificity. 图 3 显示了这些分析的结果,表明 HaplotypeCaller 有效地从全基因组和外显子捕获序列数据中调用 SNP 和 indel 变异。尽管图 3 显示与其它调用者相比,SAMtools 调用了更多的 SNP 和 indel 变异,但这些调用的特异性显著低于其他算法,特别是对于 indel 变异,如高假发现率(FDR)所示。相比之下,HaplotypeCaller 算法在基因组和外显子捕获 DNA 上调用大量 SNP 和 indel 变异,具有高灵敏度和低 FDRs,表明 HaplotypeCaller 具有高灵敏度和特异性。
To determine algorithm genotype assignment accuracy, the genotype concordance was determined. Genotype concordance measures the accuracy of genotype assignments of true positive variants to a gold standard (GiaB). In contrast to the other variant calling algorithms, the HaplotypeCaller had exceptionally high genotype concordance values for indel variant calls from both whole genome and exome captured data. These results suggest that the HaplotypeCaller calls genotypes with superior accuracy compared with the other algorithms. 为了确定算法基因型分配的准确性,确定了基因型一致性。基因型一致性衡量了将真实阳性变异体分配到金标准(GiaB)的基因型分配的准确性。与其他变异体调用算法相比,HaplotypeCaller 在从全基因组捕获数据和外显子捕获数据中调用 indel 变异体时具有异常高的基因型一致性值。这些结果表明,与其它算法相比,HaplotypeCaller 在调用基因型时具有更高的准确性。
2.3 Scaling 2.3 缩放
The Reference Confidence Model was integrated into the HaplotypeCaller to enable scaling of joint calling up to hundreds of thousands of exomes. To determine how this algorithm performs on large sample sets, runtime was examined as a function of the number of samples. Figure 4 shows the runtimes for variant calling using GATK HaplotypeCaller, GATK UnifiedGenotyper, Platypus and SAMtools. For the 20X exomes used here, HaplotypeCaller requires slightly more CPU-hours than the other algorithms for up to 250 samples. Beyond 250 samples, HaplotypeCaller runtime continues to increase linearly with the number of samples, while the other algorithms increase superlinearly. Furthermore, the efficient scaling of the HC-RCM algorithm was leveraged to produce a joint call set with over 90,000 exomes for the Exome Aggregation Consortium [14]. 参考置信模型被集成到 HaplotypeCaller 中,以实现联合调用扩展到数十万个外显子。为了确定该算法在大样本集中的性能,运行时间被检查为样本数量的函数。图 4 显示了使用 GATK HaplotypeCaller、GATK UnifiedGenotyper、Platypus 和 SAMtools 进行变异调用的运行时间。对于这里使用的 20 倍外显子,HaplotypeCaller 在 250 个样本内比其他算法需要更多的 CPU 时间。超过 250 个样本后,HaplotypeCaller 的运行时间继续与样本数量线性增加,而其他算法呈超线性增加。此外,利用 HC-RCM 算法的高效扩展,为外显子聚合联盟[14]生成了超过 90,000 个外显子的联合调用集。
Figure 4: Scaling of whole exome calling with the HaplotypeCaller-Reference Confidence Model. Comparison of computational runtimes as a function of sample size. For each algorithm, variants were called over 1M bases of genomic territory and extrapolated to full exome runtime. Error bars represent 95%95 \% confidence intervals from 10 independent runs. 图 4:使用 HaplotypeCaller-Reference Confidence Model 进行全外显子组测序的缩放。比较样本大小作为函数的计算运行时间。对于每种算法,在 1M 个基因组的基因为变异调用,并外推到全外显子组运行时间。误差线表示 10 次独立运行中的 95%95 \% 置信区间。
3 Discussion 3 讨论
The above validation studies demonstrate that the HaplotypeCaller calls both SNP and indel variants with high sensitivity and specificity. Now part of the widely used GATK pipeline, the HC removes a critical bottleneck in variant calling by enabling this algorithm to scale without losses in accuracy or sensitivity. 上述验证研究表明,HaplotypeCaller 在调用 SNP 和 indel 变异时具有高灵敏度和特异性。现在已成为广泛使用的 GATK 流程的一部分,HC 通过使该算法能够无损失地扩展,从而消除了变异调用中的关键瓶颈,提高了准确性和灵敏度。
Our analyses show that SAMtools is the most sensitive SNP variant caller using both whole genome and exome-captured data. These results are consistent with other studies for whole genomic data which have shown that locusbased algorithms accurately call SNP variants [17, 22]. However, SAMtools also has a high false discovery rate for SNP variant calls, suggesting that though highly sensitive, it lacks the specificity of the other variant calling algorithms evaluated in this study. 我们的分析表明,SAMtools 是使用全基因组数据和外显子捕获数据的最敏感的 SNP 变异调用器。这些结果与其他全基因组数据研究一致,这些研究已经表明基于位置的算法可以准确地调用 SNP 变异[17, 22]。然而,SAMtools 在 SNP 变异调用中也有较高的假发现率,这表明尽管敏感性很高,但它缺乏本研究中评估的其他变异调用算法的特异性。
In our results, the HaplotypeCaller was the most sensitive indel variant caller using both whole genome and exome-capture data, which is a conclusion consistent with other reports comparing variant callers [22, 21, 9]. However, in contrast with the report of Rimmer, et al., the results here show that the HaplotypeCaller had the lowest FDR on whole genomic data, suggesting that it had the highest specificity for indel variant calls [22]. Reasons for the discrepancies between the two studies, include different versions of the algorithm used; the current study used the HaplotypeCaller (v3.4) while Rimmer, et al used a much older version (v2.5). It is likely that more recent versions of the HaplotypeCaller software improved variant calling accuracy as has been reported by Narzisi, et al. [20] and our internal tests (data not shown). In addition to the different software versions used, there is a substantial difference in the validation methods between the two studies. Although Rimmer et al., used a fosmid dataset to validate their variant calls, the majority of variants called as false-positives by Platypus were actually present in dbSNP137, suggesting that their “truth” data contained a high frequency of errors [22]. In contrast, this analysis used the GiaB standard which contains variants validated using multiple data sets, platforms, variant callers, and filtering strategies suggesting that the GiaB is a highly validated “truth” dataset [26]. Using this high quality “truth” set it is demonstrated that the HaplotypeCaller calls indel variants with both high sensitivity and specificity. 在我们的结果中,HaplotypeCaller 是使用全基因组和外显子捕获数据最敏感的 indel 变异检测器,这一结论与其他比较变异检测器的报告一致[22, 21, 9]。然而,与 Rimmer 等人报告的结果相反,这里的结果表明 HaplotypeCaller 在基因组数据上具有最低的 FDR,这表明它在 indel 变异检测中具有最高的特异性[22]。两个研究之间差异的原因包括所使用的算法版本不同;当前研究使用了 HaplotypeCaller(v3.4),而 Rimmer 等人使用了一个版本更旧的版本(v2.5)。很可能 HaplotypeCaller 软件的较新版本提高了变异检测的准确性,正如 Narzisi 等人[20]和我们的内部测试(数据未显示)所报告的。除了所使用的软件版本不同之外,两个研究之间的验证方法也存在实质性差异。尽管 Rimmer 等人,使用 fosmid 数据集验证他们的变异调用,Platypus 调用的多数变异实际上存在于 dbSNP137 中,表明他们的“真实”数据中错误频率很高[22]。相比之下,这项分析使用了包含使用多个数据集、平台、变异调用者和过滤策略验证的变异的 GiaB 标准,表明 GiaB 是一个高度验证的“真实”数据集[26]。使用这个高质量的“真实”集,证明了 HaplotypeCaller 在高灵敏度和特异性方面调用 indel 变异。
These comparison data show that all of the algorithms assign genotypes with similar accuracy from SNP variants, consistent with the report 这些比较数据表明,所有算法从 SNP 变异中分配基因型的准确性相似,与报告一致
of [22]. However, substantial differences in genotype concordance were observed among algorithms for indel variants. HaplotypeCaller’s local assembly and pairHMM likelihood calculation are more robust to errors in repetitive contexts, leading to the higher genotyping accuracy. 然而,在 indel 变异的算法中观察到基因型一致性存在显著差异。HaplotypeCaller 的局部组装和 pairHMM 似然计算对重复背景中的错误更稳健,从而提高了基因分型的准确性。
Platypus FDRs were higher than expected for both SNPs and indels in the WGS results. The orthogonal Illumina Platinum Genomes evaluation [9] showed a FDR of less than 1 鸭嘴兽的 FDRs 在 WGS 结果中对于 SNPs 和 indels 都高于预期。正交的 Illumina Platinum Genomes 评估[9]显示 FDR 小于 1
While variant calling accuracy on a single sample is of primary importance in some studies, in nearly all contexts the interpretation of variants requires data derived from many samples. Analyses like genome-wide association studies leverage larger numbers of samples to increase their power to detect genotype-phenotype associations. In a clinical research setting, analysts often interpret the effect of a given putatively causal mutation based on population frequency estimated from a large joint call set of samples from healthy individuals used as a reference panel. To produce such a reference panel, it is crucial that all samples be jointly called and evaluated across as many other samples as possible. The estimation of the frequencies of rare alleles in large populations is possible only with accurate and confident genotype calls for all samples, especially those that are homozygous for the reference allele. 在有些研究中,单个样本的变异体调用准确性至关重要,但在几乎所有情况下,变异的解释都需要来自许多样本的数据。像全基因组关联研究这样的分析利用更多的样本数量来增加其检测基因型-表型关联的能力。在临床研究环境中,分析师通常根据从用作参考面板的健康个体的大规模联合调用集中估计的种群频率来解释给定假设因果突变的影响。要产生这样的参考面板,所有样本都必须联合调用并尽可能多地评估其他样本至关重要。在大种群中估计罕见等位基因的频率,只有对所有样本进行准确和自信的基因型调用才有可能,尤其是那些对参考等位基因纯合的样本。
Joint calling increases variant calling sensitivity over low coverage regions and improves filtering accuracy [6]. The HaplotypeCaller algorithm further improves accuracy by using a local assembly method for variant discovery. However joint calling across large sets of samples requires an algorithm that can effectively scale. HC-RCM runtimes increase linearly with sample number, enabling the algorithm to produce data across tens of thousands of samples, as demonstrated by the production of a call set featuring over 90,000 exome samples used in the ExAC study [14]. 联合调用提高了低覆盖区域的变异调用灵敏度并提高了过滤精度[6]。HaplotypeCaller 算法通过使用局部组装方法进行变异发现,进一步提高了准确性。然而,在大量样本集上进行联合调用需要一个能够有效扩展的算法。HC-RCM 运行时间随样本数量线性增加,使得算法能够产生数十万个样本的数据,如 ExAC 研究中使用的超过 90,000 个外显子组样本的调用集所示[14]。
4 Supplemental Methods 4 补充方法
4.1 Data 4.1 数据
Input data for the variant caller evaluations were obtained from the 1000 genomes repository [5]. Both data sets, consisting of the 2x250bp PCR-free whole genome samples and 2x76bp exome samples, were retrieved as BAM files of high coverage CEU trio samples. These samples, NA12891, NA12892, 输入数据用于变异调用器评估是从 1000 基因组数据库[5]获得的。这两个数据集,包括 2x250bp PCR-free 全基因组样本和 2x76bp 外显子组样本,以高覆盖 CEU 三联体样本的 BAM 文件形式检索。这些样本,NA12891,NA12892,
NA12878 representing the father, mother, and daughter respectively have been described previously [2]. Prior to analysis, all sequencing data were aligned with the Burrows-Wheeler transform algorithm (BWA) [16]. The Genome in a Bottle (GiaB) standard version 2.18 for NA12878 was used for the variant caller accuracy comparison [26]. NA12878 代表父亲、母亲和女儿分别已被描述过[2]。在分析之前,所有测序数据都使用 Burrows-Wheeler 变换算法(BWA)[16]进行对齐。使用 GiaB 标准版本 2.18 的 NA12878 进行变异调用准确度比较[26]。
4.2 Software 4.2 软件
Software versions were Platypus version 0.7.8, SAMtools version 1.1 and the UnifiedGenotyper and HaplotypeCaller-Reference Confidence Model (HCRCM) were obtained from the Genome Analysis Toolkit version 3.4. The authors made a good faith effort to run each tool according to its respective best practice recommendations by following the directions found on each tool author’s webpage. 软件版本为 Platypus 版本 0.7.8、SAMtools 版本 1.1,统一基因分型器和参考置信度模型(HCRCM)来自基因组分析工具包版本 3.4。作者们努力按照每个工具在其作者网页上找到的最佳实践推荐运行每个工具。
4.3 Evaluation Metrics 4.3 评估指标
Figure 3 reports variant caller accuracy using false negative rate (FNR), false discovery rate (FDR), and genotype concordance with respect to the Genome in a Bottle (GiaB) reference standard as the truth set [26]. A variant at a position listed in the truth set that also has an alternate allele matching the truth set is considered a true positive (TP). A variant that is called within the GiaB confidence region that does not occur in the list of truth variants or have an alternate allele matching the truth variant is considered a false positive (FP). 图 3 报告了使用假阴性率(FNR)、假发现率(FDR)和与基因组瓶(GiaB)参考标准的基因型一致性来衡量变异调用器的准确性,以真实集作为基准[26]。真实集中列出的位置上的变异,如果也具有与真实集匹配的替代等位基因,则被认为是真实阳性(TP)。在 GiaB 置信区域内调用但不在真实变异列表中或没有与真实变异匹配的替代等位基因的变异被认为是假阳性(FP)。
Algorithm sensitivity is calculated by dividing the number of true positive variants by the total number of variants in the GiaB standard [26, 22]. Here we report the FNR (1-sensitivity) to better visualize the differences between algorithms. 算法敏感性是通过将 GiaB 标准[26, 22]中真阳性变异数量除以总变异数量来计算的。在这里,我们报告 FNR(1-敏感性)以更好地可视化算法之间的差异。
The FDR describes the frequency of incorrect variant calls in a call set and is calculated by dividing the number of false-positive ( FP ) variant calls by total number of called variants (the sum of both true and false positive variant calls) as described [22, 6]. Genotype concordance (GT concordance), measures the accuracy of genotype calls and is calculated by dividing the number of correctly assigned genotypes (both heterozygous and homozygous variant) by the total number of true positive variants [26]. FDR 描述了调用集中错误变体调用的频率,通过将假阳性(FP)变体调用的数量除以调用变体的总数(真阳性和假阳性变体调用的总和)来计算,如[22, 6]所述。基因型一致性(GT 一致性)衡量基因型调用的准确性,通过将正确分配的基因型数量(包括杂合子和纯合子变体)除以真阳性变体的总数来计算[26]。
HaplotypeCaller is the method responsible for variant calling in the current best practices version of the GATK pipeline for the processing of highthroughput sequencing data. This component was developed to address the shortcomings of previous locus-based callers, primarily an inability to call structural variants including indels and repetitive sequences [22]. The HaplotypeCaller’s approach to variant calling is reference-based local reassembly of genomic regions containing non-reference evidence. Using this approach, the method presented here avoids many of the pitfalls of global alignment to the reference sequence used in mapping/locus-based approaches e.g. SAMtools and the Unified Genotyper [17, 6]. Given that this local assembly approach becomes exponentially more computationally intensive with increasing numbers of samples, the reference confidence model (RCM) was developed to reduce the computational burden associated with the tandem local reassembly of multiple samples. The combined algorithm is described in detail below. There are several preprocessing steps (Figure 1) that are explained in detail on the GATK website www.broadinstitute.org/gatk and in previous publications [6, 3]. Initially, raw sequence data is aligned and de-duplicated using BWA and Picard Tools, respectively [17, 15]. Additional preprocessing steps used to produce analysis-ready reads, including the Base Quality Score Recalibrator (BQSR), are part of the GATK pipeline and have been described previously [6,3][6,3]. Note also that reads with mapping quality less than 20 are filtered by the tool engine and excluded from HaplotypeCaller. HaplotypeCaller 是负责在 GATK 处理高通量测序数据的最佳实践版本中变异检测的方法。该组件是为了解决先前基于位点的调用器的不足而开发的,主要是一个无法调用结构变异,包括插入和缺失以及重复序列[22]。HaplotypeCaller 进行变异检测的方法是基于参考的基因组区域非参考证据的局部重新组装。使用这种方法,这里提出的方法避免了全局对参考序列对齐的许多陷阱,例如 SAMtools 和统一基因型器[17, 6]。鉴于这种局部组装方法随着样本数量的增加而呈指数增长,因此开发了参考置信度模型(RCM)以减少与多个样本串联局部重新组装相关的计算负担。以下详细描述了联合算法。GATK 网站上(www.broadinstitute.org)有详细解释的几个预处理步骤(图 1)。org/gatk 和在之前的出版物[6, 3]中。最初,原始序列数据分别使用 BWA 和 Picard Tools 进行比对和去重[17, 15]。用于生成分析就绪读数的额外预处理步骤,包括基础质量得分校准器(BQSR),是 GATK 流程的一部分,之前已有描述 [6,3][6,3] 。注意,映射质量小于 20 的读数会被工具引擎过滤并排除在 HaplotypeCaller 之外。
4.4.1 Defining ActiveRegions 4.4.1 定义活动区域
To define an ActiveRegion of a sample’s genome using data from that sample’s reads, the HaplotypeCaller operates in three phases. It computes an “active probability” for each locus, smoothes the probability signal via convolution, and thresholds the resulting signal to define the ActiveRegion boundaries. Using the original alignment assigned by BWA, the HaplotypeCaller performs genotyping at each pileup position comparing the reference allele with any non-reference possibility, incorporating mismatch evidence, but also insertions, deletions, and soft-clips. Sites are assigned an “active probability” based on these genotype likelihoods and a heterozygosity prior ( 0.001 for SNPs and 0.0001 for indels by default). That probability is then convolved across loci with a Gaussian kernel (sigma of 17 bp by default) to expand 为了使用来自样本读取的数据定义样本基因组的 ActiveRegion,HaplotypeCaller 在三个阶段运行。它为每个位点计算一个“活动概率”,通过卷积平滑概率信号,并将结果信号进行阈值处理以定义 ActiveRegion 的边界。使用 BWA 分配的原始对齐,HaplotypeCaller 在每个 pileup 位置进行基因分型,比较参考等位基因与任何非参考可能性,结合错配证据,但也包括插入、删除和软剪辑。根据这些基因型似然性和杂合性先验(默认为 SNPs 的 0.001 和 indels 的 0.0001)为位点分配一个“活动概率”。然后,使用高斯核(默认为 17 bp 的 sigma)将该概率卷积到各个位点以扩展
the activity signal. An ActiveRegion is thus defined as an interval of contiguous loci where the active probability exceeds a certain threshold (0.002 by default.) By default, HaplotypeCaller incorporates data from reads that cover an interval of up to 100bp adjacent to but outside the thresholded ActiveRegion during assembly, but these reads do not contribute to genotype likelihoods. Reads outside the ActiveRegion and its extension are not considered at all. Minimum ActiveRegion size is by default 50bp while the maximum size is 300 bp . If an ActiveRegion exceeds the maximum size after the thresholding step then is it split into two, such that the regions for which each will emit variants abut, but reads overlapping the junction will be considered in both. 活动信号。因此,ActiveRegion 被定义为连续的位点区间,其中活动概率超过某个阈值(默认为 0.002)。默认情况下,HaplotypeCaller 在组装过程中结合覆盖阈值 ActiveRegion 两侧最多 100bp 的读数数据,但这些读数不会对基因型似然性做出贡献。ActiveRegion 及其扩展区域之外的读数不予考虑。默认情况下,ActiveRegion 的最小大小为 50bp,最大大小为 300bp。如果 ActiveRegion 在阈值步骤之后超过最大大小,则将其分成两个,使得每个区域将发出变异的区域相邻,但重叠接合处的读数将被考虑在内。
4.4.2 Graph Construction 4.4.2 图的构建
The second step involves de Bruijn-like graph construction for each ActiveRegion [23]. The reads within the current ActiveRegion and the reference sequence corresponding to that ActiveRegion are parsed into k-mers of 10 and 25 nucleotides in length. If the k-mer-ized reference sequence contains nonunique k -mers, then the value k in incremented by 10 until a maximum of 65. If the reference sequence for the ActiveRegion contains non-unique 65-mers, then assembly is aborted. The graph is initialized by connecting overlapping reference k-mers into a single path. Edges in the graph are assigned weights according to how many reads contain each pair of k -mers the edge connects. The graph is simplified by merging paths with entire k -mers in common. Edges with limited k-mer support are pruned out of the graph and potential haplotypes are removed if supported by fewer than two reads by default. Paths that do not have a terminal kmer that connects back to the reference haplotype, referred to as “dangling tails”, are attempted to be merged to the reference using Smith-Waterman alignment [24]. If no overlap with the reference is found, paths containing the dangling tail will be discarded. Each candidate haplotype is aligned to the reference sequence using Smith-Waterman . The output of this alignment generates a CIGAR string for each candidate haplotype (H_(j))[17]\left(H_{j}\right)[17]. This CIGAR is used to help translate the haplotypes into the variants that will be output in the VCF. 第二步涉及对每个 ActiveRegion 进行 de Bruijn-like 图构建[23]。当前 ActiveRegion 内的读取和对应 ActiveRegion 的参考序列被解析成长度为 10 和 25 个核苷酸的 k-mer。如果 k-mer 化的参考序列包含非唯一 k-mer,则将 k 的值增加 10,直到最大值为 65。如果 ActiveRegion 的参考序列包含非唯一的 65-mer,则中止组装。通过将重叠的参考 k-mer 连接成单一路径来初始化图。图中的边根据连接的每个 k-mer 对包含的读取数量分配权重。通过合并具有整个 k-mer 的共同路径简化图。具有有限 k-mer 支持的边从图中剪除,并且如果默认情况下由少于两个读取支持,则移除潜在的杂合子。没有终端 k-mer 连接回参考杂合子的路径,称为“悬尾”,尝试使用 Smith-Waterman 对齐[24]将其与参考合并。如果没有找到与参考的重叠,则包含悬尾的路径将被丢弃。 每个候选单倍型都使用 Smith-Waterman 与参考序列对齐。此对齐的输出为每个候选单倍型生成一个 CIGAR 字符串 (H_(j))[17]\left(H_{j}\right)[17] 。此 CIGAR 用于帮助将单倍型翻译成将在 VCF 中输出的变异。
Figure 5: Pair-HMM overview. This diagram shown depicts the states, transitions, and transition rates for the read-haplotype alignment. The alignment of a read to a candidate haplotype is allowed to take on one of three states; match, insertion, or deletion (M, I, or D). epsilon\epsilon represents the gap extension penalty while delta\delta is the gap open penalty. Other transition probabilities are defined based on the constraint that the sum of the transition probabilities must equal one. 图 5:Pair-HMM 概述。此图展示了读-haplotype 对齐的状态、转移和转移率。读与候选单倍型对齐可以处于三种状态之一;匹配、插入或删除(M、I 或 D)。 epsilon\epsilon 表示间隙扩展惩罚,而 delta\delta 是间隙开放惩罚。其他转移概率基于约束条件,即转移概率之和必须等于 1。
4.4.3 Pair-HMM 4.4.3 成对-HMM
A paired Hidden Markov Model (pair-HMM) is used to determine the likelihood for each combination of candidate haplotype (H_(j))\left(H_{j}\right) and read data (R_(i))\left(R_{i}\right), namely P(R_(i)∣H_(j))[12,8]P\left(R_{i} \mid H_{j}\right)[12,8]. 配对隐马尔可夫模型(pair-HMM)用于确定候选单倍型 (H_(j))\left(H_{j}\right) 和读数数据 (R_(i))\left(R_{i}\right) 的每种组合的可能性,即 P(R_(i)∣H_(j))[12,8]P\left(R_{i} \mid H_{j}\right)[12,8] 。
Figure 5 shows a graphical representation of a global alignment model, which is a simplified version of the pair-HMM used herein. For additional details and a description of the complete model, the reader is referred to Durbin et al., [8]. The pair-HMM calculates an alignment score for each read (R_(i))\left(R_{i}\right) and candidate haplotype (H_(j))\left(H_{j}\right). There are three primary “states” of the algorithm, match (M_(ij))\left(M_{i j}\right), insertion (I_(ij))\left(I_{i j}\right), and deletion (D_(ij))\left(D_{i j}\right). An alignment of a read to a haplotype can be described by a sequence of these states. The transition probability from the match to the insertion or deletion state is given by the gap open penalty (delta)(\delta). The probability that an alignment stays in either the insertion or deletion state is given by the gap extension penalty (epsilon)(\epsilon). The gap extension penalty is held constant (default 10) while the gap open penalty may be derived from GATK’s BQSR if base insertion and deletion qualities have been output in the recalibrated BAM. If BQSR 图 5 展示了全局比对模型的图形表示,这是在此处使用的 pair-HMM 的简化版本。有关更多细节和完整模型的描述,请参阅 Durbin 等人[8]。pair-HMM 为每个读数 (R_(i))\left(R_{i}\right) 和候选单倍型 (H_(j))\left(H_{j}\right) 计算一个比对得分。算法有三个主要“状态”,即匹配 (M_(ij))\left(M_{i j}\right) 、插入 (I_(ij))\left(I_{i j}\right) 和删除 (D_(ij))\left(D_{i j}\right) 。读数与单倍型的比对可以通过这些状态的序列来描述。从匹配状态到插入或删除状态的转移概率由间隙开放惩罚 (delta)(\delta) 给出。一个比对保持在插入或删除状态的概率由间隙扩展惩罚 (epsilon)(\epsilon) 给出。间隙扩展惩罚保持不变(默认为 10),而间隙开放惩罚可能来自 GATK 的 BQSR,如果已输出重新校准的 BAM 中的基础插入和删除质量。如果 BQSR
base insertion and deletion qualities are not available, a constant value of 45 is assigned. For alignments in the match state, the probability of emitting a base identical to the reference (P_(ij))\left(P_{i j}\right) is given by the complement of the base error probability given by the base quality. 基础插入和删除质量不可用,分配一个常数值 45。对于匹配状态的对齐,发出与参考 (P_(ij))\left(P_{i j}\right) 相同的碱基的概率由碱基质量给出的碱基错误概率的补数给出。
Recurrence relations for the states with respect to position i in the read and position j in the haplotype are given as follows: 递推关系式如下,用于描述位置 i 在读取和位置 j 在单倍型中的状态:
The pair-HMM can be used to correct for PCR errors that can lead to spurious indel calls. In this error correction mode, when the alignment sequence is being calculated for a tandem repeat sequence in the reference, the base insertion and deletion qualities are decreased in order to convey the reduced confidence in indels occurring within repetitive contexts. The default is to use a conservative correction. For PCR-free genomes, the authors recommend setting the PCR indel model to “NONE”. 对 PCR 错误进行校正,这些错误可能导致假性插入/删除调用。在此错误校正模式下,当计算参考中串联重复序列的比对序列时,会降低插入和删除的基质量,以传达在重复背景中发生的插入/删除的降低置信度。默认情况下使用保守校正。对于无 PCR 基因组,作者建议将 PCR 插入/删除模型设置为“无”。
4.4.4 Assigning Genotypes 4.4.4 分配基因型
The CIGAR derived from the Smith-Waterman alignment of the discovered haplotypes is used to translate the haplotypes into variant events with respect to the reference. Genotyping is then performed at each of the events discovered in any haplotype. Per-read haplotype likelihood output from the pair-HMM is used to calculate raw genotype likelihoods using a Bayesian model. Given that multiple haplotypes may support a variant allele, the per-read allele likelihood at a given variant position is taken as the maximum of the likelihoods for haplotypes containing the allele. Equation 1 gives the genotype likelihood for a diploid genotype G_(i)G_{i} composed of one copy of allele A_(1)A_{1} and one copy of allele A_(2)A_{2}. Briefly, the probability of a candidate genotype P(R_(i)∣G_(l))P\left(R_{i} \mid G_{l}\right), is the product over all the reads of the mean read likelihoods for the alleles in the specified genotype CIGAR 从发现的单倍型 Smith-Waterman 比对中导出,用于将单倍型翻译成相对于参考的变异事件。然后在任何单倍型中发现的每个事件处进行基因分型。使用对-HMM 的每读单倍型似然输出计算使用贝叶斯模型的原基因型似然。鉴于多个单倍型可能支持一个变异等位基因,给定变异位置的每读等位基因似然取为包含该等位基因的单倍型的似然最大值。方程 1 给出了由一个 A_(1)A_{1} 等位基因和一个 A_(2)A_{2} 等位基因组成的二倍体基因型 G_(i)G_{i} 的基因型似然。简而言之,候选基因型 P(R_(i)∣G_(l))P\left(R_{i} \mid G_{l}\right) 的概率是所有读数的指定基因型中各等位基因的平均读似然之积。
To determine the posterior probability of each candidate genotype P(G_(l)∣R_(i))P\left(G_{l} \mid R_{i}\right), the raw genotype likelihoods P(R_(i)∣G_(l))P\left(R_{i} \mid G_{l}\right) are marginalized using a Bayesian model:. 为了确定每个候选基因型的后验概率 P(G_(l)∣R_(i))P\left(G_{l} \mid R_{i}\right) ,使用贝叶斯模型对原始基因型似然 P(R_(i)∣G_(l))P\left(R_{i} \mid G_{l}\right) 进行边缘化:
The numerator consists of the product of the prior probability of a genotype P(G)P(G) and a raw genotype likelihood divided by the sum of the likelihoods of all the possible genotypes for the set of alleles called in the variant. At this stage in the best practices pipeline, the genotype prior is flat such that the prior probability of all genotypes are equal. The GATK tool CalculateGenotypePosteriors can be used in post-processing to apply more informed priors. 分子由先验概率的基因型 P(G)P(G) 和原始基因型似然率的乘积除以调用变异的等位基因集合中所有可能基因型的似然率之和组成。在此阶段的最佳实践管道中,基因型先验是平坦的,即所有基因型的先验概率相等。GATK 工具 CalculateGenotypePosteriors 可以在后处理中用于应用更明智的先验。
4.4.5 Reference Confidence Model 4.4.5 参考置信度模型
The two main requirements for joint calling large cohorts using local assembly are maintaining manageable compute complexity and outputting data for every variant in every sample. The reference confidence model addresses the former requirement by processing the samples individually, reducing the number of probable paths through the assembly graph, resulting in fewer candidate haplotypes, and decreasing the number of computationally expensive likelihood calculations that must be performed for each sample. The solution to the latter involves the addition of the symbolic non-reference allele (<NON_REF>) to aggregate evidence for a variant that was not explicitly called as an alternate allele. This additional candidate allele category enables the estimation of likelihoods to genotypes featuring an allele not seen in the sample in question. For variant sites, for each read the non-reference likelihood is set to the median of the set of allele likelihoods that are worst than the best allele. For sites that appear to be reference, non-reference likelihoods are estimated using a SNP model and an indel model. 两个使用本地组装联合调用大型队列的主要要求是保持可管理的计算复杂度和为每个样本中的每个变异输出数据。参考置信度模型通过单独处理样本来解决前一个要求,减少了组装图中的可能路径数量,从而产生了更少的候选单倍型,并减少了必须为每个样本执行的计算量大的似然计算。后一个问题的解决方案涉及添加符号非参考等位基因(),以汇总未明确作为替代等位基因调用的变异的证据。这个额外的候选等位基因类别使得可以估计具有在所讨论样本中未看到的等位基因的基因型的似然。对于变异位点,对于每个读取,非参考似然被设置为比最佳等位基因最差的等位基因似然集合的中位数。对于看起来是参考的位点,使用 SNP 模型和 indel 模型来估计非参考似然。
Under the SNP model, base qualities are used to assign oer-read allele likelihoods for the reference and non-reference alleles. The per-read allele likelihoods for reference bases are assigned using the base error probability (epsilon)(\epsilon) derived from the base quality. Bases that do not match the reference or are adjacent to soft clips, insertions, or deletions are considered non-reference evidence. The likelihood for read R_(i)R_{i} with base b_(i)b_{i} and base quality qq compared to the reference allele with base b_(r)b_{r} is given by: 在 SNP 模型下,使用基础质量来为参考和非参考等位基因分配过读等位基因可能性。使用从基础质量导出的基础错误概率 (epsilon)(\epsilon) 为参考基础分配每个读的等位基因可能性。不匹配参考或相邻于软剪辑、插入或删除的基因为非参考证据。与具有基础 b_(i)b_{i} 和基础质量 qq 的参考等位基因 b_(r)b_{r} 相比,读 R_(i)R_{i} 的可能性如下:
Where epsilon=10^(-q//10)\epsilon=10^{-q / 10} for base quality qq. Deletions are assigned a constant quality of 30 . Bases with base quality of 6 or below are discarded. epsilon=10^(-q//10)\epsilon=10^{-q / 10} 用于基础质量 qq 。删除被分配一个常量质量 30。基础质量为 6 或以下的碱基被丢弃。
Under the indel model, only reads that are considered informative for all possible indels of up to size 10bp (by default) are considered. For each event size up to the maximum to be considered, bases are removed from the reference to simulate a deletion or from the read to simulate an insertion. A read is considered informative if, for the pair of sequences with the indel modification, the sum of the base qualities for mismatching bases in the read is greater than the sum of the base qualities for bases that are mismatched according to the original alignment. The number of reads informative for indels is capped at 40 . A constant reference quality of 45 is used for each indel-informative read. 在 indel 模型下,只有被认为对所有可能的至多 10bp 大小的 indel(默认情况下)具有信息性的 reads 被考虑。对于每个要考虑的事件大小,从参考中移除碱基以模拟删除,或从 reads 中移除碱基以模拟插入。如果一个 reads 被认为具有信息性,那么对于经过 indel 修改的序列对,该 reads 中不匹配碱基的碱基质量总和大于根据原始对齐不匹配的碱基的碱基质量总和。具有信息性的 reads 的数量限制为 40。每个 indel-informative reads 使用一个恒定的参考质量 45。
Genotype likelihoods for genotypes incorporating the non-reference allele are calculated in the same way as described above for likelihoods derived from both SNP and indel models. The likelihoods with the lower genotype quality are then assigned to the site. 基因型非参考等位基因的基因型似然计算方式与上述从 SNP 和 indel 模型中得到的似然计算方式相同。然后,将基因型质量较低的似然分配到位点。
Output from the HC-RCM is captured in a new intermediate file format called genomic variant call format (gVCF) that contains likelihood data for every position in the genome (or specified intervals). This is an intermediate output that is not appropriate for analysis until the data has been processed with the full GATK Best Practices pipeline. HC-RCM 的输出以名为基因组变异调用格式(gVCF)的新中间文件格式捕获,该格式包含基因组每个位置(或指定区间)的似然数据。这是在数据经过全 GATK 最佳实践管道处理之前不适用于分析的中间接出。
4.5 Running HaplotypeCaller-Reference Confidence Model 4.5 运行 HaplotypeCaller-参考置信度模型
Use of the Reference Confidence Model is not the HaplotypeCaller default and must be specified on the commandline. (See https://software. broadinstitute.org/gatk/documentation/article?id=3893 for exact command lines.) Output from this mode will be intermediate gVCF files that are not appropriate for analysis. The reader is encouraged to visit the GATK Forums website for more information about the gVCF intermediate file format used by the HC-RCM (https://software.broadinstitute. org/gatk/documentation/article.php?id=4017). gVCFs must be genotyped with the GATK GenotypeGVCFs tool to remove low quality, likely 使用参考置信度模型不是 HaplotypeCaller 的默认设置,必须在命令行中指定。(有关确切命令行,请参阅 https://software.broadinstitute.org/gatk/documentation/article?id=3893。)此模式的输出将是中间 gVCF 文件,不适用于分析。鼓励读者访问 GATK 论坛网站,了解更多关于 HC-RCM 使用的 gVCF 中间文件格式信息(https://software.broadinstitute.org/gatk/documentation/article.php?id=4017)。gVCF 必须使用 GATK GenotypeGVCFs 工具进行基因分型,以去除低质量、可能
artifactual alleles and assign a quality score to each variant. For many of the scaling analyses presented here, per-sample gVCFs were first combined into multi-sample gVCFs using the GATK3 tool CombineGVCFs before genotyping that output with GenotypeGVCFs. For multi-sample analysis, GenotypeGVCFs also produces the finalized annotation values that will be used as features in VQSR filtering. 人工等位基因并给每个变异分配一个质量分数。对于这里展示的许多缩放分析,首先使用 GATK3 工具 CombineGVCFs 将每个样本的 gVCFs 合并成多样本 gVCFs,然后再使用 GenotypeGVCFs 进行基因分型。对于多样本分析,GenotypeGVCFs 还会生成最终注释值,这些值将作为 VQSR 过滤中的特征使用。
References 参考文献
[1] C. A. Albers, G. Lunter, D. G. MacArthur, G. McVean, W. H. Ouwehand, and R. Durbin. Dindel: Accurate indel calls from short-read data. Genome Research, 21(6):961-973, 2011. [1] C. A. Albers, G. Lunter, D. G. MacArthur, G. McVean, W. H. Ouwehand, 和 R. Durbin. Dindel:从短读数据中准确调用插入和缺失。基因组研究,21(6):961-973,2011。
[2] DM Altshuler, RA Gibbs, L Peltonen, E Dermitzakis, SF Schaffner, F Yu, PE Bonnen, PI de Bakker, P Deloukas, SB Gabriel, R Gwilliam, S Hunt, M Inouye, X Jia, A Palotie, M Parkin, P Whittaker, K Chang, A Hawes, LR Lewis, Y Ren, D Wheeler, DM Muzny, C Barnes, K Darvishi, M Hurles, JM Korn, K Kristiansson, C Lee, and SA McCarrol. Integrating common and rare genetic variation in diverse human populations. Nature, 467:52-58, 2010. [2] DM Altshuler,RA Gibbs,L Peltonen,E Dermitzakis,SF Schaffner,F Yu,PE Bonnen,PI de Bakker,P Deloukas,SB Gabriel,R Gwilliam,S Hunt,M Inouye,X Jia,A Palotie,M Parkin,P Whittaker,K Chang,A Hawes,LR Lewis,Y Ren,D Wheeler,DM Muzny,C Barnes,K Darvishi,M Hurles,JM Korn,K Kristiansson,C Lee,SA McCarrol. 在不同人类群体中整合常见和罕见遗传变异。自然,467:52-58,2010。
[3] Geraldine A Auwera, Mauricio O Carneiro, Christopher Hartl, Ryan Poplin, Guillermo del Angel, Ami Levy-Moonshine, Tadeusz Jordan, Khalid Shakir, David Roazen, Joel Thibault, et al. From fastq data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Current Protocols in Bioinformatics, pages 11-10, 2013. [3] Geraldine A Auwera, Mauricio O Carneiro, Christopher Hartl, Ryan Poplin, Guillermo del Angel, Ami Levy-Moonshine, Tadeusz Jordan, Khalid Shakir, David Roazen, Joel Thibault 等人。从 fastq 数据到高置信度变异调用:基因组分析工具包最佳实践流程。生物信息学当前协议,第 11-10 页,2013 年。
[4] Ewa Bielczyk-Maczyńska, Jovana Serbanovic-Canic, Lauren Ferreira, Nicole Soranzo, Derek L. Stemple, Willem H. Ouwehand, and Ana Cvejic. A loss of function screen of identified genome-wide association study loci reveals new genes controlling hematopoiesis. PLoS Genet, 10(7):e1004450, 072014. [4] Ewa Bielczyk-Maczyńska,Jovana Serbanovic-Canic,Lauren Ferreira,Nicole Soranzo,Derek L. Stemple,Willem H. Ouwehand,以及 Ana Cvejic。对已识别的全基因组关联研究位点的功能丧失筛选揭示控制造血的新基因。PLoS Genet,10(7):e1004450,2014 年 7 月 20 日。
[5] 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56-65, 2012. [5] 1000 基因组计划联盟等。1092 个人类基因组遗传变异的综合图谱。自然,491(7422):56-65,2012。
[6] Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, An- [6] 马克·A·德普里斯托,埃里克·班克斯,瑞安·波普林,基兰·V·加里梅拉,贾里德·R·马奎尔,克里斯托弗·哈特尔,安东尼·A·菲利帕基斯,吉列尔莫·德尔·安赫尔,曼努埃尔·A·里瓦斯,马特·哈纳,阿伦·麦肯纳,蒂姆·J·费奈尔,安-