The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
基因组分析工具包:用于分析下一代 DNA 测序数据的 MapReduce 框架

Aaron McKenna, $^{1}$ Matthew Hanna, $^{1}$ Eric Banks, $^{1}$ Andrey Sivachenko, $^{1}$ Kristian Cibulskis, $^{1}$ Andrew Kernytsky, $^{1}$ Kiran Garimella, $^{1}$ David Altshuler, $^{1, 2}$ Stacey Gabriel, $^{1}$ Mark Daly, $^{1, 2}$ and Mark A. DePristo $^{1, 3}$
亚伦·麦肯纳, $^{1}$ 马修·汉纳, $^{1}$ 埃里克·班克斯, $^{1}$ 安德烈·西瓦申科, $^{1}$ 克里斯蒂安·西布尔斯基, $^{1}$ 安德鲁·科尼茨基, $^{1}$ 基兰·加里梅拉, $^{1}$ 大卫·阿尔特舒勒, $^{1, 2}$ 斯泰西·加布里尔, $^{1}$ 马克·达利, $^{1, 2}$ 和马克·A·德普里斯托 $^{1, 3}$ $^{1}$ Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA;
$^{1}$ 哈佛大学与麻省理工学院合办的泛在医学和人口遗传学项目,位于美国马萨诸塞州剑桥市 02142 号 $^{2}$ Center for Human Genetic Research, Massachusetts General Hospital, Richard B. Simches Research Center, Boston, Massachusetts 02114, USA
麻省总医院人类遗传研究中心，Richard B. Simches 研究中心，波士顿，马萨诸塞州 02114，美国

Abstract 摘要

Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS-the 1000 Genome pilot alone includes nearly five terabases-make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
下一代 DNA 测序(NGS)项目,如 1000 基因组计划,已经彻底改变了我们对个体基因变异的理解。然而,NGS 生成的大量数据集——仅 1000 基因组试点就包括近五太字节——使得即使对计算密集型个人来说,也很难编写功能丰富、高效和强大的分析工具。事实上,许多专业人士由于访问和操作这些机器产生的数据的复杂性,其回答科学问题的范围和便利性受到限制。在这里,我们讨论了我们的基因组分析工具包(GATK),这是一个结构化的编程框架,旨在通过使用 MapReduce 的函数式编程哲学来简化下一代 DNA 测序仪分析工具的高效和强大开发。GATK 提供了一组小而丰富的数据访问模式,涵盖了大多数分析工具的需求。将特定的分析计算与常见的数据管理基础设施分开,使我们能够优化 GATK 框架的正确性、稳定性、CPU 和内存效率,并实现分布式和共享内存并行化。我们通过描述鲁棒性和可扩展性工具(如覆盖率计算器和单核苷酸多态性(SNP)调用)的实现和应用,突出了 GATK 的功能。我们得出结论,GATK 编程框架使开发人员和分析人员能够快速轻松地编写高效和强大的 NGS 工具,其中许多已经被纳入大规模测序项目,如 1000 基因组计划和癌症基因组图集。

[Supplemental material is available online at http:// www.genome.org.]
[补充材料可在线获取, 网址为 http://www.genome.org.]

In recent years, there has been a rapid expansion in the number of next-generation sequencing platforms, including Illumina (Bentley et al. 2008), the Applied Biosystems SOLiD System (McKernan et al. 2009), 454 Life Sciences (Roche) (Margulies et al. 2005), Helicos HeliScope (Shendure and Ji 2008), and most recently Complete Genomics (Drmanac et al. 2010). Many tools have been created to work with next-generation sequencer data, from read based aligners like MAQ (Li et al. 2008a), BWA (Li and Durbin 2009), and SOAP (Li et al. 2008b), to single nucleotide polymorphism and structural variation detection tools like BreakDancer (Chen et al. 2009), VarScan (Koboldt et al. 2009), and MAQ. Although these tools are highly effective in their problem domains, there still exists a large development gap between sequencing output and analysis results, in part because tailoring these analysis tools to answer specific scientific questions can be laborious and difficult. General frameworks are available for processing next-generation sequencing data but tend to focus on specific classes of analysis problems-like quality assessment of sequencing data, as in PIQA (Martinez-Alcantara et al. 2009)-or require specialized knowledge of an existing framework, as in BioConductor in the ShortRead toolset (Morgan et al. 2009). The lack of sophisticated and flexible
近年来,下一代测序平台数量急剧扩张,包括 Illumina(Bentley et al. 2008)、Applied Biosystems SOLiD System(McKernan et al. 2009)、454 Life Sciences(Roche)(Margulies et al. 2005)、Helicos HeliScope(Shendure and Ji 2008)以及最近的 Complete Genomics(Drmanac et al. 2010)。许多工具已被创建用于处理下一代测序仪数据,从基于读取的比对器如 MAQ(Li et al. 2008a)、BWA(Li and Durbin 2009)和 SOAP(Li et al. 2008b),到单核苷酸多态性和结构变异检测工具如 BreakDancer(Chen et al. 2009)、VarScan(Koboldt et al. 2009)和 MAQ。尽管这些工具在其问题领域内十分有效,但测序输出和分析结果之间仍存在巨大的差距,部分原因是将这些分析工具定制为回答特定的科学问题是一项繁琐而又困难的工作。现有的通用框架可用于处理下一代测序数据,但往往集中于特定的分析问题类别,如测序数据的质量评估(如 PIQA)(Martinez-Alcantara et al. 2009),或需要对现有框架(如 BioConductor 中的 ShortRead 工具集)(Morgan et al. 2009)具有专门知识。缺乏复杂和灵活的

programming frameworks that enable downstream analysts to access and manipulate the massive sequencing data sets in a programmatic way has been a hindrance to the rapid development of new tools and methods.
使下游分析师能够以编程方式访问和操作大规模测序数据集的编程框架,一直阻碍了新工具和方法的快速发展。

With the emergence of the SAM file specification (Li et al. 2009) as the standard format for storage of platform-independent next-generation sequencing data, we saw the opportunity to implement an analysis programming framework which takes advantage of this common input format to simplify the up-front coding costs for end users. Here, we present the Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce (Dean and Ghemawat 2008). By separating specific analysis calculations from common data management infrastructure, tools are easy to write while benefiting from ongoing improvements to the core GATK. The GATK engine is constantly being refined and optimized for correctness, stability, and CPU and memory efficiency; this well-structured software core allows the GATK to support advanced features such as distributed and automatic shared-memory parallelization. Here, we highlight the capabilities of the GATK, which has been used to implement a range of analysis methods for projects like The Cancer Genome Atlas (http://cancergenome.nih.gov) and the 1000 Genomes Project (http://www.1000genomes.org), by describing the implementation of depth of coverage analysis tools
随着 SAM 文件规范（Li et al. 2009）的出现作为下一代测序数据存储的标准格式,我们看到了实现分析编程框架的机会,该框架利用这种常见的输入格式来简化最终用户的前期编码成本。在此,我们提出了基因组分析工具包（GATK）,这是一个结构化的编程框架,采用 MapReduce 的函数式编程理念,旨在简化下一代 DNA 测序器高效和稳健的分析工具的开发。通过将特定的分析计算与常见的数据管理基础设施分离,工具编写更加容易,同时也可以从 GATK 核心的持续改进中获益。GATK 引擎不断完善和优化,以确保正确性、稳定性以及 CPU 和内存的高效利用;这个结构良好的软件核心使 GATK 能够支持诸如分布式和自动共享内存并行化等高级功能。我们在此突出了 GATK 的功能,它已被用于实施诸如《癌症基因组图谱》（http://cancergenome.nih.gov）和《1000 基因组计划》（http://www.1000genomes.org）等项目的一系列分析方法,通过描述覆盖深度分析工具的实现来展示其功能。
and a Bayesian single nucleotide polymorphism (SNP) genotyper, and show the application of these tools to the 1000 Genomes Project pilot data.
以及一种贝叶斯单核苷酸多态性(SNP)基因分型器,并演示了这些工具在千人基因组项目试点数据中的应用。

Methods 方法

The GATK development environment is currently provided as a platform-independent Java 1.6 framework. The core system uses the nascent standard sequence alignment/map (SAM) format to represent reads using a production-quality SAM library, which is publicly available (http://picard.sourceforge.net). This SAM Java development kit handles parsing the sequencer reads, as well as providing ways to query for reads that span specific genomic regions. The binary alignment version of the SAM format, called binary alignment/map (BAM), is compressed and indexed, and is used by the GATK for performance reasons due to its smaller size and ability to be indexed for search. The core system can accommodate reads from any sequencing platform following conversion to BAM format and sorting on read coordinate order and has been extensively tested on Illumina (Bentley et al. 2008), Applied Biosystems SOLiD System (McKernan et al. 2009), 454 Life Sciences (Roche) (Margulies et al. 2005), and Complete Genomics (Drmanac et al. 2010). The GATK supports BAM files with alignments emitted from most next-generation sequence aligners and has been tested with many BAMs aligned using a variety of publicly available alignment tools. Many other forms of genomic information are supported as well; including common public database formats like the HapMap (International HapMap Consortium 2003) and dbSNP (Sherry et al. 2001) variation databases. A variety of genotyping and variation formats are also supported by the GATK, including common emerging SNP formats like GLF (http:// samtools.sourceforge.net), VCF (http://www.1000genomes.org/wiki/ doku.php?id=1000_genomes:analysis:vcfv3.2), and the GELI text format (http://www.broadinstitute.org/gsa/wiki/index.php/Single_ sample_genotyper#The_GeliText_file_format). As the list of available variant and other reference associated metadata formats is constantly growing, the GATK allows end users to incorporate modules for new formats into the GATK; further information can be found on our website. The GATK is available as an opensource framework on The Broad Institute’s website, http://www. broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_ Toolkit.
GATK 开发环境目前以平台无关的 Java 1.6 框架的形式提供。核心系统使用新兴的序列比对/映射(SAM)格式来表示 reads,并使用一个公开可用的 SAM Java 开发工具包(http://picard.sourceforge.net)。这个 SAM Java 开发工具包可以处理测序 reads 的解析,并提供查询跨越特定基因组区域的 reads 的方法。SAM 格式的二进制版本,称为二进制比对/映射(BAM)格式,被压缩和索引,由于其更小的尺寸和可索引的特性,GATK 出于性能原因而采用此格式。核心系统可以容纳来自任何测序平台的 reads,只要它们已转换为 BAM 格式并根据 read 坐标进行排序。它已经在 Illumina(Bentley et al. 2008)、Applied Biosystems SOLiD System(McKernan et al. 2009)、454 Life Sciences(Roche)(Margulies et al. 2005)和 Complete Genomics(Drmanac et al. 2010)等平台上经过广泛测试。GATK 支持由大多数新一代测序比对工具输出的 BAM 文件,并已经在使用各种公开可用的比对工具的 BAM 文件上进行了测试。GATK 还支持许多其他形式的基因组信息,包括 HapMap(International HapMap Consortium 2003)和 dbSNP(Sherry et al. 2001)这样的公共数据库格式。GATK 还支持各种基因型和变异格式,包括新兴的 SNP 格式,如 GLF(http://samtools.sourceforge.net)、VCF(http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2)和 GELI 文本格式(http://www.broadinstitute.org/gsa/wiki/index.php/Single_sample_genotyper#The_GeliText_file_format)。随着可用的变体和其他参考相关元数据格式列表不断增长,GATK 允许最终用户将新格式的模块纳入 GATK;更多信息可以在我们的网站上找到。GATK 可作为开源框架在 Broad 研究所的网站上获得,网址为 http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit。

GATK architecture GATK 架构

The GATK was designed using the functional programming paradigm of MapReduce. This approach makes a contract with the developer, in which analysis tools are constructed so that the underlying framework can easily parallelize and distribute processing; this methodology has been used by companies like Google and Yahoo! (Bhandarkar 2009) to manage massive computing infrastructures in a scalable way. MapReduce divides computations into two separate steps; in the first, the larger problem is subdivided
GATK 是采用 MapReduce 函数编程范式设计的。这种方法与开发人员达成了一种契约,即分析工具的构建方式应当使得底层框架能轻松进行并行化和分布式处理。这种方法已经被谷歌和雅虎等公司用来以可扩展的方式管理大规模计算基础设施(Bhandarkar 2009)。MapReduce 将计算划分为两个独立的步骤;在第一步中,将大问题细分为
into many discrete independent pieces, which are fed to the map function; this is followed by the reduce function, joining the map results back into a final product. Calculations like SNP discovery and genotyping naturally operate at the map level of MapReduce, since they perform calculations at each locus of the genome independently. On the other hand, calculations that aggregate data over multiple points in the genome, such as peak calling in chromatin immunoprecipitation with massively parallel sequencing (ChIP-seq) experiments, would utilize the reduce function of MapReduce to integrate the heights of read pileups across loci to detect sites of transcriptional regulation (Pepke et al. 2009).
成为许多离散独立的片段,这些片段被馈送到映射函数中;然后是减少函数,将映射结果组合回最终产品。诸如 SNP 发现和基因分型等计算自然在 MapReduce 的映射级别上运行,因为它们在基因组的每个位点独立执行计算。另一方面,聚合多个基因组位点数据的计算(如染色质免疫共沉淀与大规模并行测序(ChIP-seq)实验中的峰值检测)将利用 MapReduce 的减少函数来整合跨位点的读取堆积高度,以检测转录调控位点(Pepke 等人,2009)。

Consequently, the GATK is structured into traversals, which provide the division and preparation of data, and analysis modules (walkers), which provide the map and reduce methods that consume the data. In this contract, the traversal provides a succession of associated bundles of data to the analysis walker, and the analysis walker consumes these bundles of data, optionally emitting an output for each bundle to be reduced. Since many analysis methods for next-generation sequencing data have similar access patterns, the GATK can provide a small but nearly comprehensive set of traversal types that satisfy the data access needs of the majority of analysis tools. The small number of these traversal types, shared among many tools, enables the core GATK development team to optimize each traversal for correctness, stability, CPU performance, and memory footprint and in many cases allows them to automatically parallelize calculations.
因此，GATK 被构建为遍历(traversals)，它提供了数据的划分和准备，以及分析模块(walkers)，它提供了消耗数据的映射和归约方法。在这个契约中，遍历向分析遍历(walker)提供一系列相关的数据包，而分析遍历(walker)消耗这些数据包，可以选择为每个数据包输出一个要归约的结果。由于下一代测序数据的许多分析方法具有相似的访问模式，GATK 可以提供一个小但几乎全面的遍历类型集合,满足大多数分析工具的数据访问需求。这些遍历类型的数量较少,在许多工具之间共享,使得 GATK 的核心开发团队能够优化每个遍历的正确性、稳定性、CPU 性能和内存占用,在许多情况下还可以自动并行化计算。

Traversal types 遍历类型

As stated above, the GATK provides a collection of common data presentation schemes, called traversals, to walker developers (Table 1). For example, traversals “by each sequencer read” and “by every read covering each single base position in a genome” (Fig. 1) are the standard methods for accessing data for several analyses such as counting reads, building base quality histograms, reporting average coverage of sequencer reads over the genome, and calling SNPs. The “by every read covering each single base position in a genome” traversal, called a locus-based traversal, is the most commonly used traversal type. It presents the analysis walkers with all the associated genomic data, including all the reads that span the genomic location, all reference ordered data (which includes variation data, associated interval information, and other genetic features), and the reference base at the specific locus in the genome. Each of these single-base loci are passed to the walker’s map function in succession. This traversal type accommodates common analysis methods that are concerned with computation over the sequencer-read pile up (the collection of nucleotides from each read at this location), like genotyping and depth of coverage calculations, along with methods that are concerned with variant analysis and concordance calculations.
如上所述,GATK 为 walker 开发人员提供了一系列常见的数据表示方案,称为遍历(Table 1)。例如,"按每个测序读数"和"按覆盖基因组中每个单个碱基位置的每个读数"(图 1)是用于访问数据的标准方法,用于多种分析,如计算读数、构建碱基质量直方图、报告测序读数在整个基因组上的平均覆盖度,以及调用 SNP。"按覆盖基因组中每个单个碱基位置的每个读数"遍历,称为基于位点的遍历,是最常用的遍历类型。它为分析 walker 提供了所有相关的基因组数据,包括覆盖该基因组位置的所有读数、所有参考有序数据(包括变异数据、关联的间隔信息和其他遗传特征)以及该基因组位置的参考碱基。这些单个碱基位点会依次传递给 walker 的 map 函数。此遍历类型适用于关注测序读堆栈(该位置每个读数的核苷酸集合)计算的常见分析方法,如基因型鉴定和覆盖深度计算,以及关注变异分析和一致性计算的方法。

The other common traversal type, a read based traversal, presents the analysis walker with each read individually, passing
另一种常见的遍历类型是基于读取的遍历,它将每个读取分别传递给分析遍历器

Table 1. Traversal types available in the Genome Analysis Toolkit
表 1. 基因组分析工具包中可用的遍历类型

TraverseLoci 遍历位置

在基因组中，每个单一碱基位点及其相关的读取、参考有序数据和参考碱基都被呈现给分析器。

Each single base locus in the genome, with its associated read, reference ordered data, and reference base are

presented to the analysis walker.

TraverseReads 驼羊读物

Each read is presented to the analysis walker, once and only once, with its associated reference bases.
每个读取都是独一无二地呈现给分析步行器,并附有相关的参考碱基。

TraverseDuplicates 遍历重复项

The walker is supplied with a list of duplicate reads and unique reads at each reference locus.
该步行者(walker)被提供了每个参考位点(reference locus)的重复读数(duplicate reads)和唯一读数(unique reads)的列表。

TraverseLocusWindows 遍历定位窗口

Walkers are supplied the reads, reference ordered data, and reference bases for a whole interval of the genome,
步行者被提供了读数、参考有序数据和整个基因组区间的参考碱基

as opposed to a single base as in TraverseLoci.
与 TraverseLoci 中的单一基础不同。

The GATK was designed in a modular way, which allows the addition of new traversal types that address users’ analysis needs, in addition to providing established common traversal methods, as listed in the table.
GATK 是以模块化方式设计的,这允许添加新的遍历类型来满足用户的分析需求,同时也提供了列表中所述的常用遍历方法。

Figure 1. Read-based and locus-based traversals. Read-based traversals provide a sequencer read and its associated reference data during each iteration of the traversal. Locus-based traversals are provided with the reference base, associated reference ordered data, and the pileup of read bases at the given locus. These iterations are repeated respectively for each read or each reference base in the input BAM file.
图 1. 基于读取和基于位置的遍历。基于读取的遍历在每次遍历时提供一个测序读取及其相关参考数据。基于位置的遍历提供参考基础、相关的参考排序数据以及给定位置的读取堆积。这些迭代分别针对输入 BAM 文件中的每个读取或每个参考基进行重复。
each read once and only once to the walker’s map function. Along with the sequencer read, the walker is presented with the reference bases that the read overlaps (reads that do not align to reference do not have accompanying reference sequence). This type of traversal is useful for analyzing read quality scores, alignment scores, and merging reads from multiple bam files. Currently, traversals of overlapping or otherwise multilocus arrangements are not implemented, but the architecture is sufficiently general to enable such complex assess patterns with a concomitant increase in memory requirements. As an example, we are currently designing a traversal accessing read mate pairs together using additional computation and memory resources to look-ahead for the associated reads in the BAM file at every locus.
每个读取都只读取一次到行人的地图函数。除了测序器读数外,行人还会获得读数重叠的参考碱基(不与参考序列对齐的读数没有附带的参考序列)。这种遍历方式对于分析读数质量得分、比对得分和合并来自多个 bam 文件的读数很有用。目前,重叠或其他多位点排列的遍历尚未实现,但架构足够通用,可以实现这种复杂的访问模式,同时需要增加内存需求。例如,我们目前正在设计一种遍历,使用额外的计算和内存资源来查看 BAM 文件中每个位置相关的读数对。

Sharding 分片

One of the most challenging aspects of next-generation sequencing is managing the massive scale of sequencing data. Intelligently shattering this overwhelming collection of information into manageable pieces is critical for scalability, limiting memory consumption, and effective parallelization of tasks. In this respect the GATK has taken a novel approach by dividing up data into mul-tikilobase-size pieces, which we have termed “shards.” Exact shard sizes are calculated by the GATK engine and are based on the underlying storage characteristics of the BAM and the demands on the system. These shards contain all the information for the associated subregion of the genome, including the reference bases, SNP information, and other reference ordered data, along with the reads taken from the SAM file. Each of these shards will be subdivided again by the traversal engine as it feeds data to the walker, but this shattering of data allows large chunks of data to be handed out in a manageable and controlled fashion. The sharding system is agnostic to the underlying file system or overarching execution manager, a design decision made to ensure compatibility with as many system configurations as possible. It would be possible to implement sharding systems that take advantage of data localizing file systems such as Hadoop (http://hadoop.apache.org), or parallel computing platforms such as LSF (http://www.platform. com) or Sun Grid Engine (http://gridengine.sunsource.net).
下一代测序最具挑战性的方面之一是管理海量测序数据。将这些压倒性的信息智能地分割成可管理的块是实现可扩展性、限制内存消耗和有效并行化任务的关键。在这方面,GATK 采取了一种新颖的方法,将数据划分为多千字节大小的碎片,我们称之为"分片"。分片的确切大小由 GATK 引擎计算,基于 BAM 的底层存储特性和系统的需求。这些分片包含与基因组相关子区域的所有信息,包括参考序列碱基、SNP 信息和其他参考有序数据,以及从 SAM 文件中获取的读数。遍历引擎在将数据传递给 walker 时,将再次将这些分片细分,但这种数据分割允许以可管理和可控的方式分发大块数据。分片系统与底层文件系统或总体执行管理器无关,这一设计决策旨在确保与尽可能多的系统配置兼容。实现利用数据本地化文件系统(如 Hadoop)或并行计算平台(如 LSF 或 Sun Grid Engine)的分片系统是可行的。

The GATK also handles associating reference-ordered data with specific loci in these shards. The engine matches user-supplied data, for instance, dbSNP or Hapmap variation information, to their specific locus. The GATK is not limited in the number of reference ordered data tracks that can be presented to the analysis
GATK 还处理将参考有序数据与这些分片中的特定位点相关联。引擎将用户提供的数据,例如 dbSNP 或 Hapmap 变异信息,与其特定位点进行匹配。GATK 不限制可以呈现给分析的参考有序数据轨道的数量。
modules. Multisite data are provided for each base in reads, and for each locus passed to the analysis module. Support for multilocus events, such as genomic translocations, is an active research area and is ongoing.
模块。提供了每个基地的多站点数据读数,以及传递给分析模块的每个位点数据。对于多位点事件(如基因组易位)的支持正是一个正在进行的活跃研究领域。

Interval processing 区间处理

Many biological questions have context in only limited stretches of the genome; the GATK allows users to bracket the region or regions presented to the analysis walkers by specifying active intervals. These intervals can be specified either on the command line in common interval formats like the UCSC’s BED format (Kent et al. 2002), or with a custom interval format that the GATK has defined. This provides end users with the ability to target regions of interest, like processing only HapMap called sites or determining coverage over all of the exons of a gene set.
许多生物学问题仅在基因组的有限区域内有上下文; GATK 允许用户通过指定活动间隔来框选呈现给分析步行者的区域或区域。这些间隔可以通过命令行以 UCSC 的 BED 格式(Kent et al. 2002)等常见的间隔格式指定,或者使用 GATK 定义的自定义间隔格式指定。这使最终用户能够针对感兴趣的区域进行定位,例如仅处理 HapMap 调用的位点或确定某个基因集的所有外显子的覆盖范围。

Merging input files 合并输入文件

Many analysis methods are concerned with computation over all data for a single individual, group, or population. For a variety of reasons, data produced from next-generation sequencing are not always organized and clustered in the same manner, making the management and collation of a single composite data source for an analysis tedious and error prone. To address this, the GATK is capable of merging multiple BAM files on the fly, allowing multiple sequencing runs or other input files to be clustered together seamlessly into a single analysis without altering the input files. Sequencer run information, including read-group information, is preserved in this process, which allows walkers to determine the original sequencing information if necessary. The merged sequencing data can also be written to disk; this is an effective means of merging data into meaningful groupings for later use.
许多分析方法都涉及对单个个体、群体或人群的全部数据进行计算。出于各种原因,下一代测序产生的数据并不总是以相同的方式组织和聚类的,这使得为单个综合数据源管理和整理分析变得费力且容易出错。为解决这个问题,GATK 能够动态合并多个 BAM 文件,允许将多个测序运行或其他输入文件无缝地聚合到单个分析中,而不会改变输入文件。在这个过程中,包括读取组信息在内的测序运行信息都会得到保留,这使得 walker 能够在必要时确定原始的测序信息。合并后的测序数据也可以写入磁盘;这是一种有效的方式,可将数据合并成有意义的分组以供后续使用。

Parallelization 并行化

The GATK provides multiple approaches for the parallelization of tasks. With interval processing, users can split tasks by genomic locations (i.e., dividing up a job by chromosome) and farm out each interval to a GATK instance on a distributed computing system, like Sun Grid Engine or LSF. A sample script is included in the supplemental material. The GATK also supports an automatic shared-memory parallelization, where the GATK manages multiple instances of the traversal engine and the given walker on a single machine. Walkers that wish to use this shared memory parallelization implement the TreeReducible interface, which enables the GATK to merge together two reduce results (Fig. 2). With these methods, the GATK is able to ensure correct serial reassembly of the results from multiple threads in reference-based order. The GATK also collects the output from the individual walkers, merging them in the correct reference based order, alleviating the tedious task of tracking output sources from tool developers.
GATK 为任务并行化提供多种方法。通过间隔处理,用户可以根据基因组位置将任务拆分(即按染色体分割工作)并将每个间隔分发到分布式计算系统(如 Sun Grid Engine 或 LSF)上的 GATK 实例。补充材料中包含了一个示例脚本。GATK 还支持自动共享内存并行化,GATK 在单台机器上管理遍历引擎和给定步行者的多个实例。希望使用此共享内存并行化的步行者实现 TreeReducible 接口,这使 GATK 能够合并两个减少结果(图 2)。使用这些方法,GATK 能够确保根据引用顺序正确地重新组装多个线程的结果。GATK 还收集各个步行者的输出,并按正确的引用顺序合并它们,从而减轻了工具开发人员跟踪输出源的繁琐任务。

Data collection and processing
数据收集和处理

The following analyses were conducted using publicly available pilot data from the 1000 Genomes Project. The data were collected from the Data Coordination Center (DCC) as BAM files aligned using MAQ (Li et al. 2008a) for Illumina reads, SSAHA2 (Ning et al. 2001) for 454 reads, and Corona (http://solidsoftwaretools.com/ gf/project/corona/) for SOLiD reads. Sequencing data were stored on a cluster of five Isilon IQ 12000x network area storage devices and processed on a distributed blade farm using Platform Computing’s LSF software. Shared memory parallel processing jobs
使用来自 1000 基因组计划的公开可用试点数据进行了以下分析。数据从数据协调中心(DCC)收集,使用 MAQ(Li et al. 2008a)对 Illumina 读数进行比对,使用 SSAHA2(Ning et al. 2001)对 454 读数进行比对,使用 Corona (http://solidsoftwaretools.com/gf/project/corona/)对 SOLiD 读数进行比对。测序数据存储在 5 台 Isilon IQ 12000x 网络区域存储设备的集群上,并使用 Platform Computing 的 LSF 软件在分布式刀片农场上进行处理。共享内存并行处理作业。

Figure 2. Shared memory parallel tree-reduction in the GATK. Each thread executes independent MapReduce calls on a single instance of the analysis walker, and the GATK uses the user specified tree-reduce function to merge together the reduce results of each thread in sequential order. The final in-order reduce result is returned.
图 2.GATK 中的共享内存并行树归约。每个线程在单个分析步行器实例上独立执行 MapReduce 调用,GATK 使用用户指定的树归约函数按顺序合并每个线程的归约结果。最终按顺序归约的结果被返回。
were executed on an AMD Opteron 16 processor server with 128 gigabytes of RAM.
在一台配有 128GB 内存的 AMD Opteron 16 处理器服务器上执行。

Results 结果

Depth of coverage walker
覆盖深度步行器

Determining the depth of coverage (DoC) in the whole genome, whole exome, or in a targeted hybrid capture sequencing run is a computationally simple, but critical analysis tool. Depth-ofcoverage calculations play an important role in accurate CNV discovery, SNP calling, and other downstream analysis methods (Campbell et al. 2008). Although computationally simple, the creation of this analysis tool is traditionally entangled with the
确定整个基因组、全外显子或目标混合捕获测序运行中的覆盖深度(DoC)是一个计算简单但关键的分析工具。覆盖深度计算在准确的 CNV 发现、SNP 调用和其他下游分析方法中扮演重要角色(Campbell et al. 2008)。尽管计算简单,但这种分析工具的创建传统上与
tedious and often fragile task of loading and managing massive stores of sequence read-based information.
繁琐且常常脆弱的加载和管理海量序列读取信息的任务。

We have implemented a depth of coverage walker in the GATK to illustrate the power of the GATK framework, as well as to demonstrate the simplicity of coding to the toolkit. The DoC code contains 83 lines of code, extending the locus walker template. At each site the walker receives a list of the reads covering the reference base and emits the size of the pileup. The end user can optionally exclude reads of low mapping quality, reads indicated to be deletions at the current locus, and other read filtering criteria. Like all GATK-based tools, the DoC analysis can also be provided with a list of regions to calculate coverage, summing the average coverage over each region. This capability is particularly useful in quality control and assessment metrics for hybrid capture resequencing (Gnirke et al. 2009). This methodology can also be used to quantify sequencing results over complex or highly variable regions. One of these regions, the extended major histocompatibility complex (MHC), can have problematic sequencer-read alignments due to the high rate of genetic variability (Stewart et al. 2004). A simple test of an upstream sequencing pipeline is to analyze this region for effective mapping of reads using the depth of coverage walker, the output of which is shown in Figure 3. The figure shows the depth of coverage in the MHC region of all JPT samples from pilot 2 of the 1000 Genomes Project; high variability in read coverage is clearly visible, especially in regions that correspond to HLA regions of the MHC (de Bakker et al. 2006).
我们在 GATK 中实现了一个覆盖深度 walker,以阐述 GATK 框架的功能,并展示使用该工具集编码的简单性。DoC 代码包含 83 行代码,扩展了 locus walker 模板。在每个站点,walker 接收一个覆盖参考碱基的读数列表,并发出 pile-up 的大小。最终用户可以选择性地排除低映射质量的读数、指示在当前位点上是 deletions 的读数以及其他读数过滤标准。与所有基于 GATK 的工具一样,DoC 分析也可以提供一个区域列表来计算覆盖率,并对每个区域的平均覆盖率求和。这一功能在杂交捕获重测序的质量控制和评估指标中特别有用(Gnirke 等,2009)。这种方法还可用于量化对复杂或高度可变区域的测序结果。其中一个区域是扩展的主要组织相容性复合体(MHC),由于基因变异率高,可能会出现问题测序器对读数的对齐(Stewart 等,2004)。分析这个区域的读数有效映射使用覆盖深度 walker 是一个简单的测试上游测序管线,其输出如图 3 所示。该图显示了来自 1000 基因组计划第 2 批次的所有 JPT 样本在 MHC 区域的覆盖深度;高度可变的读数覆盖在与 MHC 的 HLA 区域对应的区域中尤为明显(de Bakker 等,2006)。

Simple Bayesian genotyper
简单的贝叶斯基因型分析器

Bayesian estimation of the most likely genotype from next-generation DNA resequencing reads has already proven valuable ( Li et al. 2008a,b; Li and Durbin 2009). Our final example GATK tool is a simple Bayesian genotyper. The genotyper, though naïve, provides a framework for implementing more advanced genotyping
从下一代 DNA 重测序 reads 进行贝叶斯基因型估计已经证明是有价值的(Li et al. 2008a,b; Li 和 Durbin 2009)。我们最后的示例 GATK 工具是一个简单的贝叶斯基因型检测器。这个基因型检测器,虽然天真,但为实施更先进的基因型检测提供了一个框架。

Figure 3. MHC depth of coverage in JPT samples of the 1000 Genomes Project pilot 2, calculated using the GATK depth of coverage tool. Coverage is averaged over

2.5 - kb

regions, where lines represent a local polynomial regression of coverage. The track containing all known annotated genes from the UCSC Genome Browser is shown in gray, with HLA genes highlighted in red. Coverage drops near 32.1 M and 32.7 M correspond with increasing density of HLA genes.
图 3. 在 1000 基因组项目试点 2 中 JPT 样本的 MHC 覆盖深度,使用 GATK 覆盖深度工具计算得到。平均覆盖在

2.5 - kb

区域,其中线条代表局部多项式回归的覆盖情况。UCSC 基因组浏览器中显示所有已知注释基因的轨道以灰色显示,HLA 基因以红色高亮显示。覆盖在 32.1 M 和 32.7 M 附近下降,与 HLA 基因密度增加相对应。
and variant discovery approaches that incorporate more realistic read-mapping error and base-miscall models. An improved framework could also handle samples or regions of the genome where the ploidy is not two, such as in tumor samples or regions of copy-number variation. This simple Bayesian genotyper serves both as the starting point for more advanced statistical inference tools and also as an ideal place to highlight the shared memory and distributed parallelization capabilities of the GATK core engine.
利用更加真实的读取映射错误和碱基误读模型的变体发现方法。改进的框架还可以处理样本或基因组区域的染色体倍数不是 2 的情况,如肿瘤样本或拷贝数变异区域。这种简单的贝叶斯基因分型器既是更高级统计推断工具的起点,也是突出 GATK 核心引擎共享内存和分布式并行化功能的理想之地。

In brief, our example genotyper computes the posterior probability of each genotype, given the pileup of sequencer reads that cover the current locus, and expected heterozygosity of the sample. This computation is used to derive the prior probability each of the possible 10 diploid genotypes, using the Bayesian formulation (Shoemaker et al. 1999):
简而言之,我们的示例基因分型器根据当前位点的测序读数叠加和样本的预期杂合度,计算每种基因型的后验概率。这一计算用于使用贝叶斯公式(Shoemaker et al. 1999)来推导 10 种可能的二倍体基因型的先验概率:

p (G ∣ D) = \frac{p (G) p (D ∣ G)}{p (D)}

where

D

represents our data (the read base pileup at this reference base) and

G

represents the given genotype. The term

p (G)

is the prior probability of seeing this genotype, which is influenced by its identity as a homozygous reference, heterozygous, or homozygous nonreference genotype. The value

p (D)

is constant over all genotypes, and can be ignored, and
其中

D

表示我们的数据(在此参考位点的读取碱基堆积),

G

表示给定的基因型。术语

p (G)

是看到此基因型的先验概率,受其作为纯合参考、杂合或者纯合非参考基因型的身份影响。值

p (D)

对所有基因型都是恒定的,可以忽略。

p (D ∣ G) = \prod_{b \in pileup} p (b ∣ G)

where

b

represents each base covering the target locus. The probability of each base given the genotype is defined as

p (b ∣ G) = p (b ∣ {A_{1}, A_{2}}) = \frac{1}{2} p (b ∣ A_{1}) + \frac{1}{2} p (b ∣ A_{2})

, when the genotype

G =

{A_{a}, A_{2}}

is decomposed into its two alleles. The probability of seeing a base given an allele is
其中

b

表示目标位点的每个碱基。给定基因型的每个碱基概率定义为

p (b ∣ G) = p (b ∣ {A_{1}, A_{2}}) = \frac{1}{2} p (b ∣ A_{1}) + \frac{1}{2} p (b ∣ A_{2})

，当基因型

G =

{A_{a}, A_{2}}

被分解为其两个等位基因时。给定一个等位基因看到一个碱基的概率是

p (b ∣ A) = {\begin{cases} \frac{e}{3} : b \neq A \\ 1 - e : b = A \end{cases},

and the epsilon term

e

is the reversed phred scaled quality score at the base. Finally, the assigned genotype at each site is the genotype with the greatest posterior probability, which is emitted to disk if its log-odds score exceeds a set threshold.
而那个 epsilon 项

e

是该碱基的反转的 Phred 量化质量分数。最终，每个位点的指定基因型是后验概率最大的基因型,如果其对数赔率分数超过设定的阈值,则将其输出到磁盘。

The algorithm was implemented in the GATK as a locus based walker, in 57 lines of Java code (Fig. 4). Along with implementing the locus walker strategy, it also implements the Tree-Reducible interface, which allows the GATK to parallelize the MapReduce calls across processors. We applied the genotyping algorithm above to Pilot 2 deep coverage data for the CEU daughter, sample NA12878, on chromosome 1 of the 1000 Genomes Project data using Illumina sequencing technology. On a single processor, this calculation requires 863 min to process the

247, 249, 719

million loci of chromosome 1.
该算法被实现为 GATK 中的位点遍历器,使用 57 行 Java 代码(图 4)。除了实现了位点遍历器策略,它还实现了可并行化的 Tree-Reducible 接口,允许 GATK 对 MapReduce 调用进行并行化处理。我们将上述基因分型算法应用于 1000 基因组项目数据的第 1 号染色体,使用 Illumina 测序技术对 CEU 女儿样本 NA12878 的深度覆盖数据进行了处理。在单个处理器上,该计算需要 863 分钟才能处理第 1 号染色体的

247, 249, 719

百万个位点。

public SimpleCall map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
// we don't deal with the N ref base case
    if (ref.getBase() == 'N' ||ref.getBase() == 'n') return null;
    ReadBackedPileup pileup = context.getBasePileup();
    double likelihoods[] = DiploidGenotypePriors.getReferencePolarizedPriors(ref.getBase(),
                DiploidGenotypePriors.HUMAN_HETEROZYGOSITY,
                    0.01);
    // get the bases and qualities from the pileup
    byte bases[] = pileup.getBases();
    byte quals[] = pileup.getQuals();
    // for each genotype, determine it's likelihood value
    for (GENOTYPE genotype : GENOTYPE.values())
        for (int index = 0; index < bases.length; index++) {
            if (quals[index] > 0) {
                // our epsilon is the de-Phred scored base quality
                double epsilon = Math.pow(10, quals[index] / -10.0);
            byte pileupBase = bases[index];
            double p = 0;
            for (char r : genotype.toString().toCharArray())
                p+=r== pileupBase ? 1 - epsilon:epsilon / 3;
                likelihoods[genotype.ordinal()] += Math.log10(p / genotype.toString().length());
            }
        }
    Integer sortedList[] = MathUtils.sortPermutation(likelihoods);
    // create call using the best genotype (GENOTYPE.values()[sortedList[9]].toString())
    // and calculate the LOD score from best - next best (9 and 8 in the sorted list, since the best likelihoods are closest to zero)
    return new SimpleCall(context.getLocation(),
                GENOTYPE.values()[sortedList[9]].toString()
                likelihoods[sortedList[9]] - likelihoods[sortedList[8]],
                ref.getBase());
    }
    public Integer reducelnit() {
    return 0;
}
    public Integer reduce(SimpleCall value, Integer sum) {
    if (value != null && value.LOD > LODScore) outputStream.println(value.toString());
    return sum + 1;
    public Integer treeReduce(Integer Ihs, Integer rhs) {
    return lhs + rhs;
    }
    public void onTraversalDone(Integer result) {
    out.println("Simple Genotyper genotyped " + result + "Loci.");
}

Figure 4. Code sample for the simple genotyper walker. The map function uses a naïve Bayesian method to generate genotypes, given the pileup of reference bases at the current locus, and emits a call containing the likelihoods for each of the 10 possible genotypes (assuming a diploid organism). This is then output to disk. The implementation of the tree-reduce function provides directions to the GATK engine for reducing two in-order parallel reduce results, allowing parallelization of the genotyper.
图 4. 简单基因型图仪行走器的代码样本。map 函数使用一个朴素的贝叶斯方法来生成基因型,给定当前位点的参考碱基的堆叠,并发出包含 10 种可能基因型的似然度的调用。这然后被输出到磁盘。树归约函数的实现为 GATK 引擎提供了指令,用于减少两个顺序并行归约结果,从而实现基因型仪的并行化。

Figure 5. Parallelization of genotyping in the GATK. (A) 1000 Genomes Project sample NA12878s chromosome 1 was genotyped using both shared memory parallelization and distributed parallelization methods. Both methods follow a near exponential curve (B) as the processor count was increased, and using the distributed methodology it was possible to see elapsed time gains out to 50 processors.
图 5. GATK 中的基因分型并行化。(A) 使用共享内存并行化和分布式并行化方法对 1000 基因组计划样本 NA12878 的染色体 1 进行了基因分型。随着处理器数量的增加,两种方法都遵循近乎指数的曲线(B),采用分布式方法可以看到处理器数量增加到 50 时的运行时间收益。
and memory efficiency and even to automatically parallelize most analysis tools on both shared memory machines and distributed clusters. Despite less than 1 yr of development, the GATK already underlies several critical tools in both the 1000 Genomes Project and The Cancer Genome Atlas, including quality-score recalibration, multiple-sequence realignment, HLA typing, multiple-sample SNP genotyping, and indel discovery and genotyping. The GATK’s robustness and efficiency has enabled these tools to be easily and rapidly deployed in recent projects to routinely process terabases of Illumina, SOLiD, and 454 data, as well processing hundreds of lanes each week in the production resequencing facilities at the Broad Institute. In the near future, we intend to expand the GATK to support additional data access patterns to enable the implementation of local referenceguided assembly, copy-number variation detection, inversions, and general structural variation algorithms.
以及内存效率,甚至可以自动并行化大部分分析工具,不仅适用于共享内存机器,也适用于分布式集群。尽管开发时间不到 1 年,GATK 已经成为 1000 个基因组项目和癌症基因组图谱中几个关键工具的基础,包括质量得分校准、多序列重比对、HLA 分型、多个样本 SNP 基因型和缺失及缺失的发现和基因分型。GATK 的强大性和效率使这些工具能够轻松快速地部署在最近的项目中,在 Broad 研究所的生产重测序设施中每周例行处理大量的 Illumina、SOLiD 和 454 数据。在不远的将来,我们打算扩展 GATK,以支持其他数据访问模式,从而实现局部参考引导装配、拷贝数变异检测、倒排序和常规结构变异算法。

Acknowledgments 致谢

We thank our colleagues in the Medical and Population Genetics and Cancer Informatics programs at the Broad Institute, who have encouraged and supported us during the development the Genome Analysis Toolkit and have been such enthusiastic early adopters, in particular, Gad Getz, Anthony Philippakis, and Paul de Bakker. We also thank our reviewers for their valuable feedback on the manuscript. This work was supported by grants from the National Human Genome Research Institute, including the Large Scale Sequencing and Analysis of Genomes grant (54 HG003067) and the Joint SNP and CNV calling in 1000 Genomes sequence data grant (U01 HG005208).
我们感谢布罗德研究所医学和人群遗传学以及肿瘤信息学项目的同事,他们在研发基因组分析工具包的过程中给予了鼓励和支持,并成为了热情的早期采用者,尤其是 Gad Getz、Anthony Philippakis 和 Paul de Bakker。我们还要感谢审稿人对本文的宝贵反馈。本项目得到了国家人类基因组研究所的资助,包括大规模基因组测序和分析项目(54 HG003067)和 1000 个基因组计划中 SNP 和 CNV 检测项目(U01 HG005208)。

References 参考文献

Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53-59.
本特利 DR、巴拉苏布拉马尼安 S、斯韦德洛 HPP、史密斯 GP、米尔顿 J、布朗 CG、霍尔 KP、艾弗斯 DJ、巴恩斯 CL、比格内尔 HR 等 2008。利用可逆终止子化学实现准确的全人类基因组测序。自然, 456: 53-59.
Bhandarkar M. 2009. Practical problem solving with hadoop and pig. In USENIX. The USENIX Association, San Diego, CA.
班达尔卡尔 M. 2009.使用 hadoop 和 pig 的实用问题解决方法.在 USENIX.USENIX 协会,圣地亚哥,CA.
Campbell PJ, Stephens PJ, Pleasance ED, O’Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, et al. 2008. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet 40: 722-729.
坎贝尔 PJ、斯蒂芬斯 PJ、普莱森斯 ED、欧米拉 S、李 H、桑塔瑞斯 T、斯特宾斯 LA、勒罗依 C、埃德金斯 S、哈迪 C 等. 2008. 利用大规模并行双端测序识别癌症中的体细胞获得性重排. Nat Genet 40: 722-729.
Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, et al. 2009. BreakDancer: An algorithm for high-resolution mapping of genomic structural variation. Nat Methods 6: 677-681.
陈凯、沃利斯、麦克莱伦、拉尔森、卡利基、波尔、麦格拉思、温德尔、张琪、洛克、等。2009 年。BreakDancer:一种用于基因组结构变异高分辨率制图的算法。自然方法 6:677-681。
Dean J, Ghemawat S. 2008. MapReduce: Simplified data processing on large clusters. Commun ACM 51: 107-113.
迪恩杰,格马沃特 S.2008。MapReduce：大型集群上的简化数据处理。Commun ACM 51:107-113。
de Bakker PIW, McVean G, Sabeti PC, Miretti MM, Green T, Marchini J, Ke X, Monsuur AJ, Whittaker P, Delgado M, et al. 2006. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat Genet 38: 1166-1172.
德拜克尔 PIW、麦克威安 G、萨贝蒂 PC、米雷提 MM、格林 T、马奇尼 J、柯 X、蒙苏尔 AJ、惠特克 P、德尔加多 M 等。2006 年。用于疾病相关研究的高分辨率 HLA 和 SNP 单倍型图谱的创建。《自然遗传》38: 1166-1172。
Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G, et al. 2010. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327: 78-81.
德拉马纳克 R、斯帕克斯 AB、卡罗 MJ、哈尔彭 AL、伯恩斯 NL、克尔马尼 BG、卡内瓦利 P、纳扎连科 I、尼尔森 GB、杨 G 等。2010。使用自组装 DNA 纳米阵列上的无链读数对人类基因组进行测序。科学 327:78-81。

Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, et al. 2009. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 27: 182-189.
格尼克 A, 梅尔尼科夫 A, 马奎尔 J, 罗戈夫 P, LeProust EM, 布罗克曼 W, 费内尔 T, 吉安诺卡斯 G, 费希尔 S, 罗斯 C 等。2009。利用超长寡核苷酸进行大规模并行目标测序的溶液杂交选择。自然生物技术 27: 182-189。
International HapMap Consortium. 2003. The International HapMap Project. Nature 426: 789-796.
国际 HapMap 联盟。2003。国际 HapMap 项目。《自然》426: 789-796。
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. 2002. The Human Genome Browser at UCSC. Genome Res 12: 996-1006.
肯特 WJ、苏格 CT、富瑞 TS、罗斯金 KM、普林格 TH、扎勒 AM、豪斯勒 D. 2002. UCSC 人类基因组浏览器. Genome Res 12: 996-1006.

Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, Ding L. 2009. VarScan: Variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25: 2283-2285.
孔博德 DC,陈 K,怀利 T,拉森 DE,麦克莱伦 MD,马迪斯 ER,温斯托克 GM,威尔逊 RK,丁 L.2009.VarScan:在个体和池化样品的大规模并行测序中检测变异.生物信息学 25:2283-2285.
Li H, Durbin R. 2009. Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics 25: 1754-1760.
李华,德宾 R. 2009.利用 Burrows-Wheeler 变换快速准确地进行短读长序列比对.生物信息学 25:1754-1760.
Li H, Ruan J, Durbin R. 2008a. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18: 1851-1858.
李洪, 阮军, 德班 R. 2008a. 使用映射质量评分来映射短 DNA 测序读数和调用变体. 基因组研究 18: 1851-1858.
Li R, Li Y, Kristiansen K, Wang J. 2008b. SOAP: Short oligonucleotide alignment program. Bioinformatics 24: 713-714.
李瑞、李越、克里斯蒂安森、王军。2008b。SOAP：短寡核苷酸比对程序。生物信息学 24：713-714。
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078-2079.
李弘,汉达萨 B,维索克 A,费内尔 T,阮杰,霍默 N,马斯 G,艾比卡西斯 G,德尔宾 R,1000 基因组项目数据处理分组。2009 年。序列比对/映射格式和 SAMtools。生物信息学 25:2078-2079。
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z, et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376-380.
马尔古利斯 M, 埃格罕 M, 奥特曼 WE, 阿蒂亚 S, 巴德 JS, 贝姆本 LA, 伯卡 J, 布拉弗曼 MS, 陈 Y-J, 陈 Z, 等. 2005. 微型化高密度皮克升反应器中的基因组测序. Nature 437: 376-380.
Martinez-Alcantara A, Ballesteros E, Feng C, Rojas M, Koshinsky H, Fofanov V, Havlak P, Fofanov Y. 2009. PIQA: Pipeline for Illumina G1 genome analyzer data quality assessment. Bioinformatics 25: 2438-2439.
马丁内斯-阿尔坎塔拉 A、巴利斯特罗斯 E、冯 C、罗哈斯 M、科欣斯基 H、福法诺夫 V、哈夫拉克 P、福法诺夫 Y。 2009 年。 PIQA:用于 Illumina G1 基因组分析仪数据质量评估的流水线。生物信息学 25:2438-2439。
McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, Clouser CR, Duncan C, Ichikawa JK, Lee CC, et al. 2009. Sequence and structural variation in a human genome uncovered by short-read,
麦克尔南 KJ、佩克汉 HE、科斯塔 GL、麦克劳克林 SF、富 Y、弗朗西斯 EF、克劳瑟 CR、邓肯 C、市川 JK、李 CC 等人。2009 年。短读取揭示人类基因组的序列和结构变异。
massively parallel ligation sequencing using two-base encoding. Genome Res 19: 1527-1541.
采用两碱基编码的大规模平行连接测序。基因组研究 19: 1527-1541.
Morgan M, Anders S, Lawrence M, Aboyoun P, Pages H, Gentleman R. 2009. ShortRead: A bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics 25: 26072608.
摩根 M,安德斯 S,劳伦斯 M,阿博尤恩 P,帕杰斯 H,绅士 R.2009.ShortRead:一个用于输入、质量评估和高通量测序数据探索的生物信息学软件包.生物信息学 25:26072608.

Ning Z, Cox AJ, Mullikin JC. 2001. SSAHA: A fast search method for large DNA databases. Genome Res 11: 1725-1729.
宁 Z, 考克斯 AJ, 马尔里金 JC. 2001. SSAHA: 一种用于大型 DNA 数据库的快速搜索方法. Genome Res 11: 1725-1729.
Pepke S, Wold B, Mortazavi A. 2009. Computation for ChIP-seq and RNAseq studies. Nat Methods 6: S22-S32.
佩克 S,沃尔德 B,莫塔泽维 A.2009.用于 ChIP-seq 和 RNA-seq 研究的计算.Nat Methods 6:S22-S32.
Shendure J, Ji H. 2008. Next-generation DNA sequencing. Nat Biotechnol 26: 1135-1145.
沈杜雷 J, 吉 H. 2008. 下一代 DNA 测序. Nat Biotechnol 26: 1135-1145.
Sherry S, Ward M-H, Kholodov M, Baker J, Phan L, Smigielski E, Sirotkin K. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res 29: 308-311.
雪莉·S、沃德·M-H、科洛多夫·M、贝克·J、潘·L、斯米吉尔斯基·E、西罗金·K. 2001 年. dbSNP:国家生物技术信息中心的基因变异数据库. 核酸研究 29: 308-311.
Shoemaker JS, Painter IS, Weir BS. 1999. Bayesian statistics in genetics: A guide for the uninitiated. Trends Genet 15: 354-358.
鞋匠 JS,画家 IS,韦尔 BS。1999。遗传学中的贝叶斯统计:给新手的指南。Trends Genet 15: 354-358.
Stewart CA, Horton R, Allcock RJN, Ashurst JL, Atrazhev AM, Coggill P, Dunham I, Forbes S, Halls K, Howson JMM, et al. 2004. Complete MHC haplotype sequencing for common disease gene mapping. Genome Res 14: 1176-1187.
斯图尔特 CA, 霍顿 R, 阿尔科克 RJN, 阿什兰 JL, 阿特拉泽夫 AM, 科吉尔 P, 邓纳姆 I, 福布斯 S, 霍尔斯 K, 豪森 JMM, 等. 2004. 用于常见疾病基因映射的完整 MHC 核型测序. 基因组研究 14: 1176-1187.
Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, et al. 2008. The diploid genome sequence of an Asian individual. Nature 456: 60-65.
王 J, 王 W, 李 R, 李 Y, 田 G, 古德曼 L, 范 W, 张 J, 李 J, 张 J 等. 2008. 一个亚洲人的二倍体基因组序列. 自然 456: 60-65.
Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen Y-J, Makhijani V, Roth GT, et al. 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452: 872-876.
惠勒 DA,斯里尼瓦桑 M,埃格霍尔 M,沈 Y,陈 L,麦圭尔 A,何 W,陈 Y-J,马基亚尼 V,罗斯 GT 等。2008 年。大规模并行 DNA 测序获得个体完整基因组。《自然》452:872-876。

Received March 11, 2010; accepted in revised form July 12, 2010.
收到于 2010 年 3 月 11 日；2010 年 7 月 12 日以修订后的形式接受。

$^{3}$ Corresponding author.
$^{3}$ 通讯作者。
E-mail depristo@broadinstitute.org.
发送电子邮件至 depristo@broadinstitute.org。
Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.107524.110.
在印刷发行之前发表的文章。文章和发布日期在 http://www.genome.org/cgi/doi/10.1101/gr.107524.110。

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data 基因组分析工具包:用于分析下一代 DNA 测序数据的 MapReduce 框架