- Split View 分割视图
-
Views
-
Cite 引用
Cite
Bo Liu, Dixian Zhu, Yadong Wang, deBWT: parallel construction of Burrows–Wheeler Transform for large collection of genomes with de Bruijn-branch encoding, Bioinformatics, Volume 32, Issue 12, June 2016, Pages i174–i182, https://doi.org/10.1093/bioinformatics/btw266
- Share Icon Share 分享
Abstract 摘要
Motivation : With the development of high-throughput sequencing, the number of assembled genomes continues to rise. It is critical to well organize and index many assembled genomes to promote future genomics studies. Burrows–Wheeler Transform (BWT) is an important data structure of genome indexing, which has many fundamental applications; however, it is still non-trivial to construct BWT for large collection of genomes, especially for highly similar or repetitive genomes. Moreover, the state-of-the-art approaches cannot well support scalable parallel computing owing to their incremental nature, which is a bottleneck to use modern computers to accelerate BWT construction.
动机:随着高通量测序技术的发展,已组装基因组的数量不断增加。对众多已组装的基因组进行良好的组织和索引对促进未来的基因组学研究至关重要。Burrows-Wheeler Transform(BWT)是一种重要的基因组索引数据结构,它有许多基本应用;然而,为大量基因组,尤其是高度相似或重复的基因组构建 BWT 仍然不是一件容易的事。此外,最先进的方法由于其增量性质,不能很好地支持可扩展的并行计算,这成为利用现代计算机加速 BWT 构建的瓶颈。
Results : We propose de Bruijn branch-based BWT constructor (deBWT), a novel parallel BWT construction approach. DeBWT innovatively represents and organizes the suffixes of input sequence with a novel data structure, de Bruijn branch encoding. This data structure takes the advantage of de Bruijn graph to facilitate the comparison between the suffixes with long common prefix, which breaks the bottleneck of the BWT construction of repetitive genomic sequences. Meanwhile, deBWT also uses the structure of de Bruijn graph for reducing unnecessary comparisons between suffixes. The benchmarking suggests that, deBWT is efficient and scalable to construct BWT for large dataset by parallel computing. It is well-suited to index many genomes, such as a collection of individual human genomes, with multiple-core servers or clusters.
成果:我们提出了基于 de Bruijn 分支的 BWT 构造器(deBWT),这是一种新颖的并行 BWT 构造方法。DeBWT 创新性地使用一种新型数据结构 de Bruijn 分支编码来表示和组织输入序列的后缀。这种数据结构利用 de Bruijn 图的优势,便于对具有长公共前缀的后缀进行比较,从而打破了重复基因组序列 BWT 构建的瓶颈。同时,deBWT 还利用 de Bruijn 图的结构减少了后缀间不必要的比较。基准测试表明,deBWT 可以通过并行计算高效、可扩展地构建大型数据集的 BWT。它非常适合使用多核服务器或集群为许多基因组(如人类单个基因组集合)建立索引。
Availability and implementation : deBWT is implemented in C language, the source code is available at https://github.com/hitbc/deBWT or https://github.com/DixianZhu/deBWT
可用性和实现:deBWT 用 C 语言实现,源代码可从 https://github.com/hitbc/deBWT 或 https://github.com/DixianZhu/deBWT 获取。
Contact: ydwang@hit.edu.cn
联系人: ydwang@hit.edu.cn
Supplementary information: Supplementary data are available at Bioinformatics online.
补充信息:补充数据可在 Bioinformatics online 上获取。
1 Introduction 1 引言
With the rapid development and ubiquitous application of high-throughput sequencing, many genomes have been sequenced in cutting-edge genomics studies. For example, 1000 Genomes ( The 1000 Genomes Project Consortium, 2015 ) and UK10K ( The UK10K Consortium, 2015 ) projects have sequenced many thousands of individual human genomes. Moreover, as the cost of sequencing continuously decreases, e.g. the cost of sequencing a human sample has already been lower than 1000 dollars ( Watson, 2014 ), the number of genomes may explosively increase in the future. Under this circumstance, it is fundamental to well organize and index the large amount of genomes to facilitate future genomics studies.
随着高通量测序技术的快速发展和广泛应用,许多基因组已在前沿基因组学研究中完成测序。例如,"1000 基因组"(The 1000 Genomes Project Consortium, 2015)和 "UK10K"(The UK10K Consortium, 2015)项目已经测序了数千个人类基因组。此外,随着测序成本的不断降低,例如人类样本的测序成本已经低于 1000 美元(Watson,2014 年),基因组的数量在未来可能会呈爆炸式增长。在这种情况下,对大量基因组进行良好的组织和索引是促进未来基因组学研究的基础。
Burrows–Wheeler Transform (BWT; Burrows and Wheeler, 1994 ; Ferragina and Manzini, 2000 ) is a self-indexing data structure having many fundamental applications, such as genome indexing ( Hon et al. , 2004 ; Karkkainen, 2007 ), sequence alignment ( Lam et al. , 2008 ; Li and Durbin, 2009a ; Langmead and Salzberg, 2012 ), genome compression ( Makinen et al. , 2010 ; Cox et al. , 2012 ), genome assembly ( Simpson and Durbin, 2012 ; Li, 2012 ) and sequencing error correction ( Cox et al. , 2011 ). However, the BWT construction of genomic sequence(s) is a non-trivial task. Mainly, the core of BWT construction is to determine the lexicographical order of all the suffixes of the input sequence(s). Because there could be many repetitive sequences within a genome ( Treangen and Salzberg, 2012 ), the cost would be prohibitively high to straightforwardly compare all the suffixes to determine their lexicographical orders. The problem is even more serious for constructing the BWT of many highly similar genomes, such as a large collection of individual human genomes, as there would be many common sequences to make the whole input even more repetitive.
Burrows-Wheeler Transform(BWT;Burrows 和 Wheeler,1994 年;Ferragina 和 Manzini,2000 年)是一种自索引数据结构,有许多基本应用,如基因组索引(Hon 等人,2004 年;Karkkainen,2007 年)、序列比对(Lam 等人,2008 年;Li 和 Durbin,2009a 年;Langmead 和 Salzberg,2012 年)、基因组压缩(Makinen 等人,2010 年;Cox 等人,2012 年)、基因组组装(Makinen et al.,2008;Li 和 Durbin,2009a;Langmead 和 Salzberg,2012)、基因组压缩(Makinen 等人,2010;Cox 等人,2012)、基因组组装(Simpson 和 Durbin,2012;Li,2012)和测序纠错(Cox 等人,2011)。然而,基因组序列的 BWT 构建并非易事。主要而言,BWT 构建的核心是确定输入序列所有后缀的词序。由于基因组中可能有许多重复序列(Treangen 和 Salzberg,2012 年),直接比较所有后缀以确定其词法顺序的成本过高。对于构建许多高度相似基因组的 BWT 来说,这个问题甚至更加严重,比如大量的人类个体基因组,因为会有许多共同的序列使整个输入更加重复。
Efforts have been made to BWT construction. As the BWT of input sequence(s) can be directly derived from the corresponding suffix array (SA), many extant SA construction methods ( Smyth and Turpin, 2007 ) are applicable to this task. However, the memory footprint may be not practical for large genomic sequences, e.g. sequence over tens of Giga basepairs (Gbp), as proposed SA construction methods usually need to store the entire SA in memory. Although there are also proposed SA construction methods having smaller memory footprints ( Crauser and Ferragina, 2008 ; Nong et al. , 2014 ), however, they are at the expense of speed, as they need to use external memory.
人们一直在努力构建 BWT。由于输入序列的 BWT 可以直接从相应的后缀数组(SA)中导出,许多现存的后缀数组构建方法(Smyth 和 Turpin,2007 年)都适用于这项任务。然而,对于大型基因组序列,例如超过数十亿基对(Gbp)的序列,内存占用可能并不实用,因为拟议的后缀阵列构建方法通常需要在内存中存储整个后缀阵列。虽然也有人提出了内存占用更小的 SA 构建方法(Crauser 和 Ferragina,2008 年;Nong 等人,2014 年),但这些方法需要使用外部内存,因此牺牲了速度。
Most of the state-of-the-art BWT construction methods take the advantage of the incremental construction approach, which is on the basis of the property that the relative lexicographical order of a set of sorted suffixes will not be changed by adding new suffixes. ( Hon et al. , 2007 ) proposed the first compressed suffix array (CSA) construction method using this property. Mainly, it logarithmically partitions the input sequence into blocks, and incrementally builds the CSA from shortest to longest suffixes in three steps: (i) construct the SA for a new block of suffixes; (ii) sequentially insert each of the suffixes within the new block into the CSA of the old blocks, based on the property that the relative order of the suffixes within the old blocks do not change, and the CSA values monotonically increase for the suffixes having the same initial character; (iii) merge the new and old blocks to update the CSA. There are other proposed BWT construction methods in this incremental blockwise approach, which have various implementations. ( Ferragina et al. , 2012 ) proposed a BWT construction method, bwt-disk, similar to ( Hon et al. , 2007 ), which also logarithmically partitions the input sequence. But it sorts each new block with a modified DC3 algorithm, and takes advantage of the last-first mapping (LF-mapping) property of BWT ( Ferragina and Manzini, 2005 ) to merge new and old blocks. ( Liu et al. , 2014b ) also proposed a BWT construction method, ParaBWT, in a similar manner. It uses a longest common prefix table to facilitate the sorting of newly added suffixes, but also merges the new and old blocks based on LF-mapping. The main contribution of this method is that it implements parallelization for the sorting of newly added blocks, which is beneficial for processing large input sequences. Other than constructing BWT for one or more large sequences, this approach is also used for indexing large collections of sequencing reads. Bauer et al. (2013) proposed BCR, an algorithm for constructing the BWT of a large set of reads. It uses a specific partition of SA, i.e. partitioning all the suffixes into blocks by their positions on the corresponding sequencing reads. With this partition, the markers (denoted as specific characters) of the reads can be fully used for improving the efficiency of the sorting and merging of blocks. ( Li, 2014 ) also proposed a similar method, RopeBWT2, with improved ability of handling the sequences in various lengths.
大多数最先进的 BWT 构建方法都采用了增量构建方法,该方法基于这样一个特性,即一组已排序后缀的相对词序不会因为添加新后缀而改变。(Hon 等人,2007 年)首次提出了利用这一特性的压缩后缀数组(CSA)构建方法。该方法主要是将输入序列按对数分割成块,然后分三步从最短后缀到最长后缀逐步构建 CSA:(i) 构建新后缀块的 CSA;(ii) 根据旧后缀块中后缀的相对顺序不变,且具有相同初始字符的后缀的 CSA 值单调递增的特性,将新后缀块中的每个后缀依次插入旧后缀块的 CSA 中;(iii) 合并新旧后缀块,更新 CSA。在这种增量式顺时针方法中,还有其他一些拟议的 BWT 构建方法,它们有不同的实现方式。(Ferragina 等人,2012)提出了一种 BWT 构建方法 bwt-disk,与(Hon 等人,2007)类似,也是对输入序列进行对数分割。但它采用改进的 DC3 算法对每个新块进行排序,并利用 BWT 的后先映射(LF-mapping)特性(Ferragina 和 Manzini,2005 年)合并新旧块。(Liu等人,2014b)也以类似的方式提出了一种BWT构建方法--ParaBWT。它使用一个最长公共前缀表来方便对新添加的后缀进行排序,同时也基于 LF 映射来合并新旧词块。 这种方法的主要贡献在于,它实现了新添加区块排序的并行化,有利于处理大型输入序列。除了为一个或多个大型序列构建 BWT 之外,这种方法还可用于为大型测序读取集合建立索引。Bauer 等人(2013 年)提出了 BCR 算法,用于构建大型读取集合的 BWT。它使用了一种特定的 SA 分区,即按照后缀在相应测序读数上的位置将所有后缀划分成块。通过这种划分,可以充分利用读数的标记(表示为特定字符)来提高块的排序和合并效率。(Li,2014)也提出了一种类似的方法 RopeBWT2,并改进了处理不同长度序列的能力。
Besides the blockwise incremental approach, there are also other approaches proposed. ( Karkkainen, 2007 ) proposed a method that constructs BWT in a different blockwise manner. It samples a set of suffixes as splitters to bin all the suffixes into various blocks, and for each of the blocks, the proposed method addresses all the suffixes with difference cover sample (DCS). ( Liu et al. , 2014a ) proposed a graphics processing unit (GPU-based) BWT construction method, CX1, for indexing a large set of short reads. The main idea of the method is to bin all the suffixes by their initial k-mers, and address all the bins with GPU-based radix sort. This method is mainly designed for reads with limited length, as the radix sort relies on the auxiliary characters attached to the reads.
除了顺时针递增法,还有其他一些方法。(Karkkainen, 2007)提出了一种以不同的顺时针方式构建 BWT 的方法。它将一组后缀作为分割器进行采样,将所有后缀分成不同的区块,对于每个区块,提出的方法用差分覆盖采样(DCS)处理所有后缀。( Liu 等人,2014a) 提出了一种基于图形处理器(GPU)的 BWT 构建方法 CX1,用于为大量短读建立索引。该方法的主要思路是将所有后缀按其初始 k-mers 进行分选,并使用基于 GPU 的 radix 排序对所有分选进行处理。这种方法主要针对长度有限的读取数据,因为弧度排序依赖于读取数据所附的辅助字符。
The major advantage of the incremental blockwise approach is that, based on the LF-mapping property, it provides an effective way to compare suffixes with long common prefix, which is critical to the BWT construction of large repetitive genomes. However, owing to the incremental nature, this approach is not suitable for parallel computing. Considering the rapid increase of assembled genomes, the input sequence will be much larger than before. In this situation, processing the large sequence with parallel computing is favorable, especially that modern servers and clusters have much more CPU cores and RAM than before. Recent studies have made the efforts to BWT construction with parallel computing. ParaBWT implemented the parallel BWT construction for large sequence; however, the results demonstrated that it is not scalable, i.e. the speedup will saturate when a couple of threads (e.g. eight threads) are used. This is mainly owing to the bottleneck of the incremental blockwise approach. As a non-incremental approach, CX1 is scalable; however, it has limitation on the length of input sequences.
增量式顺时针方法的主要优点是,基于 LF 映射特性,它提供了一种有效的方法来比较具有长公共前缀的后缀,这对于大型重复基因组的 BWT 构建至关重要。然而,由于其增量性质,这种方法并不适合并行计算。考虑到已组装基因组的快速增长,输入序列将比以前大得多。在这种情况下,利用并行计算处理大序列是有利的,尤其是现代服务器和集群的 CPU 内核和内存都比以前多得多。最近的一些研究为利用并行计算构建 BWT 做出了努力。ParaBWT 实现了大序列的并行 BWT 构建,但结果表明它不具有可扩展性,即当使用几个线程(如 8 个线程)时,速度提升将达到饱和。这主要是由于增量顺时针方法的瓶颈造成的。作为一种非递增方法,CX1 是可扩展的,但它对输入序列的长度有限制。
Herein, we propose de Bruijn branch-based BWT constructor (deBWT), a novel scalable parallel BWT construction method, which draws support from de Bruijn graph (dBG). The relationship between dBG and suffix trie was explored in previous studies ( Marcus et al. , 2014 ); however, it has still not been fully used in BWT construction. The main contribution of deBWT is to represent and organize the suffixes of input sequence with the property of dBG that all the copies of the same k-mer within the input sequence(s) collapse to the same vertex of the dBG of the sequence(s). The critical point of deBWT is to represent the suffixes with a novel data structure, de Bruijn branch encoding, which is derived from the unipaths of the dBG of the input sequence. This data structure facilitates the comparison between the suffixes with long common prefix. Moreover, deBWT partitions the whole BWT into blocks by their initial k-mers, and uses the property of dBG to avoid unnecessary sorting for some of the blocks, i.e. the BWT characters of some blocks can be derived in constant time based on the topology of the graph.
在此,我们提出了基于 de Bruijn 分支的 BWT 构建器(deBWT),这是一种新颖的可扩展并行 BWT 构建方法,它从 de Bruijn 图(dBG)中获得支持。之前的研究(Marcus 等人,2014 年)探讨了 dBG 与后缀三元组之间的关系;然而,它在 BWT 构建中仍未得到充分应用。deBWT 的主要贡献在于利用 dBG 的特性来表示和组织输入序列的后缀,即输入序列中相同 k-mer 的所有副本都塌缩到序列 dBG 的相同顶点上。deBWT 的关键点在于用一种新颖的数据结构--de Bruijn 分支编码--来表示后缀,这种数据结构来自输入序列 dBG 的单路径。这种数据结构便于比较具有长共同前缀的后缀。此外,deBWT 还能将整个 BWT 按其初始 k-mers 分割成块,并利用 dBG 的特性避免对某些块进行不必要的排序,即某些块的 BWT 字符可根据图的拓扑结构在恒定时间内导出。
We benchmarked deBWT with various datasets, and the result suggests that it has fast speed and good scalability to multiple threads. Especially, deBWT is well-suited to the BWT construction of a collection of highly similar genomic sequences, such as multiple human genomes, which may have wide application in the future genomics studies.
我们用各种数据集对 deBWT 进行了基准测试,结果表明它具有快速的速度和良好的多线程可扩展性。特别是,deBWT 非常适合构建高度相似的基因组序列集合(如多个人类基因组)的 BWT,这可能会在未来的基因组学研究中得到广泛应用。
2 Methods 2 种方法
2.1 Preliminary 2.1 初步情况
Let a DNA sequence,
假设 DNA 序列
A suffix of
The dBG of
In the following subsections, we present the unipath-based BWT construction approach at first (Section 2.2), then the overview of deBWT (Section 2.3) and more detailed information about the implementation of the various steps of the method (Section 2.4-2.7).
在下面的小节中,我们将首先介绍基于单路径的 BWT 构建方法(第 2.2 节),然后介绍 deBWT 的概述(第 2.3 节)以及有关该方法各步骤实现的详细信息(第 2.4-2.7 节)。
2.2 Unipath-based BWT construction
2.2 基于统一路径的 BWT 构建
2.2.1 The unipath representation of suffix
2.2.1 后缀的单路径表示法
The DNA sequence
DNA 序列
a DNA sequence can be represented by an ordered list of the unipaths of the dBG.
DNA 序列可以用 dBG 的单路径有序列表来表示。
This lemma is easy to justify by collapsing all such edges,
将所有这样的边
Each of the suffixes can be represented as an ordered list of unipaths, or a substring of a specific unipath appending an ordered list of unipaths.
每个后缀都可以表示为单链路的有序列表,或特定单链路附加单链路有序列表的子串。
Each of the suffixes,
每个后缀
2.2.2 The unipath-based comparison between two suffixes
2.2.2 基于统一路径的两后缀比较
Given two suffixes of
给定两个后缀分别为
First, considering the two ordered lists of k-mers,
首先,考虑两个有序的 k-mers 列表
Secondly, if
其次,如果
Then it becomes an iterative comparison between the aligned unipaths, i.e. if the two unipaths,
如果
Considering the property of dBG, if the two unipaths,
考虑到 dBG 的特性,如果
With these observations, we designed a novel data structure named as de Bruijn branch encoding ( Fig. 1d ). The de Bruijn branch encoding of
根据这些观察结果,我们设计了一种名为 de Bruijn 分支编码的新型数据结构(图 1d)。
Lemma 2: For two suffixes at least k bp long, their lexicographical order can be determined by comparing their projection suffixes defined by the de Bruijn branch encoding of the DNA sequence, if the initial k-mers of the two suffixes are identical.
定理 2:对于长度至少为 k bp 的两个后缀,如果两个后缀的初始 k-mers 相同,则可以通过比较 DNA 序列的 de Bruijn 分支编码所定义的投影后缀来确定它们的词典顺序。
This lemma is easy to justify with the observations mentioned above, and it provides a cost-effective solution of the comparison between two suffixes with long common prefix.
根据上述观察结果,这个 Lemma 很容易得到证明,而且它为具有长共同前缀的两个后缀之间的比较提供了一个经济有效的解决方案。
2.2.3 The k-mer partition of BWT
2.2.3 BWT 的 k-mer 分区
According to the definition of SA, the suffixes starting with identical k-mers (suppose it as
根据 SA 的定义,以相同 k-mer 开始的后缀(假设为
Thus, for each of the
因此,对于对应于不同初始 k-mer 的
Lemma 3: If a vertex
定理 3:如果
This lemma is obtained with the observation that if a vertex of the dBG is single-in, all the suffixes with this k-mer must have the same previous character, i.e. the BWT of this part is purely
这个 Lemma 是通过观察得到的,即如果 dBG 的一个顶点是单入顶点,那么所有带有这个 k-mer 的后缀都必须具有相同的前一个字符,即这个部分的 BWT 纯粹是
2.3 Overview of the deBWT method
2.3 deBWT 方法概述
As in many cases, the input is not only one, but a set of DNA sequences, such as a set of genomes, chromosomes or assembled contigs, we re-define the sequence to be indexed as
由于在很多情况下,输入的不只是一个,而是一组 DNA 序列,如一组基因组、染色体或组装等位基因,因此我们将序列重新定义为
DeBWT constructs the BWT of
DeBWT 分以下三个主要阶段构建
dBG building and analysis: deBWT builds a dBG of all the involved sequences
with a user-defined parameter k, sort the k-mers to build the k-mer partition of the BWT, recognize all the unipaths as well as the multiple-out and multiple-in vertices and solves all the parts corresponding to single-in vertices.
dBG 构建和分析:deBWT 用用户定义的参数 k 对所有涉及的序列 构建 dBG,对 k 单链体进行排序以构建 BWT 的 k 单链体分区,识别所有单链路以及多出和多入顶点,并解决与单入顶点相对应的所有部分。de Bruijn branch encoding generation: deBWT computes the de Bruijn branch encoding of
( ), bins all the suffixes of corresponding to the unsolved parts of BWT and computes their projection suffixes.
de Bruijn 分支编码生成:deBWT 计算 ( ) 的 de Bruijn 分支编码,将 中对应于 BWT 未解决部分的所有后缀进行分类,并计算其投影后缀。BWT construction with projection suffixes: for each of the unsolved BWT parts, deBWT builds the SA of the involved projection suffixes to determine the BWT characters of the part.
使用投影后缀构建 BWT:对于每个未解决的 BWT 部分,deBWT 都会构建相关投影后缀的 SA,以确定该部分的 BWT 字符。
A schematic illustration of the method is in Figure 2 .
图 2 是该方法的示意图。
2.4 dBG building and analysis
2.4 dBG 的构建与分析
It needs all the k-mers as well as the numbers of their copies in
它需要
To build the k-mer partition of the BWT, deBWT sorts all the (k + 1)-mers by their lexicographic order, and respectively merges all the (k + 1)-mers with identical first k bp prefixes into k-mers to build the partition. For each of the k-mers, its number of copies is calculated by directly summing up the corresponding (k + 1)-mers. Meanwhile, the multiple-out vertices can also be recognized during the merging, and deBWT records all the multiple-out vertices for building the de Bruijn branch encoding in the later step.
为了建立 BWT 的 k-聚合体分区,deBWT 将所有 (k + 1)-mers 按其词序排序,并分别将前 k 个 bp 前缀相同的所有 (k + 1)-mers 合并成 k-聚合体,以建立分区。对于每一个 k-mers,其副本数是通过直接求和相应的 (k + 1)-mers 来计算的。同时,在合并过程中还可以识别出多重出顶点,deBWT 会记录所有多重出顶点,以便在后一步建立 de Bruijn 分支编码。
The multiple-in vertices is then recognized based on the sorted (k + 1)-mer list. That is, deBWT partitions the sorted (k + 1)-mer list into four ordered lists, each corresponding to a specific initial character, A, C, G or T. Then deBWT recognizes all the multiple-in vertices by a four-way merging on the lists. During the merging, two tasks are done simultaneously, i.e. if a vertex is recognized as single-in, deBWT assigns it the first character of the corresponding (k + 1)-mer and the number of copies to solve the BWT part; otherwise, deBWT records the k-mer in another data structure to generate the
然后,根据排序后的 (k + 1)-mer 列表识别多入顶点。也就是说,deBWT 将排序后的 (k + 1)-mer 列表分成四个有序列表,每个列表对应一个特定的初始字符,即 A、C、G 或 T,然后 deBWT 通过对列表进行四向合并来识别所有多入顶点。在合并过程中,有两项任务同时进行,即如果一个顶点被识别为单入顶点,deBWT 将为其分配相应 (k + 1)-mer 的首字符和副本数,以解决 BWT 部分;否则,deBWT 将在另一个数据结构中记录 k-mer,以便在后一步生成
It is worth noting that, owing to the existence of the auxiliary character #, deBWT recognizes all the k-mers that have at least one copy previous to and next to # as multiple-out and multiple-in k-mers, respectively.
值得注意的是,由于辅助字符 # 的存在,deBWT 将所有在 # 之前和 # 之后至少有一个副本的 k-mers 分别识别为多出 k-mers 和多入 k-mers。
The parallelization of this step is straightforward. The k-mer counting can be directly parallelized ( Marçais and Kingsford, 2011 ). There are many feasible parallel integer sorting approaches for sorting the (k + 1)-mers and we used a simple approach, i.e. a radix sort is implemented to partition all the (k + 1)-mers into blocks, and each of the blocks is further processed in parallel by integer quick sort. The merging of the k-mer lists is not executed in parallel; however, the cost is linear to the number of (k + 1)-mers and not expensive.
这一步骤的并行化非常简单。k-mer 计数可以直接并行化(Marçais 和 Kingsford,2011 年)。有许多可行的并行整数排序方法来对 (k + 1)-mers 进行排序,而我们使用了一种简单的方法,即使用弧度排序将所有 (k + 1)-mers 划分为多个区块,然后通过整数快速排序对每个区块进行并行处理。k-mer 列表的合并并不是并行执行的;不过,合并成本与 (k + 1)-mers 的数量成线性关系,并不昂贵。
2.5 The generation of de Bruijn branch encoding and projection suffixes
2.5 生成 de Bruijn 分支编码和投影后缀
DeBWT builds a hash table-based data structure, de Bruijn branch index, to index all multiple-in and multiple-out k-mers ( Supplementary Fig. 2 ). With this index, the de Bruijn branch encoding and the
DeBWT 建立了一个基于哈希表的数据结构,即 de Bruijn 分支索引,用于索引所有多入和多出 k-mers (附图 2)。有了这个索引,只需扫描
DeBWT initially allocates an empty string as
DeBWT 最初分配一个空字符串作为
DeBWT then scans
然后,DeBWT 从上游向下游扫描
The parallelization of this step is implemented as follows. DeBWT divides
这一步的并行化实现如下。DeBWT 将
2.6 BWT construction with projection suffixes
2.6 带投影后缀的 BWT 结构
For each of the BWT parts corresponding to the multiple-in k-mers, deBWT constructs the SA of the projection suffixes with the
对于与多入 k-mers 对应的每个 BWT 部分,deBWT 使用
We did a modification on the recursive process of the original quick-sort method to improve the efficiency. That is, for a specific sub-array of suffixes transferred into the recursive function, deBWT checks whether all the suffixes have the same BWT characters at first. If this is the case, deBWT marks the sub-array as sorted, as the precise lexicographical order of these suffixes is not necessary for the BWT construction; otherwise, deBWT calls the original recursive function to further sort the sub-array. This modification can also be seen as an extension of Lemma 3.
为了提高效率,我们对原始快速排序方法的递归过程进行了修改。也就是说,对于转入递归函数的特定后缀子数组,deBWT 会首先检查所有后缀是否都有相同的 BWT 字符。如果是这样,deBWT 会将该子数组标记为已排序,因为 BWT 结构并不需要这些后缀的精确词序;否则,deBWT 会调用原始递归函数对子数组进行进一步排序。这一修改也可以看作是对 Lemma 3 的扩展。
2.7 Additional processing
2.7 额外处理
After all the operations mentioned above, there are still
经过上述所有操作后,仍有
3 Results 3 项成果
We benchmarked deBWT with three datasets mimicking various real application scenarios. (i) A dataset consists of 10 in silico human genomes (totally 30.9 Gbp). Each of the genomes is generated by integrating the variants of a specific sample from 1000 Genomes ( The 1000 Genomes Project Consortium, 2015 ) into the human reference genome GRCh37/Hg19. This dataset mimics the indexing of multiple individual human genomes, which has many applications in genomic studies. (ii) A dataset consists of a set of simulated contigs. As long read sequencing technologies, such as Single Molecular Real-Time sequencing, have improved the contig N50 of human genome assembly to >10 million bp ( http://www.pacb.com/blog/toward-platinum-genomes-pacbio-releases-a-new-higher-quality-chm1-assembly-to-ncbi/ ), we randomly extracted 3000 sequences (each is about 10M bp long, totally 30.2 Gbp) from the 10 in silico human genomes with an in-house script, which is revised from Wgsim simulator ( Li et al. , 2009b ; https://github.com/lh3/wgsim ). (iii) A dataset consists of eight primate genomes including gibbon, gorilla, orangutan, rhesus, baboon, chimp, bonobo and human (downloaded from: http://hgdownload.soe.ucsc.edu/downloads.html ). This dataset assessed the ability of deBWT to index more diverse genomes.
我们用三个模拟各种实际应用场景的数据集对 deBWT 进行了基准测试。(i) 一个数据集由 10 个硅人类基因组(总计 30.9 Gbp)组成。每个基因组都是将来自 1000 基因组(The 1000 Genomes Project Consortium, 2015)的特定样本的变异整合到人类参考基因组 GRCh37/Hg19 中生成的。该数据集模拟了多个人类个体基因组的索引,在基因组研究中有许多应用。(ii) 数据集由一组模拟等位基因组成。由于单分子实时测序等长读测序技术已将人类基因组组装的等位基因 N50 提高到大于 1000 万 bp ( http://www.pacb.com/blog/toward-platinum-genomes-pacbio-releases-a-new-higher-quality-chm1-assembly-to-ncbi/ ),我们从 10 个人类基因组中随机提取了 3000 个序列(每个序列长约 1000 万 bp,共计 30.(iii) 一个由长臂猿、大猩猩、猩猩、恒河猴、狒狒、黑猩猩、倭黑猩猩和人类等 8 个灵长类动物基因组组成的数据集(下载地址: http://hgdownload.soe.ucsc.edu/downloads.html )。该数据集评估了 deBWT 索引更多样化基因组的能力。
The benchmark was implemented on a server with four Intel Xeon E4820 CPUs (32 cores in total) at 2.00 GHz and 1 Terabytes RAM, running Linux Ubuntu 14.04. DeBWT uses Jellyfish (version 2.1.4; Marçais and Kingsford, 2011 ) for implementing the k-mer counting of the input sequences (the parameter k is configured as 31). Two recently published methods, RopeBWT2 ( Li, 2014 ) and ParaBWT (version 1.0.8-binary-x86_64) ( Liu et al. , 2014b ), were also performed on the same datasets for comparison.
该基准测试是在一台服务器上进行的,该服务器配备四颗英特尔至强 E4820 CPU(共 32 核),主频为 2.00 GHz,内存为 1 TB,运行 Linux Ubuntu 14.04。DeBWT 使用 Jellyfish(版本 2.1.4;Marçais 和 Kingsford,2011 年)对输入序列进行 k-mer 计数(参数 k 设置为 31)。最近发表的两种方法 RopeBWT2 ( Li, 2014 ) 和 ParaBWT (version 1.0.8-binary-x86_64) ( Liu et al. , 2014b ) 也在相同的数据集上进行了比较。
At first, we tested the performance of deBWT with 32 threads, i.e. running with all the 32 CPU cores of the server. ParaBWT was also run with 32 threads, but RopeBWT2 was run with its default setting, as it does not support parallel computing. The elapsed time ( Table 1 ) indicates that deBWT and ParaBWT have comparable speed, while RopeBWT2 is slower, likely owing to the fact that it does not support parallel computing and the algorithm could not be suited to long sequences. We further investigate the processing of deBWT, and found that the speed of deBWT was largely slowed down by Jellyfish owing to the format of its output file. The default output of Jellyfish is a binary file in an unpublished format. As the details about the format is unknown for us, we used the ‘dump’ command of Jellyfish to convert the output file into text file, and then converted the text file into binary file in our own format as the input of further steps. This file conversion costs a couple of hours for all the three datasets, i.e. about 60–70% of the total running time. The time cost would be much reduced if the output format of jellyfish was available, or if other k-mer counting tools with similar performance and readable output format were used. Deducting the time of file conversion, deBWT is much faster than the other two methods.
首先,我们用 32 个线程测试了 deBWT 的性能,即使用服务器的全部 32 个 CPU 内核运行。ParaBWT 也使用 32 个线程运行,但由于 RopeBWT2 不支持并行计算,因此使用其默认设置运行。运行时间(表 1)表明,deBWT 和 ParaBWT 的速度相当,而 RopeBWT2 的速度较慢,这可能是由于它不支持并行计算,算法不适合长序列。我们进一步研究了 deBWT 的处理过程,发现由于输出文件的格式问题,deBWT 的速度在很大程度上被 Jellyfish 拖慢了。Jellyfish 的默认输出是一个未公开格式的二进制文件。由于我们不知道格式的详细信息,因此我们使用 Jellyfish 的 "dump "命令将输出文件转换成文本文件,然后将文本文件转换成我们自己格式的二进制文件,作为进一步步骤的输入。对所有三个数据集来说,文件转换花费了几个小时,约占总运行时间的 60-70%。如果水母的输出格式可用,或者使用其他具有类似性能和可读输出格式的 k-mer 计数工具,时间成本将大大降低。扣除文件转换时间,deBWT 比其他两种方法快得多。
表 1.32 个 CPU 内核的运行时间(分钟)
Methods 方法 . | Human genomes 人类基因组 . | Human contigs 人类等位基因 . | Primate genomes 灵长类动物基因组 . |
---|---|---|---|
deBWT | 134 | 129 | 330 |
deBWT (no conversion) deBWT (无转换) | 48 | 56 | 100 |
ParaBWT | 241 | 262 | 180 |
RopeBWT2 | 1694 | 2247 | 1546 |
Methods . | Human genomes . | Human contigs . | Primate genomes . |
---|---|---|---|
deBWT | 134 | 129 | 330 |
deBWT (no conversion) | 48 | 56 | 100 |
ParaBWT | 241 | 262 | 180 |
RopeBWT2 | 1694 | 2247 | 1546 |
‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出格式转换的时间。
表 1.32 个 CPU 内核的运行时间(分钟)
Methods 方法 . | Human genomes 人类基因组 . | Human contigs 人类等位基因 . | Primate genomes 灵长类动物基因组 . |
---|---|---|---|
deBWT | 134 | 129 | 330 |
deBWT (no conversion) deBWT (无转换) | 48 | 56 | 100 |
ParaBWT | 241 | 262 | 180 |
RopeBWT2 | 1694 | 2247 | 1546 |
Methods . | Human genomes . | Human contigs . | Primate genomes . |
---|---|---|---|
deBWT | 134 | 129 | 330 |
deBWT (no conversion) | 48 | 56 | 100 |
ParaBWT | 241 | 262 | 180 |
RopeBWT2 | 1694 | 2247 | 1546 |
‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出格式转换的时间。
We investigated the time cost of the various steps of deBWT ( Table 2 ). Mainly, two issues are observed.
我们研究了 deBWT 各个步骤的时间成本(表 2)。主要发现了两个问题。
表 2.deBWT 各步骤的时间(分钟)
Steps 步骤 . | Human genomes 人类基因组 . | Human contigs 人类等位基因 . | Primate genomes 灵长类动物基因组 . |
---|---|---|---|
Phase1: dBG building and analysis 第 1 阶段: dBG 建设与分析 | |||
k-mer counting k 元计数 | 16 | 16 | 26 |
File conversion 文件转换 | 87 | 74 | 229 |
k-mer sorting k 元排序 | 3 | 3 | 8 |
dBG analysis dBG 分析 | 8 | 7 | 19 |
Phase2: The generation of de Bruijn branch encoding and projection suffixes 第 2 阶段:生成 de Bruijn 分支编码和投影后缀 | |||
de Bruijn branch encoding and de Bruijn 分支编码和 | 9 | 12 | 16 |
Phase3: BWT construction with projection suffixes 第 3 阶段:使用投影后缀构建 BWT | |||
Projection suffixes sorting 投影后缀排序 | 4 | 12 | 10 |
Additional processing 附加处理 | |||
Additional processing 附加处理 | 7 | 6 | 22 |
Supplement | |||
k-mer counting with KMC2 使用 KMC2 进行 k-聚合物计数 | 7 | 9 | 12 |
Steps . | Human genomes . | Human contigs . | Primate genomes . |
---|---|---|---|
Phase1: dBG building and analysis | |||
k-mer counting | 16 | 16 | 26 |
File conversion | 87 | 74 | 229 |
k-mer sorting | 3 | 3 | 8 |
dBG analysis | 8 | 7 | 19 |
Phase2: The generation of de Bruijn branch encoding and projection suffixes | |||
de Bruijn branch encoding and | 9 | 12 | 16 |
Phase3: BWT construction with projection suffixes | |||
Projection suffixes sorting | 4 | 12 | 10 |
Additional processing | |||
Additional processing | 7 | 6 | 22 |
Supplement | |||
k-mer counting with KMC2 | 7 | 9 | 12 |
表 2.deBWT 各步骤的时间(分钟)
Steps 步骤 . | Human genomes 人类基因组 . | Human contigs 人类等位基因 . | Primate genomes 灵长类动物基因组 . |
---|---|---|---|
Phase1: dBG building and analysis 第 1 阶段: dBG 建设与分析 | |||
k-mer counting k 元计数 | 16 | 16 | 26 |
File conversion 文件转换 | 87 | 74 | 229 |
k-mer sorting k 元排序 | 3 | 3 | 8 |
dBG analysis dBG 分析 | 8 | 7 | 19 |
Phase2: The generation of de Bruijn branch encoding and projection suffixes 第 2 阶段:生成 de Bruijn 分支编码和投影后缀 | |||
de Bruijn branch encoding and de Bruijn 分支编码和 | 9 | 12 | 16 |
Phase3: BWT construction with projection suffixes 第 3 阶段:使用投影后缀构建 BWT | |||
Projection suffixes sorting 投影后缀排序 | 4 | 12 | 10 |
Additional processing 附加处理 | |||
Additional processing 附加处理 | 7 | 6 | 22 |
Supplement | |||
k-mer counting with KMC2 使用 KMC2 进行 k-聚合物计数 | 7 | 9 | 12 |
Steps . | Human genomes . | Human contigs . | Primate genomes . |
---|---|---|---|
Phase1: dBG building and analysis | |||
k-mer counting | 16 | 16 | 26 |
File conversion | 87 | 74 | 229 |
k-mer sorting | 3 | 3 | 8 |
dBG analysis | 8 | 7 | 19 |
Phase2: The generation of de Bruijn branch encoding and projection suffixes | |||
de Bruijn branch encoding and | 9 | 12 | 16 |
Phase3: BWT construction with projection suffixes | |||
Projection suffixes sorting | 4 | 12 | 10 |
Additional processing | |||
Additional processing | 7 | 6 | 22 |
Supplement | |||
k-mer counting with KMC2 | 7 | 9 | 12 |
First, most of the core steps of deBWT, i.e. k-mer counting, k-mer sorting, de Bruijn branch encoding and
首先,deBWT 的大多数核心步骤,即 k-mer 计数、k-mer 排序、de Bruijn 分支编码和
Second, the I/O operation is the main issue slowing down the method. Other than the file conversion step mentioned above, there are also many I/O operations in the ‘de Bruijn graph analysis’ and the ‘additional processing’ steps. That is, in the dBG analysis step, deBWT needs to merge the files recording the four ordered lists of k-mers to recognize the multiple-in k-mers; and in the additional processing step, deBWT needs to convert the large sequences to be indexed from text (fasta format) into binary format and merge various BWT parts. Although these operations theoretically have low time complexity, they also depend on the performance of the file system of the computer as well as the implementation of the program. It is also an important future work for us to further optimize these operations.
其次,I/O 操作是拖慢该方法的主要问题。除了上面提到的文件转换步骤,在 "德布鲁因图分析 "和 "附加处理 "步骤中也有很多 I/O 操作。也就是说,在 dBG 分析步骤中,deBWT 需要合并记录四个有序 k-mers 列表的文件,以识别多入 k-mers;而在附加处理步骤中,deBWT 需要将需要索引的大序列从文本(fasta 格式)转换为二进制格式,并合并 BWT 的各个部分。虽然这些操作理论上时间复杂度较低,但也取决于计算机文件系统的性能以及程序的实现。进一步优化这些操作也是我们未来的一项重要工作。
Other than the two issues mentioned above, the time cost of the projection suffixes sorting step is especially critical, as it is the core step to handle the long repetitions within the input sequence(s). The total time cost of this step is
除上述两个问题外,投影后缀排序步骤的时间成本尤为关键,因为它是处理输入序列中长重复的核心步骤。该步骤的总时间成本为
Although the theoretical upper bound is large, however,
虽然理论上限很大,但
表 3.10 个人类基因组数据集的
Quantiles 定量 . | 0.50 . | 0.90 . | 0.95 . | 0.99 . | 0.999 . | 0.9999 . |
---|---|---|---|---|---|---|
107 | 588 | 2382 | 95 019 | 298 598 | 515 006 | |
1760 | 11 872 | 238 368 | 1 925 600 | 3 232 832 | 3 387 040 |
Quantiles . | 0.50 . | 0.90 . | 0.95 . | 0.99 . | 0.999 . | 0.9999 . |
---|---|---|---|---|---|---|
107 | 588 | 2382 | 95 019 | 298 598 | 515 006 | |
1760 | 11 872 | 238 368 | 1 925 600 | 3 232 832 | 3 387 040 |
表 3.10 个人类基因组数据集的
Quantiles 定量 . | 0.50 . | 0.90 . | 0.95 . | 0.99 . | 0.999 . | 0.9999 . |
---|---|---|---|---|---|---|
107 | 588 | 2382 | 95 019 | 298 598 | 515 006 | |
1760 | 11 872 | 238 368 | 1 925 600 | 3 232 832 | 3 387 040 |
Quantiles . | 0.50 . | 0.90 . | 0.95 . | 0.99 . | 0.999 . | 0.9999 . |
---|---|---|---|---|---|---|
107 | 588 | 2382 | 95 019 | 298 598 | 515 006 | |
1760 | 11 872 | 238 368 | 1 925 600 | 3 232 832 | 3 387 040 |
We also run deBWT with 8, 16, 24 threads to investigate its scalability. The results ( Table 4 ) suggest that deBWT can gradually speedup with the increase of threads, i.e. it has good scalability. However, the speed of ParaBWT is nearly the same with the various settings on threads. This is likely owing to the incremental nature of the ParaBWT method, which may limit its performance on modern servers and clusters. The time of the various steps of deBWT with various numbers of threads is in Figure 3 . It indicates that the two core steps, de Bruijn branch encoding and
我们还使用 8、16、24 个线程运行 deBWT,以研究其可扩展性。结果(表 4)表明,deBWT 可以随着线程数的增加而逐渐加速,即具有良好的可扩展性。然而,在不同的线程设置下,ParaBWT 的速度几乎相同。这可能是由于 ParaBWT 方法的增量性质,这可能会限制其在现代服务器和集群上的性能。不同线程数下 deBWT 各步骤的时间见图 3。从图中可以看出,de Bruijn 分支编码和
表 4.不同线程数量下的运行时间(分钟)
Methods 方法 . | 8 threads 8 条线 . | 16 threads 16 条线 . | 24 threads 24 条线 . | 32 threads 32 线程 . |
---|---|---|---|---|
Human genomes 人类基因组 | ||||
deBWT | 194 | 153 | 142 | 134 |
deBWT (no conversion) deBWT (无转换) | 109 | 68 | 56 | 48 |
ParaBWT | 265 | 240 | 240 | 241 |
Human contigs 人类等位基因 | ||||
deBWT | 183 | 154 | 123 | 129 |
vdeBWT (no conversion) vdeBWT(无转换) | 116 | 86 | 56 | 56 |
ParaBWT | 294 | 277 | 276 | 262 |
Primate genomes 灵长类动物基因组 | ||||
deBWT | 423 | 355 | 332 | 330 |
deBWT (no conversion) deBWT (无转换) | 193 | 125 | 105 | 100 |
ParaBWT | 196 | 182 | 181 | 180 |
Methods . | 8 threads . | 16 threads . | 24 threads . | 32 threads . |
---|---|---|---|---|
Human genomes | ||||
deBWT | 194 | 153 | 142 | 134 |
deBWT (no conversion) | 109 | 68 | 56 | 48 |
ParaBWT | 265 | 240 | 240 | 241 |
Human contigs | ||||
deBWT | 183 | 154 | 123 | 129 |
vdeBWT (no conversion) | 116 | 86 | 56 | 56 |
ParaBWT | 294 | 277 | 276 | 262 |
Primate genomes | ||||
deBWT | 423 | 355 | 332 | 330 |
deBWT (no conversion) | 193 | 125 | 105 | 100 |
ParaBWT | 196 | 182 | 181 | 180 |
‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output file.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出文件格式转换的时间。
表 4.不同线程数量下的运行时间(分钟)
Methods 方法 . | 8 threads 8 条线 . | 16 threads 16 条线 . | 24 threads 24 条线 . | 32 threads 32 线程 . |
---|---|---|---|---|
Human genomes 人类基因组 | ||||
deBWT | 194 | 153 | 142 | 134 |
deBWT (no conversion) deBWT (无转换) | 109 | 68 | 56 | 48 |
ParaBWT | 265 | 240 | 240 | 241 |
Human contigs 人类等位基因 | ||||
deBWT | 183 | 154 | 123 | 129 |
vdeBWT (no conversion) vdeBWT(无转换) | 116 | 86 | 56 | 56 |
ParaBWT | 294 | 277 | 276 | 262 |
Primate genomes 灵长类动物基因组 | ||||
deBWT | 423 | 355 | 332 | 330 |
deBWT (no conversion) deBWT (无转换) | 193 | 125 | 105 | 100 |
ParaBWT | 196 | 182 | 181 | 180 |
Methods . | 8 threads . | 16 threads . | 24 threads . | 32 threads . |
---|---|---|---|---|
Human genomes | ||||
deBWT | 194 | 153 | 142 | 134 |
deBWT (no conversion) | 109 | 68 | 56 | 48 |
ParaBWT | 265 | 240 | 240 | 241 |
Human contigs | ||||
deBWT | 183 | 154 | 123 | 129 |
vdeBWT (no conversion) | 116 | 86 | 56 | 56 |
ParaBWT | 294 | 277 | 276 | 262 |
Primate genomes | ||||
deBWT | 423 | 355 | 332 | 330 |
deBWT (no conversion) | 193 | 125 | 105 | 100 |
ParaBWT | 196 | 182 | 181 | 180 |
‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output file.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出文件格式转换的时间。
We further run deBWT on the in silico human genome dataset with various configurations on the k parameter to investigate its effect Table 5 . It can be observed from the result that, on a large range of k parameters, i.e. k = 23–31, the total running time is close, but for smaller k parameter, the time consumption is higher. This is likely owing to the fact that the moderate long k-mers (such as 23- to 31-mers) may have similar ability to span short repeats. In this situation, the structure of the dBG does not change much with these k configurations, i.e. there are similar numbers of unipaths as well as their copies in the graph. However, when k is smaller, the unipaths will be shorter and have more copies, which would make the de Bruijn branch encoding longer and more projection suffixes need to be sorted. k > 32 could have better ability to span repeats, which may improve the overall performance; however, it requires much more RAM space, as a k-mer cannot be stored by one 64-bits cell.
我们进一步在硅人类基因组数据集上以不同的 k 参数配置运行 deBWT,以研究其影响(见表 5)。从结果可以看出,在较大的 k 参数范围内(即 k = 23-31),总运行时间接近,但对于较小的 k 参数,耗时较多。这可能是由于中等长度的 k-分子(如 23 至 31-分子)可能具有类似于跨越短重复序列的能力。在这种情况下,dBG 的结构并不会因为这些 k 配置而发生太大变化,即图中的单路径及其副本数量相似。然而,当 k 较小时,单链路会更短,副本会更多,这将使 de Bruijn 分支编码更长,需要排序的投影后缀也更多。
表 5.不同 k 参数配置下的硅学人类基因组数据集的运行时间(分钟)
Methods 方法 . | k = 19 . | k = 23 . | k = 27 . | k = 31 . |
---|---|---|---|---|
deBWT | 142 | 124 | 131 | 134 |
deBWT (no conversion) deBWT (无转换) | 75 | 51 | 47 | 48 |
Methods . | k = 19 . | k = 23 . | k = 27 . | k = 31 . |
---|---|---|---|---|
deBWT | 142 | 124 | 131 | 134 |
deBWT (no conversion) | 75 | 51 | 47 | 48 |
‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output file.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出文件格式转换的时间。
表 5.不同 k 参数配置下的硅学人类基因组数据集的运行时间(分钟)
Methods 方法 . | k = 19 . | k = 23 . | k = 27 . | k = 31 . |
---|---|---|---|---|
deBWT | 142 | 124 | 131 | 134 |
deBWT (no conversion) deBWT (无转换) | 75 | 51 | 47 | 48 |
Methods . | k = 19 . | k = 23 . | k = 27 . | k = 31 . |
---|---|---|---|---|
deBWT | 142 | 124 | 131 | 134 |
deBWT (no conversion) | 75 | 51 | 47 | 48 |
‘DeBWT’ indicates the elapsed time of deBWT, and ‘deBWT (no conversion)’ deducts the time of the format conversion of Jellyfish output file.
DeBWT "表示 deBWT 所耗费的时间,"deBWT(无转换)"扣除 Jellyfish 输出文件格式转换的时间。
The memory footprint of deBWT ( Table 6 ) depends on both of the method itself and the used k-mer counting tool. The memory usage of Jellyfish and KMC2 is highly configurable, and we set them to use relatively large memory to accomplish the k-mer counting step as fast as possible. The major RAM costs of the three phases of deBWT are different. In the first phase, the major cost originates from the data structure of k-mer sorting. Briefly, deBWT uses a linear table like
deBWT 的内存占用(表 6)取决于方法本身和使用的 k-mer 计数工具。Jellyfish 和 KMC2 的内存使用是高度可配置的,我们将它们设置为使用相对较大的内存,以尽可能快地完成 k-mer计数步骤。deBWT 三个阶段的主要内存成本是不同的。在第一阶段,主要成本来自 k-mer排序的数据结构。简而言之,deBWT 使用
表 6.32 个 CPU 内核的内存占用空间(单位:千兆字节)
Methods 方法 . | Human genomes 人类基因组 . | Human contigs 人类等位基因 . | Primate genomes 灵长类动物基因组 . |
---|---|---|---|
deBWT | 120/78/38 | 120/63/34 | 235/203/58 |
ParaBWT | 30 | 30 | 29 |
RopeBWT2 | 30 | 24 | 40 |
Supplement | |||
KMC2 | 119 | 119 | 119 |
Methods . | Human genomes . | Human contigs . | Primate genomes . |
---|---|---|---|
deBWT | 120/78/38 | 120/63/34 | 235/203/58 |
ParaBWT | 30 | 30 | 29 |
RopeBWT2 | 30 | 24 | 40 |
Supplement | |||
KMC2 | 119 | 119 | 119 |
For the ‘x/y/z’ of deBWT in the memory columns, the x, y and z values respectively indicate the memory footprints of Jellyfish, phase1 of deBWT, and phases2 and phases3 of deBWT.
对于内存列中 deBWT 的 "x/y/z",x、y 和 z 值分别表示 Jellyfish、deBWT 第一阶段、deBWT 第二阶段和第三阶段的内存足迹。
表 6.32 个 CPU 内核的内存占用空间(单位:千兆字节)
Methods 方法 . | Human genomes 人类基因组 . | Human contigs 人类等位基因 . | Primate genomes 灵长类动物基因组 . |
---|---|---|---|
deBWT | 120/78/38 | 120/63/34 | 235/203/58 |
ParaBWT | 30 | 30 | 29 |
RopeBWT2 | 30 | 24 | 40 |
Supplement | |||
KMC2 | 119 | 119 | 119 |
Methods . | Human genomes . | Human contigs . | Primate genomes . |
---|---|---|---|
deBWT | 120/78/38 | 120/63/34 | 235/203/58 |
ParaBWT | 30 | 30 | 29 |
RopeBWT2 | 30 | 24 | 40 |
Supplement | |||
KMC2 | 119 | 119 | 119 |
For the ‘x/y/z’ of deBWT in the memory columns, the x, y and z values respectively indicate the memory footprints of Jellyfish, phase1 of deBWT, and phases2 and phases3 of deBWT.
对于内存列中 deBWT 的 "x/y/z",x、y 和 z 值分别表示 Jellyfish、deBWT 第一阶段、deBWT 第二阶段和第三阶段的内存足迹。
表 7.硅学人类基因组和等位基因的统计数据
Statistics 统计资料 . | Human genomes 人类基因组 . | Human contigs 人类等位基因 . |
---|---|---|
length of input sequences 输入序列长度 | 30955436371 | 30200003020 |
distinct k-mers 不同的 k 元素 | 5073730669 | 4025285321 |
multiple-out k-mers 多层 K | 18820763 | 17238123 |
multiple-in k-mers 复式K线 | 18821805 | 17237511 |
copies of multiple-out k-mers 多出 k-mer 的副本 | 2364004617 | 2301293218 |
copies of multiple-in k-mers 复本 | 2364445711 | 2300904807 |
Statistics . | Human genomes . | Human contigs . |
---|---|---|
length of input sequences | 30955436371 | 30200003020 |
distinct k-mers | 5073730669 | 4025285321 |
multiple-out k-mers | 18820763 | 17238123 |
multiple-in k-mers | 18821805 | 17237511 |
copies of multiple-out k-mers | 2364004617 | 2301293218 |
copies of multiple-in k-mers | 2364445711 | 2300904807 |
表 7.硅学人类基因组和等位基因的统计数据
Statistics 统计资料 . | Human genomes 人类基因组 . | Human contigs 人类等位基因 . |
---|---|---|
length of input sequences 输入序列长度 | 30955436371 | 30200003020 |
distinct k-mers 不同的 k 元素 | 5073730669 | 4025285321 |
multiple-out k-mers 多层 K | 18820763 | 17238123 |
multiple-in k-mers 复式K线 | 18821805 | 17237511 |
copies of multiple-out k-mers 多出 k-mer 的副本 | 2364004617 | 2301293218 |
copies of multiple-in k-mers 复本 | 2364445711 | 2300904807 |
Statistics . | Human genomes . | Human contigs . |
---|---|---|
length of input sequences | 30955436371 | 30200003020 |
distinct k-mers | 5073730669 | 4025285321 |
multiple-out k-mers | 18820763 | 17238123 |
multiple-in k-mers | 18821805 | 17237511 |
copies of multiple-out k-mers | 2364004617 | 2301293218 |
copies of multiple-in k-mers | 2364445711 | 2300904807 |
First, for both the datasets, the numbers of multiple-out and -in k-mers are much less than
首先,对于这两个数据集来说,多出和-入 k-mers 的数量都远远小于
Second, the numbers of the copies of multiple-out and -in k-mers are also an order lower than
其次,虽然人类基因组具有重复性,但多出和-入 k-mers的拷贝数也比
The RAM cost of the third step is also similar to that of the second phase. To sort the projection suffixes, it needs to keep the de Bruijn branch encoding and the
第三步的 RAM 成本也与第二阶段类似。为了对投影后缀进行排序,需要在 RAM 中保留 de Bruijn 分支编码、
4 Discussion 4 讨论
The well organization and indexing of many genomes will be on wide demand in future genomics studies, with the rapid increase of assembled genomes. As an important genome indexing data structure, BWT may have many applications; however, the construction of BWT for a large collection of genomes, especially highly similar re-sequenced genomes (e.g. many human individual genomes), is still a non-trivial task. Moreover, owing to the incremental nature of the state-of-the-art methods, it is hard to construct BWT with scalable parallel computing. This is a bottleneck to fully use the computational resources of modern servers or clusters to handle large amount of data.
随着组装基因组的迅速增加,对许多基因组进行良好的组织和索引将是未来基因组学研究的广泛需求。作为一种重要的基因组索引数据结构,BWT 可能会有很多应用;然而,为大量基因组,尤其是高度相似的重测序基因组(如许多人类个体基因组)构建 BWT 仍然是一项非同小可的任务。此外,由于最先进方法的增量性质,很难通过可扩展的并行计算来构建 BWT。这是充分利用现代服务器或集群的计算资源处理大量数据的瓶颈。
We propose deBWT, a novel parallel BWT construction approach, to break the bottleneck. The main contribution of deBWT is its dBG-based representation and organization of suffixes, which facilitates the comparison of suffixes with long common prefixes and avoid unnecessary comparisons. Moreover, owing to its non-incremental design, deBWT has good scalability to various computational resources. These properties make deBWT well-suited to construct BWT for large collections of highly similar or repetitive genomes with modern servers or clusters. In the experiments, deBWT achieves a substantial improvement on the speed of indexing multiple individual human genomes and contigs. For more diverse genomes, e.g. multiple primate genomes, deBWT also shows faster speed and better parallelization; however, the improvement is smaller, likely owing to that the density of the dBG is lower. That is, there are more k-mers and unipaths to handle, but the overall repetitiveness of the input is lower than highly similar genomes.
我们提出了一种新颖的并行 BWT 构建方法 deBWT,以打破这一瓶颈。deBWT 的主要贡献在于其基于 dBG 的后缀表示和组织方式,这有利于比较具有长公共前缀的后缀,避免不必要的比较。此外,由于其非递增设计,deBWT 对各种计算资源具有良好的可扩展性。这些特性使得 deBWT 非常适合在现代服务器或集群上为大量高度相似或重复的基因组构建 BWT。在实验中,deBWT 大大提高了多个人类基因组和等位基因的索引速度。对于更多样化的基因组,例如多个灵长类动物基因组,deBWT 也能显示出更快的速度和更好的并行性;但是,改进幅度较小,这可能是由于 dBG 的密度较低。也就是说,有更多的 k-mers 和单路径需要处理,但输入的整体重复性低于高度相似的基因组。
Comparing with state-of-the-art approaches, deBWT has obviously larger memory footprint. There are potential solutions to reduce the memory footprints of the various phases of deBWT.
与最先进的方法相比,deBWT 的内存占用显然更大。有一些潜在的解决方案可以减少 deBWT 各个阶段的内存占用。
For phase 1, it is feasible to bin the k-mers into several subsets and separately sort each of the subsets with limited memory. The results of the multiple subsets can be straightforwardly merged into the ordered list of all the k-mers with small memory space.
在第 1 阶段,可以将 k-mer 分成多个子集,并利用有限的内存分别对每个子集进行排序。多个子集的结果可以直接合并为所有 k-mer 的有序列表,内存空间很小。
For phase 2, it is also possible to reduce the memory footprint by keeping only a proportion of
对于第二阶段,也可以通过只保留一部分
For phase 3, it can also keep only a proportion of unsolved of BWT partitions in memory as all such partitions are independent.
对于第 3 阶段,由于所有这些分区都是独立的,它也可以只将一部分未解决的 BWT 分区保留在内存中。
There are two possible improvements on deBWT, which are important future works for us.
对 deBWT 有两种可能的改进,这是我们未来的重要工作。
First, deBWT straightforwardly sorts the projection suffixes by quick-sort. Because the de Bruijn branch encoding can be also seen as a special DNA sequence, it is also possible to use other approaches to further accelerate the projection suffix sorting step. For example, the method proposed by Karkkainen (2007) uses DCS to accelerate the sorting of the binned suffixes of the original input sequence. This method could be also used for sorting the binned projection suffixes without loss of the ability of parallel computing, as it is non-incremental.
首先,deBWT 通过快速排序直接对投影后缀进行排序。由于 de Bruijn 分支编码也可以看作是一种特殊的 DNA 序列,因此也可以使用其他方法来进一步加快投影后缀排序步骤。例如,Karkkainen(2007 年)提出的方法使用 DCS 来加速原始输入序列的分选后缀排序。这种方法也可用于分档投影后缀的排序,而不会损失并行计算的能力,因为它是非递增的。
Second, for the current version of deBWT, the I/O-intensive steps are still not optimized, which slowed down the speed. We plan to further optimize the I/O-intensive steps to improve the efficiency of deBWT. Meanwhile, as k-mer counting is still an open problem, and advanced k-mer counting tools are developing ( Perez et al. , 2016 ), we also plan to replace Jellyfish by other more advanced k-mer counting tools, or remove the file conversion step by directly accessing the default Jellyfish output file, to break the practical bottleneck of the method.
其次,在当前版本的 deBWT 中,I/O 密集型步骤仍未优化,从而降低了速度。我们计划进一步优化 I/O 密集型步骤,以提高 deBWT 的效率。同时,由于k-mer计数仍是一个开放性问题,先进的k-mer计数工具也在不断发展(Perez et al. , 2016),我们还计划用其他更先进的k-mer计数工具替换Jellyfish,或者取消文件转换步骤,直接访问默认的Jellyfish输出文件,以突破该方法的实用瓶颈。
Funding 资金筹措
This work has been partially supported by the National Nature Science Foundation of China (Nos: 61301204 and 31301089), the National High-Tech Research and Development Program (863) of China (Nos: 2015AA020101, 2015AA020108 and 2014AA021505) and the National Science and Technology Major Project (No: 2013ZX03005012).
本研究得到了国家自然科学基金(编号:61301204 和 31301089)、国家高技术研究发展计划(863)(编号:2015AA020101、2015AA020108 和 2014AA021505)和国家科技重大专项(编号:2013ZX03005012)的部分资助。
Conflict of Interest : none declared.
利益冲突:无。
References 参考资料
鲍尔 M.J .等 .( 2013 ) 构建和反演字符串集合 bwt 的轻量级算法 .Theor.Comput.Sci ., 483 , 134 - 148 .
Bentley J.L Sedgewick R. ( 1997 ) Fast algorithms for sorting and searching strings.第 8 届离散算法年会论文集 ACM,加利福尼亚州旧金山。
Burrow M Wheeler D.J. ( 1994 ) A block-sorting lossless data compression algorithm.技术报告 124,加利福尼亚州数字设备公司。
Cox A.J .et al.( 2011 ) Poster abstract: hypothesis-free detection of splice junctions in RNA-Seq data.In Proceedings of CSHL conference on Genome Informatics .纽约 Spring Harbor.
Cox A.J .et al.( 2012 ) 利用 Burrows-Wheeler 变换大规模压缩基因组序列数据库 .生物信息学 , 28 , 1415 - 1419 .
Crauser A. Ferragina P. ( 2008 ) 外存储器中后缀数组构建的理论与实验研究 .Algorithmica , 32 , 1 - 35 .
Deorowicz S .et al.( 2015 ) KMC 2: fast and resource-frugal k-mer counting .生物信息学 , 31 , 1569 - 1576 .
Ferragina P.Manzini G. ( 2000 ) Opportunistic data structures with applications .In:第 41 届计算机科学基础 (FOCS) 年度研讨会论文集 .加利福尼亚州雷东多海滩 , 390 - 398 .
Ferragina P.Manzini G. ( 2005 ) Indexing compressed text .J ACM , 52 , 552 - 581 .
Ferragina P .et al.( 2012 ) 外部存储器中的轻量级数据索引和压缩 .Algorithmica , 63 , 707 - 730 .
Gnerre S .et al.( 2011 ) 从大规模并行序列数据中获得高质量的哺乳动物基因组组装草案 .Proc.Natl.USA , 108 , 1513 - 1518 .
Hon W.K. et al.( 2004 ) 压缩后缀阵列和 FM-Index 在搜索 DNA 序列中的实际应用 .In:第六届算法工程与实验研讨会暨第一届分析算法与组合学研讨会论文集 .New Orleans, LA, USA, 2004, pp.
Hon W.K. et al.( 2007 ) 构建大字母压缩后缀数组 .Algorithmica , 48 , 23 - 36 .
Karkkainen J. ( 2007 ) 通过顺时针后缀排序实现小空间内的快速 BWT .Theor.Comput.Sci .
Lam T.W. et al.( 2008 ) DNA 的压缩索引和局部比对 .Bioinformatics , 24 , 791 - 797 .
Langmead B. Salzberg S.L. ( 2012 ) Fast gapped-read alignment with Bowtie 2 .Nat.Methods , 9 , 357 - 359 .
Li H.Durbin R. ( 2009a ) 利用 Burrows-Wheeler 变换实现快速准确的短读数比对 .Bioinformatics , 25 , 1754 - 1760 .
Li H .et al.( 2009b ) 序列对齐/映射格式和 SAMtools .生物信息学 , 25 , 2078 - 2079 .
Li H. ( 2012 ) Exploring single-sample snp and indel calling with whole-genome de novo assembly .Bioinformatics , 28 , 1838 - 1844 .
Li H. ( 2014 ) Fast construction of FM-index for long sequence reads .Bioinformatics , 30 , 3274 - 3275 .
Liu C .et al.( 2014a ) GPU 加速的短文本大集合 BWT 构建. ArXiv , 1401.7457
刘宇等 .( 2014b ) 面向基因组大数据的 Burrows-Wheeler 变换和后缀阵列的并行和空间高效构建 .IEEE/ACM Trans.Comput.Biol.Bioinform .
Makinen V .et al.( 2010 ) 高度重复序列集合的存储和检索 .J. Comp. Biol .17 , 281 - 308 .
Marçais G. Kingsford C. ( 2011 ) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers .Bioinformatics , 27 , 764 - 770 .
Marcus S .et al.( 2014 ) SplitMEM: 泛基因组分析的后缀跳转图形算法 .生物信息学 , 30 , 3476 - 3483 .
Nong G .et al.( 2014 ) 使用 d 关键子串在外部存储器中构建后缀数组 .ACM Trans.Inform.Syst .
佩雷斯 N . 等人.( 2016 ) k 聚合体计数算法的计算性能评估 .J. Comp.Biol ., 23 , 248 - 255 .
Simpson J.T. Durbin R. ( 2012 ) Efficient de novo assembly of large genomes using compressed data structures .Genome Res . , 22 , 549 - 556 .
Smyth W.F. Turpin A.H. ( 2007 ) A taxonomy of suffix array construction algorithms.ACM Comput.Surv .
1000 个基因组计划联盟 ( 2015 ) 人类遗传变异的全球参照 .Nature , 526 , 68 - 74 .
The UK10K Consortium ( 2015 ) The UK10K project identifies rare variants in health and disease .Nature , 526 , 82 - 90 .
Tomescu A.I Medvedev P. ( 2016 ) Safe and complete contig assembly via omnitigs.Recomb 2016, lnbi, 9649: 152-163
Treangen T.J. Salzberg S.L. ( 2012 ) Repetitive DNA and next-generation sequencing: computational challenges and solutions .Nat.Genet .
Watson M. ( 2014 ) 照亮 DNA 测序的未来 .Genome Biol .
Zimin A.V .et al.( 2013 ) The MaSuRCA genome assembler .Bioinformatics , 29 , 2669 - 2677 .
Author notes 作者说明
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
作者希望大家知道,他们认为前两位作者应被视为共同第一作者。
© 作者 2016。牛津大学出版社出版。
这是一篇根据知识共享署名非商业性许可协议 ( http://creativecommons.org/licenses/by-nc/4.0/ ) 条款发布的开放获取文章,该协议允许在任何媒体上进行非商业性的再使用、分发和复制,但必须适当引用原作。如需商业再利用,请联系 journals.permissions@oup.com