Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM 使用 BWA-MEM 比对序列读数、克隆序列和组装轮廓
Heng Li 李恒Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, MA 02142, USA 美国麻省剑桥剑桥中心 7 号哈佛大学和麻省理工学院宽体研究所
Abstract 摘要
Summary: BWA-MEM is a new alignment algorithm for aligning sequence reads or assembly contigs against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date. 摘要:BWA-MEM 是一种新的比对算法,用于将序列读数或组装序列拼接段与大型参考基因组(如人类)进行比对。它能自动选择局部和端到端比对,支持配对端读数,并进行融合片段比对。该算法对测序错误具有强大的鲁棒性,适用于从 70bp 到数兆碱基的广泛序列长度。对于 100bp 序列的映射,BWA-MEM 的性能优于目前几种最先进的读数比对器。
Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at http://github.com/h3/bwa Contact: hengli@broadinstitute.org 可用性和实施:BWA-MEM 作为 BWA 的一个组件实现,可从 http://github.com/h3/bwa 获取。联系人:hengli@broadinstitute.org
1 INTRODUCTION 1 简介
Most short-reads mappers for next-generation sequencing (NGS) data were developed when sequence reads were about 36bp in length. For 36bp reads, it is reasonable to require end-to-end alignment (i.e. every read base to be aligned to the reference) and to only report hits within certain hamming or edit distance. However, with emerging technologies and improved chemistry, NGS reads are not short any more, which poses new challenges to read alignment. For 100bp or longer reads, it becomes more important to allow long gaps under the affine-gap penalty and to report multiple nonoverlapping local hits potentially caused by structural variations or misassemblies in the reference genome. Many short-read alignment algorithms are not applicable or not preferred for mapping longer reads. At the same time, although several mature algorithms exist for aligning capillary sequence reads, they are slow and lack features for analyzing large-scale NGS data. Fast moving NGS technologies keep pressing for the development of new alignment algorithms. 针对下一代测序(NGS)数据的大部分短读比对程序都是在测序读长约 36 bp 时开发的。对于 36 bp 的读长,要求端到端比对(即每个读比对到参考基因组)并且只报告在特定汉明或编辑距离内的比对结果是合理的。然而,随着新兴技术和化学方法的进步,NGS 读长已经不再短,这给读比对带来了新的挑战。对于 100 bp 或更长的读长,使用亲和间隙惩罚方式允许长间隙,并报告可能由结构变异或参考基因组组装错误导致的多个不重叠的局部比对结果变得更为重要。许多短读比对算法对于较长读长的比对并不适用或不被推荐。与此同时,虽然已经有几种成熟的算法可用于毛细管测序读的比对,但它们速度较慢,且缺乏分析大规模 NGS 数据的特性。快速发展的 NGS 技术一直在催促新的比对算法的开发。
In this background, a few long-read alignment algorithms, notably including BWA-SW Li and Durbin, 2010), Bowtie2 (Langmead and Salzberg, 2012), Cushaw2 (Liu and Schmidt, 2012) and GEM (Marco-Sola et al., 2012), have been developed. However, they all have some weakness. BWA-SW is slower than bowtie2 for 100bp reads at a comparable accuracy and less accurate then Cushaw2 at a comparable speed. Bowtie2 and Cushaw2 are slower for 600bp reads (see Results). While GEM is both fast and accurate for up to approximately 1000bp reads, it mandates end-to-end alignment and does not perform affine-gap alignment, which limits its uses for long-read alignment. Even for typical sequence reads ranged from 100bp to 1000bp in length, no mappers work well across the full spectrum. At the same time, the increasing read length not only calls for new alignment algorithms, but also opens the opportunity to de novo assembly which typically results in contigs ranged from several hundred base pairs to a few megabases. Very few algorithms are able to align sequences of such variable 在这个背景下,已经开发了一些长读比对算法,其中包括 BWA-SW (Li and Durbin, 2010)、Bowtie2 (Langmead and Salzberg, 2012)、Cushaw2 (Liu and Schmidt, 2012) 和 GEM (Marco-Sola et al., 2012)。但是它们都有一些缺点。BWA-SW 对于 100bp 读长的准确性与 bowtie2 相当,但速度较慢。Bowtie2 和 Cushaw2 在 600bp 读长时会变慢(参见结果)。虽然 GEM 对于长达 1000bp 的读长既快又准确,但它要求端到端比对,不支持亲和间隙比对,这限制了它在长读比对中的应用。即使对于从 100bp 到 1000bp 长度的典型序列读,也没有一个比对器能够在全程良好工作。与此同时,读长的增加不仅需要新的比对算法,也为无参考基因组组装开辟了机会,通常会产生从几百个碱基对到几兆碱基对的 contig。很少有算法能够比对如此可变长度的序列。
lengths at high accuracy and to be robust to translocations in the assembly. All these concerns motivated us to explore a new alignment algorithm. 以高精度测量长度并且对组装过程中的易位具有强大鲁棒性。所有这些考虑促使我们探索一种新的对齐算法。
2 METHODS 两种方法
2.1 Aligning a single query sequence 2.1 对单个查询序列进行对齐
2.1.1 Seeding and re-seeding BWA-MEM follows the canonical seed-and-extend paradigm. It initially seeds an alignment with supermaximal exact matches (SMEMs) using an algorithm we found previously (Li, 2012, Algorithm 5), which essentially finds at each query position the longest exact match covering the position. However, occasionally the true alignment may not contain any SMEMs. To reduce mismappings caused by missing seeds, we introduce re-seeding. Suppose we have a SMEM of length ll with kk occurrences in the reference genome. If ll is too large (over 28bp by default), we re-seed with the longest exact matches that cover the middle base of the SMEM and occur at least k+1k+1 times in the genome. Such seeds can be found by requiring a minimum occurrence in the original SMEM algorithm. 2.1.1 播種和重新播種
BWA-MEM 遵循經典的種子擴展範式。 它最初使用我們以前發現的算法(Li,2012 年,算法 5)播種對齊,該算法本質上在每個查詢位置找到覆蓋該位置的最長精確匹配。 但是,有時真正的對齊可能不包含任何超最大精確匹配。 為了減少由於缺失種子而導致的錯誤映射,我們引入了重新播種。 假設我們有一個長度為 ll 的超最大精確匹配,在參考基因組中出現 kk 次。 如果 ll 太大(默認超過 28 個碱基),我們會使用覆蓋超最大精確匹配中間碹基並至少在基因組中出現 k+1k+1 次的最長精確匹配來重新播種。 可以通過在原始超最大精確匹配算法中要求最小出現次數來找到這些種子。
2.1.2 Chaining and chain filtering We call a group of seeds that are colinear and close to each other as a chain. We greedily chain the seeds while seeding and then filter out short chains that are largely contained in a long chain and are much worse than the long chain (by default, both 50%50 \% and 38bp shorter than the long chain). Chain filtering aims to reduce unsuccessful seed extension at a later step. Each chain may not always correspond to a final hit. Chains detected here do not need to be accurate. 2.1.2 连锁和链式过滤
我们把彼此靠近且共线的种子组称为链。我们在播种时贪心地连锁种子,然后过滤掉短链条,这些短链条大部分包含于较长链条中,且明显劣于长链条(默认情况下,这些短链条比长链条短 50%50 \% 和 38bp)。链式过滤旨在减少后续步骤中不成功的种子扩展。每条链不一定都会对应一个最终的匹配。这里检测到的链条不需要完全准确。
2.1.3 Seed extension We rank a seed by the length of the chain it belongs to and then by the seed length. For each seed in the ranked list, from the best to the worst, we drop the seed if it is already contained in an alignment found before, or extend the seed with a banded affine-gap-penalty dynamic programming (DP) if it potentially leads to a new alignment. 2.1.3 种子扩展
我们根据种子所属链的长度和种子长度对种子进行排序。对于排名从最好到最差的每个种子,如果它已经包含在之前找到的比对中,我们就将其丢弃,如果它可能导致新的比对,我们就使用带带宽的亲和力差距惩罚动态规划(DP)来扩展种子。
BWA-MEM’s seed extension differs from the standard seed extension in two aspects. Firstly, suppose at a certain extension step we come to reference position xx with the best extension score achieved at query position yy. BWAMEM will stop extension if the difference between the best extension score and the score at (x,y)(x, y) is larger than Z+|x-y|xxp_("gapExt ")Z+|x-y| \times p_{\text {gapExt }}, where p_("gapExt ")p_{\text {gapExt }} is the gap extension penalty and ZZ is an arbitrary cutoff. This heuristic avoids extension through a poorly aligned region with good flanking alignment. It is similar to the X-dropoff heuristic in BLAST (Altschul et al., 1997), but does not penalize long gaps in one of the reference or the query sequences. BWA-MEM 的种子扩展与标准种子扩展在两个方面有所不同。首先,假设在某个扩展步骤中,我们到达了参考位置 xx ,并在查询位置 yy 获得了最佳扩展得分。如果最佳扩展得分与位置 (x,y)(x, y) 处的得分差异大于 Z+|x-y|xxp_("gapExt ")Z+|x-y| \times p_{\text {gapExt }} (其中 p_("gapExt ")p_{\text {gapExt }} 为间隙扩展惩罚, ZZ 为任意截止值),BWA-MEM 将停止扩展。这种启发式方法可以避免通过与边缘配准良好但局部配准较差的区域进行扩展。这与 BLAST(Altschul 等人,1997 年)中的 X-dropoff 启发式类似,但不会惩罚参考序列或查询序列中的长间隙。
Secondly, while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence. If the difference between the best score reaching the end and the best local alignment score is below a threshold, the local alignment will be rejected even if it has a higher score. BWA-MEM uses this strategy to automatically choose between local and end-to-end alignments. It may align through true variations towards the end of a read and thus reduces reference bias, while avoids introducing excessive mismatches and gaps which may happen when we force a chimeric read into an end-to-end alignment. 其次,在延伸种子时,BWA-MEM 尝试跟踪到达查询序列末端的最佳扩展得分。如果达到末端的最佳得分与最佳局部比对得分之差低于阈值,即使局部比对得分较高,也会被拒绝。BWA-MEM 使用这种策略来自动选择局部比对和端到端比对。它可能会将真实变异与读取的末端进行比对,从而减少参考偏差,同时避免引入过多的错配和缺口,这可能会发生在我们将嵌合体读取强行放入端到端比对时。
Fig. 1. Percent mapped reads as a function of the false alignment rate under different mapping quality cutoff. Alignments with mapping quality 3 or lower are excluded. An alignment is wrong if after correcting clipping, its start position is within 20 bp from the simulated position. 10^(6)10^{6} pairs of 101bp reads are simulated from the human reference genome using wgsim (http://bit.ly/wgsim2) with 1.5%1.5 \% substitution errors and 0.2%0.2 \% indel variants. The insert size follows a normal distribution N(500,50^(2))N\left(500,50^{2}\right). The reads are aligned back to the genome either as single end (SE; top panel) or as paired end (PE; bottom panel). GEM is configured to allow up to 5 gaps and to output suboptimal alignments (option ’ -e5-m5-s1-\mathrm{e} 5-\mathrm{m} 5-\mathrm{s} 1 ’ for SE and ’ e5 -m5 -s1 -pb’ for PE). GEM does not compute mapping quality. Its mapping quality is estimated with a BWA-like algorithm with suboptimal alignments available. Other mappers are run with the default setting except for specifying the insert size distribution. The run time in seconds on a single CPU core is shown in the parentheses. 图 1. 在不同映射质量阈值下错误比对率与百分比映射读数的关系。将映射质量小于等于 3 的比对排除在外。如果纠正 clipping 后,一个比对的起始位置在模拟位置上下 20bp 范围内,则被认为是错误比对。使用 wgsim ( http://bit.ly/wgsim2)从人类参考基因组模拟生成 10^(6)10^{6} 对 101bp 长度的读数,添加了 1.5%1.5 \% 个替换错误和 0.2%0.2 \% 个缩删变异。插入片段大小服从正态分布 N(500,50^(2))N\left(500,50^{2}\right) 。将读数比对回参考基因组,可以是单端(SE;上面图)也可以是双端(PE;下面图)。GEM 配置允许最多 5 个 gap,并输出次优比对(SE 选项 ' -e5-m5-s1-\mathrm{e} 5-\mathrm{m} 5-\mathrm{s} 1 ',PE 选项'e5 -m5 -s1 -pb')。GEM 不计算映射质量,而是使用类似于 BWA 的算法估算,并利用次优比对。其他比对工具使用默认设置,只是指定了插入片段大小分布。括号内显示的是单 CPU 核上的运行时间(秒)。
2.2 Paired-end mapping 双端测序
2.2.1 Rescuing missing hits Like BWA (Li and Durbin, 2009), BWAMEM processes a batch of reads at a time. For each batch, it estimates the mean mu\mu and the variance sigma^(2)\sigma^{2} of the insert size distribution from reliable single-end hits. For the top 100 hits (by default) of either end, if the mate is unmapped in a window [ mu-4sigma,mu+4sigma\mu-4 \sigma, \mu+4 \sigma ] from each hit, BWA-MEM performs SSE2-based Smith-Waterman alignment (SW; Farrar 2007) for the mate within the window. The second best SW score is recorded to detect potential mismapping in a long tandem repeat. Hits found from both the single-sequence alignment and SW rescuing will be used for pairing. 2.2.1 拯救丢失命中 与 BWA(Li 和 Durbin,2009 年)类似,BWAMEM 一次处理一批读数。对于每一批数据,它从可靠的单端命中中估计插入大小分布的平均值 mu\mu 和方差 sigma^(2)\sigma^{2} 。对于每端的前 100 个命中(默认值),如果配对端在每个命中的窗口[ mu-4sigma,mu+4sigma\mu-4 \sigma, \mu+4 \sigma ]内未被映射,则 BWAMEM 会对窗口内的配对端执行基于 SSE2 的 Smith-Waterman 对齐(SW; Farrar 2007)。记录第二个最佳 SW 得分以检测长串联重复中的潜在错误映射。从单序列对齐和 SW 拯救中找到的命中将用于配对。
2.2.2 Pairing Given the ii-th hit for the first read and jj-th hit for the second, BWA-MEM computes their distance d_(ij)d_{i j} if the two hits are in the right orientation, or sets d_(ij)d_{i j} to infinity otherwise. It scores the pair (i,j)(i, j) with S_(ij)=S_(i)+S_(j)-min{-alog_(4)P(d_(ij)),U}S_{i j}=S_{i}+S_{j}-\min \left\{-a \log _{4} P\left(d_{i j}\right), U\right\}, where S_(i)S_{i} and S_(j)S_{j} are the SW score of the two hits, respectively, aa is the matching score and P(d)P(d) gives the probability of observing an insert size larger than dd assuming a normal distribution. ’ log_(4)\log _{4} ’ arises when we interpret SW score as odds ratio Durbin et al, 1998). UU is a threshold that controls pairing: if d_(ij)d_{i j} is small enough such that -alog_(4)P(d_(ij)) < U-a \log _{4} P\left(d_{i j}\right)<U, BWA-MEM prefers to pair the two ends; otherwise it prefers the unpaired alignments. BWA-MEM chooses the pair (i,j)(i, j) that maximizes S_(ij)S_{i j} as the final alignments for both ends. This pairing strategy jointly considers the alignment scores, the insert size and the possibility of chimeric read pairs. 2.2.2 配对
给定第一条读数的第 ii 个击中和第二条读数的第 jj 个击中, BWA-MEM 计算它们的距离 d_(ij)d_{i j} ,如果两个击中是正确的方向,否则将 d_(ij)d_{i j} 设置为无穷大。它使用 S_(ij)=S_(i)+S_(j)-min{-alog_(4)P(d_(ij)),U}S_{i j}=S_{i}+S_{j}-\min \left\{-a \log _{4} P\left(d_{i j}\right), U\right\} 的评分来评估这个配对,其中 S_(i)S_{i} 和 S_(j)S_{j} 分别是两个击中的 SW 得分, aa 是匹配得分, P(d)P(d) 给出了观察到大于 dd 的插入尺寸的概率,假设服从正态分布。' log_(4)\log _{4} '源于将 SW 得分解释为优势比(Durbin et al, 1998)。 UU 是一个阈值,用于控制配对:如果 d_(ij)d_{i j} 足够小,使得 -alog_(4)P(d_(ij)) < U-a \log _{4} P\left(d_{i j}\right)<U ,BWA-MEM 倾向于配对这两端;否则它倾向于非配对的比对。BWA-MEM 选择使 S_(ij)S_{i j} 最大化的配对 (i,j)(i, j) 作为两端的最终比对。这种配对策略同时考虑了比对得分、插入尺寸和嵌合式读数对的可能性。
3 RESULTS AND DISCUSSIONS 3 结果与讨论
We evaluated the performance of BWA-MEM on simulated data together with NovoAlign (http://novocraft.com), GEM, Bowtie2, Cushaw2, SeqAlto (Mu et al., 2012), BWA-SW and BWA (Figure1). On accuracy, NovoAlign is the best. BWA-MEM is close to NovoAlign for PE reads and is comparable to GEM and Cushaw2 for SE. On speed, BWA-MEM is similar to GEM and Bowtie2 for this data set, but is about 6 times as fast as Bowtie2 and Cushaw2 for a 650bp long-read data set. 我们一起评估了 BWA-MEM 在模拟数据上的性能,同时还评估了 NovoAlign (http://novocraft.com)、GEM、Bowtie2、Cushaw2、SeqAlto(Mu et al., 2012)、BWA-SW 和 BWA(图 1)。就准确性而言,NovoAlign 是最好的。BWA-MEM 在配对端读取(PE reads)上接近 NovoAlign,在单端读取(SE)上与 GEM 和 Cushaw2 相当。就速度而言,BWA-MEM 对于本数据集与 GEM 和 Bowtie2 相似,但在 650 bp 长读取数据集上比 Bowtie2 和 Cushaw2 快约 6 倍。
We speculate BWA-MEM is more performant for longer reads firstly because of its advanced seeding algorithm, which identifies most standing seeds with one pass of the read, and secondly because of banded dynamic programming, which guarantees linear time complexity in the length of query sequences. BWA-MEM and BWA-SW are also able to identify chimeric reads, a crucial feature for contig alignment but lacked in most NGS long-read mappers. Our earlier paper (Li and Durbin, 2010) discussed this in details. 我们推测 BWA-MEM 在处理较长的测序 reads 时表现较好,首先是因为它先进的种子搜索算法可以在一次读取时识别出大多数可用种子,其次是因为它采用带状动态规划的方法,可以保证在查询序列长度上具有线性时间复杂度。BWA-MEM 和 BWA-SW 还能识别嵌合 reads,这对于拼接序列的比对是一个关键功能,但大多数 NGS 长读测序比对工具却没有实现。我们之前的论文(Li 和 Durbin,2010)已经对此进行了详细讨论。
To evaluate the performance for even longer query sequences, we aligned the E. coli K-12 MG1655 substrain (AC:NC_000913; 4.6Mb in length) against the 536 strain (AC:NC_008253) with both BWA-MEM and nucmer (Kurtz et al., 2004). Nucmer finished the alignment in 25 seconds, giving 105,505 substitution differences between the strains. BWA-MEM is slower. It finished the alignment in 131 seconds, reporting 104,321 substitutions of which 102,241 overlapping with the nucmer output. Manual examination of the substitutions unique to each aligner reveals that most of them fall in short regions with very high divergence. It is unclear which aligner is better. Note that although nucmer is faster in the evaluation, only BWA-MEM scales well to large genomes. 为了评估更长序列查询的性能,我们将大肠杆菌 K-12 MG1655 亚株(登录号:NC_000913;长度 4.6Mb)与 536 株(登录号:NC_008253)进行了 BWA-MEM 和 nucmer(Kurtz et al.,2004)的比对。nucmer 在 25 秒内完成了比对,发现两个菌株之间有 105,505 个替换差异。BWA-MEM 比对速度较慢,需要 131 秒完成,发现 104,321 个替换差异,其中 102,241 个与 nucmer 的输出重叠。对两个比对工具独有的替换差异进行人工检查发现,它们大多集中在高度分歧的短区域。难以判断哪个比对工具更优。需要注意的是,虽然 nucmer 在评估中更快,但 BWA-MEM 在大基因组中的可扩展性更好。
BWA-MEM is a fast and accurate aligner for sequence reads and is one of the few that work well for both 70bp reads and long sequences up to a few megabases. Technically, it is possible to make BWA-MEM even faster for long sequences by using SSE2-based banded DP and by restricting DP to regions not covered by a long exact match. Seeding is the bottleneck for short sequences, while banded DP is the bottleneck for long sequences. BWA-MEM 是一种快速准确的序列读取比对器,能够很好地适用于 70bp 短 reads 以及长达几兆碱基的长序列。从技术上来说,通过使用基于 SSE2 的带状动态规划以及限制动态规划计算区域至不被长精确匹配覆盖的区域,可以进一步加快 BWA-MEM 在长序列上的比对速度。对于短序列,seed 匹配是瓶颈,而对于长序列,带状动态规划是瓶颈。
Altschul, S. F. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25:3389-402. 阿尔特舒尔, S. F. 等人 (1997). 间隙 BLAST 和 PSI-BLAST: 蛋白质数据库搜索程序的新一代. 核酸研究, 25:3389-402.