Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
使用 BWA-MEM 对序列读取、克隆序列和组装 contig 进行比对

Heng Li 李恒Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, MA 02142, USA
哈佛大学和麻省理工学院广泛研究所，剑桥中心 7 号，剑桥，MA 02142，美国

Abstract 摘要

Summary: BWA-MEM is a new alignment algorithm for aligning sequence reads or assembly contigs against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment. The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases. For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date.
摘要：BWA-MEM 是一种新的比对算法，用于将序列读取或组装的 contig 与大型参考基因组（如人类）进行比对。它自动选择局部比对和端到端比对，支持成对末端读取并执行嵌合比对。该算法对测序错误具有鲁棒性，适用于从 70bp 到几兆碱基的广泛序列长度。对于 100bp 序列的比对，BWA-MEM 的性能优于迄今为止几种最先进的读取比对工具。

Availability and implementation: BWA-MEM is implemented as a component of BWA, which is available at http://github.com/h3/bwa Contact: hengli@broadinstitute.org
可用性和实现：BWA-MEM 作为 BWA 的一个组件实现，地址为 http://github.com/h3/bwa 联系方式：hengli@broadinstitute.org

1 INTRODUCTION 1 引言

Most short-reads mappers for next-generation sequencing (NGS) data were developed when sequence reads were about 36bp in length. For 36bp reads, it is reasonable to require end-to-end alignment (i.e. every read base to be aligned to the reference) and to only report hits within certain hamming or edit distance. However, with emerging technologies and improved chemistry, NGS reads are not short any more, which poses new challenges to read alignment. For 100bp or longer reads, it becomes more important to allow long gaps under the affine-gap penalty and to report multiple nonoverlapping local hits potentially caused by structural variations or misassemblies in the reference genome. Many short-read alignment algorithms are not applicable or not preferred for mapping longer reads. At the same time, although several mature algorithms exist for aligning capillary sequence reads, they are slow and lack features for analyzing large-scale NGS data. Fast moving NGS technologies keep pressing for the development of new alignment algorithms.
大多数针对下一代测序（NGS）数据的短读段比对工具是在序列读段长度约为 36bp 时开发的。对于 36bp 的读段，要求端到端比对（即每个读段碱基都与参考序列比对）并仅报告在某个汉明距离或编辑距离内的比对结果是合理的。然而，随着新技术和改进化学的出现，NGS 读段不再短，这给读段比对带来了新的挑战。对于 100bp 或更长的读段，允许在仿射缺口惩罚下存在长缺口变得更加重要，并且需要报告可能由结构变异或参考基因组中的错误组装引起的多个不重叠的局部比对结果。许多短读段比对算法不适用于或不推荐用于比对更长的读段。同时，尽管存在几种成熟的算法用于比对毛细管序列读段，但它们速度较慢，并且缺乏分析大规模 NGS 数据的功能。快速发展的 NGS 技术不断推动新比对算法的开发。
In this background, a few long-read alignment algorithms, notably including BWA-SW Li and Durbin, 2010), Bowtie2 (Langmead and Salzberg, 2012), Cushaw2 (Liu and Schmidt, 2012) and GEM (Marco-Sola et al., 2012), have been developed. However, they all have some weakness. BWA-SW is slower than bowtie2 for 100bp reads at a comparable accuracy and less accurate then Cushaw2 at a comparable speed. Bowtie2 and Cushaw2 are slower for 600bp reads (see Results). While GEM is both fast and accurate for up to approximately 1000bp reads, it mandates end-to-end alignment and does not perform affine-gap alignment, which limits its uses for long-read alignment. Even for typical sequence reads ranged from 100bp to 1000bp in length, no mappers work well across the full spectrum. At the same time, the increasing read length not only calls for new alignment algorithms, but also opens the opportunity to de novo assembly which typically results in contigs ranged from several hundred base pairs to a few megabases. Very few algorithms are able to align sequences of such variable
在这种背景下，开发了一些长读段比对算法，特别包括 BWA-SW（Li 和 Durbin，2010）、Bowtie2（Langmead 和 Salzberg，2012）、Cushaw2（Liu 和 Schmidt，2012）和 GEM（Marco-Sola 等，2012）。然而，它们都有一些缺点。BWA-SW 在与 Bowtie2 相当的准确度下，对于 100bp 的读取速度较慢，并且在与 Cushaw2 相当的速度下准确度较低。对于 600bp 的读取，Bowtie2 和 Cushaw2 的速度较慢（见结果）。虽然 GEM 对于大约 1000bp 的读取速度快且准确，但它要求端到端比对，并且不执行仿射缺口比对，这限制了它在长读段比对中的应用。即使对于长度在 100bp 到 1000bp 范围内的典型序列读取，没有任何比对工具能够在整个范围内良好工作。同时，读取长度的增加不仅需要新的比对算法，还为 de novo 组装提供了机会，通常会产生从几百个碱基对到几兆碱基的 contig。能够对如此可变序列进行比对的算法非常少。
lengths at high accuracy and to be robust to translocations in the assembly. All these concerns motivated us to explore a new alignment algorithm.
在高精度下测量长度，并对组装中的易位具有鲁棒性。所有这些问题促使我们探索一种新的比对算法。

2 METHODS 2 种方法

2.1 Aligning a single query sequence
2.1 对齐单个查询序列

2.1.1 Seeding and re-seeding BWA-MEM follows the canonical seed-and-extend paradigm. It initially seeds an alignment with supermaximal exact matches (SMEMs) using an algorithm we found previously (Li, 2012, Algorithm 5), which essentially finds at each query position the longest exact match covering the position. However, occasionally the true alignment may not contain any SMEMs. To reduce mismappings caused by missing seeds, we introduce re-seeding. Suppose we have a SMEM of length

l

with

k

occurrences in the reference genome. If

l

is too large (over 28bp by default), we re-seed with the longest exact matches that cover the middle base of the SMEM and occur at least

k + 1

times in the genome. Such seeds can be found by requiring a minimum occurrence in the original SMEM algorithm.
2.1.1 种子和重新种子 BWA-MEM 遵循经典的种子扩展范式。它最初使用我们之前发现的算法（Li, 2012, 算法 5）通过超最大精确匹配（SMEMs）为比对播种，该算法本质上在每个查询位置找到覆盖该位置的最长精确匹配。然而，有时真实的比对可能不包含任何 SMEMs。为了减少由于缺失种子而导致的错误比对，我们引入了重新播种。假设我们在参考基因组中有一个长度为

l

的 SMEM，出现了

k

次。如果

l

太大（默认超过 28bp），我们将使用覆盖 SMEM 中间碱基并在基因组中至少出现

k + 1

次的最长精确匹配进行重新播种。这些种子可以通过在原始 SMEM 算法中要求最小出现次数来找到。
2.1.2 Chaining and chain filtering We call a group of seeds that are colinear and close to each other as a chain. We greedily chain the seeds while seeding and then filter out short chains that are largely contained in a long chain and are much worse than the long chain (by default, both

50 %

and 38bp shorter than the long chain). Chain filtering aims to reduce unsuccessful seed extension at a later step. Each chain may not always correspond to a final hit. Chains detected here do not need to be accurate.
2.1.2 链接和链过滤我们将一组共线且彼此接近的种子称为链。在播种时，我们贪婪地链接种子，然后过滤掉大部分包含在长链中的短链，这些短链的质量远低于长链（默认情况下，短链的长度均为

50 %

和比长链短 38bp）。链过滤的目的是减少后续步骤中不成功的种子扩展。每个链不一定总是对应于最终的命中。在这里检测到的链不需要是准确的。
2.1.3 Seed extension We rank a seed by the length of the chain it belongs to and then by the seed length. For each seed in the ranked list, from the best to the worst, we drop the seed if it is already contained in an alignment found before, or extend the seed with a banded affine-gap-penalty dynamic programming (DP) if it potentially leads to a new alignment.
2.1.3 种子扩展我们根据种子所属链的长度对种子进行排名，然后根据种子长度进行排名。在排名列表中，从最好到最差，对于每个种子，如果它已经包含在之前找到的比对中，我们将丢弃该种子；如果它可能导致新的比对，我们将使用带状仿射缺口惩罚动态规划（DP）扩展该种子。

BWA-MEM’s seed extension differs from the standard seed extension in two aspects. Firstly, suppose at a certain extension step we come to reference position

x

with the best extension score achieved at query position

y

. BWAMEM will stop extension if the difference between the best extension score and the score at

(x, y)

is larger than

Z + | x - y | \times p_{gapExt}

, where

p_{gapExt}

is the gap extension penalty and

Z

is an arbitrary cutoff. This heuristic avoids extension through a poorly aligned region with good flanking alignment. It is similar to the X-dropoff heuristic in BLAST (Altschul et al., 1997), but does not penalize long gaps in one of the reference or the query sequences.
BWA-MEM 的种子扩展与标准种子扩展在两个方面有所不同。首先，假设在某个扩展步骤中，我们到达参考位置

x

，在查询位置

y

达到了最佳扩展分数。如果最佳扩展分数与

(x, y)

处的分数之间的差异大于

Z + | x - y | \times p_{gapExt}

，BWA-MEM 将停止扩展，其中

p_{gapExt}

是缺口扩展惩罚，

Z

是一个任意的截止值。这种启发式方法避免了通过对齐不良的区域进行扩展，而该区域具有良好的侧翼对齐。它类似于 BLAST 中的 X-dropoff 启发式方法（Altschul 等，1997），但不惩罚参考序列或查询序列中的长缺口。

Secondly, while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence. If the difference between the best score reaching the end and the best local alignment score is below a threshold, the local alignment will be rejected even if it has a higher score. BWA-MEM uses this strategy to automatically choose between local and end-to-end alignments. It may align through true variations towards the end of a read and thus reduces reference bias, while avoids introducing excessive mismatches and gaps which may happen when we force a chimeric read into an end-to-end alignment.
其次，在扩展种子时，BWA-MEM 尝试跟踪到达查询序列末尾的最佳扩展得分。如果到达末尾的最佳得分与最佳局部比对得分之间的差异低于阈值，即使局部比对得分更高，也会被拒绝。BWA-MEM 使用这种策略在局部比对和端到端比对之间自动选择。它可能通过真实变异对读取的末尾进行比对，从而减少参考偏差，同时避免在强制将嵌合读取放入端到端比对时引入过多的不匹配和缺口。

Fig. 1. Percent mapped reads as a function of the false alignment rate under different mapping quality cutoff. Alignments with mapping quality 3 or lower are excluded. An alignment is wrong if after correcting clipping, its start position is within 20 bp from the simulated position.

10^{6}

pairs of 101bp reads are simulated from the human reference genome using wgsim (http://bit.ly/wgsim2) with

1.5 %

substitution errors and

0.2 %

indel variants. The insert size follows a normal distribution

N (500, 50^{2})

. The reads are aligned back to the genome either as single end (SE; top panel) or as paired end (PE; bottom panel). GEM is configured to allow up to 5 gaps and to output suboptimal alignments (option ’

- e 5 - m 5 - s 1

’ for SE and ’ e5 -m5 -s1 -pb’ for PE). GEM does not compute mapping quality. Its mapping quality is estimated with a BWA-like algorithm with suboptimal alignments available. Other mappers are run with the default setting except for specifying the insert size distribution. The run time in seconds on a single CPU core is shown in the parentheses.
图 1. 在不同的比对质量阈值下，映射读取的百分比与错误比对率的关系。比对质量为 3 或更低的比对被排除。如果在修正剪切后，其起始位置距离模拟位置在 20 bp 以内，则该比对是错误的。

10^{6}

对 101bp 的读取是使用 wgsim（http://bit.ly/wgsim2）从人类参考基因组模拟的，具有

1.5 %

个替代错误和

0.2 %

个插入缺失变异。插入大小遵循正态分布

N (500, 50^{2})

。读取被比对回基因组，既可以是单端（SE；上面面板），也可以是成对端（PE；下面面板）。GEM 被配置为允许最多 5 个缺口，并输出次优比对（单端的选项为'

- e 5 - m 5 - s 1

'，成对端的选项为'e5 -m5 -s1 -pb'）。GEM 不计算比对质量。其比对质量是使用类似 BWA 的算法估算的，次优比对可用。其他比对工具在默认设置下运行，除了指定插入大小分布。单个 CPU 核心的运行时间以秒为单位显示在括号中。

2.2 Paired-end mapping 2.2 配对末端比对

2.2.1 Rescuing missing hits Like BWA (Li and Durbin, 2009), BWAMEM processes a batch of reads at a time. For each batch, it estimates the mean

μ

and the variance

σ^{2}

of the insert size distribution from reliable single-end hits. For the top 100 hits (by default) of either end, if the mate is unmapped in a window [

μ - 4 σ, μ + 4 σ

] from each hit, BWA-MEM performs SSE2-based Smith-Waterman alignment (SW; Farrar 2007) for the mate within the window. The second best SW score is recorded to detect potential mismapping in a long tandem repeat. Hits found from both the single-sequence alignment and SW rescuing will be used for pairing.
2.2.1 救援缺失的比对像 BWA（Li 和 Durbin，2009）一样，BWAMEM 一次处理一批读取。对于每一批，它从可靠的单端比对中估计插入大小分布的均值

μ

和方差

σ^{2}

。对于任一端的前 100 个比对（默认），如果在每个比对的窗口 [

μ - 4 σ, μ + 4 σ

] 中配对未比对，BWA-MEM 会在窗口内对配对执行基于 SSE2 的 Smith-Waterman 比对（SW；Farrar 2007）。记录第二好的 SW 分数以检测长串联重复中的潜在错误比对。从单序列比对和 SW 救援中找到的比对将用于配对。
2.2.2 Pairing Given the

i

-th hit for the first read and

j

-th hit for the second, BWA-MEM computes their distance

d_{i j}

if the two hits are in the right orientation, or sets

d_{i j}

to infinity otherwise. It scores the pair

(i, j)

with

S_{i j} = S_{i} + S_{j} - min {- a \log_{4} P (d_{i j}), U}

, where

S_{i}

and

S_{j}

are the SW score of the two hits, respectively,

a

is the matching score and

P (d)

gives the probability of observing an insert size larger than

d

assuming a normal distribution. ’

\log_{4}

’ arises when we interpret SW score as odds ratio Durbin et al, 1998).

U

is a threshold that controls pairing: if

d_{i j}

is small enough such that

- a \log_{4} P (d_{i j}) < U

, BWA-MEM prefers to pair the two ends; otherwise it prefers the unpaired alignments. BWA-MEM chooses the pair

(i, j)

that maximizes

S_{i j}

as the final alignments for both ends. This pairing strategy jointly considers the alignment scores, the insert size and the possibility of chimeric read pairs.
2.2.2 配对给定第一个读取的

i

-th 命中和第二个读取的

j

-th 命中，BWA-MEM 计算它们的距离

d_{i j}

，如果两个命中方向正确，否则将

d_{i j}

设置为无穷大。它用

S_{i j} = S_{i} + S_{j} - min {- a \log_{4} P (d_{i j}), U}

对配对

(i, j)

进行评分，其中

S_{i}

和

S_{j}

分别是两个命中的 SW 分数，

a

是匹配分数，

P (d)

给出了观察到的插入大小大于

d

的概率，假设服从正态分布。’

\log_{4}

’ 出现于我们将 SW 分数解释为赔率比时（Durbin 等，1998）。

U

是控制配对的阈值：如果

d_{i j}

足够小以至于

- a \log_{4} P (d_{i j}) < U

，BWA-MEM 更倾向于配对两个末端；否则，它更倾向于未配对的比对。BWA-MEM 选择最大化

S_{i j}

的配对

(i, j)

作为两个末端的最终比对。该配对策略共同考虑了比对分数、插入大小和嵌合读取对的可能性。

3 RESULTS AND DISCUSSIONS
3 结果与讨论

We evaluated the performance of BWA-MEM on simulated data together with NovoAlign (http://novocraft.com), GEM, Bowtie2, Cushaw2, SeqAlto (Mu et al., 2012), BWA-SW and BWA (Figure1). On accuracy, NovoAlign is the best. BWA-MEM is close to NovoAlign for PE reads and is comparable to GEM and Cushaw2 for SE. On speed, BWA-MEM is similar to GEM and Bowtie2 for this data set, but is about 6 times as fast as Bowtie2 and Cushaw2 for a 650bp long-read data set.
我们评估了 BWA-MEM 在模拟数据上的性能，并与 NovoAlign（http://novocraft.com）、GEM、Bowtie2、Cushaw2、SeqAlto（Mu 等，2012）、BWA-SW 和 BWA 进行了比较（图 1）。在准确性方面，NovoAlign 表现最佳。对于 PE 读取，BWA-MEM 接近 NovoAlign，对于 SE 读取则与 GEM 和 Cushaw2 相当。在速度方面，BWA-MEM 在该数据集上与 GEM 和 Bowtie2 相似，但在 650bp 长读数据集上，BWA-MEM 的速度约为 Bowtie2 和 Cushaw2 的 6 倍。

We speculate BWA-MEM is more performant for longer reads firstly because of its advanced seeding algorithm, which identifies most standing seeds with one pass of the read, and secondly because of banded dynamic programming, which guarantees linear time complexity in the length of query sequences. BWA-MEM and BWA-SW are also able to identify chimeric reads, a crucial feature for contig alignment but lacked in most NGS long-read mappers. Our earlier paper (Li and Durbin, 2010) discussed this in details.
我们推测 BWA-MEM 在处理较长的读取时性能更佳，首先是因为它的先进种子算法，可以通过一次读取识别大多数静态种子，其次是因为带状动态规划，保证了查询序列长度的线性时间复杂度。BWA-MEM 和 BWA-SW 还能够识别嵌合读取，这是拼接比对的一个关键特性，而大多数 NGS 长读取映射器则缺乏这一功能。我们早期的论文（Li 和 Durbin，2010）对此进行了详细讨论。

To evaluate the performance for even longer query sequences, we aligned the E. coli K-12 MG1655 substrain (AC:NC_000913; 4.6Mb in length) against the 536 strain (AC:NC_008253) with both BWA-MEM and nucmer (Kurtz et al., 2004). Nucmer finished the alignment in 25 seconds, giving 105,505 substitution differences between the strains. BWA-MEM is slower. It finished the alignment in 131 seconds, reporting 104,321 substitutions of which 102,241 overlapping with the nucmer output. Manual examination of the substitutions unique to each aligner reveals that most of them fall in short regions with very high divergence. It is unclear which aligner is better. Note that although nucmer is faster in the evaluation, only BWA-MEM scales well to large genomes.
为了评估更长查询序列的性能，我们将大肠杆菌 K-12 MG1655 亚种（AC:NC_000913；长度 4.6Mb）与 536 株（AC:NC_008253）进行了比对，使用了 BWA-MEM 和 nucmer（Kurtz 等，2004）。Nucmer 在 25 秒内完成了比对，发现两株之间有 105,505 个替代差异。BWA-MEM 较慢，完成比对花费了 131 秒，报告了 104,321 个替代，其中 102,241 个与 nucmer 输出重叠。对每个比对器独有的替代进行手动检查，发现它们大多数位于具有非常高差异性的短区域内。目前尚不清楚哪个比对器更好。请注意，尽管在评估中 nucmer 更快，但只有 BWA-MEM 能够很好地扩展到大型基因组。

BWA-MEM is a fast and accurate aligner for sequence reads and is one of the few that work well for both 70bp reads and long sequences up to a few megabases. Technically, it is possible to make BWA-MEM even faster for long sequences by using SSE2-based banded DP and by restricting DP to regions not covered by a long exact match. Seeding is the bottleneck for short sequences, while banded DP is the bottleneck for long sequences.
BWA-MEM 是一个快速且准确的序列比对工具，是为数不多的能够很好处理 70bp 读段和长达几兆碱基的长序列的工具之一。从技术上讲，可以通过使用基于 SSE2 的带状动态规划和将动态规划限制在未被长精确匹配覆盖的区域来使 BWA-MEM 在长序列上更快。对于短序列，种子生成是瓶颈，而对于长序列，带状动态规划是瓶颈。

4 ACKNOWLEDGEMENTS 4 致谢

Funding: NIH 1U01HG005208-01.
资助：NIH 1U01HG005208-01。

REFERENCES 参考文献

Altschul, S. F. et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25:3389-402.
Altschul, S. F. 等 (1997). Gapped BLAST 和 PSI-BLAST：新一代蛋白质数据库搜索程序。核酸研究, 25:3389-402.

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM 使用 BWA-MEM 对序列读取、克隆序列和组装 contig 进行比对