这是用户在 2024-9-5 14:09 为 https://app.immersivetranslate.com/pdf-pro/dbaa0e9d-9b69-4c62-8a41-371818dde2f9 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_09_05_2152cfd992695f73eb33g

Sequence Alignment/Map Format Specification
序列比对/映射格式规范

The SAM/BAM Format Specification Working Group
SAM/BAM 格式规范工作组

16 Nov 2023 2023 年 11 月 16 日

Abstract 摘要

The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 346a94a from that repository, last modified on the date shown above.
该文档的主版本可以在 https://github.com/samtools/hts-specs 找到。此打印版本为该仓库的 346a94a,最后修改日期如上所示。

1 The SAM Format Specification
1 SAM 格式规范

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with ' ', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information.
SAM 代表序列比对/映射格式。它是一种以制表符分隔的文本格式,包含一个可选的头部部分和一个比对部分。如果存在,头部必须位于比对之前。头部行以' '开头,而比对行则不以此开头。每个比对行有 11 个强制字段,用于基本的比对信息,如映射位置,以及可变数量的可选字段,用于灵活或特定于比对工具的信息。
This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAM file may optionally specify the version being used via the @HD VN tag. For full version history see Appendix B.
此规范适用于 SAM 和 BAM 格式的 1.6 版本。每个 SAM 和 BAM 文件可以选择通过@HD VN 标签指定所使用的版本。有关完整的版本历史,请参见附录 B。
SAM file contents are 7-bit US-ASCII, except for certain field values as individually specified which may contain other Unicode characters encoded in UTF-8. Alternatively and equivalently, SAM files are encoded in UTF-8 but non-ASCII characters are permitted only within certain field values as explicitly specified in the descriptions of those fields.
SAM 文件内容为 7 位 US-ASCII,除了某些字段值外,这些字段值可能包含以 UTF-8 编码的其他 Unicode 字符。或者,SAM 文件以 UTF-8 编码,但非 ASCII 字符仅在某些字段值中被允许,这些字段的描述中有明确说明。
Where it makes a difference, SAM file contents should be read and written using the POSIX / C locale. For example, floating-point values in SAM always use '.' for the decimal-point character.
在有差异的地方,SAM 文件内容应使用 POSIX / C 区域设置进行读取和写入。例如,SAM 中的浮点值始终使用 '.' 作为小数点字符。
The regular expressions in this specification are written using the POSIX / IEEE Std 1003.1 extended syntax.
本规范中的正则表达式使用 POSIX / IEEE Std 1003.1 扩展语法编写。

1.1 An example 1.1 示例

Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment.
假设我们有以下对齐,其中小写字母表示从对齐中剪切的碱基。读取 r001/1 和 r001/2 组成一个读取对;r003 是一个嵌合读取;r004 代表一个分裂对齐。
The corresponding SAM format is:
对应的 SAM 格式是:
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1

1.2 Terminologies and Concepts
1.2 术语和概念

Template A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences.
模板 A DNA/RNA 序列的一部分是在测序仪上测序的或从原始序列组装而成。
Segment A contiguous sequence or subsequence.
段 A 连续序列或子序列。

Read A raw sequence that comes off a sequencing machine. A read may consist of multiple segments. For sequencing data, reads are indexed by the order in which they are sequenced.
读取来自测序机器的原始序列。一个读取可能由多个片段组成。对于测序数据,读取按其测序的顺序进行索引。
Linear alignment An alignment of a read to a single reference sequence that may include insertions, deletions, skips and clipping, but may not include direction changes (i.e., one portion of the alignment on forward strand and another portion of alignment on reverse strand). A linear alignment can be represented in a single SAM record.
线性比对 将读取比对到单个参考序列的比对,可能包括插入、缺失、跳过和剪切,但可能不包括方向变化(即,比对的一部分在正链上,另一部分在反链上)。线性比对可以在单个 SAM 记录中表示。
Chimeric alignment An alignment of a read that cannot be represented as a linear alignment. A chimeric alignment is represented as a set of linear alignments that do not have large overlaps. Typically, one of the linear alignments in a chimeric alignment is considered the "representative" alignment, and the others are called "supplementary" and are distinguished by the supplementary alignment flag. All the SAM records in a chimeric alignment have the same QNAME and the same values for and flags (see Section 1.4). The decision regarding which linear alignment is representative is arbitrary.
嵌合比对 无法表示为线性比对的读取的比对。嵌合比对表示为一组没有大重叠的线性比对。通常,嵌合比对中的一个线性比对被视为“代表性”比对,其他的被称为“补充”比对,并通过补充比对标志进行区分。嵌合比对中的所有 SAM 记录具有相同的 QNAME 和相同的 标志值(见第 1.4 节)。关于哪个线性比对是代表性的决定是任意的。
Read alignment A linear alignment or a chimeric alignment that is the complete representation of the alignment of the read.
读取比对 线性比对或嵌合比对,是对读取的比对的完整表示。
Multiple mapping The correct placement of a read may be ambiguous, e.g., due to repeats. In this case, there may be multiple read alignments for the same read. One of these alignments is considered primary. All the other alignments have the secondary alignment flag set in the SAM records that represent them. All the SAM records have the same QNAME and the same values for and 0x80 flags. Typically the alignment designated primary is the best alignment, but the decision may be arbitrary.
多重比对 读取的正确放置可能会模糊不清,例如,由于重复。在这种情况下,可能会有多个相同读取的比对。其中一个比对被视为主要比对。所有其他比对在表示它们的 SAM 记录中都有次要比对标志。所有 SAM 记录具有相同的 QNAME 和相同的 及 0x80 标志值。通常,指定为主要的比对是最佳比对,但该决定可能是任意的。

1-based coordinate system A coordinate system where the first base of a sequence is one. In this coordinate system, a region is specified by a closed interval. For example, the region between the 3rd and the 7 th bases inclusive is . The SAM, VCF, GFF and Wiggle formats are using the 1-based coordinate system.
1 基坐标系统 一种序列的第一个基数为 1 的坐标系统。在该坐标系统中,区域由闭区间指定。例如,第 3 个和第 7 个基数之间的区域(包括)是 。SAM、VCF、GFF 和 Wiggle 格式使用 1 基坐标系统。

0-based coordinate system A coordinate system where the first base of a sequence is zero. In this coordinate system, a region is specified by a half-closed-half-open interval. For example, the region between the 3rd and the 7 th bases inclusive is . The BAM, BCFv2, BED, and PSL formats are using the 0 -based coordinate system.
0 基础坐标系统 一种序列的第一个基数为零的坐标系统。在该坐标系统中,区域由半闭半开区间指定。例如,第 3 个和第 7 个基数之间的区域(包括)是 。BAM、BCFv2、BED 和 PSL 格式使用 0 基础坐标系统。
Phred scale Given a probability , the phred scale of equals , rounded to the closest integer.
Phred 评分 给定一个概率 ,则 的 phred 评分等于 ,四舍五入到最接近的整数。

1.2.1 Character set restrictions
1.2.1 字符集限制

Reference sequence names, CIGAR strings, and several other field types are used as values or parts of values of other fields in SAM and related formats such as VCF. To ensure that these other fields' representations are unambiguous, these field types disallow particular delimiter characters.
参考序列名称、CIGAR 字符串以及其他几种字段类型在 SAM 及相关格式(如 VCF)中用作其他字段的值或部分值。为了确保这些其他字段的表示是明确的,这些字段类型不允许使用特定的分隔符字符。
Query or read names may contain any printable ASCII characters in the range [!- ] apart from ' ', so that SAM alignment lines can be easily distinguished from header lines. (They are also limited in length.)
查询或读取的名称可以包含范围为 [!- ] 的任何可打印 ASCII 字符,除了 ' ',以便 SAM 对齐行可以与头部行轻松区分。(它们的长度也有限制。)
Reference sequence names may contain any printable ASCII characters in the range [!- ] apart from backslashes, commas, quotation marks, and brackets-i.e., apart from ',"' () [] {} <>'—and may not start with ' ' or ' '.
参考序列名称可以包含范围内的任何可打印 ASCII 字符 [!- ],但不包括反斜杠、逗号、引号和括号,即不包括 ',"' () [] {} <> '—并且不能以 ' ' 或 ' ' 开头。
Thus they match the following regular expression:
因此它们匹配以下正则表达式:

For clarity, elsewhere in this specification we write this set of allowed characters as a character class [:rname:] and extend the POSIX regular expression notation to use to indicate the omission of ' ' and ' ' from the character class. Thus this regular expression can be written more clearly as [:rname ] [:rname:]*.
为了清晰起见,在本规范的其他地方,我们将这组允许的字符写为字符类 [:rname:],并扩展 POSIX 正则表达式符号以使用 来表示从字符类中省略 ' ' 和 ' '。因此,这个正则表达式可以更清晰地写为 [:rname ] [:rname:]*。

1.3 The header section
1.3 头部部分

Each header line begins with the character ' ' followed by one of the two-letter header record type codes defined in this section. In the header, each line is TAB-delimited and, apart from @CO lines, each data field follows a format 'TAG:VALUE' where TAG is a two-character string that defines the format and content of VALUE. Thus header lines match /^ @(HD|SQ|RG|PG) ( or /^ @CO t .*/. Within each (non-@CO) header line, no field tag may appear more than once and the order in which the fields appear is not significant.
每个标题行以字符 ' ' 开头,后面跟着本节定义的两字母标题记录类型代码之一。在标题中,每行以制表符分隔,除了 @CO 行外,每个数据字段遵循格式 'TAG:VALUE',其中 TAG 是一个定义 VALUE 格式和内容的两字符字符串。因此,标题行匹配 /^ @(HD|SQ|RG|PG) ( 或 /^ @CO t .*/。在每个(非 @CO)标题行中,字段标签不得出现超过一次,字段出现的顺序并不重要。
The following table describes the header record types that may be used and their predefined tags. Tags listed with are required; e.g., every @SQ header line must have SN and LN fields. As with alignment optional fields (see Section 1.5), you can freely add new tags for further data fields. Tags containing lowercase letters are reserved for local use and will not be formally defined in any future version of this specification.
下表描述了可能使用的头记录类型及其预定义标签。带有 的标签是必需的;例如,每个 @SQ 头行必须具有 SN 和 LN 字段。与比对可选字段(见第 1.5 节)一样,您可以自由添加新标签以用于进一步的数据字段。包含小写字母的标签保留供本地使用,未来版本的此规范中将不会正式定义。
Tag 标签 Description 描述
@HD

文件级元数据。可选。如果存在,必须只有一行 @HD,并且它必须是文件的第一行。
File-level metadata. Optional. If present, there must be only one @HD line and it must be the
first line of the file.
VN* Format version. Accepted format: /^ .
格式版本。接受的格式:/^ .
SO

比对的排序顺序。有效值:unknown(默认)、unsorted、queryname 和 coordinate。对于坐标排序,主要排序键是 RNAME 字段,顺序由头部 @SQ 行的顺序定义。次要排序键是 POS 字段。对于 RNAME 和 POS 相等的比对,顺序是任意的。所有在 RNAME 字段中包含 ' ' 的比对在某些其他值之后,但其他情况下是任意顺序。对于 queryname 排序,除了要求在整个文件中一致应用外,没有明确的排序要求。
Sorting order of alignments. Valid values: unknown (default), unsorted, queryname and
coordinate. For coordinate sort, the major sort key is the RNAME field, with order defined
by the order of @SQ lines in the header. The minor sort key is the POS field. For alignments
with equal RNAME and POS, order is arbitrary. All alignments with ' ' in RNAME field follow
alignments with some other value but otherwise are in arbitrary order. For queryname sort, no
explicit requirement is made regarding the ordering other than that it be applied consistently
throughout the entire file.
GO

对齐的分组,表示相似的对齐记录被分组在一起,但文件不一定是整体排序的。有效值:none(默认),query(对齐按 QNAME 分组)和 reference(对齐按 RNAME/POS 分组)。
Grouping of alignments, indicating that similar alignment records are grouped together but the
file is not necessarily sorted overall. Valid values: none (default), query (alignments are grouped
by QNAME), and reference (alignments are grouped by RNAME/POS).
SS

对齐的子排序顺序。有效值的形式为 sort-order: sub-sort,其中 sort-order 是存储在 SO 标签中的相同值,sub-sort 是一个依赖于实现的以冒号分隔的字符串,进一步描述排序顺序,但有一些在第 1.3.1 节中定义的预定义术语。例如,如果一个算法依赖于坐标排序,在每个坐标上进一步按查询名称排序,则头部可以包含 @HD SO:coordinate SS: coordinate:queryname。 如果主要排序不是预定义的主要排序顺序之一,则应使用 unsorted,子排序实际上是主要排序。例如,如果按辅助标签 MI 排序,然后按坐标排序,则头部可以包含 @HD SO:unsorted SS:unsorted:MI: coordinate。正则表达式:(coordinate। queryname|unsorted) (: [A-Za-z0-9_-]+) +
Sub-sorting order of alignments. Valid values are of the form sort-order: sub-sort, where sort-
order is the same value stored in the SO tag and sub-sort is an implementation-dependent
colon-separated string further describing the sort order, but with some predefined terms de-
fined in Section 1.3.1. For example, if an algorithm relies on a coordinate sort that, at each
coordinate, is further sorted by query name then the header could contain @HD SO:coordinate
SS: coordinate:queryname. If the primary sort is not one of the predefined primary sort orders,
then unsorted should be used and the sub-sort is effectively the major sort. For example, if
sorted by an auxiliary tag MI then by coordinate then the header could contain @HD SO:unsorted
SS:unsorted:MI: coordinate.
Regular expression: (coordinate। queryname|unsorted) (: [A-Za-z0-9_-]+) +
@SQ Reference sequence dictionary. The order of @SQ lines defines the alignment sorting order.
参考序列字典。@SQ 行的顺序定义了比对排序顺序。
SN*

参考序列名称。所有 @SQ 行中的 SN 标签和所有单独的 AN 名称必须是唯一的。此字段的值用于 RNAME 和 RNEXT 字段中的比对记录。正则表达式:[:rname: [:rname:]*
Reference sequence name. The SN tags and all individual AN names in all @SQ lines must be
distinct. The value of this field is used in the alignment records in RNAME and RNEXT fields.
Regular expression: [:rname: [:rname:]*
LN* LN* LN* Reference sequence length. Range:
参考序列长度。范围:
AH

指示该序列是一个替代位点。 该值是主组装中该序列的替代位点,格式为 'chr: start-end','chr'(如果已知),或 '*'(如果未知),其中 ' ' 是主组装中的一个序列。不得出现在主组装中的序列上。
Indicates that this sequence is an alternate locus. The value is the locus in the primary assembly
for which this sequence is an alternative, in the format 'chr: start-end', 'chr' (if known), or '*' (if
unknown), where ' ' is a sequence in the primary assembly. Must not be present on sequences
in the primary assembly.
AN

替代参考序列名称。一个以逗号分隔的替代名称列表,工具在引用此参考序列时可能会使用这些名称。 这些替代名称在 SAM 文件的其他地方不使用;特别是,它们不得出现在比对记录的 RNAME 或 RNEXT 字段中。正则表达式:name (, name)* 其中 name 是[:rname: ] [:rname:]*
Alternative reference sequence names. A comma-separated list of alternative names that tools
may use when referring to this reference sequence. These alternative names are not used
elsewhere within the SAM file; in particular, they must not appear in alignment records' RNAME
or RNEXT fields. Regular expression: name (, name)* where name is [:rname: ] [:rname:]*
AS Genome assembly identifier.
基因组组装标识符。
DS Description. UTF-8 encoding may be used.
描述。可以使用 UTF-8 编码。
M5 MD5 checksum of the sequence. See Section 1.3.2
序列的 MD5 校验和。请参见第 1.3.2 节。
SP Species. 物种。
TP Molecule topology. Valid values: linear (default) and circular.
分子拓扑。有效值:线性(默认)和环状。
UR

序列的 URI。该值可以以标准协议之一开头,例如 'http:' 或 'ftp:'。如果它不是以这些协议之一开头,则假定它是一个文件系统路径。
URI of the sequence. This value may start with one of the standard protocols, e.g., 'http:' or
' ftp :'. If it does not start with one of these protocols, it is assumed to be a file-system path.
@RG Read group. Unordered multiple @RG lines are allowed.
读取组。允许多个无序的 @RG 行。
ID*

读取组标识符。每个 @RG 行必须具有唯一的 ID。ID 的值用于比对记录的 RG 标签。必须在头部部分的所有读取组中唯一。当合并 SAM 文件时,读取组 ID 可能会被修改以处理冲突。
Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG
tags of alignment records. Must be unique among all read groups in header section. Read group
IDs may be modified when merging SAM files in order to handle collisions.
BC

条形码序列用于识别样本或文库。该值是测序机器在没有错误的情况下读取的预期条形码碱基。如果样本/文库有多个条形码(例如,模板的每一端都有一个),建议的实现方式是将所有条形码连接在一起,用连字符(' - ')分隔。
Barcode sequence identifying the sample or library. This value is the expected barcode bases
as read by the sequencing machine in the absence of errors. If there are several barcodes for
the sample/library (e.g., one on each end of the template), the recommended implementation
concatenates all the barcodes separating them with hyphens (' - ').
CN Name of sequencing center producing the read.
产生读取的测序中心名称。
DS Description. UTF-8 encoding may be used.
描述。可以使用 UTF-8 编码。
DT Date the run was produced (ISO8601 date or date/time).
运行生成的日期(ISO8601 日期或日期/时间)。
FO

流顺序。与每个读取的每个流使用的核苷酸对应的核苷酸碱基数组。多碱基流以 IUPAC 格式编码,非核苷酸流则由各种其他字符表示。格式: ACMGRSVTWYHKDBN]+/
Flow order. The array of nucleotide bases that correspond to the nucleotides used for each
flow of each read. Multi-base flows are encoded in IUPAC format, and non-nucleotide flows by
various other characters. Format: ACMGRSVTWYHKDBN]+/
KS The array of nucleotide bases that correspond to the key sequence of each read.
每个读取的关键序列对应的核苷酸碱基数组。
LB Library. 库。
PG Programs used for processing the read group.
用于处理读取组的程序。
PI Predicted median insert size, rounded to the nearest integer.
预测的中位插入大小,四舍五入到最接近的整数。
PL

用于生成读取的 платформ/技术。有效值:CAPILLARY, DNBSEQ (MGI/BGI), ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Bio-sciences), SINGULAR, SOLID 和 ULTIMA。当技术不在此列表中(尽管在这种情况下 PM 字段仍然可能存在)或未知时,应省略此字段。
Platform/technology used to produce the reads. Valid values: CAPILLARY, DNBSEQ (MGI/BGI),
ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Bio-
sciences), SINGULAR, SOLID, and ULTIMA. This field should be omitted when the technology is
not in this list (though the PM field may still be present in this case) or is unknown.
PM Platform model. Free-form text providing further details of the platform/technology used.
平台模型。提供有关所使用的平台/技术的进一步细节的自由格式文本。
PU Platform unit (e.g., flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identifier.
平台单元(例如,Illumina 的 flowcell-barcode.lane 或 SOLiD 的 slide)。唯一标识符。
SM Sample. Use pool name where a pool is being sequenced.
示例。使用正在进行测序的池名称。
@PG Program. 程序。
ID*

程序记录标识符。每个 @PG 行必须具有唯一的 ID。ID 的值用于其他 @PG 行的对齐 PG 标签和 PP 标签。在合并 SAM 文件时,PG ID 可能会被修改以处理冲突。
Program record identifier. Each @PG line must have a unique ID. The value of ID is used in the
alignment PG tag and PP tags of other @PG lines. PG IDs may be modified when merging SAM
files in order to handle collisions.
PN Program name 程序名称
CL Command line. UTF-8 encoding may be used.
命令行。可以使用 UTF-8 编码。
PP

先前的 @PG-ID。必须与另一个 @PG 头的 ID 标签匹配。@PG 记录可以使用 PP 标签链接,链中的最后一条记录没有 PP 标签。此链定义了已应用于比对的程序顺序。在合并 SAM 文件时,可以修改 PP 值以处理 PG ID 的冲突。链中的第一个 PG 记录(即 SAM 记录中 PG 标签所引用的记录)描述了对 SAM 记录进行操作的最新程序。链中的下一个 PG 记录描述了对 SAM 记录进行操作的下一个最新程序。SAM 记录上的 PG ID 并不要求引用链中的最新 PG 记录。它可以引用链中的任何 PG 记录,意味着 SAM 记录已被该 PG 记录中的程序以及通过 PP 标签引用的程序操作。
Previous @PG-ID. Must match another @PG header's ID tag. @PG records may be chained using PP
tag, with the last record in the chain having no PP tag. This chain defines the order of programs
that have been applied to the alignment. PP values may be modified when merging SAM files
in order to handle collisions of PG IDs. The first PG record in a chain (i.e., the one referred to
by the PG tag in a SAM record) describes the most recent program that operated on the SAM
record. The next PG record in the chain describes the next most recent program that operated
on the SAM record. The PG ID on a SAM record is not required to refer to the newest PG record
in a chain. It may refer to any PG record in a chain, implying that the SAM record has been
operated on by the program in that PG record, and the program(s) referred to via the PP tag.
DS Description. UTF-8 encoding may be used.
描述。可以使用 UTF-8 编码。
VN Program version 程序版本
@CO

单行文本评论。允许多个无序的 @CO 行。可以使用 UTF-8 编码。
One-line text comment. Unordered multiple @CO lines are allowed. UTF-8 encoding may be
used.

1.3.1 Defined sub-sort terms
1.3.1 定义的子排序术语

While the SS sub-sort field allows implementation-defined keywords, some terms are predefined with specific meanings.
虽然 SS 子排序字段允许实现定义的关键字,但某些术语是预定义的,具有特定含义。

lexicographical sort order is defined as a character-based dictionary sort with the character order as defined by the POSIX C locale. For example "abc", "abc17", "abc5", "abc59" and "abcd" are in lexicographical order.
字典序排序定义为基于字符的字典排序,字符顺序由 POSIX C 区域设置定义。例如,“abc”、“abc17”、“abc5”、“abc59”和“abcd”是按字典序排列的。

natural sort order is similar to lexicographical order except that runs of adjacent digits are considered to be numbers embedded within the text string, ordered numerically when compared to each other and ordered as single digits when compared to the surrounding non-digit characters. Runs that differ only in the number of leading zeros (thus are numerically tied) are ordered by more-zeros coming before fewer-zeros. The characters '-' and '.' are considered as ordinary characters, so apparently negative or fractional values are not treated as part of an embedded number. For example, "abc", "abc+5", "abc , "abc.d", "abc03", "abc5", "abc008", "abc08", "abc8", "abc17", "abc17.+", "abc17.2", "abc17.d", "abc59" and "abcd" are in natural order.
自然排序与字典序相似,不同之处在于相邻数字的连续部分被视为嵌入在文本字符串中的数字,在相互比较时按数值排序,而在与周围的非数字字符比较时按单个数字排序。仅在前导零的数量上有所不同(因此在数值上是平局)的部分,按前导零更多的排在前面。字符 '-' 和 '.' 被视为普通字符,因此明显的负值或分数值不被视为嵌入数字的一部分。例如,"abc"、"abc+5"、"abc "、"abc.d"、"abc03"、"abc5"、"abc008"、"abc08"、"abc8"、"abc17"、"abc17.+"、"abc17.2"、"abc17.d"、"abc59" 和 "abcd" 是自然顺序。

umi is a lexicographical sort by the UMI tag. The MI tag should be used for comparing UMIs. The RX tag may be used in its absence but is not guaranteed to be unique across multiple libraries.
umi 是按 UMI 标签进行的词典排序。MI 标签应用于比较 UMIs。在缺少 MI 标签的情况下,可以使用 RX 标签,但不能保证在多个库中是唯一的。

1.3.2 Reference MD5 calculation
1.3.2 参考 MD5 计算

The M5 tag on @SQ lines allows reference sequences to be uniquely identified through the MD5 digest of the sequence itself. As the digest is based on the sequence and nothing else, it can help resolve ambiguities with reference naming. For example, it allows a quick way of checking that references named ' 1 ', ' Chr 1 ' and 'chr1' in different files are in fact the same.
@SQ 行上的 M5 标签允许通过序列本身的 MD5 摘要唯一识别参考序列。由于摘要是基于序列而非其他内容,它可以帮助解决参考命名中的歧义。例如,它提供了一种快速检查不同文件中名为 ' 1 '、' Chr 1 ' 和 'chr1' 的参考实际上是相同的方式。
The reference sequence must be in the 7-bit US-ASCII character set. All valid reference bases can be represented in this set, and it avoids the problem of determining exactly which 8 -bit representation may have been used. Padding characters (See Section 3.2) must be represented only using the '*' character.
参考序列必须使用 7 位 US-ASCII 字符集。所有有效的参考碱基都可以在此集合中表示,并且避免了确定可能使用的确切 8 位表示的问题。填充字符(见第 3.2 节)必须仅使用'*'字符表示。
The digest is calculated as follows:
摘要的计算如下:
  • All characters outside of the inclusive range 33 ('!') to are stripped out. This removes all unprintable and whitespace characters including spaces and new lines. Everything else is retained, even if not a legal nucleotide code.
    所有在包含范围 33 ('!') 到 之外的字符都会被剔除。这将移除所有不可打印和空白字符,包括空格和换行符。其他所有内容都会被保留,即使不是合法的核苷酸代码。
  • All lowercase characters are converted to uppercase. This operation is equivalent to calling toupper() on characters in the POSIX locale.
    所有小写字符都被转换为大写。此操作相当于在 POSIX 区域中对字符调用 toupper()。
  • The MD5 digest is calculated as described in RFC 1321 and presented as a 32 character lowercase hexadecimal number.
    MD5 摘要的计算方法如 RFC 1321 所述,并以 32 个字符的小写十六进制数字表示。
As an example, if the reference contains the following characters (including spaces):
作为一个例子,如果引用包含以下字符(包括空格):

ACGT ACGT ACGT
acgt acgt acgt
... 12345 !!!
then the digest is that of the string ACGTACGTACGTACGTACGTACGT...12345!!! and the resulting tag would be M5: dfabdbb36e239a6da88957841f32b8e4.
然后摘要是字符串 ACGTACGTACGTACGTACGTACGT...12345!!! 的摘要,生成的标签将是 M5: dfabdbb36e239a6da88957841f32b8e4。
In padded SAM files, the padding bases should be inserted into the reference as ' characters. Taking the example in Section 3.2, the padded version of the reference is
在填充的 SAM 文件中,填充碱基应作为 '字符插入到参考中。以第 3.2 节中的示例为例,参考的填充版本是
AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
and the corresponding tag is M5: caad65b937c4bc0b33c08f62a9fb5411.
和相应的标签是 M5: caad65b937c4bc0b33c08f62a9fb5411。

1.4 The alignment section: mandatory fields
1.4 对齐部分:必填字段

In the SAM format, each alignment line typically represents the linear alignment of a segment. Each line consists of 11 or more TAB-separated fields. The first eleven fields are always present and in the order shown below; if the information represented by any of these fields is unavailable, that field's value will be a placeholder, either ' 0 ' or ' ' as determined by the field's type. The following table gives an overview of these mandatory fields in the SAM format:
在 SAM 格式中,每个比对行通常表示一个片段的线性比对。每行由 11 个或更多以 TAB 分隔的字段组成。前十一个字段总是存在,并按下面所示的顺序排列;如果任何字段所表示的信息不可用,则该字段的值将是一个占位符,可能是' 0 '或' ',具体取决于字段的类型。下表概述了 SAM 格式中的这些必填字段:
Col  Field 字段 Type 类型 Regexp/Range 正则表达式/范围 Brief description 简要描述
1 QNAME String 字符串 Query template NAME 查询模板名称
2 FLAG Int 整数 bitwise FLAG 位运算标志
3 RNAME String 字符串 rname:   名称: Reference sequence NAME
参考序列名称
4 POS Int 整数 1-based leftmost mapping POSition
1 基于左侧最左映射位置
5 MAPQ Int 整数 MAPping Quality 映射质量
6 CIGAR String 字符串 MIDNSHP CIGAR string CIGAR 字符串
7 RNEXT String 字符串 rname: rname: Reference name of the mate/next read
参考名称的配偶/下一个阅读
8 PNEXT Int 整数 Position of the mate/next read
配对/下一个读取的位置
9 TLEN Int 整数 observed Template LENgth
观察到的模板长度
10 SEQ String 字符串 . segment SEQuence 段 SEQuence
11 QUAL String 字符串 ASCII of Phred-scaled base QUALity +33
Phred 缩放的基本质量的 ASCII +33
All mapped segments in alignment lines are represented on the forward genomic strand. For segments that have been mapped to the reverse strand, the recorded SEQ is reverse complemented from the original unmapped sequence and CIGAR, QUAL, and strand-sensitive optional fields are reversed and thus recorded consistently with the sequence bases as represented.
所有在比对行中映射的片段都表示在正向基因组链上。对于已映射到反向链的片段,记录的 SEQ 是从原始未映射序列反向互补而来,CIGAR、QUAL 和链敏感的可选字段被反转,因此与表示的序列碱基一致地记录。
  1. QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME '*' indicates the information is unavailable. In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.
    QNAME:查询模板名称。具有相同 QNAME 的读取/片段被视为来自同一模板。QNAME '*' 表示信息不可用。在 SAM 文件中,一个读取可能占据多个比对行,当其比对是嵌合的或给出了多个映射时。
  2. FLAG: Combination of bitwise FLAGs. Each bit is explained in the following table:
    标志:按位标志的组合。 每个位在下表中解释:
Bit 比特 Description 描述
1 template having multiple segments in sequencing
模板具有多个序列段
2 each segment properly aligned according to the aligner
每个片段根据对齐器正确对齐
4 segment unmapped 段未映射
8 next segment in the template unmapped
模板中的下一个段落未映射
16 SEQ being reverse complemented
SEQ 被反向互补
32 SEQ of the next segment in the template being reverse complemented
模板中下一个片段的 SEQ 被反向互补
64 the first segment in the template
模板中的第一部分
128 the last segment in the template
模板中的最后一个部分
256 secondary alignment 次级对齐
512 not passing filters, such as platform/vendor quality controls
未通过过滤器,例如平台/供应商质量控制
1024 PCR or optical duplicate
PCR 或光学重复
2048 supplementary alignment 补充对齐
  • For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies 'FLAG & '. This line is called the primary line of the read.
    对于 SAM 文件中的每个读取/拼接,要求与该读取关联的行中只有一行满足 'FLAG & '。这一行称为读取的主行。
  • Bit 0x100 marks the alignment not to be used in certain analyses when the tools in use are aware of this bit. It is typically used to flag alternative mappings when multiple mappings are presented in a SAM.
    位 0x100 标记在某些分析中不应使用的对齐,当使用的工具意识到此位时。它通常用于在 SAM 中呈现多个映射时标记替代映射。
  • Bit indicates that the corresponding alignment line is part of a chimeric alignment. A line flagged with 0x800 is called as a supplementary line.
    Bit 表示相应的比对行是嵌合比对的一部分。标记为 0x800 的行称为补充行。
  • Bit is the only reliable place to tell whether the read is unmapped. If is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits , and .
    Bit 是唯一可靠的地方来判断读取是否未映射。如果设置了 ,则无法对 RNAME、POS、CIGAR、MAPQ 和位 以及 做出任何假设。
  • Bit 0x10 indicates whether SEQ has been reverse complemented and QUAL reversed. When bit 0 x 4 is unset, this corresponds to the strand to which the segment has been mapped: bit 0 x 10 unset indicates the forward strand, while set indicates the reverse strand. When 0 x 4 is set, this indicates whether the unmapped read is stored in its original orientation as it came off the sequencing machine.
    位 0x10 表示 SEQ 是否已被反向互补,QUAL 是否被反转。当位 0x4 未设置时,这对应于段被映射到的链:位 0x10 未设置表示正链,而设置则表示反链。当 0x4 被设置时,这表示未映射的读取是否以其原始方向存储,即从测序仪上读取时的方向。
  • Bits and reflect the read ordering within each template inherent in the sequencing technology used. If and are both set, the read is part of a linear template, but it is neither the first nor the last read. If both and are unset, the index of the read in the template is unknown. This may happen for a non-linear template or when this information is lost during data processing.
    反映了所使用的测序技术中每个模板内的读取顺序。 如果 都被设置,则该读取是线性模板的一部分,但既不是第一个也不是最后一个读取。如果 都未设置,则模板中读取的索引是未知的。这可能发生在非线性模板中,或者在数据处理过程中丢失了该信息。
  • If is unset, no assumptions can be made about and .
    如果 未设置,则无法对 做出任何假设。
  • Bits that are not listed in the table are reserved for future use. They should not be set when writing and should be ignored on reading by current software.
    未在表中列出的位保留供将来使用。在写入时不应设置,在当前软件读取时应忽略。
  1. RNAME: Reference sequence NAME of the alignment. If @SQ header lines are present, RNAME (if not ) must be present in one of the SQ-SN tag. An unmapped segment without coordinate has a , at
    RNAME:比对的参考序列名称。如果存在 @SQ 头行,RNAME(如果不是 )必须出现在其中一个 SQ-SN 标签中。没有坐标的未映射片段具有 ,在
this field. However, an unmapped segment may also have an ordinary coordinate such that it can be placed at a desired position after sorting. If RNAME is , no assumptions can be made about POS and CIGAR.
此字段。然而,未映射的片段也可能具有普通坐标,以便在排序后可以放置在所需位置。如果 RNAME 是 ,则无法对 POS 和 CIGAR 做出任何假设。

4. POS: 1-based leftmost mapping POSition of the first CIGAR operation that "consumes" a reference base (see table below). The first base in a reference sequence has coordinate 1 . POS is set as 0 for an unmapped read without coordinate. If POS is 0 , no assumptions can be made about RNAME and CIGAR.
4. POS:第一个“消耗”参考碱基的 CIGAR 操作的 1 基于左侧的映射位置(见下表)。参考序列中的第一个碱基坐标为 1。对于没有坐标的未映射读取,POS 设置为 0。如果 POS 为 0,则无法对 RNAME 和 CIGAR 做出任何假设。

5. MAPQ: MAPping Quality. It equals mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
5. MAPQ:映射质量。它等于 映射位置错误,四舍五入到最接近的整数。值 255 表示映射质量不可用。

6. CIGAR: CIGAR string. The CIGAR operations are given in the following table (set ' ' if unavailable):
6. CIGAR: CIGAR 字符串。CIGAR 操作在下表中给出(如果不可用,请设置 ' '):
Op 操作 BAM Description 描述
 消耗查询
Consumes
query
 消耗引用
Consumes
reference
M 0 alignment match (can be a sequence match or mismatch)
对齐匹配(可以是序列匹配或不匹配)
yes  yes 
I 1 insertion to the reference
插入到参考文献
yes  no 
D 2 deletion from the reference
从引用中删除
no  yes 
N 3 skipped region from the reference
跳过的区域来自参考
no  yes 
S 4 soft clipping (clipped sequences present in SEQ)
软剪切(SEQ 中存在的剪切序列)
yes  no 
H 5 hard clipping (clipped sequences NOT present in SEQ)
硬裁剪(裁剪的序列不在 SEQ 中)
no  no 
P 6 padding (silent deletion from padded reference)
填充(从填充参考中静默删除)
yo  no 
= 7 sequence match 序列匹配 yes  yes 
X 8 sequence mismatch 序列不匹配 yes  yes 
  • "Consumes query" and "consumes reference" indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.
    “消耗查询”和“消耗参考”指示 CIGAR 操作是否导致比对沿查询序列和参考序列分别移动。
  • H can only be present as the first and/or last operation.
    H 只能作为第一个和/或最后一个操作出现。
  • S may only have H operations between them and the ends of the CIGAR string.
    S 之间只能有 H 操作和 CIGAR 字符串的两端。
  • For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined.
    对于 mRNA 与基因组的比对,N 操作表示一个内含子。对于其他类型的比对,N 的解释未定义。
  • Sum of lengths of the operations shall equal the length of SEQ.
    操作的长度总和应等于 SEQ 的长度。
  1. RNEXT: Reference sequence name of the primary alignment of the NEXT read in the template. For the last read, the next read is the first read in the template. If @SQ header lines are present, RNEXT (if not , or ' ') must be present in one of the SQ-SN tag. This field is set as '*' when the information is unavailable, and set as ' ' if RNEXT is identical RNAME. If not ' ' and the next read in the template has one primary mapping (see also bit in FLAG), this field is identical to RNAME at the primary line of the next read. If RNEXT is , no assumptions can be made on PNEXT and bit .
    RNEXT:模板中 NEXT 读取的主要比对的参考序列名称。对于最后一个读取,下一读取是模板中的第一个读取。如果存在@SQ 头行,则 RNEXT(如果不是 ,或' ')必须出现在一个 SQ-SN 标签中。当信息不可用时,此字段设置为'*',如果 RNEXT 与 RNAME 相同,则设置为' '。如果不是' ',并且模板中的下一读取有一个主要比对(另见 FLAG 中的位 ),则此字段与下一读取的主要行中的 RNAME 相同。如果 RNEXT 是 ,则无法对 PNEXT 和位 做出假设。
  2. PNEXT: 1-based Position of the primary alignment of the NEXT read in the template. Set as 0 when the information is unavailable. This field equals POS at the primary line of the next read. If PNEXT is 0 , no assumptions can be made on RNEXT and bit .
    PNEXT:模板中 NEXT 读取的主比对的基于 1 的位置。当信息不可用时设置为 0。此字段在下一个读取的主行中等于 POS。如果 PNEXT 为 0,则无法对 RNEXT 和位 做出任何假设。
  3. TLEN: signed observed Template LENgth. For primary reads where the primary alignments of all reads in the template are mapped to the same reference sequence, the absolute value of TLEN equals the distance between the mapped end of the template and the mapped start of the template, inclusively (i.e., end - start +1 ). Note that mapped base is defined to be one that aligns to the reference as described by CIGAR, hence excludes soft-clipped bases. The TLEN field is positive for the leftmost segment of the template, negative for the rightmost, and the sign for any middle segment is undefined. If segments cover the same coordinates then the choice of which is leftmost and rightmost is arbitrary, but the two ends must still have differing signs. It is set as 0 for a single-segment template or when
    TLEN:签名观察到的模板长度。对于主读取,其中模板中所有读取的主比对都映射到同一参考序列,TLEN 的绝对值等于模板的映射结束与映射开始之间的距离,包括在内(即,结束 - 开始 + 1)。 请注意,映射的碱基被定义为与参考序列对齐的碱基,如 CIGAR 所描述,因此排除了软剪切的碱基。模板的最左侧段的 TLEN 字段为正值,最右侧段为负值,任何中间段的符号是未定义的。如果段覆盖相同的坐标,则选择哪个是最左侧和最右侧是任意的,但两个端点仍必须具有不同的符号。对于单段模板或当
the information is unavailable (e.g., when the first or last segment of a multi-segment template is unmapped or when the two are mapped to different reference sequences).
信息不可用(例如,当多段模板的第一个或最后一个片段未映射,或当这两个片段映射到不同的参考序列时)。

The intention of this field is to indicate where the other end of the template has been aligned without needing to read the remainder of the SAM file. Unfortunately there has been no clear consensus on the definitions of the template mapped start and end. Thus the exact definitions are implementationdefined.
该字段的意图是指示模板的另一端已对齐的位置,而无需读取 SAM 文件的其余部分。不幸的是,对于模板映射的开始和结束的定义尚无明确共识。因此,确切的定义是实现定义的。

10. SEQ: segment SEQuence. This field can be a when the sequence is not stored. If not a , the length of the sequence must equal the sum of lengths of operations in CIGAR. An ' ' denotes the base is identical to the reference base. No assumptions can be made on the letter cases.
10. SEQ:片段序列。该字段可以是一个 当序列未存储时。如果不是 ,则序列的长度必须等于 CIGAR 中 操作长度的总和。' ' 表示该碱基与参考碱基相同。对字母大小写不能做出假设。

11. QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). A base quality is the phred-scaled base error probability which equals base is wrong . This field can be a when quality is not stored. If not , SEQ must not be a , and the length of the quality string ought to equal the length of SEQ.
11. QUAL: 基础质量的 ASCII 值加 33(与 Sanger FASTQ 格式中的质量字符串相同)。基础质量是 phred 标度的基础错误概率,等于 基础是错误的 。当质量未存储时,此字段可以是 。如果不是 ,SEQ 不能是 ,并且质量字符串的长度应等于 SEQ 的长度。

1.5 The alignment section: optional fields
1.5 对齐部分:可选字段

All optional fields follow the TAG:TYPE:VALUE format where TAG is a two-character string that matches /[A-Za-z] [A-Za-z0-9]/. Within each alignment line, no TAG may appear more than once and the order in which the optional fields appear is not significant. A TAG containing lowercase letters is reserved for end users. In an optional field, TYPE is a single case-sensitive letter which defines the format of VALUE:
所有可选字段遵循 TAG:TYPE:VALUE 格式,其中 TAG 是一个匹配 /[A-Za-z] [A-Za-z0-9]/ 的两个字符字符串。在每个对齐行中,TAG 不能出现超过一次,可选字段出现的顺序不重要。包含小写字母的 TAG 保留给最终用户。在可选字段中,TYPE 是一个区分大小写的单个字母,定义 VALUE 的格式:
Type 类型 Regexp matching VALUE 正则表达式匹配值 Description 描述
A Printable character 可打印字符
i Signed integer  签名整数
f Single-precision floating number
单精度浮点数
Z Printable string, including space
可打印字符串,包括空格
H Byte array in the Hex format
十六进制格式的字节数组
B cCsSiIf Integer or numeric array
整数或数字数组
For an integer or numeric array (type ' '), the first letter indicates the type of numbers in the following comma separated array. The letter can be one of 'cCsSiIf', corresponding to int8_t (signed 8-bit integer), uint8_t (unsigned 8-bit integer), int16_t, uint16_t, int32_t, uint32_t and float, respectively. During import/export, the element type may be changed if the new type is also compatible with the array.
对于整数或数字数组(类型 ' '),第一个字母表示后续以逗号分隔的数组中数字的类型。该字母可以是 'cCsSiIf' 中的一个,分别对应 int8_t(有符号 8 位整数)、uint8_t(无符号 8 位整数)、int16_t、uint16_t、int32_t、uint32_t 和 float。 在导入/导出过程中,如果新类型与数组兼容,则元素类型可能会发生更改。
Predefined tags are described in the separate Sequence Alignment/Map Optional Fields Specification. See that document for details of existing standard tag fields and conventions around creating new tags that may be of general interest. Tags starting with ' ', ' ' or ' ' and tags containing lowercase letters in either position are reserved for local use and will not be formally defined in any future version of these specifications.
预定义标签在单独的序列比对/映射可选字段规范中进行了描述。 有关现有标准标签字段和创建可能引起普遍兴趣的新标签的惯例,请参阅该文档。以 ' '、' ' 或 ' ' 开头的标签,以及在任一位置包含小写字母的标签,保留供本地使用,未来版本的这些规范中将不会正式定义。
This section describes the best practice for representing data in the SAM format. They are not required in general, but may be required by a specific software package for it to function properly.
本节描述了以 SAM 格式表示数据的最佳实践。一般来说,它们不是必需的,但某些特定软件包可能需要它们才能正常运行。
  1. The header section 标题部分
    1 The @HD line should be present, with either the SO tag or the GO tag (but not both) specified.
    1 @HD 行应该存在,指定 SO 标签或 GO 标签(但不能同时指定两者)。

    2 The @SQ lines should be present if reads have been mapped.
    2 如果读取已被映射,则应存在 @SQ 行。

    3 When a RG tag appears anywhere in the alignment section, there should be a single corresponding @RG line with matching ID tag in the header.
    当 RG 标签出现在比对部分的任何位置时,头部中应有一个对应的 @RG 行,且 ID 标签匹配。

    4 When a PG tag appears anywhere in the alignment section, there should be a single corresponding @PG line with matching ID tag in the header.
    当 PG 标签出现在比对部分的任何位置时,头部应有一个对应的 @PG 行,且 ID 标签应匹配。
  2. Adjacent CIGAR operations should be different.
    相邻的 CIGAR 操作应该是不同的。
  3. No alignments should be assigned mapping quality 255.
    不应将任何比对分配映射质量 255。
  4. Unmapped reads 未映射的读取
    1 For a unmapped paired-end or mate-pair read whose mate is mapped, the unmapped read should have RNAME and POS identical to its mate.
    对于未映射的成对末端或配对读取,其配对已映射,未映射读取的 RNAME 和 POS 应与其配对相同。

    2 If all segments in a template are unmapped, their RNAME should be set as '*, and POS as 0.
    如果模板中的所有片段都未映射,则它们的 RNAME 应设置为 '*',POS 应设置为 0。

    3 If POS plus the sum of lengths of operations in CIGAR exceeds the length specified in the LN field of the @SQ header line (if exists) with an SN equal to RNAME, the alignment should be unmapped, unless the reference sequence is circular (see below).
    如果 POS 加上 CIGAR 中 操作的长度总和超过 @SQ 头行中 LN 字段指定的长度(如果存在),且 SN 等于 RNAME,则该比对应标记为未比对,除非参考序列是循环的(见下文)。

    4 Unmapped reads should be stored in the orientation in which they came off the sequencing machine and have their reverse flag bit ( ) correspondingly unset.
    未映射的读取应以它们从测序仪上获得的方向存储,并且其反向标志位 ( ) 应相应地未设置。
  5. Multiple mapping 多重映射
    1 When one segment is present in multiple lines to represent a multiple mapping of the segment, only one of these records should have the secondary alignment flag bit ( ) unset. RNEXT and PNEXT point to the primary line of the next read in the template.
    当一个段在多行中出现以表示该段的多重映射时,这些记录中只有一个应未设置次级对齐标志位 ( )。RNEXT 和 PNEXT 指向模板中下一个读取的主行。

    2 SEQ and QUAL of secondary alignments should be set to ' to reduce the file size.
    2 SEQ 和 QUAL 的二次比对应设置为 ' 以减少文件大小。
  6. Optional tags: 可选标签:
    1 If the template has more than 2 segments, the TC tag should be present.
    如果模板有超过 2 个部分,则应存在 TC 标签。

    2 The NM tag should be present.
    2 NM 标签应存在。
  7. Circular reference sequences
    循环引用序列
Mappings that cross the coordinate 'join' in circular reference sequences (i.e., those whose @SQ headers specify TP : circular) may be represented as follows:
在循环引用序列中跨越坐标“连接”的映射(即那些其 @SQ 头部指定 TP : circular 的映射)可以表示如下:

1 (Preferred) As usual POS should be between 1 and the @SQ header's LN value, but POS plus the sum of the lengths of operations may exceed LN. Coordinates greater than LN are interpreted by subtracting LN so that bases at are considered to be mapped at positions ; thus each ( 1 -based) position is interpreted as
1(首选)如常,POS 应在 1 和 @SQ 头的 LN 值之间,但 POS 加上 操作的长度总和可能超过 LN。大于 LN 的坐标通过减去 LN 进行解释,因此在 的碱基被视为映射在 位置;因此每个(1 基于)位置 被解释为

2 Alternatively, such alignments may be split across several records: one record representing the initial portion of the segment ending at LN, one representing the final portion starting from 1 , and any other records representing additional portions in between spanning the entire reference sequence. One record (chosen arbitrarily) is considered primary and the remainder have their supplementary flag bit set.
2 或者,这种对齐可能会分散在多个记录中:一个记录表示在 LN 处结束的段的初始部分,一个记录表示从 1 开始的最终部分,以及任何其他记录表示在整个参考序列中跨越的额外部分。一个记录(任意选择)被视为主记录,其余记录的补充标志位 被设置。
8. Annotation dummy reads: These have SEQ set to *, FLAG bits and set (secondary and filtered), and a CT tag.
8. 注释虚拟读取:这些的 SEQ 设置为 *,FLAG 位 设置(次要和过滤),并且有一个 CT 标签。

1 If you wish to store free text in a CT tag, use the key value Note (uppercase N) to match GFF3.
如果您希望在 CT 标签中存储自由文本,请使用键值 Note(大写 N)以匹配 GFF3。

2 Multi-segment annotation (e.g., a gene with introns) should be described with multiple lines in SAM (like a multi-segment read). Where there is a clear biological direction (e.g., a gene), the first segment (FLAG bit 0x40) is used for the first section (e.g., the end of the gene). Thus a GenBank entry location like complement(join(85052..85354, 85441..85621, 86097..86284)) would have three lines in SAM with a common QNAME:
2 多段注释(例如,带有内含子的基因)应在 SAM 中用多行描述(如多段读取)。在存在明确生物方向的情况下(例如,一个基因),第一个段落(FLAG 位 0x40)用于第一部分(例如,基因的 端)。因此,像 complement(join(85052..85354, 85441..85621, 86097..86284)这样的 GenBank 条目位置将在 SAM 中有三行,具有共同的 QNAME:
FLAG POS CIGAR Optional fields 可选字段
The 5' fragment 5' 片段 86097 188 M FI:i:1 TC:i:3
Middle fragment 中间片段 85441 181 M FI:i:2 TC:i:3
The 3' fragment 3' 片段 85052 303M FI:i:3 TC:i:3
3 If converting GFF3 to SAM, store any key, values from column 9 in the CT tag, except for the unique ID which is used for the QNAME. GFF3 columns 1 (seqid), 4 (start) and 5 (end) are encoded using SAM columns RNAME, POS and CIGAR to hold the length. GFF3 columns 3 (type) and 7 (strand) are stored explicitly in the CT tag. Remaining GFF3 columns 2 (source), 6 (score), and 8 (phase) are stored in the CT tag using key values FSource, FScore and FPhase (uppercase keys are restricted in GFF3, so these names avoid clashes). Split location features are described with multiple lines in GFF3, and similarly become multi-segment dummy reads in SAM, with the RNEXT and PNEXT columns filled in appropriately. In the absence of a convention in SAM/BAM for reads wrapping the origin of a circular genome, any GFF3 feature line wrapping the origin must be split into two segments in SAM.
如果将 GFF3 转换为 SAM,请将第 9 列中的任何键值存储在 CT 标签中,唯一 ID 除外,该 ID 用于 QNAME。GFF3 的第 1 列(seqid)、第 4 列(start)和第 5 列(end)使用 SAM 的 RNAME、POS 和 CIGAR 编码以保持长度。GFF3 的第 3 列(type)和第 7 列(strand)明确存储在 CT 标签中。剩余的 GFF3 列第 2 列(source)、第 6 列(score)和第 8 列(phase)使用键值 FSource、FScore 和 FPhase 存储在 CT 标签中(大写键在 GFF3 中受到限制,因此这些名称避免了冲突)。分裂位置特征在 GFF3 中用多行描述,类似地在 SAM 中变成多段虚拟读取,RNEXT 和 PNEXT 列适当地填写。在 SAM/BAM 中没有关于读取环绕圆形基因组原点的约定的情况下,任何环绕原点的 GFF3 特征行必须在 SAM 中拆分为两个段。

3 Guide for Describing Assembly Sequences in SAM
3 SAM 中描述组装序列的指南

3.1 Unpadded versus padded representation
3.1 无填充与填充表示

To describe alignments, we can regard the reference sequence with no respect to other alignments against it. Such a reference sequence is called an unpadded reference. A position on an unpadded reference, referred to as an unpadded position, is not affected by any alignments. When we use unpadded references and positions to describe alignments, we say we are using the unpadded representation.
为了描述比对,我们可以将参考序列视为与其他比对无关。这样的参考序列称为未填充参考。未填充参考上的位置,称为未填充位置,不受任何比对的影响。当我们使用未填充参考和位置来描述比对时,我们说我们正在使用未填充表示。
Alternatively, to describe the same alignments, we can modify the reference sequence to contain pads that make room for sequences inserted relative to the reference. A pad is effectively a gap and conventionally represented by an asterisk . A reference sequence containing pads is called a padded reference. A position which counts the 's is referred to as a padded position. A padded reference sequence may be affected by the query alignments and because of gap insertions is typically longer than the unpadded reference. The padded position of one query alignment may be affected by other query alignments.
或者,为了描述相同的比对,我们可以修改参考序列,使其包含填充,以便为相对于参考插入的序列留出空间。填充实际上是一个间隙,通常用星号 表示。包含填充的参考序列称为填充参考。计算 的位置称为填充位置。填充参考序列可能会受到查询比对的影响,并且由于间隙插入,通常比未填充的参考序列更长。一个查询比对的填充位置可能会受到其他查询比对的影响。
Unpadded and padded are different representations of the same alignments. They are convertible to each other with no loss of any information. The unpadded representation is more common due to the convenience of a fixed coordinate system, while the padded representation has the advantage that alignments can be simply described by the start and end coordinates without using complex CIGAR strings. SAM traditionally uses the padded representation for de novo assembly. The ACE assembly format uses the padded representation exclusively.
未填充和填充是相同对齐的不同表示。它们可以相互转换而不丢失任何信息。未填充表示由于固定坐标系的便利性而更为常见,而填充表示的优点在于对齐可以简单地通过起始和结束坐标描述,而无需使用复杂的 CIGAR 字符串。SAM 传统上使用填充表示进行 de novo 组装。ACE 组装格式专门使用填充表示。

3.2 Padded SAM 3.2 填充的 SAM

The SAM format is typically used to describe alignments against an unpadded reference sequence, but it is also able to describe alignments against a padded reference. In the latter case, we say we are using a padded . A padded SAM is a valid SAM, but with the difference that the reference and positions in use are padded. There may be more than one way to describe the padded representation. We recommend the following; see also the discussion in Cock et al.
SAM 格式通常用于描述与未填充参考序列的比对,但它也能够描述与填充参考的比对。在后者的情况下,我们说我们正在使用填充的 。填充的 SAM 是有效的 SAM,但不同之处在于所使用的参考和位置是填充的。描述填充表示可能有多种方式。我们推荐以下方法;另请参见 Cock 等人的讨论
In a padded SAM, alignments and coordinates are described with respect to the padded reference sequence. Unlike traditional padded representations like the ACE file format where pads/gaps are recorded in reads using 's, we do not write *'s in the SEQ field of the SAM format. Instead, we describe pads in the query sequences as deletions from the padded reference using the CIGAR 'D' operation. In a padded SAM, the insertion and padding CIGAR operations (' ' and ' ') are not used because the padded reference already considers all the insertions.
在填充的 SAM 中,比对和坐标是相对于填充的参考序列进行描述的。与传统的填充表示法(如 ACE 文件格式)不同,在这些格式中,填充/间隙使用 的方式记录在读取中,我们在 SAM 格式的 SEQ 字段中不写*。 相反,我们将查询序列中的填充描述为从填充参考中删除,使用 CIGAR 的'D'操作。在填充的 SAM 中,插入和填充的 CIGAR 操作(' '和' ')不被使用,因为填充参考已经考虑了所有的插入。
The following shows the padded SAM for the example alignment in Section 1.1. Notably, the length of ref is 47 instead of 45 . POS of the last three alignments are all shifted by 2. CIGAR of alignments bridging the 2 bp insertion are also changed.
以下显示了第 1.1 节中示例比对的填充 SAM。值得注意的是,ref 的长度为 47 而不是 45。最后三个比对的 POS 都向右移动了 2。连接 2 bp 插入的比对的 CIGAR 也发生了变化。
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:47
ref 516 ref 1 0 14M2D31M * 0 0 AGCATGTTAGATAAGATAGCTGTGCTAGTAGGCAGTCAGCGCCAT *
r001 99 ref 7 30 14M1D3M = 39 41 TTAGATAAAGGATACTG *
* 768 ref 8 30 1M * 0 0 * * CT:Z:.;Warning;Note=Ref wrong?
r002 0 ref 9 30 3S6M1D5M * 0 0 AAAAGATAAGGATA * PT:Z:1;4;+; homopolymer
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 18 30 6M14N5M * * 0 0 0 ATAGCTTCAGC \
r001 147 ref 39 30 9M = 7 -41 CAGCGGCAT * NM:i:1
Here we also exemplify the recommended practice for storing the reference sequence and the reference annotations in SAM when necessary. For a reference sequence in SAM, QNAME should be identical to RNAME, POS set to 1 and FLAG to 516 (filtered and unmapped); for an annotation, FLAG should be set to 768 (filtered and secondary) with no restriction to QNAME. Dummy reads for annotation would typically have a CT tag to hold the annotation information; see the discussion of dummy reads in Section 2. See also the separate Optional Fields Specification for full details of the CT and PT annotation tags.
在这里,我们还示例了在必要时将参考序列和参考注释存储在 SAM 中的推荐实践。对于 SAM 中的参考序列,QNAME 应与 RNAME 相同,POS 设置为 1,FLAG 设置为 516(过滤和未映射);对于注释,FLAG 应设置为 768(过滤和次要),对 QNAME 没有限制。用于注释的虚拟读取通常会有一个 CT 标签来保存注释信息;请参见第 2 节中对虚拟读取的讨论。有关 CT 和 PT 注释标签的完整详细信息,请参见单独的可选字段规范。

4 The BAM Format Specification
4 BAM 格式规范

4.1 The BGZF compression format
4.1 BGZF 压缩格式

BGZF is block compression implemented on top of the standard gzip file format. The goal of BGZF is to provide good compression while allowing efficient random access to the BAM file for indexed queries. The BGZF format is 'gunzip compatible', in the sense that a compliant gunzip utility can decompress a BGZF compressed file.
BGZF 是在标准 gzip 文件格式之上实现的块压缩。 BGZF 的目标是在提供良好压缩的同时,允许对 BAM 文件进行高效的随机访问以进行索引查询。BGZF 格式是“gunzip 兼容”的,意味着一个合规的 gunzip 工具可以解压缩 BGZF 压缩的文件。
A BGZF file is a series of concatenated BGZF blocks, each no larger than 64 Kb before or after compression. Each BGZF block is itself a spec-compliant gzip archive which contains an "extra field" in the format described in RFC1952. The gzip file format allows the inclusion of application-specific extra fields and these are ignored by compliant decompression implementation. The gzip specification also allows gzip files to be concatenated. The result of decompressing concatenated gzip files is the concatenation of the uncompressed data.
BGZF 文件是一系列连接的 BGZF 块,每个块在压缩前或压缩后都不大于 64 Kb。每个 BGZF 块本身是一个符合规范的 gzip 存档,其中包含一个按照 RFC1952 描述的“额外字段”。gzip 文件格式允许包含特定于应用程序的额外字段,这些字段会被符合规范的解压缩实现忽略。gzip 规范还允许将 gzip 文件连接在一起。解压缩连接的 gzip 文件的结果是未压缩数据的连接。
Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:
每个 BGZF 块包含一个标准的 gzip 文件头,带有以下符合标准的扩展:
  1. The F.EXTRA bit in the header is set to indicate that extra fields are present.
    头部中的 F.EXTRA 位被设置以指示存在额外字段。
  2. The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII 'BC').
    BGZF 使用的额外字段使用两个子字段 ID 值 66 和 67(ASCII 'BC')。
  3. The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of payload).
    BGZF 额外字段负载的长度(gzip 规范中的字段 LEN)为 2(两个字节的负载)。
  4. The payload of the BGZF extra field is a 16-bit unsigned integer in little endian format. This integer gives the size of the containing BGZF block minus one.
    BGZF 扩展字段的有效载荷是一个 16 位无符号整数,采用小端格式。该整数表示包含的 BGZF 块的大小减去一。
On disk, a complete BGZF file is a series of blocks as shown in the following table. (All integers are little endian as is required by RFC1952.)
在磁盘上,一个完整的 BGZF 文件是一系列块,如下表所示。(所有整数都是小端格式,符合 RFC1952 的要求。)
Field 字段 Description 描述 Type 类型 Value 
List of compression blocks (until the end of the file)
压缩块列表(直到文件末尾)
ID1 gzip IDentifier 1 gzip 标识符 1 uint8_t 31
ID2 gzip IDentifier2 gzip 标识符 2 uint8_t 139
CM gzip Compression Method gzip 压缩方法 uint8_t 8
FLG gzip FLaGs uint8_t 4
MTIME gzip Modification TIME gzip 修改时间 uint32_t
XFL gzip eXtra FLags gzip 额外标志 uint8_t
OS gzip Operating System gzip 操作系统 uint8_t
XLEN gzip eXtra LENgth gzip 额外长度 uint16_t
Extra subfield(s) (total size XLEN)
额外子字段(总大小 XLEN)
Additional RFC1952 extra subfields if present
如果存在,额外的 RFC1952 子字段
SI1 Subfield Identifier1 子领域标识符 1 uint8_t 66
S12 Subfield Identifier2 子领域标识符 2 uint8_t 67
SLEN Subfield LENgth 子字段长度 uint16_t 2
BSIZE total Block SIZE minus 1
总块大小减去 1
uint16_t
Additional RFC1952 extra subfields if present
如果存在,额外的 RFC1952 子字段
CDATA Compressed DATA by zlib::deflate()
通过 zlib::deflate() 压缩的数据
uint8_t [BSIZE-XLEN-19]
CRC32 CRC-32 uint32_t
ISIZE Input SIZE (length of uncompressed data)
输入大小(未压缩数据的长度)
uint32_t
The random access method to be described next limits the uncompressed contents of each BGZF block to a maximum of bytes of data. Thus while ISIZE is stored as a uint32_t as per the gzip format, in BGZF it is limited to the range . BSIZE can represent BGZF block sizes in the range [1,65536], though typically BSIZE will be rather less than ISIZE due to compression.
接下来要描述的随机访问方法将每个 BGZF 块的未压缩内容限制为最大 字节的数据。因此,虽然 ISIZE 按照 gzip 格式存储为 uint32_t,但在 BGZF 中,它被限制在 的范围内。BSIZE 可以表示 BGZF 块大小在[1,65536]的范围内,尽管由于压缩,BSIZE 通常会远小于 ISIZE。

4.1.1 Random access 4.1.1 随机访问

BGZF files support random access through the BAM file index. To achieve this, the BAM file index uses virtual file offsets into the BGZF file. Each virtual file offset is an unsigned 64-bit integer, defined as: coffset<<16|uoffset, where coffset is an unsigned byte offset into the BGZF file to the beginning of a BGZF block, and uoffset is an unsigned byte offset into the uncompressed data stream represented by that BGZF block. Virtual file offsets can be compared, but subtraction between virtual file offsets and addition between a virtual offset and an integer are both disallowed.
BGZF 文件通过 BAM 文件索引支持随机访问。为此,BAM 文件索引使用虚拟文件偏移量指向 BGZF 文件。每个虚拟文件偏移量是一个无符号 64 位整数,定义为:coffset<<16|uoffset,其中 coffset 是指向 BGZF 文件中 BGZF 块开头的无符号字节偏移量,uoffset 是指向由该 BGZF 块表示的未压缩数据流的无符号字节偏移量。虚拟文件偏移量可以进行比较,但不允许在虚拟文件偏移量之间进行减法,也不允许在虚拟偏移量和整数之间进行加法。

4.1.2 End-of-file marker
4.1.2 文件结束标记

An end-of-file (EOF) trailer or marker block should be written at the end of BGZF files, so that unintended file truncation can be easily detected. The EOF marker block is a particular empty BGZF block encoded with the default zlib compression level settings, and consists of the following 28 hexadecimal bytes:
在 BGZF 文件的末尾应写入一个文件结束(EOF)标记块,以便可以轻松检测到意外的文件截断。EOF 标记块是一个特定的空 BGZF 块,使用默认的 zlib 压缩级别设置进行编码,包含以下 28 个十六进制字节:
1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 1b 00 03 00 00 00 00 00 00 00 00 00
The presence of this EOF marker at the end of a BGZF file indicates that the immediately following physical EOF is the end of the file as intended by the program that wrote it. Empty BGZF blocks are not otherwise special; in particular, the presence of an EOF marker block does not by itself signal end of file.
在 BGZF 文件末尾存在此 EOF 标记表示紧随其后的物理 EOF 是写入该文件的程序所意图的文件结束。空的 BGZF 块没有其他特殊之处;特别是,EOF 标记块的存在本身并不表示文件结束。
The absence of this final EOF marker should trigger a warning or error soon after opening a BGZF file where random access is available. When reading a BGZF file in sequential streaming fashion, ideally this EOF check should be performed when the end of the stream is reached. Checking that the final BGZF block in the file decompresses to empty or checking that the last 28 bytes of the file are exactly the bytes above are both sufficient tests; each is likely more convenient in different circumstances.
缺少这个最终的 EOF 标记应该在打开一个可随机访问的 BGZF 文件后不久触发警告或错误。 在以顺序流方式读取 BGZF 文件时,理想情况下应该在到达流的末尾时进行这个 EOF 检查。检查文件中的最后一个 BGZF 块是否解压为空,或者检查文件的最后 28 个字节是否正好是上述字节,都是足够的测试;在不同情况下,每种方法可能更方便。

4.2 The BAM format
4.2 BAM 格式

BAM is compressed in the BGZF format. All multi-byte numbers in BAM are little-endian, regardless of the machine endianness. The format is formally described in the following table where values in brackets are the default when the corresponding information is not available; an underlined word in uppercase denotes a field in the SAM format.
BAM 以 BGZF 格式压缩。BAM 中的所有多字节数字都是小端格式,无论机器的字节序如何。该格式在下表中正式描述,其中括号中的值是当相应信息不可用时的默认值;下划线的大写单词表示 SAM 格式中的一个字段。
Field 字段 Description 描述 Type 类型 Value 
magic 魔法 BAM magic string BAM 魔法字符串 BAM \1
I_text 我_文本 Length of the header text, including any NUL padding
标题文本的长度,包括任何 NUL 填充
uint32_t
text 文本 Plain header text in SAM; not necessarily NUL-terminated
在 SAM 中的普通标题文本;不一定以 NUL 结尾
char [I_text]
n_ref # reference sequences # 参考序列 uint32_t
List of reference information ( f
参考信息列表 ( f )
I_name 我_名称 Length of the reference name plus 1 (including NUL)
参考名称的长度加 1(包括 NUL)
uint32_t limited 有限
name 名称 Reference sequence name; NUL-terminated
参考序列名称;以 NUL 结尾
char [I_name]
l_ref Length of the reference sequence
参考序列的长度
uint32_t
List of alignments (until the end of the file)
对齐列表(直到文件末尾)
block_size 块大小 Total length of the alignment record, excluding this field
对齐记录的总长度,不包括此字段
uint32_t limited 有限
refID

参考序列 ID, refID ref;-1 表示没有映射位置的读取
Reference sequence ID, refID ref; -1 for a read
without a mapping position
int32_t
pos 正向 0 -based leftmost coordinate
0 -基于左侧的最左坐标
int32_t
I_read_name 我_阅读_名称 Length of read_name below ( length(QNAME )
读取名称的长度如下 ( 长度(QNAME )
uint8_t
mapq Mapping quality (=MAPQ) 映射质量(=MAPQ) uint8_t
bin BAI index bin, see Section 4.2.1
BAI 索引 bin,见第 4.2.1 节
uint16_t
n_cigar_op Number of operations in CIGAR, see Section 4.2.2
CIGAR 中的操作数量,见第 4.2.2 节
uint16_t
flag 标志 Bitwise flags  位标志 uint16_t
I_seq Length of SEQ SEQ 的长度 uint32_t limited 有限
next_refID Ref-ID of the next segment next_refID n_ref
下一个段落的参考 ID 下一个参考 ID n_ref
int32_t
next_pos 下一个位置 0 -based leftmost pos of the next segment
0 -基于左侧的下一个段落的最左边位置
int32_t
tlen Template length ( TLEN)
模板长度 ( TLEN)
int32_t
read_name 读取名称 Read name, NUL-terminated (QNAME with trailing '
读取名称,NUL 终止(带有尾随 ' 的 QNAME)
char [l_read_name]
cigar 雪茄 CIGAR: op_len<<4|op. 'MIDNSHP ' ' 012345678 ' uint32_t [n_cigar_op]
seq 序列

4 -位编码读取:' ACMGRSVTWYHKDBN' 。见第 4.2.3 节
4 -bit encoded read: ' ACMGRSVTWYHKDBN' . See
Section 4.2.3
uint8_t [(I_seq+1)/2]
qual 资格 Phred-scaled base qualities. See Section 4.2 .3
Phred 缩放的碱基质量。请参见第 4.2.3 节。
char [I_seq]
List of auxiliary data (until the end of the alignment block)
辅助数据列表(直到对齐块的末尾)
tag 标签 Two-character tag 两字符标签 char [2]
val_type 值类型 Value type: AcCsSiIfZHB, see Section 4.2.4
值类型:AcCsSiIfZHB,见第 4.2.4 节
char 字符
value  Tag value 标签值 (by val_type)
Most length and count fields described as uint32_t have additional constraints on their range: I_text due to implementation limits; n_ref because refID and next_refID are signed; I_ref because tlen is signed; those marked "limited" are limited by available memory and the practical size of the data represented well before they are limited by, e.g., Java's signed 32-bit integer maximum array size.
大多数描述为 uint32_t 的长度和计数字段在其范围上有额外的限制:I_text 由于实现限制;n_ref 因为 refID 和 next_refID 是有符号的;I_ref 因为 tlen 是有符号的;标记为“有限”的字段受可用内存和所表示数据的实际大小的限制,远在它们受到例如 Java 的有符号 32 位整数最大数组大小的限制之前。

4.2.1 BIN field calculation
4.2.1 BIN 字段计算

BIN is calculated using the reg2bin() function in Section 5.3. For mapped reads this uses POS-1 (i.e., 0 -based left position) and the alignment end point using the alignment length from the CIGAR string. For unmapped reads (e.g., paired-end reads where only one part is mapped, see Section 2) and reads whose CIGAR strings consume no reference bases at all, the alignment is treated as being of length one. Note unmapped reads with POS 0 (which becomes -1 in BAM) therefore use reg2bin which is computed as 4680.
BIN 是使用第 5.3 节中的 reg2bin() 函数计算的。对于已映射的读取,这使用 POS-1(即 0 基于左侧位置)和使用 CIGAR 字符串中的比对长度的比对结束点。对于未映射的读取(例如,只有一部分映射的成对读取,见第 2 节)和 CIGAR 字符串根本不消耗任何参考碱基的读取,比对被视为长度为一。请注意,POS 为 0 的未映射读取(在 BAM 中变为 -1)因此使用 reg2bin ,其计算结果为 4680。

4.2.2 N_CIGAR_OP field 4.2.2 N_CIGAR_OP 字段

With 16 bits, n_cigar_op can keep at most 65535 CIGAR operations in BAM files. For an alignment with more CIGAR operations, BAM stores the real CIGAR, encoded the same way as the cigar field in BAM, in the CG optional tag of type ' ', and sets CIGAR to ' S ' as a placeholder, where ' ' equals seq, ' ' is the reference sequence length in the alignment, and ' S ' and ' N ' are the soft-clipping and reference-clip CIGAR operators, respectively-i.e., in the binary form, n_cigar_op and cigar . If tag CG is present and the first CIGAR operation clips the entire read, a BAM parsing library is expected to update n_cigar_op and cigar with the real CIGAR stored in the CG tag and remove the now-redundant CG tag.
使用 16 位,n_cigar_op 最多可以在 BAM 文件中保留 65535 个 CIGAR 操作。对于具有更多 CIGAR 操作的比对,BAM 在 CG 可选标签中存储真实的 CIGAR,编码方式与 BAM 中的 cigar 字段相同,类型为' ',并将 CIGAR 设置为' S '作为占位符,其中' '等于 seq,' '是比对中的参考序列长度,' S '和' N '分别是软剪切和参考剪切 CIGAR 操作符-即,在二进制形式中,n_cigar_op 和 cigar 。如果存在标签 CG,并且第一个 CIGAR 操作剪切了整个读取,则期望 BAM 解析库使用存储在 CG 标签中的真实 CIGAR 更新 n_cigar_op 和 cigar,并移除现在冗余的 CG 标签。

4.2.3 SEQ and QUAL encoding
4.2.3 SEQ 和 QUAL 编码

Sequence is encoded in 4-bit values, with adjacent bases packed into the same byte starting with the highest 4 bits first. When I_seq is odd the bottom 4 bits of the last byte are undefined, but we recommend writing these as zero. The case-insensitive base codes '=ACMGRSVTWYHKDBN' are mapped to respectively with all other characters mapping to ' N ' (value 15).
序列以 4 位值编码,相邻的碱基打包到同一个字节中,从最高的 4 位开始。当 I_seq 为奇数时,最后一个字节的底部 4 位是未定义的,但我们建议将其写为零。不区分大小写的碱基代码'=ACMGRSVTWYHKDBN'分别映射到 ,所有其他字符映射到'N'(值 15)。
Omitted sequence, represented in SAM as ' ', is represented by l_seq being 0 and seq and qual zero-length.
省略序列,在 SAM 中表示为 ' ',由 l_seq 为 0 以及 seq 和 qual 为零长度表示。

Base qualities are stored as bytes in the range [0,93], without any +33 conversion to printable ASCII. When base qualities are omitted but the sequence is not, qual is filled with 0xFF bytes (to length I_seq).
基本质量以字节形式存储在范围 [0,93] 内,不进行 +33 转换为可打印的 ASCII。当省略基本质量但序列不省略时,qual 填充为 0xFF 字节(长度为 I_seq)。

4.2.4 Auxiliary data encoding
4.2.4 辅助数据编码

Optional alignment fields are stored immediately after each other immediately following the qual field, and are included in block_size. Each field is represented as a two-character tag followed by a single type character and then its value, whose length is determined by the field's type.
可选对齐字段紧接在 qual 字段之后存储,并包含在 block_size 中。每个字段由一个两个字符的标签、一个单一的类型字符以及其值表示,值的长度由字段的类型决定。
Single character 'A' fields have a total length of 4 bytes, with the value represented as a single byte:
单字符 'A' 字段的总长度为 4 字节,值表示为一个字节:
A char 字符
While all single (i.e., non-array) integer types are stored in SAM as 'i', in BAM any of 'cCsSiI' may be used together with the correspondingly-sized binary integer value, chosen according to the field value's magnitude. Similarly floating point ' ' fields are represented as IEEE binary 32 values. Thus BAM numeric fields have a total length of 4,5 , or 7 bytes:
在 SAM 中,所有单一(即非数组)整数类型都存储为'i',而在 BAM 中,可以使用'cCsSiI'中的任何一个,结合相应大小的二进制整数值,选择依据字段值的大小。 类似地,浮点' '字段表示为 IEEE 二进制 32 值。因此,BAM 数值字段的总长度为 4、5 或 7 字节:
c i 8
(i.e., int8_t)
C u 8
(i.e., uint8_t)
s int16_t
S uint16_t
i int32_t
I uint32_t
f float 浮动
String fields and hex-formatted byte arrays are represented as NUL-terminated text strings:
字符串字段和十六进制格式的字节数组表示为以 NUL 结尾的文本字符串:
Z char 字符 char 字符 char 字符 NUL
The representation of a ' ' array field starts with a sub-type character similar to the numeric field types above and a count (uint32_t, but limited by memory and block_size) giving the number of elements in the array. The array elements follow, encoded as binary integers or IEEE floats sized according to the sub-type:
' ' 数组字段的表示以一个类似于上述数字字段类型的子类型字符和一个计数(uint32_t,但受内存和 block_size 限制)开始,表示数组中的元素数量。数组元素随后出现,编码为根据子类型大小的二进制整数或 IEEE 浮点数:

5 Indexing BAM 5 索引 BAM

Indexing aims to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignments. BAM must be sorted by the reference ID and then the leftmost coordinate before indexing.
索引的目的是实现快速检索重叠指定区域的比对,而无需遍历整个比对。在索引之前,BAM 必须按参考 ID 和最左侧坐标排序。
This section describes the binning scheme underlying coordinate-sorted BAM indices and its implementation in the long-established BAI format. The CSI format documented elsewhere uses a similar binning scheme and can also be used to index BAM.
本节描述了坐标排序的 BAM 索引所基于的分箱方案及其在长期建立的 BAI 格式中的实现。其他地方记录的 CSI 格式使用类似的分箱方案,也可以用于索引 BAM。

5.1 Algorithm 5.1 算法

5.1.1 Basic binning index
5.1.1 基本分箱索引

The UCSC binning scheme was suggested by Richard Durbin and Lincoln Stein and is explained in Kent et al. In this scheme, each bin represents a contiguous genomic region which is either fully contained in or non-overlapping with another bin; each alignment is associated with a bin which represents the smallest region containing the entire alignment. The binning scheme is essentially a representation of R-tree. A distinct bin uniquely corresponds to a distinct internal node in a R-tree. Bin A is a child of Bin B if the region represented by A is contained in B .
UCSC 分箱方案由 Richard Durbin 和 Lincoln Stein 提出,并在 Kent 等人中进行了说明 。在此方案中,每个箱子代表一个连续的基因组区域,该区域要么完全包含在另一个箱子中,要么与之不重叠;每个比对与一个箱子相关联,该箱子代表包含整个比对的最小区域。分箱方案本质上是 R-tree 的一种表示。一个独特的箱子唯一对应于 R-tree 中的一个独特内部节点。如果 A 代表的区域包含在 B 中,则箱子 A 是箱子 B 的子节点。
To find the alignments that overlap a specified region, we need to get the bins that overlap the region, and then test each alignment in the bins to check overlap. To quickly find alignments associated with a specified bin, we can keep in the index the start file offsets of chunks of alignments which all have the bin. As alignments are sorted by the leftmost coordinates, alignments having the same bin tend to be clustered together on the disk and therefore usually a bin is only associated with a few chunks. Traversing all the alignments having the same bin usually needs a few seek calls. Given the set of bins that overlap the specified region, we can visit alignments in the order of their leftmost coordinates and stop seeking the rest when an alignment falls outside the required region. This strategy saves half of the seek calls in average.
要找到与指定区域重叠的比对,我们需要获取与该区域重叠的区间,然后测试区间中的每个比对以检查重叠。为了快速找到与指定区间相关的比对,我们可以在索引中保留所有具有该区间的比对块的起始文件偏移量。由于比对是按最左侧坐标排序的,因此具有相同区间的比对往往在磁盘上聚集在一起,因此通常一个区间只与少量块相关联。遍历所有具有相同区间的比对通常需要几次寻址调用。给定与指定区域重叠的区间集合,我们可以按最左侧坐标的顺序访问比对,当比对超出所需区域时停止寻址其余部分。这种策略平均节省了一半的寻址调用。
In the BAI format, each bin may span or . Bin 0 spans a 512 Mbp region, bins span , and bins span 16 kbp regions. This implies that this index format does not support reference chromosome sequences longer than .
在 BAI 格式中,每个 bin 可能跨越 。bin 0 跨越 512 Mbp 区域,bins 跨越 ,而 bins 跨越 16 kbp 区域。这意味着该索引格式不支持长度超过 的参考染色体序列。
The CSI format generalises the sizes of the bins, and supports reference sequences of the same length as are supported by SAM and BAM.
CSI 格式概括了箱子的大小,并支持与 SAM 和 BAM 相同长度的参考序列。

5.1.2 Reducing small chunks
5.1.2 减少小块

Around the boundary of two adjacent bins, we may see many small chunks with some having a shorter bin while the rest having a larger bin. To reduce the number of seek calls, we may join two chunks having the same bin if they are close to each other. After this process, a joined chunk will contain alignments with different bins. We need to keep in the index the file offset of the end of each chunk to identify its boundaries.
在两个相邻的箱子边界周围,我们可能会看到许多小块,其中一些具有较短的箱子,而其余的则具有较大的箱子。为了减少寻址调用的次数,我们可以将相邻的两个具有相同箱子的块合并。在此过程之后,合并的块将包含具有不同箱子的比对。我们需要在索引中保留每个块末尾的文件偏移量,以识别其边界。

5.1.3 Combining with linear index
5.1.3 与线性索引结合

For an alignment starting beyond 64 Mbp , we always need to seek to some chunks in bin 0 , which can be avoided by using a linear index. In the linear index, for each tiling 16384bp window on the reference, we record the smallest file offset of the alignments that overlap with the window. Given a region [rbeg, rend), we only need to visit a chunk whose end file offset is larger than the file offset of the 16 kbp window containing rbeg.
对于从 64 Mbp 开始的比对,我们总是需要查找 bin 0 中的一些块,这可以通过使用线性索引来避免。在线性索引中,对于参考上的每个 16384bp 窗口,我们记录与窗口重叠的比对的最小文件偏移量。给定一个区域 [rbeg, rend),我们只需访问一个其结束文件偏移量大于包含 rbeg 的 16 kbp 窗口的文件偏移量的块。
With both binning and linear indices, we can retrieve alignments in most of regions with just one seek call.
通过使用分箱和线性索引,我们可以仅通过一次查找调用在大多数区域中检索对齐。

5.1.4 A conceptual example
5.1.4 概念示例

Suppose we have a genome shorter than 144kbp. We can design a binning scheme which consists of three types of bins: bin 0 spans , bin 1,2 and 3 span 48 kbp and bins from 4 to 12 span 16 kbp each:
假设我们有一个短于 144kbp 的基因组。我们可以设计一个分箱方案,该方案由三种类型的箱子组成:箱子 0 跨越 ,箱子 1、2 和 3 跨越 48 kbp,箱子 4 到 12 每个跨越 16 kbp:
0 (0-144kbp)
1 (0-48kbp) 2 (48-96kbp) 3 (96-144kbp)
10 11 12
An alignment starting at 65 kbp and ending at 67 kbp would have a bin number 8 , which is the smallest bin containing the alignment. Similarly, an alignment starting at 51 kbp and ending at 70 kbp would go to bin 2, while an alignment between to bin 0 . Suppose we want to find all the alignments overlapping region . We first calculate that bin 0,2 and 8 overlap with this region and then traverse the alignments in these bins to find the required alignments. With a binning index alone, we need to visit the alignment at as it belongs to bin 0 . But with a linear index, we know that such an alignment stops before 64 kbp and cannot overlap the specified region. A seek call can thus be saved.
一个从 65 kbp 开始到 67 kbp 结束的比对将具有 bin 编号 8,这是包含该比对的最小 bin。类似地,一个从 51 kbp 开始到 70 kbp 结束的比对将进入 bin 2,而一个从 到 bin 0 的比对。假设我们想找到所有与区域 重叠的比对。我们首先计算出 bin 0、2 和 8 与该区域重叠,然后遍历这些 bin 中的比对以找到所需的比对。仅凭 bin 索引,我们需要访问位于 的比对,因为它属于 bin 0。但通过线性索引,我们知道这样的比对在 64 kbp 之前停止,无法与指定区域重叠。因此,可以节省一次查找调用。

5.2 The BAI index format for BAM files
5.2 BAM 文件的 BAI 索引格式

Field 字段 Description 描述 Type 类型 Value 
magic 魔法 Magic string 魔法字符串 char [4] BAI \1
n_ref # reference sequences # 参考序列 uint32_t
List of indices  索引列表
n_bin # distinct bins (for the binning index)
# 不同的箱子(用于分箱索引)
uint32_t
List of distinct bins
不同的箱子列表
bin Distinct bin 独特的箱子 uint32_t
n_chunk # chunks # 块 uint32_t limited  有限
List of chunks ( chunk)
块列表 ( 个块)
chunk_beg chunk_beg 翻译文本: (Virtual) file offset of the start of the chunk
(虚拟)文件偏移量,表示块的开始位置
uint64_t
chunk_end (Virtual) file offset of the end of the chunk
(虚拟)块末尾的文件偏移量
uint64_t
n_intv # 16kbp intervals (for the linear index)
# 16kbp 间隔(用于线性索引)
uint32_t
List of intervals (
区间列表 ( )
ioffset 偏移量 (Virtual) file offset of the first alignment in the interval
(虚拟)文件偏移量在区间内的第一个对齐
uint64_t
n_no_coor (optional) n_no_coor(可选) Number of unplaced unmapped reads (RNAME *)
未放置的未映射读取数量 (RNAME *)
uint64_t
The index file may optionally contain additional metadata providing a summary of the number of mapped and placed unmapped read-segments per reference sequence, and of any unplaced unmapped read-segments. This is stored in an optional extra metadata pseudo-bin for each reference sequence, and in the optional trailing n_no_coor field at the end of the file.
索引文件可以选择性地包含额外的元数据,提供每个参考序列的映射和放置的未映射读取片段的数量摘要,以及任何未放置的未映射读取片段。 这些信息存储在每个参考序列的可选额外元数据伪二进制中,以及文件末尾的可选尾随 n_no_coor 字段中。
The pseudo-bins appear in the references' lists of distinct bins as bin number 37450 (which is beyond the normal range) and are laid out so as to be compatible with real bins and their chunks:
伪箱出现在不同箱子的参考列表中,作为箱号 37450(超出正常范围),并且布局与真实箱子及其块兼容:
bin Magic bin number 魔法箱号码 uint32_t 37450
n_chunk # chunks # 块 uint32_t 2
ref_beg (Virtual) file offset of the start of reads placed on this reference
(虚拟)文件偏移量,表示放置在此参考上的读取开始位置
uint64_t
ref_end (Virtual) file offset of the end of reads placed on this reference
(虚拟)文件偏移量,表示放置在此参考上的读取结束位置
uint64_t
n_mapped Number of mapped read-segments for this reference
此参考的映射读取段数量
uint64_t
n_unmapped Number of unmapped read-segments for this reference
此参考的未映射读取段数量
uint64_t
The ref_beg/ref_end fields locate the first and last reads on this reference sequence, whether they are mapped or placed unmapped. Thus they are equal to the minimum chunk_beg and maximum chunk_end respectively.
ref_beg/ref_end 字段定位此参考序列上的第一个和最后一个读取,无论它们是已映射还是未映射。因此,它们分别等于最小的 chunk_beg 和最大的 chunk_end。

5.3 C source code for computing bin number and overlapping bins
5.3 计算箱号和重叠箱的 C 源代码

The following functions compute bin numbers and overlaps for a BAI-style binning scheme with 6 levels and a minimum bin size of . See the CSI specification for generalisations of these functions designed for binning schemes with arbitrary depth and sizes.
以下函数计算 BAI 风格分箱方案的箱号和重叠,具有 6 个级别和最小箱大小为 。有关这些函数的推广,旨在处理具有任意深度和大小的分箱方案,请参见 CSI 规范。
When these functions are called with regions representing unplaced unmapped reads, e.g., reg2bin , they involve operations such as ( -1 ) which are undefined or implementation-defined in some programming languages. They must be implemented as if these operations use the common two's-complement semantics: reg2bin and reg2bins returns .
当这些函数被调用时,区域表示未放置的未映射读取,例如,reg2bin ,它们涉及的操作如 ( -1 ) 在某些编程语言中是未定义或实现定义的。它们必须被实现为这些操作使用通用的二进制补码语义:reg2bin 和 reg2bins 返回
/* calculate bin given an alignment covering [beg,end) (zero-based, half-closed-half-open) */
int reg2bin(int beg, int end)
{
    --end;
    if (beg>>14 == end>>14) return ((1<<15)-1)/7 + (beg>>14);
    if (beg>>17 == end>>17) return ((1<<12)-1)/7 + (beg>>17);
    if (beg>>20 == end>>20) return ((1<<9)-1)/7 + (beg>>20);
    if (beg>>23 == end>>23) return ((1<<6)-1)/7 + (beg>>23);
    if (beg>>26 == end>>26) return ((1<<3)-1)/7 + (beg>>26);
    return 0;
}
/* calculate the list of bins that may overlap with region [beg,end) (zero-based) */
#define MAX_BIN (((1<<18)-1)/7)
int reg2bins(int beg, int end, uint16_t list[MAX_BIN])
{
    int i = 0, k;
    --end;
    list [i++] = 0;
    for (k = 1 + (beg>>26); k <= 1 + (end>>26); ++k) list[i++] = k;
    for (k = 9 + (beg>>23); k <= 9 + (end>>23); ++k) list[i++] = k;
    for (k = 73 + (beg>>20); k <= 73 + (end>>20); ++k) list[i++] = k;
    for (k = 585 + (beg>>17); k <= 585 + (end>>17); ++k) list[i++] = k;
    for (k = 4681 + (beg>>14); k <= 4681 + (end>>14); ++k) list[i++] = k;
    return i;
}

Appendix A Parsing region notation
附录 A 解析区域符号

Parsing region notation such as name [:begin [-end]] (in which omission of the outer bracketed portion indicates a request for the entire reference sequence) would be simple if name could not itself contain ':' characters, but this is not the case. (No such notation containing an optional ' ' appears in the SAM format itself, but various tools use this notation as a convenient way for their users to specify regions of interest.)
解析区域符号,例如 name [:begin [-end]](其中省略外部括号部分表示请求整个参考序列)会很简单,如果 name 本身不能包含 ':' 字符,但事实并非如此。(在 SAM 格式中没有包含可选 ' ' 的这种符号,但各种工具使用这种符号作为用户指定感兴趣区域的便捷方式。)
The set of valid reference sequence names is usually already known when parsing this notation-for example, because the associated @SQ headers have already been encountered. Tools can use this set to determine unambiguously which colons could delimit a known-valid reference sequence name.
有效参考序列名称的集合通常在解析此符号时已经知道——例如,因为相关的 @SQ 头信息已经被遇到。工具可以使用此集合明确确定哪些冒号可以分隔已知有效的参考序列名称。
In pseudocode form, a string str can be parsed as follows:
以伪代码形式,字符串 str 可以解析如下:

consider the rightmost ' ' character, if any, of str
考虑 str 的最右侧 ' ' 字符(如果有的话)

if is of the form 'prefix:NUM' or 'prefix:NUM-NUM'
如果 的形式为 'prefix:NUM' 或 'prefix:NUM-NUM'

or generally 'prefix: suffix' for some plausible interval suffix
或一般为 '前缀: 后缀' 用于某些合理的区间后缀

then 然后
if both prefix and str are in the known set then ...error: ambiguous representation
如果前缀和字符串都在已知集合中,则...错误:模糊表示

else if prefix is in the known set then return (prefix, NUM. . . NUM)
否则如果前缀在已知集合中,则返回 (前缀, NUM. . . NUM)

else if is in the known set then return (str, entire sequence)
否则如果 在已知集合中,则返回 (str, 整个序列)

else ...error: unknown reference sequence name
否则...错误:未知的引用序列名称

else ...either str does not contain a colon or the suffix is not plausibly numeric
否则...要么 str 不包含冒号,要么后缀不合理地是数字。

if is in the known set then return (str, entire sequence)
如果 在已知集合中,则返回 (str, 整个序列)

else ...error: unknown reference sequence name or invalid interval syntax
否则...错误:未知的参考序列名称或无效的区间语法

The check leading to "error: ambiguous representation" is important as it prevents confusing interpretations of actually ambiguous input. Typically the set of valid reference sequence names will not contain names that are prefixes of other names in the set, so in practice this error will not usually be encountered in non-malicious data.
导致“错误:模糊表示”的检查很重要,因为它可以防止对实际模糊输入的混淆解释。通常,有效的参考序列名称集合不会包含作为集合中其他名称前缀的名称,因此在实际操作中,这种错误通常不会在非恶意数据中遇到。
Either in addition to this algorithm or as an alternative to it, tools can use additional delimiter characters to make an unambiguously parsable notation. We recommend a convention using curly brackets around the reference sequence name - {name} [:begin [-end]] - as being memorable, easily typed, unambiguous, and not expanded by most shells.
除了这个算法,工具还可以使用额外的分隔符字符来创建一个明确可解析的符号。我们推荐使用花括号包围参考序列名称的约定 - {name} [:begin [-end]] - 这种方式易于记忆、易于输入、明确且不会被大多数 shell 扩展。

Appendix B SAM Version History
附录 B SAM 版本历史

This lists the date of each tagged SAM version along with changes that have been made while that version was current. The key changes that caused the version number to change are shown in bold.
这列出了每个标记的 SAM 版本的日期,以及在该版本有效期间所做的更改。导致版本号更改的关键更改以粗体显示。
Additions and changes to the standard predefined tags are listed in the separate Sequence Alignment/Map Optional Fields Specification.
对标准预定义标签的添加和更改列在单独的序列比对/映射可选字段规范中。

1.6: 28 November 2017 to current
1.6:2017 年 11 月 28 日至今

  • Add SINGULAR to the list of @RG PL header tag values. (May 2023)
    将 SINGULAR 添加到 @RG PL 头标签值列表中。(2023 年 5 月)
  • Clarify that @RG PI values are integers. (May 2023)
    澄清 @RG PI 值为整数。(2023 年 5 月)
  • Add ELEMENT and ULTIMA to the list of @RG PL header tag values. (Aug 2022)
    将 ELEMENT 和 ULTIMA 添加到 @RG PL 头标签值列表中。(2022 年 8 月)
  • Clarify that header field tags must be distinct within each line, and that the ordering of both header fields and alignment optional fields is not significant. (Jun 2021)
    澄清每行中的头字段标签必须是唯一的,并且头字段和对齐可选字段的顺序不重要。(2021 年 6 月)
  • Clarify the meaning of TLEN when secondary alignments are present. (May 2021)
    澄清当存在次级比对时 TLEN 的含义。(2021 年 5 月)
  • Bin calculation changed for alignment records whose CIGAR strings consume no reference bases: like unmapped records, they are considered to have length one (rather than zero). (Jan 2021)
    对未映射记录的对齐记录的 Bin 计算进行了更改,这些记录的 CIGAR 字符串不消耗任何参考碱基:它们被视为长度为一(而不是零)。 (2021 年 1 月)
  • Correct the description of index pseudo-bins, which previously stated that ref_beg/ref_end, then named unmapped_beg/unmapped_end, include only placed unmapped reads. (Jul 2020)
    修正索引伪桶的描述,之前提到的 ref_beg/ref_end,后来更名为 unmapped_beg/unmapped_end,仅包含已放置的未映射读取。(2020 年 7 月)
  • Add DNBSEQ to the list of @RG PL header tag values. (Apr 2020)
    将 DNBSEQ 添加到 @RG PL 头部标签值列表中。(2020 年 4 月)
  • Add @SQ TP circular/linear topology header tag. (May 2019)
    添加 @SQ TP 循环/线性拓扑头标签。 (2019 年 5 月)
  • Restricted the allowable punctuation characters in reference sequence names (in @SQ SN, RNAME, etc). The sets of characters allowed in @SQ SN and @SQ AN are now identical, which enlarges the previous AN set. (Jan 2019)
    限制了参考序列名称中的允许标点字符(在 @SQ SN、RNAME 等中)。现在 @SQ SN 和 @SQ AN 中允许的字符集是相同的,这扩大了之前的 AN 集合。(2019 年 1 月)

    We recommend that implementations validating reference sequence names do so using the rules in Section 1.2.1; are more lenient for files declaring @HD VN ; and validate AN only against these rules, not the previous more restrictive AN rules.
    我们建议实现验证参考序列名称时使用第 1.2.1 节中的规则;对于声明@HD VN 的文件要更加宽松;并且仅根据这些规则验证 AN,而不是之前更严格的 AN 规则。
  • Add @HD SS sorting details header tag. (Oct 2018)
    添加 @HD SS 排序详情标题标签。 (2018 年 10 月)
  • B array optional fields may have no entries - this was already representable in BAM, clarified that empty arrays are permitted in SAM too. (Jul 2018)
    B 数组的可选字段可以没有条目 - 这在 BAM 中已经可以表示,明确指出在 SAM 中也允许空数组。 (2018 年 7 月)
  • Add @SQ DS header tag. (Jul 2018)
    添加 @SQ DS 头标签。 (2018 年 7 月)
  • Add @RG BC header tag. (Apr 2018)
    添加 @RG BC 头标签。 (2018 年 4 月)
  • Permit UTF-8 in a few header tags. (Mar 2018)
    允许在一些头部标签中使用 UTF-8。(2018 年 3 月)
  • Add support for CIGAR strings with more than 65,535 operations. (Nov 2017)
    添加对超过 65,535 个操作的 CIGAR 字符串的支持。(2017 年 11 月)

1.5: 23 May 2013 to November 2017
1.5:2013 年 5 月 23 日至 2017 年 11 月

  • Add @SQ AN header tag, allowing only alphanumeric and '*+. @_ - -' characters in its names. (Jul 2017)
    添加 @SQ AN 头标签,仅允许在其名称中使用字母数字和 '*+. @_ - -' 字符。 (2017 年 7 月)
  • Add @SQ AH header tag. (Mar 2017)
    添加 @SQ AH 头标签。 (2017 年 3 月)
  • Auxiliary tags migrated to SAMtags document. (Sep 2016)
    辅助标签已迁移到 SAMtags 文档中。(2016 年 9 月)
  • Z and H auxiliary tags are permitted to be zero length. (Jun 2016)
    Z 和 H 辅助标签允许为零长度。(2016 年 6 月)
  • QNAME limited to 254 bytes (was 255). (Aug 2015)
    QNAME 限制为 254 字节(之前为 255 字节)。 (2015 年 8 月)
  • Generalise 0x200 flag bit as filtered-out bit. (Aug 2015)
    将 0x200 标志位通用化为过滤掉的位。(2015 年 8 月)
  • Add @HD GO for group order. (Mar 2015)
    添加 @HD GO 进行团购。 (2015 年 3 月)
  • Add ONT to the @RG PL and @RG PM header tags. (Mar 2015)
    将 ONT 添加到@RG PL 和@RG PM 头标签中。(2015 年 3 月)
  • Add meaning to reverse FLAG on unmapped reads. (Mar 2015)
    在未映射读取上添加反向 FLAG 的含义。(2015 年 3 月)
  • Document the idxstats .bai elements. (Nov 2014)
    文档化 idxstats .bai 元素。(2014 年 11 月)
  • Addition of CSI index. (Sep 2014)
    添加 CSI 指数。(2014 年 9 月)
  • Add @PG DS header field. (Dec 2013)
    添加 @PG DS 头字段。(2013 年 12 月)
  • Document the BAM EOF byte values. (Dec 2013)
    记录 BAM EOF 字节值。(2013 年 12 月)
  • Glossary of alignment types. (May 2013)
    对齐类型词汇表。(2013 年 5 月)
  • Note that PNEXT/RNEXT points to next read, not segment. (May 2013)
    请注意,PNEXT/RNEXT 指向下一个读取,而不是段。 (2013 年 5 月)
  • Add SUPPLEMENTARY flag bit. (May 2013)
    添加补充标志位。(2013 年 5 月)

1.4: 21 April 2011 to May 2013
1.4:2011 年 4 月 21 日至 2013 年 5 月

  • Add guide to using sequence annotations (CT/PT tags). (Mar 2012)
    添加使用序列注释(CT/PT 标签)的指南。(2012 年 3 月)
  • Increase max reference length from to . (Sep 2011)
    将最大引用长度从 增加到 。 (2011 年 9 月)
  • Clarify @SQ M5 header tag generation. (Sep 2011)
    澄清 @SQ M5 头标签生成。(2011 年 9 月)
  • Describe padded alignments. (Sep 2011)
    描述填充对齐。 (2011 年 9 月)
  • Add @RG FO, KS header fields. (Apr 2011)
    添加 @RG FO, KS 头字段。 (2011 年 4 月)
  • Clarify chaining of PG records. (Apr 2011)
    澄清 PG 记录的链式关系。(2011 年 4 月)
  • Add B array auxiliary tag type. (Apr 2011)
    添加 B 数组辅助标签类型。(2011 年 4 月)
  • Permit IUPAC in SEQ and MD auxiliary tag. (Apr 2011)
    允许在 SEQ 和 MD 辅助标签中使用 IUPAC。(2011 年 4 月)
  • Permit QNAME "*". (Apr 2011)
    允许 QNAME "*"。 (2011 年 4 月)

1.3: July 2010 to April 2011
1.3:2010 年 7 月到 2011 年 4 月

  • Add RG PG header field. (Nov 2010)
    添加 RG PG 头字段。(2010 年 11 月)
  • Add BAM description and index sections. (Nov 2010)
    添加 BAM 描述和索引部分。(2010 年 11 月)
  • Add '=' and ' ' CIGAR operations. (July 2010)
    添加 '=' 和 ' ' CIGAR 操作。 (2010 年 7 月)
  • Removal of FLAG letters. (July 2010)
    移除 FLAG 字母。(2010 年 7 月)
  • The SM header field, previously mandatory for @RG, is now optional. (July 2010)
    SM 头字段,之前对于@RG 是必需的,现在变为可选。 (2010 年 7 月)

    1.0: 2009 to July 2010
    1.0:2009 年到 2010 年 7 月
Initial edition. 初始版本。

  1. Hence in particular SAM files must not begin with a byte order mark (BOM) and lines of text are delimited by ASCII line terminator characters only. In addition to the local platform's text file line termination conventions, implementations may wish to support LF and CR LF for interoperability with other platforms.
    因此,SAM 文件特别不能以字节顺序标记(BOM)开头,文本行仅由 ASCII 行终止符字符分隔。除了本地平台的文本文件行终止约定外,实现可能希望支持 LF 和 CR LF,以便与其他平台进行互操作。
  2. The values in the FLAG column correspond to bitwise flags as follows: : first/next is reverse-complemented properly aligned/multiple segments; 0: no flags set, thus a mapped single segment; x810: supplementary/reversecomplemented; : last (second of a pair)/reverse-complemented/properly aligned/multiple segments.
    FLAG 列中的值对应于按位标志,如下所示: : first/next 是反向互补的,正确对齐/多个片段;0: 没有设置标志,因此是映射的单个片段; x810: 补充/反向互补; : last (一对中的第二个)/反向互补/正确对齐/多个片段。
  3. Chimeric alignments are primarily caused by structural variations, gene fusions, misassemblies, RNA-seq or experimental protocols. They are more frequent given longer reads. For a chimeric alignment, the linear alignments constituting the alignment are largely non-overlapping; each linear alignment may have high mapping quality and is informative in SNP/INDEL calling. In contrast, multiple mappings are caused primarily by repeats. They are less frequent given longer reads. If a read has multiple mappings, all these mappings are almost entirely overlapping with each other; except the single-best optimal mapping, all the other mappings get mapping quality and are ignored by most SNP/INDEL callers.
    嵌合比对主要是由结构变异、基因融合、错误组装、RNA-seq 或实验方案引起的。由于读取长度较长,它们的发生频率更高。对于嵌合比对,构成比对的线性比对大多是非重叠的;每个线性比对可能具有高的比对质量,并且在 SNP/INDEL 调用中是有信息量的。相比之下,多重比对主要是由重复序列引起的。由于读取长度较长,它们的发生频率较低。如果一个读取有多个比对,这些比对几乎完全重叠;除了单个最佳最优比对,所有其他比对的比对质量为 ,并被大多数 SNP/INDEL 调用者忽略。

    Characters that are not disallowed include ' ', which historically appeared in reference names derived from NCBI FASTA files, and ' ', which appears in HLA allele names. Appendix A describes approaches for parsing name [:begin-end] region notation unambiguously even though name may itself contain colons.
    不被禁止的字符包括 ' ',它历史上出现在源自 NCBI FASTA 文件的引用名称中,以及 ' ',它出现在 HLA 等位基因名称中。附录 A 描述了即使名称本身可能包含冒号,仍然可以明确解析名称 [:begin-end] 区域符号的方法。

    Best practice is to use lowercase tags while designing and experimenting with new data field tags or for fields of local interest only. For new tags that are of general interest, raise an hts-specs issue or email samtools-devel@lists.sourceforge.net to have an uppercase equivalent added to the specification. This way collisions of the same uppercase tag being used with different meanings can be avoided.
    最佳实践是在设计和实验新的数据字段标签或仅对本地感兴趣的字段时使用小写标签。对于具有一般兴趣的新标签,请提出一个 hts-specs 问题或发送电子邮件至 samtools-devel@lists.sourceforge.net,以便将大写等效项添加到规范中。这样可以避免同一大写标签在不同含义下的冲突。
  4. It is known that widely used software libraries have differing definitions of the queryname sort order, meaning care should be taken when operating on multiple files of varying provenance. Tools may wish to use the sub-sort field to explicitly distinguish between natural and lexicographical ordering. See Section 1.3.1.
    众所周知,广泛使用的软件库对查询名称的排序顺序有不同的定义,这意味着在处理来源各异的多个文件时应谨慎。工具可能希望使用子排序字段来明确区分自然排序和字典排序。请参见第 1.3.1 节。

    The repetition of sort-order enables a limited form of validation. For example, @HD SO:queryname SS:coordinate:TLEN would indicate that the data has been re-sorted (by query name) by a non-SS-aware tool and the SS field should be ignored.
    排序顺序的重复使得一种有限的验证形式成为可能。例如,@HD SO:queryname SS:coordinate:TLEN 表示数据已被一个不支持 SS 的工具重新排序(按查询名称),因此应忽略 SS 字段。

    See https://www.ncbi.nlm.nih.gov/grc/help/definitions for descriptions of alternate locus and primary assembly.
    请参见 https://www.ncbi.nlm.nih.gov/grc/help/definitions 以获取替代位点和主要组装的描述。

    For example, given '@SQ SN:MT AN: chrMT,M, chrM LN:16569 TP:circular', tools can ensure that a user's request for any of 'MT', 'chrMT', 'M', or 'chrM' succeeds and refers to the same sequence.
    例如,给定 '@SQ SN:MT AN: chrMT,M, chrM LN:16569 TP:circular',工具可以确保用户对 'MT'、'chrMT'、'M' 或 'chrM' 的请求成功,并且指向相同的序列。

    The previous footnote's example identifies MT as a circular chromosome. The TP field is often omitted, which implies linear.
    上一个脚注的示例将 MT 识别为环状染色体。TP 字段通常被省略,这意味着线性。
  5. Reference sequence names may contain any printable ASCII characters with the exception of certain punctuation characters, and may not start with '*' or '='. See Section 1.2 .1 for details and an explanation of the [:rname:] notation.
    参考序列名称可以包含任何可打印的 ASCII 字符,但某些标点符号除外,并且不能以 '*' 或 '=' 开头。有关详细信息和 [:rname:] 符号的解释,请参见第 1.2 .1 节。
  6. The manipulation of bitwise flags is described at Wikipedia (see "Bit field") and elsewhere.
    位标志的操作在维基百科(参见“位域”)及其他地方有描述。

    For example, in Illumina paired-end sequencing, first ( ) corresponds to the R1 'forward' read and last ( 0 x 80 ) to the R2 'reverse' read. (Despite the terminology, this is unrelated to the segments' orientations when they are mapped: either, neither, or both may have their reverse flag bits ( ) set after mapping.)
    例如,在 Illumina 双端测序中,first ( ) 对应于 R1 '正向'读取,last ( 0 x 80 ) 对应于 R2 '反向'读取。(尽管有这样的术语,但这与映射时片段的方向无关:映射后,可能有一个、没有或两个的反向标志位 ( ) 被设置。)
  7. Thus a segment aligning in the forward direction at base 100 for length 50 and a segment aligning in the reverse direction at base 200 for length 50 indicate the template covers bases 100 to 249 and has length 150 .
    因此,一个在基准 100 处向前对齐、长度为 50 的片段和一个在基准 200 处向后对齐、长度为 50 的片段表明模板覆盖基准 100 到 249,长度为 150。
  8. The earliest versions of this specification used to (in original orientation, TLEN#1; dashed parts of the reads indicate soft-clipped bases) while later ones used leftmost to rightmost mapped base (TLEN#2). Note: these two definitions agree in most alignments, but differ in the case of overlaps where the first segment aligns beyond the start of the last segment.
    本规范的最早版本使用 (原始方向,TLEN#1;读取的虚线部分表示软剪切碱基),而后来的版本使用最左侧到最右侧的映射碱基(TLEN#2)。注意:这两个定义在大多数比对中是一致的,但在重叠的情况下有所不同,其中第一个片段的比对超出了最后一个片段的起始位置。
    Unambiguous scenario 明确的场景
    Ambiguous scenario 模糊场景
    The number of digits in an integer optional field is not explicitly limited in SAM. However, BAM can represent values in the range ), so in practice this is the realistic range of values for SAM's ' ' as well.
    整数可选字段中的数字位数在 SAM 中没有明确限制。然而,BAM 可以表示范围 的值,因此在实践中,这也是 SAM 的 ' ' 的实际值范围。

    For example, the six-character Hex string ' 1 AE 301 ' represents the byte array [0x1a, 0xe3, 0x1].
    例如,六字符的十六进制字符串 ' 1 AE 301 ' 表示字节数组 [0x1a, 0xe3, 0x1]。

    Explicit typing eases format parsing and helps to reduce the file size when SAM is converted to BAM.
    显式类型化简化了格式解析,并有助于在将 SAM 转换为 BAM 时减少文件大小。

    See SAMtags.pdf at https://github.com/samtools/hts-specs.
    请参阅 SAMtags.pdf,网址为 https://github.com/samtools/hts-specs。
  9. The impact of this representation on indexing and random access is yet to be explored by implementations.
    这种表示法对索引和随机访问的影响尚待实现进行探索。
  10. Peter J. A. Cock, James K. Bonfield, Bastien Chevreux, and Heng Li, SAM/BAM format v1.5 extensions for de novo assemblies, bioRxiv 020024; doi:10.1101/020024.
    彼得·J·A·科克,詹姆斯·K·邦菲尔德,巴斯蒂安·舍夫勒和李恒,de novo 组装的 SAM/BAM 格式 v1.5 扩展,bioRxiv 020024;doi:10.1101/020024。
  11. Writing pads/gaps as 's in the SEQ field might have been more convenient, but this caused concerns for backward compatibility.
    在 SEQ 字段中将填充/间隙写为 可能更方便,但这引发了对向后兼容性的担忧。

    See Annotation and Padding in SAMtags.pdf.
    请参阅 SAMtags.pdf 中的注释和填充。
  12. L. Peter Deutsch, GZIP file format specification version 4.3, RFC 1952.
    L. Peter Deutsch, GZIP 文件格式规范版本 4.3,RFC 1952。

    It is worth noting that there is a known bug in the Java GZIPInputStream class that concatenated gzip archives cannot be successfully decompressed by this class. BGZF files can be created and manipulated using the built-in Java util.zip package, but naive use of GZIPInputStream on a BGZF file will not work due to this bug.
    值得注意的是,Java GZIPInputStream 类中存在一个已知的 bug,无法成功解压缩连接的 gzip 归档文件。可以使用内置的 Java util.zip 包创建和操作 BGZF 文件,但由于这个 bug,简单地在 BGZF 文件上使用 GZIPInputStream 是行不通的。
  13. Empty in the sense of having been formed by compressing a data block of length zero.
    在被压缩为长度为零的数据块的意义上是空的。

    An implementation that supports reopening a BAM file in append mode could produce a file by writing headers and alignment records to it, closing it (adding an EOF marker); then reopening it for append, writing more alignment records, and closing it (adding an EOF marker). The resulting BAM file would contain an embedded insignificant EOF marker block that should be effectively ignored when it is read.
    一个支持以追加模式重新打开 BAM 文件的实现可以通过将头部和对齐记录写入文件来生成一个文件,关闭它(添加一个 EOF 标记);然后以追加模式重新打开它,写入更多对齐记录,并关闭它(添加一个 EOF 标记)。生成的 BAM 文件将包含一个嵌入的无关 EOF 标记块,在读取时应有效忽略。

    It is useful to produce a diagnostic at the beginning of reading a file, so that interactive users can abort lengthy analysis of potentially-corrupted files. Of course, this is only possible if the stream in question supports random access.
    在读取文件的开始生成诊断信息是有用的,这样交互式用户可以中止对可能损坏文件的长时间分析。当然,这只有在相关流支持随机访问的情况下才有可能。
  14. As noted in Section 1.4, reserved FLAG bits should be written as zero and ignored on reading by current software.
    如第 1.4 节所述,保留的 FLAG 位应写为零,并在当前软件读取时被忽略。

    For backward compatibility, an absent QNAME (represented as ' ' in SAM) is stored as a C string "*\0".
    为了向后兼容,缺失的 QNAME(在 SAM 中表示为 ' ')被存储为 C 字符串 "*\0"。
  15. The signedness and size used for each integer value is an implementation choice, but is typically the smallest that suffices.
    每个整数值使用的符号和大小是一个实现选择,但通常是足够的最小值。

    The BAM representation of ' ' field values as textual hexadecimal digits rather than binary data is for historical reasons. Modern applications may prefer to use 'B, C' array fields rather than ' ' fields.
    ' ' 字段值的 BAM 表示为文本十六进制数字而非二进制数据是出于历史原因。现代应用可能更倾向于使用 'B, C' 数组字段而不是 ' ' 字段。
  16. See CSIv1.pdf at https://github.com/samtools/hts-specs. This is a separate specification because CSI is also used to index other coordinate-sorted file formats in addition to BAM.
    请参阅 CSIv1.pdf,网址为 https://github.com/samtools/hts-specs。这是一个单独的规范,因为 CSI 还用于索引除 BAM 之外的其他坐标排序文件格式。

    W. James Kent et al., The Human Genome Browser at UCSC, Genome Res. 2002 12: 996-1006; doi:10.1101/ gr.229102; PMID:12045153. See in particular The Database, p1003.
    W. James Kent 等, UCSC 人类基因组浏览器, Genome Res. 2002 12: 996-1006; doi:10.1101/gr.229102; PMID:12045153. 特别参见数据库, p1003.
  17. The number of chunks in a single bin is effectively limited by available memory and in any case is typically a maximum of some thousands.
    单个箱中的块数实际上受到可用内存的限制,并且在任何情况下通常最多为几千个。

    By placed unmapped read we mean a read that is unmapped according to its FLAG but whose RNAME and POS fields are filled in, thus "placing" it on a reference sequence (see Section 2). In contrast, unplaced unmapped reads have '*, and 0 for RNAME and POS.
    我们所说的“放置未映射读取”是指根据其 FLAG 未映射的读取,但其 RNAME 和 POS 字段已填充,因此将其“放置”在参考序列上(见第 2 节)。相比之下,未放置的未映射读取的 RNAME 和 POS 为 '*', 和 0。
  18. See Appendix A of SAMtags.pdf at https://github.com/samtools/hts-specs.
    请参阅 https://github.com/samtools/hts-specs 中的 SAMtags.pdf 附录 A。