Value Type Name
0x7 itf8 codec id
0x2 itf8 number of bytes to follow
0x0 itf8 offset
0x1 itf8 K parameter| Value | Type | Name |
| :--- | :--- | :--- |
| 0x7 | itf8 | codec id |
| 0x2 | itf8 | number of bytes to follow |
| 0x0 | itf8 | offset |
| 0x1 | itf8 | K parameter |
size in bytes N key 1 value 1 key... value ... key N value N| size in bytes | N | key 1 | value 1 | key... | value ... | key N | value N |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
the block content identifier used to
associate external data blocks with
data series| the block content identifier used to |
| :--- |
| associate external data blocks with |
| data series |
coding of byte arrays with array
length| coding of byte arrays with array |
| :--- |
| length |
BYTE_ARRAY_STOP
5
byte stop, int 外部块内容 ID
byte stop, int external block
content id| byte stop, int external block |
| :--- |
| content id |
用 stopvalue 对字节数组进行编码
coding of byte arrays with a stop
value| coding of byte arrays with a stop |
| :--- |
| value |
BETA
6
int 偏移,int 位数
二进制编码
SUBEXP
7
int 偏移,int K
次指数编码
已停用(GOLOMB_RICE)
8
int 偏移,int log_(2)m\log _{2} \mathrm{~m}
戈隆-瑞斯编码
GAMMA
9
int 偏移
埃利亚斯伽马编码
Codec ID Parameters Comment
NULL 0 none series not preserved
EXTERNAL 1 int block content id "the block content identifier used to
associate external data blocks with
data series"
Deprecated (GOLOMB) 2 int offset, int M Golomb coding
HUFFMAN 3 array<int>, array<int> coding with int/byte values
BYTE_ARRAY_LEN 4 "encoding<int> array length,
encoding<byte> bytes" "coding of byte arrays with array
length"
BYTE_ARRAY_STOP 5 "byte stop, int external block
content id" "coding of byte arrays with a stop
value"
BETA 6 int offset, int number of bits binary coding
SUBEXP 7 int offset, int K subexponential coding
Deprecated (GOLOMB_RICE) 8 int offset, int log_(2)m Golomb-Rice coding
GAMMA 9 int offset Elias gamma coding| Codec | ID | Parameters | Comment |
| :--- | :--- | :--- | :--- |
| NULL | 0 | none | series not preserved |
| EXTERNAL | 1 | int block content id | the block content identifier used to <br> associate external data blocks with <br> data series |
| Deprecated (GOLOMB) | 2 | int offset, int M | Golomb coding |
| HUFFMAN | 3 | array<int>, array<int> | coding with int/byte values |
| BYTE_ARRAY_LEN | 4 | encoding<int> array length, <br> encoding<byte> bytes | coding of byte arrays with array <br> length |
| BYTE_ARRAY_STOP | 5 | byte stop, int external block <br> content id | coding of byte arrays with a stop <br> value |
| BETA | 6 | int offset, int number of bits | binary coding |
| SUBEXP | 7 | int offset, int K | subexponential coding |
| Deprecated (GOLOMB_RICE) | 8 | int offset, int $\log _{2} \mathrm{~m}$ | Golomb-Rice coding |
| GAMMA | 9 | int offset | Elias gamma coding |
有关上述所有编码算法及其参数的详细说明,请参见第 13 节。
4 校验和
校验和用于确保数据完整性。CRAM 中使用了以下校验和算法。
4.1 CRC32
这是一个循环冗余校验和,长度为 32 位,多项式为 0x04C11DB7。详情请参阅 ITU-T V. 42。CRC32 哈希函数的值写成整数。
Data type Name Value
byte[4] format magic number CRAM (0x43 0x52 0x41 0x4d)
unsigned byte major format number 3(0x3)
unsigned byte minor format number 1 (0x1)
byte[20] file id CRAM file identifier (e.g. file name or SHA1 checksum)| Data type | Name | Value |
| :--- | :--- | :--- |
| byte[4] | format magic number | CRAM (0x43 0x52 0x41 0x4d) |
| unsigned byte | major format number | $3(0 x 3)$ |
| unsigned byte | minor format number | 1 (0x1) |
| byte[20] | file id | CRAM file identifier (e.g. file name or SHA1 checksum) |
有效的 CRAM 主版本号和次版本号如下:
1.0 最初的公开 CRAM 版本。
2.0 第一个以 Java 和 C 语言实现的 CRAM 版本;整理了 1.0 版本中实现与规范之间的差异。
the sum of the lengths of all blocks in this container
(headers and data) and any padding bytes (CRAM header
container only); equal to the total byte length of the
container minus the byte length of this header structure| the sum of the lengths of all blocks in this container |
| :--- |
| (headers and data) and any padding bytes (CRAM header |
| container only); equal to the total byte length of the |
| container minus the byte length of this header structure |
itf8
参考序列 ID
该容器中的所有片段都必须有一个与该值匹配的参考序列 ID。
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
All slices in this container must have a reference sequence
id matching this value.| reference sequence identifier or |
| :--- |
| -1 for unmapped reads |
| -2 for multiple reference sequences. |
| All slices in this container must have a reference sequence |
| id matching this value. |
itf8
起始位置
starting position on the
reference| starting position on the |
| :--- |
| reference |
the locations of slices in this container as byte offsets from
the end of this container header, used for random access
indexing. For sequence data containers, the landmark
count must equal the slice count.
Since the block before the first slice is the compression
header, landmarks[0] is equal to the byte length of the
compression header.| the locations of slices in this container as byte offsets from |
| :--- |
| the end of this container header, used for random access |
| indexing. For sequence data containers, the landmark |
| count must equal the slice count. |
| Since the block before the first slice is the compression |
| header, landmarks[0] is equal to the byte length of the |
| compression header. |
int
crc32
容器中前面所有字节的 CRC32 哈希值。
字节
大厦
容器中包含的区块。
Data type Name Value
int32 length "the sum of the lengths of all blocks in this container
(headers and data) and any padding bytes (CRAM header
container only); equal to the total byte length of the
container minus the byte length of this header structure"
itf8 reference sequence id "reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
All slices in this container must have a reference sequence
id matching this value."
itf8 "starting position on the
reference" the alignment start position
itf8 alignment span the length of the alignment
itf8 number of records number of records in the container
ltf8 record counter 1-based sequential index of records in the file/stream.
ltf8 bases number of read bases
itf8 number of blocks the total number of blocks in this container
array<itf8> landmarks "the locations of slices in this container as byte offsets from
the end of this container header, used for random access
indexing. For sequence data containers, the landmark
count must equal the slice count.
Since the block before the first slice is the compression
header, landmarks[0] is equal to the byte length of the
compression header."
int crc32 CRC32 hash of the all the preceding bytes in the container.
byte[ blocks The blocks contained within the container.| Data type | Name | Value |
| :---: | :---: | :---: |
| int32 | length | the sum of the lengths of all blocks in this container <br> (headers and data) and any padding bytes (CRAM header <br> container only); equal to the total byte length of the <br> container minus the byte length of this header structure |
| itf8 | reference sequence id | reference sequence identifier or <br> -1 for unmapped reads <br> -2 for multiple reference sequences. <br> All slices in this container must have a reference sequence <br> id matching this value. |
| itf8 | starting position on the <br> reference | the alignment start position |
| itf8 | alignment span | the length of the alignment |
| itf8 | number of records | number of records in the container |
| ltf8 | record counter | 1-based sequential index of records in the file/stream. |
| ltf8 | bases | number of read bases |
| itf8 | number of blocks | the total number of blocks in this container |
| array<itf8> | landmarks | the locations of slices in this container as byte offsets from <br> the end of this container header, used for random access <br> indexing. For sequence data containers, the landmark <br> count must equal the slice count. <br> Since the block before the first slice is the compression <br> header, landmarks[0] is equal to the byte length of the <br> compression header. |
| int | crc32 | CRC32 hash of the all the preceding bytes in the container. |
| byte[ | blocks | The blocks contained within the container. |
Data type Name Value
byte method the block compression method (and first CRAM version):
0: raw (none)*
1: gzip
2: bzip2 (v2.0)
3: lzma (v3.0)
4: rans4x8 (v3.0)
5: rans4x16 (v3.1)
6: adaptive arithmetic coder (v3.1)
7: fqzcomp (v3.1)
8: name tokeniser (v3.1)
byte block content type id the block content type identifier
itf8 size in bytes* the block content identifier used to associate external data
raw size in bytes* blocks with data series
itf8 block data size of the block data after applying block compression
itf8 the data stored in the before applying block compression
byte[] ・ bit stream of CRAM records (core data block)
∙ byte stream (external data block)
CRC32 additional fields ( header blocks)
byte[4] CRC32 hash value for all preceding bytes in the block | Data type | Name | Value |
| :--- | :--- | :--- |
| byte | method | the block compression method (and first CRAM version): |
| | | 0: raw (none)* |
| | | 1: gzip |
| | | 2: bzip2 (v2.0) |
| | | 3: lzma (v3.0) |
| | | 4: rans4x8 (v3.0) |
| | | 5: rans4x16 (v3.1) |
| | | 6: adaptive arithmetic coder (v3.1) |
| | | 7: fqzcomp (v3.1) |
| | | 8: name tokeniser (v3.1) |
| byte | block content type id | the block content type identifier |
| itf8 | size in bytes* | the block content identifier used to associate external data |
| | raw size in bytes* | blocks with data series |
| itf8 | block data | size of the block data after applying block compression |
| itf8 | | the data stored in the before applying block compression |
| byte[] | ・ bit stream of CRAM records (core data block) | |
| | | $\bullet$ byte stream (external data block) |
| | CRC32 | additional fields ( header blocks) |
| byte[4] | CRC32 hash value for all preceding bytes in the block | |
SAM 首部数据块的格式是一个 32 位小二进制整数,其长度为 SAM 首部文本的长度减去无效终止字节,然后是文本本身。虽然是 32 位,但允许的最大值是 2^(31)2^{31} ,而且所有长度都必须是正数。
以下限制适用于 SAM 标头文本:
除非参考序列已嵌入文件,否则需要 SQ:MD5 校验和。
8.4 压缩头块
压缩头块由 3 部分组成:保存图、数据序列编码图和标签编码图。
保护地图
保存映射包含 CRAM 文件中哪些数据被保存的信息。它以字节[2]键的映射形式存储:
钥匙
值数据类型
名称
价值
RN
bool
阅读名称包括
如果所有读取都保留读取名称,则为 true
AP
bool
AP 数据系列 delta
如果 AP 数据序列为 delta,则为 true,否则为 false
RR
bool
所需参考资料
如果需要参考序列才能完全恢复数据,则为 true
true if reference sequence is required to restore
the data completely| true if reference sequence is required to restore |
| :--- |
| the data completely |
SM
字节[5]
置换矩阵
置换矩阵
TD
数组<byte>
标签 ID 词典
标签 ID 列表,参见标签编码部分
Key Value data type Name Value
RN bool read names included true if read names are preserved for all reads
AP bool AP data series delta true if AP data series is delta, false otherwise
RR bool reference required "true if reference sequence is required to restore
the data completely"
SM byte[5] substitution matrix substitution matrix
TD array<byte> tag ids dictionary a list of lists of tag ids, see tag encoding section| Key | Value data type | Name | Value |
| :--- | :--- | :--- | :--- |
| RN | bool | read names included | true if read names are preserved for all reads |
| AP | bool | AP data series delta | true if AP data series is delta, false otherwise |
| RR | bool | reference required | true if reference sequence is required to restore <br> the data completely |
| SM | byte[5] | substitution matrix | substitution matrix |
| TD | array<byte> | tag ids dictionary | a list of lists of tag ids, see tag encoding section |
如果 AP-Delta = true:与前一条记录中的 AP 值的对齐起始三角洲值为 0。注意该三角洲值可能为负,例如在多参考分片中切换参考时。如果 AP-Delta = false:直接编码对齐起始位置。
if AP-Delta = true: 0-based alignment start
delta from the AP value in the previous record.
Note this delta may be negative, for example
when switching references in a multi-reference
slice. When the record is the first in the slice, the
previous position used is the slice alignment-start
field (hence the first delta should be zero for
single-reference slices, or the AP value itself for
multi-reference slices).
if AP-Delta = false: encodes the alignment start
position directly| if AP-Delta = true: 0-based alignment start |
| :--- |
| delta from the AP value in the previous record. |
| Note this delta may be negative, for example |
| when switching references in a multi-reference |
| slice. When the record is the first in the slice, the |
| previous position used is the slice alignment-start |
| field (hence the first delta should be zero for |
| single-reference slices, or the AP value itself for |
| multi-reference slices). |
| if AP-Delta = false: encodes the alignment start |
| position directly |
RG
编码<int>
阅读小组
读取组。特殊值"-1 "代表无组。
read groups. Special value ' -1 ' stands for no
group.| read groups. Special value ' -1 ' stands for no |
| :--- |
| group. |
RN^(a)\mathrm{RN}^{\mathrm{a}}
编码<byte[ ]>
阅读名称
阅读名称
MF
编码<int>
下一个队友位标志
参见具体章节
NS
编码<int>
下一个片段参考序列 ID
next fragment
reference sequence id| next fragment |
| :--- |
| reference sequence id |
下一个片段的参考序列 ID
NP
编码<int>
下一个对齐开始
next mate alignment
start| next mate alignment |
| :--- |
| start |
下一个片段的对齐位置
TS
编码<int>
模板尺寸
模板尺寸
NF
编码<int>
到下一个分段的距离
distance to next
fragment| distance to next |
| :--- |
| fragment |
跳转到下一个片段的记录数 ^(b){ }^{b}
TL^(C)\mathrm{TL}^{\mathrm{C}}
编码<int>
标签 id
标签 id 列表,参见标签编码部分
FN
编码<int>
阅读次数
number of read
features| number of read |
| :--- |
| features |
每条记录中的读取特征数
FC
编码<byte>
读取功能代码
见单独章节
FP
编码<int>
读入位置
读取特征的位置;最后一个位置的正 delta 值(从零开始)
positions of the read features; a positive delta to
the last position (starting with zero)| positions of the read features; a positive delta to |
| :--- |
| the last position (starting with zero) |
DL
编码<int>
删除长度
碱基对缺失长度
BB
编码<byte[]>
基地
基地
QQ
编码<byte[ ]>
质量分数线的长度
stretches of quality
scores| stretches of quality |
| :--- |
| scores |
质量得分
BS
编码<byte>
基本替换代码
base substitution
codes| base substitution |
| :--- |
| codes |
碱基替换码
IN
编码<byte[]>
插入
插入式基座
RS
编码<int>
参考跳读长度
N "读数特征的跳过碱基数
PD
编码<int>
衬垫
垫底数量
HC
编码<int>
硬夹
硬剪切基数
SC
编码<byte[ ]>
软夹
软剪裁底座
MQ
编码<int>
绘图质量
绘制质量分数
BA
编码<byte>
基地
基地
QS
编码<byte>
质量得分
质量得分
TC^(d)\mathrm{TC}^{\mathrm{d}}
不适用
遗留字段
置之不理
TN^(d)\mathrm{TN}^{\mathrm{d}}
不适用
遗留字段
置之不理
Key Value data type Name Value
BF encoding<int> BAM bit flags see separate section
CF encoding<int> CRAM bit flags see specific section
RI encoding<int> reference id record reference id from the SAM file header
RL encoding<int> read lengths read lengths
AP encoding<int> in-seq positions "if AP-Delta = true: 0-based alignment start
delta from the AP value in the previous record.
Note this delta may be negative, for example
when switching references in a multi-reference
slice. When the record is the first in the slice, the
previous position used is the slice alignment-start
field (hence the first delta should be zero for
single-reference slices, or the AP value itself for
multi-reference slices).
if AP-Delta = false: encodes the alignment start
position directly"
RG encoding<int> read groups "read groups. Special value ' -1 ' stands for no
group."
RN^(a) encoding<byte[ ]> read names read names
MF encoding<int> next mate bit flags see specific section
NS encoding<int> "next fragment
reference sequence id" reference sequence ids for the next fragment
NP encoding<int> "next mate alignment
start" alignment positions for the next fragment
TS encoding<int> template size template sizes
NF encoding<int> "distance to next
fragment" number of records to skip to the next fragment ^(b)
TL^(C) encoding<int> tag ids list of tag ids, see tag encoding section
FN encoding<int> "number of read
features" number of read features in each record
FC encoding<byte> read features codes see separate section
FP encoding<int> in-read positions "positions of the read features; a positive delta to
the last position (starting with zero)"
DL encoding<int> deletion lengths base-pair deletion lengths
BB encoding<byte[]> stretches of bases bases
QQ encoding<byte[ ]> "stretches of quality
scores" quality scores
BS encoding<byte> "base substitution
codes" base substitution codes
IN encoding<byte[]> insertion inserted bases
RS encoding<int> reference skip length number of skipped bases for the ' N ' read feature
PD encoding<int> padding number of padded bases
HC encoding<int> hard clip number of hard clipped bases
SC encoding<byte[ ]> soft clip soft clipped bases
MQ encoding<int> mapping qualities mapping quality scores
BA encoding<byte> bases bases
QS encoding<byte> quality scores quality scores
TC^(d) N/A legacy field to be ignored
TN^(d) N/A legacy field to be ignored| Key | Value data type | Name | Value |
| :---: | :---: | :---: | :---: |
| BF | encoding<int> | BAM bit flags | see separate section |
| CF | encoding<int> | CRAM bit flags | see specific section |
| RI | encoding<int> | reference id | record reference id from the SAM file header |
| RL | encoding<int> | read lengths | read lengths |
| AP | encoding<int> | in-seq positions | if AP-Delta = true: 0-based alignment start <br> delta from the AP value in the previous record. <br> Note this delta may be negative, for example <br> when switching references in a multi-reference <br> slice. When the record is the first in the slice, the <br> previous position used is the slice alignment-start <br> field (hence the first delta should be zero for <br> single-reference slices, or the AP value itself for <br> multi-reference slices). <br> if AP-Delta = false: encodes the alignment start <br> position directly |
| RG | encoding<int> | read groups | read groups. Special value ' -1 ' stands for no <br> group. |
| $\mathrm{RN}^{\mathrm{a}}$ | encoding<byte[ ]> | read names | read names |
| MF | encoding<int> | next mate bit flags | see specific section |
| NS | encoding<int> | next fragment <br> reference sequence id | reference sequence ids for the next fragment |
| NP | encoding<int> | next mate alignment <br> start | alignment positions for the next fragment |
| TS | encoding<int> | template size | template sizes |
| NF | encoding<int> | distance to next <br> fragment | number of records to skip to the next fragment ${ }^{b}$ |
| $\mathrm{TL}^{\mathrm{C}}$ | encoding<int> | tag ids | list of tag ids, see tag encoding section |
| FN | encoding<int> | number of read <br> features | number of read features in each record |
| FC | encoding<byte> | read features codes | see separate section |
| FP | encoding<int> | in-read positions | positions of the read features; a positive delta to <br> the last position (starting with zero) |
| DL | encoding<int> | deletion lengths | base-pair deletion lengths |
| BB | encoding<byte[]> | stretches of bases | bases |
| QQ | encoding<byte[ ]> | stretches of quality <br> scores | quality scores |
| BS | encoding<byte> | base substitution <br> codes | base substitution codes |
| IN | encoding<byte[]> | insertion | inserted bases |
| RS | encoding<int> | reference skip length | number of skipped bases for the ' N ' read feature |
| PD | encoding<int> | padding | number of padded bases |
| HC | encoding<int> | hard clip | number of hard clipped bases |
| SC | encoding<byte[ ]> | soft clip | soft clipped bases |
| MQ | encoding<int> | mapping qualities | mapping quality scores |
| BA | encoding<byte> | bases | bases |
| QS | encoding<byte> | quality scores | quality scores |
| $\mathrm{TC}^{\mathrm{d}}$ | N/A | legacy field | to be ignored |
| $\mathrm{TN}^{\mathrm{d}}$ | N/A | legacy field | to be ignored |
tag values (names and types are
available in the data series code)| tag values (names and types are |
| :--- |
| available in the data series code) |
dots\ldots
dots\ldots
dots\ldots
标签 ID n:标签类型 n
编码<byte[]>
读标签 N
dots\ldots
Key Value data type Name Value
TAG ID 1:TAG TYPE 1 encoding<byte[ ]> read tag 1 "tag values (names and types are
available in the data series code)"
dots dots dots
TAG ID N:TAG TYPE N encoding<byte[]> read tag N dots| Key | Value data type | Name | Value |
| :--- | :--- | :--- | :--- |
| TAG ID 1:TAG TYPE 1 | encoding<byte[ ]> | read tag 1 | tag values (names and types are <br> available in the data series code) |
| $\ldots$ | | $\ldots$ | $\ldots$ |
| TAG ID N:TAG TYPE N | encoding<byte[]> | read tag N | $\ldots$ |
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
This value must match that of its enclosing
container.| reference sequence identifier or |
| :--- |
| -1 for unmapped reads |
| -2 for multiple reference sequences. |
| This value must match that of its enclosing |
| container. |
itf8
对齐开始
对齐起始位置
itf8
对齐跨度
长度
itf8
记录数
片段中的记录数
ltf8
记数器
文件/流中记录的基于 1 的顺序索引
1-based sequential index of records in the
file/stream| 1-based sequential index of records in the |
| :--- |
| file/stream |
itf8
块数
分片中的区块数
itf8[]
嵌入式基准块内容 ID
切片中区块的内容 id,表示嵌入式引用序列库的区块内容 id,无则为 -1
block content ids of the blocks in the slice
block content id for the embedded reference
sequence bases or -1 for none| block content ids of the blocks in the slice |
| :--- |
| block content id for the embedded reference |
| sequence bases or -1 for none |
MD5 checksum of the reference bases within
the slice boundaries. If this slice has
reference sequence id of -1 (unmapped) or -2
(multi-ref) the MD5 should be 16 bytes of \\0.
For embedded references, the MD5 can either
be all-zeros or the MD5 of the embedded
sequence.| MD5 checksum of the reference bases within |
| :--- |
| the slice boundaries. If this slice has |
| reference sequence id of -1 (unmapped) or -2 |
| (multi-ref) the MD5 should be 16 bytes of $\backslash 0$. |
| For embedded references, the MD5 can either |
| be all-zeros or the MD5 of the embedded |
| sequence. |
字节[16]
以 BAM 辅助字段形式编码的一系列标记、类型、值元组。
a series of tag,type,value tuples encoded as
per BAM auxiliary fields.| a series of tag,type,value tuples encoded as |
| :--- |
| per BAM auxiliary fields. |
byte[]
可选标签
Data type Name Value
itf8 reference sequence id "reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
This value must match that of its enclosing
container."
itf8 alignment start the alignment start position
itf8 alignment span the length of the alignment
itf8 number of records the number of records in the slice
ltf8 record counter "1-based sequential index of records in the
file/stream"
itf8 number of blocks the number of blocks in the slice
itf8[] embedded reference bases block content id "block content ids of the blocks in the slice
block content id for the embedded reference
sequence bases or -1 for none"
itf8 reference md5 "MD5 checksum of the reference bases within
the slice boundaries. If this slice has
reference sequence id of -1 (unmapped) or -2
(multi-ref) the MD5 should be 16 bytes of \\0.
For embedded references, the MD5 can either
be all-zeros or the MD5 of the embedded
sequence."
byte[16] "a series of tag,type,value tuples encoded as
per BAM auxiliary fields."
byte[] optional tags | Data type | Name | Value |
| :--- | :--- | :--- |
| itf8 | reference sequence id | reference sequence identifier or <br> -1 for unmapped reads <br> -2 for multiple reference sequences. <br> This value must match that of its enclosing <br> container. |
| itf8 | alignment start | the alignment start position |
| itf8 | alignment span | the length of the alignment |
| itf8 | number of records | the number of records in the slice |
| ltf8 | record counter | 1-based sequential index of records in the <br> file/stream |
| itf8 | number of blocks | the number of blocks in the slice |
| itf8[] | embedded reference bases block content id | block content ids of the blocks in the slice <br> block content id for the embedded reference <br> sequence bases or -1 for none |
| itf8 | reference md5 | MD5 checksum of the reference bases within <br> the slice boundaries. If this slice has <br> reference sequence id of -1 (unmapped) or -2 <br> (multi-ref) the MD5 should be 16 bytes of $\backslash 0$. <br> For embedded references, the MD5 can either <br> be all-zeros or the MD5 of the embedded <br> sequence. |
| byte[16] | | a series of tag,type,value tuples encoded as <br> per BAM auxiliary fields. |
| byte[] | optional tags | |
只有当切片的映射数据与单个参照(参照序列 ID >=0>=0 )对齐时,才应在解码过程中使用对齐起始值和对齐跨度值。对于多参考片段或具有未映射数据的片段,建议将这些字段的值填为 0。
如果存储的校验和为全零,则不应验证 MD5sum。在计算 MD5sum 之前,嵌入式参考文献应遵循与外部参考文献相同的大小写和字母规则。如果使用嵌入式参考文献,并不要求其与用于序列比对的参考文献完全匹配。例如,它可能包含没有覆盖的 "N "碱基,也可能对 SNP 变异有不同的碱基调用。因此,在使用嵌入序列时,MD5sum 指的是嵌入序列的校验和,而不应根据任何外部参照文件进行验证。
Data type Name Value
bit[ ] CRAM record 1 The first CRAM record
dots dots dots
bit[ ] CRAM record N The Nth CRAM record| Data type | Name | Value |
| :--- | :--- | :--- |
| bit[ ] | CRAM record 1 | The first CRAM record |
| $\ldots$ | $\ldots$ | $\ldots$ |
| bit[ ] | CRAM record N | The Nth CRAM record |
"Data series
type" "Data series
name" Field Description
int BF BAM bit flags see BAM bit flags below
int CF CRAM bit flags see CRAM bit flags below
- - Positional data See section 10.2
- - Read names See section 10.3
- - Mate records See section 10.4
- - Auxiliary tags See section 10.5
- - Sequences See sections 10.6 and 10.7| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | BF | BAM bit flags | see BAM bit flags below |
| int | CF | CRAM bit flags | see CRAM bit flags below |
| - | - | Positional data | See section 10.2 |
| - | - | Read names | See section 10.3 |
| - | - | Mate records | See section 10.4 |
| - | - | Auxiliary tags | See section 10.5 |
| - | - | Sequences | See sections 10.6 and 10.7 |
BAM 位标志(BF 数据系列)
以下标志与 SAM 和 BAM 规范中的标志重复,含义相同。但需要注意的是,其中一些标志可以在解码过程中产生,因此在 CRAM 文件中可以省略,并根据位于同一切片内的对端库的两个读数计算位数。
位标志
评论
说明
0x1
在测序中具有多重段的模板
template having multiple
segments in sequencing| template having multiple |
| :--- |
| segments in sequencing |
0x2
根据校准器正确校准每个片段
each segment properly aligned
according to the aligner| each segment properly aligned |
| :--- |
| according to the aligner |
0x4
段落未映射 ^(a){ }^{\mathrm{a}}
0x8
经过计算的队友信息
calculated ^(b) or stored in the
mate's info| calculated $^{\mathrm{b}}$ or stored in the |
| :--- |
| mate's info |
未映射模板中的下一个分段
next segment in template
unmapped| next segment in template |
| :--- |
| unmapped |
0x10
SEQ 被反向补充
SEQ being reverse
complemented| SEQ being reverse |
| :--- |
| complemented |
0xx200 \times 20
经过计算的队友信息
calculated ^(b) or stored in the
mate's info| calculated $^{\mathrm{b}}$ or stored in the |
| :--- |
| mate's info |
被反向补充的模板中下一个分段的 SEQ
SEQ of the next segment in the
template being reverse
complemented| SEQ of the next segment in the |
| :--- |
| template being reverse |
| complemented |
0x40
模板中的第一段 ^(c){ }^{\mathrm{c}}
0x80
模板中的最后一段 ^(c){ }^{\mathrm{c}}
0x100
次级排列
0x200
质量控制不合格
0x400
PCT 或光学复制品
0x800
补充对齐
Bit flag Comment Description
0x1 "template having multiple
segments in sequencing"
0x2 "each segment properly aligned
according to the aligner"
0x4 segment unmapped ^(a)
0x8 "calculated ^(b) or stored in the
mate's info" "next segment in template
unmapped"
0x10 "SEQ being reverse
complemented"
0xx20 "calculated ^(b) or stored in the
mate's info" "SEQ of the next segment in the
template being reverse
complemented"
0x40 the first segment in the template ^(c)
0x80 the last segment in the template ^(c)
0x100 secondary alignment
0x200 not passing quality controls
0x400 PCT or optical duplicate
0x800 Supplementary alignment| Bit flag | Comment | Description |
| :---: | :---: | :---: |
| 0x1 | | template having multiple <br> segments in sequencing |
| 0x2 | | each segment properly aligned <br> according to the aligner |
| 0x4 | | segment unmapped ${ }^{\mathrm{a}}$ |
| 0x8 | calculated $^{\mathrm{b}}$ or stored in the <br> mate's info | next segment in template <br> unmapped |
| 0x10 | | SEQ being reverse <br> complemented |
| $0 \times 20$ | calculated $^{\mathrm{b}}$ or stored in the <br> mate's info | SEQ of the next segment in the <br> template being reverse <br> complemented |
| 0x40 | | the first segment in the template ${ }^{\mathrm{c}}$ |
| 0x80 | | the last segment in the template ${ }^{\mathrm{c}}$ |
| 0x100 | | secondary alignment |
| 0x200 | | not passing quality controls |
| 0x400 | | PCT or optical duplicate |
| 0x800 | | Supplementary alignment |
^(a){ }^{a} 位 0 x 4 是判断读取是否未映射的唯一可靠方法。如果设置了 0 x 4,则不能对 0xx2,0xx1000 \times 2,0 \times 100 和 0x 8000 x 800 位做任何假设。
^(b){ }^{\mathrm{b}} 对于同一片段内的线段。
^("c "){ }^{\text {c }} 位 0 x 40 和 0 x 80 反映了所用测序技术中每个模板内固有的读数排序,这可能与实际映射方向无关。如果 0xx400 \times 40 和 0xx800 \times 80 都被设置,则该读数是线性模板的一部分(模板序列预计会按线性顺序排列),但它既不是第一个读数,也不是最后一个读数。如果 0 x 40 和 0 x 80 都未设置,则读数在模板中的索引未知。这种情况可能发生在非线性模板(如通过拼接其他模板构建的模板)中,或者在数据处理过程中丢失了这一信息。
quality scores can be stored as read features or as an
array similar to read bases.| quality scores can be stored as read features or as an |
| :--- |
| array similar to read bases. |
0x2
独立
配对信息被逐字存储(例如,由于配对跨越多个切片,或字段与 CRAM 计算方法不同)
mate information is stored verbatim (e.g. because the
pair spans multiple slices or the fields differ to the
CRAM computed method)| mate information is stored verbatim (e.g. because the |
| :--- |
| pair spans multiple slices or the fields differ to the |
| CRAM computed method) |
0 x 4
有配下游
告知是否应该在流中更远的地方期待下一个片段
tells if the next segment should be expected further in
the stream| tells if the next segment should be expected further in |
| :--- |
| the stream |
0x8
将序列解码为 "*"
告知解码器序列未知,任何编码参考差异的存在只是为了重新生成 CIGAR 字符串。
informs the decoder that the sequence is unknown and
that any encoded reference differences are present only
to recreate the CIGAR string.| informs the decoder that the sequence is unknown and |
| :--- |
| that any encoded reference differences are present only |
| to recreate the CIGAR string. |
Bit flag Name Description
0x1 quality scores stored as array "quality scores can be stored as read features or as an
array similar to read bases."
0x2 detached "mate information is stored verbatim (e.g. because the
pair spans multiple slices or the fields differ to the
CRAM computed method)"
0 x 4 has mate downstream "tells if the next segment should be expected further in
the stream"
0x8 decode sequence as "*" "informs the decoder that the sequence is unknown and
that any encoded reference differences are present only
to recreate the CIGAR string."| Bit flag | Name | Description |
| :--- | :--- | :--- |
| 0x1 | quality scores stored as array | quality scores can be stored as read features or as an <br> array similar to read bases. |
| 0x2 | detached | mate information is stored verbatim (e.g. because the <br> pair spans multiple slices or the fields differ to the <br> CRAM computed method) |
| 0 x 4 | has mate downstream | tells if the next segment should be expected further in <br> the stream |
| 0x8 | decode sequence as "*" | informs the decoder that the sequence is unknown and <br> that any encoded reference differences are present only <br> to recreate the CIGAR string. |
procedure DECODERECORD
\(B A M \_\)flags \(\leftarrow\) READITEM(BF, Integer)
\(C R A \bar{M} \_\)flags \(\leftarrow\) READITEM \((\mathrm{CF}\), Integer \()\)
DECODEPoSITIONS \(\triangleright\) See section 10.2
DECODENAMES \(\triangleright\) See section 10.3
DECODEMateData \(\triangleright\) See section 10.4
DecoDeTaGData \(\triangleright\) See section 10.5
if \((B F\) AND 4\()=0\) then \(\triangleright\) Unmapped flag
DECODEMAPPEDREAD \(\triangleright\) See section 10.6
else
DECODEUNMAPPEDREAD \(\triangleright\) See section 10.7
end if
reference sequence id (only present in
multiref slices)| reference sequence id (only present in |
| :--- |
| multiref slices) |
int
RL
读取长度
读取长度
int
AP
对齐开始
对齐起始位置
int
RG
阅读小组
读取组标识符,用标头中的 Nh 记录表示,从 0 开始,-1 表示无组
the read group identifier expressed as
the Nh record in the header, starting
from 0 with -1 for no group| the read group identifier expressed as |
| :--- |
| the Nh record in the header, starting |
| from 0 with -1 for no group |
"Data series
type" "Data series
name" Field Description
int RI ref id "reference sequence id (only present in
multiref slices)"
int RL read length the length of the read
int AP alignment start the alignment start position
int RG read group "the read group identifier expressed as
the Nh record in the header, starting
from 0 with -1 for no group"| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | RI | ref id | reference sequence id (only present in <br> multiref slices) |
| int | RL | read length | the length of the read |
| int | AP | alignment start | the alignment start position |
| int | RG | read group | the read group identifier expressed as <br> the Nh record in the header, starting <br> from 0 with -1 for no group |
procedure DECODEPOSITIONS
if slice_header.reference_sequence_id \(=-2\) then
reference \(\_i d \leftarrow\) READITEM(RI, Integer)
else
\(r e f e r e n c e \_i d \leftarrow\) slice_header.reference_sequence_id
end if
read_length \(\leftarrow\) READITEM(RL, Integer)
if container_pmap.AP_delta \(\neq 0\) then
if first_record_in_slice then
last_position \(\leftarrow\) slice_header.alignment_start
end if
alignment_position \(\leftarrow\) READITEM(AP, Integer) + last_position
last_position \(\leftarrow\) alignment_position
else
alignment_position \(\leftarrow\) READITEM(AP, Integer)
end if
read_group \(\leftarrow\) READITEM \((\) RG, Integer \()\)
end procedure
"Data series
type" "Data series
name" Field Description
byte[ ] RN read names read names| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| byte[ $]$ | RN | read names | read names |
procedure DECODENAMES
if container_pmap.read_names_included \(=1\) then
read_name \(\leftarrow\) REAd \(\overline{\operatorname{ITEM}}(\mathrm{RN}\), Byte[])
else
read_name \(\leftarrow\) GENERATENAME
end if
end procedure
"Data series
type" Data series name Description
int NF the number of records to skip to the next fragment| Data series <br> type | Data series name | Description |
| :--- | :--- | :--- |
| int | NF | the number of records to skip to the next fragment |
在上述情况下,一旦队列也被解码,两条记录的 NS(队列参考名称)、NP(队列位置)和 TS(模板尺寸)字段都应随之产生。队列参考名称和位置是显而易见的,只需从队列中复制即可。模板大小使用 SAM 规范中描述的方法计算;从最左边到最右边映射碱基的包含距离,最左边记录的符号为正,最右边记录的符号为负。
"Data series
type" Data series name Description
int MF next mate bit flags, see table below
byte[] RN the read name (if and only if not known already)
int NS mate reference sequence identifier
int NP mate alignment start position
int TS the size of the template (insert size)| Data series <br> type | Data series name | Description |
| :--- | :--- | :--- |
| int | MF | next mate bit flags, see table below |
| byte[] | RN | the read name (if and only if not known already) |
| int | NS | mate reference sequence identifier |
| int | NP | mate alignment start position |
| int | TS | the size of the template (insert size) |
Bit flag Name Description
0x1 mate negative strand bit the bit is set if the mate is on the negative strand
0xx2 mate unmapped bit the bit is set if the mate is unmapped| Bit flag | Name | Description |
| :--- | :--- | :--- |
| 0x1 | mate negative strand bit | the bit is set if the mate is on the negative strand |
| $0 \times 2$ | mate unmapped bit | the bit is set if the mate is unmapped |
mate_flags larr\leftarrow READITEM(MF,Integer)
if mate_flags AND 1 then
bam_flags larr\leftarrow bam_flags OR 0xx20quad▹0 \times 20 \quad \triangleright Mate 已反向补全 如果结束
if mate_flags AND 2 then
bam_flags larr\leftarrow bam_flags OR 0x08 ▹\triangleright Mate 未映射 如果结束
如果 container_pmap.read_names_included !=1\neq 1 那么 read_na bar(me)larr larr READITEM(RN, bar(Byte)[])r e a d \_n a \overline{m e} \leftarrow \leftarrow \operatorname{READITEM}(\mathrm{RN}, \overline{B y t e}[])
end if
mate_ref_id \leftarrow READITEM(NS, Integer)
mate_position \leftarrow READITEM(NP, Integer)
template_size \leftarrow READITEM(TS, Integer)
else if CF ANND 4 then }\quad\triangleright\mathrm{ Mate is downstream
if next_frag.bam_flags AND 0x10 then
this.bam_flags \leftarrowthis.bam_flags OR 0x20 \triangleright next segment reverse complemented
end if
if next_frag.bam_flags AND 0x04 then
this.bam_flags \leftarrowthis.bam_flags OR 0x08 \triangleright next segment unmapped
end if
next_frag \leftarrow READITEM(NF,Integer)
next_record \leftarrowthis_record + next_frag + 1
Resolve mate_ref_-id for this_record and next_record once both have been decoded
Resolve mate_position for this_record and next_record once both have been decoded
Find leftmost and rightmost mapped coordinate in records this_record and next_record.
For leftmost of this_record and next_record: template_size \leftarrow rightmost - leftmost + 1
For rightmost of this_record and next_record: template_size }\leftarrow-(\mathrm{ rightmost - leftmost + 1)
end if
end procedure
请注意,与 SAM 规范一样,一个模板可以有两条以上的对齐记录。在这种情况下,每条记录的 "队友 "都被认为是下一条记录,最后一条记录的队友是第一条记录,从而形成一个循环列表。上述算法只是一种简化,并没有处理这种情况。完整的方法需要观察该 +NF+N F 记录何时也被标记为下游有额外的队友。一种推荐的方法是,在整个片段解码完成后,在第二遍中解析配对信息。伴侣链中的最后一个片段需要根据第一个片段相应设置 bam_flags 字段 0 x 20 和 0x08。为了简洁起见,上述算法中也没有列出这一点。
3 character key (2 tag identifier and 1 tag
type ), as specified by the tag dictionary| 3 character key $(2$ tag identifier and 1 tag |
| :--- |
| type $),$ as specified by the tag dictionary |
"Data series
type" "Data series
name" Field Description
int TL tag line an index into the tag dictionary (TD)
** ??? tag name/type "3 character key (2 tag identifier and 1 tag
type ), as specified by the tag dictionary"| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | TL | tag line | an index into the tag dictionary (TD) |
| $*$ | $? ? ?$ | tag name/type | 3 character key $(2$ tag identifier and 1 tag <br> type $),$ as specified by the tag dictionary |
procedure DECODETAGDATA
tag_line \(\leftarrow\) READITEM(TL,Integer)
for all ele \(\in\) container_pmap.tag_dict(tag_line) do
name \(\leftarrow\) first two characters of ele
tag \((\) type \() \leftarrow\) last character of ele
\(\operatorname{tag}(\) name \() \leftarrow\) READITEM \((\) ele, Byte[])
end for
end procedure
number of read
features| number of read |
| :--- |
| features |
读取特征的数量
int
FP
在读位置 ^(a)^{\mathrm{a}}
读取特征的 delta 位置
字节
FC
读取功能码
请参阅下面的功能代码
***
***
读取特征数据 ^(a){ }^{\mathrm{a}}
请参阅下面的功能代码
int
MQ
绘图质量
制图质量得分
字节[读取长度]
QS
质量得分
如果保留了基础质量
Data series type "Data series
name" Field Description
int FN "number of read
features" the number of read features
int FP in-read-position ^(a) delta-position of the read feature
byte FC read feature code See feature codes below
** ** read feature data ^(a) See feature codes below
int MQ mapping qualities mapping quality score
byte[read length] QS quality scores the base qualities, if preserved| Data series type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | FN | number of read <br> features | the number of read features |
| int | FP | in-read-position $^{\mathrm{a}}$ | delta-position of the read feature |
| byte | FC | read feature code | See feature codes below |
| $*$ | $*$ | read feature data ${ }^{\mathrm{a}}$ | See feature codes below |
| int | MQ | mapping qualities | mapping quality score |
| byte[read length] | QS | quality scores | the base qualities, if preserved |
^(a){ }^{a} 重复 FN 次,每次读取功能一次。
读取功能代码
每个地物代码都有自己的相关数据序列,包含该地物的详细信息。以下代码用于区分读取坐标的变化:
功能代码
同上
数据序列类型
Data series
type| Data series |
| :--- |
| type |
数据系列名称
Data series
name| Data series |
| :--- |
| name |
说明
基地
b (0x62)
字节
BB
绵延
得分
q (0x71)
字节
QQ
绵延
阅读基地
B (0x42)
字节,字节
BA,QS
基数和相关质量分数
A base and associated quality
score| A base and associated quality |
| :--- |
| score |
替换
X (0x58)
字节
BS
碱基替换码、SAM 操作符
base substitution codes, SAM
operators X,M and =| base substitution codes, SAM |
| :--- |
| operators $\mathrm{X}, \mathrm{M}$ and $=$ |
插入
I (0x49)
byte[]
IN
插入的碱基、SAM 操作符I
inserted bases, SAM operator
I| inserted bases, SAM operator |
| :--- |
| I |
删除
D (0x44)
int
DL
删除的碱基数,SAM 算子 D
number of deleted bases,
SAM operator D| number of deleted bases, |
| :--- |
| SAM operator D |
插入底座
i (0x69)
字节
BA
单个插入碱基,SAMoperator I
single inserted base, SAM
operator I| single inserted base, SAM |
| :--- |
| operator I |
质量得分
Q (0x51)
字节
QS
单一质量得分
参考跳读
N (0x4E)
int
RS
跳过的碱基数,SAM 运算符 N
number of skipped bases,
SAM operator N| number of skipped bases, |
| :--- |
| SAM operator N |
软夹子
S (0x53)
byte[]
SC
软剪切基,SAMoperator S
soft clipped bases, SAM
operator S| soft clipped bases, SAM |
| :--- |
| operator S |
衬垫
P(0xx50)\mathrm{P}(0 \times 50)
int
PD
填充基数,SAM 运算符 P
number of padded bases,
SAM operator P| number of padded bases, |
| :--- |
| SAM operator P |
硬夹子
H (0x48)
int
HC
硬剪切碱基数,SAM 算子 H
number of hard clipped bases,
SAM operator H| number of hard clipped bases, |
| :--- |
| SAM operator H |
Feature code Id "Data series
type" "Data series
name" Description
Bases b (0x62) byte[ BB a stretch of bases
Scores q (0x71) byte[ QQ a stretch of scores
Read base B (0x42) byte,byte BA,QS "A base and associated quality
score"
Substitution X (0x58) byte BS "base substitution codes, SAM
operators X,M and ="
Insertion I (0x49) byte[] IN "inserted bases, SAM operator
I"
Deletion D (0x44) int DL "number of deleted bases,
SAM operator D"
Insert base i (0x69) byte BA "single inserted base, SAM
operator I"
Quality score Q (0x51) byte QS single quality score
Reference skip N (0x4E) int RS "number of skipped bases,
SAM operator N"
Soft clip S (0x53) byte[] SC "soft clipped bases, SAM
operator S"
Padding P(0xx50) int PD "number of padded bases,
SAM operator P"
Hard clip H (0x48) int HC "number of hard clipped bases,
SAM operator H"| Feature code | Id | Data series <br> type | Data series <br> name | Description |
| :---: | :---: | :---: | :---: | :---: |
| Bases | b (0x62) | byte[ | BB | a stretch of bases |
| Scores | q (0x71) | byte[ | QQ | a stretch of scores |
| Read base | B (0x42) | byte,byte | BA,QS | A base and associated quality <br> score |
| Substitution | X (0x58) | byte | BS | base substitution codes, SAM <br> operators $\mathrm{X}, \mathrm{M}$ and $=$ |
| Insertion | I (0x49) | byte[] | IN | inserted bases, SAM operator <br> I |
| Deletion | D (0x44) | int | DL | number of deleted bases, <br> SAM operator D |
| Insert base | i (0x69) | byte | BA | single inserted base, SAM <br> operator I |
| Quality score | Q (0x51) | byte | QS | single quality score |
| Reference skip | N (0x4E) | int | RS | number of skipped bases, <br> SAM operator N |
| Soft clip | S (0x53) | byte[] | SC | soft clipped bases, SAM <br> operator S |
| Padding | $\mathrm{P}(0 \times 50)$ | int | PD | number of padded bases, <br> SAM operator P |
| Hard clip | H (0x48) | int | HC | number of hard clipped bases, <br> SAM operator H |
请注意,为了与 BAM 兼容,所有碱基的比较都应以不区分大小写的方式进行,写入 SC、IN 和 BA 数据系列的所有碱基都应使用大写。
碱基替换码(BS 数据系列)
碱基替换的定义是从一个核苷酸碱基(参考碱基)到另一个碱基(读取碱基)的变化,包括作为未知或缺失碱基的 N。有 5 个支持的参考碱基 (ACGTN),每个碱基有 4 种可能的替换。任何其他碱基类型,如模糊代码,必须使用 BA 数据系列逐字写出。
BS Code
Ref. base 0 1 2 3
A T C G N
C G A T N
G C T A N
T A G C N
N A C G T| | BS Code | | | |
| :--- | :---: | :---: | :---: | :---: |
| Ref. base | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ |
| A | T | C | G | N |
| C | G | A | T | N |
| G | C | T | A | N |
| T | A | G | C | N |
| N | A | C | G | T |
procedure DECODEMAPPEDREAD
feature_number }\leftarrow\mathrm{ READITEM(FN, Integer)
last_feature_position }\leftarrow
for }i\leftarrow1\mathrm{ to feature_number do
DECODEFEATURE
end for
mapping_quality \leftarrow READITEM(MQ, Integer)
if CF AND 1 then \triangleright Quality stored as an array
for }i\leftarrow1\mathrm{ to read_length do
quality_score \leftarrow READITEM(QS, Integer)
end for
end if
end procedure
procedure DecodeFeature
feature_code }\leftarrow\mathrm{ READITEM(FC, Integer)
feature_position }\leftarrow\mathrm{ READITEM(FP, Integer) + last_feature_position
last_feature_position }\leftarrow\mathrm{ feature_position
if feature_code ='B' then
base }\leftarrow\mathrm{ READITEM(BA, Byte)
quality_score }\leftarrow\mathrm{ READITEM(QS, Byte)
else if feature_code ='X' then
substitution_code \leftarrow READItEM(BS, Byte)
else if feature_code ='I' then
inserted_bases }\leftarrow\mathrm{ READITEM(IN, Byte[])
else if feature_code ='S' then
softclip_bases }\leftarrow\mathrm{ READITEM(SC, Byte[])
else if feature_code \(={ }^{\prime} H\) ' then
hardclip_length \(\leftarrow\) ReAdItEm(HC, Integer)
else if feature_code ='P' then
pad_length \({ }^{-} \leftarrow\) READITEM(PD, Integer)
else if feature_code \(=\) 'D' then
deletion_length \(\leftarrow\) READITEM(DL, Integer)
else if feature_code \(={ }^{\prime} \mathrm{N}\) ' then
ref_skip_length \(\leftarrow\) READITEm(RS, Integer)
else if feature_code \(=\) 'i' then
base \(-\leftarrow\) ReAdItEm(BA, Byte)
else if feature \(\quad\) code \(=' \mathrm{~b}\) ' then
bases \(\leftarrow\) REadItEm(BB, Byte[])
else if feature_code ='q' then
quality_scores \(\leftarrow\) REAdITEM(QQ, Byte[])
else if feature_code \(=\) ' Q ' then
quality_score \(\leftarrow\) READITEM(QS, Byte)
end if
end procedure
10.7 未映射读数
未映射读取的 CRAM 记录结构有以下附加字段:
数据系列类型
数据系列名称
Data series
name| Data series |
| :--- |
| name |
现场
说明
字节[读取长度]
BA
基地
读数碱基
字节[读取长度]
QS
质量得分
如果保留了基础质量
Data series type "Data series
name" Field Description
byte[read length] BA bases the read bases
byte[read length] QS quality scores the base qualities, if preserved| Data series type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| byte[read length] | BA | bases | the read bases |
| byte[read length] | QS | quality scores | the base qualities, if preserved |
procedure DeCoDeUnMAPpedREAD
for \(i \leftarrow 1\) to read_length do
base \(\leftarrow\) READITEM(BA, Byte)
end for
if \(C F\) AND 1 then \(\triangleright\) Quality stored as an array
for \(i \leftarrow 1\) to read_length do
quality_score \(\leftarrow\) READITEM(QS, Byte)
end for
end if
end procedure
外部编码(EXTERNAL)只是将数据逐字存储到具有给定 ID 的外部块中。如果数据类型为字节(Byte),则按原样存储;如果数据类型为整数(Integer),则按 ITF8 格式存储。
参数
CRAM 格式定义了以下外部编码参数:
数据类型
名称
评论
itf8
外部 id
包含字节流的外部块的 id
Data type Name Comment
itf8 external id id of an external block containing the byte stream| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | external id | id of an external block containing the byte stream |
Symbol Code length Codeword
A 1 0
B 3 100
C 3 101
D 3 110
E 4 1110
F 4 1111| Symbol | Code length | Codeword |
| :--- | :--- | :--- |
| A | 1 | 0 |
| B | 3 | 100 |
| C | 3 | 101 |
| D | 3 | 110 |
| E | 4 | 1110 |
| F | 4 | 1111 |
参数
数据类型
名称
评论
itf8[]
字母表
所有编码符号(值)列表
itf8[]
位长
字母表中每个符号的比特长度数组
Data type Name Comment
itf8[] alphabet list of all encoded symbols (values)
itf8[] bit-lengths array of bit-lengths for each symbol in the alphabet| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8[] | alphabet | list of all encoded symbols (values) |
| itf8[] | bit-lengths | array of bit-lengths for each symbol in the alphabet |
an encoding describing how the arrays lengths are
captured| an encoding describing how the arrays lengths are |
| :--- |
| captured |
编码<byte>
值编码
描述如何获取数值的编码
Data type Name Comment
encoding<int> lengths encoding "an encoding describing how the arrays lengths are
captured"
encoding<byte> values encoding an encoding describing how the values are captured| Data type | Name | Comment |
| :--- | :--- | :--- |
| encoding<int> | lengths encoding | an encoding describing how the arrays lengths are <br> captured |
| encoding<byte> | values encoding | an encoding describing how the values are captured |
2 more bytes of EXTERNAL parameters| 2 more bytes of EXTERNAL parameters |
| :--- |
0x80 0xc8
区块 ID 200 的 ITF8 编码
Bytes Meaning
0x04 BYTE_ARRAY_LEN codec ID
0x0a 10 remaining bytes of BYTE_ARRAY_LEN parameters
0x03 HUFFMAN codec ID, for aux tag lengths
0x04 4 more bytes of HUFFMAN parameters
0x01 Alphabet array size =1
0x02 alphabet symbol; (length =2)
0x01 Codeword array size =1
0x00 Code length =0 (zero bits needed as alphabet is size 1)
0x01 EXTERNAL codec ID, for aux tag values
0x02 "2 more bytes of EXTERNAL parameters"
0x80 0xc8 ITF8 encoding for block ID 200| Bytes | Meaning |
| :--- | :--- |
| 0x04 | BYTE_ARRAY_LEN codec ID |
| 0x0a | 10 remaining bytes of BYTE_ARRAY_LEN parameters |
| | |
| 0x03 | HUFFMAN codec ID, for aux tag lengths |
| 0x04 | 4 more bytes of HUFFMAN parameters |
| 0x01 | Alphabet array size $=1$ |
| 0x02 | alphabet symbol; (length $=2)$ |
| 0x01 | Codeword array size $=1$ |
| 0x00 | Code length $=0$ (zero bits needed as alphabet is size 1) |
| | |
| 0x01 | EXTERNAL codec ID, for aux tag values |
| 0x02 | 2 more bytes of EXTERNAL parameters |
| 0x80 0xc8 | ITF8 encoding for block ID 200 |
BYTE_ARRAY_STOP:编解码器 ID 5
可编码字节 [ ]。
字节数组以字节序列的形式捕获,以一个特殊的停止字节结束。返回的数据不包括停止字节本身。与 BYTE_ARRAY_LEN 不同的是,该值始终使用 EXTERNAL 编码,因此参数是外部 id 而不是其他编码。
数据类型
名称
评论
字节
停止字节
作为分隔符的特殊字节
itf8
外部 id
包含字节流的外部块的 id
Data type Name Comment
byte stop byte a special byte treated as a delimiter
itf8 external id id of an external block containing the byte stream| Data type | Name | Comment |
| :--- | :--- | :--- |
| byte | stop byte | a special byte treated as a delimiter |
| itf8 | external id | id of an external block containing the byte stream |
offset is subtracted from each
value during decode| offset is subtracted from each |
| :--- |
| value during decode |
itf8
长度
使用的位数
Data type Name Comment
itf8 offset "offset is subtracted from each
value during decode"
itf8 length the number of bits used| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset is subtracted from each <br> value during decode |
| itf8 | length | the number of bits used |
Data type Name Comment
itf8 offset offset is subtracted from each value during decode
itf8 k the order of the subexponential coding| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset is subtracted from each value during decode |
| itf8 | k | the order of the subexponential coding |
offset to subtract from each
value after decode| offset to subtract from each |
| :--- |
| value after decode |
Data type Name Comment
itf8 offset "offset to subtract from each
value after decode"| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset to subtract from each <br> value after decode |
the golomb parameter (number
of bins)| the golomb parameter (number |
| :--- |
| of bins) |
Data type Name Comment
itf8 offset offset is added to each value
itf8 M "the golomb parameter (number
of bins)"| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset is added to each value |
| itf8 | M | the golomb parameter (number <br> of bins) |
其他规格撰稿人包括John Marshall、Rishi Nag、Kenta Sato、Artem Tarasov 和 Jason Travis。
此外,还要衷心感谢提出 GitHub 问题和/或以其他方式帮助我们改进规范的所有人。
^(1){ }^{1} Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney, Efficient storage of high throughput DNA sequencing data using reference-based compression, Genome Res. 2011 21: 734-740; doi:10.1101/gr.114819.110; PMID:21245279.