This specification describes the CRAM 3.0 and 3.1 formats. 本规范描述了 CRAM 3.0 和 3.1 格式。
CRAM has the following major objectives: CRAM 的主要目标如下
Significantly better lossless compression than BAM 无损压缩效果明显优于 BAM
Full compatibility with BAM 与 BAM 完全兼容
Effortless transition to CRAM from using BAM files 从使用 BAM 文件轻松过渡到 CRAM
Support for controlled loss of BAM data 支持有控制地丢失 BAM 数据
The first three objectives allow users to take immediate advantage of the CRAM format while offering a smooth transition path from using BAM files. The fourth objective supports the exploration of different lossy compression strategies and provides a framework in which to effect these choices. Please note that the CRAM format does not impose any rules about what data should or should not be preserved. Instead, CRAM supports a wide range of lossless and lossy data preservation strategies enabling users to choose which data should be preserved. 前三个目标允许用户立即利用 CRAM 格式的优势,同时提供了从使用 BAM 文件的平稳过渡途径。第四个目标支持探索不同的有损压缩策略,并提供一个框架来实现这些选择。请注意,CRAM 格式并没有对哪些数据应该或不应该保留作出任何规定。相反,CRAM 支持多种无损和有损数据保存策略,使用户能够选择保存哪些数据。
Data in CRAM is stored either as CRAM records or using one of the general purpose compressors (gzip, bzip2). CRAM records are compressed using a number of different encoding strategies. For example, bases are reference compressed by encoding base differences rather than storing the bases themselves. ^(1){ }^{1} CRAM 中的数据以 CRAM 记录的形式存储,或使用其中一种通用压缩器(gzip、bzip2)存储。CRAM 记录使用多种不同的编码策略进行压缩。例如,通过编码碱基差异而不是存储碱基本身来压缩碱基。 ^(1){ }^{1}
2 Data types 2 数据类型
CRAM specification uses logical data types and storage data types; logical data types are written as words (e.g. int) while physical data types are written using single letters (e.g. i). The difference between the two is that storage data types define how logical data types are stored in CRAM. Data in CRAM is stored either as bits or bytes. Writing values as bits and bytes is described in detail below. CRAM 规范使用逻辑数据类型和存储数据类型;逻辑数据类型以单词(如 int)的形式书写,而物理数据类型则以单个字母(如 i)的形式书写。两者的区别在于,存储数据类型定义了逻辑数据类型在 CRAM 中的存储方式。CRAM 中的数据以位或字节的形式存储。下面将详细介绍以位和字节形式写入值。
2.1 Logical data types 2.1 逻辑数据类型
Byte 字节
Signed byte ( 8 bits). 有符号字节(8 位)。
Integer 整数
Signed 32-bit integer. 有符号 32 位整数。
Long 长
Signed 64 -bit integer. 有符号 64 位整数。
Array 阵列
An array of any logical data type: array<type> 任何逻辑数据类型的数组: array<type>
2.2 Writing bits to a bit stream 2.2 将比特写入比特流
A bit stream consists of a sequence of 1 s and 0 s . The bits are written most significant bit first where new bits are stacked to the right and full bytes on the left are written out. In a bit stream the last byte will be incomplete if less than 8 bits have been written to it. In this case the bits in the last byte are shifted to the left. 比特流由 1 s 和 0 s 的序列组成,比特最显著位先写入,新比特向右堆叠,左侧的完整字节被写出。在比特流中,如果写入的比特少于 8 位,则最后一个字节将是不完整的。在这种情况下,最后一个字节中的位会向左移动。
Example of writing to bit stream 写入比特流的示例
Let’s consider the following example. The table below shows a sequence of write operations: 让我们来看看下面的例子。下表显示了一系列写操作:
Operation order 操作顺序
Buffer state before 之前的缓冲状态
Written bits 书面位
Buffer state after 后的缓冲状态
Issued bytes 发布字节
1
0xx00 \times 0
1
0xx10 \times 1
-
2
0xx10 \times 1
0
0xx20 \times 2
-
3
0xx20 \times 2
11
0xx B0 \times B
-
4
0xx B0 \times B
00000111
0xx70 \times 7
0xx B00 \times B 0
Operation order Buffer state before Written bits Buffer state after Issued bytes
1 0xx0 1 0xx1 -
2 0xx1 0 0xx2 -
3 0xx2 11 0xx B -
4 0xx B 00000111 0xx7 0xx B0| Operation order | Buffer state before | Written bits | Buffer state after | Issued bytes |
| :--- | :--- | :--- | :--- | :--- |
| 1 | $0 \times 0$ | 1 | $0 \times 1$ | - |
| 2 | $0 \times 1$ | 0 | $0 \times 2$ | - |
| 3 | $0 \times 2$ | 11 | $0 \times B$ | - |
| 4 | $0 \times B$ | 00000111 | $0 \times 7$ | $0 \times B 0$ |
After flushing the above bit stream the following bytes are written: 0xB00 x B 0 0x70. Please note that the last byte was 0xx70 \times 7 before shifting to the left and became 0x 700 x 70 after that: 刷新上述位流后,将写入以下字节: 0xB00 x B 0 0x70。请注意,最后一个字节在左移之前是 0xx70 \times 7 ,左移之后变成了 0x 700 x 70 :
> echo "obase=16; ibase=2; 00000111" | bc
7
> echo "obase=16; ibase=2; 01110000" | bc
7 0
And the whole bit sequence: 还有整个位面序列:
echo “obase=2; ibase=16; B070” | bc echo "obase=2; ibase=16; B070" | bc| bc
1011000001110000
When reading the bits from the bit sequence it must be known that only 12 bits are meaningful and the bit stream should not be read after that. 从比特序列中读取比特时,必须知道只有 12 比特是有意义的,之后的比特流不应再读取。
Note on writing to bit stream 关于写入比特流的说明
When writing to a bit stream both the value and the number of bits in the value must be known. This is because programming languages normally operate with bytes ( 8 bits ) and to specify which bits are to be written requires a bit-holder, for example an integer, and the number of bits in it. Equally, when reading a value from a bit stream the number of bits must be known in advance. In case of prefix codes (e.g. Huffman) all possible bit combinations are either known in advance or it is possible to calculate how many bits will follow based on the first few bits. Alternatively, two codes can be combined, where the first contains the number of bits to read. 向位流中写入数值时,必须知道数值和数值中的位数。这是因为编程语言通常使用字节(8 位)进行操作,而要指定写入的位数,就需要一个比特位,例如一个整数,以及其中的比特位数。同样,从位流中读取数值时,也必须事先知道位数。如果是前缀码(如哈夫曼),所有可能的比特组合要么事先已经知道,要么可以根据前几个比特计算出后面的比特数。另外,也可以将两个编码组合在一起,其中第一个编码包含要读取的比特数。
2.3 Writing bytes to a byte stream 2.3 将字节写入字节流
The interpretation of byte stream is straightforward. CRAM uses little endianness for bytes when applicable and defines the following storage data types: 字节流的解释非常简单。CRAM 在适用情况下对字节使用小端位,并定义了以下存储数据类型:
Boolean (bool) 布尔型 (bool)
Boolean is written as 1 -byte with 0x00 x 0 being ‘false’ and 0 x 1 being ‘true’. 布尔值以 1 字节表示, 0x00 x 0 表示 "假",0 x 1 表示 "真"。
Integer (int32) 整数 (int32)
Signed 32-bit integer, written as 4 bytes in little-endian byte order. 带符号的 32 位整数,按小双位字节顺序写成 4 个字节。
Long (int64) 长 (int64)
Signed 64-bit integer, written as 8 bytes in little-endian byte order. 带符号的 64 位整数,按小双位字节顺序写成 8 个字节。
ITF-8 integer (itf8) ITF-8 整数 (itf8)
This is an alternative way to write an integer value. The idea is similar to UTF-8 encoding and therefore this encoding is called ITF-8 (Integer Transformation Format - 8 bit). 这是写入整数值的另一种方法。其思想类似于 UTF-8 编码,因此这种编码被称为 ITF-8(整数转换格式 - 8 位)。
The most significant bits of the first byte have special meaning and are called ‘prefix’. These are 0 to 4 true bits followed by a 0 . The number of 1 's denote the number of bytes to follow. To accommodate 32 bits such representation requires 5 bytes with only 4 lower bits used in the last byte 5 . 第一个字节的最重要位具有特殊意义,称为 "前缀"。它们是 0 至 4 个真实比特,后面跟一个 0。1 的个数表示后面的字节数。为了容纳 32 位,这种表示法需要 5 个字节,最后一个字节 5 只使用 4 个低位。
LTF-8 long (ltf8)
See ITF-8 for more details. The only difference between ITF-8 and LTF-8 is the number of bytes used to encode a single value. To do so 64 bits are required and this can be done with 9 byte at most with the first byte consisting of just 1 s or 0 xFF value. 详情请参见 ITF-8。ITF-8 和 LTF-8 的唯一区别在于编码单个值所需的字节数。要做到这一点,需要 64 位,而这最多可以用 9 个字节来完成,其中第一个字节仅包含 1 s 或 0 xFF 值。
Array (array<type>) 数组 (array<type>)
A variable sized array with an explicitly written dimension. Array length is written first as integer (itf8), followed by the elements of the array. 明确写入维数的可变大小数组。数组长度首先写为整数 (itf8),然后写入数组元素。
Implicit or fixed-size arrays are also used, written as type [ ] or type [4] (for example). These have no explicit dimension included in the file format and instead rely on the specification itself to document the array size. 此外,还使用隐式或固定大小的数组,例如写为类型 [ ] 或类型 [4]。这些数组在文件格式中没有明确的尺寸,而是依靠规范本身来记录数组的大小。
Encoding 编码
Encoding is a data type that specifies how data series have been compressed. Encodings are defined as encoding<type> where the type is a logical data type as opposed to a storage data type. 编码是一种数据类型,用于指定数据序列的压缩方式。编码定义为 encoding<type>,其中类型是逻辑数据类型,而不是存储数据类型。
An encoding is written as follows. The first integer (itf8) denotes the codec id and the second integer (itf8) the number of bytes in the following encoding-specific values. 编码的写法如下。第一个整数(itf8)表示编解码器 ID,第二个整数(itf8)表示以下编码特定值的字节数。
Subexponential encoding example: 次指数编码示例
Value 价值
Type 类型
Name 名称
0x7
itf8
codec id 编解码器 id
0x2
itf8
number of bytes to follow 后续字节数
0x0
itf8
offset 胶印
0x1
itf8
K parameter K 参数
Value Type Name
0x7 itf8 codec id
0x2 itf8 number of bytes to follow
0x0 itf8 offset
0x1 itf8 K parameter| Value | Type | Name |
| :--- | :--- | :--- |
| 0x7 | itf8 | codec id |
| 0x2 | itf8 | number of bytes to follow |
| 0x0 | itf8 | offset |
| 0x1 | itf8 | K parameter |
The first byte " 0xx70 \times 7 " is the codec id. 第一个字节 " 0xx70 \times 7 " 是编解码器 ID。
The next byte " 0 x 2 " denotes the length of the bytes to follow (2). 下一个字节 " 0 x 2 " 表示后面的字节长度 (2)。
The subexponential encoding has 2 parameters: integer (itf8) offset and integer (itf8) K. 亚指数编码有 2 个参数:整数 (itf8) 偏移和整数 (itf8) K。
offset =0x0=0=0 \mathrm{x} 0=0 偏移量 =0x0=0=0 \mathrm{x} 0=0 K=0x1=1\mathrm{K}=0 \mathrm{x} 1=1
Map 地图
A map is a collection of keys and associated values. A map with NN keys is written as follows: 映射是键和相关值的集合。具有 NN 键的映射的写法如下:
size in bytes 字节大小
N
key 1 关键 1
value 1 值 1
key... 关键...
value ... 价值...
key N 键 N
value N 值 N
size in bytes N key 1 value 1 key... value ... key N value N| size in bytes | N | key 1 | value 1 | key... | value ... | key N | value N |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
Both the size in bytes and the number of keys are written as integer (itf8). Keys and values are written according to their data types and are specific to each map. 以字节为单位的大小和键的数量都以整数(itf8)形式写入。键和值根据其数据类型书写,并针对每个映射。
String 字符串
A string is represented as byte arrays using UTF-8 format. Read names, reference sequence names and tag values with type ’ ZZ ’ are stored as UTF-8. 字符串使用 UTF-8 格式表示为字节数组。类型为" ZZ "的读取名称、引用序列名称和标记值以 UTF-8 格式存储。
3 Encodings 3 编码
Encoding is a data structure that captures information about compression details of a data series that are required to uncompress it. This could be a set of constants required to initialize a specific decompression algorithm or statistical properties of a data series or, in case of data series being stored in an external block, the block content id. 编码是一种数据结构,用于捕捉解压缩数据序列所需的压缩细节信息。这可能是初始化特定解压缩算法所需的常量集或数据序列的统计属性,如果数据序列存储在外部块中,则可能是块内容 ID。
Encoding notation is defined as the keyword ‘encoding’ followed by its data type in angular brackets, for example ‘encoding<byte>’ stands for an encoding that operates on a data series of data type ‘byte’. 编码符号定义为关键字 "encoding"(编码),后面是括号中的数据类型,例如,"encoding<byte> "表示对数据类型为 "字节 "的数据系列进行操作的编码。
Encodings may have parameters of different data types, for example the EXTERNAL encoding has only one parameter, integer id of the external block. The following encodings are defined: 编码可能有不同数据类型的参数,例如 EXTERNAL 编码只有一个参数,即外部数据块的整数 ID。定义了以下编码
Codec 编解码器
ID
Parameters 参数
Comment 评论
NULL
0
none 无
series not preserved 未保存的系列
EXTERNAL
1
int block content id int 块内容 id
块内容标识符,用于将外部数据块与数据系列关联起来
the block content identifier used to
associate external data blocks with
data series
the block content identifier used to
associate external data blocks with
data series| the block content identifier used to |
| :--- |
| associate external data blocks with |
| data series |
coding of byte arrays with array
length| coding of byte arrays with array |
| :--- |
| length |
BYTE_ARRAY_STOP
5
byte stop, int 外部块内容 ID
byte stop, int external block
content id
byte stop, int external block
content id| byte stop, int external block |
| :--- |
| content id |
用 stopvalue 对字节数组进行编码
coding of byte arrays with a stop
value
coding of byte arrays with a stop
value| coding of byte arrays with a stop |
| :--- |
| value |
BETA
6
int offset, int number of bits int 偏移,int 位数
binary coding 二进制编码
SUBEXP
7
int offset, int K int 偏移,int K
subexponential coding 次指数编码
Deprecated (GOLOMB_RICE) 已停用(GOLOMB_RICE)
8
int offset, int log_(2)m\log _{2} \mathrm{~m} int 偏移,int log_(2)m\log _{2} \mathrm{~m}
Golomb-Rice coding 戈隆-瑞斯编码
GAMMA
9
int offset int 偏移
Elias gamma coding 埃利亚斯伽马编码
Codec ID Parameters Comment
NULL 0 none series not preserved
EXTERNAL 1 int block content id "the block content identifier used to
associate external data blocks with
data series"
Deprecated (GOLOMB) 2 int offset, int M Golomb coding
HUFFMAN 3 array<int>, array<int> coding with int/byte values
BYTE_ARRAY_LEN 4 "encoding<int> array length,
encoding<byte> bytes" "coding of byte arrays with array
length"
BYTE_ARRAY_STOP 5 "byte stop, int external block
content id" "coding of byte arrays with a stop
value"
BETA 6 int offset, int number of bits binary coding
SUBEXP 7 int offset, int K subexponential coding
Deprecated (GOLOMB_RICE) 8 int offset, int log_(2)m Golomb-Rice coding
GAMMA 9 int offset Elias gamma coding| Codec | ID | Parameters | Comment |
| :--- | :--- | :--- | :--- |
| NULL | 0 | none | series not preserved |
| EXTERNAL | 1 | int block content id | the block content identifier used to <br> associate external data blocks with <br> data series |
| Deprecated (GOLOMB) | 2 | int offset, int M | Golomb coding |
| HUFFMAN | 3 | array<int>, array<int> | coding with int/byte values |
| BYTE_ARRAY_LEN | 4 | encoding<int> array length, <br> encoding<byte> bytes | coding of byte arrays with array <br> length |
| BYTE_ARRAY_STOP | 5 | byte stop, int external block <br> content id | coding of byte arrays with a stop <br> value |
| BETA | 6 | int offset, int number of bits | binary coding |
| SUBEXP | 7 | int offset, int K | subexponential coding |
| Deprecated (GOLOMB_RICE) | 8 | int offset, int $\log _{2} \mathrm{~m}$ | Golomb-Rice coding |
| GAMMA | 9 | int offset | Elias gamma coding |
See section 13 for more detailed descriptions of all the above coding algorithms and their parameters. 有关上述所有编码算法及其参数的详细说明,请参见第 13 节。
4 Checksums 4 校验和
The checksumming is used to ensure data integrity. The following checksumming algorithms are used in CRAM. 校验和用于确保数据完整性。CRAM 中使用了以下校验和算法。
4.1 CRC32
This is a cyclic redundancy checksum 32-bit long with the polynomial 0x04C11DB7. Please refer to ITU-T V. 42 for more details. The value of the CRC32 hash function is written as an integer. 这是一个循环冗余校验和,长度为 32 位,多项式为 0x04C11DB7。详情请参阅 ITU-T V. 42。CRC32 哈希函数的值写成整数。
4.2 CRC32 sum 4.2 CRC32 和
CRC32 sum is a combination of CRC32 values by summing up all individual CRC32 values modulo 2^(32)2^{32}. CRC32 求和是将所有单个 CRC32 值取模为 2^(32)2^{32} 求和后的 CRC32 值组合。
5 File structure 5 文件结构
The overall CRAM file structure is described in this section. Please refer to other sections of this document for more detailed information. 本节将介绍 CRAM 文件的整体结构。更多详细信息请参阅本文件的其他章节。
A CRAM file consists of a fixed length file definition, followed by a CRAM header container, then zero or more data containers, and finally a special end-of-file container. CRAM 文件由一个固定长度的文件定义、一个 CRAM 头容器、零个或多个数据容器以及一个特殊的文件结束容器组成。
Figure 1: A CRAM file consists of a file definition, followed by a header container, then other containers. 图 1:CRAM 文件由文件定义、文件头容器和其他容器组成。
Containers consist of one or more blocks. The first container, called the CRAM header container, is used to store a textual header as described in the SAM specification (see the section 7.1). This container may have additional padding bytes present for purposes of permitting inline rewriting of the SAM header with small changes in size. These padding bytes are undefined, but we recommend filling with nuls. The padding bytes can either be in explicit uncompressed Block structures, or as unallocated extra space where the size of the container is larger than the combined size of blocks held within it. 容器由一个或多个块组成。第一个容器称为 CRAM 标头容器,用于存储 SAM 规范中描述的文本标头(见第 7.1 节)。该容器可能会有额外的填充字节,以允许内联重写 SAM 标头,但大小变化不大。这些填充字节是未定义的,但我们建议使用 nuls 填充。填充字节可以是明确的未压缩块结构,也可以是未分配的额外空间,即容器的大小大于其中所包含块的总和。
Figure 2: The the first container holds the CRAM header text. 图 2:第一个容器包含 CRAM 标题文本。
Each container starts with a container header structure followed by one or more blocks. The first block in each container is the compression header block giving details of how to decode data in subsequent blocks. Each block starts with a block header structure followed by the block data. 每个容器都以一个容器头结构开始,然后是一个或多个块。每个容器中的第一个块是压缩头块,详细说明如何解码后续块中的数据。每个区块都以区块头结构开始,然后是区块数据。
Figure 3: Containers as a series of blocks 图 3:作为一系列区块的容器
The blocks after the compression header are organised logically into slices. One slice may contain, for example, a contiguous region of alignment data. Slices begin with a slice header block and are followed by one or more data blocks. It is these data blocks which hold the primary bulk of CRAM data. The data blocks are further subdivided into a core data block and one or more external data blocks. 压缩头之后的数据块在逻辑上被组织成片段。例如,一个片段可包含一个连续的对齐数据区域。片段以片头块开始,后面是一个或多个数据块。这些数据块是 CRAM 数据的主要部分。数据块进一步细分为一个核心数据块和一个或多个外部数据块。
Figure 4: Slices formed from a series of concatenated blocks 图 4:由一系列连接块形成的切片
6 File definition 6 文件定义
Each CRAM file starts with a fixed length (26 bytes) definition with the following fields: 每个 CRAM 文件都以固定长度(26 字节)的定义开始,其中包含以下字段:
Data type 数据类型
Name 名称
Value 价值
byte[4] 字节[4]
format magic number 格式魔数
CRAM (0x43 0x52 0x41 0x4d)
unsigned byte 无符号字节
major format number 主要格式号
3(0x3)3(0 x 3)
unsigned byte 无符号字节
minor format number 次要格式号
1 (0x1)
byte[20] 字节[20]
file id 文件 id
CRAM file identifier (e.g. file name or SHA1 checksum) CRAM 文件标识符(如文件名或 SHA1 校验和)。
Data type Name Value
byte[4] format magic number CRAM (0x43 0x52 0x41 0x4d)
unsigned byte major format number 3(0x3)
unsigned byte minor format number 1 (0x1)
byte[20] file id CRAM file identifier (e.g. file name or SHA1 checksum)| Data type | Name | Value |
| :--- | :--- | :--- |
| byte[4] | format magic number | CRAM (0x43 0x52 0x41 0x4d) |
| unsigned byte | major format number | $3(0 x 3)$ |
| unsigned byte | minor format number | 1 (0x1) |
| byte[20] | file id | CRAM file identifier (e.g. file name or SHA1 checksum) |
Valid CRAM major.minor version numbers are as follows: 有效的 CRAM 主版本号和次版本号如下:
1.0 The original public CRAM release. 1.0 最初的公开 CRAM 版本。
2.0 The first CRAM release implemented in both Java and C; tidied up implementation vs specification differences in 1.0 . 2.0 第一个以 Java 和 C 语言实现的 CRAM 版本;整理了 1.0 版本中实现与规范之间的差异。
2.1 Gained end of file markers; compatible with 2.0. 2.1 获得文件末尾标记;与 2.0 兼容。
3.0 Additional compression methods; header and data checksums; improvements for unsorted data. 3.0 新增压缩方法;标题和数据校验和;改进未排序数据的处理。
3.1 Additional EXTERNAL compression codecs only. 3.1 仅附加外部压缩编解码器。
CRAM 3.0 and 3.1 differ only in the list of compression methods available, so tools that output CRAM 3 without using any 3.1 codecs should write the header to indicate 3.0 in order to permit maximum compatibility. CRAM 3.0 和 3.1 的区别仅在于可用的压缩方法列表不同,因此输出 CRAM 3 而不使用任何 3.1 编解码器的工具应在标头中注明 3.0,以实现最大的兼容性。
7 Container header structure 7 集装箱头结构
The file definition is followed by one or more containers with the following header structure where the container content is stored in the ‘blocks’ field: 文件定义之后是一个或多个容器,其标题结构如下,容器内容存储在 "块 "字段中:
the sum of the lengths of all blocks in this container
(headers and data) and any padding bytes (CRAM header
container only); equal to the total byte length of the
container minus the byte length of this header structure
the sum of the lengths of all blocks in this container
(headers and data) and any padding bytes (CRAM header
container only); equal to the total byte length of the
container minus the byte length of this header structure| the sum of the lengths of all blocks in this container |
| :--- |
| (headers and data) and any padding bytes (CRAM header |
| container only); equal to the total byte length of the |
| container minus the byte length of this header structure |
itf8
reference sequence id 参考序列 ID
该容器中的所有片段都必须有一个与该值匹配的参考序列 ID。
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
All slices in this container must have a reference sequence
id matching this value.
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
All slices in this container must have a reference sequence
id matching this value.| reference sequence identifier or |
| :--- |
| -1 for unmapped reads |
| -2 for multiple reference sequences. |
| All slices in this container must have a reference sequence |
| id matching this value. |
itf8
起始位置
starting position on the
reference
starting position on the
reference| starting position on the |
| :--- |
| reference |
the alignment start position 对齐起始位置
itf8
alignment span 对齐跨度
the length of the alignment 长度
itf8
number of records 记录数
number of records in the container 容器中的记录数
ltf8
record counter 记数器
1-based sequential index of records in the file/stream. 文件/数据流中记录的基于 1 的顺序索引。
ltf8
bases 基地
number of read bases 读数基数
itf8
number of blocks 块数
the total number of blocks in this container 该容器中的区块总数
the locations of slices in this container as byte offsets from
the end of this container header, used for random access
indexing. For sequence data containers, the landmark
count must equal the slice count.
Since the block before the first slice is the compression
header, landmarks[0] is equal to the byte length of the
compression header.
the locations of slices in this container as byte offsets from
the end of this container header, used for random access
indexing. For sequence data containers, the landmark
count must equal the slice count.
Since the block before the first slice is the compression
header, landmarks[0] is equal to the byte length of the
compression header.| the locations of slices in this container as byte offsets from |
| :--- |
| the end of this container header, used for random access |
| indexing. For sequence data containers, the landmark |
| count must equal the slice count. |
| Since the block before the first slice is the compression |
| header, landmarks[0] is equal to the byte length of the |
| compression header. |
int
crc32
CRC32 hash of the all the preceding bytes in the container. 容器中前面所有字节的 CRC32 哈希值。
byte[ 字节
blocks 大厦
The blocks contained within the container. 容器中包含的区块。
Data type Name Value
int32 length "the sum of the lengths of all blocks in this container
(headers and data) and any padding bytes (CRAM header
container only); equal to the total byte length of the
container minus the byte length of this header structure"
itf8 reference sequence id "reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
All slices in this container must have a reference sequence
id matching this value."
itf8 "starting position on the
reference" the alignment start position
itf8 alignment span the length of the alignment
itf8 number of records number of records in the container
ltf8 record counter 1-based sequential index of records in the file/stream.
ltf8 bases number of read bases
itf8 number of blocks the total number of blocks in this container
array<itf8> landmarks "the locations of slices in this container as byte offsets from
the end of this container header, used for random access
indexing. For sequence data containers, the landmark
count must equal the slice count.
Since the block before the first slice is the compression
header, landmarks[0] is equal to the byte length of the
compression header."
int crc32 CRC32 hash of the all the preceding bytes in the container.
byte[ blocks The blocks contained within the container.| Data type | Name | Value |
| :---: | :---: | :---: |
| int32 | length | the sum of the lengths of all blocks in this container <br> (headers and data) and any padding bytes (CRAM header <br> container only); equal to the total byte length of the <br> container minus the byte length of this header structure |
| itf8 | reference sequence id | reference sequence identifier or <br> -1 for unmapped reads <br> -2 for multiple reference sequences. <br> All slices in this container must have a reference sequence <br> id matching this value. |
| itf8 | starting position on the <br> reference | the alignment start position |
| itf8 | alignment span | the length of the alignment |
| itf8 | number of records | number of records in the container |
| ltf8 | record counter | 1-based sequential index of records in the file/stream. |
| ltf8 | bases | number of read bases |
| itf8 | number of blocks | the total number of blocks in this container |
| array<itf8> | landmarks | the locations of slices in this container as byte offsets from <br> the end of this container header, used for random access <br> indexing. For sequence data containers, the landmark <br> count must equal the slice count. <br> Since the block before the first slice is the compression <br> header, landmarks[0] is equal to the byte length of the <br> compression header. |
| int | crc32 | CRC32 hash of the all the preceding bytes in the container. |
| byte[ | blocks | The blocks contained within the container. |
In the initial CRAM header container, the reference sequence id, starting position on the reference, and alignment span fields must be ignored when reading. The landmarks array is optional for the CRAM header, but if it exists it should point to block offsets instead of slices, with the first block containing the textual header. 在初始 CRAM 标头容器中,读取时必须忽略参考序列 ID、参考起始位置和排列跨度字段。对于 CRAM 标头来说,地标数组是可选的,但如果存在,则应指向块偏移而不是片段,第一个块包含文本标头。
In data containers specifying unmapped reads or multiple reference sequences (i.e. reference sequence id < 0<0 ), the starting position on the reference and alignment span fields must be ignored when reading. When writing, it is recommended to set each of these ignored fields to the value 0 . 在指定未映射读数或多个参考序列(即参考序列 ID < 0<0 )的数据容器中,读取时必须忽略参考和排列跨度字段的起始位置。写入时,建议将每个忽略字段的值设置为 0。
7.1 CRAM header container 7.1 CRAM 标头容器
The first container in a CRAM file contains a textual header in one or more blocks. See section 8.3 for more details on the layout of data within these blocks and constraints applied to the contents of the SAM header. CRAM 文件的第一个容器包含一个或多个块中的文本标头。有关这些块中的数据布局以及适用于 SAM 标头内容的限制的更多详情,请参阅第 8.3 节。
The landmarks field of the container header structure may be used to indicate the offsets of the blocks used in the header container. These may optionally be omitted by specifying an array size of zero. 容器标头结构的地标字段可用于指示标头容器中使用的块的偏移量。如果指定数组大小为零,则可以省略这些偏移。
8 Block structure 8 块结构
Containers consist of one or more blocks. Block compression is applied independently and in addition to any encodings used to compress data within the block. The block have the following header structure with the data stored in the ‘block data’ field: 容器由一个或多个数据块组成。除了用于压缩块内数据的编码外,块压缩是独立应用的。数据块具有以下标题结构,数据存储在 "数据块数据 "字段中:
Data type 数据类型
Name 名称
Value 价值
byte 字节
method 方法
the block compression method (and first CRAM version): 块压缩方法(以及第一个 CRAM 版本):
the block content identifier used to associate external data 用于关联外部数据的块内容标识符
raw size in bytes* 原始大小(以字节为单位)*
blocks with data series 数据系列块
itf8
block data 块数据
size of the block data after applying block compression 应用块压缩后的块数据大小
itf8
the data stored in the before applying block compression 在应用数据块压缩之前存储的数据
byte[]
・ bit stream of CRAM records (core data block) ・ CRAM 记录的比特流(核心数据块)
∙\bullet byte stream (external data block) ∙\bullet 字节流(外部数据块)
CRC32
additional fields ( header blocks) 附加字段(标头块)
byte[4] 字节[4]
CRC32 hash value for all preceding bytes in the block 数据块中前面所有字节的 CRC32 哈希值
Data type Name Value
byte method the block compression method (and first CRAM version):
0: raw (none)*
1: gzip
2: bzip2 (v2.0)
3: lzma (v3.0)
4: rans4x8 (v3.0)
5: rans4x16 (v3.1)
6: adaptive arithmetic coder (v3.1)
7: fqzcomp (v3.1)
8: name tokeniser (v3.1)
byte block content type id the block content type identifier
itf8 size in bytes* the block content identifier used to associate external data
raw size in bytes* blocks with data series
itf8 block data size of the block data after applying block compression
itf8 the data stored in the before applying block compression
byte[] ・ bit stream of CRAM records (core data block)
∙ byte stream (external data block)
CRC32 additional fields ( header blocks)
byte[4] CRC32 hash value for all preceding bytes in the block | Data type | Name | Value |
| :--- | :--- | :--- |
| byte | method | the block compression method (and first CRAM version): |
| | | 0: raw (none)* |
| | | 1: gzip |
| | | 2: bzip2 (v2.0) |
| | | 3: lzma (v3.0) |
| | | 4: rans4x8 (v3.0) |
| | | 5: rans4x16 (v3.1) |
| | | 6: adaptive arithmetic coder (v3.1) |
| | | 7: fqzcomp (v3.1) |
| | | 8: name tokeniser (v3.1) |
| byte | block content type id | the block content type identifier |
| itf8 | size in bytes* | the block content identifier used to associate external data |
| | raw size in bytes* | blocks with data series |
| itf8 | block data | size of the block data after applying block compression |
| itf8 | | the data stored in the before applying block compression |
| byte[] | ・ bit stream of CRAM records (core data block) | |
| | | $\bullet$ byte stream (external data block) |
| | CRC32 | additional fields ( header blocks) |
| byte[4] | CRC32 hash value for all preceding bytes in the block | |
Note on raw method: both compressed and raw sizes must be set to the same value. 原始方法注意事项:压缩大小和原始大小必须设置为相同的值。
Empty blocks may occur in the files. Blocks with a raw (uncompressed) size of zero are treated as empty, irrespective of their “method” byte. This is equivalent to interpreting them as having method zero (raw) and compressed size of zero. 文件中可能会出现空块。原始(未压缩)大小为零的块将被视为空块,与其 "方法 "字节无关。这相当于将它们解释为方法为零(原始)、压缩后大小为零。
8.1 Block content types 8.1 块内容类型
CRAM has the following block content types: CRAM 具有以下块内容类型:
Block content type 区块内容类型
块内容类型 id
Block
content
type id
Block
content
type id| Block |
| :--- |
| content |
| type id |
Name 名称
Contents 目录
FILE_HEADER
0
CRAM header block CRAM 标头块
CRAM header CRAM 标头
COMPRESSION_HEADER
1
Compression header block 压缩头块
See specific section 参见具体章节
SLICE_HEADER ^("a "){ }^{\text {a }}
2
Slice header block 切片头块
See specific section 参见具体章节
3
reserved 矜持
EXTERNAL_DATA
4
external data block 外部数据块
外部编码产生的数据
data produced by
external encodings
data produced by
external encodings| data produced by |
| :--- |
| external encodings |
CORE_DATA
5
core data block 核心数据块
除外部编码外的所有编码的比特流
bit stream of all
encodings except for
external encodings
bit stream of all
encodings except for
external encodings| bit stream of all |
| :--- |
| encodings except for |
| external encodings |
Block content type "Block
content
type id" Name Contents
FILE_HEADER 0 CRAM header block CRAM header
COMPRESSION_HEADER 1 Compression header block See specific section
SLICE_HEADER ^("a ") 2 Slice header block See specific section
3 reserved
EXTERNAL_DATA 4 external data block "data produced by
external encodings"
CORE_DATA 5 core data block "bit stream of all
encodings except for
external encodings"| Block content type | Block <br> content <br> type id | Name | Contents |
| :--- | :--- | :--- | :--- |
| FILE_HEADER | 0 | CRAM header block | CRAM header |
| COMPRESSION_HEADER | 1 | Compression header block | See specific section |
| SLICE_HEADER ${ }^{\text {a }}$ | 2 | Slice header block | See specific section |
| | 3 | | reserved |
| EXTERNAL_DATA | 4 | external data block | data produced by <br> external encodings |
| CORE_DATA | 5 | core data block | bit stream of all <br> encodings except for <br> external encodings |
8.2 Block content id 8.2 区块内容 ID
Block content id is used to distinguish between external blocks in the same slice. Each external encoding has an id parameter which must be one of the external block content ids. For external blocks the content id is a positive integer. For all other blocks content id should be 0 . Consequently, all external encodings must not use content id less than 1 . 块内容 id 用于区分同一片段中的外部块。每个外部编码都有一个 id 参数,它必须是外部块内容 id 之一。对于外部区块,内容 id 是一个正整数。对于所有其他区块,内容 id 应为 0。因此,所有外部编码不得使用小于 1 的内容 id。
Data blocks 数据块
Data is stored in data blocks. There are two types of data blocks: core data blocks and external data blocks. The difference between core and external data blocks is that core data blocks consist of data series that are compressed using bit encodings while the external data blocks are byte compressed. One core data block and any number of external data blocks are associated with each slice. 数据存储在数据块中。数据块有两种类型:核心数据块和外部数据块。核心数据块和外部数据块的区别在于,核心数据块由使用位编码压缩的数据序列组成,而外部数据块则是字节压缩的。每个片段关联一个核心数据块和任意数量的外部数据块。
Writing to and reading from core and external data blocks is organised through CRAM records. Each data series is associated with an encoding. In case of external encodings the block content id is used to identify the block where the data series is stored. Please note that external blocks can have multiple data series associated with them; in this case the values from these data series will be interleaved. 对核心和外部数据块的写入和读取是通过 CRAM 记录来组织的。每个数据序列都与编码相关联。在外部编码的情况下,数据块内容 ID 用于识别存储数据序列的数据块。请注意,外部数据块可以有多个与之相关的数据序列;在这种情况下,这些数据序列的值将交错排列。
8.3 CRAM header block(s) 8.3 CRAM 标头块
The SAM header is stored in the first block of the CRAM header container (see section 7.1). This block may be uncompressed or gzip compressed only. This block is followed by zero or more uncompressed expansion blocks. If present, these permit in-place editing of the CRAM header, allowing it to grow or shrink with a compensatory size change applied to the subsequence expansion block, avoiding the need to rewrite the remainder of the file. The contents of any expansion blocks should be zero bytes (nul characters). SAM 标头存储在 CRAM 标头容器的第一个块中(见第 7.1 节)。该块可以是未压缩的,也可以是 gzip 压缩的。该块之后是零个或多个未压缩的扩展块。如果存在这些扩展块,就可以对 CRAM 标头进行就地编辑,通过对子扩展块进行补偿性大小更改来实现标头的增大或缩小,从而避免重写文件的其余部分。任何扩展块的内容都应为零字节(nul 字符)。
The format of the initial SAM header block is a 32-bit little-endian integer holding the length of the text of the SAM header, minus nul-termination bytes, followed by the text itself. Although 32-bit, the maximum permitted value is 2^(31)2^{31}, and all lengths must be positive. SAM 首部数据块的格式是一个 32 位小二进制整数,其长度为 SAM 首部文本的长度减去无效终止字节,然后是文本本身。虽然是 32 位,但允许的最大值是 2^(31)2^{31} ,而且所有长度都必须是正数。
The following constraints apply to the SAM header text: 以下限制适用于 SAM 标头文本:
The SQ:MD5 checksum is required unless the reference sequence has been embedded into the file. 除非参考序列已嵌入文件,否则需要 SQ:MD5 校验和。
8.4 Compression header block 8.4 压缩头块
The compression header block consists of 3 parts: preservation map, data series encoding map and tag encoding map. 压缩头块由 3 部分组成:保存图、数据序列编码图和标签编码图。
Preservation map 保护地图
The preservation map contains information about which data was preserved in the CRAM file. It is stored as a map with byte[2] keys: 保存映射包含 CRAM 文件中哪些数据被保存的信息。它以字节[2]键的映射形式存储:
Key 钥匙
Value data type 值数据类型
Name 名称
Value 价值
RN
bool
read names included 阅读名称包括
true if read names are preserved for all reads 如果所有读取都保留读取名称,则为 true
AP
bool
AP data series delta AP 数据系列 delta
true if AP data series is delta, false otherwise 如果 AP 数据序列为 delta,则为 true,否则为 false
RR
bool
reference required 所需参考资料
如果需要参考序列才能完全恢复数据,则为 true
true if reference sequence is required to restore
the data completely
true if reference sequence is required to restore
the data completely| true if reference sequence is required to restore |
| :--- |
| the data completely |
SM
byte[5] 字节[5]
substitution matrix 置换矩阵
substitution matrix 置换矩阵
TD
array<byte> 数组<byte>
tag ids dictionary 标签 ID 词典
a list of lists of tag ids, see tag encoding section 标签 ID 列表,参见标签编码部分
Key Value data type Name Value
RN bool read names included true if read names are preserved for all reads
AP bool AP data series delta true if AP data series is delta, false otherwise
RR bool reference required "true if reference sequence is required to restore
the data completely"
SM byte[5] substitution matrix substitution matrix
TD array<byte> tag ids dictionary a list of lists of tag ids, see tag encoding section| Key | Value data type | Name | Value |
| :--- | :--- | :--- | :--- |
| RN | bool | read names included | true if read names are preserved for all reads |
| AP | bool | AP data series delta | true if AP data series is delta, false otherwise |
| RR | bool | reference required | true if reference sequence is required to restore <br> the data completely |
| SM | byte[5] | substitution matrix | substitution matrix |
| TD | array<byte> | tag ids dictionary | a list of lists of tag ids, see tag encoding section |
The boolean values are optional, defaulting to true when absent, although it is recommended to explicitly set them. SM and TD are mandatory. 布尔值为可选项,不存在时默认为 true,但建议明确设置。SM 和 TD 为强制值。
Data series encodings 数据系列编码
Each data series has an encoding. These encoding are stored in a map with byte[2] keys and are decoded in approximately this order ^(2){ }^{2} : 每个数据序列都有一个编码。这些编码存储在一个以字节[2]为键的映射中,解码顺序大致为 ^(2){ }^{2} :
Key 钥匙
Value data type 值数据类型
Name 名称
Value 价值
BF
encoding<int> 编码<int>
BAM bit flags BAM 位标志
see separate section 见单独章节
CF
encoding<int> 编码<int>
CRAM bit flags CRAM 位标志
see specific section 参见具体章节
RI
encoding<int> 编码<int>
reference id 参考编号
record reference id from the SAM file header SAM 文件头中的记录引用标识
RL
encoding<int> 编码<int>
read lengths 阅读长度
read lengths 阅读长度
AP
encoding<int> 编码<int>
in-seq positions 序列内位置
如果 AP-Delta = true:与前一条记录中的 AP 值的对齐起始三角洲值为 0。注意该三角洲值可能为负,例如在多参考分片中切换参考时。如果 AP-Delta = false:直接编码对齐起始位置。
if AP-Delta = true: 0-based alignment start
delta from the AP value in the previous record.
Note this delta may be negative, for example
when switching references in a multi-reference
slice. When the record is the first in the slice, the
previous position used is the slice alignment-start
field (hence the first delta should be zero for
single-reference slices, or the AP value itself for
multi-reference slices).
if AP-Delta = false: encodes the alignment start
position directly
if AP-Delta = true: 0-based alignment start
delta from the AP value in the previous record.
Note this delta may be negative, for example
when switching references in a multi-reference
slice. When the record is the first in the slice, the
previous position used is the slice alignment-start
field (hence the first delta should be zero for
single-reference slices, or the AP value itself for
multi-reference slices).
if AP-Delta = false: encodes the alignment start
position directly| if AP-Delta = true: 0-based alignment start |
| :--- |
| delta from the AP value in the previous record. |
| Note this delta may be negative, for example |
| when switching references in a multi-reference |
| slice. When the record is the first in the slice, the |
| previous position used is the slice alignment-start |
| field (hence the first delta should be zero for |
| single-reference slices, or the AP value itself for |
| multi-reference slices). |
| if AP-Delta = false: encodes the alignment start |
| position directly |
RG
encoding<int> 编码<int>
read groups 阅读小组
读取组。特殊值"-1 "代表无组。
read groups. Special value ' -1 ' stands for no
group.
read groups. Special value ' -1 ' stands for no
group.| read groups. Special value ' -1 ' stands for no |
| :--- |
| group. |
RN^(a)\mathrm{RN}^{\mathrm{a}}
encoding<byte[ ]> 编码<byte[ ]>
read names 阅读名称
read names 阅读名称
MF
encoding<int> 编码<int>
next mate bit flags 下一个队友位标志
see specific section 参见具体章节
NS
encoding<int> 编码<int>
下一个片段参考序列 ID
next fragment
reference sequence id
next fragment
reference sequence id| next fragment |
| :--- |
| reference sequence id |
reference sequence ids for the next fragment 下一个片段的参考序列 ID
NP
encoding<int> 编码<int>
下一个对齐开始
next mate alignment
start
next mate alignment
start| next mate alignment |
| :--- |
| start |
alignment positions for the next fragment 下一个片段的对齐位置
TS
encoding<int> 编码<int>
template size 模板尺寸
template sizes 模板尺寸
NF
encoding<int> 编码<int>
到下一个分段的距离
distance to next
fragment
distance to next
fragment| distance to next |
| :--- |
| fragment |
number of records to skip to the next fragment ^(b){ }^{b} 跳转到下一个片段的记录数 ^(b){ }^{b}
TL^(C)\mathrm{TL}^{\mathrm{C}}
encoding<int> 编码<int>
tag ids 标签 id
list of tag ids, see tag encoding section 标签 id 列表,参见标签编码部分
FN
encoding<int> 编码<int>
阅读次数
number of read
features
number of read
features| number of read |
| :--- |
| features |
number of read features in each record 每条记录中的读取特征数
FC
encoding<byte> 编码<byte>
read features codes 读取功能代码
see separate section 见单独章节
FP
encoding<int> 编码<int>
in-read positions 读入位置
读取特征的位置;最后一个位置的正 delta 值(从零开始)
positions of the read features; a positive delta to
the last position (starting with zero)
positions of the read features; a positive delta to
the last position (starting with zero)| positions of the read features; a positive delta to |
| :--- |
| the last position (starting with zero) |
DL
encoding<int> 编码<int>
deletion lengths 删除长度
base-pair deletion lengths 碱基对缺失长度
BB
encoding<byte[]> 编码<byte[]>
stretches of bases 基地
bases 基地
QQ
encoding<byte[ ]> 编码<byte[ ]>
质量分数线的长度
stretches of quality
scores
stretches of quality
scores| stretches of quality |
| :--- |
| scores |
quality scores 质量得分
BS
encoding<byte> 编码<byte>
基本替换代码
base substitution
codes
base substitution
codes| base substitution |
| :--- |
| codes |
base substitution codes 碱基替换码
IN
encoding<byte[]> 编码<byte[]>
insertion 插入
inserted bases 插入式基座
RS
encoding<int> 编码<int>
reference skip length 参考跳读长度
number of skipped bases for the ' N ' read feature N "读数特征的跳过碱基数
PD
encoding<int> 编码<int>
padding 衬垫
number of padded bases 垫底数量
HC
encoding<int> 编码<int>
hard clip 硬夹
number of hard clipped bases 硬剪切基数
SC
encoding<byte[ ]> 编码<byte[ ]>
soft clip 软夹
soft clipped bases 软剪裁底座
MQ
encoding<int> 编码<int>
mapping qualities 绘图质量
mapping quality scores 绘制质量分数
BA
encoding<byte> 编码<byte>
bases 基地
bases 基地
QS
encoding<byte> 编码<byte>
quality scores 质量得分
quality scores 质量得分
TC^(d)\mathrm{TC}^{\mathrm{d}}
N/A 不适用
legacy field 遗留字段
to be ignored 置之不理
TN^(d)\mathrm{TN}^{\mathrm{d}}
N/A 不适用
legacy field 遗留字段
to be ignored 置之不理
Key Value data type Name Value
BF encoding<int> BAM bit flags see separate section
CF encoding<int> CRAM bit flags see specific section
RI encoding<int> reference id record reference id from the SAM file header
RL encoding<int> read lengths read lengths
AP encoding<int> in-seq positions "if AP-Delta = true: 0-based alignment start
delta from the AP value in the previous record.
Note this delta may be negative, for example
when switching references in a multi-reference
slice. When the record is the first in the slice, the
previous position used is the slice alignment-start
field (hence the first delta should be zero for
single-reference slices, or the AP value itself for
multi-reference slices).
if AP-Delta = false: encodes the alignment start
position directly"
RG encoding<int> read groups "read groups. Special value ' -1 ' stands for no
group."
RN^(a) encoding<byte[ ]> read names read names
MF encoding<int> next mate bit flags see specific section
NS encoding<int> "next fragment
reference sequence id" reference sequence ids for the next fragment
NP encoding<int> "next mate alignment
start" alignment positions for the next fragment
TS encoding<int> template size template sizes
NF encoding<int> "distance to next
fragment" number of records to skip to the next fragment ^(b)
TL^(C) encoding<int> tag ids list of tag ids, see tag encoding section
FN encoding<int> "number of read
features" number of read features in each record
FC encoding<byte> read features codes see separate section
FP encoding<int> in-read positions "positions of the read features; a positive delta to
the last position (starting with zero)"
DL encoding<int> deletion lengths base-pair deletion lengths
BB encoding<byte[]> stretches of bases bases
QQ encoding<byte[ ]> "stretches of quality
scores" quality scores
BS encoding<byte> "base substitution
codes" base substitution codes
IN encoding<byte[]> insertion inserted bases
RS encoding<int> reference skip length number of skipped bases for the ' N ' read feature
PD encoding<int> padding number of padded bases
HC encoding<int> hard clip number of hard clipped bases
SC encoding<byte[ ]> soft clip soft clipped bases
MQ encoding<int> mapping qualities mapping quality scores
BA encoding<byte> bases bases
QS encoding<byte> quality scores quality scores
TC^(d) N/A legacy field to be ignored
TN^(d) N/A legacy field to be ignored| Key | Value data type | Name | Value |
| :---: | :---: | :---: | :---: |
| BF | encoding<int> | BAM bit flags | see separate section |
| CF | encoding<int> | CRAM bit flags | see specific section |
| RI | encoding<int> | reference id | record reference id from the SAM file header |
| RL | encoding<int> | read lengths | read lengths |
| AP | encoding<int> | in-seq positions | if AP-Delta = true: 0-based alignment start <br> delta from the AP value in the previous record. <br> Note this delta may be negative, for example <br> when switching references in a multi-reference <br> slice. When the record is the first in the slice, the <br> previous position used is the slice alignment-start <br> field (hence the first delta should be zero for <br> single-reference slices, or the AP value itself for <br> multi-reference slices). <br> if AP-Delta = false: encodes the alignment start <br> position directly |
| RG | encoding<int> | read groups | read groups. Special value ' -1 ' stands for no <br> group. |
| $\mathrm{RN}^{\mathrm{a}}$ | encoding<byte[ ]> | read names | read names |
| MF | encoding<int> | next mate bit flags | see specific section |
| NS | encoding<int> | next fragment <br> reference sequence id | reference sequence ids for the next fragment |
| NP | encoding<int> | next mate alignment <br> start | alignment positions for the next fragment |
| TS | encoding<int> | template size | template sizes |
| NF | encoding<int> | distance to next <br> fragment | number of records to skip to the next fragment ${ }^{b}$ |
| $\mathrm{TL}^{\mathrm{C}}$ | encoding<int> | tag ids | list of tag ids, see tag encoding section |
| FN | encoding<int> | number of read <br> features | number of read features in each record |
| FC | encoding<byte> | read features codes | see separate section |
| FP | encoding<int> | in-read positions | positions of the read features; a positive delta to <br> the last position (starting with zero) |
| DL | encoding<int> | deletion lengths | base-pair deletion lengths |
| BB | encoding<byte[]> | stretches of bases | bases |
| QQ | encoding<byte[ ]> | stretches of quality <br> scores | quality scores |
| BS | encoding<byte> | base substitution <br> codes | base substitution codes |
| IN | encoding<byte[]> | insertion | inserted bases |
| RS | encoding<int> | reference skip length | number of skipped bases for the ' N ' read feature |
| PD | encoding<int> | padding | number of padded bases |
| HC | encoding<int> | hard clip | number of hard clipped bases |
| SC | encoding<byte[ ]> | soft clip | soft clipped bases |
| MQ | encoding<int> | mapping qualities | mapping quality scores |
| BA | encoding<byte> | bases | bases |
| QS | encoding<byte> | quality scores | quality scores |
| $\mathrm{TC}^{\mathrm{d}}$ | N/A | legacy field | to be ignored |
| $\mathrm{TN}^{\mathrm{d}}$ | N/A | legacy field | to be ignored |
^(a){ }^{a} Note RN this is decoded after MF if the record is detached from the mate and we are attempting to auto-generate read names. ^(a){ }^{a} 请注意,如果记录与队友分离,且我们正试图自动生成读取名称,则 RN 会在 MF 之后解码。 ^(b){ }^{\mathrm{b}} The count is reset for each slice so NF can only refer to a record later within this slice. ^(b){ }^{\mathrm{b}} 每个片段的计数都会被重置,因此 NF 只能在该片段的后面引用记录。 ^(c){ }^{c} TL is followed by decoding the tag values themselves, in order of appearance in the tag dictionary. ^(c){ }^{c} TL 之后,将按照标签字典中出现的顺序对标签值本身进行解码。 ^(d)TC{ }^{\mathrm{d}} \mathrm{TC} and TN are legacy data series from CRAM 1.0. They have no function in CRAM 3.0 and should not be present. However some implementations do output them and decoders must silently skip these fields. It is illegal for TC and TN to contain any data values, although there may be empty blocks associated with them. ^(d)TC{ }^{\mathrm{d}} \mathrm{TC} 和 TN 是 CRAM 1.0 中的遗留数据系列。它们在 CRAM 3.0 中没有任何功能,不应出现。不过,有些实现确实会输出它们,解码器必须静默地跳过这些字段。TC 和 TN 不允许包含任何数据值,尽管可能存在与之相关的空块。
Tag encodings 标签编码
The tag dictionary (TD) describes the unique combinations of tag id / type that occur on each alignment record. For example if we search the id / types present in each record and find only two combinations - X1:i BC:Z SA:Z: and X1:i: BC:Z - then we have two dictionary entries in the TD map. 标签字典 (TD) 描述了每个对齐记录中出现的标签 id / 类型的唯一组合。例如,如果我们搜索每条记录中出现的标识/类型,发现只有两种组合--X1:i BC:Z SA:Z: 和 X1:i: BC:Z --那么在 TD 地图中就有两个字典条目。
Let L_(i)={T_(i0),T_(i1),dots,T_(ix)}L_{i}=\left\{T_{i 0}, T_{i 1}, \ldots, T_{i x}\right\} be a list of all tag ids for a record R_(i)R_{i}, where ii is the sequential record index and T_(ij)T_{i j} denotes jj-th tag id in the record. The list of unique L_(i)L_{i} is stored as the TD value in the preservation map. Maintaining the order is not a requirement for encoders (hence “combinations”), but it is permissible and thus different permutations, each encoded with their own elements in TD, should be supported by the decoder. Each L_(i)L_{i} element in TD is assigned a sequential integer number starting with 0 . These integer numbers are referred to by the TL data series. Using TD, an integer from the TL data series can be mapped back into a list of tag ids. Thus per alignment record we only need to store tag values and not their ids and types. 假设 L_(i)={T_(i0),T_(i1),dots,T_(ix)}L_{i}=\left\{T_{i 0}, T_{i 1}, \ldots, T_{i x}\right\} 是记录 R_(i)R_{i} 的所有标记 ID 列表,其中 ii 是顺序记录索引, T_(ij)T_{i j} 表示记录中的 jj 个标记 ID。唯一的 L_(i)L_{i} 列表作为 TD 值保存在保存映射中。保持顺序不是编码器的要求(因此称为 "组合"),但它是允许的,因此解码器应支持不同的排列,每种排列在 TD 中都有自己的编码元素。TD 中的每个 L_(i)L_{i} 元素都被分配了一个从 0 开始的连续整数。这些整数由 TL 数据序列表示。使用 TD,可以将 TL 数据序列中的整数映射回标签 ID 列表。因此,我们只需存储每个对齐记录的标签值,而无需存储标签 id 和类型。
The TD is written as a byte array consisting of L_(i)L_{i} values separated with \\0\backslash 0. Each L_(i)L_{i} value is written as a concatenation of 3 byte T_(ij)T_{i j} elements: tag id followed by BAM tag type code (one of A, c, C, s, S, i, I, f, Z, H or B , as described in the SAM specification). For example the TD for tag lists X1:i BC:Z SA:Z and X1:i BC:Z may be encoded as X1CBCZSAZ \\0X1CBCZ\\0\backslash 0 \mathrm{X} 1 \mathrm{CBCZ} \backslash 0, with X 1 C indicating a 1 byte unsigned value for tag X 1 . TD 被写成一个字节数组,由 L_(i)L_{i} 值组成,并用 \\0\backslash 0 分隔。每个 L_(i)L_{i} 值都以 3 个字节 T_(ij)T_{i j} 元素的连接形式写入:标签 ID,后跟 BAM 标签类型代码(A、c、C、s、S、i、I、f、Z、H 或 B 之一,如 SAM 规范所述)。例如,标签列表 X1:i BC:Z SA:Z 和 X1:i BC:Z 的 TD 可以编码为 X1CBCZSAZ \\0X1CBCZ\\0\backslash 0 \mathrm{X} 1 \mathrm{CBCZ} \backslash 0 ,其中 X 1 C 表示标签 X 1 的 1 字节无符号值。
Tag values 标签值
The encodings used for different tags are stored in a map. The key is 3 bytes formed from the BAM tag id and type code, matching the TD dictionary described above. Unlike the Data Series Encoding Map, the key is stored in the map as an ITF8 encoded integer, constructed using (char 1<<16)+(1<<16)+( char 2<<8)+2<<8)+ type. For example, the 3 -byte representation of OQ:Z is {0x4F,0x51,0xx5A}\{0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \times 5 \mathrm{~A}\} and these bytes are interpreted as the integer key 0 x 004 F 515 A , leading to an ITF8 byte stream {0xE0,0x4F,0x51,0x5A}\{0 \mathrm{xE} 0,0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \mathrm{x} 5 \mathrm{~A}\}. 不同标记使用的编码存储在一个映射中。键是由 BAM 标记 ID 和类型代码组成的 3 个字节,与上述 TD 字典相匹配。与数据系列编码映射表不同的是,键以 ITF8 编码整数形式存储在映射表中,使用(char 1<<16)+(1<<16)+( char 2<<8)+2<<8)+ 类型)构建。例如,OQ:Z 的 3 字节表示为 {0x4F,0x51,0xx5A}\{0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \times 5 \mathrm{~A}\} ,这些字节被解释为整数键 0 x 004 F 515 A,从而产生一个 ITF8 字节流 {0xE0,0x4F,0x51,0x5A}\{0 \mathrm{xE} 0,0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \mathrm{x} 5 \mathrm{~A}\} 。
Key 钥匙
Value data type 值数据类型
Name 名称
Value 价值
TAG ID 1:TAG TYPE 1 标签 ID 1:标签类型 1
encoding<byte[ ]> 编码<byte[ ]>
read tag 1 读标签 1
标签值(名称和类型见数据序列代码)
tag values (names and types are
available in the data series code)
tag values (names and types are
available in the data series code)| tag values (names and types are |
| :--- |
| available in the data series code) |
dots\ldots
dots\ldots
dots\ldots
TAG ID N:TAG TYPE N 标签 ID n:标签类型 n
encoding<byte[]> 编码<byte[]>
read tag N 读标签 N
dots\ldots
Key Value data type Name Value
TAG ID 1:TAG TYPE 1 encoding<byte[ ]> read tag 1 "tag values (names and types are
available in the data series code)"
dots dots dots
TAG ID N:TAG TYPE N encoding<byte[]> read tag N dots| Key | Value data type | Name | Value |
| :--- | :--- | :--- | :--- |
| TAG ID 1:TAG TYPE 1 | encoding<byte[ ]> | read tag 1 | tag values (names and types are <br> available in the data series code) |
| $\ldots$ | | $\ldots$ | $\ldots$ |
| TAG ID N:TAG TYPE N | encoding<byte[]> | read tag N | $\ldots$ |
Note that tag values are encoded as array of bytes. The routines to convert tag values into byte array and back are the same as in BAM with the exception of value type being captured in the tag key rather in the value. Hence consuming 1 byte for types ’ C ’ and ’ c ', 2 bytes for types ’ S ’ and ’ s ', 4 bytes for types ’ I ', ’ i ’ and ’ f ', and a variable number of bytes for types ’ HH ', ’ ZZ ’ and ’ BB '. 请注意,标签值是以字节数组的形式编码的。将标记值转换为字节数组并返回的例程与 BAM 中的例程相同,不同之处在于值的类型是在标记键中而不是在值中捕获的。因此,"C "和 "c "类型需要消耗 1 个字节,"S "和 "s "类型需要消耗 2 个字节,"I"、"i "和 "f "类型需要消耗 4 个字节," HH "、" ZZ "和" BB "类型需要消耗不同数量的字节。
8.5 Slice header block 8.5 片头区块
The slice header block is never compressed (block method=raw). For reference mapped reads the slice header also defines the reference sequence context of the data blocks associated with the slice. Mapped reads can be stored along with placed unmapped ^(3){ }^{3} reads on the same reference within the same slice. 切片标头块从不压缩(块方法=原始)。对于参考映射读数,片段标头还定义了与片段相关的数据块的参考序列上下文。映射读数可与同一片段中同一参考上的未映射 ^(3){ }^{3} 读数一起存储。
Slices with the Multiple Reference flag ( -2 ) set as the sequence ID in the header may contain reads mapped to multiple external references, including unmapped ^(3){ }^{3} reads (placed on these references or unplaced), but multiple embedded references cannot be combined in this way. When multiple references are used, the RI data series will be used to determine the reference sequence ID for each record. This data series is not present when only a single reference is used within a slice. 片头序列 ID 设置了多重参考标志(-2)的片段可能包含映射到多个外部参考的读数,包括未映射的 ^(3){ }^{3} 读数(放置在这些参考上或未放置),但多个嵌入参考不能以这种方式组合。当使用多个参考文献时,RI 数据系列将用于确定每条记录的参考序列 ID。如果片段中只使用了单个参考文献,则不会出现该数据序列。
The Unmapped (-1) sequence ID in the header is for slices containing only unplaced unmapped ^(3){ }^{3} reads. 标头中的未映射(-1)序列 ID 适用于只包含未置位未映射 ^(3){ }^{3} 读取的片段。
A slice containing data that does not use the external reference in any sequence may set the reference MD5 sum to zero. This can happen because the data is unmapped or the sequence has been stored verbatim instead of via reference-differencing. This latter scenario is recommended for unsorted or non-coordinate-sorted data. 包含未在任何序列中使用外部参照的数据的片段可能会将参照 MD5 和设为零。出现这种情况的原因可能是数据未映射,或者序列是逐字存储而不是通过参考差分存储的。后一种情况建议用于未排序或非坐标排序的数据。
The slice header block contains the following fields. 片头块包含以下字段。
Data type 数据类型
Name 名称
Value 价值
itf8
reference sequence id 参考序列 ID
该值必须与其外层容器的值相匹配。
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
This value must match that of its enclosing
container.
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
This value must match that of its enclosing
container.| reference sequence identifier or |
| :--- |
| -1 for unmapped reads |
| -2 for multiple reference sequences. |
| This value must match that of its enclosing |
| container. |
itf8
alignment start 对齐开始
the alignment start position 对齐起始位置
itf8
alignment span 对齐跨度
the length of the alignment 长度
itf8
number of records 记录数
the number of records in the slice 片段中的记录数
ltf8
record counter 记数器
文件/流中记录的基于 1 的顺序索引
1-based sequential index of records in the
file/stream
1-based sequential index of records in the
file/stream| 1-based sequential index of records in the |
| :--- |
| file/stream |
itf8
number of blocks 块数
the number of blocks in the slice 分片中的区块数
itf8[]
embedded reference bases block content id 嵌入式基准块内容 ID
切片中区块的内容 id,表示嵌入式引用序列库的区块内容 id,无则为 -1
block content ids of the blocks in the slice
block content id for the embedded reference
sequence bases or -1 for none
block content ids of the blocks in the slice
block content id for the embedded reference
sequence bases or -1 for none| block content ids of the blocks in the slice |
| :--- |
| block content id for the embedded reference |
| sequence bases or -1 for none |
(multi-ref) the MD5 should be 16 bytes of \\0\backslash 0.
For embedded references, the MD5 can either
be all-zeros or the MD5 of the embedded
sequence.
MD5 checksum of the reference bases within
the slice boundaries. If this slice has
reference sequence id of -1 (unmapped) or -2
(multi-ref) the MD5 should be 16 bytes of \\0.
For embedded references, the MD5 can either
be all-zeros or the MD5 of the embedded
sequence.| MD5 checksum of the reference bases within |
| :--- |
| the slice boundaries. If this slice has |
| reference sequence id of -1 (unmapped) or -2 |
| (multi-ref) the MD5 should be 16 bytes of $\backslash 0$. |
| For embedded references, the MD5 can either |
| be all-zeros or the MD5 of the embedded |
| sequence. |
byte[16] 字节[16]
以 BAM 辅助字段形式编码的一系列标记、类型、值元组。
a series of tag,type,value tuples encoded as
per BAM auxiliary fields.
a series of tag,type,value tuples encoded as
per BAM auxiliary fields.| a series of tag,type,value tuples encoded as |
| :--- |
| per BAM auxiliary fields. |
byte[]
optional tags 可选标签
Data type Name Value
itf8 reference sequence id "reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
This value must match that of its enclosing
container."
itf8 alignment start the alignment start position
itf8 alignment span the length of the alignment
itf8 number of records the number of records in the slice
ltf8 record counter "1-based sequential index of records in the
file/stream"
itf8 number of blocks the number of blocks in the slice
itf8[] embedded reference bases block content id "block content ids of the blocks in the slice
block content id for the embedded reference
sequence bases or -1 for none"
itf8 reference md5 "MD5 checksum of the reference bases within
the slice boundaries. If this slice has
reference sequence id of -1 (unmapped) or -2
(multi-ref) the MD5 should be 16 bytes of \\0.
For embedded references, the MD5 can either
be all-zeros or the MD5 of the embedded
sequence."
byte[16] "a series of tag,type,value tuples encoded as
per BAM auxiliary fields."
byte[] optional tags | Data type | Name | Value |
| :--- | :--- | :--- |
| itf8 | reference sequence id | reference sequence identifier or <br> -1 for unmapped reads <br> -2 for multiple reference sequences. <br> This value must match that of its enclosing <br> container. |
| itf8 | alignment start | the alignment start position |
| itf8 | alignment span | the length of the alignment |
| itf8 | number of records | the number of records in the slice |
| ltf8 | record counter | 1-based sequential index of records in the <br> file/stream |
| itf8 | number of blocks | the number of blocks in the slice |
| itf8[] | embedded reference bases block content id | block content ids of the blocks in the slice <br> block content id for the embedded reference <br> sequence bases or -1 for none |
| itf8 | reference md5 | MD5 checksum of the reference bases within <br> the slice boundaries. If this slice has <br> reference sequence id of -1 (unmapped) or -2 <br> (multi-ref) the MD5 should be 16 bytes of $\backslash 0$. <br> For embedded references, the MD5 can either <br> be all-zeros or the MD5 of the embedded <br> sequence. |
| byte[16] | | a series of tag,type,value tuples encoded as <br> per BAM auxiliary fields. |
| byte[] | optional tags | |
The alignment start and alignment span values should only be utilised during decoding if the slice has mapped data aligned to a single reference (reference sequence id >=0>=0 ). For multi-reference slices or those with unmapped data, it is recommended to fill these fields with value 0. 只有当切片的映射数据与单个参照(参照序列 ID >=0>=0 )对齐时,才应在解码过程中使用对齐起始值和对齐跨度值。对于多参考片段或具有未映射数据的片段,建议将这些字段的值填为 0。
MD5sums should not be validated if the stored checksum is all-zero. Embedded references should follow the same capitalisation and alphabetical rules as applied to external references prior to MD5sum calculations. If an embedded reference is used, it is not a requirement that it exactly matches the reference used for sequence alignments. For example, it may contain “N” bases where coverage is absent or it could have different base calls for SNP variants. Hence when embedded sequences are used, the MD5sum refers to the checksum of the embedded sequence and should not be validated against any external reference files. 如果存储的校验和为全零,则不应验证 MD5sum。在计算 MD5sum 之前,嵌入式参考文献应遵循与外部参考文献相同的大小写和字母规则。如果使用嵌入式参考文献,并不要求其与用于序列比对的参考文献完全匹配。例如,它可能包含没有覆盖的 "N "碱基,也可能对 SNP 变异有不同的碱基调用。因此,在使用嵌入序列时,MD5sum 指的是嵌入序列的校验和,而不应根据任何外部参照文件进行验证。
Note where an embedded reference differs to the original reference used for alignment, the MD and NM tags may need to be stored verbatim for records where the respective embedded and external reference substrings differ. 请注意,当嵌入式参考文献与用于对齐的原始参考文献不同时,可能需要逐字存储 MD 和 NM 标记,以记录嵌入式参考文献和外部参考文献的子串不同。
The optional tags are encoded in the same manner as BAM tags. I.e. a series of binary encoded tags concatenated together where each tag consists of a 2 byte key (matching [A-Za-z][A-Za-z0-9]) followed by a 1 byte type ([AfZHcCsSiIB]) followed by a string of bytes in a format defined by the type. 可选标记的编码方式与 BAM 标记相同。也就是说,一系列二进制编码标签串联在一起,每个标签由一个 2 字节密钥(与 [A-Za-z][A-Za-z0-9]相匹配)和一个 1 字节类型([AfZHcCsSiIB])组成,后跟一串由类型定义格式的字节。
Tags starting in a capital letter are reserved while lowercase ones or those starting with X,Y\mathrm{X}, \mathrm{Y} or Z are user definable. Any tag not understood by a decoder should be skipped over without producing an error. 以大写字母开头的标记是保留标记,而小写字母或以 X,Y\mathrm{X}, \mathrm{Y} 或 Z 开头的标记则由用户自定义。解码器不理解的任何标记都应跳过,不会产生错误。
At present no tags are defined. 目前没有定义标签。
8.6 Core data block 8.6 核心数据块
A core data block is a bit stream (most significant bit first) consisting of data from one or more CRAM records. Please note that one byte could hold more then one CRAM record as a minimal CRAM record could be just a few bits long. The core data block has the following fields: 核心数据块是由一个或多个 CRAM 记录中的数据组成的比特流(最显著位在前)。请注意,一个字节可能包含多条 CRAM 记录,因为一条最小的 CRAM 记录可能只有几个比特长。核心数据块包含以下字段:
Data type 数据类型
Name 名称
Value 价值
bit[ ] 位[ ]
CRAM record 1 CRAM 记录 1
The first CRAM record 第一份 CRAM 记录
dots\ldots
dots\ldots
dots\ldots
bit[ ] 位[ ]
CRAM record N CRAM 记录 N
The Nth CRAM record 第 N 个 CRAM 记录
Data type Name Value
bit[ ] CRAM record 1 The first CRAM record
dots dots dots
bit[ ] CRAM record N The Nth CRAM record| Data type | Name | Value |
| :--- | :--- | :--- |
| bit[ ] | CRAM record 1 | The first CRAM record |
| $\ldots$ | $\ldots$ | $\ldots$ |
| bit[ ] | CRAM record N | The Nth CRAM record |
8.7 External data blocks 8.7 外部数据块
The relationship between the core data block and external data blocks is shown in the following picture: 核心数据块和外部数据块之间的关系如下图所示:
Figure 5: The relationship between core and external encodings, and core and external data blocks. 图 5:核心和外部编码以及核心和外部数据块之间的关系。
The picture shows how a CRAM record (on the left) is distributed between the core data block and one or more external data blocks, via core or external encodings. The specific encodings presented are only examples for purposes of illustration. The main point is to distinguish between core bit encodings whose output is always stored in a core data block, and external byte encodings whose output is always stored in external data blocks. 图中显示了 CRAM 记录(左侧)如何通过核心或外部编码在核心数据块和一个或多个外部数据块之间分配。所展示的具体编码只是示例,仅供参考。重点在于区分输出始终存储在核心数据块中的核心比特编码和输出始终存储在外部数据块中的外部字节编码。
9 End of file container 9 文件箱结束
A special container is used to mark the end of a file or stream. It is required in version 3 or later. The idea is to provide an easy and a quick way to detect that a CRAM file or stream is complete. The marker is basically an empty container with ref seq id set to -1 (unaligned) and alignment start set to 4542278. 特殊容器用于标记文件或数据流的结束。版本 3 或更高版本需要使用该容器。其目的是提供一种简便快捷的方法来检测 CRAM 文件或数据流是否已完成。该标记基本上是一个空容器,其 ref seq id 设置为-1(未对齐),对齐起点设置为 4542278。
Here is a complete content of the EOF container explained in detail: 下面将详细介绍 EOF 容器的全部内容:
hex bytes 十六进制字节
data type 数据类型
decimal value 小数值
field name 字段名
Container header 容器头
Of 000000 000000
integer 整数
15
size of blocks data 数据块大小
ff ff ff ff of ff ff ff of
itf8
-1
ref seq id
e0 45 4f 46
itf8
4542278
alignment start 对齐开始
00
itf8
0
alignment span 对齐跨度
00
itf8
0
number of records 记录数
00
itf8
0
global record counter 全球记录计数器
00
itf8
0
bases 基地
01
itf8
1
block count 块计数
00
array 矩阵
0
landmarks 地标
05 bd d 94 f
integer 整数
1339669765
container header CRC32 容器标头 CRC32
Compression header block 压缩头块
00
byte 字节
0 (RAW) 0(原始数据)
compression method 压缩方法
01
byte 字节
1 (COMPRESSION_HEADER) 1(压缩头)
block content type 块内容类型
00
itf8
0
block content id 块内容 id
06
itf8
6
compressed size 压缩尺寸
06
itf8
6
uncompressed size 未压缩尺寸
Compression header 压缩头
01
itf8
1
preservation map byte size 保存映射字节大小
00
itf8
0
preservation map size 保存地图大小
01
itf8
1
encoding map byte size 编码映射字节大小
00
itf8
0
encoding map size 编码图大小
01
itf8
1
tag encoding byte size 标签编码字节大小
00
itf8
0
tag encoding map size 标签编码映射大小
ee 63014 b
integer 整数
1258382318
block CRC32 块 CRC32
hex bytes data type decimal value field name
Container header
Of 000000 integer 15 size of blocks data
ff ff ff ff of itf8 -1 ref seq id
e0 45 4f 46 itf8 4542278 alignment start
00 itf8 0 alignment span
00 itf8 0 number of records
00 itf8 0 global record counter
00 itf8 0 bases
01 itf8 1 block count
00 array 0 landmarks
05 bd d 94 f integer 1339669765 container header CRC32
Compression header block
00 byte 0 (RAW) compression method
01 byte 1 (COMPRESSION_HEADER) block content type
00 itf8 0 block content id
06 itf8 6 compressed size
06 itf8 6 uncompressed size
Compression header
01 itf8 1 preservation map byte size
00 itf8 0 preservation map size
01 itf8 1 encoding map byte size
00 itf8 0 encoding map size
01 itf8 1 tag encoding byte size
00 itf8 0 tag encoding map size
ee 63014 b integer 1258382318 block CRC32| hex bytes | data type | decimal value | field name |
| :---: | :---: | :---: | :---: |
| Container header | | | |
| Of 000000 | integer | 15 | size of blocks data |
| ff ff ff ff of | itf8 | -1 | ref seq id |
| e0 45 4f 46 | itf8 | 4542278 | alignment start |
| 00 | itf8 | 0 | alignment span |
| 00 | itf8 | 0 | number of records |
| 00 | itf8 | 0 | global record counter |
| 00 | itf8 | 0 | bases |
| 01 | itf8 | 1 | block count |
| 00 | array | 0 | landmarks |
| 05 bd d 94 f | integer | 1339669765 | container header CRC32 |
| Compression header block | | | |
| 00 | byte | 0 (RAW) | compression method |
| 01 | byte | 1 (COMPRESSION_HEADER) | block content type |
| 00 | itf8 | 0 | block content id |
| 06 | itf8 | 6 | compressed size |
| 06 | itf8 | 6 | uncompressed size |
| Compression header | | | |
| 01 | itf8 | 1 | preservation map byte size |
| 00 | itf8 | 0 | preservation map size |
| 01 | itf8 | 1 | encoding map byte size |
| 00 | itf8 | 0 | encoding map size |
| 01 | itf8 | 1 | tag encoding byte size |
| 00 | itf8 | 0 | tag encoding map size |
| ee 63014 b | integer | 1258382318 | block CRC32 |
When compiled together the EOF marker is 38 bytes long and in hex representation is: Of 000000 ff ff ff ff of e0 454 f 4600000000010005 bd d9 4f 0001000606010001000100 ee 6301 4b 编译到一起时,EOF 标记长 38 个字节,用十六进制表示为Of 000000 ff ff ff ff of e0 454 f 4600000000010005 bd d9 4f 0001000606010001000100 ee 6301 4b
10 Record structure 10 记录结构
CRAM record is based on the SAM record but has additional features allowing for more efficient data storage. In contrast to BAM record CRAM record uses bits as well as bytes for data storage. This way, for example, various coding techniques which output variable length binary codes can be used directly in CRAM. On the other hand, data series that do not require binary coding can be stored separately in external blocks with some other compression applied to them independently. CRAM 记录以 SAM 记录为基础,但具有附加功能,可以更有效地存储数据。与 BAM 记录不同,CRAM 记录使用比特和字节存储数据。这样,输出可变长度二进制编码的各种编码技术就可以直接在 CRAM 中使用。另一方面,不需要二进制编码的数据序列可以单独存储在外部块中,并对其单独应用其他压缩技术。
As CRAM data series may be interleaved within the same blocks ^(4){ }^{4} understanding the order in which CRAM data series must be decoded is vital. ^(4){ }^{4} 由于 CRAM 数据系列可能在同一数据块内交错排列,因此了解 CRAM 数据系列的解码顺序至关重要。
The overall flowchart is below, with more detailed description in the subsequent sections. 总体流程图如下,更详细的说明见后续章节。
10.1 CRAM record 10.1 CRAM 记录
Both mapped and unmapped reads start with the following fields. Please note that the data series type refers to the logical data type and the data series name corresponds to the data series encoding map. 已映射和未映射读数均以下列字段开始。请注意,数据序列类型指的是逻辑数据类型,数据序列名称对应的是数据序列编码映射。
数据序列类型
Data series
type
Data series
type| Data series |
| :--- |
| type |
数据系列名称
Data series
name
Data series
name| Data series |
| :--- |
| name |
Field 现场
Description 说明
int
BF
BAM bit flags BAM 位标志
see BAM bit flags below 参见下面的 BAM 位标志
int
CF
CRAM bit flags CRAM 位标志
see CRAM bit flags below 参见下面的 CRAM 位标志
-
-
Positional data 位置数据
See section 10.2 见第 10.2 节
-
-
Read names 阅读名称
See section 10.3 见第 10.3 节
-
-
Mate records 队友记录
See section 10.4 见第 10.4 节
-
-
Auxiliary tags 辅助标记
See section 10.5 见第 10.5 节
-
-
Sequences 序列
See sections 10.6 and 10.7 见第 10.6 和 10.7 节
"Data series
type" "Data series
name" Field Description
int BF BAM bit flags see BAM bit flags below
int CF CRAM bit flags see CRAM bit flags below
- - Positional data See section 10.2
- - Read names See section 10.3
- - Mate records See section 10.4
- - Auxiliary tags See section 10.5
- - Sequences See sections 10.6 and 10.7| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | BF | BAM bit flags | see BAM bit flags below |
| int | CF | CRAM bit flags | see CRAM bit flags below |
| - | - | Positional data | See section 10.2 |
| - | - | Read names | See section 10.3 |
| - | - | Mate records | See section 10.4 |
| - | - | Auxiliary tags | See section 10.5 |
| - | - | Sequences | See sections 10.6 and 10.7 |
BAM bit flags (BF data series) BAM 位标志(BF 数据系列)
The following flags are duplicated from the SAM and BAM specification, with identical meaning. Note however some of these flags can be derived during decode, so may be omitted in the CRAM file and the bits computed based on both reads of a pair-end library residing within the same slice. 以下标志与 SAM 和 BAM 规范中的标志重复,含义相同。但需要注意的是,其中一些标志可以在解码过程中产生,因此在 CRAM 文件中可以省略,并根据位于同一切片内的对端库的两个读数计算位数。
Bit flag 位标志
Comment 评论
Description 说明
0x1
在测序中具有多重段的模板
template having multiple
segments in sequencing
template having multiple
segments in sequencing| template having multiple |
| :--- |
| segments in sequencing |
0x2
根据校准器正确校准每个片段
each segment properly aligned
according to the aligner
each segment properly aligned
according to the aligner| each segment properly aligned |
| :--- |
| according to the aligner |
calculated ^(b) or stored in the
mate's info| calculated $^{\mathrm{b}}$ or stored in the |
| :--- |
| mate's info |
未映射模板中的下一个分段
next segment in template
unmapped
next segment in template
unmapped| next segment in template |
| :--- |
| unmapped |
0x10
SEQ 被反向补充
SEQ being reverse
complemented
SEQ being reverse
complemented| SEQ being reverse |
| :--- |
| complemented |
0xx200 \times 20
经过计算的队友信息
calculated ^(b)^{\mathrm{b}} or stored in the
mate's info
calculated ^(b) or stored in the
mate's info| calculated $^{\mathrm{b}}$ or stored in the |
| :--- |
| mate's info |
被反向补充的模板中下一个分段的 SEQ
SEQ of the next segment in the
template being reverse
complemented
SEQ of the next segment in the
template being reverse
complemented| SEQ of the next segment in the |
| :--- |
| template being reverse |
| complemented |
0x40
the first segment in the template ^(c){ }^{\mathrm{c}} 模板中的第一段 ^(c){ }^{\mathrm{c}}
0x80
the last segment in the template ^(c){ }^{\mathrm{c}} 模板中的最后一段 ^(c){ }^{\mathrm{c}}
0x100
secondary alignment 次级排列
0x200
not passing quality controls 质量控制不合格
0x400
PCT or optical duplicate PCT 或光学复制品
0x800
Supplementary alignment 补充对齐
Bit flag Comment Description
0x1 "template having multiple
segments in sequencing"
0x2 "each segment properly aligned
according to the aligner"
0x4 segment unmapped ^(a)
0x8 "calculated ^(b) or stored in the
mate's info" "next segment in template
unmapped"
0x10 "SEQ being reverse
complemented"
0xx20 "calculated ^(b) or stored in the
mate's info" "SEQ of the next segment in the
template being reverse
complemented"
0x40 the first segment in the template ^(c)
0x80 the last segment in the template ^(c)
0x100 secondary alignment
0x200 not passing quality controls
0x400 PCT or optical duplicate
0x800 Supplementary alignment| Bit flag | Comment | Description |
| :---: | :---: | :---: |
| 0x1 | | template having multiple <br> segments in sequencing |
| 0x2 | | each segment properly aligned <br> according to the aligner |
| 0x4 | | segment unmapped ${ }^{\mathrm{a}}$ |
| 0x8 | calculated $^{\mathrm{b}}$ or stored in the <br> mate's info | next segment in template <br> unmapped |
| 0x10 | | SEQ being reverse <br> complemented |
| $0 \times 20$ | calculated $^{\mathrm{b}}$ or stored in the <br> mate's info | SEQ of the next segment in the <br> template being reverse <br> complemented |
| 0x40 | | the first segment in the template ${ }^{\mathrm{c}}$ |
| 0x80 | | the last segment in the template ${ }^{\mathrm{c}}$ |
| 0x100 | | secondary alignment |
| 0x200 | | not passing quality controls |
| 0x400 | | PCT or optical duplicate |
| 0x800 | | Supplementary alignment |
^(a){ }^{a} Bit 0 x 4 is the only reliable place to tell whether the read is unmapped. If 0 x 4 is set, no assumptions may be made about bits 0xx2,0xx1000 \times 2,0 \times 100 and 0x 8000 x 800. ^(a){ }^{a} 位 0 x 4 是判断读取是否未映射的唯一可靠方法。如果设置了 0 x 4,则不能对 0xx2,0xx1000 \times 2,0 \times 100 和 0x 8000 x 800 位做任何假设。 ^(b){ }^{\mathrm{b}} For segments within the same slice. ^(b){ }^{\mathrm{b}} 对于同一片段内的线段。 ^("c "){ }^{\text {c }} Bits 0 x 40 and 0 x 80 reflect the read ordering within each template inherent in the sequencing technology used, which may be independent from the actual mapping orientation. If 0xx400 \times 40 and 0xx800 \times 80 are both set, the read is part of a linear template (one where the template sequence is expected to be in a linear order), but it is neither the first nor the last read. If both 0 x 40 and 0 x 80 are unset, the index of the read in the template is unknown. This may happen for a non-linear template (such as one constructed by stitching together other templates) or when this information is lost during data processing. ^("c "){ }^{\text {c }} 位 0 x 40 和 0 x 80 反映了所用测序技术中每个模板内固有的读数排序,这可能与实际映射方向无关。如果 0xx400 \times 40 和 0xx800 \times 80 都被设置,则该读数是线性模板的一部分(模板序列预计会按线性顺序排列),但它既不是第一个读数,也不是最后一个读数。如果 0 x 40 和 0 x 80 都未设置,则读数在模板中的索引未知。这种情况可能发生在非线性模板(如通过拼接其他模板构建的模板)中,或者在数据处理过程中丢失了这一信息。
CRAM bit flags (CF data series) CRAM 位标志(CF 数据系列)
The CRAM bit flags (also known as compression bit flags) expressed as an integer represent the CF data series. The following compression flags are defined for each CRAM read record: 以整数表示的 CRAM 位标志(也称为压缩位标志)代表 CF 数据系列。为每个 CRAM 读取记录定义了以下压缩位标志:
Bit flag 位标志
Name 名称
Description 说明
0x1
quality scores stored as array 质量分数存储为数组
质量分数可以作为读数特征或类似于读数碱基的数组存储。
quality scores can be stored as read features or as an
array similar to read bases.
quality scores can be stored as read features or as an
array similar to read bases.| quality scores can be stored as read features or as an |
| :--- |
| array similar to read bases. |
0x2
detached 独立
配对信息被逐字存储(例如,由于配对跨越多个切片,或字段与 CRAM 计算方法不同)
mate information is stored verbatim (e.g. because the
pair spans multiple slices or the fields differ to the
CRAM computed method)
mate information is stored verbatim (e.g. because the
pair spans multiple slices or the fields differ to the
CRAM computed method)| mate information is stored verbatim (e.g. because the |
| :--- |
| pair spans multiple slices or the fields differ to the |
| CRAM computed method) |
0 x 4
has mate downstream 有配下游
告知是否应该在流中更远的地方期待下一个片段
tells if the next segment should be expected further in
the stream
tells if the next segment should be expected further in
the stream| tells if the next segment should be expected further in |
| :--- |
| the stream |
0x8
decode sequence as "*" 将序列解码为 "*"
告知解码器序列未知,任何编码参考差异的存在只是为了重新生成 CIGAR 字符串。
informs the decoder that the sequence is unknown and
that any encoded reference differences are present only
to recreate the CIGAR string.
informs the decoder that the sequence is unknown and
that any encoded reference differences are present only
to recreate the CIGAR string.| informs the decoder that the sequence is unknown and |
| :--- |
| that any encoded reference differences are present only |
| to recreate the CIGAR string. |
Bit flag Name Description
0x1 quality scores stored as array "quality scores can be stored as read features or as an
array similar to read bases."
0x2 detached "mate information is stored verbatim (e.g. because the
pair spans multiple slices or the fields differ to the
CRAM computed method)"
0 x 4 has mate downstream "tells if the next segment should be expected further in
the stream"
0x8 decode sequence as "*" "informs the decoder that the sequence is unknown and
that any encoded reference differences are present only
to recreate the CIGAR string."| Bit flag | Name | Description |
| :--- | :--- | :--- |
| 0x1 | quality scores stored as array | quality scores can be stored as read features or as an <br> array similar to read bases. |
| 0x2 | detached | mate information is stored verbatim (e.g. because the <br> pair spans multiple slices or the fields differ to the <br> CRAM computed method) |
| 0 x 4 | has mate downstream | tells if the next segment should be expected further in <br> the stream |
| 0x8 | decode sequence as "*" | informs the decoder that the sequence is unknown and <br> that any encoded reference differences are present only <br> to recreate the CIGAR string. |
The following pseudocode describes the general process of decoding an entire CRAM record. The sequence data itself is in one of two encoding formats depending on whether the record is aligned (mapped). 下面的伪代码描述了对整个 CRAM 记录进行解码的一般过程。序列数据本身有两种编码格式,取决于记录是否被对齐(映射)。
Decode pseudocode 解码伪代码
procedure DECODERECORD
\(B A M \_\)flags \(\leftarrow\) READITEM(BF, Integer)
\(C R A \bar{M} \_\)flags \(\leftarrow\) READITEM \((\mathrm{CF}\), Integer \()\)
DECODEPoSITIONS \(\triangleright\) See section 10.2
DECODENAMES \(\triangleright\) See section 10.3
DECODEMateData \(\triangleright\) See section 10.4
DecoDeTaGData \(\triangleright\) See section 10.5
if \((B F\) AND 4\()=0\) then \(\triangleright\) Unmapped flag
DECODEMAPPEDREAD \(\triangleright\) See section 10.6
else
DECODEUNMAPPEDREAD \(\triangleright\) See section 10.7
end if
end procedure
This pseudocode is not meant to be a fully implementable programming language, but to act as an algorithmic guide to the order and structure of CRAM decoding. 这种伪代码并不是完全可实现的编程语言,而是作为 CRAM 解码顺序和结构的算法指南。
The Readitem function referred above takes two arguments; the data series name and the data type used by the Encoding. It will use the codec specified in the Container Compression Header to retrieve the next value from that data series. Note there is only one permitted data type per data series, so the second argument is redundant and is included only as an aide-mémoire. 上述 Readitem 函数有两个参数:数据序列名称和编码使用的数据类型。它将使用容器压缩标头中指定的编解码器从该数据序列中获取下一个值。请注意,每个数据系列只允许使用一种数据类型,因此第二个参数是多余的,仅作为辅助参数。
10.2 CRAM positional data 10.2 CRAM 定位数据
Following the bit-wise BAM and CRAM flags, CRAM encodes positional related data including reference, alignment positions and length, and read-group. Positional data is stored for both mapped and unmapped sequences, as unmapped data may still be “placed” at a specific location in the genome (without being aligned). Typically this is done to keep a sequence pair (paired-end or mate-pair sequencing libraries) together when one of the pair aligns and the other does not. 在比特 BAM 和 CRAM 标志之后,CRAM 编码与位置相关的数据,包括参考文献、比对位置和长度以及读数组。映射和未映射序列都会存储位置数据,因为未映射数据仍可能被 "放置 "在基因组中的特定位置(未经比对)。通常情况下,这样做是为了在序列对(成对端或配对测序文库)中的一个对齐而另一个未对齐时,将其保持在一起。
For reads stored in a position-sorted slice, the AP-delta flag in the compression header preservation map should be set and the AP data series will be delta encoded, using the slice alignment-start value as the first position to delta against. Note for multi-reference slices this may mean that the AP series includes negative values, such as when moving from an alignment to the end of one reference sequence to the start of the next or to unmapped unplaced data. When the AP-delta flag is not set the AP data series is stored as a normal integer value. 对于存储在位置排序切片中的读数,压缩头保存映射中的 AP-delta 标志应被设置,AP 数据序列将进行 delta 编码,使用切片对齐起始值作为第一个位置进行 delta 对齐。请注意,对于多参考序列切片,这可能意味着 AP 序列包含负值,例如从一个参考序列的末端对齐到下一个参考序列的起始位置或未映射的未对齐数据。当 AP-delta 标志未设置时,AP 数据序列将以正常整数值存储。
数据序列类型
Data series
type
Data series
type| Data series |
| :--- |
| type |
数据系列名称
Data series
name
Data series
name| Data series |
| :--- |
| name |
Field 现场
Description 说明
int
RI
ref id
参考序列 ID(仅出现在多参考片中)
reference sequence id (only present in
multiref slices)
reference sequence id (only present in
multiref slices)| reference sequence id (only present in |
| :--- |
| multiref slices) |
int
RL
read length 读取长度
the length of the read 读取长度
int
AP
alignment start 对齐开始
the alignment start position 对齐起始位置
int
RG
read group 阅读小组
读取组标识符,用标头中的 Nh 记录表示,从 0 开始,-1 表示无组
the read group identifier expressed as
the Nh record in the header, starting
from 0 with -1 for no group
the read group identifier expressed as
the Nh record in the header, starting
from 0 with -1 for no group| the read group identifier expressed as |
| :--- |
| the Nh record in the header, starting |
| from 0 with -1 for no group |
"Data series
type" "Data series
name" Field Description
int RI ref id "reference sequence id (only present in
multiref slices)"
int RL read length the length of the read
int AP alignment start the alignment start position
int RG read group "the read group identifier expressed as
the Nh record in the header, starting
from 0 with -1 for no group"| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | RI | ref id | reference sequence id (only present in <br> multiref slices) |
| int | RL | read length | the length of the read |
| int | AP | alignment start | the alignment start position |
| int | RG | read group | the read group identifier expressed as <br> the Nh record in the header, starting <br> from 0 with -1 for no group |
procedure DECODEPOSITIONS
if slice_header.reference_sequence_id \(=-2\) then
reference \(\_i d \leftarrow\) READITEM(RI, Integer)
else
\(r e f e r e n c e \_i d \leftarrow\) slice_header.reference_sequence_id
end if
read_length \(\leftarrow\) READITEM(RL, Integer)
if container_pmap.AP_delta \(\neq 0\) then
if first_record_in_slice then
last_position \(\leftarrow\) slice_header.alignment_start
end if
alignment_position \(\leftarrow\) READITEM(AP, Integer) + last_position
last_position \(\leftarrow\) alignment_position
else
alignment_position \(\leftarrow\) READITEM(AP, Integer)
end if
read_group \(\leftarrow\) READITEM \((\) RG, Integer \()\)
end procedure
10.3 Read names (RN data series) 10.3 读取名称(RN 数据系列)
Read names can be preserved in the CRAM format, but this is optional and is governed by the RN preservation map key in the container compression header. See section 8.4. When read names are not preserved the CRAM decoder should generate names, typically based on the file name and a numeric ID of the read using the record counter field of the slice header block. Note read names may still be preserved even when the RN compression header key indicates otherwise, such as where a read is part of a read-pair and the pair spans multiple slices. In this situation the record will be marked as detached (see the CF data series) and the mate data below (section 10.4) will contain the read name. 读取名称可以保留在 CRAM 格式中,但这是可选的,并受容器压缩标头中的 RN 保留映射关键字的制约。请参见第 8.4 节。不保留读取名称时,CRAM 解码器应生成名称,通常是基于文件名和使用片头块记录计数器字段的读取数字 ID。请注意,即使 RN 压缩标头关键字另有指示,读取名称仍可能被保留,例如读取是读取对的一部分,而读取对跨越多个分片。在这种情况下,记录将被标记为分离(见 CF 数据系列),下面的队列数据(第 10.4 节)将包含读取名称。
数据序列类型
Data series
type
Data series
type| Data series |
| :--- |
| type |
数据系列名称
Data series
name
Data series
name| Data series |
| :--- |
| name |
Field 现场
Description 说明
byte[ ]]
RN
read names 阅读名称
read names 阅读名称
"Data series
type" "Data series
name" Field Description
byte[ ] RN read names read names| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| byte[ $]$ | RN | read names | read names |
procedure DECODENAMES
if container_pmap.read_names_included \(=1\) then
read_name \(\leftarrow\) REAd \(\overline{\operatorname{ITEM}}(\mathrm{RN}\), Byte[])
else
read_name \(\leftarrow\) GENERATENAME
end if
end procedure
10.4 Mate records 10.4 伴侣记录
There are two ways in which mate information can be preserved in CRAM. If the next fragment is not in the same slice we store verbatim copies of the insert size, mate reference chromosome and positions, and mate flags 有两种方法可以在 CRAM 中保存配对信息。如果下一个片段不在同一个片段中,我们会逐字逐句地存储插入大小、配对参考染色体和位置以及配对标志的副本
(mapped status, orientation) for both records. In this case both records are labelled as “detached” in the CF data series using bit 2 . (映射状态、方向)。在这种情况下,两条记录在 CF 数据系列中都使用位 2 标为 "分离"。
If this and the next fragment are within the same slice, we can derive much of this information by comparing the two records. The upstream record has CF bit 4 (mate downstream) flag set and stores the number of records to skip (in the NF data series) between this record and the record for the next fragment on this template, with zero meaning the next fragment is also the next record. The downstream record has neither CF bits 2 (detached) or 4 (mate downstream) set nor does it use the NF data series (unless it also has an additional “next fragment” to refer to). 如果这个片段和下一个片段在同一个片段内,我们就可以通过比较这两条记录得出很多信息。上游记录的 CF 位 4(下游配对)标志已设置,并存储了此记录和此模板上下一个片段的记录之间要跳过的记录数(在 NF 数据序列中),0 表示下一个片段也是下一条记录。下游记录既没有设置 CF 位 2(分离)或 4(下游配对),也没有使用 NF 数据序列(除非它还有一个额外的 "下一个片段 "可参考)。
It is not mandatory to use this deduplication approach and optionally CRAM write implementations may wish to label data as detached even when all records for the template reside in the same slice. One reason to do this may be to preserve inconsistent data so that it round-trips through the CRAM format with full fidelity 并不是必须使用这种重复数据删除方法,CRAM 写入实现可能希望将数据标记为分离数据,即使模板的所有记录都位于同一切片中。这样做的一个原因可能是为了保留不一致的数据,使其在 CRAM 格式中往返时完全保密
数据序列类型
Data series
type
Data series
type| Data series |
| :--- |
| type |
Data series name 数据系列名称
Description 说明
int
NF
the number of records to skip to the next fragment 跳转到下一个片段的记录数
"Data series
type" Data series name Description
int NF the number of records to skip to the next fragment| Data series <br> type | Data series name | Description |
| :--- | :--- | :--- |
| int | NF | the number of records to skip to the next fragment |
In the above case, the NS (mate reference name), NP (mate position) and TS (template size) fields for both records should be derived once the mate has also been decoded. Mate reference name and position are obvious and simply copied from the mate. The template size is computed using the method described in the SAM specification; the inclusive distance from the leftmost to rightmost mapped bases with the sign being positive for the leftmost record and negative for the rightmost record. 在上述情况下,一旦队列也被解码,两条记录的 NS(队列参考名称)、NP(队列位置)和 TS(模板尺寸)字段都应随之产生。队列参考名称和位置是显而易见的,只需从队列中复制即可。模板大小使用 SAM 规范中描述的方法计算;从最左边到最右边映射碱基的包含距离,最左边记录的符号为正,最右边记录的符号为负。
If the next fragment is not found within this slice then the following structure is included into the CRAM record. Note there are cases where read-pairs within the same slice may be marked as detached and use this structure, such as to store mate-pair information that does not match the algorithm used by CRAM for computing the mate data on-the-fly. 如果下一个片段在此片段中找不到,那么 CRAM 记录中就会包含以下结构。请注意,在某些情况下,同一片段内的读对可能会被标记为分离并使用此结构,例如存储与 CRAM 即时计算配对数据的算法不匹配的配对信息。
数据序列类型
Data series
type
Data series
type| Data series |
| :--- |
| type |
Data series name 数据系列名称
Description 说明
int
MF
next mate bit flags, see table below 下一队友位标志,见下表
byte[]
RN
the read name (if and only if not known already) 读取的名称(如果且仅当还不知道时)
int
NS
mate reference sequence identifier 配偶参考序列标识符
int
NP
mate alignment start position 配对开始位置
int
TS
the size of the template (insert size) 模板的尺寸(插入尺寸)
"Data series
type" Data series name Description
int MF next mate bit flags, see table below
byte[] RN the read name (if and only if not known already)
int NS mate reference sequence identifier
int NP mate alignment start position
int TS the size of the template (insert size)| Data series <br> type | Data series name | Description |
| :--- | :--- | :--- |
| int | MF | next mate bit flags, see table below |
| byte[] | RN | the read name (if and only if not known already) |
| int | NS | mate reference sequence identifier |
| int | NP | mate alignment start position |
| int | TS | the size of the template (insert size) |
Next mate bit flags (MF data series) 下一个队列位标志(中频数据系列)
The next mate bit flags expressed as an integer represent the MF data series. These represent the missing bits we excluded from the BF data series (when compared to the full SAM/BAM flags). The following bit flags are defined: 下一个以整数表示的比特标志代表 MF 数据序列。它们代表我们从 BF 数据系列中排除的缺失位(与完整的 SAM/BAM 标志相比)。定义了以下位标志:
Bit flag 位标志
Name 名称
Description 说明
0x1
mate negative strand bit 配对负链位
the bit is set if the mate is on the negative strand 如果配偶在负链上,则该位被置位
0xx20 \times 2
mate unmapped bit 未映射位
the bit is set if the mate is unmapped 如果配对未映射,则该位被置位
Bit flag Name Description
0x1 mate negative strand bit the bit is set if the mate is on the negative strand
0xx2 mate unmapped bit the bit is set if the mate is unmapped| Bit flag | Name | Description |
| :--- | :--- | :--- |
| 0x1 | mate negative strand bit | the bit is set if the mate is on the negative strand |
| $0 \times 2$ | mate unmapped bit | the bit is set if the mate is unmapped |
Decode mate pseudocode 解码队友伪代码
In the following pseudocode we are assuming the current record is this and its mate is next_frag. 在下面的伪代码中,我们假定当前记录是 this,其队友是 next_frag。
procedure DECODEMATEDATA 过程 DECODEMATEDATA
if CFC F AND 2 then ▹\triangleright Detached from mate 如果 CFC F AND 2,那么 ▹\triangleright 与队友分离
mate_flags larr\leftarrow READITEM(MF,Integer) mate_flags larr\leftarrow READITEM(MF,Integer)
if mate_flags AND 1 then
bam_flags larr\leftarrow bam_flags OR 0xx20quad▹0 \times 20 \quad \triangleright Mate is reverse-complemented bam_flags larr\leftarrow bam_flags OR 0xx20quad▹0 \times 20 \quad \triangleright Mate 已反向补全
end if 如果结束
if mate_flags AND 2 then
bam_flags larr\leftarrow bam_flags OR 0x08 ▹\triangleright Mate is unmapped bam_flags larr\leftarrow bam_flags OR 0x08 ▹\triangleright Mate 未映射
end if 如果结束
if container_pmap.read_names_included !=1\neq 1 then 如果 container_pmap.read_names_included !=1\neq 1 那么 read_na bar(me)larr larr READITEM(RN, bar(Byte)[])r e a d \_n a \overline{m e} \leftarrow \leftarrow \operatorname{READITEM}(\mathrm{RN}, \overline{B y t e}[])
end if
mate_ref_id \leftarrow READITEM(NS, Integer)
mate_position \leftarrow READITEM(NP, Integer)
template_size \leftarrow READITEM(TS, Integer)
else if CF ANND 4 then }\quad\triangleright\mathrm{ Mate is downstream
if next_frag.bam_flags AND 0x10 then
this.bam_flags \leftarrowthis.bam_flags OR 0x20 \triangleright next segment reverse complemented
end if
if next_frag.bam_flags AND 0x04 then
this.bam_flags \leftarrowthis.bam_flags OR 0x08 \triangleright next segment unmapped
end if
next_frag \leftarrow READITEM(NF,Integer)
next_record \leftarrowthis_record + next_frag + 1
Resolve mate_ref_-id for this_record and next_record once both have been decoded
Resolve mate_position for this_record and next_record once both have been decoded
Find leftmost and rightmost mapped coordinate in records this_record and next_record.
For leftmost of this_record and next_record: template_size \leftarrow rightmost - leftmost + 1
For rightmost of this_record and next_record: template_size }\leftarrow-(\mathrm{ rightmost - leftmost + 1)
end if
end procedure
Note as with the SAM specification a template may be permitted to have more than two alignment records. In this case the “mate” for each record is considered to be the next record, with the mate for the last record being the first to form a circular list. The above algorithm is a simplification that does not deal with this scenario. The full method needs to observe when record this +NF+N F is also labelled as having an additional mate downstream. One recommended approach is to resolve the mate information in a second pass, once the entire slice has been decoded. The final segment in the mate chain needs to set bam_flags fields 0 x 20 and 0x08 accordingly based on the first segment. This is also not listed in the above algorithm, for brevity. 请注意,与 SAM 规范一样,一个模板可以有两条以上的对齐记录。在这种情况下,每条记录的 "队友 "都被认为是下一条记录,最后一条记录的队友是第一条记录,从而形成一个循环列表。上述算法只是一种简化,并没有处理这种情况。完整的方法需要观察该 +NF+N F 记录何时也被标记为下游有额外的队友。一种推荐的方法是,在整个片段解码完成后,在第二遍中解析配对信息。伴侣链中的最后一个片段需要根据第一个片段相应设置 bam_flags 字段 0 x 20 和 0x08。为了简洁起见,上述算法中也没有列出这一点。
10.5 Auxiliary tags 10.5 辅助标记
Tags are encoded using a tag line (TL data series) integer into the tag dictionary (TD field in the compression header preservation map, see section 8.4). See section 8.4 for a more detailed description of this process. 标记使用标记行(TL 数据序列)整数编码到标记字典(压缩标头保存映射中的 TD 字段,见第 8.4 节)中。有关此过程的详细说明,请参见第 8.4 节。
数据序列类型
Data series
type
Data series
type| Data series |
| :--- |
| type |
数据系列名称
Data series
name
Data series
name| Data series |
| :--- |
| name |
Field 现场
Description 说明
int
TL
tag line 标语
an index into the tag dictionary (TD) 标签字典 (TD) 的索引
***
???? ? ?
tag name/type 标签名称/类型
3 字符键类型
3 character key (2(2 tag identifier and 1 tag
type ),), as specified by the tag dictionary
3 character key (2 tag identifier and 1 tag
type ), as specified by the tag dictionary| 3 character key $(2$ tag identifier and 1 tag |
| :--- |
| type $),$ as specified by the tag dictionary |
"Data series
type" "Data series
name" Field Description
int TL tag line an index into the tag dictionary (TD)
** ??? tag name/type "3 character key (2 tag identifier and 1 tag
type ), as specified by the tag dictionary"| Data series <br> type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | TL | tag line | an index into the tag dictionary (TD) |
| $*$ | $? ? ?$ | tag name/type | 3 character key $(2$ tag identifier and 1 tag <br> type $),$ as specified by the tag dictionary |
procedure DECODETAGDATA
tag_line \(\leftarrow\) READITEM(TL,Integer)
for all ele \(\in\) container_pmap.tag_dict(tag_line) do
name \(\leftarrow\) first two characters of ele
tag \((\) type \() \leftarrow\) last character of ele
\(\operatorname{tag}(\) name \() \leftarrow\) READITEM \((\) ele, Byte[])
end for
end procedure
In the above procedure, name is a two letter tag name and type is one of the permitted types documented in the SAM/BAM specification. Type is A (a single character), c (signed 8-bit integer), C (unsigned 8-bit integer), s (signed 16-bit integer), S (unsigned 16-bit integer), i (signed 32-bit integer), I (unsigned 32-bit integer), f (32-bit float), Z (nul-terminated string), H (nul-terminated string of hex digits) and B (binary data in array format with the first byte being one of c,C,s,S,i,I,f using the meaning above, a 32 -bit integer for the number of array elements, followed by array data encoded using the specified format). All integers are little endian encoded. 在上述程序中,name 是双字母标签名,type 是 SAM/BAM 规范中允许的类型之一。类型有 A(单字符)、c(有符号 8 位整数)、C(无符号 8 位整数)、s(有符号 16 位整数)、S(无符号 16 位整数)、i(有符号 32 位整数)、I(无符号 32 位整数)、f(32 位浮点数)、Z(空尾字符串)、H(以十六进制数字空字符结尾的字符串)和 B(二进制数组格式数据,第一个字节是 c、C、s、S、i、I、f 中的一个,使用上述含义,32 位整数表示数组元素的个数,后面是使用指定格式编码的数组数据)。所有整数都是小端编码。
For example a SAM tag MQ: i has name MQ and type i and will be decoded using one of MQc, MQC, MQs, MQS, MQi and MQI data series depending on size and sign of the integer value. 例如,一个 SAM 标签 MQ: i 的名称是 MQ,类型是 i,将根据整数值的大小和符号,使用 MQc、MQC、MQs、MQS、MQi 和 MQI 数据系列之一进行解码。
Note some auxiliary tags can be created automatically during decode so can optionally be removed by the encoder. However if the decoder finds a tag stored verbatim it should use this in preference to automatically computing the value. 请注意,有些辅助标记会在解码过程中自动生成,因此编码器可以选择将其移除。不过,如果解码器发现有逐字存储的标记,则应优先使用该标记,而不是自动计算值。
The RG (read group) auxiliary tag should be created if the read group (RG data series) value is not -1 . 如果读取组(RG 数据系列)值不是 -1 ,则应创建 RG(读取组)辅助标记。
The MD and NM auxiliary tags store the differences (an edit string) between the sequence and the reference along with the number of mismatches. These may optionally be created on-the-fly during reference-based sequence reconstruction and should match the description provided in the SAMtags document. An encoder may decide to store these verbatim when no reference is used or where the automatically constructed values differ to the input data. MD 和 NM 辅助标记存储序列与参考文献之间的差异(编辑字符串)以及错配的数量。在基于参考文献的序列重建过程中,可选择即时创建这些标签,并应与 SAMtags 文档中提供的描述相匹配。在没有使用参照或自动构建的值与输入数据不同的情况下,编码器可以决定逐字存储这些值。
Note there is no mechanism to describe which records have MD/NM present and which do not. If this is deemed important, the only recourse is to store all MD and NM verbatim and to request that the decoding software does not automatically generate its own for records that have no stored MD and NM tags. 请注意,没有任何机制可以描述哪些记录有 MD/NM,哪些没有。如果认为这一点很重要,唯一的办法就是逐字存储所有 MD 和 NM,并要求解码软件不自动生成自己的 MD 和 NM 标记。
10.6 Mapped reads 10.6 映射读数
Read feature records 读取功能记录
Read features are used to store read details that are expressed using read coordinates (e.g. base differences respective to the reference sequence). The read feature records start with the number of read features followed by the read features themselves. Each read feature has the position encoded as the distance since the last feature position, or the absolute position (i.e. delta vs zero) for the first feature. Finally the single mapping quality and per-base quality scores are stored. 读数特征用于存储用读数坐标表示的读数细节(如与参考序列的碱基差异)。读取特征记录以读取特征数量开头,然后是读取特征本身。每个读数特征的位置编码为自上一个特征位置以来的距离,或第一个特征的绝对位置(即 delta 与零)。最后存储单个映射质量和每个碱基质量得分。
Data series type 数据系列类型
数据系列名称
Data series
name
Data series
name| Data series |
| :--- |
| name |
Field 现场
Description 说明
int
FN
阅读次数
number of read
features
number of read
features| number of read |
| :--- |
| features |
read feature data ^(a){ }^{\mathrm{a}} 读取特征数据 ^(a){ }^{\mathrm{a}}
See feature codes below 请参阅下面的功能代码
int
MQ
mapping qualities 绘图质量
mapping quality score 制图质量得分
byte[read length] 字节[读取长度]
QS
quality scores 质量得分
the base qualities, if preserved 如果保留了基础质量
Data series type "Data series
name" Field Description
int FN "number of read
features" the number of read features
int FP in-read-position ^(a) delta-position of the read feature
byte FC read feature code See feature codes below
** ** read feature data ^(a) See feature codes below
int MQ mapping qualities mapping quality score
byte[read length] QS quality scores the base qualities, if preserved| Data series type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| int | FN | number of read <br> features | the number of read features |
| int | FP | in-read-position $^{\mathrm{a}}$ | delta-position of the read feature |
| byte | FC | read feature code | See feature codes below |
| $*$ | $*$ | read feature data ${ }^{\mathrm{a}}$ | See feature codes below |
| int | MQ | mapping qualities | mapping quality score |
| byte[read length] | QS | quality scores | the base qualities, if preserved |
^(a){ }^{a} Repeated FN times, once for each read feature. ^(a){ }^{a} 重复 FN 次,每次读取功能一次。
Read feature codes 读取功能代码
Each feature code has its own associated data series containing further information specific to that feature. The following codes are used to distinguish variations in read coordinates: 每个地物代码都有自己的相关数据序列,包含该地物的详细信息。以下代码用于区分读取坐标的变化:
Feature code 功能代码
Id 同上
数据序列类型
Data series
type
Data series
type| Data series |
| :--- |
| type |
数据系列名称
Data series
name
Data series
name| Data series |
| :--- |
| name |
Description 说明
Bases 基地
b (0x62)
byte[ 字节
BB
a stretch of bases 绵延
Scores 得分
q (0x71)
byte[ 字节
QQ
a stretch of scores 绵延
Read base 阅读基地
B (0x42)
byte,byte 字节,字节
BA,QS
基数和相关质量分数
A base and associated quality
score
A base and associated quality
score| A base and associated quality |
| :--- |
| score |
Substitution 替换
X (0x58)
byte 字节
BS
碱基替换码、SAM 操作符
base substitution codes, SAM
operators X,M\mathrm{X}, \mathrm{M} and ==
base substitution codes, SAM
operators X,M and =| base substitution codes, SAM |
| :--- |
| operators $\mathrm{X}, \mathrm{M}$ and $=$ |
Insertion 插入
I (0x49)
byte[]
IN
插入的碱基、SAM 操作符I
inserted bases, SAM operator
I
inserted bases, SAM operator
I| inserted bases, SAM operator |
| :--- |
| I |
Deletion 删除
D (0x44)
int
DL
删除的碱基数,SAM 算子 D
number of deleted bases,
SAM operator D
number of deleted bases,
SAM operator D| number of deleted bases, |
| :--- |
| SAM operator D |
Insert base 插入底座
i (0x69)
byte 字节
BA
单个插入碱基,SAMoperator I
single inserted base, SAM
operator I
single inserted base, SAM
operator I| single inserted base, SAM |
| :--- |
| operator I |
Quality score 质量得分
Q (0x51)
byte 字节
QS
single quality score 单一质量得分
Reference skip 参考跳读
N (0x4E)
int
RS
跳过的碱基数,SAM 运算符 N
number of skipped bases,
SAM operator N
number of skipped bases,
SAM operator N| number of skipped bases, |
| :--- |
| SAM operator N |
Soft clip 软夹子
S (0x53)
byte[]
SC
软剪切基,SAMoperator S
soft clipped bases, SAM
operator S
soft clipped bases, SAM
operator S| soft clipped bases, SAM |
| :--- |
| operator S |
Padding 衬垫
P(0xx50)\mathrm{P}(0 \times 50)
int
PD
填充基数,SAM 运算符 P
number of padded bases,
SAM operator P
number of padded bases,
SAM operator P| number of padded bases, |
| :--- |
| SAM operator P |
Hard clip 硬夹子
H (0x48)
int
HC
硬剪切碱基数,SAM 算子 H
number of hard clipped bases,
SAM operator H
number of hard clipped bases,
SAM operator H| number of hard clipped bases, |
| :--- |
| SAM operator H |
Feature code Id "Data series
type" "Data series
name" Description
Bases b (0x62) byte[ BB a stretch of bases
Scores q (0x71) byte[ QQ a stretch of scores
Read base B (0x42) byte,byte BA,QS "A base and associated quality
score"
Substitution X (0x58) byte BS "base substitution codes, SAM
operators X,M and ="
Insertion I (0x49) byte[] IN "inserted bases, SAM operator
I"
Deletion D (0x44) int DL "number of deleted bases,
SAM operator D"
Insert base i (0x69) byte BA "single inserted base, SAM
operator I"
Quality score Q (0x51) byte QS single quality score
Reference skip N (0x4E) int RS "number of skipped bases,
SAM operator N"
Soft clip S (0x53) byte[] SC "soft clipped bases, SAM
operator S"
Padding P(0xx50) int PD "number of padded bases,
SAM operator P"
Hard clip H (0x48) int HC "number of hard clipped bases,
SAM operator H"| Feature code | Id | Data series <br> type | Data series <br> name | Description |
| :---: | :---: | :---: | :---: | :---: |
| Bases | b (0x62) | byte[ | BB | a stretch of bases |
| Scores | q (0x71) | byte[ | QQ | a stretch of scores |
| Read base | B (0x42) | byte,byte | BA,QS | A base and associated quality <br> score |
| Substitution | X (0x58) | byte | BS | base substitution codes, SAM <br> operators $\mathrm{X}, \mathrm{M}$ and $=$ |
| Insertion | I (0x49) | byte[] | IN | inserted bases, SAM operator <br> I |
| Deletion | D (0x44) | int | DL | number of deleted bases, <br> SAM operator D |
| Insert base | i (0x69) | byte | BA | single inserted base, SAM <br> operator I |
| Quality score | Q (0x51) | byte | QS | single quality score |
| Reference skip | N (0x4E) | int | RS | number of skipped bases, <br> SAM operator N |
| Soft clip | S (0x53) | byte[] | SC | soft clipped bases, SAM <br> operator S |
| Padding | $\mathrm{P}(0 \times 50)$ | int | PD | number of padded bases, <br> SAM operator P |
| Hard clip | H (0x48) | int | HC | number of hard clipped bases, <br> SAM operator H |
Note for compatibility with BAM, all base comparisons should be done in a case-insensitive manner, and all bases written to SC, IN and BA data series should be in upper-case. 请注意,为了与 BAM 兼容,所有碱基的比较都应以不区分大小写的方式进行,写入 SC、IN 和 BA 数据系列的所有碱基都应使用大写。
Base substitution codes (BS data series) 碱基替换码(BS 数据系列)
A base substitution is defined as a change from one nucleotide base (reference base) to another (read base), including N as an unknown or missing base. There are 5 supported reference bases (ACGTN), with 4 possible substitutions for each base. Any other base type, such as an ambiguity code, must be written verbatim using the BA data series. 碱基替换的定义是从一个核苷酸碱基(参考碱基)到另一个碱基(读取碱基)的变化,包括作为未知或缺失碱基的 N。有 5 个支持的参考碱基 (ACGTN),每个碱基有 4 种可能的替换。任何其他碱基类型,如模糊代码,必须使用 BA 数据系列逐字写出。
The codes for all possible substitutions are stored in a two-dimensional substitution matrix, indexed by reference base (A,C,G,T,N)(A, C, G, T, N) and BS code (0-3)(0-3), with each matrix element holding the modified base. 所有可能的替换代码都存储在一个二维替换矩阵中,以参考基数 (A,C,G,T,N)(A, C, G, T, N) 和 BS 代码 (0-3)(0-3) 为索引,每个矩阵元素都包含修改后的基数。
Substitution Matrix Format 替换矩阵格式
There are 5 possible base types supported by the BS data series, A, C, G, T and N. Hence for any reference base there are 4 possible substitutions. Each of these substitution possibilities are numbered 0 to 3 , in the order shown above (omitting the reference base type). Therefore the full list of substitution codes for a specific reference base is 42 -bit numbers (0-3)(0-3) in the order shown above, minus the reference base itself. These are packed into a single byte with the high 2-bits first. BS 数据系列支持 5 种可能的碱基类型:A、C、G、T 和 N。每种替代可能性都按上述顺序编号为 0 至 3(省略参考碱基类型)。因此,一个特定参考基的完整替换码列表是 42 位数字 (0-3)(0-3) ,顺序如上所示,减去参考基本身。这些数字被打包成一个字节,高 2 位在前。
For example for reference base C we would record the BS numerical values for substituting C with A,G,T\mathrm{A}, \mathrm{G}, \mathrm{T} and N respectively. If we wish A=1,G=0,T=2\mathrm{A}=1, \mathrm{G}=0, \mathrm{~T}=2 and N=3\mathrm{N}=3 then we would store binary 01001011 , or hex 0 x 4 B . 例如,对于基准基数 C,我们将记录分别用 A,G,T\mathrm{A}, \mathrm{G}, \mathrm{T} 和 N 替换 C 的 BS 数值。如果我们想要 A=1,G=0,T=2\mathrm{A}=1, \mathrm{G}=0, \mathrm{~T}=2 和 N=3\mathrm{N}=3 ,那么我们将存储二进制 01001011 或十六进制 0 x 4 B。
The full substitution matrix is 5 bytes, each storing the 4 BS codes for reference base A,C,G,T\mathrm{A}, \mathrm{C}, \mathrm{G}, \mathrm{T} and N respectively. 完整的替换矩阵有 5 个字节,每个字节分别存储参考基 A,C,G,T\mathrm{A}, \mathrm{C}, \mathrm{G}, \mathrm{T} 和 N 的 4 个 BS 编码。
A complete matrix that maps C//G\mathrm{C} / \mathrm{G} together and A//T\mathrm{A} / \mathrm{T} together may look like this: 将 C//G\mathrm{C} / \mathrm{G} 和 A//T\mathrm{A} / \mathrm{T} 映射在一起的完整矩阵可能如下所示:
Seq. base 序列基数
Ref. base 参考基数
A\mathbf{A}
C\mathbf{C}
G\mathbf{G}
T\mathbf{T}
N\mathbf{N}
A
-
1
2
0
3
C
1
-
0
2
3
G
2
0
-
1
3
T
0
2
1
-
3
N
0
1
2
3
-
Seq. base
Ref. base A C G T N
A - 1 2 0 3
C 1 - 0 2 3
G 2 0 - 1 3
T 0 2 1 - 3
N 0 1 2 3 -| | Seq. base | | | | |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Ref. base | $\mathbf{A}$ | $\mathbf{C}$ | $\mathbf{G}$ | $\mathbf{T}$ | $\mathbf{N}$ |
| A | - | 1 | 2 | 0 | 3 |
| C | 1 | - | 0 | 2 | 3 |
| G | 2 | 0 | - | 1 | 3 |
| T | 0 | 2 | 1 | - | 3 |
| N | 0 | 1 | 2 | 3 | - |
This would be encoded as 编码为
binary 01100011,01001011,quad10000111,quad00100111,quad0001101101100011,01001011, \quad 10000111, \quad 00100111, \quad 00011011 二进制 01100011,01001011,quad10000111,quad00100111,quad0001101101100011,01001011, \quad 10000111, \quad 00100111, \quad 00011011
or hex 0x 63,quad0x4b,quad0x 87,quad0x 27quad0x1b0 x 63, \quad 0 x 4 b, \quad 0 x 87, \quad 0 x 27 \quad 0 x 1 b. 或十六进制 0x 63,quad0x4b,quad0x 87,quad0x 27quad0x1b0 x 63, \quad 0 x 4 b, \quad 0 x 87, \quad 0 x 27 \quad 0 x 1 b 。
To decode, we would use the following lookup table, showing the same data as above with codes sorted into 0 , 1,2,31,2,3 order. 要解码,我们可以使用下面的查找表,该表显示的数据与上面的相同,代码按 0 , 1,2,31,2,3 的顺序排列。
BS Code BS 编码
Ref. base 参考基数
0\mathbf{0}
1\mathbf{1}
2\mathbf{2}
3\mathbf{3}
A
T
C
G
N
C
G
A
T
N
G
C
T
A
N
T
A
G
C
N
N
A
C
G
T
BS Code
Ref. base 0 1 2 3
A T C G N
C G A T N
G C T A N
T A G C N
N A C G T| | BS Code | | | |
| :--- | :---: | :---: | :---: | :---: |
| Ref. base | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ |
| A | T | C | G | N |
| C | G | A | T | N |
| G | C | T | A | N |
| T | A | G | C | N |
| N | A | C | G | T |
Substitution Code Assignment 替换码分配
There is no strict requirement on using a specific substitution matrix, nor that it be optimal. However one strategy may be to ensure the most common substitution is always given code 0 , the next most common is code 1 , and so on. This means the distribution of BS values will be skewed towards lower values, which helps improve compression over more uniformly distributed frequencies. 对使用特定的替换矩阵没有严格的要求,也不要求它是最优的。不过,一种策略是确保最常见的替换总是给定代码 0,其次是代码 1,以此类推。这意味着 BS 值的分布将偏向于较低的值,这有助于改善对分布更均匀的频率的压缩。
For example, let us assume the following substitution frequencies for base A: 例如,假设碱基 A 的替换频率如下:
AC: 15%15 \%
AG: 25%25 \%
AT: 55%55 \%
AN: 5%5 \%
Then the substitution codes are T=0,G=1,C=2,N=3\mathrm{T}=0, \mathrm{G}=1, \mathrm{C}=2, \mathrm{~N}=3. 那么替换码就是 T=0,G=1,C=2,N=3\mathrm{T}=0, \mathrm{G}=1, \mathrm{C}=2, \mathrm{~N}=3 。
Decode mapped read pseudocode 解码映射读取伪代码
procedure DECODEMAPPEDREAD
feature_number }\leftarrow\mathrm{ READITEM(FN, Integer)
last_feature_position }\leftarrow
for }i\leftarrow1\mathrm{ to feature_number do
DECODEFEATURE
end for
mapping_quality \leftarrow READITEM(MQ, Integer)
if CF AND 1 then \triangleright Quality stored as an array
for }i\leftarrow1\mathrm{ to read_length do
quality_score \leftarrow READITEM(QS, Integer)
end for
end if
end procedure
procedure DecodeFeature
feature_code }\leftarrow\mathrm{ READITEM(FC, Integer)
feature_position }\leftarrow\mathrm{ READITEM(FP, Integer) + last_feature_position
last_feature_position }\leftarrow\mathrm{ feature_position
if feature_code ='B' then
base }\leftarrow\mathrm{ READITEM(BA, Byte)
quality_score }\leftarrow\mathrm{ READITEM(QS, Byte)
else if feature_code ='X' then
substitution_code \leftarrow READItEM(BS, Byte)
else if feature_code ='I' then
inserted_bases }\leftarrow\mathrm{ READITEM(IN, Byte[])
else if feature_code ='S' then
softclip_bases }\leftarrow\mathrm{ READITEM(SC, Byte[])
else if feature_code \(={ }^{\prime} H\) ' then
hardclip_length \(\leftarrow\) ReAdItEm(HC, Integer)
else if feature_code ='P' then
pad_length \({ }^{-} \leftarrow\) READITEM(PD, Integer)
else if feature_code \(=\) 'D' then
deletion_length \(\leftarrow\) READITEM(DL, Integer)
else if feature_code \(={ }^{\prime} \mathrm{N}\) ' then
ref_skip_length \(\leftarrow\) READITEm(RS, Integer)
else if feature_code \(=\) 'i' then
base \(-\leftarrow\) ReAdItEm(BA, Byte)
else if feature \(\quad\) code \(=' \mathrm{~b}\) ' then
bases \(\leftarrow\) REadItEm(BB, Byte[])
else if feature_code ='q' then
quality_scores \(\leftarrow\) REAdITEM(QQ, Byte[])
else if feature_code \(=\) ' Q ' then
quality_score \(\leftarrow\) READITEM(QS, Byte)
end if
end procedure
10.7 Unmapped reads 10.7 未映射读数
The CRAM record structure for unmapped reads has the following additional fields: 未映射读取的 CRAM 记录结构有以下附加字段:
Data series type 数据系列类型
数据系列名称
Data series
name
Data series
name| Data series |
| :--- |
| name |
Field 现场
Description 说明
byte[read length] 字节[读取长度]
BA
bases 基地
the read bases 读数碱基
byte[read length] 字节[读取长度]
QS
quality scores 质量得分
the base qualities, if preserved 如果保留了基础质量
Data series type "Data series
name" Field Description
byte[read length] BA bases the read bases
byte[read length] QS quality scores the base qualities, if preserved| Data series type | Data series <br> name | Field | Description |
| :--- | :--- | :--- | :--- |
| byte[read length] | BA | bases | the read bases |
| byte[read length] | QS | quality scores | the base qualities, if preserved |
procedure DeCoDeUnMAPpedREAD
for \(i \leftarrow 1\) to read_length do
base \(\leftarrow\) READITEM(BA, Byte)
end for
if \(C F\) AND 1 then \(\triangleright\) Quality stored as an array
for \(i \leftarrow 1\) to read_length do
quality_score \(\leftarrow\) READITEM(QS, Byte)
end for
end if
end procedure
11 Reference sequences 11 参考序列
CRAM format is natively based upon usage of reference sequences even though in some cases they are not required. In contrast to BAM format CRAM format has strict rules about reference sequences. 尽管在某些情况下不需要参考序列,但 CRAM 格式本身就是以参考序列的使用为基础的。与 BAM 格式相比,CRAM 格式对参考序列有严格的规定。
M5 (sequence MD5 checksum) field of @SQ sequence record in the BAM header is required and UR (URI for the sequence fasta optionally gzipped file) field is strongly advised. The rule for calculating MD5 is to remove any non-base symbols (like \\n\backslash \mathrm{n}, sequence name or length and spaces) and upper case the rest. Here are some examples: BAM 头中 @SQ 序列记录的 M5(序列 MD5 校验和)字段是必需的,而 UR(序列 fasta 的 URI,可选 gzipped 文件)字段是强烈建议的。计算 MD5 的规则是去掉任何非基本符号(如 \\n\backslash \mathrm{n} 、序列名称或长度和空格),并将其余部分大写。下面是一些例子:
Please note that the latter calculates the checksum for 11 bases from position 10 (inclusive) to 20 (inclusive) and the bases are counted 1-based, so the first base position is 1. 请注意,后者计算从第 10 位(含)到第 20 位(含)的 11 个碱基的校验和,碱基以 1 为单位计数,因此第一个碱基位置为 1。
2. All CRAM reader implementations are expected to check for reference MD5 checksums and report any missing or mismatching entries. Consequently, all writer implementations are expected to ensure that all checksums are injected or checked during compression time. 2.所有 CRAM 阅读器实现都要检查参考 MD5 校验和,并报告任何缺失或不匹配的条目。因此,所有写入器实现都应确保在压缩期间注入或检查所有校验和。
3. In some cases reads may be mapped beyond the reference sequence. All out of range reference bases are all assumed to be ’ N '. 3.在某些情况下,读数的映射可能会超出参考序列的范围。所有超出范围的参考碱基都被假定为 "N"。
4. MD5 checksum bytes in slice header should be ignored for unmapped or multiref slices. 4.对于未映射或多引用片,片头中的 MD5 校验和字节应被忽略。
12 Indexing 12 索引
General notes 一般说明
Indexing is only valid on coordinate (reference ID and then leftmost position) sorted files. 索引只对坐标(参考 ID,然后是最左边的位置)排序的文件有效。
Please note that CRAM indexing is external to the file format itself and may change independently of the file format specification in the future. For example, a new type of index file may appear. 请注意,CRAM 索引与文件格式本身无关,将来可能会独立于文件格式规范而发生变化。例如,可能会出现一种新的索引文件类型。
Individual records are not indexed in CRAM files, slices should be used instead as a unit of random access. Another important difference between CRAM and BAM indexing is that CRAM container header and compression header block (first block in container) must always be read before decoding a slice. Therefore two read operations are required for random access in CRAM. CRAM 文件中不对单个记录编制索引,而应将片段用作随机存取的单位。CRAM 与 BAM 索引的另一个重要区别是,在解码片段之前,必须先读取 CRAM 容器头和压缩头块(容器中的第一个块)。因此,CRAM 中的随机存取需要两次读操作。
Indexing a CRAM file is deemed to be a lightweight operation because it usually does not require any CRAM records to be read. Indexing information can be obtained from container headers, namely sequence id, alignment start and span, container start byte offset and slice byte offset inside the container (landmarks). The exception to this is with multi-reference containers, where the “RI” data series must be read. 为 CRAM 文件编制索引被认为是一种轻量级操作,因为它通常不需要读取任何 CRAM 记录。索引信息可从容器标题中获取,即序列 ID、排列起始和跨度、容器起始字节偏移和容器内切片字节偏移(地标)。但多参照容器是个例外,必须读取 "RI "数据系列。
CRAM index CRAM 指数
A CRAM index is a gzipped tab delimited file containing the following columns: CRAM 索引是一个压缩的制表符分隔文件,包含以下列:
Reference sequence id 参考序列 ID
Alignment start (ignored on read for unmapped slices, set to 0 on write) 对齐开始(读取未映射片时忽略,写入时设置为 0)
Alignment span (ignored on read for unmapped slices, set to 0 on write) 对齐跨度(读取未映射片时忽略,写入时设置为 0)
Absolute byte offset of Container header in the file. 文件中容器头的绝对字节偏移量。
Relative byte offset of the Slice header block, from the end of the container header. This is the same as the “landmark” field in the container header. 片头块从容器头末尾开始的相对字节偏移量。这与容器标头中的 "地标 "字段相同。
Slice size in bytes (including slice header and all blocks). 片段大小,以字节为单位(包括片头和所有块)。
Each line represents a slice in the CRAM file. Please note that all slices must be listed in the index file. 每一行代表 CRAM 文件中的一个片段。请注意,索引文件中必须列出所有片段。
Multi-reference slices may need to have multiple lines for the same slice; one for each reference contained within that slice. In this case the index reference sequence ID will be the actual reference ID (from the “RI” data series) and not -2 . 多参考文献切片可能需要为同一切片设置多行;该切片中包含的每个参考文献各一行。在这种情况下,索引参考文献序列 ID 将是实际参考文献 ID(来自 "RI "数据系列),而不是 -2 。
Slices containing solely unmapped unplaced data (reference ID -1) still require values for all columns, although the alignment start and span will be ignored. It is recommended that they are both set to zero. 虽然对齐起始值和跨度将被忽略,但只包含未映射未置位数据(参考 ID-1)的切片仍然需要所有列的值。建议将它们都设置为零。
To illustrate this the absolute and relative offsets used in a three slice container are shown in the diagram below. 为了说明这一点,下图显示了三片容器中使用的绝对偏移和相对偏移。
BAM index BAM 指数
BAM indexes are supported by using 4-byte integer pointers called landmarks that are stored in container header. BAM index pointer is a 64 -bit value with 48 bits reserved for the BAM block start position and 16 bits reserved for the in-block offset. When used to index CRAM files, the first 48 bits are used to store the CRAM container start position and the last 16 bits are used to store the index of the landmark in the landmark array stored in container header. The landmark index can be used to access the appropriate slice. BAM 索引使用 4 字节整数指针(称为地标)来支持,这些指针存储在容器标头中。BAM 索引指针是一个 64 位的值,其中 48 位用于保留 BAM 块的起始位置,16 位用于保留块内偏移。用于为 CRAM 文件编制索引时,前 48 位用于存储 CRAM 容器的起始位置,后 16 位用于存储容器头中地标数组中的地标索引。地标索引可用于访问相应的分片。
The above indexing scheme treats CRAM slices as individual records in BAM file. This allows to apply BAM indexing to CRAM files, however it introduces some overhead in seeking specific alignment start because all preceding records in the slice must be read and discarded. 上述索引方案将 CRAM 切片视为 BAM 文件中的单个记录。这样就可以在 CRAM 文件中应用 BAM 索引,但由于必须读取并丢弃片段中的所有前面记录,因此在寻找特定对齐起点时会产生一些开销。
13 Encodings 13 编码
13.1 Introduction 13.1 导言
The basic idea for codings is to efficiently represent some values in binary format. This can be achieved in a number of ways that most frequently involve some knowledge about the nature of the values being encoded, for example, distribution statistics. The methods for choosing the best encoding and determining its parameters are very diverse and are not part of the CRAM format specification, which only describes how the information needed to decode the values should be stored. 编码的基本思想是以二进制格式有效地表示一些值。这可以通过多种方式实现,其中最常见的是了解被编码值的性质,例如分布统计。选择最佳编码和确定其参数的方法多种多样,不属于 CRAM 格式规范的范畴,该规范仅描述了如何存储解码值所需的信息。
Note two of the encodings (Golomb and Golomb-Rice) are listed as deprecated. These are still formally part of the CRAM specification, but have not been used by the primary implementations and may not be well supported. Therefore their use is permitted, but not recommended. 请注意,其中两种编码(Golomb 和 Golomb-Rice)已被列为废弃编码。这两种编码仍是 CRAM 规范的正式组成部分,但尚未被主要实现所使用,也可能得不到很好的支持。因此,允许使用但不推荐使用。
Offset 偏移
Many of the codings listed below encode positive integer numbers. An integer offset value is used to allow any integer numbers and not just positive ones to be encoded. It can also be used for monotonically decreasing distributions with the maximum not equal to zero. For example, given offset is 10 and the value to be encoded is 1 , the actually encoded value would be offset + value =11=11. Then when decoding, the offset would be subtracted from the decoded value. 下面列出的许多编码都是对正整数进行编码。整数偏移值用于编码任何整数,而不仅仅是正整数。它还可用于最大值不等于零的单调递减分布。例如,偏移值为 10,要编码的值为 1,则实际编码值为偏移值 + 值 =11=11 。解码时,将从解码值中减去偏移量。
13.2 EXTERNAL: codec ID 1 13.2 外部:编解码器 ID 1
Can encode types Byte, Integer. 可编码字节、整数类型。
The EXTERNAL coding is simply storage of data verbatim to an external block with a given ID. If the type is Byte the data is stored as-is, otherwise for Integer type the data is stored in ITF8. 外部编码(EXTERNAL)只是将数据逐字存储到具有给定 ID 的外部块中。如果数据类型为字节(Byte),则按原样存储;如果数据类型为整数(Integer),则按 ITF8 格式存储。
Parameters 参数
CRAM format defines the following parameters of EXTERNAL coding: CRAM 格式定义了以下外部编码参数:
Data type 数据类型
Name 名称
Comment 评论
itf8
external id 外部 id
id of an external block containing the byte stream 包含字节流的外部块的 id
Data type Name Comment
itf8 external id id of an external block containing the byte stream| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | external id | id of an external block containing the byte stream |
13.3 Huffman coding: codec ID 3 13.3 哈夫曼编码:编解码器 ID 3
Can encode types Byte, Integer. 可编码字节、整数类型。
Huffman coding replaces symbols (values to encode) by binary codewords, with common symbols having shorter codewords such that the total message of binary codewords is shorter than using uniform binary codeword lengths. The general process consists of the following steps. 哈夫曼编码法用二进制码字代替符号(要编码的值),常用符号的码字较短,这样二进制码字的总信息量就比使用统一二进制码字长度的信息量短。一般过程包括以下步骤。
Obtain symbol code lengths. 获取符号代码长度。
If encoding: 如果编码
Compute symbol frequencies. 计算符号频率
Compute code lengths from frequencies. 根据频率计算代码长度
If decoding: 如果解码:
Read code lengths from codec parameters. 从编解码器参数中读取编码长度。
Encode or decode bits as per the symbol to codeword table. Codewords have the “prefix property” that no codeword is a prefix of another codeword, enabling unambiguous decode bit by bit. 根据符号到编码词表对比特进行编码或解码。编码词具有 "前缀属性",即任何编码词都不是另一个编码词的前缀,因此可以逐位进行无歧义解码。
The use of canonical Huffman codes means that we only need to store the code lengths and use the same algorithm in both encoder and decoder to generate the codewords. This is achieved by ensuring our symbol alphabet has a natural sort order and codewords are assigned in numerical order. 使用规范哈夫曼编码意味着我们只需存储编码长度,并在编码器和解码器中使用相同的算法来生成码字。要做到这一点,就要确保我们的符号字母表具有自然的排序顺序,并按数字顺序分配码字。
Important note: for alphabets with only one value, the codeword will be zero bits long. This makes the Huffman codec an efficient mechanism for specifying constant values. 重要提示:对于只有一个值的字母,编解码字长为 0 位。这使得哈夫曼编解码器成为指定常数值的有效机制。
Canonical code computation 典型代码计算
Sort the alphabet ascending using bit-lengths and then using numerical order of the values. 使用比特长度对字母表进行升序排序,然后使用数值顺序进行排序。
The first symbol in the list gets assigned a codeword which is the same length as the symbol’s original codeword but all zeros. This will often be a single zero (‘0’). 列表中的第一个符号将被分配一个与该符号原始编码长度相同的编码,但全部为零。通常是一个零("0")。
Each subsequent symbol is assigned the next binary number in sequence, ensuring that following codes are always higher in value. 随后的每个符号都按顺序分配下一个二进制数,确保后面的代码值总是较高。
When you reach a longer codeword, then after incrementing, append zeros until the length of the new codeword is equal to the length of the old codeword. 当到达一个较长的码字时,在递增之后,追加零,直到新码字的长度等于旧码字的长度。
Examples 实例
Symbol 符号
Code length 代码长度
Codeword 密码
A
1
0
B
3
100
C
3
101
D
3
110
E
4
1110
F
4
1111
Symbol Code length Codeword
A 1 0
B 3 100
C 3 101
D 3 110
E 4 1110
F 4 1111| Symbol | Code length | Codeword |
| :--- | :--- | :--- |
| A | 1 | 0 |
| B | 3 | 100 |
| C | 3 | 101 |
| D | 3 | 110 |
| E | 4 | 1110 |
| F | 4 | 1111 |
Parameters 参数
Data type 数据类型
Name 名称
Comment 评论
itf8[]
alphabet 字母表
list of all encoded symbols (values) 所有编码符号(值)列表
itf8[]
bit-lengths 位长
array of bit-lengths for each symbol in the alphabet 字母表中每个符号的比特长度数组
Data type Name Comment
itf8[] alphabet list of all encoded symbols (values)
itf8[] bit-lengths array of bit-lengths for each symbol in the alphabet| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8[] | alphabet | list of all encoded symbols (values) |
| itf8[] | bit-lengths | array of bit-lengths for each symbol in the alphabet |
13.4 Byte array coding 13.4 字节数组编码
Often there is a need to encode an array of bytes where the length is not predetermined. For example the read identifiers differ per alignment record, possibly with different lengths, and this length must be stored somewhere. There are two choices available: storing the length explicitly (BYTE_ARRAY_LEN) or continuing to read bytes until a termination value is seen (BYTE_ARRAY_STOP). 通常情况下,需要对一个字节数组进行编码,而这个数组的长度并不是预先确定的。例如,每个排列记录的读取标识符不同,长度也可能不同,因此必须将该长度存储在某处。有两种选择:明确存储长度(BYTE_ARRAY_LEN)或继续读取字节直到看到终止值(BYTE_ARRAY_STOP)。
Note in contrast to this, quality values are known to be the same length as the sequence which is an already known quantity, so this does not need to be encoded using the byte array codecs. 请注意,与此相反,已知质量值的长度与序列长度相同,而序列长度是一个已知量,因此无需使用字节数组编解码器进行编码。
BYTE_ARRAY_LEN: codec ID 4 BYTE_ARRAY_LEN:编解码器 ID 4
Can encode types Byte [ ]. 可以编码字节 [ ]。
Byte arrays are captured length-first, meaning that the length of every array element is written using an additional encoding. For example this could be a HUFFMAN encoding or another EXTERNAL block. The length is decoded first followed by the data, followed by the next length and data, and so on. 字节数组的捕获是长度优先的,这意味着每个数组元素的长度都要使用额外的编码来写入。例如,这可以是 HUFFMAN 编码或其他外部块。首先解码的是长度,然后是数据,接着是下一个长度和数据,以此类推。
This encoding can therefore be considered as a nested encoding, with each pair of nested encodings containing their own set of parameters. The byte stream for parameters of the BYTE_ARRAY_LEN encoding is therefore the concatenation of the length and value encoding parameters as described in section 2.3 . 因此,这种编码可视为嵌套编码,每对嵌套编码都包含各自的参数集。因此,BYTE_ARRAY_LEN 编码的参数字节流就是第 2.3 节所述的长度和值编码参数的连接。
The parameter for BYTE_ARRAY_LEN are listed below: BYTE_ARRAY_LEN 的参数如下:
Data type 数据类型
Name 名称
Comment 评论
encoding<int> 编码<int>
lengths encoding 长度编码
描述如何捕获数组长度的编码
an encoding describing how the arrays lengths are
captured
an encoding describing how the arrays lengths are
captured| an encoding describing how the arrays lengths are |
| :--- |
| captured |
encoding<byte> 编码<byte>
values encoding 值编码
an encoding describing how the values are captured 描述如何获取数值的编码
Data type Name Comment
encoding<int> lengths encoding "an encoding describing how the arrays lengths are
captured"
encoding<byte> values encoding an encoding describing how the values are captured| Data type | Name | Comment |
| :--- | :--- | :--- |
| encoding<int> | lengths encoding | an encoding describing how the arrays lengths are <br> captured |
| encoding<byte> | values encoding | an encoding describing how the values are captured |
For example, the bytes specifying a BYTE_ARRAY_LEN encoding, including the codec and parameters, for a 16-bit X0 auxiliary tag (“X0C”) may use bar(H)\overline{\mathrm{H}} UFFMA bar(N)\overline{\mathrm{N}} encoding to specify the length (always 2 bytes) and an EXTERNAL encoding to store the value to an external block with ID 200. 例如,为 16 位 X0 辅助标记("X0C")指定 BYTE_ARRAY_LEN 编码(包括编解码器和参数)的字节可使用 bar(H)\overline{\mathrm{H}} UFFMA bar(N)\overline{\mathrm{N}} 编码指定长度(始终为 2 字节),并使用 EXTERNAL 编码将值存储到 ID 为 200 的外部块中。
Bytes 字节
Meaning 意义
0x04
BYTE_ARRAY_LEN codec ID BYTE_ARRAY_LEN 编解码器 ID
0x0a
10 remaining bytes of BYTE_ARRAY_LEN parameters BYTE_ARRAY_LEN 参数剩余的 10 个字节
0x03
HUFFMAN codec ID, for aux tag lengths HUFFMAN 编解码器 ID,用于辅助标记长度
0x04
4 more bytes of HUFFMAN parameters 再有 4 个字节的 HUFFMAN 参数
0x01
Alphabet array size =1=1 字母数组大小 =1=1
0x02
alphabet symbol; (length =2)=2) 字母符号;(长度 =2)=2)
0x01
Codeword array size =1=1 码字阵列大小 =1=1
0x00
Code length =0=0 (zero bits needed as alphabet is size 1) 代码长度 =0=0 (由于字母表的大小为 1,因此需要 0 比特)
0x01
EXTERNAL codec ID, for aux tag values 外部编解码器 ID,用于辅助标记值
0x02
2 more bytes of EXTERNAL parameters 另外 2 个字节的外部参数
2 more bytes of EXTERNAL parameters| 2 more bytes of EXTERNAL parameters |
| :--- |
0x80 0xc8
ITF8 encoding for block ID 200 区块 ID 200 的 ITF8 编码
Bytes Meaning
0x04 BYTE_ARRAY_LEN codec ID
0x0a 10 remaining bytes of BYTE_ARRAY_LEN parameters
0x03 HUFFMAN codec ID, for aux tag lengths
0x04 4 more bytes of HUFFMAN parameters
0x01 Alphabet array size =1
0x02 alphabet symbol; (length =2)
0x01 Codeword array size =1
0x00 Code length =0 (zero bits needed as alphabet is size 1)
0x01 EXTERNAL codec ID, for aux tag values
0x02 "2 more bytes of EXTERNAL parameters"
0x80 0xc8 ITF8 encoding for block ID 200| Bytes | Meaning |
| :--- | :--- |
| 0x04 | BYTE_ARRAY_LEN codec ID |
| 0x0a | 10 remaining bytes of BYTE_ARRAY_LEN parameters |
| | |
| 0x03 | HUFFMAN codec ID, for aux tag lengths |
| 0x04 | 4 more bytes of HUFFMAN parameters |
| 0x01 | Alphabet array size $=1$ |
| 0x02 | alphabet symbol; (length $=2)$ |
| 0x01 | Codeword array size $=1$ |
| 0x00 | Code length $=0$ (zero bits needed as alphabet is size 1) |
| | |
| 0x01 | EXTERNAL codec ID, for aux tag values |
| 0x02 | 2 more bytes of EXTERNAL parameters |
| 0x80 0xc8 | ITF8 encoding for block ID 200 |
BYTE_ARRAY_STOP: codec ID 5 BYTE_ARRAY_STOP:编解码器 ID 5
Can encode types Byte [ ]. 可编码字节 [ ]。
Byte arrays are captured as a sequence of bytes terminated by a special stop byte. The data returned does not include the stop byte itself. In contrast to BYTE_ARRAY_LEN the value is always encoded with EXTERNAL so the parameter is an external id instead of another encoding. 字节数组以字节序列的形式捕获,以一个特殊的停止字节结束。返回的数据不包括停止字节本身。与 BYTE_ARRAY_LEN 不同的是,该值始终使用 EXTERNAL 编码,因此参数是外部 id 而不是其他编码。
Data type 数据类型
Name 名称
Comment 评论
byte 字节
stop byte 停止字节
a special byte treated as a delimiter 作为分隔符的特殊字节
itf8
external id 外部 id
id of an external block containing the byte stream 包含字节流的外部块的 id
Data type Name Comment
byte stop byte a special byte treated as a delimiter
itf8 external id id of an external block containing the byte stream| Data type | Name | Comment |
| :--- | :--- | :--- |
| byte | stop byte | a special byte treated as a delimiter |
| itf8 | external id | id of an external block containing the byte stream |
13.5 Beta coding: codec ID 6 13.5 Beta 编码:编解码器 ID 6
Can encode types Integer. 可以编码整数类型。
Definition 定义
Beta coding is a most common way to represent numbers in binary notation and is sometimes referred to as binary coding. The decoder reads the specified fixed number of bits (most significant first) and subtracts the offset value to get the decoded integer. 贝塔编码是用二进制符号表示数字的最常用方法,有时也称为二进制编码。解码器读取指定的固定比特数(最重要的比特在前),然后减去偏移值,得到解码后的整数。
Parameters 参数
CRAM format defines the following parameters of beta coding: CRAM 格式定义了以下 beta 编码参数:
Data type 数据类型
Name 名称
Comment 评论
itf8
offset 胶印
在解码时从每个值中减去偏移量
offset is subtracted from each
value during decode
offset is subtracted from each
value during decode| offset is subtracted from each |
| :--- |
| value during decode |
itf8
length 长度
the number of bits used 使用的位数
Data type Name Comment
itf8 offset "offset is subtracted from each
value during decode"
itf8 length the number of bits used| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset is subtracted from each <br> value during decode |
| itf8 | length | the number of bits used |
Examples 实例
If we have integer values in the range 10 to 15 inclusive, the largest value would traditionally need 4 bits, but with an offset of -10 we can hold values 0 to 5 , using a fixed size of 3 bits. Using fixed Offset and Length coming from the beta parameters, we decode these values as: 如果我们的整数值范围在 10 至 15(含 15)之间,最大值传统上需要 4 位,但偏移量为-10 时,我们可以使用 3 位的固定大小来保存 0 至 5 的值。利用 beta 参数中固定的偏移量和长度,我们可以将这些值解码为
13.6 Subexponential coding: codec ID 7 13.6 次指数编码:编解码器 ID 7
Can encode types Integer. 可以编码整数类型。
Definition 定义
Subexponential coding ^(6){ }^{6} is parametrized by a non-negative integer kk. For values n < 2^(k+1)n<2^{k+1} subexponential coding produces codewords identical to Rice coding ^(7){ }^{7}. For larger values it grows logarithmically with nn. 亚指数编码 ^(6){ }^{6} 的参数是一个非负整数 kk 。对于 n < 2^(k+1)n<2^{k+1} 的值,亚指数编码产生的码字与赖斯编码 ^(7){ }^{7} 相同。对于更大的值,它与 nn 成对数增长。
Encoding 编码
Add offset to nn. 将偏移量添加到 nn 中。
Determine uu and bb values from nn 根据 nn 确定 uu 和 bb 值
b={[k," if "n < 2^(k)],[|__log_(2)n __|," if "n >= 2^(k)]quad u={[0," if "n < 2^(k)],[b-k+1," if "n >= 2^(k)]:}b=\left\{\begin{array}{ll}
k & \text { if } n<2^{k} \\
\left\lfloor\log _{2} n\right\rfloor & \text { if } n \geq 2^{k}
\end{array} \quad u= \begin{cases}0 & \text { if } n<2^{k} \\
b-k+1 & \text { if } n \geq 2^{k}\end{cases}\right.
Write uu in unary form; u1u 1 bits followed by a single 0 bit. 以一元形式写入 uu ; u1u 1 位后跟一个 0 位。
Write the bottom bb-bits of nn in binary form. 以二进制形式写出 nn 的底部 bb 位。
Decoding 解码
Read uu in unary form, counting the number of leading 1 s (prefix) in the codeword (discard the trailing 0 bit). 以一元形式读取 uu ,计算编码词中前导 1(前缀)的个数(舍弃尾部的 0 位)。
Determine nn via: 通过 nn 确定:
(a) if u=0u=0 then read nn as a kk-bit binary number. (a) 如果 u=0u=0 ,则将 nn 读作 kk 位二进制数。
(b) if u >= 1u \geq 1 then read xx as a (u+k-1)(u+k-1)-bit binary. Let n=2^(u+k-1)+xn=2^{u+k-1}+x. (b) 如果 u >= 1u \geq 1 ,则将 xx 读作 (u+k-1)(u+k-1) 位二进制。让 n=2^(u+k-1)+xn=2^{u+k-1}+x .
offset is subtracted from each value during decode 在解码时从每个值中减去偏移量
itf8
k
the order of the subexponential coding 次指数编码的顺序
Data type Name Comment
itf8 offset offset is subtracted from each value during decode
itf8 k the order of the subexponential coding| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset is subtracted from each value during decode |
| itf8 | k | the order of the subexponential coding |
13.7 Gamma coding: codec ID 9 13.7 伽玛编码:编解码器 ID 9
Can encode types Integer. 可以编码整数类型。
Definition 定义
Elias gamma code is a prefix encoding of positive integers. This is a combination of unary coding and beta coding. The first is used to capture the number of bits required for beta coding to capture the value. 埃利亚斯伽马编码是正整数的前缀编码。它是单值编码和贝塔编码的结合。前者用于捕捉贝塔编码捕捉数值所需的比特数。
Encoding 编码
Write it in binary. 写成二进制。
Subtract 1 from the number of bits written in step 1 and prepend that many zeros. 从步骤 1 中写入的比特数中减去 1,然后预置相同数量的零。
An equivalent way to express the same process: 表达同一过程的等效方法:
Separate the integer into the highest power of 2 it contains (2N)(2 N) and the remaining NN binary digits of the integer. 将整数分成包含 (2N)(2 N) 的最高 2 次幂和整数的其余 NN 个二进制数。
Encode NN in unary; that is, as NN zeroes followed by a one. 对 NN 进行一元编码,即编码为 NN 0 后跟一个 1。
Append the remaining NN binary digits to this representation of NN. 将剩余的 NN 二进制数字追加到 NN 的表示中。
Decoding 解码
Read and count 0 s from the stream until you reach the first 1 . Call this count of zeroes NN. 从数据流中读取并计数 0 s,直到读到第一个 1 。将这个 0 计数称为 NN 。
Considering the one that was reached to be the first digit of the integer, with a value of 2N2 N, read the remaining NN digits of the integer. 将读取到的数字视为整数的第一位,其值为 2N2 N ,然后读取整数的其余 NN 位数。
offset to subtract from each
value after decode| offset to subtract from each |
| :--- |
| value after decode |
Data type Name Comment
itf8 offset "offset to subtract from each
value after decode"| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset to subtract from each <br> value after decode |
13.8 DEPRECATED: Golomb coding: codec ID 2 13.8 过时:戈隆编码:编解码器 ID 2
Can encode types Integer. 可以编码整数类型。
Note this codec has not been used in any known CRAM implementation since before CRAM v1.0. Nor is it implemented in some of the major software. Therefore its use is not recommended. 请注意,在 CRAM v1.0 之前,任何已知的 CRAM 实现中都未使用过该编解码器。一些主要软件中也没有使用。因此不建议使用。
Definition 定义
Golomb encoding is a prefix encoding optimal for representation of random positive numbers following geometric distribution. 戈隆编码是一种前缀编码,最适合表示几何分布的随机正数。
Encoding 编码
Fix the parameter MM to an integer value. 将参数 MM 固定为整数值。
For NN, the number to be encoded, find 对于要编码的数字 NN ,查找
(a) quotient q=|__ N//M __|q=\lfloor N / M\rfloor (a) 商 q=|__ N//M __|q=\lfloor N / M\rfloor
(b) remainder r=N mod Mr=N \bmod M (b) 余数 r=N mod Mr=N \bmod M
Set b=|~log_(2)(M)~|b=\left\lceil\log _{2}(M)\right\rceil 设置 b=|~log_(2)(M)~|b=\left\lceil\log _{2}(M)\right\rceil
i. If r < 2^(b)-Mr<2^{b}-M code rr as plain binary using b-1b-1 bits. i.如果 r < 2^(b)-Mr<2^{b}-M 使用 b-1b-1 位将 rr 编码为普通二进制。
ii. If r >= 2^(b)-Mr \geq 2^{b}-M code the number r+2^(b)-Mr+2^{b}-M in plain binary representation using bb bits. ii.如果 r >= 2^(b)-Mr \geq 2^{b}-M 用 bb 位将数字 r+2^(b)-Mr+2^{b}-M 以纯二进制表示编码。
Decoding 解码
Read qq via unary coding: count the number of 1 bits and consume the following 0 bits. 通过一元编码读取 qq :计算 1 位的个数并消耗下面的 0 位。
Set b=|~log_(2)(M)~|b=\left\lceil\log _{2}(M)\right\rceil 设置 b=|~log_(2)(M)~|b=\left\lceil\log _{2}(M)\right\rceil
Read rr via b-1b-1 bits of binary coding 通过 b-1b-1 位二进制编码读取 rr
If r >= 2^(b)-Mr \geq 2^{b}-M 如果 r >= 2^(b)-Mr \geq 2^{b}-M
(a) Read 1 single bit, xx. (a) 读取 1 个比特, xx 。
(b) Set r=r**2+x-(2^(b)-M)r=r * 2+x-\left(2^{b}-M\right) (b) 设置 r=r**2+x-(2^(b)-M)r=r * 2+x-\left(2^{b}-M\right)
Value is q**M+r-q * M+r- offset 值为 q**M+r-q * M+r- 偏移量
Golomb coding takes the following parameters: 戈隆编码需要以下参数:
Data type 数据类型
Name 名称
Comment 评论
itf8
offset 胶印
offset is added to each value 每个值都会加上偏移量
itf8
M
戈仑参数(仓数)
the golomb parameter (number
of bins)
the golomb parameter (number
of bins)| the golomb parameter (number |
| :--- |
| of bins) |
Data type Name Comment
itf8 offset offset is added to each value
itf8 M "the golomb parameter (number
of bins)"| Data type | Name | Comment |
| :--- | :--- | :--- |
| itf8 | offset | offset is added to each value |
| itf8 | M | the golomb parameter (number <br> of bins) |
13.9 DEPRECATED: Golomb-Rice coding: codec ID 8 13.9 过时:Golomb-Rice 编码:编解码器 ID 8
Can encode types Integer. 可以编码整数类型。
Note this codec has not been used in any known CRAM implementation since before CRAM v1.0. Nor is it implemented in some of the major software. Therefore its use is not recommended. 请注意,在 CRAM v1.0 之前,任何已知的 CRAM 实现中都未使用过该编解码器。一些主要软件中也没有使用。因此不建议使用。
Golomb-Rice coding is a special case of Golomb coding when the M parameter is a power of 2. The reason for this coding is that the division operations in Golomb coding can be replaced with bit shift operators as well as avoiding the extra r < 2^(b)-Mr<2^{b}-M check. 戈隆-瑞斯编码是戈隆编码的一种特例,当 M 参数是 2 的幂时,采用这种编码方式的原因是戈隆编码中的除法运算可以用位移运算符代替,同时也避免了额外的 r < 2^(b)-Mr<2^{b}-M 校验。
14 External compression methods 14 外部压缩方法
External encoding operates on bytes only. Therefore any data series must be translated into bytes before sending data into an external block. The following methods are defined. Exact definitions of these methods are in their respective internet links or the ancillary CRAMcodecs document found along side this specification. 外部编码只对字节进行操作。因此,在将数据发送到外部数据块之前,必须将任何数据序列转换为字节。定义了以下方法。这些方法的确切定义见其各自的互联网链接或本规范的辅助 CRAMcodecs 文档。
Integer values are written as ITF8, which then can be translated into an array of bytes. 整数值被写成 ITF8 格式,然后可以转换成字节数组。
Strings, like read name, are translated into bytes according to UTF8 rules. In most cases these should coincide with ASCII, making the translation trivial. 字符串(如读取的名称)会根据 UTF8 规则翻译成字节。在大多数情况下,这些规则应与 ASCII 一致,从而使翻译变得微不足道。
Each method has an associated numeric code which is defined in Section 8. 第 8 节定义了每种方法的相关数字代码。
14.1 Gzip
The Gzip specification is defined in RFC 1952. Gzip in turn is an encapsulation on the Deflate algorithm defined in RFC 1951. Gzip 规范是在 RFC 1952 中定义的。而 Gzip 又是对 RFC 1951 中定义的 Deflate 算法的封装。
14.2 Bzip2
First available in CRAM v2.0. 首次出现在 CRAM v2.0 中。
Bzip2 is a compression method utilising the Burrows Wheeler Transform, Move To Front transform, Run Length Encoding and a Huffman entropy encoder. It is often superior to Gzip for textual data. Bzip2 是一种利用 Burrows Wheeler 变换、Move To Front 变换、Run Length Encoding 和 Huffman 熵编码器的压缩方法。对于文本数据,它通常优于 Gzip。
An informal format specification exists: 有一个非正式的格式规范: https://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf
First available in CRAM v3.0. 首次出现在 CRAM v3.0 中。
rANS is the range-coder variant of the Asymmetric Numerical System ^(8){ }^{8}. rANS 是非对称数值系统 ^(8){ }^{8} 的量程编码器变体。
" 4 x 8 " refers to 4 -way interleaving with 8 -bit renormalisation. "4 x 8 "指 4 路交织,8 位重正化。
This variant of rANS first appeared in CRAM v3.0. rANS 的这一变体首次出现在 CRAM v3.0 中。
Details of this algorithm have been moved to the CRAMcodecs document. 该算法的详细信息已移至 CRAMcodecs 文档。
14.5 rANS4x16 codec 14.5 rANS4x16 编解码器
First available in CRAM v3.1. 首次出现在 CRAM v3.1 中。
" 4xx164 \times 16 " refers to 4 -way interleaving with 16 -bit renormalisation. " 4xx164 \times 16 "指的是 4 路交织和 16 位重正化。
This variant of rANS first appeared in CRAM v3.1. rANS 的这一变体首次出现在 CRAM v3.1 中。
Details of this algorithm are listed in the CRAMcodecs document. 有关该算法的详细信息,请参阅 CRAMcodecs 文档。
14.6 adaptive arithemtic coding 14.6 自适应动情编码
First available in CRAM v3.1. 首次出现在 CRAM v3.1 中。
An entropy encoder that is slower but slightly more concise than rANS. It achieves this by adapting the probabilities as it compresses and decompresses instead of using a fixed table. 一种熵编码器,速度比 rANS 慢,但略微简洁。为此,它在压缩和解压缩时调整概率,而不是使用一个固定的表。
Details of this algorithm are listed in the CRAMcodecs document. 有关该算法的详细信息,请参阅 CRAMcodecs 文档。
14.7 fqzcomp codec 14.7 FQZCOMP 编解码器
First available in CRAM v3.1. 首次出现在 CRAM v3.1 中。
This is a method dedicated to compression of quality values. 这是一种专门用于压缩质量值的方法。
Details of this algorithm are listed in the CRAMcodecs document. 有关该算法的详细信息,请参阅 CRAMcodecs 文档。
14.8 name tokeniser 14.8 名称标记器
First available in CRAM v3.1. 首次出现在 CRAM v3.1 中。
This is a method dedicated to compression of read names. 这是一种专门用于压缩读取名称的方法。
Details of this algorithm are listed in the CRAMcodecs document. 有关该算法的详细信息,请参阅 CRAMcodecs 文档。
15 Appendix 15 附录
15.1 Choosing the container size 15.1 选择容器大小
CRAM format does not constrain the size of the containers. However, the following should be considered when deciding the container size: CRAM 格式不限制容器的大小。不过,在决定容器大小时应考虑以下几点:
Data can be compressed better by using larger containers 使用更大的容器可以更好地压缩数据
Random access performance is better for smaller containers 较小容器的随机存取性能更好
Streaming is more convenient for small containers 流式传输对小型集装箱更方便
Applications typically buffer containers into memory 应用程序通常会将容器缓冲到内存中
We recommend 1 megabyte containers. They are small enough to provide good random access and streaming performance while being large enough to provide good compression. 1 MB containers are also small enough to fit into the L2 cache of most modern CPUs. 我们建议使用 1 兆字节的容器。它们小到足以提供良好的随机存取和流性能,大到足以提供良好的压缩性能。1 MB 容器也足够小,可以放入大多数现代 CPU 的二级缓存中。
Some simplified examples are provided below to fit data into 1 MB containers. 下面提供一些简化示例,以便将数据装入 1 MB 容器。
Unmapped short reads with bases, read names, recalibrated and original quality scores 带有碱基、读数名称、重新校准质量分数和原始质量分数的未映射短读数
We have 10,000 unmapped short reads (100bp) with read names, recalibrated and original quality scores. We estimate 0.4 bits/base (read names) +0.4 bits/base (bases) +3 bits/base (recalibrated quality scores) +3 bits/base (original quality scores) ~~7\approx 7 bits/base. Space estimate is 10000 xx100 xx710000 \times 100 \times 7 bits ~~0.9MB\approx 0.9 \mathrm{MB}. Data could be stored in a single container. 我们有 10,000 个未作图的短读数(100bp),其中包含读数名称、重新校准的质量分数和原始质量分数。我们估计 0.4 位/碱基(读取名称)+0.4 位/碱基(碱基)+3 位/碱基(重新校准质量分数)+3 位/碱基(原始质量分数) ~~7\approx 7 位/碱基。空间估计值为 10000 xx100 xx710000 \times 100 \times 7 位 ~~0.9MB\approx 0.9 \mathrm{MB} 。数据可存储在一个容器中。
Unmapped long reads with bases, read names and quality scores 带有碱基、读数名称和质量分数的未映射长读数
We have 10,000 unmapped long reads ( 10 kb ) with read names and quality scores. We estimate: 0.4 bits/base (bases) +3 bits/base (original quality scores) ~~3.5\approx 3.5 bits/base. Space estimate is 10000 xx10000 xx3.510000 \times 10000 \times 3.5 bits ~~\approx 42 MB . Data could be stored in 42 xx1MB42 \times 1 \mathrm{MB} containers. 我们有 10,000 个未映射的长读数(10 kb),并附有读数名称和质量分数。我们估计:0.4 位/碱基(碱基)+3 位/碱基(原始质量分数) ~~3.5\approx 3.5 位/碱基。空间估计为 10000 xx10000 xx3.510000 \times 10000 \times 3.5 位 ~~\approx 42 MB。数据可存储在 42 xx1MB42 \times 1 \mathrm{MB} 容器中。
Mapped short reads with bases, pairing and mapping information 带有碱基、配对和映射信息的映射短读数
We have 250,000 mapped short reads ( 100 bp ) with bases, pairing and mapping information. We estimate the compression to be 0.2 bits/base. Space estimate is 250000 xx100 xx0.2250000 \times 100 \times 0.2 bits ~~0.6MB\approx 0.6 \mathrm{MB}. Data could be stored in a single container. 我们有 250,000 个映射短读数(100 bp),包含碱基、配对和映射信息。我们估计压缩率为 0.2 比特/碱基。空间估计为 250000 xx100 xx0.2250000 \times 100 \times 0.2 位 ~~0.6MB\approx 0.6 \mathrm{MB} 。数据可存储在一个容器中。
Embedded reference sequences 嵌入式参考序列
We have a reference sequence ( 10 Mb ). We estimate the compression to be 2 bits/base. Space estimate is 10000000 xx210000000 \times 2 bits ~~2.4MB\approx 2.4 \mathrm{MB}. Data could be written into three containers: 1MB+1MB+0.4MB1 \mathrm{MB}+1 \mathrm{MB}+0.4 \mathrm{MB}. 我们有一个参考序列(10 Mb)。我们估计压缩率为 2 比特/碱基。空间估计值为 10000000 xx210000000 \times 2 位 ~~2.4MB\approx 2.4 \mathrm{MB} 。数据可以写入三个容器: 1MB+1MB+0.4MB1 \mathrm{MB}+1 \mathrm{MB}+0.4 \mathrm{MB} 。
15.2 CRAM History 15.2 CRAM 的历史
Pre-CRAM: 2010 CRAM 之前:2010 年
The primary concepts and ideas of CRAM stem from work at the European Bioinformatics Institute in 2010 and 2011, published in: CRAM 的主要概念和想法源于 2010 年和 2011 年在欧洲生物信息学研究所开展的工作,并发表在《欧洲生物信息学研究所》(European Bioinformatics Institute)上:
Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney, Efficient storage of high Markus Hsi-Yang Fritz、Rasko Leinonen、Guy Cochrane 和 Ewan Birney,Efficient storage of highthroughput DNA sequencing data using reference-based compression, Genome Res. 2011 21: 《基因组研究》,2011 年 21 期:734-740734-740; doi:10.1101/gr.114819.110; PMID:21245279. 734-740734-740 ;doi:10.1101/gr.114819.110;PMID:21245279。
CRAM 0.x: 2011
Vadim Zalunin implemented the ideas in the paper, now named CRAM, in the Java CRAMtools package. This included versions from 0.3 to 0.86^(9)0.86^{9}. Vadim Zalunin 在 Java CRAMtools 软件包中实现了这篇论文中的想法,并将其命名为 CRAM。这包括从 0.3 到 0.86^(9)0.86^{9} 的版本。
Reimplementing CRAM in C^(11)\mathrm{C}^{11} exposed a number of issues with the 1.0 specification and disparities between the specification text and the Java implementation. CRAM 2.0 unified implementation with specification. 在 C^(11)\mathrm{C}^{11} 中重新实现 CRAM 暴露了 1.0 规范中的许多问题,以及规范文本与 Java 实现之间的差异。CRAM 2.0 将实现与规范统一起来。
Other changes included: 其他变化包括
Support for multiple references per container, to permit storage of highly fragmented assemblies. 支持每个容器多个引用,以便存储高度分散的程序集。
Soft-clips and inserted bases moved to their own separate data-series instead of sharing one. 软剪辑和插入式底座移至各自独立的数据系列,而不是共用一个数据系列。
Slice headers contain meta-data tracking the number of records and bases. 片头包含跟踪记录数和基数的元数据。
Corrected the BF (bam flag) data series to match the BAM specification. 修正了 BF(bam 标志)数据系列,使其符合 BAM 规范。
Improved encoding of auxiliary tags. 改进了辅助标记的编码。
CRAM 2.1: 2014 CRAM 2.1:2014 年
This is the first version to appear in HTSJDK (version 1.127), ported from the Java CRAMtools package. 这是 HTSJDK(1.127 版)中出现的第一个版本,由 Java CRAMtools 软件包移植而来。
EOF blocks are added in order to spot truncated files. 添加 EOF 块是为了发现截断的文件。
CRAM 3.0: 2014 CRAM 3.0:2014 年
Primarily this is an optimisation of size and speed. 这主要是对尺寸和速度的优化。
Inclusion of LZMA compression library. 包含 LZMA 压缩库。
Inclusion of the custom rANS Order-0 and Order-1 entropy encoders. 包含定制的 rANS 0 阶和 1 阶熵编码器。
Checksums added to all file format structures to ensure data integrity. 在所有文件格式结构中添加校验和,以确保数据完整性。
CRAM 3.1: 2023 CRAM 3.1:2023 年
Note: the formal draft appeared in 2019, and was initially demonstrated in 2016. 注:2019 年出现正式草案,2016 年初步论证。
This adds new EXTERNAL compression methods, described in the separate CRAMcodecs document, and expands the list of permitted “methods” in the CRAM Block structure. 这增加了新的外部压缩方法(在单独的 CRAMcodecs 文档中进行了描述),并扩展了 CRAM 块结构中允许使用的 "方法 "列表。
The aim of the new compression methods is improved compression, both performance with the newer SIMD rANS implementation and file size with custom name tokeniser and quality codec. 新压缩方法的目的是改进压缩效果,包括使用较新的 SIMD rANS 实现的性能,以及使用自定义名称标记器和高质量编解码器的文件大小。
The format is otherwise identical to 3.0 . 其他格式与 3.0 相同。
15.3 Contributors and Acknowledgements 15.3 撰稿人和致谢
Markus Fritz, Rasko Leinonen, Guy Cochrane and Ewan Birney (EBI): Initial ideas behind CRAM. Markus Fritz、Rasko Leinonen、Guy Cochrane 和 Ewan Birney(EBI):CRAM 背后的初步想法。
Vadim Zalunin (EBI): Initial JAVA implementation of CRAM and previous maintainer of CRAM specification. Vadim Zalunin(EBI):CRAM 的最初 JAVA 实现和 CRAM 规范的前任维护者。
James Bonfield (Sanger Institute): Initial C implementation of CRAM and current maintainer of CRAM specification. 詹姆斯-邦菲尔德(桑格研究所):CRAM 的最初 C 语言实现者,目前是 CRAM 规范的维护者。
Joel Thibault (Broad Institute): previous maintainer of CRAM specification. Joel Thibault(布罗德研究所):CRAM 规范的前任维护者。
Chris Norman (Broad Institute): previous maintainer of CRAM specification and worked on the HTSJDK implementation. Chris Norman(布罗德研究所):CRAM 规范的前任维护者,曾参与 HTSJDK 的实施。
Robert Buels (UC Berkeley): First JavaScript implementation of CRAM 罗伯特-布尔斯(Robert Buels,加州大学伯克利分校):CRAM 的首个 JavaScript 实现
Michael Macias (St Jude Children’s Research Hospital): First Rust implementation of CRAM 迈克尔-马西亚斯(圣裘德儿童研究医院):首次在 Rust 实施 CRAM
Other specification contributors include: John Marshall, Rishi Nag, Kenta Sato, Artem Tarasov and Jason Travis. 其他规格撰稿人包括John Marshall、Rishi Nag、Kenta Sato、Artem Tarasov 和 Jason Travis。
Plus a big thank you to everyone who has raised GitHub issues and/or helped us improve the specification in other ways. 此外,还要衷心感谢提出 GitHub 问题和/或以其他方式帮助我们改进规范的所有人。
^(1){ }^{1} Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney, Efficient storage of high throughput DNA sequencing data using reference-based compression, Genome Res. 2011 21: 734-740; doi:10.1101/gr.114819.110; PMID:21245279. ^(1){ }^{1} Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney, Efficient storage of high throughput DNA sequencing data using reference-based compression, Genome Res. 2011 21: 734-740; doi:10.1101/gr.114819.110; PMID:21245279.
^(a){ }^{a} Formerly MAPPED_SLICE_HEADER. Now used by all slice headers regardless of mapping status. ^(a){ }^{a} 原为 MAPPED_SLICE_HEADER。现在被所有片头使用,与映射状态无关。
^(2){ }^{2} The precise order is defined in section 10 . ^(2){ }^{2} 第 10 节规定了确切的顺序。
^(3){ }^{3} Unmapped reads can be placed or unplaced. By placed unmapped read we mean a read that is unmapped according to bit 0xx40 \times 4 of the BF (BAM bit flags) data series, but has position fields filled in, thus “placing” it on a reference sequence. In contrast, unplaced unmapped reads have have a reference sequence ID of -1 and alignment position of 0 . ^(3){ }^{3} 未映射读数可以放置或未放置。所谓放置的未映射读数,是指根据 BF(BAM 位标志)数据系列的 0xx40 \times 4 位,未映射的读数,但位置字段已填入,因此 "放置 "在参考序列上。相反,未定位的未映射读数的参考序列 ID 为-1,排列位置为 0。
^(4){ }^{4} Interleaving can sometimes provide better compression, however it also adds dependency between types of data meaning it is not possible to selectively decode one data series if it co-locates with another data series in the same block. ^(4){ }^{4} 交错有时可以提供更好的压缩效果,但同时也增加了数据类型之间的依赖性,这意味着如果一个数据系列与同一数据块中的另一个数据系列位于同一位置,则无法选择性地对其进行解码。