这是用户在 2024-9-19 10:34 为 https://app.immersivetranslate.com/pdf-pro/1278e129-84c3-433c-99af-fee32bda17e5 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

CRAM format specification (version 3.1)
CRAM 格式规范(3.1 版)

samtools-devel@lists.sourceforge.net

4 Sep 2024 2024 年 9 月 4 日

Abstract 摘要

The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 4127441 from that repository, last modified on the date shown above.
本文档的主版本可在 https://github.com/samtools/hts-specs 上找到。这个打印版是该存储库中的 4127441 版本,最后修改于上述日期。

license: Apache 2.0 许可证: Apache 2.0

1 Overview 1 概述

This specification describes the CRAM 3.0 and 3.1 formats.
本规范描述了 CRAM 3.0 和 3.1 格式。

CRAM has the following major objectives:
克拉姆有以下主要目标:
  1. Significantly better lossless compression than BAM
    比 BAM 具有明显更好的无损压缩性能
  2. Full compatibility with BAM
    完全兼容 BAM
  3. Effortless transition to CRAM from using BAM files
    从使用 BAM 文件到 CRAM 的无缝过渡
  4. Support for controlled loss of BAM data
    对 BAM 数据受控损失的支持
The first three objectives allow users to take immediate advantage of the CRAM format while offering a smooth transition path from using BAM files. The fourth objective supports the exploration of different lossy compression strategies and provides a framework in which to effect these choices. Please note that the CRAM format does not impose any rules about what data should or should not be preserved. Instead, CRAM supports a wide range of lossless and lossy data preservation strategies enabling users to choose which data should be preserved.
第一三个目标允许用户立即利用 CRAM 格式的优势,同时为从使用 BAM 文件向 CRAM 格式的迁移提供了平滑的过渡路径。第四个目标支持探索不同的有损压缩策略,并提供一个框架来实现这些选择。请注意,CRAM 格式没有任何关于应该保留或不应该保留什么数据的规则。相反,CRAM 支持广泛的无损和有损数据保留策略,使用户能够选择应该保留哪些数据。

Data in CRAM is stored either as CRAM records or using one of the general purpose compressors (gzip, bzip2). CRAM records are compressed using a number of different encoding strategies. For example, bases are reference compressed by encoding base differences rather than storing the bases themselves. 1 1 ^(1){ }^{1}
在 CRAM 中,数据可以以 CRAM 记录的形式存储,也可以使用通用压缩器(gzip、bzip2)进行存储。CRAM 记录使用多种编码策略进行压缩,例如通过编码碱基差异而非碱基本身来进行参考压缩。 1 1 ^(1){ }^{1}

2 Data types 2 数据类型

CRAM specification uses logical data types and storage data types; logical data types are written as words (e.g. int) while physical data types are written using single letters (e.g. i). The difference between the two is that storage data types define how logical data types are stored in CRAM. Data in CRAM is stored either as bits or bytes. Writing values as bits and bytes is described in detail below.
CRAM 规范使用逻辑数据类型和存储数据类型;逻辑数据类型用单词表示(例如 int),而物理数据类型用单个字母表示(例如 i)。两者的区别在于,存储数据类型定义了逻辑数据类型在 CRAM 中的存储方式。CRAM 中的数据以位或字节的形式存储。以下详细描述了如何以位和字节的形式来编写值。

2.1 Logical data types
2.1 逻辑数据类型

Byte 字节

Signed byte ( 8 bits).
带符号字节(8 位)。

Integer 整数

Signed 32-bit integer. 有符号的 32 位整数。

Long 

Signed 64 -bit integer.
有符号 64 位整数。

Array 数组

An array of any logical data type: array<type>
任何逻辑数据类型的数组:array

2.2 Writing bits to a bit stream
将位写入位流

A bit stream consists of a sequence of 1 s and 0 s . The bits are written most significant bit first where new bits are stacked to the right and full bytes on the left are written out. In a bit stream the last byte will be incomplete if less than 8 bits have been written to it. In this case the bits in the last byte are shifted to the left.
比特流由一系列的 1 和 0 组成。其中,首先写入最高有效位,新的位从右侧添加,左侧的完整字节依次写出。如果比特流写入的位数不足 8 位,最后一个字节将不完整,此时最后一个字节中的位将向左移位。

Example of writing to bit stream
写入比特流的示例

Let’s consider the following example. The table below shows a sequence of write operations:
让我们考虑以下示例。下表显示了一系列写入操作:
Operation order 作战命令 Buffer state before 缓冲国 Written bits 文字 Buffer state after 缓冲国家之后 Issued bytes 已发送字节数
1 0 × 0 0 × 0 0xx00 \times 0 1 0 × 1 0 × 1 0xx10 \times 1 -
2 0 × 1 0 × 1 0xx10 \times 1 0 0 × 2 0 × 2 0xx20 \times 2 -
3 0 × 2 0 × 2 0xx20 \times 2 11 0 × B 0 × B 0xx B0 \times B -
4 0 × B 0 × B 0xx B0 \times B 00000111 0 × 7 0 × 7 0xx70 \times 7 0 × B 0 0 × B 0 0xx B00 \times B 0
Operation order Buffer state before Written bits Buffer state after Issued bytes 1 0xx0 1 0xx1 - 2 0xx1 0 0xx2 - 3 0xx2 11 0xx B - 4 0xx B 00000111 0xx7 0xx B0| Operation order | Buffer state before | Written bits | Buffer state after | Issued bytes | | :--- | :--- | :--- | :--- | :--- | | 1 | $0 \times 0$ | 1 | $0 \times 1$ | - | | 2 | $0 \times 1$ | 0 | $0 \times 2$ | - | | 3 | $0 \times 2$ | 11 | $0 \times B$ | - | | 4 | $0 \times B$ | 00000111 | $0 \times 7$ | $0 \times B 0$ |
After flushing the above bit stream the following bytes are written: 0 x B 0 0 x B 0 0xB00 x B 0 0x70. Please note that the last byte was 0 × 7 0 × 7 0xx70 \times 7 before shifting to the left and became 0 x 70 0 x 70 0x 700 x 70 after that:
在刷新上述位流后,写入以下字节: 0 x B 0 0 x B 0 0xB00 x B 0 0x70。请注意,最后一个字节在移位到左侧之前为 0 × 7 0 × 7 0xx70 \times 7 ,之后变为 0 x 70 0 x 70 0x 700 x 70
> echo "obase=16; ibase=2; 00000111" | bc
7
> echo "obase=16; ibase=2; 01110000" | bc
7 0
And the whole bit sequence:
以及整个比特序列:
echo “obase=2; ibase=16; B070” | bc
10000101110000

1011000001110000
When reading the bits from the bit sequence it must be known that only 12 bits are meaningful and the bit stream should not be read after that.
从位序列中读取位时,必须知道只有 12 位是有意义的,并且不应该在此之后读取位流。

Note on writing to bit stream
比特流写入注意事项

When writing to a bit stream both the value and the number of bits in the value must be known. This is because programming languages normally operate with bytes ( 8 bits ) and to specify which bits are to be written requires a bit-holder, for example an integer, and the number of bits in it. Equally, when reading a value from a bit stream the number of bits must be known in advance. In case of prefix codes (e.g. Huffman) all possible bit combinations are either known in advance or it is possible to calculate how many bits will follow based on the first few bits. Alternatively, two codes can be combined, where the first contains the number of bits to read.
在向位流写入时,必须知道该值及其包含的位数。这是因为编程语言通常以字节(8 位)为单位操作,为了指定要写入的位,需要一个位保持器(例如整数)以及其中包含的位数。同样地,在从位流中读取值时,也必须事先知道位数。对于前缀码(如 Huffman 码),所有可能的位组合要么事先已知,要么可以根据前几位计算出后续位数。或者可以将两个码组合使用,其中第一个码包含要读取的位数。

2.3 Writing bytes to a byte stream
向字节流写入字节

The interpretation of byte stream is straightforward. CRAM uses little endianness for bytes when applicable and defines the following storage data types:
字节流的解释很简单。CRAM 在适用时使用小端字节序,并定义了以下存储数据类型:

Boolean (bool) 布尔型(bool)

Boolean is written as 1 -byte with 0 x 0 0 x 0 0x00 x 0 being ‘false’ and 0 x 1 being ‘true’.
布尔值用 1 字节表示,其中 0 x 0 0 x 0 0x00 x 0 为'false',0x1 为'true'。

Integer (int32) 整数 (int32)

Signed 32-bit integer, written as 4 bytes in little-endian byte order.
有符号 32 位整数,以小端字节顺序写成 4 个字节。

Long (int64) 长整型(int64)

Signed 64-bit integer, written as 8 bytes in little-endian byte order.
有符号 64 位整数,以小端字节顺序表示为 8 个字节。

ITF-8 integer (itf8) ITF-8 整数(itf8)

This is an alternative way to write an integer value. The idea is similar to UTF-8 encoding and therefore this encoding is called ITF-8 (Integer Transformation Format - 8 bit).
这是一种写整数值的替代方法。这个想法类似于 UTF-8 编码,因此这种编码被称为 ITF-8(整数转换格式 - 8 位)。

The most significant bits of the first byte have special meaning and are called ‘prefix’. These are 0 to 4 true bits followed by a 0 . The number of 1 's denote the number of bytes to follow. To accommodate 32 bits such representation requires 5 bytes with only 4 lower bits used in the last byte 5 .
第一个字节最重要的位具有特殊含义,被称为"前缀"。这些是 0 到 4 个真实位,后跟一个 0。1 的数量表示要跟随的字节数。为了适应 32 位,这种表示需要 5 个字节,最后一个字节只使用 4 个较低位。

LTF-8 long (ltf8) LTF-8 长体型(ltf8)

See ITF-8 for more details. The only difference between ITF-8 and LTF-8 is the number of bytes used to encode a single value. To do so 64 bits are required and this can be done with 9 byte at most with the first byte consisting of just 1 s or 0 xFF value.
有关更多详细信息,请参见 ITF-8。ITF-8 和 LTF-8 之间的唯一区别是用于编码单个值的字节数。为此需要 64 位,最多可使用 9 个字节,其中第一个字节仅由 1 个或 0xFF 值组成。

Array (array<type>) 数组(array<类型>)

A variable sized array with an explicitly written dimension. Array length is written first as integer (itf8), followed by the elements of the array.
一个可变大小的数组,其维度被明确写出。数组长度首先被写为整数(itf8),然后是数组元素。

Implicit or fixed-size arrays are also used, written as type [ ] or type [4] (for example). These have no explicit dimension included in the file format and instead rely on the specification itself to document the array size.
隐式或固定大小的数组也被使用,写为 type [ ] 或 type [4] (例如)。它们没有在文件格式中包含明确的尺寸,而是依靠规范本身来记录数组大小。

Encoding 编码

Encoding is a data type that specifies how data series have been compressed. Encodings are defined as encoding<type> where the type is a logical data type as opposed to a storage data type.
编码是一种数据类型,它指定了数据系列的压缩方式。编码被定义为 encoding,其中 type 是一种逻辑数据类型,而不是存储数据类型。

An encoding is written as follows. The first integer (itf8) denotes the codec id and the second integer (itf8) the number of bytes in the following encoding-specific values.
一个编码写成如下形式。第一个整数(itf8)表示编解码器 ID,第二个整数(itf8)表示以下编码特定值的字节数。

Subexponential encoding example:
次指数编码示例:
Value 价值 Type 类型 Name 名字
0x7 itf8 国际电信联盟 codec id 编解码器 ID
0x2 itf8 国际电信联盟 number of bytes to follow
后续字节数
0x0 itf8 国际电信联盟 offset 偏移
0x1 itf8 国际电信联盟 K parameter K 参数
Value Type Name 0x7 itf8 codec id 0x2 itf8 number of bytes to follow 0x0 itf8 offset 0x1 itf8 K parameter| Value | Type | Name | | :--- | :--- | :--- | | 0x7 | itf8 | codec id | | 0x2 | itf8 | number of bytes to follow | | 0x0 | itf8 | offset | | 0x1 | itf8 | K parameter |
The first byte " 0 × 7 0 × 7 0xx70 \times 7 " is the codec id.
第一个字节" 0 × 7 0 × 7 0xx70 \times 7 "是编码器 ID。

The next byte " 0 x 2 " denotes the length of the bytes to follow (2).
下一个字节"0x2"表示后续字节的长度(2)。

The subexponential encoding has 2 parameters: integer (itf8) offset and integer (itf8) K.
子指数编码有 2 个参数:整数(itf8)偏移量和整数(itf8) K。

offset = 0 x 0 = 0 = 0 x 0 = 0 =0x0=0=0 \mathrm{x} 0=0 偏移
K = 0 x 1 = 1 K = 0 x 1 = 1 K=0x1=1\mathrm{K}=0 \mathrm{x} 1=1
Map 地图
A map is a collection of keys and associated values. A map with N N NN keys is written as follows:
{key1: value1, key2: value2, ..., keyN: valueN}
size in bytes 字节大小 N key 1 关键 1 value 1 价值 1 key... 关键... value ... 价值... key N 键 N value N 值 N
size in bytes N key 1 value 1 key... value ... key N value N| size in bytes | N | key 1 | value 1 | key... | value ... | key N | value N | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
Both the size in bytes and the number of keys are written as integer (itf8). Keys and values are written according to their data types and are specific to each map.
字节大小和键的数量均以整数(itf8)形式写入。键和值的写入方式根据其数据类型而有所不同,这取决于每一个 map。

String 字符串

A string is represented as byte arrays using UTF-8 format. Read names, reference sequence names and tag values with type ’ Z Z ZZ ’ are stored as UTF-8.
字符串使用 UTF-8 格式表示为字节数组。使用类型 ' Z Z ZZ ' 存储的名称、参考序列名称和标记值为 UTF-8 编码。

3 Encodings 3 个编码

Encoding is a data structure that captures information about compression details of a data series that are required to uncompress it. This could be a set of constants required to initialize a specific decompression algorithm or statistical properties of a data series or, in case of data series being stored in an external block, the block content id.
编码是一种数据结构,它捕获了关于压缩细节的信息,这些细节在解压缩时是必需的。这可能是初始化特定解压缩算法所需的一组常数,或数据系列的统计属性,或者在数据系列存储在外部块中的情况下,块内容 ID。
Encoding notation is defined as the keyword ‘encoding’ followed by its data type in angular brackets, for example ‘encoding<byte>’ stands for an encoding that operates on a data series of data type ‘byte’.
编码符号被定义为关键字'编码'后跟其数据类型,用尖括号括起,例如'编码<字节>'表示对'字节'数据类型的数据系列进行编码的编码。

Encodings may have parameters of different data types, for example the EXTERNAL encoding has only one parameter, integer id of the external block. The following encodings are defined:
编码可能有不同数据类型的参数,例如 EXTERNAL 编码只有一个参数,即外部块的整数 id。定义了以下编码:
Codec 编解码器 ID Parameters 参数 Comment 评论
NULL 0 none 没有源文本 series not preserved 系列未保留
EXTERNAL 1 int block content id
整数块内容标识

用于将外部数据块与数据系列相关联的块内容标识符
the block content identifier used to
associate external data blocks with
data series
the block content identifier used to associate external data blocks with data series| the block content identifier used to | | :--- | | associate external data blocks with | | data series |
Deprecated (GOLOMB) 不推荐使用 (GOLOMB) 2 int offset, int M
整数 偏移量, 整数 M
Golomb coding 哥伦布编码
HUFFMAN 3 array<int>, array<int> 数组<整型>,数组<整型> coding with int/byte values
使用 int/byte 值编码
BYTE_ARRAY_LEN 4

编码 数组长度, 编码 字节
encoding<int> array length,
encoding<byte> bytes
encoding<int> array length, encoding<byte> bytes| encoding<int> array length, | | :--- | | encoding<byte> bytes |

字节数组的编码及其长度
coding of byte arrays with array
length
coding of byte arrays with array length| coding of byte arrays with array | | :--- | | length |
BYTE_ARRAY_STOP 5

字节停止,int 外部块内容 id
byte stop, int external block
content id
byte stop, int external block content id| byte stop, int external block | | :--- | | content id |

带有停止值的字节数组编码
coding of byte arrays with a stop
value
coding of byte arrays with a stop value| coding of byte arrays with a stop | | :--- | | value |
BETA 6 int offset, int number of bits
偏移量,位数
binary coding 二进制编码
SUBEXP 7 int offset, int K
int 偏移量, int K
subexponential coding 亚指数编码
Deprecated (GOLOMB_RICE)
已弃用(GOLOMB_RICE)
8 int offset, int log 2 m log 2 m log_(2)m\log _{2} \mathrm{~m}
整型 偏移量, 整型 log 2 m log 2 m log_(2)m\log _{2} \mathrm{~m}
Golomb-Rice coding 哥伦布-赖斯编码
GAMMA 9 int offset 偏移量 Elias gamma coding 以利亚伽马编码
Codec ID Parameters Comment NULL 0 none series not preserved EXTERNAL 1 int block content id "the block content identifier used to associate external data blocks with data series" Deprecated (GOLOMB) 2 int offset, int M Golomb coding HUFFMAN 3 array<int>, array<int> coding with int/byte values BYTE_ARRAY_LEN 4 "encoding<int> array length, encoding<byte> bytes" "coding of byte arrays with array length" BYTE_ARRAY_STOP 5 "byte stop, int external block content id" "coding of byte arrays with a stop value" BETA 6 int offset, int number of bits binary coding SUBEXP 7 int offset, int K subexponential coding Deprecated (GOLOMB_RICE) 8 int offset, int log_(2)m Golomb-Rice coding GAMMA 9 int offset Elias gamma coding| Codec | ID | Parameters | Comment | | :--- | :--- | :--- | :--- | | NULL | 0 | none | series not preserved | | EXTERNAL | 1 | int block content id | the block content identifier used to <br> associate external data blocks with <br> data series | | Deprecated (GOLOMB) | 2 | int offset, int M | Golomb coding | | HUFFMAN | 3 | array<int>, array<int> | coding with int/byte values | | BYTE_ARRAY_LEN | 4 | encoding<int> array length, <br> encoding<byte> bytes | coding of byte arrays with array <br> length | | BYTE_ARRAY_STOP | 5 | byte stop, int external block <br> content id | coding of byte arrays with a stop <br> value | | BETA | 6 | int offset, int number of bits | binary coding | | SUBEXP | 7 | int offset, int K | subexponential coding | | Deprecated (GOLOMB_RICE) | 8 | int offset, int $\log _{2} \mathrm{~m}$ | Golomb-Rice coding | | GAMMA | 9 | int offset | Elias gamma coding |
See section 13 for more detailed descriptions of all the above coding algorithms and their parameters.
请参阅第 13 节以获取上述所有编码算法及其参数的更详细描述。

4 Checksums 4 个校验和

The checksumming is used to ensure data integrity. The following checksumming algorithms are used in CRAM.
校验和用于确保数据完整性。以下校验和算法在 CRAM 中使用。

4.1 CRC32

This is a cyclic redundancy checksum 32-bit long with the polynomial 0x04C11DB7. Please refer to ITU-T V. 42 for more details. The value of the CRC32 hash function is written as an integer.
这是一个循环冗余检验码,长度为 32 位,多项式为 0x04C11DB7。欲了解更多详情,请参考 ITU-T V. 42。CRC32 哈希函数的值将以整数形式表示。

4.2 CRC32 sum 4.2 CRC32 校验和

CRC32 sum is a combination of CRC32 values by summing up all individual CRC32 values modulo 2 32 2 32 2^(32)2^{32}.
CRC32 校验和是通过将所有单独的 CRC32 值模 2 32 2 32 2^(32)2^{32} 相加而得到的组合。

5 File structure 5 文件结构

The overall CRAM file structure is described in this section. Please refer to other sections of this document for more detailed information.
CRAM 文件结构的总体情况在本节中有描述。更多详细信息请参考本文档的其他章节。

A CRAM file consists of a fixed length file definition, followed by a CRAM header container, then zero or more data containers, and finally a special end-of-file container.
CRAM 文件由固定长度的文件定义、CRAM 头容器、零个或多个数据容器以及最后的特殊文件结束容器组成。
 文件定义
File
definition
File definition| File | | :---: | | definition |
 CRAM 头容器
CRAM Header
Container
CRAM Header Container| CRAM Header | | :---: | | Container |
 数据容器
Data
Container
Data Container| Data | | :---: | | Container |
cdots\cdots
 数据容器
Data
Container
Data Container| Data | | :---: | | Container |
 CRAM EOF 集装箱
CRAM EOF
Container
CRAM EOF Container| CRAM EOF | | :---: | | Container |
"File definition" "CRAM Header Container" "Data Container" cdots "Data Container" "CRAM EOF Container"| File <br> definition | CRAM Header <br> Container | Data <br> Container | $\cdots$ | Data <br> Container | CRAM EOF <br> Container | | :---: | :---: | :---: | :---: | :---: | :---: |
Figure 1: A CRAM file consists of a file definition, followed by a header container, then other containers.
图 1:CRAM 文件由文件定义、头部容器和其他容器组成。

Containers consist of one or more blocks. The first container, called the CRAM header container, is used to store a textual header as described in the SAM specification (see the section 7.1). This container may have additional padding bytes present for purposes of permitting inline rewriting of the SAM header with small changes in size. These padding bytes are undefined, but we recommend filling with nuls. The padding bytes can either be in explicit uncompressed Block structures, or as unallocated extra space where the size of the container is larger than the combined size of blocks held within it.
容器由一个或多个块组成。第一个容器称为 CRAM 头容器,用于存储 SAM 规范中描述的文本头(参见第 7.1 节)。该容器可能存在额外的填充字节,以允许对 SAM 头进行小尺寸更改的内联重写。这些填充字节是未定义的,但我们建议用 0 填充。填充字节可以是显式的未压缩块结构,也可以是未分配的额外空间,容器的大小大于其中包含的块的总大小。
Figure 2: The the first container holds the CRAM header text.
图 2:第一个容器包含 CRAM 标头文本。

Each container starts with a container header structure followed by one or more blocks. The first block in each container is the compression header block giving details of how to decode data in subsequent blocks. Each block starts with a block header structure followed by the block data.
每个容器都以容器头结构开始,后跟一个或多个块。每个容器的第一个块是压缩头块,提供了如何解码后续块中数据的细节。每个块都以块头结构开始,后跟块数据。
Figure 3: Containers as a series of blocks
图 3:容器作为一系列的方块

The blocks after the compression header are organised logically into slices. One slice may contain, for example, a contiguous region of alignment data. Slices begin with a slice header block and are followed by one or more data blocks. It is these data blocks which hold the primary bulk of CRAM data. The data blocks are further subdivided into a core data block and one or more external data blocks.
压缩头之后的块在逻辑上被组织为切片。一个切片可能包含例如连续的对齐数据区域。切片以切片头块开始,后跟一个或多个数据块。正是这些数据块包含了 CRAM 数据的主要部分。数据块进一步划分为一个核心数据块和一个或多个外部数据块。
Figure 4: Slices formed from a series of concatenated blocks
图 4:由一系列连接的块形成的分片

6 File definition 6 文件定义

Each CRAM file starts with a fixed length (26 bytes) definition with the following fields:
每个 CRAM 文件以固定长度(26 字节)的定义开始,包含以下字段:
Data type 数据类型 Name 名字 Value 价值
byte[4] 字节[4] format magic number 格式魔术数字 CRAM (0x43 0x52 0x41 0x4d)
程序分配和内存管理 (0x43 0x52 0x41 0x4d)
unsigned byte 无符号字节 major format number 主要格式号 3 ( 0 x 3 ) 3 ( 0 x 3 ) 3(0x3)3(0 x 3)
unsigned byte 无符号字节 minor format number 小型号格式 1 (0x1)
byte[20] 字节[20] file id 文件 ID CRAM file identifier (e.g. file name or SHA1 checksum)
CRAM 文件标识符(例如文件名或 SHA1 校验和)
Data type Name Value byte[4] format magic number CRAM (0x43 0x52 0x41 0x4d) unsigned byte major format number 3(0x3) unsigned byte minor format number 1 (0x1) byte[20] file id CRAM file identifier (e.g. file name or SHA1 checksum)| Data type | Name | Value | | :--- | :--- | :--- | | byte[4] | format magic number | CRAM (0x43 0x52 0x41 0x4d) | | unsigned byte | major format number | $3(0 x 3)$ | | unsigned byte | minor format number | 1 (0x1) | | byte[20] | file id | CRAM file identifier (e.g. file name or SHA1 checksum) |
Valid CRAM major.minor version numbers are as follows:
有效的 CRAM 主版本号和次版本号如下:

1.0 The original public CRAM release.
1.0 原版公开 CRAM 版本。

2.0 The first CRAM release implemented in both Java and C; tidied up implementation vs specification differences in 1.0 .
2.0 CRAM 的第一个版本同时以 Java 和 C 语言实现;整理了 1.0 版本中实现与规范之间的差异。

2.1 Gained end of file markers; compatible with 2.0.
2.1 获得了文件结尾标记;与 2.0 兼容。

3.0 Additional compression methods; header and data checksums; improvements for unsorted data.
3.0 其他压缩方法;报头和数据校验和;不排序数据的改进。

3.1 Additional EXTERNAL compression codecs only.
3.1 仅支持其他外部压缩编解码器。
CRAM 3.0 and 3.1 differ only in the list of compression methods available, so tools that output CRAM 3 without using any 3.1 codecs should write the header to indicate 3.0 in order to permit maximum compatibility.
CRAM 3.0 和 3.1 仅在可用压缩方法列表上有所不同,因此输出 CRAM 3 而不使用任何 3.1 编解码器的工具应该将头部写为 3.0,以确保最大兼容性。

7 Container header structure
7 集装箱头结构

The file definition is followed by one or more containers with the following header structure where the container content is stored in the ‘blocks’ field:
文件定义后面跟着一个或多个具有以下标题结构的容器,其中容器内容存储在"blocks"字段中:
Data type 数据类型 Name 名字 Value 价值
int32 整数 32 位 length 长度

该容器中所有块(包括标头和数据)的长度之和以及任何填充字节(仅适用于 CRAM 标头容器);等于容器的总字节长度减去此头部结构的字节长度
the sum of the lengths of all blocks in this container
(headers and data) and any padding bytes (CRAM header
container only); equal to the total byte length of the
container minus the byte length of this header structure
the sum of the lengths of all blocks in this container (headers and data) and any padding bytes (CRAM header container only); equal to the total byte length of the container minus the byte length of this header structure| the sum of the lengths of all blocks in this container | | :--- | | (headers and data) and any padding bytes (CRAM header | | container only); equal to the total byte length of the | | container minus the byte length of this header structure |
itf8 国际电信联盟 reference sequence id 参考序列 ID

参考序列标识符或-1 表示未映射的读数,-2 表示多个参考序列。此容器中的所有切片必须具有与此值匹配的参考序列 ID。
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
All slices in this container must have a reference sequence
id matching this value.
reference sequence identifier or -1 for unmapped reads -2 for multiple reference sequences. All slices in this container must have a reference sequence id matching this value.| reference sequence identifier or | | :--- | | -1 for unmapped reads | | -2 for multiple reference sequences. | | All slices in this container must have a reference sequence | | id matching this value. |
itf8 国际电信联盟

参考位置的初始位置
starting position on the
reference
starting position on the reference| starting position on the | | :--- | | reference |
the alignment start position
起始点位置
itf8 国际电信联盟 alignment span 对齐范围 the length of the alignment
对齐的长度
itf8 国际电信联盟 number of records 记录数 number of records in the container
容器中的记录数
ltf8 record counter 记录计数器 1-based sequential index of records in the file/stream.
文件/流中记录的基于 1 的顺序索引。
ltf8 bases 基地 number of read bases
测序长度
itf8 国际电信联盟 number of blocks 区块数 the total number of blocks in this container
这个容器中的总块数
array<itf8> 数组<itf8></itf8> landmarks 地标

此容器中切片的位置作为此容器头部结尾的字节偏移量,用于随机访问索引。对于序列数据容器,地标个数必须等于切片个数。由于第一个切片之前的块是压缩头,landmarks[0]等于压缩头的字节长度。
the locations of slices in this container as byte offsets from
the end of this container header, used for random access
indexing. For sequence data containers, the landmark
count must equal the slice count.
Since the block before the first slice is the compression
header, landmarks[0] is equal to the byte length of the
compression header.
the locations of slices in this container as byte offsets from the end of this container header, used for random access indexing. For sequence data containers, the landmark count must equal the slice count. Since the block before the first slice is the compression header, landmarks[0] is equal to the byte length of the compression header.| the locations of slices in this container as byte offsets from | | :--- | | the end of this container header, used for random access | | indexing. For sequence data containers, the landmark | | count must equal the slice count. | | Since the block before the first slice is the compression | | header, landmarks[0] is equal to the byte length of the | | compression header. |
int 整型 crc32 循环冗余校验 CRC32 hash of the all the preceding bytes in the container.
容器中所有先前字节的 CRC32 哈希值。
byte[ 字节[ blocks  The blocks contained within the container.
容器内包含的块。
Data type Name Value int32 length "the sum of the lengths of all blocks in this container (headers and data) and any padding bytes (CRAM header container only); equal to the total byte length of the container minus the byte length of this header structure" itf8 reference sequence id "reference sequence identifier or -1 for unmapped reads -2 for multiple reference sequences. All slices in this container must have a reference sequence id matching this value." itf8 "starting position on the reference" the alignment start position itf8 alignment span the length of the alignment itf8 number of records number of records in the container ltf8 record counter 1-based sequential index of records in the file/stream. ltf8 bases number of read bases itf8 number of blocks the total number of blocks in this container array<itf8> landmarks "the locations of slices in this container as byte offsets from the end of this container header, used for random access indexing. For sequence data containers, the landmark count must equal the slice count. Since the block before the first slice is the compression header, landmarks[0] is equal to the byte length of the compression header." int crc32 CRC32 hash of the all the preceding bytes in the container. byte[ blocks The blocks contained within the container.| Data type | Name | Value | | :---: | :---: | :---: | | int32 | length | the sum of the lengths of all blocks in this container <br> (headers and data) and any padding bytes (CRAM header <br> container only); equal to the total byte length of the <br> container minus the byte length of this header structure | | itf8 | reference sequence id | reference sequence identifier or <br> -1 for unmapped reads <br> -2 for multiple reference sequences. <br> All slices in this container must have a reference sequence <br> id matching this value. | | itf8 | starting position on the <br> reference | the alignment start position | | itf8 | alignment span | the length of the alignment | | itf8 | number of records | number of records in the container | | ltf8 | record counter | 1-based sequential index of records in the file/stream. | | ltf8 | bases | number of read bases | | itf8 | number of blocks | the total number of blocks in this container | | array<itf8> | landmarks | the locations of slices in this container as byte offsets from <br> the end of this container header, used for random access <br> indexing. For sequence data containers, the landmark <br> count must equal the slice count. <br> Since the block before the first slice is the compression <br> header, landmarks[0] is equal to the byte length of the <br> compression header. | | int | crc32 | CRC32 hash of the all the preceding bytes in the container. | | byte[ | blocks | The blocks contained within the container. |
In the initial CRAM header container, the reference sequence id, starting position on the reference, and alignment span fields must be ignored when reading. The landmarks array is optional for the CRAM header, but if it exists it should point to block offsets instead of slices, with the first block containing the textual header.
在初始的 CRAM 头部容器中,读取时必须忽略参考序列 ID、参考起始位置和比对跨度字段。CRAM 头部的地标数组是可选的,但如果存在,它应该指向块偏移量而不是切片,第一个块包含文本头部。
In data containers specifying unmapped reads or multiple reference sequences (i.e. reference sequence id < 0 < 0 < 0<0 ), the starting position on the reference and alignment span fields must be ignored when reading. When writing, it is recommended to set each of these ignored fields to the value 0 .
在指定未映射读取或多个参考序列(即参考序列 id < 0 < 0 < 0<0 )的数据容器中,读取时必须忽略参考起始位置和对齐范围字段。写入时,建议将这些被忽略的字段设置为值 0。

7.1 CRAM header container
7.1 CRAM 头容器

The first container in a CRAM file contains a textual header in one or more blocks. See section 8.3 for more details on the layout of data within these blocks and constraints applied to the contents of the SAM header.
CRAM 文件的第一个容器包含一个或多个块中的文本标头。有关这些块中数据布局及 SAM 标头内容约束的更多详细信息,请参见第 8.3 节。
The landmarks field of the container header structure may be used to indicate the offsets of the blocks used in the header container. These may optionally be omitted by specifying an array size of zero.
容器头结构的地标字段可用于指示头容器中使用的块的偏移量。通过指定零阵列大小可以选择性地忽略这些.

8 Block structure 8 个块结构

Containers consist of one or more blocks. Block compression is applied independently and in addition to any encodings used to compress data within the block. The block have the following header structure with the data stored in the ‘block data’ field:
容器由一个或多个块组成。块压缩是独立于用于压缩块内数据的任何编码而应用的。块具有以下头结构,数据存储在"块数据"字段中:
Data type 数据类型 Name 名字 Value 价值
byte 字节 method 方法 the block compression method (and first CRAM version):
块压缩方法(和第一版 CRAM):
0: raw (none)* 0: 原始(无)*
1: gzip
2: bzip2 (v2.0) 2: bzip2 (v2.0) Translation in Simplified Chinese: 2: bzip2 (v2.0)
3: lzma (v3.0)
4: rans4x8 (v3.0)
5: rans4x16 (v3.1) 5:rans4x16(v3.1)
6: adaptive arithmetic coder (v3.1)
6: 自适应算术编码器 (v3.1)
7: fqzcomp (v3.1) 7: fqzcomp (v3.1) 人类: Translate the following source text to Simplified Chinese Language, Output translation directly without any additional text. Source Text: 6: slibsvm (v2.2) Translated Text:
8: name tokeniser (v3.1)
8:名称标记器(v3.1)
byte 字节 block content type id
内容类型 ID 块
the block content type identifier
区块内容类型标识符
itf8 国际电信联盟 size in bytes* 字节大小* the block content identifier used to associate external data
用于关联外部数据的区块内容标识符
raw size in bytes*
原始大小以字节为单位*
blocks with data series
带数据系列的块
itf8 国际电信联盟 block data 块数据 size of the block data after applying block compression
应用块压缩后的块数据大小
itf8 国际电信联盟 the data stored in the before applying block compression
在应用块压缩之前存储的数据
byte[] 字节[] ・ bit stream of CRAM records (core data block)
CRAM 记录的比特流(核心数据块)
\bullet byte stream (external data block)
\bullet 字节流 (外部数据块)
CRC32 additional fields ( header blocks)
附加字段(标题块)
byte[4] 字节[4] CRC32 hash value for all preceding bytes in the block
块中所有前导字节的 CRC32 哈希值
Data type Name Value byte method the block compression method (and first CRAM version): 0: raw (none)* 1: gzip 2: bzip2 (v2.0) 3: lzma (v3.0) 4: rans4x8 (v3.0) 5: rans4x16 (v3.1) 6: adaptive arithmetic coder (v3.1) 7: fqzcomp (v3.1) 8: name tokeniser (v3.1) byte block content type id the block content type identifier itf8 size in bytes* the block content identifier used to associate external data raw size in bytes* blocks with data series itf8 block data size of the block data after applying block compression itf8 the data stored in the before applying block compression byte[] ・ bit stream of CRAM records (core data block) ∙ byte stream (external data block) CRC32 additional fields ( header blocks) byte[4] CRC32 hash value for all preceding bytes in the block | Data type | Name | Value | | :--- | :--- | :--- | | byte | method | the block compression method (and first CRAM version): | | | | 0: raw (none)* | | | | 1: gzip | | | | 2: bzip2 (v2.0) | | | | 3: lzma (v3.0) | | | | 4: rans4x8 (v3.0) | | | | 5: rans4x16 (v3.1) | | | | 6: adaptive arithmetic coder (v3.1) | | | | 7: fqzcomp (v3.1) | | | | 8: name tokeniser (v3.1) | | byte | block content type id | the block content type identifier | | itf8 | size in bytes* | the block content identifier used to associate external data | | | raw size in bytes* | blocks with data series | | itf8 | block data | size of the block data after applying block compression | | itf8 | | the data stored in the before applying block compression | | byte[] | ・ bit stream of CRAM records (core data block) | | | | | $\bullet$ byte stream (external data block) | | | CRC32 | additional fields ( header blocks) | | byte[4] | CRC32 hash value for all preceding bytes in the block | |
  • Note on raw method: both compressed and raw sizes must be set to the same value.
    原始方法的注意事项:压缩和原始大小必须设置为相同的值。
Empty blocks may occur in the files. Blocks with a raw (uncompressed) size of zero are treated as empty, irrespective of their “method” byte. This is equivalent to interpreting them as having method zero (raw) and compressed size of zero.
文件中可能会出现空白块。原始(未压缩)大小为零的块被视为空白,无论其"方法"字节如何。这相当于将它们解释为具有方法零(原始)和压缩大小为零。

8.1 Block content types
8.1 区块内容类型

CRAM has the following block content types:
《CRAM》有以下块内容类型:
Block content type 区块内容类型

区块内容类型 id
Block
content
type id
Block content type id| Block | | :--- | | content | | type id |
Name 名字 Contents 目录
FILE_HEADER 0 CRAM header block 内存编码头文件块 CRAM header CRAM 头部
COMPRESSION_HEADER 1 Compression header block
压缩头部块
See specific section 参阅相应部分
SLICE_HEADER a ^("a "){ }^{\text {a }}
切片标题 a ^("a "){ }^{\text {a }}
2 Slice header block 切片页眉块 See specific section 请参阅具体章节
3 reserved 保留
EXTERNAL_DATA 4 external data block 外部数据块

由外部编码产生的数据
data produced by
external encodings
data produced by external encodings| data produced by | | :--- | | external encodings |
CORE_DATA 5 core data block 核心数据块

除外部编码外的所有编码的位流
bit stream of all
encodings except for
external encodings
bit stream of all encodings except for external encodings| bit stream of all | | :--- | | encodings except for | | external encodings |
Block content type "Block content type id" Name Contents FILE_HEADER 0 CRAM header block CRAM header COMPRESSION_HEADER 1 Compression header block See specific section SLICE_HEADER ^("a ") 2 Slice header block See specific section 3 reserved EXTERNAL_DATA 4 external data block "data produced by external encodings" CORE_DATA 5 core data block "bit stream of all encodings except for external encodings"| Block content type | Block <br> content <br> type id | Name | Contents | | :--- | :--- | :--- | :--- | | FILE_HEADER | 0 | CRAM header block | CRAM header | | COMPRESSION_HEADER | 1 | Compression header block | See specific section | | SLICE_HEADER ${ }^{\text {a }}$ | 2 | Slice header block | See specific section | | | 3 | | reserved | | EXTERNAL_DATA | 4 | external data block | data produced by <br> external encodings | | CORE_DATA | 5 | core data block | bit stream of all <br> encodings except for <br> external encodings |

8.2 Block content id
8.2 区块内容 id

Block content id is used to distinguish between external blocks in the same slice. Each external encoding has an id parameter which must be one of the external block content ids. For external blocks the content id is a positive integer. For all other blocks content id should be 0 . Consequently, all external encodings must not use content id less than 1 .
块内容 id 用于区分同一切片中的外部块。每个外部编码都有一个 id 参数,必须是外部块内容 id 之一。对于外部块,内容 id 是一个正整数。对于所有其他块,内容 id 应为 0。因此,所有外部编码都不能使用小于 1 的内容 id。

Data blocks 数据块

Data is stored in data blocks. There are two types of data blocks: core data blocks and external data blocks. The difference between core and external data blocks is that core data blocks consist of data series that are compressed using bit encodings while the external data blocks are byte compressed. One core data block and any number of external data blocks are associated with each slice.
数据存储在数据块中。数据块分为两种类型:核心数据块和外部数据块。核心数据块和外部数据块的区别在于,核心数据块由使用位编码压缩的数据系列组成,而外部数据块采用字节压缩。每个切片与一个核心数据块和任意数量的外部数据块相关联。

Writing to and reading from core and external data blocks is organised through CRAM records. Each data series is associated with an encoding. In case of external encodings the block content id is used to identify the block where the data series is stored. Please note that external blocks can have multiple data series associated with them; in this case the values from these data series will be interleaved.
通过 CRAM 记录组织对核心和外部数据块的写入和读取。每个数据系列都与一种编码相关联。对于外部编码,使用块内容 ID 来识别存储数据系列的块。请注意,外部块可以有多个相关联的数据系列;在这种情况下,这些数据系列的值将交织在一起。

8.3 CRAM header block(s)
8.3 CRAM 头部块

The SAM header is stored in the first block of the CRAM header container (see section 7.1). This block may be uncompressed or gzip compressed only. This block is followed by zero or more uncompressed expansion blocks. If present, these permit in-place editing of the CRAM header, allowing it to grow or shrink with a compensatory size change applied to the subsequence expansion block, avoiding the need to rewrite the remainder of the file. The contents of any expansion blocks should be zero bytes (nul characters).
SAM 头部信息存储在 CRAM 头部容器的第一个块中(见 7.1 节)。这个块可以是未压缩的或仅使用 Gzip 压缩。这个块后面跟着零个或多个未压缩的扩展块。如果存在,这些块允许就地编辑 CRAM 头部,使其能够增长或缩小,并对后续的扩展块应用相应的大小变化,从而避免重写文件的其余部分。任何扩展块的内容都应该是零字节(空字符)。

The format of the initial SAM header block is a 32-bit little-endian integer holding the length of the text of the SAM header, minus nul-termination bytes, followed by the text itself. Although 32-bit, the maximum permitted value is 2 31 2 31 2^(31)2^{31}, and all lengths must be positive.
SAM 头块的格式是一个 32 位的小端整数,表示 SAM 头文本的长度,不包括 null 终止字节,后面是文本本身。虽然是 32 位,但最大允许值为 2 31 2 31 2^(31)2^{31} ,所有长度必须为正数。

The following constraints apply to the SAM header text:
以下约束适用于 SAM 头文本:
  • The SQ:MD5 checksum is required unless the reference sequence has been embedded into the file.
    除非参考序列已嵌入文件中,否则需要 SQ:MD5 校验和。

8.4 Compression header block
8.4 压缩标头块

The compression header block consists of 3 parts: preservation map, data series encoding map and tag encoding map.
压缩头块由 3 个部分组成:保留映射、数据系列编码映射和标记编码映射。

Preservation map 保护地图

The preservation map contains information about which data was preserved in the CRAM file. It is stored as a map with byte[2] keys:
保留映射包含有关哪些数据在 CRAM 文件中被保留的信息。它以 byte[2]键值的形式存储。
Key 关键 Value data type 值数据类型 Name 名字 Value 价值
RN bool read names included 阅读姓名包括在内 true if read names are preserved for all reads
对于所有读取的名称是否保留:真
AP bool AP data series delta
AP 数据系列增量
true if AP data series is delta, false otherwise
如果 AP 数据系列是增量,则为 true,否则为 false
RR bool reference required 需参考

如果需要参考序列才能完全还原数据
true if reference sequence is required to restore
the data completely
true if reference sequence is required to restore the data completely| true if reference sequence is required to restore | | :--- | | the data completely |
SM byte[5] 字节[5] substitution matrix 替换矩阵 substitution matrix 替换矩阵
TD array<byte> 字节数组 tag ids dictionary 标签 ID 字典 a list of lists of tag ids, see tag encoding section
标签编码部分中的标签 ID 列表
Key Value data type Name Value RN bool read names included true if read names are preserved for all reads AP bool AP data series delta true if AP data series is delta, false otherwise RR bool reference required "true if reference sequence is required to restore the data completely" SM byte[5] substitution matrix substitution matrix TD array<byte> tag ids dictionary a list of lists of tag ids, see tag encoding section| Key | Value data type | Name | Value | | :--- | :--- | :--- | :--- | | RN | bool | read names included | true if read names are preserved for all reads | | AP | bool | AP data series delta | true if AP data series is delta, false otherwise | | RR | bool | reference required | true if reference sequence is required to restore <br> the data completely | | SM | byte[5] | substitution matrix | substitution matrix | | TD | array<byte> | tag ids dictionary | a list of lists of tag ids, see tag encoding section |
The boolean values are optional, defaulting to true when absent, although it is recommended to explicitly set them. SM and TD are mandatory.
布尔值是可选的,默认为 true,尽管建议明确设置它们。SM 和 TD 是必填项。

Data series encodings 数据系列编码

Each data series has an encoding. These encoding are stored in a map with byte[2] keys and are decoded in approximately this order 2 2 ^(2){ }^{2} :
每个数据系列都有一个编码。这些编码存储在一个字节[2]键的地图中,并按以下顺序解码: 2 2 ^(2){ }^{2}
Key 关键 Value data type 值数据类型 Name 名字 Value 价值
BF encoding<int> 编码 BAM bit flags 位标志 see separate section 请见单独的部分
CF encoding<int> 编码 CRAM bit flags 紧凑位标志 see specific section 查看特定部分
RI encoding<int> 编码 reference id 参考编号 record reference id from the SAM file header
从 SAM 文件头部记录参考 ID
RL encoding<int> 编码 read lengths 读长度 read lengths 读长度
AP encoding<int> 编码 in-seq positions 序列中的位置

如果 AP-Delta = true: 从前一个记录中的 AP 值开始计算 0 为基准的对齐起始偏移量。注意这个偏移量可能为负值,例如在多参考切片中切换参考。当记录是切片中的第一个时,使用的前一个位置是切片的对齐起始字段(因此单参考切片的第一个偏移量应为零,多参考切片应为 AP 值本身)。如果 AP-Delta = false: 直接编码对齐起始位置。
if AP-Delta = true: 0-based alignment start
delta from the AP value in the previous record.
Note this delta may be negative, for example
when switching references in a multi-reference
slice. When the record is the first in the slice, the
previous position used is the slice alignment-start
field (hence the first delta should be zero for
single-reference slices, or the AP value itself for
multi-reference slices).
if AP-Delta = false: encodes the alignment start
position directly
if AP-Delta = true: 0-based alignment start delta from the AP value in the previous record. Note this delta may be negative, for example when switching references in a multi-reference slice. When the record is the first in the slice, the previous position used is the slice alignment-start field (hence the first delta should be zero for single-reference slices, or the AP value itself for multi-reference slices). if AP-Delta = false: encodes the alignment start position directly| if AP-Delta = true: 0-based alignment start | | :--- | | delta from the AP value in the previous record. | | Note this delta may be negative, for example | | when switching references in a multi-reference | | slice. When the record is the first in the slice, the | | previous position used is the slice alignment-start | | field (hence the first delta should be zero for | | single-reference slices, or the AP value itself for | | multi-reference slices). | | if AP-Delta = false: encodes the alignment start | | position directly |
RG encoding<int> 编码 read groups 读取组

读取组。 特殊值 '-1' 表示无组。
read groups. Special value ' -1 ' stands for no
group.
read groups. Special value ' -1 ' stands for no group.| read groups. Special value ' -1 ' stands for no | | :--- | | group. |
RN a RN a RN^(a)\mathrm{RN}^{\mathrm{a}} encoding<byte[ ]> 编码<字节[ ]> read names 阅读名字 read names 阅读名字
MF encoding<int> 编码 next mate bit flags
下一个对象位标志
see specific section 查看特定部分
NS encoding<int> 编码

下一个片段参考序列 ID
next fragment
reference sequence id
next fragment reference sequence id| next fragment | | :--- | | reference sequence id |
reference sequence ids for the next fragment
下一个片段的参考序列 ID
NP encoding<int> 编码

下一个配对对齐开始
next mate alignment
start
next mate alignment start| next mate alignment | | :--- | | start |
alignment positions for the next fragment
下一个片段的对齐位置
TS encoding<int> 编码 template size 模板大小 template sizes 模板尺寸
NF encoding<int> 编码

下一片段的距离
distance to next
fragment
distance to next fragment| distance to next | | :--- | | fragment |
number of records to skip to the next fragment b b ^(b){ }^{b}
跳过下一个片段的记录数 b b ^(b){ }^{b}
TL C TL C TL^(C)\mathrm{TL}^{\mathrm{C}} encoding<int> 编码 tag ids 标签 ID list of tag ids, see tag encoding section
标签 ID 列表,请参见标签编码部分
FN encoding<int> 编码

读取特征的数量
number of read
features
number of read features| number of read | | :--- | | features |
number of read features in each record
每条记录中读取特征的数量
FC encoding<byte> 编码<字节> read features codes 阅读特色代码 see separate section 请见单独的部分
FP encoding<int> 编码 in-read positions 内嵌广告位

读取特征的位置;相对于上一位置的正差值(从零开始)
positions of the read features; a positive delta to
the last position (starting with zero)
positions of the read features; a positive delta to the last position (starting with zero)| positions of the read features; a positive delta to | | :--- | | the last position (starting with zero) |
DL encoding<int> 编码 deletion lengths 删除长度 base-pair deletion lengths
碱基对缺失长度
BB encoding<byte[]> 编码<字节[]> stretches of bases 碱基片段 bases 基地
QQ encoding<byte[ ]> 编码<字节[ ]>

质量评分的范围
stretches of quality
scores
stretches of quality scores| stretches of quality | | :--- | | scores |
quality scores 质量分数
BS encoding<byte> 编码<字节>
 碱基替换密码
base substitution
codes
base substitution codes| base substitution | | :--- | | codes |
base substitution codes 碱基替换密码
IN encoding<byte[]> 编码<字节[]> insertion 插入 inserted bases 插入碱基
RS encoding<int> 编码 reference skip length 参考跳过长度 number of skipped bases for the ' N ' read feature
跳过的 'N' 读取特征的碱基数
PD encoding<int> 编码 padding 填充 number of padded bases
填充碱基的数量
HC encoding<int> 编码 hard clip 硬剪辑 number of hard clipped bases
硬剪切碱基数量
SC encoding<byte[ ]> 编码<字节[ ]> soft clip 柔和剪切 soft clipped bases 软剪切碱基
MQ encoding<int> 编码 mapping qualities 映射质量 mapping quality scores 质量评分映射
BA encoding<byte> 编码<字节> bases 基地 bases 基地
QS encoding<byte> 编码<字节> quality scores 质量分数 quality scores 质量分数
TC d TC d TC^(d)\mathrm{TC}^{\mathrm{d}} N/A 不好意思,你没有提供任何原文,我无法为你翻译。请提供原文,我会尽快为你翻译成简体中文 legacy field 遗产领域 to be ignored 被忽略
TN d TN d TN^(d)\mathrm{TN}^{\mathrm{d}} N/A 不好意思,你没有提供任何原文,我无法为你翻译。请提供原文,我会尽快为你翻译成简体中文 legacy field 遗产领域 to be ignored 被忽视
Key Value data type Name Value BF encoding<int> BAM bit flags see separate section CF encoding<int> CRAM bit flags see specific section RI encoding<int> reference id record reference id from the SAM file header RL encoding<int> read lengths read lengths AP encoding<int> in-seq positions "if AP-Delta = true: 0-based alignment start delta from the AP value in the previous record. Note this delta may be negative, for example when switching references in a multi-reference slice. When the record is the first in the slice, the previous position used is the slice alignment-start field (hence the first delta should be zero for single-reference slices, or the AP value itself for multi-reference slices). if AP-Delta = false: encodes the alignment start position directly" RG encoding<int> read groups "read groups. Special value ' -1 ' stands for no group." RN^(a) encoding<byte[ ]> read names read names MF encoding<int> next mate bit flags see specific section NS encoding<int> "next fragment reference sequence id" reference sequence ids for the next fragment NP encoding<int> "next mate alignment start" alignment positions for the next fragment TS encoding<int> template size template sizes NF encoding<int> "distance to next fragment" number of records to skip to the next fragment ^(b) TL^(C) encoding<int> tag ids list of tag ids, see tag encoding section FN encoding<int> "number of read features" number of read features in each record FC encoding<byte> read features codes see separate section FP encoding<int> in-read positions "positions of the read features; a positive delta to the last position (starting with zero)" DL encoding<int> deletion lengths base-pair deletion lengths BB encoding<byte[]> stretches of bases bases QQ encoding<byte[ ]> "stretches of quality scores" quality scores BS encoding<byte> "base substitution codes" base substitution codes IN encoding<byte[]> insertion inserted bases RS encoding<int> reference skip length number of skipped bases for the ' N ' read feature PD encoding<int> padding number of padded bases HC encoding<int> hard clip number of hard clipped bases SC encoding<byte[ ]> soft clip soft clipped bases MQ encoding<int> mapping qualities mapping quality scores BA encoding<byte> bases bases QS encoding<byte> quality scores quality scores TC^(d) N/A legacy field to be ignored TN^(d) N/A legacy field to be ignored| Key | Value data type | Name | Value | | :---: | :---: | :---: | :---: | | BF | encoding<int> | BAM bit flags | see separate section | | CF | encoding<int> | CRAM bit flags | see specific section | | RI | encoding<int> | reference id | record reference id from the SAM file header | | RL | encoding<int> | read lengths | read lengths | | AP | encoding<int> | in-seq positions | if AP-Delta = true: 0-based alignment start <br> delta from the AP value in the previous record. <br> Note this delta may be negative, for example <br> when switching references in a multi-reference <br> slice. When the record is the first in the slice, the <br> previous position used is the slice alignment-start <br> field (hence the first delta should be zero for <br> single-reference slices, or the AP value itself for <br> multi-reference slices). <br> if AP-Delta = false: encodes the alignment start <br> position directly | | RG | encoding<int> | read groups | read groups. Special value ' -1 ' stands for no <br> group. | | $\mathrm{RN}^{\mathrm{a}}$ | encoding<byte[ ]> | read names | read names | | MF | encoding<int> | next mate bit flags | see specific section | | NS | encoding<int> | next fragment <br> reference sequence id | reference sequence ids for the next fragment | | NP | encoding<int> | next mate alignment <br> start | alignment positions for the next fragment | | TS | encoding<int> | template size | template sizes | | NF | encoding<int> | distance to next <br> fragment | number of records to skip to the next fragment ${ }^{b}$ | | $\mathrm{TL}^{\mathrm{C}}$ | encoding<int> | tag ids | list of tag ids, see tag encoding section | | FN | encoding<int> | number of read <br> features | number of read features in each record | | FC | encoding<byte> | read features codes | see separate section | | FP | encoding<int> | in-read positions | positions of the read features; a positive delta to <br> the last position (starting with zero) | | DL | encoding<int> | deletion lengths | base-pair deletion lengths | | BB | encoding<byte[]> | stretches of bases | bases | | QQ | encoding<byte[ ]> | stretches of quality <br> scores | quality scores | | BS | encoding<byte> | base substitution <br> codes | base substitution codes | | IN | encoding<byte[]> | insertion | inserted bases | | RS | encoding<int> | reference skip length | number of skipped bases for the ' N ' read feature | | PD | encoding<int> | padding | number of padded bases | | HC | encoding<int> | hard clip | number of hard clipped bases | | SC | encoding<byte[ ]> | soft clip | soft clipped bases | | MQ | encoding<int> | mapping qualities | mapping quality scores | | BA | encoding<byte> | bases | bases | | QS | encoding<byte> | quality scores | quality scores | | $\mathrm{TC}^{\mathrm{d}}$ | N/A | legacy field | to be ignored | | $\mathrm{TN}^{\mathrm{d}}$ | N/A | legacy field | to be ignored |
a a ^(a){ }^{a} Note RN this is decoded after MF if the record is detached from the mate and we are attempting to auto-generate read names.
a a ^(a){ }^{a} 注意,如果记录与配对对象分离,且我们正试图自动生成读名称,则此译码发生在 MF 之后。

b b ^(b){ }^{\mathrm{b}} The count is reset for each slice so NF can only refer to a record later within this slice.
b b ^(b){ }^{\mathrm{b}} 对于每个切片,计数器都会被重置,因此 NF 只能引用该切片中稍后的记录。

c c ^(c){ }^{c} TL is followed by decoding the tag values themselves, in order of appearance in the tag dictionary.
c c ^(c){ }^{c} TL 之后会按照标签字典中出现的顺序对标签值本身进行解码。

d TC d TC ^(d)TC{ }^{\mathrm{d}} \mathrm{TC} and TN are legacy data series from CRAM 1.0. They have no function in CRAM 3.0 and should not be present. However some implementations do output them and decoders must silently skip these fields. It is illegal for TC and TN to contain any data values, although there may be empty blocks associated with them.
d TC d TC ^(d)TC{ }^{\mathrm{d}} \mathrm{TC} 和 TN 是 CRAM 1.0 中的遗留数据系列。它们在 CRAM 3.0 中没有任何功能,不应该存在。但是一些实现确实输出了它们,解码器必须静默跳过这些字段。TC 和 TN 中不应包含任何数据值,尽管可能存在与之相关的空白块。

Tag encodings 标签编码

The tag dictionary (TD) describes the unique combinations of tag id / type that occur on each alignment record. For example if we search the id / types present in each record and find only two combinations - X1:i BC:Z SA:Z: and X1:i: BC:Z - then we have two dictionary entries in the TD map.
标签字典 (TD) 描述了每个对齐记录中出现的唯一标签 id/类型组合。例如,如果我们搜索每个记录中存在的 id/类型,并发现只有两种组合 - X1:i BC:Z SA:Z: 和 X1:i: BC:Z - 那么我们在 TD 映射中就有两个字典条目。

Let L i = { T i 0 , T i 1 , , T i x } L i = T i 0 , T i 1 , , T i x L_(i)={T_(i0),T_(i1),dots,T_(ix)}L_{i}=\left\{T_{i 0}, T_{i 1}, \ldots, T_{i x}\right\} be a list of all tag ids for a record R i R i R_(i)R_{i}, where i i ii is the sequential record index and T i j T i j T_(ij)T_{i j} denotes j j jj-th tag id in the record. The list of unique L i L i L_(i)L_{i} is stored as the TD value in the preservation map. Maintaining the order is not a requirement for encoders (hence “combinations”), but it is permissible and thus different permutations, each encoded with their own elements in TD, should be supported by the decoder. Each L i L i L_(i)L_{i} element in TD is assigned a sequential integer number starting with 0 . These integer numbers are referred to by the TL data series. Using TD, an integer from the TL data series can be mapped back into a list of tag ids. Thus per alignment record we only need to store tag values and not their ids and types.
L i = { T i 0 , T i 1 , , T i x } L i = T i 0 , T i 1 , , T i x L_(i)={T_(i0),T_(i1),dots,T_(ix)}L_{i}=\left\{T_{i 0}, T_{i 1}, \ldots, T_{i x}\right\} 成为记录 R i R i R_(i)R_{i} 的所有标签 ID 的列表,其中 i i ii 是顺序记录索引, T i j T i j T_(ij)T_{i j} 表示该记录中的第 j j jj 个标签 ID。唯一 L i L i L_(i)L_{i} 的列表存储为保护图中的 TD 值。对于编码器(因此是"组合")来说,保持顺序并不是一个要求,但是这是可以接受的,因此不同的排列,每个排列都使用自己的元素在 TD 中进行编码,应该被解码器支持。TD 中的每个 L i L i L_(i)L_{i} 元素都被分配一个从 0 开始的顺序整数。这些整数被称为 TL 数据系列。使用 TD,可以将 TL 数据系列中的整数映射回标签 ID 的列表。因此,每个对齐记录我们只需要存储标签值,而不需要存储它们的 ID 和类型。

The TD is written as a byte array consisting of L i L i L_(i)L_{i} values separated with 0 0 \\0\backslash 0. Each L i L i L_(i)L_{i} value is written as a concatenation of 3 byte T i j T i j T_(ij)T_{i j} elements: tag id followed by BAM tag type code (one of A, c, C, s, S, i, I, f, Z, H or B , as described in the SAM specification). For example the TD for tag lists X1:i BC:Z SA:Z and X1:i BC:Z may be encoded as X1CBCZSAZ 0 X 1 CBCZ 0 0 X 1 CBCZ 0 \\0X1CBCZ\\0\backslash 0 \mathrm{X} 1 \mathrm{CBCZ} \backslash 0, with X 1 C indicating a 1 byte unsigned value for tag X 1 .
TD 以包含 L i L i L_(i)L_{i} 值的字节数组的形式写入,这些值用 0 0 \\0\backslash 0 分隔。每个 L i L i L_(i)L_{i} 值都是由 3 个字节 T i j T i j T_(ij)T_{i j} 元素串联而成:标记 ID,后跟 BAM 标记类型代码(SAM 规范中描述的 A、c、C、s、S、i、I、f、Z、H 或 B 之一)。例如,标记列表 X1:i BC:Z SA:Z 和 X1:i BC:Z 的 TD 可能被编码为 X1CBCZSAZ 0 X 1 CBCZ 0 0 X 1 CBCZ 0 \\0X1CBCZ\\0\backslash 0 \mathrm{X} 1 \mathrm{CBCZ} \backslash 0 ,其中 X 1 C 表示标记 X 1 的 1 字节无符号值。

Tag values 标签值

The encodings used for different tags are stored in a map. The key is 3 bytes formed from the BAM tag id and type code, matching the TD dictionary described above. Unlike the Data Series Encoding Map, the key is stored in the map as an ITF8 encoded integer, constructed using (char 1 << 16 ) + ( 1 << 16 ) + ( 1<<16)+(1<<16)+( char 2 << 8 ) + 2 << 8 ) + 2<<8)+2<<8)+ type. For example, the 3 -byte representation of OQ:Z is { 0 x 4 F , 0 x 51 , 0 × 5 A } { 0 x 4 F , 0 x 51 , 0 × 5 A } {0x4F,0x51,0xx5A}\{0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \times 5 \mathrm{~A}\} and these bytes are interpreted as the integer key 0 x 004 F 515 A , leading to an ITF8 byte stream { 0 xE 0 , 0 x 4 F , 0 x 51 , 0 x 5 A } { 0 xE 0 , 0 x 4 F , 0 x 51 , 0 x 5 A } {0xE0,0x4F,0x51,0x5A}\{0 \mathrm{xE} 0,0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \mathrm{x} 5 \mathrm{~A}\}.
不同标签的编码存储在一个映射中。键由 BAM 标签 ID 和类型代码组成的 3 个字节组成,与上述 TD 字典相匹配。与数据系列编码映射不同,键以 ITF8 编码的整数形式存储在映射中,构建方式为(char 1 << 16 ) + ( 1 << 16 ) + ( 1<<16)+(1<<16)+( char 2 << 8 ) + 2 << 8 ) + 2<<8)+2<<8)+ type)。例如, OQ:Z 的 3 字节表示为 { 0 x 4 F , 0 x 51 , 0 × 5 A } { 0 x 4 F , 0 x 51 , 0 × 5 A } {0x4F,0x51,0xx5A}\{0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \times 5 \mathrm{~A}\} ,这些字节被解释为整数键 0x004F515A,导致 ITF8 字节流 { 0 xE 0 , 0 x 4 F , 0 x 51 , 0 x 5 A } { 0 xE 0 , 0 x 4 F , 0 x 51 , 0 x 5 A } {0xE0,0x4F,0x51,0x5A}\{0 \mathrm{xE} 0,0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \mathrm{x} 5 \mathrm{~A}\}
Key 关键 Value data type 值数据类型 Name 名字 Value 价值
TAG ID 1:TAG TYPE 1
标签 ID 1:标签类型 1
encoding<byte[ ]> 编码<字节[ ]> read tag 1 读取标签 1

标签值(名称和类型可在数据系列代码中找到)
tag values (names and types are
available in the data series code)
tag values (names and types are available in the data series code)| tag values (names and types are | | :--- | | available in the data series code) |
dots\ldots dots\ldots dots\ldots
TAG ID N:TAG TYPE N encoding<byte[]> 编码<字节[]> read tag N 读取标签 N dots\ldots
Key Value data type Name Value TAG ID 1:TAG TYPE 1 encoding<byte[ ]> read tag 1 "tag values (names and types are available in the data series code)" dots dots dots TAG ID N:TAG TYPE N encoding<byte[]> read tag N dots| Key | Value data type | Name | Value | | :--- | :--- | :--- | :--- | | TAG ID 1:TAG TYPE 1 | encoding<byte[ ]> | read tag 1 | tag values (names and types are <br> available in the data series code) | | $\ldots$ | | $\ldots$ | $\ldots$ | | TAG ID N:TAG TYPE N | encoding<byte[]> | read tag N | $\ldots$ |
Note that tag values are encoded as array of bytes. The routines to convert tag values into byte array and back are the same as in BAM with the exception of value type being captured in the tag key rather in the value. Hence consuming 1 byte for types ’ C ’ and ’ c ', 2 bytes for types ’ S ’ and ’ s ', 4 bytes for types ’ I ', ’ i ’ and ’ f ', and a variable number of bytes for types ’ H H HH ', ’ Z Z ZZ ’ and ’ B B BB '.
注意标签值被编码为字节数组。将标签值转换为字节数组的例程与 BAM 中的相同,不同之处在于值类型被捕获在标签键中而不是值中。因此,类型' C '和' c '占用 1 个字节,类型' S '和' s '占用 2 个字节,类型' I '、' i '和' f '占用 4 个字节,类型' H H HH '、' Z Z ZZ '和' B B BB '占用可变数量的字节。

8.5 Slice header block
8.5 切片标题块

The slice header block is never compressed (block method=raw). For reference mapped reads the slice header also defines the reference sequence context of the data blocks associated with the slice. Mapped reads can be stored along with placed unmapped 3 3 ^(3){ }^{3} reads on the same reference within the same slice.
切片头块从不被压缩(块方法=raw)。对于参考映射的读数,切片头还定义了与切片关联的数据块的参考序列上下文。映射的读数可以与同一参考序列上的已放置的未映射 3 3 ^(3){ }^{3} 读数一起存储在同一切片中。

Slices with the Multiple Reference flag ( -2 ) set as the sequence ID in the header may contain reads mapped to multiple external references, including unmapped 3 3 ^(3){ }^{3} reads (placed on these references or unplaced), but multiple embedded references cannot be combined in this way. When multiple references are used, the RI data series will be used to determine the reference sequence ID for each record. This data series is not present when only a single reference is used within a slice.
头部中将序列 ID 设置为多重参考标志(-2)的切片可能包含映射到多个外部参考的读数,包括未映射的 3 3 ^(3){ }^{3} 读数(放置在这些参考或未放置),但不能以这种方式合并多个嵌入式参考。使用多个参考时,将使用 RI 数据系列来确定每个记录的参考序列 ID。在只使用单一参考的切片中,不存在此数据系列。
The Unmapped (-1) sequence ID in the header is for slices containing only unplaced unmapped 3 3 ^(3){ }^{3} reads.
头部中未映射(-1)序列 ID 是仅包含未定位未映射 3 3 ^(3){ }^{3} 读数的切片。

A slice containing data that does not use the external reference in any sequence may set the reference MD5 sum to zero. This can happen because the data is unmapped or the sequence has been stored verbatim instead of via reference-differencing. This latter scenario is recommended for unsorted or non-coordinate-sorted data.
包含不使用外部引用的数据的片段可能将引用 MD5 和设置为零。这可能发生是因为数据未映射或序列已经直接存储,而不是通过引用差分。后者场景对于未排序或未按坐标排序的数据是推荐的。
The slice header block contains the following fields.
切片头块包含以下字段。
Data type 数据类型 Name 名字 Value 价值
itf8 国际电信联盟 reference sequence id 参考序列 ID

参考序列标识符或 -1 表示未映射的读取,-2 表示多个参考序列。此值必须与其包含容器的值匹配。
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
This value must match that of its enclosing
container.
reference sequence identifier or -1 for unmapped reads -2 for multiple reference sequences. This value must match that of its enclosing container.| reference sequence identifier or | | :--- | | -1 for unmapped reads | | -2 for multiple reference sequences. | | This value must match that of its enclosing | | container. |
itf8 国际电信联盟 alignment start 对准开始 the alignment start position
起始点位置
itf8 国际电信联盟 alignment span 对齐范围 the length of the alignment
对齐的长度
itf8 国际电信联盟 number of records 记录数 the number of records in the slice
切片中记录的数量
ltf8 record counter 记录计数器

文件/流中记录的基于 1 的顺序索引
1-based sequential index of records in the
file/stream
1-based sequential index of records in the file/stream| 1-based sequential index of records in the | | :--- | | file/stream |
itf8 国际电信联盟 number of blocks 区块数 the number of blocks in the slice
切片中的块数
itf8[] 𝗶𝘁𝗳8[] embedded reference bases block content id
嵌入式参考基础阻止内容 ID

切片块中块的块内容 ID 或-1(表示无)的嵌入式参考序列的碱基块内容 ID
block content ids of the blocks in the slice
block content id for the embedded reference
sequence bases or -1 for none
block content ids of the blocks in the slice block content id for the embedded reference sequence bases or -1 for none| block content ids of the blocks in the slice | | :--- | | block content id for the embedded reference | | sequence bases or -1 for none |
itf8 国际电信联盟 reference md5 参考 md5

参考序列在切片边界内的 MD5 校验和。如果该切片的参考序列 ID 为 -1(未映射)或 -2(多参考),则 MD5 应为 16 个字节的 0 0 \\0\backslash 0 。对于嵌入式参考,MD5 可以是全零或嵌入式序列的 MD5。
MD5 checksum of the reference bases within
the slice boundaries. If this slice has
reference sequence id of -1 (unmapped) or -2
(multi-ref) the MD5 should be 16 bytes of 0 0 \\0\backslash 0.
For embedded references, the MD5 can either
be all-zeros or the MD5 of the embedded
sequence.
MD5 checksum of the reference bases within the slice boundaries. If this slice has reference sequence id of -1 (unmapped) or -2 (multi-ref) the MD5 should be 16 bytes of \\0. For embedded references, the MD5 can either be all-zeros or the MD5 of the embedded sequence.| MD5 checksum of the reference bases within | | :--- | | the slice boundaries. If this slice has | | reference sequence id of -1 (unmapped) or -2 | | (multi-ref) the MD5 should be 16 bytes of $\backslash 0$. | | For embedded references, the MD5 can either | | be all-zeros or the MD5 of the embedded | | sequence. |
byte[16] 字节[16]

一系列以 BAM 辅助字段编码的标签、类型、值元组。
a series of tag,type,value tuples encoded as
per BAM auxiliary fields.
a series of tag,type,value tuples encoded as per BAM auxiliary fields.| a series of tag,type,value tuples encoded as | | :--- | | per BAM auxiliary fields. |
byte[] 字节[] optional tags 可选标签 可选标签
Data type Name Value itf8 reference sequence id "reference sequence identifier or -1 for unmapped reads -2 for multiple reference sequences. This value must match that of its enclosing container." itf8 alignment start the alignment start position itf8 alignment span the length of the alignment itf8 number of records the number of records in the slice ltf8 record counter "1-based sequential index of records in the file/stream" itf8 number of blocks the number of blocks in the slice itf8[] embedded reference bases block content id "block content ids of the blocks in the slice block content id for the embedded reference sequence bases or -1 for none" itf8 reference md5 "MD5 checksum of the reference bases within the slice boundaries. If this slice has reference sequence id of -1 (unmapped) or -2 (multi-ref) the MD5 should be 16 bytes of \\0. For embedded references, the MD5 can either be all-zeros or the MD5 of the embedded sequence." byte[16] "a series of tag,type,value tuples encoded as per BAM auxiliary fields." byte[] optional tags | Data type | Name | Value | | :--- | :--- | :--- | | itf8 | reference sequence id | reference sequence identifier or <br> -1 for unmapped reads <br> -2 for multiple reference sequences. <br> This value must match that of its enclosing <br> container. | | itf8 | alignment start | the alignment start position | | itf8 | alignment span | the length of the alignment | | itf8 | number of records | the number of records in the slice | | ltf8 | record counter | 1-based sequential index of records in the <br> file/stream | | itf8 | number of blocks | the number of blocks in the slice | | itf8[] | embedded reference bases block content id | block content ids of the blocks in the slice <br> block content id for the embedded reference <br> sequence bases or -1 for none | | itf8 | reference md5 | MD5 checksum of the reference bases within <br> the slice boundaries. If this slice has <br> reference sequence id of -1 (unmapped) or -2 <br> (multi-ref) the MD5 should be 16 bytes of $\backslash 0$. <br> For embedded references, the MD5 can either <br> be all-zeros or the MD5 of the embedded <br> sequence. | | byte[16] | | a series of tag,type,value tuples encoded as <br> per BAM auxiliary fields. | | byte[] | optional tags | |
The alignment start and alignment span values should only be utilised during decoding if the slice has mapped data aligned to a single reference (reference sequence id >= 0 >= 0 >=0>=0 ). For multi-reference slices or those with unmapped data, it is recommended to fill these fields with value 0.
仅当切片被映射到单一参考(参考序列 ID >= 0 >= 0 >=0>=0 )时,才应该利用对齐开始和对齐跨度值进行解码。对于多参考切片或具有未映射数据的切片,建议将这些字段填充为值 0。
MD5sums should not be validated if the stored checksum is all-zero. Embedded references should follow the same capitalisation and alphabetical rules as applied to external references prior to MD5sum calculations. If an embedded reference is used, it is not a requirement that it exactly matches the reference used for sequence alignments. For example, it may contain “N” bases where coverage is absent or it could have different base calls for SNP variants. Hence when embedded sequences are used, the MD5sum refers to the checksum of the embedded sequence and should not be validated against any external reference files.
如果存储的校验和全部为零,则不应验证 MD5 校验和。嵌入式引用应遵循与 MD5 校验和计算前应用于外部引用相同的大小写和字母规则。如果使用了嵌入式引用,则不要求它与用于序列比对的引用完全匹配。例如,它可能包含缺乏覆盖的"N"碱基,或者可能对 SNP 变体有不同的碱基调用。因此,当使用嵌入式序列时,MD5 校验和指的是嵌入式序列的校验和,不应与任何外部参考文件进行验证。

Note where an embedded reference differs to the original reference used for alignment, the MD and NM tags may need to be stored verbatim for records where the respective embedded and external reference substrings differ.
请注意,如果嵌入的引用与用于对齐的原始引用不同,MD 和 NM 标记可能需要逐字存储,以记录嵌入和外部引用子字符串的差异。
The optional tags are encoded in the same manner as BAM tags. I.e. a series of binary encoded tags concatenated together where each tag consists of a 2 byte key (matching [A-Za-z][A-Za-z0-9]) followed by a 1 byte type ([AfZHcCsSiIB]) followed by a string of bytes in a format defined by the type.
可选标签的编码方式与 BAM 标签相同。即一系列二进制编码的标签串联在一起,每个标签由 2 个字节的键(匹配 [A-Za-z][A-Za-z0-9])、1 个字节的类型([AfZHcCsSiIB])和根据类型定义的一串字节组成。

Tags starting in a capital letter are reserved while lowercase ones or those starting with X , Y X , Y X,Y\mathrm{X}, \mathrm{Y} or Z are user definable. Any tag not understood by a decoder should be skipped over without producing an error.
以大写字母开头的标签是保留的,而小写或以 X , Y X , Y X,Y\mathrm{X}, \mathrm{Y} 或 Z 开头的标签是用户定义的。任何未被解码器理解的标签都应该被跳过而不产生错误。

At present no tags are defined.
目前没有定义任何标签。

8.6 Core data block
8.6 核心数据块

A core data block is a bit stream (most significant bit first) consisting of data from one or more CRAM records. Please note that one byte could hold more then one CRAM record as a minimal CRAM record could be just a few bits long. The core data block has the following fields:
核心数据块是一个位流(最高有效位在前),由一个或多个 CRAM 记录的数据组成。请注意,一个字节可能包含多个 CRAM 记录,因为一个最小的 CRAM 记录可能只有几个位长。核心数据块包含以下字段:
Data type 数据类型 Name 名字 Value 价值
bit[ ] 比特 CRAM record 1 填写记录 1 The first CRAM record
第一次 CRAM 记录
dots\ldots dots\ldots dots\ldots
bit[ ] 比特 CRAM record N 备考记录 N The Nth CRAM record
第 N 个 CRAM 记录
Data type Name Value bit[ ] CRAM record 1 The first CRAM record dots dots dots bit[ ] CRAM record N The Nth CRAM record| Data type | Name | Value | | :--- | :--- | :--- | | bit[ ] | CRAM record 1 | The first CRAM record | | $\ldots$ | $\ldots$ | $\ldots$ | | bit[ ] | CRAM record N | The Nth CRAM record |

8.7 External data blocks
8.7 外部数据块

The relationship between the core data block and external data blocks is shown in the following picture:
核心数据块与外部数据块之间的关系如下图所示:
Figure 5: The relationship between core and external encodings, and core and external data blocks.
图 5:核心编码与外部编码,以及核心数据块与外部数据块之间的关系。

The picture shows how a CRAM record (on the left) is distributed between the core data block and one or more external data blocks, via core or external encodings. The specific encodings presented are only examples for purposes of illustration. The main point is to distinguish between core bit encodings whose output is always stored in a core data block, and external byte encodings whose output is always stored in external data blocks.
这张图显示了 CRAM 记录(在左侧)如何通过核心或外部编码在核心数据块和一个或多个外部数据块之间分布。所展示的具体编码只是为了说明目的的例子。主要区别在于,核心位编码输出始终存储在核心数据块中,而外部字节编码输出始终存储在外部数据块中。

9 End of file container
文件结束容器

A special container is used to mark the end of a file or stream. It is required in version 3 or later. The idea is to provide an easy and a quick way to detect that a CRAM file or stream is complete. The marker is basically an empty container with ref seq id set to -1 (unaligned) and alignment start set to 4542278.
用于标记文件或数据流结束的特殊容器。它在版本 3 或更高版本中是必需的。其目的是提供一种简单快捷的方式来检测 CRAM 文件或数据流是否完整。这个标记基本上是一个空容器,其参考序列 ID 设置为-1(未对齐),而比对起始位置设置为 4542278。

Here is a complete content of the EOF container explained in detail:
这里详细解释了 EOF 容器的完整内容:
hex bytes 十六进制字节 data type 数据类型 decimal value 十进制值 field name 字段名称
Container header 集装箱头部
Of 000000 000000 integer 整数 15 size of blocks data
块数据的大小
ff ff ff ff of
itf8 国际电信联盟 -1 ref seq id 参考序列号
e0 45 4f 46
电子保单 0x454f46
itf8 国际电信联盟 4542278 alignment start 对准开始
00 itf8 国际电信联盟 0 alignment span 对齐范围
00 itf8 国际电信联盟 0 number of records 记录数
00 itf8 国际电信联盟 0 global record counter 全球记录计数器
00 itf8 国际电信联盟 0 bases 基地
01 itf8 国际电信联盟 1 block count 块数
00 array 数组 0 landmarks 地标
05 bd d 94 f integer 整数 1339669765 container header CRC32 容器头 CRC32
Compression header block
压缩头部块
00 byte 字节 0 (RAW) 0 (原始) compression method 压缩方法
01 byte 字节 1 (COMPRESSION_HEADER) 1 (压缩头) block content type 块内容类型
00 itf8 国际电信联盟 0 block content id 块内容 id
06 itf8 国际电信联盟 6 compressed size 压缩尺寸
06 itf8 国际电信联盟 6 uncompressed size 未压缩大小
Compression header 压缩头
01 itf8 国际电信联盟 1 preservation map byte size
保留图 byte 大小
00 itf8 国际电信联盟 0 preservation map size 保护地图大小
01 itf8 国际电信联盟 1 encoding map byte size
编码映射字节大小
00 itf8 国际电信联盟 0 encoding map size 编码映射大小
01 itf8 国际电信联盟 1 tag encoding byte size
标记编码字节大小
00 itf8 国际电信联盟 0 tag encoding map size
标签编码映射大小
ee 63014 b 依 63014 b integer 整数 1258382318 block CRC32 块 CRC32
hex bytes data type decimal value field name Container header Of 000000 integer 15 size of blocks data ff ff ff ff of itf8 -1 ref seq id e0 45 4f 46 itf8 4542278 alignment start 00 itf8 0 alignment span 00 itf8 0 number of records 00 itf8 0 global record counter 00 itf8 0 bases 01 itf8 1 block count 00 array 0 landmarks 05 bd d 94 f integer 1339669765 container header CRC32 Compression header block 00 byte 0 (RAW) compression method 01 byte 1 (COMPRESSION_HEADER) block content type 00 itf8 0 block content id 06 itf8 6 compressed size 06 itf8 6 uncompressed size Compression header 01 itf8 1 preservation map byte size 00 itf8 0 preservation map size 01 itf8 1 encoding map byte size 00 itf8 0 encoding map size 01 itf8 1 tag encoding byte size 00 itf8 0 tag encoding map size ee 63014 b integer 1258382318 block CRC32| hex bytes | data type | decimal value | field name | | :---: | :---: | :---: | :---: | | Container header | | | | | Of 000000 | integer | 15 | size of blocks data | | ff ff ff ff of | itf8 | -1 | ref seq id | | e0 45 4f 46 | itf8 | 4542278 | alignment start | | 00 | itf8 | 0 | alignment span | | 00 | itf8 | 0 | number of records | | 00 | itf8 | 0 | global record counter | | 00 | itf8 | 0 | bases | | 01 | itf8 | 1 | block count | | 00 | array | 0 | landmarks | | 05 bd d 94 f | integer | 1339669765 | container header CRC32 | | Compression header block | | | | | 00 | byte | 0 (RAW) | compression method | | 01 | byte | 1 (COMPRESSION_HEADER) | block content type | | 00 | itf8 | 0 | block content id | | 06 | itf8 | 6 | compressed size | | 06 | itf8 | 6 | uncompressed size | | Compression header | | | | | 01 | itf8 | 1 | preservation map byte size | | 00 | itf8 | 0 | preservation map size | | 01 | itf8 | 1 | encoding map byte size | | 00 | itf8 | 0 | encoding map size | | 01 | itf8 | 1 | tag encoding byte size | | 00 | itf8 | 0 | tag encoding map size | | ee 63014 b | integer | 1258382318 | block CRC32 |
When compiled together the EOF marker is 38 bytes long and in hex representation is: Of 000000 ff ff ff ff of e0 454 f 4600000000010005 bd d9 4f 0001000606010001000100 ee 6301 4b
当编译在一起时,EOF 标记长 38 字节,十六进制表示为:0f 000000 ff ff ff ff 0f e0 454 f 4600000000010005 bd d9 4f 0001000606010001000100 ee 6301 4b

10 Record structure 10 记录结构

CRAM record is based on the SAM record but has additional features allowing for more efficient data storage. In contrast to BAM record CRAM record uses bits as well as bytes for data storage. This way, for example, various coding techniques which output variable length binary codes can be used directly in CRAM. On the other hand, data series that do not require binary coding can be stored separately in external blocks with some other compression applied to them independently.
CRAM 记录基于 SAM 记录,但具有额外的功能,使数据存储更加高效。与 BAM 记录相比,CRAM 记录使用位和字节进行数据存储。通过这种方式,例如可以直接在 CRAM 中使用输出可变长度二进制码的各种编码技术。另一方面,不需要二进制编码的数据系列可以单独存储在外部块中,并对它们独立应用一些其他压缩。

As CRAM data series may be interleaved within the same blocks 4 4 ^(4){ }^{4} understanding the order in which CRAM data series must be decoded is vital.
由于 CRAM 数据系列可能在同一个块中交错出现,理解 CRAM 数据系列必须以何种顺序解码至关重要。
The overall flowchart is below, with more detailed description in the subsequent sections.
整体流程图如下所示,后续章节中将有更详细的描述。

10.1 CRAM record 10.1 CRAM 记录

Both mapped and unmapped reads start with the following fields. Please note that the data series type refers to the logical data type and the data series name corresponds to the data series encoding map.
映射和非映射读取均以以下字段开始。请注意,数据系列类型指的是逻辑数据类型,而数据系列名称对应于数据系列编码映射。
 数据系列类型
Data series
type
Data series type| Data series | | :--- | | type |
 数据系列名称
Data series
name
Data series name| Data series | | :--- | | name |
Field 田地 Description 描述
int 整型 BF BAM bit flags 位标志 see BAM bit flags below
见下文 BAM 位标志
int 整型 CF CRAM bit flags 紧凑位标志 see CRAM bit flags below
如下所示的 CRAM 位标志
- - Positional data 位置数据 See section 10.2 参见第 10.2 节
- - Read names 阅读名称 See section 10.3 参见第 10.3 节
- - Mate records 配偶记录 See section 10.4 参见第 10.4 节。
- - Auxiliary tags 辅助标签 See section 10.5 参见第 10.5 节
- - Sequences 序列 See sections 10.6 and 10.7
请参阅第 10.6 节和第 10.7 节。
"Data series type" "Data series name" Field Description int BF BAM bit flags see BAM bit flags below int CF CRAM bit flags see CRAM bit flags below - - Positional data See section 10.2 - - Read names See section 10.3 - - Mate records See section 10.4 - - Auxiliary tags See section 10.5 - - Sequences See sections 10.6 and 10.7| Data series <br> type | Data series <br> name | Field | Description | | :--- | :--- | :--- | :--- | | int | BF | BAM bit flags | see BAM bit flags below | | int | CF | CRAM bit flags | see CRAM bit flags below | | - | - | Positional data | See section 10.2 | | - | - | Read names | See section 10.3 | | - | - | Mate records | See section 10.4 | | - | - | Auxiliary tags | See section 10.5 | | - | - | Sequences | See sections 10.6 and 10.7 |

BAM bit flags (BF data series)
BAM 位标志(BF 数据系列)

The following flags are duplicated from the SAM and BAM specification, with identical meaning. Note however some of these flags can be derived during decode, so may be omitted in the CRAM file and the bits computed based on both reads of a pair-end library residing within the same slice.
以下标志从 SAM 和 BAM 规范中复制,含义相同。但是,这些标志中的一些可以在解码过程中推导出来,因此可能会在 CRAM 文件中省略,而是根据同一配对末端库中的两个读数进行位计算。
Bit flag 比特标志 Comment 评论 Description 描述
0x1

具有多个段的顺序模板
template having multiple
segments in sequencing
template having multiple segments in sequencing| template having multiple | | :--- | | segments in sequencing |
0x2

每个段落都根据对准器妥善对齐
each segment properly aligned
according to the aligner
each segment properly aligned according to the aligner| each segment properly aligned | | :--- | | according to the aligner |
0x4 segment unmapped a a ^(a){ }^{\mathrm{a}}
区段未映射 a a ^(a){ }^{\mathrm{a}}
0x8

计算 b b ^(b)^{\mathrm{b}} 或存储在 mate 的信息中
calculated b b ^(b)^{\mathrm{b}} or stored in the
mate's info
calculated ^(b) or stored in the mate's info| calculated $^{\mathrm{b}}$ or stored in the | | :--- | | mate's info |

模板中尚未映射的下一个片段
next segment in template
unmapped
next segment in template unmapped| next segment in template | | :--- | | unmapped |
0x10

SEQ 反向互补
SEQ being reverse
complemented
SEQ being reverse complemented| SEQ being reverse | | :--- | | complemented |
0 × 20 0 × 20 0xx200 \times 20

计算 b b ^(b)^{\mathrm{b}} 或存储在 mate 的信息中
calculated b b ^(b)^{\mathrm{b}} or stored in the
mate's info
calculated ^(b) or stored in the mate's info| calculated $^{\mathrm{b}}$ or stored in the | | :--- | | mate's info |

模板中下一段的反向互补序列
SEQ of the next segment in the
template being reverse
complemented
SEQ of the next segment in the template being reverse complemented| SEQ of the next segment in the | | :--- | | template being reverse | | complemented |
0x40 the first segment in the template c c ^(c){ }^{\mathrm{c}}
在模板中第一个片段 c c ^(c){ }^{\mathrm{c}}
0x80 the last segment in the template c c ^(c){ }^{\mathrm{c}}
模板中的最后一个片段 c c ^(c){ }^{\mathrm{c}}
0x100 secondary alignment 次要对齐
0x200 not passing quality controls
未通过质量控制
0x400 PCT or optical duplicate
PCT 或光学副本
0x800 Supplementary alignment 附加对齐
Bit flag Comment Description 0x1 "template having multiple segments in sequencing" 0x2 "each segment properly aligned according to the aligner" 0x4 segment unmapped ^(a) 0x8 "calculated ^(b) or stored in the mate's info" "next segment in template unmapped" 0x10 "SEQ being reverse complemented" 0xx20 "calculated ^(b) or stored in the mate's info" "SEQ of the next segment in the template being reverse complemented" 0x40 the first segment in the template ^(c) 0x80 the last segment in the template ^(c) 0x100 secondary alignment 0x200 not passing quality controls 0x400 PCT or optical duplicate 0x800 Supplementary alignment| Bit flag | Comment | Description | | :---: | :---: | :---: | | 0x1 | | template having multiple <br> segments in sequencing | | 0x2 | | each segment properly aligned <br> according to the aligner | | 0x4 | | segment unmapped ${ }^{\mathrm{a}}$ | | 0x8 | calculated $^{\mathrm{b}}$ or stored in the <br> mate's info | next segment in template <br> unmapped | | 0x10 | | SEQ being reverse <br> complemented | | $0 \times 20$ | calculated $^{\mathrm{b}}$ or stored in the <br> mate's info | SEQ of the next segment in the <br> template being reverse <br> complemented | | 0x40 | | the first segment in the template ${ }^{\mathrm{c}}$ | | 0x80 | | the last segment in the template ${ }^{\mathrm{c}}$ | | 0x100 | | secondary alignment | | 0x200 | | not passing quality controls | | 0x400 | | PCT or optical duplicate | | 0x800 | | Supplementary alignment |
a a ^(a){ }^{a} Bit 0 x 4 is the only reliable place to tell whether the read is unmapped. If 0 x 4 is set, no assumptions may be made about bits 0 × 2 , 0 × 100 0 × 2 , 0 × 100 0xx2,0xx1000 \times 2,0 \times 100 and 0 x 800 0 x 800 0x 8000 x 800.
a a ^(a){ }^{a} 位 0 x 4 是判断读取是否未映射的唯一可靠位置。如果 0 x 4 被设置,则不得对位 0 × 2 , 0 × 100 0 × 2 , 0 × 100 0xx2,0xx1000 \times 2,0 \times 100 0 x 800 0 x 800 0x 8000 x 800 做任何假设。

b b ^(b){ }^{\mathrm{b}} For segments within the same slice.
b b ^(b){ }^{\mathrm{b}}

c ^("c "){ }^{\text {c }} Bits 0 x 40 and 0 x 80 reflect the read ordering within each template inherent in the sequencing technology used, which may be independent from the actual mapping orientation. If 0 × 40 0 × 40 0xx400 \times 40 and 0 × 80 0 × 80 0xx800 \times 80 are both set, the read is part of a linear template (one where the template sequence is expected to be in a linear order), but it is neither the first nor the last read. If both 0 x 40 and 0 x 80 are unset, the index of the read in the template is unknown. This may happen for a non-linear template (such as one constructed by stitching together other templates) or when this information is lost during data processing.
c ^("c "){ }^{\text {c }} 位 0x40 和 0x80 反映了每个模板内固有的读取顺序,这可能独立于实际的映射方向。如果 0 × 40 0 × 40 0xx400 \times 40 0 × 80 0 × 80 0xx800 \times 80 都被设置,则该读数属于线性模板(模板序列预期以线性顺序排列)但不是第一个也不是最后一个读数。如果 0x40 和 0x80 都未设置,则模板中读取的索引未知。这可能发生在非线性模板(如通过拼接其他模板构建的)或在数据处理过程中丢失此信息的情况下。

CRAM bit flags (CF data series)
内存位标志(CF 数据系列)

The CRAM bit flags (also known as compression bit flags) expressed as an integer represent the CF data series. The following compression flags are defined for each CRAM read record:
CRAM 比特标志(也称为压缩比特标志)表示为整数的 CF 数据系列。针对每个 CRAM 读取记录定义了以下压缩标志:
Bit flag 比特标志 Name 名字 Description 描述
0x1 quality scores stored as array
质量得分存储为数组

质量得分可以存储为读取特征或类似于读取碱基的数组。
quality scores can be stored as read features or as an
array similar to read bases.
quality scores can be stored as read features or as an array similar to read bases.| quality scores can be stored as read features or as an | | :--- | | array similar to read bases. |
0x2 detached 独立的

配偶信息原封不动地存储(例如,因为该对覆盖多个切片,或者字段与 CRAM 计算方法不同)。
mate information is stored verbatim (e.g. because the
pair spans multiple slices or the fields differ to the
CRAM computed method)
mate information is stored verbatim (e.g. because the pair spans multiple slices or the fields differ to the CRAM computed method)| mate information is stored verbatim (e.g. because the | | :--- | | pair spans multiple slices or the fields differ to the | | CRAM computed method) |
0 x 4 has mate downstream 有配偶下游

告诉是否下一个段应该在流中进一步预期
tells if the next segment should be expected further in
the stream
tells if the next segment should be expected further in the stream| tells if the next segment should be expected further in | | :--- | | the stream |
0x8 decode sequence as "*"
解码序列为"*"

告知解码器该序列未知,任何编码参考差异仅用于重建 CIGAR 字符串。
informs the decoder that the sequence is unknown and
that any encoded reference differences are present only
to recreate the CIGAR string.
informs the decoder that the sequence is unknown and that any encoded reference differences are present only to recreate the CIGAR string.| informs the decoder that the sequence is unknown and | | :--- | | that any encoded reference differences are present only | | to recreate the CIGAR string. |
Bit flag Name Description 0x1 quality scores stored as array "quality scores can be stored as read features or as an array similar to read bases." 0x2 detached "mate information is stored verbatim (e.g. because the pair spans multiple slices or the fields differ to the CRAM computed method)" 0 x 4 has mate downstream "tells if the next segment should be expected further in the stream" 0x8 decode sequence as "*" "informs the decoder that the sequence is unknown and that any encoded reference differences are present only to recreate the CIGAR string."| Bit flag | Name | Description | | :--- | :--- | :--- | | 0x1 | quality scores stored as array | quality scores can be stored as read features or as an <br> array similar to read bases. | | 0x2 | detached | mate information is stored verbatim (e.g. because the <br> pair spans multiple slices or the fields differ to the <br> CRAM computed method) | | 0 x 4 | has mate downstream | tells if the next segment should be expected further in <br> the stream | | 0x8 | decode sequence as "*" | informs the decoder that the sequence is unknown and <br> that any encoded reference differences are present only <br> to recreate the CIGAR string. |
The following pseudocode describes the general process of decoding an entire CRAM record. The sequence data itself is in one of two encoding formats depending on whether the record is aligned (mapped).
以下伪代码描述了解码整个 CRAM 记录的一般过程。序列数据本身采用两种编码格式之一,这取决于记录是否比对(映射)。

Decode pseudocode 解码伪代码

procedure DECODERECORD
        \(B A M \_\)flags \(\leftarrow\) READITEM(BF, Integer)
        \(C R A \bar{M} \_\)flags \(\leftarrow\) READITEM \((\mathrm{CF}\), Integer \()\)
        DECODEPoSITIONS \(\triangleright\) See section 10.2
        DECODENAMES \(\triangleright\) See section 10.3
        DECODEMateData \(\triangleright\) See section 10.4
        DecoDeTaGData \(\triangleright\) See section 10.5
        if \((B F\) AND 4\()=0\) then \(\triangleright\) Unmapped flag
            DECODEMAPPEDREAD \(\triangleright\) See section 10.6
        else
            DECODEUNMAPPEDREAD \(\triangleright\) See section 10.7
        end if
end procedure
This pseudocode is not meant to be a fully implementable programming language, but to act as an algorithmic guide to the order and structure of CRAM decoding.
这个伪代码并不意图成为一种完全可实现的编程语言,而是作为 CRAM 解码的顺序和结构的算法指南。

The Readitem function referred above takes two arguments; the data series name and the data type used by the Encoding. It will use the codec specified in the Container Compression Header to retrieve the next value from that data series. Note there is only one permitted data type per data series, so the second argument is redundant and is included only as an aide-mémoire.
上述的 Readitem 函数需要两个参数;数据系列名称和 Encoding 使用的数据类型。它将使用 Container Compression Header 中指定的编解码器来检索该数据系列的下一个值。请注意,每个数据系列只允许使用一种数据类型,所以第二个参数是多余的,只是为了帮助记忆。

10.2 CRAM positional data
10.2 CRAM 位置数据

Following the bit-wise BAM and CRAM flags, CRAM encodes positional related data including reference, alignment positions and length, and read-group. Positional data is stored for both mapped and unmapped sequences, as unmapped data may still be “placed” at a specific location in the genome (without being aligned). Typically this is done to keep a sequence pair (paired-end or mate-pair sequencing libraries) together when one of the pair aligns and the other does not.
比特位 BAM 和 CRAM 标志之后,CRAM 编码包括参考序列、比对位置和长度,以及读取组的位置相关数据。不论是比对序列还是未比对序列,都会存储位置数据,因为未比对数据可能会被"放置"在基因组的特定位置(无需比对)。这样做通常是为了保持一对序列(paired-end 或 mate-pair 测序库)的完整性,当其中一个比对而另一个未比对时。
For reads stored in a position-sorted slice, the AP-delta flag in the compression header preservation map should be set and the AP data series will be delta encoded, using the slice alignment-start value as the first position to delta against. Note for multi-reference slices this may mean that the AP series includes negative values, such as when moving from an alignment to the end of one reference sequence to the start of the next or to unmapped unplaced data. When the AP-delta flag is not set the AP data series is stored as a normal integer value.
对于存储在位置排序的切片中的读数,压缩头部保留图中的 AP-delta 标志应该被设置,并且 AP 数据系列将使用切片对齐起始值作为第一个位置进行 delta 编码。对于多参考切片,这可能意味着 AP 系列包括负值,例如从一个参考序列的对齐结束移动到下一个参考序列的开始或未映射的未放置数据。当未设置 AP-delta 标志时,AP 数据系列将作为普通整数值存储。
 数据系列类型
Data series
type
Data series type| Data series | | :--- | | type |
 数据系列名称
Data series
name
Data series name| Data series | | :--- | | name |
Field 田地 Description 描述
int 整型 RI ref id

参考序列编号(仅存在于多参考片段中)
reference sequence id (only present in
multiref slices)
reference sequence id (only present in multiref slices)| reference sequence id (only present in | | :--- | | multiref slices) |
int 整型 RL read length 读取长度 the length of the read
读取的长度
int 整型 AP alignment start 对准开始 the alignment start position
起始点位置
int 整型 RG read group 读组

读组标识符表示为头部中的 Nh 记录,从 0 开始,-1 表示无组
the read group identifier expressed as
the Nh record in the header, starting
from 0 with -1 for no group
the read group identifier expressed as the Nh record in the header, starting from 0 with -1 for no group| the read group identifier expressed as | | :--- | | the Nh record in the header, starting | | from 0 with -1 for no group |
"Data series type" "Data series name" Field Description int RI ref id "reference sequence id (only present in multiref slices)" int RL read length the length of the read int AP alignment start the alignment start position int RG read group "the read group identifier expressed as the Nh record in the header, starting from 0 with -1 for no group"| Data series <br> type | Data series <br> name | Field | Description | | :--- | :--- | :--- | :--- | | int | RI | ref id | reference sequence id (only present in <br> multiref slices) | | int | RL | read length | the length of the read | | int | AP | alignment start | the alignment start position | | int | RG | read group | the read group identifier expressed as <br> the Nh record in the header, starting <br> from 0 with -1 for no group |
procedure DECODEPOSITIONS
    if slice_header.reference_sequence_id \(=-2\) then
        reference \(\_i d \leftarrow\) READITEM(RI, Integer)
    else
        \(r e f e r e n c e \_i d \leftarrow\) slice_header.reference_sequence_id
    end if
    read_length \(\leftarrow\) READITEM(RL, Integer)
    if container_pmap.AP_delta \(\neq 0\) then
            if first_record_in_slice then
            last_position \(\leftarrow\) slice_header.alignment_start
            end if
            alignment_position \(\leftarrow\) READITEM(AP, Integer) + last_position
            last_position \(\leftarrow\) alignment_position
        else
            alignment_position \(\leftarrow\) READITEM(AP, Integer)
        end if
        read_group \(\leftarrow\) READITEM \((\) RG, Integer \()\)
    end procedure

10.3 Read names (RN data series)
10.3 读取名称(RN 数据系列)

Read names can be preserved in the CRAM format, but this is optional and is governed by the RN preservation map key in the container compression header. See section 8.4. When read names are not preserved the CRAM decoder should generate names, typically based on the file name and a numeric ID of the read using the record counter field of the slice header block. Note read names may still be preserved even when the RN compression header key indicates otherwise, such as where a read is part of a read-pair and the pair spans multiple slices. In this situation the record will be marked as detached (see the CF data series) and the mate data below (section 10.4) will contain the read name.
读名称可以保留在 CRAM 格式中,但这是可选的,由容器压缩头中的 RN 保留映射键控制。见第 8.4 节。当读名称未保留时,CRAM 解码器应生成名称,通常基于文件名和记录计数器字段中读数的数字 ID。请注意,即使 RN 压缩头密钥指示否则,读数名称也可能仍然保留,例如当读数为配对读数的一部分且配对跨越多个切片时。在这种情况下,该记录将被标记为分离(见 CF 数据系列),而下面的伴读数据(第 10.4 节)将包含读数名称。
 数据系列类型
Data series
type
Data series type| Data series | | :--- | | type |
 数据系列名称
Data series
name
Data series name| Data series | | :--- | | name |
Field 田地 Description 描述
byte[ ] ] ]] byte[ ] ] ]] RN read names 阅读名字 read names 阅读名称
"Data series type" "Data series name" Field Description byte[ ] RN read names read names| Data series <br> type | Data series <br> name | Field | Description | | :--- | :--- | :--- | :--- | | byte[ $]$ | RN | read names | read names |
procedure DECODENAMES
        if container_pmap.read_names_included \(=1\) then
            read_name \(\leftarrow\) REAd \(\overline{\operatorname{ITEM}}(\mathrm{RN}\), Byte[])
        else
            read_name \(\leftarrow\) GENERATENAME
        end if
end procedure

10.4 Mate records
10.4 配偶记录

There are two ways in which mate information can be preserved in CRAM. If the next fragment is not in the same slice we store verbatim copies of the insert size, mate reference chromosome and positions, and mate flags
在 CRAM 中有两种方式可以保留配对信息。如果下一个片段不在同一个切片中,我们会存储插入大小、配对参考染色体和位置以及配对标志的逐字副本。

(mapped status, orientation) for both records. In this case both records are labelled as “detached” in the CF data series using bit 2 .
(映射状态,方向)对于两条记录来说。在这种情况下,这两条记录在 CF 数据系列中使用位 2 标记为"分离"。

If this and the next fragment are within the same slice, we can derive much of this information by comparing the two records. The upstream record has CF bit 4 (mate downstream) flag set and stores the number of records to skip (in the NF data series) between this record and the record for the next fragment on this template, with zero meaning the next fragment is also the next record. The downstream record has neither CF bits 2 (detached) or 4 (mate downstream) set nor does it use the NF data series (unless it also has an additional “next fragment” to refer to).
如果这个和下一个片段在同一个切片内,我们可以通过比较这两个记录来推导出大部分信息。上游记录的 CF 位 4(mate downstream)标志被设置,并存储在这个记录和下一个片段记录之间需要跳过的记录数(在 NF 数据系列中),如果为零意味着下一个片段也是下一个记录。下游记录既没有 CF 位 2(分离)或 4(mate downstream)被设置,也没有使用 NF 数据系列(除非它也有额外的"下一个片段"可以引用)。
It is not mandatory to use this deduplication approach and optionally CRAM write implementations may wish to label data as detached even when all records for the template reside in the same slice. One reason to do this may be to preserve inconsistent data so that it round-trips through the CRAM format with full fidelity
不强制使用此重复数据删除方法,CRAM 写入实现可选择标记数据为分离,即使所有的记录都位于同一个切片中。这样做的一个原因可能是为了保留不一致的数据,使其通过 CRAM 格式完全忠实地来回传输。
 数据系列类型
Data series
type
Data series type| Data series | | :--- | | type |
Data series name 数据系列名称 Description 描述
int 整型 NF the number of records to skip to the next fragment
跳过到下一个片段的记录数
"Data series type" Data series name Description int NF the number of records to skip to the next fragment| Data series <br> type | Data series name | Description | | :--- | :--- | :--- | | int | NF | the number of records to skip to the next fragment |
In the above case, the NS (mate reference name), NP (mate position) and TS (template size) fields for both records should be derived once the mate has also been decoded. Mate reference name and position are obvious and simply copied from the mate. The template size is computed using the method described in the SAM specification; the inclusive distance from the leftmost to rightmost mapped bases with the sign being positive for the leftmost record and negative for the rightmost record.
在上述情况下,一旦配对也被解码,则两个记录的 NS(配对参考名称)、NP(配对位置)和 TS(模板大小)字段应该被派生。配对参考名称和位置是显而易见的,直接从配对中复制。模板大小是根据 SAM 规范中描述的方法计算的;从最左边到最右边的映射碱基的包含距离,其符号对于最左边的记录为正值,对于最右边的记录为负值。

If the next fragment is not found within this slice then the following structure is included into the CRAM record. Note there are cases where read-pairs within the same slice may be marked as detached and use this structure, such as to store mate-pair information that does not match the algorithm used by CRAM for computing the mate data on-the-fly.
如果下一个片段在此切片中找不到,则以下结构将纳入 CRAM 记录。请注意,在同一切片中的读对可能会被标记为分离并使用此结构,例如存储不符合 CRAM 用于实时计算配对数据的算法的配对信息。
 数据系列类型
Data series
type
Data series type| Data series | | :--- | | type |
Data series name 数据系列名称 Description 描述
int 整型 MF next mate bit flags, see table below
下一个伙伴位标志,请参见下表
byte[] 字节[] RN the read name (if and only if not known already)
阅读名称(如果且仅当未事先知道)
int 整型 NS mate reference sequence identifier
配偶参考序列标识符
int 整型 NP mate alignment start position
配对序列比对起始位置
int 整型 TS the size of the template (insert size)
模板的大小(插入大小)
"Data series type" Data series name Description int MF next mate bit flags, see table below byte[] RN the read name (if and only if not known already) int NS mate reference sequence identifier int NP mate alignment start position int TS the size of the template (insert size)| Data series <br> type | Data series name | Description | | :--- | :--- | :--- | | int | MF | next mate bit flags, see table below | | byte[] | RN | the read name (if and only if not known already) | | int | NS | mate reference sequence identifier | | int | NP | mate alignment start position | | int | TS | the size of the template (insert size) |

Next mate bit flags (MF data series)
下一个连接位标志(MF 数据系列)

The next mate bit flags expressed as an integer represent the MF data series. These represent the missing bits we excluded from the BF data series (when compared to the full SAM/BAM flags). The following bit flags are defined:
以整数形式表示的下一个伴随位标志代表 MF 数据系列。这些代表我们从 BF 数据系列中排除的缺失位(与完整的 SAM/BAM 标志相比)。定义了以下位标志:
Bit flag 比特标志 Name 名字 Description 描述
0x1 mate negative strand bit
负链位
the bit is set if the mate is on the negative strand
如果伴侣在负链上,则该位被设置
0 × 2 0 × 2 0xx20 \times 2 mate unmapped bit 未映射的位 the bit is set if the mate is unmapped
如果配对未被映射,位则被设置
Bit flag Name Description 0x1 mate negative strand bit the bit is set if the mate is on the negative strand 0xx2 mate unmapped bit the bit is set if the mate is unmapped| Bit flag | Name | Description | | :--- | :--- | :--- | | 0x1 | mate negative strand bit | the bit is set if the mate is on the negative strand | | $0 \times 2$ | mate unmapped bit | the bit is set if the mate is unmapped |

Decode mate pseudocode 解码伙伴伪代码

In the following pseudocode we are assuming the current record is this and its mate is next_frag.
以下伪代码中我们假设当前记录是 this,它的配对是 next_frag。

procedure DECODEMATEDATA
程序 DECODEMATEDATA

if C F C F CFC F AND 2 then \triangleright Detached from mate
如果 C F C F CFC F 且 2,则 \triangleright 与配偶分离

mate_flags larr\leftarrow READITEM(MF,Integer)
伴侣标志 larr\leftarrow READITEM(MF,整数)

if mate_flags AND 1 then
如果 mate_flags 与 1 进行逻辑与运算结果为真

bam_flags larr\leftarrow bam_flags OR 0 × 20 0 × 20 0xx20quad▹0 \times 20 \quad \triangleright Mate is reverse-complemented
bam_flags larr\leftarrow bam_flags 或 0 × 20 0 × 20 0xx20quad▹0 \times 20 \quad \triangleright 配对是反向互补

end if 结束
if mate_flags AND 2 then
如果 mate_flags 和 2

bam_flags larr\leftarrow bam_flags OR 0x08 \triangleright Mate is unmapped
bam_flags larr\leftarrow bam_flags 或 0x08 \triangleright 配对没有被映射

end if 结束
if container_pmap.read_names_included 1 1 !=1\neq 1 then
如果 container_pmap.read_names_included < code0 >则

r e a d _ n a m e ←← READITEM ( RN , B y t e [ ] ) r e a d _ n a m e ¯ ←← READITEM ( RN , B y t e ¯ [ ] ) read_na bar(me)larr larr READITEM(RN, bar(Byte)[])r e a d \_n a \overline{m e} \leftarrow \leftarrow \operatorname{READITEM}(\mathrm{RN}, \overline{B y t e}[])
    end if
    mate_ref_id \leftarrow READITEM(NS, Integer)
    mate_position \leftarrow READITEM(NP, Integer)
    template_size \leftarrow READITEM(TS, Integer)
    else if CF ANND 4 then }\quad\triangleright\mathrm{ Mate is downstream
    if next_frag.bam_flags AND 0x10 then
        this.bam_flags \leftarrowthis.bam_flags OR 0x20 \triangleright next segment reverse complemented
    end if
    if next_frag.bam_flags AND 0x04 then
        this.bam_flags \leftarrowthis.bam_flags OR 0x08 \triangleright next segment unmapped
    end if
    next_frag \leftarrow READITEM(NF,Integer)
    next_record \leftarrowthis_record + next_frag + 1
    Resolve mate_ref_-id for this_record and next_record once both have been decoded
    Resolve mate_position for this_record and next_record once both have been decoded
    Find leftmost and rightmost mapped coordinate in records this_record and next_record.
    For leftmost of this_record and next_record: template_size \leftarrow rightmost - leftmost + 1
    For rightmost of this_record and next_record: template_size }\leftarrow-(\mathrm{ rightmost - leftmost + 1)
        end if
end procedure
Note as with the SAM specification a template may be permitted to have more than two alignment records. In this case the “mate” for each record is considered to be the next record, with the mate for the last record being the first to form a circular list. The above algorithm is a simplification that does not deal with this scenario. The full method needs to observe when record this + N F + N F +NF+N F is also labelled as having an additional mate downstream. One recommended approach is to resolve the mate information in a second pass, once the entire slice has been decoded. The final segment in the mate chain needs to set bam_flags fields 0 x 20 and 0x08 accordingly based on the first segment. This is also not listed in the above algorithm, for brevity.
与 SAM 规范一样,模板可能被允许具有两个以上的比对记录。在这种情况下,每条记录的"配对"记录被认为是下一条记录,最后一条记录的配对记录是第一条,形成一个循环列表。上述算法是一种简化,不处理这种情况。完整的方法需要注意当记录 + N F + N F +NF+N F 也被标记为具有下游的额外配对时。一种推荐的方法是在整个切片被解码后,在第二次传递中解决配对信息。配对链中的最终片段需要根据第一个片段相应地设置 bam_flags 字段 0x20 和 0x08。由于篇幅原因,这也没有列在上述算法中。

10.5 Auxiliary tags 10.5 辅助标签

Tags are encoded using a tag line (TL data series) integer into the tag dictionary (TD field in the compression header preservation map, see section 8.4). See section 8.4 for a more detailed description of this process.
标签使用标签行(TL 数据系列)整数编码到标签字典(压缩头保留映射中的 TD 字段,见第 8.4 节)。有关此过程的更详细描述,请参见第 8.4 节。
 数据系列类型
Data series
type
Data series type| Data series | | :--- | | type |
 数据系列名称
Data series
name
Data series name| Data series | | :--- | | name |
Field 田地 Description 描述
int 整型 TL tag line 标语 an index into the tag dictionary (TD)
标签字典(TD)的索引
*** ? ? ? ? ? ? ???? ? ? tag name/type 标签名/类型

3 个字符键 ( 2 ( 2 (2(2 标记标识符和 1 个标记类型 ) , ) , ),), 如标签词典所述
3 character key ( 2 ( 2 (2(2 tag identifier and 1 tag
type ) , ) , ),), as specified by the tag dictionary
3 character key (2 tag identifier and 1 tag type ), as specified by the tag dictionary| 3 character key $(2$ tag identifier and 1 tag | | :--- | | type $),$ as specified by the tag dictionary |
"Data series type" "Data series name" Field Description int TL tag line an index into the tag dictionary (TD) ** ??? tag name/type "3 character key (2 tag identifier and 1 tag type ), as specified by the tag dictionary"| Data series <br> type | Data series <br> name | Field | Description | | :--- | :--- | :--- | :--- | | int | TL | tag line | an index into the tag dictionary (TD) | | $*$ | $? ? ?$ | tag name/type | 3 character key $(2$ tag identifier and 1 tag <br> type $),$ as specified by the tag dictionary |
procedure DECODETAGDATA
        tag_line \(\leftarrow\) READITEM(TL,Integer)
        for all ele \(\in\) container_pmap.tag_dict(tag_line) do
            name \(\leftarrow\) first two characters of ele
            tag \((\) type \() \leftarrow\) last character of ele
            \(\operatorname{tag}(\) name \() \leftarrow\) READITEM \((\) ele, Byte[])
        end for
end procedure
In the above procedure, name is a two letter tag name and type is one of the permitted types documented in the SAM/BAM specification. Type is A (a single character), c (signed 8-bit integer), C (unsigned 8-bit integer), s (signed 16-bit integer), S (unsigned 16-bit integer), i (signed 32-bit integer), I (unsigned 32-bit integer), f (32-bit float), Z (nul-terminated string), H (nul-terminated string of hex digits) and B (binary data in array format with the first byte being one of c,C,s,S,i,I,f using the meaning above, a 32 -bit integer for the number of array elements, followed by array data encoded using the specified format). All integers are little endian encoded.
在上述过程中,name 是一个两个字母的标签名称,type 是 SAM/BAM 规范中记录的许可类型之一。Type 可以是 A(单个字符)、c(有符号 8 位整数)、C(无符号 8 位整数)、s(有符号 16 位整数)、S(无符号 16 位整数)、i(有符号 32 位整数)、I(无符号 32 位整数)、f(32 位浮点数)、Z(以空字符结尾的字符串)、H(以空字符结尾的十六进制字符串)和 B(以数组格式的二进制数据,第一个字节为 c、C、s、S、i、I、f 之一,后跟一个 32 位整数表示数组元素个数,然后是使用指定格式编码的数组数据)。所有整数都采用小端编码。

For example a SAM tag MQ: i has name MQ and type i and will be decoded using one of MQc, MQC, MQs, MQS, MQi and MQI data series depending on size and sign of the integer value.
例如,一个 SAM 标签 MQ: i 具有名称 MQ 和类型 i,将根据整数值的大小和符号使用 MQc、MQC、MQs、MQS、MQi 和 MQI 数据系列之一进行解码。
Note some auxiliary tags can be created automatically during decode so can optionally be removed by the encoder. However if the decoder finds a tag stored verbatim it should use this in preference to automatically computing the value.
请注意,一些辅助标签可以在解码过程中自动创建,因此编码器可以选择删除它们。但是,如果解码器发现了一个以原文形式存储的标签,它应该优先使用该标签而不是自动计算其值。

The RG (read group) auxiliary tag should be created if the read group (RG data series) value is not -1 .
如果读取组(RG 数据系列)的值不为-1,则应创建 RG(读取组)辅助标签。

The MD and NM auxiliary tags store the differences (an edit string) between the sequence and the reference along with the number of mismatches. These may optionally be created on-the-fly during reference-based sequence reconstruction and should match the description provided in the SAMtags document. An encoder may decide to store these verbatim when no reference is used or where the automatically constructed values differ to the input data.
MD 和 NM 辅助标签存储序列和参考之间的差异(编辑字符串)以及不匹配的数量。这些可以在参考序列重建过程中动态创建,并应该与 SAMtags 文档中的描述相匹配。当没有参考序列使用或自动构建的值与输入数据不同时,编码器可以决定原样存储这些信息。

Note there is no mechanism to describe which records have MD/NM present and which do not. If this is deemed important, the only recourse is to store all MD and NM verbatim and to request that the decoding software does not automatically generate its own for records that have no stored MD and NM tags.
请注意,没有任何机制可以描述哪些记录具有 MD/NM 且哪些没有。如果这被认为很重要,唯一的办法就是以原样存储所有 MD 和 NM,并要求解码软件不会自动为没有存储的 MD 和 NM 标签的记录生成自己的 MD 和 NM。

10.6 Mapped reads 10.6 已映射的读数

Read feature records 读取特征记录

Read features are used to store read details that are expressed using read coordinates (e.g. base differences respective to the reference sequence). The read feature records start with the number of read features followed by the read features themselves. Each read feature has the position encoded as the distance since the last feature position, or the absolute position (i.e. delta vs zero) for the first feature. Finally the single mapping quality and per-base quality scores are stored.
读取特征用于存储使用读取坐标(例如相对于参考序列的碱基差异)表达的读取详细信息。读取特征记录以读取特征的数量开始,后跟读取特征本身。每个读取特征的位置都编码为自上一个特征位置的距离,或者对于第一个特征是绝对位置(即δ与 0)。最后存储单一映射质量和每个碱基质量得分。
Data series type 数据系列类型
 数据系列名称
Data series
name
Data series name| Data series | | :--- | | name |
Field 田地 Description 描述
int 整型 FN

读取特征的数量
number of read
features
number of read features| number of read | | :--- | | features |
the number of read features
读取特征的数量
int 整型 FP in-read-position a a ^(a)^{\mathrm{a}}
原位阅读
delta-position of the read feature
读取特征的 delta 位置
byte 字节 FC read feature code 读取特征代码 See feature codes below
请参见以下功能代码
*** *** read feature data a a ^(a){ }^{\mathrm{a}}
读取特征数据 a a ^(a){ }^{\mathrm{a}}
See feature codes below
请参见以下功能代码
int 整型 MQ mapping qualities 映射质量 mapping quality score 映射质量分数
byte[read length] 字节[读取长度] QS quality scores 质量分数 the base qualities, if preserved
如果保护好的基本质量
Data series type "Data series name" Field Description int FN "number of read features" the number of read features int FP in-read-position ^(a) delta-position of the read feature byte FC read feature code See feature codes below ** ** read feature data ^(a) See feature codes below int MQ mapping qualities mapping quality score byte[read length] QS quality scores the base qualities, if preserved| Data series type | Data series <br> name | Field | Description | | :--- | :--- | :--- | :--- | | int | FN | number of read <br> features | the number of read features | | int | FP | in-read-position $^{\mathrm{a}}$ | delta-position of the read feature | | byte | FC | read feature code | See feature codes below | | $*$ | $*$ | read feature data ${ }^{\mathrm{a}}$ | See feature codes below | | int | MQ | mapping qualities | mapping quality score | | byte[read length] | QS | quality scores | the base qualities, if preserved |
a a ^(a){ }^{a} Repeated FN times, once for each read feature.
a a ^(a){ }^{a} 重复 FN 次,每次一个读取特征。

Read feature codes 读取特征代码

Each feature code has its own associated data series containing further information specific to that feature. The following codes are used to distinguish variations in read coordinates:
每个特征代码都有自己相关的数据系列,包含进一步特定于该特征的信息。以下代码用于区分读取坐标的不同变化:
Feature code 特色代码 Id 身份证
 数据系列类型
Data series
type
Data series type| Data series | | :--- | | type |
 数据系列名称
Data series
name
Data series name| Data series | | :--- | | name |
Description 描述
Bases 基地 b (0x62) byte[ 字节[ BB a stretch of bases
一段碱基序列
Scores 成绩 q (0x71) byte[ 字节[ QQ a stretch of scores
一串分数
Read base 读取基础 B (0x42) byte,byte 字节,字节 BA,QS

基础和相关质量分数
A base and associated quality
score
A base and associated quality score| A base and associated quality | | :--- | | score |
Substitution 替换 X (0x58) byte 字节 BS

碱基置换编码,SAM 运算符 X , M X , M X,M\mathrm{X}, \mathrm{M} = = ==
base substitution codes, SAM
operators X , M X , M X,M\mathrm{X}, \mathrm{M} and = = ==
base substitution codes, SAM operators X,M and =| base substitution codes, SAM | | :--- | | operators $\mathrm{X}, \mathrm{M}$ and $=$ |
Insertion 插入 I (0x49) 我 (0x49) byte[] 字节[] IN

嵌入式碱基,SAM 操作员 I
inserted bases, SAM operator
I
inserted bases, SAM operator I| inserted bases, SAM operator | | :--- | | I |
Deletion 删除 D (0x44) D (0x44) 人类: Translate the following source text to Simplified Chinese Language, Output translation directly without any additional text. Source Text: Hello World Translated Text: int 整型 DL

被删除碱基的数量,SAM 操作员 D
number of deleted bases,
SAM operator D
number of deleted bases, SAM operator D| number of deleted bases, | | :--- | | SAM operator D |
Insert base 输入基地 i (0x69) byte 字节 BA

单个插入的基础、SAM 操作员 I
single inserted base, SAM
operator I
single inserted base, SAM operator I| single inserted base, SAM | | :--- | | operator I |
Quality score 质量分数 Q (0x51) Q (0x51) 人: 翻译以下源文本为简体中文语言,直接输出翻译,不需要添加任何其他文本。 源文本:Q (0x51) 翻译文本: 助手:Q (0x51) byte 字节 QS single quality score 单一质量得分
Reference skip 跳过参考 N (0x4E) N (0x4E) 人 int 整型 RS

跳过的碱基数,SAM 运算符 N
number of skipped bases,
SAM operator N
number of skipped bases, SAM operator N| number of skipped bases, | | :--- | | SAM operator N |
Soft clip 软剪切 S (0x53) S (0x53) 人类: Translate the following source text to Simplified Chinese Language, Output translation directly without any additional text. Source Text: 2 + 2 = 4 Translated Text: byte[] 字节[] SC

软剪辑碱基,SAM 操作员 S
soft clipped bases, SAM
operator S
soft clipped bases, SAM operator S| soft clipped bases, SAM | | :--- | | operator S |
Padding 填充 P ( 0 × 50 ) P ( 0 × 50 ) P(0xx50)\mathrm{P}(0 \times 50) int 整型 PD

垫底碱基的数量,SAM 运算符 P
number of padded bases,
SAM operator P
number of padded bases, SAM operator P| number of padded bases, | | :--- | | SAM operator P |
Hard clip 硬剪裁 H (0x48) H (0x48) 译文: H (0x48) int 整型 HC

硬性剪切碱基的数量,SAM 运算符 H
number of hard clipped bases,
SAM operator H
number of hard clipped bases, SAM operator H| number of hard clipped bases, | | :--- | | SAM operator H |
Feature code Id "Data series type" "Data series name" Description Bases b (0x62) byte[ BB a stretch of bases Scores q (0x71) byte[ QQ a stretch of scores Read base B (0x42) byte,byte BA,QS "A base and associated quality score" Substitution X (0x58) byte BS "base substitution codes, SAM operators X,M and =" Insertion I (0x49) byte[] IN "inserted bases, SAM operator I" Deletion D (0x44) int DL "number of deleted bases, SAM operator D" Insert base i (0x69) byte BA "single inserted base, SAM operator I" Quality score Q (0x51) byte QS single quality score Reference skip N (0x4E) int RS "number of skipped bases, SAM operator N" Soft clip S (0x53) byte[] SC "soft clipped bases, SAM operator S" Padding P(0xx50) int PD "number of padded bases, SAM operator P" Hard clip H (0x48) int HC "number of hard clipped bases, SAM operator H"| Feature code | Id | Data series <br> type | Data series <br> name | Description | | :---: | :---: | :---: | :---: | :---: | | Bases | b (0x62) | byte[ | BB | a stretch of bases | | Scores | q (0x71) | byte[ | QQ | a stretch of scores | | Read base | B (0x42) | byte,byte | BA,QS | A base and associated quality <br> score | | Substitution | X (0x58) | byte | BS | base substitution codes, SAM <br> operators $\mathrm{X}, \mathrm{M}$ and $=$ | | Insertion | I (0x49) | byte[] | IN | inserted bases, SAM operator <br> I | | Deletion | D (0x44) | int | DL | number of deleted bases, <br> SAM operator D | | Insert base | i (0x69) | byte | BA | single inserted base, SAM <br> operator I | | Quality score | Q (0x51) | byte | QS | single quality score | | Reference skip | N (0x4E) | int | RS | number of skipped bases, <br> SAM operator N | | Soft clip | S (0x53) | byte[] | SC | soft clipped bases, SAM <br> operator S | | Padding | $\mathrm{P}(0 \times 50)$ | int | PD | number of padded bases, <br> SAM operator P | | Hard clip | H (0x48) | int | HC | number of hard clipped bases, <br> SAM operator H |
Note for compatibility with BAM, all base comparisons should be done in a case-insensitive manner, and all bases written to SC, IN and BA data series should be in upper-case.
为了与 BAM 兼容,所有的碱基比较都应该以不区分大小写的方式进行,所有写入 SC、IN 和 BA 数据系列的碱基都应该使用大写。

Base substitution codes (BS data series)
碱基替换密码(BS 数据系列)

A base substitution is defined as a change from one nucleotide base (reference base) to another (read base), including N as an unknown or missing base. There are 5 supported reference bases (ACGTN), with 4 possible substitutions for each base. Any other base type, such as an ambiguity code, must be written verbatim using the BA data series.
碱基替换定义为从一个核苷酸碱基(参考碱基)替换为另一个(读取碱基),包括 N 作为未知或缺失碱基。支持 5 种参考碱基(ACGTN),每种碱基都有 4 种可能的替换。任何其他碱基类型,如歧义编码,必须使用 BA 数据系列原样写入。
The codes for all possible substitutions are stored in a two-dimensional substitution matrix, indexed by reference base ( A , C , G , T , N ) ( A , C , G , T , N ) (A,C,G,T,N)(A, C, G, T, N) and BS code ( 0 3 ) ( 0 3 ) (0-3)(0-3), with each matrix element holding the modified base.
所有可能替换的编码都存储在一个二维替换矩阵中,索引由参考基和 BS 编码组成,每个矩阵元素保存修改后的碱基。

Substitution Matrix Format
替换矩阵格式

There are 5 possible base types supported by the BS data series, A, C, G, T and N. Hence for any reference base there are 4 possible substitutions. Each of these substitution possibilities are numbered 0 to 3 , in the order shown above (omitting the reference base type). Therefore the full list of substitution codes for a specific reference base is 42 -bit numbers ( 0 3 ) ( 0 3 ) (0-3)(0-3) in the order shown above, minus the reference base itself. These are packed into a single byte with the high 2-bits first.
根据 BS 数据系列支持的 5 种可能的基本类型 A、C、G、T 和 N,任何参考基础都有 4 种可能的替代。这些替代可能性按上述顺序编号为 0 到 3,排除参考基础类型。因此,针对特定参考基础的完整替代编码列表是一个 42 位二进制数 ( 0 3 ) ( 0 3 ) (0-3)(0-3) ,按上述顺序排列,不包括参考基础本身。这些数字被打包成一个字节,高 2 位在前。

For example for reference base C we would record the BS numerical values for substituting C with A , G , T A , G , T A,G,T\mathrm{A}, \mathrm{G}, \mathrm{T} and N respectively. If we wish A = 1 , G = 0 , T = 2 A = 1 , G = 0 , T = 2 A=1,G=0,T=2\mathrm{A}=1, \mathrm{G}=0, \mathrm{~T}=2 and N = 3 N = 3 N=3\mathrm{N}=3 then we would store binary 01001011 , or hex 0 x 4 B .
例如,对于参考基础 C,我们将记录将 C 替换为 A , G , T A , G , T A,G,T\mathrm{A}, \mathrm{G}, \mathrm{T} 和 N 时的 BS 数值。如果我们希望 A = 1 , G = 0 , T = 2 A = 1 , G = 0 , T = 2 A=1,G=0,T=2\mathrm{A}=1, \mathrm{G}=0, \mathrm{~T}=2 N = 3 N = 3 N=3\mathrm{N}=3 ,我们将存储二进制 01001011 或十六进制 0 x 4 B。

The full substitution matrix is 5 bytes, each storing the 4 BS codes for reference base A , C , G , T A , C , G , T A,C,G,T\mathrm{A}, \mathrm{C}, \mathrm{G}, \mathrm{T} and N respectively.
完整的替换矩阵为 5 个字节,每个存储参考碱基 A , C , G , T A , C , G , T A,C,G,T\mathrm{A}, \mathrm{C}, \mathrm{G}, \mathrm{T} 和 N 的 4 个 BS 编码。
A complete matrix that maps C / G C / G C//G\mathrm{C} / \mathrm{G} together and A / T A / T A//T\mathrm{A} / \mathrm{T} together may look like this:
一个完整的矩阵可能看起来像这样,它将 C / G C / G C//G\mathrm{C} / \mathrm{G} A / T A / T A//T\mathrm{A} / \mathrm{T} 组合在一起:
Seq. base 序列碱基
Ref. base 参考基数 A A A\mathbf{A} C C C\mathbf{C} G G G\mathbf{G} T T T\mathbf{T} N N N\mathbf{N}
A - 1 2 0 3
C 1 - 0 2 3
G 2 0 - 1 3
T 0 2 1 - 3
N 0 1 2 3 -
Seq. base Ref. base A C G T N A - 1 2 0 3 C 1 - 0 2 3 G 2 0 - 1 3 T 0 2 1 - 3 N 0 1 2 3 -| | Seq. base | | | | | | :--- | :---: | :---: | :---: | :---: | :---: | | Ref. base | $\mathbf{A}$ | $\mathbf{C}$ | $\mathbf{G}$ | $\mathbf{T}$ | $\mathbf{N}$ | | A | - | 1 | 2 | 0 | 3 | | C | 1 | - | 0 | 2 | 3 | | G | 2 | 0 | - | 1 | 3 | | T | 0 | 2 | 1 | - | 3 | | N | 0 | 1 | 2 | 3 | - |
This would be encoded as
这将被编码为

binary 01100011 , 01001011 , 10000111 , 00100111 , 00011011 01100011 , 01001011 , 10000111 , 00100111 , 00011011 01100011,01001011,quad10000111,quad00100111,quad0001101101100011,01001011, \quad 10000111, \quad 00100111, \quad 00011011 二进制
or hex 0 x 63 , 0 x 4 b , 0 x 87 , 0 x 27 0 x 1 b 0 x 63 , 0 x 4 b , 0 x 87 , 0 x 27 0 x 1 b 0x 63,quad0x4b,quad0x 87,quad0x 27quad0x1b0 x 63, \quad 0 x 4 b, \quad 0 x 87, \quad 0 x 27 \quad 0 x 1 b.
或十六进制 0 x 63 , 0 x 4 b , 0 x 87 , 0 x 27 0 x 1 b 0 x 63 , 0 x 4 b , 0 x 87 , 0 x 27 0 x 1 b 0x 63,quad0x4b,quad0x 87,quad0x 27quad0x1b0 x 63, \quad 0 x 4 b, \quad 0 x 87, \quad 0 x 27 \quad 0 x 1 b
To decode, we would use the following lookup table, showing the same data as above with codes sorted into 0 , 1 , 2 , 3 1 , 2 , 3 1,2,31,2,3 order.
要解码, 我们将使用以下查找表, 显示与上述相同的数据, 代码按 0 排序。
BS Code BS 代码
Ref. base 参考基数 0 0 0\mathbf{0} 1 1 1\mathbf{1} 2 2 2\mathbf{2} 3 3 3\mathbf{3}
A T C G N
C G A T N
G C T A N
T A G C N
N A C G T
BS Code Ref. base 0 1 2 3 A T C G N C G A T N G C T A N T A G C N N A C G T| | BS Code | | | | | :--- | :---: | :---: | :---: | :---: | | Ref. base | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ | | A | T | C | G | N | | C | G | A | T | N | | G | C | T | A | N | | T | A | G | C | N | | N | A | C | G | T |

Substitution Code Assignment
替换码分配

There is no strict requirement on using a specific substitution matrix, nor that it be optimal. However one strategy may be to ensure the most common substitution is always given code 0 , the next most common is code 1 , and so on. This means the distribution of BS values will be skewed towards lower values, which helps improve compression over more uniformly distributed frequencies.
没有对使用特定替换矩阵的严格要求,也没有要求它是最优的。然而,一种策略可能是确保最常见的替换始终给出代码 0,次常见的给出代码 1,依此类推。这意味着 BS 值的分布将偏向较低的值,这有助于在更均匀分布的频率下提高压缩效果。

For example, let us assume the following substitution frequencies for base A:
对于碱基 A,以下是假设的替换频率:

AC: 15 % 15 % 15%15 \%
AG: 25 % 25 % 25%25 \% AG: 25 % 25 % 25%25 \% 人: 翻译以下源文本为简体中文语言,直接输出翻译结果,不需要额外的文本。 源文本: AG: 25 % 25 % 25%25 \% 翻译结果: AG: 25 % 25 % 25%25 \%
AT: 55 % 55 % 55%55 \%  55 % 55 % 55%55 \%
AN: 5 % 5 % 5%5 \% 译文: AN: 5 % 5 % 5%5 \%
Then the substitution codes are T = 0 , G = 1 , C = 2 , N = 3 T = 0 , G = 1 , C = 2 , N = 3 T=0,G=1,C=2,N=3\mathrm{T}=0, \mathrm{G}=1, \mathrm{C}=2, \mathrm{~N}=3.
那么替换代码是 T = 0 , G = 1 , C = 2 , N = 3 T = 0 , G = 1 , C = 2 , N = 3 T=0,G=1,C=2,N=3\mathrm{T}=0, \mathrm{G}=1, \mathrm{C}=2, \mathrm{~N}=3

Decode mapped read pseudocode
解码映射读取伪代码

procedure DECODEMAPPEDREAD
    feature_number }\leftarrow\mathrm{ READITEM(FN, Integer)
    last_feature_position }\leftarrow
    for }i\leftarrow1\mathrm{ to feature_number do
        DECODEFEATURE
    end for
    mapping_quality \leftarrow READITEM(MQ, Integer)
    if CF AND 1 then \triangleright Quality stored as an array
            for }i\leftarrow1\mathrm{ to read_length do
                quality_score \leftarrow READITEM(QS, Integer)
            end for
        end if
end procedure
procedure DecodeFeature
    feature_code }\leftarrow\mathrm{ READITEM(FC, Integer)
    feature_position }\leftarrow\mathrm{ READITEM(FP, Integer) + last_feature_position
    last_feature_position }\leftarrow\mathrm{ feature_position
    if feature_code ='B' then
        base }\leftarrow\mathrm{ READITEM(BA, Byte)
        quality_score }\leftarrow\mathrm{ READITEM(QS, Byte)
    else if feature_code ='X' then
        substitution_code \leftarrow READItEM(BS, Byte)
    else if feature_code ='I' then
        inserted_bases }\leftarrow\mathrm{ READITEM(IN, Byte[])
        else if feature_code ='S' then
            softclip_bases }\leftarrow\mathrm{ READITEM(SC, Byte[])
    else if feature_code \(={ }^{\prime} H\) ' then
        hardclip_length \(\leftarrow\) ReAdItEm(HC, Integer)
    else if feature_code ='P' then
        pad_length \({ }^{-} \leftarrow\) READITEM(PD, Integer)
    else if feature_code \(=\) 'D' then
        deletion_length \(\leftarrow\) READITEM(DL, Integer)
    else if feature_code \(={ }^{\prime} \mathrm{N}\) ' then
        ref_skip_length \(\leftarrow\) READITEm(RS, Integer)
    else if feature_code \(=\) 'i' then
        base \(-\leftarrow\) ReAdItEm(BA, Byte)
    else if feature \(\quad\) code \(=' \mathrm{~b}\) ' then
        bases \(\leftarrow\) REadItEm(BB, Byte[])
    else if feature_code ='q' then
        quality_scores \(\leftarrow\) REAdITEM(QQ, Byte[])
    else if feature_code \(=\) ' Q ' then
        quality_score \(\leftarrow\) READITEM(QS, Byte)
    end if
end procedure

10.7 Unmapped reads
未映射的读数

The CRAM record structure for unmapped reads has the following additional fields:
未映射 reads 的 CRAM 记录结构具有以下附加字段:
Data series type 数据系列类型
 数据系列名称
Data series
name
Data series name| Data series | | :--- | | name |
Field 田地 Description 描述
byte[read length] 字节[读取长度] BA bases 基地 the read bases 读取基数
byte[read length] 字节[读取长度] QS quality scores 质量分数 the base qualities, if preserved
如果保护好的基本质量
Data series type "Data series name" Field Description byte[read length] BA bases the read bases byte[read length] QS quality scores the base qualities, if preserved| Data series type | Data series <br> name | Field | Description | | :--- | :--- | :--- | :--- | | byte[read length] | BA | bases | the read bases | | byte[read length] | QS | quality scores | the base qualities, if preserved |
procedure DeCoDeUnMAPpedREAD
        for \(i \leftarrow 1\) to read_length do
            base \(\leftarrow\) READITEM(BA, Byte)
        end for
        if \(C F\) AND 1 then \(\triangleright\) Quality stored as an array
            for \(i \leftarrow 1\) to read_length do
                quality_score \(\leftarrow\) READITEM(QS, Byte)
            end for
        end if
end procedure

11 Reference sequences
11 个参考序列

CRAM format is natively based upon usage of reference sequences even though in some cases they are not required. In contrast to BAM format CRAM format has strict rules about reference sequences.
CRAM 格式本质上是基于参考序列的使用,尽管在某些情况下它们并不是必需的。与 BAM 格式相比,CRAM 格式对参考序列有严格的规则。
  1. M5 (sequence MD5 checksum) field of @SQ sequence record in the BAM header is required and UR (URI for the sequence fasta optionally gzipped file) field is strongly advised. The rule for calculating MD5 is to remove any non-base symbols (like n n \\n\backslash \mathrm{n}, sequence name or length and spaces) and upper case the rest. Here are some examples:
    @SQ 序列记录中的 M5(MD5 校验和序列)字段是必需的,UR(序列 fasta 文件的可选 gzipped URI)字段是强烈建议的。计算 MD5 的规则是删除任何非碱基符号(如 n n \\n\backslash \mathrm{n} 、序列名称或长度和空格),并将其余部分转换为大写。以下是一些示例:
> samtools faidx human_g1k_v37.fasta 1 | grep -v ,^>' | tr -d '\n' | tr a-z A-Z | md5sum
-
1b22b98cdeb4a9304cb5d48026a85128 -
> samtools faidx human_g1k_v37.fasta 1:10-20 |grep -v '^>' |tr -d '\n' |tr a-z A-Z |md5sum
Of2a4865e3952676ffad2c3671f14057 -
Please note that the latter calculates the checksum for 11 bases from position 10 (inclusive) to 20 (inclusive) and the bases are counted 1-based, so the first base position is 1.
请注意,后者从第 10 个位置(包括)到第 20 个位置(包括)计算了 11 个碱基的校验和,并且碱基是根据 1 为基数进行计数的,所以第一个碱基位置为 1。

2. All CRAM reader implementations are expected to check for reference MD5 checksums and report any missing or mismatching entries. Consequently, all writer implementations are expected to ensure that all checksums are injected or checked during compression time.
所有 CRAM 阅读器实现都应检查参考 MD5 校验和并报告任何缺失或不匹配的条目。因此,所有写入器实现都应确保在压缩时注入或检查所有校验和。

3. In some cases reads may be mapped beyond the reference sequence. All out of range reference bases are all assumed to be ’ N '.
在某些情况下,reads 可能被映射到参考序列之外。所有超出范围的参考碱基都被假定为'N'。

4. MD5 checksum bytes in slice header should be ignored for unmapped or multiref slices.
对于未映射或多引用切片,切片头部的 MD5 校验和字节应被忽略。

12 Indexing 12 索引

General notes 一般注意事项

Indexing is only valid on coordinate (reference ID and then leftmost position) sorted files.
索引仅对按坐标(参考 ID 和最左边的位置)排序的文件有效。

Please note that CRAM indexing is external to the file format itself and may change independently of the file format specification in the future. For example, a new type of index file may appear.
请注意,CRAM 索引是外部于文件格式本身的,并且可能在未来独立于文件格式规范而发生变化。例如,新的索引文件类型可能会出现。
Individual records are not indexed in CRAM files, slices should be used instead as a unit of random access. Another important difference between CRAM and BAM indexing is that CRAM container header and compression header block (first block in container) must always be read before decoding a slice. Therefore two read operations are required for random access in CRAM.
个人纪录没有在 CRAM 文件中建立索引,应该使用切片作为随机访问的单元。CRAM 和 BAM 索引之间另一个重要的区别是,在解码一个切片之前,必须始终先读取 CRAM 容器头和压缩头块(容器中的第一个块)。因此,在 CRAM 中随机访问需要两次读取操作。
Indexing a CRAM file is deemed to be a lightweight operation because it usually does not require any CRAM records to be read. Indexing information can be obtained from container headers, namely sequence id, alignment start and span, container start byte offset and slice byte offset inside the container (landmarks). The exception to this is with multi-reference containers, where the “RI” data series must be read.
CRAM 文件的索引被认为是一个轻量级的操作,因为它通常不需要读取任何 CRAM 记录。可以从容器头获得索引信息,即序列 ID、比对起始位置和跨度、容器起始字节偏移量和容器内切片字节偏移量(地标)。例外情况是对于多参考容器,需要读取"RI"数据系列。

CRAM index 考试复习指数

A CRAM index is a gzipped tab delimited file containing the following columns:
一个 CRAM 索引是一个 gzipped 制表符分隔的文件,包含以下列:
  1. Reference sequence id 参考序列标识
  2. Alignment start (ignored on read for unmapped slices, set to 0 on write)
    对齐开始(在读取未映射切片时被忽略,在写入时被设置为 0)
  3. Alignment span (ignored on read for unmapped slices, set to 0 on write)
    对齐跨度(在未映射的切片上读取被忽略,在写入时设置为 0)
  4. Absolute byte offset of Container header in the file.
    文件中 Container 头部的绝对字节偏移量。
  5. Relative byte offset of the Slice header block, from the end of the container header. This is the same as the “landmark” field in the container header.
    从容器标头结束到切片标头块的相对字节偏移量。这与容器标头中的"地标"字段相同。
  6. Slice size in bytes (including slice header and all blocks).
    字节中的切片大小(包括切片头和所有块)。
Each line represents a slice in the CRAM file. Please note that all slices must be listed in the index file.
每行代表 CRAM 文件中的一个片段。请注意,所有片段都必须在索引文件中列出。

Multi-reference slices may need to have multiple lines for the same slice; one for each reference contained within that slice. In this case the index reference sequence ID will be the actual reference ID (from the “RI” data series) and not -2 .
多参考分片可能需要为同一分片有多行;每个参考分片各一行。在这种情况下,索引参考序列 ID 将是实际参考 ID(来自"RI"数据系列),而不是-2。

Slices containing solely unmapped unplaced data (reference ID -1) still require values for all columns, although the alignment start and span will be ignored. It is recommended that they are both set to zero.
仅包含未映射的未放置数据(参考 ID -1)的片段仍需要所有列的值,尽管对齐起始位置和跨度将被忽略。建议将它们都设置为零。
To illustrate this the absolute and relative offsets used in a three slice container are shown in the diagram below.
为了说明这一点,下图显示了三片容器中使用的绝对和相对偏移量。

BAM index BAM 指数

BAM indexes are supported by using 4-byte integer pointers called landmarks that are stored in container header. BAM index pointer is a 64 -bit value with 48 bits reserved for the BAM block start position and 16 bits reserved for the in-block offset. When used to index CRAM files, the first 48 bits are used to store the CRAM container start position and the last 16 bits are used to store the index of the landmark in the landmark array stored in container header. The landmark index can be used to access the appropriate slice.
BAM 索引由存储在容器头中的称为地标的 4 字节整数指针支持。BAM 索引指针是一个 64 位值,其中 48 位保留用于 BAM 块起始位置,16 位保留用于块内偏移量。在用于索引 CRAM 文件时,前 48 位用于存储 CRAM 容器起始位置,后 16 位用于存储存储在容器头中的地标数组中的地标索引。可以使用地标索引访问相应的切片。
The above indexing scheme treats CRAM slices as individual records in BAM file. This allows to apply BAM indexing to CRAM files, however it introduces some overhead in seeking specific alignment start because all preceding records in the slice must be read and discarded.
上述索引方案将 CRAM 切片视为 BAM 文件中的单个记录。这允许将 BAM 索引应用于 CRAM 文件,但是在寻找特定的比对起始位置时会引入一些开销,因为必须读取并丢弃切片中的所有前置记录。

13 Encodings 13 种编码

13.1 Introduction 13.1 引言

The basic idea for codings is to efficiently represent some values in binary format. This can be achieved in a number of ways that most frequently involve some knowledge about the nature of the values being encoded, for example, distribution statistics. The methods for choosing the best encoding and determining its parameters are very diverse and are not part of the CRAM format specification, which only describes how the information needed to decode the values should be stored.
编码的基本思想是高效地用二进制格式表示某些值。这可以通过多种方式实现,最常见的方式涉及对被编码值的性质有所了解,例如分布统计。选择最佳编码方法并确定其参数的方法非常多样,这不属于 CRAM 格式规范的一部分,该规范仅描述了如何存储解码值所需的信息。

Note two of the encodings (Golomb and Golomb-Rice) are listed as deprecated. These are still formally part of the CRAM specification, but have not been used by the primary implementations and may not be well supported. Therefore their use is permitted, but not recommended.
编码(Golomb 和 Golomb-Rice)中的两个已被列为已弃用。这些仍正式成为 CRAM 规范的一部分,但主要实施中未使用过,可能也不太受支持。因此允许使用,但不建议这样做。

Offset 抵消

Many of the codings listed below encode positive integer numbers. An integer offset value is used to allow any integer numbers and not just positive ones to be encoded. It can also be used for monotonically decreasing distributions with the maximum not equal to zero. For example, given offset is 10 and the value to be encoded is 1 , the actually encoded value would be offset + value = 11 = 11 =11=11. Then when decoding, the offset would be subtracted from the decoded value.
以下列出的许多编码都编码正整数。使用整数偏移值可以编码任意整数,而不仅仅是正整数。它也可用于单调递减的分布,其最大值不等于零。例如,给定偏移量为 10,要编码的值为 1,实际编码的值将为偏移量+值 = 11 = 11 =11=11 。解码时,将从解码值中减去偏移量。

13.2 EXTERNAL: codec ID 1
13.2 外部:编解码器 ID 1

Can encode types Byte, Integer.
可以编码字节型、整数型。

The EXTERNAL coding is simply storage of data verbatim to an external block with a given ID. If the type is Byte the data is stored as-is, otherwise for Integer type the data is stored in ITF8.
外部编码只是将数据原样存储到具有给定 ID 的外部块。如果类型为 Byte,则数据按原样存储,否则对于 Integer 类型,数据以 ITF8 格式存储。

Parameters 参数

CRAM format defines the following parameters of EXTERNAL coding:
CRAM 格式定义了以下 EXTERNAL 编码参数:
Data type 数据类型 Name 名字 Comment 评论
itf8 国际电信联盟 external id 外部 ID id of an external block containing the byte stream
包含字节流的外部块的 id
Data type Name Comment itf8 external id id of an external block containing the byte stream| Data type | Name | Comment | | :--- | :--- | :--- | | itf8 | external id | id of an external block containing the byte stream |

13.3 Huffman coding: codec ID 3
哈夫曼编码:编解码器 ID 3

Can encode types Byte, Integer.
可以编码字节型、整数型。

Huffman coding replaces symbols (values to encode) by binary codewords, with common symbols having shorter codewords such that the total message of binary codewords is shorter than using uniform binary codeword lengths. The general process consists of the following steps.
哈夫曼编码将符号(需编码的值)替换为二进制代码字,常见符号的代码字较短,从而使整体消息的二进制代码字长度短于使用统一的二进制代码字长度。一般过程包括以下步骤。
  • Obtain symbol code lengths.
    获取符号代码长度。
  • If encoding: 如果编码:
  • Compute symbol frequencies.
    计算符号频率。
  • Compute code lengths from frequencies.
    从频率计算代码长度。
  • If decoding: 如果解码:
  • Read code lengths from codec parameters.
    从编解码器参数中读取代码长度。
  • Compute canonical Huffman codewords from code lengths 5 5 ^(5){ }^{5}.
    从代码长度 5 5 ^(5){ }^{5} 计算规范的哈夫曼编码字。
  • Encode or decode bits as per the symbol to codeword table. Codewords have the “prefix property” that no codeword is a prefix of another codeword, enabling unambiguous decode bit by bit.
    根据符号到码字表对位进行编码或解码。码字具有"前缀属性",即没有码字是另一个码字的前缀,可以实现逐位无歧义解码。
The use of canonical Huffman codes means that we only need to store the code lengths and use the same algorithm in both encoder and decoder to generate the codewords. This is achieved by ensuring our symbol alphabet has a natural sort order and codewords are assigned in numerical order.
使用标准哈夫曼编码意味着我们只需要存储代码长度,并在编码器和解码器中使用相同的算法来生成代码字。这是通过确保我们的符号字母表具有自然排序顺序并按数字顺序分配代码字来实现的。

Important note: for alphabets with only one value, the codeword will be zero bits long. This makes the Huffman codec an efficient mechanism for specifying constant values.
重要说明:对于只有一个值的字母表,代码字将为零位长。这使得霍夫曼编解码器成为指定恒定值的有效机制。

Canonical code computation
规范代码计算

  1. Sort the alphabet ascending using bit-lengths and then using numerical order of the values.
    使用位长度对字母表进行升序排序,然后使用数值的数字顺序。
  2. The first symbol in the list gets assigned a codeword which is the same length as the symbol’s original codeword but all zeros. This will often be a single zero (‘0’).
    列表中的第一个符号被分配一个与该符号的原始代码字长度相同但全为零的代码字。这通常是一个单独的零('0')。
  3. Each subsequent symbol is assigned the next binary number in sequence, ensuring that following codes are always higher in value.
    每个后续符号都被分配了下一个二进制数字,确保后续代码的值总是更高。
  4. When you reach a longer codeword, then after incrementing, append zeros until the length of the new codeword is equal to the length of the old codeword.
    当您达到一个更长的代码字时,然后在递增之后,添加零直到新代码字的长度等于旧代码字的长度。

Examples 例子

Symbol 符号 Code length 代码长度 Codeword 密码词
A 1 0
B 3 100
C 3 101
D 3 110
E 4 1110
F 4 1111
Symbol Code length Codeword A 1 0 B 3 100 C 3 101 D 3 110 E 4 1110 F 4 1111| Symbol | Code length | Codeword | | :--- | :--- | :--- | | A | 1 | 0 | | B | 3 | 100 | | C | 3 | 101 | | D | 3 | 110 | | E | 4 | 1110 | | F | 4 | 1111 |

Parameters 参数

Data type 数据类型 Name 名字 Comment 评论
itf8[] 国际电信专业字体 8[] alphabet 字母 list of all encoded symbols (values)
所有编码符号(值)的列表
itf8[] 𝗶𝘁𝗳8[] bit-lengths 比特长度 array of bit-lengths for each symbol in the alphabet
每个符号在字母表中的位长数组
Data type Name Comment itf8[] alphabet list of all encoded symbols (values) itf8[] bit-lengths array of bit-lengths for each symbol in the alphabet| Data type | Name | Comment | | :--- | :--- | :--- | | itf8[] | alphabet | list of all encoded symbols (values) | | itf8[] | bit-lengths | array of bit-lengths for each symbol in the alphabet |

13.4 Byte array coding
13.4 字节数组编码

Often there is a need to encode an array of bytes where the length is not predetermined. For example the read identifiers differ per alignment record, possibly with different lengths, and this length must be stored somewhere. There are two choices available: storing the length explicitly (BYTE_ARRAY_LEN) or continuing to read bytes until a termination value is seen (BYTE_ARRAY_STOP).
经常需要对字节数组进行编码,但长度并非预先确定。例如,每个对齐记录的读取标识符可能长度不同,必须将长度存储在某个地方。有两种方法可供选择:显式存储长度(BYTE_ARRAY_LEN)或者继续读取字节直至遇到终止值(BYTE_ARRAY_STOP)。

Note in contrast to this, quality values are known to be the same length as the sequence which is an already known quantity, so this does not need to be encoded using the byte array codecs.
与此相反,质量值被知道与序列长度一致,这已经是一个已知量,因此不需要使用字节数组编解码器进行编码。

BYTE_ARRAY_LEN: codec ID 4
字节数组长度:编解码器 ID 4

Can encode types Byte [ ].
可以编码字节[ ]类型。

Byte arrays are captured length-first, meaning that the length of every array element is written using an additional encoding. For example this could be a HUFFMAN encoding or another EXTERNAL block. The length is decoded first followed by the data, followed by the next length and data, and so on.
字节数组是按长度优先捕获的,这意味着每个数组元素的长度都使用额外的编码来写入。例如,这可能是 HUFFMAN 编码或另一个 EXTERNAL 块。首先解码长度,然后是数据,接着是下一个长度和数据,依此类推。

This encoding can therefore be considered as a nested encoding, with each pair of nested encodings containing their own set of parameters. The byte stream for parameters of the BYTE_ARRAY_LEN encoding is therefore the concatenation of the length and value encoding parameters as described in section 2.3 .
因此,此编码可被视为嵌套编码,每对嵌套编码都包含自己的一套参数。BYTE_ARRAY_LEN 编码的字节流是第 2.3 节中描述的长度和值编码参数的连接。

The parameter for BYTE_ARRAY_LEN are listed below:
以下是 BYTE_ARRAY_LEN 的参数:
Data type 数据类型 Name 名字 Comment 评论
encoding<int> 编码 lengths encoding 长度编码

描述数组长度如何捕获的编码
an encoding describing how the arrays lengths are
captured
an encoding describing how the arrays lengths are captured| an encoding describing how the arrays lengths are | | :--- | | captured |
encoding<byte> 编码<字节> values encoding 值编码 an encoding describing how the values are captured
描述如何捕获值的编码
Data type Name Comment encoding<int> lengths encoding "an encoding describing how the arrays lengths are captured" encoding<byte> values encoding an encoding describing how the values are captured| Data type | Name | Comment | | :--- | :--- | :--- | | encoding<int> | lengths encoding | an encoding describing how the arrays lengths are <br> captured | | encoding<byte> | values encoding | an encoding describing how the values are captured |
For example, the bytes specifying a BYTE_ARRAY_LEN encoding, including the codec and parameters, for a 16-bit X0 auxiliary tag (“X0C”) may use H H ¯ bar(H)\overline{\mathrm{H}} UFFMA N N ¯ bar(N)\overline{\mathrm{N}} encoding to specify the length (always 2 bytes) and an EXTERNAL encoding to store the value to an external block with ID 200.
例如,指定 BYTE_ARRAY_LEN 编码(包括编解码器和参数)的字节,用于 16 位 X0 辅助标记("X0C"),可以使用 H H ¯ bar(H)\overline{\mathrm{H}} UFFMA N N ¯ bar(N)\overline{\mathrm{N}} 编码来指定长度(始终 2 字节),并使用外部编码将值存储在 ID 为 200 的外部块中。
Bytes 字节 Meaning 含义
0x04 BYTE_ARRAY_LEN codec ID 字节数组长度编码 ID
0x0a 10 remaining bytes of BYTE_ARRAY_LEN parameters
BYTE_ARRAY_LEN 参数剩余 10 字节
0x03 HUFFMAN codec ID, for aux tag lengths
霍夫曼编码 ID,用于辅助标签长度
0x04 4 more bytes of HUFFMAN parameters
4 个更多的 HUFFMAN 参数字节
0x01 Alphabet array size = 1 = 1 =1=1
字母数组大小 = 1 = 1 =1=1
0x02 alphabet symbol; (length = 2 ) = 2 ) =2)=2)
字母符号;(长度 = 2 ) = 2 ) =2)=2)
0x01 Codeword array size = 1 = 1 =1=1
代码字数组大小 = 1 = 1 =1=1
0x00 Code length = 0 = 0 =0=0 (zero bits needed as alphabet is size 1)
代码长度 = 0 = 0 =0=0 (零位所需,因为字母表大小为 1)
0x01 EXTERNAL codec ID, for aux tag values
外部编解码器 ID,用于辅助标签值
0x02
2 more bytes of EXTERNAL parameters
额外参数再增加 2 个字节
2 more bytes of EXTERNAL parameters| 2 more bytes of EXTERNAL parameters | | :--- |
0x80 0xc8 128 200 ITF8 encoding for block ID 200
ITF8 编码用于块 ID 200
Bytes Meaning 0x04 BYTE_ARRAY_LEN codec ID 0x0a 10 remaining bytes of BYTE_ARRAY_LEN parameters 0x03 HUFFMAN codec ID, for aux tag lengths 0x04 4 more bytes of HUFFMAN parameters 0x01 Alphabet array size =1 0x02 alphabet symbol; (length =2) 0x01 Codeword array size =1 0x00 Code length =0 (zero bits needed as alphabet is size 1) 0x01 EXTERNAL codec ID, for aux tag values 0x02 "2 more bytes of EXTERNAL parameters" 0x80 0xc8 ITF8 encoding for block ID 200| Bytes | Meaning | | :--- | :--- | | 0x04 | BYTE_ARRAY_LEN codec ID | | 0x0a | 10 remaining bytes of BYTE_ARRAY_LEN parameters | | | | | 0x03 | HUFFMAN codec ID, for aux tag lengths | | 0x04 | 4 more bytes of HUFFMAN parameters | | 0x01 | Alphabet array size $=1$ | | 0x02 | alphabet symbol; (length $=2)$ | | 0x01 | Codeword array size $=1$ | | 0x00 | Code length $=0$ (zero bits needed as alphabet is size 1) | | | | | 0x01 | EXTERNAL codec ID, for aux tag values | | 0x02 | 2 more bytes of EXTERNAL parameters | | 0x80 0xc8 | ITF8 encoding for block ID 200 |

BYTE_ARRAY_STOP: codec ID 5
字节数组停止:编解码器 ID 5

Can encode types Byte [ ].
可以编码字节[ ]类型。

Byte arrays are captured as a sequence of bytes terminated by a special stop byte. The data returned does not include the stop byte itself. In contrast to BYTE_ARRAY_LEN the value is always encoded with EXTERNAL so the parameter is an external id instead of another encoding.
字节数组被捕获为以特殊停止字节终止的字节序列。返回的数据不包括停止字节本身。与 BYTE_ARRAY_LEN 不同,该值始终使用 EXTERNAL 编码,因此参数是外部 ID 而不是另一种编码。
Data type 数据类型 Name 名字 Comment 评论
byte 字节 stop byte 停止字节 a special byte treated as a delimiter
作为分隔符的特殊字节
itf8 国际电信联盟 external id 外部 ID id of an external block containing the byte stream
包含字节流的外部块的 id
Data type Name Comment byte stop byte a special byte treated as a delimiter itf8 external id id of an external block containing the byte stream| Data type | Name | Comment | | :--- | :--- | :--- | | byte | stop byte | a special byte treated as a delimiter | | itf8 | external id | id of an external block containing the byte stream |

13.5 Beta coding: codec ID 6
13.5 Beta 编码: 编解码器 ID 6

Can encode types Integer.
可以编码整数类型。

Definition 定义

Beta coding is a most common way to represent numbers in binary notation and is sometimes referred to as binary coding. The decoder reads the specified fixed number of bits (most significant first) and subtracts the offset value to get the decoded integer.
β编码是一种最常见的用二进制符号表示数字的方式,有时也称为二进制编码。解码器读取指定的固定位数(最高位优先),并减去偏移量以获得解码的整数。

Parameters 参数

CRAM format defines the following parameters of beta coding:
CRAM 格式定义了以下贝塔编码参数:
Data type 数据类型 Name 名字 Comment 评论
itf8 国际电信联盟 offset 偏移

在解码过程中,从每个值中减去偏移量
offset is subtracted from each
value during decode
offset is subtracted from each value during decode| offset is subtracted from each | | :--- | | value during decode |
itf8 国际电信联盟 length 长度 the number of bits used
使用的位数
Data type Name Comment itf8 offset "offset is subtracted from each value during decode" itf8 length the number of bits used| Data type | Name | Comment | | :--- | :--- | :--- | | itf8 | offset | offset is subtracted from each <br> value during decode | | itf8 | length | the number of bits used |

Examples 例子

If we have integer values in the range 10 to 15 inclusive, the largest value would traditionally need 4 bits, but with an offset of -10 we can hold values 0 to 5 , using a fixed size of 3 bits. Using fixed Offset and Length coming from the beta parameters, we decode these values as:
如果我们有范围从 10 到 15(包含)的整数值,那么最大值通常需要 4 位。但是如果我们使用偏移量-10,我们可以使用 3 位来表示值 0 到 5。通过使用来自 beta 参数的固定偏移量和长度,我们可以解码这些值。
Offset 抵消 Length 长度 Bits  Value 价值
-10 3 000 10
-10 3 001 11
-10 3 010 12
-10 3 011 13
-10 3 100 14
-10 3 101 15
Offset Length Bits Value -10 3 000 10 -10 3 001 11 -10 3 010 12 -10 3 011 13 -10 3 100 14 -10 3 101 15| Offset | Length | Bits | Value | | :--- | :--- | :--- | :--- | | -10 | 3 | 000 | 10 | | -10 | 3 | 001 | 11 | | -10 | 3 | 010 | 12 | | -10 | 3 | 011 | 13 | | -10 | 3 | 100 | 14 | | -10 | 3 | 101 | 15 |

13.6 Subexponential coding: codec ID 7
13.6 亚指数编码: codec ID 7

Can encode types Integer.
可以编码整数类型。

Definition 定义

Subexponential coding 6 6 ^(6){ }^{6} is parametrized by a non-negative integer k k kk. For values n < 2 k + 1 n < 2 k + 1 n < 2^(k+1)n<2^{k+1} subexponential coding produces codewords identical to Rice coding 7 7 ^(7){ }^{7}. For larger values it grows logarithmically with n n nn.
次指数编码 6 6 ^(6){ }^{6} 由一个非负整数 k k kk 参数化。对于 n < 2 k + 1 n < 2 k + 1 n < 2^(k+1)n<2^{k+1} 值,次指数编码产生与 Rice 编码 7 7 ^(7){ }^{7} 相同的码字。对于更大的值,它随 n n nn 对数增长。

Encoding 编码

  1. Add offset to n n nn.
    n n nn 添加偏移量。
  2. Determine u u uu and b b bb values from n n nn
    n n nn 确定 u u uu b b bb 的值
b = { k if n < 2 k log 2 n if n 2 k u = { 0 if n < 2 k b k + 1 if n 2 k b = k  if  n < 2 k log 2 n  if  n 2 k u = 0  if  n < 2 k b k + 1  if  n 2 k b={[k," if "n < 2^(k)],[|__log_(2)n __|," if "n >= 2^(k)]quad u={[0," if "n < 2^(k)],[b-k+1," if "n >= 2^(k)]:}b=\left\{\begin{array}{ll} k & \text { if } n<2^{k} \\ \left\lfloor\log _{2} n\right\rfloor & \text { if } n \geq 2^{k} \end{array} \quad u= \begin{cases}0 & \text { if } n<2^{k} \\ b-k+1 & \text { if } n \geq 2^{k}\end{cases}\right.
  1. Write u u uu in unary form; u 1 u 1 u1u 1 bits followed by a single 0 bit.
    用一元形式写 u u uu ; 后跟一个 0 位的 u 1 u 1 u1u 1 位。
  2. Write the bottom b b bb-bits of n n nn in binary form.
    n n nn 的低 b b bb 位以二进制形式写出。

Decoding 解码

  1. Read u u uu in unary form, counting the number of leading 1 s (prefix) in the codeword (discard the trailing 0 bit).
    以单元形式读取 u u uu ,统计代码字中前导 1 的数量(丢弃尾部的 0 位)。
  2. Determine n n nn via:
    确定 n n nn 通过:

    (a) if u = 0 u = 0 u=0u=0 then read n n nn as a k k kk-bit binary number.
    (a) 如果 u = 0 u = 0 u=0u=0 则将 n n nn 读取为一个 k k kk 位的二进制数。

    (b) if u 1 u 1 u >= 1u \geq 1 then read x x xx as a ( u + k 1 ) ( u + k 1 ) (u+k-1)(u+k-1)-bit binary. Let n = 2 u + k 1 + x n = 2 u + k 1 + x n=2^(u+k-1)+xn=2^{u+k-1}+x.
    (b) 如果 u 1 u 1 u >= 1u \geq 1 则将 x x xx 读取为 ( u + k 1 ) ( u + k 1 ) (u+k-1)(u+k-1) 位二进制数。设 n = 2 u + k 1 + x n = 2 u + k 1 + x n=2^(u+k-1)+xn=2^{u+k-1}+x .
  3. Subtract offset from n n nn.
    n n nn 中减去偏移量。

Examples 例子

Number 数字 Codeword, k = 0 k = 0 k=0\mathbf{k}=\mathbf{0}
代号, k = 0 k = 0 k=0\mathbf{k}=\mathbf{0}
Codeword, k = 1 k = 1 k=1\mathbf{k}=\mathbf{1}
密码词, k = 1 k = 1 k=1\mathbf{k}=\mathbf{1}
Codeword, k = 2 k = 2 k=2\mathbf{k}=\mathbf{2}
代号, k = 2 k = 2 k=2\mathbf{k}=\mathbf{2}
0 0 00 000
1 10 01 001
2 1100 100 010
3 1101 101 011
4 111000 11000 1000
5 111001 11001 1001
6 111010 11010 1010
7 111011 11011 1011
8 11110000 1110000 110000
9 11110001 1110001 110001
10 11110010 1110010 110010
Number Codeword, k=0 Codeword, k=1 Codeword, k=2 0 0 00 000 1 10 01 001 2 1100 100 010 3 1101 101 011 4 111000 11000 1000 5 111001 11001 1001 6 111010 11010 1010 7 111011 11011 1011 8 11110000 1110000 110000 9 11110001 1110001 110001 10 11110010 1110010 110010| Number | Codeword, $\mathbf{k}=\mathbf{0}$ | Codeword, $\mathbf{k}=\mathbf{1}$ | Codeword, $\mathbf{k}=\mathbf{2}$ | | :--- | :--- | :--- | :--- | | 0 | 0 | 00 | 000 | | 1 | 10 | 01 | 001 | | 2 | 1100 | 100 | 010 | | 3 | 1101 | 101 | 011 | | 4 | 111000 | 11000 | 1000 | | 5 | 111001 | 11001 | 1001 | | 6 | 111010 | 11010 | 1010 | | 7 | 111011 | 11011 | 1011 | | 8 | 11110000 | 1110000 | 110000 | | 9 | 11110001 | 1110001 | 110001 | | 10 | 11110010 | 1110010 | 110010 |

Parameters 参数

Data type 数据类型 Name 名字 Comment 评论
itf8 国际电信联盟 offset 偏移 offset is subtracted from each value during decode
在解码过程中,从每个值中减去偏移量
itf8 国际电信联盟 k the order of the subexponential coding
次指数编码的顺序
Data type Name Comment itf8 offset offset is subtracted from each value during decode itf8 k the order of the subexponential coding| Data type | Name | Comment | | :--- | :--- | :--- | | itf8 | offset | offset is subtracted from each value during decode | | itf8 | k | the order of the subexponential coding |

13.7 Gamma coding: codec ID 9
13.7 伽玛编码:编解码器 ID 9

Can encode types Integer.
可以编码整数类型。

Definition 定义

Elias gamma code is a prefix encoding of positive integers. This is a combination of unary coding and beta coding. The first is used to capture the number of bits required for beta coding to capture the value.
伊利亚斯伽马编码是正整数的前缀编码。这是一种结合了一元编码和贝塔编码的方法。前者用于捕获贝塔编码所需的位数,后者用于捕获数值。

Encoding 编码

  1. Write it in binary.
    用二进制写。
  2. Subtract 1 from the number of bits written in step 1 and prepend that many zeros.
    从步骤 1 中写入的位数减去 1,并在前面添加相同数量的零。
  3. An equivalent way to express the same process:
    表达相同过程的等价方式:
  4. Separate the integer into the highest power of 2 it contains ( 2 N ) ( 2 N ) (2N)(2 N) and the remaining N N NN binary digits of the integer.
    将整数分解为它所包含的最高 2 的幂 ( 2 N ) ( 2 N ) (2N)(2 N) 和整数剩余的 N N NN 二进制位。
  5. Encode N N NN in unary; that is, as N N NN zeroes followed by a one.
    N N NN 编码为单进制;也就是说,由 N N NN 个零后跟一个一组成。
  6. Append the remaining N N NN binary digits to this representation of N N NN.
    将剩余的 N N NN 二进制位附加到 N N NN 的此表示中。

Decoding 解码

  1. Read and count 0 s from the stream until you reach the first 1 . Call this count of zeroes N N NN.
    从流中读取并计数 0 直到遇到第一个 1。将这些 0 的数量称为 N N NN
  2. Considering the one that was reached to be the first digit of the integer, with a value of 2 N 2 N 2N2 N, read the remaining N N NN digits of the integer.
    将整数的第一位数字视为 2 N 2 N 2N2 N ,读取剩余的 N N NN 位数字。

Examples 例子

Value 价值 Codeword 密码词
1 1
2 010
3 011
4 00100
Value Codeword 1 1 2 010 3 011 4 00100| Value | Codeword | | :--- | :--- | | 1 | 1 | | 2 | 010 | | 3 | 011 | | 4 | 00100 |

Parameters 参数

Data type 数据类型 Name 名字 Comment 评论
itf8 国际电信联盟 offset 偏移

在解码后从每个值中减去的偏移量
offset to subtract from each
value after decode
offset to subtract from each value after decode| offset to subtract from each | | :--- | | value after decode |
Data type Name Comment itf8 offset "offset to subtract from each value after decode"| Data type | Name | Comment | | :--- | :--- | :--- | | itf8 | offset | offset to subtract from each <br> value after decode |

13.8 DEPRECATED: Golomb coding: codec ID 2
13.8 已弃用: Golomb 编码: 编码器 ID 2

Can encode types Integer.
可以编码整数类型。

Note this codec has not been used in any known CRAM implementation since before CRAM v1.0. Nor is it implemented in some of the major software. Therefore its use is not recommended.
注意此编解码器自 CRAM v1.0 以来未在任何已知的 CRAM 实现中使用过。它也未在一些主要软件中实现。因此不建议使用它。

Definition 定义

Golomb encoding is a prefix encoding optimal for representation of random positive numbers following geometric distribution.
戈伦姆编码是一种前缀编码,可以很好地表示服从几何分布的随机正整数。

Encoding 编码

  1. Fix the parameter M M MM to an integer value.
    将参数 M M MM 修改为整数值。
  2. For N N NN, the number to be encoded, find
    对于 N N NN ,要编码的数字,求

    (a) quotient q = N / M q = N / M q=|__ N//M __|q=\lfloor N / M\rfloor
    q = N / M q = N / M q=|__ N//M __|q=\lfloor N / M\rfloor

    (b) remainder r = N mod M r = N mod M r=N mod Mr=N \bmod M
    (b) 余数 r = N mod M r = N mod M r=N mod Mr=N \bmod M
  3. Generate Codeword 生成代码词
    (a) The Code format: <Quotient Code><Remainder Code>, where
    (a) 代码格式为:<商码><余码>,其中

    (b) Quotient Code (in unary coding)
    (b) 商码(单位编码)

    i. Write a q q qq-length string of 1 bits
    编写一个长度为 q q qq 的全 1 位字符串

    ii. Write a 0 bit
    写入 0 比特

    © Remainder Code (in truncated binary encoding)
    余数编码(以截断二进制编码)
Set b = log 2 ( M ) b = log 2 ( M ) b=|~log_(2)(M)~|b=\left\lceil\log _{2}(M)\right\rceil 设置 b = log 2 ( M ) b = log 2 ( M ) b=|~log_(2)(M)~|b=\left\lceil\log _{2}(M)\right\rceil
i. If r < 2 b M r < 2 b M r < 2^(b)-Mr<2^{b}-M code r r rr as plain binary using b 1 b 1 b-1b-1 bits.
如果用 b 1 b 1 b-1b-1 位作为纯二进制代码 r < 2 b M r < 2 b M r < 2^(b)-Mr<2^{b}-M r r rr

ii. If r 2 b M r 2 b M r >= 2^(b)-Mr \geq 2^{b}-M code the number r + 2 b M r + 2 b M r+2^(b)-Mr+2^{b}-M in plain binary representation using b b bb bits.
如果 r 2 b M r 2 b M r >= 2^(b)-Mr \geq 2^{b}-M b b bb 位二进制表示 r + 2 b M r + 2 b M r+2^(b)-Mr+2^{b}-M 这个数字。

Decoding 解码

  1. Read q q qq via unary coding: count the number of 1 bits and consume the following 0 bits.
    通过一元编码读取 q q qq :计算 1 比特的数量并消耗后续的 0 比特。
  2. Set b = log 2 ( M ) b = log 2 ( M ) b=|~log_(2)(M)~|b=\left\lceil\log _{2}(M)\right\rceil 设置 b = log 2 ( M ) b = log 2 ( M ) b=|~log_(2)(M)~|b=\left\lceil\log _{2}(M)\right\rceil
  3. Read r r rr via b 1 b 1 b-1b-1 bits of binary coding
    通过 b 1 b 1 b-1b-1 位二进制编码读取 r r rr
  4. If r 2 b M r 2 b M r >= 2^(b)-Mr \geq 2^{b}-M 如果 r 2 b M r 2 b M r >= 2^(b)-Mr \geq 2^{b}-M
    (a) Read 1 single bit, x x xx.
    读取 1 个单个比特位, x x xx

    (b) Set r = r 2 + x ( 2 b M ) r = r 2 + x 2 b M r=r**2+x-(2^(b)-M)r=r * 2+x-\left(2^{b}-M\right) (b) 设置 r = r 2 + x ( 2 b M ) r = r 2 + x 2 b M r=r**2+x-(2^(b)-M)r=r * 2+x-\left(2^{b}-M\right)
  5. Value is q M + r q M + r q**M+r-q * M+r- offset
    价值是 q M + r q M + r q**M+r-q * M+r- 偏移

Examples 例子

Number 数字

密码,M=10,(因此 b=4)
Codeword, M=10,
(thus b=4)
Codeword, M=10, (thus b=4)| Codeword, M=10, | | :--- | | (thus b=4) |
0 0000
4 0100
10 10000
26 1101100
42 11110010
Number "Codeword, M=10, (thus b=4)" 0 0000 4 0100 10 10000 26 1101100 42 11110010| Number | Codeword, M=10, <br> (thus b=4) | | :--- | :--- | | 0 | 0000 | | 4 | 0100 | | 10 | 10000 | | 26 | 1101100 | | 42 | 11110010 |

Parameters 参数

Golomb coding takes the following parameters:
高隆编码使用以下参数:
Data type 数据类型 Name 名字 Comment 评论
itf8 国际电信联盟 offset 偏移 offset is added to each value
每个值都添加了偏移量
itf8 国际电信联盟 M

高龙参数(箱数)
the golomb parameter (number
of bins)
the golomb parameter (number of bins)| the golomb parameter (number | | :--- | | of bins) |
Data type Name Comment itf8 offset offset is added to each value itf8 M "the golomb parameter (number of bins)"| Data type | Name | Comment | | :--- | :--- | :--- | | itf8 | offset | offset is added to each value | | itf8 | M | the golomb parameter (number <br> of bins) |

13.9 DEPRECATED: Golomb-Rice coding: codec ID 8
13.9 不推荐使用:Golomb-Rice 编码:编解码器 ID 8

Can encode types Integer.
可以编码整数类型。

Note this codec has not been used in any known CRAM implementation since before CRAM v1.0. Nor is it implemented in some of the major software. Therefore its use is not recommended.
注意此编解码器自 CRAM v1.0 以来未在任何已知的 CRAM 实现中使用过。它也未在一些主要软件中实现。因此不建议使用它。

Golomb-Rice coding is a special case of Golomb coding when the M parameter is a power of 2. The reason for this coding is that the division operations in Golomb coding can be replaced with bit shift operators as well as avoiding the extra r < 2 b M r < 2 b M r < 2^(b)-Mr<2^{b}-M check.
戈伦布-赖斯编码是戈伦布编码的一种特殊情况,其中 M 参数是 2 的幂。这种编码的原因是,戈伦布编码中的除法操作可以用位移运算符代替,同时也可以避免额外的 r < 2 b M r < 2 b M r < 2^(b)-Mr<2^{b}-M

14 External compression methods
14 外部压缩方法

External encoding operates on bytes only. Therefore any data series must be translated into bytes before sending data into an external block. The following methods are defined. Exact definitions of these methods are in their respective internet links or the ancillary CRAMcodecs document found along side this specification.
外部编码仅对字节进行操作。因此,任何数据系列都必须先翻译成字节,然后才能将数据发送到外部块中。定义了以下方法。这些方法的确切定义可在相应的互联网链接或与本规范一起找到的辅助 CRAMcodecs 文档中找到。
Integer values are written as ITF8, which then can be translated into an array of bytes.
整数值被写为 ITF8,然后可以翻译成一个字节数组。

Strings, like read name, are translated into bytes according to UTF8 rules. In most cases these should coincide with ASCII, making the translation trivial.
根据 UTF8 规则,字符串(如读取名称)被翻译成字节。在大多数情况下,它们应该与 ASCII 相吻合,使翻译变得微不足道。
Each method has an associated numeric code which is defined in Section 8.
每种方法都有一个关联的数字代码,定义在第 8 节中。

14.1 Gzip

The Gzip specification is defined in RFC 1952. Gzip in turn is an encapsulation on the Deflate algorithm defined in RFC 1951.
压缩文件的规范定义在 RFC 1952 中。Gzip 是在 RFC 1951 中定义的 Deflate 算法的一种封装。

14.2 Bzip2

First available in CRAM v2.0.
从 CRAM v2.0 开始可以使用。

Bzip2 is a compression method utilising the Burrows Wheeler Transform, Move To Front transform, Run Length Encoding and a Huffman entropy encoder. It is often superior to Gzip for textual data.
压缩方法 Bzip2 利用 Burrows-Wheeler 变换、Move-to-Front 变换、游程编码和 Huffman 熵编码。它通常比 Gzip 在文本数据上更出色。

An informal format specification exists:
非正式格式规范存在:

https://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf

14.3 LZMA

First available in CRAM v3.0.
从 CRAM v3.0 版本开始可用。

LZMA is the Lempel-Ziv Markov chain algorithm. CRAM uses the xz Stream format to encapsulate this algorithm, as defined in https://tukaani.org/xz/xz-file-format.txt.
LZMA 是 Lempel-Ziv Markov 链算法。CRAM 使用 xz 流格式来封装这种算法,如 https://tukaani.org/xz/xz-file-format.txt 中所定义。

14.4 rANS4x8 codec 14.4 rANS4x8 编码器

First available in CRAM v3.0.
从 CRAM v3.0 版本开始可用。

rANS is the range-coder variant of the Asymmetric Numerical System 8 8 ^(8){ }^{8}.
范数是异步数字系统 8 8 ^(8){ }^{8} 的一种范围编码器变体。

" 4 x 8 " refers to 4 -way interleaving with 8 -bit renormalisation.
"4 x 8"指的是 4 路交织和 8 位归一化。

This variant of rANS first appeared in CRAM v3.0.
这种 rANS 变体首次出现在 CRAM v3.0 中。

Details of this algorithm have been moved to the CRAMcodecs document.
该算法的详细信息已移至 CRAMcodecs 文档。

14.5 rANS4x16 codec 14.5 rANS4x16 编码器

First available in CRAM v3.1.
在 CRAM v3.1 中首次使用。

" 4 × 16 4 × 16 4xx164 \times 16 " refers to 4 -way interleaving with 16 -bit renormalisation.
" 4 × 16 4 × 16 4xx164 \times 16 "指的是 4 向交织和 16 位归一化。

This variant of rANS first appeared in CRAM v3.1.
这种 rANS 变体首次出现在 CRAM v3.1 中。

Details of this algorithm are listed in the CRAMcodecs document.
该算法的详细信息列于 CRAMcodecs 文档中。

14.6 adaptive arithemtic coding
自适应算术编码

First available in CRAM v3.1.
在 CRAM v3.1 中首次使用。

An entropy encoder that is slower but slightly more concise than rANS. It achieves this by adapting the probabilities as it compresses and decompresses instead of using a fixed table.
一种比 rANS 略为更简洁但速度较慢的熵编码器。它通过在压缩和解压缩时动态调整概率,而不是使用固定的编码表来实现这一目标。
Details of this algorithm are listed in the CRAMcodecs document.
此算法的细节列于 CRAMcodecs 文件中。

14.7 fqzcomp codec 14.7 fqzcomp 编解码器

First available in CRAM v3.1.
在 CRAM v3.1 中首次使用。

This is a method dedicated to compression of quality values.
这是一种专门用于质量值压缩的方法。

Details of this algorithm are listed in the CRAMcodecs document.
该算法的详细信息列于 CRAMcodecs 文档中。

14.8 name tokeniser 14.8 名称分词器

First available in CRAM v3.1.
在 CRAM v3.1 中首次使用。

This is a method dedicated to compression of read names.
这是一种专门用于读取名称压缩的方法。

Details of this algorithm are listed in the CRAMcodecs document.
该算法的详细信息列于 CRAMcodecs 文档中。

15 Appendix 15 附录

15.1 Choosing the container size
15.1 选择容器尺寸

CRAM format does not constrain the size of the containers. However, the following should be considered when deciding the container size:
CRAM 格式不限制容器的大小。但在决定容器大小时,应考虑以下因素:
  • Data can be compressed better by using larger containers
    通过使用更大的容器可以更好地压缩数据
  • Random access performance is better for smaller containers
    对于较小的容器而言,随机访问性能更佳
  • Streaming is more convenient for small containers
    流式处理对于小型容器来说更加方便
  • Applications typically buffer containers into memory
    通常情况下,应用程序会将容器缓存到内存中
We recommend 1 megabyte containers. They are small enough to provide good random access and streaming performance while being large enough to provide good compression. 1 MB containers are also small enough to fit into the L2 cache of most modern CPUs.
我们建议使用 1 兆字节的容器。它们足够小,可以提供良好的随机访问和流式性能,同时也足够大,可以提供良好的压缩效果。1 MB 的容器也足够小,可以装入大多数现代 CPU 的 L2 缓存中。

Some simplified examples are provided below to fit data into 1 MB containers.
下面提供了一些简化的示例,以将数据放入 1 MB 容器中。

Unmapped short reads with bases, read names, recalibrated and original quality scores
没有编号的短读取序列,包括碱基、读取名称、校正后和原始的质量评分

We have 10,000 unmapped short reads (100bp) with read names, recalibrated and original quality scores. We estimate 0.4 bits/base (read names) +0.4 bits/base (bases) +3 bits/base (recalibrated quality scores) +3 bits/base (original quality scores) 7 7 ~~7\approx 7 bits/base. Space estimate is 10000 × 100 × 7 10000 × 100 × 7 10000 xx100 xx710000 \times 100 \times 7 bits 0.9 MB 0.9 MB ~~0.9MB\approx 0.9 \mathrm{MB}. Data could be stored in a single container.
我们有 10,000 个未映射的短读取(100bp),包含读取名称、校准后和原始质量评分。我们估计每个碱基占用 0.4 位(读取名称)+0.4 位(碱基)+3 位(校准后质量评分)+3 位(原始质量评分),总共占用 7 7 ~~7\approx 7 位/碱基。空间估计为 10000 × 100 × 7 10000 × 100 × 7 10000 xx100 xx710000 \times 100 \times 7 0.9 MB 0.9 MB ~~0.9MB\approx 0.9 \mathrm{MB} 。数据可以存储在单个容器中。

Unmapped long reads with bases, read names and quality scores
未映射的长读段,包含碱基、读段名称和质量评分

We have 10,000 unmapped long reads ( 10 kb ) with read names and quality scores. We estimate: 0.4 bits/base (bases) +3 bits/base (original quality scores) 3.5 3.5 ~~3.5\approx 3.5 bits/base. Space estimate is 10000 × 10000 × 3.5 10000 × 10000 × 3.5 10000 xx10000 xx3.510000 \times 10000 \times 3.5 bits ~~\approx 42 MB . Data could be stored in 42 × 1 MB 42 × 1 MB 42 xx1MB42 \times 1 \mathrm{MB} containers.
我们拥有 10,000 条未映射的长读段(10 kb)及其读取名称和质量得分。我们估计:每个碱基 0.4 位(bases)+每个碱基 3 位(原始质量得分) 3.5 3.5 ~~3.5\approx 3.5 位/碱基。空间估计为 10000 × 10000 × 3.5 10000 × 10000 × 3.5 10000 xx10000 xx3.510000 \times 10000 \times 3.5 ~~\approx 42 MB。数据可存储在 42 × 1 MB 42 × 1 MB 42 xx1MB42 \times 1 \mathrm{MB} 容器中。

Mapped short reads with bases, pairing and mapping information
带有碱基、配对和映射信息的映射短读

We have 250,000 mapped short reads ( 100 bp ) with bases, pairing and mapping information. We estimate the compression to be 0.2 bits/base. Space estimate is 250000 × 100 × 0.2 250000 × 100 × 0.2 250000 xx100 xx0.2250000 \times 100 \times 0.2 bits 0.6 MB 0.6 MB ~~0.6MB\approx 0.6 \mathrm{MB}. Data could be stored in a single container.
我们有 250,000 个已映射的短读取序列(100 个碱基对)以及碱基、配对和映射信息。我们估计压缩率为 0.2 位/碱基。空间估计为 250000 × 100 × 0.2 250000 × 100 × 0.2 250000 xx100 xx0.2250000 \times 100 \times 0.2 0.6 MB 0.6 MB ~~0.6MB\approx 0.6 \mathrm{MB} 。数据可以存储在单个容器中。

Embedded reference sequences
嵌入式引用序列

We have a reference sequence ( 10 Mb ). We estimate the compression to be 2 bits/base. Space estimate is 10000000 × 2 10000000 × 2 10000000 xx210000000 \times 2 bits 2.4 MB 2.4 MB ~~2.4MB\approx 2.4 \mathrm{MB}. Data could be written into three containers: 1 MB + 1 MB + 0.4 MB 1 MB + 1 MB + 0.4 MB 1MB+1MB+0.4MB1 \mathrm{MB}+1 \mathrm{MB}+0.4 \mathrm{MB}.
我们有一个参考序列(10 Mb)。我们估计压缩率为 2 位/碱基。空间估计为 10000000 × 2 10000000 × 2 10000000 xx210000000 \times 2 2.4 MB 2.4 MB ~~2.4MB\approx 2.4 \mathrm{MB} 。数据可写入三个容器: 1 MB + 1 MB + 0.4 MB 1 MB + 1 MB + 0.4 MB 1MB+1MB+0.4MB1 \mathrm{MB}+1 \mathrm{MB}+0.4 \mathrm{MB}

15.2 CRAM History  15.2 CRAM 历史

Pre-CRAM: 2010 预处理阶段:2010 年

The primary concepts and ideas of CRAM stem from work at the European Bioinformatics Institute in 2010 and 2011, published in:
压缩反馈相关分析模型(CRAM)的主要概念和理念源自 2010 年和 2011 年在欧洲生物信息学研究所进行的工作,并发表在:

Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney, Efficient storage of high
马库斯·西阳·弗里茨、拉斯科·莱宁宁、盖伊·科克兰和尤恩·伯尼,高效存储高
throughput DNA sequencing data using reference-based compression, Genome Res. 2011 21:
使用基于参考的压缩对吞吐量 DNA 测序数据进行压缩,Genome Res. 2011 21:
734 740 734 740 734-740734-740; doi:10.1101/gr.114819.110; PMID:21245279.

CRAM 0.x: 2011

Vadim Zalunin implemented the ideas in the paper, now named CRAM, in the Java CRAMtools package. This included versions from 0.3 to 0.86 9 0.86 9 0.86^(9)0.86^{9}.
瓦迪姆·扎鲁宁在名为 CRAM 的论文中提出的想法实现在了 Java CRAMtools 包中。这包括从 0.3 版本到 0.86 9 0.86 9 0.86^(9)0.86^{9} 的版本。

CRAM 1.0: 2012 压缩 1.0:2012

The first official launch of the CRAM specification, in the Java CRAMtools package 10 10 ^(10){ }^{10}
CRAM 规范在 Java CRAMtools 包中的第一次官方发布 10 10 ^(10){ }^{10}

This was publicised at https://github.com/enasequence/cramtools.
这个已在 https://github.com/enasequence/cramtools 上公布。

CRAM 2.0: 2013 2013 年 CRAM 2.0

Reimplementing CRAM in C 11 C 11 C^(11)\mathrm{C}^{11} exposed a number of issues with the 1.0 specification and disparities between the specification text and the Java implementation. CRAM 2.0 unified implementation with specification.
C 11 C 11 C^(11)\mathrm{C}^{11} 中重新实现 CRAM 揭示了 1.0 规范中存在的一些问题以及规范文本和 Java 实现之间的差异。CRAM 2.0 统一了实现和规范。
Other changes included: 其他变化包括:
  • Support for multiple references per container, to permit storage of highly fragmented assemblies.
    支持每个容器多个引用,以允许存储高度碎片化的组件。
  • Soft-clips and inserted bases moved to their own separate data-series instead of sharing one.
    软剪切和插入基因移到了自己独立的数据系列,而不是共享同一个。
  • Slice headers contain meta-data tracking the number of records and bases.
    切片头包含元数据,用于跟踪记录和碱基的数量。
  • Corrected the BF (bam flag) data series to match the BAM specification.
    修正了 BF(bam 标志)数据系列以匹配 BAM 规范。
  • Improved encoding of auxiliary tags.
    辅助标签的编码改进。

CRAM 2.1: 2014 集中复习 2.1:2014

This is the first version to appear in HTSJDK (version 1.127), ported from the Java CRAMtools package.
这是首次出现在 HTSJDK (版本 1.127)中的版本,从 Java CRAMtools 包移植而来。
  • EOF blocks are added in order to spot truncated files.
    添加 EOF 块是为了检测被截断的文件。
CRAM 3.0: 2014 2014 年 CRAM 3.0 版
Primarily this is an optimisation of size and speed.
这主要是对大小和速度的优化。
  • Inclusion of LZMA compression library.
    包含 LZMA 压缩库。
  • Inclusion of the custom rANS Order-0 and Order-1 entropy encoders.
    包括了自定义的 rANS Order-0 和 Order-1 熵编码器。
  • Checksums added to all file format structures to ensure data integrity.
    在所有文件格式结构中添加校验和以确保数据完整性。
CRAM 3.1: 2023 《考试大纲 3.1:2023》
Note: the formal draft appeared in 2019, and was initially demonstrated in 2016.
注意:正式草案出现于 2019 年,并在 2016 年初次展示。

This adds new EXTERNAL compression methods, described in the separate CRAMcodecs document, and expands the list of permitted “methods” in the CRAM Block structure.
这添加了新的外部压缩方法,在单独的 CRAMcodecs 文件中进行了描述,并扩展了 CRAM 块结构中允许的"方法"列表。

The aim of the new compression methods is improved compression, both performance with the newer SIMD rANS implementation and file size with custom name tokeniser and quality codec.
新压缩方法的目标是提高压缩性能,包括采用较新的 SIMD rANS 实现带来的性能提升,以及使用自定义名称标记器和质量编解码器带来的文件尺寸缩减。
The format is otherwise identical to 3.0 .
格式与 3.0 版本完全相同。

15.3 Contributors and Acknowledgements
15.3 贡献者和致谢

  • Markus Fritz, Rasko Leinonen, Guy Cochrane and Ewan Birney (EBI): Initial ideas behind CRAM.
    马库斯·弗里兹、拉斯科·莱宁,盖伊·科赫兰和尤文·伯尼(欧洲生物信息学研究所):CRAM 背后的初衷。
  • Vadim Zalunin (EBI): Initial JAVA implementation of CRAM and previous maintainer of CRAM specification.
    瓦季姆·扎卢宁 (EBI): CRAM 的初始 JAVA 实现和 CRAM 规范的前任维护者。
  • James Bonfield (Sanger Institute): Initial C implementation of CRAM and current maintainer of CRAM specification.
    詹姆斯·邦菲尔德 (桑格研究所):CRAM 的初始 C 实现和当前 CRAM 规范的维护者。
  • Joel Thibault (Broad Institute): previous maintainer of CRAM specification.
    乔尔·蒂博(Broad 研究所):CRAM 规范的前任维护者。
  • Chris Norman (Broad Institute): previous maintainer of CRAM specification and worked on the HTSJDK implementation.
    克里斯·诺曼(宽带研究所):以前是 CRAM 规格的维护者,并参与了 HTSJDK 实现的工作。
  • Robert Buels (UC Berkeley): First JavaScript implementation of CRAM
    罗伯特·布尔斯(加州大学伯克利分校):CRAM 的首个 JavaScript 实现
  • Michael Macias (St Jude Children’s Research Hospital): First Rust implementation of CRAM
    基于圣犹达儿童研究医院的迈克尔·马西亚斯的 CRAM 的首个 Rust 实现
  • Other specification contributors include: John Marshall, Rishi Nag, Kenta Sato, Artem Tarasov and Jason Travis.
    其他规范贡献者包括: John Marshall、Rishi Nag、Kenta Sato、Artem Tarasov 和 Jason Travis。
  • Plus a big thank you to everyone who has raised GitHub issues and/or helped us improve the specification in other ways.
    感谢所有提出 GitHub 问题和/或以其他方式帮助我们改进规范的人。

  1. 1 1 ^(1){ }^{1} Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney, Efficient storage of high throughput DNA sequencing data using reference-based compression, Genome Res. 2011 21: 734-740; doi:10.1101/gr.114819.110; PMID:21245279.
    1 1 ^(1){ }^{1} 马库斯·希杨·弗里茨、拉斯科·莱宁、盖伊·科克兰和伊万·伯尼,利用基于参考基因组的压缩方法高效存储高通量 DNA 测序数据,基因组研究.2011 年 21:734-740;doi:10.1101/gr.114819.110;PMID:21245279。
  2. a a ^(a){ }^{a} Formerly MAPPED_SLICE_HEADER. Now used by all slice headers regardless of mapping status.
    a a ^(a){ }^{a} 以前称为 MAPPED_SLICE_HEADER。现在由所有切片头使用,不管映射状态如何。
  3. 2 2 ^(2){ }^{2} The precise order is defined in section 10 .
    2 2 ^(2){ }^{2} 确切的顺序在第 10 节中定义。
  4. 3 3 ^(3){ }^{3} Unmapped reads can be placed or unplaced. By placed unmapped read we mean a read that is unmapped according to bit 0 × 4 0 × 4 0xx40 \times 4 of the BF (BAM bit flags) data series, but has position fields filled in, thus “placing” it on a reference sequence. In contrast, unplaced unmapped reads have have a reference sequence ID of -1 and alignment position of 0 .
    3 3 ^(3){ }^{3} 未映射的读数可以被置于有位置或无位置。有位置的未映射读数是指根据 BF (BAM 位标记)数据系列的位 0 × 4 0 × 4 0xx40 \times 4 判断为未映射,但是仍有位置字段被填充,从而"置于"参考序列上。相比之下,无位置的未映射读数具有参考序列 ID 为-1 和比对位置为 0 的特点。
  5. 4 4 ^(4){ }^{4} Interleaving can sometimes provide better compression, however it also adds dependency between types of data meaning it is not possible to selectively decode one data series if it co-locates with another data series in the same block.
    4 4 ^(4){ }^{4} 交织可以提供更好的压缩,但也会增加不同类型数据之间的依赖性,这意味着如果一个数据系列与同一数据块中的另一个数据系列共同位置,就无法选择性地解码该数据系列。
  6. 6 6 ^(6){ }^{6} Fast progressive lossless image compression, Paul G. Howard and Jeffrey Scott Vitter, 1994. http://www.ittc.ku.edu/ jsv/ Papers/HoV94.progressive_FELICS.pdf
    6 6 ^(6){ }^{6} 快速进步的无损图像压缩, Paul G. Howard 和 Jeffrey Scott Vitter, 1994. http://www.ittc.ku.edu/ jsv/ Papers/HoV94.progressive_FELICS.pdf

    7 7 ^(7){ }^{7} https://en.wikipedia.org/wiki/Golomb_coding#Rice_coding
  7. 8 8 ^(8){ }^{8} J. Duda, Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding, http://arxiv.org/abs/1311.2540
    8 8 ^(8){ }^{8} J. Duda,非对称数字系统:熵编码结合了 Huffman 编码的速度和算术编码的压缩率,http://arxiv.org/abs/1311.2540
  8. 9 9 ^(9){ }^{9} https://github.com/vadimzalunin/crammer/releases
    10 10 ^(10){ }^{10} https://github.com/enasequence/cramtools
    11 11 ^(11){ }^{11} Staden IO_Lib 1.13.0 and later HTSlib 0.2.0
    11 11 ^(11){ }^{11} Staden IO_Lib 1.13.0 和更高版本 HTSlib 0.2.0