这是用户在 2024-9-19 10:34 为 https://app.immersivetranslate.com/pdf-pro/1278e129-84c3-433c-99af-fee32bda17e5 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

CRAM format specification (version 3.1)
CRAM 格式规范(3.1 版)

samtools-devel@lists.sourceforge.net

4 Sep 2024 2024 年 9 月 4 日

Abstract 摘要

The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 4127441 from that repository, last modified on the date shown above.
本文档的主版本可在 https://github.com/samtools/hts-specs 上找到。这个打印版是该存储库中的 4127441 版本,最后修改于上述日期。

license: Apache 2.0 许可证: Apache 2.0

1 Overview 1 概述

This specification describes the CRAM 3.0 and 3.1 formats.
本规范描述了 CRAM 3.0 和 3.1 格式。

CRAM has the following major objectives:
克拉姆有以下主要目标:
  1. Significantly better lossless compression than BAM
    比 BAM 具有明显更好的无损压缩性能
  2. Full compatibility with BAM
    完全兼容 BAM
  3. Effortless transition to CRAM from using BAM files
    从使用 BAM 文件到 CRAM 的无缝过渡
  4. Support for controlled loss of BAM data
    对 BAM 数据受控损失的支持
The first three objectives allow users to take immediate advantage of the CRAM format while offering a smooth transition path from using BAM files. The fourth objective supports the exploration of different lossy compression strategies and provides a framework in which to effect these choices. Please note that the CRAM format does not impose any rules about what data should or should not be preserved. Instead, CRAM supports a wide range of lossless and lossy data preservation strategies enabling users to choose which data should be preserved.
第一三个目标允许用户立即利用 CRAM 格式的优势,同时为从使用 BAM 文件向 CRAM 格式的迁移提供了平滑的过渡路径。第四个目标支持探索不同的有损压缩策略,并提供一个框架来实现这些选择。请注意,CRAM 格式没有任何关于应该保留或不应该保留什么数据的规则。相反,CRAM 支持广泛的无损和有损数据保留策略,使用户能够选择应该保留哪些数据。

Data in CRAM is stored either as CRAM records or using one of the general purpose compressors (gzip, bzip2). CRAM records are compressed using a number of different encoding strategies. For example, bases are reference compressed by encoding base differences rather than storing the bases themselves. 1 1 ^(1){ }^{1}
在 CRAM 中,数据可以以 CRAM 记录的形式存储,也可以使用通用压缩器(gzip、bzip2)进行存储。CRAM 记录使用多种编码策略进行压缩,例如通过编码碱基差异而非碱基本身来进行参考压缩。 1 1 ^(1){ }^{1}

2 Data types 2 数据类型

CRAM specification uses logical data types and storage data types; logical data types are written as words (e.g. int) while physical data types are written using single letters (e.g. i). The difference between the two is that storage data types define how logical data types are stored in CRAM. Data in CRAM is stored either as bits or bytes. Writing values as bits and bytes is described in detail below.
CRAM 规范使用逻辑数据类型和存储数据类型;逻辑数据类型用单词表示(例如 int),而物理数据类型用单个字母表示(例如 i)。两者的区别在于,存储数据类型定义了逻辑数据类型在 CRAM 中的存储方式。CRAM 中的数据以位或字节的形式存储。以下详细描述了如何以位和字节的形式来编写值。

2.1 Logical data types
2.1 逻辑数据类型

Byte 字节

Signed byte ( 8 bits).
带符号字节(8 位)。

Integer 整数

Signed 32-bit integer. 有符号的 32 位整数。

Long 

Signed 64 -bit integer.
有符号 64 位整数。

Array 数组

An array of any logical data type: array<type>
任何逻辑数据类型的数组:array

2.2 Writing bits to a bit stream
将位写入位流

A bit stream consists of a sequence of 1 s and 0 s . The bits are written most significant bit first where new bits are stacked to the right and full bytes on the left are written out. In a bit stream the last byte will be incomplete if less than 8 bits have been written to it. In this case the bits in the last byte are shifted to the left.
比特流由一系列的 1 和 0 组成。其中,首先写入最高有效位,新的位从右侧添加,左侧的完整字节依次写出。如果比特流写入的位数不足 8 位,最后一个字节将不完整,此时最后一个字节中的位将向左移位。

Example of writing to bit stream
写入比特流的示例

Let’s consider the following example. The table below shows a sequence of write operations:
让我们考虑以下示例。下表显示了一系列写入操作:
Operation order 作战命令 Buffer state before 缓冲国 Written bits 文字 Buffer state after 缓冲国家之后 Issued bytes 已发送字节数
1 0 × 0 0 × 0 0xx00 \times 0 1 0 × 1 0 × 1 0xx10 \times 1 -
2 0 × 1 0 × 1 0xx10 \times 1 0 0 × 2 0 × 2 0xx20 \times 2 -
3 0 × 2 0 × 2 0xx20 \times 2 11 0 × B 0 × B 0xx B0 \times B -
4 0 × B 0 × B 0xx B0 \times B 00000111 0 × 7 0 × 7 0xx70 \times 7 0 × B 0 0 × B 0 0xx B00 \times B 0
Operation order Buffer state before Written bits Buffer state after Issued bytes 1 0xx0 1 0xx1 - 2 0xx1 0 0xx2 - 3 0xx2 11 0xx B - 4 0xx B 00000111 0xx7 0xx B0| Operation order | Buffer state before | Written bits | Buffer state after | Issued bytes | | :--- | :--- | :--- | :--- | :--- | | 1 | $0 \times 0$ | 1 | $0 \times 1$ | - | | 2 | $0 \times 1$ | 0 | $0 \times 2$ | - | | 3 | $0 \times 2$ | 11 | $0 \times B$ | - | | 4 | $0 \times B$ | 00000111 | $0 \times 7$ | $0 \times B 0$ |
After flushing the above bit stream the following bytes are written: 0 x B 0 0 x B 0 0xB00 x B 0 0x70. Please note that the last byte was 0 × 7 0 × 7 0xx70 \times 7 before shifting to the left and became 0 x 70 0 x 70 0x 700 x 70 after that:
在刷新上述位流后,写入以下字节: 0 x B 0 0 x B 0 0xB00 x B 0 0x70。请注意,最后一个字节在移位到左侧之前为 0 × 7 0 × 7 0xx70 \times 7 ,之后变为 0 x 70 0 x 70 0x 700 x 70
> echo "obase=16; ibase=2; 00000111" | bc
7
> echo "obase=16; ibase=2; 01110000" | bc
7 0
And the whole bit sequence:
以及整个比特序列:
echo “obase=2; ibase=16; B070” | bc
10000101110000

1011000001110000
When reading the bits from the bit sequence it must be known that only 12 bits are meaningful and the bit stream should not be read after that.
从位序列中读取位时,必须知道只有 12 位是有意义的,并且不应该在此之后读取位流。

Note on writing to bit stream
比特流写入注意事项

When writing to a bit stream both the value and the number of bits in the value must be known. This is because programming languages normally operate with bytes ( 8 bits ) and to specify which bits are to be written requires a bit-holder, for example an integer, and the number of bits in it. Equally, when reading a value from a bit stream the number of bits must be known in advance. In case of prefix codes (e.g. Huffman) all possible bit combinations are either known in advance or it is possible to calculate how many bits will follow based on the first few bits. Alternatively, two codes can be combined, where the first contains the number of bits to read.
在向位流写入时,必须知道该值及其包含的位数。这是因为编程语言通常以字节(8 位)为单位操作,为了指定要写入的位,需要一个位保持器(例如整数)以及其中包含的位数。同样地,在从位流中读取值时,也必须事先知道位数。对于前缀码(如 Huffman 码),所有可能的位组合要么事先已知,要么可以根据前几位计算出后续位数。或者可以将两个码组合使用,其中第一个码包含要读取的位数。

2.3 Writing bytes to a byte stream
向字节流写入字节

The interpretation of byte stream is straightforward. CRAM uses little endianness for bytes when applicable and defines the following storage data types:
字节流的解释很简单。CRAM 在适用时使用小端字节序,并定义了以下存储数据类型:

Boolean (bool) 布尔型(bool)

Boolean is written as 1 -byte with 0 x 0 0 x 0 0x00 x 0 being ‘false’ and 0 x 1 being ‘true’.
布尔值用 1 字节表示,其中 0 x 0 0 x 0 0x00 x 0 为'false',0x1 为'true'。

Integer (int32) 整数 (int32)

Signed 32-bit integer, written as 4 bytes in little-endian byte order.
有符号 32 位整数,以小端字节顺序写成 4 个字节。

Long (int64) 长整型(int64)

Signed 64-bit integer, written as 8 bytes in little-endian byte order.
有符号 64 位整数,以小端字节顺序表示为 8 个字节。

ITF-8 integer (itf8) ITF-8 整数(itf8)

This is an alternative way to write an integer value. The idea is similar to UTF-8 encoding and therefore this encoding is called ITF-8 (Integer Transformation Format - 8 bit).
这是一种写整数值的替代方法。这个想法类似于 UTF-8 编码,因此这种编码被称为 ITF-8(整数转换格式 - 8 位)。

The most significant bits of the first byte have special meaning and are called ‘prefix’. These are 0 to 4 true bits followed by a 0 . The number of 1 's denote the number of bytes to follow. To accommodate 32 bits such representation requires 5 bytes with only 4 lower bits used in the last byte 5 .
第一个字节最重要的位具有特殊含义,被称为"前缀"。这些是 0 到 4 个真实位,后跟一个 0。1 的数量表示要跟随的字节数。为了适应 32 位,这种表示需要 5 个字节,最后一个字节只使用 4 个较低位。

LTF-8 long (ltf8) LTF-8 长体型(ltf8)

See ITF-8 for more details. The only difference between ITF-8 and LTF-8 is the number of bytes used to encode a single value. To do so 64 bits are required and this can be done with 9 byte at most with the first byte consisting of just 1 s or 0 xFF value.
有关更多详细信息,请参见 ITF-8。ITF-8 和 LTF-8 之间的唯一区别是用于编码单个值的字节数。为此需要 64 位,最多可使用 9 个字节,其中第一个字节仅由 1 个或 0xFF 值组成。

Array (array<type>) 数组(array<类型>)

A variable sized array with an explicitly written dimension. Array length is written first as integer (itf8), followed by the elements of the array.
一个可变大小的数组,其维度被明确写出。数组长度首先被写为整数(itf8),然后是数组元素。

Implicit or fixed-size arrays are also used, written as type [ ] or type [4] (for example). These have no explicit dimension included in the file format and instead rely on the specification itself to document the array size.
隐式或固定大小的数组也被使用,写为 type [ ] 或 type [4] (例如)。它们没有在文件格式中包含明确的尺寸,而是依靠规范本身来记录数组大小。

Encoding 编码

Encoding is a data type that specifies how data series have been compressed. Encodings are defined as encoding<type> where the type is a logical data type as opposed to a storage data type.
编码是一种数据类型,它指定了数据系列的压缩方式。编码被定义为 encoding,其中 type 是一种逻辑数据类型,而不是存储数据类型。

An encoding is written as follows. The first integer (itf8) denotes the codec id and the second integer (itf8) the number of bytes in the following encoding-specific values.
一个编码写成如下形式。第一个整数(itf8)表示编解码器 ID,第二个整数(itf8)表示以下编码特定值的字节数。

Subexponential encoding example:
次指数编码示例:
Value 价值 Type 类型 Name 名字
0x7 itf8 国际电信联盟 codec id 编解码器 ID
0x2 itf8 国际电信联盟 number of bytes to follow
后续字节数
0x0 itf8 国际电信联盟 offset 偏移
0x1 itf8 国际电信联盟 K parameter K 参数
Value Type Name 0x7 itf8 codec id 0x2 itf8 number of bytes to follow 0x0 itf8 offset 0x1 itf8 K parameter| Value | Type | Name | | :--- | :--- | :--- | | 0x7 | itf8 | codec id | | 0x2 | itf8 | number of bytes to follow | | 0x0 | itf8 | offset | | 0x1 | itf8 | K parameter |
The first byte " 0 × 7 0 × 7 0xx70 \times 7 " is the codec id.
第一个字节" 0 × 7 0 × 7 0xx70 \times 7 "是编码器 ID。

The next byte " 0 x 2 " denotes the length of the bytes to follow (2).
下一个字节"0x2"表示后续字节的长度(2)。

The subexponential encoding has 2 parameters: integer (itf8) offset and integer (itf8) K.
子指数编码有 2 个参数:整数(itf8)偏移量和整数(itf8) K。

offset = 0 x 0 = 0 = 0 x 0 = 0 =0x0=0=0 \mathrm{x} 0=0 偏移
K = 0 x 1 = 1 K = 0 x 1 = 1 K=0x1=1\mathrm{K}=0 \mathrm{x} 1=1
Map 地图
A map is a collection of keys and associated values. A map with N N NN keys is written as follows:
{key1: value1, key2: value2, ..., keyN: valueN}
size in bytes 字节大小 N key 1 关键 1 value 1 价值 1 key... 关键... value ... 价值... key N 键 N value N 值 N
size in bytes N key 1 value 1 key... value ... key N value N| size in bytes | N | key 1 | value 1 | key... | value ... | key N | value N | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
Both the size in bytes and the number of keys are written as integer (itf8). Keys and values are written according to their data types and are specific to each map.
字节大小和键的数量均以整数(itf8)形式写入。键和值的写入方式根据其数据类型而有所不同,这取决于每一个 map。

String 字符串

A string is represented as byte arrays using UTF-8 format. Read names, reference sequence names and tag values with type ’ Z Z ZZ ’ are stored as UTF-8.
字符串使用 UTF-8 格式表示为字节数组。使用类型 ' Z Z ZZ ' 存储的名称、参考序列名称和标记值为 UTF-8 编码。

3 Encodings 3 个编码

Encoding is a data structure that captures information about compression details of a data series that are required to uncompress it. This could be a set of constants required to initialize a specific decompression algorithm or statistical properties of a data series or, in case of data series being stored in an external block, the block content id.
编码是一种数据结构,它捕获了关于压缩细节的信息,这些细节在解压缩时是必需的。这可能是初始化特定解压缩算法所需的一组常数,或数据系列的统计属性,或者在数据系列存储在外部块中的情况下,块内容 ID。
Encoding notation is defined as the keyword ‘encoding’ followed by its data type in angular brackets, for example ‘encoding<byte>’ stands for an encoding that operates on a data series of data type ‘byte’.
编码符号被定义为关键字'编码'后跟其数据类型,用尖括号括起,例如'编码<字节>'表示对'字节'数据类型的数据系列进行编码的编码。

Encodings may have parameters of different data types, for example the EXTERNAL encoding has only one parameter, integer id of the external block. The following encodings are defined:
编码可能有不同数据类型的参数,例如 EXTERNAL 编码只有一个参数,即外部块的整数 id。定义了以下编码:
Codec 编解码器 ID Parameters 参数 Comment 评论
NULL 0 none 没有源文本 series not preserved 系列未保留
EXTERNAL 1 int block content id
整数块内容标识

用于将外部数据块与数据系列相关联的块内容标识符
the block content identifier used to
associate external data blocks with
data series
the block content identifier used to associate external data blocks with data series| the block content identifier used to | | :--- | | associate external data blocks with | | data series |
Deprecated (GOLOMB) 不推荐使用 (GOLOMB) 2 int offset, int M
整数 偏移量, 整数 M
Golomb coding 哥伦布编码
HUFFMAN 3 array<int>, array<int> 数组<整型>,数组<整型> coding with int/byte values
使用 int/byte 值编码
BYTE_ARRAY_LEN 4

编码 数组长度, 编码 字节
encoding<int> array length,
encoding<byte> bytes
encoding<int> array length, encoding<byte> bytes| encoding<int> array length, | | :--- | | encoding<byte> bytes |

字节数组的编码及其长度
coding of byte arrays with array
length
coding of byte arrays with array length| coding of byte arrays with array | | :--- | | length |
BYTE_ARRAY_STOP 5

字节停止,int 外部块内容 id
byte stop, int external block
content id
byte stop, int external block content id| byte stop, int external block | | :--- | | content id |

带有停止值的字节数组编码
coding of byte arrays with a stop
value
coding of byte arrays with a stop value| coding of byte arrays with a stop | | :--- | | value |
BETA 6 int offset, int number of bits
偏移量,位数
binary coding 二进制编码
SUBEXP 7 int offset, int K
int 偏移量, int K
subexponential coding 亚指数编码
Deprecated (GOLOMB_RICE)
已弃用(GOLOMB_RICE)
8 int offset, int log 2 m log 2 m log_(2)m\log _{2} \mathrm{~m}
整型 偏移量, 整型 log 2 m log 2 m log_(2)m\log _{2} \mathrm{~m}
Golomb-Rice coding 哥伦布-赖斯编码
GAMMA 9 int offset 偏移量 Elias gamma coding 以利亚伽马编码
Codec ID Parameters Comment NULL 0 none series not preserved EXTERNAL 1 int block content id "the block content identifier used to associate external data blocks with data series" Deprecated (GOLOMB) 2 int offset, int M Golomb coding HUFFMAN 3 array<int>, array<int> coding with int/byte values BYTE_ARRAY_LEN 4 "encoding<int> array length, encoding<byte> bytes" "coding of byte arrays with array length" BYTE_ARRAY_STOP 5 "byte stop, int external block content id" "coding of byte arrays with a stop value" BETA 6 int offset, int number of bits binary coding SUBEXP 7 int offset, int K subexponential coding Deprecated (GOLOMB_RICE) 8 int offset, int log_(2)m Golomb-Rice coding GAMMA 9 int offset Elias gamma coding| Codec | ID | Parameters | Comment | | :--- | :--- | :--- | :--- | | NULL | 0 | none | series not preserved | | EXTERNAL | 1 | int block content id | the block content identifier used to <br> associate external data blocks with <br> data series | | Deprecated (GOLOMB) | 2 | int offset, int M | Golomb coding | | HUFFMAN | 3 | array<int>, array<int> | coding with int/byte values | | BYTE_ARRAY_LEN | 4 | encoding<int> array length, <br> encoding<byte> bytes | coding of byte arrays with array <br> length | | BYTE_ARRAY_STOP | 5 | byte stop, int external block <br> content id | coding of byte arrays with a stop <br> value | | BETA | 6 | int offset, int number of bits | binary coding | | SUBEXP | 7 | int offset, int K | subexponential coding | | Deprecated (GOLOMB_RICE) | 8 | int offset, int $\log _{2} \mathrm{~m}$ | Golomb-Rice coding | | GAMMA | 9 | int offset | Elias gamma coding |
See section 13 for more detailed descriptions of all the above coding algorithms and their parameters.
请参阅第 13 节以获取上述所有编码算法及其参数的更详细描述。

4 Checksums 4 个校验和

The checksumming is used to ensure data integrity. The following checksumming algorithms are used in CRAM.
校验和用于确保数据完整性。以下校验和算法在 CRAM 中使用。

4.1 CRC32

This is a cyclic redundancy checksum 32-bit long with the polynomial 0x04C11DB7. Please refer to ITU-T V. 42 for more details. The value of the CRC32 hash function is written as an integer.
这是一个循环冗余检验码,长度为 32 位,多项式为 0x04C11DB7。欲了解更多详情,请参考 ITU-T V. 42。CRC32 哈希函数的值将以整数形式表示。

4.2 CRC32 sum 4.2 CRC32 校验和

CRC32 sum is a combination of CRC32 values by summing up all individual CRC32 values modulo 2 32 2 32 2^(32)2^{32}.
CRC32 校验和是通过将所有单独的 CRC32 值模 2 32 2 32 2^(32)2^{32} 相加而得到的组合。

5 File structure 5 文件结构

The overall CRAM file structure is described in this section. Please refer to other sections of this document for more detailed information.
CRAM 文件结构的总体情况在本节中有描述。更多详细信息请参考本文档的其他章节。

A CRAM file consists of a fixed length file definition, followed by a CRAM header container, then zero or more data containers, and finally a special end-of-file container.
CRAM 文件由固定长度的文件定义、CRAM 头容器、零个或多个数据容器以及最后的特殊文件结束容器组成。
 文件定义
File
definition
File definition| File | | :---: | | definition |
 CRAM 头容器
CRAM Header
Container
CRAM Header Container| CRAM Header | | :---: | | Container |
 数据容器
Data
Container
Data Container| Data | | :---: | | Container |
cdots\cdots
 数据容器
Data
Container
Data Container| Data | | :---: | | Container |
 CRAM EOF 集装箱
CRAM EOF
Container
CRAM EOF Container| CRAM EOF | | :---: | | Container |
"File definition" "CRAM Header Container" "Data Container" cdots "Data Container" "CRAM EOF Container"| File <br> definition | CRAM Header <br> Container | Data <br> Container | $\cdots$ | Data <br> Container | CRAM EOF <br> Container | | :---: | :---: | :---: | :---: | :---: | :---: |
Figure 1: A CRAM file consists of a file definition, followed by a header container, then other containers.
图 1:CRAM 文件由文件定义、头部容器和其他容器组成。

Containers consist of one or more blocks. The first container, called the CRAM header container, is used to store a textual header as described in the SAM specification (see the section 7.1). This container may have additional padding bytes present for purposes of permitting inline rewriting of the SAM header with small changes in size. These padding bytes are undefined, but we recommend filling with nuls. The padding bytes can either be in explicit uncompressed Block structures, or as unallocated extra space where the size of the container is larger than the combined size of blocks held within it.
容器由一个或多个块组成。第一个容器称为 CRAM 头容器,用于存储 SAM 规范中描述的文本头(参见第 7.1 节)。该容器可能存在额外的填充字节,以允许对 SAM 头进行小尺寸更改的内联重写。这些填充字节是未定义的,但我们建议用 0 填充。填充字节可以是显式的未压缩块结构,也可以是未分配的额外空间,容器的大小大于其中包含的块的总大小。
Figure 2: The the first container holds the CRAM header text.
图 2:第一个容器包含 CRAM 标头文本。

Each container starts with a container header structure followed by one or more blocks. The first block in each container is the compression header block giving details of how to decode data in subsequent blocks. Each block starts with a block header structure followed by the block data.
每个容器都以容器头结构开始,后跟一个或多个块。每个容器的第一个块是压缩头块,提供了如何解码后续块中数据的细节。每个块都以块头结构开始,后跟块数据。
Figure 3: Containers as a series of blocks
图 3:容器作为一系列的方块

The blocks after the compression header are organised logically into slices. One slice may contain, for example, a contiguous region of alignment data. Slices begin with a slice header block and are followed by one or more data blocks. It is these data blocks which hold the primary bulk of CRAM data. The data blocks are further subdivided into a core data block and one or more external data blocks.
压缩头之后的块在逻辑上被组织为切片。一个切片可能包含例如连续的对齐数据区域。切片以切片头块开始,后跟一个或多个数据块。正是这些数据块包含了 CRAM 数据的主要部分。数据块进一步划分为一个核心数据块和一个或多个外部数据块。
Figure 4: Slices formed from a series of concatenated blocks
图 4:由一系列连接的块形成的分片

6 File definition 6 文件定义

Each CRAM file starts with a fixed length (26 bytes) definition with the following fields:
每个 CRAM 文件以固定长度(26 字节)的定义开始,包含以下字段:
Data type 数据类型 Name 名字 Value 价值
byte[4] 字节[4] format magic number 格式魔术数字 CRAM (0x43 0x52 0x41 0x4d)
程序分配和内存管理 (0x43 0x52 0x41 0x4d)
unsigned byte 无符号字节 major format number 主要格式号 3 ( 0 x 3 ) 3 ( 0 x 3 ) 3(0x3)3(0 x 3)
unsigned byte 无符号字节 minor format number 小型号格式 1 (0x1)
byte[20] 字节[20] file id 文件 ID CRAM file identifier (e.g. file name or SHA1 checksum)
CRAM 文件标识符(例如文件名或 SHA1 校验和)
Data type Name Value byte[4] format magic number CRAM (0x43 0x52 0x41 0x4d) unsigned byte major format number 3(0x3) unsigned byte minor format number 1 (0x1) byte[20] file id CRAM file identifier (e.g. file name or SHA1 checksum)| Data type | Name | Value | | :--- | :--- | :--- | | byte[4] | format magic number | CRAM (0x43 0x52 0x41 0x4d) | | unsigned byte | major format number | $3(0 x 3)$ | | unsigned byte | minor format number | 1 (0x1) | | byte[20] | file id | CRAM file identifier (e.g. file name or SHA1 checksum) |
Valid CRAM major.minor version numbers are as follows:
有效的 CRAM 主版本号和次版本号如下:

1.0 The original public CRAM release.
1.0 原版公开 CRAM 版本。

2.0 The first CRAM release implemented in both Java and C; tidied up implementation vs specification differences in 1.0 .
2.0 CRAM 的第一个版本同时以 Java 和 C 语言实现;整理了 1.0 版本中实现与规范之间的差异。

2.1 Gained end of file markers; compatible with 2.0.
2.1 获得了文件结尾标记;与 2.0 兼容。

3.0 Additional compression methods; header and data checksums; improvements for unsorted data.
3.0 其他压缩方法;报头和数据校验和;不排序数据的改进。

3.1 Additional EXTERNAL compression codecs only.
3.1 仅支持其他外部压缩编解码器。
CRAM 3.0 and 3.1 differ only in the list of compression methods available, so tools that output CRAM 3 without using any 3.1 codecs should write the header to indicate 3.0 in order to permit maximum compatibility.
CRAM 3.0 和 3.1 仅在可用压缩方法列表上有所不同,因此输出 CRAM 3 而不使用任何 3.1 编解码器的工具应该将头部写为 3.0,以确保最大兼容性。

7 Container header structure
7 集装箱头结构

The file definition is followed by one or more containers with the following header structure where the container content is stored in the ‘blocks’ field:
文件定义后面跟着一个或多个具有以下标题结构的容器,其中容器内容存储在"blocks"字段中:
Data type 数据类型 Name 名字 Value 价值
int32 整数 32 位 length 长度

该容器中所有块(包括标头和数据)的长度之和以及任何填充字节(仅适用于 CRAM 标头容器);等于容器的总字节长度减去此头部结构的字节长度
the sum of the lengths of all blocks in this container
(headers and data) and any padding bytes (CRAM header
container only); equal to the total byte length of the
container minus the byte length of this header structure
the sum of the lengths of all blocks in this container (headers and data) and any padding bytes (CRAM header container only); equal to the total byte length of the container minus the byte length of this header structure| the sum of the lengths of all blocks in this container | | :--- | | (headers and data) and any padding bytes (CRAM header | | container only); equal to the total byte length of the | | container minus the byte length of this header structure |
itf8 国际电信联盟 reference sequence id 参考序列 ID

参考序列标识符或-1 表示未映射的读数,-2 表示多个参考序列。此容器中的所有切片必须具有与此值匹配的参考序列 ID。
reference sequence identifier or
-1 for unmapped reads
-2 for multiple reference sequences.
All slices in this container must have a reference sequence
id matching this value.
reference sequence identifier or -1 for unmapped reads -2 for multiple reference sequences. All slices in this container must have a reference sequence id matching this value.| reference sequence identifier or | | :--- | | -1 for unmapped reads | | -2 for multiple reference sequences. | | All slices in this container must have a reference sequence | | id matching this value. |
itf8 国际电信联盟

参考位置的初始位置
starting position on the
reference
starting position on the reference| starting position on the | | :--- | | reference |
the alignment start position
起始点位置
itf8 国际电信联盟 alignment span 对齐范围 the length of the alignment
对齐的长度
itf8 国际电信联盟 number of records 记录数 number of records in the container
容器中的记录数
ltf8 record counter 记录计数器 1-based sequential index of records in the file/stream.
文件/流中记录的基于 1 的顺序索引。
ltf8 bases 基地 number of read bases
测序长度
itf8 国际电信联盟 number of blocks 区块数 the total number of blocks in this container
这个容器中的总块数
array<itf8> 数组<itf8></itf8> landmarks 地标

此容器中切片的位置作为此容器头部结尾的字节偏移量,用于随机访问索引。对于序列数据容器,地标个数必须等于切片个数。由于第一个切片之前的块是压缩头,landmarks[0]等于压缩头的字节长度。
the locations of slices in this container as byte offsets from
the end of this container header, used for random access
indexing. For sequence data containers, the landmark
count must equal the slice count.
Since the block before the first slice is the compression
header, landmarks[0] is equal to the byte length of the
compression header.
the locations of slices in this container as byte offsets from the end of this container header, used for random access indexing. For sequence data containers, the landmark count must equal the slice count. Since the block before the first slice is the compression header, landmarks[0] is equal to the byte length of the compression header.| the locations of slices in this container as byte offsets from | | :--- | | the end of this container header, used for random access | | indexing. For sequence data containers, the landmark | | count must equal the slice count. | | Since the block before the first slice is the compression | | header, landmarks[0] is equal to the byte length of the | | compression header. |
int 整型 crc32 循环冗余校验 CRC32 hash of the all the preceding bytes in the container.
容器中所有先前字节的 CRC32 哈希值。
byte[ 字节[ blocks  The blocks contained within the container.
容器内包含的块。
Data type Name Value int32 length "the sum of the lengths of all blocks in this container (headers and data) and any padding bytes (CRAM header container only); equal to the total byte length of the container minus the byte length of this header structure" itf8 reference sequence id "reference sequence identifier or -1 for unmapped reads -2 for multiple reference sequences. All slices in this container must have a reference sequence id matching this value." itf8 "starting position on the reference" the alignment start position itf8 alignment span the length of the alignment itf8 number of records number of records in the container ltf8 record counter 1-based sequential index of records in the file/stream. ltf8 bases number of read bases itf8 number of blocks the total number of blocks in this container array<itf8> landmarks "the locations of slices in this container as byte offsets from the end of this container header, used for random access indexing. For sequence data containers, the landmark count must equal the slice count. Since the block before the first slice is the compression header, landmarks[0] is equal to the byte length of the compression header." int crc32 CRC32 hash of the all the preceding bytes in the container. byte[ blocks The blocks contained within the container.| Data type | Name | Value | | :---: | :---: | :---: | | int32 | length | the sum of the lengths of all blocks in this container <br> (headers and data) and any padding bytes (CRAM header <br> container only); equal to the total byte length of the <br> container minus the byte length of this header structure | | itf8 | reference sequence id | reference sequence identifier or <br> -1 for unmapped reads <br> -2 for multiple reference sequences. <br> All slices in this container must have a reference sequence <br> id matching this value. | | itf8 | starting position on the <br> reference | the alignment start position | | itf8 | alignment span | the length of the alignment | | itf8 | number of records | number of records in the container | | ltf8 | record counter | 1-based sequential index of records in the file/stream. | | ltf8 | bases | number of read bases | | itf8 | number of blocks | the total number of blocks in this container | | array<itf8> | landmarks | the locations of slices in this container as byte offsets from <br> the end of this container header, used for random access <br> indexing. For sequence data containers, the landmark <br> count must equal the slice count. <br> Since the block before the first slice is the compression <br> header, landmarks[0] is equal to the byte length of the <br> compression header. | | int | crc32 | CRC32 hash of the all the preceding bytes in the container. | | byte[ | blocks | The blocks contained within the container. |
In the initial CRAM header container, the reference sequence id, starting position on the reference, and alignment span fields must be ignored when reading. The landmarks array is optional for the CRAM header, but if it exists it should point to block offsets instead of slices, with the first block containing the textual header.
在初始的 CRAM 头部容器中,读取时必须忽略参考序列 ID、参考起始位置和比对跨度字段。CRAM 头部的地标数组是可选的,但如果存在,它应该指向块偏移量而不是切片,第一个块包含文本头部。
In data containers specifying unmapped reads or multiple reference sequences (i.e. reference sequence id < 0 < 0 < 0<0 ), the starting position on the reference and alignment span fields must be ignored when reading. When writing, it is recommended to set each of these ignored fields to the value 0 .
在指定未映射读取或多个参考序列(即参考序列 id < 0 < 0 < 0<0 )的数据容器中,读取时必须忽略参考起始位置和对齐范围字段。写入时,建议将这些被忽略的字段设置为值 0。

7.1 CRAM header container
7.1 CRAM 头容器

The first container in a CRAM file contains a textual header in one or more blocks. See section 8.3 for more details on the layout of data within these blocks and constraints applied to the contents of the SAM header.
CRAM 文件的第一个容器包含一个或多个块中的文本标头。有关这些块中数据布局及 SAM 标头内容约束的更多详细信息,请参见第 8.3 节。
The landmarks field of the container header structure may be used to indicate the offsets of the blocks used in the header container. These may optionally be omitted by specifying an array size of zero.
容器头结构的地标字段可用于指示头容器中使用的块的偏移量。通过指定零阵列大小可以选择性地忽略这些.

8 Block structure 8 个块结构

Containers consist of one or more blocks. Block compression is applied independently and in addition to any encodings used to compress data within the block. The block have the following header structure with the data stored in the ‘block data’ field:
容器由一个或多个块组成。块压缩是独立于用于压缩块内数据的任何编码而应用的。块具有以下头结构,数据存储在"块数据"字段中:
Data type 数据类型 Name 名字 Value 价值
byte 字节 method 方法 the block compression method (and first CRAM version):
块压缩方法(和第一版 CRAM):
0: raw (none)* 0: 原始(无)*
1: gzip
2: bzip2 (v2.0) 2: bzip2 (v2.0) Translation in Simplified Chinese: 2: bzip2 (v2.0)
3: lzma (v3.0)
4: rans4x8 (v3.0)
5: rans4x16 (v3.1) 5:rans4x16(v3.1)
6: adaptive arithmetic coder (v3.1)
6: 自适应算术编码器 (v3.1)
7: fqzcomp (v3.1) 7: fqzcomp (v3.1) 人类: Translate the following source text to Simplified Chinese Language, Output translation directly without any additional text. Source Text: 6: slibsvm (v2.2) Translated Text:
8: name tokeniser (v3.1)
8:名称标记器(v3.1)
byte 字节 block content type id
内容类型 ID 块
the block content type identifier
区块内容类型标识符
itf8 国际电信联盟 size in bytes* 字节大小* the block content identifier used to associate external data
用于关联外部数据的区块内容标识符
raw size in bytes*
原始大小以字节为单位*
blocks with data series
带数据系列的块
itf8 国际电信联盟 block data 块数据 size of the block data after applying block compression
应用块压缩后的块数据大小
itf8 国际电信联盟 the data stored in the before applying block compression
在应用块压缩之前存储的数据
byte[] 字节[] ・ bit stream of CRAM records (core data block)
CRAM 记录的比特流(核心数据块)
\bullet byte stream (external data block)
\bullet 字节流 (外部数据块)
CRC32 additional fields ( header blocks)
附加字段(标题块)
byte[4] 字节[4] CRC32 hash value for all preceding bytes in the block
块中所有前导字节的 CRC32 哈希值
Data type Name Value byte method the block compression method (and first CRAM version): 0: raw (none)* 1: gzip 2: bzip2 (v2.0) 3: lzma (v3.0) 4: rans4x8 (v3.0) 5: rans4x16 (v3.1) 6: adaptive arithmetic coder (v3.1) 7: fqzcomp (v3.1) 8: name tokeniser (v3.1) byte block content type id the block content type identifier itf8 size in bytes* the block content identifier used to associate external data raw size in bytes* blocks with data series itf8 block data size of the block data after applying block compression itf8 the data stored in the before applying block compression byte[] ・ bit stream of CRAM records (core data block) ∙ byte stream (external data block) CRC32 additional fields ( header blocks) byte[4] CRC32 hash value for all preceding bytes in the block | Data type | Name | Value | | :--- | :--- | :--- | | byte | method | the block compression method (and first CRAM version): | | | | 0: raw (none)* | | | | 1: gzip | | | | 2: bzip2 (v2.0) | | | | 3: lzma (v3.0) | | | | 4: rans4x8 (v3.0) | | | | 5: rans4x16 (v3.1) | | | | 6: adaptive arithmetic coder (v3.1) | | | | 7: fqzcomp (v3.1) | | | | 8: name tokeniser (v3.1) | | byte | block content type id | the block content type identifier | | itf8 | size in bytes* | the block content identifier used to associate external data | | | raw size in bytes* | blocks with data series | | itf8 | block data | size of the block data after applying block compression | | itf8 | | the data stored in the before applying block compression | | byte[] | ・ bit stream of CRAM records (core data block) | | | | | $\bullet$ byte stream (external data block) | | | CRC32 | additional fields ( header blocks) | | byte[4] | CRC32 hash value for all preceding bytes in the block | |
  • Note on raw method: both compressed and raw sizes must be set to the same value.
    原始方法的注意事项:压缩和原始大小必须设置为相同的值。
Empty blocks may occur in the files. Blocks with a raw (uncompressed) size of zero are treated as empty, irrespective of their “method” byte. This is equivalent to interpreting them as having method zero (raw) and compressed size of zero.
文件中可能会出现空白块。原始(未压缩)大小为零的块被视为空白,无论其"方法"字节如何。这相当于将它们解释为具有方法零(原始)和压缩大小为零。

8.1 Block content types
8.1 区块内容类型

CRAM has the following block content types:
《CRAM》有以下块内容类型:
Block content type 区块内容类型

区块内容类型 id
Block
content
type id
Block content type id| Block | | :--- | | content | | type id |
Name 名字 Contents 目录
FILE_HEADER 0 CRAM header block 内存编码头文件块 CRAM header CRAM 头部
COMPRESSION_HEADER 1 Compression header block
压缩头部块
See specific section 参阅相应部分
SLICE_HEADER a ^("a "){ }^{\text {a }}
切片标题 a ^("a "){ }^{\text {a }}
2 Slice header block 切片页眉块 See specific section 请参阅具体章节
3 reserved 保留
EXTERNAL_DATA 4 external data block 外部数据块

由外部编码产生的数据
data produced by
external encodings
data produced by external encodings| data produced by | | :--- | | external encodings |
CORE_DATA 5 core data block 核心数据块

除外部编码外的所有编码的位流
bit stream of all
encodings except for
external encodings
bit stream of all encodings except for external encodings| bit stream of all | | :--- | | encodings except for | | external encodings |
Block content type "Block content type id" Name Contents FILE_HEADER 0 CRAM header block CRAM header COMPRESSION_HEADER 1 Compression header block See specific section SLICE_HEADER ^("a ") 2 Slice header block See specific section 3 reserved EXTERNAL_DATA 4 external data block "data produced by external encodings" CORE_DATA 5 core data block "bit stream of all encodings except for external encodings"| Block content type | Block <br> content <br> type id | Name | Contents | | :--- | :--- | :--- | :--- | | FILE_HEADER | 0 | CRAM header block | CRAM header | | COMPRESSION_HEADER | 1 | Compression header block | See specific section | | SLICE_HEADER ${ }^{\text {a }}$ | 2 | Slice header block | See specific section | | | 3 | | reserved | | EXTERNAL_DATA | 4 | external data block | data produced by <br> external encodings | | CORE_DATA | 5 | core data block | bit stream of all <br> encodings except for <br> external encodings |

8.2 Block content id
8.2 区块内容 id

Block content id is used to distinguish between external blocks in the same slice. Each external encoding has an id parameter which must be one of the external block content ids. For external blocks the content id is a positive integer. For all other blocks content id should be 0 . Consequently, all external encodings must not use content id less than 1 .
块内容 id 用于区分同一切片中的外部块。每个外部编码都有一个 id 参数,必须是外部块内容 id 之一。对于外部块,内容 id 是一个正整数。对于所有其他块,内容 id 应为 0。因此,所有外部编码都不能使用小于 1 的内容 id。

Data blocks 数据块

Data is stored in data blocks. There are two types of data blocks: core data blocks and external data blocks. The difference between core and external data blocks is that core data blocks consist of data series that are compressed using bit encodings while the external data blocks are byte compressed. One core data block and any number of external data blocks are associated with each slice.
数据存储在数据块中。数据块分为两种类型:核心数据块和外部数据块。核心数据块和外部数据块的区别在于,核心数据块由使用位编码压缩的数据系列组成,而外部数据块采用字节压缩。每个切片与一个核心数据块和任意数量的外部数据块相关联。

Writing to and reading from core and external data blocks is organised through CRAM records. Each data series is associated with an encoding. In case of external encodings the block content id is used to identify the block where the data series is stored. Please note that external blocks can have multiple data series associated with them; in this case the values from these data series will be interleaved.
通过 CRAM 记录组织对核心和外部数据块的写入和读取。每个数据系列都与一种编码相关联。对于外部编码,使用块内容 ID 来识别存储数据系列的块。请注意,外部块可以有多个相关联的数据系列;在这种情况下,这些数据系列的值将交织在一起。

8.3 CRAM header block(s)
8.3 CRAM 头部块

The SAM header is stored in the first block of the CRAM header container (see section 7.1). This block may be uncompressed or gzip compressed only. This block is followed by zero or more uncompressed expansion blocks. If present, these permit in-place editing of the CRAM header, allowing it to grow or shrink with a compensatory size change applied to the subsequence expansion block, avoiding the need to rewrite the remainder of the file. The contents of any expansion blocks should be zero bytes (nul characters).
SAM 头部信息存储在 CRAM 头部容器的第一个块中(见 7.1 节)。这个块可以是未压缩的或仅使用 Gzip 压缩。这个块后面跟着零个或多个未压缩的扩展块。如果存在,这些块允许就地编辑 CRAM 头部,使其能够增长或缩小,并对后续的扩展块应用相应的大小变化,从而避免重写文件的其余部分。任何扩展块的内容都应该是零字节(空字符)。

The format of the initial SAM header block is a 32-bit little-endian integer holding the length of the text of the SAM header, minus nul-termination bytes, followed by the text itself. Although 32-bit, the maximum permitted value is 2 31 2 31 2^(31)2^{31}, and all lengths must be positive.
SAM 头块的格式是一个 32 位的小端整数,表示 SAM 头文本的长度,不包括 null 终止字节,后面是文本本身。虽然是 32 位,但最大允许值为 2 31 2 31 2^(31)2^{31} ,所有长度必须为正数。

The following constraints apply to the SAM header text:
以下约束适用于 SAM 头文本:
  • The SQ:MD5 checksum is required unless the reference sequence has been embedded into the file.
    除非参考序列已嵌入文件中,否则需要 SQ:MD5 校验和。

8.4 Compression header block
8.4 压缩标头块

The compression header block consists of 3 parts: preservation map, data series encoding map and tag encoding map.
压缩头块由 3 个部分组成:保留映射、数据系列编码映射和标记编码映射。

Preservation map 保护地图

The preservation map contains information about which data was preserved in the CRAM file. It is stored as a map with byte[2] keys:
保留映射包含有关哪些数据在 CRAM 文件中被保留的信息。它以 byte[2]键值的形式存储。
Key 关键 Value data type 值数据类型 Name 名字 Value 价值
RN bool read names included 阅读姓名包括在内 true if read names are preserved for all reads
对于所有读取的名称是否保留:真
AP bool AP data series delta
AP 数据系列增量
true if AP data series is delta, false otherwise
如果 AP 数据系列是增量,则为 true,否则为 false
RR bool reference required 需参考

如果需要参考序列才能完全还原数据
true if reference sequence is required to restore
the data completely
true if reference sequence is required to restore the data completely| true if reference sequence is required to restore | | :--- | | the data completely |
SM byte[5] 字节[5] substitution matrix 替换矩阵 substitution matrix 替换矩阵
TD array<byte> 字节数组 tag ids dictionary 标签 ID 字典 a list of lists of tag ids, see tag encoding section
标签编码部分中的标签 ID 列表
Key Value data type Name Value RN bool read names included true if read names are preserved for all reads AP bool AP data series delta true if AP data series is delta, false otherwise RR bool reference required "true if reference sequence is required to restore the data completely" SM byte[5] substitution matrix substitution matrix TD array<byte> tag ids dictionary a list of lists of tag ids, see tag encoding section| Key | Value data type | Name | Value | | :--- | :--- | :--- | :--- | | RN | bool | read names included | true if read names are preserved for all reads | | AP | bool | AP data series delta | true if AP data series is delta, false otherwise | | RR | bool | reference required | true if reference sequence is required to restore <br> the data completely | | SM | byte[5] | substitution matrix | substitution matrix | | TD | array<byte> | tag ids dictionary | a list of lists of tag ids, see tag encoding section |
The boolean values are optional, defaulting to true when absent, although it is recommended to explicitly set them. SM and TD are mandatory.
布尔值是可选的,默认为 true,尽管建议明确设置它们。SM 和 TD 是必填项。

Data series encodings 数据系列编码

Each data series has an encoding. These encoding are stored in a map with byte[2] keys and are decoded in approximately this order 2 2 ^(2){ }^{2} :
每个数据系列都有一个编码。这些编码存储在一个字节[2]键的地图中,并按以下顺序解码: 2 2 ^(2){ }^{2}
Key 关键 Value data type 值数据类型 Name 名字 Value 价值
BF encoding<int> 编码 BAM bit flags 位标志 see separate section 请见单独的部分
CF encoding<int> 编码 CRAM bit flags 紧凑位标志 see specific section 查看特定部分
RI encoding<int> 编码 reference id 参考编号 record reference id from the SAM file header
从 SAM 文件头部记录参考 ID
RL encoding<int> 编码 read lengths 读长度 read lengths 读长度
AP encoding<int> 编码 in-seq positions 序列中的位置

如果 AP-Delta = true: 从前一个记录中的 AP 值开始计算 0 为基准的对齐起始偏移量。注意这个偏移量可能为负值,例如在多参考切片中切换参考。当记录是切片中的第一个时,使用的前一个位置是切片的对齐起始字段(因此单参考切片的第一个偏移量应为零,多参考切片应为 AP 值本身)。如果 AP-Delta = false: 直接编码对齐起始位置。
if AP-Delta = true: 0-based alignment start
delta from the AP value in the previous record.
Note this delta may be negative, for example
when switching references in a multi-reference
slice. When the record is the first in the slice, the
previous position used is the slice alignment-start
field (hence the first delta should be zero for
single-reference slices, or the AP value itself for
multi-reference slices).
if AP-Delta = false: encodes the alignment start
position directly
if AP-Delta = true: 0-based alignment start delta from the AP value in the previous record. Note this delta may be negative, for example when switching references in a multi-reference slice. When the record is the first in the slice, the previous position used is the slice alignment-start field (hence the first delta should be zero for single-reference slices, or the AP value itself for multi-reference slices). if AP-Delta = false: encodes the alignment start position directly| if AP-Delta = true: 0-based alignment start | | :--- | | delta from the AP value in the previous record. | | Note this delta may be negative, for example | | when switching references in a multi-reference | | slice. When the record is the first in the slice, the | | previous position used is the slice alignment-start | | field (hence the first delta should be zero for | | single-reference slices, or the AP value itself for | | multi-reference slices). | | if AP-Delta = false: encodes the alignment start | | position directly |
RG encoding<int> 编码 read groups 读取组

读取组。 特殊值 '-1' 表示无组。
read groups. Special value ' -1 ' stands for no
group.
read groups. Special value ' -1 ' stands for no group.| read groups. Special value ' -1 ' stands for no | | :--- | | group. |
RN a RN a RN^(a)\mathrm{RN}^{\mathrm{a}} encoding<byte[ ]> 编码<字节[ ]> read names 阅读名字 read names 阅读名字
MF encoding<int> 编码 next mate bit flags
下一个对象位标志
see specific section 查看特定部分
NS encoding<int> 编码

下一个片段参考序列 ID
next fragment
reference sequence id
next fragment reference sequence id| next fragment | | :--- | | reference sequence id |
reference sequence ids for the next fragment
下一个片段的参考序列 ID
NP encoding<int> 编码

下一个配对对齐开始
next mate alignment
start
next mate alignment start| next mate alignment | | :--- | | start |
alignment positions for the next fragment
下一个片段的对齐位置
TS encoding<int> 编码 template size 模板大小 template sizes 模板尺寸
NF encoding<int> 编码

下一片段的距离
distance to next
fragment
distance to next fragment| distance to next | | :--- | | fragment |
number of records to skip to the next fragment b b ^(b){ }^{b}
跳过下一个片段的记录数 b b ^(b){ }^{b}
TL C TL C TL^(C)\mathrm{TL}^{\mathrm{C}} encoding<int> 编码 tag ids 标签 ID list of tag ids, see tag encoding section
标签 ID 列表,请参见标签编码部分
FN encoding<int> 编码

读取特征的数量
number of read
features
number of read features| number of read | | :--- | | features |
number of read features in each record
每条记录中读取特征的数量
FC encoding<byte> 编码<字节> read features codes 阅读特色代码 see separate section 请见单独的部分
FP encoding<int> 编码 in-read positions 内嵌广告位

读取特征的位置;相对于上一位置的正差值(从零开始)
positions of the read features; a positive delta to
the last position (starting with zero)
positions of the read features; a positive delta to the last position (starting with zero)| positions of the read features; a positive delta to | | :--- | | the last position (starting with zero) |
DL encoding<int> 编码 deletion lengths 删除长度 base-pair deletion lengths
碱基对缺失长度
BB encoding<byte[]> 编码<字节[]> stretches of bases 碱基片段 bases 基地
QQ encoding<byte[ ]> 编码<字节[ ]>

质量评分的范围
stretches of quality
scores
stretches of quality scores| stretches of quality | | :--- | | scores |
quality scores 质量分数
BS encoding<byte> 编码<字节>
 碱基替换密码
base substitution
codes
base substitution codes| base substitution | | :--- | | codes |
base substitution codes 碱基替换密码
IN encoding<byte[]> 编码<字节[]> insertion 插入 inserted bases 插入碱基
RS encoding<int> 编码 reference skip length 参考跳过长度 number of skipped bases for the ' N ' read feature
跳过的 'N' 读取特征的碱基数
PD encoding<int> 编码 padding 填充 number of padded bases
填充碱基的数量
HC encoding<int> 编码 hard clip 硬剪辑 number of hard clipped bases
硬剪切碱基数量
SC encoding<byte[ ]> 编码<字节[ ]> soft clip 柔和剪切 soft clipped bases 软剪切碱基
MQ encoding<int> 编码 mapping qualities 映射质量 mapping quality scores 质量评分映射
BA encoding<byte> 编码<字节> bases 基地 bases 基地
QS encoding<byte> 编码<字节> quality scores 质量分数 quality scores 质量分数
TC d TC d TC^(d)\mathrm{TC}^{\mathrm{d}} N/A 不好意思,你没有提供任何原文,我无法为你翻译。请提供原文,我会尽快为你翻译成简体中文 legacy field 遗产领域 to be ignored 被忽略
TN d TN d TN^(d)\mathrm{TN}^{\mathrm{d}} N/A 不好意思,你没有提供任何原文,我无法为你翻译。请提供原文,我会尽快为你翻译成简体中文 legacy field 遗产领域 to be ignored 被忽视
Key Value data type Name Value BF encoding<int> BAM bit flags see separate section CF encoding<int> CRAM bit flags see specific section RI encoding<int> reference id record reference id from the SAM file header RL encoding<int> read lengths read lengths AP encoding<int> in-seq positions "if AP-Delta = true: 0-based alignment start delta from the AP value in the previous record. Note this delta may be negative, for example when switching references in a multi-reference slice. When the record is the first in the slice, the previous position used is the slice alignment-start field (hence the first delta should be zero for single-reference slices, or the AP value itself for multi-reference slices). if AP-Delta = false: encodes the alignment start position directly" RG encoding<int> read groups "read groups. Special value ' -1 ' stands for no group." RN^(a) encoding<byte[ ]> read names read names MF encoding<int> next mate bit flags see specific section NS encoding<int> "next fragment reference sequence id" reference sequence ids for the next fragment NP encoding<int> "next mate alignment start" alignment positions for the next fragment TS encoding<int> template size template sizes NF encoding<int> "distance to next fragment" number of records to skip to the next fragment ^(b) TL^(C) encoding<int> tag ids list of tag ids, see tag encoding section FN encoding<int> "number of read features" number of read features in each record FC encoding<byte> read features codes see separate section FP encoding<int> in-read positions "positions of the read features; a positive delta to the last position (starting with zero)" DL encoding<int> deletion lengths base-pair deletion lengths BB encoding<byte[]> stretches of bases bases QQ encoding<byte[ ]> "stretches of quality scores" quality scores BS encoding<byte> "base substitution codes" base substitution codes IN encoding<byte[]> insertion inserted bases RS encoding<int> reference skip length number of skipped bases for the ' N ' read feature PD encoding<int> padding number of padded bases HC encoding<int> hard clip number of hard clipped bases SC encoding<byte[ ]> soft clip soft clipped bases MQ encoding<int> mapping qualities mapping quality scores BA encoding<byte> bases bases QS encoding<byte> quality scores quality scores TC^(d) N/A legacy field to be ignored TN^(d) N/A legacy field to be ignored| Key | Value data type | Name | Value | | :---: | :---: | :---: | :---: | | BF | encoding<int> | BAM bit flags | see separate section | | CF | encoding<int> | CRAM bit flags | see specific section | | RI | encoding<int> | reference id | record reference id from the SAM file header | | RL | encoding<int> | read lengths | read lengths | | AP | encoding<int> | in-seq positions | if AP-Delta = true: 0-based alignment start <br> delta from the AP value in the previous record. <br> Note this delta may be negative, for example <br> when switching references in a multi-reference <br> slice. When the record is the first in the slice, the <br> previous position used is the slice alignment-start <br> field (hence the first delta should be zero for <br> single-reference slices, or the AP value itself for <br> multi-reference slices). <br> if AP-Delta = false: encodes the alignment start <br> position directly | | RG | encoding<int> | read groups | read groups. Special value ' -1 ' stands for no <br> group. | | $\mathrm{RN}^{\mathrm{a}}$ | encoding<byte[ ]> | read names | read names | | MF | encoding<int> | next mate bit flags | see specific section | | NS | encoding<int> | next fragment <br> reference sequence id | reference sequence ids for the next fragment | | NP | encoding<int> | next mate alignment <br> start | alignment positions for the next fragment | | TS | encoding<int> | template size | template sizes | | NF | encoding<int> | distance to next <br> fragment | number of records to skip to the next fragment ${ }^{b}$ | | $\mathrm{TL}^{\mathrm{C}}$ | encoding<int> | tag ids | list of tag ids, see tag encoding section | | FN | encoding<int> | number of read <br> features | number of read features in each record | | FC | encoding<byte> | read features codes | see separate section | | FP | encoding<int> | in-read positions | positions of the read features; a positive delta to <br> the last position (starting with zero) | | DL | encoding<int> | deletion lengths | base-pair deletion lengths | | BB | encoding<byte[]> | stretches of bases | bases | | QQ | encoding<byte[ ]> | stretches of quality <br> scores | quality scores | | BS | encoding<byte> | base substitution <br> codes | base substitution codes | | IN | encoding<byte[]> | insertion | inserted bases | | RS | encoding<int> | reference skip length | number of skipped bases for the ' N ' read feature | | PD | encoding<int> | padding | number of padded bases | | HC | encoding<int> | hard clip | number of hard clipped bases | | SC | encoding<byte[ ]> | soft clip | soft clipped bases | | MQ | encoding<int> | mapping qualities | mapping quality scores | | BA | encoding<byte> | bases | bases | | QS | encoding<byte> | quality scores | quality scores | | $\mathrm{TC}^{\mathrm{d}}$ | N/A | legacy field | to be ignored | | $\mathrm{TN}^{\mathrm{d}}$ | N/A | legacy field | to be ignored |
a a ^(a){ }^{a} Note RN this is decoded after MF if the record is detached from the mate and we are attempting to auto-generate read names.
a a ^(a){ }^{a} 注意,如果记录与配对对象分离,且我们正试图自动生成读名称,则此译码发生在 MF 之后。

b b ^(b){ }^{\mathrm{b}} The count is reset for each slice so NF can only refer to a record later within this slice.
b b ^(b){ }^{\mathrm{b}} 对于每个切片,计数器都会被重置,因此 NF 只能引用该切片中稍后的记录。

c c ^(c){ }^{c} TL is followed by decoding the tag values themselves, in order of appearance in the tag dictionary.
c c ^(c){ }^{c} TL 之后会按照标签字典中出现的顺序对标签值本身进行解码。

d TC d TC ^(d)TC{ }^{\mathrm{d}} \mathrm{TC} and TN are legacy data series from CRAM 1.0. They have no function in CRAM 3.0 and should not be present. However some implementations do output them and decoders must silently skip these fields. It is illegal for TC and TN to contain any data values, although there may be empty blocks associated with them.
d TC d TC ^(d)TC{ }^{\mathrm{d}} \mathrm{TC} 和 TN 是 CRAM 1.0 中的遗留数据系列。它们在 CRAM 3.0 中没有任何功能,不应该存在。但是一些实现确实输出了它们,解码器必须静默跳过这些字段。TC 和 TN 中不应包含任何数据值,尽管可能存在与之相关的空白块。

Tag encodings 标签编码

The tag dictionary (TD) describes the unique combinations of tag id / type that occur on each alignment record. For example if we search the id / types present in each record and find only two combinations - X1:i BC:Z SA:Z: and X1:i: BC:Z - then we have two dictionary entries in the TD map.
标签字典 (TD) 描述了每个对齐记录中出现的唯一标签 id/类型组合。例如,如果我们搜索每个记录中存在的 id/类型,并发现只有两种组合 - X1:i BC:Z SA:Z: 和 X1:i: BC:Z - 那么我们在 TD 映射中就有两个字典条目。

Let L i = { T i 0 , T i 1 , , T i x } L i = T i 0 , T i 1 , , T i x L_(i)={T_(i0),T_(i1),dots,T_(ix)}L_{i}=\left\{T_{i 0}, T_{i 1}, \ldots, T_{i x}\right\} be a list of all tag ids for a record R i R i R_(i)R_{i}, where i i ii is the sequential record index and T i j T i j T_(ij)T_{i j} denotes j j jj-th tag id in the record. The list of unique L i L i L_(i)L_{i} is stored as the TD value in the preservation map. Maintaining the order is not a requirement for encoders (hence “combinations”), but it is permissible and thus different permutations, each encoded with their own elements in TD, should be supported by the decoder. Each L i L i L_(i)L_{i} element in TD is assigned a sequential integer number starting with 0 . These integer numbers are referred to by the TL data series. Using TD, an integer from the TL data series can be mapped back into a list of tag ids. Thus per alignment record we only need to store tag values and not their ids and types.
L i = { T i 0 , T i 1 , , T i x } L i = T i 0 , T i 1 , , T i x L_(i)={T_(i0),T_(i1),dots,T_(ix)}L_{i}=\left\{T_{i 0}, T_{i 1}, \ldots, T_{i x}\right\} 成为记录 R i R i R_(i)R_{i} 的所有标签 ID 的列表,其中 i i ii 是顺序记录索引, T i j T i j T_(ij)T_{i j} 表示该记录中的第 j j jj 个标签 ID。唯一 L i L i L_(i)L_{i} 的列表存储为保护图中的 TD 值。对于编码器(因此是"组合")来说,保持顺序并不是一个要求,但是这是可以接受的,因此不同的排列,每个排列都使用自己的元素在 TD 中进行编码,应该被解码器支持。TD 中的每个 L i L i L_(i)L_{i} 元素都被分配一个从 0 开始的顺序整数。这些整数被称为 TL 数据系列。使用 TD,可以将 TL 数据系列中的整数映射回标签 ID 的列表。因此,每个对齐记录我们只需要存储标签值,而不需要存储它们的 ID 和类型。

The TD is written as a byte array consisting of L i L i L_(i)L_{i} values separated with 0 0 \\0\backslash 0. Each L i L i L_(i)L_{i} value is written as a concatenation of 3 byte T i j T i j T_(ij)T_{i j} elements: tag id followed by BAM tag type code (one of A, c, C, s, S, i, I, f, Z, H or B , as described in the SAM specification). For example the TD for tag lists X1:i BC:Z SA:Z and X1:i BC:Z may be encoded as X1CBCZSAZ 0 X 1 CBCZ 0 0 X 1 CBCZ 0 \\0X1CBCZ\\0\backslash 0 \mathrm{X} 1 \mathrm{CBCZ} \backslash 0, with X 1 C indicating a 1 byte unsigned value for tag X 1 .
TD 以包含 L i L i L_(i)L_{i} 值的字节数组的形式写入,这些值用 0 0 \\0\backslash 0 分隔。每个 L i L i L_(i)L_{i} 值都是由 3 个字节 T i j T i j T_(ij)T_{i j} 元素串联而成:标记 ID,后跟 BAM 标记类型代码(SAM 规范中描述的 A、c、C、s、S、i、I、f、Z、H 或 B 之一)。例如,标记列表 X1:i BC:Z SA:Z 和 X1:i BC:Z 的 TD 可能被编码为 X1CBCZSAZ 0 X 1 CBCZ 0 0 X 1 CBCZ 0 \\0X1CBCZ\\0\backslash 0 \mathrm{X} 1 \mathrm{CBCZ} \backslash 0 ,其中 X 1 C 表示标记 X 1 的 1 字节无符号值。

Tag values 标签值

The encodings used for different tags are stored in a map. The key is 3 bytes formed from the BAM tag id and type code, matching the TD dictionary described above. Unlike the Data Series Encoding Map, the key is stored in the map as an ITF8 encoded integer, constructed using (char 1 << 16 ) + ( 1 << 16 ) + ( 1<<16)+(1<<16)+( char 2 << 8 ) + 2 << 8 ) + 2<<8)+2<<8)+ type. For example, the 3 -byte representation of OQ:Z is { 0 x 4 F , 0 x 51 , 0 × 5 A } { 0 x 4 F , 0 x 51 , 0 × 5 A } {0x4F,0x51,0xx5A}\{0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \times 5 \mathrm{~A}\} and these bytes are interpreted as the integer key 0 x 004 F 515 A , leading to an ITF8 byte stream { 0 xE 0 , 0 x 4 F , 0 x 51 , 0 x 5 A } { 0 xE 0 , 0 x 4 F , 0 x 51 , 0 x 5 A } {0xE0,0x4F,0x51,0x5A}\{0 \mathrm{xE} 0,0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \mathrm{x} 5 \mathrm{~A}\}.
不同标签的编码存储在一个映射中。键由 BAM 标签 ID 和类型代码组成的 3 个字节组成,与上述 TD 字典相匹配。与数据系列编码映射不同,键以 ITF8 编码的整数形式存储在映射中,构建方式为(char 1 << 16 ) + ( 1 << 16 ) + ( 1<<16)+(1<<16)+( char 2 << 8 ) + 2 << 8 ) + 2<<8)+2<<8)+ type)。例如, OQ:Z 的 3 字节表示为 { 0 x 4 F , 0 x 51 , 0 × 5 A } { 0 x 4 F , 0 x 51 , 0 × 5 A } {0x4F,0x51,0xx5A}\{0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \times 5 \mathrm{~A}\} ,这些字节被解释为整数键 0x004F515A,导致 ITF8 字节流 { 0 xE 0 , 0 x 4 F , 0 x 51 , 0 x 5 A } { 0 xE 0 , 0 x 4 F , 0 x 51 , 0 x 5 A } {0xE0,0x4F,0x51,0x5A}\{0 \mathrm{xE} 0,0 \mathrm{x} 4 \mathrm{~F}, 0 \mathrm{x} 51,0 \mathrm{x} 5 \mathrm{~A}\}