这是用户在 2024-5-7 15:55 为 https://app.immersivetranslate.com/pdf-pro/8124fe7d-e345-46a0-90bf-9550ee51c9fb 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_07_5a6327d0122d3ddd6bc6g

GTZ: a fast compression and cloud transmission tool optimized for FASTQ files
GTZ:针对 FASTQ 文件优化的快速压缩和云传输工具

Yuting Xing , Gen , Zhenguo Wang , Bolun Feng , Zhuo Song and Chengkun Wu
Yuting Xing , Gen , Zhenguo Wang , Bolun Feng , Zhuo Song and Chengkun Wu

From 16th International Conference on Bioinformatics (InCoB 2017)
摘自第 16 届国际生物信息学大会(InCoB 2017)
Shenzhen, China. 20-22 September 2017
中国,深圳2017 年 9 月 20-22 日

Abstract 摘要

Background: The dramatic development of DNA sequencing technology is generating real big data, craving for more storage and bandwidth. To speed up data sharing and bring data to computing resource faster and cheaper, it is necessary to develop a compression tool than can support efficient compression and transmission of sequencing data onto the cloud storage.
背景:DNA 测序技术的飞速发展正在产生真正的大数据,对存储和带宽的需求也越来越大。为了加快数据共享,使数据更快、更便宜地进入计算资源,有必要开发一种压缩工具,以支持测序数据在云存储中的高效压缩和传输。

Results: This paper presents GTZ, a compression and transmission tool, optimized for FASTQ files. As a referencefree lossless FASTQ compressor, GTZ treats different lines of FASTQ separately, utilizes adaptive context modelling to estimate their characteristic probabilities, and compresses data blocks with arithmetic coding. GTZ can also be used to compress multiple files or directories at once. Furthermore, as a tool to be used in the cloud computing era, it is capable of saving compressed data locally or transmitting data directly into cloud by choice. We evaluated the performance of GTZ on some diverse FASTQ benchmarks. Results show that in most cases, it outperforms many other tools in terms of the compression ratio, speed and stability.
结果本文介绍了针对 FASTQ 文件优化的压缩和传输工具 GTZ。作为一种无参考无损 FASTQ 压缩器,GTZ 分别处理 FASTQ 的不同行,利用自适应上下文建模来估计它们的特征概率,并用算术编码压缩数据块。GTZ 还可用于一次性压缩多个文件或目录。此外,作为云计算时代的一种工具,它还能将压缩后的数据保存在本地或直接传输到云端。我们在一些不同的 FASTQ 基准上评估了 GTZ 的性能。结果表明,在大多数情况下,它在压缩率、速度和稳定性方面都优于许多其他工具。

Conclusions: GTZ is a tool that enables efficient lossless FASTQ data compression and simultaneous data transmission onto to cloud. It emerges as a useful tool for NGS data storage and transmission in the cloud environment. GTZ is freely available online at: https://github.com/Genetalks/gtz.
结论GTZ 是一种可实现高效无损 FASTQ 数据压缩并同时向云端传输数据的工具。它是在云环境中存储和传输 NGS 数据的有用工具。GTZ 可在以下网址免费在线获取:https://github.com/Genetalks/gtz。
Keywords: FASTQ, Compression, General-purpose, Lossless, Parallel compression and transmission, Cloud computing
关键词FASTQ、压缩、通用、无损、并行压缩和传输、云计算

Background 背景介绍

Next generation sequencing (NGS) has greatly facilitated the development of genome analyses, which is vital for reaching the goal of precision medicine. Yet the exponential growth of accumulated sequencing data poses serious challenges to the transmission and storage of NGS data. Efficient compression methods provide the possibility to address this increasingly prominent problem.
下一代测序(NGS)极大地促进了基因组分析的发展,这对实现精准医疗的目标至关重要。然而,累积测序数据的指数级增长给 NGS 数据的传输和存储带来了严峻挑战。高效的压缩方法为解决这一日益突出的问题提供了可能。
Previously, general-propose compression tools, such as gzip (http://www.gzip.org/), bzip2 (http://www.bzip.org/) and (www.7-zip.org), have been utilized to compress NGS data. These tools do not take advantage of the
以前,一般压缩工具,如 gzip ( http://www.gzip.org/)、bzip2 ( http://www.bzip.org/) 和 ( www.7-zip.org) 都被用来压缩 NGS 数据。这些工具没有利用
characteristics of genome data, such as a small size alphabet and repeated sequences segments, which leaves space for performance optimization. Recently, some specialized compression tools have been developed for NGS data. These tools are either reference-based or reference-free. The main difference lies in whether extra genome sequences are used as references. Reference-based algorithms encode the differences between the target and reference sequences, and consume more memory to improve compression performance. GenCompress [1] and SimGene [2] use various entropy encoders, such as arithmetic, Golomb and Huffman to compress integer values. The values show properties of reads, like starting position, length of reads, etc. A statistical compression method, GReEn [3], uses an adaptive model to estimate probabilities based on the frequencies of characters. The probabilities are then compressed with an arithmetic encoder.
基因组数据的特点,如字母表小、序列片段重复等,为性能优化留下了空间。最近,针对 NGS 数据开发了一些专门的压缩工具。这些工具有的是基于参考的,有的是无参考的。主要区别在于是否使用额外的基因组序列作为参考。基于参考的算法会对目标序列和参考序列之间的差异进行编码,并消耗更多内存以提高压缩性能。GenCompress [1] 和 SimGene [2] 使用各种熵编码器,如算术编码器、Golomb 编码器和 Huffman 编码器来压缩整数值。这些值显示读数的属性,如起始位置、读数长度等。一种名为 GReEn [3] 的统计压缩方法使用自适应模型,根据字符的频率估算概率。然后用算术编码器对概率进行压缩。
QUIP [4] exploits arithmetic coding associated with models of order-3 and high-order Markov chains in all three parts of FASTQ data. LW-FQZip [5] utilized incremental and run-length-limited encoding schemes to compress the metadata and quality scores, respectively. Reads are pre-processed by a light-weight mapping model and then three components are combined to be compressed by a general-purpose tool, like LZMA. Fqzcomp [6] estimates character probabilities by order-k context modelling and compresses NGS data in FASTQ format with the help of arithmetic coders.
QUIP [4] 在 FASTQ 数据的所有三个部分中利用了与阶-3 和高阶马尔可夫链模型相关的算术编码。LW-FQZip [5] 利用增量和运行长度限制编码方案分别压缩元数据和质量分数。通过轻量级映射模型对读数进行预处理,然后通过通用工具(如 LZMA)将三个部分组合起来进行压缩。Fqzcomp [6] 通过阶k上下文建模估计字符概率,并借助算术编码器压缩 FASTQ 格式的 NGS 数据。
Nevertheless, reference-based algorithms can be inefficient if the similarity between target and reference sequences is low. Therefore, reference-free methods were also proposed to address this problem. Biocompress proposed in [7] is a compression method dedicated to genomic sequences. Its main idea is based on the classical dictionary-based compression method -the Ziv and Lempel [8] compression algorithm. Repeats and palindromes are encoded using the length and the position of their earliest occurrences. As an extension of biocompress [7], biocompress-2 [9] exploits the same scheme, and uses arithmetic coding of order-2 when no significant repetition exists. The DSRC [10] algorithm splits sequences into blocks and compresses them independently with LZ77 [8] and Huffman [11] encoding. It is faster than QUIP both in compression and decompression speed, but inferior to the later in terms of compression ratio. DSRC2 [12], the multithreaded version of DSRC [10], splits the input into three streams for pre-processing. After pre-processing, metadata, reads, and quality scores are compressed separately in DRSC. A boosting algorithm, SCALCE [13], which re-organizes the reads, can outperform other algorithms on most datasets both in the compression ratio and the compression speed.
然而,如果目标序列和参考序列之间的相似度较低,基于参考的算法就会效率低下。因此,无参照方法也被提出来解决这一问题。Biocompress 是一种专门针对基因组序列的压缩方法。其主要思想基于经典的基于字典的压缩方法--Ziv 和 Lempel[8]压缩算法。重复序列和回文序列使用其最早出现的长度和位置进行编码。作为生物压缩[7]的扩展,生物压缩-2[9]利用了相同的方案,并在不存在明显重复的情况下使用阶次为 2 的算术编码。DSRC 算法[10]将序列分割成块,并使用 LZ77[8] 和 Huffman[11] 编码对其进行独立压缩。该算法在压缩和解压缩速度上都比 QUIP 快,但在压缩比上不如 QUIP。DSRC2 [12] 是 DSRC [10] 的多线程版本,它将输入分成三个流进行预处理。预处理后,元数据、读取和质量分数在 DRSC 中分别进行压缩。SCALCE [13] 是一种提升算法,可对读取数据进行重新组织,在大多数数据集上,其压缩率和压缩速度都优于其他算法。
Nowadays, it is evident that cloud computing has become increasingly important for genomic analyses. However, above-mentioned tools were developed for local usage. Compression has to be completed locally before a data transmission onto the cloud can begin.
如今,云计算在基因组分析中的重要性与日俱增。然而,上述工具都是为本地使用而开发的。在开始将数据传输到云端之前,必须在本地完成压缩。
AdOC proposed in [14] is a general-propose tool that allows the overlap of compression and communication in the context of a distributed computing environment. It presents a model for transport level compression with dynamic compression level adaptation, which can be used in an environment where resource availability and bandwidth vary unpredictably.
文献[14]中提出的 AdOC 是一种通用工具,可在分布式计算环境中实现压缩与通信的重叠。它提出了一种具有动态压缩级别适应性的传输级压缩模型,可用于资源可用性和带宽变化不可预测的环境。
Generally, the compression performances of the universal compression algorithms, such as AdOC, are unsatisfactory for NGS datasets.
一般来说,通用压缩算法(如 AdOC)对 NGS 数据集的压缩效果并不理想。
In this paper, we present a tool GTZ, it is characterized as a lossless and efficient compression tool to be used jointly with cloud computing for large-scale genomic data analyses:
在本文中,我们介绍了一种工具 GTZ,它是一种无损、高效的压缩工具,可与云计算结合使用,用于大规模基因组数据分析:
  1. GTZ exploits context model technology combined with multiple prediction modelling schemes. It employs paralleling processing to improve the compression speed.
    GTZ 利用上下文模型技术,结合多种预测建模方案。它采用并行处理技术来提高压缩速度。
  2. GTZ can compress directories or folders into a single archive, which is called a multi stream file system. The all-in-one scheme can satisfy purposes of transmission, validation and storage.
    GTZ 可以将目录或文件夹压缩成一个单一的归档文件,这就是所谓的多流文件系统。这种一体化方案可以满足传输、验证和存储的目的。
  3. GTZ supports random access to files or archives. GTZ utilizes block storage, such that users can extract some parts of genome sequences out of a FASTQ file or some files in a folder, without a complete decompression of the compressed archive.
    GTZ 支持对文件或档案的随机访问。GTZ 采用块存储,因此用户可以从 FASTQ 文件或文件夹中的某些文件中提取基因组序列的某些部分,而无需对压缩归档文件进行完全解压缩。
  4. GTZ can transfer compressed blocks to the cloud storage while the compress is still in process, which is a novel feature compared with other compression tools. This feature enables the data transmission time to be can greatly reduce the total time needed for compression and data transmission onto the cloud. For instance, it could compress and transit a 200GB FASTQ file to cloud storages like AWS and Alibaba cloud storage within 14 min.
    GTZ 可以在压缩过程中将压缩块传输到云存储中,与其他压缩工具相比,这是一项新功能。该功能可大大缩短数据传输时间,减少压缩和数据传输到云所需的总时间。例如,它可以在 14 分钟内将 200GB 的 FASTQ 文件压缩并传输到 AWS 和阿里巴巴云存储等云存储设备。
  5. GTZ provides a Python API, through which users can integrate GTZ in their own applications flexibly.
    GTZ 提供了 Python API,用户可通过该 API 将 GTZ 灵活地集成到自己的应用程序中。
In the remaining of this paper, we will introduce how GTZ works and evaluate its performance on several benchmark datasets using the AWS service.
在本文的其余部分,我们将介绍 GTZ 的工作原理,并使用 AWS 服务评估其在几个基准数据集上的性能。

Methods 方法

GTZ supports efficient compression in parallel, parallel transmission and random fetching. Figure 1 demonstrates the workflow of GTZ processing.
GTZ 支持高效的并行压缩、并行传输和随机获取。图 1 展示了 GTZ 处理的工作流程。
GTZ involves procedures on clients and the cloud end.
GTZ 涉及客户和云端的程序。
A client takes the following steps:
客户采取以下步骤
(1) Read in streams of large data files.
(1) 读入大型数据文件流。
(2) Pre-process the input by dividing data streams into three sub-streams: metadata, base sequence, and quality score.
(2) 对输入数据进行预处理,将数据流分为三个子流:元数据、基本序列和质量分数。
(3) Buffer sub-streams in local memories and assemble them into different types of data blocks with a fixed size.
(3) 在本地存储器中缓冲子数据流,并将其组合成不同类型、大小固定的数据块。
(4) Compress assembled data blocks and their descriptions, and then transmit output blocks into the cloud storage.
(4) 对组装好的数据块及其描述进行压缩,然后将输出块传输到云存储中。
On the cloud, the followings steps are executed:
在云上执行以下步骤
(1)Create three types of object-oriented containers (shown in Fig. 2), which define a tree structure.
(1) 创建三种面向对象的容器(如图 2 所示),它们定义了一个树形结构。
(2)Loop and wait to receive output blocks sent by the client.
(2) 循环并等待接收客户端发送的输出块。
Fig. 1 The workflow of GTZ
图 1 德国技术合作署的工作流程
(3) Save received output blocks into block containers according to their types.
(3) 将接收到的输出块按类型保存到块容器中。
(4) Stop if no more output blocks are received.
(4) 如果不再收到输出块,则停止。
We will explain all the steps in further details about processing FASTQ files below:
下面我们将进一步详细解释处理 FASTQ 文件的所有步骤:

The client reading streams of large data files
客户端读取大型数据文件流

Raw NGS data files are typically stored in FASTQ format for the convenience of compression. A typical FASTQ file contains four lines per sequence: Line 1 begins with a character ' ' followed by a sequence identifier; Line 2 holds the raw sequence composed of A, C, , and G; line 3 begins with a character '+' and is optionally followed by the same sequence identifier (and any description) again; line 4 holds the corresponding quality scores in ASCII characters for the sequence characters in line 2. An example of a read is given in Table 1.
原始 NGS 数据文件通常以 FASTQ 格式存储,以便于压缩。典型的 FASTQ 文件每个序列包含四行:第 1 行以字符 " "开头,后跟序列标识符;第 2 行保存由 A、C、 和 G 组成的原始序列;第 3 行以字符 "+"开头,后跟相同的序列标识符(和任何描述);第 4 行保存第 2 行中序列字符的 ASCII 字符对应质量分数。表 1 给出了一个读取示例。
Data pre-processing 数据预处理
During the second step, a data stream is split into metadata sub-streams, base sequence sub-streams and quality
在第二步中,数据流被分成元数据子流、基本序列子流和质量子流。

Fig. The hierarchy of data containers
数据容器的层次结构
Table 1 The format of an FASTQ file
表 1 FASTQ 文件的格式
@ERR194147.1.HSQ1004:134:C0D8DACXX:1:1104:3874:86,238/1
2 GGTTCCTACTTNAGGGTCATTAAATAGCCCACACGTC
3 +
4
score sub-streams. (Since uninformative comment lines normally do not provide any useful information for compression, comment streams are omitted during preprocessing.) Three types of date pre-processing controllers buffer sub-streams and save them in data blocks at a fixed size respectively. Afterwards, data blocks with annotations (about numbers of blocks, sizes of blocks and types of streams) are sent to corresponding compression units. Figure 3 demonstrates how to pre-process data files with the help of pre-processing controllers and compression units.
得分子流。(由于无信息的注释行通常不会提供任何有用的压缩信息,因此在预处理过程中会省略注释流)。三种日期预处理控制器分别对子数据流进行缓冲,并以固定大小保存在数据块中。之后,带有注释(关于块的数量、块的大小和流的类型)的数据块被发送到相应的压缩单元。图 3 演示了如何借助预处理控制器和压缩单元对数据文件进行预处理。

Compressing data 压缩数据

GTZ is a general-purpose compression tool that uses statistical modelling (http://marknelson.us/1991/02/01/ arithmetic-coding-statistical-modeling-data-compression/) and arithmetic coding.
GTZ 是一种通用压缩工具,使用统计建模 ( http://marknelson.us/1991/02/01/ arithmetic-coding-statistical-modeling-data-compression/) 和算术编码。
Statistical modelling can be categorized into two types: static and adaptive statistical modelling. Conventional methods are normally static, which means probabilities are calculated after sequences are scanned from the beginning to end. A static modelling keeps a static table that records character-frequency counts. Although they produce relatively accurate results, the drawbacks are obvious:
统计建模可分为两类:静态统计建模和自适应统计建模。传统的方法通常是静态的,即从头到尾扫描序列后计算概率。静态建模保留了一个记录字符频率计数的静态表。虽然这种方法能得出相对准确的结果,但缺点也是显而易见的:
  1. It is time-consuming to read all the sequences into main memory before compression.
    在压缩前将所有序列读入主存储器非常耗时。
  2. If an input stream does not match well with the previously accumulated sequence, the compression ratio will be degraded, even the output stream will become larger than the input stream.
    如果输入流与之前累积的序列不匹配,压缩率就会降低,甚至输出流会比输入流大。
In GTZ, we employ an adaptive statistical data compression technique based on context modelling. An adaptive modeling needs not to scan the whole sequence and generate probabilities before coding. Instead, the adaptive prediction technology provides on-the-fly reading and compression, that is probabilities are calculated based on the characters already read into the memory. Probabilities may alter with more characters scanned. Initially, the performance of adaptive statistical modelling may be poor due to the lack of reads. However, with more sequences processed, the prediction tends to be more accurate.
在 GTZ 中,我们采用了一种基于上下文建模的自适应统计数据压缩技术。自适应建模无需在编码前扫描整个序列并生成概率。相反,自适应预测技术提供即时读取和压缩,即根据已读入内存的字符计算概率。概率会随着扫描字符的增加而改变。最初,由于缺乏读取,自适应统计建模的性能可能较差。不过,随着处理的序列增多,预测往往会更加准确。
Every time the compressor encodes a character, it will update the counter in the prediction table. When a new character (suppose the sequence before is ) comes, GTZ will traverse the prediction table, find every character that has followed before, and compare their appearance frequencies. For instance, if both ABCDX appears 10 times, and only once. Then GTZ will assign a higher probability for .
压缩器每编码一个字符,就会更新预测表中的计数器。当出现一个新字符 时(假设 之前的序列是 ),GTZ 将遍历预测表,找出 之前出现过的每个字符,并比较它们的出现频率。例如,如果 ABCDX 都出现了 10 次,而 只出现了一次。那么 GTZ 将赋予 更高的概率。
The work flow of an adaptive model is depicted in Fig. 4. The box 'Update model' means converting low-order modellings to high-order modellings (the meaning of low-order and high-order will be discussed in the next subsection.).
自适应模型的工作流程如图 4 所示。更新模型 "框表示将低阶模型转换为高阶模型(低阶和高阶的含义将在下一小节讨论)。
Adaptive prediction modelling can effectively reduce compression time. There is no need to read all sequences in a time and it introduces overlap of scanning and compression.
自适应预测建模可有效缩短压缩时间。它无需在同一时间读取所有序列,并引入了扫描和压缩的重叠。
GTZ utilizes specific compression units for different kinds of data blocks: a low-order encoder for genetic sequences, a multi-order encoder for quality scores and mixed encoders for metadata. Finally, the outputs in this procedure are blocks at a fixed size.
GTZ 对不同类型的数据块使用特定的压缩单元:基因序列使用低阶编码器,质量分数使用多阶编码器,元数据使用混合编码器。最后,该程序的输出是固定大小的数据块。
The main idea about arithmetic coding is to convert reads into a floating point ranging from zero to one (precisely greater than or equal to zero and less than one) based on the predictive probabilities of characters. If the statistical modelling estimates every single character accurately for the compressor, we will have high compression performance. On the contrary, a poor prediction may result in expansion of the original sequence, instead of compression. Thus, the performance of a compressor largely relies on the whether the statistical modelling can output nearoptimal predictive probabilities.
算术编码的主要思想是根据字符的预测概率,将读数转换为范围从 0 到 1 的浮点数(精确地说是大于等于 0 小于 1)。如果统计建模能为压缩器准确估算出每个字符,我们就能获得很高的压缩性能。相反,如果预测不准确,可能会导致原始序列扩大,而不是压缩。因此,压缩器的性能很大程度上取决于统计建模能否输出接近最优的预测概率。

A low-order encoder for reads
用于读数的低阶编码器

The simplest implementation of adaptive modeling is order-0. Exactly, it does not consider any context
自适应建模最简单的实现方式是零阶。确切地说,它不考虑任何上下文
Fig. 3 Pre-process data files with pre-processing controllers and compression units
图 3 使用预处理控制器和压缩单元对数据文件进行预处理
Fig. 4 Work flow of a typical statistical modelling
图 4 典型统计建模的工作流程
information, thus this short-sighted modeling can only see the current character and make prediction that is independent of the previous sequences. Similarly, an order-1 encoder makes prediction based on one preceding character. Consequently, the low-order modeling makes little contribution to the performance of compressors. Its main advantage is that it is very memory efficient. Hence, for quality score streams that do not have spatial locality, a low-order modeling is adequate for moderate compression rate.
因此,这种短视建模只能看到当前字符,并做出与之前序列无关的预测。同样,阶 1 编码器也是根据前面的一个字符进行预测。因此,低阶建模对压缩器性能的贡献很小。它的主要优点是非常节省内存。因此,对于没有空间位置性的质量分数流,低阶建模足以满足中等压缩率的要求。
Our tailored low-order encoder for reads is demonstrated in Fig. 5. The first step is to transform sequences with the BWT algorithm. BWT (Burrows-Wheeler transform) rearranges reads into runs of similar characters. In the second step, the zero-order and the first-order prediction model are used to calculate appearance probability of each character. Since a poor probability accuracy contributes to undesirable encoding results, we add interpolation after quantizing the weighted average probability, to reduce prediction errors and improve compression ratios. In the last procedure, the bit arithmetic coding algorithm produces decimals ranging from zero to one as outputs to represent sequences.
图 5 展示了我们为读取量身定制的低阶编码器。第一步是使用 BWT 算法转换序列。BWT(Burrows-Wheeler 变换)将读数重新排列成相似字符的序列。第二步,使用零阶和一阶预测模型计算每个字符的出现概率。由于概率精度不高会导致不理想的编码结果,我们在对加权平均概率进行量化后增加了插值,以减少预测误差并提高压缩率。在最后一个程序中,位算术编码算法产生从 0 到 1 的小数作为输出来表示序列。

A multi-order encoder for quality scores
质量分数多阶编码器

The statistical modeling needs non-uniform probability distribution for arithmetic algorithms. The high-order modeling enables high probabilities for those characters which appear frequently, and low probabilities for those which appear infrequently. As a result, compared with low-order encoders, higher-order encoders can enhance adaptive modeling.
统计建模需要运算算法的非均匀概率分布。通过高阶建模,可以对经常出现的字符采用高概率,而对不经常出现的字符采用低概率。因此,与低阶编码器相比,高阶编码器可以增强自适应建模能力。
A high-order modeling considers several characters preceding the current position. It can obtain better compression performance at the expense of more memory usage. Higher-order modeling was less used due to the limited memory capacity, which is no longer a problem anymore.
高阶建模会考虑当前位置之前的几个字符。它可以获得更好的压缩性能,但会占用更多内存。由于内存容量有限,高阶建模的使用较少,但这已不再是问题。
Without transformation, a multi-order encoder (See Fig. 6) for quality scores includes two procedures:
在不进行转换的情况下,质量分数的多阶编码器(见图 6)包括两个程序:
Firstly, to generate probabilities of characters, input stream flows through an expanding character probability prediction model, which is composed of firstorder, second-order, fourth-order, sixth-order prediction models and a matching model. Like a low-order encoder, probabilities of characters undergo weighted averaging, quantization and interpolation to obtain final results. Secondly, we use bit arithmetic coding algorithm for compression.
首先,为生成字符概率,输入流流经扩展字符概率预测模型,该模型由一阶、二阶、四阶、六阶预测模型和匹配模型组成。与低阶编码器一样,字符概率经过加权平均、量化和插值后得到最终结果。其次,我们采用比特算术编码算法进行压缩。

A hybrid scheme for metadata
元数据混合方案

For metadata sub-streams, GTZ first uses delimiters (punctuations) to split them into different segments, then uses different ways to process metadata according to their fields:
对于元数据子流,德国技术合作公司首先使用分隔符(标点符号)将其分割成不同的段,然后根据不同的字段使用不同的方法处理元数据:
For numbers in an ascending or descending order, we employ incremental encoding to represent the variations of one metadata to its preceding neighbors. For instance, '3458644' will be compressed into 3,1,1,3,-2,-2,0. For continuous identical characters, we exploit run-length limited encoding to show their values and numbers of
对于按升序或降序排列的数字,我们采用增量编码来表示一个元数据与前面相邻元数据的变化。例如,"3458644 "将压缩为 3,1,1,3,-2,-2,0。对于连续的相同字符,我们利用运行长度限制编码来显示它们的值和数量。
Fig. 5 A low-order encoder scheme
图 5 低阶编码器方案
Fig. 6 A multi-order encoder scheme
图 6 多阶编码器方案
repetition. For random numbers with various precisions, we convert their formats by UTF-8 coding without adding a single separator, and then use a low-order encoder for compression. Otherwise, use the low-order encoder to compress metadata.
重复。对于各种精度的随机数,我们通过 UTF-8 编码转换其格式,不添加任何分隔符,然后使用低阶编码器进行压缩。否则,使用低阶编码器压缩元数据。
In conclusion, during this process, sub-streams are fed into a dynamic probability prediction model and an arithmetic encoder, and they are transformed into compressed blocks at a fixed size.
总之,在这一过程中,子数据流被送入动态概率预测模型和算术编码器,并被转换成固定大小的压缩块。

Data transmission 数据传输

The key objective is to transmit output blocks to a certain cloud storage platform, with annotations about types, sizes, numbers of data blocks.
主要目标是将输出数据块传输到某个云存储平台,并标注数据块的类型、大小和数量。
To note, different types of encoders may lead to inconsistency in compression speed, which can lead to a data pipe blockage. Thus, in our system, the pipe-filter pattern is designed to synchronize input and output speed, e.g., the input flow will be blocked when the speed of input stream is faster than that of the output stream; The pipe will also be blocked when there is no input flow.
需要注意的是,不同类型的编码器可能会导致压缩速度不一致,从而导致数据管道堵塞。因此,在我们的系统中,管道过滤模式的设计使输入和输出速度同步,例如,当输入流的速度快于输出流的速度时,输入流将被堵塞;当没有输入流时,管道也将被堵塞。

Storage at the cloud end - Creating an object-oriented nested container system
云端存储--创建面向对象的嵌套容器系统

GTZ creates containers as storage compartments that provide a way to manage instances and store file directories. They are organized in a tree structure. Containers can be nested to represent locations of instances: a root container represents a complete compressed file; a block container includes different types of sub-stream containers where specific instances are stored. The nesting structure is showed in Fig. 2.
GTZ 将容器创建为存储隔间,为管理实例和存储文件目录提供了一种方法。它们以树形结构组织。容器可以嵌套来表示实例的位置:根容器表示完整的压缩文件;块容器包括不同类型的子流容器,其中存储了特定的实例。嵌套结构如图 2 所示。
A root container represents a FASTQ file and it holds block containers, each of which includes metadata sub-containers, base sequence sub-containers and quality score sub-containers. A metadata sub-container nests repetitive data blocks, random data blocks, incremental data blocks, etc. Base sequence sub-containers and quality score sub-containers nest 0 instance block to instance block. Taking base sequences for examples, the 0 to (N-1) output blocks are stored in the 0th block container, and the to ( ) output blocks are stored in the 1st block container, and so on.
根容器代表一个 FASTQ 文件,它容纳 个块容器,每个块容器包括元数据子容器、基序列子容器和质量分数子容器。元数据子容器嵌套重复数据块、随机数据块、增量数据块等。基序子容器和质量分数子容器嵌套 0 实例块到 实例块。以基序为例,0 至 (N-1) 个输出块存储在第 0 个块容器中, 至 ( ) 个输出块存储在第 1 个块容器中,以此类推。
Table 2 Descriptions of 8 FASTQ datasets used for performance evaluation
表 2 用于性能评估的 8 个 FASTQ 数据集说明
Dataset Species Reference genome size 参考基因组大小 Encoding No. of quality scores in data file
数据文件中的质量分数数量
ERR233152 P. aeruginosa 绿脓杆菌 556 Sanger 32
SRR935126 A. thaliana 9755 Sanger 39
SRR489793 C. elegans 12,807 Illumina 1.8+ 38
SRR801793 L. pneumophila 2756 Sanger 38
SRR125858 H. sapiens 50,744 Sanger 39
SRR5419422 RNA seq (H. sapiens)
RNA seq(H. sapiens)
15,095 Illumina 1.8+ 6
ERR1137269 metagenomes 56,543 Illumina 1.8+ 7
NA12878 (read 2) NA12878 (读 2) H. sapiens 202,631 Sanger 38
Table 3 Compression ratios of different tools on 8 FASTQ datasets
表 3 不同工具在 8 个 FASTQ 数据集上的压缩率
Dataset Compression ratio (%) 压缩比 (%)
GTZ DSRC2 QUIP LW-FQZip Fqzcomp LFQC pigz
ERR233152 15.9 16.7 19 19 16.8 26.4
SRR935126 18.6 19.6 17.7 20.5 17.8 30.2
SRR489793 22.8 22.7 22.6 25.5 22.5 34.4
SRR801793 21.4 21.9 21.1 21.2 20.8 34.1
SRR125858 19.4 19.5 18.9 23.1 28.9 31
SRR5419422 12.8 13.9 10.9 12.5 12 ERROR 22
ERR1137269 12.2 13.4 12.8 14.3 11.9 ERROR 21.9
NA12878 (read 2) NA12878 (读 2) 19.8 24 20.4 TLE 19.9 TLE 24.7
avg 17.86 18.96 17.93 19.44 18.83 28.09
SD 3.87 3.97 4.07 4.64 5.60 3.62 5.05
CV 0.22 0.21 0.23 0.24 0.30 0.30
The best results of all the tools are boldfaced
所有工具的最佳结果均以粗体标出
This kind of hierarchy allows users to maintain a directory structure to manage compressed files, thereby facilitating random access to specific sequence. Here, we show how to decompress and extract the target files from the compressed archive: in decompression mode, the system will index the start line number (which is given by users through the command line), then fetch the certain sequence from their according block containers and compress certain (which are also specified by users) lines of the sequence.
这种层次结构允许用户维护一个目录结构来管理压缩文件,从而方便对特定序列的随机访问。在此,我们将展示如何从压缩包中解压和提取目标文件:在解压模式下,系统将索引起始行号