这是用户在 2024-5-7 16:01 为 https://app.immersivetranslate.com/pdf-pro/049cfd9f-8bce-4c91-80e9-e7a7692045a3 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_07_194688a8a007543a820eg

ACO:Iossless quality score compression based on adaptive coding order
ACO:基于自适应编码顺序的无损质量评分压缩

Xidian University

Ma Mingming 马明明

Xidian University

Li Fu

Xidian University

Liu Xianming 刘显明

Peng Cheng Laboratory 彭城实验室

Shi Guangming 时光明

Xidian University

Research Article 研究文章

Keywords: High-throughput sequencing, quality score compression, lossless compression, adaptive coding order
关键词: 高通量测序, 质量分数压缩, 无损压缩, 自适应编码顺序
Posted Date: April 22nd, 2021
发布日期:2021 年 4 月 22 日
DOI: https://doi.org/10.21203/rs.3.rs-418072/v1
DOI:https://doi.org/10.21203/rs.3.rs-418072/v1
License: @ (1) This work is licensed under a Creative Commons Attribution 4.0 International License.
许可证:@(1)本作品根据知识共享署名 4.0 国际许可协议许可。
Read Full License 阅读完整许可证

ACO:lossless quality score compression based on adaptive coding order
ACO:基于自适应编码顺序的无损质量分数压缩

Yi Niu , Mingming Ma , Xianming and Guangming Shi

Abstract 摘要

Background: With the rapid development of high-throughput sequencing technology, the cost of whole genome sequencing drops rapidly, which leads to an exponential growth of genome data. Although the compression of DNA bases has achieved significant improvement in recent years, the compression of quality score is still challenging.
背景:随着高通量测序技术的快速发展,整个基因组测序的成本迅速下降,导致基因组数据呈指数增长。尽管 DNA 碱基的压缩在近年取得了显著进展,但质量分数的压缩仍然具有挑战性。

Results: In this paper, by reinvestigating the inherent correlations between the quality score and the sequencing process, we propose a novel lossless quality score compressor based on adaptive coding order (ACO). The main objective of ACO is to traverse the quality score adaptively in the most correlative trajectory according to the sequencing process. By cooperating with the adaptive arithmetic coding and context modeling, ACO achieves the state-of-the-art quality score compression performances with moderate complexity.
结果:通过重新调查质量分数与测序过程之间的固有相关性,本文提出了一种基于自适应编码顺序(ACO)的新型无损质量分数压缩器。ACO 的主要目标是根据测序过程在最相关的轨迹上自适应地遍历质量分数。通过与自适应算术编码和上下文建模的合作,ACO 实现了具有适度复杂性的最先进的质量分数压缩性能。

Conclusions: The competence enables ACO to serve as a candidate tool for quality score compression, ACO has been employed by AVS(Audio Video coding Standard Workgroup of China) and is freely available at https://github.com/Yoniming/code.
结论:ACO 的能力使其成为质量分数压缩的候选工具,ACO 已被中国 AVS(音视频编码标准工作组)采用,并可在 https://github.com/Yoniming/code 免费获取。

Keywords: High-throughput sequencing; quality score compression; lossless compression; adaptive coding order
关键词:高通量测序;质量分数压缩;无损压缩;自适应编码顺序

Background

Sequencing technology has gradually become a basic technology widely used in biological research [1]. Obtaining genetic information of different organisms can help us to improve our understanding of the organic world. In the past decades, the price of human whole genome sequencing (WGS) has dropped to less than , with a faster declining speed over the the Moore's Law expected [2]. In this case, the number of next-generation sequencing (NGS) data grows exponentially, even exceeds that of astronomical data [3]. How to efficiently compress the DNA data generated by large-scale genome projects has become an important factor restricting the further development of the DNA sequencing industry.
测序技术逐渐成为生物研究中广泛使用的基础技术[1]。获取不同生物体的遗传信息可以帮助我们改善对有机世界的理解。在过去的几十年里,人类全基因组测序(WGS)的价格已降至不到 ,并且下降速度比预期的摩尔定律更快[2]。在这种情况下,下一代测序(NGS)数据量呈指数增长,甚至超过天文数据[3]。如何高效压缩大规模基因组项目生成的 DNA 数据已成为限制 DNA 测序行业进一步发展的重要因素。
There are two major problem in the compression of DNA data: the nucleotide compression and quality score compression. The quality values takes more than half of the compression data and has been shown to be more difficult to compress than the nucleotide data . Especially with the development of assembling techniques , the nucleotide compression
DNA 数据压缩存在两个主要问题:核苷酸压缩和质量分数压缩。质量值占据了超过一半的压缩数据,并且已被证明比核苷酸数据更难压缩 。特别是随着组装技术的发展 ,核苷酸压缩
have achieved significant improvement which makes the quality score compression problem to be one of the main bottle-necks in the current DNA data storage and transfer applications.
已经取得了显著的改进,使得质量分数压缩问题成为当前 DNA 数据存储和传输应用中的主要瓶颈之一。
The quality score (QS) represents the confidence level of every base characters in the sequencing procedure, but with a much larger alphabets (41-46 distinct levels). [7] reveals that there are strong correlations among adjacent quality score, which can be regarded as the foundation of the current lossless quality score compression pipeline: 1) using Markov model to estimate the conditional probability of the quality score; 2) traversing every positions of the reads via a raster scan order; 3) encoding the quality score via arithmetic or range coding.
质量分数(QS)代表了测序过程中每个碱基字符的置信水平,但具有更大的字母表(41-46 个不同级别)。[7]揭示了相邻质量分数之间存在强相关性,这可以被视为当前无损质量分数压缩流水线的基础:1)使用马尔可夫模型来估计质量分数的条件概率;2)通过光栅扫描顺序遍历读取的每个位置;3)通过算术或范围编码对质量分数进行编码。
Based on the above pipeline, three distinguished lossless compressor have been proposed GTZ[8], Quip[9] and FQZcomp [5]. The only differences among these three works are the Markov model orders and context quantization strategies, thus the compression ratio varies around , depending on the data distribution. A negative view is unavoidably raised that there is not much rooms for the further improvement of lossless compression ratio.
基于上述流水线,已经提出了三种杰出的无损压缩器 GTZ[8]、Quip[9]和 FQZcomp[5]。这三项工作之间唯一的区别在于马尔可夫模型的阶数和上下文量化策略,因此压缩比率在 左右变化,取决于数据分布。一个消极的观点不可避免地提出,即无损压缩比率的进一步提高空间不大。
In this paper, by reinvestigate the sequencing process, we reveal two main drawback of the existing raster scan based quality score compression strategy. Firstly, the raster scan order is a "depth-first" traverse strategy of the read. However, as it indicated in [7], the quality score have a descent trend along one single read. This makes the piece-wise stationary assumption of Markov modeling untenable. Secondly, considering that the sequencing process is conduced by multi-spectral imaging , but the FASTQ file simply stores the quality score into a stack of 1D signals. The raster-scan based techniques compress every reads independently which fails to explore the potential 2D correlations the spatial-adjacent reads (not the adjacent reads from FASTQ files).
在本文中,通过重新调查测序过程,我们揭示了现有基于光栅扫描的质量分数压缩策略的两个主要缺点。首先,光栅扫描顺序是读取的“深度优先”遍历策略。然而,正如[7]所指出的,质量分数沿着一个单一读取有下降趋势。这使得马尔可夫建模的分段平稳假设站不住脚。其次,考虑到测序过程是通过 多光谱成像 进行的,但 FASTQ 文件只是将质量分数存储在一堆 1D 信号中。基于光栅扫描的技术独立地压缩每个读取,未能探索潜在的 2D 相关性,即空间相邻读取之间的相关性(而不是来自 FASTQ 文件的相邻读取)。
To overcome the above two drawbacks, we propose a novel quality score compression technique based on adaptive coding order(ACO). The main objective of is to traverse the quality score along the most relative directions, which can be regarded as a reorganization of the stack of independent 1D quality score vectors into highly related 2D matrices. Another improvement of the proposed ACO technique over the existing techniques is the compound context modeling strategy. As we will explain the details in Section of Method, instead of the adjacent QS values, the ACO context models consists of two additional aspects: 1) the global average of every reads; 2) the variant of DNA bases. The compound context model not only benefits the probability estimation and arithmetic coding, more importantly, in the implementation, it prevents ACO from multiple random access of the input FASTQ file: the compressing process can be accomplished in only one-path, at the cost of some context dilution and side-information.
为了克服上述两个缺点,我们提出了一种基于自适应编码顺序(ACO)的新型质量分数压缩技术。 的主要目标是沿着最相关的方向遍历质量分数,这可以被视为将独立的 1D 质量分数向量堆叠重新组织为高度相关的 2D 矩阵。所提出的 ACO 技术相对于现有技术的另一个改进是复合上下文建模策略。正如我们将在方法部分详细解释的那样,ACO 上下文模型不仅包括相邻的 QS 值,还包括两个额外方面:1)每个读取的全局平均值;2)DNA 碱基的变异。复合上下文模型不仅有利于概率估计和算术编码,更重要的是,在实现中,它防止了 ACO 对输入 FASTQ 文件的多次随机访问:压缩过程可以在仅一次路径中完成,以一些上下文稀释和辅助信息为代价。
Experimental results show that the proposed ACO technique achieves the state-of-the-arts performances for the lossless quality score compression, which achieves more than gains in the compression ratio over FQZcomp [5]. The only drawback of ACO is the memory cost, comparing with FQZcomp, ACO requires and additional memory to buffer the quality score matrixes and store the compound context models respectively, which should no longer been a big problem for the current PC.
实验结果表明,所提出的 ACO 技术在无损质量评分压缩方面取得了最先进的性能,相比 FQZcomp [5]在压缩比方面获得了超过 的增益。ACO 唯一的缺点是内存成本,与 FQZcomp 相比,ACO 需要额外的 内存来缓冲 质量评分矩阵和分别存储复合上下文模型,这对于当前的 PC 来说不再是一个大问题。

Declaration 声明

In this section, we will first analyze the data characteristics of the quality score, and illustrate that the coding sequence will have a certain impact on the quality score compression through specific examples, so as to promote us to compress the quality score along the direction with the strongest data correlation. Secondly, by analyzing the sequencing principle and the generation process of FASTQ file, we explore the extra relevance in the quality score data to build a novel composite context quantification model.
在本节中,我们将首先分析质量评分的数据特征,并通过具体示例说明编码序列将对质量评分压缩产生一定影响,从而促使我们沿着与数据相关性最强的方向压缩质量评分。其次,通过分析 FASTQ 文件的排序原则和生成过程,我们探讨了质量评分数据中的额外相关性,以建立一种新的复合上下文量化模型。
Impact of coding order
编码顺序的影响
The quality score represent the estimation of the probability of the corresponding nucleotide error in the reads, and it is the evaluation of the reliability of the base character. This information is used for both the quality control of the original data and the downstream analysis. We give the distribution of the quality score of the four reads of ERR2438054 in Fig.1, it can be seen that due to the influence of noise, the quality score is a random and unstable signal, and there is a strong correlation between adjacent quality score. Therefore, we can use these characteristics of quality score to change the coding order to improve the compression ratio. Changing the order doesn't sound like changing the entropy value, because according to the information theory, the information quantity of the source is a function of probability, which is represented by the information entropy of the source. However, since the adaptive arithmetic encoder is used in coding, the encoder will update the symbol probability regularly, so changing the order can reduce the size of the bitstream. The discussion on the principle of arithmetic encoder is not the main content of this paper, so we just give a test experiment to show the influence of coding order on compression results. Firstly, we create two random signals and , and let . Then, randomly disturb the distribution of and record it as , sort the distribution of by size and record it as Z3. Finally, three groups of different signals are encoded by 0 -order arithmetic encoder, the result of the bitstream is . This is because the sorting process is equivalent to placing the data together with similar distribution and strong correlation. The coding after changing the order can better cooperate with the probability update mechanism of adaptive arithmetic encoder.
质量分数代表读取中相应核苷酸错误的概率估计,也是对碱基字符可靠性的评估。这些信息用于原始数据的质量控制和下游分析。我们在图 1 中给出了 ERR2438054 的四个读取的质量分数分布,可以看到由于噪声的影响,质量分数是一个随机且不稳定的信号,并且相邻质量分数之间存在很强的相关性。因此,我们可以利用这些质量分数的特性来改变编码顺序以提高压缩比。改变顺序听起来不像改变熵值,因为根据信息理论,源的信息量是概率的函数,由源的信息熵表示。然而,由于编码中使用了自适应算术编码器,编码器会定期更新符号概率,因此改变顺序可以减小比特流的大小。 本文讨论算术编码器原理并非主要内容,因此我们只进行了一个测试实验,以展示编码顺序对压缩结果的影响。首先,我们创建两个随机信号 ,并让 。然后,随机扰乱 的分布并记录为 ,按大小对 的分布进行排序并记录为 Z3。最后,通过 0 阶算术编码器对三组不同信号进行编码,比特流的结果为 。这是因为排序过程相当于将具有相似分布和强相关性的数据放在一起。更改顺序后的编码可以更好地配合自适应算术编码器的概率更新机制。
Fig. 1: Quality score distribution curve of
图 1: 的质量评分分布曲线

Mining more relevant information
挖掘更多相关信息

Take the current widely used Hiseq sequencing platform as an example, the sequencing process consists of three steps: 1) construction of DNA library, 2) generating DNA cluster by bridge PCR amplification and 3) sequencing. In this paper we reinvestigate the sequencing step to mining more inherent correlations among the quality score to aid the compression task. The basic principle of sequencing is based on multi-spectral imaging of the flowcell. The flowcell is a carrier for sequencing and each flowcell has eight lanes with chemically modified inner surfaces. Each lane contains 96 tiles and each tile has a unique cluster. Every cluster corresponds to one DNA patches thus one flow cell can generate 768 DNA reads simultaneously.
以当前广泛使用的 Hiseq 测序平台为例,测序过程包括三个步骤:1)构建 DNA 文库,2)通过桥式 PCR 扩增生成 DNA 簇,3)测序。本文重新调查测序步骤,以挖掘更多质量分数之间的内在相关性,以帮助压缩任务。测序的基本原理是基于流式细胞多光谱成像。流式细胞是测序的载体,每个流式细胞有八个带有化学修饰内表面的通道。每个通道包含 96 个瓷砖,每个瓷砖都有一个独特的簇。每个簇对应一个 DNA 片段,因此一个流式细胞可以同时生成 768 个 DNA 读取。
As shown in in Fig.2, the sequencing process consists of five steps. In step 1, the polymerase and one type of dNTP are added into the flowcell to activate the fluorescent of the specific clusters. In step 2, the multispectral camera takes one shot of the flow cell with the specific wavelength according to the added dNTP. Then in step 3, chemical reagents are adopted to wash out the flowcell to prepare for the next imaging. The above three steps are repeated four times with different dNTPs and different imaging wavelength to get a four channel multi-spectral image. In step 4 , based on the captured four channel image, the sequencing machine not only estimate the most likely type of every cluster but also evaluate the confident level of the estimation, which are stored as the bases and quality score respectively. Step1-4 is regarded one sequencing cycle which sequence one position (depth) of all the 768 reads in the flowcell. Thus in step 5, the sequencing cycle is repeated several times and the repeated number of cycles corresponds to the length of the reads.
如图 2 所示,测序过程包括五个步骤。在第 1 步中,聚合酶和一种类型的 dNTP 被加入流动池中,以激活特定簇的荧光。在第 2 步中,多光谱相机根据添加的 dNTP 拍摄流动池的一张照片,使用特定波长。然后在第 3 步中,采用化学试剂清洗流动池,为下一次成像做准备。以上三个步骤将使用不同的 dNTP 和不同的成像波长重复四次,以获得四通道多光谱图像。在第 4 步中,基于捕获的四通道图像,测序机不仅估计每个簇最可能的类型,还评估估计的置信水平,分别存储为碱基和质量分数。步骤 1-4 被视为一个测序周期,该周期测序流动池中所有 768 个读取的一个位置(深度)。因此,在第 5 步中,测序周期将重复多次,重复的周期数对应于读取的长度。
As we discussed in details as follows, there are three aspects which corresponds to the quality score values: 1) number of cycles, 2) base change and 3) local position of chip.
正如我们在下面详细讨论的那样,有三个方面与质量评分值相对应:1)循环次数,2)碱基更改和 3)芯片的局部位置。
The number of cycles affects the distribution of quality score. DNA polymerases are used in the process of synthesis and sequencing, at the beginning of sequencing, the synthesis reaction was not very stable, but the quality of the enzyme was very good, so it would fluctuate in the high-quality region. With the progress of sequencing, the reaction tends to be stable, but the enzyme activity and specificity gradually decreases, thus the cumulative error is gradually amplified. As a result, the probability of error increases, and the overall quality score shows a downward trend. As shown in Fig.3, with the progress of sequencing, the mean value of quality score decreases gradually, while the variance is increasing. Therefore, it is improper to assume every read as a stationary random signal along the traditional raster scan order.
循环次数影响质量评分的分布。DNA 聚合酶在合成和测序过程中使用,在测序开始时,合成反应并不是非常稳定,但酶的质量非常好,因此会在高质量区域波动。随着测序的进行,反应趋于稳定,但酶的活性和特异性逐渐降低,因此累积误差逐渐放大。结果,错误的概率增加,整体质量评分呈下降趋势。如图 3 所示,随着测序的进行,质量评分的均值逐渐降低,而方差逐渐增大。因此,不适宜假设每个读取为沿着传统的光栅扫描顺序的静止随机信号。
The base change also affects the distribution of quality score. As we discussed before, the recognition of base types in a flowcell is conducted in a four step loop according to the order of dNTP and wavelength. For example, let's assume the loop order is 'A-C-G-T', if the bases of a read is '... AA...', after the imaging of the first ' ', the flowcell is washed four times until the imaging of the second ' '. But if the bases is '...TA...', the machine only wash the flowcell once before the imaging of ' '. In this way, if the flowcell contains some residuals in the cluster, the former 'A' base will affects the imaging process of the latter ' ' base, which may cause ambiguity of ' ' thus the quality score of ' ' drops significantly. Although some machines adopt compound dNTP to replace the four step loop, the residual is still the case that affect the quality score. Therefore, for the quality score compression, the base change should be considered as a side-information to model the marginal probability of every quality score.
基本变化也会影响质量分数的分布。正如我们之前讨论的,流式细胞中对碱基类型的识别是根据 dNTP 和波长的顺序进行的四步循环。例如,假设循环顺序是'A-C-G-T',如果一个读取的碱基是'... AA...',在第一个' '成像后,流式细胞会被清洗四次,直到第二个' '成像。但如果碱基是'...TA...',机器在第二个' '成像前只会清洗一次流式细胞。这样,如果流式细胞中包含一些残留物,前一个'A'碱基将影响后一个' '碱基的成像过程,可能导致' '的模糊,从而导致' '的质量分数显著下降。尽管一些机器采用复合 dNTP 来替代四步循环,但残留物仍然会影响质量分数。因此,为了质量分数的压缩,基本变化应被视为模拟每个质量分数的边际概率的辅助信息。
The local position of chip affects the distribution of quality score. The flowcell can be regarded as a 2D array that every cluster corresponds to an entry of the array. If the fluorescent of an high amplitude one entry may diffused to the adjacent entries of the array, which is the well-known "cross-talk" phenomena [11]. In other words, there is spatial correlations among the adjacent quality score. However, the stored FASTQ file is a 1D stack of all the reads which ignores the correlation. Therefore, the compression of quality score should mining the potential 2D spatial correlations among the reads.
芯片的本地位置会影响质量分数的分布。流式细胞芯片可以被视为一个二维数组,其中每个聚类对应数组的一个条目。如果一个高振幅的条目的荧光可能会扩散到数组相邻的条目,这就是众所周知的“串扰”现象[11]。换句话说,相邻的质量分数之间存在 空间相关性。然而,存储的 FASTQ 文件是所有读取的一维堆栈,忽略了 相关性。因此,质量分数的压缩应该挖掘读取之间的潜在二维空间相关性。
Fig. 3: Distribution of quality score made by FASTQ
图 3:由 FASTQ 生成的质量分数分布

Methods

In this section, we will discuss the proposed adaptive coding order (ACO) based quality score compression technique. The two contribution of ACO is 1) using an adaptive scan order to replace the traditional raster scan order which forms a more stationary signal. 2) using a compound context modeling which considers the influence of base change while exploring the potential correlations among quality score.
在本节中,我们将讨论提出的基于自适应编码顺序(ACO)的质量分数压缩技术。ACO 的两个贡献是:1)使用自适应扫描顺序来替代传统的光栅扫描顺序,形成更稳定的信号。2)使用复合上下文建模,考虑了碱基变化的影响,同时探索质量分数之间的潜在 相关性。

traverse the quality score along the most relative directions
沿着最相关的方向遍历质量分数

As can be seen from the Fig.3, with the increase of reads length, the column mean decreases but the variance becomes larger, this proves that there is a strong correlation between columns. At the same time, the reduction in the column mean is also consistent with the actual process of specific sequencing, which the quality score have a descent trend along one single read. It has been verified that changing the scan order can improve the performance of the adaptive arithmetic encoder, and coding along the more stable signal will get better coding effect.
如图 3 所示,随着读取长度的增加,列均值减小但方差变大,这证明列之间存在强相关性。同时,列均值的减小也与特定测序过程的实际情况一致,质量分数沿着一个单一读取呈下降趋势。已经验证改变扫描顺序可以提高自适应算术编码器的性能,并且沿着更稳定的信号进行编码将获得更好的编码效果。
All compression methods witch based on arithmetic encoder use the scan method in Fig.4(a) when traversing data encoding. Under this scan method, quality score is encoded line by line, after scan a line, the next step is starting from the beginning of the second line. Obviously, after encoding the last character of the front line, connecting the first character of the next line will cause a great jump and this jump will make the conversion between signals unstable. So we use an adaptive scan order to replace the traditional raster scan order so that realize the stable traversal of the signal which as shown in Fig.4(b). The starting point starts with the first element and traverses down the column until the end of a column, then traverses backward up from the end of the next column. Different from the traditional scanning method, ACO adopts a scanning way like the shape of the snake. The reason to use snake traversal is to make the transition between columns more smooth, the end symbols of one column are more relevant to the end symbols of next column, the correlation between red and green symbol is obviously stronger than the correlation between red and blue symbol. Therefore, after the red symbol is encoded, it is more appropriate to select the green symbol than the blue symbol to encode from the second column. By changing the scanning order, the encoding is carried out in a more stable direction. The probability update mechanism of adaptive arithmetic encoder is fully utilized without introducing other factors.
所有基于算术编码器的压缩方法在遍历数据编码时都使用图 4(a)中的扫描方法。在这种扫描方法下,质量分数是逐行编码的,扫描完一行后,下一步是从第二行的开头开始。显然,在编码完前一行的最后一个字符后,连接下一行的第一个字符会导致一个很大的跳跃,这种跳跃会使信号之间的转换变得不稳定。因此,我们使用自适应扫描顺序来取代传统的光栅扫描顺序,从而实现信号的稳定遍历,如图 4(b)所示。起始点从第一个元素开始,沿着列向下遍历直到列的末尾,然后从下一列的末尾向后向上遍历。与传统的扫描方法不同,ACO 采用了类似蛇形的扫描方式。使用蛇形遍历的原因是使列之间的过渡更加平滑,一列的结束符号与下一列的结束符号更相关,红色和绿色符号之间的相关性明显强于红色和蓝色符号之间的相关性。 因此,在对红色符号进行编码后,从第二列选择绿色符号进行编码比选择蓝色符号更合适。通过改变扫描顺序,编码沿着更稳定的方向进行。自适应算术编码器的概率更新机制得到充分利用,而不引入其他因素。
(a)
(b)
Fig. 4: Comparison of traditional scanning and ACO scanning:(a)traditional traversal method; (b)adaptive scan order compound context modeling
图 4:传统扫描与 ACO 扫描的比较:(a)传统遍历方法;(b)自适应扫描顺序复合上下文建模
As Section of Declaration explains, the compression of quality score should mining the potential 2D spatial correlations among the reads, so we using a compound context modeling to express the extra relevance in the quality score data. There are two additional aspects are contained in the ACO context model and the first aspect is to get the global average of each read. According to the example in Declaration, it can be seen that adjusting the data order to make the similar symbols cluster together will get good results in compression. As shown in Fig.1, the distribution curves of the four reads are very similar, only some singular points show the differences. So it is an improved strategy to cluster and code the data with similar row distribution, but it will take a lot of steps to calculate the distribution of each row, and cluster similar rows will also bring the loss of time and space. We calculate the mean information of each row to reflect its distribution, and classify the rows with the same mean value. For the row information, the mean is a measure standard of stationarity, rows with the same mean value can be approximately regarded as basically the same distribution in the whole row, although some singular points may make the distribution curve not completely coincide. Instead of calculating the Kullback-Leibler Divergence between rows, the use of row mean can save a lot of computation and time without wasting the
正如《声明部分》所解释的那样,质量分数的压缩应该挖掘阅读之间的潜在二维空间相关性,因此我们使用复合上下文建模来表达质量分数数据中的额外相关性。ACO 上下文模型中包含两个额外方面,第一个方面是获得每个读取的全局平均值。根据声明中的示例,可以看到调整数据顺序使相似符号聚集在一起将在压缩中获得良好结果。如图 1 所示,四个读取的分布曲线非常相似,只有一些奇点显示出差异。因此,将具有相似行分布的数据聚类和编码是一种改进的策略,但需要大量步骤来计算每行的分布,并且聚类相似行也会带来时间和空间的损失。我们计算每行的平均信息以反映其分布,并将具有相同平均值的行进行分类。 对于行信息,均值是一个衡量平稳性的标准,具有相同均值的行可以大致被视为整个行中基本相同的分布,尽管一些奇点可能使分布曲线不完全重合。而不是计算行之间的 Kullback-Leibler 散度,使用行均值可以节省大量计算和时间,而不会浪费行之间的相关性。

correlation between rows. The row clustering method needs to transmit extra information to the decoder to record the change of row order, facing the same problem, using the mean also needs to transmit the mean information of each line to the decoder. In practice, we will compare the extra coding amount and the actual revenue value brought by the row mean value. When the gain is greater than the original, we will choose to add the line mean information as the context.
行聚类方法需要向解码器传输额外信息以记录行顺序的变化,面对相同问题,使用均值也需要向解码器传输每行的均值信息。在实践中,我们将比较额外编码量和行均值带来的实际收益价值。当收益大于原始值时,我们将选择添加行均值信息作为上下文。
Specifically, in the process of building context model, there will be the problem of context dilution, so we need to design a suitable quantization method for mean value so that solve the problem of context dilution, this is a dynamic programming problem and the optimization objective of the quantization of a discretely distributed random variable is to minimize the distortion. The expression for the objective is:
具体来说,在构建上下文模型的过程中,会出现上下文稀疏的问题,因此我们需要设计一个适当的均值量化方法来解决上下文稀疏的问题,这是一个动态规划问题,离散分布随机变量的量化优化目标是最小化失真。目标的表达式为:
where is the value of that have nonzero probability, is the quantization value of , and is a specific distortion measure. We can define a condition set to indicate that each specific corresponds to a specific value. Define a quantized set