这是用户在 2024-12-11 17:43 为 https://app.immersivetranslate.com/pdf-pro/08445742-e249-4bfa-9ae4-72cf81d904ad 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Research SRC Report 研究 SRC 报告

A Block-sorting Lossless Data Compression Algorithm
一种块排序无损数据压缩算法

M. Burrows and D.J. Wheeler
M. Burrows 和 D.J. Wheeler
Systems Research Center 系统研究中心
130 Lytton Avenue 130 利顿大道
Palo Alto, California 94301
帕洛阿尔托,加利福尼亚 94301

Systems Research Center 系统研究中心

The charter of SRC is to advance both the state of knowledge and the state of the art in computer systems. From our establishment in 1984, we have performed basic and applied research to support Digital’s business objectives. Our current work includes exploring distributed personal computing on multiple platforms, networking, programming technology, system modelling and management techniques, and selected applications.
, SRC 的章程是推进计算机系统的知识状态和技术状态。自 1984 年成立以来,我们进行了基础和应用研究,以支持 Digital 的商业目标。我们目前的工作包括探索多个平台上的分布式个人计算、网络、编程技术、系统建模和管理技术以及选定的应用。

Our strategy is to test the technical and practical value of our ideas by building hardware and software prototypes and using them as daily tools. Interesting systems are too complex to be evaluated solely in the abstract; extended use allows us to investigate their properties in depth. This experience is useful in the short term in refining our designs, and invaluable in the long term in advancing our knowledge. Most of the major advances in information systems have come through this strategy, including personal computing, distributed systems, and the Internet.
我们的策略是通过构建硬件和软件原型并使用它们作为日常工具来测试我们想法的技术和实践价值。有趣的系统过于复杂,不能仅仅在抽象层面进行评估;长期使用使我们能够深入了解它们的特性。这种经验在短期内有助于完善我们的设计,在长期内对提升我们的知识具有无价价值。信息系统的大部分重大进步都是通过这种策略实现的,包括个人计算、分布式系统和互联网。
We also perform complementary work of a more mathematical flavor. Some of it is in established fields of theoretical computer science, such as the analysis of algorithms, computational geometry, and logics of programming. Other work explores new ground motivated by problems that arise in our systems research.
我们还在数学味更浓的补充工作中发挥作用。其中一些工作在理论计算机科学已建立的领域,如算法分析、计算几何和程序逻辑。其他工作则探索由我们的系统研究中出现的问题所激发的新领域。

We have a strong commitment to communicating our results; exposing and testing our ideas in the research and development communities leads to improved understanding. Our research report series supplements publication in professional journals and conferences. We seek users for our prototype systems among those with whom we have common interests, and we encourage collaboration with university researchers.
我们对我们沟通结果有强烈的承诺;在研发社区中展示和测试我们的想法,有助于提高理解。我们的研究报告系列补充了在专业期刊和会议上的发表。我们在与我们共同感兴趣的人群中寻找原型系统的用户,并鼓励与大学研究人员合作。
Robert W. Taylor, Director
罗伯特·W·泰勒,总监

A Block-sorting Lossless Data Compression Algorithm
一种块排序无损数据压缩算法

M. Burrows and D.J. Wheeler
M. Burrows 和 D.J. Wheeler

May 10, 1994 1994 年 5 月 10 日
David Wheeler is a Professor of Computer Science at the University of Cambridge, U.K. His electronic mail address is: djw3@cl.cam.ac.uk
David Wheeler 是英国剑桥大学计算机科学教授,他的电子邮件地址是:djw3@cl.cam.ac.uk

©Digital Equipment Corporation 1994.
©数字设备公司 1994.
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of the Systems Research Center of Digital Equipment Corporation in Palo Alto, California; an acknowledgment of the authors and individual contributors to the work; and all applicable portions of the copyright notice. Copying, reproducing, or republishing for any other purpose shall require a license with payment of fee to the Systems Research Center. All rights reserved.
本作品不得全部或部分用于任何商业目的进行复制或再生产。为非营利性教育和研究目的,在全部或部分复制时无需支付费用,但须包括以下内容:注明此类复制已获得加州帕洛阿尔托数字设备公司系统研究中心的许可;承认本作品的作者和个别贡献者;以及所有适用的版权声明部分。为任何其他目的的复制、再生产或重新发布,均需向系统研究中心支付费用并取得许可。版权所有。

Authors' abstract 作者摘要

We describe a block-sorting, lossless data compression algorithm, and our implementation of that algorithm. We compare the performance of our implementation with widely available data compressors running on the same hardware.
我们描述了一种块排序、无损数据压缩算法,以及该算法的实现。我们比较了我们的实现与在相同硬件上运行的广泛可用的数据压缩器的性能。
The algorithm works by applying a reversible transformation to a block of input text. The transformation does not itself compress the data, but reorders it to make it easy to compress with simple algorithms such as move-to-front coding.
该算法通过将输入文本块应用可逆变换来工作。这种变换本身并不压缩数据,而是重新排列它,使其易于使用简单的算法(如前向移动编码)进行压缩。
Our algorithm achieves speed comparable to algorithms based on the techniques of Lempel and Ziv, but obtains compression close to the best statistical modelling techniques. The size of the input block must be large (a few kilobytes) to achieve good compression.
我们的算法在速度上与基于 Lempel 和 Ziv 技术的算法相当,但压缩效果接近最佳的统计建模技术。输入块的大小必须大(几 KB)才能实现良好的压缩。

Contents 内容

1 Introduction … 1 1 引言 … 1
2 The reversible transformation … 2
2 可逆转换…2

3 Why the transformed string compresses well … 5
3 为什么转换后的字符串压缩得很好……5

4 An efficient implementation … 8
4 一个高效的实现……8

4.1 Compression … 8
4.1 压缩 … 8

4.2 Decompression … 12
4.2 解压缩 … 12

5 Algorithm variants … 13
5 算法变体 … 13

6 Performance of implementation … 15
6 实现性能 … 15

7 Conclusions … 16
7 结论 … 16

1 Introduction 1 引言

The most widely used data compression algorithms are based on the sequential data compressors of Lempel and Ziv [1, 2]. Statistical modelling techniques may produce superior compression [3], but are significantly slower.
最广泛使用的数据压缩算法基于 Lempel 和 Ziv 的顺序数据压缩器[1, 2]。统计建模技术可能产生更优的压缩[3],但速度明显较慢。
In this paper, we present a technique that achieves compression within a percent or so of that achieved by statistical modelling techniques, but at speeds comparable to those of algorithms based on Lempel and Ziv’s.
在这篇论文中,我们介绍了一种技术,它实现的压缩效果与统计建模技术相当,但速度与基于 Lempel 和 Ziv 算法的速度相当。
Our algorithm does not process its input sequentially, but instead processes a block of text as a single unit. The idea is to apply a reversible transformation to a block of text to form a new block that contains the same characters, but is easier to compress by simple compression algorithms. The transformation tends to group characters together so that the probability of finding a character close to another instance of the same character is increased substantially. Text of this kind can easily be compressed with fast locally-adaptive algorithms, such as move-to-front coding [4] in combination with Huffman or arithmetic coding.
我们的算法不是按顺序处理输入,而是将一段文本作为一个整体进行处理。其思路是对一段文本应用可逆变换,形成一个包含相同字符的新块,但更容易通过简单的压缩算法进行压缩。这种变换倾向于将字符分组,从而显著增加找到与另一个相同字符实例相邻的字符的概率。这种类型的文本可以很容易地通过快速局部自适应算法进行压缩,例如与霍夫曼或算术编码结合的移动到前编码[4]。
Briefly, our algorithm transforms a string S S SS of N N NN characters by forming the N N NN rotations (cyclic shifts) of S S SS, sorting them lexicographically, and extracting the last character of each of the rotations. A string L L LL is formed from these characters, where the i i ii th character of L L LL is the last character of the i i ii th sorted rotation. In addition to L L LL, the algorithm computes the index I I II of the original string S S SS in the sorted list of rotations. Surprisingly, there is an efficient algorithm to compute the original string S S SS given only L L LL and I I II.
简要来说,我们的算法通过形成 S S SS N N NN 次旋转(循环移位),按字典顺序排序,并提取每个旋转的最后字符,将 N N NN 个字符的字符串 S S SS 转换为字符串 L L LL ,其中 L L LL 的第 i i ii 个字符是第 i i ii 个排序旋转的最后字符。除了 L L LL 之外,算法还计算原始字符串 S S SS 在旋转排序列表中的索引 I I II 。令人惊讶的是,有一个高效的算法可以仅根据 L L LL I I II 计算原始字符串 S S SS
The sorting operation brings together rotations with the same initial characters. Since the initial characters of the rotations are adjacent to the final characters, consecutive characters in L L LL are adjacent to similar strings in S S SS. If the context of a character is a good predictor for the character, L L LL will be easy to compress with a simple locally-adaptive compression algorithm.
排序操作将具有相同初始字符的旋转组合在一起。由于旋转的初始字符与最终字符相邻, L L LL 中的连续字符与 S S SS 中的相似字符串相邻。如果字符的上下文是字符的良好预测器,则使用简单的本地自适应压缩算法对 L L LL 进行压缩将变得容易。
In the following sections, we describe the transformation in more detail, and show that it can be inverted. We explain more carefully why this transformation tends to group characters to allow a simple compression algorithm to work more effectively. We then describe efficient techniques for implementing the transformation and its inverse, allowing this algorithm to be competitive in speed with Lempel-Ziv-based algorithms, but achieving better compression. Finally, we give the performance of our implementation of this algorithm, and compare it with well-known compression programmes.
在以下章节中,我们更详细地描述了这种转换,并展示了它是可以逆的。我们更仔细地解释了为什么这种转换倾向于将字符分组,以便简单的压缩算法能更有效地工作。然后,我们描述了实现这种转换及其逆变换的高效技术,使该算法在速度上与基于 Lempel-Ziv 的算法具有竞争力,但实现了更好的压缩。最后,我们给出了我们实现此算法的性能,并将其与著名的压缩程序进行了比较。
The algorithm described here was discovered by one of the authors (Wheeler) in 1983 while he was working at AT&T Bell Laboratories, though it has not previously been published.
该算法是由作者之一(Wheeler)在 1983 年当他还在 AT&T 贝尔实验室工作时发现的,尽管它之前尚未发表。

2 The reversible transformation
2 可逆转换

This section describes two sub-algorithms. Algorithm C performs the reversible transformation that we apply to a block of text before compressing it, and Algorithm D performs the inverse operation. A later section suggests a method for compressing the transformed block of text.
本节描述了两个子算法。算法 C 执行我们在压缩文本块之前应用的可逆变换,而算法 D 执行逆操作。后面的章节建议了一种压缩变换后的文本块的方法。
In the description below, we treat strings as vectors whose elements are characters.
在下面的描述中,我们将字符串视为元素为字符的向量。

Algorithm C: Compression transformation
算法 C:压缩变换

This algorithm takes as input a string S S SS of N N NN characters S [ 0 ] , , S [ N 1 ] S [ 0 ] , , S [ N 1 ] S[0],dots,S[N-1]S[0], \ldots, S[N-1] selected from an ordered alphabet X X XX of characters. To illustrate the technique, we also give a running example, using the string S = S = S=S= ‘abraca’, N = 6 N = 6 N=6N=6, and the alphabet X = { a , b , c , r } X = a , b , c , r X={^(')a^('),b^('),^(')c^('),r^(')}X=\left\{{ }^{\prime} a^{\prime}, b^{\prime},{ }^{\prime} c^{\prime}, r^{\prime}\right\}.
此算法以一个字符串 S S SS 作为输入,该字符串由 N N NN 个字符 S [ 0 ] , , S [ N 1 ] S [ 0 ] , , S [ N 1 ] S[0],dots,S[N-1]S[0], \ldots, S[N-1] 组成,这些字符选自一个有序字符集 X X XX 。为了说明技术,我们还提供了一个运行示例,使用字符串 S = S = S=S= ‘abraca’, N = 6 N = 6 N=6N=6 ,以及字符集 X = { a , b , c , r } X = a , b , c , r X={^(')a^('),b^('),^(')c^('),r^(')}X=\left\{{ }^{\prime} a^{\prime}, b^{\prime},{ }^{\prime} c^{\prime}, r^{\prime}\right\}

C1. [sort rotations] C1. [排序旋转]

Form a conceptual N × N N × N N xx NN \times N matrix M M MM whose elements are characters, and whose rows are the rotations (cyclic shifts) of S S SS, sorted in lexicographical order. At least one of the rows of M M MM contains the original string S S SS. Let I I II be the index of the first such row, numbering from zero.
构建一个概念矩阵,其元素为字符,行是 S S SS 的旋转(循环移位),按字典顺序排序。至少 M M MM 的一行包含原始字符串 S S SS 。令 I I II 为这样的第一行的索引,从零开始编号。

In our example, the index I = 1 I = 1 I=1I=1 and the matrix M M MM is
在我们的例子中,索引 I = 1 I = 1 I=1I=1 和矩阵 M M MM
row 
0 aabrac
1 abraca 拥抱
2 acaabr 无翻译内容
3 bracaa
4 caabra 暂无翻译
5 racaab 无翻译内容
row 0 aabrac 1 abraca 2 acaabr 3 bracaa 4 caabra 5 racaab| row | | | :---: | :---: | | 0 | aabrac | | 1 | abraca | | 2 | acaabr | | 3 | bracaa | | 4 | caabra | | 5 | racaab |
C2. [find last characters of rotations]
C2. [找到旋转的最后几个字符]

Let the string L L LL be the last column of M M MM, with characters L [ 0 ] , , L [ N 1 ] L [ 0 ] , , L [ N 1 ] L[0],dots,L[N-1]L[0], \ldots, L[N-1] (equal to M [ 0 , N 1 ] , , M [ N 1 , N 1 ] M [ 0 , N 1 ] , , M [ N 1 , N 1 ] M[0,N-1],dots,M[N-1,N-1]M[0, N-1], \ldots, \mathrm{M}[N-1, N-1] ). The output of the transformation is the pair ( L , I ) ( L , I ) (L,I)(L, I).
让字符串 L L LL 成为 M M MM 的最后一列,字符 L [ 0 ] , , L [ N 1 ] L [ 0 ] , , L [ N 1 ] L[0],dots,L[N-1]L[0], \ldots, L[N-1] (等于 M [ 0 , N 1 ] , , M [ N 1 , N 1 ] M [ 0 , N 1 ] , , M [ N 1 , N 1 ] M[0,N-1],dots,M[N-1,N-1]M[0, N-1], \ldots, \mathrm{M}[N-1, N-1] )。变换的输出是成对出现的 ( L , I ) ( L , I ) (L,I)(L, I)
In our example, L = L = L=L= ‘caraab’ and I = 1 I = 1 I=1I=1 (from step C1).
在我们的例子中, L = L = L=L= “caraab” 和 I = 1 I = 1 I=1I=1 (来自步骤 C1)。

Algorithm D: Decompression transformation
算法 D:解压缩变换

We use the example and notation introduced in Algorithm C. Algorithm D uses the output ( L , I L , I L,IL, I ) of Algorithm C to reconstruct its input, the string S S SS of length N N NN.
我们使用在算法 C 中引入的示例和符号。算法 D 使用算法 C 的输出( L , I L , I L,IL, I )来重建其输入,长度为 N N NN 的字符串 S S SS
D1. [find first characters of rotations]
D1. [找到旋转的第一个字符]

This step calculates the first column F F FF of the matrix M M MM of Algorithm C. This is done by sorting the characters of L L LL to form F F FF. We observe that any column of the matrix M M MM is a permutation of the original string S S SS. Thus, L L LL and F F FF are both permutations of S S SS, and therefore of one another. Furthermore, because the rows of M M MM are sorted, and F F FF is the first column of M M MM, the characters in F F FF are also sorted.
这一步计算算法 C 的矩阵 M M MM 的第一列 F F FF 。这是通过排序 L L LL 中的字符来形成 F F FF 。我们观察到矩阵 M M MM 的任何一列都是原始字符串 S S SS 的排列。因此, L L LL F F FF 都是 S S SS 的排列,因此也是彼此的排列。此外,因为 M M MM 的行已排序,且 F F FF M M MM 的第一列,所以 F F FF 中的字符也是排序的。

In our example, F = F = F=F= ‘aaabcr’.
在我们的例子中, F = F = F=F= ‘aaabcr’。

D2. [build list of predecessor characters]
D2. [构建前驱字符列表]

To assist our explanation, we describe this step in terms of the contents of the matrix M M MM. The reader should remember that the complete matrix is not available to the decompressor; only the strings F , L F , L F,LF, L, and the index I I II (from the input) are needed by this step.
为了帮助我们的解释,我们用矩阵 M M MM 的内容来描述这一步。读者应记住,完整的矩阵对解压器不可用;这一步只需要字符串 F , L F , L F,LF, L 和索引 I I II (来自输入)。

Consider the rows of the matrix M M MM that start with some given character ch . Algorithm C ensured that the rows of matrix M M MM are sorted lexicographically, so the rows that start with c h c h chc h are ordered lexicographically.
考虑以某个给定字符 ch 开头的矩阵 M M MM 的行。算法 C 确保矩阵 M M MM 的行按字典顺序排序,因此以 c h c h chc h 开头的行按字典顺序排列。

We define the matrix M M M^(')M^{\prime} formed by rotating each row of M M MM one character to the right, so for each i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1, and each j = 0 , , N 1 j = 0 , , N 1 j=0,dots,N-1j=0, \ldots, N-1,
我们定义由将 M M MM 的每一行向右旋转一个字符形成的矩阵 M M M^(')M^{\prime} ,因此对于每个 i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 ,以及每个 j = 0 , , N 1 j = 0 , , N 1 j=0,dots,N-1j=0, \ldots, N-1
M [ i , j ] = M [ i , ( j 1 ) mod N ] M [ i , j ] = M [ i , ( j 1 ) mod N ] M^(')[i,j]=M[i,(j-1)mod N]M^{\prime}[i, j]=M[i,(j-1) \bmod N]
In our example, M M MM and M M M^(')M^{\prime} are:
在我们的例子中, M M MM M M M^(')M^{\prime} 是:
row  M M MM M M M^(')M^{\prime}
0 aabrac caabra 暂无翻译
1 abraca 拥抱 aabrac
2 acaabr 无翻译内容 racaab 无翻译内容
3 bracaa abraca 拥抱
4 caabra 暂无翻译 acaabr 无翻译内容
5 racaab 无翻译内容 bracaa
row M M^(') 0 aabrac caabra 1 abraca aabrac 2 acaabr racaab 3 bracaa abraca 4 caabra acaabr 5 racaab bracaa| row | $M$ | $M^{\prime}$ | | :---: | :---: | :---: | | 0 | aabrac | caabra | | 1 | abraca | aabrac | | 2 | acaabr | racaab | | 3 | bracaa | abraca | | 4 | caabra | acaabr | | 5 | racaab | bracaa |
Like M M MM, each row of M M M^(')M^{\prime} is a rotation of S S SS, and for each row of M M MM there is a corresponding row in M M M^(')M^{\prime}. We constructed M M M^(')M^{\prime} from M M MM so that the rows
类似于 M M MM M M M^(')M^{\prime} 的每一行都是 S S SS 的旋转,并且对于 M M MM 的每一行,在 M M M^(')M^{\prime} 中都有对应的行。我们从 M M MM 构建了 M M M^(')M^{\prime} ,以便行

of M M M^(')M^{\prime} are sorted lexicographically starting with their second character. So, if we consider only those rows in M M M^(')M^{\prime} that start with a character c h c h chc h, they must appear in lexicographical order relative to one another; they have been sorted lexicographically starting with their second characters, and their first characters are all the same and so do not affect the sort order. Therefore, for any given character c h c h chc h, the rows in M M MM that begin with c h c h chc h appear in the same order as the rows in M M M^(')M^{\prime} that begin with c h c h chc h.
M M M^(')M^{\prime} 中的元素按字典顺序排序,从第二个字符开始。因此,如果我们只考虑 M M M^(')M^{\prime} 中以字符 c h c h chc h 开头的行,它们必须按字典顺序相互排列;它们已经按第二个字符的字典顺序排序,并且第一个字符都相同,因此不影响排序顺序。因此,对于任何给定的字符 c h c h chc h M M MM 中以 c h c h chc h 开头的行与 M M M^(')M^{\prime} 中以 c h c h chc h 开头的行以相同的顺序出现。
In our example, this is demonstrated by the rows that begin with ’ a a aa '. The rows ‘aabrac’, ‘abraca’, and ‘acaabr’ are rows 0 , 1 , 2 0 , 1 , 2 0,1,20,1,2 in M M MM and correspond to rows 1 , 3 , 4 1 , 3 , 4 1,3,41,3,4 in M M M^(')M^{\prime}.
在我们的例子中,这通过以’ a a aa ‘开头的行来展示。行‘aabrac’、‘abraca’和‘acaabr’是 M M MM 中的 0 , 1 , 2 0 , 1 , 2 0,1,20,1,2 行,对应于 M M M^(')M^{\prime} 中的 1 , 3 , 4 1 , 3 , 4 1,3,41,3,4 行。
Using F F FF and L L LL, the first columns of M M MM and M M M^(')M^{\prime} respectively, we calculate a vector T T TT that indicates the correspondence between the rows of the two matrices, in the sense that for each j = 0 , , N 1 j = 0 , , N 1 j=0,dots,N-1j=0, \ldots, N-1, row j j jj of M M M^(')M^{\prime} corresponds to row T [ j ] T [ j ] T[j]T[j] of M M MM.
使用 F F FF L L LL ,分别计算 M M MM M M M^(')M^{\prime} 的第一列,得到一个向量 T T TT ,该向量指示两个矩阵行之间的对应关系,即对于每个 j = 0 , , N 1 j = 0 , , N 1 j=0,dots,N-1j=0, \ldots, N-1 M M M^(')M^{\prime} 的第 j j jj 行对应于 M M MM 的第 T [ j ] T [ j ] T[j]T[j] 行。

If L [ j ] L [ j ] L[j]L[j] is the k k kk th instance of c h c h chc h in L L LL, then T [ j ] = i T [ j ] = i T[j]=iT[j]=i where F [ i ] F [ i ] F[i]F[i] is the k k kk th instance of c h c h chc h in F F FF. Note that T T TT represents a one-to-one correspondence between elements of F F FF and elements of L L LL, and F [ T [ j ] ] = L [ j ] F [ T [ j ] ] = L [ j ] F[T[j]]=L[j]F[T[j]]=L[j].
如果 L [ j ] L [ j ] L[j]L[j] L L LL c h c h chc h 的第 k k kk 个实例,那么 T [ j ] = i T [ j ] = i T[j]=iT[j]=i ,其中 F [ i ] F [ i ] F[i]F[i] F F FF c h c h chc h 的第 k k kk 个实例。注意, T T TT 代表 F F FF 的元素与 L L LL F [ T [ j ] ] = L [ j ] F [ T [ j ] ] = L [ j ] F[T[j]]=L[j]F[T[j]]=L[j] 的元素之间的一对一对应。
In our example, T T TT is: ( 4 0 5 1 2 3 ) 4 0 5 1 2 3 quad([4,0,5,1,2,3])\quad\left(\begin{array}{llllll}4 & 0 & 5 & 1 & 2 & 3\end{array}\right).
在我们的例子中, T T TT 是: ( 4 0 5 1 2 3 ) 4 0 5 1 2 3 quad([4,0,5,1,2,3])\quad\left(\begin{array}{llllll}4 & 0 & 5 & 1 & 2 & 3\end{array}\right)

D3. [form output S S SS ]
D3. [表输出 S S SS ]

Now, for each i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1, the characters L [ i ] L [ i ] L[i]L[i] and F [ i ] F [ i ] F[i]F[i] are the last and first characters of the row i i ii of M M MM. Since each row is a rotation of S S SS, the character L [ i ] L [ i ] L[i]L[i] cyclicly precedes the character F [ i ] F [ i ] F[i]F[i] in S S SS. From the construction of T T TT, we have F [ T [ j ] ] = L [ j ] F [ T [ j ] ] = L [ j ] F[T[j]]=L[j]F[T[j]]=L[j]. Substituting i = T [ j ] i = T [ j ] i=T[j]i=T[j], we have L [ T [ j ] ] L [ T [ j ] ] L[T[j]]L[T[j]] cyclicly precedes L [ j ] L [ j ] L[j]L[j] in S S SS.
现在,对于每个 i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 ,字符 L [ i ] L [ i ] L[i]L[i] F [ i ] F [ i ] F[i]F[i] 分别是 i i ii 行中 M M MM 的最后一和第一个字符。由于每一行都是 S S SS 的旋转,字符 L [ i ] L [ i ] L[i]L[i] S S SS 中循环先于字符 F [ i ] F [ i ] F[i]F[i] 。从 T T TT 的构造中,我们有 F [ T [ j ] ] = L [ j ] F [ T [ j ] ] = L [ j ] F[T[j]]=L[j]F[T[j]]=L[j] 。代入 i = T [ j ] i = T [ j ] i=T[j]i=T[j] ,我们得到 L [ T [ j ] ] L [ T [ j ] ] L[T[j]]L[T[j]] S S SS 中循环先于 L [ j ] L [ j ] L[j]L[j]

The index I I II is defined by Algorithm C such that row I I II of M M MM is S S SS. Thus, the last character of S S SS is L [ I ] L [ I ] L[I]L[I]. We use the vector T T TT to give the predecessors of each character:
索引 I I II 由算法 C 定义,使得 M M MM 的第 I I II 行是 S S SS 。因此, S S SS 的最后字符是 L [ I ] L [ I ] L[I]L[I] 。我们使用向量 T T TT 来给出每个字符的前驱:
for each i = 0 , , N 1 : S [ N 1 i ] = L [ T i [ I ] ]  for each  i = 0 , , N 1 : S [ N 1 i ] = L T i [ I ] " for each "i=0,dots,N-1:S[N-1-i]=L[T^(i)[I]]\text { for each } i=0, \ldots, N-1: S[N-1-i]=L\left[T^{i}[I]\right]
where T 0 [ x ] = x T 0 [ x ] = x T^(0)[x]=xT^{0}[x]=x, and T i + 1 [ x ] = T [ T i [ x ] ] T i + 1 [ x ] = T T i [ x ] T^(i+1)[x]=T[T^(i)[x]]T^{i+1}[x]=T\left[T^{i}[x]\right]. This yields S S SS, the original input to the compressor.
T 0 [ x ] = x T 0 [ x ] = x T^(0)[x]=xT^{0}[x]=x T i + 1 [ x ] = T [ T i [ x ] ] T i + 1 [ x ] = T T i [ x ] T^(i+1)[x]=T[T^(i)[x]]T^{i+1}[x]=T\left[T^{i}[x]\right] 。这产生了 S S SS ,这是压缩机的原始输入。

In our example, S = S = S=S= ‘abraca’.
在我们的例子中, S = S = S=S= ‘abraca’。

We could have defined T T TT so that the string S S SS would be generated from front to back, rather than the other way around. The description above matches the pseudo-code given in Section 4.2.
我们可以定义 T T TT ,使得字符串 S S SS 从前向后生成,而不是相反。上面的描述与第 4.2 节中给出的伪代码相匹配。
The sequence T i [ I ] T i [ I ] T^(i)[I]T^{i}[I] for i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 is not necessarily a permutation of the numbers 0 , , N 1 0 , , N 1 0,dots,N-10, \ldots, N-1. If the original string S S SS is of the form Z p Z p Z^(p)Z^{p} for some substring Z Z ZZ and some p > 1 p > 1 p > 1p>1, then the sequence T i [ I ] T i [ I ] T^(i)[I]T^{i}[I] for i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 will also be of the form Z p Z p Z^('p)Z^{\prime p} for some subsequence Z Z Z^(')Z^{\prime}. That is, the repetitions in S S SS will be generated by visiting the same elements of T T TT repeatedly. For example, if S = S = S=S= ‘cancan’, Z = Z = Z=Z= ‘can’ and p = 2 p = 2 p=2p=2, the sequence T i [ I ] T i [ I ] T^(i)[I]T^{i}[I] for i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 will be [ 2 , 4 , 0 , 2 , 4 , 0 ] [ 2 , 4 , 0 , 2 , 4 , 0 ] [2,4,0,2,4,0][2,4,0,2,4,0].
序列 T i [ I ] T i [ I ] T^(i)[I]T^{i}[I] 对于 i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 不一定是数字 0 , , N 1 0 , , N 1 0,dots,N-10, \ldots, N-1 的排列。如果原始字符串 S S SS 的形式为 Z p Z p Z^(p)Z^{p} 对于某些子串 Z Z ZZ 和某些 p > 1 p > 1 p > 1p>1 ,那么 T i [ I ] T i [ I ] T^(i)[I]T^{i}[I] 对于 i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 的序列也将是 Z p Z p Z^('p)Z^{\prime p} 对于某些子序列 Z Z Z^(')Z^{\prime} 的形式。也就是说, S S SS 中的重复将通过重复访问 T T TT 中的相同元素来生成。例如,如果 S = S = S=S= 是 'cancan', Z = Z = Z=Z= 是 'can' 和 p = 2 p = 2 p=2p=2 ,那么 T i [ I ] T i [ I ] T^(i)[I]T^{i}[I] 对于 i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 的序列将是 [ 2 , 4 , 0 , 2 , 4 , 0 ] [ 2 , 4 , 0 , 2 , 4 , 0 ] [2,4,0,2,4,0][2,4,0,2,4,0]

3 Why the transformed string compresses well
3 为什么转换后的字符串压缩效果好

Algorithm C sorts the rotations of an input string S S SS, and generates the string L L LL consisting of the last character of each rotation.
算法 C 对输入字符串 S S SS 的旋转进行排序,并生成由每个旋转的最后一个字符组成的字符串 L L LL
To see why this might lead to effective compression, consider the effect on a single letter in a common word in a block of English text. We will use the example of the letter ’ t t tt ’ in the word ’ t t tt he’, and assume an input string containing many instances of ‘the’.
为了了解这可能导致有效压缩的原因,考虑对英语文本块中常见单词的单个字母的影响。我们将以单词“ t t tt he”中的字母‘ t t tt ’为例,并假设输入字符串包含许多“the”的实例。
When the list of rotations of the input is sorted, all the rotations starting with 'he ’ will sort together, a large proportion of them are likely to end in ’ t t tt '. One region of the string L L LL will therefore contain a disproportionately large number of ’ t t tt ’ characters, intermingled with other characters that can proceed 'he ’ in English, such as space, ’ s s ss ', ’ T T TT ', and ’ S S SS '.
当输入旋转列表排序后,所有以'he’开头的旋转将一起排序,其中很大一部分可能以' t t tt '结尾。因此,字符串 L L LL 的一个区域将包含不成比例的大量' t t tt '字符,与其他可以放在'he’之前的英语字符混合,例如空格、' s s ss '、' T T TT '和' S S SS '。
The same argument can be applied to all characters in all words, so any localized region of the string L L LL is likely to contain a large number of a few distinct characters. The overall effect is that the probability that given character c h c h chc h will occur at a given point in L L LL is very high if c h c h chc h occurs near that point in L L LL, and is low otherwise. This property is exactly the one needed for effective compression by a move-tofront coder [4], which encodes an instance of character c h c h chc h by the count of distinct characters seen since the next previous occurrence of c h c h chc h. When applied to the string L L LL, the output of a move-to-front coder will be dominated by low numbers, which can be efficiently encoded with a Huffman or arithmetic coder.
相同的论点可以应用于所有单词中的所有字符,因此字符串 L L LL 的任何本地化区域很可能包含大量几个不同的字符。整体效果是,如果 c h c h chc h L L LL 的附近发生,给定字符 c h c h chc h L L LL 的特定点出现的概率非常高,否则则很低。这个特性正是移动到前端的编码器[4]所需要的一个特性,它通过自下一个之前出现以来看到的唯一字符的数量来编码字符 c h c h chc h 的实例。当应用于字符串 L L LL 时,移动到前端的编码器的输出将由低数字主导,这些数字可以用 Huffman 或算术编码器有效地编码。
Figure 1 shows an example of the algorithm at work. Each line is the first few characters and final character of a rotation of a version of this document. Note the grouping of similar characters in the column of final characters.
图 1 展示了算法运行的一个示例。每一行是文档版本旋转后的前几个字符和最后一个字符。注意最终字符列中相似字符的分组。
For completeness, we now give details of one possible way to encode the output of Algorithm C, and the corresponding inverse operation. A complete compression algorithm is created by combining these encoding and decoding operations with Algorithms C and D.
为了完整性,我们现在给出一种可能的编码算法 C 输出细节及其逆操作。通过结合这些编码和解码操作以及算法 C 和 D,创建了一个完整的压缩算法。
 final char (L) - 最终字符(L)
final
char
(L)
final char (L)| final | | :--- | | char | | (L) |
sorted rotations 已排序的旋转
a n to decompress. It achieves compression
n 解压。它实现了压缩
o n to perform only comparisons to a depth
n 仅用于执行深度比较
o n transformation} This section describes
n 变换} 本节描述
0 n transformation} We use the example and
n 变换} 我们使用示例和
o n treats the right-hand side as the most
n 将右侧视为最
a n tree for each 16 kbyte input block, enc
每个 16 千字节输入块 n 棵树
a n tree in the output stream, then encodes
n 棵树在输出流中,然后进行编码
i n turn, set $L[i]$ to be the
n 轮,将 $L[i]$ 设置为
i n turn, set $R[i]$ to the
n 将 $R[i]$ 设置为
o n unusual data. Like the algorithm of Man
不寻常的数据。例如,曼算法
a n use a single set of probabilities table
使用单一的概率表
e n using the positions of the suffixes in
在使用后缀的位置上
i n n nn value at a given point in the vector $ R $ R $R\$ \mathrm{R}
n n nn 向量 $ R $ R $R\$ \mathrm{R} 中给定点的值
e n we present modifications that improve t t tt
我们在其中提出了一些改进
e n when the block size is quite large. Ho
当区块大小相当大时。
i n which codes that have not been seen in
在哪些尚未见过的代码中
i n with $ch$ appear in the {\em same order
n 与 $ch$ 出现的顺序相同
i n with $ch$. In our exam
n 与$ch$。在我们的考试中
0 n with Huffman or arithmetic coding. Bri
n 使用 Huffman 或算术编码。Bri
o n with figures given by Bell \cite{bell}.
n 与 Bell 给出的数字相关。
"final char (L)" sorted rotations a n to decompress. It achieves compression o n to perform only comparisons to a depth o n transformation} This section describes 0 n transformation} We use the example and o n treats the right-hand side as the most a n tree for each 16 kbyte input block, enc a n tree in the output stream, then encodes i n turn, set $L[i]$ to be the i n turn, set $R[i]$ to the o n unusual data. Like the algorithm of Man a n use a single set of probabilities table e n using the positions of the suffixes in i n value at a given point in the vector $R e n we present modifications that improve t e n when the block size is quite large. Ho i n which codes that have not been seen in i n with $ch$ appear in the {\em same order i n with $ch$. In our exam 0 n with Huffman or arithmetic coding. Bri o n with figures given by Bell \cite{bell}.| final <br> char <br> (L) | sorted rotations | | :---: | :---: | | a | n to decompress. It achieves compression | | o | n to perform only comparisons to a depth | | o | n transformation} This section describes | | 0 | n transformation} We use the example and | | o | n treats the right-hand side as the most | | a | n tree for each 16 kbyte input block, enc | | a | n tree in the output stream, then encodes | | i | n turn, set $L[i]$ to be the | | i | n turn, set $R[i]$ to the | | o | n unusual data. Like the algorithm of Man | | a | n use a single set of probabilities table | | e | n using the positions of the suffixes in | | i | $n$ value at a given point in the vector $\$ \mathrm{R}$ | | e | n we present modifications that improve $t$ | | e | n when the block size is quite large. Ho | | i | n which codes that have not been seen in | | i | n with $ch$ appear in the {\em same order | | i | n with $ch$. In our exam | | 0 | n with Huffman or arithmetic coding. Bri | | o | n with figures given by Bell \cite{bell}. |
Figure 1: Example of sorted rotations. Twenty consecutive rotations from the sorted list of rotations of a version of this paper are shown, together with the final character of each rotation.
图 1:排序旋转示例。展示了从本文版本旋转排序列表中的二十个连续旋转,以及每个旋转的最后一个字符。

Algorithm M: Move-to-front coding
算法 M:移到前面编码

This algorithm encodes the output ( L , I L , I L,IL, I ) of Algorithm C, where L L LL is a string of length N N NN and I I II is an index. It encodes L L LL using a move-to-front algorithm and a Huffman or arithmetic coder.
此算法编码算法 C 的输出( L , I L , I L,IL, I ),其中 L L LL 是一个长度为 N N NN 的字符串, I I II 是一个索引。它使用前向移动算法和 Huffman 或算术编码器来编码 L L LL
The running example of Algorithm C is continued here.
算法 C 的运行示例在此继续。

M1. [move-to-front coding]
M1. [移动到前端编码]

This step encodes each of the characters in L L LL by applying the move-tofront technique to the individual characters. We define a vector of integers R [ 0 ] , , R [ N 1 ] R [ 0 ] , , R [ N 1 ] R[0],dots,R[N-1]R[0], \ldots, R[N-1], which are the codes for the characters L [ 0 ] , , L [ N L [ 0 ] , , L [ N L[0],dots,L[N-L[0], \ldots, L[N- 1 ] 1 ] 1]1].
这一步通过应用前移技术对 L L LL 中的每个字符进行编码。我们定义一个整数向量 R [ 0 ] , , R [ N 1 ] R [ 0 ] , , R [ N 1 ] R[0],dots,R[N-1]R[0], \ldots, R[N-1] ,其中包含字符 L [ 0 ] , , L [ N L [ 0 ] , , L [ N L[0],dots,L[N-L[0], \ldots, L[N- 1 ] 1 ] 1]1] 的代码。

Initialize a list Y Y YY of characters to contain each character in the alphabet X X XX exactly once.
初始化一个包含字母表中每个字母恰好一次的字符列表 Y Y YY

For each i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 in turn, set R [ i ] R [ i ] R[i]R[i] to the number of characters preceding character L [ i ] L [ i ] L[i]L[i] in the list Y Y YY, then move character L [ i ] L [ i ] L[i]L[i] to the front of Y Y YY.
对于每个 i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 依次,将 R [ i ] R [ i ] R[i]R[i] 设置为列表 Y Y YY 中字符 L [ i ] L [ i ] L[i]L[i] 之前字符的数量,然后将字符 L [ i ] L [ i ] L[i]L[i] 移动到 Y Y YY 的前面。
Taking Y = [ a , b , c Y = a , b , c Y=[^(')a^('),b^('),^(')c^('):}Y=\left[{ }^{\prime} a^{\prime}, b^{\prime},{ }^{\prime} \mathbf{c}^{\prime}\right., ‘r’] initially, and L = L = L=L= ‘caraab’, we compute the vector R : 2 1 3 1 0 3 ) R : 2 1 3 1 0 3 {:R:quad[2,1,3,1,0,3])\left.R: \quad \begin{array}{llllll}2 & 1 & 3 & 1 & 0 & 3\end{array}\right).
Y = [ a , b , c Y = a , b , c Y=[^(')a^('),b^('),^(')c^('):}Y=\left[{ }^{\prime} a^{\prime}, b^{\prime},{ }^{\prime} \mathbf{c}^{\prime}\right. ,‘r’]初始,和 L = L = L=L= ‘caraab’,我们计算向量 R : 2 1 3 1 0 3 ) R : 2 1 3 1 0 3 {:R:quad[2,1,3,1,0,3])\left.R: \quad \begin{array}{llllll}2 & 1 & 3 & 1 & 0 & 3\end{array}\right)
M2. [encode] M2. [编码]
Apply Huffman or arithmetic coding to the elements of R R RR, treating each element as a separate token to be coded. Any coding technique can be applied as long as the decompressor can perform the inverse operation. Call the output of this coding process O U T O U T OUTO U T. The output of Algorithm C is the pair (OUT, I I II ) where I I II is the value computed in step C 1 .
应用霍夫曼编码或算术编码于 R R RR 的元素,将每个元素视为一个单独的标记进行编码。只要解压器能够执行逆操作,任何编码技术都可以应用。将此编码过程的输出称为 O U T O U T OUTO U T 。算法 C 的输出是(OUT, I I II )对,其中 I I II 是 C 1 步骤中计算出的值。

Algorithm W: Move-to-front decoding
算法 W:移到前面解码

This algorithm is the inverse of Algorithm M. It computes the pair ( L , I ) ( L , I ) (L,I)(L, I) from the pair (OUT, I).
这个算法是算法 M 的逆算法。它从对(OUT,I)计算对 ( L , I ) ( L , I ) (L,I)(L, I)
We assume that the initial value of the list Y Y YY used in step M1 is available to the decompressor, and that the coding scheme used in step M2 has an inverse operation.
我们假设在步骤 M1 中使用的列表 Y Y YY 的初始值对解压器可用,并且步骤 M2 中使用的编码方案具有逆操作。

W1. [decode] W1. [解码]

Decode the coded stream O U T O U T OUTO U T using the inverse of the coding scheme used in step M2. The result is a vector R R RR of N N NN integers.
解码使用步骤 M2 中编码方案的逆运算对编码流 O U T O U T OUTO U T 进行解码。结果是包含 N N NN 个整数的向量 R R RR

In our example, R R RR is: ( 2 1 3 1 0 3 ) 2 1 3 1 0 3 quad([2,1,3,1,0,3])\quad\left(\begin{array}{llllll}2 & 1 & 3 & 1 & 0 & 3\end{array}\right).
在我们的例子中, R R RR 是: ( 2 1 3 1 0 3 ) 2 1 3 1 0 3 quad([2,1,3,1,0,3])\quad\left(\begin{array}{llllll}2 & 1 & 3 & 1 & 0 & 3\end{array}\right)
W2. [inverse move-to-front coding]
W2. [逆移动到前面编码]

The goal of this step is to calculate a string L L LL of N N NN characters, given the move-to-front codes R [ 0 ] , , R [ N 1 ] R [ 0 ] , , R [ N 1 ] R[0],dots,R[N-1]R[0], \ldots, R[N-1].
该步骤的目标是计算给定移动到前端的代码 R [ 0 ] , , R [ N 1 ] R [ 0 ] , , R [ N 1 ] R[0],dots,R[N-1]R[0], \ldots, R[N-1] N N NN 个字符的字符串 L L LL

Initialize a list Y Y YY of characters to contain the characters of the alphabet X X XX in the same order as in step M1.
初始化一个包含字母表 X X XX 字符的列表 Y Y YY ,顺序与步骤 M1 中相同。

For each i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 in turn, set L [ i ] L [ i ] L[i]L[i] to be the character at position R [ i ] R [ i ] R[i]R[i] in list Y Y YY (numbering from 0 ), then move that character to the front of Y Y YY. The resulting string L L LL is the last column of matrix M M MM of Algorithm C. The output of this algorithm is the pair ( L , I ) ( L , I ) (L,I)(L, I), which is the input to AlgorithmD.
对于每个 i = 0 , , N 1 i = 0 , , N 1 i=0,dots,N-1i=0, \ldots, N-1 依次,将 L [ i ] L [ i ] L[i]L[i] 设置为列表 Y Y YY 中位置 R [ i ] R [ i ] R[i]R[i] 的字符(从 0 开始编号),然后将该字符移动到 Y Y YY 的前面。得到的字符串 L L LL 是算法 C 矩阵 M M MM 的最后一列。该算法的输出是 ( L , I ) ( L , I ) (L,I)(L, I) ,它是算法 D 的输入。
Taking Y = [ a , b , , , r ] Y = a , b , , , r Y=[^(')a^('),b^('),^(')^('),^('),r^(')]Y=\left[' a^{\prime}, b^{\prime}, '{ }^{\prime}, ', r^{\prime}\right] initially (as in Algorithm M), we compute the string L = L = L=L= ‘caraab’.
采取初始值 Y = [ a , b , , , r ] Y = a , b , , , r Y=[^(')a^('),b^('),^(')^('),^('),r^(')]Y=\left[' a^{\prime}, b^{\prime}, '{ }^{\prime}, ', r^{\prime}\right] (如算法 M 中所述),我们计算字符串 L = L = L=L= ‘caraab’。

4 An efficient implementation
4 一个高效的实现

Efficient implementations of move-to-front, Huffman, and arithmetic coding are well known. Here we concentrate on the unusual steps in Algorithms C and D.
高效实现前向移动、Huffman 和算术编码是众所周知的。在这里,我们专注于算法 C 和 D 中的不寻常步骤。

4.1 Compression 4.1 压缩

The most important factor in compression speed is the time taken to sort the rotations of the input block. A simple application of quicksort requires little additional space (one word per character), and works fairly well on most input data. However, its worst-case performance is poor.
压缩速度最重要的因素是排序输入块旋转所需的时间。简单应用快速排序只需要很少的额外空间(每个字符一个单词),并且在大多数输入数据上表现相当好。然而,它的最坏情况性能较差。
A faster way to implement Algorithm C is to reduce the problem of sorting the rotations to the problem of sorting the suffixes of a related string.
一种实现算法 C 的更快方法是将旋转排序问题转化为相关字符串后缀排序问题。
To compress a string S S SS, first form the string S S S^(')S^{\prime}, which is the concatenation of S S SS with E O F E O F EOFE O F, a new character that does not appear in S S SS. Now apply Algorithm C to S S S^(')S^{\prime}. Because E O F E O F EOFE O F is different from all other characters in S S S^(')S^{\prime}, the suffixes of S S S^(')S^{\prime} sort in the same order as the rotations of S S S^(')S^{\prime}. Hence, to sort the rotations, it suffices to sort the suffixes of S S S^(')S^{\prime}. This can be done in linear time and space by building a suffix tree, which then can be walked in lexicographical order to recover the sorted suffixes. We used McCreight’s suffix tree construction algorithm [5] in an implementation of Algorithm C. Its performance is within 40 % 40 % 40%40 \% of the fastest technique we have found when operating on text files. Unfortunately, suffix tree algorithms need space for more than four machine words per input character.
为了压缩字符串 S S SS ,首先形成字符串 S S S^(')S^{\prime} ,它是 S S SS E O F E O F EOFE O F 的连接, E O F E O F EOFE O F 是一个不在 S S SS 中出现的新的字符。现在将算法 C 应用于 S S S^(')S^{\prime} 。因为 E O F E O F EOFE O F S S S^(')S^{\prime} 中的所有其他字符都不同,所以 S S S^(')S^{\prime} 的尾部按 S S S^(')S^{\prime} 的旋转顺序排序。因此,为了排序旋转,只需对 S S S^(')S^{\prime} 的尾部进行排序即可。这可以通过构建后缀树并在字典序中遍历它来恢复排序后的尾部,以线性时间和空间完成。我们在算法 C 的实现中使用了 McCreight 的后缀树构建算法[5]。其性能在我们找到的最快技术中位于 40 % 40 % 40%40 \% 以内。不幸的是,后缀树算法需要为每个输入字符预留超过四个机器字的空间。
Manber and Myers give a simple algorithm for sorting the suffixes of a string in O ( N log N ) O ( N log N ) O(N log N)O(N \log N) time [6]. The algorithm they describe requires only two words per
曼伯和迈尔斯提供了一个简单的算法,用于在 O ( N log N ) O ( N log N ) O(N log N)O(N \log N) 时间内对字符串后缀进行排序。他们描述的算法只需要每个单词两个词组[6]。

input character. The algorithm works by sorting suffixes on their first i i ii characters, then using the positions of the suffixes in the sorted array as the sort key for another sort on the first 2 i 2 i 2i2 i characters. Unfortunately, the performance of their algorithm is significantly worse than the suffix tree approach.
输入字符。该算法通过按后缀的第一个 i i ii 字符对后缀进行排序,然后使用排序数组中后缀的位置作为另一个按第一个 2 i 2 i 2i2 i 字符排序的排序键。不幸的是,他们算法的性能明显低于后缀树方法。
The fastest scheme we have tried so far uses a variant of quicksort to generate the sorted list of suffixes. Its performance is significantly better than the suffix tree when operating on text, and it uses significantly less space. Unlike the suffix tree however, its performance can degrade considerably when operating on some types of data. Like the algorithm of Manber and Myers, our algorithm uses only two words per input character.
我们迄今为止尝试的最快方案使用了一种快速排序的变体来生成后缀的有序列表。它在处理文本时的性能显著优于后缀树,并且占用的空间也显著更少。然而,与后缀树不同的是,当处理某些类型的数据时,其性能可能会显著下降。像 Manber 和 Myers 的算法一样,我们的算法每个输入字符只使用两个单词。

Algorithm Q: Fast quicksorting on suffixes
算法 Q:后缀快速排序

This algorithm sorts the suffixes of the string S S SS, which contains N N NN characters S [ 0 , , N 1 ] S [ 0 , , N 1 ] S[0,dots,N-1]S[0, \ldots, N-1]. The algorithm works by applying a modified version of quicksort to the suffixes of S S SS.
此算法对字符串 S S SS 的后缀进行排序,该字符串包含 N N NN 个字符 S [ 0 , , N 1 ] S [ 0 , , N 1 ] S[0,dots,N-1]S[0, \ldots, N-1] 。算法通过应用对 S S SS 后缀进行修改的快速排序版本来工作。
First we present a simple version of the algorithm. Then we present modifications that improve the speed and make bad performance less likely.
首先,我们展示算法的简单版本。然后,我们介绍一些改进,这些改进可以提高速度并降低不良性能的可能性。
Q1. [form extended string]
Q1. [扩展字符串]

Let k k kk be the number of characters that fit in a machine word.
k k kk 表示适合于机器字中的字符数。

Form the string S S S^(')S^{\prime} from S S SS by appending k k kk additional EOF characters to S S SS, where E O F E O F EOFE O F does not appear in S S SS.
S S SS 中通过追加 k k kk 个 EOF 字符形成字符串 S S S^(')S^{\prime} ,其中 E O F E O F EOFE O F 不在 S S SS 中。
Q2. [form array of words]
Q2. [形成单词数组]

Initialize an array W W WW of N N NN words W [ 0 , , N 1 ] W [ 0 , , N 1 ] W[0,dots,N-1]W[0, \ldots, N-1], such that W [ i ] W [ i ] W[i]W[i] contains the characters S [ i , , i + k 1 ] S [ i , , i + k 1 ] S^(')[i,dots,i+k-1]S^{\prime}[i, \ldots, i+k-1] arranged so that integer comparisons on the words agree with lexicographic comparisons on the k k kk-character strings.
初始化一个包含 N N NN 个单词的数组 W W WW ,使得 W [ i ] W [ i ] W[i]W[i] 包含的字符 S [ i , , i + k 1 ] S [ i , , i + k 1 ] S^(')[i,dots,i+k-1]S^{\prime}[i, \ldots, i+k-1] 按字典序排列,以便单词上的整数比较与 k k kk 字符字符串上的字典序比较一致。
Packing characters into words has two benefits: It allows two prefixes to be compared k k kk bytes at a time using aligned memory accesses, and it allows many slow cases to be eliminated (described in step Q7).
将字符打包成单词有两个好处:它允许每次使用对齐的内存访问比较两个前缀 k k kk 字节,并且允许消除许多慢速情况(在步骤 Q7 中描述)。
Q3. [form array of suffixes]
Q3. [后缀数组]

In this step we initialize an array V V VV of N N NN integers. If an element of V V VV contains j j jj, it represents the suffix of S S S^(')S^{\prime} whose first character is S [ j ] S [ j ] S^(')[j]S^{\prime}[j]. When the algorithm is complete, suffix V [ i ] V [ i ] V[i]V[i] will be the i i ii th suffix in lexicographical order.
在这一步,我们初始化一个包含 N N NN 个整数的数组 V V VV 。如果 V V VV 中的元素包含 j j jj ,则表示 S S S^(')S^{\prime} 的后缀,其第一个字符是 S [ j ] S [ j ] S^(')[j]S^{\prime}[j] 。当算法完成后,后缀 V [ i ] V [ i ] V[i]V[i] 将是字典序中的第 i i ii 个后缀。
Initialize an array V V VV of integers so that for each i = 0 , , N 1 : V [ i ] = i i = 0 , , N 1 : V [ i ] = i i=0,dots,N-1:V[i]=ii=0, \ldots, N-1: V[i]=i.
初始化一个整数数组 V V VV ,使得对于每个 i = 0 , , N 1 : V [ i ] = i i = 0 , , N 1 : V [ i ] = i i=0,dots,N-1:V[i]=ii=0, \ldots, N-1: V[i]=i
Q4. [radix sort] Q4. [基数排序]
Sort the elements of V V VV, using the first two characters of each suffix as the sort key. This can be done efficiently using radix sort.
V V VV 的元素进行排序,使用每个后缀的前两个字符作为排序键。这可以通过基数排序有效地完成。
Q5. [iterate over each character in the alphabet]
Q5. [遍历字母表中的每个字符]

For each character c h c h chc h in the alphabet, perform steps Q6, Q7.
对于字母表中的每个字符 c h c h chc h ,执行步骤 Q6、Q7。

Once this iteration is complete, V V VV represents the sorted suffixes of S S SS, and the algorithm terminates.
一旦这一迭代完成, V V VV 表示 S S SS 的排序后后缀,算法终止。
Q6. [quicksort suffixes starting with c h c h chc h ]
Q6. [以 c h c h chc h 开头的 quicksort 后缀]

For each character c h c h ch^(')c h^{\prime} in the alphabet: Apply quicksort to the elements of V V VV starting with c h c h chc h followed by c h c h ch^(')c h^{\prime}. In the implementation of quicksort, compare the elements of V V VV by comparing the suffixes they represent by indexing into the array W W WW. At each recursion step, keep track of the number of characters that have compared equal in the group, so that they need not be compared again.
对于字母表中的每个字符 c h c h ch^(')c h^{\prime} :从 c h c h chc h 开始,对 V V VV 的元素应用快速排序,接着是 c h c h ch^(')c h^{\prime} 。在快速排序的实现中,通过索引到数组 W W WW ,比较 V V VV 中元素的表示后缀。在每次递归步骤中,跟踪组中比较相等的字符数量,以便它们不需要再次比较。

All the suffixes starting with c h c h chc h have now been sorted into their final positions in V V VV.
所有以 c h c h chc h 开头的后缀现在都已排入 V V VV 的最终位置。

Q7. [update sort keys]
Q7. [更新排序键]

For each element V [ i ] V [ i ] V[i]V[i] corresponding to a suffix starting with c h c h chc h (that is, for which S [ V [ i ] ] = c h S [ V [ i ] ] = c h S[V[i]]=chS[V[i]]=c h ), set W [ V [ i ] ] W [ V [ i ] ] W[V[i]]W[V[i]] to a value with c h c h chc h in its high-order bits (as before) and with i i ii in its low-order bits (which step Q2 set to the k 1 k 1 k-1k-1 characters following the initial c h c h chc h ). The new value of W [ V [ i ] ] W [ V [ i ] ] W[V[i]]W[V[i]] sorts into the same position as the old value, but has the desirable property that it is distinct from all other values in W W WW. This speeds up subsequent sorting operations, since comparisons with the new elements cannot compare equal.
对于每个对应于以 c h c h chc h 开头的后缀的元素 V [ i ] V [ i ] V[i]V[i] (即,对于 S [ V [ i ] ] = c h S [ V [ i ] ] = c h S[V[i]]=chS[V[i]]=c h ),将 W [ V [ i ] ] W [ V [ i ] ] W[V[i]]W[V[i]] 设置为具有 c h c h chc h 在其高阶位(如前所述)和 i i ii 在其低阶位的值(Q2 步骤将其设置为初始 c h c h chc h 之后的 k 1 k 1 k-1k-1 字符)。新值 W [ V [ i ] ] W [ V [ i ] ] W[V[i]]W[V[i]] 与旧值排序到相同的位置,但具有一个理想属性,即它与 W W WW 中的所有其他值都不同。这加快了后续的排序操作,因为与新元素的比较不能相等。
This basic algorithm can be refined in a number of ways. An obvious improvement is to pick the character c h c h chc h in step Q5 starting with the least common character in S S SS, and proceeding to the most common. This allows the updates of step Q7 to have the greatest effect.
这个基本算法可以通过多种方式改进。一个明显的改进是在步骤 Q5 中从 S S SS 中最不常见的字符开始选择 c h c h chc h ,然后逐步到最常见的字符。这使得步骤 Q7 的更新效果最大化。
As described, the algorithm performs poorly when S S SS contains many repeated substrings. We deal with this problem with two mechanisms.
如描述所述,当 S S SS 包含许多重复的子字符串时,该算法表现不佳。我们通过两种机制来解决这个问题。
The first mechanism handles strings of a single repeated character by replacing step Q6 with the following steps.
第一个机制通过以下步骤替换步骤 Q6 来处理单个重复字符的字符串。
Q6a. [quicksort suffixes starting c h , c h c h , c h ch,ch^(')c h, c h^{\prime} where c h c h ] c h c h {:ch!=ch^(')]\left.c h \neq c h^{\prime}\right]
Q6a. [quicksort 后缀从 c h , c h c h , c h ch,ch^(')c h, c h^{\prime} 开始,其中 c h c h ] c h c h {:ch!=ch^(')]\left.c h \neq c h^{\prime}\right]

For each character c h c h c h c h ch^(')!=chc h^{\prime} \neq c h in the alphabet: Apply quicksort to the elements of V V VV starting with c h c h chc h followed by c h c h ch^(')c h^{\prime}. In the implementation of quicksort,
对于字母表中的每个字符 c h c h c h c h ch^(')!=chc h^{\prime} \neq c h :从 c h c h chc h 开始,应用快速排序到 V V VV 的元素,然后是 c h c h ch^(')c h^{\prime} 。在快速排序的实现中:

compare the elements of V V VV by comparing the suffixes they represent by indexing into the array W W WW. At each recursion step, keep track of the number of characters that have compared equal in the group, so that they need not be compared again.
比较 V V VV 的元素,通过索引到数组 W W WW 来比较它们所表示的后缀。在每次递归步骤中,跟踪组中比较相等的字符数,这样它们就不需要再次比较了。
Q6b. [sort low suffixes starting c h , c h ] c h , c h ] ch,ch]c h, c h]
Q6b. [从 c h , c h ] c h , c h ] ch,ch]c h, c h] 开始排序低后缀]

This step sorts the suffixes starting runs which would sort before an infinite string of c h c h chc h characters. We observe that among such suffixes, long initial runs of character c h c h chc h sort after shorter initial runs, and runs of equal length sort in the same order as the characters at the ends of the run. This ordering can be obtained with a simple loop over the suffixes, starting with the shortest.
这一步对以 c h c h chc h 字符无限序列开头的后缀进行排序。我们观察到,在这些后缀中,长初始运行字符 c h c h chc h 排在短初始运行字符之后,长度相等的运行以运行末尾字符的相同顺序排序。这种排序可以通过对后缀进行简单循环获得,从最短的开始。

Let V [ i ] V [ i ] V[i]V[i] represent the lexicographically least suffix starting with c h c h chc h. Let j j jj be the lowest value such that suffix V [ j ] V [ j ] V[j]V[j] starts with c h , c h c h , c h ch,chc h, c h.
V [ i ] V [ i ] V[i]V[i] 代表以 c h c h chc h 开头的字典序最小的后缀。让 j j jj 为最小的值,使得后缀 V [ j ] V [ j ] V[j]V[j] c h , c h c h , c h ch,chc h, c h 开头。
while not (i = j) do
    if (V[i] > 0) and (S[V[i]-1] = ch) then
        V[j] := V[i]-1;
        j := j + 1
    end;
    i := i + 1
end;
All the suffixes that start with c h , c h c h , c h ch,chc h, c h and that are lexicographically less than an infinite sequence of c h c h chc h characters are now sorted.
所有以 c h , c h c h , c h ch,chc h, c h 开头且在字典序上小于无限序列 c h c h chc h 的后缀现在已排序。
Q6c. [sort high suffixes starting c h , c h ] c h , c h ] ch,ch]c h, c h]
Q6c. [从 c h , c h ] c h , c h ] ch,ch]c h, c h] 开始排序高后缀]

This step sorts the suffixes starting runs which would sort after an infinite string of c h c h chc h characters. This step is very like step Q6b, but with the direction of the loop reversed.
这一步对以 c h c h chc h 字符无限序列之后的后缀进行排序。这一步与 Q6b 步骤非常相似,但循环的方向相反。

Let V [ i ] V [ i ] V[i]V[i] represent the lexicographically greatest suffix starting with c h c h chc h. Let j j jj be the highest value such that suffix V [ j ] V [ j ] V[j]V[j] starts with c h , c h c h , c h ch,chc h, c h.
V [ i ] V [ i ] V[i]V[i] 代表以 c h c h chc h 开头的字典序最大的后缀。让 j j jj 是满足后缀 V [ j ] V [ j ] V[j]V[j] c h , c h c h , c h ch,chc h, c h 开头的最高值。
while not (i = j) do
    if (V[i] > 0) and (S[V[i]-1] = ch) then
        V[j] := V[i]-1;
        j := j - 1
    end;
    i := i - 1
end;
All the suffixes that start with c h , c h c h , c h ch,chc h, c h and that are lexicographically greater than an infinite sequence of c h c h chc h characters are now sorted.
所有以 c h , c h c h , c h ch,chc h, c h 开头且在字典序上大于无限序列 c h c h chc h 的后缀现在已排序。
The second mechanism handles long strings of a repeated substring of more than one character. For such strings, we use a doubling technique similar to that used in Manber and Myers’ scheme. We limit our quicksort implementation to perform comparisons only to a depth D D DD. We record where suffixes have been left unsorted with a bit in the corresponding elements of V V VV.
第二个机制处理由一个或多个字符组成的重复子串的长字符串。对于此类字符串,我们使用类似于 Manber 和 Myers 方案中使用的加倍技术。我们将我们的快速排序实现限制为仅对深度 D D DD 进行比较。我们使用 V V VV 中相应元素的位来记录未排序的后缀。
Once steps Q1 to Q7 are complete, the elements of the array W W WW indicate sorted positions up to D × k D × k D xx kD \times k characters, since each comparison compares k k kk characters. If unsorted portions of V V VV remain, we simply sort them again, limiting the comparison depth to D D DD as before. This time, each comparison compares D × k D × k D xx kD \times k characters, to yield a list of suffixes sorted on their first D 2 × k D 2 × k D^(2)xx kD^{2} \times k characters. We repeat this process until the entire string is sorted.
array W W WW 的元素表示排序到的 D × k D × k D xx kD \times k 字符的位置,因为每次比较都涉及 k k kk 个字符。如果还有未排序的 V V VV 部分,我们只需再次排序,将比较深度限制为之前的 D D DD 。这次,每次比较涉及 D × k D × k D xx kD \times k 个字符,以生成按第一个 D 2 × k D 2 × k D^(2)xx kD^{2} \times k 个字符排序的后缀列表。我们重复此过程,直到整个字符串排序完成。
The combination of radix sort, a careful implementation of quicksort, and the mechanisms for dealing with repeated strings produce a sorting algorithm that is extremely fast on most inputs, and is quite unlikely to perform very badly on real input data.
基数排序、精心实现的快速排序以及处理重复字符串的机制相结合,产生了一种在大多数输入上非常快速的排序算法,并且在实际输入数据上表现非常糟糕的可能性相当低。
We are currently investigating further variations of quicksort. The following approach seems promising. By applying the technique of Q6a-Q6c to all overlapping repeated strings, and by caching previously computed sort orders in a hash table, we can produce an algorithm that sorts the suffixes of a string in approximately the same time as the modified algorithm Q , but using only 6 bytes of space per input character ( 4 bytes for a pointer, 1 byte for the input character itself, and 1 byte amortized space in the hash table). This approach has poor performance in the worst case, but bad cases seem to be rare in practice.
我们目前正在进一步研究快速排序的变体。以下方法看起来很有希望。通过将 Q6a-Q6c 技术应用于所有重叠的重复字符串,并通过在哈希表中缓存先前计算的排序顺序,我们可以产生一个算法,该算法在约相同的时间内对字符串的后缀进行排序,但每个输入字符仅使用 6 个字节的存储空间(4 个字节用于指针,1 个字节用于输入字符本身,以及 1 个字节的哈希表摊销空间)。这种方法在最坏情况下性能较差,但在实践中坏情况似乎很少。

4.2 Decompression 4.2 解压缩

Steps D1 and D2 can be accomplished efficiently with only two passes over the data, and one pass over the alphabet, as shown in the pseudo-code below. In the first pass, we construct two arrays: C [ C [ C[C[ alphabet ] ] ]] and P [ 0 , , N 1 ] . C [ c h ] P [ 0 , , N 1 ] . C [ c h ] P[0,dots,N-1].C[ch]P[0, \ldots, N-1] . C[c h] is the total number of instances in L L LL of characters preceding character c h c h chc h in the alphabet. P [ i ] P [ i ] P[i]P[i] is the number of instances of character L [ i ] L [ i ] L[i]L[i] in the prefix L [ 0 , , i 1 ] L [ 0 , , i 1 ] L[0,dots,i-1]L[0, \ldots, i-1] of L L LL. In practice, this first pass could be combined with the move-to-front decoding step. Given the arrays L , C , P L , C , P L,C,PL, C, P, the array T T TT defined in step D 2 is given by:
步骤 D1 和 D2 可以通过仅对数据进行两次遍历和一次遍历字母表来高效完成,如下面的伪代码所示。在第一次遍历中,我们构建两个数组: C [ C [ C[C[ 字母表 ] ] ]] P [ 0 , , N 1 ] . C [ c h ] P [ 0 , , N 1 ] . C [ c h ] P[0,dots,N-1].C[ch]P[0, \ldots, N-1] . C[c h] 是字符 c h c h chc h 在字母表中的前缀 L L LL 中字符的实例总数。 P [ i ] P [ i ] P[i]P[i] 是字符 L [ i ] L [ i ] L[i]L[i] L L LL 的前缀 L [ 0 , , i 1 ] L [ 0 , , i 1 ] L[0,dots,i-1]L[0, \ldots, i-1] 中的实例数。在实践中,这个第一次遍历可以与移动到前面的解码步骤结合。给定数组 L , C , P L , C , P L,C,PL, C, P ,步骤 D2 中定义的数组 T T TT 如下:
for each i = 0 , , N 1 : T [ i ] = P [ i ] + C [ L [ i ] ]  for each  i = 0 , , N 1 : T [ i ] = P [ i ] + C [ L [ i ] ] " for each "i=0,dots,N-1:T[i]=P[i]+C[L[i]]\text { for each } i=0, \ldots, N-1: T[i]=P[i]+C[L[i]]
In a second pass over the data, we use the starting position I to complete Algorithm D to generate S S SS.
在第二次数据遍历中,我们使用起始位置 I 来完成算法 D 以生成 S S SS
Initially, all elements of C C CC are zero, and the last column of matrix M M MM is the vector L [ 0 , , N 1 ] L [ 0 , , N 1 ] L[0,dots,N-1]L[0, \ldots, N-1].
最初, C C CC 的所有元素都是零,矩阵 M M MM 的最后一列是向量 L [ 0 , , N 1 ] L [ 0 , , N 1 ] L[0,dots,N-1]L[0, \ldots, N-1]
for i := 0 to N-1 do
    P[i] := C[L[i]];
    C[L[i]] := C[L[i]] + 1
end;
Now C[ch] is the number of instances in L of character ch. The value P[i] is the
number of instances of character L[i] in the prefix L[0,\ldots,i-1] of L.
sum := 0;
for ch := FIRST(alphabet) to LAST(alphabet) do
    sum := sum + C[ch];
    C[ch] := sum - C[ch];
end;
Now C[ch] is the total number of instances in L of characters preceding ch in the
alphabet.
i := I;
for j := N-1 downto 0 do
    S[j] := L[i];
    i := P[i] + C[L[i]]
end
The decompressed result is now S [ 0 , , N 1 ] S [ 0 , , N 1 ] S[0,dots,N-1]S[0, \ldots, N-1].
解压后的结果是现在 S [ 0 , , N 1 ] S [ 0 , , N 1 ] S[0,dots,N-1]S[0, \ldots, N-1]

5 Algorithm variants 5 种算法变体

In the example given in step C 1 , we treated the left-hand side of each rotation as the most significant when sorting. In fact, our implementation treats the right-hand side as the most significant, so that the decompressor will generate its output S S SS from left to right using the code of Section 4.2. This choice has almost no effect on the compression achieved.
在步骤 C 1 给出的示例中,我们在排序时将每个旋转的左侧视为最重要的。实际上,我们的实现将右侧视为最重要的,这样解压器将使用第 4.2 节中的代码从左到右生成其输出 S S SS 。这种选择对压缩效果几乎没有影响。
The algorithm can be modified to use a different coding scheme instead of move-to-front in step M1. Compression can improve slightly if move-to-front is replaced by a scheme in which codes that have not been seen in a very long time are moved only part-way to the front. We have not exploited this in our implementations.
算法可以被修改,在步骤 M1 中使用不同的编码方案,而不是移动到前面。如果用一种方案代替移动到前面,其中很久没有见过的代码只部分移动到前面,压缩可以略有改善。我们尚未在我们的实现中利用这一点。
Although simple Huffman and arithmetic coders do well in step M2, a more complex coding scheme can do slightly better. This is because the probability of a certain value at a given point in the vector R R RR depends to a certain extent on the immediately preceding value. In practice, the most important effect is that zeroes tend to occur in runs in R R RR. We can take advantage of this effect by representing each run of zeroes by a code indicating the length of the run. A second set of Huffman codes can be used for values immediately following a run of zeroes, since the next value cannot be another run of zeroes.
尽管简单的 Huffman 和算术编码器在步骤 M2 中表现良好,但更复杂的编码方案可以做得更好。这是因为向量 R R RR 中某一点上某个值的概率在一定程度上取决于紧接其前的值。在实践中,最重要的效果是零通常在 R R RR 中成串出现。我们可以通过用一个表示串长度的代码来表示每个零的串来利用这种效果。由于下一个值不可能是另一个零的串,因此可以使用另一组 Huffman 代码来表示紧接零串之后的值。
 大小(字节)
Size
(bytes)
Size (bytes)| Size | | :---: | | (bytes) |
 CPU 时间/秒压缩
CPU time/s
compress
CPU time/s compress| CPU time/s | | :---: | | compress |
 压缩文件
Compressed
file
Compressed file| Compressed | | ---: | | file |
 dits/大小(字节)
dits/
size (bytes
dits/ size (bytes| dits/ | | :---: | | size (bytes |
char 字符
"Size (bytes)" "CPU time/s compress" "Compressed file" "dits/ size (bytes" char | | Size <br> (bytes) | CPU time/s <br> compress | | Compressed <br> file | dits/ <br> size (bytes | | :--- | ---: | :---: | :---: | ---: | :---: | | char | | | | | |
| | ||
Table 1: Results of compressing fourteen files of the Calgary Compression Corpus.
表 1:卡尔加里压缩语料库 14 个文件压缩结果
 块大小(字节)
Block size
(bytes)
Block size (bytes)| Block size | | :---: | | (bytes) |
 bits/字符 book1
bits/character
book1
bits/character book1| bits/character | | :---: | | book1 |
Hector corpus 赫克托尔语料库
1 k 1 千 4.34 4.35
4 k 4 千 3.86 3.83
16 k 16 千 3.43 3.39
64 k 3.00 2.98
256 k 2.68 2.65
750 k 750 千卡 2.49 -
1 M - 2.43
4 M 4 M (由于“4 M”可能指的是多种情况,如文件大小、网络带宽等,但缺乏上下文,无法确定其具体含义,因此直接保留原样。) - 2.26
16 M 16 M (由于“16 M”可能指的是内存大小或其他技术术语,这里直接保留原样。) - 2.13
64 M - 2.04
103 M 103 M (由于“103 M”可能是一个特定的代码或缩写,没有上下文无法确定其含义,因此直接保留原样。) - 2.01
"Block size (bytes)" "bits/character book1" Hector corpus 1 k 4.34 4.35 4 k 3.86 3.83 16 k 3.43 3.39 64 k 3.00 2.98 256 k 2.68 2.65 750 k 2.49 - 1 M - 2.43 4 M - 2.26 16 M - 2.13 64 M - 2.04 103 M - 2.01| Block size <br> (bytes) | bits/character <br> book1 | | | :---: | :---: | :---: | | Hector corpus | | | | 1 k | 4.34 | 4.35 | | 4 k | 3.86 | 3.83 | | 16 k | 3.43 | 3.39 | | 64 k | 3.00 | 2.98 | | 256 k | 2.68 | 2.65 | | 750 k | 2.49 | - | | 1 M | - | 2.43 | | 4 M | - | 2.26 | | 16 M | - | 2.13 | | 64 M | - | 2.04 | | 103 M | - | 2.01 |
Table 2: The effect of varying block size ( N N NN ) on compression of book 1 from the Calgary Compression Corpus and of the entire Hector corpus.
表 2:从卡尔加里压缩语料库和整个 Hector 语料库中,改变块大小( N N NN )对书籍 1 压缩效果的影响。

6 Performance of implementation
6 实现性能

In Table 1 we give the results of compressing the fourteen commonly used files of the Calgary Compression Corpus [7] with our algorithm. Comparison of these figures with those given by Bell, Witten and Cleary [3] indicate that our algorithm does well on non-text inputs as well as text inputs.
在表 1 中,我们给出了使用我们的算法压缩 Calgary 压缩语料库[7]中十四个常用文件的结果。将这些数字与 Bell、Witten 和 Cleary[3]给出的数字进行比较表明,我们的算法在非文本输入以及文本输入方面都表现良好。
Our implementation of Algorithm C uses the techniques described in Section 4.1, and a simple move-to-front coder. In each case, the block size N N NN is the length of the file being compressed.
我们的算法 C 实现使用了第 4.1 节中描述的技术,以及一个简单的移动到前端的编码器。在每种情况下,块大小 N N NN 是压缩文件的长度。
We compress the output of the move-to-front coder with a modified Huffman coder that uses the technique described in Section 5. Our coder calculates new Huffman trees for each 16 kbyte input block, rather than computing one tree for its whole input.
我们使用在第 5 节中描述的技术修改后的 Huffman 编码器压缩 move-to-front 编码器的输出。我们的编码器为每个 16 kbyte 输入块计算新的 Huffman 树,而不是为其整个输入计算一棵树。
In Table 1, compression effectiveness is expressed as output bits per input character. The CPU time measurements were made on a DECstation 5000/200, which has a MIPS R3000 processor clocked at 25 MHz with a 64 kbyte cache. CPU time is given rather than elapsed time so the time spent performing I/O is excluded.
在表 1 中,压缩效率以每输入字符输出比特数表示。CPU 时间测量是在 DECstation 5000/200 上进行的,该设备配备了一个 25 MHz 的 MIPS R3000 处理器和 64 kbyte 缓存。CPU 时间以 CPU 时间给出,而不是以经过的时间给出,因此不包括执行 I/O 所花费的时间。
In Table 2, we show the variation of compression effectiveness with input block size for two inputs. The first input is the file book 1 from the Calgary Compression Corpus. The second is the entire Hector corpus [8], which consists of a little over 100 MBytes of modern English text. The table shows that compression improves with increasing block size, even when the block size is quite large. However, increasing the block size above a few tens of millions of characters has only a small effect.
在表 2 中,我们展示了两个输入的压缩有效性随输入块大小的变化。第一个输入是来自卡尔加里压缩语料库的文件 book 1。第二个是整个 Hector 语料库[8],它包含超过 100 MBytes 的现代英语文本。表格显示,随着块大小的增加,压缩效果得到改善,即使块大小相当大。然而,将块大小增加到几千万字符以上,效果只有很小。
If the block size were increased indefinitely, we expect that the algorithm would not approach the theoretical optimum compression because our simple byte-wise move-to-front coder introduces some loss. A better coding scheme might achieve optimum compression asymptotically, but this is not likely to be of great practical importance.
如果块大小无限增加,我们预计该算法不会接近理论上的最优压缩,因为我们的简单字节移动到前端的编码器引入了一些损失。一个更好的编码方案可能在渐近上实现最优压缩,但这不太可能具有很大的实际意义。

Comparison with other compression programmes
与其他压缩程序的比较

Table 3 compares our implementation of Algorithm C with three other compression programmes, chosen for their wide availability. The same fourteen files were compressed individually with each algorithm, and the results totalled. The bits per character values are the means of the values for the individual files. This metric was chosen to allow easy comparison with figures given by Bell [3].
表 3 比较了我们的 C 算法实现与三种其他压缩程序,这些程序因其广泛可用性而被选中。使用每种算法分别压缩了相同的十四个文件,并将结果汇总。每字符比特值是单个文件值的平均值。选择此指标是为了便于与 Bell [3]给出的数据进行比较。
Total CPU time/s 总 CPU 时间/秒 Total compressed Programme  Total compressed   Programme  {:[" Total compressed "],[" Programme "]:}\begin{array}{c}\text { Total compressed } \\ \text { Programme }\end{array} mean compress  mean   compress  {:[" mean "],[" compress "]:}\begin{array}{c}\text { mean } \\ \text { compress }\end{array}
decompress 解压缩
Total CPU time/s " Total compressed Programme " " mean compress " decompress | | Total CPU time/s | | $\begin{array}{c}\text { Total compressed } \\ \text { Programme }\end{array}$ | $\begin{array}{c}\text { mean } \\ \text { compress }\end{array}$ | | :--- | :---: | :---: | :---: | :---: | | decompress | | | | |
) ) ))
compress is version 4.2.3 of the well-known LZW-based tool [9,10]. gzip is version 1.2.4 of Gailly’s LZ77-based tool [1, 11].
压缩是著名的基于 LZW 的工具的 4.2.3 版本[9,10]。gzip 是 Gailly 基于 LZ77 的工具的 1.2.4 版本[1, 11]。

Alg-C/D is Algorithms C and D, with our back-end Huffman coder. comp-2 is Nelson’s comp-2 coder, limited to a fourth-order model [12].
Alg-C/D 是 C 和 D 算法,使用我们的后端 Huffman 编码器。comp-2 是 Nelson 的 comp-2 编码器,限制在四阶模型[12]。
Table 3: Comparison with other compression programmes.
表 3:与其他压缩程序的比较。

7 Conclusions 7 结论

We have described a compression technique that works by applying a reversible transformation to a block of text to make redundancy in the input more accessible to simple coding schemes. Our algorithm is general-purpose, in that it does well on both text and non-text inputs. The transformation uses sorting to group characters together based on their contexts; this technique makes use of the context on only one side of each character.
我们已经描述了一种压缩技术,该技术通过将可逆变换应用于文本块来使输入中的冗余更容易被简单的编码方案访问。我们的算法是通用型的,在文本和非文本输入上都表现良好。这种变换使用排序将字符根据其上下文分组;这种技术仅利用每个字符一侧的上下文。
To achieve good compression, input blocks of several thousand characters are needed. The effectiveness of the algorithm continues to improve with increasing block size at least up to blocks of several million characters.
为了实现良好的压缩,需要输入几千个字符的块。随着块大小的增加,至少达到几百万字符的块,算法的有效性持续提高。
Our algorithm achieves compression comparable with good statistical modellers, yet is closer in speed to coders based on the algorithms of Lempel and Ziv. Like Lempel and Ziv’s algorithms, our algorithm decompresses faster than it compresses.
我们的算法在压缩方面与优秀的统计模型器相当,但在速度上更接近基于 Lempel 和 Ziv 算法的编码器。像 Lempel 和 Ziv 的算法一样,我们的算法解压缩速度比压缩速度快。

Acknowledgements 致谢

We would like to thank Andrei Broder, Cynthia Hibbard, Greg Nelson, and Jim Saxe for their suggestions on the algorithms and the paper.
我们想感谢 Andrei Broder、Cynthia Hibbard、Greg Nelson 和 Jim Saxe 对算法和论文的建议。

References 参考文献

[1] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory. Vol. IT-23, No. 3, May 1977, pp. 337-343.
[1] J. Ziv 和 A. Lempel. 一种用于顺序数据压缩的通用算法。IEEE 信息理论杂志。第 23 卷,第 3 期,1977 年 5 月,第 337-343 页。

[2] J. Ziv and A. Lempel. Compression of individual sequences via variable rate coding. IEEE Transactions on Information Theory. Vol. IT-24, No. 5, September 1978, pp. 530-535.
[2] J. Ziv 和 A. Lempel. 通过可变率编码压缩单个序列。IEEE 信息理论杂志。第 24 卷第 5 期,1978 年 9 月,第 530-535 页。

[3] T. Bell, I. H. Witten, and J.G. Cleary. Modeling for text compression. ACM Computing Surveys, Vol. 21, No. 4, December 1989, pp. 557-589.
[3] T. Bell,I. H. Witten,和 J.G. Cleary。文本压缩建模。ACM 计算评论,第 21 卷,第 4 期,1989 年 12 月,第 557-589 页。

[4] J.L. Bentley, D.D. Sleator, R.E. Tarjan, and V.K. Wei. A locally adaptive data compression algorithm. Communications of the ACM, Vol. 29, No. 4, April 1986, pp. 320-330.
[4] J.L. Bentley,D.D. Sleator,R.E. Tarjan,和 V.K. Wei。一种局部自适应数据压缩算法。ACM 通讯,第 29 卷,第 4 期,1986 年 4 月,第 320-330 页。

[5] E.M. McCreight. A space economical suffix tree construction algorithm. Journal of the ACM, Vol. 32, No. 2, April 1976, pp. 262-272.
[5] E.M. McCreight. 空间高效的后缀树构造算法。ACM 期刊,第 32 卷,第 2 期,1976 年 4 月,第 262-272 页。

[6] U. Manber and E. W. Myers, Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, Volume 22, No. 5, October 1993, pp. 935-948.
[6] U. Manber 和 E. W. Myers,后缀数组:在线字符串搜索的新方法。美国工业与应用数学学会计算杂志,第 22 卷第 5 期,1993 年 10 月,第 935-948 页。

[7] I. H. Witten and T. Bell. The Calgary/Canterbury text compression corpus. Anonymous ftp from ftp.cpsc.ucalgary.ca: /pub/text.compression.corpus/text.compression.corpus.tar.z
[7] I. H. Witten 和 T. Bell. 卡尔加里/坎特伯雷文本压缩语料库。匿名 FTP 从 ftp.cpsc.ucalgary.ca:/pub/text.compression.corpus/text.compression.corpus.tar.z

[8] L. Glassman, D. Grinberg, C. Hibbard, J. Meehan, L. Guarino Reid, and M-C. van Leunen. Hector: Connecting Words with Definitions. Proceedings of the 8th Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, Waterloo, Canada, October, 1992. pp. 3773. Also available as Research Report 92a, Digital Equipment Corporation Systems Research Center, 130 Lytton Ave, Palo Alto, CA. 94301.
[8] L. Glassman, D. Grinberg, C. Hibbard, J. Meehan, L. Guarino Reid, 和 M-C. van Leunen. Hector:将词汇与定义相连接。第 8 届 UW 新牛津英语词典与文本研究中心年度会议论文集,加拿大滑铁卢,1992 年 10 月。第 3773 页。也可作为研究报告 92a,数字设备公司系统研究中心,130 Lytton Ave,Palo Alto,CA 94301。

[9] T.A. Welch. A technique for high performance data compression. IEEE Computer, Vol. 17, No. 6, June 1984, pp. 8-19.
[9] T.A. Welch. 高性能数据压缩技术。IEEE 计算机,第 17 卷,第 6 期,1984 年 6 月,第 8-19 页。

[10] Peter Jannesen et al. Compress, Version 4.2.3. Posted to the Internet newsgroup comp. sources . reviewed, 28th August, 1992. Anonymous ftp from gatekeeper.dec.com: /pub/misc/ncompress-4.2.3
[10] Peter Jannesen 等人。Compress,版本 4.2.3。发布到互联网新闻组 comp.sources。审查,1992 年 8 月 28 日。从 gatekeeper.dec.com 的匿名 ftp:/pub/misc/ncompress-4.2.3

[11] J. Gailly et al. Gzip,Version 1.2.4. Anonymous ftp from prep.ai.mit.edu: /pub/gnu/gzip-1.2.4.tar.gz
[11] J. Gailly 等人。Gzip,版本 1.2.4。匿名 FTP 从 prep.ai.mit.edu:/pub/gnu/gzip-1.2.4.tar.gz

[12] Mark Nelson. Arithmetic coding and statistical modeling. Dr. Dobbs Journal. February, 1991. Anonymous ftp from wuarchive.wustl.edu: /mirrors/msdos/ddjmag/ddj9102.zip
[12] 马克·尼尔森。算术编码和统计建模。Dobbs 博士杂志。1991 年 2 月。来自 wuarchive.wustl.edu 的匿名 ftp:/mirrors/msdos/ddjmag/ddj9102.zip