A Universal Algorithm for Sequential Data Compression 一种通用的序列数据压缩算法
JACOB ZIV, FELLOW, IEEE, AND ABRAHAM LEMPEL, MEMBER, IEEE 雅各布·齐夫,IEEE 研究员,和亚伯拉罕·伦佩尔,IEEE 会员
Abstract 摘要
A universal algorithm for sequential data compression is presented. Its performance is investigated with respect to a nonprobabilistic model of constrained sources. The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable-to-block codes designed to match a completely specified source. 提出了一种用于序列数据压缩的通用算法。其性能在约束源的非概率模型下进行了研究。所提出的通用编码所达到的压缩比均匀地接近由设计用于匹配完全指定源的块到变量编码和变量到块编码所能达到的压缩比下限。
I. INTRODUCTION 一. 引言
IN MANY situations arising in digital communications and data processing, the encountered strings of data display various structural regularities or are otherwise subject to certain constraints, thereby allowing for storage and time-saving techniques of data compression. Given a discrete data source, the problem of data compression is first to identify the limitations of the source, and second to devise a coding scheme which, subject to certain performance criteria, will best compress the given source. 在数字通信和数据处理的许多情况下,遇到的数据字符串显示出各种结构规律,或者受到某些约束,从而允许采用数据压缩的存储和节省时间的技术。给定一个离散数据源,数据压缩的问题首先是识别源的限制,其次是设计一个编码方案,该方案在满足某些性能标准的情况下,能够最佳地压缩给定的源。
Once the relevant source parameters have been identified, the problem reduces to one of minimum-redundancy coding. This phase of the problem has received extensive treatment in the literature [1]-[7]. 一旦识别出相关的源参数,问题就简化为最小冗余编码。这一问题的阶段在文献中得到了广泛的研究 [1]-[7]。
When no a priori knowledge of the source characteristics is available, and if statistical tests are either impossible or unreliable, the problem of data compression becomes considerably more complicated. In order to overcome these difficulties one must resort to universal coding schemes whereby the coding process is interlaced with a learning process for the varying source characteristics [8], [9]. Such coding schemes inevitably require a larger working memory space and generally employ performance criteria that are appropriate for a wide variety of sources. 当没有源特征的先验知识可用时,如果统计测试既不可能又不可靠,数据压缩的问题变得复杂得多。为了克服这些困难,必须诉诸于通用编码方案,其中编码过程与变化的源特征的学习过程交织在一起 [8],[9]。这种编码方案不可避免地需要更大的工作内存空间,并且通常采用适用于各种源的性能标准。
In this paper, we describe a universal coding scheme which can be applied to any discrete source and whose performance is comparable to certain optimal fixed code book schemes designed for completely specified sources. For lack of adequate criteria, we do not attempt to rank the proposed scheme with respect to other possible universal coding schemes. Instead, for the broad class of sources defined in Section III, we derive upper bounds on the compression efficiency attainable with full a priori knowledge of the source by fixed code book schemes, and 在本文中,我们描述了一种通用编码方案,该方案可以应用于任何离散源,其性能可与为完全指定源设计的某些最优固定码本方案相媲美。由于缺乏足够的标准,我们不试图将所提出的方案与其他可能的通用编码方案进行排名。相反,对于第三节中定义的广泛源类,我们推导了在完全先验知识下,固定码本方案可达到的压缩效率的上界。
then show that the efficiency of our universal code with no a priori knowledge of the source approaches those bounds. 然后表明我们的通用编码在没有先验知识的情况下,其效率接近这些界限。
The proposed compression algorithm is an adaptation of a simple copying procedure discussed recently [10] in a study on the complexity of finite sequences. Basically, we employ the concept of encoding future segments of the source-output via maximum-length copying from a buffer containing the recent past output. The transmitted codeword consists of the buffer address and the length of the copied segment. With a predetermined initial load of the buffer and the information contained in the codewords, the source data can readily be reconstructed at the decoding end of the process. 所提议的压缩算法是对最近在有限序列复杂性研究中讨论的简单复制过程的改编[10]。基本上,我们采用通过从包含最近输出的缓冲区进行最大长度复制来编码源输出的未来段的概念。传输的代码字由缓冲区地址和复制段的长度组成。通过预设的缓冲区初始负载和代码字中包含的信息,源数据可以在解码过程的末端轻松重建。
The main drawback of the proposed algorithm is its susceptibility to error propagation in the event of a channel error. 所提算法的主要缺点是其在通道错误发生时容易受到错误传播的影响。
II. The Compression Algorithm II. 压缩算法
The proposed compression algorithm consists of a rule for parsing strings of symbols from a finite alphabet AA into substrings, or words, whose lengths do not exceed a prescribed integer L_(s)L_{s}, and a coding scheme which maps these substrings sequentially into uniquely decipherable codewords of fixed length L_(c)L_{c} over the same alphabet AA. 所提出的压缩算法包括一个规则,用于将有限字母表 AA 中的符号字符串解析为子字符串或单词,这些子字符串的长度不超过规定的整数 L_(s)L_{s} ,以及一个编码方案,该方案将这些子字符串顺序映射为相同字母表 AA 上固定长度 L_(c)L_{c} 的唯一可解码代码字。
The word-length bounds L_(s)L_{s} and L_(c)L_{c} allow for boundeddelay encoding and decoding, and they are related by 字长界限 L_(s)L_{s} 和 L_(c)L_{c} 允许有界延迟的编码和解码,它们之间的关系是
where |~x~|\lceil x\rceil is the least integer not smaller than xx, the logarithm base is the cardinality alpha\alpha of the alphabet AA, and nn is the length of a buffer, employed at the encoding end of the process, which stores the latest nn symbols emitted by the source. The exact relationship between nn and L_(s)L_{s} is discussed in Section III. Typically, n≃L_(s)alpha^(hL_(s))n \simeq L_{s} \alpha^{h L_{s}}, where 0 < h < 1<h<1. For on-line decoding, a buffer of similar length has to be employed at the decoding end also. 其中 |~x~|\lceil x\rceil 是不小于 xx 的最小整数,日志的基数是字母表 alpha\alpha 的基数 AA ,而 nn 是在编码过程的编码端使用的缓冲区的长度,用于存储源发出的最新 nn 个符号。关于 nn 和 L_(s)L_{s} 之间的确切关系在第三节中讨论。通常情况下, n≃L_(s)alpha^(hL_(s))n \simeq L_{s} \alpha^{h L_{s}} ,其中 0 < h < 1<h<1 。对于在线解码,解码端也必须使用相似长度的缓冲区。
To describe the exact mechanics of the parsing and coding procedures, we need some preparation by way of notation and definitions. 为了描述解析和编码过程的确切机制,我们需要通过符号和定义进行一些准备。
Consider a finite alphabet AA of alpha\alpha symbols, say A=A={0,1,cdots,alpha-1}\{0,1, \cdots, \alpha-1\}. A string, or word, SS of length ℓ(S)=k\ell(S)=k over AA is an ordered kk-tuple S=s_(1)s_(2)cdotss_(k)S=s_{1} s_{2} \cdots s_{k} of symbols from AA. To indicate a substring of SS which starts at position ii and ends at position jj, we write S(i,j)S(i, j). When i <= j,S(i,j)=i \leq j, S(i, j)=s_(i)s_(i+1)cdotss_(j)s_{i} s_{i+1} \cdots s_{j}, but when i > ji>j, we take S(i,j)=LambdaS(i, j)=\Lambda, the null string of length zero. 考虑一个有限字母表 AA ,包含 alpha\alpha 个符号,比如 A=A={0,1,cdots,alpha-1}\{0,1, \cdots, \alpha-1\} 。一个字符串或单词 SS ,长度为 ℓ(S)=k\ell(S)=k ,由 AA 构成,是一个有序的 kk -元组 S=s_(1)s_(2)cdotss_(k)S=s_{1} s_{2} \cdots s_{k} ,由 AA 中的符号组成。为了表示一个子字符串 SS ,它从位置 ii 开始,到位置 jj 结束,我们写作 S(i,j)S(i, j) 。当 i <= j,S(i,j)=i \leq j, S(i, j)=s_(i)s_(i+1)cdotss_(j)s_{i} s_{i+1} \cdots s_{j} 时,但当 i > ji>j 时,我们取 S(i,j)=LambdaS(i, j)=\Lambda ,长度为零的空字符串。
The concatenation of strings QQ and RR forms a new string S=QRS=Q R; if ℓ(Q)=k\ell(Q)=k and ℓ(R)=m\ell(R)=m, then ℓ(S)=k+m,Q\ell(S)=k+m, Q=S(1,k)=S(1, k), and R=S(k+1,k+m)R=S(k+1, k+m). For each j,0 <= j <=j, 0 \leq j \leq 字符串 QQ 和 RR 的连接形成一个新字符串 S=QRS=Q R ;如果 ℓ(Q)=k\ell(Q)=k 和 ℓ(R)=m\ell(R)=m ,那么 ℓ(S)=k+m,Q\ell(S)=k+m, Q=S(1,k)=S(1, k) ,并且 R=S(k+1,k+m)R=S(k+1, k+m) 。对于每个 j,0 <= j <=j, 0 \leq j \leq ℓ(S),S(1,j)\ell(S), S(1, j) is called a prefix of S;S(1,j)S ; S(1, j) is a proper prefix of SS if j < ℓ(S)j<\ell(S). ℓ(S),S(1,j)\ell(S), S(1, j) 被称为 S;S(1,j)S ; S(1, j) 的前缀,如果 j < ℓ(S)j<\ell(S) ,则是 SS 的真前缀。
Given a proper prefix S(1,j)S(1, j) of a string SS and a positive integer ii such that i <= ji \leq j, let L(i)L(i) denote the largest nonnegative integer ℓ <= ℓ(S)-j\ell \leqslant \ell(S)-j such that 给定字符串 SS 的一个适当前缀 S(1,j)S(1, j) 和一个正整数 ii ,使得 i <= ji \leq j ,令 L(i)L(i) 表示最大的非负整数 ℓ <= ℓ(S)-j\ell \leqslant \ell(S)-j ,使得
and let pp be a position of S(1,j)S(1, j) for which 让 pp 成为 S(1,j)S(1, j) 的一个位置,其中
L(p)=max_(1 <= i <= j){L(i)}L(p)=\max _{1 \leq i \leq j}\{L(i)\}
The substring S(j+1,j+L(p))S(j+1, j+L(p)) of SS is called the reproducible extension of S(1,j)S(1, j) into SS, and the integer pp is called the pointer of the reproduction. For example, if SS=00101011=00101011 and j=3j=3, then L(1)=1L(1)=1 since S(j+1,j+1)=S(j+1, j+1)=S(1,1)S(1,1) but S(j+1,j+2)!=S(1,2)S(j+1, j+2) \neq S(1,2). Similarly, L(2)=4L(2)=4 and L(3)=0L(3)=0. Hence, S(3+1,3+4)=0101S(3+1,3+4)=0101 is the reproducible extension of S(1,3)=001S(1,3)=001 into SS with pointer p=2p=2. 子串 S(j+1,j+L(p))S(j+1, j+L(p)) 的 SS 被称为 S(1,j)S(1, j) 到 SS 的可重现扩展,整数 pp 被称为重现的指针。例如,如果 SS=00101011=00101011 和 j=3j=3 ,那么 L(1)=1L(1)=1 因为 S(j+1,j+1)=S(j+1, j+1)=S(1,1)S(1,1) 但 S(j+1,j+2)!=S(1,2)S(j+1, j+2) \neq S(1,2) 。类似地, L(2)=4L(2)=4 和 L(3)=0L(3)=0 。因此, S(3+1,3+4)=0101S(3+1,3+4)=0101 是 S(1,3)=001S(1,3)=001 到 SS 的可重现扩展,指针为 p=2p=2 。
Now, to describe the encoding process, let S=s_(1)s_(2)dotsS=s_{1} s_{2} \ldots denote the string of symbols emitted by the source. The sequential encoding of SS entails parsing SS into successive source words, S=S_(1)S_(2)cdotsS=S_{1} S_{2} \cdots, and assigning a codeword C_(i)C_{i} for each S_(i)S_{i}. For bounded-delay encoding, the length ℓ_(i)\ell_{i} of each S_(i)S_{i} is at most equal to a predetermined parameter L_(s)L_{s}, while each C_(i)C_{i} is of fixed length L_(c)L_{c} as given by (1). 现在,为了描述编码过程,令 S=s_(1)s_(2)dotsS=s_{1} s_{2} \ldots 表示源发出的符号字符串。对 SS 的顺序编码涉及将 SS 解析为连续的源词 S=S_(1)S_(2)cdotsS=S_{1} S_{2} \cdots ,并为每个 S_(i)S_{i} 分配一个代码字 C_(i)C_{i} 。对于有界延迟编码,每个 S_(i)S_{i} 的长度 ℓ_(i)\ell_{i} 至多等于预定参数 L_(s)L_{s} ,而每个 C_(i)C_{i} 的固定长度 L_(c)L_{c} 如 (1) 所示。
To initiate the encoding process, we assume that the output SS of the source was preceded by a string ZZ of n-n-L_(s)L_{s} zeros, and we store the string B_(1)=ZS(1,L_(s))B_{1}=Z S\left(1, L_{s}\right) in the buffer. If S(1,j)S(1, j) is the reproducible extension of ZZ into ZS(1,L_(s)-1)Z S\left(1, L_{s}-1\right), then S_(1)=S(1,j+1)S_{1}=S(1, j+1) and ℓ_(1)=j+1\ell_{1}=j+1. To determine the next source word, we shift out the first ℓ_(1)\ell_{1} symbols from the buffer and feed into it the next ℓ_(1)\ell_{1} symbols of SS