A Universal Algorithm for Sequential Data Compression 用于顺序数据压缩的通用算法
JACOB ZIV, FELLOW, IEEE, AND ABRAHAM LEMPEL, MEMBER, IEEE 雅各布·齐夫, IEEE 院士, 亚伯拉罕·伦普尔, IEEE 会员
Abstract 摘要
A universal algorithm for sequential data compression is presented. Its performance is investigated with respect to a nonprobabilistic model of constrained sources. The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable-to-block codes designed to match a completely specified source. 呈现了一种用于序列数据压缩的通用算法。其性能是根据受约束源的非概率模型进行的调查。所提出的通用代码所获得的压缩率均匀地接近了块到可变长编码和可变长到块编码所能达到的压缩率下限,这些编码旨在匹配完全指定的源。
I. INTRODUCTION 一、 引言
IN MANY situations arising in digital communications and data processing, the encountered strings of data display various structural regularities or are otherwise subject to certain constraints, thereby allowing for storage and time-saving techniques of data compression. Given a discrete data source, the problem of data compression is first to identify the limitations of the source, and second to devise a coding scheme which, subject to certain performance criteria, will best compress the given source. 在数字通信和数据处理中的许多情况下,遇到的数据串显示各种结构性规则或受到某些约束,从而允许使用数据压缩的存储和节时技术。给定一个离散数据源,数据压缩的问题首先是识别该源的局限性,其次是设计一个在某些性能标准下最佳压缩给定源的编码方案。
Once the relevant source parameters have been identified, the problem reduces to one of minimum-redundancy coding. This phase of the problem has received extensive treatment in the literature [1]-[7]. 一旦确定了相关的源参数,问题就简化为最小冗余编码的问题。这个问题的这一阶段在文献[1]-[7]中已经得到了广泛的研究。
When no a priori knowledge of the source characteristics is available, and if statistical tests are either impossible or unreliable, the problem of data compression becomes considerably more complicated. In order to overcome these difficulties one must resort to universal coding schemes whereby the coding process is interlaced with a learning process for the varying source characteristics [8], [9]. Such coding schemes inevitably require a larger working memory space and generally employ performance criteria that are appropriate for a wide variety of sources. 当没有先验源特征知识可用,且统计检验要么不可能要么不可靠时,数据压缩问题将变得更加复杂。为克服这些困难,必须依赖通用编码方案,其中编码过程与对变化源特征的学习过程交织在一起。这种编码方案必然需要更大的工作内存空间,并通常采用适合多种源的性能准则。
In this paper, we describe a universal coding scheme which can be applied to any discrete source and whose performance is comparable to certain optimal fixed code book schemes designed for completely specified sources. For lack of adequate criteria, we do not attempt to rank the proposed scheme with respect to other possible universal coding schemes. Instead, for the broad class of sources defined in Section III, we derive upper bounds on the compression efficiency attainable with full a priori knowledge of the source by fixed code book schemes, and 在本文中,我们描述了一种通用编码方案,可应用于任何离散源,其性能可与为特定源设计的某些最优定长码本方案相媲美。由于缺乏适当的标准,我们没有尝试将所提出的方案与其他可能的通用编码方案进行排名比较。相反,对于第三节中定义的广泛源类,我们推导了利用定长码本方案在完全事先知道源的情况下可以达到的最大压缩效率的上界。
then show that the efficiency of our universal code with no a priori knowledge of the source approaches those bounds. 然后证明我们的无先验知识的通用编码的效率接近于这些界限。
The proposed compression algorithm is an adaptation of a simple copying procedure discussed recently [10] in a study on the complexity of finite sequences. Basically, we employ the concept of encoding future segments of the source-output via maximum-length copying from a buffer containing the recent past output. The transmitted codeword consists of the buffer address and the length of the copied segment. With a predetermined initial load of the buffer and the information contained in the codewords, the source data can readily be reconstructed at the decoding end of the process. 该提议的压缩算法是对最近[10]在关于有限序列复杂性的研究中讨论的简单复制过程的改编。基本上,我们采用通过从包含最近过去输出的缓冲区进行最大长度复制来对源输出的未来段进行编码的概念。传输的代码字由缓冲区地址和复制段的长度组成。通过缓冲区的预定初始负载和代码字中包含的信息,可以在解码过程的末端轻松地重构源数据。
The main drawback of the proposed algorithm is its susceptibility to error propagation in the event of a channel error. 所提出算法的主要缺点是易受信道错误传播的影响。
II. The Compression Algorithm 压缩算法
The proposed compression algorithm consists of a rule for parsing strings of symbols from a finite alphabet AA into substrings, or words, whose lengths do not exceed a prescribed integer L_(s)L_{s}, and a coding scheme which maps these substrings sequentially into uniquely decipherable codewords of fixed length L_(c)L_{c} over the same alphabet AA. 提议的压缩算法由以下两部分组成:一个将来自有限字母表 AA 的符号串解析为长度不超过预定整数 L_(s)L_{s} 的子串(或单词)的规则,以及一个将这些子串按顺序映射到同一字母表 AA 上的固定长度 L_(c)L_{c} 的唯一可解密编码字的编码方案。
The word-length bounds L_(s)L_{s} and L_(c)L_{c} allow for boundeddelay encoding and decoding, and they are related by 词长界限 L_(s)L_{s} 和 L_(c)L_{c} 允许有界延迟编码和解码,它们之间有如下关系
where |~x~|\lceil x\rceil is the least integer not smaller than xx, the logarithm base is the cardinality alpha\alpha of the alphabet AA, and nn is the length of a buffer, employed at the encoding end of the process, which stores the latest nn symbols emitted by the source. The exact relationship between nn and L_(s)L_{s} is discussed in Section III. Typically, n≃L_(s)alpha^(hL_(s))n \simeq L_{s} \alpha^{h L_{s}}, where 0 < h < 1<h<1. For on-line decoding, a buffer of similar length has to be employed at the decoding end also. 其中 |~x~|\lceil x\rceil 是不小于 xx 的最小整数,对数底数是字母表 AA 的基数 alpha\alpha , nn 是编码端使用的缓冲区长度,存储源最近输出的 nn 个符号。 nn 和 L_(s)L_{s} 之间的确切关系在第三节中讨论。通常, n≃L_(s)alpha^(hL_(s))n \simeq L_{s} \alpha^{h L_{s}} ,其中 0 < h < 1<h<1 。对于在线解码,解码端也需要使用类似长度的缓冲区。
To describe the exact mechanics of the parsing and coding procedures, we need some preparation by way of notation and definitions. 为了描述解析和编码过程的确切机制,我们需要通过符号和定义进行一些准备工作。
Consider a finite alphabet AA of alpha\alpha symbols, say A=A={0,1,cdots,alpha-1}\{0,1, \cdots, \alpha-1\}. A string, or word, SS of length ℓ(S)=k\ell(S)=k over AA is an ordered kk-tuple S=s_(1)s_(2)cdotss_(k)S=s_{1} s_{2} \cdots s_{k} of symbols from AA. To indicate a substring of SS which starts at position ii and ends at position jj, we write S(i,j)S(i, j). When i <= j,S(i,j)=i \leq j, S(i, j)=s_(i)s_(i+1)cdotss_(j)s_{i} s_{i+1} \cdots s_{j}, but when i > ji>j, we take S(i,j)=LambdaS(i, j)=\Lambda, the null string of length zero. 考虑一个有限字母表 AA ,其中有 alpha\alpha 个符号,如 A=A={0,1,cdots,alpha-1}\{0,1, \cdots, \alpha-1\} 。长度为 ℓ(S)=k\ell(S)=k 的字符串或单词 SS 是从 AA 中选择的符号的有序 kk 元组 S=s_(1)s_(2)cdotss_(k)S=s_{1} s_{2} \cdots s_{k} 。为了表示 SS 中从位置 ii 开始到位置 jj 结束的子字符串,我们写为 S(i,j)S(i, j) 。当 i <= j,S(i,j)=i \leq j, S(i, j)=s_(i)s_(i+1)cdotss_(j)s_{i} s_{i+1} \cdots s_{j} 时,但是当 i > ji>j 时,我们取 S(i,j)=LambdaS(i, j)=\Lambda ,即长度为零的空字符串。
The concatenation of strings QQ and RR forms a new string S=QRS=Q R; if ℓ(Q)=k\ell(Q)=k and ℓ(R)=m\ell(R)=m, then ℓ(S)=k+m,Q\ell(S)=k+m, Q=S(1,k)=S(1, k), and R=S(k+1,k+m)R=S(k+1, k+m). For each j,0 <= j <=j, 0 \leq j \leq 字符串 QQ 和 RR 的连接形成一个新的字符串 S=QRS=Q R ;如果 ℓ(Q)=k\ell(Q)=k 和 ℓ(R)=m\ell(R)=m ,则 ℓ(S)=k+m,Q\ell(S)=k+m, Q=S(1,k)=S(1, k) , R=S(k+1,k+m)R=S(k+1, k+m) 。对于每个 j,0 <= j <=j, 0 \leq j \leq ℓ(S),S(1,j)\ell(S), S(1, j) is called a prefix of S;S(1,j)S ; S(1, j) is a proper prefix of SS if j < ℓ(S)j<\ell(S). ℓ(S),S(1,j)\ell(S), S(1, j) 是 S;S(1,j)S ; S(1, j) 的前缀,如果 j < ℓ(S)j<\ell(S) ,则 ℓ(S),S(1,j)\ell(S), S(1, j) 是 S;S(1,j)S ; S(1, j) 的真前缀。
Given a proper prefix S(1,j)S(1, j) of a string SS and a positive integer ii such that i <= ji \leq j, let L(i)L(i) denote the largest nonnegative integer ℓ <= ℓ(S)-j\ell \leqslant \ell(S)-j such that 给定一个字符串 SS 的合适前缀 S(1,j)S(1, j) 和一个正整数 ii ,使得 i <= ji \leq j ,令 L(i)L(i) 表示最大的非负整数 ℓ <= ℓ(S)-j\ell \leqslant \ell(S)-j ,使得
and let pp be a position of S(1,j)S(1, j) for which 并且让 pp 成为 S(1,j)S(1, j) 的一个位置
L(p)=max_(1 <= i <= j){L(i)}L(p)=\max _{1 \leq i \leq j}\{L(i)\}
The substring S(j+1,j+L(p))S(j+1, j+L(p)) of SS is called the reproducible extension of S(1,j)S(1, j) into SS, and the integer pp is called the pointer of the reproduction. For example, if SS=00101011=00101011 and j=3j=3, then L(1)=1L(1)=1 since S(j+1,j+1)=S(j+1, j+1)=S(1,1)S(1,1) but S(j+1,j+2)!=S(1,2)S(j+1, j+2) \neq S(1,2). Similarly, L(2)=4L(2)=4 and L(3)=0L(3)=0. Hence, S(3+1,3+4)=0101S(3+1,3+4)=0101 is the reproducible extension of S(1,3)=001S(1,3)=001 into SS with pointer p=2p=2. S(j+1,j+L(p))S(j+1, j+L(p)) 是 SS 的可重复扩展到 SS ,整数 pp 是该重复的指针。例如,如果 SS=00101011=00101011 和 j=3j=3 ,那么 L(1)=1L(1)=1 ,因为 S(j+1,j+1)=S(j+1, j+1)=S(1,1)S(1,1) 但 S(j+1,j+2)!=S(1,2)S(j+1, j+2) \neq S(1,2) 。同样地, L(2)=4L(2)=4 和 L(3)=0L(3)=0 。因此, S(3+1,3+4)=0101S(3+1,3+4)=0101 是 S(1,3)=001S(1,3)=001 到 SS 的可重复扩展,指针为 p=2p=2 。
Now, to describe the encoding process, let S=s_(1)s_(2)dotsS=s_{1} s_{2} \ldots denote the string of symbols emitted by the source. The sequential encoding of SS entails parsing SS into successive source words, S=S_(1)S_(2)cdotsS=S_{1} S_{2} \cdots, and assigning a codeword C_(i)C_{i} for each S_(i)S_{i}. For bounded-delay encoding, the length ℓ_(i)\ell_{i} of each S_(i)S_{i} is at most equal to a predetermined parameter L_(s)L_{s}, while each C_(i)C_{i} is of fixed length L_(c)L_{c} as given by (1). 现在,为了描述编码过程,让 S=s_(1)s_(2)dotsS=s_{1} s_{2} \ldots 表示源发出的符号串。 SS 的顺序编码涉及将 SS 解析为连续的源字, S=S_(1)S_(2)cdotsS=S_{1} S_{2} \cdots ,并为每个 S_(i)S_{i} 分配一个代码字 C_(i)C_{i} 。对于有界延迟编码,每个 S_(i)S_{i} 的长度 ℓ_(i)\ell_{i} 最多等于预先确定的参数 L_(s)L_{s} ,而每个 C_(i)C_{i} 的长度 L_(c)L_{c} 是固定的,由(1)式给出。
To initiate the encoding process, we assume that the output SS of the source was preceded by a string ZZ of n-n-L_(s)L_{s} zeros, and we store the string B_(1)=ZS(1,L_(s))B_{1}=Z S\left(1, L_{s}\right) in the buffer. If S(1,j)S(1, j) is the reproducible extension of ZZ into ZS(1,L_(s)-1)Z S\left(1, L_{s}-1\right), then S_(1)=S(1,j+1)S_{1}=S(1, j+1) and ℓ_(1)=j+1\ell_{1}=j+1. To determine the next source word, we shift out the first