A Universal Algorithm for Sequential Data Compression
用于顺序数据压缩的通用算法

JACOB ZIV, FELLOW, IEEE, AND ABRAHAM LEMPEL, MEMBER, IEEE
雅各布·齐夫, IEEE 院士, 亚伯拉罕·伦普尔, IEEE 会员

Abstract 摘要

A universal algorithm for sequential data compression is presented. Its performance is investigated with respect to a nonprobabilistic model of constrained sources. The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable-to-block codes designed to match a completely specified source.
呈现了一种用于序列数据压缩的通用算法。其性能是根据受约束源的非概率模型进行的调查。所提出的通用代码所获得的压缩率均匀地接近了块到可变长编码和可变长到块编码所能达到的压缩率下限,这些编码旨在匹配完全指定的源。

I. INTRODUCTION 一、引言

IN MANY situations arising in digital communications and data processing, the encountered strings of data display various structural regularities or are otherwise subject to certain constraints, thereby allowing for storage and time-saving techniques of data compression. Given a discrete data source, the problem of data compression is first to identify the limitations of the source, and second to devise a coding scheme which, subject to certain performance criteria, will best compress the given source.
在数字通信和数据处理中的许多情况下,遇到的数据串显示各种结构性规则或受到某些约束,从而允许使用数据压缩的存储和节时技术。给定一个离散数据源,数据压缩的问题首先是识别该源的局限性,其次是设计一个在某些性能标准下最佳压缩给定源的编码方案。

Once the relevant source parameters have been identified, the problem reduces to one of minimum-redundancy coding. This phase of the problem has received extensive treatment in the literature [1]-[7].
一旦确定了相关的源参数,问题就简化为最小冗余编码的问题。这个问题的这一阶段在文献[1]-[7]中已经得到了广泛的研究。

When no a priori knowledge of the source characteristics is available, and if statistical tests are either impossible or unreliable, the problem of data compression becomes considerably more complicated. In order to overcome these difficulties one must resort to universal coding schemes whereby the coding process is interlaced with a learning process for the varying source characteristics [8], [9]. Such coding schemes inevitably require a larger working memory space and generally employ performance criteria that are appropriate for a wide variety of sources.
当没有先验源特征知识可用,且统计检验要么不可能要么不可靠时,数据压缩问题将变得更加复杂。为克服这些困难,必须依赖通用编码方案,其中编码过程与对变化源特征的学习过程交织在一起。这种编码方案必然需要更大的工作内存空间,并通常采用适合多种源的性能准则。

In this paper, we describe a universal coding scheme which can be applied to any discrete source and whose performance is comparable to certain optimal fixed code book schemes designed for completely specified sources. For lack of adequate criteria, we do not attempt to rank the proposed scheme with respect to other possible universal coding schemes. Instead, for the broad class of sources defined in Section III, we derive upper bounds on the compression efficiency attainable with full a priori knowledge of the source by fixed code book schemes, and
在本文中，我们描述了一种通用编码方案，可应用于任何离散源，其性能可与为特定源设计的某些最优定长码本方案相媲美。由于缺乏适当的标准，我们没有尝试将所提出的方案与其他可能的通用编码方案进行排名比较。相反，对于第三节中定义的广泛源类，我们推导了利用定长码本方案在完全事先知道源的情况下可以达到的最大压缩效率的上界。

then show that the efficiency of our universal code with no a priori knowledge of the source approaches those bounds.
然后证明我们的无先验知识的通用编码的效率接近于这些界限。

The proposed compression algorithm is an adaptation of a simple copying procedure discussed recently [10] in a study on the complexity of finite sequences. Basically, we employ the concept of encoding future segments of the source-output via maximum-length copying from a buffer containing the recent past output. The transmitted codeword consists of the buffer address and the length of the copied segment. With a predetermined initial load of the buffer and the information contained in the codewords, the source data can readily be reconstructed at the decoding end of the process.
该提议的压缩算法是对最近[10]在关于有限序列复杂性的研究中讨论的简单复制过程的改编。基本上,我们采用通过从包含最近过去输出的缓冲区进行最大长度复制来对源输出的未来段进行编码的概念。传输的代码字由缓冲区地址和复制段的长度组成。通过缓冲区的预定初始负载和代码字中包含的信息,可以在解码过程的末端轻松地重构源数据。

The main drawback of the proposed algorithm is its susceptibility to error propagation in the event of a channel error.
所提出算法的主要缺点是易受信道错误传播的影响。

II. The Compression Algorithm
压缩算法

The proposed compression algorithm consists of a rule for parsing strings of symbols from a finite alphabet

A

into substrings, or words, whose lengths do not exceed a prescribed integer

L_{s}

, and a coding scheme which maps these substrings sequentially into uniquely decipherable codewords of fixed length

L_{c}

over the same alphabet

A

.
提议的压缩算法由以下两部分组成:一个将来自有限字母表

A

的符号串解析为长度不超过预定整数

L_{s}

的子串(或单词)的规则,以及一个将这些子串按顺序映射到同一字母表

A

上的固定长度

L_{c}

的唯一可解密编码字的编码方案。

The word-length bounds

L_{s}

and

L_{c}

allow for boundeddelay encoding and decoding, and they are related by
词长界限

L_{s}

和

L_{c}

允许有界延迟编码和解码,它们之间有如下关系

L_{c} = 1 + ⌈ \log (n - L_{s}) ⌉ + ⌈ \log L_{s} ⌉,

where

⌈ x ⌉

is the least integer not smaller than

x

, the logarithm base is the cardinality

α

of the alphabet

A

, and

n

is the length of a buffer, employed at the encoding end of the process, which stores the latest

n

symbols emitted by the source. The exact relationship between

n

and

L_{s}

is discussed in Section III. Typically,

n ≃ L_{s} α^{h L_{s}}

, where 0

< h < 1

. For on-line decoding, a buffer of similar length has to be employed at the decoding end also.
其中

⌈ x ⌉

是不小于

x

的最小整数,对数底数是字母表

A

的基数

α

n

是编码端使用的缓冲区长度,存储源最近输出的

n

个符号。

n

和

L_{s}

之间的确切关系在第三节中讨论。通常,

n ≃ L_{s} α^{h L_{s}}

,其中 0

< h < 1

。对于在线解码,解码端也需要使用类似长度的缓冲区。

To describe the exact mechanics of the parsing and coding procedures, we need some preparation by way of notation and definitions.
为了描述解析和编码过程的确切机制,我们需要通过符号和定义进行一些准备工作。

Consider a finite alphabet

A

α

symbols, say

A =

{0, 1, \dots, α - 1}

. A string, or word,

S

of length

ℓ (S) = k

over

A

is an ordered

k

-tuple

S = s_{1} s_{2} \dots s_{k}

of symbols from

A

. To indicate a substring of

S

which starts at position

i

and ends at position

j

, we write

S (i, j)

. When

i \leq j, S (i, j) =

s_{i} s_{i + 1} \dots s_{j}

, but when

i > j

, we take

S (i, j) = Λ

, the null string of length zero.
考虑一个有限字母表

A

，其中有

α

个符号，如

A =

{0, 1, \dots, α - 1}

。长度为

ℓ (S) = k

的字符串或单词

S

是从

A

中选择的符号的有序

k

元组

S = s_{1} s_{2} \dots s_{k}

。为了表示

S

中从位置

i

开始到位置

j

结束的子字符串，我们写为

S (i, j)

。当

i \leq j, S (i, j) =

s_{i} s_{i + 1} \dots s_{j}

时,但是当

i > j

时,我们取

S (i, j) = Λ

,即长度为零的空字符串。

The concatenation of strings

Q

and

R

forms a new string

S = Q R

; if

ℓ (Q) = k

and

ℓ (R) = m

, then

ℓ (S) = k + m, Q

= S (1, k)

, and

R = S (k + 1, k + m)

. For each

j, 0 \leq j \leq

字符串

Q

和

R

的连接形成一个新的字符串

S = Q R

;如果

ℓ (Q) = k

和

ℓ (R) = m

,则

ℓ (S) = k + m, Q

= S (1, k)

R = S (k + 1, k + m)

。对于每个

j, 0 \leq j \leq

ℓ (S), S (1, j)

is called a prefix of

S; S (1, j)

is a proper prefix of

S

j < ℓ (S)

ℓ (S), S (1, j)

是

S; S (1, j)

的前缀,如果

j < ℓ (S)

,则

ℓ (S), S (1, j)

是

S; S (1, j)

的真前缀。

Given a proper prefix

S (1, j)

of a string

S

and a positive integer

i

such that

i \leq j

, let

L (i)

denote the largest nonnegative integer

ℓ ⩽ ℓ (S) - j

such that
给定一个字符串

S

的合适前缀

S (1, j)

和一个正整数

i

,使得

i \leq j

,令

L (i)

表示最大的非负整数

ℓ ⩽ ℓ (S) - j

,使得

S (i, i + ℓ - 1) = S (j + 1, j + ℓ)

and let

p

be a position of

S (1, j)

for which
并且让

p

成为

S (1, j)

的一个位置

L (p) = max_{1 \leq i \leq j} {L (i)}

The substring

S (j + 1, j + L (p))

S

is called the reproducible extension of

S (1, j)

into

S

, and the integer

p

is called the pointer of the reproduction. For example, if

S

= 00101011

and

j = 3

, then

L (1) = 1

since

S (j + 1, j + 1) =

S (1, 1)

but

S (j + 1, j + 2) \neq S (1, 2)

. Similarly,

L (2) = 4

and

L (3) = 0

. Hence,

S (3 + 1, 3 + 4) = 0101

is the reproducible extension of

S (1, 3) = 001

into

S

with pointer

p = 2

S (j + 1, j + L (p))

是

S

的可重复扩展到

S

，整数

p

是该重复的指针。例如，如果

S

= 00101011

和

j = 3

，那么

L (1) = 1

，因为

S (j + 1, j + 1) =

S (1, 1)

但

S (j + 1, j + 2) \neq S (1, 2)

。同样地，

L (2) = 4

和

L (3) = 0

。因此，

S (3 + 1, 3 + 4) = 0101

是

S (1, 3) = 001

到

S

的可重复扩展，指针为

p = 2

。

Now, to describe the encoding process, let

S = s_{1} s_{2} \dots

denote the string of symbols emitted by the source. The sequential encoding of

S

entails parsing

S

into successive source words,

S = S_{1} S_{2} \dots

, and assigning a codeword

C_{i}

for each

S_{i}

. For bounded-delay encoding, the length

ℓ_{i}

of each

S_{i}

is at most equal to a predetermined parameter

L_{s}

, while each

C_{i}

is of fixed length

L_{c}

as given by (1).
现在,为了描述编码过程,让

S = s_{1} s_{2} \dots

表示源发出的符号串。

S

的顺序编码涉及将

S

解析为连续的源字,

S = S_{1} S_{2} \dots

,并为每个

S_{i}

分配一个代码字

C_{i}

。对于有界延迟编码,每个

S_{i}

的长度

ℓ_{i}

最多等于预先确定的参数

L_{s}

,而每个

C_{i}

的长度

L_{c}

是固定的,由(1)式给出。

To initiate the encoding process, we assume that the output

S

of the source was preceded by a string

Z

n -

L_{s}

zeros, and we store the string

B_{1} = Z S (1, L_{s})

in the buffer. If

S (1, j)

is the reproducible extension of

Z

into

Z S (1, L_{s} - 1)

, then

S_{1} = S (1, j + 1)

and

ℓ_{1} = j + 1

. To determine the next source word, we shift out the first

ℓ_{1}

symbols from the buffer and feed into it the next

ℓ_{1}

symbols of

S

to obtain the string

B_{2} = B_{1} (ℓ_{1} + 1, n) S (L_{s} + 1

L_{s} + ℓ_{1}

). Now we look for the reproducible extension

E

B_{2} (1, n - L_{s})

into

B_{2} (1, n - 1)

, and set

S_{2} = E s

, where

s

is the symbol next to

E

B_{2}

. In general, if

B_{i}

denotes the string of

n

source symbols stored in the buffer when we are ready to determine the

i

th source word

S_{i}

, the successive encoding steps can be formally described as follows.
为了启动编码过程,我们假设源输出

S

是由长度为

n -

的

L_{s}

个 0 组成的字符串

Z

前缀的,并将字符串

B_{1} = Z S (1, L_{s})

存储在缓冲区中。如果

S (1, j)

是

Z

延伸到

Z S (1, L_{s} - 1)

的可复制扩展,那么

S_{1} = S (1, j + 1)

和

ℓ_{1} = j + 1

。为了确定下一个源词,我们从缓冲区中移出前

ℓ_{1}

个符号,并将

S

的下一个

ℓ_{1}

个符号输入到其中,以获得字符串

B_{2} = B_{1} (ℓ_{1} + 1, n) S (L_{s} + 1

L_{s} + ℓ_{1}

)。现在我们寻找

B_{2} (1, n - L_{s})

延伸到

B_{2} (1, n - 1)

的可重复扩展

E

,并设置

S_{2} = E s

,其中

s

是

B_{2}

中紧挨着

E

的符号。一般而言,如果

B_{i}

表示在我们准备确定第

i

个源词

S_{i}

时存储在缓冲区中的

n

个源符号的字符串,则可以将连续的编码步骤形式化地描述如下。

Initially, set $B_{1} = 0^{n - L_{s}} S (1, L_{s})$ , i.e., the all-zero string of length $n - L_{s}$ followed by the first $L_{s}$ symbols of $S$ .
最初,设置 $B_{1} = 0^{n - L_{s}} S (1, L_{s})$ ,即长度为 $n - L_{s}$ 的全零字符串,后跟 $S$ 的前 $L_{s}$ 个符号。
Having determined $B_{i}, i \geq 1$ , set
确定 $B_{i}, i \geq 1$ 之后，设置

S_{i} = B_{i} (n - L_{s} + 1, n - L_{s} + ℓ_{i})

where the prefix of length

ℓ_{i} - 1

S_{i}

is the reproducible extension of

B_{i} (1, n - L_{s})

into

B_{i} (1, n - 1)

.
其中长度为

ℓ_{i} - 1

的

S_{i}

前缀是将

B_{i} (1, n - L_{s})

扩展到

B_{i} (1, n - 1)

的可重复扩展。
3) If

p_{i}

is the reproduction pointer used to determine

S_{i}

, then the codeword

C_{i}

for

S_{i}

is given by
如果

p_{i}

是用于确定

S_{i}

的复制指针,那么

S_{i}

的代码字

C_{i}

为

C_{i} = C_{i 1} C_{i 2} C_{i 3}

where

C_{i 1}

is the radix-

α

representation of

p_{i} - 1

with

ℓ (C_{i 1}) = ⌈ \log (n - L_{s}) ⌉, C_{i 2}

is the radix-

α

representation of

ℓ_{i} - 1

with

ℓ (C_{i 2}) = ⌈ \log L_{s} ⌉

, and

C_{i 3}

is the last symbol of

S_{i}

, i.e., the symbol occupying position

n - L_{s} + ℓ_{i}

B_{i}

. The total length of

C_{i}

is given by
其中

C_{i 1}

是

p_{i} - 1

的基数-

α

表示形式,

ℓ (C_{i 1}) = ⌈ \log (n - L_{s}) ⌉, C_{i 2}

是

ℓ_{i} - 1

的基数-

α

表示形式,

ℓ (C_{i 2}) = ⌈ \log L_{s} ⌉

C_{i 3}

是

S_{i}

的最后一个符号,即

B_{i}

第

n - L_{s} + ℓ_{i}

个位置的符号。

C_{i}

的总长度由

ℓ (C_{i}) = ⌈ \log (n - L_{s}) ⌉ + ⌈ \log L_{s} ⌉ + 1

in accordance with (1). 根据(1)。
4) To update the contents of the buffer, shift out the symbols occupying the first

ℓ_{i}

positions of the buffer while feeding in the next

ℓ_{i}

symbols from the source to obtain
更新缓冲区内容时,先移出缓冲区前

ℓ_{i}

个位置占用的符号,然后从源中输入下一个

ℓ_{i}

个符号

B_{i + 1} = B_{i} (ℓ_{i} + 1, n) S (h_{i} + 1, h_{i} + ℓ_{i})

where

h_{i}

is the position of

S

occupied by the last symbol of

B_{i}

.
其中

h_{i}

是

S

占用

B_{i}

最后一个符号的位置。

This completes the description of the encoding process. It is easy to verify that the parsing rule defined by (2) guarantees a bounded, positive source word length in each iteration; in fact,

1 \leq ℓ_{i} \leq L_{s}

for each

i

thus allowing for a radix-

α

representation of

ℓ_{i} - 1

with

⌈ \log L_{s} ⌉

symbols from

A

. Also, since

1 \leq p_{i} \leq n - L_{s}

for each

i

, it is possible to represent

p_{i} - 1

with

⌈ \log (n - L_{s}) ⌉

symbols from

A

.
这完成了编码过程的描述。很容易验证由(2)定义的解析规则保证了每次迭代中源词长度的有界性和正性; 事实上,

1 \leq ℓ_{i} \leq L_{s}

对于每一个

i

从而允许用

α

进制表示

ℓ_{i} - 1

中

⌈ \log L_{s} ⌉

个符号来自

A

。此外,由于

1 \leq p_{i} \leq n - L_{s}

对于每一个

i

，因此可用

⌈ \log (n - L_{s}) ⌉

个来自

A

的符号表示

p_{i} - 1

。

Decoding can be performed simply by reversing the encoding process. Here we employ a buffer of length

n -

L_{s}

to store the latest decoded source symbols. Initially, the buffer is loaded with

n - L_{s}

zeros. If

D_{i} = d_{1} d_{2} \dots d_{n - L_{s}}

denotes the contents of the buffer after

C_{i - 1}

has been decoded into

S_{i - 1}

, then
解码只需简单地反转编码过程。这里我们使用长度为

n -

的缓冲区

L_{s}

来存储最近解码的源符号。初始时,缓冲区被加载为

n - L_{s}

个零。如果

D_{i} = d_{1} d_{2} \dots d_{n - L_{s}}

表示在

C_{i - 1}

被解码为

S_{i - 1}

之后缓冲区的内容,那么

S_{i}_{1} = D_{i} (n - L_{s} - ℓ_{i - 1} + 1, n - L_{s})

where

ℓ_{i - 1} = ℓ (S_{i - 1})

, and where

D_{i + 1}

can be obtained from

D_{i}

and

C_{i}

as follows.
当

ℓ_{i - 1} = ℓ (S_{i - 1})

，且可以从

D_{i}

和

C_{i}

获得

D_{i + 1}

，如下所示。

Determine

p_{i} - 1

and

ℓ_{i} - 1

from the first

⌈ \log (n - L_{s}) ⌉

and the next

⌈ \log L_{s} ⌉

symbols of

C_{i}

. Then, apply

ℓ_{i} - 1

shifts while feeding the contents of stage

p_{i}

into stage

n -

L_{s}

. The first of these shifts will change the buffer contents from

D_{i}

to
从

C_{i}

的前

⌈ \log (n - L_{s}) ⌉

个和后

⌈ \log L_{s} ⌉

个符号确定

p_{i} - 1

和

ℓ_{i} - 1

。然后在将

p_{i}

阶段的内容输送到

n -

阶段时应用

ℓ_{i} - 1

个位移操作。第一次位移将把缓冲区内容从

D_{i}

改变为

D_{i}^{(1)} = d_{2} d_{3} \dots d_{n - L_{s}} d_{p_{i}} = d_{1}^{(1)} d_{2}^{(1)} \dots d_{n - L_{n}}^{(1)}

Similarly, if

j \leq ℓ_{i} - 1

, the

j

th shift will transform

D_{i}^{(j - 1)}

= d_{1}^{(j - 1)} d_{2}^{(j - 1)} \dots d_{n - L_{s}}^{(j - 1)}

into

D_{i}^{(j)} = d_{2}^{(j - 1)} d_{3}^{(j - 1)} \dots

d_{n - L_{s}}^{(j - 1)} d_{p_{i}}^{(j - 1)} = d_{1}^{(j)} d_{2}^{(j)} \dots d_{n - L_{s}}^{(j)}

. After these

ℓ_{i} - 1

shifts are completed, shift once more, while feeding the last symbol of

C_{i}

into stage

n - L_{s}

of the buffer. It is easy to verify that the resulting load of the buffer contains

S_{i}

in its last

ℓ_{i} = ℓ (S_{i})

positions.
类似地,如果

j \leq ℓ_{i} - 1

,第

j

次移位将

D_{i}^{(j - 1)}

= d_{1}^{(j - 1)} d_{2}^{(j - 1)} \dots d_{n - L_{s}}^{(j - 1)}

转换为

D_{i}^{(j)} = d_{2}^{(j - 1)} d_{3}^{(j - 1)} \dots

d_{n - L_{s}}^{(j - 1)} d_{p_{i}}^{(j - 1)} = d_{1}^{(j)} d_{2}^{(j)} \dots d_{n - L_{s}}^{(j)}

。在这些

ℓ_{i} - 1

次移位完成后,再移位一次,同时将

C_{i}

的最后一个符号输入缓冲区的

n - L_{s}

级。很容易验证,缓冲区的最终负载在其最后

ℓ_{i} = ℓ (S_{i})

个位置包含

S_{i}

。

The following example will serve to illustrate the mechanics of the algorithm. Consider the ternary (

α = 3

) input string
下面的示例将说明该算法的机制。考虑三进制（

α = 3

）输入字符串

S = 001010210210212021021200 \dots

and an encoder with parameters

L_{s} = 9

and

n = 18

. (These parameters were chosen to simplify the illustration; they do not reflect the design considerations to be discussed in Section III.) According to (1), the corresponding codeword length is given by
以及一个参数为

L_{s} = 9

和

n = 18

的编码器。(这些参数是为了简化说明而选择的;它们并不反映第三节中要讨论的设计考虑因素。)根据(1),相应的码字长度由

L_{c} = 1 + \log_{3} (18 - 9) + \log_{3} 9 = 5 .

Initially, the buffer is loaded with

n - L_{s} = 9

zeros, followed by the first

L_{s} = 9

digits of

S

, namely,
最初，缓冲区被加载了

n - L_{s} = 9

个零，然后是

S

的前

L_{s} = 9

个数字，即

B_{1} = \underset{n - L_{s} = 9}{\underset{⏟}{000000000}} \underset{L_{s} = 9}{\underset{⏟}{001010210}} .

To determine the first source word

S_{1}

, we have to find the longest prefix

B_{1} (10, 9 + ℓ_{1} - 1)

of
要确定第一个源词语

S_{1}

，我们必须找到最长的前缀

B_{1} (10, 9 + ℓ_{1} - 1)

B_{1} (10, 17) = 00101021

which matches a substring of

B_{1}

that starts in position

p_{1}

\leq 9

and then set

S_{1} = B_{1} (10, 9 + ℓ_{1})

. It is easily seen that the longest match in this case is

B_{1} (10, 11) = 00

, and hence

S_{1} = 001

and

ℓ_{1} = 3

. The pointer

p_{1}

for this step can be any integer between one and nine; we choose to set

p_{1} = 9

. The two-digit radix-3 representation of

p_{1} - 1

C_{11} = 22

, and that of

ℓ_{1} - 1

C_{12} = 02

. Since

C_{i 3}

is always equal to the last symbol of

S_{i}

, the codeword for

S_{1}

is given by

C_{1} =

22021.
该匹配

B_{1}

的子字符串,从位置

p_{1}

\leq 9

开始,然后设置

S_{1} = B_{1} (10, 9 + ℓ_{1})

。很容易看出,在这种情况下最长的匹配是

B_{1} (10, 11) = 00

,因此

S_{1} = 001

和

ℓ_{1} = 3

。这一步中指针

p_{1}

可以是 1 到 9 之间的任意整数;我们选择设置

p_{1} = 9

。

p_{1} - 1

的二进制三进位表示是

C_{11} = 22

ℓ_{1} - 1

的二进制三进位表示是

C_{12} = 02

。由于

C_{i 3}

始终等于

S_{i}

的最后一个符号,

S_{1}

的码字是由

C_{1} =

22021 给出的。

To obtain the buffer load

B_{2}

for the second step, we shift out the first

ℓ_{1} = 3

digits of

B_{1}

and feed in the next 3 digits

S (10, 12) = 210

of the input string

S

. The details of steps 2,3 , and 4 are tabulated below, where pointer positions are indicated by arrows and where the source words

S_{i}

are indicated by the italic substring of the corresponding buffer load

B_{i}

要获得第二步的缓冲区负载

B_{2}

，我们将

B_{1}

的前

ℓ_{1} = 3

位移出并输入输入字符串

S

的下 3 位

S (10, 12) = 210

。下表列出了步骤 2、3 和 4 的详细信息，其中指针位置由箭头指示，源词

S_{i}

由相应缓冲区负荷

B_{i}

的斜体子字符串表示。

\begin{array}{cc} ↓ \\ B_{2} & = 000000001010210210, \\ ↓ & C_{2} = 21102 \\ B_{3} & = 000010102102102120, \end{array} C_{3} = 20212 .

III. Compression of CONSTrAined Sources
受约束源的压缩

In this section, we investigate the performance of the proposed compression algorithm with respect to a nonprobabilistic model of constrained information sources. After defining the source model, we derive lower bounds on the compression ratios attainable by block-to-variable and variable-to-block codes under full knowledge of the source, and then show that the compression ratio achieved by our universal code approaches these bounds.
在本节中,我们研究了所提出的压缩算法在受约束信息源的非概率模型下的性能。在定义源模型后,我们推导出在完全了解源的情况下,块变长码和变长块码所能达到的压缩率下限,然后证明我们的通用码所实现的压缩率接近这些下限。

A. Definition of the Source Model
源模型的定义

Let

A = {0, 1, \dots, α - 1}

be the given

α

-symbol alphabet, and let

A^{*}

denote the set of all finite strings over

A

. Given a string

S \in A^{*}

and a positive integer

m \leq ℓ (S)

, let

S {m}

denote the set of all substrings of length

m

contained in

S

, and let

S (m)

denote the cardinality of

S {m}

. That is,
令

A = {0, 1, \dots, α - 1}

为给定的

α

字符集, 令

A^{*}

表示

A

上的所有有限字符串的集合. 给定一个字符串

S \in A^{*}

和一个正整数

m \leq ℓ (S)

, 令

S {m}

表示

S

中长度为

m

的所有子串的集合, 令

S (m)

表示

S {m}

的基数.

S {m} = ⋃_{i = 0}^{ℓ (S) - m} S (i + 1, i + m)

and 和

S (m) = | S {m} |

Given a subset

σ

A^{*}

, let
给定

σ

是

A^{*}

的子集, 令

σ {m} = {S \in σ ∣ ℓ (S) = m}

and let

σ (m)

denote the cardinality of

σ {m}

.
并且让

σ (m)

表示

σ {m}

的基数。
A subset

σ

A^{*}

is called a source if the following three properties hold:
子集

σ

称为源，如果满足以下三个属性:

$A \subset σ$ (i.e., $σ$ contains all the unit length strings),
$A \subset σ$ （即， $σ$ 包含所有单位长度的字符串），
$S \in σ$ implies $S S \in σ$ ,
$S \in σ$ 意味着 $S S \in σ$
$S \in σ$ implies $S {m} \subset σ {m}$ .
$S \in σ$ 意味着 $S {m} \subset σ {m}$ 。

Typically, such a source

σ

is defined by specifying a finite set of strings over

A

which are forbidden to appear as. substrings of elements belonging to

σ

, and therefore

σ (m)

< α^{m}

for all

m

exceeding some

m_{0}

.
通常,这种来源

σ

是通过指定一组有限的字符串,它们不能作为

σ

中元素的子串而被定义的,因此

σ (m)

< α^{m}

所有

m

超过某个

m_{0}

。

With every source

σ

, we associate a sequence

h (1)

h (2), \dots

of parameters, called the

h

-parameters of

σ

, where

^{1}

对于每个源

σ

，我们都会关联一个参数序列

h (1)

h (2), \dots

，称为

h

参数，其中

^{1}

h (m) = \frac{1}{m} \log σ (m), m = 1, 2, \dots

It is clear that

0 \leq h (m) \leq 1

for all

m

and, by 2 ) it is also clear that

m h (m)

is a nondecreasing function of

m

. The sequence of

h

-parameters, however, is usually nonincreasing in

m

. To avoid any possible confusion in the sequel, we postulate this property as an additional defining property of a source. Namely, we require
很明显,

0 \leq h (m) \leq 1

对于所有

m

,并且,根据 2)也很明显,

m h (m)

是

m

的非递减函数。但是,

h

参数的序列通常在

m

中是非递增的。为了避免后续可能产生的任何困惑,我们将此特性作为源的额外定义属性来要求。
4)

h (m) = 1 / m \log σ (m)

is a nonincreasing function of

m

.
4)

h (m) = 1 / m \log σ (m)

是

m

的非递增函数。

B. Some Lower Bounds on the Compression Ratio
压缩率的一些下限

Consider a compression coding scheme for a source

σ

which employs a block-to-variable (

B V

) code book of

M

pairs

(X_{i}, Y_{i})

of words over

A

, with

ℓ (X_{i}) = L

for

i =

1, 2, \dots, M

. The encoding of an infinitely long string

S \in

σ

by such a code is carried out by first parsing

S

into blocks of length

L

, and then replacing each block

X_{i}

by the corresponding codeword

Y_{i}

. It is assumed, of course, that the code book is exhaustive with respect to

σ

and uniquely decipherable [2]. Hence, we must have
考虑一个源码

σ

的压缩编码方案,它使用了一个块到变长(

B V

)代码本,其中包括

M

对

(X_{i}, Y_{i})

单词,这些单词来自

A

,且

ℓ (X_{i}) = L

为

i =

1, 2, \dots, M

。通过这种编码方式,无限长的字符串

S \in

σ

会被编码成由长度为

L

的块

X_{i}

组成,每个块被替换为相应的代码字

Y_{i}

。当然,假设代码本针对

σ

是充分的,并且是唯一可解的[2]。因此,我们必须有:

{X_{i}}_{i = 1}^{M} = σ {L}

or 或

M = σ (L) = α^{L h (L)}

and 和

max_{1 \leq i \leq M} {ℓ (Y_{i})} \geq \log M = L h (L)

The compression ratio

ρ_{i}

associated with the

i

th wordpair of the code is given by
相关于代码的第

i

对单词对的压缩比为

ρ_{i} = \frac{ℓ (Y_{i})}{L}

The

B V

compression ratio,

ρ_{B V} (σ, M)

, of the source

σ

. is defined as the minimax value of

ρ_{i}

, where the maximization is over all word-pairs of a given code, and the minimization is over the set

C_{B V} (σ, M)

of all

B V

code books consisting of

M

word-pairs. Thus,
源码

σ

的

B V

压缩比

ρ_{B V} (σ, M)

定义为

ρ_{i}

的极小极大值,其中最大化是针对给定编码的所有词对,而最小化是针对所有由

M

个词对组成的

B V

编码本集合

C_{B V} (σ, M)

。

\begin{aligned} \begin{aligned} ρ_{B V} (σ, M) = & min_{C_{B V} (σ, M)} max_{1 \leq i \leq M} {\frac{ℓ (Y_{i})}{L}} \\ \geq \frac{\log M}{L} = \frac{L h (L)}{L} = h (L) \end{aligned} \\ ^{1} Throughout this paper, log x means the base- α logarithm of x . \end{aligned}

For later reference, we record this result in the following lemma.
为了后续参考,我们在下面的引理中记录了这个结果。

Lemma 1: 引理 1:

ρ_{B V} (σ, M) \geq h (L)

where 哪里

L h (L) = \log M

Now, consider a compression scheme which employs a variable-to-block (

V B

) code book of

M

word-pairs (

X_{i}, Y_{i}

), with

ℓ (Y_{i}) = L

for all

i = 1, 2, \dots, M

. In this case, the compression ratio associated with the

i

th word-pair is given by
现在,考虑一个使用可变到块(v-b)代码簿的压缩方案,其中包含 n 个单词对,其中所有单词的长度均为 m。在这种情况下,与第 k 个单词对相关的压缩比给出为

ρ_{i} = \frac{L}{ℓ {(X_{i})}^{'}}

and similarly, the

V B

compression ratio

ρ_{V B} (σ, M)

σ

is defined as the minimax value of

ρ_{i}

over all word-pairs and over the set

C_{V B} (σ, M)

V B

code books with

M

wordpairs.
同样地，

V B

压缩比

ρ_{V B} (σ, M)

被定义为

σ

上的极值，其中涉及所有单词对以及

C_{V B} (σ, M)

中的

V B

个代码本，每个代码本包含

M

个单词对。

Lemma 2: 引理 2:

ρ_{V B} (σ, M) \geq h (L_{M})

where 哪里

L_{M} = max {ℓ ∣ M \geq σ (ℓ)}

Proof: We may assume, without loss of generality, that in every code under consideration
我们可以设无损概括地说,在每个考虑的代码中

ℓ (X_{1}) \leq ℓ (X_{2}) \leq \dots \leq ℓ (X_{M})

and hence for each

C \in C_{V B} (σ, M)

,
因此对于每个

C \in C_{V B} (σ, M)

max ρ_{i} (C) = \frac{L (C)}{ℓ (X_{1})}

Since

C

is exhaustive with respect to

σ

, we have
由于

C

对

σ

是广泛的,我们有

M \geq σ (ℓ_{1})

where

ℓ_{1} = ℓ (X_{1})

; and since

C

is uniquely decipherable, we have
在

ℓ_{1} = ℓ (X_{1})

；由于

C

是唯一可解码的，我们有

L (C) \geq \log M

From the definition of

L_{M}

, inequality (6), and the nondecreasing property of

σ (ℓ)

, we obtain
从

L_{M}

的定义、不等式(6)和

σ (ℓ)

的非递减性质可以得出

ℓ_{1} \leq L_{M}

From (5), (7), and (8), we have
从(5)、(7)和(8)式中，我们有

max ρ_{i} (C) \geq \frac{\log M}{L_{M}}

and since 而且

M \geq σ (L_{M}) = α^{L_{M} h (L_{M})}

it follows that for each

C \in C_{V B} (σ, M)

,
因此对于每个

C \in C_{V B} (σ, M)

而言，

max ρ_{i} (C) \geq h (L_{M})

Q.E.D. 证明完毕。

Remarks 评论

Since the value of $L$ in the context of Lemma 1 satisfies the definition of $L_{M}$ as given in Lemma 2, it follows that the bounds of both lemmas are essentially the same, despite the basic difference between the respective coding schemes.
鉴于在引理 1 的上下文中 $L$ 的值满足引理 2 给出的 $L_{M}$ 的定义,因此两个引理的界限本质上是相同的,尽管它们各自的编码方案存在基本差异。
By 2), the second defining property of a source, the per-word bounds derived above apply to indefinitely long messages as well, since the whole message may consist of repeated appearances of the same worst case word.
根据第二个定义的源特性，上述推导的逐词界限也适用于无限长的消息，因为整个消息可能由同一个最坏情况词的重复出现组成。
By 4), the nonincreasing property of the $h$ -parameters, the form of the derived bounds confirms the intuitive expectation that an increase in the size $M$ of the employed code book causes a decrease in the lower bound on the attainable compression ratio.
根据 4)中 $h$ 参数的非增性质,衍生界限的形式确认了直观预期,即所采用的代码本尺寸 $M$ 增加会导致可实现压缩比的下界降低。

C. Performance of the Proposed Algorithm
提议算法的性能

We proceed now to derive an upper bound on the compression ratio attainable by the algorithm of Section II. To this end, we consider the worst case source message of length

n - L_{s}

, where

n

is the prospective buffer length and

L_{s}

is the maximal word-length. The bound obtained for this case will obviously apply to all messages of length

n

L_{s}

or greater.
我们现在要推导第二节算法所能达到的最大压缩比。为此,我们考虑长度为

n - L_{s}

First, we assume that only the

h

-parameters of the source under consideration are known to the designer of the encoder. Later, when we discuss the universal performance of the proposed algorithm, we will show that even this restricted

a

priori knowledge of the source is actually unessential.
首先,我们假设只有

h

参数对于编码器的设计者是已知的。当我们讨论所提出算法的整体性能时,我们将证明即使这种受限的

a

先验知识也是不必要的。

We begin by choosing the buffer length

n

to be an integer of the form
我们首先选择缓冲区长度

n

为以下形式的整数

n = \sum_{m = 1}^{λ} m α^{m} + \sum_{m = λ + 1}^{ℓ} m σ (ℓ) + (ℓ + 1) (N_{ℓ + 1} + 1)

where 哪里

N_{ℓ + 1} = \sum_{m = 1}^{λ} (ℓ - m) α^{m} + \sum_{m = λ + 1}^{ℓ} (ℓ - m) σ (ℓ)

λ = ⌊ ℓ h (ℓ) ⌋

, the integer part of

\log σ (ℓ) = ℓ h (ℓ)

, and

λ = ⌊ ℓ h (ℓ) ⌋

，

\log σ (ℓ) = ℓ h (ℓ)

的整数部分，和

ℓ = L_{s} - 1

The specific valuc of the parameter

L_{s}

is left for later determination (see the first remark following the proof of Theorem 1). The reasoning that motivates the given form of

n

will become clear from subsequent derivations.
参数

L_{s}

的具体值将留待以后确定（见定理 1 证明后的第一条备注）。给定

n

形式的推导过程将在后续推导中阐明。

Consider a string

S \in σ {n - L_{s}}

, and let

N (S)

denote the number of words into which

S

is parsed by the algorithm during the encoding process. Recalling that each of these words is mapped into a codeword of fixed length

L_{c}

(see (1)), it follows that the compression ratio

ρ (S)

associated with the string

S

is given by
考虑一个字符串

S \in σ {n - L_{s}}

，令

N (S)

表示在编码过程中算法将其解析为的字词数。回顾每个字词都被映射为固定长度

L_{c}

的码字(见(1))，则与字符串

S

Remarks 评论

The value of $ϵ (L_{s})$ decreases with $L_{s}$ and, consequently, the compression ratio $ρ$ approaches the value of $h (L_{s} - 1)$ , the $h$ -parameter associated with the second largest word-length processed by the encoder. Given any $δ > 0$ , one can always find the least integer $ℓ_{s}$ such that $ρ - h (ℓ_{s} - 1) \leq δ$ . The magnitude of the acceptable de-
值 $ϵ (L_{s})$ 随 $L_{s}$ 而降低,因此压缩率 $ρ$ 接近 $h (L_{s} - 1)$ 的值,这是编码器处理的第二大单词长度相关的 $h$ 参数。对于任何给定的 $δ > 0$ ,总可以找到最小整数 $ℓ_{s}$ 使得 $ρ - h (ℓ_{s} - 1) \leq δ$ 成立。可接受的解压缩幅度的大小
viation $δ$ determines the operational range of $L_{s}$ , namely, $L_{s} \geq ℓ_{s}$ .
航空 $δ$ 决定了 $L_{s}$ 的操作范围,即 $L_{s} \geq ℓ_{s}$ 。
Since our code maps source words of variable length at most $L_{s}$ into codewords of fixed length $L_{c}$ , we adopt as a reference for comparison the best $V B$ code $C_{V B}$ discussed in Subsection III-B. The counterpart of our block-length $L_{c}$ is $L (C)$ of (5), and by (7) we can write
由于我们的代码将长度最长为 $L_{s}$ 的源词映射到长度固定为 $L_{c}$ 的码字,我们采用第 III-B 小节中讨论的最佳 $V B$ 码 $C_{V B}$ 作为比较参考。我们的块长 $L_{c}$ 的对应项是(5)中的 $L (C)$ ,根据(7)我们可以写为

L_{c} \approx \log M \approx L_{M} h (L_{M})

where

M

is the code book size and

h (L_{M})

is the lower bound (see Lemma 2) on the compression ratio

ρ_{V B}

attainable by

C_{V B}

. Since for sufficiently large

L_{s}

, we have also
其中

M

是码书大小,

h (L_{M})

是可实现的压缩比下限 (见引理 2),

ρ_{V B}

是可实现的压缩比. 当

L_{s}

足够大时, 还有

L_{c} \approx (L_{s} - 1) ρ \approx (L_{s} - 1) h (L_{s} - 1),

it follows that

ρ \approx ρ_{V B}

.
故此，

ρ \approx ρ_{V B}

。
We turn now to investigate the universal performance of the proposed algorithm, i.e., the performance when no a priori knowledge of the source to be compressed is available.
我们现在转向研究所提出算法的普遍性能,即在没有预先了解要压缩的源信息的情况下的性能。

Given

δ_{1}

and

h_{1}

such that

0 < δ_{1} < h_{1} < 1

, let

ℓ_{1}

be the least positive integer satisfying
给定

δ_{1}

和

h_{1}

满足

0 < δ_{1} < h_{1} < 1

条件,求

ℓ_{1}

的最小正整数值

δ_{1} \geq ϵ (ℓ_{1} + 1) = \frac{1}{ℓ_{1}} (3 + 3 \log ℓ_{1} + \log \frac{ℓ_{1} + 1}{2})

and let

K = α^{ℓ_{1} h_{1}}

and

λ_{1} = ⌊ ℓ_{1} h_{1} ⌋

. Consider the encoder

E_{1}

which employs a buffer of length

n_{1} = n (h_{1}, ℓ_{1})

, obtained from (9) and (10) by setting

ℓ = ℓ_{1}, λ = λ_{1}

, and

σ (ℓ)

= K

, and whose maximal word-length

L_{s}

is equal to

ℓ_{1} +

1.
并让

K = α^{ℓ_{1} h_{1}}

和

λ_{1} = ⌊ ℓ_{1} h_{1} ⌋

。考虑采用长度为

n_{1} = n (h_{1}, ℓ_{1})

的缓冲区的编码器

E_{1}

,该缓冲区由(9)和(10)设置

ℓ = ℓ_{1}, λ = λ_{1}

和

σ (ℓ)

= K

获得,其最大字长

L_{s}

等于

ℓ_{1} +

1。

It follows from Theorem 1 that if this encoder is applied to a source

σ_{1}

such that

h_{σ_{1}} (ℓ_{1}) = h_{1}

, then the resulting compression ratio

ρ_{1} (σ_{1})

satisfies
根据定理 1，如果对源

σ_{1}

应用此编码器，使得

h_{σ_{1}} (ℓ_{1}) = h_{1}

，则所得压缩比

ρ_{1} (σ_{1})

满足

ρ_{1} (σ_{1}) \leq h_{σ_{1}} (ℓ_{1}) + δ_{1} = h_{1} + δ_{1}

Suppose now that

h_{1}

and

δ_{1}

were chosen to fit the prescribed upper value

ρ_{1} = h_{1} + δ_{1}

of a prospective compression ratio range (

ρ_{0}, ρ_{1}

) with
假设现在

h_{1}

和

δ_{1}

被选择来适应预期压缩比范围 (

ρ_{0}, ρ_{1}

) 的规定上限

ρ_{1} = h_{1} + δ_{1}

0 < δ_{1} R < ρ_{0} < ρ_{1} < 1, R > 1

where

R

is an adjustment parameter to be determined later. As shown above, the encoder

E_{1}

is then matched exactly to the upper value

ρ_{1}

of the prospective compression range. In order to accommodate the whole given range, we propose to employ a slightly larger encoder

E_{0}

whose buffer is of length

n_{0} = n_{1} - ℓ_{1} + ℓ_{0}

, where

n_{1}

and

ℓ_{1}

are the parameters of

E_{1}

, and where

ℓ_{0}

is an integer, greater than

ℓ_{1}

, for which the solution

h_{0}

of the equation
其中

R

是稍后确定的调整参数。如上所示,编码器

E_{1}

精确地匹配于预期压缩范围的上限值

ρ_{1}

。为了容纳整个给定范围,我们建议使用略大的编码器

E_{0}

,其缓冲区长度为

n_{0} = n_{1} - ℓ_{1} + ℓ_{0}

,其中

n_{1}

和

ℓ_{1}

是

E_{1}

的参数,而

ℓ_{0}

是一个大于

ℓ_{1}

的整数,其方程的解为

h_{0}

。

n_{1} - ℓ_{1} + ℓ_{0} = n (h_{0}, ℓ_{0})

satisfies 满足

ρ_{0} - ϵ (ℓ_{0}) < h_{0} \leq ρ_{0} - ϵ (ℓ_{0} + 1)

Noting that

n_{0} - ℓ_{0} = n_{1} - ℓ_{1}

is fixed, it is clear from (9) and (10) that as

ℓ_{0}

increases

h_{0}

decreases; also, (21) and the fact that

ℓ_{0} > ℓ_{1}

imply that

ρ_{0} - ϵ (ℓ_{0} + 1) > ρ_{0} - ϵ (ℓ_{1} + 1)

\geq ρ_{0} - δ_{1} > 0

. Hence, there exist

h_{0} > 0

and

ℓ_{0} > ℓ_{1}

that satisfy both (22) and (23).
注意到

n_{0} - ℓ_{0} = n_{1} - ℓ_{1}

是固定的,从(9)和(10)可以清楚地看出,随着

ℓ_{0}

的增加,

h_{0}

会减少;同时,(21)和

ℓ_{0} > ℓ_{1}

的事实也意味着

ρ_{0} - ϵ (ℓ_{0} + 1) > ρ_{0} - ϵ (ℓ_{1} + 1)

\geq ρ_{0} - δ_{1} > 0

。因此,存在

h_{0} > 0

和

ℓ_{0} > ℓ_{1}

同时满足(22)和(23)。

In analogy with (20), it is also clear that if

E_{0}

is applied to a source

σ_{0}

such that

h_{σ_{0}} (ℓ_{0}) = h_{0}

, then the resulting compression ratio

ρ_{0} (σ_{0})

satisfies
根据(20)的类比,如果对源

σ_{0}

应用

E_{0}

,使得

h_{σ_{0}} (ℓ_{0}) = h_{0}

,则所得的压缩比

ρ_{0} (σ_{0})

满足

ρ_{0} (σ_{0}) \leq h_{σ_{0}} (ℓ_{0}) + δ_{0} = h_{0} + δ_{0}

where

δ_{0} = ϵ (ℓ_{0} + 1)

. 在哪里

δ_{0} = ϵ (ℓ_{0} + 1)

。
From (23) and (24), we have

ρ_{0} (σ_{0}) \leq ρ_{0} - ϵ (ℓ_{0} + 1) + δ_{0}

= ρ_{0}

, and
从(23)和(24)中,我们有

ρ_{0} (σ_{0}) \leq ρ_{0} - ϵ (ℓ_{0} + 1) + δ_{0}

= ρ_{0}

δ_{0} \leq ρ_{0} - h_{0} < ϵ (ℓ_{0}) \leq ϵ (ℓ_{1} + 1) \leq δ_{1} < \frac{1}{R} ρ_{0}

Hence,

h_{0}

can be made arbitrarily close to

ρ_{0}

, and consequently, the encoder

E_{0}

matches as closely as desired the lower end

ρ_{0}

of the given compression range.
因此,

h_{0}

可以任意接近

ρ_{0}

,因此,编码器

E_{0}

可以匹配给定压缩范围的下端

ρ_{0}

。

Theorem 2: Let

σ

be a source for which a matched encoder

E

achieves a compression ratio

ρ (σ)

within the range

(ρ_{0}, ρ_{1})

. Then the compression ratio

ρ_{0} (σ)

, achieved for

σ

E_{0}

, satisfies
定理 2：设

σ

为一个源，其匹配编码器

E

可在范围

(ρ_{0}, ρ_{1})

内实现压缩比

ρ (σ)

。那么由

E_{0}

为

σ

实现的压缩比

ρ_{0} (σ)

满足

ρ_{0} (σ) \leq ρ (σ) + Δ,

where 哪里

Δ \leq \frac{⌈ \log d ⌉}{ℓ_{1}} d = max {\frac{h_{1}}{h_{0}}, \frac{1}{1 - h_{0}}}

(Typically,

(h_{1} / h_{0}) > (1 / 1 - h_{0})

and

d = (h_{1} / h_{0})

.)
(通常情况下，

(h_{1} / h_{0}) > (1 / 1 - h_{0})

和

d = (h_{1} / h_{0})

。)
Proof: To prove the theorem we shall consider the obviously worst case of applying

E_{0}

to the source

σ_{1}

whose matched encoder

E_{1}

realizes

ρ_{1}

.
证明:为了证明该定理,我们将考虑应用

E_{0}

到其编码器

E_{1}

实现

ρ_{1}

的源

σ_{1}

的显而易见的最坏情况。

Let

ρ_{0} (σ_{1})

denote the compression ratio achievable by

E_{0}

when applied to

σ_{1}

. According to (12), we have
令

ρ_{0} (σ_{1})

表示在应用于

σ_{1}

时

E_{0}

可达到的压缩率。根据(12)式,我们有

ρ_{0} (σ_{1}) = \frac{L_{c 0}}{n_{0} - (ℓ_{0} + 1)} N_{0} (σ_{1})

where 哪里

L_{c i} = 1 + ⌈ \log (ℓ_{i} + 1) ⌉ + ⌈ \log (n_{i} - ℓ_{i} - 1) ⌉, i \in {0, 1}

and

N_{i} (σ_{j}), i, j \in {0, 1}

, is the maximum number of words into which a string

S \in σ_{j} {n_{i} - ℓ_{i} - 1}

is parsed by

E_{i}

.
和

N_{i} (σ_{j}), i, j \in {0, 1}

，是将字符串

S \in σ_{j} {n_{i} - ℓ_{i} - 1}

通过

E_{i}

解析的最大单词数。

Since

n_{0} - ℓ_{0} = n_{1} - ℓ_{1}

and

ℓ_{0} > ℓ_{1}

, it is easy to verify that

^{2} N_{0} (σ_{1}) \leq N_{1} (σ_{1})

. Also by (16),
自从

n_{0} - ℓ_{0} = n_{1} - ℓ_{1}

和

ℓ_{0} > ℓ_{1}

，很容易验证

^{2} N_{0} (σ_{1}) \leq N_{1} (σ_{1})

。同时由式(16)可得，

N_{1} (σ_{1}) \leq N_{1}^{'} (σ_{1}) = \frac{n_{1} - (ℓ_{1} + 1)}{ℓ_{1}} = \frac{n_{0} - (ℓ_{0} + 1)}{ℓ_{1}}

Hence 因此

\begin{aligned} ρ_{0} (σ_{1}) & \leq \frac{L_{c 0}}{ℓ_{1}} = \frac{L_{c 1}}{ℓ_{1}} + \frac{L_{c 0} - L_{c 1}}{ℓ_{1}} \\ \leq ρ_{1} + \frac{L_{c 0} - L_{c 1}}{ℓ_{1}} \end{aligned}

and since 而且

L_{c 0} - L_{c 1} = ⌈ \log (ℓ_{0} + 1) ⌉ - ⌈ \log (ℓ_{1} + 1) ⌉ \leq ⌈ \log k ⌉,

where

k = (ℓ_{0} / ℓ_{1})

, we obtain
其中

k = (ℓ_{0} / ℓ_{1})

，我们得到

ρ_{0} (σ_{1}) \leq ρ_{1} + \frac{1}{ℓ_{1}} ⌈ \log k ⌉ .

To obtain an upper bound on

k

, we first observe that
要获得

k

的上界,我们首先观察到

\begin{aligned} ℓ_{0} \sum_{m = λ_{0} + 1}^{ℓ_{0}} (ℓ_{0} - m & + 1) α^{ℓ_{0} h_{0}} \leq n_{0} - ℓ_{0} \\ = n_{1} - ℓ_{1} \leq ℓ_{1} \sum_{m = 1}^{ℓ_{1}} (ℓ_{1} - m + 1) α^{ℓ_{1} h_{1}} \end{aligned}

which reduces to 减为

k^{2} {(1 - h_{0})}^{2} \leq α^{ℓ_{1} h_{1} (1 - k (h_{0} / h_{1}))}

Now, either

k (1 - h_{0}) < 1

, or else the exponent on the right side of (26) must be nonnegative and thus

k \leq (h_{1} /

h_{0}

). In either case,
现在,要么

k (1 - h_{0}) < 1

,否则(26)右侧的指数必须是非负的,因此

k \leq (h_{1} /

h_{0}

) 。在任何一种情况下,

k \leq max {\frac{h_{1}}{h_{0}}, \frac{1}{1 - h_{0}}} = d

Q.E.D. 证明完毕。

Theorems 1 and 2 demonstrate the efficiency and universality of the proposed algorithm. They show that an encoder designed to operate over a prescribed compression range performs, practically, as well as one designed to match a specific compression ratio within the given range.
定理 1 和 2 证明了所提出的算法的效率和广泛适用性。它们表明，设计用于指定压缩范围的编码器在实际应用中的表现与专门针对给定范围内的特定压缩比设计的编码器相当。

Moreover, for given complexity i.e., a given codeword length, the compression efficiency is comparable to that of an optimal variable-to-block code book designed to match a given source.
此外,对于给定的复杂性,即给定的码字长度,压缩效率与设计用于匹配给定源的最佳可变到块码本相当。

REFERENCES 参考文献

[1] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proc. IRE, vol. 40, pp. 1098-1101, 1952.
[1] 赫夫曼, "最小冗余码构建方法"，《IRE 杂志》，第 40 卷，第 1098-1101 页，1952 年。
[2] R. M. Karp, “Minimum-redundancy coding for the discrete noiseless channel,” IRE Trans. Inform. Theory, vol. IT-17, pp. 27-38, Jan. 1961.
[2] R. M. Karp, "最小冗余编码离散无噪信道," IRE 信息理论会议论文集, 第 IT-17 卷, 第 27-38 页, 1961 年 1 月.
[3] B. F. Varn, “Optimal variable length codes,” Inform. Contr., vol. 19, pp. 289-301, 1971.
[3] B. F. Varn, "可变长度码的最优编码,"《信息控制》,第 19 卷,289-301 页,1971 年。
[4] Y. Perl, M. R. Gary, and S. Even, “Efficient generation of optimal prefix code: Equiprobable words using unequal cost letters,” J. ACM, vol. 22, pp. 202-214, April 1975.
[4] 玉, M. R. Gary, 和 S. Even, "高效生成最佳前缀码: 使用不等成本字母的等概率单词," J. ACM, 第 22 卷, 第 202-214 页, 1975 年 4 月。
[5] A. Lempel, S. Even, and M. Cohn, “An algorithm for optimal prefix parsing of a noiseless and memoryless channel,” IEEE Trans. Inform. Theory, vol. IT-19, pp. 208-214, March 1973.
[5] A. 勒姆泊尔, S. 伊文, 以及 M. 科恩, "一种无噪音和无记忆信道最优前缀解析算法", IEEE 信息理论学报, 第 19 卷, 第 208-214 页, 1973 年 3 月。
[6] F. Jelinek and K. S. Schneider, “On variable length to block coding,” IEEE Trans. Inform. Theory, vol. IT-18, pp. 765-774, Nov. 1972.
[6] F. Jelinek 和 K. S. Schneider, "关于变长编码到定长编码的转换," IEEE 信息理论学报, 卷 IT-18, 页 765-774, 1972 年 11 月.
[7] R. G. Gallager, Information Theory and Reliable Communication. New York: Wiley, 1968.
[7] R. G. 高勒,信息论与可靠通信。纽约:威利,1968。
[8] J. Ziv, “Coding of sources with unknown statistics—Part I; Probability of encoding error,” IEEE Trans. Inform. Theory, vol. IT-18, pp. 384-394, May 1972.
[8] J. Ziv, "未知统计源编码—第一部分；编码误差概率," IEEE 信息理论交易, vol. IT-18, 第 384-394 页, 1972 年 5 月.
[9] L. D. Davisson, “Universal noiseless coding,” IEEE Trans. Inform. Theory, vol. IT-19, pp. 783-795, Nov. 1973.
[9] L. D. Davisson, "通用无噪编码," IEEE 信息理论汇刊, vol. IT-19, pp. 783-795, 1973 年 11 月.
[10] A. Lempel and J. Ziv, “On the complexity of finite sequences,” IEEE Trans. Inform. Theory, vol. IT-22, pp. 75-81, Jan. 1976.
[10] A. 伦佩尔和 J. 齐夫,"论有限序列的复杂性,"IEEE 信息论交易杂志,第 IT-22 卷,第 75-81 页,1976 年 1 月。
[11] B. M. Fitingof, “Optimal coding in the case of unknown and changing message statistics,” Prob. Inform. Transm., vol. 2, pp. 3-11, 1966.
[11] B. M. 费廷诺夫, "在未知和变化的消息统计情况下的最优编码,"《信息传输问题》, 第 2 卷, 第 3-11 页, 1966 年。

Manuscript received June 23, 1975; revised July 6, 1976. Paper previously presented at the IEEE International Symposium on Information Theory, Ronneby, Sweden, June 21-24, 1976.
手稿于 1975 年 6 月 23 日收到;1976 年 7 月 6 日修订。该论文曾在 1976 年 6 月 21 日至 24 日于瑞典隆内比举行的 IEEE 国际信息理论研讨会上进行过演讲。
J. Ziv was with the Department of Electrical Engineering, TechnionIsrael Institute of Technology, Haifa, Israel. He is now with the Bell Telephone Laboratories, Murray Hill, NJ 07974.
齐郑六治已与以色列海法技术学院电气工程系工作,现任职于美国新泽西州默里山贝尔电话实验室。
A. Lempel was with the Department of Electrical Engineering, Tech-nion-Israel Institute of Technology, Haifa, Israel. He is now with the Sperry Research Center, Sudbury, MA 01776.
列佩尔先生曾在以色列海法技术学院电气工程系工作,目前他在马萨诸塞州苏德伯里斯皮里研究中心任职。
2 The assertion here is analogous to that of $[10$ , theorem 1].
这里的断言类似于[ $[10$ ，定理 1]。

A Universal Algorithm for Sequential Data Compression 用于顺序数据压缩的通用算法

Abstract 摘要

I. INTRODUCTION 一、 引言

II. The Compression Algorithm压缩算法

III. Compression of CONSTrAined Sources受约束源的压缩

A. Definition of the Source Model源模型的定义

B. Some Lower Bounds on the Compression Ratio压缩率的一些下限

Lemma 1: 引理 1:

Remarks 评论

C. Performance of the Proposed Algorithm提议算法的性能

Remarks 评论

REFERENCES 参考文献

A Universal Algorithm for Sequential Data Compression
用于顺序数据压缩的通用算法

I. INTRODUCTION 一、引言

II. The Compression Algorithm
压缩算法

III. Compression of CONSTrAined Sources
受约束源的压缩

A. Definition of the Source Model
源模型的定义

B. Some Lower Bounds on the Compression Ratio
压缩率的一些下限

C. Performance of the Proposed Algorithm
提议算法的性能