Universal Text Preprocessing for Data Compression
面向数据压缩的通用文本预处理

Jürgen Abel, Member, IEEE, and William Teahan, Member, IEEE
Jürgen Abel，IEEE成员，William Teahan，IEEE成员

Abstract 摘要

Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme. The algorithms need no external dictionary and are language independent. The compression gain is compared along with the costs of speed for the BWT, PPM, and LZ compression schemes. The average overall compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.
几个预处理算法的文本文件，相互补充，并执行之前的压缩方案。该算法不需要外部字典，并且与语言无关。将压缩增益与BWT、PPM和LZ压缩方案的速度成本进行沿着比较。对于卡尔加里语料库的文本文件，平均整体压缩增益在3%至5%的范围内，对于大型坎特伯雷语料库的文本文件，平均整体压缩增益在2%至9%的范围内。

Index Terms-Algorithms, data compression, BWT, LZ, PPM, preprocessing, text compression.
索引术语-算法、数据压缩、BWT、LZ、PPM、预处理、文本压缩。

1 IntRODUCTION 1引言

TODAY the most popular schemes for lossless data compression are the Burrows-Wheeler Compression Algorithm (BWCA) [1], Prediction by Partial Matching (PPM) [2], and Lempel-Ziv (LZ) [3] schemes. The first two schemes are context related, whereas the LZ scheme is dictionary-based. Even though each of these schemes can be used to compress text files, they do not consider the special properties of textual data and the compression rate can be enhanced by using preprocessing algorithms specialized for textual data. Text preprocessing algorithms are reversible transformations which are performed before the actual compression scheme prior to encoding and behind the decompression scheme after decoding. Since textual data make up a substantial part of the Internet and other information systems, efficient compression of textual data is of significant practical interest.
当今最流行的无损数据压缩方案是Burrows-Wheeler压缩算法（BWCA）[1]，部分匹配预测（PPM）[2]和Lempel-Ziv（LZ）[3]方案。前两个方案是上下文相关的，而LZ方案是基于字典的。尽管这些方案都可以用于压缩文本文件，但它们没有考虑文本数据的特殊属性，并且可以通过使用专门针对文本数据的预处理算法来提高压缩率。文本预处理算法是在编码之前的实际压缩方案之前和解码之后的解压缩方案之后执行的可逆变换。由于文本数据构成了互联网和其他信息系统的重要组成部分，因此文本数据的有效压缩具有重要的实际意义。

This paper presents a number of text preprocessing algorithms, including a text recognition scheme and five separate algorithms: capital letter conversion, EOL coding, word replacement, phrase replacement, and alphabet reordering. The different techniques are processed sequentially in order to complement and boost each other. The approach presented here is a universal one, which, in contrast to many others, needs no fixed external information like dictionaries. It is language independent-as long as the text is based on Latin letters. Nevertheless, it achieves a significant compression gain. The first four algorithms work with most universal compression schemes; the last algorithm is especially designed for sort-based schemes like the BWCA.
本文提出了一些文本预处理算法，包括一个文本识别方案和五个独立的算法：大写字母转换，EOL编码，单词替换，短语替换，和字母重排。不同的技术被顺序处理，以便相互补充和促进。这里提出的方法是一种通用的方法，与许多其他方法相比，它不需要像字典这样固定的外部信息。它是独立于语言的-只要文本是基于拉丁字母。然而，它实现了显著的压缩增益。前四个算法适用于大多数通用压缩方案;最后一个算法是专门为基于排序的方案（如BWCA）设计的。

The impact of the text preprocessing algorithms is illustrated using the example of the 10 text files of the Calgary Corpus-bib, book1, book2, news, paper1, paper2, progc,
使用卡尔加里语料库的10个文本文件的例子来说明文本预处理算法的影响-bib，book 1，book 2，news，paper 1，paper 2，piclc，

J. Abel is with the University Duisberg-Essen, Department of Communications Systems, Faculty of Engineering Sciences, Bismarckstrasse 81, D-47057 Duisberg, Germany. E-mail: juergen.abel@acm.org.
J. Abel就职于杜伊斯堡-埃森大学，通信系统系，工程科学学院，Bismarckstrasse 81，D-47057 Duisberg，德国。电子邮件：juergen. acm.org。
W. Teahan is with the School of Informatics, University of Wales, Bangor, Gwynedd LL57 1UT, UK. E-mail: wjt@informatics.bangor.ac.uk.
W. Teahan是威尔士大学信息学院的学生，地址是英国格温内思LL 57 1UT，班戈尔。电子邮件地址：wjt@informatics.bangor.ac.uk。

Manuscript received 2 Aug. 2003; revised 14 Sept. 2004; accepted 22 Dec. 2004; published online 16 Mar. 2005.
2003年8月2日收到《手册》; 9月14日修订。2004年; 2004年12月22日接受; 2005年3月16日在线发表。
For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TC-0114-0803.
有关获取本文再版的信息，请发送电子邮件至：tc@computer.org，并参考IEEECS日志编号TC-0114-0803。
progl, progp, and trans-for the BWCA, PPM, and LZ compression schemes. The BWCA is represented by the program ABC 2.4 from Abel [4], the PPM algorithm is represented by the program PPMD+ from Teahan [5], and, for the LZ scheme, the program GZIP from Gailly is used [6].
针对BWCA、PPM和LZ压缩方案，使用了BWl、BWp和trans-i。BWCA由Abel [4]的程序ABC 2.4表示，PPM算法由Teahan [5]的程序PPMD+表示，对于LZ方案，使用Gailly的程序GZIP [6]。

2 Previous Work 2以前的工作

The preprocessing of textual data for data compression is the subject of many publications. In some articles, the treatment of textual data is embedded within the compression scheme itself, but could easily be separated into two independent parts: a preprocessing algorithm and a standard compression algorithm, which are processed sequentially one after the other.
用于数据压缩的文本数据的预处理是许多出版物的主题。在某些文章中，文本数据的处理嵌入在压缩方案本身中，但可以很容易地分为两个独立的部分：预处理算法和标准压缩算法，它们一个接一个地顺序处理。

Bentley et al. describe a word-based compression scheme [7], where words are replaced by an index into an MTF (Move To Front) list. The dictionary of the words is transmitted implicitly by transmitting the word during its first occurrence. This scheme can be divided into a parsing part that performs the preprocessing and a standard MTF ranking scheme. A word-based variation of the PPM scheme is presented by Moffat [8]. He uses order-0, order-1, and order-2 word models to achieve better compression than the MTF scheme from Bentley et al. Similar schemes which differentiate between alphanumeric strings and punctuation strings and which also use an implicit dictionary are presented by Horspool and Cormack [9]. Again, these schemes can be divided into a parsing part and a coding part using Huffman codes.
Bentley等人描述了一种基于单词的压缩方案[7]，其中单词被MTF（移动到前面）列表中的索引所取代。通过在单词第一次出现时发送单词，隐含地发送单词的词典。该方案可以分为执行预处理的解析部分和标准MTF排名方案。Moffat [8]提出了PPM方案的基于字的变体。他使用0阶、1阶和2阶单词模型来实现比Bentley等人的MTF方案更好的压缩。Horspool和Cormack [9]提出了区分字母数字字符串和标点符号字符串以及使用隐式字典的类似方案。同样，这些方案可以分为解析部分和使用霍夫曼码的编码部分。

Teahan and Cleary describe several methods for enlarging the alphabet of the textual data [10]. Besides the replacement of common bigrams by a one symbol token, they propose methods for encoding special forms of bigrams called digrams (two letters representing a single sound, as ea in “bread” or

n g

in “sing”). The replacements are processed using a fixed set of frequently used bigrams in the English language, which makes this attempt language dependent. Teahan and Cleary [11] describe a word-based compression scheme where the word dictionary is adaptively built from the already processed input data. This can also be achieved by a preprocessing stage if the words are replaced by corresponding tokens. Teahan presents a further comparison Downloaded on December 01,2024 at 04:01:44 UTC from IEEE Xplore. Restrictions apply. Published by the IEEE Computer Society
Teahan和Cleary描述了几种扩大文本数据字母表的方法[10]。除了用一个符号符号来代替普通的二元组之外，他们还提出了一种方法来编码特殊形式的二元组，称为二元组（两个字母代表一个声音，如“面包”中的ea或“唱歌”中的

n g

）。这些替换是使用英语中一组固定的频繁使用的二元组来处理的，这使得这种尝试依赖于语言。Teahan和Cleary [11]描述了一种基于单词的压缩方案，其中单词字典是从已经处理的输入数据自适应地构建的。这也可以通过预处理阶段来实现，如果单词被相应的标记替换的话。Teahan在2024年12月1日04：01：44 UTC从IEEE Xplore下载了进一步的比较。有相关限制。出版社：IEEE Computer Society
between two different word-based compression schemes in his PhD thesis [12]. The first scheme uses function words, which include articles, prepositions, pronouns, numbers, conjunctions, auxiliary verbs, and certain irregular forms. The second scheme uses the most frequently used words in the English language. Both schemes require external dictionaries and are language dependent.
两种不同的基于单词的压缩方案之间的差异[12]。第一个方案使用功能词，包括文章，介词，代词，数字，连词，助动词和某些不规则形式。第二种方案使用英语中最常用的单词。这两种方案都需要外部字典，并且依赖于语言。

A special case of word encoding is the star encoding method from Franceschini and Mukherjee [13] and later from Kruse and Mukherjee [14]. This method replaces words by a symbols sequence that mostly consist of repetitions of the symbol “". This requires the use of an external dictionary that must be known by the sender as well as the receiver. Inside the dictionary, the words are first sorted by their length and, second, by their frequency in the English language using information obtained from Horspool and Cormack [9]. In [13], the words are encoded by sequences of the same length, where letters were replaced by "”, e.g., "

b^{*} D^{* * "}

. In [14], all words of the same length are encoded by sequences

^{* *} . . . * ", " A^{*} . . . * ", \dots

, ,

Z^{*} . .^{* "}

, "

a^{*} . .^{* "}, \dots

, "

z^{*} . .^{* "},^{* *} A^{*} . .^{* "}, \dots

, where the length of the encoded sequence is equal to the length of the word being encoded. The requirement of an external dictionary makes these methods again language dependent. Later, Sun et al. present an improved dictionary-based scheme called StarNT, which works with a ternary search tree [15] and is about twice as fast in encoding and about four times as fast in decoding than the former approach. The first 312 words of the dictionary are the most frequently used words of the English language. The rest of the dictionary is filled by words sorted by their length first and then by their frequency. This language dependent approach reaches an average compression gain between 10 percent for PPM, 13 percent for BWCA, and 19 percent for LZ-based compression schemes.
单词编码的一个特殊情况是来自Franceschini和Mukherjee [13]以及后来来自Kruse和Mukherjee [14]的星星编码方法。此方法用符号序列替换单词，该符号序列主要由符号““的重复组成。这需要使用一个外部字典，发送者和接收者都必须知道这个字典。在字典中，单词首先根据它们的长度进行排序，其次根据它们在英语中的频率使用从Horspool和Cormack获得的信息[9]。在[13]中，单词由相同长度的序列编码，其中字母被替换为““，例如，“0#。在[14]中，所有相同长度的字由序列

^{* *} . . . * ", " A^{*} . . . * ", \dots

、

Z^{*} . .^{* "}

、

a^{*} . .^{* "}, \dots

、

z^{*} . .^{* "},^{* *} A^{*} . .^{* "}, \dots

编码，其中编码序列的长度等于被编码的字的长度。对外部字典的要求使这些方法再次依赖于语言。后来，Sun et al. 提出了一种改进的基于字典的方案，称为StarNT，它与三元搜索树[15]一起工作，编码速度大约是前一种方法的两倍，解码速度大约是前一种方法的四倍。词典的前312个单词是英语中最常用的单词。字典的其余部分由单词填充，这些单词首先按长度排序，然后按频率排序。这种依赖于语言的方法达到了PPM的10%，BWCA的13%和基于LZ的压缩方案的19%之间的平均压缩增益。

Preprocessing methods, specialized for a specific compression scheme, are presented by Chapin and Tate [16] and, later, by Chapin [17]. They describe several methods for alphabet reordering prior to using the BWCA in order to place letters with similar contexts close to one another. Since the Burrows-Wheeler transformation (BWT) is a permutation of the input symbols based on a lexicographic sorting of the suffices, this reordering places areas of similar contexts at the BWT output stage closer together and these can be exploited by the latter stages of the BWCA. The paper compares several heuristic and computed reorderings where the heuristic approaches achieve a better result on text files than the computed approaches. The average gain for BWCA using heuristic reorderings over the normal alphabetic order was 0.4 percent on the text files of the Calgary Corpus. Balkenhol and Shtarkov use a very similar heuristic alphabet reordering for preprocessing with BWCA [18]. A different alphabet reordering for BWCA is used in the paper from Kruse and Mukherjee [19]. It also describes a bigram encoding method and a word encoding method which is based on their star encoding.
专门用于特定压缩方案的预处理方法由Chapin和Tate [16]以及后来的Chapin [17]提出。他们描述了在使用BWCA之前重新排序字母的几种方法，以便将具有相似上下文的字母彼此靠近。由于Burrows-Wheeler变换（BWT）是基于足够的字典排序的输入符号的置换，因此这种重新排序将BWT输出阶段的相似上下文的区域更靠近在一起，并且这些可以由BWCA的后面阶段利用。本文比较了几个启发式和计算的重新排序的启发式方法实现了更好的结果比计算方法的文本文件。在卡尔加里语料库的文本文件中，BWCA使用启发式重新排序的平均收益是0.4%。Eschenhol和Shtarkov使用非常相似的启发式字母重排来进行BWCA预处理[18]。在Kruse和Mukherjee [19]的论文中使用了BWCA的不同字母重排。文中还介绍了一种二元组编码方法和一种基于其星星编码的字编码方法。

Grabowski proposes several text preprocessing methods in his publication [20] which focuses on improvements for BWCA, but some techniques can also be used for other compression schemes. Besides the already mentioned techniques like alphabet reordering, bigram, trigram, and quadgram replacement, Grabowski suggests three new algorithms. The first one is capital conversion. An escape
Grabowski在他的出版物[20]中提出了几种文本预处理方法，重点是BWCA的改进，但一些技术也可以用于其他压缩方案。除了已经提到的技术，如字母重排，二元组，三元组和四元组替换，Grabowski提出了三个新的算法。第一个是资本转换。一种逃避
symbol and the corresponding lower letter replace capital letters at the beginning of a word. If the second letter of the word is capitalized too, the replacement is omitted. This technique increases context dependencies and similarities between words, which can be exploited by standard compression schemes. The second algorithm is space stuffing, where a space symbol is placed at the beginning of each line in order to change the context that follows the end of line symbol (EOL) to one space instead of various symbols. The last algorithm is EOL coding, which replaces EOL symbols by space symbols and separately encodes the former EOL positions, which are represented by the number of blanks since the previous EOL symbol. These numbers are encoded either within the symbol stream itself or in a separate data stream. Grabowski, who attributes the EOL coding idea to Taylor, suggests using either space stuffing or EOL coding for preprocessing text files, but, because of unstable side-effects, EOL coding is omitted. His preprocessing algorithms without EOL coding achieve an average gain for BWCA of 2.64 percent on the 10 text files of the Calgary Corpus. Since he uses a set of fixed bigrams, trigrams, and quadgrams, his proposal requires an external dictionary and is language dependent.
符号和相应的小写字母代替单词开头的大写字母。如果单词的第二个字母也大写，则省略替换。这种技术增加了单词之间的上下文依赖性和相似性，这可以通过标准压缩方案来利用。第二种算法是空格填充，其中将空格符号放置在每行的开头，以便将行尾符号（EOL）之后的上下文更改为一个空格而不是各种符号。最后一种算法是EOL编码，它用空格符号代替EOL符号，并对前一个EOL位置进行单独编码，前一个EOL位置由自前一个EOL符号以来的空格数表示。这些数字被编码在符号流本身或单独的数据流中。 Grabowski将EOL编码的想法归功于Taylor，他建议使用空间填充或EOL编码来预处理文本文件，但是，由于不稳定的副作用，EOL编码被省略了。他的预处理算法没有EOL编码实现了2.64%的平均增益BWCA的卡尔加里语料库的10个文本文件。由于他使用了一组固定的二元组、三元组和四元组，因此他的建议需要外部词典，并且依赖于语言。

Franceschini et al. extend the star encoding method by using different schemes for the indices into the dictionary [21], called Length-Preserving Transform (LPT), Reverse Length-Preserving Transform (RLPT), and Shortened-Context Length-Preserving Transform (SCLPT). All of these require an external dictionary and are language dependent. Franceschini et al. reported, for SCLPT, which achieves the best results, a gain for BWCA of 7.1 percent and, for PPMD+, a gain of 3.8 percent for the files of the Calgary Corpus (including the files paper3, paper4, paper5, and paper6). A further improvement of the star encoding method, presented by Awan et al. [22], is called Length Index Preserving Transform (LIPT). LIPT encodes a word as a string that can be interpreted as an index into a dictionary. The string consists of three parts: a single symbol "", a symbol between “a” and " z ", and a sequence of symbol from the set " $a$ “…” $z ", " A$ “…” $Z "$ . The second part of the string, the single symbol, represents the length $l$ of the word, where “a” stands for length 1 and " $z$ " for length 26 . The third part is the encoded index inside the set of words with length $l$ . They are encoded as a number representation of base 52 decremented by 1, where “a” represents $0, \dots$ , " z " represents 25, " A " represents $26, \dots$ , and " $Z$ " represents 51 . An empty substring represents the number 0 . Therefore, a word of length 3 with index 0 is encoded as " c ", a word of length 3 with index 1 as “*ca”, a word of length 3 with index 27 as “*cA”, and so on. LIPT achieves a gain for BWCA on the Calgary Corpus of 4.1 percent and of 5.6 percent for GZIP.
Franceschini等人通过对字典中的索引使用不同的方案来扩展星星编码方法[21]，称为长度保持变换（LPT），反向长度保持变换（RLPT）和缩短上下文长度保持变换（SCLPT）。所有这些都需要外部字典，并且依赖于语言。Franceschini等人报告说，对于SCLPT，它实现了最好的结果，BWCA的收益为7.1%，对于PPMD+，卡尔加里语料库的文件（包括文件paper3，paper4，paper5和paper6）的收益为3.8%。由Awan等人提出的星星编码方法的进一步改进。[22]被称为长度索引保持变换（LIPT）。LIPT将一个单词编码为一个字符串，该字符串可以被解释为字典的索引。字符串由三部分组成：单个符号“"、“a”和“z“之间的符号以及来自集合”“的符号序列“ “…” “…” .字符串的第二部分，即单个符号，表示长度其中“a”代表长度1，“ “长度为26。第三部分是具有长度的词集合内部的编码索引 .它们被编码为以52为基数减1的数字表示，其中“a”表示，“Z“表示25，“A“表示25，，以及“ “代表51人。空的子字符串表示数字0。因此，索引为0的长度为3的单词被编码为“c“，索引为1的长度为3的单词被编码为“*ca”，索引为27的长度为3的单词被编码为“*cA”，等等。LIPT在卡尔加里语料库上实现了4.1%的BWCA增益和5.6%的GZIP增益。

Isal and Moffat present different text preprocessing schemes for bigrams and words [23] using internal and external dictionaries. In their paper, tokens are used with values above 255 , so they can be used together with normal symbols as the compression scheme needs to handle alphabets with more than 8 bits. For text files, the wordbased schemes with internal dictionaries give the highest compression gain. Later, Isal et al. combined the word preprocessing scheme with different global structure transformations and entropy coding schemes [24]. Because of the use of an internal dictionary, where each word is
Isal和Moffat使用内部和外部词典为bigram和单词提供了不同的文本预处理方案。在他们的论文中，令牌的值大于255，因此它们可以与普通符号一起使用，因为压缩方案需要处理超过8位的字母表。对于文本文件，具有内部字典的基于单词的方案提供了最高的压缩增益。后来，Isal等人将单词预处理方案与不同的全局结构变换和熵编码方案相结合[24]。由于使用了内部词典，每个单词都是

Universal Text Preprocessing for Data Compression 面向数据压缩的通用文本预处理

Abstract 摘要

1 IntRODUCTION 1引言

2 Previous Work 2以前的工作

Universal Text Preprocessing for Data Compression
面向数据压缩的通用文本预处理