gdelt, anianr@google.com
gdelt,anianr @google.com
Language Modeling Is Compression
语言建模即压缩
Abstract 摘要
It has long been established that predictive models can be transformed into lossless compressors and vice versa.
Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models.
Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors.
In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models.
We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning.
For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.
Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
长期以来,人们已经确定预测模型可以转化为无损压缩器,反之亦然。顺便提一下,近年来,机器学习社区一直专注于训练越来越大且功能强大的自监督(语言)模型。由于这些大型语言模型表现出令人印象深刻的预测能力,它们有望成为强大的压缩器。在这项工作中,我们提倡通过压缩的视角来看待预测问题,并评估大型(基础)模型的压缩能力。我们表明,大型语言模型是强大的通用预测器,而压缩视角为扩展法则、标记化和上下文学习提供了新的见解。例如,Chinchilla 70B,虽然主要在文本上进行训练,但将 ImageNet 补丁压缩到其原始大小的 43.4%,将 LibriSpeech 样本压缩到 16.4%,分别超过了领域特定的压缩器如 PNG(58.5%)或 FLAC(30.3%)。最后,我们表明预测-压缩等价性使我们能够使用任何压缩器(如 gzip)来构建条件生成模型。
1 Introduction 1 引言
Information theory and machine learning are inextricably linked and have even been referred to as “two sides of the same coin” (MacKay, 2003).
One particularly elegant connection is the essential equivalence between probabilistic models of data and lossless compression.
The source coding theorem (Shannon, 1948) is the fundamental theorem describing this idea, i.e., the expected message length in bits of an optimal entropy encoder is equal to the negative -likelihood of the statistical model.
In other words, maximizing the -likelihood (of the data) is equivalent to minimizing the number of bits required per message.
Indeed, lossless compression with a probabilistic model can be achieved in a variety of different ways, including Huffman coding (Huffman, 1952), arithmetic coding (Pasco, 1977; Rissanen, 1976), and asymmetric numeral systems (Duda, 2009).
信息论和机器学习密不可分,甚至被称为“同一枚硬币的两面”(MacKay, 2003)。一个特别优雅的联系是数据的概率模型与无损压缩之间的本质等价性。源编码定理(Shannon, 1948)是描述这一思想的基本定理,即最佳熵编码器的预期消息长度(以比特为单位)等于统计模型的负对数似然。换句话说,最大化数据的对数似然等同于最小化每条消息所需的比特数。实际上,可以通过多种不同的方法实现基于概率模型的无损压缩,包括霍夫曼编码(Huffman, 1952)、算术编码(Pasco, 1977; Rissanen, 1976)和非对称数字系统(Duda, 2009)。
Arithmetic coding, in particular, is known to be optimal in terms of coding length, meaning that the overall compression performance depends on the capabilities of the probabilistic model (Fig. 1).
Incidentally, in recent years, large pre-trained Transformers (Vaswani et al., 2017), so-called foundation models (Bommasani et al., 2021), have proven to be highly successful across a wide range of predictive tasks (Bubeck et al., 2023; Rae et al., 2021) and are thus promising candidates for use with arithmetic coding.
Indeed, Transformer-based compression with arithmetic coding has produced state-of-the-art results both in the online (Bellard, 2021; Mao et al., 2022) and offline settings (Valmeekam et al., 2023).
In the online setting, a pseudo-randomly initialized model is directly trained on the stream of data that is to be compressed, while the offline setting, which we consider in our work, trains the model on an external dataset before employing it to compress a (potentially different) data stream.
Consequently, offline compression is performed in-context, with a fixed set of model parameters.
Transformers have demonstrated impressive in-context learning abilities (Laskin et al., 2023; Brown et al., 2020; Wei et al., 2022; Genewein et al., 2023), which renders them ideally suited for offline compression.
However, as we will discuss in this work, Transformers are actually trained to compress well, and therefore must have good in-context learning abilities.
算术编码,特别是在编码长度方面被认为是最优的,这意味着整体压缩性能取决于概率模型的能力(图 1)。顺便提一下,近年来,大型预训练 Transformer(Vaswani 等,2017),所谓的基础模型(Bommasani 等,2021),在广泛的预测任务中表现出极高的成功率(Bubeck 等,2023;Rae 等,2021),因此是与算术编码一起使用的有前途的候选者。事实上,基于 Transformer 的算术编码压缩在在线(Bellard,2021;Mao 等,2022)和离线设置(Valmeekam 等,2023)中都产生了最先进的结果。在在线设置中,伪随机初始化模型直接在要压缩的数据流上进行训练,而我们在工作中考虑的离线设置则是在使用模型压缩(可能不同的)数据流之前,在外部数据集上训练模型。因此,离线压缩是在上下文中执行的,具有一组固定的模型参数。Transformer 在上下文学习能力方面表现出令人印象深刻的能力(Laskin 等)。2023 年;Brown 等,2020 年;Wei 等,2022 年;Genewein 等,2023 年),这使得它们非常适合离线压缩。然而,正如我们将在本文中讨论的那样,Transformers 实际上是为了良好压缩而训练的,因此必须具有良好的上下文学习能力。
The context length is a key limiting factor in offline compression, as it dictates the maximum number of bytes a model can compress at a time.
Transformers can only compress a few kilobytes (each “token” being coded with 2 or 3 bytes), while requiring a lot of compute.
Correspondingly, many challenging predictive tasks (e.g., algorithmic reasoning or long-term memory) require long contexts (Delétang et al., 2023), and thus extending these models’ context lengths is a key challenge which is gaining increased attention (Zaheer et al., 2020; Guo et al., 2022; Bulatov et al., 2023).
The in-context compression view provides insights into the failure modes of current foundation models.
上下文长度是离线压缩的关键限制因素,因为它决定了模型一次可以压缩的最大字节数。Transformer 只能压缩几千字节(每个“token”用 2 或 3 个字节编码),同时需要大量计算资源。相应地,许多具有挑战性的预测任务(例如,算法推理或长期记忆)需要较长的上下文(Delétang 等,2023),因此扩展这些模型的上下文长度是一个关键挑战,正受到越来越多的关注(Zaheer 等,2020;Guo 等,2022;Bulatov 等,2023)。上下文压缩视角提供了对当前基础模型失效模式的见解。
This Work 这项工作
We advocate for using (lossless) compression to study foundation models.
To that end, we conduct an extensive empirical investigation of the offline (in-context) compression capabilities of large language models, with the rationale that they have recently become readily available (Hoffmann et al., 2022; Touvron et al., 2023) and can thus be used for compression without the training overhead.
We empirically demonstrate that these models, while (meta-)trained primarily on text, also achieve state-of-the-art compression rates across different data modalities, using their context to condition a general-purpose compressor to excel at a particular task.
Moreover, we shed new light on scaling laws (Kaplan et al., 2020), showing that they also hold true for compression but that measuring the compression rates instead of the log loss adds a twist: Scaling beyond a certain point will deteriorate the compression performance since the model parameters need to be accounted for in the compressed output.
Finally, we advocate for framing (self-supervised) prediction through the lens of compression as it encompasses generalization: a model that compresses well generalizes well (Hutter, 2006).
我们提倡使用(无损)压缩来研究基础模型。为此,我们对大型语言模型的离线(上下文)压缩能力进行了广泛的实证研究,理由是这些模型最近变得易于获取(Hoffmann 等,2022;Touvron 等,2023),因此可以在不需要训练开销的情况下用于压缩。我们通过实验证明,这些模型虽然主要在文本上进行(元)训练,但也在不同的数据模态上实现了最先进的压缩率,利用其上下文来调节通用压缩器以在特定任务中表现出色。此外,我们对扩展定律(Kaplan 等,2020)提供了新的见解,表明这些定律也适用于压缩,但测量压缩率而不是对数损失会带来一个转折点:扩展到某个点之后,压缩性能会恶化,因为模型参数需要在压缩输出中加以考虑。最后,我们提倡通过压缩的视角来框定(自监督)预测,因为它包含了泛化:一个压缩效果好的模型泛化效果也好(Hutter,2006)。
Contributions 贡献
We make the following contributions:
我们做出以下贡献:
-
•
We empirically investigate the lossless compression capabilities of foundation models. To that end, we review how to compress with predictive models via arithmetic coding and call attention to the connection between current language modeling research and compression.
• 我们实证研究了基础模型的无损压缩能力。为此,我们回顾了如何通过算术编码使用预测模型进行压缩,并关注当前语言建模研究与压缩之间的联系。 -
•
We show that foundation models, trained primarily on text, are general-purpose compressors due to their in-context learning abilities. For example, Chinchilla 70B achieves compression rates of 43.4% on ImageNet patches and 16.4% on LibriSpeech samples, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.
• 我们表明,主要在文本上训练的基础模型由于其上下文学习能力,是通用的压缩器。例如,Chinchilla 70B 在 ImageNet 补丁上实现了 43.4% 的压缩率,在 LibriSpeech 样本上实现了 16.4% 的压缩率,分别超过了领域特定的压缩器如 PNG (58.5%) 或 FLAC (30.3%)。 -
•
We provide a novel view on scaling laws, showing that the dataset size provides a hard limit on model size in terms of compression performance and that scaling is not a silver bullet.
• 我们提供了一个关于缩放定律的新视角,表明数据集的大小在压缩性能方面对模型大小提供了一个硬性限制,并且缩放并不是万能的。 -
•
We leverage the compression-prediction equivalence to employ compressors as generative models and visually illustrate the performance of the underlying compressor.
• 我们利用压缩-预测等价性将压缩器用作生成模型,并直观地展示底层压缩器的性能。 -
•
We demonstrate that tokenization, which can be viewed as a pre-compression, does, in general, not improve compression performance, but allows models to increase the information content in their context and is thus generally employed to improve prediction performance.
• 我们证明了标记化,可以看作是一种预压缩,通常不会提高压缩性能,但允许模型增加其上下文中的信息内容,因此通常用于提高预测性能。
2 Background 背景
In this section, we review the necessary background on information theory and its relation to likelihood maximization.
To that end, we consider streams of data of length from a finite set of symbols .
We write for and denote the empty string as .
Finally, we denote the concatenation of two strings and by .
在本节中,我们回顾了信息论的必要背景及其与似然最大化的关系。为此,我们考虑来自有限符号集 的长度为 的数据流 。我们用 表示 ,并将空字符串记为 。最后,我们用 表示两个字符串 和 的连接。
Coding Distributions 编码分布
A coding distribution is a sequence of probability mass functions , which for all satisfy the constraint that for all , with the base case .
From here on out, whenever the meaning is clear from the argument to , we drop the subscript on .
Under this definition, the conditional probability of a symbol given previous data is defined as , with the familiar chain rules and following.
编码分布 是概率质量函数序列 ,对于所有 满足约束条件 对于所有 ,基准情况为 。从此以后,只要从 的论点中可以清楚地理解其含义,我们就省略 的下标。在此定义下,给定先前数据 的符号 的条件概率定义为 ,并遵循熟悉的链式规则 和 。
Lossless Compression 无损压缩
The goal of lossless compression is to encode a stream of symbols sampled from a coding distribution into a bitstream of minimal (expected) length, while ensuring that the original data sequence is recoverable from the bitstream.
To that end, we use a binary source code , which assigns to each possible data sequence a binary code word of length (in bits).
Thus, the aim is to minimize the expected bits per sequence , i.e., encoding rare sequences with more bits and frequent sequences with fewer bits.
Shannon’s source coding theorem establishes the limit on possible data compression as for any possible code, where is the Shannon entropy (Shannon, 1948).
无损压缩的目标是将从编码分布中采样的符号流编码成最小(期望)长度的比特流,同时确保可以从比特流中恢复原始数据序列。为此,我们使用二进制源代码,它为每个可能的数据序列分配一个长度为(以比特为单位)的二进制代码字。因此,目标是最小化每个序列的期望比特数,即用更多的比特编码罕见序列,用更少的比特编码频繁序列。香农的源编码定理确立了任何可能的代码的数据压缩极限,其中是香农熵(Shannon, 1948)。
Arithmetic Coding 算术编码
Given a coding distribution and a sequence , arithmetic coding (Pasco, 1977; Rissanen, 1976) constructs a code with almost optimal length.
It directly connects coding and compression with prediction and modeling: compressing well means modeling well in a log-loss sense and vice-versa.
Assuming infinite precision for the arithmetic operations involved, the arithmetic code has length bits, whereas the optimal code length is bits.
A practical implementation that is subject to bit precision adds further bits (Howard & Vitter, 1991), which is negligible for 32- or 64-bit arithmetic.
In the following we consider infinite precision arithmetic coders and refer to Witten et al. (1987) for the finite-precision implementation.
给定一个编码分布 和一个序列 ,算术编码(Pasco, 1977; Rissanen, 1976)构建了一个几乎最优长度的编码。它直接将编码和压缩与预测和建模联系起来:良好的压缩意味着在对数损失意义上良好的建模,反之亦然。假设涉及的算术运算具有无限精度,算术编码的长度为 比特,而最优编码长度为 比特。受限于 比特精度的实际实现会增加 比特(Howard & Vitter, 1991),对于 32 位或 64 位算术来说是可以忽略的。以下我们考虑无限精度的算术编码器,并参考 Witten 等人(1987)的有限精度实现。
Arithmetic Encoder 算术编码器
The arithmetic code of a sequence is the binary representation of a number .
We identify by narrowing down an interval that encloses step by step (maintaining a growing prefix of the binary representation of throughout the process).
Initially, this interval is .
In step (i.e., encoding ), we first partition the previous interval into sub-intervals , one for each letter from .
The size of sub-interval that represents letter is .
Formally, we define
一个序列的算术编码是一个数字的二进制表示。我们通过逐步缩小包含该数字的区间来识别该数字(在整个过程中保持该数字二进制表示的前缀不断增长)。最初,这个区间是。在第步(即编码),我们首先将前一个区间划分为个子区间,每个字母对应一个子区间。表示字母的子区间的大小是。正式地,我们定义
(1) |
assuming a strict order on .
To encode we proceed with its corresponding interval, i.e., .
Finally, we choose with the shortest binary representation in the terminating interval and use that binary representation to encode .
Fig. 1 illustrates this process.
假设在 上有一个严格的顺序。为了编码 ,我们继续处理其对应的区间,即 。最后,我们选择在终止区间 中具有最短二进制表示的 ,并使用该二进制表示来编码 。图 1 说明了这个过程。
Arithmetic Decoder 算术解码器
Given and decoding the -th letter is easy: Starting with , find such that to decode , then set and proceed with the -st letter.
给定 和 ,解码第 个字母很容易:从 开始,找到 使得 解码 ,然后设置 并继续处理第 个字母。
Likelihood Maximization 似然最大化
In practice, the source distribution is usually unknown and is instead estimated with a parametric probabilistic model .
Thus, instead of achieving code length for the sequence , we obtain the suboptimal length .
As a result, the expected (suboptimal) number of bits is the cross-entropy:
在实践中,源分布 通常是未知的,而是用参数概率模型 来估计。因此,我们不是为序列 实现码长 ,而是得到次优码长 。结果,期望的(次优)比特数是交叉熵:
(2) |
Thus, we can minimize the expected length of the encoded data stream with symbols distributed according to by minimizing the cross-entropy with respect to some , which is equivalent to likelihood maximization (MacKay, 2003).
However, Eq. 2 is exactly the same objective used to train current foundation models, i.e., the -loss.
Thus, minimizing the -loss is equivalent to minimizing the compression rate of that model used as a lossless compressor with arithmetic coding, i.e., current language model training protocols use a maximum-compression objective.
因此,我们可以通过最小化相对于某些 的交叉熵来最小化根据 分布的符号编码数据流的期望长度,这等同于似然最大化(MacKay, 2003)。然而,方程 2 正是用于训练当前基础模型的目标,即 -损失。因此,最小化 -损失等同于最小化该模型作为无损压缩器使用算术编码时的压缩率,即当前语言模型训练协议使用最大压缩目标。
Compression-Based Sequence Prediction
基于压缩的序列预测
Analogous to how a predictive distribution can be used for lossless compression via arithmetic coding (described above), any compressor can be employed for sequence prediction (Frank et al., 2000).
The main idea is to define as the coding distribution , where is the length of sequence when encoded with compressor (e.g., gzip).
We thus recover the conditional distribution by computing , for all .
类似于如何通过算术编码(如上所述)使用预测分布进行无损压缩,任何压缩器都可以用于序列预测(Frank 等,2000)。主要思想是将 定义为编码分布 ,其中 是使用压缩器 (例如 gzip)对序列 进行编码时的长度。我们因此通过计算 来恢复条件分布 ,对于所有 。
Universal Coding 通用编码
Above we discussed optimal (arithmetic) coding with respect to data sampled from a fixed distribution .
In contrast, universal (optimal) source coding with respect to all computable sampling distributions can, in theory, be achieved by choosing as the Kolmogorov complexity of (Kolmogorov, 1998; Li & Vitányi, 2019).
For this choice, the conditional distribution described above is universally optimal over , recovering the Solomonoff predictor (Solomonoff, 1964a, b; Rathmanner & Hutter, 2011).
The Solomonoff predictor is a Bayesian mixture of all predictors that can be programmed in a chosen Turing-complete programming language.
More precisely, for a predictor of program-length bits,
the Solomonoff predictor assigns a prior weight of to predictor .
That is, if is the set of all predictors that can be programmed and computed, the Solomonoff predictor assigns probability to a sequence , if every predictor assigns that sequence probability .
Therefore, for all , and thus .
Observe that is a constant of that is independent of the sequence length.
Therefore, compressing optimally is equivalent to predicting optimally and vice versa (Hutter, 2005).
上文我们讨论了相对于从固定分布 采样的数据的最优(算术)编码。相比之下,相对于所有可计算采样分布的通用(最优)源编码理论上可以通过选择 作为 的 Kolmogorov 复杂度来实现(Kolmogorov, 1998; Li & Vitányi, 2019)。对于这种选择,上述条件分布在 上是普遍最优的,恢复了 Solomonoff 预测器(Solomonoff, 1964a, b; Rathmanner & Hutter, 2011)。Solomonoff 预测器是所有可以在选定的图灵完备编程语言中编程的预测器的贝叶斯混合。更准确地说,对于程序长度为 位的预测器 ,Solomonoff 预测器为预测器 分配了先验权重 。也就是说,如果 是所有可以编程和计算的预测器的集合,Solomonoff 预测器为序列 分配概率 ,如果每个预测器 为该序列分配概率 。因此, 对于所有 ,因此 。注意, 是 的常数,与序列长度无关。 因此,最优压缩等同于最优预测,反之亦然(Hutter, 2005)。
3 Experimental Evaluation 3 实验评估
We now present our evaluation of the (in-context) compression capabilities of foundation models.
我们现在展示基础模型的(上下文中)压缩能力的评估。
Compressors 压缩机
We compare our arithmetic coding-based language model compressors to two competitive general-purpose lossless compressors: gzip (Deutsch, 1996) and its improvement LZMA2 (Pavlov, 2019), used by the 7zip software.
Both are based on Huffman coding (Huffman, 1952) and the Lempel-Ziv-Welch algorithm (Welch, 1984).
We also consider specialized lossless compressors for image and audio data, i.e., PNG (Boutell, 1997) and FLAC (Coalson, 2008), respectively.
Finally, we evaluate two types of language models (of different sizes) with arithmetic coding: vanilla decoder-only Transformers (Vaswani et al., 2017), which we pretrain on the enwik8 dataset, and pretrained Chinchilla-like foundation models (Hoffmann et al., 2022).
我们将基于算术编码的语言模型压缩器与两种具有竞争力的通用无损压缩器进行比较:gzip(Deutsch,1996)及其改进版 LZMA2(Pavlov,2019),后者由 7zip 软件使用。两者都基于霍夫曼编码(Huffman,1952)和 Lempel-Ziv-Welch 算法(Welch,1984)。我们还考虑了用于图像和音频数据的专用无损压缩器,即 PNG(Boutell,1997)和 FLAC(Coalson,2008)。最后,我们评估了两种类型的语言模型(不同大小)与算术编码:仅解码器的基础 Transformer(Vaswani 等,2017),我们在 enwik8 数据集上进行预训练,以及预训练的类似 Chinchilla 的基础模型(Hoffmann 等,2022)。
3.1 Datasets 3.1 数据集
We consider datasets of three different modalities, text, image, and audio, which have (a priori) very different biases for compression and thus provide a good testbed for evaluating a compressor’s general capabilities.
To render the results comparable across modalities, all our datasets are 1GB.
我们考虑了三种不同模态的数据集,文本、图像和音频,它们在压缩方面有着(先验)非常不同的偏差,因此为评估压缩器的综合能力提供了一个良好的测试平台。为了使结果在各个模态之间具有可比性,我们所有的数据集都是 1GB。
A key question is how to reconcile the different context lengths of the compressors we consider.
Transformers are restricted to short contexts ( bytes, i.e., 2048 tokens of 8 bits that represent the ASCII characters, for our trained models
and roughly 10 kilobytes for Chinchilla models), while gzip uses a maximum context of 32 kilobytes, and LZMA2 has a virtually “infinite” context length.
Having a longer context allows a compressor to exploit more sequential dependencies to achieve a better compression rate.
For compressors with finite contexts, there are two approaches to compress sequences that are longer than the context length: (i) slide the compressor byte by byte, thus always processing a history of the previous bytes when compressing a new byte, and (ii) chunk the data stream into sequences of bytes and evaluate the in-context compression (without any history) averaged across batches.
For Transformers, we consider the latter approach since sliding would increase their (already very long) running time by a factor of .
Therefore, we chunk all datasets into sequences of bytes and feed them to the compressors one-by-one.
However, since classical compressors usually include a header in their compressed output, which can be larger than the compressed data in some cases, we only count it once for all batches, yielding a compression rate of .
Moreover, since chunking deteriorates the performance of classical compressors, which have context lengths , we also report their compression rates on the unchunked datasets.
We consider the following datasets:
一个关键问题是如何协调我们考虑的压缩器的不同上下文长度。Transformer 受限于短上下文(对于我们训练的模型为 字节,即 2048 个 8 位表示 ASCII 字符的标记,对于 Chinchilla 模型大约为 10 千字节),而 gzip 使用最大 32 千字节的上下文,LZMA2 则具有实际上“无限”的上下文长度。拥有更长的上下文允许压缩器利用更多的顺序依赖性以实现更好的压缩率。对于具有有限上下文的压缩器,有两种方法可以压缩超过上下文长度的序列:(i)逐字节滑动压缩器,从而在压缩新字节时始终处理前 字节的历史记录,以及(ii)将数据流分块为 个 字节的序列,并评估上下文内压缩(没有任何历史)的平均批次。对于 Transformer,我们考虑后一种方法,因为滑动会使其(已经非常长的)运行时间增加 倍。因此,我们将所有数据集分块为 字节的序列,并逐一输入压缩器。 然而,由于经典压缩器通常在其压缩输出中包含一个头部,该头部在某些情况下可能比压缩数据更大,我们只对所有批次计算一次,从而得出压缩率为 。此外,由于分块会降低经典压缩器的性能,其上下文长度为 ,我们还报告了它们在未分块数据集上的压缩率。我们考虑以下数据集:
enwik9
The enwik9 dataset (Hutter, 2006) consists of the first (1 billion) bytes of the English Wikipedia XML dump on March 3rd, 2006 and is typically used to measure a model’s ability to compress data.
It is an extension of the enwik8 dataset that only contains the first 100 million bytes.
We train our vanilla Transformer models on enwik8, but evaluate on both enwik8 and enwik9 (to evaluate the out-of-distribution compression performance).
While enwik8 is included in enwik9, it only represents the first 10% and thus still constitutes a significant distribution shift.
enwik9 数据集(Hutter,2006)由 2006 年 3 月 3 日的英文维基百科 XML 转储的前 (10 亿)字节组成,通常用于衡量模型压缩数据的能力。它是 enwik8 数据集的扩展,后者仅包含前 1 亿字节。我们在 enwik8 上训练我们的基础 Transformer 模型,但在 enwik8 和 enwik9 上进行评估(以评估分布外压缩性能)。虽然 enwik8 包含在 enwik9 中,但它仅代表前 10%,因此仍然构成显著的分布变化。
ImageNet 图像网
The ImageNet dataset (Russakovsky et al., 2015) contains annotated images from the WordNet hierarchy.
Since 2010, the dataset has been used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection.
We extract contiguous patches of size from all images, flatten them, convert them to grayscale (so that each byte represents exactly one pixel) to obtain samples of 2048 bytes.
We then concatenate of these patches, following the original dataset order, to create a dataset of 1 GB.
ImageNet 数据集(Russakovsky 等,2015)包含来自 WordNet 层次结构的 个标注图像。自 2010 年以来,该数据集已被用于 ImageNet 大规模视觉识别挑战赛(ILSVRC),这是图像分类和目标检测的基准。我们从所有图像中提取大小为 的连续块,将它们展平,转换为灰度图(使每个字节恰好代表一个像素),以获得 2048 字节的样本。然后我们按照原始数据集的顺序连接这些块中的 个,以创建一个 1 GB 的数据集。
LibriSpeech
LibriSpeech (Panayotov et al., 2015) is a corpus of approximately hours of 16kHz English speech.
The data is derived from audiobooks from the LibriVox project and has been carefully segmented and aligned.
We chunk the samples into batches of 2048 bytes and gather such chunks into dataset of size 1 GB.
LibriSpeech(Panayotov 等,2015)是一个大约包含 小时 16kHz 英语语音的语料库。数据来源于 LibriVox 项目的有声读物,并经过精心分割和对齐。我们将样本分成 2048 字节的批次,并将 个这样的批次收集成 1GB 大小的数据集。
3.2 Comparing Compression Rates
3.2 压缩率比较
表 1:不同数据集上的压缩率(压缩大小/原始大小)(越低越好)。对于 Transformer 和 Chinchilla 模型,原始压缩率不考虑参数大小,而调整后的压缩率将参数大小视为压缩大小的一部分。所有数据集的原始大小均为 1GB。随机数据用作基线,不应可压缩。Transformer 和 Chinchilla 是预测模型,我们使用算术编码来获得无损压缩器。我们从头开始在 enwik8 上训练 Transformer 模型,而 Chinchilla 模型是在大型文本数据集上预训练的。在 enwik 上训练的 Transformer 模型过拟合于该数据模式,而 Chinchilla 模型是各种数据类型的良好压缩器。
Raw Compression Rate (%) 原始压缩率 (%) |
Adjusted Compression Rate (%) 调整压缩率 (%) |
||||||||||
Chunk Size 块大小 | Compressor 压缩机 | enwik9 | ImageNet 图像网 | LibriSpeech | Random 随机 | enwik9 | ImageNet 图像网 | LibriSpeech | Random 随机 | ||
gzip | 32.3 | 70.7 | 36.4 | 100.0 | 32.3 | 70.7 | 36.4 | 100.0 | |||
LZMA2 | 23.0 | 57.9 | 29.9 | 100.0 | 23.0 | 57.9 | 29.9 | 100.0 | |||
PNG | 42.9 | 58.5 | 32.2 | 100.0 | 42.9 | 58.5 | 32.2 | 100.0 | |||
FLAC | 89.5 | 61.9 | 30.9 | 107.8 | 89.5 | 61.9 | 30.9 | 107.8 | |||
gzip | 48.1 | 68.6 | 38.5 | 100.1 | 48.1 | 68.6 | 38.5 | 100.1 | |||
LZMA2 | 50.0 | 62.4 | 38.2 | 100.0 | 50.0 | 62.4 | 38.2 | 100.0 | |||
PNG | 80.6 | 61.7 | 37.6 | 103.2 | 80.6 | 61.7 | 37.6 | 103.2 | |||
FLAC | 88.9 | 60.9 | 30.3 | 107.2 | 88.9 | 60.9 | 30.3 | 107.2 | |||
Transformer 200K 变压器 200K | 30.9 | 194.0 | 146.6 | 195.5 | 30.9 | 194.0 | 146.6 | 195.5 | |||
Transformer 800K 变压器 800K | 21.7 | 185.1 | 131.1 | 200.1 | 21.9 | 185.3 | 131.3 | 200.3 | |||
Transformer 3.2M 变压器 3.2M | 17.0 | 215.8 | 228.2 | 224.0 | 17.7 | 216.5 | 228.9 | 224.7 | |||
Chinchilla 1B 龙猫 1B | 11.3 | 62.2 | 24.9 | 108.8 | 211.3 | 262.2 | 224.9 | 308.8 | |||
Chinchilla 7B 龙猫 7B | 10.2 | 54.7 | 23.6 | 101.6 | 1410.2 | 1454.7 | 1423.6 | 1501.6 | |||
Chinchilla 70B 南美栗鼠 70B | 8.3 | 48.0 | 21.0 | 100.8 | 14008.3 | 14048.0 | 14021.0 | 14100.8 |
Table 1 shows the compression rates for all compressors and datasets.
We show both the raw compression rate, which does not take the model size (in bytes) into account, as well as the adjusted rate, which does.
The size of the Python program for classical compressors is very small (a few kilobytes at most) and thus barely affects the compression rate.
In contrast, language models suffer a huge loss in compression rate due to their large size, which cannot be offset when compressing only 1GB of data.
We encode each neural network parameter with 2 bytes, using a float16 representation since quantizing weights to this level does not significantly affect performance (Tao et al., 2022) and is standard for model inference.
Note that further compressing the float16 parameters using classical compressors does not significantly reduce their size (we obtained rates of 92.2% and 89.1% on a 38M parameter Transformer with gzip and LZMA2, respectively).
Also, recall that we only consider the offline setting, which computes the adjusted compression rate using a two-part code (i.e., it adds the model size to the -loss of the data).
In contrast, prequential (online) coding would provide an alternative view on adjusted compression by computing the adjusted compression rate as the -loss plus the size of the training script (not the model parameters).
According to prior work, prequential coding leads to better compression with overparametrized neural networks (Blier & Ollivier, 2018), however, it requires training the model online (which reduces performance and cannot be performed with foundation models) both during encoding and decoding (which is very costly for our models).
表 1 显示了所有压缩器和数据集的压缩率。我们展示了不考虑模型大小(以字节为单位)的原始压缩率以及考虑模型大小的调整后压缩率。经典压缩器的 Python 程序大小非常小(最多几千字节),因此几乎不影响压缩率。相比之下,语言模型由于其大尺寸而在压缩率上遭受巨大损失,当仅压缩 1GB 数据时,这种损失无法弥补。我们使用 float16 表示法将每个神经网络参数编码为 2 字节,因为将权重量化到此级别不会显著影响性能(Tao 等,2022),并且这是模型推理的标准。请注意,使用经典压缩器进一步压缩 float16 参数不会显著减少其大小(我们在一个 38M 参数的 Transformer 上使用 gzip 和 LZMA2 分别获得了 92.2%和 89.1%的压缩率)。此外,请记住,我们只考虑离线设置,该设置使用两部分代码计算调整后的压缩率(即,它将模型大小添加到数据的 -损失中)。 相比之下,预序(在线)编码通过将调整后的压缩率计算为 -损失加上训练脚本的大小(而不是模型参数)来提供调整压缩的另一种视角。根据先前的研究,预序编码在过参数化神经网络中导致更好的压缩(Blier & Ollivier, 2018),然而,它需要在编码和解码期间在线训练模型(这会降低性能,并且无法在基础模型中执行),这对我们的模型来说非常昂贵。
Foundation Models Are General-Purpose Compressors
基础模型是通用压缩器
A lossless compressor induces an injective function over bit sequences, meaning that we cannot compress all sequences equally well (by the pigeonhole principle).
Consequently, in practice, compressors are often tailored to a particular setting, e.g., FLAC for audio or PNG for images, and thus fail to compress other data modalities well (see Table 1).
In contrast, general-purpose compressors, such as gzip, offer good performance on a wide range of data sources.
Surprisingly, Chinchilla models, while trained primarily on text, also appear to be general-purpose compressors, as they outperform all other compressors, even on image and audio data (see Table 1).
Note that Chinchilla models have not been trained on this kind of data according to Appendix A. of Hoffmann et al. (2022), which states that the training dataset consists of a mix of internet text data (Wikipedia, websites, github) and books.
However, it is still possible (but unlikely) that some images or audio samples were encoded into text on some websites.
Thus, Chinchilla models achieve their impressive compression performance by conditioning a (meta-)trained model to a particular task at hand via in-context learning (Genewein et al., 2023).
In contrast, smaller Transformers, trained manually on enwik8, only achieve good compression rates on similar Wikipedia data, i.e., enwik9.
However, larger models’ stronger in-context compression (or in-context learning) comes at a price: the number of parameters, which has to be offset with increasingly large data sources when computing the adjusted compression rate (see Section 3.3).
Finally, note that since Chinchilla has been trained on Wikipedia, the enwik9 results are in-distribution.
无损压缩器在比特序列上引入了一个单射函数,这意味着我们不能对所有序列进行同样好的压缩(根据鸽笼原理)。因此,在实践中,压缩器通常针对特定的设置进行定制,例如,FLAC 用于音频或 PNG 用于图像,因此无法很好地压缩其他数据模式(见表 1)。相比之下,通用压缩器,如 gzip,在各种数据源上提供了良好的性能。令人惊讶的是,Chinchilla 模型虽然主要在文本上进行训练,但似乎也是通用压缩器,因为它们在图像和音频数据上也优于所有其他压缩器(见表 1)。请注意,根据 Hoffmann 等人(2022)附录 A 的说法,Chinchilla 模型并未在这种数据上进行训练,该训练数据集由互联网文本数据(维基百科、网站、github)和书籍的混合组成。然而,仍然有可能(但不太可能)某些图像或音频样本在某些网站上被编码为文本。 因此,Chinchilla 模型通过在上下文学习(Genewein 等,2023)中将一个(元)训练模型调整到手头的特定任务来实现其令人印象深刻的压缩性能。相比之下,手动在 enwik8 上训练的小型 Transformer 仅在类似的维基百科数据(即 enwik9)上实现良好的压缩率。然而,较大模型更强的上下文压缩(或上下文学习)是有代价的:参数数量必须在计算调整后的压缩率时用越来越大的数据源来抵消(见第 3.3 节)。最后,请注意,由于 Chinchilla 是在维基百科上训练的,enwik9 的结果是分布内的。
3.3 Optimal Model-Dataset Size Tradeoff
3.3 最优模型-数据集大小权衡
As shown in Table 1, foundation models incur a huge cost in compression rates when accounting for their size, which is in the order of hundreds of GBs for billions of parameters.
In theory, if the dataset is infinite, we can ignore the model’s size since it is insignificant compared to the size of the dataset.
However, in practice, a foundation model can only achieve non-trivial (adjusted) compression rates when evaluated on datasets in the order of TBs (or more).
Since this is infeasible under reasonable hardware constraints, we instead investigate the optimal model size with smaller Transformers that we train on enwik8.
Recall that the model size (in bytes) is twice the number of (float16) parameters.
如表 1 所示,基础模型在考虑其大小时会产生巨大的压缩率成本,其参数数量达到数十亿,大小在数百 GB 的量级。理论上,如果数据集是无限的,我们可以忽略模型的大小,因为与数据集的大小相比,它是微不足道的。然而,在实践中,基础模型只有在对 TB 级(或更多)的数据集进行评估时,才能实现非平凡的(调整后的)压缩率。由于在合理的硬件限制下这是不可行的,我们转而研究在 enwik8 上训练较小的 Transformer 时的最优模型大小。请记住,模型大小(以字节为单位)是(float16)参数数量的两倍。
Fig. 2 visualizes the adjusted compression rate for vanilla Transformers of different model sizes for the enwik datasets.
We observe that larger models achieve better compression rates on larger datasets, thus justifying recent trends in model scaling (Kaplan et al., 2020).
However, they achieve worse rates on smaller datasets, indicating that scaling laws are, in fact, dependent on the size of the test set.
That is, for each dataset, the model sizes reach a critical point, after which the adjusted compression rate starts to increase again since the number of parameters is too big compared to the size of the dataset.
Note that we evaluate offline compression, i.e., we do not necessarily compress the data the model was trained on, meaning that the results on enwik7 and enwik8 are in-distribution, while the enwik9 results are out-of-distribution.
Nevertheless, larger models still achieve better compression rates on enwik9 than enwik8, illustrating the benefits of scaling.
图 2 展示了不同模型大小的 vanilla Transformers 在 enwik 数据集上的调整压缩率。我们观察到,较大的模型在较大的数据集上实现了更好的压缩率,从而证明了最近模型扩展的趋势(Kaplan 等,2020)。然而,它们在较小的数据集上实现了更差的压缩率,这表明扩展规律实际上依赖于测试集的大小。也就是说,对于每个数据集,模型大小达到一个临界点后,调整压缩率开始再次增加,因为参数数量相对于数据集大小来说太大了。请注意,我们评估的是离线压缩,即我们不一定压缩模型训练的数据,这意味着 enwik7 和 enwik8 的结果是分布内的,而 enwik9 的结果是分布外的。尽管如此,较大的模型在 enwik9 上仍然实现了比 enwik8 更好的压缩率,说明了扩展的好处。
3.4 Compressors as Generative Models
3.4 压缩机作为生成模型
In Section 2, we discussed how any compressor can be employed as a sequence prediction model.
Concretely, for compressor , we sample the next byte according to the distribution , i.e., we compute the length of the compressed sequence for all possible .
Thus, if a byte leads to a particularly short compressed sequence (when concatenated with ), it will have a higher probability of being sampled next.
Note that any constant in the length function (e.g., the header for classical compressors) disappears when we normalize the distribution.
在第 2 节中,我们讨论了如何将任何压缩器用作序列预测模型。具体来说,对于压缩器 ,我们根据分布 对下一个字节进行采样,即我们计算所有可能的 的压缩序列 的长度 。因此,如果一个字节 导致一个特别短的压缩序列(当与 连接时),它将有更高的概率被下一个采样。请注意,当我们规范化分布时,长度函数中的任何常数(例如,经典压缩器的头部)都会消失。
Since generic compressors have a low intrinsic bias, sampling data without conditioning does not yield interesting results as it looks random.
Thus, we condition the compressors on part of an existing sequence (1948 bytes for enwik9, half of the sample for ImageNet and LibriSpeech) and generate the remaining bytes with the compression-based generative model.
We compare the generative performance of gzip and Chinchilla 70B across all three data modalities in Figs. 3, 5 and 4 for text, image, and audio data, respectively.
In general, generative models can be evaluated using one of two ways: sampling the next byte
(i) using teacher forcing, i.e., conditioning on the true subsequence , or (ii) via autoregressive sampling, i.e., conditioning on the model’s previous outputs.
The latter induces a distribution shift, and with it undesired side effects (Ortega et al., 2021), but is standard and thus what we choose to visualize.
由于通用压缩器具有较低的内在偏差,未经条件采样的数据看起来是随机的,因此不会产生有趣的结果。因此,我们将压缩器基于现有序列的一部分进行条件化(enwik9 为 1948 字节,ImageNet 和 LibriSpeech 为样本的一半),并使用基于压缩的生成模型生成剩余的字节。我们在图 3、图 5 和图 4 中分别比较了 gzip 和 Chinchilla 70B 在文本、图像和音频数据三种数据模式下的生成性能。一般来说,生成模型可以通过两种方式之一进行评估:使用教师强制,即基于真实子序列进行条件化,或通过自回归采样,即基于模型的先前输出进行条件化。后者会引起分布偏移,并伴随不良副作用(Ortega 等,2021),但这是标准方法,因此我们选择进行可视化。
(c) 龙猫(按行)
图 5:基于压缩的图像数据生成。我们将 gzip 和 Chinchilla 条件化在 ImageNet 图像的每一行的前半部分,然后自回归地采样剩余的部分。两个模型都生成了不连贯的样本,但 Chinchilla 看起来比 gzip 噪声要少得多。
3.5 Sequential Evolution of In-Context Compression
3.5 上下文压缩的顺序演变
Language models take a very different “approach” to compression compared to classical compressors.
Classical compressors have a small program size and optimize for a large context length to exploit sequential dependencies in the data.
In contrast, foundation models consist of billions of parameters, which enable rapid adaptation in their (relatively) short context window (Genewein et al., 2023).
Thus, arithmetic coding-based compressors rely heavily on the predictive models’ in-context learning capabilities to achieve competitive compression performance.
We investigate this phenomenon in Fig. 6, which visualizes the compression rate across sequence lengths for gzip, Chinchilla 1B and a Transformer pretrained on enwik8.
Intuitively, the longer the sequence, the more data the model can process in its context, and therefore, the better the compression.
As expected, most compression rates decrease quickly with increasing sequence length, indicating that the models learn some data statistics in-context, without any gradient-based training.
As in Table 1, the Chinchilla model achieves the best compression rates accross all three data modalities and sequence lengths.
与传统压缩器相比,语言模型采用了截然不同的“方法”来进行压缩。传统压缩器具有较小的程序大小,并优化了较大的上下文长度,以利用数据中的顺序依赖性。相比之下,基础模型由数十亿个参数组成,这使得它们能够在(相对)较短的上下文窗口中快速适应(Genewein 等,2023)。因此,基于算术编码的压缩器在很大程度上依赖于预测模型的上下文学习能力,以实现具有竞争力的压缩性能。我们在图 6 中研究了这一现象,该图展示了 gzip、Chinchilla 1B 和一个在 enwik8 上预训练的 Transformer 在不同序列长度下的压缩率。直观地说,序列越长,模型在其上下文中可以处理的数据就越多,因此压缩效果就越好。如预期的那样,大多数压缩率随着序列长度的增加而迅速下降,这表明模型在上下文中学习了一些数据统计信息,而无需任何基于梯度的训练。如表 1 所示,Chinchilla 模型在所有三种数据模态和序列长度上都实现了最佳的压缩率。
图 6:序列长度的上下文压缩率。对于每个数据集,我们计算所有 2048 字节子序列的压缩率,并对 100 个序列取平均值。
3.6 Tokenization Is Compression
3.6 分词即压缩
表 2:在 enwik9 上使用不同分词器(ASCII 和字节对编码(BPE))以及各种词汇量训练的 Transformers 在 enwik8 上的原始压缩率(压缩大小/原始大小)。Transformers 使用更简单的分词器压缩效果更好。然而,较大的词汇量减少了序列的长度,这意味着可以在上下文中打包更多的信息。
Raw Compression Rate (%) 原始压缩率 (%) |
||||
---|---|---|---|---|
Tokenization 标记化 | 200K | 6.4M | 38M | |
ASCII | 22.9 | 13.6 | 6.4 | |
BPE 1000 | 25.4 | 14.8 | 6.9 | |
BPE 2000 | 25.6 | 15.7 | 7.4 | |
BPE 5000 | 23.1 | 17.1 | 8.7 | |
BPE 10000 | 21.3 | 17.0 | 8.9 | |
BPE 20000 | 19.3 | 16.4 | 9.0 |
Transformers are generally not trained on raw input data but on tokenized versions thereof, both for efficiency and performance reasons.
As a consequence, Transformers are trained on compressed data, with tokenizers acting as the compressor.
Since tokenization is known to have an impact on the generalization performance (Radford et al., 2019), we investigate its impact on the compression rate in Table 2.
Concretely, we train Transformers on enwik8 using different tokenizers: ASCII, i.e., an alphabet of size 256 (no tokenization), and byte-pair encoding trained on enwik8, with various vocabulary sizes (1K, 2K, 5K, 10K, and 20K tokens).
Note that the tokenizations are lossless.
Transformer 模型通常不是在原始输入数据上进行训练,而是在经过分词处理后的数据上进行训练,这样做是出于效率和性能的考虑。因此,Transformer 模型是在压缩数据上进行训练的,分词器起到了压缩器的作用。由于分词已知会对泛化性能产生影响(Radford 等,2019),我们在表 2 中研究了其对压缩率的影响。具体来说,我们使用不同的分词器在 enwik8 数据集上训练 Transformer 模型:ASCII,即大小为 256 的字母表(无分词),以及在 enwik8 上训练的字节对编码,具有不同的词汇量(1K、2K、5K、10K 和 20K 个词元)。请注意,这些分词都是无损的。
Increasing the number of tokens (i.e., the “alphabet size”) reduces the length of the sequence and thus increases the amount of information in a models context.
However, decreasing the sequence length comes at a price: the number of tokens is larger, which makes the prediction task more challenging since reducing the entropy of the conditional distribution is increasingly difficult for larger alphabet size.
In theory, as the tokenization is a lossless compression, the two effects should compensate.
In practice, we observe that if the model is small, increasing the number of possible tokens boosts the compression performance.
In contrast, for bigger models, it seems that the converse happens: having a larger token vocabulary harms the final compression rate of the model.
Nevertheless, short sequence lengths also help Transformers since their time complexity scales quadratically with context length, and it has been shown they do not generalize well to long contexts (Delétang et al., 2023; Ruoss et al., 2023).
This explains why most practical Transformer implementations still use some form of tokenization, e.g., SentencePiece (Kudo & Richardson, 2018).
增加标记的数量(即“字母表大小”)会减少序列的长度,从而增加模型上下文中的信息量。然而,减少序列长度是有代价的:标记的数量更大,这使得预测任务更加具有挑战性,因为对于较大的字母表大小,减少条件分布的熵变得越来越困难。理论上,由于标记化是无损压缩,这两种效应应该相互抵消。实际上,我们观察到,如果模型较小,增加可能的标记数量会提升压缩性能。相反,对于较大的模型,似乎情况正好相反:拥有更大的标记词汇量会损害模型的最终压缩率。然而,较短的序列长度也有助于 Transformer,因为它们的时间复杂度随着上下文长度呈二次方增长,并且已经证明它们在长上下文中不能很好地泛化(Delétang 等,2023;Ruoss 等,2023)。这解释了为什么大多数实际的 Transformer 实现仍然使用某种形式的标记化,例如 SentencePiece(Kudo & Richardson,2018)。
4 Related work 4 相关工作
Prediction vs. Compression
预测与压缩
Leveraging Shannon’s source coding theorem (Shannon, 1948), a plethora of approaches exploit the connection between prediction and compression.
For example, context-tree weighting (CTW) (Willems et al., 1995) mixes the predictions of many underlying Markov models to achieve lossless compression via arithmetic coding (Pasco, 1977; Rissanen, 1976).
Similarly, prediction by partial matching (PPM) (Cleary & Witten, 1984) also leverages arithmetic coding, but uses a contiguous context matching method to create probability distributions based on the history of characters in a sequence.
Likewise, PAQ8 (Knoll & de Freitas, 2012) uses a weighted combination of predictions from a large number of models (most of them based on context matching, but unlike PPM also noncontiguous context matches).
In a different setting, Veness et al. (2015) demonstrated how to employ compression to obtain value estimates of a policy in an environment.
Frank et al. (2000) and later Teahan & Harper (2003) introduced the idea of classification with compressors.
Recently, Jiang et al. (2023) applied this technique with NLP tasks, paired with a k-nearest-neighbour algorithm.
The results are surprisingly good for simple general purpose compressors like gzip.
Jiang et al. (2022) exploit the same idea but train the compressor on a vast amount of unlabeled data first.
Finally, van den Oord & Schrauwen (2014) apply arithmetic coding to image compression using Student distribution mixtures and Gaussian processes as predictors.
利用香农的源编码定理(Shannon, 1948),大量方法利用预测和压缩之间的联系。例如,上下文树加权(CTW)(Willems 等,1995)通过算术编码(Pasco, 1977; Rissanen, 1976)混合许多底层马尔可夫模型的预测来实现无损压缩。同样,部分匹配预测(PPM)(Cleary & Witten, 1984)也利用算术编码,但使用连续的上下文匹配方法根据序列中字符的历史创建概率分布。同样,PAQ8(Knoll & de Freitas, 2012)使用大量模型的预测的加权组合(其中大多数基于上下文匹配,但与 PPM 不同的是也包括非连续上下文匹配)。在不同的设置中,Veness 等(2015)展示了如何利用压缩来获得环境中策略的价值估计。Frank 等(2000)以及后来的 Teahan & Harper(2003)引入了使用压缩器进行分类的想法。最近,Jiang 等(2023)将这一技术应用于 NLP 任务,并与 k 近邻算法配对。 对于像 gzip 这样的简单通用压缩器,结果出奇地好。Jiang 等人(2022)利用了相同的想法,但首先在大量未标记的数据上训练压缩器。最后,van den Oord 和 Schrauwen(2014)将算术编码应用于图像压缩,使用学生分布混合和高斯过程作为预测器。
Compression With Neural Networks
使用神经网络进行压缩
Prior work demonstrated that neural predictive distributions can be employed to perform lossless compression via arithmetic coding (Schmidhuber & Heil, 1996; Mahoney, 2000; Knoll, 2014; Cox, 2016; Schiopu et al., 2018; Goyal et al., 2019; Liu et al., 2019; Mentzer et al., 2019, 2020; Schiopu & Munteanu, 2020; Rhee et al., 2022; Mikolov, 2012).
Similarly, neural networks were also shown to achieve strong lossless compression rates when replacing arithmetic coding with asymmetric numeral systems (Hoogeboom et al., 2019; Kingma et al., 2019; Townsend et al., 2019; Barzen et al., 2022).
While these approaches assume the existence of a separate training set, a different line of work investigated arithmetic coding-based neural compression in a purely online fashion, i.e., training the model only on the data stream that is to be compressed (Bellard, 2019; Goyal et al., 2020; Bellard, 2021; Mao et al., 2022).
Finally, concurrent work (Valmeekam et al., 2023) also investigated lossless offline compression with foundation models, using arithmetic coding with LLaMA-7B (Touvron et al., 2023).
先前的工作表明,神经预测分布可以通过算术编码实现无损压缩(Schmidhuber & Heil, 1996; Mahoney, 2000; Knoll, 2014; Cox, 2016; Schiopu et al., 2018; Goyal et al., 2019; Liu et al., 2019; Mentzer et al., 2019, 2020; Schiopu & Munteanu, 2020; Rhee et al., 2022; Mikolov, 2012)。同样,当用非对称数字系统替代算术编码时,神经网络也被证明能够实现强大的无损压缩率(Hoogeboom et al., 2019; Kingma et al., 2019; Townsend et al., 2019; Barzen et al., 2022)。虽然这些方法假设存在一个单独的训练集,但另一类工作研究了基于算术编码的神经压缩在纯在线方式下的应用,即仅在要压缩的数据流上训练模型(Bellard, 2019; Goyal et al., 2020; Bellard, 2021; Mao et al., 2022)。最后,同时进行的工作(Valmeekam et al., 2023)也研究了使用基础模型进行无损离线压缩,使用 LLaMA-7B 进行算术编码(Touvron et al., 2023)。
Compression Biases: Tokenization, Model Size, etc.
压缩偏差:分词、模型大小等。
Much effort has been devoted on understanding the inductive biases of neural networks.
Here, we are mostly interested in the biases of Natural Language Processing (NLP) and Transformers.
Kudo & Richardson (2018) defined a tokenizer for NLP-related research, an improvement of well-known techniques like byte-pair encoding (BPE) (Sennrich et al., 2016), BPE dropout (Provilkov et al., 2020), and subword regularization (Kudo, 2018).
In this paper, we show how these tokenization techniques act as pre-compressors for the data, and can significantly affect the final compression rates when paired with a neural model.
More general studies have been performed on generalization (Neyshabur et al., 2017), which, we argue, is equivalent to the model’s compressive power when accounting parameters code-length.
Finally, some work has been done on compressing the neural models’ parameters themselves (Cheng et al., 2017).
大量努力已投入到理解神经网络的归纳偏差上。在这里,我们主要关注自然语言处理(NLP)和 Transformer 的偏差。Kudo 和 Richardson(2018)为 NLP 相关研究定义了一个分词器,这是对字节对编码(BPE)(Sennrich 等,2016)、BPE dropout(Provilkov 等,2020)和子词正则化(Kudo,2018)等著名技术的改进。在本文中,我们展示了这些分词技术如何作为数据的预压缩器,并且当与神经模型配对时,可以显著影响最终的压缩率。更一般的研究已经在泛化上进行(Neyshabur 等,2017),我们认为,当考虑参数代码长度时,这相当于模型的压缩能力。最后,一些工作已经在压缩神经模型的参数本身上进行(Cheng 等,2017)。
5 Conclusion 5 结论
In this paper we investigated how and why compression and prediction are equivalent.
Arithmetic coding transforms a prediction model into a compressor, and, conversely, a compressor can be transformed into a predictor by using the coding lengths to construct probability distributions following Shannon’s entropy principle.
We evaluated large pretrained models used as compressors against various standard compressors, and showed they are competitive not only on text but also on modalities they have never been trained on (images, audio data).
We showed that the compression viewpoint provides novel insights on scaling laws since it takes the model size into account, unlike the log-loss objective, which is standard in current language modeling research.
Consequently, we showed that the optimal model size is inextricably linked to the dataset size and cannot be scaled without limit.
在本文中,我们研究了压缩和预测为何以及如何等价。算术编码将预测模型转化为压缩器,反之,通过使用编码长度构建遵循香农熵原理的概率分布,压缩器可以转化为预测器。我们评估了用作压缩器的大型预训练模型与各种标准压缩器的表现,结果表明它们不仅在文本上具有竞争力,而且在从未训练过的模态(图像、音频数据)上也表现出色。我们展示了压缩视角在扩展法则上提供了新的见解,因为它考虑了模型的大小,而这在当前语言建模研究中使用的对数损失目标中是没有的。因此,我们表明,最佳模型大小与数据集大小密不可分,不能无限制地扩展。
Acknowledgments 致谢
We thank Jörg Bornschein, Nando de Freitas, Slav Petrov, and Zhengdong Wang for their helpful feedback and insightful discussions.
我们感谢 Jörg Bornschein、Nando de Freitas、Slav Petrov 和 Zhengdong Wang 的有益反馈和富有洞察力的讨论。
References
- Barzen et al. (2022) Benjamin Lukas Cajus Barzen, Fedor Glazov, Jonas Geistert, and Thomas Sikora. Accelerated deep lossless image coding with unified paralleleized GPU coding architecture. In PCS, 2022.
- Bellard (2019) Fabrice Bellard. Lossless data compression with neural networks. Technical report, Amarisoft, 2019.
- Bellard (2021) Fabrice Bellard. NNCP v2: Lossless data compression with transformer. Technical report, Amarisoft, 2021.
- Blier & Ollivier (2018) Léonard Blier and Yann Ollivier. The description length of deep learning models. In NeurIPS, 2018.
- Bommasani et al. (2021) Rishi Bommasani et al. On the opportunities and risks of foundation models. arXiv:2108.07258, 2021.
- Boutell (1997) Thomas Boutell. PNG (portable network graphics) specification version 1.0. RFC, 1997.
- Brown et al. (2020) Tom B. Brown, Benjamin Mannand Nick Ryder, Melanie Subbiah, et al. Language models are few-shot learners. In NeurIPS, 2020.
- Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv:2303.12712, 2023.
- Bulatov et al. (2023) Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. Scaling transformer to 1m tokens and beyond with RMT. arXiv:2304.11062, 2023.
- Cheng et al. (2017) Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282, 2017.
- Cleary & Witten (1984) John G. Cleary and Ian H. Witten. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun., 1984.
- Coalson (2008) Josh Coalson. Free lossless audio codec, 2008. URL https://xiph.org/flac.
- Cox (2016) David Cox. Syntactically informed text compression with recurrent neural networks. arXiv:1608.02893, 2016.
- Delétang et al. (2023) Grégoire Delétang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A. Ortega. Neural networks and the chomsky hierarchy. In ICLR, 2023.
- Deutsch (1996) Peter Deutsch. GZIP file format specification version 4.3. RFC, 1996.
- Duda (2009) Jarek Duda. Asymmetric numeral systems. arXiv:0902.0271, 2009.
- Frank et al. (2000) Eibe Frank, Chang Chui, and Ian H. Witten. Text categorization using compression models. In Data Compression Conference, 2000.
- Genewein et al. (2023) Tim Genewein, Grégoire Delétang, Anian Ruoss, Li Kevin Wenliang, Elliot Catt, Vincent Dutordoir, Jordi Grau-Moya, Laurent Orseau, Marcus Hutter, and Joel Veness. Memory-based meta-learning on non-stationary distributions. arXiv:2302.03067, 2023.
- Goyal et al. (2019) Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa. Deepzip: Lossless data compression using recurrent neural networks. In DCC, 2019.
- Goyal et al. (2020) Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa. Dzip: Improved general-purpose lossless compression based on novel neural network modeling. In DCC, 2020.
- Guo et al. (2022) Mandy Guo, Joshua Ainslie, David C. Uthus, Santiago Ontañón, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. Longt5: Efficient text-to-text transformer for long sequences. In NAACL-HLT (Findings), 2022.
- Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. arXiv:2203.15556, 2022.
- Hoogeboom et al. (2019) Emiel Hoogeboom, Jorn W. T. Peters, Rianne van den Berg, and Max Welling. Integer discrete flows and lossless compression. In NeurIPS, 2019.
- Howard & Vitter (1991) Paul G. Howard and Jeffrey Scott Vitter. Analysis of arithmetic coding for data compression. In Data Compression Conference, 1991.
- Huffman (1952) David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 1952.
- Hutter (2005) Marcus Hutter. Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability. Springer, 2005.
- Hutter (2006) Marcus Hutter. 500’000€ prize for compressing human knowledge, 2006. URL http://prize.hutter1.net.
- Jiang et al. (2022) Zhiying Jiang, Yiqin Dai, Ji Xin, Ming Li, and Jimmy Lin. Few-shot non-parametric learning with deep latent variable model. In NeurIPS, 2022.
- Jiang et al. (2023) Zhiying Jiang, Matthew Y. R. Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, and Jimmy Lin. "low-resource" text classification: A parameter-free classification method with compressors. In ACL (Findings), 2023.
- Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020.
- Kingma et al. (2019) Friso H. Kingma, Pieter Abbeel, and Jonathan Ho. Bit-swap: Recursive bits-back coding for lossless compression with hierarchical latent variables. In ICML, 2019.
- Knoll (2014) Byron Knoll. CMIX, 2014. URL http://www.byronknoll.com/cmix.html.
- Knoll & de Freitas (2012) Byron Knoll and Nando de Freitas. A machine learning perspective on predictive coding with PAQ8. In DCC, 2012.
- Kolmogorov (1998) Andrei N. Kolmogorov. On tables of random numbers. Theoretical Computer Science, 1998.
- Kudo (2018) Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In ACL (1), 2018.
- Kudo & Richardson (2018) Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP (Demonstration), 2018.
- Laskin et al. (2023) Michael Laskin, Luyu Wang, et al. In-context reinforcement learning with algorithm distillation. In ICLR. OpenReview.net, 2023.
- Li & Vitányi (2019) Ming Li and Paul M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications, 4th Edition. Springer, 2019.
- Liu et al. (2019) Qian Liu, Yiling Xu, and Zhu Li. DecMac: A deep context model for high efficiency arithmetic coding. In ICAIIC, 2019.
- MacKay (2003) David J. C. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.
- Mahoney (2000) Matthew V. Mahoney. Fast text compression with neural networks. In FLAIRS, 2000.
- Mao et al. (2022) Yu Mao, Yufei Cui, Tei-Wei Kuo, and Chun Jason Xue. TRACE: A fast transformer-based general-purpose lossless compressor. In WWW, 2022.
- Mentzer et al. (2019) Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Practical full resolution learned lossless image compression. In CVPR, 2019.
- Mentzer et al. (2020) Fabian Mentzer, Luc Van Gool, and Michael Tschannen. Learning better lossless compression using lossy compression. In CVPR, 2020.
- Mikolov (2012) Tomas Mikolov. Statistical Language Models Based on Neural Networks. PhD thesis, Brno Universtiy of Technology, 2012.
- Neyshabur et al. (2017) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In NIPS, 2017.
- Ortega et al. (2021) Pedro A. Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Pérolat, Tom Everitt, Corentin Tallec, Emilio Parisotto, Tom Erez, Yutian Chen, Scott E. Reed, Marcus Hutter, Nando de Freitas, and Shane Legg. Shaking the foundations: delusions in sequence models for interaction and control. arXiv:2110.10819, 2021.
- Panayotov et al. (2015) Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In ICASSP, 2015.
- Pasco (1977) Richard C. Pasco. Source coding algorithms for fast data compression (ph.d. thesis abstr.). IEEE Trans. Inf. Theory, 1977.
- Pavlov (2019) Igor Pavlov. 7z Format, 2019. URL http://www.7-zip.org/7z.html.
- Provilkov et al. (2020) Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. Bpe-dropout: Simple and effective subword regularization. In ACL, 2020.
- Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
- Rae et al. (2021) Jack W. Rae et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv:2112.11446, 2021.
- Rathmanner & Hutter (2011) Samuel Rathmanner and Marcus Hutter. A philosophical treatise of universal induction. Entropy, 2011.
- Rhee et al. (2022) Hochang Rhee, Yeong Il Jang, Seyun Kim, and Nam Ik Cho. LC-FDNet: Learned lossless image compression with frequency decomposition network. In CVPR, 2022.
- Rissanen (1976) Jorma Rissanen. Generalized kraft inequality and arithmetic coding. IBM J. Res. Dev., 1976.
- Ruoss et al. (2023) Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers. In ACL (2), 2023.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 2015.
- Schiopu & Munteanu (2020) Ionut Schiopu and Adrian Munteanu. Deep-learning-based lossless image coding. IEEE Trans. Circuits Syst. Video Technol., 2020.
- Schiopu et al. (2018) Ionut Schiopu, Yu Liu, and Adrian Munteanu. CNN-based prediction for lossless coding of photographic images. In PCS, 2018.
- Schmidhuber & Heil (1996) Jürgen Schmidhuber and Stefan Heil. Sequential neural text compression. IEEE Trans. Neural Networks, 1996.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL (1), 2016.
- Shannon (1948) Claude E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 1948.
- Solomonoff (1964a) Ray J. Solomonoff. A formal theory of inductive inference. part I. Inf. Control., 1964a.
- Solomonoff (1964b) Ray J. Solomonoff. A formal theory of inductive inference. part II. Inf. Control., 1964b.
- Tao et al. (2022) Chaofan Tao, Lu Hou, Wei Zhang, Lifeng Shang, Xin Jiang, Qun Liu, Ping Luo, and Ngai Wong. Compression of generative pre-trained language models via quantization. In ACL (1), 2022.
- Teahan & Harper (2003) William J. Teahan and David J. Harper. Using Compression-Based Language Models for Text Categorization, pp. 141–165. Springer Netherlands, 2003.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
- Townsend et al. (2019) James Townsend, Thomas Bird, and David Barber. Practical lossless compression with latent variables using bits back coding. In ICLR (Poster), 2019.
- Valmeekam et al. (2023) Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-François Chamberland, and Srinivas Shakkottai. Llmzip: Lossless text compression using large language models. arXiv:2306.04050, 2023.
- van den Oord & Schrauwen (2014) Aäron van den Oord and Benjamin Schrauwen. The student-t mixture as a natural image patch prior with application to image compression. J. Mach. Learn. Res., 2014.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
- Veness et al. (2015) Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Desjardins. Compress and control. In AAAI, 2015.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
- Welch (1984) Terry A. Welch. A technique for high-performance data compression. Computer, 1984.
- Willems et al. (1995) Frans M. J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The context-tree weighting method: basic properties. IEEE Trans. Inf. Theory, 1995.
- Witten et al. (1987) Ian H. Witten, Radford M. Neal, and John G. Cleary. Arithmetic coding for data compression. Commun. ACM, 1987.
- Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In NeurIPS, 2020.