这是用户在 2024-7-16 15:00 为 https://arxiv.org/html/2407.06645v3?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: pdfcol
  • failed: shellesc

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
许可:CC BY 4.0
arXiv:2407.06645v3 [cs.LG] 11 Jul 2024
arXiv:2407.06645v3 [cs.LG] 2024 年 7 月 11 日
\pdfcolInitStack

tcb@breakable \useunder\ul

Entropy Law: The Story Behind Data Compression and LLM Performance
熵定律:数据压缩和LLM性能背后的故事

Mingjia Yin, Chuhan Wusuperscript{}^{*^{\dagger}}start_FLOATSUPERSCRIPT ∗ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT, Yufei Wang, Hao Wangsuperscript{}^{*^{\dagger}}start_FLOATSUPERSCRIPT ∗ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT,
殷明佳 ,吴楚涵 superscript{}^{*^{\dagger}}start_FLOATSUPERSCRIPT ∗ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT ,王雨菲 ,王浩 superscript{}^{*^{\dagger}}start_FLOATSUPERSCRIPT ∗ start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT

Wei Guo, Yasheng Wang, Yong Liu, Ruiming Tang, Defu Lian, Enhong Chen
郭伟,王亚生,刘勇,唐瑞明,连德富,陈恩红

State Key Laboratory of Cognitive Intelligence & University of Science and Technology of China
认知智能国家重点实验室 & 中国科学技术大学

Noah’s Ark Lab, Huawei
诺亚方舟实验室,华为

{mingjia-yin, wanghao3, cheneh}@ustc.edu.cn
{wuchuhan1, tangruiming}@huawei.com
Corresponding authors. \dagger Equal contribution.
通讯作者。 \dagger 共同贡献。
Abstract 摘要

Data is the cornerstone of large language models (LLMs), but not all data is useful for model learning. Carefully selected data can better elicit the capabilities of LLMs with much less computational overhead. Most methods concentrate on evaluating the quality of individual samples in data selection, while the combinatorial effects among samples are neglected. Even if each sample is of perfect quality, their combinations may be suboptimal in teaching LLMs due to their intrinsic homogeneity or contradiction. In this paper, we aim to uncover the underlying relationships between LLM performance and data selection. Inspired by the information compression nature of LLMs, we uncover an “entropy law” that connects LLM performance with data compression ratio and first-epoch training loss, which reflect the information redundancy of a dataset and the mastery of inherent knowledge encoded in this dataset, respectively. Through both theoretical deduction and empirical evaluation, we find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method named ZIP for training LLMs, which aim to prioritize data subsets exhibiting a low compression ratio. Based on a multi-stage algorithm that selects diverse data in a greedy manner, we can obtain a good data subset with satisfactory diversity. Extensive experiments have been conducted to validate the entropy law and the superiority of ZIP across different LLM backbones and alignment stages. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
数据是大型语言模型的基石,但并非所有数据都对模型学习有用。精心选择的数据可以在更少的计算开销下更好地引出模型的能力。大多数方法集中在评估数据选择中单个样本的质量,而忽略了样本之间的组合效应。即使每个样本的质量都很完美,由于其内在的同质性或矛盾性,它们的组合在教学模型时可能是次优的。在本文中,我们旨在揭示模型性能与数据选择之间的潜在关系。受信息压缩性质的启发,我们揭示了一条“熵定律”,将模型性能与数据压缩比和第一轮训练损失联系起来,分别反映了数据集的信息冗余和数据集中编码的内在知识的掌握程度。通过理论推导和实证评估,我们发现模型性能与训练数据的压缩比负相关,这通常会产生较低的训练损失。 基于熵定律的发现,我们提出了一种名为 ZIP 的高效且通用的数据选择方法用于训练LLMs,其目的是优先选择压缩率低的数据子集。基于一个以贪婪方式选择多样数据的多阶段算法,我们可以获得一个具有满意多样性的数据子集。进行了大量实验以验证熵定律和 ZIP 在不同LLM骨干网和对齐阶段的优越性。我们还展示了熵定律的一个有趣应用,即在模型训练开始时检测潜在的性能风险。
111Code can be found in https://github.com/USTC-StarTeam/ZIP.

1 Introduction 1 引言

In recent years, Large Language Models (LLMs) have gained significant attention from both academia and industry, applied in various domains, such as chatbots (Ouyang et al., 2022; Achiam et al., 2023), chemistry tools (M. Bran et al., 2024), and programming assistants (GitHub, 2020). The great success of LLMs depends on their general intelligence obtained from a vast amount of data collected from various sources (Albalak et al., 2024; Wang et al., 2023c). Through pretraining on trillions of tokens to master diverse knowledge and tuning on smaller instruction data to align models with human preference, LLMs can effectively utilize their knowledge to follow user instructions, do commonsense reasoning, and solve real-world problems (Zhao et al., 2023).
近年来,大型语言模型(LLMs)引起了学术界和工业界的广泛关注,被应用于各种领域,如聊天机器人(Ouyang 等,2022;Achiam 等,2023)、化学工具(M. Bran 等,2024)和编程助手(GitHub,2020)。LLMs的巨大成功依赖于从各种来源收集的大量数据中获得的通用智能(Albalak 等,2024;Wang 等,2023c)。通过在数万亿个标记上进行预训练以掌握多样化的知识,并在较小的指令数据上进行微调以使模型与人类偏好对齐,LLMs能够有效利用其知识来遵循用户指令、进行常识推理并解决现实世界的问题(Zhao 等,2023)。

However, not all data are useful for teaching LLMs, especially when computational resources are limited (Albalak et al., 2024). For example, we can better elicit the capability of LLMs by fine-tuning them on carefully curated samples rather than a large but noisy data collection (Ouyang et al., 2022; Chowdhery et al., 2023; Meta, 2020; Zhou et al., 2023). However, selecting the proper data for LLM training is quite complicated and abstruse, since the space of data preprocessing and combination is almost unlimited. Due to the huge computational overhead of LLM training, manual or empirical data selection based on trial-and-error feedback is rather cumbersome and even impractical. Therefore, automatic data selection methods are necessary for LLM development under limited computational budgets.
然而,并非所有数据都对教学LLMs有用,特别是在计算资源有限的情况下(Albalak 等,2024)。例如,我们可以通过在精心策划的样本上进行微调,而不是在大量但嘈杂的数据集合上,更好地引出LLMs的能力(Ouyang 等,2022;Chowdhery 等,2023;Meta,2020;Zhou 等,2023)。然而,为LLM训练选择合适的数据是相当复杂和深奥的,因为数据预处理和组合的空间几乎是无限的。由于LLM训练的巨大计算开销,基于试错反馈的手动或经验数据选择相当繁琐,甚至不切实际。因此,在有限的计算预算下,LLM开发需要自动数据选择方法。

Intuitively, high-quality samples are expected to have better efficiency in teaching LLMs. For example, the successful practice of LIMA (Zhou et al., 2023) shows the powerful effect of data quality on LLM performance that can surpass the amount of data. Therefore, existing methods usually focus on quality-oriented data selection, based either on heuristic rules (Raffel et al., 2020; Rae et al., 2021; Xie et al., 2023; Chowdhery et al., 2023; Li et al., 2023) or evaluation models (Wettig et al., 2024; Chen et al., 2023; Lu et al., 2023; Liu et al., 2023; Cui et al., 2023). Heuristic methods typically involve hand-crafted rules (e.g., sentence number (Raffel et al., 2020), word count (Rae et al., 2021), length (Shen, 2024)) to evaluate data across multiple dimensions. Model-based approaches, on the contrary, rely on well-established LLMs such as GPT-4 (Achiam et al., 2023) to provide quality assessments of training samples in different views, such as direct scoring (Chen et al., 2023), task tagging (Lu et al., 2023), and pairwise scoring (Liu et al., 2023). However, most of these approaches evaluate different data samples independently, which neglects the intricate combinatorial effects among samples. As illustrated in Figure 1, even if each sample is in perfect quality, their combinations may still be suboptimal due to their mutual information redundancy or inconsistency. Although the quality-based subset is composed of all three good samples, the knowledge they encode is actually redundant and conflicting. In contrast, another data subset composed of several relatively lower-quality but diverse samples may convey more information than the above subset in the teaching of LLMs. Therefore, quality-based data selection does not fully align with the goal of maximizing the knowledge mastery of LLMs.
直观上,高质量样本在教学LLMs中预期会有更好的效率。例如,LIMA(Zhou et al, 2023)的成功实践显示了数据质量对LLM性能的强大影响,甚至可以超过数据量。因此,现有方法通常专注于基于质量导向的数据选择,或者基于启发式规则(Raffel et al, 2020; Rae et al, 2021; Xie et al, 2023; Chowdhery et al, 2023; Li et al, 2023)或评估模型(Wettig et al, 2024; Chen et al, 2023; Lu et al, 2023; Liu et al, 2023; Cui et al, 2023)。启发式方法通常涉及手工制定的规则(例如,句子数量(Raffel et al, 2020),字数(Rae et al, 2021),长度(Shen, 2024))来评估多个维度的数据。基于模型的方法则依赖于成熟的LLMs,如 GPT-4(Achiam et al, 2023),从不同角度提供训练样本的质量评估,例如直接评分(Chen et al, 2023),任务标记(Lu et al, 2023),和成对评分(Liu et al, 2023)。然而,这些方法中的大多数都是独立评估不同的数据样本,忽略了样本之间复杂的组合效应。 如图 1 所示,即使每个样本的质量都很完美,由于它们的互信息冗余或不一致性,它们的组合仍可能是次优的。尽管基于质量的子集由所有三个好的样本组成,但它们编码的知识实际上是冗余和冲突的。相比之下,另一个由几个相对较低质量但多样化的样本组成的数据子集在LLMs的教学中可能比上述子集传达更多的信息。因此,基于质量的数据选择并不完全符合最大化LLMs知识掌握的目标。

Refer to caption
Figure 1: An illustrative example describing different data selection paradigms. Quality-based data selection relies on sample-level data quality measurements while overlooking combinatorial effects among samples. Information-amount-based selection aims to select samples maximizing the overall information amount.
图 1:描述不同数据选择范式的示例。基于质量的数据选择依赖于样本级数据质量测量,同时忽略样本之间的组合效应。基于信息量的选择旨在选择最大化整体信息量的样本。

In many recent studies, researchers have shown that the basic mechanism of auto-regressive language modeling in LLMs is information compression (Delétang et al., 2023; Huang et al., 2024). Thus, the knowledge condensed by LLMs actually depends on the effective information encoded by training data. This intuition opens another direction of data selection, i.e., based on the effective information amount of data. In this paper, we uncover the underlying relations between LLM performance and data homogeneity, which can be measured by various canonical lossless compression algorithms (e.g., DEFLATE in ZIP). Through both theoretical analysis and empirical experiments, we formulate the “entropy law”, which shows that the compression ratio of training data is a decisive factor affecting model performance, if the overall quality and consistency of selected samples remain unchanged. Motivated by the entropy law, we propose an effective and efficient data selection algorithm called ZIP to select heterogeneous data with low compression ratio, which aims to maximize the effective information amount of information for LLM learning. Specifically, we devise a multi-stage greedy strategy to find an approximate solution that guarantees a low compression ratio without exhausting all possible combinations, and it iterates continuously until we obtain a predetermined number of samples. In each iteration, ZIP performs preliminary filtering to choose a smaller pool of candidates, and then selects a few samples from the reduced pool that minimizes the compression ratio of the selected dataset through a cascaded manner. By learning LLMs on a collection of diverse samples that encode heterogeneous and complementary information, the capabilities of LLMs can be better elicited. Extensive experiments on different LLM backbones at different stages of LLM alignment demonstrate the superiority of ZIP over various quality-based baselines. We also present an interesting application of the entropy law that can detect potential performance risks at the beginning of model training, which can effectively reduce the computational overhead in LLM development.
在许多最近的研究中,研究人员表明,自回归语言建模的基本机制是信息压缩(Delétang 等,2023;Huang 等,2024)。因此,LLMs 所凝结的知识实际上取决于训练数据所编码的有效信息。这一直觉开启了数据选择的另一个方向,即基于数据的有效信息量。在本文中,我们揭示了 LLM 性能与数据同质性之间的潜在关系,这可以通过各种典型的无损压缩算法(例如 ZIP 中的 DEFLATE)来衡量。通过理论分析和实证实验,我们提出了“熵定律”,该定律表明,如果所选样本的整体质量和一致性保持不变,训练数据的压缩比是影响模型性能的决定性因素。受熵定律的启发,我们提出了一种名为 ZIP 的有效且高效的数据选择算法,以选择压缩比低的异质数据,旨在最大化 LLM 学习的信息有效信息量。 具体来说,我们设计了一种多阶段贪婪策略,以找到一个近似解,该解保证了低压缩比,而无需穷尽所有可能的组合,并且它会不断迭代,直到我们获得预定数量的样本。在每次迭代中,ZIP 进行初步筛选以选择一个较小的候选池,然后从缩小的候选池中选择一些样本,通过级联方式最小化所选数据集的压缩比。通过在编码异质和互补信息的多样化样本集合上学习LLMs,可以更好地引出LLMs的能力。在不同阶段的LLM对齐中,不同LLM骨干上的大量实验表明,ZIP 优于各种基于质量的基线。我们还展示了熵定律的一个有趣应用,它可以在模型训练开始时检测潜在的性能风险,从而有效减少LLM开发中的计算开销。

2 Related Works 2 相关工作

2.1 Large Modeling and Information Compression
2.1 大规模建模与信息压缩

The relationship between language modeling and data compression has long intrigued researchers (Shannon, 1948, 1951). Pandey (2024) has identified a data-dependant scaling law that takes data’s gzip compressibility into consideration. Besides, recent empirical studies have confirmed that language models can act as versatile data compressors (Delétang et al., 2023), and the intelligence of LLMs can be quantified by their capacity for text compression (Huang et al., 2024). Let a text corpus be generated from an underlying distribution ρ𝜌\rhoitalic_ρ. A lossless compression algorithm 𝒞𝒞\mathcal{C}caligraphic_C is then expected to encode a text sequence x1:nsubscript𝑥:1𝑛x_{1:n}italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT into a bitstream 𝒞(x1:n)𝒞subscript𝑥:1𝑛\mathcal{C}(x_{1:n})caligraphic_C ( italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) of minimal length, ensuring that x1:nsubscript𝑥:1𝑛x_{1:n}italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT can be perfectly recovered from 𝒞(x1:n)𝒞subscript𝑥:1𝑛\mathcal{C}(x_{1:n})caligraphic_C ( italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ). The expected number of bits of an optimal 𝒞𝒞\mathcal{C}caligraphic_C is equal to 𝔼xρ[log2ρ(x)]=𝔼xρ[i=1nlog2ρ(xi|x1:i1)]subscript𝔼similar-to𝑥𝜌delimited-[]subscript2𝜌𝑥subscript𝔼similar-to𝑥𝜌delimited-[]superscriptsubscript𝑖1𝑛subscript2𝜌conditionalsubscript𝑥𝑖subscript𝑥:1𝑖1\mathbb{E}_{x\sim\rho}[-\log_{2}\rho(x)]=\mathbb{E}_{x\sim\rho}[-\sum_{i=1}^{n% }\log_{2}\rho(x_{i}|x_{1:i-1})]blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT [ - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ρ ( italic_x ) ] = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT [ - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ρ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 : italic_i - 1 end_POSTSUBSCRIPT ) ] (Shannon, 1948). The underlying distribution ρ𝜌\rhoitalic_ρ is usually unknown in reality, but it can be estimated by a language model ρmodelsubscript𝜌model\rho_{\text{model}}italic_ρ start_POSTSUBSCRIPT model end_POSTSUBSCRIPT. Then the expected number of bits of an optimal 𝒞𝒞\mathcal{C}caligraphic_C can be updated:
语言建模与数据压缩之间的关系长期以来一直吸引着研究人员(Shannon, 1948, 1951)。Pandey(2024)已经确定了一种依赖于数据的缩放定律,该定律考虑了数据的 gzip 可压缩性。此外,最近的实证研究证实,语言模型可以作为多功能的数据压缩器(Delétang et al, 2023),而LLMs的智能可以通过其文本压缩能力来量化(Huang et al, 2024)。假设一个文本语料库是从一个基础分布 ρ𝜌\rhoitalic_ρ 生成的。然后,一个无损压缩算法 𝒞𝒞\mathcal{C}caligraphic_C 应该将文本序列 x1:nsubscript𝑥:1𝑛x_{1:n}italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT 编码成最小长度的比特流 𝒞(x1:n)𝒞subscript𝑥:1𝑛\mathcal{C}(x_{1:n})caligraphic_C ( italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) ,确保可以从 𝒞(x1:n)𝒞subscript𝑥:1𝑛\mathcal{C}(x_{1:n})caligraphic_C ( italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ) 完美恢复 x1:nsubscript𝑥:1𝑛x_{1:n}italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT 。最优 𝒞𝒞\mathcal{C}caligraphic_C 的预期比特数等于 𝔼xρ[log2ρ(x)]=𝔼xρ[i=1nlog2ρ(xi|x1:i1)]subscript𝔼similar-to𝑥𝜌delimited-[]subscript2𝜌𝑥subscript𝔼similar-to𝑥𝜌delimited-[]superscriptsubscript𝑖1𝑛subscript2𝜌conditionalsubscript𝑥𝑖subscript𝑥:1𝑖1\mathbb{E}_{x\sim\rho}[-\log_{2}\rho(x)]=\mathbb{E}_{x\sim\rho}[-\sum_{i=1}^{n% }\log_{2}\rho(x_{i}|x_{1:i-1})]blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT [ - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ρ ( italic_x ) ] = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT [ - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ρ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 : italic_i - 1 end_POSTSUBSCRIPT ) ] (Shannon, 1948)。基础分布 ρ𝜌\rhoitalic_ρ 在现实中通常是未知的,但可以通过语言模型 ρmodelsubscript𝜌model\rho_{\text{model}}italic_ρ start_POSTSUBSCRIPT model end_POSTSUBSCRIPT 来估计。然后,最优 𝒞𝒞\mathcal{C}caligraphic_C 的预期比特数可以更新为:

𝔼xρ[i=1nlog2ρmodel(xi|x1:i1)].subscript𝔼similar-to𝑥𝜌delimited-[]superscriptsubscript𝑖1𝑛subscript2subscript𝜌modelconditionalsubscript𝑥𝑖subscript𝑥:1𝑖1\mathbb{E}_{x\sim\rho}[-\sum_{i=1}^{n}\log_{2}\rho_{\text{model}}(x_{i}|x_{1:i% -1})].blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ end_POSTSUBSCRIPT [ - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 : italic_i - 1 end_POSTSUBSCRIPT ) ] . (1)

Equation 1 is the cross-entropy loss employed in training LLMs, thereby establishing a coherent relationship between LLMs and information compression. This foundational insight paves the way for this work.
方程 1 是训练LLMs时使用的交叉熵损失,从而在LLMs和信息压缩之间建立了一种连贯的关系。这一基础性见解为本工作的开展铺平了道路。

2.2 Alignment of Large Language Models
2.2 大型语言模型的对齐

Large Language Models (LLMs) have recently gained significant attention from academia and industry. LLM alignment, which includes supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF), has emerged as a crucial technique for adapting LLMs to end tasks using natural language instructions (Zhao et al., 2023; Wang et al., 2023c). Alignment is performed using instruction datasets consisting of multiple (Instruction, Output) pairs, which require LLMs to follow the instructions and generate corresponding outputs. Early explorations have focused on constructing or expanding instruction datasets through methods such as crowd-sourcing (Wang et al., 2022; Köpf et al., 2024), self-instruction (Taori et al., 2023; Peng et al., 2023; Wang et al., 2023b), or the combination of existing datasets (Wang et al., 2023a; Ivison et al., 2023). Fine-tuned LLMs on these datasets have demonstrated promising capabilities to adhere to instructions across various contexts and align with human expectations.
大型语言模型(LLMs)最近在学术界和工业界引起了广泛关注。LLM对齐,包括监督微调(SFT)和基于人类反馈的强化学习(RLHF),已成为使用自然语言指令将LLMs适应终端任务的关键技术(Zhao 等,2023;Wang 等,2023c)。对齐是使用包含多个(指令,输出)对的指令数据集进行的,这些数据集要求LLMs遵循指令并生成相应的输出。早期的探索集中在通过众包(Wang 等,2022;Köpf 等,2024)、自我指令(Taori 等,2023;Peng 等,2023;Wang 等,2023b)或现有数据集的组合(Wang 等,2023a;Ivison 等,2023)等方法构建或扩展指令数据集。在这些数据集上微调的LLMs展示了在各种情境下遵循指令并符合人类期望的有希望的能力。

2.3 Data selection for LLM alignment
2.3 用于LLM对齐的数据选择

A growing body of research has emphasized the importance of selecting appropriate data for LLM alignment, which can prevent potential quality issues and optimize computational resource allocation. As a prominent example, Lima (Zhou et al., 2023) has demonstrated superior performance by carefully crafting only 1,000 high-quality samples for SFT, highlighting the crucial importance of data quality. The current literature on selecting alignment data has focused on selecting samples according to individual sample quality, which can be categorized into heuristic methods (Shen, 2024) and model-based methods (Chen et al., 2023; Lu et al., 2023; Liu et al., 2023; Li et al., 2023, 2024; Du et al., 2023). Heuristic methods typically employ specific criteria, such as response length (Shen, 2024), to guide data selection. On the other hand, model-based methods adopt various strategies to leverage the capabilities of established language models for evaluating sample quality. For example, IFD (Li et al., 2023) measures the change in response loss when instructions are removed, and selects those with the most significant changes. Building upon IFD, SuperFiltering (Li et al., 2024) introduces a lightweight proxy model for a more efficient calculation of the IFD score. In addition, other model-based methods employ proprietary LLMs to assess data quality. In a pioneering work, AlpaGasus (Chen et al., 2023) uses ChatGPT directly to assign data quality scores to samples, while #InsTag (Lu et al., 2023) proposes assigning tags to each sample using ChatGPT and evaluates sample quality based on the number of tags. DEITA (Liu et al., 2023) uses ChatGPT-generated data to train two Llama-based scorers, assigning complexity and quality scores to each sample, and ultimately selecting samples with the highest hybrid scores. However, existing methods are mainly designed to pick data based on sample-wise quality measurements, which are usually weak in reflecting the overall dataset quality. In this paper, we focus on the relation between performance and dataset quality, which can be efficiently measured by data compression metrics.
越来越多的研究强调选择适当的数据进行LLM对齐的重要性,这可以防止潜在的质量问题并优化计算资源分配。作为一个突出的例子,Lima(Zhou 等,2023)通过精心制作仅 1,000 个高质量样本用于 SFT,展示了卓越的性能,突显了数据质量的关键重要性。目前关于选择对齐数据的文献主要集中在根据单个样本质量选择样本,这可以分为启发式方法(Shen,2024)和基于模型的方法(Chen 等,2023;Lu 等,2023;Liu 等,2023;Li 等,2023,2024;Du 等,2023)。启发式方法通常采用特定标准,如响应长度(Shen,2024),来指导数据选择。另一方面,基于模型的方法采用各种策略来利用已建立的语言模型的能力来评估样本质量。例如,IFD(Li 等,2023)通过测量删除指令时响应损失的变化,并选择那些变化最显著的样本。 在 IFD 的基础上,SuperFiltering(Li 等,2024)引入了一种轻量级代理模型,以更高效地计算 IFD 分数。此外,其他基于模型的方法采用专有的LLMs来评估数据质量。在一项开创性工作中,AlpaGasus(Chen 等,2023)直接使用 ChatGPT 为样本分配数据质量分数,而#InsTag(Lu 等,2023)提出使用 ChatGPT 为每个样本分配标签,并根据标签数量评估样本质量。DEITA(Liu 等,2023)使用 ChatGPT 生成的数据训练了两个基于 Llama 的评分器,为每个样本分配复杂度和质量分数,最终选择混合分数最高的样本。然而,现有方法主要设计用于基于样本质量测量来挑选数据,这通常难以反映整体数据集的质量。在本文中,我们关注性能与数据集质量之间的关系,这可以通过数据压缩指标高效地测量。

3 Entropy Law: Connecting Model Performance with Data Compression
熵定律:将模型性能与数据压缩联系起来

In this section, we provide some theoretical analysis of the relations between data compression and LLM performance. Intuitively, the correctness and diversity of the training data would affect the performance of the final model. Meanwhile, the performance of LLM may be suboptimal if the data have severe intrinsic conflicts or the model has poor mastery of the information encoded by the data, which can be indicated by the training loss. Based on these assumptions, we denote the performance of an LLM as Z𝑍Zitalic_Z, which is expected to be influenced by the following factors:
在本节中,我们提供了一些关于数据压缩与LLM性能之间关系的理论分析。直观上,训练数据的正确性和多样性会影响最终模型的性能。同时,如果数据存在严重的内在冲突或模型对数据编码的信息掌握不佳(这可以通过训练损失来表示),LLM的性能可能会不理想。基于这些假设,我们将LLM的性能表示为 Z𝑍Zitalic_Z ,其预期会受到以下因素的影响:

  • Data compression ratio R𝑅Ritalic_R: This metric can be derived by dividing the pre-compression data size by the post-compression size, which can be computed by various off-the-shelf compression algorithms. Intuitively, a dataset with a lower compression ratio indicates a higher information density.

  • Training loss L𝐿Litalic_L: Indicates whether the data are hard for the model to memorize. Given the same base model, a high training loss is usually due to noisy or inconsistent information in the dataset. In practice, the average loss of a small number of training steps in the first training epoch is sufficient to produce an indicative L𝐿Litalic_L value so that the model does not overfit the data.

  • Data consistency C𝐶Citalic_C: The consistency of data is reflected by the entropy of the probability of the next token given the previous contexts. Higher data consistency usually yields a lower training loss. The performance of LLMs is usually suboptimal if the dataset has poor consistency.222 Assume we have two question-answer pairs (q1,a1)subscript𝑞1subscript𝑎1(q_{1},a_{1})( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (q2,a2)subscript𝑞2subscript𝑎2(q_{2},a_{2})( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). When an LLM updates on each QA pair, it learns the mutual information (MI) between the question and answer, i.e., I(q1;a1)𝐼subscript𝑞1subscript𝑎1I(q_{1};a_{1})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and I(q2;a2)𝐼subscript𝑞2subscript𝑎2I(q_{2};a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). When it updates on both, it learns the joint MI I(q1q2;a1a2)𝐼subscript𝑞1subscript𝑞2subscript𝑎1subscript𝑎2I(q_{1}q_{2};a_{1}a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). If q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are not independent, we have I(q1q2;a1a2)<I(q1;a1)+I(q2;a2)𝐼subscript𝑞1subscript𝑞2subscript𝑎1subscript𝑎2𝐼subscript𝑞1subscript𝑎1𝐼subscript𝑞2subscript𝑎2I(q_{1}q_{2};a_{1}a_{2})<I(q_{1};a_{1})+I(q_{2};a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) < italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_I ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), whose detailed derivation can be found in Appendix A. This implies that the total knowledge learned by LLMs is narrowed if the answers to similar questions are highly inconsistent.

  • Average data quality Q𝑄Qitalic_Q: This reflects the average sample-level quality of the data, which can be measured through various objective and subjective aspects.

Given a certain amount of training data, the model performance can be estimated by the above factors:

Zf(R,L,C,Q),proportional-to𝑍𝑓𝑅𝐿𝐶𝑄Z\propto f(R,L,C,Q),italic_Z ∝ italic_f ( italic_R , italic_L , italic_C , italic_Q ) , (2)

where f𝑓fitalic_f is a hidden function. Given a specific base model, the scale of L𝐿Litalic_L usually depends on R𝑅Ritalic_R and C𝐶Citalic_C, which can be formulated as:

Lg(R,C).proportional-to𝐿𝑔𝑅𝐶L\propto g(R,C).italic_L ∝ italic_g ( italic_R , italic_C ) . (3)

L𝐿Litalic_L is expected to be monotonous on R𝑅Ritalic_R and C𝐶Citalic_C, since a dataset with higher homogeneity or better data consistency is easier for a model to learn. Thus, we can rewrite the above formula as follows:

Cg(R,L),proportional-to𝐶superscript𝑔𝑅𝐿C\propto g^{\prime}(R,L),italic_C ∝ italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_R , italic_L ) , (4)

where gsuperscript𝑔g^{\prime}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is an inverse function. By combining the three above equations, we have:

Zf(R,L,g(R,L),Q)h(R,L,Q),proportional-to𝑍𝑓𝑅𝐿superscript𝑔𝑅𝐿𝑄proportional-to𝑅𝐿𝑄Z\propto f(R,L,g^{\prime}(R,L),Q)\propto h(R,L,Q),italic_Z ∝ italic_f ( italic_R , italic_L , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_R , italic_L ) , italic_Q ) ∝ italic_h ( italic_R , italic_L , italic_Q ) , (5)

where hhitalic_h is another hidden function. If a data selection method does not substantially change the average data quality Q𝑄Qitalic_Q, we can approximately regard the variable Q𝑄Qitalic_Q as a constant. Therefore, the final performance can be roughly formulated as follows:

Zh(R,L),proportional-to𝑍superscript𝑅𝐿Z\propto h^{\prime}(R,L),italic_Z ∝ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_R , italic_L ) , (6)

which means that the model performance is correlated with the data compression ratio and training loss. We name this relationship as “Entropy Law”.

We can raise two deductions based on the entropy law:

  • If we further regard the data consistency as a constant, the training loss is directly influenced by the compression ratio (Eq. 3). Thus, the model performance is controlled by the compression ratio: Z𝑍Zitalic_Z is usually worse if the data compression ratio R𝑅Ritalic_R is higher, which will be validated by our experiments.

  • Given the same compression ratio R𝑅Ritalic_R, a higher training loss means a lower data consistency. Thus, the effective knowledge learned by the model may be more limited. This can be used to predict the performance of LLM on different data with similar compression ratios and sample qualities. We will show later the application of this deduction in our practice.

Notably, entropy law reveals a coherent connection between downstream model performance and data compression ratio, setting it apart from the previously proposed data-dependent scaling law by Pandey (2024). Building upon the entropy law, we derive a data selection algorithm in Section 4 and demonstrate its application in practical large-scale LLM development in Section 5.3.

4 ZIP: Lightweight Data Selection for LLM Alignment

Guided by the findings of the entropy law, we propose an effective and efficient method named ZIP to select data samples based on data compression ratios, which aims to maximize the amount of effective information given a limited training data budget. Although there exists a subset with the lowest compression ratio, it is impractical to find it due to the huge combination space of data samples. Thus, we propose an iterative multi-stage greedy algorithm to efficiently obtain an approximate solution with a relatively low compression ratio. In each iteration, we first use a global selection stage to choose a pool of candidate samples that have low compression ratios, which aims to find samples with high information density. We then employ a coarse-grained local selection stage incorporating a smaller set of samples with the lowest redundancy with already selected samples. Finally, we use a fine-grained local selection stage that minimizes the similarity between samples to add. The above process is conducted until we obtain a sufficient size of data. The workflow of our method is summarized in Algorithm 1, whose details are introduced as follows.

4.1 Global Selection

In general, we maintain an information redundancy state π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT that evaluates the “information gain” of each sample. Intuitively, data with high intra-sample information redundancy are unlikely to have good global diversity. For example, a sample with repeated patterns or echoed conversation turns usually has low education value in LLM training. Thus, we initialize this state by calculating the sample-level compression ratio for the entire dataset 𝒟𝒟\mathcal{D}caligraphic_D. In each iteration, We select K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT samples with the lowest scores in π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT to form an initial candidate pool 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which provides a good set for subsequent local selection.

4.2 Local Coarse-grained Selection

Since the global selection does not well consider the mutual relations among samples, we further conduct local selection to pick diverse samples. To ensure good computational efficiency, we introduce a coarse-grained selection phase to narrow the candidate pool into a smaller one with K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT samples. We first compute the compression ratio of a merged set that adds each sample in 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to the selected set 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We use this score to update the information redundancy state π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT to better indicate the current information gain of these samples. Based on the scores of the samples in 𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we select K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT samples with the lowest scores. These samples form a small subset for final fine-grained selection, where each sample has good diversity with the selected dataset 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Algorithm 1 Pseudo code of ZIP
1:The original dataset 𝒟𝒟\mathcal{D}caligraphic_D of size N𝑁Nitalic_N. A data compressor 𝒞𝒞\mathcal{C}caligraphic_C. The number of selected samples m𝑚mitalic_m. The number of compression ratios to be updated in the global selection K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The data pool size of local selection K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The data budget of local selection K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Compression ratio calculation function g(𝒞(D))=Bits(D)Bits(𝒞(D))𝑔𝒞𝐷Bits𝐷Bits𝒞𝐷g(\mathcal{C}(D))=\frac{\text{Bits}(D)}{\text{Bits}(\mathcal{C}(D))}italic_g ( caligraphic_C ( italic_D ) ) = divide start_ARG Bits ( italic_D ) end_ARG start_ARG Bits ( caligraphic_C ( italic_D ) ) end_ARG, which measures compression ratio by the number of bits.
2:A data subset 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of size m𝑚mitalic_m with a relatively low compression ratio.
3:Init 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as an empty set ΦΦ\Phiroman_Φ
4:Init π𝒟={g(d0),g(d1),,g(dN)}subscript𝜋𝒟𝑔subscript𝑑0𝑔subscript𝑑1𝑔subscript𝑑𝑁\pi_{\mathcal{D}}=\{g(d_{0}),g(d_{1}),\dots,g(d_{N})\}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT = { italic_g ( italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_g ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_g ( italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) } by calculating the compression ratio of each sample in 𝒟𝒟\mathcal{D}caligraphic_D
5:while |𝒟|<msuperscript𝒟𝑚|\mathcal{D}^{\prime}|<m| caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | < italic_m do
6:     // Stage 1: Global Selection
7:     𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = Bottom-K(𝒟𝒟𝒟superscript𝒟\mathcal{D}\setminus\mathcal{D}^{\prime}caligraphic_D ∖ caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, π𝒟𝒟subscript𝜋𝒟superscript𝒟\pi_{\mathcal{D}\setminus\mathcal{D}^{\prime}}italic_π start_POSTSUBSCRIPT caligraphic_D ∖ caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) // Select K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT samples with the lowest scores
8:     // Stage 2: Local Coarse-grained Selection
9:     πK1=g(𝒟{d})subscript𝜋subscript𝐾1𝑔superscript𝒟𝑑\pi_{K_{1}}=g(\mathcal{D}^{\prime}\cup\{d\})italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_g ( caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ { italic_d } ), where d𝒟K1𝑑subscript𝒟subscript𝐾1d\in\mathcal{D}_{K_{1}}italic_d ∈ caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT //Compression ratios of the union of 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and d𝑑ditalic_d
10:     Update corresponding values in π𝒟subscript𝜋𝒟\pi_{\mathcal{D}}italic_π start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT with πK1subscript𝜋subscript𝐾1\pi_{K_{1}}italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
11:     𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = Bottom-K(𝒟K1subscript𝒟subscript𝐾1\mathcal{D}_{K_{1}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, πK1subscript𝜋subscript𝐾1\pi_{K_{1}}italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) // Select K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT samples with the lowest compression ratios
12:     // Stage 3: Local Fine-grained Selection
13:     Init 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT as an empty set ΦΦ\Phiroman_Φ
14:     while |𝒟K3|<K3subscript𝒟subscript𝐾3subscript𝐾3|\mathcal{D}_{K_{3}}|<K_{3}| caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | < italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT do
15:         πK2={g(𝒟K3{d})|d𝒟K2}subscript𝜋subscript𝐾2conditional-set𝑔subscript𝒟subscript𝐾3𝑑𝑑subscript𝒟subscript𝐾2\pi_{K_{2}}=\{g(\mathcal{D}_{K_{3}}\cup\{d\})|d\in\mathcal{D}_{K_{2}}\}italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_g ( caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ { italic_d } ) | italic_d ∈ caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT }
16:         d = argmindπK2subscriptargmin𝑑subscript𝜋subscript𝐾2\operatorname*{arg\,min}_{d}{\pi_{K_{2}}}start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
17:         𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 𝒟K3{d}subscript𝒟subscript𝐾3𝑑\mathcal{D}_{K_{3}}\cup\{d\}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ { italic_d }
18:         𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 𝒟K2{d}subscript𝒟subscript𝐾2𝑑\mathcal{D}_{K_{2}}\setminus\{d\}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∖ { italic_d }
19:     end while
20:     // Update the selected dataset
21:     𝒟=𝒟𝒟K3superscript𝒟superscript𝒟subscript𝒟subscript𝐾3\mathcal{D}^{\prime}=\mathcal{D}^{\prime}\cup\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
22:end while

4.3 Local Fine-grained Selection

Although the above stage ensures that the candidate pool has distinct information from the selected set, the information redundancy among the samples within this pool is not measured. Thus, we aim to pick further samples from this subset that are diverse from each other. Concretely, we initialize a local selected set 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and compute the compression ratio of the union of 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and each sample in 𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We add the sample with the lowest compression ratio into 𝒟K3subscript𝒟subscript𝐾3\mathcal{D}_{K_{3}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and remove it from 𝒟K2subscript𝒟subscript𝐾2\mathcal{D}_{K_{2}}caligraphic_D start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. By repeating this process, we obtain a small subset that contains samples not only different from the selected set 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT but also distinct from each other. We conduct the three stages above until the size of 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT reaches a predefined data budget m𝑚mitalic_m. In our method, the entire selection process is quite easy to implement since it is model-free, and can be accelerated using multiple threads. It can select data efficiently and effectively from a large candidate pool for high-quality LLM training.

5 Experiments

ZIP is content-agnostic and model-free, making it suitable for various stages of LLM alignment. We systematically evaluate the effectiveness of ZIP through experiments conducted in the SFT and RLHF stages, as described in Sections 5.1 and 5.2, respectively. Subsequently, Section 5.3 presents an in-depth analysis to empirically support the proposed entropy law, including a practical application guided by this law.

5.1 Data Selection for SFT

5.1.1 Setup

Data Pool & Data Selection We follow DEITA (Liu et al., 2023) to establish a large-scale data pool comprising 300K high-quality samples obtained from WizardLM (Xu et al., 2023), ShareGPT (Chiang et al., 2023), and UltraChat (Ding et al., 2023). Subsequently, various data selection techniques are employed to extract a subset of this pool for LLM instruction tuning. Notably, previous studies controlled the data budget by limiting the number of instances, whereas we managed the total token count to ensure a fair allocation of the compute budget among all methods. To achieve this, we initially select 10,000 samples using ZIP and calculate the corresponding token count. Then, we apply other methods to continue data selection until the required token count is reached.

Training & Evaluation We fine-tune Mistral-7B (Jiang et al., 2023) and LLama-3-8B (Meta, 2020) on the selected dataset. Other training details can be found in Appendix B. As for evaluation, we adopt MT-bench(Zheng et al., 2023) as our benchmark. Specifically, MT-bench is a challenging multi-turn question set with LLM judgements to evaluate model responses, which exhibits a high-level human preferences alignment.

Baselines We select the baseline from two aspects. The first group includes heuristic methods: (1) Random, which randomly selects instances from the data pool to verify the fundamental effectiveness of other methods; (2) Cluster, which adopts K-means clustering based on the sample representations and select cluster centroids; (3) Perplexity, which selects the samples with highest training loss. The second group of baselines includes model-based methods: (1) DEITA (Liu et al., 2023), which employs ChatGPT-generated data to train a Llama-based data complexity evaluator and a quality evaluator, and selects samples with the highest hybrid scores; (2) SuperFiltering (Li et al., 2024), which assesses each sample by calculating the change in response loss upon instruction removal and introduce a lightweight proxy model to calculate the score more efficiently.

5.1.2 Results

Table 1: Model performance comparison between different data selection baselines based on Mistral-7B and Llama-3-8B on the SFT stage. We also provide the computational cost, average token length of selected data, and the average data quality produced by Alpagasus Chen et al. (2023). The best results are bolded, and the second-best numbers are underlined. \dagger means CPU-only methods.
Model MT-bench \uparrow Cost \downarrow Avg.length Quality
Mistral-7B-based models with SFT
Random \dagger 6.85 10s 976 4.08
Cluster \ul6.91 15h 970 4.05
Perplexity 6.89 8h 981 4.09
SuperFiltering 6.12 14h 1579 4.10
DEITA 6.82 21h 2048 4.03
ZIP \dagger 7.08 \ul4.5h 543 4.00
Llama-3-8B-based models with SFT
Random \dagger 7.16 10s 892 4.08
Cluster \ul7.18 16h 886 3.95
Perplexity 7.09 9h 895 3.96
SuperFiltering 6.59 14h 1481 3.99
DEITA 7.11 21h 2048 4.09
ZIP \dagger 7.28 \ul4.5h 470 4.00

Main comparison We compare ZIP with various data selection methods based on Mistral-7B and Llama-3-8B, and the results are presented in Table 1. ZIP outperforms other data selection approaches on all backbones, which can be attributed to ZIP’s ability to model the complex combinatorial effects among samples. Furthermore, our observations indicate that model-based data selection methods often fail to produce satisfactory outcomes when a fixed token number is given. This is because the sample-level evaluations are not updated correspondingly after selecting some samples, leading to biased evaluations for the remaining samples. Additionally, some of these methods adopt strategies to enhance data diversity, such as DEITA, which controls the representation distances of selected samples. However, these strategies only provide a rough assessment of the combinatorial effects within the representation space, since semantic distances do not necessarily reflect information redundancy.

Selection bias in sample length across different strategies We also provide the average length of tokenized samples in Table 1. The average token length of Random provides an estimation for the entire data pool, which is used to analyze other methods. From the tables, we can observe that Cluster and Perplexity exhibit similar selection preferences as Random. Additionally, Deita and SuperFiltering predominantly select lengthy data samples. This bias may stem from the LLMs’ inclination toward generating longer responses (Saito et al., 2023). However, given the limited budget of selected tokens, choosing excessively lengthy data will reduce the information density and degrade the capabilities of models trained on such data. In contrast, ZIP tends to select shorter samples. Furthermore, we plot the token length distribution of these methods, as depicted in Figure 2 and Figure 6. Consistent with the previous results, we can observe similar distributions for Random, Cluster, and Perplexity. The token length distributions of DEITA and SuperFiltering are severely skewed, deviating greatly from the original data distribution. In contrast to these model-based approaches, ZIP exhibits no bias toward selecting lengthy samples.

Cost comparison of different strategies We provide a detailed cost analysis of each method in Table 1. Except for the Random method, ZIP required the least time to complete the data selection process, demonstrating greater efficiency than other methods. Notably, ZIP’s computations are entirely executed on CPUs, resulting in significant cost savings. Furthermore, ZIP is independent of proprietary LLMs used by DEITA or the proxy model employed by Cluster, Perplexity, and SuperFiltering. This model-free characteristic endows ZIP with notable efficiency and versatility.

Refer to caption
(a) ZIP
Refer to caption
(b) Random
Refer to caption
(c) Diversity
Refer to caption
(d) Perplexity
Refer to caption
(e) DEITA
Refer to caption
(f) SuperFiltering
Figure 2: The distribution of average token number across datasets selected by different algorithms for Mistral-7B.

Selected data quality of different strategies We have followed Alpagasus (Chen et al., 2023) to evaluate the quality of each data sample in the selected datasets by prompting ChatGPT, with the quality scores ranging from 0 to 5. The quality scores of multi-turn samples are the average scores of each turn. The results have been presented in Table 1. Surprisingly, the quality scores of selected datasets are highly similar, even with significant differences in selection mechanisms. This may suggest that the average quality distribution remains relatively uniform in the original data pool. Notably, even the SOTA model-based methods like DEITA (Liu et al., 2023) and SuperFiltering (Li et al., 2024) select data with similar quality scores, potentially contradicting their original conclusions. We posit that this discrepancy stems from the setting of the data budget, which is controlled by the number of samples in prior studies. Considering the selection bias discussed above, these methods tend to select lengthy samples, resulting in a significantly higher token count compared with baselines. For instance, under this setting, data selected by DEITA will possess 2.7 times the number of tokens compared to ZIP. However, we argue it is fairer to control the data budget by the token count since it guarantees a similar compute budget among all methods333In practical implementation, the training steps of all methods are almost equal by employing the packing technique detailed in Axolotl..

5.2 Data Selection for RLHF

Table 2: Model performance comparison between different data selection baselines based on Llama-3-8B on the RLHF stage. We also provide the computational cost of each method.
Model MT-bench \uparrow Cost Avg.length
Base 7.18 NA NA
Random \dagger \ul7.33 5s 464
Score 7.30 NA 489
ZIP \dagger 7.42 1.1h 357

5.2.1 Setup

Data Pool & Data Selection The data pool used for preference alignment is a cleaned version of UltraFeedback (Cui et al., 2023; Bartolome et al., 2023), which consists of around 60k samples in the form of a "chosen-rejected" pair. Similarly to the SFT stage, we ensure each data selection method selects data with an approximately equal token count. Since a "chosen-rejected" data pair encompasses two data points, we select 5,000 data pairs with ZIP and then apply other methods to select data with the corresponding token budget.

Refer to caption
(a) Entropy law w.r.t. compression ratio
Refer to caption
(b) Entropy law w.r.t. training loss
Figure 3: Entropy law demonstration of Mistral-7B. The Entropy law curve is fitted with the results of different methods.
Refer to caption
(a) Entropy law w.r.t. compression ratio
Refer to caption
(b) Entropy law w.r.t. training loss
Figure 4: Entropy law curve of Llama-3-8B. The Entropy law curve is fitted with the results of different methods.

Training & Evaluation Building upon the model previously fine-tuned with SFT, we further refine it using RLHF. In particular, we employ Kahneman-Tversky Optimization (KTO) (Ethayarajh et al., 2024) for preference alignment, a novel method that shows promising potential in aligning preferences. Additional training details can be found in Appendix B. For evaluation, we continue to utilize MT-bench (Zheng et al., 2023) as our benchmark to assess the capabilities of LLMs fine-tuned with data selected using diverse data selection strategies.

Baselines We compare ZIP with the following baselines: (1) Random, which randomly samples some "chosen-rejected" pairs from the data pool. (2) Score, which selects the "chosen-rejected" pairs with the highest "chosen-scores". These scores are obtained through LLM evaluation of the response quality (Cui et al., 2023; Bartolome et al., 2023).

5.2.2 Main results

Table 2 presents the results of different data selection strategies on the preference alignment stage of LLMs. Similar to the SFT stage, models aligned with data selected by ZIP can yield the best downstream performance, demonstrating the necessity for modeling the combinatorial effects. Besides, we find Score and Random are on par with each other, even though the selection process of Score is far more expensive than Random. This is unsurprising, as Score does not consider the combinatorial effects, which may limit the knowledge amount of the selected dataset.

5.3 Empirical Validation of Entropy Law

Refer to caption
Figure 5: Practical application of Entropy law in incremental training data update, where x1,x2,x3,x4,x5subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4subscript𝑥5x_{1},x_{2},x_{3},x_{4},x_{5}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT are five data versions.

In this section, we aim to demonstrate the proposed entropy law. Specifically, we have plotted the model performance of Mistral-7B and Llama-3-8B concerning data compression ratio and training loss in Figure 3 and 4, respectively. Besides, we plot entropy-law curves by fitting the results. From the two figures, we can draw the following analysis:

Relationship between model performance, data compression ratio, and training loss In Figure 3(a) and Figure 4(a), LLMs trained on data with a lower compression ratio typically exhibit enhanced performance. Since the learning process of LLMs is highly relevant to information compression, we can regard LLMs as data compressors. Then the data with a lower compression ratio means a higher knowledge amount, which is more beneficial to the compressors. Besides, a lower compression ratio usually corresponds a higher training loss, as illustrated in Figures 3(b) and 4(b). This is because data resistant to compression carries more knowledge, posing a greater challenge for LLMs to absorb the encapsulated knowledge.

Model performance interpretation with entropy law Considering the three methods with comparable compression ratios and training loss, namely Random, Cluster, and Perplexity, the corresponding model performances are close. This phenomenon may seem counter-intuitive, given the distinct criteria used for data selection. However, it aligns with the predictions of our proposed entropy law: when the average data quality, training loss, and data compression ratio are similar, the model performance is expected to be comparable as well. Thus, the entropy law has the potential to serve as a criterion for predicting the model’s performance on data, thereby guiding the training of LLMs.

Practical application of entropy law Incremental version update of training data is a common setting in practical LLM development. Usually, the training data amount remains relatively stable, with only a minor portion undergoing modification. We have conducted incremental training data update experiments in real scenarios, with results depicted in Figure 5. Due to confidentiality, only the relative order of the results is provided. Guided by entropy law, assuming the data quality Q𝑄Qitalic_Q does not significantly decay after each incremental update, we can expect a model performance improvement with a decreased data compression ratio. This expectation is supported by the results from data versions x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to x4subscript𝑥4x_{4}italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT in Figure 5 since their compression ratios are decreased after each incremental update. However, the data version x5subscript𝑥5x_{5}italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT exhibits an abnormal increase in the loss and data compression ratio, which serves as an early indicator of potential model performance degradation due to a decline in training data consistency. This prediction is further confirmed by subsequent post-training model performance evaluations, as illustrated in Figure 5. Thus, the entropy Law can be utilized as a guideline for LLM training to identify potential risks of experimental failure without training the model on the full dataset until convergence. This is particularly significant given the substantial costs associated with training an LLM.

6 Conclusion

In this paper, we delve deeply into the data selection problem from a data compression perspective. Inspired by the insight that language modeling is performing information compression, we propose an entropy law delineating the coherent relationship between model performance, data compression ratio, and training loss. Theoretically guided by the entropy law, we propose a new data selection algorithm, ZIP, to select data with the nearly lowest compression ratio, which is model-free and content-agnostic. rendering it significantly lightweight and versatile. Experimental results have demonstrated the effectiveness and efficiency of ZIP, based on various LLM backbones, during the SFT and RLHF stages. Further in-depth analysis provided empirical evidence of Entropy law, which could serve as a criterion for LLM performance prediction on specific data.

References

  • (1)
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • Albalak et al. (2024) Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. 2024. A Survey on Data Selection for Language Models. arXiv:2402.16827 [cs.CL]
  • Bartolome et al. (2023) Alvaro Bartolome, Gabriel Martin, and Daniel Vila. 2023. Notus. https://github.com/argilla-io/notus.
  • Chen et al. (2023) Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. 2023. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023).
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) 2, 3 (2023), 6.
  • Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  • Cui et al. (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377 (2023).
  • Delétang et al. (2023) Grégoire Delétang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. 2023. Language Modeling Is Compression. CoRR abs/2309.10668 (2023).
  • Ding et al. (2023) Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. In EMNLP. Association for Computational Linguistics, 3029–3051.
  • Du et al. (2023) Qianlong Du, Chengqing Zong, and Jiajun Zhang. 2023. Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653 (2023).
  • Ethayarajh et al. (2024) Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306 (2024).
  • GitHub (2020) GitHub. 2020. GitHub Copilot. https://github.com/features/copilot/
  • Huang et al. (2024) Yuzhen Huang, Jinghan Zhang, Zifei Shan, and Junxian He. 2024. Compression Represents Intelligence Linearly. arXiv preprint arXiv:2404.09937 (2024).
  • Ivison et al. (2023) Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew E. Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. 2023. Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2. CoRR abs/2311.10702 (2023).
  • Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. CoRR abs/2310.06825 (2023).
  • Köpf et al. (2024) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. 2024. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems 36 (2024).
  • Li et al. (2024) Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu Zhao, Jianzong Wang, Ning Cheng, and Tianyi Zhou. 2024. Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning. CoRR abs/2402.00530 (2024).
  • Li et al. (2023) Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. 2023. From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning. CoRR abs/2308.12032 (2023).
  • Liu et al. (2023) Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. 2023. What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. CoRR abs/2312.15685 (2023).
  • Lu et al. (2023) Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and Jingren Zhou. 2023. # InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. In The Twelfth International Conference on Learning Representations.
  • M. Bran et al. (2024) Andres M. Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. 2024. Augmenting large language models with chemistry tools. Nature Machine Intelligence (2024), 1–11.
  • Meta (2020) Meta. 2020. Llama3. https://ai.meta.com/blog/meta-llama-3/
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
  • Pandey (2024) Rohan Pandey. 2024. gzip Predicts Data-dependent Scaling Laws. arXiv preprint arXiv:2405.16684 (2024).
  • Peng et al. (2023) Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction Tuning with GPT-4. CoRR abs/2304.03277 (2023).
  • Rae et al. (2021) Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021).
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1–67.
  • Saito et al. (2023) Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. 2023. Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076 (2023).
  • Shannon (1948) Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 3 (1948), 379–423.
  • Shannon (1951) Claude E Shannon. 1951. Prediction and entropy of printed English. Bell system technical journal 30, 1 (1951), 50–64.
  • Shen (2024) Ming Shen. 2024. Rethinking Data Selection for Supervised Fine-Tuning. CoRR abs/2402.06094 (2024).
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model.
  • Wang et al. (2023a) Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, and Hannaneh Hajishirzi. 2023a. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources. In NeurIPS.
  • Wang et al. (2023b) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023b. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In ACL (1). Association for Computational Linguistics, 13484–13508.
  • Wang et al. (2022) Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705 (2022).
  • Wang et al. (2023c) Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023c. Data management for large language models: A survey. arXiv preprint arXiv:2312.01700 (2023).
  • Wettig et al. (2024) Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. 2024. QuRating: Selecting High-Quality Data for Training Language Models. arXiv preprint arXiv:2402.09739 (2024).
  • Xie et al. (2023) Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy S Liang. 2023. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems 36 (2023), 34201–34227.
  • Xu et al. (2023) Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. WizardLM: Empowering Large Language Models to Follow Complex Instructions. CoRR abs/2304.12244 (2023).
  • Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. CoRR abs/2303.18223 (2023).
  • Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In NeurIPS.
  • Zhou et al. (2023) Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. 2023. LIMA: Less Is More for Alignment. In NeurIPS.

Appendix A Derivations of joint mutual information of two QA pairs

I(q1q2;a1a2)𝐼subscript𝑞1subscript𝑞2subscript𝑎1subscript𝑎2\displaystyle I(q_{1}q_{2};a_{1}a_{2})italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =H(a1a2)H(a1a2|q1q2)absent𝐻subscript𝑎1subscript𝑎2𝐻conditionalsubscript𝑎1subscript𝑎2subscript𝑞1subscript𝑞2\displaystyle=H(a_{1}a_{2})-H(a_{1}a_{2}|q_{1}q_{2})= italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (7)
=H(a1a2)H(a1a2|q1q2)absent𝐻subscript𝑎1subscript𝑎2𝐻conditionalsubscript𝑎1subscript𝑎2subscript𝑞1subscript𝑞2\displaystyle=H(a_{1}a_{2})-H(a_{1}a_{2}|q_{1}q_{2})= italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=H(a1a2)H(a1|q1q2)H(a2|a1q1q2)absent𝐻subscript𝑎1subscript𝑎2𝐻conditionalsubscript𝑎1subscript𝑞1subscript𝑞2𝐻conditionalsubscript𝑎2subscript𝑎1subscript𝑞1subscript𝑞2\displaystyle=H(a_{1}a_{2})-H(a_{1}|q_{1}q_{2})-H(a_{2}|a_{1}q_{1}q_{2})= italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=H(a1a2)H(a1|q1)H(a2|q2)absent𝐻subscript𝑎1subscript𝑎2𝐻conditionalsubscript𝑎1subscript𝑞1𝐻conditionalsubscript𝑎2subscript𝑞2\displaystyle=H(a_{1}a_{2})-H(a_{1}|q_{1})-H(a_{2}|q_{2})= italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
H(a1)+H(a2)H(a1|q1)H(a2|q2)absent𝐻subscript𝑎1𝐻subscript𝑎2𝐻conditionalsubscript𝑎1subscript𝑞1𝐻conditionalsubscript𝑎2subscript𝑞2\displaystyle\leq H(a_{1})+H(a_{2})-H(a_{1}|q_{1})-H(a_{2}|q_{2})≤ italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_H ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=I(q1;a1)+I(q2;a2).absent𝐼subscript𝑞1subscript𝑎1𝐼subscript𝑞2subscript𝑎2\displaystyle=I(q_{1};a_{1})+I(q_{2};a_{2}).= italic_I ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_I ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

The equality is achieved when a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is independent on a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (similarly, q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT needs to be independent on q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).

Appendix B Training Details

Platform All experiments were finished on a platform with 64 Intel Xeon Gold 6326 CPU cores @ 2.90GHz, two main-stream high-performance GPUs, and 500GB memories. The training code is based on a popular open-source framework Axolotl444https://github.com/OpenAccess-AI-Collective/axolotl.

Data preprocessing To format the multi-turn conversation data, we adopt the Vicuna-style template for Mistral-7B and the Llama-3 template for Llama-3-8B. Samples longer than the maximum input sequence length will be truncated. Besides, the data will be packed to speed up training for SFT.

Hyper-parameters For ZIP, the selection numbers K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are set to 10000, 200, and 100, respectively. As for SFT, we share these hyper-parameters for all backbones: training batch size is 128, training epochs is 4, input sequence length is 2048, and the warm-up ratio is 0.1. We adopt different learning rates for each backbone: the learning rate of Mistral-7B is set to 4e-6, and the learning rate of Llama-3-8B is set to 1e-5. As for RLHF, the learning rate for KTO is set to 1e-6, and the batch size is set to 128.

Appendix C Token length distribution of more backbones

The token length distribution of data selected for Llama-3-8B is depicted in Figure 6, similar to the ones of Mistral-7B.

Refer to caption
(a) ZIP
Refer to caption
(b) Random
Refer to caption
(c) Diversity
Refer to caption
(d) Perplexity
Refer to caption
(e) DEITA
Refer to caption
(f) SuperFiltering
Figure 6: The distribution of average token number across datasets selected by different algorithms for Llama-3-8B.

Appendix D Hyper-parameter sensitivity

Refer to caption
(a) Model performance w.r.t. K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
Refer to caption
(b) Model performance w.r.t. K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Refer to caption
(c) Model performance w.r.t. K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
Figure 7: Model performance w.r.t. different hyper-parameters.

ZIP involves three hyper-parameters K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT for improved efficiency. We aim to investigate the impact of these hyper-parameters on the model performance, with results depicted in Figure 7.

Perceived sample number in global selection K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT decides the number of samples to be updated in the global selection stage. We set K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT among range [200, 1000, 10000, 20000], and the results are presented in Figure 7(a). In the figure, the model performance exhibits an increasing trend when K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT increases. When a smaller K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is specified, ZIP is only exposed to a limited set of samples. This can lead ZIP to degenerate into a variant that consistently selects samples based on individual compression ratios, neglecting the modeling of combinatorial effects. Furthermore, the compression ratio associated with the currently selected dataset typically increases with each update, whereas the compression ratios of other samples remain unchanged. Consequently, a large K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT may result in the compression ratio of the un-updated samples being underestimated, leading to inferior samples’ selection. As a result, a model performance degradation can be found when K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is set to 20,000.

Data pool size of local selection K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT decides the number of samples selected from the previous K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT samples. We set K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT among range [100, 200, 500, 1000], and the results are presented in Figure 7(b). The model performance increases with an increased K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which aligns with intuition since the algorithm can consider the combinatorial effects of more samples. But when K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT exceeds a threshold, the model performance reaches a saturated phase, which indicates similar local selection results even with increased local data budget.

Data budget of local selection K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT decides the number of samples selected from the previous K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT samples. We set K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT among range [50, 100, 150, 200], and the results are presented in Figure 7(b). The results exhibit a similar trend as the results of K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, yet the underlying causes are inverse. A large K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT will make ZIP degenerate into a trivial variant that consistently selects samples based on individual compression ratios. On the other hand, a small K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT will lead to more frequent compression ratio updates, which can also lead to underestimated compression ratios of some inferior samples.