Distilabel：为每个人带来合成数据生成和人工智能反馈

几个月前，我们推出了 distilabel ，这是一个使用 LLM 进行合成数据生成和 AI 反馈的 Python 库。当时，我们从一种简单的方法开始，用户可以使用名为 generator 的单个 LLM 生成合成数据，然后使用另一个名为 labeller 的 LLM 对其进行标记。这种方法对于许多场景都很有用，特别是生成偏好数据集。感谢第一个版本，我们生成并公开共享了有影响力的数据集，例如 argilla/distilabel-capybara-dpo-7k-binarized、argilla/OpenHermesPreferences、argilla/distilabel-intel-orca-dpo-pairs 等，这些数据集已用于训练多个 SOTA 模型。我们还看到社区中越来越多地采用 distilabel ，其中包括 davanstrien/haiku 等很酷的社区项目。

话虽如此，我们知道以前的实现不适合更复杂的合成数据生成管道，例如 DEITA，它需要运行多个 LLM 和更复杂的步骤。最重要的是，我们希望降低项目规模的复杂性，让社区更容易做出贡献。我们的决定很明确：我们需要从头开始重写库来解决这个问题，并使其更具可扩展性、可维护性和可伸缩性。

今天，我们很高兴地发布 distilabel 1.0.0 ，这是该库的新版本，它带来了新的架构，允许使用 LLM 构建复杂的数据处理管道，并希望使社区更容易创建并共享合成数据生成管道。

Distilabel logo

管道、步骤、任务和法学硕士

这个新版本的 distilabel 允许使用任意数量的 Step 或 Task 构建 Pipeline ，它们之间可以连接，因此一个步骤或任务的输出将作为另一步骤或任务的输入。它不再是一个 generator 和一个 labeller ，而是关于一系列可以链接在一起以使用 LLM 构建复杂数据处理管道的步骤。

Step 是一个更通用的节点，依赖于不需要使用 LLM 或模型的基类。每个 Step 的输入是一批数据，包含字典列表，其中每个字典代表数据集的一行，键是列名。 Step 然后可以：

从字典中添加或删除键以修改最终数据集的列。
过滤掉字典以从数据集中删除行。
将新字典添加到批次中以将新行添加到数据集。

此外， Step 提供了一个简单的生命周期，公开了一个 load 方法，该方法可用于创建将在 process 方法中使用的必要资源，实际数据处理完成的地方。最后，每个 Step 都可以定义一个运行时参数列表，这些参数可用于为每个管道执行配置 Step 的行为。

Building on this basic idea, distilabel offers two additional kind of steps apart from the normal Step which are the GeneratorStep and the GlobalStep.

GeneratorStep 是从源加载数据（例如，从 Hugging Face Hub 的数据集）或生成新数据（例如，使用 SelfInstruct 和主题列表）的节点），因此它们是管道的起始节点，不需要任何传入边。

另一方面， GlobalStep 的工作方式与 Step 完全相同，但它们一次接收来自先前步骤的所有数据，从而允许聚合来自先前步骤的数据或执行需要完整数据集的操作，例如过滤掉重复的行。

继续 Step 概念，我们从之前的版本发展了 Task 概念，现在是知道如何使用 LLM 。执行特定任务，例如文本生成、演变指令或响应、判断文本质量等。

管道执行

对于这个新版本，我们还更改了管道的执行方式。在之前的版本中，执行是顺序的，即先执行 generator ，然后执行 labeller 。

现在，使用多个进程并行执行，每个进程执行管道的不同步骤。创建子进程时，它将执行该步骤的 load 方法，然后开始处理从输入队列接收的批次。生成的批次将通过输出队列发送回主进程，并在其中分发到管道中的后续步骤。

对于第一个版本，我们决定使用 Python 标准库中的 multiprocessing 模块来管理子流程并进行单节点管道执行，这对于大多数情况来说已经足够了。话虽如此，我们在架构设计上花了很多心思，以使用 Ray 等库来添加对未来分布式执行的支持。

新版本 distilabel 的主要目标之一是让社区更轻松地创建和共享合成数据生成管道。为了实现这一目标，我们添加了一项新功能，允许将管道序列化为 JSON 或 YAML 文件，并从文件加载回来，从而允许调整管道的运行时参数并再次运行它。此外，将生成的数据集推送到 Hugging Face Hub 也会自动将管道推送到 Hub，并将管道的良好描述添加到数据集卡，以便将来更容易重新执行管道。

Hugging Face Hub Dataset Card

如果你不是剧本人👨🏻‍💻？ ...

不用担心，我们为您服务！我们还添加了一个 CLI，允许从文件或 URL 获取管道信息：

distilabel pipeline info --config "https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml"

Distilabel CLI Info

并从文件或 URL 运行管道：

distilabel pipeline run --config "https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini-with-generations/raw/main/pipeline.yaml" \    --param load_dataset.repo_id=distilabel-internal-testing/instruction-dataset-mini \    --param load_dataset.split=test \    --param generate_with_gpt35.llm.generation_kwargs.max_new_tokens=512 \    --param generate_with_gpt35.llm.generation_kwargs.temperature=0.7 \    --param to_argilla.dataset_name=text_generation_with_gpt35 \    --param to_argilla.dataset_workspace=admin

差异概述

旧版本和当前版本之间的主要区别是：

	distilabel ≤ 0.6.0	distilabel ≥ 1.0.0
# of LLMs	2 at most	From 0 to N (not mandatory to use LLMs)
# of Tasks	2 at most (generator and labeller)	From 1 to N
Integrations	OpenAI, vLLM, Llama.cpp, Transformers, Inference Endpoints, Together, Anyscale, Ollama and Vertex AI	Same as before, but also Cohere, Azure OpenAI and LiteLLM
Flow	`Generator` → `Labeller`	Any → … → Any (where … can be an arbitrary number of tasks)
Execution	Sequential	Parallel
Ease of contribution	Medium-Hard	Easy
Argilla	Integrated on every pipeline	Detached, following a plug and play approach anytime
Hierarchy	`LLM`	`Pipeline` > `Step` > `Task` (> `LLM`)
Syntax	generator and labeller	Arbitrary, defined by the user
Approach	Chained Python functions	DAG
Sharing	Hard to share pipelines	Easy to share pipelines thanks to the serialization and the CLI

Former

仅适用于 generator-labeller 场景
难以扩展/维护
不适合大多数合成数据生成管道

from datasets import load_datasetfrom distilabel.llm import OpenAILLMfrom distilabel.pipeline import pipelinefrom distilabel.tasks import TextGenerationTaskdataset = (    load_dataset("HuggingFaceH4/instruction-dataset", split="test[:10]")    .remove_columns(["completion", "meta"])    .rename_column("prompt", "input"))task = TextGenerationTask()generator = OpenAILLM(task=task, max_new_tokens=512)pipeline = pipeline("preference", "instruction-following", generator=generator)dataset = pipeline.generate(dataset)

Current

可以有任意数量的任何类型的步骤（不仅是法学硕士）
更具可扩展性、可维护性和可伸缩性
由于这些步骤并行运行，本地法学硕士可能需要更多计算

from distilabel.llms import OpenAILLMfrom distilabel.pipeline import Pipelinefrom distilabel.steps import LoadDataFromDictsfrom distilabel.steps.tasks import TextGenerationwith Pipeline() as pipeline:    load_dataset = LoadDataFromDicts(        name="load_dataset",        data=[            {                "instruction": "Write a short story about a dragon that saves a princess from a tower.",            },        ],    )    text_generation = TextGeneration(        name="text_generation",        llm=OpenAILLM(model="gpt-4"),    )    load_dataset.connect(text_generation)    ...if __name__ == "__main__":    distiset = pipeline.run(        parameters={            "text_generation": {                "llm": {                    "generation_kwargs": {                        "temperature": 0.7,                        "max_new_tokens": 512,                    }                }            },            ...        },    )    distiset.push_to_hub(        "distilabel-internal-testing/instruction-dataset-mini-with-generations"    )

下一个是什么？

您可以查看 distilabel GitHub 存储库和文档以了解有关新版本的更多信息，并开始创建您自己的合成数据生成管道。

我们希望这个新版本的 distilabel 将使社区更容易创建和共享合成数据生成管道，并有助于实现合成数据生成和 AI 反馈的使用民主化。我们很高兴看到社区将使用这个新版本的 distilabel 构建什么，我们期待您的反馈和贡献！

Distilabel：为每个人带来合成数据生成和人工智能反馈

管道、步骤、任务和法学硕士

管道执行

差异概述

Former

Current

下一个是什么？

RLHF and alternatives: ORPO

RLHF and alternatives: KTO

RLHF and alternatives: IPO

Distilabel：为每个人带来合成数据生成和人工智能反馈

管道、步骤、任务和法学硕士

管道执行

共享管道

差异概述

Former

Current

下一个是什么？

Stay in the loop!

RLHF and alternatives: ORPO

RLHF and alternatives: KTO

RLHF and alternatives: IPO