这是用户在 2024-11-18 11:17 为 https://app.immersivetranslate.com/pdf-pro/b50931da-d76e-4d05-aedc-a6b1c144bb5c 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

The Pile: An 800GB Dataset of Diverse Text for Language Modeling
《堆栈》:一个用于语言建模的 800GB 多样化文本数据集

Leo Gao 高 LeoTravis Hoppe 特拉维斯·霍普佩Anish ThiteStella BidermanCharles Foster 查尔斯·福斯特Noa NabeshimaSid BlackJason Phang 贾森·方Shawn Presser 肖恩·普雷瑟Laurence Golding 劳伦斯·戈尔丁Horace He 贺拉斯·希Connor Leahy 康纳·莱希

EleutherAI EleutherAI (由于 EleutherAI 是一个专有名词,因此没有进行翻译。)contact@eleuther.ai

Abstract 摘要

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets-both existing and newly constructed-many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction. 1 1 ^(1){ }^{1}
近期研究表明,增加训练数据集的多样性可以提高大规模语言模型在跨领域知识以及下游泛化能力方面的表现。鉴于此,我们提出了 Pile:一个针对训练大规模语言模型的 825 GiB 英语文本语料库。Pile 由 22 个多样化的高质量子集构建而成,包括现有和新构建的,其中许多来自学术或专业来源。我们对 GPT-2 和 GPT-3 在 Pile 上未调优性能的评估显示,这些模型在许多组件上表现不佳,例如学术写作。相反,在 Pile 上训练的模型在 Pile 的所有组件上均显著优于 Raw CC 和 CC-100,同时在下游评估中提高了性能。通过深入的探索性分析,我们记录了数据中可能引起用户关注的方面。我们将构建过程中使用的代码公开提供。

1 Introduction 1 引言

Recent breakthroughs in general-purpose language modeling have demonstrated the effectiveness of training massive models on large text corpora for downstream applications (Radford et al., 2019; Shoeybi et al., 2019; Raffel et al., 2019; Rosset, 2019; Brown et al., 2020; Lepikhin et al., 2020). As the field continues to scale up language model training, the demand for high-quality massive text data will continue to grow (Kaplan et al., 2020).
近年来,通用语言模型在大型文本语料库上训练大规模模型在下游应用中的有效性得到了证明(Radford 等人,2019;Shoeybi 等人,2019;Raffel 等人,2019;Rosset,2019;Brown 等人,2020;Lepikhin 等人,2020)。随着该领域继续扩大语言模型训练规模,对高质量大规模文本数据的需求将持续增长(Kaplan 等人,2020)。
The growing need for data in language modeling has caused most existing large-scale language models to turn to the Common Crawl for most or all of their data (Brown et al., 2020; Raffel et al., 2019). While training on the Common Crawl has been effective, recent work has shown that dataset di-
数据在语言建模中的需求日益增长,导致大多数现有的大型语言模型转向使用 Common Crawl 作为其大部分或全部数据来源(Brown 等,2020;Raffel 等,2019)。尽管在 Common Crawl 上训练是有效的,但最近的研究表明,数据集的差异性
versity leads to better downstream generalization capability (Rosset, 2019). Additionally, large-scale language models have been shown to effectively acquire knowledge in a novel domain with only relatively small amounts of training data from that domain (Rosset, 2019; Brown et al., 2020; Carlini et al., 2020). These results suggest that by mixing together a large number of smaller, high quality, diverse datasets, we can improve the general cross-domain knowledge and downstream generalization capabilities of the model compared to models trained on only a handful of data sources.
to improve the general cross-domain knowledge and downstream generalization capabilities of the model compared to models trained on only a few data sources. 此外,大规模语言模型已被证明只需从该领域获取相对较少的训练数据,就能有效地在该领域获取知识(Rosset,2019;Brown 等人,2020;Carlini 等人,2020)。这些结果表明,通过混合大量小型、高质量、多样化的数据集,我们可以提高模型在跨领域知识以及下游泛化能力方面的表现,与仅使用少量数据源训练的模型相比。
To address this need, we introduce the Pile: a 825.18 GiB English text dataset designed for training large scale language models. The Pile is composed of 22 diverse and high-quality datasets, including both established natural language processing datasets and several newly introduced ones. In addition to its utility in training large language models, the Pile can also serve as a broad-coverage benchmark for cross-domain knowledge and generalization ability of language models.
为了满足这一需求,我们引入了 Pile:一个 825.18 GiB 的英文文本数据集,专为训练大规模语言模型而设计。Pile 由 22 个多样化和高质量的数据集组成,包括既有的自然语言处理数据集以及一些新引入的数据集。除了在训练大型语言模型中的实用性外,Pile 还可以作为跨领域知识和语言模型泛化能力的广泛覆盖基准。
We introduce new datasets derived from the following sources: PubMed Central, ArXiv, GitHub, the FreeLaw Project, Stack Exchange, the US Patent and Trademark Office, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, and NIH ExPorter. We also introduce OpenWebText2 and BookCorpus2, which are extensions of the original OpenWebText (Gokaslan and Cohen, 2019) and BookCorpus (Zhu et al., 2015; Kobayashi, 2018) datasets, respectively.
我们引入了以下来源的新数据集:PubMed Central、ArXiv、GitHub、FreeLaw 项目、Stack Exchange、美国专利商标局、PubMed、Ubuntu IRC、HackerNews、YouTube、PhilPapers 和 NIH ExPorter。我们还引入了 OpenWebText2 和 BookCorpus2,它们分别是原始 OpenWebText(Gokaslan 和 Cohen,2019)和 BookCorpus(Zhu 等人,2015;Kobayashi,2018)数据集的扩展。
In addition, we incorporate several existing highquality datasets: Books3 (Presser, 2020), Project Gutenberg (PG-19) (Rae et al., 2019), OpenSubtitles (Tiedemann, 2016), English Wikipedia, DM Mathematics (Saxton et al., 2019), EuroParl (Koehn, 2005), and the Enron Emails corpus (Klimt and Yang, 2004). To supplement these, we also in-
此外,我们还纳入了几个现有的高质量数据集:Books3(Presser,2020 年),Project Gutenberg(PG-19)(Rae 等人,2019 年),OpenSubtitles(Tiedemann,2016 年),英语维基百科,DM Mathematics(Saxton 等人,2019 年),EuroParl(Koehn,2005 年),以及 Enron 电子邮件语料库(Klimt 和 Yang,2004 年)。为了补充这些,我们还纳入了

Figure 1: Treemap of Pile components by effective size.
图 1:按有效大小排列的堆组件树状图。

troduce a new filtered subset of Common Crawl, Pile-CC, with improved extraction quality.
介绍一个新的过滤后的 Common Crawl 子集,Pile-CC,提取质量得到提升。
Through our analyses, we confirm that the Pile is significantly distinct from pure Common Crawl data. Additionally, our evaluations show that the existing GPT-2 and GPT-3 models perform poorly on many components of the Pile, and that models trained on the Pile significantly outperform both raw and filtered Common Crawl models. To complement the performance evaluations, we also perform an exploratory analysis of the text within the Pile to provide a detailed picture of the data. We hope that our extensive documentation of the construction and characteristics of the Pile will help researchers make informed decisions about potential downstream applications.
通过我们的分析,我们确认 Pile 与纯 Common Crawl 数据有显著差异。此外,我们的评估显示,现有的 GPT-2 和 GPT-3 模型在 Pile 的许多组件上表现不佳,而基于 Pile 训练的模型在性能上显著优于原始和过滤后的 Common Crawl 模型。为了补充性能评估,我们还对 Pile 中的文本进行了探索性分析,以提供数据的详细情况。我们希望我们对 Pile 构建和特性的广泛记录能帮助研究人员就潜在的下层应用做出明智的决定。
Finally, we make publicly available the preprocessing code for the constituent datasets of the Pile and the code for constructing alternative versions 2 2 ^(2){ }^{2}. In the interest of reproducibility, we also document all processing performed on each dataset (and the Pile as a whole) in as much detail as possible. For further details about the processing of each dataset, see Section 2 and Appendix C.
最后,我们将 Pile 的构成数据集的预处理代码和构建替代版本代码公开。为了确保可重复性,我们还尽可能详细地记录了每个数据集(以及整个 Pile)上所进行的所有处理。有关每个数据集处理的更多详细信息,请参阅第 2 节和附录 C。

1.1 Contributions 1.1 贡献

The core contributions of this paper are:
本文的核心贡献是:
  1. The introduction of a 825.18 GiB englishlanguage dataset for language modeling combining 22 diverse sources.
    825.18 GiB 英语语言数据集的引入,用于结合 22 个不同来源的语言建模。
  2. The introduction of 14 new language modeling datasets, which we expect to be of independent interest to researchers.
    14 个新的语言建模数据集的引入,我们预计这些数据集将对研究人员具有独立的研究价值。
  3. Evaluations demonstrating significant improvements across many domains by GPT-2sized models trained on this new dataset, compared to training on CC-100 and raw Common Crawl.
    评估显示,与在 CC-100 和原始 Common Crawl 上训练相比,在新的数据集上训练的 GPT-2 规模模型在许多领域取得了显著改进。
  4. The investigation and documentation of this dataset, which we hope will better inform researchers about how to use it as well as motivate them to undertake similar investigations of their own data.
    该数据集的调查和记录,我们希望它能更好地让研究人员了解如何使用它,并激励他们对自己的数据进行类似的调查。

2 The Pile Datasets
2 堆叠数据集

The Pile is composed of 22 constituent sub-datasets, as shown in Table 1. Following Brown et al. (2020), we increase the weights of higher quality components, with certain high-quality datasets such as Wikipedia being seen up to 3 times (“epochs”) for
该堆由 22 个子数据集组成,如表 1 所示。遵循 Brown 等人(2020 年)的方法,我们增加了高质量组件的权重,某些高质量数据集如维基百科的权重被提高到 3 倍(“epoch”)
Component 组件 Raw Size 原始尺寸 Weight 重量 Epochs 时代 Effective Size 有效尺寸 Mean Document Size 平均文档大小
Pile-CC 227.12 GiB 227.12 吉比特 18.11% 1.0 227.12 GiB 227.12 吉比特 4.33 KiB
PubMed Central 90.27 GiB 14.40% 2.0 180.55 GiB 30.55 KiB
Books3 ^(†){ }^{\dagger} Books3 100.96 GiB 12.07% 1.5 151.44 GiB 538.36 KiB
OpenWebText2 OpenWebText2:开放网络文本 2 62.77 GiB 10.01% 2.0 125.54 GiB 3.85 KiB
ArXiv 56.21 GiB 8.96% 2.0 112.42 GiB 46.61 KiB
Github 95.16 GiB 7.59% 1.0 95.16 GiB 5.25 KiB
FreeLaw FreeLaw:免费法律 51.15 GiB 51.15 GiB 无需翻译 6.12% 1.5 76.73 GiB 15.06 KiB
Stack Exchange 32.20 GiB 5.13% 2.0 64.39 GiB 2.16 KiB
USPTO Backgrounds 美国专利商标局背景 22.90 GiB 22.90 GiB (注:GiB 是千兆字节,通常用于表示大容量数据存储,此处翻译为简体中文时,通常保留原缩写形式,因为它是国际标准缩写。) 3.65% 2.0 45.81 GiB 4.08 KiB
PubMed Abstracts PubMed 摘要 19.26 GiB 3.07% 2.0 38.53 GiB 38.53 GiB 无需翻译 1.30 KiB
Gutenberg (PG-19) ^(†){ }^{\dagger} 古腾堡(PG-19) ^(†){ }^{\dagger} 10.88 GiB 2.17% 2.5 27.19 GiB 27.19 GiB (由于 GiB 是数据存储单位,通常不需要翻译,因此原文保持不变。) 398.73 KiB
OpenSubtitles ^(†){ }^{\dagger} 12.98 GiB 1.55% 1.5 19.47 GiB 19.47 GiB (由于 GiB 是数据存储单位,通常不需要翻译,因此原文保持不变。) 30.48 KiB
Wikipedia (en) ^(†){ }^{\dagger} 维基百科(英文) 6.38 GiB 1.53% 3.0 19.13 GiB 19.13 GiB (注:GiB 是千兆字节,通常用于表示大容量数据存储,此处翻译为简体中文时,通常保留原缩写形式,因为它是国际标准缩写。) 1.11 KiB
DM Mathematics ^(†){ }^{\dagger} DM 数学 ^(†){ }^{\dagger} 7.75 GiB 1.24% 2.0 15.49 GiB 8.00 KiB
Ubuntu IRC Ubuntu IRC:Ubuntu 互联网中继聊天 5.52 GiB 0.88% 2.0 11.03 GiB 545.48 KiB
BookCorpus2 BookCorpus2:图书语料库 2 6.30 GiB 6.30 GiB 无法翻译 0.75% 1.5 9.45 GiB 369.87 KiB
EuroParl ^(†){ }^{\dagger} 欧罗帕议会 4.59 GiB 0.73% 2.0 9.17 GiB 68.87 KiB
HackerNews 黑客新闻 3.90 GiB 3.90 GiB (注:GiB 是千兆字节,通常用于表示大容量存储设备的大小,此处翻译为简体中文时,通常保留原缩写形式,因为它是国际标准缩写。) 0.62% 2.0 7.80 GiB 4.92 KiB
YoutubeSubtitles YouTube 字幕 3.73 GiB 3.73 GiB (注:GiB 是千兆字节,通常用于表示大容量数据存储,此处翻译为简体中文时,通常保留原缩写形式,因为它是国际标准缩写。) 0.60% 2.0 7.47 GiB 22.55 KiB
PhilPapers 2.38 GiB 0.38% 2.0 4.76 GiB 73.37 KiB
NIH ExPorter NIH ExPorter (注:NIH ExPorter 是一个专有名词,通常不进行翻译。) 1.89 GiB 0.30% 2.0 3.79 GiB 2.11 KiB
Enron Emails ^(†){ }^{\dagger} 恩隆电子邮件 ^(†){ }^{\dagger} 0.88 GiB 0.14% 2.0 1.76 GiB 1.78 KiB
The Pile The Pile 堆 825.18 GiB 1254.20 GiB 3,547.36 吉字节 5.91 KiB
Component Raw Size Weight Epochs Effective Size Mean Document Size Pile-CC 227.12 GiB 18.11% 1.0 227.12 GiB 4.33 KiB PubMed Central 90.27 GiB 14.40% 2.0 180.55 GiB 30.55 KiB Books3 ^(†) 100.96 GiB 12.07% 1.5 151.44 GiB 538.36 KiB OpenWebText2 62.77 GiB 10.01% 2.0 125.54 GiB 3.85 KiB ArXiv 56.21 GiB 8.96% 2.0 112.42 GiB 46.61 KiB Github 95.16 GiB 7.59% 1.0 95.16 GiB 5.25 KiB FreeLaw 51.15 GiB 6.12% 1.5 76.73 GiB 15.06 KiB Stack Exchange 32.20 GiB 5.13% 2.0 64.39 GiB 2.16 KiB USPTO Backgrounds 22.90 GiB 3.65% 2.0 45.81 GiB 4.08 KiB PubMed Abstracts 19.26 GiB 3.07% 2.0 38.53 GiB 1.30 KiB Gutenberg (PG-19) ^(†) 10.88 GiB 2.17% 2.5 27.19 GiB 398.73 KiB OpenSubtitles ^(†) 12.98 GiB 1.55% 1.5 19.47 GiB 30.48 KiB Wikipedia (en) ^(†) 6.38 GiB 1.53% 3.0 19.13 GiB 1.11 KiB DM Mathematics ^(†) 7.75 GiB 1.24% 2.0 15.49 GiB 8.00 KiB Ubuntu IRC 5.52 GiB 0.88% 2.0 11.03 GiB 545.48 KiB BookCorpus2 6.30 GiB 0.75% 1.5 9.45 GiB 369.87 KiB EuroParl ^(†) 4.59 GiB 0.73% 2.0 9.17 GiB 68.87 KiB HackerNews 3.90 GiB 0.62% 2.0 7.80 GiB 4.92 KiB YoutubeSubtitles 3.73 GiB 0.60% 2.0 7.47 GiB 22.55 KiB PhilPapers 2.38 GiB 0.38% 2.0 4.76 GiB 73.37 KiB NIH ExPorter 1.89 GiB 0.30% 2.0 3.79 GiB 2.11 KiB Enron Emails ^(†) 0.88 GiB 0.14% 2.0 1.76 GiB 1.78 KiB The Pile 825.18 GiB 1254.20 GiB 5.91 KiB| Component | Raw Size | Weight | Epochs | Effective Size | Mean Document Size | | :---: | :---: | :---: | :---: | :---: | :---: | | Pile-CC | 227.12 GiB | 18.11% | 1.0 | 227.12 GiB | 4.33 KiB | | PubMed Central | 90.27 GiB | 14.40% | 2.0 | 180.55 GiB | 30.55 KiB | | Books3 ${ }^{\dagger}$ | 100.96 GiB | 12.07% | 1.5 | 151.44 GiB | 538.36 KiB | | OpenWebText2 | 62.77 GiB | 10.01% | 2.0 | 125.54 GiB | 3.85 KiB | | ArXiv | 56.21 GiB | 8.96% | 2.0 | 112.42 GiB | 46.61 KiB | | Github | 95.16 GiB | 7.59% | 1.0 | 95.16 GiB | 5.25 KiB | | FreeLaw | 51.15 GiB | 6.12% | 1.5 | 76.73 GiB | 15.06 KiB | | Stack Exchange | 32.20 GiB | 5.13% | 2.0 | 64.39 GiB | 2.16 KiB | | USPTO Backgrounds | 22.90 GiB | 3.65% | 2.0 | 45.81 GiB | 4.08 KiB | | PubMed Abstracts | 19.26 GiB | 3.07% | 2.0 | 38.53 GiB | 1.30 KiB | | Gutenberg (PG-19) ${ }^{\dagger}$ | 10.88 GiB | 2.17% | 2.5 | 27.19 GiB | 398.73 KiB | | OpenSubtitles ${ }^{\dagger}$ | 12.98 GiB | 1.55% | 1.5 | 19.47 GiB | 30.48 KiB | | Wikipedia (en) ${ }^{\dagger}$ | 6.38 GiB | 1.53% | 3.0 | 19.13 GiB | 1.11 KiB | | DM Mathematics ${ }^{\dagger}$ | 7.75 GiB | 1.24% | 2.0 | 15.49 GiB | 8.00 KiB | | Ubuntu IRC | 5.52 GiB | 0.88% | 2.0 | 11.03 GiB | 545.48 KiB | | BookCorpus2 | 6.30 GiB | 0.75% | 1.5 | 9.45 GiB | 369.87 KiB | | EuroParl ${ }^{\dagger}$ | 4.59 GiB | 0.73% | 2.0 | 9.17 GiB | 68.87 KiB | | HackerNews | 3.90 GiB | 0.62% | 2.0 | 7.80 GiB | 4.92 KiB | | YoutubeSubtitles | 3.73 GiB | 0.60% | 2.0 | 7.47 GiB | 22.55 KiB | | PhilPapers | 2.38 GiB | 0.38% | 2.0 | 4.76 GiB | 73.37 KiB | | NIH ExPorter | 1.89 GiB | 0.30% | 2.0 | 3.79 GiB | 2.11 KiB | | Enron Emails ${ }^{\dagger}$ | 0.88 GiB | 0.14% | 2.0 | 1.76 GiB | 1.78 KiB | | The Pile | 825.18 GiB | | | 1254.20 GiB | 5.91 KiB |
Table 1: Overview of datasets in the Pile before creating the held out sets. Raw Size is the size before any up- or down-sampling. Weight is the percentage of bytes in the final dataset occupied by each dataset. Epochs is the number of passes over each constituent dataset during a full epoch over the Pile. Effective Size is the approximate number of bytes in the Pile occupied by each dataset. Datasets marked with a \dagger are used with minimal preprocessing from prior work.
表 1:在创建保留集之前 Pile 中数据集的概述。原始大小是任何上采样或下采样之前的尺寸。权重是每个数据集占最终数据集字节数的百分比。Epochs 是整个 Pile 中每个构成数据集在完整 epoch 中遍历的次数。有效大小是 Pile 中每个数据集占用的约字节数。标记为 \dagger 的数据集使用了来自先前工作的最小预处理。

each full epoch over the Pile. Detailed information about the construction of each dataset is available in Appendix C.
每个完整 epoch 都经过 Pile 处理。每个数据集的构建详细信息可在附录 C 中找到。

2.1 Pile-CC 2.1 堆-CC

Common Crawl is a collection of website crawls from 2008 onwards, including raw web pages, metadata and text extractions. Due to the raw nature of the dataset, Common Crawl has the advantage of including text from diverse domains, but at the cost of varying quality data. Due to this, use of Common Crawl typically necessitates well-designed extraction and filtering. Our Common Crawl-based dataset, Pile-CC, uses jusText (Endrédy and Novák, 2013) on Web Archive files (raw HTTP responses including page HTML) for extraction, which yields higher quality output than directly using the WET files (extracted plaintext).
Common Crawl 是从 2008 年开始的网站抓取集合,包括原始网页、元数据和文本提取。由于数据集的原始性质,Common Crawl 优点是包括来自不同领域的文本,但代价是数据质量参差不齐。因此,通常需要精心设计的提取和过滤来使用 Common Crawl。我们的基于 Common Crawl 的数据集 Pile-CC 使用 jusText(Endrédy 和 Novák,2013)对 Web Archive 文件(包括页面 HTML 的原始 HTTP 响应)进行提取,比直接使用 WET 文件(提取的纯文本)产生更高的质量输出。

2.2 PubMed Central 2.2 PubMed 中央

PubMed Central (PMC) is a subset of the PubMed online repository for biomedical articles run by the United States of America’s National Center for Biotechnology Information (NCBI), providing open, full-text access to nearly five million publications. Most publications indexed by PMC are recent, and their inclusion is mandated for all NIH funded research starting from 2008 by the NIH Public Access Policy. We included PMC in the hopes that it will benefit potential downstream applications to the medical domain.
PubMed Central(PMC)是美国国家生物技术信息中心(NCBI)运行的生物医学文章在线存储库的一个子集,提供近 500 万篇文献的开放全文访问。大多数由 PMC 索引的出版物都是最新的,并且从 2008 年起,根据 NIH 公共获取政策,所有 NIH 资助的研究都必须包含在内。我们包括 PMC,希望它能对医学领域的潜在下游应用有益。

2.3 Books3 2.3 书籍 3

Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020). Bibliotik consists of a mix of fiction and nonfiction books and is almost an order of magnitude
Books3 是由 Shawn Presser(Presser,2020)提供的 Bibliotik 私人追踪器内容副本派生出的书籍数据集。Bibliotik 包含了虚构和非虚构书籍的混合,几乎是一个数量级的

  1. 1 1 ^(1){ }^{1} https://pile.eleuther.ai/
  2. 2 2 ^(2){ }^{2} https://github.com/EleutherAI/ the-pile
    2 2 ^(2){ }^{2} https://github.com/EleutherAI/ the-pile 2 2 ^(2){ }^{2} https://github.com/EleutherAI/the-pile