Thank you for purchasing the MEAP edition of Build a Large Language Model (From Scratch). 感谢购买《从头开始构建大型语言模型》的 MEAP 版本。
In this book, I invite you to embark on an educational journey with me to learn how to build Large Language Models (LLMs) from the ground up. Together, we'll delve deep into the LLM training pipeline, starting from data loading and culminating in finetuning LLMs on custom datasets. 在这本书中,我邀请您与我一起踏上一段教育之旅,学习如何从头开始构建大型语言模型(LLM)。我们将深入探究 LLM 训练管线,从数据加载开始,一直到在自定义数据集上进行微调。
For many years, I've been deeply immersed in the world of deep learning, coding LLMs, and have found great joy in explaining complex concepts thoroughly. This book has been a long-standing idea in my mind, and I'm thrilled to finally have the opportunity to write it and share it with you. Those of you familiar with my work, especially from my blog, have likely seen glimpses of my approach to coding from scratch. This method has resonated well with many readers, and I hope it will be equally effective for you. 多年来,我沉浸在深度学习、编码LLMs的世界中,并发现深入解释复杂概念令我非常欣喜。这本书一直是我心中的一个久久的想法,我很高兴终于有机会将它写下并与您分享。那些熟悉我工作的人,尤其是从我的博客上,可能已经看到了我从头开始编码的方法的端倪。这种方法已经得到了许多读者的共鸣,我希望它对您同样有效。
I've designed the book to emphasize hands-on learning, primarily using PyTorch and without relying on pre-existing libraries. With this approach, coupled with numerous figures and illustrations, I aim to provide you with a thorough understanding of how LLMs work, their limitations, and customization methods. Moreover, we'll explore commonly used workflows and paradigms in pretraining and fine-tuning LLMs, offering insights into their development and customization. 我设计这本书旨在强调实践学习,主要使用 PyTorch,不依赖于预先存在的库。通过这种方法,结合大量图表和插图,我旨在为您提供深入了解LLMs的工作原理、局限性和定制方法。此外,我们将探讨在预训练和微调LLMs中常用的工作流程和范式,并提供其开发和定制的洞见。
The book is structured with detailed step-by-step introductions, ensuring no critical detail is overlooked. To gain the most from this book, you should have a background in Python programming. Prior experience in deep learning and a foundational understanding of PyTorch, or familiarity with other deep learning frameworks like TensorFlow, will be beneficial. 这本书采用详细的分步介绍结构,确保没有遗漏任何关键细节。要从这本书中获得最大收益,您应该具有 Python 编程背景。在深度学习方面的 prior 经验,以及对 PyTorch 的基础理解,或者对 TensorFlow 等其他深度学习框架的熟悉将会有帮助。
I warmly invite you to engage in the liveBook discussion forum for any questions, suggestions, or feedback you might have. Your contributions are immensely valuable and appreciated in enhancing this learning journey. 我诚挚地邀请您参与 liveBook 讨论论坛,欢迎您提出任何问题、建议或反馈。您的贡献对于丰富这个学习历程至关重要,我们深表感谢。
Sebastian Raschka 塞巴斯蒂安·拉施卡
brief contents 简要内容
CHAPTERS 章节
1 Understanding Large Language Models 1 理解大型语言模型
2 Working with Text Data 使用文本数据
3 Coding Attention Mechanisms 3 个编码注意力机制
4 Implementing a GPT model from Scratch To Generate Text 从头实现 GPT 模型以生成文本
5 Pretraining on Unlabeled Data 在未标注数据上的预训练
6 Finetuning for Classification 6 用于分类的微调
7 Finetuning to Follow Instructions 根据说明进行 7 次细调
Appendix A. Introduction to PyTorch 第一附录 PyTorch 入门
Appendix B. References and Further Reading 附录 B.参考文献和进一步阅读
Appendix C. Exercise Solutions 附录 C. 练习解答
Appendix D. Adding Bells and Whistles to the Training Loop 附录 D.为训练循环添加铃声和口哨声
Appendix E. Parameter-efficient Finetuning with LoRA 附录 E. 使用 LoRA 的参数高效微调
1
Understanding Large Language Models 理解大型语言模型
This chapter covers 本章涵盖
High-level explanations of the fundamental concepts behind large language models (LLMs) 关于大型语言模型的基本概念的高层次解释(LLMs)
Insights into the transformer architecture from which LLMs, such as the ones used on the ChatGPT platform, are derived 从中派生出LLMs等 ChatGPT 平台使用的变换器架构的见解
A plan for building an LLM from scratch 从头开始建造LLM的计划
Large language models (LLMs), such as those offered in OpenAI's ChatGPT, are deep neural network models that have been developed over the past few years. They ushered in a new era for Natural Language Processing (NLP). Before the advent of large language models, traditional methods excelled at categorization tasks such as email spam classification and straightforward pattern recognition that could be captured with handcrafted rules or simpler models. However, they typically underperformed in language tasks that demanded complex understanding and generation abilities, such as parsing detailed instructions, conducting contextual analysis, or creating coherent and contextually appropriate original text. For example, previous generations of language models could not write an email from a list of keywords-a task that is trivial for contemporary LLMs. 大型语言模型(如 OpenAI 的 ChatGPT 所提供的模型)是过去几年里开发出来的深度神经网络模型。它们开启了自然语言处理(NLP)的新时代。在大型语言模型出现之前,传统方法擅长于诸如电子邮件垃圾信息分类等分类任务,以及可以通过手工制定规则或更简单的模型捕捉到的直截了当的模式识别。然而,它们通常在需要复杂理解和生成能力的语言任务上表现不佳,例如解析详细说明、进行上下文分析或创作出连贯且符合上下文的原创文本。例如,以前的语言模型无法根据关键字列表编写电子邮件,这对当代的大型语言模型来说是一项微不足道的任务。
LLMs have remarkable capabilities to understand, generate, and interpret human language. However, it's important to clarify that when we say language models "understand," we mean that they can process and generate text in ways that appear coherent and contextually relevant, not that they possess human-like consciousness or comprehension. LLMs
Enabled by advancements in deep learning, which is a subset of machine learning and artificial intelligence (AI) focused on neural networks, LLMs are trained on vast quantities of text data. This allows LLMs to capture deeper contextual information and subtleties of human language compared to previous approaches. As a result, LLMs have significantly improved performance in a wide range of NLP tasks, including text translation, sentiment analysis, question answering, and many more. 由于深度学习的进步,这是机器学习和人工智能(AI)的一个子集,专注于神经网络,LLMs都是在大量的文本数据上进行训练的。这使LLMs能够比以前的方法更好地捕捉人类语言的更深层次的上下文信息和细微差别。因此,LLMs在广泛的自然语言处理任务中,包括文本翻译、情感分析、问答等,都有了显著的性能提升。
Another important distinction between contemporary LLMs and earlier NLP models is that these earlier NLP models were typically designed for specific tasks, for example, text categorization, language translation and so forth. Whereas those earlier NLP models excelled in their narrow applications, LLMs demonstrate a broader proficiency across a wide range of NLP tasks. 当代LLMs和早期自然语言处理模型之间的另一个重要区别是,这些早期自然语言处理模型通常是为特定任务而设计的,例如文本分类、语言翻译等。而那些早期的自然语言处理模型擅长于其狭窄的应用,LLMs展现出了在广泛的自然语言处理任务中的更广泛的专业能力。
The success behind LLMs can be attributed to the transformer architecture which underpins many LLMs, and the vast amounts of data LLMs are trained on, allowing them to capture a wide variety of linguistic nuances, contexts, and patterns that would be challenging to manually encode. LLMs背后的成功归功于变压器架构,这种架构支撑了许多LLMs,以及接受了大量LLMs的能力,使它们能够捕捉到手工编码难以实现的各种语言细微差异、上下文和模式。
This shift towards implementing models based on the transformer architecture and using large training datasets to train LLMs has fundamentally transformed NLP, providing more capable tools for understanding and interacting with human language. 这种向基于变压器架构的模型和使用大型训练数据集来训练LLMs的转变,从根本上改变了自然语言处理,为理解和互动人类语言提供了更强大的工具。
Beginning with this chapter, we set the foundation to accomplish the primary objective of this book: understanding LLMs by implementing a ChatGPT-like LLM based on the transformer architecture step by step in code. 从本章开始,我们将为实现本书的主要目标奠定基础:通过逐步在代码中实现基于 transformer 架构的 ChatGPT 类LLM来理解LLMs。
1.1 What is an LLM? 1.1 什么是LLM?
An LLM, a large language model, is a neural network designed to understand, generate, and respond to human-like text. These models are deep neural networks trained on massive amounts of text data, sometimes encompassing large portions of the entire publicly available text on the internet. 一个LLM,一个大型语言模型,是一种神经网络,旨在理解、生成和响应人类风格的文本。这些模型是深度神经网络,经过大量文本数据的训练,有时包括整个互联网上公开可用文本的大部分。
The "large" in large language model refers to both the model's size in terms of parameters and the immense dataset on which it's trained. Models like this often have tens or even hundreds of billions of parameters, which are the adjustable weights in the network that are optimized during training to predict the next word in a sequence. Next-word prediction is sensible because it harnesses the inherent sequential nature of language to train models on understanding context, structure, and relationships within text. Yet, it is a very simple task and so it is surprising to many researchers that it can produce such capable models. We will discuss and implement the next-word training procedure in later chapters step by step. 大型语言模型中的"大"指的既是模型的参数规模,也指其训练所依赖的庞大数据集。这类模型通常拥有数十甚至数百亿的参数,这些可调权重在训练过程中会被优化,以预测序列中的下一个词语。预测下一个词语是合理的,因为它利用了语言固有的顺序性来训练模型理解上下文、结构和文本内部关系。然而,这是一项非常简单的任务,许多研究人员感到惊讶这竟能创造出如此强大的模型。我们将在后续章节中一步步讨论和实现这一训练程序。
LLMs utilize an architecture called the transformer (covered in more detail in section 1.4), which allows them to pay selective attention to different parts of the input when making predictions, making them especially adept at handling the nuances and complexities of human language. LLMs
Since LLMs are capable of generating text, LLMs are also often referred to as a form of generative artificial intelligence (AI), often abbreviated as generative AI or GenAI. As illustrated in Figure 1.1, AI encompasses the broader field of creating machines that can perform tasks requiring human-like intelligence, including understanding language, recognizing patterns, and making decisions, and includes subfields like machine learning and deep learning. 由于LLMs能够生成文本,LLMs也经常被称为生成式人工智能(AI)的一种形式,通常简称为生成式 AI 或 GenAI。如图 1.1 所示,AI 涵盖了创造具有人类智能的机器的更广泛领域,包括理解语言、识别模式和做出决策,并包括机器学习和深度学习等子领域。
Figure 1.1 As this hierarchical depiction of the relationship between the different fields suggests, LLMs represent a specific application of deep learning techniques, leveraging their ability to process and generate human-like text. Deep learning is a specialized branch of machine learning that focuses on using multi-layer neural networks. And machine learning and deep learning are fields aimed at implementing algorithms that enable computers to learn from data and perform tasks that typically require human intelligence. 图 1.1 正如这种层次性描绘关系所示,LLMs代表深度学习技术的特定应用,利用它们处理和生成人类般文本的能力。深度学习是机器学习的一个专门领域,专注于使用多层神经网络。机器学习和深度学习是旨在实现使计算机能够从数据学习并执行通常需要人类智能的任务的算法领域。
The algorithms used to implement AI are the focus of the field of machine learning. Specifically, machine learning involves the development of algorithms that can learn from and make predictions or decisions based on data without being explicitly programmed. To illustrate this, imagine a spam filter as a practical application of machine learning. Instead of manually writing rules to identify spam emails, a machine learning algorithm is fed examples of emails labeled as spam and legitimate emails. By minimizing the error in its predictions on a training dataset, the model then learns to recognize patterns and characteristics indicative of spam, enabling it to classify new emails as either spam or legitimate. 用于实现人工智能的算法是机器学习领域的焦点。具体来说,机器学习涉及开发可以在没有明确编程的情况下从数据中学习并做出预测或决策的算法。为了说明这一点,请想象一个垃圾邮件过滤器作为机器学习的实际应用。与其手动编写识别垃圾邮件的规则,不如让机器学习算法接受标记为垃圾邮件和合法邮件的示例。通过最小化训练数据集上的预测错误,该模型然后学会识别垃圾邮件的模式和特征,从而能够将新的电子邮件分类为垃圾邮件或合法邮件。
As illustrated in Figure 1.1, deep learning is a subset of machine learning that focuses on utilizing neural networks with three or more layers (also called deep neural networks) to model complex patterns and abstractions in data. In contrast to deep learning, traditional machine learning requires manual feature extraction. This means that human experts need to identify and select the most relevant features for the model. 如图 1.1 所示,深度学习是机器学习的一个子集,其着重使用三层或更多层的神经网络(也称为深度神经网络)来建模数据中复杂的模式和抽象概念。与深度学习不同,传统机器学习需要手动特征提取。这意味着人类专家需要识别和选择模型最相关的特征。
While the field of AI is nowadays dominated by machine learning and deep learning, it also includes other approaches, for example, using rule-based systems, genetic algorithms, expert systems, fuzzy logic, or symbolic reasoning. 虽然人工智能领域如今被机器学习和深度学习所主导,但它也包括其他方法,例如使用基于规则的系统、遗传算法、专家系统、模糊逻辑或符号推理。
Returning to the spam classification example, in traditional machine learning, human experts might manually extract features from email text such as the frequency of certain trigger words ("prize," "win," "free"), the number of exclamation marks, use of all uppercase words, or the presence of suspicious links. This dataset, created based on these expert-defined features, would then be used to train the model. In contrast to traditional machine learning, deep learning does not require manual feature extraction. This means that human experts do not need to identify and select the most relevant features for a deep learning model. (However, in both traditional machine learning and deep learning for spam classification, you still require the collection of labels, such as spam or non-spam, which need to be gathered either by an expert or users.) 返回垃圾邮件分类示例,在传统的机器学习中,人类专家可能会手动从电子邮件文本中提取特征,例如某些触发词("奖品"、"获胜"、"免费")的频率、感叹号的数量、使用全大写字词或可疑链接的存在。然后,基于这些专家定义的特征创建的数据集将用于训练模型。与传统机器学习不同,深度学习不需要手动特征提取。这意味着人类专家不需要识别和选择深度学习模型最相关的特征。(然而,在传统机器学习和垃圾邮件分类的深度学习中,您仍然需要收集标签,如垃圾邮件或非垃圾邮件,这些标签需要由专家或用户收集。)
The upcoming sections will cover some of the problems LLMs can solve today, the challenges that LLMs address, and the general LLM architecture, which we will implement in this book. 即将到来的章节将涵盖一些问题LLMs今天可以解决的问题,LLMs解决的挑战,以及我们将在本书中实施的一般LLM架构。
1.2 Applications of LLMs 1.2 LLMs
Owing to their advanced capabilities to parse and understand unstructured text data, LLMs have a broad range of applications across various domains. Today, LLMs are employed for machine translation, generation of novel texts (see Figure 1.2), sentiment analysis, text summarization, and many other tasks. LLMs have recently been used for content creation, such as writing fiction, articles, and even computer code. 由于其先进的能力,可以解析和理解非结构化文本数据,LLMs在各种领域都有广泛的应用。如今,LLMs被用于机器翻译、生成新型文本(见图 1.2)、情感分析、文本摘要等许多任务。最近,LLMs还被用于内容创作,例如编写小说、文章甚至计算机代码。
Figure 1.2 LLM interfaces enable natural language communication between users and AI systems. This screenshot shows ChatGPT writing a poem according to a user's specifications. 图 1.2 LLM接口使用户和 AI 系统之间实现自然语言交流。此截图显示 ChatGPT 根据用户的要求编写了一首诗。
LLMs can also power sophisticated chatbots and virtual assistants, such as OpenAI's ChatGPT or Google's Gemini (formerly called Bard), which can answer user queries and augment traditional search engines such as Google Search or Microsoft Bing. LLMs还可以驱动复杂的聊天机器人和虚拟助手,如 OpenAI 的 ChatGPT 或谷歌的 Gemini(原称 Bard),它们可以回答用户查询,并增强传统搜索引擎如谷歌搜索或微软必应。
Moreover, LLMs may be used for effective knowledge retrieval from vast volumes of text in specialized areas such as medicine or law. This includes sifting through documents, summarizing lengthy passages, and answering technical questions. 此外,LLMs可用于从医学或法律等专门领域的大量文本中有效地检索知识。这包括筛选文件、总结详细段落和回答技术问题。
In short, LLMs are invaluable for automating almost any task that involves parsing and generating text. Their applications are virtually endless, and as we continue to innovate and explore new ways to use these models, it's clear that LLMs have the potential to redefine our relationship with technology, making it more conversational, intuitive, and accessible. 总之,LLMs对于自动化几乎所有涉及解析和生成文本的任务都是不可或缺的。它们的应用几乎是无穷无尽的,随着我们不断创新和探索使用这些模型的新方式,很明显LLMs有望重塑我们与技术的关系,使其更加对话式、直观和易用。
In this book, we will focus on understanding how LLMs work from the ground up, coding an LLM that can generate texts. We will also learn about techniques that allow LLMs to carry out queries, ranging from answering questions to summarizing text, translating text into different languages, and more. In other words, in this book, we will learn how complex LLM assistants such as ChatGPT work by building one step by step. 在本书中,我们将着重于从头开始理解LLMs的工作原理,编码一个能够生成文本的LLM。我们还将学习可以使LLMs执行各种查询的技术,从回答问题到文本概括、将文本翻译成不同语言等。换句话说,在本书中,我们将一步步构建一个复杂的LLM助手(如 ChatGPT),以了解它们的工作原理。
1.3 Stages of building and using LLMs 建立和使用LLMs的 1.3 阶段
Why should we build our own LLMs? Coding an LLM from the ground up is an excellent exercise to understand its mechanics and limitations. Also, it equips us with the required knowledge for pretraining or finetuning existing open-source LLM architectures to our own domain-specific datasets or tasks. 为什么我们应该构建自己的LLMs?从头开始编码一个LLM是一个很好的练习,可以了解其机制和局限性。此外,它还使我们具备对现有开源LLM架构进行预训练或微调以适应自己的特定领域数据集或任务所需的知识。
Research has shown that when it comes to modeling performance, custom-built LLMsthose tailored for specific tasks or domains-can outperform general-purpose LLMs, such as those provided by ChatGPT, which are designed for a wide array of applications. Examples of this include BloombergGPT, which is specialized for finance, and LLMs that are tailored for medical question answering (please see the Further Reading and References section in Appendix B for more details). 研究发现,在建模性能方面,针对特定任务或领域定制的大型语言模型通常能够优于通用目的的大型语言模型,如 ChatGPT 等旨在广泛应用的模型。例如,专注于金融领域的 BloombergGPT 以及针对医疗问答的大型语言模型,均表现出较强的专业性能。更多细节请参见附录 B 的进一步阅读和参考部分。
Using custom-built LLMs offers several advantages, particularly regarding data privacy. For instance, companies may prefer not to share sensitive data with third-party LLM providers like OpenAI due to confidentiality concerns. Additionally, developing custom LLMs enables deployment directly on customer devices, such as laptops and smartphones, which is something companies like Apple are currently exploring. This local implementation can significantly decrease latency and reduce server-related costs. Furthermore, custom LLMs grant developers complete autonomy, allowing them to control updates and modifications to the model as needed. 使用自定义构建的LLMs提供了几个优势,尤其是在数据隐私方面。例如,公司可能更愿意不与像 OpenAI 这样的第三方LLM提供商分享敏感数据,以确保保密性。此外,开发定制LLMs还可以直接部署到客户设备,如笔记本电脑和智能手机,这是苹果等公司目前正在探索的。这种本地实施可以显著降低延迟并减少与服务器相关的成本。此外,定制的LLMs可以使开发人员完全掌控自主权,让他们随时控制模型的更新和修改。
The general process of creating an LLM includes pretraining and finetuning. The term "pre" in "pretraining" refers to the initial phase where a model like an LLM is trained on a large, diverse dataset to develop a broad understanding of language. This pretrained model then serves as a foundational resource that can be further refined through finetuning, a process where the model is specifically trained on a narrower dataset that is more specific to particular tasks or domains. This two-stage training approach consisting of pretraining and finetuning is depicted in Figure 1.3. 创建LLM的一般过程包括预训练和微调。"预训练"中的"预"指的是初始阶段,在这个阶段,像LLM这样的模型会被训练在一个大而多样的数据集上,以开发对语言的广泛理解。这个预训练模型然后可以作为一个基础资源,通过微调进一步完善,微调是一个过程,在这个过程中,模型会被特别训练在一个更窄的数据集上,该数据集更具体地针对特定任务或领域。这种由预训练和微调组成的两阶段训练方法在图 1.3 中有所描述。
Figure 1.3 Pretraining an LLM involves next-word prediction on large text datasets. A pretrained LLM can then be finetuned using a smaller labeled dataset. 图 1.3 预训练一个LLM涉及在大型文本数据集上进行下一个词预测。然后可以使用较小的标记数据集对预训练的LLM进行微调。
As illustrated in Figure 1.3, the first step in creating an LLM is to train it on a large corpus of text data, sometimes referred to as raw text. Here, "raw" refers to the fact that this data is just regular text without any labeling information_[1]_. (Filtering may be applied, such as removing formatting characters or documents in unknown languages.) 正如图 1.3 所示,创建LLM的第一步是在大量文本数据(有时称为原始文本)上对其进行训练。这里的"原始"指的是这些数据只是普通文本,没有任何标注信息[1]。(可能会进行过滤,比如删除格式字符或未知语言的文档。)
This first training stage of an LLM is also known as pretraining, creating an initial pretrained LLM, often called a base or foundation model. A typical example of such a model is the GPT-3 model (the precursor of the original model offered in ChatGPT). This model is capable of text completion, that is, finishing a half-written sentence provided by a user. It also has limited few-shot capabilities, which means it can learn to perform new tasks based on only a few examples instead of needing extensive training data. This is further illustrated in the next section, Introducing the transformer architecture. 这个LLM的第一个训练阶段也被称为预训练,创建一个初始的预训练LLM,通常称为基础模型。这种模型的一个典型例子是 GPT-3 模型(ChatGPT 中提供的原始模型的前身)。这个模型能够进行文本补全,也就是完成用户提供的半成品句子。它也具有有限的少样本学习能力,这意味着它可以学习在只有少量示例的情况下执行新任务,而不需要大量的训练数据。这将在下一节"引入变换器架构"中进一步说明。
After obtaining a pretrained LLM from training on large text datasets, where the LLM is trained to predict the next word in the text, we can further train the LLM on labeled data, also known as finetuning. 在从大型文本数据集训练获得预训练LLM后,其中LLM被训练去预测文本中的下一个词,我们可以进一步在标注数据上训练LLM,也被称为微调。
The two most popular categories of finetuning LLMs include instruction-finetuning and finetuning for classification tasks. In instruction-finetuning, the labeled dataset consists of instruction and answer pairs, such as a query to translate a text accompanied by the correctly translated text. In classification finetuning, the labeled dataset consists of texts and associated class labels, for example, emails associated with spam and non-spam labels. 最常见的两类微调LLMs包括指令微调和分类任务微调。在指令微调中,标记数据集由指令和答案对组成,例如要求翻译文本的查询以及正确的翻译文本。在分类微调中,标记数据集由文本和相关类别标签组成,例如垃圾邮件和非垃圾邮件标签的电子邮件。
In this book, we will cover both code implementations for pretraining and finetuning an LLM, and we will delve deeper into the specifics of instruction-finetuning and finetuning for classification later in this book after pretraining a base LLM. 在本书中,我们将涵盖预训练和微调LLM的代码实现,并在对基础LLM进行预训练后,深入探讨指令微调和分类微调的具体细节。
1.4 Introducing the transformer architecture 1.4 引入变换器架构
Most modern LLMs rely on the transformer architecture, which is a deep neural network architecture introduced in the 2017 paper Attention Is All You Need. To understand LLMs we briefly have to go over the original transformer, which was originally developed for machine translation, translating English texts to German and French. A simplified version of the transformer architecture is depicted in Figure 1.4. 大多数现代LLMs依赖于变压器架构,这是一种深度神经网络架构,最早在 2017 年的论文《注意力机制就是一切》中介绍。为了理解LLMs,我们需要简要回顾一下最初的变压器,它最初是为机器翻译开发的,将英文文本翻译成德语和法语。图 1.4 展示了变压器架构的简化版本。
Figure 1.4 A simplified depiction of the original transformer architecture, which is a deep learning model for language translation. The transformer consists of two parts, an encoder that processes the input text and produces an embedding representation (a numerical representation that captures many different factors in different dimensions) of the text that the decoder can use to generate the translated text one word at a time. Note that this figure shows the final stage of the translation process where the decoder has to generate only the final word ("Beispiel"), given the original input text ("This is an example") and a partially translated sentence ("Das ist ein"), to complete the translation. 图 1.4 原始 transformer 架构的简化描述,这是一种用于语言翻译的深度学习模型。Transformer 由两部分组成,一个编码器用于处理输入文本并生成一个嵌入表示(一个数字表示,捕获不同维度中的许多不同因素),解码器可以使用该表示逐个生成翻译后的文本。请注意,该图显示了翻译过程的最终阶段,其中解码器只需要生成最后一个单词("Beispiel"),给定原始输入文本("This is an example")和部分翻译的句子("Das ist ein"),就可以完成翻译。
The transformer architecture depicted in Figure 1.4 consists of two submodules, an encoder and a decoder. The encoder module processes the input text and encodes it into a series of numerical representations or vectors that capture the contextual information of the input. Then, the decoder module takes these encoded vectors and generates the output text from them. In a translation task, for example, the encoder would encode the text from the source language into vectors, and the decoder would decode these vectors to generate text in the target language. Both the encoder and decoder consist of many layers connected by a so-called self-attention mechanism. You may have many questions regarding how the inputs are preprocessed and encoded. These will be addressed in a step-by-step implementation in the subsequent chapters. 如图 1.4 所示的变压器体系结构由两个子模块组成,即编码器和解码器。编码器模块处理输入文本并将其编码为一系列数字表示或向量,捕获输入的上下文信息。然后,解码器模块接收这些编码向量并从中生成输出文本。以翻译任务为例,编码器将源语言文本编码为向量,解码器将这些向量解码为目标语言文本。编码器和解码器都包含许多通过所谓的自注意力机制相互连接的层。您可能会对输入如何预处理和编码有许多疑问。这些将在后续章节的逐步实现中解答。
A key component of transformers and LLMs is the self-attention mechanism (not shown), which allows the model to weigh the importance of different words or tokens in a sequence relative to each other. This mechanism enables the model to capture long-range dependencies and contextual relationships within the input data, enhancing its ability to generate coherent and contextually relevant output. However, due to its complexity, we will defer the explanation to chapter 3 , where we will discuss and implement it step by step. Moreover, we will also discuss and implement the data preprocessing steps to create the model inputs in chapter 2, Working with Text Data. 变换器和LLMs的一个关键组成部分是自注意力机制(未显示),它允许模型根据彼此的相对重要性来权衡序列中不同词语或标记的重要性。这种机制使模型能够捕捉输入数据中的长期依赖和上下文关系,增强其生成连贯且上下文相关输出的能力。然而,由于其复杂性,我们将在第 3 章中推迟对其的解释,届时我们将逐步讨论和实现它。此外,我们还将在第 2 章中讨论和实现创建模型输入的数据预处理步骤,该章节标题为"处理文本数据"。
Later variants of the transformer architecture, such as the so-called BERT (short for bidirectional encoder representations from transformers) and the various GPT models (short for generative pretrained transformers), built on this concept to adapt this architecture for different tasks. (References can be found in Appendix B.) 变形金刚架构的后续版本,如所谓的 BERT(短语双向编码器来自变形金刚的表征)以及各种 GPT 模型(短语生成性预训练变形金刚),都构建于这一概念之上,以此适应于不同任务。(参考资料见附录 B。)
BERT, which is built upon the original transformer's encoder submodule, differs in its training approach from GPT. While GPT is designed for generative tasks, BERT and its variants specialize in masked word prediction, where the model predicts masked or hidden words in a given sentence as illustrated in Figure 1.5. This unique training strategy equips BERT with strengths in text classification tasks, including sentiment prediction and document categorization. As an application of its capabilities, as of this writing, Twitter uses BERT to detect toxic content. BERT 建立在原始 transformer 的编码器子模块之上,其训练方法与 GPT 有所不同。GPT 旨在执行生成任务,而 BERT 及其变体专注于掩码词预测,模型预测给定句子中的掩码或隐藏词,如图 1.5 所示。这种独特的训练策略使 BERT 在文本分类任务(包括情感预测和文档分类)方面具有优势。作为其功能的一种应用,截至目前,Twitter 正在使用 BERT 检测有害内容。
Figure 1.5 A visual representation of the transformer's encoder and decoder submodules. On the left, the encoder segment exemplifies BERT-like LLMs, which focus on masked word prediction and are primarily used for tasks like text classification. On the right, the decoder segment showcases GPT-like LLMs, designed for generative tasks and producing coherent text sequences. 图 1.5 变压器的编码器和解码器子模块的视觉表示。在左侧,编码器部分示例性地展示了类似 BERT 的LLMs,其着重于遮挡词预测,主要用于文本分类等任务。在右侧,解码器部分展示了类似 GPT 的LLMs,专门设计用于生成任务和产生连贯的文本序列。
GPT, on the other hand, focuses on the decoder portion of the original transformer architecture and is designed for tasks that require generating texts. This includes machine translation, text summarization, fiction writing, writing computer code, and more. We will discuss the GPT architecture in more detail in the remaining sections of this chapter and implement it from scratch in this book. GPT 则专注于原始变压器架构的解码器部分,并针对需要生成文本的任务进行设计。这包括机器翻译、文本摘要、虚构写作、编写计算机代码等。我们将在本章的剩余部分更详细地讨论 GPT 架构,并在本书中从头实现它。
GPT models, primarily designed and trained to perform text completion tasks, also show remarkable versatility in their capabilities. These models are adept at executing both zeroshot and few-shot learning tasks. Zero-shot learning refers to the ability to generalize to completely unseen tasks without any prior specific examples. On the other hand, few-shot learning involves learning from a minimal number of examples the user provides as input, as shown in Figure 1.6. GPT 模型主要设计和训练用于执行文本补全任务,但也显示出了在其能力方面的出色通用性。这些模型擅长执行零样本和少样本学习任务。零样本学习指的是在没有任何先前特定示例的情况下,能够推广到完全未见的任务。而少样本学习则涉及从用户提供的少量示例中学习,如图 1.6 所示。
Figure 1.6 In addition to text completion, GPT-like LLMs can solve various tasks based on their inputs without needing retraining, finetuning, or task-specific model architecture changes. Sometimes, it is helpful to provide examples of the target within the input, which is known as a few-shot setting. However, GPT-like LLMs are also capable of carrying out tasks without a specific example, which is called zero-shot setting. 图 1.6 除了文本完成之外,类似 GPT 的LLMs也可以根据其输入解决各种任务,无需重新训练、微调或改变任务特定的模型结构。有时候,在输入中提供目标的示例会很有帮助,这被称为少样本设置。但是,类似 GPT 的LLMs也能在没有特定示例的情况下执行任务,这称为零样本设置。
TRANSFORMERS VERSUS LLMS 变形金刚对战LLMS
Today's LLMs are based on the transformer architecture introduced in the previous section. Hence, transformers and LLMs are terms that are often used synonymously in the literature. However, note that not all transformers are LLMs since transformers can also be used for computer vision. Also, not all LLMs are transformers, as there are large language models based on recurrent and convolutional architectures. The main motivation behind these alternative approaches is to improve the computational efficiency of LLMs. However, whether these alternative LLM architectures can compete with the capabilities of transformer-based LLMs and whether they are going to be adopted in practice remains to be seen. For simplicity, this book uses the term "LLM" to refer to transformer-based LLMs similar to GPT. (Interested readers can find literature references describing these architectures in the Further Reading section at the end of this chapter.) 今天的LLMs是基于前一节介绍的变压器架构。因此,变压器和LLMs是文献中常被同义使用的术语。然而,请注意并非所有变压器都是LLMs,因为变压器也可用于计算机视觉。同样,并非所有LLMs都是变压器,因为有基于递归和卷积架构的大型语言模型。这些替代方法的主要动机是提高LLMs的计算效率。然而,这些替代LLM架构是否能够与变压器LLMs的能力竞争,以及它们是否会被实际采用,仍有待观察。为简单起见,本书使用"LLM"一词来指代类似于 GPT 的变压器LLMs。(有兴趣的读者可以在本章末尾的"进一步阅读"部分找到描述这些架构的文献引用。)
1.5 Utilizing large datasets 利用大型数据集
The large training datasets for popular GPT- and BERT-like models represent diverse and comprehensive text corpora encompassing billions of words, which include a vast array of topics and natural and computer languages. To provide a concrete example, Table 1.1 summarizes the dataset used for pretraining GPT-3, which served as the base model for the first version of ChatGPT. 用于著名的 GPT 和 BERT 等模型的大型训练数据集代表了包含数十亿单词的多样化和全面的文本语料库,包含广泛的主题和自然语言以及计算机语言。为了提供具体示例,表 1.1 总结了用于预训练 GPT-3 的数据集,GPT-3 是 ChatGPT 第一版本的基础模型。
Table 1.1 The pretraining dataset of the popular GPT-3 LLM 表 1.1 流行 GPT-3 的预训练数据集LLM
Dataset name 数据集名称
Dataset description 数据集描述
单词数
Number of
tokens
训练数据中的比例
Proportion in
training data
CommonCrawl (filtered) 常见爬取(已过滤)
Web crawl data 网页爬取数据
410 billion 4100 亿
WebText2 网络文本 2
Web crawl data 网页爬取数据
19 billion 190 亿
Books1 书 1
基于互联网的图书语料库
Internet-based book
corpus
12 billion 120 亿
Books2 书籍 2
基于互联网的图书语料库
Internet-based book
corpus
55 billion 550 亿
Table 1.1 reports the number of tokens, where a token is a unit of text that a model reads, and the number of tokens in a dataset is roughly equivalent to the number of words and punctuation characters in the text. We will cover tokenization, the process of converting text into tokens, in more detail in the next chapter. 表 1.1 报告了令牌的数量,其中令牌是模型读取的文本单元,数据集中的令牌数量大致等于文本中的单词和标点符号数量。我们将在下一章中更详细地介绍令牌化,即将文本转换为令牌的过程。
The main takeaway is that the scale and diversity of this training dataset allows these models to perform well on diverse tasks including language syntax, semantics, and context, and even some requiring general knowledge. 主要收获是,这个训练数据集的规模和多样性使这些模型能够在包括语言语法、语义和上下文在内的各种任务,以及一些需要一般知识的任务上都表现出色。
GPT-3 DATASET DETAILS GPT-3 数据集详情
Table 1.1 displays the dataset used for GPT-3. The proportions column in the table sums up to of the sampled data, adjusted for rounding errors. Although the subsets in the "Number of Tokens" column total 509 billion, the model was trained on only 300 billion tokens. The authors of the GPT-3 paper did not specify why the model was not trained on all 509 billion tokens. 表 1.1 显示了用于 GPT-3 的数据集。该表的"比例"列相加为 ,四舍五入误差调整。尽管"词元数"列中的子集总计为 509 亿个,但该模型仅在 300 亿个词元上进行了训练。GPT-3 论文的作者没有说明为什么该模型未在所有 509 亿个词元上进行训练。
For context, consider the size of the CommonCrawl dataset, which alone consists of 410 billion tokens and requires about 570 GB of storage. In comparison, later iterations of models like GPT-3, such as Meta's LLaMA, have expanded their training scope to include additional data sources like Arxiv research papers ( 92 GB ) and StackExchange's code-related Q&As ( 78 GB ). 考虑一下 CommonCrawl 数据集的规模,它单独包含 410 亿个标记,需要约 570 GB 的存储空间。相比之下,像 GPT-3 这样的模型的后续版本,如 Meta 的 LLaMA,已将其训练范围扩展到包括例如 Arxiv 研究论文(92 GB)和 StackExchange 的与代码相关的问答(78 GB)等额外的数据源。
The authors of the GPT-3 paper did not share the training dataset but a comparable dataset that is publicly available is Dolma: an Open Corpus of Three Trillion Tokens for LLM Pretraining Research by Soldaini et al. 2024 (https://arxiv.org/abs/2402. . However, the collection may contain copyrighted works, and the exact usage terms may depend on the intended use case and country. GPT-3 论文的作者并没有分享训练数据集,但是有一个公开可用的可比较数据集,那就是 Soldaini 等人在 2024 年发布的"Dolma: an Open Corpus of Three Trillion Tokens for LLM Pretraining Research"( https://arxiv.org/abs/2402. )。然而,该数据集可能包含受版权保护的作品,具体使用条款可能取决于预期用途和国家。
The pretrained nature of these models makes them incredibly versatile for further finetuning on downstream tasks, which is why they are also known as base or foundation models. Pretraining LLMs requires access to significant resources and is very expensive. For example, the GPT-3 pretraining cost is estimated to be million in terms of cloud computing credits_[2]_. 这些模型的预训练特性使它们在进一步微调下游任务时非常通用,这也是为什么它们被称为基础模型。预训练 LLMs 需要大量资源和高昂成本。例如,GPT-3 的预训练成本估计约为 百万云计算积分[2]。
The good news is that many pretrained LLMs, available as open-source models, can be used as general purpose tools to write, extract, and edit texts that were not part of the training data. Also, LLMs can be finetuned on specific tasks with relatively smaller datasets, reducing the computational resources needed and improving performance on the specific task. 好消息是,许多预训练的LLMs作为开源模型可用,可以用作编写、提取和编辑非训练数据的通用工具。此外,LLMs可以在具有较小数据集的特定任务上进行微调,减少所需的计算资源,并提高在特定任务上的性能。
In this book, we will implement the code for pretraining and use it to pretrain an LLM for educational purposes. All computations will be executable on consumer hardware. After implementing the pretraining code we will learn how to reuse openly available model weights and load them into the architecture we will implement, allowing us to skip the expensive pretraining stage when we finetune LLMs later in this book. 在这本书中,我们将实现预训练的代码并将其用于教育目的的预训练LLM。所有计算都可以在消费级硬件上执行。在实现预训练代码后,我们将学习如何重复使用公开可用的模型权重并将其加载到我们将实现的架构中,从而在本书后续的微调LLMs过程中跳过昂贵的预训练阶段。
1.6 A closer look at the GPT architecture 1.6 深入了解 GPT 架构
Previously in this chapter, we mentioned the terms GPT-like models, GPT-3, and ChatGPT. Let's now take a closer look at the general GPT architecture. First, GPT stands for Generative ransformer and was originally introduced in the following paper: 在本章之前,我们提到了 GPT 类型的模型、GPT-3 和 ChatGPT。让我们更仔细地看一下通用的 GPT 架构。首先,GPT 代表生成性 变换器,最初在以下论文中介绍:
Improving Language Understanding by Generative Pre-Training (2018) by Radford et al. from OpenAI, http://cdn.openai.com/researchcovers/language-unsupervised/language understanding_paper.pdf 通过生成式预训练提高语言理解(2018 年),Radford 等人来自 OpenAI,http://cdn.openai.com/researchcovers/language-unsupervised/language understanding_paper.pdf
GPT-3 is a scaled-up version of this model that has more parameters and was trained on a larger dataset. And the original model offered in ChatGPT was created by finetuning GPT-3 on a large instruction dataset using a method from OpenAI's InstructGPT paper, which we will cover in more detail in chapter 7, Finetuning with Human Feedback To Follow Instructions. As we have seen earlier in Figure 1.6, these models are competent text completion models and can carry out other tasks such as spelling correction, classification, or language translation. This is actually very remarkable given that GPT models are pretrained on a relatively simple next-word prediction task, as illustrated in Figure 1.7. GPT-3 是一个规模更大的版本,拥有更多参数,并且在更大的数据集上进行了训练。而在 ChatGPT 中提供的原始模型是通过在大型指令数据集上对 GPT-3 进行微调而创建的,使用了 OpenAI 的 InstructGPT 论文中的一种方法,我们将在第 7 章"以人类反馈微调以遵循指令"中更详细地介绍这一点。正如我们在图 1.6 中看到的,这些模型擅长于完成文本,并能执行诸如拼写纠正、分类或语言翻译等其他任务。这确实非常令人 remarkable,因为 GPT 模型是在一个相对简单的预测下一个词的任务上预训练的,如图 1.7 所示。
The model is simply trained to predict the next word 该模型仅被训练来预测下一个词语
Figure 1.7 In the next-word pretraining task for GPT models, the system learns to predict the upcoming word in a sentence by looking at the words that have come before it. This approach helps the model understand how words and phrases typically fit together in language, forming a foundation that can be applied to various other tasks. 图 1.7 在 GPT 模型的下一个字预训练任务中,该系统通过查看之前出现的单词来学习预测句子中即将出现的单词。这种方法帮助模型理解词语和短语在语言中通常如何配合,形成一个可应用于各种其他任务的基础。
The next-word prediction task is a form of self-supervised learning, which is a form of selflabeling. This means that we don't need to collect labels for the training data explicitly but can leverage the structure of the data itself: we can use the next word in a sentence or document as the label that the model is supposed to predict. Since this next-word prediction task allows us to create labels "on the fly," it is possible to leverage massive unlabeled text datasets to train LLMs as previously discussed in section 1.5, Utilizing large datasets. 下一个词预测任务是一种自我监督学习的形式,这是一种自我标注的形式。这意味着我们不需要显式收集训练数据的标签,而可以利用数据本身的结构:我们可以使用句子或文档中的下一个词作为模型需要预测的标签。由于这种下一个词预测任务使我们能够"即时"创建标签,因此我们可以利用大量未标注的文本数据集来训练LLMs,正如第 1.5 节"利用大型数据集"中所讨论的。
Compared to the original transformer architecture we covered in section 1.4 , the general GPT architecture is relatively simple. Essentially, it's just the decoder part without the encoder as illustrated in Figure 1.8. Since decoder-style models like GPT generate text by predicting text one word at a time, they are considered a type of autoregressive model. Autoregressive models incorporate their previous outputs as inputs for future predictions. Consequently, in GPT, each new word is chosen based on the sequence that precedes it, which improves coherence of the resulting text. 与我们在第 1.4 节中介绍的原始 transformer 架构相比,通用 GPT 架构相对简单。本质上,它只是图 1.8 所示的解码器部分,没有编码器。由于像 GPT 这样的解码器风格模型通过逐词预测来生成文本,因此它们被认为是一种自回归模型。自回归模型将之前的输出作为未来预测的输入。因此,在 GPT 中,每个新词的选择都基于前面的序列,这提高了产生文本的连贯性。
Architectures such as GPT-3 are also significantly larger than the original transformer model. For instance, the original transformer repeated the encoder and decoder blocks six times. GPT-3 has 96 transformer layers and 175 billion parameters in total. 像 GPT-3 这样的架构也比原版的变压器模型大得多。例如,原始的变压器重复了 6 次编码器和解码器块。GPT-3 有 96 个变压器层和总共 1750 亿个参数。
Figure 1.8 The GPT architecture employs only the decoder portion of the original transformer. It is designed for unidirectional, left-to-right processing, making it well-suited for text generation and next-word prediction tasks to generate text in iterative fashion one word at a time. 图 1.8 GPT 架构仅采用了原始 Transformer 的解码器部分。它被设计用于单向的从左到右的处理,使其非常适合于文本生成和下一个词预测任务,以迭代的方式逐个词生成文本。
GPT-3 was introduced in 2020, which, by the standards of deep learning and large language model (LLM) development, is considered a long time ago. However, more recent architectures, such as Meta's Llama models, are still based on the same underlying concepts, introducing only minor modifications. Hence, understanding GPT remains as relevant as ever, and this book focuses on implementing the prominent architecture behind GPT while providing pointers to specific tweaks employed by alternative LLMs. GPT-3 于 2020 年推出,从深度学习和大型语言模型(LLM)发展的标准来看,这已经很久了。然而,诸如 Meta 的 Llama 模型等更新近的架构,仍然基于相同的基本概念,只做了些许修改。因此,理解 GPT 仍然与时俱进,本书重点介绍 GPT 的主要架构实现,并提供了其他备选方案所采用的特定调整的提示。
Lastly, it's interesting to note that although the original transformer model, consisting of encoder and decoder blocks, was explicitly designed for language translation, GPT modelsdespite their larger yet simpler decoder-only architecture aimed at next-word predictionare also capable of performing translation tasks. This capability was initially unexpected to researchers, as it emerged from a model primarily trained on a next-word prediction task, which is a task that did not specifically target translation. 最后,值得注意的是,尽管原始的 transformer 模型由编码器和解码器块组成,是专门为语言翻译设计的,但 GPT 模型尽管其更大且更简单的解码器架构旨在预测下一个词,也能够执行翻译任务。这种能力最初让研究人员感到意外,因为它来自一个主要针对下一个词预测任务进行训练的模型,而这种任务并没有特别针对翻译。
The ability to perform tasks that the model wasn't explicitly trained to perform is called an "emergent behavior." This capability isn't explicitly taught during training but emerges as a natural consequence of the model's exposure to vast quantities of multilingual data in diverse contexts. The fact that GPT models can "learn" the translation patterns between languages and perform translation tasks even though they weren't specifically trained for it demonstrates the benefits and capabilities of these large-scale, generative language models. We can perform diverse tasks without using diverse models for each. 能够执行模型没有被显式训练过的任务,被称为"新兴行为"。这种能力并没有在训练过程中被明确教授,而是作为模型接触大量多语言数据的自然结果而产生。GPT 模型能够"学习"不同语言之间的翻译模式,并执行翻译任务,尽管它们没有专门接受过训练,这说明了这些大规模生成语言模型的优势和能力。我们可以执行多种任务,而无需使用各种不同的模型。
1.7 Building a large language model 建立一个大型语言模型
In this chapter, we laid the groundwork for understanding LLMs. In the remainder of this book, we will be coding one from scratch. We will take the fundamental idea behind GPT as a blueprint and tackle this in three stages, as outlined in Figure 1.9. 在这一章中,我们奠定了理解LLMs的基础。在本书的剩余部分,我们将从头开始编写一个。我们将采用 GPT 背后的基本思想作为蓝图,并按照图 1.9 中概述的三个阶段来完成这项任务。
Figure 1.9 The stages of building LLMs covered in this book include implementing the LLM architecture and data preparation process, pretraining an LLM to create a foundation model, and finetuning the foundation model to become a personal assistant or text classifier. 图 1.9 本书涵盖的构建LLMs的阶段包括实施LLM架构和数据准备过程,预训练LLM以创建基础模型,并对基础模型进行微调以成为个人助理或文本分类器。
First, we will learn about the fundamental data preprocessing steps and code the attention mechanism that is at the heart of every LLM. 首先,我们将学习基本的数据预处理步骤,并编写注意力机制,这是每个LLM的核心。
Next, in stage 2, we will learn how to code and pretrain a GPT-like LLM capable of generating new texts. And we will also go over the fundamentals of evaluating LLMs, which is essential for developing capable NLP systems. 接下来,在第二阶段,我们将学习如何编码和预训练一个类似 GPT 的LLM,能够生成新的文本。我们还将介绍评估LLMs的基础知识,这对于开发强大的 NLP 系统至关重要。
Note that pretraining an LLM from scratch is a significant endeavor, demanding thousands to millions of dollars in computing costs for GPT-like models. Therefore, the focus of stage 2 is on implementing training for educational purposes using a small dataset. In addition, the book will also provide code examples for loading openly available model weights. 注意,从头开始预训练一个LLM是一个巨大的工作量,对于类似 GPT 的模型需要数千到数百万美元的计算成本。因此,第二阶段的重点是使用小型数据集实施教学目的的训练。此外,本书还将提供加载公开可用模型权重的代码示例。
Finally, in stage 3, we will take a pretrained LLM and finetune it to follow instructions such as answering queries or classifying texts -- the most common tasks in many realworld applications and research. 最后,在第三阶段,我们将使用预训练的LLM并对其进行微调,使其能够遵循诸如回答查询或对文本进行分类等指令 -- 这些是许多现实世界应用程序和研究中最常见的任务。
I hope you are looking forward to embarking on this exciting journey! 我希望你期待开启这段振奋人心的旅程!
1.8 Summary 1.8 总结
LLMs have transformed the field of natural language processing, which previously mostly relied on explicit rule-based systems and simpler statistical methods. The advent of LLMs introduced new deep learningdriven approaches that led to advancements in understanding, generating, and translating human language. LLMs已经改变了自然语言处理领域,之前该领域主要依赖于基于明确规则的系统和更简单的统计方法。LLMs的出现带来了新的基于深度学习的方法,推动了人类语言的理解、生成和翻译方面的进步。
Modern LLMs are trained in two main steps. 现代LLMs是通过两个主要步骤训练的。
First, they are pretrained on a large corpus of unlabeled text by using the prediction of the next word in a sentence as a "label." 首先,它们通过使用预测句子中下一个词作为"标签"而在一个大规模的未标注文本语料库上进行预训练。
0
Then, they are finetuned on a smaller, labeled target dataset to follow instructions or perform classification tasks. 然后,它们被调整到一个较小的、带标签的目标数据集上,以遵循指令或执行分类任务。
LLMs are based on the transformer architecture. The key idea of the transformer architecture is an attention mechanism that gives the LLM selective access to the whole input sequence when generating the output one word at a time. LLMs基于变换器架构。变换器架构的关键思想是注意力机制,该机制使LLM在逐个生成输出单词时能够有选择地访问整个输入序列。
The original transformer architecture consists of an encoder for parsing text and a decoder for generating text. 原始的 Transformer 架构由一个编码器用于解析文本和一个解码器用于生成文本组成。
LLMs for generating text and following instructions, such as GPT-3 and ChatGPT, only implement decoder modules, simplifying the architecture. LLMs
Large datasets consisting of billions of words are essential for pretraining LLMs. In this book, we will implement and train LLMs on small datasets for educational purposes but also see how we can load openly available model weights. 包含数十亿个字的大型数据集对于预训练LLMs至关重要。在本书中,我们将在小型数据集上实现和训练LLMs,以供教育目的使用,同时也将了解如何加载公开可用的模型权重。
While the general pretraining task for GPT-like models is to predict the next word in a sentence, these LLMs exhibit "emergent" properties such as capabilities to classify, translate, or summarize texts. 虽然 GPT 类模型的常规预训练任务是预测句子中的下一个单词,但这些LLMs模型展现了"突现"属性,例如分类、翻译或总结文本的能力。
Once an LLM is pretrained, the resulting foundation model can be finetuned more efficiently for various downstream tasks. 一旦LLM经过预训练,所得到的基础模型可以更高效地进行微调以用于各种下游任务。
LLMs finetuned on custom datasets can outperform general LLMs on specific tasks. 在特定任务上,针对定制数据集进行微调的LLMs可以超越一般的LLMs。
[1] Readers with a background in machine learning may note that labeling information is typically required for traditional machine learning models and deep neural networks trained via the conventional supervised learning paradigm. However, this is not the case for the pretraining stage of LLMs. In this phase, LLMs leverage selfsupervised learning, where the model generates its own labels from the input data This concept is covered later in this chapter 对于具有机器学习背景的读者来说,可能会注意到传统机器学习模型和通过常规监督学习范式训练的深度神经网络通常需要标注信息。然而,这并非 LLMs 预训练阶段的情况。在这个阶段, LLMs 利用自监督学习,模型通过自生成输入数据的标签。这一概念在本章后面进行了介绍。
[2] GPT-3, The Language Model, https://www.reddit.com/ r/MachineLearning/comments/h0jwoz/d g.pt3 the 4600000 language model/ [2] GPT-3,语言模型
https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_g.pt3_the_4600000_language_model/
Working with Text Data 处理文本数据
Thls chapter covers 本章涵盖
Preparing text for large language model training 为大型语言模型训练准备文本
Splitting text into word and subword tokens 将文本分割成单词和子词令牌
Byte pair encoding as a more advanced way of tokenizing text 字节对编码作为更先进的文本分词方式
Sampling training examples with a sliding window approach 使用滑动窗口方法对训练样本进行采样
Converting tokens into vectors that feed into a large language model 将令牌转换为输入大型语言模型的向量
In the previous chapter, we covered the general structure of large language models (LLMs) and learned that they are pretrained on vast amounts of text. Specifically, our focus was on decoder-only LLMs based on the transformer architecture, which underlies the models used in ChatGPT and other popular GPT-like LLMs. 在上一章中,我们介绍了大型语言模型(LLMs)的一般结构,并了解到它们是在大量文本上预训练的。具体来说,我们着重研究了基于变换器架构的仅解码器LLMs模型,该模型是 ChatGPT 和其他流行的 GPT 类LLMs模型的基础。
During the pretraining stage, LLMs process text one word at a time. Training LLMs with millions to billions of parameters using a next-word prediction task yields models with impressive capabilities. These models can then be further finetuned to follow general instructions or perform specific target tasks. But before we can implement and train LLMs in the upcoming chapters, we need to prepare the training dataset, which is the focus of this chapter, as illustrated in Figure 2.1 在预训练阶段,LLMs逐个处理文本词汇。使用下一个词预测任务,用数百万到数十亿个参数训练LLMs,产生具有令人印象深刻能力的模型。然后可以对这些模型进行进一步微调,使其能够遵循一般指令或执行特定目标任务。但在我们能够在即将到来的章节中实施和训练LLMs之前,我们需要准备训练数据集,这是本章的重点,如图 2.1 所示。
Figure 2.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter will explain and code the data preparation and sampling pipeline that provides the LLM with the text data for pretraining. 图 2.1 编码LLM的三个主要阶段、在通用文本数据集上预训练LLM以及在标记数据集上调优的心智模型。本章将解释并编码提供LLM预训练文本数据的数据准备和采样管道。
In this chapter, you'll learn how to prepare input text for training LLMs. This involves splitting text into individual word and subword tokens, which can then be encoded into vector representations for the LLM. You'll also learn about advanced tokenization schemes like byte pair encoding, which is utilized in popular LLMs like GPT. Lastly, we'll implement a sampling and data loading strategy to produce the input-output pairs necessary for training LLMs in subsequent chapters. 在本章中,您将学习如何准备输入文本用于训练LLMs。这包括将文本拆分成单独的单词和子词词元,然后将其编码为向量表示形式以供LLM使用。您还将了解字节对编码等高级分词方案,它被用于流行的LLMs像 GPT。最后,我们将实施采样和数据加载策略,以产生训练LLMs所需的输入输出对,这将在后续章节中进行。
2.1 Understanding word embeddings 2.1 理解词嵌入
Deep neural network models, including LLMs, cannot process raw text directly. Since text is categorical, it isn't compatible with the mathematical operations used to implement and train neural networks. Therefore, we need a way to represent words as continuous-valued vectors. (Readers unfamiliar with vectors and tensors in a computational context can learn more in Appendix A, section A2.2 Understanding tensors.) 深度神经网络模型,包括LLMs,无法直接处理原始文本。由于文本是分类的,因此与用于实现和训练神经网络的数学运算不兼容。因此,我们需要一种方式将单词表示为连续值向量。(对向量和张量在计算环境中不熟悉的读者可以在附录 A,第 A2.2 节"理解张量"中了解更多信息。)
The concept of converting data into a vector format is often referred to as embedding. Using a specific neural network layer or another pretrained neural network model, we can embed different data types, for example, video, audio, and text, as illustrated in Figure 2.2. 将数据转换为向量格式的概念通常被称为嵌入。使用特定的神经网络层或另一个预训练的神经网络模型,我们可以将不同类型的数据,例如视频、音频和文本,嵌入到向量中,如图 2.2 所示。
Embedding model converts raw input into a vector representation 嵌入模型将原始输入转换为向量表示
Figure 2.2 Deep learning models cannot process data formats like video, audio, and text in their raw form. Thus, we use an embedding model to transform this raw data into a dense vector representation that deep learning architectures can easily understand and process. Specifically, this figure illustrates the process of converting raw data into a three-dimensional numerical vector. 图 2.2 深度学习模型无法直接处理视频、音频和文本等原始数据格式。因此我们使用嵌入模型将这些原始数据转化为稠密向量表示,这种表示深度学习架构可以轻松理解和处理。具体来说,这个图说明了将原始数据转化为三维数值向量的过程。
As shown in Figure 2.2, we can process various different data formats via embedding models. However, it's important to note that different data formats require distinct embedding models. For example, an embedding model designed for text would not be suitable for embedding audio or video data. 如图 2.2 所示,我们可以通过嵌入模型处理各种不同的数据格式。然而,需要注意的是,不同的数据格式需要不同的嵌入模型。例如,用于文本的嵌入模型不适合于音频或视频数据的嵌入。
At its core, an embedding is a mapping from discrete objects, such as words, images, or even entire documents, to points in a continuous vector space -- the primary purpose of embeddings is to convert non-numeric data into a format that neural networks can process. 从本质上说,嵌入是将离散对象(如单词、图像甚至整个文档)映射到连续向量空间中的点的过程。嵌入的主要目的是将非数字数据转换为神经网络可以处理的格式。
While word embeddings are the most common form of text embedding, there are also embeddings for sentences, paragraphs, or whole documents. Sentence or paragraph embeddings are popular choices for retrieval-augmented generation. Retrieval-augmented generation combines generation (like producing text) with retrieval (like searching an external knowledge base) to pull relevant information when generating text, which is a technique that is beyond the scope of this book. Since our goal is to train GPT-like LLMs, which learn to generate text one word at a time, this chapter focuses on word embeddings. 虽然词嵌入是最常见的文本嵌入形式,但也有句子、段落或整个文档的嵌入。句子或段落嵌入是检索增强型生成的流行选择。检索增强型生成将生成(如产生文本)与检索(如搜索外部知识库)相结合,以在生成文本时提取相关信息,这是一种超出本书范畴的技术。由于我们的目标是训练类似 GPT 的 LLMs,它们学习一次生成一个词的文本,因此本章侧重于词嵌入。
There are several algorithms and frameworks that have been developed to generate word embeddings. One of the earlier and most popular examples is the Word2Vec approach. Word2Vec trained neural network architecture to generate word embeddings by predicting the context of a word given the target word or vice versa. The main idea behind Word2Vec is that words that appear in similar contexts tend to have similar meanings. Consequently, when projected into 2-dimensional word embeddings for visualization purposes, it can be seen that similar terms cluster together, as shown in Figure 2.3. 有几种算法和框架被开发用于生成词嵌入。其中一个较早且最广为人知的例子是 Word2Vec 方法。Word2Vec 训练神经网络架构,通过预测给定目标词的上下文或反之来生成词嵌入。Word2Vec 背后的主要思想是,出现在相似上下文中的词往往有相似的含义。因此,当将其投射到 2 维词嵌入空间用于可视化时,可以看到相似的术语会聚集在一起,如图 2.3 所示。
Figure 2.3 If word embeddings are two-dimensional, we can plot them in a two-dimensional scatterplot for visualization purposes as shown here. When using word embedding techniques, such as Word2Vec, words corresponding to similar concepts often appear close to each other in the embedding space. For instance, different types of birds appear closer to each other in the embedding space compared to countries and cities. 图 2.3 如果词嵌入是二维的,我们可以将它们绘制在二维散点图上以用于可视化目的,如图所示。使用词嵌入技术(如 Word2Vec)时,对应于相似概念的单词通常会在嵌入空间中彼此靠近。例如,不同类型的鸟类在嵌入空间中比国家和城市更加靠近。
Word embeddings can have varying dimensions, from one to thousands. As shown in Figure 2.3, we can choose two-dimensional word embeddings for visualization purposes. A higher dimensionality might capture more nuanced relationships but at the cost of computational efficiency. 词嵌入可以有从一到数千不等的维度。如图 2.3 所示,我们可以选择二维词嵌入以用于可视化目的。更高的维数可能会捕捉到更细微的关系,但代价是计算效率降低。
While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand. We will implement such embedding layers later in this chapter. Furthermore, LLMs can also create contextualized output embeddings, as we discuss in chapter 3. 虽然我们可以使用像 Word2Vec 这样的预训练模型来生成机器学习模型的嵌入,但是LLMs通常会生成自己的嵌入,这些嵌入是输入层的一部分,并会在训练过程中更新。相比于使用 Word2Vec,优化嵌入作为LLM训练的一部分的优势在于,嵌入被优化用于特定的任务和数据。我们将在本章后面实现这样的嵌入层。此外,LLMs也可以创建上下文相关的输出嵌入,我们将在第 3 章中讨论这一点。
Unfortunately, high-dimensional embeddings present a challenge for visualization because our sensory perception and common graphical representations are inherently limited to three dimensions or fewer, which is why Figure 2.3 showed two-dimensional embeddings in a two-dimensional scatterplot. However, when working with LLMs, we typically use embeddings with a much higher dimensionality than shown in Figure 2.3. For both GPT-2 and GPT-3, the embedding size (often referred to as the dimensionality of the model's hidden states) varies based on the specific model variant and size. It is a trade-off between performance and efficiency. The smallest GPT-2 models (117M and 125M parameters) use an embedding size of 768 dimensions to provide concrete examples. The largest GPT-3 model (175B parameters) uses an embedding size of 12,288 dimensions. 不幸的是,高维嵌入为可视化带来了挑战,因为我们的感官感知和常见的图形表示本质上都局限于三维或更少,这就是为什么图 2.3 显示了二维嵌入在二维散点图中的情况。然而,在使用LLMs时,我们通常使用的嵌入维度远高于图 2.3 所示的维度。对于 GPT-2 和 GPT-3,嵌入大小(通常称为模型隐藏状态的维数)会根据特定的模型变体和大小而有所不同。这是性能和效率之间的权衡。最小的 GPT-2 模型(117M 和 125M 参数)使用 768 维的嵌入大小来提供具体实例。最大的 GPT-3 模型(175B 参数)使用 12,288 维的嵌入大小。
The upcoming sections in this chapter will walk through the required steps for preparing the embeddings used by an LLM, which include splitting text into words, converting words into tokens, and turning tokens into embedding vectors. 本章后续部分将逐步介绍准备用于LLM的嵌入所需的步骤,包括将文本拆分为词语、将词语转换为标记以及将标记转换为嵌入向量。
2.2 Tokenizing text 2.2 文本分词
This section covers how we split input text into individual tokens, a required preprocessing step for creating embeddings for an LLM. These tokens are either individual words or special characters, including punctuation characters, as shown in Figure 2.4. 本节介绍如何将输入文本分割成单个标记,这是为LLM创建嵌入的必要预处理步骤。这些标记要么是单个单词,要么是特殊字符,包括标点符号,如图 2.4 所示。
Figure 2.4 A view of the text processing steps covered in this section in the context of an LLM. Here, we split an input text into individual tokens, which are either words or special characters, such as punctuation characters. In upcoming sections, we will convert the text into token IDs and create token embeddings. 图 2.4 本节介绍的文本处理步骤在LLM环境下的视图。在这里,我们将输入文本拆分为单独的标记,这些标记可能是单词或特殊字符,如标点符号。在后续章节中,我们将把文本转换为标记 ID,并创建标记嵌入。
The text we will tokenize for LLM training is a short story by Edith Wharton called The Verdict, which has been released into the public domain and is thus permitted to be used for LLM training tasks. The text is available on Wikisource at https://en.wikisource.org/wiki/ The Verdict, and you can copy and paste it into a text file, which I copied into a text file "the-verdict.txt" to load using Python's standard file reading utilities: 我们将对《判决》进行标记训练的文本是爱迪丝·沃顿创作的一篇短篇小说,该作品已进入公共领域,因此允许用于标记训练任务。该文本可在维基文库上获得,网址为 https://en.wikisource.org/wiki/ The Verdict。您可以将其复制并粘贴到一个文本文件中,我已将其复制到一个名为"the-verdict.txt"的文件中,以便使用 Python 的标准文件读取工具进行加载。
Listing 2.1 Reading in a short story as text sample into Python 列 2.1 将短篇小说作为文本样本读取到 Python
with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
print("Total number of character:", len(raw_text))
print(raw_text[:99])
Alternatively, you can find this "the-verdict.txt" file in this book's GitHub repository at https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/01 main-chapter-code. 或者,你可以在本书的 GitHub 仓库中找到这个"the-verdict.txt"文件,网址是 https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/01 main-chapter-code.
The print command prints the total number of characters followed by the first 100 characters of this file for illustration purposes: 打印命令会打印出此文件中字符的总数,然后再打印出前 100 个字符,供参考使用:
Total number of character: 20479 总字符数:20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 我一直认为杰克·吉斯伯恩是一个廉价的天才,尽管他是个不错的家伙,所以这并不令人惊讶
Our goal is to tokenize this 20,479 -character short story into individual words and special characters that we can then turn into embeddings for LLM training in the upcoming chapters. 我们的目标是将这个 20,479 个字符的短篇小说分割成单个单词和特殊字符,然后将它们转化为嵌入向量,用于接下来章节的LLM训练。
TEXT SAMPLE SIZES 文本样本大小
Note that it's common to process millions of articles and hundreds of thousands of books -- many gigabytes of text -- when working with LLMs. However, for educational purposes, it's sufficient to work with smaller text samples like a single book to illustrate the main ideas behind the text processing steps and to make it possible to run it in reasonable time on consumer hardware. 请注意,在使用LLMs时,处理数百万篇文章和数十万本书籍——即数十个吉字节的文本是很常见的。但是,为教学目的,使用单本书籍这样的较小文本样本就足以说明文本处理步骤背后的主要思想,并能在消费级硬件上在合理的时间内运行。
How can we best split this text to obtain a list of tokens? For this, we go on a small excursion and use Python's regular expression library re for illustration purposes. (Note that you don't have to learn or memorize any regular expression syntax since we will transition to a pre-built tokenizer later in this chapter.) 我们如何才能最好地拆分这段文本以获得一个标记列表?为此,我们进行一次小小的远足,并使用 Python 的正则表达式库 re 作为说明。(请注意,您不需要学习或记住任何正则表达式语法,因为我们稍后将过渡到一个预建的标记器。)
Using some simple example text, we can use the re.split command with the following syntax to split a text on whitespace characters: 使用一些简单的示例文本,我们可以使用以下语法使用 re.split 命令在空白字符上拆分文本:
import re
text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
print(result)
The result is a list of individual words, whitespaces, and punctuation characters: 结果是一个由单个单词、空格和标点符号组成的列表
Note that the simple tokenization scheme above mostly works for separating the example text into individual words, however, some words are still connected to punctuation characters that we want to have as separate list entries. We also refrain from making all text lowercase because capitalization helps LLMs distinguish between proper nouns and common nouns, understand sentence structure, and learn to generate text with proper capitalization. 请注意,上面简单的标记化方案主要用于将示例文本分为单个单词,但一些单词仍然与我们希望作为单独列表条目的标点符号相连。我们还避免将所有文本转换为小写,因为大写有助于LLMs区分专有名词和普通名词、了解句子结构,并学会生成带有适当大写的文本。
Let's modify the regular expression splits on whitespaces ( s) and commas, and periods . 让我们修改正则表达式以分割空格(s)、逗号和句号。
result re.split(r'([,.]|\s)', text) 结果 re.split(r'([,.]|\s)', text)
print(result) 打印(结果)
We can see that the words and punctuation characters are now separate list entries just as we wanted: 我们可以看到,单词和标点符号现在是单独的列表项,正如我们所想:
A small remaining issue is that the list still includes whitespace characters. Optionally, we can remove these redundant characters safely as follows: 剩下的一个小问题是该列表仍然包含空白字符。可以如下安全地移除这些冗余字符:
result = [item for item in result if item.strip()]
print(result)
The resulting whitespace-free output looks like as follows: 以下是没有空白字符的输出结果:
When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is sensitive to indentation and spacing). Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme that includes whitespaces. 在开发一个简单的分词器时,我们是应该将空白字符编码为单独的字符还是直接删除它们,这取决于我们的应用程序及其要求。删除空白字符可以减少内存和计算要求。但是,如果我们训练对文本的确切结构敏感的模型(例如,对缩进和间距敏感的 Python 代码),保留空白字符会很有用。在这里,我们为了简单性和简洁性而删除空白字符。稍后,我们将切换到包含空白字符的分词方案。
The tokenization scheme we devised above works well on the simple sample text. Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes we have seen earlier in the first 100 characters of Edith Wharton's short story, along with additional special characters: 我们上面设计的标记化方案在简单的示例文本上运作良好。让我们稍作修改,使其也能处理问号、引号以及艾迪丝·沃顿短篇小说前 100 个字符中出现的双破折号等其他类型的标点符号,以及其他特殊字符。
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)
As we can see based on the results summarized in Figure 2.5, our tokenization scheme can now handle the various special characters in the text successfully. 根据图 2.5 中总结的结果,我们可以看到,我们的分词方案现在可以成功处理文本中的各种特殊字符。
Input text 输入文本
Hello, world. Is this-- a test? 你好,世界。这是一个--测试吗?
Tokenized text 分词文本
Hello world Is this a test 你好 世界 这是一个 测试 吗?
Figure 2.5 The tokenization scheme we implemented so far splits text into individual words and punctuation characters. In the specific example shown in this figure, the sample text gets split into individual tokens. 图 2.5 我们目前实施的标记化方案将文本拆分为单个单词和标点符号字符。在此图中显示的特定示例中,示例文本被拆分为 个单独的标记。
Now that we got a basic tokenizer working, let's apply it to Edith Wharton's entire short story: 现在我们已经有了一个基本的分词器工作,让我们将其应用于艾迪丝·华顿的整篇短篇小说:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))
The above print statement outputs 4690 , which is the number of tokens in this text (without whitespaces). 上述打印语句输出 4690,这是这段文本中的令牌数量(不包括空白字符)。
Let's print the first 30 tokens for a quick visual check: 让我们打印前 30 个单词以快速视觉检查:
print(preprocessed[:30])
The resulting output shows that our tokenizer appears to be handling the text well since all words and special characters are neatly separated: 生成的输出显示,我们的标记器似乎能很好地处理文本,因为所有单词和特殊字符都被整洁地分开了:
In the previous section, we tokenized a short story by Edith Wharton into individual tokens. In this section, we will convert these tokens from a Python string to an integer representation to produce the so-called token IDs. This conversion is an intermediate step before converting the token IDs into embedding vectors. 在前一节中,我们将艾迪思·沃顿的一篇短篇小说分解成单个标记。在本节中,我们将把这些标记从 Python 字符串转换为整数表示,以产生所谓的标记 ID。这种转换是在将标记 ID 转换为嵌入向量之前的中间步骤。
To map the previously generated tokens into token IDs, we have to build a so-called vocabulary first. This vocabulary defines how we map each unique word and special character to a unique integer, as shown in Figure 2.6. 为了将先前生成的标记映射到标记 ID,我们必须首先构建一个所谓的词汇表。此词汇表定义了我们如何将每个唯一的单词和特殊字符映射到一个唯一的整数,如图 2.6 所示。
Figure 2.6 We build a vocabulary by tokenizing the entire text in a training dataset into individual tokens. These individual tokens are then sorted alphabetically, and duplicate tokens are removed. The unique tokens are then aggregated into a vocabulary that defines a mapping from each unique token to a unique integer value. The depicted vocabulary is purposefully small for illustration purposes and contains no punctuation or special characters for simplicity. 图 2.6 我们通过将训练数据集中的整个文本分词成独立的标记来构建词汇表。这些独立的标记然后按字母顺序排序,并删除重复的标记。然后将这些唯一的标记聚合到一个词汇表中,该词汇表定义了每个唯一标记到唯一整数值的映射。所描述的词汇表故意很小,仅用于说明目的,不包含任何标点符号或特殊字符,以保持简单性。
In the previous section, we tokenized Edith Wharton's short story and assigned it to a Python variable called preprocessed. Let's now create a list of all unique tokens and sort them alphabetically to determine the vocabulary size: 在上一节中,我们对埃迪丝·华顿的短篇小说进行了标记分词,并将其分配给名为 preprocessed 的 Python 变量。现在让我们创建一个包含所有唯一标记的列表,并将其按字母顺序排序以确定词汇量大小:
After determining that the vocabulary size is 1,130 via the above code, we create the vocabulary and print its first 51 entries for illustration purposes: 在确定词汇量为 1,130 通过上述代码后,我们创建了词汇表并打印了其前 51 个条目作为说明:
Listing 2.2 Creating a vocabulary 2.2 创建词汇表
vocab = {token:integer for integer,token in enumerate(all_words)} 词汇表 = {令牌:整数 for 整数,令牌 in 枚举(所有单词)}
for i, item in enumerate(vocab.items()):
print(item) 打印(项目)
if i > 50: 如果 i > 50:
break 打破
The output is as follows: 输出如下:
输出如下:
('!', 0)
('"', 1) ("",1)
("'", 2) ("'", 2)
人类: Translate the following source text to Simplified Chinese Language, Output translation directly without any additional text.
Source Text: Hello, World!
Translated Text:
...
('Her', 49) ('她', 49)
( 'Hermia', 50) ('赫米娅',50)
As we can see, based on the output above, the dictionary contains individual tokens associated with unique integer labels. Our next goal is to apply this vocabulary to convert new text into token IDs, as illustrated in Figure 2.7. 正如我们可以看到的,根据上述输出,字典包含与唯一整数标签相关联的个别标记。我们的下一个目标是将这些词汇应用于将新文本转换为令牌 ID,如图 2.7 所示。
Figure 2.7 Starting with a new text sample, we tokenize the text and use the vocabulary to convert the text tokens into token IDs. The vocabulary is built from the entire training set and can be applied to the training set itself and any new text samples. The depicted vocabulary contains no punctuation or special characters for simplicity. 图 2.7 从新的文本样本开始,我们对文本进行分词并使用词汇表将文本标记转换为标记 ID。词汇表是从整个训练集构建的,可应用于训练集本身和任何新的文本样本。为简单起见,所描述的词汇表不包含标点符号或特殊字符。
Later in this book, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs into text. For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens. 在本书后面,当我们想要将 LLM 的输出从数字转换回文本时,我们也需要一种方法将令牌 ID 转换为文本。为此,我们可以创建一个词汇表的反向版本,将令牌 ID 映射回相应的文本令牌。
Let's implement a complete tokenizer class in Python with an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary. In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text. 让我们在 Python 中实现一个完整的标记器类,其中包含一个编码方法,该方法将文本拆分为标记并执行字符串到整数的映射,以通过词汇表产生标记 ID。此外,我们实现了一个解码方法,该方法执行反向整数到字符串的映射,将标记 ID 转换回文本。
The code for this tokenizer implementation is as in listing 2.3: 此分词器实现的代码如列表 2.3 所示:
Listing 2.3 Implementing a simple text tokenizer 列表 2.3 实现简单的文本分词器
class SimpleTokenizerV1: 简单标记器 V1:
def
_init__(self, vocab):
self.str_to_int = vocab #A
self.int_to_str = {i:s for s,i in vocab.items()} #B
def encode(self, text): #C
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
ids = [self.str_to_int[s] for s in preprocessed]
return ids
def decode(self, ids):
#D
text = " ".join([self.int_to_str[i] for i in ids])
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) #E
return text
#A Store the vocabulary as a class attribute for access in the encode and decode methods 将词汇作为类属性进行存储,以便在编码和解码方法中访问
#B Create an inverse vocabulary that maps token IDs back to the original text tokens #B 创建一个反向词汇表,将令牌 ID 映射回原始文本令牌
#C Process input text into token IDs #C 将输入文本处理为标记 ID
#D Convert token IDs back into text #D 将令牌 ID 转换回文本
#E Replace spaces before the specified punctuation 我们来替换指定标点符号前的空格
Using the SimpleTokenizerV1 Python class above, we can now instantiate new tokenizer objects via an existing vocabulary, which we can then use to encode and decode text, as illustrated in Figure 2.8. 使用上述 SimpleTokenizerV1 Python 类,我们现在可以通过现有的词汇表实例化新的分词器对象,然后我们可以用它们来编码和解码文本,如图 2.8 所示。
Figure 2.8 Tokenizer implementations share two common methods: an encode method and a decode method. The encode method takes in the sample text, splits it into individual tokens, and converts the tokens into token IDs via the vocabulary. The decode method takes in token IDs, converts them back into text tokens, and concatenates the text tokens into natural text. 图 2.8 标记器实现共享两个常见方法:编码方法和解码方法。编码方法接受样本文本,将其拆分为单独的标记,并通过词汇表将标记转换为标记 ID。解码方法接受标记 ID,将其转换回文本标记,并将文本标记连接成自然文本。
Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class and tokenize a passage from Edith Wharton's short story to try it out in practice: 让我们从 SimpleTokenizerV1 类中实例化一个新的分词器对象,并对埃迪斯·沃顿的短篇小说中的一段文字进行分词,以在实践中试用一下:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable
pride."""
ids = tokenizer.encode(text)
print(ids)
The code above prints the following token IDs: 以上代码输出以下 token ID:
Next, let's see if we can turn these token IDs back into text using the decode method: 让我们接下来看看是否可以使用解码方法将这些令牌 ID 转换回文本
print(tokenizer.decode(ids)) 打印(令牌编码器.解码(标识符))
This outputs the following text: 这就输出以下文本:
翻译文本:
'" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.' "这是他最后一幅作品,你知道,"吉斯伯恩夫人自豪地说。
Based on the output above, we can see that the decode method successfully converted the token IDs back into the original text. 根据上面的输出,我们可以看到解码方法成功地将令牌 ID 转换回原始文本。
So far, so good. We implemented a tokenizer capable of tokenizing and de-tokenizing text based on a snippet from the training set. Let's now apply it to a new text sample that is not contained in the training set: 到目前为止,一切顺利。我们实现了一个分词器,能够基于训练集中的片段对文本进行分词和去分词。现在让我们将其应用于一个不在训练集中的新文本样本:
Executing the code above will result in the following error: 执行上述代码将导致以下错误:
. .
KeyError: 'Hello' KeyError: '你好'
The problem is that the word "Hello" was not used in the The Verdict short story. Hence, it is not contained in the vocabulary. This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs. 问题在于,单词"Hello"没有在《审判》这部简短的故事中使用。因此它不包含在词汇表中。这突出了在处理LLMs时需要考虑大型和多样化的训练集来扩展词汇表的必要性。
In the next section, we will test the tokenizer further on text that contains unknown words, and we will also discuss additional special tokens that can be used to provide further context for an LLM during training. 在下一节中,我们将进一步测试分词器在包含未知词语的文本上的性能,并讨论可以用来为训练期间的LLM提供更多上下文的其他特殊标记。
2.4 Adding special context tokens 2.4 添加特殊上下文标记
In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set. In this section, we will modify this tokenizer to handle unknown words. 在上一个部分中,我们实施了一个简单的分词器并将其应用于训练集中的一段文章。在这一部分中,我们将修改这个分词器以处理未知词。
In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and , as illustrated in Figure 2.9. 具体而言,我们将修改在上一节中实施的词汇表和标记器 SimpleTokenizerV2,以支持两个新的标记 <|unk|> 和 ,如图 2.9 所示。
Tokenized sample text 标记化的示例文本
Figure 2.9 We add special tokens to a vocabulary to deal with certain contexts. For instance, we add an <|unk|> token to represent new and unknown words that were not part of the training data and thus not part of the existing vocabulary. Furthermore, we add an <|endoftext|> token that we can use to separate two unrelated text sources. 图 2.9 我们向词汇表添加特殊令牌以处理特定上下文。例如,我们添加一个<|unk|>令牌来表示训练数据中不存在的新词和未知词,这些词不属于现有的词汇表。此外,我们添加一个<|endoftext|>令牌,可用于分隔两个无关的文本来源。
As shown in Figure 2.9, we can modify the tokenizer to use an <|unk|> token if it encounters a word that is not part of the vocabulary. Furthermore, we add a token between unrelated texts. For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source, as illustrated in Figure 2.10. This helps the LLM understand that, although these text sources are concatenated for training, they are, in fact, unrelated. 如图 2.9 所示,我们可以修改分词器以使用<|unk|>标记来处理不在词汇表中的单词。此外,我们在不相关的文本之间添加一个标记。例如,在训练 GPT 类似的LLMs模型时,将多个独立的文档或书籍连接在一起进行训练时,通常会在每个文档或书籍之前插入一个标记,如图 2.10 所示。这有助于LLM理解这些文本源虽然在训练时被连接在一起,但实际上是不相关的。
Text concatenated from all 所有文本连接起来
independent sources 独立来源
“... in a thrilling overtime victory. <| endoftext |> ... days happily ever after. <|endoftext |> .. marking its highest gain in the past three months. <lendoftext|> ... journey had forever changed her heart." 在一次激动人心的加时赛胜利中。
过着幸福美满的日子。
创下过去三个月的最高涨幅。
这次旅程永远改变了她的内心。
Figure 2.10 When working with multiple independent text source, we add <|endoftext|> tokens between these texts. These <|endoftext|> tokens act as markers, signaling the start or end of a particular segment, allowing for more effective processing and understanding by the LLM. 图 2.10 当处理多个独立的文本源时,我们在这些文本之间添加<|endoftext|>标记。这些<|endoftext|>标记作为标记,标识特定段落的开始或结束,从而实现更有效的处理和理解 by LLM。
Let's now modify the vocabulary to include these two special tokens, and , by adding these to the list of all unique words that we created in the previous section: 现在让我们修改词汇表,将这两个特殊标记, ,添加到我们在前一节创建的所有唯一单词列表中:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))
Based on the output of the print statement above, the new vocabulary size is 1161 (the vocabulary size in the previous section was 1159). 根据上面的打印语句输出,新的词汇量为 1161(前一节的词汇量为 1159)。
As an additional quick check, let's print the last 5 entries of the updated vocabulary: 作为一个额外的快速检查,让我们打印更新后词汇表的最后 5 个条目:
for i, item in enumerate(list(vocab.items())[-5:]): 对于 i,item 在枚举(vocab.items())[-5:]中:
print(item) 打印(项目)
The code above prints the following: 上面的代码会打印出以下内容:
Based on the code output above, we can confirm that the two new special tokens were indeed successfully incorporated into the vocabulary. Next, we adjust the tokenizer from code listing 2.3 accordingly, as shown in listing 2.4: 根据上述代码输出,我们可以确认这两个新的特殊标记已成功并入词汇表。接下来,我们根据代码清单 2.3 相应调整分词器,如代码清单 2.4 所示:
Listing 2.4 A simple text tokenizer that handles unknown words 列表 2.4 一个简单的文本分词器,可以处理未知词
class SimpleTokenizerV2:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = { i:s for s,i in vocab.items()}
def encode(self, text):
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
preprocessed = [item if item in self.str_to_int #A
else "<|unk|>" for item in preprocessed]
ids = [self.str_to_int[s] for s in preprocessed]
return ids
def decode(self, ids):
text = " ".join([self.int_to_str[i] for i in ids])
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
#B
return text
#A replace unknown words by <|unk|> tokens 我们必须<|unk|>气候变化以及相关的<|unk|>影响。我们需要采取行动来减少温室气体排放,并提高对气候<|unk|>的适应能力。这是一个全球性的挑战,需要各国政府、企业和个人的共同努力。我们必须立即采取行动,为我们的下一代创造一个更加可持续的未来
#B Replace spaces before the specified punctuations #B 在指定标点符号前替换空格
Compared to the SimpleTokenizerV1 we implemented in code listing 2.3 in the previous section, the new SimpleTokenizerV2 replaces unknown words by unk tokens. 与我们在上一节中的代码清单 2.3 中实现的 SimpleTokenizerV1 相比,新的 SimpleTokenizerV2 将未知单词替换为 unk 令牌。
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)
The output is as follows: 输出如下:
输出如下:
'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.' 你好,你喜欢喝茶吗?
在皇宫阳光明媚的露台上。
Next, let's tokenize the sample text using the SimpleTokenizerV2 on the vocab we previously created in listing 2.2: 接下来,让我们使用之前在 2.2 代码清单中创建的词汇表,利用 SimpleTokenizerV2 对示例文本进行分词:
Above, we can see that the list of token IDs contains 1159 for the <|endoftext|> separator token as well as two 1160 tokens, which are used for unknown words. 在上面,我们可以看到令牌 ID 列表中包含 1159 作为<|endoftext|>分隔符令牌,以及两个 1160 令牌,用于未知单词。
'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.' <|unk|>,你喜欢茶吗?<|endoftext|>在<|unk|>阳光下的露台上。
Based on comparing the de-tokenized text above with the original input text, we know that the training dataset, Edith Wharton's short story The Verdict, did not contain the words "Hello" and "palace." 根据对比上述去标记文本与原始输入文本,我们知道训练数据集埃迪丝·沃顿的短篇小说《判决》中不包含"hello"和"palace"这两个词。
So far, we have discussed tokenization as an essential step in processing text as input to LLMs. Depending on the LLM, some researchers also consider additional special tokens such as the following: 到目前为止,我们已经讨论了词元化作为将文本输入LLMs处理的一个关键步骤。根据LLM,一些研究人员还会考虑以下一些特殊的令牌:
[BOS] (beginning of sequence): This token marks the start of a text. It signifies to the LLM where a piece of content begins. [BOS]这个标记表示文本的开始。它向LLM表示内容的起始位置。
[EOS] (end of sequence): This token is positioned at the end of a text, and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>. For instance, when combining two different Wikipedia articles or books, the [EOS] token indicates where one article ends and the next one begins. [EOS]这个标记位于文本的末尾,当拼接多个无关的文本时特别有用,类似于<|endoftext|>。例如,当合并两篇不同的维基百科文章或书籍时,[EOS]标记指示一篇文章的结尾和下一篇的开始。
[PAD] (padding): When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or "padded" using the [PAD] token, up to the length of the longest text in the batch. [PAD] (填充):在训练批量大小大于 1 的LLMs时,批量可能包含长度不同的文本。为确保所有文本长度相同,使用[PAD]标记对较短的文本进行扩展或"填充",直至达到批量中最长文本的长度。
Note that the tokenizer used for GPT models does not need any of these tokens mentioned above but only uses an <|endoftext|> token for simplicity. The <|endoftext|> is analogous to the [EOS] token mentioned above. Also, lendoftext|> is used for padding as well. However, as we'll explore in subsequent chapters when training on batched inputs, we typically use a mask, meaning we don't attend to padded tokens. Thus, the specific token chosen for padding becomes inconsequential. 请注意,用于 GPT 模型的标记器不需要上述提到的任何这些标记,只使用一个<|endoftext|>标记来简化。<|endoftext|>与上述提到的[EOS]标记类似。此外, lendoftext|>也用于填充。但是,正如我们将在后续章节中探讨的,在处理批输入时,我们通常使用掩码,这意味着我们不关注填充标记。因此,用于填充的特定标记变得无关紧要。
Moreover, the tokenizer used for GPT models also doesn't use an <|unk|> token for outof-vocabulary words. Instead, GPT models use a byte pair encoding tokenizer, which breaks down words into subword units, which we will discuss in the next section. 此外,用于 GPT 模型的分词器也没有使用<|unk|>标记来表示词汇表外的词。相反,GPT 模型使用字节对编码分词器,将单词分解为子词单元,我们将在下一节讨论这一点。
2.5 Byte pair encoding 2.5 字节对编码
We implemented a simple tokenization scheme in the previous sections for illustration purposes. This section covers a more sophisticated tokenization scheme based on a concept called byte pair encoding (BPE). The BPE tokenizer covered in this section was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT. 我们在之前的部分为说明目的实施了一个简单的分词方案。本节介绍了一种基于称为字节对编码(BPE)的概念的更复杂的分词方案。本节介绍的 BPE 分词器被用于训练LLMs等 GPT-2、GPT-3 和 ChatGPT 中使用的原始模型。
Since implementing BPE can be relatively complicated, we will use an existing Python open-source library called tiktoken (https://github.com/openai/tiktoken), which implements the BPE algorithm very efficiently based on source code in Rust. Similar to other Python libraries, we can install the tiktoken library via Python's pip installer from the terminal: 由于实施 BPE 相对比较复杂,我们将使用一个现有的 Python 开源库 called tiktoken(https://github.com/openai/tiktoken),它基于 Rust 源代码非常高效地实现了 BPE 算法。与其他 Python 库类似,我们可以通过 Python 的 pip 安装器从终端安装 tiktoken 库:
pip install tiktoken 安装 tiktoken
The code in this chapter is based on tiktoken 0.5.1. You can use the following code to check the version you currently have installed: 本章中的代码基于 tiktoken 0.5.1。您可以使用以下代码检查当前安装的版本:
from importlib.metadata import version
import tiktoken
print("tiktoken version:", version("tiktoken"))
Once installed, we can instantiate the BPE tokenizer from tiktoken as follows: 安装后,我们可以如下实例化 tiktoken 的 BPE 分词器:
tokenizer = tiktoken.get_encoding("gpt2")
The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented previously via an encode method: 这个标记器的用法与我们之前实现的 SimpleTokenizerV2 类似,都通过 encode 方法进行编码
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of
someunknownPlace."
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)
The code above prints the following token IDs: 以上代码输出以下 token ID:
,
We can then convert the token IDs back into text using the decode method, similar to our SimpleTokenizerV2 earlier: 然后我们可以使用解码方法将令牌 ID 转换回文本,类似于我们之前的 SimpleTokenizerV2
'Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.' 你好,你喜欢喝茶吗?<|endoftext|>在一个不知名的阳光洒满的露台上。
We can make two noteworthy observations based on the token IDs and decoded text above. First, the <|endoftext|> token is assigned a relatively large token ID, namely, 50256. In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50,257 , with endoftext|> being assigned the largest token ID. 根据上述的标记 ID 和解码文本,我们可以提出两个值得注意的观察。首先,<|endoftext|>标记被分配了一个相当大的标记 ID,即 50256。事实上,用于训练 GPT-2、GPT-3 以及 ChatGPT 中使用的原始模型的 BPE 分词器,其总词汇表大小为 50,257,其中<|endoftext|>被分配了最大的标记 ID。
Second, the BPE tokenizer above encodes and decodes unknown words, such as "someunknownPlace" correctly. The BPE tokenizer can handle any unknown word. How does it achieve this without using unk tokens? 其次,上述的 BPE 分词器可以正确编码和解码未知词,例如"someunknownPlace"。BPE 分词器可以处理任何未知词。它是如何在不使用标记的情况下实现这一点的呢?
The algorithm underlying BPE breaks down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-ofvocabulary words. So, thanks to the BPE algorithm, if the tokenizer encounters an unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or characters, as illustrated in Figure 2.11. 基于 BPE 的算法将未在预定义词汇表中的单词分解为较小的子词单元甚至个别字符,使其能够处理不在词汇表中的单词。因此,得益于 BPE 算法,如果分词器在分词过程中遇到了一个陌生的单词,它可以将其表示为一系列子词标记或字符,如图 2.11 所示。
Text sample with 文本样本
unknown words
Figure 2.11 BPE tokenizers break down unknown words into subwords and individual characters. This way, a BPE tokenizer can parse any word and doesn't need to replace unknown words with special tokens, such as unk|>. 图 2.11 BPE 分词器将未知词分解成子词和单个字符。这样,BPE 分词器可以解析任何单词,无需用特殊令牌(如 unk|>)替换未知词。
As illustrated in Figure 2.11, the ability to break down unknown words into individual characters ensures that the tokenizer, and consequently the LLM that is trained with it, can process any text, even if it contains words that were not present in its training data. 如图 2.11 所示,将未知单词分解为单个字符的能力可确保分词器及其训练的LLM能够处理任何文本,即使其包含训练数据中不存在的单词。
EXERCISE 2.1 BYTE PAIR ENCODING OF UNKNOWN WORDS 第 2.1 练习 未知词的字节对编码
Try the BPE tokenizer from the tiktoken library on the unknown words "Akwirw ier" and print the individual token IDs. Then, call the decode function on each of the resulting integers in this list to reproduce the mapping shown in Figure 2.11. Lastly, call the decode method on the token IDs to check whether it can reconstruct the original input, "Akwirw ier". 尝试对未知词"Akwirw ier"使用 tiktoken 库中的 BPE 分词器,并打印出单个令牌 ID。然后在这个列表中调用每个结果整数上的解码函数,以复现图 2.11 中显示的映射。最后,调用令牌 ID 上的解码方法,检查是否能重建原始输入"Akwirw ier"。
A detailed discussion and implementation of BPE is out of the scope of this book, but in short, it builds its vocabulary by iteratively merging frequent characters into subwords and frequent subwords into words. For example, BPE starts with adding all individual single characters to its vocabulary ("a", "b", ...). In the next stage, it merges character combinations that frequently occur together into subwords. For example, "d" and "e" may be merged into the subword "de," which is common in many English words like "define", "depend", "made", and "hidden". The merges are determined by a frequency cutoff. 本书不涉及详细讨论和实现 BPE 的内容,但简而言之,它通过反复将频繁出现的字符合并成子词,将频繁出现的子词合并成词来构建词汇表。例如,BPE 最初将所有单个字符添加到词汇表中("a","b"等)。在下一阶段,它将经常共同出现的字符组合合并成子词。例如,"d"和"e"可能会合并成子词"de",这在许多英语单词中很常见,如"define","depend","made"和"hidden"。合并的依据是频率阈值。
2.6 Data sampling with a sliding window 使用滑动窗口进行数据采样 2.6
The previous section covered the tokenization steps and conversion from string tokens into integer token IDs in great detail. The next step before we can finally create the embeddings for the LLM is to generate the input-target pairs required for training an LLM. 上一节详细介绍了标记化步骤和将字符串标记转换为整数标记 ID 的过程。在我们最终创建LLM的嵌入之前,下一步是生成训练LLM所需的输入-目标对。
What do these input-target pairs look like? As we learned in chapter 1, LLMs are pretrained by predicting the next word in a text, as depicted in figure 2.12 . 这些输入-目标对看起来是什么样的?正如我们在第 1 章中所学的那样,LLMs是通过预测文本中的下一个词来进行预训练的,如图 2.12 所示。
Figure 2.12 Given a text sample, extract input blocks as subsamples that serve as input to the LLM, and the LLM's prediction task during training is to predict the next word that follows the input block. During training, we mask out all words that are past the target. Note that the text shown in this figure would undergo tokenization before the LLM can process it; however, this figure omits the tokenization step for clarity. 图 2.12 给定一个文本样本,提取输入块作为子样本,作为训练期间LLM的输入,而LLM的预测任务是预测输入块后的下一个单词。在训练期间,我们会屏蔽掉目标之后的所有单词。请注意,本图所示的文本在LLM处理之前会经过分词,但为了清楚起见,这里省略了分词步骤。
In this section we implement a data loader that fetches the input-target pairs depicted in Figure 2.12 from the training dataset using a sliding window approach. 在本节中,我们实现了一个数据加载器,它使用滑动窗口的方法从训练数据集中获取图 2.12 所描述的输入-目标对。
To get started, we will first tokenize the whole The Verdict short story we worked with earlier using the BPE tokenizer introduced in the previous section: 要开始,我们首先要使用上一节引入的 BPE 分词器对之前我们处理过的《审判》短篇小说进行分词:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
enc_text = tokenizer.encode(raw_text)
print(len(enc_text))
Executing the code above will return 5145, the total number of tokens in the training set, after applying the BPE tokenizer. 执行上述代码将返回 5145,这是应用 BPE 分词器后训练集中的总标记数。
Next, we remove the first 50 tokens from the dataset for demonstration purposes as it results in a slightly more interesting text passage in the next steps: 接下来,为了演示目的,我们从数据集中删除前 50 个标记,这会在后续步骤中产生一些更有趣的文本段落:
enc_sample = enc_text[50:]
One of the easiest and most intuitive ways to create the input-target pairs for the nextword prediction task is to create two variables, x and y , where x contains the input tokens and y contains the targets, which are the inputs shifted by 1 : 创建输入-目标对以进行下一个词预测任务的最简单直观的方法之一是创建两个变量 x 和 y,其中 x 包含输入标记,y 包含目标,这些目标是将输入向前移动 1 个位置:
context_size = 4 #A
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y: {y}")
#A The context size determines how many tokens are included in the input 输入的上下文大小决定了包含的标记数量
Running the above code prints the following output: 运行上述代码会打印以下输出:
Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks depicted earlier in figure 2.12, as follows: 处理输入和目标,目标是输入向右移动一个位置,我们可以创建第 2.12 图中描述的下一个词预测任务。
for i in range(1, context_size+1):
context = enc_sample[:i]
desired = enc_sample[i]
print(context, "---->", desired)
The code above prints the following: 上面的代码会打印出以下内容:
Everything left of the arrow (---->) refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict. 箭头(----)左边的所有内容都是指输入给LLM的内容,而箭头右边的令牌 ID 代表LLM需要预测的目标令牌 ID。
For illustration purposes, let's repeat the previous code but convert the token IDs into text: 为了说明的目的,让我们再次重复以前的代码,但将令牌 IDs 转换为文本:
for i in range(1, context_size+1):
context = enc_sample[:i]
desired = enc_sample[i]
print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))
The following outputs show how the input and outputs look in text format: 以下输出以文本格式显示输入和输出的外观:
and ----> established
and established ----> himself
and established himself ----> in
and established himself in ----> a
We've now created the input-target pairs that we can turn into use for the LLM training in upcoming chapters. 我们现在已经创建了输入-目标对,这些可以用于后续章节中的LLM训练。
There's only one more task before we can turn the tokens into embeddings, as we mentioned at the beginning of this chapter: implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays. 在我们将标记转换为嵌入之前,还有最后一个任务需要完成,正如我们在本章开头提到的那样:实现一个高效的数据加载器,它可以遍历输入数据集并以 PyTorch 张量的形式返回输入和目标,PyTorch 张量可以被视为多维数组。
In particular, we are interested in returning two tensors: an input tensor containing the text that the LLM sees and a target tensor that includes the targets for the LLM to predict, as depicted in Figure 2.13. 我们特别对返回两个张量感兴趣:一个包含 LLM 所看到的文本的输入张量,以及一个包含 LLM 需要预测的目标的目标张量,如图 2.13 所示。
Figure 2.13 To implement efficient data loaders, we collect the inputs in a tensor, , where each row represents one input context. A second tensor, , contains the corresponding prediction targets (next words), which are created by shifting the input by one position. 图 2.13 要实现高效的数据加载器,我们将输入收集在一个张量 中,其中每一行代表一个输入上下文。第二个张量 包含相应的预测目标(下一个词),这些目标是通过将输入向前移动一个位置而创建的。
While Figure 2.13 shows the tokens in string format for illustration purposes, the code implementation will operate on token IDs directly since the encode method of the BPE tokenizer performs both tokenization and conversion into token IDs as a single step. 尽管图 2.13 以字符串格式显示了令牌以便于说明,但是代码实现将直接操作令牌 ID,因为 BPE 分词器的 encode 方法会在单一步骤内完成分词和转换为令牌 ID 两项操作。
For the efficient data loader implementation, we will use PyTorch's built-in Dataset and DataLoader classes. For additional information and guidance on installing PyTorch, please see section A.1.3, Installing PyTorch, in Appendix A. 为了有效的数据加载器实现,我们将使用 PyTorch 内置的 Dataset 和 DataLoader 类。关于安装 PyTorch 的更多信息和指导,请参见附录 A 中的 A.1.3 节,安装 PyTorch。
The code for the dataset class is shown in code listing 2.5: 数据集类的代码在代码清单 2.5 中显示:
Listing 2.5 A dataset for batched inputs and targets 列表 2.5 用于批量输入和目标的数据集
#A Tokenize the entire text #A 将整个文本进行分词
人工智能是一个快速发展的领域,它涉及计算机系统模仿人类智能行为的技术。这种技术可用于解决各种问题,如图像识别、自然语言处理和游戏策略。人工智能正在改变我们的生活方式,并为未来带来无限可能
#B Use a sliding window to chunk the book into overlapping sequences of max_length 使用滑动窗口将书籍划分为最大长度的重叠序列
#C Return the total number of rows in the dataset 返回数据集中的总行数
#D Return a single row from the dataset #D 从数据集返回单个行
The GPTDatasetV1 class in listing 2.5 is based on the PyTorch Dataset class and defines how individual rows are fetched from the dataset, where each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor. The target_chunk tensor contains the corresponding targets. I recommend reading on to see how the data returned from this dataset looks like when we combine the dataset with a PyTorch DataLoader -- this will bring additional intuition and clarity. 列出 2.5 中的 GPTDatasetV1 类基于 PyTorch Dataset 类,定义了如何从数据集中获取单个行,每行由多个令牌 ID 组成(基于 max_length)并分配给 input_chunk 张量。target_chunk 张量包含相应的目标。我建议继续阅读,看看当我们将该数据集与 PyTorch DataLoader 结合时,返回的数据是什么样子 - 这将带来更多直觉和清晰度。
If you are new to the structure of PyTorch Dataset classes, such as shown in listing 2.5, please read section A.6, Setting up efficient data loaders, in Appendix A, which explains the general structure and usage of PyTorch Dataset and DataLoader classes. 如果您是第一次接触 PyTorch 数据集类的结构,如示例 2.5 所示,请阅读附录 A 中的第 A.6 节"设置高效的数据加载器",该节解释了 PyTorch 数据集和数据加载器类的一般结构和使用。
The following code will use the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader: 以下代码将使用 GPTDatasetV1 来通过 PyTorch DataLoader 以批次加载输入:
Listing 2.6 A data loader to generate batches with input-with pairs 列表 2.6 生成带有输入输出对的批次数据的数据加载器
#C drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training #C drop_last=True 在训练过程中丢弃最后一个 batch,如果它的长度短于指定的 batch_size,以防止损失出现尖峰
#D The number of CPU processes to use for preprocessing 使用的 CPU 进程数量用于预处理
Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4 to develop an intuition of how the GPTDatasetV1 class from listing 2.5 and the create_dataloader_v1 function from listing 2.6 work together: 让我们用批量大小为 1 来测试数据加载器,对于一个 LLM 上下文大小为 4 的输入,以开发对 listing 2.5 中的 GPTDatasetV1 类和 listing 2.6 中的 create_dataloader_v1 函数如何协作的直观理解:
#A convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function 将数据加载器转换为 Python 迭代器,以通过 Python 内置的 next() 函数获取下一个条目
Executing the preceding code prints the following: 执行上述代码将输出以下内容:
The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs. Since the max_length is set to 4, each of the two tensors contains 4 token IDs. Note that an input size of 4 is relatively small and only chosen for illustration purposes. It is common to train LLMs with input sizes of at least 256. first_batch 变量包含两个张量:第一个张量存储输入令牌 ID,第二个张量存储目标令牌 ID。由于 max_length 设置为 4,因此两个张量中都包含 4 个令牌 ID。请注意,输入大小为 4 相对较小,仅用于说明目的。通常情况下,会使用至少 256 的输入大小进行训练。
To illustrate the meaning of stride=1, let's fetch another batch from this dataset: 以 stride=1 为例说明其含义,让我们从这个数据集中再获取一个 batch:
If we compare the first with the second batch, we can see that the second batch's token IDs are shifted by one position compared to the first batch (for example, the second ID in the first batch's input is 367 , which is the first ID of the second batch's input). The stride setting dictates the number of positions the inputs shift across batches, emulating a sliding window approach, as demonstrated in Figure 2.14. 如果我们将第一批次与第二批次进行比较,我们可以看到第二批次的 token ID 相对于第一批次平移了一个位置(例如,第一批次输入中的第二个 ID 为 367,这是第二批次输入中的第一个 ID)。步长设置决定了输入在批次间移动的位置数,模拟了图 2.14 中所示的滑动窗口方法。
Figure 2.14 When creating multiple batches from the input dataset, we slide an input window across the text. If the stride is set to 1 , we shift the input window by 1 position when creating the next batch. If we set the stride equal to the input window size, we can prevent overlaps between the batches. 图 2.14 当从输入数据集创建多个批次时,我们将输入窗口滑动过文本。如果步幅设置为 1,我们在创建下一个批次时将输入窗口移动 1 个位置。如果我们将步幅设置为等于输入窗口大小,我们可以防止批次之间重叠。
EXERCISE 2.2 DATA LOADERS WITH DIFFERENT STRIDES AND CONTEXT SIZES 练习 2.2 具有不同步幅和上下文大小的数据加载器
To develop more intuition for how the data loader works, try to run it with different settings such as max_length and stride and max_length and stride . 为了对数据加载器的工作原理有更深入的理解,尝试使用不同的设置来运行它,如 max_length 和 stride 以及 max_length 和 stride 。
Batch sizes of 1 , such as we have sampled from the data loader so far, are useful for illustration purposes. If you have previous experience with deep learning, you may know that small batch sizes require less memory during training but lead to more noisy model updates. Just like in regular deep learning, the batch size is a trade-off and hyperparameter to experiment with when training LLMs. 1 批大小,如我们目前从数据加载器中采样的那样,对于说明目的很有用。如果您之前有深度学习经验,您可能知道小批大小在训练期间需要的内存更少,但会导致更嘈杂的模型更新。与常规深度学习一样,批量大小是一个需要权衡的超参数,在训练时需要进行实验。
Before we move on to the two final sections of this chapter that are focused on creating the embedding vectors from the token IDs, let's have a brief look at how we can use the data loader to sample with a batch size greater than 1 : 在我们转到本章最后两节关于从标记 ID 创建嵌入向量的内容之前,让我们简单看一下如何使用数据加载器以大于 1 的批量大小进行采样:
Note that we increase the stride to 4 . This is to utilize the data set fully (we don't skip a single word) but also avoid any overlap between the batches, since more overlap could lead to increased overfitting. 注意我们将步长增加到 4。这是为了完全利用数据集(我们不跳过任何单词),但同时也避免批次之间的任何重叠,因为更多的重叠可能会导致过拟合加剧。
In the final two sections of this chapter, we will implement embedding layers that convert the token IDs into continuous vector representations, which serve as input data format for LLMs. 在本章的最后两节中,我们将实现嵌入层,将令牌 ID 转换为连续的向量表示,这些向量表示用作LLMs的输入数据格式。
2.7 Creating token embeddings 创建令牌嵌入
The last step for preparing the input text for LLM training is to convert the token IDs into embedding vectors, as illustrated in Figure 2.15, which will be the focus of these two last remaining sections of this chapter. 为LLM训练准备输入文本的最后一步是将 token ID 转换为嵌入向量,如图 2.15 所示,这将是本章最后两节的重点。
Figure 2.15 Preparing the input text for an LLM involves tokenizing text, converting text tokens to token IDs, and converting token IDs into vector embedding vectors. In this section, we consider the token IDs created in previous sections to create the token embedding vectors. 图 2.15 为LLM准备输入文本包括分词、将文本标记转换为标记 ID,以及将标记 ID 转换为向量嵌入向量。在本节中,我们考虑在前几节中创建的标记 ID 来创建标记嵌入向量。
In addition to the processes outlined in Figure 2.15, it is important to note that we initialize these embedding weights with random values as a preliminary step. This initialization serves as the starting point for the LLM's learning process. We will optimize the embedding weights as part of the LLM training in chapter 5. 除了图 2.15 中概述的过程外,我们还需要注意,我们将这些嵌入权重初始化为随机值作为初步步骤。这种初始化作为LLM学习过程的起点。我们将在第 5 章的LLM训练中优化嵌入权重。
A continuous vector representation, or embedding, is necessary since GPT-like LLMs are deep neural networks trained with the backpropagation algorithm. If you are unfamiliar with how neural networks are trained with backpropagation, please read section A.4, Automatic differentiation made easy, in Appendix A. 连续的向量表示或嵌入是必要的,因为类似于 GPT 的LLMs是使用反向传播算法训练的深度神经网络。如果您不熟悉如何使用反向传播训练神经网络,请阅读附录 A 中的 A.4 节"Automatic differentiation made easy"。
Let's illustrate how the token ID to embedding vector conversion works with a hands-on example. Suppose we have the following four input tokens with IDs 2, 3, 5, and 1 : 让我们通过一个实际的例子来说明令牌 ID 到嵌入向量的转换过程。假设我们有以下四个输入令牌,它们的 ID 分别是 2、3、5 和 1:
input_ids = torch.tensor([2, 3, 5, 1])
For the sake of simplicity and illustration purposes, suppose we have a small vocabulary of only 6 words (instead of the 50,257 words in the BPE tokenizer vocabulary), and we want to create embeddings of size 3 (in GPT-3, the embedding size is 12,288 dimensions): 为了简单起见和说明目的,假设我们只有 6 个词的小词汇表(而不是 BPE 分词器词汇表中的 50,257 个词),我们想创建大小为 3 的嵌入(在 GPT-3 中,嵌入大小为 12,288 维)
vocab_size 词汇量大小
output_dim = 3 输出维度 = 3
Using the vocab_size and output_dim, we can instantiate an embedding layer in PyTorch, setting the random seed to 123 for reproducibility purposes: 使用 vocab_size 和 output_dim,我们可以在 PyTorch 中实例化一个嵌入层,为了可重复性设置随机种子为 123:
We can see that the weight matrix of the embedding layer contains small, random values. These values are optimized during LLM training as part of the LLM optimization itself, as we will see in upcoming chapters. Moreover, we can see that the weight matrix has six rows and three columns. There is one row for each of the six possible tokens in the vocabulary. And there is one column for each of the three embedding dimensions. 我们可以看到,嵌入层的权重矩阵包含小的随机值。这些值在LLM训练过程中作为LLM优化的一部分进行优化,正如我们将在后续章节中看到的。此外,我们可以看到权重矩阵有六行三列。每个词汇表中有六个可能的令牌对应一行。每个嵌入维度对应一列。
After we instantiated the embedding layer, let's now apply it to a token ID to obtain the embedding vector: 在我们初始化嵌入层之后,现在让我们将其应用于一个 token ID,以获得嵌入向量:
print(embedding_layer(torch.tensor([3])))
The returned embedding vector is as follows: [0.1234, -0.5678, 0.9012, -0.3456]
If we compare the embedding vector for token ID 3 to the previous embedding matrix, we see that it is identical to the 4th row (Python starts with a zero index, so it's the row corresponding to index 3). In other words, the embedding layer is essentially a look-up operation that retrieves rows from the embedding layer's weight matrix via a token ID. 如果我们将 token ID 3 的嵌入向量与之前的嵌入矩阵进行比较,我们会发现它与第 4 行(Python 从零索引开始,所以对应于索引 3 的行)完全相同。换句话说,嵌入层本质上是一个查找操作,通过 token ID 从嵌入层的权重矩阵中检索行。
EMBEDDING LAYERS VERSUS MATRIX MULTIPLICATION 嵌入层与矩阵乘法
For those who are familiar with one-hot encoding, the embedding layer approach above is essentially just a more efficient way of implementing one-hot encoding followed by matrix multiplication in a fully connected layer, which is illustrated in the supplementary code on GitHub at https://github.com/rasbt/LLMs-from-scratch/tree/ main/ch02/03 bonus embedding-vs-matmul. Because the embedding layer is just a more efficient implementation equivalent to the one-hot encoding and matrixmultiplication approach, it can be seen as a neural network layer that can be optimized via backpropagation. 对于熟悉 one-hot 编码的人来说,上述的嵌入层方法本质上就是一种更有效的实现 one-hot 编码后的矩阵乘法方式,这在 GitHub 上的补充代码中有所说明,地址为 https://github.com/rasbt/LLMs-from-scratch/tree/ main/ch02/03 bonus embedding-vs-matmul。由于嵌入层只是对 one-hot 编码和矩阵乘法方法的更有效实现,因此可以将其视为一种可通过反向传播优化的神经网络层。
Previously, we have seen how to convert a single token ID into a three-dimensional embedding vector. Let's now apply that to all four input IDs we defined earlier (torch.tensor([2, 3, 5, 1])): 之前,我们已经看到了如何将单个标记 ID 转换为三维嵌入向量。现在让我们将此应用于我们之前定义的四个输入 ID (torch.tensor([2, 3, 5, 1])):
print(embedding_layer(input_ids))
The print output reveals that this results in a matrix: 打印输出揭示这导致一个 矩阵:
Each row in this output matrix is obtained via a lookup operation from the embedding weight matrix, as illustrated in Figure 2.16. 该输出矩阵的每一行都是通过从嵌入权重矩阵中进行查找操作获得的,如图 2.16 所示。
Figure 2.16 Embedding layers perform a look-up operation, retrieving the embedding vector corresponding to the token ID from the embedding layer's weight matrix. For instance, the embedding vector of the token ID 5 is the sixth row of the embedding layer weight matrix (it is the sixth instead of the fifth row because Python starts counting at 0 ). For illustration purposes, we assume that the token IDs were produced by the small vocabulary we used in section 2.3. 图 2.16 嵌入层执行查找操作,从嵌入层的权重矩阵中检索对应于令牌 ID 的嵌入向量。例如,令牌 ID 5 的嵌入向量是嵌入层权重矩阵的第六行(而不是第五行,因为 Python 从 0 开始计数)。为了说明目的,我们假设令牌 ID 是由第 2.3 节中使用的小词汇表产生的。
This section covered how we create embedding vectors from token IDs. The next and final section of this chapter will add a small modification to these embedding vectors to encode positional information about a token within a text. 本节介绍了我们如何从标记 ID 创建嵌入向量。本章的下一节和最后一节将对这些嵌入向量进行细小的修改,以编码有关文本中标记位置的信息。
2.8 Encoding word positions 2.8 编码词位置
In the previous section, we converted the token IDs into a continuous vector representation, the so-called token embeddings. In principle, this is a suitable input for an LLM. However, a minor shortcoming of LLMs is that their self-attention mechanism, which will be covered in detail in chapter 3, doesn't have a notion of position or order for the tokens within a sequence. 在上一个部分中,我们将 token ID 转换为连续的向量表示,即所谓的 token 嵌入。从原理上说,这是一个适合LLM的输入。然而,LLMs的一个小缺点是它们的自注意机制,将在第 3 章中详细介绍,对序列中 token 的位置或顺序没有概念。
The way the previously introduced embedding layer works is that the same token ID always gets mapped to the same vector representation, regardless of where the token ID is positioned in the input sequence, as illustrated in Figure 2.17. 之前介绍的嵌入层的工作方式是,相同的令牌 ID 总是会被映射到相同的向量表示,而不管令牌 ID 在输入序列中的位置如何,如图 2.17 所示。
Figure 2.17 The embedding layer converts a token ID into the same vector representation regardless of where it is located in the input sequence. For example, the token ID 5, whether it's in the first or third position in the token ID input vector, will result in the same embedding vector. 图 2.17 嵌入层将令牌 ID 转换为相同的向量表示,而不管其位于输入序列的何处。例如,令牌 ID 5,无论位于令牌 ID 输入向量的第一或第三位置,都会生成相同的嵌入向量。
In principle, the deterministic, position-independent embedding of the token ID is good for reproducibility purposes. However, since the self-attention mechanism of LLMs itself is also position-agnostic, it is helpful to inject additional position information into the LLM. 原则上,令牌 ID 的确定性、独立于位置的嵌入有利于可重复性的目的。然而,由于LLMs自身的自注意机制也是无位置依赖的,因此向LLM中注入额外的位置信息会很有帮助。
Absolute positional embeddings are directly associated with specific positions in a sequence. For each position in the input sequence, a unique embedding is added to the token's embedding to convey its exact location. For instance, the first token will have a specific positional embedding, the second token another distinct embedding, and so on, as illustrated in Figure 2.18. 绝对位置嵌入直接与序列中的特定位置相关联。对于输入序列中的每个位置,都会向令牌的嵌入添加一个唯一的嵌入,以传达其确切位置。例如,第一个令牌将具有特定的位置嵌入,第二个令牌将具有另一个不同的嵌入,依此类推,如图 2.18 所示。
Figure 2.18 Positional embeddings are added to the token embedding vector to create the input embeddings for an LLM. The positional vectors have the same dimension as the original token embeddings. The token embeddings are shown with value 1 for simplicity. 图 2.18 位置嵌入被添加到标记嵌入向量中,以创建 LLM的输入嵌入。位置向量与原始标记嵌入具有相同的维度。为简单起见,标记嵌入以值 1 表示。
Instead of focusing on the absolute position of a token, the emphasis of relative positional embeddings is on the relative position or distance between tokens. This means the model learns the relationships in terms of "how far apart" rather than "at which exact position." The advantage here is that the model can generalize better to sequences of varying lengths, even if it hasn't seen such lengths during training. 不是关注令牌的绝对位置,相对位置嵌入强调令牌之间的相对位置或距离。这意味着模型学习的是"相距有多远"的关系,而不是"在哪个确切的位置"。这里的优势在于,即使在训练过程中没有见过这样的长度序列,模型也能更好地推广到不同长度的序列。
Both types of positional embeddings aim to augment the capacity of LLMs to understand the order and relationships between tokens, ensuring more accurate and context-aware predictions. The choice between them often depends on the specific application and the nature of the data being processed. 两种位置嵌入都旨在增强 LLMs理解令牌之间顺序和关系的能力,确保更精确和上下文感知的预测。在它们之间的选择通常取决于具体应用程序和所处理数据的性质。
OpenAI's GPT models use absolute positional embeddings that are optimized during the training process rather than being fixed or predefined like the positional encodings in the original Transformer model. This optimization process is part of the model training itself, which we will implement later in this book. For now, let's create the initial positional embeddings to create the LLM inputs for the upcoming chapters. OpenAI 的 GPT 模型使用在训练过程中优化的绝对位置嵌入,而不像原始 Transformer 模型中预先定义的位置编码。这种优化过程是模型训练本身的一部分,我们将在本书后续实现。现在,让我们创建初始位置嵌入,为即将到来的章节建立LLM输入。
Previously, we focused on very small embedding sizes in this chapter for illustration purposes. We now consider more realistic and useful embedding sizes and encode the input tokens into a 256-dimensional vector representation. This is smaller than what the original GPT-3 model used (in GPT-3, the embedding size is 12,288 dimensions) but still reasonable for experimentation. Furthermore, we assume that the token IDs were created by the BPE tokenizer that we implemented earlier, which has a vocabulary size of 50,257 : 之前,我们在本章中为了说明目的而关注了非常小的嵌入尺寸。现在我们考虑更实际和有用的嵌入尺寸,并将输入标记编码为 256 维向量表示。这比原始 GPT-3 模型使用的尺寸(GPT-3 中的嵌入尺寸为 12,288 维)小,但仍然适合实验。此外,我们假设令牌 ID 是由我们之前实现的 BPE 分词器创建的,该分词器的词汇表大小为 50,257。
Using the token_embedding_layer above, if we sample data from the data loader, we embed each token in each batch into a 256-dimensional vector. If we have a batch size of 8 with four tokens each, the result will be an tensor. 使用上述的 token_embedding_layer,如果我们从数据加载器中采样数据,我们将每个批次中的每个令牌嵌入到 256 维向量中。如果我们有一个批次大小为 8,每个批次有四个令牌,结果将是一个 张量。
Let's instantiate the data loader from section 2.6, Data sampling with a sliding window, first: 让我们首先实例化第 2.6 节的数据加载器,即滑动窗口数据采样:
As we can see, the token ID tensor is -dimensional, meaning that the data batch consists of 8 text samples with 4 tokens each. 正如我们所看到的,令牌 ID 张量是 维的,这意味着数据批次由 8 个文本样本组成,每个样本有 4 个令牌。
Let's now use the embedding layer to embed these token IDs into 256-dimensional vectors: 让我们现在使用嵌入层将这些标记 ID 嵌入到 256 维向量中:
The preceding print function call returns the following: 前面的 print 函数调用返回以下内容:
torch.Size([8, 4, 256])
As we can tell based on the -dimensional tensor output, each token ID is now embedded as a 256-dimensional vector. 根据 维张量输出,每个标记 ID 现在都嵌入为 256 维向量。
For a GPT model's absolute embedding approach, we just need to create another embedding layer that has the same dimension as the token_embedding_layer: 对于 GPT 模型的绝对嵌入方法,我们只需要创建另一个嵌入层,其维度与 token_embedding_layer 相同
As shown in the preceding code example, the input to the pos_embeddings is usually a placeholder vector torch.arange(context_length), which contains a sequence of numbers , up to the maximum input length -1 . The context_length is a variable that represents the supported input size of the LLM. Here, we choose it similar to the maximum length of the input text. In practice, input text can be longer than the supported context length, in which case we have to truncate the text. 如上一个代码示例所示,pos_embeddings 的输入通常是一个占位符向量 torch.arange(context_length),其中包含一个从 0 到最大输入长度-1 的数字序列。context_length 是一个变量,代表 支持的输入大小。在这里,我们选择它类似于输入文本的最大长度。在实际应用中,输入文本可能长于支持的上下文长度,这种情况下我们需要截断文本。
The output of the print statement is as follows: 打印语句的输出结果如下:
torch.Size( torch.Size()
As we can see, the positional embedding tensor consists of four 256-dimensional vectors. We can now add these directly to the token embeddings, where PyTorch will add the dimensional pos_embeddings tensor to each 4x256-dimensional token embedding tensor in each of the 8 batches: 我们可以看到,位置嵌入张量由四个 256 维向量组成。我们现在可以直接将它们添加到标记嵌入中,其中 PyTorch 将把 维 pos_embeddings 张量添加到每个 8 个批次中的 4x256 维标记嵌入张量中:
The input_embeddings we created, as summarized in Figure 2.19, are the embedded input examples that can now be processed by the main LLM modules, which we will begin implementing in chapter 3 如图 2.19 所示,我们创建的 input_embeddings 是可以由第 3 章开始实现的主要 LLM 模块处理的嵌入式输入示例。
Figure 2.19 As part of the input processing pipeline, input text is first broken up into individual tokens. These tokens are then converted into token IDs using a vocabulary. The token IDs are converted into embedding vectors to which positional embeddings of a similar size are added, resulting in input embeddings that are used as input for the main LLM layers. 图 2.19 作为输入处理管线的一部分,输入文本首先被分割成独立的词元。这些词元然后使用词汇表转换为词元 ID。将词元 ID 转换为嵌入向量,并添加相同大小的位置嵌入,从而得到作为主要LLM层输入的输入嵌入。
2.9 Summary 2.9 总结
LLMs require textual data to be converted into numerical vectors, known as embeddings since they can't process raw text. Embeddings transform discrete data (like words or images) into continuous vector spaces, making them compatible with neural network operations. LLMs需要将文本数据转换为数值向量,称为嵌入,因为它们无法处理原始文本。嵌入将离散数据(如单词或图像)转换为连续向量空间,使其与神经网络操作兼容。
As the first step, raw text is broken into tokens, which can be words or characters. Then, the tokens are converted into integer representations, termed token IDs. 作为第一步,原始文本被分解成可以是单词或字符的标记。然后,这些标记被转换成整数表示,称为标记 ID。
Special tokens, such as <|unk|> and <|endoftext|>, can be added to enhance the model's understanding and handle various contexts, such as unknown words or marking the boundary between unrelated texts. <代码 0>特殊标记,如<|unk|>和<|endoftext|>,可以添加以增强模型的理解能力,并处理各种上下文,如未知词或标记不相关文本之间的边界。
The byte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT3 can efficiently handle unknown words by breaking them down into subword units or individual characters. 字节对编码(BPE)分词器用于像 GPT-2 和 GPT3 这样的模型,可以通过将未知词分解成子词单元或个别字符来高效处理未知词。
We use a sliding window approach on tokenized data to generate inputtarget pairs for LLM training. 我们对分词数据使用滑动窗口方法来生成LLM训练的输入目标对。
Embedding layers in PyTorch function as a lookup operation, retrieving vectors corresponding to token IDs. The resulting embedding vectors provide continuous representations of tokens, which is crucial for training deep learning models like LLMs. 在 PyTorch 中,嵌入层函数作为查找操作,检索与令牌 ID 对应的向量。得到的嵌入向量提供了令牌的连续表征,这对于训练深度学习模型(如LLMs)至关重要。
While token embeddings provide consistent vector representations for each token, they lack a sense of the token's position in a sequence. To rectify this, two main types of positional embeddings exist: absolute and relative. OpenAI's GPT models utilize absolute positional embeddings that are added to the token embedding vectors and are optimized during the model training. 虽然令牌嵌入为每个令牌提供了一致的向量表示,但它们缺乏对令牌在序列中位置的感知。为了纠正这一点,存在两种主要类型的位置嵌入:绝对和相对。OpenAI 的 GPT 模型利用绝对位置嵌入,它们被添加到令牌嵌入向量中,并在模型训练期间进行优化。
Coding Attention Mechanisms 编码注意力机制
This chapter covers 本章涵盖
Exploring the reasons for using attention mechanisms in neural networks 探索在神经网络中使用注意力机制的原因
Introducing a basic self-attention framework and progressing to an enhanced selfattention mechanism 介绍一个基本的自注意力框架并发展到一个增强的自注意力机制
Implementing a causal attention module that allows LLMs to generate one token at a time 实施一个因果注意力模块,使LLMs能够逐个生成标记
Masking randomly selected attention weights with dropout to reduce overfitting 使用随机遮蔽注意力权重进行 dropout 以减少过拟合
Stacking multiple causal attention modules into a multi-head attention module 将多个因果注意力模块堆叠成一个多头注意力模块
In the previous chapter, you learned how to prepare the input text for training LLMs. This involved splitting text into individual word and subword tokens, which can be encoded into vector representations, the so-called embeddings, for the LLM. 在上一章中,您学习了如何为训练LLMs而准备输入文本。这涉及到将文本分割成个别词和子词标记,可以将其编码为向量表示,即所谓的嵌入,用于LLM。
In this chapter, we will now look at an integral part of the LLM architecture itself, attention mechanisms, as illustrated in Figure 3.1. 在这一章中,我们将看一看LLM架构本身的一个重要组成部分,即注意力机制,如图 3.1 所示。
Figure 3.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter focuses on attention mechanisms, which are an integral part of an LLM architecture. 图 3.1 编码LLM的三个主要阶段、在通用文本数据集上预训练LLM以及在标注数据集上微调的心理模型。 本章专注于注意力机制,这是LLM架构的不可或缺的一部分。
Attention mechanisms are a comprehensive topic, which is why we are devoting a whole chapter to it. We will largely look at these attention mechanisms in isolation and focus on them at a mechanistic level. In the next chapter, we will then code the remaining parts of the LLM surrounding the self-attention mechanism to see it in action and to create a model to generate text. 注意力机制是一个广泛的话题,这就是我们专门为此划分一章的原因。我们将主要从机械层面孤立地研究这些注意力机制。在下一章中,我们将编码自注意力机制周围的其余部分,以查看其运作情况并创建一个生成文本的模型。
Over the course of this chapter, we will implement four different variants of attention mechanisms, as illustrated in Figure 3.2. 在本章的过程中,我们将实现图 3.2 中所示的四种不同的注意力机制变体。
Figure 3.2 The figure depicts different attention mechanisms we will code in this chapter, starting with a simplified version of self-attention before adding the trainable weights. The causal attention mechanism adds a mask to self-attention that allows the LLM to generate one word at a time. Finally, multi-head attention organizes the attention mechanism into multiple heads, allowing the model to capture various aspects of the input data in parallel. 图 3.2 该图描绘了我们将在本章中编写的不同注意力机制,从简化版的自注意开始,然后添加可训练权重。因果注意力机制在自注意力上添加了一个掩码,允许LLM一次生成一个单词。最后,多头注意力将注意力机制组织成多个头,允许模型并行捕捉输入数据的各个方面。
These different attention variants shown in Figure 3.2 build on each other, and the goal is to arrive at a compact and efficient implementation of multi-head attention at the end of this chapter that we can then plug into the LLM architecture we will code in the next chapter. 图 3.2 中展示的这些不同注意力变体是建立在彼此之上的,本章的目标是最终达到一个紧凑高效的多头注意力实现,我们可以将其插入到下一章中我们要编写的LLM架构中。
3.1 The problem with modeling long sequences 3.1 对于建模长序列的问题
Before we dive into the self-attention mechanism that is at the heart of LLMs later in this chapter, what is the problem with architectures without attention mechanisms that predate LLMs? Suppose we want to develop a language translation model that translates text from one language into another. As shown in Figure 3.3, we can't simply translate a text word by word due to the grammatical structures in the source and target language. 在我们深入研究本章后续将要探讨的自注意力机制之前,没有注意力机制的架构有什么问题?假设我们想开发一个语言翻译模型,将文本从一种语言翻译成另一种语言。如图 3.3 所示,我们不能简单地逐词翻译,因为源语言和目标语言的语法结构不同。
Certain words in the generated translation require access to words that appear earlier or later in the original sentence 在生成的翻译中,某些词语需要访问原句中稍早或稍后出现的词语
Figure 3.3 When translating text from one language to another, such as German to English, it's not possible to merely translate word by word. Instead, the translation process requires contextual understanding and grammar alignment. 图 3.3 在从一种语言翻译到另一种语言时,例如从德语翻译到英语,并不可能仅凭逐字翻译。相反,翻译过程需要具有上下文理解和语法对齐。
To address the issue that we cannot translate text word by word, it is common to use a deep neural network with two submodules, a so-called encoder and decoder. The job of the encoder is to first read in and process the entire text, and the decoder then produces the translated text. 为了解决我们无法逐字翻译文本的问题,通常会使用一个由两个子模块组成的深度神经网络,即所谓的编码器和解码器。编码器的任务是首先读入和处理整个文本,而解码器则产生翻译后的文本。
We already briefly discussed encoder-decoder networks when we introduced the transformer architecture in chapter 1 (section 1.4, Using LLMs for different tasks). Before the advent of transformers, recurrent neural networks (RNNs) were the most popular encoder-decoder architecture for language translation. 我们在第 1 章(第 1.4 节,使用LLMs进行不同任务)介绍变换器架构时已经简要讨论了编码器-解码器网络。在变换器出现之前,循环神经网络(RNN)是用于语言翻译的最流行的编码器-解码器架构。
An RNN is a type of neural network where outputs from previous steps are fed as inputs to the current step, making them well-suited for sequential data like text. If you are unfamiliar with RNNs, don't worry, you don't need to know the detailed workings of RNNs to follow this discussion; our focus here is more on the general concept of the encoderdecoder setup. 循环神经网络是一种神经网络,其中前一步的输出作为当前步的输入,使其非常适合处理像文本这样的序列数据。如果你对循环神经网络不太熟悉,别担心,你不需要了解循环神经网络的详细工作原理来跟上本次讨论;我们在这里关注的重点是编码器-解码器架构的一般概念。
In an encoder-decoder RNN, the input text is fed into the encoder, which processes it sequentially. The encoder updates its hidden state (the internal values at the hidden layers) at each step, trying to capture the entire meaning of the input sentence in the final hidden state, as illustrated in Figure 3.4. The decoder then takes this final hidden state to start generating the translated sentence, one word at a time. It also updates its hidden state at each step, which is supposed to carry the context necessary for the next-word prediction. 在编码器-解码器 RNN 中,输入文本被馈送到编码器中进行顺序处理。编码器在每个步骤中更新其隐藏状态(隐藏层的内部值),试图在最终隐藏状态中捕获整个输入句子的含义,如图 3.4 所示。然后,解码器利用这最终的隐藏状态开始逐字生成翻译后的句子。它也在每个步骤中更新自己的隐藏状态,这个状态应该携带下一个词预测所需的上下文。
Figure 3.4 Before the advent of transformer models, encoder-decoder RNNs were a popular choice for machine translation. The encoder takes a sequence of tokens from the source language as input, where a hidden state (an intermediate neural network layer) of the encoder encodes a compressed representation of the entire input sequence. Then, the decoder uses its current hidden state to begin the translation, token by token. 图 3.4 在变换器模型出现之前,编码器-解码器循环神经网络是机器翻译的流行选择。编码器将源语言的一系列标记作为输入,其中编码器的隐藏状态(中间神经网络层)编码了整个输入序列的压缩表示。然后,解码器使用其当前的隐藏状态开始逐个标记地进行翻译。
While we don't need to know the inner workings of these encoder-decoder RNNs, the key idea here is that the encoder part processes the entire input text into a hidden state (memory cell). The decoder then takes in this hidden state to produce the output. You can think of this hidden state as an embedding vector, a concept we discussed in chapter 2. 虽然我们不需要了解这些编码-解码 RNN 的内部工作原理,但关键思想在于编码器部分将整个输入文本处理为隐藏状态(记忆单元)。解码器然后接收这个隐藏状态来产生输出。你可以将这个隐藏状态视为一个嵌入向量,这是我们在第 2 章中讨论过的概念。
The big issue and limitation of encoder-decoder RNNs is that the RNN can't directly access earlier hidden states from the encoder during the decoding phase. Consequently, it relies solely on the current hidden state, which encapsulates all relevant information. This can lead to a loss of context, especially in complex sentences where dependencies might span long distances. 编码器-解码器递归神经网络的一个重大问题和局限性是,在解码阶段,递归神经网络无法直接访问编码器的早期隐藏状态。因此,它完全依赖于当前隐藏状态,该状态包含了所有相关信息。这可能导致上下文丢失,尤其是在依赖可能跨越长距离的复杂句子中。
For readers unfamiliar with RNNs, it is not essential to understand or study this architecture as we will not be using it in this book. The takeaway message of this section is that encoder-decoder RNNs had a shortcoming that motivated the design of attention mechanisms. 对于不熟悉循环神经网络(RNNs)的读者来说,理解和学习这种架构并不是必需的,因为我们在本书中不会使用它。本节的要点是编码器-解码器 RNNs 存在一个缺陷,这引发了注意力机制的设计。
3.2 Capturing data dependencies with attention mechanisms 通过注意力机制捕获数据依赖关系
Before transformer LLMs, it was common to use RNNs for language modeling tasks such as language translation, as mentioned previously. RNNs work fine for translating short sentences but don't work well for longer texts as they don't have direct access to previous words in the input. 在 transformer LLMs之前,人们通常会使用 RNN 来进行语言建模任务,如语言翻译,如前所述。RNN 可以很好地翻译短句,但对于较长的文本不太适用,因为它们无法直接访问输入中的先前单词。
One major shortcoming in this approach is that the RNN must remember the entire encoded input in a single hidden state before passing it to the decoder, as illustrated in Figure 3.4 in the previous section. 这种方法的一个主要缺点是 RNN 必须在单个隐藏状态中记住整个编码输入,然后再传递给解码器,如前一节中图 3.4 所示。
Hence, researchers developed the so-called Bahdanau attention mechanism for RNNs in 2014 (named after the first author of the respective paper), which modifies the encoderdecoder RNN such that the decoder can selectively access different parts of the input sequence at each decoding step as illustrated in Figure 3.5. 因此,研究人员在 2014 年为循环神经网络开发了所谓的 Bahdanau 注意力机制(以该论文的第一作者命名),该机制修改了编码器-解码器循环神经网络,使解码器能够在每个解码步骤中选择性地访问输入序列的不同部分,如图 3.5 所示。
Figure 3.5 Using an attention mechanism, the text-generating decoder part of the network can access all input tokens selectively. This means that some input tokens are more important than others for generating a given output token. The importance is determined by the so-called attention weights, which we will compute later. Note that this figure shows the general idea behind attention and does not depict the exact implementation of the Bahdanau mechanism, which is an RNN method outside this book's scope. 图 3.5 使用注意力机制,网络的文本生成解码器部分可以选择性地访问所有输入标记。这意味着某些输入标记对于生成给定的输出标记而言更加重要。重要性由所谓的注意力权重决定,我们将在稍后计算。请注意,此图仅展示了注意力背后的一般思想,并未描述巴达纳乌机制的具体实现,该机制是一种超出本书范畴的 RNN 方法。
Interestingly, only three years later, researchers found that RNN architectures are not required for building deep neural networks for natural language processing and proposed the original transformer architecture (discussed in chapter 1) with a self-attention mechanism inspired by the Bahdanau attention mechanism. 有趣的是,仅 3 年后,研究人员发现神经网络架构并非构建深度神经网络进行自然语言处理所需,并提出了由 Bahdanau 注意力机制启发的原始 transformer 架构(在第 1 章中讨论)及其自注意力机制。
Self-attention is a mechanism that allows each position in the input sequence to attend to all positions in the same sequence when computing the representation of a sequence. Self-attention is a key component of contemporary LLMs based on the transformer architecture, such as the GPT series. 自注意力是一种机制,它允许输入序列中的每个位置在计算序列表示时注意同一序列中的所有位置。自注意力是建立在变换架构之上的当代LLMs系列(如 GPT)的关键组成部分。
This chapter focuses on coding and understanding this self-attention mechanism used in GPT-like models, as illustrated in Figure 3.6. In the next chapter, we will then code the remaining parts of the LLM. 本章着重于编码和理解图 3.6 所示的 GPT 类模型中使用的自注意力机制。在下一章中,我们将编码LLM的其余部分。
Figure 3.6 Self-attention is a mechanism in transformers that is used to compute more efficient input representations by allowing each position in a sequence to interact with and weigh the importance of all other positions within the same sequence. In this chapter, we will code this self-attention mechanism from the ground up before we code the remaining parts of the GPT-like LLM in the following chapter. 图 3.6 自注意力是 transformers 中的一种机制,用于通过允许序列中的每个位置与同一序列中的所有其他位置进行交互和加权重要性来计算更高效的输入表示。在本章中,我们将从头开始编写这种自注意力机制,然后在下一章中编写类似 GPT 的LLM的其余部分。
3.3 Attending to different parts of the input with self-attention 通过自注意力关注输入的不同部分
We'll now cover the inner workings of the self-attention mechanism and learn how to code it from the ground up. Self-attention serves as the cornerstone of every LLM based on the transformer architecture. It's worth noting that this topic may require a lot of focus and attention (no pun intended), but once you grasp its fundamentals, you will have conquered one of the toughest aspects of this book and implementing LLMs in general. 我们现在将介绍自注意力机制的内部工作原理,并学习如何从头开始对其进行编码。自注意力是基于 Transformer 架构的每一个LLM的基础。值得注意的是,这个话题可能需要大量的专注和注意力(不是有意的),但一旦你掌握了其基本原理,你就将征服本书中最艰难的几个方面,并在普遍情况下实现LLMs。
THE "SELF" IN SELF-ATTENTION 自注意力中的"自我"
In self-attention, the "self" refers to the mechanism's ability to compute attention weights by relating different positions within a single input sequence. It assesses and learns the relationships and dependencies between various parts of the input itself, such as words in a sentence or pixels in an image. This is in contrast to traditional attention mechanisms, where the focus is on the relationships between elements of two different sequences, such as in sequence-to-sequence models where the attention might be between an input sequence and an output sequence, such as the example depicted in Figure 3.5. 在自注意力中,"自"指机制通过比较单个输入序列中不同位置之间的关系来计算注意力权重的能力。它评估和学习输入本身不同部分之间的关系和依赖性,如句子中的词语或图像中的像素。这与传统的注意力机制形成对比,后者关注两个不同序列的元素之间的关系,例如在序列到序列模型中,注意力可能集中在输入序列和输出序列之间,如图 3.5 所示的示例。
Since self-attention can appear complex, especially if you are encountering it for the first time, we will begin by introducing a simplified version of self-attention in the next subsection. Afterwards, in section 3.4 , we will then implement the self-attention mechanism with trainable weights, which is used in LLMs. 由于自注意力可能很复杂,特别是如果你是第一次接触它,我们将在下一个小节中介绍自注意力的简化版本。之后,在第 3.4 节中,我们将实现具有可训练权重的自注意力机制,这是在LLMs中使用的。
3.3.1 A simple self-attention mechanism without trainable weights 3.3.1 没有可训练权重的简单自注意力机制
In this section, we implement a simplified variant of self-attention, free from any trainable weights, which is summarized in Figure 3.7. The goal of this section is to illustrate a few key concepts in self-attention before adding trainable weights next in section 3.4. 在这一部分中,我们实现了一个简化版的自注意力机制,没有任何可训练的权重,这在图 3.7 中总结。本节的目的是在 3.4 节添加可训练权重之前,说明自注意力机制的一些关键概念。
Figure 3.7 The goal of self-attention is to compute a context vector, for each input element, that combines information from all other input elements. In the example depicted in this figure, we compute the context vector . The importance or contribution of each input element for computing is determined by the attention weights to . When computing , the attention weights are calculated with respect to input element and all other inputs. The exact computation of these attention weights is discussed later in this section. 图 3.7 自注意力的目标是为每个输入元素计算一个上下文向量,该向量结合了所有其他输入元素的信息。在本图所示的示例中,我们计算上下文向量 。计算 时,每个输入元素的重要性或贡献由注意力权重 到 确定。在计算 时,注意力权重是根据输入元素 和所有其他输入计算的。本节稍后将讨论这些注意力权重的确切计算。
Figure 3.7 shows an input sequence, denoted as , consisting of elements represented as to . This sequence typically represents text, such as a sentence, that has already been transformed into token embeddings, as explained in chapter 2 . 图 3.7 显示了一个输入序列,记为 ,由 个元素组成,表示为 到 。此序列通常表示文本(如句子)已转换为令牌嵌入,如第 2 章所述。
For example, consider an input text like "Your journey starts with one step." In this case, each element of the sequence, such as , corresponds to a -dimensional embedding vector representing a specific token, like "Your." In Figure 3.7, these input vectors are shown as 3-dimensional embeddings. 举个例子,考虑一个输入文本"Your journey starts with one step."。在这种情况下,序列的每个元素,如 ,都对应一个 维 embedding 向量,代表一个特定的标记,如"Your"。在图 3.7 中,这些输入向量被显示为 3 维嵌入。
In self-attention, our goal is to calculate context vectors for each element in the input sequence. A context vector can be interpreted as an enriched embedding vector. 在自注意力中,我们的目标是为输入序列中的每个元素计算上下文向量 。上下文向量可以解释为增强的嵌入向量。
To illustrate this concept, let's focus on the embedding vector of the second input element, (which corresponds to the token "journey"), and the corresponding context vector, , shown at the bottom of Figure 3.7. This enhanced context vector, , is an embedding that contains information about and all other input elements to . 为了说明这个概念,让我们着重于第二个输入元素 (对应于"journey"这个词)的嵌入向量和相应的上下文向量 ,如图 3.7 所示的底部。这个增强的上下文向量 是一个嵌入,它包含了 和所有其他输入元素 到 的信息。
In self-attention, context vectors play a crucial role. Their purpose is to create enriched representations of each element in an input sequence (like a sentence) by incorporating information from all other elements in the sequence, as illustrated in Figure 3.7. This is essential in LLMs, which need to understand the relationship and relevance of words in a sentence to each other. Later, we will add trainable weights that help an LLM learn to construct these context vectors so that they are relevant for the LLM to generate the next token. 在自注意力中,上下文向量扮演着至关重要的角色。它们的目的是通过吸收序列(如句子)中所有其他元素的信息,为序列中的每个元素创造丰富的表述,如图 3.7 所示。这在需要理解句子中单词之间的关系和相关性的LLMs中至关重要。后续,我们将添加可训练的权重,帮助LLM学习构建这些上下文向量,使其对LLM生成下一个标记相关。
In this section, we implement a simplified self-attention mechanism to compute these weights and the resulting context vector one step at a time. 在本节中,我们实现了一种简化的自注意力机制来逐步计算这些权重和相应的上下文向量。
Consider the following input sentence, which has already been embedded into 3dimensional vectors as discussed in chapter 2 . We choose a small embedding dimension for illustration purposes to ensure it fits on the page without line breaks: 考虑以下输入句子,该句子已按第 2 章所述嵌入为 3 维向量。为便于说明,我们选择一个较小的嵌入维度,以确保它能够一页内显示而不换行:
The first step of implementing self-attention is to compute the intermediate values , referred to as attention scores, as illustrated in Figure 3.8. (Please note that Figure 3.8 displays the values of the preceding inputs tensor in a truncated version; for example, 0.87 is truncated to 0.8 due to spatial constraints. In this truncated version, the embeddings of the words "journey" and "starts" may appear similar by random chance.) 实施自注意力的第一步是计算中间值 ,称为注意力分数,如图 3.8 所示。(请注意,图 3.8 显示了前一个输入张量的值的缩短版本;例如,0.87 被截断为 0.8,这是由于空间限制。在这个缩短版本中,"journey"和"starts"两个词的嵌入可能由于偶然因素而显得相似。)
Figure 3.8 The overall goal of this section is to illustrate the computation of the context vector using the second input element, as a query. This figure shows the first intermediate step, computing the attention scores between the query and all other input elements as a dot product. (Note that the numbers in the figure are truncated to one digit after the decimal point to reduce visual clutter.) 图 3.8 本节的总体目标是说明使用第二个输入元素 作为查询来计算上下文向量 的过程。该图显示了第一个中间步骤,即通过点积计算查询 与所有其他输入元素之间的注意力得分 。(注意,为了减少视觉杂乱,图中的数字仅保留小数点后一位。)
Figure 3.8 illustrates how we calculate the intermediate attention scores between the query token and each input token. We determine these scores by computing the dot product of the query, , with every other input token: 图 3.8 说明了我们如何计算查询令牌和每个输入令牌之间的中间注意力分数。我们通过计算查询 与每个其他输入令牌的点积来确定这些分数:
query = inputs[1] #A
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)
#A The second input token serves as the query 第二个输入令牌用作查询
The computed attention scores are as follows: 计算得出的注意力分数如下:
A dot product is essentially just a concise way of multiplying two vectors elementwise and then summing the products, which we can demonstrate as follows: 点积本质上只是一种简洁的方式来逐个元素相乘两个向量,然后对这些乘积求和,我们可以如下方式演示:
res = 0.
for idx, element in enumerate(inputs[0]):
res += inputs[0][idx] * query[idx]
print(res)
print(torch.dot(inputs[0], query))
The outputs confirms that the sum of the element-wise multiplication gives the same results as the dot product: 元素乘积的和等于点积的结果
tensor(0.9544)
tensor(0.9544)
Beyond viewing the dot product operation as a mathematical tool that combines two vectors to yield a scalar value, the dot product is a measure of similarity because it quantifies how much two vectors are aligned: a higher dot product indicates a greater degree of alignment or similarity between the vectors. In the context of selfattention mechanisms, the dot product determines the extent to which elements in a sequence attend to each other: the higher the dot product, the higher the similarity and attention score between two elements. 将点积运算视为将两个矢量组合以产生标量值的数学工具之外,点积是一种相似性度量,因为它量化了两个矢量的对齐程度:点积越高,说明这两个矢量的对齐程度或相似程度越高。在自注意力机制的背景下,点积确定序列中的元素彼此关注的程度:点积越高,两个元素之间的相似性和注意力得分就越高。
In the next step, as shown in Figure 3.9, we normalize each of the attention scores that we computed previously. 在下一步中,如图 3.9 所示,我们标准化了之前计算的每个注意力分数。
Figure 3.9 After computing the attention scores to with respect to the input query , the next step is to obtain the attention weights to by normalizing the attention scores. 图 3.9 在根据输入查询 计算出注意力得分 到 后, 下一步是通过归一化注意力得分来获得注意力权重 到 。
The main goal behind the normalization shown in Figure 3.9 is to obtain attention weights that sum up to 1 . This normalization is a convention that is useful for interpretation and for maintaining training stability in an LLM. Here's a straightforward method for achieving this normalization step: 图 3.9 中所示的归一化的主要目标是获得注意力权重之和为 1。这种归一化是一种惯例,有助于解释并保持LLM中的训练稳定性。以下是实现此归一化步骤的简单方法:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum() 注意力权重 2_临时 = 注意力得分 2 / 注意力得分 2.求和()
In practice, it's more common and advisable to use the softmax function for normalization. This approach is better at managing extreme values and offers more favorable gradient properties during training. Below is a basic implementation of the softmax function for normalizing the attention scores: 在实践中,使用 softmax 函数进行归一化更加常见和可取。这种方法更擅长处理极端值,并在训练过程中提供更有利的梯度特性。下面是 softmax 函数的基本实现,用于归一化注意力得分:
As the output shows, the softmax function also meets the objective and normalizes the attention weights such that they sum to 1 : 如输出所示,softmax 函数也满足目标并将注意力权重归一化,使它们的总和为 1
In addition, the softmax function ensures that the attention weights are always positive. This makes the output interpretable as probabilities or relative importance, where higher weights indicate greater importance. 此外,softmax 函数确保注意力权重始终为正。这使得输出可解释为概率或相对重要性,其中较高的权重表示更高的重要性。
Note that this naive softmax implementation (softmax_naive) may encounter numerical instability problems, such as overflow and underflow, when dealing with large or small input values. Therefore, in practice, it's advisable to use the PyTorch implementation of softmax, which has been extensively optimized for performance: 请注意,这个简单的 softmax 实现(softmax_naive)可能会遇到数值稳定性问题,如溢出和下溢,当处理大或小的输入值时。因此,在实践中,最好使用 PyTorch 的 softmax 实现,它已经过广泛的性能优化。
Now that we computed the normalized attention weights, we are ready for the final step illustrated in Figure 3.10: calculating the context vector by multiplying the embedded input tokens, , with the corresponding attention weights and then summing the resulting vectors. 现在我们已经计算了归一化的注意力权重,我们已经准备好进行图 3.10 中所示的最后一步:通过将嵌入的输入标记 与相应的注意力权重相乘,然后求和得到上下文向量 。
Figure 3.10 The final step, after calculating and normalizing the attention scores to obtain the attention weights for query , is to compute the context vector . This context vector is a combination of all input vectors to weighted by the attention weights. 图 3.10 在计算并规范化注意力分数以获得查询 的注意力权重之后的最后一步是计算上下文向量 。这个上下文向量是所有输入向量 到 加权平均得到的。
The context vector depicted in Figure 3.10 is calculated as a weighted sum of all input vectors. This involves multiplying each input vector by its corresponding attention weight: 图 3.10 所示的上下文向量 是通过将所有输入向量的加权和计算得到的。这涉及将每个输入向量乘以其相应的注意力权重。
query = inputs[1] # 2nd input token is the query
context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
context_vec_2 += attn_weights_2[i]*x_i
print(context_vec_2)
The results of this computation are as follows: 这次计算的结果如下:
tensor([0.4419, 0.6515, 0.5683])
In the next section, we will generalize this procedure for computing context vectors to calculate all context vectors simultaneously. 在下一节中,我们将概括这种用于计算上下文向量的过程,同时计算所有的上下文向量。
3.3.2 Computing attention weights for all input tokens 3.3.2 为所有输入标记计算注意力权重
In the previous section, we computed attention weights and the context vector for input 2, as shown in the highlighted row in Figure 3.11. Now, we are extending this computation to calculate attention weights and context vectors for all inputs. 在前一节中,我们计算了输入 2 的注意力权重和上下文向量,如图 3.11 中高亮的行所示。现在,我们正在扩展这一计算过程,计算所有输入的注意力权重和上下文向量。
Figure 3.11 The highlighted row shows the attention weights for the second input element as a query, as we computed in the previous section. This section generalizes the computation to obtain all other attention weights. 图 3.11 高亮行显示了作为查询的第二个输入元素的注意力权重,我们在上一节中计算了这些权重。本节概括了计算所有其他注意力权重的方法。
We follow the same three steps as before, as summarized in Figure 3.12, except that we make a few modifications in the code to compute all context vectors instead of only the second context vector, . 我们遵循与之前相同的三个步骤,如图 3.12 所示,唯一的区别是我们在代码中做了一些修改,以计算所有的上下文向量,而不仅仅是第二个上下文向量, 。
Figure 3.12 In self-attention, we begin by computing the attention scores, which are then normalized to obtain attention weights that sum up to 1 . These attention weights are used to compute the context vectors as a weighted sum of the inputs. 图 3.12 在自注意力中,我们首先计算注意力分数,然后将其标准化以获得注意力权重,这些权重之和为 1。这些注意力权重用于计算上下文向量,这是输入的加权和。
First, in step 1 as illustrated in Figure 3.12, we add an additional for-loop to compute the dot products for all pairs of inputs. 首先,如图 3.12 所示的第 1 步中,我们添加了一个额外的 for 循环来计算所有输入对的点积。
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
for j, x_j in enumerate(inputs):
attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)
The resulting attention scores are as follows: 注意力分数如下:
Each element in the preceding tensor represents an attention score between each pair of inputs, as illustrated in Figure 3.11. Note that the values in Figure 3.11 are normalized, which is why they differ from the unnormalized attention scores in the preceding tensor. We will take care of the normalization later. 前述张量中的每个元素表示每对输入之间的注意力分数,如图 3.11 所示。请注意,图 3.11 中的值已被归一化,这就是为什么它们与前述张量中未归一化的注意力分数有所不同。我们将在稍后处理归一化的问题。
When computing the preceding attention score tensor, we used for-loops in Python. However, for-loops are generally slow, and we can achieve the same results using matrix multiplication: 在计算前一注意力分数张量时,我们在 Python 中使用了 for 循环。然而,for 循环通常比较慢,我们可以通过矩阵乘法来获得相同的结果:
We can visually confirm that the results are the same as before: 我们可以通过目视确认结果与之前相同:
In step 2, as illustrated in Figure 3.12, we now normalize each row so that the values in each row sum to 1 : 如图 3.12 所示,在步骤 2 中,我们现在将每一行规范化,使每一行的值之和为 1。
This returns the following attention weight tensor that matches the values shown in Figure 3.10: 这返回以下与图 3.10 中显示的值相匹配的注意力权重张量:
In the context of using PyTorch, the dim parameter in functions like torch.softmax specifies the dimension of the input tensor along which the function will be computed. By setting dim , we are instructing the softmax function to apply the normalization along the last dimension of the attn_scores tensor. If attn_scores is a 2D tensor (for example, with a shape of [rows, columns]), dim=-1 will normalize across the columns so that the values in each row (summing over the column dimension) sum up to 1 . 在使用 PyTorch 的上下文中,函数如 torch.softmax 中的 dim 参数指定了输入张量的维度,沿着该维度计算函数。通过设置 dim=0,我们指示 softmax 函数沿着 attn_scores 张量的最后一个维度进行归一化。如果 attn_scores 是一个 2D 张量(例如形状为[行, 列]),则 dim=-1 将沿着列进行归一化,使得每行的值(求和列维度)之和为 1。
Before we move on to step 3, the final step shown in Figure 3.12, let's briefly verify that the rows indeed all sum to 1 : 在我们转到图 3.12 中显示的第 3 步之前,让我们简单地验证这些行确实都加起来等于 1:
In the third and last step, we now use these attention weights to compute all context vectors via matrix multiplication: 在第三和最后一步中,我们现在利用这些注意力权重通过矩阵乘法计算所有上下文向量:
In the resulting output tensor, each row contains a 3-dimensional context vector: 在生成的输出张量中,每行包含一个 3 维的上下文向量:
We can double-check that the code is correct by comparing the 2 nd row with the context vector that we computed previously in section 3.3.1: 通过将第 2 行与我们在 3.3.1 节中先前计算的上下文向量 进行比较,我们可以再次确认代码是否正确。
Based on the result, we can see that the previously calculated context_vec_2 matches the second row in the previous tensor exactly: 根据结果,我们可以看到之前计算的 context_vec_2 与先前张量的第二行完全一致
This concludes the code walkthrough of a simple self-attention mechanism. In the next section, we will add trainable weights, enabling the LLM to learn from data and improve its performance on specific tasks. 这就是一个简单的自注意力机制的代码演练。在下一节中,我们将添加可训练的权重,使LLM能够从数据中学习,并提高在特定任务上的性能。
3.4 Implementing self-attention with trainable weights 使用可训练权重实现 self-attention
In this section, we are implementing the self-attention mechanism that is used in the original transformer architecture, the GPT models, and most other popular LLMs. This selfattention mechanism is also called scaled dot-product attention. Figure 3.13 provides a mental model illustrating how this self-attention mechanism fits into the broader context of implementing an LLM. 在本节中,我们正在实施自注意机制,该机制在原始 transformer 架构、GPT 模型以及大多数其他流行的LLMs中使用。这种自注意机制也称为缩放点积注意力。图 3.13 提供了一个心智模型,说明这种自注意机制如何融入实施LLM的更广泛背景中。
Figure 3.13 A mental model illustrating how the self-attention mechanism we code in this section fits into the broader context of this book and chapter. In the previous section, we coded a simplified attention mechanism to understand the basic mechanism behind attention mechanisms. In this section, we add trainable weights to this attention mechanism. In the upcoming sections, we will then extend this self-attention mechanism by adding a causal mask and multiple heads. 图 3.13 一个心理模型说明了我们在本节中编码的自注意力机制如何适应本书和本章的更广泛背景。在上一节中,我们编码了一个简化的注意力机制来理解注意力机制背后的基本机制。在本节中,我们为这种注意力机制添加了可训练的权重。在接下来的几节中,我们将通过添加因果遮罩和多个头来扩展这种自注意力机制。
As illustrated in Figure 3.13 the self-attention mechanism with trainable weights builds on the previous concepts: we want to compute context vectors as weighted sums over the input vectors specific to a certain input element. As you will see, there are only slight differences compared to the basic self-attention mechanism we coded earlier in section 3.3. 如图 3.13 所示,使用可训练权重的自注意力机制建立在前述概念之上:我们希望计算特定于某个输入元素的加权和上下文向量。如你所见,与我们在第 3.3 节中编写的基本自注意力机制相比,只有细微差别。
The most notable difference is the introduction of weight matrices that are updated during model training. These trainable weight matrices are crucial so that the model (specifically, the attention module inside the model) can learn to produce "good" context vectors. (Note that we will train the LLM in chapter 5. .) 最显著的区别是引入了在模型训练过程中更新的权重矩阵。这些可训练的权重矩阵至关重要,以便模型(特别是模型中的注意力模块)能够学习生成"良好"的上下文向量。(请注意,我们将在第 5 章训练LLM。)
We will tackle this self-attention mechanism in the two subsections. First, we will code it step-by-step as before. Second, we will organize the code into a compact Python class that can be imported into an LLM architecture, which we will code in chapter 4. 我们将在两个小节中探讨这种自注意力机制。首先,我们将像以前一样逐步编码它。其次,我们将把代码组织成一个紧凑的 Python 类,可以导入到第 4 章中编写的LLM架构中。
3.4.1 Computing the attention weights step by step 逐步计算注意力权重
We will implement the self-attention mechanism step by step by introducing the three trainable weight matrices , and . These three matrices are used to project the embedded input tokens, , into query, key, and value vectors as illustrated in Figure 3.14. 我们将一步一步地实现自注意力机制,引入三个可训练的权重矩阵 、 和 。这三个矩阵用于将嵌入的输入标记投影到查询、密钥和值向量中,如图 3.14 所示。
The second input token serves as the 第二个输入标记作为
This is the value vector corresponding to the first input token obtained via matrix multiplication between the weight matrix and input token 这是通过权重矩阵 和输入标记 之间的矩阵乘法得到的第一个输入标记的值向量
Figure 3.14 In the first step of the self-attention mechanism with trainable weight matrices, we compute query , key ( ), and value ( ) vectors for input elements . Similar to previous sections, we designate the second input, , as the query input. The query vector is obtained via matrix multiplication between the input and the weight matrix . Similarly, we obtain the key and value vectors via matrix multiplication involving the weight matrices and 图 3.14 在具有可训练权重矩阵的自注意力机制的第一步中,我们为输入元素 计算查询 、键( )和值( )向量。与前几节类似,我们将第二个输入 指定为查询输入。查询向量 通过输入 与权重矩阵 之间的矩阵乘法获得。同样,我们通过涉及权重矩阵 和 的矩阵乘法获得键和值向量。
Earlier in section 3.3.1, we defined the second input element as the query when we computed the simplified attention weights to compute the context vector . Later, in section 3.3.2, we generalized this to compute all context vectors for the sixword input sentence "Your journey starts with one step." 在 3.3.1 节中,我们将第二个输入元素 定义为在计算简化注意力权重以计算上下文向量 时的查询。后来,在 3.3.2 节中,我们将此推广到计算"您的旅程从一步开始"这个六个单词输入句子的所有上下文向量 。
Similarly, we will start by computing only one context vector, , for illustration purposes. In the next section, we will modify this code to calculate all context vectors. 同样地,我们将从计算只有一个上下文向量 开始,仅用于说明目的。在下一部分中,我们将修改此代码以计算所有上下文向量。
Let's begin by defining a few variables: 让我们从定义几个变量开始:
#C The output embedding size, d_out=2 输出嵌入大小, d_out=2
Note that in GPT-like models, the input and output dimensions are usually the same, but for illustration purposes, to better follow the computation, we choose different input (d_in=3) and output (d_out=2) dimensions here. 请注意,在 GPT 类似的模型中,输入和输出维度通常是相同的,但为了便于说明计算过程,我们在此选择不同的输入(d_in=3)和输出(d_out=2)维度。
Next, we initialize the three weight matrices and that are shown in Figure 3.14: 下一步,我们初始化三个权重矩阵 和 ,如图 3.14 所示:
Note that we are setting requires_grad=False to reduce clutter in the outputs for illustration purposes, but if we were to use the weight matrices for model training, we would set requires_grad=True to update these matrices during model training. 请注意,我们设置 requires_grad=False 是为了减少输出中的混乱,但如果我们要将权重矩阵用于模型训练,我们将设置 requires_grad=True 以便在模型训练期间更新这些矩阵。
Next, we compute the query, key, and value vectors as shown earlier in Figure 3.14: 接下来,我们计算如图 3.14 所示的查询、键和值向量:
As we can see based on the output for the query, this results in a 2-dimensional vector since we set the number of columns of the corresponding weight matrix, via d_out, to 2 : 正如我们从查询的输出中可以看到的,由于我们通过 d_out 将相应权重矩阵的列数设置为 2,因此这导致了一个 2 维向量
tensor([0.4306, 1.4551])
WEIGHT PARAMETERS VS ATTENTION WEIGHTS 权重参数与注意力权重
Note that in the weight matrices , the term "weight" is short for "weight parameters," the values of a neural network that are optimized during training. This is not to be confused with the attention weights. As we already saw in the previous section, attention weights determine the extent to which a context vector depends on the different parts of the input, i.e., to what extent the network focuses on different parts of the input. 请注意,在权重矩阵 中,"权重"一词是指神经网络在训练期间优化的权重参数。这与注意力权重不同。如前一节所述,注意力权重决定了上下文向量在多大程度上依赖于输入的不同部分,也就是说,网络在多大程度上关注输入的不同部分。
In summary, weight parameters are the fundamental, learned coefficients that define the network's connections, while attention weights are dynamic, context-specific values. 总而言之,权重参数是定义网络连接的基本学习系数,而注意力权重是动态的上下文相关值。
Even though our temporary goal is to only compute the one context vector, , we still require the key and value vectors for all input elements as they are involved in computing the attention weights with respect to the query , as illustrated in Figure 3.14. 尽管我们当前的目标是只计算一个上下文向量 ,但我们仍然需要所有输入元素的键和值向量,因为它们参与了与查询 相关的注意力权重的计算,如图 3.14 所示。
We can obtain all keys and values via matrix multiplication: 通过矩阵乘法,我们可以得到所有的键和值
As we can tell from the outputs, we successfully projected the 6 input tokens from a 3D onto a 2D embedding space: 从输出结果来看,我们成功地将 3D 空间中的 6 个输入标记投影到了 2D 嵌入空间中
The second step is now to compute the attention scores, as shown in Figure 3.15. 下一步是计算注意力分数,如图 3.15 所示。
Figure 3.15 The attention score computation is a dot-product computation similar to what we have used in the simplified self-attention mechanism in section 3.3. The new aspect here is that we are not directly computing the dot-product between the input elements but using the query and key obtained by transforming the inputs via the respective weight matrices. 图 3.15 注意力分数计算是一种点积计算,类似于我们在第 3.3 节中使用的简化自注意力机制。这里的新方面是,我们不是直接计算输入元素之间的点积,而是使用通过相应权重矩阵变换输入得到的查询和键。
First, let's compute the attention score : 首先,让我们计算注意力分数 :
#A Remember that Python starts indexing at 0 记住 Python 从 0 开始索引
The results in the following unnormalized attention score: 以下未规范化的注意力分数结果:
tensor(1.8524)
Again, we can generalize this computation to all attention scores via matrix multiplication: 再次,我们可以通过矩阵乘法推广此计算到所有注意力得分:
attn_scores_2 = query_2 @ keys.T # All attention scores for given query print(attn_scores_2) attn_scores_2 = query_2 @ keys.T # 给定查询的所有注意力得分
print(attn_scores_2)
As we can see, as a quick check, the second element in the output matches attn_score_22 we computed previously: 如我们所见,作为一个快速检查,输出中的第二个元素与我们先前计算的 attn_score_22 匹配
The third step is now going from the attention scores to the attention weights, as illustrated in Figure 3.16. 第三步现在是从注意力分数转换到注意力权重,如图 3.16 所示。
Figure 3.16 After computing the attention scores , the next step is to normalize these scores using the softmax function to obtain the attention weights . 图 3.16 在计算注意力分数 之后,下一步是使用 softmax 函数对这些分数进行归一化以获得注意力权重 。
Next, as illustrated in Figure 3.16, we compute the attention weights by scaling the attention scores and using the softmax function we used earlier. The difference to earlier is that we now scale the attention scores by dividing them by the square root of the embedding dimension of the keys, (note that taking the square root is mathematically the same as exponentiating by 0.5 ): 如图 3.16 所示,我们通过对注意力得分进行缩放并使用之前使用的 softmax 函数来计算注意力权重。与之前的不同之处在于,我们现在将注意力得分除以键的嵌入维度的平方根(注意,取平方根在数学上等同于取 0.5 次幂)。
THE RATIONALE BEHIND SCALED-DOT PRODUCT ATTENTION 缩放点积注意力的基本原理
The reason for the normalization by the embedding dimension size is to improve the training performance by avoiding small gradients. For instance, when scaling up the embedding dimension, which is typically greater than thousand for GPT-like LLMs, large dot products can result in very small gradients during backpropagation due to the softmax function applied to them. As dot products increase, the softmax function behaves more like a step function, resulting in gradients nearing zero. These small gradients can drastically slow down learning or cause training to stagnate. 根据嵌入维度大小进行归一化的原因是为了避免小梯度从而提高训练性能。例如,当放大嵌入维度(通常大于千维对于 GPT 这样的型号)时,由于对它们应用了 softmax 函数,大的点积可能会导致反向传播过程中出现非常小的梯度。随着点积的增加,softmax 函数的行为越来越像阶跃函数,从而导致梯度趋近于零。这些小梯度可能会严重降低学习速度或导致训练陷入停滞。
The scaling by the square root of the embedding dimension is the reason why this self-attention mechanism is also called scaled-dot product attention. 缩放嵌入维度平方根是这种自注意机制也被称为缩放点积注意力的原因。
Now, the final step is to compute the context vectors, as illustrated in Figure 3.17. 现在,最后一步是计算上下文向量,如图 3.17 所示。
Figure 3.17 In the final step of the self-attention computation, we compute the context vector by combining all value vectors via the attention weights. 图 3.17 在自注意力计算的最后一步中,我们通过注意力权重将所有值向量组合起来计算上下文向量。
Similar to section 3.3, where we computed the context vector as a weighted sum over the input vectors, we now compute the context vector as a weighted sum over the value vectors. Here, the attention weights serve as a weighting factor that weighs the respective importance of each value vector. Similar to section 3.3, we can use matrix multiplication to obtain the output in one step: 与第 3.3 节类似,我们将上下文向量计算为输入向量的加权和,现在我们将上下文向量计算为值向量的加权和。在这里,注意权重作为一个加权因子,评估每个值向量的相对重要性。与第 3.3 节类似,我们可以使用矩阵乘法一步得到输出:
The contents of the resulting vector are as follows: 生成向量的内容如下:
tensor([0.3061, 0.8210]) 张量([0.3061, 0.8210])
So far, we only computed a single context vector, . In the next section, we will generalize the code to compute all context vectors in the input sequence, to . 到目前为止,我们只计算了一个上下文向量, 。在下一节中,我们将概括代码来计算输入序列中的所有上下文向量, 到 。
WHY QUERY, KEY, AND VALUE? 为什么是查询、键和值?
The terms "key," "query," and "value" in the context of attention mechanisms are borrowed from the domain of information retrieval and databases, where similar concepts are used to store, search, and retrieve information. 在注意力机制的背景下,"键(key)"、"查询(query)"和"值(value)"这些术语来自于信息检索和数据库领域,在那里使用类似的概念来存储、搜索和检索信息。
A "query" is analogous to a search query in a database. It represents the current item (e.g., a word or token in a sentence) the model focuses on or tries to understand. The query is used to probe the other parts of the input sequence to determine how much attention to pay to them. 一个"查询"类似于数据库中的搜索查询。它表示模型当前关注或试图理解的项目(例如,句子中的单词或标记)。查询用于探查输入序列的其他部分,以确定应该给予它们多少注意力。
The "key" is like a database key used for indexing and searching. In the attention mechanism, each item in the input sequence (e.g., each word in a sentence) has an associated key. These keys are used to match with the query. 'key'就像是用于索引和搜索的数据库键。在注意力机制中,输入序列中的每个项目(例如,句子中的每个单词)都有一个相关联的键。这些键用于与查询进行匹配。
The "value" in this context is similar to the value in a key-value pair in a database. It represents the actual content or representation of the input items. Once the model determines which keys (and thus which parts of the input) are most relevant to the query (the current focus item), it retrieves the corresponding values. 在这个上下文中,"值"类似于数据库中键值对中的值。它代表输入项的实际内容或表示。一旦模型确定哪些键(以及哪些输入部分)与当前焦点项最相关,它就会检索相应的值。
3.4.2 Implementing a compact self-attention Python class 3.4.2 实现一个紧凑的自注意力力 Python 类
In the previous sections, we have gone through a lot of steps to compute the self-attention outputs. This was mainly done for illustration purposes so we could go through one step at a time. In practice, with the LLM implementation in the next chapter in mind, it is helpful to organize this code into a Python class as follows: 在前面的部分中,我们已经经历了很多步骤来计算自注意力输出。这主要是为了说明目的,让我们一步一步地进行。在实践中,考虑到下一章中的LLM实现,将这些代码组织成一个 Python 类会很有帮助,如下所示:
Listing 3.1 A compact self-attention class 3.1 节 一个紧凑的自注意力类
In this PyTorch code, SelfAttention_v1 is a class derived from nn.Module, which is a fundamental building block of PyTorch models, which provides necessary functionalities for model layer creation and management. 在这个 PyTorch 代码中,SelfAttention_v1 是一个继承自 nn.Module 的类,nn.Module 是 PyTorch 模型的基本构建块,提供了创建和管理模型层所需的必要功能。
The init method initializes trainable weight matrices (w_query, w_key, and W_value) for queries, keys, and values, each transforming the input dimension d_in to an output dimension d_out. 初始化方法初始化可训练的权重矩阵(w_query、w_key 和 W_value)用于查询、键和值,每个矩阵将输入维度 d_in 转换为输出维度 d_out。
During the forward pass, using the forward method, we compute the attention scores (attn_scores) by multiplying queries and keys, normalizing these scores using softmax. Finally, we create a context vector by weighting the values with these normalized attention scores. 在正向传播过程中,使用正向方法,我们通过将查询和键相乘来计算注意力得分(attn_scores),并使用 softmax 对这些得分进行归一化。最后,我们使用这些归一化的注意力得分加权得到上下文向量。
Since inputs contains six embedding vectors, this result in a matrix storing the six context vectors: 因为输入包含六个嵌入向量,这将导致一个存储六个上下文向量的矩阵:
tensor([[0.2996, 0.8053], 张量([[0.2996, 0.8053],
,
,
,
,
, grad_fn=)
As a quick check, notice how the second row ([0.3061, 0.8210]) matches the contents of context vec 2 in the previous section. 作为一个快速检查,请注意第二行([0.3061, 0.8210])与前一节中的 context vec 2 的内容相匹配。
Figure 3.18 summarizes the self-attention mechanism we just implemented. 图 3.18 总结了我们刚刚实现的自注意力机制。
Figure 3.18 In self-attention, we transform the input vectors in the input matrix with the three weight matrices, Wq, Wk, and Wv. Then, we compute the attention weight matrix based on the resulting queries (Q) and keys (K). Using the attention weights and values (V), we then compute the context vectors (Z). (For visual clarity, we focus on a single input text with tokens in this figure, not a batch of multiple inputs. Consequently, the 3D input tensor is simplified to a 2D matrix in this context. This approach allows for a more straightforward visualization and understanding of the processes involved. Also, for consistency with later figures, the values in the attention matrix do not depict the real attention weights.) 图 3.18 在自注意力中,我们使用三个权重矩阵 Wq、Wk 和 Wv 对输入矩阵 中的输入向量进行变换。然后,我们根据生成的查询(Q)和键(K)计算注意力权重矩阵。使用注意力权重和值(V),我们计算上下文向量(Z)。(为了更清晰地说明,我们在本图中只关注一个包含 个标记的输入文本,而不是多个输入的批次。因此,3D 输入张量在此上下文中简化为 2D 矩阵。这种方法可以更直观地说明涉及的过程。此外,为了与后续图表保持一致,注意力矩阵中的值并不代表真实的注意力权重。)
As shown in Figure 3.18, self-attention involves the trainable weight matrices , and . These matrices transform input data into queries, keys, and values, which are crucial components of the attention mechanism. As the model is exposed to more data during training, it adjusts these trainable weights, as we will see in upcoming chapters. 如图 3.18 所示,自注意力涉及可训练的权重矩阵 和 。这些矩阵将输入数据转换为查询、键和值,这些是注意力机制的关键组成部分。随着模型在训练过程中接触更多数据,它会调整这些可训练权重,这将在后续章节中介绍。
We can improve the SelfAttention_v1 implementation further by utilizing PyTorch's nn. Linear layers, which effectively perform matrix multiplication when the bias units are disabled. Additionally, a significant advantage of using nn. Linear instead of manually implementing nn. Parameter(torch.rand(...)) is that nn.Linear has an optimized weight initialization scheme, contributing to more stable and effective model training. 我们可以通过利用 PyTorch 的 nn.Linear 层进一步改进 SelfAttention_v1 的实现,这些层在禁用偏置单元时有效地执行矩阵乘法。此外,使用 nn.Linear 而不是手动实现 nn.Parameter(torch.rand(...))的一个重要优势是,nn.Linear 拥有优化的权重初始化方案,这有助于更稳定和有效的模型训练。
Listing 3.2 A self-attention class using PyTorch's Linear layers 使用 PyTorch 的 Linear 层的自注意力类
Note that SelfAttention_v1 and SelfAttention_v2 give different outputs because they use different initial weights for the weight matrices since nn. Linear uses a more sophisticated weight initialization scheme. 请注意,SelfAttention_v1 和 SelfAttention_v2 输出不同,因为它们使用的权重矩阵的初始权重不同,因为 nn.Linear 采用了更复杂的权重初始化方案。
Note that nn.Linear in SelfAttention_v2 uses a different weight initialization scheme as nn.Parameter(torch.rand(d_in, d_out)) used in SelfAttention_v1, which causes both mechanisms to produce different results. To check that both implementations, SelfAttention_v1 and SelfAttention_v2, are otherwise similar, we can transfer the weight matrices from a SelfAttention_v2 object to a SelfAttention_v1, such that both objects then produce the same results. 注意在 SelfAttention_v2 中使用的 nn.Linear 与 SelfAttention_v1 中使用的 nn.Parameter(torch.rand(d_in, d_out)) 采用了不同的权重初始化方案, 这导致两种机制产生不同的结果。为了检查 SelfAttention_v1 和 SelfAttention_v2 的其他方面是否相似, 我们可以将权重矩阵从 SelfAttention_v2 对象转移到 SelfAttention_v1, 使得两个对象产生相同的结果。
Your task is to correctly assign the weights from an instance of SelfAttention_v2 to an instance of SelfAttention_v1. To do this, you need to understand the relationship between the weights in both versions. (Hint: nn. Linear stores the weight matrix in a transposed form.) After the assignment, you should observe that both instances produce the same outputs. 您的任务是将 SelfAttention_v2 实例中的权重正确分配到 SelfAttention_v1 实例中。为此,您需要了解两个版本中权重之间的关系。(提示:nn.Linear 以转置形式存储权重矩阵。)进行分配后,您应该观察到两个实例产生相同的输出。
In the next section, we will make enhancements to the self-attention mechanism, focusing specifically on incorporating causal and multi-head elements. The causal aspect involves modifying the attention mechanism to prevent the model from accessing future information in the sequence, which is crucial for tasks like language modeling, where each word prediction should only depend on previous words. 在下一部分中,我们将对自注意机制进行改进,特别关注因果关系和多头元素的融合。因果方面涉及到修改注意力机制,防止模型访问序列中的未来信息,这对于像语言建模这样的任务非常关键,因为每个词的预测只应该依赖于之前的词。
The multi-head component involves splitting the attention mechanism into multiple "heads." Each head learns different aspects of the data, allowing the model to simultaneously attend to information from different representation subspaces at different positions. This improves the model's performance in complex tasks. 多头组件包括将注意力机制拆分为多个"头"。每个头学习数据的不同方面,允许模型同时关注不同位置的不同表示子空间中的信息。这提高了模型在复杂任务中的性能。
3.5 Hiding future words with causal attention 3.5 使用因果注意力隐藏未来词
In this section, we modify the standard self-attention mechanism to create a causal attention mechanism, which is essential for developing an LLM in the subsequent chapters. 在这个部分,我们修改了标准的自注意力机制来创造因果注意力机制,这对于在后续章节中开发一个LLM是至关重要的。
Causal attention, also known as masked attention, is a specialized form of self-attention. It restricts a model to only consider previous and current inputs in a sequence when processing any given token. This is in contrast to the standard self-attention mechanism, which allows access to the entire input sequence at once. 因果注意力,也称为遮罩注意力,是一种专门的自注意力形式。它限制模型在处理任何给定令牌时仅考虑序列中的先前和当前输入。这与标准自注意力机制形成对比,后者允许在一次性访问整个输入序列。
Consequently, when computing attention scores, the causal attention mechanism ensures that the model only factors in tokens that occur at or before the current token in the sequence. 因此,在计算注意力得分时,因果注意力机制确保模型仅考虑在当前令牌之前或在当前令牌处出现的令牌。
To achieve this in GPT-like LLMs, for each token processed, we mask out the future tokens, which come after the current token in the input text, as illustrated in Figure 3.19. 要在类似 GPT 的LLMs中实现这一点,对于每个被处理的标记,我们都会屏蔽在输入文本中该标记之后出现的未来标记,如图 3.19 所示。
Figure 3.19 In causal attention, we mask out the attention weights above the diagonal such that for a given input, the LLM can't access future tokens when computing the context vectors using the attention weights. For example, for the word "journey" in the second row, we only keep the attention weights for the words before ("Your") and in the current position ("journey"). 图 3.19 在因果注意力中,我们屏蔽了对角线以上的注意力权重,这样对于给定的输入,LLM在使用注意力权重计算上下文向量时就无法访问未来的令牌。例如,对于第二行中的单词"journey",我们只保留之前("Your")和当前位置("journey")单词的注意力权重。
As illustrated in Figure 3.19, we mask out the attention weights above the diagonal, and we normalize the non-masked attention weights, such that the attention weights sum to 1 in each row. In the next section, we will implement this masking and normalization procedure 如图 3.19 所示,我们遮蔽对角线以上的注意力权重,并对非遮蔽的注意力权重进行归一化,使得每一行的注意力权重之和为 1。在下一节中,我们将实现这种遮蔽和归一化的过程。
in code. 在代码中。
3.5.1 Applying a causal attention mask 3.5.1 应用因果注意力遮罩
In this section, we implement the causal attention mask in code. We start with the procedure summarized in Figure 3.20. 在本节中,我们在代码中实现了因果注意力掩码。我们从图 3.20 中总结的过程开始。
Figure 3.20 One way to obtain the masked attention weight matrix in causal attention is to apply the softmax function to the attention scores, zeroing out the elements above the diagonal and normalizing the resulting matrix. 图 3.20 在因果注意力中获得遮蔽注意力权重矩阵的一种方法是将注意力分数应用 softmax 函数,将对角线上方的元素归零,然后对生成的矩阵进行正则化。
To implement the steps to apply a causal attention mask to obtain the masked attention weights as summarized in Figure 3.20, let's work with the attention scores and weights from the previous section to code the causal attention mechanism. 实现将因果注意力遮罩应用于获取掩蔽注意力权重的步骤,如图 3.20 所示,让我们使用上一节中的注意力得分和权重来编码因果注意力机制。
In the first step illustrated in Figure 3.20, we compute the attention weights using the softmax function as we have done in previous sections: 如图 3.20 所示的第一步中,我们使用之前章节中介绍的 softmax 函数计算注意力权重。
#A Reuse the query and key weight matrices of the SelfAttention_v2 object from the previous section for convenience #重复使用前一节中 SelfAttention_v2 对象的查询和键权重矩阵,以方便起见
This results in the following attention weights: 这导致了以下注意力权重:
We can implement step 2 in Figure 3.20 using PyTorch's tril function to create a mask where the values above the diagonal are zero: 我们可以使用 PyTorch 的 tril 函数在图 3.20 中实现步骤 2,以创建一个掩码,其中对角线上方的值为零:
As we can see, the elements above the diagonal are successfully zeroed out: 正如我们所见,对角线以上的元素已成功归零:
The third step in Figure 3.20 is to renormalize the attention weights to sum up to 1 again in each row. We can achieve this by dividing each element in each row by the sum in each row: 图 3.20 中的第三步是将注意力权重重新归一化,使每一行的和重新等于 1。我们可以通过将每一行的每个元素除以该行的和来实现这一点:
The result is an attention weight matrix where the attention weights above the diagonal are zeroed out and where the rows sum to 1 : 结果是一个注意力权重矩阵,其中对角线以上的注意力权重被置零,且每一行的权重和为 1
INFORMATION LEAKAGE 信息泄露
When we apply a mask and then renormalize the attention weights, it might initially appear that information from future tokens (which we intend to mask) could still influence the current token because their values are part of the softmax calculation. However, the key insight is that when we renormalize the attention weights after masking, what we're essentially doing is recalculating the softmax over a smaller subset (since masked positions don't contribute to the softmax value). 当我们应用一个掩码并重新规范化注意力权重时,看起来信息从未来的令牌(我们打算屏蔽的令牌)可能仍然会影响当前的令牌,因为它们的值是 softmax 计算的一部分。但关键洞见是,当我们在屏蔽后重新规范化注意力权重时,我们实际上就是重新计算一个更小子集(因为被屏蔽的位置不会对 softmax 值产生贡献)的 softmax。
The mathematical elegance of softmax is that despite initially including all positions in the denominator, after masking and renormalizing, the effect of the masked positions is nullified - they don't contribute to the softmax score in any meaningful way. 软最大熵的数学优雅在于,尽管初始将所有位置包含在分母中,但在遮蔽和重新标准化之后,被遮蔽位置的影响被消除了 - 它们不会以任何有意义的方式对软最大熵得分产生贡献。
In simpler terms, after masking and renormalization, the distribution of attention weights is as if it was calculated only among the unmasked positions to begin with. This ensures there's no information leakage from future (or otherwise masked) tokens as we intended. 简单来说,在掩码和重新归一化之后,注意力权重的分布就好像最初只计算了非掩码位置一样。这确保了我们预期的不会有任何来自未来(或其他被掩码)标记的信息泄露。
While we could be technically done with implementing causal attention at this point, we can take advantage of a mathematical property of the softmax function and implement the computation of the masked attention weights more efficiently in fewer steps, as shown in Figure 3.21. 虽然我们在实现因果注意力方面可以算是技术上完成了,但我们可以利用 softmax 函数的一个数学特性,在更少的步骤中更有效地实现计算遮罩注意力权重,如图 3.21 所示。
Figure 3.21 A more efficient way to obtain the masked attention weight matrix in causal attention is to mask the attention scores with negative infinity values before applying the softmax function. 图 3.21 在因果关注中,通过在 softmax 函数应用之前用负无穷大值掩蔽注意力分数的方式可以得到更高效的掩蔽注意力权重矩阵。
The softmax function converts its inputs into a probability distribution. When negative infinity values ) are present in a row, the softmax function treats them as zero probability. (Mathematically, this is because approaches 0 .) 软最大化函数将其输入转换为概率分布。当行中存在负无穷大值时,软最大化函数将它们视为零概率。(从数学上讲,这是因为当 x 接近负无穷大时,e^x 趋于 0。)
We can implement this more efficient masking "trick" by creating a mask with 1 's above the diagonal and then replacing these 1 's with negative infinity (-inf) values: 我们可以通过创建一个对角线以上全为 1 的掩码,然后将这些 1 替换为负无穷(-inf)值来实现更高效的掩码"技巧":
As we can see based on the output, the values in each row sum to 1 , and no further normalization is necessary: 正如我们从输出结果中可以看到,每一行的值之和为 1,因此无需进一步标准化:
We could now use the modified attention weights to compute the context vectors via context_vec = attn_weights @ values, as in section 3.4. However, in the next section, we first cover another minor tweak to the causal attention mechanism that is useful for reducing overfitting when training LLMs. 现在,我们可以使用修改后的注意力权重来通过 context_vec = attn_weights @ values 计算上下文向量,如第 3.4 节所述。然而,在下一节中,我们首先讨论了另一个对训练LLMs时减少过拟合很有用的因果注意力机制的微小调整。
Dropout in deep learning is a technique where randomly selected hidden layer units are ignored during training, effectively "dropping" them out. This method helps prevent overfitting by ensuring that a model does not become overly reliant on any specific set of hidden layer units. It's important to emphasize that dropout is only used during training and is disabled afterward. 深度学习中的 Dropout 是一种技术,其中随机选择的隐藏层单元在训练期间被忽略,实际上"丢弃"了它们。这种方法通过确保模型不会过度依赖任何特定的隐藏层单元集来帮助防止过拟合。需要强调的是,Dropout 仅在训练期间使用,在此之后被禁用。
In the transformer architecture, including models like GPT, dropout in the attention mechanism is typically applied in two specific areas: after calculating the attention scores or after applying the attention weights to the value vectors. 在变换器架构中,包括 GPT 等模型,dropout 通常应用于注意力机制的两个特定区域:在计算注意力分数后或在将注意力权重应用于值向量后。
Here, we will apply the dropout mask after computing the attention weights, as illustrated in Figure 3.22, because it's the more common variant in practice. 在这里,我们将在计算注意力权重之后应用 dropout 掩码,如图 3.22 所示,因为这是在实践中更常见的变体。
Figure 3.22 Using the causal attention mask (upper left), we apply an additional dropout mask (upper right) to zero out additional attention weights to reduce overfitting during training. 图 3.22 使用因果注意力遮罩(左上)后,我们应用额外的 dropout 遮罩(右上)来将一些注意力权重置零,以在训练过程中减少过拟合。
In the following code example, we use a dropout rate of , which means masking out half of the attention weights. (When we train the GPT model in later chapters, we will use a lower dropout rate, such as 0.1 or 0.2 .) 在以下代码示例中,我们使用丢弃率为 ,这意味着屏蔽掉一半的注意力权重。(在后续章节中训练 GPT 模型时,我们将使用较低的丢弃率,例如 0.1 或 0.2。)
In the following code, we apply PyTorch's dropout implementation first to a tensor consisting of ones for illustration purposes: 在下面的代码中,我们首先将 PyTorch 的 dropout 实现应用于由一组构成的 张量,以此来说明:
#B Here, we create a matrix of 1 's 这里,我们创建了一个由 1 组成的矩阵
As we can see, approximately half of the values are zeroed out: 正如我们所见,大约一半的值被设为零:
When applying dropout to an attention weight matrix with a rate of , half of the elements in the matrix are randomly set to zero. To compensate for the reduction in active elements, the values of the remaining elements in the matrix are scaled up by a factor of . This scaling is crucial to maintain the overall balance of the attention weights, ensuring that the average influence of the attention mechanism remains consistent during both the training and inference phases. 在一个注意力权重矩阵上应用 dropout 时,如果 dropout 率为
Now, let's apply dropout to the attention weight matrix itself: 现在,让我们将 dropout 应用于注意力权重矩阵本身:
The resulting attention weight matrix now has additional elements zeroed out and the remaining ones rescaled: 由此得出的注意力权重矩阵现在有更多的元素被设置为零,剩余的元素则被重新缩放
Note that the resulting dropout outputs may look different depending on your operating system; you can read more about this inconsistency here on the PyTorch issue tracker at https://github.com/pytorch/pytorch/issues/121595. 请注意,由于操作系统的差异,生成的 dropout 输出可能会有所不同;您可以在 PyTorch issue 跟踪器上 https://github.com/pytorch/pytorch/issues/121595 处阅读更多关于这一不一致性的信息。
Having gained an understanding of causal attention and dropout masking, we will develop a concise Python class in the following section. This class is designed to facilitate the efficient application of these two techniques. 在了解因果注意力和 dropout 遮蔽的基础上,我们将在以下部分开发一个简洁的 Python 类。该类旨在促进这两种技术的高效应用。
3.5.3 Implementing a compact causal attention class 实现紧凑因果注意力类
In this section, we will now incorporate the causal attention and dropout modifications into the SelfAttention Python class we developed in section 3.4. This class will then serve as a template for developing multi-head attention in the upcoming section, which is the final attention class we implement in this chapter. 在本节中,我们将把因果注意力和 dropout 修改整合到第 3.4 节中开发的 SelfAttention Python 类中。这个类将作为未来章节中开发多头注意力的模板,这是本章实现的最后一个注意力类。
But before we begin, one more thing is to ensure that the code can handle batches consisting of more than one input so that the CausalAttention class supports the batch outputs produced by the data loader we implemented in chapter 2. 但是在我们开始之前,还有一件事需要确保,就是代码能够处理由多个输入组成的批次,以便 CausalAttention 类能够支持我们在第 2 章中实现的数据加载器生成的批量输出。
For simplicity, to simulate such batch inputs, we duplicate the input text example: 为了简单起见,为了模拟这种批量输入,我们复制输入文本示例:
#A 2 inputs with 6 tokens each, and each token has embedding dimension 3 #一个含有 6 个标记的两个输入,每个标记的嵌入维度为 3
This results in a 3D tensor consisting of 2 input texts with 6 tokens each, where each token is a 3-dimensional embedding vector: 这会导致一个 3D 张量,由 2 个包含 6 个标记的输入文本组成,其中每个标记都是一个 3 维嵌入向量
torch.Size([2, 6, 3])
The following CausalAttention class is similar to the SelfAttention class we implemented earlier, except that we now added the dropout and causal mask components as highlighted in the following code: 以下 CausalAttention 类与我们之前实现的 SelfAttention 类类似,不同之处在于我们现在添加了 dropout 和因果掩码组件,如以下代码所示:
Listing 3.3 A compact causal attention class 表示 3.3 一个简洁的因果注意力类
class CausalAttention(nn.Module): 类 CausalAttention(nn.Module):
#A Compared to the previous SelfAttention_v1 class, we added a dropout layer 与前一个 SelfAttention_v1 类相比,我们添加了一个 dropout 层
#B The register_buffer call is also a new addition (more information is provided in the following text) #B 注册缓存(register_buffer)调用也是一个新添加项(更多信息在后续文本中提供)
#C We transpose dimensions 1 and 2, keeping the batch dimension at the first position (0) #C 我们转置维度 1 和 2,保持批次维度位于第一个位置(0)
#D In PyTorch, operations with a trailing underscore are performed in-place, avoiding unnecessary memory copies 在 PyTorch 中,带有尾部下划线的操作是就地执行的,避免了不必要的内存拷贝
While all added code lines should be familiar from previous sections, we now added a self.register_buffer() call in the init method. The use of register_buffer in PyTorch is not strictly necessary for all use cases but offers several advantages here. For instance, when we use the CausalAttention class in our LLM, buffers are automatically moved to the appropriate device (CPU or GPU) along with our model, which will be relevant when training the LLM in future chapters. This means we don't need to manually ensure these tensors are on the same device as your model parameters, avoiding device mismatch errors. 虽然所有添加的代码行在前面的章节中都应该很熟悉,我们现在在 初始化方法中添加了一个 self.register_buffer()调用。在 PyTorch 中使用 register_buffer 并不是所有用例都必需的,但这里提供了几个优势。例如,当我们在LLM中使用 CausalAttention 类时,缓冲区会与我们的模型一起自动移动到适当的设备(CPU 或 GPU),这在未来章节训练LLM时会很有用。这意味着我们不需要手动确保这些张量与模型参数在同一设备上,避免了设备不匹配错误。
We can use the CausalAttention class as follows, similar to SelfAttention previously: 我们可以像之前的自注意力一样使用因果注意力类:
The resulting context vector is a 3D tensor where each token is now represented by a 2 D embedding: 结果生成的上下文向量是一个 3D 张量,其中每个令牌现在由一个 2D 嵌入表示
context_vecs.shape: torch.Size([2, 6, 2])
Figure 3.23 provides a mental model that summarizes what we have accomplished so far. 图 3.23 提供了一个总结我们到目前为止所完成工作的心智模型。
Figure 3.23 A mental model summarizing the four different attention modules we are coding in this chapter. We began with a simplified attention mechanism, added trainable weights, and then added a casual attention mask. In the remainder of this chapter, we will extend the causal attention mechanism and code multi-head attention, which is the final module we will use in the LLM implementation in the next chapter. 图 3.23 总结了我们在本章中编码的四种不同注意力模块的心智模型。我们从一个简化的注意力机制开始,添加了可训练的权重,然后添加了因果注意力掩码。在本章的其余部分,我们将扩展因果注意力机制,并编码多头注意力,这是我们在下一章中用于LLM实现的最终模块。
As illustrated in Figure 3.23, in this section, we focused on the concept and implementation of causal attention in neural networks. In the next section, we will expand on this concept and implement a multi-head attention module that implements several of such causal attention mechanisms in parallel. 如图 3.23 所示,在本节中,我们重点关注了因果关注在神经网络中的概念和实现。在下一节中,我们将扩展这一概念,并实现一个多头关注模块,该模块并行实现了几种此类因果关注机制。
3.6 Extending single-head attention to multi-head attention 将单头注意力扩展到多头注意力
In this final section of this chapter, we are extending the previously implemented causal attention class over multiple-heads. This is also called multi-head attention. 在本章的最后一部分,我们将之前实现的因果注意力类扩展到多个头。这也被称为多头注意力。
The term "multi-head" refers to dividing the attention mechanism into multiple "heads," each operating independently. In this context, a single causal attention module can be considered single-head attention, where there is only one set of attention weights processing the input sequentially. 术语"多头"指的是将注意力机制划分为多个"头",每个"头"都独立运作。在这种情况下,单个因果注意力模块可视为单头注意力,其中只有一组注意力权重按顺序处理输入。
In the following subsections, we will tackle this expansion from causal attention to multihead attention. The first subsection will intuitively build a multi-head attention module by stacking multiple CausalAttention modules for illustration purposes. The second subsection will then implement the same multi-head attention module in a more complicated but computationally more efficient way. 在以下小节中,我们将从因果注意力到多头注意力进行此扩展。第一小节将通过堆叠多个 CausalAttention 模块来直观地构建多头注意力模块,以供说明。第二小节将以更复杂但计算效率更高的方式实现相同的多头注意力模块。
In practical terms, implementing multi-head attention involves creating multiple instances of the self-attention mechanism (depicted earlier in Figure 3.18 in section 3.4.1), each with its own weights, and then combining their outputs. Using multiple instances of the selfattention mechanism can be computationally intensive, but it's crucial for the kind of complex pattern recognition that models like transformer-based LLMs are known for. 从实际操作的角度来说,实现多头注意力机制涉及到创建多个自注意力机制的实例(如第 3.4.1 节中图 3.18 所示),每个实例都有自己的权重,然后将它们的输出组合起来。使用多个自注意力机制实例可能会很计算密集,但这对于像基于变换器的LLMs模型所擅长的复杂模式识别来说是至关重要的。
Figure 3.24 illustrates the structure of a multi-head attention module, which consists of multiple single-head attention modules, as previously depicted in Figure 3.18, stacked on top of each other. 图 3.24 说明了多头注意力模块的结构,其由多个如图 3.18 所示的单头注意力模块堆叠而成。
Figure 3.24 The multi-head attention module in this figure depicts two single-head attention modules stacked on top of each other. So, instead of using a single matrix for computing the value matrices, in a multi-head attention module with two heads, we now have two value weight matrices: and . The same applies to the other weight matrices, and We obtain two sets of context vectors and that we can combine into a single context vector matrix . 图 3.24 此图中的多头注意力模块描绘了两个堆叠在彼此之上的单头注意力模块。因此,与使用单一矩阵 计算值矩阵不同,在具有两个头的多头注意力模块中,我们现在有两个值权重矩阵: 和 。其他权重矩阵 和 也是如此。我们得到了两组上下文向量 和 ,我们可以将它们组合成单个上下文向量矩阵 。
As mentioned before, the main idea behind multi-head attention is to run the attention mechanism multiple times (in parallel) with different, learned linear projections - the results of multiplying the input data (like the query, key, and value vectors in attention mechanisms) by a weight matrix. 如前所述,多头注意力的主要思想是多次(并行)运行注意力机制,采用不同的、已学习的线性投影 - 将输入数据(如注意力机制中的查询、键和值向量)乘以权重矩阵的结果。
In code, we can achieve this by implementing a simple MultiHeadAttentionWrapper class that stacks multiple instances of our previously implemented CausalAttention module: 在代码中,我们可以通过实现一个简单的 MultiHeadAttentionWrapper 类来实现这一点,该类堆叠了我们之前实现的 CausalAttention 模块的多个实例:
Listing 3.4 A wrapper class to implement multi-head attention 列表 3.4 实现多头注意力的包装类
class MultiHeadAttentionWrapper(nn.Module):
def __init__(self, d_in, d_out, context_length,
dropout, num_heads, qkv_bias=False):
super().__init__()
self.heads = nn.ModuleList(
[CausalAttention(d_in, d_out, context_length, dropout, qkv_bias)
for _ in range(num_heads)]
)
def forward(self, x):
return torch.cat([head(x) for head in self.heads], dim=-1)
For example, if we use this MultiHeadAttentionWrapper class with two attention heads (via num_heads=2) and CausalAttention output dimension d_out=2, this results in a 4dimensional context vectors (d_out*num_heads=4), as illustrated in Figure 3.25. 例如,如果我们使用这个 MultiHeadAttentionWrapper 类,设置两个注意力头(通过 num_heads=2)和 CausalAttention 输出维度 d_out=2,这将导致 4 维上下文向量(d_out*num_heads=4),如图 3.25 所示。
Figure 3.25 Using the MultiHeadAttentionWrapper, we specified the number of attention heads (num_heads). If we set num_heads=2, as shown in this figure, we obtain a tensor with two sets of context vector matrices. In each context vector matrix, the rows represent the context vectors corresponding to the tokens, and the columns correspond to the embedding dimension specified via d_out . We concatenate these context vector matrices along the column dimension. Since we have 2 attention heads and an embedding dimension of 2 , the final embedding dimension is . 图 3.25 使用 MultiHeadAttentionWrapper,我们指定了注意力头的数量(num_heads)。如果我们将 num_heads 设置为 2,如图所示,我们将得到一个具有两组上下文向量矩阵的张量。在每个上下文向量矩阵中,行代表对应于令牌的上下文向量,而列对应于通过 d_out 指定的嵌入维度。我们沿列维度 concatenate 这些上下文向量矩阵。由于我们有 2 个注意力头和 2 个嵌入维度,最终的嵌入维度为 。
To illustrate Figure 3.25 further with a concrete example, we can use the MultiHeadAttentionWrapper class similar to the CausalAttention class before: 使用与 CausalAttention 类类似的 MultiHeadAttentionWrapper 类来进一步说明图 3.25 中的一个具体例子:
torch.manual_seed(123)
context_length = batch.shape[1] # This is the number of tokens
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(d_in, d_out, context_length, 0.0, num_heads=2)
context_vecs = mha(batch)
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
This results in the following tensor representing the context vectors: 这导致以下张量代表上下文向量:
The first dimension of the resulting context_vecs tensor is 2 since we have two input texts (the input texts are duplicated, which is why the context vectors are exactly the same for those). The second dimension refers to the 6 tokens in each input. The third dimension refers to the 4 -dimensional embedding of each token. 生成的 context_vecs 张量的第一个维度为 2,因为我们有两个输入文本(输入文本被复制了,这就是为什么上下文向量完全相同)。第二个维度指的是每个输入的 6 个标记。第三个维度指的是每个标记的 4 维嵌入。
Change the input arguments for the MultiHeadAttentionWrapper(..., num_heads=2) call such that the output context vectors are 2-dimensional instead of 4-dimensional while keeping the setting num_heads . Hint: You don't have to modify the class implementation; you just have to change one of the other input arguments. 将 MultiHeadAttentionWrapper(..., num_heads=2) 调用的输入参数更改为使输出上下文向量为 2 维而不是 4 维,同时保持 num_heads 设置不变。提示:您不必修改类的实现,只需更改其他输入参数之一即可。
In this section, we implemented a MultiHeadAttentionWrapper that combined multiple single-head attention modules. However, note that these are processed sequentially via [head(x) for head in self.heads] in the forward method. We can improve this implementation by processing the heads in parallel. One way to achieve this is by computing the outputs for all attention heads simultaneously via matrix multiplication, as we will explore in the next section. 在这一节中,我们实现了一个 MultiHeadAttentionWrapper,它将多个单头注意力模块结合在一起。然而,请注意这些是通过 forward 方法中的[head(x) for head in self.heads]顺序处理的。我们可以通过并行处理头部来改进这个实现。实现这一目标的一种方法是通过矩阵乘法同时计算所有注意力头的输出,我们将在下一节中探讨这一点。
3.6.2 Implementing multi-head attention with weight splits 3.6.2 实施权重拆分的多头注意力机制
In the previous section, we created a MultiHeadAttentionWrapper to implement multihead attention by stacking multiple single-head attention modules. This was done by instantiating and combining several CausalAttention objects. 在上一节中,我们创建了一个 MultiHeadAttentionWrapper 来通过堆叠多个单头注意力模块来实现多头注意力。这是通过实例化并组合多个 CausalAttention 对象来完成的。
Instead of maintaining two separate classes, MultiHeadAttentionWrapper and CausalAttention, we can combine both of these concepts into a single MultiHeadAttention class. Also, in addition to just merging the MultiHeadAttentionWrapper with the CausalAttention code, we will make some other modifications to implement multi-head attention more efficiently. 我们可以将 MultiHeadAttentionWrapper 和 CausalAttention 两个独立的类合并为一个 MultiHeadAttention 类。此外,除了将 MultiHeadAttentionWrapper 与 CausalAttention 代码合并之外,我们还会做一些其他修改,以更有效地实现多头注意力机制。
In the MultiHeadAttentionWrapper, multiple heads are implemented by creating a list of CausalAttention objects (self.heads), each representing a separate attention head. The CausalAttention class independently performs the attention mechanism, and the results from each head are concatenated. In contrast, the following MultiHeadAttention class integrates the multi-head functionality within a single class. It splits the input into multiple heads by reshaping the projected query, key, and value tensors and then combines the results from these heads after computing attention. 在 MultiHeadAttentionWrapper 中,多个头部通过创建一个 CausalAttention 对象列表(self.heads)来实现,每个对象表示一个独立的注意力头部。CausalAttention 类独立执行注意力机制,并将每个头部的结果进行拼接。相比之下,下面的 MultiHeadAttention 类将多头功能集成在单个类中。它通过重塑投影的查询、键和值张量将输入分成多个头部,然后在计算注意力之后将这些头部的结果组合在一起。
Let's take a look at the MultiHeadAttention class before we discuss it further: 让我们在进一步讨论之前看一下 MultiHeadAttention 类:
Listing 3.5 An efficient multi-head attention class 列出 3.5 一个高效的多头注意力类
#D We implicitly split the matrix by adding a num_heads dimension. Then we unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) 我们通过添加一个 num_heads 维度来隐式地分割矩阵。然后我们展开最后一个维度:(b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
#E Transpose from shape (b, num_tokens, num_heads, head_dim) to (b, num_heads, num_tokens, head_dim) 从形状(b, num_tokens, num_heads, head_dim)转置到(b, num_heads, num_tokens, head_dim)
#F Compute dot product for each head 计算每个头的点积
#G Mask truncated to the number of tokens #G 截断到令牌数的口罩
#H Use the mask to fill attention scores #H 使用遮罩来填充注意力分数
Even though the reshaping (.view) and transposing (.transpose) of tensors inside the MultiHeadAttention class looks very complicated, mathematically, the MultiHeadAttention class implements the same concept as the MultiHeadAttentionWrapper earlier. 尽管 MultiHeadAttention 类内的张量重塑(.view)和转置(.transpose)看起来非常复杂,但从数学上讲,MultiHeadAttention 类实现了与之前的 MultiHeadAttentionWrapper 相同的概念。
On a big-picture level, in the previous MultiHeadAttentionWrapper, we stacked multiple single-head attention layers that we combined into a multi-head attention layer. The MultiHeadAttention class takes an integrated approach. It starts with a multi-head layer and then internally splits this layer into individual attention heads, as illustrated in Figure 3.26. 从宏观角度来看,在之前的 MultiHeadAttentionWrapper 中,我们堆叠了多个单头注意力层,将它们组合成一个多头注意力层。MultiHeadAttention 类采取了一种集成的方法。它从一个多头层开始,然后在内部将该层拆分为单独的注意力头,如图 3.26 所示。
Perform two matrix multiplications to obtain 进行两次矩阵乘法以获得
the two query matrices, and 两个查询矩阵, 和
Figure 3.26 In the MultiheadAttentionWrapper class with two attention heads, we initialized two weight matrices and and computed two query matrices and as illustrated at the top of this figure. In the MultiheadAttention class, we initialize one larger weight matrix , only perform one matrix multiplication with the inputs to obtain a query matrix , and then split the query matrix into and as shown at the bottom of this figure. We do the same for the keys and values, which are not shown to reduce visual clutter. 图 3.26 在具有两个注意力头的 MultiheadAttentionWrapper 类中,我们初始化了两个权重矩阵 和 ,并计算了两个查询矩阵 和 ,如图顶部所示。在 MultiheadAttention 类中,我们初始化了一个更大的权重矩阵 ,仅对输入进行一次矩阵乘法以获得查询矩阵 ,然后将查询矩阵分为 和 ,如图底部所示。我们对键值做了相同的处理,为了减少视觉混乱没有在图中显示。
The splitting of the query, key, and value tensors, as depicted in Figure 3.26, is achieved through tensor reshaping and transposing operations using PyTorch's .view and .transpose methods. The input is first transformed (via linear layers for queries, keys, and values) and then reshaped to represent multiple heads. 如图 3.26 所示,通过使用 PyTorch 的.view 和.transpose 方法进行张量重塑和转置操作,实现了查询、键和值张量的分割。输入首先经过线性层变换(针对查询、键和值),然后重塑以表示多个注意力头。
The key operation is to split the d_out dimension into num_heads and head_dim, where head_dim = d_out / num_heads. This splitting is then achieved using the .view method: a tensor of dimensions ( b , num_tokens, d_out) is reshaped to dimension ( b , num_tokens, num_heads, head_dim). 关键操作是将 d_out 维度拆分为 num_heads 和 head_dim,其中 head_dim = d_out / num_heads。这种拆分通过.view 方法实现:维度为(b,num_tokens,d_out)的张量被重塑为维度(b,num_tokens,num_heads,head_dim)。
The tensors are then transposed to bring the num_heads dimension before the num_tokens dimension, resulting in a shape of (b, num_heads, num_tokens, head_dim). This transposition is crucial for correctly aligning the queries, keys, and values across the different heads and performing batched matrix multiplications efficiently. 然后将张量转置,使 num_heads 维度位于 num_tokens 维度之前,得到形状为(b, num_heads, num_tokens, head_dim)。这种转置对于正确对齐不同头部的查询、键和值,并有效地执行批量矩阵乘法至关重要。
To illustrate this batched matrix multiplication, suppose we have the following example tensor: 为了说明这种批次矩阵乘法,假设我们有以下示例张量:
#A The shape of this tensor is , num_heads, num_tokens, head_dim 这个张量的形状是 ,num_heads、num_tokens、head_dim
Now, we perform a batched matrix multiplication between the tensor itself and a view of the tensor where we transposed the last two dimensions, num_tokens and head_dim: 现在,我们对张量本身和张量的视图进行批处理矩阵乘法,其中我们转置了最后两个维度,num_tokens 和 head_dim:
print(a @ a.transpose(2, 3))
The result is as follows: 结果如下:
In this case, the matrix multiplication implementation in PyTorch handles the 4-dimensional input tensor so that the matrix multiplication is carried out between the 2 last dimensions (num_tokens, head_dim) and then repeated for the individual heads. 在这种情况下,PyTorch 中的矩阵乘法实现可以处理 4 维输入张量,使得矩阵乘法在最后两个维度(num_tokens, head_dim)之间进行,然后针对每个头重复执行。
For instance, the above becomes a more compact way to compute the matrix multiplication for each head separately: 举例来说,上述内容就变成了一种更为紧凑的方式来分别计算每个注意力头的矩阵乘法
The results are exactly the same results that we obtained when using the batched matrix multiplication print (a @ a.transpose (2, 3)) earlier: 结果正好是我们之前使用批量矩阵乘法 print(a @ a.transpose(2, 3))时得到的结果:
Continuing with MultiHeadAttention, after computing the attention weights and context vectors, the context vectors from all heads are transposed back to the shape (b, num_tokens, num_heads, head_dim). These vectors are then reshaped (flattened) into the shape (b, num_tokens, d_out), effectively combining the outputs from all heads. 继续对 MultiHeadAttention 进行操作,在计算完注意力权重和上下文向量后,来自所有头的上下文向量被转置回形状(b, num_tokens, num_heads, head_dim)。这些向量随后被重塑(展平)为形状(b, num_tokens, d_out),从而有效地将所有头的输出组合在一起。
Additionally, we added a so-called output projection layer (self.out_proj) to MultiHeadAttention after combining the heads, which is not present in the CausalAttention class. This output projection layer is not strictly necessary (see the References section in Appendix B for more details), but it is commonly used in many LLM architectures, which is why we added it here for completeness. 此外,我们在将头部组合后添加了一个所谓的输出投影层(self.out_proj)到 MultiHeadAttention 中,这在 CausalAttention 类中不存在。这个输出投影层并不是绝对必要的(更多细节请参见附录 B 中的参考资料部分),但它在许多LLM架构中很常用,这就是我们在这里添加它的原因。
Even though the MultiHeadAttention class looks more complicated than the MultiHeadAttentionWrapper due to the additional reshaping and transposition of tensors, it is more efficient. The reason is that we only need one matrix multiplication to compute the keys, for instance, keys = self. W_key(x) (the same is true for the queries and values). In the MultiHeadAttentionWrapper, we needed to repeat this matrix multiplication, which is computationally one of the most expensive steps, for each attention head. 尽管 MultiHeadAttention 类看起来比 MultiHeadAttentionWrapper 更复杂,因为需要对张量进行额外的整形和转置,但它更高效。原因是我们只需要一次矩阵乘法就可以计算键,例如 keys = self.W_key(x)(对查询和值也是如此)。在 MultiHeadAttentionWrapper 中,我们需要为每个注意力头重复这个矩阵乘法,这是最耗时的步骤之一。
The MultiHeadAttention class can be used similar to the SelfAttention and CausalAttention classes we implemented earlier: 多头注意力(MultiHeadAttention)类可以像我们之前实现的自注意力(SelfAttention)和因果注意力(CausalAttention)类一样使用:
In this section, we implemented the MultiHeadAttention class that we will use in the upcoming sections when implementing and training the LLM itself. Note that while the code is fully functional, we used relatively small embedding sizes and numbers of attention heads to keep the outputs readable. 在本节中,我们实现了 MultiHeadAttention 类,我们将在接下来的章节中实现和训练 LLM 时使用它。请注意,虽然代码功能完全正常,但我们使用了相对较小的嵌入尺寸和关注头的数量,以保持输出可读。
For comparison, the smallest GPT-2 model (117 million parameters) has 12 attention heads and a context vector embedding size of 768. The largest GPT-2 model (1.5 billion parameters) has 25 attention heads and a context vector embedding size of 1600 . Note that the embedding sizes of the token inputs and context embeddings are the same in GPT models (d_in = d_out). 对于比较而言,最小的 GPT-2 模型(1.17 亿参数)有 12 个注意力头和 768 的上下文向量嵌入大小。最大的 GPT-2 模型(1.5 十亿参数)有 25 个注意力头和 1600 的上下文向量嵌入大小。需要注意的是,GPT 模型中的令牌输入和上下文嵌入的嵌入大小是相同的(d_in = d_out)。
Using the MultiHeadAttention class, initialize a multi-head attention module that has the same number of attention heads as the smallest GPT-2 model (12 attention heads). Also ensure that you use the respective input and output embedding sizes similar to GPT-2 (768 dimensions). Note that the smallest GPT-2 model supports a context length of 1024 tokens. 使用 MultiHeadAttention 类初始化一个多头注意力模块,它拥有与最小版本的 GPT-2 模型相同的注意力头数(12 个注意力头)。同时确保使用的输入和输出嵌入大小与 GPT-2 相同(768 维)。注意,最小版本的 GPT-2 模型支持 1024 个 token 的上下文长度。
3.7 Summary 3.7 总结
Attention mechanisms transform input elements into enhanced context vector representations that incorporate information about all inputs. 注意力机制将输入元素转换为增强的上下文向量表示,这些表示包含了所有输入的信息。
A self-attention mechanism computes the context vector representation as a weighted sum over the inputs. 自注意力机制将上下文向量表示计算为输入的加权和。
In a simplified attention mechanism, the attention weights are computed via dot products. 在简化的注意力机制中,注意力权重通过点积计算得出。
A dot product is just a concise way of multiplying two vectors elementwise and then summing the products. 点积只是一种简洁的方式来逐元素地乘以两个向量,然后对这些乘积求和。
Matrix multiplications, while not strictly required, help us to implement computations more efficiently and compactly by replacing nested forloops. 矩阵乘法虽然不是严格必需的,但通过代替嵌套的 for 循环,可以更有效和紧凑地实现计算。
In self-attention mechanisms that are used in LLMs, also called scaled-dot product attention, we include trainable weight matrices to compute intermediate transformations of the inputs: queries, values, and keys. 在用于LLMs的自注意力机制(也称为缩放点积注意力)中,我们包括可训练的权重矩阵来计算输入的中间转换:查询、值和键。
When working with LLMs that read and generate text from left to right, we add a causal attention mask to prevent the LLM from accessing future tokens. 当使用从左到右读取和生成文本的LLMs时,我们添加因果注意力掩码以防止LLM访问未来的 Token。
Next to causal attention masks to zero out attention weights, we can also add a dropout mask to reduce overfitting in LLMs. 除了使用因果注意力掩码来减少注意力权重外,我们还可以添加 dropout 掩码来降低LLMs中的过拟合。
The attention modules in transformer-based LLMs involve multiple instances of causal attention, which is called multi-head attention. 基于变压器的LLMs中的注意力模块涉及多个因果注意力的实例,这称为多头注意力。
We can create a multi-head attention module by stacking multiple instances of causal attention modules. 我们可以通过堆叠多个因果注意力模块来创建一个多头注意力模块。
A more efficient way of creating multi-head attention modules involves batched matrix multiplications. 创建多头注意力模块的更有效方式涉及批量矩阵乘法。
4
Implementing a GPT model from Scratch To Generate Text 从头实现一个 GPT 模型以生成文本
Thls chapter covers 本章涵盖
Coding a GPT-like large language model (LLM) that can be trained to generate human-like text 编码一个类似 GPT 的大型语言模型(LLM),可以训练生成人类文本
Normalizing layer activations to stabilize neural network training 稳定神经网络训练的正则化层激活
Adding shortcut connections in deep neural networks to train models more effectively 在深度神经网络中添加快捷连接以更有效地训练模型
Implementing transformer blocks to create GPT models of various sizes 实现 transformer 模块来创建不同规模的 GPT 模型
Computing the number of parameters and storage requirements of GPT models 计算 GPT 模型的参数数量和存储需求
In the previous chapter, you learned and coded the multi-head attention mechanism, one of the core components of LLMs. In this chapter, we will now code the other building blocks of an LLM and assemble them into a GPT-like model that we will train in the next chapter to generate human-like text, as illustrated in Figure 4.1. 在上一章中,您学习并编写了多头注意力机制,这是LLMs的核心组件之一。在本章中,我们将编写LLM的其他构建块,并将它们组装成一个类似 GPT 的模型,我们将在下一章中对其进行训练以生成类人文本,如图 4.1 所示。
Figure 4.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter focuses on implementing the LLM architecture, which we will train in the next chapter. 图 4.1 编码LLM的三个主要阶段、在通用文本数据集上预训练LLM以及在标注数据集上微调的心智模型。本章重点介绍了LLM架构的实现,我们将在下一章对其进行训练。
The LLM architecture, referenced in Figure 4.1, consists of several building blocks that we will implement throughout this chapter. We will begin with a top-down view of the model architecture in the next section before covering the individual components in more detail. 图 4.1 所引用的LLM体系结构由我们将在本章中实现的几个构建块组成。我们将首先在下一节中从自上而下的角度介绍模型体系结构,然后再详细介绍各个组件。
4.1 Coding an LLM architecture 4.1 编码 LLM 架构
LLMs, such as GPT (which stands for Generative Pretrained Transformer), are large deep neural network architectures designed to generate new text one word (or token) at a time. However, despite their size, the model architecture is less complicated than you might think, since many of its components are repeated, as we will see later. Figure 4.2 provides a top-down view of a GPT-like LLM, with its main components highlighted. LLMs等是设计用于一次生成一个词(或标记)的新文本的大型深层神经网络体系结构。然而,尽管它们的规模很大,但模型体系结构并不像你可能想象的那么复杂,因为它的许多组件都是重复的,正如我们稍后将看到的那样。图 4.2 提供了一个 GPT 式LLM的自上而下的视图,突出显示了其主要组件。
Figure 4.2 A mental model of a GPT model. Next to the embedding layers, it consists of one or more transformer blocks containing the masked multi-head attention module we implemented in the previous chapter. 图 4.2 GPT 模型的心智模型。除了嵌入层外,它由一个或多个包含我们在上一章中实现的掩码多头注意力模块的变压器块组成。
As you can see in Figure 4.2, we have already covered several aspects, such as input tokenization and embedding, as well as the masked multi-head attention module. The focus of this chapter will be on implementing the core structure of the GPT model, including its transformer blocks, which we will then train in the next chapter to generate human-like text. 如图 4.2 所示,我们已经涵盖了诸如输入令牌化和嵌入以及掩码多头注意力模块等多个方面。本章的重点将是实现 GPT 模型的核心结构,包括其变换器块,我们将在下一章中对其进行训练以生成类人文本。
In the previous chapters, we used smaller embedding dimensions for simplicity, ensuring that the concepts and examples could comfortably fit on a single page. Now, in this chapter, we are scaling up to the size of a small GPT-2 model, specifically the smallest version with 124 million parameters, as described in Radford et al.'s paper, "Language Models are Unsupervised Multitask Learners." Note that while the original report mentions 117 million parameters, this was later corrected. 在之前的章节中,我们使用较小的嵌入维度来简化操作,确保概念和示例可以舒适地放在一页之内。现在在这一章中,我们将规模扩大到一个小型 GPT-2 模型的尺寸,具体来说就是 Radford 等人在"Language Models are Unsupervised Multitask Learners"论文中描述的最小版本,拥有 1.24 亿个参数。需要注意的是,虽然原始报告提到有 1.17 亿个参数,但这个数字后来被更正了。
Chapter 6 will focus on loading pretrained weights into our implementation and adapting it for larger GPT-2 models with 345, 762, and 1,542 million parameters. In the context of deep learning and LLMs like GPT, the term "parameters" refers to the trainable weights of the model. These weights are essentially the internal variables of the model that are adjusted and optimized during the training process to minimize a specific loss function. This optimization allows the model to learn from the training data. 第 6 章将重点介绍如何将预训练权重加载到我们的实现中,并调整它以适应更大的 GPT-2 模型,这些模型有 3.45 亿、7.62 亿和 15.42 亿个参数。在深度学习和 GPT 等模型中,"参数"一词指的是模型的可训练权重。这些权重实质上是模型的内部变量,在训练过程中进行调整和优化,以最小化特定的损失函数。这种优化使模型能够从训练数据中学习。
For example, in a neural network layer that is represented by a 2,048x2,048dimensional matrix (or tensor) of weights, each element of this matrix is a parameter. Since there are 2,048 rows and 2,048 columns, the total number of parameters in this layer is 2,048 multiplied by 2,048 , which equals parameters. 举例而言,在一个由 2,048x2,048 维矩阵(或张量)表示的神经网络层中,该矩阵的每个元素都是一个参数。由于有 2,048 行和 2,048 列,因此该层中的总参数数量为 2,048 乘以 2,048,即 个参数。
GPT-2 VERSUS GPT-3 GPT-2 与 GPT-3
Note that we are focusing on GPT-2 because OpenAI has made the weights of the pretrained model publicly available, which we will load into our implementation in chapter 6. GPT-3 is fundamentally the same in terms of model architecture, except that it is scaled up from 1.5 billion parameters in GPT-2 to 175 billion parameters in GPT-3, and it is trained on more data. As of this writing, the weights for GPT-3 are not publicly available. GPT-2 is also a better choice for learning how to implement LLMs, as it can be run on a single laptop computer, whereas GPT-3 requires a GPU cluster for training and inference. According to Lambda Labs, it would take 355 years to train GPT-3 on a single V100 datacenter GPU, and 665 years on a consumer RTX 8000 GPU. 请注意,我们关注 GPT-2,因为 OpenAI 已经公开了预训练模型的权重,我们将在第 6 章中加载它们。GPT-3 在模型架构方面与 GPT-2 基本相同,不同的是它从 GPT-2 的 15 亿参数扩展到了 1750 亿参数,并且它使用了更多的数据进行训练。 截至目前,GPT-3 的权重还未公开。 与 GPT-3 相比,GPT-2 也是学习如何实现的更好选择,因为它可以在单个笔记本电脑上运行,而 GPT-3 需要 GPU 集群进行训练和推理。 根据 Lambda Labs 的数据,在单个 V100 数据中心 GPU 上训练 GPT-3 需要 355 年,在消费级 RTX 8000 GPU 上需要 665 年。
We specify the configuration of the small GPT-2 model via the following Python dictionary, which we will use in the code examples later: 我们通过以下 Python 字典指定小型 GPT-2 模型的配置,该字典稍后将用于代码示例:
In the GPT_CONFIG_124M dictionary, we use concise variable names for clarity and to prevent long lines of code: 在 GPT_CONFIG_124M 字典中,我们使用简洁的变量名称以增加清晰度并防止代码行过长:
"vocab_size" refers to a vocabulary of 50,257 words, as used by the BPE tokenizer from chapter 2. "vocab_size"指的是 50,257 个单词的词汇表,这是在第 2 章中使用 BPE 分词器时采用的。
"context_length" denotes the maximum number of input tokens the model can handle, via the positional embeddings discussed in chapter 2 . "context_length"表示模型可处理的最大输入令牌数量,这是通过第 2 章讨论的位置嵌入实现的。
"emb_dim" represents the embedding size, transforming each token into a 768-dimensional vector. "emb_dim"表示嵌入大小,将每个令牌转换为 768 维向量。
"n_heads" indicates the count of attention heads in the multi-head attention mechanism, as implemented in chapter 3. "n_heads"表示在第 3 章中实现的多头注意力机制中的注意力头的数量。
"n_layers" specifies the number of transformer blocks in the model, which will be elaborated on in upcoming sections. "n_layers"指定模型中变压器块的数量,这将在后续部分中详细解释。
"drop_rate" indicates the intensity of the dropout mechanism (0.1 implies a drop of hidden units) to prevent overfitting, as covered in chapter 3 . "drop_rate"表示 dropout 机制的强度(0.1 意味着 隐藏单元的下降),以防止过拟合,如第 3 章所述。
"qkv_bias" determines whether to include a bias vector in the Linear layers of the multi-head attention for query, key, and value computations. We will initially disable this, following the norms of modern LLMs, but will revisit it in chapter 6 when we load pretrained GPT-2 weights from OpenAI into our model. "qkv_bias"决定是否在多头注意力的查询、键和值计算的线性层中包含偏置向量。我们最初会禁用此功能,遵循现代LLMs的规范,但在第 6 章加载来自 OpenAI 的预训练 GPT-2 权重到我们的模型时会重新审视它。
Using the configuration above, we will start this chapter by implementing a GPT placeholder architecture (DummyGPTModel) in this section, as shown in Figure 4.3. This will provide us with a big-picture view of how everything fits together and what other components we need to code in the upcoming sections to assemble the full GPT model architecture. 根据上述配置,我们将在本章开始时实施 GPT 占位符体系结构(DummyGPTModel),如图 4.3 所示。这将为我们提供一个整体视图,了解如何将所有内容组合在一起,以及在接下来的章节中我们需要编码哪些其他组件来组装完整的 GPT 模型体系结构。
Figure 4.3 A mental model outlining the order in which we code the GPT architecture. In this chapter, we will start with the GPT backbone, a placeholder architecture, before we get to the individual core pieces and eventually assemble them in a transformer block for the final GPT architecture. 图 4.3 概述我们编码 GPT 架构的顺序的心智模型。在本章中,我们将从 GPT 主干(一个占位架构)开始,然后再到各个核心部件,最终将它们组装成一个 transformer 模块以构建最终的 GPT 架构。
The numbered boxes shown in Figure 4.3 illustrate the order in which we tackle the individual concepts required to code the final GPT architecture. We will start with step 1 , a placeholder GPT backbone we call DummyGPTModel: 图 4.3 中显示的编号框说明了我们处理最终 GPT 架构所需的各个概念的顺序。我们将从步骤 1 开始,即我们称之为 DummyGPTModel 的占位符 GPT 主干。
Listing 4.1 A placeholder GPT model architecture class 类占位符 GPT 模型架构
import torch
import torch.nn as nn
class DummyGPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])]) #A
self.final_norm = DummyLayerNorm(cfg["emb_dim"]) #B
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
class DummyTransformerBlock(nn.Module): #C
def __init__(self, cfg):
super().__init__()
def forward(self, x): #D
return x
class DummyLayerNorm(nn.Module): #E
def __init__(self, normalized_shape, eps=1e-5): #F
super().__init__()
def forward(self, x):
return x
#A Use a placeholder for TransformerBlock #A 使用占位符进行 TransformerBlock
#B Use a placeholder for LayerNorm 用一个占位符替换 LayerNorm
#C A simple placeholder class that will be replaced by a real TransformerBlock later #C 一个简单的占位符类,稍后将被真正的 TransformerBlock 取代
#D This block does nothing and just returns its input. #D 这个块什么也不做,只是返回它的输入。
#E A simple placeholder class that will be replaced by a real TransformerBlock later #F The parameters here are just to mimic the LayerNorm interface. 一个简单的占位符类,将被一个真正的 TransformerBlock 所取代。这里的参数只是为了模仿 LayerNorm 接口。
The DummyGPTModel class in this code defines a simplified version of a GPT-like model using PyTorch's neural network module (nn.Module). The model architecture in the DummyGPTModel class consists of token and positional embeddings, dropout, a series of transformer blocks (DummyTransformerBlock), a final layer normalization (DummyLayerNorm), and a linear output layer (out_head). The configuration is passed in via a Python dictionary, for instance, the GPT_CONFIG_124M dictionary we created earlier. 此代码中的 DummyGPTModel 类定义了一个使用 PyTorch 神经网络模块(nn.Module)的类似 GPT 模型的简化版本。DummyGPTModel 类中的模型架构由 token 和位置嵌入、dropout、一系列 transformer 块(DummyTransformerBlock)、最终的层归一化(DummyLayerNorm)和线性输出层(out_head)组成。配置通过 Python 字典传入,例如我们之前创建的 GPT_CONFIG_124M 字典。
The forward method describes the data flow through the model: it computes token and positional embeddings for the input indices, applies dropout, processes the data through the transformer blocks, applies normalization, and finally produces logits with the linear output layer. 前向传播方法描述了模型中的数据流:它计算输入索引的令牌和位置嵌入,应用 dropout,通过变换器块处理数据,应用归一化,最终使用线性输出层产生 logits。
The code above is already functional, as we will see later in this section after we prepare the input data. However, for now, note in the code above that we have used placeholders (DummyLayerNorm and DummyTransformerBlock) for the transformer block and layer normalization, which we will develop in later sections. 上面的代码已经可以运行了,我们将在本节后面看到。不过现在,请注意上面的代码中使用了占位符(DummyLayerNorm 和 DummyTransformerBlock)来代表 transformer 块和层归一化,我们会在后续章节中开发这些部分。
Next, we will prepare the input data and initialize a new GPT model to illustrate its usage. Building on the figures we have seen in chapter 2 , where we coded the tokenizer, Figure 4.4 provides a high-level overview of how data flows in and out of a GPT model. 接下来,我们将准备输入数据并初始化一个新的 GPT 模型,以说明其用法。建立在我们在第 2 章中看到的图形上,在那里我们编码了令牌生成器,图 4.4 提供了数据如何流入和流出 GPT 模型的高级概述。
Figure 4.4 A big-picture overview showing how the input data is tokenized, embedded, and fed to the GPT model. Note that in our DummyGPTClass coded earlier, the token embedding is handled inside the GPT model. In LLMs, the embedded input token dimension typically matches the output dimension. The output embeddings here represent the context vectors we discussed in chapter 3. 图 4.4 概括性总览展示了如何将输入数据进行分词、嵌入,并馈入 GPT 模型。请注意,在之前编码的 DummyGPTClass 中,令牌嵌入在 GPT 模型内部处理。在LLMs中,嵌入的输入令牌维度通常与输出维度匹配。这里的输出嵌入表示我们在第 3 章中讨论的上下文向量。
To implement the steps shown in Figure 4.4, we tokenize a batch consisting of two text inputs for the GPT model using the tiktoken tokenizer introduced in chapter 2 : 为了实现图 4.4 所示的步骤,我们使用第 2 章介绍的 tiktoken 分词器对包含两个文本输入的批量数据进行分词,供 GPT 模型使用:
#A The first row corresponds to the first text, and the second row corresponds to the second text #A 第一行对应第一段文字,第二行对应第二段文字
Next, we initialize a new 124 million parameter DummyGPTModel instance and feed it the tokenized batch: 接下来,我们初始化一个新的 1.24 亿参数的 DummyGPTModel 实例,并将其输入的令牌化批次喂入其中:
The output tensor has two rows corresponding to the two text samples. Each text sample consists of 4 tokens; each token is a 50,257-dimensional vector, which matches the size of the tokenizer's vocabulary. 输出张量有两行,对应两个文本样本。每个文本样本由 4 个标记组成;每个标记是一个 50,257 维向量,与标记器词汇表的大小相匹配。
The embedding has 50,257 dimensions because each of these dimensions refers to a unique token in the vocabulary. At the end of this chapter, when we implement the postprocessing code, we will convert these 50,257-dimensional vectors back into token IDs, which we can then decode into words. 嵌入层有 50,257 个维度,因为每个维度都对应着词汇表中的一个独特的标记。在本章的最后,当我们实现后处理代码时,我们将把这些 50,257 维度的向量转换回标记 ID,然后将其解码为单词。
Now that we have taken a top-down look at the GPT architecture and its in- and outputs, we will code the individual placeholders in the upcoming sections, starting with the real layer normalization class that will replace the DummyLayerNorm in the previous code. 现在我们从顶向下的角度看了 GPT 架构及其输入和输出,在接下来的章节中,我们将编写各个占位符,从替代先前代码中的 DummyLayerNorm 的实际层归一化类开始。
4.2 Normalizing activations with layer normalization 使用层归一化对激活值进行归一化
Training deep neural networks with many layers can sometimes prove challenging due to issues like vanishing or exploding gradients. These issues lead to unstable training dynamics and make it difficult for the network to effectively adjust its weights, which means the learning process struggles to find a set of parameters (weights) for the neural network that minimizes the loss function. In other words, the network has difficulty learning the underlying patterns in the data to a degree that would allow it to make accurate predictions or decisions. (If you are new to neural network training and the concepts of gradients, a brief introduction to these concepts can be found in Section A.4, Automatic Differentiation Made Easy in Appendix A: Introduction to PyTorch. However, a deep mathematical understanding of gradients is not required to follow the contents of this book.) 培训具有多层的深度神经网络有时会很有挑战性,这是由于诸如梯度消失或梯度爆炸等问题造成的。这些问题导致了训练动力的不稳定,使网络很难有效调整其权重,这意味着学习过程难以找到一组参数(权重)使神经网络损失函数最小化。换句话说,网络很难学习到足以进行准确预测或决策的数据潜在模式。(如果您是神经网络训练和梯度概念的新手,可以在附录 A:PyTorch 简介的第 A.4 节中找到这些概念的简介。但是,要理解本书的内容,对梯度的深入数学理解并非必需。)
In this section, we will implement layer normalization to improve the stability and efficiency of neural network training. 在这个部分,我们将实现层归一化来提高神经网络训练的稳定性和效率。
The main idea behind layer normalization is to adjust the activations (outputs) of a neural network layer to have a mean of 0 and a variance of 1 , also known as unit variance. This adjustment speeds up the convergence to effective weights and ensures consistent, reliable training. As we have seen in the previous section, based on the DummyLayerNorm placeholder, in GPT-2 and modern transformer architectures, layer normalization is typically applied before and after the multi-head attention module and before the final output layer. 层归一化的主要思想是调整神经网络层的激活(输出)使其均值为 0,方差为 1,也称为单位方差。这种调整加速了有效权重的收敛,确保了一致可靠的训练。正如我们在上一节中所见,基于 DummyLayerNorm 占位符,在 GPT-2 和现代 transformer 架构中,层归一化通常应用于多头注意力模块之前和之后,以及最终输出层之前。
Before we implement layer normalization in code, Figure 4.5 provides a visual overview of how layer normalization functions. 在我们在代码中实现层标准化之前,图 4.5 提供了层标准化功能的可视化概述。
Figure 4.5 An illustration of layer normalization where the 5 layer outputs, also called activations, are normalized such that they have a zero mean and variance of 1 . 图 4.5 一个层归一化的说明,其中 5 个层输出,也称为激活,被归一化为均值为 0,方差为 1。
We can recreate the example shown in Figure 4.5 via the following code, where we implement a neural network layer with 5 inputs and 6 outputs that we apply to two input examples: 我们可以通过以下代码重新创建图 4.5 中所示的示例,其中我们实现了一个具有 5 个输入和 6 个输出的神经网络层,并将其应用于两个输入示例:
#A create 2 training examples with 5 dimensions (features) each #A 创建两个维度为 5 的训练样例
This prints the following tensor, where the first row lists the layer outputs for the first input and the second row lists the layer outputs for the second row: 这会打印以下张量,其中第一行列出了第一个输入的层输出,第二行列出了第二个输入的层输出:
The neural network layer we have coded consists of a Linear layer followed by a non-linear activation function, ReLU (short for Rectified Linear Unit), which is a standard activation function in neural networks. If you are unfamiliar with ReLU, it simply thresholds negative inputs to 0 , ensuring that a layer outputs only positive values, which explains why the resulting layer output does not contain any negative values. (Note that we will use another, more sophisticated activation function in GPT, which we will introduce in the next section). 我们编码的神经网络层由一个线性层和一个非线性激活函数 ReLU(即修正线性单元)组成,这是神经网络中的标准激活函数。如果您不熟悉 ReLU,它只是将负输入阈值化为 0,确保层输出仅包含正值,这就解释了为什么结果层输出不包含任何负值。(请注意,我们将在 GPT 中使用另一种更复杂的激活函数,我们将在下一节中介绍它)。
Before we apply layer normalization to these outputs, let's examine the mean and variance: 在我们对这些输出应用层归一化之前,让我们来检查一下均值和方差:
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)
The first row in the mean tensor above contains the mean value for the first input row, and the second output row contains the mean for the second input row. 上述平均张量中的第一行包含第一个输入行的平均值,第二个输出行包含第二个输入行的平均值。
Using keepdim=True in operations like mean or variance calculation ensures that the output tensor retains the same number of dimensions as the input tensor, even though the operation reduces the tensor along the dimension specified via dim. For instance, without keepdim=True, the returned mean tensor would be a 2-dimensional vector [0.1324, 0.2170 ] instead of a -dimensional matrix [ [0.1324], [0.2170] ]. 在像均值或方差计算这样的操作中使用 keepdim=True 可以确保输出张量保留与输入张量相同的维度数,即使该操作沿指定的维度(dim)减小了张量。例如,如果不使用 keepdim=True,返回的均值张量将是一个 2 维向量[0.1324, 0.2170]而不是一个 1 维矩阵[[0.1324], [0.2170]]。
The dim parameter specifies the dimension along which the calculation of the statistic (here, mean or variance) should be performed in a tensor, as shown in Figure 4.6. 维度参数指定在张量中执行统计量(此处为均值或方差)计算的维度,如图 4.6 所示。
dim calculates mean across the row dimension to obtain one mean per column 求平均数(dim )沿行维度计算,以获得每列一个平均值
Figure 4.6 An illustration of the dim parameter when calculating the mean of a tensor. For instance, if we have a 2D tensor (matrix) with dimensions [rows, columns], using dim will perform the operation across rows (vertically, as shown at the bottom), resulting in an output that aggregates the data for each column. Using dim=1 or dim=-1 will perform the operation across columns (horizontally, as shown at the top), resulting in an output aggregating the data for each row. 图 4.6 计算张量平均值时的 dim 参数说明。例如,如果我们有一个 2D 张量(矩阵)维度为[行,列],使用 dim=0 将沿行执行操作(垂直,如底部所示),输出将聚合每列的数据。使用 dim=1 或 dim=-1 将沿列执行操作(水平,如顶部所示),输出将聚合每行的数据。
As Figure 4.6 explains, for a 2D tensor (like a matrix), using dim=-1 for operations such as mean or variance calculation is the same as using dim=1. This is because -1 refers to the tensor's last dimension, which corresponds to the columns in a 2D tensor. Later, when adding layer normalization to the GPT model, which produces 3D tensors with shape [batch_size, num_tokens, embedding_size], we can still use dim=-1 for normalization across the last dimension, avoiding a change from dim=1 to dim=2. 如图 4.6 所示,对于二维张量(如矩阵),对于诸如平均值或方差计算等操作使用 dim=-1 与使用 dim=1 是相同的。这是因为 -1 指的是张量的最后一个维度,在二维张量中对应于列。后来在为 GPT 模型添加层归一化时,它会产生三维张量,形状为[batch_size, num_tokens, embedding_size],我们仍然可以使用 dim=-1 对最后一个维度进行归一化,避免从 dim=1 变更为 dim=2。
Next, let us apply layer normalization to the layer outputs we obtained earlier. The operation consists of subtracting the mean and dividing by the square root of the variance (also known as standard deviation): 接下来,让我们对之前得到的层输出应用层归一化。该操作包括减去平均值并除以方差的平方根(也称为标准差):
As we can see based on the results, the normalized layer outputs, which now also contain negative values, have zero mean and a variance of 1 : 正如我们根据结果看到的那样,归一化层的输出现在也包含负值,具有零均值和 1 的方差:
Note that the value in the output tensor is the scientific notation for , which is 0.0000000298 in decimal form. This value is very close to 0 , but it is not exactly 0 due to small numerical errors that can accumulate because of the finite precision with which computers represent numbers. 请注意,输出张量中的值 是 的科学记数法表示,其十进制形式为 0.0000000298。这个值非常接近于 0,但由于计算机表示数字的有限精度会导致一些小误差累积,因此不完全等于 0。
To improve readability, we can also turn off the scientific notation when printing tensor values by setting sci_mode to False: 为了提高可读性,我们还可以在打印张量值时关闭科学记数法,将 sci_mode 设置为 False:
So far, in this section, we have coded and applied layer normalization in a step-by-step process. Let's now encapsulate this process in a PyTorch module that we can use in the GPT model later: 到目前为止,在本节中,我们已经编码并逐步应用了层归一化。现在,让我们把这个过程封装在一个 PyTorch 模块中,供我们在 GPT 模型中使用:
This specific implementation of layer Normalization operates on the last dimension of the input tensor x , which represents the embedding dimension (emb_dim). The variable eps is a small constant (epsilon) added to the variance to prevent division by zero during normalization. The scale and shift are two trainable parameters (of the same dimension as the input) that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task. This allows the model to learn appropriate scaling and shifting that best suit the data it is processing. 该层归一化的具体实现以输入张量 x 的最后一个维度(即嵌入维度 emb_dim)为操作对象。变量 eps 是一个小常量(epsilon),在归一化过程中加入它可以防止除零。比例因子和偏移量是两个可训练参数(与输入维度相同),LLM会在训练过程中自动调整它们,以提高模型在训练任务上的性能。这样模型可以学习到最适合所处理数据的适当缩放和平移。
BIASED VARIANCE 偏差方差
In our variance calculation method, we have opted for an implementation detail by setting unbiased=False. For those curious about what this means, in the variance calculation, we divide by the number of inputs in the variance formula. This approach does not apply Bessel's correction, which typically uses instead of in the denominator to adjust for bias in sample variance estimation. This decision results in a so-called biased estimate of the variance. For large-scale language models (LLMs), where the embedding dimension is significantly large, the difference between using and is practically negligible. We chose this approach to ensure compatibility with the GPT-2 model's normalization layers and because it reflects TensorFlow's default behavior, which was used to implement the original GPT2 model. Using a similar setting ensures our method is compatible with the pretrained weights we will load in chapter 6 . 在我们的方差计算方法中,我们选择了一个实现细节,设置 unbiased=False。对于对这意味着什么感到好奇的人来说,在方差计算中,我们除以输入的数量 。这种方法不应用 Bessel 校正,它通常在分母中使用 而不是 来调整样本方差估计中的偏差。这一决定导致了所谓的偏差方差估计。对于大规模语言模型(LLMs),其嵌入维度 非常大,使用 还是 的差异实际上可以忽略不计。我们选择了这种方法,以确保与 GPT-2 模型归一化层的兼容性,并因为它反映了 TensorFlow 的默认行为,该默认行为用于实现原始的 GPT2 模型。使用类似的设置可确保我们的方法与我们将在第 6 章中加载的预训练权重兼容。
Let's now try the LayerNorm module in practice and apply it to the batch input: 让我们现在在实践中尝试 LayerNorm 模块,并将其应用于批量输入:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)
As we can see based on the results, the layer normalization code works as expected and normalizes the values of each of the two inputs such that they have a mean of 0 and variance of 1 : 根据结果,可以看到层归一化代码按预期工作,将两个输入的值都归一化,使得它们的均值为 0,方差为 1
In this section, we covered one of the building blocks we will need to implement the GPT architecture, as shown in the mental model in Figure 4.7. 在本节中,我们介绍了实现 GPT 架构所需的基本组件之一,如图 4.7 所示的心智模型所示。
Figure 4.7 A mental model listing the different building blocks we implement in this chapter to assemble the GPT architecture. 图 4.7 一个列出我们在本章中实现的用于组装 GPT 架构的不同构建块的心智模型。
In the next section, we will look at the GELU activation function, which is one of the activation functions used in LLMs, instead of the traditional ReLU function we used in this section. 在下一节中,我们将看看 GELU 激活函数,这是<code1001>中使用的激活函数之一,而不是我们在本节中使用的传统 ReLU 函数。
LAYER NORMALIZATION VERSUS BATCH NORMALIZATION 层规范化与批量规范化
If you are familiar with batch normalization, a common and traditional normalization method for neural networks, you may wonder how it compares to layer normalization. Unlike batch normalization, which normalizes across the batch dimension, layer normalization normalizes across the feature dimension. LLMs often require significant computational resources, and the available hardware or the specific use case can dictate the batch size during training or inference. Since layer normalization normalizes each input independently of the batch size, it offers more flexibility and stability in these scenarios. This is particularly beneficial for distributed training or when deploying models in environments where resources are constrained. 如果您熟悉批量归一化,这是神经网络常见的传统归一化方法,您可能会想知道它与层归一化相比如何。与批量归一化不同,后者是在批次维度上进行归一化,而层归一化是在特征维度上进行归一化。LLMs通常需要大量计算资源,训练或推理期间的批次大小可能由可用的硬件或特定用例决定。由于层归一化是在每个输入上独立进行归一化,而不受批次大小的影响,因此它在这些场景下提供了更大的灵活性和稳定性。对于分布式训练或在资源受限环境中部署模型来说,这一特性尤其有益。
4.3 Implementing a feed forward network with GELU activations 4.3 使用 GELU 激活函数实现前馈网络
In this section, we implement a small neural network submodule that is used as part of the transformer block in LLMs. We begin with implementing the GELU activation function, which plays a crucial role in this neural network submodule. (For additional information on implementing neural networks in PyTorch, please see section A. 5 Implementing multilayer neural networks in Appendix A.) 在这一部分中,我们实现了一个小型神经网络子模块,作为LLMs中变压器块的一部分使用。我们首先实现了 GELU 激活函数,它在这个神经网络子模块中扮演了关键角色。(有关在 PyTorch 中实现神经网络的更多信息,请参见附录 A 中的第 A.5 节"实现多层神经网络")。
Historically, the ReLU activation function has been commonly used in deep learning due to its simplicity and effectiveness across various neural network architectures. However, in LLMs, several other activation functions are employed beyond the traditional ReLU. Two notable examples are GELU (Gaussian Error Linear Unit) and SwiGLU (Swish-Gated Linear Unit). 从历史上看,ReLU 激活函数由于其简单性和在各种神经网络架构中的有效性而广泛使用。然而,在LLMs中,除了传统的 ReLU 之外,还使用了几种其他的激活函数。两个值得注意的例子是 GELU(Gaussian Error Linear Unit)和 SwiGLU(Swish-Gated Linear Unit)。
GELU and SwiGLU are more complex and smooth activation functions incorporating Gaussian and sigmoid-gated linear units, respectively. They offer improved performance for deep learning models, unlike the simpler ReLU. GELU 和 SwiGLU 是更复杂和平滑的激活函数,分别融合了高斯和 sigmoid 门控线性单元。与简单的 ReLU 相比,它们为深度学习模型提供了更好的性能。
The GELU activation function can be implemented in several ways; the exact version is defined as , where is the cumulative distribution function of the standard Gaussian distribution. In practice, however, it's common to implement a computationally cheaper approximation (the original GPT-2 model was also trained with this approximation): GELU 激活函数可以通过多种方式实现;确切版本定义为 ,其中 是标准高斯分布的累积分布函数。然而,在实践中,常见采用计算成本更低的近似(原 GPT-2 模型也使用了此近似):
In code, we can implement this function as PyTorch module as follows: 在代码中,我们可以将这个函数实现为 PyTorch 模块如下:
Listing 4.3 An implementation of the GELU activation function 代码示例 4.3 GELU 激活函数的实现
Next, to get an idea of what this GELU function looks like and how it compares to the ReLU function, let's plot these functions side by side: 接下来,为了了解 GELU 函数的样子以及它与 ReLU 函数的对比,让我们将这些函数并列绘制:
import matplotlib.pyplot as plt
gelu, relu = GELU(), nn.ReLU()
x = torch.linspace(-3, 3, 100) #A
y_gelu, y_relu = gelu(x), relu(x)
plt.figure(figsize=(8, 3))
for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1):
plt.subplot(1, 2, i)
plt.plot(x, y)
plt.title(f"{label} activation function")
plt.xlabel("x")
plt.ylabel(f"{label}(x)")
plt.grid(True)
plt.tight_layout()
plt.show()
#A Create 100 sample data points in the range -3 to 3 在-3 到 3 的范围内创建 100 个样本数据点
As we can see in the resulting plot in Figure 4.8, ReLU is a piecewise linear function that outputs the input directly if it is positive; otherwise, it outputs zero. GELU is a smooth, nonlinear function that approximates ReLU but with a non-zero gradient for negative values. 正如我们在图 4.8 所示的结果图中所看到的,ReLU 是一个分段线性函数,如果输入为正则直接输出输入,否则输出为零。GELU 是一个平滑的非线性函数,它近似于 ReLU,但对于负值具有非零梯度。
Figure 4.8 The output of the GELU and ReLU plots using matplotlib. The x-axis shows the function inputs and the y-axis shows the function outputs. 图 4.8 使用 matplotlib 输出 GELU 和 ReLU 图。x 轴显示函数输入,y 轴显示函数输出。
The smoothness of GELU, as shown in Figure 4.8, can lead to better optimization properties during training, as it allows for more nuanced adjustments to the model's parameters. In contrast, ReLU has a sharp corner at zero, which can sometimes make optimization harder, especially in networks that are very deep or have complex architectures. Moreover, unlike RELU, which outputs zero for any negative input, GELU allows for a small, non-zero output for negative values. This characteristic means that during the training process, neurons that receive negative input can still contribute to the learning process, albeit to a lesser extent than positive inputs. 如图 4.8 所示,GELU 的平滑性可以在训练过程中产生更好的优化特性,因为它允许对模型参数进行更细微的调整。相比之下,ReLU 在零点有一个尖锐的拐角,有时会使优化变得更加困难,特别是在非常深的网络或复杂架构中。此外,与 RELU 不同,GELU 对任何负值输入都不会输出零,而是仍然产生一个较小的非零输出。这种特性意味着在训练过程中,接收负输入的神经元仍然可以在一定程度上参与学习过程。
Next, let's use the GELU function to implement the small neural network module, FeedForward, that we will be using in the LLM's transformer block later: 接下来,让我们使用 GELU 函数来实现小型神经网络模块 FeedForward,我们将在LLM的 transformer 块中使用它:
As we can see in the preceding code, the FeedForward module is a small neural network consisting of two Linear layers and a GELU activation function. In the 124 million parameter GPT model, it receives the input batches with tokens that have an embedding size of 768 each via the GPT_CONFIG_124M dictionary where GPT_CONFIG_124M["emb_dim"] . 从前面的代码中可以看到,FeedForward 模块是一个小型神经网络,由两个线性层和一个 GELU 激活函数组成。在 1.24 亿参数的 GPT 模型中,它通过 GPT_CONFIG_124M 字典接收每个嵌入大小为 768 的令牌的输入批次,其中 GPT_CONFIG_124M["emb_dim"]为 768。
Figure 4.9 shows how the embedding size is manipulated inside this small feed forward neural network when we pass it some inputs. 图 4.9 显示了当我们将一些输入传递给这个小型前馈神经网络时,嵌入大小是如何被操纵的。
Output tensor with shape 输出张量的形状为
Input tensor with shape 输入张量的形状为
Input:
Output:
Input:
Output:
Input:
Output:
The second linear layer decreases the embedding dimension by a factor of 4 第二个线性层将嵌入维度减少 4 倍
The first linear layer 第一个线性层
increases the embedding dimension by a factor of 4
The three values represent the batch size (2), 批次大小(2)
number of tokens (3), and embedding size (768)
Figure 4.9 provides a visual overview of the connections between the layers of the feed forward neural network. It is important to note that this neural network can accommodate variable batch sizes and numbers of tokens in the input. However, the embedding size for each token is determined and fixed when initializing the weights. 图 4.9 提供了前馈神经网络各层之间连接的概览。需要注意的是,这种神经网络可以容纳可变的批大小和输入的标记数量。但是,每个标记的嵌入大小在初始化权重时就已确定并固定。
Following the example in Figure 4.9, let's initialize a new FeedForward module with a token embedding size of 768 and feed it a batch input with 2 samples and 3 tokens each: 以图 4.9 中的示例为例,让我们初始化一个新的 FeedForward 模块,其 token 嵌入大小为 768,并输入一个 batch 尺寸为 2、token 数量为 3 的输入:
ffn = FeedForward(GPT_CONFIG_124M)
x = torch.rand(2, 3, 768) #A
out = ffn(x)
print(out.shape)
As we can see, the shape of the output tensor is the same as that of the input tensor: 我们可以看到,输出张量的形状与输入张量的形状相同:
torch.Size([2, 3, 768])
The FeedForward module we implemented in this section plays a crucial role in enhancing the model's ability to learn from and generalize the data. Although the input and output dimensions of this module are the same, it internally expands the embedding dimension into a higher-dimensional space through the first linear layer as illustrated in Figure 4.10. This expansion is followed by a non-linear GELU activation, and then a contraction back to the original dimension with the second linear transformation. Such a design allows for the exploration of a richer representation space. 在本节中我们实现的前馈模块在增强模型从数据中学习和泛化的能力方面起着关键作用。尽管该模块的输入和输出维度相同,但它通过第一个线性层将嵌入维度扩展到更高维空间,如图 4.10 所示。此扩展随后经过非线性 GELU 激活,然后通过第二个线性变换收缩回到原始维度。这种设计允许探索更丰富的表示空间。
Figure 4.10 An illustration of the expansion and contraction of the layer outputs in the feed forward neural network. First, the inputs expand by a factor of 4 from 768 to 3072 values. Then, the second layer compresses the 3072 values back into a 768 -dimensional representation. 图 4.10 前馈神经网络中层输出展开和压缩的示例。首先,输入从 768 扩展到 3072 个值,扩展因子为 4。然后,第二层将 3072 个值压缩回 768 维的表示。
Moreover, the uniformity in input and output dimensions simplifies the architecture by enabling the stacking of multiple layers, as we will do later, without the need to adjust dimensions between them, thus making the model more scalable. 此外,输入和输出维度的一致性通过允许堆叠多个层而简化了架构,我们将在后面这样做,无需在它们之间调整维度,从而使模型更具可扩展性。
As illustrated in Figure 4.11, we have now implemented most of the LLM's building blocks. 如图 4.11 所示,我们现在已经实现了大部分LLM的建筑块。
Figure 4.11 A mental model showing the topics we cover in this chapter, with the black checkmarks indicating those that we have already covered. 图 4.11 一个心智模型显示了我们在本章中涵盖的主题,黑色勾号表示我们已经讨论过的那些主题。
In the next section, we will go over the concept of shortcut connections that we insert between different layers of a neural network, which are important for improving the training performance in deep neural network architectures. 在下一节中,我们将介绍我们在神经网络的不同层之间插入的快捷连接的概念,这对于提高深度神经网络架构中的训练性能很重要。
4.4 Adding shortcut connections 4.4 添加快捷连接
Next, let's discuss the concept behind shortcut connections, also known as skip or residual connections. Originally, shortcut connections were proposed for deep networks in computer vision (specifically, in residual networks) to mitigate the challenge of vanishing gradients. The vanishing gradient problem refers to the issue where gradients (which guide weight updates during training) become progressively smaller as they propagate backward through the layers, making it difficult to effectively train earlier layers, as illustrated in Figure 4.12. 下一步,让我们讨论快捷连接的概念,也称为跳跃或残差连接。最初,快捷连接是为计算机视觉的深度网络(特别是残差网络)提出的,以缓解消失梯度的挑战。消失梯度问题是指梯度(在训练期间指导权重更新)在通过层反向传播时逐步变小,这使得有效训练早期层变得困难,如图 4.12 所示。
Figure 4.12 A comparison between a deep neural network consisting of 5 layers without (on the left) and with shortcut connections (on the right). Shortcut connections involve adding the inputs of a layer to its outputs, effectively creating an alternate path that bypasses certain layers. The gradient illustrated in Figure 1.1 denotes the mean absolute gradient at each layer, which we will compute in the code example that follows. 图 4.12 一个由 5 层组成的深度神经网络,左侧没有快捷连接,右侧有快捷连接。快捷连接是将某一层的输入直接加到其输出上,从而创造了一条绕过某些层的替代通道。图 1.1 中的梯度表示每一层的平均绝对梯度,这将在后续的代码示例中计算。
As illustrated in Figure 4.12, a shortcut connection creates an alternative, shorter path for the gradient to flow through the network by skipping one or more layers, which is achieved by adding the output of one layer to the output of a later layer. This is why these connections are also known as skip connections. They play a crucial role in preserving the flow of gradients during the backward pass in training. 正如图 4.12 所示,快捷连接为梯度在网络中流动创造了另一条更短的路径,通过跳过一个或多个层实现。这是通过将一层的输出加到较后一层的输出来实现的。这就是为什么这些连接也被称为跳跃连接。它们在训练过程中反向传播梯度流动中发挥了关键作用。
In the code example below, we implement the neural network shown in Figure 4.12 to see how we can add shortcut connections in the forward method: 在下面的代码示例中,我们实现了图 4.12 所示的神经网络,以了解我们如何在前向方法中添加快捷连接:
Listing 4.5 A neural network to illustrate shortcut connections 4.5 节 说明神经网络的快捷连接
class ExampleDeepNeuralNetwork(nn.Module): 类 ExampleDeepNeuralNetwork(nn.Module):
# Compute the output of the current layer layer_output = layer(x) 计算当前层的输出
layer_output = layer(x)
# Check if shortcut can be applied # 检查是否可以应用快捷方式
if self.use_shortcut and x.shape == layer_output.shape: 如果 self.use_shortcut 和 x.shape == layer_output.shape:
x = x + layer_output
else: 否则:
x = layer_output x = 层输出
return 返回
The code implements a deep neural network with 5 layers, each consisting of a Linear layer and a GELU activation function. In the forward pass, we iteratively pass the input through the layers and optionally add the shortcut connections depicted in Figure 4.12 if the self.use_shortcut attribute is set to True. 代码实现了一个具有 5 层的深度神经网络,每层由一个线性层和一个 GELU 激活函数组成。在前向传递过程中,我们将输入逐层传递,如果 self.use_shortcut 属性设置为 True,则还可以选择性地添加图 4.12 所示的快捷连接。
Let's use this code to first initialize a neural network without shortcut connections. Here, each layer will be initialized such that it accepts an example with 3 input values and returns 3 output values. The last layer returns a single output value: 让我们使用这段代码首先初始化一个没有快捷连接的神经网络。在这里,每个层将被初始化成接受 3 个输入值的示例,并返回 3 个输出值。最后一层返回一个单一的输出值。
layer_sizes = [3, 3, 3, 3, 3, 1]
sample_input = torch.tensor([[1., 0., -1.]])
torch.manual_seed(123) # specify random seed for the initial weights for reproducibility
model_without_shortcut = ExampleDeepNeuralNetwork(
layer_sizes, use_shortcut=False
)
Next, we implement a function that computes the gradients in the the model's backward pass: 接下来,我们实现一个函数来计算模型反向传播时的梯度:
def print_gradients(model, x):
# Forward pass
output = model(x)
target = torch.tensor([[0.]])
# Calculate loss based on how close the target
# and output are
loss = nn.MSELoss()
loss = loss(output, target)
# Backward pass to calculate the gradients
loss.backward()
for name, param in model.named_parameters():
if 'weight' in name:
# Print the mean absolute gradient of the weights
print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")
In the preceding code, we specify a loss function that computes how close the model output and a user-specified target (here, for simplicity, the value 0 ) are. Then, when calling loss.backward (), PyTorch computes the loss gradient for each layer in the model. We can iterate through the weight parameters via model. named_parameters(). Suppose we have a weight parameter matrix for a given layer. In that case, this layer will have gradient values, and we print the mean absolute gradient of these gradient values to obtain a single gradient value per layer to compare the gradients between layers more easily. 在前面的代码中,我们指定了一个损失函数,计算模型输出和用户指定的目标(这里简单起见为值 0)之间的接近程度。然后在调用 loss.backward()时,PyTorch 计算模型中每一层的损失梯度。我们可以通过 model.named_parameters()遍历权重参数。假设我们有一个给定层的 权重参数矩阵。在这种情况下,这一层将有 个梯度值,我们打印这些 个梯度值的平均绝对值梯度,以便更容易比较不同层之间的梯度。
In short, the .backward() method is a convenient method in PyTorch that computes loss gradients, which are required during model training, without implementing the math for the gradient calculation ourselves, thereby making working with deep neural networks much more accessible. If you are unfamiliar with the concept of gradients and neural network training, I recommend reading sections A.4, Automatic differentiation made easy and A.7 A typical training loop in appendix A. 总之,.backward()方法是一种方便的 PyTorch 方法,它可以计算训练模型所需的损失梯度,而无需自己实现梯度计算的数学公式,从而使深度神经网络的工作更加易于访问。如果你不熟悉梯度和神经网络训练的概念,我建议你阅读附录 A 中的 A.4 节《轻松掌握自动微分》和 A.7 节《典型的训练循环》。
Let's now use the print_gradients function and apply it to the model without skip connections: 让我们现在使用 print_gradients 函数,并将其应用于没有跳跃连接的模型:
The output is as follows: 输出如下:
输出如下:
layers.0.0.weight has gradient mean of 0.00020173587836325169 层.0.0.权重的梯度均值为 0.00020173587836325169
layers.1.0.weight has gradient mean of 0.0001201116101583466 层.1.0.权重的梯度均值为 0.0001201116101583466
layers.2.0.weight has gradient mean of 0.0007152041653171182 layers.2.0.weight 的梯度均值为 0.0007152041653171182
layers.3.0.weight has gradient mean of 0.001398873864673078 layers.3.0.weight 的梯度均值为 0.001398873864673078
layers.4.0.weight has gradient mean of 0.005049646366387606 layers.4.0.weight 的梯度均值为 0.005049646366387606
As we can see based on the output of the print_gradients function, the gradients become smaller as we progress from the last layer (layers.4) to the first layer (layers.0), which is a phenomenon called the vanishing gradient problem. 根据 print_gradients 函数的输出结果可以看出,从最后一层(layers.4)到第一层(layers.0),梯度越来越小,这就是著名的消失梯度问题。
Let's now instantiate a model with skip connections and see how it compares: 让我们现在实例化一个带有跳跃连接的模型,并看看它与其他模型的比较:
layers.0.0.weight has gradient mean of 0.22169792652130127 layers.0.0.weight 的梯度均值为 0.22169792652130127
layers.1.0.weight has gradient mean of 0.20694105327129364 layers.1.0.weight 的梯度均值为 0.20694105327129364
layers.2.0.weight has gradient mean of 0.32896995544433594 层.2.0.权重的梯度平均值为 0.32896995544433594
layers.3.0.weight has gradient mean of 0.2665732502937317 layers.3.0.weight 的梯度均值为 0.2665732502937317
layers.4.0.weight has gradient mean of 1.3258541822433472 layers.4.0.weight 的梯度均值为 1.3258541822433472
As we can see, based on the output, the last layer (layers.4) still has a larger gradient than the other layers. However, the gradient value stabilizes as we progress towards the first layer (layers.0) and doesn't shrink to a vanishingly small value. 从输出来看,最后一层(layers.4)的梯度仍然比其他层更大。但是,随着向第一层(layers.0)的进展,梯度值稳定下来,不会缩小到微不足道的值。
In conclusion, shortcut connections are important for overcoming the limitations posed by the vanishing gradient problem in deep neural networks. Shortcut connections are a core building block of very large models such as LLMs, and they will help facilitate more effective training by ensuring consistent gradient flow across layers when we train the GPT model in the next chapter. 总之,捷径连接对于克服深度神经网络中消失梯度问题的局限性很重要。捷径连接是LLMs等非常大型模型的核心构建块,它们将有助于在下一章训练 GPT 模型时确保跨层的梯度流一致,从而实现更有效的训练。
After introducing shortcut connections, we will now connect all of the previously covered concepts (layer normalization, GELU activations, feed forward module, and shortcut connections) in a transformer block in the next section, which is the final building block we need to code the GPT architecture. 在介绍快捷连接之后,我们将在下一节中将之前介绍的所有概念(层归一化、GELU 激活、前馈模块和快捷连接)结合到一个 transformer 块中,这是我们编写 GPT 架构所需的最后一个基础模块。
4.5 Connecting attention and linear layers in a transformer block 4.5 在变换器块中连接注意力和线性层
In this section, we are implementing the transformer block, a fundamental building block of GPT and other LLM architectures. This block, which is repeated a dozen times in the 124 million parameter GPT-2 architecture, combines several concepts we have previously covered: multi-head attention, layer normalization, dropout, feed forward layers, and GELU activations, as illustrated in Figure 4.13. In the next section, we will then connect this transformer block to the remaining parts of the GPT architecture. 在本节中,我们正在实现变换器块,这是 GPT 和其他 LLM 架构的基本构建块。该块在 124 百万参数的 GPT-2 架构中重复了十几次,结合了我们之前介绍的几个概念:多头注意力、层归一化、Dropout、前馈层和 GELU 激活函数,如图 4.13 所示。在下一节中,我们将把这个变换器块连接到 GPT 架构的其余部分。
Figure 4.13 An illustration of a transformer block. The bottom of the diagram shows input tokens that have been embedded into 768 -dimensional vectors. Each row corresponds to one token's vector representation. The outputs of the transformer block are vectors of the same dimension as the input, which can then be fed into subsequent layers in an LLM. 图 4.13 变压器块的示意图。图底显示已嵌入到 768 维向量的输入标记。每行对应一个标记的向量表示。变压器块的输出是与输入相同维度的向量,可以传递给LLM中的后续层。
As shown in Figure 4.13, the transformer block combines several components, including the masked multi-head attention module from chapter 3 and the FeedForward module we implemented in Section 4.3. 如图 4.13 所示,transformer 模块结合了多个组件,包括第 3 章中的掩码多头注意力模块和我们在 4.3 节中实现的前馈模块。
When a transformer block processes an input sequence, each element in the sequence (for example, a word or subword token) is represented by a fixed-size vector (in the case of Figure 4.13, 768 dimensions). The operations within the transformer block, including multihead attention and feed forward layers, are designed to transform these vectors in a way that preserves their dimensionality. 当变压器块处理输入序列时,序列中的每个元素(例如单词或子词标记)都用固定大小的向量(在图 4.13 中为 768 维)来表示。变压器块内的操作,包括多头注意力和前馈层,旨在以保持其维度的方式转换这些向量。
The idea is that the self-attention mechanism in the multi-head attention block identifies and analyzes relationships between elements in the input sequence. In contrast, the feed forward network modifies the data individually at each position. This combination not only enables a more nuanced understanding and processing of the input but also enhances the model's overall capacity for handling complex data patterns. 自注意机制在多头注意力模块中识别和分析输入序列中元素之间的关系。相比之下,前馈网络独立修改每个位置的数据。这种组合不仅可以实现更细腻的理解和处理输入,还可以增强模型处理复杂数据模式的整体能力。
In code, we can create the TransformerBlock as follows: 在代码中,我们可以如下创建 TransformerBlock:
Listing 4.6 The transformer block component of GPT 4.6 GPT 变换器块组件
from previous_chapters import MultiHeadAttention
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"])
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
#A
shortcut = x
x = self.norm1(x)
x = self.att(x)
x = self.drop_shortcut(x)
x = x + shortcut # Add the original input back
shortcut = x
\#B
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x = x + shortcut
#C
return x
#A Shortcut connection for attention block #用于注意块的快捷连接
#B Shortcut connection for feed forward block 前馈块的快捷连接
#C Add the original input back #C 将原始输入添加回来
The given code defines a TransformerBlock class in PyTorch that includes a multi-head attention mechanism (MultiHeadAttention) and a feed forward network (FeedForward), both configured based on a provided configuration dictionary (c£g), such as GPT_CONFIG_124M. 给定的代码在 PyTorch 中定义了一个 TransformerBlock 类,包括多头注意力机制(MultiHeadAttention)和前馈网络(FeedForward),这些都是根据提供的配置字典(c£g)进行配置的,例如 GPT_CONFIG_124M。
Layer normalization (LayerNorm) is applied before each of these two components, and dropout is applied after them to regularize the model and prevent overfitting. This is also known as Pre-LayerNorm. Older architectures, such as the original transformer model, applied layer normalization after the self-attention and feed-forward networks instead, known as Post-LayerNorm, which often leads to worse training dynamics. 层归一化(LayerNorm)应用于这两个组件之前,并在它们之后应用 dropout 以规范模型并防止过拟合。这也被称为 Pre-LayerNorm。原始的变换模型等较旧的架构在自注意力和前馈网络之后应用层归一化,称为 Post-LayerNorm,这往往导致训练动态性变差。
The class also implements the forward pass, where each component is followed by a shortcut connection that adds the input of the block to its output. This critical feature helps gradients flow through the network during training and improves the learning of deep models as explained in section 4.4. 该类还实现了前向传递,其中每个组件都跟随着一个快捷连接,将该块的输入添加到其输出中。这一关键特性有助于在训练期间梯度流经网络,并如第 4.4 节所解释的那样提高深度模型的学习。
Using the GPT_CONFIG_124M dictionary we defined earlier, let's instantiate a transformer block and feed it some sample data: 使用我们之前定义的 GPT_CONFIG_124M 字典,让我们实例化一个 transformer 模块并输入一些示例数据:
As we can see from the code output, the transformer block maintains the input dimensions in its output, indicating that the transformer architecture processes sequences of data without altering their shape throughout the network. 从代码输出中我们可以看到,转换器块保持了输入维度在输出中,这表示转换器架构在整个网络中处理数据序列而不改变它们的形状。
The preservation of shape throughout the transformer block architecture is not incidental but a crucial aspect of its design. This design enables its effective application across a wide range of sequence-to-sequence tasks, where each output vector directly corresponds to an input vector, maintaining a one-to-one relationship. However, the output is a context vector that encapsulates information from the entire input sequence, as we learned in chapter 3. This means that while the physical dimensions of the sequence (length and feature size) remain unchanged as it passes through the transformer block, the content of each output vector is re-encoded to integrate contextual information from across the entire input sequence. 在变压器块体系结构中,保持形状不是偶然的,而是其设计的关键方面。这种设计使其能够广泛应用于各种序列到序列任务,每个输出向量直接对应一个输入向量,保持一对一的关系。然而,输出是一个上下文向量,它包含了整个输入序列的信息,正如我们在第 3 章中所学到的。这意味着,当输入序列通过变压器块时,其物理尺寸(长度和特征大小)保持不变,但每个输出向量的内容都被重新编码,以整合整个输入序列的上下文信息。
With the transformer block implemented in this section, we now have all the building blocks, as shown in Figure 4.14, needed to implement the GPT architecture in the next section. 在本节中实现的 Transformer 块,我们现在拥有了所有的构建块,如图 4.14 所示,这些构建块将在下一节中用于实现 GPT 架构。
Figure 4.14 A mental model of the different concepts we have implemented in this chapter so far. 图 4.14 我们在本章迄今实施的不同概念的心智模型。
As illustrated in Figure 4.14, the transformer block combines layer normalization, the feed forward network, including GELU activations, and shortcut connections, which we already covered earlier in this chapter. As we will see in the upcoming chapter, this transformer block will make up the main component of the GPT architecture we will implement 如图 4.14 所示,变压器块结合了图层标准化、前馈网络(包括 GELU 激活)和快捷连接,这些我们在前面的章节中已经介绍过了。正如我们将在即将到来的章节中看到的那样,这个变压器块将构成我们要实现的 GPT 架构的主要组件。
4.6 Coding the GPT model 4.6 编码 GPT 模型
We started this chapter with a big-picture overview of a GPT architecture that we called DummyGPTModel. In this DummyGPTModel code implementation, we showed the input and outputs to the GPT model, but its building blocks remained a black box using a DummyTransformerBlock and DummyLayerNorm class as placeholders. 我们以一个叫 DummyGPTModel 的 GPT 架构的宏观概述开始了这一章。在这个 DummyGPTModel 代码实现中,我们展示了 GPT 模型的输入和输出,但其构建块仍然是一个黑箱,使用 DummyTransformerBlock 和 DummyLayerNorm 类作为占位符。
In this section, we are now replacing the DummyTransformerBlock and DummyLayerNorm placeholders with the real TransformerBlock and LayerNorm classes we coded later in this chapter to assemble a fully working version of the original 124 million parameter version of GPT-2. In chapter 5, we will pretrain a GPT-2 model, and in chapter 6 , we will load in the pretrained weights from OpenAI. 在本节中,我们现在将 DummyTransformerBlock 和 DummyLayerNorm 占位符替换为我们在本章后面编写的真实的 TransformerBlock 和 LayerNorm 类,以组装 GPT-2 原始版本的 1.24 亿参数的完全工作版本。在第 5 章中,我们将预训练一个 GPT-2 模型,在第 6 章中,我们将加载来自 OpenAI 的预训练权重。
Before we assemble the GPT-2 model in code, let's look at its overall structure in Figure 4.15, which combines all the concepts we covered so far in this chapter. 在我们在代码中组装 GPT-2 模型之前,让我们看看图 4.15 中的整体结构,它结合了我们在本章中涵盖的所有概念。
Every effort moves you 每一次努力都推动你前进
Figure 4.15 An overview of the GPT model architecture. This figure illustrates the flow of data through the GPT model. Starting from the bottom, tokenized text is first converted into token embeddings, which are then augmented with positional embeddings. This combined information forms a tensor that is passed through a series of transformer blocks shown in the center (each containing multi-head attention and feed forward neural network layers with dropout and layer normalization), which are stacked on top of each other and repeated 12 times. 图 4.15 GPT 模型架构概览。该图说明了数据在 GPT 模型中的流动。从底部开始,分词文本首先转换成词嵌入,然后添加位置嵌入。这些组合信息形成一个张量,通过中间的一系列 Transformer 块(每个块包含多头注意力和前馈神经网络层,并应用 dropout 和层标准化),该 Transformer 块被堆叠并重复 12 次。
As shown in Figure 4.15, the transformer block we coded in Section 4.5 is repeated many times throughout a GPT model architecture. In the case of the 124 million parameter GPT-2 model, it's repeated 12 times, which we specify via the "n_layers" entry in the GPT_CONFIG_124M dictionary. In the case of the largest GPT-2 model with 1,542 million parameters, this transformer block is repeated 36 times. 如图 4.15 所示,我们在第 4.5 节中编写的变压器块在 GPT 模型架构中被重复多次。在拥有 1.24 亿参数的 GPT-2 模型中,它被重复了 12 次,我们通过 GPT_CONFIG_124M 字典中的"n_layers"条目指定了这一点。在拥有 15.42 亿参数的最大 GPT-2 模型中,这个变压器块被重复了 36 次。
As shown in Figure 4.15, the output from the final transformer block then goes through a final layer normalization step before reaching the linear output layer. This layer maps the transformer's output to a high-dimensional space (in this case, 50,257 dimensions, corresponding to the model's vocabulary size) to predict the next token in the sequence. 如图 4.15 所示,最终变换器块的输出然后经过最终的层归一化步骤,然后到达线性输出层。该层将变换器的输出映射到高维空间(在本例中为 50,257 维,对应于模型的词汇表大小),以预测序列中的下一个标记。
Let's now implement the architecture we see in Figure 4.15 in code: 现在让我们在代码中实现图 4.15 中所示的架构:
Listing 4.7 The GPT model architecture implementation 4.7 章 GPT 模型架构实现
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
#A
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
#A The device setting will allow us to train the model on a CPU or GPU, depending on which device the input data sits 设备设置将允许我们根据输入数据所在的设备,在 CPU 或 GPU 上训练模型
Thanks to the TransformerBlock class we implemented in Section 4.5, the GPTModel class is relatively small and compact. 得益于我们在第 4.5 节中实现的 TransformerBlock 类,GPTModel 类相对较小和紧凑。
The init constructor of this GPTModel class initializes the token and positional embedding layers using the configurations passed in via a Python dictionary, cfg. These embedding layers are responsible for converting input token indices into dense vectors and adding positional information, as discussed in chapter 2. 这个 GPTModel 类的 init 构造函数使用通过 Python 字典 cfg 传入的配置初始化了 token 和位置嵌入层。这些嵌入层负责将输入标记索引转换为密集向量并添加位置信息,如第 2 章所述。
Next, the init method creates a sequential stack of TransformerBlock modules equal to the number of layers specified in cfg. Following the transformer blocks, a LayerNorm layer is applied, standardizing the outputs from the transformer blocks to stabilize the learning process. Finally, a linear output head without bias is defined, which projects the transformer's output into the vocabulary space of the tokenizer to generate logits for each token in the vocabulary. 接下来, init 方法创建了一个由等同于 cfg 中指定的层数的 TransformerBlock 模块组成的顺序堆栈。在经过 transformer 块之后,应用了一个 LayerNorm 层,标准化来自 transformer 块的输出以稳定学习过程。最后,定义了一个没有偏差的线性输出头,将 transformer 的输出映射到记号器的词汇空间,以生成每个词汇的对数值。
The forward method takes a batch of input token indices, computes their embeddings, applies the positional embeddings, passes the sequence through the transformer blocks, normalizes the final output, and then computes the logits, representing the next token's unnormalized probabilities. We will convert these logits into tokens and text outputs in the next section. 前向方法采用一批输入令牌索引,计算它们的嵌入,应用位置嵌入,将序列传递通过变压器块,对最终输出进行标准化,然后计算对下一个令牌的非标准化概率的 logits。我们将在下一节中将这些 logits 转换为令牌和文本输出。
Let's now initialize the 124 million parameter GPT model using the GPT_CONFIG_124M dictionary we pass into the cfg parameter and feed it with the batch text input we created at the beginning of this chapter: 我们现在使用在本章开头创建的批处理文本输入来初始化 1.24 亿参数的 GPT 模型,将 GPT_CONFIG_124M 字典传递到 cfg 参数中
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)
The preceding code prints the contents of the input batch followed by the output tensor: 输入批次的内容后跟输出张量将被打印
Input batch:
tensor([[ 6109, 3626, 6100, 345], \# token IDs of text 1
\([\) 6109, 1110, 6622, 257]]) \# token IDs of text 2
As we can see, the output tensor has the shape , since we passed in 2 input texts with 4 tokens each. The last dimension, 50,257 , corresponds to the vocabulary size of the tokenizer. In the next section, we will see how to convert each of these 50,257dimensional output vectors back into tokens. 正如我们所看到的,输出张量的形状为 ,因为我们传入了 2 个含有 4 个词元的输入文本。最后一维 50,257 对应于分词器的词汇表大小。在下一节中,我们将看到如何将这些 50,257 维的输出向量转换回词元。
Before we move on to the next section and code the function that converts the model outputs into text, let's spend a bit more time with the model architecture itself and analyze its size. 在我们转到下一节并编写将模型输出转换为文本的函数之前,让我们花更多时间在模型架构本身上并分析其大小。
Using the numel() method, short for "number of elements," we can collect the total number of parameters in the model's parameter tensors: 使用 numel()方法,即"元素数量",我们可以收集模型参数张量中的总参数数量:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}") 打印(f"总参数数量: {total_params:,}")
The result is as follows: 结果如下:
Total number of parameters: 参数总数:
Now, a curious reader might notice a discrepancy. Earlier, we spoke of initializing a 124 million parameter GPT model, so why is the actual number of parameters 163 million, as shown in the preceding code output? 现在,好奇的读者可能会注意到一个差异。早些时候,我们谈到初始化一个 1.24 亿参数的 GPT 模型,那么为什么实际参数数量为 1.63 亿,如前述代码输出所示呢?
The reason is a concept called weight tying that is used in the original GPT-2 architecture, which means that the original GPT-2 architecture is reusing the weights from the token embedding layer in its output layer. To understand what this means, let's take a look at the shapes of the token embedding layer and linear output layer that we initialized on the model via the GPTModel earlier: 原因是一个称为权重绑定的概念,它被用于原始的 GPT-2 架构,这意味着原始的 GPT-2 架构在其输出层中重复使用了来自令牌嵌入层的权重。为了理解这意味着什么,让我们来看一下我们通过 GPTModel 在模型上初始化的令牌嵌入层和线性输出层的形状:
The token embedding and output layers are very large due to the number of rows for the 50,257 in the tokenizer's vocabulary. Let's remove the output layer parameter count from the total GPT-2 model count according to the weight tying: 语义嵌入和输出层很大,这是由于标记器词汇表中有 50,257 个行。让我们根据权重绑定从总的 GPT-2 模型参数数量中去除输出层参数数量:
total_params_gpt2 = total_params - sum(p.numel() for p in model.out_head.parameters()) print(f"Number of trainable parameters considering weight tying: {total_params_gpt2:,}") 总参数数量_gpt2 = 总参数 - sum(p.numel() for p in model.out_head.parameters())
print(f"考虑权重绑定的可训练参数数量: {total_params_gpt2:,}")
The output is as follows: 输出如下:
输出如下:
Number of trainable parameters considering weight tying: 124,412,160 考虑权重绑定的可训练参数数量:124,412,160
As we can see, the model is now only 124 million parameters large, matching the original size of the GPT-2 model. 正如我们所见,该模型现在只有 1.24 亿个参数,与 GPT-2 模型的原始大小相匹配。
Weight tying reduces the overall memory footprint and computational complexity of the model. However, in my experience, using separate token embedding and output layers results in better training and model performance; hence, we are using separate layers in our GPTModel implementation. The same is true for modern LLMs. However, we will revisit and implement the weight tying concept later in chapter 6 when we load the pretrained weights from OpenAI. 权重绑定可以减少模型的整体内存占用和计算复杂度。然而,根据我的经验,使用独立的标记嵌入和输出层可以获得更好的训练和模型性能;因此,在我们的 GPTModel 实现中使用了独立的层。现代LLMs也是如此。但是,我们将在第 6 章中重新访问并实现权重绑定概念,届时我们将加载来自 OpenAI 的预训练权重。
EXERCISE 4.1 NUMBER OF PARAMETERS IN FEED FORWARD AND ATTENTION MODULES 第 4.1 练习 前馈和注意力模块中的参数数量
Calculate and compare the number of parameters that are contained in the feed forward module and those that are contained in the multi-head attention module. 计算并比较前馈模块和多头注意力模块中包含的参数数量。
Lastly, let us compute the memory requirements of the 163 million parameters in our GPTModel object: 最后,让我们计算 GPTModel 对象中 1.63 亿参数的内存需求:
#A Calculate the total size in bytes (assuming float32, 4 bytes per parameter) #B Convert to megabytes #A 计算字节总大小(假设 float32,每个参数 4 字节)
#B 转换为兆字节
The result is as follows: 结果如下:
Total size of the model: 621.83 MB 模型总大小:621.83 MB
In conclusion, by calculating the memory requirements for the 163 million parameters in our GPTModel object and assuming each parameter is a 32 -bit float taking up 4 bytes, we find that the total size of the model amounts to 621.83 MB , illustrating the relatively large storage capacity required to accommodate even relatively small LLMs. 总之,通过计算我们的 GPTModel 对象中 1.63 亿个参数的内存需求,并假设每个参数都是占用 4 字节的 32 位浮点数,我们发现该模型的总大小为 621.83MB,说明即使是相对较小的LLMs也需要相当大的存储容量。
In this section, we implemented the GPTModel architecture and saw that it outputs numeric tensors of shape [batch_size, num_tokens, vocab_size]. In the next section, we will write the code to convert these output tensors into text. 在本节中,我们实现了 GPTModel 架构,并发现它输出形状为 [batch_size, num_tokens, vocab_size] 的数字张量。在下一节中,我们将编写代码将这些输出张量转换为文本。
In this chapter, we initialized a 124 million parameter GPT model, which is known as "GPT-2 small." Without making any code modifications besides updating the configuration file, use the GPTModel class to implement GPT-2 medium (using 1024dimensional embeddings, 24 transformer blocks, 16 multi-head attention heads), GPT-2 large (1280-dimensional embeddings, 36 transformer blocks, 20 multi-head attention heads), and GPT-2 XL (1600-dimensional embeddings, 48 transformer blocks, 25 multi-head attention heads). As a bonus, calculate the total number of parameters in each GPT model. 在本章中,我们初始化了一个拥有 1.24 亿参数的 GPT 模型,这也被称为"GPT-2 small"。在仅更新配置文件而不进行任何代码修改的情况下,使用 GPTModel 类实现了 GPT-2 medium(使用 1024 维嵌入、24 个变换块和 16 个多头注意力头)、GPT-2 large(1280 维嵌入、36 个变换块和 20 个多头注意力头)和 GPT-2 XL(1600 维嵌入、48 个变换块和 25 个多头注意力头)。作为奖励,计算了每个 GPT 模型的总参数数。
4.7 Generating text 4.7 生成文本
In this final section of this chapter, we will implement the code that converts the tensor outputs of the GPT model back into text. Before we get started, let's briefly review how a generative model like an LLM generates text one word (or token) at a time, as shown in Figure 4.16. 在本章的最后一部分,我们将实现将 GPT 模型的张量输出转换回文本的代码。在开始之前,让我们简要回顾一下如图 4.16 所示的生成模型(如 LLM)如何一个词(或令牌)一个词地生成文本。
The input context 输入内容
for the model 对于模型
The next generated 下一个生成的
token 代币
The token generated in the previous round is appended to the input for the next iteration 前一轮生成的令牌附加到下一次迭代的输入中
2nd iteration: 第二次迭代:
翻译文本:
The input context 输入内容
grows in each iteration 在每次迭代中增长
3rd iteration: 第三次迭代:
6th iteration: 第 6 次迭代:
Figure 4.16 This diagram illustrates the step-by-step process by which an LLM generates text, one token at a time. Starting with an initial input context ("Hello, I am"), the model predicts a subsequent token during each iteration, appending it to the input context for the next round of prediction. As shown, the first iteration adds "a", the second "model", and the third "ready", progressively building the sentence. 图 4.16 此图说明了一个LLM如何逐步生成文本,一个令牌接一个令牌地进行。从一个初始输入上下文("Hello, I am")开始,模型在每次迭代中预测一个后续的令牌,并将其附加到下一轮预测的输入上下文中。如图所示,第一次迭代添加了"a",第二次添加了"model",第三次添加了"ready",逐步构建了这个句子。
Figure 4.16 illustrates the step-by-step process by which a GPT model generates text given an input context, such as "Hello, I am," on a big-picture level. With each iteration, the input context grows, allowing the model to generate coherent and contextually appropriate text. By the 6th iteration, the model has constructed a complete sentence: "Hello, I am a model ready to help." 图 4.16 说明了 GPT 模型在给定输入上下文(如"Hello, I am")的情况下如何逐步生成文本的过程。随着每一次迭代,输入上下文会增加,使模型能够生成连贯且与上下文相适应的文本。到第 6 次迭代时,该模型已构建完成一个完整的句子:"Hello, I am a model ready to help."
In the previous section, we saw that our current GPTModel implementation outputs tensors with shape [batch_size, num_token, vocab_size]. Now, the question is, how does a GPT model go from these output tensors to the generated text shown in Figure 4.16 ? 在上一节中,我们看到我们当前的 GPTModel 实现输出形状为 [batch_size, num_token, vocab_size] 的张量。那么,GPT 模型如何从这些输出张量生成图 4.16 所示的文本呢?
The process by which a GPT model goes from output tensors to generated text involves several steps, as illustrated in Figure 4.17. These steps include decoding the output tensors, selecting tokens based on a probability distribution, and converting these tokens into human-readable text. 图 4.17 所示,GPT 模型将输出张量转换为生成的文本需要经历几个步骤。这些步骤包括解码输出张量、根据概率分布选择标记、并将这些标记转换为可读文本。
Figure 4.17 details the mechanics of text generation in a GPT model by showing a single iteration in the token generation process. The process begins by encoding the input text into token IDs, which are then fed into the GPT model. The outputs of the model are then converted back into text and appended to the original input text. 图 4.17 详细说明了 GPT 模型中文本生成的机制,展示了单次令牌生成过程。该过程首先将输入文本编码为令牌 ID,然后将其输入 GPT 模型。模型的输出随后被转换回文本,并附加到原始输入文本之后。
The next-token generation process detailed in Figure 4.17 illustrates a single step where the GPT model generates the next token given its input. 图 4.17 中描述的下一个令牌生成过程说明了 GPT 模型根据其输入生成下一个令牌的单个步骤。
In each step, the model outputs a matrix with vectors representing potential next tokens. The vector corresponding to the next token is extracted and converted into a probability distribution via the softmax function. Within the vector containing the resulting probability scores, the index of the highest value is located, which translates to the token ID. This token ID is then decoded back into text, producing the next token in the sequence. Finally, this token is appended to the previous inputs, forming a new input sequence for the subsequent iteration. This step-by-step process enables the model to generate text sequentially, building coherent phrases and sentences from the initial input context. 在每个步骤中,模型输出一个包含代表潜在下一个令牌的向量的矩阵。提取对应于下一个令牌的向量,并通过 softmax 函数转换为概率分布。在包含结果概率分数的向量中,找到最高值的索引,这对应于令牌 ID。然后将此令牌 ID 解码回文本,产生序列中的下一个令牌。最后,将此令牌添加到先前的输入中,形成后续迭代的新输入序列。这种逐步的过程使模型能够顺序生成文本,从初始输入上下文构建连贯的短语和句子。
In practice, we repeat this process over many iterations, such as shown in Figure 4.16 earlier, until we reach a user-specified number of generated tokens. 在实践中,我们会重复这个过程多次迭代,如图 4.16 所示,直到达到用户指定的生成令牌数。
In code, we can implement the token-generation process as follows: 在代码中,我们可以按照以下方式实现令牌生成过程:
Listing 4.8 A function for the GPT model to generate text 代码清单 4.8 GPT 模型生成文本的函数
#A idx is a (batch, n_tokens) array of indices in the current context #A idx 是当前上下文中的(批次, n_tokens)索引数组
#B Crop current context if it exceeds the supported context size E.g., if LLM supports only 5 tokens, and the context size is 10 then only the last 5 tokens are used as context 如果当前上下文超出支持的上下文大小,请裁剪当前上下文。例如,如果LLM只支持 5 个标记,而上下文大小为 10,那么只使用最后 5 个标记作为上下文。
#C Focus only on the last time step, so that (batch, n_token, vocab_size) becomes (batch, vocab_size) #C 仅关注最后一个时间步骤,使(batch, n_token, vocab_size)变为(batch, vocab_size)
#D probas has shape (batch, vocab_size) #D 概率分布具有(批次,词汇量)的形状
#E idx_next has shape (batch, 1) #E idx_next 的形状为(batch, 1)
#F Append sampled index to the running sequence, where idx has shape (batch, n_tokens +1 ) 将采样的索引附加到运行序列中,其中 idx 的形状为 (batch, n_tokens +1)
In the preceeding code, the generate_text_simple function, we use a softmax function to convert the logits into a probability distribution from which we identify the position with the highest value via torch.argmax. The softmax function is monotonic, meaning it preserves the order of its inputs when transformed into outputs. So, in practice, the softmax step is redundant since the position with the highest score in the softmax output tensor is the same position in the logit tensor. In other words, we could apply the torch.argmax function to the logits tensor directly and get identical results. However, we coded the conversion to illustrate the full process of transforming logits to probabilities, which can add additional intuition, such as that the model generates the most likely next token, which is known as greedy decoding. 在上述代码中,generate_text_simple 函数使用 softmax 函数将 logits 转换为概率分布,然后通过 torch.argmax 确定概率最高的位置。softmax 函数是单调的,意味着将输入转换为输出时会保持顺序不变。因此在实践中,softmax 步骤是多余的,因为 softmax 输出张量中概率最高的位置与 logits 张量中的位置是相同的。换句话说,我们可以直接对 logits 张量应用 torch.argmax 函数并获得相同的结果。然而,我们编写了将 logits 转换为概率的完整过程,这可以增加对模型生成最可能的下一个标记(即贪心解码)的直观理解。
In the next chapter, when we will implement the GPT training code, we will also introduce additional sampling techniques where we modify the softmax outputs such that the model doesn't always select the most likely token, which introduces variability and creativity in the generated text. 在下一章中,当我们实施 GPT 训练代码时,我们还将引入其他采样技术,其中我们修改了 softmax 输出,使得模型并不总是选择最可能的令牌,这在生成的文本中引入了可变性和创造性。
This process of generating one token ID at a time and appending it to the context using the generate text simple function is further illustrated in Figure 4.18. (The token ID generation process for each iteration is detailed in Figure 4.17. 在图 4.18 中进一步说明了这个通过使用生成文本简单函数一次生成一个令牌 ID 并将其附加到上下文的过程。(每次迭代的令牌 ID 生成过程详见图 4.17。)
Figure 4.18 An illustration showing six iterations of a token prediction cycle, where the model takes a sequence of initial token IDs as input, predicts the next token, and appends this token to the input sequence for the next iteration. (The token IDs are also translated into their corresponding text for better understanding.) 图 4.18 展示了令牌预测循环的六个迭代过程。模型接收一系列初始令牌 ID 作为输入,预测下一个令牌,并将该令牌附加到输入序列中,供下一次迭代使用。(令牌 ID 也被翻译成相应的文本,以便更好地理解。)
As shown in Figure 4.18, we generate the token IDs in an iterative fashion. For instance, in iteration 1, the model is provided with the tokens corresponding to "Hello , I am", predicts the next token (with ID 257, which is "a"), and appends it to the input. This process is repeated until the model produces the complete sentence "Hello, I am a model ready to help." after six iterations. 如图 4.18 所示,我们以迭代的方式生成令牌 ID。例如,在第 1 次迭代中,模型被提供"Hello, I am"对应的令牌,预测下一个令牌(ID 为 257,即"a"),并将其附加到输入中。该过程重复进行,直至模型在第六次迭代后生成完整的句子"Hello, I am a model ready to help."。
Let's now try out the generate_text_simple function with the "Hello, I am" context as model input, as shown in Figure 4.18, in practice. 现在让我们尝试使用"Hello, I am"作为模型输入,在实践中运行 generate_text_simple 函数,如图 4.18 所示。
First, we encode the input context into token IDs: 首先,我们将输入上下文编码为令牌 ID:
Next, we put the model into .eval() mode, which disables random components like dropout, which are only used during training, and use the generate_text_simple function on the encoded input tensor: 接下来,我们将模型置于.eval()模式,这将关闭诸如 dropout 之类的随机组件,这些组件仅在训练期间使用,并对编码的输入张量使用 generate_text_simple 函数:
The model output in text format is as follows: 模型输出的文本格式如下:
Hello, I am Featureiman Byeswickattribute argue 你好,我是 Featureiman Byeswickattribute 争论
As we can see, based on the preceding output, the model generated gibberish, which is not at all like the coherent text shown in Figure 4.18. What happened? The reason why the model is unable to produce coherent text is that we haven't trained it yet. So far, we just implemented the GPT architecture and initialized a GPT model instance with initial random weights. 根据前面的输出,我们可以看到,该模型生成了毫无意义的内容,这与图 4.18 中所示的连贯文本完全不同。这是为什么呢?模型无法生成连贯的文本是因为我们还没有对其进行训练。到目前为止,我们只是实现了 GPT 架构,并用随机初始权重初始化了一个 GPT 模型实例。
Model training is a large topic in itself, and we will tackle it in the next chapter. 模型训练本身就是一个很大的话题,我们将在下一章中讨论它。
EXERCISE 4.3 USING SEPARATE DROPOUT PARAMETERS 练习 4.3 使用独立的 dropout 参数
At the beginning of this chapter, we defined a global "drop_rate" setting in the GPT_CONFIG_124M dictionary to set the dropout rate in various places throughout the GPTModel architecture. Change the code to specify a separate dropout value for the various dropout layers throughout the model architecture. (Hint: there are three distinct places where we used dropout layers: the embedding layer, shortcut layer, and multi-head attention module.) 在本章开头,我们在 GPT_CONFIG_124M 字典中定义了全局"drop_rate"设置,用于在 GPTModel 架构的各个位置设置 dropout 率。请修改代码,为模型架构中的各个 dropout 层指定单独的 dropout 值。(提示:我们使用了 dropout 层的三个不同位置:嵌入层、快捷连接层和多头注意力模块。)
4.8 Summary 4.8 总结
Layer normalization stabilizes training by ensuring that each layer's outputs have a consistent mean and variance. 层归一化通过确保每一层的输出具有一致的均值和方差来稳定训练。
Shortcut connections are connections that skip one or more layers by feeding the output of one layer directly to a deeper layer, which helps mitigate the vanishing gradient problem when training deep neural networks, such as LLMs. 快捷连接是跳过一个或多个层的连接,通过将一个层的输出直接馈入更深层的层来实现,这有助于缓解训练深度神经网络(如LLMs)时梯度消失的问题。
Transformer blocks are a core structural component of GPT models, combining masked multi-head attention modules with fully connected feed-forward networks that use the GELU activation function. 变换器块是 GPT 模型的核心结构组件,结合了遮掩多头注意力模块和使用 GELU 激活函数的完全连接前馈网络。
GPT models are LLMs with many repeated transformer blocks that have millions to billions of parameters. GPT 模型由数以百万到数十亿参数的多个重复的 Transformer 块组成。
GPT models come in various sizes, for example, 124, 345, 762, and 1542 million parameters, which we can implement with the same GPrModel Python class. GPT 模型有不同规模,例如 124、345、762 和 1542 百万参数,这些都可以使用同一个 GPrModel Python 类来实现。
The text generation capability of a GPT-like LLM involves decoding output tensors into human-readable text by sequentially predicting one token at a time based on a given input context. 类似 GPT 的文本生成能力涉及通过基于给定输入上下文逐个预测词元的方式将输出张量解码为可读文本。
Without training, a GPT model generates incoherent text, which underscores the importance of model training for coherent text generation, which is the topic of subsequent chapters. 没有训练,GPT 模型会生成不连贯的文本,这突出了模型训练对于生成连贯文本的重要性,这是后续章节的主题。
5
Pretraining on Unlabeled Data 无标签数据的预训练
This chapter covers 本章涵盖
Computing the training and validation set losses to assess the quality of LLMgenerated text during training 计算训练集和验证集损失以评估训练期间 LLM 生成文本的质量
Implementing a training function and pretraining the LLM 实施培训功能并预先训练LLM
Saving and loading model weights to continue training an LLM 保存和加载模型权重以继续训练一个LLM
Loading pretrained weights from OpenAl 从 OpenAI 加载预训练权重
In the previous chapters, we implemented the data sampling, attention mechanism and coded the LLM architecture. The core focus of this chapter is to implement a training function and pretrain the LLM, as illustrated in Figure 5.1. 在前几章中,我们实现了数据采样、注意力机制,并编码了LLM架构。本章的核心重点是实现一个训练函数并预训练LLM,如图 5.1 所示。
Figure 5.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset and finetuning it on a labeled dataset. This chapter focuses on pretraining the LLM, which includes implementing the training code, evaluating the performance, and saving and loading model weights. 图 5.1 编码LLM的三个主要阶段的心智模型:在一般文本数据集上对LLM进行预训练,然后在标记的数据集上微调。本章着重介绍了对LLM进行预训练的工作,包括实现训练代码、评估性能,以及保存和加载模型权重。
As illustrated in Figure 5.1, we will also learn about basic model evaluation techniques to measure the quality of the generated text, which is a requirement for optimizing the LLM during the training process. Moreover, we will discuss how to load pretrained weights, giving our LLM a solid starting point for finetuning in the upcoming chapters. 如图 5.1 所示,我们还将学习基本的模型评估技术来评估生成文本的质量,这是在训练过程中优化LLM的要求。此外,我们将讨论如何加载预训练权重,为我们的LLM在即将到来的章节中进行微调提供良好的起点。
WEIGHT PARAMETERS 重量参数
In the context of LLMs and other deep learning models, weights refer to the trainable parameters that the learning process adjusts. These weights are also known as weight parameters or simply parameters. In frameworks like PyTorch, these weights are stored in linear layers, for example, which we used to implement the multi-head attention module in chapter 3 and the GPTModel in chapter 4. After initializing a layer (new_layer = torch.nn.Linear(...)), we can access its weights through the .weight attribute, new_layer.weight. Additionally, for convenience, PyTorch allows direct access to all a model's trainable parameters, including weights and biases, through the method model.parameters(), which we will use later when implementing the model training. 在LLMs和其他深度学习模型的背景下,权重指的是学习过程调整的可训练参数。这些权重也被称为权重参数或简称参数。在 PyTorch 等框架中,这些权重存储在线性层中,例如我们在第 3 章中用于实现多头注意力模块,以及在第 4 章中用于实现 GPTModel。初始化一个层(new_layer = torch.nn.Linear(...))后,我们可以通过.weight 属性访问其权重,即 new_layer.weight。此外,为了方便起见,PyTorch 允许通过 model.parameters()方法直接访问模型的所有可训练参数,包括权重和偏差,我们将在实现模型训练时使用这种方法。
5.1 Evaluating generative text models 5.1 评估生成式文本模型
We begin this chapter by setting up the LLM for text generation based on code from the previous chapter and discuss basic ways to evaluate the quality of the generated text in this section. The content we cover in this section and the remainder of this chapter is outlined in Figure 5.2. 我们在这一章节中首先建立了基于前一章内容的文本生成的LLM,并在本节中探讨了评估生成文本质量的基本方法。本节和本章剩余部分的内容概要如图 5.2 所示。
Figure 5.2 An overview of the topics covered in this chapter. We begin by recapping the text generation from the previous chapter and implementing basic model evaluation techniques that we can use during the pretraining stage. 图 5.2 本章涵盖的主题概览。我们首先回顾前一章的文本生成,并实现基本的模型评估技术,这些技术可以在预训练阶段使用。
As shown in Figure 5.2, the next subsection recaps the text generation we set up at the end of the previous chapter before we dive into the text evaluation and calculation of the training and validation losses in the subsequent subsections. 如图 5.2 所示,下一个小节概括了我们在上一章结尾设置的文本生成,然后我们深入探讨了后续小节中的文本评估以及训练和验证损失的计算。
5.1.1 Using GPT to generate text 使用 GPT 生成文本
In this section, we set up the LLM and briefly recap the text generation process we implemented in chapter 4 . We begin by initializing the GPT model that we will evaluate and train in this chapter, using the GPTModel class and GPT_CONFIG_124M dictionary from chapter 4: 在本节中,我们设置了LLM,并简要回顾了我们在第 4 章中实现的文本生成过程。我们首先初始化将在本章中评估和训练的 GPT 模型,使用第 4 章中的 GPTModel 类和 GPT_CONFIG_124M 字典:
#A We shorten the context length from 1024 to 256 tokens #B It's possible and common to set dropout to 0 . #A 我们将上下文长度从 1024 个标记缩短到 256 个标记
#B 设置丢弃率为 0 是可能的并且很常见
Considering the GPT_CONFIG_124M dictionary, the only adjustment we have made compared to the previous chapter is reducing the context length (context_length) to 256 tokens. This modification reduces the computational demands of training the model, making it possible to carry out the training on a standard laptop computer. 考虑到 GPT_CONFIG_124M 字典,我们相比上一章唯一做出的调整是将上下文长度(context_length)减少到 256 个标记。这种修改减少了训练模型的计算需求,使得在标准笔记本电脑上进行训练成为可能。
Originally, the GPT-2 model with 124 million parameters was configured to handle up to 1,024 tokens. After the training process, at the end of this chapter, we will update the context size setting and load pretrained weights to work with a model configured for a 1,024 -token context length. 最初,拥有 1.24 亿参数的 GPT-2 模型被配置为处理最多 1,024 个标记。在训练过程结束后,我们将更新上下文大小设置,并加载预训练权重以与配置为 1,024 个标记上下文长度的模型一起工作。
Using the GPTmodel instance, we adopt the generate_text_simple function introduced in the previous chapter and introduce two handy functions, text_to_token_ids and token_ids_to_text. These functions facilitate the conversion between text and token representations, a technique we will utilize throughout this chapter. To provide a clearer understanding, Figure 5.3 illustrates this process before we dive into the code. 使用 GPT 模型实例,我们采用前一章中介绍的 generate_text_simple 函数,并引入两个方便的函数,text_to_token_ids 和 token_ids_to_text。这些函数可以实现文本和标记表示之间的转换,这是我们将在本章中广泛使用的技术。为了更清楚地理解这个过程,图 5.3 展示了它在我们深入代码之前的情况。
Figure 5.3 Generating text involves encoding text into token IDs that the LLM processes into logit vectors. The logit vectors are then converted back into token IDs, detokenized into a text representation. 图 5.3 生成文本涉及将文本编码为令牌 ID,LLM会将其处理为 logit 向量。然后将 logit 向量转换回令牌 ID,并去标记为文本表示。
Figure 5.3 illustrates a three-step text generation process using a GPT model. First, the tokenizer converts input text into a series of token IDs, as discussed in chapter 2. Second, the model receives these token IDs and generates corresponding logits, which are vectors representing the probability distribution for each token in the vocabulary, as discussed in chapter 4. Third, these logits are converted back into token IDs, which the tokenizer decodes into human-readable text, completing the cycle from textual input to textual output. 图 5.3 说明了使用 GPT 模型的三步文本生成过程。首先,分词器将输入文本转换为一系列令牌 ID,正如第 2 章所讨论的。其次,模型接收这些令牌 ID 并生成相应的 logits,这些 logits 是代表每个词汇表中每个令牌的概率分布的向量,正如第 4 章所讨论的。第三,这些 logits 被转换回令牌 ID,分词器将其解码为人类可读的文本,完成从文本输入到文本输出的整个循环。
In code, we implement the text generation process as follows: 在代码中,我们将文本生成过程实现如下:
Listing 5.1 Utility functions for text to token ID conversion 5.1 节 文本到标记 ID 转换的实用函数
Using the preceding code, the model generates the following text: 使用上述代码,该模型生成以下文本:
Output text: 输出文本:
简化中文:
这是一个测试
Every effort moves you rentingetic wasn refres RexMeCHicular stren 每一个努力都会让你更有动力 让你更加焕发生机
Based on the output, it's clear the model isn't yet producing coherent text because it hasn't undergone training. To define what makes text "coherent" or "high quality," we have to implement a numerical method to evaluate the generated content. This approach will enable us to monitor and enhance the model's performance throughout its training process. 根据输出结果,很明显该模型尚未经过训练,因此无法产生连贯的文本。要定义什么是"连贯"或"高质量"的文本,我们需要实施数值方法来评估生成的内容。这种方法将使我们能够在训练过程中监控和提高模型的性能。
The following section introduces how we calculate a loss metric for the generated outputs. This loss serves as a progress and success indicator of the training progress. Furthermore, in subsequent chapters on finetuning LLMs, we will review additional methodologies for assessing model quality. 以下部分介绍了我们如何计算生成输出的损失指标。这种损失用作训练进度的进度和成功指标。此外,在后续关于微调的章节中,我们将回顾评估模型质量的其他方法。
5.1.2 Calculating the text generation loss 5.1.2 计算文本生成损失
This section explores techniques for numerically assessing text quality generated during training by calculating a so-called text generation loss. We go over this topic step-by-step with a practical example to make the concepts clear and applicable, beginning with a short recap of how the data is loaded from chapter 2 and how the text is generated via the generate_text_simple function from chapter 4. 本节探讨在训练过程中通过计算所谓的文本生成损失来数值评估生成文本质量的技术。我们会通过一个实际示例一步步介绍这个话题,使概念变得清晰且具有可操作性,从第 2 章中如何加载数据以及如何通过第 4 章中的 generate_text_simple 函数生成文本开始简单回顾。
Figure 5.4 illustrates the overall flow from input text to LLM-generated text using a fivestep procedure. 图 5.4 说明了从输入文本到使用五步程序生成LLM文本的整体流程。
Figure 5.4 For each of the 3 input tokens, shown on the left, we compute a vector containing probability scores corresponding to each token in the vocabulary. The index position of the highest probability score in each vector represents the most likely next token ID. These token IDs associated with the highest probability scores are selected and mapped back into a text that represents the text generated by the model. 图 5.4 对于左侧显示的 3 个输入标记,我们计算了一个向量,包含与词汇表中每个标记对应的概率得分。每个向量中概率得分最高的索引位置代表最可能的下一个标记 ID。与最高概率得分相关的这些标记 ID 被选中,映射回一个代表模型生成的文本。
The text generation process in Figure 5.4 outlines what the generate_text_simple function from chapter 4 does internally. We need to perform these same initial steps before we can compute a loss that measures the generated text quality later in this section. 图 5.4 中的文本生成过程概括了第 4 章中 generate_text_simple 函数内部的执行过程。在后续部分计算生成文本质量的损失之前,我们需要执行这些初始步骤。
Figure 5.4 outlines the text generation process with a small 7-token vocabulary to fit this image on a single page. However, our GPTModel works with a much larger vocabulary consisting of 50,257 words; hence, the token IDs in the following codes will range from 0 to 50,256 rather than 0 to 6 . 图 5.4 概述了使用小型 7 个词汇表的文本生成过程,以使该图像适合在单个页面上显示。然而,我们的 GPTModel 使用了更大的词汇表,包含 50,257 个单词;因此,以下代码中的令牌 ID 范围将从 0 到 50,256,而非 0 到 6。
Also, Figure 5.4 only shows a single text example ("every effort moves") for simplicity. In the following hands-on code example that implements the steps in Figure 5.4, we will work with two input examples ("every effort moves" and "I really like") as inputs for the GPT model: 另外,图 5.4 只显示了一个文本示例("every effort moves")以简化操作。在下面实施图 5.4 中步骤的动手代码示例中,我们将使用两个输入示例("every effort moves"和"I really like")作为 GPT 模型的输入。
Consider the two input examples, which have already been mapped to token IDs, corresponding to step 1 in Figure 5.4: 考虑两个输入示例,它们已被映射到令牌 ID,对应于图 5.4 中的第 1 步:
inputs = torch.tensor([[16833, 3626, 6100], # ["every effort moves",
[40, 1107, 588]]) # "I really like"]
Matching these inputs, the `targets` contain the token IDs we aim for the model to
produce:
targets = torch.tensor([[3626, 6100, 345 ], # [" effort moves you",
[107, 588, 11311]]) # " really like chocolate"]
Note that the targets are the inputs but shifted one position forward, a concept we covered chapter 2 during the implementation of the data loader. This shifting strategy is crucial for teaching the model to predict the next token in a sequence. 需要注意的是,目标是输入数据,但向前移动了一个位置,这是我们在第 2 章实现数据加载器时讨论过的概念。这种移位策略对于教会模型预测序列中的下一个标记至关重要。
When we feed the inputs into the model to calculate logit vectors for the two input examples, each comprising three tokens, and apply the softmax function to transform these logit values into probability scores, which corresponds to step 2 in Figure 5.4: 当我们将输入馈送到模型中,计算两个包含三个令牌的输入示例的 logit 向量,并将这些 logit 值应用 softmax 函数转换为概率分数时,这对应于图 5.4 中的步骤 2。
with torch.no_grad():
#A
logits = model(inputs)
probas = torch.softmax(logits, dim=-1) # Probability of each token in vocabulary
print(probas.shape)
#A Disable gradient tracking since we are not training, yet #不跟踪渐变,因为我们现在没有在训练
The resulting tensor dimension of the probability score (probas) tensor is as follows: 概率得分(probas)张量的结果张量维度如下:
torch.Size([2, 3, 50257])
The first number, 2, corresponds to the two examples (rows) in the inputs, also known as batch size. The second number, 3, corresponds to the number of tokens in each input (row). Finally, the last number corresponds to the embedding dimensionality, which is determined by the vocabulary size, as discussed in previous chapters. 第一个数字 2 对应输入中的两个样本(行),也称为批大小。第二个数字 3 对应每个输入(行)中的令牌数。最后一个数字对应嵌入维度,正如前几章所讨论的,它由词汇表大小决定。
Following the conversion from logits to probabilities via the softmax function, the generate_text_simple function from chapter 4 then converts the resulting probability scores back into text, as illustrated in steps 3-5 in Figure 5.4. 经过通过 softmax 函数从 logits 转换为概率后,第 4 章的 generate_text_simple 函数将生成的概率得分转换回文本,如图 5.4 中步骤 3-5 所示。
We can implement steps 3 and 4 by applying the argmax function to the probability scores to obtain the corresponding token IDs: 我们可以通过将 argmax 函数应用到概率分数上来获取相应的标记 ID,从而实现步骤 3 和 4
Given that we have 2 input batches, each containing 3 tokens, applying the argmax function to the probability scores (step 3 in Figure 5.4) yields 2 sets of outputs, each with 3 predicted token IDs: 鉴于我们有 2 个输入批次,每个批次包含 3 个令牌,将 argmax 函数应用于概率分数(图 5.4 中的步骤 3)将产生 2 组输出,每组都有 3 个预测的令牌 ID。
Token IDs:
tensor([[[16657], # First batch
[ 339],
[42826]],
[[49906], # Second batch
[29669],
[41751]]])
Finally, step 5 converts the token IDs back into text: 最后,第 5 步将令牌 ID 转换回文本:
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")
When we decode these tokens, we find that these output tokens are quite different from
the target tokens we want the model to generate:
Targets batch 1: effort moves you
Outputs batch 1: Armed heNetflix
The model produces random text that is different from the target text because it has not been trained yet. We now get to the part where we evaluate the performance of the model's generated text numerically via a so-called loss as illustrated in Figure 5.4. Not only is this useful for measuring the quality of the generated text, but it's also a building block for implementing the training function later, which we use to update the model's weight to improve the generated text. 该模型生成的随机文本与目标文本不同,因为它尚未接受过训练。我们现在进入评估模型生成文本性能的部分,通过所谓的损失函数数值化地体现,如图 5.4 所示。这不仅有助于评估生成文本的质量,也是实现后续训练函数的基础,我们会用它来更新模型权重以改善生成文本。
Figure 5.5 We now implement the text evaluation function in the remainder of this section. In the next section, we apply this evaluation function to the entire dataset we use for model training. 图 5.5 我们现在在本节的其余部分实现文本评估函数。在下一节中,我们将此评估函数应用于用于模型训练的整个数据集。
Part of the text evaluation process that we implement in the remainder of this section, as shown in Figure 5.5, is to measure "how far" the generated tokens are from the correct predictions (targets). The training function we implement later in this chapter will use this information to adjust the model weights to generate text that is more similar to (or ideally matches) the target text. 我们在本节剩余部分实现的文本评估过程(如图 5.5 所示)是衡量生成的标记与正确预测(目标)之间的"距离"。我们在本章后面实现的训练函数将利用这些信息调整模型权重,生成更接近(或理想情况下匹配)目标文本的文本。
The model training aims to increase the softmax probability in the index positions corresponding to the correct target token IDs, as illustrated in Figure 5.6. This softmax probability is also used in the evaluation metric we are implementing in the remainder of this section to numerically assess the model's generated outputs: the higher the probability in the correct positions, the better. 模型训练的目的是增加图 5.6 所示的正确目标标记 ID 位置的 softmax 概率。在本节剩余部分实施的评估指标中也使用了这个 softmax 概率来数字化评估模型生成的输出:正确位置的概率越高,效果越好。
Figure 5.6 Before training, the model produces random next-token probability vectors. The goal of model training is to ensure that the probability values corresponding to the highlighted target token IDs are maximized. 图 5.6 训练前,模型生成随机的下一个令牌概率向量。模型训练的目标是确保与高亮显示的目标令牌 ID 对应的概率值最大化。
Remember that Figure 5.6 displays the softmax probabilities for a compact 7-token vocabulary to fit everything into a single figure. This implies that the starting random values will hover around , which equals approximately 0.14 . 请记住,图 5.6 显示了紧凑的 7 个令牌词汇表的 softmax 概率,以便将所有内容放入单个图中。这意味着起始随机值将围绕 徘徊,约等于 0.14。
However, the vocabulary we are using for our GPT-2 model has 50,257 tokens, so most of the initial probabilities will hover around 0.00002 via . 然而,我们用于 GPT-2 模型的词汇表有 50,257 个标记,所以大多数初始概率都会徘徊在 0.00002 左右通过 。
For each of the two input texts, we can print the initial softmax probability scores corresponding to the target tokens via the following code: 对于两个输入文本中的每一个,我们可以通过以下代码打印目标标记对应的初始 softmax 概率得分:
The 3 target token ID probabilities for each batch are as follows: 每批次 3 个目标令牌 ID 概率如下:
Text 1: tensor([7.4541e-05, 3.1061e-05, 1.1563e-05])
Text 2: tensor([1.0337e-05, 5.6776e-05, 4.7559e-06])
The goal of training an LLM is to maximize these values, aiming to get them as close to a probability of 1 . This way, we ensure the LLM consistently picks the target tokenessentially the next word in the sentence-as the next token it generates. 训练LLM的目标是最大化这些值,目标是将它们尽可能接近 1 的概率。这样,我们就可以确保LLM始终选择目标令牌(即句子中的下一个词)作为它生成的下一个令牌。
BACKPROPAGATION 反向传播
How do we maximize the softmax probability values corresponding to the target tokens? The big picture is that we update the model weights so that the model outputs higher values for the respective token IDs we want to generate. The weight update is done via a process called backpropagation, a standard technique for training deep neural networks (see sections A. 3 to A. 7 in Appendix A for more details about backpropagation and model training). 如何最大化目标标记对应的 softmax 概率值?大局部是,我们更新模型权重以使模型输出更高的值对应于我们想要生成的标记 ID。权重更新是通过称为反向传播的过程完成的,这是训练深度神经网络的标准技术(更多关于反向传播和模型训练的详细信息请参见附录 A 中的 A. 3 到 A. 7 节)。
Backpropagation requires a loss function, which calculates the difference between the model's predicted output (here, the probabilities corresponding to the target token IDs) and the actual desired output. This loss function measures how far off the model's predictions are from the target values. 反向传播需要一个损失函数,它计算模型预测输出(这里是与目标令牌 ID 对应的概率)与实际所需输出之间的差异。这个损失函数衡量模型预测值与目标值相差多远。
In the remainder of this section, we calculate the loss for the probability scores of the two example batches, target_probas_1 and target_probas 2. The main steps are illustrated in Figure 5.7. 在本节剩余部分,我们计算两个示例批次 target_probas_1 和 target_probas_2 的概率分数的损失。主要步骤如图 5.7 所示。
Figure 5.7 Calculating the loss involves several steps. Steps 1 to 3 calculate the token probabilities corresponding to the target tensors. These probabilities are then transformed via a logarithm and averaged in steps 4-6. 图 5.7 计算损失涉及几个步骤。步骤 1 到 3 计算对应于目标张量的标记概率。然后在步骤 4-6 通过对数变换和取平均值来转换这些概率。
Since we already applied steps 1-3 listed in Figure 5.7 to obtain target_probas_1 and target_probas_2, we proceed with step 4, applying the logarithm to the probability scores: 由于我们已经应用了图 5.7 中列出的步骤 1-3 来获得 target_probas_1 和 target_probas_2,我们现在进行第 4 步,对概率分数应用对数运算。
Working with logarithms of probability scores is more manageable in mathematical optimization than handling the scores directly. This topic is outside the scope of this book, but I've detailed it further in a lecture, which is linked in the reference section in appendix B. 用对数概率得分来工作比直接处理得分更容易管理。这个话题不在本书的范围之内,但我在附录 B 的参考部分链接了一个讲座,对此进行了更详细的介绍。
Next, we combine these log probabilities into a single score by computing the average (step 5 in Figure 5.7): 接下来,我们将这些对数概率组合成一个总分,计算它们的平均值(图 5.7 中的第 5 步):
The resulting average log probability score is as follows: 得到的平均对数概率评分如下:
tensor( -10.7940 张量(-10.7940
The goal is to get the average log probability as close to 0 as possible by updating the model's weights as part of the training process, which we will implement later in section 5.2 . 目标是通过在第 5.2 节中实施的训练过程更新模型权重,尽可能将平均对数概率接近 0。
However, in deep learning, the common practice isn't to push the average log probability up to 0 but rather to bring the negative average log probability down to 0 . The negative average log probability is simply the average log probability multiplied by -1 , which corresponds to step 6 in Figure 5.7: 然而,在深度学习中,常见的做法不是将平均对数概率推高到 0,而是将负平均对数概率降低到 0。负平均对数概率只是将平均对数概率乘以-1,对应于图 5.7 中的步骤 6。
The term for this negative value, -10.7940 turning into 10.7940 , is known as the cross entropy loss in deep learning. 这种负值-10.7940 转换为 10.7940 的术语,在深度学习中被称为交叉熵损失。
PyTorch comes in handy here, as it already has a built-in cross_entropy function that takes care of all these 6 steps in Figure 5.7 for us. PyTorch 在这里很有帮助,因为它已经内置了一个 cross_entropy 函数,可以帮我们处理图 5.7 中的所有 6 个步骤。
CROSS ENTROPY LOSS 交叉熵损失
At its core, the cross entropy loss is a popular measure in machine learning and deep learning that measures the difference between two probability distributionstypically, the true distribution of labels (here, tokens in a dataset) and the predicted distribution from a model (for instance, the token probabilities generated by an LLM). 从根本上说,交叉熵损失是机器学习和深度学习中广受欢迎的度量标准,它衡量两个概率分布之间的差异,通常是标签的真实分布(这里是数据集中的标记)和模型预测的分布(例如,一个<code1001>生成的标记概率)。
In the context of machine learning and specifically in frameworks like PyTorch, the cross_entropy function computes this measure for discrete outcomes, which is similar to the negative average log probability of the target tokens given the model's generated token probabilities, making the terms cross entropy and negative average log probability related and often used interchangeably in practice. 在机器学习的背景下,尤其是在像 PyTorch 这样的框架中,cross_entropy 函数计算离散结果的这种度量,这类似于给定模型生成的 token 概率,目标 token 的负平均对数概率,因此交叉熵和负平均对数概率这两个术语在实践中往往可以互换使用。
Before we apply the cross entropy function, let's briefly recall the shape of the logits and target tensors: 在应用交叉熵函数之前,让我们简单回顾一下 logits 和 target 张量的形状:
print("Logits shape:", logits.shape)
print("Targets shape:", targets.shape)
The resulting shapes are as follows:
Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])
As we can see, the logits tensor has three dimensions: batch size, number of tokens, and vocabulary size. The targets tensor has two dimensions: batch size and number of tokens. 正如我们所见,logits 张量有三个维度:batch size、token 数量和词汇表大小。targets 张量有两个维度:batch size 和 token 数量。
For the cross entropy_loss function in PyTorch, we want to flatten these tensors by combining them over the batch dimension: 对于 PyTorch 中的交叉熵损失函数,我们希望在批次维度上将这些张量展平:
Remember that the targets are the token IDs we want the LLM to generate, and the logits contain the unscaled model outputs before they enter the softmax function to obtain the probability scores. 请记住,目标是我们希望 LLM生成的令牌 ID,而对数包含在通过 softmax 函数获得概率分数之前的未缩放模型输出。
Previously, we applied the softmax function, selected the probability scores corresponding to the target IDs, and computed the negative average log probabilities. PyTorch's cross_entropy function will take care of all these steps for us: 之前,我们应用了 softmax 函数,选择了与目标 ID 对应的概率得分,并计算了负平均对数概率。PyTorch 的 cross_entropy 函数将为我们处理所有这些步骤。
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)
The resulting loss is the same that we obtained previously when applying the individual steps shown in Figure 5.7 manually: 所得到的损失与我们之前手动应用图 5.7 所示的各个步骤时获得的结果相同。
tensor(10.7940)
PERPLEXITY 疑惑
Perplexity is a measure often used alongside cross entropy loss to evaluate the performance of models in tasks like language modeling. It can provide a more interpretable way to understand the uncertainty of a model in predicting the next token in a sequence. 困惑度是一个经常与交叉熵损失一起使用的衡量标准,用于评估模型在如语言建模等任务中的性能。它可以提供一种更具可解释性的方式来理解模型在预测序列中下一个令牌的不确定性。
Perplexity measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset. Similar to the loss, a lower perplexity indicates that the model predictions are closer to the actual distribution. 困惑度度量了模型预测的概率分布与数据集中单词的实际分布之间的吻合程度。与损失类似,困惑度越低,模型预测与实际分布越接近。
Perplexity can be calculated as perplexity = torch.exp(loss), which returns tensor (48725.8203) when applied to the previously calculated loss. 困惑度可以被计算为困惑度=torch.exp(loss),当应用于先前计算的 loss 时,返回张量(48725.8203)。
Perplexity is often considered more interpretable than the raw loss value because it signifies the effective vocabulary size about which the model is uncertain at each step. In the given example, this would translate to the model being unsure about which among 47,678 words or tokens in the vocabulary to generate as the next token. 置疑度通常被认为比原始损失值更具可解释性,因为它表明模型在每一步中对于哪个有效词汇表大小存在不确定性。在给定的例子中,这意味着模型对于从 47,678 个词或标记中生成下一个标记感到不确定。
In this section, we calculated the loss for two small text inputs for illustration purposes. In the next section, we apply the loss computation to the entire training and validation sets. 在本节中,我们计算了两个小文本输入的损失,仅作为说明。在下一节中,我们将损失计算应用于整个训练集和验证集。
5.1.3 Calculating the training and validation set losses 5.1.3 计算训练集和验证集损失
In this section, we first prepare the training and validation datasets that we will use to train the LLM later in this chapter. Then, we calculate the cross entropy for the training and validation sets, as illustrated in Figure 5.8, which is an important component of the model training process. 在本节中,我们首先准备用于训练本章稍后介绍的LLM的训练和验证数据集。然后,我们计算训练和验证集的交叉熵,如图 5.8 所示,这是模型训练过程的一个重要组成部分。
Figure 5.8 After computing the cross entropy loss in the previous section, we now apply this loss computation to the entire text dataset that we will use for model training. 图 5.8 在上一节中计算交叉熵损失之后,我们现在将此损失计算应用于我们将用于模型训练的整个文本数据集。
To compute the loss on the training and validation datasets as illustrated in Figure 5.8, we use a very small text dataset, the "The Verdict" short story by Edith Wharton, which we have already worked with in chapter 2. By selecting a text from the public domain, we circumvent any concerns related to usage rights. Additionally, the reason why we use such a small dataset is that it allows for the execution of code examples on a standard laptop computer in a matter of minutes, even without a high-end GPU, which is particularly advantageous for educational purposes. 如图 5.8 所示,为了计算训练和验证数据集上的损失,我们使用了一个非常小的文本数据集,"The Verdict"短篇故事,这是我们在第 2 章中已经使用过的。通过选择公共领域的文本,我们避免了与使用权相关的任何顾虑。此外,之所以使用如此小的数据集,是因为它允许在标准笔记本电脑上在几分钟内执行代码示例,即使没有高端 GPU,这对于教育目的来说特别有利。
Interested readers can also use the supplementary code of this book to prepare a largerscale dataset consisting of more than 60,000 public domain books from Project Gutenberg and train an LLM on these (see appendix D for details). 有兴趣的读者还可以使用本书的补充代码来准备一个由超过 60,000 本公共领域图书馆古腾堡计划的图书组成的更大规模的数据集,并在这些图书上训练一个LLM(详见附录 D)。
THE COST OF PRETRAINING LLMS 预训练的成本 LLMS
To put the scale of our project into perspective, consider the training of the 7 billion parameter Llama 2 model, a relatively popular openly available LLM. This model required 184,320 GPU hours on expensive A100 GPUs, processing 2 trillion tokens. At the time of writing, running an cloud server on AWS costs around per hour. A rough estimate puts the total training cost of such an LLM at around (calculated as 184,320 hours divided by 8 , then multiplied by ). 要将我们项目的规模置于视角中,请考虑训练拥有 70 亿参数的 Llama 2 模型,这是一个相当受欢迎的公开可用LLM。这个模型需要在昂贵的 A100 GPU 上耗费 184,320 GPU 小时,处理了 2 万亿令牌。在撰写本文时,在 AWS 上运行一台 云服务器的成本大约为每小时 。粗略估计,这种LLM的总培训成本约为 (计算方式为 184,320 小时除以 8,然后乘以 )。
The following code loads the "The Verdict" short story we used in chapter 2 : 以下代码加载了我们在第 2 章中使用的短篇小说"The Verdict":
file_path = "the-verdict.txt"
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()
After loading the dataset, we can check the number of characters and tokens in the dataset: 载入数据集后,我们可以检查数据集中的字符和令牌数量
With just 5,145 tokens, the text might seem too small to train an LLM, but as mentioned earlier, it's for educational purposes so that we can run the code in minutes instead of weeks. Plus, we will be loading pretrained weights from OpenAI into our GPTModel code at the end of this chapter. 尽管只有 5,145 个标记,这篇文本可能看起来太小,无法训练一个LLM,但如前所述,这是出于教育目的,这样我们就可以在几分钟内而不是几周内运行代码。另外,我们将在本章末尾将 OpenAI 预训练权重加载到我们的 GPTModel 代码中。
Next, we divide the dataset into a training and a validation set and use the data loaders from chapter 2 to prepare the batches for LLM training. This process is visualized in Figure 5.9 . 接下来,我们将数据集分为训练集和验证集,并使用第 2 章中的数据加载器准备用于训练的批次。该过程如图 5.9 所示。
An example using stride equal to the context length (here: 4 ) as shown below: 步长等于上下文长度(此处为 4)的示例如下:
Figure 5.9 When preparing the data loaders, we split the input text into training and validation set portions. Then, we tokenize the text (only shown for the training set portion for simplicity) and divide the tokenized text into chunks of a user-specified length (here 6). Finally, we shuffle the rows and organize the chunked text into batches (here, batch size 2), which we can use for model training. 图 5.9 在准备数据加载器时,我们将输入文本划分为训练集和验证集部分。然后,我们对文本进行标记化(仅显示训练集部分以简化)并将标记化文本划分为用户指定长度的块(这里为 6)。最后,我们打乱行并将分块文本组织成批次(这里为批量大小 2),以用于模型训练。
For visualization purposes, Figure 5.9 uses a max_length=6 due to spatial constraints. However, for the actual data loaders we are implementing, we set the max_length equal to the 256-token context length that the LLM supports so that the LLM sees longer texts during training. 为了可视化目的,图 5.9 使用 max_length=6,这是由于空间限制。但是对于我们正在实现的实际数据加载器,我们将 max_length 设置为LLM支持的 256-token 上下文长度,以便LLM在训练期间看到更长的文本。
TRAINING WITH VARIABLE LENGTHS 变长度训练
We are training the model with training data presented in similarly-sized chunks for simplicity and efficiency. However, in practice, it can also be beneficial to train an LLM with variable-length inputs to help the LLM to better generalize across different types of inputs when it is being used. 为了简单和效率,我们使用大小相似的块来训练模型。但在实际应用中,使用可变长度输入来训练LLM也可以帮助LLM更好地处理不同类型的输入数据。
To implement the data splitting and loading visualized in Figure 5.9, we first define a train_ratio to use of the data for training and the remaining as validation data for model evaluation during training: 要实现图 5.9 中可视化的数据分割和加载,我们首先定义一个 train_ratio 来使用 的数据进行训练,剩余的 作为验证数据,用于训练期间的模型评估:
Using the train_data and val_data subsets, we can now create the respective data loader reusing the create_dataloader_v1 code from chapter 2 : 利用 train_data 和 val_data 子集,我们现在可以重复使用第 2 章中的 create_dataloader_v1 代码创建相应的数据加载器:
We used a relatively small batch size in the preceding code to reduce the computational resource demand because we were working with a very small dataset. In practice, training LLMs with batch sizes of 1,024 or larger is not uncommon. 在前面的代码中,我们使用了相对较小的批次大小来减少计算资源需求,因为我们在处理一个非常小的数据集。在实践中,使用 1,024 或更大的批次大小进行训练并不罕见。
As an optional check, we can iterate through the data loaders to ensure that they were created correctly: 作为一项可选的检查,我们可以遍历数据加载器,以确保它们被正确创建:
print("Train loader:")
for x, y in train_loader:
print(x.shape, y.shape)
print("\nValidation loader:")
for x, y in val_loader:
print(x.shape, y.shape)
Based on the preceding code output, we have 9 training set batches with 2 samples and 256 tokens each. Since we allocated only of the data for validation, there is only one validation batch consisting of 2 input examples. 根据前面的代码输出,我们有 9 个训练集批次,每个批次包含 2 个样本和 256 个标记。由于我们只为验证分配了 的数据,因此只有一个验证批次,包含 2 个输入示例。
As expected, the input data (x) and target data (y) have the same shape (the batch size times the number of tokens in each batch) since the targets are the inputs shifted by one position, as discussed in chapter 2. 正如预期的那样,输入数据(x)和目标数据(y)具有相同的形状(批次大小乘以每批次中的 token 数量),因为目标数据是输入数据向右移动一个位置得到的,正如第 2 章所讨论的那样。
Next, we implement a utility function to calculate the cross entropy loss of a given batch returned via the training and validation loader: 接下来,我们实现一个效用函数来计算通过训练和验证加载器返回的给定批次的交叉熵损失:
#A the transfer to a given device allows us to transfer the data to a GPU 将数据转移到 GPU 的功能可以实现将数据传输到给定设备
We can now use this calc_loss_batch utility function, which computes the loss for a single batch, to implement the following calc_loss_loader function that computes the loss over all the batches sampled by a given data loader: 我们现在可以使用这个 calc_loss_batch 效用函数,它计算单个批次的损失,来实现以下 calc_loss_loader 函数,它计算给定数据加载器采样的所有批次的损失:
Listing 5.2 Function to compute the training and validation loss 代码清单 5.2 计算训练和验证损失的函数
def calc_loss_loader(data_loader, model, device, num_batches=None):
total_loss = 0.
if len(data_loader) == 0:
return float("nan")
elif num_batches is None:
num_batches = len(data_loader) #A
else:
num_batches = min(num_batches, len(data_loader)) #B
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
loss = calc_loss_batch(input_batch, target_batch, model, device)
total_loss += loss.item()
#C
else:
break
return total_loss / num_batches #D
#A Iterative over all batches if no fixed num_batches is specified #B Reduce the number of batches to match the total number of batches in the data loader if num_batches exceeds the number of batches in the data loader #A 如果未指定固定的 num_batches,请迭代所有批次
#B 如果 num_batches 超过数据加载器中的批次数,请减少批次数以匹配数据加载器中的总批次数
#C Sum loss for each batch 每个批次的 #C 损失之和
#D Average the loss over all batches 平均所有批次的损失
By default, the calc_loss_batch function iterates over all batches in a given data loader, accumulates the loss in the total_loss variable, and then computes and averages the loss over the total number of batches. Alternatively, we can specify a smaller number of batches via num_batches to speed up the evaluation during model training. 默认情况下,calc_loss_batch 函数会遍历给定数据加载器中的所有批次,在 total_loss 变量中累计损失,然后计算并平均整个批次的总损失。或者,我们可以通过 num_batches 指定较小的批次数量,以加快模型训练期间的评估。
Let's now see this calc_loss_batch function in action, applying it to the training and validation set loaders: 现在让我们看看这个 calc_loss_batch 函数的运行情况,将其应用于训练集和验证集加载器:
#A If you have a machine with a CUDA-supported GPU, the LLM will train on the GPU without making any changes to the code 如果您有一台配备 CUDA 支持的 GPU 的机器,LLM 将在 GPU 上进行训练,无需对代码进行任何更改
#B Disable gradient tracking for efficiency because we are not training, yet #B 禁用梯度跟踪以提高效率,因为我们目前未进行训练
#C Via the device setting, we ensure that the data is loaded onto the same device as the LLM model 通过 device 设置,我们确保数据被加载到与LLM模型相同的设备上
The resulting loss values are as follows: 最终的损失值如下:
Training loss: 10.98758347829183 训练损失: 10.98758347829183
The loss values are relatively high because the model has not yet been trained. For comparison, the loss approaches 0 if the model learns to generate the next tokens as they appear in the training and validation sets. 损失值相对较高,因为模型尚未训练好。对比而言,如果模型学会生成与训练集和验证集中出现的下一个标记相同的标记,损失就会趋近于 0。
Now that we have a way to measure the quality of the generated text, in the next section, we train the LLM to reduce this loss so that it becomes better at generating text, as illustrated in Figure 5.10. 现在我们有了衡量生成文本质量的方法,在下一节中,我们训练LLM以降低此损失,使其在生成文本方面变得更出色,如图 5.10 所示。
Figure 5.10 We have recapped the text generation process and implemented basic model evaluation techniques to compute the training and validation set losses. Next, we will go to the training functions and pretrain the LLM. 图 5.10 我们已经概括了文本生成过程并实现了基本的模型评估技术来计算训练和验证集的损失。接下来,我们将转到训练函数并预先训练LLM。
As shown in Figure 5.10, the next section focuses on pretraining the LLM. After model training, we implement alternative text generation strategies and save and load pretrained model weights. 如图 5.10 所示,下一部分重点关注预训练LLM。在模型训练之后,我们实施备用文本生成策略,并保存和加载预训练的模型权重。
5.2 Training an LLM 5.2 训练一个LLM
In this section, we finally implement the code for pretraining the LLM, our GPTModel. For this, we focus on a straightforward training loop, as illustrated in Figure 5.11, to keep the code concise and readable. However, interested readers can learn about more advanced techniques, including learning rate warmup, cosine annealing, and gradient clipping, in Appendix D, Adding Bells and Whistles to the Training Loop. 在这一部分,我们最终实现了用于预训练LLM(我们的 GPT 模型)的代码。为此,我们专注于一个简单明了的训练循环,如图 5.11 所示,以使代码简洁且易读。但是,有兴趣的读者可以在附录 D"为训练循环添加附加功能"中了解更多高级技术,如学习率预热、余弦退火和梯度裁剪。
Figure 5.11 A typical training loop for training deep neural networks in PyTorch consists of several steps, iterating over the batches in the training set for several epochs. In each loop, we calculate the loss for each training set batch to determine loss gradients, which we use to update the model weights so that the training set loss is minimized. 图 5.11 在 PyTorch 中训练深度神经网络的典型训练循环包括几个步骤,在训练集上迭代多个 epoch。在每个循环中,我们计算每个训练集批次的损失来确定损失梯度,然后使用这些梯度来更新模型权重,以最小化训练集损失。
The flowchart in Figure 5.11 depicts a typical PyTorch neural network training workflow, which we use for training an LLM. It outlines eight steps, starting with iterating over each epoch, processing batches, resetting and calculating gradients, updating weights, and concluding with monitoring steps like printing losses and generating text samples. If you are relatively new to training deep neural networks with PyTorch and any of these steps are unfamiliar, consider reading sections A. 5 to A. 8 in Appendix A, Introduction to PyTorch. 图 5.11 中的流程图描述了一个典型的 PyTorch 神经网络训练工作流,我们用于训练一个LLM。它概述了 8 个步骤,从遍历每个轮次开始,处理批次,重置和计算梯度,更新权重,并以打印损失和生成文本样本等监控步骤结束。如果您是 PyTorch 深度神经网络训练的新手,并且这些步骤中的任何一个对您来说都不熟悉,请考虑阅读附录 A 中 A.5 到 A.8 节的内容。
In code, we can implement this training flow via the following train_model_simple function: 在代码中,我们可以通过以下 train_model_simple 函数来实现这个训练流程:
Listing 5.3 The main function for pretraining LLMs 5.3 节 预训练的主要函数LLMs
tokens_seen += input_batch.numel() tokens_seen += input_batch.numel()
人力: Translate the following source text to German Language, Output translation directly without any additional text.
Source Text: tokens_seen += input_batch.numel()
Translated Text:
#A Initialize lists to track losses and tokens seen #A 初始化列表以跟踪损失和观察到的令牌
#B Start the main training loop 开始主训练循环
#C Reset loss gradients from previous batch iteration #C 重置上一批次迭代的损失梯度
#D Calculate loss gradients #D 计算损失梯度
#E Update model weights using loss gradients 使用损失梯度更新模型权重
#F Optional evaluation step #F 可选评估步骤
#G Print a sample text after each epoch #G 在每个 epoch 后打印一个样本文本
Note that the train_model_simple function we just created uses two functions we have not defined yet: evaluate_model and generate_and_print_sample. 注意,我们刚刚创建的 train_model_simple 函数使用了两个我们还没有定义的函数:evaluate_model 和 generate_and_print_sample。
The evaluate_model function corresponds to step 7 in Figure 5.11. It prints the training and validation set losses after each model update so we can evaluate whether the training improves the model. 评估_模型函数对应图 5.11 中的第 7 步。它在每次模型更新后打印训练集和验证集的损失,以便我们评估训练是否改善了模型。
More specifically, the evaluate_model function calculates the loss over the training and validation set while ensuring the model is in evaluation mode with gradient tracking and dropout disabled when calculating the loss over the training and validation sets: 更具体地说,evaluate_model 函数在确保模型处于评估模式的同时,计算训练集和验证集上的损失,梯度跟踪和 Dropout 被禁用
#A Dropout is disabled during evaluation for stable, reproducible results 在评估时,dropout 无效以获得稳定、可重复的结果
#B Disable gradient tracking, which is not required during evaluation, to reduce the computational overhead 关闭梯度跟踪,这在评估期间是不需要的,以减少计算开销
Similar to evaluate_model, the generate_and_print_sample function is a convenience function that we use to track whether the model improves during the training. In particular, the generate_and_print_sample function takes a text snippet (start_context) as input, converts it into token IDs, and feeds it to the LLM to generate a text sample using the generate_text_simple function we used earlier: 与 evaluate_model 相似,generate_and_print_sample 函数是一个便捷函数,我们使用它来跟踪模型在训练过程中的改进情况。具体来说,generate_and_print_sample 函数以一个文本片段(start_context)作为输入,将其转换为 token ID,并将其输入到LLM中,使用我们之前使用的 generate_text_simple 函数生成文本样本:
While the evaluate_model function gives us a numeric estimate of the model's training progress, this generate_and_print_sample text function provides a concrete text example generated by the model to judge its capabilities during training. 虽然 evaluate_model 函数为我们提供了模型训练进度的数字估计,但 generate_and_print_sample 文本函数提供了由模型生成的具体文本示例,以评估训练期间的模型功能。
Abstract 摘要
ADAMW Adam optimizers are a popular choice for training deep neural networks. However, in our training loop, we opt for the AdamW optimizer. AdamW is a variant of Adam that improves the weight decay approach, which aims to minimize model complexity and prevent overfitting by penalizing larger weights. This adjustment allows AdamW to achieve more effective regularization and better generalization and is thus frequently used in the training of LLMs. ADAMW Adam 优化器是训练深度神经网络的热门选择。但是,在我们的训练循环中,我们选择 AdamW 优化器。AdamW 是 Adam 的一种变体,它改进了权重衰减方法,旨在通过对更大权重进行惩罚来最小化模型复杂度并防止过拟合。这种调整使 AdamW 能够实现更有效的正则化和更好的泛化,因此在训练LLMs中经常使用。
Let's see this all in action by training a GPTModel instance for 10 epochs using an AdamW optimizer and the train_model_simple function we defined earlier. 我们来通过使用 AdamW 优化器和刚才定义的 train_model_simple 函数,对 GPTModel 实例进行 10 个 epoch 的训练来观察实际效果。
#A The .parameters() method returns all trainable weight parameters of the model #A .parameters()方法返回模型的所有可训练权重参数
Executing the training_model_simple function starts the training process, which takes about 5 minutes on a MacBook Air or a similar laptop to complete. The output printed during this execution is as follows: 执行 training_model_simple 函数开始训练过程,在 MacBook Air 或类似笔记本电脑上完成此过程需要约 5 分钟。在执行过程中打印的输出如下:
Ep 1 (Step 000000): Train loss 9.781, Val loss 9.933
Ep 1 (Step 000005): Train loss 8.111, Val loss 8.339
Every effort moves you,,,,,,,,,,,,
Ep 2 (Step 000010): Train loss 6.661, Val loss 7.048
Ep 2 (Step 000015): Train loss 5.961, Val loss 6.616
Every effort moves you, and, and, and, and, and, and, and, and, and, and, and, and, and,
and, and, and, and, and, and, and, and, and, and, and,
[...] Results are truncated to save space
#A
Ep 9 (Step 000080): Train loss 0.541, Val loss 6.393
Every effort moves you?" "Yes--quite insensible to the irony. She wanted him
vindicated--and by me!" He laughed again, and threw back the window-curtains, I had the
donkey. "There were days when I
Ep 10 (Step 000085): Train loss 0.391, Val loss 6.452
Every effort moves you know," was one of the axioms he laid down across the Sevres and
silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run
over from Monte Carlo; and Mrs. Gis
#A Intermediate results removed to save space
As we can see, based on the results printed during the training, the training loss improves drastically, starting with a value of 9.558 and converging to 0.762 . The language skills of the model have improved quite a lot. In the beginning, the model is only able to append commas to the start context ("Every effort moves you,,,,,,,,,,,,,") or repeat the word "and". At the end of the training, it can generate grammatically correct text. 正如我们所看到的,根据训练期间打印的结果,训练损失大幅改善,从 9.558 的初始值收敛到 0.762。模型的语言技能有了显著提升。开始时,模型只能在开始语境后添加逗号("Every effort moves you,,,,,,,,,,,,,")或重复"and"一词。而在训练结束时,它能生成语法正确的文本。
Similar to the training set loss, we can see that the validation loss starts high (9.856) and decreases during the training. However, it never becomes as small as the training set Ioss and remains at 6.372 after the 10th epoch. 与训练集损失类似,我们可以看到验证损失起始较高(9.856)并在训练过程中降低。但是它从未达到训练集损失那么小的程度,在第 10 个时期后保持在 6.372。
Before discussing the validation loss in more detail, let's create a simple plot that shows the training and validation set losses side by side: 在更详细地讨论验证损失之前,让我们创建一个简单的图表,显示训练集和验证集的损失对比:
#A Create a second -axis that shares the same -axis 创建第二个 轴,与相同 轴共享
#B Invisible plot for aligning ticks 隐藏的刻度对齐图
The resulting training and validation loss plot is shown in Figure 5.12. 生成的训练和验证损失图如图 5.12 所示。
Figure 5.12 At the beginning of the training, we observe that both the training and validation set losses sharply decrease, which is a sign that the model is learning. However, the training set loss continues to decrease past the second epoch, whereas the validation loss stagnates. This is a sign that the model is still learning, but it's overfitting to the training set past epoch 2. 图 5.12 在训练开始时,我们观察到训练集和验证集的损失都急剧下降,这表明该模型正在学习。然而,训练集损失在第二个 epoch 之后继续下降,而验证集损失则停滞不前。这意味着该模型仍在学习,但在第 2 个 epoch 之后开始过拟合训练集。
As Figure 5.12 shows, both the training and validation losses start to improve for the first epoch. However, the losses start to diverge past the second epoch. This divergence and the fact that the validation loss is much larger than the training loss indicate that the model is overfitting to the training data. We can confirm that the model memorizes the training data verbatim by searching for the generated text snippets, such as "quite insensible to the irony" in the "The Verdict" text file. 正如图 5.12 所示,训练损失和验证损失在第一个时期都开始有所改善。然而,在第二个时期之后,这两个损失开始发散。这种发散以及验证损失远大于训练损失表明,该模型正过拟合于训练数据。我们可以通过在"The Verdict"文本文件中搜索生成的文本片段(如"quite insensible to the irony")来确认该模型在记住训练数据方面做到了逐字逐句。
This memorization is expected since we are working with a very, very small training dataset and training the model for multiple epochs. Usually, it's common to train a model on a much, much larger dataset for only one epoch. 这是预料之中的,因为我们正在使用一个非常小的训练数据集,并训练模型多个周期。通常情况下,在一个大得多的数据集上只训练一个周期就是常见的做法。
As mentioned earlier, interested readers can try to train the model on 60,000 public domain books from Project Gutenberg, where this overfitting does not occur; see appendix for details. 正如前面提到的,有兴趣的读者可以尝试在来自项目古腾堡的 6 万本公共领域书籍上训练模型,在这里不会出现过拟合的情况;详情请参见附录 。
In the upcoming section, as shown in Figure 5.13, we explore sampling methods employed by LLMs to mitigate memorization effects, resulting in more novel generated text. 在即将到来的章节中,如图 5.13 所示,我们探讨了 LLMs 采用的采样方法,以缓解记忆化效应,从而生成更新颖的文本。
Figure 5.13 Our model can generate coherent text after implementing the training function. However, it often memorizes passages from the training set verbatim. The following section covers strategies to generate more diverse output texts. 图 5.13 我们的模型在实施培训功能后可以生成连贯的文本。然而,它经常原封不动地背诵训练集中的段落。以下部分涵盖了生成更多元化输出文本的策略。
As illustrated in Figure 5.13, the next section will cover text generation strategies for LLM to reduce training data memorization and increase the originality of the LLM-generated text before we cover weight loading and saving and loading pretrained weights from OpenAI's GPT model. 如图 5.13 所示,下一节将介绍文本生成策略,用于LLM减少训练数据记忆并增加LLM生成文本的原创性,然后我们将介绍权重的加载和保存以及从 OpenAI 的 GPT 模型加载预训练的权重。
5.3 Decoding strategies to control randomness 5.3 控制随机性的解码策略
In this section, we will cover text generation strategies (also called decoding strategies) to generate more original text. First, we briefly revisit the generate_text_simple function from the previous chapter that we used inside the generate_and_print_sample earlier in this chapter. Then, we will cover two techniques, temperature scaling, and top-k sampling, to improve this function. 在这个部分,我们将介绍生成更原创文本的文本生成策略(也称为解码策略)。首先,我们简要回顾了前一章中我们在 generate_and_print_sample 中使用的 generate_text_simple 函数。然后,我们将介绍两种技术,温度缩放和 top-k 采样,来改进这个函数。
We begin by transferring the model back from the GPU to the CPU since inference with a relatively small model does not require a GPU. Also, after training, we put the model into evaluation model to turn off random components such as dropout: 我们首先将模型从 GPU 转移到 CPU,因为使用相对较小的模型进行推理不需要 GPU。此外,在训练结束后,我们将模型置于评估模式,关闭随机组件(如 dropout)。
model.to("cpu")
model.eval()
Next, we plug the GPTModel instance (model) into the generate_text_simple function, which uses the LLM to generate one token at a time: 接下来,我们将 GPTModel 实例(model)插入 generate_text_simple 函数,该函数使用LLM逐个生成令牌:
Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun 每一个努力都会让你前进,"这是他在精美装饰的餐桌上留下的一条金句
As explained earlier in section 5.1.2, the generated token is selected at each generation step corresponding to the largest probability score among all tokens in the vocabulary. 正如在 5.1.2 节中解释的那样,生成的令牌是根据所有词汇表中的最大概率得分在每个生成步骤中选择的。
The following subsections introduce two concepts to control the randomness and diversity of the generated text: temperature scaling and top-k sampling. 以下小节介绍两个概念来控制生成文本的随机性和多样性:温度缩放和 top-k 采样。
5.3.1 Temperature scaling 5.3.1 温度缩放
This section introduces temperature scaling, a technique that adds a probabilistic selection process to the next-token generation task. 这一部分介绍了温度缩放技术,这是一种在下一个令牌生成任务中添加概率选择过程的技术。
Previously, inside the generate_text_simple function, we always sampled the token with the highest probability as the next token using torch.argmax, also known as greedy decoding. To generate text with more variety, we can replace the argmax with a function that samples from a probability distribution (here, the probability scores the LLM generates for each vocabulary entry at each token generation step). 以前,在 generate_text_simple 函数内部,我们总是使用 torch.argmax(也称为贪婪解码)来采样具有最高概率的令牌作为下一个令牌。为了生成更多样化的文本,我们可以用一个从概率分布中采样的函数来替换 argmax(这里,概率分数是LLM在每个令牌生成步骤中为每个词汇表项生成的)。
To illustrate the probabilistic sampling with a concrete example, let's briefly discuss the next-token generation process using a very small vocabulary for illustration purposes: 为了用具体的例子说明概率采样,让我们简要讨论使用非常小的词汇表进行下一个词生成的过程:
vocab = {
"closer": 0,
"every": 1,
"effort": 2,
"forward": 3,
"inches": 4,
"moves": 5,
"pizza": 6,
"toward": 7,
"you": 8,
}
inverse_vocab = {v: k for k, v in vocab.items()}
Next, assume the LLM is given the start context "every effort moves you" and 下一步,假设 LLM 被给予起始上下文"every effort moves you"和
generates the following next-token logits: 生成以下下一个 token 的对数几率:
As discussed in the previous chapter, inside the generate_text_simple, we convert the logits into probabilities via the softmax function and obtain the token ID corresponding the generated token via the argmax function, which we can then map back into text via the inverse vocabulary: 如前一章所述,在 generate_text_simple 内部,我们通过 softmax 函数将 logits 转换为概率,并通过 argmax 函数获取生成的 Token 的 ID,然后通过反向词汇表将其映射回文本
Since the largest logit value, and correspondingly the largest softmax probability score, is in the fourth position (index position 3 since Python uses 0 -indexing), the generated word is 由于最大的 logit 值和相应的最大 softmax 概率得分位于第四个位置(由于 Python 使用 0 索引,索引位置为 3),因此生成的单词是
To implement a probabilistic sampling process, we can now replace the argmax with the multinomial function in PyTorch: 要实现概率采样过程,我们现在可以在 PyTorch 中用多项式函数替换 argmax:
The printed output is "forward" just like before. What happened? The multinomial function samples the next token proportional to its probability score. In other words, "forward" is still the most likely token and will be selected by multinomial most of the time but not all the time. To illustrate this, let's implement a function that repeats this sampling 1000 times: 打印输出仍为"forward"。发生了什么?多项式函数根据概率得分采样下一个词元。换句话说,"forward"仍然是最可能的词元,并且大部分时间都会被多项式选中,但并非总是如此。为了说明这一点,让我们实现一个函数,该函数重复进行 1000 次采样:
def print_sampled_tokens(probas):
torch.manual_seed(123)
sample = [torch.multinomial(probas, num_samples=1).item() for i in range(1_000)]
sampled_ids = torch.bincount(torch.tensor(sample))
for i, freq in enumerate(sampled_ids):
print(f"{freq} x {inverse_vocab[i]}")
print_sampled_tokens(probas)
The sampling output is as follows: 采样输出结果如下:
0 x every 0 × 任何
x effort x 努力
582 x forward 582 x 向前
2 x inches 2 x 英寸
0 x moves 0 次移动
0 x pizza 0 个披萨
343 x toward 343 x 朝向
As we can see based on the output, the word "forward" is sampled most of the time (582 out of 1000 times), but other tokens such as "closer", "inches", and "toward" will also be sampled some of the time. This means that if we replaced the argmax function with the multinomial function inside the generate_and_print_sample function, the LLM would sometimes generate texts such as "every effort moves you toward", "every effort moves you inches", and "every effort moves you closer" instead of "every effort moves you forward".
We can further control the distribution and selection process via a concept called temperature scaling, where temperature scaling is just a fancy description for dividing the logits by a number greater than 0 :
def softmax_with_temperature(logits, temperature): 使用温度的 softmax 函数(logits, temperature):
scaled_logits = logits / temperature 缩放逻辑 = 逻辑 / 温度
return torch.softmax(scaled_logits, dim=0) 返回 torch.softmax(scaled_logits, dim=0)
Temperatures greater than 1 result in more uniformly distributed token probabilities, 温度高于 1 将导致更统一分布的令牌概率
and Temperatures smaller than 1 will result in more confident (sharper or more peaky) 和温度小于 1 将导致更有信心(更锐利或更尖峰)
distributions. Let's illustrate this by plotting the original probabilities alongside 概率分布。让我们通过绘制原始概率和
probabilities scaled with different temperature values: 概率根据不同温度值进行缩放:
temperatures = [1, 0.1, 5] #A 温度 = [1, 0.1, 5] #A
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures] 缩放概率 = [温度缩放 softmax(下一个令牌的对数, T) 其中 T 在 temperatures 中]
x = torch.arange(len(vocab))
bar_width = 0.15 柱状图宽度 = 0.15
fig, ax = plt.subplots(figsize=(5, 3)) 图形, 轴 = plt.subplots(figsize=(5, 3))
for i, T in enumerate(temperatures): 对于 i, T 在 temperatures 中的枚举:
rects = ax.bar(x + i * bar_width, scaled_probas[i], 矩形 = ax.bar(x + i * 条形宽度, 缩放概率[i],
bar_width, label=f'Temperature = {T}') 条形宽度, 标签=f'温度 = {T}')
ax.set_ylabel('Probability') ax.set_ylabel('概率')
ax.set_xticks(x)
ax.set_xticklabels(vocab.keys(), rotation=90)
ax.legend() 图例(ax.legend())
plt.tight_layout()
plt.show() 显示()
\#A Original, lower, and higher confidence
The resulting plot is shown in Figure 5.14.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-187.jpg?height=695&width=1210&top_left_y=204&top_left_x=319)
Figure 5.14 A temperature of 1 represents the unscaled probability scores for each token in the vocabulary. Decreasing the temperature to 0.1 sharpens the distribution, so the most likely token (here "forward") will have an even higher probability score. Vice versa, increasing the temperature to 5 makes the distribution more uniform.
A temperature of 1 divides the logits by 1 before passing them to the softmax function to compute the probability scores. In other words, using a temperature of 1 is the same as not using any temperature scaling. In this case, the tokens are selected with a probability equal to the original softmax probability scores via the multinomial sampling function in PyTorch.
Also, as we can see in Figure 5.14, applying very small temperatures, such as 0.1 , will result in sharper distributions such that the behavior of the multinomial function selects the most likely token (here: "forward") almost 100\% of the time, approaching the behavior of the argmax function. Vice versa, a temperature of 5 results in a more uniform distribution where other tokens are selected more often. This can add more variety to the generated texts but also more often results in nonsensical text. For example, using the temperature of 5 results in texts such as "every effort moves you pizza" about \(4 \%\) of the time.
\section*{EXERCISE 5.1}
Use the print_sampled_tokens function to print the sampling frequencies of the softmax probabilities scaled with the temperatures shown in Figure 5.13. How often is the word "pizza" sampled in each case? Can you think of a faster and more accurate way to determine how often the word "pizza" is sampled?
\subsection*{5.3.2 Top-k sampling}
In the previous section, we implemented a probabilistic sampling approach coupled with temperature scaling to increase the diversity of the outputs. We saw that higher temperature values result in more uniformly distributed next-token probabilities, which result in more diverse outputs as it reduces the likelihood of the model repeatedly selecting the most probable token. This method allows for exploring less likely but potentially more interesting and creative paths in the generation process. However, One downside of this approach is that it sometimes leads to grammatically incorrect or completely nonsensical outputs such as "every effort moves you pizza".
In this section, we introduce another concept called top-k sampling, which, when combined with probabilistic sampling and temperature scaling, can improve the text generation results.
In top-k sampling, we can restrict the sampled tokens to the top- k most likely tokens and exclude all other tokens from the selection process by masking their probability scores, as illustrated in Figure 5.15.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-188.jpg?height=651&width=1252&top_left_y=912&top_left_x=298)
By assigning zero probabilities to the non-top-k positions, we ensure that the next token is always sampled from a top-k position
Figure 5.15 Using top-k sampling with \(k=3\), we focus on the 3 tokens associated with the highest logits and mask out all other tokens with negative infinity (-inf) before applying the softmax function. This results in a probability distribution with a probability value 0 assigned to all non-top-k tokens.
The approach outlined in Figure 5.15 replaces all non-selected logits with negative infinity value (-inf), such that when computing the softmax values, the probability scores of the non-top-k tokens are 0 , and the remaining probabilities sum up to 1 . (Careful readers may remember this masking trick from the causal attention module we implemented in chapter 3 in section 3.5.1 Applying a causal attention mask.)
In code, we can implement the top-k procedure outlined in Figure 5.15 as follows, starting with the selection of the tokens with the largest logit values:
Subsequently, we apply PyTorch's where function to set the logit values of tokens that are below the lowest logit value within our top-3 selection to negative infinity (-inf).
\#A Identifies logits less than the minimum in the top 3
\#B Assigns -inf to these lower logits
\#C Retains the original logits for all other tokens
The resulting logits for the next token in the 9-token vocabulary are as follows:
tensor([4.5100, -inf, -inf, 6.7500, -inf, -inf, -inf, 6.2800, -inf])
Lastly, let's apply the softmax function to turn these into next-token probabilities:
We can now apply the temperature scaling and multinomial function for probabilistic sampling introduced in the previous section to select the next token among these 3 nonzero probability scores to generate the next token. We do this in the next section by modifying the text generation function.
\subsection*{5.3.3 Modifying the text generation function}
The previous two subsections introduced two concepts to increase the diversity of LLMgenerated text: temperature sampling and top-k sampling. In this section, we combine and add these concepts to modify the generate_simple function we used to generate text via the LLM earlier, creating a new generate function:
Listing 5.4 A modified text generation function with more diversity
if temperature > 0.0:
\#C
logits = logits / temperature
probs = torch.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
else:
\#D
idx_next = torch.argmax(logits, dim=-1, keepdim=True)
if idx_next == eos_id:
\(\# \mathrm{E}\)
break
idx = torch.cat((idx, idx_next), dim=1)
return idx
\#A For-loop is the same as before: Get logits, and only focus on last time step \#B In this new section, we filter logits with top_k sampling \#C This is the new section where we apply temperature scaling \#D Carry out greedy next-token selection as before when temperature scaling is disabled \#E Stop generating early if end-of-sequence token is encountered and eos_id is specified
Let's now see this new generate function in action:
Output text: 输出文本:
简化中文:
这是一个测试
Every effort moves you stand to work on surprise, a one of us had gone with random- 每一次尝试都让你站在惊喜的边缘,我们中的一个人已经随机行动了
As we can see, the generated text is very different from the one we previously generated via the generate_simple function at the beginning of section 5.3 ("Every effort moves you know," was one of the axioms he laid...!"), which was a memorized passage from the training set.
\section*{EXERCISE 5.2}
Play around with different temperatures and top-k settings. Based on your observations, can you think of applications where lower temperature and top-k settings are desired? Vice versa, can you think of applications where higher temperature and top-k settings are preferred? (It's recommended to also revisit this exercise at the end of the chapter after loading the pretrained weights from OpenAI.)
\section*{EXERCISE 5.3}
What are the different combinations of settings for the generate function to force deterministic behavior, that is, disabling the random sampling such that it always produces the same outputs similar to the generate_simple function?
So far, we covered how to pretrain LLMs and use them to generate text. The last two sections of this chapter will discuss how we save and load the trained LLM and how we load pretrained weights from OpenAI.
\subsection*{5.4 Loading and saving model weights in PyTorch}
In this chapter, we have discussed how to numerically evaluate the training progress and pretrain an LLM from scratch. Even though both the LLM and dataset were relatively small, this exercise showed that pretraining LLMs is computationally expensive. Thus, it is important to be able to save the LLM so that we don't have to rerun the training every time we want to use it in a new session.
As illustrated in the chapter overview in Figure 5.16, we cover how to save and load a pretrained model in this section. Then, in the upcoming section, we will load a more capable pretrained GPT model from OpenAI into our GPTModel instance.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-193.jpg?height=431&width=1237&top_left_y=666&top_left_x=315)
Figure 5.16 After training and inspecting the model, it is often helpful to save the model so that we can use or continue training it later, which is the topic of this section before we load the pretrained model weights from OpenAl in the final section of this chapter.
Fortunately, saving a PyTorch model is relatively straightforward. The recommended way is to save a model's so-called state_dict, a dictionary mapping each layer to its parameters, using the torch.save function as follows:
torch.save(model.state_dict(), "model.pth")
In the preceding code, "model.pth" is the filename where the state_dict is saved. The .pth extension is a convention for PyTorch files, though we could technically use any file extension.
Then, after saving the model weights via the state_dict, we can load the model weights into a new GPTModel model instance as follows:
model = GPTModel(GPT_CONFIG_124M) 模型 = GPTModel(GPT_CONFIG_124M)
model.load_state_dict(torch.load("model.pth")) 模型.加载状态字典(torch.加载("模型.pth"))
model.eval() 模型.eval()
As discussed in chapter 4, dropout helps prevent the model from overfitting to the training data by randomly "dropping out" of a layer's neurons during training. However, during inference, we don't want to randomly drop out any of the information the network has learned. Using model.eval() switches the model to evaluation mode for inference, disabling the dropout layers of the model.
If we plan to continue pretraining a model later, for example, using the train_model_simple function we defined earlier in this chapter, saving the optimizer state is also recommended.
Adaptive optimizers such as AdamW store additional parameters for each model weight. AdamW uses historical data to adjust learning rates for each model parameter dynamically. Without it, the optimizer resets, and the model may learn suboptimally or even fail to converge properly, which means that it will lose the ability to generate coherent text. Using torch.save, we can save both the model and optimizer state_dict contents as follows:
Then, we can restore the model and optimizer states as follows by first loading the saved data via torch.load and then using the load_state_dict method:
\section*{EXERCISE 5.4}
After saving the weights, load the model and optimizer in a new Python session or Jupyter notebook file and continue pretraining it for 1 more epoch using the train_model_simple function.
\subsection*{5.5 Loading pretrained weights from OpenAI}
Previously, for educational purposes, we trained a small GPT-2 model using a limited dataset comprising a short-story book. This approach allowed us to focus on the fundamentals without the need for extensive time and computational resources.
Fortunately, OpenAI openly shared the weights of their GPT-2 models, thus eliminating the need to invest tens to hundreds of thousands of dollars in retraining the model on a large corpus ourselves.
In the remainder of this section, we load these weights into our GPTModel class and use the model for text generation. Here, weights refer to the weight parameters that are stored in the .weight attributes of PyTorch's Linear and Embedding layers, for example. We accessed them earlier via model. parameters () when training the model.
In the next chapters, we will reuse these pretrained weights to finetune the model for a text classification task and follow instructions similar to ChatGPT.
Note that OpenAI originally saved the GPT-2 weights via TensorFlow, which we have to install to load the weights in Python. Moreover, the following code will use a progress bar tool called tqdm to track the download process, which we also have to install.
You can install these libraries by executing the following command in your terminal:
pip install tensorflow \(>=2.15 .0\) tqdm \(>=4.66\)
The download code is relatively long, mostly boilerplate, and not very interesting. Hence, instead of devoting precious space in this chapter to discussing Python code for fetching files from the internet, we download the gpt_download.py Python module directly from this chapter's online repository:
Next, after downloading this file to the local directory of your Python session, readers are encouraged to briefly inspect the contents of this file to ensure that it was saved correctly and contains valid Python code.
We can now import the download_and_load_gpt2 function from the gpt_download.py file as follows, which will load the GPT-2 architecture settings (settings) and weight parameters (params) into our Python session:
\section*{UPDATED DOWNLOAD INSTRUCTIONS}
If the download code does not work for you, it could be due to intermittent internet connection, server issues, or changes in how OpenAI shares the weights of the opensource GPT-2 model. In this case, please visit this chapter's online code repository at https://github.com/rasbt/LLMs-from-scratch for alternative and updated instructions, and please reach out via the Manning Forum for further questions.
After the execution of the previous code has been completed, let's inspect the contents of settings and params:
The contents are as follows:
Settings: \{'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, 'n_layer': 12\} Parameter dictionary keys: dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])
Both settings and params are Python dictionaries. The settings dictionary stores the LLM architecture settings similarly to our manually defined GPT CONFIG 124 M settings. The params dictionary contains the actual weight tensors. Note that we only printed the dictionary keys because printing the weight contents would take up too much screen space, however, we can inspect these weight tensors by printing the whole dictionary via print (params) or by selecting individual tensors via the respective dictionary keys, for example, the embedding layer weights:
The weights of the token embedding layer are as follows:
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-197.jpg?height=48&width=837&top_left_y=438&top_left_x=260)
[ 0.04034033 _. 0.08605453 0.00253983 0.04318958]
We downloaded and loaded the weights of the smallest GPT-2 model via the download_and_load_gpt2(model_size="124M", ...) setting. However, note that OpenAI also shares the weights of larger models: "355M", "774M", and "1558M". The overall architecture of these differently-sized GPT models is the same, as illustrated in Figure 5.17. 我们通过设置 download_and_load_gpt2(model_size="124M", ...)下载并加载了最小型 GPT-2 模型的权重。但请注意,OpenAI 也共享了更大型号的模型权重:"355M"、"774M" 和 "1558M"。如图 5.17 所示,这些不同大小的 GPT 模型的整体架构是相同的。
Every effort moves you 每一次努力都推动你前进
Figure 5.17 GPT-2 LLMs come in several different model sizes, ranging from 124 million to 1,558 million parameters. The core architecture is the same, with the only difference being the embedding sizes and the number of times individual components like the attention heads and transformer blocks are repeated. 图 5.17 GPT-2 模型有多种不同的规模,参数数量从 1.24 亿到 15.58 亿不等。它们的核心架构是相同的,只有嵌入尺寸和注意力头以及变换器块重复的次数有所不同。
As illustrated in Figure 5.17, the overall architecture of the differently-sized GPT-2 models remains the same, except that different architectural elements are repeated different numbers of times, and the embedding size differs. The remaining code in this chapter is also compatible with these larger models. 如图 5.17 所示,不同大小的 GPT-2 模型的总体架构保持不变,只是不同的架构元素重复的次数不同,嵌入大小也有所不同。本章中的其余代码也与这些更大的模型兼容。
After loading the GPT-2 model weights into Python, we still need to transfer them from the settings and params dictionaries into our GPTModel instance. 在将 GPT-2 模型权重加载到 Python 中之后,我们还需要将它们从设置和参数字典转移到我们的 GPTModel 实例中。
First, we create a dictionary that lists the differences between the different GPT model
sizes, as explained in Figure 5.17:
model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}
Suppose we are interested in loading the smallest model, "gpt2-small (124M)". We can use
the corresponding settings from the model_configs table able to update our full-length
GPT_CONFIG_124M we defined and used earlier throughout the chapter as follows:
model_name = "gpt2-small (124M)"
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
Careful readers may remember that we used a 256 -token length earlier, but the original GPT-2 models from OpenAI were trained with a 1,024-token length, so we have to update the NEW_CONFIG accordingly: 仔细的读者可能还记得我们之前使用了 256 个标记长度,但 OpenAI 的原始 GPT-2 模型使用了 1,024 个标记长度,因此我们必须相应地更新 NEW_CONFIG
NEW_CONFIG.update({"context_length": 1024})
Also, OpenAI used bias vectors in the multi-head attention module's linear layers to implement the query, key, and value matrix computations. Bias vectors are not commonly used in LLMs anymore as they don't improve the modeling performance and are thus unnecessary. However, since we are working with pretrained weights, we need to match the settings for consistency and enable these bias vectors: 此外,OpenAI 在多头注意力模块的线性层中使用偏差向量来实现查询、键和值矩阵的计算。偏差向量在LLMs中不再常用,因为它们不会提高建模性能,因此是不必要的。然而,由于我们正在使用预训练的权重,我们需要匹配设置以保持一致性并启用这些偏差向量。
NEW_CONFIG.update({"qkv_bias": True})
We can now use the updated NEW_CONFIG dictionary to initialize a new GPTModel instance:
gpt = GPTModel(NEW_CONFIG)
gpt.eval()
By default, the GPTModel instance is initialized with random weights for pretraining. The last step to using OpenAI's model weights is to override these random weights with the weights we loaded into the params dictionary. 默认情况下,GPTModel 实例使用随机权重进行预训练。使用 OpenAI 的模型权重的最后一步是用我们加载到 params 字典中的权重来覆盖这些随机权重。
For this, we will first define a small assign utility function that checks whether two tensors or arrays (left and right) have the same dimensions or shape and returns the right tensor as trainable PyTorch parameters: 对于这个,我们首先定义一个小的赋值工具函数,检查两个张量或数组(左和右)是否具有相同的维度或形状,并将右张量作为可训练的 PyTorch 参数返回:
Next, we define a load_weights_into_gpt function that loads the weights from the params dictionary into a GPTModel instance gpt: 下一步,我们定义一个 load_weights_into_gpt 函数,它将 params 字典中的权重加载到 GPTModel 实例 gpt 中:
Listing 5.5 Loading OpenAl weights into our GPT model code import numpy as np 代码清单 5.5 将 OpenAl 权重加载到我们的 GPT 模型中
import numpy as np
#A Setting the model's positional and token embedding weights to those specified in params. 将模型的位置和标记嵌入权重设置为 params 中指定的权重。
#B Iterate over each transformer block in the model. 遍历模型中的每个变换器块。
#C The np.split function is used to divide the attention and bias weights into three equal parts for the query, np.split 函数用于将注意力和偏差权重分为三等份,分别对应查询、键和值。
key, and value components. 键和值组件。
#D The original GPT-2 model by OpenAl reused the token embedding weights in the output layer to reduce the total number of parameters, which is a concept known as weight tying. 开放人工智能公司开发的原始 GPT-2 模型在输出层重用了词嵌入权重,这是一种被称为权重绑定(weight tying)的概念,用于减少总参数数量。
In the load_weights_into_gpt function, we carefully match the weights from OpenAI's implementation with our GPTModel implementation. To pick a specific example, OpenAI stored the weight tensor for the output projection layer for the first transformer block as params["blocks"][0]["attn"]["c_proj"]["w"]. In our implementation, this weight tensor corresponds to gpt.trf_blocks[b].att.out_proj.weight, where gpt is a GPTModel instance. 在 load_weights_into_gpt 函数中,我们仔细地将 OpenAI 实现中的权重与我们的 GPTModel 实现相匹配。以一个特定的例子来说,OpenAI 将第一个 transformer 块的输出投影层的权重张量存储为 params["blocks"][0]["attn"]["c_proj"]["w"]。在我们的实现中,这个权重张量对应于 gpt.trf_blocks[b].att.out_proj.weight,其中 gpt 是一个 GPTModel 实例。
Developing the load_weights_into_gpt function took a lot of guesswork since OpenAI used a slightly different naming convention from ours. However, the assign function would alert us if we try to match two tensors with different dimensions. Also, if we made a mistake in this function, we would notice this as the resulting GPT model would be unable to produce coherent text. 开发 load_weights_into_gpt 函数需要大量猜测,因为 OpenAI 使用的命名约定略有不同于我们的。然而,如果我们尝试匹配具有不同尺寸的两个张量,assign 函数会提醒我们。此外,如果我们在此函数中犯了错误,我们也会注意到这一点,因为生成的 GPT 模型将无法产生连贯的文本。
Let's not try the load_weights_into_gpt out in practice and load the OpenAI model weights into our GPTModel instance gpt: 让我们不尝试在实践中加载 load_weights_into_gpt,而是将 OpenAI 模型权重加载到我们的 GPTModel 实例 gpt 中:
load_weights_into_gpt(gpt, params)
gpt.to(device)
If the model is loaded correctly, we can now use it to generate new text using our previous generate function: 如果模型载入正确,我们现在可以使用之前的生成函数来生成新的文本
Every effort moves you toward finding an ideal new way to practice something! What makes us want to be on top of that? 每一个努力都让你更接近于找到一种全新的理想方式来实践某事!是什么让我们想要在这方面做到最好呢?
We can be confident that we loaded the model weights correctly because the model can produce coherent text. A tiny mistake in this process would cause the model to fail. 我们可以确信我们正确地加载了模型权重,因为模型能够产生连贯的文本。这个过程中的一点点错误会导致模型失败。
In the following chapters, we will work further with this pretrained model and fine-tune it to classify text and follow instructions. 在接下来的章节中,我们将继续使用这个预训练模型,并对其进行微调,以实现文本分类和指令跟踪功能。
EXERCISE 5.5 练习 5.5
Calculate the training and validation set losses of the GPTModel with the pretrained weights from OpenAI on the "The Verdict" dataset. 使用 OpenAI 预训练权重在"The Verdict"数据集上计算 GPTModel 的训练损失和验证损失。
EXERCISE 5.6 练习 5.6
Readers are encouraged to experiment with GPT-2 models of different sizes, for example, the largest 1558 M parameter model and compare the generated text to the 124 M model we loaded in this chapter. 我们鼓励读者尝试使用不同尺寸的 GPT-2 模型,例如最大的 1558M 参数模型,并将生成的文本与本章中加载的 124M 模型进行对比。
5.6 Summary 5.6 总结
When LLMs generate text, they output one token at a time. 当LLMs生成文本时,它们一次输出一个标记。
By default, the next token is generated by converting the model outputs into probability scores and selecting the token from the vocabulary that corresponds to the highest probability score, which is known as "greedy decoding." 默认情况下,下一个令牌是通过将模型输出转换为概率分数并选择对应于最高概率分数的词汇表中的令牌生成的,这被称为"贪婪解码"。
Using probabilistic sampling and temperature scaling, we can influence the diversity and coherence of the generated text. 使用概率性抽样和温度缩放,我们可以影响生成文本的多样性和连贯性。
Training and validation set losses can be used to gauge the quality of text generated by LLM during training. 训练集和验证集的损失可用于评估LLM在训练过程中生成文本的质量。
Pretraining an LLM involves changing its weights to minimize the training loss. 预训练一个LLM涉及到改变其权重以最小化训练损失。
The training loop for LLMs itself is a standard procedure in deep learning, using a conventional cross entropy loss and AdamW optimizer. 用于LLMs本身的训练循环是深度学习中的标准流程,使用常规的交叉熵损失和 AdamW 优化器。
Pretraining an LLM on a large text corpus is time- and resource-intensive so we can load openly available weights from OpenAI as an alternative to pretraining the model on a large dataset ourselves. 在大型文本语料库上预训练LLM是非常耗时耗资的,因此我们可以加载 OpenAI 公开提供的权重,作为自己在大型数据集上预训练模型的替代方案。
6
Finetuning for Classification 用于分类的微调
Thls chapter covers 本章涵盖
Introducing different LLM finetuning approaches 介绍不同的LLM微调方法
Preparing a dataset for text classification 为文本分类准备数据集
Modifying a pretrained LLM for finetuning 微调预训练的LLM
Finetuning an LLM to identify spam messages 微调 LLM 识别垃圾邮件
Evaluating the accuracy of a finetuned LLM classifier 评估微调后LLM分类器的准确性
Using a finetuned LLM to classify new data 使用微调的LLM对新数据进行分类
In previous chapters, we coded the LLM architecture, pretrained it, and learned how to import pretrained weights from an external source, such as OpenAI, into our model. In this chapter, we are reaping the fruits of our labor by finetuning the LLM on a specific target task, such as classifying text, as illustrated in figure 6.1. The concrete example we will examine is classifying text messages as spam or not spam. 在前几章中,我们编码了LLM架构、预训练了它,并学习了如何从外部源(如 OpenAI)导入预训练权重到我们的模型。在本章中,我们通过在特定目标任务(如图 6.1 所示的文本分类)上微调LLM来收获劳动成果。我们将探讨一个具体的例子,即将文本消息分类为垃圾邮件或非垃圾邮件。
Figure 6.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset and finetuning it. This chapter focuses on finetuning a pretrained LLM as a classifier. 图 6.1 编码 LLM 的三个主要阶段的心智模型,在一般文本数据集上预训练 LLM,并对其进行微调。本章着重介绍如何将预训练的 LLM 作为分类器进行微调。
Figure 6.1 shows two main ways of finetuning an LLM: finetuning for classification (step 8) and finetuning an LLM to follow instructions (step 9). In the next section, we will discuss these two ways of finetuning in more detail. 图 6.1 显示了两种主要的微调LLM的方法:用于分类的微调(步骤 8)和微调LLM以遵循指令(步骤 9)。在下一节中,我们将更详细地讨论这两种微调方法。
6.1 Different categories of finetuning 6.1 不同类型的微调
The most common ways to finetune language models are instruction-finetuning and classification-finetuning. Instruction-finetuning involves training a language model on a set of tasks using specific instructions to improve its ability to understand and execute tasks described in natural language prompts, as illustrated in figure 6.2. 微调语言模型最常见的方式是指令微调和分类微调。指令微调涉及使用特定指令训练语言模型执行一组任务,以提高其理解和执行自然语言提示中描述的任务的能力,如图 6.2 所示。
Figure 6.2 Illustration of two different instruction-finetuning scenarios. At the top, the model is tasked with determining whether a given text is spam. At the bottom, the model is given an instruction on how to translate an English sentence into German. 图 6.2 两种不同指令微调场景的示例。在顶部,该模型的任务是确定给定文本是否为垃圾邮件。在底部,该模型被给出一条将英语句子翻译成德语的指令。
The next chapter will discuss instruction-finetuning, as illustrated in figure 6.2. Meanwhile, this chapter is centered on classification-finetuning, a concept you might already be acquainted with if you have a background in machine learning. 下一章将讨论指令微调,如图 6.2 所示。同时,本章着重于分类微调,这个概念如果你有机器学习背景的话可能已经熟悉了。
In classification-finetuning, the model is trained to recognize a specific set of class labels, such as "spam" and "not spam." Examples of classification tasks extend beyond large language models and email filtering; they include identifying different species of plants from images, categorizing news articles into topics like sports, politics, or technology, and distinguishing between benign and malignant tumors in medical imaging. 在分类微调中,模型被训练来识别特定的类别标签集合,例如"垃圾邮件"和"非垃圾邮件"。分类任务的示例不仅仅局限于大型语言模型和电子邮件过滤,还包括从图像中识别不同种类的植物、将新闻文章归类为体育、政治或技术等主题,以及在医学成像中区分良性和恶性肿瘤。
The key point is that a classification-finetuned model is restricted to predicting classes it has encountered during its training-for instance, it can determine whether something is "spam" or "not spam," as illustrated in figure 6.3, but it can't say anything else about the input text. 关键点在于,分类精调模型仅限于预测其训练期间遇到的类别-例如,它可以确定某物是"垃圾邮件"还是"非垃圾邮件",如图 6.3 所示,但不能说明输入文本的其他任何内容。
Figure 6.3 Illustration of a text classification scenario using an LLM. A model finetuned for spam classification does not require further instruction alongside the input. In contrast to an instruction-finetuned model, it can only respond with "spam" and "not spam." 图 6.3 使用LLM进行文本分类场景的插图。针对垃圾邮件分类进行微调的模型不需要额外的指令就可以处理输入。相比之下,经指令微调的模型只能响应"垃圾邮件"和"非垃圾邮件"。
In contrast to the classification-finetuned model depicted in figure 6.3, an instructionfinetuned model typically has the capability to undertake a broader range of tasks. We can view a classification-finetuned model as highly specialized, and generally, it is easier to develop a specialized model than a generalist model that works well across various tasks. 与图 6.3 所示的分类精调模型相比,指令精调模型通常具有较广泛的任务能力。我们可以将分类精调模型视为高度专业化的,通常开发专门的模型比开发在各种任务中都表现良好的通用模型要更容易。
CHOOSING THE RIGHT APPROACH 选择正确的方法
Instruction-finetuning improves a model's ability to understand and generate responses based on specific user instructions. Instruction-finetuning is best suited for models that need to handle a variety of tasks based on complex user instructions, improving flexibility and interaction quality. Classification-finetuning, on the other hand, is ideal for projects requiring precise categorization of data into predefined classes, such as sentiment analysis or spam detection. 指令精调可提高模型理解并生成特定用户指令的响应的能力。指令精调最适合需要处理各种任务的模型,这些任务基于复杂的用户指令,从而提高了灵活性和交互质量。相比之下,分类精调更适合于需要将数据精确地分类到预定义类别的项目,如情感分析或垃圾邮件检测。
While instruction-finetuning is more versatile, it demands larger datasets and greater computational resources to develop models proficient in various tasks. In contrast, classification-finetuning requires less data and compute power, but its use is confined to the specific classes on which the model has been trained. 尽管指令微调更加通用灵活,但它需要更大的数据集和更强大的计算资源来开发擅长各种任务的模型。相比之下,分类微调需要更少的数据和计算能力,但其使用范围局限于模型接受训练的特定类别。
6.2 Preparing the dataset 准备数据集
In the remainder of this chapter, we will modify and classification-finetune the GPT model we implemented and pretrained in the previous chapters. We begin with downloading and preparing the dataset, as illustrated in figure 6.4. 在本章的剩余部分中,我们将修改并对我们在前几章中实施并预训练的 GPT 模型进行分类微调。我们首先下载并准备数据集,如图 6.4 所示。
Figure 6.4 Illustration of the three-stage process for classification-finetuning the LLM in this chapter. Stage 1 involves dataset preparation. Stage 2 focuses on model setup. Stage 3 covers the finetuning and evaluation of the model. 图 6.4 本章中用于分类微调LLM的三阶段过程图解。第 1 阶段涉及数据集准备。第 2 阶段重点在于模型设置。第 3 阶段涵盖模型的微调和评估。
To provide an intuitive and useful example of classification-finetuning, we will work with a text message dataset that consists of spam and non-spam messages. 为了提供一个直观有用的分类微调示例,我们将使用一个由垃圾邮件和非垃圾邮件消息组成的文本消息数据集进行工作。
Note that these are text messages typically sent via phone, not email. However, the same steps also apply to email classification, and interested readers can find links to email spam classification datasets in the References section in appendix B. 注意,这些通常通过手机发送的文本消息,而不是电子邮件。但是,这些步骤也适用于电子邮件分类,对此感兴趣的读者可以在附录 B 的参考文献部分找到电子邮件垃圾邮件分类数据集的链接。
The first step is to download the dataset via the following code: 第一步是通过以下代码下载数据集:
Listing 6.1 Downloading and unzipping the dataset 清单 6.1 下载和解压数据集
import urllib.request
import zipfile
import os
from pathlib import Path
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"
extracted_path = "sms_spam_collection"
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"
def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):
if data_file_path.exists():
print(f"{data_file_path} already exists. Skipping download and extraction.")
return
with urllib.request.urlopen(url) as response: #A
with open(zip_path, "wb") as out_file:
out_file.write(response.read())
with zipfile.ZipFile(zip_path, "r") as zip_ref: #B
zip_ref.extractall(extracted_path)
original_file_path = Path(extracted_path) / "SMSSpamCollection"
os.rename(original_file_path, data_file_path) #C
print(f"File downloaded and saved as {data_file_path}")
download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)
#A Downloading the file
#B Unzipping the file
#C Adding a .tsv file extension
After executing the preceding code, the dataset is saved as a tab-separated text file, SMSSpamCollection.tsv, in the sms_spam_collection folder. We can load it into a pandas DataFrame as follows: 执行前述代码后,数据集保存为制表符分隔的文本文件 SMSSpamCollection.tsv,位于 sms_spam_collection 文件夹中。我们可以按如下方式将其加载到 pandas DataFrame 中:
#A
#A Renders the data frame in a Jupyter notebook. Alternatively, use print(df). 在 Jupyter 笔记本中渲染数据框。或者使用 print(df)。
The resulting data frame of the spam dataset is shown in figure 6.5. 垃圾邮件数据集的生成数据框显示在图 6.5 中。
Label Text 标签文本
0 ham Go until jurong point, crazy.. Available only ... 去尤龙点,疯狂的..仅有可用...
1 ham Ok lar... Joking wif u oni... 1 滚可以啦...我只是在跟你开玩笑而已...
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 2 免费垃圾信息参加每周竞争赢得足总杯决赛...
3 ham dun say so early hor... c already then say... 3 个火腿 已经好了,不要那么早说啊... 我已经说了...
4 ham Nah I don't think he goes to usf, he lives aro... 4 个火腿 不,我不认为他去过 usf,他住在附近...
5571 ham 5571 火腿
Rofl. Its true to its name 哈哈。名副其实
5572 rows columns 5572 行 列
Figure 6.5 Preview of the SMSSpamCollection dataset in a pandas DataFrame, showing class labels ("ham" or "spam") and corresponding text messages. The dataset consists of 5,572 rows (text messages and labels). 图 6.5 展示了一个 pandas DataFrame 中的 SMSSpamCollection 数据集的预览,显示了类标签("ham"或"spam")和相应的文本消息。该数据集包含 5,572 行(文本消息和标签)。
Let's examine the class label distribution: 让我们来检查一下类别标签分布情况:
Executing the previous code, we find that the data contains "ham" (i.e., not spam) far more frequently than "spam": 执行前面的代码,我们发现数据中"ham"(即非垃圾邮件)的出现频率远远高于"spam"。
Label
ham 4825
spam 747
Name: count, dtype: int64
For simplicity, and because we prefer a small dataset for educational purposes (which will facilitate faster fine-tuning of the large language model), we choose to undersample the dataset to include 747 instances from each class. While there are several other methods to handle class imbalances, these are beyond the scope of a book on large language models. Readers interested in exploring methods for dealing with imbalanced data can find additional information in the References section in appendix B. 为了简单起见,以及因为我们偏好一个小型数据集用于教育目的(这将有助于更快地微调大型语言模型),我们选择对数据集进行欠采样,包含每个类别 747 个实例。虽然有几种其他方法可以处理类别不平衡,但这些超出了大型语言模型书籍的范围。对于有兴趣探索处理不平衡数据的方法的读者,可以在附录 B 的参考文献部分找到更多信息。
We use the following code to undersample the dataset and create a balanced dataset: 我们使用以下代码来对数据集进行欠采样并创建一个平衡的数据集:
Listing 6.2 Creating a balanced dataset 示例 6.2 创建平衡的数据集
#B Randomly sample "ham" instances to match the number of "spam" instances 随机抽取"ham"实例,使其数量与"spam"实例数量相匹配
#C Combine ham "subset" with "spam" 将火腿"子集"与"垃圾信息"合并
After executing the previous code to balance the dataset, we can see that we now have equal amounts of spam and non-spam messages: 执行前一个代码平衡数据集后,我们现在可以看到垃圾信息和正常信息的数量相等
Label 标签
ham 747 747
spam 747 垃圾邮件 747
Name: count, dtype: int64 名称: 计数, 数据类型: int64
Next, we convert the "string" class labels "ham" and "spam" into integer class labels 0 and 1 , respectively: 下一步,我们将"string"类标签"ham"和"spam"转换为整数类标签 0 和 1:
This process is similar to converting text into token IDs. However, instead of using the GPT vocabulary, which consists of more than 50,000 words, we are dealing with just two token IDs: 0 and 1. 这个过程类似于将文本转换为令牌 ID。但是,我们不是使用包含超过 50,000 个单词的 GPT 词汇表,而是仅处理 0 和 1 两个令牌 ID。
We create a random_split function to split the dataset into three parts: for training, for validation, and for testing. (These ratios are common in machine learning to train, adjust, and evaluate models.) 我们创建一个 random_split 函数将数据集分为三部分: 用于训练, 用于验证, 用于测试。(这些比例在机器学习中用于训练、调整和评估模型是常见的。)
def random_split(df, train_frac, validation_frac): 随机分割(df, 训练_比例, 验证_比例)
In this section, we downloaded the dataset, balanced it, and split it into training and evaluation subsets. In the next section, we will set up the PyTorch data loaders that will be used to train the model. 在这个部分,我们下载了数据集,对其进行了平衡,并将其拆分为训练和评估子集。在下一部分,我们将建立 PyTorch 数据加载器,用于训练模型。
6.3 Creating data loaders 创建数据加载器
6.3
In this section, we develop PyTorch data loaders that are conceptually similar to the ones we implemented in chapter 2 . 在本节中,我们开发 PyTorch 数据加载器,它在概念上与我们在第 2 章中实现的加载器类似。
Previously, in chapter 2, we utilized a sliding window technique to generate uniformly sized text chunks, which were then grouped into batches for more efficient model training. Each chunk functioned as an individual training instance. 在第 2 章中,我们使用滑动窗口技术生成了大小均匀的文本块,然后将它们分组为批次以更高效地进行模型训练。每个块都作为一个独立的训练实例。
However, in this chapter, we are working with a spam dataset that contains text messages of varying lengths. To batch these messages as we did with the text chunks in chapter 2, we have two primary options: 然而,在本章中,我们正在使用一个包含不同长度文本消息的垃圾信息数据集。为了像第二章中处理文本块一样对这些消息进行批处理,我们有两个主要选择:
Truncate all messages to the length of the shortest message in the dataset or batch. 将所有消息截断为数据集或批次中最短消息的长度。
Pad all messages to the length of the longest message in the dataset or batch. 将所有消息的长度填充至数据集或批次中最长消息的长度。
Option 1 is computationally cheaper, but it may result in significant information loss if shorter messages are much smaller than the average or longest messages, potentially reducing model performance. So, we opt for the second option, which preserves the entire content of all messages. 选择 1 在计算上更便宜,但如果较短的消息远小于平均或最长消息,可能会导致重大的信息损失,从而降低模型性能。因此,我们选择第二个选项,保留所有消息的完整内容。
To implement option 2, where all messages are padded to the length of the longest message in the dataset, we add padding tokens to all shorter messages. For this purpose, we use "<|endoftext " as a padding token, as discussed in chapter 2 . 要实施选项 2,即将所有消息都补齐为数据集中最长消息的长度,我们需要为所有较短的消息添加填充标记。为此,我们使用"<|endoftext "作为填充标记,如第 2 章所述。
However, instead of appending the string "<|endoftext|>" to each of the text messages directly, we can add the token ID corresponding to "<|endoftext|>" to the encoded text messages as illustrated in figure 6.6. 然而,我们可以在编码的文本消息中添加对应于"<|endoftext|>"的令牌 ID,而不是直接将字符串"<|endoftext|>"附加到每个文本消息中,如图 6.6 所示。
Figure 6.6 An illustration of the input text preparation process. First, each input text message is converted into a sequence of token IDs. Then, to ensure uniform sequence lengths, shorter sequences are padded with a padding token (in this case, token ID 50256) to match the length of the longest sequence. 图 6.6 输入文本准备过程的说明。首先,每个输入文本消息都被转换为一个令牌 ID 序列。然后,为了确保序列长度统一,较短的序列使用填充令牌(在本例中为令牌 ID 50256)填充到与最长序列相同的长度。
Figure 6.6 presumes that 50,256 is the token ID of the padding token "<|endoftext|>". We can double-check that this is indeed the correct token ID by encoding the " using the GPT-2 tokenizer from the tiktoken package that we used in previous chapters: 图 6.6 假设 50,256 是填充标记"<|endoftext|>"的标记 ID。我们可以通过使用我们在前几章中使用的 GPT-2 分词器从 tiktoken 包对" 进行编码来再次确认这是正确的标记 ID:
Executing the preceding code indeed returns [50256]. 执行前面的代码确实返回了[50256]。
As we have seen in chapter 2, we first need to implement a PyTorch Dataset, which specifies how the data is loaded and processed, before we can instantiate the data loaders. 正如我们在第 2 章中所看到的,在我们能够实例化数据加载器之前,我们首先需要实现一个 PyTorch 数据集,它指定了如何加载和处理数据。
For this purpose, we define the SpamDataset class, which implements the concepts illustrated in figure 6.6. This SpamDataset class handles several key tasks: it identifies the longest sequence in the training dataset, encodes the text messages, and ensures that all other sequences are padded with a padding token to match the length of the longest sequence. 为此,我们定义了 SpamDataset 类,它实现了图 6.6 中说明的概念。这个 SpamDataset 类处理了几个关键任务:它识别训练数据集中最长的序列,对文本消息进行编码,并确保所有其他序列都用填充标记进行填充,以匹配最长序列的长度。
Listing 6.4 Setting up a Pytorch Dataset class 6.4 节 设置一个 Pytorch 数据集类
import torch
from torch.utils.data import Dataset
class SpamDataset(Dataset):
def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
self.data = pd.read_csv(csv_file)
self.encoded_texts = [
tokenizer.encode(text) for text in self.data["Text"]
]
if max_length is None:
self.max_length = self._longest_encoded_length()
else:
self.max_length = max_length
self.encoded_texts = [
encoded_text[:self.max_length]
for encoded_text in self.encoded_texts
]
#C
self.encoded_texts = [
encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
for encoded_text in self.encoded_texts
]
def __getitem__(self, index):
encoded = self.encoded_texts[index]
label = self.data.iloc[index]["Label"]
return (
torch.tensor(encoded, dtype=torch.long),
torch.tensor(label, dtype=torch.long)
)
def __len__(self):
return len(self.data)
def _longest_encoded_length(self):
max_length = 0
for encoded_text in self.encoded_texts:
encoded_length = len(encoded_text)
if encoded_length > max_length:
max_length = encoded_length
return max_length
#A Pre-tokenize texts #A 预处理文本
#B Truncate sequences if they are longer than max_length 如果序列长度超过最大长度,则截断序列
#C Pad sequences to the longest sequence #C 将序列填充到最长序列
The SpamDataset class loads data from the CSV files we created earlier, tokenizes the text using the GPT-2 tokenizer from tiktoken and allows us to pad or truncate the sequences to a uniform length determined by either the longest sequence or a predefined maximum length. This ensures each input tensor is of the same size, which is necessary to create the batches in the training data loader we implement next: SpamDataset 类从我们之前创建的 CSV 文件中加载数据,使用 tiktoken 中的 GPT-2 分词器对文本进行分词,并允许我们将序列填充或截断到由最长序列或预定义的最大长度确定的统一长度。这确保了每个输入张量的大小相同,这对于实现下一步的训练数据加载器中的批处理是必要的。
Note that the longest sequence length is stored in the dataset's max_length attribute. If you are curious to see the number of tokens in the longest sequence, you can use the following code: 请注意,最长序列长度存储在数据集的 max_length 属性中。如果您想查看最长序列中的令牌数,可以使用以下代码:
print(train_dataset.max_length)
The code outputs 120, showing that the longest sequence contains no more than 120 tokens, a common length for text messages. It's worth noting that the model can handle sequences of up to 1,024 tokens, given its context length limit. If your dataset includes longer texts, you can pass max length when creating the training dataset in the preceding code to ensure that the data does not exceed the model's supported input (context) length. 输出 120,表示最长序列包含不超过 120 个令牌,这是短信的常见长度。值得注意的是,该模型可以处理最多 1,024 个令牌的序列,因为它的上下文长度限制。如果您的数据集包含更长的文本,可以在创建前面代码中的训练数据集时传递 max_length ,以确保数据不会超过模型支持的输入(上下文)长度。
Next, we pad the validation and test sets to match the length of the longest training sequence. It's important to note that any validation and test set samples exceeding the length of the longest training example are truncated using encoded_text[:self.max_length] in the SpamDataset code we defined earlier. This truncation is optional; you could also set max_length=None for both validation and test sets, provided there are no sequences exceeding 1,024 tokens in these sets. 接下来,我们将验证集和测试集补齐到与最长训练序列长度相同的长度。需要注意的是,任何超过最长训练示例长度的验证集和测试集样本都使用我们之前定义的 SpamDataset 代码中的 encoded_text[:self.max_length]进行了截断。这种截断是可选的;你也可以为验证集和测试集都设置 max_length=None,前提是这些集合中没有超过 1,024 个令牌的序列。
EXERCISE 6.1 INCREASING THE CONTEXT LENGTH 练习 6.1 增加上下文长度
Pad the inputs to the maximum number of tokens the model supports and observe how it impacts the predictive performance. 将输入填充到模型支持的最大令牌数,并观察其对预测性能的影响。
Using the datasets as inputs, we can instantiate the data loaders similarly to what we did in chapter 2. However, in this case, the targets represent class labels rather than the next tokens in the text. For instance, choosing a batch size of 8 , each batch will consist of 8 training examples of length 120 and the corresponding class label of each example, as illustrated in figure 6.7. 使用这些数据集作为输入,我们可以像第 2 章中一样实例化数据加载器。然而,在这种情况下,目标表示类标签而不是文本中的下一个令牌。例如,选择一个批量大小为 8,每个批次将包含 8 个长度为 120 的训练示例和每个示例的相应类标签,如图 6.7 所示。
Figure 6.7 An illustration of a single training batch consisting of 8 text messages represented as token IDs. Each text message consists of token IDs. In addition, a class label array stores the class labels corresponding to the text messages, which can be either 0 (not spam) or 1 (spam). 图 6.7 一个包含 8 个文本消息的单一训练批次的示意图。每个文本消息由 个标记 ID 表示。此外,一个类标签数组存储了文本消息对应的 个类标签,可以是 0(非垃圾)或 1(垃圾)。
The following code creates the training, validation, and test set data loaders that load the text messages and labels in batches of size 8, as illustrated in figure 6.7: 以下代码创建了训练、验证和测试集数据加载器,它们以大小为 8 的批次加载文本消息和标签,如图 6.7 所示:
#A This setting ensures compatibility with most computers #A 这种设置确保与大多数计算机的兼容性
To ensure that the data loaders are working and are indeed returning batches of the expected size, we iterate over the training loader and then print the tensor dimensions of the last batch: 为确保数据加载器正常工作并确实返回预期大小的批次,我们遍历训练加载器,然后打印最后一个批次的张量维度:
for input_batch, target_batch in train_loader:
pass
print("Input batch dimensions:", input_batch.shape)
print("Label batch dimensions", target_batch.shape)
As we can see, the input batches consist of 8 training examples with 120 tokens each, as expected. The label tensor stores the class labels corresponding to the 8 training examples. 正如我们所见,输入批次包含 8 个训练样本,每个样本有 120 个 token,符合预期。标签张量存储了这 8 个训练样本对应的类别标签。
Lastly, to get an idea of the dataset size, let's print the total number of batches in each dataset: 最后,为了了解数据集的大小,让我们打印出每个数据集中的批次总数:
print(f"{len(train_loader)} training batches")
print(f"{len(val_loader)} validation batches")
print(f"{len(test_loader)} test batches")
The number of batches in each dataset are as follows: 每个数据集中批次的数量如下:
19 validation batches 19 个验证批次
38 test batches 38 批次测试
This concludes the data preparation in this chapter. Next, we will prepare the model for finetuning.
\subsection*{6.4 Initializing a model with pretrained weights}
In this section, we prepare the model we will use for the classification-finetuning to identify spam messages. We start with initializing the pretrained model we worked with in the previous chapter, as illustrated in figure 6.8.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-219.jpg?height=747&width=1247&top_left_y=171&top_left_x=317)
Figure 6.8 Illustration of the three-stage process for classification-finetuning the LLM in this chapter. After completing stage 1, preparing the dataset, this section focuses on initializing the LLM we will finetune to classify spam messages.
We start the model preparation process by reusing the configurations from chapter 5 :
Next, we import the download_and_load_gpt2 function from the gpt_download.py file we downloaded in chapter 5. Furthermore, we also reuse the GPTModel class and load_weights_into_gpt function from chapter 5 to load the downloaded weights into the GPT model:
\section*{Listing 6.6 Loading a pretrained GPT model}
from gpt_download import download_and_load_gpt2
from chapter05 import GPTModel, load_weights_into_gpt
model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")
model \(=\) GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval()
After loading the model weights into the GPTModel, we use the text generation utility function from the previous chapters to ensure that the model generates coherent text:
As we can see based on the following output, the model generates coherent text, which is an indicator that the model weights have been loaded correctly:
Every effort moves you forward.
The first step is to understand the importance of your work
Now, before we start finetuning the model as a spam classifier, let's see if the model can perhaps already classify spam messages by by prompting it with instructions:
text_2 = ( 文本_2 = (
文本这里被翻译为简体中文
"Is the following text 'spam'? Answer with 'yes' or 'no':" 这个文本是否属于"垃圾邮件"?回答"是"或"否"
" 'You are a winner you have been specially" "你是一个赢家你已经特别了"
" selected to receive $1000 cash or a $2000 award.'" 被选中获得 1000 美元现金或 2000 美元奖励。
)
token_ids = generate_text_simple( 令牌 Id = 生成 文本 简单(
model=model, 模型=模型,
idx=text_to_token_ids(text_2, tokenizer), 索引=文本转换为令牌标识(文本_2,标记器),
max_new_tokens=23, 最大新生成标记数=23
context_size=BASE_CONFIG["context_length"] 上下文大小=BASE_CONFIG["上下文长度"]
)
print(token_ids_to_text(token_ids, tokenizer)) 打印(token_ids_to_text(token_ids, tokenizer))
The model output is as follows:
\begin{abstract}
Is the following text 'spam'? Answer with 'yes' or 'no': 'You are a winner you have been specially selected to receive \(\$ 1000\) cash or a \$2000 award.
The following text 'spam'? Answer with 'yes' or 'no': 'You are a winner
\end{abstract}
Based on the output, it's apparent that the model struggles with following instructions.
This is anticipated, as it has undergone only pretraining and lacks instruction-finetuning, which we will explore in the upcoming chapter.
The next section prepares the model for classification-finetuning.
\subsection*{6.5 Adding a classification head}
In this section, we modify the pretrained large language model to prepare it for classification-finetuning. To do this, we replace the original output layer, which maps the hidden representation to a vocabulary of 50,257 , with a smaller output layer that maps to two classes: 0 ("not spam") and 1 ("spam"), as shown in figure 6.9.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-222.jpg?height=960&width=1242&top_left_y=694&top_left_x=303)
Figure 6.9 This figure illustrates adapting a GPT model for spam classification by altering its architecture. Initially, the model's linear output layer mapped 768 hidden units to a vocabulary of \(\mathbf{5 0 , 2 5 7}\) tokens. For spam detection, this layer is replaced with a new output layer that maps the same 768 hidden units to just two classes, representing "spam" and "not spam."
As shown in figure 6.9, we use the same model as in previous chapters except for replacing the output layer.
\section*{OUTPUT LAYER NODES}
We could technically use a single output node since we are dealing with a binary classification task. However, this would require modifying the loss function, as discussed in an article in the Reference section in appendix B. Therefore, we choose a more general approach where the number of output nodes matches the number of classes. For example, for a 3-class problem, such as classifying news articles as "Technology", "Sports", or "Politics", we would use three output nodes, and so forth.
Before we attempt the modification illustrated in figure 6.9, let's print the model architecture via print (model), which prints the following:
Above, we can see the architecture we implemented in chapter 4 neatly laid out. As discussed in chapter 4, the GPTModel consists of embedding layers followed by 12 identical transformer blocks (only the last block is shown for brevity), followed by a final LayerNorm and the output layer, out_head.
Next, we replace the out_head with a new output layer, as illustrated in figure 6.9, that we will finetune.
\section*{FINETUNING SELECTED LAYERS VERSUS ALL LAYERS}
Since we start with a pretrained model, it's not necessary to finetune all model layers. This is because, in neural network-based language models, the lower layers generally capture basic language structures and semantics that are applicable across a wide range of tasks and datasets. So, finetuning only the last layers (layers near the output), which are more specific to nuanced linguistic patterns and task-specific features, can often be sufficient to adapt the model to new tasks. A nice side effect is that it is computationally more efficient to finetune only a small number of layers. Interested readers can find more information, including experiments, on which layers to finetune in the References section for this chapter in appendix B.
To get the model ready for classification-finetuning, we first freeze the model, meaning that we make all layers non-trainable:
for param in model.parameters(): 对于 model.parameters()中的每个参数:
param.requires_grad = False 参数.需要_梯度 = 假
Then, as shown in figure 6.9, we replace the output layer (model.out_head), which originally maps the layer inputs to 50,257 dimensions (the size of the vocabulary):
\section*{Listing 6.7 Adding a classification layer}
Note that in the preceding code, we use BASE_CONFIG["emb_dim"], which is equal to 768 in the "gpt2-small (124M)" model, to keep the code below more general. This means we can also use the same code to work with the larger GPT-2 model variants.
This new model.out_head output layer has its requires_grad attribute set to True by default, which means that it's the only layer in the model that will be updated during training.
Technically, training the output layer we just added is sufficient. However, as I found in experiments, finetuning additional layers can noticeably improve the predictive performance of the finetuned model. (For more details, refer to the References in appendix C.)
Additionally, we configure the last transformer block and the final LayerNorm module, which connects this block to the output layer, to be trainable, as depicted in figure 6.10.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-226.jpg?height=1188&width=1254&top_left_y=184&top_left_x=297)
Figure 6.10 The GPT model we developed in earlier chapters, which we loaded previously, includes 12 repeated transformer blocks. Alongside the output layer, we set the final LayerNorm and the last transformer block as trainable, while the remaining 11 transformer blocks and the embedding layers are kept non-trainable.
To make the final LayerNorm and last transformer block trainable, as illustrated in figure 6.10, we set their respective requires_grad to True:
for param in model.trf_blocks[-1].parameters(): 对于 model.trf_blocks[-1].parameters() 中的每个参数:
param.requires_grad = True 参数.需要进行梯度计算 = 真
for param in model.final_norm.parameters(): 对于模型.final_norm.参数:
param.requires_grad = True 参数.需要进行梯度计算 = 真
\section*{EXERCISE 6.2 FINETUNING THE WHOLE MODEL}
Instead of finetuning just the final transformer block, finetune the entire model and assess the impact on predictive performance.
Even though we added a new output layer and marked certain layers as trainable or nontrainable, we can still use this model in a similar way to previous chapters. For instance, we can feed it an example text identical to how we have done it in earlier chapters. For example, consider the following example text:
In chapters 4 and 5, a similar input would have produced an output tensor of [1, 4, 50257], where 50,257 represents the vocabulary size. As in previous chapters, the number of output rows corresponds to the number of input tokens (in this case, 4). However, each output's embedding dimension (the number of columns) is now reduced to 2 instead of 50,257 since we replaced the output layer of the model.
Remember that we are interested in finetuning this model so that it returns a class label that indicates whether a model input is spam or not spam. To achieve this, we don't need to finetune all 4 output rows but can focus on a single output token. In particular, we will focus on the last row corresponding to the last output token, as illustrated in figure 6.11.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-228.jpg?height=1213&width=1207&top_left_y=631&top_left_x=339)
Figure 6.11 An illustration of the GPT model with a 4-token example input and output. The output tensor consists of 2 columns due to the modified output layer. We are only focusing on the last row corresponding to the last token when finetuning the model for spam classification.
To extract the last output token, illustrated in figure 6.11, from the output tensor, we use the following code:
print("Last output token:", outputs[:, -1, :])
This prints the following:
Last output token: tensor([[-3.5983, 3.9902]])
Before we proceed to the next section, let's recap our discussion. We will focus on converting the values into a class-label prediction. But first, let's understand why we are particularly interested in the last output token, and not the 1st, 2nd, or 3rd output token.
In chapter 3, we explored the attention mechanism, which establishes a relationship between each input token and every other input token. Subsequently, we introduced the concept of a causal attention mask, commonly used in GPT-like models. This mask restricts a token's focus to only its current position and those before it, ensuring that each token can only be influenced by itself and preceding tokens, as illustrated in figure 6.12.
\section*{Tokens masked out via the causal attention mask discussed in chapter 3}
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-230.jpg?height=1080&width=886&top_left_y=412&top_left_x=396)
\section*{The last token is the only token with an attention score to all other tokens}
Figure 6.12 Illustration of the causal attention mechanism as discussed in chapter 3 , where the attention scores between input tokens are displayed in a matrix format. The empty cells indicate masked positions due to the causal attention mask, preventing tokens from attending to future tokens. The values in the cells represent attention scores, with the last token, "time," being the only one that computes attention scores for all preceding tokens.
Given the causal attention mask setup shown in figure 6.12, the last token in a sequence accumulates the most information since it is the only token with access to data from all the previous tokens. Therefore, in our spam classification task, we focus on this last token during the finetuning process.
Having modified the model, the next section will detail the process of transforming the last token into class label predictions and calculate the model's initial prediction accuracy. Following this, we will finetune the model for the spam classification task in the subsequent section.
\section*{EXERCISE 6.3 FINETUNING THE FIRST VERSUS LAST TOKEN}
Rather than finetuning the last output token, try finetuning the first output token and observe the changes in predictive performance when finetuning the model in later sections.
\subsection*{6.6 Calculating the classification loss and accuracy}
So far in this chapter, we have prepared the dataset, loaded a pretrained model, and modified it for classification-finetuning. Before we proceed with the finetuning itself, only one small part remains: implementing the model evaluation functions used during finetuning, as illustrated in figure 6.13. We will tackle this in this section.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-232.jpg?height=893&width=1223&top_left_y=204&top_left_x=322)
Figure 6.13 Illustration of the three-stage process for classification-finetuning the LLM in this chapter. This section implements the last step of stage 2 , implementing the functions to evaluate the model's performance to classify spam messages before, during, and after the finetuning.
Before implementing the evaluation utilities, let's briefly discuss how we convert the model outputs into class label predictions.
In the previous chapter, we computed the token ID of the next token generated by the LLM by converting the 50,257 outputs into probabilities via the softmax function and then returning the position of the highest probability via the argmax function. In this chapter, we take the same approach to calculate whether the model outputs a "spam" or "not spam" prediction for a given input, as shown in figure 6.14 , with the only difference being that we work with 2-dimensional instead of 50,257-dimensional outputs.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-233.jpg?height=418&width=1252&top_left_y=187&top_left_x=312)
Figure 6.14. The model outputs corresponding to the last token are converted into probability scores for each input text. Then, the class labels are obtained by looking up the index position of the highest probability score. Note that the model predicts the spam labels incorrectly because it has not yet been trained.
To illustrate figure 6.14 with a concrete example, let's consider the last token output from the previous section:
print("Last output token:", outputs[:, -1, :])
The values of the tensor corresponding to the last token are as follows:
Last output token: tensor([[-3.5983, 3.9902]])
We can obtain the class label via the following code:
In this case, the code returns 1, meaning the model predicts that the input text is "spam." Using the softmax function here is optional because the largest outputs directly correspond to the highest probability scores, as mentioned in chapter 5 . Hence, we can simplify the code as follows, without using softmax:
This concept can be used to compute the so-called classification accuracy, which measures the percentage of correct predictions across a dataset.
To determine the classification accuracy, we apply the argmax-based prediction code to all examples in the dataset and calculate the proportion of correct predictions by defining a calc_accuracy_loader function:
\section*{Listing 6.8 Calculating the classification accuracy}
\#A Logits of last output token
Let's use the function to determine the classification accuracies across various datasets estimated from 10 batches for efficiency:
Via the device setting, the model automatically runs on a GPU if a GPU with Nvidia CUDA support is available and otherwise runs on a CPU. The output is as follows:
Training accuracy: \(46.25 \%\)
Validation accuracy: \(45.00 \%\)
Test accuracy: \(48.75 \%\)
As we can see, the prediction accuracies are near a random prediction, which would be \(50 \%\) in this case. To improve the prediction accuracies, we need to finetune the model.
However, before we begin finetuning the model, we need to define the loss function that we will optimize during the training process. Our objective is to maximize the spam classification accuracy of the model, which means that the preceding code should output the correct class labels: 0 for non-spam and 1 for spam texts.
However, classification accuracy is not a differentiable function, so we use cross entropy loss as a proxy to maximize accuracy. This is the same cross entropy loss discussed in chapter 5 .
Accordingly, the calc_loss_batch function remains the same as in chapter 5, with one adjustment: we focus on optimizing only the last token, model(input_batch) [:, -1, : ], rather than all tokens, model (input_batch):
We use the calc_loss_batch function to compute the loss for a single batch obtained from the previously defined data loaders. To calculate the loss for all batches in a data loader, we define the calc_loss_loader function, which is identical to the one described in chapter 5 :
Listing 6.9 Calculating the classification loss
def calc_loss_loader(data_loader, model, device, num_batches=None): 计算损失,装载数据
total_loss = 0. 总损失 = 0.
if len(data_loader) == 0: 如果 len(data_loader) == 0:
return float("nan") 返回 float("nan")
elif num_batches is None: 如果 num_batches 为 None:
num_batches = len(data_loader) 批次数 = len(data_loader)
else: #A 否则:#A
num_batches = min(num_batches, len(data_loader)) 批次数 = min(批次数, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader): 对于 i,(input_batch, target_batch) in enumerate(data_loader):
if i < num_batches: 如果 i < num_batches:
loss = calc_loss_batch(input_batch, target_batch, model, device) 损失 = calc_loss_batch(输入批次, 目标批次, 模型, 设备)
total_loss += loss.item() 总损失 += 损失.项目()
else: 否则:
break 打断
return total_loss / num_batches 返回总损失/批量数
Similar to calculating the training accuracy, we now compute the initial loss for each 与计算训练精度类似,我们现在计算每个的初始损失
data set: 数据集:
with torch.no_grad(): #B 使用 torch.no_grad(): #B
train_loss = calc_loss_loader(train_loader, model, device, num_batches=5) 训练损失 = calc_loss_loader(训练加载器, 模型, 设备, 批次数=5)
val_loss = calc_loss_loader(val_loader, model, device, num_batches=5) 验证损失 = calc_loss_loader(val_loader, model, device, num_batches=5)
test_loss = calc_loss_loader(test_loader, model, device, num_batches=5) 测试损失 = calc_loss_loader(test_loader, model, device, num_batches=5)
\#A Ensure number of batches doesn't exceed batches in data loader
\#B Disable gradient tracking for efficiency because we are not training, yet
print(f"Training loss: {train_loss:.3f}") 输出(f"训练损失: {train_loss:.3f}")
print(f"Validation loss: {val_loss:.3f}") 打印(f"验证损失: {val_loss:.3f}")
print(f"Test loss: {test_loss:.3f}") 打印(f"测试损失: {test_loss:.3f}")
The initial loss values are as follows: 最初的损失值如下:
Training loss: 3.095 训练损失:3.095
Validation loss: 2.583 验证损失: 2.583
Test loss: 2.322 测试损失: 2.322
In the next section, we will implement a training function to finetune the model, which means adjusting the model to minimize the training set loss. Minimizing the training set loss will help increase the classification accuracy, our overall goal.
\subsection*{6.7 Finetuning the model on supervised data}
In this section, we define and use the training function to finetune the pretrained LLM and improve its spam classification accuracy. The training loop, illustrated in figure 6.15, is the same overall training loop we used in chapter 5, with the only difference being that we calculate the classification accuracy instead of generating a sample text for evaluating the model.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-237.jpg?height=1266&width=1212&top_left_y=541&top_left_x=318)
Figure 6.15 A typical training loop for training deep neural networks in PyTorch consists of several steps, iterating over the batches in the training set for several epochs. In each loop, we calculate the loss for each training set batch to determine loss gradients, which we use to update the model weights to minimize the training set loss.
The training function implementing the concepts shown in figure 6.15 also closely mirrors the train_model_simple function used for pretraining the model in chapter 5 .
The only two distinctions are that we now track the number of training examples seen (examples_seen) instead of the number of tokens, and we calculate the accuracy after each epoch instead of printing a sample text:
\section*{Listing 6.10 Finetuning the model to classify spam}
def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs, eval_freq, eval_iter, tokenizer):
\# Initialize lists to track losses and examples seen
train_losses, val_losses, train_accs, val_accs = [], [], [], []
examples_seen, global_step = 0, -1
\# Main training loop
for epoch in range(num_epochs):
model.train()
\#A
for input_batch, target_batch in train_loader:
optimizer.zero_grad() \#B
loss = calc_loss_batch(input_batch, target_batch, model, device)
loss.backward() \#C
optimizer.step() \#D
examples_seen += input_batch.shape[0] \#E
global_step += 1
if global_step \% eval_freq == 0: train_loss, val_loss = evaluate_model(
model, train_loader, val_loader, device, eval_iter)
train_losses.append(train_loss)
val_losses.append(val_loss)
print(f"Ep \{epoch+1\} (Step \{global_step:06d\}): "
f"Train loss \{train_loss:.3f\}, Val loss \{val_loss:.3f\}")
\#G
train_accuracy = calc_accuracy_loader(
train_loader, model, device, num_batches=eval_iter
)
val_accuracy = calc_accuracy_loader(
val_loader, model, device, num_batches=eval_iter
)
print(f"Training accuracy: \{train_accuracy*100:.2f\}\% | ", end="")
ⓒ Manning Publications Co. To comment go to liveBook
Licensed to lijin lijin <jimlee13894896995@gmail.com>
print(f"Validation accuracy: \{val_accuracy*100:.2f\}\%")
train_accs.append(train_accuracy)
val_accs.append(val_accuracy)
return train_losses, val_losses, train_accs, val_accs, examples_seen
\#A Set model to training mode
\#B Reset loss gradients from previous batch iteration
\#C Calculate loss gradients
\#D Update model weights using loss gradients
\#E New: track examples instead of tokens
\#F Optional evaluation step
\#G Calculate accuracy after each epoch
The evaluate_model function used in the preceding train_classifier_simple is identical the one we used in chapter 5 :
Next, we initialize the optimizer, set the number of training epochs, and initiate the training using the train_classifier_simple function. We will discuss the choice of the the number of training epochs after we evaluated the results. The training takes about 6 minutes on an M3 MacBook Air laptop computer and less than half a minute on a V100 or A100 GPU:
The output we see during the training is as follows:
Ep 1 (Step 000000): Train loss 2.153, Val loss 2.392 第 1 集(步骤 000000):训练损失 2.153,验证损失 2.392
Ep 1 (Step 000050): Train loss 0.617, Val loss 0.637 第 1 集(第 50 步):训练损失 0.617,验证损失 0.637
Ep 1 (Step 000100): Train loss 0.523, Val loss 0.557 第 1 集(第 100 步):训练损失 0.523,验证损失 0.557
Training accuracy: 70.00% | Validation accuracy: 72.50% 训练准确率: 70.00% | 验证准确率: 72.50%
Ep 2 (Step 000150): Train loss 0.561, Val loss 0.489 第 2 集(步骤 000150):训练损失 0.561,验证损失 0.489
Ep 2 (Step 000200): Train loss 0.419, Val loss 0.397 第 2 集(步骤 000200):训练损失 0.419,验证损失 0.397
Ep 2 (Step 000250): Train loss 0.409, Val loss 0.353 第 2 集 (步骤 000250):训练损失 0.409,验证损失 0.353
Training accuracy: 82.50% | Validation accuracy: 85.00% 训练精度: 82.50% | 验证精度: 85.00%
Ep 3 (Step 000300): Train loss 0.333, Val loss 0.320 第 3 集 (步骤 000300): 训练损失 0.333, 验证损失 0.320
Ep 3 (Step 000350): Train loss 0.340, Val loss 0.306 第 3 集 (步骤 000350):训练损失 0.340,验证损失 0.306
Training accuracy: 90.00% | Validation accuracy: 90.00% 训练准确率:90.00% | 验证准确率:90.00%
Ep 4 (Step 000400): Train loss 0.136, Val loss 0.200 第 4 集(步骤 000400):训练损失 0.136,验证损失 0.200
Ep 4 (Step 000450): Train loss 0.153, Val loss 0.132 第 4 集(步骤 000450):训练损失 0.153,验证损失 0.132
Ep 4 (Step 000500): Train loss 0.222, Val loss 0.137 第 4 集 (步骤 000500):训练损失 0.222,验证损失 0.137
Training accuracy: 100.00% | Validation accuracy: 97.50% 训练准确度: 100.00% | 验证准确度: 97.50%
Ep 5 (Step 000550): Train loss 0.207, Val loss 0.143 第 5 集(步骤 000550):训练损失 0.207,验证损失 0.143
Ep 5 (Step 000600): Train loss 0.083, Val loss 0.074 第 5 集(第 600 步):训练损失 0.083,验证损失 0.074
Training accuracy: 100.00% | Validation accuracy: 97.50% 训练准确度: 100.00% | 验证准确度: 97.50%
Training completed in 5.65 minutes. 训练在 5.65 分钟内完成。
Similar to chapter 5, we then use matplotlib to plot the loss function for the training and validation set:
Listing 6.11 Plotting the classification loss
import matplotlib.pyplot as plt
def plot_values(epochs_seen, examples_seen, train_values, val_values, label="loss"):
fig, ax1 = plt.subplots(figsize=(5, 3))
\#A
ax1.plot(epochs_seen, train_values, label=f"Training \{label\}")
ax1.plot(epochs_seen, val_values, linestyle="-.", label=f"Validation \{label\}")
ax1.set_xlabel("Epochs")
ax1.set_ylabel(label.capitalize())
ax1.legend()
ax2 = ax1.twiny()
ax2.plot(examples_seen, train_values, alpha=0) \# Invisible plot for aligning ticks ax2.set_xlabel("Examples seen")
fig.tight_layout() \#C
plt.savefig(f"\{label\}-plot.pdf")
plt.show()
\#A Plot training and validation loss against epochs
\#B Create a second \(x\)-axis for examples seen
\#C Adjust layout to make room
The resulting loss curves are shown in the plot in figure 6.16.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-242.jpg?height=698&width=1210&top_left_y=186&top_left_x=319)
Figure 6.16 This graph shows the model's training and validation loss over the five training epochs. The training loss, represented by the solid line, and the validation loss, represented by the dashed line, both sharply decline in the first epoch and gradually stabilize towards the fifth epoch. This pattern indicates good learning progress and suggests that the model learned from the training data while generalizing well to the unseen validation data.
As we can see based on the sharp downward slope in figure 6.16 , the model is learning well from the training data, and there is little to no indication of overfitting; that is, there is no noticeable gap between the training and validation set losses).
\section*{CHOOSING THE NUMBER OF EPOCHS}
Earlier, when we initiated the training, we set the number of epochs to 5 . The number of epochs depends on the dataset and the task's difficulty, and there is no universal solution or recommendation. An epoch number of 5 is usually a good starting point. If the model overfits after the first few epochs, as a loss plot as shown in Figure 6.16 could indicate, we may need to reduce the number of epochs. Conversely, if the trendline suggests that the validation loss could improve with further training, we should increase the number of epochs. In this concrete case 5 epochs was a reasonable number as there is no sign of early overfitting, and the validation loss is close to 0 .
Using the same plot_values function, let's now also plot the classification accuracies:
epochs_tensor = torch.linspace(0, num_epochs, len(train_accs))
examples_seen_tensor = torch.linspace(0, examples_seen, len(train_accs))
plot_values(epochs_tensor, examples_seen_tensor, train_accs, val_accs, label="accuracy") The resulting accuracy graphs are shown in figure 6.17.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-243.jpg?height=714&width=1228&top_left_y=466&top_left_x=319)
Figure 6.17 Both the training accuracy (solid line) and the validation accuracy (dashed line) increase substantially in the early epochs and then plateau, achieving almost perfect accuracy scores of 1.0. The close proximity of the two lines throughout the epochs suggests that the model does not overfit the training data much.
Based on the accuracy plot in figure 6.17, the model achieves a relatively high training and validation accuracy after epochs 4 and 5 .
However, it's important to note that we previously set eval_iter=5 when using the train_classifier_simple function, which means our estimations of training and validation performance were based on only 5 batches for efficiency during training.
Now, we will calculate the performance metrics for the training, validation, and test sets across the entire dataset by running the following code, this time without defining the eval_iter value:
The resulting accuracy values are as follows:
Training accuracy: \(97.21 \%\)
Validation accuracy: \(97.32 \%\)
Test accuracy: \(95.67 \%\)
The training and test set performances are almost identical.
A slight discrepancy between the training and test set accuracies suggests minimal overfitting of the training data. Typically, the validation set accuracy is somewhat higher than the test set accuracy because the model development often involves tuning hyperparameters to perform well on the validation set, which might not generalize as effectively to the test set.
This situation is common, but the gap could potentially be minimized by adjusting the model's settings, such as increasing the dropout rate (drop_rate) or the weight_decay parameter in the optimizer configuration.
\subsection*{6.8 Using the LLM as a spam classifier}
After finetuning and evaluating the model in the previous sections, we are now in the final stage of this chapter, as illustrated in figure 6.18: using the model to classify spam messages.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-245.jpg?height=754&width=1225&top_left_y=186&top_left_x=321)
Figure 6.18 Illustration of the three-stage process for classification-finetuning the LLM in this chapter. This section implements the final step of stage 3 , using the finetuned model to classify new spam messages.
Finally, let's use the finetuned GPT-based spam classification model. The following classify_review function follows data preprocessing steps similar to those we used in the SpamDataset implemented earlier in this chapter. And then, after processing text into token IDs, the function uses the model to predict an integer class label, similar to what we have implemented in section 6.6 , and then returns the corresponding class name:
Listing 6.12 Using the model to classify new texts
def classify_review(text, model, tokenizer, device, max_length=None, pad_token_id=50256)
model.eval()
input_ids = tokenizer.encode(text) \#A
supported_context_length = model.pos_emb.weight.shape[1]
input_ids = input_ids[:min(max_length, supported_context_length)] \#B
input_ids += [pad_token_id] * (max_length - len(input_ids)) \#C
input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0) \#D
with torch.no_grad()
logits = model(input_tensor) \([:,-1, \quad:]\)
predicted_label = torch.argmax(logits, dim=-1).item()
return "spam" if predicted_label == 1 else "not spam" \#G
\#A Prepare inputs to the model
\#B Truncate sequences if they too long
\#C Pad sequences to the longest sequence
\#D Add batch dimension
\#E Model inference without gradient tracking
\#F Logits of the last output token
\#G Return the classified result
Let's try this classify_review function on an example text:
text_1 = ( 文本 1 = (
你好,世界。
这里是一些简单的文本。)
"You are a winner you have been specially" 你是一个赢家,你已经特别了
" selected to receive $1000 cash or a $2000 award." 被选中获得 1000 美元现金或 2000 美元奖励。
)
print(classify_review( 降低情绪对积极评论进行分类
text_1, model, tokenizer, device, max_length=train_dataset.max_length 文本_1, 模型, 令牌生成器, 设备, max_length=train_dataset.max_length
))
The resulting model correctly predicts "spam". Next, let's try another example:
text_2 = ( 文本_2 = (
文本这里被翻译为简体中文
"Hey, just wanted to check if we're still on" "嗨,只是想确认我们还是在计划中"
" for dinner tonight? Let me know!" 今晚吃什么?告诉我!
)
print(classify_review( 降低情绪对积极评论进行分类
text_2, model, tokenizer, device, max_length=train_dataset.max_length 文本_2, 模型, 标记器, 设备, 最大长度=训练数据集.最大长度
))
Also, here, the model makes a correct prediction and returns a "not spam" label.
Finally, let's save the model in case we want to reuse the model later without having to train it again using the torch. save method we introduced in the previous chapter.
torch.save(model.state_dict(), "review_classifier.pth")
Once saved, the model can be loaded as follows:
\subsection*{6.9 Summary}
- \(\quad\) There are different strategies for finetuning LLMs, including classificationfinetuning (this chapter) and instruction-finetuning (next chapter)
- \(\quad\) Classification-finetuning involves replacing the output layer of an LLM via a small classification layer.
- In the case of classifying text messages as "spam" or "not spam," the new classification layer consists of only 2 output nodes; in previous chapters, the number of output nodes was equal to the number of unique tokens in the vocabulary, namely, 50,256
- Instead of predicting the next token in the text as in pretraining, classification-finetuning trains the model to output a correct class label, for example, "spam" or "not spam."
- The model input for finetuning is text converted into token IDs, similar to pretraining.
- Before finetuning an LLM, we load the pretrained model as a base model.
- \(\quad\) Evaluating a classification model involves calculating the classification accuracy (the fraction or percentage of correct predictions).
- \(\quad\) Finetuning a classification model uses the same cross entropy loss function that is used for pretraining the LLM.
\section*{7 \\ Finetuning to Follow Instructions}
\section*{This chapter covers}
- Introduction to the instruction finetuning process of LLMs
- Preparing a dataset for supervised instruction finetuning
- Organizing instruction data in training batches
- Loading a pretrained LLM and finetuning it to follow human instructions
- Extracting LLM-generated instruction responses for evaluation
- Evaluating an instruction-finetuned LLM
In previous chapters, we implemented the LLM architecture, carried out pretraining, and imported pretrained weights from external sources into our model. Then, in the previous chapter, we focused on finetuning our LLM for a specific classification task: distinguishing between spam and non-spam text messages. In this chapter, we implement the process for finetuning an LLM to follow human instructions, as illustrated in figure 7.1, which is one of the main techniques behind developing LLMs for chatbot applications, personal assistants, and other conversational tasks.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-249.jpg?height=575&width=1266&top_left_y=182&top_left_x=298)
Figure 7.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it. This chapter focuses on finetuning a pretrained LLM to follow human instructions.
Figure 7.1 shows two main ways of finetuning an LLM: finetuning for classification (step 8) and finetuning an LLM to follow instructions (step 9). We implemented step 8 in the previous chapter. This chapter focuses on finetuning an LLM using an instruction dataset, a process that will be further explained in the next section.
\subsection*{7.1 Introduction to instruction finetuning}
In chapter 5, we saw that pretraining an LLM involves a training procedure where it learns to generate one word at a time. The resulting pretrained LLM is capable of text completion, meaning it can finish sentences or write text paragraphs given a fragment as input.
However, pretrained LLMs often struggle with specific instructions, such as "Fix the grammar in this text" or "Convert this text into passive voice." We will examine a concrete example of this in section 7.5, where we load the pretrained LLM as the basis for instruction finetuning.
In this chapter, we focus on improving the LLM's ability to follow such instructions and generate a desired response, as illustrated in figure 7.2.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-250.jpg?height=198&width=568&top_left_y=193&top_left_x=317)
Convert 45 kilometers to meters.
The goal for the LLM is to
generate a desired response
\section*{Desired response}
45 kilometers is 45000 meters.
Provide a synonym for "bright". \(\longrightarrow\) A synonym for "bright" is "radiant".
Edit the following sentence to remove all passive voice: "The
\(\longrightarrow\) The artist composed the song.
Figure 7.2 This figure shows examples of instructions that are processed by an LLM to generate desired responses.
In the remainder of this chapter, we will implement the instruction finetuning process in several steps, beginning with the dataset preparation, as shown in figure 7.3.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-251.jpg?height=846&width=1249&top_left_y=190&top_left_x=302)
Figure 7.3 Illustration of the three-stage process for instruction finetuning the LLM in this chapter. Stage 1 involves dataset preparation. Stage 2 focuses on model setup and finetuning. Stage 3 covers the evaluation of the model.
Preparing the dataset is a key aspect of instruction finetuning where most of the time in this chapter is spent. The next section, as illustrated in figure 7.3, implements the code to download and format the dataset, which is the first step in the dataset preparation process.
\subsection*{7.2 Preparing a dataset for supervised instruction finetuning}
In this section, we download and format the instruction dataset for instruction finetuning a pretrained LLM in this chapter. The dataset consists of 1100 instruction-response pairs similar to those shown in figure 7.2. This dataset has been specifically created for this book, but interested readers can find alternative, publicly available instruction datasets in appendix B.
The following code implements and executes a function to download this dataset, which is a relatively small file, only 204 KB in size, in JSON format. JSON, or JavaScript Object Notation, mirrors the structure of Python dictionaries, providing a simple structure for data interchange that is both human-readable and machine-friendly.
Listing 7.1 Downloading the dataset
import json
import os
import urllib
def download_and_load_file(file_path, url):
if not os.path.exists(file_path):
with urllib.request.urlopen(url) as response:
text_data = response.read().decode("utf-8")
with open(file_path, "w", encoding="utf-8") as file:
file.write(text_data)
else: \#A
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()
with open(file_path, "r") as file:
data = json.load(file)
return data
file_path = "instruction-data.json"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-
chapter-code/instruction-data.json"
data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))
\#A Skip download if file was already downloaded
The output of executing the preceding code is as follows:
Number of entries: 1100
The data list, which we loaded from the JSON file contains the 1100 entries of the instruction dataset. Let's print one of the entries to see how each entry is structured:
print("Example entry:\n", data[50])
The content of the example entry is as follows:
Example entry:
\{'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"\}
As we can see, the example entries are Python dictionary objects containing an 'instruction', 'input', and 'output'. Let's take a look at another example:
print("Another example entry:\n", data[999])
Based on the contents of this entry, the 'input' field may occasionally be empty:
Another example entry:
\{'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."\}
Instruction finetuning, also known as supervised instruction finetuning, involves training a model on a dataset where the input-output pairs, like those we extracted from the JSON file, are explicitly provided. There are various methods to format these entries for LLMs. Figure 7.4 illustrates two different example formats, often referred to as prompt styles, used in the training of notable LLMs such as Alpaca and Phi-3. Alpaca was one of the early LLMs to publicly detail its instruction finetuning process. Phi-3, developed by Microsoft, is included to demonstrate the diversity in prompt styles.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-253.jpg?height=677&width=1235&top_left_y=1071&top_left_x=314)
Figure 7.4 Comparison of prompt styles for instruction finetuning in LLMs. The Alpaca style (left) uses a structured format with defined sections for instruction, input, and response, while the Phi-3 style (right) employs a simpler format with designated <|user|> and <|assistant|> tokens.
The rest of this chapter uses the Alpaca prompt style since it is one of the most popular ones, largely because it helped define the original approach to finetuning.
\section*{EXERCISE 7.1 CHANGING PROMPT STYLES}
After finetuning the model with the Alpaca prompt style, try the Phi-3 prompt style shown in figure 7.4 and observe if it affects the response quality of the model.
Let's define a format_input function that we can use to convert the entries in the data list into the Alpaca-style input format depicted in figure 7.4:
\section*{Listing 7.2 Implementing the prompt formatting function}
def format_input(entry): 定义格式化输入(输入):
instruction_text = ( 說明文本 = (
指示您使用應用程式上的各種功能。請仔細閱讀這些指示,了解如何充分利用此應用程式
f"Below is an instruction that describes a task. " 以下是描述任务的说明。
f"Write a response that appropriately completes the request." 撰写一个适当的回应以完成该请求。
f"\n\n### Instruction:\n{entry['instruction']}" 说明:
{entry['instruction']}
input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else "" 输入文本: {entry['input']}
return instruction_text + input_text 返回 instruction_text + input_text
This format_input function takes a dictionary entry as input and constructs a formatted string. Let's test it to dataset entry data [50], which to looked at earlier:
For formatted input looks like as follows:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
\#\#\# Instruction:
Identify the correct spelling of the following word.
\section*{\#\#\# Input:}
Ocassion
\#\#\# Response:
The correct spelling is 'Occasion.'
Note that the format_input skips the optional \#\#\# Input: section if the 'input' field is empty, which we can test out by applying the format_input function to entry data[999] that we inspected earlier:
As we can see based on the following output, entries with an empty 'input' field don't contain an \#\#\# Input: section in the formatted input:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
\#\#\# Instruction:
What is an antonym of 'complicated'?
\#\#\# Response:
An antonym of 'complicated' is 'simple'.
Before we move on to setting up the PyTorch data loaders in the next section, let's divide the dataset into training, validation, and test sets analogous to what we have done with the spam classification dataset in the previous chapter. Here's how we calculate the portions:
\section*{Listing 7.3 Partitioning the dataset}
This partitioning results in the following dataset sizes:
Training set length: 935
Validation set length: 55
Test set length: 110
Having successfully downloaded and partitioned the dataset, and gained a clear understanding of the dataset prompt formatting, we are now ready for the core implementation of the instruction finetuning process. In the upcoming section, we will focus on developing the method for constructing the training batches for finetuning the LLM.
\subsection*{7.3 Organizing data into training batches}
As we progress into the implementation phase of our instruction finetuning process, the next step, illustrated in figure 7.5 , focuses on constructing the training batches effectively. This involves defining a method that will ensure our model receives the formatted training data during the finetuning process.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-256.jpg?height=788&width=1263&top_left_y=900&top_left_x=302)
Figure 7.5 After downloading the dataset and implementing text formatting utility function in the previous section, this section focuses on assembling the training batches.
In the previous chapter, the training batches were created automatically by the PyTorch DataLoader class, which employs a default collate function to combine lists of samples into batches. A collate function is responsible for taking a list of individual data samples and merging them into a single batch that can be processed efficiently by the model during training,
However, the batching process for instruction finetuning in this chapter is a bit more involved and requires us to create our own custom collate function that we will later plug into the DataLoader. We implement this custom collate function to handle the specific requirements and formatting of our instruction finetuning dataset.
In this section, we will tackle the batching process in several steps including the coding of the custom collate function, as illustrated in figure 7.6.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-257.jpg?height=1080&width=1228&top_left_y=497&top_left_x=319)
Figure 7.6 An illustration of the five substeps involved in implementing the batching process: applying the prompt template defined in the previous section, using tokenization from previous chapters, adding padding tokens, creating target token IDs, and replacing -100 placeholder tokens to mask padding tokens in the loss function.
First, to implement steps 2.1 and 2.2 as illustrated in figure 7.6 , we code an InstructionDataset class that applies format_input from the previous section and pretokenizes all inputs in the dataset, similar to the SpamDataset in chapter 6 . These two steps are illustrated in more detail in figure 7.7.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-258.jpg?height=660&width=1261&top_left_y=184&top_left_x=301)
Figure 7.7 This diagram shows how entries are first formatted using a specific prompt template and then tokenized, resulting in a sequence of token IDs that the model can process.
The 2 -step process illustrated in figure 7.7 is implemented in the init constructor method of the InstructionDataset:
Listing 7.4 Implementing an instruction dataset class
\#A Pre-tokenize texts
Similar to the approach in chapter 6, we aim to accelerate training by collecting multiple training examples in a batch, which necessitates padding all inputs to a similar length. As with the previous chapter, we use the \(<\) |endoftext|> token as a padding token.
Instead of appending the <|endoftext|> tokens to the text inputs, we can append its token ID to the pre-tokenized inputs directly. To remind us which token ID we should use, we can use the tokenizer's .encode method on an <|endoftext|> token:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2") 分词器 = tiktoken.get_encoding("gpt2")
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"})) [1, 50257]
The resulting token ID is 50256. 产生的令牌 ID 是 50256。
In chapter 6, we padded all examples in a dataset to the same length. Moving on to step 2.3 in figure 7.6, here, we adopt a more sophisticated approach by developing a custom collate function that we can pass to the data loader. This custom collate function pads the training examples in each batch to have the same length, while allowing different batches to have different lengths, as illustrated in figure 7.8. This approach minimizes unnecessary padding by only extending sequences to match the longest one in each batch, not the whole dataset.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-260.jpg?height=330&width=1238&top_left_y=184&top_left_x=312)
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-260.jpg?height=337&width=1172&top_left_y=586&top_left_x=319)
Figure 7.8 This figure showed the padding of training examples in batches using token ID 50256 to ensure uniform length within each batch. Each batch may have different lengths, as shown by the first and second batches in this figure.
We can implement the padding process illustrated in figure 7.8 with a custom collate function as follows:
\#A Find the longest sequence in the batch
\#B Pad and prepare inputs
\#C Remove extra padded token added earlier
\#D Convert list of inputs to tensor and transfer to target device
The custom_collate_draft_1 we implemented is designed to be integrated into a PyTorch DataLoader, but it can also function as a standalone tool. Here, we use it independently to test and verify that it operates as intended. Let's try it on three different inputs that we want to assemble into a batch, where each example gets padded to the same length:
The resulting batch looks like as follows:
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-262.jpg?height=146&width=703&top_left_y=193&top_left_x=245)
As we can see based on the preceding output, all inputs have been padded to the length of the longest input list, inputs_1 containing 5 token IDs.
So, we have just implemented our first custom collate function to create batches from lists of inputs. However, as you learned in chapters 5 and 6 , we also need to create batches with the target token IDs, corresponding to the batch of input IDs. These target IDs, as illustrated in figure 7.9, are crucial because they represent what we want the model to generate and what we need during training to calculate the loss for the weight updates, similar to previous chapters.
Below is an instruction that describes a task. Write a response that appropriately 以下是描述任务的说明。请写一个适当的回应。
completes the request. 完成了请求。
Instruction: ... 说明:...
Format input into an instruction-
Response: ... 响应:...
response template
[21106, 318, 281, 12064, 326, [21106, 318, 281, 12064, 326]
[21106, 318, 281, 12064, 326 [21106, 318, 281, 12064, 326]
8477, 257, 4876,13,, Add end-of-text tokens (50256) to pad 8477, 257, 4876, 13, , 50256
50256, 50256, 50256]; data samples to the same length 50256, 50256, 50256]; 将数据样本统一到相同长度
[21106, 318, 281, 12064, 326, [21106, 318, 281, 12064, 326]
8477,257,4876,13, Create a list of target token IDs for the model to 8477,257,4876,13,为模型创建目标令牌 ID 列表
50256, 50256, 50256] learn (these are the inputs shifted by one, plus [50256, 50256, 50256] 学习(这些是输入前移一个位置,加上)
[318, 281, 12064, 326, 8477, an additional padding token) [318, 281, 12064, 326, 8477, 一个附加的填充标记]
Figure 7.9 An illustration of the five substeps involved in implementing the batching process. We are now focusing on step 2.4, the creation of target token IDs. This step is essential as it enables the model to learn and predict the tokens it needs to generate.
As illustrated in figure 7.9, we are now modifying our custom collate function to also return the target token IDs in addition to the input token IDs.
Similar to the process described in chapter 5 for pretraining an LLM, the target token IDs match the input token IDs but are shifted one position to the right. This setup, as shown in figure 7.10, allows the LLM to learn how to predict the next token in a sequence.
The target vector does not contain the
first input ID
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-263.jpg?height=479&width=1013&top_left_y=458&top_left_x=521)
The token IDs in the target are similar to the input IDs but shifted by 1 position
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-263.jpg?height=375&width=870&top_left_y=1151&top_left_x=522)
We always add an end-of-text (padding) token
to the target
Figure 7.10 This figure illustrates the input and target token alignment used in the instruction finetuning process of an LLM. For each input sequence, the corresponding target sequence is created by shifting the token IDs one position to the right, omitting the first token of the input, and appending an end-of-text token.
The following updated collate function generates the target token IDs, as illustrated in figure 7.10 , from the input token IDs:
\#A Truncate the last token for inputs
\#B Shift +1 to the right for targets
Applied to the example batch consisting of 3 input lists we defined earlier, the new custom_collate_draft_2 function now returns the input and the target batch:
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-264.jpg?height=269&width=1047&top_left_y=1560&top_left_x=257)
\#A The 1st tensor represents inputs
\#B The 2nd tensor represents the targets
In the next step, we assign a -100 placeholder value to all padding tokens, as illustrated in figure 2.5. This special value allows us to exclude these padding tokens from contributing to the training loss calculation, ensuring that only meaningful data influences model learning.
More details on this process will be discussed after implementing this modification. (In chapter 6 , we did not have to worry about this since we only trained the model based on the last output token.)
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-265.jpg?height=1006&width=1247&top_left_y=355&top_left_x=305)
Figure 7.11 This figure illustrates step 2.5 in the token replacement process we apply to the data batches. After creating the target sequence by shifting token IDs one position to the right and appending an end-of-text token, step 2.5 focuses on replacing end-of-text padding tokens with a placeholder value (-100).
In step 2.4, as shown in figure 7.11, we replace the end-of-text tokens, which we previously used as padding tokens and are assigned token ID 50256, with -100 in the target token list. (The choice of -100 as a replacement will be clarified later.)
However, note that we retain one end-of-text token, ID 50256, in the target list, as depicted in figure 7.12. This allows the LLM to learn when to generate an end-of-text token in response to instructions, which we use as an indicator that the generated response is complete.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-266.jpg?height=432&width=1252&top_left_y=166&top_left_x=298)
Figure 7.12 This figure illustrates step 2.4 in the token replacement process in the target batch for the training data preparation. It shows the replacement of all but the first instance of the end-of-text token, which we use as padding, with the placeholder value -100 , while keeping the initial end-of-text token in each target sequence.
In the following code, we modify our custom collate function to replace tokens with ID 50256 with -100 in the target lists, as illustrated in figure 7.12. Additionally, we introduce an allowed_max_length parameter to optionally limit the length of the samples. This adjustment will be useful if you plan to work with your own datasets that exceed the 1024token context size supported by the GPT-2 model. The code for this updated collate function is as follows:
\section*{Listing 7.5 Implementing a custom batch collate function}
def custom_collate_fn( def 自定义_collate_fn(
batch, 批次
pad_token_id=50256, 填充词标识符=50256,
ignore_index=-100, 忽略索引=-100,
allowed_max_length=None, 允许的最大长度=无限,
device="cpu" 设备="cpu"
):
batch_max_length = max(len(item)+1 for item in batch) 批处理最大长度=max([len(item)+1 for item in batch])
inputs_lst, targets_lst = [], [] 输入_列表, 目标_列表 = [], []
for item in batch: 对于批次中的每个项目:
new_item = item.copy() 新的_项目 = 项目.复制()
new_item += [pad_token_id] 新的项目 += [填充令牌 ID]
# Pad sequences to max_length 对序列进行填充以达到最大长度
padded = new_item + [pad_token_id] * (batch_max_length - len(new_item)) 填充后的结果 = 新项 + [填充标识符] * (批次最大长度 - 新项长度)
inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs 输入 = torch.tensor(padded[:-1]) # 截断最后一个标记以获得输入
targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets 目标=torch.tensor(填充[1:]) # 向右移动+1 以获取目标
mask = targets == pad_token_id #A 掩码 = 目标 == 填充令牌 ID
indices = torch.nonzero(mask).squeeze() #A 指数 = torch.nonzero(mask).squeeze()
if indices.numel() > 1: 如果 indices.numel() > 1:
\#A Replace all but the first padding tokens in targets by ignore_index
\#B Optionally truncate to maximum sequence length
Again, let's try the collate function on the sample batch that we created earlier to check that it works as intended:
The results are as follows, where the first tensor represents the inputs, and the second tensor represents the targets:
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-267.jpg?height=276&width=715&top_left_y=1312&top_left_x=246)
The modified collate function works as expected, altering the target list by inserting the token ID -100. What is the logic behind this adjustment? Let's explore the underlying purpose of this modification.
For demonstration purposes, consider the following simple and self-contained example where each output logit can correspond to a potential token from the model's vocabulary. Here's how we might calculate the cross entropy loss (introduced in chapter 5) during training when the model predicts a sequence of tokens, similar to what we have done in chapter 5 when pretraining the model, or in chapter 6 when finetuning the model for classification:
The loss value calculated by the previous code is 1.1269 .
tensor(1.1269)
Adding an additional token ID will, as we would expect, affect the loss calculation.
\#A New 3rd token ID prediction
The loss value, after adding the third token, is now 0.7936 .
So far, we have carried out some more or less obvious example calculations using the cross entropy loss function in PyTorch, the same loss function we used in the training functions of chapters 5 and 6 , as well as the one we will use in this chapter.
Now, let's get to the interesting part and see what happens if we replace the third target token ID with -100 :
Based on this result, we can see that the resulting loss on these 3 training examples is identical to the loss we calculated from the 2 training examples earlier. In other words, the cross entropy loss function ignored the third entry in the targets_3 vector, the token ID corresponding to -100 . (Interested readers can try to replace the -100 value with another token IDs that is not 0 or 1 , and will see that this results in an error.)
So, what's so special about -100 that it's ignored by the cross entropy loss? The default setting of the cross entropy function in PyTorch is cross_entropy(..., ignore_index=-100). This means that it ignores targets labeled with -100 .
In this chapter, we take advantage of this ignore_index to ignore the additional end-oftext (padding) tokens that we used to pad the training examples to have the same length in each batch.
However, as shown earlier in figure 7.12, we want to keep one 50256 (end-of-text) token ID in the targets because it helps the LLM to learn to generate end-of-text tokens, which we can use as an indicator that a response is complete.
In addition to masking out padding tokens, it is also common to mask out the target token IDs that correspond to the instruction, as illustrated in figure 7.13.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-269.jpg?height=508&width=1256&top_left_y=956&top_left_x=292)
\#\#\# Response:
Great results were achieved by the team.<|endoftext|>
Tokenize
\([-100,-100,-100,-100,-100, \ldots, 13,50256]\)
1
The instruction tokens are replaced by -100
Figure 7.13 The left side shows the formatted input text we tokenize and then feed to the LLM during training. The right side shows the target text we prepare for the LLM where we can optionally mask out the instruction section, which means replacing the corresponding token IDs with the -100 ignore_index value.
By masking out the target token IDs that correspond to the instruction, as shown in figure 7.13, the LLM the cross entropy loss is only computed for the generated response target IDs. By masking out the instruction tokens, the model is trained to focus on generating accurate responses rather than additionally also memorizing instructions, which can help with reducing overfitting.
Currently, researchers are divided on whether masking the instructions as shown in figure 7.13 is universally beneficial during instruction finetuning. For instance, a recent paper titled "Instruction Tuning With Loss Over Instructions" demonstrated that not masking the instructions benefits the LLM performance (see the references in appendix B for more details). In this chapter, we do not apply masking and leave it as an optional exercise for the reader.
\section*{EXERCISE 7.2 INSTRUCTION AND INPUT MASKING}
After completing the chapter and finetuning the model with the InstructionDataset implemented in this section, replace the instruction and input tokens with the -100 mask to implement the instruction masking method illustrated in Figure 7.13. Then, evaluate whether this has a positive effect on model performance.
\subsection*{7.4 Creating data loaders for an instruction dataset}
In the previous section, we went through several stages to implement an InstructionDataset class and a custom_collate_fn function for the instruction dataset. In this section, as shown in figure 7.14, we can reap the fruits of our labor by simply plugging both InstructionDataset objects and the custom_collate_fn function into PyTorch data loaders. These loaders will automatically shuffle and organize the batches for the LLM instruction finetuning process.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-271.jpg?height=788&width=1270&top_left_y=167&top_left_x=298)
Figure 7.14 In previous sections, we prepared the dataset and implemented a custom collate function for batching the instruction dataset. In this section, we create and apply the data loaders to the training, validation, and test sets that we need for the LLM instruction finetuning and evaluation.
Before we implement the data loader creation step shown in figure 7.14 , we have to briefly talk about the device setting of the custom_collate_fn we implemented in the previous section.
The custom_collate_fn includes code to move the input and target tensors (for example, torch.stack(inputs_lst).to(device)) to a specified device, which can be either "cpu" or "cuda" (for GPUs), or optionally "mps" for Macs with Apple Silicon chips. (Note that using an "mps" device may result in numerical differences compared to the contents of this chapter, as Apple Silicon support in PyTorch is still experimental.)
In previous chapters, we moved the data onto the target device (for example, the GPU memory when device="cuda") in the main training loop. Having this as part of the collate function offers the advantage of performing this device transfer process as a background process outside the training loop, preventing it from blocking the GPU during model training.
The following code initializes the device variable:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 设备 = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.backends.mps.is_available(): #A 如果 torch.backends.mps.is_available():
\#A Uncomment these two lines to use the GPU on an Apple Silicon chip
Next, to reuse the chosen device setting in custom_collate_fn when we plug it into the PyTorch DataLoader class later in this section, we use the partial function from Python's functools standard library to create a new version of the function with the device argument pre-filled. Additionally, we set the allowed_max_length to 1024, which truncates the data to the maximum context length supported by the GPT-2 model we finetune later in this chapter:
Next, we can set up the data loaders as we did in previous chapters, but this time we will use our custom collate function for the batching process:
Listing 7.6 Initializing the data loaders
\#A You can try to increase this number if parallel Python process are supported by your operating system
Let's examine the dimensions of the input and target batches generated by the training Ioader:
print("Train loader:") 打印("火车加载器:")
for inputs, targets in train_loader: 对于 inputs, targets in train_loader:
print(inputs.shape, targets.shape)
The output is as follows (truncated for space reasons):
In the preceding output, we can see that the first input and target batch have dimensions \(8 \times 61\), where 8 represents the batch size, and 61 is the number of tokens in each training example in this batch. The second input and target batch have a different number of tokens, for instance, 76 .
As we saw in the preceding code output, thanks to our custom collate function, the data loader is able to create batches of different lengths. In the next section, we load a pretrained LLM that we can then finetune with this data loader.
\subsection*{7.5 Loading a pretrained LLM}
In the previous sections, we spent a lot of time preparing the dataset for instruction finetuning, which is a key aspect of the supervised finetuning process. Many other aspects are the same as in pretraining, allowing us to reuse much of the code from earlier chapters.
Before beginning instruction finetuning, we first load a pretrained GPT model, as shown in figure 7.15, that we want to finetune.
In the previous sections, we prepared the
dataset and data loaders
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-274.jpg?height=759&width=1250&top_left_y=276&top_left_x=299)
Figure 7.15 After the dataset preparation, the process of finetuning an LLM for instruction-following begins with loading a pretrained LLM, which serves as the foundation for subsequent training. This pretrained model, having already learned general language patterns and knowledge from vast amounts of text data, is then adapted for instruction following through the finetuning process in the next section.
As shown in the chapter overview diagram in figure 7.15, this section focuses on step 4, loading a pretrained LLM to serve as the starting point for instruction finetuning, similar to the process in previous chapters. However, instead of using the smallest 124 million parameter model as before, we load the medium-sized model with 355 million parameters. The reason for this choice is that the 124 million parameter model is too limited in capacity to achieve qualitatively satisfactory results via instruction finetuning.
This is done using the same code as in section 5.5 of chapter 5 and section 6.4 of chapter 6 except that we now specify "gpt2-medium (355M)" instead of "gpt2-small (124M)". Please note that executing the code provided below will initiate the download of the medium-sized GPT model, which has a storage requirement of approximately 1.42 gigabytes. This is roughly three times larger than the storage space needed for the small model:
Listing 7.7 Loading the pretrained model
After executing the code in the previous section, several files will be downloaded, similar to the process in earlier chapters. The downloaded files include:
Before diving into finetuning the model in the next section, let's take a moment to assess the pretrained LLM's performance on one of the validation tasks by comparing its output to the expected response. This will give us a baseline understanding of how well the model performs on an instruction-following task right out of the box, prior to finetuning, and will help us appreciate the impact of finetuning later on. We use the first example from the validation set for this assessment:
Below is an instruction that describes a task. Write a response that appropriately 以下是描述任务的说明。请写一个适当的回应。
completes the request. 完成了请求。
Instruction: 指令:
简体中文:
说明:
Convert the active sentence to passive: 'The chef cooks the meal every day.' 每天由厨师烹饪这顿餐点。
Next, we generate the model's response using the generate function from chapter 5: 下一步,我们使用第 5 章中的生成功能生成模型的响应:
from chapter05 import generate, text_to_token_ids, token_ids_to_text 从 chapter05 导入 generate、text_to_token_ids、token_ids_to_text
token_ids = generate( 生成的令牌标识 = generate(
model=model, 模型=模型,
idx=text_to_token_ids(input_text, tokenizer), 索引=文本_转_令牌_标识符(输入_文本, 令牌生成器),
max_new_tokens=35, 最大新词语=35
context_size=BASE_CONFIG["context_length"], 上下文大小=基础配置["上下文长度"],
eos_id=50256, 编码序号=50256,
)
generated_text = token_ids_to_text(token_ids, tokenizer) 生成的文本 = token_ids_to_text(token_ids, tokenizer)
It's important to note that the generate function returns the combined input and output text. This behavior was convenient in previous chapters since pretrained LLMs are primarily designed as text-completion models, where the input and output are concatenated to create a coherent and legible text. However, when evaluating the model's performance on a specific task, we often want to focus solely on the model's generated response.
To isolate the model's response text, we need to subtract the length of the input instruction from the start of the generated_text:
This code snippet removes the input text from the beginning of the generated_text, leaving us with only the model's generated response. The strip() function is then applied to remove any leading or trailing whitespace characters. The output is as follows:
\#\#\# Response:
The chef cooks the meal every day.
\#\#\# Instruction:
Convert the active sentence to passive: 'The chef cooks the
As we can see from the output, the pretrained model is not yet capable of correctly following the given instruction. While it does create a "Response" section, it simply repeats the original input sentence and part of the instruction, failing to convert the active sentence to passive voice as requested.
In the upcoming section, we implement the finetuning process to improve the model's ability to comprehend and appropriately respond to such requests.
\subsection*{7.6 Finetuning the LLM on instruction data}
As illustrated in the chapter overview in figure 7.16, this section focuses on finetuning the LLM. We take the pretrained model loaded in the previous section and further train it using the instruction dataset prepared earlier in this chapter.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-277.jpg?height=573&width=1233&top_left_y=1312&top_left_x=317)
Figure 7.16 In step 5 of finetuning the LLM for instruction-following, we train the pretrained model loaded in the previous section on the instruction dataset prepared earlier in this chapter.
As mentioned earlier, we already did all the hard work when we implemented the instruction dataset processing at the beginning of this chapter. For the finetuning process itself, we can reuse the loss calculation and training functions implemented in chapter 5 during the pretraining:
The initial loss values are as follows; as in previous chapters, our goal is to minimize this loss:
Training loss: 3.825908660888672
Validation loss: 3.7619335651397705
\section*{DEALING WITH HARDWARE LIMITATIONS}
Note that using and training a larger model like GPT-2 medium ( 355 million parameters) is more computationally intensive compared to the smaller GPT-2 model ( 124 million parameters) used in previous chapters. If you encounter issues due to hardware limitations, you can switch to the smaller model by changing CHOOSE_MODEL \(=\) "gpt2-medium (355M)" to CHOoSE_MODEL = "gpt2-small (124M)" in section 7.5. Alternatively, to speed up the model training, consider using a GPU. The following supplementary section in this book's code repository lists several options for using cloud GPUs: https://github.com/rasbt/LLMs-from-scratch/tree/main/setup
Table 7.1 provides reference runtimes for training each model on various devices, including CPUs and GPUs. Running this code on a compatible GPU requires no code changes and can significantly speed up training. For the results shown in this chapter, I used the GPT-2 medium model and trained it on an A100 GPU.
Table 7.1 Reference runtimes for instruction finetuning GPT-2
\begin{tabular}{|l|l|l|}
\hline Model name & Device & \begin{tabular}{l}
Runtime for 2 \\
epochs
\end{tabular} \\
\hline gpt2-medium (355M) & CPU (M3 MacBook Air) & 15.78 minutes \\
\hline gpt2-medium (355M) & GPU (NVIDIA L4) & 1.83 minutes \\
\hline gpt2-medium (355M) & GPU (NVIDIA A100) & 0.86 minutes \\
\hline gpt2-small (124M) & CPU (M3 MacBook Air) & 5.74 minutes \\
\hline gpt2-small (124M) & GPU (NVIDIA L4) & 0.69 minutes \\
\hline gpt2-small (124M) & GPU (NVIDIA A100) & 0.39 minutes \\
\hline
\end{tabular}
With the model and data loaders prepared, we can now proceed to train the model. The following code sets up the training process, including initializing the optimizer, setting the number of epochs, and defining the evaluation frequency and starting context to evaluate generated LLM responses during training based on the first validation set instruction (val_data[0]) we looked at earlier:
\section*{Listing 7.8 Instruction finetuning the pretrained LLM}
The following output displays the training progress over two epochs, where a steady decrease in losses indicates improving ability to follow instructions and generate appropriate responses:
Ep 1 (Step 000000): Train loss 2.637, Val loss 2.626 第 1 集(步骤 000000):训练损失 2.637,验证损失 2.626
Ep 1 (Step 000005): Train loss 1.174, Val loss 1.103 第 1 集 (步骤 000005): 训练损失 1.174, 验证损失 1.103
Ep 1 (Step 000010): Train loss 0.872, Val loss 0.944 第 1 集(步骤 000010):训练损失 0.872,验证损失 0.944
Ep 1 (Step 000015): Train loss 0.857, Val loss 0.906 第 1 集(步骤 000015):训练损失 0.857,验证损失 0.906
...
Ep 1 (Step 000115): Train loss 0.520, Val loss 0.665 第 1 集 (步骤 000115): 训练损失 0.520, 验证损失 0.665
Below is an instruction that describes a task. Write a response that appropriately 下面是描述任务的说明。写一个适当的回应。
completes the request. ### Instruction: Convert the active sentence to passive: 'The 请求已完成。
chef cooks the meal every day.' ### Response: The meal is prepared every day by the 每天由厨师准备餐点。
chef.<|endoftext|>The following is an instruction that describes a task. Write a 厨师。
response that appropriately completes the request. ### Instruction: Convert the active 将主动语句转换为被动语句。
sentence to passive: 被动句:
Ep 2 (Step 000120): Train loss 0.438, Val loss 0.670 第 2 集 (步骤 000120): 训练损失 0.438, 验证损失 0.670
Ep 2 (Step 000125): Train loss 0.453, Val loss 0.685 第 2 集(步骤 000125):训练损失 0.453,验证损失 0.685
Ep 2 (Step 000130): Train loss 0.448, Val loss 0.681 第 2 集 (第 130 步): 训练损失 0.448, 验证损失 0.681
Ep 2 (Step 000135): Train loss 0.408, Val loss 0.677 第 2 集 (步骤 000135): 训练损失 0.408, 验证损失 0.677
...
Ep 2 (Step 000230): Train loss 0.300, Val loss 0.657 第 2 集 (步骤 000230): 训练损失 0.300, 验证损失 0.657
Below is an instruction that describes a task. Write a response that appropriately 下面是描述任务的说明。写一个适当的回应。
completes the request. ### Instruction: Convert the active sentence to passive: 'The 请求已完成。
chef cooks the meal every day.' ### Response: The meal is cooked every day by the chef. 厨师每天烹饪这份餐点。
<|endoftext|>The following is an instruction that describes a task. Write a response 以下是描述任务的说明。撰写回应。
that appropriately completes the request. ### Instruction: What is the capital of the 首都是柏林。
United Kingdom 英国
Training completed in 0.87 minutes. 训练已在 0.87 分钟内完成。
The training output shows that the model is learning effectively, as we can tell based on the consistently decreasing training and validation loss values over the two epochs. This suggests that the model is gradually improving its ability to understand and follow the provided instructions. (Since the model demonstrated effective learning within these two epochs, extending the training to a third epoch or more is not essential, and may even be counterproductive here as it could lead to increased overfitting.)
Moreover, the generated responses at the end of each epoch let us inspect the model's progress in correctly executing the given task in the validation set example. In this case, the model successfully converts the active sentence "The chef cooks the meal every day." into its passive voice counterpart: "The meal is cooked every day by the chef."
We will revisit and evaluate the response quality of the model in more detail in a later section. But now, to conclude this section, let's examine the training and validation loss curves to gain additional insights into the model's learning process. For this, we use the plot_losses function from chapter 5:
The resulting loss plot is shown in figure 7.17.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-282.jpg?height=709&width=1230&top_left_y=487&top_left_x=316)
Figure 7.17 A plot showing the training and validation loss trends over two epochs. The solid line represents the training loss, showing a sharp decrease before stabilizing, while the dotted line represents the validation loss, which follows a similar pattern.
As we can see in the loss plot shown in figure 7.17, the model's performance on both the training and validation sets improves substantially over the course of training. The rapid decrease in losses during the initial phase indicates that the model is quickly learning meaningful patterns and representations from the data. Then, as training progresses to the second epoch, the losses continue to decrease but at a slower rate, suggesting that the model is finetuning its learned representations and converging to a stable solution.
While the loss plot in figure 7.17 indicates that the model is training effectively, the most crucial aspect is its performance in terms of response quality and correctness. In the remaining sections of this chapter, we will extract the responses and store them in a format that allows us to evaluate and quantify the response quality.
\section*{EXERCISE 7.3 FINETUNING ON THE ORIGINAL ALPACA DATASET}
The so-called Alpaca dataset by researchers at Stanford is one of the earliest and most popular openly shared instruction datasets, consisting of 52,002 entries. As an alternative to the instruction-data.json file we use in this chapter, consider finetuning an LLM on this dataset. The dataset is available at the following URL: https://raw.githubusercontent.com/tatsu-lab/stanford alpaca/main/alpaca data.json
This dataset contains 52,002 entries, which is approximately 50 times more than those we used in this chapter, and most entries are longer as well. Thus, it's highly recommended to conduct the training using a GPU to accelerate the finetuning process. If you encounter out-of-memory errors, consider reducing the batch_size from 8 to 4,2 , or even 1. Additionally, lowering the allowed_max_length from 1024 to 512 or 256 can further help manage memory issues.
\subsection*{7.7 Extracting and saving responses}
After finetuning the LLM on the training portion of the instruction dataset as described in the previous section, we now proceed to evaluate its performance on the held-out test set. To accomplish this, we first extract the model-generated responses for each input in the test dataset and collect them for manual analysis as illustrated in the chapter overview in figure 7.18.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-284.jpg?height=890&width=1235&top_left_y=182&top_left_x=314)
Figure 7.18 This section is focused on extracting and collecting the model responses on the held-out test dataset for further analysis. The next section covers model evaluation to quantify the performance of the instruction-finetuned LLM.
We start with step 7, the response instruction step illustrated in figure 7.18 , using the generate function. We then print the model responses alongside the expected test set answers for the first three test set entries, presenting them side by side for comparison:
\#A Iterate over the first 3 test set samples
\#B Use the generate function imported in section 7.5
As mentioned earlier, the generate function returns the combined input and output text, so we use slicing and the .replace() method on the generated_text contents to extract the model's response. The instructions, followed by the given test set response and model response are shown below:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
\#\#\# Instruction:
Rewrite the sentence using a simile.
Remove white spaces and parentheses from file name 从文件名中删除空格和括号
file_name = f"{re.sub(r'[()]', '', CHOOSE_MODEL) }-sft.pth" 文件名 = f"{re.sub(r'[()]', '', 选择的模型)}-sft.pth"
torch.save(model.state_dict(), file_name)
print(f"Model saved as {file_name}") 模型已保存为 {file_name}
The saved model can then be loaded via model.load_state_dict(torch.load("gpt2- 保存的模型可以通过 model.load_state_dict(torch.load("gpt2-")加载。
medium355M-sft.pth")). 中文:medium355M-sft.pth"))。
\subsection*{7.8 Evaluating the finetuned LLM}
Previously, we judged the performance of an instruction finetuned model by looking at its responses on 3 examples of the test set. While this gives us a rough idea of how well the model performs, this method does not really scale well to larger amounts of responses. So, in this section, as indicated in the chapter overview in figure 7.19, we implement a method to automate the response evaluation of the finetuned LLM using another, larger LLM.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-289.jpg?height=799&width=1249&top_left_y=939&top_left_x=302)
After extracting the responses by our finetuned LLM, we use another LLM to automatically evaluate these responses
Figure 7.19 In this last step of the instruction finetuning pipeline, we implement a method to quantify the performance of the finetuned model by scoring the responses it generated for the test.
To implement step 9 shown in figure 7.19, which involves evaluating test set responses in an automated fashion, we utilize an existing instruction-finetuned 8 billion parameter Llama 3 model developed by Meta AI. This model can be run locally using the open-source Ollama application (https://ollama.com).
Ollama is an efficient application for running LLMs on a laptop. It serves as a wrapper around the open-source llama.cpp library (https://github.com/ggerganov/llama.cpp), which implements LLMs in pure C/C++ to maximize efficiency. However, note that Ollama is only a tool for generating text using LLMs (inference) and does not support training or finetuning LLMs.
\section*{USING LARGER LLMS VIA WEB APIS}
The 8 billion parameter Llama 3 model is a very capable LLM that runs locally. However, it's not as capable as large proprietary LLMs such as GPT-4 offered by OpenAI. For readers interested in exploring how to utilize GPT-4 through the OpenAI API to assess generated model responses, an optional code notebook is available within the supplementary materials accompanying this book at https://github. com/rasbt/LLMs-from-scratch/blob/main/ch07/03 model-evaluation/llm-instructioneval-openai.ipynb
To execute the following code, please install Ollama by visiting https://ollama.com and following the provided instructions for your operating system:
- For macOS and Windows users: Open the downloaded Ollama application. If prompted to install command line usage, select "yes."
- \(\quad\) For Linux users: Use the installation command available on the Ollama website.
Before implementing the model evaluation code, let's first download the Llama 3 model and verify that Ollama is functioning correctly by using it from the command line terminal.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-291.jpg?height=620&width=1235&top_left_y=176&top_left_x=311)
Figure 7.20 Two options for running Ollama. The left panel illustrates starting Ollama using ollama serve. The right panel shows a second option in macOS, running the Ollama application in the background instead of using the ollama serve command to start the application.
With the Ollama application or ollama serve running in a different terminal as shown in figure 7.20, execute the following command on the command line (not in a Python session) to try out the 8 billion parameter Llama 3 model:
ollama run llama3
The first time you execute this command, the 8 billion parameter Llama 3 model, which takes up 4.7 GB of storage space, will be automatically downloaded. The output looks like as follows:
\section*{ALTERNATIVE OLLAMA MODELS}
Note that the llama3 in the ollama run llama3 command refers to the instructionfinetuned 8 billion parameter Llama 3 model. Using Ollama with the llama3 model requires approximately 16 GB of RAM. If your machine does not have sufficient RAM, you can try using a smaller model, such as the 3.8 billion parameter phi-3 model via ollama run llama3, which only requires around 8 GB of RAM.
For more powerful computers, you can also use the larger 70 billion parameter Llama 3 model by replacing llama3 with llama3:70b. However, keep in mind that this model requires significantly more computational resources.
Once the model download is complete, we are presented with a command-line interface that allows us to interact with the model. For example, try asking the model, "What do llamas eat?":
What do llamas eat? 骆驼吃什么?
Llamas are ruminant animals, which means they have a four-chambered 胡萝卜是反刍动物,这意味着它们有四个腔室
stomach and eat plants that are high in fiber. In the wild, llamas 胃和吃纤维含量高的植物。在野外,骆驼
typically feed on: 通常以下面为食:
Grasses: They love to graze on various types of grasses, including tall 草类:它们爱在各种类型的草地上放牧,包括高草
grasses, wheat, oats, and barley. 草类、小麦、燕麦和大麦。
Note that the response you may be seeing might differ since Ollama is not deterministic as of this writing.
You can end this ollama run llama3 session using the input /bye. However, make sure to keep the ollama serve command or the Ollama application running for the remainder of this chapter.
The following code verifies that the Ollama session is running properly before we use Ollama to evaluate the test set responses generated in the previous section:
import psutil
def check_if_running(process_name): 检查是否运行(进程名称):
running = False 运行 = False
for proc in psutil.process_iter(["name"]): 对于 psutil.process_iter(["name"]) 中的每一个 proc:
if process_name in proc.info["name"]: 如果进程名称在 proc.info["name"]中:
running = True 运行 = True
break 打断
return running 返回运行
ollama_running = check_if_running("ollama") 羊驼_运行 = 检查是否正在运行("羊驼")
if not ollama_running: 如果 ollama_running 不为真:
raise RuntimeError("Ollama not running. Launch ollama before proceeding.") 抛出 RuntimeError("Ollama 未运行。请先启动 ollama 再继续。")
print("Ollama running:", check_if_running("ollama")) 打印("Ollama 正在运行:", check_if_running("ollama"))
Ensure that the output from executing the previous code displays Ollama running: True. If it shows False, please verify that the ollama serve command or the Ollama application is actively running.
\section*{RUNNING THE CODE IN A NEW PYTHON SESSION}
If you closed your Python session after section 7.7 , or if you prefer to execute the remaining code in this chapter in different Python session, you execute the following code, which loads the instruction and response data file we created in section 7.7, and it redefines the format_input function we used earlier (the tqdm progress bar utility is used later):
import json 导入 json
from tqdm import tqdm 从 tqdm 导入 tqdm
file_path = "instruction-data-with-response.json" 文件路径 = "instruction-data-with-response.json"
with open(file_path, "r") as file: 使用 open(file_path, "r")打开文件:
test_data = json.load(file) 测试_数据 = json.load(file)
def format_input(entry): 请将以下源文本直接翻译为简体中文:
def format_input(entry):
翻译结果:
def 格式化输入(条目):
instruction_text = ( 說明文本 = (
指示您使用應用程式上的各種功能。請仔細閱讀這些指示,了解如何充分利用此應用程式
f"Below is an instruction that describes a task. " 以下是描述任务的说明。
f"Write a response that appropriately completes the request." 撰写一个适当的回应以完成该请求。
f"\n\n### Instruction:\n{entry['instruction']}" 说明:
{entry['instruction']}
)
input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else "" 输入文本: {entry['input']}
return instruction_text + input_text 返回 instruction_text + input_text
An alternative to the ollama run command for interacting with the model is through its REST API using Python. The following query_model function demonstrates how to use the API :
Listing 7.10 Querying a local Ollama model
\#A Create the data payload as a dictionary
\#B Convert the dictionary to a JSON formatted string and encode it to bytes
\#C Create a request object, setting the method to POST and adding necessary headers
\#D Send the request and capture the response
Before running the subsequent code cells in this notebook, ensure that Ollama is still running. The previous code cells should print "Ollama running: True" to confirm that the model is active and ready to receive requests.
Here's an example of how to use the query_1lama function we just implemented:
model = "llama3" 模型 = "llama3"
result = query_model("What do Llamas eat?", model) 结果 = 查询_模型("羊驼吃什么?", 模型)
print(result) 打印(结果)
The resulting response is as follows:
Llamas are ruminant animals, which means they have a four-chambered stomach that allows them to digest plant-based foods. Their diet typically consists of:
1. Grasses: Llamas love to graze on grasses, including tall grasses, short grasses, and even weeds.
...
Using the query_model function defined earlier, we can evaluate the responses generated by our finetuned model with a prompt that prompts the Llama 3 model to rate our finetuned model's responses on a scale from 0 to 100 based on the given test set response as reference.
First, we apply this approach to the first three examples from the test set that we examined in a previous section:
for entry in test_data[:3]: 对于 test_data[:3]中的每一项:
prompt = ( 提示 = (
f"Given the input {format_input(entry)} " 给定输入 {format_input(entry)}
f"and correct output {entry['output']}, " 以及正确的输出 {entry['output']} ,
f"score the model response {entry['model_response']}" 评分模型响应 {entry['model_response']}
f" on a scale from 0 to 100, where 100 is the best score. " 在 0 到 100 的量表上,其中 100 分为最高分。
)
print("\nDataset response:") 打印("\n 数据集响应:")
print(">>", entry['output']) 打印(">>", 条目['输出'])
print("\nModel response:") 打印("\n 模型响应:")
print(">>", entry["model_response"]) 打印(">>", entry["model_response"])
print("\nScore:") 打印("\n 分数:")
print(">>", query_model(prompt)) 打印(">>", 查询模型(提示))
print("\n------------------------") 打印("\n------------------------")
This prints outputs similar to the following ones (note that Ollama is not fully deterministic, as of this writing, so the generated texts may vary):
Dataset response: 数据集响应:
翻译文本:
The car is as fast as lightning. 这辆车速度如闪电一般。
Model response: 模型回应:
简体中文译文:
The car is as fast as a bullet. 这辆车像子弹一样快。
Score: 分数:
A scoring task! 得分任务!
To evaluate the model response "The car is as fast as a bullet.", I'll consider how well 汽车和子弹一样快。
it follows the instruction and uses a simile that's coherent, natural-sounding, and effective in conveying the idea of speed.
Here are some factors to consider:
1. **Follows instruction**: Yes, the model uses a simile to rewrite the sentence.
2. **Coherence and naturalness**: The comparison between the car's speed and a bullet is common and easy to understand. It's a good choice for a simile that conveys the idea of rapid movement.
3. **Effectiveness in conveying idea of speed**: A bullet is known for its high velocity, which makes it an excellent choice to describe a fast-moving car.
Considering these factors, I'd score the model response "The car is as fast as a bullet." around 85 out of 100. The simile is well-chosen, coherent, and effectively conveys the idea of speed. Well done, model!
Dataset response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.
Model response:
>> The type of cloud associated with thunderstorms is a cumulus cloud.
Score:
>> A scoring task!
I'll evaluate the model's response based on its accuracy and relevance to the original instruction.
**Accuracy:** The model's response is partially correct. Cumulus clouds are indeed associated with fair weather and not typically linked to thunderstorms. The correct answer, cumulonimbus, is a type of cloud that is closely tied to thunderstorm formation.
**Relevance:** The model's response is somewhat relevant, as it mentions clouds in the context of thunderstorms. However, the specific type of cloud mentioned (cumulus) is not directly related to thunderstorms.
Considering these factors, I would score the model response a **40 out of \(100 * *\). While the response attempts to address the instruction, it provides an incorrect answer and lacks relevance to the original question.
Dataset response: 数据集响应:
翻译文本:
Jane Austen. 简·奥斯汀。
Model response: 模型回应:
简体中文译文:
The author of 'Pride and Prejudice' is Jane Austen. 《傲慢与偏见》的作者是简·奥斯汀。
Score: 分数:
A simple one! 很简单!
My model response: "The author of 'Pride and Prejudice' is Jane Austen." 《傲慢与偏见》的作者是简·奥斯汀。
Score: 99 分数:99
Reasoning: 推理:
人工智能系统通过分析信息、制定假设并测试这些假设来推理。推理是一种基于已知信息得出新知识的过程。机器学习算法可以用于自动推理,通过从数据中学习发现模式和关系。推理是人工智能实现智能行为的关键能力之一
The response directly answers the question, providing the correct name of the author. 这个回答直接回答了该问题,提供了作者的正确姓名。
The sentence structure is clear and easy to understand. 句子结构清晰易懂。
There's no room for misinterpretation or ambiguity. 没有误解或模糊的余地。
Overall, a perfect score! 总的来说,满分!
Based on the generated responses, we can observe that the Llama 3 model provides reasonable evaluations and is capable of assigning partial points when a model's answer is not entirely correct. For instance, if we consider the evaluation of the "cumulus cloud" answer, the model acknowledges the partial correctness of the response.
The previous prompt returns highly detailed evaluations in addition to the score. We can modify the prompt to just generate integer scores ranging from 0 to 100 , where 100 represents the best possible score. This modification allows us to calculate an average score for our model, which serves as a more concise and quantitative assessment of its performance.
The following generate_model_scores function uses a modified the prompt telling the model to "Respond with the integer number only.":
\section*{Listing 7.11 Evaluating the instruction finetuning LLM}
def generate_model_scores(json_data, json_key, model="llama3"): 生成模型分数(json_数据, json_键, 模型="llama3")
scores = [] 得分 = []
for entry in tqdm(json_data, desc="Scoring entries"): 对于 tqdm(json_data,desc="评分条目")中的每个条目:
prompt = ( 提示 = (
f"Given the input {format_input(entry)} " 给定输入 {format_input(entry)}
f"and correct output {entry['output']}, " 以及正确的输出 {entry['output']} ,
f"score the model response {entry[json_key]}" 评分模型响应 {entry[json_key]}
f" on a scale from 0 to 100, where 100 is the best score. " 在 0 到 100 的量表上,其中 100 分为最高分。
f"Respond with the integer number only." 响应有整数数字。
#A
)
score = query_model(prompt, model) 分数 = query_model(提示, 模型)
try: 尝试:
scores.append(int(score)) 分数.追加(int(分数))
except ValueError: 除了 ValueError:
print(f"Could not convert score: {score}") 无法转换分数: {score}
continue 继续
return scores
\#A Modified instruction line to only return the score
Let's now apply the generate_model_scores function to the entire test_data set, which 让我们现在将 generate_model_scores 函数应用到整个 test_data 集合中
takes about 1 minute on a M3 Macbook Air: 在 M3 Macbook Air 上大约需要 1 分钟
scores = generate_model_scores(test_data, "model_response") 分数 = 生成模型分数(测试数据, "model_response")
print(f"Number of scores: {len(scores)} of {len(test_data)}") 打印(f"分数数量:{len(分数)} 总共 {len(test_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n") 打印(f"平均分数: {sum(scores)/len(scores):.2f}\n")
The results are as follows:
Scoring entries: 100% 得分项目: 100%
110/110 [01:10<00:00, 1.56it/s]
Number of scores: 110 of 110 成绩数量: 110 / 110
Average score: 54.16 平均分: 54.16
The evaluation output shows that our finetuned model achieves an average score above 50, which provides a useful benchmark for comparison against other models or for experimenting with different training configurations to improve the model's performance.
It's worth noting that Ollama is not entirely deterministic at the time of this writing, which means that the scores you obtain might slightly vary from the ones presented above. To obtain more robust results, you can repeat the evaluation multiple times and average the resulting scores.
To further improve our model's performance, we can explore various strategies, such as:
- Adjusting the hyperparameters during finetuning, such as the learning rate, batch size, or number of epochs.
- Increasing the size of the training dataset or diversifying the examples to cover a broader range of topics and styles.
- Experimenting with different prompts or instruction formats to guide the model's responses more effectively.
- \(\quad\) Considering the use of a larger pretrained model, which may have greater capacity to capture complex patterns and generate more accurate responses.
\section*{PERFORMANCE OF LLAMA 3 MODELS}
For reference, when using the methodology described in this section, the Llama 38 B base model, without any finetuning, achieves an average score of 58.51 on the test set. The Llama 38 B instruct model, which has been finetuned on a general instruction-following dataset, achieves an impressive average score of 82.6.
\section*{EXERCISE 7.4 PARAMETER-EFFICIENT FINETUNING WITH LORA}
To instruction finetune an LLM more efficiently, modify the code in this chapter to use the low-rank adaptation method (LoRA) from appendix E. Compare the training runtime and model performance before and after the modification.
\subsection*{7.9 Conclusions}
This chapter marks the conclusion of our journey through the LLM development cycle. We have covered all the essential steps, including implementing an LLM architecture, pretraining an LLM, and finetuning it for specific tasks, as summarized in figure 7.21.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-301.jpg?height=578&width=1252&top_left_y=180&top_left_x=298)
Figure 7.21 An overview of the different stages of implementing, pretraining, and finetuning an LLM covered in this book.
The next subsection will give you some ideas for what to look into next after the essential steps shown in figure 7.21 .
\subsection*{7.9.1 What's next?}
While we covered the most essential steps, as illustrated in figure 7.21 , there is an optional step that can be performed after instruction finetuning: preference finetuning. Preference finetuning is particularly useful for customizing a model to better align with specific user preferences. If you are interested in exploring this further, please refer to the 04 _preference-tuning-with-dpo folder in this book's supplementary GitHub repository at https://github.com/rasbt/LLMs-from-scratch/tree/main/ch07/04 preference-tuning-withdpo.
In addition to the main content covered in this book, the GitHub repository also contains a large selection bonus material that you may find valuable. To learn more about these additional resources, please visit the Bonus Material section on the repository's README page: https://github.com/rasbt/LLMs-from-scratch?tab=readme-ov-file\#bonus-material.
\subsection*{7.9.2 Staying up to date in a fast-moving field}
The fields of AI and LLM research are evolving at a rapid (and depending on who you ask, exciting) pace. One way to keep up with the latest advancements, consider exploring recent research papers on arXiv at https://arxiv.org/list/cs.LG/recent. Additionally, many researchers and practitioners are very active in sharing and discussing the latest developments on social media platforms like X (formerly Twitter) and Reddit. The subreddit r/LocalLLaMA, in particular, is a good resource for connecting with the community and staying informed about the latest tools and trends.
I also regularly share insights and write about the latest in LLM research on my blog, available at https://magazine.sebastianraschka.com and https://sebastianraschka. com/blog \(L\).
\subsection*{7.9.3 Final words}
I hope you have enjoyed this journey of implementing an LLM from the ground up and coding the pretraining and finetuning functions from scratch. In my opinion, building an LLM from scratch is the most effective way to gain a deep understanding of how LLMs work. I hope that this hands-on approach has provided you with valuable insights and a solid foundation in LLM development.
While the primary purpose of this book is educational, you may be interested in utilizing different and more powerful LLMs for real-world applications. For this, I recommend exploring popular tools such as Axolotl (https://github.com/OpenAccess-AI-Collective/ axolotl) or LitGPT (https://github.com/Lightning-AI/litg.pt), which I am actively involved in developing.
Thank you for joining me on this learning journey, and I wish you all the best in your future endeavors in the exciting field of LLMs and AI!
\subsection*{7.10 Summary}
- The instruction finetuning process adapts a pretrained LLM to follow human instructions and generate desired responses.
- \(\quad\) Preparing the dataset involves downloading an instruction-response dataset, formatting the entries, and splitting it into train, validation and test sets.
- Training batches are constructed using a custom collate function that pads sequences, creates target token IDs, and masks padding tokens.
- We load a pretrained GPT-2 medium model with 355M parameters to serve as the starting point for instruction finetuning.
- The pretrained model is finetuned on the instruction dataset using a training loop similar to pretraining.
- Evaluation involves extracting model responses on a test set and scoring them, e.g. using another LLM.
- \(\quad\) The Ollama application with an 8B parameter Llama model can be used to automatically score the finetuned model's responses on the test set, providing an average score to quantify performance.
\section*{Appendix A. Introduction to PyTorch}
\section*{Thls chapter covers}
- An overview of the PyTorch deep learning library
- Setting up an environment and workspace for deep learning
- Tensors as a fundamental data structure for deep learning
- The mechanics of training deep neural networks
- Training models on GPUs
This chapter is designed to equip you with the necessary skills and knowledge to put deep learning into practice and implement large language models (LLMs) from scratch.
We will introduce PyTorch, a popular Python-based deep learning library, which will be our primary tool for the remainder of this book. This chapter will also guide you through setting up a deep learning workspace armed with PyTorch and GPU support.
Then, you'll learn about the essential concept of tensors and their usage in PyTorch. We will also delve into PyTorch's automatic differentiation engine, a feature that enables us to conveniently and efficiently use backpropagation, which is a crucial aspect of neural network training.
Note that this chapter is meant as a primer for those who are new to deep learning in PyTorch. While this chapter explains PyTorch from the ground up, it's not meant to be an exhaustive coverage of the PyTorch library. Instead, this chapter focuses on the PyTorch fundamentals that we will use to implement LLMs throughout this book. If you are already familiar with deep learning, you may skip this appendix and directly move on to chapter 2, working with text data.
\section*{A. 1 What is PyTorch}
PyTorch (https://pytorch.orgL) is an open-source Python-based deep learning library. According to Papers With Code (https://paperswithcode.com/trends), a platform that tracks and analyzes research papers, PyTorch has been the most widely used deep learning library for research since 2019 by a wide margin. And according to the Kaggle Data Science and Machine Learning Survey 2022 (https://www.kaggle.com/c/kaggle-survey-2022), the number of respondents using PyTorch is approximately \(40 \%\) and constantly grows every year.
One of the reasons why PyTorch is so popular is its user-friendly interface and efficiency. However, despite its accessibility, it doesn't compromise on flexibility, providing advanced users the ability to tweak lower-level aspects of their models for customization and optimization. In short, for many practitioners and researchers, PyTorch offers just the right balance between usability and features.
In the following subsections, we will define the main features PyTorch has to offer.
\section*{A.1.1 The three core components of PyTorch}
PyTorch is a relatively comprehensive library, and one way to approach it is to focus on its three broad components, which are summarized in figure A.1.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-304.jpg?height=691&width=1242&top_left_y=1116&top_left_x=301)
Figure A.1 PyTorch's three main components include a tensor library as a fundamental building block for computing, automatic differentiation for model optimization, and deep learning utility functions, making it easier to implement and train deep neural network models.
Firstly, PyTorch is a tensor library that extends the concept of array-oriented programming library NumPy with the additional feature of accelerated computation on GPUs, thus providing a seamless switch between CPUs and GPUs.
Secondly, PyTorch is an automatic differentiation engine, also known as autograd, which enables the automatic computation of gradients for tensor operations, simplifying backpropagation and model optimization.
Finally, PyTorch is a deep learning library, meaning that it offers modular, flexible, and efficient building blocks (including pre-trained models, loss functions, and optimizers) for designing and training a wide range of deep learning models, catering to both researchers and developers.
After defining the term deep learning and installing PyTorch in the two following subsections, the remainder of this chapter will go over these three core components of PyTorch in more detail, along with hands-on code examples.
\section*{A.1.2 Defining deep learning}
LLMs are often referred to as AI models in the news. However, as illustrated in the first section of chapter 1 (1.1 What is an LLM?) LLMs are also a type of deep neural network, and PyTorch is a deep learning library. Sounds confusing? Let's take a brief moment and summarize the relationship between these terms before we proceed.
AI is fundamentally about creating computer systems capable of performing tasks that usually require human intelligence. These tasks include understanding natural language, recognizing patterns, and making decisions. (Despite significant progress, AI is still far from achieving this level of general intelligence.)
Machine learning represents a subfield of AI (as illustrated in figure A.2) that focuses on developing and improving learning algorithms. The key idea behind machine learning is to enable computers to learn from data and make predictions or decisions without being explicitly programmed to perform the task. This involves developing algorithms that can identify patterns and learn from historical data and improve their performance over time with more data and feedback.
\section*{Artificial intelligence (Al)}
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-306.jpg?height=646&width=1169&top_left_y=332&top_left_x=337)
Figure A. 2 Deep learning is a subcategory of machine learning that is focused on the implementation of deep neural networks. In turn, machine learning is a subcategory of Al that is concerned with algorithms that learn from data. Al is the broader concept of machines being able to perform tasks that typically require human intelligence.
Machine learning has been integral in the evolution of AI, powering many of the advancements we see today, including LLMs. Machine learning is also behind technologies like recommendation systems used by online retailers and streaming services, email spam filtering, voice recognition in virtual assistants, and even self-driving cars. The introduction and advancement of machine learning have significantly enhanced AI's capabilities, enabling it to move beyond strict rule-based systems and adapt to new inputs or changing environments.
Deep learning is a subcategory of machine learning that focuses on the training and application of deep neural networks. These deep neural networks were originally inspired by how the human brain works, particularly the interconnection between many neurons. The "deep" in deep learning refers to the multiple hidden layers of artificial neurons or nodes that allow them to model complex, nonlinear relationships in the data.
Unlike traditional machine learning techniques that excel at simple pattern recognition, deep learning is particularly good at handling unstructured data like images, audio, or text, so deep learning is particularly well suited for LLMs.
The typical predictive modeling workflow (also referred to as supervised learning) in machine learning and deep learning is summarized in figure A.3.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-307.jpg?height=839&width=1235&top_left_y=189&top_left_x=314)
Once a model is trained, we can use it to predict the labels of new data
Figure A. 3 The supervised learning workflow for predictive modeling consists of a training stage where a model is trained on labeled examples in a training dataset. The trained model can then be used to predict the labels of new observations.
Using a learning algorithm, a model is trained on a training dataset consisting of examples and corresponding labels. In the case of an email spam classifier, for example, the training dataset consists of emails and their spam and not-spam labels that a human identified. Then, the trained model can be used on new observations (new emails) to predict their unknown label (spam or not spam).
Of course, we also want to add a model evaluation between the training and inference stages to ensure that the model satisfies our performance criteria before using it in a realworld application.
Note that the workflow for training and using LLMs, as we will see later in this book, is similar to the workflow depicted in figure A. 3 if we train them to classify texts. And if we are interested in training LLMs for generating texts, which is the main focus of this book, figure A. 3 still applies. In this case, the labels during pretraining can be derived from the text itself (the next-word prediction task introduced in chapter 1). And the LLM will generate entirely new text (instead of predicting labels) given an input prompt during inference.
\section*{A.1.3 Installing PyTorch}
PyTorch can be installed just like any other Python library or package. However, since PyTorch is a comprehensive library featuring CPU- and GPU-compatible codes, the installation may require additional explanation.
\section*{PYTHON VERSION}
Many scientific computing libraries do not immediately support the newest version of Python. Therefore, when installing PyTorch, it's advisable to use a version of Python that is one or two releases older. For instance, if the latest version of Python is 3.13, using Python 3.10 or 3.11 is recommended.
For instance, there are two versions of PyTorch: a leaner version that only supports CPU computing and a version that supports both CPU and GPU computing. If your machine has a CUDA-compatible GPU that can be used for deep learning (ideally an NVIDIA T4, RTX 2080 Ti , or newer), I recommend installing the GPU version. Regardless, the default command for installing PyTorch is as follows in a code terminal:
pip install torch
Suppose your computer supports a CUDA-compatible GPU. In that case, this will automatically install the PyTorch version that supports GPU acceleration via CUDA, given that the Python environment you're working on has the necessary dependencies (like pip) installed.
\section*{AMD GPUS FOR DEEP LEARNING}
As of this writing, PyTorch has also added experimental support for AMD GPUs via ROCm. Please see https://pytorch.org for additional instructions.
However, to explicitly install the CUDA-compatible version of PyTorch, it's often better to specify the CUDA you want PyTorch to be compatible with. PyTorch's official website (https://pytorch.org.) provides commands to install PyTorch with CUDA support for different operating systems as shown in figure A.4.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-309.jpg?height=551&width=1188&top_left_y=189&top_left_x=358)
Figure A. 4 Access the PyTorch installation recommendation on https://pytorch.org to customize and select the installation command for your system.
(Note that the command shown in figure A. 4 will also install the torchvision and torchaudio libraries, which are optional for this book.)
As of this writing, this book is based on PyTorch 2.0.1, so it's recommended to use the following installation command to install the exact version to guarantee compatibility with this book:
pip install torch \(==2.0 .1\)
However, as mentioned earlier, given your operating system, the installation command might slightly differ from the one shown above. Thus, I recommend visiting the https://pytorch.org website and using the installation menu (see figure A4) to select the installation command for your operating system and replace torch with torch==2.0.1 in this command.
To check the version of PyTorch, you can execute the following code in PyTorch:
import torch
torch.__version
This prints:
\[
' 2.0 .1^{\prime}
\]
\section*{PYTORCH AND TORCH}
Note that the Python library is named "torch" primarily because it's a continuation of the Torch library but adapted for Python (hence, "PyTorch"). The name "torch" acknowledges the library's roots in Torch, a scientific computing framework with wide support for machine learning algorithms, which was initially created using the Lua programming language.
If you are looking for additional recommendations and instructions for setting up your Python environment or installing the other libraries used later in this book, I recommend visiting the supplementary GitHub repository of this book at https://github.com/rasbt/ LLMs-from-scratch.
After installing PyTorch, you can check whether your installation recognizes your built-in NVIDIA GPU by running the following code in Python:
import torch
torch.cuda.is_available()
This returns:
True
If the command returns True, you are all set. If the command returns False, your computer may not have a compatible GPU, or PyTorch does not recognize it. While GPUs are not required for the initial chapters in this book, which are focused on implementing LLMs for educational purposes, they can significantly speed up deep learning-related computations.
If you don't have access to a GPU, there are several cloud computing providers where users can run GPU computations against an hourly cost. A popular Jupyter-notebook-like environment is Google Colab (https://colab.research.google.com), which provides timelimited access to GPUs as of this writing. Using the "Runtime" menu, it is possible to select a GPU, as shown in the screenshot in figure A.5.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-311.jpg?height=521&width=1228&top_left_y=187&top_left_x=319)
Figure A. 5 Select a GPU device for Google Colab under the Runtime/Change runtime type menu.
\section*{PYTORCH ON APPLE SILICON}
If you have an Apple Mac with an Apple Silicon chip (like the M1, M2, M3, or newer models), you have the option to leverage its capabilities to accelerate PyTorch code execution. To use your Apple Silicon chip for PyTorch, you first need to install PyTorch as you normally would. Then, to check if your Mac supports PyTorch acceleration with its Apple Silicon chip, you can run a simple code snippet in Python:
print(torch.backends.mps.is_available())
If it returns True, it means that your Mac has an Apple Silicon chip that can be used to accelerate PyTorch code.
\section*{EXERCISE A. 1}
Install and set up PyTorch on your computer.
\section*{EXERCISE A. 2}
Run the supplementary Chapter 2 code at https://github.com/rasbt/LLMs-fromscratch that checks whether your environment is set up correctly..
\section*{A. 2 Understanding tensors}
Tensors represent a mathematical concept that generalizes vectors and matrices to potentially higher dimensions. In other words, tensors are mathematical objects that can be characterized by their order (or rank), which provides the number of dimensions. For example, a scalar (just a number) is a tensor of rank 0 , a vector is a tensor of rank 1 , and a matrix is a tensor of rank 2, as illustrated in figure A. 6
\section*{An example of a 3D vector that}
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-312.jpg?height=558&width=311&top_left_y=619&top_left_x=356)
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-312.jpg?height=568&width=212&top_left_y=607&top_left_x=731)
A matrix with 3 rows and 4 columns
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-312.jpg?height=513&width=322&top_left_y=660&top_left_x=1046)
Figure A.6 An illustration of tensors with different ranks. Here OD corresponds to rank \(0,1 \mathrm{D}\) to rank 1, and 2D to rank 2. Note that a 3D vector, which consists of 3 elements, is still a rank 1 tensor.
From a computational perspective, tensors serve as data containers. For instance, they hold multi-dimensional data, where each dimension represents a different feature. Tensor libraries, such as PyTorch, can create, manipulate, and compute with these multidimensional arrays efficiently. In this context, a tensor library functions as an array library.
PyTorch tensors are similar to NumPy arrays but have several additional features important for deep learning. For example, PyTorch adds an automatic differentiation engine, simplifying computing gradients, as discussed later in section 2.4. PyTorch tensors also support GPU computations to speed up deep neural network training, which we will discuss later in section 2.8.
\section*{PYTORCH'S HAS A NUMPY-LIKE API}
As you will see in the upcoming sections, PyTorch adopts most of the NumPy array API and syntax for its tensor operations. If you are new to NumPy, you can get a brief overview of the most relevant concepts via my article Scientific Computing in Python: Introduction to NumPy and Matplotlib at https://sebastianraschka.com/blogL \(\underline{2020 / n u m p y-\text {-intro.html. }}\)
The following subsections will look at the basic operations of the PyTorch tensor library, showing how to create simple tensors and going over some of the essential operations.
\section*{A.2.1 Scalars, vectors, matrices, and tensors}
As mentioned earlier, PyTorch tensors are data containers for array-like structures. A scalar is a 0 -dimensional tensor (for instance, just a number), a vector is a 1-dimensional tensor, and a matrix is a 2-dimensional tensor. There is no specific term for higher-dimensional tensors, so we typically refer to a 3-dimensional tensor as just a 3D tensor, and so forth.
We can create objects of PyTorch's Tensor class using the torch.tensor function as follows:
Listing A. 1 Creating PyTorch tensors
import torch
tensor0d = torch.tensor(1) #A
tensor1d = torch.tensor([1, 2, 3]) #B
tensor2d = torch.tensor([[1, 2], [3, 4]]) #C 张量 2d = torch.张量([[1, 2], [3, 4]]) #C
tensor3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) #D 张量 3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) #D
#A create a OD tensor (scalar) from a Python integer #从 Python 整数创建 OD 张量(标量)
#B create a 1D tensor (vector) from a Python list 创建一个从 Python 列表中生成的一维张量(向量)
#C create a 2D tensor from a nested Python list 创建一个二维张量从嵌套的 Python 列表
#D create a 3D tensor from a nested Python list 从嵌套的 Python 列表创建 3D 张量
\section*{A.2.2 Tensor data types}
In the previous section, we created tensors from Python integers. In this case, PyTorch adopts the default 64-bit integer data type from Python. We can access the data type of a tensor via the .dtype attribute of a tensor:
This choice is primarily due to the balance between precision and computational efficiency. A 32-bit floating point number offers sufficient precision for most deep learning tasks, while consuming less memory and computational resources than a 64-bit floating point number. Moreover, GPU architectures are optimized for 32-bit computations, and using this data type can significantly speed up model training and inference.
Moreover, it is possible to readily change the precision using a tensor's .to method. The following code demonstrates this by changing a 64-bit integer tensor into a 32-bit float tensor:
This returns:
\section*{torch.float32}
For more information about different tensor data types available in PyTorch, I recommend checking the official documentation at https://pytorch.org/docs/stable/tensors.html.
\section*{A.2.3 Common PyTorch tensor operations}
Comprehensive coverage of all the different PyTorch tensor operations and commands is outside the scope of this book. However, we will briefly describe relevant operations as we introduce them throughout the book.
Before we move on to the next section covering the concept behind computation graphs, below is a list of the most essential PyTorch tensor operations.
We already introduced the torch.tensor() function to create new tensors.
This prints:
\[
\begin{aligned}
& \text { tensor }([[1,2,3] \\
& [4,5,6]])
\end{aligned}
\]
In addition, the . shape attribute allows us to access the shape of a tensor:
print(tensor2d.shape)
The output is:
torch.Size( \([2,3])\)
As you can see above, .shape returns \([2,3]\), which means that the tensor has 2 rows and 3 columns. To reshape the tensor into a 3 by 2 tensor, we can use the .reshape method:
print(tensor2d.reshape(3, 2))
This prints:
tensor \(([1,2]\),
\([3,4]\),
\([5,6]])\)
However, note that the more common command for reshaping tensors in PyTorch is .view():
print(tensor2d.view(3, 2))
The output is:
tensor([[1, 2], \([3,4]\), \([5,6]])\)
Similar to .reshape and .view, there are several cases where PyTorch offers multiple syntax options for executing the same computation. This is because PyTorch initially followed the original Lua Torch syntax convention but then also added syntax to make it more similar to NumPy upon popular request.
Next, we can use .T to transpose a tensor, which means flipping it across its diagonal. Note that this is similar from reshaping a tensor as you can see based on the result below:
print(tensor2d.T)
The output is:
tensor([[1, 4],
\([2,5]\),
\([3,6]]\) )
Lastly, the common way to multiply two matrices in PyTorch is the .matmul method:
print(tensor2d.matmul(tensor2d.T))
The output is:
tensor([[14, 32],
\([32,77]])\)
However, we can also adopt the @ operator, which accomplishes the same thing more compactly:
print(tensor2d @ tensor2d.T)
This prints:
tensor([[14, 32],
\([32,77]])\)
As mentioned earlier, we will introduce additional operations throughout this book when needed. For readers who'd like to browse through all the different tensor operations available in PyTorch (hint: we won't need most of these), I recommend checking out the official documentation at https://pytorch.org/docs/stable/tensors.html.
\section*{A. 3 Seeing models as computation graphs}
In the previous section, we covered one of the major three components of PyTorch, namely, its tensor library. Next in line is PyTorch's automatic differentiation engine, also known as autograd. PyTorch's autograd system provides functions to compute gradients in dynamic computational graphs automatically. But before we dive deeper into computing gradients in the next section, let's define the concept of a computational graph.
A computational graph (or computation graph in short) is a directed graph that allows us to express and visualize mathematical expressions. In the context of deep learning, a computation graph lays out the sequence of calculations needed to compute the output of a neural network -- we will need this later to compute the required gradients for backpropagation, which is the main training algorithm for neural networks.
Let's look at a concrete example to illustrate the concept of a computation graph. The following code implements the forward pass (prediction step) of a simple logistic regression classifier, which can be seen as a single-layer neural network, returning a score between 0 and 1 that is compared to the true class label ( 0 or 1 ) when computing the loss:
\section*{Listing A. 2 A logistic regression forward pass}
import torch.nn.functional as F 导入 torch.nn.functional as F
\#A
\(y=\) torch.tensor([1.0]) ..... \#B
x1 = torch.tensor([1.1]) ..... \#C
w1 = torch.tensor([2.2])
\#D
\(b=\) torch.tensor([0.0]) ..... \#E
\(z=x 1 * w 1+b\) ..... \(\# F\)
\(\mathrm{a}=\) torch.sigmoid(z) ..... \#G
loss = F.binary_cross_entropy(a, y)
\#A This import statement is a common convention in PyTorch to prevent long lines of code
\#B true label
\#C input feature
\#D weight parameter
\#E bias unit
\#F net input
\#G activation \& output
If not all components in the code above make sense to you, don't worry. The point of this example is not to implement a logistic regression classifier but rather to illustrate how we can think of a sequence of computations as a computation graph, as shown in figure A.7.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-318.jpg?height=436&width=1254&top_left_y=352&top_left_x=295)
Figure A. 7 A logistic regression forward pass as a computation graph. The input feature \(x_{1}\) is multiplied by a model weight \(w_{1}\) and passed through an activation function \(\sigma\) after adding the bias. The loss is computed by comparing the model output \(a\) with a given label \(y\).
In fact, PyTorch builds such a computation graph in the background, and we can use this to calculate gradients of a loss function with respect to the model parameters (here w1 and b) to train the model, which is the topic of the upcoming sections.
\section*{A. 4 Automatic differentiation made easy}
In the previous section, we introduced the concept of computation graphs. If we carry out computations in PyTorch, it will build such a graph internally by default if one of its terminal nodes has the requires grad attribute set to True. This is useful if we want to compute gradients. Gradients are required when training neural networks via the popular backpropagation algorithm, which can be thought of as an implementation of the chain rule from calculus for neural networks, which is illustrated in figure A.8.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-319.jpg?height=776&width=1209&top_left_y=180&top_left_x=338)
Figure A. 8 The most common way of computing the loss gradients in a computation graph involves applying the chain rule from right to left, which is also called reverse-model automatic differentiation or backpropagation. It means we start from the output layer (or the loss itself) and work backward through the network to the input layer. This is done to compute the gradient of the loss with respect to each parameter (weights and biases) in the network, which informs how we update these parameters during training.
\section*{PARTIAL DERIVATIVES AND GRADIENTS}
Figure A. 8 shows partial derivatives, which measure the rate at which a function changes with respect to one of its variables. A gradient is a vector containing all of the partial derivatives of a multivariate function, a function with more than one variable as input.
If you are not familiar or don't remember the partial derivatives, gradients, or the chain rule from calculus, don't worry. On a high level, all you need to know for this book is that the chain rule is a way to compute gradients of a loss function with respect to the model's parameters in a computation graph. This provides the information needed to update each parameter in a way that minimizes the loss function, which serves as a proxy for measuring the model's performance, using a method such as gradient descent. We will revisit the computational implementation of this training loop in PyTorch in section 2.7, A typical training loop.
Now, how is this all related to the second component of the PyTorch library we mentioned earlier, the automatic differentiation (autograd) engine? By tracking every operation performed on tensors, PyTorch's autograd engine constructs a computational graph in the background. Then, calling the grad function, we can compute the gradient of the loss with respect to model parameter w1 as follows:
Listing A. 3 Computing gradients via autograd
import torch.nn.functional as F
from torch.autograd import grad
\(y=\) torch.tensor([1.0])
x1 = torch.tensor([1.1])
w1 = torch.tensor([2.2], requires_grad=True)
\(b=\) torch.tensor([0.0], requires_grad=True)
\(z=x 1 * w 1+b\)
\(\mathrm{a}=\) torch.sigmoid(z)
loss = F.binary_cross_entropy(a, y)
grad_L_w1 = grad(loss, w1, retain_graph=True)
\#A
grad_L_b = grad(loss, b, retain_graph=True)
\#A By default, PyTorch destroys the computation graph after calculating the gradients to free memory.
However, since we are going to reuse this computation graph shortly, we set retain_graph=True so that it stays in memory.
Let's show the resulting values of the loss with respect to the model's parameters:
print(grad_L_w1)
print(grad_L_b)
The prints:
Above, we have been using the grad function "manually," which can be useful for experimentation, debugging, and demonstrating concepts. But in practice, PyTorch provides even more high-level tools to automate this process. For instance, we can call .backward on the loss, and PyTorch will compute the gradients of all the leaf nodes in the graph, which will be stored via the tensors' .grad attributes:
The outputs are:
(tensor([-0.0898]), )
(tensor( \([-0.0817])\), )
If this section is packed with a lot of information and you may be overwhelmed by the calculus concepts, don't worry. While this calculus jargon was a means to explain PyTorch's autograd component, all you need to take away from this section is that PyTorch takes care of the calculus for us via the .backward method -- we won't need to compute any derivatives or gradients by hand in this book.
\section*{A. 5 Implementing multilayer neural networks}
In the previous sections, we covered PyTorch's tensor and autograd components. This section focuses on PyTorch as a library for implementing deep neural networks.
To provide a concrete example, we focus on a multilayer perceptron, which is a fully connected neural network, as illustrated in figure A.9.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-322.jpg?height=1091&width=1235&top_left_y=171&top_left_x=316)
Figure A. 9 An illustration of a multilayer perceptron with 2 hidden layers. Each node represents a unit in the respective layer. Each layer has only a very small number of nodes for illustration purposes.
When implementing a neural network in PyTorch, we typically subclass the torch.nn.Module class to define our own custom network architecture. This Module base class provides a lot of functionality, making it easier to build and train models. For instance, it allows us to encapsulate layers and operations and keep track of the model's parameters.
Within this subclass, we define the network layers in the \(\qquad\) init constructor and specify how they interact in the forward method. The forward method describes how the input data passes through the network and comes together as a computation graph.
In contrast, the backward method, which we typically do not need to implement ourselves, is used during training to compute gradients of the loss function with respect to the model parameters, as we will see in section 2.7, A typical training loop.
The following code implements a classic multilayer perceptron with two hidden layers to illustrate a typical usage of the Module class:
\section*{Listing A. 4 A multilayer perceptron with two hidden layers}
\#E
\#A It's useful to code the number of inputs and outputs as variables to reuse the same code for datasets with different numbers of features and classes.
\#B The Linear layer takes the number of input and output nodes as arguments.
\#C Nonlinear activation functions are placed between the hidden layers.
\#D The number of output nodes of one hidden layer has to match the number of inputs of the next layer.
\#E The outputs of the last layer are called logits.
We can then instantiate a new neural network object as follows:
model \(=\) NeuralNetwork( 50,3\()\)
But before using this new model object, it is often useful to call print on the model to see a summary of its structure:
print(model)
This prints:
NeuralNetwork( 神经网络(
(layers): Sequential( 层(layers): Sequential(
(0): Linear(in_features=50, out_features=30, bias=True) (0): Linear(in_features=50, out_features=30, bias=True)
人: Translate the following source text to Simplified Chinese Language, Output translation directly without any additional text.
Source Text: Population growth rate in a country is defined as the change in population over a unit time period, often expressed as a percentage of the base population.
Translated Text:
(1): ReLU() 1: ReLU()
(2): Linear(in_features=30, out_features=20, bias=True) (2): Linear(in_features=30, out_features=20, bias=True)
人工智能: (2): Linear(in_features=30, out_features=20, bias=True)
(3): ReLU()
(4): Linear(in_features=20, out_features=3, bias=True)
)
)
Note that we used the Sequential class when we implemented the NeuralNetwork class. Using Sequential is not required, but it can make our life easier if we have a series of layers that we want to execute in a specific order, as is the case here. This way, after instantiating self.layers = Sequential(...) in the __init__ constructor, we just have to call the self.layers instead of calling each layer individually in the NeuralNetwork's forward method.
Next, let's check the total number of trainable parameters of this model:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) 训练参数数量 = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable model parameters:", num_params) 打印("可训练的模型参数总数:", num_params)
This prints:
Total number of trainable model parameters: 2213
Note that each parameter for which requires_grad=True counts as a trainable parameter and will be updated during training (more on that later in section 2.7, A typical training loop).
In the case of our neural network model with the two hidden layers above, these trainable parameters are contained in the torch.nn. Linear layers. A linear layer multiplies the inputs with a weight matrix and adds a bias vector. This is sometimes also referred to as a feedforward or fully connected layer.
Based on the print(model) call we executed above, we can see that the first Linear layer is at index position 0 in the layers attribute. We can access the corresponding weight parameter matrix as follows:
Since this is a large matrix that is not shown in its entirety, let's use the . shape attribute to show its dimensions:
print(model.layers[0].weight.shape)
The result is:
torch.Size( \([30,50])\)
(Similarly, you could access the bias vector via model. layers [0] .bias.)
The weight matrix above is a \(30 \times 50\) matrix, and we can see that the requires_grad is set to True, which means its entries are trainable -- this is the default setting for weights and biases in torch.nn. Linear.
Note that if you execute the code above on your computer, the numbers in the weight matrix will likely differ from those shown above. This is because the model weights are initialized with small random numbers, which are different each time we instantiate the network. In deep learning, initializing model weights with small random numbers is desired to break symmetry during training -- otherwise, the nodes would be just performing the same operations and updates during backpropagation, which would not allow the network to learn complex mappings from inputs to outputs.
However, while we want to keep using small random numbers as initial values for our layer weights, we can make the random number initialization reproducible by seeding PyTorch's random number generator via manual_seed:
The result is:
\[
\begin{aligned}
& \text { Parameter containing: } \\
& \text { tensor([[-0.0577, } 0.0047,-0.0702, \ldots, 0.0222,0.1260,0.0865] \text {, } \\
& {[0.0502,0.0307,0.0333, \ldots, \quad 0.0951,0.1134,-0.0297]} \\
& {[0.1077,-0.1108,0.0122, \ldots, 0.0108,-0.1049,-0.1063],} \\
& \ldots \text {, } \\
& {[-0.0787,0.1259,0.0803, \ldots, \quad 0.1218,0.1303,-0.1351]} \\
& {[0.1359,0.0175,-0.0673, \ldots, \quad 0.0674,0.0676,0.1058] \text {, }} \\
& [0.0790,0.1343,-0.0293, \ldots, 0.0344,-0.0971,-0.0509]], \\
& \text { requires_grad=True) }
\end{aligned}
\]
Now, after we spent some time inspecting the NeuraNetwork instance, let's briefly see how it's used via the forward pass:
torch.manual_seed(123)
X = torch.rand((1, 50))
out = model(X) 出 = 模型(X)
print(out) 打印(out)
The result is:tensor([[-0.1262, 0.1080, -0.1792]], grad_fn=<AddmmBackward0>)
In the code above, we generated a single random training example x as a toy input (note that our network expects 50 -dimensional feature vectors) and fed it to the model, returning three scores. When we call model \((\mathrm{x})\), it will automatically execute the forward pass of the model.
The forward pass refers to calculating output tensors from input tensors. This involves passing the input data through all the neural network layers, starting from the input layer, through hidden layers, and finally to the output layer.
These three numbers returned above correspond to a score assigned to each of the three output nodes. Notice that the output tensor also includes a grad_fn value.
Here, grad_fn=<AddmmBackward0> represents the last-used function to compute a variable in the computational graph. In particular, grad_fn=<AddmmBackward0> means that the tensor we are inspecting was created via a matrix multiplication and addition operation. PyTorch will use this information when it computes gradients during backpropagation. The <AddmmBackward0> part of grad_fn=<AddmmBackward0> specifies the operation that was performed. In this case, it is an Addmm operation. Addmm stands for matrix multiplication \((\mathrm{mm}\) ) followed by an addition (Add).
If we just want to use a network without training or backpropagation, for example, if we use it for prediction after training, constructing this computational graph for backpropagation can be wasteful as it performs unnecessary computations and consumes additional memory. So, when we use a model for inference (for instance, making predictions) rather than training, it is a best practice to use the torch.no_grad() context manager, as shown below. This tells PyTorch that it doesn't need to keep track of the gradients, which can result in significant savings in memory and computation.
with torch.no_grad(): 使用 torch.no_grad():
out = model(X) 出输出 = 模型(X)
print(out) 打印(out)
In PyTorch, it's common practice to code models such that they return the outputs of the last layer (logits) without passing them to a nonlinear activation function. That's because PyTorch's commonly used loss functions combine the softmax (or sigmoid for binary classification) operation with the negative log-likelihood loss in a single class. The reason for this is numerical efficiency and stability. So, if we want to compute class-membership probabilities for our predictions, we have to call the softmax function explicitly:
with torch.no_grad(): 使用 torch.no_grad():
out = torch.softmax(model(X), dim=1) 输出 = torch.softmax(模型(X), dim=1)
print(out) 打印(out)
The values can now be interpreted as class-membership probabilities that sum up to 1. The values are roughly equal for this random input, which is expected for a randomly initialized model without training.
In the following two sections, we will learn how to set up an efficient data loader and train the model.
\section*{A. 6 Setting up efficient data loaders}
In the previous section, we defined a custom neural network model. Before we can train this model, we have to briefly talk about creating efficient data loaders in PyTorch, which we will iterate over when training the model. The overall idea behind data loading in PyTorch is illustrated in figure A. 10 .
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-328.jpg?height=535&width=1237&top_left_y=501&top_left_x=315)
Figure A. 10 PyTorch implements a Dataset and a DataLoader class. The Dataset class is used to instantiate objects that define how each data record is loaded. The DataLoader handles how the data is shuffled and assembled into batches.
Following the illustration in figure A.10, in this section, we will implement a custom Dataset class that we will use to create a training and a test dataset that we'll then use to create the data loaders.
Let's start by creating a simple toy dataset of five training examples with two features each. Accompanying the training examples, we also create a tensor containing the corresponding class labels: three examples below to class 0 , and two examples belong to class 1. In addition, we also make a test set consisting of two entries. The code to create this dataset is shown below.
Listing A. 5 Creating a small toy dataset
CLASS LABEL NUMBERING PyTorch requires that class labels start with label 0 , and the largest class label value should not exceed the number of output nodes minus 1 (since Python index counting starts at 0 . So, if we have class labels \(0,1,2,3\), and 4 , the neural network output layer should consist of 5 nodes.
Next, we create a custom dataset class, ToyDataset, by subclassing from PyTorch's Dataset parent class, as shown below.
Listing A. 6 Defining a custom Dataset class
from torch.utils.data import Dataset
\#A Instructions for retrieving exactly one data record and the corresponding label \#B Instructions for returning the total length of the dataset
This custom ToyDataset class's purpose is to use it to instantiate a PyTorch DataLoader. But before we get to this step, let's briefly go over the general structure of the ToyDataset code.
In PyTorch, the three main components of a custom Dataset class are the __init__ constructor, the __getitem__ method, and the __len__ method, as shown in code listing A. 6 above.
In the __init__ method, we set up attributes that we can access later in the _getitem__ and __len__ methods. This could be file paths, file objects, database connectors, and so on. Since we created a tensor dataset that sits in memory, we are simply assigning x and y to these attributes, which are placeholders for our tensor objects.
In the \(\qquad\) method, we define instructions for returning exactly one item from the dataset via an index. This means the features and the class label corresponding to a single training example or test instance. (The data loader will provide this index, which we will cover shortly.)
Finally, the __len__ method constrains instructions for retrieving the length of the dataset. Here, we use the .shape attribute of a tensor to return the number of rows in the feature array. In the case of the training dataset, we have five rows, which we can doublecheck as follows:
print(len(train_ds)) 输出(len(train_ds))
The result is:
5
Now that we defined a PyTorch Dataset class we can use for our toy dataset, we can use PyTorch's DataLoader class to sample from it, as shown in the code listing below:
\section*{Listing A. 7 Instantiating data loaders}
from torch.utils.data import DataLoader
\#A The ToyDataset instance created earlier serves as input to the data loader.
\#B Whether or not to shuffle the data
\#C The number of background processes
\#D It is not necessary to shuffle a test dataset
After instantiating the training data loader, we can iterate over it as shown below. (The iteration over the test_loader works similarly but is omitted for brevity.)
for idx, (x, y) in enumerate(train_loader): 用于 idx, (x, y) 在 enumerate(train_loader):
print(f"Batch {idx+1}:", x, y) 打印(f"批次 {idx+1}:", x, y)
As we can see based on the output above, the train_loader iterates over the training dataset visiting each training example exactly once. This is known as a training epoch. Since we seeded the random number generator using torch.manual_seed (123) above, you should get the exact same shuffling order of training examples as shown above. However if you iterate over the dataset a second time, you will see that the shuffling order will change. This is desired to prevent deep neural networks getting caught in repetitive update cycles during training.
Note that we specified a batch size of 2 above, but the 3rd batch only contains a single example. That's because we have five training examples, which is not evenly divisible by 2 . In practice, having a substantially smaller batch as the last batch in a training epoch can disturb the convergence during training. To prevent this, it's recommended to set drop last=True, which will drop the last batch in each epoch, as shown below:
\section*{Listing A. 8 A training loader that drops the last batch}
Lastly, let's discuss the setting num_workers=0 in the DataLoader. This parameter in PyTorch's DataLoader function is crucial for parallelizing data loading and preprocessing. When num_workers is set to 0 , the data loading will be done in the main process and not in separate worker processes. This might seem unproblematic, but it can lead to significant slowdowns during model training when we train larger networks on a GPU. This is because instead of focusing solely on the processing of the deep learning model, the CPU must also take time to load and preprocess the data. As a result, the GPU can sit idle while waiting for the CPU to finish these tasks. In contrast, when num_workers is set to a number greater than zero, multiple worker processes are launched to load data in parallel, freeing the main process to focus on training your model and better utilizing your system's resources, which is illustrated in figure A. 11
Data loading without multiple workers
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-333.jpg?height=524&width=592&top_left_y=787&top_left_x=293)
Data loading with multiple workers
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-333.jpg?height=554&width=641&top_left_y=751&top_left_x=922)
Figure A. 11 Loading data without multiple workers (setting num_workers=0) will create a data loading bottleneck where the model sits idle until the next batch is loaded as illustrated in the left subpanel. If multiple workers are enabled, the data loader can already queue up the next batch in the background as shown in the right subpanel.
However, if we are working with very small datasets, setting num_workers to 1 or larger may not be necessary since the total training time takes only fractions of a second anyway. On the contrary, if you are working with tiny datasets or interactive environments such as Jupyter notebooks, increasing num_workers may not provide any noticeable speedup. They might, in fact, lead to some issues. One potential issue is the overhead of spinning up multiple worker processes, which could take longer than the actual data loading when your dataset is small.
Furthermore, for Jupyter notebooks, setting num_workers to greater than 0 can sometimes lead to issues related to the sharing of resources between different processes, resulting in errors or notebook crashes. Therefore, it's essential to understand the trade-off and make a calculated decision on setting the num_workers parameter. When used correctly, it can be a beneficial tool but should be adapted to your specific dataset size and computational environment for optimal results.
In my experience, setting num_workers=4 usually leads to optimal performance on many real-world datasets, but optimal settings depend on your hardware and the code used for loading a training example defined in the Dataset class.
\section*{A. 7 A typical training loop}
So far, we've discussed all the requirements for training neural networks: PyTorch's tensor library, autograd, the Module API, and efficient data loaders. Let's now combine all these things and train a neural network on the toy dataset from the previous section. The training code is shown in code listing A. 9 below.
Listing A. 9 Neural network training in PyTorch
import torch.nn.functional as F
\#A The dataset from the previous section has 2 features and 2 classes \#B We led the optimizer needs to know which parameters to optimize \#C Set the gradients from the previous round to zero to prevent unintended gradient accumulation \#D Compute the gradients of the loss with respect to the model parameters \#E The optimizer users the gradients to update the model parameters
Running the code in listing A. 9 above yields the following outputs:
As we can see, the loss reaches zero after 3 epochs, a sign that the model converged on the training set. However, before we evaluate the model's predictions, let's go over some of the details of the preceding code listing.
First, note that we initialized a model with two inputs and two outputs. That's because the toy dataset from the previous section has two input features and two class labels to predict. We used a stochastic gradient descent (SGD) optimizer with a learning rate (lr) of 0.5 . The learning rate is a hyperparameter, meaning it's a tunable setting that we have to experiment with based on observing the loss. Ideally, we want to choose a learning rate such that the loss converges after a certain number of epochs -- the number of epochs is another hyperparameter to choose.
\section*{EXERCISE A. 3}
How many parameters does the neural network introduced at the beginning of this section have?
In practice, we often use a third dataset, a so-called validation dataset, to find the optimal hyperparameter settings. A validation dataset is similar to a test set. However, while we only want to use a test set precisely once to avoid biasing the evaluation, we usually use the validation set multiple times to tweak the model settings.
We also introduced new settings called model.train() and model.eval(). As these names imply, these settings are used to put the model into a training and an evaluation mode. This is necessary for components that behave differently during training and inference, such as dropout or batch normalization layers. Since we don't have dropout or other components in our NeuralNetwork class that are affected by these settings, using model.train() and model.eval() is redundant in our code above. However, it's best practice to include them anyway to avoid unexpected behaviors when we change the model architecture or reuse the code to train a different model.
As discussed earlier, we pass the logits directly into the cross_entropy loss function, which will apply the softmax function internally for efficiency and numerical stability reasons. Then, calling loss.backward() will calculate the gradients in the computation graph that PyTorch constructed in the background. The optimizer.step() method will use the gradients to update the model parameters to minimize the loss. In the case of the SGD optimizer, this means multiplying the gradients with the learning rate and adding the scaled negative gradient to the parameters.
PREVENTING UNDESIRED GRADIENT ACCUMULATION It is important to include an optimizer.zero_grad() call in each update round to reset the gradients to zero. Otherwise, the gradients will accumulate, which may be undesired.
After we trained the model, we can use it to make predictions, as shown below:
The results are as follows:
\[
\begin{aligned}
& \text { tensor([[ 2.8569, -4.1618], } \\
& \text { [ 2.5382, -3.7548], } \\
& \text { [ 2.0944, -3.1820], } \\
& {[-1.4814,1.4816]} \\
& [-1.7176, \quad 1.7342]])
\end{aligned}
\]
To obtain the class membership probabilities, we can then use PyTorch's softmax function, as follows:
This outputs:
\begin{tabular}{|c|c|c|}
\hline tensor([ [ & 0.9991 & \(0.0009]\) \\
\hline [ & 0.9982 & \(0.0018]\) \\
\hline [ & 0.9949 , & \(0.0051]\) \\
\hline [ & 0.0491 , & 0.9509], \\
\hline [ & 0.0307 & 0.9693]]) \\
\hline
\end{tabular}
Let's consider the first row in the code output above. Here, the first value (column) means that the training example has a \(99.91 \%\) probability of belonging to class 0 and a \(0.09 \%\) probability of belonging to class 1. (The set_printoptions call is used here to make the outputs more legible.)
We can convert these values into class labels predictions using PyTorch's argmax function, which returns the index position of the highest value in each row if we set dim=1 (setting dim=0 would return the highest value in each column, instead):
Note that it is unnecessary to compute softmax probabilities to obtain the class labels. We could also apply the argmax function to the logits (outputs) directly:
Above, we computed the predicted labels for the training dataset. Since the training dataset is relatively small, we could compare it to the true training labels by eye and see that the model is \(100 \%\) correct. We can double-check this using the \(==\) comparison operator:
predictions == y_train
The results are:
Using torch.sum, we can count the number of correct prediction as follows:
torch.sum(predictions == y_train)
The output is:
5
Since the dataset consists of 5 training examples, we have 5 out of 5 predictions that are correct, which equals \(5 / 5 \times 100 \%=100 \%\) prediction accuracy.
However, to generalize the computation of the prediction accuracy, let's implement a compute_accuracy function as shown in the following code listing.
Listing A. 10 A function to compute the prediction accuracy
\#A This returns a tensor of True/False values depending on whether the labels match \#B The sum operations counts the number of True values \#C This is the fraction of correct prediction, a value between 0 and 1. And .item() returns the value of the tensor as a Python float.
Note that the following code listing iterates over a data loader to compute the number and fraction of the correct predictions. This is because when we work with large datasets, we typically can only call the model on a small part of the dataset due to memory limitations. The compute_accuracy function above is a general method that scales to datasets of arbitrary size since, in each iteration, the dataset chunk that the model receives is the same size as the batch size seen during training.
Notice that the internals of the compute_accuracy function are similar to what we used before when we converted the logits to the class labels.
We can then apply the function to the training as follows:
print(compute_accuracy(model, train_loader))
The results is:
\section*{1.0}
Similarly, we can apply the function to the test set as follows:
>>> print(compute_accuracy(model, test_loader))
This prints:
\section*{1.0}
In this section, we learned how we can train a neural network using PyTorch. Next, let's see how we can save and restore models after training.
\section*{A. 8 Saving and loading models}
In the previous section, we successfully trained a model. Let's now see how we can save a trained model to reuse it later.
Here's the recommended way how we can save and load models in PyTorch:
torch.save(model.state_dict(), "model.pth")
The model's state_dict is a Python dictionary object that maps each layer in the model to its trainable parameters (weights and biases). Note that "model.pth" is an arbitrary filename for the model file saved to disk. We can give it any name and file ending we like; however, . pth and .pt are the most common conventions.
Once we saved the model, we can restore it from disk as follows:
model = NeuralNetwork(2, 2)
model.load_state_dict(torch.load("model.pth"))
The torch.load("model.pth") function reads the file "model.pth" and reconstructs the Python dictionary object containing the model's parameters while model.load_state_dict() applies these parameters to the model, effectively restoring its learned state from when we saved it.
Note that the line model \(=\) NeuralNetwork \((2,2)\) above is not strictly necessary if you execute this code in the same session where you saved a model. However, I included it here to illustrate that we need an instance of the model in memory to apply the saved parameters. Here, the NeuralNetwork \((2,2)\) architecture needs to match the original saved model exactly.
Now, we are well equipped to use PyTorch to implement large language models in the upcoming chapters. However, before we jump to the next chapter, the last section will show you how to train PyTorch models faster using one or more GPUs (if available).
\section*{A. 9 Optimizing training performance with GPUs}
In this last section of this chapter, we will see how we can utilize GPUs, which will accelerate deep neural network training compared to regular CPUs. First, we will introduce the main concepts behind GPU computing in PyTorch. Then, we will train a model on a single GPU. Finally, we'll then look at distributed training using multiple GPUs.
\section*{A.9.1 PyTorch computations on GPU devices}
As you will see, modifying the training loop from section 2.7 to optionally run on a GPU is relatively simple and only requires changing three lines of code.
Before we make the modifications, it's crucial to understand the main concept behind GPU computations within PyTorch. First, we need to introduce the notion of devices. In PyTorch, a device is where computations occur, and data resides. The CPU and the GPU are examples of devices. A PyTorch tensor resides in a device, and its operations are executed on the same device.
Let's see how this works in action. Assuming that you installed a GPU-compatible version of PyTorch as explained in section 2.1.3, Installing PyTorch, we can double-check that our runtime indeed supports GPU computing via the following code:
print(torch.cuda.is_available())
The result is:
True
Now, suppose we have two tensors that we can add as follows -- this computation will be carried out on the CPU by default:
Notice that the resulting tensor now includes the device information, device='cuda:0', which means that the tensors reside on the first GPU. If your machine hosts multiple GPUs, you have the option to specify which GPU you'd like to transfer the tensors to. You can do this by indicating the device ID in the transfer command. For instance, you can use .to("cuda:0"), .to("cuda:1"), and so on.
However, it is important to note that all tensors must be on the same device. Otherwise, the computation will fail, as shown below, where one tensor resides on the CPU and the other on the GPU:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
In this section, we learned that GPU computations on PyTorch are relatively straightforward. All we have to do is transfer the tensors onto the same GPU device, and PyTorch will handle the rest. Equipped with this information, we can now train the neural network from the previous section on a GPU.
\section*{A.9.2 Single-GPU training}
Now that we are familiar with transferring tensors to the GPU, we can modify the training loop from section 2.7, A typical training loop, to run on a GPU. This requires only changing three lines of code, as shown in code listing A. 11 below.
\section*{Listing A.11 A training loop on a GPU}
torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2) 模型 = NeuralNetwork(num_inputs=2, num_outputs=2)
device = torch.device("cuda") #A 设备 = torch.device("cuda")
model = model.to(device) #B 模型 = 模型.to(设备)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5) 优化器 = torch.optim.SGD(model.parameters(), lr=0.5)
num_epochs = 3 训练轮数 = 3
for epoch in range(num_epochs): 对于 epoch in range(num_epochs):
model.train() 模型.训练()
for batch_idx, (features, labels) in enumerate(train_loader): 对于 batch_idx, (features, labels) in enumerate(train_loader):
features, labels = features.to(device), labels.to(device) #C 特征, 标签 = 特征.转移到(设备), 标签.转移到(设备) #C
logits = model(features) 逻辑值 = 模型(特征)
loss = F.cross_entropy(logits, labels) # Loss function 损失 = F.交叉熵(预测值, 标签) # 损失函数
optimizer.zero_grad() 优化器.清零梯度()
loss.backward() 损失.反向传播()
optimizer.step() 优化器.步进()
### LOGGING ### 日志记录
print(f"Epoch: \{epoch+1:03d\}/\{num_epochs:03d\}"
f" | Batch \{batch_idx:03d\}/\{len(train_loader):03d\}"
f" | Train/Val Loss: \{loss:.2f\}")
model.eval()
\# Optional model evaluation
\#A Define a device variable that defaults to a GPU.
\#B Transfer the model onto the GPU.
\#C Transfer the data onto the GPU.
Running the above code will output the following, similar to the results obtained on the CPU previously in section 2.7:
We can also use .to("cuda") instead of device = torch.device("cuda"). As we saw in section 2.9.1, transferring a tensor to "cuda" instead of torch.device("cuda") works as well and is shorter. We can also modify the statement to the following, which will make the same code executable on a CPU if a GPU is not available, which is usually considered best practice when sharing PyTorch code:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
In the case of the modified training loop above, we probably won't see a speed-up because of the memory transfer cost from CPU to GPU. However, we can expect a significant speedup when training deep neural networks, especially large language models.
As we saw in this section, training a model on a single GPU in PyTorch is relatively easy. Next, let's introduce another concept: training models on multiple GPUs.
\section*{PYTORCH ON MACOS}
On an Apple Mac with an Apple Silicon chip (like the M1, M2, M3, or newer models) instead of a computer with an Nvidia GPU, you can change
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
to
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
to take advantage of this chip.
\section*{EXERCISE A. 4}
Compare the runtime of matrix multiplication on a CPU to a GPU. At what matrix size do you begin to see the matrix multiplication on the GPU being faster than on the CPU? Hint: I recommend using the \%timeit command in Jupyter to compare the runtime. For example, given matrices a and b, run the command \%timeit a @ b in a new notebook cell.
\section*{A.9.3 Training with multiple GPUs}
In this section, we will briefly go over the concept of distributed training. Distributed training is the concept of dividing the model training across multiple GPUs and machines.
Why do we need this? Even when it is possible to train a model on a single GPU or machine, the process could be exceedingly time-consuming. The training time can be significantly reduced by distributing the training process across multiple machines, each with potentially multiple GPUs. This is particularly crucial in the experimental stages of model development, where numerous training iterations might be necessary to finetune the model parameters and architecture.
MULTI-GPU COMPUTING IS OPTIONAL For this book, it is not required to have access to or use multiple-GPU. This section is included for those who are interested in how multi-GPU computing works in PyTorch.
In this section, we will look at the most basic case of distributed training: PyTorch's DistributedDataParallel (DDP) strategy. DDP enables parallelism by splitting the input data across the available devices and processing these data subsets simultaneously.
How does this work? PyTorch launches a separate process on each GPU, and each process receives and keeps a copy of the model -- these copies will be synchronized during training. To illustrate this, suppose we have two GPUs that we want to use to train a neural network, as shown in figure A.12.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-346.jpg?height=558&width=1230&top_left_y=263&top_left_x=302)
Figure A.12 The model and data transfer in DDP involves two key steps. First, we create a copy of the model on each of the GPUs. Then we divide the input data into unique minibatches that we pass on to each model copy.
Each of the two GPUs will receive a copy of the model. Then, in every training iteration, each model will receive a minibatch (or just batch) from the data loader. We can use a DistributedSampler to ensure that each GPU will receive a different, non-overlapping batch when using DDP.
Since each model copy will see a different sample of the training data, the model copies will return different logits as outputs and compute different gradients during the backward pass. These gradients are then averaged and synchronized during training to update the models. This way, we ensure that the models don't diverge, as illustrated in figure A.13.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-346.jpg?height=422&width=1230&top_left_y=1397&top_left_x=332)
Figure A.13 The forward and backward pass in DDP are executed independently on each GPU with its corresponding data subset. Once the forward and backward passes are completed, gradients from each model replica (on each GPU) are synchronized across all GPUs. This ensures that every model replica has the same updated weights.
The benefit of using DDP is the enhanced speed it offers for processing the dataset compared to a single GPU. Barring a minor communication overhead between devices that comes with DDP use, it can theoretically process a training epoch in half the time with two GPUs compared to just one. The time efficiency scales up with the number of GPUs, allowing us to process an epoch eight times faster if we have eight GPUs, and so on.
\begin{abstract}
MULT-GPU COMPUTING IN INTERACTIVE ENVIRONMENTS DDP does not function properly within interactive Python environments like Jupyter notebooks, which don't handle multiprocessing in the same way a standalone Python script does. Therefore, the following code should be executed as a script, not within a notebook interface like Jupyter. This is because DDP needs to spawn multiple processes, and each process should have its own Python interpreter instance.
\end{abstract}
Let's now see how this works in practice. For brevity, we will only focus on the core parts of the previous code that need to be adjusted for DDP training. However, for readers who want to run the code on their own multi-GPU machine or a cloud instance of their choice, it is recommended to use the standalone script provided in this book's GitHub repository at https://github.com/rasbt/LLMs-from-scratch.
First, we will import a few additional submodules, classes, and functions for distributed training PyTorch as shown in code listing A. 13 below.
\title{
Listing A. 12 PyTorch utilities for distributed training
}
import torch.multiprocessing as mp
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
Before we dive deeper into the changes to make the training compatible with DDP, let's briefly go over the rationale and usage for these newly imported utilities that we need alongside the DistributedDataParallel class.
PyTorch's multiprocessing submodule contains functions such as multiprocessing.spawn, which we will use to spawn multiple processes and apply a function to multiple inputs in parallel. We will use it to spawn one training process per GPU.
If we spawn multiple processes for training, we will need a way to divide the dataset among these different processes. For this, we will use the DistributedSampler.
The init_process_group and destroy_process_group are used to initialize and quit the distributed training mods. The init_process_group function should be called at the beginning of the training script to initialize a process group for each process in the distributed setup, and destroy_process_group should be called at the end of the training script to destroy a given process group and release its resources.
The following code in listing A. 13 below illustrates how these new components are used to implement DDP training for the NeuralNetwork model we implemented earlier.
Listing A.13 Model training with DistributedDataParallel strategy
\#A Address of the main node
\#B Any free port on the machine
\#C nccl stands for NVIDIA Collective Communication Library.
\#D rank refers to the index of the GPU we want to use.
\#E world_size is the number of GPUs to use.
\#F Sets the current GPU device on which tensors will be allocated and operations will be performed.
\#G DistibutedSampler takes care of the shuffling now
\#H Enables faster memory transfer when training on GPU
\#I Splits the dataset into distinct, non-overlapping subsets for each process (GPU)
\#J The main function running the model training
\#K rank is the GPU ID
\#L Clean up resource allocation
\#M Launch the main function using multiple processes, where nprocs=world_size means one process per GPU.
Before we run the code from listing A.13, here is a summary of how it works, in addition to the annotations above. We have a __name__ == "__main__" clause at the bottom containing code that is executed when we run the code as a Python script instead of importing it as a module. This code first prints the number of available GPUs using torch.cuda.device_count (), sets a random seed for reproducibility and then spawns new processes using PyTorch's multiprocesses.spawn function. Here, the spawn function launches one process per GPU setting nproces=world_size, where the world size is the number of available GPUs. This spawn function launches the code in the main function we define in the same script with some additional arguments provided via args. Note that the main function has a rank argument that we don't include in the mp.spawn() call. That's because the rank, which refers to the process ID we use as the GPU ID, is already passed automatically.
The main function sets up the distributed environment via ddp_setup -- another function we defined, loads the training and test sets, sets up the model, and carries out the training. Compared to the single-GPU training in section 2.12, we now transfer the model and data to the target device via .to (rank), which we use to refer to the GPU device ID. Also, we wrap the model via DDP, which enables the synchronization of the gradients between the different GPUs during training. After the training finishes and we evaluate the models, we use destroy process group() to cleanly exit the distributed training and free up the allocated resources.
Earlier, we mentioned that each GPU will receive a different subsample of the training data. To ensure this, we set sampler=DistributedSampler(train_ds) in the training loader.
The last function to discuss is ddp_setup. It sets the main node's address and port to allow for communication between the different processes, initializes the process group with the NCCL backend (designed for GPU-to-GPU communication), and sets the rank (process identifier) and world size (total number of processes). Finally, it specifies the GPU device corresponding to the current model training process rank.
\section*{SELECTING AVAILABLE GPUS ON A MULTI-GPU MACHINE}
If you wish to restrict the number of GPUs used for training on a multi-GPU machine, the simplest way is to use the CUDA VISIBLE DEVICES environment variable. To illustrate this, suppose your machine has multiple GPUs, and you only want to use one GPU, for example, the GPU with index 0 . Instead of python some_script.py, you can run the code from the terminal as follows:
CUDA_VISIBLE_DEVICES=0 python some_script.py
Or, if your machine has four GPUs and you only want to use the first and third GPU, you can use
CUDA_VISIBLE_DEVICES=0,2 python some_script.py
Setting CUDA_VISIBLE_DEVICES in this way is a simple and effective way to manage GPU allocation without modifying your PyTorch scripts.
Let's now run this code and see how it works in practice by launching the code as a script from the terminal:
python ch02-DDP-script.py
Note that it should work on both single- and multi-GPU machines. If we run this code on a single GPU, we should see the following output:
PyTorch version: 2.0.1+cu117
CUDA available: True
Number of GPUs available: 1
[GPU0] Epoch: 001/003 | Batchsize 002 | Train/Val Loss: 0.62
[GPU0] Epoch: 001/003 | Batchsize 002 | Train/Val Loss: 0.32
[GPU0] Epoch: 002/003 | Batchsize 002 | Train/Val Loss: 0.11
[GPU0] Epoch: 002/003 | Batchsize 002 | Train/Val Loss: 0.07
[GPU0] Epoch: 003/003 | Batchsize 002 | Train/Val Loss: 0.02
[GPU0] Epoch: 003/003 | Batchsize 002 | Train/Val Loss: 0.03
[GPU0] Training accuracy 1.0
[GPU0] Test accuracy 1.0
The code output looks similar to the one in section 2.9.2, which is a good sanity check.
Now, if we run the same command and code on a machine with two GPUs, we should see the following:
This prints:
[33901]
[86]
\# . . .
You can then use the following code to assemble the original string:
print(tokenizer.decode([33901, 86, 343, 86, 220, 959]))
This returns:
'Akwirw ier'
\section*{EXERCISE 2.2}
The code for the data loader with max_length=2 and stride=2:
dataloader = create_dataloader(raw_text, batch_size=4, max_length=2, stride=2)
It produces batches of the following format:
\[
\begin{aligned}
& \text { tensor([[ 40, 367], } \\
& {[2885,1464] \text {, }} \\
& \text { [1807, 3619], } \\
& [402,271]])
\end{aligned}
\]
The code of the second data loader with max_length=8 and stride=2:
dataloader = create_dataloader(raw_text, batch_size=4, max_length=8, stride=2)
An example batch looks like as follows:
\section*{EXERCISE 3.2}
To achieve an output dimension of 2 , similar to what we had in single-head attention, we need to change the projection dimension d_out to 1 .
d_out = 1
mha = MultiHeadAttentionWrapper(d_in, d_out, block_size, 0.0, num_heads=2)
\section*{EXERCISE 3.3}
The initialization for the smallest GPT-2 model is as follows:
\section*{Chapter 4}
\section*{EXERCISE 4.1}
We can calculate the number of parameters in the feed forward and attention modules as follows:
block = TransformerBlock(GPT_CONFIG_124M) 块 = TransformerBlock(GPT_CONFIG_124M)
total_params = sum(p.numel() for p in block.ff.parameters()) 总参数数 = sum(p.numel() for p in block.ff.parameters())
print(f"Total number of parameters in feed forward module: {total_params:,}") 前馈模块中的总参数数量为:{total_params:,}
total_params = sum(p.numel() for p in block.att.parameters()) 总参数量 = sum(p.numel() for p in block.att.parameters())
print(f"Total number of parameters in attention module: {total_params:,}") 打印(f"注意力模块中的总参数数量: {total_params:,}")
As we can see, the feed forward module contains approximately twice as many parameters as the attention module:
Total number of parameters in feed forward module: \(4,722,432\)
Total number of parameters in attention module: \(2,360,064\)
\section*{EXERCISE 4.2}
To instantiate the other GPT model sizes, we can modify the configuration dictionary as follows (here shown for GPT-2 XL):
Then, reusing the code from Section 4.6 to calculate the number of parameters and RAM requirements, we find the following:
gpt2-xl:
Total number of parameters: 1,637,792,000
Number of trainable parameters considering weight tying: 1,557,380,800
Total size of the model: 6247.68 MB
\section*{Chapter 5}
\section*{EXERCISE 5.1}
We can print the number of times the token (or word) "pizza" is sampled using the print_sampled_tokens function we defined in this section. Let's start with the code we defined in section 5.3.1.
The "pizza" token is sampled \(0 x\) if the temperature is 0 or 0.1 , and it is sampled \(32 \times\) if the temperature is scaled up to 5 . The estimated probability is \(32 / 1000 \times 100 \%=3.2 \%\).
The actual probability is \(4.3 \%\) and contained in the rescaled softmax probability tensor (scaled_probas[2][6]).
\section*{EXERCISE 5.2}
Top-k sampling and temperature scaling are settings that have to be adjusted based on the LLM and the desired degree of diversity and randomness in the output.
When using relatively small top-k values (e.g., smaller than 10) and the temperature is set below 1, the model's output becomes less random and more deterministic. This setting is useful when we need the generated text to be more predictable, coherent, and closer to the most likely outcomes based on the training data.
Applications for such low k and temperature settings include generating formal documents or reports where clarity and accuracy are most important. Other examples of applications include technical analysis or code generation tasks, where precision is crucial. Also, question answering and educational content require accurate answers where a temperature below 1 is helpful.
On the other hand, larger top-k values (e.g., values in the range of 20 to 40) and temperature values above 1 are useful when using LLMs for brainstorming or generating creative content, such as fiction.
\section*{EXERCISE 5.3}
There are multiple ways to force deterministic behavior with the generate function:
1. Setting to top_k=None and applying no temperature scaling;
2. Setting top_k=1.
\section*{EXERCISE 5.4}
In essence, we have to load the model and optimizer that we saved in the main chapter:
Then, call the train_simple_function with num_epochs=1 to train the model for another epoch.
\section*{EXERCISE 5.5}
We can use the following code to calculate the training and validation set losses of the GPT model:
The resulting losses for the 124 M parameter are as follows:
Training loss: 3.754748503367106
Validation loss: 3.559617757797241
The main observation is that the training and validation set performances are in the same ballpark. This can have multiple explanations.
1. The Verdict was not part of the pretraining dataset when OpenAI trained GPT-2. Hence, the model is not explicitly overfitting to the training set and performs similarly well on The Verdict's training and validation set portions. (The validation set loss is slightly lower than the training set loss, which is unusual in deep learning. However, it's likely due to random noise since the dataset is relatively small. In practice, if there is no overfitting, the training and validation set performances are expected to be roughly identical).
2. The Verdict was part of GPT -2's training dataset. In this case, we can't tell whether the model is overfitting the training data because the validation set would have been used for training as well. To evaluate the degree of overfitting, we'd need a new dataset generated after OpenAI finished training GPT-2 to make sure that it couldn't have been part of the pretraining.
\section*{EXERCISE 5.6}
In the main chapter, we experimented with the smallest GPT-2 model, which has only 124 M parameters. The reason was to keep the resource requirements as low as possible. However, you can easily experiment with larger models with minimal code changes. For example, instead of loading the 1558 M instead of 124 M model weights in chapter 5 , the only 2 lines of code that we have to change are the following:
\section*{Chapter 6}
\section*{EXERCISE 6.1}
We can pad the inputs to the maximum number of tokens the model supports by setting the max length to max_length \(=1024\) when initializing the datasets:
However, the additional padding results in a substantially worse test accuracy of \(78.33 \%\) (versus the \(95.67 \%\) in the main chapter).
\section*{EXERCISE 6.2}
Instead of finetuning just the final transformer block, we can finetune the entire model by removing the following lines from the code:
for param in model.parameters():
param.requires_grad = False
This modification results in a \(1 \%\) improved test accuracy of \(96.67 \%\) (versus the \(95.67 \%\) in the main chapter).
\section*{EXERCISE 6.3}
Rather than finetuning the last output token, we can finetune the first output token by changing model(input_batch) [:, -1, :] to model(input_batch) [:, 0, :] everywhere in the code.
As expected, since the first token contains less information than the last token, this change results in a substantially worse test accuracy of \(75.00 \%\) (versus the \(95.67 \%\) in the main chapter).
\section*{Chapter 7}
\section*{EXERCISE 7.1}
The Phi-3 prompt format, which is shown in figure 7.4 in chapter 7, looks like as follows for a given example input:
Identify the correct spelling of the following word: 'Occasion'
The correct spelling is 'Occasion'.
```
To use this template, we can modify the format_input function as follows: 要使用此模板,我们可以按如下方式修改 format_input 函数:
#A Separate list for instruction lengths #A Separate list for instruction lengths
#一个单独的指令长度列表
#B Collect instruction lengths #B 收集指令长度
#C Return both instruction lengths and texts separately #C 返回指令长度和文本,分别地
Next, we update the custom_collate_fn where each batch is now a tuple containing (instruction_length, item) instead of just item due to the changes in the InstructionDataset dataset. In addition, we now mask the corresponding instruction tokens in the target ID list: 接下来,我们更新 custom_collate_fn,每个批次现在是一个元组,包含(instruction_length, item),而不仅仅是由于 InstructionDataset 数据集中的更改而只有 item。此外,我们现在在目标 ID 列表中掩盖相应的指令标记。
def custom_collate_fn(
batch,
pad_token_id=50256,
ignore_index=-100,
allowed_max_length=None,
device="cpu"
):
#A
batch_max_length = max(len(item)+1 for instruction_length, item in batch)
inputs_lst, targets_lst = [], []
for instruction_length, item in batch:
#B Mask all input and instruction tokens in the targets #B 在目标中屏蔽所有输入和指令令牌
When evaluating a model finetuned with this instruction masking method, it performs slightly worse (approximately 4 points using the Ollama Llama 3 method from chapter 7). This is consistent with observations in the "Instruction Tuning With Loss Over Instructions" paper (https://arxiv.org/abs/2405.14394). 使用此说明屏蔽方法微调模型时,其性能略有下降(使用第 7 章中的 Ollama Llama 3 方法大约降低 4 分)。这与"Instructions Tuning With Loss Over Instructions"论文( https://arxiv.org/abs/2405.14394)中的观察一致。
EXERCISE 7.3 练习 7.3
To finetune the model on the original Stanford Alpaca dataset (https://github.com/tatsulab/stanford alpaca), we just have to change the file URL from 要在原始斯坦福 Alpaca 数据集(https://github.com/tatsulab/stanford alpaca)上微调模型,我们只需要将文件 URL 从
Note that the dataset contains 52 k entries (50x more than in chapter 7 ), and the entries are longer than the ones we worked with in chapter 7. 注意这个数据集包含 52k 个条目(比第 7 章多 50 倍),而且这些条目比我们在第 7 章中处理的那些更长。
Thus, it's highly recommended that the training be run on a GPU. 因此,强烈建议在 GPU 上运行训练。
If you encounter out-of-memory errors, consider reducing the batch size from 8 to 4,2 , or 1 . In addition to lowering the batch size, you may also want to consider lowering the allowed_max_length from 1024 to 512 or 256 . 如果遇到内存不足错误,请考虑将批次大小从 8 降至 4、2 或 1。除了降低批次大小外,您还可以考虑将允许的最大长度从 1024 降至 512 或 256。
Below are a few examples from the Alpaca dataset, including the generated model responses: 以下是 Alpaca 数据集的一些示例,包括生成的模型响应:
EXERCISE 7.4 练习 7.4
To instruction finetune the model using LoRA, use the relevant classes and functions from appendix E: 要使用 LoRA 对模型进行微调训练,请使用附录 E 中的相关类和函数:
Next, add the following lines of code below the model loading code in section 7.5: 接下来,在第 7.5 节的模型加载代码下添加以下几行代码:
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters before: {total_params:,}")
for param in model.parameters():
param.requires_grad = False
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters after: {total_params:,}")
replace_linear_with_lora(model, rank=16, alpha=16)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable LoRA parameters: {total_params:,}")
model.to(device)
Note that on an Nvidia L4 GPU, the finetuning with LoRA, takes 1.30 min to run on an L4. On the same GPU, the original code takes 1.80 minutes to run. So, LoRA is approximately faster in this case. The score, evaluated with the Ollama Llama 3 method from chapter 7 , is around 50 , which is in the same ballpark as the original model. 请注意在 Nvidia L4 GPU 上,使用 LoRA 进行微调需要 1.30 分钟的运行时间。在同一 GPU 上,原始代码需要 1.80 分钟。因此,在这种情况下,LoRA 大约快 倍。使用第 7 章中的 Ollama Llama 3 方法进行评估的分数约为 50,与原始模型基本一致。
Appendix D. Adding Bells and Whistles to the Training Loop 附录 D.为训练循环添加铃声和口哨声
In the appendix, we enhance the training function for the pretraining and finetuning processes covered in chapters 5-7. This appendix, in particular, covers learning rate warmup, cosine decay, and gradient clipping in the first three sections. 在附录中,我们提高了第 5-7 章涵盖的预训练和微调过程的训练函数。这个附录特别涵盖了前三节中的学习率预热、余弦衰减和梯度裁剪。
The final section then incorporates these techniques into the training function developed in chapter 5 and pretrains an LLM. 最后一节将这些技术整合到第 5 章中开发的训练函数中,并预训练了一个LLM。
To make the code in this appendix self-contained, we reinitialize the model we trained in chapter 5 . 为使本附录中的代码自成一体,我们重新初始化了在第 5 章中训练的模型。
import torch
from previous_chapters import GPTModel
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size
"ctx_len": 256, # Shortened context length (orig: 1024)
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-key-value bias
}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()
After initializing the model, we also need to initialize the data loaders we used in chapter 5 . First, we load the "The Verdict" short story: 在初始化模型之后,我们还需要初始化第 5 章中使用的数据加载器。首先,我们加载"The Verdict"短篇小说:
import os
import urllib.request
file_path = "the-verdict.txt"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-
chapter-code/the-verdict.txt"
if not os.path.exists(file_path):
with urllib.request.urlopen(url) as response:
text_data = response.read().decode('utf-8')
with open(file_path, "w", encoding="utf-8") as file:
file.write(text_data)
else:
with open(file_path, "r", encoding="utf-8") as file:
text_data = file.read()
Next, we load the text_data into the data loaders:
from previous_chapters import create_dataloader_v1
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
torch.manual_seed(123)
train_loader = create_dataloader_v1(
text_data[:split_idx],
batch_size=2,
max_length=GPT_CONFIG_124M["ctx_len"],
stride=GPT_CONFIG_124M["ctx_len"],
drop_last=True,
shuffle=True
)
val_loader = create_dataloader_v1(
text_data[split_idx:],
batch_size=2,
max_length=GPT_CONFIG_124M["ctx_len"],
stride=GPT_CONFIG_124M["ctx_len"],
drop_last=False,
shuffle=False
)
Now that we have re-instantiated the model and data loaders we used in chapter 5 , the next section will introduce the enhancements we make to the training function. 现在我们已经重新实例化了第 5 章中使用的模型和数据加载器,下一节将介绍我们对训练函数的增强。
D. 1 Learning rate warmup D. 1 学习率预热
The first technique we introduce is learning rate warmup. Implementing a learning rate warmup can stabilize the training of complex models such as LLMs. This process involves gradually increasing the learning rate from a very low initial value (initial_lr) to a maximum value specified by the user (peak_lr). Starting the training with smaller weight updates decreases the risk of the model encountering large, destabilizing updates during its training phase. 我们介绍的第一个技术是学习率预热。实施学习率预热可以稳定诸如LLMs等复杂模型的训练。这个过程涉及逐步将学习率从非常低的初始值(initial_lr)增加到用户指定的最大值(peak_lr)。从较小的权重更新开始训练可降低模型在训练阶段遇到大的不稳定更新的风险。
Suppose we plan to train an LLM for 15 epochs, starting with an initial learning rate of 0.0001 and increasing it to a maximum learning rate of 0.01 . Furthermore, we define 20 warmup steps to increase the initial learning rate from 0.0001 to 0.01 in the first 20 training steps: 假设我们计划对一个LLM进行 15 个周期的训练,初始学习率为 0.0001,最大学习率为 0.01。而且我们定义了 20 个预热步骤,在前 20 个训练步骤中将初始学习率从 0.0001 增加到 0.01。
Next, we implement a simple training loop template to illustrate this warmup process: 接下来,我们实现一个简单的训练循环模板来说明这个预热过程:
optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.1)
lr_increment = (peak_lr - initial_lr) / warmup_steps
\#A
global_step = -1
track_lrs = []
for epoch in range(n_epochs): #B
for input_batch, target_batch in train_loader:
optimizer.zero_grad()
global_step += 1
if global_step < warmup_steps: #C
lr = initial_lr + global_step * lr_increment
else:
lr = peak_lr
for param_group in optimizer.param_groups:
#D
param_group["lr"] = lr
track_lrs.append(optimizer.param_groups[0]["lr"])
#E
#A This increment determines by how much we increase the inital_Ir in each of the 20 warmup steps. #A 这个增量决定了我们在每个 20 个预热步骤中增加初始 Ir 的幅度。
#B Execute a typical training loop iterating over the batches in the training loader in each epoch #C Update the learning rate if we are still in the warmup phase. #B 执行一个典型的训练循环,在每个 epoch 中迭代训练加载器中的批次#C 如果仍在预热阶段,则更新学习率。
#D Apply the calculated learning rate to the optimizer. #D 将计算出的学习率应用于优化器。
#E In a complete training loop, the loss and the model updates would be calculated here, which are omitted for simplicity in this example. 在完整的训练循环中,损失和模型更新将在此计算,为了简单起见,在此示例中省略了这些。
After running the preceding code, we visualize how the learning rate was changed by the training loop above to verify that the learning rate warmup works as intended: 运行前述代码后,我们可以可视化训练循环如何改变学习率,以验证学习率预热是否按预期工作:
The resulting plot is shown in Figure D.1. 生成的图像如图 D.1 所示。
Figure D.1 The learning rate warmup increases the learning rate for the first training steps. After steps, the learning rate reaches the peak of 0.01 and remains constant for the rest of the training. 图 D.1 学习速率预热在前 个训练步骤中增加学习速率。在 个步骤之后,学习速率达到 0.01 的峰值并保持恒定,直至训练结束。
As shown in Figure D.1, the learning rate starts with a low value and increases for 20 steps until it reaches the maximum value after 20 steps. 如图 D.1 所示,学习率从一个较低的值开始,并在 20 步之内增加到最大值。
In the next section, we will modify the learning rate further so that it decreases after reaching the maximum learning rate, which further helps improve the model training. 在下一个部分中,我们将进一步修改学习率,使其在达到最大学习率后降低,这进一步有助于改善模型训练。
D. 2 Cosine decay 双余弦衰减
Another widely adopted technique for training complex deep neural networks and LLMs is cosine decay. This method modulates the learning rate throughout the training epochs, making it follow a cosine curve after the warmup stage. 另一种广泛采用的训练复杂深度神经网络和LLMs的技术是余弦衰减。该方法在训练轮次中调节学习率,在热身阶段之后遵循余弦曲线。
In its popular variant, cosine decay reduces (or decays) the learning rate to nearly zero, mimicking the trajectory of a half-cosine cycle. The gradual learning decrease in cosine decay aims to decelerate the pace at which the model updates its weights. This is particularly important as it helps minimize the risk of overshooting the loss minima during the training process, which is essential for ensuring the stability of the training during its later phases. 在其流行的变体中,余弦衰减将学习率降低(或衰减)至接近于零,模仿半余弦周期的轨迹。余弦衰减中渐进式的学习降低旨在减缓模型更新其权重的速度。这一点特别重要,因为它有助于最大限度地降低在训练过程中超过损失最小值的风险,这对于确保训练后期的稳定性至关重要。
We can modify the training loop template from the previous section, adding cosine decay as follows: 我们可以修改上一节中的训练循环模板,并添加余弦衰减,如下所示:
import math
min_lr = 0.1 * initial_lr
track_lrs = []
lr_increment = (peak_lr - initial_lr) / warmup_steps
global_step = -1
for epoch in range(n_epochs):
for input_batch, target_batch in train_loader:
optimizer.zero_grad()
global_step += 1
if global_step < warmup_steps:
lr = initial_lr + global_step * lr_increment
else:# #B
progress = ((global_step - warmup_steps) /
(total_training_steps - warmup_steps))
lr = min_lr + (peak_lr - min_lr) * 0.5 * (1 + math.cos(math.pi * progress))
for param_group in optimizer.param_groups:
param_group["lr"] = lr
track_lrs.append(optimizer.param_groups[0]["lr"])
Again, to verify that the learning rate has changed as intended, we plot the learning rate: 再次,为了验证学习率已按照预期发生变化,我们绘制学习率图像:
The resulting learning rate plot is shown in Figure D.2. 得到的学习率曲线图如图 D.2 所示。
Figure D. 2 The first 20 steps of linear learning rate warmup are followed by a cosine decay, which reduces the learning rate in a half-cosine cycle until it reaches its minimum point at the end of training. 图 D.2 前 20 个步骤采用线性学习率预热,之后采用余弦衰减,学习率沿半个余弦周期逐渐降低,直至训练结束时达到最小值。
As shown in Figure D.2, the learning rate starts with a linear warmup phase, which increases for 20 steps until it reaches the maximum value after 20 steps. After the 20 steps of linear warmup, cosine decay kicks in, reducing the learning rate gradually until it reaches its minimum. 如图 D.2 所示,学习率从一个线性预热阶段开始,在 20 步内增加到最大值。在 20 步线性预热后,余弦衰减开始发挥作用,学习率逐渐减小直至达到最小值。
D. 3 Gradient clipping D.3 梯度裁剪
In this section, we introduce gradient clipping, another important technique for enhancing stability during LLM training. This method involves setting a threshold above which gradients are downscaled to a predetermined maximum magnitude. This process ensures that the updates to the model's parameters during backpropagation stay within a manageable range. 在这一部分,我们介绍梯度裁剪,这是提高LLM训练稳定性的另一个重要技术。这种方法涉及设置一个阈值,超过该阈值时,梯度会被下调到预定的最大幅度。这个过程确保了在反向传播过程中对模型参数的更新保持在可控范围内。
For example, applying the max_norm=1.0 setting within PyTorch's clip_grad_norm_ function ensures that the norm of the gradients does not surpass 1.0. Here, the term "norm" signifies the measure of the gradient vector's length, or magnitude, within the model's parameter space, specifically referring to the norm, also known as the Euclidean norm. 例如,在 PyTorch 的 clip_grad_norm_ 函数中应用 max_norm=1.0 设置可确保梯度的范数不超过 1.0。这里,"范数"指梯度向量在模型参数空间中的长度或大小的度量,特指 L2 范数,也称欧几里德范数。
In mathematical terms, for a vector composed of components , the L2 norm is described as: 从数学上讲,对于由分量 组成的向量 ,L2 范数描述如下:
This calculation method is also applied to matrices. 此计算方法也适用于矩阵。
For instance, consider a gradient matrix given by: 举例来说,考虑一个给定的梯度矩阵:
If we aim to clip these gradients to a max_norm of 1, we first compute the L2 norm of these gradients, which is 如果我们的目标是将这些梯度剪辑到最大范数为 1,我们首先计算这些梯度的 L2 范数,即
Given that exceeds our max_norm of 1 , we scale down the gradients to ensure their norm equals exactly 1 . This is achieved through a scaling factor, calculated as max_norm/|G . Consequently, the adjusted gradient matrix becomes 鉴于 超出我们的最大范数 1,我们将梯度缩小以确保其范数严格等于 1。这通过缩放因子来实现,该因子计算为 max_norm/|G 。因此,调整后的梯度矩阵 变为
To illustrate this gradient clipping process, we would begin by initializing a new model and calculating the loss for a training batch, similar to the procedure in a standard training loop: 为了说明这种渐变剪裁过程,我们首先会初始化一个新的模型,并计算训练批次的损失,类似于标准训练循环中的过程:
from previous_chapters import calc_loss_batch
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
loss = calc_loss_batch(input_batch, target_batch, model, device)
loss.backward()
Upon calling the .backward () method in the preceding code snippet, PyTorch calculates the loss gradients and stores them in a .grad attribute for each model weight (parameter) tensor. 在前面的代码片段中调用 .backward() 方法时,PyTorch 会计算损失梯度并将它们存储在每个模型权重(参数)张量的 .grad 属性中。
For illustration purposes, we can define the following find_highest_gradient utility function to identify the highest gradient value by scanning all the .grad attributes of the model's weight tensors after calling .backward(): 为了演示目的,我们可以定义以下 find_highest_gradient 实用函数来识别通过扫描模型权重张量的所有.grad 属性后调用.backward()的最高梯度值:
def find_highest_gradient(model):
max_grad = None
for param in model.parameters():
if param.grad is not None:
grad_values = param.grad.data.flatten()
max_grad_param = grad_values.max()
if max_grad is None or max_grad_param > max_grad:
max_grad = max_grad_param
return max_grad
print(find_highest_gradient(model))
The largest gradient value identified by the preceding code is as follows: 前面代码识别的最大梯度值如下:
tensor 张量
Let's now apply gradient clipping, which can be implemented with one line of code, and see how this affects the largest gradient value: 让我们现在应用梯度裁剪,这可以用一行代码实现,并观察这如何影响最大梯度值:
The largest gradient value after applying the gradient clipping with the max norm of 1 is substantially smaller than before: 在将梯度裁剪应用于最大范数为 1 后,最大梯度值大幅小于之前:
tensor(0.0166) 张量(0.0166)
In the next section, we will put all the concepts covered in this appendix so far into action and modify the LLM training function. 在下一部分中,我们将把本附录中涵盖的所有概念付诸行动,并修改LLM训练函数。
D. 4 The modified training function 修改后的训练函数
In this final section of this appendix, we improve the train_model_simple training function we used in chapter 5 by adding the three concepts we introduced: linear warmup, cosine decay, and gradient clipping. Together, these methods help stabilize LLM training. 在本附录的最后一节中,我们通过添加我们在第 5 章中介绍的三个概念:线性预热、余弦衰减和梯度裁剪来改进了我们使用的 train_model_simple 训练函数。总的来说,这些方法有助于稳定LLM训练。
The code is as follows, with the changes compared to the train_model_simple annotated: 代码如下,与 train_model_simple 相比有以下变化:
#A Retrieve the initial learning rate from the optimizer, assuming we use it as the peak learning rate #B Calculate the total number of iterations in the training process #A 从优化器中检索初始学习率,假设我们将其用作峰值学习率
#B 计算训练过程中的总迭代次数
#C Calculate the learning rate increment during the warmup phase #C 在预热阶段计算学习率增量
#D Adjust the learning rate based on the current phase (warmup or cosine annealing) 基于当前阶段(预热或余弦退火)调整学习率
#E Apply the calculated learning rate to the optimizer #E 将计算出的学习率应用于优化器
#F Apply gradient clipping after the warmup phase to avoid exploding gradients #F 在预热阶段之后应用梯度裁剪以避免梯度爆炸
#G Everything below here remains unchanged compared to the train_model_simple function used in #G 这里的所有内容与
train_model_simple
函数中使用的内容保持不变
Chapter 5 第 5 章
After defining the train_model function, we can use it in a similar fashion to train the model compared to the train_model_simple method in chapter 5: 定义 train_model 函数后,我们可以使用它来训练模型,方式与第 5 章中的 train_model_simple 方法类似
The training will take about 5 minutes to complete on a MacBook Air or similar laptop and print the following outputs: 在 MacBook Air 或类似的笔记本电脑上完成此培训约需 5 分钟,输出如下内容:
Ep 1 (Iter 000000): Train loss 10.934, Val loss 10.939
Ep 1 (Iter 000005): Train loss 8.529, Val loss 8.843
Ep 2 (Iter 000010): Train loss 6.400, Val loss 6.825
Ep 2 (Iter 000015): Train loss 6.116, Val loss 6.861
Every effort moves you,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...
the irony. She wanted him vindicated--and by me!" He laughed again, and threw back his
head to look up at the sketch of the donkey. "There were days when I
Ep 15 (Iter 000130): Train loss 0.101, Val loss 6.707
Every effort moves you?" "Yes--quite insensible to the irony. She wanted him
vindicated--and by me!" He laughed again, and threw back his head to look up at the
sketch of the donkey. "There were days when I
Like chapter 5, the model begins to overfit after a few epochs since it is a very small dataset, and we iterate over it multiple times. However, we can see that the function is working since it minimizes the training set loss. 与第 5 章类似,由于数据集非常小,并且我们多次迭代过它,因此模型在几个纪元后开始过拟合。但是,我们可以看到该函数正在工作,因为它最小化了训练集损失。
Readers are encouraged to train the model on a larger text dataset and compare the results obtained with this more sophisticated training function to the results that can be obtained with the train_model_simple function used in chapter 5 . 我们鼓励读者在更大的文本数据集上训练模型,并将使用第 5 章中的 train_model_simple 函数获得的结果与使用更复杂的训练函数获得的结果进行比较。
Appendix Parameter-efficient Finetuning with LoRA 附录 使用 LoRA 进行参数高效的微调
This appendix introduces low-rank adaptation (LoRA), one of the most widely used techniques for parameter-efficient finetuning. After explaining the main idea behind LoRA, this appendix will be based on the spam classification-finetuning example from chapter 6 and finetune the LLM. It's important to note, however, that LoRA finetuning is also applicable to the supervised instruction-finetuning discussed in chapter 7 . 这个附录介绍了低秩适应(LoRA),这是最广泛使用的参数高效微调技术之一。在解释了 LoRA 背后的主要思想之后,这个附录将基于第 6 章的垃圾邮件分类微调示例,对LLM进行微调。不过需要注意的是,LoRA 微调也适用于第 7 章讨论的监督式指令微调。
E. 1 Introduction to LoRA LoRA 简介
LoRA, or low-rank adaptation, is a technique that adapts a pretrained model to better suit a specific, often smaller, dataset by adjusting only a small subset of the model's weight parameters. The "low-rank" aspect refers to the mathematical concept of limiting model adjustments to a smaller dimensional subspace of the total weight parameter space, which effectively captures the most influential directions of the weight parameter changes during training. 低秩适应(LoRA)是一种技术,它通过只调整模型权重参数的一小部分来使预训练模型更好地适应特定的、通常较小的数据集。"低秩"方面指的是将模型调整限制在总权重参数空间的较小维度子空间的数学概念,这有效地捕捉了训练过程中权重参数变化最具影响力的方向。
The LoRA method is useful and popular because it enables efficient finetuning of large models on task-specific data, significantly cutting down on the computational costs and resources that are usually required for finetuning. LoRA 方法很有用且广受欢迎,因为它可以在特定任务的数据上高效地微调大模型,大幅减少了通常微调所需的计算成本和资源。
To explain how LoRA works, suppose there is a large weight matrix associated with a specific layer. LoRA can be applied to all linear layers in an LLM, as we will see later, but we focus on a single layer for illustration purposes in this section. 要解释 LoRA 的工作原理,假设有一个与特定层相关的大型权重矩阵 。LoRA 可以应用于LLM中的所有线性层,正如我们后面将看到的,但在本节中我们只关注一个单一的层。
When training deep neural networks, during backpropagation, we learn a matrix, which contains information on how much we want to update the original weight parameters to minimize the loss function during training. In the rest of this appendix, we will use the term "weight" as a shorthand for the model's weight parameters. 在训练深度神经网络时,在反向传播过程中,我们学习一个 矩阵,其中包含信息表示我们如何更新原始权重参数以最小化训练过程中的损失函数。在本附录的其余部分中,我们将使用"权重"这个术语作为模型权重参数的简写。
In regular training and finetuning, the weight update is defined as follows: 在常规培训和微调中,权重更新定义如下:
The LoRA method proposed by Hu et al. (https://arxiv.org/abs/2106.09685) offers a more efficient alternative to computing the weight updates by learning an approximation of it: 胡等人提出的 LoRA 方法(https://arxiv.org/abs/2106.09685)提供了一种更有效的替代方法来计算权重更新 ,通过学习对其的近似来实现。
where and are two matrices much smaller than , and represents the matrix multiplication product between and . 其中 和 是远小于 的两个矩阵,而 表示 和 之间的矩阵乘积。
Using LoRA, we can then reformulate the weight update we defined earlier as follows: 使用 LoRA,我们可以将先前定义的权重更新重新表述如下:
Figure E. 1 illustrates the weight update formulas for full finetuning and LoRA side by side. 图 E.1 并排显示了完全微调和 LoRA 的权重更新公式。
Figure E. 1 A comparison between weight update methods: regular finetuning and LoRA. On the left, regular finetuning involves updating the pretrained weight matrix directly with . On the right, LoRA uses two smaller matrices and to approximate , where the product is added to , and denotes the inner dimension, a tunable hyperparameter. 图 E.1 权重更新方法的比较:常规微调和 LoRA。左图展示了常规微调直接更新预训练权重矩阵 。右图展示了 LoRA 使用两个更小的矩阵 和 来近似 ,其中 的乘积被添加到 , 表示可调的内部维度超参数。
If you paid close attention, you might have noticed that the visual representations of full finetuning and LoRA in Figure E. 1 differ slightly from the earlier presented formulas. This variation is attributed to the distributive law of matrix multiplication, which allows us to separate the original and updated weights rather than combine them. For example, in the case of regular finetuning with as the input data, we can express the computation as follows: 如果您仔细观察,您可能已经注意到 图 E.1 中完全微调和 LoRA 的视觉表示与之前呈现的公式略有不同。这种变化归因于矩阵乘法的分配律,它使我们能够分离原始权重和更新权重,而不是将它们组合在一起。例如,在使用 作为输入数据的常规微调情况下,我们可以将计算表示如下:
Similarly, we can write the following for LoRA: 同様に,我们可以为 LoRA 编写以下内容:
Besides reducing the number of weights to update during training, the ability to keep the LoRA weight matrices separate from the original model weights makes LoRA even more useful in practice. Practically, this allows for the pretrained model weights to remain unchanged, with the LoRA matrices being applied dynamically after training when using the model. 除了减少训练期间需要更新的权重数量,将 LoRA 权重矩阵与原始模型权重分开保存的能力使 LoRA 在实践中更加有用。实际上,这样可以使预训练模型权重保持不变,而 LoRA 矩阵在使用模型时动态应用。
Keeping the LoRA weights separate is very useful in practice because it enables model customization without needing to store multiple complete versions of an LLM. This reduces storage requirements and improves scalability, as only the smaller LoRA matrices need to be adjusted and saved when we customize LLMs for each specific customer or application. 保持 LoRA 权重独立对实际应用非常有用,因为它可在无需存储多个完整版本的LLM的情况下实现模型定制。这减少了存储需求并提高了可扩展性,因为当我们为每个特定客户或应用程序定制LLMs时,只需要调整和保存较小的 LoRA 矩阵。
Now that we have discussed what LoRA is all about, in the following sections, let's see how it can be used to finetune an LLM for spam classification, similar to the finetuning example in chapter 6. 现在我们已经讨论了什么是 LoRA,在以下部分,让我们看看如何使用它来微调LLM进行垃圾邮件分类,这与第 6 章中的微调示例类似。
E. 2 Preparing the dataset 2.准备数据集
Before applying LoRA to the spam classification example from chapter 6, we have to load the dataset and pretrained model we will work with. 在将 LoRA 应用于第 6 章中的垃圾邮件分类示例之前,我们必须加载将要使用的数据集和预训练模型。
The code in this section repeats the data preparation from chapter 6. (Note that instead of repeating the code in this section, we could also open and run the chapter 6 notebook and then insert the LoRA code from section E. 4 there.) 本节中的代码重复了第 6 章中的数据准备。(请注意,与其在本节重复代码,我们也可以打开并运行第 6 章的笔记本,然后将第 E.4 节的 LoRA 代码插入其中。)
First, we download the dataset and save it as CSV files: 首先,我们下载数据集并将其保存为 CSV 文件:
Listing E. 1 Downloading and preparing the dataset 列表 E. 1 下载和准备数据集
As a verification step, we iterate through the data loaders and check that the batches contain 8 training examples each, where each training example consists of 120 tokens: 作为一个验证步骤,我们遍历数据加载器,并检查每个批次包含 8 个训练样本,其中每个训练样本由 120 个令牌组成
print("Train loader:")
for input_batch, target_batch in train_loader:
pass
print("Input batch dimensions:", input_batch.shape)
print("Label batch dimensions", target_batch.shape)
\section*{E. 3 Initializing the model}
This section repeats the code from chapter 6 to load and prepare the pretrained GPT model. We begin with downloading the model weights and loading them into the GPTModel class:
\section*{Listing E. 4 Loading a pretrained GPT model}
from gpt_download import download_and_load_gpt2
from previous_chapters import GPTModel, load_weights_into_gpt
BASE CONFIG.update(model configs[CHOOSE MODEL])
model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")
model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval()
To ensure that the model was loaded corrected, let's double-check that it generates coherent text:
from previous_chapters import ( 從 previous_chapters 導入(
generate_text_simple, 生成简单文本
text_to_token_ids, 文本转换为令牌 ID
token_ids_to_text 令牌 ID 转文本
)
text_1 = "Every effort moves you" 每一个努力都会推动你前进
token_ids = generate_text_simple( 令牌 Id = 生成 文本 简单(
model=model, 模型=模型,
idx=text_to_token_ids(text_1, tokenizer), idx=text_to_token_ids(text_1,tokenizer),
max_new_tokens=15, 最大新令牌数=15
context_size=BASE_CONFIG["context_length"] 上下文大小=BASE_CONFIG["上下文长度"]
)
print(token_ids_to_text(token_ids, tokenizer)) 打印(token_ids_to_text(token_ids, tokenizer))
As we can see based on the output below, the model generates coherent text, which is an indicator that the model weights were loaded correctly:
Every effort moves you forward.
The first step is to understand the importance of your work
Next, we prepare the model for classification-finetuning similar to chapter 6 , where we replace the output layer:
Lastly, let's calculate the initial classification accuracy of the not-finetuned model (we expect this to be around \(50 \%\), which means that the model is not able to distinguish between spam and non-spam messages yet reliably):
The initial prediction accuracies are as follows:
Training accuracy: \(46.25 \%\)
Validation accuracy: \(45.00 \%\)
Test accuracy: \(48.75 \%\)
\section*{E. 4 Parameter-efficient finetuning with LoRA}
In this section, we modify and finetune the LLM using LoRA. We begin by initializing a LoRALayer that creates the matrices \(A\) and \(B\), along with the alpha scaling factor and the rank ( \(r\) ) setting.
This layer can accept an input and compute the corresponding output, as illustrated in Figure E.2.
\section*{Outputs}
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-398.jpg?height=677&width=1018&top_left_y=284&top_left_x=526)
Figure E. 2 The LoRA matrices \(A\) and \(B\) are applied to the layer inputs and are involved to compute the model outputs. The inner dimension \(r\) of these matrices serves as a setting that adjusts the number of trainable parameters by varying the sizes of \(A\) and \(B\).
In code, this LoRA layer depicted Figure E. 2 can be implemented as follows:
Listing E. 5 Implementing a LoRA layer
import math 导入 math
class LoRALayer(torch.nn.Module): 类 LoRALayer(torch.nn.Module):
def init(self, in_dim, out_dim, rank, alpha): 定义 init(self, in_dim, out_dim, rank, alpha):
super().init() 超级().init()
self.A = torch.nn.Parameter(torch.empty(in_dim, rank))
torch.nn.init.kaiming_uniform_(self.A, a=math.sqrt(5)) #A
self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
self.alpha = alpha 自己.alpha = alpha
def forward(self, x): 定义 forward(self, x):
x = self.alpha * (x @ self.A @ self.B)
return x 返回 x
\#A Same initialization that is used for Linear layers in PyTorch
In the preceding code, the rank governs the inner dimension of matrices \(A\) and \(B\). Essentially, this setting determines the number of extra parameters introduced by LoRA, which is used to balance between the adaptability of the model and its efficiency via the number of parameters used.
The other important setting, alpha, functions as a scaling factor for the output from the low-rank adaptation. It primarily dictates the degree to which the output from the adapted layer can impact the original layer's output. This can be seen as a way to regulate the impact of the low-rank adaptation on the layer's output.
The LoRALayer class we have implemented so far enables us to transform the inputs of a layer.
In LoRA, the typical goal is to substitute existing Linear layers, allowing weight updates to be applied directly to the pre-existing pretrained weights, as illustrated in Figure E.3.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-399.jpg?height=630&width=1219&top_left_y=781&top_left_x=312)
Figure E. 3 Illustration of the integration of LoRA into a model layer. The original pretrained weights ( \(W\) ) of a layer are combined with the outputs from LoRA matrices ( \(A\) and \(B\) ), which approximate the weight update matrix ( \(\triangle W\) ). The final output is calculated by adding the output of the adapted layer (using LoRA weights) to the original output.
To integrate the original Linear layer weights, as shown in Figure E.3, we now create a LinearWithLoRA layer. This layer utilizes the previously implemented LoRALayer and is designed to replace existing Linear layers within a neural network, such as the selfattention modules or feed-forward modules in the GPTModel:
\section*{Listing E. 6 A LinearWithLora layer to replace Linear layers}
The preceding code combines a standard Linear layer with the LoRALayer. The forward method computes the output by adding the results from the original linear layer and the LoRA layer.
Since the weight matrix \(B\) (self.B in LoRALayer) is initialized with zero values, the product of matrices \(A\) and \(B\) results in a zero matrix. This ensures that the multiplication does not alter the original weights, as adding zero does not change them.
To apply LoRA to the earlier defined GPTModel, we also introduce a replace_linear_with_lora function. This function will swap all existing Linear layers in the model with the newly created LinearWithLoRA layers:
\#A Replace the Linear layer with LinearWithLoRA
\#B Recursively apply the same function to child modules
We have now implemented all the necessary code to replace the Linear layers in the GPTModel with the newly developed LinearWithLoRA layers for parameter-efficient finetuning. In the subsequent sections, we will apply the LinearWithLoRA upgrade to all Linear layers found in the multi-head attention, feed-forward modules, and the output layer of the GPTModel, as illustrated in Figure E.4.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-401.jpg?height=1176&width=1247&top_left_y=202&top_left_x=296)
Every effort moves you
Figure E. 4 A diagram representing the architecture of the GPT model. It highlights the parts of the model where Linear layers are being upgraded to LinearWithLoRA layers for parameter-efficient finetuning.
Before we apply the LinearWithLoRA layer upgrades depicted in Figure E.4, we first freeze the original model parameters:
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad) 总参数数量= sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters before: {total_params:,}") 打印(f"总可训练参数之前: {total_params:,}")
for param in model.parameters(): 对于 model.parameters() 中的每个参数:
param.requires_grad = False 参数.需要_梯度 = 假
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad) 总参数量 = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters after: {total_params:,}") 打印(f"总可训练参数为: {total_params:,}")
After the preceding code, we can see that none of the 124 M model parameters are trainable:
Total trainable parameters before: \(124,441,346\)
Total trainable parameters after: 0
Next, we use the replace_linear_with_lora to replace the Linear layers:
replace_linear_with_lora(model, rank=16, alpha=16) 将_linear_替换为_lora(model, 秩=16, alpha=16)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad) 总参数量 = 求和(模型可训练参数的元素个数)
print(f"Total trainable LoRA parameters: {total_params:,}") 打印(f"总可训练的 LoRA 参数: {total_params:,}")
The number of trainable parameters, after adding the LoRA layers, is as follows:
Total trainable LoRA parameters: 2,666,528
As we can see, we reduced the number of trainable parameters by almost 50 x when using LoRA. A rank and alpha of 16 are good default choices, but it is also common to increase the rank parameter, which in turn increases the number of trainable parameters. Alpha is usually chosen to be half, double, or equal to the rank.
Let's verify that the layers have been modified as intended by printing the model architecture:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 设备 = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) 模型.to(设备)
print(model) 打印(模型)
As we can see based on the preceding output, the model now includes the new LinearWithLoRA layers, which themselves consist of the original Linear layers, which we set to non-trainable, and the new LoRA layers, which we will finetune.
However, before we begin finetuning the model, let's calculate the initial classification accuracy:
The resulting accuracy values are as follows:
Training accuracy: \(46.25 \%\)
Validation accuracy: \(45.00 \%\)
Test accuracy: \(48.75 \%\)
If we compare these accuracy values to the initial ones from chapter 6 , we notice they are identical. This occurs because we initialized the LoRA matrix \(B\) with zeros. Consequently, the product of matrices \(A B\) results in a zero matrix. This ensures that the multiplication does not alter the original weights since adding zero does not change them before we start finetuning.
Now, let's move on to the exciting part and finetune the model using the training function from chapter 6. The training takes about 15 minutes on an M3 MacBook Air laptop and less than half a minute on a V100 or A100 GPU:
\section*{Listing E. 7 Finetuning a model with LoRA layers}
The output we see during the training is as follows:
Ep 1 (Step 000000): Train loss 3.820, Val loss 3.462 第 1 集(步骤 000000):训练损失 3.820,验证损失 3.462
Ep 1 (Step 000050): Train loss 0.396, Val loss 0.364 第 1 集(第 50 步):训练损失 0.396,验证损失 0.364
Ep 1 (Step 000100): Train loss 0.111, Val loss 0.229 第 1 集 (第 100 步): 训练损失 0.111, 验证损失 0.229
Training accuracy: 97.50% | Validation accuracy: 95.00% 训练准确率: 97.50% | 验证准确率: 95.00%
Ep 2 (Step 000150): Train loss 0.135, Val loss 0.073 第 2 集(步骤 000150):训练损失 0.135,验证损失 0.073
Ep 2 (Step 000200): Train loss 0.008, Val loss 0.052 第 2 集 (步骤 000200): 训练损失 0.008, 验证损失 0.052
Ep 2 (Step 000250): Train loss 0.021, Val loss 0.179 第二集(步骤 000250):训练损失 0.021,验证损失 0.179
Training accuracy: 97.50% | Validation accuracy: 97.50% 训练准确率:97.50%|验证准确率:97.50%
Ep 3 (Step 000300): Train loss 0.096, Val loss 0.080 第 3 集 (步骤 000300): 训练损失 0.096, 验证损失 0.080
Ep 3 (Step 000350): Train loss 0.010, Val loss 0.116 第 3 集 (步骤 000350): 训练损失 0.010, 验证损失 0.116
Training accuracy: 97.50% | Validation accuracy: 95.00% 训练准确率:97.50% | 验证准确率:95.00%
Ep 4 (Step 000400): Train loss 0.003, Val loss 0.151 第 4 集(步骤 000400):训练损失 0.003,验证损失 0.151
Ep 4 (Step 000450): Train loss 0.008, Val loss 0.077 第 4 集(步骤 000450):训练损失 0.008,验证损失 0.077
Ep 4 (Step 000500): Train loss 0.001, Val loss 0.147 第 4 集(第 500 步):训练损失 0.001,验证损失 0.147
Training accuracy: 100.00% | Validation accuracy: 97.50% 训练准确度: 100.00% | 验证准确度: 97.50%
Ep 5 (Step 000550): Train loss 0.007, Val loss 0.094 第 5 集(步骤 000550):训练损失 0.007,验证损失 0.094
Ep 5 (Step 000600): Train loss 0.000, Val loss 0.056 第 5 集 (步骤 000600): 训练损失 0.000, 验证损失 0.056
Training accuracy: 100.00% | Validation accuracy: 97.50% 训练准确度: 100.00% | 验证准确度: 97.50%
Training completed in 12.10 minutes.
Note that training the model with LoRA takes longer than training it without LoRA in chapter 6, because the LoRA layers introduce an additional computation during the forward pass. However, for larger models, where backpropagation becomes more costly, models typically train faster with LoRA than without it.
As we can see, the model received perfect training and very high validation accuracy. Let's also visualize the loss curves to better see if the training has converged:
The resulting plot is shown in Figure E.5.
![](https://cdn.mathpix.com/cropped/2024_07_13_ba8d9eb54f12ebe396d8g-407.jpg?height=664&width=1155&top_left_y=302&top_left_x=372)
Figure E. 5 The training and validation loss curves over six epochs for a machine learning model. Initially, both training and validation loss decrease sharply, then they level off, indicating the model is converging, which means that it is not expected to improve noticeably with further training.
In addition to evaluating the model based on the loss curves shown in E.5, let's also calculate the accuracies on the full training, validation, and test set (during the training, we approximated the training and validation set accuracies from 5 batches via the eval_iter \(=5\) setting):
Training accuracy: 100.00% 训练准确度:100.00%
Validation accuracy: 96.64% 验证精度:96.64%
Test accuracy: 98.00% 测试准确率:98.00%
The given accuracy shows that the model performs well across training, validation, and test datasets. With a training accuracy of \(100 \%\), the model has perfectly learned the training data. However, the slightly lower validation and test accuracies ( \(96.64 \%\) and \(97.33 \%\), respectively) suggest a small degree of overfitting, as the model does not generalize quite as well on unseen data compared to the training set. Overall, the results are very impressive considering that we finetuned only a relatively small number of model weights (2.7 million LoRA weights instead of the original 124 million model weights).