1 Understanding large language models

This chapter covers 本章涵蓋

High-level explanations of the fundamental concepts behind large language models (LLMs)
大型語言模型（LLMs）背後基本概念的高層次解釋
Insights into the transformer architecture from which LLMs are derived
對於從中衍生出LLMs的變壓器架構的深入見解
A plan for building an LLM from scratch
從零開始建立一個LLM的計劃

Large language models (LLMs), such as those offered in OpenAI’s ChatGPT, are deep neural network models that have been developed over the past few years. They ushered in a new era for natural language processing (NLP). Before the advent of LLMs, traditional methods excelled at categorization tasks such as email spam classification and straightforward pattern recognition that could be captured with handcrafted rules or simpler models. However, they typically underperformed in language tasks that demanded complex understanding and generation abilities, such as parsing detailed instructions, conducting contextual analysis, and creating coherent and contextually appropriate original text. For example, previous generations of language models could not write an email from a list of keywords—a task that is trivial for contemporary LLMs.
大型語言模型（LLMs），例如 OpenAI 的 ChatGPT，是在過去幾年中開發的深度神經網絡模型。它們為自然語言處理（NLP）開創了一個新時代。在 LLMs 出現之前，傳統方法在電子郵件垃圾郵件分類和可以用手工規則或簡單模型捕捉的簡單模式識別任務中表現出色。然而，它們在需要複雜理解和生成能力的語言任務中通常表現不佳，例如解析詳細指令、進行上下文分析以及創建連貫且符合上下文的原創文本。例如，之前的語言模型無法根據一系列關鍵字撰寫電子郵件，而這對當代的 LLMs 來說則是微不足道的任務。

LLMs have remarkable capabilities to understand, generate, and interpret human language. However, it’s important to clarify that when we say language models “understand,” we mean that they can process and generate text in ways that appear coherent and contextually relevant, not that they possess human-like consciousness or comprehension.
LLMs具備理解、生成和詮釋人類語言的卓越能力。然而，當我們說語言模型「理解」時，我們的意思是它們能夠以看似連貫且具上下文相關性的方式處理和生成文本，並不意味著它們擁有類似人類的意識或理解能力。

Enabled by advancements in deep learning, which is a subset of machine learning and artificial intelligence (AI) focused on neural networks, LLMs are trained on vast quantities of text data. This large-scale training allows LLMs to capture deeper contextual information and subtleties of human language compared to previous approaches. As a result, LLMs have significantly improved performance in a wide range of NLP tasks, including text translation, sentiment analysis, question answering, and many more.
得益於深度學習的進步，深度學習是機器學習和人工智慧（AI）的一個子集，專注於神經網絡，LLMs 在大量文本數據上進行訓練。這種大規模的訓練使得LLMs 能夠捕捉更深層的上下文信息和人類語言的細微差別，相較於以往的方法，表現顯著提升。因此，LLMs 在各種自然語言處理（NLP）任務中，如文本翻譯、情感分析、問答等，表現大幅改善。

Another important distinction between contemporary LLMs and earlier NLP models is that earlier NLP models were typically designed for specific tasks, such as text categorization, language translation, etc. While those earlier NLP models excelled in their narrow applications, LLMs demonstrate a broader proficiency across a wide range of NLP tasks.
當代LLMs與早期 NLP 模型之間另一個重要的區別在於，早期的 NLP 模型通常是為特定任務設計的，例如文本分類、語言翻譯等。雖然那些早期的 NLP 模型在其狹窄的應用中表現出色，但LLMs則在各種 NLP 任務中展現出更廣泛的能力。

The success behind LLMs can be attributed to the transformer architecture that underpins many LLMs and the vast amounts of data on which LLMs are trained, allowing them to capture a wide variety of linguistic nuances, contexts, and patterns that would be challenging to encode manually.
LLMs 的成功可歸因於支撐許多 LLMs 的變壓器架構，以及用於訓練 LLMs 的大量數據，使它們能夠捕捉各種語言細微差別、上下文和模式，這些都是手動編碼所難以實現的。

This shift toward implementing models based on the transformer architecture and using large training datasets to train LLMs has fundamentally transformed NLP, providing more capable tools for understanding and interacting with human language.
這一轉變朝向基於變壓器架構的模型實施，並使用大型訓練數據集來訓練大型語言模型（LLMs），已根本改變了自然語言處理（NLP），提供了更強大的工具來理解和與人類語言互動。

The following discussion sets a foundation to accomplish the primary objective of this book: understanding LLMs by implementing a ChatGPT-like LLM based on the transformer architecture step by step in code.
以下討論為本書的主要目標奠定基礎：通過逐步在代碼中實現基於變壓器架構的類似 ChatGPT 的 LLM，來理解 LLMs。

1.1 What is an LLM?
1.1 什麼是LLM?

An LLM is a neural network designed to understand, generate, and respond to human-like text. These models are deep neural networks trained on massive amounts of text data, sometimes encompassing large portions of the entire publicly available text on the internet.
一個 LLM 是一種神經網絡，旨在理解、生成和回應類似人類的文本。這些模型是深度神經網絡，經過大量文本數據的訓練，有時涵蓋了整個互聯網上公開可用文本的很大一部分。

The “large” in “large language model” refers to both the model’s size in terms of parameters and the immense dataset on which it’s trained. Models like this often have tens or even hundreds of billions of parameters, which are the adjustable weights in the network that are optimized during training to predict the next word in a sequence. Next-word prediction is sensible because it harnesses the inherent sequential nature of language to train models on understanding context, structure, and relationships within text. Yet, it is a very simple task, and so it is surprising to many researchers that it can produce such capable models. In later chapters, we will discuss and implement the next-word training procedure step by step.
在「大型語言模型」中的「大型」指的是模型在參數方面的大小以及其訓練所用的龐大數據集。這類模型通常擁有數十億甚至數百億的參數，這些參數是網絡中可調整的權重，在訓練過程中進行優化，以預測序列中的下一個單詞。預測下一個單詞是合理的，因為它利用了語言固有的序列性質，訓練模型理解文本中的上下文、結構和關係。然而，這是一個非常簡單的任務，因此許多研究人員對於它能產生如此強大的模型感到驚訝。在後面的章節中，我們將逐步討論和實施下一個單詞的訓練程序。

LLMs utilize an architecture called the transformer, which allows them to pay selective attention to different parts of the input when making predictions, making them especially adept at handling the nuances and complexities of human language.
LLMs 採用一種稱為 transformer 的架構，這使它們在進行預測時能夠對輸入的不同部分進行選擇性注意，特別擅長處理人類語言的細微差別和複雜性。

Since LLMs are capable of generating text, LLMs are also often referred to as a form of generative artificial intelligence, often abbreviated as generative AI or GenAI. As illustrated in figure 1.1, AI encompasses the broader field of creating machines that can perform tasks requiring human-like intelligence, including understanding language, recognizing patterns, and making decisions, and includes subfields like machine learning and deep learning.
由於LLMs能夠生成文本，LLMs通常也被稱為一種生成式人工智慧，常簡稱為生成式 AI或GenAI。如圖 1.1 所示，人工智慧涵蓋了創造能執行需要類人智慧的任務的機器的更廣泛領域，包括理解語言、識別模式和做出決策，並包括機器學習和深度學習等子領域。

Figure 1.1 As this hierarchical depiction of the relationship between the different fields suggests, LLMs represent a specific application of deep learning techniques, using their ability to process and generate human-like text. Deep learning is a specialized branch of machine learning that focuses on using multilayer neural networks. Machine learning and deep learning are fields aimed at implementing algorithms that enable computers to learn from data and perform tasks that typically require human intelligence.
圖 1.1 如這個不同領域之間關係的層次圖所示，LLMs 代表了深度學習技術的一種特定應用，利用其處理和生成類人文本的能力。深度學習是機器學習的一個專門分支，專注於使用多層神經網絡。機器學習和深度學習是旨在實現算法的領域，使計算機能夠從數據中學習並執行通常需要人類智慧的任務。

The algorithms used to implement AI are the focus of the field of machine learning. Specifically, machine learning involves the development of algorithms that can learn from and make predictions or decisions based on data without being explicitly programmed. To illustrate this, imagine a spam filter as a practical application of machine learning. Instead of manually writing rules to identify spam emails, a machine learning algorithm is fed examples of emails labeled as spam and legitimate emails. By minimizing the error in its predictions on a training dataset, the model then learns to recognize patterns and characteristics indicative of spam, enabling it to classify new emails as either spam or not spam.
用於實現人工智慧的演算法是機器學習領域的重點。具體而言，機器學習涉及開發能夠從數據中學習並根據數據做出預測或決策的演算法，而無需明確編程。為了說明這一點，可以想像一個垃圾郵件過濾器作為機器學習的實際應用。與其手動編寫規則來識別垃圾郵件，不如將標記為垃圾郵件和合法郵件的電子郵件示例提供給機器學習演算法。通過最小化其在訓練數據集上的預測誤差，模型學會識別垃圾郵件的模式和特徵，從而能夠將新郵件分類為垃圾郵件或非垃圾郵件。

As illustrated in figure 1.1, deep learning is a subset of machine learning that focuses on utilizing neural networks with three or more layers (also called deep neural networks) to model complex patterns and abstractions in data. In contrast to deep learning, traditional machine learning requires manual feature extraction. This means that human experts need to identify and select the most relevant features for the model.
如圖 1.1 所示，深度學習是機器學習的一個子集，專注於利用三層或更多層的神經網絡（也稱為深度神經網絡）來建模數據中的複雜模式和抽象。與深度學習相比，傳統機器學習需要手動特徵提取。這意味著人類專家需要識別和選擇對模型最相關的特徵。

While the field of AI is now dominated by machine learning and deep learning, it also includes other approaches—for example, using rule-based systems, genetic algorithms, expert systems, fuzzy logic, or symbolic reasoning.
雖然人工智慧領域現在主要由機器學習和深度學習主導，但它也包括其他方法，例如使用基於規則的系統、遺傳算法、專家系統、模糊邏輯或符號推理。

Returning to the spam classification example, in traditional machine learning, human experts might manually extract features from email text such as the frequency of certain trigger words (for example, “prize,” “win,” “free”), the number of exclamation marks, use of all uppercase words, or the presence of suspicious links. This dataset, created based on these expert-defined features, would then be used to train the model. In contrast to traditional machine learning, deep learning does not require manual feature extraction. This means that human experts do not need to identify and select the most relevant features for a deep learning model. (However, both traditional machine learning and deep learning for spam classification still require the collection of labels, such as spam or non-spam, which need to be gathered either by an expert or users.)
回到垃圾郵件分類的例子，在傳統的機器學習中，人類專家可能會手動從電子郵件文本中提取特徵，例如某些觸發詞的頻率（例如，“獎品”、“贏”、“免費”）、感嘆號的數量、全大寫字母的使用，或可疑鏈接的存在。這個基於這些專家定義的特徵創建的數據集，將用於訓練模型。與傳統機器學習相比，深度學習不需要手動特徵提取。這意味著人類專家不需要為深度學習模型識別和選擇最相關的特徵。（然而，無論是傳統機器學習還是深度學習進行垃圾郵件分類，仍然需要收集標籤，例如垃圾郵件或非垃圾郵件，這些標籤需要由專家或用戶收集。）

Let’s look at some of the problems LLMs can solve today, the challenges that LLMs address, and the general LLM architecture we will implement later.
讓我們來看看今天LLMs可以解決的一些問題、LLMs所面對的挑戰，以及我們稍後將實現的一般LLM架構。

1.2 Applications of LLMs
1.2 LLMs 的應用

Owing to their advanced capabilities to parse and understand unstructured text data, LLMs have a broad range of applications across various domains. Today, LLMs are employed for machine translation, generation of novel texts (see figure 1.2), sentiment analysis, text summarization, and many other tasks. LLMs have recently been used for content creation, such as writing fiction, articles, and even computer code.
由於其解析和理解非結構化文本數據的先進能力，LLMs 在各個領域擁有廣泛的應用。如今，LLMs 被用於機器翻譯、創作新文本（見圖 1.2）、情感分析、文本摘要以及許多其他任務。LLMs 最近也被用於內容創作，例如寫小說、文章，甚至是電腦程式碼。

Figure 1.2 LLM interfaces enable natural language communication between users and AI systems. This screenshot shows ChatGPT writing a poem according to a user’s specifications.
圖 1.2 LLM 介面使得用戶與 AI 系統之間能夠進行自然語言交流。這張截圖顯示了 ChatGPT 根據用戶的要求寫詩。

LLMs can also power sophisticated chatbots and virtual assistants, such as OpenAI’s ChatGPT or Google’s Gemini (formerly called Bard), which can answer user queries and augment traditional search engines such as Google Search or Microsoft Bing.
LLMs 也可以驅動複雜的聊天機器人和虛擬助手，例如 OpenAI 的 ChatGPT 或 Google 的 Gemini（前稱 Bard），這些工具能夠回答用戶查詢並增強傳統搜索引擎，如 Google 搜索或 Microsoft Bing。

Moreover, LLMs may be used for effective knowledge retrieval from vast volumes of text in specialized areas such as medicine or law. This includes sifting through documents, summarizing lengthy passages, and answering technical questions.
此外，LLMs可用於從醫學或法律等專業領域的大量文本中有效檢索知識。這包括篩選文件、總結冗長段落以及回答技術問題。

In short, LLMs are invaluable for automating almost any task that involves parsing and generating text. Their applications are virtually endless, and as we continue to innovate and explore new ways to use these models, it’s clear that LLMs have the potential to redefine our relationship with technology, making it more conversational, intuitive, and accessible.
簡而言之，LLMs 對於自動化幾乎任何涉及解析和生成文本的任務都是無價的。它們的應用幾乎無窮無盡，隨著我們不斷創新和探索使用這些模型的新方法，顯然 LLMs 有潛力重新定義我們與科技的關係，使其變得更加對話式、直觀和可及。

We will focus on understanding how LLMs work from the ground up, coding an LLM that can generate texts. You will also learn about techniques that allow LLMs to carry out queries, ranging from answering questions to summarizing text, translating text into different languages, and more. In other words, you will learn how complex LLM assistants such as ChatGPT work by building one step by step.
我們將專注於從基礎了解LLMs的運作，編寫一個可以生成文本的LLM。你還將學習使LLMs能夠執行查詢的技術，這些查詢包括回答問題、總結文本、將文本翻譯成不同語言等。換句話說，你將通過一步一步地構建一個來學習複雜的LLM助手，例如 ChatGPT，是如何運作的。

1.3 Stages of building and using LLMs
1.3 建立和使用 LLMs 的階段

Why should we build our own LLMs? Coding an LLM from the ground up is an excellent exercise to understand its mechanics and limitations. Also, it equips us with the required knowledge for pretraining or fine-tuning existing open source LLM architectures to our own domain-specific datasets or tasks.
為什麼我們應該建立自己的LLMs? 從零開始編寫LLM是一個很好的練習，可以幫助我們理解其機制和限制。此外，這也使我們具備了對現有開源LLM架構進行預訓練或微調的必要知識，以適應我們特定領域的數據集或任務。

NOTE Most LLMs today are implemented using the PyTorch deep learning library, which is what we will use. Readers can find a comprehensive introduction to PyTorch in appendix A.
注意目前大多數 LLMs 是使用 PyTorch 深度學習庫實現的，我們也將使用這個工具。讀者可以在附錄 A 中找到 PyTorch 的全面介紹。

Research has shown that when it comes to modeling performance, custom-built LLMs—those tailored for specific tasks or domains—can outperform general-purpose LLMs, such as those provided by ChatGPT, which are designed for a wide array of applications. Examples of these include BloombergGPT (specialized for finance) and LLMs tailored for medical question answering (see appendix B for more details).
研究顯示，在建模性能方面，專門定制的 LLMs—針對特定任務或領域量身打造的模型—可以超越通用型 LLMs，例如 ChatGPT 提供的那些，這些模型設計用於各種應用。這些模型的例子包括 BloombergGPT（專注於金融）和針對醫療問題回答的 LLMs（詳情見附錄 B）。

Using custom-built LLMs offers several advantages, particularly regarding data privacy. For instance, companies may prefer not to share sensitive data with third-party LLM providers like OpenAI due to confidentiality concerns. Additionally, developing smaller custom LLMs enables deployment directly on customer devices, such as laptops and smartphones, which is something companies like Apple are currently exploring. This local implementation can significantly decrease latency and reduce server-related costs. Furthermore, custom LLMs grant developers complete autonomy, allowing them to control updates and modifications to the model as needed.
使用自訂建置的 LLMs 具有幾個優勢，特別是在數據隱私方面。例如，企業可能因為保密問題而不願與第三方 LLM 供應商（如 OpenAI）分享敏感數據。此外，開發較小的自訂 LLMs 使得可以直接在客戶的設備上部署，例如筆記型電腦和智慧型手機，這是像 Apple 這樣的公司目前正在探索的方向。這種本地實施可以顯著降低延遲並減少伺服器相關的成本。此外，自訂 LLMs 使開發者擁有完全的自主權，允許他們根據需要控制模型的更新和修改。

The general process of creating an LLM includes pretraining and fine-tuning. The “pre” in “pretraining” refers to the initial phase where a model like an LLM is trained on a large, diverse dataset to develop a broad understanding of language. This pretrained model then serves as a foundational resource that can be further refined through fine-tuning, a process where the model is specifically trained on a narrower dataset that is more specific to particular tasks or domains. This two-stage training approach consisting of pretraining and fine-tuning is depicted in figure 1.3.
創建一個LLM的一般過程包括預訓練和微調。“預”在“預訓練”中指的是初始階段，在這個階段，像LLM這樣的模型會在一個大型、多樣化的數據集上進行訓練，以發展對語言的廣泛理解。這個預訓練模型隨後作為一個基礎資源，可以通過微調進一步精煉，微調是一個專門在更狹窄的數據集上進行訓練的過程，這些數據集更具針對性，適用於特定的任務或領域。這種由預訓練和微調組成的兩階段訓練方法在圖 1.3 中顯示。

Figure 1.3 Pretraining an LLM involves next-word prediction on large text datasets. A pretrained LLM can then be fine-tuned using a smaller labeled dataset.
圖 1.3 預訓練一個 LLM 涉及在大型文本數據集上進行下一個詞預測。預訓練的 LLM 然後可以使用較小的標記數據集進行微調。

The first step in creating an LLM is to train it on a large corpus of text data, sometimes referred to as raw text. Here, “raw” refers to the fact that this data is just regular text without any labeling information. (Filtering may be applied, such as removing formatting characters or documents in unknown languages.)
創建一個LLM的第一步是對大量文本數據進行訓練，這些數據有時被稱為原始文本。在這裡，“原始”指的是這些數據僅僅是普通文本，沒有任何標籤信息。（可能會進行過濾，例如去除格式字符或未知語言的文檔。）

NOTE Readers with a background in machine learning may note that labeling information is typically required for traditional machine learning models and deep neural networks trained via the conventional supervised learning paradigm. However, this is not the case for the pretraining stage of LLMs. In this phase, LLMs use self-supervised learning, where the model generates its own labels from the input data.
注意具有機器學習背景的讀者可能會注意到，傳統機器學習模型和通過傳統監督學習範式訓練的深度神經網絡通常需要標記信息。然而，這對於 LLMs 的預訓練階段並不適用。在這個階段，LLMs 使用自我監督學習，模型從輸入數據中生成自己的標籤。

This first training stage of an LLM is also known as pretraining, creating an initial pretrained LLM, often called a base or foundation model. A typical example of such a model is the GPT-3 model (the precursor of the original model offered in ChatGPT). This model is capable of text completion—that is, finishing a half-written sentence provided by a user. It also has limited few-shot capabilities, which means it can learn to perform new tasks based on only a few examples instead of needing extensive training data.
這個LLM的第一階段訓練也被稱為預訓練，創建一個初始的預訓練LLM，通常稱為基礎或基礎模型。這樣的模型的一個典型例子是GPT-3模型（原始 ChatGPT 模型的前身）。這個模型能夠進行文本補全，也就是說，能夠完成用戶提供的半寫好的句子。它還具有有限的少量示例學習能力，這意味著它可以根據僅有的幾個例子學習執行新任務，而不需要大量的訓練數據。

After obtaining a pretrained LLM from training on large text datasets, where the LLM is trained to predict the next word in the text, we can further train the LLM on labeled data, also known as fine-tuning.
在獲得一個經過大型文本數據集訓練的預訓練LLM後，該LLM被訓練以預測文本中的下一個單詞，我們可以進一步在標記數據上訓練該LLM，這也被稱為微調。

The two most popular categories of fine-tuning LLMs are instruction fine-tuning and classification fine-tuning. In instruction fine-tuning, the labeled dataset consists of instruction and answer pairs, such as a query to translate a text accompanied by the correctly translated text. In classification fine-tuning, the labeled dataset consists of texts and associated class labels—for example, emails associated with “spam” and “not spam” labels.
最受歡迎的兩種微調 LLMs 類別是 指令微調 和分類微調。在指令微調中，標記的數據集由指令和答案對組成，例如一個翻譯文本的查詢，並附有正確翻譯的文本。在分類微調中，標記的數據集由文本和相關的類別標籤組成，例如與「垃圾郵件」和「非垃圾郵件」標籤相關的電子郵件。

We will cover code implementations for pretraining and fine-tuning an LLM, and we will delve deeper into the specifics of both instruction and classification fine-tuning after pretraining a base LLM.
我們將涵蓋預訓練和微調 LLM 的程式碼實作，並在對基礎 LLM 進行預訓練後，深入探討指令微調和分類微調的具體細節。

1.4 Introducing the transformer architecture
1.4 介紹變壓器架構

Most modern LLMs rely on the transformer architecture, which is a deep neural network architecture introduced in the 2017 paper “Attention Is All You Need” (https://arxiv.org/abs/1706.03762). To understand LLMs, we must understand the original transformer, which was developed for machine translation, translating English texts to German and French. A simplified version of the transformer architecture is depicted in figure 1.4.
大多數現代LLMs依賴於transformer架構，這是一種在 2017 年論文“Attention Is All You Need”中提出的深度神經網絡架構（https://arxiv.org/abs/1706.03762）。要理解LLMs，我們必須了解最初的 transformer，該架構是為機器翻譯而開發的，將英語文本翻譯成德語和法語。圖 1.4 顯示了 transformer 架構的簡化版本。

Figure 1.4 A simplified depiction of the original transformer architecture, which is a deep learning model for language translation. The transformer consists of two parts: (a) an encoder that processes the input text and produces an embedding representation (a numerical representation that captures many different factors in different dimensions) of the text that the (b) decoder can use to generate the translated text one word at a time. This figure shows the final stage of the translation process where the decoder has to generate only the final word (“Beispiel”), given the original input text (“This is an example”) and a partially translated sentence (“Das ist ein”), to complete the translation.
圖 1.4 原始變壓器架構的簡化示意圖，這是一種用於語言翻譯的深度學習模型。變壓器由兩個部分組成：(a) 一個編碼器，用於處理輸入文本並生成文本的嵌入表示（捕捉不同維度中多種不同因素的數值表示）；(b) 解碼器可以利用這些嵌入來生成翻譯文本，每次生成一個單詞。此圖顯示了翻譯過程的最終階段，解碼器需要根據原始輸入文本（“This is an example”）和部分翻譯句子（“Das ist ein”）生成最後一個單詞（“Beispiel”），以完成翻譯。

The transformer architecture consists of two submodules: an encoder and a decoder. The encoder module processes the input text and encodes it into a series of numerical representations or vectors that capture the contextual information of the input. Then, the decoder module takes these encoded vectors and generates the output text. In a translation task, for example, the encoder would encode the text from the source language into vectors, and the decoder would decode these vectors to generate text in the target language. Both the encoder and decoder consist of many layers connected by a so-called self-attention mechanism. You may have many questions regarding how the inputs are preprocessed and encoded. These will be addressed in a step-by-step implementation in subsequent chapters.
變壓器架構由兩個子模組組成：編碼器和解碼器。編碼器模組處理輸入文本，並將其編碼為一系列數值表示或向量，這些向量捕捉了輸入的上下文信息。然後，解碼器模組接收這些編碼向量並生成輸出文本。在翻譯任務中，例如，編碼器會將源語言的文本編碼為向量，而解碼器則會解碼這些向量以生成目標語言的文本。編碼器和解碼器都由許多層組成，這些層通過所謂的自注意力機制相連。您可能對輸入是如何預處理和編碼的有許多問題。這些問題將在後續章節中逐步解答。

A key component of transformers and LLMs is the self-attention mechanism (not shown), which allows the model to weigh the importance of different words or tokens in a sequence relative to each other. This mechanism enables the model to capture long-range dependencies and contextual relationships within the input data, enhancing its ability to generate coherent and contextually relevant output. However, due to its complexity, we will defer further explanation to chapter 3, where we will discuss and implement it step by step.
變壓器和LLMs的一個關鍵組件是自注意力機制（未顯示），它使模型能夠相對於序列中的不同單詞或標記來權衡其重要性。這個機制使模型能夠捕捉輸入數據中的長距離依賴性和上下文關係，增強其生成連貫且與上下文相關的輸出的能力。然而，由於其複雜性，我們將在第三章中進一步解釋，屆時我們將逐步討論和實現它。

Later variants of the transformer architecture, such as BERT (short for bidirectional encoder representations from transformers) and the various GPT models (short for generative pretrained transformers), built on this concept to adapt this architecture for different tasks. If interested, refer to appendix B for further reading suggestions.
後來的變體如 BERT（即雙向編碼器表示法來自變壓器）和各種 GPT 模型（即生成式預訓練變壓器）基於這一概念，將此架構調整以適應不同的任務。如有興趣，請參考附錄 B 以獲取進一步的閱讀建議。

BERT, which is built upon the original transformer’s encoder submodule, differs in its training approach from GPT. While GPT is designed for generative tasks, BERT and its variants specialize in masked word prediction, where the model predicts masked or hidden words in a given sentence, as shown in figure 1.5. This unique training strategy equips BERT with strengths in text classification tasks, including sentiment prediction and document categorization. As an application of its capabilities, as of this writing, X (formerly Twitter) uses BERT to detect toxic content.
BERT 建立在原始變壓器的編碼子模組上，其訓練方法與 GPT 不同。雖然 GPT 設計用於生成任務，但 BERT 及其變體專注於遮蔽詞預測，模型在給定句子中預測被遮蔽或隱藏的詞，如圖 1.5 所示。這種獨特的訓練策略使 BERT 在文本分類任務中具備優勢，包括情感預測和文件分類。作為其能力的應用，截至目前，X（前身為 Twitter）使用 BERT 來檢測有害內容。

Figure 1.5 A visual representation of the transformer’s encoder and decoder submodules. On the left, the encoder segment exemplifies BERT-like LLMs, which focus on masked word prediction and are primarily used for tasks like text classification. On the right, the decoder segment showcases GPT-like LLMs, designed for generative tasks and producing coherent text sequences.
圖 1.5 變壓器的編碼器和解碼器子模組的視覺表示。左側的編碼器部分示範了類似 BERT 的 LLMs，專注於遮蔽詞預測，主要用於文本分類等任務。右側的解碼器部分展示了類似 GPT 的 LLMs，旨在生成任務並產生連貫的文本序列。

GPT, on the other hand, focuses on the decoder portion of the original transformer architecture and is designed for tasks that require generating texts. This includes machine translation, text summarization, fiction writing, writing computer code, and more.
GPT 則專注於原始變壓器架構的解碼器部分，並設計用於需要生成文本的任務。這包括機器翻譯、文本摘要、小說創作、編寫計算機代碼等。

GPT models, primarily designed and trained to perform text completion tasks, also show remarkable versatility in their capabilities. These models are adept at executing both zero-shot and few-shot learning tasks. Zero-shot learning refers to the ability to generalize to completely unseen tasks without any prior specific examples. On the other hand, few-shot learning involves learning from a minimal number of examples the user provides as input, as shown in figure 1.6.
GPT 模型主要設計和訓練用於文本補全任務，但在其能力上也展現出卓越的多功能性。這些模型擅長執行零樣本學習和少樣本學習任務。零樣本學習是指在沒有任何先前具體範例的情況下，能夠對完全未見過的任務進行概括。另一方面，少樣本學習則涉及從用戶提供的少量範例中學習，如圖 1.6 所示。

Figure 1.6 In addition to text completion, GPT-like LLMs can solve various tasks based on their inputs without needing retraining, fine-tuning, or task-specific model architecture changes. Sometimes it is helpful to provide examples of the target within the input, which is known as a few-shot setting. However, GPT-like LLMs are also capable of carrying out tasks without a specific example, which is called zero-shot setting.
圖 1.6 除了文本補全，類似 GPT 的 LLMs 可以根據其輸入解決各種任務，而無需重新訓練、微調或特定任務模型架構的變更。有時在輸入中提供目標的示例是有幫助的，這被稱為少量示例設定。然而，類似 GPT 的 LLMs 也能在沒有特定示例的情況下執行任務，這稱為零示例設定。

Transformers vs. LLMs 變壓器 vs. LLMs

Today’s LLMs are based on the transformer architecture. Hence, transformers and LLMs are terms that are often used synonymously in the literature. However, note that not all transformers are LLMs since transformers can also be used for computer vision. Also, not all LLMs are transformers, as there are LLMs based on recurrent and convolutional architectures. The main motivation behind these alternative approaches is to improve the computational efficiency of LLMs. Whether these alternative LLM architectures can compete with the capabilities of transformer-based LLMs and whether they are going to be adopted in practice remains to be seen. For simplicity, I use the term “LLM” to refer to transformer-based LLMs similar to GPT. (Interested readers can find literature references describing these architectures in appendix B.)
當今的 LLMs 基於變壓器架構。因此，在文獻中，變壓器和 LLMs 這兩個術語經常被同義使用。然而，請注意並非所有的變壓器都是 LLMs，因為變壓器也可以用於計算機視覺。此外，並非所有的 LLMs 都是變壓器，因為還有基於遞歸和卷積架構的 LLMs。這些替代方法的主要動機是提高 LLMs 的計算效率。這些替代的 LLM 架構是否能與基於變壓器的 LLMs 的能力競爭，以及它們是否會在實踐中被採用，仍有待觀察。為了簡便起見，我使用“LLM”這個術語來指代類似於 GPT 的基於變壓器的 LLMs。（有興趣的讀者可以在附錄 B 中找到描述這些架構的文獻參考。）

1.5 Utilizing large datasets
1.5 利用大型數據集

The large training datasets for popular GPT- and BERT-like models represent diverse and comprehensive text corpora encompassing billions of words, which include a vast array of topics and natural and computer languages. To provide a concrete example, table 1.1 summarizes the dataset used for pretraining GPT-3, which served as the base model for the first version of ChatGPT.
流行的 GPT 和 BERT 類模型的大型訓練數據集代表了多樣且全面的文本語料庫，涵蓋了數十億個單詞，包含了各種主題以及自然語言和計算機語言。具體來說，表 1.1 總結了用於預訓練 GPT-3 的數據集，該模型作為第一版 ChatGPT 的基礎模型。

Table 1.1 The pretraining dataset of the popular GPT-3 LLM
表 1.1 受歡迎的 GPT-3 LLM 的預訓練數據集

Dataset name 數據集名稱	Dataset description 數據集描述	Number of tokens 代幣數量	Proportion in training data 訓練數據中的比例
CommonCrawl (filtered) CommonCrawl（過濾後）	Web crawl data 網路爬蟲數據	410 billion 4100 億	60%
WebText2	Web crawl data 網路爬蟲數據	19 billion 190 億	22%
Books1 書籍 1	Internet-based book corpus 基於互聯網的書籍語料庫	12 billion 120 億	8%
Books2 書籍 2	Internet-based book corpus 基於互聯網的書籍語料庫	55 billion 550 億	8%
Wikipedia 維基百科	High-quality text 高品質文本	3 billion 30 億	3%

Table 1.1 reports the number of tokens, where a token is a unit of text that a model reads and the number of tokens in a dataset is roughly equivalent to the number of words and punctuation characters in the text. Chapter 2 addresses tokenization, the process of converting text into tokens.
表 1.1 報告了標記的數量，其中標記是模型讀取的文本單位，數據集中標記的數量大致等同於文本中的單詞和標點符號的數量。第二章討論了標記化，即將文本轉換為標記的過程。

The main takeaway is that the scale and diversity of this training dataset allow these models to perform well on diverse tasks, including language syntax, semantics, and context—even some requiring general knowledge.
主要的收穫是，這個訓練數據集的規模和多樣性使這些模型能夠在各種任務上表現良好，包括語言語法、語義和上下文，甚至一些需要一般知識的任務。

GPT-3 dataset details
GPT-3數據集詳情

Table 1.1 displays the dataset used for GPT-3. The proportions column in the table sums up to 100% of the sampled data, adjusted for rounding errors. Although the subsets in the Number of Tokens column total 499 billion, the model was trained on only 300 billion tokens. The authors of the GPT-3 paper did not specify why the model was not trained on all 499 billion tokens.
表 1.1 顯示了用於 GPT-3 的數據集。表中的比例列總和為 100%，經過四捨五入的調整。儘管“標記數量”列中的子集總計為 4990 億，但模型僅在 3000 億標記上進行了訓練。GPT-3 論文的作者並未說明為什麼模型沒有在所有 4990 億標記上進行訓練。

For context, consider the size of the CommonCrawl dataset, which alone consists of 410 billion tokens and requires about 570 GB of storage. In comparison, later iterations of models like GPT-3, such as Meta’s LLaMA, have expanded their training scope to include additional data sources like Arxiv research papers (92 GB) and StackExchange’s code-related Q&As (78 GB).
為了提供背景，考慮一下 CommonCrawl 數據集的大小，該數據集本身包含 4100 億個標記，並需要約 570 GB 的存儲空間。相比之下，像 GPT-3 這樣的模型的後續版本，例如 Meta 的 LLaMA，已擴大其訓練範圍，納入了額外的數據來源，如 Arxiv 研究論文（92 GB）和 StackExchange 的代碼相關問答（78 GB）。

The authors of the GPT-3 paper did not share the training dataset, but a comparable dataset that is publicly available is Dolma: An Open Corpus of Three Trillion Tokens for LLM Pretraining Research by Soldaini et al. 2024 (https://arxiv.org/abs/2402.00159). However, the collection may contain copyrighted works, and the exact usage terms may depend on the intended use case and country.
GPT-3 論文的作者並未分享訓練數據集，但一個可公開獲得的可比數據集是 Soldaini 等人於 2024 年發表的 Dolma: An Open Corpus of Three Trillion Tokens for LLM Pretraining Research （https://arxiv.org/abs/2402.00159）。然而，該集合可能包含受版權保護的作品，具體的使用條款可能取決於預期的使用情況和國家。

The pretrained nature of these models makes them incredibly versatile for further fine-tuning on downstream tasks, which is why they are also known as base or foundation models. Pretraining LLMs requires access to significant resources and is very expensive. For example, the GPT-3 pretraining cost is estimated to be $4.6 million in terms of cloud computing credits (https://mng.bz/VxEW).
這些模型的預訓練特性使它們在下游任務的進一步微調中變得非常靈活，因此它們也被稱為基礎模型或基底模型。預訓練LLMs需要大量資源，且成本非常高。例如，GPT-3 的預訓練成本估計為 460 萬美元，這是基於雲計算信用的計算(https://mng.bz/VxEW)。

The good news is that many pretrained LLMs, available as open source models, can be used as general-purpose tools to write, extract, and edit texts that were not part of the training data. Also, LLMs can be fine-tuned on specific tasks with relatively smaller datasets, reducing the computational resources needed and improving performance.
好消息是，許多預訓練的 LLMs 作為開源模型，可以用作通用工具來撰寫、提取和編輯不在訓練數據中的文本。此外，LLMs 可以在特定任務上進行微調，使用相對較小的數據集，從而減少所需的計算資源並提高性能。

We will implement the code for pretraining and use it to pretrain an LLM for educational purposes. All computations are executable on consumer hardware. After implementing the pretraining code, we will learn how to reuse openly available model weights and load them into the architecture we will implement, allowing us to skip the expensive pretraining stage when we fine-tune our LLM.
我們將實現預訓練的代碼，並用它來為教育目的預訓練一個LLM。所有計算都可以在消費者硬體上執行。在實現預訓練代碼後，我們將學習如何重用公開可用的模型權重，並將其加載到我們將實現的架構中，這樣在微調我們的LLM時，就可以跳過昂貴的預訓練階段。

1.6 A closer look at the GPT architecture
1.6 更深入了解 GPT 架構

GPT was originally introduced in the paper “Improving Language Understanding by Generative Pre-Training” (https://mng.bz/x2qg) by Radford et al. from OpenAI. GPT-3 is a scaled-up version of this model that has more parameters and was trained on a larger dataset. In addition, the original model offered in ChatGPT was created by fine-tuning GPT-3 on a large instruction dataset using a method from OpenAI’s InstructGPT paper (https://arxiv.org/abs/2203.02155). As figure 1.6 shows, these models are competent text completion models and can carry out other tasks such as spelling correction, classification, or language translation. This is actually very remarkable given that GPT models are pretrained on a relatively simple next-word prediction task, as depicted in figure 1.7.
GPT 最初是在 Radford 等人於 OpenAI 發表的論文“透過生成預訓練改善語言理解”中介紹的（https://mng.bz/x2qg）。GPT-3 是這個模型的擴展版本，擁有更多的參數，並在更大的數據集上進行訓練。此外，ChatGPT 中提供的原始模型是通過在大型指令數據集上微調 GPT-3 而創建的，使用了 OpenAI 的 InstructGPT 論文中的一種方法（https://arxiv.org/abs/2203.02155）。如圖 1.6 所示，這些模型是能夠完成文本的模型，並且可以執行其他任務，如拼寫校正、分類或語言翻譯。考慮到GPT 模型是在相對簡單的下一個單詞預測任務上進行預訓練的，這實際上是非常了不起的，如圖 1.7 所示。

Figure 1.7 In the next-word prediction pretraining task for GPT models, the system learns to predict the upcoming word in a sentence by looking at the words that have come before it. This approach helps the model understand how words and phrases typically fit together in language, forming a foundation that can be applied to various other tasks.
圖 1.7 在 GPT 模型的下一個單詞預測預訓練任務中，系統通過查看前面的單詞來學習預測句子中即將出現的單詞。這種方法幫助模型理解單詞和短語在語言中通常是如何組合在一起的，形成可以應用於各種其他任務的基礎。

The next-word prediction task is a form of self-supervised learning, which is a form of self-labeling. This means that we don’t need to collect labels for the training data explicitly but can use the structure of the data itself: we can use the next word in a sentence or document as the label that the model is supposed to predict. Since this next-word prediction task allows us to create labels “on the fly,” it is possible to use massive unlabeled text datasets to train LLMs.
下一個詞預測任務是一種自我監督學習的形式，這是一種自我標記的方式。這意味著我們不需要明確地收集訓練數據的標籤，而是可以利用數據本身的結構：我們可以使用句子或文檔中的下一個詞作為模型應該預測的標籤。由於這個下一個詞預測任務允許我們“即時”創建標籤，因此可以使用大量未標記的文本數據集來訓練大型語言模型（LLMs）。

Compared to the original transformer architecture we covered in section 1.4, the general GPT architecture is relatively simple. Essentially, it’s just the decoder part without the encoder (figure 1.8). Since decoder-style models like GPT generate text by predicting text one word at a time, they are considered a type of autoregressive model. Autoregressive models incorporate their previous outputs as inputs for future predictions. Consequently, in GPT, each new word is chosen based on the sequence that precedes it, which improves the coherence of the resulting text.
與我們在第 1.4 節中介紹的原始變壓器架構相比，通用的 GPT 架構相對簡單。基本上，它只是沒有編碼器的解碼器部分（見圖 1.8）。由於像 GPT 這樣的解碼器風格模型是通過逐字預測來生成文本的，因此它們被視為一種 自回歸 模型。自回歸模型將其先前的輸出作為未來預測的輸入。因此，在 GPT 中，每個新單詞的選擇都是基於其之前的序列，這提高了生成文本的連貫性。

Architectures such as GPT-3 are also significantly larger than the original transformer model. For instance, the original transformer repeated the encoder and decoder blocks six times. GPT-3 has 96 transformer layers and 175 billion parameters in total.
像 GPT-3 這樣的架構也比原始的 transformer 模型大得多。例如，原始的 transformer 重複了編碼器和解碼器區塊六次。GPT-3 擁有 96 層 transformer 和總共 1750 億個參數。

Figure 1.8 The GPT architecture employs only the decoder portion of the original transformer. It is designed for unidirectional, left-to-right processing, making it well suited for text generation and next-word prediction tasks to generate text in an iterative fashion, one word at a time.
圖 1.8 GPT 架構僅使用原始變壓器的解碼器部分。它設計為單向的從左到右處理，這使其非常適合於文本生成和下一個單詞預測任務，以逐字的方式迭代生成文本。

GPT-3 was introduced in 2020, which, by the standards of deep learning and large language model development, is considered a long time ago. However, more recent architectures, such as Meta’s Llama models, are still based on the same underlying concepts, introducing only minor modifications. Hence, understanding GPT remains as relevant as ever, so I focus on implementing the prominent architecture behind GPT while providing pointers to specific tweaks employed by alternative LLMs.
GPT-3 於 2020 年推出，根據深度學習和大型語言模型的發展標準，這被認為是很久以前的事。然而，更近期的架構，例如 Meta 的 Llama 模型，仍然基於相同的基本概念，只引入了輕微的修改。因此，理解 GPT 仍然是非常重要的，因此我專注於實現 GPT 背後的主要架構，同時提供對其他大型語言模型所採用的特定調整的指引。

Although the original transformer model, consisting of encoder and decoder blocks, was explicitly designed for language translation, GPT models—despite their larger yet simpler decoder-only architecture aimed at next-word prediction—are also capable of performing translation tasks. This capability was initially unexpected to researchers, as it emerged from a model primarily trained on a next-word prediction task, which is a task that did not specifically target translation.
儘管原始的變壓器模型由編碼器和解碼器區塊組成，明確設計用於語言翻譯，但 GPT 模型—儘管其更大但更簡單的僅解碼器架構旨在進行下一個單詞預測—也能執行翻譯任務。這一能力最初令研究人員感到意外，因為它源自一個主要訓練於下一個單詞預測任務的模型，而這一任務並未特別針對翻譯。

The ability to perform tasks that the model wasn’t explicitly trained to perform is called an emergent behavior. This capability isn’t explicitly taught during training but emerges as a natural consequence of the model’s exposure to vast quantities of multilingual data in diverse contexts. The fact that GPT models can “learn” the translation patterns between languages and perform translation tasks even though they weren’t specifically trained for it demonstrates the benefits and capabilities of these large-scale, generative language models. We can perform diverse tasks without using diverse models for each.
能夠執行模型未明確訓練的任務稱為突現行為。這種能力並不是在訓練過程中明確教授的，而是作為模型接觸大量多語言數據和多樣化背景的自然結果而出現的。GPT 模型能夠“學習”語言之間的翻譯模式並執行翻譯任務，即使它們並未專門為此進行訓練，這顯示了這些大規模生成語言模型的優勢和能力。我們可以在不為每個任務使用不同模型的情況下執行多樣的任務。

1.7 Building a large language model
1.7 建立大型語言模型

Now that we’ve laid the groundwork for understanding LLMs, let’s code one from scratch. We will take the fundamental idea behind GPT as a blueprint and tackle this in three stages, as outlined in figure 1.9.
現在我們已經為理解LLMs打下基礎，讓我們從零開始編寫一個。我們將以 GPT 的基本概念作為藍圖，並按照圖 1.9 中的三個階段來進行。

Figure 1.9 The three main stages of coding an LLM are implementing the LLM architecture and data preparation process (stage 1), pretraining an LLM to create a foundation model (stage 2), and fine-tuning the foundation model to become a personal assistant or text classifier (stage 3).
圖 1.9 編碼一個 LLM 的三個主要階段是實現 LLM 架構和數據準備過程（階段 1）、對 LLM 進行預訓練以創建一個基礎模型（階段 2），以及對基礎模型進行微調以成為個人助理或文本分類器（階段 3）。

In stage 1, we will learn about the fundamental data preprocessing steps and code the attention mechanism at the heart of every LLM. Next, in stage 2, we will learn how to code and pretrain a GPT-like LLM capable of generating new texts. We will also go over the fundamentals of evaluating LLMs, which is essential for developing capable NLP systems.
在第一階段，我們將學習基本的數據預處理步驟，並編碼每個LLM核心的注意力機制。接下來，在第二階段，我們將學習如何編碼和預訓練一個類似GPT的LLM，能夠生成新文本。我們還將討論評估LLMs的基本原則，這對於開發有能力的 NLP 系統至關重要。

Pretraining an LLM from scratch is a significant endeavor, demanding thousands to millions of dollars in computing costs for GPT-like models. Therefore, the focus of stage 2 is on implementing training for educational purposes using a small dataset. In addition, I also provide code examples for loading openly available model weights.
從頭開始對LLM進行預訓練是一項重大任務，對於類似 GPT 的模型，計算成本需要數千到數百萬美元。因此，第二階段的重點是使用小型數據集進行教育目的的訓練。此外，我還提供了加載公開可用的模型權重的代碼示例。

Finally, in stage 3, we will take a pretrained LLM and fine-tune it to follow instructions such as answering queries or classifying texts—the most common tasks in many real-world applications and research.
最後，在第三階段，我們將採用一個預訓練 LLM，並對其進行微調，以遵循指令，例如回答查詢或分類文本——這是許多現實應用和研究中最常見的任務。

I hope you are looking forward to embarking on this exciting journey!
我希望你期待開始這段激動人心的旅程！

Summary 摘要

LLMs have transformed the field of natural language processing, which previously mostly relied on explicit rule-based systems and simpler statistical methods. The advent of LLMs introduced new deep learning-driven approaches that led to advancements in understanding, generating, and translating human language.
LLMs 已經改變了自然語言處理的領域，該領域之前主要依賴於明確的基於規則的系統和較簡單的統計方法。LLMs 的出現引入了新的深度學習驅動的方法，促進了對人類語言的理解、生成和翻譯的進步。
Modern LLMs are trained in two main steps:
現代LLMs的訓練主要分為兩個步驟：
- First, they are pretrained on a large corpus of unlabeled text by using the prediction of the next word in a sentence as a label.
  首先，它們在大量未標記的文本語料庫上進行預訓練，使用句子中下一個單詞的預測作為標籤。
- Then, they are fine-tuned on a smaller, labeled target dataset to follow instructions or perform classification tasks.
  然後，它們在一個較小的標記目標數據集上進行微調，以遵循指示或執行分類任務。
LLMs are based on the transformer architecture. The key idea of the transformer architecture is an attention mechanism that gives the LLM selective access to the whole input sequence when generating the output one word at a time.
LLMs 基於變壓器架構。變壓器架構的關鍵思想是一種注意力機制，該機制使 LLM 在逐字生成輸出時，能夠選擇性地訪問整個輸入序列。
The original transformer architecture consists of an encoder for parsing text and a decoder for generating text.
原始的變壓器架構包含一個編碼器用於解析文本，以及一個解碼器用於生成文本。
LLMs for generating text and following instructions, such as GPT-3 and ChatGPT, only implement decoder modules, simplifying the architecture.
生成文本和遵循指令的LLMs（如GPT-3和 ChatGPT）僅實現解碼器模塊，簡化了架構。
Large datasets consisting of billions of words are essential for pretraining LLMs.
包含數十億單詞的大型數據集對於LLMs的預訓練至關重要。
While the general pretraining task for GPT-like models is to predict the next word in a sentence, these LLMs exhibit emergent properties, such as capabilities to classify, translate, or summarize texts.
雖然類似 GPT 模型的一般預訓練任務是預測句子中的下一個單詞，但這些LLMs展現了突現特性，例如分類、翻譯或總結文本的能力。
Once an LLM is pretrained, the resulting foundation model can be fine-tuned more efficiently for various downstream tasks.
一旦LLM完成預訓練，所產生的基礎模型可以更有效地進行微調，以應對各種下游任務。
LLMs fine-tuned on custom datasets can outperform general LLMs on specific tasks.
經過自定義數據集微調的LLMs在特定任務上可以超越一般的LLMs。

1 Understanding large language models

This chapter covers 本章涵蓋

1.1 What is an LLM?1.1 什麼是LLM?

1.2 Applications of LLMs1.2 LLMs 的應用

1.3 Stages of building and using LLMs1.3 建立和使用 LLMs 的階段

1.4 Introducing the transformer architecture1.4 介紹 變壓器架構

Transformers vs. LLMs 變壓器 vs. LLMs

1.5 Utilizing large datasets1.5 利用大型數據集

Table 1.1 The pretraining dataset of the popular GPT-3 LLM表 1.1 受歡迎的 GPT-3 LLM 的預訓練數據集

GPT-3 dataset detailsGPT-3數據集 詳情

1.6 A closer look at the GPT architecture1.6 更深入了解 GPT 架構

1.7 Building a large language model1.7 建立大型語言模型