这是用户在 2024-5-9 15:34 为 https://blog.miguelgrinberg.com/post/how-llms-work-explained-without-math 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

How LLMs Work, Explained Without Math
如何工作,无需数学解释

Posted by  发布者
on under  在下

I'm sure you agree that it has become impossible to ignore Generative AI (GenAI), as we are constantly bombarded with mainstream news about Large Language Models (LLMs). Very likely you have tried ChatGPT, maybe even keep it open all the time as an assistant.
我相信您会同意,现在已经不可能忽视生成式人工智能(GenAI),因为我们不断被有关大型语言模型(LLMs)的主流新闻轰炸。很可能您已经尝试过 ChatGPT,甚至可能一直保持打开以作为助手。

A basic question I think a lot of people have about the GenAI revolution is where does the apparent intelligence these models have come from. In this article, I'm going to attempt to explain in simple terms and without using advanced math how generative text models work, to help you think about them as computer algorithms and not as magic.
我认为很多人对 GenAI 革命有一个基本问题,那就是这些模型所展现的明显智能是从何而来。在本文中,我将尝试用简单的术语解释生成式文本模型是如何工作的,而不使用高级数学,帮助您将其视为计算机算法而非魔法。

What Does An LLM Do?
一个LLM是做什么的?

I'll begin by clearing a big misunderstanding people have regarding how Large Language Models work. The assumption that most people make is that these models can answer questions or chat with you, but in reality all they can do is take some text you provide as input and guess what the next word (or more accurately, the next token) is going to be.
我将首先澄清人们对大型语言模型如何工作存在的一个重大误解。大多数人的假设是,这些模型可以回答问题或与您聊天,但实际上它们能做的只是接受您提供的一些文本作为输入,并猜测接下来的单词(或更准确地说,下一个标记)将是什么。

Let's start to unravel the mystery of LLMs from the tokens.
让我们从标记开始揭开LLMs的神秘面纱。

Tokens 令牌

A token is the basic unit of text understood by the LLM. It is convenient to think of tokens as words, but for the LLM the goal is to encode text as efficiently as possible, so in many cases tokens represent sequences of characters that are shorter or longer than whole words. Punctuation symbols and spaces are also represented as tokens, either individually or grouped with other characters.
令牌是LLM理解的文本的基本单位。将令牌视为单词很方便,但LLM的目标是尽可能高效地对文本进行编码,因此在许多情况下,令牌代表的是比整个单词更短或更长的字符序列。标点符号和空格也被表示为令牌,可以单独表示,也可以与其他字符分组表示。

The complete list of tokens used by an LLM are said to be the LLM's vocabulary, since it can be used to express any possible text. The byte pair encoding (BPE) algorithm is commonly used by LLMs to generate a token vocabulary given an input dataset. Just so that you have some rough idea of scale, the GPT-2 language model, which is open source and can be studied in detail, uses a vocabulary of 50,257 tokens.
LLM使用的令牌完整列表被称为LLM的词汇表,因为它可以用来表达任何可能的文本。字节对编码(BPE)算法通常由LLMs用于根据输入数据集生成令牌词汇表。只是为了让您对规模有个大致的概念,GPT-2 语言模型使用了一个包含 50,257 个令牌的词汇表,该模型是开源的,可以进行详细研究。

Each token in an LLM's vocabulary is given a unique identifier, usually a number. The LLM uses a tokenizer to convert between regular text given as a string and an equivalent sequence of tokens, given as a list of token numbers. If you are familiar with Python and want to play with tokens, you can install the tiktoken package from OpenAI:
LLM词汇表中的每个令牌都被赋予一个唯一的标识符,通常是一个数字。LLM使用分词器在常规文本(字符串形式)和等效令牌序列(令牌号码列表形式)之间进行转换。如果您熟悉 Python 并且想要使用令牌进行实验,可以安装 OpenAI 的 tiktoken 软件包。

$ pip install tiktoken

Then try this in a Python prompt:
然后在 Python 提示符中尝试这个:

>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-2")

>>> encoding.encode("The quick brown fox jumps over the lazy dog.")
[464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13]

>>> encoding.decode([464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13])
'The quick brown fox jumps over the lazy dog.'

>>> encoding.decode([464])
'The'
>>> encoding.decode([2068])
' quick'
>>> encoding.decode([13])
'.'

You can see in this experiment that for the GPT-2 language model token 464 represents the word "The", and token 2068 represents the word " quick", including a leading space. This model uses token 13 for the period.
您可以在这个实验中看到,对于 GPT-2 语言模型,标记 464 代表单词"The",标记 2068 代表单词"quick",包括一个前导空格。该模型使用标记 13 表示句号。

Because tokens are determined algorithmically, you may find strange things, such as these three variants of the word "the", all encoded as different tokens by GPT-2:
由于令牌是通过算法确定的,您可能会发现一些奇怪的事情,比如这三个单词“the”的变体,都被 GPT-2 编码为不同的令牌:

>>> encoding.encode('The')
[464]
>>> encoding.encode('the')
[1169]
>>> encoding.encode(' the')
[262]

The BPE algorithm doesn't always map entire words to tokens. In fact, words that are less frequently used do not get to be their own token and have to be encoded with multiple tokens. Here is an example of a word that this model encodes with two tokens:
BPE 算法并不总是将整个单词映射到令牌。实际上,不常用的单词不会成为自己的令牌,而必须用多个令牌进行编码。这个模型用两个令牌对一个单词进行编码的示例如下:

>>> encoding.encode("Payment")
[19197, 434]

>>> encoding.decode([19197])
'Pay'
>>> encoding.decode([434])
'ment'

Next Token Predictions 下一个令牌预测

As I stated above, given some text, a language model makes predictions about what token will follow right after. If it helps to see this with Python pseudo-code, here is how you could run one of these models to get predictions for the next token:
正如我上面所说的,对于给定的一些文本,语言模型会预测接下来会跟在后面的令牌是什么。如果需要用 Python 伪代码来看这一点,下面是你可以运行其中一个模型以获取下一个令牌预测的方式:

predictions = get_token_predictions(['The', ' quick', ' brown', ' fox'])

The function gets a list of input tokens, which are encoded from the prompt provided by the user. In this example I'm assuming words are all individual tokens. To keep things simple I'm using the textual representation of each token, but as you've seen before in reality each token will be passed to the model as a number.
该函数接收一个输入标记列表,这些标记是从用户提供的提示编码而来的。在这个例子中,我假设单词都是单独的标记。为了保持简单,我使用每个标记的文本表示,但实际上每个标记将作为一个数字传递给模型。

The returned value of this function is a data structure that assigns each token in the vocabulary a probability to follow the input text. If this was based on GPT-2, the return value of the function would be a list of 50,257 floating point numbers, each predicting a probability that the corresponding token will come next.
该函数的返回值是一个数据结构,为词汇表中的每个标记分配一个跟随输入文本的概率。如果基于 GPT-2,该函数的返回值将是一个包含 50,257 个浮点数的列表,每个数预测相应标记将出现的概率。

In the example above you could imagine that a well trained language model will give the token "jumps" a high probability to follow the partial phrase "The quick brown fox" that I used as prompt. Once again assuming a model trained appropriately, you could also imagine that the probability of a random word such as "potato" continuing this phrase is going to be much lower and close to 0.
在上面的示例中,您可以想象,一个训练有素的语言模型会给予标记“jumps”高概率,以跟随我作为提示使用的部分短语“ The quick brown fox”。再次假设模型经过适当训练,您还可以想象,像“potato”这样的随机词继续这个短语的概率会低得多,接近于 0。

To be able to produce reasonable predictions, the language model has to go through a training process. During training, it is presented with lots and lots of text to learn from. At the end of the training, the model is able to calculate next token probabilities for a given token sequence using data structures that it has built using all the text that it saw in training.
为了能够产生合理的预测,语言模型必须经过训练过程。在训练过程中,它会被呈现大量文本以供学习。训练结束时,模型能够使用在训练中看到的所有文本构建的数据结构来计算给定标记序列的下一个标记概率。

Is this different from what you expected? I hope this is starting to look less magical now.
这与您的预期有何不同?我希望现在这看起来不那么神奇了。

Generating Long Text Sequences
生成长文本序列

Since the model can only predict what the next token is going to be, the only way to make it generate complete sentences is to run the model multiple times in a loop. With each loop iteration a new token is generated, chosen from the returned probabilities. This token is then added to the input that is given to the model on the next iteration of the loop, and this continues until sufficient text has been generated.
由于模型只能预测下一个标记是什么,要使其生成完整的句子的唯一方法是在循环中多次运行模型。在每次循环迭代中,将从返回的概率中选择一个新标记生成。然后将此标记添加到下一个循环迭代中提供给模型的输入中,直到生成足够的文本为止。

Let's look at a more complete Python pseudo-code showing how this would work:
让我们看一个更完整的 Python 伪代码,展示这个是如何工作的:

def generate_text(prompt, num_tokens, hyperparameters):
    tokens = tokenize(prompt)
    for i in range(num_tokens):
        predictions = get_token_predictions(tokens)
        next_token = select_next_token(predictions, hyperparameters)
        tokens.append(next_token)
    return ''.join(tokens)

The generate_text() function takes a user prompt as an argument. This could be, for example, a question.
generate_text() 函数将用户提示作为参数。例如,这可以是一个问题。

The tokenize() helper function converts the prompt to an equivalent list of tokens, using tiktoken or a similar library. Inside the for-loop, the get_token_predictions() function is where the AI model is called to get the probabilitles for the next token, as in the previous example.
tokenize() 辅助函数将提示转换为等效的标记列表,使用 tiktoken 或类似的库。在 for 循环内, get_token_predictions() 函数是调用 AI 模型以获取下一个标记的概率的地方,就像前面的示例中一样。

The job of the select_next_token() function is to take the next token probabilities (or predictions) and pick the best token to continue the input sequence. The function could just pick the token with the highest probability, which in machine learning is called a greedy selection. Better yet, it can pick a token using a random number generator that honors the probabilities returned by the model, and in that way add some variety to the generated text. This will also make the model produce different responses if given the same prompt multiple times.
select_next_token() 函数的作用是获取下一个标记的概率(或预测),并选择最佳的标记来继续输入序列。该函数可以简单地选择具有最高概率的标记,这在机器学习中称为贪婪选择。更好的做法是,它可以使用一个遵守模型返回的概率的随机数生成器来选择一个标记,从而为生成的文本添加一些变化。这也将使模型在多次给定相同提示时产生不同的响应。

To make the token selection process even more flexible, the probabilities returned by the LLM can be modified using hyperparameters, which are passed to the text generation function as arguments. The hyperparameters allow you to control the "greediness" of the token selection process. If you have used LLMs, you are likely familiar with the temperature hyperparameter. With a higher temperature, the token probabilities are flattened out, and this augments the chances of less likely tokens to be selected, with the end result of making the generated text look more creative or unusual. You may have also used two other hyperparameters called top_p and top_k, which control how many of the highest probable tokens are considered for selection.
为了使令牌选择过程更加灵活,可以使用超参数修改由LLM返回的概率,这些超参数作为参数传递给文本生成函数。超参数允许您控制令牌选择过程的“贪婪度”。如果您使用过LLMs,您可能熟悉 temperature 超参数。通过增加温度,令牌概率被拉平,这增加了不太可能被选择的令牌的机会,最终使生成的文本看起来更有创意或不寻常。您可能还使用了另外两个超参数,称为 top_ptop_k ,它们控制考虑用于选择的最可能令牌的数量。

Once a token has been selected, the loop iterates and now the model is given an input that includes the new token at the end, and one more token is generated to follow it. The num_tokens argument controls how many iterations to run the loop for, or in other words, how much text to generate. The generated text can (and often does) end mid-sentence, because the LLM has no concept of sentences or paragraphs, since it just works on one token at a time. To prevent the generated text from ending in the middle of a sentence, we could consider the num_tokens argument as a maximum instead of an exact number of tokens to generate, and in that case we could stop the loop when a period token is generated.
一旦选择了一个标记,循环开始迭代,现在模型被提供了一个包含新标记在末尾的输入,并生成一个跟随它的标记。 num_tokens 参数控制循环运行的迭代次数,换句话说,生成多少文本。生成的文本可能(并经常)在句子中间结束,因为LLM没有句子或段落的概念,因为它只是一次处理一个标记。为了防止生成的文本在句子中间结束,我们可以将 num_tokens 参数视为最大值,而不是要生成的标记的确切数量,这种情况下,我们可以在生成句号标记时停止循环。

If you've reached this point and understood everything then congratulations, you now know how LLMs work at a high level. Are you interested in more details? In the next section I'll get a bit more technical, while still doing my best to avoid referencing the math that supports this technology, which is quite advanced.
如果您已经达到这一点并且理解了一切,那么恭喜,您现在知道LLMs在高层次上是如何工作的。您对更多细节感兴趣吗?在下一节中,我将更加技术化,同时尽力避免引用支持这项技术的数学,这是相当先进的。

Model Training 模型训练

Unfortunately, discussing how a model is trained is actually difficult without using math. What I'm going to do is start by showing you a very simple training approach.
不幸的是,谈论模型是如何训练的,如果不使用数学,实际上是困难的。我要做的是从展示一个非常简单的训练方法开始。

Given that the task is to predict tokens that follow other tokens, a simple way to train a model is to get all the pairs of consecutive tokens that appear in the training dataset and build a table of probabilities with them.
鉴于任务是预测跟随其他标记的标记,训练模型的一种简单方法是获取训练数据集中出现的所有连续标记对,并用它们构建概率表。

Let's do this with a short vocabulary and dataset. Let's say the model's vocabulary has the following five tokens:
让我们用一个简短的词汇表和数据集来做这个。假设模型的词汇表包含以下五个标记:

['I', 'you', 'like', 'apples', 'bananas']

To keep this example short and simple, I'm not going to consider spaces or punctuation symbols as tokens.
为了让这个示例简短而简单,我不会将空格或标点符号视为标记。

Let's use a training dataset that is composed of three sentences:
让我们使用一个由三个句子组成的训练数据集:

  • I like apples 我喜欢苹果
  • I like bananas 我喜欢香蕉
  • you like bananas 你喜欢香蕉

We can build a 5x5 table and in each cell write how many times the token representing the row of the cell is followed by the token representing the column. Here is the table built from the three sentences in the dataset:
我们可以建立一个 5x5 的表格,在每个单元格中写下代表单元格行的标记后面跟着代表列的标记的次数。这是从数据集中三个句子构建的表格:

- I you like apples bananas
I 2
you 1
like 1 2
apples
bananas

Hopefully this is clear. The dataset has two instances of "I like", one instance of "you like", one instance of "like apples" and two of "like bananas".
希望这一点是清楚的。数据集中有两个"I like"实例,一个"you like"实例,一个"like apples"实例和两个"like bananas"实例。

Now that we know how many times each pair of tokens appeared in the training dataset, we can calculate the probabilities of each token following each other. To do this, we convert the numbers in each row to probabilities. For example, token "like" in the middle row of the table was followed once by "apples" and twice by "bananas". That means that "apples" follows "like" 33.3% of the time, and "bananas" follows it the remaining 66.7%.
现在我们知道每对标记在训练数据集中出现的次数,我们可以计算每个标记在其后出现的概率。为此,我们将每行中的数字转换为概率。例如,表格中间行中的标记"like"后面分别跟随了一次"apples"和两次"bananas"。这意味着"apples"跟随"like"的概率为 33.3%,而"bananas"跟随的概率为 66.7%。

Here is the complete table with all the probabilities calculated. Empty cells have a probability of 0%.
这是所有计算出的概率的完整表格。空单元格的概率为 0%。

- I you like apples bananas
I 100%
you 100%
like 33.3% 66.7%
apples 25% 25% 25% 25%
bananas 25% 25% 25% 25%

The rows for "I", "you" and "like" are easy to calculate, but "apples" and "bananas" present a problem because they have no data at all, since the dataset does not have any examples with these tokens being followed by other tokens. Here we have a "hole" in our training, so to make sure that the model produces a prediction even when lacking training, I have decided to split the probabilities for a follow-up token for "apples" and "bananas" evenly across the other four possible tokens, which could obviously generate strange results, but at least the model will not get stuck when it reaches one of these two tokens.
“我”、“你”和“喜欢”的行很容易计算,但“苹果”和“香蕉”却是一个问题,因为它们根本没有数据,由于数据集中没有任何示例表明这些标记后面跟着其他标记。在这里,我们的训练中存在一个“空隙”,为了确保模型即使缺乏训练也能产生预测,我决定将“苹果”和“香蕉”的后续标记的概率均匀分配到其他四个可能的标记上,这显然可能会产生奇怪的结果,但至少当模型到达这两个标记之一时不会陷入困境。

The problem of holes in training data is actually important. In real LLMs the training datasets are very large, so you would not find training holes that are so obvious as in my tiny example above. But smaller, more difficult to detect holes due to low coverage in the training data do exist and are fairly common. The quality of the token predictions the LLM makes in these poorly trained areas can be bad, but often in ways that are difficult to perceive. This is one of the reasons LLMs can sometimes hallucinate, which happens when the generated text reads well, but contains factual errors or inconsistencies.
培训数据中存在的问题实际上很重要。在现实中,训练数据集非常庞大,因此你不太可能像上面我举的小例子中那样发现明显的训练漏洞。但是,由于训练数据覆盖率低而更难以检测的较小漏洞确实存在,并且相当常见。在这些训练不足的领域中,模型做出的标记预测质量可能很差,但通常以难以察觉的方式表现出来。这是模型有时会产生幻觉的原因之一,当生成的文本读起来很好,但包含事实错误或不一致时就会发生。

Using the probabilities table above, you may now imagine how an implementation of the get_token_predictions() function would work. In Python pseudo-code it would be something like this:
使用上面的概率表,您现在可以想象 get_token_predictions() 函数的实现方式。在 Python 伪代码中,它可能是这样的:

def get_token_predictions(input_tokens):
    last_token = input_tokens[-1]
    return probabilities_table[last_token]

Simpler than expected, right? The function accepts a sequence of tokens, which come from the user prompt. It takes the last token in the sequence, and returns the row in the probabilities table that corresponds to that token.
比预期的简单,对吧?该函数接受一系列令牌,这些令牌来自用户提示。它获取序列中的最后一个令牌,并返回与该令牌对应的概率表中的行。

If you were to call this function with ['you', 'like'] as input tokens, for example, the function would return the row for "like", which gives the token "apples" a 33.3% chance of continuing the sentence, and the token "bananas" the other 66.7%. With these probabilities, the select_next_token() function shown above should choose "apples" one out of three times.
如果您使用 ['you', 'like'] 作为输入令牌调用此函数,例如,该函数将返回“like”一行,这使得令牌“apples”有 33.3% 的概率继续句子,而令牌“bananas”有另外 66.7% 的概率。根据这些概率,上面显示的 select_next_token() 函数应该每三次选择一次“apples”。

When the "apples" token is selected as a continuation of "you like", the sentence "you like apples" will be formed. This is an original sentence that did not exist in the training dataset, yet it is perfectly reasonable. Hopefully you are starting to get an idea of how these models can come up with what appears to be original ideas or concepts, just by reusing patterns and stitching together different bits of what they learned in training.
当选择“苹果”标记作为“你喜欢”的延续时,将形成句子“你喜欢苹果”。这是一个原始句子,在训练数据集中并不存在,但这是完全合理的。希望您开始了解这些模型如何通过重用模式和拼接在训练中学到的不同部分来提出看似原创的想法或概念。

The Context Window 上下文窗口

The approach I took in the previous section to train my mini-language model is called a Markov chain.
我在前一节中采用的训练迷你语言模型的方法称为马尔可夫链。

An issue with this technique is that only one token (the last of the input) is used to make a prediction. Any text that appears before that last token doesn't have any influence when choosing how to continue, so we can say that the context window of this solution is equal to one token, which is very small. With such a small context window the model constantly "forgets" its line of thought and jumps from one word to the next without much consistency.
这种技术的一个问题是只使用一个令牌(输入的最后一个)来进行预测。出现在最后一个令牌之前的任何文本在选择如何继续时都没有任何影响,因此我们可以说这种解决方案的上下文窗口等于一个令牌,非常小。由于这样一个小的上下文窗口,模型不断地“忘记”自己的思路,并且在下一个词之间跳跃,缺乏一致性。

To improve the model's predictions a larger probabilities table can be constructed. To use a context window of two tokens, additional table rows would have to be added with rows that represent all possible sequences of two tokens. With the five tokens I used in the example there would be 25 new rows in the probabilities table each for a pair of tokens, added to the 5 single-token rows that are already there. The model would have to be trained again, this time looking at groups of three tokens in addition to the pairs. Then in each loop iteration of the get_token_predictions() function the last two tokens from the input would be used when available, to find the corresponding row in the larger probabilities table.
为了改进模型的预测,可以构建一个更大的概率表。要使用两个标记的上下文窗口,需要添加额外的表行,这些行代表所有可能的两个标记序列。在我示例中使用的五个标记中,概率表中将会有 25 行新行,每行代表一对标记,加上已经存在的 5 个单标记行。模型需要再次进行训练,这次除了考虑对,还要考虑三个标记的组合。然后,在 get_token_predictions() 函数的每次循环迭代中,如果有的话,将使用输入中的最后两个标记,以找到更大概率表中对应的行。

But a context window of 2 tokens is still insufficient. For the generated text to be consistent with itself and make at least some basic sense, a much larger context window is needed. Without a large enough context it is impossible for newly generated tokens to relate to concepts or ideas expressed in previous tokens. So what can we do? Increasing the context window to 3 tokens would add 125 additional rows to the probabilities table, and the quality would still be very poor. How large do we need to make the context window?
但是 2 个标记的上下文窗口仍然不够。为了生成的文本能够自洽并至少具有一些基本意义,需要一个更大的上下文窗口。如果上下文不够大,新生成的标记就无法与之前标记中表达的概念或想法相关联。那么我们该怎么办呢?将上下文窗口增加到 3 个标记将为概率表添加 125 行,但质量仍然会非常差。我们需要将上下文窗口做多大呢?

The open source GPT-2 model from OpenAI uses a context window of 1024 tokens. To be able to implement a context window of this size using Markov chains, each row of the probabilities table would have to represent a sequence that is between 1 and 1024 tokens long. Using the above example vocabulary of 5 tokens, there are 51024 possible sequences that are 1024 tokens long. How many table rows are required to represent this? I did the calculation in a Python session (scroll to the right to see the complete number):
来自 OpenAI 的开源 GPT-2 模型使用了一个包含 1024 个标记的上下文窗口。为了能够使用马尔可夫链实现这样大小的上下文窗口,概率表的每一行都必须表示一个介于 1 到 1024 个标记长的序列。使用上面的 5 个标记的示例词汇,有 5 1024 种可能的长度为 1024 个标记的序列。需要多少表行来表示这个?我在 Python 会话中进行了计算(向右滚动以查看完整数字):

>>> pow(5, 1024)
55626846462680034577255817933310101605480399511558295763833185422180110870347954896357078975312775514101683493275895275128810854038836502721400309634442970528269449838300058261990253686064590901798039126173562593355209381270166265416453973718012279499214790991212515897719252957621869994522193843748736289511290126272884996414561770466127838448395124802899527144151299810833802858809753719892490239782222290074816037776586657834841586939662825734294051183140794537141608771803070715941051121170285190347786926570042246331102750604036185540464179153763503857127117918822547579033069472418242684328083352174724579376695971173152319349449321466491373527284227385153411689217559966957882267024615430273115634918212890625

That is a lot of rows! And this is only a portion of the table, since we would also need sequences that are 1023 tokens long, 1022, etc., all the way to 1, since we want to make sure shorter sequences can also be handled when not enough tokens are available in the input. Markov chains are fun to work with, but they do have a big scalability problem.
那是很多行!而且这只是表格的一部分,因为我们还需要长度为 1023 个标记、1022 个标记等等的序列,一直到 1,因为我们希望确保在输入中没有足够的标记时也能处理较短的序列。马尔可夫链很有趣,但它们确实存在很大的可扩展性问题。

And a context window of 1024 tokens isn't even that great anymore. With GPT-3, the context window was increased to 2048 tokens, then increased to 4096 in GPT-3.5. GPT-4 started with 8192 tokens, later got increased to 32K, and then again to 128K (that's right, 128,000 tokens!). Models with 1M or larger context windows are starting to appear now, allowing models to have much better consistency and recall when they make token predictions.
而且,1024 个标记的上下文窗口已经不再那么好了。随着 GPT-3,上下文窗口增加到 2048 个标记,然后在 GPT-3.5 中增加到 4096 个标记。GPT-4 从 8192 个标记开始,后来增加到 32K,然后再增加到 128K(没错,128,000 个标记!)。具有 1M 或更大上下文窗口的模型现在开始出现,使模型在进行标记预测时具有更好的一致性和召回率。

In conclusion, Markov chains allow us to think about the problem of text generation in the right way, but they have big issues that prevent us from considering them as a viable solution.
总的来说,马尔可夫链让我们以正确的方式思考文本生成的问题,但它们存在严重问题,使我们无法将其视为可行的解决方案。

From Markov Chains to Neural Networks
从马尔可夫链到神经网络

Obviously we have to forget the idea of having a table of probabilities, since a table for a reasonable context window would require an impossibly large amount of RAM. What we can do is replace the table with a function that returns an approximation of what the token probabilities would be, generated algorithmically instead of stored as a big table. This is actually something that neural networks can do well.
很明显,我们必须放弃拥有概率表的想法,因为一个合理的上下文窗口的表将需要大量的内存,这是不可能的。我们可以做的是用一个函数替换表,该函数返回代表令牌概率的近似值,这些值是通过算法生成的,而不是存储为一个大表。这实际上是神经网络擅长的事情。

A neural network is a special type of function that takes some inputs, performs some calculations on them, and returns an output. For a language model the inputs are the tokens that represent the prompt, and the output is the list of predicted probabilities for the next token.
神经网络是一种特殊类型的函数,它接受一些输入,在其上执行一些计算,然后返回一个输出。对于语言模型,输入是代表提示的标记,输出是下一个标记的预测概率列表。

I said neural networks are "special" functions. What makes them special is that in addition to the function logic, the calculations they perform on the inputs are controlled by a number of externally defined parameters. Initially, the parameters of the network are not known, and as a result, the function produces and output that is completely useless. The training process for the neural network consists in finding the parameters that make the function perform the best when evaluated on the data from the training dataset, with the assumption that if the function works well with the training data it will work comparably well with other data.
我说神经网络是“特殊”的函数。它们之所以特殊,是因为除了函数逻辑外,它们对输入执行的计算受到一些外部定义的参数的控制。最初,网络的参数是未知的,因此函数产生的输出是完全无用的。神经网络的训练过程包括找到使函数在训练数据集上表现最佳的参数,假设如果函数在训练数据上表现良好,它在其他数据上也会表现得很好。

During the training process, the parameters are iteratively adjusted in small increments using an algorithm called backpropagation which is heavy on math, so I won't discuss in this article. With each adjustment, the predictions of the neural network are expected to become a tiny bit better. After an update to the parameters, the network is evaluated again against the training dataset, and the results inform the next round of adjustments. This process continues until the function performs good next token predictions on the training dataset.
在训练过程中,参数会使用一种称为反向传播的算法进行小幅调整,这个算法涉及大量数学,所以我不会在本文中讨论。随着每次调整,神经网络的预测预期会变得稍微好一点。在更新参数后,网络会再次针对训练数据集进行评估,结果会指导下一轮调整。这个过程会持续进行,直到函数在训练数据集上表现出良好的下一个标记预测。

To help you have an idea of the scale at which neural networks work, consider that the GPT-2 model has about 1.5 billion parameters, and GPT-3 increased the parameter count to 175 billion. GPT-4 is said to have about 1.76 trillion parameters. Training neural networks at this scale with current generation hardware takes a very long time, usually weeks or months.
为了帮助您了解神经网络运作的规模,可以考虑 GPT-2 模型约有 15 亿个参数,而 GPT-3 将参数数量增加到了 1750 亿。据说 GPT-4 大约有 1.76 万亿个参数。使用当前一代硬件训练这种规模的神经网络需要很长时间,通常是几周甚至几个月。

What is interesting is that because there are so many parameters, all calculated through a lengthy iterative process without human assistance, it is difficult to understand how a model works. A trained LLM is like a black box that is extremely difficult to debug, because most of the "thinking" of the model is hidden in the parameters. Even those who trained it have trouble explaining its inner workings.
有趣的是,由于有如此多的参数,全部通过漫长的迭代过程计算而来,没有人类的帮助,很难理解模型的工作原理。经过训练的LLM就像一个极其难以调试的黑匣子,因为模型的大部分“思考”都隐藏在参数中。即使是训练过它的人也很难解释其内部运作。

Layers, Transformers and Attention
层、变压器和注意力

You may be curious to know what mysterious calculations happen inside the neural network function that can, with the help of well tuned parameters, take a list of input tokens and somehow output reasonable probabilities for the token that follows.
您可能会好奇神经网络函数内部发生了什么神秘的计算,如何在参数调整良好的情况下,接受一系列输入标记,并以某种方式输出后续标记的合理概率。

A neural network is configured to perform a chain of operations, each called a layer. The first layer receives the inputs, and performs some type of transformation on them. The transformed inputs enter the next layer and are transformed once again. This continues until the data reaches the final layer and is transformed one last time, generating the output, or prediction.
神经网络被配置为执行一系列操作,每个操作称为一个层。第一层接收输入,并对其执行某种转换。转换后的输入进入下一层,再次进行转换。这一过程持续进行,直到数据到达最终层,并进行最后一次转换,生成输出或预测。

Machine learning experts come up with different types of layers that perform mathematical transformations on the input data, and they also figure out ways to organize and group layers so that they achieve a desired result. Some layers are of a general purpose, while others are designed to work on a specific type of input data, such as images or as in the case of LLMs, on tokenized text.
机器学习专家提出了不同类型的层,对输入数据进行数学变换,并找出组织和分组层的方法,以便实现期望的结果。一些层是通用的,而另一些则设计用于处理特定类型的输入数据,例如图像或者像LLMs中的标记化文本。

The neural network architecture that is the most popular today for text generation in large language models is called the Transformer. LLMs that use this design are said to be GPTs, or Generative Pre-Trained Transformers.
今天最流行的用于大型语言模型文本生成的神经网络架构被称为 Transformer。使用这种设计的LLMs被称为 GPT,或生成式预训练变压器。

The distinctive characteristic of transformer models is a layer calculation they perform called Attention, that allows them to derive relationships and patterns between tokens that are in the context window, which are then reflected in the resulting probabilities for the next token.
Transformer 模型的独特特征是它们执行的一种称为注意力的层计算,使它们能够推导出上下文窗口中标记之间的关系和模式,然后反映在下一个标记的结果概率中。

The Attention mechanism was initially used in language translators, as a way to find which tokens in an input sequence are the most important to extract its meaning. This mechanism gives modern translators the ability to "understand" a sentence at a basic level, by focusing on (or driving "attention" to) the important words or tokens.
注意机制最初用于语言翻译器中,作为一种找出输入序列中哪些标记对于提取其含义最重要的方法。这种机制赋予现代翻译器在基本水平上“理解”句子的能力,通过专注于(或将“注意力”集中在)重要单词或标记。

Do LLMs Have Intelligence?
LLMs有智慧吗?

By now you may be starting to form an opinion on wether LLMs show some form of intelligence in the way they generate text.
到目前为止,您可能已经开始形成一个关于LLMs是否在生成文本的方式中表现出某种智能的看法。

I personally do not see LLMs as having an ability to reason or come up with original thoughts, but that does not mean to say they're useless. Thanks to the clever calculations they perform on the tokens that are in the context window, LLMs are able to pick up on patterns that exist in the user prompt and match them to similar patterns learned during training. The text they generate is formed from bits and pieces of training data for the most part, but the way in which they stitch words (tokens, really) together is highly sophisticated, in many cases producing results that feel original and useful.
我个人认为LLMs没有推理能力或产生原创思想,但这并不意味着它们毫无用处。多亏了它们在上下文窗口中执行的巧妙计算,LLMs能够捕捉到存在于用户提示中的模式,并将其与训练期间学到的类似模式进行匹配。它们生成的文本主要由训练数据的片段组成,但它们将单词(实际上是标记)拼接在一起的方式非常复杂,很多情况下产生的结果让人觉得是原创且有用的。

On the other side, given the propensity of LLMs to hallucinate, I wouldn't trust any workflow in which the LLM produces output that goes straight to end users without verification by a human.
另一方面,考虑到LLMs易产生幻觉的倾向,我不会相信任何工作流程,其中LLM生成的输出直接传递给最终用户,而没有人类的验证。

Will the larger LLMs that are going to appear in the following months or years achieve anything that resembles true intelligence? I feel this isn't going to happen with the GPT architecture due to its many limitations, but who knows, maybe with some future innovations we'll get there.
未来几个月或几年将出现的更大的LLMs是否会实现类似真正智能的东西?我觉得由于其许多限制,GPT 架构不太可能实现这一点,但谁知道,也许通过一些未来的创新,我们会实现这一目标。

The End 结局

Thank you for staying with me until the end! I hope I have picked your interested enough for you to decide to continue learning, and eventually facing all that scary math that you cannot avoid if you want to understand every detail. In that case, I can't recommend Andrej Karpathy's Neural Networks: Zero to Hero video series enough.
感谢您一直陪伴到最后!希望我已经引起了您的兴趣,让您决定继续学习,最终面对那些您无法避免的可怕数学,如果您想了解每一个细节。在这种情况下,我强烈推荐安德烈·卡帕西的《神经网络:从零到英雄》视频系列。

Become a Patron! 成为赞助人!

Hello, and thank you for visiting my blog! If you enjoyed this article, please consider supporting my work on this blog on Patreon!
你好,感谢您访问我的博客!如果您喜欢这篇文章,请考虑在 Patreon 上支持我的博客工作!

4 comments 4 条评论
  • #1 Vasco said 3 days ago
    #1 Vasco 3 天前说

    Nice article! 不错的文章!

    On the conclusion about are "Do LLMs Have Inteligence?" is more about what human inteligence is. Assuming we agree that we are made out of matter, one could also say the following sentence about humans:
    关于“LLMs是否具有智慧”的结论更多地涉及人类智慧是什么。假设我们同意我们是由物质构成的,我们也可以这样说关于人类的一句话:

    "I personally do not see Humans as having an ability to reason or come up with original thoughts, but that does not mean to say they're useless. Thanks to the clever calculations they perform (on the tokens) that are in the context window, Humans are able to pick up on patterns that exist in the user prompt and match them to similar patterns learned during training. The text they generate is formed from bits and pieces of training data for the most part, but the way in which they stitch words together is highly sophisticated, in many cases producing results that feel original and useful."
    我个人认为人类没有推理或产生原创思想的能力,但这并不意味着他们是无用的。由于他们进行的聪明计算(对令牌)在上下文窗口中进行,人类能够捕捉到存在于用户提示中的模式,并将其与训练期间学到的类似模式进行匹配。他们生成的文本主要是由训练数据的片段组成,但他们将单词组合在一起的方式非常复杂,很多情况下产生的结果让人觉得是原创且有用的。

  • #2 Rodri said 3 days ago
    Rodri 说 3 天前

    A very interesting reading. Thank you for the hard work!
    一篇非常有趣的阅读。感谢您的辛勤工作!

  • #3 Miguel Grinberg said 3 days ago
    #3 Miguel Grinberg 说 3 天前

    @Vasco: with all due respect, what you are saying makes absolutely no sense.
    @Vasco:恕我直言,你说的完全没有道理。

  • #4 Jay said 2 days ago
    #4 Jay 说 2 天前

    Beautiful and elegant and very useful summary! Well done!
    精美优雅,非常有用的总结!干得好!

Leave a Comment 留言