It’s Just Adding One Word at a Time
一次只加一个词
That ChatGPT can automatically generate something that reads even superficially like human-written text is remarkable, and unexpected. But how does it do it? And why does it work? My purpose here is to give a rough outline of what’s going on inside ChatGPT—and then to explore why it is that it can do so well in producing what we might consider to be meaningful text. I should say at the outset that I’m going to focus on the big picture of what’s going on—and while I’ll mention some engineering details, I won’t get deeply into them. (And the essence of what I’ll say applies just as well to other current “large language models” [LLMs] as to ChatGPT.)
ChatGPT 能够自动生成表面上看起来像人类书写文本的东西是令人瞩目的,也是出乎意料的。但它究竟是如何做到的呢?为什么它能起作用?我的目的是在这里给出一个大致的概述,说明 ChatGPT 内部发生了什么,然后探讨为什么它能如此出色地生成我们认为有意义的文本。我应该在一开始就说明,我将专注于正在发生的大局——虽然我会提到一些工程细节,但我不会深入探讨它们。(而且,我将要说的本质也同样适用于其他当前的“大型语言模型”[LLMs],而不仅仅是 ChatGPT。)
The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a “reasonable continuation” of whatever text it’s got so far, where by “reasonable” we mean “what one might expect someone to write after seeing what people have written on billions of webpages, etc.”
要解释的第一件事是,ChatGPT 始终在尝试做的根本事情是生成一个“合理的延续”,无论它到目前为止得到了什么文本,这里的“合理”是指“在看到人们在数十亿网页等上写的内容后,人们可能会期望某人写的内容”。
So let’s say we’ve got the text “The best thing about AI is its ability to”. Imagine scanning billions of pages of human-written text (say on the web and in digitized books) and finding all instances of this text—then seeing what word comes next what fraction of the time. ChatGPT effectively does something like this, except that (as I’ll explain) it doesn’t look at literal text; it looks for things that in a certain sense “match in meaning”. But the end result is that it produces a ranked list of words that might follow, together with “probabilities”:
所以,假设我们有文本“AI 最好的事情是它的能力”。想象一下扫描数十亿页人类撰写的文本(比如在网络上和数字化书籍中),并找到所有这个文本的实例——然后看看下一个词是什么,以及它在多大程度上出现。ChatGPT 实际上做了类似的事情,除了(正如我将解释的)它不看字面文本;它寻找在某种意义上“意义匹配”的东西。但最终结果是,它生成一个可能跟随的单词排名列表,以及“概率”:
And the remarkable thing is that when ChatGPT does something like write an essay what it’s essentially doing is just asking over and over again “given the text so far, what should the next word be?”—and each time adding a word. (More precisely, as I’ll explain, it’s adding a “token”, which could be just a part of a word, which is why it can sometimes “make up new words”.)
令人惊奇的是,当 ChatGPT 做诸如写文章这样的事情时,它本质上只是在不断地问“到目前为止,下一个词应该是什么?”——每次都加一个词。(更确切地说,正如我将解释的,它是在添加一个“标记”,这可能只是一个词的一部分,这就是为什么它有时会“创造新词”。)
But, OK, at each step it gets a list of words with probabilities. But which one should it actually pick to add to the essay (or whatever) that it’s writing? One might think it should be the “highest-ranked” word (i.e. the one to which the highest “probability” was assigned). But this is where a bit of voodoo begins to creep in. Because for some reason—that maybe one day we’ll have a scientific-style understanding of—if we always pick the highest-ranked word, we’ll typically get a very “flat” essay, that never seems to “show any creativity” (and even sometimes repeats word for word). But if sometimes (at random) we pick lower-ranked words, we get a “more interesting” essay.
但是,好吧,在每一步它都会得到一个带有概率的单词列表。但它应该选择哪一个来添加到它正在写的文章(或任何内容)中呢?人们可能会认为它应该是“排名最高”的单词(即被分配了最高“概率”的那个)。但这就是一点巫术开始悄悄渗入的地方。因为出于某种原因——也许有一天我们会有科学式的理解——如果我们总是选择排名最高的单词,我们通常会得到一篇非常“平淡”的文章,似乎从未“显示出任何创造力”(甚至有时会逐字重复)。但如果有时(随机地)我们选择排名较低的单词,我们会得到一篇“更有趣”的文章。
The fact that there’s randomness here means that if we use the same prompt multiple times, we’re likely to get different essays each time. And, in keeping with the idea of voodoo, there’s a particular so-called “temperature” parameter that determines how often lower-ranked words will be used, and for essay generation, it turns out that a “temperature” of 0.8 seems best. (It’s worth emphasizing that there’s no “theory” being used here; it’s just a matter of what’s been found to work in practice. And for example the concept of “temperature” is there because exponential distributions familiar from statistical physics happen to be being used, but there’s no “physical” connection—at least so far as we know.)
这里存在随机性的事实意味着,如果我们多次使用相同的提示,我们可能会每次得到不同的文章。而且,按照巫毒的概念,有一个所谓的“温度”参数决定了低排名词的使用频率,对于文章生成来说,发现 0.8 的“温度”似乎是最好的。(值得强调的是,这里没有使用“理论”;这只是实践中发现有效的方法。例如,“温度”的概念存在是因为使用了统计物理学中熟悉的指数分布,但至少就我们所知,没有“物理”联系。)
Before we go on I should explain that for purposes of exposition I’m mostly not going to use the full system that’s in ChatGPT; instead I’ll usually work with a simpler GPT-2 system, which has the nice feature that it’s small enough to be able to run on a standard desktop computer. And so for essentially everything I show I’ll be able to include explicit Wolfram Language code that you can immediately run on your computer. (Click any picture here to copy the code behind it.)
在我们继续之前,我应该解释一下,为了说明的目的,我主要不会使用 ChatGPT 中的完整系统;相反,我通常会使用一个更简单的 GPT-2 系统,它有一个很好的特性,即它小到可以在标准台式计算机上运行。因此,对于我展示的几乎所有内容,我都能够包括你可以立即在你的计算机上运行的明确的 Wolfram 语言代码。(点击这里的任何图片复制其背后的代码。)
For example, here’s how to get the table of probabilities above. First, we have to retrieve the underlying “language model” neural net:
例如,以下是如何获取上述概率表的方法。首先,我们需要检索底层“语言模型”神经网络:
Later on, we’ll look inside this neural net, and talk about how it works. But for now we can just apply this “net model” as a black box to our text so far, and ask for the top 5 words by probability that the model says should follow:
后来,我们将深入研究这个神经网络,并讨论它的工作原理。但现在,我们可以将这个“网络模型”作为一个黑箱应用到我们迄今为止的文本中,并要求模型预测接下来概率最高的 5 个单词:
This takes that result and makes it into an explicit formatted “dataset”:
这会将该结果转换为显式的格式化“数据集”:
Here’s what happens if one repeatedly “applies the model”—at each step adding the word that has the top probability (specified in this code as the “decision” from the model):
如果一个人反复“应用模型”——在每一步添加具有最高概率的单词(在此代码中指定为模型的“决策”),会发生以下情况:
What happens if one goes on longer? In this (“zero temperature”) case what comes out soon gets rather confused and repetitive:
如果继续下去会发生什么?在这种(“零温度”)情况下,很快就会出现相当混乱和重复的情况:
But what if instead of always picking the “top” word one sometimes randomly picks “non-top” words (with the “randomness” corresponding to “temperature” 0.8)? Again one can build up text:
但如果在选择“顶部”单词时,有时随机选择“非顶部”单词(“随机性”对应于“温度”0.8),又会怎样呢?同样可以构建文本:
And every time one does this, different random choices will be made, and the text will be different—as in these 5 examples:
每次这样做时,都会做出不同的随机选择,文本也会不同——就像这 5 个例子一样:
It’s worth pointing out that even at the first step there are a lot of possible “next words” to choose from (at temperature 0.8), though their probabilities fall off quite quickly (and, yes, the straight line on this log-log plot corresponds to an n–1 “power-law” decay that’s very characteristic of the general statistics of language):
值得指出的是,即使在第一步,也有很多可能的“下一个词”可供选择(在温度 0.8 的情况下),尽管它们的概率下降得非常快(是的,这个对数-对数图上的直线对应于 n–1 的“幂律”衰减,这是语言总体统计的一个非常典型的特征):
So what happens if one goes on longer? Here’s a random example. It’s better than the top-word (zero temperature) case, but still at best a bit weird:
那么如果时间更长会发生什么?这里有一个随机例子。它比最高词(零温度)情况要好,但仍然最多有点奇怪:
This was done with the simplest GPT-2 model (from 2019). With the newer and bigger GPT-3 models the results are better. Here’s the top-word (zero temperature) text produced with the same “prompt”, but with the biggest GPT-3 model:
这是使用最简单的 GPT-2 模型(来自 2019 年)完成的。使用更新、更大的 GPT-3 模型,结果会更好。以下是使用相同的“提示”,但使用最大的 GPT-3 模型生成的顶级词(零温度)文本:
And here’s a random example at “temperature 0.8”:
这是一个在“温度 0.8”下的随机例子:
Where Do the Probabilities Come From?
概率从何而来?
OK, so ChatGPT always picks its next word based on probabilities. But where do those probabilities come from? Let’s start with a simpler problem. Let’s consider generating English text one letter (rather than word) at a time. How can we work out what the probability for each letter should be?
好的,所以 ChatGPT 总是根据概率选择下一个单词。但这些概率从哪里来呢?让我们从一个更简单的问题开始。让我们考虑一次生成一个字母(而不是单词)的英文文本。我们如何计算每个字母的概率呢?
A very minimal thing we could do is just take a sample of English text, and calculate how often different letters occur in it. So, for example, this counts letters in the Wikipedia article on “cats”:
我们可以做的一件非常简单的事情是,只需取一段英文文本的样本,并计算其中不同字母出现的频率。例如,这计算了维基百科上关于“猫”的文章中字母的出现频率:
And this does the same thing for “dogs”:
而对于“狗”,它也做同样的事情:
The results are similar, but not the same (“o” is no doubt more common in the “dogs” article because, after all, it occurs in the word “dog” itself). Still, if we take a large enough sample of English text we can expect to eventually get at least fairly consistent results:
结果相似,但并不相同(“o”在“dogs”文章中无疑更常见,毕竟它出现在单词“dog”本身中)。尽管如此,如果我们取足够大的英语文本样本,我们可以期望最终至少得到相当一致的结果:
Here’s a sample of what we get if we just generate a sequence of letters with these probabilities:
如果我们只是根据这些概率生成一系列字母,我们会得到这样的样本:
We can break this into “words” by adding in spaces as if they were letters with a certain probability:
我们可以通过以一定概率添加空格,将其分解为“单词”,就像它们是字母一样:
We can do a slightly better job of making “words” by forcing the distribution of “word lengths” to agree with what it is in English:
我们可以通过强制“单词长度”的分布与英语中的分布一致,从而更好地制作“单词”
We didn’t happen to get any “actual words” here, but the results are looking slightly better. To go further, though, we need to do more than just pick each letter separately at random. And, for example, we know that if we have a “q”, the next letter basically has to be “u”.
我们在这里没有碰巧得到任何“实际的单词”,但结果看起来稍微好一些。不过,要进一步发展,我们需要做的不仅仅是随机挑选每个字母。例如,我们知道如果我们有一个“q”,下一个字母基本上必须是“u”。
Here’s a plot of the probabilities for letters on their own:
这是单个字母概率的图表:
And here’s a plot that shows the probabilities of pairs of letters (“2-grams”) in typical English text. The possible first letters are shown across the page, the second letters down the page:
这是显示典型英文文本中字母对(“2-grams”)概率的图表。可能的第一个字母显示在页面横轴上,第二个字母显示在页面纵轴上:
And we see here, for example, that the “q” column is blank (zero probability) except on the “u” row. OK, so now instead of generating our “words” a single letter at a time, let’s generate them looking at two letters at a time, using these “2-gram” probabilities. Here’s a sample of the result—which happens to include a few “actual words”:
我们在这里看到,例如,“q”列是空白的(零概率),除了在“u”行。好的,所以现在我们不是一次生成一个字母来生成“单词”,而是使用这些“2-gram”概率一次生成两个字母。以下是结果的一个样本——其中碰巧包含了一些“实际单词”:
With sufficiently much English text we can get pretty good estimates not just for probabilities of single letters or pairs of letters (2-grams), but also for longer runs of letters. And if we generate “random words” with progressively longer n-gram probabilities, we see that they get progressively “more realistic”:
有了足够多的英文文本,我们不仅可以很好地估计单个字母或两个字母(2-grams)的概率,还可以估计更长字母序列的概率。如果我们用逐渐增加的 n-gram 概率生成“随机单词”,我们会看到它们逐渐变得更“真实”:
But let’s now assume—more or less as ChatGPT does—that we’re dealing with whole words, not letters. There are about 40,000 reasonably commonly used words in English. And by looking at a large corpus of English text (say a few million books, with altogether a few hundred billion words), we can get an estimate of how common each word is. And using this we can start generating “sentences”, in which each word is independently picked at random, with the same probability that it appears in the corpus. Here’s a sample of what we get:
但让我们现在假设——或多或少像 ChatGPT 那样——我们处理的是整个单词,而不是字母。英语中大约有 40,000 个合理常用的单词。通过查看大量的英语文本语料库(比如几百万本书,总共几百亿个单词),我们可以估计每个单词的常见程度。利用这些信息,我们可以开始生成“句子”,其中每个单词都是独立随机选择的,概率与它在语料库中出现的概率相同。以下是我们得到的一个样本:
Not surprisingly, this is nonsense. So how can we do better? Just like with letters, we can start taking into account not just probabilities for single words but probabilities for pairs or longer n-grams of words. Doing this for pairs, here are 5 examples of what we get, in all cases starting from the word “cat”:
毫不奇怪,这是胡说八道。那么我们怎样才能做得更好呢?就像处理字母一样,我们可以开始考虑的不仅仅是单个单词的概率,而是成对或更长的单词 n-gram 的概率。对于成对的情况,以下是从单词“cat”开始的 5 个例子:
It’s getting slightly more “sensible looking”. And we might imagine that if we were able to use sufficiently long n-grams we’d basically “get a ChatGPT”—in the sense that we’d get something that would generate essay-length sequences of words with the “correct overall essay probabilities”. But here’s the problem: there just isn’t even close to enough English text that’s ever been written to be able to deduce those probabilities.
它看起来稍微“合理”了一些。我们可能会想象,如果我们能够使用足够长的 n-grams,我们基本上就会“得到一个 ChatGPT”——从某种意义上说,我们会得到一个能够生成具有“正确整体作文概率”的长篇文字序列的东西。但问题是:根本没有足够多的曾经写过的英文文本能够推导出这些概率。
In a crawl of the web there might be a few hundred billion words; in books that have been digitized there might be another hundred billion words. But with 40,000 common words, even the number of possible 2-grams is already 1.6 billion—and the number of possible 3-grams is 60 trillion. So there’s no way we can estimate the probabilities even for all of these from text that’s out there. And by the time we get to “essay fragments” of 20 words, the number of possibilities is larger than the number of particles in the universe, so in a sense they could never all be written down.
在网络爬取中可能会有几千亿个单词;在已经数字化的书籍中可能会有另外几千亿个单词。但使用 40,000 个常用单词,即使是可能的 2-grams 数量也已经达到 16 亿——而可能的 3-grams 数量达到 60 万亿。因此,我们无法从现有的文本中估计这些的概率。当我们到达 20 个单词的“文章片段”时,可能性的数量比宇宙中的粒子数量还要大,所以在某种意义上它们永远不可能全部被写下来。
So what can we do? The big idea is to make a model that lets us estimate the probabilities with which sequences should occur—even though we’ve never explicitly seen those sequences in the corpus of text we’ve looked at. And at the core of ChatGPT is precisely a so-called “large language model” (LLM) that’s been built to do a good job of estimating those probabilities.
那么我们能做什么呢?一个大的想法是建立一个模型,让我们能够估计序列应该出现的概率——即使在我们看过的文本语料库中从未明确见过这些序列。而 ChatGPT 的核心正是一个所谓的“大型语言模型”(LLM),它被构建得能够很好地估计这些概率。
What Is a Model?
什么是模型?
Say you want to know (as Galileo did back in the late 1500s) how long it’s going to take a cannon ball dropped from each floor of the Tower of Pisa to hit the ground. Well, you could just measure it in each case and make a table of the results. Or you could do what is the essence of theoretical science: make a model that gives some kind of procedure for computing the answer rather than just measuring and remembering each case.
假设你想知道(就像伽利略在 16 世纪末所做的那样)从比萨塔的每一层丢下一个炮弹需要多长时间才能落地。好吧,你可以每次都测量一下,并记录结果。或者你可以做理论科学的本质:建立一个模型,给出某种计算答案的程序,而不是仅仅测量和记忆每个案例。
Let’s imagine we have (somewhat idealized) data for how long the cannon ball takes to fall from various floors:
让我们想象一下,我们有一些(略微理想化的)数据,关于炮弹从不同楼层坠落所需的时间:
How do we figure out how long it’s going to take to fall from a floor we don’t explicitly have data about? In this particular case, we can use known laws of physics to work it out. But say all we’ve got is the data, and we don’t know what underlying laws govern it. Then we might make a mathematical guess, like that perhaps we should use a straight line as a model:
我们如何确定从一个没有明确数据的楼层坠落需要多长时间?在这个特定的情况下,我们可以使用已知的物理定律来计算。但假设我们只有数据,而不知道支配它的基本定律。那么我们可能会做一个数学猜测,比如也许我们应该使用一条直线作为模型:
We could pick different straight lines. But this is the one that’s on average closest to the data we’re given. And from this straight line we can estimate the time to fall for any floor.
我们可以选择不同的直线。但这是平均最接近我们给定数据的直线。从这条直线我们可以估计任何楼层的坠落时间。
How did we know to try using a straight line here? At some level we didn’t. It’s just something that’s mathematically simple, and we’re used to the fact that lots of data we measure turns out to be well fit by mathematically simple things. We could try something mathematically more complicated—say a + b x + c x2—and then in this case we do better:
我们怎么知道在这里尝试使用一条直线呢?在某种程度上我们并不知道。这只是数学上简单的东西,而且我们习惯于我们测量的许多数据最终都能被数学上简单的东西很好地拟合。我们可以尝试一些数学上更复杂的东西——比如 a + b x + c x2——然后在这种情况下我们做得更好:
Things can go quite wrong, though. Like here’s the best we can do with a + b/x + c sin(x):
事情可能会变得非常糟糕。就像我们能用 a + b/x + c sin(x) 做到的最好的那样:
It is worth understanding that there’s never a “model-less model”. Any model you use has some particular underlying structure—then a certain set of “knobs you can turn” (i.e. parameters you can set) to fit your data. And in the case of ChatGPT, lots of such “knobs” are used—actually, 175 billion of them.
值得了解的是,从来没有“无模型的模型”。你使用的任何模型都有某种特定的底层结构——然后有一组你可以调整的“旋钮”(即你可以设置的参数)来适应你的数据。而在 ChatGPT 的情况下,使用了大量的这种“旋钮”——实际上,有 1750 亿个。
But the remarkable thing is that the underlying structure of ChatGPT—with “just” that many parameters—is sufficient to make a model that computes next-word probabilities “well enough” to give us reasonable essay-length pieces of text.
但值得注意的是,ChatGPT 的底层结构——仅使用如此多的参数——足以构建一个计算下一个单词概率“足够好”的模型,从而为我们提供合理长度的文章文本。
Models for Human-Like Tasks
人类任务模型
The example we gave above involves making a model for numerical data that essentially comes from simple physics—where we’ve known for several centuries that “simple mathematics applies”. But for ChatGPT we have to make a model of human-language text of the kind produced by a human brain. And for something like that we don’t (at least yet) have anything like “simple mathematics”. So what might a model of it be like?
我们上面给出的例子涉及为本质上来自简单物理学的数值数据制作模型——我们在几个世纪前就知道“简单的数学适用”。但对于 ChatGPT,我们必须为人类大脑产生的人类语言文本制作模型。对于这样的东西,我们(至少目前)还没有类似“简单数学”的东西。那么,它的模型可能是什么样的呢?
Before we talk about language, let’s talk about another human-like task: recognizing images. And as a simple example of this, let’s consider images of digits (and, yes, this is a classic machine learning example):
在我们谈论语言之前,让我们先谈谈另一个类似人类的任务:识别图像。作为一个简单的例子,让我们考虑数字图像(是的,这是一个经典的机器学习例子):
One thing we could do is get a bunch of sample images for each digit:
我们可以做的一件事是为每个数字获取一堆样本图像:
Then to find out if an image we’re given as input corresponds to a particular digit we could just do an explicit pixel-by-pixel comparison with the samples we have. But as humans we certainly seem to do something better—because we can still recognize digits, even when they’re for example handwritten, and have all sorts of modifications and distortions:
然后,为了确定我们作为输入给出的图像是否对应于某个特定数字,我们可以简单地与我们拥有的样本进行逐像素比较。但作为人类,我们显然做得更好——因为我们仍然可以识别数字,即使它们是手写的,并且具有各种修改和扭曲:
When we made a model for our numerical data above, we were able to take a numerical value x that we were given, and just compute a + b x for particular a and b. So if we treat the gray-level value of each pixel here as some variable xi is there some function of all those variables that—when evaluated—tells us what digit the image is of? It turns out that it’s possible to construct such a function. Not surprisingly, it’s not particularly simple, though. And a typical example might involve perhaps half a million mathematical operations.
当我们为上面的数值数据建立模型时,我们能够取一个给定的数值 x,并计算 a + b x,其中 a 和 b 是特定的值。那么,如果我们将每个像素的灰度值视为某个变量 x,是否存在一个函数,当评估该函数时,可以告诉我们图像是什么数字?事实证明,可以构建这样一个函数。不出所料,它并不特别简单。一个典型的例子可能涉及大约五十万次数学运算。
But the end result is that if we feed the collection of pixel values for an image into this function, out will come the number specifying which digit we have an image of. Later, we’ll talk about how such a function can be constructed, and the idea of neural nets. But for now let’s treat the function as black box, where we feed in images of, say, handwritten digits (as arrays of pixel values) and we get out the numbers these correspond to:
但最终结果是,如果我们把图像的像素值集合输入到这个函数中,出来的将是指定我们拥有哪个数字图像的数字。稍后,我们将讨论如何构建这样一个函数,以及神经网络的概念。但现在让我们把这个函数当作一个黑箱,我们输入手写数字的图像(作为像素值数组),然后我们得到这些图像对应的数字:
But what’s really going on here? Let’s say we progressively blur a digit. For a little while our function still “recognizes” it, here as a “2”. But soon it “loses it”, and starts giving the “wrong” result:
但这里到底发生了什么?假设我们逐渐模糊一个数字。有一段时间,我们的函数仍然“识别”它,这里识别为“2”。但很快它“失去它”,并开始给出“错误”的结果:
But why do we say it’s the “wrong” result? In this case, we know we got all the images by blurring a “2”. But if our goal is to produce a model of what humans can do in recognizing images, the real question to ask is what a human would have done if presented with one of those blurred images, without knowing where it came from.
但为什么我们说这是“错误”的结果呢?在这种情况下,我们知道我们通过模糊一个“2”得到了所有这些图像。但如果我们的目标是生成一个模型,模拟人类在识别图像方面的能力,真正要问的问题是,如果一个人在没有知道图像来源的情况下,面对这些模糊图像时会怎么做。
And we have a “good model” if the results we get from our function typically agree with what a human would say. And the nontrivial scientific fact is that for an image-recognition task like this we now basically know how to construct functions that do this.
如果我们从函数中得到的结果通常与人类所说的相符,那么我们就有一个“好的模型”。而一个非平凡的科学事实是,对于这样的图像识别任务,我们现在基本上知道如何构建能够完成这项工作的函数。
Can we “mathematically prove” that they work? Well, no. Because to do that we’d have to have a mathematical theory of what we humans are doing. Take the “2” image and change a few pixels. We might imagine that with only a few pixels “out of place” we should still consider the image a “2”. But how far should that go? It’s a question of human visual perception. And, yes, the answer would no doubt be different for bees or octopuses—and potentially utterly different for putative aliens.
我们能“从数学上证明”它们有效吗?嗯,不能。因为要做到这一点,我们需要有一个关于我们人类在做什么的数学理论。以“2”图像为例,改变几个像素。我们可能会认为,只有几个像素“位置不对”,我们仍然应该认为这个图像是“2”。但这种变化应该到什么程度呢?这是一个关于人类视觉感知的问题。而且,是的,对于蜜蜂或章鱼来说,答案无疑会有所不同——对于假设的外星人来说,可能完全不同。
Neural Nets 神经网络
OK, so how do our typical models for tasks like image recognition actually work? The most popular—and successful—current approach uses neural nets. Invented—in a form remarkably close to their use today—in the 1940s, neural nets can be thought of as simple idealizations of how brains seem to work.
好的,那么我们用于图像识别等任务的典型模型实际上是如何工作的呢?目前最流行且成功的方法是使用神经网络。神经网络在 1940 年代以一种与今天使用方式非常接近的形式被发明,可以被认为是大脑工作方式的简单理想化。
In human brains there are about 100 billion neurons (nerve cells), each capable of producing an electrical pulse up to perhaps a thousand times a second. The neurons are connected in a complicated net, with each neuron having tree-like branches allowing it to pass electrical signals to perhaps thousands of other neurons. And in a rough approximation, whether any given neuron produces an electrical pulse at a given moment depends on what pulses it’s received from other neurons—with different connections contributing with different “weights”.
在人类大脑中有大约 1000 亿个神经元(神经细胞),每个神经元能够产生每秒最多可能达一千次的电脉冲。神经元在一个复杂的网络中相互连接,每个神经元都有树状的分支,使其能够将电信号传递给可能成千上万的其他神经元。粗略地近似,任何给定神经元在给定时刻是否产生电脉冲取决于它从其他神经元接收到的脉冲——不同的连接以不同的“权重”贡献。
When we “see an image” what’s happening is that when photons of light from the image fall on (“photoreceptor”) cells at the back of our eyes they produce electrical signals in nerve cells. These nerve cells are connected to other nerve cells, and eventually the signals go through a whole sequence of layers of neurons. And it’s in this process that we “recognize” the image, eventually “forming the thought” that we’re “seeing a 2” (and maybe in the end doing something like saying the word “two” out loud).
当我们“看到一个图像”时,实际上是来自图像的光子落在眼睛后部的(“光感受器”)细胞上,它们在神经细胞中产生电信号。这些神经细胞与其他神经细胞相连,最终信号通过一系列神经元层。正是在这个过程中,我们“识别”图像,最终“形成我们正在“看到一个 2”的想法”(也许最后会做诸如大声说出“二”这个词的事情)。
The “black-box” function from the previous section is a “mathematicized” version of such a neural net. It happens to have 11 layers (though only 4 “core layers”):
上一节中的“黑箱”功能是这种神经网络的“数学化”版本。它恰好有 11 层(尽管只有 4 个“核心层”):
There’s nothing particularly “theoretically derived” about this neural net; it’s just something that—back in 1998—was constructed as a piece of engineering, and found to work. (Of course, that’s not much different from how we might describe our brains as having been produced through the process of biological evolution.)
这种神经网络并没有什么特别“理论推导”的地方;它只是在 1998 年作为一项工程构建的,并且发现它是有效的。(当然,这与我们可能描述我们的大脑是通过生物进化过程产生的方式并没有太大不同。)
OK, but how does a neural net like this “recognize things”? The key is the notion of attractors. Imagine we’ve got handwritten images of 1’s and 2’s:
好的,但像这样的神经网络是如何“识别事物”的呢?关键在于吸引子的概念。想象一下,我们有手写的 1 和 2 的图像:
We somehow want all the 1’s to “be attracted to one place”, and all the 2’s to “be attracted to another place”. Or, put a different way, if an image is somehow “closer to being a 1” than to being a 2, we want it to end up in the “1 place” and vice versa.
我们希望所有的 1 都“被吸引到一个地方”,所有的 2 都“被吸引到另一个地方”。或者,换句话说,如果一个图像在某种程度上“更接近于 1”而不是 2,我们希望它最终在“1 的地方”,反之亦然。
As a straightforward analogy, let’s say we have certain positions in the plane, indicated by dots (in a real-life setting they might be positions of coffee shops). Then we might imagine that starting from any point on the plane we’d always want to end up at the closest dot (i.e. we’d always go to the closest coffee shop). We can represent this by dividing the plane into regions (“attractor basins”) separated by idealized “watersheds”:
作为一个简单的类比,假设我们在平面上有一些位置,用点表示(在现实生活中,它们可能是咖啡店的位置)。然后我们可以想象,从平面上的任何一点出发,我们总是希望到达最近的点(即我们总是去最近的咖啡店)。我们可以通过将平面划分为由理想化的“分水岭”分隔的区域(“吸引盆地”)来表示这一点:
We can think of this as implementing a kind of “recognition task” in which we’re not doing something like identifying what digit a given image “looks most like”—but rather we’re just, quite directly, seeing what dot a given point is closest to. (The “Voronoi diagram” setup we’re showing here separates points in 2D Euclidean space; the digit recognition task can be thought of as doing something very similar—but in a 784-dimensional space formed from the gray levels of all the pixels in each image.)
我们可以将这看作是一种“识别任务”的实现,在这种任务中,我们不是在做类似于识别给定图像“看起来最像”哪个数字的事情——而是相当直接地,看看给定的点最接近哪个点。(我们在这里展示的“Voronoi 图”设置将二维欧几里得空间中的点分开;数字识别任务可以被认为是在做非常相似的事情——但在由每个图像中所有像素的灰度级别形成的 784 维空间中。)
So how do we make a neural net “do a recognition task”? Let’s consider this very simple case:
那么我们如何让神经网络“执行识别任务”呢?让我们考虑这个非常简单的情况:
Our goal is to take an “input” corresponding to a position {x,y}—and then to “recognize” it as whichever of the three points it’s closest to. Or, in other words, we want the neural net to compute a function of {x,y} like:
我们的目标是获取一个对应于位置{x,y}的“输入”——然后将其“识别”为与其最接近的三个点中的一个。或者换句话说,我们希望神经网络计算一个{x,y}的函数,如下所示:
So how do we do this with a neural net? Ultimately a neural net is a connected collection of idealized “neurons”—usually arranged in layers—with a simple example being:
那么我们如何用神经网络来实现这一点呢?最终,神经网络是由理想化的“神经元”连接而成的集合——通常排列成层——一个简单的例子是:
Each “neuron” is effectively set up to evaluate a simple numerical function. And to “use” the network, we simply feed numbers (like our coordinates x and y) in at the top, then have neurons on each layer “evaluate their functions” and feed the results forward through the network—eventually producing the final result at the bottom:
每个“神经元”都被有效地设置为评估一个简单的数值函数。要“使用”这个网络,我们只需在顶部输入数字(如我们的坐标 x 和 y),然后让每一层的神经元“评估它们的函数”并将结果通过网络向前传递——最终在底部产生最终结果:
In the traditional (biologically inspired) setup each neuron effectively has a certain set of “incoming connections” from the neurons on the previous layer, with each connection being assigned a certain “weight” (which can be a positive or negative number). The value of a given neuron is determined by multiplying the values of “previous neurons” by their corresponding weights, then adding these up and adding a constant—and finally applying a “thresholding” (or “activation”) function. In mathematical terms, if a neuron has inputs
在传统的(生物启发的)设置中,每个神经元实际上都有来自前一层神经元的一组“传入连接”,每个连接都被分配了一个特定的“权重”(可以是正数或负数)。给定神经元的值是通过将“前一层神经元”的值乘以其对应的权重,然后将这些值相加并加上一个常数——最后应用一个“阈值化”(或“激活”)函数来确定的。在数学术语中,如果一个神经元有输入 x = {x1, x2 …},那么我们计算 f[w . x + b],其中权重 w 和常数 b 通常在网络中的每个神经元选择不同;函数 f 通常是相同的。
Computing w . x + b is just a matter of matrix multiplication and addition. The “activation function” f introduces nonlinearity (and ultimately is what leads to nontrivial behavior). Various activation functions commonly get used; here we’ll just use Ramp (or ReLU):
计算 w . x + b 只是矩阵乘法和加法的问题。“激活函数” f 引入了非线性(最终导致非平凡行为)。常用的激活函数有很多种;这里我们只使用 Ramp (或 ReLU):
For each task we want the neural net to perform (or, equivalently, for each overall function we want it to evaluate) we’ll have different choices of weights. (And—as we’ll discuss later—these weights are normally determined by “training” the neural net using machine learning from examples of the outputs we want.)
对于我们希望神经网络执行的每个任务(或者,等效地,对于我们希望它评估的每个整体功能),我们将有不同的权重选择。(并且——正如我们稍后将讨论的——这些权重通常通过使用机器学习从我们想要的输出的示例中“训练”神经网络来确定。)
Ultimately, every neural net just corresponds to some overall mathematical function—though it may be messy to write out. For the example above, it would be:
最终,每个神经网络都对应于某种总体数学函数——尽管它可能很复杂难以写出来。对于上面的例子,它将是:
The neural net of ChatGPT also just corresponds to a mathematical function like this—but effectively with billions of terms.
ChatGPT 的神经网络也仅仅对应于这样一个数学函数——但实际上有数十亿个项。
But let’s go back to individual neurons. Here are some examples of the functions a neuron with two inputs (representing coordinates x and y) can compute with various choices of weights and constants (and Ramp as activation function):
但让我们回到单个神经元。以下是一些具有两个输入(表示坐标 x 和 y)的神经元可以通过不同选择的权重和常数(以及 Ramp 作为激活函数)计算的函数示例:
But what about the larger network from above? Well, here’s what it computes:
但上面更大的网络呢?好吧,这就是它计算的内容:
It’s not quite “right”, but it’s close to the “nearest point” function we showed above.
它不完全“正确”,但它接近我们上面展示的“最近点”功能。
Let’s see what happens with some other neural nets. In each case, as we’ll explain later, we’re using machine learning to find the best choice of weights. Then we’re showing here what the neural net with those weights computes:
让我们看看其他神经网络会发生什么。在每种情况下,我们稍后会解释,我们使用机器学习来找到最佳的权重选择。然后我们在这里展示带有这些权重的神经网络计算的内容:
Bigger networks generally do better at approximating the function we’re aiming for. And in the “middle of each attractor basin” we typically get exactly the answer we want. But at the boundaries—where the neural net “has a hard time making up its mind”—things can be messier.
较大的网络通常在逼近我们目标函数方面表现得更好。在每个吸引子盆地的“中间”,我们通常会得到我们想要的准确答案。但在边界——神经网络“难以做出决定”的地方——事情可能会更混乱。
With this simple mathematical-style “recognition task” it’s clear what the “right answer” is. But in the problem of recognizing handwritten digits, it’s not so clear. What if someone wrote a “2” so badly it looked like a “7”, etc.? Still, we can ask how a neural net distinguishes digits—and this gives an indication:
在这个简单的数学风格的“识别任务”中,很明显什么是“正确答案”。但在识别手写数字的问题中,就不那么清楚了。如果有人写了一个“2”写得非常糟糕,看起来像“7”,等等?尽管如此,我们可以问神经网络如何区分数字——这给出了一个指示:
Can we say “mathematically” how the network makes its distinctions? Not really. It’s just “doing what the neural net does”. But it turns out that that normally seems to agree fairly well with the distinctions we humans make.
我们能从数学上解释网络是如何进行区分的吗?并不真的可以。它只是在“做神经网络所做的事情”。但事实证明,这通常似乎与我们人类所做的区分相当一致。
Let’s take a more elaborate example. Let’s say we have images of cats and dogs. And we have a neural net that’s been trained to distinguish them. Here’s what it might do on some examples:
让我们举一个更详细的例子。假设我们有一些猫和狗的图片。我们有一个经过训练的神经网络,可以区分它们。以下是它在一些例子中可能会做的事情:
Now it’s even less clear what the “right answer” is. What about a dog dressed in a cat suit? Etc. Whatever input it’s given the neural net will generate an answer, and in a way reasonably consistent with how humans might. As I’ve said above, that’s not a fact we can “derive from first principles”. It’s just something that’s empirically been found to be true, at least in certain domains. But it’s a key reason why neural nets are useful: that they somehow capture a “human-like” way of doing things.
现在更不清楚什么是“正确答案”了。那只穿着猫装的狗呢?等等。无论给它什么输入,神经网络都会生成一个答案,并且在某种程度上与人类可能的方式相当一致。正如我上面所说,这不是我们可以“从第一原理推导”的事实。这只是经验上被发现为真实的东西,至少在某些领域是这样。但这是神经网络有用的关键原因:它们以某种方式捕捉到了一种“类人”的做事方式。
Show yourself a picture of a cat, and ask “Why is that a cat?”. Maybe you’d start saying “Well, I see its pointy ears, etc.” But it’s not very easy to explain how you recognized the image as a cat. It’s just that somehow your brain figured that out. But for a brain there’s no way (at least yet) to “go inside” and see how it figured it out. What about for an (artificial) neural net? Well, it’s straightforward to see what each “neuron” does when you show a picture of a cat. But even to get a basic visualization is usually very difficult.
展示一张猫的图片,并问“为什么那是猫?”。也许你会开始说“嗯,我看到它的尖耳朵等等。”但解释你如何识别这张图片为猫并不容易。只是你的大脑不知怎么就弄明白了。但对于大脑来说,目前还没有办法“进入”并看到它是如何弄明白的。那么对于(人工)神经网络呢?嗯,当你展示一张猫的图片时,很容易看到每个“神经元”在做什么。但要得到一个基本的可视化通常非常困难。
In the final net that we used for the “nearest point” problem above there are 17 neurons. In the net for recognizing handwritten digits there are 2190. And in the net we’re using to recognize cats and dogs there are 60,650. Normally it would be pretty difficult to visualize what amounts to 60,650-dimensional space. But because this is a network set up to deal with images, many of its layers of neurons are organized into arrays, like the arrays of pixels it’s looking at.
在我们用于上述“最近点”问题的最终网络中有 17 个神经元。在用于识别手写数字的网络中有 2190 个。而在我们用于识别猫和狗的网络中有 60,650 个。通常情况下,要想象相当于 60,650 维的空间是非常困难的。但由于这是一个为处理图像而设置的网络,其许多神经元层被组织成数组,就像它正在观察的像素数组一样。
And if we take a typical cat image
如果我们拿一张典型的猫图片
then we can represent the states of neurons at the first layer by a collection of derived images—many of which we can readily interpret as being things like “the cat without its background”, or “the outline of the cat”:
然后我们可以用一组衍生图像来表示第一层神经元的状态——其中许多我们可以很容易地解释为“没有背景的猫”或“猫的轮廓”:
By the 10th layer it’s harder to interpret what’s going on:
到了第 10 层,解释发生了什么变得更难了:
But in general we might say that the neural net is “picking out certain features” (maybe pointy ears are among them), and using these to determine what the image is of. But are those features ones for which we have names—like “pointy ears”? Mostly not.
但一般来说,我们可以说神经网络是在“挑选出某些特征”(可能其中之一是尖耳朵),并利用这些特征来确定图像是什么。但这些特征是我们有名字的吗——比如“尖耳朵”?大多数不是。
Are our brains using similar features? Mostly we don’t know. But it’s notable that the first few layers of a neural net like the one we’re showing here seem to pick out aspects of images (like edges of objects) that seem to be similar to ones we know are picked out by the first level of visual processing in brains.
我们的脑是否使用了相似的特征?大多数情况下我们并不知道。但值得注意的是,像我们这里展示的这种神经网络的前几层似乎挑选出了图像的一些方面(如物体的边缘),这些方面似乎与我们已知的大脑视觉处理的第一层所挑选出的相似。
But let’s say we want a “theory of cat recognition” in neural nets. We can say: “Look, this particular net does it”—and immediately that gives us some sense of “how hard a problem” it is (and, for example, how many neurons or layers might be needed). But at least as of now we don’t have a way to “give a narrative description” of what the network is doing. And maybe that’s because it truly is computationally irreducible, and there’s no general way to find what it does except by explicitly tracing each step. Or maybe it’s just that we haven’t “figured out the science”, and identified the “natural laws” that allow us to summarize what’s going on.
但假设我们想要一个神经网络中的“猫识别理论”。我们可以说:“看,这个特定的网络做到了”——这立即给了我们一些关于“问题有多难”的感觉(例如,可能需要多少神经元或层)。但至少到目前为止,我们没有办法“给出叙述性描述”说明网络在做什么。也许这是因为它在计算上是不可简化的,除了通过明确追踪每一步,没有一般的方法来找到它在做什么。或者也许只是因为我们还没有“弄清楚科学”,并且识别出允许我们总结正在发生的事情的“自然法则”。
We’ll encounter the same kinds of issues when we talk about generating language with ChatGPT. And again it’s not clear whether there are ways to “summarize what it’s doing”. But the richness and detail of language (and our experience with it) may allow us to get further than with images.
当我们讨论使用 ChatGPT 生成语言时,我们会遇到同样的问题。同样不清楚是否有一种方法可以“总结它在做什么”。但是,语言的丰富性和细节(以及我们对它的经验)可能使我们能够比处理图像时走得更远。
Machine Learning, and the Training of Neural Nets
机器学习,以及神经网络的训练
We’ve been talking so far about neural nets that “already know” how to do particular tasks. But what makes neural nets so useful (presumably also in brains) is that not only can they in principle do all sorts of tasks, but they can be incrementally “trained from examples” to do those tasks.
我们到目前为止一直在讨论那些“已经知道”如何执行特定任务的神经网络。但使神经网络如此有用(可能在大脑中也是如此)的是,它们不仅原则上可以执行各种任务,而且可以通过“从示例中逐步训练”来执行这些任务。
When we make a neural net to distinguish cats from dogs we don’t effectively have to write a program that (say) explicitly finds whiskers; instead we just show lots of examples of what’s a cat and what’s a dog, and then have the network “machine learn” from these how to distinguish them.
当我们制作一个区分猫和狗的神经网络时,我们不需要有效地编写一个(比如说)明确寻找胡须的程序;相反,我们只是展示大量什么是猫和什么是狗的例子,然后让网络从这些例子中“机器学习”如何区分它们。
And the point is that the trained network “generalizes” from the particular examples it’s shown. Just as we’ve seen above, it isn’t simply that the network recognizes the particular pixel pattern of an example cat image it was shown; rather it’s that the neural net somehow manages to distinguish images on the basis of what we consider to be some kind of “general catness”.
关键在于,训练后的网络从它所展示的特定示例中“泛化”出来。正如我们上面所看到的,问题不仅仅在于网络识别出它所展示的示例猫图像的特定像素模式;而是神经网络以某种方式设法根据我们认为的某种“一般猫性”来区分图像。
So how does neural net training actually work? Essentially what we’re always trying to do is to find weights that make the neural net successfully reproduce the examples we’ve given. And then we’re relying on the neural net to “interpolate” (or “generalize”) “between” these examples in a “reasonable” way.
那么神经网络训练实际上是如何工作的呢?本质上,我们总是在试图找到使神经网络成功重现我们给出的例子的权重。然后我们依赖神经网络以“合理”的方式“在这些例子之间”“插值”(或“泛化”)。
Let’s look at a problem even simpler than the nearest-point one above. Let’s just try to get a neural net to learn the function:
让我们来看一个比上述最近点问题更简单的问题。让我们尝试让神经网络学习以下函数:
For this task, we’ll need a network that has just one input and one output, like:
对于这项任务,我们需要一个只有一个输入和一个输出的网络,如下所示:
But what weights, etc. should we be using? With every possible set of weights the neural net will compute some function. And, for example, here’s what it does with a few randomly chosen sets of weights:
但是我们应该使用什么权重等呢?对于每一种可能的权重组合,神经网络都会计算某种函数。例如,以下是它使用一些随机选择的权重组合所得到的结果:
And, yes, we can plainly see that in none of these cases does it get even close to reproducing the function we want. So how do we find weights that will reproduce the function?
而且,是的,我们可以清楚地看到,在这些情况下,它甚至无法接近我们想要的函数。那么,我们如何找到能够重现该函数的权重呢?
The basic idea is to supply lots of “input → output” examples to “learn from”—and then to try to find weights that will reproduce these examples. Here’s the result of doing that with progressively more examples:
基本思路是提供大量的“输入 → 输出”示例以供“学习”——然后尝试找到能够重现这些示例的权重。以下是通过逐步增加示例数量进行此操作的结果:
At each stage in this “training” the weights in the network are progressively adjusted—and we see that eventually we get a network that successfully reproduces the function we want. So how do we adjust the weights? The basic idea is at each stage to see “how far away we are” from getting the function we want—and then to update the weights in such a way as to get closer.
在每个阶段的这种“训练”中,网络中的权重逐渐调整——我们看到最终我们得到了一个成功重现我们所需功能的网络。那么我们如何调整权重呢?基本思想是在每个阶段查看“我们离所需功能还有多远”——然后以一种更接近的方式更新权重。
To find out “how far away we are” we compute what’s usually called a “loss function” (or sometimes “cost function”). Here we’re using a simple (L2) loss function that’s just the sum of the squares of the differences between the values we get, and the true values. And what we see is that as our training process progresses, the loss function progressively decreases (following a certain “learning curve” that’s different for different tasks)—until we reach a point where the network (at least to a good approximation) successfully reproduces the function we want:
要找出“我们有多远”,我们计算通常称为“损失函数”(有时也称为“成本函数”)的值。这里我们使用一个简单的(L2)损失函数,它只是我们得到的值与真实值之间的差的平方和。我们看到的是,随着我们的训练过程进行,损失函数逐渐减少(遵循不同任务不同的“学习曲线”)——直到我们达到一个点,网络(至少在很好的近似程度上)成功地重现了我们想要的函数:
Alright, so the last essential piece to explain is how the weights are adjusted to reduce the loss function. As we’ve said, the loss function gives us a “distance” between the values we’ve got, and the true values. But the “values we’ve got” are determined at each stage by the current version of neural net—and by the weights in it. But now imagine that the weights are variables—say wi. We want to find out how to adjust the values of these variables to minimize the loss that depends on them.
好的,所以最后要解释的一个关键部分是如何调整权重以减少损失函数。正如我们所说,损失函数为我们提供了我们得到的值与真实值之间的“距离”。但是,“我们得到的值”是由当前版本的神经网络——以及其中的权重——在每个阶段确定的。但现在想象一下,权重是变量——比如说 w。我们希望找出如何调整这些变量的值以最小化依赖于它们的损失。
For example, imagine (in an incredible simplification of typical neural nets used in practice) that we have just two weights w1 and w2. Then we might have a loss that as a function of w1 and w2 looks like this:
例如,想象一下(这是对实践中使用的典型神经网络的极大简化)我们只有两个权重 w1 和 w2。然后我们可能会有一个损失函数,作为 w1 和 w2 的函数看起来像这样:
Numerical analysis provides a variety of techniques for finding the minimum in cases like this. But a typical approach is just to progressively follow the path of steepest descent from whatever previous w1, w2 we had:
数值分析提供了各种技术来在这种情况下找到最小值。但一个典型的方法是只需从我们之前的任何 w1, w2 开始,沿着最陡下降的路径逐步进行:
Like water flowing down a mountain, all that’s guaranteed is that this procedure will end up at some local minimum of the surface (“a mountain lake”); it might well not reach the ultimate global minimum.
就像水从山上流下一样,可以保证的是这个过程最终会达到某个局部最小值(“山间湖泊”);但它很可能不会达到最终的全局最小值。
It’s not obvious that it would be feasible to find the path of the steepest descent on the “weight landscape”. But calculus comes to the rescue. As we mentioned above, one can always think of a neural net as computing a mathematical function—that depends on its inputs, and its weights. But now consider differentiating with respect to these weights. It turns out that the chain rule of calculus in effect lets us “unravel” the operations done by successive layers in the neural net. And the result is that we can—at least in some local approximation—“invert” the operation of the neural net, and progressively find weights that minimize the loss associated with the output.
在“权重景观”上找到最陡下降路径并不明显是可行的。但微积分来拯救了我们。正如我们上面提到的,人们总是可以将神经网络视为计算一个数学函数——该函数取决于其输入和权重。但现在考虑对这些权重进行微分。事实证明,微积分的链式法则有效地让我们“解开”了神经网络中连续层所执行的操作。结果是,我们至少可以在一些局部近似中“反转”神经网络的操作,并逐步找到使与输出相关的损失最小化的权重。
The picture above shows the kind of minimization we might need to do in the unrealistically simple case of just 2 weights. But it turns out that even with many more weights (ChatGPT uses 175 billion) it’s still possible to do the minimization, at least to some level of approximation. And in fact the big breakthrough in “deep learning” that occurred around 2011 was associated with the discovery that in some sense it can be easier to do (at least approximate) minimization when there are lots of weights involved than when there are fairly few.
上图显示了在只有 2 个权重的不切实际的简单情况下,我们可能需要进行的某种最小化。但事实证明,即使有更多的权重(ChatGPT 使用了 1750 亿个),仍然可以进行最小化,至少在某种程度上可以进行近似。事实上,2011 年左右发生的“深度学习”的重大突破与这样一个发现有关:在某种意义上,当涉及大量权重时,进行(至少是近似)最小化可能比涉及较少权重时更容易。
In other words—somewhat counterintuitively—it can be easier to solve more complicated problems with neural nets than simpler ones. And the rough reason for this seems to be that when one has a lot of “weight variables” one has a high-dimensional space with “lots of different directions” that can lead one to the minimum—whereas with fewer variables it’s easier to end up getting stuck in a local minimum (“mountain lake”) from which there’s no “direction to get out”.
换句话说——有点违反直觉——用神经网络解决更复杂的问题比解决更简单的问题更容易。这似乎是因为当一个人有很多“权重变量”时,一个人有一个高维空间,有“很多不同的方向”可以引导一个人达到最小值——而变量较少时,更容易陷入局部最小值(“山湖”),从那里没有“方向可以出去”。
It’s worth pointing out that in typical cases there are many different collections of weights that will all give neural nets that have pretty much the same performance. And usually in practical neural net training there are lots of random choices made—that lead to “different-but-equivalent solutions”, like these:
值得指出的是,在典型情况下,有许多不同的权重集合都会使神经网络具有几乎相同的性能。通常在实际的神经网络训练中,会做出许多随机选择——这些选择会导致“不同但等效的解决方案”,例如这些:
But each such “different solution” will have at least slightly different behavior. And if we ask, say, for an “extrapolation” outside the region where we gave training examples, we can get dramatically different results:
但每个这样的“不同解决方案”至少会有略微不同的行为。如果我们要求,比如说,在给定训练示例的区域之外进行“外推”,我们可能会得到截然不同的结果:
But which of these is “right”? There’s really no way to say. They’re all “consistent with the observed data”. But they all correspond to different “innate” ways to “think about” what to do “outside the box”. And some may seem “more reasonable” to us humans than others.
但这些中哪一个才是“正确的”?真的没有办法说。它们都“与观察到的数据一致”。但它们都对应于不同的“内在”方式来“思考”如何“跳出框框”。而且有些可能在我们人类看来“更合理”。
The Practice and Lore of Neural Net Training
神经网络训练的实践与知识
Particularly over the past decade, there’ve been many advances in the art of training neural nets. And, yes, it is basically an art. Sometimes—especially in retrospect—one can see at least a glimmer of a “scientific explanation” for something that’s being done. But mostly things have been discovered by trial and error, adding ideas and tricks that have progressively built a significant lore about how to work with neural nets.
特别是在过去的十年里,训练神经网络的技术有了许多进步。而且,是的,这基本上是一门艺术。有时——尤其是事后看来——人们至少可以看到对所做事情的“科学解释”的一丝曙光。但大多数情况下,这些发现是通过反复试验,逐步添加想法和技巧,逐渐积累了关于如何使用神经网络的重要知识。
There are several key parts. First, there’s the matter of what architecture of neural net one should use for a particular task. Then there’s the critical issue of how one’s going to get the data on which to train the neural net. And increasingly one isn’t dealing with training a net from scratch: instead a new net can either directly incorporate another already-trained net, or at least can use that net to generate more training examples for itself.
有几个关键部分。首先,是关于应该为特定任务使用哪种神经网络架构的问题。然后是关于如何获取训练神经网络所需数据的关键问题。而且,越来越多的人不再从头开始训练网络:相反,一个新的网络可以直接包含另一个已经训练好的网络,或者至少可以使用该网络为自己生成更多的训练示例。
One might have thought that for every particular kind of task one would need a different architecture of neural net. But what’s been found is that the same architecture often seems to work even for apparently quite different tasks. At some level this reminds one of the idea of universal computation (and my Principle of Computational Equivalence), but, as I’ll discuss later, I think it’s more a reflection of the fact that the tasks we’re typically trying to get neural nets to do are “human-like” ones—and neural nets can capture quite general “human-like processes”.
人们可能会认为,对于每一种特定的任务,都需要一种不同的神经网络架构。但事实证明,即使是对于明显非常不同的任务,相同的架构也经常有效。在某种程度上,这让人想起通用计算的概念(以及我的计算等效原理),但正如我稍后将讨论的那样,我认为这更多地反映了我们通常试图让神经网络完成的任务是“类人”的任务——而神经网络可以捕捉到相当普遍的“类人过程”。
In earlier days of neural nets, there tended to be the idea that one should “make the neural net do as little as possible”. For example, in converting speech to text it was thought that one should first analyze the audio of the speech, break it into phonemes, etc. But what was found is that—at least for “human-like tasks”—it’s usually better just to try to train the neural net on the “end-to-end problem”, letting it “discover” the necessary intermediate features, encodings, etc. for itself.
在神经网络的早期,有一种观点认为应该“让神经网络做尽可能少的事情”。例如,在将语音转换为文本时,人们认为应该首先分析语音的音频,将其分解为音素等。但后来发现,至少对于“类似人类的任务”,通常最好直接尝试在“端到端问题”上训练神经网络,让它自己“发现”必要的中间特征、编码等。
There was also the idea that one should introduce complicated individual components into the neural net, to let it in effect “explicitly implement particular algorithmic ideas”. But once again, this has mostly turned out not to be worthwhile; instead, it’s better just to deal with very simple components and let them “organize themselves” (albeit usually in ways we can’t understand) to achieve (presumably) the equivalent of those algorithmic ideas.
还有一种想法是,应该将复杂的个体组件引入神经网络,让它实际上“明确实现特定的算法思想”。但再次证明,这大多不值得;相反,最好是处理非常简单的组件,让它们“自我组织”(尽管通常是以我们无法理解的方式)来实现(大概)那些算法思想的等效效果。
That’s not to say that there are no “structuring ideas” that are relevant for neural nets. Thus, for example, having 2D arrays of neurons with local connections seems at least very useful in the early stages of processing images. And having patterns of connectivity that concentrate on “looking back in sequences” seems useful—as we’ll see later—in dealing with things like human language, for example in ChatGPT.
这并不是说神经网络没有相关的“结构化思想”。因此,例如,在图像处理的早期阶段,具有局部连接的 2D 神经元阵列似乎非常有用。并且具有集中于“回看序列”的连接模式似乎很有用——正如我们稍后将看到的——在处理诸如人类语言之类的事情时,例如在 ChatGPT 中。
But an important feature of neural nets is that—like computers in general—they’re ultimately just dealing with data. And current neural nets—with current approaches to neural net training—specifically deal with arrays of numbers. But in the course of processing, those arrays can be completely rearranged and reshaped. And as an example, the network we used for identifying digits above starts with a 2D “image-like” array, quickly “thickening” to many channels, but then “concentrating down” into a 1D array that will ultimately contain elements representing the different possible output digits:
但神经网络的一个重要特征是——就像计算机一般——它们最终只是在处理数据。而当前的神经网络——使用当前的神经网络训练方法——具体处理的是数字数组。但在处理过程中,这些数组可以被完全重新排列和重塑。例如,我们用于识别上述数字的网络从一个 2D 的“图像类”数组开始,迅速“增厚”到许多通道,但随后“集中”成一个 1D 数组,该数组最终包含表示不同可能输出数字的元素:
But, OK, how can one tell how big a neural net one will need for a particular task? It’s something of an art. At some level the key thing is to know “how hard the task is”. But for human-like tasks that’s typically very hard to estimate. Yes, there may be a systematic way to do the task very “mechanically” by computer. But it’s hard to know if there are what one might think of as tricks or shortcuts that allow one to do the task at least at a “human-like level” vastly more easily. It might take enumerating a giant game tree to “mechanically” play a certain game; but there might be a much easier (“heuristic”) way to achieve “human-level play”.
但是,好吧,一个人如何判断为一个特定任务需要多大的神经网络呢?这是一门艺术。在某种程度上,关键是要知道“任务有多难”。但对于类似人类的任务,这通常很难估计。是的,可能有一种系统的方法可以通过计算机非常“机械地”完成任务。但很难知道是否有一些可以认为是技巧或捷径的东西,这些技巧或捷径可以让人至少在“类似人类的水平”上更容易地完成任务。可能需要枚举一个巨大的游戏树来“机械地”玩某个游戏;但可能有一种更容易(“启发式”)的方法来实现“人类水平的游戏”。
When one’s dealing with tiny neural nets and simple tasks one can sometimes explicitly see that one “can’t get there from here”. For example, here’s the best one seems to be able to do on the task from the previous section with a few small neural nets:
当处理微小的神经网络和简单的任务时,有时可以明确地看到“无法从这里到达那里”。例如,以下是使用几个小型神经网络在前一节任务中似乎能做的最好的结果:
And what we see is that if the net is too small, it just can’t reproduce the function we want. But above some size, it has no problem—at least if one trains it for long enough, with enough examples. And, by the way, these pictures illustrate a piece of neural net lore: that one can often get away with a smaller network if there’s a “squeeze” in the middle that forces everything to go through a smaller intermediate number of neurons. (It’s also worth mentioning that “no-intermediate-layer”—or so-called “perceptron”—networks can only learn essentially linear functions—but as soon as there’s even one intermediate layer it’s always in principle possible to approximate any function arbitrarily well, at least if one has enough neurons, though to make it feasibly trainable one typically has some kind of regularization or normalization.)
我们看到,如果网络太小,它就无法重现我们想要的功能。但如果网络达到一定大小,它就没有问题——至少如果训练时间足够长,有足够的例子。顺便说一下,这些图片说明了一个神经网络的传说:如果中间有一个“挤压”,迫使所有东西通过较少的中间神经元,通常可以用较小的网络。(还值得一提的是,“无中间层”——或所谓的“感知器”——网络只能学习本质上的线性函数——但只要有一个中间层,原则上总是可以任意好地近似任何函数,至少如果有足够的神经元,尽管为了使其可行训练,通常会有某种正则化或归一化。)
OK, so let’s say one’s settled on a certain neural net architecture. Now there’s the issue of getting data to train the network with. And many of the practical challenges around neural nets—and machine learning in general—center on acquiring or preparing the necessary training data. In many cases (“supervised learning”) one wants to get explicit examples of inputs and the outputs one is expecting from them. Thus, for example, one might want images tagged by what’s in them, or some other attribute. And maybe one will have to explicitly go through—usually with great effort—and do the tagging. But very often it turns out to be possible to piggyback on something that’s already been done, or use it as some kind of proxy. And so, for example, one might use alt tags that have been provided for images on the web. Or, in a different domain, one might use closed captions that have been created for videos. Or—for language translation training—one might use parallel versions of webpages or other documents that exist in different languages.
好的,假设已经确定了一种特定的神经网络架构。现在的问题是如何获取数据来训练网络。神经网络(以及一般的机器学习)的许多实际挑战都集中在获取或准备必要的训练数据上。在许多情况下(“监督学习”),人们希望获得输入的明确示例以及期望从它们得到的输出。因此,例如,可能希望对图像进行标记,说明其中包含的内容或其他属性。可能需要明确地逐一进行标记——通常需要付出巨大的努力。但通常可以利用已经完成的工作,或者将其作为某种代理。因此,例如,可以使用网络上提供的图像替代文本(alt tags)。或者,在不同的领域,可以使用为视频创建的封闭字幕。或者——用于语言翻译训练——可以使用存在于不同语言中的网页或其他文档的平行版本。
How much data do you need to show a neural net to train it for a particular task? Again, it’s hard to estimate from first principles. Certainly the requirements can be dramatically reduced by using “transfer learning” to “transfer in” things like lists of important features that have already been learned in another network. But generally neural nets need to “see a lot of examples” to train well. And at least for some tasks it’s an important piece of neural net lore that the examples can be incredibly repetitive. And indeed it’s a standard strategy to just show a neural net all the examples one has, over and over again. In each of these “training rounds” (or “epochs”) the neural net will be in at least a slightly different state, and somehow “reminding it” of a particular example is useful in getting it to “remember that example”. (And, yes, perhaps this is analogous to the usefulness of repetition in human memorization.)
你需要向神经网络展示多少数据才能训练它完成特定任务?同样,很难从第一原理进行估计。当然,通过使用“迁移学习”将已经在另一个网络中学习到的重要特征列表“迁移”进来,可以显著减少需求。但一般来说,神经网络需要“看到很多例子”才能训练好。至少对于某些任务来说,一个重要的神经网络知识是例子可以非常重复。实际上,向神经网络展示所有已有的例子,一遍又一遍,是一种标准策略。在每一个“训练轮次”(或“epochs”)中,神经网络至少会处于稍微不同的状态,而“提醒它”某个特定的例子有助于让它“记住那个例子”。(而且,是的,这可能类似于重复在人类记忆中的有用性。)
But often just repeating the same example over and over again isn’t enough. It’s also necessary to show the neural net variations of the example. And it’s a feature of neural net lore that those “data augmentation” variations don’t have to be sophisticated to be useful. Just slightly modifying images with basic image processing can make them essentially “as good as new” for neural net training. And, similarly, when one’s run out of actual video, etc. for training self-driving cars, one can go on and just get data from running simulations in a model videogame-like environment without all the detail of actual real-world scenes.
但通常仅仅重复同一个例子一遍又一遍是不够的。还需要向神经网络展示例子的变体。神经网络的一个特点是,这些“数据增强”变体不必复杂就能有用。只需用基本的图像处理稍微修改图像,就可以使它们对神经网络训练来说“几乎像新的一样”。同样地,当一个人用完了实际的视频等来训练自动驾驶汽车时,可以继续从在类似视频游戏的模拟环境中运行模拟中获取数据,而不需要实际真实世界场景的所有细节。
How about something like ChatGPT? Well, it has the nice feature that it can do “unsupervised learning”, making it much easier to get it examples to train from. Recall that the basic task for ChatGPT is to figure out how to continue a piece of text that it’s been given. So to get it “training examples” all one has to do is get a piece of text, and mask out the end of it, and then use this as the “input to train from”—with the “output” being the complete, unmasked piece of text. We’ll discuss this more later, but the main point is that—unlike, say, for learning what’s in images—there’s no “explicit tagging” needed; ChatGPT can in effect just learn directly from whatever examples of text it’s given.
ChatGPT 怎么样?它有一个很好的特点,就是可以进行“无监督学习”,这使得获取训练示例变得更加容易。回想一下,ChatGPT 的基本任务是弄清楚如何继续它所给出的一段文本。因此,要获取“训练示例”,只需获取一段文本,遮盖其结尾部分,然后将其作为“训练输入”——“输出”则是完整的、未遮盖的文本。我们稍后会详细讨论这一点,但主要的一点是——与学习图像中的内容不同——不需要“明确的标记”;ChatGPT 实际上可以直接从它所给出的任何文本示例中学习。
OK, so what about the actual learning process in a neural net? In the end it’s all about determining what weights will best capture the training examples that have been given. And there are all sorts of detailed choices and “hyperparameter settings” (so called because the weights can be thought of as “parameters”) that can be used to tweak how this is done. There are different choices of loss function (sum of squares, sum of absolute values, etc.). There are different ways to do loss minimization (how far in weight space to move at each step, etc.). And then there are questions like how big a “batch” of examples to show to get each successive estimate of the loss one’s trying to minimize. And, yes, one can apply machine learning (as we do, for example, in Wolfram Language) to automate machine learning—and to automatically set things like hyperparameters.
好的,那么在神经网络中的实际学习过程是怎样的呢?最终,这一切都是关于确定哪些权重能够最好地捕捉到已给出的训练样本。而且有各种各样的详细选择和“超参数设置”(之所以这样称呼,是因为权重可以被视为“参数”)可以用来调整如何完成这一过程。有不同的损失函数选择(平方和、绝对值和等)。有不同的损失最小化方法(每次在权重空间中移动多远等)。然后还有诸如展示多少个“批次”的样本以获得每次连续估计的损失等问题,这些损失是试图最小化的。而且,是的,可以应用机器学习(例如,我们在 Wolfram 语言中这样做)来自动化机器学习,并自动设置诸如超参数之类的东西。
But in the end the whole process of training can be characterized by seeing how the loss progressively decreases (as in this Wolfram Language progress monitor for a small training):
但最终,整个训练过程可以概括为观察损失如何逐渐减少(如在这个用于小型训练的 Wolfram 语言进度监控器中所示):
And what one typically sees is that the loss decreases for a while, but eventually flattens out at some constant value. If that value is sufficiently small, then the training can be considered successful; otherwise it’s probably a sign one should try changing the network architecture.
通常看到的是,损失会减少一段时间,但最终会稳定在某个常数值。如果该值足够小,那么训练可以被认为是成功的;否则可能是一个应该尝试改变网络架构的信号。
Can one tell how long it should take for the “learning curve” to flatten out? Like for so many other things, there seem to be approximate power-law scaling relationships that depend on the size of neural net and amount of data one’s using. But the general conclusion is that training a neural net is hard—and takes a lot of computational effort. And as a practical matter, the vast majority of that effort is spent doing operations on arrays of numbers, which is what GPUs are good at—which is why neural net training is typically limited by the availability of GPUs.
一个人能说出“学习曲线”需要多长时间才能变平吗?就像许多其他事物一样,似乎存在依赖于神经网络大小和使用数据量的近似幂律缩放关系。但一般结论是训练神经网络很难——并且需要大量的计算努力。作为一个实际问题,绝大多数努力都花在数字数组的操作上,这正是 GPU 擅长的——这也是为什么神经网络训练通常受限于 GPU 的可用性。
In the future, will there be fundamentally better ways to train neural nets—or generally do what neural nets do? Almost certainly, I think. The fundamental idea of neural nets is to create a flexible “computing fabric” out of a large number of simple (essentially identical) components—and to have this “fabric” be one that can be incrementally modified to learn from examples. In current neural nets, one’s essentially using the ideas of calculus—applied to real numbers—to do that incremental modification. But it’s increasingly clear that having high-precision numbers doesn’t matter; 8 bits or less might be enough even with current methods.
在未来,是否会有从根本上更好的方法来训练神经网络——或者一般地做神经网络所做的事情?我认为几乎肯定是有的。神经网络的基本思想是通过大量简单(本质上相同)的组件创建一个灵活的“计算织物”,并使这种“织物”能够通过增量修改从示例中学习。在当前的神经网络中,人们基本上使用微积分的思想——应用于实数——来进行这种增量修改。但越来越明显的是,高精度数字并不重要;即使使用当前的方法,8 位或更少的位数可能就足够了。
With computational systems like cellular automata that basically operate in parallel on many individual bits it’s never been clear how to do this kind of incremental modification, but there’s no reason to think it isn’t possible. And in fact, much like with the “deep-learning breakthrough of 2012” it may be that such incremental modification will effectively be easier in more complicated cases than in simple ones.
在像元胞自动机这样的计算系统中,它们基本上在许多单独的位上并行操作,如何进行这种增量修改一直不明确,但没有任何理由认为这是不可能的。事实上,就像 2012 年的“深度学习突破”一样,这种增量修改在更复杂的情况下可能比在简单的情况下更容易实现。
Neural nets—perhaps a bit like brains—are set up to have an essentially fixed network of neurons, with what’s modified being the strength (“weight”) of connections between them. (Perhaps in at least young brains significant numbers of wholly new connections can also grow.) But while this might be a convenient setup for biology, it’s not at all clear that it’s even close to the best way to achieve the functionality we need. And something that involves the equivalent of progressive network rewriting (perhaps reminiscent of our Physics Project) might well ultimately be better.
神经网络——可能有点像大脑——被设置为具有本质上固定的神经元网络,其中被修改的是它们之间连接的强度(“权重”)。(也许至少在年轻的大脑中,大量全新的连接也可以生长。)但虽然这对生物学来说可能是一个方便的设置,但完全不清楚这是否接近于实现我们需要的功能的最佳方式。而涉及等效于渐进网络重写的东西(可能让人想起我们的物理项目)可能最终会更好。
But even within the framework of existing neural nets there’s currently a crucial limitation: neural net training as it’s now done is fundamentally sequential, with the effects of each batch of examples being propagated back to update the weights. And indeed with current computer hardware—even taking into account GPUs—most of a neural net is “idle” most of the time during training, with just one part at a time being updated. And in a sense this is because our current computers tend to have memory that is separate from their CPUs (or GPUs). But in brains it’s presumably different—with every “memory element” (i.e. neuron) also being a potentially active computational element. And if we could set up our future computer hardware this way it might become possible to do training much more efficiently.
但即使在现有神经网络的框架内,目前也存在一个关键限制:现在进行的神经网络训练本质上是顺序的,每个批次示例的效果被传播回以更新权重。实际上,使用当前的计算机硬件——即使考虑到 GPU——在训练期间,神经网络的大部分时间都是“空闲”的,只有一个部分在更新。从某种意义上说,这是因为我们当前的计算机往往将内存与 CPU(或 GPU)分开。但在大脑中,情况可能不同——每个“记忆元素”(即神经元)也是一个潜在的活跃计算元素。如果我们能够以这种方式设置未来的计算机硬件,可能会使训练效率大大提高。
“Surely a Network That’s Big Enough Can Do Anything!”
“当然,一个足够大的网络可以做任何事情!”
The capabilities of something like ChatGPT seem so impressive that one might imagine that if one could just “keep going” and train larger and larger neural networks, then they’d eventually be able to “do everything”. And if one’s concerned with things that are readily accessible to immediate human thinking, it’s quite possible that this is the case. But the lesson of the past several hundred years of science is that there are things that can be figured out by formal processes, but aren’t readily accessible to immediate human thinking.
ChatGPT 这样的能力看起来如此令人印象深刻,以至于人们可能会想象,如果能够“继续下去”并训练越来越大的人工神经网络,那么它们最终将能够“做任何事情”。如果一个人关心的是那些可以立即被人类思维轻易触及的事物,那么这可能是正确的。但过去几百年的科学教训是,有些事情可以通过形式化过程解决,但不容易被人类思维立即触及。
Nontrivial mathematics is one big example. But the general case is really computation. And ultimately the issue is the phenomenon of computational irreducibility. There are some computations which one might think would take many steps to do, but which can in fact be “reduced” to something quite immediate. But the discovery of computational irreducibility implies that this doesn’t always work. And instead there are processes—probably like the one below—where to work out what happens inevitably requires essentially tracing each computational step:
非平凡数学是一个大例子。但一般情况实际上是计算。最终的问题是计算不可简化性的现象。有些计算可能看起来需要很多步骤才能完成,但实际上可以“简化”为相当直接的东西。但计算不可简化性的发现意味着这并不总是有效。相反,存在一些过程——可能类似于下面的过程——在这些过程中,要弄清楚发生了什么,不可避免地需要基本上追踪每个计算步骤:
The kinds of things that we normally do with our brains are presumably specifically chosen to avoid computational irreducibility. It takes special effort to do math in one’s brain. And it’s in practice largely impossible to “think through” the steps in the operation of any nontrivial program just in one’s brain.
我们通常用大脑做的事情大概都是特意选择来避免计算不可约性的。在脑子里做数学需要特别的努力。而且在实践中,几乎不可能仅凭大脑“思考”任何非平凡程序的操作步骤。
But of course for that we have computers. And with computers we can readily do long, computationally irreducible things. And the key point is that there’s in general no shortcut for these.
但当然,为此我们有计算机。有了计算机,我们可以轻松地进行长时间、计算上不可简化的事情。关键点是,通常没有捷径可走。
Yes, we could memorize lots of specific examples of what happens in some particular computational system. And maybe we could even see some (“computationally reducible”) patterns that would allow us to do a little generalization. But the point is that computational irreducibility means that we can never guarantee that the unexpected won’t happen—and it’s only by explicitly doing the computation that you can tell what actually happens in any particular case.
是的,我们可以记住许多特定计算系统中发生的事情的具体例子。也许我们甚至可以看到一些(“计算上可约”)的模式,这使我们能够进行一些概括。但关键是,计算不可约性意味着我们永远无法保证意外不会发生——只有通过明确地进行计算,你才能知道在任何特定情况下实际发生了什么。
And in the end there’s just a fundamental tension between learnability and computational irreducibility. Learning involves in effect compressing data by leveraging regularities. But computational irreducibility implies that ultimately there’s a limit to what regularities there may be.
最终,学习能力和计算不可约性之间存在着根本的张力。学习实际上是通过利用规律来压缩数据。但计算不可约性意味着最终可能存在的规律是有限的。
As a practical matter, one can imagine building little computational devices—like cellular automata or Turing machines—into trainable systems like neural nets. And indeed such devices can serve as good “tools” for the neural net—like Wolfram|Alpha can be a good tool for ChatGPT. But computational irreducibility implies that one can’t expect to “get inside” those devices and have them learn.
实际上,可以想象将小型计算设备——如细胞自动机或图灵机——构建到像神经网络这样的可训练系统中。事实上,这些设备可以作为神经网络的良好“工具”——就像 Wolfram|Alpha 可以作为 ChatGPT 的良好工具一样。但计算不可约性意味着不能期望“进入”这些设备并让它们学习。
Or put another way, there’s an ultimate tradeoff between capability and trainability: the more you want a system to make “true use” of its computational capabilities, the more it’s going to show computational irreducibility, and the less it’s going to be trainable. And the more it’s fundamentally trainable, the less it’s going to be able to do sophisticated computation.
或者换句话说,能力与可训练性之间存在一个最终的权衡:你越希望系统“真正利用”其计算能力,它就越会表现出计算不可约性,也就越不可训练。而它越基本可训练,就越不能进行复杂的计算。
(For ChatGPT as it currently is, the situation is actually much more extreme, because the neural net used to generate each token of output is a pure “feed-forward” network, without loops, and therefore has no ability to do any kind of computation with nontrivial “control flow”.)
(对于当前的 ChatGPT 来说,情况实际上更加极端,因为用于生成每个输出标记的神经网络是一个纯粹的“前馈”网络,没有循环,因此没有任何能力进行任何具有非平凡“控制流”的计算。)
Of course, one might wonder whether it’s actually important to be able to do irreducible computations. And indeed for much of human history it wasn’t particularly important. But our modern technological world has been built on engineering that makes use of at least mathematical computations—and increasingly also more general computations. And if we look at the natural world, it’s full of irreducible computation—that we’re slowly understanding how to emulate and use for our technological purposes.
当然,人们可能会想知道是否真的需要能够进行不可简化的计算。确实,在人类历史的很大一部分时间里,这并不是特别重要。但我们的现代技术世界是建立在至少利用数学计算的工程上的——而且越来越多地利用更一般的计算。如果我们看看自然界,它充满了不可简化的计算——我们正在慢慢理解如何模拟并将其用于我们的技术目的。
Yes, a neural net can certainly notice the kinds of regularities in the natural world that we might also readily notice with “unaided human thinking”. But if we want to work out things that are in the purview of mathematical or computational science the neural net isn’t going to be able to do it—unless it effectively “uses as a tool” an “ordinary” computational system.
是的,神经网络当然可以注意到我们在“不借助人类思维”的情况下也能轻易注意到的自然界中的规律。但如果我们想要解决数学或计算科学范畴内的问题,神经网络将无法做到——除非它有效地“将普通计算系统作为工具使用”。
But there’s something potentially confusing about all of this. In the past there were plenty of tasks—including writing essays—that we’ve assumed were somehow “fundamentally too hard” for computers. And now that we see them done by the likes of ChatGPT we tend to suddenly think that computers must have become vastly more powerful—in particular surpassing things they were already basically able to do (like progressively computing the behavior of computational systems like cellular automata).
但这一切可能有些令人困惑。过去有很多任务——包括写论文——我们一直认为计算机“从根本上来说太难”完成。而现在我们看到像 ChatGPT 这样的工具能够完成这些任务,我们往往会突然认为计算机的性能一定大大提高了——特别是在已经基本能够完成的事情上(比如逐步计算像细胞自动机这样的计算系统的行为)。
But this isn’t the right conclusion to draw. Computationally irreducible processes are still computationally irreducible, and are still fundamentally hard for computers—even if computers can readily compute their individual steps. And instead what we should conclude is that tasks—like writing essays—that we humans could do, but we didn’t think computers could do, are actually in some sense computationally easier than we thought.
但这并不是正确的结论。计算上不可约简的过程仍然是计算上不可约简的,并且对于计算机来说仍然是根本性的难题——即使计算机可以轻松计算它们的各个步骤。相反,我们应该得出的结论是,我们人类可以完成的任务——比如写文章——实际上在某种意义上比我们想象的更容易计算。
In other words, the reason a neural net can be successful in writing an essay is because writing an essay turns out to be a “computationally shallower” problem than we thought. And in a sense this takes us closer to “having a theory” of how we humans manage to do things like writing essays, or in general deal with language.
换句话说,神经网络能够在写作文方面取得成功的原因是因为写作文实际上是一个比我们想象的“计算上更浅显”的问题。从某种意义上说,这使我们更接近于“拥有一个理论”,即我们人类如何能够完成诸如写作文这样的事情,或者更广泛地说,如何处理语言。
If you had a big enough neural net then, yes, you might be able to do whatever humans can readily do. But you wouldn’t capture what the natural world in general can do—or that the tools that we’ve fashioned from the natural world can do. And it’s the use of those tools—both practical and conceptual—that have allowed us in recent centuries to transcend the boundaries of what’s accessible to “pure unaided human thought”, and capture for human purposes more of what’s out there in the physical and computational universe.
如果你有一个足够大的神经网络,那么,是的,你可能能够做任何人类可以轻易做到的事情。但你不会捕捉到自然界一般能做的事情——或者我们用自然界制造的工具能做的事情。正是使用这些工具——无论是实际的还是概念的——使我们在最近几个世纪超越了“纯粹无助的人类思维”所能触及的界限,并为人类目的捕捉到物理和计算宇宙中更多的东西。
The Concept of Embeddings
嵌入的概念
Neural nets—at least as they’re currently set up—are fundamentally based on numbers. So if we’re going to to use them to work on something like text we’ll need a way to represent our text with numbers. And certainly we could start (essentially as ChatGPT does) by just assigning a number to every word in the dictionary. But there’s an important idea—that’s for example central to ChatGPT—that goes beyond that. And it’s the idea of “embeddings”. One can think of an embedding as a way to try to represent the “essence” of something by an array of numbers—with the property that “nearby things” are represented by nearby numbers.
神经网络——至少目前是这样设置的——基本上是基于数字的。因此,如果我们打算用它们来处理文本这样的东西,我们需要一种用数字表示文本的方法。当然,我们可以从(基本上像 ChatGPT 那样)为字典中的每个单词分配一个数字开始。但是,有一个重要的概念——例如,它是 ChatGPT 的核心——超越了这一点。这个概念就是“嵌入”。可以将嵌入视为一种通过数字数组来尝试表示某物的“本质”的方法——具有“附近的事物”用附近的数字表示的特性。
And so, for example, we can think of a word embedding as trying to lay out words in a kind of “meaning space” in which words that are somehow “nearby in meaning” appear nearby in the embedding. The actual embeddings that are used—say in ChatGPT—tend to involve large lists of numbers. But if we project down to 2D, we can show examples of how words are laid out by the embedding:
因此,例如,我们可以将词嵌入视为试图在某种“意义空间”中排列单词,其中在意义上“接近”的单词在嵌入中也会接近。实际使用的嵌入——比如在 ChatGPT 中——往往涉及大量数字列表。但如果我们将它们投影到二维空间,我们可以展示词嵌入如何排列单词的示例:
And, yes, what we see does remarkably well in capturing typical everyday impressions. But how can we construct such an embedding? Roughly the idea is to look at large amounts of text (here 5 billion words from the web) and then see “how similar” the “environments” are in which different words appear. So, for example, “alligator” and “crocodile” will often appear almost interchangeably in otherwise similar sentences, and that means they’ll be placed nearby in the embedding. But “turnip” and “eagle” won’t tend to appear in otherwise similar sentences, so they’ll be placed far apart in the embedding.
而且,是的,我们看到的内容确实很好地捕捉到了典型的日常印象。但如何构建这样的嵌入呢?大致的想法是查看大量文本(这里是从网络上获取的 50 亿个单词),然后看看不同单词出现的“环境”有多“相似”。例如,“鳄鱼”和“鳄鱼”通常会在其他相似的句子中几乎互换出现,这意味着它们在嵌入中会被放置在附近。但“萝卜”和“鹰”不会倾向于出现在其他相似的句子中,所以它们在嵌入中会被放置得很远。
But how does one actually implement something like this using neural nets? Let’s start by talking about embeddings not for words, but for images. We want to find some way to characterize images by lists of numbers in such a way that “images we consider similar” are assigned similar lists of numbers.
但是如何使用神经网络实际实现这样的功能呢?让我们从讨论嵌入开始,不是针对单词,而是针对图像。我们希望找到某种方法,通过数字列表来描述图像,使得“我们认为相似的图像”被分配到相似的数字列表。
How do we tell if we should “consider images similar”? Well, if our images are, say, of handwritten digits we might “consider two images similar” if they are of the same digit. Earlier we discussed a neural net that was trained to recognize handwritten digits. And we can think of this neural net as being set up so that in its final output it puts images into 10 different bins, one for each digit.
我们如何判断是否应该“认为图像相似”?好吧,如果我们的图像是手写数字,我们可能会“认为两个图像相似”,如果它们是相同的数字。之前我们讨论了一个被训练来识别手写数字的神经网络。我们可以认为这个神经网络被设置成在其最终输出中将图像放入 10 个不同的箱子中,每个数字一个箱子。
But what if we “intercept” what’s going on inside the neural net before the final “it’s a ‘4’” decision is made? We might expect that inside the neural net there are numbers that characterize images as being “mostly 4-like but a bit 2-like” or some such. And the idea is to pick up such numbers to use as elements in an embedding.
但如果在神经网络内部做出最终“这是‘4’”的决定之前“拦截”正在发生的事情呢?我们可能会预期在神经网络内部有一些数字,它们将图像描述为“主要像 4,但有点像 2”或类似的特征。而想法是提取这些数字,用作嵌入中的元素。
So here’s the concept. Rather than directly trying to characterize “what image is near what other image”, we instead consider a well-defined task (in this case digit recognition) for which we can get explicit training data—then use the fact that in doing this task the neural net implicitly has to make what amount to “nearness decisions”. So instead of us ever explicitly having to talk about “nearness of images” we’re just talking about the concrete question of what digit an image represents, and then we’re “leaving it to the neural net” to implicitly determine what that implies about “nearness of images”.
所以这是概念。我们不是直接尝试描述“什么图像靠近什么其他图像”,而是考虑一个定义明确的任务(在这种情况下是数字识别),我们可以获得明确的训练数据——然后利用神经网络在执行此任务时隐含地必须做出“接近决策”这一事实。因此,我们不必明确地谈论“图像的接近性”,我们只是谈论图像代表什么数字的具体问题,然后“让神经网络”隐含地确定这关于“图像的接近性”意味着什么。
So how in more detail does this work for the digit recognition network? We can think of the network as consisting of 11 successive layers, that we might summarize iconically like this (with activation functions shown as separate layers):
那么,这种数字识别网络是如何更详细地工作的呢?我们可以将网络视为由 11 个连续层组成,我们可以像这样图示总结(激活函数显示为单独的层):
At the beginning we’re feeding into the first layer actual images, represented by 2D arrays of pixel values. And at the end—from the last layer—we’re getting out an array of 10 values, which we can think of saying “how certain” the network is that the image corresponds to each of the digits 0 through 9.
在开始时,我们将实际图像输入到第一层,这些图像由像素值的二维数组表示。最后——从最后一层——我们得到一个包含 10 个值的数组,可以认为它表示网络对图像对应于数字 0 到 9 的“确定程度”。
Feed in the image and the values of the neurons in that last layer are:
输入图像 ,最后一层神经元的值为:
In other words, the neural net is by this point “incredibly certain” that this image is a 4—and to actually get the output “4” we just have to pick out the position of the neuron with the largest value.
换句话说,神经网络此时“非常确定”这张图像是 4——实际上要得到输出“4”,我们只需要找出值最大的神经元的位置。
But what if we look one step earlier? The very last operation in the network is a so-called softmax which tries to “force certainty”. But before that’s been applied the values of the neurons are:
但要是我们再往前看一步呢?网络中的最后一个操作是一个所谓的 softmax,它试图“强制确定性”。但在应用这个操作之前,神经元的值是:
The neuron representing “4” still has the highest numerical value. But there’s also information in the values of the other neurons. And we can expect that this list of numbers can in a sense be used to characterize the “essence” of the image—and thus to provide something we can use as an embedding. And so, for example, each of the 4’s here has a slightly different “signature” (or “feature embedding”)—all very different from the 8’s:
表示“4”的神经元仍然具有最高的数值。但在其他神经元的值中也有信息。我们可以预期,这个数字列表在某种意义上可以用来表征图像的“本质”,从而提供我们可以用作嵌入的东西。因此,例如,这里的每个“4”都有稍微不同的“签名”(或“特征嵌入”)——所有这些都与“8”非常不同:
Here we’re essentially using 10 numbers to characterize our images. But it’s often better to use much more than that. And for example in our digit recognition network we can get an array of 500 numbers by tapping into the preceding layer. And this is probably a reasonable array to use as an “image embedding”.
在这里,我们基本上使用 10 个数字来描述我们的图像。但通常使用更多的数字会更好。例如,在我们的数字识别网络中,我们可以通过访问前一层来获得一个包含 500 个数字的数组。这可能是一个合理的数组,可以用作“图像嵌入”。
If we want to make an explicit visualization of “image space” for handwritten digits we need to “reduce the dimension”, effectively by projecting the 500-dimensional vector we’ve got into, say, 3D space:
如果我们想要对手写数字的“图像空间”进行明确的可视化,我们需要“降低维度”,通过将我们得到的 500 维向量投影到 3D 空间中来实现:
We’ve just talked about creating a characterization (and thus embedding) for images based effectively on identifying the similarity of images by determining whether (according to our training set) they correspond to the same handwritten digit. And we can do the same thing much more generally for images if we have a training set that identifies, say, which of 5000 common types of object (cat, dog, chair, …) each image is of. And in this way we can make an image embedding that’s “anchored” by our identification of common objects, but then “generalizes around that” according to the behavior of the neural net. And the point is that insofar as that behavior aligns with how we humans perceive and interpret images, this will end up being an embedding that “seems right to us”, and is useful in practice in doing “human-judgement-like” tasks.
我们刚刚讨论了基于有效识别图像相似性(从而嵌入)来创建图像特征化,具体是通过确定(根据我们的训练集)它们是否对应于相同的 handwritten digit。如果我们有一个训练集,可以识别出每个图像属于 5000 种常见物体(猫、狗、椅子……)中的哪一种,那么我们可以更普遍地对图像进行同样的处理。通过这种方式,我们可以创建一个由我们对常见物体的识别“锚定”的图像嵌入,然后根据神经网络的行为“围绕这一点进行泛化”。关键在于,只要这种行为与我们人类感知和解释图像的方式一致,这将最终成为一个“对我们来说似乎正确”的嵌入,并且在实践中对于执行“类似人类判断”的任务非常有用。
OK, so how do we follow the same kind of approach to find embeddings for words? The key is to start from a task about words for which we can readily do training. And the standard such task is “word prediction”. Imagine we’re given “the ___ cat”. Based on a large corpus of text (say, the text content of the web), what are the probabilities for different words that might “fill in the blank”? Or, alternatively, given “___ black ___” what are the probabilities for different “flanking words”?
好的,那么我们如何采用类似的方法来找到单词的嵌入呢?关键是从一个我们可以轻松进行训练的单词任务开始。标准的此类任务是“单词预测”。想象一下,我们得到的是“这只___猫”。基于一个大型文本语料库(例如,网络的文本内容),不同单词“填补空白”的概率是多少?或者,给定“___黑色___”,不同“侧翼单词”的概率是多少?
How do we set this problem up for a neural net? Ultimately we have to formulate everything in terms of numbers. And one way to do this is just to assign a unique number to each of the 50,000 or so common words in English. So, for example, “the” might be 914, and “ cat” (with a space before it) might be 3542. (And these are the actual numbers used by GPT-2.) So for the “the ___ cat” problem, our input might be {914, 3542}. What should the output be like? Well, it should be a list of 50,000 or so numbers that effectively give the probabilities for each of the possible “fill-in” words. And once again, to find an embedding, we want to “intercept” the “insides” of the neural net just before it “reaches its conclusion”—and then pick up the list of numbers that occur there, and that we can think of as “characterizing each word”.
我们如何为神经网络设置这个问题?最终我们必须将一切都用数字来表示。一种方法是为每个大约 50,000 个常用英语单词分配一个唯一的数字。例如,“the”可能是 914,“ cat”(前面有一个空格)可能是 3542。(这些是 GPT-2 实际使用的数字。)因此,对于“the ___ cat”问题,我们的输入可能是{914, 3542}。输出应该是什么样的?它应该是一个包含大约 50,000 个数字的列表,有效地给出每个可能的“填充”单词的概率。再次强调,为了找到嵌入,我们希望在神经网络“得出结论”之前“拦截”其“内部”——然后拾取那里出现的数字列表,我们可以将其视为“表征每个单词”。
OK, so what do those characterizations look like? Over the past 10 years there’ve been a sequence of different systems developed (word2vec, GloVe, BERT, GPT, …), each based on a different neural net approach. But ultimately all of them take words and characterize them by lists of hundreds to thousands of numbers.
好的,那么这些特征化是什么样子的呢?在过去的 10 年里,已经开发了一系列不同的系统(word2vec、GloVe、BERT、GPT 等),每个系统都基于不同的神经网络方法。但最终,它们都通过数百到数千个数字的列表来表征单词。
In their raw form, these “embedding vectors” are quite uninformative. For example, here’s what GPT-2 produces as the raw embedding vectors for three specific words:
在它们的原始形式中,这些“嵌入向量”相当不具信息性。例如,以下是 GPT-2 为三个特定单词生成的原始嵌入向量:
If we do things like measure distances between these vectors, then we can find things like “nearnesses” of words. Later we’ll discuss in more detail what we might consider the “cognitive” significance of such embeddings. But for now the main point is that we have a way to usefully turn words into “neural-net-friendly” collections of numbers.
如果我们像测量这些向量之间的距离那样做事情,那么我们就能找到诸如词语“接近性”之类的东西。稍后我们将更详细地讨论我们可能会考虑的这种嵌入的“认知”意义。但目前的主要观点是,我们有一种方法可以将词语有效地转化为“神经网络友好”的数字集合。
But actually we can go further than just characterizing words by collections of numbers; we can also do this for sequences of words, or indeed whole blocks of text. And inside ChatGPT that’s how it’s dealing with things. It takes the text it’s got so far, and generates an embedding vector to represent it. Then its goal is to find the probabilities for different words that might occur next. And it represents its answer for this as a list of numbers that essentially give the probabilities for each of the 50,000 or so possible words.
但实际上,我们不仅可以将单词表示为数字集合,还可以表示单词序列或整个文本块。在 ChatGPT 内部,这就是它处理事物的方式。它获取到目前为止的文本,并生成一个嵌入向量来表示它。然后,它的目标是为可能出现的不同单词找到概率。它将这个答案表示为一个数字列表,这些数字基本上给出了大约 50,000 个可能单词中每个单词的概率。
(Strictly, ChatGPT does not deal with words, but rather with “tokens”—convenient linguistic units that might be whole words, or might just be pieces like “pre” or “ing” or “ized”. Working with tokens makes it easier for ChatGPT to handle rare, compound and non-English words, and, sometimes, for better or worse, to invent new words.)
(严格来说,ChatGPT 不处理单词,而是处理“标记”——这些方便的语言单位可能是完整的单词,或者只是像“pre”或“ing”或“ized”这样的片段。使用标记使 ChatGPT 更容易处理稀有、复合和非英语单词,有时,无论是好是坏,还能创造新词。)
Inside ChatGPT ChatGPT 内部
OK, so we’re finally ready to discuss what’s inside ChatGPT. And, yes, ultimately, it’s a giant neural net—currently a version of the so-called GPT-3 network with 175 billion weights. In many ways this is a neural net very much like the other ones we’ve discussed. But it’s a neural net that’s particularly set up for dealing with language. And its most notable feature is a piece of neural net architecture called a “transformer”.
好的,所以我们终于准备好讨论 ChatGPT 的内部结构了。是的,最终它是一个巨大的神经网络——目前是所谓的 GPT-3 网络的一个版本,拥有 1750 亿个权重。在许多方面,这个神经网络与我们讨论过的其他神经网络非常相似。但它是一个特别为处理语言而设置的神经网络。它最显著的特点是一种称为“transformer”的神经网络架构。
In the first neural nets we discussed above, every neuron at any given layer was basically connected (at least with some weight) to every neuron on the layer before. But this kind of fully connected network is (presumably) overkill if one’s working with data that has particular, known structure. And thus, for example, in the early stages of dealing with images, it’s typical to use so-called convolutional neural nets (“convnets”) in which neurons are effectively laid out on a grid analogous to the pixels in the image—and connected only to neurons nearby on the grid.
在上述我们讨论的第一个神经网络中,任何给定层的每个神经元基本上都与前一层的每个神经元(至少在某些权重上)连接。但如果处理的数据具有特定且已知的结构,这种全连接网络可能是(可能)过度设计的。因此,例如,在处理图像的早期阶段,通常会使用所谓的卷积神经网络(“convnets”),其中神经元有效地排列在类似于图像像素的网格上——并且仅与网格上附近的神经元连接。
The idea of transformers is to do something at least somewhat similar for sequences of tokens that make up a piece of text. But instead of just defining a fixed region in the sequence over which there can be connections, transformers instead introduce the notion of “attention”—and the idea of “paying attention” more to some parts of the sequence than others. Maybe one day it’ll make sense to just start a generic neural net and do all customization through training. But at least as of now it seems to be critical in practice to “modularize” things—as transformers do, and probably as our brains also do.
变压器的想法是对构成一段文本的标记序列做一些至少有点类似的事情。但与其只是在序列中定义一个固定的区域,在这个区域中可以有连接,变压器引入了“注意力”的概念——以及“更关注”序列的某些部分而不是其他部分的想法。也许有一天,直接启动一个通用的神经网络并通过训练进行所有定制会变得合理。但至少到目前为止,在实践中“模块化”事物——就像变压器所做的那样,可能也像我们的大脑所做的那样——似乎是至关重要的。
OK, so what does ChatGPT (or, rather, the GPT-3 network on which it’s based) actually do? Recall that its overall goal is to continue text in a “reasonable” way, based on what it’s seen from the training it’s had (which consists in looking at billions of pages of text from the web, etc.) So at any given point, it’s got a certain amount of text—and its goal is to come up with an appropriate choice for the next token to add.
好的,那么 ChatGPT(或者更确切地说,它所基于的 GPT-3 网络)实际上是做什么的呢?回想一下,它的总体目标是根据它从训练中学到的内容(包括查看来自网络等的数十亿页文本),以“合理”的方式继续文本。因此,在任何给定点,它都有一定数量的文本——它的目标是提出一个适当的下一个标记选择来添加。
It operates in three basic stages. First, it takes the sequence of tokens that corresponds to the text so far, and finds an embedding (i.e. an array of numbers) that represents these. Then it operates on this embedding—in a “standard neural net way”, with values “rippling through” successive layers in a network—to produce a new embedding (i.e. a new array of numbers). It then takes the last part of this array and generates from it an array of about 50,000 values that turn into probabilities for different possible next tokens. (And, yes, it so happens that there are about the same number of tokens used as there are common words in English, though only about 3000 of the tokens are whole words, and the rest are fragments.)
它分三个基本阶段运行。首先,它获取与迄今为止的文本相对应的标记序列,并找到一个表示这些标记的嵌入(即数字数组)。然后,它以“标准神经网络方式”对这个嵌入进行操作,通过网络中的连续层“传递”值,以产生一个新的嵌入(即一个新的数字数组)。然后,它获取这个数组的最后一部分,并从中生成一个大约 50,000 个值的数组,这些值变成不同可能的下一个标记的概率。(是的,碰巧使用的标记数量与英语中的常用单词数量大致相同,尽管只有大约 3000 个标记是完整单词,其余的是片段。)
A critical point is that every part of this pipeline is implemented by a neural network, whose weights are determined by end-to-end training of the network. In other words, in effect nothing except the overall architecture is “explicitly engineered”; everything is just “learned” from training data.
一个关键点是,这个管道的每个部分都是由一个神经网络实现的,其权重由网络的端到端训练决定。换句话说,除了整体架构之外,实际上没有任何东西是“明确设计的”;所有东西都是从训练数据中“学习”的。
There are, however, plenty of details in the way the architecture is set up—reflecting all sorts of experience and neural net lore. And—even though this is definitely going into the weeds—I think it’s useful to talk about some of those details, not least to get a sense of just what goes into building something like ChatGPT.
然而,架构的设置方式中有许多细节——反映了各种经验和神经网络知识。而且——尽管这确实有点深入细节——我认为讨论其中一些细节是有用的,尤其是因为可以了解构建像 ChatGPT 这样的东西需要什么。
First comes the embedding module. Here’s a schematic Wolfram Language representation for it for GPT-2:
首先是嵌入模块。以下是 GPT-2 的嵌入模块的 Wolfram 语言表示图:
The input is a vector of n tokens (represented as in the previous section by integers from 1 to about 50,000). Each of these tokens is converted (by a single-layer neural net) into an embedding vector (of length 768 for GPT-2 and 12,288 for ChatGPT’s GPT-3). Meanwhile, there’s a “secondary pathway” that takes the sequence of (integer) positions for the tokens, and from these integers creates another embedding vector. And finally the embedding vectors from the token value and the token position are added together—to produce the final sequence of embedding vectors from the embedding module.
输入是一个包含 n 个标记的向量(在前面的部分中表示为从 1 到大约 50,000 的整数)。这些标记中的每一个都通过一个单层神经网络转换为嵌入向量(对于 GPT-2 长度为 768,对于 ChatGPT 的 GPT-3 长度为 12,288)。同时,有一个“次要路径”接收标记的(整数)位置序列,并从这些整数创建另一个嵌入向量。最后,来自标记值和标记位置的嵌入向量被相加——以产生来自嵌入模块的最终嵌入向量序列。
Why does one just add the token-value and token-position embedding vectors together? I don’t think there’s any particular science to this. It’s just that various different things have been tried, and this is one that seems to work. And it’s part of the lore of neural nets that—in some sense—so long as the setup one has is “roughly right” it’s usually possible to home in on details just by doing sufficient training, without ever really needing to “understand at an engineering level” quite how the neural net has ended up configuring itself.
为什么只是将 token-value 和 token-position 嵌入向量相加?我不认为这有什么特别的科学依据。只是尝试了各种不同的方法,而这种方法似乎有效。而且,神经网络的一部分知识是——在某种意义上——只要设置大致正确,通常可以通过足够的训练来细化细节,而不需要真正“在工程层面上理解”神经网络是如何配置的。
Here’s what the embedding module does, operating on the string hello hello hello hello hello hello hello hello hello hello bye bye bye bye bye bye bye bye bye bye:
这是嵌入模块的功能,对字符串“hello hello hello hello hello hello hello hello hello hello bye bye bye bye bye bye bye bye bye bye”进行操作:
The elements of the embedding vector for each token are shown down the page, and across the page we see first a run of “hello” embeddings, followed by a run of “bye” ones. The second array above is the positional embedding—with its somewhat-random-looking structure being just what “happened to be learned” (in this case in GPT-2).
每个标记的嵌入向量的元素显示在页面下方,在页面横向上,我们首先看到一串“hello”嵌入,然后是一串“bye”嵌入。上面的第二个数组是位置嵌入——其看起来有些随机的结构正是“碰巧被学习到的”(在这种情况下是 GPT-2)。
OK, so after the embedding module comes the “main event” of the transformer: a sequence of so-called “attention blocks” (12 for GPT-2, 96 for ChatGPT’s GPT-3). It’s all pretty complicated—and reminiscent of typical large hard-to-understand engineering systems, or, for that matter, biological systems. But anyway, here’s a schematic representation of a single “attention block” (for GPT-2):
好的,所以在嵌入模块之后是变压器的“主事件”:一系列所谓的“注意力块”(GPT-2 有 12 个,ChatGPT 的 GPT-3 有 96 个)。这一切都非常复杂——让人想起典型的大型难以理解的工程系统,或者就此而言,生物系统。但无论如何,这里有一个单个“注意力块”(用于 GPT-2)的示意图表示:
Within each such attention block there are a collection of “attention heads” (12 for GPT-2, 96 for ChatGPT’s GPT-3)—each of which operates independently on different chunks of values in the embedding vector. (And, yes, we don’t know any particular reason why it’s a good idea to split up the embedding vector, or what the different parts of it “mean”; this is just one of those things that’s been “found to work”.)
在每个这样的注意力块中,有一组“注意力头”(GPT-2 有 12 个,ChatGPT 的 GPT-3 有 96 个)——每个注意力头独立地对嵌入向量中的不同块进行操作。(而且,是的,我们不知道任何特定的原因为什么将嵌入向量分开是个好主意,或者它的不同部分“意味着”什么;这只是那些“被发现有效”的事情之一。)
OK, so what do the attention heads do? Basically they’re a way of “looking back” in the sequence of tokens (i.e. in the text produced so far), and “packaging up the past” in a form that’s useful for finding the next token. In the first section above we talked about using 2-gram probabilities to pick words based on their immediate predecessors. What the “attention” mechanism in transformers does is to allow “attention to” even much earlier words—thus potentially capturing the way, say, verbs can refer to nouns that appear many words before them in a sentence.
好的,那么注意力头是做什么的呢?基本上,它们是一种在标记序列(即到目前为止生成的文本)中“回顾”的方式,并以对找到下一个标记有用的形式“打包过去”。在上面第一部分中,我们讨论了使用 2-gram 概率根据其直接前驱选择单词。变压器中的“注意力”机制所做的是允许“关注”更早的单词——从而可能捕捉到动词可以指代句子中出现在它们之前许多词的名词的方式。
At a more detailed level, what an attention head does is to recombine chunks in the embedding vectors associated with different tokens, with certain weights. And so, for example, the 12 attention heads in the first attention block (in GPT-2) have the following (“look-back-all-the-way-to-the-beginning-of-the-sequence-of-tokens”) patterns of “recombination weights” for the “hello, bye” string above:
在更详细的层面上,注意力头的作用是通过某些权重重新组合与不同标记相关的嵌入向量中的块。因此,例如,第一个注意力块中的 12 个注意力头(在 GPT-2 中)对于上述“hello, bye”字符串具有以下(“回顾到标记序列的开头”)“重组权重”模式:
After being processed by the attention heads, the resulting “re-weighted embedding vector” (of length 768 for GPT-2 and length 12,288 for ChatGPT’s GPT-3) is passed through a standard “fully connected” neural net layer. It’s hard to get a handle on what this layer is doing. But here’s a plot of the 768×768 matrix of weights it’s using (here for GPT-2):
经过注意力头处理后,生成的“重新加权的嵌入向量”(对于 GPT-2 长度为 768,对于 ChatGPT 的 GPT-3 长度为 12,288)通过一个标准的“全连接”神经网络层传递。很难理解这一层在做什么。但这里有一个它使用的 768×768 权重矩阵的图(这里是针对 GPT-2 的):
Taking 64×64 moving averages, some (random-walk-ish) structure begins to emerge:
采用 64×64 移动平均,一些(类似随机游走的)结构开始显现:
What determines this structure? Ultimately it’s presumably some “neural net encoding” of features of human language. But as of now, what those features might be is quite unknown. In effect, we’re “opening up the brain of ChatGPT” (or at least GPT-2) and discovering, yes, it’s complicated in there, and we don’t understand it—even though in the end it’s producing recognizable human language.
是什么决定了这种结构?最终,这可能是某种“神经网络编码”的人类语言特征。但到目前为止,这些特征可能是什么还完全未知。实际上,我们正在“打开 ChatGPT(或至少是 GPT-2)的大脑”,并发现,是的,里面很复杂,我们还不理解它——尽管最终它产生了可识别的人类语言。
OK, so after going through one attention block, we’ve got a new embedding vector—which is then successively passed through additional attention blocks (a total of 12 for GPT-2; 96 for GPT-3). Each attention block has its own particular pattern of “attention” and “fully connected” weights. Here for GPT-2 are the sequence of attention weights for the “hello, bye” input, for the first attention head:
好的,所以在经过一个注意力块之后,我们得到了一个新的嵌入向量——然后它被依次传递到更多的注意力块(GPT-2 总共 12 个;GPT-3 总共 96 个)。每个注意力块都有其特定的“注意力”和“全连接”权重模式。这里是 GPT-2 的“hello, bye”输入的第一个注意力头的注意力权重序列:
And here are the (moving-averaged) “matrices” for the fully connected layers:
这里是全连接层的(移动平均)“矩阵”:
Curiously, even though these “matrices of weights” in different attention blocks look quite similar, the distributions of the sizes of weights can be somewhat different (and are not always Gaussian):
令人好奇的是,尽管这些不同注意力块中的“权重矩阵”看起来非常相似,但权重大小的分布可能会有所不同(并且不总是高斯分布):
So after going through all these attention blocks what is the net effect of the transformer? Essentially it’s to transform the original collection of embeddings for the sequence of tokens to a final collection. And the particular way ChatGPT works is then to pick up the last embedding in this collection, and “decode” it to produce a list of probabilities for what token should come next.
所以在经历了所有这些注意力块之后,变压器的净效应是什么?本质上,它是将原始的令牌序列的嵌入集合转换为最终的集合。而 ChatGPT 的具体工作方式是选择这个集合中的最后一个嵌入,并“解码”它以产生下一个令牌的概率列表。
So that’s in outline what’s inside ChatGPT. It may seem complicated (not least because of its many inevitably somewhat arbitrary “engineering choices”), but actually the ultimate elements involved are remarkably simple. Because in the end what we’re dealing with is just a neural net made of “artificial neurons”, each doing the simple operation of taking a collection of numerical inputs, and then combining them with certain weights.
所以这就是 ChatGPT 的概要内容。它可能看起来很复杂(尤其是因为它有许多不可避免的有些任意的“工程选择”),但实际上所涉及的最终元素非常简单。因为最终我们所处理的是一个由“人工神经元”组成的神经网络,每个神经元执行简单的操作,即获取一组数值输入,然后用某些权重将它们组合起来。
The original input to ChatGPT is an array of numbers (the embedding vectors for the tokens so far), and what happens when ChatGPT “runs” to produce a new token is just that these numbers “ripple through” the layers of the neural net, with each neuron “doing its thing” and passing the result to neurons on the next layer. There’s no looping or “going back”. Everything just “feeds forward” through the network.
ChatGPT 的原始输入是一组数字(到目前为止的标记的嵌入向量),当 ChatGPT“运行”以生成新标记时,这些数字只是在神经网络的各层中“波动”,每个神经元“做它的事情”并将结果传递给下一层的神经元。没有循环或“回退”。一切只是通过网络“前馈”。
It’s a very different setup from a typical computational system—like a Turing machine—in which results are repeatedly “reprocessed” by the same computational elements. Here—at least in generating a given token of output—each computational element (i.e. neuron) is used only once.
这是一个与典型计算系统(如图灵机)非常不同的设置,在典型计算系统中,结果会由相同的计算元素反复“重新处理”。在这里——至少在生成给定的输出标记时——每个计算元素(即神经元)只使用一次。
But there is in a sense still an “outer loop” that reuses computational elements even in ChatGPT. Because when ChatGPT is going to generate a new token, it always “reads” (i.e. takes as input) the whole sequence of tokens that come before it, including tokens that ChatGPT itself has “written” previously. And we can think of this setup as meaning that ChatGPT does—at least at its outermost level—involve a “feedback loop”, albeit one in which every iteration is explicitly visible as a token that appears in the text that it generates.
但在某种意义上,ChatGPT 中仍然存在一个“外部循环”,它甚至在 ChatGPT 中重用计算元素。因为当 ChatGPT 要生成一个新标记时,它总是“读取”(即作为输入)它之前的整个标记序列,包括 ChatGPT 自己之前“写入”的标记。我们可以认为这种设置意味着 ChatGPT 在其最外层级别上确实涉及一个“反馈循环”,尽管在这个循环中,每次迭代都明确可见为其生成的文本中的一个标记。
But let’s come back to the core of ChatGPT: the neural net that’s being repeatedly used to generate each token. At some level it’s very simple: a whole collection of identical artificial neurons. And some parts of the network just consist of (“fully connected”) layers of neurons in which every neuron on a given layer is connected (with some weight) to every neuron on the layer before. But particularly with its transformer architecture, ChatGPT has parts with more structure, in which only specific neurons on different layers are connected. (Of course, one could still say that “all neurons are connected”—but some just have zero weight.)
但让我们回到 ChatGPT 的核心:被反复用来生成每个标记的神经网络。在某种程度上,它非常简单:一大堆相同的人工神经元。网络的某些部分仅由(“全连接”)层组成,其中给定层上的每个神经元(具有某种权重)都连接到前一层上的每个神经元。但特别是由于其 transformer 架构,ChatGPT 的部分具有更多结构,其中只有不同层上的特定神经元相互连接。(当然,人们仍然可以说“所有神经元都连接”——但有些只是权重为零。)
In addition, there are aspects of the neural net in ChatGPT that aren’t most naturally thought of as just consisting of “homogeneous” layers. And for example—as the iconic summary above indicates—inside an attention block there are places where “multiple copies are made” of incoming data, each then going through a different “processing path”, potentially involving a different number of layers, and only later recombining. But while this may be a convenient representation of what’s going on, it’s always at least in principle possible to think of “densely filling in” layers, but just having some weights be zero.
此外,ChatGPT 中的神经网络的某些方面并不自然地被认为是仅由“同质”层组成的。例如——如上述标志性总结所示——在注意力块内部,传入数据会在某些地方被“复制多份”,每份随后通过不同的“处理路径”,可能涉及不同数量的层,并且仅在稍后重新组合。尽管这可能是对正在发生的事情的一种方便表示,但在原则上,总是至少有可能将层“密集填充”,只是有些权重为零。
If one looks at the longest path through ChatGPT, there are about 400 (core) layers involved—in some ways not a huge number. But there are millions of neurons—with a total of 175 billion connections and therefore 175 billion weights. And one thing to realize is that every time ChatGPT generates a new token, it has to do a calculation involving every single one of these weights. Implementationally these calculations can be somewhat organized “by layer” into highly parallel array operations that can conveniently be done on GPUs. But for each token that’s produced, there still have to be 175 billion calculations done (and in the end a bit more)—so that, yes, it’s not surprising that it can take a while to generate a long piece of text with ChatGPT.
如果查看通过 ChatGPT 的最长路径,大约涉及 400(核心)层——在某些方面这不是一个巨大的数字。但有数百万个神经元——总共有 1750 亿个连接,因此有 1750 亿个权重。需要意识到的是,每次 ChatGPT 生成一个新标记时,它都必须进行涉及每个权重的计算。在实现上,这些计算可以“按层”组织成高度并行的数组操作,这些操作可以在 GPU 上方便地完成。但对于每个生成的标记,仍然必须进行 1750 亿次计算(最后还要多一点)——因此,确实,使用 ChatGPT 生成一篇长文需要一段时间并不令人惊讶。
But in the end, the remarkable thing is that all these operations—individually as simple as they are—can somehow together manage to do such a good “human-like” job of generating text. It has to be emphasized again that (at least so far as we know) there’s no “ultimate theoretical reason” why anything like this should work. And in fact, as we’ll discuss, I think we have to view this as a—potentially surprising—scientific discovery: that somehow in a neural net like ChatGPT’s it’s possible to capture the essence of what human brains manage to do in generating language.
但最终,令人惊奇的是,所有这些操作——尽管它们各自都很简单——却能以某种方式共同完成如此出色的“类人”文本生成工作。必须再次强调的是(至少据我们所知),没有任何“终极理论原因”可以解释为什么这样的事情应该有效。事实上,正如我们将讨论的,我认为我们必须将此视为一个——可能令人惊讶的——科学发现:即在像 ChatGPT 这样的神经网络中,可以捕捉到人类大脑在生成语言时所实现的核心本质。
The Training of ChatGPT
ChatGPT 的训练
OK, so we’ve now given an outline of how ChatGPT works once it’s set up. But how did it get set up? How were all those 175 billion weights in its neural net determined? Basically they’re the result of very large-scale training, based on a huge corpus of text—on the web, in books, etc.—written by humans. As we’ve said, even given all that training data, it’s certainly not obvious that a neural net would be able to successfully produce “human-like” text. And, once again, there seem to be detailed pieces of engineering needed to make that happen. But the big surprise—and discovery—of ChatGPT is that it’s possible at all. And that—in effect—a neural net with “just” 175 billion weights can make a “reasonable model” of text humans write.
好的,所以我们现在已经概述了 ChatGPT 在设置后如何工作。但它是如何设置的呢?它的神经网络中的所有 1750 亿个权重是如何确定的?基本上,它们是基于大量文本语料库进行大规模训练的结果——这些文本来自网络、书籍等,由人类撰写。正如我们所说,即使有所有这些训练数据,神经网络能否成功生成“类人”文本也绝非显而易见。而且,再次强调,似乎需要详细的工程细节才能实现这一点。但 ChatGPT 的巨大惊喜和发现是,这完全是可能的。而且,实际上,一个拥有“仅仅”1750 亿个权重的神经网络可以成为一个“合理模型”,模拟人类撰写的文本。
In modern times, there’s lots of text written by humans that’s out there in digital form. The public web has at least several billion human-written pages, with altogether perhaps a trillion words of text. And if one includes non-public webpages, the numbers might be at least 100 times larger. So far, more than 5 million digitized books have been made available (out of 100 million or so that have ever been published), giving another 100 billion or so words of text. And that’s not even mentioning text derived from speech in videos, etc. (As a personal comparison, my total lifetime output of published material has been a bit under 3 million words, and over the past 30 years I’ve written about 15 million words of email, and altogether typed perhaps 50 million words—and in just the past couple of years I’ve spoken more than 10 million words on livestreams. And, yes, I’ll train a bot from all of that.)
在现代,有许多以数字形式存在的人类撰写的文本。公共网络至少有数十亿个人类撰写的页面,总共可能有数万亿字的文本。如果包括非公开的网页,这些数字可能至少大 100 倍。到目前为止,已有超过 500 万本数字化书籍可供使用(总共约有 1 亿本出版物),提供了大约 1000 亿字的文本。这甚至还没有提到从视频中的语音派生的文本等。(作为个人比较,我一生中出版物的总字数略低于 300 万字,在过去 30 年里我写了大约 1500 万字的电子邮件,总共打字可能 5000 万字——而在过去几年里,我在直播中说了超过 1000 万字。而且,是的,我会用所有这些来训练一个机器人。)
But, OK, given all this data, how does one train a neural net from it? The basic process is very much as we discussed it in the simple examples above. You present a batch of examples, and then you adjust the weights in the network to minimize the error (“loss”) that the network makes on those examples. The main thing that’s expensive about “back propagating” from the error is that each time you do this, every weight in the network will typically change at least a tiny bit, and there are just a lot of weights to deal with. (The actual “back computation” is typically only a small constant factor harder than the forward one.)
但是,好吧,考虑到所有这些数据,如何从这些数据中训练神经网络呢?基本过程与我们上面讨论的简单示例非常相似。您提供一批示例,然后调整网络中的权重以最小化网络在这些示例上产生的误差(“损失”)。关于从误差“反向传播”的主要问题是,每次执行此操作时,网络中的每个权重通常都会至少改变一点点,而且需要处理的权重数量非常多。(实际的“反向计算”通常仅比前向计算难一个小常数因子。)
With modern GPU hardware, it’s straightforward to compute the results from batches of thousands of examples in parallel. But when it comes to actually updating the weights in the neural net, current methods require one to do this basically batch by batch. (And, yes, this is probably where actual brains—with their combined computation and memory elements—have, for now, at least an architectural advantage.)
在现代 GPU 硬件上,并行计算来自数千个示例批次的结果非常简单。但当涉及到实际更新神经网络中的权重时,当前的方法基本上需要逐批次进行。(而且,是的,这可能是实际大脑——具有结合计算和存储元素——至少在架构上目前具有优势的地方。)
Even in the seemingly simple cases of learning numerical functions that we discussed earlier, we found we often had to use millions of examples to successfully train a network, at least from scratch. So how many examples does this mean we’ll need in order to train a “human-like language” model? There doesn’t seem to be any fundamental “theoretical” way to know. But in practice ChatGPT was successfully trained on a few hundred billion words of text.
即使在我们之前讨论的学习数值函数的看似简单的情况下,我们发现通常需要使用数百万个例子才能成功训练一个网络,至少从头开始。那么,这又意味着我们需要多少例子才能训练一个“类人语言”模型呢?似乎没有任何根本的“理论”方法可以知道。但在实践中,ChatGPT 成功地在一万亿字左右的文本上进行了训练。
Some of the text it was fed several times, some of it only once. But somehow it “got what it needed” from the text it saw. But given this volume of text to learn from, how large a network should it require to “learn it well”? Again, we don’t yet have a fundamental theoretical way to say. Ultimately—as we’ll discuss further below—there’s presumably a certain “total algorithmic content” to human language and what humans typically say with it. But the next question is how efficient a neural net will be at implementing a model based on that algorithmic content. And again we don’t know—although the success of ChatGPT suggests it’s reasonably efficient.
有些文本被它多次输入,有些只输入了一次。但不知怎么的,它“从看到的文本中得到了所需的东西”。但是,考虑到这种学习文本的量,它需要多大的网络才能“学好”呢?我们还没有一个基本的理论方法来回答这个问题。最终——正如我们将在下面进一步讨论的那样——人类语言和人类通常用它说的话可能有一个“总的算法内容”。但下一个问题是,神经网络在基于这种算法内容实现模型时会有多高效。我们仍然不知道——尽管 ChatGPT 的成功表明它是相当高效的。
And in the end we can just note that ChatGPT does what it does using a couple hundred billion weights—comparable in number to the total number of words (or tokens) of training data it’s been given. In some ways it’s perhaps surprising (though empirically observed also in smaller analogs of ChatGPT) that the “size of the network” that seems to work well is so comparable to the “size of the training data”. After all, it’s certainly not that somehow “inside ChatGPT” all that text from the web and books and so on is “directly stored”. Because what’s actually inside ChatGPT are a bunch of numbers—with a bit less than 10 digits of precision—that are some kind of distributed encoding of the aggregate structure of all that text.
最后我们可以注意到,ChatGPT 通过使用数千亿个权重来完成它的工作——这与它所接收的训练数据的总字数(或标记数)相当。在某些方面,这可能是令人惊讶的(尽管在较小的 ChatGPT 类似模型中也有经验观察到),“网络的大小”似乎与“训练数据的大小”相当。毕竟,这当然不是说“在 ChatGPT 内部”所有来自网络和书籍等的文本都“直接存储”。因为实际上在 ChatGPT 内部的是一堆数字——精度略低于 10 位——这些数字是所有文本的聚合结构的某种分布式编码。
Put another way, we might ask what the “effective information content” is of human language and what’s typically said with it. There’s the raw corpus of examples of language. And then there’s the representation in the neural net of ChatGPT. That representation is very likely far from the “algorithmically minimal” representation (as we’ll discuss below). But it’s a representation that’s readily usable by the neural net. And in this representation it seems there’s in the end rather little “compression” of the training data; it seems on average to basically take only a bit less than one neural net weight to carry the “information content” of a word of training data.
换句话说,我们可能会问,人类语言的“有效信息量”是什么,通常用它说什么。有语言的原始语料库。然后是 ChatGPT 神经网络中的表示。这种表示很可能远不是“算法上最小的”表示(正如我们将在下面讨论的)。但它是神经网络容易使用的表示。在这种表示中,似乎最终对训练数据的“压缩”很少;似乎平均而言,基本上只需要不到一个神经网络权重来承载一个词的训练数据“信息量”。
When we run ChatGPT to generate text, we’re basically having to use each weight once. So if there are n weights, we’ve got of order n computational steps to do—though in practice many of them can typically be done in parallel in GPUs. But if we need about n words of training data to set up those weights, then from what we’ve said above we can conclude that we’ll need about n2 computational steps to do the training of the network—which is why, with current methods, one ends up needing to talk about billion-dollar training efforts.
当我们运行 ChatGPT 生成文本时,基本上每次都需要使用每个权重一次。因此,如果有 n 个权重,我们需要进行大约 n 次计算步骤——尽管在实践中,许多步骤通常可以在 GPU 中并行完成。但如果我们需要大约 n 个词的训练数据来设置这些权重,那么根据我们上面所说的,我们可以得出结论,我们需要大约 n2 次计算步骤来训练网络——这就是为什么,使用当前方法,最终需要谈论数十亿美元的训练工作。
Beyond Basic Training 超越基础训练
The majority of the effort in training ChatGPT is spent “showing it” large amounts of existing text from the web, books, etc. But it turns out there’s another—apparently rather important—part too.
训练 ChatGPT 的大部分工作是“向它展示”来自网络、书籍等的大量现有文本。但事实证明,还有另一个——显然相当重要的——部分。
As soon as it’s finished its “raw training” from the original corpus of text it’s been shown, the neural net inside ChatGPT is ready to start generating its own text, continuing from prompts, etc. But while the results from this may often seem reasonable, they tend—particularly for longer pieces of text—to “wander off” in often rather non-human-like ways. It’s not something one can readily detect, say, by doing traditional statistics on the text. But it’s something that actual humans reading the text easily notice.
一旦从原始文本语料库中完成了“原始训练”,ChatGPT 内部的神经网络就可以开始生成自己的文本,从提示等继续。但尽管这些结果通常看起来合理,它们往往会——特别是对于较长的文本——以通常非常不像人类的方式“偏离”。这不是可以通过对文本进行传统统计来轻易检测到的。但这是实际阅读文本的人很容易注意到的事情。
And a key idea in the construction of ChatGPT was to have another step after “passively reading” things like the web: to have actual humans actively interact with ChatGPT, see what it produces, and in effect give it feedback on “how to be a good chatbot”. But how can the neural net use that feedback? The first step is just to have humans rate results from the neural net. But then another neural net model is built that attempts to predict those ratings. But now this prediction model can be run—essentially like a loss function—on the original network, in effect allowing that network to be “tuned up” by the human feedback that’s been given. And the results in practice seem to have a big effect on the success of the system in producing “human-like” output.
在构建 ChatGPT 时的一个关键思想是,在“被动阅读”诸如网络上的内容之后,要有另一个步骤:让实际的人类主动与 ChatGPT 互动,查看它产生的内容,并实际上给它反馈“如何成为一个好的聊天机器人”。但是神经网络如何使用这些反馈呢?第一步只是让人类对神经网络的结果进行评级。然后构建另一个神经网络模型,试图预测这些评级。但现在这个预测模型可以运行——本质上像一个损失函数——在原始网络上,实际上允许该网络通过所给出的人类反馈进行“调优”。在实践中,这些结果似乎对系统产生“类人”输出的成功产生了重大影响。
In general, it’s interesting how little “poking” the “originally trained” network seems to need to get it to usefully go in particular directions. One might have thought that to have the network behave as if it’s “learned something new” one would have to go in and run a training algorithm, adjusting weights, and so on.
一般来说,有趣的是,“最初训练的”网络似乎只需要很少的“调整”就能让它有效地朝特定方向发展。人们可能会认为,要让网络表现得好像“学到了新东西”,就必须进入并运行训练算法,调整权重等。
But that’s not the case. Instead, it seems to be sufficient to basically tell ChatGPT something one time—as part of the prompt you give—and then it can successfully make use of what you told it when it generates text. And once again, the fact that this works is, I think, an important clue in understanding what ChatGPT is “really doing” and how it relates to the structure of human language and thinking.
但事实并非如此。相反,似乎只需在给定的提示中一次性告诉 ChatGPT 某些内容,然后它就能成功地利用你告诉它的内容来生成文本。而且再次强调,我认为这一点的有效性是理解 ChatGPT“真正在做什么”以及它如何与人类语言和思维结构相关的重要线索。
There’s certainly something rather human-like about it: that at least once it’s had all that pre-training you can tell it something just once and it can “remember it”—at least “long enough” to generate a piece of text using it. So what’s going on in a case like this? It could be that “everything you might tell it is already in there somewhere”—and you’re just leading it to the right spot. But that doesn’t seem plausible. Instead, what seems more likely is that, yes, the elements are already in there, but the specifics are defined by something like a “trajectory between those elements” and that’s what you’re introducing when you tell it something.
这确实有一些类似人类的特点:至少在你对它进行了所有预训练之后,你可以告诉它一次,它就能“记住它”——至少“足够长时间”以生成一段使用它的文本。那么在这种情况下发生了什么呢?可能是“你可能告诉它的所有内容已经在那里某个地方”——而你只是引导它找到正确的位置。但这种说法似乎不太可信。相反,更有可能的是,是的,元素已经在那里了,但具体内容是由类似于“这些元素之间的轨迹”定义的,而这就是你告诉它某些内容时引入的东西。
And indeed, much like for humans, if you tell it something bizarre and unexpected that completely doesn’t fit into the framework it knows, it doesn’t seem like it’ll successfully be able to “integrate” this. It can “integrate” it only if it’s basically riding in a fairly simple way on top of the framework it already has.
实际上,就像对人类一样,如果你告诉它一些完全不符合它所知道框架的奇怪和意外的事情,它似乎无法成功地“整合”这一点。它只能在基本上以相当简单的方式建立在它已有的框架之上时,才能“整合”它。
It’s also worth pointing out again that there are inevitably “algorithmic limits” to what the neural net can “pick up”. Tell it “shallow” rules of the form “this goes to that”, etc., and the neural net will most likely be able to represent and reproduce these just fine—and indeed what it “already knows” from language will give it an immediate pattern to follow. But try to give it rules for an actual “deep” computation that involves many potentially computationally irreducible steps and it just won’t work. (Remember that at each step it’s always just “feeding data forward” in its network, never looping except by virtue of generating new tokens.)
同样值得再次指出的是,神经网络能够“捕捉”到的东西不可避免地存在“算法限制”。告诉它“浅显”的规则,比如“这个变成那个”等等,神经网络很可能能够很好地表示和重现这些规则——事实上,它从语言中“已经知道”的东西会立即给它一个可以遵循的模式。但试图给它涉及许多潜在计算不可简化步骤的实际“深度”计算规则,它就无法工作了。(记住,在每一步中,它总是在其网络中“前馈数据”,除了通过生成新标记外,从不循环。)
Of course, the network can learn the answer to specific “irreducible” computations. But as soon as there are combinatorial numbers of possibilities, no such “table-lookup-style” approach will work. And so, yes, just like humans, it’s time then for neural nets to “reach out” and use actual computational tools. (And, yes, Wolfram|Alpha and Wolfram Language are uniquely suitable, because they’ve been built to “talk about things in the world”, just like the language-model neural nets.)
当然,网络可以学习特定“不可简化”计算的答案。但是,一旦有组合数量的可能性,这种“查表式”的方法就不再有效。因此,是的,就像人类一样,现在是神经网络“伸出援手”并使用实际计算工具的时候了。(而且,是的,Wolfram|Alpha 和 Wolfram Language 是独一无二的合适选择,因为它们已经构建为“谈论世界上的事物”,就像语言模型神经网络一样。)
What Really Lets ChatGPT Work?
是什么让 ChatGPT 真正工作?
Human language—and the processes of thinking involved in generating it—have always seemed to represent a kind of pinnacle of complexity. And indeed it’s seemed somewhat remarkable that human brains—with their network of a “mere” 100 billion or so neurons (and maybe 100 trillion connections) could be responsible for it. Perhaps, one might have imagined, there’s something more to brains than their networks of neurons—like some new layer of undiscovered physics. But now with ChatGPT we’ve got an important new piece of information: we know that a pure, artificial neural network with about as many connections as brains have neurons is capable of doing a surprisingly good job of generating human language.
人类语言——以及产生它所涉及的思维过程——一直以来似乎都代表着一种复杂性的顶峰。确实,人类大脑——拥有大约 1000 亿个神经元(可能还有 100 万亿个连接)的网络——能够负责这种复杂性,这似乎有些令人惊讶。也许,人们可能会想象,大脑的网络不仅仅是神经元,还有一些未被发现的新物理层。但现在,随着 ChatGPT 的出现,我们获得了重要的新信息:我们知道,一个纯粹的人工神经网络,其连接数与大脑的神经元数量相当,能够出色地生成人类语言。
And, yes, that’s still a big and complicated system—with about as many neural net weights as there are words of text currently available out there in the world. But at some level it still seems difficult to believe that all the richness of language and the things it can talk about can be encapsulated in such a finite system. Part of what’s going on is no doubt a reflection of the ubiquitous phenomenon (that first became evident in the example of rule 30) that computational processes can in effect greatly amplify the apparent complexity of systems even when their underlying rules are simple. But, actually, as we discussed above, neural nets of the kind used in ChatGPT tend to be specifically constructed to restrict the effect of this phenomenon—and the computational irreducibility associated with it—in the interest of making their training more accessible.
而且,是的,这仍然是一个庞大而复杂的系统——其神经网络权重数量大约与世界上目前可用的文本字数相当。但在某种程度上,仍然难以相信语言的丰富性及其所能谈论的事物可以被封装在这样的有限系统中。部分原因无疑反映了无处不在的现象(这在规则 30 的例子中首次变得明显),即计算过程实际上可以大大增强系统的表面复杂性,即使其基本规则很简单。但实际上,正如我们上面所讨论的,ChatGPT 中使用的神经网络类型往往被特别构建以限制这种现象的效果——以及与之相关的计算不可约性——以利于其训练的可行性。
So how is it, then, that something like ChatGPT can get as far as it does with language? The basic answer, I think, is that language is at a fundamental level somehow simpler than it seems. And this means that ChatGPT—even with its ultimately straightforward neural net structure—is successfully able to “capture the essence” of human language and the thinking behind it. And moreover, in its training, ChatGPT has somehow “implicitly discovered” whatever regularities in language (and thinking) make this possible.
那么,像 ChatGPT 这样的东西是如何在语言方面取得如此大的进展的呢?我认为,基本的答案是,语言在根本上比看起来要简单得多。这意味着,即使 ChatGPT 最终采用了简单的神经网络结构,它也能成功地“捕捉到”人类语言及其背后的思维的本质。而且,在它的训练过程中,ChatGPT 已经以某种方式“隐含地发现了”语言(以及思维)中的任何规律,使得这一切成为可能。
The success of ChatGPT is, I think, giving us evidence of a fundamental and important piece of science: it’s suggesting that we can expect there to be major new “laws of language”—and effectively “laws of thought”—out there to discover. In ChatGPT—built as it is as a neural net—those laws are at best implicit. But if we could somehow make the laws explicit, there’s the potential to do the kinds of things ChatGPT does in vastly more direct, efficient—and transparent—ways.
我认为 ChatGPT 的成功为我们提供了一个基本且重要的科学证据:它表明我们可以期待发现一些重要的新“语言定律”——以及有效的“思维定律”。在 ChatGPT 中——作为一个神经网络构建——这些定律充其量是隐含的。但如果我们能够以某种方式使这些定律明确,就有可能以更加直接、高效和透明的方式做 ChatGPT 所做的事情。
But, OK, so what might these laws be like? Ultimately they must give us some kind of prescription for how language—and the things we say with it—are put together. Later we’ll discuss how “looking inside ChatGPT” may be able to give us some hints about this, and how what we know from building computational language suggests a path forward. But first let’s discuss two long-known examples of what amount to “laws of language”—and how they relate to the operation of ChatGPT.
但是,好吧,那么这些法律可能是什么样的呢?最终,它们必须给我们某种关于语言——以及我们用它所说的事物——是如何组合的处方。稍后我们将讨论“窥探 ChatGPT 内部”如何能够给我们一些关于这方面的提示,以及我们从构建计算语言中知道的内容如何指明前进的道路。但首先让我们讨论两个长期以来被认为是“语言法则”的例子——以及它们与 ChatGPT 运作的关系。
The first is the syntax of language. Language is not just a random jumble of words. Instead, there are (fairly) definite grammatical rules for how words of different kinds can be put together: in English, for example, nouns can be preceded by adjectives and followed by verbs, but typically two nouns can’t be right next to each other. Such grammatical structure can (at least approximately) be captured by a set of rules that define how what amount to “parse trees” can be put together:
第一种是语言的语法。语言不仅仅是单词的随意组合。相反,有(相当)明确的语法规则来规定不同种类的单词如何组合:例如,在英语中,名词可以由形容词前置和动词后置,但通常两个名词不能紧挨着。这种语法结构可以(至少近似地)通过一组规则来捕捉,这些规则定义了如何组合相当于“解析树”的内容:
ChatGPT doesn’t have any explicit “knowledge” of such rules. But somehow in its training it implicitly “discovers” them—and then seems to be good at following them. So how does this work? At a “big picture” level it’s not clear. But to get some insight it’s perhaps instructive to look at a much simpler example.
ChatGPT 没有任何明确的“知识”关于这些规则。但在其训练过程中,它隐含地“发现”了它们——然后似乎很擅长遵循它们。那么这是如何工作的呢?在“大局”层面上并不清楚。但为了获得一些见解,或许看看一个简单得多的例子是有指导意义的。
Consider a “language” formed from sequences of (’s and )’s, with a grammar that specifies that parentheses should always be balanced, as represented by a parse tree like:
考虑一种由('s 和 )'s 序列形成的“语言”,其语法规定括号应始终保持平衡,如以下解析树所示:
Can we train a neural net to produce “grammatically correct” parenthesis sequences? There are various ways to handle sequences in neural nets, but let’s use transformer nets, as ChatGPT does. And given a simple transformer net, we can start feeding it grammatically correct parenthesis sequences as training examples. A subtlety (which actually also appears in ChatGPT’s generation of human language) is that in addition to our “content tokens” (here “(” and “)”) we have to include an “End” token, that’s generated to indicate that the output shouldn’t continue any further (i.e. for ChatGPT, that one’s reached the “end of the story”).
我们可以训练一个神经网络来生成“语法正确”的括号序列吗?在神经网络中处理序列有多种方法,但让我们使用像 ChatGPT 那样的变压器网络。给定一个简单的变压器网络,我们可以开始将语法正确的括号序列作为训练示例输入。一个微妙之处(实际上在 ChatGPT 生成人类语言时也会出现)是,除了我们的“内容标记”(这里为“(”和“)”)外,我们还需要包含一个“结束”标记,该标记生成以指示输出不应再继续(即对于 ChatGPT,表示已到达“故事的结尾”)。
If we set up a transformer net with just one attention block with 8 heads and feature vectors of length 128 (ChatGPT also uses feature vectors of length 128, but has 96 attention blocks, each with 96 heads) then it doesn’t seem possible to get it to learn much about parenthesis language. But with 2 attention blocks, the learning process seems to converge—at least after 10 million or so examples have been given (and, as is common with transformer nets, showing yet more examples just seems to degrade its performance).
如果我们设置一个只有一个注意力块的变压器网络,该注意力块有 8 个头和长度为 128 的特征向量(ChatGPT 也使用长度为 128 的特征向量,但有 96 个注意力块,每个有 96 个头),那么似乎不可能让它学会太多关于括号语言的知识。但如果有 2 个注意力块,学习过程似乎会收敛——至少在给出了大约 1000 万左右的例子之后(而且,像变压器网络常见的那样,展示更多的例子似乎只会降低其性能)。
So with this network, we can do the analog of what ChatGPT does, and ask for probabilities for what the next token should be—in a parenthesis sequence:
所以有了这个网络,我们可以做类似于 ChatGPT 所做的事情,并询问下一个标记应该是什么的概率——在一个括号序列中:
And in the first case, the network is “pretty sure” that the sequence can’t end here—which is good, because if it did, the parentheses would be left unbalanced. In the second case, however, it “correctly recognizes” that the sequence can end here, though it also “points out” that it’s possible to “start again”, putting down a “(”, presumably with a “)” to follow. But, oops, even with its 400,000 or so laboriously trained weights, it says there’s a 15% probability to have “)” as the next token—which isn’t right, because that would necessarily lead to an unbalanced parenthesis.
在第一种情况下,网络“非常确定”序列不能在这里结束——这是好事,因为如果真的在这里结束,括号就会不平衡。然而,在第二种情况下,它“正确识别”出序列可以在这里结束,但它也“指出”可以“重新开始”,放下一个“(”,大概后面会跟一个“)”。但是,哎呀,即使它有大约 400,000 个精心训练的权重,它说下一个标记有 15%的概率是“)”——这是不对的,因为那必然会导致括号不平衡。
Here’s what we get if we ask the network for the highest-probability completions for progressively longer sequences of (’s:
如果我们要求网络为越来越长的(’s 序列提供最高概率的补全,我们会得到以下结果:
And, yes, up to a certain length the network does just fine. But then it starts failing. It’s a pretty typical kind of thing to see in a “precise” situation like this with a neural net (or with machine learning in general). Cases that a human “can solve in a glance” the neural net can solve too. But cases that require doing something “more algorithmic” (e.g. explicitly counting parentheses to see if they’re closed) the neural net tends to somehow be “too computationally shallow” to reliably do. (By the way, even the full current ChatGPT has a hard time correctly matching parentheses in long sequences.)
是的,在一定长度内,网络表现良好。但随后它开始失败。这在神经网络(或一般机器学习)中是一种相当典型的现象。人类“一眼就能解决”的情况,神经网络也能解决。但对于需要“更多算法”的情况(例如,显式计数括号以查看它们是否关闭),神经网络往往会因为“计算深度不足”而无法可靠地完成。(顺便说一句,即使是当前完整的 ChatGPT 在长序列中正确匹配括号也存在困难。)
So what does this mean for things like ChatGPT and the syntax of a language like English? The parenthesis language is “austere”—and much more of an “algorithmic story”. But in English it’s much more realistic to be able to “guess” what’s grammatically going to fit on the basis of local choices of words and other hints. And, yes, the neural net is much better at this—even though perhaps it might miss some “formally correct” case that, well, humans might miss as well. But the main point is that the fact that there’s an overall syntactic structure to the language—with all the regularity that implies—in a sense limits “how much” the neural net has to learn. And a key “natural-science-like” observation is that the transformer architecture of neural nets like the one in ChatGPT seems to successfully be able to learn the kind of nested-tree-like syntactic structure that seems to exist (at least in some approximation) in all human languages.
那么这对于像 ChatGPT 和英语这样的语言的语法意味着什么呢?这种括号语言是“简朴的”——而且更多的是一种“算法故事”。但在英语中,基于局部词汇选择和其他提示来“猜测”语法上会适合什么要现实得多。而且,是的,神经网络在这方面要好得多——尽管它可能会错过一些“形式上正确”的情况,而这些情况,嗯,人类也可能错过。但主要观点是,语言具有整体的句法结构——以及由此带来的所有规律性——在某种意义上限制了神经网络需要学习的内容。一个关键的“类似自然科学的”观察是,像 ChatGPT 中的神经网络的变压器架构似乎能够成功地学习似乎存在于所有人类语言中(至少在某种近似程度上)的嵌套树状句法结构。
Syntax provides one kind of constraint on language. But there are clearly more. A sentence like “Inquisitive electrons eat blue theories for fish” is grammatically correct but isn’t something one would normally expect to say, and wouldn’t be considered a success if ChatGPT generated it—because, well, with the normal meanings for the words in it, it’s basically meaningless.
语法提供了一种语言的约束。但显然还有更多。像“好奇的电子吃蓝色理论作为鱼”这样的句子在语法上是正确的,但不是人们通常会说的,如果 ChatGPT 生成它也不会被认为是成功的——因为,好吧,用这些词的正常含义,它基本上是无意义的。
But is there a general way to tell if a sentence is meaningful? There’s no traditional overall theory for that. But it’s something that one can think of ChatGPT as having implicitly “developed a theory for” after being trained with billions of (presumably meaningful) sentences from the web, etc.
但是,有没有一种普遍的方法来判断一个句子是否有意义呢?没有传统的整体理论来解决这个问题。但可以认为,ChatGPT 在经过数十亿(大概是有意义的)来自网络等的句子训练后,已经隐含地“发展了一种理论”。
What might this theory be like? Well, there’s one tiny corner that’s basically been known for two millennia, and that’s logic. And certainly in the syllogistic form in which Aristotle discovered it, logic is basically a way of saying that sentences that follow certain patterns are reasonable, while others are not. Thus, for example, it’s reasonable to say “All X are Y. This is not Y, so it’s not an X” (as in “All fishes are blue. This is not blue, so it’s not a fish.”). And just as one can somewhat whimsically imagine that Aristotle discovered syllogistic logic by going (“machine-learning-style”) through lots of examples of rhetoric, so too one can imagine that in the training of ChatGPT it will have been able to “discover syllogistic logic” by looking at lots of text on the web, etc. (And, yes, while one can therefore expect ChatGPT to produce text that contains “correct inferences” based on things like syllogistic logic, it’s a quite different story when it comes to more sophisticated formal logic—and I think one can expect it to fail here for the same kind of reasons it fails in parenthesis matching.)
这种理论可能是什么样的?好吧,有一个小角落基本上已经被人们所知了两千年,那就是逻辑。当然,在亚里士多德发现的这种三段论形式中,逻辑基本上是一种说遵循某些模式的句子是合理的,而其他则不是的方式。因此,例如,说“所有的 X 都是 Y。这不是 Y,所以它不是 X”(如“所有的鱼都是蓝色的。这不是蓝色的,所以它不是鱼。”)是合理的。正如人们可以有些幽默地想象亚里士多德通过(“机器学习风格”)大量修辞例子发现了三段论逻辑,同样,人们也可以想象在 ChatGPT 的训练中,它能够通过查看网络上大量的文本等“发现三段论逻辑”。(而且,是的,虽然因此可以期望 ChatGPT 生成包含基于三段论逻辑的“正确推理”的文本,但在更复杂的正式逻辑方面,这是一个完全不同的情况——我认为人们可以预期它在这里失败,原因与它在括号匹配中失败的原因相同。)
But beyond the narrow example of logic, what can be said about how to systematically construct (or recognize) even plausibly meaningful text? Yes, there are things like Mad Libs that use very specific “phrasal templates”. But somehow ChatGPT implicitly has a much more general way to do it. And perhaps there’s nothing to be said about how it can be done beyond “somehow it happens when you have 175 billion neural net weights”. But I strongly suspect that there’s a much simpler and stronger story.
但在逻辑的狭隘例子之外,关于如何系统地构建(或识别)甚至看似有意义的文本,可以说些什么呢?是的,有一些像 Mad Libs 这样的东西使用非常具体的“短语模板”。但不知何故,ChatGPT 隐含地有一种更普遍的方法来做到这一点。也许除了“当你有 1750 亿个神经网络权重时,它就会发生”之外,没有什么可说的。但我强烈怀疑有一个更简单、更强大的故事。
Meaning Space and Semantic Laws of Motion
意义空间与语义运动定律
We discussed above that inside ChatGPT any piece of text is effectively represented by an array of numbers that we can think of as coordinates of a point in some kind of “linguistic feature space”. So when ChatGPT continues a piece of text this corresponds to tracing out a trajectory in linguistic feature space. But now we can ask what makes this trajectory correspond to text we consider meaningful. And might there perhaps be some kind of “semantic laws of motion” that define—or at least constrain—how points in linguistic feature space can move around while preserving “meaningfulness”?
我们在上面讨论过,在 ChatGPT 内部,任何一段文本实际上都是由一组数字表示的,我们可以将其视为某种“语言特征空间”中的点的坐标。因此,当 ChatGPT 继续一段文本时,这相当于在语言特征空间中追踪一条轨迹。但现在我们可以问,是什么使得这条轨迹对应于我们认为有意义的文本。是否可能存在某种“语义运动定律”,定义或至少约束了语言特征空间中的点在保持“有意义性”的同时如何移动?
So what is this linguistic feature space like? Here’s an example of how single words (here, common nouns) might get laid out if we project such a feature space down to 2D:
那么这种语言特征空间是什么样的呢?这里有一个例子,展示了如果我们将这种特征空间投影到二维空间中,单个词(这里指常见名词)是如何布局的:
We saw another example above based on words representing plants and animals. But the point in both cases is that “semantically similar words” are placed nearby.
我们在上面看到了另一个基于代表植物和动物的词语的例子。但在这两种情况下,重点是“语义相似的词语”被放置在附近。
As another example, here’s how words corresponding to different parts of speech get laid out:
作为另一个例子,以下是如何将对应于不同词性的单词进行布局:
Of course, a given word doesn’t in general just have “one meaning” (or necessarily correspond to just one part of speech). And by looking at how sentences containing a word lay out in feature space, one can often “tease apart” different meanings—as in the example here for the word “crane” (bird or machine?):
当然,一个给定的词通常不仅仅只有一个“意思”(或者不一定只对应一个词性)。通过观察包含一个词的句子在特征空间中的布局,人们通常可以“分解”出不同的含义——就像这里对“crane”(鸟或机器?)这个词的例子一样:
OK, so it’s at least plausible that we can think of this feature space as placing “words nearby in meaning” close in this space. But what kind of additional structure can we identify in this space? Is there for example some kind of notion of “parallel transport” that would reflect “flatness” in the space? One way to get a handle on that is to look at analogies:
好的,所以至少可以合理地认为我们可以将这个特征空间视为将“意义相近的词语”在这个空间中靠近。但是我们能在这个空间中识别出什么样的额外结构呢?例如,是否存在某种“平行移动”的概念,反映出空间中的“平坦性”?一种理解这一点的方法是看类比:
And, yes, even when we project down to 2D, there’s often at least a “hint of flatness”, though it’s certainly not universally seen.
而且,是的,即使我们投影到 2D,通常至少会有一种“平坦的感觉”,尽管这肯定不是普遍看到的。
So what about trajectories? We can look at the trajectory that a prompt for ChatGPT follows in feature space—and then we can see how ChatGPT continues that:
那么轨迹呢?我们可以看看 ChatGPT 的提示在特征空间中遵循的轨迹——然后我们可以看到 ChatGPT 是如何继续的:
There’s certainly no “geometrically obvious” law of motion here. And that’s not at all surprising; we fully expect this to be a considerably more complicated story. And, for example, it’s far from obvious that even if there is a “semantic law of motion” to be found, what kind of embedding (or, in effect, what “variables”) it’ll most naturally be stated in.
这里显然没有“几何上明显的”运动定律。这完全不令人惊讶;我们完全预期这会是一个相当复杂的故事。例如,即使存在一个“语义运动定律”,也远非显而易见它最自然地会以何种嵌入(或实际上,何种“变量”)来表述。
In the picture above, we’re showing several steps in the “trajectory”—where at each step we’re picking the word that ChatGPT considers the most probable (the “zero temperature” case). But we can also ask what words can “come next” with what probabilities at a given point:
在上图中,我们展示了“轨迹”中的几个步骤——在每一步中,我们选择了 ChatGPT 认为最可能的单词(“零温度”情况)。但我们也可以询问在某个特定点下,哪些单词可以“接下来”出现,以及它们的概率是多少:
And what we see in this case is that there’s a “fan” of high-probability words that seems to go in a more or less definite direction in feature space. What happens if we go further? Here are the successive “fans” that appear as we “move along” the trajectory:
在这种情况下,我们看到的是一个高概率词汇的“扇形”,似乎在特征空间中朝着一个或多或少明确的方向发展。如果我们继续深入会怎样?以下是随着我们沿着轨迹“移动”而出现的连续“扇形”:
Here’s a 3D representation, going for a total of 40 steps:
这是一个 3D 表示,总共进行 40 步:
And, yes, this seems like a mess—and doesn’t do anything to particularly encourage the idea that one can expect to identify “mathematical-physics-like” “semantic laws of motion” by empirically studying “what ChatGPT is doing inside”. But perhaps we’re just looking at the “wrong variables” (or wrong coordinate system) and if only we looked at the right one, we’d immediately see that ChatGPT is doing something “mathematical-physics-simple” like following geodesics. But as of now, we’re not ready to “empirically decode” from its “internal behavior” what ChatGPT has “discovered” about how human language is “put together”.
是的,这看起来一团糟——并且没有特别鼓励人们通过经验研究“ChatGPT 内部在做什么”来期望能够识别出“类似数学物理”的“语义运动定律”。但也许我们只是在看“错误的变量”(或错误的坐标系),如果我们只看正确的变量,我们会立即看到 ChatGPT 在做一些“类似数学物理简单”的事情,比如遵循测地线。但到目前为止,我们还没有准备好从它的“内部行为”“经验解码”出 ChatGPT“发现了”关于人类语言是如何“组合”的。
Semantic Grammar and the Power of Computational Language
语义语法与计算语言的力量
What does it take to produce “meaningful human language”? In the past, we might have assumed it could be nothing short of a human brain. But now we know it can be done quite respectably by the neural net of ChatGPT. Still, maybe that’s as far as we can go, and there’ll be nothing simpler—or more human understandable—that will work. But my strong suspicion is that the success of ChatGPT implicitly reveals an important “scientific” fact: that there’s actually a lot more structure and simplicity to meaningful human language than we ever knew—and that in the end there may be even fairly simple rules that describe how such language can be put together.
要产生“有意义的人类语言”需要什么?过去,我们可能会认为这除了人类大脑之外别无他法。但现在我们知道,ChatGPT 的神经网络可以做得相当不错。尽管如此,也许这就是我们所能达到的极限,不会有更简单或更易于人类理解的方法了。但我强烈怀疑,ChatGPT 的成功隐含地揭示了一个重要的“科学”事实:实际上,有意义的人类语言比我们以往所知的更具结构性和简单性——最终可能存在相当简单的规则来描述如何构建这种语言。
As we mentioned above, syntactic grammar gives rules for how words corresponding to things like different parts of speech can be put together in human language. But to deal with meaning, we need to go further. And one version of how to do this is to think about not just a syntactic grammar for language, but also a semantic one.
正如我们上面提到的,句法语法给出了如何将对应于不同词性的词组合在一起的规则。但要处理意义,我们需要更进一步。一种方法不仅是考虑语言的句法语法,还要考虑语义语法。
For purposes of syntax, we identify things like nouns and verbs. But for purposes of semantics, we need “finer gradations”. So, for example, we might identify the concept of “moving”, and the concept of an “object” that “maintains its identity independent of location”. There are endless specific examples of each of these “semantic concepts”. But for the purposes of our semantic grammar, we’ll just have some general kind of rule that basically says that “objects” can “move”. There’s a lot to say about how all this might work (some of which I’ve said before). But I’ll content myself here with just a few remarks that indicate some of the potential path forward.
为了语法的目的,我们识别名词和动词等事物。但为了语义的目的,我们需要“更精细的区分”。例如,我们可能会识别“移动”的概念,以及“物体”的概念,即“保持其身份独立于位置”。这些“语义概念”中的每一个都有无数的具体例子。但为了我们的语义语法的目的,我们只会有一条大致的规则,基本上说“物体”可以“移动”。关于这一切如何运作,有很多可以说的(其中一些我以前说过)。但我在这里只满足于几句指出一些潜在前进路径的评论。
It’s worth mentioning that even if a sentence is perfectly OK according to the semantic grammar, that doesn’t mean it’s been realized (or even could be realized) in practice. “The elephant traveled to the Moon” would doubtless “pass” our semantic grammar, but it certainly hasn’t been realized (at least yet) in our actual world—though it’s absolutely fair game for a fictional world.
值得一提的是,即使一个句子在语义语法上是完全正确的,这并不意味着它在实践中已经实现(甚至可能实现)。“大象去了月球”无疑会“通过”我们的语义语法,但它肯定还没有在我们的现实世界中实现(至少目前还没有)——尽管它在虚构世界中绝对是公平的游戏。
When we start talking about “semantic grammar” we’re soon led to ask “What’s underneath it?” What “model of the world” is it assuming? A syntactic grammar is really just about the construction of language from words. But a semantic grammar necessarily engages with some kind of “model of the world”—something that serves as a “skeleton” on top of which language made from actual words can be layered.
当我们开始谈论“语义语法”时,我们很快就会问“它下面是什么?”它假设了什么样的“世界模型”?句法语法实际上只是关于从单词构建语言。但语义语法必然会涉及某种“世界模型”——作为“骨架”的东西,实际单词构成的语言可以在此基础上分层。
Until recent times, we might have imagined that (human) language would be the only general way to describe our “model of the world”. Already a few centuries ago there started to be formalizations of specific kinds of things, based particularly on mathematics. But now there’s a much more general approach to formalization: computational language.
直到最近,我们可能还认为(人类)语言是描述我们“世界模型”的唯一通用方式。几个世纪前,基于数学的特定类型事物的形式化已经开始出现。但现在有一种更通用的形式化方法:计算语言。
And, yes, that’s been my big project over the course of more than four decades (as now embodied in the Wolfram Language): to develop a precise symbolic representation that can talk as broadly as possible about things in the world, as well as abstract things that we care about. And so, for example, we have symbolic representations for cities and molecules and images and neural networks, and we have built-in knowledge about how to compute about those things.
是的,这确实是我四十多年来一直在进行的大型项目(现在体现在 Wolfram 语言中):开发一种精确的符号表示,能够尽可能广泛地谈论世界上的事物,以及我们关心的抽象事物。因此,例如,我们有城市的符号表示、分子的符号表示、图像的符号表示和神经网络的符号表示,并且我们内置了关于如何计算这些事物的知识。
And, after decades of work, we’ve covered a lot of areas in this way. But in the past, we haven’t particularly dealt with “everyday discourse”. In “I bought two pounds of apples” we can readily represent (and do nutrition and other computations on) the “two pounds of apples”. But we don’t (quite yet) have a symbolic representation for “I bought”.
而且,经过几十年的工作,我们已经以这种方式覆盖了很多领域。但在过去,我们并没有特别处理“日常对话”。在“我买了两磅苹果”中,我们可以很容易地表示(并对“两磅苹果”进行营养和其他计算)。但我们还没有(还)对“我买了”有一个符号表示。
It’s all connected to the idea of semantic grammar—and the goal of having a generic symbolic “construction kit” for concepts, that would give us rules for what could fit together with what, and thus for the “flow” of what we might turn into human language.
这一切都与语义语法的概念有关——目标是拥有一个通用的符号“构建工具包”来表达概念,这将为我们提供什么可以与什么组合的规则,从而为我们可能转化为人类语言的“流程”提供指导。
But let’s say we had this “symbolic discourse language”. What would we do with it? We could start off doing things like generating “locally meaningful text”. But ultimately we’re likely to want more “globally meaningful” results—which means “computing” more about what can actually exist or happen in the world (or perhaps in some consistent fictional world).
但假设我们有了这种“符号话语语言”。我们能用它做什么呢?我们可以开始做一些生成“局部有意义的文本”的事情。但最终我们可能想要更多“全局有意义”的结果——这意味着要“计算”更多关于世界上实际存在或发生的事情(或者可能在某个一致的虚构世界中)。
Right now in Wolfram Language we have a huge amount of built-in computational knowledge about lots of kinds of things. But for a complete symbolic discourse language we’d have to build in additional “calculi” about general things in the world: if an object moves from A to B and from B to C, then it’s moved from A to C, etc.
在 Wolfram 语言中,我们现在拥有大量关于各种事物的内置计算知识。但对于一个完整的符号论述语言,我们需要额外构建关于世界中一般事物的“计算”:如果一个物体从 A 移动到 B,再从 B 移动到 C,那么它就从 A 移动到了 C,等等。
Given a symbolic discourse language we might use it to make “standalone statements”. But we can also use it to ask questions about the world, “Wolfram|Alpha style”. Or we can use it to state things that we “want to make so”, presumably with some external actuation mechanism. Or we can use it to make assertions—perhaps about the actual world, or perhaps about some specific world we’re considering, fictional or otherwise.
给定一个符号话语语言,我们可以用它来做出“独立陈述”。但我们也可以用它来询问关于世界的问题,“Wolfram|Alpha 风格”。或者我们可以用它来陈述我们“想要实现的事情”,大概是通过某种外部驱动机制。或者我们可以用它来做出断言——可能是关于现实世界的,也可能是关于我们正在考虑的某个特定世界,无论是虚构的还是其他的。
Human language is fundamentally imprecise, not least because it isn’t “tethered” to a specific computational implementation, and its meaning is basically defined just by a “social contract” between its users. But computational language, by its nature, has a certain fundamental precision—because in the end what it specifies can always be “unambiguously executed on a computer”. Human language can usually get away with a certain vagueness. (When we say “planet” does it include exoplanets or not, etc.?) But in computational language we have to be precise and clear about all the distinctions we’re making.
人类语言从根本上来说是不精确的,尤其是因为它没有“绑定”到特定的计算实现上,其意义基本上是由用户之间的“社会契约”定义的。但计算语言本质上具有一定的基本精确性——因为最终它所指定的内容总是可以在计算机上“明确地执行”。人类语言通常可以容忍一定的模糊性。(当我们说“行星”时,它是否包括系外行星等?)但在计算语言中,我们必须对我们所做的所有区分精确且清晰。
It’s often convenient to leverage ordinary human language in making up names in computational language. But the meanings they have in computational language are necessarily precise—and might or might not cover some particular connotation in typical human language usage.
在计算语言中使用普通人类语言来命名通常很方便。但它们在计算语言中的含义必然是精确的——并且可能或可能不涵盖典型人类语言使用中的某些特定内涵。
How should one figure out the fundamental “ontology” suitable for a general symbolic discourse language? Well, it’s not easy. Which is perhaps why little has been done since the primitive beginnings Aristotle made more than two millennia ago. But it really helps that today we now know so much about how to think about the world computationally (and it doesn’t hurt to have a “fundamental metaphysics” from our Physics Project and the idea of the ruliad).
一个人应该如何找出适合一般符号话语语言的基本“本体论”?嗯,这并不容易。这或许就是为什么自亚里士多德两千多年前的初步工作以来,进展甚微的原因。但今天我们对如何计算地思考世界了解得如此之多,这确实有帮助(并且拥有我们物理项目中的“基本形而上学”和 ruliad 的概念也无妨)。
But what does all this mean in the context of ChatGPT? From its training ChatGPT has effectively “pieced together” a certain (rather impressive) quantity of what amounts to semantic grammar. But its very success gives us a reason to think that it’s going to be feasible to construct something more complete in computational language form. And, unlike what we’ve so far figured out about the innards of ChatGPT, we can expect to design the computational language so that it’s readily understandable to humans.
但是在 ChatGPT 的背景下,这一切意味着什么呢?通过其训练,ChatGPT 有效地“拼凑”了一定数量(相当令人印象深刻)的语义语法。但它的成功让我们有理由认为,构建一个更完整的计算语言形式将是可行的。而且,与我们目前对 ChatGPT 内部机制的了解不同,我们可以预期设计计算语言,使其易于被人类理解。
When we talk about semantic grammar, we can draw an analogy to syllogistic logic. At first, syllogistic logic was essentially a collection of rules about statements expressed in human language. But (yes, two millennia later) when formal logic was developed, the original basic constructs of syllogistic logic could now be used to build huge “formal towers” that include, for example, the operation of modern digital circuitry. And so, we can expect, it will be with more general semantic grammar. At first, it may just be able to deal with simple patterns, expressed, say, as text. But once its whole computational language framework is built, we can expect that it will be able to be used to erect tall towers of “generalized semantic logic”, that allow us to work in a precise and formal way with all sorts of things that have never been accessible to us before, except just at a “ground-floor level” through human language, with all its vagueness.
当我们谈论语义语法时,可以将其比作三段论逻辑。起初,三段论逻辑本质上是一套关于用人类语言表达的陈述的规则。但是(是的,两千年后)当形式逻辑被发展出来时,三段论逻辑的原始基本构造现在可以用来构建巨大的“形式塔”,例如现代数字电路的操作。因此,我们可以预期,语义语法也会如此。起初,它可能只能处理简单的模式,比如说,以文本形式表达。但一旦其整个计算语言框架建立起来,我们可以预期它将能够用来建造高耸的“广义语义逻辑”塔,使我们能够以精确和形式化的方式处理各种以前只能通过人类语言在“地面层”上接触到的事物,以及所有这些事物的模糊性。
We can think of the construction of computational language—and semantic grammar—as representing a kind of ultimate compression in representing things. Because it allows us to talk about the essence of what’s possible, without, for example, dealing with all the “turns of phrase” that exist in ordinary human language. And we can view the great strength of ChatGPT as being something a bit similar: because it too has in a sense “drilled through” to the point where it can “put language together in a semantically meaningful way” without concern for different possible turns of phrase.
我们可以将计算语言——以及语义语法的构建——视为表示事物的一种最终压缩形式。因为它允许我们谈论可能性的本质,而不必处理普通人类语言中存在的所有“表达方式”。我们可以将 ChatGPT 的强大之处视为与此有些相似:因为它在某种意义上也已经“钻研透彻”,能够“以语义上有意义的方式组合语言”,而不必考虑不同的可能表达方式。
So what would happen if we applied ChatGPT to underlying computational language? The computational language can describe what’s possible. But what can still be added is a sense of “what’s popular”—based for example on reading all that content on the web. But then—underneath—operating with computational language means that something like ChatGPT has immediate and fundamental access to what amount to ultimate tools for making use of potentially irreducible computations. And that makes it a system that can not only “generate reasonable text”, but can expect to work out whatever can be worked out about whether that text actually makes “correct” statements about the world—or whatever it’s supposed to be talking about.
那么,如果我们对底层计算语言应用 ChatGPT 会发生什么?计算语言可以描述什么是可能的。但仍然可以添加的是一种“什么是流行”的感觉——例如基于阅读网络上所有的内容。但随后——在底层——使用计算语言意味着像 ChatGPT 这样的系统可以立即且根本地访问用于利用潜在不可简化的计算的终极工具。这使得它不仅是一个可以“生成合理文本”的系统,而且可以期望弄清楚关于该文本是否实际上对其所谈论的世界或任何它应该谈论的内容做出“正确”陈述的所有可以弄清楚的内容。
So … What Is ChatGPT Doing, and Why Does It Work?
所以……ChatGPT 在做什么,为什么它有效?
The basic concept of ChatGPT is at some level rather simple. Start from a huge sample of human-created text from the web, books, etc. Then train a neural net to generate text that’s “like this”. And in particular, make it able to start from a “prompt” and then continue with text that’s “like what it’s been trained with”.
ChatGPT 的基本概念在某种程度上相当简单。从网络上、书籍等中获取大量人类创建的文本样本开始。然后训练一个神经网络来生成“类似这样的”文本。特别是,让它能够从一个“提示”开始,然后继续生成“类似于它所训练过的”文本。
As we’ve seen, the actual neural net in ChatGPT is made up of very simple elements—though billions of them. And the basic operation of the neural net is also very simple, consisting essentially of passing input derived from the text it’s generated so far “once through its elements” (without any loops, etc.) for every new word (or part of a word) that it generates.
正如我们所见,ChatGPT 中的实际神经网络由非常简单的元素组成——尽管有数十亿个。神经网络的基本操作也非常简单,基本上包括将其生成的文本的输入“一次通过其元素”(没有任何循环等),以生成每个新词(或词的一部分)。
But the remarkable—and unexpected—thing is that this process can produce text that’s successfully “like” what’s out there on the web, in books, etc. And not only is it coherent human language, it also “says things” that “follow its prompt” making use of content it’s “read”. It doesn’t always say things that “globally make sense” (or correspond to correct
但令人惊讶且出乎意料的是,这个过程可以生成成功“类似于”网络上、书籍中等的内容。而且,它不仅仅是连贯的人类语言,它还“说出了”“遵循其提示”的内容,利用了它“阅读”过的内容。它并不总是说出“全局上合理”(或对应于正确计算)的事情——因为(例如,在不访问 Wolfram|Alpha 的“计算超能力”的情况下)它只是说出了在其训练材料中“听起来正确”的事情。
The specific engineering of ChatGPT has made it quite compelling. But ultimately (at least until it can use outside tools) ChatGPT is “merely” pulling out some “coherent thread of text” from the “statistics of conventional wisdom” that it’s accumulated. But it’s amazing how human-like the results are. And as I’ve discussed, this suggests something that’s at least scientifically very important: that human language (and the patterns of thinking behind it) are somehow simpler and more “law like” in their structure than we thought. ChatGPT has implicitly discovered it. But we can potentially explicitly expose it, with semantic grammar, computational language, etc.
ChatGPT 的具体工程设计使其非常引人注目。但最终(至少在它能使用外部工具之前),ChatGPT“仅仅”是从它积累的“常识统计”中提取出一些“连贯的文本线索”。但令人惊讶的是,结果是多么像人类。正如我所讨论的,这表明至少在科学上非常重要的一点:人类语言(及其背后的思维模式)在结构上比我们想象的要简单和“规律性”得多。ChatGPT 已经隐含地发现了这一点。但我们有可能通过语义语法、计算语言等显式地揭示它。
What ChatGPT does in generating text is very impressive—and the results are usually very much like what we humans would produce. So does this mean ChatGPT is working like a brain? Its underlying artificial-neural-net structure was ultimately modeled on an idealization of the brain. And it seems quite likely that when we humans generate language many aspects of what’s going on are quite similar.
ChatGPT 在生成文本方面所做的工作非常令人印象深刻——结果通常非常像我们人类会产生的。那么这是否意味着 ChatGPT 像大脑一样工作呢?其底层的人工神经网络结构最终是基于对大脑的理想化模型。而且似乎很有可能,当我们人类生成语言时,所发生的许多方面都非常相似。
When it comes to training (AKA learning) the different “hardware” of the brain and of current computers (as well as, perhaps, some undeveloped algorithmic ideas) forces ChatGPT to use a strategy that’s probably rather different (and in some ways much less efficient) than the brain. And there’s something else as well: unlike even in typical algorithmic computation, ChatGPT doesn’t internally “have loops” or “recompute on data”. And that inevitably limits its computational capability—even with respect to current computers, but definitely with respect to the brain.
当涉及到训练(也就是学习)大脑和当前计算机的不同“硬件”(以及可能还有一些未开发的算法思想)时,ChatGPT 被迫使用一种可能相当不同(并且在某些方面效率更低)的策略,而不是大脑。还有其他一些东西:与典型的算法计算不同,ChatGPT 内部没有“循环”或“重新计算数据”。这不可避免地限制了它的计算能力——即使与当前计算机相比,但肯定与大脑相比。
It’s not clear how to “fix that” and still maintain the ability to train the system with reasonable efficiency. But to do so will presumably allow a future ChatGPT to do even more “brain-like things”. Of course, there are plenty of things that brains don’t do so well—particularly involving what amount to irreducible computations. And for these both brains and things like ChatGPT have to seek “outside tools”—like Wolfram Language.
尚不清楚如何“解决这个问题”并仍然保持以合理的效率训练系统的能力。但这样做可能会让未来的 ChatGPT 能够做更多“类似大脑的事情”。当然,大脑也有很多不擅长的事情——特别是涉及不可简化的计算。对于这些,大脑和像 ChatGPT 这样的东西都需要寻求“外部工具”——比如 Wolfram 语言。
But for now it’s exciting to see what ChatGPT has already been able to do. At some level it’s a great example of the fundamental scientific fact that large numbers of simple computational elements can do remarkable and unexpected things. But it also provides perhaps the best impetus we’ve had in two thousand years to understand better just what the fundamental character and principles might be of that central feature of the human condition that is human language and the processes of thinking behind it.
但目前看到 ChatGPT 已经能够做到的事情令人兴奋。在某种程度上,这是大量简单计算元素能够完成令人惊讶和意想不到的事情的一个极好例子。但它也提供了也许是两千年来我们最好的一次动力,以更好地理解人类状况的中心特征——人类语言及其背后的思维过程的基本特征和原则。
Thanks 谢谢
I’ve been following the development of neural nets now for about 43 years, and during that time I’ve interacted with many people about them. Among them—some from long ago, some from recently, and some across many years—have been: Giulio Alessandrini, Dario Amodei, Etienne Bernard, Taliesin Beynon, Sebastian Bodenstein, Greg Brockman, Jack Cowan, Pedro Domingos, Jesse Galef, Roger Germundsson, Robert Hecht-Nielsen, Geoff Hinton, John Hopfield, Yann LeCun, Jerry Lettvin, Jerome Louradour, Marvin Minsky, Eric Mjolsness, Cayden Pierce, Tomaso Poggio, Matteo Salvarezza, Terry Sejnowski, Oliver Selfridge, Gordon Shaw, Jonas Sjöberg, Ilya Sutskever, Gerry Tesauro and Timothee Verdier. For help with this piece, I’d particularly like to thank Giulio Alessandrini and Brad Klee.
我一直在关注神经网络的发展,大约有 43 年了,在这段时间里,我和许多人交流过。其中——有些是很久以前的,有些是最近的,还有一些是多年来的——包括:朱利奥·阿莱桑德里尼、达里奥·阿莫代、埃蒂安·贝尔纳、塔利森·贝农、塞巴斯蒂安·博德斯坦、格雷格·布罗克曼、杰克·考恩、佩德罗·多明戈斯、杰西·加尔夫、罗杰·格蒙森、罗伯特·海克特-尼尔森、杰弗里·辛顿、约翰·霍普菲尔德、扬·勒昆、杰瑞·莱特文、杰罗姆·卢拉杜尔、马文·明斯基、埃里克·姆约尔斯内斯、凯登·皮尔斯、托马索·波吉奥、马泰奥·萨尔瓦雷扎、特里·塞伊诺斯基、奥利弗·塞尔弗里奇、戈登·肖、乔纳斯·希约伯格、伊利亚·苏茨克维、杰瑞·特萨罗和蒂莫泰·韦迪尔。对于这篇文章的帮助,我要特别感谢朱利奥·阿莱桑德里尼和布拉德·克利。