这是用户在 2024-5-23 22:33 为 https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

See also: 请参阅:
“LLM Tech Comes to Wolfram Language” »
“LLM 技术融入 Wolfram 语言”

A discussion about the history of neural nets »

What Is ChatGPT Doing … and Why Does It Work?
ChatGPT 在做什么?为什么它有效?

It’s Just Adding One Word at a Time

That ChatGPT can automatically generate something that reads even superficially like human-written text is remarkable, and unexpected. But how does it do it? And why does it work? My purpose here is to give a rough outline of what’s going on inside ChatGPT—and then to explore why it is that it can do so well in producing what we might consider to be meaningful text. I should say at the outset that I’m going to focus on the big picture of what’s going on—and while I’ll mention some engineering details, I won’t get deeply into them. (And the essence of what I’ll say applies just as well to other current “large language models” [LLMs] as to ChatGPT.)
ChatGPT 能自动生成表面上看似人类编写的文本,这非常了不起,令人意外。它是怎么做到的呢?为什么它能成功?我在这里的目的是粗略描述一下 ChatGPT 的内部运作情况,然后探讨它为何能如此出色地生成我们认为有意义的文本。我要在一开始就说明,我将重点讨论整体情况——虽然我会提到一些工程细节,但不会深入探讨。(我所说的内容同样适用于其他当前的“大型语言模型” [LLMs]),以及 ChatGPT。

The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a “reasonable continuation” of whatever text it’s got so far, where by “reasonable” we mean “what one might expect someone to write after seeing what people have written on billions of webpages, etc.”
首先需要解释的是,ChatGPT 的基本功能是对已有的文本进行“合理续写”,这里所谓的“合理”,是指在阅读了数十亿网页上的内容后,可能会继续编写的内容。

So let’s say we’ve got the text “The best thing about AI is its ability to”. Imagine scanning billions of pages of human-written text (say on the web and in digitized books) and finding all instances of this text—then seeing what word comes next what fraction of the time. ChatGPT effectively does something like this, except that (as I’ll explain) it doesn’t look at literal text; it looks for things that in a certain sense “match in meaning”. But the end result is that it produces a ranked list of words that might follow, together with “probabilities”:
假设我们有一段文本“人工智能的最大优点是它的能力”。想象一下,扫描数十亿页的人工撰写内容(例如互联网上和数字化的书籍),找到所有这一文本的实例——然后观察在大多数情况下下一个词是什么。ChatGPT 基本上做了类似的事情,只不过(正如我会解释的那样),它并不看字面上的文本,而是寻找在某种意义上“意义一致”的内容。但最终它会产生一个可能的后续词的排名列表,以及附带的“概率”:

And the remarkable thing is that when ChatGPT does something like write an essay what it’s essentially doing is just asking over and over again “given the text so far, what should the next word be?”—and each time adding a word. (More precisely, as I’ll explain, it’s adding a “token”, which could be just a part of a word, which is why it can sometimes “make up new words”.)
非常了不起的是,当 ChatGPT 写作文时,它其实只是在不断地根据前面的文本问“下一个词是什么?”——然后每次添加一个词。(更准确地说,它是在添加一个“标记”,这个标记可能只是单词的一部分,所以它有时会“创造新词”。)

But, OK, at each step it gets a list of words with probabilities. But which one should it actually pick to add to the essay (or whatever) that it’s writing? One might think it should be the “highest-ranked” word (i.e. the one to which the highest “probability” was assigned). But this is where a bit of voodoo begins to creep in. Because for some reason—that maybe one day we’ll have a scientific-style understanding of—if we always pick the highest-ranked word, we’ll typically get a very “flat” essay, that never seems to “show any creativity” (and even sometimes repeats word for word). But if sometimes (at random) we pick lower-ranked words, we get a “more interesting” essay.

The fact that there’s randomness here means that if we use the same prompt multiple times, we’re likely to get different essays each time. And, in keeping with the idea of voodoo, there’s a particular so-called “temperature” parameter that determines how often lower-ranked words will be used, and for essay generation, it turns out that a “temperature” of 0.8 seems best. (It’s worth emphasizing that there’s no “theory” being used here; it’s just a matter of what’s been found to work in practice. And for example the concept of “temperature” is there because exponential distributions familiar from statistical physics happen to be being used, but there’s no “physical” connection—at least so far as we know.)
随机性的存在意味着如果我们多次使用相同的提示,每次可能会产生不同的文章。按照巫术的概念,有一个特定的“温度”参数决定了使用低阶词汇的频率,而生成文章的最佳“温度”似乎是 0.8。(需要强调的是,这里没有使用任何理论知识;这只是实践中发现可行的做法。举例来说,“温度”这个概念来源于统计物理学中的指数分布,但至少据我们所知,两者之间没有“物理”联系。)

Before we go on I should explain that for purposes of exposition I’m mostly not going to use the full system that’s in ChatGPT; instead I’ll usually work with a simpler GPT-2 system, which has the nice feature that it’s small enough to be able to run on a standard desktop computer. And so for essentially everything I show I’ll be able to include explicit Wolfram Language code that you can immediately run on your computer. (Click any picture here to copy the code behind it.)
在我们继续之前,我需要解释一下,为了方便说明,我大部分时间不会使用 ChatGPT 的完整系统,而是使用一个更简单的 GPT-2 系统。这个系统的一个优点是它足够小,可以在标准台式电脑上运行。因此,几乎所有我展示的内容都可以包含明确的 Wolfram 语言代码,您可以立即在电脑上运行这些代码。(点击这里的任何图片来复制其背后的代码。)

For example, here’s how to get the table of probabilities above. First, we have to retrieve the underlying “language model” neural net:

Later on, we’ll look inside this neural net, and talk about how it works. But for now we can just apply this “net model” as a black box to our text so far, and ask for the top 5 words by probability that the model says should follow:

This takes that result and makes it into an explicit formatted “dataset”:

Here’s what happens if one repeatedly “applies the model”—at each step adding the word that has the top probability (specified in this code as the “decision” from the model):

What happens if one goes on longer? In this (“zero temperature”) case what comes out soon gets rather confused and repetitive:

But what if instead of always picking the “top” word one sometimes randomly picks “non-top” words (with the “randomness” corresponding to “temperature” 0.8)? Again one can build up text:
但是,如果不总是选择最常见的词语,而是随机地选择一些不太常见的词语(这种“随机性”的参数被称为“温度”,比如 0.8),会怎样呢?同样可以生成文本:

And every time one does this, different random choices will be made, and the text will be different—as in these 5 examples:
每次这样做时,都会做出不同的随机选择,文本也会有所不同——如下这 5 个例子所示:

It’s worth pointing out that even at the first step there are a lot of possible “next words” to choose from (at temperature 0.8), though their probabilities fall off quite quickly (and, yes, the straight line on this log-log plot corresponds to an n–1 “power-law” decay that’s very characteristic of the general statistics of language):
值得注意的是,即便是在第一步,也有很多可能的“下一个词”可供选择(在温度 0.8 时),不过它们的概率下降得非常快(没错,这个对数-对数图上的直线符合 n –1 的“幂律”衰减,这,对于语言的整体统计特征是很典型的):

So what happens if one goes on longer? Here’s a random example. It’s better than the top-word (zero temperature) case, but still at best a bit weird:

This was done with the simplest GPT-2 model (from 2019). With the newer and bigger GPT-3 models the results are better. Here’s the top-word (zero temperature) text produced with the same “prompt”, but with the biggest GPT-3 model:
这是用 2019 年的最简单 GPT-2 模型生成的。使用更新和更大的 GPT-3 模型,效果更好。以下是用相同“提示词”生成的零温度文本,但使用的是最大的 GPT-3 模型:

And here’s a random example at “temperature 0.8”:
这是一个在“温度 0.8”下的随机例子:

Where Do the Probabilities Come From?

OK, so ChatGPT always picks its next word based on probabilities. But where do those probabilities come from? Let’s start with a simpler problem. Let’s consider generating English text one letter (rather than word) at a time. How can we work out what the probability for each letter should be?
好的,ChatGPT 总是根据概率选择下一个词。但这些概率来自哪里呢?让我们先从一个简单的问题入手。假设我们是一个字母一个字母地生成英语文本,该如何计算每个字母的概率呢?

A very minimal thing we could do is just take a sample of English text, and calculate how often different letters occur in it. So, for example, this counts letters in the Wikipedia article on “cats”:

And this does the same thing for “dogs”:

The results are similar, but not the same (“o” is no doubt more common in the “dogs” article because, after all, it occurs in the word “dog” itself). Still, if we take a large enough sample of English text we can expect to eventually get at least fairly consistent results:

Here’s a sample of what we get if we just generate a sequence of letters with these probabilities:

We can break this into “words” by adding in spaces as if they were letters with a certain probability:

We can do a slightly better job of making “words” by forcing the distribution of “word lengths” to agree with what it is in English:

We didn’t happen to get any “actual words” here, but the results are looking slightly better. To go further, though, we need to do more than just pick each letter separately at random. And, for example, we know that if we have a “q”, the next letter basically has to be “u”.

Here’s a plot of the probabilities for letters on their own:

And here’s a plot that shows the probabilities of pairs of letters (“2-grams”) in typical English text. The possible first letters are shown across the page, the second letters down the page:

And we see here, for example, that the “q” column is blank (zero probability) except on the “u” row. OK, so now instead of generating our “words” a single letter at a time, let’s generate them looking at two letters at a time, using these “2-gram” probabilities. Here’s a sample of the result—which happens to include a few “actual words”:

With sufficiently much English text we can get pretty good estimates not just for probabilities of single letters or pairs of letters (2-grams), but also for longer runs of letters. And if we generate “random words” with progressively longer n-gram probabilities, we see that they get progressively “more realistic”:
当我们拥有足够多的英文文本时,不仅可以准确估算单个字母或字母对(双字组)的概率,还可以估算更长的字母序列的概率。而且如果我们利用越来越长的 n-gram 概率生成“随机单词”,会发现它们变得越来越“真实”:

But let’s now assume—more or less as ChatGPT does—that we’re dealing with whole words, not letters. There are about 40,000 reasonably commonly used words in English. And by looking at a large corpus of English text (say a few million books, with altogether a few hundred billion words), we can get an estimate of how common each word is. And using this we can start generating “sentences”, in which each word is independently picked at random, with the same probability that it appears in the corpus. Here’s a sample of what we get:
但现在让我们假设——类似于 ChatGPT 的做法——我们处理的是整个单词,而不是字母。英语中大约有 40,000 个常用词。通过查看一个庞大的英文文本语料库(例如几百万本书,总计几百亿个单词),我们可以估计每个单词的使用频率。利用这些频率,我们可以开始生成“句子”。在这些句子中,每个单词都是独立随机选择的,选择概率与它在语料库中出现的频率相同。以下是我们生成的一个示例:

Not surprisingly, this is nonsense. So how can we do better? Just like with letters, we can start taking into account not just probabilities for single words but probabilities for pairs or longer n-grams of words. Doing this for pairs, here are 5 examples of what we get, in all cases starting from the word “cat”:
不意外,这是一派胡言。那么我们怎么能做得更好呢?和字母一样,我们可以开始考虑不仅是单个单词的概率,还要考虑词对或更长的词组的概率。以下是从“猫”这个词开始的 5 个示例:

It’s getting slightly more “sensible looking”. And we might imagine that if we were able to use sufficiently long n-grams we’d basically “get a ChatGPT”—in the sense that we’d get something that would generate essay-length sequences of words with the “correct overall essay probabilities”. But here’s the problem: there just isn’t even close to enough English text that’s ever been written to be able to deduce those probabilities.
它看起来稍微更“合理”了。我们可以想象,如果我们能够使用足够长的 n 元语法,我们基本上会得到类似 ChatGPT 的东西——意思是说,我们会得到能够生成长度级别的文章且具有“整体正确概率”的单词序列。但问题是:甚至连接近足够多的英文文本都没被写出来来推断这些概率。

In a crawl of the web there might be a few hundred billion words; in books that have been digitized there might be another hundred billion words. But with 40,000 common words, even the number of possible 2-grams is already 1.6 billion—and the number of possible 3-grams is 60 trillion. So there’s no way we can estimate the probabilities even for all of these from text that’s out there. And by the time we get to “essay fragments” of 20 words, the number of possibilities is larger than the number of particles in the universe, so in a sense they could never all be written down.
在网络爬虫中可能会出现几千亿个单词;在已数字化的书籍中可能会有另外的一千亿个单词。但是,即使是 40,000 个常用单词,2 元组的可能数量已经达到 16 亿,而 3 元组的可能数量则高达 60 万亿。因此,我们无法通过现有的文本来估计所有这些概率。而当我们谈到包含 20 个单词的“文章片段”时,其可能的组合数量比宇宙中的粒子还要多,从某种意义上说,它们永远无法全部写下来。

So what can we do? The big idea is to make a model that lets us estimate the probabilities with which sequences should occur—even though we’ve never explicitly seen those sequences in the corpus of text we’ve looked at. And at the core of ChatGPT is precisely a so-called “large language model” (LLM) that’s been built to do a good job of estimating those probabilities.
那么我们能做什么呢?主要的想法是创建一个模型,可以估计序列出现的概率——即使这些序列在我们查看的文本库中从未明确出现。而 ChatGPT 的核心正是这样的一个“大型语言模型”(LLM),它被构建来很好地估计这些概率。

What Is a Model?

Say you want to know (as Galileo did back in the late 1500s) how long it’s going to take a cannon ball dropped from each floor of the Tower of Pisa to hit the ground. Well, you could just measure it in each case and make a table of the results. Or you could do what is the essence of theoretical science: make a model that gives some kind of procedure for computing the answer rather than just measuring and remembering each case.
假如你想知道(就像伽利略在 16 世纪末所做的那样),一个炮弹从比萨斜塔的每一层掉落到地面需要多长时间。你可以在每种情况下进行测量并制作一张结果表。或者你可以做理论科学的本质工作:建立一个模型,通过计算得出答案,而不仅仅是测量和记住每一种情况。

Let’s imagine we have (somewhat idealized) data for how long the cannon ball takes to fall from various floors:

How do we figure out how long it’s going to take to fall from a floor we don’t explicitly have data about? In this particular case, we can use known laws of physics to work it out. But say all we’ve got is the data, and we don’t know what underlying laws govern it. Then we might make a mathematical guess, like that perhaps we should use a straight line as a model:

We could pick different straight lines. But this is the one that’s on average closest to the data we’re given. And from this straight line we can estimate the time to fall for any floor.

How did we know to try using a straight line here? At some level we didn’t. It’s just something that’s mathematically simple, and we’re used to the fact that lots of data we measure turns out to be well fit by mathematically simple things. We could try something mathematically more complicated—say a + b x + c x2—and then in this case we do better:
我们怎么知道在这里尝试使用直线?某种程度上,我们并不知道。这只是数学上很简单的东西,我们已经习惯于很多测量的数据正好符合简单的数学模型。我们可以尝试更复杂的数学模型——比如 a + b x + c x 2 ——在这种情况下效果会更好:

Things can go quite wrong, though. Like here’s the best we can do with a + b/x + c sin(x):
事情可能会非常糟糕。不过,这就是我们对 a + b/x + c sin(x)所能做的最好结果:

It is worth understanding that there’s never a “model-less model”. Any model you use has some particular underlying structure—then a certain set of “knobs you can turn” (i.e. parameters you can set) to fit your data. And in the case of ChatGPT, lots of such “knobs” are used—actually, 175 billion of them.
理解没有“无模型的模型”这一点是很重要的。你使用的任何模型都具有特定的基础结构——并且有一系列可以调整的参数来适应数据。在 ChatGPT 的情况下,实际上使用了 1750 亿个这样的参数。

But the remarkable thing is that the underlying structure of ChatGPT—with “just” that many parameters—is sufficient to make a model that computes next-word probabilities “well enough” to give us reasonable essay-length pieces of text.
但值得注意的是,ChatGPT 的底层结构,虽然参数“不多”,却足以使模型“很好地”计算出下一个单词的概率,从而为我们提供合理长度的文章。

Models for Human-Like Tasks

The example we gave above involves making a model for numerical data that essentially comes from simple physics—where we’ve known for several centuries that “simple mathematics applies”. But for ChatGPT we have to make a model of human-language text of the kind produced by a human brain. And for something like that we don’t (at least yet) have anything like “simple mathematics”. So what might a model of it be like?
我们上面提到的例子是为基于简单物理学的数值数据建立模型,这些数据遵循“简单数学”的规律,这个规律我们已经知道了几个世纪。但对于 ChatGPT,我们需要为人类大脑生成的人类语言文本建立模型。对于这种情况,我们现在还没有类似“简单数学”的理论。那么,这个模型可能是什么样的呢?

Before we talk about language, let’s talk about another human-like task: recognizing images. And as a simple example of this, let’s consider images of digits (and, yes, this is a classic machine learning example):

One thing we could do is get a bunch of sample images for each digit:

Then to find out if an image we’re given as input corresponds to a particular digit we could just do an explicit pixel-by-pixel comparison with the samples we have. But as humans we certainly seem to do something better—because we can still recognize digits, even when they’re for example handwritten, and have all sorts of modifications and distortions:

When we made a model for our numerical data above, we were able to take a numerical value x that we were given, and just compute a + b x for particular a and b. So if we treat the gray-level value of each pixel here as some variable xi is there some function of all those variables that—when evaluated—tells us what digit the image is of? It turns out that it’s possible to construct such a function. Not surprisingly, it’s not particularly simple, though. And a typical example might involve perhaps half a million mathematical operations.
当我们为上面的数值数据建立模型时,可以获取一个给定的数值 x,然后计算出特定的 a 和 b,即 a + b x。因此,如果将这里每个像素的灰度值作为变量 x,是否存在一种函数,通过计算这些变量来确定图像中的数字呢?事实证明可以构造这样的函数。不过,这并不简单。一个典型的例子可能需要约五十万次数学运算。

But the end result is that if we feed the collection of pixel values for an image into this function, out will come the number specifying which digit we have an image of. Later, we’ll talk about how such a function can be constructed, and the idea of neural nets. But for now let’s treat the function as black box, where we feed in images of, say, handwritten digits (as arrays of pixel values) and we get out the numbers these correspond to:

But what’s really going on here? Let’s say we progressively blur a digit. For a little while our function still “recognizes” it, here as a “2”. But soon it “loses it”, and starts giving the “wrong” result:

But why do we say it’s the “wrong” result? In this case, we know we got all the images by blurring a “2”. But if our goal is to produce a model of what humans can do in recognizing images, the real question to ask is what a human would have done if presented with one of those blurred images, without knowing where it came from.

And we have a “good model” if the results we get from our function typically agree with what a human would say. And the nontrivial scientific fact is that for an image-recognition task like this we now basically know how to construct functions that do this.
如果我们从函数获得的结果通常与人的判断一致,那么这个模型就是“好的”。 一个重要的科学事实是,对于这种图像识别任务,我们现在基本知道如何构建实现这一功能的算法。

Can we “mathematically prove” that they work? Well, no. Because to do that we’d have to have a mathematical theory of what we humans are doing. Take the “2” image and change a few pixels. We might imagine that with only a few pixels “out of place” we should still consider the image a “2”. But how far should that go? It’s a question of human visual perception. And, yes, the answer would no doubt be different for bees or octopuses—and potentially utterly different for putative aliens.

Neural Nets 神经网络

OK, so how do our typical models for tasks like image recognition actually work? The most popular—and successful—current approach uses neural nets. Invented—in a form remarkably close to their use today—in the 1940s, neural nets can be thought of as simple idealizations of how brains seem to work.
好的,那么我们的典型任务模型(如图像识别)实际上是如何工作的呢?目前最流行且成功的方法是使用神经网络。神经网络于 1940 年代发明,其形式与今天的使用非常接近,可以被看作是大脑工作方式的简单理想化。

In human brains there are about 100 billion neurons (nerve cells), each capable of producing an electrical pulse up to perhaps a thousand times a second. The neurons are connected in a complicated net, with each neuron having tree-like branches allowing it to pass electrical signals to perhaps thousands of other neurons. And in a rough approximation, whether any given neuron produces an electrical pulse at a given moment depends on what pulses it’s received from other neurons—with different connections contributing with different “weights”.
在人类大脑中,大约有 1000 亿个神经元(神经细胞),每个神经元每秒可以产生多达一千次的电脉冲。神经元以复杂的网络相互连接,每个神经元都有树状分支,使其能够将电信号传递给数以千计的其他神经元。大致而言,某个特定神经元在特定时刻是否产生电脉冲,取决于它从其他神经元接收到的脉冲——不同的连接以不同的“权重”起作用。

When we “see an image” what’s happening is that when photons of light from the image fall on (“photoreceptor”) cells at the back of our eyes they produce electrical signals in nerve cells. These nerve cells are connected to other nerve cells, and eventually the signals go through a whole sequence of layers of neurons. And it’s in this process that we “recognize” the image, eventually “forming the thought” that we’re “seeing a 2” (and maybe in the end doing something like saying the word “two” out loud).

The “black-box” function from the previous section is a “mathematicized” version of such a neural net. It happens to have 11 layers (though only 4 “core layers”):
前一节提到的“黑箱”函数是这种神经网络的“数学版”。它有 11 层(但只有 4 层是“核心层”):

There’s nothing particularly “theoretically derived” about this neural net; it’s just something that—back in 1998—was constructed as a piece of engineering, and found to work. (Of course, that’s not much different from how we might describe our brains as having been produced through the process of biological evolution.)
这个神经网络没有所谓的“理论推导”;它只是 1998 年作为一项工程被构建,并且确实有效。(当然,这和我们描述大脑是通过生物进化产生的方式没什么不同。)

OK, but how does a neural net like this “recognize things”? The key is the notion of attractors. Imagine we’ve got handwritten images of 1’s and 2’s:
好的,但是这样的神经网络如何“识别事物”?关键在于吸引子的概念。想象一下,我们有手写的数字 1 和 2 的图像:

We somehow want all the 1’s to “be attracted to one place”, and all the 2’s to “be attracted to another place”. Or, put a different way, if an image is somehow “closer to being a 1” than to being a 2, we want it to end up in the “1 place” and vice versa.
我们希望所有的 1 都聚集到一个地方,所有的 2 都聚集到另一个地方。换句话说,如果某个图像更接近 1 而不是 2,我们希望它最终位于代表 1 的区域,反之亦然。

As a straightforward analogy, let’s say we have certain positions in the plane, indicated by dots (in a real-life setting they might be positions of coffee shops). Then we might imagine that starting from any point on the plane we’d always want to end up at the closest dot (i.e. we’d always go to the closest coffee shop). We can represent this by dividing the plane into regions (“attractor basins”) separated by idealized “watersheds”:

We can think of this as implementing a kind of “recognition task” in which we’re not doing something like identifying what digit a given image “looks most like”—but rather we’re just, quite directly, seeing what dot a given point is closest to. (The “Voronoi diagram” setup we’re showing here separates points in 2D Euclidean space; the digit recognition task can be thought of as doing something very similar—but in a 784-dimensional space formed from the gray levels of all the pixels in each image.)
我们可以将这视为一种“识别任务”,但不是在识别某个图像最像哪个数字,而是直接观察给定点最接近哪个点。(我们展示的“Voronoi 图”将二维欧几里得空间中的点分开;数字识别任务可以视为在一个由所有像素灰度级组成的 784 维空间中进行类似的操作。)

So how do we make a neural net “do a recognition task”? Let’s consider this very simple case:

Our goal is to take an “input” corresponding to a position {x,y}—and then to “recognize” it as whichever of the three points it’s closest to. Or, in other words, we want the neural net to compute a function of {x,y} like:
我们的目标是获取位置 {x,y} 的“输入”,然后将其识别为距离最近的三个点中的任意一个。换句话说,我们希望神经网络计算一个关于 {x,y} 的函数:

So how do we do this with a neural net? Ultimately a neural net is a connected collection of idealized “neurons”—usually arranged in layers—with a simple example being:

Each “neuron” is effectively set up to evaluate a simple numerical function. And to “use” the network, we simply feed numbers (like our coordinates x and y) in at the top, then have neurons on each layer “evaluate their functions” and feed the results forward through the network—eventually producing the final result at the bottom:
每个“神经元”实际上都被设置为评估一个简单的函数。使用这个网络时,我们只需从顶部输入数字(例如我们的 x 和 y 坐标),然后让每一层的神经元“评估它们的函数”,并将结果通过网络传递出去,最终在底部生成最终结果:

In the traditional (biologically inspired) setup each neuron effectively has a certain set of “incoming connections” from the neurons on the previous layer, with each connection being assigned a certain “weight” (which can be a positive or negative number). The value of a given neuron is determined by multiplying the values of “previous neurons” by their corresponding weights, then adding these up and adding a constant—and finally applying a “thresholding” (or “activation”) function. In mathematical terms, if a neuron has inputs x = {x1, x2 …} then we compute f[w . x + b], where the weights w and constant b are generally chosen differently for each neuron in the network; the function f is usually the same.
在传统的(生物学启发的)设置中,每个神经元实际上有一组来自前一层神经元的“输入连接”,每个连接都有一个特定的“权重”(权重可以是正数或负数)。神经元的值由“前一层神经元”的值乘以相应的权重相加,再加上一个常数,最后应用“阈值”或“激活”函数来决定。用数学术语来说,如果一个神经元的输入是 x = {x0, x1, …},那么我们计算 f[w . x + b],其中权重 w 和常数 b 通常为网络中的每个神经元不同;函数 f 通常是相同的。

Computing w . x + b is just a matter of matrix multiplication and addition. The “activation function” f introduces nonlinearity (and ultimately is what leads to nontrivial behavior). Various activation functions commonly get used; here we’ll just use Ramp (or ReLU):
计算 w . x + b 只是矩阵乘法和加法的问题。“激活函数” f 引入了非线性(最终导致复杂行为)。通常使用各种激活函数;在这里我们只使用 Ramp (或 ReLU):

For each task we want the neural net to perform (or, equivalently, for each overall function we want it to evaluate) we’ll have different choices of weights. (And—as we’ll discuss later—these weights are normally determined by “training” the neural net using machine learning from examples of the outputs we want.)

Ultimately, every neural net just corresponds to some overall mathematical function—though it may be messy to write out. For the example above, it would be:

The neural net of ChatGPT also just corresponds to a mathematical function like this—but effectively with billions of terms.
ChatGPT 的神经网络也是一个类似的数学函数,但它实际上包含了数十亿个参数。

But let’s go back to individual neurons. Here are some examples of the functions a neuron with two inputs (representing coordinates x and y) can compute with various choices of weights and constants (and Ramp as activation function):
但是让我们回到单个神经元。以下是一些示例,展示了一个具有两个输入(表示坐标 x 和 y)的神经元,在选择不同的权重和常数以及 Ramp 作为激活函数时可以计算的功能:

But what about the larger network from above? Well, here’s what it computes:

It’s not quite “right”, but it’s close to the “nearest point” function we showed above.

Let’s see what happens with some other neural nets. In each case, as we’ll explain later, we’re using machine learning to find the best choice of weights. Then we’re showing here what the neural net with those weights computes:

Bigger networks generally do better at approximating the function we’re aiming for. And in the “middle of each attractor basin” we typically get exactly the answer we want. But at the boundaries—where the neural net “has a hard time making up its mind”—things can be messier.

With this simple mathematical-style “recognition task” it’s clear what the “right answer” is. But in the problem of recognizing handwritten digits, it’s not so clear. What if someone wrote a “2” so badly it looked like a “7”, etc.? Still, we can ask how a neural net distinguishes digits—and this gives an indication:

Can we say “mathematically” how the network makes its distinctions? Not really. It’s just “doing what the neural net does”. But it turns out that that normally seems to agree fairly well with the distinctions we humans make.

Let’s take a more elaborate example. Let’s say we have images of cats and dogs. And we have a neural net that’s been trained to distinguish them. Here’s what it might do on some examples:

Now it’s even less clear what the “right answer” is. What about a dog dressed in a cat suit? Etc. Whatever input it’s given the neural net will generate an answer, and in a way reasonably consistent with how humans might. As I’ve said above, that’s not a fact we can “derive from first principles”. It’s just something that’s empirically been found to be true, at least in certain domains. But it’s a key reason why neural nets are useful: that they somehow capture a “human-like” way of doing things.

Show yourself a picture of a cat, and ask “Why is that a cat?”. Maybe you’d start saying “Well, I see its pointy ears, etc.” But it’s not very easy to explain how you recognized the image as a cat. It’s just that somehow your brain figured that out. But for a brain there’s no way (at least yet) to “go inside” and see how it figured it out. What about for an (artificial) neural net? Well, it’s straightforward to see what each “neuron” does when you show a picture of a cat. But even to get a basic visualization is usually very difficult.

In the final net that we used for the “nearest point” problem above there are 17 neurons. In the net for recognizing handwritten digits there are 2190. And in the net we’re using to recognize cats and dogs there are 60,650. Normally it would be pretty difficult to visualize what amounts to 60,650-dimensional space. But because this is a network set up to deal with images, many of its layers of neurons are organized into arrays, like the arrays of pixels it’s looking at.
在我们用于“最近点”问题的最终网络中,有 17 个神经元。在用于识别手写数字的网络中,有 2190 个神经元。而在我们识别猫和狗的网络中,有 60,650 个神经元。通常,要想象一个 60,650 维的空间是很困难的。但因为这是一个处理图像的网络,所以它的许多神经层被组织成数组,就像它正在查看的像素数组一样。

And if we take a typical cat image


then we can represent the states of neurons at the first layer by a collection of derived images—many of which we can readily interpret as being things like “the cat without its background”, or “the outline of the cat”:

By the 10th layer it’s harder to interpret what’s going on:

But in general we might say that the neural net is “picking out certain features” (maybe pointy ears are among them), and using these to determine what the image is of. But are those features ones for which we have names—like “pointy ears”? Mostly not.

Are our brains using similar features? Mostly we don’t know. But it’s notable that the first few layers of a neural net like the one we’re showing here seem to pick out aspects of images (like edges of objects) that seem to be similar to ones we know are picked out by the first level of visual processing in brains.

But let’s say we want a “theory of cat recognition” in neural nets. We can say: “Look, this particular net does it”—and immediately that gives us some sense of “how hard a problem” it is (and, for example, how many neurons or layers might be needed). But at least as of now we don’t have a way to “give a narrative description” of what the network is doing. And maybe that’s because it truly is computationally irreducible, and there’s no general way to find what it does except by explicitly tracing each step. Or maybe it’s just that we haven’t “figured out the science”, and identified the “natural laws” that allow us to summarize what’s going on.

We’ll encounter the same kinds of issues when we talk about generating language with ChatGPT. And again it’s not clear whether there are ways to “summarize what it’s doing”. But the richness and detail of language (and our experience with it) may allow us to get further than with images.
在讨论用 ChatGPT 生成语言时,我们会遇到类似的问题。同样,也不确定是否有办法“总结它的工作原理”。不过,语言的丰富性和细节(以及我们对语言的体验)或许能让我们比在处理图像时走得更远。

Machine Learning, and the Training of Neural Nets

We’ve been talking so far about neural nets that “already know” how to do particular tasks. But what makes neural nets so useful (presumably also in brains) is that not only can they in principle do all sorts of tasks, but they can be incrementally “trained from examples” to do those tasks.

When we make a neural net to distinguish cats from dogs we don’t effectively have to write a program that (say) explicitly finds whiskers; instead we just show lots of examples of what’s a cat and what’s a dog, and then have the network “machine learn” from these how to distinguish them.
当我们创建一个神经网络来区分猫和狗时,我们不需要具体编写一个程序去识别猫须;相反,我们只需要展示大量猫和狗的例子,然后让网络通过 “机器学习” 来学会如何区分它们。

And the point is that the trained network “generalizes” from the particular examples it’s shown. Just as we’ve seen above, it isn’t simply that the network recognizes the particular pixel pattern of an example cat image it was shown; rather it’s that the neural net somehow manages to distinguish images on the basis of what we consider to be some kind of “general catness”.

So how does neural net training actually work? Essentially what we’re always trying to do is to find weights that make the neural net successfully reproduce the examples we’ve given. And then we’re relying on the neural net to “interpolate” (or “generalize”) “between” these examples in a “reasonable” way.

Let’s look at a problem even simpler than the nearest-point one above. Let’s just try to get a neural net to learn the function:

For this task, we’ll need a network that has just one input and one output, like:

But what weights, etc. should we be using? With every possible set of weights the neural net will compute some function. And, for example, here’s what it does with a few randomly chosen sets of weights:

And, yes, we can plainly see that in none of these cases does it get even close to reproducing the function we want. So how do we find weights that will reproduce the function?

The basic idea is to supply lots of “input → output” examples to “learn from”—and then to try to find weights that will reproduce these examples. Here’s the result of doing that with progressively more examples:

At each stage in this “training” the weights in the network are progressively adjusted—and we see that eventually we get a network that successfully reproduces the function we want. So how do we adjust the weights? The basic idea is at each stage to see “how far away we are” from getting the function we want—and then to update the weights in such a way as to get closer.

To find out “how far away we are” we compute what’s usually called a “loss function” (or sometimes “cost function”). Here we’re using a simple (L2) loss function that’s just the sum of the squares of the differences between the values we get, and the true values. And what we see is that as our training process progresses, the loss function progressively decreases (following a certain “learning curve” that’s different for different tasks)—until we reach a point where the network (at least to a good approximation) successfully reproduces the function we want:
为了找出“我们离目标有多远”,我们会计算通常称为“损失函数”(有时称为“代价函数”)的东西。在这里,我们使用一个简单的 L2 损失函数,它只是将我们得到的值与真实值之间差的平方和起来。我们可以看到的是,随着训练过程的进展,损失函数逐渐减少(根据不同任务会有不同的“学习曲线”)—直到我们的网络(至少在某种程度上)能够成功地再现我们想要实现的函数:

Alright, so the last essential piece to explain is how the weights are adjusted to reduce the loss function. As we’ve said, the loss function gives us a “distance” between the values we’ve got, and the true values. But the “values we’ve got” are determined at each stage by the current version of neural net—and by the weights in it. But now imagine that the weights are variables—say wi. We want to find out how to adjust the values of these variables to minimize the loss that depends on them.
好的,所以最后一个要解释的关键部分是如何调整权重来减少损失函数。正如我们所说,损失函数给出了我们得到的值与真实值之间的“误差”。但实际上,“我们得到的值”在每个阶段都由当前版本的神经网络及其权重决定。现在,想象权重是变量,比如 w。我们希望找出如何调整这些变量的值以最小化依赖于它们的损失。

For example, imagine (in an incredible simplification of typical neural nets used in practice) that we have just two weights w1 and w2. Then we might have a loss that as a function of w1 and w2 looks like this:
例如,想象一下(这非常简化了实际使用的典型神经网络)我们只有两个权重 w 1 和 w 2 。那么,我们可能会有一个损失函数,其作为 w 1 和 w 2 的函数如下:

Numerical analysis provides a variety of techniques for finding the minimum in cases like this. But a typical approach is just to progressively follow the path of steepest descent from whatever previous w1, w2 we had:
数值分析提供了多种技术来处理这种情况中的最小值问题。然而,一种常见的方法是沿着我们之前的 w 1 ,w 2 的最陡下降路径逐步前进:

Like water flowing down a mountain, all that’s guaranteed is that this procedure will end up at some local minimum of the surface (“a mountain lake”); it might well not reach the ultimate global minimum.

It’s not obvious that it would be feasible to find the path of the steepest descent on the “weight landscape”. But calculus comes to the rescue. As we mentioned above, one can always think of a neural net as computing a mathematical function—that depends on its inputs, and its weights. But now consider differentiating with respect to these weights. It turns out that the chain rule of calculus in effect lets us “unravel” the operations done by successive layers in the neural net. And the result is that we can—at least in some local approximation—“invert” the operation of the neural net, and progressively find weights that minimize the loss associated with the output.

The picture above shows the kind of minimization we might need to do in the unrealistically simple case of just 2 weights. But it turns out that even with many more weights (ChatGPT uses 175 billion) it’s still possible to do the minimization, at least to some level of approximation. And in fact the big breakthrough in “deep learning” that occurred around 2011 was associated with the discovery that in some sense it can be easier to do (at least approximate) minimization when there are lots of weights involved than when there are fairly few.
上图显示了在只有 2 个权重的简单情况下,我们可能需要进行的优化。事实上,即使权重更多(ChatGPT 使用了 1750 亿个),我们仍然可以进行某种程度的近似优化。实际上,2011 年前后在“深度学习”领域的重大突破之一是发现,当涉及大量权重时,进行近似优化在某种意义上比涉及较少权重时更容易。

In other words—somewhat counterintuitively—it can be easier to solve more complicated problems with neural nets than simpler ones. And the rough reason for this seems to be that when one has a lot of “weight variables” one has a high-dimensional space with “lots of different directions” that can lead one to the minimum—whereas with fewer variables it’s easier to end up getting stuck in a local minimum (“mountain lake”) from which there’s no “direction to get out”.

It’s worth pointing out that in typical cases there are many different collections of weights that will all give neural nets that have pretty much the same performance. And usually in practical neural net training there are lots of random choices made—that lead to “different-but-equivalent solutions”, like these:

But each such “different solution” will have at least slightly different behavior. And if we ask, say, for an “extrapolation” outside the region where we gave training examples, we can get dramatically different results:

But which of these is “right”? There’s really no way to say. They’re all “consistent with the observed data”. But they all correspond to different “innate” ways to “think about” what to do “outside the box”. And some may seem “more reasonable” to us humans than others.

The Practice and Lore of Neural Net Training

Particularly over the past decade, there’ve been many advances in the art of training neural nets. And, yes, it is basically an art. Sometimes—especially in retrospect—one can see at least a glimmer of a “scientific explanation” for something that’s being done. But mostly things have been discovered by trial and error, adding ideas and tricks that have progressively built a significant lore about how to work with neural nets.

There are several key parts. First, there’s the matter of what architecture of neural net one should use for a particular task. Then there’s the critical issue of how one’s going to get the data on which to train the neural net. And increasingly one isn’t dealing with training a net from scratch: instead a new net can either directly incorporate another already-trained net, or at least can use that net to generate more training examples for itself.

One might have thought that for every particular kind of task one would need a different architecture of neural net. But what’s been found is that the same architecture often seems to work even for apparently quite different tasks. At some level this reminds one of the idea of universal computation (and my Principle of Computational Equivalence), but, as I’ll discuss later, I think it’s more a reflection of the fact that the tasks we’re typically trying to get neural nets to do are “human-like” ones—and neural nets can capture quite general “human-like processes”.

In earlier days of neural nets, there tended to be the idea that one should “make the neural net do as little as possible”. For example, in converting speech to text it was thought that one should first analyze the audio of the speech, break it into phonemes, etc. But what was found is that—at least for “human-like tasks”—it’s usually better just to try to train the neural net on the “end-to-end problem”, letting it “discover” the necessary intermediate features, encodings, etc. for itself.

There was also the idea that one should introduce complicated individual components into the neural net, to let it in effect “explicitly implement particular algorithmic ideas”. But once again, this has mostly turned out not to be worthwhile; instead, it’s better just to deal with very simple components and let them “organize themselves” (albeit usually in ways we can’t understand) to achieve (presumably) the equivalent of those algorithmic ideas.

That’s not to say that there are no “structuring ideas” that are relevant for neural nets. Thus, for example, having 2D arrays of neurons with local connections seems at least very useful in the early stages of processing images. And having patterns of connectivity that concentrate on “looking back in sequences” seems useful—as we’ll see later—in dealing with things like human language, for example in ChatGPT.
这并不是说神经网络没有相关的“结构化理念”。例如,在图像处理的早期阶段,具有局部连接的二维神经元阵列非常有用。而且,如我们稍后将看到的那样,专注于“回顾序列”的连接模式,对于处理诸如人类语言之类的内容,例如在 ChatGPT 中,似乎非常有用。

But an important feature of neural nets is that—like computers in general—they’re ultimately just dealing with data. And current neural nets—with current approaches to neural net training—specifically deal with arrays of numbers. But in the course of processing, those arrays can be completely rearranged and reshaped. And as an example, the network we used for identifying digits above starts with a 2D “image-like” array, quickly “thickening” to many channels, but then “concentrating down” into a 1D array that will ultimately contain elements representing the different possible output digits:

But, OK, how can one tell how big a neural net one will need for a particular task? It’s something of an art. At some level the key thing is to know “how hard the task is”. But for human-like tasks that’s typically very hard to estimate. Yes, there may be a systematic way to do the task very “mechanically” by computer. But it’s hard to know if there are what one might think of as tricks or shortcuts that allow one to do the task at least at a “human-like level” vastly more easily. It might take enumerating a giant game tree to “mechanically” play a certain game; but there might be a much easier (“heuristic”) way to achieve “human-level play”.

When one’s dealing with tiny neural nets and simple tasks one can sometimes explicitly see that one “can’t get there from here”. For example, here’s the best one seems to be able to do on the task from the previous section with a few small neural nets:

And what we see is that if the net is too small, it just can’t reproduce the function we want. But above some size, it has no problem—at least if one trains it for long enough, with enough examples. And, by the way, these pictures illustrate a piece of neural net lore: that one can often get away with a smaller network if there’s a “squeeze” in the middle that forces everything to go through a smaller intermediate number of neurons. (It’s also worth mentioning that “no-intermediate-layer”—or so-called “perceptron”—networks can only learn essentially linear functions—but as soon as there’s even one intermediate layer it’s always in principle possible to approximate any function arbitrarily well, at least if one has enough neurons, though to make it feasibly trainable one typically has some kind of regularization or normalization.)
我们发现,如果神经网络太小,它无法实现我们想要的功能。但当网络达到一定规模时,它就没问题了——前提是训练时间足够长,并且有足够的样本。此外,这些图片还说明了一个神经网络的经验:如果中间有一个“挤压”环节,将所有内容通过一个较少的中间神经元数量,通常可以使用较小的网络。 顺便一提,“无中间层”(或称“感知器”)的网络只能学习线性函数。但只要存在一个中间层,原则上就可以任意逼近任何函数,至少当有足够的神经元时。尽管如此,为了让其更易于训练,通常会进行某种形式的正则化或标准化。)

OK, so let’s say one’s settled on a certain neural net architecture. Now there’s the issue of getting data to train the network with. And many of the practical challenges around neural nets—and machine learning in general—center on acquiring or preparing the necessary training data. In many cases (“supervised learning”) one wants to get explicit examples of inputs and the outputs one is expecting from them. Thus, for example, one might want images tagged by what’s in them, or some other attribute. And maybe one will have to explicitly go through—usually with great effort—and do the tagging. But very often it turns out to be possible to piggyback on something that’s already been done, or use it as some kind of proxy. And so, for example, one might use alt tags that have been provided for images on the web. Or, in a different domain, one might use closed captions that have been created for videos. Or—for language translation training—one might use parallel versions of webpages or other documents that exist in different languages.
好的,假设我们已经确定了一种神经网络架构。接下来是获取训练数据的问题。许多神经网络的实际挑战——以及一般的机器学习挑战——都集中在获取或准备必要的训练数据上。在很多情况下(“监督学习”),我们需要明确的输入和期望输出的例子。例如,我们可能需要标记出图片中的内容或其他属性。也许我们需要亲自费力地进行标记。但通常会发现,有可能依赖已经完成的工作,或将其作为某种代理来使用。 比如,可以使用为网页上的图片提供的替代标签。或者,在其他领域,可以使用为视频创建的字幕。又或者——在语言翻译训练中,可以使用存在于不同语言的网页或其他文档的对应版本。

How much data do you need to show a neural net to train it for a particular task? Again, it’s hard to estimate from first principles. Certainly the requirements can be dramatically reduced by using “transfer learning” to “transfer in” things like lists of important features that have already been learned in another network. But generally neural nets need to “see a lot of examples” to train well. And at least for some tasks it’s an important piece of neural net lore that the examples can be incredibly repetitive. And indeed it’s a standard strategy to just show a neural net all the examples one has, over and over again. In each of these “training rounds” (or “epochs”) the neural net will be in at least a slightly different state, and somehow “reminding it” of a particular example is useful in getting it to “remember that example”. (And, yes, perhaps this is analogous to the usefulness of repetition in human memorization.)
需要向神经网络展示多少数据才能训练它完成特定任务?其实很难从基本原理上估计。当然,利用“迁移学习”将另一个网络中已经学到的关键特征列表“迁移过来”,可以显著减少需求量。但总体来说,神经网络需要“看到大量的例子”才能训练好。且至少对于某些任务,有一个神经网络的重要经验是这些例子可以非常重复。确实,一个标准策略就是不断地向神经网络反复展示所有例子。 在每一轮“训练”(或“纪元”)中,神经网络的状态都会有些许不同,通过某种方式“提醒”它一个特定的例子,有助于它“记住那个例子”。(是的,这或许类似于重复对于人类记忆的作用。))

But often just repeating the same example over and over again isn’t enough. It’s also necessary to show the neural net variations of the example. And it’s a feature of neural net lore that those “data augmentation” variations don’t have to be sophisticated to be useful. Just slightly modifying images with basic image processing can make them essentially “as good as new” for neural net training. And, similarly, when one’s run out of actual video, etc. for training self-driving cars, one can go on and just get data from running simulations in a model videogame-like environment without all the detail of actual real-world scenes.

How about something like ChatGPT? Well, it has the nice feature that it can do “unsupervised learning”, making it much easier to get it examples to train from. Recall that the basic task for ChatGPT is to figure out how to continue a piece of text that it’s been given. So to get it “training examples” all one has to do is get a piece of text, and mask out the end of it, and then use this as the “input to train from”—with the “output” being the complete, unmasked piece of text. We’ll discuss this more later, but the main point is that—unlike, say, for learning what’s in images—there’s no “explicit tagging” needed; ChatGPT can in effect just learn directly from whatever examples of text it’s given.
ChatGPT 这样的东西怎么样?它有一个很好的功能,可以进行“无监督学习”,这使得获取训练示例变得容易得多。记住,ChatGPT 的基本任务是根据给定的文本片段推断接下来的内容。因此,要获取“训练示例”,只需要获取一段文本,屏蔽住末尾部分,并将其用作“输入”,然后用完整的、未屏蔽的文本作为“输出”就可以了。我们稍后会详细讨论,但主要的区别是,和图像识别不同,不需要“显式标记”;ChatGPT 可以直接从任何提供的文本示例中学习。

OK, so what about the actual learning process in a neural net? In the end it’s all about determining what weights will best capture the training examples that have been given. And there are all sorts of detailed choices and “hyperparameter settings” (so called because the weights can be thought of as “parameters”) that can be used to tweak how this is done. There are different choices of loss function (sum of squares, sum of absolute values, etc.). There are different ways to do loss minimization (how far in weight space to move at each step, etc.). And then there are questions like how big a “batch” of examples to show to get each successive estimate of the loss one’s trying to minimize. And, yes, one can apply machine learning (as we do, for example, in Wolfram Language) to automate machine learning—and to automatically set things like hyperparameters.
好的,那么关于神经网络的实际学习过程呢?最终,这全都是关于确定哪些权重可以最好地捕捉所给出的训练示例。而且有各种详细的选择和“超参数设置”(因为权重可以被视为“参数”)来调整实现这一目标的方法。损失函数有多种选择(平方和,绝对值和等)。对于损失最小化有不同的方法(如每一步在权重空间中移动的距离等)。还有一些问题,比如需要展示多大批量的示例来获得每次的损失估计。 而且,确实可以通过机器学习(例如在我们使用的 Wolfram 语言中)来实现机器学习自动化,并自动设置超参数等内容。

But in the end the whole process of training can be characterized by seeing how the loss progressively decreases (as in this Wolfram Language progress monitor for a small training):
但最终,整个训练过程可以通过观察损失值的逐渐减少来进行描述(如图所示,这是一个用于小型训练的 Wolfram 语言进度监控器):

And what one typically sees is that the loss decreases for a while, but eventually flattens out at some constant value. If that value is sufficiently small, then the training can be considered successful; otherwise it’s probably a sign one should try changing the network architecture.

Can one tell how long it should take for the “learning curve” to flatten out? Like for so many other things, there seem to be approximate power-law scaling relationships that depend on the size of neural net and amount of data one’s using. But the general conclusion is that training a neural net is hard—and takes a lot of computational effort. And as a practical matter, the vast majority of that effort is spent doing operations on arrays of numbers, which is what GPUs are good at—which is why neural net training is typically limited by the availability of GPUs.
能否估计“学习曲线”要多久才能平稳?和其他很多事情一样,这似乎取决于神经网络的规模和所使用的数据量之间的幂律关系。但总体结论是,训练神经网络非常困难——需要大量的计算资源。在实际操作中,大部分计算工作是对数字数组进行运算,这正是 GPU 所擅长的——这也是神经网络训练通常受限于 GPU 供应的原因。

In the future, will there be fundamentally better ways to train neural nets—or generally do what neural nets do? Almost certainly, I think. The fundamental idea of neural nets is to create a flexible “computing fabric” out of a large number of simple (essentially identical) components—and to have this “fabric” be one that can be incrementally modified to learn from examples. In current neural nets, one’s essentially using the ideas of calculus—applied to real numbers—to do that incremental modification. But it’s increasingly clear that having high-precision numbers doesn’t matter; 8 bits or less might be enough even with current methods.
将来是否会有更好的方法来训练神经网络,或者通常完成神经网络的任务?我几乎可以肯定。神经网络的基本思想是通过大量简单(几乎相同)的组件创建一个灵活的“计算结构”,并使其能够通过逐步修改从例子中学习。在当前的神经网络中,我们基本上是使用微积分的思想——应用于实数——来进行这种逐步修改。但是越来越明显的是,高精度的数字并不重要;即使是当前的方法,8 位甚至更少的精度可能也足够。

With computational systems like cellular automata that basically operate in parallel on many individual bits it’s never been clear how to do this kind of incremental modification, but there’s no reason to think it isn’t possible. And in fact, much like with the “deep-learning breakthrough of 2012” it may be that such incremental modification will effectively be easier in more complicated cases than in simple ones.
对于诸如细胞自动机之类在多个单独比特上并行操作的计算系统,一直没有明确如何进行这种增量修改,但没有理由认为它不可能。而且,实际上,就像 2012 年的“深度学习突破”一样,这种增量修改在更复杂的情况下可能会更容易。

Neural nets—perhaps a bit like brains—are set up to have an essentially fixed network of neurons, with what’s modified being the strength (“weight”) of connections between them. (Perhaps in at least young brains significant numbers of wholly new connections can also grow.) But while this might be a convenient setup for biology, it’s not at all clear that it’s even close to the best way to achieve the functionality we need. And something that involves the equivalent of progressive network rewriting (perhaps reminiscent of our Physics Project) might well ultimately be better.

But even within the framework of existing neural nets there’s currently a crucial limitation: neural net training as it’s now done is fundamentally sequential, with the effects of each batch of examples being propagated back to update the weights. And indeed with current computer hardware—even taking into account GPUs—most of a neural net is “idle” most of the time during training, with just one part at a time being updated. And in a sense this is because our current computers tend to have memory that is separate from their CPUs (or GPUs). But in brains it’s presumably different—with every “memory element” (i.e. neuron) also being a potentially active computational element. And if we could set up our future computer hardware this way it might become possible to do training much more efficiently.
但即使在现有的神经网络框架内,也存在一个关键限制:目前的神经网络训练本质上是顺序进行的,每批样本的效果都会向后传播以更新权重。实际上,即使考虑到 GPU,使用目前的计算机硬件,大部分时间神经网络的大部分在训练过程中都是“闲置”的,只有一部分在更新。从某种意义上说,这是因为目前的计算机内存通常与其 CPU(或 GPU)是分开的。但在大脑中,每个“记忆元件”(即神经元)也可能是一个活跃的计算元件。 如果我们能这样配置未来的计算机硬件,可能能大大提高训练效率。

“Surely a Network That’s Big Enough Can Do Anything!”

The capabilities of something like ChatGPT seem so impressive that one might imagine that if one could just “keep going” and train larger and larger neural networks, then they’d eventually be able to “do everything”. And if one’s concerned with things that are readily accessible to immediate human thinking, it’s quite possible that this is the case. But the lesson of the past several hundred years of science is that there are things that can be figured out by formal processes, but aren’t readily accessible to immediate human thinking.
像 ChatGPT 这样的功能看起来非常强大,以至于有人可能会认为,如果继续训练越来越大的神经网络,它们最终能解决所有问题。如果关注的是可以迅速被人类理解的事情,确实有这种可能性。但是,过去几百年的科学教训告诉我们,有些事情可以通过正式过程理解,但不容易被人类直观思维立即接受。

Nontrivial mathematics is one big example. But the general case is really computation. And ultimately the issue is the phenomenon of computational irreducibility. There are some computations which one might think would take many steps to do, but which can in fact be “reduced” to something quite immediate. But the discovery of computational irreducibility implies that this doesn’t always work. And instead there are processes—probably like the one below—where to work out what happens inevitably requires essentially tracing each computational step:

The kinds of things that we normally do with our brains are presumably specifically chosen to avoid computational irreducibility. It takes special effort to do math in one’s brain. And it’s in practice largely impossible to “think through” the steps in the operation of any nontrivial program just in one’s brain.

But of course for that we have computers. And with computers we can readily do long, computationally irreducible things. And the key point is that there’s in general no shortcut for these.

Yes, we could memorize lots of specific examples of what happens in some particular computational system. And maybe we could even see some (“computationally reducible”) patterns that would allow us to do a little generalization. But the point is that computational irreducibility means that we can never guarantee that the unexpected won’t happen—and it’s only by explicitly doing the computation that you can tell what actually happens in any particular case.

And in the end there’s just a fundamental tension between learnability and computational irreducibility. Learning involves in effect compressing data by leveraging regularities. But computational irreducibility implies that ultimately there’s a limit to what regularities there may be.

As a practical matter, one can imagine building little computational devices—like cellular automata or Turing machines—into trainable systems like neural nets. And indeed such devices can serve as good “tools” for the neural net—like Wolfram|Alpha can be a good tool for ChatGPT. But computational irreducibility implies that one can’t expect to “get inside” those devices and have them learn.
实际而言,可以设想将小型计算设备(如元胞自动机或图灵机)嵌入到可训练系统(如神经网络)中。而这样的设备确实可以为神经网络提供有效的“工具”——就像 Wolfram|Alpha 是 ChatGPT 的好工具一样。但是,计算不可约性意味着无法“进入”这些设备并使其学习。

Or put another way, there’s an ultimate tradeoff between capability and trainability: the more you want a system to make “true use” of its computational capabilities, the more it’s going to show computational irreducibility, and the less it’s going to be trainable. And the more it’s fundamentally trainable, the less it’s going to be able to do sophisticated computation.

(For ChatGPT as it currently is, the situation is actually much more extreme, because the neural net used to generate each token of output is a pure “feed-forward” network, without loops, and therefore has no ability to do any kind of computation with nontrivial “control flow”.)
对于当前版本的 ChatGPT 来说,情况实际上更为极端,因为生成每个输出令牌的神经网络是一个纯“前馈”网络,没有循环,因此无法进行复杂的“控制流”计算。

Of course, one might wonder whether it’s actually important to be able to do irreducible computations. And indeed for much of human history it wasn’t particularly important. But our modern technological world has been built on engineering that makes use of at least mathematical computations—and increasingly also more general computations. And if we look at the natural world, it’s full of irreducible computation—that we’re slowly understanding how to emulate and use for our technological purposes.

Yes, a neural net can certainly notice the kinds of regularities in the natural world that we might also readily notice with “unaided human thinking”. But if we want to work out things that are in the purview of mathematical or computational science the neural net isn’t going to be able to do it—unless it effectively “uses as a tool” an “ordinary” computational system.

But there’s something potentially confusing about all of this. In the past there were plenty of tasks—including writing essays—that we’ve assumed were somehow “fundamentally too hard” for computers. And now that we see them done by the likes of ChatGPT we tend to suddenly think that computers must have become vastly more powerful—in particular surpassing things they were already basically able to do (like progressively computing the behavior of computational systems like cellular automata).
但这一切可能会有些令人困惑。过去,我们认为很多任务——包括写论文——对计算机来说“基本上太难了”。如今,当我们看到像 ChatGPT 这样的程序完成这些任务时,我们往往会觉得计算机的能力大大增强了,尤其是在超越它们以前基本能做的事情(比如逐步计算细胞自动机的行为)。

But this isn’t the right conclusion to draw. Computationally irreducible processes are still computationally irreducible, and are still fundamentally hard for computers—even if computers can readily compute their individual steps. And instead what we should conclude is that tasks—like writing essays—that we humans could do, but we didn’t think computers could do, are actually in some sense computationally easier than we thought.

In other words, the reason a neural net can be successful in writing an essay is because writing an essay turns out to be a “computationally shallower” problem than we thought. And in a sense this takes us closer to “having a theory” of how we humans manage to do things like writing essays, or in general deal with language.

If you had a big enough neural net then, yes, you might be able to do whatever humans can readily do. But you wouldn’t capture what the natural world in general can do—or that the tools that we’ve fashioned from the natural world can do. And it’s the use of those tools—both practical and conceptual—that have allowed us in recent centuries to transcend the boundaries of what’s accessible to “pure unaided human thought”, and capture for human purposes more of what’s out there in the physical and computational universe.

The Concept of Embeddings

Neural nets—at least as they’re currently set up—are fundamentally based on numbers. So if we’re going to to use them to work on something like text we’ll need a way to represent our text with numbers. And certainly we could start (essentially as ChatGPT does) by just assigning a number to every word in the dictionary. But there’s an important idea—that’s for example central to ChatGPT—that goes beyond that. And it’s the idea of “embeddings”. One can think of an embedding as a way to try to represent the “essence” of something by an array of numbers—with the property that “nearby things” are represented by nearby numbers.
当前的神经网络基本上是基于数字构建的。所以如果我们要用它们来处理文本,就需要找到一种用数字表示文本的方法。当然,我们可以像 ChatGPT 那样,通过为字典中的每个单词分配一个数字来开始。但是有一个重要的概念,对 ChatGPT 来说尤为重要,那就是“嵌入”的概念。可以把嵌入看作是一种用数字数组来表示事物“本质”的方法,其中“相近的事物”由相近的数字来表示。

And so, for example, we can think of a word embedding as trying to lay out words in a kind of “meaning space” in which words that are somehow “nearby in meaning” appear nearby in the embedding. The actual embeddings that are used—say in ChatGPT—tend to involve large lists of numbers. But if we project down to 2D, we can show examples of how words are laid out by the embedding:
因此,例如,我们可以将词嵌入看作是尝试在一种“意义空间”中布置词汇,其中意义相近的词汇在嵌入中靠得很近。实际使用的嵌入,比如在 ChatGPT 中,通常涉及大量数字列表。但如果我们投影到二维平面上,我们可以展示词汇在嵌入中的排列方式:

And, yes, what we see does remarkably well in capturing typical everyday impressions. But how can we construct such an embedding? Roughly the idea is to look at large amounts of text (here 5 billion words from the web) and then see “how similar” the “environments” are in which different words appear. So, for example, “alligator” and “crocodile” will often appear almost interchangeably in otherwise similar sentences, and that means they’ll be placed nearby in the embedding. But “turnip” and “eagle” won’t tend to appear in otherwise similar sentences, so they’ll be placed far apart in the embedding.
是的,我们看到的确实非常擅长捕捉典型的日常印象。但是我们如何构建这样的嵌入呢?大致的想法是查看大量的文本(这里是从网络上获取的 50 亿个单词),然后看不同单词出现的“环境”有多“相似”。比如,“鳄鱼”和“短吻鳄”通常会在相似的句子中互换出现,这意味着它们在嵌入中会被放在一起。但“萝卜”和“鹰”在相似的句子中很少出现,所以它们在嵌入中会被放得很远。

But how does one actually implement something like this using neural nets? Let’s start by talking about embeddings not for words, but for images. We want to find some way to characterize images by lists of numbers in such a way that “images we consider similar” are assigned similar lists of numbers.

How do we tell if we should “consider images similar”? Well, if our images are, say, of handwritten digits we might “consider two images similar” if they are of the same digit. Earlier we discussed a neural net that was trained to recognize handwritten digits. And we can think of this neural net as being set up so that in its final output it puts images into 10 different bins, one for each digit.
我们如何判断两张图像是否相似呢?如果我们的图像是手写数字,例如,我们可以认为两张图像相似如果它们展示的是相同的数字。之前我们讨论了一个用于识别手写数字的神经网络。我们可以把这个神经网络想象成,它会在最终输出时把图像分类到 10 个不同的类别中,每个数字对应一个类别。

But what if we “intercept” what’s going on inside the neural net before the final “it’s a ‘4’” decision is made? We might expect that inside the neural net there are numbers that characterize images as being “mostly 4-like but a bit 2-like” or some such. And the idea is to pick up such numbers to use as elements in an embedding.
但是,如果我们在神经网络做出最终的“这是一个‘4’”的决定之前“拦截”正在发生的事情呢?可能在神经网络内部有一些数字,表示图像是“主要像 4,但有点像 2”之类的。这个想法是捕捉这些数字用作嵌入元素。

So here’s the concept. Rather than directly trying to characterize “what image is near what other image”, we instead consider a well-defined task (in this case digit recognition) for which we can get explicit training data—then use the fact that in doing this task the neural net implicitly has to make what amount to “nearness decisions”. So instead of us ever explicitly having to talk about “nearness of images” we’re just talking about the concrete question of what digit an image represents, and then we’re “leaving it to the neural net” to implicitly determine what that implies about “nearness of images”.

So how in more detail does this work for the digit recognition network? We can think of the network as consisting of 11 successive layers, that we might summarize iconically like this (with activation functions shown as separate layers):
那么,数字识别网络具体是如何工作的呢?我们可以将其视为由 11 个连续层组成,可以这样图示化总结(激活函数作为单独的层显示):

At the beginning we’re feeding into the first layer actual images, represented by 2D arrays of pixel values. And at the end—from the last layer—we’re getting out an array of 10 values, which we can think of saying “how certain” the network is that the image corresponds to each of the digits 0 through 9.
开始时我们将实际图像(用像素值的二维数组表示)输入到第一层。最后一层的输出是一个包含 10 个值的数组,我们可以理解为网络对图像是数字 0 到 9 中哪一个的确定性。

Feed in the image and the values of the neurons in that last layer are:
输入图像 后,最后一层的神经元值是:

In other words, the neural net is by this point “incredibly certain” that this image is a 4—and to actually get the output “4” we just have to pick out the position of the neuron with the largest value.
换句话说,此时神经网络已经“非常确定”这张图像是 4——要得到输出“4”,我们只需找出数值最大的那个神经元的位置。

But what if we look one step earlier? The very last operation in the network is a so-called softmax which tries to “force certainty”. But before that’s been applied the values of the neurons are:
但是如果我们往前看一步呢?网络中的最后一个操作是一个所谓的 softmax,它试图“强制确定性”。但是在它应用之前,神经元的值是:

The neuron representing “4” still has the highest numerical value. But there’s also information in the values of the other neurons. And we can expect that this list of numbers can in a sense be used to characterize the “essence” of the image—and thus to provide something we can use as an embedding. And so, for example, each of the 4’s here has a slightly different “signature” (or “feature embedding”)—all very different from the 8’s:
代表“4”的神经元仍然具有最高的数值,但是其他神经元的数值也包含了信息。我们可以预期,这些数字列表在某种程度上可以描述图像的“本质”,从而提供我们可以用作嵌入的信息。所以,例如,每个 4 都有稍微不同的“特征签名”(或“特征嵌入”)——这与 8 的特征截然不同。

Here we’re essentially using 10 numbers to characterize our images. But it’s often better to use much more than that. And for example in our digit recognition network we can get an array of 500 numbers by tapping into the preceding layer. And this is probably a reasonable array to use as an “image embedding”.
在这里,我们基本上使用 10 个数字来描述我们的图像。但通常使用更多会更好。例如,在我们的数字识别网络中,通过利用前一层的输出,我们可以得到一个由 500 个数字组成的数组。这可能是一个合理的“图像嵌入”数组。

If we want to make an explicit visualization of “image space” for handwritten digits we need to “reduce the dimension”, effectively by projecting the 500-dimensional vector we’ve got into, say, 3D space:
如果我们想清晰地可视化手写数字的“图像空间”,需要“降维”,实际上就是把我们得到的 500 维向量投影到 3 维空间:

We’ve just talked about creating a characterization (and thus embedding) for images based effectively on identifying the similarity of images by determining whether (according to our training set) they correspond to the same handwritten digit. And we can do the same thing much more generally for images if we have a training set that identifies, say, which of 5000 common types of object (cat, dog, chair, …) each image is of. And in this way we can make an image embedding that’s “anchored” by our identification of common objects, but then “generalizes around that” according to the behavior of the neural net. And the point is that insofar as that behavior aligns with how we humans perceive and interpret images, this will end up being an embedding that “seems right to us”, and is useful in practice in doing “human-judgement-like” tasks.
我们刚刚讨论了如何通过判断图像是否对应于相同的手写数字(根据我们的训练集)来创建图像的表征(即嵌入)。如果我们有一个训练集,可以识别出每张图像是 5000 种常见物体(如猫、狗、椅子等)中的哪一种,那么我们可以更广泛地处理图像。这样,我们可以创建一个图像嵌入,其“基础”是我们对常见物体的识别,并且根据神经网络的表现进行“泛化”。 关键是,当这种行为与我们人类感知和解读图像的方式一致时,它将形成一种“对我们来说似乎正确”的表示,并且在执行类似“人类判断”的任务时非常有用。

OK, so how do we follow the same kind of approach to find embeddings for words? The key is to start from a task about words for which we can readily do training. And the standard such task is “word prediction”. Imagine we’re given “the ___ cat”. Based on a large corpus of text (say, the text content of the web), what are the probabilities for different words that might “fill in the blank”? Or, alternatively, given “___ black ___” what are the probabilities for different “flanking words”?
好,那么我们如何采用类似的方法来找到单词的嵌入呢?关键在于从一个单词任务开始,这个任务可以轻松训练。标准任务是“单词预测”。想象一下,给出“the ___ cat”。基于大量的文本语料库(比如网络上的文本内容),填补空白的不同单词的概率是多少?或者,给出“___ black ___”,不同“夹在中间的词”的概率是多少?

How do we set this problem up for a neural net? Ultimately we have to formulate everything in terms of numbers. And one way to do this is just to assign a unique number to each of the 50,000 or so common words in English. So, for example, “the” might be 914, and “ cat” (with a space before it) might be 3542. (And these are the actual numbers used by GPT-2.) So for the “the ___ cat” problem, our input might be {914, 3542}. What should the output be like? Well, it should be a list of 50,000 or so numbers that effectively give the probabilities for each of the possible “fill-in” words. And once again, to find an embedding, we want to “intercept” the “insides” of the neural net just before it “reaches its conclusion”—and then pick up the list of numbers that occur there, and that we can think of as “characterizing each word”.
我们如何为神经网络设置这个问题?最终,我们必须将所有内容转换成数字。 一种方法是为大约 5 万个常见的英语单词分配一个唯一的编号。例如,“the”可以是 914,而“cat”(前面带有空格)可以是 3542(这也是 GPT-2 使用的实际编号)。 因此,对于“the ___ cat”这个问题,我们的输入可能是{914, 3542}。那么,输出应该是什么样的呢? 它应该是一个包含大约 5 万个数字的列表,表示每个可能填入单词的概率。 为了找到一个嵌入,我们想要在神经网络“得出结论”之前“截取”它的“内部”,然后获取出现的数字列表,这些数字可以视为“每个单词的表征”。

OK, so what do those characterizations look like? Over the past 10 years there’ve been a sequence of different systems developed (word2vec, GloVe, BERT, GPT, …), each based on a different neural net approach. But ultimately all of them take words and characterize them by lists of hundreds to thousands of numbers.
好的,那么这些表征具体长什么样呢?在过去的 10 年里,(word2vec、GloVe、BERT、GPT 等)被开发出来了一系列不同的系统,每个系统都基于不同的神经网络方法。但最终它们都是通过包含数百到数千个数字的列表来表征单词。

In their raw form, these “embedding vectors” are quite uninformative. For example, here’s what GPT-2 produces as the raw embedding vectors for three specific words:
这些“嵌入向量”在原始形式下信息量很少。例如,以下是 GPT-2 为三个特定单词生成的原始嵌入向量:

If we do things like measure distances between these vectors, then we can find things like “nearnesses” of words. Later we’ll discuss in more detail what we might consider the “cognitive” significance of such embeddings. But for now the main point is that we have a way to usefully turn words into “neural-net-friendly” collections of numbers.

But actually we can go further than just characterizing words by collections of numbers; we can also do this for sequences of words, or indeed whole blocks of text. And inside ChatGPT that’s how it’s dealing with things. It takes the text it’s got so far, and generates an embedding vector to represent it. Then its goal is to find the probabilities for different words that might occur next. And it represents its answer for this as a list of numbers that essentially give the probabilities for each of the 50,000 or so possible words.
但实际上,我们不仅可以用一组数字来表示单词,还可以用一组数字来表示单词序列,甚至整个文本块。在 ChatGPT 的内部,它就是这样处理问题的。它获取到目前为止的文本,并生成一个嵌入向量来表示这些文本。然后,它的目标是找到可能出现的不同单词的概率。它将其答案表示为一个数字列表,基本上给出了大约 5 万个可能单词的概率。

(Strictly, ChatGPT does not deal with words, but rather with “tokens”—convenient linguistic units that might be whole words, or might just be pieces like “pre” or “ing” or “ized”. Working with tokens makes it easier for ChatGPT to handle rare, compound and non-English words, and, sometimes, for better or worse, to invent new words.)
严格来说,ChatGPT 处理的不是单词,而是“标记”——一种方便的语言单位,可能是完整的单词,也可能是“前”、“后”、“化”这样的片段。使用标记使 ChatGPT 更容易处理罕见的、复合的和非英语单词,有时尽管好坏参半,它还可能创造新单词。

Inside ChatGPT 深入了解 ChatGPT

OK, so we’re finally ready to discuss what’s inside ChatGPT. And, yes, ultimately, it’s a giant neural net—currently a version of the so-called GPT-3 network with 175 billion weights. In many ways this is a neural net very much like the other ones we’ve discussed. But it’s a neural net that’s particularly set up for dealing with language. And its most notable feature is a piece of neural net architecture called a “transformer”.
好的,我们终于准备好讨论 ChatGPT 的内部构造了。确实,它最终是一个巨大的神经网络——目前是拥有 1750 亿个权重的 GPT-3 版本。在很多方面,它像我们之前讨论的其他神经网络一样。但它是一个专门用于处理语言的神经网络。它最显著的特点是一种称为“变压器”的神经网络架构。

In the first neural nets we discussed above, every neuron at any given layer was basically connected (at least with some weight) to every neuron on the layer before. But this kind of fully connected network is (presumably) overkill if one’s working with data that has particular, known structure. And thus, for example, in the early stages of dealing with images, it’s typical to use so-called convolutional neural nets (“convnets”) in which neurons are effectively laid out on a grid analogous to the pixels in the image—and connected only to neurons nearby on the grid.

The idea of transformers is to do something at least somewhat similar for sequences of tokens that make up a piece of text. But instead of just defining a fixed region in the sequence over which there can be connections, transformers instead introduce the notion of “attention”—and the idea of “paying attention” more to some parts of the sequence than others. Maybe one day it’ll make sense to just start a generic neural net and do all customization through training. But at least as of now it seems to be critical in practice to “modularize” things—as transformers do, and probably as our brains also do.

OK, so what does ChatGPT (or, rather, the GPT-3 network on which it’s based) actually do? Recall that its overall goal is to continue text in a “reasonable” way, based on what it’s seen from the training it’s had (which consists in looking at billions of pages of text from the web, etc.) So at any given point, it’s got a certain amount of text—and its goal is to come up with an appropriate choice for the next token to add.
好的,那么 ChatGPT(或更确切地说,它基于的 GPT-3 网络)究竟做了什么?请记住,它的总体目标是根据它所接受的训练(包括浏览网络上的数十亿页的文本)以“合理”的方式继续文本。因此,在任何时候它都有一定量的文本——它的目标是为下一个要添加的词语选择一个合适的选项。

It operates in three basic stages. First, it takes the sequence of tokens that corresponds to the text so far, and finds an embedding (i.e. an array of numbers) that represents these. Then it operates on this embedding—in a “standard neural net way”, with values “rippling through” successive layers in a network—to produce a new embedding (i.e. a new array of numbers). It then takes the last part of this array and generates from it an array of about 50,000 values that turn into probabilities for different possible next tokens. (And, yes, it so happens that there are about the same number of tokens used as there are common words in English, though only about 3000 of the tokens are whole words, and the rest are fragments.)
它主要分为三个基本阶段。首先,它获取到目前为止文本的令牌序列,并找到一个表示这些令牌的嵌入(即一个数字数组)。然后,它对这个嵌入进行操作——通过“标准的神经网络方式”,数值在网络的各层中“传递”——生成一个新的嵌入(即一个新的数字数组)。接着,它取出这个数组的最后一部分,并生成一个包含约 50000 个值的数组,这些值表示不同可能的下一个令牌的概率。(是的,确实使用的令牌数量与常见的英语单词数量大致相同,不过只有大约 3000 个令牌是完整的单词,其余的是片段。)

A critical point is that every part of this pipeline is implemented by a neural network, whose weights are determined by end-to-end training of the network. In other words, in effect nothing except the overall architecture is “explicitly engineered”; everything is just “learned” from training data.

There are, however, plenty of details in the way the architecture is set up—reflecting all sorts of experience and neural net lore. And—even though this is definitely going into the weeds—I think it’s useful to talk about some of those details, not least to get a sense of just what goes into building something like ChatGPT.
然而,架构的设计包含许多细节,反映了各种经验和神经网络知识。尽管这深入到细节,但我认为讨论这些细节是有用的,至少可以了解构建 ChatGPT 这样的东西需要哪些要素。

First comes the embedding module. Here’s a schematic Wolfram Language representation for it for GPT-2:
首先是嵌入模块。下面是 GPT-2 的 Wolfram 语言表示的示意图:

The input is a vector of n tokens (represented as in the previous section by integers from 1 to about 50,000). Each of these tokens is converted (by a single-layer neural net) into an embedding vector (of length 768 for GPT-2 and 12,288 for ChatGPT’s GPT-3). Meanwhile, there’s a “secondary pathway” that takes the sequence of (integer) positions for the tokens, and from these integers creates another embedding vector. And finally the embedding vectors from the token value and the token position are added together—to produce the final sequence of embedding vectors from the embedding module.
输入是一个由 n 个标记组成的向量(如前一节所述,由 1 到大约 50000 的整数表示)。每个标记通过一个单层神经网络转换为一个嵌入向量(GPT-2 的长度为 768,ChatGPT 的 GPT-3 为 12288)。同时,还有一条“次要路径”负责获取标记的序列位置,并从这些位置整数生成另一个嵌入向量。最终,将标记值和标记位置的嵌入向量相加,生成嵌入模块的最终嵌入向量序列。

Why does one just add the token-value and token-position embedding vectors together? I don’t think there’s any particular science to this. It’s just that various different things have been tried, and this is one that seems to work. And it’s part of the lore of neural nets that—in some sense—so long as the setup one has is “roughly right” it’s usually possible to home in on details just by doing sufficient training, without ever really needing to “understand at an engineering level” quite how the neural net has ended up configuring itself.
为什么只是将 token-value 和 token-position 的嵌入向量简单相加?我认为这没有什么特别的科学依据。只是尝试了各种不同的方法,这是其中一种看起来有效的方法。而且在神经网络的经验中,只要你设置的大致正确,通常通过足够的训练就可以调整细节,而不需要真正从工程层面理解神经网络最终是如何自我配置的。

Here’s what the embedding module does, operating on the string hello hello hello hello hello hello hello hello hello hello bye bye bye bye bye bye bye bye bye bye:

The elements of the embedding vector for each token are shown down the page, and across the page we see first a run of “hello” embeddings, followed by a run of “bye” ones. The second array above is the positional embedding—with its somewhat-random-looking structure being just what “happened to be learned” (in this case in GPT-2).
页面下方显示了每个标记的嵌入向量元素,而页面横向显示了“hello”的嵌入,然后是“bye”的嵌入。上面的第二组是位置嵌入——这种看似随机的结构正是(在 GPT-2 中)“恰好被学习到”的。

OK, so after the embedding module comes the “main event” of the transformer: a sequence of so-called “attention blocks” (12 for GPT-2, 96 for ChatGPT’s GPT-3). It’s all pretty complicated—and reminiscent of typical large hard-to-understand engineering systems, or, for that matter, biological systems. But anyway, here’s a schematic representation of a single “attention block” (for GPT-2):
好的,所以在嵌入模块之后,就是变压器的“主要部分”:一系列所谓的“注意力块”(GPT-2 有 12 个,ChatGPT 的 GPT-3 有 96 个)。这一切都很复杂——让人联想到典型的大型难以理解的工程系统,或者生物系统。无论如何,这里是单个“注意力块”的示意图(用于 GPT-2):

Within each such attention block there are a collection of “attention heads” (12 for GPT-2, 96 for ChatGPT’s GPT-3)—each of which operates independently on different chunks of values in the embedding vector. (And, yes, we don’t know any particular reason why it’s a good idea to split up the embedding vector, or what the different parts of it “mean”; this is just one of those things that’s been “found to work”.)
在每个注意力块中,有一组“注意力头”(GPT-2 有 12 个,ChatGPT 的 GPT-3 有 96 个),每个头独立处理嵌入向量中的不同值块。(是的,我们并不知道为什么要分割嵌入向量,也不清楚它的不同部分的意义;这只是被发现有效的方法之一。)

OK, so what do the attention heads do? Basically they’re a way of “looking back” in the sequence of tokens (i.e. in the text produced so far), and “packaging up the past” in a form that’s useful for finding the next token. In the first section above we talked about using 2-gram probabilities to pick words based on their immediate predecessors. What the “attention” mechanism in transformers does is to allow “attention to” even much earlier words—thus potentially capturing the way, say, verbs can refer to nouns that appear many words before them in a sentence.
好的,那么注意力机制具体是什么呢?基本上,它是一种“回顾”令牌序列(即,现有文本)的方式,并将之前的信息打包成一种有助于找到下一个词的形式。在前面提到的部分中,我们讨论了如何使用 2-gram 概率根据前一个词来选择词汇。而变压器中的“注意力”机制则允许我们关注更早出现的词,从而可以捕捉到动词可以关联到句子前面许多词前的名词的方式。

At a more detailed level, what an attention head does is to recombine chunks in the embedding vectors associated with different tokens, with certain weights. And so, for example, the 12 attention heads in the first attention block (in GPT-2) have the following (“look-back-all-the-way-to-the-beginning-of-the-sequence-of-tokens”) patterns of “recombination weights” for the “hello, bye” string above:
更详细地说,注意力头的作用是用一定的权重重新组合与不同标记关联的嵌入向量块。例如,GPT-2 的第一个注意力块中的 12 个注意力头,对于上面的“hello, bye”字符串,有如下的(“回溯到序列开头”)重组合权重模式:

After being processed by the attention heads, the resulting “re-weighted embedding vector” (of length 768 for GPT-2 and length 12,288 for ChatGPT’s GPT-3) is passed through a standard “fully connected” neural net layer. It’s hard to get a handle on what this layer is doing. But here’s a plot of the 768×768 matrix of weights it’s using (here for GPT-2):
被注意力头处理后,产生的“重新加权嵌入向量”(GPT-2 的长度为 768,ChatGPT 的 GPT-3 的长度为 12,288)会通过一个标准的“全连接”神经网络层。很难弄清楚这个层在做什么。但这里有它使用的 768×768 权重矩阵的图示(这里是针对 GPT-2 的):

Taking 64×64 moving averages, some (random-walk-ish) structure begins to emerge:
采用 64×64 的移动平均值,一些(类似随机游走的)结构开始显现出来:

What determines this structure? Ultimately it’s presumably some “neural net encoding” of features of human language. But as of now, what those features might be is quite unknown. In effect, we’re “opening up the brain of ChatGPT” (or at least GPT-2) and discovering, yes, it’s complicated in there, and we don’t understand it—even though in the end it’s producing recognizable human language.
是什么决定了这种结构?最终,这可能是人类语言特征的某种“神经网络编码”。但截至目前,我们对这些特征一无所知。实际上,我们在“解剖 ChatGPT 的大脑”(或者至少是 GPT-2),发现里面确实很复杂,我们还无法理解——尽管它最终生成了可识别的人类语言。

OK, so after going through one attention block, we’ve got a new embedding vector—which is then successively passed through additional attention blocks (a total of 12 for GPT-2; 96 for GPT-3). Each attention block has its own particular pattern of “attention” and “fully connected” weights. Here for GPT-2 are the sequence of attention weights for the “hello, bye” input, for the first attention head:
好的,经过一个注意力块,我们会得到一个新的嵌入向量——这个向量会依次通过更多的注意力块(GPT-2 有 12 个;GPT-3 有 96 个)。每个注意力块都有自己特定的“注意力”和“全连接”权重。以下是 GPT-2 模型在第一个注意力头上,对“你好,再见”输入的注意力权重序列:

And here are the (moving-averaged) “matrices” for the fully connected layers:

Curiously, even though these “matrices of weights” in different attention blocks look quite similar, the distributions of the sizes of weights can be somewhat different (and are not always Gaussian):

So after going through all these attention blocks what is the net effect of the transformer? Essentially it’s to transform the original collection of embeddings for the sequence of tokens to a final collection. And the particular way ChatGPT works is then to pick up the last embedding in this collection, and “decode” it to produce a list of probabilities for what token should come next.
因此,经过所有这些注意力块后,Transformer 的总体效果是什么?本质上,它将初始的一系列嵌入转换为最终的集合。而 ChatGPT 的工作方式是选取最终集合中的最后一个嵌入,并“解码”它,生成下一个应该出现的 token 的概率列表。

So that’s in outline what’s inside ChatGPT. It may seem complicated (not least because of its many inevitably somewhat arbitrary “engineering choices”), but actually the ultimate elements involved are remarkably simple. Because in the end what we’re dealing with is just a neural net made of “artificial neurons”, each doing the simple operation of taking a collection of numerical inputs, and then combining them with certain weights.
这就是 ChatGPT 内部的基本结构了。虽然看起来很复杂(不可避免地有很多有些随意的“工程选择”),但实际上最终涉及的元素相当简单。最终,我们面对的只是一个由“人工神经元”组成的神经网络,每个神经元都进行简单的操作,即接受一组数值输入,然后将它们按一定权重组合起来。

The original input to ChatGPT is an array of numbers (the embedding vectors for the tokens so far), and what happens when ChatGPT “runs” to produce a new token is just that these numbers “ripple through” the layers of the neural net, with each neuron “doing its thing” and passing the result to neurons on the next layer. There’s no looping or “going back”. Everything just “feeds forward” through the network.
ChatGPT 的原始输入是一组数字(到目前为止标记的嵌入向量)。当 ChatGPT “运行”生成新标记时,这些数字只是通过神经网络的各层传递,每个神经元完成自己的工作后,将结果传递给下一层的神经元。没有循环或“回退”,一切都只是通过网络前馈。

It’s a very different setup from a typical computational system—like a Turing machine—in which results are repeatedly “reprocessed” by the same computational elements. Here—at least in generating a given token of output—each computational element (i.e. neuron) is used only once.

But there is in a sense still an “outer loop” that reuses computational elements even in ChatGPT. Because when ChatGPT is going to generate a new token, it always “reads” (i.e. takes as input) the whole sequence of tokens that come before it, including tokens that ChatGPT itself has “written” previously. And we can think of this setup as meaning that ChatGPT does—at least at its outermost level—involve a “feedback loop”, albeit one in which every iteration is explicitly visible as a token that appears in the text that it generates.
但从某种意义上来说,即使在 ChatGPT 中仍然存在一个“外循环”,它重用计算元素。因为 ChatGPT 每次生成一个新标记时,它总是读取(即输入)之前的整个标记序列,包括 ChatGPT 自己之前生成的标记。我们可以认为这种设置意味着 ChatGPT 在最外层涉及一个“反馈循环”,尽管每次迭代都会在生成的文本中显式地显示为一个标记。

But let’s come back to the core of ChatGPT: the neural net that’s being repeatedly used to generate each token. At some level it’s very simple: a whole collection of identical artificial neurons. And some parts of the network just consist of (“fully connected”) layers of neurons in which every neuron on a given layer is connected (with some weight) to every neuron on the layer before. But particularly with its transformer architecture, ChatGPT has parts with more structure, in which only specific neurons on different layers are connected. (Of course, one could still say that “all neurons are connected”—but some just have zero weight.)
但是让我们回到 ChatGPT 的核心部分:这个反复使用来生成每个符号的神经网络。从某种程度上说,这非常简单:一组完全相同的人工神经元。网络的某些部分只是由“全连接”的神经元层组成,即在某一层中的每个神经元都与上一层的每个神经元相连(具有某种权重)。但 ChatGPT 特别是通过其变压器架构,有些部分具有更复杂的结构,其中只有不同层的特定神经元相连。(当然,也可以说“所有神经元都是连通的”——但有些只是权重为零。)

In addition, there are aspects of the neural net in ChatGPT that aren’t most naturally thought of as just consisting of “homogeneous” layers. And for example—as the iconic summary above indicates—inside an attention block there are places where “multiple copies are made” of incoming data, each then going through a different “processing path”, potentially involving a different number of layers, and only later recombining. But while this may be a convenient representation of what’s going on, it’s always at least in principle possible to think of “densely filling in” layers, but just having some weights be zero.
此外,ChatGPT 的神经网络中的一些部分并不是简单地由“同质”层组成。例如,按照上面的总结,在一个注意力块内,输入的数据会被“多次复制”并通过不同的“处理路径”,这些路径可能包含不同数量的层,之后再重新组合。虽然这是一种便于理解的表示方法,但理论上总是可以把这些层看作被“密集填充”,只是某些权重设置为零。

If one looks at the longest path through ChatGPT, there are about 400 (core) layers involved—in some ways not a huge number. But there are millions of neurons—with a total of 175 billion connections and therefore 175 billion weights. And one thing to realize is that every time ChatGPT generates a new token, it has to do a calculation involving every single one of these weights. Implementationally these calculations can be somewhat organized “by layer” into highly parallel array operations that can conveniently be done on GPUs. But for each token that’s produced, there still have to be 175 billion calculations done (and in the end a bit more)—so that, yes, it’s not surprising that it can take a while to generate a long piece of text with ChatGPT.
如果观察 ChatGPT 的最长路径,大约有 400 个核心层参与——从某种角度看,这个数字并不算大。但是,它包含数百万个神经元,总共有 1750 亿个连接,也就是说有 1750 亿个权重。需要注意的是,每次 ChatGPT 生成一个新字符时,必须进行涉及这些权重的计算。实现上,这些计算可以按层级组织成在 GPU 上方便进行的高度并行数组操作。但每生成一个字符,都需要进行 1750 亿次计算(实际上可能更多),所以生成一段长文本确实需要一些时间,这并不奇怪。

But in the end, the remarkable thing is that all these operations—individually as simple as they are—can somehow together manage to do such a good “human-like” job of generating text. It has to be emphasized again that (at least so far as we know) there’s no “ultimate theoretical reason” why anything like this should work. And in fact, as we’ll discuss, I think we have to view this as a—potentially surprising—scientific discovery: that somehow in a neural net like ChatGPT’s it’s possible to capture the essence of what human brains manage to do in generating language.
但最终,值得注意的是,虽然每个操作都很简单,但这些操作组合在一起却能够非常好地完成“类人”的文本生成。必须再次强调的是(至少据我们所知),没有任何“最终理论依据”解释为什么这种方法会奏效。实际上,正如我们将要讨论的,我认为这是一种——可能令人惊讶的——科学发现:在像 ChatGPT 这样的神经网络中,居然能够捕捉到人类大脑在生成语言时所做工作的精髓。

The Training of ChatGPT
ChatGPT 的培训

OK, so we’ve now given an outline of how ChatGPT works once it’s set up. But how did it get set up? How were all those 175 billion weights in its neural net determined? Basically they’re the result of very large-scale training, based on a huge corpus of text—on the web, in books, etc.—written by humans. As we’ve said, even given all that training data, it’s certainly not obvious that a neural net would be able to successfully produce “human-like” text. And, once again, there seem to be detailed pieces of engineering needed to make that happen. But the big surprise—and discovery—of ChatGPT is that it’s possible at all. And that—in effect—a neural net with “just” 175 billion weights can make a “reasonable model” of text humans write.
好的,现在我们已经概述了 ChatGPT 如何工作的情况。但是,ChatGPT 是如何搭建的呢?它的神经网络中所有的 1750 亿个权重是如何确定的呢?这些权重基本上是通过基于网络、书籍等人类编写的大规模文本语料库所进行的大规模训练得出的结果。正如我们所说,即使有所有这些训练数据,也不能明确神经网络能成功地产生“类人”的文本。而且,似乎需要详细的工程设计才能实现这一点。但 ChatGPT 的最大惊喜和发现是,这竟然是可能的。事实上,一个拥有 1750 亿个权重的神经网络能够较为“合理地”模拟人类编写的文本。

In modern times, there’s lots of text written by humans that’s out there in digital form. The public web has at least several billion human-written pages, with altogether perhaps a trillion words of text. And if one includes non-public webpages, the numbers might be at least 100 times larger. So far, more than 5 million digitized books have been made available (out of 100 million or so that have ever been published), giving another 100 billion or so words of text. And that’s not even mentioning text derived from speech in videos, etc. (As a personal comparison, my total lifetime output of published material has been a bit under 3 million words, and over the past 30 years I’ve written about 15 million words of email, and altogether typed perhaps 50 million words—and in just the past couple of years I’ve spoken more than 10 million words on livestreams. And, yes, I’ll train a bot from all of that.)
现代社会中,有大量以数字形式存在的人类写作文本。仅公共网页就有至少几十亿页人类写作的内容,总共可能有近一万亿字的文本。如果算上非公开的网页,数量可能会增加至少 100 倍。截至目前,已有超过 500 万本书籍被数字化(在历史上总共约 1 亿本出版的书籍中),增加了大约 1000 亿字的文本。这还不包括视频等中的语音转写文本。 作为个人对比,我一生发表的材料总量略少于 300 万字,在过去 30 年里,我写了约 1500 万字的电子邮件,总共打字可能达到 5000 万字——而就在过去几年里,我在直播中说了超过 1000 万字。是的,我会用这些数据来训练一个机器人。)

But, OK, given all this data, how does one train a neural net from it? The basic process is very much as we discussed it in the simple examples above. You present a batch of examples, and then you adjust the weights in the network to minimize the error (“loss”) that the network makes on those examples. The main thing that’s expensive about “back propagating” from the error is that each time you do this, every weight in the network will typically change at least a tiny bit, and there are just a lot of weights to deal with. (The actual “back computation” is typically only a small constant factor harder than the forward one.)

With modern GPU hardware, it’s straightforward to compute the results from batches of thousands of examples in parallel. But when it comes to actually updating the weights in the neural net, current methods require one to do this basically batch by batch. (And, yes, this is probably where actual brains—with their combined computation and memory elements—have, for now, at least an architectural advantage.)
使用现代 GPU 硬件,可以轻松地同时计算成千上万的示例结果。但是,当实际更新神经网络中的权重时,当前的方法基本上需要逐批进行。(而且,可以说,这也是为什么当前大脑——由于其计算和存储元素的结合——在这方面至少在架构上拥有优势。)

Even in the seemingly simple cases of learning numerical functions that we discussed earlier, we found we often had to use millions of examples to successfully train a network, at least from scratch. So how many examples does this mean we’ll need in order to train a “human-like language” model? There doesn’t seem to be any fundamental “theoretical” way to know. But in practice ChatGPT was successfully trained on a few hundred billion words of text.
即使是之前讨论的学习数值函数的简单案例,我们发现通常需要数百万个例子才能成功训练一个网络,尤其是从头开始。那么,训练一个“类人语言”模型需要多少例子呢?似乎没有根本的“理论”方法能够得知。但实际上,ChatGPT 成功地训练于几千亿字的文本。

Some of the text it was fed several times, some of it only once. But somehow it “got what it needed” from the text it saw. But given this volume of text to learn from, how large a network should it require to “learn it well”? Again, we don’t yet have a fundamental theoretical way to say. Ultimately—as we’ll discuss further below—there’s presumably a certain “total algorithmic content” to human language and what humans typically say with it. But the next question is how efficient a neural net will be at implementing a model based on that algorithmic content. And again we don’t know—although the success of ChatGPT suggests it’s reasonably efficient.
有些文本多次输入,有些文本只输入了一次。但不知怎么地,它从所看到的文本中“获得了所需的信息”。但考虑到要学习这些文本,需要多大的神经网络才能“学得好”?我们仍没有基本的理论方法来回答。最终——正如我们将在下面进一步讨论的那样——人类语言和人们通常使用的语言大概有一定的“总算法内容”。下一个问题是,神经网络在基于这些算法内容建立模型时有多高效。虽然我们还不知道具体答案,但 ChatGPT 的成功表明它是相当高效的。

And in the end we can just note that ChatGPT does what it does using a couple hundred billion weights—comparable in number to the total number of words (or tokens) of training data it’s been given. In some ways it’s perhaps surprising (though empirically observed also in smaller analogs of ChatGPT) that the “size of the network” that seems to work well is so comparable to the “size of the training data”. After all, it’s certainly not that somehow “inside ChatGPT” all that text from the web and books and so on is “directly stored”. Because what’s actually inside ChatGPT are a bunch of numbers—with a bit less than 10 digits of precision—that are some kind of distributed encoding of the aggregate structure of all that text.
最后我们只能注意到,ChatGPT 利用几千亿个权重来完成它的任务——这个数量与它接受的训练数据的总词数(或 tokens 数量)相当。从某种程度上讲,这是令人惊讶的(尽管在较小版本的 ChatGPT 中也观察到了相同的现象),因为“效果好的网络规模”与“训练数据的规模”相当。毕竟,并不是说 ChatGPT 内部“直接存储”了所有来自网络和书籍的文本,而是 ChatGPT 内部存储的是一组数字——精度略低于 10 位数——这些数字是所有文本的聚合结构的某种分布式编码。

Put another way, we might ask what the “effective information content” is of human language and what’s typically said with it. There’s the raw corpus of examples of language. And then there’s the representation in the neural net of ChatGPT. That representation is very likely far from the “algorithmically minimal” representation (as we’ll discuss below). But it’s a representation that’s readily usable by the neural net. And in this representation it seems there’s in the end rather little “compression” of the training data; it seems on average to basically take only a bit less than one neural net weight to carry the “information content” of a word of training data.
换句话说,我们可能会问人类语言的“有效信息内容”是什么以及通常用它表达什么。有一份语言示例的原始语料库。然后是在 ChatGPT 的神经网络中的表示。这种表示很可能远未达到“算法最小”的表示(我们将在下面讨论)。但这是神经网络可以轻松使用的表示。在这种表示中,似乎最终对训练数据的“压缩”是很少的;平均来说,似乎基本上只需要稍微少于一个神经网络权重来承载训练数据中一个词的“信息内容”。

When we run ChatGPT to generate text, we’re basically having to use each weight once. So if there are n weights, we’ve got of order n computational steps to do—though in practice many of them can typically be done in parallel in GPUs. But if we need about n words of training data to set up those weights, then from what we’ve said above we can conclude that we’ll need about n2 computational steps to do the training of the network—which is why, with current methods, one ends up needing to talk about billion-dollar training efforts.
当我们运行 ChatGPT 生成文本时,实际上必须使用每个权重一次。假如有 n 个权重,那么我们大约需要 n 个计算步骤——尽管实际上许多步骤可以在 GPU 上并行处理。而如果我们需要大约 n 个单词的训练数据来设置这些权重,根据我们以上的讨论,我们可以推断出需要大约 n 2 个计算步骤来训练网络——这也是为什么用现有方法进行训练会涉及数十亿美元费用的原因。

Beyond Basic Training 高级训练课程

The majority of the effort in training ChatGPT is spent “showing it” large amounts of existing text from the web, books, etc. But it turns out there’s another—apparently rather important—part too.
在训练 ChatGPT 的过程中,大部分精力花在了“让它阅读”大量来自网络、书籍等现有文本上。但事实证明,还有另一个显然非常重要的部分。

As soon as it’s finished its “raw training” from the original corpus of text it’s been shown, the neural net inside ChatGPT is ready to start generating its own text, continuing from prompts, etc. But while the results from this may often seem reasonable, they tend—particularly for longer pieces of text—to “wander off” in often rather non-human-like ways. It’s not something one can readily detect, say, by doing traditional statistics on the text. But it’s something that actual humans reading the text easily notice.
一旦完成了对原始语料库的初步训练,ChatGPT 内部的神经网络就可以开始生成自己的文本,继续提示等。然而,尽管生成的结果通常看起来合乎情理,但在较长文本中,文本往往以一种非人类的方式“偏离”。这种偏离不能通过传统的统计方法轻易检测,但是真正阅读文本的人能够轻松察觉。

And a key idea in the construction of ChatGPT was to have another step after “passively reading” things like the web: to have actual humans actively interact with ChatGPT, see what it produces, and in effect give it feedback on “how to be a good chatbot”. But how can the neural net use that feedback? The first step is just to have humans rate results from the neural net. But then another neural net model is built that attempts to predict those ratings. But now this prediction model can be run—essentially like a loss function—on the original network, in effect allowing that network to be “tuned up” by the human feedback that’s been given. And the results in practice seem to have a big effect on the success of the system in producing “human-like” output.
在构建 ChatGPT 时,一个关键理念是在“被动阅读”网络内容后,增加一个让人工与 ChatGPT 互动的步骤。这样实际的人类可以查看 ChatGPT 的输出并反馈“如何成为一个好的聊天机器人”。那么神经网络如何利用这些反馈?首先,人类会对神经网络的输出进行评分。接着,我们会建立另一个模型来预测这些评分。这个预测模型可以像一个损失函数一样,在原始网络上运行,实际使网络能够通过人类的反馈进行“微调”。实践结果表明,这对系统生成“类似人类”输出的成功有显著影响。

In general, it’s interesting how little “poking” the “originally trained” network seems to need to get it to usefully go in particular directions. One might have thought that to have the network behave as if it’s “learned something new” one would have to go in and run a training algorithm, adjusting weights, and so on.

But that’s not the case. Instead, it seems to be sufficient to basically tell ChatGPT something one time—as part of the prompt you give—and then it can successfully make use of what you told it when it generates text. And once again, the fact that this works is, I think, an important clue in understanding what ChatGPT is “really doing” and how it relates to the structure of human language and thinking.
但事实并非如此。相反,只要在提示中告诉 ChatGPT 一次,它就能成功地利用这些信息生成文本。而且,这种方法有效,我认为这是理解 ChatGPT 真正工作原理的重要线索,并且它与人类语言和思维的结构有关。

There’s certainly something rather human-like about it: that at least once it’s had all that pre-training you can tell it something just once and it can “remember it”—at least “long enough” to generate a piece of text using it. So what’s going on in a case like this? It could be that “everything you might tell it is already in there somewhere”—and you’re just leading it to the right spot. But that doesn’t seem plausible. Instead, what seems more likely is that, yes, the elements are already in there, but the specifics are defined by something like a “trajectory between those elements” and that’s what you’re introducing when you tell it something.

And indeed, much like for humans, if you tell it something bizarre and unexpected that completely doesn’t fit into the framework it knows, it doesn’t seem like it’ll successfully be able to “integrate” this. It can “integrate” it only if it’s basically riding in a fairly simple way on top of the framework it already has.

It’s also worth pointing out again that there are inevitably “algorithmic limits” to what the neural net can “pick up”. Tell it “shallow” rules of the form “this goes to that”, etc., and the neural net will most likely be able to represent and reproduce these just fine—and indeed what it “already knows” from language will give it an immediate pattern to follow. But try to give it rules for an actual “deep” computation that involves many potentially computationally irreducible steps and it just won’t work. (Remember that at each step it’s always just “feeding data forward” in its network, never looping except by virtue of generating new tokens.)

Of course, the network can learn the answer to specific “irreducible” computations. But as soon as there are combinatorial numbers of possibilities, no such “table-lookup-style” approach will work. And so, yes, just like humans, it’s time then for neural nets to “reach out” and use actual computational tools. (And, yes, Wolfram|Alpha and Wolfram Language are uniquely suitable, because they’ve been built to “talk about things in the world”, just like the language-model neural nets.)
当然,网络可以学会特定“不可简化”的计算答案。但一旦存在组合数量的可能性,这种“查表式”方法就失效了。因此,没错,就像人类一样,这时候就需要神经网络“寻求帮助”并使用实际的计算工具了。(而且,确实,Wolfram|Alpha 和 Wolfram 语言因为能“谈论现实世界的事物”,所以特别适用,就像语言模型的神经网络一样。)

What Really Lets ChatGPT Work?
ChatGPT 真正的工作原理是什么?

Human language—and the processes of thinking involved in generating it—have always seemed to represent a kind of pinnacle of complexity. And indeed it’s seemed somewhat remarkable that human brains—with their network of a “mere” 100 billion or so neurons (and maybe 100 trillion connections) could be responsible for it. Perhaps, one might have imagined, there’s something more to brains than their networks of neurons—like some new layer of undiscovered physics. But now with ChatGPT we’ve got an important new piece of information: we know that a pure, artificial neural network with about as many connections as brains have neurons is capable of doing a surprisingly good job of generating human language.
人类语言及其生成过程一直被认为是复杂性的巅峰。而事实上,只有大约 1000 亿个神经元(可能有 100 万亿个连接)的人的大脑能够负责这一点,这确实令人惊叹。或许,人们本来会认为,大脑除了神经元网络外,还具有一些尚未发现的物理层。然而,现在有了 ChatGPT,我们获得了一条重要的新线索:我们知道了一个纯粹的人工神经网络,其连接数与大脑神经元数相当,能够极好地生成自然语言。

And, yes, that’s still a big and complicated system—with about as many neural net weights as there are words of text currently available out there in the world. But at some level it still seems difficult to believe that all the richness of language and the things it can talk about can be encapsulated in such a finite system. Part of what’s going on is no doubt a reflection of the ubiquitous phenomenon (that first became evident in the example of rule 30) that computational processes can in effect greatly amplify the apparent complexity of systems even when their underlying rules are simple. But, actually, as we discussed above, neural nets of the kind used in ChatGPT tend to be specifically constructed to restrict the effect of this phenomenon—and the computational irreducibility associated with it—in the interest of making their training more accessible.
是的,那仍然是一个庞大而复杂的系统——它的神经网络权重数量几乎和现有的所有文本单词数目一样多。但在某种程度上,仍然很难相信语言的丰富性及其表达的内容可以被浓缩到这样一个有限的系统中。部分原因无疑反映了普遍现象(首次在规则 30 的例子中显现),即使基本规则简单,计算过程实际上可以极大地放大系统的表面复杂性。 但实际上,正如我们之前讨论的那样,用于 ChatGPT 的神经网络通常被特意设计来限制这种现象的影响及其相关的计算不可约性,以便使训练过程更简单。

So how is it, then, that something like ChatGPT can get as far as it does with language? The basic answer, I think, is that language is at a fundamental level somehow simpler than it seems. And this means that ChatGPT—even with its ultimately straightforward neural net structure—is successfully able to “capture the essence” of human language and the thinking behind it. And moreover, in its training, ChatGPT has somehow “implicitly discovered” whatever regularities in language (and thinking) make this possible.
那么,ChatGPT 是如何在语言上取得如此大进展的呢?基本答案是,我认为,语言在根本层面上比想象中更简单。这意味着,即使 ChatGPT 的神经网络结构相对简单,它也能成功“捕捉到”人类语言及其背后的思维本质。而且,在训练过程中,ChatGPT 还“隐性地发现”了语言(和思维)中所有能让这一切实现的规律。

The success of ChatGPT is, I think, giving us evidence of a fundamental and important piece of science: it’s suggesting that we can expect there to be major new “laws of language”—and effectively “laws of thought”—out there to discover. In ChatGPT—built as it is as a neural net—those laws are at best implicit. But if we could somehow make the laws explicit, there’s the potential to do the kinds of things ChatGPT does in vastly more direct, efficient—and transparent—ways.
我认为,ChatGPT 的成功为我们提供了一个基本且重要的科学证据:它表明我们有望发现重大新的“语言定律”,以及实际上“思维定律”。在作为神经网络构建的 ChatGPT 中,这些定律最多是隐性的。不过,如果我们能够将这些定律显性化,就有可能以更加直接、高效和透明的方式做 ChatGPT 目前所做的事情。

But, OK, so what might these laws be like? Ultimately they must give us some kind of prescription for how language—and the things we say with it—are put together. Later we’ll discuss how “looking inside ChatGPT” may be able to give us some hints about this, and how what we know from building computational language suggests a path forward. But first let’s discuss two long-known examples of what amount to “laws of language”—and how they relate to the operation of ChatGPT.
那么,这些规律可能是什么样的呢?最终,它们必须为我们提供一些关于语言及其表达方式的指导。稍后,我们将讨论如何通过“深入了解 ChatGPT”来获得一些提示,以及我们从构建计算语言中了解到的建议。但首先,让我们讨论两个已知的“语言法则”的例子——以及它们如何与 ChatGPT 的运行相关。

The first is the syntax of language. Language is not just a random jumble of words. Instead, there are (fairly) definite grammatical rules for how words of different kinds can be put together: in English, for example, nouns can be preceded by adjectives and followed by verbs, but typically two nouns can’t be right next to each other. Such grammatical structure can (at least approximately) be captured by a set of rules that define how what amount to “parse trees” can be put together:

ChatGPT doesn’t have any explicit “knowledge” of such rules. But somehow in its training it implicitly “discovers” them—and then seems to be good at following them. So how does this work? At a “big picture” level it’s not clear. But to get some insight it’s perhaps instructive to look at a much simpler example.
ChatGPT 并没有明确知道这些规则。但在训练过程中,它似乎隐含地“发现”了这些规则,并且看起来很擅长遵循它们。那么这是怎么做到的呢?从总体上来说,这个过程并不透明。不过,为了更好地理解,我们或许可以先看一个更简单的例子。

Consider a “language” formed from sequences of (’s and )’s, with a grammar that specifies that parentheses should always be balanced, as represented by a parse tree like:

Can we train a neural net to produce “grammatically correct” parenthesis sequences? There are various ways to handle sequences in neural nets, but let’s use transformer nets, as ChatGPT does. And given a simple transformer net, we can start feeding it grammatically correct parenthesis sequences as training examples. A subtlety (which actually also appears in ChatGPT’s generation of human language) is that in addition to our “content tokens” (here “(” and “)”) we have to include an “End” token, that’s generated to indicate that the output shouldn’t continue any further (i.e. for ChatGPT, that one’s reached the “end of the story”).
我们能否训练一个神经网络来生成“语法正确”的括号序列?有很多方法可以处理神经网络中的序列,但我们选择使用像 ChatGPT 那样的 transformer 网络。给定一个简单的 transformer 网络,我们可以开始用语法正确的括号序列作为训练示例来喂给它。有一种细微差别(实际上在 ChatGPT 生成语言时也会出现),就是除了我们的“内容标记”(这里是“(”和“)”)外,还必须包含一个“结束”标记,这个标记用于指示输出不应该继续(即对于 ChatGPT 来说,这意味着已经到达了“故事的结尾”)。

If we set up a transformer net with just one attention block with 8 heads and feature vectors of length 128 (ChatGPT also uses feature vectors of length 128, but has 96 attention blocks, each with 96 heads) then it doesn’t seem possible to get it to learn much about parenthesis language. But with 2 attention blocks, the learning process seems to converge—at least after 10 million or so examples have been given (and, as is common with transformer nets, showing yet more examples just seems to degrade its performance).
如果我们设置一个仅包含一个具有 8 个头和长度为 128 的特征向量的注意力块的变压器网络(ChatGPT 也使用长度为 128 的特征向量,但有 96 个注意力块,每个块有 96 个头),那么它似乎不太可能学会太多关于括号语言的知识。但是,当增加到 2 个注意力块时,学习过程似乎开始收敛——至少在提供了大约 1000 万个例子之后(而且,像变压器网络通常的情况一样,展示更多的例子似乎只会降低其性能)。

So with this network, we can do the analog of what ChatGPT does, and ask for probabilities for what the next token should be—in a parenthesis sequence:
因此,通过这个网络,我们可以像 ChatGPT 一样,预测括号序列中下一个标记的概率:

And in the first case, the network is “pretty sure” that the sequence can’t end here—which is good, because if it did, the parentheses would be left unbalanced. In the second case, however, it “correctly recognizes” that the sequence can end here, though it also “points out” that it’s possible to “start again”, putting down a “(”, presumably with a “)” to follow. But, oops, even with its 400,000 or so laboriously trained weights, it says there’s a 15% probability to have “)” as the next token—which isn’t right, because that would necessarily lead to an unbalanced parenthesis.
在第一种情况下,网络非常确定序列不能在这里结束——这很合理,因为如果结束的话,括号就会不匹配。然而,在第二种情况下,它正确识别出序列可以在这里结束,同时还指出可以重新开始,写下一个“(”,并假定之后会有一个“)”。但即使有大约 400,000 次辛苦训练后的权重,它也认为下一个标记有 15%的概率是“)”——这不对,因为这样必然会导致括号不匹配。

Here’s what we get if we ask the network for the highest-probability completions for progressively longer sequences of (’s:

And, yes, up to a certain length the network does just fine. But then it starts failing. It’s a pretty typical kind of thing to see in a “precise” situation like this with a neural net (or with machine learning in general). Cases that a human “can solve in a glance” the neural net can solve too. But cases that require doing something “more algorithmic” (e.g. explicitly counting parentheses to see if they’re closed) the neural net tends to somehow be “too computationally shallow” to reliably do. (By the way, even the full current ChatGPT has a hard time correctly matching parentheses in long sequences.)
是的,在特定长度内网络运行良好。但随后它会开始失效。在这种“精确”的环境中,用神经网络(或一般的机器学习)看到这种情况是很常见的。人类“能一眼解决”的问题,神经网络也能解决。但当需要做“更算法化”的操作(例如明确计数括号是否闭合)时,神经网络往往在计算能力上显得不足,无法可靠执行。(顺便说一句,即使是完整的当前 ChatGPT,在长序列中匹配括号也有困难。)

So what does this mean for things like ChatGPT and the syntax of a language like English? The parenthesis language is “austere”—and much more of an “algorithmic story”. But in English it’s much more realistic to be able to “guess” what’s grammatically going to fit on the basis of local choices of words and other hints. And, yes, the neural net is much better at this—even though perhaps it might miss some “formally correct” case that, well, humans might miss as well. But the main point is that the fact that there’s an overall syntactic structure to the language—with all the regularity that implies—in a sense limits “how much” the neural net has to learn. And a key “natural-science-like” observation is that the transformer architecture of neural nets like the one in ChatGPT seems to successfully be able to learn the kind of nested-tree-like syntactic structure that seems to exist (at least in some approximation) in all human languages.
那么这对像 ChatGPT 这样的系统和英语这样的语言的语法有何影响呢?括号语言非常“简洁”,更像是一个“算法故事”。但在英语中,根据局部单词的选择和其他提示,更容易“猜测”语法上适合的内容。确实,神经网络在这方面表现更好,即使它可能会错过某些“正式正确”的情况,而这些情况人类也可能会错过。不过,关键在于,语言具有整体的句法结构——并包含所有的规则性——从某种意义上限制了神经网络需要学习的内容。 一个重要的“自然科学式”观察是,像 ChatGPT 这样的神经网络的变压器架构似乎能够成功学习存在于所有人类语言中的嵌套树状句法结构(至少在某种程度上)。

Syntax provides one kind of constraint on language. But there are clearly more. A sentence like “Inquisitive electrons eat blue theories for fish” is grammatically correct but isn’t something one would normally expect to say, and wouldn’t be considered a success if ChatGPT generated it—because, well, with the normal meanings for the words in it, it’s basically meaningless.
语法对语言起到了一种约束作用,但显然还有更多的约束。比如“好奇的电子吃蓝色理论为了鱼”这样的句子在语法上是对的,但不是人们通常会说的话。如果 ChatGPT 生成了这个句子,人们不会认为它是成功的,因为这些词在正常情况下的意思基本上是毫无意义的。

But is there a general way to tell if a sentence is meaningful? There’s no traditional overall theory for that. But it’s something that one can think of ChatGPT as having implicitly “developed a theory for” after being trained with billions of (presumably meaningful) sentences from the web, etc.
但是,有没有一种通用的方法来判断一句话是否有意义呢?对此并没有传统的总体理论。但你可以认为 ChatGPT 在经过从网络上训练了数十亿句(大概有意义的)句子后,隐含地“形成了一个理论”。

What might this theory be like? Well, there’s one tiny corner that’s basically been known for two millennia, and that’s logic. And certainly in the syllogistic form in which Aristotle discovered it, logic is basically a way of saying that sentences that follow certain patterns are reasonable, while others are not. Thus, for example, it’s reasonable to say “All X are Y. This is not Y, so it’s not an X” (as in “All fishes are blue. This is not blue, so it’s not a fish.”). And just as one can somewhat whimsically imagine that Aristotle discovered syllogistic logic by going (“machine-learning-style”) through lots of examples of rhetoric, so too one can imagine that in the training of ChatGPT it will have been able to “discover syllogistic logic” by looking at lots of text on the web, etc. (And, yes, while one can therefore expect ChatGPT to produce text that contains “correct inferences” based on things like syllogistic logic, it’s a quite different story when it comes to more sophisticated formal logic—and I think one can expect it to fail here for the same kind of reasons it fails in parenthesis matching.)
这种理论可能是什么样子呢?其实,有一个领域已经被了解了两千年,那就是逻辑。而在亚里士多德发现的三段论形式中,逻辑基本上是一种方法,用来判断某些模式的句子是否合理。例如,说“所有 X 都是 Y。这不是 Y,所以它不是 X”是合理的;比如,“所有鱼都是蓝色的。这不是蓝色的,所以它不是鱼。”同样,我们可以想象,像亚里士多德通过大量修辞范例发现三段论逻辑一样,ChatGPT 在训练过程中,也会通过阅读大量网络文本发现三段论逻辑。 是的,虽然我们可以期待 ChatGPT 生成包含基于三段论逻辑的“正确推论”的文本,但在更复杂的形式逻辑方面却是另一回事——我认为出于同样的原因可以预见它会在括号匹配时失败。)

But beyond the narrow example of logic, what can be said about how to systematically construct (or recognize) even plausibly meaningful text? Yes, there are things like Mad Libs that use very specific “phrasal templates”. But somehow ChatGPT implicitly has a much more general way to do it. And perhaps there’s nothing to be said about how it can be done beyond “somehow it happens when you have 175 billion neural net weights”. But I strongly suspect that there’s a much simpler and stronger story.
除了逻辑狭隘的例子,还有什么能系统地构建(或识别)可能有意义的文本的方法呢?是的,Mad Libs 使用非常具体的“短语模板”。但 ChatGPT 似乎有一种更通用的方法来实现这一点。也许除了“拥有 1750 亿个神经网络权重”之外,无法解释它是如何完成的。但我强烈怀疑存在一个更简单、更有说服力的方法。

Meaning Space and Semantic Laws of Motion

We discussed above that inside ChatGPT any piece of text is effectively represented by an array of numbers that we can think of as coordinates of a point in some kind of “linguistic feature space”. So when ChatGPT continues a piece of text this corresponds to tracing out a trajectory in linguistic feature space. But now we can ask what makes this trajectory correspond to text we consider meaningful. And might there perhaps be some kind of “semantic laws of motion” that define—or at least constrain—how points in linguistic feature space can move around while preserving “meaningfulness”?
我们之前讨论过,ChatGPT 内部的任何文本实际上都是以一组数字表示的,我们可以将这些数字视为某种“语言特征空间”中的点的坐标。因此,当 ChatGPT 继续生成文本时,这就相当于在语言特征空间中描绘出一条轨迹。但现在我们可以问,是什么让这条轨迹对应上我们认为有意义的文本。或许存在某种“语义运动定律”,能够定义或至少约束在保持“意义”的前提下,语言特征空间中的点如何移动?

So what is this linguistic feature space like? Here’s an example of how single words (here, common nouns) might get laid out if we project such a feature space down to 2D:

We saw another example above based on words representing plants and animals. But the point in both cases is that “semantically similar words” are placed nearby.

As another example, here’s how words corresponding to different parts of speech get laid out:

Of course, a given word doesn’t in general just have “one meaning” (or necessarily correspond to just one part of speech). And by looking at how sentences containing a word lay out in feature space, one can often “tease apart” different meanings—as in the example here for the word “crane” (bird or machine?):

OK, so it’s at least plausible that we can think of this feature space as placing “words nearby in meaning” close in this space. But what kind of additional structure can we identify in this space? Is there for example some kind of notion of “parallel transport” that would reflect “flatness” in the space? One way to get a handle on that is to look at analogies:

And, yes, even when we project down to 2D, there’s often at least a “hint of flatness”, though it’s certainly not universally seen.

So what about trajectories? We can look at the trajectory that a prompt for ChatGPT follows in feature space—and then we can see how ChatGPT continues that:
那么,关于轨迹呢?我们可以查看 ChatGPT 在特征空间中所遵循的轨迹,然后看看 ChatGPT 是如何继续的:

There’s certainly no “geometrically obvious” law of motion here. And that’s not at all surprising; we fully expect this to be a considerably more complicated story. And, for example, it’s far from obvious that even if there is a “semantic law of motion” to be found, what kind of embedding (or, in effect, what “variables”) it’ll most naturally be stated in.

In the picture above, we’re showing several steps in the “trajectory”—where at each step we’re picking the word that ChatGPT considers the most probable (the “zero temperature” case). But we can also ask what words can “come next” with what probabilities at a given point:
在上图中,我们展示了“轨迹”的几个步骤——在每一步中,我们都选择了 ChatGPT 认为最可能的单词(即“零温度”情况)。但我们也可以询问,在某一个时刻哪些单词可能会接下来出现,以及它们出现的概率如何:

And what we see in this case is that there’s a “fan” of high-probability words that seems to go in a more or less definite direction in feature space. What happens if we go further? Here are the successive “fans” that appear as we “move along” the trajectory:

Here’s a 3D representation, going for a total of 40 steps:
这是一种 3D 展示方式,总共分为 40 步:

And, yes, this seems like a mess—and doesn’t do anything to particularly encourage the idea that one can expect to identify “mathematical-physics-like” “semantic laws of motion” by empirically studying “what ChatGPT is doing inside”. But perhaps we’re just looking at the “wrong variables” (or wrong coordinate system) and if only we looked at the right one, we’d immediately see that ChatGPT is doing something “mathematical-physics-simple” like following geodesics. But as of now, we’re not ready to “empirically decode” from its “internal behavior” what ChatGPT has “discovered” about how human language is “put together”.
是的,这看起来有些混乱——并且并没有特别助长通过实证研究“ChatGPT 内部在做什么”来找到“类似数学物理的”“语义运动定律”的想法。但也许我们只是看错了“变量”或坐标系,如果我们看到正确的变量,我们立即就会发现 ChatGPT 正在做一些“数学物理简单”的事情,比如遵循测地线。但到目前为止,我们还没有准备好从其“内部行为”中“实证解码”出 ChatGPT 已经“发现”了人类语言是如何“构建”的。

Semantic Grammar and the Power of Computational Language

What does it take to produce “meaningful human language”? In the past, we might have assumed it could be nothing short of a human brain. But now we know it can be done quite respectably by the neural net of ChatGPT. Still, maybe that’s as far as we can go, and there’ll be nothing simpler—or more human understandable—that will work. But my strong suspicion is that the success of ChatGPT implicitly reveals an important “scientific” fact: that there’s actually a lot more structure and simplicity to meaningful human language than we ever knew—and that in the end there may be even fairly simple rules that describe how such language can be put together.
生产“有意义的人类语言”需要什么?过去,我们可能认为这只能靠人类的大脑实现,但现在我们知道,通过 ChatGPT 的神经网络也能做到这一点。不过,这或许就是我们所能达到的极限,不会有更简单或更便于理解的方法了。但我强烈怀疑,ChatGPT 的成功隐含揭示了一个重要的“科学”事实:有意义的人类语言其实比我们想象的结构更简单,更有条理,最终可能会有相当简单的规则来描述如何构造这种语言。

As we mentioned above, syntactic grammar gives rules for how words corresponding to things like different parts of speech can be put together in human language. But to deal with meaning, we need to go further. And one version of how to do this is to think about not just a syntactic grammar for language, but also a semantic one.

For purposes of syntax, we identify things like nouns and verbs. But for purposes of semantics, we need “finer gradations”. So, for example, we might identify the concept of “moving”, and the concept of an “object” that “maintains its identity independent of location”. There are endless specific examples of each of these “semantic concepts”. But for the purposes of our semantic grammar, we’ll just have some general kind of rule that basically says that “objects” can “move”. There’s a lot to say about how all this might work (some of which I’ve said before). But I’ll content myself here with just a few remarks that indicate some of the potential path forward.

It’s worth mentioning that even if a sentence is perfectly OK according to the semantic grammar, that doesn’t mean it’s been realized (or even could be realized) in practice. “The elephant traveled to the Moon” would doubtless “pass” our semantic grammar, but it certainly hasn’t been realized (at least yet) in our actual world—though it’s absolutely fair game for a fictional world.

When we start talking about “semantic grammar” we’re soon led to ask “What’s underneath it?” What “model of the world” is it assuming? A syntactic grammar is really just about the construction of language from words. But a semantic grammar necessarily engages with some kind of “model of the world”—something that serves as a “skeleton” on top of which language made from actual words can be layered.

Until recent times, we might have imagined that (human) language would be the only general way to describe our “model of the world”. Already a few centuries ago there started to be formalizations of specific kinds of things, based particularly on mathematics. But now there’s a much more general approach to formalization: computational language.

And, yes, that’s been my big project over the course of more than four decades (as now embodied in the Wolfram Language): to develop a precise symbolic representation that can talk as broadly as possible about things in the world, as well as abstract things that we care about. And so, for example, we have symbolic representations for cities and molecules and images and neural networks, and we have built-in knowledge about how to compute about those things.
是的,这一直是我四十多年来的重大项目(现在体现于 Wolfram 语言中):开发一种精确的符号表达,可以尽可能广泛地讨论世界上的事物以及我们关心的抽象事物。比如,我们有城市、分子、图像和神经网络的符号表示,并且我们有关于如何计算这些事物的内置知识。

And, after decades of work, we’ve covered a lot of areas in this way. But in the past, we haven’t particularly dealt with “everyday discourse”. In “I bought two pounds of apples” we can readily represent (and do nutrition and other computations on) the “two pounds of apples”. But we don’t (quite yet) have a symbolic representation for “I bought”.

It’s all connected to the idea of semantic grammar—and the goal of having a generic symbolic “construction kit” for concepts, that would give us rules for what could fit together with what, and thus for the “flow” of what we might turn into human language.

But let’s say we had this “symbolic discourse language”. What would we do with it? We could start off doing things like generating “locally meaningful text”. But ultimately we’re likely to want more “globally meaningful” results—which means “computing” more about what can actually exist or happen in the world (or perhaps in some consistent fictional world).

Right now in Wolfram Language we have a huge amount of built-in computational knowledge about lots of kinds of things. But for a complete symbolic discourse language we’d have to build in additional “calculi” about general things in the world: if an object moves from A to B and from B to C, then it’s moved from A to C, etc.
目前在 Wolfram 语言中,我们已经内置了大量关于各种事物的计算知识。然而,要实现一个完整的符号语言,我们还需要构建一些关于一般事物的额外“计算系统”:比如,如果一个物体从 A 移动到 B,再从 B 移动到 C,那么它也就是从 A 移动到了 C,等等。

Given a symbolic discourse language we might use it to make “standalone statements”. But we can also use it to ask questions about the world, “Wolfram|Alpha style”. Or we can use it to state things that we “want to make so”, presumably with some external actuation mechanism. Or we can use it to make assertions—perhaps about the actual world, or perhaps about some specific world we’re considering, fictional or otherwise.

Human language is fundamentally imprecise, not least because it isn’t “tethered” to a specific computational implementation, and its meaning is basically defined just by a “social contract” between its users. But computational language, by its nature, has a certain fundamental precision—because in the end what it specifies can always be “unambiguously executed on a computer”. Human language can usually get away with a certain vagueness. (When we say “planet” does it include exoplanets or not, etc.?) But in computational language we have to be precise and clear about all the distinctions we’re making.

It’s often convenient to leverage ordinary human language in making up names in computational language. But the meanings they have in computational language are necessarily precise—and might or might not cover some particular connotation in typical human language usage.

How should one figure out the fundamental “ontology” suitable for a general symbolic discourse language? Well, it’s not easy. Which is perhaps why little has been done since the primitive beginnings Aristotle made more than two millennia ago. But it really helps that today we now know so much about how to think about the world computationally (and it doesn’t hurt to have a “fundamental metaphysics” from our Physics Project and the idea of the ruliad).
如何确定适合通用符号语言的基本“本体”?这并不容易。这可能就是自从亚里士多德两千多年前提出基础理论以来,进展甚少的原因。然而,今天我们对计算思维的理解大大提高(再加上我们从物理项目和 ruliad 概念中获得了一些“基本形而上学”的思考),这确实有所帮助。

But what does all this mean in the context of ChatGPT? From its training ChatGPT has effectively “pieced together” a certain (rather impressive) quantity of what amounts to semantic grammar. But its very success gives us a reason to think that it’s going to be feasible to construct something more complete in computational language form. And, unlike what we’ve so far figured out about the innards of ChatGPT, we can expect to design the computational language so that it’s readily understandable to humans.
那么这一切在 ChatGPT 的背景下意味着什么呢?通过训练,ChatGPT 已经有效地“拼凑”出相当可观的语法结构。但它的成功给了我们理由相信,用计算机语言构建更完整的东西是可行的。而且,与我们迄今为止对 ChatGPT 内部机制的了解不同,我们可以期待设计一种人类易于理解的计算语言。

When we talk about semantic grammar, we can draw an analogy to syllogistic logic. At first, syllogistic logic was essentially a collection of rules about statements expressed in human language. But (yes, two millennia later) when formal logic was developed, the original basic constructs of syllogistic logic could now be used to build huge “formal towers” that include, for example, the operation of modern digital circuitry. And so, we can expect, it will be with more general semantic grammar. At first, it may just be able to deal with simple patterns, expressed, say, as text. But once its whole computational language framework is built, we can expect that it will be able to be used to erect tall towers of “generalized semantic logic”, that allow us to work in a precise and formal way with all sorts of things that have never been accessible to us before, except just at a “ground-floor level” through human language, with all its vagueness.
当我们谈起语义语法时,可以将其类比为三段论逻辑。最初,三段论逻辑基本上是关于人类语言陈述的规则集。(是的,两千年后)当形式逻辑被发展出来时,三段论逻辑的基本结构被用来建立庞大的“形式塔”,例如现代数字电路的运作。同样地,我们可以期望,更普遍的语义语法也会历经这样的演变。最初,它可能只能处理简单的模式,例如以文本形式表达的模式。 但是,一旦整个计算语言框架建立起来,我们可以期待它能用来构建“广义语义逻辑”的高楼大厦,使我们能够以精确和正式的方式处理各种以前只能通过含糊的人类语言在“初级”层面上才能接触到的事物。

We can think of the construction of computational language—and semantic grammar—as representing a kind of ultimate compression in representing things. Because it allows us to talk about the essence of what’s possible, without, for example, dealing with all the “turns of phrase” that exist in ordinary human language. And we can view the great strength of ChatGPT as being something a bit similar: because it too has in a sense “drilled through” to the point where it can “put language together in a semantically meaningful way” without concern for different possible turns of phrase.
我们可以认为,构建计算语言和语义语法就像是一种终极压缩方式来表达事物的本质。因为它让我们可以谈论事物的本质,而不必处理普通语言中的各种表达方式。我们可以把 ChatGPT 的巨大优势也看作类似的东西:因为它在某种意义上也“深入”到了可以“以语义上有意义的方式组织语言”的程度,而不用在意不同表达方式。

So what would happen if we applied ChatGPT to underlying computational language? The computational language can describe what’s possible. But what can still be added is a sense of “what’s popular”—based for example on reading all that content on the web. But then—underneath—operating with computational language means that something like ChatGPT has immediate and fundamental access to what amount to ultimate tools for making use of potentially irreducible computations. And that makes it a system that can not only “generate reasonable text”, but can expect to work out whatever can be worked out about whether that text actually makes “correct” statements about the world—or whatever it’s supposed to be talking about.
那么如果我们将 ChatGPT 应用于基础计算语言,会发生什么呢?计算语言能够描述可能的事物。但我们仍然可以添加一种“什么是热门的”感觉,例如基于阅读所有的网络内容。然而,在底层使用计算语言意味着像 ChatGPT 这样的系统能够立即且根本地访问到可能无法简化的计算的最终工具。这使得它不仅能够“生成合理的文本”,而且能够判断该文本是否真的对世界或它所讨论的内容做出了“正确”的陈述。

So … What Is ChatGPT Doing, and Why Does It Work?
所以……ChatGPT 到底是做什么的,它为什么能有效?

The basic concept of ChatGPT is at some level rather simple. Start from a huge sample of human-created text from the web, books, etc. Then train a neural net to generate text that’s “like this”. And in particular, make it able to start from a “prompt” and then continue with text that’s “like what it’s been trained with”.
ChatGPT 的基本概念相当简单。首先,收集来自网络、书籍等大量人类创作的文本样本。然后训练一个神经网络,生成相似的文本。尤其是要让它能够从一个“提示”开始,并继续生成与训练内容相似的文本。

As we’ve seen, the actual neural net in ChatGPT is made up of very simple elements—though billions of them. And the basic operation of the neural net is also very simple, consisting essentially of passing input derived from the text it’s generated so far “once through its elements” (without any loops, etc.) for every new word (or part of a word) that it generates.
正如我们所见,ChatGPT 的实际神经网络由非常简单的元素组成,尽管有数十亿个。而且神经网络的基本操作也非常简单,基本上是将根据它生成的文本派生的输入“依次通过元素”(没有任何循环等)来生成每一个新词或词的一部分。

But the remarkable—and unexpected—thing is that this process can produce text that’s successfully “like” what’s out there on the web, in books, etc. And not only is it coherent human language, it also “says things” that “follow its prompt” making use of content it’s “read”. It doesn’t always say things that “globally make sense” (or correspond to correct computations)—because (without, for example, accessing the “computational superpowers” of Wolfram|Alpha) it’s just saying things that “sound right” based on what things “sounded like” in its training material.
但是值得注意且出乎意料的是,这个过程能生成类似于网络、书籍等中的文本。不仅仅是连贯的语言,它还能在“读”过的内容基础上,根据“提示”进行表述。它不总是能说“完全合理”的话(或者对应正确的计算),因为(比如不访问 Wolfram|Alpha 的“计算超能力”)它只是根据训练材料中“听起来正确的”内容进行表述。

The specific engineering of ChatGPT has made it quite compelling. But ultimately (at least until it can use outside tools) ChatGPT is “merely” pulling out some “coherent thread of text” from the “statistics of conventional wisdom” that it’s accumulated. But it’s amazing how human-like the results are. And as I’ve discussed, this suggests something that’s at least scientifically very important: that human language (and the patterns of thinking behind it) are somehow simpler and more “law like” in their structure than we thought. ChatGPT has implicitly discovered it. But we can potentially explicitly expose it, with semantic grammar, computational language, etc.
ChatGPT 的精密设计使其极具吸引力。但最终(至少在它可以使用外部工具之前),ChatGPT“仅仅”是从它积累的“传统智慧统计”中提取出某种“连贯的文本线索”。然而,结果却令人惊奇地接近人类。正如我所讨论的,这意味着至少在科学上,人类语言(以及其背后的思维模式)的结构比我们想象的要简单和更具规律。ChatGPT 已经暗中发现了这一点。通过语义语法、计算语言等,我们有可能显式地揭示这一点。

What ChatGPT does in generating text is very impressive—and the results are usually very much like what we humans would produce. So does this mean ChatGPT is working like a brain? Its underlying artificial-neural-net structure was ultimately modeled on an idealization of the brain. And it seems quite likely that when we humans generate language many aspects of what’s going on are quite similar.
ChatGPT 在生成文本方面的表现非常令人印象深刻,生成的结果通常非常像我们人类会写的。那么这是否意味着 ChatGPT 像大脑一样工作呢?它的底层人工神经网络结构最终是基于对大脑的理想化模型构建的。而且,当我们人类生成语言时,许多过程看起来确实非常相似。

When it comes to training (AKA learning) the different “hardware” of the brain and of current computers (as well as, perhaps, some undeveloped algorithmic ideas) forces ChatGPT to use a strategy that’s probably rather different (and in some ways much less efficient) than the brain. And there’s something else as well: unlike even in typical algorithmic computation, ChatGPT doesn’t internally “have loops” or “recompute on data”. And that inevitably limits its computational capability—even with respect to current computers, but definitely with respect to the brain.
在进行大脑和现代计算机(包括可能尚未开发的算法概念)不同“硬件”的训练(或学习)时,ChatGPT 被迫使用的方法可能与大脑的策略大不相同(某些方面 效率低很多)。还有一点是,和典型的算法计算不同,ChatGPT 内部没有“循环”或“数据重算”。这不可避免地限制了它的计算能力——即使是与当前的计算机相比,更不用说与大脑相比。

It’s not clear how to “fix that” and still maintain the ability to train the system with reasonable efficiency. But to do so will presumably allow a future ChatGPT to do even more “brain-like things”. Of course, there are plenty of things that brains don’t do so wellparticularly involving what amount to irreducible computations. And for these both brains and things like ChatGPT have to seek “outside tools”—like Wolfram Language.
不清楚如何“解决这个问题”并且还能够以合理的效率训练系统。然而,这样做可能会让未来的 ChatGPT 具备更多“类脑功能”。当然,有许多任务大脑并不擅长,尤其是那些需要不可简化计算的任务。对于这些任务,无论是大脑还是像 ChatGPT 这样的系统,都需要借助“外部工具”——比如 Wolfram 语言。

But for now it’s exciting to see what ChatGPT has already been able to do. At some level it’s a great example of the fundamental scientific fact that large numbers of simple computational elements can do remarkable and unexpected things. But it also provides perhaps the best impetus we’ve had in two thousand years to understand better just what the fundamental character and principles might be of that central feature of the human condition that is human language and the processes of thinking behind it.
但目前来看,ChatGPT 已经取得的成果令人兴奋。在某种程度上,它很好地例证了一个基本的科学事实,即大量简单的计算元素可以产生惊人且意想不到的效果。同时,它也为我们带来了两千年来最好的契机,让我们更好地理解人类语言这一人类状况的核心特征及其背后的思维过程的基本性质和原则。

Thanks 感谢

I’ve been following the development of neural nets now for about 43 years, and during that time I’ve interacted with many people about them. Among them—some from long ago, some from recently, and some across many years—have been: Giulio Alessandrini, Dario Amodei, Etienne Bernard, Taliesin Beynon, Sebastian Bodenstein, Greg Brockman, Jack Cowan, Pedro Domingos, Jesse Galef, Roger Germundsson, Robert Hecht-Nielsen, Geoff Hinton, John Hopfield, Yann LeCun, Jerry Lettvin, Jerome Louradour, Marvin Minsky, Eric Mjolsness, Cayden Pierce, Tomaso Poggio, Matteo Salvarezza, Terry Sejnowski, Oliver Selfridge, Gordon Shaw, Jonas Sjöberg, Ilya Sutskever, Gerry Tesauro and Timothee Verdier. For help with this piece, I’d particularly like to thank Giulio Alessandrini and Brad Klee.
我关注神经网络的发展已经有约 43 年了,在此期间我与许多人进行了交流。其中包括一些多年前认识的,一些最近认识的,和一些多年来都有交流的:Giulio Alessandrini,Dario Amodei,Etienne Bernard,Taliesin Beynon,Sebastian Bodenstein,Greg Brockman,Jack Cowan,Pedro Domingos,Jesse Galef,Roger Germundsson,Robert Hecht-Nielsen,Geoff Hinton,John Hopfield,Yann LeCun,Jerry Lettvin,Jerome Louradour,Marvin Minsky,Eric Mjolsness,Cayden Pierce,Tomaso Poggio,Matteo Salvarezza,Terry Sejnowski,Oliver Selfridge,Gordon Shaw,Jonas Sjöberg,Ilya Sutskever,Gerry Tesauro 和 Timothee Verdier。在本文的创作中,特别感谢 Giulio Alessandrini 和 Brad Klee 的帮助。

Additional Resources 其他资源

Stephen Wolfram (2023), "What Is ChatGPT Doing ... and Why Does It Work?," Stephen Wolfram Writings. writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work.
斯蒂芬·沃尔夫拉姆(2023),"ChatGPT 在做什么...以及为什么它有效?," 斯蒂芬·沃尔夫拉姆的文章。writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work.
Stephen Wolfram (2023), "What Is ChatGPT Doing ... and Why Does It Work?," Stephen Wolfram Writings. writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work.
Wolfram, Stephen. "What Is ChatGPT Doing ... and Why Does It Work?." Stephen Wolfram Writings. February 14, 2023. writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work.
Wolfram, S. (2023, February 14). What is ChatGPT doing ... and why does it work?. Stephen Wolfram Writings. writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work.

Posted in: Artificial Intelligence, Language & Communication, New Technology

55 comments  55 条评论

  1. Very impressive, succinct yet eloquent narrative to what otherwise could be relegated to alchemy or voodoo science. Thank you Steve and the Team!! Much obliged to your enlightenment!

  2. This should be a must read for anyone who wants to understand Neural Networks \ LLMs. Clear explanation of a complicated subject. Thank you Stephen and the team for making us a bit smarter.

  3. Thank you very much Stephen Wolfram & team for this meaningful, enlightening and rather long* 🙂 article.

    As a “laymen” (run of the mill backend software engineer) I finally got the basic concepts behind GPT/LLM models and AI/ML in general.
    I very much appreciate that you left out the usual buzzword bingo of the AI/ML community.

    * I appreciate that the subject matter probably makes it impossible to condense the wealth of information provided here even more than you already did and still provide the reader with the ability to understand the topics presented.

    Greetings from Germany

  4. This is the best explanation ever, full of amazing reflections!
    Thank you very much.

  5. Stephen, all of us are grateful to you and your team for putting up this article.