这是用户在 2024-12-24 19:46 为 https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

This post relates an observation I've made in my work with GPT-2, which I have not seen made elsewhere.
这篇文章描述了我在与 GPT-2 工作中所做的一项观察,而我在其他地方没有见过相关的论述。

IMO, this observation sheds a good deal of light on how the GPT-2/3/etc models (hereafter just "GPT") work internally.
在我看来,这一观察为理解 GPT-2/3 等模型(以下简称“GPT”)的内部工作原理提供了很好的启示。

There is an accompanying Colab notebook which will let you interactively explore the phenomenon I describe here.
有一个配套的 Colab 笔记本,可以让你互动式地探索我在这里描述的现象。

[Edit: updated with another section on comparing to the inputs, rather than the outputs. This arguably resolves some of my confusion at the end. Thanks to algon33 and Gurkenglas for relevant suggestions here.]
[编辑:更新了另一个关于与输入而非输出进行比较的部分。这可以说解决了我最后的一些困惑。感谢 algon33 和 Gurkenglas 在这里提供的相关建议。]

[Edit 5/17/21: I've recently written a new Colab notebook which extends this post in various ways:
[编辑于 2021 年 5 月 17 日:我最近写了一个新的 Colab 笔记本,它在多个方面扩展了这篇文章。]

  • trying the "lens" on various models from 125M to 2.7B parameters, including GPT-Neo and CTRL
    在 125M 到 2.7B 参数的各种模型上尝试“镜头”,包括 GPT-Neo 和 CTRL。
  • exploring the contributions of the attention and MLP sub-blocks within transformer blocks/layers
    探索变换器块/层中注意力和多层感知器子块的贡献
  • trying out a variant of the "decoder" used in this post, which dramatically helps with interpreting some models
    尝试一种在本帖中使用的“解码器”的变体,这对解释一些模型有显著帮助

]

overview  概述

  • GPT's probabilistic predictions are a linear function of the activations in its final layer. If one applies the same function to the activations of intermediate GPT layers, the resulting distributions make intuitive sense.
    GPT 的概率预测是其最终层激活的线性函数。如果将相同的函数应用于中间 GPT 层的激活,所得到的分布是直观合理的。
    • This "logit lens" provides a simple (if partial) interpretability lens for GPT's internals.
      这种“逻辑镜头”为 GPT 的内部提供了一种简单(如果不完全)的可解释性视角。
    • Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
      对变压器内部进行解释的其他研究主要集中在注意力关注的对象上。Logit 透镜关注的是 GPT 在处理每个步骤后“相信”的内容,而不是它在步骤内部如何更新该信念。
  • These distributions gradually converge to the final distribution over the layers of the network, often getting close to that distribution long before the end.
    这些分布逐渐收敛到网络层的最终分布,通常在结束前很久就接近该分布。
    • At some point in the middle, GPT will have formed a "pretty good guess" as to the next token, and the later layers seem to be refining these guesses in light of one another.
      在某个中间阶段,GPT 将形成对下一个标记的“相当不错的猜测”,而后面的层似乎在相互考虑的基础上对这些猜测进行精炼。
    • The general trend, as one moves from earlier to later layers, is
      一般趋势是,从早期层到后期层,
      • "nonsense / not interpretable" (sometimes, in very early layers) -->
        "无意义 / 不可解释的" (有时在非常早的层次中) -->
      • "shallow guesses (words that are the right part of speech / register / etc)" -->
        "肤浅的猜测(符合正确词性/用语风格等的词)"
      • "better guesses"  "更好的猜测"
    • ...though some of those phases are sometimes absent.
      ...虽然其中一些阶段有时会缺失。
  • On the other hand, only the inputs look like the input tokens.
    另一方面,只有输入看起来像输入标记。
    • In the logit lens, the early layers sometimes look like nonsense, and sometimes look like very simple guesses about the output. They almost never look like the input.
      在逻辑回归视角下,早期层有时看起来像是无意义的内容,有时看起来像是对输出的非常简单的猜测。它们几乎从不看起来像输入。
    • Apparently, the model does not "keep the inputs around" for a while and gradually process them into some intermediate representation, then into a prediction.
      显然,该模型并没有“暂时保留输入”,而是直接将其逐步处理成某种中间表示,然后生成预测。
    • Instead, the inputs are immediately converted to a very different representation, which is smoothly refined into the final prediction.
      相反,输入会立即被转换为一种非常不同的表示形式,并在此基础上平滑地细化为最终预测。
  • This is reminiscent of the perspective in Universal Transformers which sees transformers as iteratively refining a guess.
    这让人想起了通用变压器中的观点,认为变压器是对一个猜测进行迭代优化的过程。
    • However, Universal Transformers have both an encoder and decoder, while GPT is only a decoder. This means GPT faces a tradeoff between keeping around the input tokens, and producing the next tokens.
      然而,Universal Transformers 同时具有编码器和解码器,而 GPT 仅仅是一个解码器。这意味着 GPT 在保留输入标记和生成下一个标记之间面临权衡。
    • Eventually it has to spit out the next token, so the longer it spends (in depth terms) processing something that looks like token i, the less time it has to convert it into token i+1. GPT has a deadline, and the clock is ticking.
      最终它必须输出下一个标记,因此它在处理看起来像标记 i 的东西上花费的时间越长,它转换成标记 i+1 的时间就越少。GPT 有一个截止日期,时间在不断流逝。
  • More speculatively, this suggests that GPT mostly "thinks in predictive space," immediately converting inputs to predicted outputs, then refining guesses in light of other guesses that are themselves being refined.
    更具推测性的是,这表明 GPT 主要“在预测空间中思考”,立即将输入转换为预测的输出,然后根据其他正在被细化的猜测来完善自己的猜测。
    • I think this might suggest there is some fundamentally better way to do sampling from GPT models? I'm having trouble writing out the intuition clearly, so I'll leave it for later posts.
      我认为这可能暗示从 GPT 模型中进行采样有某种根本上更好的方法?我在清晰地表达直觉方面遇到了困难,所以我会留到后面的帖子中再说。
  • Caveat: I call this a "lens" because it is one way of extracting information from GPT's internal activations. I imagine there is other information present in the activations that cannot be understood by looking at logits over tokens. The logit lens show us some of what is going on, not all of it.
    警告:我称之为“透镜”,因为这是从 GPT 的内部激活中提取信息的一种方式。我想在激活中存在其他信息,无法通过查看令牌的对数几率来理解。对数几率透镜展示了我们所看到的一部分情况,而不是全部。

background on GPT's structure
GPT 的结构背景

You can skip or skim this if you already know it.
如果你已经知道这个内容,可以跳过或略读。

  • Input and output  输入和输出
    • As input, GPT takes a sequence of tokens. Each token is a single item from a vocabulary of N_v=50257 byte pairs (mostly English words).
      作为输入,GPT 接收一个令牌序列。每个令牌是来自一个包含 N_v=50257 个字节对(主要是英语单词)词汇表的单个项。
    • As output, GPT returns a probability distribution over the vocabulary. It is trained so this distribution predicts the next token.
      作为输出,GPT 返回一个词汇表上的概率分布。它的训练使得这个分布能够预测下一个标记。
    • That is, the model's outputs are shifted forward by one position relative to the inputs. The token at position i should, after flowing through the layers of the model, turn into the token at position i+1. (More accurately, a distribution over the token at position i+1.)
      也就是说,模型的输出相对于输入向前移动了一个位置。位置 i 的标记在经过模型的各层后应变成位置 i+1 的标记。(更准确地说,是对位置 i+1 的标记的分布。)
  • Vocab and embedding spaces
    词汇和嵌入空间
    • The vocab has size N_v=50257, but GPT works internally in a smaller "embedding" vector space, of dimension N_e.
      词汇表大小为 N_v=50257,但 GPT 在内部以更小的“嵌入”向量空间(维度为 N_e)进行工作。
      • For example, in the GPT-2 1558M model size, N_e=1600. (Below, I'll often assume we're talking about GPT-2 1558M for concreteness.)
        例如,在 GPT-2 1558M 模型中,N_e=1600。 (下面,我会经常假设我们在谈论 GPT-2 1558M,以便更加具体。)
    • There is an N_v-by-N_e embedding matrix W which is used to project the vocab space into the embedding space and vice versa.
      有一个 N_v 乘以 N_e 的嵌入矩阵 W,用于将词汇空间投影到嵌入空间,反之亦然。
  • In, blocks, out  进,块,出
    • The first thing that happens to the inputs is a multiplication by W, which projects them into the embedding space. [1]
      输入的第一件事是通过 W 进行乘法运算,将它们投影到嵌入空间中。[1]
    • The resulting 1600-dimensional vector then passes through many neural network blocks, each of which returns another 1600-dimensional vector.
      结果得到的 1600 维向量随后通过许多神经网络块,每个块返回另一个 1600 维向量。
    • At the end, the final 1600-dimensional vector is multiplied by W's transpose to project back into vocab space.
      最后,最终的 1600 维向量与 W 的转置相乘,以投影回词汇空间。
    • The resulting 50257-dim vectors are treated as logits. Applying the softmax function to them gives you the output probability distribution.
      生成的 50257 维向量被视为对数几率。对它们应用 softmax 函数可以得到输出概率分布。

the logit lens  对数几率视角

As described above, GPT schematically looks like
如上所述,GPT 在结构上看起来像

  • Project the input tokens from vocab space into the 1600-dim embedding space
    将输入的标记从词汇空间投影到 1600 维嵌入空间
  • Modify this 1600-dim vector many times
    多次修改这个 1600 维的向量
  • Project the final 1600-dim vector back into vocab space
    将最终的 1600 维向量投影回词汇空间

We have a "dictionary," W, that lets us convert between vocab space and embedding space at any point. We know that some vectors in embedding space make sense when converted into vocab space:
我们有一个“字典”,W,可以让我们在任何时刻在词汇空间和嵌入空间之间转换。我们知道,在嵌入空间中的某些向量在转换为词汇空间时是有意义的:

  • The very first embedding vectors are just the input tokens (in embedding space)
    最初的嵌入向量只是输入的标记(在嵌入空间中)
  • The very last embedding vectors are just the output logits (in embedding space)
    最后的嵌入向量就是输出的对数几率(在嵌入空间中)

What about the 1600-dim vectors produced in the middle of the network, say the output of the 12th layer or the 33rd? If we convert them to vocab space, do the results make sense? The answer is yes.
那么,网络中间产生的 1600 维向量,例如第 12 层或第 33 层的输出呢?如果我们将它们转换为词汇空间,结果是否有意义?答案是肯定的。

logits

For example: the plots below show the logit lens on GPT-2 as it predicts a segment of the abstract of the GPT-3 paper. (This is a segment in the middle of the abstract; it can see all the preceding text, but I'm not visualizing the activations for it.)
例如:下面的图显示了 GPT-2 在预测 GPT-3 论文摘要一部分时的 logit 镜头。(这是摘要中间的一段;它可以看到之前的所有文本,但我并没有为其可视化激活。)

For readability, I've made two plots showing two consecutive stretches of 10 tokens. Notes on how to read them:
为了便于阅读,我制作了两个图表,显示了两个连续的 10 个符号的延伸。关于如何阅读它们的说明:

  • The input tokens are shown as 45-degree tilted axis labels at the bottom.
    输入的记号显示为底部的 45 度倾斜轴标签。
  • The correct output (i.e. the input shifted by one) is likewise shown at the top.
    正确的输出(即输入向下移一位)同样显示在顶部。
    • A (*) is added in these labels when the model's top guess matched the correct output.
      在这些标签中,当模型的最佳猜测与正确输出匹配时,会添加一个(*)。
  • The vertical axis indexes the layers (or "blocks"), zero-indexed from 0 to 47. To make the plots less huge I skip every other intermediate layer. The Colab notebook lets you control this skipping as you like.
    垂直轴索引层(或“块”),从 0 到 47 进行零索引。为了使图表不那么庞大,我跳过每个中间层。Colab 笔记本允许您根据需要控制这种跳过。
  • The top guess for each token, according to the model's activations at a given layer, is printed in each cell.
    根据模型在给定层的激活,每个单元格中打印了每个标记的最佳猜测。
  • The colors show the logit associated with the top guess. These tend to increase steadily as the model converges on a "good guess," then get refined in the last layers.
    颜色显示与最佳猜测相关的对数几率。这些几率在模型收敛到“好猜测”时往往会持续增加,然后在最后几层中进行精细化调整。
  • Cells are outlined when their top guess matches the final top guess.
    当细胞的最高猜测与最终的最高猜测相匹配时,细胞会被勾勒出来。
  • For transformer experts: the "activations" here are the block outputs after layer norm, but before the learned point-wise transformation.
    对于变换器专家来说:“激活”是指层归一化后的块输出,但在学习的逐点变换之前。


There are various amusing and interesting things one can glimpse in these plots. The "early guesses" are generally wrong but often sensible enough in some way:
在这些情节中,能够窥见各种有趣和引人发笑的事物。“早期的猜测”通常是错误的,但在某种程度上往往是合理的。

  • "We train GPT-3..." 000? (someday!)
    我们训练 GPT-3… 000?(总有一天!)
  • "GPT-3, an..." enormous? massive? (not wrong!)
    GPT-3,一个...... 巨大的?庞大的?(没有错!)
  • "We train GPT-3, an aut..." oreceptor? (later converges to the correct oregressive)
    我们训练 GPT-3,一个自动……受体?(最终收敛到正确的回归)
  • "model with 175..." million? (later converges to a comma, not the correct billion)
    “模型有 1.75 亿?(后来收敛到一个逗号,而不是正确的十亿)”

ranks  排名

The view above focuses only on the top-1 guess at each layer, which is a reductive window on the full distributions.
上述观点仅关注每一层的 top-1 猜测,这只是对完整分布的一个简化窗口。

Another way to look at things: we still reduces the final output to the top-1 guess, but we compare other distributions to the final one by looking at the rank of the final top-1 guess.
另一种看待事物的方法:我们仍然将最终输出减少到排名第一的猜测,但我们通过查看最终排名第一猜测的排名,将其他分布与最终输出进行比较。

Even if the middle of the model hasn't yet converged to the final answer, maybe it's got that answer somewhere in its top 3, top 10, etc. That's a lot better than "top 50257."
即使模型的中间结果尚未收敛到最终答案,也许它在前 3 个、前 10 个等结果中有那个答案。这比“前 50257 个”要好得多。

Here's the same activations as ranks. (Remember: these are ranks of the model's final top-1 prediction, not the true token.)
这里是与排名相同的激活值。(请记住:这些是模型最终的顶级预测的排名,而不是实际的标记。)





In most cases, network's uncertainty has drastically reduced by the middle layers. The order of the top candidates may not be right, and the probabilities may not be perfectly calibrated, but it's got the gist already.
在大多数情况下,中间层显著减少了网络的不确定性。顶级候选者的顺序可能不正确,概率也可能没有完美地校准,但它已经抓住了要点。

KL divergence and input discarding
KL 散度及输入丢弃

Another way of comparing the similarity of two probability distributions is the KL divergence. Taking the KL divergence of the intermediate probabilities w/r/t the final probabilities, we get a more continuous view of how the distributions smoothly converge to the model's output.
比较两个概率分布相似性的另一种方法是 KL 散度。考虑中间概率与最终概率的 KL 散度,我们可以更连续地观察到分布如何平滑地收敛到模型的输出。

Because KL divergence is a more holistic measure of the similarity between two distributions than the ones I've used above, it's also my preferred metric for making the point that nothing looks like the input.
由于 KL 散度是衡量两个分布相似度的更全面的指标,优于我之前使用的指标,因此它也是我用来强调没有任何东西与输入相似的首选指标。

In the plots above, I've skipped the input layer (i.e. the input tokens in embedding space). Why? Because they're so different from everything else, they distract the eye!
在上面的图中,我省略了输入层(即嵌入空间中的输入标记)。为什么?因为它们与其他所有内容差异太大,容易分散注意力!

In the plots below, where color is KL divergence, I include the input as well. If we trust that KL divergence is a decent holistic way to compare two distributions (I've seen the same pattern with other metrics), then:
在下面的图中,颜色表示 KL 散度,我还包括了输入。如果我们相信 KL 散度是比较两个分布的一个合理的整体方式(我在其他指标中也见过相同的模式),那么:

  • Immediately, after the very first layer, the input has been transformed into something that looks more like the final output (47 layers layer) than it does like the input.
    立即,在第一层之后,输入已经被转化为一种更接近最终输出(第 47 层)的形式,而不是更像输入。
  • After this one discontinuous jump, the distribution progresses in a much more smooth way to the final output distribution.
    经过这一不连续的跳跃,分布以更加平滑的方式发展到最终输出分布。



other examples  其他例子

I show several other examples in the Colab notebook. I'll breeze through a few of them here.
我在 Colab 笔记本中展示了几个其他示例。这里我将简单介绍其中的一些。

copying a rare token  复制稀有代币

Sometimes it's clear that the next token should be a "copy" of an earlier token: whatever arbitrary thing was in that slot, spit it out again.
有时很明显下一个标记应该是之前某个标记的“复制”:无论那个位置上的任意内容是什么,都再次输出。

If this is a token with relatively low prior probability, one would think it would be useful to "keep it around" from the input so later positions can look at it and copy it. But as we saw, the input is never "kept around"!
如果这个标记的先验概率相对较低,人们可能会认为将其“保留”在输入中是有用的,以便后面的位点可以查看并复制它。但正如我们所看到的,输入从来没有被“保留”!

What happens instead? I tried this text:
发生了什么呢?我试着写了这段文本:

Sometimes, when people say plasma, they mean a state of matter. Other times, when people say plasma
有时,当人们提到等离子体时,他们指的是一种物质状态。其他时候,当人们提到等离子体时,

As shown below (truncated to the last few tokens for visibility), the model correctly predicts "plasma" at the last position, but only figures it out in the very last layers.
如下面所示(为了可视性截取到最后几个词元),模型在最后一个位置正确预测了“plasma”,但仅在最后几层中才弄清楚这一点。

Apparently it is keeping around a representation of the token "plasma" with enough resolution to copy it . . . but it only retrieves this representation at the end! (In the rank view, the rank of plasma is quite low until the very end.)
显然,它保持着对代币“等离子体”的一种表示,分辨率足够高以便复制……但它仅在最后才检索这个表示!(在排名视图中,等离子体的排名在最后之前相当低。)

This is surprising to me. The repetition is directly visible in the input: "when people say" is copied verbatim. If you just applied the rule "if input seems to be repeating, keep repeating it," you'd be good. Instead, the model scrambles away the pattern, then recovers it later through some other computational route.
这让我感到惊讶。重复在输入中是直接可见的:“当人们说”被逐字复制。如果你只应用规则“如果输入看起来在重复,就继续重复它”,那就好了。相反,模型扰乱了这种模式,然后通过其他计算途径在稍后恢复它。


extreme repetition  极端重复

We've all seen GPT sampling get into a loop where text repeats itself exactly, over and over. When text is repeating like this, where is the pattern "noticed"?
我们都见过 GPT 生成文本时进入循环的情况,文本一遍又一遍地完全重复。当文本像这样重复时,模式“被注意到”在哪里?

At least in the following example, it's noticed in the upper half of the network, while the lower half can't see it even after several rounds of repetition.
至少在以下例子中,网络的上半部分注意到了它,而下半部分即使经过多轮重复也看不见它。


why? / is this surprising?
为什么? / 这让人惊讶吗?

First, some words about why this trick can even work at all.
首先,谈谈这个技巧为什么可以有效。

One can imagine models that perform the exact same computation as GPT-2, for which this trick would not work. For instance, each layer could perform some arbitrary vector rotation of the previous one before doing anything else to it. This would preserve all the information, but the change of basis would prevent the vectors from making sense when multiplied by W^T.
可以想象出一些模型,它们执行与 GPT-2 完全相同的计算,而这一技巧则不适用。例如,每一层可以在对前一层进行任何其他操作之前,执行某种任意的向量旋转。这将保留所有信息,但基底的改变会阻止向量在与 W^T 相乘时具有意义。

Why doesn't the model do this? Two relevant facts:
为什么模型不这样做?两个相关的事实:

1. Transformers are residual networks. Every connection in them looks like x + f(x) where f is the learned part. So the identity is very easy to learn.
变压器是残差网络。它们中的每个连接看起来都像 x + f(x),其中 f 是学习的部分。因此,恒等式很容易学习。

This tends to keep things in the same basis across different layers, unless there's some reason to switch.
这通常会在不同层次之间保持相同的基础,除非有某种原因需要切换。

2. Transformers are usually trained with weight decay, which is almost the same thing as L2 regularization. This encourages learned weights to have small L2 norm.
2. 变换器通常使用权重衰减进行训练,这几乎与 L2 正则化相同。这鼓励学习到的权重具有较小的 L2 范数。

That means the model will try to "spread out" a computation across as many layers as possible (since the sum-of-squares is less than the square-of-sums). Given the task of turning an input into an output, the model will generally prefer changing the input a little, then a little more, then a little more, bit by bit.
这意味着模型将尽量在尽可能多的层之间“分散”计算(因为平方和小于和的平方)。在将输入转换为输出的任务中,模型通常会倾向于先稍微改变输入,然后再多一点,再多一点,一点一点地进行。

1+2 are a good story if you want to explain why the same vector basis is used across the network, and why things change smoothly. This story would render the whole thing unsurprising . . . except that the input is discarded in such a discontinuous way!
1+2 是一个不错的故事,如果你想解释为什么在整个网络中使用相同的向量基,以及为什么事物会平滑地变化。这个故事会使整个过程显得毫不意外……只是输入以如此不连续的方式被舍弃!

I would have expected a U-shaped pattern, where the early layers mostly look like the input, the late layers mostly look like the output, and there's a gradual "flip" in the middle between the two perspectives. Instead, the input space immediately vanishes, and we're in output space the whole way.
我原本预计会出现 U 型模式,其中早期层大多看起来像输入,晚期层大多看起来像输出,而在两种视角之间有一个渐进的“翻转”。然而,输入空间立即消失,我们始终处于输出空间中。

Maybe there is some math fact I'm missing here.
也许这里有一些我遗漏的数学事实。

Or, maybe there's some sort of "hidden" invertible relationship between
或者,也许存在某种“隐含”的可逆关系

  • the embedding of a given token, and
    给定标记的嵌入,和
  • the model's prior for what token comes after it (given no other information)
    该模型的先验知识,即在没有其他信息的情况下,预测其后接的标记

so that a token like "plasma" is kept around from the input -- but not in the form "the output is plasma," instead in the form "the output is [the kind of word that comes after plasma]."
因此,像“等离子体”这样的标记在输入中被保留——但不是以“输出是等离子体”的形式,而是以“输出是[在等离子体后面出现的词的种类]”的形式。

However, I'm not convinced by that story as stated. For one thing, GPT layers don't share their weights, so the mapping between these two spaces would have to be separately memorized by each layer, which seems costly. Additionally, if this were true, we'd expect the very early activations to look like naive context-less guesses for the next token. Often they are, but just as often they're weird nonsense like "Garland."
然而,我对这个故事并不信服。首先,GPT 层之间并不共享权重,因此这两个空间之间的映射必须由每一层单独记忆,这似乎是很耗费资源的。此外,如果这是真的,我们会期待非常早期的激活看起来像是对下一个标记的简单无上下文的猜测。它们往往是这样的,但同样常常也会是奇怪的无意义词汇,例如“花环”。

addendum: more on "input discarding"
附录:关于“输入丢弃”的更多信息

In comments, Gurkenglas noted that the plots showing KL(final || layer) don't tend the whole story.
在评论中,Gurkenglas 指出,显示 KL(final || layer)的图表并没有讲述整个故事。

The KL divergence is not a metric: it is not symmetric and does not obey the triangle inequality. Hence my intuitive picture of the distribution "jumping" from the input to the first layer, then smoothly converging to the final layer, is misleading: it implies we are measuring distances along a path through some space, but KL divergence does not measure distance in any space.
KL 散度不是一种度量:它不是对称的,并且不遵循三角不等式。因此,我对分布“跳跃”从输入层到第一层,然后平滑地收敛到最终层的直观想象是误导性的:它暗示我们正在沿着某个空间中的路径测量距离,但 KL 散度并不在任何空间中测量距离。

Gurkenglas and algon33 suggested plotting the KL divergences of everything w/r/t the input rather than the output: KL(input || layer).
Gurkenglas 和 algon33 建议绘制相对于输入而非输出的所有 KL 散度:KL(input || layer)。

Note that the input is close to a distribution that just assigns probability 1 to the input token ("close" because W * W^T is not invertible), so this is similar to asking "how probable is the input token, according to each layer?" That's a question which is also natural to answer by plotting ranks: what rank is assigned to the input token by each layer?
请注意,输入接近于一个仅为输入标记分配概率 1 的分布(“接近”是因为 W * W^T 是不可逆的),因此这类似于在询问“根据每一层,输入标记的概率有多大?”这是一个自然可以通过绘制排名来回答的问题:每一层给输入标记分配了什么排名?

Below, I show both: KL(input || layer), and the rank of the input token according to later layers.
下面,我展示了两者:KL(输入 || 层),以及根据后续层的输入标记的等级。

  • For KL(input || layer), I use the same color scale as in the plots for KL(final || layer), so the two are comparable.
    对于 KL(input || layer),我使用与 KL(final || layer) 图中相同的颜色尺度,因此二者是可比较的。
  • For the ranks, I do not use the same color scale: I have the colors bottom out at rank 1000 instead of rank 100. This gives more visual insight into where the model could be preserving input information.
    对于排名,我不使用相同的颜色比例:我将颜色的底部设定在排名 1000 而不是排名 100。这提供了更多的视觉洞察,帮助理解模型可能在哪些地方保留输入信息。



  • There is still a fast jump in KL(input || layer) after the input.
    输入后在 KL(input || layer)中仍然有一个快速跳转。
    • However, it's far smaller than the jump in KL(output || layer) at the same point.
      然而,在同一点上,KL(output || layer)的增长要大得多。
    • Note that the darkest color, meaning KL=30 does not appear on the plot of KL(input || layer).
      注意到最深的颜色,即 KL=30 并没有出现在 KL(输入 || 层) 的图中。
    • On the plot of KL(output || layer), however, the maximum values were in fact much greater than 30; I cut off the color scale at 30 so other distinctions were perceptible at all.
      在 KL(输出 || 层)的图中,然而,最大值实际上远大于 30;我将颜色尺度截断在 30,以便其他区别能够清晰可见。
  • Likewise, while ranks jump quickly after the input, they often stay relatively high in the context of a ~50K vocab.
    同样,在输入后排名迅速上升,但在大约 50K 词汇的情况下,它们通常保持相对较高的水平。
    • I am curious about the differences here: some tokens are "preserved" much more in this sense than others.
      我对这里的差异很感兴趣:某些词元在这个意义上"保留"得比其他词元多得多。
    • This is apparently contextual, not just based on the token itself. Note the stark differences between the rank trajectories of the first, second, and third commas in the passage.
      这显然是情境性的,而不仅仅是基于符号本身。请注意段落中第一、第二和第三个逗号的排名轨迹之间的显著差异。

It's possible that the relatively high ranks -- in the 100s or 1000s, but not the 10000s -- of input tokens in many cases is (related to) the mechanism by which the model "keeps around" rarer tokens in order to copy them later.
输入标记相对较高的排名——在 100 或 1000 范围内,但不超过 10000——在许多情况下可能与模型“保留”稀有标记的机制有关,以便后续复制。

As some evidence for this, I will show plots like the above for the plasma example. Here, I show a segment including the first instance of "plasma," rather than the second which copies it.
作为对此的一些证据,我将展示类似于上述的图形,针对等离子体的例子。在这里,我展示一个包含“等离子体”第一次出现的片段,而不是第二个重复它的片段。




The preservation of "plasma" here is striking.
这里“等离子体”的保存是显著的。

My intuitive guess is that the rarity, or (in some sense) "surprisingness," of the token causes early layers to preserve it: this would provide a mechanism for providing raw access to rare tokens in the later layers, which otherwise only be looking at more plausible tokens that GPT had guessed for the corresponding positions.
我直觉的猜测是,标记的稀有性,或者说在某种意义上的“出人意料”,导致早期层对此进行保留:这为后续层提供了一个机制,以原始方式访问稀有标记,而后者则只能查看 GPT 为相应位置猜测的更可信标记。

On the other hand, this story has trouble explaining why "G" and "PT" are not better preserved in the GPT3 abstract plots just above. This is the first instance of "GPT" in the full passage, so the model can't rely on copies of these at earlier positions. That said, my sense of scale for "well-preservedness" is a wild guess, and these particular metrics may not be ideal for capturing it anyway.
另一方面,这个故事很难解释为什么“G”和“PT”在上面的 GPT3 抽象图中没有得到更好的保留。这是全文中“GPT”的第一次出现,因此模型无法依赖于早期位置的这些复制品。尽管如此,我对“良好保留度”的尺度感知仅是个大胆的猜测,这些特定的指标可能也不适合捕捉这一点。




  1. Right after this, positional embeddings are added. I'm ignoring positional embeddings in the post, but mention them in this footnote for accuracy. ↩︎
    紧接着,位置嵌入被添加。我在文中忽略了位置嵌入,但为了准确性在此脚注中提及。

New Comment  新评论


37 comments, sorted by Click to highlight new comments since:

I think this might suggest there is some fundamentally better way to do sampling from GPT models? I'm having trouble writing out the intuition clearly, so I'll leave it for later posts.
我认为这可能暗示从 GPT 模型中进行采样有某种根本上更好的方法?我在清晰地表达直觉方面遇到了困难,所以我会留到后面的帖子中再说。

Unroll the sampling process: hook up all the individual GPT instances into a single long model, bypass the discretizing/embedding layers to make it differentiable end-to-end, and do gradient ascent to find the sequence which maximizes likelihood conditional on the fixed input.
展开采样过程:将所有个体 GPT 实例连接成一个单一的长模型,绕过离散化/嵌入层以实现端到端的可微分,进行梯度上升以找到在固定输入条件下最大化似然的序列。

Interesting, but not (I think?) the direction I was headed in.
有趣,但(我认为?)不是我所想要的方向。

I was thinking more about the way the model seems to be managing a tradeoff between preserving the representation of token i and producing the representation of token i+1.
我更在意的是模型似乎如何在保持标记的表示和生成标记 i+1 的表示之间进行权衡。

The depth-wise continuity imposed by weight decay means late layers are representing something close to the final output -- in late layers the model is roughly looking at its own guesses, even if they were wrong, which seems suboptimal.
权重衰减所施加的深度连续性意味着后层表示的内容接近最终输出——在后层,模型大致上是在查看自己的猜测,即使这些猜测是错误的,这似乎并不理想。

Consider this scenario:  考虑这种情况:

  • The model does poorly at position i, assigning very low probability to the true token residing at i+1.
    该模型在该位置表现不佳,为位于 i+1 的真实标记分配了非常低的概率。
  • To retain a clear view of the input sequence, the model now needs to "keep around" the true token at i+1, since its own guess is a poor proxy.
    为了保留对输入序列的清晰视图,模型现在需要“保留”在 i+1 处的真实标记,因为它自己的猜测并不是一个好的代理。
  • But early layers don't know that: they can't "look up" and notice the poor prediction. So they just treat i+1 like any other position. (I.e. there's no way to implement a selective "copy when we got it wrong" mechanism)
    但早期层并不知道这一点:它们无法“向上查看”并察觉到预测不准确。因此,它们只是将 i+1 视为其他任何位置。(也就是说,无法实现一种选择性的“在我们犯错时复制”的机制)
  • In late layers, position i+1 has been converted into a guess about i+2 by the earlier layers, so we can't rely on it to tell us what really occupied i+1.
    在后面的层中,位置 i+1 被早期层转换为对 i+2 的猜测,因此我们无法依赖它来告诉我们实际占据 i+1 的内容。
  • And position i has been converted to a bad guess about position i+1, so if we use it as a proxy for i+1 we'll do poorly.
    而位置已被转化为对位置 i+1 的糟糕猜测,因此如果我们将其作为 i+1 的代理使用,我们的表现将会很差。

My sampling idea was something like "let's replace (or interpolate) late activations with embeddings of the actual next token, so the model can see what really happened, even when its probability was low." (This is for sampling specifically because it'd be too slow in training, where you want to process a whole window at once with matrix operations; sampling has to be a loop anyway, so there's no cost to adding stuff that only works as a loop.)
我的采样想法是“让我们用实际下一个 tokens 的嵌入来替换(或插值)后期激活,这样模型可以看到实际发生的情况,即使它的概率很低。”(这特别针对采样,因为在训练中这样做会太慢,而在训练中你希望一次处理整个窗口,进行矩阵操作;采样无论如何都必须是一个循环,因此增加一些只在循环中有效的内容没有成本。)

But, thinking about it more, the model clearly can perform well in scenarios like the above, e.g. my plasma example and also many other cases naturally arising in language which GPT handles well.
但是,仔细考虑后,这个模型显然可以在上述场景中表现良好,例如我的等离子体示例,以及许多其他在语言中自然出现的案例,而这些都是 GPT 处理良好的。

I have no idea how it does it -- indeed the connection structure feels weirdly adverse to such operations -- but apparently it does. So it's probably premature to assume it can't do this well, and attempt to "help it out" with extra tricks.
我不知道它是如何做到的——事实上,连接结构似乎与这样的操作奇怪地不相容——但显然它做到了。因此,假设它无法做到这一点并试图用额外的技巧“帮助它”可能还为时已晚。

How far away is this from being implementable?
这距离可实施还有多远?

It doesn't sound hard at all. The things Gwern is describing are the same sort of thing that people do for interpretability where they, eg, find an image that maximizes the probability of the network predicting a target class.
听起来一点也不难。Gwern 所描述的事情与人们为可解释性所做的事情是一样的,例如,他们找到一张最大化网络预测目标类别概率的图像。

Of course, you need access to the model, so only OpenAI could do it for GPT-3 right now.
当然,您需要访问该模型,因此目前只有 OpenAI 能够为 GPT-3 提供服务。

Doing it with GPT-3 would be quite challenging just for compute requirements like RAM. You'd want to test this out on GPT-2-117M first, definitely. If the approach works at all, it should work well for the smallest models too.
使用 GPT-3 进行此操作会非常具有挑战性,仅在计算要求(如 RAM)方面就很复杂。你肯定应该先在 GPT-2-117M 上进行测试。如果这种方法有效,它在最小模型上也应该表现良好。

This is very neat. I definitely agree that I find the discontinuity from the first transformer block surprising. One thing which occurred to me that might be interesting to do is to try and train a linear model to reconstitute the input from the activations at different layers to get an idea of how the model is encoding the input. You could either train one linear model on data randomly sampled from different layers, or a separate linear model for each layer, and then see if there are any interesting patterns like whether the accuracy increases or decreases as you get further into the model. You could also see if the resulting matrix has any relationship to the embedding matrix (e.g. are the two matrices farther apart or closer together than would be expected by chance?). One possible hypothesis that this might let you test is whether the information about the input is being stored indirectly via what the model's guess is given that input or whether it's just being stored in parts of the embedding space that aren't very relevant to the output (if it's the latter, the linear model should put a lot of weight on basis elements that have very little weight in the embedding matrix).
这非常整洁。我完全同意,我发现第一个变压器块的非连续性令人惊讶。有件事让我想到,可能有趣的做法是尝试训练一个线性模型,以从不同层的激活重构输入,以了解模型是如何编码输入的。你可以选择在从不同层随机抽样的数据上训练一个线性模型,或者为每一层训练一个单独的线性模型,然后看看是否存在一些有趣的模式,比如随着模型的深入,准确性是增加还是减少。你还可以查看生成的矩阵是否与嵌入矩阵有任何关系(例如,这两个矩阵相距得更远还是更近,这与你的期望相比是否有偏差?)。一个可能的假设是,这可以让你测试输入信息是通过模型对该输入的猜测间接存储,还是仅仅存储在与输出不太相关的嵌入空间部分(如果是后者,线性模型应该会对在嵌入矩阵中权重很小的基元素给予很大的重视)。

One thing which occurred to me that might be interesting to do is to try and train a linear model to reconstitute the input from the activations at different layers to get an idea of how the model is encoding the input. You could either train one linear model on data randomly sampled from different layers, or a separate linear model for each layer, and then see if there are any interesting patterns like whether the accuracy increases or decreases as you get further into the model.
我想到的一件可能有趣的事情是尝试训练一个线性模型,从不同层的激活值中重构输入,以了解模型如何编码输入。您可以选择在从不同层随机采样的数据上训练一个线性模型,或者为每一层训练一个单独的线性模型,然后观察是否存在有趣的模式,比如随着深入模型,准确率是否增加或减少。

That's a great idea!  这是个好主意!

One possible hypothesis that this might let you test is whether the information about the input is being stored indirectly via what the model's guess is given that input or whether it's just being stored in parts of the embedding space that aren't very relevant to the output (if it's the latter, the linear model should put a lot of weight on basis elements that have very little weight in the embedding matrix).
一个可能的假设是,这可能让你测试输入的信息是否通过模型在给定输入下的猜测间接存储,或者它是否仅仅存储在与输出不是很相关的嵌入空间部分(如果是后者,线性模型应该会在嵌入矩阵中权重非常小的基元素上施加很大的权重)。

Hmm... I guess there is some reason to think the basis elements have special meaning (as opposed to the elements of any other basis for the same space), since the layer norm step operates in this basis.
嗯……我想确实有一些理由认为基底元素具有特殊意义(与同一空间的其他基底元素相对),因为层归一化步骤是在这个基底上进行的。

But I doubt there are actually individual components the embedding cares little about, as that seems wasteful (you want to compress 50K into 1600 as well as you possibly can), and if the embedding cares about them even a little bit then the model needs to slot in the appropriate predictive information, eventually.
但我怀疑嵌入实际上是否有个别组件几乎不在乎,因为这似乎是浪费(你希望尽可能将 50K 压缩为 1600),如果嵌入对它们稍微在乎一点,那么模型最终需要插入适当的预测信息。

Thinking out loud, I imagine there might be pattern where embeddings of unlikely tokens (given the context) are repurposed in the middle for computation (you know they're near-impossible so you don't need to track them closely), and then smoothly subtracted out at the end. There's probably a way to check if that's happening.
我在思考时想象可能存在一种模式,在这种模式中,不太可能的标记的嵌入(考虑到上下文)在中间被重新利用进行计算(你知道它们几乎不可能,所以你不需要密切跟踪它们),然后在最后顺利地减去。可能有一种方法可以检查是否发生了这种情况。

That's a great idea!
这是个好主意!

Thanks! I'd be quite excited to know what you find if you end up trying it.
谢谢!如果你最终尝试了这个,我会很兴奋知道你发现了什么。

Hmm... I guess there is some reason to think the basis elements have special meaning (as opposed to the elements of any other basis for the same space), since the layer norm step operates in this basis.
嗯……我想有理由认为基本元素具有特殊的意义(与同一空间的其他基的元素相比),因为层归一化步骤在这个基下进行。

But I doubt there are actually individual components the embedding cares little about, as that seems wasteful (you want to compress 50K into 1600 as well as you possibly can), and if the embedding cares about them even a little bit then the model needs to slot in the appropriate predictive information, eventually.
但我怀疑嵌入实际上是否有个别组件几乎不在乎,因为这似乎是浪费(你希望尽可能将 50K 压缩为 1600),如果嵌入对它们稍微在乎一点,那么模型最终需要插入适当的预测信息。

Thinking out loud, I imagine there might be pattern where embeddings of unlikely tokens (given the context) are repurposed in the middle for computation (you know they're near-impossible so you don't need to track them closely), and then smoothly subtracted out at the end. There's probably a way to check if that's happening.
我在思考时想象可能存在一种模式,在这种模式中,不太可能的标记的嵌入(考虑到上下文)在中间被重新利用进行计算(你知道它们几乎不可能,所以你不需要密切跟踪它们),然后在最后顺利地减去。可能有一种方法可以检查是否发生了这种情况。

I wasn't thinking you would do this with the natural component basis—though it's probably worth trying that also—but rather doing some sort of matrix decomposition on the embedding matrix to get a basis ordered by importance (e.g. using PCA or NMF—PCA is simpler though I know NMF is what OpenAI Clarity usually uses when they're trying to extract interpretable basis elements from neural network activations) and then seeing what the linear model looks like in that basis. You could even just do something like what you're saying and find some sort of basis ordered by the frequency of the tokens that each basis element corresponds to (though I'm not sure exactly what the right way would be to generate such a basis).
我并没有想到您会使用自然成分基础——尽管这也许值得尝试——而是想对嵌入矩阵进行某种矩阵分解,以获得按重要性排序的基础(例如,使用主成分分析或非负矩阵分解——主成分分析更简单,尽管我知道非负矩阵分解是 OpenAI Clarity 在尝试从神经网络激活中提取可解释基础元素时通常使用的方法),然后查看在该基础下线性模型的样子。您甚至可以做一些像您所说的那样,找到一种按每个基础元素对应的标记频率排序的基础(虽然我不太确定生成这样的基础的正确方法是什么)。

I also thought of PCA/SVD, but I imagine matrix decompositions like these would be misleading here.
我也考虑过 PCA/SVD,但我想像这样的矩阵分解在这里可能会产生误导。

What matters here (I think) is not some basis of N_emb orthogonal vectors in embedding space, but some much larger set of ~exp(N_emb) almost orthogonal vectors. We only have 1600 degrees of freedom to tune, but they're continuous degrees of freedom, and this lets us express >>1600 distinct vectors in vocab space as long as we accept some small amount of reconstruction error.
这里重要的(我认为)不是嵌入空间中 N_emb 个正交向量的某种基础,而是一个更大的约~exp(N_emb)个几乎正交向量的集合。我们只有 1600 个自由度可以调节,但它们是连续的自由度,这使我们能够在词汇空间中表达>>1600 个不同的向量,只要我们接受一些小的重构误差。

I expect GPT and many other neural models are effectively working in such space of nearly orthogonal vectors, and picking/combining elements of it. A decomposition into orthogonal vectors won't really illuminate this. I wish I knew more about this topic -- are there standard techniques?
我预计 GPT 和许多其他神经模型实际上是在近乎正交向量的空间中有效工作,并选择/组合其中的元素。对正交向量的分解并不能真正揭示这个问题。我希望我能更多地了解这个主题——是否有标准的技术?

You might want to look into NMF, which, unlike PCA/SVD, doesn't aim to create an orthogonal projection. It works well for interpretability because its components cannot cancel each other out, which makes its features more intuitive to reason about. I think it is essentially what you want, although I don't think it will allow you to find directly the 'larger set of almost orthogonal vectors' you're looking for.
你可能想考虑一下非负矩阵分解(NMF),它与主成分分析(PCA)/奇异值分解(SVD)不同,并不旨在创建正交投影。由于其组件无法相互抵消,这使得它在可解释性方面表现良好,使其特征更容易直观理解。我认为这基本上是你所需要的,尽管我认为它并不会直接让你找到你正在寻找的“更大的一组几乎正交的向量”。

Related layer visualizations: "Looking for Grammar in All The Right Places".
相关层次可视化:“在所有正确的地方寻找语法”。

Maybe I am misunderstanding something, but to me it is very intuitive that there is a big jump from the embedding output to the first transformer block output. The embedding is backpropagated into so it makes sense to see all representations as representations of the prediction we are trying to make, i.e. of the next word. 
也许我误解了什么,但对我来说,很直观的是,从嵌入输出到第一个变压器块输出存在一个大跃迁。嵌入被反向传播,因此将所有表示视为我们试图进行的预测的表示是有意义的,即下一个词的表示。

But the embedding is a prediction of the next word based on only a single word, the word that is being embedded. So the prediction of the next word is by necessity very bad (the BPE ensures that, IIUC, because tokens that would always follow one another are merged). 
但是嵌入只是基于单个单词(即被嵌入的单词)对下一个单词的预测。因此,下一个单词的预测必然是非常糟糕的(BPE 确保了这一点,如果我理解正确的话,因为总是彼此跟随的标记会被合并)。

The first transformer block integrates hundreds of words of context into the prediction, that’s where the big jump comes from. 
第一个变压器模块将数百个词的上下文整合到预测中,这就是重大飞跃的来源。

Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...
它真的被训练成输出输入偏移一个位置的结果,还是仅仅让最后一个槽位包含下一个单词?因为我本来期待它在输入内容向右移动一个位置的复制上会做得更好……

If each layer were trained to give its best guess at the next token, this myopia would prevent all sorts of hiding data for later. This would be a good experiment for your last story, yes? I expect this would perform very poorly, though if it doesn't, hooray, for I really don't expect that version to develop inner optimizers.
如果每一层都被训练来给出对下一个标记的最佳猜测,这种短视会阻止各种隐藏数据的情况发生。这对你的最后一个故事来说是个不错的实验,是吗?我预计这将表现得非常糟糕,然而如果它表现得不错,那就太好了,因为我真的不期望这个版本能发展出内部优化器。

I think I understand your question and was also confused by this for a bit so I wanted add in some points of clarification. First I want out that I really couldn't find a satisfactory explanation of this particular detail (at least one that I could understand) so I pieced this together myself from looking at the huggingface code for GPT2. I may get some details wrong.
我想我理解你的问题,对此我也感到困惑了一段时间,所以我想补充一些澄清的要点。首先,我想指出,我确实无法找到对这个特定细节令人满意的解释(至少我能理解的解释),所以我通过查看 GPT2 的 huggingface 代码自己拼凑了这些信息。我可能会有些细节上的错误。

During training at each step the GPT2 takes in an N tokens and outputs N tokens. But the i-th output token is computed in such away that it only relies on the information from tokens 1, ..., i and is meant to predict i+1-th token from these. I think it's best to think of each output being computed independently of the others (though this isn't strictly true since the separate outputs are computed by shared matrices). So for each i, we train the network so that the i-th output produces the correct result given the _input_ tokens 1, ..., i. There is a term in the loss function for each output token and the total loss is the sum of all the losses of the output tokens. The outputs at other positions do not play a role in the i-th output token, only the first 1,..., i input tokens do.
在训练过程中,GPT2 在每一步输入 N 个令牌并输出 N 个令牌。但是,i-th 输出令牌的计算方式仅依赖于令牌 1,...,i 的信息,目的是根据这些信息预测第 i+1 个令牌。我认为最好将每个输出视为独立于其他输出进行计算(尽管这并不严格正确,因为单独的输出是通过共享矩阵计算的)。因此,对于每个 i,我们训练网络,使得第 i 个输出在给定输入令牌 1,...,i 的情况下产生正确的结果。每个输出令牌在损失函数中都有一个项,总代价是所有输出令牌损失的总和。其他位置的输出在第 i 个输出令牌中没有作用,只有前 1,...,i 个输入令牌起作用。

During inference, given an input of k tokens, we are only concerned with the k-th output token (which should predict the token following the first k). GPT-3 also produces predictions for the outputs before position k but these are just ignored since we already know what these values should be.
在推理过程中,给定 k 个标记的输入,我们只关注第 k 个输出标记(它应该预测第一个 k 个标记之后的标记)。GPT-3 也会对位置 k 前的输出进行预测,但这些预测会被忽略,因为我们已经知道这些值应该是什么。

Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...
它真的是被训练成输出输入内容偏移一个位置,还是只是让最后一个槽位包含下一个单词?因为我认为它在将输入内容向右偏移一个位置的复制上应该更好……

Not sure I understand the distinction, could you rephrase?
不确定我理解这个区别,您能重新表述一下吗?

If by "last slot" you mean last layer (as opposed to earlier layers), that seems like the same thing as outputting the input offset by one.
如果“最后一个插槽”指的是最后一层(与早期层相对),那么这似乎与将输入偏移一个单位后输出是一样的。

If by "last slot" you mean the token N+1 given tokens (1, 2, ... N), then no, that's not how GPT works. If you put in tokens (1, 2, ... N), you always get guesses for tokens (2, 3, ..., N+1) in response. This is true even if all you care about is the guess for N+1.
如果你所说的“最后一个位置”是指给定的令牌 N+1(1, 2, ... N),那么不,这并不是 GPT 的工作原理。如果你输入令牌(1, 2, ... N),你总是会得到对于令牌(2, 3, ..., N+1)的猜测作为回应。这一点是正确的,即使你唯一关心的是 N+1 的猜测。

I meant your latter interpretation.
我指的是你后面的解释。

Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
你能否在每一层从输入而不是输出测量 KL 散度?KL 不满足三角不等式,因此可能大多数层对输入和输出都是 KL 接近的?

GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values. If we used an activation function that's linear on small values, I would therefore expect more of the calculation to be visible.
GPT 使用 ReLU,是吗?那么正则化会使其在小值上进行计算,这在小值上是可能的,因为 ReLU 在小值上是非线性的。如果我们使用一个在小值上是线性的激活函数,那么我会期望更多的计算能够被看到。

Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?
你能否在每一层从输入而不是输出测量 KL 散度?KL 不满足三角不等式,因此也许大多数层对输入和输出都是 KL 接近的?

One can do this in the Colab notebook by calling show_token_progress with comparisons_vs="first" rather than the default "final". IIRC, this also shows a discontinuous flip at the bottom followed by slower change.
在 Colab 笔记本中,可以通过调用 show_token_progresscomparisons_vs="first" 而不是默认的 "final" 来实现这一点。若我记得没错,这也显示了底部的不连续翻转,随后是较慢的变化。

(This is similar to asking the question "do the activations assign high or low probability the input token?" One can answer the same question by plotting logits or ranks with the input layer included.)
这类似于问“激活是否将输入标记分配高或低概率?”可以通过绘制包含输入层的对数值或排名来回答同样的问题。

GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values.
GPT 使用 ReLU,对吗?那么正则化将使其使用小值进行计算,这是可能的,因为 ReLU 在小值上是非线性的。

It uses gelu, but gelu has the same property. However, note that I am extracting activations right after the application of a layer norm operation, which shifts/scales the activations to mean 0 and L2 norm 1 before passing them to the next layer.
它使用 gelu,但 gelu 具有相同的性质。然而,请注意,我是在应用层归一化操作后立即提取激活值,这样可以将激活值移位/缩放到均值为 0 且 L2 范数为 1,然后再将它们传递到下一层。

gelu has the same property
gelu 具有相同的性质

Actually, gelu is differentiable at 0, so it is linear on close-to-zero values.
实际上,gelu 在 0 处可微,因此它在接近零的值上是线性的。

Ah, I think we miscommunicated.
哦,我想我们沟通不够。

I meant "gelu(x) achieves its maximum curvature somewhere near x=0."
我意思是“gelu(x)在 x=0 附近的某个地方达到了最大曲率。”

People often interpret relu as a piecewise linear version of functions like elu and gelu, which are curved near x=0 and linear for large |x|.  In this sense gelu is like relu.
人们常常将 relu 解释为像 elu 和 gelu 这样的分段线性函数,这些函数在 x=0 附近是曲线而在|x|很大时是线性的。在这个意义上,gelu 类似于 relu。

It sounds like you were, instead, talking about the property of relu that you can get nonlinear behavior for arbitrarily small inputs.
听起来你实际上是在谈论 relu 的特性,即对于任意小的输入可以获得非线性行为。

This is indeed unique to relu -- I remember some DeepMind (?) paper that used floating point underflow to simulate relu, and then made NNs out of just linear floating point ops.  Obviously you can't simulate a differentiable function with that trick.
这确实是 relu 所特有的——我记得某篇 DeepMind(?)的论文使用浮点数下溢来模拟 relu,然后仅用线性浮点运算构建神经网络。显然,你无法用那个技巧模拟一个可微分的函数。

(OpenAI?)  (OpenAI?)

floating point underflow to simulate relu
浮点数下溢以模拟 ReLU

Oh that's not good. Looks like we'd need a version of float that keeps track of an interval of possible floats (by the two floats at the end of the interval). Then we could simulate the behavior of infinite-precision floats so long as the network keeps the bounds tight, and we could train the network to keep the simulation in working order. Then we could see whether, in a network thus linear at small numbers, every visibly large effect has a visibly large cause.
哦,这可不好。看起来我们需要一个浮点数的版本,它能够跟踪一个可能浮点数的区间(由区间两端的两个浮点数决定)。这样,只要网络能够保持边界稳定,我们就可以模拟无限精度浮点数的行为,并且我们可以训练网络使得模拟保持正常运作。然后,我们就可以观察在一个在小数值线性化的网络中,是否每一个明显的大效果都有一个明显的大原因。

By the way - have you seen what happens when you finetune GPT to reinforce this pattern that you're observing, that every entry of the table, not just the top right one, predicts an input token?
顺便问一下——你有没有看到当你微调 GPT 以强化你观察到的这个模式时会发生什么?也就是说,表格中的每一个条目,而不仅仅是右上角的那个,都会预测一个输入标记?

IIRC, this also shows a discontinuous flip at the bottom followed by slower change.
我记得,这也显示了底部的不连续翻转,随后是较慢的变化。

Maybe edit the post so you include this? I know I was wondering about this too.
也许可以编辑一下帖子,把这个包含进去?我知道我也在想这个。

Post has been now updated with a long-ish addendum about this topic.
帖子现在已经更新,加入了关于这个主题的较长附录。

Good idea, I'll do that.
好主意,我会这么做。

I know I'd run those plots before, but running them again after writing the post felt like it resolved some of the mystery. If our comparison point is the input, rather than the output, the jump in KL/rank is still there but it's smaller.
我知道我之前运行过这些图,但在写完帖子后再次运行它们感觉解决了某些神秘感。如果我们的比较点是输入而不是输出,那么 KL/秩的跳跃依然存在,但更小。

Moreover, the rarer the input token is, the more it seems to be preserved in later layers (in the sense of low KL / low vocab rank). This may be how tokens like "plasma" are "kept around" for later use.
此外,输入 token 越稀有,它在后面的层中似乎保留得越多(在低 KL / 低词汇排名的意义上)。这可能是像“plasma”这样的 token 如何“被保留”以便后续使用的原因。

Consider also trying the other direction - after all, KL is asymmetric.
考虑尝试另一种方向 - 毕竟,KL 是不对称的。

Apparently it is keeping around a representation of the token "plasma" with enough resolution to copy it . . . but it only retrieves this representation at the end! (In the rank view, the rank of plasma is quite low until the very end.)
显然,它保持着对代币“等离子体”的一种表示,分辨率足够高以便复制……但它仅在最后才检索这个表示!(在排名视图中,等离子体的排名在最后之前相当低。)

This is surprising to me. The repetition is directly visible in the input: "when people say" is copied verbatim. If you just applied the rule "if input seems to be repeating, keep repeating it," you'd be good. Instead, the model scrambles away the pattern, then recovers it later through some other computational route.
这让我感到惊讶。重复在输入中是直接可见的:“当人们说”被逐字复制。如果你只应用规则“如果输入看起来在重复,就继续重复它”,那就好了。相反,模型扰乱了这种模式,然后通过其他计算途径在稍后恢复它。

 

One more reason on why this is suprising, is that other experiments found that this behaviour (forgetting then recalling) is common in MLM (masked language models) but not in simple language models like GPT-2 (see this blog post and more specifically this graph). The intepretation is that "for MLMs, representations initially acquire information about the context around the token, partially forgetting the token identity and producing a more generalized token representation; the token identity then gets recreated at the top layer" (citing from the blog post).
另一个令人惊讶的原因是,其他实验发现这种行为(遗忘后再记忆)在掩蔽语言模型(MLM)中很常见,但在像 GPT-2 这样的简单语言模型中却不常见(参见这篇博客文章,特别是这张图)。其解释是“对于 MLM,表示最初获取关于令牌周围上下文的信息,部分遗忘令牌身份,并生成一个更为通用的令牌表示;令牌身份随后在顶层重新创造”(引用自博客文章)。

However, the logit lense here seems indicating that this may happen in GPT-2 (large) too. Could this be a virtue of scale? Where the same behaviour that one obtains with a MLM is reached by a LM as well with sufficient scale?
然而,这里的逻辑透视似乎表明,这也可能发生在 GPT-2(大型)上。这可能是规模的一个优点吗?在足够的规模下,语言模型是否能达到与掩蔽语言模型相同的行为?

Are these known facts? If not, I think there's a paper in here.
这些是已知的事实吗?如果不是,我认为这里有一篇论文。

In all of this, there seems to be an implicit assumption that the ordering of the embedding dimensions is consistent across layers, in the sense that "dog" is more strongly associated with dimension 12 in layers 2, 3, 4, etc.
在所有这些中,似乎有一个隐含的假设,即嵌入维度的排列在各层之间是一致的,意味着在第 2 层、第 3 层、第 4 层等中“狗”与第 12 维的关联更强。


I don't see any reason why this should be the case from either a training or model structure perspective. How, then, does the logit lens (which should clearly not be invariant with regard to a permutation of its inputs) still produce valid results for some intermediate layers?
我不明白为什么从训练或模型结构的角度来看会出现这种情况。那么,logit 视角是如何仍然在一些中间层产生有效结果的(显然它不应该对其输入的排列保持不变)?

Because model has residual connections.
因为模型具有残差连接。

Ah, got it. Thanks a ton!

Cool project.  There were some changes in HuggingFace's transformer package which are affecting you Colab implementation.  See here:

https://github.com/huggingface/transformers/issues/29576

47 layers layer

47 layers later ?

Could you try a prompt that tells it to end a sentence with a particular word, and see how that word casts its influence back over the sentence? I know that this works with GPT-3, but I didn't really understand how it could.

Interesting topic! I'm not confident this lens would reveal much about it (vs. attention maps or something), but it's worth a try.

I'd encourage you to try this yourself with the Colab notebook, since you presumably have more experience writing this kind of prompt than I do.

Hey I'm not finished reading this yet but I noticed something off about what you said.

At the end, the final 1600-dimensional vector is multiplied by W's transpose to project back into vocab space.

This isn't quite right. They don't multiply by W's transpose at the end. Rather there is a completely new matrix at the end, whose shape is the same as the transpose of W.

You can see this in huggingface's code for GPT2. In the class GPT2LMHeadModel the final matrix multiplication is performed by the matrix called "lm_head", where as the matrix you call W which is used to map 50,257 dimensional vectors into 1600 dimensional space is called "wte" (found in the GPT2Model class). You can see from the code that wte has shape "Vocab size x Embed Size" while lm_head has shape "Embed Size x Vocab size" so lm_head does have the same shape as W transpose but doesn't have the same numbers.


Edit: I could be wrong here, though. Maybe lm_head was set to be equal to wte transpose? I'm looking through the GPT-2 paper but don't see anything like that mentioned.

Maybe lm_head was set to be equal to wte transpose?

Yes, this is the case in GPT-2. Perhaps the huggingface implementation supports making these two matrices different, but they are the same in the official GPT-2.

  • In OpenAI's tensorflow code, see lines 154 and 171 of src/model.py. The variable "wte" is defined on 151, then re-used on 171.
  • In the original GPT paper, see eqs. (2) in section 3.1. The same matrix W_e is used twice. (The GPT-2 and GPT-3 papers just refer you back to the GPT paper for architecture details, so the GPT paper is the place to look.)

Edit: I think the reason this is obscured in the huggingface implementation is that they always distinguish the internal layers of a transformer from the "head" used to convert the final layer outputs into predictions. The intent is easy swapping between different "heads" with the same "body" beneath.

This forces their code to allow for heads that differ from the input embedding matrix, even when they implement models like GPT-2 where the official specification says they are the same.

Edit2: might as well say explicitly that I find the OpenAI tensorflow code much more readable than the huggingface code. This isn't a critique of the latter; it's trying to support every transformer out there in a unified framework. But if you only care about GPT, this introduces a lot of distracting abstraction.

Thanks for the info.

This was a great read, very informative.