It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?
若人工神经网络中的单个神经元能对应于输入特征的清晰可解释性,那将极为便利。例如,在“理想”的 ImageNet 分类器中,每个神经元仅在特定视觉特征出现时激活,如红色、左向曲线或狗鼻。实证研究表明,我们所考察的模型中,部分神经元确实与特征清晰对应。然而,并非总能如此,尤其是在大型语言模型中,神经元与清晰特征的对应似乎颇为罕见。这引发诸多疑问:为何神经元有时与特征对齐,有时则不然?为何某些模型和任务中存在大量此类清晰对应的神经元,而在其他模型中却极为稀少?
In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition
本文中,我们采用玩具模型——即在稀疏输入特征的合成数据上训练的小型 ReLU 网络——来探究模型如何在维度之外表示更多特征,我们称此现象为叠加
Consider a toy model where we train an embedding of five features of varying importance
考虑一个玩具模型,我们训练一个包含五个特征的嵌入,这些特征的重要性各不相同
1
,其中“重要性”是对均方误差损失的标量乘数。在二维空间中,添加 ReLU 进行过滤,并调整特征的稀疏度。在密集特征情况下,模型学会表示最重要的两个特征的正交基(类似于主成分分析可能提供的结果),而其他三个特征则未被表示。但如果我们使特征变得稀疏,情况就会改变:
Not only can models store additional features in superposition by tolerating some interference, but we'll show that, at least in certain limited cases, models can perform computation while in superposition. (In particular, we'll show that models can put simple circuits computing the absolute value function in superposition.) This leads us to hypothesize that the neural networks we observe in practice are in some sense noisily simulating larger, highly sparse networks. In other words, it's possible that models we train can be thought of as doing “the same thing as” an imagined much-larger model, representing the exact same features but with no interference.
模型不仅能通过容忍一定干扰在叠加态中存储额外特征,我们还将在特定有限情况下展示,模型可在叠加态中执行计算。(特别是,我们将证明模型能将计算绝对值函数的简单电路置于叠加态。)这促使我们假设,实践中观察到的神经网络在某种意义上是在嘈杂地模拟更大、高度稀疏的网络。换言之,我们训练的模型可能被视为在做“与”一个设想中的更大模型“相同的事情”,代表完全相同的特征,但无干扰。
Feature superposition isn't a novel idea. A number of previous interpretability papers have considered it
特征叠加并非新颖概念,已有众多解释性论文对此进行探讨
For interpretability researchers, our main contribution is providing a direct demonstration that superposition occurs in artificial neural networks given a relatively natural setup, suggesting this may also occur in practice. That is, we show a case where interpreting neural networks as having sparse structure in superposition isn't just a useful post-hoc interpretation, but actually the "ground truth" of a model. We offer a theory of when and why this occurs, revealing a phase diagram for superposition. This explains why neurons are sometimes "monosemantic" responding to a single feature, and sometimes "polysemantic"
对于可解释性研究者而言,我们的主要贡献在于直接证明了在相对自然的设定下,人工神经网络中确实存在叠加现象,暗示这在实践中也可能发生。即,我们展示了一个案例,其中将神经网络解释为具有叠加稀疏结构不仅仅是事后有用的解释,而实际上是模型的“真实本质”。我们提出了一种理论,阐明了何时及为何会出现这种情况,揭示了叠加的相位图。这解释了为何神经元有时是“单义的”,仅对单一特征作出反应,而有时则是“多义的”,对多个不相关特征作出反应。我们还发现,至少在我们的玩具模型中,叠加现象展现出复杂的几何结构。
But our results may also be of broader interest. We find preliminary evidence that superposition may be linked to adversarial examples and grokking, and might also suggest a theory for the performance of mixture of experts models. More broadly, the toy model we investigate has unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar to the fractional quantum Hall effect in physics, among other striking phenomena. We originally investigated the subject to gain understanding of cleanly-interpretable neurons in larger models, but we've found these toy models to be surprisingly interesting in their own right.
然而,我们的研究成果或许具有更广泛的吸引力。初步证据显示,叠加现象可能与对抗性示例及领悟机制存在关联,并可能为混合专家模型性能提供理论依据。更广泛地看,我们探究的这一简化模型展现出意料之外的丰富结构,包括相变、基于均匀多面体的几何结构、训练过程中的“能量级”跃迁,以及与物理学中的分数量子霍尔效应性质相似的现象等引人注目之处。最初,我们研究此课题旨在深入理解大型模型中清晰可解释的神经元,却意外发现这些简化模型本身就极具研究价值。
In our toy models, we are able to demonstrate that:
在我们的玩具模型中,我们能够证明:
Our toy models are simple ReLU networks, so it seems fair to say that neural networks exhibit these properties in at least some regimes, but it's very unclear what to generalize to real networks.
我们的玩具模型是简单的 ReLU 网络,因此似乎可以说神经网络至少在某些情况下表现出这些特性,但如何将其推广到真实网络则非常不明确。
In our work, we often think of neural networks as having features of the input represented as directions in activation space. This isn't a trivial claim. It isn't obvious what kind of structure we should expect neural network representations to have. When we say something like "word embeddings have a gender direction" or "vision models have curve detector neurons", one is implicitly making strong claims about the structure of network representations.
在我们的工作中,常将神经网络视为以激活空间中的方向来表示输入特征。这一观点并非无足轻重。神经网络表示应具备何种结构并不显而易见。当我们提及“词嵌入具有性别方向”或“视觉模型拥有曲线检测神经元”时,实际上是在对网络表示的结构做出强有力的假设。
Despite this, we believe this kind of "linear representation hypothesis" is supported both by significant empirical findings and theoretical arguments. One might think of this as two separate properties, which we'll explore in more detail shortly:
尽管如此,我们认为这种“线性表征假设”得到了显著的实证发现和理论论据的双重支持。这可以被视为两种不同的特性,我们将在稍后详细探讨:
If we hope to reverse engineer neural networks, we need a property like decomposability. Decomposability is what allows us to reason about the model without fitting the whole thing in our heads! But it's not enough for things to be decomposable: we need to be able to access the decomposition somehow. In order to do this, we need to identify the individual features within a representation. In a linear representation, this corresponds to determining which directions in activation space correspond to which independent features of the input.
若要逆向工程神经网络,我们需具备可分解性这一特性。可分解性使我们无需将整个模型装进脑海,便能对其进行推理!然而,仅可分解还不够,我们还需以某种方式触及这种分解。为此,我们必须识别表示中的各个特征。在线性表示中,这相当于确定激活空间中的哪些方向对应于输入的哪些独立特征。
Sometimes, identifying feature directions is very easy because features seem to correspond to neurons. For example, many neurons in the early layers of InceptionV1 clearly correspond to features (e.g. curve detector neurons
有时,识别特征方向非常容易,因为特征似乎与神经元相对应。例如,InceptionV1 早期层中的许多神经元明显对应于特征(如曲线检测神经元
Superposition has been hypothesized in previous work
先前的工作
The goal of this section will be to motivate these ideas and unpack them in detail.
本节的目标是激发这些想法,并详细阐述它们。
It's worth noting that many of the ideas in this section have close connections to ideas in other lines of interpretability research (especially disentanglement), neuroscience (distributed representations, population codes, etc), compressed sensing, and many other lines of work. This section will focus on articulating our perspective on the problem. We'll discuss these other lines of work in detail in Related Work.
值得注意的是,本节中许多概念与解释性研究的其他领域(尤其是解耦)、神经科学(分布式表示、群体编码等)、压缩感知以及众多其他工作线有着紧密联系。本节将着重阐述我们对问题的视角,而与这些其他工作线的详细讨论将在“相关工作”部分展开。
When we talk about "features" and how they're represented, this is ultimately theory building around several observed empirical phenomena. Before describing how we conceptualize those results, we'll simply describe some of the major results motivating our thinking:
当我们讨论“特征”及其表现方式时,这实质上是在围绕多个观察到的经验现象构建理论。在阐述我们如何概念化这些成果之前,我们先简要描述一些推动我们思考的主要成果:
V("king") - V("man") + V("woman") = V("queen")
(but see V("king") - V("man") + V("woman") = V("queen")
这样的嵌入向量运算(但请参见 As a result, we tend to think of neural network representations as being composed of features which are represented as directions. We'll unpack this idea in the following sections.
因此,我们往往认为神经网络的表示由特征构成,这些特征表现为方向。我们将在接下来的章节中详细阐述这一概念。
Our use of the term "feature" is motivated by the interpretable properties of the input we observe neurons (or word embedding directions) responding to. There's a rich variety of such observed properties!
我们使用“特征”这一术语,是基于我们观察到的神经元(或词嵌入方向)响应的可解释输入属性。这些观察到的属性种类繁多!
2
在视觉领域,这些属性从低级神经元如曲线检测器
But even with that motivation, it turns out to be quite challenging to create a satisfactory definition of a feature. Rather than offer a single definition we're confident about, we consider three potential working definitions:
尽管有此动力,但要创造一个令人满意的特征定义仍颇具挑战。与其提供一个我们确信无疑的定义,我们考虑了三种可能的工作定义:
We've written this paper with the final "neurons in sufficiently large models" definition in mind. But we aren't overly attached to it, and actually think it's probably important to not prematurely attach to a definition.
我们在撰写这篇论文时,心中怀揣着“充分大型模型中的神经元”这一最终定义。但我们并不对其过分执着,实际上认为避免过早固化定义颇为重要。
4
拉卡托斯的一部名著
As we've mentioned in previous sections, we generally think of features as being represented by directions. For example, in word embeddings, "gender" and "royalty" appear to correspond to directions, allowing arithmetic like V("king") - V("man") + V("woman") = V("queen")
如前所述,我们通常将特征视为由方向表示。例如,在词嵌入中,“性别”和“皇室”似乎对应于方向,使得诸如 V("king") - V("man") + V("woman") = V("queen")
Let's call a neural network representation linear if features correspond to directions in activation space. In a linear representation, each feature
若特征对应激活空间中的方向,我们称神经网络的表示为线性的。在线性表示中,每个特征
We don't think it's a coincidence that neural networks empirically seem to have linear representations. Neural networks are built from linear functions interspersed with non-linearities. In some sense, the linear functions are the vast majority of the computation (for example, as measured in FLOPs). Linear representations are the natural format for neural networks to represent information in! Concretely, there are three major benefits:
我们认为神经网络在经验上似乎具有线性表示并非巧合。神经网络由线性函数与非线性元素交错构建而成。从某种意义上说,线性函数占据了计算的绝大部分(例如,以浮点运算次数衡量)。线性表示是神经网络自然表达信息的方式!具体而言,存在三大主要优势:
It is possible to construct non-linear representations, and retrieve information from them, if you use multiple layers (although even these examples can be seen as linear representations with more exotic features). We provide an example in the appendix. However, our intuition is that non-linear representations are generally inefficient for neural networks.
构建非线性表示并从中检索信息是可能的,只要使用多层结构(尽管这些例子也可视为具有更奇特特征的线性表示)。我们在附录中提供了一个示例。然而,我们的直觉是非线性表示通常对神经网络而言效率不高。
One might think that a linear representation can only store as many features as it has dimensions, but it turns out this isn't the case! We'll see that the phenomenon we call superposition will allow models to store more features – potentially many more features – in linear representations.
人们可能会认为线性表示只能存储与其维度一样多的特征,但事实证明并非如此!我们将看到,我们称之为叠加的现象将允许模型在线性表示中存储更多特征——潜在地,是许多更多的特征。
For discussion on how this view of features squares with a conception of features as being multidimensional manifolds, see the appendix “What about Multidimensional Features?”.
关于这一特征观点如何与特征作为多维流形的概念相契合的讨论,请参阅附录“多维特征如何处理?”。
Even if features are encoded as directions, a natural question to ask is which directions? In some cases, it seems useful to consider the basis directions, but in others it doesn't. Why is this?
即便特征被编码为方向,一个自然的问题是:哪些方向?在某些情况下,考虑基向量似乎很有用,而在其他情况下则不然。这是为什么呢?
When researchers study word embeddings, it doesn't make sense to analyze basis directions. There would be no reason to expect a basis dimension to be different from any other possible direction. One way to see this is to imagine applying some random linear transformation
研究词嵌入时,分析基向量方向并无意义。没有理由认为基维度会与任何其他可能方向有所不同。理解这一点的一种方式是设想对词嵌入施加一个随机线性变换
But many neural network layers are not like this. Often, something about the architecture makes the basis directions special, such as applying an activation function. This "breaks the symmetry", making those directions special, and potentially encouraging features to align with the basis dimensions. We call this a privileged basis, and call the basis directions "neurons." Often, these neurons correspond to interpretable features.
然而,许多神经网络层并非如此。通常,架构中的某些特性使得基向量方向具有特殊性,例如应用激活函数。这“打破了对称性”,使得那些方向变得特殊,并可能促使特征与基向量维度对齐。我们称这种基向量为“特权基”,并将基向量方向称为“神经元”。通常,这些神经元对应着可解释的特征。
From this perspective, it only makes sense to ask if a neuron is interpretable when it is in a privileged basis. In fact, we typically reserve the word "neuron" for basis directions which are in a privileged basis. (See longer discussion here.)
从这个角度来看,只有当神经元处于特权基底时,询问其可解释性才有意义。实际上,我们通常将“神经元”一词保留给那些处于特权基底的方向。(参见此处更详细的讨论。)
Note that having a privileged basis doesn't guarantee that features will be basis-aligned – we'll see that they often aren't! But it's a minimal condition for the question to even make sense.
请注意,拥有特权基并不保证特征会与基对齐——我们将会看到它们往往并非如此!但这至少是这个问题有意义的最基本条件。
Even when there is a privileged basis, it's often the case that neurons are "polysemantic", responding to several unrelated features. One explanation for this is the superposition hypothesis
即便存在特权基底,神经元往往仍表现出“多义性”,对多个无关特征产生响应。对此现象的一种解释是叠加假设
Several results from mathematics suggest that something like this might be plausible:
数学中的几个结果表明,类似这样的事情可能是合理的:
Concretely, in the superposition hypothesis, features are represented as almost-orthogonal directions in the vector space of neuron outputs. Since the features are only almost-orthogonal, one feature activating looks like other features slightly activating. Tolerating this "noise" or "interference" comes at a cost. But for neural networks with highly sparse features, this cost may be outweighed by the benefit of being able to represent more features! (Crucially, sparsity greatly reduces the costs since sparse features are rarely active to interfere with each other, and non-linear activation functions create opportunities to filter out small amounts of noise.)
具体而言,在叠加假设中,特征被表示为神经元输出向量空间中近乎正交的方向。由于这些特征仅是近乎正交,一个特征的激活会略微影响其他特征。容忍这种“噪声”或“干扰”是有代价的。但对于具有高度稀疏特征的神经网络来说,能够表示更多特征的益处可能超过了这一成本!(关键在于,稀疏性大大降低了成本,因为稀疏特征很少活跃到相互干扰的程度,而非线性激活函数则提供了过滤少量噪声的机会。)
One way to think of this is that a small neural network may be able to noisily "simulate" a sparse larger model:
一种理解方式是,小型神经网络或许能以一种嘈杂的方式“模拟”稀疏的大型模型:
Although we've described superposition with respect to neurons, it can also occur in representations with an unprivileged basis, such as a word embedding. Superposition simply means that there are more features than dimensions.
尽管我们在描述叠加时是以神经元为参照,但它同样可以发生在非特权基底的表征中,比如词嵌入。叠加的含义仅仅是特征数量超过了维度数。
The ideas in this section might be thought of in terms of four progressively more strict properties that neural network representations might have.
本节中的概念可以从神经网络表示可能具备的四个渐进式严格属性来考虑。
The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.
前两者(可分解性和线性)是我们假设普遍存在的特性,而后两者(非叠加性和基向量对齐)则是我们认为仅在某些情况下出现的特性。
If one takes the superposition hypothesis seriously, a natural first question is whether neural networks can actually noisily represent more features than they have neurons. If they can't, the superposition hypothesis may be comfortably dismissed.
若认真对待叠加假设,一个自然的首要问题是神经网络是否能以噪声方式表示比其神经元更多的特征。若不能,叠加假设或许可被轻易摒弃。
The intuition from linear models would be that this isn't possible: the best a linear model can do is to store the principal components. But we'll see that adding just a slight nonlinearity can make models behave in a radically different way! This will be our first demonstration of superposition. (It will also be an object lesson in the complexity of even very simple neural networks.)
线性模型的直觉会认为这是不可能的:线性模型所能做的最好就是存储主成分。但我们将会看到,只需添加一点非线性,就能使模型行为发生根本性的变化!这将是我们的第一个叠加示例。(同时,它也将是对极其简单的神经网络复杂性的一个生动教训。)
Our goal is to explore whether a neural network can project a high dimensional vector
我们的目标是探究神经网络是否能将高维向量
We begin by describing the high-dimensional vector
我们首先描述高维向量
Since we don't have any ground truth for features, we need to create synthetic data for
由于我们没有特征的实际数据,我们需要为
Concretely, our synthetic data is defined as follows: The input vectors
具体而言,我们的合成数据定义如下:输入向量
We will actually consider two models, which we motivate below. The first "linear model" is a well understood baseline which does not exhibit superposition. The second "ReLU output model" is a very simple model which does exhibit superposition. The two models vary only in the final activation function.
我们将实际考虑两种模型,其动机如下。第一个“线性模型”是一个理解透彻的基准,不表现出叠加性。第二个“ReLU 输出模型”是一个非常简单的模型,确实展现出叠加性。两种模型仅在最终激活函数上有所不同。
Why these models? 为何选择这些模型?
The superposition hypothesis suggests that each feature in the higher-dimensional model corresponds to a direction in the lower-dimensional space. This means we can represent the down projection as a linear map
叠加假设指出,高维模型中的每个特征对应于低维空间中的一个方向。这意味着我们可以将降维表示为线性映射
To recover the original vector, we'll use the transpose of the same matrix
为恢复原始向量,我们将使用同一矩阵
We also add a bias. One motivation for this is that it allows the model to set features it doesn't represent to their expected value. But we'll see later that the ability to set a negative bias is important for superposition for a second set of reasons – roughly, it allows models to discard small amounts of noise.
我们还引入了偏置项。这样做的一个动机是,它允许模型将未表示的特征设置为其预期值。但稍后我们将看到,设置负偏置的能力对于叠加状态至关重要,这有第二组原因——大致来说,它使模型能够剔除少量噪声。
The final step is whether to add an activation function. This turns out to be critical to whether superposition occurs. In a real neural network, when features are actually used by the model to do computation, there will be an activation function, so it seems principled to include one at the end.
最后一步是决定是否添加激活函数,事实证明这对是否发生叠加至关重要。在真实的神经网络中,当特征被模型实际用于计算时,会存在一个激活函数,因此在最后包含一个激活函数似乎是合乎原则的。
Our loss is weighted mean squared error weighted by the feature importances,
我们的损失是根据上述描述的特征重要性加权的均方误差,即
Our first experiment will simply be to train a few ReLU output models with different sparsity levels and visualize the results. (We'll also train a linear model – if optimized well enough, the linear model solution does not depend on sparsity level.)
我们的首个实验将简单地训练几个具有不同稀疏度级别的 ReLU 输出模型,并展示结果。(我们还将训练一个线性模型——如果优化得足够好,线性模型的解决方案并不依赖于稀疏度水平。)
The main question is how to visualize the results. The simplest way is to visualize
主要问题是如何呈现结果。最简单的方法是展示
But the thing we really care about is this hypothesized phenomenon of superposition – does the model represent "extra features" by storing them non-orthogonally? Is there a way to get at it more explicitly? Well, one question is just how many features the model learns to represent. For any feature, whether or not it is represented is determined by
但我们真正关注的是这个假设的叠加现象——模型是否通过非正交方式存储“额外特征”来表示它们?是否有更明确的方法来探究这一点?一个问题是模型学习表示的特征数量。对于任何特征,其是否被表示取决于其嵌入向量的范数
We'd also like to understand whether a given feature shares its dimension with other features. For this, we calculate
我们也希望了解某个特征是否与其他特征共享维度。为此,我们计算
We can visualize the model we looked at previously this way:
我们可以这样可视化之前研究的模型:
Now that we have a way to visualize models, we can start to actually do experiments. We'll start by considering models with only a few features (
既然我们有了可视化模型的方法,就可以开始实际进行实验了。我们将从考虑仅包含少数特征(
As our standard intuitions would expect, the linear model always learns the top-
正如我们的标准直觉所预期,线性模型始终学习最重要的前
The results are qualitatively similar for models with more features and hidden dimensions. For example, if we consider a model with
对于具有更多特征和隐藏维度的模型,结果在定性上相似。例如,如果我们考虑一个具有
In the previous section, we observed a surprising empirical result: adding a ReLU to the output of our model allowed a radically different solution – superposition – which doesn't occur in linear models.
在前一节中,我们观察到一个令人惊讶的实证结果:在我们的模型输出中加入 ReLU 激活函数,使得一种截然不同的解决方案——叠加态——得以实现,这在传统线性模型中是不会出现的。
The model where it occurs is still quite mathematically simple. Can we analytically understand why superposition is occurring? And for that matter, why does adding a single non-linearity make things so different from the linear model case? It turns out that we can get a fairly satisfying answer, revealing that our model is governed by balancing two competing forces – feature benefit and interference – which will be useful intuition going forwards. We'll also discover a connection to the famous Thomson Problem in chemistry.
该模型中发生的现象在数学上仍相当简单。我们能否从分析角度理解为何会出现叠加现象?此外,为何仅仅引入一个非线性因素就能使情况与线性模型大相径庭?事实证明,我们可以得到一个相当令人满意的答案,揭示出我们的模型受制于两种相互竞争的力量——特征收益与干扰——这将为后续理解提供有益的直觉。我们还将发现这与化学中著名的汤姆森问题存在联系。
Let's start with the linear case. This is well understood by prior work! If one wants to understand why linear models don't exhibit superposition, the easy answer is to observe that linear models essentially perform PCA. But this isn't fully satisfying: if we set aside all our knowledge and intuition about linear functions for a moment, why exactly is it that superposition can't occur?
我们从线性情况开始讨论。这一点已由先前的工作充分理解!若要探究为何线性模型不展现叠加现象,一个简单的解释是观察到线性模型本质上执行主成分分析(PCA)。但这并不完全令人满意:如果我们暂时抛开所有关于线性函数的知识和直觉,究竟是什么原因导致叠加现象无法发生呢?
A deeper understanding can come from the results of Saxe et al.
萨克斯等人
The Saxe results reveal that there are fundamentally two competing forces which control learning dynamics in the considered model. Firstly, the model can attain a better loss by representing more features (we've labeled this "feature benefit"). But it also gets a worse loss if it represents more than it can fit orthogonally due to "interference" between features.
Saxe 的研究结果揭示,在所考虑的模型中,学习动态主要受两种基本竞争力的控制。首先,模型通过表示更多特征可以获得更佳的损失值,我们称之为“特征收益”。然而,若表示的特征超出其正交容纳能力,则会因特征间的“干扰”导致损失恶化。顺便一提,对比线性模型中的干扰现象与压缩感知中的相干性概念颇为有趣,它们可视为同一向量的
Can we achieve a similar kind of understanding for the ReLU output model? Concretely, we'd like to understand
我们能否对 ReLU 输出模型实现类似的理解?具体而言,我们希望理解
The integral over
积分覆盖
This new equation is vaguely similar to the famous Thomson problem in chemistry. In particular, if we assume uniform importance and that there are a fixed number of features with
这一新方程与化学中著名的汤姆森问题有几分相似。具体而言,若我们假设各特征同等重要,且存在固定数量的特征具有
Another interesting property is that ReLU makes negative interference free in the 1-sparse case. This explains why the solutions we've seen prefer to only have negative interference when possible. Further, using a negative bias can convert small positive interferences into essentially being negative interferences.
另一个有趣的特性是,在 1-稀疏情况下,ReLU 使得负干扰变得无害。这解释了为何我们所见的解决方案倾向于尽可能只存在负干扰。此外,使用负偏置可以将小的正干扰实质上转化为负干扰。
What about the terms corresponding to less sparse vectors? We leave explicitly writing these out to the reader, but the main idea is that there are multiple compounding interferences, and the "active features" can experience interference. In a later section, we'll see that features often organize themselves into sparse interference graphs such that only a small number of features interfere with another feature – it's interesting to note that this reduces the probability of compounding interference and makes the 1-sparse loss term more important relative to others.
对于那些对应于较不稀疏向量的项又如何呢?我们将其具体表述留给读者思考,但核心思想在于存在多种叠加干扰,且“活跃特征”可能遭受干扰。在后续章节中,我们将看到特征常常自发组织成稀疏干扰图,使得只有少数特征会干扰到其他特征——值得注意的是,这降低了叠加干扰的概率,并使得 1-稀疏损失项相对于其他项更为重要。
The results in the previous section seem to suggest that there are three outcomes for a feature when we train a model: (1) the feature may simply not be learned; (2) the feature may be learned, and represented in superposition; or (3) the model may represent a feature with a dedicated dimension. The transitions between these three outcomes seem sharp. Possibly, there's some kind of phase change.
上一节的结果似乎表明,当我们训练模型时,特征可能面临三种情况:(1) 特征可能根本未被学习;(2) 特征可能被学习,并以叠加态表示;或(3) 模型可能用专用的维度来表示特征。这三种情况之间的转变似乎是突兀的,可能存在某种相变现象。
13
此处,我们使用“相变”一词,意指“不连续变化”的广义概念,而非特指系统规模无限大时出现的更技术性的不连续性。
One way to understand this better is to explore if there's something like a "phase diagram" from physics, which could help us understand when a feature is expected to be in one of these regimes. Although we can see hints of this in our previous experiment, it's hard to really isolate what's going on because many features are changing at once and there may be interaction effects. As a result, we set up the following experiment to better isolate the effects.
为了更深入理解这一现象,我们不妨探究是否存在类似物理学中的“相位图”,它能帮助我们判断某个特征何时会落入这些状态之一。尽管在先前的实验中已见端倪,但众多特征同时变化且可能存在交互效应,使得真正厘清其中机制颇为困难。因此,我们设计了以下实验,旨在更好地分离这些效应。
As an initial experiment, we consider models with 2 features but only 1 hidden layer dimension. We still consider the ReLU output model,
作为初步实验,我们考虑具有 2 个特征但仅 1 个隐藏层维度的模型。我们仍采用 ReLU 输出模型,
We can compare this to a theoretical "toy model of the toy model" where we can get closed form solutions for the loss of different weight configurations as a function of importance and sparsity. There are three natural ways to store 2 features in 1 dimension:
我们可以将此比作一个理论上的“玩具模型中的玩具模型”,在其中我们能得到不同权重配置损失的封闭形式解,作为重要性和稀疏性的函数。在 1 维中存储 2 个特征有三种自然方式:
As expected, sparsity is necessary for superposition to occur, but we can see that it interacts in an interesting way with relative feature importance. But most interestingly, there appears to be a real phase change, observed in both the empirical and theoretical diagrams! The optimal weight configuration discontinuously changes in magnitude and superposition. (In the theoretical model, we can analytically confirm that there's a first-order phase change: there's crossover between the functions, causing a discontinuity in the derivative of the optimal loss.)
如预期般,稀疏性对于叠加现象的发生是必要的,但我们发现它与相对特征重要性之间存在有趣的互动。最引人注目的是,在实证与理论图表中均观察到一种真实的相变现象!最优权重配置在幅度和叠加性上发生了不连续的变化。(在理论模型中,我们能通过分析确认存在一次相变:函数间发生交叉,导致最优损失的导数出现不连续。)
We can ask this same question of embedding three features in two dimensions. This problem still has a single "extra feature" (now the third one) we can study, asking what happens as we vary its importance relative to the other two and change sparsity.
我们同样可以探讨将三个特征嵌入到二维空间中的问题。这个问题仍然有一个“额外的特征”(现在是第三个)供我们研究,我们可以探究随着这个特征相对于其他两个的重要性变化,以及稀疏性的调整,会发生什么情况。
For the theoretical model, we now consider four natural solutions. We can describe solutions by asking "what feature direction did
对于理论模型,我们现考虑四种自然解法。描述这些解法时,我们可提问:“
These diagrams suggest that there really is a phase change between different strategies for encoding features. However, we'll see in the next section that there's much more complex structure this preliminary view doesn't capture.
这些图表表明,在编码特征的不同策略之间确实存在一种相变。然而,我们将在下一节看到,这种初步观点未能捕捉到更为复杂的结构。
We've seen that superposition can allow a model to represent extra features, and that the number of extra features increases as we increase sparsity. In this section, we'll investigate this relationship in more detail, discovering an unexpected geometric story: features seem to organize themselves into geometric structures such as pentagons and tetrahedrons! In some ways, the structure described in this section seems "too elegant to be true" and we think there's a good chance it's at least partly idiosyncratic to the toy model we're investigating. But it seems worth investigating because if anything about this generalizes to real models, it may give us a lot of leverage in understanding their representations.
我们已经观察到,叠加态能使模型呈现额外特征,且随着稀疏性的增加,额外特征的数量也随之增长。本节将深入探讨这一关系,揭示一个意外的几何现象:特征似乎自发组织成五边形、四面体等几何结构!从某种角度看,本节描述的结构“过于完美而显得不真实”,我们推测它至少部分是针对所研究的简化模型特有的。尽管如此,这一发现值得探究,因为若其任何方面能推广至真实模型,或许将极大助力我们理解这些模型的表征机制。
We'll start by investigating uniform superposition, where all features are identical: independent, equally important and equally sparse. It turns out that uniform superposition has a surprising connection to the geometry of uniform polytopes! Later, we'll move on to investigate non-uniform superposition, where features are not identical. It turns out that this can be understood, at least to some extent, as a deformation of uniform superposition.
我们将首先探究均匀叠加,其中所有特征均相同:独立、同等重要且同等稀疏。令人惊讶的是,均匀叠加与均匀多面体的几何结构有着紧密联系!随后,我们将进一步研究非均匀叠加,其中特征并非一致。至少在某种程度上,这种非均匀叠加可以理解为均匀叠加的一种变形。
As mentioned above, we begin our investigation with uniform superposition, where all features have the same importance and sparsity. We'll see later that this case has some unexpected structure, but there's also a much more basic reason to study it: it's much easier to reason about than the non-uniform case, and has fewer variables we need to worry about in our experiments.
如上所述,我们的研究从均匀叠加开始,其中所有特征具有同等重要性和稀疏性。稍后我们将发现这种情况具有一些意外的结构,但研究它的根本原因更为基础:相较于非均匀情况,它更易于推理,且在实验中需要关注的变量更少。
We'd like to understand what happens as we change feature sparsity,
我们希望了解在改变特征稀疏性时会发生什么,
A convenient way to measure the number of features the model has learned is to look at the Frobenius norm,
衡量模型已学习特征数量的便捷方法是观察 Frobenius 范数,
We'll plot
我们将绘制
Surprisingly, we find that this graph is "sticky" at
令人惊讶的是,我们发现该图在
It turns out that antipodal pairs are just the tip of the iceberg. Hiding underneath this curve are a number of extremely specific geometric configurations of features.
事实证明,对跖点对仅仅是冰山一角。在这条曲线之下,隐藏着众多极其特殊的几何特征配置。
In the previous section, we saw that there's a sticky regime where the model has "half a dimension per feature" in some sense. This is an average statistical property of the features the model represents, but it seems to hint at something interesting. Is there a way we could understand what "fraction of a dimension" a specific feature gets?
在前一节中,我们观察到存在一个粘滞状态,其中模型在某种意义上具有“每个特征半维”。这是模型所代表特征的平均统计特性,但它似乎暗示了某些有趣的现象。我们能否找到一种方法来理解特定特征所获得的“维度的一部分”是多少?
We'll define the dimensionality of the
我们将定义第
where
其中
Intuitively, the numerator represents the extent to which a given feature is represented, while the denominator is "how many features share the dimension it is embedded in" by projecting each feature onto its dimension. In the antipodal case, each feature participating in an antipodal pair will have a dimensionality of
直观上,分子表示给定特征的表达程度,而分母则是“有多少特征通过将其投影到其维度上共享该嵌入维度”。在极端对立的情况下,参与对立对的每个特征将具有
We can now break the above plot down on a per-feature basis. This reveals many more of these "sticky points"! To help us understand this better, we're going to create a scatter plot annotated with some additional information:
我们现在可以按每个特征对上述图表进行分解,这将揭示更多这类“粘性点”!为了更好地理解这一点,我们将创建一个带有额外标注信息的散点图。
Let's look at the resulting plot, and then we'll try to figure out what it's showing us:
让我们观察一下生成的图表,然后尝试理解它向我们展示了什么:
What is going on with the points clustering at specific fractions?? We'll see shortly that the model likes to create specific weight geometries and kind of jumps between the different configurations.
特定分数处的点为何会聚集?我们很快就会发现,模型倾向于形成特定的权重几何结构,并在不同配置之间跳跃。
In the previous section, we developed a theory of superposition as a phase change. But everything on this plot between 0 (not learning a feature) and 1 (dedicating a dimension to a feature) is superposition. Superposition is what happens when features have fractional dimensionality. That is to say – superposition isn't just one thing!
在前一节中,我们发展了一种将叠加视为相位变化的理论。但在这个图上,从 0(不学习特征)到 1(为特征分配一个维度)之间的所有内容都是叠加。当特征具有分数维时,就会发生叠加。也就是说——叠加并非仅指一种情况!
How can we relate this to our original understanding of the phase change? We often think of water as only having three phases: ice, water and steam. But this is a simplification: there are actually many phases of ice, often corresponding to different crystal structures (eg. hexagonal vs cubic ice). In a vaguely similar way, neural network features seem to also have many other phases within the general category of "superposition."
我们如何将此与对相变的原始理解联系起来?我们通常认为水只有三种相态:冰、水和蒸汽。但这是一种简化:实际上,冰存在多种相态,通常对应于不同的晶体结构(例如六角形与立方形冰)。在某种程度上类似地,神经网络特征似乎在“叠加”这一大类中也有许多其他相态。
In the previous diagram, we found that there are distinct lines corresponding to dimensionality of: ¾ (tetrahedron), ⅔ (triangle), ½ (antipodal pair), ⅖ (pentagon), ⅜ (square antiprism), and 0 (feature not learned). We believe there would also be a 1 (dedicated dimension for a feature) line if not for the fact that basis features are indistinguishable from other directions in the dense regime.
在前述图表中,我们发现存在对应不同维度的明显线条:¾(四面体)、⅔(三角形)、½(对映点对)、⅖(五边形)、⅜(方形反棱柱)以及 0(未习得特征)。若非在高密度区域基元特征与其他方向难以区分,我们相信还会有一条代表 1(专为某一特征设定的维度)的线条出现。
Several of these configurations may jump out as solutions to the famous Thomson problem. (In particular, square antiprisms are much less famous than cubes and are primarily of note for their role in molecular geometry due to being a Thomson problem solution.) As we saw earlier, there is a very real sense in which our model can be understood as solving a generalized version of the Thomson problem. When our model chooses to represent a feature, the feature is embedded as a point on an
这些配置中的一些可能跃然而出,成为著名的汤姆森问题的解决方案。(特别是,正交反棱柱远不如立方体著名,主要因其作为汤姆森问题解决方案在分子几何中的作用而受到关注。)正如我们之前所见,从非常真实的意义上讲,我们的模型可以被理解为解决了汤姆森问题的一个广义版本。当我们的模型选择表示一个特征时,该特征被嵌入为一个
A second clue as to what's going on is that there are lines for the Thomson solutions which are uniform polyhedra (e.g. tetrahedron), but there seem to be split lines where we'd expect to see non-uniform solutions (e.g. instead of a ⅗ line for triangular bipyramids we see a co-occurence of points at ⅔ for triangles and points at ½ for a antipodes). In a uniform polyhedron, all vertices have the same geometry, and so if we embed features as them each feature has the same dimensionality. But if we embed features as a non-uniform polyhedron, different features will have more or less interference with others.
另一个线索表明,汤姆森解中的线对应于均匀多面体(如四面体),但似乎出现了分裂线,本应观察到非均匀解的地方(例如,三角双锥的⅗线被⅔的三角形点和½的对跖点共现所取代)。在均匀多面体中,所有顶点几何相同,因此若将特征嵌入其中,每个特征的维度一致。然而,若以非均匀多面体嵌入特征,不同特征间将产生或多或少的干扰。
In particular, many of the Thomson solutions can be understood as tegum products (an operation which constructs polytopes by embedding two polytopes in orthogonal subspaces) of smaller uniform polytopes. (In the earlier graph visualizations of feature geometry, two subgraphs are disconnected if and only if they are in different tegum factors.) As a result, we should expect their dimensionality to actually correspond to the underlying factor uniform polytopes.
特别是,许多汤普森解决方案可以被理解为 tegum 积(一种通过在正交子空间中嵌入两个多面体来构造多面体的操作)的产物,这些产物源自更小的均匀多面体。(在早期的特征几何图形可视化中,两个子图若且唯若它们属于不同的 tegum 因子时才会断开连接。)因此,我们应当预期它们的维度实际上对应于基础因子均匀多面体。
This also suggests a possible reason why we observe 3D Thomson problem solutions, despite the fact that we're actually studying a higher dimensional version of the problem. Just as many 3D Thomson solutions are tegum products of 2D and 1D solutions, perhaps higher dimensional solutions are often tegum products of 1D, 2D, and 3D solutions.
这或许揭示了我们为何能观察到三维汤姆逊问题解的一个可能原因,尽管我们实际上研究的是该问题的高维版本。正如许多三维汤姆逊解是二维与一维解的楔积,或许高维解常常是一维、二维及三维解的楔积。
The orthogonality of factors in tegum products has interesting implications. For the purposes of superposition, it means that there can't be any "interference" across tegum-factors. This may be preferred by the toy model: having many features interfere simultaneously could be really bad for it. (See related discussion in our earlier mathematical analysis.)
泰格姆积中因子的正交性具有有趣的含义。就叠加目的而言,这意味着泰格姆因子之间不可能存在“干扰”。这对于玩具模型来说可能是可取的:多个特征同时相互干扰可能对其极为不利。(参见我们早期数学分析中的相关讨论。)
At this point, it's worth making explicit that there's a correspondence between polytopes and symmetric, positive-definite, low-rank matrices (i.e. matrices of the form
此时,明确指出多面体与对称、正定、低秩矩阵(即形如
In some ways, the correspondence is trivial. If one has a rank-
在某些方面,这种对应关系是微不足道的。如果有一个秩为
Put another way, there's an exact correspondence between polytopes and strategies for superposition. For example, every strategy for putting three features in superposition in a 2-dimensional space corresponds to a triangle, and every triangle corresponds to such a strategy. From this perspective, it doesn't seem surprising that if we have three equally important and equally sparse features, the optimal strategy is an equilateral triangle.
换言之,多面体与叠加策略之间存在精确对应关系。例如,在二维空间中,每种将三个特征进行叠加的策略都对应一个三角形,反之亦然。从这个角度看,如果我们拥有三个同等重要且同样稀疏的特征,最优策略是形成一个等边三角形,这并不令人意外。
This correspondence also goes the other direction. Suppose we have a rank
这种对应关系也适用于相反方向。假设我们有一个秩为
In fact, given such a set of orthogonal vectors, we can construct a polytope by starting with
实际上,给定一组正交向量,我们可以通过从
So far, this section has focused on the geometry of uniform superposition, where all features are of equal importance, equal sparsity, and independent. The model is essentially solving a variant of the Thomson problem. Because all features are the same, solutions corresponding to uniform polyhedra get especially low loss. In this subsection, we'll study non-uniform superposition, where features are somehow not uniform. They may vary in importance and sparsity, or have a correlational structure that makes them not independent. This distorts the uniform geometry we saw earlier.
迄今为止,本节专注于均匀叠加的几何特性,其中所有特征同等重要、同等稀疏且相互独立。该模型实质上是在解决汤姆森问题的一个变种。由于所有特征相同,对应于均匀多面体的解决方案损失特别低。在本小节中,我们将探讨非均匀叠加,其中特征在某种程度上并不均匀。它们可能在重要性和稀疏性上有所变化,或者具有使其不独立的关联结构。这扭曲了我们之前看到的均匀几何形态。
In practice, it seems like superposition in real neural networks will be non-uniform, so developing an understanding of it seems important. Unfortunately, we're far from a comprehensive theory of the geometry of non-uniform superposition at this point. As a result, the goal of this section will merely be to highlight some of the more striking phenomena we observe:
实际上,现实中的神经网络叠加似乎是非均匀的,因此对其进行深入理解显得尤为重要。遗憾的是,目前我们距离全面理解非均匀叠加的几何理论还相去甚远。因此,本节的目标仅在于强调我们所观察到的一些显著现象:
We attempt to illustrate these phenomena with some representative experiments below.
我们试图通过以下一些代表性实验来阐释这些现象。
The simplest kind of non-uniform superposition is to vary one feature and leave the others uniform. As an experiment, let's consider an experiment where we represent
最简单的非均匀叠加方式是改变一个特征而保持其他特征均匀。以一个实验为例,设想我们在
If we make it sufficiently sparse, there's a phase change, and it collapses from a pentagon to a pair of digons with the sparser point at zero. The phase change corresponds to loss curves corresponding to the two different geometries crossing over. (This observation allows us to directly confirm that it is genuinely a first order phase change.)
若使其足够稀疏,便会出现相变,原本的五边形结构会塌缩成一对二边形,稀疏点位于零点。这种相变对应着两种不同几何形状的损失曲线交叉。(这一发现使我们得以直接确认,这确实是一次一阶相变。)
To visualize the solutions, we canonicalize them, rotating them to align with each other in a consistent manner.
为了直观展示解决方案,我们对其进行规范化处理,通过旋转使它们以一致的方式相互对齐。
These results seem to suggest that, at least in some cases, non-uniform superposition can be understood as a deformation of uniform superposition and jumping between uniform superposition configurations rather than a totally different regime. Since uniform superposition has a lot of understandable structure, but real world superposition is almost certainly non-uniform, this seems very promising!
这些结果似乎表明,至少在某些情况下,非均匀叠加可以被理解为均匀叠加的一种变形,以及在不同均匀叠加构型之间的跳跃,而非完全不同的状态。由于均匀叠加具有许多可理解的结构,而现实世界的叠加几乎肯定是非均匀的,这一发现极具前景!
The reason pentagonal solutions are not on the unit circle is because models reduce the effect of positive interference, setting a slight negative bias to cut off noise and setting their weights to
五边形解不在单位圆上的原因是模型减少了正向干扰的影响,设置了轻微的负偏置以截断噪声,并将权重设为
A note for reimplementations: optimizing with a two-dimensional hidden space makes this easier to study, but the actual optimization process to be really challenging from gradient descent – a lot harder than even just having three dimensions. Getting clean results required fitting each model multiple times and taking the solution with the lowest loss. However, there's a silver lining to this: visualizing the sub-optimal solutions on a scatter plot as above allows us to see the loss curves for different geometries and gain greater insight into the phase change.
对于重新实现者的提示:采用二维隐藏空间进行优化使得研究更为简便,但实际的优化过程从梯度下降角度来看极具挑战性——甚至比仅有三个维度时还要困难得多。为了获得清晰的结果,需要多次拟合每个模型,并选取损失最小的解决方案。然而,这其中也有一线希望:如上所述,通过散点图可视化次优解,我们能够观察到不同几何形状下的损失曲线,从而更深入地理解相变现象。
A more complicated form of non-uniform superposition occurs when there are correlations between features. This seems essential for understanding superposition in the real world, where many features are correlated or anti-correlated.
当特征之间存在相关性时,非均匀叠加呈现出更为复杂的形式。这对于理解现实世界中的叠加现象至关重要,因为许多特征要么相互关联,要么相互排斥。
For example, one very pragmatic question to ask is whether we should expect polysemantic neurons to group the same features together across models. If the groupings were random, you could use this to detect polysemantic neurons, by comparing across models! However, we'll see that correlational structure strongly influences which features are grouped together in superposition.
例如,一个非常实际的问题是,我们是否应该期望多义性神经元在不同模型中将相同特征归类在一起。如果归类是随机的,通过跨模型比较,你就能借此检测到多义性神经元!然而,我们将看到,相关性结构强烈影响哪些特征会在叠加中被归类在一起。
The behavior seems to be quite nuanced, with a kind of "order of preferences" for how correlated features behave in superposition. The model ideally represents correlated features orthogonally, in separate tegum factors with no interactions between them. When that fails, it prefers to arrange them so that they're as close together as possible – it prefers positive interference between correlated features over negative interference. Finally, when there isn't enough space to represent all the correlated features, it will collapse them and represent their principal component instead! Conversely, when features are anti-correlated, models prefer to have them interfere, especially with negative interference. We'll demonstrate this with a few experiments below.
该行为似乎颇为微妙,存在一种“偏好顺序”,用以指导叠加状态下相关特征的相互作用方式。理想情况下,模型应将相关特征正交表示,各自独立于不同的 tegum 因子中,彼此间无交互作用。若此条件不满足,模型倾向于尽可能紧密地排列这些特征——更偏好相关特征间的正向干扰而非负向干扰。当空间不足以容纳所有相关特征时,模型将它们压缩,转而表示其主成分。相反,若特征间呈负相关,模型则倾向于让它们相互干扰,尤其是产生负向干扰。以下我们将通过几个实验来演示这一现象。
Throughout this section we'll refer to "correlated feature sets" and "anticorrelated feature sets".
在本节中,我们将提及“相关特征集”和“反相关特征集”。
Correlated Feature Sets. Our correlated feature sets can be thought of as "bundles" of co-occurring features. One can imagine a highly idealized version of what might happen in an image classifier: there could be a bundle of features used to identify animals (fur, ears, eyes) and another bundle used to identify buildings (corners, windows, doors). Features from one of these bundles are likely to appear together. Mathematically, we represent this by linking the choice of whether all the features in a correlated feature set are zero or not together. Recall that we originally defined our synthetic distribution to have features be zero with probability
相关特征集。我们的相关特征集可视为共现特征的“捆绑包”。设想一个高度理想化的图像分类器场景:可能存在一组用于识别动物的特征(毛发、耳朵、眼睛),以及另一组用于识别建筑物的特征(角落、窗户、门)。来自同一捆绑包的特征很可能同时出现。数学上,我们通过将相关特征集中所有特征是否同时为零的选择关联起来来表示这一点。回想一下,我们最初定义的合成分布中,特征以概率
Anticorrelated Feature Sets. One could also imagine anticorrelated features which are extremely unlikely to occur together. To simulate these, we'll have anticorrelated feature sets where only one feature in the set can be active at a time. To simulate this, we'll have the feature set be entirely zero with probability
反相关特征集。我们还可以设想那些极不可能同时出现的反相关特征。为了模拟这些特征,我们将构建反相关特征集,其中任何时刻仅有一个特征处于激活状态。具体模拟时,该特征集以概率
For our initial investigation, we simply train a number of small toy models with correlated and anti-correlated features and observe what happens. To make this easy to study, we limit ourselves to the
在初步研究中,我们仅训练了一批小型玩具模型,这些模型具有相关性和反相关性特征,并观察其结果。为便于研究,我们限定在
It turns out that the tendency of models to arrange correlated features to be orthogonal is actually quite a strong phenomenon. In particular, for larger models, it seems to generate a kind of "local almost-orthogonal basis" where, even though the model as a whole is in superposition, the correlated feature sets considered in isolation are (nearly) orthogonal and can be understood as having very little superposition.
事实证明,模型倾向于使相关特征相互正交的现象实际上相当显著。特别是对于较大的模型而言,似乎会生成一种“局部近似正交基”,即使整个模型处于叠加状态,单独考虑的相关特征集也近乎正交,并且可以理解为几乎不存在叠加。
To investigate this, we train a larger model with two sets of correlated features and visualize
为探究此问题,我们训练了一个包含两组相关特征的大型模型,并对其进行了可视化展示
If this result holds in real neural networks, it suggests we might be able to make a kind of "local non-superposition" assumption, where for certain sub-distributions we can assume that the activating features are not in superposition. This could be a powerful result, allowing us to confidently use methods such as PCA which might not be principled to generally use in the context of superposition.
若此结果在真实神经网络中成立,则暗示我们或许能采用一种“局部非叠加”假设,即对于某些子分布,我们可以假定激活特征未处于叠加状态。这可能是一个强有力的结论,使我们能自信地运用诸如 PCA 等方法,而这些方法在叠加情境下通常并不适用。
One of the most interesting properties is that there seems to be a trade off with Principal Components Analysis (PCA) and superposition. If there are two correlated features
其中最有趣的特性之一是,主成分分析(PCA)与叠加之间似乎存在一种权衡。若存在两个相关特征
As an experiment, we consider six features, organized into three sets of correlated pairs. Features in each correlated pair are represented by a given color (red, green, and blue). The correlation is created by having both features always activate together – they're either both zero or neither zero. (The exact non-zero values they take when they activate is uncorrelated.)
作为一项实验,我们考虑了六个特征,组织成三对相关联的特征组。每对相关特征由特定颜色(红、绿、蓝)表示。这种关联通过两个特征始终同时激活来实现——它们要么同时为零,要么都不为零。(它们激活时所取的确切非零值之间并不相关。)
As we vary the sparsity of the features, we find that in the very sparse regime, we observe superposition as expected, with features arranged in a hexagon and correlated features side-by-side. As we decrease sparsity, the features progressively "collapse" into their principal components. In very dense regimes, the solution becomes equivalent to PCA.
随着我们调整特征的稀疏度,我们发现在极度稀疏的情况下,正如预期那样,观察到了叠加现象,特征排列成六边形,相关特征并排出现。随着稀疏度降低,特征逐渐“坍缩”至其主成分。在非常密集的情况下,解决方案等同于主成分分析(PCA)。
These results seem to hint that PCA and superposition are in some sense complementary strategies which trade off with one another. As features become more correlated, PCA becomes a better strategy. As features become sparser, superposition becomes a better strategy. When features are both sparse and correlated, mixtures of each strategy seem to occur. It would be nice to more deeply understand this space of tradeoffs.
这些结果似乎暗示,PCA 与叠加在某种意义上是互补的策略,它们相互之间存在权衡。随着特征相关性增强,PCA 成为更优策略。当特征变得稀疏时,叠加策略则更为优越。当特征既稀疏又相关时,两种策略的混合似乎会出现。更深入地理解这种权衡的空间将大有裨益。
It's also interesting to think about this in the context of continuous equivariant features, such as features which occur in different rotations.
在连续等变特征的背景下思考这一点也颇有趣味,比如那些在不同旋转中出现的特征。
The focus of this paper is how superposition contributes to the functioning of fully trained neural networks, but as a brief detour it's interesting to ask how our toy models – and the resulting superposition – evolve over the course of training.
本文重点探讨叠加态如何促进完全训练后的神经网络运作,但作为一次简短的旁涉,探究我们的玩具模型及其产生的叠加态在训练过程中的演变亦颇具趣味。
There are several reasons why these models seem like a particularly interesting case for studying learning dynamics. Firstly, unlike most neural networks, the fully trained models converge to a simple but non-trivial structure that rhymes with an emerging thread of evidence that neural network learning dynamics might have geometric weight structure that we can understand. One might hope that understanding the final structure would make it easier for us to understand the evolution over training. Secondly, superposition hints at surprisingly discrete structure (regular polytopes of all things!). We'll find that the underlying learning dynamics are also surprisingly discrete, continuing an emerging trend of evidence that neural network learning might be less continuous than it seems. Finally, since superposition has significant implications for interpretability, it would be nice to understand how it emerges over training – should we expect models to use superposition early on, or is it something that only emerges later in training, as models struggle to fit more features in?
这些模型之所以成为研究学习动力学的特别有趣案例,原因有几点。首先,与大多数神经网络不同,完全训练后的模型会收敛到一个简单却非平凡的结构,这与神经网络学习动力学可能具有可理解的幾何權重結構这一新兴证据相呼应。我们或许可以期待,理解最终结构将有助于我们洞察训练过程中的演变。其次,叠加现象暗示了出乎意料的离散结构(竟然是正则多胞形!)。我们将发现,底层的学习动力学同样呈现出惊人的离散性,延续了一个新兴趋势,即神经网络学习可能并不如表面所见的那般连续。 最后,由于叠加性对可解释性具有重要意义,了解它在训练过程中如何显现将十分有益——我们是否应预期模型在训练初期即采用叠加性,还是它仅在训练后期,当模型努力适应更多特征时才逐渐显现?
Unfortunately, we aren't able to give these questions the detailed investigation they deserve within the scope of this paper. Instead, we'll limit ourselves to a couple particularly striking phenomena we've noticed, leaving more detailed investigation for future work.
遗憾的是,我们无法在这篇论文的范围内对这些问题进行应有的详细探讨。因此,我们将仅限于讨论我们注意到的一些特别引人注目的现象,而将更详细的调查留待未来的工作。
Perhaps the most striking phenomenon we've noticed is that the learning dynamics of toy models with large numbers of features appear to be dominated by "energy level jumps" where features jump between different feature dimensionalities. (Recall that a feature's dimensionality is the fraction of a dimension dedicated to representing a feature.)
我们注意到最显著的现象之一是,具有大量特征的玩具模型学习动态似乎由“能量级跳跃”主导,其中特征在不同特征维度间跃迁。(回想一下,一个特征的维度是指用于表示该特征的维度部分的比例。)
Let's consider the problem setup we studied when investigating the geometry of uniform superposition in the previous section, where we have a large number of features of equal importance and sparsity. As we saw previously, the features ultimately arrange themselves into a small number of polytopes with fractional dimensionalities.
让我们回顾上一节中探讨均匀叠加几何问题时的情景,那里存在大量同等重要且稀疏的特征。如前所述,这些特征最终会排列成少数几个具有分数维度的多面体。
A natural question to ask is what happens to these feature dimensionalities over the course of training. Let's pick one model where all the features converge into digons and observe. In the first plot, each colored line corresponds to the dimensionality of a single feature. The second plot shows how the loss curve changes over the same duration.
一个自然的问题是,在训练过程中这些特征维度会发生什么变化。我们选取一个所有特征最终收敛成二元组的模型进行观察。在第一个图中,每条彩色线代表单个特征的维度变化。第二个图则展示了同一时期损失曲线的变化情况。
Note how the dimensionality of some features "jump" between different values and swap places. As this happens, the loss curve also undergoes a sudden drop (a very small one at the first jump, and a larger one at the second jump).
注意某些特征的维度如何在不同值之间“跳跃”并交换位置。随着这一现象的发生,损失曲线也会经历突然的下降(第一次跳跃时幅度很小,而第二次跳跃时幅度较大)。
These results make us suspect that seemingly smooth decreases of the loss curve in larger models are in fact composed of many small jumps of features between different configurations. (For similar results of sudden mechanistic changes, see Olsson et al.'s induction head phase change
这些结果使我们怀疑,大型模型中看似平滑的损失曲线下降,实际上是由不同配置间特征的多次小跳跃构成的。(对于类似的机制突变结果,参见 Olsson 等人关于归纳头相变的论述
Many of our toy model solutions can be understood as corresponding to geometric structures. This is especially easy to see and study when there are only
我们的许多玩具模型解决方案可以理解为对应于几何结构。当只有
It turns out that, at least in some cases, the learning dynamics leading to these structures can be understood as a sequence of simple, independent geometric transformations!
事实证明,至少在某些情况下,导致这些结构的学习动力学可以被理解为一连串简单独立的几何变换过程!
One particularly interesting example of this phenomenon occurs in the context of correlated features, as studied in the previous section. Consider the problem of representing
在上一节研究的关联特征背景下,这一现象的一个特别有趣的例子出现了。考虑在
(Although the last solution – an octahedron with features from different correlated sets arranged in antipodal pairs – seems to be a strong attractor, the learning trajectory visualized above appears to be one of a few different learning trajectories that attract the model. The different trajectories vary at step C: sometimes the model gets pulled directly into the antiprism configuration from the start or organize features into antipodal pairs. Presumably this depends on which feature geometry the model is closest to when step B ends.)
尽管最后一个解决方案——一个由来自不同相关集合的特征以对映体对形式排列的八面体——似乎是一个强大的吸引子,但上述可视化的学习轨迹似乎只是吸引模型的几个不同学习轨迹之一。这些不同的轨迹在步骤 C 处有所变化:有时模型从一开始就直接进入反棱柱结构,或者将特征组织成对映体对。这可能取决于模型在步骤 B 结束时最接近哪种特征几何。
The learning dynamics we observe here seem directly related to previous findings on simple models.
我们在此观察到的学习动态似乎与先前关于简单模型的研究结果直接相关。
Although we're most interested in the implications of superposition for interpretability, there appears to be a connection to adversarial examples. If one gives it a little thought, this connection can actually be quite intuitive.
尽管我们最关注叠加态对可解释性的影响,但似乎与对抗性示例存在某种联系。稍加思考,这种联系实际上相当直观。
In a model without superposition, the end-to-end weights for the first feature are:
在无叠加模型中,第一个特征的端到端权重为:
But in a model with superposition, it's something like:
但在叠加模型中,它类似于:
The
To test this, we generated L2 adversarial examples (allowing a max L2 attack norm of 0.1 of the average input norm). We originally generated attacks with gradient descent, but found that for extremely sparse examples where ReLU neurons are in the zero regime 99% of the time, attacks were difficult, effectively due to gradient masking
为验证这一点,我们生成了 L2 对抗样本(允许最大 L2 攻击范数为平均输入范数的 0.1 倍)。起初我们采用梯度下降法生成攻击,但发现在极度稀疏的例子中,ReLU 神经元 99%的时间处于零状态,攻击变得困难,实质上是因为梯度掩蔽
We find that vulnerability to adversarial examples sharply increases as superposition forms (increasing by >3x), and that the level of vulnerability closely tracks the number of features per dimension (the reciprocal of feature dimensionality).
我们发现,随着叠加态的形成(增幅超过 3 倍),对抗样本的脆弱性急剧增加,并且这种脆弱性水平紧密跟踪每个维度特征的数量(即特征维度的倒数)。
We're hesitant to speculate about the extent to which superposition is responsible for adversarial examples in practice. There are compelling theories for why adversarial examples occur without reference to superposition (e.g.
我们对于叠加态在实际中对对抗性示例的影响程度持谨慎态度,不愿妄加揣测。已有令人信服的理论解释对抗性示例的出现,无需涉及叠加态(例如
In addition to observing that superposition can cause models to be vulnerable to adversarial examples, we briefly experimented with adversarial training to see if the relationship could be used in the other direction to reduce superposition. To keep training reasonably efficient, we used the analytic optimal attack against a random feature. We found that this did reduce superposition, but attacks had to be made unreasonably large (80% input L2 norm) to fully eliminate it, which didn't seem satisfying. Perhaps stronger adversarial attacks would work better. We didn't explore this further since the increased cost and complexity of adversarial training made us want to prioritize other lines of attack on superposition first.
除了观察到叠加态可能使模型易受对抗性示例影响外,我们还尝试了对抗性训练,以探究是否能反向利用这种关系来减少叠加态。为确保训练效率,我们采用了针对随机特征的分析最优攻击。结果显示,此举确实减少了叠加态,但需施加过大的攻击(输入 L2 范数的 80%)才能完全消除,这一结果并不令人满意。或许更强烈的对抗攻击会更为有效。由于对抗性训练的成本和复杂性增加,我们决定优先探索其他针对叠加态的攻击策略,因此未进一步深入此方向。
So far, we've explored superposition in a model without a privileged basis. We can rotate the hidden activations arbitrarily and, as long as we rotate all the weights, have the exact same model behavior. That is, for any ReLU output model with weights
迄今为止,我们探讨了在没有特权基底的情况下叠加态的情况。我们可以任意旋转隐藏激活,只要同时旋转所有权重,就能保持模型的行为完全一致。也就是说,对于任何具有权重
Models without a privileged basis are elegant, and can be an interesting analogue for certain neural network representations which don't have a privileged basis – word embeddings, or the transformer residual stream. But we'd also (and perhaps primarily) like to understand neural network representations where there are neurons which do impose a privileged basis, such as transformer MLP layers or conv net neurons.
无特权基底模型优雅且可作为某些神经网络表示的有趣类比,这些表示并无特权基底,如词嵌入或变换器残差流。然而,我们更希望(或许主要)理解那些存在施加特权基底的神经元网络表示,例如变换器 MLP 层或卷积网络神经元。
Our goal in this section is to explore the simplest toy model which gives us a privileged basis. There are at least two ways we could do this: we could add an activation function or apply L1 regularization to the hidden layer. We'll focus on adding an activation function, since the representation we are most interested in understanding is hidden layers with neurons, such as the transformer MLP layer.
本节目标在于探究最简单的玩具模型,该模型为我们提供了一个特权基。至少有两种方法可以实现这一目标:一是在隐藏层中加入激活函数,二是应用 L1 正则化。我们将聚焦于添加激活函数,因为最感兴趣的是理解包含神经元的隐藏层表示,例如 Transformer 中的多层感知机层。
This gives us the following "ReLU hidden layer" model:
这为我们提供了如下“ReLU 隐藏层”模型:
We'll train this model on the same data as before.
我们将使用与之前相同的数据来训练这个模型。
Adding a ReLU to the hidden layer radically changes the model from an interpretability perspective. The key thing is that while
在隐藏层中加入 ReLU 从可解释性的角度彻底改变了模型。关键在于,尽管我们之前的模型中
We'll discuss this in much more detail shortly, but here's a comparison of weights resulting from a linear hidden layer model and a ReLU hidden layer model:
我们很快会对此进行更详细的讨论,但这里先对比一下线性隐藏层模型与 ReLU 隐藏层模型产生的权重差异:
Recall that we think of basis elements in the input as "features," and basis elements in the middle layer as "neurons". Thus
回想一下,我们将输入中的基元素视为“特征”,将中间层的基元素视为“神经元”。因此,
What we see in the above plot is that the features are aligning with neurons in a structured way! Many of the neurons are simply dedicated to representing a feature! (This is the critical property that justifies why neuron-focused interpretability approaches – such as much of the work in the original Circuits thread – can be effective in some circumstances.)
在上图中我们看到,特征正以一种结构化的方式与神经元对齐!许多神经元仅仅专注于表示一个特征!(这一关键特性证明了为何以神经元为中心的解释方法——如原电路主题中的许多工作——在某些情况下能有效。)
Let's explore this in more detail.
让我们更详细地探讨这个问题。
Having a privileged basis opens up new possibilities for visualizing our models. As we saw above, we can simply inspect
拥有特权基础为可视化我们的模型开辟了新可能。如上所述,我们可以直接检查
This stack plot visualization can be nice as models get bigger. It also makes polysemantic neurons obvious: they simply correspond to having more than one weight.
随着模型规模的扩大,这种堆叠图可视化效果会变得很出色。它还能明显展示出多义性神经元:它们仅仅对应于拥有多个权重。
We'll now visualize a ReLU hidden layer toy model with
现在我们将展示一个带有
However, we found that these small models were harder to optimize. For each model shown, we trained 1000 models and visualized the one with the lowest loss. Although the typical solutions are often similar to the minimal loss solutions shown, selecting the minimal loss solutions reveals even more structure in how features align with neurons. It also reveals that there are ranges of sparsity values where the optimal solution for all models trained on data with that sparsity have the same weight configurations.
然而,我们发现这些小型模型更难以优化。对于所展示的每个模型,我们训练了 1000 个模型,并可视化了损失最低的那一个。尽管典型解决方案通常与显示的最低损失解决方案相似,但选择最低损失解决方案揭示了特征与神经元对齐方式中更多的结构。它还揭示了在特定稀疏度值范围内,针对该稀疏度数据训练的所有模型,其最优解具有相同的权重配置。
The solutions are visualized below, both visualizing the raw
解决方案如下所示,既展示了原始
The most important thing to pay attention to is how there's a shift from monosemantic to polysemantic neurons as sparsity increases. Monosemantic neurons do exist in some regimes! Polysemantic neurons exist in others. And they can both exist in the same model! Moreover, while it's not quite clear how to formalize this, it looks a great deal like there's a neuron-level phase change, mirroring the feature phase changes we saw earlier.
最需关注的是,随着稀疏度增加,神经元从单义性向多义性转变的现象。在某些状态下确实存在单义性神经元!而在其他状态下则存在多义性神经元。并且,这两种神经元可以在同一模型中共存!此外,尽管目前尚不清楚如何对此进行形式化,但这看起来非常像神经元层面的相变,映射了我们之前观察到的特征相变。
It's also interesting to examine the structure of the polysemantic solutions, which turn out to be surprisingly structured and neuron-aligned. Features typically correspond to sets of neurons (monosemantic neurons might be seen as the special case where features only correspond to singleton sets). There's also structure in how polysemantic neurons are. They transition from monosemantic, to only representing a few features, to gradually representing more. However, it's unclear how much of this is generalizable to real models.
探究多义性解决方案的结构同样颇具趣味,这些结构出乎意料地呈现出有序且与神经元对齐的特征。通常,特征对应于一组神经元(单义神经元可视作特例,其特征仅对应于单元素集合)。多义神经元内部亦存在结构:它们从单义性过渡到仅表征少数特征,再逐渐扩展至表征更多特征。然而,这些规律在实际模型中的普适性尚不明确。
Unfortunately, the toy model described in this section has a significant weakness, which limits the regimes in which it shows interesting results. The issue is that the model doesn't benefit from the ReLU hidden layer – it has no role except limiting how the model can encode information. If given any chance, the model will circumvent it. For example, given a hidden layer bias, the model will set all the biases to be positive, shifting the neurons into a positive regime where they behave linearly. If one removes the bias, but gives the model enough features, it will simulate a bias by averaging over many features. The model will only use the ReLU activation function if absolutely forced, which is a significant mark against studying this toy model.
遗憾的是,本节所述的玩具模型存在一个显著弱点,限制了其在何种情况下能展现有趣成果。问题在于,该模型并未从 ReLU 隐藏层中获益——除了限制模型编码信息的方式外,它并无其他作用。一旦有机会,模型便会绕过它。例如,若赋予隐藏层偏置,模型会将所有偏置设为正值,使神经元进入线性行为的正向区域。若移除偏置但提供足够特征,模型会通过平均众多特征来模拟偏置。唯有在绝对强制下,模型才会采用 ReLU 激活函数,这一事实严重削弱了研究此玩具模型的意义。
We'll introduce a model without this issue in the next section, but wanted to study this model as a simpler case study.
我们将在下一节介绍一个没有这个问题的模型,但想先研究这个模型作为更简单的案例研究。
So far, we've shown that neural networks can store sparse features in superposition and then recover them. But we actually believe superposition is more powerful than this – we think that neural networks can perform computation entirely in superposition rather than just using it as storage. This model will also give us a more principled way to study a privileged basis where features align with basis dimensions.
迄今为止,我们已经证明了神经网络能够以叠加态存储稀疏特征并随后恢复它们。但我们实际上认为叠加态的力量远不止于此——我们相信神经网络能够完全在叠加态中执行计算,而不仅仅是将其作为存储手段。这一模型还将为我们提供一种更为原则性的方法,来研究特征与基底维度对齐的特权基。
To explore this, we consider a new setup where we imagine our input and output layer to be the layers of our hypothetical disentangled model, but have our hidden layer be a smaller layer we're imagining to be the observed model which might use superposition. We'll then try to compute a simple non-linear function and explore whether it can use superposition to do this. Since the model will have (and need to use) the hidden layer non-linearity, we'll also see features align with a privileged basis.
为探究此问题,我们设想一种新架构:将输入与输出层视为假设性解耦模型的层级,而隐含层则设想为一个较小的、可能运用叠加原理的观测模型。接下来,我们将尝试计算一个简单的非线性函数,并探讨其是否能借助叠加原理实现。由于模型将具备(且需运用)隐含层的非线性特性,我们还将观察到特征与特权基底的对齐情况。
Specifically, we'll have the model compute
具体而言,我们将让模型计算
Since this model needs ReLU to compute absolute value, it doesn't have the issues the model in the previous section had with trying to avoid the activation function.
由于此模型需借助 ReLU 计算绝对值,因此它不会遇到上一节模型试图避免激活函数时所面临的问题。
The input feature vector,
输入特征向量
Following the previous section, we'll consider the "ReLU hidden layer" toy model variant, but no longer tie the two weights to be identical:
继上一节之后,我们将考虑“ReLU 隐藏层”玩具模型变体,但不再将两个权重绑定为相同值
The loss is still the mean squared error weighted by feature importances
损失依旧是之前提到的,根据特征重要性
With this model, it's a bit less straightforward to study how individual features get embedded; because of the ReLU on the hidden layer, we can't just study
在此模型中,研究单个特征如何嵌入变得不那么直接;由于隐藏层上的 ReLU 激活,我们无法仅研究
As we saw in the previous section, having a hidden layer activation function means that it makes sense to visualize the weights in terms of neurons. We can visualize
如前一节所述,拥有隐藏层激活函数意味着从神经元角度可视化权重是有意义的。我们可以直接可视化
Let's look at what happens when we train a model with
让我们看看当我们用
The resulting model – modulo a subtle issue about rescaling input and output weights
所得模型——在调整输入输出权重时需注意一个细微问题
15
,模型在学习过程中具有一定自由度
We've seen that – as expected – our toy model can learn to implement absolute value. But can it use superposition to compute absolute value for more features? To test this, we train models with
我们已经看到,正如预期那样,我们的玩具模型能够学会实现绝对值功能。但它能否利用叠加原理来计算更多特征的绝对值呢?为了验证这一点,我们训练了具有
A couple of notes on visualization: Since we're primarily interested in understanding superposition and polysemantic neurons, we'll show a stacked weight plot of the absolute values of weights. The features are colored by superposition. To make the diagrams easier to read, neurons are faintly colored based on how polysemantic they are (as judged by eye based on the plots). Neuron order is sorted by the importance of the largest feature.
关于可视化的一些说明:由于我们主要关注理解叠加和多义性神经元,我们将展示一个堆叠权重图,其中权重取绝对值。特征根据叠加程度着色。为使图表更易读,神经元根据其多义性(通过目测图表判断)淡色显示。神经元顺序按最大特征的重要性排序。
Much like we saw in the ReLU hidden layer models, these results demonstrate that activation functions, under the right circumstances, create a privileged basis and cause features to align with basis dimensions. In the dense regime, we end up with each neuron representing a single feature, and we can read feature values directly off of neuron activations.
正如我们在 ReLU 隐藏层模型中所见,这些结果表明,在适当条件下,激活函数能创造一种特权基底,促使特征与基底维度对齐。在密集状态下,每个神经元代表一个单一特征,我们可以直接从神经元激活中读取特征值。
However, once the features become sufficiently sparse, this model, too, uses superposition to represent more features than it has neurons. This result is notable because it demonstrates the ability of neural networks to perform computation even on data that is represented in superposition.
然而,一旦特征变得足够稀疏,该模型同样利用叠加原理来表示比其拥有的神经元更多的特征。这一结果值得关注,因为它展示了神经网络即使在数据以叠加形式表示时也能执行计算的能力。
17
你可能会问,我们能否通过检查损失来量化叠加原理额外计算能力的程度。遗憾的是,这并不容易。叠加现象出现在任务变得更加稀疏时。因此,具有不同叠加程度的模型的损失不可直接比较——它们衡量的是不同任务的损失!请记住,模型需要使用隐藏层的 ReLU 来计算绝对值;梯度下降能够找到即使在每个神经元编码了多个特征混合的情况下,也能有效近似计算的解决方案。
Focusing on the intermediate sparsity regimes, we find several additional qualitative behaviors that we find fascinatingly reminiscent of behavior that has been observed in real, full-scale neural networks:
聚焦于中等稀疏度范围,我们发现了若干额外的定性行为,这些行为令人着迷地类似于在真实的全尺度神经网络中观察到的现象:
To begin, we find that in some regimes, many of the model's neurons will encode pure features, but a subset of them will be highly polysemantic. This is similar to the phase change we saw earlier in the ReLU output model. However, in that case, the phase change was with respect to features, with more important features not being put in superposition. In this experiment, the neurons don't have any intrinsic importance, but we see that the neurons representing the most important features (on the left) tend to be monosemantic.
首先,我们发现某些模式下,模型中的许多神经元会编码纯粹的特征,但其中一部分则高度多义。这类似于之前在 ReLU 输出模型中观察到的相变现象。然而,在那种情况下,相变是相对于特征而言,更重要的特征并未处于叠加状态。在本实验中,神经元并无内在重要性,但我们注意到,代表最重要特征的神经元(位于左侧)往往呈现单一语义。
We find this to bear a suggestive resemblance to some previous work in vision models, which found some layers that contained "mostly pure" feature neurons, but with some neurons representing additional features on a different scale.
我们发现这与视觉模型中的一些先前工作有着启发性的相似之处,那些工作发现某些层包含“基本纯净”的特征神经元,但其中一些神经元在不同尺度上还代表了额外的特征。
We also note that many neurons appear to be associated with a single "primary" feature – encoded by a relatively large weight – coupled with one or more "secondary" features encoded with smaller-magnitude weights to that neuron. If we were to observe the activations of such a neuron over a range of input examples, we would find that the largest activations of that neuron were all or nearly-all associated with the presence of the "primary" feature, but that the lower-magnitude activations were much more polysemantic.
我们还注意到,许多神经元似乎与一个主要的“核心”特征相关联——这一特征由相对较大的权重编码——同时伴随一个或多个由较小幅度权重编码的“次要”特征。若我们观察这类神经元在多个输入示例上的激活情况,会发现该神经元的最大激活几乎全部或主要与“核心”特征的存在相关,而那些较小幅度激活则呈现出更为多义的特性。
Intriguingly, that description closely matches what researchers have found in previous work on language models
有趣的是,这一描述与研究人员在先前关于语言模型的研究中
If neural networks can perform computation in superposition, a natural question is to ask how exactly they're doing so. What does that look like mechanically, in terms of the weights? In this subsection, we'll (mostly) work through one such model and see an interesting motif of asymmetric superposition. (We use the term "motif" in the sense of the original circuit thread, inspired by its use in systems biology
若神经网络能在叠加态下进行计算,一个自然的问题便是它们究竟是如何做到的。从权重角度看,这一过程在机制上呈现为何种形态?在本小节中,我们将(主要)探讨这样一个模型,并揭示其中有趣的非对称叠加模式。(我们使用“模式”一词,意指原始电路线索中的含义,灵感源自其在系统生物学中的应用
The model we're trying to understand is shown below on the left, visualized as a neuron weight stack plot, with features corresponding to colors. The model is only doing a limited amount of superposition, and many of the weights can be understood as simply implementing absolute value in the expected way.
我们试图理解的模式如图左侧所示,以神经元权重堆栈图形式呈现,特征对应于颜色。该模型仅进行有限的叠加,许多权重可视为以预期方式简单实现绝对值。
However, there are a few neurons doing something else…
然而,有少数神经元在做着别的事情……
These other neurons implement two instances of asymmetric superposition and inhibition. Each instance consists of two neurons:
这些其他神经元实现了两个非对称叠加与抑制的实例。每个实例包含两个神经元:
One neuron does asymmetric superposition. In normal superposition, one might store features with equal weights (eg.
单个神经元进行非对称叠加。在常规叠加中,可能以相等权重存储特征(例如
To avoid the consequences of that interference, the model has another neuron heavily inhibit the feature in the case where there would have been positive interference. This essentially converts positive interference (which could greatly increase the loss) into negative interference (which has limited consequences due to the output ReLU).
为规避此种干扰的后果,模型中另有一个神经元在可能出现正向干扰的情况下,会强烈抑制该特征。这实质上将可能大幅增加损失的正向干扰转化为因输出 ReLU 而影响有限的负向干扰。
There are a few other weights this doesn't explain. (We believe they're effectively small conditional biases.) But this asymmetric superposition and inhibition pattern appears to be the primary story.
还有一些其他权重未被解释(我们认为它们实际上是微小的条件偏置)。但这种不对称的叠加与抑制模式似乎是主要机制。
Although superposition is scientifically interesting, much of our interest comes from a pragmatic motivation: we believe that superposition is deeply connected to the challenge of using interpretability to make claims about the safety of AI systems. In particular, it is a clear challenge to the most promising path we see to be able to say that neural networks won't perform certain harmful behaviors or to catch "unknown unknowns" safety problems. This is because superposition is deeply linked to the ability to identify and enumerate over all features in a model, and the ability to enumerate over all features would be a powerful primitive for making claims about model behavior.
尽管叠加态在科学上颇具趣味,我们对它的浓厚兴趣很大程度上源自实用动机:我们坚信叠加态与运用可解释性来评估人工智能系统安全性的挑战紧密相关。特别是,它对我们认为最有希望的路径构成了明确挑战,即如何断言神经网络不会执行某些有害行为,或捕捉到“未知之未知”的安全问题。这是因为叠加态与识别并枚举模型中所有特征的能力紧密相连,而能够枚举所有特征的能力将成为评估模型行为强有力的基础。
We begin this section by describing how "solving superposition" in a certain sense is equivalent to many strong interpretability properties which might be useful for safety. Next, we'll describe three high level strategies one might take to "solving superposition." Finally, we'll describe a few other additional strategic considerations.
本节首先阐述在某种意义上,“解决叠加问题”等同于多种强解释性属性,这些属性对安全性可能大有裨益。接着,我们将概述三种可能采取的高层次策略来“解决叠加问题”。最后,我们将探讨一些其他额外的战略考量。
We'd like a way to have confidence that models will never do certain behaviors such as "deliberately deceive" or "manipulate." Today, it's unclear how one might show this, but we believe a promising tool would be the ability to identify and enumerate over all features. The ability to have a universal quantifier over the fundamental units of neural network computation is a significant step towards saying that certain types of circuits don't exist.
我们希望有一种方法能确信模型永远不会表现出某些行为,如“故意欺骗”或“操纵”。目前,尚不清楚如何证明这一点,但我们认为一个有前景的工具是能够识别并枚举所有特征的能力。在神经网络计算的基本单元上拥有一个全称量词,是朝着证明某些类型的电路不存在迈出的重要一步。
18
最终,我们希望断言模型不会执行某些类别的行为。枚举所有特征使得断言某个特征不存在变得容易(例如,“不存在‘欺骗行为’特征”),但这并非我们真正想要的。我们预期需要代表现实世界的模型也会代表不良行为。然而,或许可以构建更微妙的论断,比如“所有‘欺骗行为’特征都不参与电路 X、Y 和 Z”。它似乎也是应对“未知之未知”的强大工具,因为它在某种意义上能全面覆盖网络行为。
How does this relate to superposition? It turns out that the ability to enumerate over features is deeply intertwined with superposition. One way to see this is to imagine a neural network with a privileged basis and without superposition (like the monosemantic neurons found in early InceptionV1, e.g.
这与叠加态有何关联?事实证明,枚举特征的能力与叠加态紧密交织。一种理解方式是设想一个具有特权基底而无叠加态的神经网络(如早期 InceptionV1 中发现的单义神经元,例如
For this reason, we'll call any method that gives us the ability to enumerate over features – and equivalently, unfold activations – a "solution to superposition". Any solution is on the table, from creating models that just don't have superposition, to identifying what directions correspond to features after the fact. We'll discuss the space of possibilities shortly.
因此,我们将任何能够枚举特征——同样地,展开激活——的方法称为“叠加态解决方案”。任何方案都在考虑之列,从构建不具有叠加态的模型,到事后识别与特征对应的方向。我们很快将探讨这些可能性。
We've motivated "solving superposition" in terms of feature enumeration, but it's worth noting that it's equivalent to (or necessary for) many other interpretability properties one might care about:
我们已从特征列举的角度阐述了“解决叠加问题”的意义,但值得注意的是,它等同于(或对于)许多其他可能关注的可解释性属性而言是必要的:
At a very high level, there seem to be three potential approaches to resolving superposition:
从宏观角度看,解决叠加态似乎存在三种潜在方法:
Our sense is that all of these approaches are possible if one doesn't care about having a competitive model. For example, we believe it's possible to accomplish any of these for the toy models described in this paper. However, as one starts to consider serious neural networks, let alone modern large language models, all of these approaches begin to look very difficult. We'll outline the challenges we see for each approach in the following sections.
我们的观点是,如果不介意拥有一个竞争力不足的模型,这些方法都是可行的。例如,我们相信可以为本文所述的玩具模型实现其中任何一种。然而,一旦开始考虑严肃的神经网络,更不用说现代大型语言模型,所有这些方法都显得异常困难。我们将在接下来的章节中概述每种方法面临的挑战。
With that said, it's worth highlighting one bright spot before we focus on the challenges. You might have believed that superposition was something you could never fully get rid of, but that doesn't seem to be the case. All our results seem to suggest that superposition and polysemanticity are phases with sharp transitions. That is, there may exist a regime for every model where it has no superposition or polysemanticity. The question is largely whether the cost of getting rid of or otherwise resolving superposition is too high.
话虽如此,在聚焦挑战之前,值得强调一个亮点:你可能曾以为叠加态是无法彻底消除的,但事实似乎并非如此。我们的所有结果似乎都表明,叠加态和多义性是具有明显过渡阶段的相态。也就是说,对于每个模型,可能都存在一个无叠加态或多义性的状态。关键问题在于,消除或解决叠加态的成本是否过高。
It's actually quite easy to get rid of superposition in the toy models described in this paper, albeit at the cost of a higher loss. Simply apply at L1 regularization term to the hidden layer activations (i.e. add
在本论文所述的玩具模型中,消除叠加态其实相当简单,尽管会以更高的损失为代价。只需对隐藏层激活应用 L1 正则化项(即在损失中加入
However, it seems likely that models are significantly benefitting from superposition. Roughly, the sparser features are, the more features can be squeezed in per neuron. And many features in language models seem very sparse! For example, language models know about individuals with only modest public presences, such as several of the authors of this paper. Presumably we only occur with frequency significantly less than one in a million tokens. As a result, it may be the case that superposition effectively makes models much bigger.
然而,模型似乎极大地受益于叠加态。简而言之,特征越稀疏,每个神经元能压缩的特征就越多。语言模型中的许多特征似乎非常稀疏!例如,语言模型能识别那些公众影响力有限的个人,比如本文的几位作者。我们出现的频率很可能远低于百万分之一。因此,叠加态可能实际上使得模型规模大幅扩展。
All of this paints a picture where getting rid of superposition may be fairly achievable, but doing so will have a large performance cost. For a model with a fixed number of neurons, superposition helps – potentially a lot.
这一切描绘出一幅图景:摆脱叠加态或许相当可行,但此举将带来巨大的性能代价。对于固定数量的神经元模型而言,叠加态有助于提升性能——潜在地,提升幅度可能很大。
But this is only true if the constraint is thought of in terms of neurons. That is, a superposition model with
然而,这一观点仅在将约束视为神经元数量时成立。也就是说,拥有
One family of models which change the flop-neuron relationship are Mixture of Experts (MoE) models (see review
混合专家(MoE)模型是一类改变神经元与计算量关系的模型(详见综述
It's unclear how far this can be pushed, especially given difficult engineering constraints. But there's an obvious lower bound, which is likely too optimistic but is interesting to think about: what if models only expended flops on neuron activations, and recovered the compute of all non-activating neurons? In this world, it seems unlikely that superposition would be optimal: you could always split a polysemantic neuron into dedicated neurons for each feature with the same cost, except for the cases where there would have been interference that hurt the model anyways. Our preliminary investigations comparing various types of superposition in terms of "loss reduction per activation frequency" seem to suggest that superposition is not optimal on these terms, although it asymptotically becomes as good as dedicated feature dimensions. Another way to think of this is that superposition exploits a gap between the sparsity of neurons and the sparsity of the underlying features; MoE eats that same gap, and so we should expect MoE models to have less superposition.
尚不清楚这一极限能推至多远,尤其在面临严峻工程约束的情况下。但存在一个明显的下限,虽可能过于乐观,却值得深思:若模型仅在神经元激活时消耗浮点运算,而非激活神经元的计算则得以恢复,将会如何?在此设想下,叠加态似乎并非最优选择:总能以相同成本将多义神经元拆分为各特征专用的神经元,除非存在干扰本就损害模型性能的情形。我们初步对比不同叠加方式的“每激活频率损失降低”研究显示,叠加态在这些条件下并非最优,尽管它渐近地与专用特征维度表现相当。 另一种理解方式是,叠加利用了神经元稀疏性与底层特征稀疏性之间的差距;而专家混合模型(MoE)同样填补了这一差距,因此我们预期 MoE 模型所展现的叠加效应会较少。
To be clear, MoE models are already well studied, and we don't think this changes the capabilities case for them. (If anything, superposition offers a theory for why MoE models have not proven more effective for capabilities when the case for them seems so initially compelling!) But if one's goal is to create competitive models that don't have superposition, MoE models become interesting to think about. We don't necessarily think that they specifically are the right path forward – our goal here has been to use them as an example of why we think it remains plausible there may be ways to build competitive superposition-free models.
明确地说,门控模型已得到充分研究,我们并不认为这会改变它们的能力状况。(如果有什么不同的话,叠加态提供了一种理论,解释了为何门控模型在能力上尚未证明更加有效,尽管最初它们的前景似乎十分诱人!)但如果目标是构建不具有叠加态的竞争性模型,门控模型便值得深思。我们并非必然认为它们就是前进的正确道路——我们的目的在于以它们为例,说明为何我们认为仍有可能存在构建无叠加态竞争模型的途径。
The opposite strategy of creating a superposition-free model is to take a regular model, which has superposition, and find an overcomplete basis describing how features are embedded after the fact. This appears to be a relatively standard sparse coding (or dictionary learning) problem, where we want to take the activations of neural network layers and find out which directions correspond to features.
与之相对的策略是采用常规模型,该模型具有叠加效应,然后寻找一个超完备基,用以描述特征事后如何嵌入。这似乎是一个相对标准的稀疏编码(或字典学习)问题,我们的目标是获取神经网络层的激活情况,并找出哪些方向对应于特征。更正式地说,给定一个隐藏层激活矩阵,该矩阵在多种刺激下采样得到,如果我们认为存在个基本特征,我们可以尝试找到使得表达式稀疏的矩阵。这一方法已在前人的工作中有所探讨。
The advantage of this is that we don't need to worry about whether we're damaging model performance. On the other hand, many other things are harder:
其优势在于我们无需担心是否会损害模型性能。然而,许多其他方面变得更加困难:
In addition to approaches which address superposition purely at training time, or purely after the fact, it may be possible to take "hybrid approaches" which do a mixture. For example, even if one can't change models without superposition, it may be possible to produce models with less superposition, which are then easier to decode.
除了仅在训练时或事后处理叠加的方法外,采用“混合方法”进行混合处理或许也是可行的。例如,即便无法在不改变叠加状态的情况下调整模型,仍有可能构建出叠加程度较低的模型,从而更易于解码。
21
特别是,我们似乎有理由期待,通过实施如 L1 正则化等措施,而不进行架构调整,至少能在不影响性能的前提下略微降低叠加程度。值得注意的是,模型应存在一个叠加水平,使得损失相对于叠加量的导数为零——否则,模型会倾向于使用更多或更少的叠加。因此,至少在一定范围内,我们应能减少叠加量而不影响模型性能。 或者,通过架构调整,在大模型中寻找超完备基底可能变得更加容易或计算上更易处理,这与尝试减少叠加态是分开的。
Phase Changes as Cause For Hope. Is totally getting rid of superposition a realistic hope? One could easily imagine a world where it can only be asymptotically reduced, and never fully eliminated. While the results in this paper seem to suggest that superposition is hard to get rid of because it's actually very useful, the upshot of it corresponding to a phase change is that there's a regime where it totally doesn't exist. If we can find a way to push models in the non-superposition regime, it seems likely it can be totally eliminated.
相变带来希望之光:彻底摆脱叠加态是否现实?不难设想一个世界,其中叠加态仅能渐近式减少,却永无法根除。尽管本文研究结果暗示,叠加态之所以难以摒弃,实因其极具价值,但与之相关的相变现象揭示了一个不存在叠加态的领域。若能找到方法将模型推向非叠加态区域,似乎有望彻底消除叠加态。
Any superposition-free model would be a powerful tool for research. We believe that most of the research risk is in whether one can make performant superposition free models, rather than whether it's possible to make superposition free models at all. Of course, ultimately, we need to make performant models. But a non-performant superposition free model could still be a very useful research tool for studying superposition in normal models. At present, it's challenging to study superposition in models because we have no ground truth for what the features are. (This is also the reason why the toy models described in this paper can be studied – we do know what the features are!) If we had a superposition-free model, we may be able to use it as a ground truth to study superposition in regular models.
任何无叠加态的模型都将是研究的有力工具。我们认为,研究风险大多在于能否构建出性能卓越的无叠加态模型,而非无叠加态模型是否可能存在。当然,最终我们需要的是高性能模型。但即便性能不佳,无叠加态模型仍可作为研究常规模型中叠加态的宝贵研究工具。目前,研究模型中的叠加态颇具挑战,因为我们缺乏关于特征的基准真相。(这也是本文所述玩具模型得以研究的原因——我们确实了解其特征!)若拥有无叠加态模型,我们或许能将其作为基准真相,用以探究常规模型中的叠加态。
Local bases are not enough. Earlier, when we considered the geometry of non-uniform superposition, we observed that models often form local orthogonal bases, where co-occurring features are orthogonal. This suggests a strategy for locally understanding models on sufficiently narrow sub-distributions. However, if our goal is to eventually make useful statements about the safety of models, we need mechanistic accounts that hold for the full distribution (and off distribution). Local bases seem unlikely to give this to us.
本地基底不足。早先,在探讨非均匀叠加的几何结构时,我们注意到模型常构建局部正交基,其中共现特征相互正交。这提示了一种策略,即在足够狭窄的子分布上局部理解模型。然而,若我们的最终目标是就模型的安全性做出有意义的论断,则需寻求适用于全分布(及分布外)的机制性解释。本地基底似乎难以满足这一需求。
Why are we interested in toy models? We believe they are useful proxies for studying the superposition we suspect might exist in real neural networks. But how can we know if they're actually a useful toy model? Our best validation is whether their predictions are consistent with empirical observations regarding polysemanticity. To the best of our knowledge they are. In particular:
我们为何对玩具模型感兴趣?我们认为它们是研究我们怀疑可能存在于真实神经网络中的叠加态的有用代理。但如何判断它们是否真正是有效的玩具模型呢?我们最佳的验证方式是看它们的预测是否与关于多义性的实证观察相一致。据我们所知,它们确实如此。特别是:
This doesn't mean that everything about our toy model reflects real neural networks. Our intuition is that some of the phenomena we observe (superposition, monosemantic vs polysemantic neurons, perhaps the relationship to adversarial examples) are likely to generalize, while other phenomena (especially the geometry and learning dynamics results) are much more uncertain.
这并不意味着我们的玩具模型所体现的一切都能反映真实神经网络。我们的直觉是,我们观察到的一些现象(如叠加态、单义性与多义性神经元,或许还包括对抗样本的关系)很可能具有普遍性,而其他现象(尤其是几何形态和学习动力学结果)则存在较大不确定性。
This paper has shown that the superposition hypothesis is true in certain toy models. But if anything, we're left with many more questions about it than we had at the start. In this final section, we review some of the questions which strike us as most important: what do we know, and would we like for future work to clarify?
本文已表明,叠加假设在某些玩具模型中成立。然而,我们对此留下的疑问远比开始时多。在最后这一部分,我们将回顾一些我们认为最为重要的问题:我们已知什么,以及我们希望未来的工作能澄清什么?
Our work is inspired by research exploring the features that naturally occur in neural networks. Many models form at least some interpretable features. Word embeddings have semantic directions (see
我们的研究灵感源自探索神经网络中自然涌现的特征。众多模型至少形成了部分可解释的特征。词嵌入具有语义方向(参见
The earliest reference to superposition in artificial neural networks that we're aware of is Arora et al.'s work
我们所知关于人工神经网络中叠加的最早提及,源自 Arora 等人的研究
In parallel with this, investigations of individual neurons in models with privileged bases were beginning to grapple with "polysemantic" neurons which respond to unrelated inputs
与此同时,在具有特权基础的模型中,对单个神经元的研究开始探讨“多义性”神经元,这些神经元对无关联的输入产生反应
Separate from all of this, Cheung et al.
除此之外,Cheung 等人
The goal of learning disentangled representations arises from Bengio et al.'s influential position paper on representation learning
学习解耦表示的目标源于 Bengio 等人关于表示学习的影响性立场论文
Concretely, disentanglement research often explores whether one can train a VAE or GAN where basis dimensions correspond to the major features one might use to describe the problem (e.g. rotation, lighting, gender… as relevant). Early work often focused on semi-supervised approaches where the features were known in advance, but fully unsupervised approaches started to develop around 2016
具体而言,解耦研究常探讨能否训练一个 VAE 或 GAN,使其基础维度对应于描述问题时可能用到的主要特征(如旋转、光照、性别等)。早期研究多集中于半监督方法,其中特征预先已知,但自 2016 年起,完全无监督方法开始发展。
Put another way, the goal of disentanglement might be described as imposing a strong privileged basis on representations which are rotationally invariant by default. This helps get at ways in which the questions of polysemanticity and superposition are a bit different from disentanglement. Consider that when we deal with neurons, rather than embeddings, we have a privileged basis by default. It varies by model, but many neurons just cleanly respond to features. This means that polysemanticity arises as a kind of anomalous behavior, and superposition arises as a hypothesis for explaining it. The question then isn't how to impose a privileged basis, but how to remove superposition as a fundamental problem to accessing features.
换言之,解缠目标可描述为对默认旋转不变的表示强加一个特权基底。这有助于理解多义性和叠加问题与解缠略有不同之处。设想当我们处理神经元而非嵌入时,默认情况下就有一个特权基底。尽管模型间存在差异,但许多神经元能清晰地响应特定特征。这意味着多义性表现为一种异常行为,而叠加则成为解释这种现象的假设。因此,问题不在于如何强加特权基底,而在于如何消除叠加作为访问特征的基本障碍。
Of course, if the superposition hypothesis is true, there are still a number of connections to disentanglement. On the one hand, it seems likely superposition occurs in the latent spaces of generative models, even though that isn't an area we've investigated. If so, it may be that superposition is a major reason why disentanglement is difficult. Superposition may allow generative models to be much more effective than they would otherwise be without. Put another way, disentanglement often assumes a small number of important latent variables to explain the data. There are clearly examples of such variables, like the orientation of objects – but what if a large number of sparse, rare, individually unimportant features are collectively very important? Superposition would be the natural way for models to represent this.
当然,若叠加假设成立,仍存在诸多与解耦相关的联系。一方面,叠加现象很可能发生在生成模型的潜在空间中,尽管这一领域我们尚未探究。若果真如此,叠加或许是解耦困难的主要原因。叠加可能使得生成模型在没有它的情况下远比现在更为高效。换言之,解耦常假定存在少量关键潜在变量来解释数据。确实有这类变量的明显例子,如物体的朝向——但若大量稀疏、罕见、个体上不重要却整体极为关键的特征存在呢?叠加正是模型表达这一点的自然方式。
22
更为微妙的问题是,GANs 和 VAEs 通常假设其潜在空间呈高斯分布。 稀疏潜变量具有很强的非高斯性,然而中心极限定理表明,众多此类变量的叠加将逐渐呈现出高斯分布的特征。因此,某些生成模型的潜空间实际上可能迫使模型采用叠加策略!另一方面,可以设想,解耦领域的思想在构建抵抗叠加的架构中可能发挥作用,通过创造一个更加优势化的基底来实现。
The toy problems we consider are quite similar to the problems considered in the field of compressed sensing, which is also known as compressive sensing and sparse recovery. However, there are some important differences:
我们考虑的玩具问题与压缩感知领域所研究的问题颇为相似,后者亦称为压缩感知和稀疏恢复。然而,存在一些重要差异:
In general, our toy model is solving a similar problem using less powerful methods than compressed sensing algorithms, especially because the computational model is so much more restricted (to just a single linear transformation and a non-linearity) compared to the arbitrary computation that might be used by a compressed sensing algorithm.
总体而言,我们的玩具模型采用较压缩感知算法更为简化的方法解决类似问题,主要因为其计算模型受到极大限制(仅限于单一的线性变换与非线性操作),远不及压缩感知算法可能采用的任意复杂计算。
As a result, compressed sensing lower bounds—which give lower bounds on the dimension of the embedding such that recovery is still possible—can be interpreted as giving an upper bound on the amount of superposition in our toy model. In particular, in various compressed sensing settings, one can recover an
因此,压缩感知下界——给出恢复仍可能的嵌入维度下界——可被解读为在我们的玩具模型中给出叠加量的上限。具体在多种压缩感知场景下,若且仅若
At first, this bound appears to allow a number of features that is exponential in
起初,这种约束似乎允许将指数级数量的特征打包进
A striking parallel between our toy model and compressed sensing is the existence of phase changes.
在我们的玩具模型与压缩感知之间,一个显著的相似之处在于相变的存在。
24
值得注意的是,在压缩感知的情况下,相变发生在维度数量趋向无穷大的极限中——对于有限维空间,转变虽迅速但并非不连续。在压缩感知中,若考虑由向量的稀疏性和维度定义的二维空间,存在尖锐的相变点,其中向量在一个状态下几乎确定可恢复,而在另一个状态下则几乎确定无法恢复
Another interesting line of work has tried to build useful sparse recovery algorithms using neural networks
另一项有趣的工作尝试利用神经网络构建实用的稀疏恢复算法
Sparse Coding studies the problem of finding a sparse representation of dense data. One can think of it as being like compressed sensing, except the matrix projecting sparse vectors into the lower dimensional space is also unknown. This topic goes by many different names including sparse coding (most common in neuroscience), dictionary learning (in computer science), and sparse frame design (in mathematics). For a general introduction, we refer readers to a textbook by Michael Elad
稀疏编码研究如何从密集数据中寻找稀疏表示的问题。可以将其类比为压缩感知,但不同的是,将稀疏向量映射到低维空间的矩阵也是未知的。这一主题有多种名称,包括在神经科学中最常见的稀疏编码、计算机科学中的字典学习,以及数学中的稀疏框架设计。对于一般性介绍,我们推荐读者参考迈克尔·埃拉德的教科书
Classic sparse coding algorithms take an expectation-maximization approach (this includes Olshausen et al's early work
经典稀疏编码算法采用期望最大化方法(包括 Olshausen 等人的早期工作
From our perspective, sparse coding is interesting because it's probably the most natural mathematical formulation of trying to "solve superposition" by discovering which directions correspond to features.
从我们的视角来看,稀疏编码之所以引人入胜,是因为它可能是最自然的数学表达方式,旨在通过探索哪些方向对应于特征来“解决叠加问题”。有趣的是,这与神经科学中通常对稀疏编码的理解相反。神经科学往往将生物神经元视为对其输入进行稀疏编码,而我们则关注于反向应用它,以在神经元上的叠加中寻找特征。但这些方法真的能在实践中解决叠加问题吗?以往的工作尝试利用稀疏编码来发现稀疏结构。最近,Sharkey 等人的研究在原始论文的基础上,通过稀疏自编码器在玩具模型中初步成功地从叠加中提取出了特征。 总体而言,我们仅处于使用稀疏编码和字典学习进行此类探索的初步阶段,但前景似乎相当乐观。更多讨论请参见“方法二:寻找过完备基”部分。
Our work explores representations in artificial “neurons”. Neuroscientists study similar questions in biological neurons. There are a variety of theories for how information could be encoded by a group of neurons. At one extreme is a local code, in which every individual stimulus is represented by a separate neuron. At the other extreme is a maximally-dense distributed code, in which the information-theoretic capacity of the population is fully utilized, and every neuron in the population plays a necessary role in representing every input.
我们的研究探索了人工“神经元”中的表征方式。神经科学家在生物神经元中研究类似问题。关于信息如何被一群神经元编码,存在多种理论。一种极端是局部编码,其中每个单独的刺激由一个独立的神经元表示。另一种极端是最大密度分布编码,其中群体的信息理论容量被充分利用,群体中的每个神经元都对表示每个输入起着不可或缺的作用。
One challenge in comparing our work with the neuroscience literature is that a “distributed representation” seems to mean different things. Consider an overly-simplified example of a population of neurons, each taking a binary value of active or inactive, and a stimulus set of sixteen items: four shapes, with four colors (example borrowed from
在将我们的工作与神经科学文献进行比较时,一个挑战在于“分布式表示”似乎具有不同的含义。以一个过度简化的神经元群体为例,每个神经元取二值状态:激活或未激活,以及一组包含十六个项目的刺激:四种形状,每种形状有四种颜色(此例借鉴自
Decomposability doesn’t necessarily mean each feature gets its own neuron. Instead, it could be that each feature corresponds to a “direction in activation-space”
可分解性并不一定意味着每个特征都有其专属的神经元。实际上,每个特征可能对应于“激活空间中的一个方向”
26
。尽管“激活空间中的方向”这一概念在分布式编码文献中颇为常见,但我们尚未遇到一个特定术语专门对应这一假设,这或许是由于我们的疏忽。我们称此假设为线性性,考虑到标量“激活”(在生物神经元中即放电率)。只有存在一个特权基时,“特征神经元”才会被激励发展。在生物神经元中,代谢因素常被假设为诱导特权基,从而形成“稀疏编码”。如果神经系统能量消耗随放电率线性或亚线性增长,这一预期便合理。 实验证据似乎支持这一点。此外,神经元是生物神经网络实现非线性变换的基本单位,因此,若某一特征需进行非线性变换,采用“特征神经元”是一种有效的方法。
Any decomposable linear code that uses orthogonal feature vectors is functionally equivalent from the viewpoint of a linear readout. So, a code can both be “maximally distributed” — in the sense that every neuron participates in representing every input, making each neuron extremely polysemantic — and also have no more features than it has dimensions. In this conception, it’s clear that a code can be fully “distributed” and also have no superposition.
任何使用正交特征向量的可分解线性码,从线性读出的角度看,其功能是等价的。因此,一个码既可以是“最大分布式”的——意味着每个神经元都参与表示每个输入,使得每个神经元具有极高的多义性——同时其特征数量也不会超过维度数。在这种构想中,显然一个码可以完全“分布式”,同时又不存在叠加态。
A notable difference between our work, and the neuroscience literature we have encountered, is that we consider as a central concept the likelihood that features co-occur with some probability.
我们工作与所接触的神经科学文献之间的一个显著差异在于,我们将特征以一定概率共同出现的概率性视为核心概念。
28
神经科学文献中与之相关但有所区别的概念是“绑定问题”
One hypothesis in neuroscience is that highly compressed representations might have an important use in long-range communication between brain areas
神经科学中的一种假设认为,高度压缩的表征可能在脑区间的长距离通信中扮演重要角色
It's worth noting that the term "distributed representation" is also used in deep learning, and has the same ambiguities of meaning there. Our sense is that some influential early works (e.g.
值得注意的是,“分布式表示”这一术语在深度学习中也常被使用,并同样存在意义模糊的情况。我们的理解是,一些有影响力的早期作品(例如
After publishing the original version of this paper, a number of readers generously brought to our attention additional connections to prior work. We don't have a sufficiently deep understanding of this work to offer a detailed review, but we offer a brief overview below:
在发布本文初版后,许多读者慷慨指出其与先前工作的更多联系。我们对此工作的理解尚不够深入,无法提供详尽评述,但以下提供一个简要概览:
Inspired by the original Circuits Thread and Distill's Discussion Article experiment, the authors invited several external researchers who we had previously discussed our preliminary results with to comment on this work. Their comments are included below.
受最初的 Circuits Thread 和 Distill 的讨论文章实验启发,作者邀请了几位曾与我们讨论过初步成果的外部研究人员对此工作进行评论。他们的评论如下所示。
Tom McGrath is a research scientist at DeepMind.
汤姆·麦格拉斯是 DeepMind 的一名研究科学家。
The results in this paper are an important contribution - they really further our theoretical understanding of a phenomenon that may be central to interpretability research and understanding network representations more generally. It’s surprising that such simple settings can produce these rich phenomena. We’ve reproduced the experiments in the Demonstrating Superposition and Superposition as a Phase Change sections and have a minor additional result to contribute.
本文的研究成果是一项重要贡献——它们确实深化了我们对可能成为可解释性研究核心及更广泛理解网络表征现象的理论认识。令人惊讶的是,如此简单的设定竟能产生这些丰富现象。我们已重现了“展示叠加”与“叠加作为相变”章节中的实验,并有一项微小的额外成果贡献。
It is possible to exactly solve the expected loss for the
对于基本 ReLU 输出玩具模型中的
Making this substitution renders the integral analytically tractable, which allows us to plot the full loss surface and solve for the loss minima directly. We show some example loss surfaces below:
通过这一替换,积分在分析上变得易于处理,这使我们能够绘制完整的损失曲面并直接求解损失最小值。以下展示了一些损失曲面的示例:
Although many of these loss surfaces (Figure 1a, 1b) have minima qualitatively similar to one of the network weights used in the section Superposition as a Phase Change, we also find a new phase where
尽管这些损失曲面(图 1a、1b)中的许多最小值在定性上与“叠加作为相变”一节中使用的网络权重相似,但我们还发现了一个新相态,其中
As Figure 1c shows, some combinations of sparsity and relative feature importance lead to loss surfaces with two minima (once the symmetry
如图 1c 所示,某些稀疏度与相对特征重要性的组合会导致损失曲面出现两个极小值(在考虑了对称性
Original Authors' Response: This closed form analysis of the
原作者回应:对
Jeffrey Wu and Dan Mossing are members of the Alignment team at OpenAI.
杰弗里·吴和丹·莫辛是 OpenAI 对齐团队的成员。
We are very excited about these toy models of polysemanticity. This work sits at a rare intersection of being plausibly very important for training more interpretable models and being very simple and elegant. The results have been surprisingly easy to replicate -- we have reproduced (with very little fuss) plots similar to those in the Demonstrating Superposition – Basic Results, Geometry – Feature Dimensionality, and Learning Dynamics – Discrete "Energy Level" Jumps sections.
我们对这些多义性玩具模型感到非常兴奋。这项工作处于一个罕见的交叉点,既对训练更具解释性的模型具有潜在的重要意义,又极其简单优雅。结果出乎意料地容易复制——我们已轻松重现了《展示叠加——基本结果》、《几何——特征维度》以及《学习动力学——离散“能级”跃迁》章节中类似的图表。
Original Authors' Response: We really appreciate this replication of our basic results. Some of our findings were quite surprising to us, and this gives us more confidence that they aren't the result of an idiosyncratic quirk or bug in our implementations.
我们对这一基本结果的复现深表感激。我们的一些发现颇出乎意料,这让我们更加确信,这些结果并非源于我们实现中的特异瑕疵或错误。
Spencer Becker-Kahn is a senior research scholar at the Future of Humanity Institute and a SERI Machine Learning Alignment Theory Scholar.
斯宾塞·贝克尔-卡恩是未来人类研究所的高级研究员,也是 SERI 机器学习对齐理论学者。
After seeing preliminary results, I independently replicated some of the key graphs from Demonstrating Superposition – Basic Results and, using very small toy models, produced a series of plots consistent with the conceptual picture emerging in Geometry – Feature Dimensionality and Superposition and Learning Dynamics.
在观察到初步结果后,我独立重现了《展示叠加态——基础结果》中的一些关键图表,并利用极小的玩具模型,制作了一系列与《几何学——特征维度与叠加态及学习动力学》中呈现的概念图景相一致的图示。
See also Twitter thread. 另请参阅推特话题。
Adam Jermyn is an independent researcher focused on AI alignment and interpretability. He was previously a Research Fellow at the Flatiron Institute’s Center for Computational Astrophysics. Evan Hubinger is a Research Fellow at MIRI. Nicholas Schiefer is a member of the technical staff at Anthropic and an author of the original paper.
亚当·杰明是专注于人工智能对齐与可解释性的独立研究者,此前曾任弗莱登研究院计算天体物理中心的博士后研究员。埃文·哈宾格是机器智能研究所的研究员。尼古拉斯·希弗是 Anthropic 的技术团队成员,也是原论文的作者之一。
Inspired by the results in this paper and the previous paper introducing the SoLU activation, we have been investigating whether changes to the model architecture or training process can reduce superposition in toy models. After replicating several of these results independently, we made the following extensions in that direction:
受本论文及先前介绍 SoLU 激活函数的论文成果启发,我们一直在探究模型架构或训练过程的调整能否减少玩具模型中的叠加现象。在独立复现了这些成果中的若干项后,我们在此方向上做出了以下拓展:
At least in some limits, this suggests there may not be a price to be paid for monosemanticity. Detailed results can be found in our paper, Engineering Monosemanticity in Toy Models (Alignment Forum, ArXiV).
至少在某些限度内,这表明单一语义性可能无需付出代价。详细结果可参阅我们的论文《玩具模型中的单一语义性工程》(Alignment Forum, ArXiv)。
Tom Henighan and Chris Olah are authors of the original paper.
汤姆·赫尼汉和克里斯·奥拉是原论文的作者。
In the "Feature Dimensionality" section, we found that features organized into clean polytopes when there are more features than can be easily represented in the embedding dimensions.
在“特征维度”部分,我们发现当特征数量超过嵌入维度所能轻松表示的范围时,这些特征会组织成清晰的凸多面体。
We briefly investigated this further and found that the number of features competing to be represented significantly influences this phenomenon. Cleaner structure often seems to emerge when there is more "pressure" – more features competing to be represented. This is especially true at high sparsity levels. Additionally, training longer seems to also produce cleaner structure.
我们进一步简要探究了这一现象,发现竞争被表征的特征数量对此有显著影响。当存在更多“竞争压力”——即更多特征争夺表征时,往往会出现更清晰的结构。这在高稀疏度水平下尤为明显。此外,延长训练时间似乎也能产生更清晰的结构。
More investigation would be needed to really understand this phenomenon.
要真正理解这一现象,还需进一步调查研究。
Marius Hobbhahn is a PhD student at the University of Tuebingen.
马吕斯·霍布恩是图宾根大学的一名博士生。
I replicated the “Basic Results” in Section 2 (“Demonstrating Superposition”) and all of Section 7 (“Superposition in a Privileged Basis”) of the “Toy Models of Superposition” paper. All of my findings are identical to the ones described in the paper. I replicated most findings in the follow-up “Superposition, Memorization, and Double Descent” paper.
我复现了《叠加态模型玩具》中的第二节“展示叠加态”的基本结果以及第七节“特权基中的叠加态”的全部内容。我的所有发现均与论文所述一致。在后续的《叠加态、记忆化与双重下降》论文中,我复现了其中的大部分发现。
The details of my replication can be found in my write-up “More Findings on Memorization and Double Descent”.
我的复制细节可在我的撰写报告《记忆化与双重下降的更多发现》中找到。
Lee Sharkey, Dan Braun, and Beren Millidge are researchers at Conjecture.
李·沙基、丹·布劳恩和贝伦·米利奇是猜想公司的研究人员。
The results from this paper, and the strategic picture it paints, inspired our preliminary follow-up work that aimed to address some of the challenges described in the section titled 'Approach 2: Finding an overcomplete basis'.
本文的研究成果及其描绘的战略图景,激发了我们初步跟进工作的灵感,旨在应对标题为“方法二:寻找超完备基”一节中所述的若干挑战。
Before studying the activations of real neural networks, where we're not sure what the 'ground truth' features are, we studied a toy example. We generated a set of toy ground truth features and created a dataset using sparse combinations of them. We found that a one-layer sparse autoencoder with an
在探究真实神经网络的激活模式之前,我们尚不确定何为“真实特征”,因此我们先研究了一个简单的示例。我们生成了一个玩具级别的真实特征集,并利用这些特征的稀疏组合构建了数据集。结果发现,一个单层稀疏自编码器,其隐藏层激活受到
For the toy dataset, we knew how many ground truth features there were. But we ultimately want to count the features used by real neural networks, where the number of features is unknown. We explored three ways to count the features in the toy dataset: a) Counting dead neurons in the autoencoders; b) looking at autoencoder losses; and c) comparing the features learned by autoencoders of different sizes. We found indications that these methods might be suitable to count the number of features in superposition in real neural data.
对于玩具数据集,我们已知其真实特征数量。然而,我们最终目的是统计真实神经网络所使用的特征数,而该数目是未知的。我们探索了三种统计玩具数据集特征的方法:a) 计算自编码器中的死亡神经元数量;b) 观察自编码器的损失情况;c) 比较不同规模自编码器所学得的特征。我们发现这些方法可能适用于估算真实神经数据中叠加特征的数量。
We also applied our method to real activations from a small language model. Our initial, preliminary investigations led to inconclusive results, possibly resulting from having used autoencoders that were either too small or undertrained. At the time of writing, investigations are ongoing.
我们还应用了该方法于一个小型语言模型的真实激活数据。初步调查结果尚无定论,这可能是由于所使用的自编码器过小或训练不足所致。截至撰写时,相关研究仍在进行中。
Neel Nanda is an external researcher in mechanistic interpretability. This is a description of his blog post, Actually, Othello-GPT Has A Linear Emergent World Representation.
Neel Nanda 作为机制可解释性的外部研究员,其博客文章《实际上,Othello-GPT 拥有线性涌现的世界表示》对此进行了详细阐述。
I describe a natural experiment of the linear representation hypothesis described in this paper -- the idea that features correspond to directions in neural networks.
本文所述线性表示假设提出了一种自然实验——即特征对应于神经网络中的方向这一理念。
Background: Martin Wattenberg (an author on this paper) and colleagues recently found that, if you train transformer language model trained to predict the next token in synthetic Othello games (where each move is a randomly chosen legal move), it forms an emergent model of the board state (despite only being trained to predict the next move!). They showed that the state of the board (whether each cell was empty, black or white) could be recovered with high accuracy by a one hidden layer MLP probe. They further showed that you could use the world model to causally intervene on the model’s residual stream. By choosing another board state, and changing the residual stream (with gradient descent) such that the probe indicates that new board state, they caused the model to output legal moves in the new board state, even if the edited board state was impossible to reach via legal Othello play!
背景:Martin Wattenberg(本文作者之一)及其同事最近发现,若训练一个变压器语言模型来预测合成奥赛罗游戏中的下一个令牌(其中每一步都是随机选择的合法移动),该模型会形成一种对棋盘状态的涌现模型(尽管它仅被训练来预测下一步!)。他们证明,通过一个单隐藏层的多层感知机(MLP)探针,可以高精度地恢复棋盘状态(每个格子是空、黑或白)。进一步地,他们展示了可以利用这个世界模型对模型的残差流进行因果干预。通过选择另一个棋盘状态,并利用梯度下降改变残差流,使得探针指示出新的棋盘状态,他们促使模型在新棋盘状态下输出合法移动,即便编辑后的棋盘状态通过合法的奥赛罗玩法无法达到!
Pre-Registered Hypothesis: The probing and causal intervention together provided strong evidence that the model had learned to represent features corresponding to the state of each square on the board. Yet, notably, linear probes were not able to recover the board state. Since linear features should be recoverable with a one-layer probe, and the causal intervention suggests the model both computes and uses the board state, this seemed like significant evidence against the linear representation hypothesis.
预注册假设:探查与因果干预共同提供了强有力证据,表明模型已学会表示与棋盘上每个方格状态相对应的特征。然而值得注意的是,线性探针未能恢复棋盘状态。由于线性特征应能通过单层探针恢复,且因果干预表明模型既计算又利用了棋盘状态,这似乎成为反对线性表示假设的重要证据。
However, Chris Olah (an author on this paper) argued the model might still be representing features linearly if it used a different set of features, and that the probe and causal intervention may be picking up on this different set of features. This created an informal pre-registered prediction of the hypothesis which was contrary to the evidence at the time.
然而,Chris Olah(本文作者之一)认为,如果模型使用的是另一套特征,它可能仍然以线性方式表示特征,而探针和因果干预可能正是捕捉到了这套不同的特征。这形成了一个与当时证据相悖的非正式预注册预测假设。
Results: I independently came to the same conclusion as Chris and investigated the Othello playing model. I found that the model does form an emergent model of the board state that is linearly represented and can be extracted with a linear probe. But that as the model plays both black and white moves, the model represents the state of a cell as whether it has “my colour” vs “opponent’s colour”. Further, I found circumstantial evidence that these features are used by the model, as we can linearly intervene on the residual stream using the directions given by the probe to edit the represented board state, and the model plays legal moves in the new board state.
结果表明:我独立得出了与 Chris 相同的结论,并研究了 Othello 对弈模型。我发现该模型确实形成了对棋盘状态的涌现模型,该模型以线性方式表示,并可通过线性探针提取。由于模型同时执行黑棋和白棋的走法,它将棋盘上每个格子的状态表示为“我的颜色”与“对手的颜色”。此外,有间接证据表明模型确实利用了这些特征,因为我们能根据探针提供的方向,在线性干预残差流以编辑所表示的棋盘状态,而模型在新棋盘状态下仍能走出合法的棋步。
I consider these results to be notable, as the paper’s results provided evidence against the linear representation hypothesis and the hypothesis faced genuine risk of falsification. And the hypothesis made non-trivial predictions that were contrary to where the evidence pointed, but these turned out to be true. This is both a proof of concept that there are underlying principles of neural networks which have predictive power about models, and a natural experiment supporting the linear representation hypothesis.
我认为这些结果值得关注,因为论文的结果提供了反对线性表示假设的证据,且该假设面临着实证否定的真正风险。该假设做出了与证据指向相反的非平凡预测,但事实证明这些预测是正确的。这既证明了神经网络存在具有预测能力的潜在原理,也是对线性表示假设的自然实验支持。
I think that there is further work interpreting the Othello playing model that could test other hypotheses in this paper and our broader conceptual frameworks about neural networks and transformers, such as by looking for monosemantic vs superposed neurons in its MLP layers. The model is both complex enough to be interesting and expose principles of how transformers learn algorithms, yet the algorithmic nature of the task and existence of the probe suggests that finding circuits will be tractable. I elaborate on what I consider promising directions of future work in a follow-up post.
我认为,对奥赛罗游戏模型进行进一步解读,可以测试本文中的其他假设以及我们对神经网络和变压器更广泛概念框架的理解,例如通过寻找其 MLP 层中的单义与叠加神经元。该模型既复杂到足以引人入胜,又揭示了变压器学习算法的原理,同时任务的算法性质及探针的存在表明,寻找电路将是可行的。我将在后续文章中详细阐述我认为有前景的未来研究方向。
Fred Zhang is a PhD student in the Theory Group of the EECS Department at UC Berkeley.
张弗雷是加州大学伯克利分校 EECS 系理论组的一名博士生。
In the Geometry of Superposition section, the paper defines a notion of feature dimensionality,
在“叠加几何”部分,论文定义了特征维度的概念,
where
其中
Following this definition, the paper makes the remark that "Empirically, it seems that the dimensionality of all features add up to the number of embedding dimensions when the features are 'packed efficiently' in some sense." In this comment, I point out a natural, theoretical explanation of this observation. The argument is via the notion of leverage score in matrix approximation. I’ll define it first, then explain how it connects to feature dimensionality.
根据这一界定,论文中指出:“从经验上看,似乎所有特征的维度在某种意义上‘高效打包’时,会累加到嵌入维度的总数。”在此评论中,我提出了一种自然的理论解释来阐明这一观察。论证通过矩阵近似中的杠杆分数概念展开。首先我将定义它,然后解释其如何与特征维度相联系。
At a conceptual level, leverage score is a measure of the importance of a row in composing the row space of a matrix. For instance, if a row is orthogonal to all other rows, its leverage score is 1, meaning that it’s maximally important. This is natural, since removing it would decrease the rank of the matrix and completely change the row space. Formally, if
在概念层面,杠杆分数衡量的是矩阵行空间中某一行的重要性。例如,如果某一行与其他所有行正交,其杠杆分数为 1,意味着它极为重要。这是自然的,因为移除它将降低矩阵的秩并彻底改变行空间。形式上,若设
Notice that the denominator term equals
注意分母项等于
Three quick remarks on this definition:
关于此定义的三点简要说明:
Returning to my main point, another nice fact about leverage score is they sum up to the rank of the matrix. In the tall and thin case above, they sum up to d (if the matrix is full-rank). Given that, it is natural this paper makes the empirical observation that the sum of
回到主题,关于杠杆分数的另一个美好事实是它们总和等于矩阵的秩。在上文所述的高瘦矩阵情形中,若矩阵满秩,这些分数之和即为 d。鉴于此,本论文观察到,若向量“高效压缩”,
We provide a notebook to reproduce some of the core diagrams in this article here. (It isn't comprehensive, since we needed to rewrite code for our experiments to run outside our codebase.) We provide a separate notebook for the theoretical phase change diagrams.
我们在此提供一个笔记本,以重现本文中的一些核心图表。(这并非全面,因为我们为了使实验代码能在我们的代码库外运行,需要进行重写。)我们还为理论相变图提供了单独的笔记本。
Note that the reproductions by other researchers mentioned in comments above were not based on this code, but are instead fully independent replications with clean code from the description in an early draft of this article.
请注意,上述评论中提到的其他研究人员的复现并非基于此代码,而是完全独立的复制,其代码清晰,源自本文早期草稿的描述。
We're extremely grateful to a number of colleagues across several organizations for their invaluable support in our writing of this paper.
我们深切感谢来自多个组织的众多同事,他们在我们撰写此论文过程中提供了宝贵的支持。
Jeff Wu, Daniel Mossing, Tom McGrath, and Kshitij Sachan did independent replications of many of our experiments, greatly increasing our confidence in our results. Kshitij Sachan's and Tom McGrath's additional investigations and insightful questions both pushed us to clarify our understanding of the superposition phase change (both as reflected in this paper, and in further understanding which we learned from them not captured here). Buck Shlegeris, Adam Scherlis, and Adam Jermyn shared valuable insights into the mathematical nature of the toy problem and related work. Adam Jermyn also coined the term "virtual neurons."
吴杰夫、丹尼尔·莫辛、汤姆·麦克格拉斯与克希蒂夫·萨昌对我们的多项实验进行了独立复现,极大增强了我们对结果的信心。克希蒂夫·萨昌和汤姆·麦克格拉斯的深入探究及富有洞察力的问题促使我们更清晰地理解叠加态相变(不仅体现在本文中,也包括我们从他们那里学到的、此处未详述的进一步认识)。巴克·施莱格里斯、亚当·谢尔利斯和亚当·杰明对玩具问题的数学本质及相关研究提供了宝贵见解。亚当·杰明还创造了“虚拟神经元”这一术语。
Gabriel Goh, Neel Nanda, Vladimir Mikulik, and Nick Cammarata gave detailed feedback which improved the paper, in addition to being motivating. Alex Dimakis, Piotr Indyk, Dan Yamins generously took time to discuss these results with us and give advice on how they might connect to their area of expertise. Finally, we benefited from the feedback and comments of James Bradbury, Sebastian Farquhar, Shan Carter, Patrick Mineault, Alex Tamkin, Paul Christiano, Evan Hubinger, Ian McKenzie, and Sid Black. We're additionally grateful to Trenton Bricken and Manjari Narayan for referring us to valuable related work we originally missed. Thanks to Ken Kahn for typo corrections.
加布里埃尔·高、尼尔·南塔、弗拉基米尔·米库利克和尼克·卡马拉塔提供了详尽的反馈,不仅提升了论文质量,也激励了我们。亚历克斯·迪马基斯、皮奥特·因迪克、丹·亚明斯慷慨地抽出时间与我们讨论这些成果,并就如何与他们的专业领域相联系给予建议。最后,我们得益于詹姆斯·布拉德伯里、塞巴斯蒂安·法夸尔、单·卡特、帕特里克·米尼奥尔特、亚历克斯·坦金、保罗·克里斯蒂亚诺、埃文·哈宾格、伊恩·麦肯齐和西德·布莱克的反馈与评论。我们还特别感谢特伦顿·布里肯和曼贾里·纳拉扬向我们推荐了最初遗漏的相关重要工作。感谢肯·卡恩对拼写错误的修正。
Finally, we're very grateful to all our colleagues at Anthropic for their advice and support: Daniela Amodei, Jack Clark, Tom Brown, Ben Mann, Nick Joseph, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones,, Timothy Telleen-Lawton, Anna Chen, Yuntao Bai, Jeffrey Ladish, Deep Ganguli, Liane Lovitt, Nova DasSarma, Jia Yuan Loke, Jackson Kernion, Tom Conerly, Scott Johnston, Jamie Kerr, Sheer El Showk, Stanislav Fort, Rebecca Raible, Saurav Kadavath, Rune Kvist, Jarrah Bloomfield, Eli Tran-Johnson, Rob Gilson, Guro Khundadze, Filipe Dobreira, Ethan Perez, Sam Bowman, Sam Ringer, Sebastian Conybeare, Jeeyoon Hyun, Michael Sellitto, Jared Mueller, Joshua Landau, Cameron McKinnon, Sandipan Kundu, Jasmine Brazilek, Da Yan, Robin Larson, Noemí Mercado, Anna Goldie, Azalia Mirhoseini, Jennifer Zhou, Erick Galankin, James Sully, Dustin Li, James Landis.
最后,我们衷心感谢 Anthropic 所有同事的宝贵建议与支持:Daniela Amodei、Jack Clark、Tom Brown、Ben Mann、Nick Joseph、Danny Hernandez、Amanda Askell、Kamal Ndousse、Andy Jones、Timothy Telleen-Lawton、Anna Chen、Yuntao Bai、Jeffrey Ladish、Deep Ganguli、Liane Lovitt、Nova DasSarma、Jia Yuan Loke、Jackson Kernion、Tom Conerly、Scott Johnston、Jamie Kerr、Sheer El Showk、Stanislav Fort、Rebecca Raible、Saurav Kadavath、Rune Kvist、Jarrah Bloomfield、Eli Tran-Johnson、Rob Gilson、Guro Khundadze、Filipe Dobreira、Ethan Perez、Sam Bowman、Sam Ringer、Sebastian Conybeare、Jeeyoon Hyun、Michael Sellitto、Jared Mueller、Joshua Landau、Cameron McKinnon、Sandipan Kundu、Jasmine Brazilek、Da Yan、Robin Larson、Noemí Mercado、Anna Goldie、Azalia Mirhoseini、Jennifer Zhou、Erick Galankin、James Sully、Dustin Li、James Landis。
Basic Results - The basic toy model results demonstrating the existence of superposition were done by Nelson Elhage and Chris Olah. Chris suggested the toy model and Nelson ran the experiments.
基本结果 - 展示叠加态存在的基本玩具模型结果由 Nelson Elhage 和 Chris Olah 完成。Chris 提出了玩具模型,而 Nelson 进行了实验。
Phase Change - Chris Olah ran the empirical phase change experiments, with help from Nelson Elhage. Martin Wattenberg introduced the theoretical model where exact losses for specific weight configurations can be computed.
相变实验 - Chris Olah 主导了实证相变实验,Nelson Elhage 提供协助。Martin Wattenberg 引入了理论模型,使得能够计算特定权重配置的确切损失。
Geometry - The uniform superposition geometry results were discovered by Nelson Elhage and Nicholas Schiefer, with help from Chris Olah. Nelson discovered the original
几何学——均匀叠加几何的结果由 Nelson Elhage 和 Nicholas Schiefer 发现,Chris Olah 提供协助。Nelson 发现了原始的
Learning Dynamics - Nelson Elhage discovered the "energy level jump" phenomenon, in collaboration with Nicholas Schiefer and Chris Olah. Martin Wattenberg discovered the "geometric transformations" phenomenon.
学习动力学 - Nelson Elhage 与 Nicholas Schiefer 和 Chris Olah 合作发现了“能级跃迁”现象,Martin Wattenberg 发现了“几何变换”现象。
Adversarial Examples - Chris Olah and Catherine Olsson found evidence of a connection between superposition and adversarial examples.
对抗样本——克里斯·奥拉和凯瑟琳·奥尔森发现了叠加态与对抗样本之间存在联系的证据。
Superposition with a Privileged Basis / Doing Computation - Chris Olah did the basic investigation of superposition in a privileged basis. Nelson Elhage, with help from Chris, investigated the "absolute value" model which provided a more principled demonstration of superposition and showed that computation could be done while in superposition. Nelson discovered the "asymmetric superposition" motif.
在特权基上的叠加/进行计算——克里斯·奥拉对特权基上的叠加进行了基础研究。在克里斯的协助下,纳尔逊·埃拉奇研究了“绝对值”模型,该模型提供了叠加原理的更明确展示,并表明在叠加状态下可以进行计算。纳尔逊发现了“非对称叠加”模式。
Theory - The theoretical picture articulated over the course of this paper (especially in the "mathematical understanding" section) was developed in conversations between all authors, but especially Chris Olah, Jared Kaplan, Martin Wattenberg, Nelson Elhage, Tristan Hume, Tom Henighan, Catherine Olsson, Nicholas Schiefer, Dawn Drain, Shauna Kravec, Roger Grosse, Robert Lasenby, and Sam McCandlish. Jared introduced the strategy of rewriting the loss by grouping terms with the number of active features. Both Jared and Martin independently noticed the value of investigating the
理论框架——本文中阐述的理论图景(尤其在“数学理解”部分)是所有作者间交流讨论的成果,特别归功于 Chris Olah、Jared Kaplan、Martin Wattenberg、Nelson Elhage、Tristan Hume、Tom Henighan、Catherine Olsson、Nicholas Schiefer、Dawn Drain、Shauna Kravec、Roger Grosse、Robert Lasenby 及 Sam McCandlish 的贡献。Jared 提出了通过聚合激活特征数量来重构损失函数的策略。Jared 和 Martin 均独立意识到研究
Strategic Picture - The strategic picture articulated in this paper – What does superposition mean for interpretability and safety? What would a suitable solution be? How might one solve it? – developed in extensive conversations between authors, and in particular Chris Olah, Tristan Hume, Nelson Elhage, Dario Amodei, Jared Kaplan. Nelson Elhage recognized the potential importance of "enumerative safety", further articulated by Dario. Tristan brainstormed extensively about ways one might solve superposition and pushed Chris on this topic.
战略图景 - 本文阐述的战略图景——叠加态对可解释性与安全性意味着什么?合适的解决方案是什么?如何解决这一问题?——是在作者们,尤其是 Chris Olah、Tristan Hume、Nelson Elhage、Dario Amodei、Jared Kaplan 之间的广泛对话中发展起来的。Nelson Elhage 意识到了“枚举安全性”的潜在重要性,并由 Dario 进一步阐明。Tristan 深入探讨了解决叠加态的各种方法,并在此话题上对 Chris 进行了深入探讨。
Writing - The paper was primarily drafted by Chris Olah, with some sections by Nelson Elhage, Tristan Hume, Martin Wattenberg, and Catherine Olsson. All authors contributed to editing, with particularly significant contributions from Zac Hatfield Dodds, Robert Lasenby, Kipply Chen, and Roger Grosse.
写作 - 论文主要由克里斯·奥拉起草,部分章节由纳尔逊·埃拉奇、特里斯坦·休姆、马丁·瓦滕伯格和凯瑟琳·奥尔森撰写。所有作者均参与了编辑工作,其中扎克·哈特菲尔德·多兹、罗伯特·拉森贝、基普利·陈和罗杰·格罗斯贡献尤为显著。
Illustration - The paper was primarily illustrated by Chris Olah, with help from Tristan Hume, Nelson Elhage, and Catherine Olsson.
插图 - 该论文主要由 Chris Olah 绘制,Tristan Hume、Nelson Elhage 和 Catherine Olsson 协助完成。
Please cite as: 请引用为:
Elhage, et al., "Toy Models of Superposition", Transformer Circuits Thread, 2022.
BibTeX Citation: BibTeX 引用格式:
@article{elhage2022superposition, title={Toy Models of Superposition}, author={Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher}, year={2022}, journal={Transformer Circuits Thread}, note={https://transformer-circuits.pub/2022/toy_model/index.html} }
This paper focuses on the assumption that representations are linear. But what if models don’t use linear feature directions to represent information? What might such a thing concretely look like?
本文聚焦于表征为线性的假设。但若模型不采用线性特征方向来表示信息,又会是何种情形?这样的实体具体会呈现怎样的面貌?
Neural networks have nonlinearities that make it theoretically possible to compress information even more compactly than a linear superposition. There are reasons we think models are unlikely to pervasively use nonlinear compression schemes:
神经网络中的非线性特性理论上使得信息能够比线性叠加更紧凑地压缩。我们认为模型不太可能普遍采用非线性压缩方案的原因有:
Regardless of whether large models end up using nonlinear compression, it should be possible to view directions being used with nonlinear compression as linear feature directions and reverse engineer the computation being used for compression like any other circuit. If this kind of encoding is pervasive throughout the network then it may merit some kind of automated decoding. It shouldn’t pose a fundamental challenge to interpretability unless the model learns a scheme for doing complex computation while staying in a complicated nonlinear representation, which we suspect is unlikely.
无论大型模型最终是否采用非线性压缩,应能将非线性压缩所用的方向视为线性特征方向,并像解析其他电路一样逆向工程压缩计算。若此类编码贯穿整个网络,则可能需要某种自动解码机制。这不应构成解释性的根本挑战,除非模型学会了一种在复杂非线性表示中进行复杂计算的方法,而我们对此持怀疑态度。
To help provide intuition, the simplest example of what a nonlinear compression scheme might look like is compressing two [0,1) dimensions
为了帮助理解,非线性压缩方案最简单的例子是将两个[0,1)维度的
This works by quantizing the
该方法通过使用某个整数
We can compare the mean squared error loss on random uniform dense values of
我们可以比较在随机均匀密集值
Here, we formalize the relationship between a compressed sensing lower bound and the toy model.
在此,我们正式确立了压缩感知下界与玩具模型之间的关系。
Let
设
We derive the following theorem:
我们推导出以下定理:
Theorem 1. Suppose that the toy model recovers all
定理 1. 假设玩具模型能恢复所有
We prove this result by framing our toy model as a compressed sensing algorithm. The primary barrier to doing so is that our optimization only searches for vectors that are close in
我们通过将玩具模型构建成压缩感知算法来证明这一结果。主要的障碍在于,我们的优化仅寻找在
Lemma 1. Suppose that we have a toy model
引理 1. 假设我们有一个玩具模型
Proof. We construct
证明:我们如下构造
Lastly, we use the deterministic compressed sensing lower bound of Do Ba, Indyk, Price, and Woodruff
最后,我们采用 Do Ba、Indyk、Price 和 Woodruff 提出的确定性压缩感知下界
Theorem 2 (Corollary 3.1 in
定理 2(
for an approximation factor
对于近似因子
Theorem 1 follows directly from Lemma 1 and Theorem 2.
定理 1 直接由引理 1 和定理 2 得出。
Replication & Forthcoming Paper
复制与即将发表的论文
Kshitij Sachan is a research intern at Redwood Research.
克希蒂杰·萨克森是红木研究公司的研究实习生。
Redwood Research has been working on toy models of polysemanticity, inspired by Anthropic's work. We plan to separately publish our results, and during our research we replicated many of the experiments in this paper. Specifically, we replicated all plots in the Demonstrating Superposition and Superposition as a Phase Change sections (visualizations of the relu models with different sparsities and the phase diagrams) as well as the plot in The Geometry of Superposition – Uniform Superposition. We found the phase diagrams look quite different depending on the activation function, suggesting that in this toy model some activation functions induce more polysemanticity than others.
红木研究一直在探索多义性的玩具模型,灵感源自 Anthropic 的工作。我们计划分别发布研究成果,并在研究过程中复现了该论文中的多项实验。具体而言,我们复现了“展示叠加”与“叠加作为相变”章节中的所有图表(不同稀疏度的 relu 模型可视化及相位图),以及“叠加的几何形态——均匀叠加”中的图表。我们发现,相位图的形态因激活函数而异,表明在此玩具模型中,某些激活函数能诱发更丰富的多义性。
Original Authors' Response: Redwood's further analysis of the superposition phase change significantly advanced our own understanding of the issue – we're very excited for their analysis to be shared with the world. We also appreciate the independent replication of our basic results.
红木团队对叠加相变问题的深入分析极大地推进了我们对这一问题的理解——我们非常期待他们的分析能与世界共享。同时,我们也对基本结果的独立复现表示感谢。
Update: The research by Redwood mentioned in the previous comment, Polysemanticity and Capacity in Neural Networks (Alignment Forum, Arxiv) is out! They study a slightly different toy model, and get some really interesting results. Highlights include analytical traction on understanding a variant of the toy model, understanding superposition in terms of constrained optimization, and analysis of the role different activation functions play.
更新:前述评论中提到的红木研究,《神经网络中的多义性与容量》(对齐论坛,arXiv)已发布!他们研究了一个略有不同的玩具模型,并获得了一些非常有趣的结果。亮点包括对玩具模型变体的分析理解,从约束优化的角度理解叠加态,以及对不同激活函数作用的分析。