It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?
若人工神经网络中的单个神经元能对应于输入特征的清晰可解释性,那将极为便利。例如,在“理想”的 ImageNet 分类器中,每个神经元仅在特定视觉特征出现时激活,如红色、左向曲线或狗鼻。实证研究表明,我们所考察的模型中,部分神经元确实与特征清晰对应。然而,并非总能如此,尤其是在大型语言模型中,神经元与清晰特征的对应似乎颇为罕见。这引发诸多疑问:为何神经元有时与特征对齐,有时则不然?为何某些模型和任务中存在大量此类清晰对应的神经元,而在其他模型中却极为稀少?
In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition
本文中,我们采用玩具模型——即在稀疏输入特征的合成数据上训练的小型 ReLU 网络——来探究模型如何在维度之外表示更多特征,我们称此现象为叠加
Consider a toy model where we train an embedding of five features of varying importance
考虑一个玩具模型,我们训练一个包含五个特征的嵌入,这些特征的重要性各不相同
1
,其中“重要性”是对均方误差损失的标量乘数。在二维空间中,添加 ReLU 进行过滤,并调整特征的稀疏度。在密集特征情况下,模型学会表示最重要的两个特征的正交基(类似于主成分分析可能提供的结果),而其他三个特征则未被表示。但如果我们使特征变得稀疏,情况就会改变:
Not only can models store additional features in superposition by tolerating some interference, but we'll show that, at least in certain limited cases, models can perform computation while in superposition. (In particular, we'll show that models can put simple circuits computing the absolute value function in superposition.) This leads us to hypothesize that the neural networks we observe in practice are in some sense noisily simulating larger, highly sparse networks. In other words, it's possible that models we train can be thought of as doing “the same thing as” an imagined much-larger model, representing the exact same features but with no interference.
模型不仅能通过容忍一定干扰在叠加态中存储额外特征,我们还将在特定有限情况下展示,模型可在叠加态中执行计算。(特别是,我们将证明模型能将计算绝对值函数的简单电路置于叠加态。)这促使我们假设,实践中观察到的神经网络在某种意义上是在嘈杂地模拟更大、高度稀疏的网络。换言之,我们训练的模型可能被视为在做“与”一个设想中的更大模型“相同的事情”,代表完全相同的特征,但无干扰。
Feature superposition isn't a novel idea. A number of previous interpretability papers have considered it
特征叠加并非新颖概念,已有众多解释性论文对此进行探讨
For interpretability researchers, our main contribution is providing a direct demonstration that superposition occurs in artificial neural networks given a relatively natural setup, suggesting this may also occur in practice. That is, we show a case where interpreting neural networks as having sparse structure in superposition isn't just a useful post-hoc interpretation, but actually the "ground truth" of a model. We offer a theory of when and why this occurs, revealing a phase diagram for superposition. This explains why neurons are sometimes "monosemantic" responding to a single feature, and sometimes "polysemantic"
对于可解释性研究者而言,我们的主要贡献在于直接证明了在相对自然的设定下,人工神经网络中确实存在叠加现象,暗示这在实践中也可能发生。即,我们展示了一个案例,其中将神经网络解释为具有叠加稀疏结构不仅仅是事后有用的解释,而实际上是模型的“真实本质”。我们提出了一种理论,阐明了何时及为何会出现这种情况,揭示了叠加的相位图。这解释了为何神经元有时是“单义的”,仅对单一特征作出反应,而有时则是“多义的”,对多个不相关特征作出反应。我们还发现,至少在我们的玩具模型中,叠加现象展现出复杂的几何结构。
But our results may also be of broader interest. We find preliminary evidence that superposition may be linked to adversarial examples and grokking, and might also suggest a theory for the performance of mixture of experts models. More broadly, the toy model we investigate has unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar to the fractional quantum Hall effect in physics, among other striking phenomena. We originally investigated the subject to gain understanding of cleanly-interpretable neurons in larger models, but we've found these toy models to be surprisingly interesting in their own right.
然而,我们的研究成果或许具有更广泛的吸引力。初步证据显示,叠加现象可能与对抗性示例及领悟机制存在关联,并可能为混合专家模型性能提供理论依据。更广泛地看,我们探究的这一简化模型展现出意料之外的丰富结构,包括相变、基于均匀多面体的几何结构、训练过程中的“能量级”跃迁,以及与物理学中的分数量子霍尔效应性质相似的现象等引人注目之处。最初,我们研究此课题旨在深入理解大型模型中清晰可解释的神经元,却意外发现这些简化模型本身就极具研究价值。
In our toy models, we are able to demonstrate that:
在我们的玩具模型中,我们能够证明:
Our toy models are simple ReLU networks, so it seems fair to say that neural networks exhibit these properties in at least some regimes, but it's very unclear what to generalize to real networks.
我们的玩具模型是简单的 ReLU 网络,因此似乎可以说神经网络至少在某些情况下表现出这些特性,但如何将其推广到真实网络则非常不明确。
In our work, we often think of neural networks as having features of the input represented as directions in activation space. This isn't a trivial claim. It isn't obvious what kind of structure we should expect neural network representations to have. When we say something like "word embeddings have a gender direction" or "vision models have curve detector neurons", one is implicitly making strong claims about the structure of network representations.
在我们的工作中,常将神经网络视为以激活空间中的方向来表示输入特征。这一观点并非无足轻重。神经网络表示应具备何种结构并不显而易见。当我们提及“词嵌入具有性别方向”或“视觉模型拥有曲线检测神经元”时,实际上是在对网络表示的结构做出强有力的假设。
Despite this, we believe this kind of "linear representation hypothesis" is supported both by significant empirical findings and theoretical arguments. One might think of this as two separate properties, which we'll explore in more detail shortly:
尽管如此,我们认为这种“线性表征假设”得到了显著的实证发现和理论论据的双重支持。这可以被视为两种不同的特性,我们将在稍后详细探讨:
If we hope to reverse engineer neural networks, we need a property like decomposability. Decomposability is what allows us to reason about the model without fitting the whole thing in our heads! But it's not enough for things to be decomposable: we need to be able to access the decomposition somehow. In order to do this, we need to identify the individual features within a representation. In a linear representation, this corresponds to determining which directions in activation space correspond to which independent features of the input.
若要逆向工程神经网络,我们需具备可分解性这一特性。可分解性使我们无需将整个模型装进脑海,便能对其进行推理!然而,仅可分解还不够,我们还需以某种方式触及这种分解。为此,我们必须识别表示中的各个特征。在线性表示中,这相当于确定激活空间中的哪些方向对应于输入的哪些独立特征。
Sometimes, identifying feature directions is very easy because features seem to correspond to neurons. For example, many neurons in the early layers of InceptionV1 clearly correspond to features (e.g. curve detector neurons
有时,识别特征方向非常容易,因为特征似乎与神经元相对应。例如,InceptionV1 早期层中的许多神经元明显对应于特征(如曲线检测神经元
Superposition has been hypothesized in previous work
先前的工作
The goal of this section will be to motivate these ideas and unpack them in detail.
本节的目标是激发这些想法,并详细阐述它们。
It's worth noting that many of the ideas in this section have close connections to ideas in other lines of interpretability research (especially disentanglement), neuroscience (distributed representations, population codes, etc), compressed sensing, and many other lines of work. This section will focus on articulating our perspective on the problem. We'll discuss these other lines of work in detail in Related Work.
值得注意的是,本节中许多概念与解释性研究的其他领域(尤其是解耦)、神经科学(分布式表示、群体编码等)、压缩感知以及众多其他工作线有着紧密联系。本节将着重阐述我们对问题的视角,而与这些其他工作线的详细讨论将在“相关工作”部分展开。
When we talk about "features" and how they're represented, this is ultimately theory building around several observed empirical phenomena. Before describing how we conceptualize those results, we'll simply describe some of the major results motivating our thinking:
当我们讨论“特征”及其表现方式时,这实质上是在围绕多个观察到的经验现象构建理论。在阐述我们如何概念化这些成果之前,我们先简要描述一些推动我们思考的主要成果:
V("king") - V("man") + V("woman") = V("queen")
(but see V("king") - V("man") + V("woman") = V("queen")
这样的嵌入向量运算(但请参见 As a result, we tend to think of neural network representations as being composed of features which are represented as directions. We'll unpack this idea in the following sections.
因此,我们往往认为神经网络的表示由特征构成,这些特征表现为方向。我们将在接下来的章节中详细阐述这一概念。
Our use of the term "feature" is motivated by the interpretable properties of the input we observe neurons (or word embedding directions) responding to. There's a rich variety of such observed properties!
我们使用“特征”这一术语,是基于我们观察到的神经元(或词嵌入方向)响应的可解释输入属性。这些观察到的属性种类繁多!
2
在视觉领域,这些属性从低级神经元如曲线检测器
But even with that motivation, it turns out to be quite challenging to create a satisfactory definition of a feature. Rather than offer a single definition we're confident about, we consider three potential working definitions:
尽管有此动力,但要创造一个令人满意的特征定义仍颇具挑战。与其提供一个我们确信无疑的定义,我们考虑了三种可能的工作定义:
We've written this paper with the final "neurons in sufficiently large models" definition in mind. But we aren't overly attached to it, and actually think it's probably important to not prematurely attach to a definition.
我们在撰写这篇论文时,心中怀揣着“充分大型模型中的神经元”这一最终定义。但我们并不对其过分执着,实际上认为避免过早固化定义颇为重要。
4
拉卡托斯的一部名著
As we've mentioned in previous sections, we generally think of features as being represented by directions. For example, in word embeddings, "gender" and "royalty" appear to correspond to directions, allowing arithmetic like V("king") - V("man") + V("woman") = V("queen")
如前所述,我们通常将特征视为由方向表示。例如,在词嵌入中,“性别”和“皇室”似乎对应于方向,使得诸如 V("king") - V("man") + V("woman") = V("queen")
Let's call a neural network representation linear if features correspond to directions in activation space. In a linear representation, each feature
若特征对应激活空间中的方向,我们称神经网络的表示为线性的。在线性表示中,每个特征
We don't think it's a coincidence that neural networks empirically seem to have linear representations. Neural networks are built from linear functions interspersed with non-linearities. In some sense, the linear functions are the vast majority of the computation (for example, as measured in FLOPs). Linear representations are the natural format for neural networks to represent information in! Concretely, there are three major benefits:
我们认为神经网络在经验上似乎具有线性表示并非巧合。神经网络由线性函数与非线性元素交错构建而成。从某种意义上说,线性函数占据了计算的绝大部分(例如,以浮点运算次数衡量)。线性表示是神经网络自然表达信息的方式!具体而言,存在三大主要优势:
It is possible to construct non-linear representations, and retrieve information from them, if you use multiple layers (although even these examples can be seen as linear representations with more exotic features). We provide an example in the appendix. However, our intuition is that non-linear representations are generally inefficient for neural networks.
构建非线性表示并从中检索信息是可能的,只要使用多层结构(尽管这些例子也可视为具有更奇特特征的线性表示)。我们在附录中提供了一个示例。然而,我们的直觉是非线性表示通常对神经网络而言效率不高。
One might think that a linear representation can only store as many features as it has dimensions, but it turns out this isn't the case! We'll see that the phenomenon we call superposition will allow models to store more features – potentially many more features – in linear representations.
人们可能会认为线性表示只能存储与其维度一样多的特征,但事实证明并非如此!我们将看到,我们称之为叠加的现象将允许模型在线性表示中存储更多特征——潜在地,是许多更多的特征。
For discussion on how this view of features squares with a conception of features as being multidimensional manifolds, see the appendix “What about Multidimensional Features?”.
关于这一特征观点如何与特征作为多维流形的概念相契合的讨论,请参阅附录“多维特征如何处理?”。
Even if features are encoded as directions, a natural question to ask is which directions? In some cases, it seems useful to consider the basis directions, but in others it doesn't. Why is this?
即便特征被编码为方向,一个自然的问题是:哪些方向?在某些情况下,考虑基向量似乎很有用,而在其他情况下则不然。这是为什么呢?
When researchers study word embeddings, it doesn't make sense to analyze basis directions. There would be no reason to expect a basis dimension to be different from any other possible direction. One way to see this is to imagine applying some random linear transformation
研究词嵌入时,分析基向量方向并无意义。没有理由认为基维度会与任何其他可能方向有所不同。理解这一点的一种方式是设想对词嵌入施加一个随机线性变换
But many neural network layers are not like this. Often, something about the architecture makes the basis directions special, such as applying an activation function. This "breaks the symmetry", making those directions special, and potentially encouraging features to align with the basis dimensions. We call this a privileged basis, and call the basis directions "neurons." Often, these neurons correspond to interpretable features.
然而,许多神经网络层并非如此。通常,架构中的某些特性使得基向量方向具有特殊性,例如应用激活函数。这“打破了对称性”,使得那些方向变得特殊,并可能促使特征与基向量维度对齐。我们称这种基向量为“特权基”,并将基向量方向称为“神经元”。通常,这些神经元对应着可解释的特征。
From this perspective, it only makes sense to ask if a neuron is interpretable when it is in a privileged basis. In fact, we typically reserve the word "neuron" for basis directions which are in a privileged basis. (See longer discussion here.)
从这个角度来看,只有当神经元处于特权基底时,询问其可解释性才有意义。实际上,我们通常将“神经元”一词保留给那些处于特权基底的方向。(参见此处更详细的讨论。)
Note that having a privileged basis doesn't guarantee that features will be basis-aligned – we'll see that they often aren't! But it's a minimal condition for the question to even make sense.
请注意,拥有特权基并不保证特征会与基对齐——我们将会看到它们往往并非如此!但这至少是这个问题有意义的最基本条件。
Even when there is a privileged basis, it's often the case that neurons are "polysemantic", responding to several unrelated features. One explanation for this is the superposition hypothesis
即便存在特权基底,神经元往往仍表现出“多义性”,对多个无关特征产生响应。对此现象的一种解释是叠加假设
Several results from mathematics suggest that something like this might be plausible:
数学中的几个结果表明,类似这样的事情可能是合理的:
Concretely, in the superposition hypothesis, features are represented as almost-orthogonal directions in the vector space of neuron outputs. Since the features are only almost-orthogonal, one feature activating looks like other features slightly activating. Tolerating this "noise" or "interference" comes at a cost. But for neural networks with highly sparse features, this cost may be outweighed by the benefit of being able to represent more features! (Crucially, sparsity greatly reduces the costs since sparse features are rarely active to interfere with each other, and non-linear activation functions create opportunities to filter out small amounts of noise.)
具体而言,在叠加假设中,特征被表示为神经元输出向量空间中近乎正交的方向。由于这些特征仅是近乎正交,一个特征的激活会略微影响其他特征。容忍这种“噪声”或“干扰”是有代价的。但对于具有高度稀疏特征的神经网络来说,能够表示更多特征的益处可能超过了这一成本!(关键在于,稀疏性大大降低了成本,因为稀疏特征很少活跃到相互干扰的程度,而非线性激活函数则提供了过滤少量噪声的机会。)
One way to think of this is that a small neural network may be able to noisily "simulate" a sparse larger model:
一种理解方式是,小型神经网络或许能以一种嘈杂的方式“模拟”稀疏的大型模型:
Although we've described superposition with respect to neurons, it can also occur in representations with an unprivileged basis, such as a word embedding. Superposition simply means that there are more features than dimensions.
尽管我们在描述叠加时是以神经元为参照,但它同样可以发生在非特权基底的表征中,比如词嵌入。叠加的含义仅仅是特征数量超过了维度数。
The ideas in this section might be thought of in terms of four progressively more strict properties that neural network representations might have.
本节中的概念可以从神经网络表示可能具备的四个渐进式严格属性来考虑。
The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.
前两者(可分解性和线性)是我们假设普遍存在的特性,而后两者(非叠加性和基向量对齐)则是我们认为仅在某些情况下出现的特性。
If one takes the superposition hypothesis seriously, a natural first question is whether neural networks can actually noisily represent more features than they have neurons. If they can't, the superposition hypothesis may be comfortably dismissed.
若认真对待叠加假设,一个自然的首要问题是神经网络是否能以噪声方式表示比其神经元更多的特征。若不能,叠加假设或许可被轻易摒弃。
The intuition from linear models would be that this isn't possible: the best a linear model can do is to store the principal components. But we'll see that adding just a slight nonlinearity can make models behave in a radically different way! This will be our first demonstration of superposition. (It will also be an object lesson in the complexity of even very simple neural networks.)
线性模型的直觉会认为这是不可能的:线性模型所能做的最好就是存储主成分。但我们将会看到,只需添加一点非线性,就能使模型行为发生根本性的变化!这将是我们的第一个叠加示例。(同时,它也将是对极其简单的神经网络复杂性的一个生动教训。)
Our goal is to explore whether a neural network can project a high dimensional vector
我们的目标是探究神经网络是否能将高维向量
We begin by describing the high-dimensional vector
我们首先描述高维向量
Since we don't have any ground truth for features, we need to create synthetic data for
由于我们没有特征的实际数据,我们需要为
Concretely, our synthetic data is defined as follows: The input vectors
具体而言,我们的合成数据定义如下:输入向量
We will actually consider two models, which we motivate below. The first "linear model" is a well understood baseline which does not exhibit superposition. The second "ReLU output model" is a very simple model which does exhibit superposition. The two models vary only in the final activation function.
我们将实际考虑两种模型,其动机如下。第一个“线性模型”是一个理解透彻的基准,不表现出叠加性。第二个“ReLU 输出模型”是一个非常简单的模型,确实展现出叠加性。两种模型仅在最终激活函数上有所不同。
Why these models? 为何选择这些模型?
The superposition hypothesis suggests that each feature in the higher-dimensional model corresponds to a direction in the lower-dimensional space. This means we can represent the down projection as a linear map
叠加假设指出,高维模型中的每个特征对应于低维空间中的一个方向。这意味着我们可以将降维表示为线性映射
To recover the original vector, we'll use the transpose of the same matrix
为恢复原始向量,我们将使用同一矩阵
We also add a bias. One motivation for this is that it allows the model to set features it doesn't represent to their expected value. But we'll see later that the ability to set a negative bias is important for superposition for a second set of reasons – roughly, it allows models to discard small amounts of noise.
我们还引入了偏置项。这样做的一个动机是,它允许模型将未表示的特征设置为其预期值。但稍后我们将看到,设置负偏置的能力对于叠加状态至关重要,这有第二组原因——大致来说,它使模型能够剔除少量噪声。
The final step is whether to add an activation function. This turns out to be critical to whether superposition occurs. In a real neural network, when features are actually used by the model to do computation, there will be an activation function, so it seems principled to include one at the end.
最后一步是决定是否添加激活函数,事实证明这对是否发生叠加至关重要。在真实的神经网络中,当特征被模型实际用于计算时,会存在一个激活函数,因此在最后包含一个激活函数似乎是合乎原则的。
Our loss is weighted mean squared error weighted by the feature importances,
我们的损失是根据上述描述的特征重要性加权的均方误差,即
Our first experiment will simply be to train a few ReLU output models with different sparsity levels and visualize the results. (We'll also train a linear model – if optimized well enough, the linear model solution does not depend on sparsity level.)
我们的首个实验将简单地训练几个具有不同稀疏度级别的 ReLU 输出模型,并展示结果。(我们还将训练一个线性模型——如果优化得足够好,线性模型的解决方案并不依赖于稀疏度水平。)
The main question is how to visualize the results. The simplest way is to visualize
主要问题是如何呈现结果。最简单的方法是展示
But the thing we really care about is this hypothesized phenomenon of superposition – does the model represent "extra features" by storing them non-orthogonally? Is there a way to get at it more explicitly? Well, one question is just how many features the model learns to represent. For any feature, whether or not it is represented is determined by
但我们真正关注的是这个假设的叠加现象——模型是否通过非正交方式存储“额外特征”来表示它们?是否有更明确的方法来探究这一点?一个问题是模型学习表示的特征数量。对于任何特征,其是否被表示取决于其嵌入向量的范数
We'd also like to understand whether a given feature shares its dimension with other features. For this, we calculate
我们也希望了解某个特征是否与其他特征共享维度。为此,我们计算
We can visualize the model we looked at previously this way:
我们可以这样可视化之前研究的模型:
Now that we have a way to visualize models, we can start to actually do experiments. We'll start by considering models with only a few features (
既然我们有了可视化模型的方法,就可以开始实际进行实验了。我们将从考虑仅包含少数特征(
As our standard intuitions would expect, the linear model always learns the top-
正如我们的标准直觉所预期,线性模型始终学习最重要的前
The results are qualitatively similar for models with more features and hidden dimensions. For example, if we consider a model with
对于具有更多特征和隐藏维度的模型,结果在定性上相似。例如,如果我们考虑一个具有
In the previous section, we observed a surprising empirical result: adding a ReLU to the output of our model allowed a radically different solution – superposition – which doesn't occur in linear models.
在前一节中,我们观察到一个令人惊讶的实证结果:在我们的模型输出中加入 ReLU 激活函数,使得一种截然不同的解决方案——叠加态——得以实现,这在传统线性模型中是不会出现的。
The model where it occurs is still quite mathematically simple. Can we analytically understand why superposition is occurring? And for that matter, why does adding a single non-linearity make things so different from the linear model case? It turns out that we can get a fairly satisfying answer, revealing that our model is governed by balancing two competing forces – feature benefit and interference – which will be useful intuition going forwards. We'll also discover a connection to the famous Thomson Problem in chemistry.
该模型中发生的现象在数学上仍相当简单。我们能否从分析角度理解为何会出现叠加现象?此外,为何仅仅引入一个非线性因素就能使情况与线性模型大相径庭?事实证明,我们可以得到一个相当令人满意的答案,揭示出我们的模型受制于两种相互竞争的力量——特征收益与干扰——这将为后续理解提供有益的直觉。我们还将发现这与化学中著名的汤姆森问题存在联系。
Let's start with the linear case. This is well understood by prior work! If one wants to understand why linear models don't exhibit superposition, the easy answer is to observe that linear models essentially perform PCA. But this isn't fully satisfying: if we set aside all our knowledge and intuition about linear functions for a moment, why exactly is it that superposition can't occur?
我们从线性情况开始讨论。这一点已由先前的工作充分理解!若要探究为何线性模型不展现叠加现象,一个简单的解释是观察到线性模型本质上执行主成分分析(PCA)。但这并不完全令人满意:如果我们暂时抛开所有关于线性函数的知识和直觉,究竟是什么原因导致叠加现象无法发生呢?
A deeper understanding can come from the results of Saxe et al.
萨克斯等人
The Saxe results reveal that there are fundamentally two competing forces which control learning dynamics in the considered model. Firstly, the model can attain a better loss by representing more features (we've labeled this "feature benefit"). But it also gets a worse loss if it represents more than it can fit orthogonally due to "interference" between features.
Saxe 的研究结果揭示,在所考虑的模型中,学习动态主要受两种基本竞争力的控制。首先,模型通过表示更多特征可以获得更佳的损失值,我们称之为“特征收益”。然而,若表示的特征超出其正交容纳能力,则会因特征间的“干扰”导致损失恶化。顺便一提,对比线性模型中的干扰现象与压缩感知中的相干性概念颇为有趣,它们可视为同一向量的
Can we achieve a similar kind of understanding for the ReLU output model? Concretely, we'd like to understand
我们能否对 ReLU 输出模型实现类似的理解?具体而言,我们希望理解
The integral over
积分覆盖
This new equation is vaguely similar to the famous Thomson problem in chemistry. In particular, if we assume uniform importance and that there are a fixed number of features with
这一新方程与化学中著名的汤姆森问题有几分相似。具体而言,若我们假设各特征同等重要,且存在固定数量的特征具有
Another interesting property is that ReLU makes negative interference free in the 1-sparse case. This explains why the solutions we've seen prefer to only have negative interference when possible. Further, using a negative bias can convert small positive interferences into essentially being negative interferences.
另一个有趣的特性是,在 1-稀疏情况下,ReLU 使得负干扰变得无害。这解释了为何我们所见的解决方案倾向于尽可能只存在负干扰。此外,使用负偏置可以将小的正干扰实质上转化为负干扰。
What about the terms corresponding to less sparse vectors? We leave explicitly writing these out to the reader, but the main idea is that there are multiple compounding interferences, and the "active features" can experience interference. In a later section, we'll see that features often organize themselves into sparse interference graphs such that only a small number of features interfere with another feature – it's interesting to note that this reduces the probability of compounding interference and makes the 1-sparse loss term more important relative to others.
对于那些对应于较不稀疏向量的项又如何呢?我们将其具体表述留给读者思考,但核心思想在于存在多种叠加干扰,且“活跃特征”可能遭受干扰。在后续章节中,我们将看到特征常常自发组织成稀疏干扰图,使得只有少数特征会干扰到其他特征——值得注意的是,这降低了叠加干扰的概率,并使得 1-稀疏损失项相对于其他项更为重要。
The results in the previous section seem to suggest that there are three outcomes for a feature when we train a model: (1) the feature may simply not be learned; (2) the feature may be learned, and represented in superposition; or (3) the model may represent a feature with a dedicated dimension. The transitions between these three outcomes seem sharp. Possibly, there's some kind of phase change.
上一节的结果似乎表明,当我们训练模型时,特征可能面临三种情况:(1) 特征可能根本未被学习;(2) 特征可能被学习,并以叠加态表示;或(3) 模型可能用专用的维度来表示特征。这三种情况之间的转变似乎是突兀的,可能存在某种相变现象。
13
此处,我们使用“相变”一词,意指“不连续变化”的广义概念,而非特指系统规模无限大时出现的更技术性的不连续性。
One way to understand this better is to explore if there's something like a "phase diagram" from physics, which could help us understand when a feature is expected to be in one of these regimes. Although we can see hints of this in our previous experiment, it's hard to really isolate what's going on because many features are changing at once and there may be interaction effects. As a result, we set up the following experiment to better isolate the effects.
为了更深入理解这一现象,我们不妨探究是否存在类似物理学中的“相位图”,它能帮助我们判断某个特征何时会落入这些状态之一。尽管在先前的实验中已见端倪,但众多特征同时变化且可能存在交互效应,使得真正厘清其中机制颇为困难。因此,我们设计了以下实验,旨在更好地分离这些效应。
As an initial experiment, we consider models with 2 features but only 1 hidden layer dimension. We still consider the ReLU output model,
作为初步实验,我们考虑具有 2 个特征但仅 1 个隐藏层维度的模型。我们仍采用 ReLU 输出模型,
We can compare this to a theoretical "toy model of the toy model" where we can get closed form solutions for the loss of different weight configurations as a function of importance and sparsity. There are three natural ways to store 2 features in 1 dimension:
我们可以将此比作一个理论上的“玩具模型中的玩具模型”,在其中我们能得到不同权重配置损失的封闭形式解,作为重要性和稀疏性的函数。在 1 维中存储 2 个特征有三种自然方式:
As expected, sparsity is necessary for superposition to occur, but we can see that it interacts in an interesting way with relative feature importance. But most interestingly, there appears to be a real phase change, observed in both the empirical and theoretical diagrams! The optimal weight configuration discontinuously changes in magnitude and superposition. (In the theoretical model, we can analytically confirm that there's a first-order phase change: there's crossover between the functions, causing a discontinuity in the derivative of the optimal loss.)
如预期般,稀疏性对于叠加现象的发生是必要的,但我们发现它与相对特征重要性之间存在有趣的互动。最引人注目的是,在实证与理论图表中均观察到一种真实的相变现象!最优权重配置在幅度和叠加性上发生了不连续的变化。(在理论模型中,我们能通过分析确认存在一次相变:函数间发生交叉,导致最优损失的导数出现不连续。)
We can ask this same question of embedding three features in two dimensions. This problem still has a single "extra feature" (now the third one) we can study, asking what happens as we vary its importance relative to the other two and change sparsity.
我们同样可以探讨将三个特征嵌入到二维空间中的问题。这个问题仍然有一个“额外的特征”(现在是第三个)供我们研究,我们可以探究随着这个特征相对于其他两个的重要性变化,以及稀疏性的调整,会发生什么情况。
For the theoretical model, we now consider four natural solutions. We can describe solutions by asking "what feature direction did
对于理论模型,我们现考虑四种自然解法。描述这些解法时,我们可提问:“
These diagrams suggest that there really is a phase change between different strategies for encoding features. However, we'll see in the next section that there's much more complex structure this preliminary view doesn't capture.
这些图表表明,在编码特征的不同策略之间确实存在一种相变。然而,我们将在下一节看到,这种初步观点未能捕捉到更为复杂的结构。
We've seen that superposition can allow a model to represent extra features, and that the number of extra features increases as we increase sparsity. In this section, we'll investigate this relationship in more detail, discovering an unexpected geometric story: features seem to organize themselves into geometric structures such as pentagons and tetrahedrons! In some ways, the structure described in this section seems "too elegant to be true" and we think there's a good chance it's at least partly idiosyncratic to the toy model we're investigating. But it seems worth investigating because if anything about this generalizes to real models, it may give us a lot of leverage in understanding their representations.
我们已经观察到,叠加态能使模型呈现额外特征,且随着稀疏性的增加,额外特征的数量也随之增长。本节将深入探讨这一关系,揭示一个意外的几何现象:特征似乎自发组织成五边形、四面体等几何结构!从某种角度看,本节描述的结构“过于完美而显得不真实”,我们推测它至少部分是针对所研究的简化模型特有的。尽管如此,这一发现值得探究,因为若其任何方面能推广至真实模型,或许将极大助力我们理解这些模型的表征机制。
We'll start by investigating uniform superposition, where all features are identical: independent, equally important and equally sparse. It turns out that uniform superposition has a surprising connection to the geometry of uniform polytopes! Later, we'll move on to investigate non-uniform superposition, where features are not identical. It turns out that this can be understood, at least to some extent, as a deformation of uniform superposition.
我们将首先探究均匀叠加,其中所有特征均相同:独立、同等重要且同等稀疏。令人惊讶的是,均匀叠加与均匀多面体的几何结构有着紧密联系!随后,我们将进一步研究非均匀叠加,其中特征并非一致。至少在某种程度上,这种非均匀叠加可以理解为均匀叠加的一种变形。
As mentioned above, we begin our investigation with uniform superposition, where all features have the same importance and sparsity. We'll see later that this case has some unexpected structure, but there's also a much more basic reason to study it: it's much easier to reason about than the non-uniform case, and has fewer variables we need to worry about in our experiments.
如上所述,我们的研究从均匀叠加开始,其中所有特征具有同等重要性和稀疏性。稍后我们将发现这种情况具有一些意外的结构,但研究它的根本原因更为基础:相较于非均匀情况,它更易于推理,且在实验中需要关注的变量更少。
We'd like to understand what happens as we change feature sparsity,
我们希望了解在改变特征稀疏性时会发生什么,
A convenient way to measure the number of features the model has learned is to look at the Frobenius norm,
衡量模型已学习特征数量的便捷方法是观察 Frobenius 范数,
We'll plot
我们将绘制
Surprisingly, we find that this graph is "sticky" at
令人惊讶的是,我们发现该图在
It turns out that antipodal pairs are just the tip of the iceberg. Hiding underneath this curve are a number of extremely specific geometric configurations of features.
事实证明,对跖点对仅仅是冰山一角。在这条曲线之下,隐藏着众多极其特殊的几何特征配置。
In the previous section, we saw that there's a sticky regime where the model has "half a dimension per feature" in some sense. This is an average statistical property of the features the model represents, but it seems to hint at something interesting. Is there a way we could understand what "fraction of a dimension" a specific feature gets?
在前一节中,我们观察到存在一个粘滞状态,其中模型在某种意义上具有“每个特征半维”。这是模型所代表特征的平均统计特性,但它似乎暗示了某些有趣的现象。我们能否找到一种方法来理解特定特征所获得的“维度的一部分”是多少?
We'll define the dimensionality of the
我们将定义第
where
其中
Intuitively, the numerator represents the extent to which a given feature is represented, while the denominator is "how many features share the dimension it is embedded in" by projecting each feature onto its dimension. In the antipodal case, each feature participating in an antipodal pair will have a dimensionality of
直观上,分子表示给定特征的表达程度,而分母则是“有多少特征通过将其投影到其维度上共享该嵌入维度”。在极端对立的情况下,参与对立对的每个特征将具有
We can now break the above plot down on a per-feature basis. This reveals many more of these "sticky points"! To help us understand this better, we're going to create a scatter plot annotated with some additional information:
我们现在可以按每个特征对上述图表进行分解,这将揭示更多这类“粘性点”!为了更好地理解这一点,我们将创建一个带有额外标注信息的散点图。
Let's look at the resulting plot, and then we'll try to figure out what it's showing us:
让我们观察一下生成的图表,然后尝试理解它向我们展示了什么:
What is going on with the points clustering at specific fractions?? We'll see shortly that the model likes to create specific weight geometries and kind of jumps between the different configurations.
特定分数处的点为何会聚集?我们很快就会发现,模型倾向于形成特定的权重几何结构,并在不同配置之间跳跃。
In the previous section, we developed a theory of superposition as a phase change. But everything on this plot between 0 (not learning a feature) and 1 (dedicating a dimension to a feature) is superposition. Superposition is what happens when features have fractional dimensionality. That is to say – superposition isn't just one thing!
在前一节中,我们发展了一种将叠加视为相位变化的理论。但在这个图上,从 0(不学习特征)到 1(为特征分配一个维度)之间的所有内容都是叠加。当特征具有分数维时,就会发生叠加。也就是说——叠加并非仅指一种情况!
How can we relate this to our original understanding of the phase change? We often think of water as only having three phases: ice, water and steam. But this is a simplification: there are actually many phases of ice, often corresponding to different crystal structures (eg. hexagonal vs cubic ice). In a vaguely similar way, neural network features seem to also have many other phases within the general category of "superposition."
我们如何将此与对相变的原始理解联系起来?我们通常认为水只有三种相态:冰、水和蒸汽。但这是一种简化:实际上,冰存在多种相态,通常对应于不同的晶体结构(例如六角形与立方形冰)。在某种程度上类似地,神经网络特征似乎在“叠加”这一大类中也有许多其他相态。
In the previous diagram, we found that there are distinct lines corresponding to dimensionality of: ¾ (tetrahedron), ⅔ (triangle), ½ (antipodal pair), ⅖ (pentagon), ⅜ (square antiprism), and 0 (feature not learned). We believe there would also be a 1 (dedicated dimension for a feature) line if not for the fact that basis features are indistinguishable from other directions in the dense regime.
在前述图表中,我们发现存在对应不同维度的明显线条:¾(四面体)、⅔(三角形)、½(对映点对)、⅖(五边形)、⅜(方形反棱柱)以及 0(未习得特征)。若非在高密度区域基元特征与其他方向难以区分,我们相信还会有一条代表 1(专为某一特征设定的维度)的线条出现。
Several of these configurations may jump out as solutions to the famous Thomson problem. (In particular, square antiprisms are much less famous than cubes and are primarily of note for their role in molecular geometry due to being a Thomson problem solution.) As we saw earlier, there is a very real sense in which our model can be understood as solving a generalized version of the Thomson problem. When our model chooses to represent a feature, the feature is embedded as a point on an
这些配置中的一些可能跃然而出,成为著名的汤姆森问题的解决方案。(特别是,正交反棱柱远不如立方体著名,主要因其作为汤姆森问题解决方案在分子几何中的作用而受到关注。)正如我们之前所见,从非常真实的意义上讲,我们的模型可以被理解为解决了汤姆森问题的一个广义版本。当我们的模型选择表示一个特征时,该特征被嵌入为一个
A second clue as to what's going on is that there are lines for the Thomson solutions which are uniform polyhedra (e.g. tetrahedron), but there seem to be split lines where we'd expect to see non-uniform solutions (e.g. instead of a ⅗ line for triangular bipyramids we see a co-occurence of points at ⅔ for triangles and points at ½ for a antipodes). In a uniform polyhedron, all vertices have the same geometry, and so if we embed features as them each feature has the same dimensionality. But if we embed features as a non-uniform polyhedron, different features will have more or less interference with others.
另一个线索表明,汤姆森解中的线对应于均匀多面体(如四面体),但似乎出现了分裂线,本应观察到非均匀解的地方(例如,三角双锥的⅗线被⅔的三角形点和½的对跖点共现所取代)。在均匀多面体中,所有顶点几何相同,因此若将特征嵌入其中,每个特征的维度一致。然而,若以非均匀多面体嵌入特征,不同特征间将产生或多或少的干扰。
In particular, many of the Thomson solutions can be understood as tegum products (an operation which constructs polytopes by embedding two polytopes in orthogonal subspaces) of smaller uniform polytopes. (In the earlier graph visualizations of feature geometry, two subgraphs are disconnected if and only if they are in different tegum factors.) As a result, we should expect their dimensionality to actually correspond to the underlying factor uniform polytopes.
特别是,许多汤普森解决方案可以被理解为 tegum 积(一种通过在正交子空间中嵌入两个多面体来构造多面体的操作)的产物,这些产物源自更小的均匀多面体。(在早期的特征几何图形可视化中,两个子图若且唯若它们属于不同的 tegum 因子时才会断开连接。)因此,我们应当预期它们的维度实际上对应于基础因子均匀多面体。
This also suggests a possible reason why we observe 3D Thomson problem solutions, despite the fact that we're actually studying a higher dimensional version of the problem. Just as many 3D Thomson solutions are tegum products of 2D and 1D solutions, perhaps higher dimensional solutions are often tegum products of 1D, 2D, and 3D solutions.
这或许揭示了我们为何能观察到三维汤姆逊问题解的一个可能原因,尽管我们实际上研究的是该问题的高维版本。正如许多三维汤姆逊解是二维与一维解的楔积,或许高维解常常是一维、二维及三维解的楔积。
The orthogonality of factors in tegum products has interesting implications. For the purposes of superposition, it means that there can't be any "interference" across tegum-factors. This may be preferred by the toy model: having many features interfere simultaneously could be really bad for it. (See related discussion in our earlier mathematical analysis.)
泰格姆积中因子的正交性具有有趣的含义。就叠加目的而言,这意味着泰格姆因子之间不可能存在“干扰”。这对于玩具模型来说可能是可取的:多个特征同时相互干扰可能对其极为不利。(参见我们早期数学分析中的相关讨论。)
At this point, it's worth making explicit that there's a correspondence between polytopes and symmetric, positive-definite, low-rank matrices (i.e. matrices of the form
此时,明确指出多面体与对称、正定、低秩矩阵(即形如
In some ways, the correspondence is trivial. If one has a rank-
在某些方面,这种对应关系是微不足道的。如果有一个秩为
Put another way, there's an exact correspondence between polytopes and strategies for superposition. For example, every strategy for putting three features in superposition in a 2-dimensional space corresponds to a triangle, and every triangle corresponds to such a strategy. From this perspective, it doesn't seem surprising that if we have three equally important and equally sparse features, the optimal strategy is an equilateral triangle.
换言之,多面体与叠加策略之间存在精确对应关系。例如,在二维空间中,每种将三个特征进行叠加的策略都对应一个三角形,反之亦然。从这个角度看,如果我们拥有三个同等重要且同样稀疏的特征,最优策略是形成一个等边三角形,这并不令人意外。
This correspondence also goes the other direction. Suppose we have a rank
这种对应关系也适用于相反方向。假设我们有一个秩为
In fact, given such a set of orthogonal vectors, we can construct a polytope by starting with
实际上,给定一组正交向量,我们可以通过从
So far, this section has focused on the geometry of uniform superposition, where all features are of equal importance, equal sparsity, and independent. The model is essentially solving a variant of the Thomson problem. Because all features are the same, solutions corresponding to uniform polyhedra get especially low loss. In this subsection, we'll study non-uniform superposition, where features are somehow not uniform. They may vary in importance and sparsity, or have a correlational structure that makes them not independent. This distorts the uniform geometry we saw earlier.
迄今为止,本节专注于均匀叠加的几何特性,其中所有特征同等重要、同等稀疏且相互独立。该模型实质上是在解决汤姆森问题的一个变种。由于所有特征相同,对应于均匀多面体的解决方案损失特别低。在本小节中,我们将探讨非均匀叠加,其中特征在某种程度上并不均匀。它们可能在重要性和稀疏性上有所变化,或者具有使其不独立的关联结构。这扭曲了我们之前看到的均匀几何形态。
In practice, it seems like superposition in real neural networks will be non-uniform, so developing an understanding of it seems important. Unfortunately, we're far from a comprehensive theory of the geometry of non-uniform superposition at this point. As a result, the goal of this section will merely be to highlight some of the more striking phenomena we observe:
实际上,现实中的神经网络叠加似乎是非均匀的,因此对其进行深入理解显得尤为重要。遗憾的是,目前我们距离全面理解非均匀叠加的几何理论还相去甚远。因此,本节的目标仅在于强调我们所观察到的一些显著现象:
We attempt to illustrate these phenomena with some representative experiments below.
我们试图通过以下一些代表性实验来阐释这些现象。
The simplest kind of non-uniform superposition is to vary one feature and leave the others uniform. As an experiment, let's consider an experiment where we represent
最简单的非均匀叠加方式是改变一个特征而保持其他特征均匀。以一个实验为例,设想我们在
If we make it sufficiently sparse, there's a phase change, and it collapses from a pentagon to a pair of digons with the sparser point at zero. The phase change corresponds to loss curves corresponding to the two different geometries crossing over. (This observation allows us to directly confirm that it is genuinely a first order phase change.)
若使其足够稀疏,便会出现相变,原本的五边形结构会塌缩成一对二边形,稀疏点位于零点。这种相变对应着两种不同几何形状的损失曲线交叉。(这一发现使我们得以直接确认,这确实是一次一阶相变。)
To visualize the solutions, we canonicalize them, rotating them to align with each other in a consistent manner.
为了直观展示解决方案,我们对其进行规范化处理,通过旋转使它们以一致的方式相互对齐。
These results seem to suggest that, at least in some cases, non-uniform superposition can be understood as a deformation of uniform superposition and jumping between uniform superposition configurations rather than a totally different regime. Since uniform superposition has a lot of understandable structure, but real world superposition is almost certainly non-uniform, this seems very promising!
这些结果似乎表明,至少在某些情况下,非均匀叠加可以被理解为均匀叠加的一种变形,以及在不同均匀叠加构型之间的跳跃,而非完全不同的状态。由于均匀叠加具有许多可理解的结构,而现实世界的叠加几乎肯定是非均匀的,这一发现极具前景!
The reason pentagonal solutions are not on the unit circle is because models reduce the effect of positive interference, setting a slight negative bias to cut off noise and setting their weights to
五边形解不在单位圆上的原因是模型减少了正向干扰的影响,设置了轻微的负偏置以截断噪声,并将权重设为
A note for reimplementations: optimizing with a two-dimensional hidden space makes this easier to study, but the actual optimization process to be really challenging from gradient descent – a lot harder than even just having three dimensions. Getting clean results required fitting each model multiple times and taking the solution with the lowest loss. However, there's a silver lining to this: visualizing the sub-optimal solutions on a scatter plot as above allows us to see the loss curves for different geometries and gain greater insight into the phase change.
对于重新实现者的提示:采用二维隐藏空间进行优化使得研究更为简便,但实际的优化过程从梯度下降角度来看极具挑战性——甚至比仅有三个维度时还要困难得多。为了获得清晰的结果,需要多次拟合每个模型,并选取损失最小的解决方案。然而,这其中也有一线希望:如上所述,通过散点图可视化次优解,我们能够观察到不同几何形状下的损失曲线,从而更深入地理解相变现象。
A more complicated form of non-uniform superposition occurs when there are correlations between features. This seems essential for understanding superposition in the real world, where many features are correlated or anti-correlated.
当特征之间存在相关性时,非均匀叠加呈现出更为复杂的形式。这对于理解现实世界中的叠加现象至关重要,因为许多特征要么相互关联,要么相互排斥。
For example, one very pragmatic question to ask is whether we should expect polysemantic neurons to group the same features together across models. If the groupings were random, you could use this to detect polysemantic neurons, by comparing across models! However, we'll see that correlational structure strongly influences which features are grouped together in superposition.
例如,一个非常实际的问题是,我们是否应该期望多义性神经元在不同模型中将相同特征归类在一起。如果归类是随机的,通过跨模型比较,你就能借此检测到多义性神经元!然而,我们将看到,相关性结构强烈影响哪些特征会在叠加中被归类在一起。
The behavior seems to be quite nuanced, with a kind of "order of preferences" for how correlated features behave in superposition. The model ideally represents correlated features orthogonally, in separate tegum factors with no interactions between them. When that fails, it prefers to arrange them so that they're as close together as possible – it prefers positive interference between correlated features over negative interference. Finally, when there isn't enough space to represent all the correlated features, it will collapse them and represent their principal component instead! Conversely, when features are anti-correlated, models prefer to have them interfere, especially with negative interference. We'll demonstrate this with a few experiments below.
该行为似乎颇为微妙,存在一种“偏好顺序”,用以指导叠加状态下相关特征的相互作用方式。理想情况下,模型应将相关特征正交表示,各自独立于不同的 tegum 因子中,彼此间无交互作用。若此条件不满足,模型倾向于尽可能紧密地排列这些特征——更偏好相关特征间的正向干扰而非负向干扰。当空间不足以容纳所有相关特征时,模型将它们压缩,转而表示其主成分。相反,若特征间呈负相关,模型则倾向于让它们相互干扰,尤其是产生负向干扰。以下我们将通过几个实验来演示这一现象。
Throughout this section we'll refer to "correlated feature sets" and "anticorrelated feature sets".
在本节中,我们将提及“相关特征集”和“反相关特征集”。
Correlated Feature Sets. Our correlated feature sets can be thought of as "bundles" of co-occurring features. One can imagine a highly idealized version of what might happen in an image classifier: there could be a bundle of features used to identify animals (fur, ears, eyes) and another bundle used to identify buildings (corners, windows, doors). Features from one of these bundles are likely to appear together. Mathematically, we represent this by linking the choice of whether all the features in a correlated feature set are zero or not together. Recall that we originally defined our synthetic distribution to have features be zero with probability
相关特征集。我们的相关特征集可视为共现特征的“捆绑包”。设想一个高度理想化的图像分类器场景:可能存在一组用于识别动物的特征(毛发、耳朵、眼睛),以及另一组用于识别建筑物的特征(角落、窗户、门)。来自同一捆绑包的特征很可能同时出现。数学上,我们通过将相关特征集中所有特征是否同时为零的选择关联起来来表示这一点。回想一下,我们最初定义的合成分布中,特征以概率
Anticorrelated Feature Sets. One could also imagine anticorrelated features which are extremely unlikely to occur together. To simulate these, we'll have anticorrelated feature sets where only one feature in the set can be active at a time. To simulate this, we'll have the feature set be entirely zero with probability
反相关特征集。我们还可以设想那些极不可能同时出现的反相关特征。为了模拟这些特征,我们将构建反相关特征集,其中任何时刻仅有一个特征处于激活状态。具体模拟时,该特征集以概率
For our initial investigation, we simply train a number of small toy models with correlated and anti-correlated features and observe what happens. To make this easy to study, we limit ourselves to the
在初步研究中,我们仅训练了一批小型玩具模型,这些模型具有相关性和反相关性特征,并观察其结果。为便于研究,我们限定在
It turns out that the tendency of models to arrange correlated features to be orthogonal is actually quite a strong phenomenon. In particular, for larger models, it seems to generate a kind of "local almost-orthogonal basis" where, even though the model as a whole is in superposition, the correlated feature sets considered in isolation are (nearly) orthogonal and can be understood as having very little superposition.
事实证明,模型倾向于使相关特征相互正交的现象实际上相当显著。特别是对于较大的模型而言,似乎会生成一种“局部近似正交基”,即使整个模型处于叠加状态,单独考虑的相关特征集也近乎正交,并且可以理解为几乎不存在叠加。
To investigate this, we train a larger model with two sets of correlated features and visualize
为探究此问题,我们训练了一个包含两组相关特征的大型模型,并对其进行了可视化展示
If this result holds in real neural networks, it suggests we might be able to make a kind of "local non-superposition" assumption, where for certain sub-distributions we can assume that the activating features are not in superposition. This could be a powerful result, allowing us to confidently use methods such as PCA which might not be principled to generally use in the context of superposition.
若此结果在真实神经网络中成立,则暗示我们或许能采用一种“局部非叠加”假设,即对于某些子分布,我们可以假定激活特征未处于叠加状态。这可能是一个强有力的结论,使我们能自信地运用诸如 PCA 等方法,而这些方法在叠加情境下通常并不适用。
One of the most interesting properties is that there seems to be a trade off with Principal Components Analysis (PCA) and superposition. If there are two correlated features
其中最有趣的特性之一是,主成分分析(PCA)与叠加之间似乎存在一种权衡。若存在两个相关特征
As an experiment, we consider six features, organized into three sets of correlated pairs. Features in each correlated pair are represented by a given color (red, green, and blue). The correlation is created by having both features always activate together – they're either both zero or neither zero. (The exact non-zero values they take when they activate is uncorrelated.)
作为一项实验,我们考虑了六个特征,组织成三对相关联的特征组。每对相关特征由特定颜色(红、绿、蓝)表示。这种关联通过两个特征始终同时激活来实现——它们要么同时为零,要么都不为零。(它们激活时所取的确切非零值之间并不相关。)
As we vary the sparsity of the features, we find that in the very sparse regime, we observe superposition as expected, with features arranged in a hexagon and correlated features side-by-side. As we decrease sparsity, the features progressively "collapse" into their principal components. In very dense regimes, the solution becomes equivalent to PCA.
随着我们调整特征的稀疏度,我们发现在极度稀疏的情况下,正如预期那样,观察到了叠加现象,特征排列成六边形,相关特征并排出现。随着稀疏度降低,特征逐渐“坍缩”至其主成分。在非常密集的情况下,解决方案等同于主成分分析(PCA)。
These results seem to hint that PCA and superposition are in some sense complementary strategies which trade off with one another. As features become more correlated, PCA becomes a better strategy. As features become sparser, superposition becomes a better strategy. When features are both sparse and correlated, mixtures of each strategy seem to occur. It would be nice to more deeply understand this space of tradeoffs.
这些结果似乎暗示,PCA 与叠加在某种意义上是互补的策略,它们相互之间存在权衡。随着特征相关性增强,PCA 成为更优策略。当特征变得稀疏时,叠加策略则更为优越。当特征既稀疏又相关时,两种策略的混合似乎会出现。更深入地理解这种权衡的空间将大有裨益。
It's also interesting to think about this in the context of continuous equivariant features, such as features which occur in different rotations.
在连续等变特征的背景下思考这一点也颇有趣味,比如那些在不同旋转中出现的特征。
The focus of this paper is how superposition contributes to the functioning of fully trained neural networks, but as a brief detour it's interesting to ask how our toy models – and the resulting superposition – evolve over the course of training.
本文重点探讨叠加态如何促进完全训练后的神经网络运作,但作为一次简短的旁涉,探究我们的玩具模型及其产生的叠加态在训练过程中的演变亦颇具趣味。
There are several reasons why these models seem like a particularly interesting case for studying learning dynamics. Firstly, unlike most neural networks, the fully trained models converge to a simple but non-trivial structure that rhymes with an emerging thread of evidence that neural network learning dynamics might have geometric weight structure that we can understand. One might hope that understanding the final structure would make it easier for us to understand the evolution over training. Secondly, superposition hints at surprisingly discrete structure (regular polytopes of all things!). We'll find that the underlying learning dynamics are also surprisingly discrete, continuing an emerging trend of evidence that neural network learning might be less continuous than it seems. Finally, since superposition has significant implications for interpretability, it would be nice to understand how it emerges over training – should we expect models to use superposition early on, or is it something that only emerges later in training, as models struggle to fit more features in?
这些模型之所以成为研究学习动力学的特别有趣案例,原因有几点。首先,与大多数神经网络不同,完全训练后的模型会收敛到一个简单却非平凡的结构,这与神经网络学习动力学可能具有可理解的幾何權重結構这一新兴证据相呼应。我们或许可以期待,理解最终结构将有助于我们洞察训练过程中的演变。其次,叠加现象暗示了出乎意料的离散结构(竟然是正则多胞形!)。我们将发现,底层的学习动力学同样呈现出惊人的离散性,延续了一个新兴趋势,即神经网络学习可能并不如表面所见的那般连续。 最后,由于叠加性对可解释性具有重要意义,了解它在训练过程中如何显现将十分有益——我们是否应预期模型在训练初期即采用叠加性,还是它仅在训练后期,当模型努力适应更多特征时才逐渐显现?
Unfortunately, we aren't able to give these questions the detailed investigation they deserve within the scope of this paper. Instead, we'll limit ourselves to a couple particularly striking phenomena we've noticed, leaving more detailed investigation for future work.
遗憾的是,我们无法在这篇论文的范围内对这些问题进行应有的详细探讨。因此,我们将仅限于讨论我们注意到的一些特别引人注目的现象,而将更详细的调查留待未来的工作。
Perhaps the most striking phenomenon we've noticed is that the learning dynamics of toy models with large numbers of features appear to be dominated by "energy level jumps" where features jump between different feature dimensionalities. (Recall that a feature's dimensionality is the fraction of a dimension dedicated to representing a feature.)
我们注意到最显著的现象之一是,具有大量特征的玩具模型学习动态似乎由“能量级跳跃”主导,其中特征在不同特征维度间跃迁。(回想一下,一个特征的维度是指用于表示该特征的维度部分的比例。)
Let's consider the problem setup we studied when investigating the geometry of uniform superposition in the previous section, where we have a large number of features of equal importance and sparsity. As we saw previously, the features ultimately arrange themselves into a small number of polytopes with fractional dimensionalities.
让我们回顾上一节中探讨均匀叠加几何问题时的情景,那里存在大量同等重要且稀疏的特征。如前所述,这些特征最终会排列成少数几个具有分数维度的多面体。
A natural question to ask is what happens to these feature dimensionalities over the course of training. Let's pick one model where all the features converge into digons and observe. In the first plot, each colored line corresponds to the dimensionality of a single feature. The second plot shows how the loss curve changes over the same duration.
一个自然的问题是,在训练过程中这些特征维度会发生什么变化。我们选取一个所有特征最终收敛成二元组的模型进行观察。在第一个图中,每条彩色线代表单个特征的维度变化。第二个图则展示了同一时期损失曲线的变化情况。