这是用户在 2024-5-27 21:19 为 https://transformer-circuits.pub/2022/toy_model/index.html#motivation 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Transformer Circuits Thread
变压器电路线程

Toy Models of Superposition
叠加态的玩具模型

Authors 作者

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg,Christopher Olah
纳尔逊·埃拉格 、特里斯坦·休姆 、凯瑟琳·奥尔森 、尼古拉斯·席弗 、汤姆·海尼汉、肖娜·克雷维克、扎克·哈特菲尔德-多兹、罗伯特·拉森比、唐恩·德兰、卡罗尔·陈、罗杰·格罗斯、萨姆·麦坎德利斯、贾里德·卡普兰、达里奥·阿莫迪、马丁·瓦滕伯格 、克里斯托弗·奥拉

Affiliations 所属机构

Anthropic, Harvard 人类与哈佛

Published 已发布

Sept 14, 2022 2022 年 9 月 14 日
* Core Research Contributor; ‡ Correspondence to colah@anthropic.com; Author contributions statement below.
* 核心研究贡献者;‡ 通讯至 colah@anthropic.com;作者贡献声明如下。

Authors 作者

Affiliations 隶属关系

Published 已发布

Not published yet. 尚未发布。

DOI

No DOI yet. 尚无 DOI。

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?
若人工神经网络中的单个神经元能对应于输入特征的清晰可解释性,那将极为便利。例如,在“理想”的 ImageNet 分类器中,每个神经元仅在特定视觉特征出现时激活,如红色、左向曲线或狗鼻。实证研究表明,我们所考察的模型中,部分神经元确实与特征清晰对应。然而,并非总能如此,尤其是在大型语言模型中,神经元与清晰特征的对应似乎颇为罕见。这引发诸多疑问:为何神经元有时与特征对齐,有时则不然?为何某些模型和任务中存在大量此类清晰对应的神经元,而在其他模型中却极为稀少?

In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition . When features are sparse, superposition allows compression beyond what a linear model would do, at the cost of "interference" that requires nonlinear filtering.
本文中,我们采用玩具模型——即在稀疏输入特征的合成数据上训练的小型 ReLU 网络——来探究模型如何在维度之外表示更多特征,我们称此现象为叠加 。当特征稀疏时,叠加现象使得模型能实现超越线性模型的压缩效果,但代价是需要非线性滤波来处理由此产生的“干扰”。

Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features. With dense features, the model learns to represent an orthogonal basis of the most important two features (similar to what Principal Component Analysis might give us), and the other three features are not represented. But if we make the features sparse, this changes:
考虑一个玩具模型,我们训练一个包含五个特征的嵌入,这些特征的重要性各不相同 1 ,其中“重要性”是对均方误差损失的标量乘数。在二维空间中,添加 ReLU 进行过滤,并调整特征的稀疏度。在密集特征情况下,模型学会表示最重要的两个特征的正交基(类似于主成分分析可能提供的结果),而其他三个特征则未被表示。但如果我们使特征变得稀疏,情况就会改变:

This figure and a few others can be reproduced using the toy model framework Colab notebook in our Github repo
此图及其他几幅图可通过我们 GitHub 仓库中的玩具模型框架 Colab 笔记本来重现

Not only can models store additional features in superposition by tolerating some interference, but we'll show that, at least in certain limited cases, models can perform computation while in superposition. (In particular, we'll show that models can put simple circuits computing the absolute value function in superposition.) This leads us to hypothesize that the neural networks we observe in practice are in some sense noisily simulating larger, highly sparse networks. In other words, it's possible that models we train can be thought of as doing “the same thing as” an imagined much-larger model, representing the exact same features but with no interference.
模型不仅能通过容忍一定干扰在叠加态中存储额外特征,我们还将在特定有限情况下展示,模型可在叠加态中执行计算。(特别是,我们将证明模型能将计算绝对值函数的简单电路置于叠加态。)这促使我们假设,实践中观察到的神经网络在某种意义上是在嘈杂地模拟更大、高度稀疏的网络。换言之,我们训练的模型可能被视为在做“与”一个设想中的更大模型“相同的事情”,代表完全相同的特征,但无干扰。

Feature superposition isn't a novel idea. A number of previous interpretability papers have considered it , and it's very closely related to the long-studied topic of compressed sensing in mathematics , as well as the ideas of distributed, dense, and population codes in neuroscience and deep learning . What, then, is the contribution of this paper?
特征叠加并非新颖概念,已有众多解释性论文对此进行探讨 ,它与数学中长期研究的压缩感知问题紧密相关 ,同时也与神经科学及深度学习中的分布式、密集型和群体编码理念相呼应 。那么,本文的贡献何在?

For interpretability researchers, our main contribution is providing a direct demonstration that superposition occurs in artificial neural networks given a relatively natural setup, suggesting this may also occur in practice. That is, we show a case where interpreting neural networks as having sparse structure in superposition isn't just a useful post-hoc interpretation, but actually the "ground truth" of a model. We offer a theory of when and why this occurs, revealing a  phase diagram for superposition. This explains why neurons are sometimes "monosemantic" responding to a single feature, and sometimes "polysemantic" responding to many unrelated features. We also discover that, at least in our toy model, superposition exhibits complex geometric structure.
对于可解释性研究者而言,我们的主要贡献在于直接证明了在相对自然的设定下,人工神经网络中确实存在叠加现象,暗示这在实践中也可能发生。即,我们展示了一个案例,其中将神经网络解释为具有叠加稀疏结构不仅仅是事后有用的解释,而实际上是模型的“真实本质”。我们提出了一种理论,阐明了何时及为何会出现这种情况,揭示了叠加的相位图。这解释了为何神经元有时是“单义的”,仅对单一特征作出反应,而有时则是“多义的”,对多个不相关特征作出反应。我们还发现,至少在我们的玩具模型中,叠加现象展现出复杂的几何结构。

But our results may also be of broader interest. We find preliminary evidence that superposition may be linked to adversarial examples and grokking, and might also suggest a theory for the performance of mixture of experts models. More broadly, the toy model we investigate has unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar to the fractional quantum Hall effect in physics, among other striking phenomena. We originally investigated the subject to gain understanding of cleanly-interpretable neurons in larger models, but we've found these toy models to be surprisingly interesting in their own right.
然而,我们的研究成果或许具有更广泛的吸引力。初步证据显示,叠加现象可能与对抗性示例及领悟机制存在关联,并可能为混合专家模型性能提供理论依据。更广泛地看,我们探究的这一简化模型展现出意料之外的丰富结构,包括相变、基于均匀多面体的几何结构、训练过程中的“能量级”跃迁,以及与物理学中的分数量子霍尔效应性质相似的现象等引人注目之处。最初,我们研究此课题旨在深入理解大型模型中清晰可解释的神经元,却意外发现这些简化模型本身就极具研究价值。

Key Results From Our Toy Models
我们的玩具模型关键成果

In our toy models, we are able to demonstrate that:
在我们的玩具模型中,我们能够证明:

Our toy models are simple ReLU networks, so it seems fair to say that neural networks exhibit these properties in at least some regimes, but it's very unclear what to generalize to real networks.
我们的玩具模型是简单的 ReLU 网络,因此似乎可以说神经网络至少在某些情况下表现出这些特性,但如何将其推广到真实网络则非常不明确。







Definitions and Motivation: Features, Directions, and Superposition
定义与动机:特征、方向与叠加

In our work, we often think of neural networks as having features of the input represented as directions in activation space. This isn't a trivial claim. It isn't obvious what kind of structure we should expect neural network representations to have. When we say something like "word embeddings have a gender direction" or "vision models have curve detector neurons", one is implicitly making strong claims about the structure of network representations.
在我们的工作中,常将神经网络视为以激活空间中的方向来表示输入特征。这一观点并非无足轻重。神经网络表示应具备何种结构并不显而易见。当我们提及“词嵌入具有性别方向”或“视觉模型拥有曲线检测神经元”时,实际上是在对网络表示的结构做出强有力的假设。

Despite this, we believe this kind of "linear representation hypothesis" is supported both by significant empirical findings and theoretical arguments. One might think of this as two separate properties, which we'll explore in more detail shortly:
尽管如此,我们认为这种“线性表征假设”得到了显著的实证发现和理论论据的双重支持。这可以被视为两种不同的特性,我们将在稍后详细探讨:

If we hope to reverse engineer neural networks, we need a property like decomposability. Decomposability is what allows us to reason about the model without fitting the whole thing in our heads! But it's not enough for things to be decomposable: we need to be able to access the decomposition somehow. In order to do this, we need to identify the individual features within a representation. In a linear representation, this corresponds to determining which directions in activation space correspond to which independent features of the input.
若要逆向工程神经网络,我们需具备可分解性这一特性。可分解性使我们无需将整个模型装进脑海,便能对其进行推理!然而,仅可分解还不够,我们还需以某种方式触及这种分解。为此,我们必须识别表示中的各个特征。在线性表示中,这相当于确定激活空间中的哪些方向对应于输入的哪些独立特征。

Sometimes, identifying feature directions is very easy because features seem to correspond to neurons. For example, many neurons in the early layers of InceptionV1 clearly correspond to features (e.g. curve detector neurons ). Why is it that we sometimes get this extremely helpful property, but in other cases don't? We hypothesize that there are really two countervailing forces driving this:
有时,识别特征方向非常容易,因为特征似乎与神经元相对应。例如,InceptionV1 早期层中的许多神经元明显对应于特征(如曲线检测神经元 )。为何我们有时能获得这种极为有利的特性,而在其他情况下却不然?我们推测,这背后实际上是两种相互抵消的力量在驱动:

Superposition has been hypothesized in previous work , and in some cases, assuming something like superposition has been shown to help find interpretable structure . However, we're not aware of feature superposition having been unambiguously demonstrated to occur in neural networks before ( demonstrates a closely related phenomenon of model superposition). The goal of this paper is to change that, demonstrating superposition and exploring how it interacts with privileged bases. If superposition occurs in networks, it deeply influences what approaches to interpretability research make sense, so unambiguous demonstration seems important.
先前的工作 中已假设了叠加现象,并且在某些情况下,假设类似叠加的机制已被证明有助于发现可解释的结构 。然而,我们尚未明确证实特征叠加在神经网络中确实发生过( 展示了与模型叠加紧密相关的一种现象)。本文旨在改变这一现状,展示叠加现象并探讨其与特权基的相互作用。若网络中存在叠加,它将深刻影响可解释性研究方法的合理性,因此明确证实叠加现象显得尤为重要。

The goal of this section will be to motivate these ideas and unpack them in detail.
本节的目标是激发这些想法,并详细阐述它们。

It's worth noting that many of the ideas in this section have close connections to ideas in other lines of interpretability research (especially disentanglement), neuroscience (distributed representations, population codes, etc), compressed sensing, and many other lines of work. This section will focus on articulating our perspective on the problem. We'll discuss these other lines of work in detail in Related Work.
值得注意的是,本节中许多概念与解释性研究的其他领域(尤其是解耦)、神经科学(分布式表示、群体编码等)、压缩感知以及众多其他工作线有着紧密联系。本节将着重阐述我们对问题的视角,而与这些其他工作线的详细讨论将在“相关工作”部分展开。

Empirical Phenomena 实证现象

When we talk about "features" and how they're represented, this is ultimately theory building around several observed empirical phenomena. Before describing how we conceptualize those results, we'll simply describe some of the major results motivating our thinking:
当我们讨论“特征”及其表现方式时,这实质上是在围绕多个观察到的经验现象构建理论。在阐述我们如何概念化这些成果之前,我们先简要描述一些推动我们思考的主要成果:

As a result, we tend to think of neural network representations as being composed of features which are represented as directions. We'll unpack this idea in the following sections.
因此,我们往往认为神经网络的表示由特征构成,这些特征表现为方向。我们将在接下来的章节中详细阐述这一概念。

What are Features? 什么是特性?

Our use of the term "feature" is motivated by the interpretable properties of the input we observe neurons (or word embedding directions) responding to. There's a rich variety of such observed properties!In the context of vision, these have ranged from low-level neurons like curve detectors and high-low frequency detectors, to more complex neurons like oriented dog-head detectors or car detectors, to extremely abstract neurons corresponding to famous people, emotions, geographic regions, and more . In language models, researchers have found word embedding directions such as a male-female or singular-plural direction , low-level neurons disambiguating words that occur in multiple languages, much more abstract neurons, and "action" output neurons that help produce certain words . We'd like to use the term "feature" to encompass all these properties.
我们使用“特征”这一术语,是基于我们观察到的神经元(或词嵌入方向)响应的可解释输入属性。这些观察到的属性种类繁多! 2 在视觉领域,这些属性从低级神经元如曲线检测器 和高低频检测器 ,到更复杂的神经元如定向狗头检测器或汽车检测器 ,再到极其抽象的神经元,它们对应着名人、情感、地理区域等 。在语言模型中,研究者发现了诸如男性-女性或单数-复数方向的词嵌入方向 ,低级神经元区分多语言中出现的词汇,更为抽象的神经元,以及“动作”输出神经元,这些神经元有助于生成特定词汇 。我们希望用“特征”这一术语来概括所有这些属性。

But even with that motivation, it turns out to be quite challenging to create a satisfactory definition of a feature. Rather than offer a single definition we're confident about, we consider three potential working definitions:
尽管有此动力,但要创造一个令人满意的特征定义仍颇具挑战。与其提供一个我们确信无疑的定义,我们考虑了三种可能的工作定义:

We've written this paper with the final "neurons in sufficiently large models" definition in mind. But we aren't overly attached to it, and actually think it's probably important to not prematurely attach to a definition.A famous book by Lakatos illustrates the importance of uncertainty about definitions and how important rethinking definitions often is in the context of research.
我们在撰写这篇论文时,心中怀揣着“充分大型模型中的神经元”这一最终定义。但我们并不对其过分执着,实际上认为避免过早固化定义颇为重要。 4 拉卡托斯的一部名著 阐明了定义的不确定性之重要,以及在研究语境中频繁重新审视定义的必要性。

Features as Directions 功能即指南

As we've mentioned in previous sections, we generally think of features as being represented by directions. For example, in word embeddings, "gender" and "royalty" appear to correspond to directions, allowing arithmetic like V("king") - V("man") + V("woman") = V("queen") . Examples of interpretable neurons are also cases of features as directions, since the amount a neuron activates corresponds to a basis direction in the representation
如前所述,我们通常将特征视为由方向表示。例如,在词嵌入中,“性别”和“皇室”似乎对应于方向,使得诸如 V("king") - V("man") + V("woman") = V("queen") 的算术运算成为可能。可解释神经元的例子也是特征作为方向的案例,因为神经元激活的程度对应于表示中的一个基方向。

Let's call a neural network representation linear if features correspond to directions in activation space. In a linear representation, each feature f_i has a corresponding representation direction W_i. The presence of multiple features f_1, f_2… activating with values x_{f_1}, x_{f_2}… is represented by x_{f_1}W_{f_1} + x_{f_2}W_{f_2}.... To be clear, the features being represented are almost certainly nonlinear functions of the input. It's only the map from features to activation vectors which is linear. Note that whether something is a linear representation depends on what you consider to be the features.
若特征对应激活空间中的方向,我们称神经网络的表示为线性的。在线性表示中,每个特征 f_i 都有对应的表示方向 W_i 。多个特征 f_1, f_2… 以值 x_{f_1}, x_{f_2}… 激活的情况由 x_{f_1}W_{f_1} + x_{f_2}W_{f_2}... 表示。需明确的是,被表示的特征几乎肯定是输入的非线性函数。仅从特征到激活向量的映射是线性的。值得注意的是,某物是否为线性表示,取决于你如何定义特征。

We don't think it's a coincidence that neural networks empirically seem to have linear representations. Neural networks are built from linear functions interspersed with non-linearities. In some sense, the linear functions are the vast majority of the computation (for example, as measured in FLOPs). Linear representations are the natural format for neural networks to represent information in! Concretely, there are three major benefits:
我们认为神经网络在经验上似乎具有线性表示并非巧合。神经网络由线性函数与非线性元素交错构建而成。从某种意义上说,线性函数占据了计算的绝大部分(例如,以浮点运算次数衡量)。线性表示是神经网络自然表达信息的方式!具体而言,存在三大主要优势:

It is possible to construct non-linear representations, and retrieve information from them, if you use multiple layers (although even these examples can be seen as linear representations with more exotic features). We provide an example in the appendix. However, our intuition is that non-linear representations are generally inefficient for neural networks.
构建非线性表示并从中检索信息是可能的,只要使用多层结构(尽管这些例子也可视为具有更奇特特征的线性表示)。我们在附录中提供了一个示例。然而,我们的直觉是非线性表示通常对神经网络而言效率不高。

One might think that a linear representation can only store as many features as it has dimensions, but it turns out this isn't the case! We'll see that the phenomenon we call superposition will allow models to store more features – potentially many more features – in linear representations.
人们可能会认为线性表示只能存储与其维度一样多的特征,但事实证明并非如此!我们将看到,我们称之为叠加的现象将允许模型在线性表示中存储更多特征——潜在地,是许多更多的特征。

For discussion on how this view of features squares with a conception of features as being multidimensional manifolds, see the appendix “What about Multidimensional Features?”.
关于这一特征观点如何与特征作为多维流形的概念相契合的讨论,请参阅附录“多维特征如何处理?”。

Privileged vs Non-privileged Bases
特权与非特权基址

Even if features are encoded as directions, a natural question to ask is which directions? In some cases, it seems useful to consider the basis directions, but in others it doesn't. Why is this?
即便特征被编码为方向,一个自然的问题是:哪些方向?在某些情况下,考虑基向量似乎很有用,而在其他情况下则不然。这是为什么呢?

When researchers study word embeddings, it doesn't make sense to analyze basis directions. There would be no reason to expect a basis dimension to be different from any other possible direction. One way to see this is to imagine applying some random linear transformation M to the word embedding, and apply M^{-1} to the following weights. This would produce an identical model where the basis dimensions are totally different. This is what we mean by a non-privileged basis. Of course, it's possible to study activations without a privileged basis, you just need to identify interesting directions to study somehow, such as creating a gender direction in a word embedding by taking the difference vector between "man" and "woman".
研究词嵌入时,分析基向量方向并无意义。没有理由认为基维度会与任何其他可能方向有所不同。理解这一点的一种方式是设想对词嵌入施加一个随机线性变换 M ,并对后续权重应用 M^{-1} 。这将得到一个基维度完全不同的相同模型。这就是我们所说的非特权基。当然,即使没有特权基,也可以研究激活情况,只需设法确定有趣的研究方向,例如通过取“男人”与“女人”之间的差矢量来在词嵌入中创建性别方向。

But many neural network layers are not like this. Often, something about the architecture makes the basis directions special, such as applying an activation function. This "breaks the symmetry", making those directions special, and potentially encouraging features to align with the basis dimensions. We call this a privileged basis, and call the basis directions "neurons." Often, these neurons correspond to interpretable features.
然而,许多神经网络层并非如此。通常,架构中的某些特性使得基向量方向具有特殊性,例如应用激活函数。这“打破了对称性”,使得那些方向变得特殊,并可能促使特征与基向量维度对齐。我们称这种基向量为“特权基”,并将基向量方向称为“神经元”。通常,这些神经元对应着可解释的特征。

From this perspective, it only makes sense to ask if a neuron is interpretable when it is in a privileged basis. In fact, we typically reserve the word "neuron" for basis directions which are in a privileged basis. (See longer discussion here.)
从这个角度来看,只有当神经元处于特权基底时,询问其可解释性才有意义。实际上,我们通常将“神经元”一词保留给那些处于特权基底的方向。(参见此处更详细的讨论。)

Note that having a privileged basis doesn't guarantee that features will be basis-aligned – we'll see that they often aren't! But it's a minimal condition for the question to even make sense.
请注意,拥有特权基并不保证特征会与基对齐——我们将会看到它们往往并非如此!但这至少是这个问题有意义的最基本条件。

The Superposition Hypothesis
叠加假设

Even when there is a privileged basis, it's often the case that neurons are "polysemantic", responding to several unrelated features. One explanation for this is the superposition hypothesis. Roughly, the idea of superposition is that neural networks "want to represent more features than they have neurons", so they exploit a property of high-dimensional spaces to simulate a model with many more neurons.
即便存在特权基底,神经元往往仍表现出“多义性”,对多个无关特征产生响应。对此现象的一种解释是叠加假设 。简言之,叠加思想认为神经网络“希望表达的特征远超其拥有的神经元数量”,因此它们利用高维空间的特性来模拟一个拥有更多神经元的模型。

Several results from mathematics suggest that something like this might be plausible:
数学中的几个结果表明,类似这样的事情可能是合理的:

Concretely, in the superposition hypothesis, features are represented as almost-orthogonal directions in the vector space of neuron outputs. Since the features are only almost-orthogonal, one feature activating looks like other features slightly activating. Tolerating this "noise" or "interference" comes at a cost. But for neural networks with highly sparse features, this cost may be outweighed by the benefit of being able to represent more features! (Crucially, sparsity greatly reduces the costs since sparse features are rarely active to interfere with each other, and non-linear activation functions create opportunities to filter out small amounts of noise.)
具体而言,在叠加假设中,特征被表示为神经元输出向量空间中近乎正交的方向。由于这些特征仅是近乎正交,一个特征的激活会略微影响其他特征。容忍这种“噪声”或“干扰”是有代价的。但对于具有高度稀疏特征的神经网络来说,能够表示更多特征的益处可能超过了这一成本!(关键在于,稀疏性大大降低了成本,因为稀疏特征很少活跃到相互干扰的程度,而非线性激活函数则提供了过滤少量噪声的机会。)

One way to think of this is that a small neural network may be able to noisily "simulate" a sparse larger model:
一种理解方式是,小型神经网络或许能以一种嘈杂的方式“模拟”稀疏的大型模型:

Although we've described superposition with respect to neurons, it can also occur in representations with an unprivileged basis, such as a word embedding. Superposition simply means that there are more features than dimensions.
尽管我们在描述叠加时是以神经元为参照,但它同样可以发生在非特权基底的表征中,比如词嵌入。叠加的含义仅仅是特征数量超过了维度数。

Summary: A Hierarchy of Feature Properties
摘要:特征属性的层级结构

The ideas in this section might be thought of in terms of four progressively more strict properties that neural network representations might have.
本节中的概念可以从神经网络表示可能具备的四个渐进式严格属性来考虑。

The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.
前两者(可分解性和线性)是我们假设普遍存在的特性,而后两者(非叠加性和基向量对齐)则是我们认为仅在某些情况下出现的特性。







Demonstrating Superposition
展示叠加原理

If one takes the superposition hypothesis seriously, a natural first question is whether neural networks can actually noisily represent more features than they have neurons. If they can't, the superposition hypothesis may be comfortably dismissed.
若认真对待叠加假设,一个自然的首要问题是神经网络是否能以噪声方式表示比其神经元更多的特征。若不能,叠加假设或许可被轻易摒弃。

The intuition from linear models would be that this isn't possible: the best a linear model can do is to store the principal components. But we'll see that adding just a slight nonlinearity can make models behave in a radically different way! This will be our first demonstration of superposition. (It will also be an object lesson in the complexity of even very simple neural networks.)
线性模型的直觉会认为这是不可能的:线性模型所能做的最好就是存储主成分。但我们将会看到,只需添加一点非线性,就能使模型行为发生根本性的变化!这将是我们的第一个叠加示例。(同时,它也将是对极其简单的神经网络复杂性的一个生动教训。)

Experiment Setup 实验设置

Our goal is to explore whether a neural network can project a high dimensional vector x \in R^n into a lower dimensional vector h\in R^m and then recover it.This experiment setup could also be viewed as an autoencoder reconstructing x.
我们的目标是探究神经网络是否能将高维向量 x \in R^n 映射至低维向量 h\in R^m ,并实现其恢复。 5 此实验设置亦可视作自编码器进行重构的过程 x

The Feature Vector (x)
特征向量( x

We begin by describing the high-dimensional vector x: the activations of our idealized, disentangled larger model. We call each element x_i a "feature" because we're imagining features to be perfectly aligned with neurons in the hypothetical larger model. In a vision model, this might be a Gabor filter, a curve detector, or a floppy ear detector. In a language model, it might correspond to a token referring to a specific famous person, or a clause being a particular kind of description.
我们首先描述高维向量 x :即我们理想化、解耦合的大型模型的激活状态。我们将每个元素 x_i 称为“特征”,因为我们设想这些特征与假设的大型模型中的神经元完美对齐。在视觉模型中,这可能是一个 Gabor 滤波器、曲线检测器或松软耳朵检测器。在语言模型中,它可能对应于指代特定名人的令牌,或是描述某种特定类型的从句。

Since we don't have any ground truth for features, we need to create synthetic data for x which simulates any important properties we believe features have from the perspective of modeling them. We make three major assumptions:
由于我们没有特征的实际数据,我们需要为 x 创建合成数据,以模拟我们认为在模型构建中重要的特征属性。我们做出三个主要假设:

Concretely, our synthetic data is defined as follows: The input vectors x are synthetic data intended to simulate the properties we believe the true underlying features of our task have. We consider each dimension x_i to be a "feature". Each one has an associated sparsity S_i and importance I_i. We let x_i=0 with probability S_i, but is otherwise uniformly distributed between [0,1].The choice to have features distributed uniformly is arbitrary. An exponential or power law distribution would also be very natural. In practice, we focus on the case where all features have the same sparsity, S_i = S.
具体而言,我们的合成数据定义如下:输入向量 x 是合成数据,旨在模拟我们认为任务真实底层特征所具有的属性。我们将每个维度 x_i 视为一个“特征”。每个特征都关联有一个稀疏度 S_i 和重要性 I_i 。我们令 x_i=0 以概率 S_i 出现,否则在 [0,1] 范围内均匀分布。 8 选择特征均匀分布是任意的,指数分布或幂律分布同样自然。实践中,我们关注所有特征具有相同稀疏度 S_i = S 的情况。

The Model (x \to x')
模型( x \to x'

We will actually consider two models, which we motivate below. The first "linear model" is a well understood baseline which does not exhibit superposition. The second "ReLU output model" is a very simple model which does exhibit superposition. The two models vary only in the final activation function.
我们将实际考虑两种模型,其动机如下。第一个“线性模型”是一个理解透彻的基准,不表现出叠加性。第二个“ReLU 输出模型”是一个非常简单的模型,确实展现出叠加性。两种模型仅在最终激活函数上有所不同。

Linear Model 线性模型
h~=~Wx
x'~=~W^Th~+~b
x' ~=~W^TWx ~+~ b
ReLU Output Model ReLU 输出模型
h~=~Wx
x'~=~\text{ReLU}(W^Th+b)
x' ~=~\text{ReLU}(W^TWx + b)

Why these models? 为何选择这些模型?

The superposition hypothesis suggests that each feature in the higher-dimensional model corresponds to a direction in the lower-dimensional space. This means we can represent the down projection as a linear map h=Wx. Note that each column W_i corresponds to the direction in the lower-dimensional space that represents a feature x_i.
叠加假设指出,高维模型中的每个特征对应于低维空间中的一个方向。这意味着我们可以将降维表示为线性映射 h=Wx 。注意,每一列 W_i 对应于代表一个特征 x_i 的低维空间中的方向。

To recover the original vector, we'll use the transpose of the same matrix W^T. This has the advantage of avoiding any ambiguity regarding what direction in the lower-dimensional space really corresponds to a feature. It also seems relatively mathematically principledRecall that W^T = W^{-1} if W is orthonormal. Although W can't be literally orthonormal, our intuition from compressed sensing is that it will be "almost orthonormal" in the sense of Candes & Tao ., and empirically works.
为恢复原始向量,我们将使用同一矩阵 W^T 的转置。这样做的好处是避免了关于低维空间中哪个方向真正对应于某个特征的任何歧义。它也显得相对数学上合理 9 。回想一下 W^T = W^{-1} ,如果 W 是正交的。尽管 W 不能严格意义上是正交的,但根据 Candes & Tao 的观点,从压缩感知中得到的直觉是它将“近似正交”,并且在实践中有效。

We also add a bias. One motivation for this is that it allows the model to set features it doesn't represent to their expected value. But we'll see later that the ability to set a negative bias is important for superposition for a second set of reasons – roughly, it allows models to discard small amounts of noise.
我们还引入了偏置项。这样做的一个动机是,它允许模型将未表示的特征设置为其预期值。但稍后我们将看到,设置负偏置的能力对于叠加状态至关重要,这有第二组原因——大致来说,它使模型能够剔除少量噪声。

The final step is whether to add an activation function. This turns out to be critical to whether superposition occurs. In a real neural network, when features are actually used by the model to do computation, there will be an activation function, so it seems principled to include one at the end.
最后一步是决定是否添加激活函数,事实证明这对是否发生叠加至关重要。在真实的神经网络中,当特征被模型实际用于计算时,会存在一个激活函数,因此在最后包含一个激活函数似乎是合乎原则的。

The Loss 损失

Our loss is weighted mean squared error weighted by the feature importances, I_i, described above:
我们的损失是根据上述描述的特征重要性加权的均方误差,即 I_i
L = \sum_x \sum_i I_i (x_i - x'_i)^2

Basic Results 基本结果

Our first experiment will simply be to train a few ReLU output models with different sparsity levels and visualize the results. (We'll also train a linear model – if optimized well enough, the linear model solution does not depend on sparsity level.)
我们的首个实验将简单地训练几个具有不同稀疏度级别的 ReLU 输出模型,并展示结果。(我们还将训练一个线性模型——如果优化得足够好,线性模型的解决方案并不依赖于稀疏度水平。)

The main question is how to visualize the results. The simplest way is to visualize W^TW (a features by features matrix) and b (a feature length vector). Note that features are arranged from most important to least, so the results have a fairly nice structure. Here's an example of what this type of visualization might look like, for a small model model (n=20; ~m=5;) which behaves in the "expected linear model-like" way, only representing as many features as it has dimensions:
主要问题是如何呈现结果。最简单的方法是展示 W^TW (一个特征对特征的矩阵)和 b (一个特征长度的向量)。注意,特征是按照重要性从高到低排列的,因此结果具有相当清晰的结构。以下是一个小型模型( n=20; ~m=5; )的示例,该模型以“预期线性模型类似”的方式表现,仅展示与其维度数量相等的特征,这种可视化可能呈现的样子。

But the thing we really care about is this hypothesized phenomenon of superposition – does the model represent "extra features" by storing them non-orthogonally? Is there a way to get at it more explicitly? Well, one question is just how many features the model learns to represent. For any feature, whether or not it is represented is determined by ||W_i||, the norm of its embedding vector.
但我们真正关注的是这个假设的叠加现象——模型是否通过非正交方式存储“额外特征”来表示它们?是否有更明确的方法来探究这一点?一个问题是模型学习表示的特征数量。对于任何特征,其是否被表示取决于其嵌入向量的范数 ||W_i||

We'd also like to understand whether a given feature shares its dimension with other features. For this, we calculate \sum_{j\neq i} (\hat{W_i}\cdot W_j)^2, projecting all other features onto the direction vector of W_i. It will be 0 if the feature is orthogonal to other features (dark blue below). On the other hand, values \geq 1 mean that there is some group of other features which can activate W_i as strongly as feature i itself!
我们也希望了解某个特征是否与其他特征共享维度。为此,我们计算 \sum_{j\neq i} (\hat{W_i}\cdot W_j)^2 ,将所有其他特征投射到 W_i 的方向向量上。如果该特征与其他特征正交(如下方深蓝色所示),则结果为 0 。相反,值为 \geq 1 意味着存在一组其他特征,它们能像特征 i 本身一样强烈地激活 W_i

We can visualize the model we looked at previously this way:
我们可以这样可视化之前研究的模型:

Now that we have a way to visualize models, we can start to actually do experiments.  We'll start by considering models with only a few features (n=20; ~m=5;~ I_i=0.7^i). This will make it easy to visually see what happens. We consider a linear model, and several ReLU-output models trained on data with different feature sparsity levels:
既然我们有了可视化模型的方法,就可以开始实际进行实验了。我们将从考虑仅包含少数特征( n=20; ~m=5;~ I_i=0.7^i )的模型开始,这样便于直观观察变化。我们考虑线性模型以及在不同特征稀疏度数据上训练的多个 ReLU 输出模型。

As our standard intuitions would expect, the linear model always learns the top-m most important features, analogous to learning the top principal components. The ReLU output model behaves the same on dense features (1-S=1.0), but as sparsity increases, we see superposition emerge. The model represents more features by having them not be orthogonal to each other. It starts with less important features, and gradually affects the most important ones. Initially this involves arranging them in antipodal pairs, where one feature’s representation vector is exactly the negative of the other’s, but we observe it gradually transition to other geometric structures as it represents more features.  We'll discuss feature geometry further in the later section, The Geometry of Superposition.
正如我们的标准直觉所预期,线性模型始终学习最重要的前 m 个特征,类似于学习主成分中的顶部成分。ReLU 输出模型在密集特征上表现相同( 1-S=1.0 ),但随着稀疏性增加,我们观察到叠加现象的出现。该模型通过使特征之间不相互正交来表示更多特征。它从不太重要的特征开始,逐渐影响最重要的特征。最初这涉及将它们排列成对跖对,其中一个特征的表示向量恰好是另一个的负向量,但随着表示更多特征,我们观察到它逐渐过渡到其他几何结构。我们将在后面的章节“叠加的几何学”中进一步讨论特征几何。

The results are qualitatively similar for models with more features and hidden dimensions. For example, if we consider a model with m=20 hidden dimensions and n=80 features (with importance increased to I_i=0.9^i to account for having more features), we observe essentially a rescaled version of the visualization above:
对于具有更多特征和隐藏维度的模型,结果在定性上相似。例如,如果我们考虑一个具有 m=20 个隐藏维度且具有 n=80 个特征的模型(特征重要性增加到 I_i=0.9^i ,以考虑更多特征的影响),我们观察到的基本上是上述可视化结果的重新缩放版本:

Mathematical Understanding
数学理解

In the previous section, we observed a surprising empirical result: adding a ReLU to the output of our model allowed a radically different solution – superposition – which doesn't occur in linear models.
在前一节中,我们观察到一个令人惊讶的实证结果:在我们的模型输出中加入 ReLU 激活函数,使得一种截然不同的解决方案——叠加态——得以实现,这在传统线性模型中是不会出现的。

The model where it occurs is still quite mathematically simple. Can we analytically understand why superposition is occurring? And for that matter, why does adding a single non-linearity make things so different from the linear model case? It turns out that we can get a fairly satisfying answer, revealing that our model is governed by balancing two competing forces – feature benefit and interference – which will be useful intuition going forwards. We'll also discover a connection to the famous Thomson Problem in chemistry.
该模型中发生的现象在数学上仍相当简单。我们能否从分析角度理解为何会出现叠加现象?此外,为何仅仅引入一个非线性因素就能使情况与线性模型大相径庭?事实证明,我们可以得到一个相当令人满意的答案,揭示出我们的模型受制于两种相互竞争的力量——特征收益与干扰——这将为后续理解提供有益的直觉。我们还将发现这与化学中著名的汤姆森问题存在联系。

Let's start with the linear case. This is well understood by prior work! If one wants to understand why linear models don't exhibit superposition, the easy answer is to observe that linear models essentially perform PCA. But this isn't fully satisfying: if we set aside all our knowledge and intuition about linear functions for a moment, why exactly is it that superposition can't occur?
我们从线性情况开始讨论。这一点已由先前的工作充分理解!若要探究为何线性模型不展现叠加现象,一个简单的解释是观察到线性模型本质上执行主成分分析(PCA)。但这并不完全令人满意:如果我们暂时抛开所有关于线性函数的知识和直觉,究竟是什么原因导致叠加现象无法发生呢?

A deeper understanding can come from the results of Saxe et al. who study the learning dynamics of linear neural networks – that is, neural networks without activation functions. Such models are ultimately linear functions, but because they are the composition of multiple linear functions the dynamics are potentially quite complex. The punchline of their paper reveals that neural network weights can be thought of as optimizing a simple closed-form solution. We can tweak their problem to be a bit more similar to our linear case,We have the model be x' = W^TWx, but leave x Gaussianaly distributed as in Saxe. revealing the following equation:
萨克斯等人 的研究结果揭示了线性神经网络(即不包含激活函数的神经网络)的学习动力学,这些模型本质上仍是线性函数,但由于是多个线性函数的复合,其动力学可能相当复杂。他们的论文要点表明,神经网络权重可视为优化一个简单的闭合形式解。我们可以调整其问题,使其更接近我们的线性情况 10 ,让模型 x' = W^TWx ,同时保留 x 如萨克斯所设的正态分布,从而揭示出以下方程:

The Saxe results reveal that there are fundamentally two competing forces which control learning dynamics in the considered model. Firstly, the model can attain a better loss by representing more features (we've labeled this "feature benefit"). But it also gets a worse loss if it represents more than it can fit orthogonally due to "interference" between features.As a brief aside, it's interesting to contrast the linear model interference, \sum_{i\neq j}|W_i \cdot W_J|^2, to the notion of coherence in compressed sensing, \max_{i\neq j}|W_i \cdot W_J|. We can see them as the L^2 and L^\infty norms of the same vector. In fact, this makes it never worthwhile for the linear model to represent more features than it has dimensions.
Saxe 的研究结果揭示,在所考虑的模型中,学习动态主要受两种基本竞争力的控制。首先,模型通过表示更多特征可以获得更佳的损失值,我们称之为“特征收益”。然而,若表示的特征超出其正交容纳能力,则会因特征间的“干扰”导致损失恶化。顺便一提,对比线性模型中的干扰现象与压缩感知中的相干性概念颇为有趣,它们可视为同一向量的 L^2L^\infty 范数。实际上,这表明线性模型绝不应表示超过其维度的特征数量。要证明在线性模型中叠加绝非最优策略,可求解损失梯度为零的情况或参考 Saxe 等人的研究。
To prove that superposition is never optimal in a linear model, solve for the gradient of the loss being zero or consult Saxe et al.

Can we achieve a similar kind of understanding for the ReLU output model? Concretely, we'd like to understand L=\int_x ||I(x-\text{ReLU}(W^TWx+b))||^2 d\textbf{p}(x) where x is distributed such that x_i=0 with probability S.
我们能否对 ReLU 输出模型实现类似的理解?具体而言,我们希望理解 L=\int_x ||I(x-\text{ReLU}(W^TWx+b))||^2 d\textbf{p}(x) ,其中 x 的分布使得 x_i=0S 的概率发生。

The integral over x decomposes into a term for each sparsity pattern according to the binomial expansion of ((1\!-\!S)+S)^n. We can group terms of the sparsity together, rewriting the loss as L = (1\!-\!S)^n L_n +\ldots+ (1\!-\!S)S^{n-1} L_1+ S^n L_0, with each L_k corresponding to the loss when the input is a k-sparse vector. Note that as S\to 1, L_1 and L_0 dominate. The L_0 term, corresponding to the loss on a zero vector, is just a penalty on positive biases, \sum_i \text{ReLU}(b_i)^2. So the interesting term is L_1, the loss on 1-sparse vectors:
积分覆盖 x 根据 ((1\!-\!S)+S)^n 的二项展开分解为每个稀疏模式的项。我们可以将稀疏性相同的项归类,重写损失为 L = (1\!-\!S)^n L_n +\ldots+ (1\!-\!S)S^{n-1} L_1+ S^n L_0 ,其中每个 L_k 对应于输入为 k -稀疏向量时的损失。注意随着 S\to 1L_1L_0 占主导地位。 L_0 项,对应于零向量上的损失,仅是对正偏差的惩罚, \sum_i \text{ReLU}(b_i)^2 。因此,有趣的是 L_1 ,即 1 -稀疏向量上的损失:

This new equation is vaguely similar to the famous Thomson problem in chemistry. In particular, if we assume uniform importance and that there are a fixed number of features with ||W_i|| = 1 and the rest have ||W_i|| = 0, and that b_i = 0, then the feature benefit term is constant and the interference term becomes a generalized Thomson problem – we're just packing points on the surface of the sphere with a slightly unusual energy function. (We'll see this can be a productive analogy when we resume our empirical investigation in the following sections!)
这一新方程与化学中著名的汤姆森问题有几分相似。具体而言,若我们假设各特征同等重要,且存在固定数量的特征具有 ||W_i|| = 1 ,其余为 ||W_i|| = 0 ,并满足 b_i = 0 条件,那么特征收益项恒定,而干扰项则转化为广义汤姆森问题——我们只是在球面上以一种略显特殊的能量函数布置点。(当我们继续进行后续章节的实证研究时,会发现这一类比颇具成效!)

Another interesting property is that ReLU makes negative interference free in the 1-sparse case. This explains why the solutions we've seen prefer to only have negative interference when possible. Further, using a negative bias can convert small positive interferences into essentially being negative interferences.
另一个有趣的特性是,在 1-稀疏情况下,ReLU 使得负干扰变得无害。这解释了为何我们所见的解决方案倾向于尽可能只存在负干扰。此外,使用负偏置可以将小的正干扰实质上转化为负干扰。

What about the terms corresponding to less sparse vectors? We leave explicitly writing these out to the reader, but the main idea is that there are multiple compounding interferences, and the "active features" can experience interference. In a later section, we'll see that features often organize themselves into sparse interference graphs such that only a small number of features interfere with another feature – it's interesting to note that this reduces the probability of compounding interference and makes the 1-sparse loss term more important relative to others.
对于那些对应于较不稀疏向量的项又如何呢?我们将其具体表述留给读者思考,但核心思想在于存在多种叠加干扰,且“活跃特征”可能遭受干扰。在后续章节中,我们将看到特征常常自发组织成稀疏干扰图,使得只有少数特征会干扰到其他特征——值得注意的是,这降低了叠加干扰的概率,并使得 1-稀疏损失项相对于其他项更为重要。







Superposition as a Phase Change
叠加作为相变

The results in the previous section seem to suggest that there are three outcomes for a feature when we train a model: (1) the feature may simply not be learned; (2) the feature may be learned, and represented in superposition; or (3) the model may represent a feature with a dedicated dimension. The transitions between these three outcomes seem sharp. Possibly, there's some kind of phase change. Here, we use “phase change” in the generalized sense of “discontinuous change”, rather than in the more technical sense of a discontinuity arising in the limit of infinite system size.
上一节的结果似乎表明,当我们训练模型时,特征可能面临三种情况:(1) 特征可能根本未被学习;(2) 特征可能被学习,并以叠加态表示;或(3) 模型可能用专用的维度来表示特征。这三种情况之间的转变似乎是突兀的,可能存在某种相变现象。 13 此处,我们使用“相变”一词,意指“不连续变化”的广义概念,而非特指系统规模无限大时出现的更技术性的不连续性。

One way to understand this better is to explore if there's something like a "phase diagram" from physics, which could help us understand when a feature is expected to be in one of these regimes.  Although we can see hints of this in our previous experiment, it's hard to really isolate what's going on because many features are changing at once and there may be interaction effects. As a result, we set up the following experiment to better isolate the effects.
为了更深入理解这一现象,我们不妨探究是否存在类似物理学中的“相位图”,它能帮助我们判断某个特征何时会落入这些状态之一。尽管在先前的实验中已见端倪,但众多特征同时变化且可能存在交互效应,使得真正厘清其中机制颇为困难。因此,我们设计了以下实验,旨在更好地分离这些效应。

As an initial experiment, we consider models with 2 features but only 1 hidden layer dimension. We still consider the ReLU output model, \text{ReLU}(W^T W x - b). The first feature has an importance of 1.0. On one axis, we vary the importance of the 2nd "extra" feature from 0.1 to 10. On the other axis, we vary the sparsity of all features from 1.0 to 0.01. We then plot whether the 2nd "extra" feature is not learned, learned in superposition, or learned and represented orthogonally. To reduce noise, we train ten models for each point and average over the results, discarding the model with the highest loss.
作为初步实验,我们考虑具有 2 个特征但仅 1 个隐藏层维度的模型。我们仍采用 ReLU 输出模型, \text{ReLU}(W^T W x - b) 。第一个特征的重要性为 1.0。在一个轴上,我们调整第二个“额外”特征的重要性从 0.1 至 10。在另一轴上,我们调整所有特征的稀疏度从 1.0 至 0.01。随后,我们绘制图表以展示第二个“额外”特征是否未被学习、以叠加方式学习,还是被正交地学习和表示。为减少噪声,我们对每个点训练十个模型并平均结果,舍弃损失最高的模型。

We can compare this to a theoretical "toy model of the toy model" where we can get closed form solutions for the loss of different weight configurations as a function of importance and sparsity. There are three natural ways to store 2 features in 1 dimension: W=[1,0] (ignore [0,1], throwing away the extra feature), W=[0,1] (ignore [1,0], throwing away the first feature to give the extra feature a dedicated dimension), and W=[1,-1] (store the features in superposition, losing the ability to represent [1,1], the combination of both features at the same time). We call this last solution “antipodal” because the two basis vectors [1, 0] and [0, 1] are mapped in opposite directions. It turns out we can analytically determine the loss for these solutions (details can be found in this notebook).
我们可以将此比作一个理论上的“玩具模型中的玩具模型”,在其中我们能得到不同权重配置损失的封闭形式解,作为重要性和稀疏性的函数。在 1 维中存储 2 个特征有三种自然方式: W=[1,0] (忽略 [0,1] ,舍弃多余特征), W=[0,1] (忽略 [1,0] ,舍弃第一个特征以给额外特征分配专用维度),以及 W=[1,-1] (以叠加方式存储特征,失去同时表示 [1,1] ,即两个特征组合的能力)。我们称最后一种解决方案为“对跖”,因为两个基向量 [1, 0][0, 1] 被映射到相反方向。事实证明,我们可以通过分析确定这些解决方案的损失(详细信息可在此笔记本中找到)。

As expected, sparsity is necessary for superposition to occur, but we can see that it interacts in an interesting way with relative feature importance. But most interestingly, there appears to be a real phase change, observed in both the empirical and theoretical diagrams! The optimal weight configuration discontinuously changes in magnitude and superposition. (In the theoretical model, we can analytically confirm that there's a first-order phase change: there's crossover between the functions, causing a discontinuity in the derivative of the optimal loss.)
如预期般,稀疏性对于叠加现象的发生是必要的,但我们发现它与相对特征重要性之间存在有趣的互动。最引人注目的是,在实证与理论图表中均观察到一种真实的相变现象!最优权重配置在幅度和叠加性上发生了不连续的变化。(在理论模型中,我们能通过分析确认存在一次相变:函数间发生交叉,导致最优损失的导数出现不连续。)

We can ask this same question of embedding three features in two dimensions. This problem still has a single "extra feature" (now the third one) we can study, asking what happens as we vary its importance relative to the other two and change sparsity.
我们同样可以探讨将三个特征嵌入到二维空间中的问题。这个问题仍然有一个“额外的特征”(现在是第三个)供我们研究,我们可以探究随着这个特征相对于其他两个的重要性变化,以及稀疏性的调整,会发生什么情况。

For the theoretical model, we now consider four natural solutions. We can describe solutions by asking "what feature direction did W ignore?" For example, W might just not represent the extra feature – we'll write this W \perp [0, 0, 1]. Or W might ignore one of the other features, W \perp [1, 0, 0]. But the interesting thing is that there are two ways to use superposition to make antipodal pairs. We can put the "extra feature" in an antipodal pair with one of the others (W \perp [0, 1, 1]) or put the other two features in superposition and give the extra feature a dedicated dimension (W \perp [1, 1, 0]). Details on the closed form losses for these solutions can be found in this notebook. We do not consider a last solution of putting all the features in joint superposition, W \perp [1, 1, 1].
对于理论模型,我们现考虑四种自然解法。描述这些解法时,我们可提问:“ W 忽略了哪个特征方向?”例如, W 可能未体现额外特征——我们记作 W \perp [0, 0, 1] 。或者 W 可能忽略了其他某个特征,即 W \perp [1, 0, 0] 。有趣的是,利用叠加原理构造对映体对存在两种方式:可将“额外特征”与另一特征形成对映体对( W \perp [0, 1, 1] ),或将另外两个特征叠加,为额外特征分配专属维度( W \perp [1, 1, 0] )。这些解法的闭合形式损失函数的详细信息可在本笔记本中查阅。我们不考虑将所有特征置于联合叠加状态的最后一种解法,即 W \perp [1, 1, 1]

These diagrams suggest that there really is a phase change between different strategies for encoding features. However, we'll see in the next section that there's much more complex structure this preliminary view doesn't capture.
这些图表表明,在编码特征的不同策略之间确实存在一种相变。然而,我们将在下一节看到,这种初步观点未能捕捉到更为复杂的结构。







The Geometry of Superposition
叠加几何学

We've seen that superposition can allow a model to represent extra features, and that the number of extra features increases as we increase sparsity. In this section, we'll investigate this relationship in more detail, discovering an unexpected geometric story: features seem to organize themselves into geometric structures such as pentagons and tetrahedrons! In some ways, the structure described in this section seems "too elegant to be true" and we think there's a good chance it's at least partly idiosyncratic to the toy model we're investigating. But it seems worth investigating because if anything about this generalizes to real models, it may give us a lot of leverage in understanding their representations.
我们已经观察到,叠加态能使模型呈现额外特征,且随着稀疏性的增加,额外特征的数量也随之增长。本节将深入探讨这一关系,揭示一个意外的几何现象:特征似乎自发组织成五边形、四面体等几何结构!从某种角度看,本节描述的结构“过于完美而显得不真实”,我们推测它至少部分是针对所研究的简化模型特有的。尽管如此,这一发现值得探究,因为若其任何方面能推广至真实模型,或许将极大助力我们理解这些模型的表征机制。

We'll start by investigating uniform superposition, where all features are identical: independent, equally important and equally sparse. It turns out that uniform superposition has a surprising connection to the geometry of uniform polytopes! Later, we'll move on to investigate non-uniform superposition, where features are not identical. It turns out that this can be understood, at least to some extent, as a deformation of uniform superposition.
我们将首先探究均匀叠加,其中所有特征均相同:独立、同等重要且同等稀疏。令人惊讶的是,均匀叠加与均匀多面体的几何结构有着紧密联系!随后,我们将进一步研究非均匀叠加,其中特征并非一致。至少在某种程度上,这种非均匀叠加可以理解为均匀叠加的一种变形。

Uniform Superposition 均匀叠加

As mentioned above, we begin our investigation with uniform superposition, where all features have the same importance and sparsity. We'll see later that this case has some unexpected structure, but there's also a much more basic reason to study it: it's much easier to reason about than the non-uniform case, and has fewer variables we need to worry about in our experiments.
如上所述,我们的研究从均匀叠加开始,其中所有特征具有同等重要性和稀疏性。稍后我们将发现这种情况具有一些意外的结构,但研究它的根本原因更为基础:相较于非均匀情况,它更易于推理,且在实验中需要关注的变量更少。

We'd like to understand what happens as we change feature sparsity, S. Since all features are equally important, we will assume without loss of generalityScaling the importance of all features by the same amount simply scales the loss, and does not change the optimal solutions. that each feature has importance I_i = 1 . We'll study a model with n=400 features and m=30 hidden dimensions, but it turns out the number of features and hidden dimensions doesn't matter very much. In particular, it turns out that the number of input features n doesn't matter as long as it's much larger than the number of hidden dimensions, n \gg m. And it also turns out that the number of hidden dimensions doesn't really matter as long as we're interested in the ratio of features learned to hidden features. Doubling the number of hidden dimensions just doubles the number of features the model learns.
我们希望了解在改变特征稀疏性时会发生什么, S 。由于所有特征同等重要,我们不妨假设 14 ,将所有特征的重要性按相同比例缩放仅会缩放损失,而不改变最优解。每个特征的重要性为 I_i = 1 。我们将研究一个具有 n=400 个特征和 m=30 个隐藏维度的模型,但事实证明,特征数量和隐藏维度的重要性并不大。特别是,只要输入特征的数量 n 远大于隐藏维度的数量 n \gg m ,输入特征的数量就无关紧要。同样,只要我们关注学习到的特征与隐藏特征的比例,隐藏维度的数量也并不重要。将隐藏维度数量加倍,模型学习的特征数量也会加倍。

A convenient way to measure the number of features the model has learned is to look at the Frobenius norm, ||W||_F^2. Since ||W_i||^2\simeq 1 if a feature is represented and ||W_i||^2\simeq 0 if it is not, this is roughly the number of features the model has learned to represent. Conveniently, this norm is basis-independent, so it still behaves nicely in the dense regime S=0 where the feature basis isn't privileged by anything and the model represents features with arbitrary directions instead.
衡量模型已学习特征数量的便捷方法是观察 Frobenius 范数, ||W||_F^2 。由于 ||W_i||^2\simeq 1 表示特征已被表示,而 ||W_i||^2\simeq 0 表示未被表示,这大致反映了模型已学会表示的特征数量。幸运的是,该范数与基底无关,因此在密集区域 S=0 中仍表现良好,此时特征基底不受任何特权影响,模型以任意方向表示特征。

We'll plot D^* = m / ||W||_F^2, which we can think of as the "dimensions per feature":
我们将绘制 D^* = m / ||W||_F^2 ,可以将其视为“每个特征的维度”:

Surprisingly, we find that this graph is "sticky" at 1 and 1/2. (This very vaguely resembles the fractional quantum Hall effect – see e.g. this diagram.) Why is this? On inspection, the 1/2 "sticky point" seems to correspond to a precise geometric arrangement where features come in "antipodal pairs", each being exactly the negative of the other, allowing two features to be packed into each hidden dimension. It appears that antipodal pairs are so effective that the model preferentially uses them over a wide range of the sparsity regime.
令人惊讶的是,我们发现该图在 11/2 处呈现“粘性”。(这隐约类似于分数量子霍尔效应——例如参见此图。)为何如此?经审视, 1/2 处的“粘性点”似乎对应于一种精确的几何排列,其中特征以“对映体对”形式出现,每一对都是彼此的精确负像,使得两个特征能嵌入每个隐藏维度中。看来对映体对极为高效,以至于模型在广泛的稀疏性范围内都优先采用它们。

It turns out that antipodal pairs are just the tip of the iceberg. Hiding underneath this curve are a number of extremely specific geometric configurations of features.
事实证明,对跖点对仅仅是冰山一角。在这条曲线之下,隐藏着众多极其特殊的几何特征配置。

Feature Dimensionality 特征维度

In the previous section, we saw that there's a sticky regime where the model has "half a dimension per feature" in some sense. This is an average statistical property of the features the model represents, but it seems to hint at something interesting. Is there a way we could understand what "fraction of a dimension" a specific feature gets?
在前一节中,我们观察到存在一个粘滞状态,其中模型在某种意义上具有“每个特征半维”。这是模型所代表特征的平均统计特性,但它似乎暗示了某些有趣的现象。我们能否找到一种方法来理解特定特征所获得的“维度的一部分”是多少?

We'll define the dimensionality of the ith feature, D_i, as:
我们将定义第 i 个特征 D_i 的维度为:

D_i ~=~ \frac{||W_i||^2}{\sum_j (\hat{W_i} \cdot W_j)^2}

where W_i is the weight vector column associated with the ith feature, and \hat{W_i} is the unit version of that vector.
其中 W_i 是与第 i 个特征相关联的权重列向量,而 \hat{W_i} 则是该向量的单位化版本。

Intuitively, the numerator represents the extent to which a given feature is represented, while the denominator is "how many features share the dimension it is embedded in" by projecting each feature onto its dimension. In the antipodal case, each feature participating in an antipodal pair will have a dimensionality of D = 1 / (1+1) = 1/2 while features which are not learned will have a dimensionality of 0. Empirically, it seems that the dimensionality of all features add up to the number of embedding dimensions when the features are "packed efficiently" in some sense.
直观上,分子表示给定特征的表达程度,而分母则是“有多少特征通过将其投影到其维度上共享该嵌入维度”。在极端对立的情况下,参与对立对的每个特征将具有 D = 1 / (1+1) = 1/2 的维度,而未被学习的特征将具有 0 的维度。从经验上看,当特征在某种意义上“高效打包”时,所有特征的维度总和似乎等于嵌入维度的数量。

We can now break the above plot down on a per-feature basis. This reveals many more of these "sticky points"! To help us understand this better, we're going to create a scatter plot annotated with some additional information:
我们现在可以按每个特征对上述图表进行分解,这将揭示更多这类“粘性点”!为了更好地理解这一点,我们将创建一个带有额外标注信息的散点图。

Let's look at the resulting plot, and then we'll try to figure out what it's showing us:
让我们观察一下生成的图表,然后尝试理解它向我们展示了什么:

What is going on with the points clustering at specific fractions?? We'll see shortly that the model likes to create specific weight geometries and kind of jumps between the different configurations.
特定分数处的点为何会聚集?我们很快就会发现,模型倾向于形成特定的权重几何结构,并在不同配置之间跳跃。

In the previous section, we developed a theory of superposition as a phase change. But everything on this plot between 0 (not learning a feature) and 1 (dedicating a dimension to a feature) is superposition. Superposition is what happens when features have fractional dimensionality. That is to say – superposition isn't just one thing!
在前一节中,我们发展了一种将叠加视为相位变化的理论。但在这个图上,从 0(不学习特征)到 1(为特征分配一个维度)之间的所有内容都是叠加。当特征具有分数维时,就会发生叠加。也就是说——叠加并非仅指一种情况!

How can we relate this to our original understanding of the phase change? We often think of water as only having three phases: ice, water and steam. But this is a simplification: there are actually many phases of ice, often corresponding to different crystal structures (eg. hexagonal vs cubic ice). In a vaguely similar way, neural network features seem to also have many other phases within the general category of "superposition."
我们如何将此与对相变的原始理解联系起来?我们通常认为水只有三种相态:冰、水和蒸汽。但这是一种简化:实际上,冰存在多种相态,通常对应于不同的晶体结构(例如六角形与立方形冰)。在某种程度上类似地,神经网络特征似乎在“叠加”这一大类中也有许多其他相态。

Why these geometric structures?
为何选择这些几何结构?

In the previous diagram, we found that there are distinct lines corresponding to dimensionality of: ¾ (tetrahedron), ⅔ (triangle), ½ (antipodal pair), ⅖ (pentagon), ⅜ (square antiprism), and 0 (feature not learned). We believe there would also be a 1 (dedicated dimension for a feature) line if not for the fact that basis features are indistinguishable from other directions in the dense regime.
在前述图表中,我们发现存在对应不同维度的明显线条:¾(四面体)、⅔(三角形)、½(对映点对)、⅖(五边形)、⅜(方形反棱柱)以及 0(未习得特征)。若非在高密度区域基元特征与其他方向难以区分,我们相信还会有一条代表 1(专为某一特征设定的维度)的线条出现。

Several of these configurations may jump out as solutions to the famous Thomson problem. (In particular, square antiprisms are much less famous than cubes and are primarily of note for their role in molecular geometry due to being a Thomson problem solution.) As we saw earlier, there is a very real sense in which our model can be understood as solving a generalized version of the Thomson problem. When our model chooses to represent a feature, the feature is embedded as a point on an m-dimensional sphere.
这些配置中的一些可能跃然而出,成为著名的汤姆森问题的解决方案。(特别是,正交反棱柱远不如立方体著名,主要因其作为汤姆森问题解决方案在分子几何中的作用而受到关注。)正如我们之前所见,从非常真实的意义上讲,我们的模型可以被理解为解决了汤姆森问题的一个广义版本。当我们的模型选择表示一个特征时,该特征被嵌入为一个 m 维球面上的点。

A second clue as to what's going on is that there are lines for the Thomson solutions which are uniform polyhedra (e.g. tetrahedron), but there seem to be split lines where we'd expect to see non-uniform solutions (e.g. instead of a ⅗ line for triangular bipyramids we see a co-occurence of points at ⅔ for triangles and points at ½ for a antipodes). In a uniform polyhedron, all vertices have the same geometry, and so if we embed features as them each feature has the same dimensionality. But if we embed features as a non-uniform polyhedron, different features will have more or less interference with others.
另一个线索表明,汤姆森解中的线对应于均匀多面体(如四面体),但似乎出现了分裂线,本应观察到非均匀解的地方(例如,三角双锥的⅗线被⅔的三角形点和½的对跖点共现所取代)。在均匀多面体中,所有顶点几何相同,因此若将特征嵌入其中,每个特征的维度一致。然而,若以非均匀多面体嵌入特征,不同特征间将产生或多或少的干扰。

In particular, many of the Thomson solutions can be understood as tegum products (an operation which constructs polytopes  by embedding two polytopes in orthogonal subspaces) of smaller uniform polytopes. (In the earlier graph visualizations of feature geometry, two subgraphs are disconnected if and only if they are in different tegum factors.) As a result, we should expect their dimensionality to actually correspond to the underlying factor uniform polytopes.
特别是,许多汤普森解决方案可以被理解为 tegum 积(一种通过在正交子空间中嵌入两个多面体来构造多面体的操作)的产物,这些产物源自更小的均匀多面体。(在早期的特征几何图形可视化中,两个子图若且唯若它们属于不同的 tegum 因子时才会断开连接。)因此,我们应当预期它们的维度实际上对应于基础因子均匀多面体。

This also suggests a possible reason why we observe 3D Thomson problem solutions, despite the fact that we're actually studying a higher dimensional version of the problem. Just as many 3D Thomson solutions are tegum products of 2D and 1D solutions, perhaps higher dimensional solutions are often tegum products of 1D, 2D, and 3D solutions.
这或许揭示了我们为何能观察到三维汤姆逊问题解的一个可能原因,尽管我们实际上研究的是该问题的高维版本。正如许多三维汤姆逊解是二维与一维解的楔积,或许高维解常常是一维、二维及三维解的楔积。

The orthogonality of factors in tegum products has interesting implications. For the purposes of superposition, it means that there can't be any "interference" across tegum-factors. This may be preferred by the toy model: having many features interfere simultaneously could be really bad for it. (See related discussion in our earlier mathematical analysis.)
泰格姆积中因子的正交性具有有趣的含义。就叠加目的而言,这意味着泰格姆因子之间不可能存在“干扰”。这对于玩具模型来说可能是可取的:多个特征同时相互干扰可能对其极为不利。(参见我们早期数学分析中的相关讨论。)

Aside: Polytopes and Low-Rank Matrices
旁注:多面体与低秩矩阵

At this point, it's worth making explicit that there's a correspondence between polytopes and symmetric, positive-definite, low-rank matrices (i.e. matrices of the form W^TW). This correspondence underlies the results we saw in the previous section, and is generally useful for thinking about superposition.
此时,明确指出多面体与对称、正定、低秩矩阵(即形如 W^TW 的矩阵)之间存在对应关系是值得的。这种对应关系是我们前一节所见结果的基础,并且通常有助于理解叠加态。

In some ways, the correspondence is trivial. If one has a rank-m n\!\times\!n-matrix of the form W^TW, then W is a n\!\times\!m-matrix. We can interpret the columns of W as n points in a m-dimensional space. The place where this starts to become interesting is that it makes it clear that W^TW is driven by the geometry. In particular, we can see how the off-diagonal terms are driven by the geometry of the points.
在某些方面,这种对应关系是微不足道的。如果有一个秩为 m n\!\times\!n 的矩阵,形式如 W^TW ,则 W 是一个 n\!\times\!m 矩阵。我们可以将 W 的列解释为 m 维空间中的 n 个点。开始变得有趣的地方在于,它明确表明 W^TW 是由几何驱动的。特别是,我们可以看到非对角线项是如何由点的几何结构决定的。

Put another way, there's an exact correspondence between polytopes and strategies for superposition. For example, every strategy for putting three features in superposition in a 2-dimensional space corresponds to a triangle, and every triangle corresponds to such a strategy. From this perspective, it doesn't seem surprising that if we have three equally important and equally sparse features, the optimal strategy is an equilateral triangle.
换言之,多面体与叠加策略之间存在精确对应关系。例如,在二维空间中,每种将三个特征进行叠加的策略都对应一个三角形,反之亦然。从这个角度看,如果我们拥有三个同等重要且同样稀疏的特征,最优策略是形成一个等边三角形,这并不令人意外。

This correspondence also goes the other direction. Suppose we have a rank (n\!-\!i)-matrix of the form W^TW. We can characterize it by the dimensions W did not represent – that is, which directions are orthogonal to W? For example, if we have a (n\!-\!1)-matrix, we might ask what single direction did W not represent? This is especially informative if we assume that W^TW will be as "identity-like" as possible, given the constraint of not representing certain vectors.
这种对应关系也适用于相反方向。假设我们有一个秩为 (n\!-\!i) 的矩阵,形式为 W^TW 。我们可以通过 W 未表示的维度来描述它——即哪些方向与 W 正交?例如,如果我们有一个 (n\!-\!1) 矩阵,我们可能会问 W 未表示的是哪个单一方向?如果我们假设 W^TW 在无法表示某些向量的约束下,尽可能地接近“单位矩阵”,那么这一点尤其具有信息价值。

In fact, given such a set of orthogonal vectors, we can construct a polytope by starting with n basis vectors and projecting them to a space orthogonal to the given vectors. For example, if we start in three dimensions and then project such that W \perp (1,1,1), we get a triangle. More generally, setting W \perp (1,1,1,...) gives us a regular n-simplex. This is interesting because it's in some sense the "minimal possible superposition." Assuming that features are equally important and sparse, the best possible direction to not represent is the fully dense vector (1,1,1,...)!
实际上,给定一组正交向量,我们可以通过从 n 个基向量开始,并将它们投影到与给定向量正交的空间中来构建一个多面体。例如,若我们在三维空间中开始并进行投影,如 W \perp (1,1,1) 所示,我们得到一个三角形。更一般地,设定 W \perp (1,1,1,...) 将给我们一个正则的 n -单纯形。这很有趣,因为它在某种意义上是“最小的可能叠加”。假设特征同等重要且稀疏,最佳的不表示方向就是完全密集的向量 (1,1,1,...)

Non-Uniform Superposition
非均匀叠加

So far, this section has focused on the geometry of uniform superposition, where all features are of equal importance, equal sparsity, and independent. The model is essentially solving a variant of the Thomson problem. Because all features are the same, solutions corresponding to uniform polyhedra get especially low loss. In this subsection, we'll study non-uniform superposition, where features are somehow not uniform. They may vary in importance and sparsity, or have a correlational structure that makes them not independent. This distorts the uniform geometry we saw earlier.
迄今为止,本节专注于均匀叠加的几何特性,其中所有特征同等重要、同等稀疏且相互独立。该模型实质上是在解决汤姆森问题的一个变种。由于所有特征相同,对应于均匀多面体的解决方案损失特别低。在本小节中,我们将探讨非均匀叠加,其中特征在某种程度上并不均匀。它们可能在重要性和稀疏性上有所变化,或者具有使其不独立的关联结构。这扭曲了我们之前看到的均匀几何形态。

In practice, it seems like superposition in real neural networks will be non-uniform, so developing an understanding of it seems important. Unfortunately, we're far from a comprehensive theory of the geometry of non-uniform superposition at this point. As a result, the goal of this section will merely be to highlight some of the more striking phenomena we observe:
实际上,现实中的神经网络叠加似乎是非均匀的,因此对其进行深入理解显得尤为重要。遗憾的是,目前我们距离全面理解非均匀叠加的几何理论还相去甚远。因此,本节的目标仅在于强调我们所观察到的一些显著现象:

We attempt to illustrate these phenomena with some representative experiments below.
我们试图通过以下一些代表性实验来阐释这些现象。

Perturbing a Single Feature
扰动单一特征

The simplest kind of non-uniform superposition is to vary one feature and leave the others uniform. As an experiment, let's consider an experiment where we represent n=5 features in m=2 dimensions. In the uniform case, with importance I=1 and activation density 1-S=0.05, we get a regular pentagon. But if we vary one point – in this case we'll make it more or less sparse – we see the pentagram stretch to account for the new value. If we make it denser, activating more frequently (yellow) the other features repel from it, giving it more space. On the other hand, if we make it sparser, activating less frequently (blue) it takes less space and other points push towards it.
最简单的非均匀叠加方式是改变一个特征而保持其他特征均匀。以一个实验为例,设想我们在 m=2 维空间中表示 n=5 个特征。在均匀情况下,重要性为 I=1 ,激活密度为 1-S=0.05 时,我们得到一个规则五边形。但若改变其中一个点——此处使其更为稀疏或密集——五角星形会随之拉伸以适应新值。若使其更密集,频繁激活(黄色),其他特征会远离它,为其腾出更多空间。反之,若使其更稀疏,激活频率降低(蓝色),它占用的空间减少,其他点则向其靠拢。

If we make it sufficiently sparse, there's a phase change, and it collapses from a pentagon to a pair of digons with the sparser point at zero. The phase change corresponds to loss curves corresponding to the two different geometries crossing over. (This observation allows us to directly confirm that it is genuinely a first order phase change.)
若使其足够稀疏,便会出现相变,原本的五边形结构会塌缩成一对二边形,稀疏点位于零点。这种相变对应着两种不同几何形状的损失曲线交叉。(这一发现使我们得以直接确认,这确实是一次一阶相变。)

To visualize the solutions, we canonicalize them, rotating them to align with each other in a consistent manner.
为了直观展示解决方案,我们对其进行规范化处理,通过旋转使它们以一致的方式相互对齐。

These results seem to suggest that, at least in some cases, non-uniform superposition can be understood as a deformation of uniform superposition and jumping between uniform superposition configurations rather than a totally different regime. Since uniform superposition has a lot of understandable structure, but real world superposition is almost certainly non-uniform, this seems very promising!
这些结果似乎表明,至少在某些情况下,非均匀叠加可以被理解为均匀叠加的一种变形,以及在不同均匀叠加构型之间的跳跃,而非完全不同的状态。由于均匀叠加具有许多可理解的结构,而现实世界的叠加几乎肯定是非均匀的,这一发现极具前景!

The reason pentagonal solutions are not on the unit circle is because models reduce the effect of positive interference, setting a slight negative bias to cut off noise and setting their weights to ||W_i|| = 1 / (1-b_i) to compensate. Distance from the unit circle can be interpreted as primarily driven by the amount of positive interference.
五边形解不在单位圆上的原因是模型减少了正向干扰的影响,设置了轻微的负偏置以截断噪声,并将权重设为 ||W_i|| = 1 / (1-b_i) 以补偿。与单位圆的距离主要可解释为由正向干扰的程度所驱动。

A note for reimplementations: optimizing with a two-dimensional hidden space makes this easier to study, but the actual optimization process to be really challenging from gradient descent – a lot harder than even just having three dimensions. Getting clean results required fitting each model multiple times and taking the solution with the lowest loss. However, there's a silver lining to this: visualizing the sub-optimal solutions on a scatter plot as above allows us to see the loss curves for different geometries and gain greater insight into the phase change.
对于重新实现者的提示:采用二维隐藏空间进行优化使得研究更为简便,但实际的优化过程从梯度下降角度来看极具挑战性——甚至比仅有三个维度时还要困难得多。为了获得清晰的结果,需要多次拟合每个模型,并选取损失最小的解决方案。然而,这其中也有一线希望:如上所述,通过散点图可视化次优解,我们能够观察到不同几何形状下的损失曲线,从而更深入地理解相变现象。

Correlated and Anticorrelated Features
相关与反相关特征

A more complicated form of non-uniform superposition occurs when there are correlations between features. This seems essential for understanding superposition in the real world, where many features are correlated or anti-correlated.
当特征之间存在相关性时,非均匀叠加呈现出更为复杂的形式。这对于理解现实世界中的叠加现象至关重要,因为许多特征要么相互关联,要么相互排斥。

For example, one very pragmatic question to ask is whether we should expect polysemantic neurons to group the same features together across models. If the groupings were random, you could use this to detect polysemantic neurons, by comparing across models! However, we'll see that correlational structure strongly influences which features are grouped together in superposition.
例如,一个非常实际的问题是,我们是否应该期望多义性神经元在不同模型中将相同特征归类在一起。如果归类是随机的,通过跨模型比较,你就能借此检测到多义性神经元!然而,我们将看到,相关性结构强烈影响哪些特征会在叠加中被归类在一起。

The behavior seems to be quite nuanced, with a kind of "order of preferences" for how correlated features behave in superposition. The model ideally represents correlated features orthogonally, in separate tegum factors with no interactions between them. When that fails, it prefers to arrange them so that they're as close together as possible – it prefers positive interference between correlated features over negative interference. Finally, when there isn't enough space to represent all the correlated features, it will collapse them and represent their principal component instead! Conversely, when features are anti-correlated, models prefer to have them interfere, especially with negative interference. We'll demonstrate this with a few experiments below.
该行为似乎颇为微妙,存在一种“偏好顺序”,用以指导叠加状态下相关特征的相互作用方式。理想情况下,模型应将相关特征正交表示,各自独立于不同的 tegum 因子中,彼此间无交互作用。若此条件不满足,模型倾向于尽可能紧密地排列这些特征——更偏好相关特征间的正向干扰而非负向干扰。当空间不足以容纳所有相关特征时,模型将它们压缩,转而表示其主成分。相反,若特征间呈负相关,模型则倾向于让它们相互干扰,尤其是产生负向干扰。以下我们将通过几个实验来演示这一现象。

Setup for Exploring Correlated and Anticorrelated Features
探索相关性与反相关特征的设置

Throughout this section we'll refer to "correlated feature sets" and "anticorrelated feature sets".
在本节中,我们将提及“相关特征集”和“反相关特征集”。

Correlated Feature Sets. Our correlated feature sets can be thought of as "bundles" of co-occurring features. One can imagine a highly idealized version of what might happen in an image classifier: there could be a bundle of features used to identify animals (fur, ears, eyes) and another bundle used to identify buildings (corners, windows, doors). Features from one of these bundles are likely to appear together. Mathematically, we represent this by linking the choice of whether all the features in a correlated feature set are zero or not together. Recall that we originally defined our synthetic distribution to have features be zero with probability S and otherwise uniformly distributed between [0,1]. We simply have the same sample determine whether they're zero.
相关特征集。我们的相关特征集可视为共现特征的“捆绑包”。设想一个高度理想化的图像分类器场景:可能存在一组用于识别动物的特征(毛发、耳朵、眼睛),以及另一组用于识别建筑物的特征(角落、窗户、门)。来自同一捆绑包的特征很可能同时出现。数学上,我们通过将相关特征集中所有特征是否同时为零的选择关联起来来表示这一点。回想一下,我们最初定义的合成分布中,特征以概率 S 为零,否则在[0,1]区间内均匀分布。现在,我们只需让同一采样决定它们是否为零。

Anticorrelated Feature Sets. One could also imagine anticorrelated features which are extremely unlikely to occur together. To simulate these, we'll have anticorrelated feature sets where only one feature in the set can be active at a time. To simulate this, we'll have the feature set be entirely zero with probability S, but then only have one randomly selected feature in the set be uniformly sampled from [0,1] if it's active, with the others being zero.
反相关特征集。我们还可以设想那些极不可能同时出现的反相关特征。为了模拟这些特征,我们将构建反相关特征集,其中任何时刻仅有一个特征处于激活状态。具体模拟时,该特征集以概率 S 完全为零,但如果处于激活状态,则仅有一个随机选定的特征从[0,1]区间均匀采样,其余特征均为零。

Organization of Correlated and Anticorrelated Features
相关与反相关特征组织

For our initial investigation, we simply train a number of small toy models with correlated and anti-correlated features and observe what happens. To make this easy to study, we limit ourselves to the m=2 case where we can explicitly visualize the weights as points in 2D space. In general, such solutions can be understood as a collection of points on a unit circle. To make solutions easy to compare, we rotate and flip solutions to align with each other.
在初步研究中,我们仅训练了一批小型玩具模型,这些模型具有相关性和反相关性特征,并观察其结果。为便于研究,我们限定在 m=2 情形下,能将权重直观地以二维空间中的点来表示。通常,这类解可视为单位圆上的一系列点。为便于比较解,我们旋转并翻转解以使其相互对齐。

Local Almost-Orthogonal Bases
局部近正交基底

It turns out that the tendency of models to arrange correlated features to be orthogonal is actually quite a strong phenomenon. In particular, for larger models, it seems to generate a kind of "local almost-orthogonal basis" where, even though the model as a whole is in superposition, the correlated feature sets considered in isolation are (nearly) orthogonal and can be understood as having very little superposition.
事实证明,模型倾向于使相关特征相互正交的现象实际上相当显著。特别是对于较大的模型而言,似乎会生成一种“局部近似正交基”,即使整个模型处于叠加状态,单独考虑的相关特征集也近乎正交,并且可以理解为几乎不存在叠加。

To investigate this, we train a larger model with two sets of correlated features and visualize W^TW.
为探究此问题,我们训练了一个包含两组相关特征的大型模型,并对其进行了可视化展示 W^TW

If this result holds in real neural networks, it suggests we might be able to make a kind of "local non-superposition" assumption, where for certain sub-distributions we can assume that the activating features are not in superposition. This could be a powerful result, allowing us to confidently use methods such as PCA which might not be principled to generally use in the context of superposition.
若此结果在真实神经网络中成立,则暗示我们或许能采用一种“局部非叠加”假设,即对于某些子分布,我们可以假定激活特征未处于叠加状态。这可能是一个强有力的结论,使我们能自信地运用诸如 PCA 等方法,而这些方法在叠加情境下通常并不适用。

Collapsing of Correlated Features
相关特征的崩溃

One of the most interesting properties is that there seems to be a trade off with Principal Components Analysis (PCA) and superposition. If there are two correlated features a and b, but the model only has capacity to represent one, the model will represent their principal component (a+b)/\sqrt{2}, a sparse variable that has more impact on the loss than either individually, and ignore the second principal component (a-b)/\sqrt{2}.
其中最有趣的特性之一是,主成分分析(PCA)与叠加之间似乎存在一种权衡。若存在两个相关特征 ab ,但模型仅能表示其中一个,则模型将表示它们的主成分 (a+b)/\sqrt{2} ,这是一个对损失影响大于两者单独作用之和的稀疏变量,并忽略第二个主成分 (a-b)/\sqrt{2}

As an experiment, we consider six features, organized into three sets of correlated pairs. Features in each correlated pair are represented by a given color (red, green, and blue). The correlation is created by having both features always activate together – they're either both zero or neither zero. (The exact non-zero values they take when they activate is uncorrelated.)
作为一项实验,我们考虑了六个特征,组织成三对相关联的特征组。每对相关特征由特定颜色(红、绿、蓝)表示。这种关联通过两个特征始终同时激活来实现——它们要么同时为零,要么都不为零。(它们激活时所取的确切非零值之间并不相关。)

As we vary the sparsity of the features, we find that in the very sparse regime, we observe superposition as expected, with features arranged in a hexagon and correlated features side-by-side. As we decrease sparsity, the features progressively "collapse" into their principal components. In very dense regimes, the solution becomes equivalent to PCA.
随着我们调整特征的稀疏度,我们发现在极度稀疏的情况下,正如预期那样,观察到了叠加现象,特征排列成六边形,相关特征并排出现。随着稀疏度降低,特征逐渐“坍缩”至其主成分。在非常密集的情况下,解决方案等同于主成分分析(PCA)。

These results seem to hint that PCA and superposition are in some sense complementary strategies which trade off with one another. As features become more correlated, PCA becomes a better strategy. As features become sparser, superposition becomes a better strategy. When features are both sparse and correlated, mixtures of each strategy seem to occur. It would be nice to more deeply understand this space of tradeoffs.
这些结果似乎暗示,PCA 与叠加在某种意义上是互补的策略,它们相互之间存在权衡。随着特征相关性增强,PCA 成为更优策略。当特征变得稀疏时,叠加策略则更为优越。当特征既稀疏又相关时,两种策略的混合似乎会出现。更深入地理解这种权衡的空间将大有裨益。

It's also interesting to think about this in the context of continuous equivariant features, such as features which occur in different rotations.
在连续等变特征的背景下思考这一点也颇有趣味,比如那些在不同旋转中出现的特征。







Superposition and Learning Dynamics
叠加与学习动力学

The focus of this paper is how superposition contributes to the functioning of fully trained neural networks, but as a brief detour it's interesting to ask how our toy models – and the resulting superposition – evolve over the course of training.
本文重点探讨叠加态如何促进完全训练后的神经网络运作,但作为一次简短的旁涉,探究我们的玩具模型及其产生的叠加态在训练过程中的演变亦颇具趣味。

There are several reasons why these models seem like a particularly interesting case for studying learning dynamics. Firstly, unlike most neural networks, the fully trained models converge to a simple but non-trivial structure that rhymes with an emerging thread of evidence that neural network learning dynamics might have geometric weight structure that we can understand. One might hope that understanding the final structure would make it easier for us to understand the evolution over training. Secondly, superposition hints at surprisingly discrete structure (regular polytopes of all things!). We'll find that the underlying learning dynamics are also surprisingly discrete, continuing an emerging trend of evidence that neural network learning might be less continuous than it seems. Finally, since superposition has significant implications for interpretability, it would be nice to understand how it emerges over training – should we expect models to use superposition early on, or is it something that only emerges later in training, as models struggle to fit more features in?
这些模型之所以成为研究学习动力学的特别有趣案例,原因有几点。首先,与大多数神经网络不同,完全训练后的模型会收敛到一个简单却非平凡的结构,这与神经网络学习动力学可能具有可理解的幾何權重結構这一新兴证据相呼应。我们或许可以期待,理解最终结构将有助于我们洞察训练过程中的演变。其次,叠加现象暗示了出乎意料的离散结构(竟然是正则多胞形!)。我们将发现,底层的学习动力学同样呈现出惊人的离散性,延续了一个新兴趋势,即神经网络学习可能并不如表面所见的那般连续。 最后,由于叠加性对可解释性具有重要意义,了解它在训练过程中如何显现将十分有益——我们是否应预期模型在训练初期即采用叠加性,还是它仅在训练后期,当模型努力适应更多特征时才逐渐显现?

Unfortunately, we aren't able to give these questions the detailed investigation they deserve within the scope of this paper. Instead, we'll limit ourselves to a couple particularly striking phenomena we've noticed, leaving more detailed investigation for future work.
遗憾的是,我们无法在这篇论文的范围内对这些问题进行应有的详细探讨。因此,我们将仅限于讨论我们注意到的一些特别引人注目的现象,而将更详细的调查留待未来的工作。

Phenomenon 1: Discrete "Energy Level" Jumps
现象 1:离散“能级”跃迁

Perhaps the most striking phenomenon we've noticed is that the learning dynamics of toy models with large numbers of features appear to be dominated by "energy level jumps" where features jump between different feature dimensionalities. (Recall that a feature's dimensionality is the fraction of a dimension dedicated to representing a feature.)
我们注意到最显著的现象之一是,具有大量特征的玩具模型学习动态似乎由“能量级跳跃”主导,其中特征在不同特征维度间跃迁。(回想一下,一个特征的维度是指用于表示该特征的维度部分的比例。)

Let's consider the problem setup we studied when investigating the geometry of uniform superposition in the previous section, where we have a large number of features of equal importance and sparsity. As we saw previously, the features ultimately arrange themselves into a small number of polytopes with fractional dimensionalities.
让我们回顾上一节中探讨均匀叠加几何问题时的情景,那里存在大量同等重要且稀疏的特征。如前所述,这些特征最终会排列成少数几个具有分数维度的多面体。

A natural question to ask is what happens to these feature dimensionalities over the course of training. Let's pick one model where all the features converge into digons and observe. In the first plot, each colored line corresponds to the dimensionality of a single feature. The second plot shows how the loss curve changes over the same duration.
一个自然的问题是,在训练过程中这些特征维度会发生什么变化。我们选取一个所有特征最终收敛成二元组的模型进行观察。在第一个图中,每条彩色线代表单个特征的维度变化。第二个图则展示了同一时期损失曲线的变化情况。