这是用户在 2024-12-24 22:11 为 https://transformer-circuits.pub/2022/toy_model/index.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Transformer Circuits Thread
变压器电路线

Toy Models of Superposition
叠加的玩具模型

Authors 作者

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg,Christopher Olah
内尔森·埃尔哈奇特里斯坦·休梅,凯瑟琳·奥尔松,尼古拉斯·谢费尔,汤姆·亨尼根,肖娜·克拉维克,扎克·哈特菲尔德-多兹,罗伯特·拉森比,道恩·德雷恩,卡罗尔·陈,罗杰·格罗斯,萨姆·麦肯德利什,贾里德·卡普兰,达里奥·阿莫德伊,马丁·瓦滕伯格,克里斯托弗·奥拉 h

Affiliations 隶属关系

Anthropic, Harvard
人类学的,哈佛

Published 出版

Sept 14, 2022 2022 年 9 月 14 日
* Core Research Contributor; ‡ Correspondence to colah@anthropic.com; Author contributions statement below.
* 核心研究贡献者;‡ 通信至 colah@anthropic.com下面是作者贡献声明

Authors 作者

Affiliations 隶属关系

Published 已发布

Not published yet.

DOI

No DOI yet.

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?
如果人工神经网络的单个神经元与输入的清晰可解释特征相对应,那将非常方便。例如,在一个“理想”的 ImageNet 分类器中,每个神经元只会在特定视觉特征存在时激活,比如红色、左转曲线或狗的鼻子。根据我们研究过的模型的经验,有些神经元确实与特征对应得很清晰。但并非总是如此,尤其是在大型语言模型中,神经元与清晰特征对应似乎是很少见的。这引发了许多问题。为什么神经元有时与特征对齐,有时又不对齐?为什么一些模型和任务有许多这样的清晰神经元,而其他模型则极其稀少?

In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition . When features are sparse, superposition allows compression beyond what a linear model would do, at the cost of "interference" that requires nonlinear filtering.
在本文中,我们使用玩具模型——在具有稀疏输入特征的合成数据上训练的小型 ReLU 网络——来研究模型如何以及何时表示比其维度更多的特征。我们称这种现象为叠加。当特征稀疏时,叠加允许超越线性模型的压缩,但代价是需要非线性过滤的“干扰”。

Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features. With dense features, the model learns to represent an orthogonal basis of the most important two features (similar to what Principal Component Analysis might give us), and the other three features are not represented. But if we make the features sparse, this changes:
考虑一个玩具模型,我们在二维空间中训练五个具有不同重要性的特征的嵌入,其中“重要性”是均方误差损失的标量乘子。之后添加一个 ReLU 以进行过滤,并改变特征的稀疏性。对于密集特征,模型学习表示最重要的两个特征的正交基(类似于主成分分析可能给我们的结果),而其他三个特征未被表示。但如果我们使特征稀疏,这种情况就会改变:

This figure and a few others can be reproduced using the toy model framework Colab notebook in our Github repo
此图及其他几个图可以使用我们Github 仓库中的玩具模型框架 Colab 笔记本重现

Not only can models store additional features in superposition by tolerating some interference, but we'll show that, at least in certain limited cases, models can perform computation while in superposition. (In particular, we'll show that models can put simple circuits computing the absolute value function in superposition.) This leads us to hypothesize that the neural networks we observe in practice are in some sense noisily simulating larger, highly sparse networks. In other words, it's possible that models we train can be thought of as doing “the same thing as” an imagined much-larger model, representing the exact same features but with no interference.
模型不仅可以通过容忍一些干扰来在叠加中存储额外特征,而且我们将展示,至少在某些有限的情况下,模型可以在叠加状态下执行计算。(具体来说,我们将展示模型可以在叠加中放置计算绝对值函数的简单电路。)这使我们假设 所观察到的 神经网络在某种意义上是在嘈杂地模拟更大、稀疏性更高的网络。换句话说,我们训练的模型可能可以被看作是在做“与”一个想象中大得多的模型“相同的事情”,表示完全相同的特征,但没有干扰。

Feature superposition isn't a novel idea. A number of previous interpretability papers have considered it , and it's very closely related to the long-studied topic of compressed sensing in mathematics , as well as the ideas of distributed, dense, and population codes in neuroscience and deep learning . What, then, is the contribution of this paper?
特征叠加并不是一个新颖的概念。一些早期的可解释性论文已经考虑过这个问题,它与数学中长期研究的压缩感知主题以及神经科学和深度学习中的分布式、密集和群体编码的想法密切相关。那么,这篇论文的贡献是什么呢?

For interpretability researchers, our main contribution is providing a direct demonstration that superposition occurs in artificial neural networks given a relatively natural setup, suggesting this may also occur in practice. That is, we show a case where interpreting neural networks as having sparse structure in superposition isn't just a useful post-hoc interpretation, but actually the "ground truth" of a model. We offer a theory of when and why this occurs, revealing a  phase diagram for superposition. This explains why neurons are sometimes "monosemantic" responding to a single feature, and sometimes "polysemantic" responding to many unrelated features. We also discover that, at least in our toy model, superposition exhibits complex geometric structure.
对于可解释性研究人员,我们的主要贡献是直接展示了在相对自然的设置下,超叠加在人工神经网络中确实发生,这表明这在实践中也可能发生。也就是说,我们展示了一个例子,其中将神经网络解释为在超叠加中具有稀疏结构不仅仅是一个有用的事后解释,而实际上是模型的“事实真相”。我们提供了一个关于何时以及为何发生这种情况的理论,揭示了一个相图用于超叠加。这解释了为什么神经元有时是“单语义”的,仅对单一特征作出反应,且有时是“多语义”的,对许多无关特征作出反应。我们还发现,至少在我们的玩具模型中,超叠加展现了复杂的几何结构

But our results may also be of broader interest. We find preliminary evidence that superposition may be linked to adversarial examples and grokking, and might also suggest a theory for the performance of mixture of experts models. More broadly, the toy model we investigate has unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar to the fractional quantum Hall effect in physics, among other striking phenomena. We originally investigated the subject to gain understanding of cleanly-interpretable neurons in larger models, but we've found these toy models to be surprisingly interesting in their own right.
但我们的结果可能也具有更广泛的意义。我们发现初步证据表明,叠加可能与对抗性示例和理解有关,并可能暗示一种专家模型混合性能的理论。从更广泛的角度来看,我们研究的玩具模型具有出乎意料的丰富结构,展示了相变、基于均匀多体的几何结构、在训练过程中类似于“能级”的跃迁,以及一种在性质上类似于物理中的分数量子霍尔效应的现象,以及其他引人注目的现象。我们最初研究这个主题是为了理解更大模型中可清晰解释的神经元,但我们发现这些玩具模型本身竟然是如此有趣。

Key Results From Our Toy Models
来自我们玩具模型的主要结果

In our toy models, we are able to demonstrate that:
在我们的玩具模型中,我们能够演示出:

Our toy models are simple ReLU networks, so it seems fair to say that neural networks exhibit these properties in at least some regimes, but it's very unclear what to generalize to real networks.
我们的玩具模型是简单的 ReLU 网络,因此可以公平地说神经网络在至少某些情况下表现出这些特性,但如何将其推广到真实网络仍然非常不清楚。







Definitions and Motivation: Features, Directions, and Superposition
定义与动机:特征、方向和叠加

In our work, we often think of neural networks as having features of the input represented as directions in activation space. This isn't a trivial claim. It isn't obvious what kind of structure we should expect neural network representations to have. When we say something like "word embeddings have a gender direction" or "vision models have curve detector neurons", one is implicitly making strong claims about the structure of network representations.
在我们的工作中,我们经常将神经网络视为具有输入的特征,这些特征表示为 激活空间中的方向 。这并不是一个简单的说法。我们无法明显预期神经网络表示应具备何种结构。当我们说“词嵌入具有性别方向”或“视觉模型具有曲线检测神经元”时,实际上是在隐含地对网络表示的结构做出强有力的声明。

Despite this, we believe this kind of "linear representation hypothesis" is supported both by significant empirical findings and theoretical arguments. One might think of this as two separate properties, which we'll explore in more detail shortly:
尽管如此,我们认为这种“线性表征假说”得到了重要的实证发现和理论论证的支持。人们可以将其视为两个独立的特性,我们将稍后更详细地探讨。

If we hope to reverse engineer neural networks, we need a property like decomposability. Decomposability is what allows us to reason about the model without fitting the whole thing in our heads! But it's not enough for things to be decomposable: we need to be able to access the decomposition somehow. In order to do this, we need to identify the individual features within a representation. In a linear representation, this corresponds to determining which directions in activation space correspond to which independent features of the input.
如果我们希望逆向工程神经网络,我们 需要一个类似可分解性的属性。可分解性是让我们能够推理模型而不必将整个模型都放在脑海中的原因!但仅仅具备可分解性是不够的:我们需要能够以某种方式访问分解。为了做到这一点,我们需要识别表示中的各个特征。在线性表示中,这对应于确定激活空间中的哪些方向对应于输入的哪些独立特征。

Sometimes, identifying feature directions is very easy because features seem to correspond to neurons. For example, many neurons in the early layers of InceptionV1 clearly correspond to features (e.g. curve detector neurons ). Why is it that we sometimes get this extremely helpful property, but in other cases don't? We hypothesize that there are really two countervailing forces driving this:
有时,识别特征方向非常简单,因为特征似乎与神经元相对应。例如,InceptionV1 早期层中的许多神经元明显对应于特征(例如,曲线检测神经元)。为什么我们有时会得到这种极其有用的属性,而在其他情况下却没有?我们假设实际上有两种相互对立的力量驱动着这一现象:

Superposition has been hypothesized in previous work , and in some cases, assuming something like superposition has been shown to help find interpretable structure . However, we're not aware of feature superposition having been unambiguously demonstrated to occur in neural networks before ( demonstrates a closely related phenomenon of model superposition). The goal of this paper is to change that, demonstrating superposition and exploring how it interacts with privileged bases. If superposition occurs in networks, it deeply influences what approaches to interpretability research make sense, so unambiguous demonstration seems important.
超级叠加在之前的工作中被假设,在某些情况下,假设类似超级叠加的事务已被证明有助于找到可解释的结构。然而,我们不知道特征超级叠加在神经网络中之前是否被明确证明发生过(展示了与模型超级叠加密切相关的现象)。本文的目标是改变这一点,证明超级叠加的存在并探索它与特权基的互动。如果在网络中发生超级叠加,这将深刻影响哪些可解释性研究方法是合理的,因此明确的证明显得非常重要。

The goal of this section will be to motivate these ideas and unpack them in detail.
本节的目标是激发这些思想并详细阐述。

It's worth noting that many of the ideas in this section have close connections to ideas in other lines of interpretability research (especially disentanglement), neuroscience (distributed representations, population codes, etc), compressed sensing, and many other lines of work. This section will focus on articulating our perspective on the problem. We'll discuss these other lines of work in detail in Related Work.
值得注意的是,本节中的许多观点与其他可解释性研究领域(尤其是解缠结)、神经科学(分布式表示、群体编码等)、压缩感知以及许多其他研究方向有密切联系。本节将重点阐述我们对这个问题的看法。我们将在相关工作中详细讨论这些其他研究方向。

Empirical Phenomena 经验现象

When we talk about "features" and how they're represented, this is ultimately theory building around several observed empirical phenomena. Before describing how we conceptualize those results, we'll simply describe some of the major results motivating our thinking:
当我们谈论“特征”及其表示方式时,这最终是围绕几个观察到的经验现象进行理论构建。在描述我们如何概念化这些结果之前,我们将简单描述一些激励我们思考的主要结果:

As a result, we tend to think of neural network representations as being composed of features which are represented as directions. We'll unpack this idea in the following sections.
因此,我们倾向于将神经网络的表示视为由特征组成,这些特征被表示为方向。我们将在接下来的章节中详细阐述这个观点。

What are Features? 特征是什么?

Our use of the term "feature" is motivated by the interpretable properties of the input we observe neurons (or word embedding directions) responding to. There's a rich variety of such observed properties!In the context of vision, these have ranged from low-level neurons like curve detectors and high-low frequency detectors, to more complex neurons like oriented dog-head detectors or car detectors, to extremely abstract neurons corresponding to famous people, emotions, geographic regions, and more . In language models, researchers have found word embedding directions such as a male-female or singular-plural direction , low-level neurons disambiguating words that occur in multiple languages, much more abstract neurons, and "action" output neurons that help produce certain words . We'd like to use the term "feature" to encompass all these properties.
我们使用“特征”这个术语是基于可解释的输入特性,这些特性是我们观察到神经元(或词嵌入方向)所响应的。这样的观察到的特性种类丰富!在视觉领域,这些特性从低级神经元(如曲线检测器和高低频检测器),到更复杂的神经元(如定向狗头检测器或汽车检测器),再到与名人、情感、地理区域等对应的极其抽象的神经元。在语言模型中,研究人员发现了一些词嵌入方向,如男性-女性或单数-复数方向,低级神经元用于消歧义出现在多种语言中的词,更多的抽象神经元,以及帮助产生特定词汇的“动作”输出神经元。我们希望使用“特征”这个术语来涵盖所有这些特性。

But even with that motivation, it turns out to be quite challenging to create a satisfactory definition of a feature. Rather than offer a single definition we're confident about, we consider three potential working definitions:
但即便有了这种动机,创建一个令人满意的特征定义仍然相当具有挑战性。我们并不提供一个我们自信的单一定义,而是考虑三种潜在的工作定义:

We've written this paper with the final "neurons in sufficiently large models" definition in mind. But we aren't overly attached to it, and actually think it's probably important to not prematurely attach to a definition.A famous book by Lakatos illustrates the importance of uncertainty about definitions and how important rethinking definitions often is in the context of research.
我们撰写这篇论文时考虑了“足够大模型中的神经元”的最终定义。但我们并没有过于依赖这个定义,实际上我们认为在研究的背景下,不要过早依赖某个定义是相当重要的。拉卡托斯的一本著名书籍展示了对定义的不确定性的重要性,以及在研究背景下重新思考定义的必要性。

Features as Directions 特征作为方向

As we've mentioned in previous sections, we generally think of features as being represented by directions. For example, in word embeddings, "gender" and "royalty" appear to correspond to directions, allowing arithmetic like V("king") - V("man") + V("woman") = V("queen") . Examples of interpretable neurons are also cases of features as directions, since the amount a neuron activates corresponds to a basis direction in the representation
正如我们在前面的章节中提到的,我们通常认为 特征是由方向来表示的。例如,在词嵌入中,“性别”和“贵族”似乎对应于一些方向,从而允许类似于 V("king") - V("man") + V("woman") = V("queen") 的算术运算。可解释神经元的例子也是特征作为方向的情况,因为神经元的激活程度对应于表示中的一个基方向。

Let's call a neural network representation linear if features correspond to directions in activation space. In a linear representation, each feature f_i has a corresponding representation direction W_i. The presence of multiple features f_1, f_2… activating with values x_{f_1}, x_{f_2}… is represented by x_{f_1}W_{f_1} + x_{f_2}W_{f_2}.... To be clear, the features being represented are almost certainly nonlinear functions of the input. It's only the map from features to activation vectors which is linear. Note that whether something is a linear representation depends on what you consider to be the features.
如果特征对应于激活空间中的方向,我们将神经网络表示称为线性。在一个线性表示中,每个特征 f_i 都有一个对应的表示方向 W_i 。多个特征 f_1, f_2… 以值 x_{f_1}, x_{f_2}… 激活的存在由 x_{f_1}W_{f_1} + x_{f_2}W_{f_2}... 表示。更清楚地说,被表示的特征几乎肯定是输入的非线性函数。仅从特征到激活向量的映射是线性的。请注意,某事物是否是线性表示取决于您认为的特征。

We don't think it's a coincidence that neural networks empirically seem to have linear representations. Neural networks are built from linear functions interspersed with non-linearities. In some sense, the linear functions are the vast majority of the computation (for example, as measured in FLOPs). Linear representations are the natural format for neural networks to represent information in! Concretely, there are three major benefits:
我们认为神经网络在经验上似乎具有线性表示并非偶然。神经网络是由线性函数与非线性函数交错构成的。从某种意义上说,线性函数占据了计算的大部分(例如,以 FLOPs 为衡量标准)。线性表示是神经网络表示信息的自然格式!具体来说,有三个主要好处:

It is possible to construct non-linear representations, and retrieve information from them, if you use multiple layers (although even these examples can be seen as linear representations with more exotic features). We provide an example in the appendix. However, our intuition is that non-linear representations are generally inefficient for neural networks.
可以构建非线性表示并从中检索信息,只要使用多层(尽管即使这些示例也可以视为具有更奇特特征的线性表示)。我们在附录中提供了一个示例。然而,我们的直觉是,非线性表示对于神经网络来说通常是低效的。

One might think that a linear representation can only store as many features as it has dimensions, but it turns out this isn't the case! We'll see that the phenomenon we call superposition will allow models to store more features – potentially many more features – in linear representations.
人们可能会认为线性表示只能存储与其维度数量相等的特征,但事实并非如此!我们将看到,我们称之为叠加的现象将允许模型在线性表示中存储更多的特征——可能是更多的特征。

For discussion on how this view of features squares with a conception of features as being multidimensional manifolds, see the appendix “What about Multidimensional Features?”.
有关该特征观点与将特征视为多维流形的概念如何吻合的讨论,请参见附录“多维特征如何?”。

Privileged vs Non-privileged Bases
特权基址与非特权基址

Even if features are encoded as directions, a natural question to ask is which directions? In some cases, it seems useful to consider the basis directions, but in others it doesn't. Why is this?
即使特征被编码为方向,一个自然的问题是哪些方向?在某些情况下,考虑基方向似乎是有用的,但在其他情况下则不是。这是为什么呢?

When researchers study word embeddings, it doesn't make sense to analyze basis directions. There would be no reason to expect a basis dimension to be different from any other possible direction. One way to see this is to imagine applying some random linear transformation M to the word embedding, and apply M^{-1} to the following weights. This would produce an identical model where the basis dimensions are totally different. This is what we mean by a non-privileged basis. Of course, it's possible to study activations without a privileged basis, you just need to identify interesting directions to study somehow, such as creating a gender direction in a word embedding by taking the difference vector between "man" and "woman".
当研究人员研究词嵌入时,分析基础方向是没有意义的。没有理由期待某个基础维度与任何其他可能的方向有所不同。我们可以想象应用一些随机线性变换 M 到词嵌入,并对以下权重应用 M^{-1} 。这将产生一个完全不同的基础维度的相同模型。这就是我们所说的 非特权基底。当然,在没有特权基底的情况下研究激活是可能的,你只需以某种方式识别出有趣的方向,例如通过计算“男人”和“女人”之间的差向量来创建一个性别方向。

But many neural network layers are not like this. Often, something about the architecture makes the basis directions special, such as applying an activation function. This "breaks the symmetry", making those directions special, and potentially encouraging features to align with the basis dimensions. We call this a privileged basis, and call the basis directions "neurons." Often, these neurons correspond to interpretable features.
但是许多神经网络层并不是这样的。通常,架构的某些特性使基方向变得特殊,例如应用激活函数。这“打破了对称性”,使得这些方向变得特殊,并可能促使特征与基维度对齐。我们称之为特权基,并将基方向称为“神经元”。通常,这些神经元对应于可解释的特征。

From this perspective, it only makes sense to ask if a neuron is interpretable when it is in a privileged basis. In fact, we typically reserve the word "neuron" for basis directions which are in a privileged basis. (See longer discussion here.)
从这个角度来看,问一个神经元在特权基底中是否可解释是有意义的。实际上,我们通常将“神经元”一词保留用于位于特权基底中的基方向。(更长的讨论请见这里。)

Note that having a privileged basis doesn't guarantee that features will be basis-aligned – we'll see that they often aren't! But it's a minimal condition for the question to even make sense.
请注意,拥有一个特权基并不保证特征会与基对齐——我们将看到它们通常并不是!但这是这个问题有意义的最基本条件。

The Superposition Hypothesis
叠加假说

Even when there is a privileged basis, it's often the case that neurons are "polysemantic", responding to several unrelated features. One explanation for this is the superposition hypothesis. Roughly, the idea of superposition is that neural networks "want to represent more features than they have neurons", so they exploit a property of high-dimensional spaces to simulate a model with many more neurons.
即使在有特权基础的情况下,神经元通常是“多义的”,对几个不相关的特征作出反应。其中一个解释是叠加假设。大致来说,叠加的概念是神经网络“想要表示更多的特征,而它们的神经元却很少”,因此它们利用高维空间的特性来模拟一个具有更多神经元的模型。

Several results from mathematics suggest that something like this might be plausible:
数学中的几个结果表明,这种情况可能是合理的:

Concretely, in the superposition hypothesis, features are represented as almost-orthogonal directions in the vector space of neuron outputs. Since the features are only almost-orthogonal, one feature activating looks like other features slightly activating. Tolerating this "noise" or "interference" comes at a cost. But for neural networks with highly sparse features, this cost may be outweighed by the benefit of being able to represent more features! (Crucially, sparsity greatly reduces the costs since sparse features are rarely active to interfere with each other, and non-linear activation functions create opportunities to filter out small amounts of noise.)
具体而言,在叠加假设中,特征在神经元输出的向量空间中表现为几乎正交的方向。由于特征几乎是正交的,因此一个特征的激活看起来就像其他特征稍微激活。容忍这种“噪声”或“干扰”是有代价的。但对于具有高度稀疏特征的神经网络,这种代价可能会被能够表示更多特征的好处所抵消!(关键是,稀疏性大大降低了成本,因为稀疏特征很少相互干扰,而非线性激活函数则创造了过滤少量噪声的机会。)

One way to think of this is that a small neural network may be able to noisily "simulate" a sparse larger model:
一个思路是,一个小型神经网络能够嘈杂地“模拟”一个稀疏的大型模型:

Although we've described superposition with respect to neurons, it can also occur in representations with an unprivileged basis, such as a word embedding. Superposition simply means that there are more features than dimensions.
虽然我们已经将叠加描述为与神经元相关,但它也可以在没有特权基的表示中发生,例如词嵌入。叠加仅意味着特征的数量超过了维度的数量。

Summary: A Hierarchy of Feature Properties
摘要:特征属性的层次结构

The ideas in this section might be thought of in terms of four progressively more strict properties that neural network representations might have.
本节中的想法可以被视为神经网络表示可能具有的四个逐渐更严格的属性。

The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.
前两者(可分解性和线性)是我们假设广泛存在的特性,而后者(非叠加性和基对齐)是我们认为仅在某些情况下出现的特性。







Demonstrating Superposition
演示叠加原理

If one takes the superposition hypothesis seriously, a natural first question is whether neural networks can actually noisily represent more features than they have neurons. If they can't, the superposition hypothesis may be comfortably dismissed.
如果认真对待叠加假设,自然会产生第一个问题:神经网络是否可以嘈杂地表示比它们拥有的神经元更多的特征。如果不能,那叠加假设可能就可以被轻松驳斥。

The intuition from linear models would be that this isn't possible: the best a linear model can do is to store the principal components. But we'll see that adding just a slight nonlinearity can make models behave in a radically different way! This will be our first demonstration of superposition. (It will also be an object lesson in the complexity of even very simple neural networks.)
线性模型的直觉是认为这是不可能的:线性模型能做的最好的是存储主成分。但是我们会看到,加入一点非线性可以使模型以截然不同的方式表现!这将是我们对叠加的首次演示。(这也将是一个关于即使是非常简单的神经网络复杂性的实物课。)

Experiment Setup 实验设置

Our goal is to explore whether a neural network can project a high dimensional vector x \in R^n into a lower dimensional vector h\in R^m and then recover it.This experiment setup could also be viewed as an autoencoder reconstructing x.
我们的目标是探索神经网络是否可以将一个高维向量 x \in R^n 投影到一个低维向量 h\in R^m ,然后再恢复它。这个实验设置也可以被看作是一个自编码器重建 x。

The Feature Vector (x)
特征向量 ( x )

We begin by describing the high-dimensional vector x: the activations of our idealized, disentangled larger model. We call each element x_i a "feature" because we're imagining features to be perfectly aligned with neurons in the hypothetical larger model. In a vision model, this might be a Gabor filter, a curve detector, or a floppy ear detector. In a language model, it might correspond to a token referring to a specific famous person, or a clause being a particular kind of description.
我们首先描述高维向量 x :我们理想化的、解耦的更大模型的激活值。我们称每个元素 x_i 为“特征”,因为我们想象特征与假设的更大模型中的神经元完美对齐。在视觉模型中,这可能是一个 Gabor 滤波器、一个曲线检测器或一个松垂耳检测器。在语言模型中,这可能对应于一个指向某个特定名人的令牌,或一个特定类型描述的从句。

Since we don't have any ground truth for features, we need to create synthetic data for x which simulates any important properties we believe features have from the perspective of modeling them. We make three major assumptions:
由于我们没有特征的真实数据,我们需要为合成数据创建 x ,以模拟我们认为特征在建模时具有的任何重要属性。我们做出了三个主要假设:

Concretely, our synthetic data is defined as follows: The input vectors x are synthetic data intended to simulate the properties we believe the true underlying features of our task have. We consider each dimension x_i to be a "feature". Each one has an associated sparsity S_i and importance I_i. We let x_i=0 with probability S_i, but is otherwise uniformly distributed between [0,1].The choice to have features distributed uniformly is arbitrary. An exponential or power law distribution would also be very natural. In practice, we focus on the case where all features have the same sparsity, S_i = S.
具体来说,我们的合成数据定义如下:输入向量 x 是旨在模拟我们认为任务真实基础特征的属性的合成数据。我们将每个维度 x_i 视为一个“特征”。每个特征都有相应的稀疏性 S_i 和重要性 I_i 。我们让 x_i=0 以概率 S_i 发生,但在 [0,1] 之间均匀分布。选择特征均匀分布是任意的。指数分布或幂法则分布也是非常自然的。在实践中,我们专注于所有特征具有相同稀疏性 S_i = S 的情况。

The Model (x \to x') 模型 ( x \to x' )

We will actually consider two models, which we motivate below. The first "linear model" is a well understood baseline which does not exhibit superposition. The second "ReLU output model" is a very simple model which does exhibit superposition. The two models vary only in the final activation function.
我们将实际考虑两个模型,下面我们将对此进行说明。第一个“线性模型”是一个众所周知的基准,它不表现出叠加性。第二个“ReLU 输出模型”是一个非常简单的模型,它确实表现出叠加性。这两个模型仅在最后的激活函数上有所不同。

Linear Model 线性模型
h~=~Wx
x'~=~W^Th~+~b
x' ~=~W^TWx ~+~ b
ReLU Output Model ReLU 输出模型
h~=~Wx
x'~=~\text{ReLU}(W^Th+b)
x' ~=~\text{ReLU}(W^TWx + b)

Why these models? 为什么选择这些模型?

The superposition hypothesis suggests that each feature in the higher-dimensional model corresponds to a direction in the lower-dimensional space. This means we can represent the down projection as a linear map h=Wx. Note that each column W_i corresponds to the direction in the lower-dimensional space that represents a feature x_i.
叠加假设表明,高维模型中的每个特征对应低维空间中的一个方向。这意味着我们可以将下投影表示为一个线性映射 h=Wx 。请注意,每一列 W_i 对应于表示特征 x_i 的低维空间中的方向。

To recover the original vector, we'll use the transpose of the same matrix W^T. This has the advantage of avoiding any ambiguity regarding what direction in the lower-dimensional space really corresponds to a feature. It also seems relatively mathematically principledRecall that W^T = W^{-1} if W is orthonormal. Although W can't be literally orthonormal, our intuition from compressed sensing is that it will be "almost orthonormal" in the sense of Candes & Tao ., and empirically works.
为了恢复原始向量,我们将使用相同矩阵 W^T 的转置。这有助于避免在低维空间中哪个方向真正对应于特征的任何模糊性。它在数学上似乎也相对合理。请记住,如果 W 是正交归一的,则 W^T = W^{-1}。尽管 W 不能严格意义上是正交归一的,但从压缩感知的直觉来看,它在 Candes 和 Tao 所说的意义上将是“几乎正交归一的”,并且经验上有效。

We also add a bias. One motivation for this is that it allows the model to set features it doesn't represent to their expected value. But we'll see later that the ability to set a negative bias is important for superposition for a second set of reasons – roughly, it allows models to discard small amounts of noise.
我们还增加了一个偏置。这样做的一个动机是它允许模型将未表示的特征设定为其期望值。但我们稍后会看到,设置负偏置的能力对于叠加有第二个原因的重要性——大致而言,它使模型能够丢弃少量噪声。

The final step is whether to add an activation function. This turns out to be critical to whether superposition occurs. In a real neural network, when features are actually used by the model to do computation, there will be an activation function, so it seems principled to include one at the end.
最后一步是是否添加一个激活函数。事实证明,这对超位置是否发生至关重要。在真实的神经网络中,当模型实际使用特征进行计算时,会有一个激活函数,因此在最后加入一个激活函数似乎是合乎原则的。

The Loss 失去

Our loss is weighted mean squared error weighted by the feature importances, I_i, described above:
我们的损失是加权特征重要性的加权均方误差,如上所述:
L = \sum_x \sum_i I_i (x_i - x'_i)^2

Basic Results 基本结果

Our first experiment will simply be to train a few ReLU output models with different sparsity levels and visualize the results. (We'll also train a linear model – if optimized well enough, the linear model solution does not depend on sparsity level.)
我们的第一个实验将简单地训练几个具有不同稀疏水平的 ReLU 输出模型并可视化结果。(我们还将训练一个线性模型——如果优化得足够好,线性模型的解不依赖于稀疏水平。)

The main question is how to visualize the results. The simplest way is to visualize W^TW (a features by features matrix) and b (a feature length vector). Note that features are arranged from most important to least, so the results have a fairly nice structure. Here's an example of what this type of visualization might look like, for a small model model (n=20; ~m=5;) which behaves in the "expected linear model-like" way, only representing as many features as it has dimensions:
主要问题是如何可视化结果。最简单的方法是可视化 W^TW (按特征排列的矩阵)和 b (特征长度向量)。请注意,特征是按照重要性从高到低排列的,因此结果具有相当不错的结构。这是这种类型的可视化在一个小模型( n=20; ~m=5; )中可能的样子,它以“预期的线性模型类似”的方式运行,仅代表其维度数量的特征:

But the thing we really care about is this hypothesized phenomenon of superposition – does the model represent "extra features" by storing them non-orthogonally? Is there a way to get at it more explicitly? Well, one question is just how many features the model learns to represent. For any feature, whether or not it is represented is determined by ||W_i||, the norm of its embedding vector.
但我们真正关心的是这种假设的叠加现象——模型是否通过非正交的方式存储“额外特征”?有没有更明确的方法来获取它?一个问题就是模型学习表示多少特征。对于任何特征,它是否被表示取决于 ||W_i|| ,即其嵌入向量的范数。

We'd also like to understand whether a given feature shares its dimension with other features. For this, we calculate \sum_{j\neq i} (\hat{W_i}\cdot W_j)^2, projecting all other features onto the direction vector of W_i. It will be 0 if the feature is orthogonal to other features (dark blue below). On the other hand, values \geq 1 mean that there is some group of other features which can activate W_i as strongly as feature i itself!
我们还想了解一个给定特征是否与其他特征共享维度。为此,我们计算 \sum_{j\neq i} (\hat{W_i}\cdot W_j)^2 ,将所有其他特征投影到 W_i 的方向向量上。如果特征与其他特征正交,则将是 0 (如下图中深蓝色)。另一方面,值 \geq 1 意味着有一些其他特征可以像特征 i 本身一样强烈地激活 W_i

We can visualize the model we looked at previously this way:
我们可以这样可视化我们之前看的模型:

Now that we have a way to visualize models, we can start to actually do experiments.  We'll start by considering models with only a few features (n=20; ~m=5;~ I_i=0.7^i). This will make it easy to visually see what happens. We consider a linear model, and several ReLU-output models trained on data with different feature sparsity levels:
现在我们有了可视化模型的方法,我们可以开始实际进行实验。我们将首先考虑只有几个特征的模型( n=20; ~m=5;~ I_i=0.7^i )。这将使我们更容易直观地看到发生了什么。我们考虑一个线性模型,以及几个在具有不同特征稀疏性水平的数据上训练的 ReLU 输出模型:

As our standard intuitions would expect, the linear model always learns the top-m most important features, analogous to learning the top principal components. The ReLU output model behaves the same on dense features (1-S=1.0), but as sparsity increases, we see superposition emerge. The model represents more features by having them not be orthogonal to each other. It starts with less important features, and gradually affects the most important ones. Initially this involves arranging them in antipodal pairs, where one feature’s representation vector is exactly the negative of the other’s, but we observe it gradually transition to other geometric structures as it represents more features.  We'll discuss feature geometry further in the later section, The Geometry of Superposition.
正如我们的标准直觉所预期,线性模型总是学习最重要的前 m 个特征,类似于学习前几个主成分。ReLU 输出模型在稠密特征( 1-S=1.0 )上表现相同,但随着稀疏性增加,我们看到叠加现象的出现。模型通过使特征之间不是正交的来表示更多特征。 它从不太重要的特征开始,逐渐影响最重要的特征。最初这包括将它们排列成对立的对,其中一个特征的表示向量正好是另一个的负向量,但我们观察到它随着表示更多特征而逐渐过渡到其他几何结构。我们将在后面的部分 叠加的几何 中进一步讨论特征几何。

The results are qualitatively similar for models with more features and hidden dimensions. For example, if we consider a model with m=20 hidden dimensions and n=80 features (with importance increased to I_i=0.9^i to account for having more features), we observe essentially a rescaled version of the visualization above:
结果在具有更多特征和隐藏维度的模型中定性相似。例如,如果我们考虑一个具有 m=20 个隐藏维度和 n=80 个特征的模型(重要性增加到 I_i=0.9^i 以考虑更多特征的影响),我们基本上观察到上述可视化的一个重新缩放版本:

Mathematical Understanding
数学理解

In the previous section, we observed a surprising empirical result: adding a ReLU to the output of our model allowed a radically different solution – superposition – which doesn't occur in linear models.
在上一节中,我们观察到了一个令人惊讶的经验结果:在我们模型的输出中添加 ReLU 允许了一种截然不同的解决方案 – 叠加 – 这一结果在线性模型中并不存在。

The model where it occurs is still quite mathematically simple. Can we analytically understand why superposition is occurring? And for that matter, why does adding a single non-linearity make things so different from the linear model case? It turns out that we can get a fairly satisfying answer, revealing that our model is governed by balancing two competing forces – feature benefit and interference – which will be useful intuition going forwards. We'll also discover a connection to the famous Thomson Problem in chemistry.
发生这种情况的模型在数学上仍然相当简单。我们能否从分析的角度理解为什么会出现叠加现象?另一方面,为什么添加一个非线性因素会使得情况与线性模型截然不同?结果是我们可以得到一个相当令人满意的答案,揭示我们的模型由两种竞争力量平衡所主导——特征收益 和 干扰 ——这为我们今后的理解提供了有用的直觉。我们还将发现与化学中著名的汤姆森问题之间的联系。

Let's start with the linear case. This is well understood by prior work! If one wants to understand why linear models don't exhibit superposition, the easy answer is to observe that linear models essentially perform PCA. But this isn't fully satisfying: if we set aside all our knowledge and intuition about linear functions for a moment, why exactly is it that superposition can't occur?
让我们从线性情况开始。这一点在之前的研究中得到了很好的理解!如果有人想要理解为什么线性模型不表现出叠加,简单的答案是观察到线性模型本质上执行主成分分析(PCA)。但是这并不完全令人满意:如果我们暂时搁置所有关于线性函数的知识和直觉,究竟是什么原因导致叠加无法发生呢?

A deeper understanding can come from the results of Saxe et al. who study the learning dynamics of linear neural networks – that is, neural networks without activation functions. Such models are ultimately linear functions, but because they are the composition of multiple linear functions the dynamics are potentially quite complex. The punchline of their paper reveals that neural network weights can be thought of as optimizing a simple closed-form solution. We can tweak their problem to be a bit more similar to our linear case,We have the model be x' = W^TWx, but leave x Gaussianaly distributed as in Saxe. revealing the following equation:
更深入的理解可以来自 Saxe 等人的研究,他们研究了线性神经网络的学习动态——即没有激活函数的神经网络。这些模型最终是线性函数,但由于它们是多个线性函数的组合,动态可能相当复杂。他们论文的关键在于揭示了神经网络权重可以被视为优化一个简单的封闭形式解。我们可以稍微调整他们的问题,使其更类似于我们的线性案例,10 我们将模型设定为 x' = W^TWx,但将 x 保持为高斯分布,正如 Saxe 所示,揭示了以下方程:

The Saxe results reveal that there are fundamentally two competing forces which control learning dynamics in the considered model. Firstly, the model can attain a better loss by representing more features (we've labeled this "feature benefit"). But it also gets a worse loss if it represents more than it can fit orthogonally due to "interference" between features.As a brief aside, it's interesting to contrast the linear model interference, \sum_{i\neq j}|W_i \cdot W_J|^2, to the notion of coherence in compressed sensing, \max_{i\neq j}|W_i \cdot W_J|. We can see them as the L^2 and L^\infty norms of the same vector. In fact, this makes it never worthwhile for the linear model to represent more features than it has dimensions.
萨克斯的结果显示,在所考虑的模型中,学习动态受到两种根本上相互竞争的力量的控制。首先,通过表示更多特征,该模型可以获得更好的损失(我们称之为“特征收益”)。但是,如果模型表示的特征超过其可以正交拟合的范围,则会导致更差的损失,这种现象被称为“干扰”。作为一个简短的插论,与压缩感知中的相干性(\max_{i\neq j}|W_i \cdot W_J|)对比线性模型干扰(\sum_{i\neq j}|W_i \cdot W_J|^2)是很有趣的。我们可以将它们视为同一向量的 L^2 和 L^\infty 范数。事实上,这使得线性模型从未值得表示超过其维度的特征。要证明在线性模型中叠加从未是最优的,可以求解损失梯度为零或参考萨克斯等人的研究。
To prove that superposition is never optimal in a linear model, solve for the gradient of the loss being zero or consult Saxe et al.

Can we achieve a similar kind of understanding for the ReLU output model? Concretely, we'd like to understand L=\int_x ||I(x-\text{ReLU}(W^TWx+b))||^2 d\textbf{p}(x) where x is distributed such that x_i=0 with probability S.
我们能否对 ReLU 输出模型实现类似的理解?具体而言,我们希望理解 L=\int_x ||I(x-\text{ReLU}(W^TWx+b))||^2 d\textbf{p}(x) ,其中 x 的分布是这样的 x_i=0 ,概率为 S

The integral over x decomposes into a term for each sparsity pattern according to the binomial expansion of ((1\!-\!S)+S)^n. We can group terms of the sparsity together, rewriting the loss as L = (1\!-\!S)^n L_n +\ldots+ (1\!-\!S)S^{n-1} L_1+ S^n L_0, with each L_k corresponding to the loss when the input is a k-sparse vector. Note that as S\to 1, L_1 and L_0 dominate. The L_0 term, corresponding to the loss on a zero vector, is just a penalty on positive biases, \sum_i \text{ReLU}(b_i)^2. So the interesting term is L_1, the loss on 1-sparse vectors:
x 的积分根据 ((1\!-\!S)+S)^n 的二项展开分解为每个稀疏模式的一个项。我们可以将稀疏项组合在一起,将损失重写为 L = (1\!-\!S)^n L_n +\ldots+ (1\!-\!S)S^{n-1} L_1+ S^n L_0 ,其中每个 L_k 对应于输入为 k -稀疏向量时的损失。注意,当 S\to 1L_1L_0 占主导地位时。 L_0 项对应于零向量的损失,仅仅是对正偏差的惩罚, \sum_i \text{ReLU}(b_i)^2 。所以有趣的项是 L_1 ,即对 1 -稀疏向量的损失:

This new equation is vaguely similar to the famous Thomson problem in chemistry. In particular, if we assume uniform importance and that there are a fixed number of features with ||W_i|| = 1 and the rest have ||W_i|| = 0, and that b_i = 0, then the feature benefit term is constant and the interference term becomes a generalized Thomson problem – we're just packing points on the surface of the sphere with a slightly unusual energy function. (We'll see this can be a productive analogy when we resume our empirical investigation in the following sections!)
这个新方程与化学中的著名 汤姆森问题 隐约相似。特别是,如果我们假设重要性均匀,并且特征的固定数量为 ||W_i|| = 1 ,其余特征为 ||W_i|| = 0 ,并且 b_i = 0 ,那么特征收益项是常量,而干扰项变成了一个广义的汤姆森问题——我们只是以稍微不同的能量函数在球面上放置点。(当我们在接下来的章节中恢复经验调查时,我们会发现这可以成为一个富有成效的类比!)

Another interesting property is that ReLU makes negative interference free in the 1-sparse case. This explains why the solutions we've seen prefer to only have negative interference when possible. Further, using a negative bias can convert small positive interferences into essentially being negative interferences.
另一个有趣的属性是,在 1-稀疏情况下,ReLU 使负干扰无效。这解释了为什么我们看到的解决方案倾向于只在可能的情况下存在负干扰。此外,使用负偏差可以将小的正干扰转化为实际上是负干扰。

What about the terms corresponding to less sparse vectors? We leave explicitly writing these out to the reader, but the main idea is that there are multiple compounding interferences, and the "active features" can experience interference. In a later section, we'll see that features often organize themselves into sparse interference graphs such that only a small number of features interfere with another feature – it's interesting to note that this reduces the probability of compounding interference and makes the 1-sparse loss term more important relative to others.
关于对应于较少稀疏向量的术语呢?我们将明确写出的这部分留给读者,但主要思想是存在多个复合干扰,而“活跃特征”可以经历干扰。在后面的章节中,我们将看到特征通常会组织成稀疏干扰图,以至于只有少数特征会干扰另一个特征—有趣的是,这减少了复合干扰的概率,使得 1-稀疏损失项相对于其他项变得更加重要。







Superposition as a Phase Change
叠加作为相变

The results in the previous section seem to suggest that there are three outcomes for a feature when we train a model: (1) the feature may simply not be learned; (2) the feature may be learned, and represented in superposition; or (3) the model may represent a feature with a dedicated dimension. The transitions between these three outcomes seem sharp. Possibly, there's some kind of phase change. Here, we use “phase change” in the generalized sense of “discontinuous change”, rather than in the more technical sense of a discontinuity arising in the limit of infinite system size.
之前部分的结果似乎表明,当我们训练一个模型时,特征可能有三种结果:(1)特征可能根本没有被学习;(2)特征可能被学习,并以叠加的方式表示;或者(3)模型可能用专用维度来表示特征。这三种结果之间的过渡似乎非常明显。可能存在某种相变。在这里,我们将“相变”用于“一种不连续变化”的一般意义上,而不是在无限系统大小的极限下出现的不连续性的更技术性意义上。

One way to understand this better is to explore if there's something like a "phase diagram" from physics, which could help us understand when a feature is expected to be in one of these regimes.  Although we can see hints of this in our previous experiment, it's hard to really isolate what's going on because many features are changing at once and there may be interaction effects. As a result, we set up the following experiment to better isolate the effects.
理解这一点的一个方法是探索是否存在类似于物理中的“相图”的东西,这可以帮助我们理解何时某个特征预计会处于这些状态之一。尽管我们在之前的实验中看到了一些线索,但由于许多特征同时变化并可能存在交互效应,确实很难真正孤立出发生的情况。因此,我们设立了以下实验以更好地孤立这些效应。

As an initial experiment, we consider models with 2 features but only 1 hidden layer dimension. We still consider the ReLU output model, \text{ReLU}(W^T W x - b). The first feature has an importance of 1.0. On one axis, we vary the importance of the 2nd "extra" feature from 0.1 to 10. On the other axis, we vary the sparsity of all features from 1.0 to 0.01. We then plot whether the 2nd "extra" feature is not learned, learned in superposition, or learned and represented orthogonally. To reduce noise, we train ten models for each point and average over the results, discarding the model with the highest loss.
作为初步实验,我们考虑具有两个特征但只有一个隐藏层维度的模型。我们仍然考虑 ReLU 输出模型, \text{ReLU}(W^T W x - b) 。第一个特征的重要性为 1.0。在一个轴上,我们将第二个“额外”特征的重要性从 0.1 变化到 10。在另一个轴上,我们将所有特征的稀疏性从 1.0 变化到 0.01。然后我们绘制第二个“额外”特征是否没有被学习,是否在叠加中学习,或者是否被学习并正交表示。为了减少噪声,我们为每个点训练十个模型,并对结果取平均,舍弃损失最高的模型。

We can compare this to a theoretical "toy model of the toy model" where we can get closed form solutions for the loss of different weight configurations as a function of importance and sparsity. There are three natural ways to store 2 features in 1 dimension: W=[1,0] (ignore [0,1], throwing away the extra feature), W=[0,1] (ignore [1,0], throwing away the first feature to give the extra feature a dedicated dimension), and W=[1,-1] (store the features in superposition, losing the ability to represent [1,1], the combination of both features at the same time). We call this last solution “antipodal” because the two basis vectors [1, 0] and [0, 1] are mapped in opposite directions. It turns out we can analytically determine the loss for these solutions (details can be found in this notebook).
我们可以将其与一个理论上的“玩具模型的玩具模型”进行比较,在这个模型中,我们可以获得不同权重配置的损失的闭式解,作为重要性和稀疏性的函数。在一维中存储两个特征有三种自然方式: W=[1,0] (忽略 [0,1] ,丢弃额外的特征)、 W=[0,1] (忽略 [1,0] ,丢弃第一个特征以赋予额外特征一个专用维度),以及 W=[1,-1] (以叠加的方式存储特征,失去同时表示 [1,1] 的能力,即同时组合两个特征)。我们称这个最后的解决方案为“相对极性”,因为两个基向量 [1, 0][0, 1] 朝相反方向映射。事实证明,我们可以解析地确定这些解决方案的损失(详细信息可以在这个笔记本中找到)。

As expected, sparsity is necessary for superposition to occur, but we can see that it interacts in an interesting way with relative feature importance. But most interestingly, there appears to be a real phase change, observed in both the empirical and theoretical diagrams! The optimal weight configuration discontinuously changes in magnitude and superposition. (In the theoretical model, we can analytically confirm that there's a first-order phase change: there's crossover between the functions, causing a discontinuity in the derivative of the optimal loss.)
正如预期的那样,稀疏性对于叠加的发生是必要的,但我们可以看到它与相对特征重要性以一种有趣的方式相互作用。但最有趣的是,在经验和理论图中都观察到了真正的相变!最优权重配置在大小和叠加上不连续地变化。(在理论模型中,我们可以分析性地确认存在一阶相变:函数之间存在交叉,导致最优损失的导数出现不连续性。)

We can ask this same question of embedding three features in two dimensions. This problem still has a single "extra feature" (now the third one) we can study, asking what happens as we vary its importance relative to the other two and change sparsity.
我们可以在二维中问同样的问题,即嵌入三个特征。这个问题仍然有一个“额外特征”(现在是第三个)可以研究,我们可以询问当我们改变它相对于另外两个的重要性和稀疏性时会发生什么。

For the theoretical model, we now consider four natural solutions. We can describe solutions by asking "what feature direction did W ignore?" For example, W might just not represent the extra feature – we'll write this W \perp [0, 0, 1]. Or W might ignore one of the other features, W \perp [1, 0, 0]. But the interesting thing is that there are two ways to use superposition to make antipodal pairs. We can put the "extra feature" in an antipodal pair with one of the others (W \perp [0, 1, 1]) or put the other two features in superposition and give the extra feature a dedicated dimension (W \perp [1, 1, 0]). Details on the closed form losses for these solutions can be found in this notebook. We do not consider a last solution of putting all the features in joint superposition, W \perp [1, 1, 1].
对于理论模型,我们现在考虑四个自然解决方案。我们可以通过问“ W 忽略了哪个特征方向?”来描述这些解决方案。例如, W 可能只是不代表额外特征——我们将其写作 W \perp [0, 0, 1] 。或者 W 可能忽略了其他特征之一, W \perp [1, 0, 0] 。但有趣的是,有两种方法可以使用叠加来形成对立对。我们可以将“额外特征”与其他特征之一( W \perp [0, 1, 1] )放在对立对中,或者将另外两个特征叠加,并为额外特征提供一个专用维度( W \perp [1, 1, 0] )。关于这些解决方案的闭式损失的详细信息可以在 这个笔记本 中找到。我们不考虑将所有特征放入联合叠加的最后一种解决方案, W \perp [1, 1, 1]

These diagrams suggest that there really is a phase change between different strategies for encoding features. However, we'll see in the next section that there's much more complex structure this preliminary view doesn't capture.
这些图示表明,不同策略编码特征之间确实存在相位变化。然而,我们将在下一节中看到,这种初步视图无法捕捉到更复杂的结构。







The Geometry of Superposition
叠加的几何形状

We've seen that superposition can allow a model to represent extra features, and that the number of extra features increases as we increase sparsity. In this section, we'll investigate this relationship in more detail, discovering an unexpected geometric story: features seem to organize themselves into geometric structures such as pentagons and tetrahedrons! In some ways, the structure described in this section seems "too elegant to be true" and we think there's a good chance it's at least partly idiosyncratic to the toy model we're investigating. But it seems worth investigating because if anything about this generalizes to real models, it may give us a lot of leverage in understanding their representations.
我们已经看到,叠加可以让模型表示额外的特征,并且随着稀疏性的增加,额外特征的数量也会增加。在本节中,我们将更详细地研究这种关系,发现一个意想不到的几何故事:特征似乎组织成五边形和四面体等几何结构!在某些方面,本节描述的结构似乎“太优雅而不真实”,我们认为它至少在某种程度上是特定于我们正在研究的玩具模型的。但这值得研究,因为如果这方面的内容可以推广到真实模型,它可能会在理解其表示方面给我们带来很大的帮助。

We'll start by investigating uniform superposition, where all features are identical: independent, equally important and equally sparse. It turns out that uniform superposition has a surprising connection to the geometry of uniform polytopes! Later, we'll move on to investigate non-uniform superposition, where features are not identical. It turns out that this can be understood, at least to some extent, as a deformation of uniform superposition.
我们将首先研究均匀叠加,在这里所有特征都是相同的:独立、同等重要且同样稀疏。事实证明,均匀叠加与均匀多面体的几何形状之间存在令人惊讶的联系!稍后,我们将继续研究非均匀叠加,在这里特征并不相同。实际上,这可以在某种程度上理解为均匀叠加的一种变形。

Uniform Superposition 均匀叠加

As mentioned above, we begin our investigation with uniform superposition, where all features have the same importance and sparsity. We'll see later that this case has some unexpected structure, but there's also a much more basic reason to study it: it's much easier to reason about than the non-uniform case, and has fewer variables we need to worry about in our experiments.
如上所述,我们的研究从均匀叠加开始,在这种情况下,所有特征具有相同的重要性和稀疏性。我们稍后会看到,这种情况有一些意想不到的结构,但研究它还有一个更基本的原因:它比非均匀情况更容易推理,并且在我们的实验中需要担心的变量更少。

We'd like to understand what happens as we change feature sparsity, S. Since all features are equally important, we will assume without loss of generalityScaling the importance of all features by the same amount simply scales the loss, and does not change the optimal solutions. that each feature has importance I_i = 1 . We'll study a model with n=400 features and m=30 hidden dimensions, but it turns out the number of features and hidden dimensions doesn't matter very much. In particular, it turns out that the number of input features n doesn't matter as long as it's much larger than the number of hidden dimensions, n \gg m. And it also turns out that the number of hidden dimensions doesn't really matter as long as we're interested in the ratio of features learned to hidden features. Doubling the number of hidden dimensions just doubles the number of features the model learns.
我们想了解当我们改变特征稀疏性时会发生什么。由于所有特征的重要性相同,我们将做一个无损的一般假设:将所有特征的重要性按相同数量缩放只会缩放损失,并不会改变最优解。我们将研究一个具有特征 n=400 和隐藏维度 m=30 的模型,但事实证明特征数量和隐藏维度并没有太大关系。特别地,事实证明输入特征的数量 n 并不重要,只要它远大于隐藏维度的数量 n \gg m 。同时,隐藏维度的数量实际上也并不重要,只要我们对学习到的特征与隐藏特征的比例感兴趣。隐藏维度数量的翻倍仅仅会使模型学习到的特征数量翻倍。

A convenient way to measure the number of features the model has learned is to look at the Frobenius norm, ||W||_F^2. Since ||W_i||^2\simeq 1 if a feature is represented and ||W_i||^2\simeq 0 if it is not, this is roughly the number of features the model has learned to represent. Conveniently, this norm is basis-independent, so it still behaves nicely in the dense regime S=0 where the feature basis isn't privileged by anything and the model represents features with arbitrary directions instead.
衡量模型学习的特征数量的一个方便方法是查看弗罗贝纽斯范数, ||W||_F^2 。因为 ||W_i||^2\simeq 1 如果一个特征被表示,而 ||W_i||^2\simeq 0 如果没有被表示,因此这大致上是模型学会表示的特征数量。方便的是,这个范数是与基无关的,因此在特征基因不受任何特权的稠密范围 S=0 中,它仍然表现得很好,模型以任意方向表示特征。

We'll plot D^* = m / ||W||_F^2, which we can think of as the "dimensions per feature":
我们将绘制 D^* = m / ||W||_F^2 ,我们可以将其视为“每个特征的维度”:

Surprisingly, we find that this graph is "sticky" at 1 and 1/2. (This very vaguely resembles the fractional quantum Hall effect – see e.g. this diagram.) Why is this? On inspection, the 1/2 "sticky point" seems to correspond to a precise geometric arrangement where features come in "antipodal pairs", each being exactly the negative of the other, allowing two features to be packed into each hidden dimension. It appears that antipodal pairs are so effective that the model preferentially uses them over a wide range of the sparsity regime.
令人惊讶的是,我们发现这个图在 11/2 处是“粘滞”的。(这与分数量子霍尔效应有些相似 – 见例如 这个图示。)这是为什么呢?仔细观察, 1/2 “粘滞点”似乎对应于一种精确的几何排列,其中特征成“对立对”出现,每个特征恰好是另一个的负值,从而允许每个隐含维度中装入两个特征。看起来对立对是如此有效,以至于该模型在很宽的稀疏区域内优先使用它们。

It turns out that antipodal pairs are just the tip of the iceberg. Hiding underneath this curve are a number of extremely specific geometric configurations of features.
事实证明,反极对只是冰山一角。在这条曲线背后隐藏着许多极为具体的几何特征配置。

Feature Dimensionality 特征维度

In the previous section, we saw that there's a sticky regime where the model has "half a dimension per feature" in some sense. This is an average statistical property of the features the model represents, but it seems to hint at something interesting. Is there a way we could understand what "fraction of a dimension" a specific feature gets?
在上一节中,我们看到有一种粘性状态,在某种意义上模型对每个特征都有“半个维度”。这是模型所表示特征的一个平均统计属性,但它似乎暗示了某些有趣的事情。我们是否有办法理解特定特征获得了“维度的某个部分”?

We'll define the dimensionality of the ith feature, D_i, as:
我们将定义第 维度 的第 i 个特征 D_i 为:

D_i ~=~ \frac{||W_i||^2}{\sum_j (\hat{W_i} \cdot W_j)^2}

where W_i is the weight vector column associated with the ith feature, and \hat{W_i} is the unit version of that vector.
其中 W_i 是与第 i 个特征相关的权重向量列,而 \hat{W_i} 是该向量的单位版本。

Intuitively, the numerator represents the extent to which a given feature is represented, while the denominator is "how many features share the dimension it is embedded in" by projecting each feature onto its dimension. In the antipodal case, each feature participating in an antipodal pair will have a dimensionality of D = 1 / (1+1) = 1/2 while features which are not learned will have a dimensionality of 0. Empirically, it seems that the dimensionality of all features add up to the number of embedding dimensions when the features are "packed efficiently" in some sense.
直观上,分子表示给定特征的表示程度,而分母则是“有多少特征共享嵌入的维度”,通过将每个特征投影到其维度上。在对偶情况下,每个参与对偶对的特征将具有维度 D = 1 / (1+1) = 1/2 ,而未学习的特征将具有维度 0 。经验上来看,当特征在某种意义上“高效打包”时,所有特征的维度总和似乎等于嵌入维度的数量。

We can now break the above plot down on a per-feature basis. This reveals many more of these "sticky points"! To help us understand this better, we're going to create a scatter plot annotated with some additional information:
我们现在可以按特征逐个分解上述图表。这揭示了更多这些“粘性点”!为了帮助我们更好地理解这一点,我们将创建一个带有一些附加信息的散点图:

Let's look at the resulting plot, and then we'll try to figure out what it's showing us:
让我们看看结果图,然后我们试着弄清楚它在告诉我们什么:

What is going on with the points clustering at specific fractions?? We'll see shortly that the model likes to create specific weight geometries and kind of jumps between the different configurations.
特定分数的点聚集发生了什么??我们将很快看到,该模型喜欢创建特定的权重几何,并在不同的配置之间跳跃。

In the previous section, we developed a theory of superposition as a phase change. But everything on this plot between 0 (not learning a feature) and 1 (dedicating a dimension to a feature) is superposition. Superposition is what happens when features have fractional dimensionality. That is to say – superposition isn't just one thing!
在前一节中,我们将叠加理论发展为相变。但在这个图中,从 0(不学习特征)到 1(将一个维度专用于特征)之间的所有内容都是叠加。叠加是当特征具有分数维数时发生的情况。也就是说——叠加不仅仅是一个事物!

How can we relate this to our original understanding of the phase change? We often think of water as only having three phases: ice, water and steam. But this is a simplification: there are actually many phases of ice, often corresponding to different crystal structures (eg. hexagonal vs cubic ice). In a vaguely similar way, neural network features seem to also have many other phases within the general category of "superposition."
我们如何将此与我们对相变的原始理解联系起来?我们通常认为水只有三种相态:冰、水和蒸汽。但这是一种简化:实际上冰有许多相态,通常对应于不同的晶体结构(例如,六角形冰与立方体冰)。以一种模糊相似的方式,神经网络的特征似乎也在“叠加”这一大类中具有许多其他相态。

Why these geometric structures?
为什么选择这些几何结构?

In the previous diagram, we found that there are distinct lines corresponding to dimensionality of: ¾ (tetrahedron), ⅔ (triangle), ½ (antipodal pair), ⅖ (pentagon), ⅜ (square antiprism), and 0 (feature not learned). We believe there would also be a 1 (dedicated dimension for a feature) line if not for the fact that basis features are indistinguishable from other directions in the dense regime.
在之前的图中,我们发现有不同的线条对应的维度分别是:¾(四面体)、⅔(三角形)、½(对偶点)、⅖(五边形)、⅜(正方体反棱柱),以及 0(未学习的特征)。我们相信如果不是因为基础特征在密集状态下与其他方向不可区分的话,还会有一条 1(特征的专用维度)线。

Several of these configurations may jump out as solutions to the famous Thomson problem. (In particular, square antiprisms are much less famous than cubes and are primarily of note for their role in molecular geometry due to being a Thomson problem solution.) As we saw earlier, there is a very real sense in which our model can be understood as solving a generalized version of the Thomson problem. When our model chooses to represent a feature, the feature is embedded as a point on an m-dimensional sphere.
这些配置中的几个可能会被认为是著名的汤姆森问题的解决方案。(特别是,平方反棱柱远不如立方体著名,主要因其在分子几何中的作用而受到关注,因为它是汤姆森问题的解决方案。)正如我们之前看到的那样,我们的模型可以在某种非常真实的意义上被理解为解决了汤姆森问题的广义版本。当我们的模型选择表示某个特征时,该特征作为点嵌入在 m 维球体上。

A second clue as to what's going on is that there are lines for the Thomson solutions which are uniform polyhedra (e.g. tetrahedron), but there seem to be split lines where we'd expect to see non-uniform solutions (e.g. instead of a ⅗ line for triangular bipyramids we see a co-occurence of points at ⅔ for triangles and points at ½ for a antipodes). In a uniform polyhedron, all vertices have the same geometry, and so if we embed features as them each feature has the same dimensionality. But if we embed features as a non-uniform polyhedron, different features will have more or less interference with others.
第二个线索是,汤姆森解决方案的线条是均匀多面体(例如四面体),但在我们预期看到非均匀解决方案的地方似乎存在分裂线(例如,三角双锥的⅗线被我们看到的三角形上⅔点和对点上½点的共现所取代)。在均匀多面体中,所有顶点具有相同的几何形状,因此如果我们将特征嵌入作为它们,每个特征具有相同的维度。但如果我们将特征嵌入为非均匀多面体,不同的特征将会对其他特征产生或多或少的干扰。

In particular, many of the Thomson solutions can be understood as tegum products (an operation which constructs polytopes  by embedding two polytopes in orthogonal subspaces) of smaller uniform polytopes. (In the earlier graph visualizations of feature geometry, two subgraphs are disconnected if and only if they are in different tegum factors.) As a result, we should expect their dimensionality to actually correspond to the underlying factor uniform polytopes.
特别是,许多汤姆森解可以理解为tegum products(一种通过将两个多面体嵌入正交子空间中构建多面体的操作)的小型均匀多面体的组合。(在特征几何的早期图形可视化中,两个子图仅当它们在不同的 tegum 因子中时才是不连接的。)因此,我们应该期待它们的维度实际上对应于基础因子的均匀多面体。

This also suggests a possible reason why we observe 3D Thomson problem solutions, despite the fact that we're actually studying a higher dimensional version of the problem. Just as many 3D Thomson solutions are tegum products of 2D and 1D solutions, perhaps higher dimensional solutions are often tegum products of 1D, 2D, and 3D solutions.
这也提出了一个可能的原因,为什么我们观察到三维汤姆森问题的解,尽管我们实际上正在研究这个问题的更高维版本。正如许多三维汤姆森解是二维和一维解的集合产物一样,也许更高维的解往往是 1D、2D 和 3D 解的集合产物。

The orthogonality of factors in tegum products has interesting implications. For the purposes of superposition, it means that there can't be any "interference" across tegum-factors. This may be preferred by the toy model: having many features interfere simultaneously could be really bad for it. (See related discussion in our earlier mathematical analysis.)
tegum 产品中因素的正交性具有有趣的意义。就叠加而言,这意味着不同的 tegum 因素之间不能存在任何“干涉”。玩具模型可能更倾向于这种情况:多个特征同时干涉可能对其非常不利。(有关此讨论,请参见我们早期的数学分析。)

Aside: Polytopes and Low-Rank Matrices
旁注:多面体和低秩矩阵

At this point, it's worth making explicit that there's a correspondence between polytopes and symmetric, positive-definite, low-rank matrices (i.e. matrices of the form W^TW). This correspondence underlies the results we saw in the previous section, and is generally useful for thinking about superposition.
在这一点上,值得明确的是,多面体对称的正定低秩矩阵(即形式为 W^TW 的矩阵)之间存在对应关系。这个对应关系是我们在上一节中看到的结果的基础,并且对于思考叠加现象通常是有用的。

In some ways, the correspondence is trivial. If one has a rank-m n\!\times\!n-matrix of the form W^TW, then W is a n\!\times\!m-matrix. We can interpret the columns of W as n points in a m-dimensional space. The place where this starts to become interesting is that it makes it clear that W^TW is driven by the geometry. In particular, we can see how the off-diagonal terms are driven by the geometry of the points.
在某些方面,这种对应是微不足道的。如果一个矩阵具有秩- m n\!\times\!n -的形式 W^TW ,那么 W 是一个 n\!\times\!m -矩阵。我们可以将 W 的列解释为 n 个点在一个 m -维空间中。这开始变得有趣的地方在于它清楚地表明 W^TW 是由几何驱动的。特别是,我们可以看到非对角项是如何受点的几何形状驱动的。

Put another way, there's an exact correspondence between polytopes and strategies for superposition. For example, every strategy for putting three features in superposition in a 2-dimensional space corresponds to a triangle, and every triangle corresponds to such a strategy. From this perspective, it doesn't seem surprising that if we have three equally important and equally sparse features, the optimal strategy is an equilateral triangle.
换句话说,多面体与叠加策略之间有着精确的对应关系。例如,在二维空间中,将三个特征叠加的每种策略对应于一个三角形,而每个三角形也对应于这样的策略。从这个角度来看,如果我们有三个同等重要且同样稀疏的特征,最佳策略是等边三角形,这似乎并不令人惊讶。

This correspondence also goes the other direction. Suppose we have a rank (n\!-\!i)-matrix of the form W^TW. We can characterize it by the dimensions W did not represent – that is, which directions are orthogonal to W? For example, if we have a (n\!-\!1)-matrix, we might ask what single direction did W not represent? This is especially informative if we assume that W^TW will be as "identity-like" as possible, given the constraint of not representing certain vectors.
这种对应关系也可以反向进行。假设我们有一个秩为 (n\!-\!i) 的矩阵,其形式为 W^TW 。我们可以通过维度 W 来表征它  表示 – 也就是说,哪些方向与 W 正交?例如,如果我们有一个 (n\!-\!1) 矩阵,我们可能会问 W 未表示哪个单一方向?如果我们假设 W^TW 在不表示某些向量的限制下将尽可能“类同于恒等”,这特别具有信息性。

In fact, given such a set of orthogonal vectors, we can construct a polytope by starting with n basis vectors and projecting them to a space orthogonal to the given vectors. For example, if we start in three dimensions and then project such that W \perp (1,1,1), we get a triangle. More generally, setting W \perp (1,1,1,...) gives us a regular n-simplex. This is interesting because it's in some sense the "minimal possible superposition." Assuming that features are equally important and sparse, the best possible direction to not represent is the fully dense vector (1,1,1,...)!
实际上,给定这样一组正交向量,我们可以通过从 n 基础向量开始并将其投影到与给定向量正交的空间中来构造一个多面体。例如,如果我们从三维空间开始并进行投影,使 W \perp (1,1,1) ,我们得到一个三角形。更一般地,设置 W \perp (1,1,1,...) 给我们一个 规则的 n -单纯形。这很有趣,因为从某种意义上说它是“可能的最小叠加”。假设特征同样重要且稀疏,最佳的非表示方向是完全稠密向量 (1,1,1,...)

Non-Uniform Superposition
非均匀叠加

So far, this section has focused on the geometry of uniform superposition, where all features are of equal importance, equal sparsity, and independent. The model is essentially solving a variant of the Thomson problem. Because all features are the same, solutions corresponding to uniform polyhedra get especially low loss. In this subsection, we'll study non-uniform superposition, where features are somehow not uniform. They may vary in importance and sparsity, or have a correlational structure that makes them not independent. This distorts the uniform geometry we saw earlier.
迄今为止,本节集中于均匀叠加的几何特性,其中所有特征具有相同的重要性、相同的稀疏性,并且是独立的。该模型本质上是在解决一个托马斯问题的变体。由于所有特征相同,与均匀多面体对应的解具有特别低的损失。在本小节中,我们将研究非均匀叠加,其中特征在某种程度上并不均匀。它们可能在重要性和稀疏性上有所不同,或者具有使其不独立的相关结构。这扭曲了我们之前看到的均匀几何。

In practice, it seems like superposition in real neural networks will be non-uniform, so developing an understanding of it seems important. Unfortunately, we're far from a comprehensive theory of the geometry of non-uniform superposition at this point. As a result, the goal of this section will merely be to highlight some of the more striking phenomena we observe:
在实践中,似乎真实神经网络中的叠加是非均匀的,因此理解它似乎很重要。不幸的是,至今我们距离全面理解非均匀叠加的几何理论还有很长的路要走。因此,本节的目标只是突出我们观察到的一些更引人注目的现象:

We attempt to illustrate these phenomena with some representative experiments below.
我们尝试通过以下一些典型实验来说明这些现象。

Perturbing a Single Feature
扰动单一特征

The simplest kind of non-uniform superposition is to vary one feature and leave the others uniform. As an experiment, let's consider an experiment where we represent n=5 features in m=2 dimensions. In the uniform case, with importance I=1 and activation density 1-S=0.05, we get a regular pentagon. But if we vary one point – in this case we'll make it more or less sparse – we see the pentagram stretch to account for the new value. If we make it denser, activating more frequently (yellow) the other features repel from it, giving it more space. On the other hand, if we make it sparser, activating less frequently (blue) it takes less space and other points push towards it.
最简单的非均匀叠加类型是改变一个特征,而保持其他特征均匀。作为实验,我们考虑一个实验,其中我们在 m=2 维度中表示 n=5 特征。在均匀情况下,重要性为 I=1 ,激活密度为 1-S=0.05 时,我们得到一个规则的五边形。但如果我们改变一个点——在这种情况下,我们将其变得更加稀疏——我们看到五角星拉伸以适应新的值。如果我们使其更加密集,更频繁地激活(黄色),其他特征就会排斥它,给它更多空间。另一方面,如果我们令其更加稀疏,激活频率较低(蓝色),它占用更少空间,其他点则向它靠拢。

If we make it sufficiently sparse, there's a phase change, and it collapses from a pentagon to a pair of digons with the sparser point at zero. The phase change corresponds to loss curves corresponding to the two different geometries crossing over. (This observation allows us to directly confirm that it is genuinely a first order phase change.)
如果我们使其足够稀疏,就会发生相变,它会从五边形坍缩成一对双边形,稀疏的点位于零处。相变对应于两种不同几何形状交叉的损失曲线。(这一观察使我们能够直接确认这确实是一个一阶相变。)

To visualize the solutions, we canonicalize them, rotating them to align with each other in a consistent manner.
为了可视化解决方案,我们将其规范化,旋转它们以以一致的方式对齐。

These results seem to suggest that, at least in some cases, non-uniform superposition can be understood as a deformation of uniform superposition and jumping between uniform superposition configurations rather than a totally different regime. Since uniform superposition has a lot of understandable structure, but real world superposition is almost certainly non-uniform, this seems very promising!
这些结果似乎表明,至少在某些情况下,非均匀叠加可以理解为对均匀叠加的 变形在均匀叠加配置之间跳跃,而不是完全不同的状态。由于均匀叠加有很多可以理解的结构,但现实世界的叠加几乎肯定是非均匀的,这似乎非常有希望!

The reason pentagonal solutions are not on the unit circle is because models reduce the effect of positive interference, setting a slight negative bias to cut off noise and setting their weights to ||W_i|| = 1 / (1-b_i) to compensate. Distance from the unit circle can be interpreted as primarily driven by the amount of positive interference.
五边形解之所以不在单位圆上,是因为模型减少了正干扰的影响,设定了轻微的负偏差以消除噪声,并将它们的权重设置为 ||W_i|| = 1 / (1-b_i) 以进行补偿。与单位圆的距离可以被解释为主要由正干扰的数量驱动。

A note for reimplementations: optimizing with a two-dimensional hidden space makes this easier to study, but the actual optimization process to be really challenging from gradient descent – a lot harder than even just having three dimensions. Getting clean results required fitting each model multiple times and taking the solution with the lowest loss. However, there's a silver lining to this: visualizing the sub-optimal solutions on a scatter plot as above allows us to see the loss curves for different geometries and gain greater insight into the phase change.
对于重新实现的说明:使用二维隐藏空间进行优化使研究变得更加容易,但实际的优化过程从梯度下降来看确实具有挑战性——甚至比仅仅拥有三维还要困难。获得干净的结果需要多次拟合每个模型,并选择损失最低的解决方案。然而,这有一个好处:如上所示,在散点图上可视化次优解使我们能够看到不同几何形状的损失曲线,并对相变获得更深入的了解。

Correlated and Anticorrelated Features
相关和反相关特征

A more complicated form of non-uniform superposition occurs when there are correlations between features. This seems essential for understanding superposition in the real world, where many features are correlated or anti-correlated.
当特征之间存在相关性时,会出现一种更复杂形式的非均匀叠加。这似乎对于理解现实世界中的叠加至关重要,因为许多特征是相关或反相关的。

For example, one very pragmatic question to ask is whether we should expect polysemantic neurons to group the same features together across models. If the groupings were random, you could use this to detect polysemantic neurons, by comparing across models! However, we'll see that correlational structure strongly influences which features are grouped together in superposition.
例如,一个非常务实的问题是,我们是否应该期待多义神经元在不同模型中将相同的特征分组在一起。如果这些分组是随机的,你可以通过比较不同模型来检测多义神经元!然而,我们将看到,相关结构强烈影响特征在叠加中如何分组在一起。

The behavior seems to be quite nuanced, with a kind of "order of preferences" for how correlated features behave in superposition. The model ideally represents correlated features orthogonally, in separate tegum factors with no interactions between them. When that fails, it prefers to arrange them so that they're as close together as possible – it prefers positive interference between correlated features over negative interference. Finally, when there isn't enough space to represent all the correlated features, it will collapse them and represent their principal component instead! Conversely, when features are anti-correlated, models prefer to have them interfere, especially with negative interference. We'll demonstrate this with a few experiments below.
这种行为似乎相当微妙,对于如何在叠加中表现相关特征,有一种“偏好顺序”。模型理想情况下将相关特征正交表示,以独立的因子表示,并且二者之间没有相互作用。当这无法实现时,它更倾向于将它们尽可能靠近安排——它更喜欢相关特征之间的正干扰,而不是负干扰。最后,当没有足够的空间来表示所有相关特征时,它将将它们合并,并代表它们的主要成分!相反,当特征是反相关的时,模型更倾向于让它们相互干扰,特别是负干扰。我们将在下面通过一些实验来演示这一点。

Setup for Exploring Correlated and Anticorrelated Features
探索相关与反相关特征的设置

Throughout this section we'll refer to "correlated feature sets" and "anticorrelated feature sets".
在本节中,我们将提到“相关特征集”和“反相关特征集”。

Correlated Feature Sets. Our correlated feature sets can be thought of as "bundles" of co-occurring features. One can imagine a highly idealized version of what might happen in an image classifier: there could be a bundle of features used to identify animals (fur, ears, eyes) and another bundle used to identify buildings (corners, windows, doors). Features from one of these bundles are likely to appear together. Mathematically, we represent this by linking the choice of whether all the features in a correlated feature set are zero or not together. Recall that we originally defined our synthetic distribution to have features be zero with probability S and otherwise uniformly distributed between [0,1]. We simply have the same sample determine whether they're zero.
相关特征集。 我们的相关特征集可以被视为共现特征的“束”。可以想象在图像分类器中可能发生的高度理想化的版本:可能有一束特征用于识别动物(毛发、耳朵、眼睛),还有另一束用于识别建筑物(角、窗户、门)。这些束中的特征可能会一起出现。在数学上,我们通过将相关特征集中的所有特征是否都为零的选择联系在一起来表示这一点。回想一下,我们最初定义的合成分布使特征以概率 S 为零,否则在[0,1]之间均匀分布。我们只是让同样的样本来决定它们是否为零。

Anticorrelated Feature Sets. One could also imagine anticorrelated features which are extremely unlikely to occur together. To simulate these, we'll have anticorrelated feature sets where only one feature in the set can be active at a time. To simulate this, we'll have the feature set be entirely zero with probability S, but then only have one randomly selected feature in the set be uniformly sampled from [0,1] if it's active, with the others being zero.
反相关特征集。人们也可以想象一些反相关特征,这些特征极不可能同时出现。为了模拟这些,我们将拥有反相关特征集,其中该集合中的每次只有一个特征可以同时处于活动状态。为了模拟这一点,我们将特征集的值在概率 S 下完全为零,但如果某个特征处于活动状态,则只从[0,1]中随机选择一个特征,并使其均匀采样,而其他特征的值为零。

Organization of Correlated and Anticorrelated Features
相关和反相关特征的组织

For our initial investigation, we simply train a number of small toy models with correlated and anti-correlated features and observe what happens. To make this easy to study, we limit ourselves to the m=2 case where we can explicitly visualize the weights as points in 2D space. In general, such solutions can be understood as a collection of points on a unit circle. To make solutions easy to compare, we rotate and flip solutions to align with each other.
为了我们的初步调查,我们简单地训练一些具有相关和反相关特征的小玩具模型,并观察发生了什么。为了便于研究,我们将自己限制在 m=2 的情况下,以便明确地将权重可视化为二维空间中的点。一般而言,这种解决方案可以理解为单位圆上的一组点。为了便于比较解决方案,我们旋转并翻转解决方案以使其对齐。

Local Almost-Orthogonal Bases
局部近正交基

It turns out that the tendency of models to arrange correlated features to be orthogonal is actually quite a strong phenomenon. In particular, for larger models, it seems to generate a kind of "local almost-orthogonal basis" where, even though the model as a whole is in superposition, the correlated feature sets considered in isolation are (nearly) orthogonal and can be understood as having very little superposition.
事实证明,模型将相关特征安排为正交的倾向实际上是一种相当强烈的现象。特别是,对于更大的模型,它似乎生成了一种“局部近乎正交基”,在这种情况下,尽管模型整体处于叠加状态,但单独考虑的相关特征集是(几乎)正交的,并且可以理解为几乎没有叠加。

To investigate this, we train a larger model with two sets of correlated features and visualize W^TW.
为了调查这个,我们训练一个更大的模型,使用两组相关特征并可视化 W^TW

If this result holds in real neural networks, it suggests we might be able to make a kind of "local non-superposition" assumption, where for certain sub-distributions we can assume that the activating features are not in superposition. This could be a powerful result, allowing us to confidently use methods such as PCA which might not be principled to generally use in the context of superposition.
如果这个结果在真实神经网络中成立,这表明我们可能能够做出一种“局部非叠加”的假设,在某些子分布中,我们可以假设激活特征不处于叠加状态。这可能是一个强有力的结果,使我们能够自信地使用像 PCA 这样的可能在叠加背景下不合理使用的方法。

Collapsing of Correlated Features
相关特征的合并

One of the most interesting properties is that there seems to be a trade off with Principal Components Analysis (PCA) and superposition. If there are two correlated features a and b, but the model only has capacity to represent one, the model will represent their principal component (a+b)/\sqrt{2}, a sparse variable that has more impact on the loss than either individually, and ignore the second principal component (a-b)/\sqrt{2}.
最有趣的特性之一是,主成分分析(PCA)与叠加之间似乎存在权衡。如果有两个相关特征 ab ,但模型只能表示其中一个,模型将表示它们的主成分 (a+b)/\sqrt{2} ,这是一个稀疏变量,对损失的影响大于任一特征,并忽略第二主成分 (a-b)/\sqrt{2}

As an experiment, we consider six features, organized into three sets of correlated pairs. Features in each correlated pair are represented by a given color (red, green, and blue). The correlation is created by having both features always activate together – they're either both zero or neither zero. (The exact non-zero values they take when they activate is uncorrelated.)
作为实验,我们考虑六个特征,分成三组相关对。每对相关特征由一种特定颜色(红色、绿色和蓝色)表示。相关性是通过始终一起激活两个特征来创建的——它们要么都是零,要么都不是零。(它们激活时所取的确切非零值是无关的。)

As we vary the sparsity of the features, we find that in the very sparse regime, we observe superposition as expected, with features arranged in a hexagon and correlated features side-by-side. As we decrease sparsity, the features progressively "collapse" into their principal components. In very dense regimes, the solution becomes equivalent to PCA.
随着我们改变特征的稀疏性,我们发现,在非常稀疏的情况下,如预期所示,我们观察到叠加,特征排列成六边形,并且相关特征并排。随着稀疏性的降低,特征逐渐“崩溃”到它们的主成分中。在非常密集的情况下,解决方案等同于主成分分析(PCA)。

These results seem to hint that PCA and superposition are in some sense complementary strategies which trade off with one another. As features become more correlated, PCA becomes a better strategy. As features become sparser, superposition becomes a better strategy. When features are both sparse and correlated, mixtures of each strategy seem to occur. It would be nice to more deeply understand this space of tradeoffs.
这些结果似乎暗示,主成分分析(PCA)和叠加在某种意义上是互补的策略,相互权衡。当特征变得更相关时,PCA 成为一种更好的策略。当特征变得更稀疏时,叠加成为一种更好的策略。当特征既稀疏又相关时,似乎会出现这两种策略的混合。更深入地理解这种权衡空间将是非常好的。

It's also interesting to think about this in the context of continuous equivariant features, such as features which occur in different rotations.
在讨论连续的等变特征时,这也是一个有趣的思考,例如出现在不同旋转中的特征。







Superposition and Learning Dynamics
叠加与学习动态

The focus of this paper is how superposition contributes to the functioning of fully trained neural networks, but as a brief detour it's interesting to ask how our toy models – and the resulting superposition – evolve over the course of training.
本文的重点是叠加如何促进完全训练的神经网络的功能,但作为一个简短的旁道,询问我们的玩具模型及其产生的叠加在训练过程中是如何演变的,实际上是很有趣的。

There are several reasons why these models seem like a particularly interesting case for studying learning dynamics. Firstly, unlike most neural networks, the fully trained models converge to a simple but non-trivial structure that rhymes with an emerging thread of evidence that neural network learning dynamics might have geometric weight structure that we can understand. One might hope that understanding the final structure would make it easier for us to understand the evolution over training. Secondly, superposition hints at surprisingly discrete structure (regular polytopes of all things!). We'll find that the underlying learning dynamics are also surprisingly discrete, continuing an emerging trend of evidence that neural network learning might be less continuous than it seems. Finally, since superposition has significant implications for interpretability, it would be nice to understand how it emerges over training – should we expect models to use superposition early on, or is it something that only emerges later in training, as models struggle to fit more features in?
这些模型似乎是研究学习动态的一个特别有趣的案例,有几个原因。首先,与大多数神经网络不同,完全训练好的模型收敛到一个简单但非平凡的结构,这与神经网络学习动态可能具有几何权重结构的新出现证据相符,这种结构我们可以理解。人们可能希望理解最终结构将使我们更容易理解培训过程的演变。其次,叠加暗示了惊人的离散结构(所有事物的规则多面体!)。我们会发现,潜在的学习动态也是惊人的离散的,继续显示出神经网络学习可能没有表面上看起来那么连续的趋势。最后,由于叠加对可解释性具有重要意义,理解它在训练中是如何出现的将是非常有益的——我们应该期待模型在早期使用叠加,还是说这是只有在训练后期模型努力拟合更多特征时才出现的东西?

Unfortunately, we aren't able to give these questions the detailed investigation they deserve within the scope of this paper. Instead, we'll limit ourselves to a couple particularly striking phenomena we've noticed, leaving more detailed investigation for future work.
不幸的是,我们无法在本文的范围内对这些问题进行详细的调查。相反,我们将限制在我们注意到的几个特别引人注目的现象上,留下更详细的研究作为未来工作的内容。

Phenomenon 1: Discrete "Energy Level" Jumps
现象 1:离散的“能级”跃迁

Perhaps the most striking phenomenon we've noticed is that the learning dynamics of toy models with large numbers of features appear to be dominated by "energy level jumps" where features jump between different feature dimensionalities. (Recall that a feature's dimensionality is the fraction of a dimension dedicated to representing a feature.)
或许我们注意到的最引人注目的现象是,具有大量特征的玩具模型的学习动态似乎受到“能量水平跃迁”的主导,在这种情况下,特征在不同的特征维度之间跳动。(请回想一下,特征的 维度 是分配给表示特征的维度的比例。)

Let's consider the problem setup we studied when investigating the geometry of uniform superposition in the previous section, where we have a large number of features of equal importance and sparsity. As we saw previously, the features ultimately arrange themselves into a small number of polytopes with fractional dimensionalities.
我们来考虑在上一节研究均匀叠加几何时所研究的问题设置,其中我们有大量同等重要性和稀疏性的特征。正如我们之前所看到的,这些特征最终排列成少数具有分数维度的多面体。

A natural question to ask is what happens to these feature dimensionalities over the course of training. Let's pick one model where all the features converge into digons and observe. In the first plot, each colored line corresponds to the dimensionality of a single feature. The second plot shows how the loss curve changes over the same duration.
一个自然的问题是,在训练过程中这些特征维度会发生什么变化。我们选择一个模型,在这个模型中所有特征收敛为二边形,并进行观察。在第一个图中,每条彩色线对应于单个特征的维度。第二个图显示了损失曲线在相同时间段内的变化情况。

Note how the dimensionality of some features "jump" between different values and swap places. As this happens, the loss curve also undergoes a sudden drop (a very small one at the first jump, and a larger one at the second jump).
注意某些特征的维度如何在不同值之间“跳跃”并互换位置。随着这一变化,损失曲线也会出现突然下降(第一次跳跃时下降非常小,第二次跳跃时下降幅度较大)。

These results make us suspect that seemingly smooth decreases of the loss curve in larger models are in fact composed of many small jumps of features between different configurations. (For similar results of sudden mechanistic changes, see Olsson et al.'s induction head phase change , and Nanda and Lieberum's results on phase changes in modular arithmetic . More broadly, consider the phenomenon of grokking .)
这些结果让我们怀疑,在较大模型中看似平滑的损失曲线下降实际上是由不同配置之间多个小的特征跳跃组成的。(有关突然机制变化的类似结果,见 Olsson 等人的引导头相变,以及 Nanda 和 Lieberum 在模算术中的相变结果。更广泛地说,考虑一下 grokking 现象。)

Phenomenon 2: Learning as Geometric Transformations
现象 2:学习作为几何变换

Many of our toy model solutions can be understood as corresponding to geometric structures. This is especially easy to see and study when there are only m=3 hidden dimensions, since we can just directly visualize the feature embeddings as points in 3D space forming a polyhedron.
我们许多玩具模型解决方案可以理解为对应于几何结构。当仅有 m=3 个隐藏维度时,这一点尤其容易看出和研究,因为我们可以直接将特征嵌入可视化为在三维空间中形成多面体的点。

It turns out that, at least in some cases, the learning dynamics leading to these structures can be understood as a sequence of simple, independent geometric transformations!
事实证明,至少在某些情况下,导致这些结构的学习动态可以被理解为一系列简单的、独立的几何变换!

One particularly interesting example of this phenomenon occurs in the context of correlated features, as studied in the previous section. Consider the problem of representing n=6 features in superposition within m=3 dimensions. If we have the 6 features be 2 sets of 3 correlated features, we observe a really interesting pattern. The learning proceeds in distinct regimes which are visible in the loss curve, with each regime corresponding to a distinct geometric transformation:
这种现象的一个特别有趣的例子出现在相关特征的背景下,正如前一部分所研究的那样。考虑在 m=3 维中以叠加方式表示 n=6 特征的问题。如果我们将 6 特征视为 23 相关特征,我们观察到一个非常有趣的模式。学习在不同的阶段进行,这在损失曲线上是可见的,每个阶段对应于一种独特的几何变换:

(Although the last solution – an octahedron with features from different correlated sets arranged in antipodal pairs – seems to be a strong attractor, the learning trajectory visualized above appears to be one of a few different learning trajectories that attract the model. The different trajectories vary at step C: sometimes the model gets pulled directly into the antiprism configuration from the start or organize features into antipodal pairs. Presumably this depends on which feature geometry the model is closest to when step B ends.)
(尽管最后的解决方案——一种具有不同相关集特征的八面体以对极对形式排列——似乎是一个强吸引子,但上面可视化的学习轨迹似乎是吸引模型的几条不同学习轨迹之一。这些不同的轨迹在第 C 步变化:有时模型直接从一开始就被拉入反棱柱结构,或者将特征组织成对极对。这可能取决于在第 B 步结束时,模型最接近哪种特征几何。)

The learning dynamics we observe here seem directly related to previous findings on simple models. found that two-layer neural networks, in early stages of training, tend to learn a linear approximation to a problem. Although the technicalities of our data generation process do not precisely match the hypotheses of their theorem, it seems likely that the same basic mechanism is at work. In our case, we see the toy network learns a linear PCA solution before moving to a better nonlinear solution. A second related finding comes from , who looked at hierarchical sets of features, with a data generation process similar to the one we consider. They find empirically that certain networks (nonlinear and deep linear) “split” embedding vectors in a manner very much like what we observed. They also provide a theoretical analysis in terms of the underlying dynamical system. A key difference is that they focus on the topology—the branching structure of the emerging feature representations—rather than the geometry. Despite this difference, it seems likely that their analysis could be generalized to our case.
我们在这里观察到的学习动态似乎与之前关于简单模型的发现直接相关。研究发现,二层神经网络在训练的早期阶段,往往倾向于学习一个问题的线性近似。尽管我们的数据生成过程的技术细节与他们定理的假设并不完全匹配,但同样基本机制在起作用似乎是可能的。在我们的案例中,我们看到玩具网络在转向更好的非线性解决方案之前学习了一个线性主成分分析(PCA)解决方案。第二个相关发现来自于,研究了特征的层次集合,数据生成过程与我们考虑的过程相似。他们在经验上发现,某些网络(非线性和深层线性)以一种与我们观察到的非常相似的方式“分裂”嵌入向量。他们还提供了关于基础动力系统的理论分析。一个关键的区别是,他们关注的是拓扑结构——新兴特征表示的分支结构——而不是几何形状。尽管存在这个区别,但他们的分析似乎可以推广到我们的案例。







Relationship to Adversarial Robustness
与对抗鲁棒性的关系

Although we're most interested in the implications of superposition for interpretability, there appears to be a connection to adversarial examples. If one gives it a little thought, this connection can actually be quite intuitive.
尽管我们最感兴趣的是叠加对可解释性的影响,但似乎与对抗性示例之间存在联系。如果稍加思考,这种联系实际上可以非常直观。

In a model without superposition, the end-to-end weights for the first feature are:
在没有叠加的模型中,第一个特征的端到端权重为:

(W^TW)_0 ~~=~~ (1,~ 0,~ 0,~ 0,~ ...)

But in a model with superposition, it's something like:
但在一个具有叠加特性的模型中,它类似于:

(W^TW)_0 ~~=~~ (1,~ \epsilon,~ -\epsilon,~ \epsilon,~ ...)

The \epsilon entries (which are solely an artifact of superposition "interference") create an obvious way for an adversary to attack the most important feature. Note that this may remain true even in the infinite data limit: the optimal behavior of the model fit to sparse infinite data is to use superposition to represent more features, leaving it vulnerable to attack.
\epsilon 条目(仅仅是叠加“干涉”的一个伪影)为对手提供了一种明显的攻击最重要特征的方式。请注意,即使在无限数据限制下,这一点可能仍然成立:模型在稀疏无限数据上的最佳行为是使用叠加来表示更多特征,从而使其易受攻击。

To test this, we generated L2 adversarial examples (allowing a max L2 attack norm of 0.1 of the average input norm). We originally generated attacks with gradient descent, but found that for extremely sparse examples where ReLU neurons are in the zero regime 99% of the time, attacks were difficult, effectively due to gradient masking . Instead, we found it worked better to analytically derive adversarial attacks by considering the optimal L2 attacks for each feature (\lambda (W^TW)_i / ||(W^TW)_i||_2) and taking the one of these attacks which most harms model performance.
为了测试这一点,我们生成了 L2 对抗样本(允许的最大 L2 攻击范数为平均输入范数的 0.1)。我们最初通过梯度下降生成攻击,但发现对于极其稀疏的例子,其中 ReLU 神经元 99%的时间处于零状态,攻击变得困难,这有效地是由于梯度掩蔽。相反,我们发现通过考虑每个特征的最优 L2 攻击( \lambda (W^TW)_i / ||(W^TW)_i||_2 )并采取这些攻击中最能损害模型性能的一个,分析地推导对抗攻击效果更好。

We find that vulnerability to adversarial examples sharply increases as superposition forms (increasing by >3x), and that the level of vulnerability closely tracks the number of features per dimension (the reciprocal of feature dimensionality).
我们发现,随着叠加的形成,抵御对抗样本的脆弱性急剧增加(增加超过 3 倍),而脆弱性的程度与每个维度的特征数量密切相关(特征维度的倒数)。

We're hesitant to speculate about the extent to which superposition is responsible for adversarial examples in practice. There are compelling theories for why adversarial examples occur without reference to superposition (e.g. ). But it is interesting to note that if one wanted to try to argue for a "superposition maximalist stance", it does seem like many interesting phenomena related to adversarial examples can be predicted from superposition. As seen above, superposition can be used to explain why adversarial examples exist. It also predicts that adversarially robust models would have worse performance, since making models robust would require giving up superposition and representing less features. It predicts that more adversarially robust models might be more interpretable (see e.g. ). Finally, it could arguably predict that adversarial examples transfer (see e.g. ) if the arrangement of features in superposition is heavily influenced by which features are correlated or anti-correlated (see earlier results on this). It might be interesting for future work to see how far the hypothesis that superposition is a significant contributor to adversarial examples can be driven.
我们对于叠加在实践中对对抗样本的影响有些犹豫。关于对抗样本为何会出现的理论非常引人注目,而不涉及叠加(例如)。但有趣的是,如果有人想要为“叠加极端主义立场”辩护,似乎许多与对抗样本相关的有趣现象都可以通过叠加来预测。如上所述,叠加可以用来解释对抗样本为何存在。它还预测,对抗性鲁棒模型的性能会更差,因为使模型鲁棒需要放弃叠加并表示更少的特征。它预测,更具对抗鲁棒性的模型可能更具可解释性(见例如)。最后,可以说如果叠加中特征的排列受相关或反相关特征的强烈影响,可能预测对抗样本的转移(见例如),(见以前的结果)。 未来的研究可能会有趣,看看“叠加是对抗性样本的重要因素”这一假设能推导到何种程度。

In addition to observing that superposition can cause models to be vulnerable to adversarial examples, we briefly experimented with adversarial training to see if the relationship could be used in the other direction to reduce superposition. To keep training reasonably efficient, we used the analytic optimal attack against a random feature. We found that this did reduce superposition, but attacks had to be made unreasonably large (80% input L2 norm) to fully eliminate it, which didn't seem satisfying. Perhaps stronger adversarial attacks would work better. We didn't explore this further since the increased cost and complexity of adversarial training made us want to prioritize other lines of attack on superposition first.
除了观察到叠加可能导致模型对对抗样本的脆弱性外,我们还进行了对抗训练的简要实验,看看这种关系是否可以反向利用以降低叠加。为了保持训练的合理效率,我们对随机特征使用了分析最优攻击。我们发现这确实减少了叠加,但攻击必须做得不合理地大(80%的输入 L2 范数)才能完全消除,这并不令人满意。也许更强的对抗攻击会更有效。我们没有进一步探讨这个问题,因为对抗训练的成本和复杂性增加使我们更想优先考虑其他减少叠加的攻击方向。







Superposition in a Privileged Basis
特权基中的叠加

So far, we've explored superposition in a model without a privileged basis. We can rotate the hidden activations arbitrarily and, as long as we rotate all the weights, have the exact same model behavior. That is, for any ReLU output model with weights W, we could take an arbitrary orthogonal matrix O and consider the model W' = OW. Since (OW)^T(OW) = W^TW, the result would be an identical model!
到目前为止,我们已经在一个没有特权基的模型中探索了叠加。我们可以任意旋转隐藏激活,并且只要我们旋转所有的权重,就会有完全相同的模型行为。也就是说,对于任何具有权重 W 的 ReLU 输出模型,我们可以取一个任意的正交矩阵 O 并考虑模型 W' = OW 。由于 (OW)^T(OW) = W^TW ,结果将是一个完全相同的模型!

Models without a privileged basis are elegant, and can be an interesting analogue for certain neural network representations which don't have a privileged basis – word embeddings, or the transformer residual stream. But we'd also (and perhaps primarily) like to understand neural network representations where there are neurons which do impose a privileged basis, such as transformer MLP layers or conv net neurons.
没有特权基底的模型优雅,并且可以作为某些没有特权基底的神经网络表示的有趣类比——例如词嵌入或变换器残差流。然而,我们也希望(或许主要希望)理解那些存在神经元施加特权基底的神经网络表示,例如变换器的多层感知机层或卷积网络的神经元。

Our goal in this section is to explore the simplest toy model which gives us a privileged basis. There are at least two ways we could do this: we could add an activation function or apply L1 regularization to the hidden layer. We'll focus on adding an activation function, since the representation we are most interested in understanding is hidden layers with neurons, such as the transformer MLP layer.
我们在这一部分的目标是探索最简单的玩具模型,这将为我们提供一个特权基础。我们可以通过至少两种方式来实现这一点:我们可以添加激活函数或对隐藏层应用 L1 正则化。我们将重点关注添加激活函数,因为我们最感兴趣的表示是带有神经元的隐藏层,例如转换器 MLP 层。

This gives us the following "ReLU hidden layer" model:
这给我们提供了以下“ReLU 隐藏层”模型:

h~=~\text{ReLU}(Wx) x'~=~\text{ReLU}(W^Th+b)

We'll train this model on the same data as before.
我们将用相同的数据来训练这个模型。

Adding a ReLU to the hidden layer radically changes the model from an interpretability perspective. The key thing is that while W in our previous model was challenging to interpret (recall that we visualized W^TW rather than W), W in the ReLU hidden layer model can be directly interpreted, since it connects features to basis-aligned neurons.
在隐藏层添加 ReLU 从可解释性的角度根本改变了模型。关键在于,虽然我们之前模型中的 W 难以解释(回想一下我们可视化的是 W^TW 而不是 W ),ReLU 隐藏层模型中的 W 则可以直接解释,因为它将特征连接到与基向量对齐的神经元。

We'll discuss this in much more detail shortly, but here's a comparison of weights resulting from a linear hidden layer model and a ReLU hidden layer model:
我们很快会详细讨论这个,但这是线性隐藏层模型和 ReLU 隐藏层模型产生的权重比较:

Recall that we think of basis elements in the input as "features," and basis elements in the middle layer as "neurons". Thus W is a map from features to neurons.
请记住,我们将输入中的基元素视为“特征”,而中间层中的基元素视为“神经元”。因此 W 是从特征到神经元的映射。

What we see in the above plot is that the features are aligning with neurons in a structured way! Many of the neurons are simply dedicated to representing a feature! (This is the critical property that justifies why neuron-focused interpretability approaches – such as much of the work in the original Circuits thread – can be effective in some circumstances.)
我们在上面的图中看到的是 特征与神经元 以结构化的方式对齐 !许多神经元仅仅是用于表示某个特征!(这是支持神经元集中解释方法有效性的关键特性——例如原始电路主题中的许多工作——在某些情况下可以有效。)

Let's explore this in more detail.
让我们更详细地探讨一下这个问题。

Visualizing Superposition in Terms of Neurons
从神经元的角度可视化叠加

Having a privileged basis opens up new possibilities for visualizing our models. As we saw above, we can simply inspect W. We can also make a per-neuron stacked bar plot where, for every neuron, we visualize its weights as a stack of rectangles on top of each other:
拥有一个特权基底为可视化我们的模型开辟了新的可能性。正如我们上面所见,我们可以简单地检查 W 。我们还可以制作每个神经元的堆叠条形图,其中,对于每个神经元,我们将其权重可视化为一个个堆叠在一起的矩形:

This stack plot visualization can be nice as models get bigger. It also makes polysemantic neurons obvious: they simply correspond to having more than one weight.
这个堆叠图可视化在模型变大时效果很好。它还使多义神经元变得明显:它们简单地对应于具有多个权重。

We'll now visualize a ReLU hidden layer toy model with n=10;~ m=5; I^i = 0.75^i and varying feature sparsity levels. We chose a very small model (only 5 neurons) both for ease of visualization, and to circumvent some issues with this toy model we'll discuss below.
我们现在将可视化一个 ReLU 隐藏层玩具模型,特征稀疏水平各异。我们选择了一个非常小的模型(只有 5 个神经元),既为了便于可视化,也为了避免我们将在下面讨论的一些与这个玩具模型有关的问题。

However, we found that these small models were harder to optimize. For each model shown, we trained 1000 models and visualized the one with the lowest loss. Although the typical solutions are often similar to the minimal loss solutions shown, selecting the minimal loss solutions reveals even more structure in how features align with neurons. It also reveals that there are ranges of sparsity values where the optimal solution for all models trained on data with that sparsity have the same weight configurations.
然而,我们发现这些小模型更难以优化。对于每个展示的模型,我们训练了 1000 个模型,并可视化了损失最低的一个。虽然典型的解决方案通常与显示的最小损失解决方案相似,但选择最小损失解决方案则进一步揭示了特征与神经元对齐的结构。这也表明,在某些稀疏值范围内,针对具有该稀疏性的所有训练模型的最优解决方案具有相同的权重配置。

The solutions are visualized below, both visualizing the raw W and a neuron stacked bar plot. We color features in the stacked bar plot based on whether they're in superposition, and color neurons as being monosemantic or polysemantic depending on whether they store more than one feature. Neuron order was chosen by hand (since it's arbitrary).
下面可视化了该解决方案,包括原始 W 和一个神经元堆叠条形图。我们根据特征是否处于叠加状态对堆叠条形图中的特征进行着色,并根据神经元是否存储多个特征将其着色为单义或多义。神经元的顺序是手动选择的(因为这是任意的)。

The most important thing to pay attention to is how there's a shift from monosemantic to polysemantic neurons as sparsity increases. Monosemantic neurons do exist in some regimes! Polysemantic neurons exist in others. And they can both exist in the same model! Moreover, while it's not quite clear how to formalize this, it looks a great deal like there's a neuron-level phase change, mirroring the feature phase changes we saw earlier.
最重要的是要注意随着稀疏性增加,单义神经元向多义神经元的转变。在某些情况下确实存在单义神经元!而多义神经元则存在于其他情况下。并且它们可以同时存在于同一个模型中!此外,虽然目前尚不清楚如何将其形式化,但它看起来非常像是存在一个神经元级别的相变,反映了我们之前看到的特征相变。

It's also interesting to examine the structure of the polysemantic solutions, which turn out to be surprisingly structured and neuron-aligned. Features typically correspond to sets of neurons (monosemantic neurons might be seen as the special case where features only correspond to singleton sets). There's also structure in how polysemantic neurons are. They transition from monosemantic, to only representing a few features, to gradually representing more. However, it's unclear how much of this is generalizable to real models.
分析多义解的结构也很有趣,结果显示它们的结构出人意料地明确且与神经元对齐。特征通常对应于神经元集合(单义神经元可以看作特例,其中特征仅对应于单个集合)。多义神经元的结构也很明显。它们从单义神经元过渡到只表示少数特征,再逐渐表示更多特征。然而,尚不清楚这些在多大程度上可以推广到实际模型中。

Limitations of The ReLU Hidden Layer Toy Model Simulating Identity
ReLU 隐藏层玩具模型模拟恒等映射的局限性

Unfortunately, the toy model described in this section has a significant weakness, which limits the regimes in which it shows interesting results. The issue is that the model doesn't benefit from the ReLU hidden layer – it has no role except limiting how the model can encode information. If given any chance, the model will circumvent it. For example, given a hidden layer bias, the model will set all the biases to be positive, shifting the neurons into a positive regime where they behave linearly. If one removes the bias, but gives the model enough features, it will simulate a bias by averaging over many features. The model will only use the ReLU activation function if absolutely forced, which is a significant mark against studying this toy model.
不幸的是,本节所描述的玩具模型存在一个重大缺陷,这限制了它展示有趣结果的范围。问题在于该模型无法利用 ReLU 隐藏层——它除了限制模型编码信息的方式外,没有其他作用。如果有任何机会,模型将绕过它。例如,给定一个隐藏层偏置,模型将把所有偏置设为正值,使神经元进入一个正的状态,在该状态下它们表现为线性。如果去掉偏置,但给模型足够的特征,它会通过对多个特征进行平均来模拟偏置。模型只有在被迫的情况下才会使用 ReLU 激活函数,这对研究这个玩具模型来说是一个重大缺陷。

We'll introduce a model without this issue in the next section, but wanted to study this model as a simpler case study.
我们将在下一节介绍一个没有这个问题的模型,但想要将这个模型作为一个更简单的案例进行研究。







Computation in Superposition
叠加中的计算

So far, we've shown that neural networks can store sparse features in superposition and then recover them. But we actually believe superposition is more powerful than this – we think that neural networks can perform computation entirely in superposition rather than just using it as storage. This model will also give us a more principled way to study a privileged basis where features align with basis dimensions.
到目前为止,我们已经证明神经网络可以在叠加中存储稀疏特征并进行恢复。但我们实际上相信叠加比这更强大——我们认为神经网络可以完全在叠加中执行计算,而不仅仅是将其用作存储。这种模型还将为我们提供一种更有原则的方法来研究特权基,在该基中,特征与基维度对齐。

To explore this, we consider a new setup where we imagine our input and output layer to be the layers of our hypothetical disentangled model, but have our hidden layer be a smaller layer we're imagining to be the observed model which might use superposition. We'll then try to compute a simple non-linear function and explore whether it can use superposition to do this. Since the model will have (and need to use) the hidden layer non-linearity, we'll also see features align with a privileged basis.
为了探索这一点,我们考虑一个新的设置,设想我们的输入层和输出层是我们假设的解缠模型的层,而我们的隐藏层则是我们想象中的较小的观测模型层,可能会使用叠加。然后我们将尝试计算一个简单的非线性函数,并探索它是否可以利用叠加来实现这一点。由于模型将具有(并且需要使用)隐藏层非线性,我们还将观察特征与特权基的对齐。

Specifically, we'll have the model compute y=\text{abs}(x). Absolute value is an appealing function to study because there's a very simple way to compute it with ReLU neurons: \text{abs}(x) = \text{ReLU}(x) + \text{ReLU}(-x). This simple structure will make it easy for us to study the geometry of how the hidden layer is leveraged to do computation.
具体来说,我们将让模型计算 y=\text{abs}(x) 。绝对值是一个有吸引力的函数,因为使用 ReLU 神经元有一种非常简单的方法来计算它: \text{abs}(x) = \text{ReLU}(x) + \text{ReLU}(-x) 。这个简单的结构将使我们容易研究隐藏层如何被利用来进行计算的几何形状。

Since this model needs ReLU to compute absolute value, it doesn't have the issues the model in the previous section had with trying to avoid the activation function.
由于这个模型需要 ReLU 来计算绝对值,因此它没有前一节模型在避免激活函数时遇到的问题。

Experiment Setup 实验设置

The input feature vector, x, is still sparse, with each feature x_i having probability S_i of being 0. However, since we want to have the model compute absolute value, we need to allow it to take on non-positive values for this to be a non-trivial task. As a result, if it is non-zero, its value is now sampled uniformly from [-1,1]. The target output y is y=\text{abs}(x).
输入特征向量 x 仍然是稀疏的,每个特征 x_i 以概率 S_i0 。然而,由于我们希望模型计算绝对值,因此我们需要允许它取非正值,这样才能成为一个非平凡的任务。结果,如果它不是零,它的值现在是从 [-1,1] 中均匀抽样的。目标输出 {{5 }} 是 y=\text{abs}(x)

Following the previous section, we'll consider the "ReLU hidden layer" toy model variant, but no longer tie the two weights to be identical:
在前一节的基础上,我们将考虑“ReLU 隐藏层”玩具模型变体,但不再将两个权重绑定为相同:

h = \text{ReLU}(W_1x) y' = \text{ReLU}(W_2h+b)

The loss is still the mean squared error weighted by feature importances I_i as before.
损失仍然是加权特征重要性的均方误差 I_i ,与之前相同。

Basic Results 基本结果

With this model, it's a bit less straightforward to study how individual features get embedded; because of the ReLU on the hidden layer, we can't just study W_2^TW_1. And because W_2 and W_1 are now learned independently, we can't just study columns of W_1. We believe that with some manipulation we could recover much of the simplicity of the earlier model by considering "positive features" and "negative features" independently, but we're going to focus on another perspective instead.
使用这个模型,研究单个特征如何嵌入就不那么直接了;由于隐藏层的 ReLU,我们不能只是研究 W_2^TW_1 。而且因为 W_2W_1 现在是独立学习的,我们不能只研究 W_1 的列。我们相信通过一些操作,我们可以通过独立考虑“正特征”和“负特征”来恢复早期模型的简单性,但我们将重点关注另一种视角。

As we saw in the previous section, having a hidden layer activation function means that it makes sense to visualize the weights in terms of neurons. We can visualize W directly or as a neuron stack plot as we did before. We can also visualize it as a graph, which can sometimes be helpful for understanding computation.
正如我们在上一节中看到的,具有隐含层激活函数意味着将权重可视化为神经元是有意义的。我们可以直接或像之前那样将 W 可视化为神经元堆叠图。我们还可以将其可视化为图形,这有时对于理解计算非常有帮助。

Let's look at what happens when we train a model with n=3 features to perform absolute value on m=6 hidden layer neurons. Without superposition, the model needs two hidden layer neurons to implement absolute value on one feature.
我们来看看当我们用 n=3 个特征训练一个模型以对 m=6 个隐藏层神经元执行绝对值时会发生什么。在没有叠加的情况下,该模型需要两个隐藏层神经元来对一个特征实现绝对值。

The resulting model – modulo a subtle issue about rescaling input and output weightsNote that there's a degree of freedom for the model in learning W_1: We can rescale any hidden unit by scaling its row of W_1 by \alpha, and its column of W_2 by \alpha^{-1}, and arrive at the same model. For consistency in the visualization, we rescale each hidden unit before visualizing so that the largest-magnitude weight to that neuron from W_1 has magnitude 1. – performs absolute value exactly as one might expect. For each input feature x_i, it constructs a "positive side" neuron \text{ReLU}(x_i) and a "negative side" neuron \text{ReLU}(-x_i). It then adds these together to compute absolute value:
生成的模型 - 关于输入和输出权重重新缩放的细微问题 15 注意,模型在学习 W_1 时有一个自由度:我们可以通过将 W_1 的行缩放\alpha,以及将 W_2 的列缩放\alpha^{-1},来重新缩放任何隐藏单元,最终得到相同的模型。为了在可视化中保持一致性,我们在可视化之前重新缩放每个隐藏单元,以便从 W_1 到该神经元的最大权重的绝对值为 1。 - 完全符合预期地执行绝对值。对于每个输入特征 x_i ,它构造了一个"正侧"神经元 \text{ReLU}(x_i) 和一个"负侧"神经元 \text{ReLU}(-x_i) 。然后,它将这些相加以计算绝对值:

Superposition vs Sparsity
叠加与稀疏性

We've seen that – as expected – our toy model can learn to implement absolute value. But can it use superposition to compute absolute value for more features? To test this, we train models with n=100 features and m=40 neurons and a feature importance curve I_i = 0.8^i, varying feature sparsity.These specific values were chosen to illustrate the phenomenon we're interested in: the absolute value model learns more easily when there are more neurons, but we wanted to keep the numbers small enough that it could be easily visualized.
我们已经看到——正如预期的那样——我们的玩具模型可以学习实现绝对值。但它能否利用叠加来计算更多特征的绝对值?为了测试这一点,我们训练了具有 n=100 特征和 m=40 神经元以及特征重要性曲线 I_i = 0.8^i 的模型,并改变特征稀疏性。这些具体值的选择是为了说明我们感兴趣的现象:绝对值模型在神经元较多时学习更容易,但我们希望将数字保持在足够小的范围内,以便于可视化。

A couple of notes on visualization: Since we're primarily interested in understanding superposition and polysemantic neurons, we'll show a stacked weight plot of the absolute values of weights. The features are colored by superposition. To make the diagrams easier to read, neurons are faintly colored based on how polysemantic they are (as judged by eye based on the plots). Neuron order is sorted by the importance of the largest feature.
关于可视化的一些说明:由于我们主要关注理解叠加和多义神经元,我们将展示权重绝对值的堆叠权重图。这些特征按叠加上色。为了使图表更易于阅读,神经元根据其多义性轻微着色(根据图表视觉判断)。神经元的顺序是按最大特征的重要性进行排序的。

Much like we saw in the ReLU hidden layer models, these results demonstrate that activation functions, under the right circumstances, create a privileged basis and cause features to align with basis dimensions. In the dense regime, we end up with each neuron representing a single feature, and we can read feature values directly off of neuron activations.
与我们在 ReLU 隐藏层模型中看到的非常相似,这些结果表明,在适当的情况下,激活函数会创建一个特权基,并使特征与基维度对齐。在密集区域中,每个神经元最终代表一个单一特征,我们可以直接从神经元激活中读取特征值。

However, once the features become sufficiently sparse, this model, too, uses superposition to represent more features than it has neurons. This result is notable because it demonstrates the ability of neural networks to perform computation even on data that is represented in superposition.One question you might ask is whether we can quantify the ability of superposition to enable extra computation by examining the loss. Unfortunately, we can't easily do this. Superposition occurs when we change the task, making it sparser. As a result, the losses of models with different amounts of superposition are not comparable – they're measuring the loss on different tasks! Remember that the model is required to use the hidden layer ReLU in order to compute an absolute value; gradient descent manages to find solutions that usefully approximate the computation even when each neuron encodes a mix of multiple features.
然而,一旦特征变得足够稀疏,该模型也使用叠加来表示比它拥有的神经元更多的特征。这个结果是显著的,因为它展示了神经网络即使在以叠加形式表示的数据上也能执行计算的能力。有一个问题你可能会问,那就是我们是否能够通过检查损失来量化叠加使额外计算成为可能的能力。不幸的是,我们无法轻易做到这一点。叠加发生在我们改变任务时,使其更加稀疏。因此,不同叠加量的模型的损失是不可比较的——它们测量的是不同任务上的损失!请记住,模型需要使用隐藏层的 ReLU 来计算绝对值;即使每个神经元编码了多种特征,梯度下降也能找到有用地近似计算的解决方案。

Focusing on the intermediate sparsity regimes, we find several additional qualitative behaviors that we find fascinatingly reminiscent of behavior that has been observed in real, full-scale neural networks:
聚焦于中间稀疏状态,我们发现几种额外的定性行为,这些行为令人惊讶地与真实全规模神经网络中观察到的行为相似:

To begin, we find that in some regimes, many of the model's neurons will encode pure features, but a subset of them will be highly polysemantic. This is similar to the phase change we saw earlier in the ReLU output model. However, in that case, the phase change was with respect to features, with more important features not being put in superposition. In this experiment, the neurons don't have any intrinsic importance, but we see that the neurons representing the most important features (on the left) tend to be monosemantic.
首先,我们发现,在某些情况下,许多模型的神经元将编码纯特征,但其中一部分会高度多义。这与我们在 ReLU 输出模型中看到的相变类似。然而,在那个情况下,相变是相对于特征而言的,更重要的特征并没有被放入叠加。在这个实验中,这些神经元没有任何内在重要性,但我们看到,代表最重要特征的神经元(在左侧)往往是单义的。

We find this to bear a suggestive resemblance to some previous work in vision models, which found some layers that contained "mostly pure" feature neurons, but with some neurons representing additional features on a different scale.
我们发现这与一些先前的视觉模型工作有着启示性的相似之处,这些工作发现某些层包含“主要是纯”特征神经元,但有些神经元代表不同尺度的附加特征。

We also note that many neurons appear to be associated with a single "primary" feature – encoded by a relatively large weight – coupled with one or more "secondary" features encoded with smaller-magnitude weights to that neuron. If we were to observe the activations of such a neuron over a range of input examples, we would find that the largest activations of that neuron were all or nearly-all associated with the presence of the "primary" feature, but that the lower-magnitude activations were much more polysemantic.
我们还注意到,许多神经元似乎与一个单一的“主要”特征相关联——由相对较大的权重编码——以及与一个或多个“次要”特征相关联,这些特征以较小的权重编码到该神经元。如果我们观察这样一个神经元在一系列输入示例上的激活,我们会发现该神经元的最大激活几乎全部或完全与“主要”特征的存在相关,但较低幅度的激活则更加多义。

Intriguingly, that description closely matches what researchers have found in previous work on language models – many neurons appear interpretable when we examine their strongest activations over a dataset, but can be shown on further investigation to activate for other meanings or patterns, often at a lower magnitude. While only suggestive, the ability of our toy model to reproduce these qualitative features of larger neural networks offers an exciting hint that these models are illuminating general phenomena.
有趣的是,这一描述与研究人员在之前对语言模型的研究中发现的情况非常相似——当我们检查神经元在数据集上的强激活时,许多神经元似乎是可以解释的,但经过进一步研究可以发现它们也会激活其他含义或模式,通常是以较低的幅度激活。虽然这只是暗示,但我们的玩具模型能够再现大型神经网络的这些定性特征,为这些模型揭示一般现象提供了令人兴奋的暗示。

The Asymmetric Superposition Motif
不对称叠加图案

If neural networks can perform computation in superposition, a natural question is to ask how exactly they're doing so. What does that look like mechanically, in terms of the weights? In this subsection, we'll (mostly) work through one such model and see an interesting motif of asymmetric superposition. (We use the term "motif" in the sense of the original circuit thread, inspired by its use in systems biology .)
如果神经网络可以在叠加状态下进行计算,那么一个自然的问题是它们究竟是如何做到的。从权重的角度来看,这种机制是什么样的?在本小节中,我们将(主要)探讨这样一个模型,并看到一个有趣的主题,即不对称叠加。 (我们在生物系统中借用其用法,使用“主题”一词,灵感来自原始电路线索。)

The model we're trying to understand is shown below on the left, visualized as a neuron weight stack plot, with features corresponding to colors. The model is only doing a limited amount of superposition, and many of the weights can be understood as simply implementing absolute value in the expected way.
我们试图理解的模型如下图左所示,以神经元权重堆叠图的形式可视化,特征对应于颜色。该模型只进行有限的叠加,许多权重可以简单地理解为以预期的方式实现绝对值。

However, there are a few neurons doing something else…
然而,还有一些神经元在做其他事情……

These other neurons implement two instances of asymmetric superposition and inhibition. Each instance consists of two neurons:
这些其他神经元实现了两次不对称叠加和抑制。每次实例由两个神经元组成:

One neuron does asymmetric superposition. In normal superposition, one might store features with equal weights (eg. W=[1,-1]) and then have equal output weights (W=[1,1]). In asymmetric superposition, one stores the features with different magnitudes (eg. W=[2,-\frac{1}{2}]) and then has reciprocal output weights (eg. W=[\frac{1}{2}, 2]). This causes one feature to heavily interfere with the other, but avoid the other interfering with the first!
一个神经元执行非对称叠加。在正常的叠加中,可能以相等的权重存储特征(例如 W=[1,-1] ),然后具有相等的输出权重( W=[1,1] )。在非对称叠加中,以不同的大小存储特征(例如 W=[2,-\frac{1}{2}] ),然后具有相互倒数的输出权重(例如 W=[\frac{1}{2}, 2] )。这导致一个特征对另一个特征产生强烈干扰,但避免了另一个特征对第一个特征的干扰!

To avoid the consequences of that interference, the model has another neuron heavily inhibit the feature in the case where there would have been positive interference. This essentially converts positive interference (which could greatly increase the loss) into negative interference (which has limited consequences due to the output ReLU).
为了避免这种干扰的后果,模型有另一个神经元在本应产生正干扰的情况下对特征进行强烈抑制。这本质上将正干扰(可能会大大增加损失)转换为负干扰(由于输出 ReLU,其后果有限)。

There are a few other weights this doesn't explain. (We believe they're effectively small conditional biases.) But this asymmetric superposition and inhibition pattern appears to be the primary story.
还有一些其他的权重没有解释。(我们相信它们实际上是小的条件偏置。)但这种不对称的叠加和抑制模式似乎是主要的故事。







The Strategic Picture of Superposition
叠加的战略图景

Although superposition is scientifically interesting, much of our interest comes from a pragmatic motivation: we believe that superposition is deeply connected to the challenge of using interpretability to make claims about the safety of AI systems. In particular, it is a clear challenge to the most promising path we see to be able to say that neural networks won't perform certain harmful behaviors or to catch "unknown unknowns" safety problems. This is because superposition is deeply linked to the ability to identify and enumerate over all features in a model, and the ability to enumerate over all features would be a powerful primitive for making claims about model behavior.
虽然叠加在科学上很有趣,但我们的大部分兴趣来自务实的动机:我们相信,叠加与使用可解释性来对人工智能系统安全性做出声明的挑战有着深刻的联系。特别是,这对我们看到的最有前途的路径构成了明显的挑战,即能够声明神经网络不会表现出某些有害行为或捕捉“未知的未知”安全问题。这是因为叠加与识别和枚举模型中的所有特征的能力密切相关,而枚举所有特征的能力将是对模型行为做出声明的一个强大原语。

We begin this section by describing how "solving superposition" in a certain sense is equivalent to many strong interpretability properties which might be useful for safety. Next, we'll describe three high level strategies one might take to "solving superposition." Finally, we'll describe a few other additional strategic considerations.
我们在这一节开始时将描述在某种意义上“解决叠加”是如何等同于许多强可解释性特性的,这些特性可能对安全有用。接下来,我们将描述可能采用的三种高层次策略来“解决叠加”。最后,我们将描述一些其他额外的战略考虑。

Safety, Interpretability, & "Solving Superposition"
安全性、可解释性与“解决叠加”

We'd like a way to have confidence that models will never do certain behaviors such as "deliberately deceive" or "manipulate." Today, it's unclear how one might show this, but we believe a promising tool would be the ability to identify and enumerate over all features. The ability to have a universal quantifier over the fundamental units of neural network computation is a significant step towards saying that certain types of circuits don't exist.Ultimately we want to say that a model doesn't implement some class of behaviors. Enumerating over all features makes it easy to say a feature doesn't exist (e.g. "there is no 'deceptive behavior' feature") but that isn't quite what we want. We expect models that need to represent the world to represent unsavory behaviors. But it may be possible to build more subtle claims such as "all 'deceptive behavior' features do not participate in circuits X, Y and Z." It also seems like a powerful tool for addressing "unknown unknowns", since it's a way that one can fully cover network behavior, in a sense.
我们希望有一种方法可以保证模型永远不会表现出某些行为,例如“故意欺骗”或“操控”。现在尚不清楚如何展示这一点,但我们相信一个有前景的工具是能够识别和枚举所有特征。对神经网络计算的基本单元具有一个普遍量化器是一个重要的步骤,使我们能够说某些类型的电路不存在。最终,我们希望说明一个模型不实现某类行为。对所有特征进行枚举使得很容易说某个特征不存在(例如,“不存在‘欺骗行为’特征”),但这并不是我们想要的。我们预期需要表示世界的模型会表现出不良行为。但可能有可能构建更微妙的主张,例如“所有‘欺骗行为’特征都不参与电路 X、Y 和 Z。”这似乎也是处理“未知的未知”的一种强大工具,因为在某种程度上,它提供了一种可以完全覆盖网络行为的方法。

How does this relate to superposition? It turns out that the ability to enumerate over features is deeply intertwined with superposition. One way to see this is to imagine a neural network with a privileged basis and without superposition (like the monosemantic neurons found in early InceptionV1, e.g. ): features would simply correspond to neurons, and you could enumerate over features by enumerating over neurons. Superposition also makes it harder to find interpretable directions in a model without a privileged basis. Without superposition, one could try to do something like the Gram–Schmidt process, progressively identifying interpretable directions and then removing them to make future features easier to identify. But with superposition, one can't simply remove a direction even if one knows that it is a feature direction. The connection also goes the other way: if one has the ability to enumerate over features, one can perform compressed sensing using the feature directions to (with high probability) "unfold" a superposition models activations into those of a larger, non-superposition model.
这与叠加有什么关系?事实证明,枚举特征的能力与叠加密切相关。一种看待这一点的方法是想象一个没有叠加和特权基的神经网络(例如早期 InceptionV1 中发现的单义神经元):特征将简单地对应于神经元,而你可以通过枚举神经元来枚举特征。叠加还使得在没有特权基的模型中找到可解释方向变得更困难。在没有叠加的情况下,人们可以尝试做类似 Gram-Schmidt 过程的事情,逐步识别可解释方向,然后去除它们以便使未来的特征更容易识别。但是在叠加的情况下,即使知道某个方向是特征方向,也不能简单地去除该方向。联系也反向成立:如果能够枚举特征,则可以使用特征方向执行压缩感知,从而(以高概率)将叠加模型的激活“展开”为较大且无叠加模型的激活。

For this reason, we'll call any method that gives us the ability to enumerate over features – and equivalently, unfold activations – a "solution to superposition". Any solution is on the table, from creating models that just don't have superposition, to identifying what directions correspond to features after the fact. We'll discuss the space of possibilities shortly.
出于这个原因,我们将称任何使我们能够枚举特征的方法——并且等效地展开激活——为“叠加的解决方案”。 任何解决方案都是可以考虑的,从创建完全没有叠加的模型,到识别事后哪些方向对应于特征。我们将很快讨论这些可能性的空间。

We've motivated "solving superposition" in terms of feature enumeration, but it's worth noting that it's equivalent to (or necessary for) many other interpretability properties one might care about:
我们已经从特征枚举的角度激励了“解决叠加”,但值得注意的是,它等同于(或对)许多其他人可能关心的可解释性属性是必要的:

Three Ways Out

At a very high level, there seem to be three potential approaches to resolving superposition:
在一个非常高的层次上,似乎有三种潜在的方法来解决叠加问题:

Our sense is that all of these approaches are possible if one doesn't care about having a competitive model. For example, we believe it's possible to accomplish any of these for the toy models described in this paper. However, as one starts to consider serious neural networks, let alone modern large language models, all of these approaches begin to look very difficult. We'll outline the challenges we see for each approach in the following sections.
我们的感觉是,如果不在意拥有一个竞争模型,所有这些方法都是可能的。例如,我们相信可以为本文描述的玩具模型实现这些方法中的任何一种。然而,当开始考虑严肃的神经网络,更不用说现代大型语言模型时,所有这些方法都变得非常困难。我们将在接下来的部分中概述我们所看到的每种方法的挑战。

With that said, it's worth highlighting one bright spot before we focus on the challenges. You might have believed that superposition was something you could never fully get rid of, but that doesn't seem to be the case. All our results seem to suggest that superposition and polysemanticity are phases with sharp transitions. That is, there may exist a regime for every model where it has no superposition or polysemanticity. The question is largely whether the cost of getting rid of or otherwise resolving superposition is too high.
话虽如此,在我们关注挑战之前,值得强调一个亮点。你可能相信,叠加是你永远无法完全摆脱的东西,但情况似乎并非如此。我们所有的结果似乎都表明,叠加和多义性是具有明显转变的阶段。也就是说,可能存在每个模型的一个状态,在这个状态下它没有叠加或多义性。问题主要在于,摆脱或以其他方式解决叠加的成本是否过高。

Approach 1: Creating Models Without Superposition
方法一:创建没有叠加的模型

It's actually quite easy to get rid of superposition in the toy models described in this paper, albeit at the cost of a higher loss. Simply apply at L1 regularization term to the hidden layer activations (i.e. add \lambda ||h||_1 to the loss). This actually has a nice interpretation in terms of killing features below a certain importance threshold, especially if they're not basis aligned. Generalizing this to real neural networks isn't trivial, but we expect it can be done. (This approach would be similar to work attempting to use sparsity to encourage basis-aligned word embeddings .)
实际上,在本文描述的玩具模型中去除叠加实际上很简单,尽管代价是损失较高。只需对隐藏层的激活应用 L1 正则化项(即将 \lambda ||h||_1 添加到损失中)。这实际上在切断低于某个重要性阈值的特征方面具有良好的解释,尤其是在这些特征没有与基对齐的情况下。将此推广到真实的神经网络并不简单,但我们预计可以做到。(这方法类似于尝试使用稀疏性来促进基对齐词嵌入的工作。)

However, it seems likely that models are significantly benefitting from superposition. Roughly, the sparser features are, the more features can be squeezed in per neuron. And many features in language models seem very sparse! For example, language models know about individuals with only modest public presences, such as several of the authors of this paper. Presumably we only occur with frequency significantly less than one in a million tokens. As a result, it may be the case that superposition effectively makes models much bigger.
然而,看起来模型确实从叠加中显著受益。大致上,特征越稀疏,每个神经元可以容纳的特征就越多。而许多语言模型中的特征似乎非常稀疏!例如,语言模型对一些公众曝光率只有适度的人物有了解,比如本文的几位作者。据推测,我们的出现频率远低于每百万个标记中的一次。因此,叠加可能使模型变得更大。

All of this paints a picture where getting rid of superposition may be fairly achievable, but doing so will have a large performance cost. For a model with a fixed number of neurons, superposition helps – potentially a lot.
所有这些描绘出一个画面,去除叠加可能相对可行,但这样做将会有很大的性能成本。对于一个固定数量神经元的模型,叠加是有帮助的 - 可能帮助很大。

But this is only true if the constraint is thought of in terms of neurons. That is, a superposition model with n neurons likely has the same performance as a significantly larger monosemantic model with kn neurons. But neurons aren't the fundamental constraint: flops are. In the most common model architectures, flops and neurons have a strict correspondence, but this doesn't have to be the case and it's much less clear that superposition is optimal in the broader space of possibilities.
但这一点只有在将约束视为神经元的情况下才成立。也就是说,一个具有 n 个神经元的叠加模型可能在性能上与一个具有 kn 个神经元的显著更大的单义模型相同。但神经元并不是基本约束:浮点运算是。在最常见的模型架构中,浮点运算与神经元之间存在严格的对应关系,但情况不一定如此,而且在更广泛的可能性空间中,叠加是否最佳则不太明确。

One family of models which change the flop-neuron relationship are Mixture of Experts (MoE) models (see review  ). The intuition is that most neurons are for specialized circumstances and don't need to activate most of the time. For example, German-specific neurons don't need to activate on French text. Harry Potter neurons don't need to activate on scientific papers. So MoE models organize neurons into blocks or experts, which only activate a small fraction of the time. This effectively allows the model to have k times more neurons for a similar flop budget, given the constraint that only 1/k of the neurons activate in a given example and that they must activate in a block. Put another way, MoE models can recover neuron sparsity as free flops, as long as the sparsity is organized in certain ways.
一种改变翻转神经元关系的模型家族是专家混合模型(MoE)(参见评论)。其直觉是大多数神经元用于特定的情况,并不需要时刻激活。例如,德国语言特定的神经元在处理法语文本时无需激活。哈利·波特神经元在科学论文中也不需要激活。因此,MoE 模型将神经元组织成块或专家,只在很小的比例时间激活。这有效地使模型在相似的翻转预算下具有 k 倍的神经元,前提是只有 1/k 的神经元在给定示例中激活,并且它们必须在一个块中激活。换句话说,只要稀疏性按照特定方式组织,MoE 模型可以将神经元稀疏性恢复为免费的翻转。

It's unclear how far this can be pushed, especially given difficult engineering constraints. But there's an obvious lower bound, which is likely too optimistic but is interesting to think about: what if models only expended flops on neuron activations, and recovered the compute of all non-activating neurons? In this world, it seems unlikely that superposition would be optimal: you could always split a polysemantic neuron into dedicated neurons for each feature with the same cost, except for the cases where there would have been interference that hurt the model anyways. Our preliminary investigations comparing various types of superposition in terms of "loss reduction per activation frequency" seem to suggest that superposition is not optimal on these terms, although it asymptotically becomes as good as dedicated feature dimensions. Another way to think of this is that superposition exploits a gap between the sparsity of neurons and the sparsity of the underlying features; MoE eats that same gap, and so we should expect MoE models to have less superposition.
目前尚不清楚这一点能推到多远,尤其是在困难的工程限制下。但有一个明显的下限,虽然可能过于乐观,但值得思考:如果模型只在神经元激活上消耗浮点运算,并恢复所有未激活神经元的计算,在这样的情况下,超位置似乎不太可能是最优的:你总是可以将多义神经元拆分为每个特征的专用神经元,成本相同,除了在会干扰模型的情况下。我们初步调查比较了各种超位置类型在“每次激活频率的损失减少”方面,似乎表明超位置在这些条件下不是最优的,尽管它渐近地变得和专用特征维度一样好。另一种看法是,超位置利用了神经元稀疏性与基础特征稀疏性之间的差距;MoE 消耗了同样的差距,因此我们应该预期 MoE 模型的超位置会更少。

To be clear, MoE models are already well studied, and we don't think this changes the capabilities case for them. (If anything, superposition offers a theory for why MoE models have not proven more effective for capabilities when the case for them seems so initially compelling!) But if one's goal is to create competitive models that don't have superposition, MoE models become interesting to think about. We don't necessarily think that they specifically are the right path forward – our goal here has been to use them as an example of why we think it remains plausible there may be ways to build competitive superposition-free models.
明确来说,MoE 模型已经得到了很好的研究,我们认为这并没有改变它们的能力案例。(如果有的话,叠加提供了一个理论,解释了为什么 MoE 模型在能力上没有表现得更有效,尽管它们的案例最初看起来是如此引人注目!)但如果某人的目标是创建没有叠加的竞争模型,那么 MoE 模型变得值得思考。我们并不一定认为它们特别是前进的正确道路——我们在这里的目标是用它们作为一个例子,说明我们认为仍然可以建立竞争性的无叠加模型的方式是合理的。

Approach 2: Finding an Overcomplete Basis
方法二:寻找超完备基底

The opposite strategy of creating a superposition-free model is to take a regular model, which has superposition, and find an overcomplete basis describing how features are embedded after the fact. This appears to be a relatively standard sparse coding (or dictionary learning) problem, where we want to take the activations of neural network layers and find out which directions correspond to features.More formally, given a matrix H \sim [d,m] ~=~[h_0, h_1, …] of hidden layer activations h \sim [m] sampled over d stimuli, if we believe there are n underlying features, we can try to find matrices A\sim [d,n] and B \sim [n,m] such that A is sparse. This approach has been explored by some prior work .
创建一个无叠加模型的相反策略是取一个常规模型,该模型具有叠加性,并找到一个过完备基来描述特征在事后是如何嵌入的。这似乎是一个相对标准的 sparse coding (或 dictionary learning)问题,我们想要获取神经网络层的激活,并找出哪些方向对应于特征。20 更正式地说,给定一个矩阵 H \sim [d,m] ~=~[h_0, h_1, …] 的隐藏层激活 h \sim [m],该激活在 d 个刺激上进行采样,如果我们相信有 n 个潜在特征,我们可以尝试找到矩阵 A\sim [d,n] 和 B \sim [n,m],使得 A 是稀疏的。这种方法已经被一些先前的工作探索过。

The advantage of this is that we don't need to worry about whether we're damaging model performance. On the other hand, many other things are harder:
这样做的好处是我们不必担心是否会损害模型性能。另一方面,许多其他事情则更难:

Approach 3: Hybrid Approaches
方法三:混合方法

In addition to approaches which address superposition purely at training time, or purely after the fact, it may be possible to take "hybrid approaches" which do a mixture. For example, even if one can't change models without superposition, it may be possible to produce models with less superposition, which are then easier to decode.In particular, it seems like we should expect to be able to reduce superposition at least a little bit with essentially no effect on performance, just by doing something like L1 regularization without any architectural changes.  Note that models should have a level of superposition where the derivative of loss with respect to the amount of superposition is zero – otherwise, they'd use more or less superposition. As a result, there should be at least some margin within which we can reduce the amount of superposition without affecting model performance. Alternatively, it may be possible for architecture changes to make finding an overcomplete basis easier or more computationally tractable in large models, separately from trying to reduce superposition.
除了那些在训练时纯粹处理叠加或在事后纯粹处理叠加的方法外,可能还有可能采取“混合方法”,两者结合。例如,即使无法在没有叠加的情况下更改模型,也可能生成具有 更少 叠加的模型,从而更易于解码。尤其是,我们应该期望在几乎不影响性能的情况下,通过进行类似 L1 正则化的操作,来减少叠加,尽管没有任何架构上的变化。注意,模型的叠加水平应该是损失相对于叠加量的导数为零的水平——否则,它们将使用更多或更少的叠加。因此,应该至少存在一定的余地,我们可以在不影响模型性能的情况下减少叠加量。或者,可能通过架构的改变使得在大型模型中寻找过完备基底变得更容易或更具计算可行性,而不必试图减少叠加。

Additional Considerations
其他考虑事项

Phase Changes as Cause For Hope. Is totally getting rid of superposition a realistic hope? One could easily imagine a world where it can only be asymptotically reduced, and never fully eliminated. While the results in this paper seem to suggest that superposition is hard to get rid of because it's actually very useful, the upshot of it corresponding to a phase change is that there's a regime where it totally doesn't exist. If we can find a way to push models in the non-superposition regime, it seems likely it can be totally eliminated.
相变作为希望的原因。 完全摆脱叠加是一种现实的希望吗?人们很容易想象一个世界,在这个世界中它只能逐渐减少,而永远无法完全消除。虽然本文的结果似乎表明,叠加很难消除,因为它实际上非常有用,但它对应于相变的启示是,有一种状态在这里它完全不存在。如果我们能找到一种方法将模型推向非叠加状态,那么它似乎可以被完全消除。

Any superposition-free model would be a powerful tool for research. We believe that most of the research risk is in whether one can make performant superposition free models, rather than whether it's possible to make superposition free models at all. Of course, ultimately, we need to make performant models. But a non-performant superposition free model could still be a very useful research tool for studying superposition in normal models. At present, it's challenging to study superposition in models because we have no ground truth for what the features are. (This is also the reason why the toy models described in this paper can be studied – we do know what the features are!) If we had a superposition-free model, we may be able to use it as a ground truth to study superposition in regular models.
任何无叠加的模型将成为强大的研究工具。 我们认为,大多数研究风险在于是否能够制作高效的无叠加模型,而不是是否根本可以制作无叠加模型。当然,最终我们需要制作高效模型。但一个非高效的无叠加模型仍然可以成为研究正常模型中叠加现象的非常有用的工具。目前,研究模型中的叠加现象具有挑战性,因为我们没有有关特征的真实依据。(这也是本文描述的玩具模型能够研究的原因 – 我们确实知道这些特征是什么!)如果我们有一个无叠加的模型,我们也许可以将其用作研究常规模型中叠加现象的真实依据。

Local bases are not enough. Earlier, when we considered the geometry of non-uniform superposition, we observed that models often form local orthogonal bases, where co-occurring features are orthogonal. This suggests a strategy for locally understanding models on sufficiently narrow sub-distributions. However, if our goal is to eventually make useful statements about the safety of models, we need mechanistic accounts that hold for the full distribution (and off distribution). Local bases seem unlikely to give this to us.
局部基不够。 早些时候,当我们考虑 非均匀叠加的几何 时,我们观察到模型通常形成 局部正交基,其中同时出现的特征是正交的。这暗示了一种在足够狭窄的子分布上局部理解模型的策略。然而,如果我们的目标是最终对模型的安全性做出有用的陈述,我们需要对完整分布(以及离分布)的机械解释。局部基似乎不太可能为我们提供这一点。







Discussion 讨论

To What Extent Does Superposition Exist in Real Models?
叠加在实际模型中存在的程度有多大?

Why are we interested in toy models? We believe they are useful proxies for studying the superposition we suspect might exist in real neural networks. But how can we know if they're actually a useful toy model? Our best validation is whether their predictions are consistent with empirical observations regarding polysemanticity. To the best of our knowledge they are. In particular:
我们为什么对玩具模型感兴趣?我们认为它们是研究我们怀疑在真实神经网络中存在的叠加现象的有用代理。但是我们如何知道它们是否实际上是一个有用的玩具模型呢?我们最好的验证是它们的预测是否与关于多义性的经验观察相一致。根据我们所知,它们是的。特别是:

This doesn't mean that everything about our toy model reflects real neural networks. Our intuition is that some of the phenomena we observe (superposition, monosemantic vs polysemantic neurons, perhaps the relationship to adversarial examples) are likely to generalize, while other phenomena (especially the geometry and learning dynamics results) are much more uncertain.
这并不意味着我们的玩具模型的所有特点都反映了真实的神经网络。我们的直觉是,我们观察到的一些现象(叠加、单义神经元与多义神经元,或许还有与对抗样本的关系)可能会泛化,而其他现象(特别是几何和学习动态结果)则要不确定得多。

Open Questions 未解之问

This paper has shown that the superposition hypothesis is true in certain toy models. But if anything, we're left with many more questions about it than we had at the start. In this final section, we review some of the questions which strike us as most important: what do we know, and would we like for future work to clarify?
本文表明,在某些玩具模型中,叠加假设是正确的。但是,如果说有什么,那就是我们对其的疑问比最开始时更多。在最后这一部分中,我们回顾一些我们认为最重要的问题:我们知道了什么,未来的研究希望澄清什么?







Related Work 相关工作

Interpretable Features 可解释特征

Our work is inspired by research exploring the features that naturally occur in neural networks. Many models form at least some interpretable features. Word embeddings have semantic directions (see ). There is evidence of interpretable neurons in RNNs (e.g. ), convolutional neural networks (see generally e.g. ; individual neuron families ), and in some limited cases, transformer language models (see detailed discussion in our previous paper). However this work has also found many "polysemantic" neurons which are not interpretable as a single concept .
我们的工作受到研究的启发,探索神经网络中自然出现的特征。许多模型形成至少一些可解释的特征。词嵌入具有语义方向(参见)。在递归神经网络(例如)、卷积神经网络(一般参见,例如个别神经元家族)以及在一些有限情况下,变换器语言模型中(参见我们之前论文中的详细讨论),都有可解释的神经元的证据。然而,这项工作还发现了许多“多义”的神经元,这些神经元不是可解释为单一概念。

Superposition 叠加

The earliest reference to superposition in artificial neural networks that we're aware of is Arora et al.'s work , who suggest that the word embeddings of words with multiple different word senses may be superpositions of the vectors for the distinct meanings. Arora extend this idea there to there being many sparse "atoms of discourse" in superposition, an idea which was generalized to other kinds of embeddings vectors and explored in more detail by Goh .
我们所知的关于人工神经网络中叠加的最早参考是 Arora 等人的研究,他们建议具有多种不同词义的单词的词嵌入可能是这些不同含义的向量的叠加。Arora 将这一想法扩展到叠加中有许多稀疏的“话语原子”,这个想法被推广到其他类型的嵌入向量,并由 Goh 进行了更详细的探讨。

In parallel with this, investigations of individual neurons in models with privileged bases were beginning to grapple with "polysemantic" neurons which respond to unrelated inputs . A natural hypothesis was that these polysemantic neurons are disambiguated by the combined activation of other neurons. This line of thinking eventually became the "superposition hypothesis" for circuits .
与此并行,针对具有特权基的模型中单个神经元的研究开始探讨“多义”神经元,这些神经元对无关的输入做出反应。一个自然的假设是,这些多义神经元通过其他神经元的联合激活被消歧。这种思路最终成为电路的“叠加假设”。

Separate from all of this, Cheung et al.  explore a slightly different idea one might describe as "model level" superposition: can neural network parameters represent multiple completely independent models? Their investigation is motivated by catastrophic forgetting, but seems quite related to the questions investigated in this paper. Model level superposition can be seen as feature level superposition for highly correlated sets of features, similar to the "almost orthogonal bases" experiment we considered above.
与所有这些无关,Cheung 探讨了一个略有不同的想法,可以称之为“模型层级”叠加:神经网络的参数能否表示多个完全独立的模型?他们的研究动机是灾难性遗忘,但似乎与本文所探讨的问题十分相关。模型层级叠加可以被视为高度相关的特征集的特征层级叠加,类似于我们之前考虑的“几乎正交基”实验

Disentanglement 解开纠缠

The goal of learning disentangled representations arises from Bengio et al.'s influential position paper on representation learning : "we would like our representations to disentangle the factors of variation…  to learn representations that separate the various explanatory sources." Since then, a literature has developed motivated by this goal, tending to focus on creating generative models which separate out major factors of variation in their latent spaces. This research touches on questions related to superposition, but is also quite different in a number of ways.
学习解耦表示的目标源于 Bengio等人在表示学习方面的影响力论文:“我们希望我们的表示能够 解耦变化因素……学习能够区分各种解释来源的表示。” 从那时起,围绕这一目标发展出了一系列文献,主要关注创建能够在其潜在空间中分离主要变化因素的生成模型。这项研究涉及与叠加相关的问题,但在许多方面也有很大不同。

Concretely, disentanglement research often explores whether one can train a VAE or GAN where basis dimensions correspond to the major features one might use to describe the problem (e.g. rotation, lighting, gender… as relevant). Early work often focused on semi-supervised approaches where the features were known in advance, but fully unsupervised approaches started to develop around 2016 .
具体而言,解缠研究通常探讨是否可以训练一个 VAE 或 GAN,使得基底维度对应于描述问题所用的主要特征(例如旋转、照明、性别……等相关特征)。早期的工作通常集中在半监督的方法上,其中特征是预先已知的,但完全无监督的方法大约在 2016 年开始发展。

Put another way, the goal of disentanglement might be described as imposing a strong privileged basis on representations which are rotationally invariant by default. This helps get at ways in which the questions of polysemanticity and superposition are a bit different from disentanglement. Consider that when we deal with neurons, rather than embeddings, we have a privileged basis by default. It varies by model, but many neurons just cleanly respond to features. This means that polysemanticity arises as a kind of anomalous behavior, and superposition arises as a hypothesis for explaining it. The question then isn't how to impose a privileged basis, but how to remove superposition as a fundamental problem to accessing features.
换句话说,解缠绕的目标可以描述为强加一个特权基础到表示上,这些表示在默认情况下是旋转不变的。这有助于理解多义性和叠加的问题与解缠绕之间的不同之处。考虑到当我们处理神经元而不是嵌入时,我们有一个默认的特权基础。它因模型而异,但许多神经元只是干净地响应特征。这意味着多义性作为一种异常行为出现,而叠加则作为一种解释它的假说出现。那么问题不在于如何强加特权基础,而在于如何去除叠加作为访问特征的一个根本问题。

Of course, if the superposition hypothesis is true, there are still a number of connections to disentanglement. On the one hand, it seems likely superposition occurs in the latent spaces of generative models, even though that isn't an area we've investigated. If so, it may be that superposition is a major reason why disentanglement is difficult. Superposition may allow generative models to be much more effective than they would otherwise be without. Put another way, disentanglement often assumes a small number of important latent variables to explain the data. There are clearly examples of such variables, like the orientation of objects – but what if a large number of sparse, rare, individually unimportant features are collectively very important? Superposition would be the natural way for models to represent this.A more subtle issue is that GANs and VAEs often assume that their latent space is Gaussianly distributed. Sparse latent variables are very non-Gaussian, but the central limit theorem means that the superposition of many such variables will gradually look more Gaussian. So the latent spaces of some generative models may in fact force models to use superposition! On the other hand, one could imagine ideas from the disentanglement literature being useful in creating architectures that resist superposition by creating an even more strongly privileged basis.
当然,如果叠加假设是正确的,那么仍然有许多与解缠结相关的联系。一方面,尽管这不是我们研究的领域,但叠加在生成模型的潜在空间中似乎很可能发生。如果是这样,叠加可能是解缠结困难的主要原因。叠加可能使生成模型的效果远比没有叠加时更为有效。换句话说,解缠结通常假设少量重要的潜变量来解释数据。显然存在这样的变量的例子,比如物体的方向——但如果大量的稀疏、罕见、个别不重要的特征集合起来是非常重要的呢?叠加将是模型表示这一点的自然方式。更微妙的问题是,GAN 和 VAE 通常假设它们的潜在空间是高斯分布的。稀疏潜变量非常非高斯,但中心极限定理意味着许多此类变量的叠加将逐渐看起来更像高斯。因此,一些生成模型的潜在空间实际上可能迫使模型使用叠加! 另一方面,可以想象从解纠缠文献中得出的思想在创建抵抗叠加的架构时会有用,方法是创建一个更强特权的基底。

Compressed Sensing 压缩感知

The toy problems we consider are quite similar to the problems considered in the field of compressed sensing, which is also known as compressive sensing and sparse recovery. However, there are some important differences:
我们考虑的玩具问题与压缩感知领域中考虑的问题非常相似,压缩感知也被称为压缩感知和稀疏恢复。然而,存在一些重要的差异:

In general, our toy model is solving a similar problem using less powerful methods than compressed sensing algorithms, especially because the computational model is so much more restricted (to just a single linear transformation and a non-linearity) compared to the arbitrary computation that might be used by a compressed sensing algorithm.
一般来说,我们的玩具模型使用比压缩感知算法更弱小的方法来解决类似的问题,特别是因为计算模型相比于压缩感知算法所能使用的任意计算要受到更大的限制(仅限于一个线性变换和一个非线性变换)。

As a result, compressed sensing lower bounds—which give lower bounds on the dimension of the embedding such that recovery is still possible—can be interpreted as giving an upper bound on the amount of superposition in our toy model. In particular, in various compressed sensing settings, one can recover an n-dimensional k-sparse vector from an m dimensional projection if and only if m = \Omega(k \log (n/k)) . While the connection is not entirely straightforward, we apply one such result to the toy model in the appendix.
因此,压缩感知的下界——这为嵌入维度提供了下界,以使恢复仍然可行——可以解释为给我们的玩具模型中的叠加量提供上界。特别是在各种压缩感知设置中,仅当 m = \Omega(k \log (n/k)) 时,可以从 m 维投影中恢复一个 n 维的 k -稀疏向量。虽然这种联系并不完全直观,但我们将这样的结果应用于玩具模型在附录中

At first, this bound appears to allow a number of features that is exponential in m to be packed into the m-dimensional embedding space. However, in our setting, the integer k for which all vectors have at most k non-zero entries is determined by the fixed density parameter S as k = O((1 - S)n). As a result, our bound is actually m = \Omega(-n (1 - S) \log(1 - S)). Therefore, the number of features is linear in m but modulated by the sparsity. Note that this has a nice information-theoretic interpretation: \log(1 - S) is the surprisal of a given dimension being non-zero, and is multiplied by the expected number of non-zeros. This is good news if we are hoping to eliminate superposition as a phenomenon! However, these bounds also allow for the amount of superposition to increase dramatically with sparsity – hopefully this is an artifact of the techniques in the proofs and not an inherent barrier to reducing or eliminating superposition.
起初,这个上界似乎允许在 m 维嵌入空间中打包数量为 m 的特征,且数量呈指数增长。然而,在我们的设置中,所有向量最多有 k 个非零条目的整数 k 由固定的密度参数 S 决定,为 k = O((1 - S)n) 。因此,我们的界实际上是 m = \Omega(-n (1 - S) \log(1 - S)) 。因此,特征的数量在 m 中是线性的,但受到稀疏性的调制。请注意,这具有良好的信息论解释:\log(1 - S)是给定维度为非零的惊讶程度,并且与预期的非零数量相乘。如果我们希望消除叠加作为一种现象,这是个好消息!然而,这些界也允许随着稀疏性的增加叠加的数量显著增加——希望这只是证明中技术的一个伪影,而不是减少或消除叠加的固有障碍。

A striking parallel between our toy model and compressed sensing is the existence of phase changes.Note that in the compressed sensing case, the phase transition is in the limit as the number of dimensions becomes large - for finite-dimensional spaces, the transition is fast but not discontinuous. In compressed sensing, if one considers a two-dimensional space defined by the sparsity and dimensionality of the vectors, there are sharp phase changes where the vector can almost surely be recovered in one regime and almost surely not in the other . It isn't immediately obvious how to connect these phase changes in compressed sensing – which apply to recovery of the entire vector, rather than one particular component – to the phase changes we observe in features and neurons. But the parallel is suspicious.
我们玩具模型与压缩感知之间一个显著的相似之处是存在相变。24 请注意,在压缩感知的情况下,相变发生在维度变得很大的极限中——对于有限维空间,过渡快速但不是不连续的。在压缩感知中,如果考虑一个由向量的稀疏性和维度定义的二维空间,就会出现明显的相变,在一个阶段向量几乎可以被恢复,而在另一个阶段几乎不能恢复。将压缩感知中的这些相变与我们观察到的特征和神经元中的相变联系起来并不是显而易见的——后者适用于整个向量的恢复,而不是某个特定分量。但这种相似之处让人感到怀疑。

Another interesting line of work has tried to build useful sparse recovery algorithms using neural networks . While we find it useful for analysis purposes to view the toy model as a sparse recovery algorithm, so that we may apply sparse recovery lower bounds, we do not expect that the toy model is useful for the problem of sparse recovery. However, there may be an exciting opportunity to relate our understanding of the phenomenon of superposition to these and other techniques.
另一个有趣的研究方向是尝试使用神经网络构建实用的稀疏恢复算法。虽然我们发现将玩具模型视为稀疏恢复算法在分析目的上是有用的,以便我们可以应用稀疏恢复下界,但我们并不期望玩具模型对稀疏恢复问题有用。然而,将我们对叠加现象的理解与这些技术及其他技术关联起来,可能会是一个令人兴奋的机会。

Sparse Coding and Dictionary Learning
稀疏编码与字典学习

Sparse Coding studies the problem of finding a sparse representation of dense data. One can think of it as being like compressed sensing, except the matrix projecting sparse vectors into the lower dimensional space is also unknown. This topic goes by many different names including sparse coding (most common in neuroscience), dictionary learning (in computer science), and sparse frame design (in mathematics). For a general introduction, we refer readers to a textbook by Michael Elad .
稀疏编码研究寻找稠密数据的稀疏表示的问题。可以把它想象成压缩感知,只不过将稀疏向量投影到低维空间的矩阵也是未知的。这个主题有许多不同的名称,包括稀疏编码(在神经科学中最常见)、字典学习(在计算机科学中)和稀疏框架设计(在数学中)。对于一般介绍,我们建议读者参考 Michael Elad 的教科书。

Classic sparse coding algorithms take an expectation-maximization approach (this includes Olshausen et al's early work , the MOD algorithm , and the k-SVD algorithm ). More recently, new methods based on gradient descent and autoencoders have begun building on these ideas .
经典的稀疏编码算法采用期望最大化方法(这包括 Olshausen 等人的早期研究、MOD 算法和 k-SVD 算法)。最近,基于梯度下降和自编码器的新方法开始在这些思想的基础上发展。

From our perspective, sparse coding is interesting because it's probably the most natural mathematical formulation of trying to "solve superposition" by discovering which directions correspond to features.Interestingly, this is the reverse of how sparse coding is typically thought of in neuroscience. Neuroscience often thinks of biological neurons as sparse coding their inputs, whereas we're interested in applying it the opposite direction, to find features in superposition over neurons. But can we actually use these methods to solve superposition in practice? Previous work has attempted to use sparse coding to find sparse structure . More recently, research by Sharkey et al  following up on the original publication of this paper has had preliminary success in extracting features out of superposition in toy models using a sparse autoencoder. In general, we're only in the very preliminary investigations of using sparse coding and dictionary learning in this way, but the situation seems quite optimistic. See the section Approach 2: Finding an Overcomplete Basis for more discussion.
从我们的角度来看,稀疏编码很有趣,因为这可能是尝试“解决叠加”最自然的数学表达形式,目的是发现哪些方向对应于特征。有趣的是,这与神经科学中稀疏编码的传统观点正好相反。神经科学通常认为生物神经元是在对输入进行稀疏编码,而我们则有兴趣将其反向应用,以在神经元的叠加中寻找特征。但我们是否真的能在实践中使用这些方法来解决叠加问题?之前的研究曾试图利用稀疏编码来寻找稀疏结构。最近,斯哈基等人的研究在这篇论文的原始出版物之后,在使用稀疏自编码器从玩具模型中的叠加中提取特征方面取得了初步成功。总体而言,我们在以这种方式使用稀疏编码和字典学习的研究仅处于最初阶段,但情况似乎相当乐观。有关更多讨论,请参见方法 2:寻找过度完整基部分。

Theories of Neural Coding and Representation
神经编码和表征理论

Our work explores representations in artificial “neurons”. Neuroscientists study similar questions in biological neurons. There are a variety of theories for how information could be encoded by a group of neurons. At one extreme is a local code, in which every individual stimulus is represented by a separate neuron. At the other extreme is a maximally-dense distributed code, in which the information-theoretic capacity of the population is fully utilized, and every neuron in the population plays a necessary role in representing every input.
我们的工作探索了人工“神经元”中的表示。神经科学家在生物神经元中研究类似的问题。有多种理论解释信息如何被一组神经元编码。在一个极端是一个 局部编码,在这种编码中,每一个独立的刺激由一个单独的神经元表示。在另一个极端是一个 最大密度分布编码,在这种编码中,整个群体的信息理论容量被充分利用,群体中的每个神经元在表示每个输入时都发挥着必要的作用。

One challenge in comparing our work with the neuroscience literature is that a “distributed representation” seems to mean different things. Consider an overly-simplified example of a population of neurons, each taking a binary value of active or inactive, and a stimulus set of sixteen items: four shapes, with four colors  (example borrowed from ). A “local code” would be one with a “red triangle” neuron, a “red square” neuron, and so on. In what sense could the representation be made more “distributed”? One sense is by representing independent features separately — e.g. four “shape” neurons and four “color” neurons. A second sense is by representing more items than neurons — i.e. using a binary code over four neurons to encode 2^4 = 16 stimuli. In our framework, these senses correspond to decomposability (representing stimuli as compositions of independent features) and superposition (representing more features than neurons, at cost of interference if features co-occur).
在将我们的工作与神经科学文献进行比较时,一个挑战是“分布式表示”似乎意味着不同的东西。考虑一个过于简化的例子:一群神经元,每个神经元取二元值“活跃”或“非活跃”,以及一个包含十六个项的刺激集:四种形状,四种颜色(例子借自)。一个“局部编码”将会是一个有“红色三角形”神经元,“红色正方形”神经元,等等。那么,在什么意义上可以使表示变得更加“分布式”?一种意义是通过分别表示独立特征 — 例如四个“形状”神经元和四个“颜色”神经元。第二种意义是通过表示 比神经元更多的项 — 即使用四个神经元的二进制编码来编码 2^4 = 16 个刺激。在我们的框架中,这些意义对应于可分解性(将刺激表示为独立特征的组合)和叠加(表示比神经元更多的特征,如果特征同时出现则会产生干扰)。

Decomposability doesn’t necessarily mean each feature gets its own neuron. Instead, it could be that each feature corresponds to a “direction in activation-space”We haven’t encountered a specific term in the distributed coding literature that corresponds to this hypothesis specifically, although the idea of a “direction in activation-space” is common in the literature, which may be due to ignorance on our part. We call this hypothesis linearity., given scalar “activations” (which in biological neurons would be firing rate). Then, only if there is a privileged basis, “feature neurons” are incentivized to develop. In biological neurons, metabolic considerations are often hypothesized to induce a privileged basis, and thus a “sparse code”. This would be expected if the nervous system’s energy expenditure increases linearly or sublinearly with firing rate.Experimental evidence seems to support this Additionally, neurons are the units by which biological neural networks can implement non-linear transformations, so if a feature needs to be non-linearly transformed, a “feature neuron” is a good way to achieve that.
可分解性并不一定意味着每个特征都有自己的神经元。相反,可能每个特征对应于“激活空间中的方向”26 我们在分布式编码文献中尚未遇到明确的术语与该假设完全对应,尽管“激活空间中的方向”的想法在文献中比较常见,这可能是由于我们自身的无知。我们称这个假设为线性性,给定标量“激活”(在生物神经元中将是发火率)。然后,只有在存在特权基底时,“特征神经元”才会被激励去发展。在生物神经元中,常常假设代谢考虑会诱导出特权基底,从而形成“稀疏编码”。如果神经系统的能量消耗随着发火率线性或亚线性增加,这是可以预期的。27 实验证据似乎支持这一点。此外,神经元是生物神经网络实现非线性变换的单位,因此如果某个特征需要进行非线性变换,“特征神经元”是实现这一点的好方法。

Any decomposable linear code that uses orthogonal feature vectors is functionally equivalent from the viewpoint of a linear readout. So, a code can both be “maximally distributed” — in the sense that every neuron participates in representing every input, making each neuron extremely polysemantic — and also have no more features than it has dimensions. In this conception, it’s clear that a code can be fully “distributed” and also have no superposition
任何可分解的线性编码,使用正交特征向量,从线性读取的角度来看都是功能上等价的。因此,一个编码可以是“最大分布”的——在每个神经元参与表示每个输入的意义上,使每个神经元具有极强的多语义性——同时也没有比其维度更多的特征。在这种理念中,很明显一个编码可以完全“分布”,同时也没有超级叠加。
.

A notable difference between our work, and the neuroscience literature we have encountered, is that we consider as a central concept the likelihood that features co-occur with some probability.A related, but different, concept in the neuroscience literature is the “binding problem” in which e.g. a red triangle is a co-occurrence of exactly one shape and exactly one color, which is not a representational challenge, but a binding problem arises if a decomposed code needs to represent simultaneously also a blue square — which shape feature goes with which color feature? Our work does not engage with the binding question, merely treating this as a co-occurrence of “blue”, “red”, “triangle”, and “square”. A “maximally-dense distributed code” makes the most sense in the case where items never co-occur; if the network only needs to represent one item at a time, it can tolerate a very extreme degree of superposition. By contrast, a network that could plausibly need to represent all the items at once can do so without interference between the items if it uses a code with no superposition. One example of high feature co-occurrence could be encoding spatial frequency in a receptive field; these visual neurons need to be able to represent white noise, which has energy at all frequencies. An example of limited co-occurrence could be a motor “reach” task to discrete targets, far enough apart that only one can be reached at a time.
我们工作的一个显著不同之处在于,与我们所遇到的神经科学文献相比,我们将特征以某种概率共同发生的可能性视为核心概念。与之相关但不同的概念是神经科学文献中的“绑定问题”,例如,一个红色三角形恰好是一个形状和一个颜色的共同出现,这并不是一个表征挑战,但如果一个分解的代码需要同时表示一个蓝色正方形,就会出现绑定问题——哪个形状特征与哪个颜色特征相对应?我们的工作并不涉及绑定问题,仅仅将其视为“蓝色”、“红色”、“三角形”和“正方形”的共同出现。在物品从未共同出现的情况下,“最大密度分布代码”是最有意义的;如果网络只需要一次表示一个物品,它可以容忍非常极端程度的重叠。相反,一个可能需要同时表示所有物品的网络,如果使用没有重叠的代码,就可以做到在物品之间没有干扰。 一种高特征共现的例子可能是在感受野中编码空间频率;这些视觉神经元需要能够表示白噪声,因为白噪声在所有频率上都有能量。有限共现的一个例子可能是一个针对离散目标的运动“到达”任务,这些目标相距足够远,以至于一次只能到达一个。

One hypothesis in neuroscience is that highly compressed representations might have an important use in long-range communication between brain areas. Under this theory, sparse representations are used within a brain area to do computation, and then are compressed for transmission across a small number of axons. Our experiments with the absolute value toy model shows that networks can do useful computation even under a code with a moderate degree of superposition. This suggests that all neural codes, not just those used for efficient communication, could plausibly be “compressed” to some degree; the regional code might not necessarily need to be decompressed to a fully sparse one.
在神经科学中,一个假设是高度压缩的表征可能在脑区之间的长距离通信中具有重要用途。在这一理论下,稀疏表征在脑区内用于计算,然后通过少量轴突进行压缩以进行传输。我们对绝对值玩具模型的实验表明,即使在中等程度的叠加编码下,网络也可以进行有用的计算。这表明,所有神经编码,甚至不仅限于用于高效通信的编码,都可能在某种程度上被“压缩”;区域编码可能不一定需要完全解压缩为稀疏编码。

It's worth noting that the term "distributed representation" is also used in deep learning, and has the same ambiguities of meaning there. Our sense is that some influential early works (e.g. ) may have primarily meant the "independent features are represented independently" decomposability sense, but we believe that other work intends to suggest something similar to what we call superposition.
值得注意的是,“分布式表示”这个术语在深度学习中也被使用,并且在那里具有相同的含义模糊性。我们的感觉是一些有影响力的早期工作(例如)可能主要是指“独立特征是独立表示的”可分解性的含义,但我们相信其他工作的意图是暗示类似于我们所称的叠加的东西。

Additional Connections 附加连接

After publishing the original version of this paper, a number of readers generously brought to our attention additional connections to prior work. We don't have a sufficiently deep understanding of this work to offer a detailed review, but we offer a brief overview below:
在出版该论文的原始版本后,许多读者慷慨地提醒我们与先前工作的额外联系。我们对这项工作的理解不够深入,无法提供详细的评审,但我们在下面提供一个简要概述:







Comments & Replications 评论与复制

Inspired by the original Circuits Thread and Distill's Discussion Article experiment, the authors invited several external researchers who we had previously discussed our preliminary results with to comment on this work. Their comments are included below.
受到原始电路线程蒸馏讨论文章实验的启发,作者邀请了几位与我们之前讨论过初步结果的外部研究人员对这项工作进行点评。他们的评论如下。

Replication & Forthcoming Paper
复制与即将发表的论文

Kshitij Sachan is a research intern at Redwood Research.
克什提杰·萨昌红木研究所 的研究实习生。

Redwood Research has been working on toy models of polysemanticity, inspired by Anthropic's work. We plan to separately publish our results, and during our research we replicated many of the experiments in this paper. Specifically, we replicated all plots in the Demonstrating Superposition and Superposition as a Phase Change sections (visualizations of the relu models with different sparsities and the phase diagrams) as well as the plot in The Geometry of Superposition – Uniform Superposition. We found the phase diagrams look quite different depending on the activation function, suggesting that in this toy model some activation functions induce more polysemanticity than others.
Redwood Research 一直在研究多义性的玩具模型,受到 Anthropic 工作的启发。我们计划单独发布我们的结果,在研究过程中,我们复制了本文中的许多实验。具体来说,我们复制了 演示叠加叠加作为相变 部分中的所有图表(不同稀疏度的 ReLU 模型的可视化和相图),以及 叠加的几何 – 均匀叠加 中的图表。我们发现,相图的外观因激活函数的不同而大相径庭,这表明在这个玩具模型中,一些激活函数比其他函数更诱发多义性。


Original Authors' Response:  Redwood's further analysis of the superposition phase change significantly advanced our own understanding of the issue – we're very excited for their analysis to be shared with the world. We also appreciate the independent replication of our basic results.
原作者的回复:红木对叠加相变的进一步分析显著地提升了我们对这一问题的理解——我们非常期待他们的分析能与世界分享。我们也感谢对我们基本结果的独立复现。


Update: The research by Redwood mentioned in the previous comment, Polysemanticity and Capacity in Neural Networks (Alignment Forum, Arxiv) is out! They study a slightly different toy model, and get some really interesting results. Highlights include analytical traction on understanding a variant of the toy model, understanding superposition in terms of constrained optimization, and analysis of the role different activation functions play.
更新:之前评论中提到的 Redwood 的研究,神经网络中的多义性和容量对齐论坛, Arxiv)已经发布!他们研究了一个稍微不同的玩具模型,并得出了一些非常有趣的结果。重点包括对玩具模型变体理解的分析推进、从约束优化的角度理解叠加,以及对不同激活函数角色的分析。

Replication & Further Results
复制与进一步结果

Tom McGrath is a research scientist at DeepMind.
汤姆·麦格拉斯是一名在DeepMind工作的研究科学家。

The results in this paper are an important contribution - they really further our theoretical understanding of a phenomenon that may be central to interpretability research and understanding network representations more generally. It’s surprising that such simple settings can produce these rich phenomena. We’ve reproduced the experiments in the Demonstrating Superposition  and Superposition as a Phase Change sections and have a minor additional result to contribute.
本文的结果是一个重要的贡献——它确实加深了我们对一个可能是可解释性研究中心现象的理论理解,以及对网络表示的更一般理解。令人惊讶的是,这样简单的设置能够产生如此丰富的现象。我们重现了演示叠加叠加作为相变部分中的实验,并有一个小的附加结果要贡献。

It is possible to exactly solve the expected loss for the n=2, m=1 case of the basic ReLU output toy model (ignoring bias terms). The derivation is mathematically simple but somewhat long-winded: the ‘tricks’ are to (1) represent the sparse portion of the input distribution with delta functions, and (2) replace the ReLu with a restriction of the domain of integration:
可以准确求解基本ReLU 输出玩具模型n=2m=1 案例的期望损失(忽略偏置项)。推导在数学上很简单,但有些冗长:“技巧”是(1)用δ函数表示输入分布的稀疏部分,以及(2)将 ReLU 替换为积分域的限制:

\int_D \text{ReLU}(f(x)) dx = \int_{D \cap f(x)>0} f(x) dx

Making this substitution renders the integral analytically tractable, which allows us to plot the full loss surface and solve for the loss minima directly. We show some example loss surfaces below:
进行这种替换使得积分在解析上可处理,这使我们能够绘制完整的损失表面并直接求解损失最小值。我们在下面展示了一些示例损失表面:

Although many of these loss surfaces (Figure 1a, 1b) have minima qualitatively similar to one of the network weights used in the section Superposition as a Phase Change, we also find a new phase where W_1\simeq W_2 \simeq \frac{1}{\sqrt{2}}: weights are similar rather than antipodal. This ‘confused feature’ regime occurs when sparsity is low and both features are important (Figure 1c). (This is slightly similar to the behavior described in The Geometry of Superposition – Collapsing of Correlated Features, but occurs without the features being correlated!) Further, although the solutions we find are often qualitatively similar to the weights used in Superposition as a Phase Change, they can be quantitatively different, as Figure 1a shows. The transition from Figure 1a to Figure 1b is continuous: the minima moves smoothly in weight space as the degree of sparsity alters. This explains the ‘blurry’ region around the triple point in the phase diagram.
尽管这些损失表面(图 1a,1b)中有许多具有与

超位置作为相变中使用的网络权重相似的极小值,但我们也发现一种新的相,其中 W_1\simeq W_2 \simeq \frac{1}{\sqrt{2}} :权重是相似的,而不是对立的。这种“混淆特征”状态发生在稀疏性较低且两个特征都很重要时(图 1c)。 (这与超位置的几何 – 相关特征的坍缩中描述的行为略有相似,但发生时特征并不相关!)此外,尽管我们发现的解决方案通常与在超位置作为相变中使用的权重在定性上相似,但它们在定量上可能不同,如图 1a 所示。从图 1a 到图 1b 的过渡是连续的:极小值在权重空间中随着稀疏度的改变而平滑移动。这解释了相图中三重点周围的“模糊”区域。

As Figure 1c shows, some combinations of sparsity and relative feature importance lead to loss surfaces with two minima (once the symmetry (W_1, W_2) \to (-W_1, -W_2) has been accounted for). If this pattern holds for larger values of n and m (and we see no reason why it would not) this could account for the Discrete "Energy Level" Jumps phenomenon as solutions hop between minima. In some cases (e.g. when parameters approach those needed for a phase transition) the global minimum can have a considerably smaller basin of attraction than local minima. The transition between the antipodal and confused-feature solutions appears to be discontinuous.
如图 1c 所示,某些稀疏性和相对特征重要性的组合导致了具有两个最小值的损失表面(在考虑了对称性 (W_1, W_2) \to (-W_1, -W_2) 之后)。如果这一模式在更大的 nm 值下依然成立(我们没有理由认为它不会),这可能解释了离散“能级”跃迁现象,因为解在最小值之间跳跃。在某些情况下(例如,当参数接近相位转变所需的值时),全局最小值的吸引盆地可能远小于局部最小值。反对极和混淆特征解之间的过渡似乎是不连续的。


Original Authors' Response:  This closed form analysis of the n=2, m=1 case is fascinating. We hadn't realized that W_1\simeq W_2 \simeq \frac{1}{\sqrt{2}} could be a solution without correlated features! The clarification of the "blurry behavior" and the observation about local minima are also very interesting. More generally, we're very grateful for the independent replication of our core results.
原作者的回复: 对于 n=2, m=1 案例的这种封闭形式分析令人着迷。我们没有意识到 W_1\simeq W_2 \simeq \frac{1}{\sqrt{2}} 可以是一个没有相关特征的解!对“模糊行为”的澄清以及关于局部最小值的观察也非常有趣。更一般来说,我们非常感谢对我们核心结果的独立复制。

Replication 复制

Jeffrey Wu and Dan Mossing are members of the Alignment team at OpenAI.
Jeffrey WuDan MossingOpenAI 对齐团队的成员。

We are very excited about these toy models of polysemanticity. This work sits at a rare intersection of being plausibly very important for training more interpretable models and being very simple and elegant. The results have been surprisingly easy to replicate -- we have reproduced (with very little fuss) plots similar to those in the Demonstrating Superposition – Basic Results, Geometry – Feature Dimensionality, and Learning Dynamics – Discrete "Energy Level" Jumps sections.
我们对这些多义性玩具模型感到非常兴奋。这个工作位于一个罕见的交集之上,它被认为对训练更具可解释性的模型非常重要,同时又非常简单和优雅。结果出奇地易于复制——我们已经再现了(几乎没有麻烦)与展示叠加 - 基本结果几何 - 特征维度学习动力学 - 离散“能量水平”跃迁部分类似的图。


Original Authors' Response:  We really appreciate this replication of our basic results. Some of our findings were quite surprising to us, and this gives us more confidence that they aren't the result of an idiosyncratic quirk or bug in our implementations.
原作者的回复: 我们非常感谢您对我们基本结果的重复验证。我们的一些发现让我们感到非常惊讶,这让我们更加确信这些结果不是我们实现中的奇特特性或 bug。

Replication 复制

Spencer Becker-Kahn is a senior research scholar at the Future of Humanity Institute and a SERI Machine Learning Alignment Theory Scholar.
斯宾塞·贝克-卡恩 是人类未来研究所的高级研究学者,以及 SERI 机器学习对齐理论学者。

After seeing preliminary results, I independently replicated some of the key graphs from Demonstrating Superposition – Basic Results and, using very small toy models, produced a series of plots consistent with the conceptual picture emerging in Geometry – Feature Dimensionality and Superposition and Learning Dynamics.
在看到初步结果后,我独立地复制了演示叠加 – 基本结果中的一些关键图表,并使用非常小的玩具模型,生成了一系列与几何 – 特征维度叠加与学习动态中 emerging 的概念图一致的图。

See also Twitter thread
另见 推特线程
.

Engineering Monosemanticity in Toy Models
玩具模型中的工程单义性

Adam Jermyn is an independent researcher focused on AI alignment and interpretability. He was previously a Research Fellow at the Flatiron Institute’s Center for Computational Astrophysics. Evan Hubinger is a Research Fellow at MIRI. Nicholas Schiefer is a member of the technical staff at Anthropic and an author of the original paper.
亚当·杰尔敏 是一位专注于人工智能对齐和可解释性的独立研究员。他此前是弗拉特铁研究所计算天体物理中心的研究员。埃文·哈宾格 是 MIRI 的研究员。尼古拉斯·希费尔 是 Anthropic 的技术人员,并且是原始论文的作者。

Inspired by the results in this paper and the previous paper introducing the SoLU activation, we have been investigating whether changes to the model architecture or training process can reduce superposition in toy models. After replicating several of these results independently, we made the following extensions in that direction:
受到本文结果和前一篇论文介绍的 SoLU 激活的启发,我们一直在研究模型架构或训练过程的变化是否能够减少玩具模型中的叠加。在独立复制了其中几个结果后,我们在这一方向上进行了以下扩展:

  • A modified toy model that tries to more accurately represent the non-toy case where sparse features are projected down into a non-sparse input to the model with fewer dimensions.
    一个改进的玩具模型,试图更准确地表示非玩具案例,即稀疏特征被投影到具有更少维度的模型非稀疏输入中。
  • A “many neuron” architecture that gives the model enough capacity to avoid polysemanticity entirely.
    一种“多神经元”架构,使模型具备足够的能力,从而完全避免多义性。
  • A training and initialization method that actually makes these toy models monosemantic.
    一种实际使这些玩具模型具有单义性的训练和初始化方法。
  • A systematic exploration of the polysemantic neurons that tend to appear in these models, which guided our explorations of the training methods.
    对这些模型中往往出现的多义神经元进行系统性探索,这引导了我们对训练方法的研究。

At least in some limits, this suggests there may not be a price to be paid for monosemanticity. Detailed results can be found in our paper, Engineering Monosemanticity in Toy Models (Alignment Forum, ArXiV).
至少在某些限制下,这表明单义性可能并不需要付出代价。详细结果可以在我们的论文中找到,在玩具模型中工程化单义性对齐论坛ArXiV)。

Fractional Dimensionality and "Pressure"
分数维度和“压力”

Tom Henighan and Chris Olah are authors of the original paper.
汤姆·亨尼根 克里斯·奥拉赫是原始论文的作者。

In the "Feature Dimensionality" section, we found that features organized into clean polytopes when there are more features than can be easily represented in the embedding dimensions.
"特征维度"部分,我们发现当特征数量超过在嵌入维度中易于表示的数量时,特征会组织成整洁的多面体。

We briefly investigated this further and found that the number of features competing to be represented significantly influences this phenomenon. Cleaner structure often seems to emerge when there is more "pressure" – more features competing to be represented. This is especially true at high sparsity levels. Additionally, training longer seems to also produce cleaner structure.
我们稍微进一步研究了这个问题,发现竞争被表示的特征数量显著影响这一现象。当有更多的“压力”——更多特征竞争被表示时,往往会出现更清晰的结构。这在高稀疏度水平下尤其真实。此外,训练时间更长似乎也会产生更清晰的结构。

More investigation would be needed to really understand this phenomenon.
需要进一步的研究才能真正理解这一现象。

Replication 复制作品

Marius Hobbhahn is a PhD student at the University of Tuebingen.
马里乌斯·霍布哈恩 是图宾根大学的博士研究生。

I replicated the “Basic Results” in Section 2 (“Demonstrating Superposition”) and all of Section 7 (“Superposition in a Privileged Basis”) of the “Toy Models of Superposition” paper. All of my findings are identical to the ones described in the paper. I replicated most findings in the follow-up “Superposition, Memorization, and Double Descent” paper.
我复制了“基本结果”第二节(“演示叠加”)和“叠加玩具模型”论文的第七节(“特权基中的叠加”)的所有内容。我的所有发现与论文中描述的完全相同。我复制了后续“叠加、记忆与双重下降”论文中的大多数发现。

The details of my replication can be found in my write-up “More Findings on Memorization and Double Descent”.
我的复制细节可以在我的文章中找到 “更多关于记忆与双重下降的发现”

Extracting Features with Sparse Autoencoders
使用稀疏自编码器提取特征

Lee Sharkey, Dan Braun, and Beren Millidge are researchers at Conjecture.
李·沙基丹·布劳恩贝伦·米利奇 是 Conjecture 的研究人员。

The results from this paper, and the strategic picture it paints, inspired our preliminary follow-up work that aimed to address some of the challenges described in the section titled 'Approach 2: Finding an overcomplete basis'.
本论文的结果及其描绘的战略图景,激发了我们的初步后续工作,旨在解决方法 2:寻找一个过完备基底一节中描述的一些挑战。

Before studying the activations of real neural networks, where we're not sure what the 'ground truth' features are, we studied a toy example. We generated a set of toy ground truth features and created a dataset using sparse combinations of them. We found that a one-layer sparse autoencoder with an L_1 penalty on its hidden activations learned features that were almost identical to the ground truth features. This is encouraging because it indicates that a relatively simple method should be able to recover the features used by neural networks, inasmuch as they have them.
在研究真实神经网络的激活之前,因为我们不确定“真实特征”是什么,我们研究了一个简单的例子。我们生成了一组简单的真实特征,并使用它们的稀疏组合创建了一个数据集。我们发现,具有对其隐藏激活的 L_1 惩罚的一层稀疏自编码器学习到的特征与真实特征几乎相同。这令人鼓舞,因为这表明相对简单的方法应该能够恢复神经网络使用的特征,前提是它们确实存在。

For the toy dataset, we knew how many ground truth features there were. But we ultimately want to count the features used by real neural networks, where the number of features is unknown. We explored three ways to count the features in the toy dataset: a) Counting dead neurons in the autoencoders; b) looking at autoencoder losses; and c) comparing the features learned by autoencoders of different sizes. We found indications that these methods might be suitable to count the number of features in superposition in real neural data.
对于玩具数据集,我们知道有多少个真实特征。但我们最终想要计算的是真实神经网络中使用的特征数量,这个数量是未知的。我们探索了三种方法来计算玩具数据集中的特征:a) 计算自编码器中的死神经元;b) 查看自编码器损失;以及 c) 比较不同规模自编码器学习到的特征。我们发现这些方法可能适合计算真实神经数据中重叠的特征数量。

We also applied our method to real activations from a small language model. Our initial, preliminary investigations led to inconclusive results, possibly resulting from having used autoencoders that were either too small or undertrained. At the time of writing, investigations are ongoing.
我们还将我们的方法应用于来自小型语言模型的真实激活。我们的初步调查导致了不确定的结果,这可能是由于使用的自编码器过小或未经过充分训练。撰写本文时,调查仍在进行中。

Linear Representation in Othello
在西洋棋的线性表示

Neel Nanda is an external researcher in mechanistic interpretability. This is a description of his blog post, Actually, Othello-GPT Has A Linear Emergent World Representation.
尼尔·南达 是一位研究机制可解释性的外部研究员。这是他博客文章的描述,实际上,Othello-GPT 拥有线性涌现的世界表征

I describe a natural experiment of the linear representation hypothesis described in this paper -- the idea that features correspond to directions in neural networks.
我描述了本文中提到的线性表征假说的一个自然实验——特征对应于神经网络中的方向这一想法。

Background: Martin Wattenberg (an author on this paper) and colleagues recently found that, if you train transformer language model trained to predict the next token in synthetic Othello games (where each move is a randomly chosen legal move), it forms an emergent model of the board state (despite only being trained to predict the next move!). They showed that the state of the board (whether each cell was empty, black or white) could be recovered with high accuracy by a one hidden layer MLP probe. They further showed that you could use the world model to causally intervene on the model’s residual stream. By choosing another board state, and changing the residual stream (with gradient descent) such that the probe indicates that new board state, they caused the model to output legal moves in the new board state, even if the edited board state was impossible to reach via legal Othello play!
背景: 马丁·瓦滕堡(本文的作者之一)和同事们最近发现,如果你训练一个变压器语言模型来预测合成俄罗棋(Othello)游戏中的下一个标记(每一步都是随机选择的合法移动),它会形成一个新兴的棋盘状态模型(尽管只被训练来预测下一步!)。他们展示了棋盘的状态(每个单元是空的、黑色或白色)可以通过一个隐藏层的多层感知机探测器高精度地恢复。他们进一步展示了可以利用世界模型对模型的残差流进行因果干预。通过选择另一个棋盘状态,并改变残差流(使用梯度下降),使得探测器指示该新棋盘状态,他们促使模型在新的棋盘状态下输出合法移动,即使编辑后的棋盘状态在合法的俄罗棋玩法中是不可能达到的!

Pre-Registered Hypothesis: The probing and causal intervention together provided strong evidence that the model had learned to represent features corresponding to the state of each square on the board. Yet, notably, linear probes were not able to recover the board state. Since linear features should be recoverable with a one-layer probe, and the causal intervention suggests the model both computes and uses the board state, this seemed like significant evidence against the linear representation hypothesis.
预注册假设:探测和因果干预一起提供了强有力的证据,表明模型已经学会了表示与棋盘上每个方格状态对应的特征。然而,值得注意的是,线性探测无法恢复棋盘状态。由于线性特征应该可以通过一层探测器恢复,而因果干预表明模型既计算又使用棋盘状态,这似乎是对线性表示假设的重大证据。

However, Chris Olah (an author on this paper) argued the model might still be representing features linearly if it used a different set of features, and that the probe and causal intervention may be picking up on this different set of features. This created an informal pre-registered prediction of the hypothesis which was contrary to the evidence at the time.
然而,这篇论文的作者之一克里斯·奥拉对此提出了异议,认为如果模型使用不同的特征,则可能仍然以线性方式表示特征,并且探测器和因果干预可能正在捕捉这一不同的特征集。这导致了对该假设的非正式预注册预测,与当时的证据相悖。

Results: I independently came to the same conclusion as Chris and investigated the Othello playing model. I found that the model does form an emergent model of the board state that is linearly represented and can be extracted with a linear probe. But that as the model plays both black and white moves, the model represents the state of a cell as whether it has “my colour” vs “opponent’s colour”. Further, I found circumstantial evidence that these features are used by the model, as we can linearly intervene on the residual stream using the directions given by the probe to edit the represented board state, and the model plays legal moves in the new board state.
结果:我独立得出了与克里斯相同的结论,并研究了俄亥俄棋模型。我发现该模型确实形成了一个线性表示的棋盘状态的涌现模型,并可以通过线性探针提取出来。但是,由于模型同时进行黑白双方的移动,模型将单元格的状态表示为“我的颜色”与“对手的颜色”。此外,我发现有间接证据表明这些特征被模型使用,因为我们可以用探针提供的方向在线性干预残余流,以编辑所表示的棋盘状态,而模型在新的棋盘状态中进行合法移动。

I consider these results to be notable, as the paper’s results provided evidence against the linear representation hypothesis and the hypothesis faced genuine risk of falsification. And the hypothesis made non-trivial predictions that were contrary to where the evidence pointed, but these turned out to be true. This is both a proof of concept that there are underlying principles of neural networks which have predictive power about models, and a natural experiment supporting the linear representation hypothesis.
我认为这些结果是显著的,因为论文的结果提供了反对线性表征假说的证据,而该假说面临着真实被证伪的风险。而且,该假说提出了与证据指向的相反的非平凡预测,但这些预测最终被证明是正确的。这既是证明神经网络存在具有预测能力的基本原理的概念验证,也是支持线性表征假说的自然实验。

I think that there is further work interpreting the Othello playing model that could test other hypotheses in this paper and our broader conceptual frameworks about neural networks and transformers, such as by looking for monosemantic vs superposed neurons in its MLP layers. The model is both complex enough to be interesting and expose principles of how transformers learn algorithms, yet the algorithmic nature of the task and existence of the probe suggests that finding circuits will be tractable. I elaborate on what I consider promising directions of future work in a follow-up post.
我认为还有进一步的研究可以解释奥赛罗游戏模型,以测试本文中的其他假设以及我们关于神经网络和变压器的更广泛概念框架,例如通过在其多层感知器层中寻找单义神经元与叠加神经元。该模型既复杂又有趣,揭示了变压器如何学习算法的原理,但任务的算法特性和探针的存在表明,寻找电路是可行的。我在后续文章中详细阐述了我认为有前景的未来研究方向。

Leverage Score and Feature Dimensionality
杠杆得分和特征维度

Fred Zhang is a PhD student in the Theory Group of the EECS Department at UC Berkeley.
Fred Zhang 是加州大学伯克利分校电子工程与计算机科学系理论组的博士生。

In the Geometry of Superposition section, the paper defines a notion of feature dimensionality,
叠加的几何部分,论文定义了特征维度的概念,

D_i ~=~ \frac{||W_i||^2}{\sum_j (\hat{W_i} \cdot W_j)^2},

where W_i is the ith feature and \hat{W_i}=W_i / \| W_i \|_2 is its normalized version. For simplicity of notation, we assume that there are n features, each being a d-dimensional real vector.
其中 W_i 是第 i 个特征, \hat{W_i}=W_i / \| W_i \|_2 是其归一化版本。为了简化符号,我们假设有 n 个特征,每个特征都是一个 d 维实向量。

Following this definition, the paper makes the remark that "Empirically, it seems that the dimensionality of all features add up to the number of embedding dimensions when the features are 'packed efficiently' in some sense." In this comment, I point out a natural, theoretical explanation of this observation. The argument is via the notion of leverage score in matrix approximation. I’ll define it first, then explain how it connects to feature dimensionality.
根据这个定义,论文指出“经验上,似乎所有特征的维度之和等于嵌入维度的数量,当特征在某种意义上‘有效打包’时。”在这个评论中,我指出了这一观察结果的一个自然的理论解释。这个论点是通过矩阵近似中的杠杆分数概念来阐述的。我将首先定义它,然后解释它如何与特征维度相关。

At a conceptual level, leverage score is a measure of the importance of a row in composing the row space of a matrix. For instance, if a row is orthogonal to all other rows, its leverage score is 1, meaning that it’s maximally important. This is natural, since removing it would decrease the rank of the matrix and completely change the row space. Formally, if W be a n by d matrices (think of it as tall and thin, so n > d), then the leverage score of ith row W_i is
在概念层面上,杠杆得分是衡量某一行在组成矩阵行空间中的重要性的指标。例如,如果一行与所有其他行正交,则其杠杆得分为 1,这意味着它是最大程度重要的。这是很自然的,因为删除它会降低矩阵的秩,并完全改变行空间。正式地,如果 W 是一个 nd 的矩阵(可以认为它是高而瘦的,因此 n > d ),那么 i 行的杠杆得分是

\tau_i = \max_{x : \|x\|_2 = 1} \frac{\langle W_i , x \rangle^2}{\sum_{j=1}^n \langle W_j , x \rangle^2}.

Notice that the denominator term equals \| Wx \|_2^2 . Hence, the leverage score of W_i is the maximum contribution \langle W_i, x\rangle^2 can make towards || Wx ||^2 (over all possible directions x). This is a fractional number between 0 and 1. Also observe that this indeed has the nice property that if W_i is orthogonal to all other rows, \tau_i equals 1, because all except the ith term within the summation of the denominator is 0.
注意到分母项等于 \| Wx \|_2^2 。因此, W_i 的杠杆分数是 \langle W_i, x\rangle^2|| Wx ||^2 的最大贡献(在所有可能的方向 x 中)。这是一个介于 0 和 1 之间的分数。还要观察到,这确实具有一个良好的性质,如果 W_i 与所有其他行正交, \tau_i 等于 1,因为除 i 项外,分母中求和的所有项都是 0。

Three quick remarks on this definition:
关于这个定义的三条快速备注:

  • It coincides with the notion of maximal data dimensionality in a comment by Chris Olah on Data Dimensionality of MNIST in the follow-up work of Superposition, Memorization, and Double Descent.
    它与MNIST 的数据维度中 Chris Olah 的评论中提到的最大数据维度的概念相一致,后者是在叠加、记忆和双重下降的后续工作中提出的。
  • There are other equivalent definitions of leverage score; see section 3.3 of here.
    杠杆得分还有其他等效定义;请参见这里的第 3.3 节。
  • Leverage score arises in numerical linear algebra. It is well-known that sampling rows proportional to their leverage score yields an almost square matrix that approximates the original, potentially very tall matrix. For a formal statement, see Theorem 17 of https://arxiv.org/abs/1411.4357.
    杠杆得分出现在数值线性代数中。众所周知,按杠杆得分的比例采样行会产生一个几乎是方阵的矩阵,能够近似原始的、可能非常高的矩阵。有关正式表述,请参见https://arxiv.org/abs/1411.4357的定理 17。

Returning to my main point, another nice fact about leverage score is they sum up to the rank of the matrix. In the tall and thin case above, they sum up to d (if the matrix is full-rank). Given that, it is natural this paper makes the empirical observation that the sum of D_i roughly equals the embedding dimension d, if the vectors are "efficiently packed". Specifically, one formal way of stating that they are packed efficiently is that they are roughly in isotropic position, that is, the variance \langle W_j, x\rangle^2 along any direction x is close to 1. In other words, the covariance structure of the embedding vectors is close to identity. (Geometrically, it means they are well spread out. This would be the case, for example, if the vectors form regular simplex, just as it happens in the experiments of this paper.) Then in the above definition of leverage score, each summand within the denominator is 1 for any x. So to maximize the numerator, we just take x to be W_i itself---and this precisely recovers the definition of feature dimensionality D_i. Thus, in this well-spread-out case, these two notions are roughly the same, and they both sum up to d.
回到我的主要观点,杠杆分数的另一个好处是它们的总和等于矩阵的秩。在上面的高而细的情况下,它们的总和等于 d(如果矩阵是满秩的)。基于这一点,本文自然得出经验观察,即如果向量“有效地打包”, D_i 的总和大致等于嵌入维度 d 。具体来说,一种正式的说法是它们大致处于各向同性的位置,即,沿任何方向 x 的方差 \langle W_j, x\rangle^2 接近 1 。换句话说,嵌入向量的协方差结构接近于单位矩阵。(从几何上讲,这意味着它们分布得很好。例如,如果这些向量形成规则的单纯形,就像本文实验中所发生的那样。)然后在上述的杠杆分数定义中,分母内的每个加数对于任何 x 都为 1。因此,为了最大化分子,我们就把 x 取为 W_i 本身——这正好恢复了特征维度 D_i 的定义。因此,在这种良好分布的情况下,这两个概念大致相同,它们的总和都等于 d

Code 代码

We provide a notebook to reproduce some of the core diagrams in this article here. (It isn't comprehensive, since we needed to rewrite code for our experiments to run outside our codebase.) We provide a separate notebook for the theoretical phase change diagrams.
我们提供一个笔记本,以重现本文中的一些核心图示 在这里。 (这并不全面,因为我们需要重写代码以使我们的实验在我们的代码库之外运行。)我们提供一个 单独的笔记本 用于理论相变图。

Note that the reproductions by other researchers mentioned in comments above were not based on this code, but are instead fully independent replications with clean code from the description in an early draft of this article.
请注意,上述评论中提到的其他研究人员的复制品并不是基于此代码,而是完全独立的复制,源自本文章早期草稿中的描述的干净代码。

Acknowledgments 致谢

We're extremely grateful to a number of colleagues across several organizations for their invaluable support in our writing of this paper.
我们非常感谢来自多个组织的许多同事,他们在我们撰写本文时给予了宝贵的支持。

Jeff Wu, Daniel Mossing, Tom McGrath, and Kshitij Sachan did independent replications of many of our experiments, greatly increasing our confidence in our results. Kshitij Sachan's and Tom McGrath's additional investigations and insightful questions both pushed us to clarify our understanding of the superposition phase change (both as reflected in this paper, and in further understanding which we learned from them not captured here). Buck Shlegeris, Adam Scherlis, and Adam Jermyn shared valuable insights into the mathematical nature of the toy problem and related work.  Adam Jermyn also coined the term "virtual neurons."
杰夫·吴、丹尼尔·莫辛、汤姆·麦克格拉斯和克希提杰·萨昌独立重复了我们很多实验,这大大增强了我们对结果的信心。克希提杰·萨昌和汤姆·麦克格拉斯的额外调查和深刻问题促使我们澄清对叠加相变的理解(包括本文所反映的,以及我们从他们那里获得的进一步理解,这里没有记录)。巴克·施勒里斯、亚当·谢尔利斯和亚当·杰米恩分享了对玩具问题及相关工作的数学性质的宝贵见解。亚当·杰米恩还创造了“虚拟神经元”这个术语。

Gabriel Goh, Neel Nanda, Vladimir Mikulik, and Nick Cammarata gave detailed feedback which improved the paper, in addition to being motivating. Alex Dimakis, Piotr Indyk, Dan Yamins generously took time to discuss these results with us and give advice on how they might connect to their area of expertise. Finally, we benefited from the feedback and comments of James Bradbury, Sebastian Farquhar, Shan Carter, Patrick Mineault, Alex Tamkin, Paul Christiano, Evan Hubinger, Ian McKenzie, and Sid Black. We're additionally grateful to Trenton Bricken and Manjari Narayan for referring us to valuable related work we originally missed. Thanks to Ken Kahn for typo corrections.
加布里埃尔·戈,尼尔·南达,弗拉基米尔·米库利克和尼克·卡马拉塔提供了详细的反馈意见,改善了论文,同时也提供了激励。亚历克斯·迪马基斯,皮奥特·印迪克,丹·亚明斯慷慨地抽出时间与我们讨论这些结果,并提供如何与他们的专业领域联系的建议。最后,我们受益于詹姆斯·布拉德伯里,塞巴斯蒂安·法夸,山·卡特,帕特里克·米诺尔特,亚历克斯·汤金,保罗·克里斯蒂亚诺,埃文·哈宾格,伊恩·麦肯齐和西德·布莱克的反馈和评论。我们还特别感谢特伦顿·布里肯和曼贾里·纳拉扬推荐了我们最初错过的有价值相关工作。感谢肯·卡恩的错别字修正。

Finally, we're very grateful to all our colleagues at Anthropic for their advice and support: Daniela Amodei, Jack Clark, Tom Brown, Ben Mann, Nick Joseph, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones,, Timothy Telleen-Lawton, Anna Chen, Yuntao Bai, Jeffrey Ladish, Deep Ganguli, Liane Lovitt, Nova DasSarma, Jia Yuan Loke, Jackson Kernion, Tom Conerly, Scott Johnston, Jamie Kerr, Sheer El Showk, Stanislav Fort, Rebecca Raible, Saurav Kadavath, Rune Kvist, Jarrah Bloomfield, Eli Tran-Johnson, Rob Gilson, Guro Khundadze, Filipe Dobreira, Ethan Perez, Sam Bowman, Sam Ringer, Sebastian Conybeare, Jeeyoon Hyun, Michael Sellitto, Jared Mueller, Joshua Landau, Cameron McKinnon, Sandipan Kundu, Jasmine Brazilek, Da Yan, Robin Larson, Noemí Mercado, Anna Goldie, Azalia Mirhoseini, Jennifer Zhou, Erick Galankin, James Sully, Dustin Li, James Landis.
最后,我们非常感谢 Anthropic 的所有同事们给予的建议和支持:Daniela Amodei、Jack Clark、Tom Brown、Ben Mann、Nick Joseph、Danny Hernandez、Amanda Askell、Kamal Ndousse、Andy Jones、Timothy Telleen-Lawton、Anna Chen、Yuntao Bai、Jeffrey Ladish、Deep Ganguli、Liane Lovitt、Nova DasSarma、Jia Yuan Loke、Jackson Kernion、Tom Conerly、Scott Johnston、Jamie Kerr、Sheer El Showk、Stanislav Fort、Rebecca Raible、Saurav Kadavath、Rune Kvist、Jarrah Bloomfield、Eli Tran-Johnson、Rob Gilson、Guro Khundadze、Filipe Dobreira、Ethan Perez、Sam Bowman、Sam Ringer、Sebastian Conybeare、Jeeyoon Hyun、Michael Sellitto、Jared Mueller、Joshua Landau、Cameron McKinnon、Sandipan Kundu、Jasmine Brazilek、Da Yan、Robin Larson、Noemí Mercado、Anna Goldie、Azalia Mirhoseini、Jennifer Zhou、Erick Galankin、James Sully、Dustin Li、James Landis。

Author Contributions 作者贡献

Basic Results - The basic toy model results demonstrating the existence of superposition were done by Nelson Elhage and Chris Olah. Chris suggested the toy model and Nelson ran the experiments.
基本结果- 演示叠加存在的基本玩具模型结果是由 Nelson Elhage 和 Chris Olah 完成的。Chris 提出了玩具模型,Nelson 进行了实验。

Phase Change - Chris Olah ran the empirical phase change experiments, with help from Nelson Elhage. Martin Wattenberg introduced the theoretical model where exact losses for specific weight configurations can be computed.
相变 - Chris Olah 在 Nelson Elhage 的帮助下进行了经验相变实验。Martin Wattenberg 提出了理论模型,可以计算特定重量配置的准确损失。

Geometry - The uniform superposition geometry results were discovered by Nelson Elhage and Nicholas Schiefer, with help from Chris Olah. Nelson discovered the original m/||W||_F^2 mysterious "stickiness". Chris introduced the definition of feature dimensionality. Nicholas and Nelson then investigated the polytopes that formed. As for non-uniform superposition, Martin Wattenberg performed the initial investigations of the resulting geometry, focusing on the behavior of correlated features. Chris extended this with an investigation of the role of relative feature importance and sparsity.
几何 - 均匀叠加几何的结果是由尼尔森·埃尔哈吉和尼古拉斯·希费尔发现的,并得到了克里斯·奥拉赫的帮助。尼尔森发现了原始的 m/||W||_F^2 神秘“黏性”。克里斯引入了特征维度的定义。然后尼古拉斯和尼尔森调查了形成的多面体。至于非均匀叠加,马丁·瓦腾伯格对结果几何进行了初步研究,重点关注相关特征的行为。克里斯进一步研究了相对特征重要性和稀疏性的角色。

Learning Dynamics - Nelson Elhage discovered the "energy level jump" phenomenon, in collaboration with Nicholas Schiefer and Chris Olah. Martin Wattenberg discovered the "geometric transformations" phenomenon.
学习动态 - 尼尔森·埃尔哈奇与尼古拉斯·谢费尔和克里斯·奥拉合作发现了“能级跃迁”现象。马丁·瓦滕伯格发现了“几何变换”现象。

Adversarial Examples - Chris Olah and Catherine Olsson found evidence of a connection between superposition and adversarial examples.
对抗性示例 - Chris Olah 和 Catherine Olsson 发现了叠加与对抗性示例之间的联系。

Superposition with a Privileged Basis / Doing Computation - Chris Olah did the basic investigation of superposition in a privileged basis. Nelson Elhage, with help from Chris, investigated the "absolute value" model which provided a more principled demonstration of superposition and showed that computation could be done while in superposition. Nelson discovered the "asymmetric superposition" motif.
特权基上的叠加 / 进行计算 - 克里斯·奥拉进行了特权基上叠加的基本研究。纳尔逊·埃尔哈奇在克里斯的帮助下,研究了“绝对值”模型,该模型提供了更原则性的叠加演示,并显示可以在叠加状态下进行计算。纳尔逊发现了“非对称叠加”的主题。

Theory - The theoretical picture articulated over the course of this paper (especially in the "mathematical understanding" section) was developed in conversations between all authors, but especially Chris Olah, Jared Kaplan, Martin Wattenberg, Nelson Elhage, Tristan Hume, Tom Henighan, Catherine Olsson, Nicholas Schiefer, Dawn Drain, Shauna Kravec, Roger Grosse, Robert Lasenby, and Sam McCandlish. Jared introduced the strategy of rewriting the loss by grouping terms with the number of active features. Both Jared and Martin independently noticed the value of investigating the n=2; m=1 case as the simplest case to understand. Nicholas and Dawn clarified our understanding of the connection to compressed sensing.
理论 - 本文中阐述的理论图景(尤其是在“数学理解”部分)是在所有作者之间,特别是克里斯·欧拉、贾里德·卡普兰、马丁·瓦滕伯格、尼尔森·埃尔哈赫、特里斯坦·休梅、汤姆·亨尼根、凯瑟琳·奥尔森、尼古拉斯·施富尔、道恩·德雷因、肖娜·克拉维克、罗杰·格罗斯、罗伯特·拉森比和萨姆·麦肯德利什之间的对话中发展起来的。贾里德引入了通过对拥有活动特征的项进行分组来重写损失的策略。贾里德和马丁独立注意到研究 n=2; m=1 情况作为最简单的案例来理解的重要性。尼古拉斯和道恩澄清了我们对与压缩感知连接的理解。

Strategic Picture - The strategic picture articulated in this paper – What does superposition mean for interpretability and safety? What would a suitable solution be? How might one solve it? – developed in extensive conversations between authors, and in particular Chris Olah, Tristan Hume, Nelson Elhage, Dario Amodei, Jared Kaplan. Nelson Elhage recognized the potential importance of "enumerative safety", further articulated by Dario. Tristan brainstormed extensively about ways one might solve superposition and pushed Chris on this topic.
战略图景 - 本文阐述的战略图景 - 超位置对于可解释性和安全性意味着什么?合适的解决方案是什么?该如何解决? - 是在作者之间进行了广泛的讨论,特别是克里斯·奥拉、特里斯坦·休梅、纳尔逊·埃尔哈奇、达里奥·阿莫代伊和贾里德·卡普兰。纳尔逊·埃尔哈奇认识到“枚举安全”的潜在重要性,这一点也得到了达里奥的进一步阐述。特里斯坦广泛地进行头脑风暴,寻找解决超位置的途径,并激励克里斯在这个问题上深入思考。

Writing - The paper was primarily drafted by Chris Olah, with some sections by Nelson Elhage, Tristan Hume, Martin Wattenberg, and Catherine Olsson. All authors contributed to editing, with particularly significant contributions from Zac Hatfield Dodds, Robert Lasenby, Kipply Chen, and Roger Grosse.
撰写 - 这篇论文主要由 Chris Olah 草拟,部分章节由 Nelson Elhage、Tristan Hume、Martin Wattenberg 和 Catherine Olsson 撰写。所有作者都参与了编辑工作,其中 Zac Hatfield Dodds、Robert Lasenby、Kipply Chen 和 Roger Grosse 的贡献尤其重要。

Illustration - The paper was primarily illustrated by Chris Olah, with help from Tristan Hume, Nelson Elhage, and Catherine Olsson.
插图 - 这篇论文主要由 Chris Olah 插图,得到了 Tristan Hume、Nelson Elhage 和 Catherine Olsson 的帮助。

Citation Information 引用信息

Please cite as: 请引用为:

Elhage, et al., "Toy Models of Superposition", Transformer Circuits Thread, 2022.

BibTeX Citation: BibTeX 引用:

@article{elhage2022superposition,
   title={Toy Models of Superposition},
   author={Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher},
   year={2022},
   journal={Transformer Circuits Thread},
   note={https://transformer-circuits.pub/2022/toy_model/index.html}
}

Footnotes 脚注

  1. Where “importance” is a scalar multiplier on mean squared error loss.[↩]
    “重要性”是均方误差损失的一个标量乘数。[↩]
  2. In the context of vision, these have ranged from low-level neurons like curve detectors and high-low frequency detectors, to more complex neurons like oriented dog-head detectors or car detectors, to extremely abstract neurons corresponding to famous people, emotions, geographic regions, and more . In language models, researchers have found word embedding directions such as a male-female or singular-plural direction , low-level neurons disambiguating words that occur in multiple languages, much more abstract neurons, and "action" output neurons that help produce certain words .[↩]
    在视觉的背景下,这些神经元的种类从低级神经元,如曲线检测器高低频率检测器,到更复杂的神经元,如定向狗头检测器汽车检测器,再到与名人情感地理区域等对应的极其抽象的神经元。在语言模型中,研究人员发现了词嵌入方向,例如男性-女性或单数-复数方向,低级神经元能够消歧义出出现在多种语言中的词汇,更加抽象的神经元,以及帮助产生特定单词的“动作”输出神经元。[↩]
  3. This definition is trickier than it seems. Specifically, something is a feature if there exists a large enough model size such that it gets a dedicated neuron. This create a kind "epsilon-delta" like definition. Our present understanding – as we'll see in later sections – is that arbitrarily large models can still have a large fraction of their features be in superposition. However, for any given feature, assuming the feature importance curve isn't flat, it should eventually be given a dedicated neuron. This definition can be helpful in saying that something is a feature – curve detectors are a feature because you find them in across a range of models larger than some minimal size – but unhelpful for the much more common case of features we only hypothesize about or observe in superposition.[↩]
    这个定义比看起来更复杂。具体来说,如果存在一个足够大的模型规模使其获得一个专用神经元,那么某事就是一个特征。这形成了一种类似“epsilon-delta”的定义。我们目前的理解——正如我们在后面的章节中将看到的——是任意大的模型仍然可以有很大比例的特征处于叠加状态。然而,对于任何给定的特征,假设特征重要性曲线不是平坦的,它最终应该被分配一个专用神经元。这个定义在说某事是一个特征时是有帮助的——曲线检测器是一个特征,因为你在一些超过最小规模的模型中找到了它们——但对于我们仅假设或观察到处于叠加状态的特征的更常见情况则帮助不大。[↩]
  4. A famous book by Lakatos illustrates the importance of uncertainty about definitions and how important rethinking definitions often is in the context of research. [↩]
    拉卡托斯的一本著名书籍说明了对定义的不确定性的重要性,以及在研究的背景下重新思考定义常常是多么重要。[↩]
  5. This experiment setup could also be viewed as an autoencoder reconstructing x.[↩]
    这个实验装置也可以被视为一个自编码器在重构 x[↩]
  6. A vision model of sufficient generality might benefit from representing every species of plant and animal and every manufactured object which it might potentially see. A language model might benefit from representing each person who has ever been mentioned in writing. These are only scratching the surface of plausible features, but already there seem more than any model has neurons. In fact, large language models demonstrably do in fact know about people of very modest prominence – presumably more such people than they have neurons. This point is a common argument in discussion of the plausibility of "grandmother neurons'' in neuroscience, but seems even stronger for artificial neural networks.[↩]
    一个足够通用的视觉模型可能会受益于表示每一种植物和动物以及它可能看到的每一种制造物品。一个语言模型可能会受益于表示每一个曾在书面文字中提到过的人。这些只是合理特征的冰山一角,但似乎已经有的特征数量超过了任何模型的神经元数量。事实上,大型语言模型确实知道一些非常普通知名度的人物——可能知道的人数超过了它们的神经元数量。这个观点在神经科学中关于“祖母神经元”可行性的讨论中是一个常见的论点,但对于人工神经网络来说,这个论点似乎更为强劲。[↩]
  7. For computational reasons, we won't focus on it in this article, but we often imagine an infinite number of features with importance asymptotically approaching zero.[↩]
    出于计算原因,我们在本文中不会关注这个问题,但我们常常想象一个无限数量的特征,其重要性逐渐接近于零。[↩]
  8. The choice to have features distributed uniformly is arbitrary. An exponential or power law distribution would also be very natural.[↩]
    选择均匀分布特征是任意的。指数分布或幂律分布也会非常自然。[↩]
  9. Recall that W^T = W^{-1} if W is orthonormal. Although W can't be literally orthonormal, our intuition from compressed sensing is that it will be "almost orthonormal" in the sense of Candes & Tao .[↩]
    回想一下,如果 W 是正交归一的,那么 W^T = W^{-1} 。虽然 W 不能字面上是正交归一的,但我们从压缩感知中得到的直觉是,它将在 Candes 和 Tao 的意义上是“几乎正交归一”。[↩]
  10. We have the model be x' = W^TWx, but leave x Gaussianaly distributed as in Saxe.[↩]
    我们将模型设为 x' = W^TWx ,但将 x 进行高斯分布,如同 Saxe 中所述。[↩]
  11. As a brief aside, it's interesting to contrast the linear model interference, \sum_{i\neq j}|W_i \cdot W_J|^2, to the notion of coherence in compressed sensing, \max_{i\neq j}|W_i \cdot W_J|. We can see them as the L^2 and L^\infty norms of the same vector.[↩]
    作为一个简短的旁注,将线性模型干涉 \sum_{i\neq j}|W_i \cdot W_J|^2 与压缩感知中的相干性 \max_{i\neq j}|W_i \cdot W_J| 进行对比是很有趣的。我们可以将它们视为同一向量的 L^2L^\infty 范数。[↩]
  12. To prove that superposition is never optimal in a linear model, solve for the gradient of the loss being zero or consult Saxe et al.[↩]
    为证明在线性模型中叠加从不最优,请求解损失的梯度为零,或参考 Saxe 等人。[↩]
  13. Here, we use “phase change” in the generalized sense of “discontinuous change”, rather than in the more technical sense of a discontinuity arising in the limit of infinite system size.[↩]
    在这里,我们使用“相变”这个词的广义意义,即“间歇性变化”,而不是在无限系统规模极限中出现的不连续性这个更技术性的意义。[↩]
  14. Scaling the importance of all features by the same amount simply scales the loss, and does not change the optimal solutions.[↩]
    通过同等程度地缩放所有特征的重要性,只是缩放了损失,并没有改变最优解。[↩]
  15. Note that there's a degree of freedom for the model in learning W_1: We can rescale any hidden unit by scaling its row of W_1 by \alpha, and its column of W_2 by \alpha^{-1}, and arrive at the same model. For consistency in the visualization, we rescale each hidden unit before visualizing so that the largest-magnitude weight to that neuron from W_1 has magnitude 1.[↩]
    请注意,模型在学习 W_1 时有一定的自由度:我们可以通过将其 W_1 的行按 \alpha 进行缩放,将其 W_2 的列按 \alpha^{-1} 进行缩放,从而得到相同的模型。为了在可视化中保持一致,我们在可视化之前对每个隐藏单元进行重新缩放,以便来自 W_1 的最大幅度权重对该神经元的幅度为 1[↩]
  16. These specific values were chosen to illustrate the phenomenon we're interested in: the absolute value model learns more easily when there are more neurons, but we wanted to keep the numbers small enough that it could be easily visualized.[↩]
    这些特定的值是为了说明我们感兴趣的现象:绝对值模型在神经元数量更多时更容易学习,但我们想保持数字足够小,以便于进行可视化。[↩]
  17. One question you might ask is whether we can quantify the ability of superposition to enable extra computation by examining the loss. Unfortunately, we can't easily do this. Superposition occurs when we change the task, making it sparser. As a result, the losses of models with different amounts of superposition are not comparable – they're measuring the loss on different tasks![↩]
    一个你可能会问的问题是,我们是否能够通过检查损失来量化叠加能力以实现额外的计算。不幸的是,我们很难做到这一点。叠加发生在我们更改任务时,使其更稀疏。因此,具有不同叠加量的模型的损失是不可比较的——因为它们在测量不同任务上的损失![↩]
  18. Ultimately we want to say that a model doesn't implement some class of behaviors. Enumerating over all features makes it easy to say a feature doesn't exist (e.g. "there is no 'deceptive behavior' feature") but that isn't quite what we want. We expect models that need to represent the world to represent unsavory behaviors. But it may be possible to build more subtle claims such as "all 'deceptive behavior' features do not participate in circuits X, Y and Z."[↩]
    最终我们想要说的是,一个模型并不实现某些行为类。列举所有特征使得容易说某个特征不存在(例如:“没有‘欺骗行为’特征”),但这并不是我们想要的。我们期望需要表示世界的模型能够表示不受欢迎的行为。但可能可以构建更微妙的说法,例如“所有‘欺骗行为’特征都不参与电路 X、Y 和 Z。”[↩]
  19. Superposition also makes it harder to find interpretable directions in a model without a privileged basis. Without superposition, one could try to do something like the Gram–Schmidt process, progressively identifying interpretable directions and then removing them to make future features easier to identify. But with superposition, one can't simply remove a direction even if one knows that it is a feature direction.[↩]
    叠加使得在没有优先基础的模型中更难找到可解释的方向。没有叠加,您可以尝试像 Gram–Schmidt 过程那样,逐步识别可解释的方向,然后将它们移除,以便后续特征更容易被识别。但在叠加的情况下,即使知道某个方向是特征方向,也无法简单地去除它。[↩]
  20. More formally, given a matrix H \sim [d,m] ~=~[h_0, h_1, …] of hidden layer activations h \sim [m] sampled over d stimuli, if we believe there are n underlying features, we can try to find matrices A\sim [d,n] and B \sim [n,m] such that A is sparse.[↩]
    更正式地说,给定一个隐藏层激活值的矩阵 H \sim [d,m] ~=~[h_0, h_1, …] ,该矩阵是针对 d 个刺激采样的,如果我们认为存在 n 个潜在特征,我们可以尝试找到矩阵 A\sim [d,n]B \sim [n,m] 使得 A 是稀疏的。[↩]
  21. In particular, it seems like we should expect to be able to reduce superposition at least a little bit with essentially no effect on performance, just by doing something like L1 regularization without any architectural changes.  Note that models should have a level of superposition where the derivative of loss with respect to the amount of superposition is zero – otherwise, they'd use more or less superposition. As a result, there should be at least some margin within which we can reduce the amount of superposition without affecting model performance.[↩]
    特别是,我们似乎应该期待能够在基本没有影响性能的情况下,至少稍微减少叠加,只需通过类似 L1 正则化的方式,而无需进行任何架构变化。请注意,模型应该具有一定水平的叠加,其中关于叠加量的损失的导数为零——否则,它们将使用更多或更少的叠加。因此,应该至少有一些余量,在这个余量内我们可以减少叠加量而不影响模型性能。[↩]
  22. A more subtle issue is that GANs and VAEs often assume that their latent space is Gaussianly distributed. Sparse latent variables are very non-Gaussian, but the central limit theorem means that the superposition of many such variables will gradually look more Gaussian. So the latent spaces of some generative models may in fact force models to use superposition![↩]
    一个更微妙的问题是,GAN 和 VAE 通常假设它们的潜在空间是高斯分布的。稀疏的潜在变量是非常非高斯的,但中央极限定理意味着许多此类变量的叠加会逐渐看起来更接近高斯。因此,一些生成模型的潜在空间实际上可能迫使模型使用叠加![↩]
  23. Note that this has a nice information-theoretic interpretation: \log(1 - S) is the surprisal of a given dimension being non-zero, and is multiplied by the expected number of non-zeros.[↩]
    请注意,这具有良好的信息理论解释: \log(1 - S) 是给定维度非零的惊讶度,并乘以期望的非零数量。[↩]
  24. Note that in the compressed sensing case, the phase transition is in the limit as the number of dimensions becomes large - for finite-dimensional spaces, the transition is fast but not discontinuous.[↩]
    请注意,在压缩感知的情况下,相变是在维度数量变大时的极限——对于有限维空间,相变快速但并非不连续。[↩]
  25. Interestingly, this is the reverse of how sparse coding is typically thought of in neuroscience. Neuroscience often thinks of biological neurons as sparse coding their inputs, whereas we're interested in applying it the opposite direction, to find features in superposition over neurons.[↩]
    有趣的是,这与神经科学中稀疏编码的典型思维方式相反。神经科学通常认为生物神经元对其输入进行稀疏编码,而我们感兴趣的是将其应用于相反的方向,以在神经元的叠加上找到特征。[↩]
  26. We haven’t encountered a specific term in the distributed coding literature that corresponds to this hypothesis specifically, although the idea of a “direction in activation-space” is common in the literature, which may be due to ignorance on our part. We call this hypothesis linearity.[↩]
  27. Experimental evidence seems to support this[↩]
  28. A related, but different, concept in the neuroscience literature is the “binding problem” in which e.g. a red triangle is a co-occurrence of exactly one shape and exactly one color, which is not a representational challenge, but a binding problem arises if a decomposed code needs to represent simultaneously also a blue square — which shape feature goes with which color feature? Our work does not engage with the binding question, merely treating this as a co-occurrence of “blue”, “red”, “triangle”, and “square”. [↩]

References

  1. Linear algebraic structure of word senses, with applications to polysemy
    Arora, S., Li, Y., Liang, Y., Ma, T. and Risteski, A., 2018. Transactions of the Association for Computational Linguistics, Vol 6, pp. 483--495. MIT Press.
  2. Decoding The Thought Vector[link]
    Goh, G., 2016.
  3. Zoom In: An Introduction to Circuits
    Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill. DOI: 10.23915/distill.00024.001
  4. Softmax Linear Units
    Elhage, N., Hume, T., Olsson, C., Nanda, N., Henighan, T., Johnston, S., ElShowk, S., Joseph, N., DasSarma, N., Mann, B., Hernandez, D., Askell, A., Ndousse, K., Jones, A., Drain, D., Chen, A., Bai, Y., Ganguli, D., Lovitt, L., Hatfield-Dodds, Z., Kernion, J., Conerly, T., Kravec, S., Fort, S., Kadavath, S., Jacobson, J., Tran-Johnson, E., Kaplan, J., Clark, J., Brown, T., McCandlish, S., Amodei, D. and Olah, C., 2022. Transformer Circuits Thread.
  5. Compressed sensing
    Donoho, D.L., 2006. IEEE Transactions on information theory, Vol 52(4), pp. 1289--1306. IEEE.
  6. Local vs. Distributed Coding
    Thorpe, S.J., 1989. Intellectica, Vol 8, pp. 3--40.
  7. Representation learning: A review and new perspectives
    Bengio, Y., Courville, A. and Vincent, P., 2013. IEEE transactions on pattern analysis and machine intelligence, Vol 35(8), pp. 1798--1828. IEEE.
  8. Feature Visualization[link]
    Olah, C., Mordvintsev, A. and Schubert, L., 2017. Distill. DOI: 10.23915/distill.00007
  9. Curve Detectors[link]
    Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov, M. and Olah, C., 2020. Distill.
  10. Superposition of many models into one
    Cheung, B., Terekhov, A., Chen, Y., Agrawal, P. and Olshausen, B., 2019. Advances in neural information processing systems, Vol 32.
  11. Linguistic regularities in continuous space word representations[PDF]
    Mikolov, T., Yih, W. and Zweig, G., 2013. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp. 746--751.
  12. Linguistic regularities in sparse and explicit word representations
    Levy, O. and Goldberg, Y., 2014. Proceedings of the eighteenth conference on computational natural language learning, pp. 171--180.
  13. Unsupervised representation learning with deep convolutional generative adversarial networks
    Radford, A., Metz, L. and Chintala, S., 2015. arXiv preprint arXiv:1511.06434.
  14. Visualizing and understanding recurrent networks[PDF]
    Karpathy, A., Johnson, J. and Fei-Fei, L., 2015. arXiv preprint arXiv:1506.02078.
  15. Learning to generate reviews and discovering sentiment[PDF]
    Radford, A., Jozefowicz, R. and Sutskever, I., 2017. arXiv preprint arXiv:1704.01444.
  16. Object detectors emerge in deep scene cnns[PDF]
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A., 2014. arXiv preprint arXiv:1412.6856.
  17. Network Dissection: Quantifying Interpretability of Deep Visual Representations[PDF]
    Bau, D., Zhou, B., Khosla, A., Oliva, A. and Torralba, A., 2017. Computer Vision and Pattern Recognition.
  18. Understanding the role of individual units in a deep neural network
    Bau, D., Zhu, J., Strobelt, H., Lapedriza, A., Zhou, B. and Torralba, A., 2020. Proceedings of the National Academy of Sciences, Vol 117(48), pp. 30071--30078. National Acad Sciences.
  19. On the importance of single directions for generalization[PDF]
    Morcos, A.S., Barrett, D.G., Rabinowitz, N.C. and Botvinick, M., 2018. arXiv preprint arXiv:1803.06959.
  20. On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron
    Donnelly, J. and Roegiest, A., 2019. European Conference on Information Retrieval, pp. 795--802.
  21. High-Low Frequency Detectors
    Schubert, L., Voss, C., Cammarata, N., Goh, G. and Olah, C., 2021. Distill. DOI: 10.23915/distill.00024.005
  22. Multimodal Neurons in Artificial Neural Networks
    Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., Radford, A. and Olah, C., 2021. Distill. DOI: 10.23915/distill.00030
  23. Convergent learning: Do different neural networks learn the same representations?
    Li, Y., Yosinski, J., Clune, J., Lipson, H., Hopcroft, J.E. and others,, 2015. FE@ NIPS, pp. 196--212.
  24. Adversarial examples are not bugs, they are features
    Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. Advances in neural information processing systems, Vol 32.
  25. Proofs and refutations
    Lakatos, I., 1963. Nelson London.
  26. Sparse coding with an overcomplete basis set: A strategy employed by V1?
    Olshausen, B.A. and Field, D.J., 1997. Vision research, Vol 37(23), pp. 3311--3325. Elsevier.
  27. Decoding by linear programming
    Candes, E.J. and Tao, T., 2005. IEEE transactions on information theory, Vol 51(12), pp. 4203--4215. IEEE.
  28. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
    Saxe, A.M., McClelland, J.L. and Ganguli, S., 2014.
  29. In-context Learning and Induction Heads[HTML]
    Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S. and Olah, C., 2022. Transformer Circuits Thread.
  30. A Mechanistic Interpretability Analysis of Grokking[link]
    Nanda, N. and Lieberum, T., 2022.
  31. Grokking: Generalization beyond overfitting on small algorithmic datasets
    Power, A., Burda, Y., Edwards, H., Babuschkin, I. and Misra, V., 2022. arXiv preprint arXiv:2201.02177.
  32. The surprising simplicity of the early-time learning dynamics of neural networks
    Hu, W., Xiao, L., Adlam, B. and Pennington, J., 2020. Advances in Neural Information Processing Systems, Vol 33, pp. 17116--17128.
  33. A mathematical theory of semantic development in deep neural networks
    Saxe, A.M., McClelland, J.L. and Ganguli, S., 2019. Proceedings of the National Academy of Sciences, Vol 116(23), pp. 11537--11546. National Acad Sciences.
  34. Towards the science of security and privacy in machine learning
    Papernot, N., McDaniel, P., Sinha, A. and Wellman, M., 2016. arXiv preprint arXiv:1611.03814.
  35. Adversarial spheres
    Gilmer, J., Metz, L., Faghri, F., Schoenholz, S.S., Raghu, M., Wattenberg, M. and Goodfellow, I., 2018. arXiv preprint arXiv:1801.02774.
  36. Adversarial robustness as a prior for learned representations
    Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1906.00945.
  37. Delving into transferable adversarial examples and black-box attacks
    Liu, Y., Chen, X., Liu, C. and Song, D., 2016. arXiv preprint arXiv:1611.02770.
  38. An introduction to systems biology: design principles of biological circuits
    Alon, U., 2019. CRC press. DOI: 10.1201/9781420011432
  39. The Building Blocks of Interpretability[link]
    Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K. and Mordvintsev, A., 2018. Distill. DOI: 10.23915/distill.00010
  40. Visualizing Weights[link]
    Voss, C., Cammarata, N., Goh, G., Petrov, M., Schubert, L., Egan, B., Lim, S.K. and Olah, C., 2021. Distill. DOI: 10.23915/distill.00024.007
  41. Learning effective and interpretable semantic models using non-negative sparse embedding
    Murphy, B., Talukdar, P. and Mitchell, T., 2012. Proceedings of COLING 2012, pp. 1933--1950.
  42. Word2Sense: sparse interpretable word embeddings
    Panigrahi, A., Simhadri, H.V. and Bhattacharyya, C., 2019. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5692--5705.
  43. Spine: Sparse interpretable neural embeddings
    Subramanian, A., Pruthi, D., Jhamtani, H., Berg-Kirkpatrick, T. and Hovy, E., 2018. Proceedings of the AAAI Conference on Artificial Intelligence, Vol 32(1).
  44. A Review of Sparse Expert Models in Deep Learning
    Fedus, W., Dean, J. and Zoph, B., 2022. arXiv preprint arXiv:2209.01667.
  45. A Mathematical Framework for Transformer Circuits[HTML]
    Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S. and Olah, C., 2021. Transformer Circuits Thread.
  46. An Overview of Early Vision in InceptionV1
    Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill. DOI: 10.23915/distill.00024.002
  47. beta-vae: Learning basic visual concepts with a constrained variational framework
    Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S. and Lerchner, A., 2016.
  48. Infogan: Interpretable representation learning by information maximizing generative adversarial nets
    Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I. and Abbeel, P., 2016. Advances in neural information processing systems, Vol 29.
  49. Disentangling by factorising
    Kim, H. and Mnih, A., 2018. International Conference on Machine Learning, pp. 2649--2658.
  50. Uncertainty principles and ideal atomic decomposition
    Donoho, D.L., Huo, X. and others,, 2001. IEEE transactions on information theory, Vol 47(7), pp. 2845--2862. Citeseer.
  51. Compressed sensing and best 𝑘-term approximation
    Cohen, A., Dahmen, W. and DeVore, R., 2009. Journal of the American mathematical society, Vol 22(1), pp. 211--231.
  52. A remark on compressed sensing
    Kashin, B.S. and Temlyakov, V.N., 2007. Mathematical notes, Vol 82(5), pp. 748--755. Springer.
  53. Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting
    Wainwright, M., 2007. 2007 IEEE International Symposium on Information Theory, pp. 961--965.
  54. Lower bounds for sparse recovery
    Do Ba, K., Indyk, P., Price, E. and Woodruff, D.P., 2010. Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pp. 1190--1197.
  55. Neighborly polytopes and sparse solution of underdetermined linear equations
    Donoho, D.L., 2005.
  56. Compressed sensing: How sharp is the RIP
    Blanchard, J.D., Cartis, C. and Tanner, J., 2009. SIAM Rev., accepted, Vol 10, pp. 090748160.
  57. A deep learning approach to structured signal recovery
    Mousavi, A., Patel, A.B. and Baraniuk, R.G., 2015. 2015 53rd annual allerton conference on communication, control, and computing (Allerton), pp. 1336--1343.
  58. Learned D-AMP: Principled neural network based compressive image recovery
    Metzler, C., Mousavi, A. and Baraniuk, R., 2017. Advances in Neural Information Processing Systems, Vol 30.
  59. Compressed Sensing using Generative Models[HTML]
    Bora, A., Jalal, A., Price, E. and Dimakis, A.G., 2017. Proceedings of the 34th International Conference on Machine Learning, Vol 70, pp. 537--546. PMLR.
  60. Sparse and redundant representations: from theory to applications in signal and image processing
    Elad, M., 2010. , Vol 2(1). Springer.
  61. Method of optimal directions for frame design
    Engan, K., Aase, S.O. and Husoy, J.H., 1999. 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), Vol 5, pp. 2443--2446.
  62. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation
    Aharon, M., Elad, M. and Bruckstein, A., 2006. IEEE Transactions on signal processing, Vol 54(11), pp. 4311--4322. IEEE.
  63. Learning Fast Approximations of Sparse Coding
    Gregor, K. and LeCun, Y., 2010. International Conference on Machine Learning.
  64. Sparse-Coding Variational Auto-Encoders
    Barello, G., Charles, A.S. and Pillow, J.W., 2018. bioRxiv.
  65. [Interim research report] Taking features out of superposition with sparse autoencoders[link]
    Sharkey, L., Braun, D. and Millidge, B., 2022.
  66. Average firing rate rather than temporal pattern determines metabolic cost of activity in thalamocortical relay neurons
    Yi, G. and Grill, W., 2019. Scientific reports, Vol 9(1), pp. 6940. DOI: 10.1038/s41598-019-43460-8
  67. Distributed representations
    Plate, T., 2003. Cognitive Science, pp. 1-15.
  68. Compressed Sensing, Sparsity, and Dimensionality in Neuronal Information Processing and Data Analysis[link]
    Ganguli, S. and Sompolinsky, H., 2012. Annual Review of Neuroscience, Vol 35(1), pp. 485-508. DOI: 10.1146/annurev-neuro-062111-150410
  69. ?
  70. A Survey on Hyperdimensional Computing aka Vector Symbolic Architectures, Part I: Models and Data Transformations
    Kleyko, D., Rachkovskij, D.A., Osipov, E. and Rahimi, A., 2021. ACM Computing Surveys (CSUR). ACM New York, NY.
  71. Life beyond bases: The advent of frames (Part I)
    Kovacevic, J. and Chebira, A., 2007. IEEE Signal Processing Magazine, Vol 24(4), pp. 86--104. IEEE.

Nonlinear Compression

This paper focuses on the assumption that representations are linear. But what if models don’t use linear feature directions to represent information? What might such a thing concretely look like?

Neural networks have nonlinearities that make it theoretically possible to compress information even more compactly than a linear superposition. There are reasons we think models are unlikely to pervasively use nonlinear compression schemes:

Regardless of whether large models end up using nonlinear compression, it should be possible to view directions being used with nonlinear compression as linear feature directions and reverse engineer the computation being used for compression like any other circuit. If this kind of encoding is pervasive throughout the network then it may merit some kind of automated decoding. It shouldn’t pose a fundamental challenge to interpretability unless the model learns a scheme for doing complex computation while staying in a complicated nonlinear representation, which we suspect is unlikely.

To help provide intuition, the simplest example of what a nonlinear compression scheme might look like is compressing two [0,1) dimensions x and y into a single [0,1) dimension t:

t = \frac{\lfloor Zx \rfloor + y}{Z}

This works by quantizing the x dimension using some integer Z such that the floating point precision of t is split between x and y. This particular function needs the discontinuous floor function to compute, and the discontinuous fmod function to invert, but models can’t compute discontinuous functions. However it’s possible to replace the discontinuities with steep linear segments that are only some epsilon value wide.

We can compare the mean squared error loss on random uniform dense values of x and y and see that even with epsilons as large as 0.1 and Z values as small as 3 the nonlinear compression outperforms linear compression such as picking one of the dimensions or using the average:

Connection between compressed sensing lower bounds and the toy model

Here, we formalize the relationship between a compressed sensing lower bound and the toy model.

Let T(x) : \mathbb{R}^n \to \mathbb{R}^n be the complete toy model autoencoder defined by T(x) = \mathrm{ReLU}(W_2 W_1 x - b) for an m \times n matrix W_1 and an n \times m matrix W_2.  

We derive the following theorem:

Theorem 1. Suppose that the toy model recovers all x with T(x) such that \|T(x) - x\|_2 \leq \varepsilon for sufficiently small \varepsilon and W_1 has the (\delta, k) restricted isometry property. The inner dimension of the projection matrix W is m = \Omega(k log(n/k)).

We prove this result by framing our toy model as a compressed sensing algorithm. The primary barrier to doing so is that our optimization only searches for vectors that are close in \ell_2 distance to the original vector and may not itself be exactly k-sparse. The following lemma resolves this concern through a denoising step:

Lemma 1. Suppose that we have a toy model T(x) with the properties in Theorem 1. Then there exists a compressed sensing algorithm f(y) : \mathbb{R}^m \to \mathbb{R}^n for the measurement matrix W_1.

Proof. We construct f(y) as follows. First, compute \tilde{x} = \mathrm{ReLU}(W_2 y - b), as in T(x). This produces the vector \tilde{x} = T(x) and so by supposition \|T(x) - x\|_2 \leq \varepsilon. Next, we threshold \tilde{x} to obtain \tilde{x}' by dropping all but its k largest entries. Lastly, we solve the optimization problem: \min_{x'} \|x' - \tilde{x}'\| subject to W_1 x' = y, which is convex because x' and \tilde{x}' have the same support. For sufficiently small \varepsilon (specifically, \varepsilon smaller than the (k + 1)th largest entry in x), both \tilde{x} and the nearest k-sparse vector to x have the same support, and so the the convex optimization problem has a unique solution: the nearest k sparse vector to x. Therefore, f is a compressed sensing algorithm for W_1 with approximation factor 1. \qed.

Lastly, we use the deterministic compressed sensing lower bound of Do Ba, Indyk, Price, and Woodruff :

Theorem 2 (Corollary 3.1 in ). Given a k \times n matrix A with the restricted isometry property, a sparse recovery algorithm find a k-sparse approximation \hat{x} of x \in \mathbb{R}^n from Ax such that

\|x - \hat{x}\|_1 \leq C(k) \min_{x', \|x'\|_0 \leq k} \|x - x'\|_1

for an approximation factor C(k). If C(k) = O(1), then a sparse recovery algorithm exists only if m = \Omega(k \log (n/k)).

Theorem 1 follows directly from Lemma 1 and Theorem 2.