这是用户在 2024-5-27 21:19 为 https://transformer-circuits.pub/2022/toy_model/index.html#motivation 保存的双语快照页面，由沉浸式翻译提供双语支持。了解如何保存？

Transformer Circuits Thread
变压器电路线程

Toy Models of Superposition
叠加态的玩具模型

Authors 作者

Nelson Elhage^∗, Tristan Hume^∗, Catherine Olsson^∗, Nicholas Schiefer^∗, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg^∗,Christopher Olah^‡
纳尔逊·埃拉格 ^∗ 、特里斯坦·休姆 ^∗ 、凯瑟琳·奥尔森 ^∗ 、尼古拉斯·席弗 ^∗ 、汤姆·海尼汉、肖娜·克雷维克、扎克·哈特菲尔德-多兹、罗伯特·拉森比、唐恩·德兰、卡罗尔·陈、罗杰·格罗斯、萨姆·麦坎德利斯、贾里德·卡普兰、达里奥·阿莫迪、马丁·瓦滕伯格 ^∗ 、克里斯托弗·奥拉 ^‡

Affiliations 所属机构

Anthropic, Harvard 人类与哈佛

Published 已发布

Sept 14, 2022 2022 年 9 月 14 日

* Core Research Contributor; ‡ Correspondence to colah@anthropic.com; Author contributions statement below.
* 核心研究贡献者；‡ 通讯至 colah@anthropic.com；作者贡献声明如下。

Authors 作者

Affiliations 隶属关系

Published 已发布

Not published yet. 尚未发布。

DOI

No DOI yet. 尚无 DOI。

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?
若人工神经网络中的单个神经元能对应于输入特征的清晰可解释性，那将极为便利。例如，在“理想”的 ImageNet 分类器中，每个神经元仅在特定视觉特征出现时激活，如红色、左向曲线或狗鼻。实证研究表明，我们所考察的模型中，部分神经元确实与特征清晰对应。然而，并非总能如此，尤其是在大型语言模型中，神经元与清晰特征的对应似乎颇为罕见。这引发诸多疑问：为何神经元有时与特征对齐，有时则不然？为何某些模型和任务中存在大量此类清晰对应的神经元，而在其他模型中却极为稀少？

In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition . When features are sparse, superposition allows compression beyond what a linear model would do, at the cost of "interference" that requires nonlinear filtering.
本文中，我们采用玩具模型——即在稀疏输入特征的合成数据上训练的小型 ReLU 网络——来探究模型如何在维度之外表示更多特征，我们称此现象为叠加。当特征稀疏时，叠加现象使得模型能实现超越线性模型的压缩效果，但代价是需要非线性滤波来处理由此产生的“干扰”。

Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features. With dense features, the model learns to represent an orthogonal basis of the most important two features (similar to what Principal Component Analysis might give us), and the other three features are not represented. But if we make the features sparse, this changes:
考虑一个玩具模型，我们训练一个包含五个特征的嵌入，这些特征的重要性各不相同 ¹ ，其中“重要性”是对均方误差损失的标量乘数。在二维空间中，添加 ReLU 进行过滤，并调整特征的稀疏度。在密集特征情况下，模型学会表示最重要的两个特征的正交基（类似于主成分分析可能提供的结果），而其他三个特征则未被表示。但如果我们使特征变得稀疏，情况就会改变：

This figure and a few others can be reproduced using the toy model framework Colab notebook in our Github repo
此图及其他几幅图可通过我们 GitHub 仓库中的玩具模型框架 Colab 笔记本来重现

Not only can models store additional features in superposition by tolerating some interference, but we'll show that, at least in certain limited cases, models can perform computation while in superposition. (In particular, we'll show that models can put simple circuits computing the absolute value function in superposition.) This leads us to hypothesize that the neural networks we observe in practice are in some sense noisily simulating larger, highly sparse networks. In other words, it's possible that models we train can be thought of as doing “the same thing as” an imagined much-larger model, representing the exact same features but with no interference.
模型不仅能通过容忍一定干扰在叠加态中存储额外特征，我们还将在特定有限情况下展示，模型可在叠加态中执行计算。（特别是，我们将证明模型能将计算绝对值函数的简单电路置于叠加态。）这促使我们假设，实践中观察到的神经网络在某种意义上是在嘈杂地模拟更大、高度稀疏的网络。换言之，我们训练的模型可能被视为在做“与”一个设想中的更大模型“相同的事情”，代表完全相同的特征，但无干扰。

Feature superposition isn't a novel idea. A number of previous interpretability papers have considered it , and it's very closely related to the long-studied topic of compressed sensing in mathematics , as well as the ideas of distributed, dense, and population codes in neuroscience and deep learning . What, then, is the contribution of this paper?
特征叠加并非新颖概念，已有众多解释性论文对此进行探讨，它与数学中长期研究的压缩感知问题紧密相关，同时也与神经科学及深度学习中的分布式、密集型和群体编码理念相呼应。那么，本文的贡献何在？

For interpretability researchers, our main contribution is providing a direct demonstration that superposition occurs in artificial neural networks given a relatively natural setup, suggesting this may also occur in practice. That is, we show a case where interpreting neural networks as having sparse structure in superposition isn't just a useful post-hoc interpretation, but actually the "ground truth" of a model. We offer a theory of when and why this occurs, revealing a phase diagram for superposition. This explains why neurons are sometimes "monosemantic" responding to a single feature, and sometimes "polysemantic" responding to many unrelated features. We also discover that, at least in our toy model, superposition exhibits complex geometric structure.
对于可解释性研究者而言，我们的主要贡献在于直接证明了在相对自然的设定下，人工神经网络中确实存在叠加现象，暗示这在实践中也可能发生。即，我们展示了一个案例，其中将神经网络解释为具有叠加稀疏结构不仅仅是事后有用的解释，而实际上是模型的“真实本质”。我们提出了一种理论，阐明了何时及为何会出现这种情况，揭示了叠加的相位图。这解释了为何神经元有时是“单义的”，仅对单一特征作出反应，而有时则是“多义的”，对多个不相关特征作出反应。我们还发现，至少在我们的玩具模型中，叠加现象展现出复杂的几何结构。

But our results may also be of broader interest. We find preliminary evidence that superposition may be linked to adversarial examples and grokking, and might also suggest a theory for the performance of mixture of experts models. More broadly, the toy model we investigate has unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar to the fractional quantum Hall effect in physics, among other striking phenomena. We originally investigated the subject to gain understanding of cleanly-interpretable neurons in larger models, but we've found these toy models to be surprisingly interesting in their own right.
然而，我们的研究成果或许具有更广泛的吸引力。初步证据显示，叠加现象可能与对抗性示例及领悟机制存在关联，并可能为混合专家模型性能提供理论依据。更广泛地看，我们探究的这一简化模型展现出意料之外的丰富结构，包括相变、基于均匀多面体的几何结构、训练过程中的“能量级”跃迁，以及与物理学中的分数量子霍尔效应性质相似的现象等引人注目之处。最初，我们研究此课题旨在深入理解大型模型中清晰可解释的神经元，却意外发现这些简化模型本身就极具研究价值。

Key Results From Our Toy Models
我们的玩具模型关键成果

In our toy models, we are able to demonstrate that:
在我们的玩具模型中，我们能够证明：

Superposition is a real, observed phenomenon.
叠加是一个真实可观测的现象。
Both monosemantic and polysemantic neurons can form.
单义和多义神经元均可形成。
At least some kinds of computation can be performed in superposition.
至少某些类型的计算可以在叠加态中进行。
Whether features are stored in superposition is governed by a phase change.
特征是否以叠加态存储取决于相位的变化。
Superposition organizes features into geometric structures such as digons, triangles, pentagons, and tetrahedrons.
叠加将特征组织成如二角形、三角形、五边形和四面体等几何结构。

Our toy models are simple ReLU networks, so it seems fair to say that neural networks exhibit these properties in at least some regimes, but it's very unclear what to generalize to real networks.
我们的玩具模型是简单的 ReLU 网络，因此似乎可以说神经网络至少在某些情况下表现出这些特性，但如何将其推广到真实网络则非常不明确。

Definitions and Motivation: Features, Directions, and Superposition
定义与动机：特征、方向与叠加

In our work, we often think of neural networks as having features of the input represented as directions in activation space. This isn't a trivial claim. It isn't obvious what kind of structure we should expect neural network representations to have. When we say something like "word embeddings have a gender direction" or "vision models have curve detector neurons", one is implicitly making strong claims about the structure of network representations.
在我们的工作中，常将神经网络视为以激活空间中的方向来表示输入特征。这一观点并非无足轻重。神经网络表示应具备何种结构并不显而易见。当我们提及“词嵌入具有性别方向”或“视觉模型拥有曲线检测神经元”时，实际上是在对网络表示的结构做出强有力的假设。

Despite this, we believe this kind of "linear representation hypothesis" is supported both by significant empirical findings and theoretical arguments. One might think of this as two separate properties, which we'll explore in more detail shortly:
尽管如此，我们认为这种“线性表征假设”得到了显著的实证发现和理论论据的双重支持。这可以被视为两种不同的特性，我们将在稍后详细探讨：

Decomposability: Network representations can be described in terms of independently understandable features.
可分解性：网络表示可以描述为独立可理解的特征。
Linearity: Features are represented by direction.
线性：特征由方向表示。

If we hope to reverse engineer neural networks, we need a property like decomposability. Decomposability is what allows us to reason about the model without fitting the whole thing in our heads! But it's not enough for things to be decomposable: we need to be able to access the decomposition somehow. In order to do this, we need to identify the individual features within a representation. In a linear representation, this corresponds to determining which directions in activation space correspond to which independent features of the input.
若要逆向工程神经网络，我们需具备可分解性这一特性。可分解性使我们无需将整个模型装进脑海，便能对其进行推理！然而，仅可分解还不够，我们还需以某种方式触及这种分解。为此，我们必须识别表示中的各个特征。在线性表示中，这相当于确定激活空间中的哪些方向对应于输入的哪些独立特征。

Sometimes, identifying feature directions is very easy because features seem to correspond to neurons. For example, many neurons in the early layers of InceptionV1 clearly correspond to features (e.g. curve detector neurons ). Why is it that we sometimes get this extremely helpful property, but in other cases don't? We hypothesize that there are really two countervailing forces driving this:
有时，识别特征方向非常容易，因为特征似乎与神经元相对应。例如，InceptionV1 早期层中的许多神经元明显对应于特征（如曲线检测神经元）。为何我们有时能获得这种极为有利的特性，而在其他情况下却不然？我们推测，这背后实际上是两种相互抵消的力量在驱动：

Privileged Basis: Only some representations have a privileged basis which encourages features to align with basis directions (i.e. to correspond to neurons).
特权基底：仅某些表示拥有特权基底，这促使特征与基底方向对齐（即对应于神经元）。
Superposition: Linear representations can represent more features than dimensions, using a strategy we call superposition. This can be seen as neural networks simulating larger networks. This pushes features away from corresponding to neurons.
叠加原理：线性表示能够利用我们称之为叠加的策略，表达比维度更多的特征。这可以看作是神经网络模拟更大网络的行为，从而使特征与神经元之间的对应关系变得松散。

Superposition has been hypothesized in previous work , and in some cases, assuming something like superposition has been shown to help find interpretable structure . However, we're not aware of feature superposition having been unambiguously demonstrated to occur in neural networks before ( demonstrates a closely related phenomenon of model superposition). The goal of this paper is to change that, demonstrating superposition and exploring how it interacts with privileged bases. If superposition occurs in networks, it deeply influences what approaches to interpretability research make sense, so unambiguous demonstration seems important.
先前的工作中已假设了叠加现象，并且在某些情况下，假设类似叠加的机制已被证明有助于发现可解释的结构。然而，我们尚未明确证实特征叠加在神经网络中确实发生过（展示了与模型叠加紧密相关的一种现象）。本文旨在改变这一现状，展示叠加现象并探讨其与特权基的相互作用。若网络中存在叠加，它将深刻影响可解释性研究方法的合理性，因此明确证实叠加现象显得尤为重要。

The goal of this section will be to motivate these ideas and unpack them in detail.
本节的目标是激发这些想法，并详细阐述它们。

It's worth noting that many of the ideas in this section have close connections to ideas in other lines of interpretability research (especially disentanglement), neuroscience (distributed representations, population codes, etc), compressed sensing, and many other lines of work. This section will focus on articulating our perspective on the problem. We'll discuss these other lines of work in detail in Related Work.
值得注意的是，本节中许多概念与解释性研究的其他领域（尤其是解耦）、神经科学（分布式表示、群体编码等）、压缩感知以及众多其他工作线有着紧密联系。本节将着重阐述我们对问题的视角，而与这些其他工作线的详细讨论将在“相关工作”部分展开。

Empirical Phenomena 实证现象

When we talk about "features" and how they're represented, this is ultimately theory building around several observed empirical phenomena. Before describing how we conceptualize those results, we'll simply describe some of the major results motivating our thinking:
当我们讨论“特征”及其表现方式时，这实质上是在围绕多个观察到的经验现象构建理论。在阐述我们如何概念化这些成果之前，我们先简要描述一些推动我们思考的主要成果：

Word Embeddings - A famous result by Mikolov et al. found that word embeddings appear to have directions which correspond to semantic properties, allowing for embedding arithmetic vectors such as V("king") - V("man") + V("woman") = V("queen") (but see ).
词嵌入——Mikolov 等人著名的研究成果发现，词嵌入似乎具有对应于语义属性的方向，使得能够进行诸如 V("king") - V("man") + V("woman") = V("queen") 这样的嵌入向量运算（但请参见）。
Latent Spaces - Similar "vector arithmetic" and interpretable direction results have also been found for generative adversarial networks (e.g. ).
潜在空间——类似的“向量运算”和可解释方向结果也在生成对抗网络中被发现（例如）。
Interpretable Neurons - There is a significant body of results finding neurons which appear to be interpretable (in RNNs ; in CNNs ; in GANs ), activating in response to some understandable property. This work has faced some skepticism . In response, several papers have aimed to give extremely detailed accounts of a few specific neurons, in the hope of dispositively establishing examples of neurons which truly detect some understandable property (notably Cammarata et al. , but also ).
可解释神经元——已有大量研究成果发现，某些神经元似乎具有可解释性（在 RNN 中；在 CNN 中；在 GAN 中），它们针对某些可理解的属性产生激活反应。然而，这项工作也遭遇了一些质疑。对此，多篇论文致力于对少数特定神经元进行极其详尽的描述，旨在确凿无疑地确立那些真正检测到可理解属性的神经元实例（Cammarata 等人尤为突出，亦见）。
Universality - Many analogous neurons responding to the same properties can be found across networks .
普遍性——在多个网络中均可发现响应相同属性的类似神经元。
Polysemantic Neurons - At the same time, there are also many neurons which appear to not respond to an interpretable property of the input, and in particular, many polysemantic neurons which appear to respond to unrelated mixtures of inputs .
多义性神经元——与此同时，还存在许多看似不响应输入可解释属性的神经元，尤其是众多多义性神经元，它们似乎对无关联的输入混合体产生反应。

As a result, we tend to think of neural network representations as being composed of features which are represented as directions. We'll unpack this idea in the following sections.
因此，我们往往认为神经网络的表示由特征构成，这些特征表现为方向。我们将在接下来的章节中详细阐述这一概念。

What are Features? 什么是特性？

Our use of the term "feature" is motivated by the interpretable properties of the input we observe neurons (or word embedding directions) responding to. There's a rich variety of such observed properties!In the context of vision, these have ranged from low-level neurons like curve detectors and high-low frequency detectors, to more complex neurons like oriented dog-head detectors or car detectors, to extremely abstract neurons corresponding to famous people, emotions, geographic regions, and more . In language models, researchers have found word embedding directions such as a male-female or singular-plural direction , low-level neurons disambiguating words that occur in multiple languages, much more abstract neurons, and "action" output neurons that help produce certain words . We'd like to use the term "feature" to encompass all these properties.
我们使用“特征”这一术语，是基于我们观察到的神经元（或词嵌入方向）响应的可解释输入属性。这些观察到的属性种类繁多！ ² 在视觉领域，这些属性从低级神经元如曲线检测器和高低频检测器，到更复杂的神经元如定向狗头检测器或汽车检测器，再到极其抽象的神经元，它们对应着名人、情感、地理区域等。在语言模型中，研究者发现了诸如男性-女性或单数-复数方向的词嵌入方向，低级神经元区分多语言中出现的词汇，更为抽象的神经元，以及“动作”输出神经元，这些神经元有助于生成特定词汇。我们希望用“特征”这一术语来概括所有这些属性。

But even with that motivation, it turns out to be quite challenging to create a satisfactory definition of a feature. Rather than offer a single definition we're confident about, we consider three potential working definitions:
尽管有此动力，但要创造一个令人满意的特征定义仍颇具挑战。与其提供一个我们确信无疑的定义，我们考虑了三种可能的工作定义：

Features as arbitrary functions. One approach would be to define features as any function of the input (as in ). But this doesn't quite seem to fit our motivations. There's something special about these features that we're observing: they seem to in some sense be fundamental abstractions for reasoning about the data, with the same features forming reliably across models. Features also seem identifiable: cat and car are two features while cat+car and cat-car seem like mixtures of features rather than features in some important sense.
特征作为任意函数。一种方法是将特征定义为输入的任何函数（如所示）。但这似乎并不完全符合我们的动机。我们观察到的这些特征有其特殊之处：它们在某种意义上似乎是推理数据的基本抽象，相同的特征在不同模型中稳定形成。特征似乎也具有可识别性：猫和汽车是两个特征，而猫+汽车和猫-汽车则像是特征的混合，而非在某种重要意义上的独立特征。
Features as interpretable properties. All the features we described are strikingly understandable to humans. One could try to use this for a definition: features are the presence of human understandable "concepts" in the input. But it seems important to allow for features we might not understand. If AlphaFold discovers some important chemical structure for predicting protein folding, it very well might not be something we initially understand!
特征作为可解释的属性。我们所描述的所有特征对人类来说都极为易懂。有人可能会尝试以此来定义：特征是输入中存在的人类可理解的“概念”。但似乎有必要允许存在我们可能不理解的特征。如果 AlphaFold 发现了一种对预测蛋白质折叠至关重要的化学结构，它很可能不是我们最初就能理解的！
Neurons in Sufficiently Large Models. A final approach is to define features as properties of the input which a sufficiently large neural network will reliably dedicate a neuron to representing.This definition is trickier than it seems. Specifically, something is a feature if there exists a large enough model size such that it gets a dedicated neuron. This create a kind "epsilon-delta" like definition. Our present understanding – as we'll see in later sections – is that arbitrarily large models can still have a large fraction of their features be in superposition. However, for any given feature, assuming the feature importance curve isn't flat, it should eventually be given a dedicated neuron. This definition can be helpful in saying that something is a feature – curve detectors are a feature because you find them in across a range of models larger than some minimal size – but unhelpful for the much more common case of features we only hypothesize about or observe in superposition. For example, curve detectors appear to reliably occur across sufficiently sophisticated vision models, and so are a feature. For interpretable properties which we presently only observe in polysemantic neurons, the hope is that a sufficiently large model would dedicate a neuron to them. This definition is slightly circular, but avoids the issues with the earlier ones.
充分大型模型中的神经元。最后一种方法是，将特征定义为输入的属性，即足够大的神经网络会可靠地分配一个神经元来专门表示这些属性。 ³ 这个定义看似简单，实则复杂。具体而言，某物成为特征的条件是，存在一个足够大的模型规模，使得该特征能被分配到一个专用的神经元。这形成了一种类似“ε-δ”的定义。根据我们目前的理解——后续章节将详细探讨——即使模型规模任意扩大，其大部分特征仍可能处于叠加状态。然而，对于任何特定特征，只要其重要性曲线不是平坦的，理论上最终都会被分配到一个专用神经元。这种定义有助于确认某些事物为特征——例如曲线检测器，因其在大于某个最小规模的各类模型中均被发现——但对于那些我们仅假设存在或观察到处于叠加状态的更常见特征，则帮助有限。例如，曲线检测器似乎在足够复杂的视觉模型中可靠地出现，因此成为一种特征。对于我们目前仅在多义性神经元中观察到的可解释属性，希望足够大的模型会为其专门分配一个神经元。这个定义略显循环，但避免了早期定义的问题。

We've written this paper with the final "neurons in sufficiently large models" definition in mind. But we aren't overly attached to it, and actually think it's probably important to not prematurely attach to a definition.A famous book by Lakatos illustrates the importance of uncertainty about definitions and how important rethinking definitions often is in the context of research.
我们在撰写这篇论文时，心中怀揣着“充分大型模型中的神经元”这一最终定义。但我们并不对其过分执着，实际上认为避免过早固化定义颇为重要。 ⁴ 拉卡托斯的一部名著阐明了定义的不确定性之重要，以及在研究语境中频繁重新审视定义的必要性。

Features as Directions 功能即指南

As we've mentioned in previous sections, we generally think of features as being represented by directions. For example, in word embeddings, "gender" and "royalty" appear to correspond to directions, allowing arithmetic like V("king") - V("man") + V("woman") = V("queen") . Examples of interpretable neurons are also cases of features as directions, since the amount a neuron activates corresponds to a basis direction in the representation
如前所述，我们通常将特征视为由方向表示。例如，在词嵌入中，“性别”和“皇室”似乎对应于方向，使得诸如 V("king") - V("man") + V("woman") = V("queen") 的算术运算成为可能。可解释神经元的例子也是特征作为方向的案例，因为神经元激活的程度对应于表示中的一个基方向。

Let's call a neural network representation linear if features correspond to directions in activation space. In a linear representation, each feature f_i has a corresponding representation direction W_i. The presence of multiple features f_1, f_2… activating with values x_{f_1}, x_{f_2}… is represented by x_{f_1}W_{f_1} + x_{f_2}W_{f_2}.... To be clear, the features being represented are almost certainly nonlinear functions of the input. It's only the map from features to activation vectors which is linear. Note that whether something is a linear representation depends on what you consider to be the features.
若特征对应激活空间中的方向，我们称神经网络的表示为线性的。在线性表示中，每个特征 f_i 都有对应的表示方向 W_i 。多个特征 f_1, f_2… 以值 x_{f_1}, x_{f_2}… 激活的情况由 x_{f_1}W_{f_1} + x_{f_2}W_{f_2}... 表示。需明确的是，被表示的特征几乎肯定是输入的非线性函数。仅从特征到激活向量的映射是线性的。值得注意的是，某物是否为线性表示，取决于你如何定义特征。

We don't think it's a coincidence that neural networks empirically seem to have linear representations. Neural networks are built from linear functions interspersed with non-linearities. In some sense, the linear functions are the vast majority of the computation (for example, as measured in FLOPs). Linear representations are the natural format for neural networks to represent information in! Concretely, there are three major benefits:
我们认为神经网络在经验上似乎具有线性表示并非巧合。神经网络由线性函数与非线性元素交错构建而成。从某种意义上说，线性函数占据了计算的绝大部分（例如，以浮点运算次数衡量）。线性表示是神经网络自然表达信息的方式！具体而言，存在三大主要优势：

Linear representations are the natural outputs of obvious algorithms a layer might implement. If one sets up a neuron to pattern match a particular weight template, it will fire more as a stimulus matches the template better and less as it matches it less well.
线性表示是层可能实现的明显算法的自然输出。如果将神经元设置为匹配特定权重模板，则随着刺激与模板匹配度提高，其激发频率会增加，反之匹配度降低时激发频率减少。
Linear representations make features "linearly accessible." A typical neural network layer is a linear function followed by a non-linearity. If a feature in the previous layer is represented linearly, a neuron in the next layer can "select it" and have it consistently excite or inhibit that neuron. If a feature were represented non-linearly, the model would not be able to do this in a single step.
线性表示使得特征“线性可访问”。典型的神经网络层由一个线性函数紧随一个非线性函数构成。若前一层的特征以线性方式表示，下一层的神经元便能“选中它”，并使其持续激发或抑制该神经元。若特征以非线性方式表示，模型则无法在单一步骤中实现此功能。
Statistical Efficiency. Representing features as different directions may allow non-local generalization in models with linear transformations (such as the weights of neural nets), increasing their statistical efficiency relative to models which can only locally generalize. This view is especially advocated in some of Bengio's writing (e.g. ). A more accessible argument can be found in this blog post.
统计效率。将特征表示为不同方向可能使具有线性变换（如神经网络权重）的模型实现非局部泛化，从而相对于仅能局部泛化的模型，提高了其统计效率。这一观点在 Bengio 的一些著作中（例如）得到特别推崇。更通俗易懂的论述可参见这篇博客文章。

It is possible to construct non-linear representations, and retrieve information from them, if you use multiple layers (although even these examples can be seen as linear representations with more exotic features). We provide an example in the appendix. However, our intuition is that non-linear representations are generally inefficient for neural networks.
构建非线性表示并从中检索信息是可能的，只要使用多层结构（尽管这些例子也可视为具有更奇特特征的线性表示）。我们在附录中提供了一个示例。然而，我们的直觉是非线性表示通常对神经网络而言效率不高。

One might think that a linear representation can only store as many features as it has dimensions, but it turns out this isn't the case! We'll see that the phenomenon we call superposition will allow models to store more features – potentially many more features – in linear representations.
人们可能会认为线性表示只能存储与其维度一样多的特征，但事实证明并非如此！我们将看到，我们称之为叠加的现象将允许模型在线性表示中存储更多特征——潜在地，是许多更多的特征。

For discussion on how this view of features squares with a conception of features as being multidimensional manifolds, see the appendix “What about Multidimensional Features?”.
关于这一特征观点如何与特征作为多维流形的概念相契合的讨论，请参阅附录“多维特征如何处理？”。

Privileged vs Non-privileged Bases
特权与非特权基址

Even if features are encoded as directions, a natural question to ask is which directions? In some cases, it seems useful to consider the basis directions, but in others it doesn't. Why is this?
即便特征被编码为方向，一个自然的问题是：哪些方向？在某些情况下，考虑基向量似乎很有用，而在其他情况下则不然。这是为什么呢？

When researchers study word embeddings, it doesn't make sense to analyze basis directions. There would be no reason to expect a basis dimension to be different from any other possible direction. One way to see this is to imagine applying some random linear transformation M to the word embedding, and apply M^{-1} to the following weights. This would produce an identical model where the basis dimensions are totally different. This is what we mean by a non-privileged basis. Of course, it's possible to study activations without a privileged basis, you just need to identify interesting directions to study somehow, such as creating a gender direction in a word embedding by taking the difference vector between "man" and "woman".
研究词嵌入时，分析基向量方向并无意义。没有理由认为基维度会与任何其他可能方向有所不同。理解这一点的一种方式是设想对词嵌入施加一个随机线性变换 M ，并对后续权重应用 M^{-1} 。这将得到一个基维度完全不同的相同模型。这就是我们所说的非特权基。当然，即使没有特权基，也可以研究激活情况，只需设法确定有趣的研究方向，例如通过取“男人”与“女人”之间的差矢量来在词嵌入中创建性别方向。

But many neural network layers are not like this. Often, something about the architecture makes the basis directions special, such as applying an activation function. This "breaks the symmetry", making those directions special, and potentially encouraging features to align with the basis dimensions. We call this a privileged basis, and call the basis directions "neurons." Often, these neurons correspond to interpretable features.
然而，许多神经网络层并非如此。通常，架构中的某些特性使得基向量方向具有特殊性，例如应用激活函数。这“打破了对称性”，使得那些方向变得特殊，并可能促使特征与基向量维度对齐。我们称这种基向量为“特权基”，并将基向量方向称为“神经元”。通常，这些神经元对应着可解释的特征。

From this perspective, it only makes sense to ask if a neuron is interpretable when it is in a privileged basis. In fact, we typically reserve the word "neuron" for basis directions which are in a privileged basis. (See longer discussion here.)
从这个角度来看，只有当神经元处于特权基底时，询问其可解释性才有意义。实际上，我们通常将“神经元”一词保留给那些处于特权基底的方向。（参见此处更详细的讨论。）

Note that having a privileged basis doesn't guarantee that features will be basis-aligned – we'll see that they often aren't! But it's a minimal condition for the question to even make sense.
请注意，拥有特权基并不保证特征会与基对齐——我们将会看到它们往往并非如此！但这至少是这个问题有意义的最基本条件。

The Superposition Hypothesis
叠加假设

Even when there is a privileged basis, it's often the case that neurons are "polysemantic", responding to several unrelated features. One explanation for this is the superposition hypothesis. Roughly, the idea of superposition is that neural networks "want to represent more features than they have neurons", so they exploit a property of high-dimensional spaces to simulate a model with many more neurons.
即便存在特权基底，神经元往往仍表现出“多义性”，对多个无关特征产生响应。对此现象的一种解释是叠加假设。简言之，叠加思想认为神经网络“希望表达的特征远超其拥有的神经元数量”，因此它们利用高维空间的特性来模拟一个拥有更多神经元的模型。

Several results from mathematics suggest that something like this might be plausible:
数学中的几个结果表明，类似这样的事情可能是合理的：

Almost Orthogonal Vectors. Although it's only possible to have n orthogonal vectors in an n-dimensional space, it's possible to have \exp(n) many "almost orthogonal" (<\epsilon cosine similarity) vectors in high-dimensional spaces. See the Johnson–Lindenstrauss lemma.
几乎正交向量。尽管在 n 维空间中至多只能有 n 个正交向量，但在高维空间中却可能存在 \exp(n) 个“几乎正交”（ <\epsilon 余弦相似度）的向量。参见约翰逊-林登斯特劳斯引理。
Compressed sensing. In general, if one projects a vector into a lower-dimensional space, one can't reconstruct the original vector. However, this changes if one knows that the original vector is sparse. In this case, it is often possible to recover the original vector.
压缩感知。通常情况下，若将一个向量投影到低维空间，无法重构出原向量。然而，若已知原向量是稀疏的，情况则有所不同。此时，往往能够恢复原向量。

Concretely, in the superposition hypothesis, features are represented as almost-orthogonal directions in the vector space of neuron outputs. Since the features are only almost-orthogonal, one feature activating looks like other features slightly activating. Tolerating this "noise" or "interference" comes at a cost. But for neural networks with highly sparse features, this cost may be outweighed by the benefit of being able to represent more features! (Crucially, sparsity greatly reduces the costs since sparse features are rarely active to interfere with each other, and non-linear activation functions create opportunities to filter out small amounts of noise.)
具体而言，在叠加假设中，特征被表示为神经元输出向量空间中近乎正交的方向。由于这些特征仅是近乎正交，一个特征的激活会略微影响其他特征。容忍这种“噪声”或“干扰”是有代价的。但对于具有高度稀疏特征的神经网络来说，能够表示更多特征的益处可能超过了这一成本！（关键在于，稀疏性大大降低了成本，因为稀疏特征很少活跃到相互干扰的程度，而非线性激活函数则提供了过滤少量噪声的机会。）

One way to think of this is that a small neural network may be able to noisily "simulate" a sparse larger model:
一种理解方式是，小型神经网络或许能以一种嘈杂的方式“模拟”稀疏的大型模型：

Although we've described superposition with respect to neurons, it can also occur in representations with an unprivileged basis, such as a word embedding. Superposition simply means that there are more features than dimensions.
尽管我们在描述叠加时是以神经元为参照，但它同样可以发生在非特权基底的表征中，比如词嵌入。叠加的含义仅仅是特征数量超过了维度数。

Summary: A Hierarchy of Feature Properties
摘要：特征属性的层级结构

The ideas in this section might be thought of in terms of four progressively more strict properties that neural network representations might have.
本节中的概念可以从神经网络表示可能具备的四个渐进式严格属性来考虑。

Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features. (This property is ultimately the most important – see the role of decomposition in defeating the curse of dimensionality.)
可分解性：可分解的神经网络激活能够分解为特征，这些特征的含义不依赖于其他特征的值。（这一特性最终是最重要的——参见分解在克服维度灾难中的作用。）
Linearity: Features correspond to directions. Each feature f_i has a corresponding representation direction W_i. The presence of multiple features f_1, f_2… activating with values x_{f_1}, x_{f_2}… is represented by x_{f_1}W_{f_1} + x_{f_2}W_{f_2}....
线性关系：特征对应方向。每个特征 f_i 均有一个对应的表示方向 W_i 。多个特征 f_1, f_2… 以值 x_{f_1}, x_{f_2}… 激活的情况由 x_{f_1}W_{f_1} + x_{f_2}W_{f_2}... 表示。
Superposition vs Non-Superposition: A linear representation exhibits superposition if W^TW is not invertible. If W^TW is invertible, it does not exhibit superposition.
叠加与非叠加：线性表示若 W^TW 不可逆，则表现出叠加性；若 W^TW 可逆，则不表现出叠加性。
Basis-Aligned: A representation is basis aligned if all W_i are one-hot basis vectors. A representation is partially basis aligned if all W_i are sparse. This requires a privileged basis.
基向量对齐：若所有 W_i 均为独热基向量，则表示形式为基向量对齐。若所有 W_i 为稀疏，则表示形式为部分基向量对齐。这需要一个特权基向量。

The first two (decomposability and linearity) are properties we hypothesize to be widespread, while the latter (non-superposition and basis-aligned) are properties we believe only sometimes occur.
前两者（可分解性和线性）是我们假设普遍存在的特性，而后两者（非叠加性和基向量对齐）则是我们认为仅在某些情况下出现的特性。

Demonstrating Superposition
展示叠加原理

If one takes the superposition hypothesis seriously, a natural first question is whether neural networks can actually noisily represent more features than they have neurons. If they can't, the superposition hypothesis may be comfortably dismissed.
若认真对待叠加假设，一个自然的首要问题是神经网络是否能以噪声方式表示比其神经元更多的特征。若不能，叠加假设或许可被轻易摒弃。

The intuition from linear models would be that this isn't possible: the best a linear model can do is to store the principal components. But we'll see that adding just a slight nonlinearity can make models behave in a radically different way! This will be our first demonstration of superposition. (It will also be an object lesson in the complexity of even very simple neural networks.)
线性模型的直觉会认为这是不可能的：线性模型所能做的最好就是存储主成分。但我们将会看到，只需添加一点非线性，就能使模型行为发生根本性的变化！这将是我们的第一个叠加示例。（同时，它也将是对极其简单的神经网络复杂性的一个生动教训。）

Experiment Setup 实验设置

Our goal is to explore whether a neural network can project a high dimensional vector x \in R^n into a lower dimensional vector h\in R^m and then recover it.This experiment setup could also be viewed as an autoencoder reconstructing x.
我们的目标是探究神经网络是否能将高维向量 x \in R^n 映射至低维向量 h\in R^m ，并实现其恢复。 ⁵ 此实验设置亦可视作自编码器进行重构的过程 x 。

The Feature Vector (x)
特征向量（ x ）

We begin by describing the high-dimensional vector x: the activations of our idealized, disentangled larger model. We call each element x_i a "feature" because we're imagining features to be perfectly aligned with neurons in the hypothetical larger model. In a vision model, this might be a Gabor filter, a curve detector, or a floppy ear detector. In a language model, it might correspond to a token referring to a specific famous person, or a clause being a particular kind of description.
我们首先描述高维向量 x ：即我们理想化、解耦合的大型模型的激活状态。我们将每个元素 x_i 称为“特征”，因为我们设想这些特征与假设的大型模型中的神经元完美对齐。在视觉模型中，这可能是一个 Gabor 滤波器、曲线检测器或松软耳朵检测器。在语言模型中，它可能对应于指代特定名人的令牌，或是描述某种特定类型的从句。

Since we don't have any ground truth for features, we need to create synthetic data for x which simulates any important properties we believe features have from the perspective of modeling them. We make three major assumptions:
由于我们没有特征的实际数据，我们需要为 x 创建合成数据，以模拟我们认为在模型构建中重要的特征属性。我们做出三个主要假设：

Feature Sparsity: In the natural world, many features seem to be sparse in the sense that they only rarely occur. For example, in vision, most positions in an image don't contain a horizontal edge, or a curve, or a dog head. In language, most tokens don't refer to Martin Luther King or aren't part of a clause describing music. This idea goes back to classical work on vision and the statistics of natural images (see e.g. Olshausen, 1997, the section "Why Sparseness?" ). For this reason, we will choose a sparse distribution for our features.
特征稀疏性：在自然界中，许多特征似乎呈现出稀疏性，即它们仅偶尔出现。例如，在视觉领域，图像中的大多数位置并不包含水平边缘、曲线或狗头。在语言中，大多数词汇并非指代马丁·路德·金，或不属于描述音乐的从句。这一概念可追溯至视觉及自然图像统计的经典研究（参见例如 Olshausen, 1997，“为何稀疏？”章节）。因此，我们将为特征选择稀疏分布。
More Features Than Neurons: There are an enormous number of potentially useful features a model might represent.A vision model of sufficient generality might benefit from representing every species of plant and animal and every manufactured object which it might potentially see. A language model might benefit from representing each person who has ever been mentioned in writing. These are only scratching the surface of plausible features, but already there seem more than any model has neurons. In fact, large language models demonstrably do in fact know about people of very modest prominence – presumably more such people than they have neurons. This point is a common argument in discussion of the plausibility of "grandmother neurons'' in neuroscience, but seems even stronger for artificial neural networks. This imbalance between features and neurons in real models seems like it must be a central tension in neural network representations.
功能超越神经元：一个模型可能代表的潜在有用特征数量庞大。 ⁶ 一个足够通用的视觉模型或许能从表示其可能看到的每种动植物及人造物品中获益。语言模型则可能因包含所有曾被书面提及的人物而受益。这些仅触及了可能特征的皮毛，但已远超任何模型拥有的神经元数量。实际上，大型语言模型确实知晓许多知名度不高的人物——推测其知晓的人物数量超过了模型自身的神经元数。这一点在神经科学中关于“祖母细胞”的合理性讨论中常被提及，但对人工神经网络而言，这一论点似乎更为有力。现实模型中特征与神经元之间的这种不平衡，似乎构成了神经网络表示的核心矛盾。
Features Vary in Importance: Not all features are equally useful to a given task. Some can reduce the loss more than others. For an ImageNet model, where classifying different species of dogs is a central task, a floppy ear detector might be one of the most important features it can have. In contrast, another feature might only very slightly improve performance.For computational reasons, we won't focus on it in this article, but we often imagine an infinite number of features with importance asymptotically approaching zero.
特征重要性各异：并非所有特征对特定任务都同样有用。有些特征能更大幅度地降低损失。以 ImageNet 模型为例，其核心任务之一是区分不同犬种，那么一个松软耳朵检测器可能就是它拥有的最关键特征之一。相反，其他某些特征可能仅略微提升性能。出于计算考量，本文不对此深入探讨，但我们常设想存在无数重要性渐近于零的特征。

Concretely, our synthetic data is defined as follows: The input vectors x are synthetic data intended to simulate the properties we believe the true underlying features of our task have. We consider each dimension x_i to be a "feature". Each one has an associated sparsity S_i and importance I_i. We let x_i=0 with probability S_i, but is otherwise uniformly distributed between [0,1].The choice to have features distributed uniformly is arbitrary. An exponential or power law distribution would also be very natural. In practice, we focus on the case where all features have the same sparsity, S_i = S.
具体而言，我们的合成数据定义如下：输入向量 x 是合成数据，旨在模拟我们认为任务真实底层特征所具有的属性。我们将每个维度 x_i 视为一个“特征”。每个特征都关联有一个稀疏度 S_i 和重要性 I_i 。我们令 x_i=0 以概率 S_i 出现，否则在 [0,1] 范围内均匀分布。 ⁸ 选择特征均匀分布是任意的，指数分布或幂律分布同样自然。实践中，我们关注所有特征具有相同稀疏度 S_i = S 的情况。

The Model (x \to x')
模型（ x \to x' ）

We will actually consider two models, which we motivate below. The first "linear model" is a well understood baseline which does not exhibit superposition. The second "ReLU output model" is a very simple model which does exhibit superposition. The two models vary only in the final activation function.
我们将实际考虑两种模型，其动机如下。第一个“线性模型”是一个理解透彻的基准，不表现出叠加性。第二个“ReLU 输出模型”是一个非常简单的模型，确实展现出叠加性。两种模型仅在最终激活函数上有所不同。

Linear Model 线性模型

h~=~Wx

x'~=~W^Th~+~b

x' ~=~W^TWx ~+~ b

ReLU Output Model ReLU 输出模型

h~=~Wx

x'~=~\text{ReLU}(W^Th+b)

x' ~=~\text{ReLU}(W^TWx + b)

Why these models? 为何选择这些模型？

The superposition hypothesis suggests that each feature in the higher-dimensional model corresponds to a direction in the lower-dimensional space. This means we can represent the down projection as a linear map h=Wx. Note that each column W_i corresponds to the direction in the lower-dimensional space that represents a feature x_i.
叠加假设指出，高维模型中的每个特征对应于低维空间中的一个方向。这意味着我们可以将降维表示为线性映射 h=Wx 。注意，每一列 W_i 对应于代表一个特征 x_i 的低维空间中的方向。

To recover the original vector, we'll use the transpose of the same matrix W^T. This has the advantage of avoiding any ambiguity regarding what direction in the lower-dimensional space really corresponds to a feature. It also seems relatively mathematically principledRecall that W^T = W^{-1} if W is orthonormal. Although W can't be literally orthonormal, our intuition from compressed sensing is that it will be "almost orthonormal" in the sense of Candes & Tao ., and empirically works.
为恢复原始向量，我们将使用同一矩阵 W^T 的转置。这样做的好处是避免了关于低维空间中哪个方向真正对应于某个特征的任何歧义。它也显得相对数学上合理 ⁹ 。回想一下 W^T = W^{-1} ，如果 W 是正交的。尽管 W 不能严格意义上是正交的，但根据 Candes & Tao 的观点，从压缩感知中得到的直觉是它将“近似正交”，并且在实践中有效。

We also add a bias. One motivation for this is that it allows the model to set features it doesn't represent to their expected value. But we'll see later that the ability to set a negative bias is important for superposition for a second set of reasons – roughly, it allows models to discard small amounts of noise.
我们还引入了偏置项。这样做的一个动机是，它允许模型将未表示的特征设置为其预期值。但稍后我们将看到，设置负偏置的能力对于叠加状态至关重要，这有第二组原因——大致来说，它使模型能够剔除少量噪声。

The final step is whether to add an activation function. This turns out to be critical to whether superposition occurs. In a real neural network, when features are actually used by the model to do computation, there will be an activation function, so it seems principled to include one at the end.
最后一步是决定是否添加激活函数，事实证明这对是否发生叠加至关重要。在真实的神经网络中，当特征被模型实际用于计算时，会存在一个激活函数，因此在最后包含一个激活函数似乎是合乎原则的。

The Loss 损失

Our loss is weighted mean squared error weighted by the feature importances, I_i, described above:
我们的损失是根据上述描述的特征重要性加权的均方误差，即 I_i ：L = \sum_x \sum_i I_i (x_i - x'_i)^2

Basic Results 基本结果

Our first experiment will simply be to train a few ReLU output models with different sparsity levels and visualize the results. (We'll also train a linear model – if optimized well enough, the linear model solution does not depend on sparsity level.)
我们的首个实验将简单地训练几个具有不同稀疏度级别的 ReLU 输出模型，并展示结果。（我们还将训练一个线性模型——如果优化得足够好，线性模型的解决方案并不依赖于稀疏度水平。）

The main question is how to visualize the results. The simplest way is to visualize W^TW (a features by features matrix) and b (a feature length vector). Note that features are arranged from most important to least, so the results have a fairly nice structure. Here's an example of what this type of visualization might look like, for a small model model (n=20; ~m=5;) which behaves in the "expected linear model-like" way, only representing as many features as it has dimensions:
主要问题是如何呈现结果。最简单的方法是展示 W^TW （一个特征对特征的矩阵）和 b （一个特征长度的向量）。注意，特征是按照重要性从高到低排列的，因此结果具有相当清晰的结构。以下是一个小型模型（ n=20; ~m=5; ）的示例，该模型以“预期线性模型类似”的方式表现，仅展示与其维度数量相等的特征，这种可视化可能呈现的样子。

But the thing we really care about is this hypothesized phenomenon of superposition – does the model represent "extra features" by storing them non-orthogonally? Is there a way to get at it more explicitly? Well, one question is just how many features the model learns to represent. For any feature, whether or not it is represented is determined by ||W_i||, the norm of its embedding vector.
但我们真正关注的是这个假设的叠加现象——模型是否通过非正交方式存储“额外特征”来表示它们？是否有更明确的方法来探究这一点？一个问题是模型学习表示的特征数量。对于任何特征，其是否被表示取决于其嵌入向量的范数 ||W_i|| 。

We'd also like to understand whether a given feature shares its dimension with other features. For this, we calculate \sum_{j\neq i} (\hat{W_i}\cdot W_j)^2, projecting all other features onto the direction vector of W_i. It will be 0 if the feature is orthogonal to other features (dark blue below). On the other hand, values \geq 1 mean that there is some group of other features which can activate W_i as strongly as feature i itself!
我们也希望了解某个特征是否与其他特征共享维度。为此，我们计算 \sum_{j\neq i} (\hat{W_i}\cdot W_j)^2 ，将所有其他特征投射到 W_i 的方向向量上。如果该特征与其他特征正交（如下方深蓝色所示），则结果为 0 。相反，值为 \geq 1 意味着存在一组其他特征，它们能像特征 i 本身一样强烈地激活 W_i ！

We can visualize the model we looked at previously this way:
我们可以这样可视化之前研究的模型：

Now that we have a way to visualize models, we can start to actually do experiments. We'll start by considering models with only a few features (n=20; ~m=5;~ I_i=0.7^i). This will make it easy to visually see what happens. We consider a linear model, and several ReLU-output models trained on data with different feature sparsity levels:
既然我们有了可视化模型的方法，就可以开始实际进行实验了。我们将从考虑仅包含少数特征（ n=20; ~m=5;~ I_i=0.7^i ）的模型开始，这样便于直观观察变化。我们考虑线性模型以及在不同特征稀疏度数据上训练的多个 ReLU 输出模型。

As our standard intuitions would expect, the linear model always learns the top-m most important features, analogous to learning the top principal components. The ReLU output model behaves the same on dense features (1-S=1.0), but as sparsity increases, we see superposition emerge. The model represents more features by having them not be orthogonal to each other. It starts with less important features, and gradually affects the most important ones. Initially this involves arranging them in antipodal pairs, where one feature’s representation vector is exactly the negative of the other’s, but we observe it gradually transition to other geometric structures as it represents more features. We'll discuss feature geometry further in the later section, The Geometry of Superposition.
正如我们的标准直觉所预期，线性模型始终学习最重要的前 m 个特征，类似于学习主成分中的顶部成分。ReLU 输出模型在密集特征上表现相同（ 1-S=1.0 ），但随着稀疏性增加，我们观察到叠加现象的出现。该模型通过使特征之间不相互正交来表示更多特征。它从不太重要的特征开始，逐渐影响最重要的特征。最初这涉及将它们排列成对跖对，其中一个特征的表示向量恰好是另一个的负向量，但随着表示更多特征，我们观察到它逐渐过渡到其他几何结构。我们将在后面的章节“叠加的几何学”中进一步讨论特征几何。

The results are qualitatively similar for models with more features and hidden dimensions. For example, if we consider a model with m=20 hidden dimensions and n=80 features (with importance increased to I_i=0.9^i to account for having more features), we observe essentially a rescaled version of the visualization above:
对于具有更多特征和隐藏维度的模型，结果在定性上相似。例如，如果我们考虑一个具有 m=20 个隐藏维度且具有 n=80 个特征的模型（特征重要性增加到 I_i=0.9^i ，以考虑更多特征的影响），我们观察到的基本上是上述可视化结果的重新缩放版本：

Mathematical Understanding
数学理解

In the previous section, we observed a surprising empirical result: adding a ReLU to the output of our model allowed a radically different solution – superposition – which doesn't occur in linear models.
在前一节中，我们观察到一个令人惊讶的实证结果：在我们的模型输出中加入 ReLU 激活函数，使得一种截然不同的解决方案——叠加态——得以实现，这在传统线性模型中是不会出现的。

The model where it occurs is still quite mathematically simple. Can we analytically understand why superposition is occurring? And for that matter, why does adding a single non-linearity make things so different from the linear model case? It turns out that we can get a fairly satisfying answer, revealing that our model is governed by balancing two competing forces – feature benefit and interference – which will be useful intuition going forwards. We'll also discover a connection to the famous Thomson Problem in chemistry.
该模型中发生的现象在数学上仍相当简单。我们能否从分析角度理解为何会出现叠加现象？此外，为何仅仅引入一个非线性因素就能使情况与线性模型大相径庭？事实证明，我们可以得到一个相当令人满意的答案，揭示出我们的模型受制于两种相互竞争的力量——特征收益与干扰——这将为后续理解提供有益的直觉。我们还将发现这与化学中著名的汤姆森问题存在联系。

Let's start with the linear case. This is well understood by prior work! If one wants to understand why linear models don't exhibit superposition, the easy answer is to observe that linear models essentially perform PCA. But this isn't fully satisfying: if we set aside all our knowledge and intuition about linear functions for a moment, why exactly is it that superposition can't occur?
我们从线性情况开始讨论。这一点已由先前的工作充分理解！若要探究为何线性模型不展现叠加现象，一个简单的解释是观察到线性模型本质上执行主成分分析（PCA）。但这并不完全令人满意：如果我们暂时抛开所有关于线性函数的知识和直觉，究竟是什么原因导致叠加现象无法发生呢？

A deeper understanding can come from the results of Saxe et al. who study the learning dynamics of linear neural networks – that is, neural networks without activation functions. Such models are ultimately linear functions, but because they are the composition of multiple linear functions the dynamics are potentially quite complex. The punchline of their paper reveals that neural network weights can be thought of as optimizing a simple closed-form solution. We can tweak their problem to be a bit more similar to our linear case,We have the model be x' = W^TWx, but leave x Gaussianaly distributed as in Saxe. revealing the following equation:
萨克斯等人的研究结果揭示了线性神经网络（即不包含激活函数的神经网络）的学习动力学，这些模型本质上仍是线性函数，但由于是多个线性函数的复合，其动力学可能相当复杂。他们的论文要点表明，神经网络权重可视为优化一个简单的闭合形式解。我们可以调整其问题，使其更接近我们的线性情况 ¹⁰ ，让模型 x' = W^TWx ，同时保留 x 如萨克斯所设的正态分布，从而揭示出以下方程：

The Saxe results reveal that there are fundamentally two competing forces which control learning dynamics in the considered model. Firstly, the model can attain a better loss by representing more features (we've labeled this "feature benefit"). But it also gets a worse loss if it represents more than it can fit orthogonally due to "interference" between features.As a brief aside, it's interesting to contrast the linear model interference, \sum_{i\neq j}|W_i \cdot W_J|^2, to the notion of coherence in compressed sensing, \max_{i\neq j}|W_i \cdot W_J|. We can see them as the L^2 and L^\infty norms of the same vector. In fact, this makes it never worthwhile for the linear model to represent more features than it has dimensions.
Saxe 的研究结果揭示，在所考虑的模型中，学习动态主要受两种基本竞争力的控制。首先，模型通过表示更多特征可以获得更佳的损失值，我们称之为“特征收益”。然而，若表示的特征超出其正交容纳能力，则会因特征间的“干扰”导致损失恶化。顺便一提，对比线性模型中的干扰现象与压缩感知中的相干性概念颇为有趣，它们可视为同一向量的 L^2 和 L^\infty 范数。实际上，这表明线性模型绝不应表示超过其维度的特征数量。要证明在线性模型中叠加绝非最优策略，可求解损失梯度为零的情况或参考 Saxe 等人的研究。To prove that superposition is never optimal in a linear model, solve for the gradient of the loss being zero or consult Saxe et al.

Can we achieve a similar kind of understanding for the ReLU output model? Concretely, we'd like to understand L=\int_x ||I(x-\text{ReLU}(W^TWx+b))||^2 d\textbf{p}(x) where x is distributed such that x_i=0 with probability S.
我们能否对 ReLU 输出模型实现类似的理解？具体而言，我们希望理解 L=\int_x ||I(x-\text{ReLU}(W^TWx+b))||^2 d\textbf{p}(x) ，其中 x 的分布使得 x_i=0 以 S 的概率发生。

The integral over x decomposes into a term for each sparsity pattern according to the binomial expansion of ((1\!-\!S)+S)^n. We can group terms of the sparsity together, rewriting the loss as L = (1\!-\!S)^n L_n +\ldots+ (1\!-\!S)S^{n-1} L_1+ S^n L_0, with each L_k corresponding to the loss when the input is a k-sparse vector. Note that as S\to 1, L_1 and L_0 dominate. The L_0 term, corresponding to the loss on a zero vector, is just a penalty on positive biases, \sum_i \text{ReLU}(b_i)^2. So the interesting term is L_1, the loss on 1-sparse vectors:
积分覆盖 x 根据 ((1\!-\!S)+S)^n 的二项展开分解为每个稀疏模式的项。我们可以将稀疏性相同的项归类，重写损失为 L = (1\!-\!S)^n L_n +\ldots+ (1\!-\!S)S^{n-1} L_1+ S^n L_0 ，其中每个 L_k 对应于输入为 k -稀疏向量时的损失。注意随着 S\to 1 ， L_1 和 L_0 占主导地位。 L_0 项，对应于零向量上的损失，仅是对正偏差的惩罚， \sum_i \text{ReLU}(b_i)^2 。因此，有趣的是 L_1 ，即 1 -稀疏向量上的损失：

This new equation is vaguely similar to the famous Thomson problem in chemistry. In particular, if we assume uniform importance and that there are a fixed number of features with ||W_i|| = 1 and the rest have ||W_i|| = 0, and that b_i = 0, then the feature benefit term is constant and the interference term becomes a generalized Thomson problem – we're just packing points on the surface of the sphere with a slightly unusual energy function. (We'll see this can be a productive analogy when we resume our empirical investigation in the following sections!)
这一新方程与化学中著名的汤姆森问题有几分相似。具体而言，若我们假设各特征同等重要，且存在固定数量的特征具有 ||W_i|| = 1 ，其余为 ||W_i|| = 0 ，并满足 b_i = 0 条件，那么特征收益项恒定，而干扰项则转化为广义汤姆森问题——我们只是在球面上以一种略显特殊的能量函数布置点。（当我们继续进行后续章节的实证研究时，会发现这一类比颇具成效！）

Another interesting property is that ReLU makes negative interference free in the 1-sparse case. This explains why the solutions we've seen prefer to only have negative interference when possible. Further, using a negative bias can convert small positive interferences into essentially being negative interferences.
另一个有趣的特性是，在 1-稀疏情况下，ReLU 使得负干扰变得无害。这解释了为何我们所见的解决方案倾向于尽可能只存在负干扰。此外，使用负偏置可以将小的正干扰实质上转化为负干扰。

What about the terms corresponding to less sparse vectors? We leave explicitly writing these out to the reader, but the main idea is that there are multiple compounding interferences, and the "active features" can experience interference. In a later section, we'll see that features often organize themselves into sparse interference graphs such that only a small number of features interfere with another feature – it's interesting to note that this reduces the probability of compounding interference and makes the 1-sparse loss term more important relative to others.
对于那些对应于较不稀疏向量的项又如何呢？我们将其具体表述留给读者思考，但核心思想在于存在多种叠加干扰，且“活跃特征”可能遭受干扰。在后续章节中，我们将看到特征常常自发组织成稀疏干扰图，使得只有少数特征会干扰到其他特征——值得注意的是，这降低了叠加干扰的概率，并使得 1-稀疏损失项相对于其他项更为重要。

Superposition as a Phase Change
叠加作为相变

The results in the previous section seem to suggest that there are three outcomes for a feature when we train a model: (1) the feature may simply not be learned; (2) the feature may be learned, and represented in superposition; or (3) the model may represent a feature with a dedicated dimension. The transitions between these three outcomes seem sharp. Possibly, there's some kind of phase change. Here, we use “phase change” in the generalized sense of “discontinuous change”, rather than in the more technical sense of a discontinuity arising in the limit of infinite system size.
上一节的结果似乎表明，当我们训练模型时，特征可能面临三种情况：(1) 特征可能根本未被学习；(2) 特征可能被学习，并以叠加态表示；或(3) 模型可能用专用的维度来表示特征。这三种情况之间的转变似乎是突兀的，可能存在某种相变现象。 ¹³ 此处，我们使用“相变”一词，意指“不连续变化”的广义概念，而非特指系统规模无限大时出现的更技术性的不连续性。

One way to understand this better is to explore if there's something like a "phase diagram" from physics, which could help us understand when a feature is expected to be in one of these regimes. Although we can see hints of this in our previous experiment, it's hard to really isolate what's going on because many features are changing at once and there may be interaction effects. As a result, we set up the following experiment to better isolate the effects.
为了更深入理解这一现象，我们不妨探究是否存在类似物理学中的“相位图”，它能帮助我们判断某个特征何时会落入这些状态之一。尽管在先前的实验中已见端倪，但众多特征同时变化且可能存在交互效应，使得真正厘清其中机制颇为困难。因此，我们设计了以下实验，旨在更好地分离这些效应。

As an initial experiment, we consider models with 2 features but only 1 hidden layer dimension. We still consider the ReLU output model, \text{ReLU}(W^T W x - b). The first feature has an importance of 1.0. On one axis, we vary the importance of the 2nd "extra" feature from 0.1 to 10. On the other axis, we vary the sparsity of all features from 1.0 to 0.01. We then plot whether the 2nd "extra" feature is not learned, learned in superposition, or learned and represented orthogonally. To reduce noise, we train ten models for each point and average over the results, discarding the model with the highest loss.
作为初步实验，我们考虑具有 2 个特征但仅 1 个隐藏层维度的模型。我们仍采用 ReLU 输出模型， \text{ReLU}(W^T W x - b) 。第一个特征的重要性为 1.0。在一个轴上，我们调整第二个“额外”特征的重要性从 0.1 至 10。在另一轴上，我们调整所有特征的稀疏度从 1.0 至 0.01。随后，我们绘制图表以展示第二个“额外”特征是否未被学习、以叠加方式学习，还是被正交地学习和表示。为减少噪声，我们对每个点训练十个模型并平均结果，舍弃损失最高的模型。

We can compare this to a theoretical "toy model of the toy model" where we can get closed form solutions for the loss of different weight configurations as a function of importance and sparsity. There are three natural ways to store 2 features in 1 dimension: W=[1,0] (ignore [0,1], throwing away the extra feature), W=[0,1] (ignore [1,0], throwing away the first feature to give the extra feature a dedicated dimension), and W=[1,-1] (store the features in superposition, losing the ability to represent [1,1], the combination of both features at the same time). We call this last solution “antipodal” because the two basis vectors [1, 0] and [0, 1] are mapped in opposite directions. It turns out we can analytically determine the loss for these solutions (details can be found in this notebook).
我们可以将此比作一个理论上的“玩具模型中的玩具模型”，在其中我们能得到不同权重配置损失的封闭形式解，作为重要性和稀疏性的函数。在 1 维中存储 2 个特征有三种自然方式： W=[1,0] （忽略 [0,1] ，舍弃多余特征）， W=[0,1] （忽略 [1,0] ，舍弃第一个特征以给额外特征分配专用维度），以及 W=[1,-1] （以叠加方式存储特征，失去同时表示 [1,1] ，即两个特征组合的能力）。我们称最后一种解决方案为“对跖”，因为两个基向量 [1, 0] 和 [0, 1] 被映射到相反方向。事实证明，我们可以通过分析确定这些解决方案的损失（详细信息可在此笔记本中找到）。

As expected, sparsity is necessary for superposition to occur, but we can see that it interacts in an interesting way with relative feature importance. But most interestingly, there appears to be a real phase change, observed in both the empirical and theoretical diagrams! The optimal weight configuration discontinuously changes in magnitude and superposition. (In the theoretical model, we can analytically confirm that there's a first-order phase change: there's crossover between the functions, causing a discontinuity in the derivative of the optimal loss.)
如预期般，稀疏性对于叠加现象的发生是必要的，但我们发现它与相对特征重要性之间存在有趣的互动。最引人注目的是，在实证与理论图表中均观察到一种真实的相变现象！最优权重配置在幅度和叠加性上发生了不连续的变化。（在理论模型中，我们能通过分析确认存在一次相变：函数间发生交叉，导致最优损失的导数出现不连续。）

We can ask this same question of embedding three features in two dimensions. This problem still has a single "extra feature" (now the third one) we can study, asking what happens as we vary its importance relative to the other two and change sparsity.
我们同样可以探讨将三个特征嵌入到二维空间中的问题。这个问题仍然有一个“额外的特征”（现在是第三个）供我们研究，我们可以探究随着这个特征相对于其他两个的重要性变化，以及稀疏性的调整，会发生什么情况。

For the theoretical model, we now consider four natural solutions. We can describe solutions by asking "what feature direction did W ignore?" For example, W might just not represent the extra feature – we'll write this W \perp [0, 0, 1]. Or W might ignore one of the other features, W \perp [1, 0, 0]. But the interesting thing is that there are two ways to use superposition to make antipodal pairs. We can put the "extra feature" in an antipodal pair with one of the others (W \perp [0, 1, 1]) or put the other two features in superposition and give the extra feature a dedicated dimension (W \perp [1, 1, 0]). Details on the closed form losses for these solutions can be found in this notebook. We do not consider a last solution of putting all the features in joint superposition, W \perp [1, 1, 1].
对于理论模型，我们现考虑四种自然解法。描述这些解法时，我们可提问：“ W 忽略了哪个特征方向？”例如， W 可能未体现额外特征——我们记作 W \perp [0, 0, 1] 。或者 W 可能忽略了其他某个特征，即 W \perp [1, 0, 0] 。有趣的是，利用叠加原理构造对映体对存在两种方式：可将“额外特征”与另一特征形成对映体对（ W \perp [0, 1, 1] ），或将另外两个特征叠加，为额外特征分配专属维度（ W \perp [1, 1, 0] ）。这些解法的闭合形式损失函数的详细信息可在本笔记本中查阅。我们不考虑将所有特征置于联合叠加状态的最后一种解法，即 W \perp [1, 1, 1] 。

These diagrams suggest that there really is a phase change between different strategies for encoding features. However, we'll see in the next section that there's much more complex structure this preliminary view doesn't capture.
这些图表表明，在编码特征的不同策略之间确实存在一种相变。然而，我们将在下一节看到，这种初步观点未能捕捉到更为复杂的结构。

The Geometry of Superposition
叠加几何学

We've seen that superposition can allow a model to represent extra features, and that the number of extra features increases as we increase sparsity. In this section, we'll investigate this relationship in more detail, discovering an unexpected geometric story: features seem to organize themselves into geometric structures such as pentagons and tetrahedrons! In some ways, the structure described in this section seems "too elegant to be true" and we think there's a good chance it's at least partly idiosyncratic to the toy model we're investigating. But it seems worth investigating because if anything about this generalizes to real models, it may give us a lot of leverage in understanding their representations.
我们已经观察到，叠加态能使模型呈现额外特征，且随着稀疏性的增加，额外特征的数量也随之增长。本节将深入探讨这一关系，揭示一个意外的几何现象：特征似乎自发组织成五边形、四面体等几何结构！从某种角度看，本节描述的结构“过于完美而显得不真实”，我们推测它至少部分是针对所研究的简化模型特有的。尽管如此，这一发现值得探究，因为若其任何方面能推广至真实模型，或许将极大助力我们理解这些模型的表征机制。

We'll start by investigating uniform superposition, where all features are identical: independent, equally important and equally sparse. It turns out that uniform superposition has a surprising connection to the geometry of uniform polytopes! Later, we'll move on to investigate non-uniform superposition, where features are not identical. It turns out that this can be understood, at least to some extent, as a deformation of uniform superposition.
我们将首先探究均匀叠加，其中所有特征均相同：独立、同等重要且同等稀疏。令人惊讶的是，均匀叠加与均匀多面体的几何结构有着紧密联系！随后，我们将进一步研究非均匀叠加，其中特征并非一致。至少在某种程度上，这种非均匀叠加可以理解为均匀叠加的一种变形。

Uniform Superposition 均匀叠加

As mentioned above, we begin our investigation with uniform superposition, where all features have the same importance and sparsity. We'll see later that this case has some unexpected structure, but there's also a much more basic reason to study it: it's much easier to reason about than the non-uniform case, and has fewer variables we need to worry about in our experiments.
如上所述，我们的研究从均匀叠加开始，其中所有特征具有同等重要性和稀疏性。稍后我们将发现这种情况具有一些意外的结构，但研究它的根本原因更为基础：相较于非均匀情况，它更易于推理，且在实验中需要关注的变量更少。

We'd like to understand what happens as we change feature sparsity, S. Since all features are equally important, we will assume without loss of generalityScaling the importance of all features by the same amount simply scales the loss, and does not change the optimal solutions. that each feature has importance I_i = 1 . We'll study a model with n=400 features and m=30 hidden dimensions, but it turns out the number of features and hidden dimensions doesn't matter very much. In particular, it turns out that the number of input features n doesn't matter as long as it's much larger than the number of hidden dimensions, n \gg m. And it also turns out that the number of hidden dimensions doesn't really matter as long as we're interested in the ratio of features learned to hidden features. Doubling the number of hidden dimensions just doubles the number of features the model learns.
我们希望了解在改变特征稀疏性时会发生什么， S 。由于所有特征同等重要，我们不妨假设 ¹⁴ ，将所有特征的重要性按相同比例缩放仅会缩放损失，而不改变最优解。每个特征的重要性为 I_i = 1 。我们将研究一个具有 n=400 个特征和 m=30 个隐藏维度的模型，但事实证明，特征数量和隐藏维度的重要性并不大。特别是，只要输入特征的数量 n 远大于隐藏维度的数量 n \gg m ，输入特征的数量就无关紧要。同样，只要我们关注学习到的特征与隐藏特征的比例，隐藏维度的数量也并不重要。将隐藏维度数量加倍，模型学习的特征数量也会加倍。

A convenient way to measure the number of features the model has learned is to look at the Frobenius norm, ||W||_F^2. Since ||W_i||^2\simeq 1 if a feature is represented and ||W_i||^2\simeq 0 if it is not, this is roughly the number of features the model has learned to represent. Conveniently, this norm is basis-independent, so it still behaves nicely in the dense regime S=0 where the feature basis isn't privileged by anything and the model represents features with arbitrary directions instead.
衡量模型已学习特征数量的便捷方法是观察 Frobenius 范数， ||W||_F^2 。由于 ||W_i||^2\simeq 1 表示特征已被表示，而 ||W_i||^2\simeq 0 表示未被表示，这大致反映了模型已学会表示的特征数量。幸运的是，该范数与基底无关，因此在密集区域 S=0 中仍表现良好，此时特征基底不受任何特权影响，模型以任意方向表示特征。

We'll plot D^* = m / ||W||_F^2, which we can think of as the "dimensions per feature":
我们将绘制 D^* = m / ||W||_F^2 ，可以将其视为“每个特征的维度”：

Surprisingly, we find that this graph is "sticky" at 1 and 1/2. (This very vaguely resembles the fractional quantum Hall effect – see e.g. this diagram.) Why is this? On inspection, the 1/2 "sticky point" seems to correspond to a precise geometric arrangement where features come in "antipodal pairs", each being exactly the negative of the other, allowing two features to be packed into each hidden dimension. It appears that antipodal pairs are so effective that the model preferentially uses them over a wide range of the sparsity regime.
令人惊讶的是，我们发现该图在 1 和 1/2 处呈现“粘性”。（这隐约类似于分数量子霍尔效应——例如参见此图。）为何如此？经审视， 1/2 处的“粘性点”似乎对应于一种精确的几何排列，其中特征以“对映体对”形式出现，每一对都是彼此的精确负像，使得两个特征能嵌入每个隐藏维度中。看来对映体对极为高效，以至于模型在广泛的稀疏性范围内都优先采用它们。

It turns out that antipodal pairs are just the tip of the iceberg. Hiding underneath this curve are a number of extremely specific geometric configurations of features.
事实证明，对跖点对仅仅是冰山一角。在这条曲线之下，隐藏着众多极其特殊的几何特征配置。

Feature Dimensionality 特征维度

In the previous section, we saw that there's a sticky regime where the model has "half a dimension per feature" in some sense. This is an average statistical property of the features the model represents, but it seems to hint at something interesting. Is there a way we could understand what "fraction of a dimension" a specific feature gets?
在前一节中，我们观察到存在一个粘滞状态，其中模型在某种意义上具有“每个特征半维”。这是模型所代表特征的平均统计特性，但它似乎暗示了某些有趣的现象。我们能否找到一种方法来理解特定特征所获得的“维度的一部分”是多少？

We'll define the dimensionality of the ith feature, D_i, as:
我们将定义第 i 个特征 D_i 的维度为：

D_i ~=~ \frac{||W_i||^2}{\sum_j (\hat{W_i} \cdot W_j)^2}

where W_i is the weight vector column associated with the ith feature, and \hat{W_i} is the unit version of that vector.
其中 W_i 是与第 i 个特征相关联的权重列向量，而 \hat{W_i} 则是该向量的单位化版本。

Intuitively, the numerator represents the extent to which a given feature is represented, while the denominator is "how many features share the dimension it is embedded in" by projecting each feature onto its dimension. In the antipodal case, each feature participating in an antipodal pair will have a dimensionality of D = 1 / (1+1) = 1/2 while features which are not learned will have a dimensionality of 0. Empirically, it seems that the dimensionality of all features add up to the number of embedding dimensions when the features are "packed efficiently" in some sense.
直观上，分子表示给定特征的表达程度，而分母则是“有多少特征通过将其投影到其维度上共享该嵌入维度”。在极端对立的情况下，参与对立对的每个特征将具有 D = 1 / (1+1) = 1/2 的维度，而未被学习的特征将具有 0 的维度。从经验上看，当特征在某种意义上“高效打包”时，所有特征的维度总和似乎等于嵌入维度的数量。

We can now break the above plot down on a per-feature basis. This reveals many more of these "sticky points"! To help us understand this better, we're going to create a scatter plot annotated with some additional information:
我们现在可以按每个特征对上述图表进行分解，这将揭示更多这类“粘性点”！为了更好地理解这一点，我们将创建一个带有额外标注信息的散点图。

We start with the line plot we had in the previous section.
我们从上一节中的折线图开始。
We overlay this with a scatter plot of the individual feature dimensionalities for each feature in the models at each sparsity level.
我们在每个稀疏度级别上，为模型中的每个特征叠加了其个体特征维度的散点图。
The feature dimensionalities cluster at certain fractions, so we draw lines for those. (It turns out that each fraction corresponds to a specific weight geometry – we'll discuss this shortly.)
特征维度在某些分数处聚集，因此我们为这些分数绘制了线条。（事实证明，每个分数对应特定的权重几何结构——我们很快会讨论这一点。）
We visualize the weight geometries for a few models with a "feature geometry graph" where each feature is a node and edge weights are based on the absolute value of the dot product feature embedding vectors. So features are connected if they aren't orthogonal.
我们通过“特征几何图”来直观展示几个模型的权重几何结构，其中每个特征作为一个节点，边权重基于特征嵌入向量的点积绝对值。因此，非正交的特征之间会相互连接。

Let's look at the resulting plot, and then we'll try to figure out what it's showing us:
让我们观察一下生成的图表，然后尝试理解它向我们展示了什么：

What is going on with the points clustering at specific fractions?? We'll see shortly that the model likes to create specific weight geometries and kind of jumps between the different configurations.
特定分数处的点为何会聚集？我们很快就会发现，模型倾向于形成特定的权重几何结构，并在不同配置之间跳跃。

In the previous section, we developed a theory of superposition as a phase change. But everything on this plot between 0 (not learning a feature) and 1 (dedicating a dimension to a feature) is superposition. Superposition is what happens when features have fractional dimensionality. That is to say – superposition isn't just one thing!
在前一节中，我们发展了一种将叠加视为相位变化的理论。但在这个图上，从 0（不学习特征）到 1（为特征分配一个维度）之间的所有内容都是叠加。当特征具有分数维时，就会发生叠加。也就是说——叠加并非仅指一种情况！

How can we relate this to our original understanding of the phase change? We often think of water as only having three phases: ice, water and steam. But this is a simplification: there are actually many phases of ice, often corresponding to different crystal structures (eg. hexagonal vs cubic ice). In a vaguely similar way, neural network features seem to also have many other phases within the general category of "superposition."
我们如何将此与对相变的原始理解联系起来？我们通常认为水只有三种相态：冰、水和蒸汽。但这是一种简化：实际上，冰存在多种相态，通常对应于不同的晶体结构（例如六角形与立方形冰）。在某种程度上类似地，神经网络特征似乎在“叠加”这一大类中也有许多其他相态。

Why these geometric structures?
为何选择这些几何结构？

In the previous diagram, we found that there are distinct lines corresponding to dimensionality of: ¾ (tetrahedron), ⅔ (triangle), ½ (antipodal pair), ⅖ (pentagon), ⅜ (square antiprism), and 0 (feature not learned). We believe there would also be a 1 (dedicated dimension for a feature) line if not for the fact that basis features are indistinguishable from other directions in the dense regime.
在前述图表中，我们发现存在对应不同维度的明显线条：¾（四面体）、⅔（三角形）、½（对映点对）、⅖（五边形）、⅜（方形反棱柱）以及 0（未习得特征）。若非在高密度区域基元特征与其他方向难以区分，我们相信还会有一条代表 1（专为某一特征设定的维度）的线条出现。

Several of these configurations may jump out as solutions to the famous Thomson problem. (In particular, square antiprisms are much less famous than cubes and are primarily of note for their role in molecular geometry due to being a Thomson problem solution.) As we saw earlier, there is a very real sense in which our model can be understood as solving a generalized version of the Thomson problem. When our model chooses to represent a feature, the feature is embedded as a point on an m-dimensional sphere.
这些配置中的一些可能跃然而出，成为著名的汤姆森问题的解决方案。（特别是，正交反棱柱远不如立方体著名，主要因其作为汤姆森问题解决方案在分子几何中的作用而受到关注。）正如我们之前所见，从非常真实的意义上讲，我们的模型可以被理解为解决了汤姆森问题的一个广义版本。当我们的模型选择表示一个特征时，该特征被嵌入为一个 m 维球面上的点。

A second clue as to what's going on is that there are lines for the Thomson solutions which are uniform polyhedra (e.g. tetrahedron), but there seem to be split lines where we'd expect to see non-uniform solutions (e.g. instead of a ⅗ line for triangular bipyramids we see a co-occurence of points at ⅔ for triangles and points at ½ for a antipodes). In a uniform polyhedron, all vertices have the same geometry, and so if we embed features as them each feature has the same dimensionality. But if we embed features as a non-uniform polyhedron, different features will have more or less interference with others.
另一个线索表明，汤姆森解中的线对应于均匀多面体（如四面体），但似乎出现了分裂线，本应观察到非均匀解的地方（例如，三角双锥的⅗线被⅔的三角形点和½的对跖点共现所取代）。在均匀多面体中，所有顶点几何相同，因此若将特征嵌入其中，每个特征的维度一致。然而，若以非均匀多面体嵌入特征，不同特征间将产生或多或少的干扰。

In particular, many of the Thomson solutions can be understood as tegum products (an operation which constructs polytopes by embedding two polytopes in orthogonal subspaces) of smaller uniform polytopes. (In the earlier graph visualizations of feature geometry, two subgraphs are disconnected if and only if they are in different tegum factors.) As a result, we should expect their dimensionality to actually correspond to the underlying factor uniform polytopes.
特别是，许多汤普森解决方案可以被理解为 tegum 积（一种通过在正交子空间中嵌入两个多面体来构造多面体的操作）的产物，这些产物源自更小的均匀多面体。（在早期的特征几何图形可视化中，两个子图若且唯若它们属于不同的 tegum 因子时才会断开连接。）因此，我们应当预期它们的维度实际上对应于基础因子均匀多面体。

This also suggests a possible reason why we observe 3D Thomson problem solutions, despite the fact that we're actually studying a higher dimensional version of the problem. Just as many 3D Thomson solutions are tegum products of 2D and 1D solutions, perhaps higher dimensional solutions are often tegum products of 1D, 2D, and 3D solutions.
这或许揭示了我们为何能观察到三维汤姆逊问题解的一个可能原因，尽管我们实际上研究的是该问题的高维版本。正如许多三维汤姆逊解是二维与一维解的楔积，或许高维解常常是一维、二维及三维解的楔积。

The orthogonality of factors in tegum products has interesting implications. For the purposes of superposition, it means that there can't be any "interference" across tegum-factors. This may be preferred by the toy model: having many features interfere simultaneously could be really bad for it. (See related discussion in our earlier mathematical analysis.)
泰格姆积中因子的正交性具有有趣的含义。就叠加目的而言，这意味着泰格姆因子之间不可能存在“干扰”。这对于玩具模型来说可能是可取的：多个特征同时相互干扰可能对其极为不利。（参见我们早期数学分析中的相关讨论。）

Aside: Polytopes and Low-Rank Matrices
旁注：多面体与低秩矩阵

At this point, it's worth making explicit that there's a correspondence between polytopes and symmetric, positive-definite, low-rank matrices (i.e. matrices of the form W^TW). This correspondence underlies the results we saw in the previous section, and is generally useful for thinking about superposition.
此时，明确指出多面体与对称、正定、低秩矩阵（即形如 W^TW 的矩阵）之间存在对应关系是值得的。这种对应关系是我们前一节所见结果的基础，并且通常有助于理解叠加态。

In some ways, the correspondence is trivial. If one has a rank-m n\!\times\!n-matrix of the form W^TW, then W is a n\!\times\!m-matrix. We can interpret the columns of W as n points in a m-dimensional space. The place where this starts to become interesting is that it makes it clear that W^TW is driven by the geometry. In particular, we can see how the off-diagonal terms are driven by the geometry of the points.
在某些方面，这种对应关系是微不足道的。如果有一个秩为 m n\!\times\!n 的矩阵，形式如 W^TW ，则 W 是一个 n\!\times\!m 矩阵。我们可以将 W 的列解释为 m 维空间中的 n 个点。开始变得有趣的地方在于，它明确表明 W^TW 是由几何驱动的。特别是，我们可以看到非对角线项是如何由点的几何结构决定的。

Put another way, there's an exact correspondence between polytopes and strategies for superposition. For example, every strategy for putting three features in superposition in a 2-dimensional space corresponds to a triangle, and every triangle corresponds to such a strategy. From this perspective, it doesn't seem surprising that if we have three equally important and equally sparse features, the optimal strategy is an equilateral triangle.
换言之，多面体与叠加策略之间存在精确对应关系。例如，在二维空间中，每种将三个特征进行叠加的策略都对应一个三角形，反之亦然。从这个角度看，如果我们拥有三个同等重要且同样稀疏的特征，最优策略是形成一个等边三角形，这并不令人意外。

This correspondence also goes the other direction. Suppose we have a rank (n\!-\!i)-matrix of the form W^TW. We can characterize it by the dimensions W did not represent – that is, which directions are orthogonal to W? For example, if we have a (n\!-\!1)-matrix, we might ask what single direction did W not represent? This is especially informative if we assume that W^TW will be as "identity-like" as possible, given the constraint of not representing certain vectors.
这种对应关系也适用于相反方向。假设我们有一个秩为 (n\!-\!i) 的矩阵，形式为 W^TW 。我们可以通过 W 未表示的维度来描述它——即哪些方向与 W 正交？例如，如果我们有一个 (n\!-\!1) 矩阵，我们可能会问 W 未表示的是哪个单一方向？如果我们假设 W^TW 在无法表示某些向量的约束下，尽可能地接近“单位矩阵”，那么这一点尤其具有信息价值。

In fact, given such a set of orthogonal vectors, we can construct a polytope by starting with n basis vectors and projecting them to a space orthogonal to the given vectors. For example, if we start in three dimensions and then project such that W \perp (1,1,1), we get a triangle. More generally, setting W \perp (1,1,1,...) gives us a regular n-simplex. This is interesting because it's in some sense the "minimal possible superposition." Assuming that features are equally important and sparse, the best possible direction to not represent is the fully dense vector (1,1,1,...)!
实际上，给定一组正交向量，我们可以通过从 n 个基向量开始，并将它们投影到与给定向量正交的空间中来构建一个多面体。例如，若我们在三维空间中开始并进行投影，如 W \perp (1,1,1) 所示，我们得到一个三角形。更一般地，设定 W \perp (1,1,1,...) 将给我们一个正则的 n -单纯形。这很有趣，因为它在某种意义上是“最小的可能叠加”。假设特征同等重要且稀疏，最佳的不表示方向就是完全密集的向量 (1,1,1,...) ！

Non-Uniform Superposition
非均匀叠加

So far, this section has focused on the geometry of uniform superposition, where all features are of equal importance, equal sparsity, and independent. The model is essentially solving a variant of the Thomson problem. Because all features are the same, solutions corresponding to uniform polyhedra get especially low loss. In this subsection, we'll study non-uniform superposition, where features are somehow not uniform. They may vary in importance and sparsity, or have a correlational structure that makes them not independent. This distorts the uniform geometry we saw earlier.
迄今为止，本节专注于均匀叠加的几何特性，其中所有特征同等重要、同等稀疏且相互独立。该模型实质上是在解决汤姆森问题的一个变种。由于所有特征相同，对应于均匀多面体的解决方案损失特别低。在本小节中，我们将探讨非均匀叠加，其中特征在某种程度上并不均匀。它们可能在重要性和稀疏性上有所变化，或者具有使其不独立的关联结构。这扭曲了我们之前看到的均匀几何形态。

In practice, it seems like superposition in real neural networks will be non-uniform, so developing an understanding of it seems important. Unfortunately, we're far from a comprehensive theory of the geometry of non-uniform superposition at this point. As a result, the goal of this section will merely be to highlight some of the more striking phenomena we observe:
实际上，现实中的神经网络叠加似乎是非均匀的，因此对其进行深入理解显得尤为重要。遗憾的是，目前我们距离全面理解非均匀叠加的几何理论还相去甚远。因此，本节的目标仅在于强调我们所观察到的一些显著现象：

Features varying in importance or sparsity causes smooth deformation of polytopes as the imbalance builds, up until a critical breaking point at which they snap to another polytope.
特征的重要性或稀疏性差异导致多面体平滑变形，随着不平衡的累积，直至达到一个临界断裂点，此时它们会突然转变为另一个多面体。
Correlated features prefer to be orthogonal, often forming in different tegum factors. As a result, correlated features may form an orthogonal local basis. When they can't be orthogonal, they prefer to be side-by-side. In some cases correlated features merge into a single feature: this hints at some kind of interaction between "superposition-like behavior" and "PCA-like behavior".
相关特征倾向于正交，常形成于不同的锥体因子中。因此，相关特征可能构成正交的局部基底。当无法正交时，它们更倾向于并排存在。在某些情况下，相关特征会合并成单一特征，这暗示了“叠加态般的行为”与“PCA 般的行为”之间存在某种交互作用。
Anti-correlated features prefer to be in the same tegum factor when superposition is necessary. They prefer to have negative interference, ideally being antipodal.
反相关特征在需要叠加时倾向于处于同一锥体因子中，它们偏好产生负干扰，理想情况下应处于对映位置。

We attempt to illustrate these phenomena with some representative experiments below.
我们试图通过以下一些代表性实验来阐释这些现象。

Perturbing a Single Feature
扰动单一特征

The simplest kind of non-uniform superposition is to vary one feature and leave the others uniform. As an experiment, let's consider an experiment where we represent n=5 features in m=2 dimensions. In the uniform case, with importance I=1 and activation density 1-S=0.05, we get a regular pentagon. But if we vary one point – in this case we'll make it more or less sparse – we see the pentagram stretch to account for the new value. If we make it denser, activating more frequently (yellow) the other features repel from it, giving it more space. On the other hand, if we make it sparser, activating less frequently (blue) it takes less space and other points push towards it.
最简单的非均匀叠加方式是改变一个特征而保持其他特征均匀。以一个实验为例，设想我们在 m=2 维空间中表示 n=5 个特征。在均匀情况下，重要性为 I=1 ，激活密度为 1-S=0.05 时，我们得到一个规则五边形。但若改变其中一个点——此处使其更为稀疏或密集——五角星形会随之拉伸以适应新值。若使其更密集，频繁激活（黄色），其他特征会远离它，为其腾出更多空间。反之，若使其更稀疏，激活频率降低（蓝色），它占用的空间减少，其他点则向其靠拢。

If we make it sufficiently sparse, there's a phase change, and it collapses from a pentagon to a pair of digons with the sparser point at zero. The phase change corresponds to loss curves corresponding to the two different geometries crossing over. (This observation allows us to directly confirm that it is genuinely a first order phase change.)
若使其足够稀疏，便会出现相变，原本的五边形结构会塌缩成一对二边形，稀疏点位于零点。这种相变对应着两种不同几何形状的损失曲线交叉。（这一发现使我们得以直接确认，这确实是一次一阶相变。）

To visualize the solutions, we canonicalize them, rotating them to align with each other in a consistent manner.
为了直观展示解决方案，我们对其进行规范化处理，通过旋转使它们以一致的方式相互对齐。

These results seem to suggest that, at least in some cases, non-uniform superposition can be understood as a deformation of uniform superposition and jumping between uniform superposition configurations rather than a totally different regime. Since uniform superposition has a lot of understandable structure, but real world superposition is almost certainly non-uniform, this seems very promising!
这些结果似乎表明，至少在某些情况下，非均匀叠加可以被理解为均匀叠加的一种变形，以及在不同均匀叠加构型之间的跳跃，而非完全不同的状态。由于均匀叠加具有许多可理解的结构，而现实世界的叠加几乎肯定是非均匀的，这一发现极具前景！

The reason pentagonal solutions are not on the unit circle is because models reduce the effect of positive interference, setting a slight negative bias to cut off noise and setting their weights to ||W_i|| = 1 / (1-b_i) to compensate. Distance from the unit circle can be interpreted as primarily driven by the amount of positive interference.
五边形解不在单位圆上的原因是模型减少了正向干扰的影响，设置了轻微的负偏置以截断噪声，并将权重设为 ||W_i|| = 1 / (1-b_i) 以补偿。与单位圆的距离主要可解释为由正向干扰的程度所驱动。

A note for reimplementations: optimizing with a two-dimensional hidden space makes this easier to study, but the actual optimization process to be really challenging from gradient descent – a lot harder than even just having three dimensions. Getting clean results required fitting each model multiple times and taking the solution with the lowest loss. However, there's a silver lining to this: visualizing the sub-optimal solutions on a scatter plot as above allows us to see the loss curves for different geometries and gain greater insight into the phase change.
对于重新实现者的提示：采用二维隐藏空间进行优化使得研究更为简便，但实际的优化过程从梯度下降角度来看极具挑战性——甚至比仅有三个维度时还要困难得多。为了获得清晰的结果，需要多次拟合每个模型，并选取损失最小的解决方案。然而，这其中也有一线希望：如上所述，通过散点图可视化次优解，我们能够观察到不同几何形状下的损失曲线，从而更深入地理解相变现象。

Correlated and Anticorrelated Features
相关与反相关特征

A more complicated form of non-uniform superposition occurs when there are correlations between features. This seems essential for understanding superposition in the real world, where many features are correlated or anti-correlated.
当特征之间存在相关性时，非均匀叠加呈现出更为复杂的形式。这对于理解现实世界中的叠加现象至关重要，因为许多特征要么相互关联，要么相互排斥。

For example, one very pragmatic question to ask is whether we should expect polysemantic neurons to group the same features together across models. If the groupings were random, you could use this to detect polysemantic neurons, by comparing across models! However, we'll see that correlational structure strongly influences which features are grouped together in superposition.
例如，一个非常实际的问题是，我们是否应该期望多义性神经元在不同模型中将相同特征归类在一起。如果归类是随机的，通过跨模型比较，你就能借此检测到多义性神经元！然而，我们将看到，相关性结构强烈影响哪些特征会在叠加中被归类在一起。

The behavior seems to be quite nuanced, with a kind of "order of preferences" for how correlated features behave in superposition. The model ideally represents correlated features orthogonally, in separate tegum factors with no interactions between them. When that fails, it prefers to arrange them so that they're as close together as possible – it prefers positive interference between correlated features over negative interference. Finally, when there isn't enough space to represent all the correlated features, it will collapse them and represent their principal component instead! Conversely, when features are anti-correlated, models prefer to have them interfere, especially with negative interference. We'll demonstrate this with a few experiments below.
该行为似乎颇为微妙，存在一种“偏好顺序”，用以指导叠加状态下相关特征的相互作用方式。理想情况下，模型应将相关特征正交表示，各自独立于不同的 tegum 因子中，彼此间无交互作用。若此条件不满足，模型倾向于尽可能紧密地排列这些特征——更偏好相关特征间的正向干扰而非负向干扰。当空间不足以容纳所有相关特征时，模型将它们压缩，转而表示其主成分。相反，若特征间呈负相关，模型则倾向于让它们相互干扰，尤其是产生负向干扰。以下我们将通过几个实验来演示这一现象。

Setup for Exploring Correlated and Anticorrelated Features
探索相关性与反相关特征的设置

Throughout this section we'll refer to "correlated feature sets" and "anticorrelated feature sets".
在本节中，我们将提及“相关特征集”和“反相关特征集”。

Correlated Feature Sets. Our correlated feature sets can be thought of as "bundles" of co-occurring features. One can imagine a highly idealized version of what might happen in an image classifier: there could be a bundle of features used to identify animals (fur, ears, eyes) and another bundle used to identify buildings (corners, windows, doors). Features from one of these bundles are likely to appear together. Mathematically, we represent this by linking the choice of whether all the features in a correlated feature set are zero or not together. Recall that we originally defined our synthetic distribution to have features be zero with probability S and otherwise uniformly distributed between [0,1]. We simply have the same sample determine whether they're zero.
相关特征集。我们的相关特征集可视为共现特征的“捆绑包”。设想一个高度理想化的图像分类器场景：可能存在一组用于识别动物的特征（毛发、耳朵、眼睛），以及另一组用于识别建筑物的特征（角落、窗户、门）。来自同一捆绑包的特征很可能同时出现。数学上，我们通过将相关特征集中所有特征是否同时为零的选择关联起来来表示这一点。回想一下，我们最初定义的合成分布中，特征以概率 S 为零，否则在[0,1]区间内均匀分布。现在，我们只需让同一采样决定它们是否为零。

Anticorrelated Feature Sets. One could also imagine anticorrelated features which are extremely unlikely to occur together. To simulate these, we'll have anticorrelated feature sets where only one feature in the set can be active at a time. To simulate this, we'll have the feature set be entirely zero with probability S, but then only have one randomly selected feature in the set be uniformly sampled from [0,1] if it's active, with the others being zero.
反相关特征集。我们还可以设想那些极不可能同时出现的反相关特征。为了模拟这些特征，我们将构建反相关特征集，其中任何时刻仅有一个特征处于激活状态。具体模拟时，该特征集以概率 S 完全为零，但如果处于激活状态，则仅有一个随机选定的特征从[0,1]区间均匀采样，其余特征均为零。

Organization of Correlated and Anticorrelated Features
相关与反相关特征组织

For our initial investigation, we simply train a number of small toy models with correlated and anti-correlated features and observe what happens. To make this easy to study, we limit ourselves to the m=2 case where we can explicitly visualize the weights as points in 2D space. In general, such solutions can be understood as a collection of points on a unit circle. To make solutions easy to compare, we rotate and flip solutions to align with each other.
在初步研究中，我们仅训练了一批小型玩具模型，这些模型具有相关性和反相关性特征，并观察其结果。为便于研究，我们限定在 m=2 情形下，能将权重直观地以二维空间中的点来表示。通常，这类解可视为单位圆上的一系列点。为便于比较解，我们旋转并翻转解以使其相互对齐。

Local Almost-Orthogonal Bases
局部近正交基底

It turns out that the tendency of models to arrange correlated features to be orthogonal is actually quite a strong phenomenon. In particular, for larger models, it seems to generate a kind of "local almost-orthogonal basis" where, even though the model as a whole is in superposition, the correlated feature sets considered in isolation are (nearly) orthogonal and can be understood as having very little superposition.
事实证明，模型倾向于使相关特征相互正交的现象实际上相当显著。特别是对于较大的模型而言，似乎会生成一种“局部近似正交基”，即使整个模型处于叠加状态，单独考虑的相关特征集也近乎正交，并且可以理解为几乎不存在叠加。

To investigate this, we train a larger model with two sets of correlated features and visualize W^TW.
为探究此问题，我们训练了一个包含两组相关特征的大型模型，并对其进行了可视化展示 W^TW 。

If this result holds in real neural networks, it suggests we might be able to make a kind of "local non-superposition" assumption, where for certain sub-distributions we can assume that the activating features are not in superposition. This could be a powerful result, allowing us to confidently use methods such as PCA which might not be principled to generally use in the context of superposition.
若此结果在真实神经网络中成立，则暗示我们或许能采用一种“局部非叠加”假设，即对于某些子分布，我们可以假定激活特征未处于叠加状态。这可能是一个强有力的结论，使我们能自信地运用诸如 PCA 等方法，而这些方法在叠加情境下通常并不适用。

Collapsing of Correlated Features
相关特征的崩溃

One of the most interesting properties is that there seems to be a trade off with Principal Components Analysis (PCA) and superposition. If there are two correlated features a and b, but the model only has capacity to represent one, the model will represent their principal component (a+b)/\sqrt{2}, a sparse variable that has more impact on the loss than either individually, and ignore the second principal component (a-b)/\sqrt{2}.
其中最有趣的特性之一是，主成分分析（PCA）与叠加之间似乎存在一种权衡。若存在两个相关特征 a 和 b ，但模型仅能表示其中一个，则模型将表示它们的主成分 (a+b)/\sqrt{2} ，这是一个对损失影响大于两者单独作用之和的稀疏变量，并忽略第二个主成分 (a-b)/\sqrt{2} 。

As an experiment, we consider six features, organized into three sets of correlated pairs. Features in each correlated pair are represented by a given color (red, green, and blue). The correlation is created by having both features always activate together – they're either both zero or neither zero. (The exact non-zero values they take when they activate is uncorrelated.)
作为一项实验，我们考虑了六个特征，组织成三对相关联的特征组。每对相关特征由特定颜色（红、绿、蓝）表示。这种关联通过两个特征始终同时激活来实现——它们要么同时为零，要么都不为零。（它们激活时所取的确切非零值之间并不相关。）

As we vary the sparsity of the features, we find that in the very sparse regime, we observe superposition as expected, with features arranged in a hexagon and correlated features side-by-side. As we decrease sparsity, the features progressively "collapse" into their principal components. In very dense regimes, the solution becomes equivalent to PCA.
随着我们调整特征的稀疏度，我们发现在极度稀疏的情况下，正如预期那样，观察到了叠加现象，特征排列成六边形，相关特征并排出现。随着稀疏度降低，特征逐渐“坍缩”至其主成分。在非常密集的情况下，解决方案等同于主成分分析（PCA）。

These results seem to hint that PCA and superposition are in some sense complementary strategies which trade off with one another. As features become more correlated, PCA becomes a better strategy. As features become sparser, superposition becomes a better strategy. When features are both sparse and correlated, mixtures of each strategy seem to occur. It would be nice to more deeply understand this space of tradeoffs.
这些结果似乎暗示，PCA 与叠加在某种意义上是互补的策略，它们相互之间存在权衡。随着特征相关性增强，PCA 成为更优策略。当特征变得稀疏时，叠加策略则更为优越。当特征既稀疏又相关时，两种策略的混合似乎会出现。更深入地理解这种权衡的空间将大有裨益。

It's also interesting to think about this in the context of continuous equivariant features, such as features which occur in different rotations.
在连续等变特征的背景下思考这一点也颇有趣味，比如那些在不同旋转中出现的特征。

Superposition and Learning Dynamics
叠加与学习动力学

The focus of this paper is how superposition contributes to the functioning of fully trained neural networks, but as a brief detour it's interesting to ask how our toy models – and the resulting superposition – evolve over the course of training.
本文重点探讨叠加态如何促进完全训练后的神经网络运作，但作为一次简短的旁涉，探究我们的玩具模型及其产生的叠加态在训练过程中的演变亦颇具趣味。

There are several reasons why these models seem like a particularly interesting case for studying learning dynamics. Firstly, unlike most neural networks, the fully trained models converge to a simple but non-trivial structure that rhymes with an emerging thread of evidence that neural network learning dynamics might have geometric weight structure that we can understand. One might hope that understanding the final structure would make it easier for us to understand the evolution over training. Secondly, superposition hints at surprisingly discrete structure (regular polytopes of all things!). We'll find that the underlying learning dynamics are also surprisingly discrete, continuing an emerging trend of evidence that neural network learning might be less continuous than it seems. Finally, since superposition has significant implications for interpretability, it would be nice to understand how it emerges over training – should we expect models to use superposition early on, or is it something that only emerges later in training, as models struggle to fit more features in?
这些模型之所以成为研究学习动力学的特别有趣案例，原因有几点。首先，与大多数神经网络不同，完全训练后的模型会收敛到一个简单却非平凡的结构，这与神经网络学习动力学可能具有可理解的幾何權重結構这一新兴证据相呼应。我们或许可以期待，理解最终结构将有助于我们洞察训练过程中的演变。其次，叠加现象暗示了出乎意料的离散结构（竟然是正则多胞形！）。我们将发现，底层的学习动力学同样呈现出惊人的离散性，延续了一个新兴趋势，即神经网络学习可能并不如表面所见的那般连续。最后，由于叠加性对可解释性具有重要意义，了解它在训练过程中如何显现将十分有益——我们是否应预期模型在训练初期即采用叠加性，还是它仅在训练后期，当模型努力适应更多特征时才逐渐显现？

Unfortunately, we aren't able to give these questions the detailed investigation they deserve within the scope of this paper. Instead, we'll limit ourselves to a couple particularly striking phenomena we've noticed, leaving more detailed investigation for future work.
遗憾的是，我们无法在这篇论文的范围内对这些问题进行应有的详细探讨。因此，我们将仅限于讨论我们注意到的一些特别引人注目的现象，而将更详细的调查留待未来的工作。

Phenomenon 1: Discrete "Energy Level" Jumps
现象 1：离散“能级”跃迁

Perhaps the most striking phenomenon we've noticed is that the learning dynamics of toy models with large numbers of features appear to be dominated by "energy level jumps" where features jump between different feature dimensionalities. (Recall that a feature's dimensionality is the fraction of a dimension dedicated to representing a feature.)
我们注意到最显著的现象之一是，具有大量特征的玩具模型学习动态似乎由“能量级跳跃”主导，其中特征在不同特征维度间跃迁。（回想一下，一个特征的维度是指用于表示该特征的维度部分的比例。）

Let's consider the problem setup we studied when investigating the geometry of uniform superposition in the previous section, where we have a large number of features of equal importance and sparsity. As we saw previously, the features ultimately arrange themselves into a small number of polytopes with fractional dimensionalities.
让我们回顾上一节中探讨均匀叠加几何问题时的情景，那里存在大量同等重要且稀疏的特征。如前所述，这些特征最终会排列成少数几个具有分数维度的多面体。

A natural question to ask is what happens to these feature dimensionalities over the course of training. Let's pick one model where all the features converge into digons and observe. In the first plot, each colored line corresponds to the dimensionality of a single feature. The second plot shows how the loss curve changes over the same duration.
一个自然的问题是，在训练过程中这些特征维度会发生什么变化。我们选取一个所有特征最终收敛成二元组的模型进行观察。在第一个图中，每条彩色线代表单个特征的维度变化。第二个图则展示了同一时期损失曲线的变化情况。

Note how the dimensionality of some features "jump" between different values and swap places. As this happens, the loss curve also undergoes a sudden drop (a very small one at the first jump, and a larger one at the second jump).
注意某些特征的维度如何在不同值之间“跳跃”并交换位置。随着这一现象的发生，损失曲线也会经历突然的下降（第一次跳跃时幅度很小，而第二次跳跃时幅度较大）。

These results make us suspect that seemingly smooth decreases of the loss curve in larger models are in fact composed of many small jumps of features between different configurations. (For similar results of sudden mechanistic changes, see Olsson et al.'s induction head phase change , and Nanda and Lieberum's results on phase changes in modular arithmetic . More broadly, consider the phenomenon of grokking .)
这些结果使我们怀疑，大型模型中看似平滑的损失曲线下降，实际上是由不同配置间特征的多次小跳跃构成的。（对于类似的机制突变结果，参见 Olsson 等人关于归纳头相变的论述，以及 Nanda 和 Lieberum 在模块化算术中关于相变的研究。更广泛地，可考虑“领悟”现象。）

Phenomenon 2: Learning as Geometric Transformations
现象二：学习作为几何变换

Many of our toy model solutions can be understood as corresponding to geometric structures. This is especially easy to see and study when there are only m=3 hidden dimensions, since we can just directly visualize the feature embeddings as points in 3D space forming a polyhedron.
我们的许多玩具模型解决方案可以理解为对应于几何结构。当只有 m=3 个隐藏维度时，这一点尤其容易观察和研究，因为我们只需将特征嵌入直接可视化为形成多面体的 3D 空间中的点。

It turns out that, at least in some cases, the learning dynamics leading to these structures can be understood as a sequence of simple, independent geometric transformations!
事实证明，至少在某些情况下，导致这些结构的学习动力学可以被理解为一连串简单独立的几何变换过程！

One particularly interesting example of this phenomenon occurs in the context of correlated features, as studied in the previous section. Consider the problem of representing n=6 features in superposition within m=3 dimensions. If we have the 6 features be 2 sets of 3 correlated features, we observe a really interesting pattern. The learning proceeds in distinct regimes which are visible in the loss curve, with each regime corresponding to a distinct geometric transformation:
在上一节研究的关联特征背景下，这一现象的一个特别有趣的例子出现了。考虑在 m=3 维空间中叠加表示 n=6 个特征的问题。如果我们让 6 个特征成为 2 组 3 个相关特征，我们会观察到一个非常有趣的模式。学习过程在损失曲线上呈现出不同的阶段，每个阶段对应一种独特的几何变换：

(Although the last solution – an octahedron with features from different correlated sets arranged in antipodal pairs – seems to be a strong attractor, the learning trajectory visualized above appears to be one of a few different learning trajectories that attract the model. The different trajectories vary at step C: sometimes the model gets pulled directly into the antiprism configuration from the start or organize features into antipodal pairs. Presumably this depends on which feature geometry the model is closest to when step B ends.)
尽管最后一个解决方案——一个由来自不同相关集合的特征以对映体对形式排列的八面体——似乎是一个强大的吸引子，但上述可视化的学习轨迹似乎只是吸引模型的几个不同学习轨迹之一。这些不同的轨迹在步骤 C 处有所变化：有时模型从一开始就直接进入反棱柱结构，或者将特征组织成对映体对。这可能取决于模型在步骤 B 结束时最接近哪种特征几何。

The learning dynamics we observe here seem directly related to previous findings on simple models. found that two-layer neural networks, in early stages of training, tend to learn a linear approximation to a problem. Although the technicalities of our data generation process do not precisely match the hypotheses of their theorem, it seems likely that the same basic mechanism is at work. In our case, we see the toy network learns a linear PCA solution before moving to a better nonlinear solution. A second related finding comes from , who looked at hierarchical sets of features, with a data generation process similar to the one we consider. They find empirically that certain networks (nonlinear and deep linear) “split” embedding vectors in a manner very much like what we observed. They also provide a theoretical analysis in terms of the underlying dynamical system. A key difference is that they focus on the topology—the branching structure of the emerging feature representations—rather than the geometry. Despite this difference, it seems likely that their analysis could be generalized to our case.
我们在此观察到的学习动态似乎与先前关于简单模型的研究结果直接相关。发现，在训练初期，双层神经网络往往倾向于学习问题的线性近似。尽管我们的数据生成过程的技术细节与他们的定理假设并非完全吻合，但基本机制可能相同。在我们的案例中，玩具网络在学习更优的非线性解决方案之前，首先学会了线性 PCA 解。另一项相关发现来自，他们研究了与我们考虑的数据生成过程类似的层次特征集。他们实证发现，某些网络（非线性和深度线性）以类似于我们观察到的方式“分割”嵌入向量。他们还从底层动力系统的角度提供了理论分析。关键区别在于，他们关注的是拓扑结构——即新兴特征表示的分支结构——而非几何形态。尽管存在这一差异，但他们的分析似乎很可能适用于我们的情况。

Relationship to Adversarial Robustness
对抗鲁棒性的关联

Although we're most interested in the implications of superposition for interpretability, there appears to be a connection to adversarial examples. If one gives it a little thought, this connection can actually be quite intuitive.
尽管我们最关注叠加态对可解释性的影响，但似乎与对抗性示例存在某种联系。稍加思考，这种联系实际上相当直观。

In a model without superposition, the end-to-end weights for the first feature are:
在无叠加模型中，第一个特征的端到端权重为：

(W^TW)_0 ~~=~~ (1,~ 0,~ 0,~ 0,~ ...)

But in a model with superposition, it's something like:
但在叠加模型中，它类似于：

(W^TW)_0 ~~=~~ (1,~ \epsilon,~ -\epsilon,~ \epsilon,~ ...)

The \epsilon entries (which are solely an artifact of superposition "interference") create an obvious way for an adversary to attack the most important feature. Note that this may remain true even in the infinite data limit: the optimal behavior of the model fit to sparse infinite data is to use superposition to represent more features, leaving it vulnerable to attack.
\epsilon 个条目（仅是叠加“干涉”的产物）为攻击者提供了明显途径，用以针对最关键特性发起攻击。值得注意的是，即便在无限数据极限情况下，这一弱点仍可能存在：模型在适应稀疏无限数据时的最佳行为是利用叠加来表示更多特征，从而使其易受攻击。

To test this, we generated L2 adversarial examples (allowing a max L2 attack norm of 0.1 of the average input norm). We originally generated attacks with gradient descent, but found that for extremely sparse examples where ReLU neurons are in the zero regime 99% of the time, attacks were difficult, effectively due to gradient masking . Instead, we found it worked better to analytically derive adversarial attacks by considering the optimal L2 attacks for each feature (\lambda (W^TW)_i / ||(W^TW)_i||_2) and taking the one of these attacks which most harms model performance.
为验证这一点，我们生成了 L2 对抗样本（允许最大 L2 攻击范数为平均输入范数的 0.1 倍）。起初我们采用梯度下降法生成攻击，但发现在极度稀疏的例子中，ReLU 神经元 99%的时间处于零状态，攻击变得困难，实质上是因为梯度掩蔽。于是，我们转而通过分析方法推导对抗攻击，考虑每个特征的最优 L2 攻击（ \lambda (W^TW)_i / ||(W^TW)_i||_2 ），并选取其中最能损害模型性能的攻击方式。

We find that vulnerability to adversarial examples sharply increases as superposition forms (increasing by >3x), and that the level of vulnerability closely tracks the number of features per dimension (the reciprocal of feature dimensionality).
我们发现，随着叠加态的形成（增幅超过 3 倍），对抗样本的脆弱性急剧增加，并且这种脆弱性水平紧密跟踪每个维度特征的数量（即特征维度的倒数）。

We're hesitant to speculate about the extent to which superposition is responsible for adversarial examples in practice. There are compelling theories for why adversarial examples occur without reference to superposition (e.g. ). But it is interesting to note that if one wanted to try to argue for a "superposition maximalist stance", it does seem like many interesting phenomena related to adversarial examples can be predicted from superposition. As seen above, superposition can be used to explain why adversarial examples exist. It also predicts that adversarially robust models would have worse performance, since making models robust would require giving up superposition and representing less features. It predicts that more adversarially robust models might be more interpretable (see e.g. ). Finally, it could arguably predict that adversarial examples transfer (see e.g. ) if the arrangement of features in superposition is heavily influenced by which features are correlated or anti-correlated (see earlier results on this). It might be interesting for future work to see how far the hypothesis that superposition is a significant contributor to adversarial examples can be driven.
我们对于叠加态在实际中对对抗性示例的影响程度持谨慎态度，不愿妄加揣测。已有令人信服的理论解释对抗性示例的出现，无需涉及叠加态（例如）。然而，值得注意的是，若有人试图主张“叠加态最大化立场”，确实会发现许多与对抗性示例相关的有趣现象能从叠加态中得到预测。如上所述，叠加态可用于解释对抗性示例的存在。它还预示着，为了增强模型的对抗鲁棒性，可能需牺牲叠加态并减少特征表示，从而导致性能下降。它还预测，更具对抗鲁棒性的模型可能更具可解释性（参见例如）。最后，它或许还能预言对抗性示例的迁移性（参见例如）。若叠加中的特征排列深受哪些特征相互关联或反相关的影响（参见此前相关研究结果），未来研究探讨叠加作为对抗样本重要成因的假设能推进到何种程度，或许颇具意义。

In addition to observing that superposition can cause models to be vulnerable to adversarial examples, we briefly experimented with adversarial training to see if the relationship could be used in the other direction to reduce superposition. To keep training reasonably efficient, we used the analytic optimal attack against a random feature. We found that this did reduce superposition, but attacks had to be made unreasonably large (80% input L2 norm) to fully eliminate it, which didn't seem satisfying. Perhaps stronger adversarial attacks would work better. We didn't explore this further since the increased cost and complexity of adversarial training made us want to prioritize other lines of attack on superposition first.
除了观察到叠加态可能使模型易受对抗性示例影响外，我们还尝试了对抗性训练，以探究是否能反向利用这种关系来减少叠加态。为确保训练效率，我们采用了针对随机特征的分析最优攻击。结果显示，此举确实减少了叠加态，但需施加过大的攻击（输入 L2 范数的 80%）才能完全消除，这一结果并不令人满意。或许更强烈的对抗攻击会更为有效。由于对抗性训练的成本和复杂性增加，我们决定优先探索其他针对叠加态的攻击策略，因此未进一步深入此方向。

Superposition in a Privileged Basis
特权基底中的叠加态

So far, we've explored superposition in a model without a privileged basis. We can rotate the hidden activations arbitrarily and, as long as we rotate all the weights, have the exact same model behavior. That is, for any ReLU output model with weights W, we could take an arbitrary orthogonal matrix O and consider the model W' = OW. Since (OW)^T(OW) = W^TW, the result would be an identical model!
迄今为止，我们探讨了在没有特权基底的情况下叠加态的情况。我们可以任意旋转隐藏激活，只要同时旋转所有权重，就能保持模型的行为完全一致。也就是说，对于任何具有权重 W 的 ReLU 输出模型，我们可以取任意正交矩阵 O ，并考虑模型 W' = OW 。由于 (OW)^T(OW) = W^TW ，所得结果将是一个完全相同的模型！

Models without a privileged basis are elegant, and can be an interesting analogue for certain neural network representations which don't have a privileged basis – word embeddings, or the transformer residual stream. But we'd also (and perhaps primarily) like to understand neural network representations where there are neurons which do impose a privileged basis, such as transformer MLP layers or conv net neurons.
无特权基底模型优雅且可作为某些神经网络表示的有趣类比，这些表示并无特权基底，如词嵌入或变换器残差流。然而，我们更希望（或许主要）理解那些存在施加特权基底的神经元网络表示，例如变换器 MLP 层或卷积网络神经元。

Our goal in this section is to explore the simplest toy model which gives us a privileged basis. There are at least two ways we could do this: we could add an activation function or apply L1 regularization to the hidden layer. We'll focus on adding an activation function, since the representation we are most interested in understanding is hidden layers with neurons, such as the transformer MLP layer.
本节目标在于探究最简单的玩具模型，该模型为我们提供了一个特权基。至少有两种方法可以实现这一目标：一是在隐藏层中加入激活函数，二是应用 L1 正则化。我们将聚焦于添加激活函数，因为最感兴趣的是理解包含神经元的隐藏层表示，例如 Transformer 中的多层感知机层。

This gives us the following "ReLU hidden layer" model:
这为我们提供了如下“ReLU 隐藏层”模型：

h~=~\text{ReLU}(Wx)

x'~=~\text{ReLU}(W^Th+b)

We'll train this model on the same data as before.
我们将使用与之前相同的数据来训练这个模型。

Adding a ReLU to the hidden layer radically changes the model from an interpretability perspective. The key thing is that while W in our previous model was challenging to interpret (recall that we visualized W^TW rather than W), W in the ReLU hidden layer model can be directly interpreted, since it connects features to basis-aligned neurons.
在隐藏层中加入 ReLU 从可解释性的角度彻底改变了模型。关键在于，尽管我们之前的模型中 W 难以解释（回想一下，我们可视化了 W^TW 而非 W ），但在 ReLU 隐藏层模型中， W 可以直接解释，因为它将特征与基向量对齐的神经元直接相连。

We'll discuss this in much more detail shortly, but here's a comparison of weights resulting from a linear hidden layer model and a ReLU hidden layer model:
我们很快会对此进行更详细的讨论，但这里先对比一下线性隐藏层模型与 ReLU 隐藏层模型产生的权重差异：

Recall that we think of basis elements in the input as "features," and basis elements in the middle layer as "neurons". Thus W is a map from features to neurons.
回想一下，我们将输入中的基元素视为“特征”，将中间层的基元素视为“神经元”。因此， W 是从特征到神经元的映射。

What we see in the above plot is that the features are aligning with neurons in a structured way! Many of the neurons are simply dedicated to representing a feature! (This is the critical property that justifies why neuron-focused interpretability approaches – such as much of the work in the original Circuits thread – can be effective in some circumstances.)
在上图中我们看到，特征正以一种结构化的方式与神经元对齐！许多神经元仅仅专注于表示一个特征！（这一关键特性证明了为何以神经元为中心的解释方法——如原电路主题中的许多工作——在某些情况下能有效。）

Let's explore this in more detail.
让我们更详细地探讨这个问题。

Visualizing Superposition in Terms of Neurons
用神经元概念可视化叠加态

Having a privileged basis opens up new possibilities for visualizing our models. As we saw above, we can simply inspect W. We can also make a per-neuron stacked bar plot where, for every neuron, we visualize its weights as a stack of rectangles on top of each other:
拥有特权基础为可视化我们的模型开辟了新可能。如上所述，我们可以直接检查 W 。此外，我们还能为每个神经元制作堆叠条形图，其中每个神经元的权重以一系列矩形堆叠的形式展现：

Each column in the stack plot visualizes one column of W.
堆叠图中的每一列可视化了 W 中的一列。
Each rectangle represents one weight entry, with height corresponding to the absolute value.
每个矩形代表一个权重项，其高度对应于绝对值。
The color of each rectangle corresponds to the feature it acts on (i.e. which row of W it's in).
每个矩形的颜色对应其所作用的特征（即它位于 W 的哪一行）。
Negative values go below the x-axis.
负值位于 x 轴下方。
The order of the rectangles is not significant.
矩形的顺序并不重要。

This stack plot visualization can be nice as models get bigger. It also makes polysemantic neurons obvious: they simply correspond to having more than one weight.
随着模型规模的扩大，这种堆叠图可视化效果会变得很出色。它还能明显展示出多义性神经元：它们仅仅对应于拥有多个权重。

We'll now visualize a ReLU hidden layer toy model with n=10;~ m=5; I^i = 0.75^i and varying feature sparsity levels. We chose a very small model (only 5 neurons) both for ease of visualization, and to circumvent some issues with this toy model we'll discuss below.
现在我们将展示一个带有 n=10;~ m=5; I^i = 0.75^i 和不同特征稀疏度级别的 ReLU 隐藏层玩具模型。我们选择了一个非常小的模型（仅含 5 个神经元），这样既便于可视化，又能规避下文将讨论的该玩具模型的一些问题。

However, we found that these small models were harder to optimize. For each model shown, we trained 1000 models and visualized the one with the lowest loss. Although the typical solutions are often similar to the minimal loss solutions shown, selecting the minimal loss solutions reveals even more structure in how features align with neurons. It also reveals that there are ranges of sparsity values where the optimal solution for all models trained on data with that sparsity have the same weight configurations.
然而，我们发现这些小型模型更难以优化。对于所展示的每个模型，我们训练了 1000 个模型，并可视化了损失最低的那一个。尽管典型解决方案通常与显示的最低损失解决方案相似，但选择最低损失解决方案揭示了特征与神经元对齐方式中更多的结构。它还揭示了在特定稀疏度值范围内，针对该稀疏度数据训练的所有模型，其最优解具有相同的权重配置。

The solutions are visualized below, both visualizing the raw W and a neuron stacked bar plot. We color features in the stacked bar plot based on whether they're in superposition, and color neurons as being monosemantic or polysemantic depending on whether they store more than one feature. Neuron order was chosen by hand (since it's arbitrary).
解决方案如下所示，既展示了原始 W ，也展示了神经元堆叠条形图。我们根据特征是否处于叠加状态为堆叠条形图中的特征着色，并根据神经元存储的特征是否超过一个，将其着色为单义或多义。神经元顺序是手动选择的（因为它是任意的）。

The most important thing to pay attention to is how there's a shift from monosemantic to polysemantic neurons as sparsity increases. Monosemantic neurons do exist in some regimes! Polysemantic neurons exist in others. And they can both exist in the same model! Moreover, while it's not quite clear how to formalize this, it looks a great deal like there's a neuron-level phase change, mirroring the feature phase changes we saw earlier.
最需关注的是，随着稀疏度增加，神经元从单义性向多义性转变的现象。在某些状态下确实存在单义性神经元！而在其他状态下则存在多义性神经元。并且，这两种神经元可以在同一模型中共存！此外，尽管目前尚不清楚如何对此进行形式化，但这看起来非常像神经元层面的相变，映射了我们之前观察到的特征相变。

It's also interesting to examine the structure of the polysemantic solutions, which turn out to be surprisingly structured and neuron-aligned. Features typically correspond to sets of neurons (monosemantic neurons might be seen as the special case where features only correspond to singleton sets). There's also structure in how polysemantic neurons are. They transition from monosemantic, to only representing a few features, to gradually representing more. However, it's unclear how much of this is generalizable to real models.
探究多义性解决方案的结构同样颇具趣味，这些结构出乎意料地呈现出有序且与神经元对齐的特征。通常，特征对应于一组神经元（单义神经元可视作特例，其特征仅对应于单元素集合）。多义神经元内部亦存在结构：它们从单义性过渡到仅表征少数特征，再逐渐扩展至表征更多特征。然而，这些规律在实际模型中的普适性尚不明确。

Limitations of The ReLU Hidden Layer Toy Model Simulating Identity
RELU 隐藏层玩具模型模拟身份的局限性

Unfortunately, the toy model described in this section has a significant weakness, which limits the regimes in which it shows interesting results. The issue is that the model doesn't benefit from the ReLU hidden layer – it has no role except limiting how the model can encode information. If given any chance, the model will circumvent it. For example, given a hidden layer bias, the model will set all the biases to be positive, shifting the neurons into a positive regime where they behave linearly. If one removes the bias, but gives the model enough features, it will simulate a bias by averaging over many features. The model will only use the ReLU activation function if absolutely forced, which is a significant mark against studying this toy model.
遗憾的是，本节所述的玩具模型存在一个显著弱点，限制了其在何种情况下能展现有趣成果。问题在于，该模型并未从 ReLU 隐藏层中获益——除了限制模型编码信息的方式外，它并无其他作用。一旦有机会，模型便会绕过它。例如，若赋予隐藏层偏置，模型会将所有偏置设为正值，使神经元进入线性行为的正向区域。若移除偏置但提供足够特征，模型会通过平均众多特征来模拟偏置。唯有在绝对强制下，模型才会采用 ReLU 激活函数，这一事实严重削弱了研究此玩具模型的意义。

We'll introduce a model without this issue in the next section, but wanted to study this model as a simpler case study.
我们将在下一节介绍一个没有这个问题的模型，但想先研究这个模型作为更简单的案例研究。

Computation in Superposition
叠加态计算

So far, we've shown that neural networks can store sparse features in superposition and then recover them. But we actually believe superposition is more powerful than this – we think that neural networks can perform computation entirely in superposition rather than just using it as storage. This model will also give us a more principled way to study a privileged basis where features align with basis dimensions.
迄今为止，我们已经证明了神经网络能够以叠加态存储稀疏特征并随后恢复它们。但我们实际上认为叠加态的力量远不止于此——我们相信神经网络能够完全在叠加态中执行计算，而不仅仅是将其作为存储手段。这一模型还将为我们提供一种更为原则性的方法，来研究特征与基底维度对齐的特权基。

To explore this, we consider a new setup where we imagine our input and output layer to be the layers of our hypothetical disentangled model, but have our hidden layer be a smaller layer we're imagining to be the observed model which might use superposition. We'll then try to compute a simple non-linear function and explore whether it can use superposition to do this. Since the model will have (and need to use) the hidden layer non-linearity, we'll also see features align with a privileged basis.
为探究此问题，我们设想一种新架构：将输入与输出层视为假设性解耦模型的层级，而隐含层则设想为一个较小的、可能运用叠加原理的观测模型。接下来，我们将尝试计算一个简单的非线性函数，并探讨其是否能借助叠加原理实现。由于模型将具备（且需运用）隐含层的非线性特性，我们还将观察到特征与特权基底的对齐情况。

Specifically, we'll have the model compute y=\text{abs}(x). Absolute value is an appealing function to study because there's a very simple way to compute it with ReLU neurons: \text{abs}(x) = \text{ReLU}(x) + \text{ReLU}(-x). This simple structure will make it easy for us to study the geometry of how the hidden layer is leveraged to do computation.
具体而言，我们将让模型计算 y=\text{abs}(x) 。绝对值是一个值得研究的有趣函数，因为它可以用 ReLU 神经元非常简单地计算： \text{abs}(x) = \text{ReLU}(x) + \text{ReLU}(-x) 。这种简单的结构将便于我们研究隐藏层如何被利用来进行计算的几何特性。

Since this model needs ReLU to compute absolute value, it doesn't have the issues the model in the previous section had with trying to avoid the activation function.
由于此模型需借助 ReLU 计算绝对值，因此它不会遇到上一节模型试图避免激活函数时所面临的问题。

Experiment Setup 实验设置

The input feature vector, x, is still sparse, with each feature x_i having probability S_i of being 0. However, since we want to have the model compute absolute value, we need to allow it to take on non-positive values for this to be a non-trivial task. As a result, if it is non-zero, its value is now sampled uniformly from [-1,1]. The target output y is y=\text{abs}(x).
输入特征向量 x 仍保持稀疏，每个特征 x_i 以 S_i 的概率呈现 0 状态。然而，为使模型计算绝对值，我们需允许其取非正数值，以确保任务具有挑战性。因此，若该值非零，其值现从 [-1,1] 中均匀抽样。目标输出 y 为 y=\text{abs}(x) 。

Following the previous section, we'll consider the "ReLU hidden layer" toy model variant, but no longer tie the two weights to be identical:
继上一节之后，我们将考虑“ReLU 隐藏层”玩具模型变体，但不再将两个权重绑定为相同值

h = \text{ReLU}(W_1x) y' = \text{ReLU}(W_2h+b)

The loss is still the mean squared error weighted by feature importances I_i as before.
损失依旧是之前提到的，根据特征重要性 I_i 加权的均方误差。

Basic Results 基本结果

With this model, it's a bit less straightforward to study how individual features get embedded; because of the ReLU on the hidden layer, we can't just study W_2^TW_1. And because W_2 and W_1 are now learned independently, we can't just study columns of W_1. We believe that with some manipulation we could recover much of the simplicity of the earlier model by considering "positive features" and "negative features" independently, but we're going to focus on another perspective instead.
在此模型中，研究单个特征如何嵌入变得不那么直接；由于隐藏层上的 ReLU 激活，我们无法仅研究 W_2^TW_1 。同时，由于 W_2 和 W_1 现在独立学习，我们也不能仅通过研究 W_1 的列来分析。我们相信，通过一些操作，我们可以通过独立考虑“正特征”和“负特征”来恢复早期模型的许多简单性，但我们选择关注另一个视角。

As we saw in the previous section, having a hidden layer activation function means that it makes sense to visualize the weights in terms of neurons. We can visualize W directly or as a neuron stack plot as we did before. We can also visualize it as a graph, which can sometimes be helpful for understanding computation.
如前一节所述，拥有隐藏层激活函数意味着从神经元角度可视化权重是有意义的。我们可以直接可视化 W ，或者像之前那样作为神经元堆叠图。我们还可以将其可视化为图形，这有时有助于理解计算过程。

Let's look at what happens when we train a model with n=3 features to perform absolute value on m=6 hidden layer neurons. Without superposition, the model needs two hidden layer neurons to implement absolute value on one feature.
让我们看看当我们用 n=3 个特征训练一个模型，对 m=6 个隐藏层神经元执行绝对值操作时会发生什么。在没有叠加的情况下，模型需要两个隐藏层神经元来对一个特征实现绝对值。

The resulting model – modulo a subtle issue about rescaling input and output weightsNote that there's a degree of freedom for the model in learning W_1: We can rescale any hidden unit by scaling its row of W_1 by \alpha, and its column of W_2 by \alpha^{-1}, and arrive at the same model. For consistency in the visualization, we rescale each hidden unit before visualizing so that the largest-magnitude weight to that neuron from W_1 has magnitude 1. – performs absolute value exactly as one might expect. For each input feature x_i, it constructs a "positive side" neuron \text{ReLU}(x_i) and a "negative side" neuron \text{ReLU}(-x_i). It then adds these together to compute absolute value:
所得模型——在调整输入输出权重时需注意一个细微问题 ¹⁵ ，模型在学习过程中具有一定自由度 W_1 ：我们可以通过将 W_1 的某一行乘以 \alpha ，以及将 W_2 的对应列乘以 \alpha^{-1} 来缩放任意隐藏单元，从而得到相同模型。为确保可视化一致性，我们在可视化前对每个隐藏单元进行重新缩放，使得从 W_1 到该神经元的最大权重幅值为 1 。——其绝对值计算表现正如预期。对于每个输入特征 x_i ，模型构建一个“正侧”神经元 \text{ReLU}(x_i) 和一个“负侧”神经元 \text{ReLU}(-x_i) ，然后将两者相加以计算绝对值。

Superposition vs Sparsity
叠加与稀疏性

We've seen that – as expected – our toy model can learn to implement absolute value. But can it use superposition to compute absolute value for more features? To test this, we train models with n=100 features and m=40 neurons and a feature importance curve I_i = 0.8^i, varying feature sparsity.These specific values were chosen to illustrate the phenomenon we're interested in: the absolute value model learns more easily when there are more neurons, but we wanted to keep the numbers small enough that it could be easily visualized.
我们已经看到，正如预期那样，我们的玩具模型能够学会实现绝对值功能。但它能否利用叠加原理来计算更多特征的绝对值呢？为了验证这一点，我们训练了具有 n=100 个特征和 m=40 个神经元的模型，并绘制了 I_i = 0.8^i 特征重要性曲线，同时改变特征的稀疏度。 ¹⁶ 这些特定数值的选择旨在阐明我们关注的现象：当神经元数量增多时，绝对值模型学习起来更为容易，但我们希望保持数字足够小，以便于直观展示。

A couple of notes on visualization: Since we're primarily interested in understanding superposition and polysemantic neurons, we'll show a stacked weight plot of the absolute values of weights. The features are colored by superposition. To make the diagrams easier to read, neurons are faintly colored based on how polysemantic they are (as judged by eye based on the plots). Neuron order is sorted by the importance of the largest feature.
关于可视化的一些说明：由于我们主要关注理解叠加和多义性神经元，我们将展示一个堆叠权重图，其中权重取绝对值。特征根据叠加程度着色。为使图表更易读，神经元根据其多义性（通过目测图表判断）淡色显示。神经元顺序按最大特征的重要性排序。

Much like we saw in the ReLU hidden layer models, these results demonstrate that activation functions, under the right circumstances, create a privileged basis and cause features to align with basis dimensions. In the dense regime, we end up with each neuron representing a single feature, and we can read feature values directly off of neuron activations.
正如我们在 ReLU 隐藏层模型中所见，这些结果表明，在适当条件下，激活函数能创造一种特权基底，促使特征与基底维度对齐。在密集状态下，每个神经元代表一个单一特征，我们可以直接从神经元激活中读取特征值。

However, once the features become sufficiently sparse, this model, too, uses superposition to represent more features than it has neurons. This result is notable because it demonstrates the ability of neural networks to perform computation even on data that is represented in superposition.One question you might ask is whether we can quantify the ability of superposition to enable extra computation by examining the loss. Unfortunately, we can't easily do this. Superposition occurs when we change the task, making it sparser. As a result, the losses of models with different amounts of superposition are not comparable – they're measuring the loss on different tasks! Remember that the model is required to use the hidden layer ReLU in order to compute an absolute value; gradient descent manages to find solutions that usefully approximate the computation even when each neuron encodes a mix of multiple features.
然而，一旦特征变得足够稀疏，该模型同样利用叠加原理来表示比其拥有的神经元更多的特征。这一结果值得关注，因为它展示了神经网络即使在数据以叠加形式表示时也能执行计算的能力。 ¹⁷ 你可能会问，我们能否通过检查损失来量化叠加原理额外计算能力的程度。遗憾的是，这并不容易。叠加现象出现在任务变得更加稀疏时。因此，具有不同叠加程度的模型的损失不可直接比较——它们衡量的是不同任务的损失！请记住，模型需要使用隐藏层的 ReLU 来计算绝对值；梯度下降能够找到即使在每个神经元编码了多个特征混合的情况下，也能有效近似计算的解决方案。

Focusing on the intermediate sparsity regimes, we find several additional qualitative behaviors that we find fascinatingly reminiscent of behavior that has been observed in real, full-scale neural networks:
聚焦于中等稀疏度范围，我们发现了若干额外的定性行为，这些行为令人着迷地类似于在真实的全尺度神经网络中观察到的现象：

To begin, we find that in some regimes, many of the model's neurons will encode pure features, but a subset of them will be highly polysemantic. This is similar to the phase change we saw earlier in the ReLU output model. However, in that case, the phase change was with respect to features, with more important features not being put in superposition. In this experiment, the neurons don't have any intrinsic importance, but we see that the neurons representing the most important features (on the left) tend to be monosemantic.
首先，我们发现某些模式下，模型中的许多神经元会编码纯粹的特征，但其中一部分则高度多义。这类似于之前在 ReLU 输出模型中观察到的相变现象。然而，在那种情况下，相变是相对于特征而言，更重要的特征并未处于叠加状态。在本实验中，神经元并无内在重要性，但我们注意到，代表最重要特征的神经元（位于左侧）往往呈现单一语义。

We find this to bear a suggestive resemblance to some previous work in vision models, which found some layers that contained "mostly pure" feature neurons, but with some neurons representing additional features on a different scale.
我们发现这与视觉模型中的一些先前工作有着启发性的相似之处，那些工作发现某些层包含“基本纯净”的特征神经元，但其中一些神经元在不同尺度上还代表了额外的特征。

We also note that many neurons appear to be associated with a single "primary" feature – encoded by a relatively large weight – coupled with one or more "secondary" features encoded with smaller-magnitude weights to that neuron. If we were to observe the activations of such a neuron over a range of input examples, we would find that the largest activations of that neuron were all or nearly-all associated with the presence of the "primary" feature, but that the lower-magnitude activations were much more polysemantic.
我们还注意到，许多神经元似乎与一个主要的“核心”特征相关联——这一特征由相对较大的权重编码——同时伴随一个或多个由较小幅度权重编码的“次要”特征。若我们观察这类神经元在多个输入示例上的激活情况，会发现该神经元的最大激活几乎全部或主要与“核心”特征的存在相关，而那些较小幅度激活则呈现出更为多义的特性。

Intriguingly, that description closely matches what researchers have found in previous work on language models – many neurons appear interpretable when we examine their strongest activations over a dataset, but can be shown on further investigation to activate for other meanings or patterns, often at a lower magnitude. While only suggestive, the ability of our toy model to reproduce these qualitative features of larger neural networks offers an exciting hint that these models are illuminating general phenomena.
有趣的是，这一描述与研究人员在先前关于语言模型的研究中所发现的成果惊人地吻合——当我们审视数据集上最强激活的神经元时，许多神经元似乎具有可解释性，但进一步探究则显示它们也会为其他意义或模式激活，通常强度较低。尽管仅具提示性，我们的玩具模型能够重现大型神经网络的这些定性特征，这一事实令人兴奋地暗示，这些模型正揭示着普遍现象。

The Asymmetric Superposition Motif
非对称叠加主题

If neural networks can perform computation in superposition, a natural question is to ask how exactly they're doing so. What does that look like mechanically, in terms of the weights? In this subsection, we'll (mostly) work through one such model and see an interesting motif of asymmetric superposition. (We use the term "motif" in the sense of the original circuit thread, inspired by its use in systems biology .)
若神经网络能在叠加态下进行计算，一个自然的问题便是它们究竟是如何做到的。从权重角度看，这一过程在机制上呈现为何种形态？在本小节中，我们将（主要）探讨这样一个模型，并揭示其中有趣的非对称叠加模式。（我们使用“模式”一词，意指原始电路线索中的含义，灵感源自其在系统生物学中的应用。）

The model we're trying to understand is shown below on the left, visualized as a neuron weight stack plot, with features corresponding to colors. The model is only doing a limited amount of superposition, and many of the weights can be understood as simply implementing absolute value in the expected way.
我们试图理解的模式如图左侧所示，以神经元权重堆栈图形式呈现，特征对应于颜色。该模型仅进行有限的叠加，许多权重可视为以预期方式简单实现绝对值。

However, there are a few neurons doing something else…
然而，有少数神经元在做着别的事情……

These other neurons implement two instances of asymmetric superposition and inhibition. Each instance consists of two neurons:
这些其他神经元实现了两个非对称叠加与抑制的实例。每个实例包含两个神经元：

One neuron does asymmetric superposition. In normal superposition, one might store features with equal weights (eg. W=[1,-1]) and then have equal output weights (W=[1,1]). In asymmetric superposition, one stores the features with different magnitudes (eg. W=[2,-\frac{1}{2}]) and then has reciprocal output weights (eg. W=[\frac{1}{2}, 2]). This causes one feature to heavily interfere with the other, but avoid the other interfering with the first!
单个神经元进行非对称叠加。在常规叠加中，可能以相等权重存储特征（例如 W=[1,-1] ），然后具有相等的输出权重（ W=[1,1] ）。而在非对称叠加中，特征以不同幅度存储（例如 W=[2,-\frac{1}{2}] ），随后输出权重互为倒数（例如 W=[\frac{1}{2}, 2] ）。这导致一个特征严重干扰另一个，却避免了后者对前者的干扰！

To avoid the consequences of that interference, the model has another neuron heavily inhibit the feature in the case where there would have been positive interference. This essentially converts positive interference (which could greatly increase the loss) into negative interference (which has limited consequences due to the output ReLU).
为规避此种干扰的后果，模型中另有一个神经元在可能出现正向干扰的情况下，会强烈抑制该特征。这实质上将可能大幅增加损失的正向干扰转化为因输出 ReLU 而影响有限的负向干扰。

There are a few other weights this doesn't explain. (We believe they're effectively small conditional biases.) But this asymmetric superposition and inhibition pattern appears to be the primary story.
还有一些其他权重未被解释（我们认为它们实际上是微小的条件偏置）。但这种不对称的叠加与抑制模式似乎是主要机制。

The Strategic Picture of Superposition
叠加战略图景

Although superposition is scientifically interesting, much of our interest comes from a pragmatic motivation: we believe that superposition is deeply connected to the challenge of using interpretability to make claims about the safety of AI systems. In particular, it is a clear challenge to the most promising path we see to be able to say that neural networks won't perform certain harmful behaviors or to catch "unknown unknowns" safety problems. This is because superposition is deeply linked to the ability to identify and enumerate over all features in a model, and the ability to enumerate over all features would be a powerful primitive for making claims about model behavior.
尽管叠加态在科学上颇具趣味，我们对它的浓厚兴趣很大程度上源自实用动机：我们坚信叠加态与运用可解释性来评估人工智能系统安全性的挑战紧密相关。特别是，它对我们认为最有希望的路径构成了明确挑战，即如何断言神经网络不会执行某些有害行为，或捕捉到“未知之未知”的安全问题。这是因为叠加态与识别并枚举模型中所有特征的能力紧密相连，而能够枚举所有特征的能力将成为评估模型行为强有力的基础。

We begin this section by describing how "solving superposition" in a certain sense is equivalent to many strong interpretability properties which might be useful for safety. Next, we'll describe three high level strategies one might take to "solving superposition." Finally, we'll describe a few other additional strategic considerations.
本节首先阐述在某种意义上，“解决叠加问题”等同于多种强解释性属性，这些属性对安全性可能大有裨益。接着，我们将概述三种可能采取的高层次策略来“解决叠加问题”。最后，我们将探讨一些其他额外的战略考量。

Safety, Interpretability, & "Solving Superposition"
安全性、可解释性与“解决叠加问题”

We'd like a way to have confidence that models will never do certain behaviors such as "deliberately deceive" or "manipulate." Today, it's unclear how one might show this, but we believe a promising tool would be the ability to identify and enumerate over all features. The ability to have a universal quantifier over the fundamental units of neural network computation is a significant step towards saying that certain types of circuits don't exist.Ultimately we want to say that a model doesn't implement some class of behaviors. Enumerating over all features makes it easy to say a feature doesn't exist (e.g. "there is no 'deceptive behavior' feature") but that isn't quite what we want. We expect models that need to represent the world to represent unsavory behaviors. But it may be possible to build more subtle claims such as "all 'deceptive behavior' features do not participate in circuits X, Y and Z." It also seems like a powerful tool for addressing "unknown unknowns", since it's a way that one can fully cover network behavior, in a sense.
我们希望有一种方法能确信模型永远不会表现出某些行为，如“故意欺骗”或“操纵”。目前，尚不清楚如何证明这一点，但我们认为一个有前景的工具是能够识别并枚举所有特征的能力。在神经网络计算的基本单元上拥有一个全称量词，是朝着证明某些类型的电路不存在迈出的重要一步。 ¹⁸ 最终，我们希望断言模型不会执行某些类别的行为。枚举所有特征使得断言某个特征不存在变得容易（例如，“不存在‘欺骗行为’特征”），但这并非我们真正想要的。我们预期需要代表现实世界的模型也会代表不良行为。然而，或许可以构建更微妙的论断，比如“所有‘欺骗行为’特征都不参与电路 X、Y 和 Z”。它似乎也是应对“未知之未知”的强大工具，因为它在某种意义上能全面覆盖网络行为。

How does this relate to superposition? It turns out that the ability to enumerate over features is deeply intertwined with superposition. One way to see this is to imagine a neural network with a privileged basis and without superposition (like the monosemantic neurons found in early InceptionV1, e.g. ): features would simply correspond to neurons, and you could enumerate over features by enumerating over neurons. Superposition also makes it harder to find interpretable directions in a model without a privileged basis. Without superposition, one could try to do something like the Gram–Schmidt process, progressively identifying interpretable directions and then removing them to make future features easier to identify. But with superposition, one can't simply remove a direction even if one knows that it is a feature direction. The connection also goes the other way: if one has the ability to enumerate over features, one can perform compressed sensing using the feature directions to (with high probability) "unfold" a superposition models activations into those of a larger, non-superposition model.
这与叠加态有何关联？事实证明，枚举特征的能力与叠加态紧密交织。一种理解方式是设想一个具有特权基底而无叠加态的神经网络（如早期 InceptionV1 中发现的单义神经元，例如）：特征直接对应于神经元，通过枚举神经元即可枚举特征。 ¹⁹ 叠加态也使得在没有特权基底的情况下，寻找模型中的可解释方向变得更加困难。若无叠加态，可以尝试类似格拉姆-施密特过程的方法，逐步识别可解释方向并移除它们，以便更容易地识别后续特征。然而，在叠加态存在的情况下，即便已知某方向为特征方向，也无法简单地将其移除。这种联系也存在反向路径：若能遍历特征，便可利用特征方向进行压缩感知，从而以高概率将叠加模型的激活状态“展开”为更大非叠加模型的激活状态。

For this reason, we'll call any method that gives us the ability to enumerate over features – and equivalently, unfold activations – a "solution to superposition". Any solution is on the table, from creating models that just don't have superposition, to identifying what directions correspond to features after the fact. We'll discuss the space of possibilities shortly.
因此，我们将任何能够枚举特征——同样地，展开激活——的方法称为“叠加态解决方案”。任何方案都在考虑之列，从构建不具有叠加态的模型，到事后识别与特征对应的方向。我们很快将探讨这些可能性。

We've motivated "solving superposition" in terms of feature enumeration, but it's worth noting that it's equivalent to (or necessary for) many other interpretability properties one might care about:
我们已从特征列举的角度阐述了“解决叠加问题”的意义，但值得注意的是，它等同于（或对于）许多其他可能关注的可解释性属性而言是必要的：

Decomposing Activation Space. The most fundamental challenge of any interpretability agenda is to defeat the curse of dimensionality. For mechanistic interpretability, this ultimately reduces to whether we can decompose activation space into independently understandable components, analogous to how computer program memory can be decomposed into variables. Identifying features is what allows us to decompose the model in terms of them.
分解激活空间。任何解释性议程最根本的挑战是克服维度诅咒。对于机制解释性而言，这最终归结为我们是否能将激活空间分解为独立可理解的部分，类似于计算机程序内存可以分解为变量的方式。识别特征使我们能够根据这些特征来分解模型。
Describing Activations in Terms of Pure Features. One of the most obvious casualties of superposition is that we can't describe activations in terms of pure features. When features are relatively basis aligned, we can take an activation – say the activations for a dog head in a vision model – and decompose them into individual underlying features, like a floppy ear, short golden fur, and a snout. (See the "semantic dictionary" interface in Building Blocks .) Solving superposition would allow us to do this for every model.
在纯特征层面描述激活状态。叠加效应最直接的后果之一是我们无法用纯特征来描述激活状态。当特征相对基础对齐时，我们可以取一个激活示例——比如视觉模型中狗头的激活——并将其分解为单独的底层特征，如松软的耳朵、短金毛和鼻口。（参见《构建模块》中的“语义词典”界面。）解决叠加问题将使我们能够对每个模型进行此类分解。
Understanding Weights (ie. Circuit Analysis). Neural network weights can typically only be understood when they're connecting together understandable features. All the circuit analysis seen in the original circuit thread (see especially ), was fundamentally only possible because the weights connected non-polysemantic neurons. We need to solve superposition for this to work in general.
理解权重（即电路分析）。神经网络的权重通常只有在连接可理解的特征时才能被理解。原电路线程中看到的所有电路分析（特别是参见），从根本上说，之所以可能，是因为权重连接了非多义性神经元。为了使这一机制普遍适用，我们需要解决叠加问题。
Even very basic approaches become perilous with superposition. It isn't just sophisticated approaches to interpretability which are harmed by superposition. Even very basic methods one might consider become unreliable. For example, if one is concerned about language models exhibiting manipulative behavior, one might ask if an input has a significant cosine similarity to the representations of other examples of deceptive behavior. Unfortunately, superposition means that cosine similarity has the potential to be misleading, since unrelated features start to be embedded with positive dot products to each other. However, if we solve superposition, this won't be an issue – either we'll have a model where features align with neurons, or a way to use compressed sensing to lift features to a space where they no longer have positive dot products.
即便是非常基础的方法，在叠加态面前也变得岌岌可危。不仅仅是复杂的可解释性方法会受到叠加态的损害，就连那些看似简单的手段也变得不可靠。例如，若担心语言模型表现出操纵性行为，我们可能会检查输入与欺骗行为示例的表示之间是否存在显著的余弦相似度。然而，叠加态的存在使得余弦相似度可能误导我们，因为原本无关的特征开始以正点积相互嵌入。不过，一旦解决了叠加态问题，这一困扰将不复存在——我们将拥有一个特征与神经元对齐的模型，或者一种利用压缩感知将特征提升至不再具有正点积的空间的方法。

Three Ways Out 三种出路

At a very high level, there seem to be three potential approaches to resolving superposition:
从宏观角度看，解决叠加态似乎存在三种潜在方法：

Create models without superposition.
创建无叠加模型。
Find an overcomplete basis that describes how features are represented in models with superposition.
寻找一个超完备基，以描述特征在叠加模型中的表示方式。
Hybrid approaches in which one changes models, not resolving superposition, but making it easier for a second stage of analysis to find an overcomplete basis that describes it.
采用混合方法，不解决叠加问题，而是使第二阶段的分析更容易找到描述它的超完备基，从而改变模型。

Our sense is that all of these approaches are possible if one doesn't care about having a competitive model. For example, we believe it's possible to accomplish any of these for the toy models described in this paper. However, as one starts to consider serious neural networks, let alone modern large language models, all of these approaches begin to look very difficult. We'll outline the challenges we see for each approach in the following sections.
我们的观点是，如果不介意拥有一个竞争力不足的模型，这些方法都是可行的。例如，我们相信可以为本文所述的玩具模型实现其中任何一种。然而，一旦开始考虑严肃的神经网络，更不用说现代大型语言模型，所有这些方法都显得异常困难。我们将在接下来的章节中概述每种方法面临的挑战。

With that said, it's worth highlighting one bright spot before we focus on the challenges. You might have believed that superposition was something you could never fully get rid of, but that doesn't seem to be the case. All our results seem to suggest that superposition and polysemanticity are phases with sharp transitions. That is, there may exist a regime for every model where it has no superposition or polysemanticity. The question is largely whether the cost of getting rid of or otherwise resolving superposition is too high.
话虽如此，在聚焦挑战之前，值得强调一个亮点：你可能曾以为叠加态是无法彻底消除的，但事实似乎并非如此。我们的所有结果似乎都表明，叠加态和多义性是具有明显过渡阶段的相态。也就是说，对于每个模型，可能都存在一个无叠加态或多义性的状态。关键问题在于，消除或解决叠加态的成本是否过高。

Approach 1: Creating Models Without Superposition
方法一：创建无叠加模型

It's actually quite easy to get rid of superposition in the toy models described in this paper, albeit at the cost of a higher loss. Simply apply at L1 regularization term to the hidden layer activations (i.e. add \lambda ||h||_1 to the loss). This actually has a nice interpretation in terms of killing features below a certain importance threshold, especially if they're not basis aligned. Generalizing this to real neural networks isn't trivial, but we expect it can be done. (This approach would be similar to work attempting to use sparsity to encourage basis-aligned word embeddings .)
在本论文所述的玩具模型中，消除叠加态其实相当简单，尽管会以更高的损失为代价。只需对隐藏层激活应用 L1 正则化项（即在损失中加入 \lambda ||h||_1 ）。这在消除低于特定重要性阈值的特征方面具有良好解释，尤其是当它们并非基向量对齐时。将此方法推广至真实神经网络并非易事，但我们预期这是可行的。（此策略类似于尝试利用稀疏性来促进基向量对齐的词嵌入研究。）

However, it seems likely that models are significantly benefitting from superposition. Roughly, the sparser features are, the more features can be squeezed in per neuron. And many features in language models seem very sparse! For example, language models know about individuals with only modest public presences, such as several of the authors of this paper. Presumably we only occur with frequency significantly less than one in a million tokens. As a result, it may be the case that superposition effectively makes models much bigger.
然而，模型似乎极大地受益于叠加态。简而言之，特征越稀疏，每个神经元能压缩的特征就越多。语言模型中的许多特征似乎非常稀疏！例如，语言模型能识别那些公众影响力有限的个人，比如本文的几位作者。我们出现的频率很可能远低于百万分之一。因此，叠加态可能实际上使得模型规模大幅扩展。

All of this paints a picture where getting rid of superposition may be fairly achievable, but doing so will have a large performance cost. For a model with a fixed number of neurons, superposition helps – potentially a lot.
这一切描绘出一幅图景：摆脱叠加态或许相当可行，但此举将带来巨大的性能代价。对于固定数量的神经元模型而言，叠加态有助于提升性能——潜在地，提升幅度可能很大。

But this is only true if the constraint is thought of in terms of neurons. That is, a superposition model with n neurons likely has the same performance as a significantly larger monosemantic model with kn neurons. But neurons aren't the fundamental constraint: flops are. In the most common model architectures, flops and neurons have a strict correspondence, but this doesn't have to be the case and it's much less clear that superposition is optimal in the broader space of possibilities.
然而，这一观点仅在将约束视为神经元数量时成立。也就是说，拥有 n 个神经元的叠加模型，其性能可能与拥有 kn 个神经元的大规模单义模型相当。但神经元并非根本约束，浮点运算次数（flops）才是。在大多数常见模型架构中，浮点运算与神经元之间存在严格对应关系，但这并非必然，且在更广阔的可能性空间中，叠加模型是否最优尚不明确。

One family of models which change the flop-neuron relationship are Mixture of Experts (MoE) models (see review ). The intuition is that most neurons are for specialized circumstances and don't need to activate most of the time. For example, German-specific neurons don't need to activate on French text. Harry Potter neurons don't need to activate on scientific papers. So MoE models organize neurons into blocks or experts, which only activate a small fraction of the time. This effectively allows the model to have k times more neurons for a similar flop budget, given the constraint that only 1/k of the neurons activate in a given example and that they must activate in a block. Put another way, MoE models can recover neuron sparsity as free flops, as long as the sparsity is organized in certain ways.
混合专家（MoE）模型是一类改变神经元与计算量关系的模型（详见综述）。其核心思想是，大多数神经元专为特定情境设计，无需频繁激活。例如，针对德语文本的神经元无需在处理法语文本时激活，而哈利波特相关神经元也不必在科学论文上活跃。因此，MoE 模型将神经元组织成专家块，这些专家块仅在少数情况下被激活。这使得模型在相似的计算量预算下，能拥有 k 倍数量的神经元，前提是每个示例中仅 1/k 的神经元被激活，且它们必须以块的形式激活。换言之，只要神经元稀疏性按特定方式组织，MoE 模型就能将这种稀疏性转化为额外的计算能力。

It's unclear how far this can be pushed, especially given difficult engineering constraints. But there's an obvious lower bound, which is likely too optimistic but is interesting to think about: what if models only expended flops on neuron activations, and recovered the compute of all non-activating neurons? In this world, it seems unlikely that superposition would be optimal: you could always split a polysemantic neuron into dedicated neurons for each feature with the same cost, except for the cases where there would have been interference that hurt the model anyways. Our preliminary investigations comparing various types of superposition in terms of "loss reduction per activation frequency" seem to suggest that superposition is not optimal on these terms, although it asymptotically becomes as good as dedicated feature dimensions. Another way to think of this is that superposition exploits a gap between the sparsity of neurons and the sparsity of the underlying features; MoE eats that same gap, and so we should expect MoE models to have less superposition.
尚不清楚这一极限能推至多远，尤其在面临严峻工程约束的情况下。但存在一个明显的下限，虽可能过于乐观，却值得深思：若模型仅在神经元激活时消耗浮点运算，而非激活神经元的计算则得以恢复，将会如何？在此设想下，叠加态似乎并非最优选择：总能以相同成本将多义神经元拆分为各特征专用的神经元，除非存在干扰本就损害模型性能的情形。我们初步对比不同叠加方式的“每激活频率损失降低”研究显示，叠加态在这些条件下并非最优，尽管它渐近地与专用特征维度表现相当。另一种理解方式是，叠加利用了神经元稀疏性与底层特征稀疏性之间的差距；而专家混合模型（MoE）同样填补了这一差距，因此我们预期 MoE 模型所展现的叠加效应会较少。

To be clear, MoE models are already well studied, and we don't think this changes the capabilities case for them. (If anything, superposition offers a theory for why MoE models have not proven more effective for capabilities when the case for them seems so initially compelling!) But if one's goal is to create competitive models that don't have superposition, MoE models become interesting to think about. We don't necessarily think that they specifically are the right path forward – our goal here has been to use them as an example of why we think it remains plausible there may be ways to build competitive superposition-free models.
明确地说，门控模型已得到充分研究，我们并不认为这会改变它们的能力状况。（如果有什么不同的话，叠加态提供了一种理论，解释了为何门控模型在能力上尚未证明更加有效，尽管最初它们的前景似乎十分诱人！）但如果目标是构建不具有叠加态的竞争性模型，门控模型便值得深思。我们并非必然认为它们就是前进的正确道路——我们的目的在于以它们为例，说明为何我们认为仍有可能存在构建无叠加态竞争模型的途径。

Approach 2: Finding an Overcomplete Basis
方法二：寻找超完备基

The opposite strategy of creating a superposition-free model is to take a regular model, which has superposition, and find an overcomplete basis describing how features are embedded after the fact. This appears to be a relatively standard sparse coding (or dictionary learning) problem, where we want to take the activations of neural network layers and find out which directions correspond to features.More formally, given a matrix H \sim [d,m] ~=~[h_0, h_1, …] of hidden layer activations h \sim [m] sampled over d stimuli, if we believe there are n underlying features, we can try to find matrices A\sim [d,n] and B \sim [n,m] such that A is sparse. This approach has been explored by some prior work .
与之相对的策略是采用常规模型，该模型具有叠加效应，然后寻找一个超完备基，用以描述特征事后如何嵌入。这似乎是一个相对标准的稀疏编码（或字典学习）问题，我们的目标是获取神经网络层的激活情况，并找出哪些方向对应于特征。更正式地说，给定一个隐藏层激活矩阵，该矩阵在多种刺激下采样得到，如果我们认为存在个基本特征，我们可以尝试找到使得表达式稀疏的矩阵。这一方法已在前人的工作中有所探讨。

The advantage of this is that we don't need to worry about whether we're damaging model performance. On the other hand, many other things are harder:
其优势在于我们无需担心是否会损害模型性能。然而，许多其他方面变得更加困难：

It’s no longer easy to know how many features you have to enumerate. A monosemantic model represents a feature per neuron, but when finding an overcomplete basis there’s an additional challenge of identifying how many features to use for it.
如今，要确定需要列举多少特征已非易事。单义模型为每个神经元代表一个特征，但在寻找过完备基时，额外挑战在于识别为此需要使用多少特征。
Solutions are no longer integrated into the surface computational structure. Neural networks can be understood in terms of their surface structure – neurons, attention heads, etc – and virtual structure that implicitly emerge (e.g. virtual attention heads ). A model described by an overcomplete basis has "virtual neurons": there's a further gap between the surface and virtual structure.
解决方案不再整合于表面的计算结构中。神经网络可从其表面结构（如神经元、注意力头等）及隐式涌现的虚拟结构（例如虚拟注意力头）来理解。一个由超完备基描述的模型拥有“虚拟神经元”，这进一步拉大了表面结构与虚拟结构之间的距离。

It's a different, major engineering challenge. Seriously attempting to solve superposition by applying sparse coding to real neural nets suggests a massive sparse coding problem. For truly large language models, one would be starting with something like a millions (neurons) by billions (tokens) matrix and then trying to do an extremely overcomplete factorization, perhaps trying to factor it to be a thousand or more times larger. This is a major engineering challenge which is different from the standard distributed training challenges ML labs are set up for.
这是一个截然不同的重大工程挑战。认真尝试通过将稀疏编码应用于真实神经网络来解决叠加问题，意味着面临一个庞大的稀疏编码难题。对于真正的大型语言模型，起始点可能是一个由数百万（神经元）乘以数十亿（标记）构成的矩阵，然后尝试进行极度过完备的因式分解，可能试图将其分解为原规模的一千倍甚至更大。这是一项与机器学习实验室常规设置的分布式训练挑战迥异的重大工程难题。
Interference is no longer pushing in your favor. If you try to train models without superposition, interference between features is pushing the training process to have less superposition. If you instead try to decode superposition after the fact, whatever amount of superposition is "baked in" by the training process and you don't have part of the objective pushing in your favor.
干扰不再对你有利。若尝试在不叠加的情况下训练模型，特征间的干扰会促使训练过程减少叠加。若事后试图解码叠加，则训练过程中“固化”的叠加程度已成定局，你无法再有目标驱动力来助你一臂之力。

Approach 3: Hybrid Approaches
方法三：混合方法

In addition to approaches which address superposition purely at training time, or purely after the fact, it may be possible to take "hybrid approaches" which do a mixture. For example, even if one can't change models without superposition, it may be possible to produce models with less superposition, which are then easier to decode.In particular, it seems like we should expect to be able to reduce superposition at least a little bit with essentially no effect on performance, just by doing something like L1 regularization without any architectural changes. Note that models should have a level of superposition where the derivative of loss with respect to the amount of superposition is zero – otherwise, they'd use more or less superposition. As a result, there should be at least some margin within which we can reduce the amount of superposition without affecting model performance. Alternatively, it may be possible for architecture changes to make finding an overcomplete basis easier or more computationally tractable in large models, separately from trying to reduce superposition.
除了仅在训练时或事后处理叠加的方法外，采用“混合方法”进行混合处理或许也是可行的。例如，即便无法在不改变叠加状态的情况下调整模型，仍有可能构建出叠加程度较低的模型，从而更易于解码。 ²¹ 特别是，我们似乎有理由期待，通过实施如 L1 正则化等措施，而不进行架构调整，至少能在不影响性能的前提下略微降低叠加程度。值得注意的是，模型应存在一个叠加水平，使得损失相对于叠加量的导数为零——否则，模型会倾向于使用更多或更少的叠加。因此，至少在一定范围内，我们应能减少叠加量而不影响模型性能。或者，通过架构调整，在大模型中寻找超完备基底可能变得更加容易或计算上更易处理，这与尝试减少叠加态是分开的。

Additional Considerations
附加考量因素

Phase Changes as Cause For Hope. Is totally getting rid of superposition a realistic hope? One could easily imagine a world where it can only be asymptotically reduced, and never fully eliminated. While the results in this paper seem to suggest that superposition is hard to get rid of because it's actually very useful, the upshot of it corresponding to a phase change is that there's a regime where it totally doesn't exist. If we can find a way to push models in the non-superposition regime, it seems likely it can be totally eliminated.
相变带来希望之光：彻底摆脱叠加态是否现实？不难设想一个世界，其中叠加态仅能渐近式减少，却永无法根除。尽管本文研究结果暗示，叠加态之所以难以摒弃，实因其极具价值，但与之相关的相变现象揭示了一个不存在叠加态的领域。若能找到方法将模型推向非叠加态区域，似乎有望彻底消除叠加态。

Any superposition-free model would be a powerful tool for research. We believe that most of the research risk is in whether one can make performant superposition free models, rather than whether it's possible to make superposition free models at all. Of course, ultimately, we need to make performant models. But a non-performant superposition free model could still be a very useful research tool for studying superposition in normal models. At present, it's challenging to study superposition in models because we have no ground truth for what the features are. (This is also the reason why the toy models described in this paper can be studied – we do know what the features are!) If we had a superposition-free model, we may be able to use it as a ground truth to study superposition in regular models.
任何无叠加态的模型都将是研究的有力工具。我们认为，研究风险大多在于能否构建出性能卓越的无叠加态模型，而非无叠加态模型是否可能存在。当然，最终我们需要的是高性能模型。但即便性能不佳，无叠加态模型仍可作为研究常规模型中叠加态的宝贵研究工具。目前，研究模型中的叠加态颇具挑战，因为我们缺乏关于特征的基准真相。（这也是本文所述玩具模型得以研究的原因——我们确实了解其特征！）若拥有无叠加态模型，我们或许能将其作为基准真相，用以探究常规模型中的叠加态。

Local bases are not enough. Earlier, when we considered the geometry of non-uniform superposition, we observed that models often form local orthogonal bases, where co-occurring features are orthogonal. This suggests a strategy for locally understanding models on sufficiently narrow sub-distributions. However, if our goal is to eventually make useful statements about the safety of models, we need mechanistic accounts that hold for the full distribution (and off distribution). Local bases seem unlikely to give this to us.
本地基底不足。早先，在探讨非均匀叠加的几何结构时，我们注意到模型常构建局部正交基，其中共现特征相互正交。这提示了一种策略，即在足够狭窄的子分布上局部理解模型。然而，若我们的最终目标是就模型的安全性做出有意义的论断，则需寻求适用于全分布（及分布外）的机制性解释。本地基底似乎难以满足这一需求。

Discussion 讨论

To What Extent Does Superposition Exist in Real Models?
现实模型中叠加现象存在到何种程度？

Why are we interested in toy models? We believe they are useful proxies for studying the superposition we suspect might exist in real neural networks. But how can we know if they're actually a useful toy model? Our best validation is whether their predictions are consistent with empirical observations regarding polysemanticity. To the best of our knowledge they are. In particular:
我们为何对玩具模型感兴趣？我们认为它们是研究我们怀疑可能存在于真实神经网络中的叠加态的有用代理。但如何判断它们是否真正是有效的玩具模型呢？我们最佳的验证方式是看它们的预测是否与关于多义性的实证观察相一致。据我们所知，它们确实如此。特别是：

Polysemantic neurons exist. Polysemantic neurons form in our third model, just as they are observed in a wide range of neural networks.
多义性神经元确实存在。在我们的第三个模型中，它们如同在广泛的神经网络中所观察到的那样形成。
Neurons are sometimes "cleanly interpretable" and sometimes "polysemantic", often in the same layer. Our third model exhibits both polyemantic and non-polysemantic neurons, often at the same time. This is analogous to how real neural networks often have a mixture of polysemantic and non-polysemantic neurons in the same layer.
神经元有时“清晰可解”，有时又“多义共存”，常在同一层中并存。我们的第三模型便展现了这种同时具备多义性与非多义性神经元的特性。这与真实神经网络中同一层内常混合存在多义与非多义神经元的现象相类似。
InceptionV1 has more polysemantic neurons in later layers. Empirically, the fraction of neurons which are polysemantic in InceptionV1 increases with depth. One natural explanation is that as features become higher-level the stimuli they detect become rarer and thus sparser (for example, in vision, a high-level floppy ear feature is less common than a low-level Gabor filter's edge). A major prediction of our model is that superposition and polysemanticity increase as sparsity increases.
InceptionV1 在较深层拥有更多多义性神经元。经验表明，随着网络深度的增加，InceptionV1 中多义性神经元的比例也随之上升，这自然解释为随着特征层次的提高，其所检测的刺激变得更为罕见和稀疏（例如，在视觉领域，高级的“松软耳朵”特征比低级的 Gabor 滤波器边缘更为少见）。我们模型的一个重要预测是，随着稀疏性的增加，叠加和多义性也会相应增强。
Early Transformer MLP neurons are extremely polysemantic. Our experience is that neurons in the first MLP layer in Transformer language models are often extremely polysemantic. If the goal of the first MLP layer is to distinguish between different interpretations of the same token (eg. "die" in English vs German vs Dutch vs Afrikans), such features would be very sparse and our toy model would predict lots of polysemanticity.
早期 Transformer 中的 MLP 神经元具有极高的多义性。我们的经验表明，Transformer 语言模型中第一层 MLP 的神经元通常表现出极高的多义性。如果第一层 MLP 的目标是区分同一词元的不同解释（例如英语、德语、荷兰语和南非荷兰语中的"die"），那么此类特征将非常稀疏，而我们的简易模型则会预测出大量的多义现象。

This doesn't mean that everything about our toy model reflects real neural networks. Our intuition is that some of the phenomena we observe (superposition, monosemantic vs polysemantic neurons, perhaps the relationship to adversarial examples) are likely to generalize, while other phenomena (especially the geometry and learning dynamics results) are much more uncertain.
这并不意味着我们的玩具模型所体现的一切都能反映真实神经网络。我们的直觉是，我们观察到的一些现象（如叠加态、单义性与多义性神经元，或许还包括对抗样本的关系）很可能具有普遍性，而其他现象（尤其是几何形态和学习动力学结果）则存在较大不确定性。

Open Questions 开放性问题

This paper has shown that the superposition hypothesis is true in certain toy models. But if anything, we're left with many more questions about it than we had at the start. In this final section, we review some of the questions which strike us as most important: what do we know, and would we like for future work to clarify?
本文已表明，叠加假设在某些玩具模型中成立。然而，我们对此留下的疑问远比开始时多。在最后这一部分，我们将回顾一些我们认为最为重要的问题：我们已知什么，以及我们希望未来的工作能澄清什么？

Is there a statistical test for catching superposition?
是否存在一种统计测试来捕捉叠加态？
How can we control whether superposition and polysemanticity occur? Put another way, can we change the phase diagram such that features don't fall into the superposition regime? Pragmatically, this seems like the most important question. L1 regularization of activations, adversarial training, and changing the activation function all seem promising.
我们如何控制叠加和多义性的发生？换言之，能否调整相位图，使特征不落入叠加状态？从实用角度看，这似乎是最关键的问题。激活函数的 L1 正则化、对抗训练以及改变激活函数类型都显得颇有前景。
Are there any models of superposition which have a closed-form solution? Saxe et al. demonstrate that it's possible to create nice closed-form solutions for linear neural networks. We made some progress towards this for the n=2; m=1 ReLU output model (and Tom McGrath makes further progress in his comment), but it would be nice to solve this more generally.
是否存在具有封闭形式解的超叠加模型？Saxe 等人证明了为线性神经网络创建优美的封闭形式解是可能的。我们在 n=2; m=1 ReLU 输出模型上对此取得了一些进展（Tom McGrath 在其评论中进一步推进了这一进展），但更普遍地解决这一问题将更为理想。
How realistic are these toy models? To what extent do they capture the important properties of real models with respect to superposition? How can we tell?
这些玩具模型有多真实？它们在多大程度上捕捉到了真实模型关于叠加的重要特性？我们如何判断？
Can we estimate the feature importance curve or feature sparsity curve of real models? If one takes our toy models seriously, the most important properties for understanding the problem are the feature importance and sparsity curves. Is there a way we can estimate them for real models? (Likely, this would involve training models of varying sizes or amounts of regularization, observing the loss and neuron sparsities, and trying to infer something.)
我们能否估算真实模型的特征重要性曲线或特征稀疏性曲线？若认真对待我们的玩具模型，理解问题的关键属性即为特征重要性和稀疏性曲线。对于真实模型，我们是否有方法进行估算？（这可能涉及训练不同大小或不同正则化程度的模型，观察损失和神经元稀疏性，并尝试推断出某些规律。）
Should we expect superposition to go away if we just scale enough? What assumptions about the feature importance curve and sparsity would need to be true for that to be the case? Alternatively, should we expect superposition to remain a constant fraction of represented features, or even to increase as we scale?
我们是否应预期叠加现象会随着规模扩大而消失？要使这种情况成立，关于特征重要性曲线和稀疏性需要满足哪些假设？或者，我们是否应预期叠加现象将保持为所表示特征的恒定比例，甚至随着规模扩大而增加？
Are we measuring the maximally principled things? For example, what is the most principled definition of superposition / polysemanticity?
我们是否在衡量最具原则性的事物？例如，叠加态/多义性的最原则性定义是什么？
How important are polysemantic neurons? If X% of the model is interpretable neurons and 1-X% are polysemantic, how much should we believe we understand from understanding the x% interpretable neurons? (See also the "feature packing principle" suggested above.)
多义性神经元的重要性如何？若模型中 X%为可解释神经元，1-X%为多义性神经元，那么我们应如何评估通过理解这 X%可解释神经元所获得的认知？（参见上文提及的“特征打包原理”。）
How many features should we expect to be stored in superposition? This was briefly discussed in the previous section. It seems like results from compressed sensing should be able to give us useful upper-bounds, but it would be nice to have a clearer understanding – and perhaps tighter bounds!
我们应该期望多少特征能以叠加态存储？这在上一节中简要讨论过。压缩感知的结果似乎能为我们提供有用的上限，但若能获得更清晰的认识——乃至更严格的界限，无疑更为理想。
Does the apparent phase change we observe in features/neurons have any connection to phase changes in compressed sensing?
我们观察到的特征/神经元中明显的相位变化是否与压缩感知中的相位变化有关联？
How does superposition relate to non-robust features? An interesting paper by Gabriel Goh (archive.org backup) explores features in a linear model in terms of the principal components of the data. It focuses on a trade off between "usefulness" and "robustness" in the principal component features, but it seems like one could also relate it to the interpretability of features. How much would this perspective change if one believed the superposition hypothesis – could it be that the useful, non-robust features are an artifact of superposition?
叠加原理如何与非鲁棒性特征相关联？加布里埃尔·高（Gabriel Goh）的一篇有趣论文（archive.org 备份）从数据的主成分角度探讨了线性模型中的特征。该文聚焦于主成分特征中“有用性”与“鲁棒性”之间的权衡，但似乎也可以将其与特征的可解释性联系起来。如果接受叠加假设，这一视角将如何改变？是否可能，那些有用而非鲁棒的特征，实际上是叠加现象的产物？
To what extent can neural networks "do useful computation" on features in superposition? Is the absolute value problem representative of computation in superposition generally, or idiosyncratic? What class of computation is amenable to being performed in superposition? Does it require a sparse structure to the computation?
神经网络在多大程度上能对处于叠加态的特征进行“有用计算”？绝对值问题是否代表了叠加态计算的一般性，还是特例？哪一类计算适合在叠加态中执行？是否需要计算结构的稀疏性？
How does superposition change if features are not independent? Can superposition pack features more efficiently if they are anti-correlated?
如果特征不独立，叠加态将如何变化？若特征呈负相关，叠加态能否更高效地整合这些特征？
Can models effectively use nonlinear representations? We suspect models will tend not to use them, but further experimentation could provide good evidence. See the appendix on nonlinear compression. For example investigating the representations used by autoencoders with multi-layer encoders and decoders with really small bottlenecks on random uncorrelated data.
模型能否有效利用非线性表示？我们推测模型可能不太倾向于使用它们，但进一步的实验或许能提供有力证据。参见附录中的非线性压缩部分。例如，研究具有多层编码器和解码器、且瓶颈极小的自编码器在随机不相关数据上所采用的表示。

Related Work 相关工作

Interpretable Features 可解释特征

Our work is inspired by research exploring the features that naturally occur in neural networks. Many models form at least some interpretable features. Word embeddings have semantic directions (see ). There is evidence of interpretable neurons in RNNs (e.g. ), convolutional neural networks (see generally e.g. ; individual neuron families ), and in some limited cases, transformer language models (see detailed discussion in our previous paper). However this work has also found many "polysemantic" neurons which are not interpretable as a single concept .
我们的研究灵感源自探索神经网络中自然涌现的特征。众多模型至少形成了部分可解释的特征。词嵌入具有语义方向（参见）。有证据表明，在循环神经网络（例如）、卷积神经网络（一般性参见例如；特定神经元家族）以及某些有限情况下，如转换器语言模型中（详见我们之前的论文讨论），存在可解释的神经元。然而，此项工作也发现了许多“多义性”神经元，它们无法被解释为单一概念。

Superposition 叠加态

The earliest reference to superposition in artificial neural networks that we're aware of is Arora et al.'s work , who suggest that the word embeddings of words with multiple different word senses may be superpositions of the vectors for the distinct meanings. Arora extend this idea there to there being many sparse "atoms of discourse" in superposition, an idea which was generalized to other kinds of embeddings vectors and explored in more detail by Goh .
我们所知关于人工神经网络中叠加的最早提及，源自 Arora 等人的研究，他们提出具有多种不同词义的单词的词嵌入可能是这些独特含义向量的叠加。Arora 进一步将这一概念扩展到存在许多稀疏的“话语原子”处于叠加状态，这一思想被 Goh 推广至其他类型的嵌入向量，并进行了更深入的探讨。

In parallel with this, investigations of individual neurons in models with privileged bases were beginning to grapple with "polysemantic" neurons which respond to unrelated inputs . A natural hypothesis was that these polysemantic neurons are disambiguated by the combined activation of other neurons. This line of thinking eventually became the "superposition hypothesis" for circuits .
与此同时，在具有特权基础的模型中，对单个神经元的研究开始探讨“多义性”神经元，这些神经元对无关联的输入产生反应。一个自然的假设是，这些多义性神经元通过其他神经元的联合激活得以消歧。这一思路最终演变成了电路的“叠加假设” 。

Separate from all of this, Cheung et al. explore a slightly different idea one might describe as "model level" superposition: can neural network parameters represent multiple completely independent models? Their investigation is motivated by catastrophic forgetting, but seems quite related to the questions investigated in this paper. Model level superposition can be seen as feature level superposition for highly correlated sets of features, similar to the "almost orthogonal bases" experiment we considered above.
除此之外，Cheung 等人探讨了一个略有不同的概念，可称之为“模型级”叠加：神经网络参数能否代表多个完全独立的模型？他们的研究源于灾难性遗忘问题，但与本文探讨的问题颇为相关。模型级叠加可视作高度相关特征集的特征级叠加，类似于我们之前考虑的“近正交基”实验。

Disentanglement 解纠缠

The goal of learning disentangled representations arises from Bengio et al.'s influential position paper on representation learning : "we would like our representations to disentangle the factors of variation… to learn representations that separate the various explanatory sources." Since then, a literature has developed motivated by this goal, tending to focus on creating generative models which separate out major factors of variation in their latent spaces. This research touches on questions related to superposition, but is also quite different in a number of ways.
学习解耦表示的目标源于 Bengio 等人关于表示学习的影响性立场论文：“我们希望我们的表示能够解耦变异因素……学习分离各种解释性来源的表示。”自那时起，受此目标驱动的文献逐渐发展，倾向于构建能够在其潜在空间中分离出主要变异因素的生成模型。这项研究涉及叠加相关问题，但在许多方面也颇为不同。

Concretely, disentanglement research often explores whether one can train a VAE or GAN where basis dimensions correspond to the major features one might use to describe the problem (e.g. rotation, lighting, gender… as relevant). Early work often focused on semi-supervised approaches where the features were known in advance, but fully unsupervised approaches started to develop around 2016 .
具体而言，解耦研究常探讨能否训练一个 VAE 或 GAN，使其基础维度对应于描述问题时可能用到的主要特征（如旋转、光照、性别等）。早期研究多集中于半监督方法，其中特征预先已知，但自 2016 年起，完全无监督方法开始发展。

Put another way, the goal of disentanglement might be described as imposing a strong privileged basis on representations which are rotationally invariant by default. This helps get at ways in which the questions of polysemanticity and superposition are a bit different from disentanglement. Consider that when we deal with neurons, rather than embeddings, we have a privileged basis by default. It varies by model, but many neurons just cleanly respond to features. This means that polysemanticity arises as a kind of anomalous behavior, and superposition arises as a hypothesis for explaining it. The question then isn't how to impose a privileged basis, but how to remove superposition as a fundamental problem to accessing features.
换言之，解缠目标可描述为对默认旋转不变的表示强加一个特权基底。这有助于理解多义性和叠加问题与解缠略有不同之处。设想当我们处理神经元而非嵌入时，默认情况下就有一个特权基底。尽管模型间存在差异，但许多神经元能清晰地响应特定特征。这意味着多义性表现为一种异常行为，而叠加则成为解释这种现象的假设。因此，问题不在于如何强加特权基底，而在于如何消除叠加作为访问特征的基本障碍。

Of course, if the superposition hypothesis is true, there are still a number of connections to disentanglement. On the one hand, it seems likely superposition occurs in the latent spaces of generative models, even though that isn't an area we've investigated. If so, it may be that superposition is a major reason why disentanglement is difficult. Superposition may allow generative models to be much more effective than they would otherwise be without. Put another way, disentanglement often assumes a small number of important latent variables to explain the data. There are clearly examples of such variables, like the orientation of objects – but what if a large number of sparse, rare, individually unimportant features are collectively very important? Superposition would be the natural way for models to represent this.A more subtle issue is that GANs and VAEs often assume that their latent space is Gaussianly distributed. Sparse latent variables are very non-Gaussian, but the central limit theorem means that the superposition of many such variables will gradually look more Gaussian. So the latent spaces of some generative models may in fact force models to use superposition! On the other hand, one could imagine ideas from the disentanglement literature being useful in creating architectures that resist superposition by creating an even more strongly privileged basis.
当然，若叠加假设成立，仍存在诸多与解耦相关的联系。一方面，叠加现象很可能发生在生成模型的潜在空间中，尽管这一领域我们尚未探究。若果真如此，叠加或许是解耦困难的主要原因。叠加可能使得生成模型在没有它的情况下远比现在更为高效。换言之，解耦常假定存在少量关键潜在变量来解释数据。确实有这类变量的明显例子，如物体的朝向——但若大量稀疏、罕见、个体上不重要却整体极为关键的特征存在呢？叠加正是模型表达这一点的自然方式。 ²² 更为微妙的问题是，GANs 和 VAEs 通常假设其潜在空间呈高斯分布。稀疏潜变量具有很强的非高斯性，然而中心极限定理表明，众多此类变量的叠加将逐渐呈现出高斯分布的特征。因此，某些生成模型的潜空间实际上可能迫使模型采用叠加策略！另一方面，可以设想，解耦领域的思想在构建抵抗叠加的架构中可能发挥作用，通过创造一个更加优势化的基底来实现。

Compressed Sensing 压缩感知

The toy problems we consider are quite similar to the problems considered in the field of compressed sensing, which is also known as compressive sensing and sparse recovery. However, there are some important differences:
我们考虑的玩具问题与压缩感知领域所研究的问题颇为相似，后者亦称为压缩感知和稀疏恢复。然而，存在一些重要差异：

Compressed sensing recovers vectors by solving an optimization problem using general techniques, while our toy model must use a neural network layer. Compressed sensing algorithms are, in principle, much more powerful than our toy model.
压缩感知通过使用通用技术解决优化问题来恢复向量，而我们的玩具模型必须使用神经网络层。原则上，压缩感知算法比我们的玩具模型强大得多。
Compressed sensing works using the number of non-zero entries as the measure of sparsity, while we use the probability that each dimension is zero as the sparsity. These are not wholly unrelated: concentration of measure implies that our vectors have a bounded number of non-zero entries with high probability.
压缩感知以非零项的数量作为稀疏性的度量，而我们则采用每个维度为零的概率来衡量稀疏性。这两者并非完全无关：测度集中意味着我们的向量以高概率具有有限数量的非零项。
Compressed sensing requires that the embedding matrix (usually called the measurement matrix) have a certain “incoherent” structure such as the restricted isometry property or nullspace property . Our toy model learns the embedding matrix, and will often simply ignore many input dimensions to make others easier to recover.
压缩感知要求嵌入矩阵（通常称为测量矩阵）具备特定的“非相关”结构，如限制等距性质或零空间性质。我们的模型学习该嵌入矩阵，并常通过忽略许多输入维度来简化其他维度的恢复过程。
Features in our toy model have different "importances", which means the model will often prefer to be able to recover “important” features more accurately, at the cost of not being able to recover “less important” features at all.
我们玩具模型中的特征具有不同的“重要性”，这意味着模型通常更倾向于能够更准确地恢复“重要”特征，代价是无法完全恢复“不太重要”的特征。

In general, our toy model is solving a similar problem using less powerful methods than compressed sensing algorithms, especially because the computational model is so much more restricted (to just a single linear transformation and a non-linearity) compared to the arbitrary computation that might be used by a compressed sensing algorithm.
总体而言，我们的玩具模型采用较压缩感知算法更为简化的方法解决类似问题，主要因为其计算模型受到极大限制（仅限于单一的线性变换与非线性操作），远不及压缩感知算法可能采用的任意复杂计算。

As a result, compressed sensing lower bounds—which give lower bounds on the dimension of the embedding such that recovery is still possible—can be interpreted as giving an upper bound on the amount of superposition in our toy model. In particular, in various compressed sensing settings, one can recover an n-dimensional k-sparse vector from an m dimensional projection if and only if m = \Omega(k \log (n/k)) . While the connection is not entirely straightforward, we apply one such result to the toy model in the appendix.
因此，压缩感知下界——给出恢复仍可能的嵌入维度下界——可被解读为在我们的玩具模型中给出叠加量的上限。具体在多种压缩感知场景下，若且仅若 m = \Omega(k \log (n/k)) ，能从 m 维投影中恢复出 n 维的 k 稀疏向量。尽管这一联系并非完全直接，我们在附录中将此类结果应用于该玩具模型。

At first, this bound appears to allow a number of features that is exponential in m to be packed into the m-dimensional embedding space. However, in our setting, the integer k for which all vectors have at most k non-zero entries is determined by the fixed density parameter S as k = O((1 - S)n). As a result, our bound is actually m = \Omega(-n (1 - S) \log(1 - S)). Therefore, the number of features is linear in m but modulated by the sparsity. Note that this has a nice information-theoretic interpretation: \log(1 - S) is the surprisal of a given dimension being non-zero, and is multiplied by the expected number of non-zeros. This is good news if we are hoping to eliminate superposition as a phenomenon! However, these bounds also allow for the amount of superposition to increase dramatically with sparsity – hopefully this is an artifact of the techniques in the proofs and not an inherent barrier to reducing or eliminating superposition.
起初，这种约束似乎允许将指数级数量的特征打包进 m 维嵌入空间中。然而，在我们的设定下，所有向量至多拥有 k 个非零项的整数 k ，由固定密度参数 S 决定为 k = O((1 - S)n) 。因此，我们的约束实际上是 m = \Omega(-n (1 - S) \log(1 - S)) 。结果是，特征数量呈线性增长于 m ，但受稀疏性调节。 ²³ 值得注意的是，这具有良好的信息论解释： \log(1 - S) 表示某一维度为非零的惊奇度，并乘以预期非零数量。若我们希望消除叠加现象，这无疑是好消息！但这些界限也使得叠加程度随稀疏性剧增——希望这只是证明技巧的产物，而非减少或消除叠加的固有障碍。

A striking parallel between our toy model and compressed sensing is the existence of phase changes.Note that in the compressed sensing case, the phase transition is in the limit as the number of dimensions becomes large - for finite-dimensional spaces, the transition is fast but not discontinuous. In compressed sensing, if one considers a two-dimensional space defined by the sparsity and dimensionality of the vectors, there are sharp phase changes where the vector can almost surely be recovered in one regime and almost surely not in the other . It isn't immediately obvious how to connect these phase changes in compressed sensing – which apply to recovery of the entire vector, rather than one particular component – to the phase changes we observe in features and neurons. But the parallel is suspicious.
在我们的玩具模型与压缩感知之间，一个显著的相似之处在于相变的存在。 ²⁴ 值得注意的是，在压缩感知的情况下，相变发生在维度数量趋向无穷大的极限中——对于有限维空间，转变虽迅速但并非不连续。在压缩感知中，若考虑由向量的稀疏性和维度定义的二维空间，存在尖锐的相变点，其中向量在一个状态下几乎确定可恢复，而在另一个状态下则几乎确定无法恢复。将压缩感知中这些应用于整个向量而非特定分量恢复的相变，与我们观察到的特征和神经元中的相变相联系，并非显而易见。但这一相似性引人深思。

Another interesting line of work has tried to build useful sparse recovery algorithms using neural networks . While we find it useful for analysis purposes to view the toy model as a sparse recovery algorithm, so that we may apply sparse recovery lower bounds, we do not expect that the toy model is useful for the problem of sparse recovery. However, there may be an exciting opportunity to relate our understanding of the phenomenon of superposition to these and other techniques.
另一项有趣的工作尝试利用神经网络构建实用的稀疏恢复算法。尽管我们将玩具模型视为稀疏恢复算法以进行分析目的，从而能够应用稀疏恢复的下界，但我们并不预期该模型对稀疏恢复问题本身有多大助益。然而，将我们对叠加现象的理解与这些及其他技术相联系，或许蕴藏着令人振奋的机遇。

Sparse Coding and Dictionary Learning
稀疏编码与词典学习

Sparse Coding studies the problem of finding a sparse representation of dense data. One can think of it as being like compressed sensing, except the matrix projecting sparse vectors into the lower dimensional space is also unknown. This topic goes by many different names including sparse coding (most common in neuroscience), dictionary learning (in computer science), and sparse frame design (in mathematics). For a general introduction, we refer readers to a textbook by Michael Elad .
稀疏编码研究如何从密集数据中寻找稀疏表示的问题。可以将其类比为压缩感知，但不同的是，将稀疏向量映射到低维空间的矩阵也是未知的。这一主题有多种名称，包括在神经科学中最常见的稀疏编码、计算机科学中的字典学习，以及数学中的稀疏框架设计。对于一般性介绍，我们推荐读者参考迈克尔·埃拉德的教科书。

Classic sparse coding algorithms take an expectation-maximization approach (this includes Olshausen et al's early work , the MOD algorithm , and the k-SVD algorithm ). More recently, new methods based on gradient descent and autoencoders have begun building on these ideas .
经典稀疏编码算法采用期望最大化方法（包括 Olshausen 等人的早期工作、MOD 算法以及 k-SVD 算法）。近期，基于梯度下降和自编码器的新方法开始在这些理念基础上构建。

From our perspective, sparse coding is interesting because it's probably the most natural mathematical formulation of trying to "solve superposition" by discovering which directions correspond to features.Interestingly, this is the reverse of how sparse coding is typically thought of in neuroscience. Neuroscience often thinks of biological neurons as sparse coding their inputs, whereas we're interested in applying it the opposite direction, to find features in superposition over neurons. But can we actually use these methods to solve superposition in practice? Previous work has attempted to use sparse coding to find sparse structure . More recently, research by Sharkey et al following up on the original publication of this paper has had preliminary success in extracting features out of superposition in toy models using a sparse autoencoder. In general, we're only in the very preliminary investigations of using sparse coding and dictionary learning in this way, but the situation seems quite optimistic. See the section Approach 2: Finding an Overcomplete Basis for more discussion.
从我们的视角来看，稀疏编码之所以引人入胜，是因为它可能是最自然的数学表达方式，旨在通过探索哪些方向对应于特征来“解决叠加问题”。有趣的是，这与神经科学中通常对稀疏编码的理解相反。神经科学往往将生物神经元视为对其输入进行稀疏编码，而我们则关注于反向应用它，以在神经元上的叠加中寻找特征。但这些方法真的能在实践中解决叠加问题吗？以往的工作尝试利用稀疏编码来发现稀疏结构。最近，Sharkey 等人的研究在原始论文的基础上，通过稀疏自编码器在玩具模型中初步成功地从叠加中提取出了特征。总体而言，我们仅处于使用稀疏编码和字典学习进行此类探索的初步阶段，但前景似乎相当乐观。更多讨论请参见“方法二：寻找过完备基”部分。

Theories of Neural Coding and Representation
神经编码与表征理论

Our work explores representations in artificial “neurons”. Neuroscientists study similar questions in biological neurons. There are a variety of theories for how information could be encoded by a group of neurons. At one extreme is a local code, in which every individual stimulus is represented by a separate neuron. At the other extreme is a maximally-dense distributed code, in which the information-theoretic capacity of the population is fully utilized, and every neuron in the population plays a necessary role in representing every input.
我们的研究探索了人工“神经元”中的表征方式。神经科学家在生物神经元中研究类似问题。关于信息如何被一群神经元编码，存在多种理论。一种极端是局部编码，其中每个单独的刺激由一个独立的神经元表示。另一种极端是最大密度分布编码，其中群体的信息理论容量被充分利用，群体中的每个神经元都对表示每个输入起着不可或缺的作用。

One challenge in comparing our work with the neuroscience literature is that a “distributed representation” seems to mean different things. Consider an overly-simplified example of a population of neurons, each taking a binary value of active or inactive, and a stimulus set of sixteen items: four shapes, with four colors (example borrowed from ). A “local code” would be one with a “red triangle” neuron, a “red square” neuron, and so on. In what sense could the representation be made more “distributed”? One sense is by representing independent features separately — e.g. four “shape” neurons and four “color” neurons. A second sense is by representing more items than neurons — i.e. using a binary code over four neurons to encode 2^4 = 16 stimuli. In our framework, these senses correspond to decomposability (representing stimuli as compositions of independent features) and superposition (representing more features than neurons, at cost of interference if features co-occur).
在将我们的工作与神经科学文献进行比较时，一个挑战在于“分布式表示”似乎具有不同的含义。以一个过度简化的神经元群体为例，每个神经元取二值状态：激活或未激活，以及一组包含十六个项目的刺激：四种形状，每种形状有四种颜色（此例借鉴自）。一个“局部编码”将包含一个“红色三角形”神经元、一个“红色正方形”神经元等。那么，这种表示如何能变得更加“分布式”呢？一种方式是通过分别表示独立特征——例如，四个“形状”神经元和四个“颜色”神经元。另一种方式是表示的项目数量超过神经元数量——即使用四个神经元上的二进制编码来表示 2^4=16 个刺激。在我们的框架中，这些含义对应于可分解性（将刺激表示为独立特征的组合）和叠加性（在特征共现时可能产生干扰的情况下，用较少的神经元表示更多的特征）。

Decomposability doesn’t necessarily mean each feature gets its own neuron. Instead, it could be that each feature corresponds to a “direction in activation-space”We haven’t encountered a specific term in the distributed coding literature that corresponds to this hypothesis specifically, although the idea of a “direction in activation-space” is common in the literature, which may be due to ignorance on our part. We call this hypothesis linearity., given scalar “activations” (which in biological neurons would be firing rate). Then, only if there is a privileged basis, “feature neurons” are incentivized to develop. In biological neurons, metabolic considerations are often hypothesized to induce a privileged basis, and thus a “sparse code”. This would be expected if the nervous system’s energy expenditure increases linearly or sublinearly with firing rate.Experimental evidence seems to support this Additionally, neurons are the units by which biological neural networks can implement non-linear transformations, so if a feature needs to be non-linearly transformed, a “feature neuron” is a good way to achieve that.
可分解性并不一定意味着每个特征都有其专属的神经元。实际上，每个特征可能对应于“激活空间中的一个方向” ²⁶ 。尽管“激活空间中的方向”这一概念在分布式编码文献中颇为常见，但我们尚未遇到一个特定术语专门对应这一假设，这或许是由于我们的疏忽。我们称此假设为线性性，考虑到标量“激活”（在生物神经元中即放电率）。只有存在一个特权基时，“特征神经元”才会被激励发展。在生物神经元中，代谢因素常被假设为诱导特权基，从而形成“稀疏编码”。如果神经系统能量消耗随放电率线性或亚线性增长，这一预期便合理。实验证据似乎支持这一点。此外，神经元是生物神经网络实现非线性变换的基本单位，因此，若某一特征需进行非线性变换，采用“特征神经元”是一种有效的方法。

Any decomposable linear code that uses orthogonal feature vectors is functionally equivalent from the viewpoint of a linear readout. So, a code can both be “maximally distributed” — in the sense that every neuron participates in representing every input, making each neuron extremely polysemantic — and also have no more features than it has dimensions. In this conception, it’s clear that a code can be fully “distributed” and also have no superposition.
任何使用正交特征向量的可分解线性码，从线性读出的角度看，其功能是等价的。因此，一个码既可以是“最大分布式”的——意味着每个神经元都参与表示每个输入，使得每个神经元具有极高的多义性——同时其特征数量也不会超过维度数。在这种构想中，显然一个码可以完全“分布式”，同时又不存在叠加态。

A notable difference between our work, and the neuroscience literature we have encountered, is that we consider as a central concept the likelihood that features co-occur with some probability.A related, but different, concept in the neuroscience literature is the “binding problem” in which e.g. a red triangle is a co-occurrence of exactly one shape and exactly one color, which is not a representational challenge, but a binding problem arises if a decomposed code needs to represent simultaneously also a blue square — which shape feature goes with which color feature? Our work does not engage with the binding question, merely treating this as a co-occurrence of “blue”, “red”, “triangle”, and “square”. A “maximally-dense distributed code” makes the most sense in the case where items never co-occur; if the network only needs to represent one item at a time, it can tolerate a very extreme degree of superposition. By contrast, a network that could plausibly need to represent all the items at once can do so without interference between the items if it uses a code with no superposition. One example of high feature co-occurrence could be encoding spatial frequency in a receptive field; these visual neurons need to be able to represent white noise, which has energy at all frequencies. An example of limited co-occurrence could be a motor “reach” task to discrete targets, far enough apart that only one can be reached at a time.
我们工作与所接触的神经科学文献之间的一个显著差异在于，我们将特征以一定概率共同出现的概率性视为核心概念。 ²⁸ 神经科学文献中与之相关但有所区别的概念是“绑定问题” ，例如，一个红色的三角形是恰好一个形状与恰好一个颜色的共同出现，这并非表征挑战，但若需分解编码同时表示一个蓝色正方形时，绑定问题便浮现——哪个形状特征与哪个颜色特征相配？我们的研究并未涉及绑定问题，仅将其视为“蓝色”、“红色”、“三角形”和“正方形”的共同出现。在物品永不共现的情况下，“最大密度分布编码”最具意义；若网络仅需一次表示一个物品，它便能容忍极高程度的叠加。相比之下，一个理论上需要同时表示所有项目的网络，若采用无叠加的编码方式，便能在不相互干扰的情况下实现这一目标。高特征共现性的一个例子是在感受野中编码空间频率；这些视觉神经元需能表示白噪声，其能量分布于所有频率。而有限共现性的例子则可能是针对离散目标的“伸手”任务，目标间距足够远，一次只能触及一个。

One hypothesis in neuroscience is that highly compressed representations might have an important use in long-range communication between brain areas. Under this theory, sparse representations are used within a brain area to do computation, and then are compressed for transmission across a small number of axons. Our experiments with the absolute value toy model shows that networks can do useful computation even under a code with a moderate degree of superposition. This suggests that all neural codes, not just those used for efficient communication, could plausibly be “compressed” to some degree; the regional code might not necessarily need to be decompressed to a fully sparse one.
神经科学中的一种假设认为，高度压缩的表征可能在脑区间的长距离通信中扮演重要角色。根据这一理论，稀疏表征在脑区内用于计算，随后为通过少数轴突传输而进行压缩。我们对绝对值玩具模型的实验表明，网络即使在适度叠加的编码下也能进行有效计算。这暗示所有神经编码，不仅仅是那些用于高效通信的编码，都可能在某种程度上“压缩”；区域编码未必需要完全解压缩至完全稀疏状态。

It's worth noting that the term "distributed representation" is also used in deep learning, and has the same ambiguities of meaning there. Our sense is that some influential early works (e.g. ) may have primarily meant the "independent features are represented independently" decomposability sense, but we believe that other work intends to suggest something similar to what we call superposition.
值得注意的是，“分布式表示”这一术语在深度学习中也常被使用，并同样存在意义模糊的情况。我们的理解是，一些有影响力的早期作品（例如）可能主要指的是“独立特征独立表示”的可分解性含义，但我们认为其他作品意在暗示与我们所说的叠加相似的概念。

Additional Connections 附加连接

After publishing the original version of this paper, a number of readers generously brought to our attention additional connections to prior work. We don't have a sufficiently deep understanding of this work to offer a detailed review, but we offer a brief overview below:
在发布本文初版后，许多读者慷慨指出其与先前工作的更多联系。我们对此工作的理解尚不够深入，无法提供详尽评述，但以下提供一个简要概览：

Vector Symbolic Architectures and Hyperdimensional Computing (see reviews ) are models from theoretical neuroscience of how neural systems can manipulate symbols. Many of the core ideas of how quasi-orthogonal vectors and the "blessings of dimensionality" enable computation are closely related to our notions of superposition.
向量符号架构与超维计算（参见评论）是理论神经科学中关于神经系统如何操纵符号的模型。其中许多核心思想，如准正交向量及“维度之赐”如何促进计算，与我们关于叠加的概念紧密相关。
Frames (see review ) are a generalization of the idea of a mathematical basis. The way superposition encodes features in lower dimensional spaces might be seen as frames, at least in some cases. In particular, the "Mercedes-Benz Frame" is equivalent to the triangular geometry superposition we sometimes observe.
框架（参见评论）是数学基底概念的推广。叠加方式在低维空间中编码特征的方法，在某些情况下可以视为框架。特别是，“梅赛德斯-奔驰框架”与我们有时观察到的三角几何叠加是等价的。
Although we discuss compressed sensing and sparse coding above, it's worth noting that this only scratches the surface of research on how sparse vectors can be encoded in lower dimensional dense vectors, and there's a large body of additional work not captured by these topics.
尽管我们在上面讨论了压缩感知和稀疏编码，但值得注意的是，这仅触及了稀疏向量如何在低维密集向量中编码的研究表面，还有大量未被这些主题涵盖的额外工作。

Comments & Replications 评论与回复

Inspired by the original Circuits Thread and Distill's Discussion Article experiment, the authors invited several external researchers who we had previously discussed our preliminary results with to comment on this work. Their comments are included below.
受最初的 Circuits Thread 和 Distill 的讨论文章实验启发，作者邀请了几位曾与我们讨论过初步成果的外部研究人员对此工作进行评论。他们的评论如下所示。

Replication & Forthcoming Paper
复制与即将发表的论文

Kshitij Sachan is a research intern at Redwood Research.
克希蒂杰·萨克森是红木研究公司的研究实习生。

Redwood Research has been working on toy models of polysemanticity, inspired by Anthropic's work. We plan to separately publish our results, and during our research we replicated many of the experiments in this paper. Specifically, we replicated all plots in the Demonstrating Superposition and Superposition as a Phase Change sections (visualizations of the relu models with different sparsities and the phase diagrams) as well as the plot in The Geometry of Superposition – Uniform Superposition. We found the phase diagrams look quite different depending on the activation function, suggesting that in this toy model some activation functions induce more polysemanticity than others.
红木研究一直在探索多义性的玩具模型，灵感源自 Anthropic 的工作。我们计划分别发布研究成果，并在研究过程中复现了该论文中的多项实验。具体而言，我们复现了“展示叠加”与“叠加作为相变”章节中的所有图表（不同稀疏度的 relu 模型可视化及相位图），以及“叠加的几何形态——均匀叠加”中的图表。我们发现，相位图的形态因激活函数而异，表明在此玩具模型中，某些激活函数能诱发更丰富的多义性。

Original Authors' Response: Redwood's further analysis of the superposition phase change significantly advanced our own understanding of the issue – we're very excited for their analysis to be shared with the world. We also appreciate the independent replication of our basic results.
红木团队对叠加相变问题的深入分析极大地推进了我们对这一问题的理解——我们非常期待他们的分析能与世界共享。同时，我们也对基本结果的独立复现表示感谢。

Update: The research by Redwood mentioned in the previous comment, Polysemanticity and Capacity in Neural Networks (Alignment Forum, Arxiv) is out! They study a slightly different toy model, and get some really interesting results. Highlights include analytical traction on understanding a variant of the toy model, understanding superposition in terms of constrained optimization, and analysis of the role different activation functions play.
更新：前述评论中提到的红木研究，《神经网络中的多义性与容量》（对齐论坛，arXiv）已发布！他们研究了一个略有不同的玩具模型，并获得了一些非常有趣的结果。亮点包括对玩具模型变体的分析理解，从约束优化的角度理解叠加态，以及对不同激活函数作用的分析。

Replication & Further Results
复制与进一步结果

Tom McGrath is a research scientist at DeepMind.
汤姆·麦格拉斯是 DeepMind 的一名研究科学家。

The results in this paper are an important contribution - they really further our theoretical understanding of a phenomenon that may be central to interpretability research and understanding network representations more generally. It’s surprising that such simple settings can produce these rich phenomena. We’ve reproduced the experiments in the Demonstrating Superposition and Superposition as a Phase Change sections and have a minor additional result to contribute.
本文的研究成果是一项重要贡献——它们确实深化了我们对可能成为可解释性研究核心及更广泛理解网络表征现象的理论认识。令人惊讶的是，如此简单的设定竟能产生这些丰富现象。我们已重现了“展示叠加”与“叠加作为相变”章节中的实验，并有一项微小的额外成果贡献。

It is possible to exactly solve the expected loss for the n=2, m=1 case of the basic ReLU output toy model (ignoring bias terms). The derivation is mathematically simple but somewhat long-winded: the ‘tricks’ are to (1) represent the sparse portion of the input distribution with delta functions, and (2) replace the ReLu with a restriction of the domain of integration:
对于基本 ReLU 输出玩具模型中的 n=2 、 m=1 情况，可以精确求解预期损失（忽略偏置项）。推导过程数学上简单，但略显冗长：关键技巧在于（1）用狄拉克δ函数表示输入分布的稀疏部分，以及（2）将 ReLU 替换为积分域的限制。

\int_D \text{ReLU}(f(x)) dx = \int_{D \cap f(x)>0} f(x) dx

Making this substitution renders the integral analytically tractable, which allows us to plot the full loss surface and solve for the loss minima directly. We show some example loss surfaces below:
通过这一替换，积分在分析上变得易于处理，这使我们能够绘制完整的损失曲面并直接求解损失最小值。以下展示了一些损失曲面的示例：

Although many of these loss surfaces (Figure 1a, 1b) have minima qualitatively similar to one of the network weights used in the section Superposition as a Phase Change, we also find a new phase where W_1\simeq W_2 \simeq \frac{1}{\sqrt{2}}: weights are similar rather than antipodal. This ‘confused feature’ regime occurs when sparsity is low and both features are important (Figure 1c). (This is slightly similar to the behavior described in The Geometry of Superposition – Collapsing of Correlated Features, but occurs without the features being correlated!) Further, although the solutions we find are often qualitatively similar to the weights used in Superposition as a Phase Change, they can be quantitatively different, as Figure 1a shows. The transition from Figure 1a to Figure 1b is continuous: the minima moves smoothly in weight space as the degree of sparsity alters. This explains the ‘blurry’ region around the triple point in the phase diagram.
尽管这些损失曲面（图 1a、1b）中的许多最小值在定性上与“叠加作为相变”一节中使用的网络权重相似，但我们还发现了一个新相态，其中 W_1\simeq W_2 \simeq \frac{1}{\sqrt{2}} ：权重相似而非对立。这种“混淆特征”状态出现在稀疏度较低且两个特征都重要时（图 1c）。（这与《叠加的几何学——相关特征的坍缩》中描述的行为略有相似，但发生时特征并不相关！）此外，尽管我们找到的解决方案在定性上常与“叠加作为相变”中使用的权重相似，它们在数量上可能有所不同，如图 1a 所示。从图 1a 到图 1b 的过渡是连续的：随着稀疏度的变化，最小值在权重空间中平滑移动。这解释了相图中三相点周围的“模糊”区域。

As Figure 1c shows, some combinations of sparsity and relative feature importance lead to loss surfaces with two minima (once the symmetry (W_1, W_2) \to (-W_1, -W_2) has been accounted for). If this pattern holds for larger values of n and m (and we see no reason why it would not) this could account for the Discrete "Energy Level" Jumps phenomenon as solutions hop between minima. In some cases (e.g. when parameters approach those needed for a phase transition) the global minimum can have a considerably smaller basin of attraction than local minima. The transition between the antipodal and confused-feature solutions appears to be discontinuous.
如图 1c 所示，某些稀疏度与相对特征重要性的组合会导致损失曲面出现两个极小值（在考虑了对称性 (W_1, W_2) \to (-W_1, -W_2) 之后）。若此模式对于更大的 n 和 m 值依然成立（我们没有理由认为它不会），这或许能解释离散“能量级”跳跃现象，即解决方案在不同极小值之间跃迁。在某些情况下（例如参数接近相变所需值时），全局极小值的吸引域可能远小于局部极小值。从对立特征解到混淆特征解的转变似乎是不连续的。

Original Authors' Response: This closed form analysis of the n=2, m=1 case is fascinating. We hadn't realized that W_1\simeq W_2 \simeq \frac{1}{\sqrt{2}} could be a solution without correlated features! The clarification of the "blurry behavior" and the observation about local minima are also very interesting. More generally, we're very grateful for the independent replication of our core results.
原作者回应：对 n=2, m=1 情况的封闭形式分析令人着迷。我们未曾意识到 W_1\simeq W_2 \simeq \frac{1}{\sqrt{2}} 能在不依赖相关特征的情况下成为解决方案！对“模糊行为”的澄清以及关于局部极小值的观察同样非常有趣。更广泛地说，我们非常感激核心结果的独立复现。

Replication 复制

Jeffrey Wu and Dan Mossing are members of the Alignment team at OpenAI.
杰弗里·吴和丹·莫辛是 OpenAI 对齐团队的成员。

We are very excited about these toy models of polysemanticity. This work sits at a rare intersection of being plausibly very important for training more interpretable models and being very simple and elegant. The results have been surprisingly easy to replicate -- we have reproduced (with very little fuss) plots similar to those in the Demonstrating Superposition – Basic Results, Geometry – Feature Dimensionality, and Learning Dynamics – Discrete "Energy Level" Jumps sections.
我们对这些多义性玩具模型感到非常兴奋。这项工作处于一个罕见的交叉点，既对训练更具解释性的模型具有潜在的重要意义，又极其简单优雅。结果出乎意料地容易复制——我们已轻松重现了《展示叠加——基本结果》、《几何——特征维度》以及《学习动力学——离散“能级”跃迁》章节中类似的图表。

Original Authors' Response: We really appreciate this replication of our basic results. Some of our findings were quite surprising to us, and this gives us more confidence that they aren't the result of an idiosyncratic quirk or bug in our implementations.
我们对这一基本结果的复现深表感激。我们的一些发现颇出乎意料，这让我们更加确信，这些结果并非源于我们实现中的特异瑕疵或错误。

Replication 复制

Spencer Becker-Kahn is a senior research scholar at the Future of Humanity Institute and a SERI Machine Learning Alignment Theory Scholar.
斯宾塞·贝克尔-卡恩是未来人类研究所的高级研究员，也是 SERI 机器学习对齐理论学者。

After seeing preliminary results, I independently replicated some of the key graphs from Demonstrating Superposition – Basic Results and, using very small toy models, produced a series of plots consistent with the conceptual picture emerging in Geometry – Feature Dimensionality and Superposition and Learning Dynamics.
在观察到初步结果后，我独立重现了《展示叠加态——基础结果》中的一些关键图表，并利用极小的玩具模型，制作了一系列与《几何学——特征维度与叠加态及学习动力学》中呈现的概念图景相一致的图示。

See also Twitter thread. 另请参阅推特话题。

Engineering Monosemanticity in Toy Models
玩具模型中的工程单义性

Adam Jermyn is an independent researcher focused on AI alignment and interpretability. He was previously a Research Fellow at the Flatiron Institute’s Center for Computational Astrophysics. Evan Hubinger is a Research Fellow at MIRI. Nicholas Schiefer is a member of the technical staff at Anthropic and an author of the original paper.
亚当·杰明是专注于人工智能对齐与可解释性的独立研究者，此前曾任弗莱登研究院计算天体物理中心的博士后研究员。埃文·哈宾格是机器智能研究所的研究员。尼古拉斯·希弗是 Anthropic 的技术团队成员，也是原论文的作者之一。

Inspired by the results in this paper and the previous paper introducing the SoLU activation, we have been investigating whether changes to the model architecture or training process can reduce superposition in toy models. After replicating several of these results independently, we made the following extensions in that direction:
受本论文及先前介绍 SoLU 激活函数的论文成果启发，我们一直在探究模型架构或训练过程的调整能否减少玩具模型中的叠加现象。在独立复现了这些成果中的若干项后，我们在此方向上做出了以下拓展：

A modified toy model that tries to more accurately represent the non-toy case where sparse features are projected down into a non-sparse input to the model with fewer dimensions.
一个经过改良的玩具模型，旨在更准确地模拟非玩具情况，其中稀疏特征被映射到模型中维度更少的非稀疏输入。
A “many neuron” architecture that gives the model enough capacity to avoid polysemanticity entirely.
一种“多神经元”架构，赋予模型足够的能力以完全避免多义性。
A training and initialization method that actually makes these toy models monosemantic.
一种实际使这些玩具模型单义化的训练与初始化方法。
A systematic exploration of the polysemantic neurons that tend to appear in these models, which guided our explorations of the training methods.
对这些模型中出现的多义性神经元进行系统性探索，指导了我们训练方法的探索。

At least in some limits, this suggests there may not be a price to be paid for monosemanticity. Detailed results can be found in our paper, Engineering Monosemanticity in Toy Models (Alignment Forum, ArXiV).
至少在某些限度内，这表明单一语义性可能无需付出代价。详细结果可参阅我们的论文《玩具模型中的单一语义性工程》（Alignment Forum, ArXiv）。

Fractional Dimensionality and "Pressure"
分数维度和“压力”

Tom Henighan and Chris Olah are authors of the original paper.
汤姆·赫尼汉和克里斯·奥拉是原论文的作者。

In the "Feature Dimensionality" section, we found that features organized into clean polytopes when there are more features than can be easily represented in the embedding dimensions.
在“特征维度”部分，我们发现当特征数量超过嵌入维度所能轻松表示的范围时，这些特征会组织成清晰的凸多面体。

We briefly investigated this further and found that the number of features competing to be represented significantly influences this phenomenon. Cleaner structure often seems to emerge when there is more "pressure" – more features competing to be represented. This is especially true at high sparsity levels. Additionally, training longer seems to also produce cleaner structure.
我们进一步简要探究了这一现象，发现竞争被表征的特征数量对此有显著影响。当存在更多“竞争压力”——即更多特征争夺表征时，往往会出现更清晰的结构。这在高稀疏度水平下尤为明显。此外，延长训练时间似乎也能产生更清晰的结构。

More investigation would be needed to really understand this phenomenon.
要真正理解这一现象，还需进一步调查研究。

Replication 复制

Marius Hobbhahn is a PhD student at the University of Tuebingen.
马吕斯·霍布恩是图宾根大学的一名博士生。

I replicated the “Basic Results” in Section 2 (“Demonstrating Superposition”) and all of Section 7 (“Superposition in a Privileged Basis”) of the “Toy Models of Superposition” paper. All of my findings are identical to the ones described in the paper. I replicated most findings in the follow-up “Superposition, Memorization, and Double Descent” paper.
我复现了《叠加态模型玩具》中的第二节“展示叠加态”的基本结果以及第七节“特权基中的叠加态”的全部内容。我的所有发现均与论文所述一致。在后续的《叠加态、记忆化与双重下降》论文中，我复现了其中的大部分发现。

The details of my replication can be found in my write-up “More Findings on Memorization and Double Descent”.
我的复制细节可在我的撰写报告《记忆化与双重下降的更多发现》中找到。

Extracting Features with Sparse Autoencoders
使用稀疏自编码器提取特征

Lee Sharkey, Dan Braun, and Beren Millidge are researchers at Conjecture.
李·沙基、丹·布劳恩和贝伦·米利奇是猜想公司的研究人员。

The results from this paper, and the strategic picture it paints, inspired our preliminary follow-up work that aimed to address some of the challenges described in the section titled 'Approach 2: Finding an overcomplete basis'.
本文的研究成果及其描绘的战略图景，激发了我们初步跟进工作的灵感，旨在应对标题为“方法二：寻找超完备基”一节中所述的若干挑战。

Before studying the activations of real neural networks, where we're not sure what the 'ground truth' features are, we studied a toy example. We generated a set of toy ground truth features and created a dataset using sparse combinations of them. We found that a one-layer sparse autoencoder with an L_1 penalty on its hidden activations learned features that were almost identical to the ground truth features. This is encouraging because it indicates that a relatively simple method should be able to recover the features used by neural networks, inasmuch as they have them.
在探究真实神经网络的激活模式之前，我们尚不确定何为“真实特征”，因此我们先研究了一个简单的示例。我们生成了一个玩具级别的真实特征集，并利用这些特征的稀疏组合构建了数据集。结果发现，一个单层稀疏自编码器，其隐藏层激活受到 L_1 惩罚，所学得的特征与真实特征几乎一致。这一发现令人鼓舞，因为它表明，一个相对简单的方法理应能够还原神经网络所使用的特征，只要这些特征确实存在。

For the toy dataset, we knew how many ground truth features there were. But we ultimately want to count the features used by real neural networks, where the number of features is unknown. We explored three ways to count the features in the toy dataset: a) Counting dead neurons in the autoencoders; b) looking at autoencoder losses; and c) comparing the features learned by autoencoders of different sizes. We found indications that these methods might be suitable to count the number of features in superposition in real neural data.
对于玩具数据集，我们已知其真实特征数量。然而，我们最终目的是统计真实神经网络所使用的特征数，而该数目是未知的。我们探索了三种统计玩具数据集特征的方法：a) 计算自编码器中的死亡神经元数量；b) 观察自编码器的损失情况；c) 比较不同规模自编码器所学得的特征。我们发现这些方法可能适用于估算真实神经数据中叠加特征的数量。

We also applied our method to real activations from a small language model. Our initial, preliminary investigations led to inconclusive results, possibly resulting from having used autoencoders that were either too small or undertrained. At the time of writing, investigations are ongoing.
我们还应用了该方法于一个小型语言模型的真实激活数据。初步调查结果尚无定论，这可能是由于所使用的自编码器过小或训练不足所致。截至撰写时，相关研究仍在进行中。

Linear Representation in Othello
《奥赛罗》中的线性表达

Neel Nanda is an external researcher in mechanistic interpretability. This is a description of his blog post, Actually, Othello-GPT Has A Linear Emergent World Representation.
Neel Nanda 作为机制可解释性的外部研究员，其博客文章《实际上，Othello-GPT 拥有线性涌现的世界表示》对此进行了详细阐述。

I describe a natural experiment of the linear representation hypothesis described in this paper -- the idea that features correspond to directions in neural networks.
本文所述线性表示假设提出了一种自然实验——即特征对应于神经网络中的方向这一理念。

Background: Martin Wattenberg (an author on this paper) and colleagues recently found that, if you train transformer language model trained to predict the next token in synthetic Othello games (where each move is a randomly chosen legal move), it forms an emergent model of the board state (despite only being trained to predict the next move!). They showed that the state of the board (whether each cell was empty, black or white) could be recovered with high accuracy by a one hidden layer MLP probe. They further showed that you could use the world model to causally intervene on the model’s residual stream. By choosing another board state, and changing the residual stream (with gradient descent) such that the probe indicates that new board state, they caused the model to output legal moves in the new board state, even if the edited board state was impossible to reach via legal Othello play!
背景：Martin Wattenberg（本文作者之一）及其同事最近发现，若训练一个变压器语言模型来预测合成奥赛罗游戏中的下一个令牌（其中每一步都是随机选择的合法移动），该模型会形成一种对棋盘状态的涌现模型（尽管它仅被训练来预测下一步！）。他们证明，通过一个单隐藏层的多层感知机（MLP）探针，可以高精度地恢复棋盘状态（每个格子是空、黑或白）。进一步地，他们展示了可以利用这个世界模型对模型的残差流进行因果干预。通过选择另一个棋盘状态，并利用梯度下降改变残差流，使得探针指示出新的棋盘状态，他们促使模型在新棋盘状态下输出合法移动，即便编辑后的棋盘状态通过合法的奥赛罗玩法无法达到！

Pre-Registered Hypothesis: The probing and causal intervention together provided strong evidence that the model had learned to represent features corresponding to the state of each square on the board. Yet, notably, linear probes were not able to recover the board state. Since linear features should be recoverable with a one-layer probe, and the causal intervention suggests the model both computes and uses the board state, this seemed like significant evidence against the linear representation hypothesis.
预注册假设：探查与因果干预共同提供了强有力证据，表明模型已学会表示与棋盘上每个方格状态相对应的特征。然而值得注意的是，线性探针未能恢复棋盘状态。由于线性特征应能通过单层探针恢复，且因果干预表明模型既计算又利用了棋盘状态，这似乎成为反对线性表示假设的重要证据。

However, Chris Olah (an author on this paper) argued the model might still be representing features linearly if it used a different set of features, and that the probe and causal intervention may be picking up on this different set of features. This created an informal pre-registered prediction of the hypothesis which was contrary to the evidence at the time.
然而，Chris Olah（本文作者之一）认为，如果模型使用的是另一套特征，它可能仍然以线性方式表示特征，而探针和因果干预可能正是捕捉到了这套不同的特征。这形成了一个与当时证据相悖的非正式预注册预测假设。

Results: I independently came to the same conclusion as Chris and investigated the Othello playing model. I found that the model does form an emergent model of the board state that is linearly represented and can be extracted with a linear probe. But that as the model plays both black and white moves, the model represents the state of a cell as whether it has “my colour” vs “opponent’s colour”. Further, I found circumstantial evidence that these features are used by the model, as we can linearly intervene on the residual stream using the directions given by the probe to edit the represented board state, and the model plays legal moves in the new board state.
结果表明：我独立得出了与 Chris 相同的结论，并研究了 Othello 对弈模型。我发现该模型确实形成了对棋盘状态的涌现模型，该模型以线性方式表示，并可通过线性探针提取。由于模型同时执行黑棋和白棋的走法，它将棋盘上每个格子的状态表示为“我的颜色”与“对手的颜色”。此外，有间接证据表明模型确实利用了这些特征，因为我们能根据探针提供的方向，在线性干预残差流以编辑所表示的棋盘状态，而模型在新棋盘状态下仍能走出合法的棋步。

I consider these results to be notable, as the paper’s results provided evidence against the linear representation hypothesis and the hypothesis faced genuine risk of falsification. And the hypothesis made non-trivial predictions that were contrary to where the evidence pointed, but these turned out to be true. This is both a proof of concept that there are underlying principles of neural networks which have predictive power about models, and a natural experiment supporting the linear representation hypothesis.
我认为这些结果值得关注，因为论文的结果提供了反对线性表示假设的证据，且该假设面临着实证否定的真正风险。该假设做出了与证据指向相反的非平凡预测，但事实证明这些预测是正确的。这既证明了神经网络存在具有预测能力的潜在原理，也是对线性表示假设的自然实验支持。

I think that there is further work interpreting the Othello playing model that could test other hypotheses in this paper and our broader conceptual frameworks about neural networks and transformers, such as by looking for monosemantic vs superposed neurons in its MLP layers. The model is both complex enough to be interesting and expose principles of how transformers learn algorithms, yet the algorithmic nature of the task and existence of the probe suggests that finding circuits will be tractable. I elaborate on what I consider promising directions of future work in a follow-up post.
我认为，对奥赛罗游戏模型进行进一步解读，可以测试本文中的其他假设以及我们对神经网络和变压器更广泛概念框架的理解，例如通过寻找其 MLP 层中的单义与叠加神经元。该模型既复杂到足以引人入胜，又揭示了变压器学习算法的原理，同时任务的算法性质及探针的存在表明，寻找电路将是可行的。我将在后续文章中详细阐述我认为有前景的未来研究方向。

Leverage Score and Feature Dimensionality
杠杆分数与特征维度

Fred Zhang is a PhD student in the Theory Group of the EECS Department at UC Berkeley.
张弗雷是加州大学伯克利分校 EECS 系理论组的一名博士生。

In the Geometry of Superposition section, the paper defines a notion of feature dimensionality,
在“叠加几何”部分，论文定义了特征维度的概念，

D_i ~=~ \frac{||W_i||^2}{\sum_j (\hat{W_i} \cdot W_j)^2},

where W_i is the ith feature and \hat{W_i}=W_i / \| W_i \|_2 is its normalized version. For simplicity of notation, we assume that there are n features, each being a d-dimensional real vector.
其中 W_i 表示第 i 个特征，而 \hat{W_i}=W_i / \| W_i \|_2 是其归一化版本。为简化符号表示，我们假设共有 n 个特征，每个特征均为一个 d 维实数向量。

Following this definition, the paper makes the remark that "Empirically, it seems that the dimensionality of all features add up to the number of embedding dimensions when the features are 'packed efficiently' in some sense." In this comment, I point out a natural, theoretical explanation of this observation. The argument is via the notion of leverage score in matrix approximation. I’ll define it first, then explain how it connects to feature dimensionality.
根据这一界定，论文中指出：“从经验上看，似乎所有特征的维度在某种意义上‘高效打包’时，会累加到嵌入维度的总数。”在此评论中，我提出了一种自然的理论解释来阐明这一观察。论证通过矩阵近似中的杠杆分数概念展开。首先我将定义它，然后解释其如何与特征维度相联系。

At a conceptual level, leverage score is a measure of the importance of a row in composing the row space of a matrix. For instance, if a row is orthogonal to all other rows, its leverage score is 1, meaning that it’s maximally important. This is natural, since removing it would decrease the rank of the matrix and completely change the row space. Formally, if W be a n by d matrices (think of it as tall and thin, so n > d), then the leverage score of ith row W_i is
在概念层面，杠杆分数衡量的是矩阵行空间中某一行的重要性。例如，如果某一行与其他所有行正交，其杠杆分数为 1，意味着它极为重要。这是自然的，因为移除它将降低矩阵的秩并彻底改变行空间。形式上，若设 W 为 n 行 d 列的矩阵（想象它又高又瘦，即 n > d ），则第 i 行 W_i 的杠杆分数为

\tau_i = \max_{x : \|x\|_2 = 1} \frac{\langle W_i , x \rangle^2}{\sum_{j=1}^n \langle W_j , x \rangle^2}.

Notice that the denominator term equals \| Wx \|_2^2 . Hence, the leverage score of W_i is the maximum contribution \langle W_i, x\rangle^2 can make towards || Wx ||^2 (over all possible directions x). This is a fractional number between 0 and 1. Also observe that this indeed has the nice property that if W_i is orthogonal to all other rows, \tau_i equals 1, because all except the ith term within the summation of the denominator is 0.
注意分母项等于 \| Wx \|_2^2 。因此， W_i 的杠杆分数是 \langle W_i, x\rangle^2 对 || Wx ||^2 的最大贡献（在所有可能的方向 x 上）。这是一个介于 0 和 1 之间的分数。同时观察到，如果 W_i 与其他所有行正交， \tau_i 确实等于 1，因为分母求和中的除第 i 项外的所有项均为 0。

Three quick remarks on this definition:
关于此定义的三点简要说明：

It coincides with the notion of maximal data dimensionality in a comment by Chris Olah on Data Dimensionality of MNIST in the follow-up work of Superposition, Memorization, and Double Descent.
这与 Chris Olah 在后续工作《叠加、记忆与双重下降》中关于 MNIST 数据维度的评论中提到的最大数据维度概念相吻合。
There are other equivalent definitions of leverage score; see section 3.3 of here.
关于杠杆分数还有其他等价定义，详见此处第 3.3 节。
Leverage score arises in numerical linear algebra. It is well-known that sampling rows proportional to their leverage score yields an almost square matrix that approximates the original, potentially very tall matrix. For a formal statement, see Theorem 17 of https://arxiv.org/abs/1411.4357.
杠杆分数在数值线性代数中出现。众所周知，按其杠杆分数比例抽样行，可以得到一个近似正方形的矩阵，该矩阵能够逼近原始的、可能非常高大的矩阵。对于正式表述，请参见 https://arxiv.org/abs/1411.4357 中的定理 17。

Returning to my main point, another nice fact about leverage score is they sum up to the rank of the matrix. In the tall and thin case above, they sum up to d (if the matrix is full-rank). Given that, it is natural this paper makes the empirical observation that the sum of D_i roughly equals the embedding dimension d, if the vectors are "efficiently packed". Specifically, one formal way of stating that they are packed efficiently is that they are roughly in isotropic position, that is, the variance \langle W_j, x\rangle^2 along any direction x is close to 1. In other words, the covariance structure of the embedding vectors is close to identity. (Geometrically, it means they are well spread out. This would be the case, for example, if the vectors form regular simplex, just as it happens in the experiments of this paper.) Then in the above definition of leverage score, each summand within the denominator is 1 for any x. So to maximize the numerator, we just take x to be W_i itself---and this precisely recovers the definition of feature dimensionality D_i. Thus, in this well-spread-out case, these two notions are roughly the same, and they both sum up to d.
回到主题，关于杠杆分数的另一个美好事实是它们总和等于矩阵的秩。在上文所述的高瘦矩阵情形中，若矩阵满秩，这些分数之和即为 d。鉴于此，本论文观察到，若向量“高效压缩”， D_i 的总和近似等于嵌入维度 d ，这一现象颇为自然。具体而言，向量高效压缩的一种正式表述是它们大致处于各向同性位置，即任意方向 x 上的方差 \langle W_j, x\rangle^2 接近于 1 。换言之，嵌入向量的协方差结构接近于单位矩阵。（从几何角度看，这意味着向量分布均匀。例如，若向量构成正则单纯形，便会出现这种情况，正如本论文实验所示。）那么在上述杠杆分数的定义中，对于任意 x ，分母中的每一项加数均为 1。因此，为了最大化分子，我们只需将 x 设为 W_i 本身——这恰好重现了特征维度的定义 D_i 。因此，在这种分布广泛的情况下，这两个概念大致相同，它们总和均为 d 。

Code 代码

We provide a notebook to reproduce some of the core diagrams in this article here. (It isn't comprehensive, since we needed to rewrite code for our experiments to run outside our codebase.) We provide a separate notebook for the theoretical phase change diagrams.
我们在此提供一个笔记本，以重现本文中的一些核心图表。（这并非全面，因为我们为了使实验代码能在我们的代码库外运行，需要进行重写。）我们还为理论相变图提供了单独的笔记本。

Note that the reproductions by other researchers mentioned in comments above were not based on this code, but are instead fully independent replications with clean code from the description in an early draft of this article.
请注意，上述评论中提到的其他研究人员的复现并非基于此代码，而是完全独立的复制，其代码清晰，源自本文早期草稿的描述。

Acknowledgments 致谢

We're extremely grateful to a number of colleagues across several organizations for their invaluable support in our writing of this paper.
我们深切感谢来自多个组织的众多同事，他们在我们撰写此论文过程中提供了宝贵的支持。

Jeff Wu, Daniel Mossing, Tom McGrath, and Kshitij Sachan did independent replications of many of our experiments, greatly increasing our confidence in our results. Kshitij Sachan's and Tom McGrath's additional investigations and insightful questions both pushed us to clarify our understanding of the superposition phase change (both as reflected in this paper, and in further understanding which we learned from them not captured here). Buck Shlegeris, Adam Scherlis, and Adam Jermyn shared valuable insights into the mathematical nature of the toy problem and related work. Adam Jermyn also coined the term "virtual neurons."
吴杰夫、丹尼尔·莫辛、汤姆·麦克格拉斯与克希蒂夫·萨昌对我们的多项实验进行了独立复现，极大增强了我们对结果的信心。克希蒂夫·萨昌和汤姆·麦克格拉斯的深入探究及富有洞察力的问题促使我们更清晰地理解叠加态相变（不仅体现在本文中，也包括我们从他们那里学到的、此处未详述的进一步认识）。巴克·施莱格里斯、亚当·谢尔利斯和亚当·杰明对玩具问题的数学本质及相关研究提供了宝贵见解。亚当·杰明还创造了“虚拟神经元”这一术语。

Gabriel Goh, Neel Nanda, Vladimir Mikulik, and Nick Cammarata gave detailed feedback which improved the paper, in addition to being motivating. Alex Dimakis, Piotr Indyk, Dan Yamins generously took time to discuss these results with us and give advice on how they might connect to their area of expertise. Finally, we benefited from the feedback and comments of James Bradbury, Sebastian Farquhar, Shan Carter, Patrick Mineault, Alex Tamkin, Paul Christiano, Evan Hubinger, Ian McKenzie, and Sid Black. We're additionally grateful to Trenton Bricken and Manjari Narayan for referring us to valuable related work we originally missed. Thanks to Ken Kahn for typo corrections.
加布里埃尔·高、尼尔·南塔、弗拉基米尔·米库利克和尼克·卡马拉塔提供了详尽的反馈，不仅提升了论文质量，也激励了我们。亚历克斯·迪马基斯、皮奥特·因迪克、丹·亚明斯慷慨地抽出时间与我们讨论这些成果，并就如何与他们的专业领域相联系给予建议。最后，我们得益于詹姆斯·布拉德伯里、塞巴斯蒂安·法夸尔、单·卡特、帕特里克·米尼奥尔特、亚历克斯·坦金、保罗·克里斯蒂亚诺、埃文·哈宾格、伊恩·麦肯齐和西德·布莱克的反馈与评论。我们还特别感谢特伦顿·布里肯和曼贾里·纳拉扬向我们推荐了最初遗漏的相关重要工作。感谢肯·卡恩对拼写错误的修正。

Finally, we're very grateful to all our colleagues at Anthropic for their advice and support: Daniela Amodei, Jack Clark, Tom Brown, Ben Mann, Nick Joseph, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones,, Timothy Telleen-Lawton, Anna Chen, Yuntao Bai, Jeffrey Ladish, Deep Ganguli, Liane Lovitt, Nova DasSarma, Jia Yuan Loke, Jackson Kernion, Tom Conerly, Scott Johnston, Jamie Kerr, Sheer El Showk, Stanislav Fort, Rebecca Raible, Saurav Kadavath, Rune Kvist, Jarrah Bloomfield, Eli Tran-Johnson, Rob Gilson, Guro Khundadze, Filipe Dobreira, Ethan Perez, Sam Bowman, Sam Ringer, Sebastian Conybeare, Jeeyoon Hyun, Michael Sellitto, Jared Mueller, Joshua Landau, Cameron McKinnon, Sandipan Kundu, Jasmine Brazilek, Da Yan, Robin Larson, Noemí Mercado, Anna Goldie, Azalia Mirhoseini, Jennifer Zhou, Erick Galankin, James Sully, Dustin Li, James Landis.
最后，我们衷心感谢 Anthropic 所有同事的宝贵建议与支持：Daniela Amodei、Jack Clark、Tom Brown、Ben Mann、Nick Joseph、Danny Hernandez、Amanda Askell、Kamal Ndousse、Andy Jones、Timothy Telleen-Lawton、Anna Chen、Yuntao Bai、Jeffrey Ladish、Deep Ganguli、Liane Lovitt、Nova DasSarma、Jia Yuan Loke、Jackson Kernion、Tom Conerly、Scott Johnston、Jamie Kerr、Sheer El Showk、Stanislav Fort、Rebecca Raible、Saurav Kadavath、Rune Kvist、Jarrah Bloomfield、Eli Tran-Johnson、Rob Gilson、Guro Khundadze、Filipe Dobreira、Ethan Perez、Sam Bowman、Sam Ringer、Sebastian Conybeare、Jeeyoon Hyun、Michael Sellitto、Jared Mueller、Joshua Landau、Cameron McKinnon、Sandipan Kundu、Jasmine Brazilek、Da Yan、Robin Larson、Noemí Mercado、Anna Goldie、Azalia Mirhoseini、Jennifer Zhou、Erick Galankin、James Sully、Dustin Li、James Landis。

Author Contributions 作者贡献

Basic Results - The basic toy model results demonstrating the existence of superposition were done by Nelson Elhage and Chris Olah. Chris suggested the toy model and Nelson ran the experiments.
基本结果 - 展示叠加态存在的基本玩具模型结果由 Nelson Elhage 和 Chris Olah 完成。Chris 提出了玩具模型，而 Nelson 进行了实验。

Phase Change - Chris Olah ran the empirical phase change experiments, with help from Nelson Elhage. Martin Wattenberg introduced the theoretical model where exact losses for specific weight configurations can be computed.
相变实验 - Chris Olah 主导了实证相变实验，Nelson Elhage 提供协助。Martin Wattenberg 引入了理论模型，使得能够计算特定权重配置的确切损失。

Geometry - The uniform superposition geometry results were discovered by Nelson Elhage and Nicholas Schiefer, with help from Chris Olah. Nelson discovered the original m/||W||_F^2 mysterious "stickiness". Chris introduced the definition of feature dimensionality. Nicholas and Nelson then investigated the polytopes that formed. As for non-uniform superposition, Martin Wattenberg performed the initial investigations of the resulting geometry, focusing on the behavior of correlated features. Chris extended this with an investigation of the role of relative feature importance and sparsity.
几何学——均匀叠加几何的结果由 Nelson Elhage 和 Nicholas Schiefer 发现，Chris Olah 提供协助。Nelson 发现了原始的 m/||W||_F^2 神秘“粘性”现象。Chris 引入了特征维度的定义。随后，Nicholas 和 Nelson 研究了形成的凸多面体。至于非均匀叠加，Martin Wattenberg 对产生的几何形态进行了初步探索，重点关注相关特征的行为。Chris 进一步通过研究相对特征重要性和稀疏性的作用扩展了这一领域。

Learning Dynamics - Nelson Elhage discovered the "energy level jump" phenomenon, in collaboration with Nicholas Schiefer and Chris Olah. Martin Wattenberg discovered the "geometric transformations" phenomenon.
学习动力学 - Nelson Elhage 与 Nicholas Schiefer 和 Chris Olah 合作发现了“能级跃迁”现象，Martin Wattenberg 发现了“几何变换”现象。

Adversarial Examples - Chris Olah and Catherine Olsson found evidence of a connection between superposition and adversarial examples.
对抗样本——克里斯·奥拉和凯瑟琳·奥尔森发现了叠加态与对抗样本之间存在联系的证据。

Superposition with a Privileged Basis / Doing Computation - Chris Olah did the basic investigation of superposition in a privileged basis. Nelson Elhage, with help from Chris, investigated the "absolute value" model which provided a more principled demonstration of superposition and showed that computation could be done while in superposition. Nelson discovered the "asymmetric superposition" motif.
在特权基上的叠加/进行计算——克里斯·奥拉对特权基上的叠加进行了基础研究。在克里斯的协助下，纳尔逊·埃拉奇研究了“绝对值”模型，该模型提供了叠加原理的更明确展示，并表明在叠加状态下可以进行计算。纳尔逊发现了“非对称叠加”模式。

Theory - The theoretical picture articulated over the course of this paper (especially in the "mathematical understanding" section) was developed in conversations between all authors, but especially Chris Olah, Jared Kaplan, Martin Wattenberg, Nelson Elhage, Tristan Hume, Tom Henighan, Catherine Olsson, Nicholas Schiefer, Dawn Drain, Shauna Kravec, Roger Grosse, Robert Lasenby, and Sam McCandlish. Jared introduced the strategy of rewriting the loss by grouping terms with the number of active features. Both Jared and Martin independently noticed the value of investigating the n=2; m=1 case as the simplest case to understand. Nicholas and Dawn clarified our understanding of the connection to compressed sensing.
理论框架——本文中阐述的理论图景（尤其在“数学理解”部分）是所有作者间交流讨论的成果，特别归功于 Chris Olah、Jared Kaplan、Martin Wattenberg、Nelson Elhage、Tristan Hume、Tom Henighan、Catherine Olsson、Nicholas Schiefer、Dawn Drain、Shauna Kravec、Roger Grosse、Robert Lasenby 及 Sam McCandlish 的贡献。Jared 提出了通过聚合激活特征数量来重构损失函数的策略。Jared 和 Martin 均独立意识到研究 n=2; m=1 情形作为理解起点的重要性。Nicholas 和 Dawn 深化了我们对压缩感知关联性的认识。

Strategic Picture - The strategic picture articulated in this paper – What does superposition mean for interpretability and safety? What would a suitable solution be? How might one solve it? – developed in extensive conversations between authors, and in particular Chris Olah, Tristan Hume, Nelson Elhage, Dario Amodei, Jared Kaplan. Nelson Elhage recognized the potential importance of "enumerative safety", further articulated by Dario. Tristan brainstormed extensively about ways one might solve superposition and pushed Chris on this topic.
战略图景 - 本文阐述的战略图景——叠加态对可解释性与安全性意味着什么？合适的解决方案是什么？如何解决这一问题？——是在作者们，尤其是 Chris Olah、Tristan Hume、Nelson Elhage、Dario Amodei、Jared Kaplan 之间的广泛对话中发展起来的。Nelson Elhage 意识到了“枚举安全性”的潜在重要性，并由 Dario 进一步阐明。Tristan 深入探讨了解决叠加态的各种方法，并在此话题上对 Chris 进行了深入探讨。

Writing - The paper was primarily drafted by Chris Olah, with some sections by Nelson Elhage, Tristan Hume, Martin Wattenberg, and Catherine Olsson. All authors contributed to editing, with particularly significant contributions from Zac Hatfield Dodds, Robert Lasenby, Kipply Chen, and Roger Grosse.
写作 - 论文主要由克里斯·奥拉起草，部分章节由纳尔逊·埃拉奇、特里斯坦·休姆、马丁·瓦滕伯格和凯瑟琳·奥尔森撰写。所有作者均参与了编辑工作，其中扎克·哈特菲尔德·多兹、罗伯特·拉森贝、基普利·陈和罗杰·格罗斯贡献尤为显著。

Illustration - The paper was primarily illustrated by Chris Olah, with help from Tristan Hume, Nelson Elhage, and Catherine Olsson.
插图 - 该论文主要由 Chris Olah 绘制，Tristan Hume、Nelson Elhage 和 Catherine Olsson 协助完成。

Citation Information 引用信息

Please cite as: 请引用为：

Elhage, et al., "Toy Models of Superposition", Transformer Circuits Thread, 2022.

BibTeX Citation: BibTeX 引用格式：

@article{elhage2022superposition,
   title={Toy Models of Superposition},
   author={Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher},
   year={2022},
   journal={Transformer Circuits Thread},
   note={https://transformer-circuits.pub/2022/toy_model/index.html}
}

Footnotes 脚注

Where “importance” is a scalar multiplier on mean squared error loss.[↩]
其中，“重要性”是对均方误差损失的标量乘数。
In the context of vision, these have ranged from low-level neurons like curve detectors and high-low frequency detectors, to more complex neurons like oriented dog-head detectors or car detectors, to extremely abstract neurons corresponding to famous people, emotions, geographic regions, and more . In language models, researchers have found word embedding directions such as a male-female or singular-plural direction , low-level neurons disambiguating words that occur in multiple languages, much more abstract neurons, and "action" output neurons that help produce certain words .[↩]
在视觉领域，这些神经元从低级的曲线检测器和高低频检测器，到更复杂的如定向犬头检测器或汽车检测器，再到极其抽象的对应于名人、情感、地理区域等的神经元。在语言模型中，研究者发现了诸如男性-女性或单数-复数方向的词嵌入向量，低级神经元区分多语言中出现的词汇，更为抽象的神经元，以及帮助生成特定词汇的“动作”输出神经元。
This definition is trickier than it seems. Specifically, something is a feature if there exists a large enough model size such that it gets a dedicated neuron. This create a kind "epsilon-delta" like definition. Our present understanding – as we'll see in later sections – is that arbitrarily large models can still have a large fraction of their features be in superposition. However, for any given feature, assuming the feature importance curve isn't flat, it should eventually be given a dedicated neuron. This definition can be helpful in saying that something is a feature – curve detectors are a feature because you find them in across a range of models larger than some minimal size – but unhelpful for the much more common case of features we only hypothesize about or observe in superposition.[↩]
这一界定比表面看起来更为复杂。具体而言，某物若能在大到一定规模的模型中拥有专属神经元，则可称之为特征。这形成了一种类似“ε-δ”的定义。我们目前的理解——将在后续章节中探讨——是，即便模型规模任意扩大，其大部分特征仍可能处于叠加状态。然而，对于任一特征，只要其重要性曲线并非平坦，最终应会分配到专属神经元。此定义有助于确认某些实证特征——如曲线检测器，因其存在于超过某一最小规模的多种模型中——但对于那些仅处于假设或叠加观察状态的更常见特征，则显得不够实用。
A famous book by Lakatos illustrates the importance of uncertainty about definitions and how important rethinking definitions often is in the context of research. [↩]
拉卡托斯的一部著名著作阐明了定义的不确定性之重要，以及在研究背景下经常重新思考定义的重要性。
This experiment setup could also be viewed as an autoencoder reconstructing x.[↩]
此实验设置亦可视作自动编码器对 x 进行重构。
A vision model of sufficient generality might benefit from representing every species of plant and animal and every manufactured object which it might potentially see. A language model might benefit from representing each person who has ever been mentioned in writing. These are only scratching the surface of plausible features, but already there seem more than any model has neurons. In fact, large language models demonstrably do in fact know about people of very modest prominence – presumably more such people than they have neurons. This point is a common argument in discussion of the plausibility of "grandmother neurons'' in neuroscience, but seems even stronger for artificial neural networks.[↩]
一个足够通用的视觉模型或许能从表示每种动植物及所有可能遇见的制造物品中获益。语言模型则可能从表示所有曾被书面提及的人物中获益。这些仅触及了可能特征的皮毛，但已远超任何模型的神经元数量。实际上，大型语言模型确实知晓许多知名度不高的人物——其数量很可能超过了模型自身的神经元。这一点在神经科学中关于“祖母神经元”的合理性讨论中常被提及，但对于人工神经网络而言，这一论点似乎更为有力。
For computational reasons, we won't focus on it in this article, but we often imagine an infinite number of features with importance asymptotically approaching zero.[↩]
出于计算原因，本文将不对此进行深入探讨，但我们常设想存在无数个重要性渐近于零的特征。
The choice to have features distributed uniformly is arbitrary. An exponential or power law distribution would also be very natural.[↩]
选择使特征均匀分布是任意的。指数分布或幂律分布同样非常自然。
Recall that W^T = W^{-1} if W is orthonormal. Although W can't be literally orthonormal, our intuition from compressed sensing is that it will be "almost orthonormal" in the sense of Candes & Tao .[↩]
回想一下，如果 W 是正交归一的，则 W^T = W^{-1} 。尽管 W 不能严格意义上是正交归一的，但根据 Candes & Tao 的观点，从压缩感知的角度来看，它将在“几乎正交归一”的意义上接近正交归一。[↩]
We have the model be x' = W^TWx, but leave x Gaussianaly distributed as in Saxe.[↩]
我们将模型设为 x' = W^TWx ，但保持 x 如 Saxe 所述呈高斯分布。
As a brief aside, it's interesting to contrast the linear model interference, \sum_{i\neq j}|W_i \cdot W_J|^2, to the notion of coherence in compressed sensing, \max_{i\neq j}|W_i \cdot W_J|. We can see them as the L^2 and L^\infty norms of the same vector.[↩]
作为简短的题外话，对比线性模型干扰 \sum_{i\neq j}|W_i \cdot W_J|^2 与压缩感知中的相干性概念 \max_{i\neq j}|W_i \cdot W_J| 颇为有趣。我们可以将它们视为同一向量的 L^2 和 L^\infty 范数。[↩]
To prove that superposition is never optimal in a linear model, solve for the gradient of the loss being zero or consult Saxe et al.[↩]
为证明叠加态在任何线性模型中均非最优，求解损失梯度为零的情况，或参考 Saxe 等人研究。
Here, we use “phase change” in the generalized sense of “discontinuous change”, rather than in the more technical sense of a discontinuity arising in the limit of infinite system size.[↩]
在此，我们采用“相变”的广义含义，即“不连续变化”，而非更技术性的含义，即在系统尺寸无限大极限下出现的非连续性。
Scaling the importance of all features by the same amount simply scales the loss, and does not change the optimal solutions.[↩]
将所有特征的重要性按相同幅度缩放，仅会缩放损失值，并不会改变最优解。
Note that there's a degree of freedom for the model in learning W_1: We can rescale any hidden unit by scaling its row of W_1 by \alpha, and its column of W_2 by \alpha^{-1}, and arrive at the same model. For consistency in the visualization, we rescale each hidden unit before visualizing so that the largest-magnitude weight to that neuron from W_1 has magnitude 1.[↩]
请注意，模型在学习 W_1 时具有一定的自由度：我们可以通过将 W_1 中的行乘以 \alpha ，将 W_2 中的列乘以 \alpha^{-1} 来重新缩放任何隐藏单元，从而得到相同的模型。为了在可视化中保持一致性，我们在可视化之前对每个隐藏单元进行重新缩放，使得从 W_1 到该神经元的最大权重幅值为 1 。[↩]
These specific values were chosen to illustrate the phenomenon we're interested in: the absolute value model learns more easily when there are more neurons, but we wanted to keep the numbers small enough that it could be easily visualized.[↩]
这些特定数值的选择旨在阐明我们关注的现象：绝对值模型在神经元数量更多时学习更为容易，但我们希望保持数字足够小，以便于直观展示。
One question you might ask is whether we can quantify the ability of superposition to enable extra computation by examining the loss. Unfortunately, we can't easily do this. Superposition occurs when we change the task, making it sparser. As a result, the losses of models with different amounts of superposition are not comparable – they're measuring the loss on different tasks![↩]
你可能会问，我们能否通过考察损失来量化叠加态增强计算能力的效果。遗憾的是，这并不容易。叠加态的出现伴随着任务的转变，使其更为稀疏。因此，具有不同叠加程度的模型损失不可直接比较——它们衡量的是不同任务的损失！
Ultimately we want to say that a model doesn't implement some class of behaviors. Enumerating over all features makes it easy to say a feature doesn't exist (e.g. "there is no 'deceptive behavior' feature") but that isn't quite what we want. We expect models that need to represent the world to represent unsavory behaviors. But it may be possible to build more subtle claims such as "all 'deceptive behavior' features do not participate in circuits X, Y and Z."[↩]
最终我们想要表达的是，一个模型并未实现某种行为类别。遍历所有特征使得我们可以轻易断言某个特征不存在（例如，“不存在‘欺骗行为’特征”），但这并非我们真正所需。我们期望那些需要表征世界的模型能够表征不良行为。然而，构建更为微妙的论断或许是可能的，比如“所有‘欺骗行为’特征均不参与电路 X、Y 和 Z 的运作”。
Superposition also makes it harder to find interpretable directions in a model without a privileged basis. Without superposition, one could try to do something like the Gram–Schmidt process, progressively identifying interpretable directions and then removing them to make future features easier to identify. But with superposition, one can't simply remove a direction even if one knows that it is a feature direction.[↩]
叠加态也使得在没有特权基的情况下，更难在模型中找到可解释的方向。没有叠加态时，人们可以尝试类似格拉姆-施密特过程的方法，逐步识别可解释方向并移除它们，以便于未来特征的识别。但在叠加态下，即便知道某个方向是特征方向，也无法简单地将其移除。
More formally, given a matrix H \sim [d,m] ~=~[h_0, h_1, …] of hidden layer activations h \sim [m] sampled over d stimuli, if we believe there are n underlying features, we can try to find matrices A\sim [d,n] and B \sim [n,m] such that A is sparse.[↩]
更正式地说，给定一个隐藏层激活矩阵 H \sim [d,m] ~=~[h_0, h_1, …] ，该矩阵是在 d 个刺激下采样得到的 h \sim [m] ，如果我们认为存在 n 个潜在特征，我们可以尝试找到矩阵 A\sim [d,n] 和 B \sim [n,m] ，使得 A 具有稀疏性。
In particular, it seems like we should expect to be able to reduce superposition at least a little bit with essentially no effect on performance, just by doing something like L1 regularization without any architectural changes. Note that models should have a level of superposition where the derivative of loss with respect to the amount of superposition is zero – otherwise, they'd use more or less superposition. As a result, there should be at least some margin within which we can reduce the amount of superposition without affecting model performance.[↩]
特别是，我们似乎应该预期，通过实施类似 L1 正则化这样的措施，而不进行架构调整，至少能在一定程度上减少叠加态，且对性能影响微乎其微。值得注意的是，模型应存在一个叠加态水平，使得损失关于叠加态量的导数为零——否则，它们会采用更多或更少的叠加态。因此，至少在一定范围内，我们应能减少叠加态而不影响模型性能。
A more subtle issue is that GANs and VAEs often assume that their latent space is Gaussianly distributed. Sparse latent variables are very non-Gaussian, but the central limit theorem means that the superposition of many such variables will gradually look more Gaussian. So the latent spaces of some generative models may in fact force models to use superposition![↩]
一个更为微妙的问题是，GAN 和 VAE 通常假设其潜在空间呈高斯分布。稀疏潜在变量非常非高斯，但中心极限定理意味着，许多此类变量的叠加将逐渐呈现出更多的高斯特性。因此，某些生成模型的潜在空间实际上可能迫使模型采用叠加策略！
Note that this has a nice information-theoretic interpretation: \log(1 - S) is the surprisal of a given dimension being non-zero, and is multiplied by the expected number of non-zeros.[↩]
请注意，这具有良好的信息论解释： \log(1 - S) 表示某一维度非零的惊奇度，并乘以预期非零值的数量。[↩]
Note that in the compressed sensing case, the phase transition is in the limit as the number of dimensions becomes large - for finite-dimensional spaces, the transition is fast but not discontinuous.[↩]
请注意，在压缩感知情况下，相变发生在维度数趋向无穷大的极限状态——对于有限维空间，过渡是迅速的，但并非不连续。
Interestingly, this is the reverse of how sparse coding is typically thought of in neuroscience. Neuroscience often thinks of biological neurons as sparse coding their inputs, whereas we're interested in applying it the opposite direction, to find features in superposition over neurons.[↩]
有趣的是，这与神经科学中通常对稀疏编码的理解相反。神经科学常将生物神经元视为对其输入进行稀疏编码，而我们则关注于反向应用它，即在神经元上寻找叠加的特征。
We haven’t encountered a specific term in the distributed coding literature that corresponds to this hypothesis specifically, although the idea of a “direction in activation-space” is common in the literature, which may be due to ignorance on our part. We call this hypothesis linearity.[↩]
在分布式编码文献中，我们尚未遇到与这一假设完全对应的特定术语，尽管“激活空间中的方向”这一概念在文献中颇为常见，这或许是由于我们的疏忽。我们称此假设为线性性。
Experimental evidence seems to support this[↩]
实验证据似乎支持这一观点 [↩]
A related, but different, concept in the neuroscience literature is the “binding problem” in which e.g. a red triangle is a co-occurrence of exactly one shape and exactly one color, which is not a representational challenge, but a binding problem arises if a decomposed code needs to represent simultaneously also a blue square — which shape feature goes with which color feature? Our work does not engage with the binding question, merely treating this as a co-occurrence of “blue”, “red”, “triangle”, and “square”. [↩]
神经科学文献中有一个相关但不同的概念，即“绑定问题” ，例如，一个红色的三角形是恰好一个形状与恰好一个颜色的共同出现，这并非表征上的挑战，但当需要分解的代码同时表示一个蓝色正方形时，绑定问题就出现了——哪个形状特征与哪个颜色特征相匹配？我们的研究并未涉及绑定问题，仅将其视为“蓝色”、“红色”、“三角形”和“正方形”的共同出现。[↩]

References 参考资料

Linear algebraic structure of word senses, with applications to polysemy
词义的线性代数结构及其在多义性研究中的应用
Arora, S., Li, Y., Liang, Y., Ma, T. and Risteski, A., 2018. Transactions of the Association for Computational Linguistics, Vol 6, pp. 483--495. MIT Press.
阿罗拉, S., 李, Y., 梁, Y., 马, T., 与里斯特斯基, A., 2018 年. 计算语言学协会会刊, 第 6 卷, 页码 483 至 495. 麻省理工学院出版社.
Decoding The Thought Vector [link]
解码思维向量[链接]
Goh, G., 2016. 高，G.，2016 年。
Zoom In: An Introduction to Circuits
放大细节：电路入门简介
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill. DOI: 10.23915/distill.00024.001
奥拉, C., 卡马拉塔, N., 舒伯特, L., 高, G., 彼得罗夫, M. 和卡特, S., 2020. 蒸馏. DOI: 10.23915/distill.00024.001
Softmax Linear Units Softmax 线性单元
Elhage, N., Hume, T., Olsson, C., Nanda, N., Henighan, T., Johnston, S., ElShowk, S., Joseph, N., DasSarma, N., Mann, B., Hernandez, D., Askell, A., Ndousse, K., Jones, A., Drain, D., Chen, A., Bai, Y., Ganguli, D., Lovitt, L., Hatfield-Dodds, Z., Kernion, J., Conerly, T., Kravec, S., Fort, S., Kadavath, S., Jacobson, J., Tran-Johnson, E., Kaplan, J., Clark, J., Brown, T., McCandlish, S., Amodei, D. and Olah, C., 2022. Transformer Circuits Thread.
埃尔哈格、休谟、奥尔森、南达、赫尼根、约翰斯顿、埃尔绍克、约瑟夫、达萨马、曼恩、埃尔南德斯、阿斯克尔、恩杜塞、琼斯、德兰、陈、白、甘古利、洛维特、哈特菲尔德-多兹、克尼昂、科纳利、克雷夫克、福特、卡达瓦斯、雅各布森、特兰-约翰逊、卡普兰、克拉克、布朗、麦坎德利斯、阿莫迪和奥拉等人，2022 年。《Transformer 电路线索》。
Compressed sensing 压缩感知
Donoho, D.L., 2006. IEEE Transactions on information theory, Vol 52(4), pp. 1289--1306. IEEE.
Donoho, D.L., 2006. 《IEEE 信息理论汇刊》，第 52 卷第 4 期，页码 1289-1306。IEEE 出版。
Local vs. Distributed Coding
本地编码与分布式编码
Thorpe, S.J., 1989. Intellectica, Vol 8, pp. 3--40.
索普, S.J., 1989. 《智力》, 第 8 卷, 页码 3-40。
Representation learning: A review and new perspectives
表示学习：回顾与新视角
Bengio, Y., Courville, A. and Vincent, P., 2013. IEEE transactions on pattern analysis and machine intelligence, Vol 35(8), pp. 1798--1828. IEEE.
Bengio, Y., Courville, A., 与 Vincent, P., 2013 年. 《IEEE 模式分析与机器智能汇刊》，第 35 卷(第 8 期)，页码 1798 至 1828. IEEE 出版。
Feature Visualization [link]
特征可视化 [链接]
Olah, C., Mordvintsev, A. and Schubert, L., 2017. Distill. DOI: 10.23915/distill.00007
奥拉, C., 莫德温采夫, A. 和舒伯特, L., 2017. 《蒸馏》. DOI: 10.23915/distill.00007
Curve Detectors [link] 曲线检测器 [链接]
Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov, M. and Olah, C., 2020. Distill.
卡马拉塔，N.，吴，G.，卡特，S.，舒伯特，L.，彼得罗夫，M. 和奥拉，C.，2020 年。《蒸馏》。
Superposition of many models into one
多模型叠加成一
Cheung, B., Terekhov, A., Chen, Y., Agrawal, P. and Olshausen, B., 2019. Advances in neural information processing systems, Vol 32.
张, B., 特列霍夫, A., 陈, Y., 阿格拉沃尔, P., 与奥尔绍恩, B., 2019. 神经信息处理系统进展, 第 32 卷.
Linguistic regularities in continuous space word representations [PDF]
连续空间词表示中的语言规律性 [PDF]
Mikolov, T., Yih, W. and Zweig, G., 2013. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp. 746--751.
米科洛夫, T., 易文, W., & 兹威格, G., 2013. 北美计算语言学协会分会 2013 年会议论文集：人类语言技术，第 746-751 页。
Linguistic regularities in sparse and explicit word representations
稀疏与显式词汇表示中的语言规律
Levy, O. and Goldberg, Y., 2014. Proceedings of the eighteenth conference on computational natural language learning, pp. 171--180.
利维, O. 与戈德堡, Y., 2014. 第十八届计算自然语言学习会议论文集, 页码 171 至 180.
Unsupervised representation learning with deep convolutional generative adversarial networks
深度卷积生成对抗网络的无监督表示学习
Radford, A., Metz, L. and Chintala, S., 2015. arXiv preprint arXiv:1511.06434.
拉德福，A.，梅茨，L.，钦塔拉，S.，2015 年。arXiv 预印本 arXiv:1511.06434。
Visualizing and understanding recurrent networks [PDF]
可视化与理解循环网络 [PDF]
Karpathy, A., Johnson, J. and Fei-Fei, L., 2015. arXiv preprint arXiv:1506.02078.
Karpathy, A., Johnson, J., 与 Fei-Fei, L., 2015. arXiv 预印本 arXiv:1506.02078.
Learning to generate reviews and discovering sentiment [PDF]
学习生成评论与发现情感[PDF]
Radford, A., Jozefowicz, R. and Sutskever, I., 2017. arXiv preprint arXiv:1704.01444.
拉德福，A.，约泽福维奇，R. 和苏茨克维尔，I.，2017 年。arXiv 预印本 arXiv:1704.01444。
Object detectors emerge in deep scene cnns [PDF]
物体检测器在深度场景卷积神经网络中崭露头角 [PDF]
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A., 2014. arXiv preprint arXiv:1412.6856.
周博磊, 阿尼尔·乔瑟夫·科斯拉, 阿德利亚·拉佩德里扎, 阿丽亚娜·奥利亚, 阿方索·托拉尔瓦, 2014 年. arXiv 预印本 arXiv:1412.6856.
Network Dissection: Quantifying Interpretability of Deep Visual Representations [PDF]
网络解剖：量化深度视觉表征的可解释性 [PDF]
Bau, D., Zhou, B., Khosla, A., Oliva, A. and Torralba, A., 2017. Computer Vision and Pattern Recognition.
鲍, D., 周, B., 科斯拉, A., 奥利瓦, A., 和托罗利亚, A., 2017. 计算机视觉与模式识别。
Understanding the role of individual units in a deep neural network
理解深度神经网络中个体单元的作用
Bau, D., Zhu, J., Strobelt, H., Lapedriza, A., Zhou, B. and Torralba, A., 2020. Proceedings of the National Academy of Sciences, Vol 117(48), pp. 30071--30078. National Acad Sciences.
鲍, D., 朱, J., 斯特罗贝尔特, H., 拉佩德里萨, A., 周, B. 及托拉尔巴, A., 2020. 《美国国家科学院院刊》, 第 117 卷(第 48 期), 页码 30071 至 30078. 美国国家科学院.
On the importance of single directions for generalization [PDF]
关于单向性对泛化重要性的探讨[PDF]
Morcos, A.S., Barrett, D.G., Rabinowitz, N.C. and Botvinick, M., 2018. arXiv preprint arXiv:1803.06959.
Morcos, A.S., Barrett, D.G., Rabinowitz, N.C., 及 Botvinick, M., 2018. arXiv 预印本 arXiv:1803.06959.
On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron
关于可解释性与特征表示：情感神经元分析
Donnelly, J. and Roegiest, A., 2019. European Conference on Information Retrieval, pp. 795--802.
唐纳利，J. 与罗吉斯特，A.，2019 年。欧洲信息检索会议，第 795 至 802 页。
High-Low Frequency Detectors
高低频检测器
Schubert, L., Voss, C., Cammarata, N., Goh, G. and Olah, C., 2021. Distill. DOI: 10.23915/distill.00024.005
舒伯特, L., 沃斯, C., 卡马拉塔, N., 高, G., 与奥拉, C., 2021. 《蒸馏》. DOI: 10.23915/distill.00024.005
Multimodal Neurons in Artificial Neural Networks
人工神经网络中的多模态神经元
Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., Radford, A. and Olah, C., 2021. Distill. DOI: 10.23915/distill.00030
高, G., 卡马拉塔, N., 沃斯, C., 卡特, S., 彼得罗夫, M., 舒伯特, L., 拉德福德, A. 与奥拉, C., 2021. 《蒸馏》. DOI: 10.23915/distill.00030
Convergent learning: Do different neural networks learn the same representations?
聚合学习：不同的神经网络是否学习相同的表征？
Li, Y., Yosinski, J., Clune, J., Lipson, H., Hopcroft, J.E. and others,, 2015. FE@ NIPS, pp. 196--212.
李岩, 约斯基, 克卢恩, 利普森, 霍普克罗夫特及其他作者, 2015 年. 神经信息处理系统进展会议论文集, 第 196 至 212 页.
Adversarial examples are not bugs, they are features
对抗样本并非漏洞，而是特性
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. Advances in neural information processing systems, Vol 32.
伊利亚斯, A., 桑图尔卡, S., 齐帕拉斯, D., 恩斯特龙, L., 特兰, B. 与马德里, A., 2019 年. 神经信息处理系统进展, 第 32 卷.
Proofs and refutations 证明与反驳
Lakatos, I., 1963. Nelson London.
拉卡托斯，I.，1963 年。纳尔逊出版社，伦敦。
Sparse coding with an overcomplete basis set: A strategy employed by V1?
过完备基集上的稀疏编码：V1 采用的策略？
Olshausen, B.A. and Field, D.J., 1997. Vision research, Vol 37(23), pp. 3311--3325. Elsevier.
奥尔绍森，B.A.与菲尔德，D.J.，1997 年。《视觉研究》，第 37 卷(第 23 期)，第 3311-3325 页。爱思唯尔出版社。
Decoding by linear programming
通过线性规划解码
Candes, E.J. and Tao, T., 2005. IEEE transactions on information theory, Vol 51(12), pp. 4203--4215. IEEE.
Candes, E.J. 与 Tao, T., 2005. 《IEEE 信息理论汇刊》，第 51 卷(12 期)，页码 4203-4215。IEEE 出版。
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
深度线性神经网络学习非线性动力学的精确解
Saxe, A.M., McClelland, J.L. and Ganguli, S., 2014.
萨克斯，A.M.，麦克莱兰，J.L.，和甘古利，S.，2014 年。
In-context Learning and Induction Heads [HTML]
情境学习与归纳头[HTML]
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S. and Olah, C., 2022. Transformer Circuits Thread.
奥尔森、埃尔哈格、南达、约瑟夫、达萨马、海尼根、曼恩、阿斯克尔、白、陈、科纳利、德兰、甘古利、哈特菲尔德-多兹、埃尔南德斯、约翰斯顿、琼斯、克尼昂、洛维特、恩杜塞、阿莫迪、布朗、克拉克、卡普兰、麦坎德利什和奥拉，2022 年。《Transformer 电路线程》。
A Mechanistic Interpretability Analysis of Grokking [link]
对 Grokking 的机械可解释性分析[链接]
Nanda, N. and Lieberum, T., 2022.
南达与利伯伦，2022 年。
Grokking: Generalization beyond overfitting on small algorithmic datasets
理解：超越在小算法数据集上的过拟合，实现泛化
Power, A., Burda, Y., Edwards, H., Babuschkin, I. and Misra, V., 2022. arXiv preprint arXiv:2201.02177.
鲍尔, A., 布尔达, Y., 爱德华兹, H., 巴布什金, I., 米斯拉, V., 2022. arXiv 预印本 arXiv:2201.02177.
The surprising simplicity of the early-time learning dynamics of neural networks
神经网络早期学习动力学的惊人简洁性
Hu, W., Xiao, L., Adlam, B. and Pennington, J., 2020. Advances in Neural Information Processing Systems, Vol 33, pp. 17116--17128.
胡伟、肖琳、阿德拉姆与彭宁顿，2020 年。《神经信息处理系统进展》，第 33 卷，第 17116 至 17128 页。
A mathematical theory of semantic development in deep neural networks
深度神经网络中语义发展的数学理论
Saxe, A.M., McClelland, J.L. and Ganguli, S., 2019. Proceedings of the National Academy of Sciences, Vol 116(23), pp. 11537--11546. National Acad Sciences.
萨克斯, A.M., 麦克莱兰, J.L. 和甘古利, S., 2019. 《美国国家科学院院刊》, 第 116 卷(第 23 期), 页码 11537-11546. 美国国家科学院.
Towards the science of security and privacy in machine learning
迈向机器学习中的安全和隐私科学
Papernot, N., McDaniel, P., Sinha, A. and Wellman, M., 2016. arXiv preprint arXiv:1611.03814.
佩珀诺特, N., 麦克丹尼尔, P., 西纳, A. 与韦尔曼, M., 2016. arXiv 预印本 arXiv:1611.03814.
Adversarial spheres 对抗性球体
Gilmer, J., Metz, L., Faghri, F., Schoenholz, S.S., Raghu, M., Wattenberg, M. and Goodfellow, I., 2018. arXiv preprint arXiv:1801.02774.
吉尔默，J.，梅茨，L.，法赫里，F.，舍恩霍尔茨，S.S.，拉古，M.，瓦滕伯格，M.，古德费洛，I.，2018 年。arXiv 预印本 arXiv:1801.02774。
Adversarial robustness as a prior for learned representations
将对抗鲁棒性作为学习表示的先验条件
Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B. and Madry, A., 2019. arXiv preprint arXiv:1906.00945.
恩斯特龙, L., 伊利亚斯, A., 桑图尔卡, S., 齐帕拉斯, D., 特兰, B., 与马德里, A., 2019. arXiv 预印本 arXiv:1906.00945.
Delving into transferable adversarial examples and black-box attacks
深入探究可迁移对抗样本与黑盒攻击
Liu, Y., Chen, X., Liu, C. and Song, D., 2016. arXiv preprint arXiv:1611.02770.
刘焱, 陈曦, 刘畅, 宋成. 2016. arXiv 预印本 arXiv:1611.02770.
An introduction to systems biology: design principles of biological circuits
系统生物学导论：生物电路设计原理
Alon, U., 2019. CRC press. DOI: 10.1201/9781420011432
阿隆，U.，2019 年。CRC 出版社。DOI: 10.1201/9781420011432
The Building Blocks of Interpretability [link]
可解释性的基石 [链接]
Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K. and Mordvintsev, A., 2018. Distill. DOI: 10.23915/distill.00010
奥拉, C., 萨蒂亚纳拉亚纳, A., 约翰逊, I., 卡特, S., 舒伯特, L., 叶, K., 莫德温采夫, A., 2018. 《蒸馏》. DOI: 10.23915/distill.00010
Visualizing Weights [link]
可视化权重 [链接]
Voss, C., Cammarata, N., Goh, G., Petrov, M., Schubert, L., Egan, B., Lim, S.K. and Olah, C., 2021. Distill. DOI: 10.23915/distill.00024.007
Voss, C., Cammarata, N., Goh, G., Petrov, M., Schubert, L., Egan, B., Lim, S.K., 及 Olah, C., 2021. 《Distill》. DOI: 10.23915/distill.00024.007
Learning effective and interpretable semantic models using non-negative sparse embedding
利用非负稀疏嵌入学习有效且可解释的语义模型
Murphy, B., Talukdar, P. and Mitchell, T., 2012. Proceedings of COLING 2012, pp. 1933--1950.
墨菲、塔尔卡德和米切尔，2012 年。《2012 年国际计算语言学会议论文集》，第 1933 至 1950 页。
Word2Sense: sparse interpretable word embeddings
Word2Sense：稀疏可解释的词向量
Panigrahi, A., Simhadri, H.V. and Bhattacharyya, C., 2019. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5692--5705.
帕尼格拉希, A., 辛哈德里, H.V. 与巴塔查里亚, C., 2019. 第 57 届计算语言学协会年会会议录, 页码 5692 至 5705.
Spine: Sparse interpretable neural embeddings
脊梁：稀疏可解释的神经嵌入
Subramanian, A., Pruthi, D., Jhamtani, H., Berg-Kirkpatrick, T. and Hovy, E., 2018. Proceedings of the AAAI Conference on Artificial Intelligence, Vol 32(1).
苏布拉马尼亚姆、普鲁蒂、贾姆塔尼、伯格-柯克帕特里克和霍维，2018 年。《人工智能促进协会会议论文集》，第 32 卷（第 1 期）。
A Review of Sparse Expert Models in Deep Learning
深度学习中稀疏专家模型综述
Fedus, W., Dean, J. and Zoph, B., 2022. arXiv preprint arXiv:2209.01667.
费杜斯、W.、迪恩、J. 和佐夫、B.，2022 年。arXiv 预印本 arXiv:2209.01667。
A Mathematical Framework for Transformer Circuits [HTML]
变压器电路的数学框架[HTML]
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S. and Olah, C., 2021. Transformer Circuits Thread.
埃尔哈吉、南达、奥尔森、赫尼根、约瑟夫、曼恩、阿斯克尔、白、陈、科纳利、达萨马、德兰、甘古利、哈特菲尔德-多兹、埃尔南德斯、琼斯、克尼昂、洛维特、恩杜塞、阿莫迪、布朗、克拉克、卡普兰、麦坎德利斯和奥拉等人，2021 年。《Transformer 电路线程》。
An Overview of Early Vision in InceptionV1
InceptionV1 早期视觉概览
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill. DOI: 10.23915/distill.00024.002
奥拉, C., 卡马拉塔, N., 舒伯特, L., 高, G., 彼得罗夫, M. 和卡特, S., 2020. 蒸馏. DOI: 10.23915/distill.00024.002
beta-vae: Learning basic visual concepts with a constrained variational framework
beta-vae：利用受限变分框架学习基本视觉概念
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S. and Lerchner, A., 2016.
希金斯，I.，马特伊，L.，帕尔，A.，伯吉斯，C.，格洛里奥特，X.，博蒂尼克，M.，穆罕默德，S.及勒希纳，A.，2016 年。
Infogan: Interpretable representation learning by information maximizing generative adversarial nets
Infogan：通过最大化信息生成对抗网络实现可解释的表征学习
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I. and Abbeel, P., 2016. Advances in neural information processing systems, Vol 29.
陈曦, 段一, Houthooft, R., Schulman, J., Sutskever, I. 与 Abbeel, P., 2016 年. 神经信息处理系统进展, 第 29 卷.
Disentangling by factorising
通过因子分解解开
Kim, H. and Mnih, A., 2018. International Conference on Machine Learning, pp. 2649--2658.
金, H. 与米恩, A., 2018. 国际机器学习大会, 页码 2649-2658.
Uncertainty principles and ideal atomic decomposition
不确定性原理与理想原子分解
Donoho, D.L., Huo, X. and others,, 2001. IEEE transactions on information theory, Vol 47(7), pp. 2845--2862. Citeseer.
Donoho, D.L., Huo, X. 及其他作者, 2001. 《IEEE 信息理论汇刊》，第 47 卷(7 期)，页码 2845-2862。Citeseer 收录。
Compressed sensing and best 𝑘-term approximation
压缩感知与最佳 k 项逼近
Cohen, A., Dahmen, W. and DeVore, R., 2009. Journal of the American mathematical society, Vol 22(1), pp. 211--231.
科恩, A., 达门, W. 和德沃尔, R., 2009. 《美国数学会杂志》, 第 22 卷(1), 页码 211-231.
A remark on compressed sensing
关于压缩感知的评论
Kashin, B.S. and Temlyakov, V.N., 2007. Mathematical notes, Vol 82(5), pp. 748--755. Springer.
卡申, B.S. 与特姆利亚科夫, V.N., 2007. 数学笔记, 第 82 卷(5), 页码 748-755. 斯普林格出版社.
Information-theoretic bounds on sparsity recovery in the high-dimensional and noisy setting
高维噪声环境下稀疏恢复的信息论界限
Wainwright, M., 2007. 2007 IEEE International Symposium on Information Theory, pp. 961--965.
温赖特，M.，2007 年。2007 年 IEEE 信息论国际研讨会，第 961-965 页。
Lower bounds for sparse recovery
稀疏恢复的下界
Do Ba, K., Indyk, P., Price, E. and Woodruff, D.P., 2010. Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pp. 1190--1197.
多巴，K.，因德克，P.，普莱斯，E.，伍德拉夫，D.P.，2010 年。第二十一届 ACM-SIAM 离散算法年度研讨会论文集，第 1190-1197 页。
Neighborly polytopes and sparse solution of underdetermined linear equations
邻域多面体与欠定线性方程组的稀疏解法
Donoho, D.L., 2005. 多诺霍，D.L.，2005 年。
Compressed sensing: How sharp is the RIP
压缩感知：RIP 的锐度如何
Blanchard, J.D., Cartis, C. and Tanner, J., 2009. SIAM Rev., accepted, Vol 10, pp. 090748160.
布兰查德、卡蒂斯与坦纳，2009 年。《SIAM 评论》，已接受，第 10 卷，页码 090748160。
A deep learning approach to structured signal recovery
深度学习在结构化信号恢复中的应用
Mousavi, A., Patel, A.B. and Baraniuk, R.G., 2015. 2015 53rd annual allerton conference on communication, control, and computing (Allerton), pp. 1336--1343.
穆萨维, A., 帕特尔, A.B. 与巴拉纽克, R.G., 2015. 2015 年第 53 届阿勒顿通信、控制与计算年会(阿勒顿), 页码 1336 至 1343.
Learned D-AMP: Principled neural network based compressive image recovery
学识渊博的 D-AMP：基于神经网络的有原则的压缩图像恢复
Metzler, C., Mousavi, A. and Baraniuk, R., 2017. Advances in Neural Information Processing Systems, Vol 30.
梅茨勒、穆萨维与巴拉纽克，2017 年。《神经信息处理系统进展》，第 30 卷。
Compressed Sensing using Generative Models [HTML]
生成模型在压缩感知中的应用 [HTML]
Bora, A., Jalal, A., Price, E. and Dimakis, A.G., 2017. Proceedings of the 34th International Conference on Machine Learning, Vol 70, pp. 537--546. PMLR.
Bora, A., Jalal, A., Price, E., 及 Dimakis, A.G., 2017. 第 34 届国际机器学习会议论文集，第 70 卷，页码 537-546。PMLR 出版。
Sparse and redundant representations: from theory to applications in signal and image processing
稀疏与冗余表示：从理论到信号与图像处理应用
Elad, M., 2010. , Vol 2(1). Springer.
伊拉德，M.，2010 年，第 2 卷(1)，斯普林格出版社。
Method of optimal directions for frame design
框架设计的最佳方向方法
Engan, K., Aase, S.O. and Husoy, J.H., 1999. 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), Vol 5, pp. 2443--2446.
恩甘, K., 阿瑟, S.O. 和胡索伊, J.H., 1999 年. 1999 年 IEEE 国际声学、语音和信号处理会议论文集. ICASSP99（编号 99CH36258），第 5 卷，第 2443-2446 页。
K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation
K-SVD 算法：用于稀疏表示的过完备字典设计方法
Aharon, M., Elad, M. and Bruckstein, A., 2006. IEEE Transactions on signal processing, Vol 54(11), pp. 4311--4322. IEEE.
阿哈龙，M.，埃拉德，M.，与布鲁克斯坦，A.，2006 年。《IEEE 信号处理汇刊》，第 54 卷（第 11 期），页码 4311-4322。IEEE 出版。
Learning Fast Approximations of Sparse Coding
学习稀疏编码的快速近似方法
Gregor, K. and LeCun, Y., 2010. International Conference on Machine Learning.
格雷戈尔，K.与勒存，Y.，2010 年。国际机器学习会议。
Sparse-Coding Variational Auto-Encoders
稀疏编码变分自编码器
Barello, G., Charles, A.S. and Pillow, J.W., 2018. bioRxiv.
Barello, G., Charles, A.S., 与 Pillow, J.W., 2018. bioRxiv.
[Interim research report] Taking features out of superposition with sparse autoencoders [link]
[中期研究报告] 利用稀疏自编码器提取叠加特征 [链接]
Sharkey, L., Braun, D. and Millidge, B., 2022.
夏基，L.，布劳恩，D.，米利奇，B.，2022 年。
Average firing rate rather than temporal pattern determines metabolic cost of activity in thalamocortical relay neurons
平均放电率而非时间模式决定丘脑皮质中继神经元的代谢活动成本
Yi, G. and Grill, W., 2019. Scientific reports, Vol 9(1), pp. 6940. DOI: 10.1038/s41598-019-43460-8
易, G. 与格里尔, W., 2019. 科学报告, 第 9 卷(1), 页码 6940. DOI: 10.1038/s41598-019-43460-8
Distributed representations
分布式表示
Plate, T., 2003. Cognitive Science, pp. 1-15.
普拉特，T.，2003 年。《认知科学》，第 1-15 页。
Compressed Sensing, Sparsity, and Dimensionality in Neuronal Information Processing and Data Analysis [link]
压缩感知、稀疏性与神经信息处理及数据分析中的维度问题[链接]
Ganguli, S. and Sompolinsky, H., 2012. Annual Review of Neuroscience, Vol 35(1), pp. 485-508. DOI: 10.1146/annurev-neuro-062111-150410
甘古利，S. 与索姆波林斯基，H.，2012 年。《神经科学年度回顾》，第 35 卷(1)，页 485-508。DOI: 10.1146/annurev-neuro-062111-150410
?
A Survey on Hyperdimensional Computing aka Vector Symbolic Architectures, Part I: Models and Data Transformations
超维计算即向量符号架构综述，第一部分：模型与数据转换
Kleyko, D., Rachkovskij, D.A., Osipov, E. and Rahimi, A., 2021. ACM Computing Surveys (CSUR). ACM New York, NY.
Kleyko, D., Rachkovskij, D.A., Osipov, E., 与 Rahimi, A., 2021. ACM 计算概览 (CSUR). ACM 纽约, 纽约州.
Life beyond bases: The advent of frames (Part I)
基地之外的生活：框架的来临（第一部分）
Kovacevic, J. and Chebira, A., 2007. IEEE Signal Processing Magazine, Vol 24(4), pp. 86--104. IEEE.
Kovacevic, J. 与 Chebira, A., 2007. 《IEEE 信号处理杂志》，第 24 卷第 4 期，页码 86-104。IEEE 出版。

Nonlinear Compression 非线性压缩

This paper focuses on the assumption that representations are linear. But what if models don’t use linear feature directions to represent information? What might such a thing concretely look like?
本文聚焦于表征为线性的假设。但若模型不采用线性特征方向来表示信息，又会是何种情形？这样的实体具体会呈现怎样的面貌？

Neural networks have nonlinearities that make it theoretically possible to compress information even more compactly than a linear superposition. There are reasons we think models are unlikely to pervasively use nonlinear compression schemes:
神经网络中的非线性特性理论上使得信息能够比线性叠加更紧凑地压缩。我们认为模型不太可能普遍采用非线性压缩方案的原因有：

The model needs to decompress things before it can compute with them naturally: Most of the computation in the model is linear, so this kind of compression is likely only worth it to save space in the residual stream across many layers before being decompressed to be computed with linearly again.
模型需先解压缩数据方能自然运算：模型中的大部分计算是线性的，因此这种压缩很可能仅在节省跨多层残差流的存储空间时才有价值，随后需解压缩以再次进行线性运算。
They’re probably difficult to learn: Nonlinear compression schemes may require finely tuned approximations of discontinuities, and for the compression and decompression to line up, and may be difficult for gradient descent to learn.
它们可能难以学习：非线性压缩方案可能需要对不连续点进行精细调校的近似，且压缩与解压缩过程需精确对齐，这可能使得梯度下降学习变得困难。
They probably take enough neurons that the benefit over superposition isn’t worth it:
他们可能认为，所需的神经元数量使得叠加态的优势不再显著

Representing the piecewise linear functions in the simple example with ReLU neurons using the universal function approximation result that each line segment takes two neurons, would require 12 neurons per Z segment, so only starts to beat linear compression at a combined 36 neurons for the compression and decompression.
在简单示例中，利用 ReLU 神经元表示分段线性函数，根据通用函数逼近结果，每个线段需两个神经元，这意味着每个 Z 段需要 12 个神经元，因此直到压缩与解压缩共需 36 个神经元时，才开始超越线性压缩效率。
This comparison has only one hidden dimension and dense features, which is somewhat of a degenerate case for superposition. Superposition is much more powerful for compression of sparse features in many dimensions. We suspect in large models the scaling is in favor of superposition, although this is just intuition and it’s possible that scaling nonlinear compression is competitive.
此比较仅涉及单一隐含维度和密集特征，对叠加而言略显退化。叠加在多维度稀疏特征压缩方面效能显著。我们推测，在大模型中，规模扩展可能倾向于叠加，尽管这只是一种直觉，非线性压缩的规模化也可能具有竞争力。

Regardless of whether large models end up using nonlinear compression, it should be possible to view directions being used with nonlinear compression as linear feature directions and reverse engineer the computation being used for compression like any other circuit. If this kind of encoding is pervasive throughout the network then it may merit some kind of automated decoding. It shouldn’t pose a fundamental challenge to interpretability unless the model learns a scheme for doing complex computation while staying in a complicated nonlinear representation, which we suspect is unlikely.
无论大型模型最终是否采用非线性压缩，应能将非线性压缩所用的方向视为线性特征方向，并像解析其他电路一样逆向工程压缩计算。若此类编码贯穿整个网络，则可能需要某种自动解码机制。这不应构成解释性的根本挑战，除非模型学会了一种在复杂非线性表示中进行复杂计算的方法，而我们对此持怀疑态度。

To help provide intuition, the simplest example of what a nonlinear compression scheme might look like is compressing two [0,1) dimensions x and y into a single [0,1) dimension t:
为了帮助理解，非线性压缩方案最简单的例子是将两个[0,1)维度的 x 和 y 压缩成一个[0,1)维度的 t ：

t = \frac{\lfloor Zx \rfloor + y}{Z}

This works by quantizing the x dimension using some integer Z such that the floating point precision of t is split between x and y. This particular function needs the discontinuous floor function to compute, and the discontinuous fmod function to invert, but models can’t compute discontinuous functions. However it’s possible to replace the discontinuities with steep linear segments that are only some epsilon value wide.
该方法通过使用某个整数 Z 对 x 维度进行量化，使得 t 的浮点精度在 x 和 y 之间分配。此特定函数需要借助不连续的向下取整函数来计算，并通过不连续的 fmod 函数进行逆运算，但模型无法处理不连续函数。然而，可以通过用宽度仅为某个小量 epsilon 的陡峭线性段替换这些不连续点来实现。

We can compare the mean squared error loss on random uniform dense values of x and y and see that even with epsilons as large as 0.1 and Z values as small as 3 the nonlinear compression outperforms linear compression such as picking one of the dimensions or using the average:
我们可以比较在随机均匀密集值 x 和 y 上的均方误差损失，并发现即使当ε值达到 0.1 和 Z ，且数值小至 3 时，非线性压缩仍然优于线性压缩方法，如选择某一维度或采用平均值

Connection between compressed sensing lower bounds and the toy model
压缩感知下界与玩具模型之间的联系

Here, we formalize the relationship between a compressed sensing lower bound and the toy model.
在此，我们正式确立了压缩感知下界与玩具模型之间的关系。

Let T(x) : \mathbb{R}^n \to \mathbb{R}^n be the complete toy model autoencoder defined by T(x) = \mathrm{ReLU}(W_2 W_1 x - b) for an m \times n matrix W_1 and an n \times m matrix W_2.
设 T(x) : \mathbb{R}^n \to \mathbb{R}^n 为通过 T(x) = \mathrm{ReLU}(W_2 W_1 x - b) 定义的完整玩具模型自编码器，其中涉及 m \times n 矩阵 W_1 和 n \times m 矩阵 W_2 。

We derive the following theorem:
我们推导出以下定理：

Theorem 1. Suppose that the toy model recovers all x with T(x) such that \|T(x) - x\|_2 \leq \varepsilon for sufficiently small \varepsilon and W_1 has the (\delta, k) restricted isometry property. The inner dimension of the projection matrix W is m = \Omega(k log(n/k)).
定理 1. 假设玩具模型能恢复所有 x ，其中 T(x) 满足对于充分小的 \varepsilon 和 W_1 ，且 W_1 具有受限等距性质 (\delta, k) 。投影矩阵 W 的内维度为 m = \Omega(k log(n/k)) 。

We prove this result by framing our toy model as a compressed sensing algorithm. The primary barrier to doing so is that our optimization only searches for vectors that are close in \ell_2 distance to the original vector and may not itself be exactly k-sparse. The following lemma resolves this concern through a denoising step:
我们通过将玩具模型构建成压缩感知算法来证明这一结果。主要的障碍在于，我们的优化仅寻找在 \ell_2 距离上接近原始向量的向量，而这些向量本身可能并非严格 k -稀疏。以下引理通过一个去噪步骤解决了这一问题：

Lemma 1. Suppose that we have a toy model T(x) with the properties in Theorem 1. Then there exists a compressed sensing algorithm f(y) : \mathbb{R}^m \to \mathbb{R}^n for the measurement matrix W_1.
引理 1. 假设我们有一个玩具模型 T(x) ，其性质如定理 1 所述。那么，对于测量矩阵 W_1 ，存在一种压缩感知算法 f(y) : \mathbb{R}^m \to \mathbb{R}^n 。

Proof. We construct f(y) as follows. First, compute \tilde{x} = \mathrm{ReLU}(W_2 y - b), as in T(x). This produces the vector \tilde{x} = T(x) and so by supposition \|T(x) - x\|_2 \leq \varepsilon. Next, we threshold \tilde{x} to obtain \tilde{x}' by dropping all but its k largest entries. Lastly, we solve the optimization problem: \min_{x'} \|x' - \tilde{x}'\| subject to W_1 x' = y, which is convex because x' and \tilde{x}' have the same support. For sufficiently small \varepsilon (specifically, \varepsilon smaller than the (k + 1)th largest entry in x), both \tilde{x} and the nearest k-sparse vector to x have the same support, and so the the convex optimization problem has a unique solution: the nearest k sparse vector to x. Therefore, f is a compressed sensing algorithm for W_1 with approximation factor 1. \qed.
证明：我们如下构造 f(y) 。首先，按照 T(x) 中的方法计算 \tilde{x} = \mathrm{ReLU}(W_2 y - b) ，得到向量 \tilde{x} = T(x) ，由此根据假设得到 \|T(x) - x\|_2 \leq \varepsilon 。接着，我们对 \tilde{x} 进行阈值处理，保留其 k 个最大项以获得 \tilde{x}' 。最后，我们解决以下优化问题： \min_{x'} \|x' - \tilde{x}'\| ，满足 W_1 x' = y ，该问题因 x' 与 \tilde{x}' 具有相同支撑集而呈凸性。对于充分小的 \varepsilon （具体而言， \varepsilon 小于 x 中第 (k + 1) 个最大项）， \tilde{x} 及其最邻近的 k -稀疏向量与 x 具有相同支撑集，因此该凸优化问题存在唯一解：即最邻近的 k -稀疏向量至 x 。由此， f 成为针对 W_1 、近似因子为 1 的压缩感知算法。 \qed 。

Lastly, we use the deterministic compressed sensing lower bound of Do Ba, Indyk, Price, and Woodruff :
最后，我们采用 Do Ba、Indyk、Price 和 Woodruff 提出的确定性压缩感知下界：

Theorem 2 (Corollary 3.1 in ). Given a k \times n matrix A with the restricted isometry property, a sparse recovery algorithm find a k-sparse approximation \hat{x} of x \in \mathbb{R}^n from Ax such that
定理 2（中推论 3.1）。给定具有限制等距性质的 k \times n 矩阵 A ，稀疏恢复算法能从 Ax 中找到 x \in \mathbb{R}^n 的 k -稀疏近似 \hat{x} ，使得

\|x - \hat{x}\|_1 \leq C(k) \min_{x', \|x'\|_0 \leq k} \|x - x'\|_1

for an approximation factor C(k). If C(k) = O(1), then a sparse recovery algorithm exists only if m = \Omega(k \log (n/k)).
对于近似因子 C(k) 。如果 C(k) = O(1) ，则仅当 m = \Omega(k \log (n/k)) 时才存在稀疏恢复算法。

Theorem 1 follows directly from Lemma 1 and Theorem 2.
定理 1 直接由引理 1 和定理 2 得出。

Toy Models of Superposition叠加态的玩具模型

Authors 作者

Affiliations 所属机构

Published 已发布

Authors 作者

Affiliations 隶属关系

Published 已发布

DOI

Aside: Polytopes and Low-Rank Matrices旁注：多面体与低秩矩阵

Visualizing Superposition in Terms of Neurons用神经元概念可视化叠加态

Limitations of The ReLU Hidden Layer Toy Model Simulating IdentityRELU 隐藏层玩具模型模拟身份的局限性

Footnotes 脚注

References 参考资料

Toy Models of Superposition
叠加态的玩具模型

Aside: Polytopes and Low-Rank Matrices
旁注：多面体与低秩矩阵

Visualizing Superposition in Terms of Neurons
用神经元概念可视化叠加态

Limitations of The ReLU Hidden Layer Toy Model Simulating Identity
RELU 隐藏层玩具模型模拟身份的局限性