这是用户在 2024-12-2 9:59 为 https://ar5iv.org/html/1804.08838 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Measuring the Intrinsic Dimension
of Objective Landscapes
测量客观景观的内禀维度

Chunyuan Li  李春元
Duke University
cl319@duke.edu
杜克大学 cl319@duke.edu

&Heerad Farkhoor, Rosanne Liu, and Jason Yosinski
Uber AI Labs
{heerad,rosanne,yosinski}@uber.com
赫拉德·法尔胡尔,罗莎妮·刘,贾森·约辛斯基 Uber AI 实验室 {heerad,rosanne,yosinski}@uber.com
Work performed as an intern at Uber AI Labs.
在 Uber AI 实验室担任实习生期间的工作。
Abstract 摘要

Many recently trained neural networks employ large numbers of parameters to achieve good performance. One may intuitively use the number of parameters required as a rough gauge of the difficulty of a problem. But how accurate are such notions? How many parameters are really needed? In this paper we attempt to answer this question by training networks not in their native parameter space, but instead in a smaller, randomly oriented subspace. We slowly increase the dimension of this subspace, note at which dimension solutions first appear, and define this to be the intrinsic dimension of the objective landscape. The approach is simple to implement, computationally tractable, and produces several suggestive conclusions. Many problems have smaller intrinsic dimensions than one might suspect, and the intrinsic dimension for a given dataset varies little across a family of models with vastly different sizes. This latter result has the profound implication that once a parameter space is large enough to solve a problem, extra parameters serve directly to increase the dimensionality of the solution manifold. Intrinsic dimension allows some quantitative comparison of problem difficulty across supervised, reinforcement, and other types of learning where we conclude, for example, that solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR-10. In addition to providing new cartography of the objective landscapes wandered by parameterized models, the method is a simple technique for constructively obtaining an upper bound on the minimum description length of a solution. A byproduct of this construction is a simple approach for compressing networks, in some cases by more than 100 times.
许多最近训练的神经网络使用大量参数以实现良好的性能。人们可能会直观地使用所需参数的数量作为问题难度的粗略衡量标准。但这种观念的准确性如何?究竟需要多少参数?在本文中,我们试图通过在较小的、随机取向的子空间中而不是在它们的原生参数空间中训练网络来回答这个问题。我们逐渐增加这个子空间的维度,注意在哪个维度上首次出现解决方案,并将这个维度定义为目标景观的内禀维度。这种方法易于实现,计算上可行,并得出几个有启发性的结论。许多问题的内禀维度比人们想象的要小,对于给定数据集,内禀维度在具有极大不同规模的模型族中变化很小。这个后者的结果具有深刻的含义,即一旦参数空间足够大以解决问题,额外的参数将直接增加解流形的空间维度。 内在维度允许我们在监督学习、强化学习和其他类型的学习中进行一些问题难度的定量比较,例如,我们得出结论,解决倒立摆问题比从 MNIST 数字中进行分类容易 100 倍,而从像素中玩 Atari Pong 的难度大约与分类 CIFAR-10 相当。除了提供参数化模型游历的客观景观的新地图之外,该方法是一种简单的技术,用于构造性地获得解决方案的最小描述长度上界。这种构建的副产品是一种简单的网络压缩方法,在某些情况下压缩比例超过 100 倍。

1 Introduction 1 引言

Training a neural network to model a given dataset entails several steps. First, the network designer chooses a loss function and a network architecture for a given dataset. The architecture is then initialized by populating its weights with random values drawn from some distribution. Finally, the network is trained by adjusting its weights to produce a loss as low as possible. We can think of the training procedure as traversing some path along an objective landscape. Note that as soon as a dataset and network architecture are specified, the landscape in its entirety is completely determined. It is instantiated and frozen; all subsequent parameter initialization, forward and backward propagation, and gradient steps taken by an optimizer are just details of how the frozen space is explored.
训练一个神经网络以建模给定的数据集涉及几个步骤。首先,网络设计者选择一个损失函数和针对给定数据集的网络架构。然后,通过从某个分布中抽取随机值来填充其权重,初始化该架构。最后,通过调整其权重以产生尽可能低的损失来训练网络。我们可以将训练过程视为沿着目标景观的路径进行遍历。请注意,一旦指定了数据集和网络架构,整个景观就完全确定。它被实例化和冻结;所有后续的参数初始化、前向和反向传播以及优化器采取的梯度步骤只是探索冻结空间的方式细节。

Consider a network parameterized by DD weights. We can picture its associated objective landscape as a set of “hills and valleys” in DD dimensions, where each point in D\mathbb{R}^{D} corresponds to a value of the loss, i.e., the elevation of the landscape. If D=2D=2, the map from two coordinates to one scalar loss can be easily imagined and intuitively understood by those living in a three-dimensional world with similar hills. However, in higher dimensions, our intuitions may not be so faithful, and generally we must be careful, as extrapolating low-dimensional intuitions to higher dimensions can lead to unreliable conclusions. The difficulty of understanding high-dimensional landscapes notwithstanding, it is the lot of neural network researchers to spend their efforts leading (or following?) networks over these multi-dimensional surfaces. Therefore, any interpreted geography of these landscapes is valuable.
考虑一个由权重参数化的网络。我们可以将其相关的目标景观想象为在 DD 维度中的一系列“山丘和山谷”,其中每个 Dsuperscript\mathbb{R}^{D} 点对应损失值,即景观的高度。如果 D=22D=2 ,从两个坐标到单个损失标量的映射可以很容易地想象,并且那些生活在具有类似山丘的三维世界中的人可以直观理解。然而,在更高维度中,我们的直觉可能并不那么可靠,通常我们必须谨慎,因为将低维直觉外推到高维可能导致不可靠的结论。尽管理解高维景观存在困难,但神经网络研究人员的工作就是花费他们的努力引导(或跟随?)网络在这些多维度表面上。因此,对这些景观的任何解释地理都是宝贵的。

Several papers have shed valuable light on this landscape, particularly by pointing out flaws in common extrapolation from low-dimensional reasoning. Dauphin et al. (2014) showed that, in contrast to conventional thinking about getting stuck in local optima (as one might be stuck in a valley in our familiar D=2D=2), local critical points in high dimension are almost never valleys but are instead saddlepoints: structures which are “valleys” along a multitude of dimensions with “exits” in a multitude of other dimensions. The striking conclusion is that one has less to fear becoming hemmed in on all sides by higher loss but more to fear being waylaid nearly indefinitely by nearly flat regions. Goodfellow et al. (2015) showed another property: that paths directly from the initial point to the final point of optimization are often monotonically decreasing. Though dimension is high, the space is in some sense simpler than we thought: rather than winding around hills and through long twisting corridors, the walk could just as well have taken a straight line without encountering any obstacles, if only the direction of the line could have been determined at the outset.
数篇论文为此领域提供了宝贵的见解,尤其是通过指出从低维推理中常见的外推缺陷。Dauphin 等人(2014 年)指出,与传统的陷入局部最优的观点(就像我们熟悉的 D=22D=2 中的山谷一样可能会陷入)相反,高维度的局部临界点几乎都不是山谷,而是鞍点:在众多维度上是“山谷”,在众多其他维度上有“出口”。令人震惊的结论是,人们更不必担心被高损失所包围,而更应该担心几乎无限期地被几乎平坦的区域所阻挠。Goodfellow 等人(2015 年)展示了另一个特性:从初始点到优化最终点的路径通常是单调递减的。尽管维度很高,但在某种程度上,空间比我们想象的要简单:与其在山丘上蜿蜒曲折,穿过长长的扭曲走廊,不如一开始就能确定直线的方向,这样走就可以直接走直线而不会遇到任何障碍。

In this paper we seek further understanding of the structure of the objective landscape by restricting training to random slices through it, allowing optimization to proceed in randomly generated subspaces of the full parameter space. Whereas standard neural network training involves computing a gradient and taking a step in the full parameter space (D\mathbb{R}^{D} above), we instead choose a random dd-dimensional subspace of D\mathbb{R}^{D}, where generally d<Dd<D, and optimize directly in this subspace. By performing experiments with gradually larger values of dd, we can find the subspace dimension at which solutions first appear, which we call the measured intrinsic dimension of a particular problem. Examining intrinsic dimensions across a variety of problems leads to a few new intuitions about the optimization problems that arise from neural network models.
在这篇论文中,我们通过将训练限制在通过目标景观的随机切片上,允许优化在随机生成的全参数子空间中进行,来进一步理解目标景观的结构。与标准神经网络训练涉及在全参数空间中计算梯度并采取一步(如上所述 Dsuperscript\mathbb{R}^{D} ),我们相反选择一个随机的 dd 维子空间,其中通常 d<Dd<D ,并在该子空间中直接优化。通过进行实验,逐渐增加 dd 的值,我们可以找到解决方案首次出现的子空间维度,我们称之为特定问题的测量内在维度。考察各种问题的内在维度,导致关于从神经网络模型中出现的优化问题的一些新直觉。

We begin in Sec. 2 by defining more precisely the notion of intrinsic dimension as a measure of the difficulty of objective landscapes. In Sec. 3 we measure intrinsic dimension over a variety of network types and datasets, including MNIST, CIFAR-10, ImageNet, and several RL tasks. Based on these measurements, we draw a few insights on network behavior, and we conclude in Sec. 4.
我们在第 2 节中更精确地定义了内在维度的概念,将其作为衡量目标景观难度的度量。在第 3 节中,我们对多种网络类型和数据集进行了内在维度的测量,包括 MNIST、CIFAR-10、ImageNet 以及几个强化学习任务。基于这些测量结果,我们对网络行为得出了一些见解,并在第 4 节中得出结论。

2 Defining and Estimating Intrinsic Dimension
定义和估计内在维度

We introduce the intrinsic dimension of an objective landscape with an illustrative toy problem. Let θ(D)D\theta^{(D)}\in\mathbb{R}^{D} be a parameter vector in a parameter space of dimension DD, let θ0(D)\theta^{(D)}_{0} be a randomly chosen initial parameter vector, and let θ(D)\theta^{(D)}_{*} be the final parameter vector arrived at via optimization.
我们通过一个示例玩具问题引入了目标景观的内禀维度。令 θ(D)Dsuperscriptsuperscript\theta^{(D)}\in\mathbb{R}^{D} 为参数空间维度 DD 中的参数向量,令 θ0(D)subscriptsuperscript0\theta^{(D)}_{0} 为随机选择的初始参数向量,令 θ(D)subscriptsuperscript\theta^{(D)}_{*} 为通过优化得到的最终参数向量。

Consider a toy optimization problem where D=1000D=1000 and where θ(D)\theta^{(D)} optimized to minimize a squared error cost function that requires the first 100 elements to sum to 1, the second 100 elements to sum to 2, and so on until the vector has been divided into 10 groups with their requisite 10 sums. We may start from a θ0(D)\theta^{(D)}_{0} that is drawn from a Gaussian distribution and optimize in D\mathbb{R}^{D} to find a θ(D)\theta^{(D)}_{*} that solves the problem with cost arbitrarily close to zero.
考虑一个玩具优化问题,其中 D=10001000D=1000 ,并且 θ(D)superscript\theta^{(D)} 被优化以最小化一个平方误差成本函数,该函数要求前 100 个元素之和为 1,第二 100 个元素之和为 2,以此类推,直到向量被分成 10 组,每组有必要的 10 个和。我们可以从一个从高斯分布中抽取的 θ0(D)subscriptsuperscript0\theta^{(D)}_{0} 开始优化,在 Dsuperscript\mathbb{R}^{D} 中找到 θ(D)subscriptsuperscript\theta^{(D)}_{*} ,以解决成本任意接近零的问题。

Solutions to this problem are highly redundant. With a little algebra, one can find that the manifold of solutions is a 990 dimensional hyperplane: from any point that has zero cost, there are 990 orthogonal directions one can move and remain at zero cost. Denoting as ss the dimensionality of the solution set, we define the intrinsic dimensionality dintd_{\mathrm{int}} of a solution as the codimension of the solution set inside of D\mathbb{R}^{D}:
这个问题有高度冗余的解。通过一点代数,可以找到一个解的流形是一个 990 维的超平面:从任何零成本的点出发,有 990 个正交方向可以移动而保持零成本。将解集的维数表示为 ss ,我们定义解的内禀维数 dintsubscriptd_{\mathrm{int}} 为解集在 Dsuperscript\mathbb{R}^{D} 中的余维数:

D=dint+sD=d_{\mathrm{int}}+s\vspace{0mm} (1)

Here the intrinsic dimension dintd_{\mathrm{int}} is 10 (1000 = 10 + 990), with 10 corresponding intuitively to the number of constraints placed on the parameter vector. Though the space is large (D=1000D=1000), the number of things one needs to get right is small (dint=10d_{\mathrm{int}}=10).
这里固有维度 dintsubscriptd_{\mathrm{int}} 为 10(1000 = 10 + 990),10 直观地对应于施加在参数向量上的约束数量。尽管空间很大( D=10001000D=1000 ),但需要做对的事情的数量却很少( dint=10subscript10d_{\mathrm{int}}=10 )。

2.1 Measuring Intrinsic Dimension via Random Subspace Training
2.1 通过随机子空间训练测量内在维度

The above example had a simple enough form that we obtained dint=10d_{\mathrm{int}}=10 by calculation. But in general we desire a method to measure or approximate dintd_{\mathrm{int}} for more complicated problems, including problems with data-dependent objective functions, e.g. neural network training. Random subspace optimization provides such a method.
上述例子形式足够简单,我们通过计算得到了 dint=10subscript10d_{\mathrm{int}}=10 。但在一般情况下,我们希望有一种方法来测量或近似更复杂问题中的 dintsubscriptd_{\mathrm{int}} ,包括具有数据相关目标函数的问题,例如神经网络训练。随机子空间优化提供了一种这样的方法。

Standard optimization, which we will refer to hereafter as the direct method of training, entails evaluating the gradient of a loss with respect to θ(D)\theta^{(D)} and taking steps directly in the space of θ(D)\theta^{(D)}. To train in a random subspace, we instead define θ(D)\theta^{(D)} in the following way:
标准优化,我们在此之后将其称为直接训练方法,包括评估损失相对于 θ(D)superscript\theta^{(D)} 的梯度,并在 θ(D)superscript\theta^{(D)} 的空间中直接采取步骤。要在随机子空间中进行训练,我们定义 θ(D)superscript\theta^{(D)} 如下:

θ(D)=θ0(D)+Pθ(d)\theta^{(D)}=\theta^{(D)}_{0}+P\theta^{(d)}~{}\vspace{0mm} (2)

where PP is a randomly generated D×dD\times d projection matrix111This projection matrix can take a variety of forms each with different computational considerations. In later sections we consider dense, sparse, and implicit PP matrices.
这个投影矩阵可以有多种形式,每种形式都有不同的计算考虑。在后面的章节中,我们考虑了稠密、稀疏和隐式 PP 矩阵。

PP 是随机生成的 D×dD\times d 投影矩阵 1
and θ(d)\theta^{(d)} is a parameter vector in a generally smaller space d\mathbb{R}^{d}. θ0(D)\theta^{(D)}_{0} and PP are randomly generated and frozen (not trained), so the system has only dd degrees of freedom. We initialize θ(d)\theta^{(d)} to a vector of all zeros, so initially θ(D)=θ0(D)\theta^{(D)}=\theta^{(D)}_{0}. This convention serves an important purpose for neural network training: it allows the network to benefit from beginning in a region of parameter space designed by any number of good initialization schemes (Glorot & Bengio, 2010; He et al., 2015) to be well-conditioned, such that gradient descent via commonly used optimizers will tend to work well.222A second, more subtle reason to start away from the origin and with a randomly oriented subspace is that this puts the subspace used for training and the solution set in general position with respect to each other. Intuitively, this avoids pathological cases where both the solution set and the random subspace contain structure oriented around the origin or along axes, which could bias toward non-intersection or toward intersection, depending on the problem.
第二个、更为微妙的原因是,从原点开始并以随机方向选择子空间,这样做可以使用于训练的子空间和一般解集相互处于一般位置。直观上,这避免了病态情况,即解集和随机子空间都包含围绕原点或沿轴的结构,这可能会根据问题偏向非交集或偏向交集。

并且 θ(d)superscript\theta^{(d)} 是一个在通常更小的空间 dsuperscript\mathbb{R}^{d} 中的参数向量。 θ0(D)subscriptsuperscript0\theta^{(D)}_{0}PP 是随机生成并冻结(未训练)的,因此系统只有 dd 个自由度。我们将 θ(d)superscript\theta^{(d)} 初始化为零向量,所以最初 θ(D)=θ0(D)superscriptsubscriptsuperscript0\theta^{(D)}=\theta^{(D)}_{0} 。这种约定对于神经网络训练具有重要意义:它允许网络从任何数量的良好初始化方案(Glorot & Bengio,2010;He 等,2015)设计的参数空间区域开始,以便条件良好,这样通过常用优化器进行的梯度下降通常会工作得很好。 2

Training proceeds by computing gradients with respect to θ(d)\theta^{(d)} and taking steps in that space. Columns of PP are normalized to unit length, so steps of unit length in θ(d)\theta^{(d)} chart out unit length motions of θ(D)\theta^{(D)}. Columns of PP may also be orthogonalized if desired, but in our experiments we relied simply on the approximate orthogonality of high dimensional random vectors. By this construction PP forms an approximately orthonormal basis for a randomly oriented dd dimensional subspace of D\mathbb{R}^{D}, with the origin of the new coordinate system at θ0(D)\theta^{(D)}_{0}. Fig. 1 (left and middle) shows an illustration of the related vectors.
通过计算相对于 θ(d)superscript\theta^{(d)} 的梯度并在该空间中采取步骤来推进训练。 PP 的列被归一化到单位长度,因此在 θ(d)superscript\theta^{(d)} 图中的单位长度步骤描绘了 θ(D)superscript\theta^{(D)} 的单位长度运动。如果需要, PP 的列也可以正交化,但在我们的实验中,我们仅仅依赖于高维随机向量的近似正交性。通过这种构造, PP 形成了一个对于随机取向的 ddDsuperscript\mathbb{R}^{D} 子空间的近似正交基,新坐标系的起点在 θ0(D)subscriptsuperscript0\theta^{(D)}_{0} 。图 1(左侧和中间)展示了相关向量的示意图。

Refer to caption
Refer to caption
Figure 1: (left) Illustration of parameter vectors for direct optimization in the D=3D=3 case. (middle) Illustration of parameter vectors and a possible random subspace for the D=3,d=2D=3,d=2 case. (right) Plot of performance vs. subspace dimension for the toy example of toy example of Sec. 2. The problem becomes both 90% solvable and 100% solvable at random subspace dimension 10, so dint90d_{\mathrm{int90}} and dint100d_{\mathrm{int100}} are 10.
图 1:(左) D=33D=3 情况直接优化参数向量的示意图。(中) D=3,d=2formulae-sequence32D=3,d=2 情况参数向量和可能随机子空间的示意图。(右)玩具示例中性能与子空间维度的关系图。在随机子空间维度为 10 时,问题变为 90%可解和 100%可解,因此 dint90subscriptd_{\mathrm{int90}}dint100subscriptd_{\mathrm{int100}} 为 10。

Consider a few properties of this training approach. If d=Dd=D and PP is a large identity matrix, we recover exactly the direct optimization problem. If d=Dd=D but PP is instead a random orthonormal basis for all of D\mathbb{R}^{D} (just a random rotation matrix), we recover a rotated version of the direct problem. Note that for some “rotation-invariant” optimizers, such as SGD and SGD with momentum, rotating the basis will not change the steps taken nor the solution found, but for optimizers with axis-aligned assumptions, such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2014), the path taken through θ(D)\theta^{(D)} space by an optimizer will depend on the rotation chosen. Finally, in the general case where d<Dd<D and solutions exist in DD, solutions will almost surely (with probability 1) not be found if dd is less than the codimension of the solution. On the other hand, when dDsd\geq D-s, if the solution set is a hyperplane, the solution will almost surely intersect the subspace, but for solution sets of arbitrary topology, intersection is not guaranteed. Nonetheless, by iteratively increasing dd, re-running optimization, and checking for solutions, we obtain one estimate of dintd_{\mathrm{int}}. We try this sweep of dd for our toy problem laid out in the beginning of this section, measuring (by convention as described in the next section) the positive performance (higher is better) instead of loss.333For this toy problem we define performance=exp(loss)\mathrm{performance}=\exp({-\mathrm{loss}}), bounding performance between 0 and 1, with 1 being a perfect solution.
为此玩具问题,我们定义 performance=exp(loss)\mathrm{performance}=\exp({-\mathrm{loss}}) ,将性能限制在 0 到 1 之间,其中 1 代表完美解。

考虑这种训练方法的一些性质。如果 d=Dd=DPP 是一个大的单位矩阵,我们就能恢复出精确的直接优化问题。如果 d=Dd=DPP 是一个随机正交基,对所有 Dsuperscript\mathbb{R}^{D} (只是一个随机旋转矩阵),我们就能恢复出直接问题的旋转版本。注意,对于一些“旋转不变”的优化器,如 SGD 和带有动量的 SGD,旋转基不会改变采取的步骤或找到的解,但对于具有轴对齐假设的优化器,如 RMSProp(Tieleman & Hinton,2012)和 Adam(Kingma & Ba,2014),优化器在 θ(D)superscript\theta^{(D)} 空间中采取的路径将取决于选择的旋转。最后,在一般情况中,如果 d<Dd<D 和解存在于 DD ,当 dd 小于解的余维数时,几乎肯定(概率为 1)找不到解。另一方面,当 dDsd\geq D-s ,如果解集是一个超平面,解几乎肯定会与子空间相交,但对于任意拓扑的解集,相交并不保证。尽管如此,通过迭代增加 dd ,重新运行优化并检查解,我们获得 dintsubscriptd_{\mathrm{int}} 的一个估计值。 我们尝试对本节开头提出的玩具问题进行 dd 的扫描,按照下一节所述的惯例测量正性能(越高越好),而不是损失。 3
As expected, the solutions are first found at d=10d=10 (see Fig. 1, right), confirming our intuition that for this problem, dint=10d_{\mathrm{int}}=10.
如预期,解决方案首先在 d=1010d=10 找到(见图 1 右侧),证实了我们的直觉,对于这个问题, dint=10subscript10d_{\mathrm{int}}=10

2.2 Details and conventions
2.2 详细信息和惯例

In the rest of this paper, we measure intrinsic dimensions for particular neural network problems and draw conclusions about the associated objective landscapes and solution sets. Because modeling real data is more complex than the above toy example, and losses are generally never exactly zero, we first choose a heuristic for classifying points on the objective landscape as solutions vs. non-solutions. The heuristic we choose is to threshold network performance at some level relative to a baseline model, where generally we take as baseline the best directly trained model. In supervised classification settings, validation accuracy is used as the measure of performance, and in reinforcement learning scenarios, the total reward (shifted up or down such that the minimum reward is 0) is used. Accuracy and reward are preferred to loss to ensure results are grounded to real-world performance and to allow comparison across models with differing scales of loss and different amounts of regularization included in the loss.
在本文的其余部分,我们测量特定神经网络问题的内在维度,并就相关的目标景观和解集得出结论。由于对真实数据进行建模比上述玩具示例更复杂,并且损失通常永远不会精确为零,我们首先选择一种启发式方法来将目标景观上的点分类为解或非解。我们选择的启发式方法是相对于基线模型在某个水平上阈值化网络性能,其中我们通常将最佳直接训练模型作为基线。在监督分类设置中,验证准确率被用作性能的衡量标准,而在强化学习场景中,总奖励(向上或向下调整,使得最小奖励为 0)被使用。准确率和奖励比损失更受欢迎,以确保结果基于现实世界的性能,并允许在不同损失规模和损失中包含不同量正则化的模型之间进行比较。

We define dint100d_{\mathrm{int100}} as the intrinsic dimension of the “100%” solution: solutions whose performance is statistically indistinguishable from baseline solutions. However, when attempting to measure dint100d_{\mathrm{int100}}, we observed it to vary widely, for a few confounding reasons: dint100d_{\mathrm{int100}} can be very high — nearly as high as DD — when the task requires matching a very well-tuned baseline model, but can drop significantly when the regularization effect of restricting parameters to a subspace boosts performance by tiny amounts. While these are interesting effects, we primarily set out to measure the basic difficulty of problems and the degrees of freedom needed to solve (or approximately solve) them rather than these subtler effects.
我们定义 dint100subscriptd_{\mathrm{int100}} 为“100%”解决方案的内在维度:其性能在统计学上与基线解决方案无法区分的解决方案。然而,当试图测量 dint100subscriptd_{\mathrm{int100}} 时,我们发现它变化很大,原因有几个: dint100subscriptd_{\mathrm{int100}} 可以非常高——几乎与 DD 相当——当任务需要匹配一个非常调优的基线模型时,但可以在将参数限制到子空间以微小幅度提升性能的正则化效应下显著下降。虽然这些是有趣的效果,但我们主要的目标是测量问题的基本难度以及解决(或近似解决)它们所需的自由度,而不是这些更微妙的效果。

Thus, we found it more practical and useful to define and measure dint90d_{\mathrm{int90}} as the intrinsic dimension of the “90%” solution: solutions with performance at least 90% of the baseline.
因此,我们认为将 dint90subscriptd_{\mathrm{int90}} 定义为“90%”解决方案的内禀维度更为实用和有用:即性能至少达到基准线 90%的解决方案。

We chose 90% after looking at a number of dimension vs. performance plots (e.g. Fig. 2) as a reasonable trade off between wanting to guarantee solutions are as good as possible, but also wanting measured dintd_{\mathrm{int}} values to be robust to small noise in measured performance. If too high a threshold is used, then the dimension at which performance crosses the threshold changes a lot for only tiny changes in accuracy, and we always observe tiny changes in accuracy due to training noise.
我们观察了多个维度与性能的对比图(例如图 2),选择了 90%作为在保证解尽可能好的同时,也希望测量的 dintsubscriptd_{\mathrm{int}} 值对测量性能中的微小噪声具有鲁棒性的合理折中。如果使用过高的阈值,那么性能超过阈值的维度会因为精度的小幅度变化而发生很大变化,而我们总是观察到由于训练噪声导致的精度微小变化。

If a somewhat different (higher or lower) threshold were chosen, we expect most of conclusions in the rest of the paper to remain qualitatively unchanged. In the future, researchers may find it useful to measure dintd_{\mathrm{int}} using higher or lower thresholds.
如果选择一个略有不同(更高或更低)的阈值,我们预计本文其余部分的结论在定性上保持不变。未来,研究人员可能会发现使用更高或更低的阈值来测量 dintsubscriptd_{\mathrm{int}} 是有用的。

3 Results and Discussion 3 结果与讨论

3.1 MNIST 3.1 MNIST

We begin by analyzing a fully connected (FC) classifier trained on MNIST. We choose a network with layer sizes 784–200–200–10, i.e. a network with two hidden layers of width 200; this results in a total number of parameters D=199,210D=199,210. A series of experiments with gradually increasing subspace dimension dd produce monotonically increasing performances, as shown in Fig. 2 (left). By checking the subspace dimension at which performance crosses the 90% mark, we measure this network’s intrinsic dimension dint90d_{\mathrm{int90}} at about 750.
我们从分析在 MNIST 上训练的全连接(FC)分类器开始。我们选择了一个具有 784-200-200-10 层大小的网络,即具有两个宽度为 200 的隐藏层;这导致参数总数为 D=199,210199210D=199,210 。一系列随着子空间维度逐渐增加的实验产生了单调递增的性能,如图 2(左)所示。通过检查性能超过 90%的子空间维度,我们测量出这个网络的内在维度 dint90subscriptd_{\mathrm{int90}} 约为 750。

Refer to caption
Refer to caption
Figure 2: Performance (validation accuracy) vs. subspace dimension dd for two networks trained on MNIST: (left) a 784–200–200–10 fully-connected (FC) network (D=D= 199,210) and (right) a convolutional network, LeNet (D=D= 44,426). The solid line shows performance of a well-trained direct (FC or conv) model, and the dashed line shows the 90% threshold we use to define dint90d_{\mathrm{int90}}. The standard derivation of validation accuracy and measured dint90d_{\mathrm{int90}} are visualized as the blue vertical and red horizontal error bars. We oversample the region around the threshold to estimate the dimension of crossing more exactly. We use one-run measurements for dint90d_{\mathrm{int90}} of 750 and 290, respectively.
图 2:在 MNIST 上训练的两个网络的性能(验证准确率)与子空间维度 dd 的关系:(左)一个 784-200-200-10 的全连接(FC)网络( D=absentD= 199,210)和(右)一个卷积网络,LeNet( D=absentD= 44,426)。实线表示训练良好的直接(FC 或卷积)模型的性能,虚线表示我们用来定义 dint90subscriptd_{\mathrm{int90}} 的 90%阈值。验证准确率的标准差和测量的 dint90subscriptd_{\mathrm{int90}} 以蓝色垂直和红色水平误差条表示。我们增加阈值周围的样本量以更精确地估计交叉维度。我们分别对 dint90subscriptd_{\mathrm{int90}} 750 和 290 进行一次运行测量。
Some networks are very compressible.
一些网络非常可压缩。

A salient initial conclusion is that 750 is quite low. At that subspace dimension, only 750 degrees of freedom (0.4%) are being used and 198,460 (99.6%) unused to obtain 90% of the performance of the direct baseline model. A compelling corollary of this result is a simple, new way of creating and training compressed networks, particularly networks for applications in which the absolute best performance is not critical. To store this network, one need only store a tuple of three items: (i)(\textup{\it i}) the random seed to generate the frozen θ0(D)\theta^{(D)}_{0}, (ii)(\textup{\it ii}) the random seed to generate PP and (iii)(\textup{\it iii}) the 750 floating point numbers in θ(d)\theta^{(d)}_{*}. It leads to compression (assuming 32-bit floats) by a factor of 260×\times from 793kB to only 3.2kB, or 0.4% of the full parameter size. Such compression could be very useful for scenarios where storage or bandwidth are limited, e.g. including neural networks in downloaded mobile apps or on web pages.
一个显著的初步结论是 750 相当低。在那个子空间维度,只有 750 个自由度(0.4%)被使用,而 198,460 个(99.6%)未被使用,以获得直接基线模型 90%的性能。这个结果的必然推论是创建和训练压缩网络的一种简单、新的方法,尤其是对于绝对最佳性能不是关键的应用网络。为了存储这个网络,只需要存储三个项目的元组: (i)(\textup{\it i}) 生成冻结的随机种子, θ0(D)subscriptsuperscript0\theta^{(D)}_{0}(ii)(\textup{\it ii}) 生成 PP(iii)(\textup{\it iii}) 的 750 个浮点数, θ(d)subscriptsuperscript\theta^{(d)}_{*} 。这导致压缩(假设 32 位浮点数)因子为 260 ×\times ,从 793kB 压缩到仅 3.2kB,或全参数大小的 0.4%。这种压缩在存储或带宽受限的场景中可能非常有用,例如包括在下载的移动应用或网页中的神经网络。

This compression approach differs from other neural network compression methods in the following aspects. (i)(\textup{\it i}) While it has previously been appreciated that large networks waste parameters (Dauphin & Bengio, 2013) and weights contain redundancy (Denil et al., 2013) that can be exploited for post-hoc compression (Wen et al., 2016), this paper’s method constitutes a much simpler approach to compression, where training happens once, end-to-end, and where any parameterized model is an allowable base model. (ii)(\textup{\it ii}) Unlike layerwise compression models (Denil et al., 2013; Wen et al., 2016), we operate in the entire parameter space, which could work better or worse, depending on the network. (iii)(\textup{\it iii}) Compared to methods like that of Louizos et al. (2017), who take a Bayesian perspective and consider redundancy on the level of groups of parameters (input weights to a single neuron) by using group-sparsity-inducing hierarchical priors on the weights, our approach is simpler but not likely to lead to compression as high as the levels they attain. (iv)(\textup{\it iv}) Our approach only reduces the number of degrees of freedom, not the number of bits required to store each degree of freedom, e.g. as could be accomplished by quantizing weights (Han et al., 2016). Both approaches could be combined. (v)(\textup{\it v}) There is a beautiful array of papers on compressing networks such that they also achieve computational savings during the forward pass (Wen et al., 2016; Han et al., 2016; Yang et al., 2015); subspace training does not speed up execution time during inference. (vi)(\textup{\it vi}) Finally, note the relationships between weight pruning, weight tying, and subspace training: weight pruning is equivalent to finding, post-hoc, a subspace that is orthogonal to certain axes of the full parameter space and that intersects those axes at the origin. Weight tying, e.g. by random hashing of weights into buckets (Chen et al., 2015), is equivalent to subspace training where the subspace is restricted to lie along the equidistant “diagonals” between any axes that are tied together.
这种压缩方法在以下方面与其他神经网络压缩方法不同。 (i)(\textup{\it i}) 虽然之前已经认识到大型网络浪费参数(Dauphin & Bengio,2013)以及权重包含冗余(Denil 等,2013),这些冗余可以用于事后压缩(Wen 等,2016),但本文的方法构成了一个更简单的压缩方法,其中训练一次,端到端进行,并且任何参数化模型都是允许的基础模型。 (ii)(\textup{\it ii}) 与分层压缩模型(Denil 等,2013;Wen 等,2016)不同,我们在整个参数空间中操作,这可能会根据网络的不同而效果更好或更差。 (iii)(\textup{\it iii}) 与 Louizos 等(2017)的方法相比,他们从贝叶斯角度出发,通过在权重上使用组稀疏性诱导的层次先验来考虑参数组(单个神经元的输入权重)的冗余,我们的方法更简单,但不太可能导致他们达到的压缩水平。 (iv)(\textup{\it iv}) 我们的方法仅减少自由度的数量,而不是存储每个自由度所需的位数,例如 通过量化权重(Han 等人,2016 年)可以实现。两种方法可以结合。 (v)(\textup{\it v}) 关于压缩网络以在正向传播过程中也实现计算节省,有一系列精彩的论文(Wen 等人,2016 年;Han 等人,2016 年;Yang 等人,2015 年);子空间训练不会加快推理过程中的执行时间。 (vi)(\textup{\it vi}) 最后,注意权重剪枝、权重绑定和子空间训练之间的关系:权重剪枝相当于事后找到一个与完整参数空间中某些轴正交的子空间,并且这些轴在原点相交。权重绑定,例如通过随机哈希权重到桶中(Chen 等人,2015 年),相当于子空间训练,其中子空间被限制在绑定在一起的所有轴之间的等距“对角线”上。

Robustness of intrinsic dimension.
内在维度的鲁棒性。

Next, we investigate how intrinsic dimension varies across FC networks with a varying number of layers and varying layer width.444Note that here we used a global baseline of 100% accuracy to compare simply and fairly across all models. See Sec. S5 for similar results obtained using instead 20 separate baselines for each of the 20 models.
请注意,在此我们使用 100%的全球基准准确性来简单公平地比较所有模型。参见 S5 节,了解使用每个模型 20 个单独基准获得的类似结果。

接下来,我们研究内在维度如何随着层数和层宽的变化在 FC 网络中变化。 4
We perform a grid sweep of networks with number of hidden layers LL chosen from {1, 2, 3, 4, 5} and width WW chosen from {50, 100, 200, 400}. Fig. S6 in the Supplementary Information shows performance vs. subspace dimension plots in the style of Fig. 2 for all 20 networks, and Fig. 3 shows each network’s dint90d_{\mathrm{int90}} plotted against its native dimension DD. As one can see, DD changes by a factor of 24.124.1 between the smallest and largest networks, but dint90d_{\mathrm{int90}} changes over this range by a factor of only 1.331.33, with much of this possibly due to noise.
我们对从{1, 2, 3, 4, 5}中选择的隐藏层数量 LL 和从{50, 100, 200, 400}中选择的宽度 WW 的网络进行网格扫描。补充信息中的图 S6 显示了所有 20 个网络在图 2 风格下的性能与子空间维度关系图,图 3 显示了每个网络的 dint90subscriptd_{\mathrm{int90}} 与其原生维度 DD 的关系。如图所示, DD 在最小和最大的网络之间变化了 24.124.124.1 倍,但 dint90subscriptd_{\mathrm{int90}} 在此范围内仅变化了 1.331.331.33 倍,其中大部分可能归因于噪声。

Thus it turns out that the intrinsic dimension changes little even as models grown in width or depth! The striking conclusion is that every extra parameter added to the network — every extra dimension added to DD — just ends up adding one dimension to the redundancy of the solution, ss.
因此,即使模型在宽度或深度上增长,内在维度变化也很小!令人瞩目的结论是,添加到网络中的每一个额外参数——添加到 DD 的每一个额外维度——最终只是增加了解决方案冗余的一个维度, ss

Often the most accurate directly trained models for a problem have far more parameters than needed (Zhang et al., 2017); this may be because they are just easier to train, and our observation suggests a reason why: with larger models, solutions have greater redundancy and in a sense “cover” more of the space.555To be precise, we may not conclude “greater coverage” in terms of the volume of the solution set — volumes are not comparable across spaces of different dimension, and our measurements have only estimated the dimension of the solution set, not its volume. A conclusion we may make is that as extra parameters are added, the ratio of solution dimension to total dimension, s/Ds/D, increases, approaching 1. Further research could address other notions of coverage.
要精确地说,我们不能从解集的体积方面得出“更大覆盖范围”的结论——不同维度的空间中的体积是不可比较的,我们的测量只估计了解集的维度,而不是其体积。我们可以得出的结论是,随着额外参数的增加,解集维度与总维度的比率 s/Ds/D 增加,趋近于 1。进一步的研究可以解决其他关于覆盖范围的概念。

通常,针对某个问题的最准确的直接训练模型比所需的参数多得多(张等,2017);这可能是因为它们更容易训练,我们的观察表明了一个原因:随着模型规模的增大,解决方案具有更大的冗余性,从某种意义上说,“覆盖”了更多的空间。 5
To our knowledge, this is the first time this phenomenon has been directly measured. We should also be careful not to claim that all FC nets on MNIST will have an intrinsic dimension of around 750; instead, we should just consider that we have found for this architecture/dataset combination a wide plateau of hyperparamter space over which intrinsic dimension is approximately constant.
据我们所知,这是首次直接测量这一现象。我们还应小心不要声称所有在 MNIST 上的 FC 网络都具有大约 750 的内禀维度;相反,我们只需考虑,对于这种架构/数据集组合,我们已经发现了一个超参数空间宽广的平台,其中内禀维度大致保持恒定。

Refer to caption
Figure 3: Measured intrinsic dimension dint90d_{\mathrm{int90}} vs number of parameters DD for 20 FC models of varying width (from 50 to 400) and depth (number of hidden layers from 1 to 5) trained on MNIST. The red interval is the standard derivation of the measurement of dint90d_{\mathrm{int90}}. Though the number of native parameters DD varies by a factor of 24.124.1, dint90d_{\mathrm{int90}} varies by only 1.331.33, with much of that factor possibly due to noise, showing that dint90d_{\mathrm{int90}} is a fairly robust measure across a model family and that each extra parameter ends up adding an extra dimension directly to the redundancy of the solution. Standard deviation was estimated via bootstrap; see Sec. S5.1.
图 3:在 MNIST 上训练的宽度(从 50 到 400)和深度(隐藏层数量从 1 到 5)不同的 20 个全连接(FC)模型的测量内在维度 dint90subscriptd_{\mathrm{int90}} 与参数数量 DD 的关系。红色区间是 dint90subscriptd_{\mathrm{int90}} 测量标准差的区间。尽管本地参数数量 DD 的变化因子为 24.124.124.1dint90subscriptd_{\mathrm{int90}} 的变化因子仅为 1.331.331.33 ,其中大部分变化可能由于噪声,表明 dint90subscriptd_{\mathrm{int90}} 是模型家族中相当稳健的度量,每个额外的参数最终都会直接增加解决方案的冗余维度。标准差通过自举法估计;参见 S5.1 节。
Refer to caption
Refer to caption
Figure 4: Performance vs. number of trainable parameters for (left) FC networks and (right) convolutional networks trained on MNIST. Randomly generated direct networks are shown (gray circles) alongside all random subspace training results (blue circles) from the sweep shown in Fig. S6. FC networks show a persistent gap in dimension, suggesting general parameter inefficiency of FC models. The parameter efficiency of convolutional networks varies, as the gray points can be significantly to the right of or close to the blue manifold.
图 4:在 MNIST 上训练的(左)全连接网络和(右)卷积网络的性能与可训练参数数量对比。随机生成的直接网络(灰色圆圈)与图 S6 所示的随机子空间训练结果(蓝色圆圈)一同展示。全连接网络在维度上表现出持续的差距,表明全连接模型参数效率普遍低下。卷积网络的参数效率各异,因为灰色点可以显著位于蓝色流形右侧或靠近蓝色流形。
Are random subspaces really more parameter-efficient for FC nets?
随机子空间真的比全连接网络更参数高效吗?

One might wonder to what extent claiming 750 parameters is meaningful given that performance achieved (90%) is far worse than a state of the art network trained on MNIST. With such a low bar for performance, could a directly trained network with a comparable number of trainable parameters be found that achieves the same performance? We generated 1000 small networks (depth randomly chosen from {1, 2, 3, 4, 5}, layer width randomly from {2, 3, 5, 8, 10, 15, 20, 25}, seed set randomly) in an attempt to find high-performing, small FC networks, but as Fig. 4 (left) shows, a gap still exists between the subspace dimension and the smallest direct FC network giving the same performance at most levels of performance.
人们可能会质疑声称 750 个参数的意义,因为所达到的性能(90%)远低于在 MNIST 上训练的先进网络。在如此低的标准下,能否找到一个具有相当可训练参数数量的直接训练网络,达到相同的性能?我们生成了 1000 个小网络(深度随机选择自{1, 2, 3, 4, 5},层宽随机选择自{2, 3, 5, 8, 10, 15, 20, 25},种子随机设置)以寻找高性能的小全连接网络,但如图 4(左)所示,子空间维度与在大多数性能水平上给出相同性能的最小直接全连接网络之间仍存在差距。

Measuring dint90d_{\mathrm{int90}} on a convolutional network.
测量卷积网络中的 dint90subscriptd_{\mathrm{int90}}

Next we measure dint90d_{\mathrm{int90}} of a convolutional network, LeNet (D=44,426). Fig. 2 (right) shows validation accuracy vs. subspace dimension dd, and we find dint90=290d_{\mathrm{int90}}=290, or a compression rate of about 150×\times for this network. As with the FC case above, we also do a sweep of random networks,
接下来,我们测量了卷积网络 LeNet(D=44,426)的 dint90subscriptd_{\mathrm{int90}} 。图 2(右)显示了验证准确率与子空间维度 dd 的关系,我们发现 dint90=290subscript290d_{\mathrm{int90}}=290 ,或者大约 150 ×\times 的压缩率。与上述全连接层的情况一样,我们也对随机网络进行了一次扫描。

but notice that the performance gap of convnets between direct and subspace training methods becomes closer for fixed budgets, i.e., the number of trainable parameters. Further, the performance of direct training varies significantly, depending on the extrinsic design of convet architectures. We interpret these results in terms of the Minimum Description Length below.
但请注意,在固定预算,即可训练参数数量下,卷积神经网络在直接训练和子空间训练方法之间的性能差距变得更小。此外,直接训练的性能因卷积网络架构的外部设计而显著变化。我们以下从最小描述长度来解释这些结果。

Relationship between Intrinsic Dimension and Minimum Description Length (MDL).
内在维度与最小描述长度(MDL)之间的关系

As discussed earlier, the random subspace training method leads naturally to a compressed representation of a network, where only dd floating point numbers need to be stored. We can consider this dd as an upper bound on the MDL of the problem solution.666We consider MDL in terms of number of degrees of freedom instead of bits. For degrees of freedom stored with constant fidelity (e.g. float32), these quantities are related by a constant factor (e.g. 32).
我们考虑 MDL 的度自由度数量而不是比特数。对于以恒定保真度存储的度自由度(例如 float32),这些数量通过一个常数因子(例如 32)相关联。

如前所述,随机子空间训练方法自然导致网络的压缩表示,其中只需要存储 dd 个浮点数。我们可以将此 dd 视为问题解决方案的 MDL 的上限。 6
We cannot yet conclude the extent to which this bound is loose or tight, and tightness may vary by problem. However, to the extent that it is tighter than previous bounds (e.g., just the number of parameters DD) and to the extent that it is correlated with the actual MDL, we can use this interpretation to judge which solutions are more well-suited to the problem in a principled way. As developed by Rissanen (1978) and further by Hinton & Van Camp (1993), holding accuracy constant, the best model is the one with the shortest MDL.
我们尚不能确定这个界限是宽松还是紧绷,紧绷程度可能因问题而异。然而,在它比之前的界限更紧(例如,仅仅是参数的数量 DD )以及它与实际 MDL 相关联的程度上,我们可以利用这种解释以原则性的方式判断哪些解决方案更适合该问题。正如 Rissanen(1978)所发展并 Hinton 和 Van Camp(1993)进一步发展的,在保持准确度不变的情况下,最佳模型是具有最短 MDL 的模型。

Thus, there is some rigor behind our intuitive assumption that LeNet is a better model than an FC network for MNIST image classification, because its intrinsic dimension is lower (dint90d_{\mathrm{int90}} of 290 vs. 750). In this particular case we are lead to a predictable conclusion, but as models become larger, more complex, and more heterogeneous, conclusions of this type will often not be obvious. Having a simple method of approximating MDL may prove extremely useful for guiding model exploration, for example, for the countless datasets less well-studied than MNIST and for models consisting of separate sub-models that may be individually designed and evaluated (Ren et al., 2015; Kaiser et al., 2017). In this latter case, considering the MDL for a sub-model could provide a more detailed view of that sub-model’s properties than would be available by just analyzing the system’s overall validation performance.
因此,我们直观地认为 LeNet 比 FC 网络更适合 MNIST 图像分类的假设背后有一定的严谨性,因为其内在维度更低(290 与 750 之比)。在这种情况下,我们得出一个可预测的结论,但随着模型变得更大、更复杂和更异质,此类结论往往不会很明显。拥有一种简单的方法来近似 MDL 可能对指导模型探索极为有用,例如,对于比 MNIST 研究得少的无数数据集,以及由可能分别设计和评估的子模型组成的模型(Ren 等,2015;Kaiser 等,2017)。在后一种情况下,考虑子模型的 MDL 可能比仅分析系统的整体验证性能提供更详细的该子模型属性视图。

Finally, note that although our approach is related to a rich body of work on estimating the “intrinsic dimension of a dataset” (Camastra & Vinciarelli, 2002; Kégl, 2003; Fukunaga & Olsen, 1971; Levina & Bickel, 2005; Tenenbaum et al., 2000), it differs in a few respects. Here we do not measure the number of degrees of freedom necessary to represent a dataset (which requires representation of a global p(X)p(X) and per-example properties and thus grows with the size of the dataset), but those required to represent a model for part of the dataset (here p(y|X)p(y|X), which intuitively might saturate at some complexity even as a dataset grows very large). That said, in the following section we do show measurements for a corner case where the model must memorize per-example properties.
最后,请注意,尽管我们的方法与关于估计“数据集的内在维度”的大量工作相关(Camastra & Vinciarelli,2002;Kégl,2003;Fukunaga & Olsen,1971;Levina & Bickel,2005;Tenenbaum 等,2000),但在某些方面有所不同。在这里,我们并不测量表示数据集所需的自由度数量(这需要表示全局 p(X)p(X) 和每个实例的属性,因此随着数据集大小的增加而增长),而是测量表示数据集一部分模型的所需自由度(这里 p(y|X)conditionalp(y|X) ,直观上可能在数据集非常大时达到某种复杂性饱和)。尽管如此,在下一节中,我们确实展示了模型必须记住每个实例属性的特殊情况的测量。

Are convnets always better on MNIST? Measuring dint90d_{\mathrm{int90}} on shuffled data.
MNIST 上的卷积神经网络是否总是更好?在打乱的数据上测量 dint90subscriptd_{\mathrm{int90}}

Zhang et al. (2017) provocatively showed that large networks normally thought to generalize well can nearly as easily be trained to memorize entire training sets with randomly assigned labels or with input pixels provided in random order. Consider two identically sized networks: one trained on a real, non-shuffled dataset and another trained with shuffled pixels or labels. As noted by Zhang et al. (2017), externally the networks are very similar, and the training loss may even be identical at the final epoch. However, the intrinsic dimension of each may be measured to expose the differences in problem difficulty. When training on a dataset with shuffled pixels — pixels for each example in the dataset subject to a random permutation, chosen once for the entire dataset — the intrinsic dimension of an FC network remains the same at 750, because FC networks are invariant to input permutation. But the intrinsic dimension of a convnet increases from 290 to 1400, even higher than an FC network. Thus while convnets are better suited to classifying digits given images with local structure, when this structure is removed, violating convolutional assumptions, our measure can clearly reveal that many more degrees of freedom are now required to model the underlying distribution. When training on MNIST with shuffled labels — the label for each example is randomly chosen — we redefine our measure of dint90d_{\mathrm{int90}} relative to training accuracy (validation accuracy is always at chance). We find that memorizing random labels on the 50,000 example MNIST training set requires a very high dimension, dint90=190,000d_{\mathrm{int90}}=190,000, or 3.8 floats per memorized label. Sec. S5.2 gives a few further results, in particular that the more labels are memorized, the more efficient memorization is in terms of floats per label. Thus, while the network obviously does not generalize to an unseen validation set, it would seem “generalization” within a training set may be occurring as the network builds a shared infrastructure that makes it possible to more efficiently memorize labels.
张等人(2017)挑衅性地表明,通常被认为泛化能力强的的大型网络几乎同样容易被训练去记忆整个带有随机分配标签的训练集或以随机顺序提供的输入像素。考虑两个大小相同的网络:一个在真实、非打乱的数据集上训练,另一个使用打乱像素或标签进行训练。正如张等人(2017)所指出的,从外部看,这两个网络非常相似,训练损失甚至可能在最终周期时完全相同。然而,可以通过测量每个网络的内在维度来揭示问题难度的差异。当在一个带有打乱像素的数据集上训练时——每个数据集中的示例像素都受到随机排列的影响,整个数据集只选择一次——FC 网络的内在维度保持在 750,因为 FC 网络对输入排列不变。但卷积网络的内在维度从 290 增加到 1400,甚至高于 FC 网络。 因此,尽管卷积神经网络更适合对具有局部结构的图像进行数字分类,但当这种结构被移除,违反卷积假设时,我们的度量可以清楚地揭示现在需要更多的自由度来模拟潜在分布。当在 MNIST 上使用打乱标签进行训练——每个示例的标签是随机选择的——我们重新定义了相对于训练准确率(验证准确率始终为随机)的 dint90subscriptd_{\mathrm{int90}} 度量。我们发现,在 50,000 个示例的 MNIST 训练集上记住随机标签需要非常高的维度, dint90=190,000subscript190000d_{\mathrm{int90}}=190,000 ,或每个记住的标签 3.8 个浮点数。第 S5.2 节给出了更多结果,特别是记住的标签越多,按标签计算的记住效率就越高。因此,虽然网络显然不能泛化到未见的验证集,但似乎在训练集中,随着网络构建一个共享的基础设施,使得能够更有效地记住标签,可能正在发生“泛化”。

3.2 CIFAR-10 and ImageNet 3.2 CIFAR-10 和 ImageNet

We scale to larger supervised classification problems by considering CIFAR-10 (Krizhevsky & Hinton, 2009) and ImageNet (Russakovsky et al., 2015). When scaling beyond MNIST-sized networks with DD on the order of 200k and dd on the order of 1k, we find it necessary to use more efficient methods of generating and projecting from random subspaces. This is particularly true in the case of ImageNet, where the direct network can easily require millions of parameters. In Sec. S7, we describe and characterize scaling properties of three methods of projection: dense matrix projection, sparse matrix projection (Li et al., 2006), and the remarkable Fastfood transform (Le et al., 2013). We generally use the sparse projection method to train networks on CIFAR-10 and the Fastfood transform for ImageNet.
我们通过考虑 CIFAR-10(Krizhevsky & Hinton,2009)和 ImageNet(Russakovsky 等人,2015)来扩展到更大的监督分类问题。当扩展到 MNIST 大小的网络, DD 大约为 200k, dd 大约为 1k 时,我们发现使用更高效的从随机子空间生成和投影的方法是必要的。这在 ImageNet 的情况下尤其如此,直接的网络可能需要数百万个参数。在第 S7 节中,我们描述并分析了三种投影方法的扩展特性:密集矩阵投影、稀疏矩阵投影(Li 等人,2006)以及令人瞩目的 Fastfood 变换(Le 等人,2013)。我们通常使用稀疏投影方法在 CIFAR-10 上训练网络,以及 Fastfood 变换用于 ImageNet。

Measured dint90d_{\mathrm{int90}} values for CIFAR-10 and are ImageNet given in Table 1, next to all previous MNIST results and RL results to come. For CIFAR-10 we find qualitatively similar results to MNIST, but with generally higher dimension (9k vs. 750 for FC and 2.9k vs. 290 for LeNet). It is also interesting to observe the difference of dint90d_{\mathrm{int90}} across network architectures. For example, to achieve a global >50%>\!\!50\% validation accuracy on CIFAR-10, FC, LeNet and ResNet approximately requires dint90=9d_{\mathrm{int90}}=9k, 2.92.9k and 11k, respectively, showing that ResNets are more efficient. Full results and experiment details are given in Sec. S8 and Sec. S9. Due to limited time and memory issues, training on ImageNet has not yet given a reliable estimate for dint90d_{\mathrm{int90}} except that it is over 500k.
测量了 CIFAR-10 的 dint90subscriptd_{\mathrm{int90}} 值和 ImageNet 的值,如表 1 所示,紧邻所有之前的 MNIST 结果和 RL 结果。对于 CIFAR-10,我们发现与 MNIST 的结果在定性上相似,但通常维度更高(FC 为 9k 比 750,LeNet 为 2.9k 比 290)。观察不同网络架构之间的 dint90subscriptd_{\mathrm{int90}} 差异也很有趣。例如,为了在 CIFAR-10 上实现全局 >50%absentpercent50>\!\!50\% 验证准确率,FC、LeNet 和 ResNet 分别大约需要 dint90=9subscript9d_{\mathrm{int90}}=9 k、 2.92.92.9 k 和 111 k,这表明 ResNets 更有效率。完整结果和实验细节见第 S8 节和第 S9 节。由于时间和内存限制,ImageNet 上的训练尚未为 dint90subscriptd_{\mathrm{int90}} 提供一个可靠的估计,除了它超过 500k。

Table 1: Measured dint90d_{\mathrm{int90}} on various supervised and reinforcement learning problems.
表 1:在多种监督学习和强化学习问题上进行测量。
Dataset 数据集 MNIST MNIST (Shuf Pixels) MNIST(随机像素) MNIST (Shuf Labels) MNIST(随机标签)
Network Type 网络类型 FC LeNet FC LeNet FC
Parameter Dim. DD 参数维度 DD 199,210 44,426 199,210 44,426 959,610
Intrinsic Dim. dint90d_{\mathrm{int90}} 内在维度 dint90subscriptd_{\mathrm{int90}} 750 290 750 1,400 190,000
CIFAR-10 ImageNet Inverted Pendulum 倒立摆 Humanoid 类人 Atari Pong 雅达利乒乓球
FC LeNet SqueezeNet FC FC ConvNet 卷积神经网络
656,810 62,006 1,248,424 562 166,673 1,005,974
9,000 2,900 >> 500k  >> 50k 4 700 6,000

3.3 Reinforcement Learning environments
3.3 强化学习环境

Measuring intrinsic dimension allows us to perform some comparison across the divide between supervised learning and reinforcement learning. In this section we measure the intrinsic dimension of three control tasks of varying difficulties using both value-based and policy-based algorithms. The value-based algorithm we evaluate is the Deep Q-Network (DQN) (Mnih et al., 2013), and the policy-based algorithm is Evolutionary Strategies (ES) (Salimans et al., 2017). Training details are given in Sec. S6.2. For all tasks, performance is defined as the maximum-attained (over training iterations) mean evaluation reward (averaged over 30 evaluations for a given parameter setting). In Fig. 5, we show results of ES on three tasks: 𝙸𝚗𝚟𝚎𝚛𝚝𝚎𝚍𝙿𝚎𝚗𝚍𝚞𝚕𝚞𝚖𝚟𝟷\mathtt{InvertedPendulum\!-\!v1}, 𝙷𝚞𝚖𝚊𝚗𝚘𝚒𝚍𝚟𝟷\mathtt{Humanoid\!-\!v1} in MuJoCo (Todorov et al., 2012), and 𝙿𝚘𝚗𝚐𝚟𝟶\mathtt{Pong\!-\!v0} in Atari. Dots in each plot correspond to the (noisy) median of observed performance values across many runs for each given dd, and the vertical uncertainty bar shows the maximum and minimum observed performance values. The dotted horizontal line corresponds to the usual 90% baseline derived from the best directly-trained network (the solid horizontal line). A dot is darkened signifying the first dd that allows a satisfactory performance. We find that the inverted pendulum task is surprisingly easy, with dint100=dint90=4d_{\mathrm{int100}}=d_{\mathrm{int90}}=4, meaning that only four parameters are needed to perfectly solve the problem (see Stanley & Miikkulainen (2002) for a similarly small solution found via evolution). The walking humanoid task is more difficult: solutions are found reliably by dimension 700, a similar complexity to that required to model MNIST with an FC network, and far less than modeling CIFAR-10 with a convnet. Finally, to play Pong on Atari (directly from pixels) requires a network trained in a 6k dimensional subspace, making it on the same order of modeling CIFAR-10. For an easy side-by-side comparison we list all intrinsic dimension values found for all problems in Table 1. For more complete ES results see Sec. S6.2, and Sec. S6.1 for DQN results.
测量内在维度使我们能够在监督学习和强化学习之间进行一些比较。在本节中,我们使用基于价值和基于策略的算法来测量三个难度不同的控制任务的内在维度。我们评估的基于价值的算法是深度 Q 网络(DQN)(Mnih 等人,2013 年),而基于策略的算法是进化策略(ES)(Salimans 等人,2017 年)。训练细节见第 S6.2 节。对于所有任务,性能定义为训练迭代中达到的最大平均评估奖励(对于给定的参数设置,平均 30 次评估)。在图 5 中,我们展示了 ES 在三个任务上的结果: 𝙸𝚗𝚟𝚎𝚛𝚝𝚎𝚍𝙿𝚎𝚗𝚍𝚞𝚕𝚞𝚖𝚟𝟷\mathtt{InvertedPendulum\!-\!v1} 在 MuJoCo(Todorov 等人,2012 年)中, 𝙷𝚞𝚖𝚊𝚗𝚘𝚒𝚍𝚟𝟷\mathtt{Humanoid\!-\!v1} 在 Atari 中。每个图中的点对应于多次运行中观察到的性能值的(有噪声的)中位数,垂直不确定性条表示观察到的最大和最小性能值。虚线水平线对应于从最佳直接训练网络得出的常规 90%基线(实线水平线)。 一个点被加深表示第一个允许满意性能的 dd 。我们发现倒立摆任务出奇地简单, dint100=dint90=4subscriptsubscript4d_{\mathrm{int100}}=d_{\mathrm{int90}}=4 意味着只需要四个参数就能完美解决问题(参见 Stanley & Miikkulainen (2002)中通过进化找到的类似的小规模解决方案)。行走类人任务更困难:通过 700 维找到可靠解决方案,与用全连接网络模拟 MNIST 的复杂度相似,远低于用卷积网络模拟 CIFAR-10。最后,要在 Atari 上玩 Pong(直接从像素中)需要在一个 6k 维子空间中训练的网络,这使得它与模拟 CIFAR-10 的复杂度相当。为了方便对比,我们在表 1 中列出所有问题的内在维度值。更完整的 ES 结果见 S6.2 节,DQN 结果见 S6.1 节。

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Results using the policy-based ES algorithm to train agents on (left column) 𝙸𝚗𝚟𝚎𝚛𝚝𝚎𝚍𝙿𝚎𝚗𝚍𝚞𝚕𝚞𝚖𝚟𝟷\mathtt{InvertedPendulum\!-\!v1}, (middle column) 𝙷𝚞𝚖𝚊𝚗𝚘𝚒𝚍𝚟𝟷\mathtt{Humanoid\!-\!v1}, and (right column) 𝙿𝚘𝚗𝚐𝚟𝟶\mathtt{Pong\!-\!v0}. The intrinsic dimensions found are 44, 700700, and 6k. This places the walking humanoid task on a similar level of difficulty as modeling MNIST with a FC network (far less than modeling CIFAR-10 with a convnet), and Pong on the same order of modeling CIFAR-10.
图 5:使用基于策略的 ES 算法训练代理的结果(左侧列) 𝙸𝚗𝚟𝚎𝚛𝚝𝚎𝚍𝙿𝚎𝚗𝚍𝚞𝚕𝚞𝚖𝚟𝟷\mathtt{InvertedPendulum\!-\!v1} ,(中间列) 𝙷𝚞𝚖𝚊𝚗𝚘𝚒𝚍𝚟𝟷\mathtt{Humanoid\!-\!v1} ,和(右侧列) 𝙿𝚘𝚗𝚐𝚟𝟶\mathtt{Pong\!-\!v0} 。找到的内禀维度是 444700700700 和 6k。这使行走类人形任务与使用全连接网络建模 MNIST 的难度相当(远低于使用卷积网络建模 CIFAR-10),与 Pong 的建模难度在同一水平。

4 Conclusions and Future Directions
4 结论与未来方向

In this paper, we have defined the intrinsic dimension of objective landscapes and shown a simple method — random subspace training — of approximating it for neural network modeling problems. We use this approach to compare problem difficulty within and across domains. We find in some cases the intrinsic dimension is much lower than the direct parameter dimension, and hence enable network compression, and in other cases the intrinsic dimension is similar to that of the best tuned models, and suggesting those models are better suited to the problem.
在这篇论文中,我们定义了目标景观的内禀维度,并展示了一种简单的方法——随机子空间训练——来近似它以解决神经网络建模问题。我们使用这种方法来比较不同领域内和跨领域的问题难度。我们发现,在某些情况下,内禀维度远低于直接参数维度,从而实现了网络压缩;在另一些情况下,内禀维度与最佳调优模型相似,这表明这些模型更适合该问题。

Further work could also identify better ways of creating subspaces for reparameterization: here we chose random linear subspaces, but one might carefully construct other linear or non-linear subspaces to be even more likely to contain solutions. Finally, as the field departs from single stack-of-layers image classification models toward larger and more heterogeneous networks (Ren et al., 2015; Kaiser et al., 2017) often composed of many modules and trained by many losses, methods like measuring intrinsic dimension that allow some automatic assessment of model components might provide much-needed greater understanding of individual black-box module properties.
进一步的工作还可以确定创建子空间进行重新参数化的更好方法:在这里我们选择了随机线性子空间,但人们可以谨慎地构建其他线性或非线性子空间,使其更有可能包含解。最后,随着该领域从单层堆叠图像分类模型转向更大、更多样化的网络(Ren 等人,2015;Kaiser 等人,2017),这些网络通常由许多模块组成,并由多个损失进行训练,像测量内在维度这样的方法可能为自动评估模型组件提供所需的更深入理解,从而对单个黑盒模块属性有更深入的了解。

Acknowledgments 致谢

The authors gratefully acknowledge Zoubin Ghahramani, Peter Dayan, Sam Greydanus, Jeff Clune, and Ken Stanley for insightful discussions, Joel Lehman for initial idea validation, Felipe Such, Edoardo Conti and Xingwen Zhang for helping scale the ES experiments to the cluster, Vashisht Madhavan for insights on training Pong, Shrivastava Anshumali for conversations about random projections, and Ozan Sener for discussion of second order methods. We are also grateful to Paul Mikesell, Leon Rosenshein, Alex Sergeev and the entire OpusStack Team inside Uber for providing our computing platform and for technical support.
作者们衷心感谢 Zoubin Ghahramani、Peter Dayan、Sam Greydanus、Jeff Clune 和 Ken Stanley 的深刻讨论,Joel Lehman 对初步想法的验证,Felipe Such、Edoardo Conti 和 Xingwen Zhang 帮助将 ES 实验扩展到集群,Vashisht Madhavan 对训练 Pong 的见解,Shrivastava Anshumali 关于随机投影的讨论,以及 Ozan Sener 对二阶方法的讨论。我们还要感谢 Paul Mikesell、Leon Rosenshein、Alex Sergeev 以及 Uber 内部的整个 OpusStack 团队为我们提供计算平台和技术支持。

References 参考文献

  • Brockman et al. (2016) 布洛克曼等(2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, 和 Wojciech Zaremba. OpenAI Gym,2016。
  • Camastra & Vinciarelli (2002)
    卡马斯特拉和温恰雷利(2002)
    Francesco Camastra and Alessandro Vinciarelli. Estimating the intrinsic dimension of data with a fractal-based method. IEEE Transactions on pattern analysis and machine intelligence, 24(10):1404–1407, 2002.
    弗朗西斯科·卡马斯特拉和阿莱桑德罗·温恰雷利。基于分形方法估计数据的内在维度。IEEE 模式分析与机器智能杂志,第 24 卷第 10 期:1404–1407,2002 年。
  • Chen et al. (2015) 陈等(2015) Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. CoRR, abs/1504.04788, 2015. URL http://arxiv.org/abs/1504.04788.
    陈文林,詹姆斯·T·威尔逊,斯蒂芬·特里,基利安·Q·韦宁伯格,陈一欣。利用哈希技巧压缩神经网络。CoRR,abs/1504.04788,2015。URL http://arxiv.org/abs/1504.04788。
  • Dauphin & Bengio (2013) Yann Dauphin and Yoshua Bengio. Big neural networks waste capacity. CoRR, abs/1301.3583, 2013. URL http://arxiv.org/abs/1301.3583.
    Yann Dauphin 和 Yoshua Bengio. 大型神经网络浪费能力。CoRR,abs/1301.3583,2013。URL http://arxiv.org/abs/1301.3583。
  • Dauphin et al. (2014) Dauphin 等(2014) Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. CoRR, abs/1406.2572, 2014. URL http://arxiv.org/abs/1406.2572.
    Yann Dauphin, Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, Surya Ganguli, 和 Yoshua Bengio. 在高维非凸优化中识别和攻击鞍点问题。CoRR, abs/1406.2572, 2014。URL http://arxiv.org/abs/1406.2572。
  • Denil et al. (2013) Denil 等(2013) Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pp. 2148–2156, 2013.
    米沙·德尼尔,巴巴克·沙克比,劳伦特·丁,南多·德·弗雷塔斯等。预测深度学习中的参数。在《神经信息处理系统进展》第 2148-2156 页,2013 年。
  • Fukunaga & Olsen (1971)
    福塚 & 奥尔森(1971)
    Keinosuke Fukunaga and David R Olsen. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers, 100(2):176–183, 1971.
    福冈健司克和戴维·R·奥尔森。寻找数据内在维度的算法。IEEE 计算机杂志,第 100 卷第 2 期:176–183,1971 年。
  • Giryes et al. (2016) Giryes 等人(2016) Raja Giryes, Guillermo Sapiro, and Alexander M Bronstein. Deep neural networks with random gaussian weights: a universal classification strategy? IEEE Trans. Signal Processing, 2016.
    Raja Giryes, Guillermo Sapiro 和 Alexander M Bronstein. 基于随机高斯权重的深度神经网络:一种通用的分类策略?IEEE 信号处理汇刊,2016。
  • Glorot & Bengio (2010)
    格罗特和本吉奥(2010)
    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Artificial Intelligence and Statistics, 2010.
    谢尔盖·格洛特和约书亚·本吉奥。理解训练深度前馈神经网络的难度。在《人工智能与统计》,2010 年。
  • Goodfellow et al. (2015)
    古德费洛等(2015)
    Ian J. Goodfellow, Oriol Vinyals, and Andrew M. Saxe. Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations (ICLR 2015), 2015. URL http://arxiv.org/abs/1412.6544.
    伊恩·J·古德费洛,奥里奥尔·维尼亚尔斯,安德鲁·M·萨克斯。定性描述神经网络优化问题。载于国际学习表示会议(ICLR 2015),2015。URL http://arxiv.org/abs/1412.6544。
  • Han et al. (2016) 韩等(2016) Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016.
    宋汉,毛辉子,威廉·J·达利. 深度压缩:通过剪枝、训练量化与霍夫曼编码压缩深度神经网络。ICLR,2016。
  • He et al. (2015) 他等人(2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In CVPR, 2015.
    何凯明,张祥宇,任少卿,孙剑. 深入研究整流器:在 ImageNet 分类上超越人类水平. 在 CVPR,2015.
  • Hinton & Van Camp (1993)
    Hinton & Van Camp (1993)
    Geoffrey E Hinton and Drew Van Camp. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp.  5–13. ACM, 1993.
    霍金斯,G. E.,范·坎普,D. 保持神经网络简单:通过最小化权重描述长度。载于《第六届计算学习理论年会论文集》,第 5-13 页。ACM,1993 年。
  • Iandola et al. (2016) Iandola 等(2016) Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
    Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, 和 Kurt Keutzer. SqueezeNet:与 AlexNet 相当精度,参数减少 50 倍,模型大小 0.5MB。arXiv 预印本 arXiv:1602.07360,2016。
  • Kaiser et al. (2017) 凯撒等人(2017) Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. CoRR, abs/1706.05137, 2017. URL http://arxiv.org/abs/1706.05137.
    卢卡斯·凯泽,艾丹·N·戈麦斯,诺亚姆·沙泽尔,阿西什·瓦萨尼,尼基·帕尔马尔,利昂·琼斯,以及雅各布·乌斯克雷伊特。一个模型学会所有。CoRR,abs/1706.05137,2017。URL http://arxiv.org/abs/1706.05137。
  • Kégl (2003) 凯格尔(2003) Balázs Kégl. Intrinsic dimension estimation using packing numbers. In Advances in neural information processing systems, pp. 697–704, 2003.
    巴拉萨·凯格尔. 利用装箱数估计内在维度。载于《神经信息处理系统进展》,第 697-704 页,2003 年。
  • Kingma & Ba (2014)
    金玛与巴(2014)
    Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Diederik Kingma 和 Jimmy Ba. Adam:一种随机优化方法。arXiv 预印本 arXiv:1412.6980,2014。
  • Kirkpatrick et al. (2017)
    柯克帕特里克等(2017)
    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. doi: 10.1073/pnas.1611835114. URL http://www.pnas.org/content/114/13/3521.abstract.
    詹姆斯·柯克帕特里克,拉兹万·帕斯卡努,尼尔·拉宾诺维茨,乔尔·文内斯,纪尧姆·德沙热丁斯,安德烈·A·卢苏,基兰·米兰,约翰·泉,蒂亚戈·拉马洛,阿格涅什卡·格拉布斯卡-巴温斯卡,德米斯·哈萨比斯,克劳迪娅·克劳帕斯,达拉什安·库马拉南,拉亚·哈德塞尔。克服神经网络中的灾难性遗忘。美国国家科学院院刊,第 114 卷第 13 期:3521–3526,2017。doi: 10.1073/pnas.1611835114。URL http://www.pnas.org/content/114/13/3521.abstract。
  • Krizhevsky & Hinton (2009)
    克里泽夫斯基和辛顿(2009 年)
    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, 2009.
    亚历克斯·克里泽夫斯基和杰弗里·辛顿。从微小图像中学习多层特征。技术报告,2009 年。
  • Le et al. (2013) 雷等(2013) Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood-approximating kernel expansions in loglinear time. In ICML, 2013.
    Quoc Le, Tamás Sarlós, 和 Alex Smola. 在对数线性时间内近似快餐核展开。在 ICML,2013。
  • Levina & Bickel (2005)
    列维娜 & 比克尔(2005)
    Elizaveta Levina and Peter J Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in neural information processing systems, pp. 777–784, 2005.
    伊丽莎白·列维娜和彼得·J·比克尔。最大似然估计内在维度。在《神经信息处理系统进展》,第 777-784 页,2005 年。
  • Li et al. (2006) 李等(2006) Ping Li, T. Hastie, and K. W. Church. Very sparse random projections. In KDD, 2006.
    李平,T. Hastie,K. W. Church. 非常稀疏的随机投影。在 KDD,2006。
  • Louizos et al. (2017) 楼茲奧斯等人(2017 年) C. Louizos, K. Ullrich, and M. Welling. Bayesian Compression for Deep Learning. ArXiv e-prints, May 2017.
    C. Louizos, K. Ullrich, 和 M. Welling. 深度学习的贝叶斯压缩。ArXiv 电子预印本,2017 年 5 月。
  • Mnih et al. (2013) 米哈伊洛夫等(2013) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with Deep Reinforcement Learning. ArXiv e-prints, December 2013.
    V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, 和 M. Riedmiller. 使用深度强化学习玩 Atari 游戏。ArXiv 电子预印本,2013 年 12 月。
  • Ren et al. (2015) 任等(2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
    任少卿,何凯明,罗斯·吉什克,孙剑. Faster R-CNN:基于区域提议网络的目标检测实时化. 在《神经信息处理系统进展》,第 91-99 页,2015 年。
  • Rissanen (1978) 里桑南(1978) Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
    焦尔马·里斯南恩. 通过最短数据描述进行建模。自动化学报,14(5):465–471,1978。
  • Russakovsky et al. (2015)
    Russakovsky et al. (2015)
    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015.
    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, 等人. ImageNet 大规模视觉识别挑战赛。国际计算机视觉杂志,2015 年。
  • Salimans et al. (2017) Salimans 等(2017) T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolution Strategies as a Scalable Alternative to Reinforcement Learning. ArXiv e-prints, March 2017.
    T. Salimans, J. Ho, X. Chen, 和 I. Sutskever. 进化策略作为强化学习的可扩展替代方案。ArXiv 电子预印本,2017 年 3 月。
  • Stanley & Miikkulainen (2002)
    斯坦利 & 米库拉宁(2002)
    Kenneth Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.
    肯尼斯·斯坦利和里斯特·米基拉伊宁。通过增强拓扑结构进化神经网络。进化计算,10(2):99–127,2002。
  • Tenenbaum et al. (2000) 滕纳鲍姆等(2000) Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
    Joshua B Tenenbaum, Vin De Silva, John C Langford. 非线性降维的全局几何框架。科学,290(5500):2319–2323,2000。
  • Tieleman & Hinton (2012)
    蒂尔曼和辛顿(2012)
    Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
    蒂姆·蒂埃勒曼和杰弗里·辛顿。讲座 6.5-rmsprop:将梯度除以其最近大小的移动平均值。COURSERA:机器学习的神经网络,4(2):26–31,2012。
  • Todorov et al. (2012) 托多罗夫等(2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp.  5026–5033. IEEE, 2012.
    托多罗夫,E.,埃雷兹,T.,塔萨,Y.。Mujoco:基于模型的控制物理引擎。在智能机器人与系统(IROS)2012 IEEE/RSJ 国际会议上,第 5026-5033 页。IEEE,2012。
  • Wen et al. (2016) 文等(2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
    魏文,吴春鹏,王艳丹,陈一然,李海。在深度神经网络中学习结构化稀疏性。NIPS,2016。
  • Yang et al. (2015) 楊等(2015) Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pp.  1476–1483, 2015.
    杨子超,莫丘斯基,米沙·德尼尔,南多·德·弗雷塔斯,亚历克斯·斯莫拉,宋磊,王紫宇. 深度油炸卷积神经网络。IEEE 国际计算机视觉会议论文集,第 1476-1483 页,2015 年。
  • Zhang et al. (2017) 张等(2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 2017.
    张驰原,萨米·本吉奥,莫里茨·哈特,本杰明·雷克,奥里奥尔·维尼亚尔斯。理解深度学习需要重新思考泛化。ICLR,2017。

 

Supplementary Information for:
补充信息:

Measuring the Intrinsic Dimension of Objective Landscapes
测量客观景观的内禀维度

 

S5 Additional MNIST results and insights
S5 额外的 MNIST 结果和见解

S5.1 Sweeping depths and widths; multiple runs to estimate variance
S5.1 扫描深度和宽度;多次运行以估计方差

In the main paper, we attempted to find dint90d_{\mathrm{int90}} across 20 FC networks with various depths and widths. A grid sweep of number of hidden layers from {1,2,3,4,5} and width of each hidden layer from {50,100,200,400} is performed, and all 20 plots are shown in Fig. S6. For each dd we take 3 runs and plot the mean and variance with blue dots and blue error bars. dint90d_{\mathrm{int90}} is indicated in plots (darkened blue dots) by the dimension at which the median of the 3 runs passes 90% performance threshold. The variance of dint90d_{\mathrm{int90}} is estimated using 50 bootstrap samples. Note that the variance of both accuracy and measured dint90d_{\mathrm{int90}} for a given hyper-parameter setting are generally small, and the mean of performance monotonically increases (very similar to the single-run result) as dd increases. This illustrates that the difference between lucky vs. unlucky random projections have little impact on the quality of solutions, while the subspace dimensionality has a great impact. We hypothesize that the variance due to different PP matrices will be smaller than the variance due to different random initial parameter vectors θ0(D)\theta^{(D)}_{0} because there are dDdD i.i.d. samples used to create PP (at least in the dense case) but only DD samples used to create θ0(D)\theta^{(D)}_{0}, and aspects of the network depending on smaller numbers of random samples will exhibit greater variance. Hence, in some other experiments we rely on single runs to estimate the intrinsic dimension, though slightly more accurate estimates could be obtained via multiple runs.
在主论文中,我们尝试在 20 个具有不同深度和宽度的 FC 网络上寻找 dint90subscriptd_{\mathrm{int90}} 。对隐藏层数量从{1,2,3,4,5}和每个隐藏层宽度从{50,100,200,400}进行网格扫描,所有 20 个图均显示在图 S6 中。对于每个 dd ,我们进行 3 次运行,并用蓝色点表示均值和蓝色误差线绘制。 dint90subscriptd_{\mathrm{int90}} 通过 3 次运行的中位数超过 90%性能阈值的维度在图中表示(深蓝色点)。 dint90subscriptd_{\mathrm{int90}} 的方差使用 50 个自举样本估计。请注意,对于给定的超参数设置,准确性和测量的 dint90subscriptd_{\mathrm{int90}} 的方差通常很小,随着 dd 的增加,性能均值单调递增(与单次运行结果非常相似)。这表明幸运与不幸的随机投影之间的差异对解的质量影响很小,而子空间维度有重大影响。我们假设由于不同的 PP 矩阵引起的方差将小于由于不同的随机初始参数向量 θ0(D)subscriptsuperscript0\theta^{(D)}_{0} 引起的方差,因为存在 dDdD 独立同分布的。 样本用于创建 PP (至少在密集情况下),但仅使用 DD 样本创建 θ0(D)subscriptsuperscript0\theta^{(D)}_{0} ,网络的一些方面依赖于更少的随机样本,将表现出更大的方差。因此,在一些其他实验中,我们依赖于单次运行来估计内在维度,尽管通过多次运行可以获得略微更准确的估计。

In similar manner to the above, in Fig. S7 we show the relationship between dint90d_{\mathrm{int90}} and DD across 20 networks but using a per-model, directly trained baseline. Most baselines are slightly below 100% accuracy. This is in contrast to Fig. 3, which used a simpler global baseline of 100% across all models. Results are qualitatively similar but with slightly lower intrinsic dimension due to slightly lower thresholds.
与上述类似,在图 S7 中,我们展示了 20 个网络中 dint90subscriptd_{\mathrm{int90}}DD 之间的关系,但使用的是每个模型直接训练的基线。大多数基线的准确率略低于 100%。这与图 3 形成对比,图 3 使用了一个更简单的 100%的全球基线。结果在定性上相似,但由于阈值略低,内在维度略低。

Refer to caption
Figure S6: A sweep of FC networks on MNIST. Each column contains networks of the same depth, and each row those of the same number of hidden nodes in each of its layers. Mean and variance at each dd is shown by blue dots and blue bars. dint90d_{\mathrm{int90}} is found by dark blue dots, and the variance of it is indicated by red bars spanning in the dd axis.
图 S6:MNIST 上的 FC 网络扫描。每列包含相同深度的网络,每行包含每层相同数量的隐藏节点。每个 dd 的均值和方差由蓝色点和蓝色条表示。 dint90subscriptd_{\mathrm{int90}} 由深蓝色点找到,其方差由跨越 dd 轴的红条指示。
Refer to caption
Figure S7: Measured intrinsic dimension dint90d_{\mathrm{int90}} vs number of parameters DD for 20 models of varying width (from 50 to 400) and depth (number of hidden layers from 1 to 5) trained on MNIST. The vertical red interval is the standard derivation of measured dint90d_{\mathrm{int90}}. As opposed to Fig. 3, which used a global, shared baseline across all models, here a per-model baseline is used. The number of native parameters varies by a factor of 24.1, but dint90d_{\mathrm{int90}} varies by only 1.42. The per-model baseline results in higher measured dint90d_{\mathrm{int90}} for larger models because they have a higher baseline performance than the shallower models.
图 S7:在 MNIST 上训练的 20 个不同宽度(从 50 到 400)和深度(隐藏层数量从 1 到 5)的模型的测量内在维度 dint90subscriptd_{\mathrm{int90}} 与参数数量 DD 的关系。垂直红色区间是测量值 dint90subscriptd_{\mathrm{int90}} 的标准差。与图 3 使用全局、共享基线不同,这里使用每个模型的基线。原生参数数量变化系数为 24.1,但 dint90subscriptd_{\mathrm{int90}} 的变化系数仅为 1.42。由于较大模型具有比较浅模型更高的基线性能,因此每个模型的基线导致较大模型的测量值 dint90subscriptd_{\mathrm{int90}} 更高。

S5.2 Additional details on shuffled MNIST datasets
S5.2 关于打乱后的 MNIST 数据集的附加细节

Two kinds of shuffled MNIST datasets are considered:
两种打乱顺序的 MNIST 数据集被考虑:

  • The shuffled pixel dataset: the label for each example remains the same as the normal dataset, but a random permutation of pixels is chosen once and then applied to all images in the training and test sets. FC networks solve the shuffled pixel datasets exactly as easily as the base dataset, because there is no privileged ordering of input dimension in FC networks; all orderings are equivalent.


    • 混洗像素数据集:每个示例的标签与正常数据集相同,但选择一次随机的像素排列,然后应用于训练集和测试集中的所有图像。全连接(FC)网络可以像处理基本数据集一样轻松地解决混洗像素数据集,因为在全连接网络中没有输入维度的特权顺序;所有排列都是等效的。
  • The shuffled label dataset: the images remain the same as the normal dataset, but labels are randomly shuffled for the entire training set. Here, as in (Zhang et al., 2017), we only evaluate training accuracy, as test set accuracy remains forever at chance level (the training set XX and yy convey no information about test set p(y|X)p(y|X), because the shuffled relationship in test is independent of that of training).


    • 混洗标签数据集:图像与正常数据集相同,但标签在整个训练集中随机打乱。在此,如同(Zhang 等人,2017 年)中所述,我们只评估训练准确率,因为测试集准确率始终保持在随机水平(训练集 XXyy 无法提供关于测试集 p(y|X)conditionalp(y|X) 的信息,因为测试集中的混洗关系与训练集中的关系是独立的)。

On the full shuffled label MNIST dataset (50k images), we trained an FC network (L=5,W=400L=5,W=400, which had dint90=750d_{\mathrm{int90}}=750 on standard MNIST), it yields dint90=190d_{\mathrm{int90}}=190k. We can interpret this as requiring 3.8 floats to memorize each random label (at 90% accuracy). Wondering how this scales with dataset size, we estimated dint90d_{\mathrm{int90}} on shuffled label versions of MNIST at different scales and found curious results, shown in Table S2 and Fig. S8. As the dataset memorized becomes smaller, the number of floats required to memorize each label becomes larger. Put another way, as dataset size increases, the intrinsic dimension also increases, but not as fast as linearly. The best interpretation is not yet clear, but one possible interpretation is that networks required to memorize large training sets make use of shared machinery for memorization. In other words, though performance does not generalize to a validation set, generalization within a training set is non-negligible even though labels are random.
在完整的打乱标签的 MNIST 数据集(50k 张图像)上,我们训练了一个全连接网络( L=5,W=400formulae-sequence5400L=5,W=400 ,在标准 MNIST 上的 dint90=750subscript750d_{\mathrm{int90}}=750 ),它产生了 dint90=190subscript190d_{\mathrm{int90}}=190 k。我们可以将其解释为需要 3.8 个浮点数来记忆每个随机标签(90%的准确率)。想知道这如何随着数据集大小而变化,我们在不同规模的 MNIST 打乱标签版本上估计了 dint90subscriptd_{\mathrm{int90}} ,并发现了有趣的结果,如表 S2 和图 S8 所示。随着记忆的数据集变小,记忆每个标签所需的浮点数也变大。换句话说,随着数据集大小的增加,内在维度也增加,但增加的速度并不像线性那样快。最佳解释尚不明确,但一种可能的解释是,用于记忆大型训练集的网络利用了记忆的共享机制。换句话说,尽管性能不能推广到验证集,但即使在标签随机的情况下,训练集内的泛化也是不可忽视的。

Table S2: dint90d_{\mathrm{int90}} required to memorize shuffled MNIST labels. As dataset size grows, memorization becomes more efficient, suggesting a form of “generalization” from one part of the training set to another, even though labels are random.
表 S2: dint90subscriptd_{\mathrm{int90}} 需要记忆打乱顺序的 MNIST 标签。随着数据集大小的增长,记忆变得更加高效,这表明从训练集的一部分到另一部分存在一种“泛化”形式,尽管标签是随机的。
Fraction of MNIST training set
MNIST 训练集的分数
dint90d_{\mathrm{int90}} Floats per label 每标签浮点数
100% 190k 19 万 3.8
50% 130k 13 万 5.2
10% 90k 9k 18.0
Refer to caption
Figure S8: Training accuracy vs. subspace dimension dd for a FC networks (W=400W\!\!=\!\!400, L=5L\!\!=\!\!5) trained on a shuffled label version of MNIST containing 100%, 50%, and 10% of the dataset.
图 S8:FC 网络( W=400400W\!\!=\!\!400L=55L\!\!=\!\!5 )在包含 100%、50%和 10%数据集的 MNIST 打乱标签版本上的训练精度与子空间维度 dd 的关系。

S5.3 Training stability S5.3 训练稳定性

An interesting tangential observation is that random subspace training can in some cases make optimization more stable. First, it helps in the case of deeper networks. Fig. S9 shows training results for FC networks with up to 10 layers. SGD with step 0.1, and ReLUs with He initialization is used. Multiple networks failed at depths 4, and all failed at depths higher than 4, despite the activation function and initialization designed to make learning stable (He et al., 2015). Second, for MNIST with shuffled labels, we noticed that it is difficult to reach high training accuracy using the direct training method with SGD, though both subspace training with SGD and either type of training with Adam reliably reach 100% memorization as dd increases (see Fig. S8).
一个有趣的旁征博引的观察是,在某些情况下,随机子空间训练可以使优化更加稳定。首先,它有助于深层网络的情况。图 S9 显示了具有多达 10 层的 FC 网络的训练结果。使用步长为 0.1 的 SGD 和 He 初始化的 ReLUs。多个网络在深度 4 处失败,并且所有网络在深度大于 4 时都失败了,尽管激活函数和初始化是为了使学习稳定(He 等人,2015)。其次,对于打乱标签的 MNIST,我们注意到,尽管使用 SGD 的直接训练方法难以达到高训练精度,但使用 SGD 的子空间训练和 Adam 的任何一种训练类型随着 dd 的增加都能可靠地达到 100%的记忆(见图 S8)。

Because each random basis vector projects across all DD direct parameters, the optimization problem may be far better conditioned in the subspace case than in the direct case. A related potential downside is that projecting across DD parameters which may have widely varying scale could result in ignoring parameter dimensions with tiny gradients. This situation is similar to that faced by methods like SGD, but ameliorated by RMSProp, Adam, and other methods that rescale per-dimension step sizes to account for individual parameter scales. Though convergence of the subspace approach seems robust, further work may be needed to improve network amenability to subspace training: for example by ensuring direct parameters are similarly scaled by clever initialization or by inserting a pre-scaling layer between the projected subspace and the direct parameters themselves.
因为每个随机基向量都投影到所有 0#直接参数上,所以在子空间情况下,优化问题可能比直接情况下条件要好得多。相关潜在缺点是,投影到可能具有广泛不同尺度的 1#参数可能会导致忽略具有微小梯度的参数维度。这种情况类似于 SGD 等方法所面临的情况,但通过 RMSProp、Adam 等方法得到了改善,这些方法通过按维度调整步长来考虑单个参数的尺度。尽管子空间方法的收敛性看起来很稳健,但可能还需要进一步的工作来提高网络对子空间训练的适应性:例如,通过通过巧妙的初始化确保直接参数具有相似的尺度,或者在投影子空间和直接参数本身之间插入一个预缩放层。

Refer to caption Refer to caption
(a) Accuracy (a) 准确性 (b) Negative Log Likelihood (NLL)
(b) 负对数似然(NLL)
Figure S9: Results of subspace training versus the number of layers in a fully connected network trained on MNIST. The direct method always fail to converge when L>5L>5, while subspace training yields stable performance across all depths.
图 S9:在 MNIST 上训练的全连接网络中,子空间训练与层数的关系结果。直接方法在 L>55L>5 时总是无法收敛,而子空间训练在所有深度上均能保持稳定的性能。

S5.4 The role of optimizers
S5.4 优化器的作用

Another finding through our experiments with MNIST FC networks has to do with the role of optimizers. The same set of experiments are run with both SGD (learning rate 0.1) and ADAM (learning rate 0.001), allowing us to investigate the impact of stochastic optimizers on the intrinsic dimension achieved.
通过我们在 MNIST FC 网络上的实验,另一个发现与优化器的作用有关。相同的实验集在 SGD(学习率 0.1)和 ADAM(学习率 0.001)下运行,使我们能够研究随机优化器对达到的内禀维数的影响。

The intrinsic dimension dint90d_{\mathrm{int90}} are reported in Fig. S10 (a)(b). In addition to two optimizers we also use two baselines: Global baseline that is set up as 90% of best performance achieved across all models, and individual baseline that is with regards to the performance of the same model in direct training.
图 S10(a)(b)报告了固有维度 dint90subscriptd_{\mathrm{int90}} 。除了两个优化器外,我们还使用了两个基线:全局基线设置为所有模型中最佳性能的 90%,以及与直接训练中相同模型的性能相关的个体基线。

Refer to caption
(a) Global baseline (a)全球基线
Refer to caption
(b) Individual baseline (b) 个人基线
Figure S10: The role of optimizers on MNIST FC networks. The transparent dots indicate SGD results, and opaque dots indicate Adam results. Adam generally yields higher intrinsic dimensions because higher baselines are achieved, especially when individual baselines in (b) are used. Note that the Adam points are slightly different between Fig. 3 and Fig. S7, because in the former we average over three runs, and in the latter we show one run each for all optimization methods.
图 S10:优化器在 MNIST FC 网络中的作用。透明点表示 SGD 结果,不透明点表示 Adam 结果。Adam 通常产生更高的内在维度,因为实现了更高的基线,尤其是在(b)中的个别基线使用时。请注意,图 3 和图 S7 中的 Adam 点略有不同,因为在前者中我们对三次运行进行了平均,而在后者中我们展示了所有优化方法的一次运行。

S6 Additional Reinforcement Learning results and details
S6 附加强化学习结果及细节

S6.1 DQN experiments S6.1DQN 实验

Refer to caption Refer to caption Refer to caption
(a) CartPole (dint90=25d_{\mathrm{int90}}=25)
(a) CartPole ( dint90=25subscript25d_{\mathrm{int90}}=25 )
(b) Pole (dint90=23d_{\mathrm{int90}}=23)
(b) 极( dint90=23subscript23d_{\mathrm{int90}}=23
(c) Cart (dint90=7d_{\mathrm{int90}}=7)
(版权)购物车( dint90=7subscript7d_{\mathrm{int90}}=7
Figure S11: Subspace training of DQN on CartPole game. Shown as dots are rewards collected through a game run averaged over the last 100 episodes, under each subspace training of DQN, and each game environment. The line connects mean rewards across different dds.
图 S11:DQN 在 CartPole 游戏中的子空间训练。以点表示的是通过游戏运行收集到的奖励,平均了最后 100 个回合,在每个 DQN 子空间训练和每个游戏环境中。线条连接了不同 dd s 的平均奖励。
DQN on Cartpole DQN 在 Cartpole 上

We start with a simple classic control game 𝙲𝚊𝚛𝚝𝙿𝚘𝚕𝚎𝚟𝟶\mathtt{CartPole\!-\!v0} in OpenAI Gym (Brockman et al., 2016). A pendulum starts upright, and the goal is to prevent it from falling over. The system is controlled by applying a force of LEFT or RIGHT to the cart. The full game ends when one of two failure conditions is satisfied: the cart moves more than 2.4 units from the center (where it started), or the pole is more than 15 degrees from vertical (where it started). A reward of +1 is provided for every time step as long as the game is going. We further created two easier environments 𝙿𝚘𝚕𝚎\mathtt{Pole} and 𝙲𝚊𝚛𝚝\mathtt{Cart}, each confined by one of the failure modes only.
我们从 OpenAI Gym(Brockman 等人,2016)中的一个简单经典控制游戏 𝙲𝚊𝚛𝚝𝙿𝚘𝚕𝚎𝚟𝟶\mathtt{CartPole\!-\!v0} 开始。一个摆锤竖直放置,目标是防止它倒下。系统通过向小车施加向左或向右的力来控制。当满足两个失败条件之一时,整个游戏结束:小车从中心(起始位置)移动超过 2.4 个单位,或者杆子从垂直位置(起始位置)偏离超过 15 度。只要游戏在进行,每一步都会获得+1 的奖励。我们进一步创建了两个更容易的环境 𝙿𝚘𝚕𝚎\mathtt{Pole}𝙲𝚊𝚛𝚝\mathtt{Cart} ,每个环境仅受一种失败模式限制。

A DQN is used, where the value network is parameterized by an FC (L=2,W=400L=2,W=400). For each subspace dd at least 5 runs are conducted, the mean of which is used to computed dint90d_{\mathrm{int90}}, and the baseline is set as 195.0777𝙲𝚊𝚛𝚝𝙿𝚘𝚕𝚎𝚟𝟶\mathtt{CartPole\!-\!v0} is considered as “solved” when the average reward over the last 100 episodes is 195
当过去 100 个回合的平均奖励达到 195 时, 𝙲𝚊𝚛𝚝𝙿𝚘𝚕𝚎𝚟𝟶\mathtt{CartPole\!-\!v0} 被视为“已解决”

DQN 被采用,其中值网络由一个全连接层( L=2,W=400formulae-sequence2400L=2,W=400 )参数化。对于每个子空间 dd 至少进行 5 次运行,取其平均值用于计算 dint90subscriptd_{\mathrm{int90}} ,基准设置为 195.0 7
. The results are shown in Fig. S11. The solid line connects mean rewards within a run over the last 100 episodes, across different dds. Due to the noise-sensitiveness of RL games the course is not monotonic any more. The intrinsic dimension for 𝙲𝚊𝚛𝚝𝙿𝚘𝚕𝚎\mathtt{CartPole}, 𝙿𝚘𝚕𝚎\mathtt{Pole} and 𝙲𝚊𝚛𝚝\mathtt{Cart} is dint90=25d_{\mathrm{int90}}=25, 2323 and 77, respectively. This reveals that the difficulty of optimization landscape of these games is remarkably low, as well as interesting insights such as driving a cart is much easier than keeping a pole straight, the latter being the major cause of difficulty when trying to do both.
图 S11 显示了结果。实线连接了最后 100 个回合中不同 dd 的运行内的平均奖励。由于强化学习游戏的噪声敏感性,曲线不再是单调的。 𝙲𝚊𝚛𝚝𝙿𝚘𝚕𝚎\mathtt{CartPole}𝙿𝚘𝚕𝚎\mathtt{Pole}𝙲𝚊𝚛𝚝\mathtt{Cart} 的内维分别是 dint90=25subscript25d_{\mathrm{int90}}=25232323777 。这表明这些游戏的优化景观的难度显著较低,以及一些有趣的见解,例如驾驶马车比保持杆子直立要容易得多,后者是尝试同时进行时困难的主要原因。

S6.2 Evolutionary Strategies (ES) complete results
S6.2 进化策略(ES)完整结果

We carry out with ES 3 RL tasks: 𝙸𝚗𝚟𝚎𝚛𝚝𝚎𝚍𝙿𝚎𝚗𝚍𝚞𝚕𝚞𝚖𝚟𝟷\mathtt{InvertedPendulum\!-\!v1}, 𝙷𝚞𝚖𝚊𝚗𝚘𝚒𝚍𝚟𝟷\mathtt{Humanoid\!-\!v1}, 𝙿𝚘𝚗𝚐𝚟𝟶\mathtt{Pong\!-\!v0}. The hyperparameter settings for training are in Table S3.
我们执行了 ES 3 RL 任务: 𝙸𝚗𝚟𝚎𝚛𝚝𝚎𝚍𝙿𝚎𝚗𝚍𝚞𝚕𝚞𝚖𝚟𝟷\mathtt{InvertedPendulum\!-\!v1}𝙷𝚞𝚖𝚊𝚗𝚘𝚒𝚍𝚟𝟷\mathtt{Humanoid\!-\!v1}𝙿𝚘𝚗𝚐𝚟𝟶\mathtt{Pong\!-\!v0} 。训练的超参数设置见表 S3。

2\ell_{2} penalty  2subscript2\ell_{2} 惩罚 Adam LR 亚当·L·R ES σ\sigma Iterations 迭代
𝙸𝚗𝚟𝚎𝚛𝚝𝚎𝚍𝙿𝚎𝚗𝚍𝚞𝚕𝚞𝚖𝚟𝟷\mathtt{InvertedPendulum\!-\!v1} 1×1081\times 10^{-8} 3×1013\times 10^{-1} 2×1022\times 10^{-2} 1000
𝙷𝚞𝚖𝚊𝚗𝚘𝚒𝚍𝚟𝟷\mathtt{Humanoid\!-\!v1} 5×1035\times 10^{-3} 3×1023\times 10^{-2} 2×1022\times 10^{-2} 2000
𝙿𝚘𝚗𝚐𝚟𝟶\mathtt{Pong\!-\!v0} 5×1035\times 10^{-3} 3×1023\times 10^{-2} 2×1022\times 10^{-2} 500
Table S3: Hyperparameters used in training RL tasks using ES. σ\sigma refers to the parameter perturbation noise used in ES. Default Adam parameters of β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, ϵ=1×107\epsilon=1\times 10^{-7} were used.
表 S3:使用 ES 训练 RL 任务时使用的超参数。 σ\sigma 指的是 ES 中使用的参数扰动噪声。使用了 β1=0.9subscript10.9\beta_{1}=0.9β2=0.999subscript20.999\beta_{2}=0.999ϵ=1×1071superscript107\epsilon=1\times 10^{-7} 的默认 Adam 参数。
Inverted pendulum 倒立摆

The 𝙸𝚗𝚟𝚎𝚛𝚝𝚎𝚍𝙿𝚎𝚗𝚍𝚞𝚕𝚞𝚖𝚟𝟷\mathtt{InvertedPendulum\!-\!v1} environment uses the MuJoCo physics simulator (Todorov et al., 2012) to instantiate the same problem as 𝙲𝚊𝚛𝚝𝙿𝚘𝚕𝚎𝚟𝟶\mathtt{CartPole\!-\!v0} in a realistic setting. We expect that even with richer environment dynamics, as well as a different RL algorithm – ES – the intrinsic dimensionality should be similar. As seen in Fig. 5, the measured intrinsic dimensionality dint90=4d_{\mathrm{int90}}=4 is of the same order of magnitude, but smaller. Interestingly, although the environment dynamics are more complex than in 𝙲𝚊𝚛𝚝𝙿𝚘𝚕𝚎𝚟𝟶\mathtt{CartPole\!-\!v0}, using ES rather than DQN seems to induce a simpler objective landscape.
𝙸𝚗𝚟𝚎𝚛𝚝𝚎𝚍𝙿𝚎𝚗𝚍𝚞𝚕𝚞𝚖𝚟𝟷\mathtt{InvertedPendulum\!-\!v1} 环境使用 MuJoCo 物理模拟器(Todorov 等,2012)在现实场景中实例化与 𝙲𝚊𝚛𝚝𝙿𝚘𝚕𝚎𝚟𝟶\mathtt{CartPole\!-\!v0} 相同的问题。我们预计,即使在更丰富的环境动力学以及不同的强化学习算法(ES)的情况下,内在维度应该相似。如图 5 所示,测量的内在维度 dint90=4subscript4d_{\mathrm{int90}}=4 与相同数量级,但更小。有趣的是,尽管环境动力学比 𝙲𝚊𝚛𝚝𝙿𝚘𝚕𝚎𝚟𝟶\mathtt{CartPole\!-\!v0} 更复杂,但使用 ES 而不是 DQN 似乎导致目标景观更简单。

Learning to walk 学习走路

A more challenging problem is 𝙷𝚞𝚖𝚊𝚗𝚘𝚒𝚍𝚟𝟷\mathtt{Humanoid\!-\!v1} in MuJoCo simulator. Intuitively, one might believe that learning to walk is a more complex task than classifying images. Our results show the contrary – that the learned intrinsic dimensionality of dint90=700d_{\mathrm{int90}}=700 is similar to that of MNIST on a fully-connected network (dint90=650d_{\mathrm{int90}}=650) but significantly less than that of even a convnet trained on CIFAR-10 (dint90=2,500d_{\mathrm{int90}}=2,500). Fig. 5 shows the full results. Interestingly, we begin to see training runs reach the threshold as early as d=400d=400, with the median performance steadily increasing with dd.
一个更具挑战性的问题是 MuJoCo 模拟器中的 𝙷𝚞𝚖𝚊𝚗𝚘𝚒𝚍𝚟𝟷\mathtt{Humanoid\!-\!v1} 。直观上,人们可能会认为学习行走比分类图像更复杂。我们的结果显示相反的情况——即 dint90=700subscript700d_{\mathrm{int90}}=700 学习到的内在维度与在完全连接网络( dint90=650subscript650d_{\mathrm{int90}}=650 )上的 MNIST 相似,但显著低于在 CIFAR-10 上训练的卷积网络( dint90=2,500subscript2500d_{\mathrm{int90}}=2,500 )的维度。图 5 展示了完整的结果。有趣的是,我们开始看到训练运行在 d=400400d=400 时达到阈值,随着 dd 的增加,中位性能稳步提升。

Atari Pong 雅达利乒乓球

Finally, using a base convnet of approximately D=1MD=1M in the 𝙿𝚘𝚗𝚐𝚟𝟶\mathtt{Pong\!-\!v0} pixels-to-actions environment (using 4-frame stacking). The agent receives an image frame (size of 210×160×3210\times 160\times 3) and the action is to move the paddle UP or DOWN. We were able to determine dint90=6,000d_{\mathrm{int90}}=6,000.
最后,在约 D=1M1D=1M 像素到动作的环境中使用基础卷积神经网络(采用 4 帧堆叠)。智能体接收一个图像帧(大小为 210×160×32101603210\times 160\times 3 ),动作是向上或向下移动挡板。我们能够确定 dint90=6,000subscript6000d_{\mathrm{int90}}=6,000

S7 Three Methods of random projection
S7 三种随机投影方法

Scaling the random subspace training procedure to large problems requires an efficient way to map from d\mathbb{R}^{d} into a random dd-dimensional subspace of D\mathbb{R}^{D} that does not necessarily include the origin. Algebraically, we need to left-multiply a vector of parameters vdv\in\mathbb{R}^{d} by a random matrix MD×dM\in\mathbb{R}^{D\times d}, whose columns are orthonormal, then add an offset vector θ0D\theta_{0}\in\mathbb{R}^{D}. If the low-dimensional parameter vector in d\mathbb{R}^{d} is initialized to zero, then specifying an offset vector is equivalent to choosing an initialization point in the original model parameter space D\mathbb{R}^{D}.
将随机子空间训练过程扩展到大型问题需要一种有效的方法将 dsuperscript\mathbb{R}^{d} 映射到不必然包含原点的 Dsuperscript\mathbb{R}^{D} 的随机 dd 维子空间。从代数上看,我们需要将参数向量 vdsuperscriptv\in\mathbb{R}^{d} 左乘以一个随机矩阵 MD×dsuperscriptM\in\mathbb{R}^{D\times d} ,其列是正交归一的,然后加上偏移向量 θ0Dsubscript0superscript\theta_{0}\in\mathbb{R}^{D} 。如果 dsuperscript\mathbb{R}^{d} 中的低维参数向量初始化为零,则指定偏移向量等价于在原始模型参数空间 Dsuperscript\mathbb{R}^{D} 中选择一个初始化点。

A naïve approach to generating the random matrix MM is to use a dense D×dD\times d matrix of independent standard normal entries, then scale each column to be of length 1. The columns will be approximately orthogonal if DD is large because of the independence of the entries. Although this approach is sufficient for low-rank training of models with few parameters, we quickly run into scaling limits because both matrix-vector multiply time and storage of the matrix scale according to 𝒪(Dd)\mathcal{O}(Dd). We were able to successfully determine the intrinsic dimensionality of MNIST (dd=225) using a LeNet (DD=44,426), but were unable to increase dd beyond 1,000 when applying a LeNet (DD=62,006) to CIFAR-10, which did not meet the performance criterion to be considered the problem’s intrinsic dimensionality.
一种生成随机矩阵 MM 的朴素方法是使用一个密集的独立标准正态分布的 D×dD\times d 矩阵,然后对每一列进行缩放以使其长度为 1。当 DD 较大时,由于元素的独立性,列将大致正交。尽管这种方法对于具有少量参数的模型低秩训练是足够的,但我们很快遇到了缩放限制,因为矩阵-向量乘法和矩阵存储的规模都按照 𝒪(Dd)\mathcal{O}(Dd) 来缩放。我们能够成功地确定 MNIST( dd =225)的内禀维度,使用 LeNet( DD =44,426),但在将 LeNet( DD =62,006)应用于 CIFAR-10 时,未能将 dd 增加到 1,000 以上,这没有达到被认为是问题内禀维度的性能标准。

Random matrices need not be dense for their columns to be approximately orthonormal. In fact, a method exists for “very sparse” random projections (Li et al., 2006), which achieves a density of 1D\frac{1}{\sqrt{D}}. To construct the D×dD\times d matrix, each entry is chosen to be nonzero with probability 1D\frac{1}{\sqrt{D}}. If chosen, then with equal probability, the entry is either positive or negative with the same magnitude in either case. The density of 1D\frac{1}{\sqrt{D}} implies Dd\sqrt{D}d nonzero entries, or 𝒪(Dd)\mathcal{O}(\sqrt{D}d) time and space complexity. Implementing this procedure allowed us to find the intrinsic dimension of dd=2,500 for CIFAR-10 using a LeNet mentioned above. Unfortunately, when using Tensorflow’s SparseTensor implementation we did not achieve the theoretical D\sqrt{D}-factor improvement in time complexity (closer to a constant 10x). Nonzero elements also have a significant memory footprint of 24 bytes, so we could not scale to larger problems with millions of model parameters and large intrinsic dimensionalities.
随机矩阵的列不必密集,其列也近似正交归一。事实上,存在一种用于“非常稀疏”随机投影的方法(Li 等,2006),实现了密度为 1D1\frac{1}{\sqrt{D}} 。为了构建 D×dD\times d 矩阵,每个元素以概率 1D1\frac{1}{\sqrt{D}} 被选为非零。如果被选中,那么以相同的概率,该元素要么是正的,要么是负的,且两种情况下绝对值相同。 1D1\frac{1}{\sqrt{D}} 的密度意味着有 Dd\sqrt{D}d 非零元素,或 𝒪(Dd)\mathcal{O}(\sqrt{D}d) 时间和空间复杂度。实施此程序使我们能够使用上述 LeNet 找到 CIFAR-10 的固有维度 dd =2,500。不幸的是,当使用 Tensorflow 的 SparseTensor 实现时,我们没有实现理论上的 D\sqrt{D} -因子时间复杂度改进(更接近于常数 10 倍)。非零元素还有显著的 24 字节内存占用,因此我们无法扩展到具有数百万个模型参数和大型固有维度的更大问题。

We need not explicitly form and store the transformation matrix. The Fastfood transform (Le et al., 2013) was initially developed as an efficient way to compute a nonlinear, high-dimensional feature map ϕ(x)\phi(x) for a vector xx. A portion of the procedure involves implicitly generating a D×dD\times d matrix with approximately uncorrelated standard normal entries, using only 𝒪(D)\mathcal{O}(D) space, which can be multiplied by vv in 𝒪(Dlogd)\mathcal{O}(D\log{d}) time using a specialized method. The method relies on the fact that Hadamard matrices multiplied by Gaussian vectors behave like dense Gaussian matrices. In detail, to implicitly multiply vv by a random square Gaussian matrix MM with side-lengths equal to a power of two, the matrix is factorized into multiple simple matrices: M=HGΠHBM=HG\Pi HB, where BB is a random diagonal matrix with entries +-1 with equal probability, HH is a Hadamard matrix, Π\Pi is a random permutation matrix, and GG is a random diagonal matrix with independent standard normal entries. Multiplication by a Hadamard matrix can be done via the Fast Walsh-Hadamard Transform in 𝒪(dlogd)\mathcal{O}(d\log{d}) time and takes no additional space. The other matrices have linear time and space complexities. When D>dD>d, multiple independent samples of MM can be stacked to increase the output dimensionality. When dd is not a power of two, we can zero-pad vv appropriately. Stacking Dd\frac{D}{d} samples of MM results in an overall time complexity of 𝒪(Dddlogd)\mathcal{O}(\frac{D}{d}d\log{d}) = 𝒪(Dlogd)\mathcal{O}(D\log{d}), and a space complexity of 𝒪(Ddd)\mathcal{O}(\frac{D}{d}d) = 𝒪(D)\mathcal{O}(D). In practice, the reduction in space footprint allowed us to scale to much larger problems, including the Pong RL task using a 1M parameter convolutional network for the policy function.
我们不需要显式地形成和存储变换矩阵。快速食品变换(Le 等人,2013 年)最初作为一种计算非线性、高维特征映射 ϕ(x)\phi(x) 的向量 xx 的有效方法而开发。该过程的一部分涉及隐式生成一个具有大约不相关标准正态分布的 D×dD\times d 矩阵,仅使用 𝒪(D)\mathcal{O}(D) 空间,可以使用专门的方法在 𝒪(Dlogd)\mathcal{O}(D\log{d}) 时间内乘以 vv 。该方法依赖于 Hadamard 矩阵乘以高斯向量表现得像密集的高斯矩阵。具体来说,为了隐式地将 vv 乘以一个边长为 2 的幂的随机平方高斯矩阵 MM ,矩阵被分解成多个简单的矩阵: M=HGΠHBM=HG\Pi HB ,其中 BB 是一个具有等概率的±1 条目的随机对角矩阵, HH 是一个 Hadamard 矩阵, Π\Pi 是一个随机排列矩阵, GG 是一个具有独立标准正态分布的随机对角矩阵。通过 Hadamard 矩阵的乘法可以在 𝒪(dlogd)\mathcal{O}(d\log{d}) 时间内完成,并且不占用额外的空间。其他矩阵具有线性时间和空间复杂度。 当 D>dD>d 时,可以将多个独立的 MM 样本堆叠以增加输出维度。当 dd 不是 2 的幂时,我们可以适当地对 vv 进行零填充。堆叠 Dd\frac{D}{d}MM 样本导致整体时间复杂度为 𝒪(Dddlogd)\mathcal{O}(\frac{D}{d}d\log{d}) = 𝒪(Dlogd)\mathcal{O}(D\log{d}) ,空间复杂度为 𝒪(Ddd)\mathcal{O}(\frac{D}{d}d) = 𝒪(D)\mathcal{O}(D) 。在实践中,空间足迹的减少使我们能够扩展到更大的问题,包括使用 1M 参数卷积网络进行策略函数的 Pong RL 任务。

Table S4 summarizes the performance of each of the three methods theoretically and empirically.
表 S4 总结了三种方法的理论和实证性能。

Time complexity 时间复杂度 Space complexity 空间复杂度 DD = 100k DD = 1M DD = 60M
Dense 密集 𝒪(Dd)\mathcal{O}(Dd) 𝒪(Dd)\mathcal{O}(Dd) 0.0169 s 0.0169 秒 1.0742 s* 4399.1 s*
Sparse 稀疏 𝒪(Dd)\mathcal{O}(\sqrt{D}d) 𝒪(Dd)\mathcal{O}(\sqrt{D}d) 0.0002 s 0.0002 秒 0.0019 s 0.0019 秒 0.5307 s*
Fastfood 快餐 𝒪(Dlogd)\mathcal{O}(D\log{d}) 𝒪(D)\mathcal{O}(D) 0.0181 s 0.0181 秒 0.0195 s 0.0195 秒 0.7949 s 0.7949 秒
Table S4: Comparison of theoretical complexity and average duration of a forward+backward pass through MM (in seconds). dd was fixed to 1% of DD in each measurement. DD = 100k is approximately the size of an MNIST fully-connected network, and DD = 60M is approximately the size of AlexNet. The Fastfood timings are based on a Tensorflow implementation of the Fast Walsh-Hadamard Transform, and could be drastically reduced with an efficient CUDA implementation. Asterisks mean that we encountered an out-of-memory error, and the values are extrapolated from the largest successful run (a few powers of two smaller). For example, we expect sparse to outperform Fastfood if it didn’t run into memory issues.
表 S4:通过 MM 进行正向+反向传递的理论复杂度与平均持续时间比较(单位:秒)。在每次测量中, dd 固定为 DD 的 1%。 DD = 100k 大约相当于 MNIST 全连接网络的大小, DD = 60M 大约相当于 AlexNet 的大小。Fastfood 的时间基于 Tensorflow 实现的快速沃尔什-哈达玛变换,并且可以通过高效的 CUDA 实现大幅减少。星号表示我们遇到了内存不足错误,数值是从最大的成功运行(小几个数量级的 2 的幂)外推得出的。例如,我们预计如果未遇到内存问题,稀疏矩阵将优于 Fastfood。

Figure S4 compares the computational time for direct and subspace training (various projections) methods for each update. Our subspace training is more computational expensive, because the subspace training method has to propagate the signals through two modules: the layers of neural networks, and the projection between two spaces. The direct training only propagates signals in the layers of neural networks. We have made efforts to reduce the extra computational cost. For example, the sparse projection less than doubles the time cost for a large range of subspace dimensions.
图 S4 比较了每次更新中直接训练和子空间训练(各种投影)方法的计算时间。我们的子空间训练计算成本更高,因为子空间训练方法需要通过两个模块传播信号:神经网络层和两个空间之间的投影。直接训练只传播神经网络层的信号。我们已经努力降低额外的计算成本。例如,稀疏投影将大量子空间维度的计算时间成本减少不到一倍。

Refer to caption
Refer to caption
Figure S12: MNIST compute time for direct vs. various projection methods for 100k parameters (left) and 1M parameters (right).
图 S12:100k 参数(左侧)和 1M 参数(右侧)的直接方法与各种投影方法的 MNIST 计算时间。

S8 Additional CIFAR-10 Results
S8 附加的 CIFAR-10 结果

FC networks FC 网络

We consider the CIFAR-10 dataset and test the same set of FC and LeNet architectures as on MNIST. For FC networks, dint90d_{\mathrm{int90}} values for all 20 networks are shown in Fig. S13 (a) plotted against the native dimension DD of each network; DD changes by a factor of 12.16 between the smallest and largest networks, but dint90d_{\mathrm{int90}} changes over this range by a factor of 5.0. However, much of this change is due to change of baseline performance. In Fig. S13 (b), we instead compute the intrinsic dimension with respect to a global baseline: 50% validation accuracy. dint90d_{\mathrm{int90}} changes over this range by a factor of 1.52. This indicates that various FC networks share similar intrinsic dimension (dint90=50008000d_{\mathrm{int90}}=5000\sim 8000) to achieve the same level of task performance. For LeNet (D=62,006D=62,006), the validation accuracy vs. subspace dimension dd is shown in Fig. S14 (b), the corresponding dint90=2900d_{\mathrm{int90}}=2900. It yields a compression rate of 5%, which is 10 times larger than LeNet on MNIST. It shows that CIFAR-10 images are significantly more difficult to be correctly classified than MNIST. In another word, CIFAR-10 is a harder problem than MNIST, especially given the fact that the notion of “problem solved” (baseline performance) is defined as 99% accuracy on MNIST and 58% accuracy on CIFAR-10. On the CIFAR-10 dataset, as dd increases, subspace training tends to overfitting; we study the role of subspace training as a regularizer below.
我们考虑了 CIFAR-10 数据集,并测试了与 MNIST 相同的 FC 和 LeNet 架构集。对于 FC 网络,所有 20 个网络的 dint90subscriptd_{\mathrm{int90}} 值在图 S13(a)中显示,与每个网络的本地维度 DD 相对应; DD 在最小和最大网络之间变化了 12.16 倍,但 dint90subscriptd_{\mathrm{int90}} 在这个范围内变化了 5.0 倍。然而,这种变化的大部分是由于基线性能的变化。在图 S13(b)中,我们计算了相对于全局基线(50%验证准确率)的内在维度。 dint90subscriptd_{\mathrm{int90}} 在这个范围内变化了 1.52 倍。这表明,各种 FC 网络具有相似的内禀维度( dint90=50008000subscript5000similar-to8000d_{\mathrm{int90}}=5000\sim 8000 ),以实现相同级别的任务性能。对于 LeNet( D=62,00662006D=62,006 ),图 S14(b)显示了验证准确率与子空间维度 dd 的关系,相应的 dint90=2900subscript2900d_{\mathrm{int90}}=2900 。它产生了 5%的压缩率,是 MNIST 上 LeNet 的 10 倍。这表明 CIFAR-10 图像比 MNIST 图像更难被正确分类。 换句话说,CIFAR-10 比 MNIST 是一个更难的问题,尤其是考虑到“问题解决”(基线性能)的定义是 MNIST 上 99%的准确率和 CIFAR-10 上 58%的准确率。在 CIFAR-10 数据集上,随着 dd 的增加,子空间训练往往会过拟合;我们下面研究子空间训练作为正则化器的作用。

Refer to caption
(a) Individual baseline (a)个体基线
Refer to caption
(b) Global baseline (b)全球基线
Figure S13: Intrinsic dimension of FC networks with various width and depth on the CIFAR-10 dataset. In (b), we use a simple 50% baseline globally.
图 S13:CIFAR-10 数据集上不同宽度和深度的 FC 网络的固有维度。在(b)中,我们使用一个简单的 50%全局基线。
ResNet vs. LeNet ResNet 对比 LeNet

We test ResNets, compare to LeNet, and find they make efficient use of parameters. We adopt the smallest 20-layer structure of ResNet with 280k parameters, and find out in Fig. S14 (b) that it reaches LeNet baseline with dint90=10002000d_{\mathrm{int90}}=1000\sim 2000 (lower than the dint90d_{\mathrm{int90}} of LeNet), while takes a larger dint90d_{\mathrm{int90}} (20,00050,00020,000\sim 50,000) to reach reach its own, much higher baseline.
我们测试了 ResNets,与 LeNet 进行比较,发现它们对参数的利用效率很高。我们采用了 ResNet 最小的 20 层结构,参数量为 280k,并在图 S14(b)中发现,它达到了 LeNet 基线 dint90=10002000subscript1000similar-to2000d_{\mathrm{int90}}=1000\sim 2000 (低于 LeNet 的 dint90subscriptd_{\mathrm{int90}} ),而需要更大的 dint90subscriptd_{\mathrm{int90}}20,00050,000formulae-sequencesimilar-to200005000020,000\sim 50,000 )才能达到其自身,远高于 LeNet 的基线。

Refer to caption
(a) FC (W=200W=200, L=2L=2)
(a) FC( W=200200W=200L=22L=2
Refer to caption
(b) LeNet
Refer to caption
(c) ResNet (c)残差网络
Figure S14: Validation accuracy of an FC network, LeNet and ResNet on CIFAR with different subspace dimension dd. In (a)(b), the variance of validation accuracy and measured dint90d_{\mathrm{int90}} are visualized as the blue vertical and red horizontal error bars, respectively. Subspace method surpasses the 90% baseline on LeNet at dd between 1000 and 2000, 90% of ResNet baseline between 20k and 50k.
图 S14:FC 网络、LeNet 和 ResNet 在 CIFAR 上的不同子空间维度的验证准确率。在(a)(b)中,验证准确率的方差和测量的 dint90subscriptd_{\mathrm{int90}} 分别用蓝色垂直和红色水平误差条表示。子空间方法在 LeNet 上超过 90%的基线,在 1000 到 2000 之间的 dd ,在 ResNet 基线 90%之间,20k 到 50k。
The role of regularizers
正则化器的作用

Our subspace training can be considered as a regularization scheme, as it restricts the solution set. We study and compare its effects with two traditional regularizers with an FC network (L=2L\!\!=\!\!2, W=200W\!\!=\!\!200) on CIFAR-10 dataset, including 2\ell_{2} penalty on the weights (i.e., weight decay) and Dropout.
我们的子空间训练可以被视为一种正则化方案,因为它限制了解集。我们研究了它与两个传统的正则化器在 CIFAR-10 数据集上的效果,包括对权重(即权重衰减)和 Dropout 的 2subscript2\ell_{2} 惩罚。

  • 2\ell_{2} penalty   Various amount of 2\ell_{2} penalty from {102,103,5×104,104,105,0}\{10^{-2},10^{-3},5\!\times\!10^{-4},10^{-4},10^{-5},0\} are considered. The accuracy and negative log-likelihood (NLL) are reported in Fig. S15 (a) (b), respectively. As expected, larger amount of weight decay reduces the gap between training and testing performance for both direct and subspace training methods, and eventually closes the gap (i.e., 2\ell_{2} penalty = 0.01). Subspace training itself exhibits strong regularization ability, especially when dd is small, at which the performance gap between training and testing is smaller.


    2subscript2\ell_{2} 惩罚 考虑了从 {102,103,5×104,104,105,0}superscript102superscript1035superscript104superscript104superscript1050\{10^{-2},10^{-3},5\!\times\!10^{-4},10^{-4},10^{-5},0\} 的各种数量的 2subscript2\ell_{2} 惩罚。准确性和负对数似然(NLL)分别报告在图 S15(a)(b)中。正如预期的那样,更大的权重衰减量减少了直接和子空间训练方法在训练和测试性能之间的差距,并最终关闭了差距(即 2subscript2\ell_{2} 惩罚=0.01)。子空间训练本身表现出强大的正则化能力,尤其是在 dd 较小时,此时训练和测试之间的性能差距更小。
  • Dropout   Various dropout rates from {0.5,0.4,0.3,0.2,0.1,0}\{0.5,0.4,0.3,0.2,0.1,0\} are considered. The accuracy and NLL are reported in Fig. S16. Larger dropout rates reduce the gap between training and testing performance for both direct and subspace training methods. When observing testing NLL, subspace training tends to overfit the training dataset less.


    • Dropout 考虑了从 {0.5,0.4,0.3,0.2,0.1,0}0.50.40.30.20.10\{0.5,0.4,0.3,0.2,0.1,0\} 的各种 dropout 率。准确率和 NLL 在图 S16 中报告。较大的 dropout 率减少了直接和子空间训练方法在训练和测试性能之间的差距。在观察测试 NLL 时,子空间训练倾向于较少地过拟合训练数据集。
  • Subspace training as implicit regularization Subspace training method performs implicit regularization, as it restricts the solution set. We visualized the testing NLL in Fig. S17. Subspace training method outperforms direct method when dd is properly chosen (when 2\ell_{2} penalty<5×104<5\!\times\!10^{-4}, or dropout rate <0.1<0.1 ), suggesting the potential of this method as a better alternative to traditional regularizers. When dd is large, the method also overfits the training dataset. Note that the these methods perform regularization in different ways: weight decay enforces the learned weights concentrating around zeros, while subspace training directly reduces the number of dimensions of the solution space.


    • 子空间训练作为隐式正则化 子空间训练方法通过限制解集进行隐式正则化。我们在图 S17 中可视化了测试 NLL。当 dd 选择适当(当 2subscript2\ell_{2} 惩罚 <5×104absent5superscript104<5\!\times\!10^{-4} ,或 dropout 率 <0.1absent0.1<0.1 )时,子空间训练方法优于直接方法,这表明该方法作为传统正则化器的更好替代方案具有潜力。当 dd 较大时,该方法也会过度拟合训练数据集。请注意,这些方法以不同的方式进行正则化:权重衰减强制学习的权重集中在零周围,而子空间训练直接减少解空间的维度数。
Refer to caption
(a) Accuracy (a) 准确性
Refer to caption
(b) NLL
Figure S15: Comparing regularization induced by 2\ell_{2} penalty and subspace training. Weight decay interacts with dint90d_{\mathrm{int90}} since it changes the objective landscapes through various loss functions.
图 S15:比较由 2subscript2\ell_{2} 惩罚和子空间训练引起的正则化。权重衰减与 dint90subscriptd_{\mathrm{int90}} 相互作用,因为它通过各种损失函数改变目标函数的景观。
Refer to caption
(a) Accuracy (a) 准确性
Refer to caption
(b) NLL
Figure S16: Comparing regularization induced by Dropout and subspace training. Dropout interacts with dint90d_{\mathrm{int90}} since it changes the objective landscapes through randomly removing hidden units of the extrinsic neural networks.
图 S16:比较 Dropout 和子空间训练引起的正则化。Dropout 与 dint90subscriptd_{\mathrm{int90}} 相互作用,因为它通过随机移除外源神经网络的隐藏单元来改变目标函数的景观。
Refer to caption Refer to caption
(a) 2\ell_{2} penalty (a) 2subscript2\ell_{2} 惩罚 (b) Dropout
Figure S17: Comparing regularization induced by 2\ell_{2} penalty, Dropout and subspace training. The gray surface and black line indicate direct training.
图 S17:比较由 2subscript2\ell_{2} 惩罚、Dropout 和子空间训练引起的正则化。灰色表面和黑色线条表示直接训练。

S9 ImageNet

To investigate even larger problems, we attempted to measure dint90d_{\mathrm{int90}} for an ImageNet classification network. We use a relatively smaller network, SqueezeNet by Iandola et al. (2016), with 1.24M parameters. Larger networks suffered from memory issues. A direct training produces Top-1 accuracy of 55.5%. We vary intrinsic dimension from 50k, 100k, 200k, 500k, 800k, and record the validation accuracies as shown in Fig. S18. The training of each intrinsic dimension takes about 6 to 7 days, distributed across 4 GPUs. Due to limited time, training on ImageNet has not yet produced a reliable estimate for dint90d_{\mathrm{int90}} except that it is over 500k.
为了研究更大的问题,我们尝试测量 ImageNet 分类网络的 dint90subscriptd_{\mathrm{int90}} 。我们使用了一个相对较小的网络,Iandola 等人(2016)提出的 SqueezeNet,参数量为 1.24M。较大的网络受到内存问题的困扰。直接训练产生了 55.5%的 Top-1 准确率。我们改变了内在维度,从 50k、100k、200k、500k、800k,并记录了如图 S18 所示的验证准确率。每个内在维度的训练大约需要 6 到 7 天,分布在 4 个 GPU 上。由于时间有限,在 ImageNet 上的训练尚未产生对 dint90subscriptd_{\mathrm{int90}} 的可靠估计,除了它超过 500k 之外。

Refer to caption
Figure S18: Validation accuracy of SqueezeNet on ImageNet with different dd. At d=500kd=500k the accuracy reaches 34.34%, which is not yet past the threshold required to estimate dint90d_{\mathrm{int90}}.
图 S18:不同 dd 下 SqueezeNet 在 ImageNet 上的验证准确率。在 d=500k500d=500k 时,准确率达到 34.34%,尚未超过估计 dint90subscriptd_{\mathrm{int90}} 所需的阈值。

S10 Investigation of Convolutional Networks
S10 卷积网络研究

Since the learned dint90d_{\mathrm{int90}} can be used as a robust measure to study the fitness of neural network architectures for specific tasks, we further apply it to understand the contribution of each component in convolutional networks for image classification task. The convolutional network is a special case of FC network in two aspects: local receptive fields and weight-tying. Local receptive fields force each filter to “look” only at a small, localized region of the image or layer below. Weight-tying enforces that each filter shares the same weights, which reduces the number of learnable parameters. We performed control experiments to investigate the degree to which each component contributes. Four variants of LeNet are considered:
自学习 dint90subscriptd_{\mathrm{int90}} 可以作为研究特定任务中神经网络架构适应性的稳健度量,我们进一步将其应用于理解卷积网络在图像分类任务中每个组件的贡献。卷积网络在两个方面是全连接网络的特殊情况:局部感受野和权重绑定。局部感受野迫使每个滤波器“观察”图像或下层的微小、局部区域。权重绑定强制每个滤波器共享相同的权重,这减少了可学习参数的数量。我们进行了控制实验以研究每个组件的贡献程度。考虑了 LeNet 的四种变体:
Standard LeNet   6 kernels (5×55\times 5) – max-pooling (2×22\times 2) – 16 kernels (5×55\times 5) – max-pooling (2×22\times 2) – 120 FC – 84 FC – 10 FC
• 标准 LeNet 6 核( 5×5555\times 5 )- 最大池化( 2×2222\times 2 )- 16 核( 5×5555\times 5 )- 最大池化( 2×2222\times 2 )- 120 全连接层(FC)- 84 全连接层(FC)- 10 全连接层(FC)
Untied-LeNet   The same architecture with the standard LeNet is employed, except that weights are unshared, i.e., a different set of filters is applied at each different patch of the input. For example in Keras, the 𝙻𝚘𝚌𝚊𝚕𝚕𝚢𝙲𝚘𝚗𝚗𝚎𝚌𝚝𝚎𝚍𝟸𝙳\mathtt{LocallyConnected2D} layer is used to replace the 𝙲𝚘𝚗𝚟𝟸𝙳\mathtt{Conv2D} layer.
• 解耦-LeNet 采用与标准 LeNet 相同的架构,除了权重不共享,即在每个不同的输入块上应用不同的滤波器集。例如,在 Keras 中,使用 𝙻𝚘𝚌𝚊𝚕𝚕𝚢𝙲𝚘𝚗𝚗𝚎𝚌𝚝𝚎𝚍𝟸𝙳\mathtt{LocallyConnected2D} 层来替换 𝙲𝚘𝚗𝚟𝟸𝙳\mathtt{Conv2D} 层。
FCTied-LeNet  The same set of filters is applied at each different patch of the input. we break local connections by applying filters to global patches of the input. Assume the image size is H×HH\times H, the architecture is 6 kernels ((2H1)×(2H1)(2H\!-\!1)\times(2H\!-\!1)) – max-pooling (2×22\times 2) – 16 kernels ((H1)×(H1)(H\!-\!1)\times(H\!-\!1)) – max-pooling (2×22\times 2) – 120 FC – 84 FC – 10 FC. The padding type is 𝚜𝚊𝚖𝚎\mathtt{same}.
• FCTied-LeNet 在输入的不同 patch 上应用相同的过滤器集。我们通过将过滤器应用于输入的全局 patch 来打破局部连接。假设图像大小为 H×HH\times H ,架构为 6 个内核( (2H1)×(2H1)2121(2H\!-\!1)\times(2H\!-\!1) )- 最大池化( 2×2222\times 2 )- 16 个内核( (H1)×(H1)11(H\!-\!1)\times(H\!-\!1) )- 最大池化( 2×2222\times 2 )- 120 个全连接层(FC)- 84 个全连接层(FC)- 10 个全连接层(FC)。填充类型为 𝚜𝚊𝚖𝚎\mathtt{same}
FC-LeNet   Neither local connections or tied weights is employed, we mimic LeNet with its FC implementation. The same number of hidden units as the standard LeNet are used at each layer.
• FC-LeNet 既不使用局部连接也不使用固定权重,我们采用与 LeNet 相同的 FC 实现方式。每一层使用的隐藏单元数量与标准 LeNet 相同。

The results are shown in Fig. S19 (a)(b). We set a crossing-line accuracy (i.e., threshold) for each task, and investigate dint90d_{\mathrm{int90}} needed to achieve it. For MNIST and CIFAR-10, the threshold is 90% and 45%, respectively. For the above LeNet variants, dint90=290,600,425,2000d_{\mathrm{int90}}=290,600,425,2000 on MNIST, and dint90=1000,2750,2500,35000d_{\mathrm{int90}}=1000,2750,2500,35000 on CIFAR-10. Experiments show both tied-weights and local connections are important to the model. That tied-weights should matter seems sensible. However, models with maximal convolutions (convolutions covering the whole image) may have had the same intrinsic dimension as smaller convolutions, but this turns out not to be the case.
结果如图 S19(a)(b)所示。我们为每个任务设定了一个交叉线精度(即阈值),并研究了达到该精度所需的 dint90subscriptd_{\mathrm{int90}} 。对于 MNIST 和 CIFAR-10,阈值分别为 90%和 45%。对于上述 LeNet 变体,MNIST 上的 dint90=290,600,425,2000subscript2906004252000d_{\mathrm{int90}}=290,600,425,2000 ,CIFAR-10 上的 dint90=1000,2750,2500,35000subscript10002750250035000d_{\mathrm{int90}}=1000,2750,2500,35000 。实验表明,绑定的权重和局部连接对模型都很重要。绑定的权重应该很重要似乎是合理的。然而,具有最大卷积(覆盖整个图像的卷积)的模型可能具有与较小卷积相同的内在维度,但结果并非如此。

Refer to caption Refer to caption
(a) MNIST (b) CIFAR-10
Figure S19: Validation accuracy of LeNet variants with different subspace dimension dd. The conclusion is that convnets are more efficient than FC nets both due to local connectivity and due to weight tying.
图 S19:不同子空间维度的 LeNet 变体的验证准确率 dd 。结论是,由于局部连接和权重捆绑,卷积神经网络比全连接神经网络更高效。

S11 Summarization of dint90d_{\mathrm{int90}} S11 对 dint90subscriptd_{\mathrm{int90}} 的总结

We summarize dint90d_{\mathrm{int90}} of the objective landscape on all different problems and neural network architectures in Table LABEL:tab:summarized_d and Fig. S20. “SP” indicates shuffled pixel, and “SL” for shuffled label, and “FC-5” for a 5-layer FC. dint90d_{\mathrm{int90}} indicates the minimum number of dimensions of trainable parameters required to properly solve the problem, and thus reflects the difficulty level of problems.
我们总结了所有不同问题和神经网络架构上的目标景观的 dint90subscriptd_{\mathrm{int90}} ,并在表 LABEL:tab:summarized_d 和图 S20 中展示。“SP”表示打乱像素,“SL”表示打乱标签,“FC-5”表示 5 层全连接层。 dint90subscriptd_{\mathrm{int90}} 表示解决该问题所需的训练参数的最小维度数,从而反映了问题的难度级别。

Table S5: Intrinsic dimension of different objective landscapes, determined by dataset and network.
表 S5:由数据集和网络确定的不同的目标景观的固有维度。
Refer to caption
Figure S20: Intrinsic dimension of the objective landscapes created by all combinations of dataset and network we tried in this paper.
图 S20:本文中尝试的所有数据集和网络组合所创建的目标景观的固有维度。