这是用户在 2024-4-16 14:51 为 https://ar5iv.labs.arxiv.org/html/2002.10657?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Coherent Gradients: An Approach to
Understanding Generalization
in Gradient Descent-based Optimization
"Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization" 翻译为 "一致梯度:理解基于梯度下降优化的泛化的方法"

Satrajit Chatterjee
萨特拉吉特·查特尔吉

Google AI
Mountain View, CA 94043, USA
schatter@google.com
美国加利福尼亚州山景城谷歌人工智能,邮编 94043,schatter@google.com

Abstract
摘要

An open question in the Deep Learning community is why neural networks trained with Gradient Descent generalize well on real datasets even though they are capable of fitting random data.
深度学习领域一个开放的问题是,为什么使用梯度下降训练的神经网络能够很好地泛化到真实数据集,即使它们能够拟合随机数据。

We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent that we call Coherent Gradients: Gradients from similar examples are similar and so the overall gradient is stronger in certain directions where these reinforce each other.
我们提出了一种回答这个问题的方法,基于我们所称的一种梯度下降动态的假设,即一致梯度:相似示例的梯度是相似的,因此整体梯度在这些示例相互增强的方向上更加强大。

Thus changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples when such similarity exists.
因此,在训练过程中对网络参数的更改会偏向于那些(局部地)同时使许多示例受益的参数,当存在这样的相似性时。

We support this hypothesis with heuristic arguments and perturbative experiments and outline how this can explain several common empirical observations about Deep Learning. Furthermore, our analysis is not just descriptive, but prescriptive.
我们用启发式论证和摄动实验支持这一假设,并概述了这如何可以解释关于深度学习的几个常见经验观察。此外,我们的分析不仅仅是描述性的,而且是规范性的。

It suggests a natural modification to gradient descent that can greatly reduce overfitting.
它提出了对梯度下降的自然修改,可以大大减少过拟合。

1 Introduction and Overview
1 介绍与概述

Neural networks used in practice often have sufficient effective capacity to learn arbitrary maps from their inputs to their outputs. This is typically demonstrated by training a classification network that achieves good test accuracy on a real dataset S𝑆S, on a modified version of S𝑆S (call it Ssuperscript𝑆S^{\prime}) where the labels are randomized and observing that the training accuracy on Ssuperscript𝑆S^{\prime} is very high, though, of course, the test accuracy is no better than chance (Zhang et al., 2017). This leads to an important open question in the Deep Learning community (Zhang et al. (2017); Arpit et al. (2017); Bartlett et al. (2017); Kawaguchi et al. (2017); Neyshabur et al. (2018); Arora et al. (2018); Belkin et al. (2019); Rahaman et al. (2019); Nagarajan & Kolter (2019), etc.): Among all maps that fit a real dataset, how does Gradient Descent (GD) find one that generalizes well? This is the question we address in this paper.
实际应用中使用的神经网络通常具有足够的有效容量,可以从输入到输出学习任意映射。这通常通过训练一个分类网络来证明,该网络在真实数据集上可以取得良好的测试准确性 S𝑆S ,在一个修改过的版本的 S𝑆S 上(称为 Ssuperscript𝑆S^{\prime} ),其中标签是随机的,并观察到在 Ssuperscript𝑆S^{\prime} 上的训练准确性非常高,尽管当然,测试准确性不会比随机猜测好(Zhang 等人,2017)。这引发了深度学习社区中一个重要的开放问题(Zhang 等人(2017);Arpit 等人(2017);Bartlett 等人(2017);Kawaguchi 等人(2017);Neyshabur 等人(2018);Arora 等人(2018);Belkin 等人(2019);Rahaman 等人(2019);Nagarajan & Kolter(2019),等等):在适应真实数据集的所有映射中,梯度下降(GD)如何找到一个泛化良好的映射?这是本文要解决的问题。

We start by observing that this phenomenon is not limited to neural networks trained with GD but also applies to Random Forests and Decision Trees. However, there is no mystery with trees: A typical tree construction algorithm splits the training set recursively into similar subsets based on input features.
我们首先观察到,这种现象不仅限于使用梯度下降训练的神经网络,还适用于随机森林和决策树。然而,对于树结构并没有什么神秘之处:典型的树构建算法会根据输入特征将训练集递归地分割成相似的子集。

If no similarity is found, eventually, each example is put into its own leaf to achieve good training accuracy (but, of course, at the cost of poor generalization). Thus, trees that achieve good accuracy on a randomized dataset are much larger than those on a real dataset (e.g.  Chatterjee & Mishchenko (2019, Expt. 5)).
如果找不到相似性,最终每个示例都会被放入自己的叶节点以获得良好的训练准确性(但当然,以牺牲泛化能力为代价)。因此,在随机数据集上取得良好准确性的树要比在真实数据集上的要大得多(例如 Chatterjee & Mishchenko(2019,实验 5))。

Is it possible that something similar happens with GD? We believe so. The type of randomized-label experiments described above show that if there are common patterns to be found, then GD finds them. If not, it fits each example on a case-by-case basis.
The question then is, what is it about the dynamics of GD that makes it possible to extract common patterns from the data? And what does it mean for a pattern to be common?
那么问题就是,梯度下降的动态特性使得从数据中提取共同模式成为可能的是什么?共同模式意味着什么?

Since the only change to the network parameters in GD comes from the gradients, the mechanism to detect commonality amongst examples must be through the gradients. We propose that this commonality detection can be explained as follows:
由于梯度下降中网络参数的唯一变化来自梯度,因此检测示例之间的共性机制必须通过梯度来实现。我们提出这种共性检测可以解释如下:

  1. 1.

    Gradients are coherent, i.e, similar examples (or parts of examples) have similar gradients (or similar components of gradients) and dissimilar examples have dissimilar gradients.


    1. 梯度是连贯的,即相似的样本(或样本的部分)具有相似的梯度(或梯度的相似分量),而不相似的样本具有不相似的梯度。
  2. 2.

    Since the overall gradient is the sum of the per-example gradients, it is stronger in directions where the per-example gradients are similar and reinforce each other and weaker in other directions where they are different and do not add up.


    由于总梯度是每个示例梯度的总和,因此在示例梯度相似且相互增强的方向上,总梯度更强,在其他示例梯度不同且不相加的方向上,总梯度较弱。
  3. 3.

    Since network parameters are updated proportionally to gradients, they change faster in the direction of stronger gradients.


    由于网络参数按梯度的比例进行更新,它们在梯度更强的方向上变化更快。
  4. 4.

    Thus the changes to the network during training are biased towards those that simultaneously benefit many examples instead of a few (or one example).


    4. 因此,在训练过程中对网络的更改是倾向于同时使许多示例受益而不是少数(或一个)示例。

For convenience, we refer to this as the Coherent Gradients hypothesis.
为方便起见,我们将其称为连贯梯度假设。

It is instructive to work through the proposed mechanism in the context of a simple thought experiment. Consider a training set with two examples a𝑎a and b𝑏b. At some point in training, suppose the gradient of a𝑎a, gasubscript𝑔𝑎g_{a}, can be decomposed into two orthogonal components ga1subscript𝑔subscript𝑎1g_{a_{1}} and ga2subscript𝑔subscript𝑎2g_{a_{2}} of roughly equal magnitude, i.e., there are two, equally good, independent ways in which the network can better fit a𝑎a (by using say two disjoint parts of the network). Likewise, for b𝑏b. Now, further suppose that one of the two ways is common to both a𝑎a and b𝑏b, i.e., say ga2=gb2=gabsubscript𝑔subscript𝑎2subscript𝑔subscript𝑏2subscript𝑔𝑎𝑏g_{a_{2}}=g_{b_{2}}=g_{ab}, whereas, the other two are example specific, i.e., ga1,gb1=0subscript𝑔subscript𝑎1subscript𝑔subscript𝑏10\langle g_{a_{1}},g_{b_{1}}\rangle=0. Now, the overall gradient is
在一个简单的思想实验中,通过提出的机制来考虑上下文是很有启发性的。考虑一个训练集,其中包含两个示例 a𝑎ab𝑏b 。在训练的某个时刻,假设 a𝑎a 的梯度, gasubscript𝑔𝑎g_{a} ,可以分解为大致相等大小的两个正交分量 ga1subscript𝑔subscript𝑎1g_{a_{1}}ga2subscript𝑔subscript𝑎2g_{a_{2}} ,即网络可以更好地拟合 a𝑎a 的两种独立方式(比如使用网络的两个不相交部分)。同样,对于 b𝑏b 也是如此。现在,进一步假设两种方式中的一种对 a𝑎ab𝑏b 都是共同的,即,比如 ga2=gb2=gabsubscript𝑔subscript𝑎2subscript𝑔subscript𝑏2subscript𝑔𝑎𝑏g_{a_{2}}=g_{b_{2}}=g_{ab} ,而另外两种方式是特定于示例的,即, ga1,gb1=0subscript𝑔subscript𝑎1subscript𝑔subscript𝑏10\langle g_{a_{1}},g_{b_{1}}\rangle=0 。现在,整体梯度是

g=ga+gb=ga1+2gab+gb1.𝑔subscript𝑔𝑎subscript𝑔𝑏subscript𝑔subscript𝑎12subscript𝑔𝑎𝑏subscript𝑔subscript𝑏1g=g_{a}+g_{b}=g_{a_{1}}+2\ g_{ab}+g_{b_{1}}.

Observe that the gradient is stronger in the direction that simultaneously helps both examples and thus the corresponding parameter changes are bigger than those those that only benefit only one example.111 While the mechanism is easiest to see with full or large minibatches, we believe it holds even for small minibatches (though there one has to consider the bias in updates over time).
虽然这种机制在完整或大型的小批量中最容易观察,但我们相信即使在小型小批量中也是成立的(尽管在这种情况下,人们必须考虑随时间更新的偏差)。

注意到梯度在同时帮助两个示例的方向上更强,因此相应的参数变化比仅有一个示例受益的情况下要大。

It is important to emphasize that the notion of similarity used above (i.e., which examples are considered similar) is not a constant but changes in the course of training as network parameters change.
强调上述使用的相似性概念(即,哪些示例被认为是相似的)是不固定的,随着网络参数的变化在训练过程中发生变化是很重要的。

It starts from a mostly task independent notion due to random initialization and is bootstrapped in the course of training to be task dependent.
它起初是一个主要与任务无关的概念,由于随机初始化而开始,并在训练过程中逐渐转变为与任务相关。

We say “mostly” because even with random initialization, examples that are syntactically close are treated similarly (e.g., two images differing in the intensities of some pixels as opposed to two images where one is a translated version of the other).
我们说“大多数”是因为即使在随机初始化的情况下,语法上接近的示例也会被类似对待(例如,两幅图像在一些像素的强度上有所不同,而不是一幅图像是另一幅图像的翻译版本)。

The relationship between strong gradients and generalization can also be understood through the lens of algorithmic stability (Bousquet & Elisseeff, 2002): strong gradient directions are more stable since the presence or absence of a single example does not impact them as much, as opposed to weak gradient directions which may altogether disappear if a specific example is missing from the training set. With this observation, we can reason inductively about the stability of GD: since the initial values of the parameters do not depend on the training data, the initial function mapping examples to their gradients is stable.
强梯度与泛化之间的关系也可以通过算法稳定性的视角来理解(Bousquet & Elisseeff, 2002):强梯度方向更稳定,因为单个示例的存在或缺失对它们的影响不大,而弱梯度方向则可能在训练集中缺少特定示例时完全消失。基于这一观察,我们可以归纳地推断出梯度下降的稳定性:由于参数的初始值不依赖于训练数据,将示例映射到它们的梯度的初始函数是稳定的。

Now, if all parameter updates are due to strong gradient directions, then stability is preserved. However, if some parameter updates are due to weak gradient directions, then stability is diminished. Since stability (suitably formalized) is equivalent to generalization (Shalev-Shwartz et al., 2010), this allows us to see how generalization may degrade as training progresses. Based on this insight, we shall see later how a simple modification to GD to suppress the weak gradient directions can dramatically reduce overfitting.
现在,如果所有参数更新都是由于强梯度方向,那么稳定性就会得到保持。然而,如果一些参数更新是由于弱梯度方向,那么稳定性就会减弱。由于稳定性(适当形式化)等价于泛化(Shalev-Shwartz et al., 2010),这使我们能够看到随着训练的进行泛化可能会下降。基于这一观点,我们将在后面看到,对梯度下降进行简单修改以抑制弱梯度方向可以显著减少过拟合。

In addition to providing insight into why GD generalizes in practice, we believe that the Coherent Gradients hypothesis can help explain several other empirical observations about deep learning in the literature:
除了提供为什么梯度下降在实践中会泛化的见解之外,我们相信一致梯度假设可以帮助解释文献中关于深度学习的其他几个经验观察。

  1. (a)

    Learning is slower with random labels than with real labels (Zhang et al., 2017; Arpit et al., 2017)


    (a) 使用随机标签学习比使用真实标签学习速度较慢(Zhang等,2017;Arpit等,2017)
  2. (b)

    Robustness to large amounts of label noise (Rolnick et al., 2017)


    (b) 对大量标签噪声的鲁棒性(Rolnick等,2017)
  3. (c)

    Early stopping leads to better generalization (Caruana et al., 2000)


    (c) 提前停止训练会导致更好的泛化能力(Caruana 等人,2000)
  4. (d)

    Increasing capacity improves generalization (Caruana et al., 2000; Neyshabur et al., 2018)


    (d) 增加模型容量会提高泛化能力(Caruana 等人,2000;Neyshabur 等人,2018)
  5. (e)

    The existence of adversarial initialization schemes (Liu et al., 2019)


    (e) 存在对抗性初始化方案(Liu 等人,2019)
  6. (f)

    GD detects common patterns even when trained with random labels (Chatterjee & Mishchenko, 2019)


    (f) GD在使用随机标签进行训练时也能检测到常见模式(Chatterjee & Mishchenko, 2019)

A direct experimental verification of the Coherent Gradients hypothesis is challenging since the notion of similarity between examples depends on the parameters of the network and thus changes during training. Our approach, therefore, is to design intervention experiments where we establish a baseline and compare it against variants designed to test some aspect or prediction of the theory. As part of these experiments, we replicate the observations (a)–(c) in the literature noted above, and analyze the corresponding explanations provided by Coherent Gradients (§2), and outline for future work how (d)–(f) may be accounted for (§5).
直接实验验证一致梯度假设具有挑战性,因为示例之间的相似性概念取决于网络的参数,并且在训练过程中会发生变化。因此,我们的方法是设计干预实验,建立基线并将其与旨在测试理论某些方面或预测的变体进行比较。作为这些实验的一部分,我们复制了文献中提到的观察结果 (a)–(c),并分析了一致梯度提供的相应解释 (§2),并概述了未来工作如何解释 (d)–(f) (§5)。

In this paper, we limit our study to simple baselines: vanilla Stochastic Gradient Descent (SGD) on MNIST using fully connected networks. We believe that this is a good starting point, since even in this simple setting, with all frills eliminated (e.g., inductive bias from architecture or explicit regularization, or a more sophisticated optimization procedure), we are challenged to find a satisfactory explanation of why SGD generalizes well. Furthermore, our prior is that the difference between weak and strong directions is small at any one step of training, and therefore having a strong learning signal as in the case of MNIST makes a direct analysis of gradients easier. It also has the benefit of having a smaller carbon footprint and being easier to reproduce. Finally, based on preliminary experiments on other architectures and datasets we are optimistic that the insights we get from studying this simple setup apply more broadly.
在本文中,我们将研究范围限制在简单的基线上:在 MNIST 上使用全连接网络的普通随机梯度下降 (SGD)。我们认为这是一个很好的起点,因为即使在这种简单的设置中,即使去除了所有的花哨之处 (例如来自架构或显式正则化的归纳偏差,或更复杂的优化过程),我们仍然很难找到为什么 SGD 泛化良好的令人满意的解释。此外,我们的先验是,在训练的任何一步中,弱方向和强方向之间的差异很小,因此在 MNIST 的情况下具有强学习信号使得梯度的直接分析更容易。它还具有碳足迹较小且更易于复制的好处。最后,基于对其他架构和数据集的初步实验,我们乐观地认为我们从研究这个简单设置中获得的见解更广泛地适用。

2 Effect of Reducing Similarity Between Examples
减少示例之间相似性的影响

Our first test of the Coherent Gradients hypothesis is to see what happens when we reduce similarity between examples.
我们对连贯梯度假设的第一次测试是看当我们减少示例之间的相似性时会发生什么。

Although, at any point during training, we do not know which examples are similar, and which are different, we can (with high probability) reduce the similarity among training examples simply by injecting label noise.
尽管在训练过程中的任何时刻,我们不知道哪些示例是相似的,哪些是不同的,但我们可以(很可能)通过注入标签噪声来减少训练示例之间的相似性。

In other words, under any notion of similarity, adding label noise to a dataset that has clean labels is likely to make similar examples less similar. Note that this perturbation does not reduce coherence since gradients still depend on the examples.
换句话说,无论相似性的任何概念,向具有清晰标签的数据集添加标签噪声很可能会使相似的示例变得不那么相似。请注意,这种扰动并不会减少连贯性,因为梯度仍然取决于示例。

(To break coherence, we would have to make the gradients independent of the training examples which would requiring perturbing SGD itself and not just the dataset).
为了打破一致性,我们需要使梯度与训练样本无关,这将需要扰动 SGD 本身而不仅仅是数据集。

2.1 Setup
2.1Setup -> 设置

For our baseline, we use the standard MNIST dataset of 60,000 training examples and 10,000 test examples. Each example is a 28x28 pixel grayscale handwritten digit along with a label (‘0’–‘9’). We train a fully connected network on this dataset.
作为我们的基准,我们使用了标准的 MNIST 数据集,其中包括 60,000 个训练样本和 10,000 个测试样本。每个样本是一个 28x28 像素的灰度手写数字,附带一个标签('0'-'9')。我们在这个数据集上训练了一个全连接网络。

The network has one hidden layer with 2048 ReLUs and an output layer with a 10-way softmax. We initialize it with Xavier and train using vanilla SGD (i.e., no momentum) using cross entropy loss with a constant learning rate of 0.1 and a minibatch size of 100 for 105superscript10510^{5} steps (i.e., about 170 epochs). We do not use any explicit regularizers.
网络有一个包含 2048 个 ReLU 的隐藏层和一个包含 10 个输出的 softmax 层。我们使用 Xavier 进行初始化,并使用交叉熵损失的普通 SGD(即没有动量),学习率为 0.1,小批量大小为 100,进行了 105superscript10510^{5} 步(大约 170 个 epochs)的训练。我们没有使用任何显式的正则化器。

We perturb the baseline by modifying only the dataset and keeping all other aspects of the architecture and learning algorithm fixed. The dataset is modified by adding various amounts of noise (25%, 50%, 75%, and 100%) to the labels of the training set (but not the test set). This noise is added by taking, say in the case of 25% label noise, 25% of the examples at random and randomly permuting their labels.
我们通过仅修改数据集并保持架构和学习算法的所有其他方面不变来扰动基准。数据集通过向训练集的标签添加不同数量的噪声(25%,50%,75% 和 100%)来进行修改(但不包括测试集)。这种噪声是通过随机选择,然后随机排列它们的标签来添加的,例如在 25% 标签噪声的情况下,随机选择 25% 的样本并随机排列它们的标签。

Thus, when we add 25% label noise, we still expect about 75% + 0.1 * 25%, i.e., 77.5% of the examples to have unchanged (i.e. “correct”) labels which we call the proper accuracy of the modified dataset. In what follows, we call examples with unchanged labels, pristine, and the remaining, corrupt. Also, from this perspective, it is convenient to refer to the original MNIST dataset as having 0% label noise.
因此,当我们添加 25%的标签噪声时,我们仍然期望大约 75% + 0.1 * 25%,即 77.5%的示例具有不变的(即“正确的”)标签,我们称之为修改后数据集的正确准确度。接下来,我们称具有不变标签的示例为原始的,其余的为损坏的。此外,从这个角度来看,方便将原始的 MNIST 数据集称为具有 0%标签噪声。

We use a fully connected architecture instead of a convolutional one to mitigate concerns that some of the difference in generalization between the original MNIST and the noisy variants could stem from architectural inductive bias. We restrict ourselves to only 1 hidden layer to have the gradients be as well-behaved as possible. Finally, the network width, learning rate, and the number of training steps are chosen to ensure that exactly the same procedure is usually able to fit all 5 variants to 100% training accuracy.
我们使用全连接的架构而不是卷积架构,以减轻原始 MNIST 和嘈杂变体之间一些泛化差异可能源自架构归纳偏差的担忧。我们限制自己只有 1 个隐藏层,以确保梯度尽可能地良好。最后,网络宽度、学习率和训练步数的选择是为了确保通常能够将所有 5 个变体都训练到 100%的准确度。

2.2 Qualitative Predictions
2.2 定性预测

Before looking at the experimental results, it is useful to consider what Coherent Gradients can qualitatively say about this setup.
在查看实验结果之前,考虑一下相干梯度对这个设置有什么定性的影响是很有用的。

In going from 0% label noise to 100% label noise, as per experiment design, we expect examples in the training set to become more dissimilar (no matter what the current notion of similarity is). Therefore, we expect the per-example gradients to be less aligned with each other.
从 0%标签噪声到 100%标签噪声的过程中,根据实验设计,我们期望训练集中的示例变得更不相似(无论当前的相似性概念是什么)。因此,我们期望每个示例的梯度彼此之间的对齐度更低。

This in turn causes the overall gradient to become more diffuse, i.e., stronger directions become relatively weaker, and consequently, we expect it to take longer to reach a given level of accuracy as label noise increases, i.e., to have a lower realized learning rate.
这反过来导致整体梯度变得更加扩散,即更强的方向相对变得更弱,因此,我们期望随着标签噪声的增加,达到给定精度水平需要更长的时间,即实现较低的学习速率。

This can be made more precise by considering the following heuristic argument. Let θtsubscript𝜃𝑡\theta_{t} be the vector of trainable parameters of the network at training step t𝑡t. Let \mathcal{L} denote the loss function of the network (over all training examples). Let gtsubscript𝑔𝑡g_{t} be the gradient of \mathcal{L} at θtsubscript𝜃𝑡\theta_{t} and let α𝛼\alpha denote the learning rate. By Taylor expansion, to first order, the change ΔtΔsubscript𝑡\Delta\mathcal{L}_{t} in the loss function due to a small gradient descent step ht=αgtsubscript𝑡𝛼subscript𝑔𝑡h_{t}=-\alpha\cdot g_{t} is given by
这可以通过考虑以下启发式论证来更加精确。让 θtsubscript𝜃𝑡\theta_{t} 表示网络在训练步骤 t𝑡t 的可训练参数向量。让 \mathcal{L} 表示网络的损失函数(针对所有训练样本)。让 gtsubscript𝑔𝑡g_{t} 表示 \mathcal{L}θtsubscript𝜃𝑡\theta_{t} 处的梯度,让 α𝛼\alpha 表示学习率。通过泰勒展开,一阶近似下,由于小梯度下降步骤 ht=αgtsubscript𝑡𝛼subscript𝑔𝑡h_{t}=-\alpha\cdot g_{t} 导致损失函数 ΔtΔsubscript𝑡\Delta\mathcal{L}_{t} 的变化为。

Δt:=(θt+ht)(θt)gt,ht=αgt,gt=αgt2assignΔsubscript𝑡subscript𝜃𝑡subscript𝑡subscript𝜃𝑡subscript𝑔𝑡subscript𝑡𝛼subscript𝑔𝑡subscript𝑔𝑡𝛼superscriptnormsubscript𝑔𝑡2\Delta\mathcal{L}_{t}:=\mathcal{L}(\theta_{t}+h_{t})-\mathcal{L}(\theta_{t})\approx\langle g_{t},h_{t}\rangle=-\alpha\cdot\langle g_{t},g_{t}\rangle=-\alpha\cdot\|g_{t}\|^{2} (1)

where \|\cdot\| denotes the l2subscript𝑙2l_{2}-norm. Now, let gtesubscript𝑔𝑡𝑒g_{te} denote the gradient of training example e𝑒e at step t𝑡t. Since the overall gradient is the sum of the per-example gradients, we have,
在这里," \|\cdot\| " 表示 l2subscript𝑙2l_{2} 范数。现在,让 " gtesubscript𝑔𝑡𝑒g_{te} " 表示在步骤 " t𝑡t " 上的训练样本 " e𝑒e " 的梯度。由于总体梯度是每个样本梯度的总和,我们有,

gt2=gt,gt=egte,egte=e,egte,gte=egte2+e,eeegte,gtesuperscriptnormsubscript𝑔𝑡2subscript𝑔𝑡subscript𝑔𝑡subscript𝑒subscript𝑔𝑡𝑒subscript𝑒subscript𝑔𝑡𝑒subscript𝑒superscript𝑒subscript𝑔𝑡𝑒subscript𝑔𝑡superscript𝑒subscript𝑒superscriptnormsubscript𝑔𝑡𝑒2subscript𝑒superscript𝑒𝑒superscript𝑒subscript𝑔𝑡𝑒subscript𝑔𝑡superscript𝑒\|g_{t}\|^{2}=\langle g_{t},g_{t}\rangle=\langle\sum\limits_{e}g_{te},\sum\limits_{e}g_{te}\rangle=\sum\limits_{e,e^{\prime}}\langle g_{te},g_{te^{\prime}}\rangle=\sum\limits_{e}\|g_{te}\|^{2}+\sum\limits_{\begin{subarray}{c}e,e^{\prime}\\ e\neq e^{\prime}\end{subarray}}\langle g_{te},g_{te^{\prime}}\rangle (2)

Now, heuristically, let us assume that all the gtenormsubscript𝑔𝑡𝑒\|g_{te}\| are roughly the same and equal to gtnormsuperscriptsubscript𝑔𝑡\|g_{t}^{\circ}\| which is not entirely unreasonable (at least at the start of training, if the network has no a priori reason to treat different examples very differently). If all the per-example gradients are approximately orthogonal (i.e., gte,gte0subscript𝑔𝑡𝑒subscript𝑔𝑡superscript𝑒0\langle g_{te},g_{te^{\prime}}\rangle\approx 0 for ee𝑒superscript𝑒e\neq e^{\prime}), then gt2mgt2superscriptnormsubscript𝑔𝑡2𝑚superscriptnormsuperscriptsubscript𝑔𝑡2\|g_{t}\|^{2}\approx m\cdot\|g_{t}^{\circ}\|^{2} where m𝑚m is the number of examples. On the other hand, if they are approximately the same (i.e., gte,gtegt2subscript𝑔𝑡𝑒subscript𝑔𝑡superscript𝑒superscriptnormsuperscriptsubscript𝑔𝑡2\langle g_{te},g_{te^{\prime}}\rangle\approx\|g_{t}^{\circ}\|^{2}), then gt2m2gt2superscriptnormsubscript𝑔𝑡2superscript𝑚2superscriptnormsuperscriptsubscript𝑔𝑡2\|g_{t}\|^{2}\approx m^{2}\cdot\|g_{t}^{\circ}\|^{2}. Thus, we expect that greater the agreement in per-example gradients, the faster loss should decrease.
现在,从启发式的角度来看,让我们假设所有的 gtenormsubscript𝑔𝑡𝑒\|g_{te}\| 大致相同,并且等于 gtnormsuperscriptsubscript𝑔𝑡\|g_{t}^{\circ}\| 这在训练开始阶段是不完全不合理的(至少在网络没有先验理由对不同的示例进行非常不同的处理时)。如果所有的每个示例的梯度大致正交(即, gte,gte0subscript𝑔𝑡𝑒subscript𝑔𝑡superscript𝑒0\langle g_{te},g_{te^{\prime}}\rangle\approx 0 对于 ee𝑒superscript𝑒e\neq e^{\prime} ),那么 gt2mgt2superscriptnormsubscript𝑔𝑡2𝑚superscriptnormsuperscriptsubscript𝑔𝑡2\|g_{t}\|^{2}\approx m\cdot\|g_{t}^{\circ}\|^{2} 其中 m𝑚m 是示例的数量。另一方面,如果它们大致相同(即, gte,gtegt2subscript𝑔𝑡𝑒subscript𝑔𝑡superscript𝑒superscriptnormsuperscriptsubscript𝑔𝑡2\langle g_{te},g_{te^{\prime}}\rangle\approx\|g_{t}^{\circ}\|^{2} ),那么 gt2m2gt2superscriptnormsubscript𝑔𝑡2superscript𝑚2superscriptnormsuperscriptsubscript𝑔𝑡2\|g_{t}\|^{2}\approx m^{2}\cdot\|g_{t}^{\circ}\|^{2} 。因此,我们期望每个示例梯度的一致性越大,损失下降得越快。

Finally, for datasets that have a significant fractions of pristine and corrupt examples (i.e., the 25%, 50%, and 75% noise) we can make a more nuanced prediction. Since, in those datasets, the pristine examples as a group are still more similar than the corrupt ones, we expect the pristine gradients to continue to align well and sum up to a strong gradient. Therefore, we expect them to be learned faster than the corrupt examples, and at a rate closer to the realized learning rate in the 0% label noise case. Likewise, we expect the realized learning rate on the corrupt examples to be closer to the 100% label noise case. Finally, as the proportion of pristine examples falls with increasing noise, we expect the realized learning rate for pristine examples to degrade.
最后,对于具有大量原始和损坏示例的数据集(即,25%,50% 和 75% 的噪声),我们可以做出更细致的预测。因为在这些数据集中,原始示例作为一组仍然比损坏示例更相似,我们期望原始梯度继续良好地对齐并汇总为强梯度。因此,我们期望它们学习速度比损坏示例更快,并且接近于 0% 标签噪声情况下的实际学习速率。同样,我们期望损坏示例上的实际学习速率接近于 100% 标签噪声情况。最后,随着原始示例比例随噪声增加而下降,我们期望原始示例的实际学习速率下降。

Note that this provides an explanation for the observation in the literature that that networks can learn even when the number of examples with noisy labels greatly outnumber the clean examples as long as the number of clean examples is sufficiently large (Rolnick et al., 2017) since with too few clean examples the pristine gradients are not strong enough to dominate.
请注意,这解释了文献中的观察结果,即只要干净示例的数量足够大(Rolnick 等人,2017),即使带有嘈杂标签的示例数量远远超过干净示例,网络也可以学习,因为干净示例太少时,原始梯度不足以占主导地位。

2.3 Agreement with Experiment
2.3 与实验结果的一致性

Refer to caption
(a) Training accuracy.
(a) 训练准确率。
Refer to caption
(b) Validation accuracy.
(b) 验证准确率。
Refer to caption
(c) Training loss.
(c)训练损失。
Refer to caption
(d) First training step when an example is learned.
(d)当一个样本被学习时的第一个训练步骤。
Figure 1: Results of the experiment to reduce similarity by adding label noise (§2).
图 1:通过添加标签噪声(§2)来减少相似性的实验结果。

Figure 1(a) and (b) show the training and test curves for the baseline and the 4 variants. We note that for all 5 variants, at the end of training, we achieve 100% training accuracy but different amounts of generalization.
图1(a)和(b)展示了基线和4个变体的训练和测试曲线。我们注意到对于所有5个变体,在训练结束时,我们实现了100%的训练准确率,但是泛化程度不同。

As expected, SGD is able to fit random labels, yet when trained on real data, generalizes well. Figure 1(c) shows the reduction in training loss over the course of training, and Figure 1(d) shows the fraction of pristine and corrupt labels learned as training processes.
如预期,随机梯度下降(SGD)能够拟合随机标签,但在真实数据上训练时,能够很好地泛化。图1(c)显示了训练过程中训练损失的减少,图1(d)显示了在训练过程中学习的原始标签和错误标签的比例。

The results are in agreement with the qualitative predictions made above:
结果与上述定性预测一致:

  1. 1.

    In general, as noise increases, the time taken to reach a given level of accuracy (i.e., realized learning rate) increases.


    1. 一般来说,随着噪声的增加,达到特定精度水平所需的时间(即实现的学习速率)也会增加。
  2. 2.

    Pristine examples are learned faster than corrupt examples. They are learned at a rate closer to the 0% label noise rate whereas the corrupt examples are learned at a rate closer to the 100% label noise rate.


    2. 纯净的样本比损坏的样本学习得更快。它们的学习速率接近于 0%标签噪声率,而损坏的样本的学习速率接近于 100%标签噪声率。
  3. 3.

    With fewer pristine examples, their learning rate reduces. This is most clearly seen in the first few steps of training by comparing say 0% noise with 25% noise.


    3. 随着纯净样本数量的减少,它们的学习速率也会降低。这在训练的最初几步中最为明显,比如将 0%噪声与 25%噪声进行比较。

Using Equation 1, note that the magnitude of the slope of the training loss curve is a good measure of the square of the l2subscript𝑙2l_{2}-norm of the overall gradient. Therefore, from the loss curves of Figure 1(c), it is clear that in early training, the more the noise, the weaker the l2subscript𝑙2l_{2}-norm of the gradient. If we assume that the per-example l2subscript𝑙2l_{2}-norm is the same in all variants at start of training, then from Equation 2, it is clear that with greater noise, the gradients are more dissimilar.
使用方程 1,注意训练损失曲线斜率的大小是整体梯度的二范数的良好度量。因此,从图 1(c)的损失曲线可以清楚地看出,在早期训练中,噪声越大,梯度的二范数越弱。如果我们假设每个示例的二范数在训练开始时在所有变体中是相同的,那么从方程 2 可以清楚地看出,噪声越大,梯度越不相似。

Finally, we note that this experiment is an instance where early stopping (e.g., Caruana et al. (2000)) is effective. Coherent gradients and the discussion in §2.2 provide some insight into this: Strong gradients both generalize well (they are stable since they are supported by many examples) and they bring the training loss down quickly for those examples. Thus early stopping maximizes the use of strong gradients and limits the impact of weak gradients. (The experiment in the §3 discusses a different way to limit the impact of weak gradients and is an interesting point of comparison with early stopping.)
最后,我们注意到这个实验是早停的一个例子(例如,Caruana 等人(2000))。连贯的梯度和第 2.2 节的讨论为此提供了一些见解:强梯度既具有良好的泛化能力(它们稳定,因为它们由许多示例支持),又能快速降低这些示例的训练损失。因此,早停最大限度地利用了强梯度,并限制了弱梯度的影响。(第 3 节的实验讨论了限制弱梯度影响的不同方法,与早停进行了有趣的比较。)

2.4 Analyzing Strong and Weak Gradients
2.4分析强弱梯度

Refer to caption
Figure 2: Relative contributions of pristine (similar) and corrupt (dissimilar) examples to point-in-time loss reduction. To get a sense of statistical significance, we show the actual statistic as well as 3 simulations under the null assuming there is no difference. See §2.4.
图 2:原始(相似)和损坏(不相似)示例对时点损失减少的相对贡献。为了了解统计显著性,我们展示实际统计数据以及在零假设下的 3 次模拟,假设没有差异。见§2.4。

Within each noisy dataset, we expect the pristine examples to be more similar to each other and the corrupt ones to be less similar. In turn, based on the training curves (particularly, Figure 1 (d)), during the initial part of training, this should mean that the gradients from the pristine examples should be stronger than the gradients from the corrupt examples. We can study this effect via a different decomposition of square of the l2subscript𝑙2l_{2}-norm of the gradient (of equivalently upto a constant, the change in the loss function):
在每个嘈杂的数据集中,我们期望原始示例彼此更相似,而损坏示例则更不相似。因此,基于训练曲线(特别是图 1(d)),在训练的初始阶段,这意味着来自原始示例的梯度应该比来自损坏示例的梯度更强。我们可以通过对梯度的二次分解(或者等效地,损失函数的变化)来研究这种效应:

gt,gt=gt,gtp+gtc=gt,gtp+gt,gtcsubscript𝑔𝑡subscript𝑔𝑡subscript𝑔𝑡superscriptsubscript𝑔𝑡𝑝superscriptsubscript𝑔𝑡𝑐subscript𝑔𝑡superscriptsubscript𝑔𝑡𝑝subscript𝑔𝑡superscriptsubscript𝑔𝑡𝑐\langle g_{t},g_{t}\rangle=\langle g_{t},g_{t}^{p}+g_{t}^{c}\rangle=\langle g_{t},g_{t}^{p}\rangle+\langle g_{t},g_{t}^{c}\rangle

where gtpsuperscriptsubscript𝑔𝑡𝑝g_{t}^{p} and gtcsuperscriptsubscript𝑔𝑡𝑐g_{t}^{c} are the sum of the gradients of the pristine examples and corrupt examples respectively. (We cannot decompose the overall norm into a sum of norms of pristine and corrupt due to the cross terms gtp,gtcsuperscriptsubscript𝑔𝑡𝑝superscriptsubscript𝑔𝑡𝑐\langle g_{t}^{p},g_{t}^{c}\rangle. With this decomposition, we attribute the cross terms equally to both.) Now, set ftp=gt,gtp<gt,gt>f_{t}^{p}=\frac{\langle g_{t},g_{t}^{p}\rangle}{<g_{t},g_{t}>} and ftc=gt,gtcgt,gtsuperscriptsubscript𝑓𝑡𝑐subscript𝑔𝑡superscriptsubscript𝑔𝑡𝑐subscript𝑔𝑡subscript𝑔𝑡f_{t}^{c}=\frac{\langle g_{t},g_{t}^{c}\rangle}{\langle g_{t},g_{t}\rangle}. Thus, ftpsuperscriptsubscript𝑓𝑡𝑝f_{t}^{p} and ftcsuperscriptsubscript𝑓𝑡𝑐f_{t}^{c} represent the fraction of the loss reduction due to pristine and corrupt at each time step respectively (and we have ftp+ftc=1superscriptsubscript𝑓𝑡𝑝superscriptsubscript𝑓𝑡𝑐1f_{t}^{p}+f_{t}^{c}=1), and based on the foregoing, we expect the pristine fraction to be a larger fraction of the total when training starts and to diminish as training progresses and the pristine examples are fitted.
其中 gtpsuperscriptsubscript𝑔𝑡𝑝g_{t}^{p}gtcsuperscriptsubscript𝑔𝑡𝑐g_{t}^{c} 分别是原始示例和损坏示例的梯度之和。(由于交叉项 gtp,gtcsuperscriptsubscript𝑔𝑡𝑝superscriptsubscript𝑔𝑡𝑐\langle g_{t}^{p},g_{t}^{c}\rangle ,我们无法将总体范数分解为原始和损坏范数的和。通过这种分解,我们将交叉项平均分配给两者。)现在,设定 ftp=gt,gtp<gt,gt>f_{t}^{p}=\frac{\langle g_{t},g_{t}^{p}\rangle}{<g_{t},g_{t}>}ftc=gt,gtcgt,gtsuperscriptsubscript𝑓𝑡𝑐subscript𝑔𝑡superscriptsubscript𝑔𝑡𝑐subscript𝑔𝑡subscript𝑔𝑡f_{t}^{c}=\frac{\langle g_{t},g_{t}^{c}\rangle}{\langle g_{t},g_{t}\rangle} 。因此, ftpsuperscriptsubscript𝑓𝑡𝑝f_{t}^{p}ftcsuperscriptsubscript𝑓𝑡𝑐f_{t}^{c} 分别代表每个时间步的原始和损坏造成的损失减少的比例(我们有 ftp+ftc=1superscriptsubscript𝑓𝑡𝑝superscriptsubscript𝑓𝑡𝑐1f_{t}^{p}+f_{t}^{c}=1 ),根据前述,我们期望在训练开始时原始部分占总体的比例更大,并且随着训练的进行和原始示例的拟合而减少。

The first row of Figure 2 shows a plot of estimates of ftpsuperscriptsubscript𝑓𝑡𝑝f_{t}^{p} and ftcsuperscriptsubscript𝑓𝑡𝑐f_{t}^{c} for 25%, 50% and 75% noise. These quantities were estimated by recording a sample of 400 per-example gradients for 600 weights (300 from each layer) in the network. We see that for 25% and 50% label noise, ftpsuperscriptsubscript𝑓𝑡𝑝f_{t}^{p} initially starts off higher than ftcsuperscriptsubscript𝑓𝑡𝑐f_{t}^{c} and after a few steps they cross over. This happens because at that point all the pristine examples have been fitted and for most of the rest of training the corrupt examples need to be fitted and so they largely contribute to the l2subscript𝑙2l_{2}-norm of the gradient (or equivalently by Equation 1 to loss reduction). Only at the end when the corrupt examples have also been fit, the two curves reach parity. In the case of 75% noise, we see that the cross over doesn’t happen, but there is a slight slope downwards for the contribution from pristine examples.
Figure 2的第一行显示了25%、50%和75%噪声下 ftpsuperscriptsubscript𝑓𝑡𝑝f_{t}^{p}ftcsuperscriptsubscript𝑓𝑡𝑐f_{t}^{c} 的估计图。这些数量是通过记录网络中每个权重的400个样本梯度(每层300个)来估计的。我们看到,在25%和50%标签噪声下, ftpsuperscriptsubscript𝑓𝑡𝑝f_{t}^{p} 最初比 ftcsuperscriptsubscript𝑓𝑡𝑐f_{t}^{c} 高,几步之后它们交叉。这是因为在那一点上,所有原始示例都已经适应,而在大部分训练的其余时间里,需要适应损坏的示例,因此它们在很大程度上对梯度的 l2subscript𝑙2l_{2} -范数(或者根据方程1等效地对损失减少)做出贡献。只有在最后,当损坏的示例也被适应时,这两条曲线才会达到平衡。在75%噪声的情况下,我们看到交叉不会发生,但是来自原始示例的贡献略微下降。

We believe this is because of the sheer number of corrupt examples, and so even though the individual corrupt example gradients are weak, their sum dominates.
我们认为这是因为腐败样本的数量庞大,因此即使单个腐败样本的梯度较弱,它们的总和也会占主导地位。

To get a sense of statistical significance in our hypothesis that there is a difference between the pristine and corrupt examples as a group, in the remaining rows of Figure 2, we construct a null world where there is no difference between pristine and corrupt. We do that by randomly permuting the “corrupt” and “pristine” designations among the examples (instead of using the actual designations) and reploting.
为了在我们的假设中获得统计显著性,即在图 2 的其余行中,我们构建一个空世界,在这个世界中,原始和腐败样本之间没有差异。我们通过在示例之间随机置换“腐败”和“原始”标识(而不是使用实际的标识)并重新绘制来实现这一点。

Although the null pristine and corrupt curves are mirror images (as they must be even in the null world since each example is given one of the two designations), we note that for 25% and 50% they do not cross over as they do with the real data. This increases our confidence that the null may be rejected.
尽管空的原始和腐败曲线是镜像(因为即使在空世界中,每个示例都被赋予两个标识,它们也必须是镜像的),但我们注意到在 25%和 50%时,它们不像真实数据那样交叉。这增加了我们对空假设可能被拒绝的信心。

The 75% case is weaker but only the real data shows the slight downward slope in pristine which none of the nulls typically show.
75% 的情况较弱,但只有真实数据显示了原始数据中微弱的下降趋势,而典型的空值数据中则没有显示出这种情况。

However, all the nulls do show that corrupt is more than pristine which increases our confidence that this is due to the significantly differing sizes of the two sets.
然而,所有的空值数据都显示了腐败数据多于原始数据,这增加了我们对于这是由于两组数据大小显著不同而引起的信心。

(Note that this happens in reverse in the 25% case: pristine is always above corrupt, but they never cross over in the null worlds.)
(请注意,这在 25% 的情况中是相反的:原始数据始终高于腐败数据,但在空值世界中它们永远不会交叉。)

Refer to caption
Figure 3: Contributions of mean pristine (similar) and corrupt (dissimilar) examples to loss reduction accumulated over first few steps of training.
图 3:在训练的前几个步骤中,平均原始(相似)和损坏(不相似)示例对损失减少的贡献。

To get a sense of statistical significance, we show the actual statistic as well as 3 simulations under the null assuming there is no difference. See §2.4.
为了了解统计学上的显著性,我们展示实际统计数据以及在零假设下进行的 3 次模拟,假设没有差异。见§2.4。

To get a stronger signal for the difference between pristine and corrupt in the 75% case, we can look at a different statistic that adjusts for the different sizes of the pristine and corrupt sets. Let |p|𝑝|p| and |c|𝑐|c| be the number of pristine and corrupt examples respectively. Define
为了在 75%的情况下获得更强的原始和损坏之间的差异信号,我们可以查看一种不同的统计数据,该统计数据调整了原始和损坏集的不同大小。让 |p|𝑝|p||c|𝑐|c| 分别表示原始和损坏示例的数量。定义

itp:=1|p|t=0tgt,gtpanditc:=1|c|t=0tgt,gtcformulae-sequenceassignsuperscriptsubscript𝑖𝑡𝑝1𝑝superscriptsubscriptsuperscript𝑡0𝑡subscript𝑔superscript𝑡superscriptsubscript𝑔superscript𝑡𝑝andassignsuperscriptsubscript𝑖𝑡𝑐1𝑐superscriptsubscriptsuperscript𝑡0𝑡subscript𝑔superscript𝑡superscriptsubscript𝑔superscript𝑡𝑐i_{t}^{p}:=\frac{1}{|p|}\sum\limits_{t^{\prime}=0}^{t}\langle g_{t^{\prime}},g_{t^{\prime}}^{p}\rangle\quad\textrm{and}\quad i_{t}^{c}:=\frac{1}{|c|}\sum\limits_{t^{\prime}=0}^{t}\langle g_{t^{\prime}},g_{t^{\prime}}^{c}\rangle

which represents to a first order and upto a scale factor (α𝛼\alpha) the mean cumulative contribution of a pristine or corrupt example up until that point in training (since the total change in loss from the start of training to time t𝑡t is approximately the sum of first order changes in the loss at each time step).
这代表了一个一阶和一个比例因子( α𝛼\alpha )的意义,即在训练过程中直到该点的一个原始或损坏示例的平均累积贡献(因为从训练开始到时间 t𝑡t 的总损失变化大约是在每个时间步的损失的一阶变化的总和)。

The first row of Figure 3 shows itpsuperscriptsubscript𝑖𝑡𝑝i_{t}^{p} and itcsuperscriptsubscript𝑖𝑡𝑐i_{t}^{c} for the first 10 steps of training where the difference between pristine and corrupt is the most pronounced.
图 3 的第一行显示了训练的前 10 步的 itpsuperscriptsubscript𝑖𝑡𝑝i_{t}^{p}itcsuperscriptsubscript𝑖𝑡𝑐i_{t}^{c} ,在这里原始和损坏之间的差异最为显著。

As before, to give a sense of statistical significance, the remaining rows show the same plots in null worlds where we randomly permute the pristine or corrupt designations of the examples. The results appear somewhat significant but not overwhelmingly so.
与之前一样,为了给出统计显著性的感觉,其余行显示了在空白世界中的相同图,其中我们随机排列示例的原始或损坏标识。结果似乎有些显著,但并非压倒性。

It would be interesting to redo this on the entire population of examples and trainable parameters instead of a small sample.
这将是有趣的,如果能够在整个样本和可训练参数上重新进行,而不是在一个小样本上。

3 Effect of Suppressing Weak Gradient Directions
"3Effect of Suppressing Weak Gradient Directions" 翻译为 "抑制弱梯度方向的效果"

In the second test of the Coherent Gradients hypothesis, we change GD itself in a very specific (and to our knowledge, novel) manner suggested by the theory. Our inspiration comes from random forests.
在一致梯度假设的第二次测试中,我们以一种非常具体的方式改变了 GD 本身(据我们所知,这是一种新颖的方式),这种方式是由该理论提出的。我们的灵感来自随机森林。

As noted in the introduction, by building sufficiently deep trees a random forest algorithm can get perfect training accuracy with random labels, yet generalize well when trained on real data. However, if we limit the tree construction algorithm to have a certain minimum number of examples in each leaf, then it no longer overfits. In the case of GD, we can do something similar by suppressing the weak gradient directions.
正如介绍中所指出的,通过构建足够深的树,随机森林算法可以在使用随机标签进行训练时获得完美的训练准确度,但在真实数据上训练时可以很好地泛化。然而,如果我们限制树构建算法在每个叶子节点上具有一定数量的示例,则它就不会再过拟合。在 GD 的情况下,我们可以通过抑制弱梯度方向来做类似的事情。

3.1 Setup
3.1 设置

Our baseline setup is the same as before (§2.1) but we add a new dimension by modifying SGD to update each parameter with a “winsorized” gradient where we clip the most extreme values (outliers) among all the per-example gradients. Formally, let gwesubscript𝑔𝑤𝑒g_{we} be the gradient for the trainable parameter w𝑤w for example e𝑒e. The usual gradient computation for w𝑤w is
我们的基准设置与之前相同(§2.1),但我们通过修改 SGD 来添加一个新的维度,以使用“winsorized”梯度来更新每个参数,其中我们剪切所有每个示例梯度中的最极端值(离群值)。形式上,设 gwesubscript𝑔𝑤𝑒g_{we} 为可训练参数 w𝑤w 在示例 e𝑒e 上的梯度。对于 w𝑤w 的常规梯度计算为

gw=egwesubscript𝑔𝑤subscript𝑒subscript𝑔𝑤𝑒g_{w}=\sum\limits_{e}g_{we}

Now let c[0,50]𝑐050c\in[0,50] be a hyperparameter that controls the level of winsorization. Define lwsubscript𝑙𝑤l_{w} to be the c𝑐c-th percentile of gwesubscript𝑔𝑤𝑒g_{we} taken over the examples. Similarly, let uwsubscript𝑢𝑤u_{w} be the (100c)100𝑐(100-c)-th percentile. Now, compute the c𝑐c-winsorized gradient for w𝑤w (denoted by gwcsuperscriptsubscript𝑔𝑤𝑐g_{w}^{c}) as follows:
现在,设 c[0,50]𝑐050c\in[0,50] 为控制 winsorization 程度的超参数。定义 lwsubscript𝑙𝑤l_{w} 为在示例中取得的 c𝑐c 百分位数。类似地,设 uwsubscript𝑢𝑤u_{w}(100c)100𝑐(100-c) 百分位数。现在,计算 w𝑤wc𝑐c -winsorized 梯度(用 gwcsuperscriptsubscript𝑔𝑤𝑐g_{w}^{c} 表示)如下:

gwc:=eclip(gwe,lw,uw)assignsuperscriptsubscript𝑔𝑤𝑐subscript𝑒clipsubscript𝑔𝑤𝑒subscript𝑙𝑤subscript𝑢𝑤g_{w}^{c}:=\sum\limits_{e}{\rm clip}(g_{we},l_{w},u_{w})

The change to gradient descent is to simply use gwcsuperscriptsubscript𝑔𝑤𝑐g_{w}^{c} instead of gwsubscript𝑔𝑤g_{w} when updating w𝑤w at each step.
对梯度下降的改变就是在每一步更新 w𝑤w 时,简单地使用 gwcsuperscriptsubscript𝑔𝑤𝑐g_{w}^{c} 而不是 gwsubscript𝑔𝑤g_{w}

Note that although this is conceptually a simple change, it is computationally very expensive due to the need for per-example gradients. To reduce the computational cost we only use the examples in the minibatch to compute lwsubscript𝑙𝑤l_{w} and uwsubscript𝑢𝑤u_{w}. Furthermore, instead of using 1 hidden layer of 2048 ReLUs, we use a smaller network with 3 hidden layers of 256 ReLUs each, and train for 60,000 steps (i.e., 100 epochs) with a fixed learning rate of 0.1. We train on the baseline dataset and the 4 noisy variants with c{0,1,2,4,8}𝑐01248c\in\{0,1,2,4,8\}. Since we have 100 examples in each minibatch, the value of c𝑐c immediately tells us how many outliers are clipped in each minibatch. For example, c=2𝑐2c=2 means the 2 largest and 2 lowest values of the per-example gradient are clipped (independently for each trainable parameter in the network), and c=0𝑐0c=0 corresponds to unmodified SGD.
请注意,尽管这在概念上是一个简单的改变,但由于需要每个示例的梯度,计算上非常昂贵。为了降低计算成本,我们只使用小批量中的示例来计算 lwsubscript𝑙𝑤l_{w}uwsubscript𝑢𝑤u_{w} 。此外,我们不再使用 2048 个 ReLU 的 1 个隐藏层,而是使用 3 个隐藏层,每个隐藏层有 256 个 ReLU,并且使用固定学习率为 0.1 进行 60,000 步(即 100 个 epochs)的训练。我们在基准数据集和 4 个带有 c{0,1,2,4,8}𝑐01248c\in\{0,1,2,4,8\} 的嘈杂变体上进行训练。由于每个小批量中有 100 个示例, c𝑐c 的值立即告诉我们每个小批量中剪切了多少异常值。例如, c=2𝑐2c=2 表示剪切了每个可训练参数的梯度的 2 个最大值和 2 个最小值(对于网络中的每个可训练参数独立进行剪切), c=0𝑐0c=0 对应于未修改的 SGD。

3.2 Qualitative Predictions
3.2 定性预测

If the Coherent Gradient hypothesis is right, then the strong gradients are responsible for making changes to the network that generalize well since they improve many examples simultaneously.
如果连贯梯度假设是正确的,那么强梯度负责对网络进行改变,从而使其具有良好的泛化能力,因为它们同时改进了许多示例。

On the other hand, the weak gradients lead to overfitting since they only improve a few examples. By winsorizing each coordinate, we suppress the most extreme values and thus ensure that a parameter is only updated in a manner that benefits multiple examples. Therefore:
另一方面,弱梯度会导致过拟合,因为它们只会改善少数示例。通过对每个坐标进行 winsorizing,我们抑制了最极端的值,从而确保参数只在有利于多个示例的情况下进行更新。因此:

  • Since c𝑐c controls which examples are considered extreme, the larger c𝑐c is, the less we expect the network to overfit.


    • 由于 c𝑐c 控制着哪些示例被认为是极端的, c𝑐c 越大,我们就越不希望网络过拟合。
  • But this also makes it harder for the network to fit the training data, and so we expect the training accuracy to fall as well.


    • 但这也使网络更难拟合训练数据,因此我们也预计训练准确度会下降。
  • Winsorization will not completely eliminate the weak directions. For example, for small values of c𝑐c we should still expect overfitting to happen over time though at a reduced rate since only the most egregious outliers are suppressed.


    • Winsorization 不能完全消除弱方向。例如,对于较小的 c𝑐c 值,我们仍然应该期望过拟合会随着时间的推移而发生,尽管发生的速率会降低,因为只有最严重的异常值被抑制。

3.3 Agreement with Experiment
3.3 与实验的一致性

Refer to caption
Figure 4: Effect of suppressing weak gradient directions by eliminating outlier per-example gradients. This is done independently for each trainable parameter. Overfit is measured after accounting for the fact that test labels are not randomized (§3.3).
图4:通过消除每个样本梯度的异常值来抑制弱梯度方向的影响。这是针对每个可训练参数独立进行的。在考虑到测试标签未随机化的情况下,过拟合是经过衡量的(§3.3)。

The resulting training and test curves shown in Figure 4. The columns correspond to different amounts of label noise and the rows to different amounts of winsorization. In addition to the training and test accuracies (𝗍𝖺𝗍𝖺{\sf ta} and 𝗏𝖺𝗏𝖺{\sf va}, respectively), we show the level of overfit which is defined as 𝗍𝖺[ϵ110+(1ϵ)𝗏𝖺]𝗍𝖺delimited-[]italic-ϵ1101italic-ϵ𝗏𝖺{\sf ta}-[\epsilon\cdot\frac{1}{10}+(1-\epsilon)\cdot{\sf va}] to account for the fact that the test labels are not randomized.
图 4 中显示的训练和测试曲线。列对应不同标签噪声的数量,行对应不同的 winsorization 数量。除了训练和测试准确率(分别为 𝗍𝖺𝗍𝖺{\sf ta}𝗏𝖺𝗏𝖺{\sf va} ),我们还展示了过拟合水平,定义为 𝗍𝖺[ϵ110+(1ϵ)𝗏𝖺]𝗍𝖺delimited-[]italic-ϵ1101italic-ϵ𝗏𝖺{\sf ta}-[\epsilon\cdot\frac{1}{10}+(1-\epsilon)\cdot{\sf va}] ,以考虑测试标签未随机化的事实。

We see that the experimental results are in agreement with the predictions above. In particular,
我们发现实验结果与上述预测一致。特别是,

  • For c>1𝑐1c>1, training accuracies do not exceed the proper accuracy of the dataset, though they may fall short, specially for large values of c𝑐c.


    • 对于 c>1𝑐1c>1 ,训练准确率不会超过数据集的正确准确率,尽管它们可能会稍微不足,特别是在 c𝑐c 的大值情况下。
  • The rate at which the overfit curve grows goes down with increasing c𝑐c.


    • 过拟合曲线增长速度随着 c𝑐c 的增加而下降。

Additionally, we notice that with a large amount of winsorization, the training and test accuracies reach a maximum and then go down. Part of the reason is that as a result of winsorization, each step is no longer in a descent direction, i.e., this is no longer gradient descent.
另外,我们注意到大量修剪后,训练和测试准确率达到最大值,然后下降。部分原因是由于修剪的结果,每一步不再是下降方向,即不再是梯度下降。

4 Discussion and Related Work
4 讨论和相关工作

Although there has been a lot of work in recent years in trying to understand generalization in Deep Learning, no entirely satisfactory explanation has emerged so far.
尽管近年来在深度学习的泛化理解方面已经做了大量工作,但迄今为止还没有完全令人满意的解释出现。

There is a rich literature on aspects of the stochastic optimization problem such as the loss landscape and minima (e.g., Choromanska et al. (2015); Zhu et al. (2018)), the curvature around stationary points (e.g., Hochreiter & Schmidhuber (1997); Keskar et al. (2016); Dinh et al. (2017); Wu et al. (2018)), and the implications of stochasticity due to sampling in SGD (e.g., Simsekli et al. (2019)). However, we believe it should be possible to understand generalization without a detailed understanding of the optimization landscape.
关于随机优化问题的诸多方面都有丰富的文献,比如损失函数的景观和极小值(例如 Choromanska 等人(2015);Zhu 等人(2018)),稳定点周围的曲率(例如 Hochreiter & Schmidhuber(1997);Keskar 等人(2016);Dinh 等人(2017);Wu 等人(2018)),以及由于 SGD 中的采样而产生的随机性的影响(例如 Simsekli 等人(2019))。然而,我们相信应该能够在不详细了解优化景观的情况下理解泛化。

For example, since stopping early typically leads to small generalization gap, the nature of the solutions of GD (e.g., stationary points, the limit cycles of SGD at equilibrium) cannot be solely responsible for generalization. In fact, from this observation, it would appear that an inductive argument for generalization would be more natural. Likewise, there is reason to believe that stochasticity is not fundamental to generalization (though it may help).
例如,由于提前停止通常会导致较小的泛化差距,因此 GD 的解的性质(例如稳定点,SGD 在平衡时的极限环)不能单独对泛化负责。事实上,根据这一观察,似乎归纳泛化的论证更为自然。同样,有理由相信随机性并非泛化的根本原因(尽管它可能有所帮助)。

For example, modifying the experiment in §2.1 to use full batch leads to similar qualitative generalization results. This is consistent with other small scale studies (e.g., Figure 1 of Wu et al. (2018)) though we are not aware of any large scale studies on full batch.
例如,修改 §2.1 中的实验以使用完整批次会导致类似的定性泛化结果。这与其他小规模研究一致(例如,吴等人(2018)的图 1),尽管我们不知道有关完整批次的大规模研究。

Our view of optimization is a simple, almost combinatorial, one: gradient descent is a greedy search with some hill-climbing thrown in (due to sampling in SGD and finite step size).
我们对优化的看法是简单的,几乎是组合的:梯度下降是一种贪婪搜索,其中还加入了一些爬山(由于 SGD 中的采样和有限的步长)。

Therefore, we worry less about the quality of solutions reached, but more about staying “feasible” at all times during the search.
因此,我们更担心达到的解的质量,而不是在搜索过程中始终保持“可行”。

In our context, feasibility means being able to generalize; and this naturally leads us to look at the transition dynamics to see if that preserves generalizability.
在我们的背景下,可行性意味着能够泛化;这自然导致我们去关注过渡动力学,看看它是否保持了泛化能力。

Another approach to understanding generalization, is to argue that gradient-based optimization induces a form of implicit regularization leading to a bias towards models of low complexity. This is an extension of the classical approach where bounding a complexity measure leads to bounds on the generalization gap.
理解泛化的另一种方法是,认为基于梯度的优化引发了一种隐式正则化,导致对低复杂度模型的偏好。这是对经典方法的延伸,其中限制复杂度度量导致了对泛化差距的限制。

As is well known, classical measures of complexity (also called capacity) do not work well. For example, sometimes adding more parameters to a net can help generalization (see for e.g. Lawrence et al. (1996); Neyshabur et al. (2018)) and, as we have seen, VC-Dimension and Rademacher Complexity-based bounds must be vacuous since networks can memorize random labels and yet generalize on real data. This has led to a lot of recent work in identifying better measures of complexity such as spectrally-normalized margin (Bartlett et al., 2017), path-based group norm (Neyshabur et al., 2018), a compression-based approach (Arora et al., 2018), etc. However, to our knowledge, none of these measures is entirely satisfactory for accounting for generalization in practice. Please see Nagarajan & Kolter (2019) for an excellent discussion of the challenges.
众所周知,经典的复杂度度量(也称为容量)效果不佳。例如,有时向网络添加更多参数可以帮助泛化(例如参见 Lawrence 等人(1996);Neyshabur 等人(2018)),正如我们所见,基于 VC 维度和 Rademacher 复杂度的界限必须是空虚的,因为网络可以记忆随机标签,但在真实数据上泛化。这导致了最近在确定更好的复杂度度量方面的大量工作,例如谱归一化边界(Bartlett 等人,2017),基于路径的群规范(Neyshabur 等人,2018),基于压缩的方法(Arora 等人,2018)等。然而,据我们所知,这些度量中没有一种完全令人满意地解释实践中的泛化。请参阅 Nagarajan & Kolter(2019)以了解对挑战的出色讨论。

We rely on a different classical notion to argue generalization: algorithmic stability (see Bousquet & Elisseeff (2002) for a historical overview). We have provided only an informal argument in Section 1, but there has been prior work by Hardt et al. (2016) in looking at GD and SGD through the lens of stability, but their formal results do not explain generalization in practical settings (e.g., multiple epochs of training and non-convex objectives).
我们依赖于一个不同的经典概念来论证泛化:算法稳定性(参见 Bousquet & Elisseeff (2002) 进行历史概述)。我们在第 1 节中仅提供了一个非正式的论证,但之前 Hardt 等人(2016)的工作已经通过稳定性的视角来研究 GD 和 SGD,但他们的形式化结果并不能解释实际设置中的泛化(例如,多次训练和非凸目标)。

In fact, such an attempt appears unlikely to work since our experimental results imply that any stability bounds for SGD that do not account for the actual training data must be vacuous! (This was also noted by Zhang et al. (2017).) That said, we believe stability is the right way to think about generalization in GD for a few reasons. First, since by Shalev-Shwartz et al. (2010) stability, suitably formalized, is equivalent to generalization. Therefore, in principle, any explanation of generalizability for a learning problem must—to borrow a term from category theory—factor through stability.
事实上,这样的尝试似乎不太可能成功,因为我们的实验结果暗示,任何不考虑实际训练数据的 SGD 稳定性界限必须是空洞的!(这也是 Zhang 等人(2017)所指出的。)尽管如此,我们认为稳定性是正确思考 GD 泛化的方式,原因有几点。首先,根据 Shalev-Shwartz 等人(2010)的稳定性,适当形式化后,等价于泛化。因此,原则上,对于学习问题的任何泛化性解释必须——借用范畴论的术语——通过稳定性。

Second, a stability based analysis may be more amenable to taking the actual training data into account (perhaps by using a “stability accountant” similar to a privacy accountant) which appears necessary to get non-vacuous bounds for practical networks and datasets.
其次,基于稳定性的分析可能更容易考虑实际训练数据(也许通过使用类似于隐私会计师的“稳定性会计师”),这似乎是必要的,以获得对实际网络和数据集的非空洞界限。

Finally, as we have seen with the modification in §3, a stability based approach is not just descriptive but prescriptive222See https://www.offconvex.org/2017/12/08/generalization1/ for a nice discussion of the difference.
请提供需要翻译的文本。
and can point the way to better learning algorithms.
最后,正如我们在第3节中所看到的修改,基于稳定性的方法不仅仅是描述性的,而且是规定性的,并且可以指引出更好的学习算法。

Finally, we look at two relevant lines of work pointed out by a reviewer. First, Rahaman et al. (2019) compute the Fourier spectrum of ReLU networks and argue based on heuristics and experiments that these networks learn low frequency functions first. In contrast, we focus not on the function learnt, but on the mechanism in GD to detect commonality.
最后,我们看了一位审稿人指出的两条相关工作线索。首先,Rahaman 等人(2019 年)计算了 ReLU 网络的傅立叶谱,并根据启发式和实验证据认为这些网络首先学习低频函数。相比之下,我们关注的不是学到的函数,而是梯度下降中用于检测共性的机制。

This leads to a perspective that is at once simpler and more general (for e.g., it applies equally to networks with other activation functions, with attention, LSTMs, and discrete (combinatorial) inputs).
这导致了一种既简单又更普遍的视角(例如,它同样适用于具有其他激活函数、注意力、LSTM 和离散(组合)输入的网络)。

Furthermore, it opens up a path to analyzing generalization via stability. It is is not clear if Rahaman et al. (2019) claim a causal mechanism, but their analysis does not suggest an obvious intervention experiment such as ours of §3 to test causality. There are other experimental results that show biases towards linear functions (Nakkiran et al., 2019) and functions with low descriptive complexity (Valle-Perez et al., 2019) but these papers do not posit a causal mechanism. It is interesting to consider if Coherent Gradients can provide a unified explanation for these observed biases.
此外,它为通过稳定性分析泛化打开了一条道路。Rahaman 等人(2019 年)是否声称有因果机制尚不清楚,但他们的分析并未提出像我们第 3 节的明显干预实验来测试因果关系。还有其他实验结果显示对线性函数(Nakkiran 等人,2019 年)和具有低描述复杂度的函数(Valle-Perez 等人,2019 年)的偏好,但这些论文并未提出因果机制。有趣的是考虑一下,一致梯度是否能为这些观察到的偏好提供一个统一的解释。

Second, Fort et al. (2019) propose a descriptive statistic stiffness based on pairwise per-example gradients and show experimentally that it can be used to characterize generalization. Sankararaman et al. (2019) propose a very similar statistic called gradient confusion but use it to study the speed of training. Unlike our work, these do not propose causal mechanisms for generalization, but these statistics (which are different from those in §2.4) could be useful for the further study of Coherent Gradients.
其次,Fort 等人(2019 年)提出了一种基于成对每个示例梯度的描述统计刚度,并通过实验证明它可以用于表征泛化。Sankararaman 等人(2019 年)提出了一个非常相似的统计量,称为梯度混淆,但用它来研究训练速度。与我们的工作不同,这些工作并未提出泛化的因果机制,但这些统计量(与§2.4 中的统计量不同)可能对一致梯度的进一步研究有用。

5 Directions for Future Work
未来工作的方向

Does the Coherent Gradients hypothesis hold in other settings such as BERT, ResNet, etc.? For that we would need to develop more computationally efficient tests.
一致梯度假设是否适用于 BERT、ResNet 等其他设置?为此,我们需要开发更多计算效率更高的测试。

Can we use the state of the network to explicitly characterize which examples are considered similar and study this evolution in the course of training? We expect non-parametric methods for similarity such as those developed in Chatterjee & Mishchenko (2019) and their characterization of “easy” examples (i.e., examples learnt early as per Arpit et al. (2017)) as those with many others like them, to be useful in this context.
我们能否利用网络状态来明确表征哪些示例被认为是相似的,并在训练过程中研究这种演变?我们期望像 Chatterjee & Mishchenko (2019) 中所开发的那样,对相似性采用非参数方法,并对“简单”示例(即,像 Arpit 等人 (2017) 中早期学习的示例)进行表征,即有许多类似示例的示例,在这种情况下是有用的。

Can Coherent Gradients explain adversarial initializations (Liu et al., 2019)? The adversarial initial state makes semantically similar examples purposefully look different. Therefore, during training, they continue to be treated differently (i.e., their gradients share less in common than they would if starting from a random initialization).
一致梯度能否解释敌对初始化(Liu 等人,2019)?敌对初始化状态使语义上相似的示例故意看起来不同。因此,在训练过程中,它们继续被不同对待(即,它们的梯度共享的部分比起从随机初始化开始要少)。

Thus, fitting is more case-by-case and while it achieves good final training accuracy, it does not generalize.
因此,拟合更多地是因案例而异,虽然它能够达到良好的最终训练准确性,但它并不具有泛化性。

Can Coherent Gradients along with the Lottery Ticket Hypothesis (Frankle & Carbin, 2018) explain the observation in Neyshabur et al. (2018) that wider networks generalize better? By Lottery Ticket, wider networks provide more chances to find initial gradient directions that improve many examples, and by Coherent Gradients, these popular hypotheses are learned preferentially (faster).
能否利用彩票票据假设(Frankle&Carbin,2018)解释 Neyshabur 等人(2018)观察到的更宽网络泛化更好的现象?通过彩票票据,更宽的网络提供了更多机会找到改进许多示例的初始梯度方向,而通过连贯梯度,这些流行的假设被优先学习(更快)。

Can we use the ideas behind Winsorized SGD from §3 to develop a computationally efficient learning algorithm with generalization (and even privacy) guarantees? How does winsorized gradients compare in practice to the algorithm proposed in Abadi et al. (2016) for privacy? Last, but not least, can we use the insights from this work to design learning algorithms that operate natively on discrete networks?
我们能否利用第 3 节中 Winsorized SGD 的思想来开发一个具有泛化(甚至隐私)保证的计算高效的学习算法?Winsorized 梯度在实践中与 Abadi 等人(2016)提出的隐私算法相比如何?最后,我们能否利用这项工作的见解来设计原生运行在离散网络上的学习算法?

Acknowledgments
致谢

I thank Alan Mishchenko, Shankar Krishnan, Piotr Zielinski, Chandramouli Kashyap, Sergey Ioffe, Michele Covell, and Jay Yagnik for helpful discussions.
我感谢 Alan Mishchenko,Shankar Krishnan,Piotr Zielinski,Chandramouli Kashyap,Sergey Ioffe,Michele Covell 和 Jay Yagnik 的有益讨论。

References
参考文献

  • Abadi et al. (2016)
    Abadi et al. (2016) -> Abadi等人(2016年)
    Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pp.  308–318, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4139-4. doi: 10.1145/2976749.2978318 . URL http://doi.acm.org/10.1145/2976749.2978318 .
  • Arora et al. (2018)
    Arora 等人 (2018)
    Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  254–263. PMLR, 2018. URL http://proceedings.mlr.press/v80/arora18b.html.
    Sanjeev Arora, Rong Ge, Behnam Neyshabur, 和 Yi Zhang. 通过压缩方法加强深度神经网络的更强泛化界限. 在 Jennifer G. Dy 和 Andreas Krause (eds.) 主编的《第 35 届国际机器学习大会论文集, ICML 2018》, 于 2018 年 7 月 10-15 日在瑞典斯德哥尔摩的 Stockholmsmässan 举行, 机器学习研究论文集第 80 卷, 页码 254–263. PMLR, 2018. URL http://proceedings.mlr.press/v80/arora18b.html.
  • Arpit et al. (2017)
    Arpit 等人 (2017)
    Devansh Arpit, Stanislaw K. Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 233–242, 2017. URL http://proceedings.mlr.press/v70/arpit17a.html.
    Devansh Arpit, Stanislaw K. Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste-Julien. 深度网络中记忆化的更近距离观察。在第 34 届国际机器学习会议(ICML 2017)论文集中,2017 年 8 月 6-11 日,澳大利亚新南威尔士州悉尼,第 233-242 页,2017 年。网址 http://proceedings.mlr.press/v70/arpit17a.html。
  • Bartlett et al. (2017)
    Bartlett 等人 (2017)
    Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp.  6240–6249. Curran Associates, Inc., 2017.
    Peter L. Bartlett, Dylan J. Foster, 和 Matus J. Telgarsky. 神经网络的谱归一化边界。在 I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, 和 R. Garnett (eds.) 编,第 30 届神经信息处理系统进展,第 6240-6249 页。Curran Associates, Inc.,2017 年。
  • Belkin et al. (2019)
    Belkin 等人(2019 年)
    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. ISSN 0027-8424. doi: 10.1073/pnas.1903070116. URL https://www.pnas.org/content/116/32/15849.
  • Bousquet & Elisseeff (2002)
    Bousquet & Elisseeff(2002 年)
    Olivier Bousquet and André Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499–526, March 2002. ISSN 1532-4435. doi: 10.1162/153244302760200704 . URL https://doi.org/10.1162/153244302760200704 .
  • Caruana et al. (2000)
    Caruana 等人 (2000)
    Rich Caruana, Steve Lawrence, and Lee Giles. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’00, pp.  381–387, Cambridge, MA, USA, 2000. MIT Press. URL http://dl.acm.org/citation.cfm?id=3008751.3008807.
    Rich Caruana,Steve Lawrence 和 Lee Giles。神经网络的过拟合:反向传播,共轭梯度和早停止。在第 13 届国际神经信息处理系统会议论文集中,NIPS’00,第 381–387 页,2000 年,美国马萨诸塞州剑桥市。MIT 出版社。URL http://dl.acm.org/citation.cfm?id=3008751.3008807。
  • Chatterjee & Mishchenko (2019) Satrajit Chatterjee and Alan Mishchenko. Circuit-based intrinsic methods to detect overfitting. CoRR, abs/1907.01991, 2019. URL http://arxiv.org/abs/1907.01991.
    Satrajit Chatterjee 和 Alan Mishchenko. 电路内在方法检测过拟合. CoRR, abs/1907.01991, 2019. URL http://arxiv.org/abs/1907.01991.
  • Choromanska et al. (2015)
    Choromanska 等人 (2015)
    Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Guy Lebanon and S. V. N. Vishwanathan (eds.), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, volume 38 of JMLR Workshop and Conference Proceedings. JMLR.org, 2015. URL http://proceedings.mlr.press/v38/choromanska15.html.
    Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous 和 Yann LeCun。多层网络的损失曲面。在 Guy Lebanon 和 S. V. N. Vishwanathan(编辑)的《第十八届人工智能和统计国际会议论文集,AISTATS 2015》,2015 年 5 月 9-12 日,美国加利福尼亚州圣地亚哥,JMLR Workshop and Conference Proceedings 卷 38。JMLR.org,2015。网址 http://proceedings.mlr.press/v38/choromanska15.html。
  • Dinh et al. (2017)
    Dinh 等人(2017 年)
    Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. CoRR, abs/1703.04933, 2017. URL http://arxiv.org/abs/1703.04933.
    Laurent Dinh, Razvan Pascanu, Samy Bengio 和 Yoshua Bengio。深度网络的尖锐极小值可以泛化。CoRR,abs/1703.04933,2017。网址 http://arxiv.org/abs/1703.04933。
  • Fort et al. (2019)
    Fort 等人(2019 年)
    Stanislav Fort, Pawel Krzysztof Nowak, and Srini Narayanan. Stiffness: A new perspective on generalization in neural networks. CoRR, abs/1901.09491, 2019. URL http://arxiv.org/abs/1901.09491.
    Stanislav Fort,Pawel Krzysztof Nowak 和 Srini Narayanan。刚度:神经网络泛化的新视角。CoRR,abs/1901.09491,2019 年。网址 http://arxiv.org/abs/1901.09491。
  • Frankle & Carbin (2018)
    Frankle & Carbin(2018 年)
    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Training pruned neural networks. CoRR, abs/1803.03635, 2018. URL http://arxiv.org/abs/1803.03635.
    Jonathan Frankle 和 Michael Carbin. 彩票票假设:训练修剪神经网络. CoRR, abs/1803.03635, 2018. URL http://arxiv.org/abs/1803.03635.
  • Hardt et al. (2016)
    Hardt 等人 (2016)
    Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1225–1234. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045520.
    Moritz Hardt, Benjamin Recht, 和 Yoram Singer. 更快地训练,更好地泛化:随机梯度下降的稳定性. 在第 33 届国际机器学习会议论文集 - Volume 48, ICML’16, 页码 1225–1234. JMLR.org, 2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045520.
  • Hochreiter & Schmidhuber (1997)
    Hochreiter & Schmidhuber (1997) -> 霍赫赖特和施密特胡伯(1997)
    Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Comput., 9(1):1–42, January 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1 . URL http://dx.doi.org/10.1162/neco.1997.9.1.1 .
  • Kawaguchi et al. (2017)
    Kawaguchi 等人 (2017)
    K. Kawaguchi, L. Pack Kaelbling, and Y. Bengio. Generalization in Deep Learning. ArXiv e-prints, December 2017. URL https://arxiv.org/abs/1710.05468v2.
    K. Kawaguchi,L. Pack Kaelbling 和 Y. Bengio。深度学习中的泛化。ArXiv 电子版,2017 年 12 月。网址 https://arxiv.org/abs/1710.05468v2。
  • Keskar et al. (2016)
    Keskar 等人 (2016)
    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836.
    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, 和 Ping Tak Peter Tang. 关于深度学习的大批量训练:泛化差距和尖锐极小值。CoRR, abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836.
  • Lawrence et al. (1996)
    Lawrence 等人 (1996)
    Steve Lawrence, C. Lee Giles, and Ah Chung Tsoi. What size neural network gives optimal generalization? convergence properties of backpropagation. Technical report, 1996.
    史蒂夫·劳伦斯,C.李·吉尔斯和 Ah·钟才。什么样的神经网络能够获得最佳的泛化能力?反向传播的收敛特性。技术报告,1996 年。
  • Liu et al. (2019)
    刘等人(2019 年)
    Shengchao Liu, Dimitris S. Papailiopoulos, and Dimitris Achlioptas. Bad global minima exist and SGD can reach them. CoRR, abs/1906.02613, 2019. URL http://arxiv.org/abs/1906.02613.
    刘胜超,Dimitris S. Papailiopoulos 和 Dimitris Achlioptas。存在糟糕的全局最小值,而随机梯度下降可以达到这些最小值。CoRR,abs/1906.02613,2019 年。网址 http://arxiv.org/abs/1906.02613。
  • Nagarajan & Kolter (2019) Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp.  11611–11622, 2019.
    Vaishnavh Nagarajan 和 J. Zico Kolter. 统一收敛可能无法解释深度学习中的泛化。在 Hanna M. Wallach,Hugo Larochelle,Alina Beygelzimer,Florence d’Alché-Buc,Emily B. Fox 和 Roman Garnett(编辑),Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019,NeurIPS 2019,2019 年 12 月 8-14 日,加拿大温哥华,第 11611-11622 页。
  • Nakkiran et al. (2019)
    Nakkiran 等人(2019)
    Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L. Edelman, Fred Zhang, and Boaz Barak. SGD on neural networks learns functions of increasing complexity. CoRR, abs/1905.11604, 2019. URL http://arxiv.org/abs/1905.11604.
  • Neyshabur et al. (2018) Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. CoRR, abs/1805.12076, 2018. URL http://arxiv.org/abs/1805.12076.
  • Rahaman et al. (2019)
    Rahaman 等人(2019 年)
    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5301–5310, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/rahaman19a.html.
    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio 和 Aaron Courville. 关于神经网络的谱偏差。在 Kamalika Chaudhuri 和 Ruslan Salakhutdinov(编辑)的《第 36 届国际机器学习会议论文集》中,第 97 卷机器学习研究论文集,第 5301-5310 页,美国加利福尼亚州长滩,2019 年 6 月 9-15 日。PMLR。网址 http://proceedings.mlr.press/v97/rahaman19a.html。
  • Rolnick et al. (2017)
    Rolnick 等人(2017 年)
    David Rolnick, Andreas Veit, Serge J. Belongie, and Nir Shavit. Deep learning is robust to massive label noise. CoRR, abs/1705.10694, 2017. URL http://arxiv.org/abs/1705.10694.
    David Rolnick, Andreas Veit, Serge J. Belongie, and Nir Shavit. 深度学习对大规模标签噪声具有鲁棒性。CoRR, abs/1705.10694, 2017. URL http://arxiv.org/abs/1705.10694.
  • Sankararaman et al. (2019)
    Sankararaman 等人 (2019)
    Karthik Abinav Sankararaman, Soham De, Zheng Xu, W. Ronny Huang, and Tom Goldstein. The impact of neural network overparameterization on gradient confusion and stochastic gradient descent. CoRR, abs/1904.06963, 2019. URL http://arxiv.org/abs/1904.06963.
    Karthik Abinav Sankararaman, Soham De, Zheng Xu, W. Ronny Huang, 和 Tom Goldstein. 神经网络超参数化对梯度混淆和随机梯度下降的影响。CoRR, abs/1904.06963, 2019. URL http://arxiv.org/abs/1904.06963.
  • Shalev-Shwartz et al. (2010)
    Shalev-Shwartz 等人 (2010)
    Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11:2635–2670, December 2010. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1756006.1953019.
    Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, 和 Karthik Sridharan. 可学习性,稳定性和一致收敛性. 机器学习研究杂志, 11:2635–2670, 2010 年 12 月. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1756006.1953019.
  • Simsekli et al. (2019)
    Simsekli 等人(2019 年)
    Umut Simsekli, Levent Sagun, and Mert Gürbüzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp.  5827–5837. PMLR, 2019. URL http://proceedings.mlr.press/v97/simsekli19a.html.
    Umut Simsekli,Levent Sagun 和 Mert Gürbüzbalaban。深度神经网络中随机梯度噪声的尾指数分析。在 Kamalika Chaudhuri 和 Ruslan Salakhutdinov(编辑)的《第 36 届国际机器学习会议 ICML 2019》论文集中,2019 年 6 月 9 日至 15 日,美国加利福尼亚州长滩,机器学习研究论文集第 97 卷,第 5827-5837 页。PMLR,2019 年。网址 http://proceedings.mlr.press/v97/simsekli19a.html。
  • Valle-Perez et al. (2019)
    Valle-Perez 等人(2019 年)
    Guillermo Valle-Perez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rye4g3AqFm.
    Guillermo Valle-Perez, Chico Q. Camargo, 和 Ard A. Louis. 深度学习的泛化是因为参数-函数映射偏向于简单函数。在 2019 年国际学习表示会议上发表。网址 https://openreview.net/forum?id=rye4g3AqFm。
  • Wu et al. (2018)
    Wu 等人(2018 年)
    Lei Wu, Chao Ma, and Weinan E. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp.  8279–8288. Curran Associates, Inc., 2018.
    Lei Wu, Chao Ma, 和 Weinan E. 随机梯度下降如何在过参数化学习中选择全局最小值:一个动态稳定性的视角。在 S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, 和 R. Garnett (eds.) 编,第 31 届神经信息处理系统进展,第 8279–8288 页。Curran Associates, Inc.,2018 年。
  • Zhang et al. (2017)
    张等人(2017 年)
    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations ICLR, 2017.
    张赤远,萨米·本吉奥,莫里茨·哈特,本杰明·雷希特和奥里奥尔·维尼亚尔斯。理解深度学习需要重新思考泛化。在 2017 年国际学习表示会议 ICLR 论文集中。
  • Zhu et al. (2018)
    朱等人(2018 年)
    Zhihui Zhu, Daniel Soudry, Yonina C. Eldar, and Michael B. Wakin. The global optimization geometry of shallow linear neural networks. CoRR, abs/1805.04938, 2018. URL http://arxiv.org/abs/1805.04938.
    朱志辉,丹尼尔·索德里,约尼娜·C·埃尔达尔和迈克尔·B·瓦金。浅层线性神经网络的全局优化几何。CoRR,abs/1805.04938,2018 年。网址 http://arxiv.org/abs/1805.04938。