Explaining and Harnessing
Adversarial Examples
解释和利用对抗样本

Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy
Ian J. Goodfellow, Jonathon Shlens 和 Christian Szegedy
Google Inc., Mountain View, CA
{goodfellow,shlens,szegedy}@google.com
Google Inc.，加利福尼亚州山景城{goodfellow，shlens，szegedy}@google.com

Abstract 摘要

Several machine learning models, including neural networks, consistently misclassify adversarial examples—inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks’ vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.
包括神经网络在内的多个机器学习模型一直在一致地对对抗性示例进行错误分类——即通过对数据集中的示例应用小但故意的最坏情况扰动而形成的输入，以致扰动的输入导致模型以高置信度输出不正确答案。早期试图解释这一现象的尝试着重于非线性和过拟合。相反，我们认为神经网络对对抗性扰动脆弱的主要原因是它们的线性特性。这一解释得到了新的定量结果的支持，同时给出了关于它们最有趣事实的第一个解释：它们跨架构和训练集的泛化。此外，这个观点提供了一种简单快速的生成对抗性示例的方法。利用这种方法为对抗性训练提供示例，我们减少了在 MNIST 数据集上的一个最大输出网络的测试集错误。

1 Introduction 1 简介

Szegedy et al. (2014b) made an intriguing discovery: several machine learning models, including state-of-the-art neural networks, are vulnerable to adversarial examples. That is, these machine learning models misclassify examples that are only slightly different from correctly classified examples drawn from the data distribution. In many cases, a wide variety of models with different architectures trained on different subsets of the training data misclassify the same adversarial example. This suggests that adversarial examples expose fundamental blind spots in our training algorithms.
Szegedy 等人（2014b 年）做出了一个引人入胜的发现：包括最先进的神经网络在内的几种机器学习模型都容易受到对抗性示例的影响。也就是说，这些机器学习模型会对与从数据分布中正确分类的示例仅有轻微差异的示例进行错误分类。在许多情况下，不同结构的多种模型在不同训练数据子集上训练后，会将同一个对抗性示例进行错误分类。这表明对抗性示例揭示了我们的训练算法中的基本盲点。

The cause of these adversarial examples was a mystery, and speculative explanations have suggested it is due to extreme nonlinearity of deep neural networks, perhaps combined with insufficient model averaging and insufficient regularization of the purely supervised learning problem. We show that these speculative hypotheses are unnecessary. Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples. This view enables us to design a fast method of generating adversarial examples that makes adversarial training practical. We show that adversarial training can provide an additional regularization benefit beyond that provided by using dropout (Srivastava et al., 2014) alone. Generic regularization strategies such as dropout, pretraining, and model averaging do not confer a significant reduction in a model’s vulnerability to adversarial examples, but changing to nonlinear model families such as RBF networks can do so.
这些对抗样本的原因一直是一个谜，推测性的解释表明这可能是由于深度神经网络的极端非线性，也许加上不足的模型平均化和对于纯监督学习问题不足的正则化。我们展示了这些推测性假说是不必要的。在高维空间中的线性行为足以导致对抗样本。这种观点使我们能够设计一种快速生成对抗样本的方法，使对抗训练成为切实可行。我们展示了对抗训练可以在使用仅使用 dropout（Srivastava 等，2014）时提供的正则化效益之外提供额外的正则化效益。通用的正则化策略如 dropout、预训练和模型平均化并不能显著降低模型对对抗样本的脆弱性，但改为非线性模型家族如 RBF 网络可以做到。

Our explanation suggests a fundamental tension between designing models that are easy to train due to their linearity and designing models that use nonlinear effects to resist adversarial perturbation. In the long run, it may be possible to escape this tradeoff by designing more powerful optimization methods that can succesfully train more nonlinear models.
我们的解释表明，在设计模型时存在一种基本张力，即由于其线性而易于训练的模型与利用非线性效应来抵抗对抗性扰动的模型之间存在张力。从长远来看，通过设计更强大的优化方法，可能可以摆脱这种权衡，从而成功训练更多非线性模型。

2 Related work 相关工作

Szegedy et al. (2014b) demonstrated a variety of intriguing properties of neural networks and related models. Those most relevant to this paper include:
Szegedy 等人（2014b 年）展示了神经网络和相关模型的各种有趣特性。与本文最相关的包括：

•

Box-constrained L-BFGS can reliably find adversarial examples.

• 约束为盒子的 L-BFGS 能够可靠地找到对抗样本。
•

On some datasets, such as ImageNet (Deng et al., 2009), the adversarial examples were so close to the original examples that the differences were indistinguishable to the human eye.

• 在一些数据集中，如 ImageNet（Deng et al.，2009），对抗样本与原始样本非常接近，以至于人眼无法分辨它们之间的差异。
•

The same adversarial example is often misclassified by a variety of classifiers with different architectures or trained on different subsets of the training data.

同一对抗样本经常被不同架构的分类器或在不同训练数据子集上训练的分类器误分类。
•

Shallow softmax regression models are also vulnerable to adversarial examples.

• 浅 softmax 回归模型也容易受到对抗样本的攻击。
•

Training on adversarial examples can regularize the model—however, this was not practical at the time due to the need for expensive constrained optimization in the inner loop.

• 在对抗性示例上训练可以使模型规范化，然而，由于需要在内部循环中进行昂贵的受限优化，这在当时并不实用。

These results suggest that classifiers based on modern machine learning techniques, even those that obtain excellent performance on the test set, are not learning the true underlying concepts that determine the correct output label. Instead, these algorithms have built a Potemkin village that works well on naturally occuring data, but is exposed as a fake when one visits points in space that do not have high probability in the data distribution. This is particularly disappointing because a popular approach in computer vision is to use convolutional network features as a space where Euclidean distance approximates perceptual distance. This resemblance is clearly flawed if images that have an immeasurably small perceptual distance correspond to completely different classes in the network’s representation.
这些结果表明，基于现代机器学习技术的分类器，即使在测试集上表现出色，也没有学习到决定正确输出标签的真正基础概念。相反，这些算法构建了一个波捷金村庄，对自然发生的数据效果很好，但当访问不在数据分布中概率高的点时，就会暴露出是伪造的。这尤其令人失望，因为在计算机视觉中一个流行的方法是使用卷积网络特征作为欧几里得距离近似感知距离的空间。如果在网络的表征中，具有无法估量的感知距离的图像对应于完全不同的类别，则这种相似性显然存在缺陷。

These results have often been interpreted as being a flaw in deep networks in particular, even though linear classifiers have the same problem. We regard the knowledge of this flaw as an opportunity to fix it. Indeed, Gu & Rigazio (2014) and Chalupka et al. (2014) have already begun the first steps toward designing models that resist adversarial perturbation, though no model has yet succesfully done so while maintaining state of the art accuracy on clean inputs.
这些结果经常被解释为特定地深层网络中的一个缺陷，即使线性分类器也存在着同样的问题。我们将这个缺陷视为一个修复的机会。事实上，Gu＆Rigazio（2014 年）和 Chalupka 等人（2014 年）已经开始迈出了设计能够抵抗对抗性扰动的模型的第一步，尽管目前还没有任何模型能够在保持对干净输入的最先进准确性的同时成功做到这一点。

3 The linear explanation of adversarial examples
对对抗样本的线性解释

We start with explaining the existence of adversarial examples for linear models.
我们从解释线性模型存在对抗性样本开始。

In many problems, the precision of an individual input feature is limited. For example, digital images often use only 8 bits per pixel so they discard all information below $1/255$ of the dynamic range. Because the precision of the features is limited, it is not rational for the classifier to respond differently to an input ${\bm{x}}$ than to an adversarial input $\tilde{{\bm{x}}}={\bm{x}}+{\bm{\eta}}$ if every element of the perturbation ${\bm{\eta}}$ is smaller than the precision of the features. Formally, for problems with well-separated classes, we expect the classifier to assign the same class to ${\bm{x}}$ and $\tilde{{\bm{x}}}$ so long as $||{\bm{\eta}}||_{\infty}<{\epsilon}$ , where ${\epsilon}$ is small enough to be discarded by the sensor or data storage apparatus associated with our problem.
在许多问题中，个别输入特征的精确度是有限的。例如，数字图像通常每个像素只使用 8 位，因此丢弃所有动态范围下 $1/255$ 的信息。由于特征的精确度有限，如果扰动 ${\bm{\eta}}$ 的每个元素小于特征的精确度，那么分类器对输入 ${\bm{x}}$ 和对对抗性输入 $\tilde{{\bm{x}}}={\bm{x}}+{\bm{\eta}}$ 的响应应该是相同的是不合理的。形式上，在具有明确分离类别的问题中，我们期望分类器对 ${\bm{x}}$ 和 $\tilde{{\bm{x}}}$ 分配相同的类别，只要 $||{\bm{\eta}}||_{\infty}<{\epsilon}$ ，其中 ${\epsilon}$ 足够小，可以被我们问题相关的传感器或数据存储设备丢弃。

Consider the dot product between a weight vector ${\bm{w}}$ and an adversarial example $\tilde{{\bm{x}}}$ :
考虑一个权重向量 ${\bm{w}}$ 和一个对抗样本 $\tilde{{\bm{x}}}$ 之间的点积:

{\bm{w}}^{\top}\tilde{{\bm{x}}}={\bm{w}}^{\top}{\bm{x}}+{\bm{w}}^{\top}{\bm{\eta}}.

The adversarial perturbation causes the activation to grow by ${\bm{w}}^{\top}{\bm{\eta}}$ .We can maximize this increase subject to the max norm constraint on ${\bm{\eta}}$ by assigning $\eta=\text{sign}({\bm{w}})$ . If ${\bm{w}}$ has $n$ dimensions and the average magnitude of an element of the weight vector is $m$ , then the activation will grow by ${\epsilon}mn$ . Since $||\eta||_{\infty}$ does not grow with the dimensionality of the problem but the change in activation caused by perturbation by $\eta$ can grow linearly with $n$ , then for high dimensional problems, we can make many infinitesimal changes to the input that add up to one large change to the output. We can think of this as a sort of “accidental steganography,” where a linear model is forced to attend exclusively to the signal that aligns most closely with its weights, even if multiple signals are present and other signals have much greater amplitude.
对抗性扰动导致激活值增长 ${\bm{w}}^{\top}{\bm{\eta}}$ 。我们可以最大化这种增长，受到 ${\bm{\eta}}$ 的最大规范约束，通过指定 $\eta=\text{sign}({\bm{w}})$ 来实现。如果 ${\bm{w}}$ 有 $n$ 个维度，并且权向量的一个元素的平均幅度是 $m$ ，那么激活值将增加 ${\epsilon}mn$ 。由于 $||\eta||_{\infty}$ 不会随着问题维度增加而增长，但扰动引起的激活值变化可以与 $\eta$ 成正比增长，对于高维问题，我们可以对输入进行许多微不足道的改变，这些改变加起来形成一个对输出的大改变。我们可以把这看作一种“意外的隐写术”，在这种情况下，线性模型被迫只关注与其权重最接近的信号，即使存在多个信号，并且其他信号的振幅要大得多。

This explanation shows that a simple linear model can have adversarial examples if its input has sufficient dimensionality. Previous explanations for adversarial examples invoked hypothesized properties of neural networks, such as their supposed highly non-linear nature. Our hypothesis based on linearity is simpler, and can also explain why softmax regression is vulnerable to adversarial examples.
这一解释表明，如果简单的线性模型的输入具有足够的维度，它可能存在对抗性示例。过去关于对抗性示例的解释涉及神经网络的假设属性，比如它们被认为具有非常非线性的特性。我们基于线性性的假设更简单，也可以解释为什么 softmax 回归容易受到对抗性示例的影响。

4 Linear perturbation of non-linear models
非线性模型的线性扰动

The linear view of adversarial examples suggests a fast way of generating them. We hypothesize that neural networks are too linear to resist linear adversarial perturbation. LSTMs (Hochreiter & Schmidhuber, 1997), ReLUs (Jarrett et al., 2009; Glorot et al., 2011), and maxout networks (Goodfellow et al., 2013c) are all intentionally designed to behave in very linear ways, so that they are easier to optimize. More nonlinear models such as sigmoid networks are carefully tuned to spend most of their time in the non-saturating, more linear regime for the same reason. This linear behavior suggests that cheap, analytical perturbations of a linear model should also damage neural networks.
对抗性示例的线性视角表明一种快速生成它们的方法。我们假设神经网络过于线性，难以抵抗线性的对抗扰动。LSTM（Hochreiter & Schmidhuber，1997）、ReLU（Jarrett 等，2009；Glorot 等，2011）和 maxout 网络（Goodfellow 等，2013c）都是故意设计成表现出非常线性特性的，以便更容易优化。而像 Sigmoid 网络这样的更非线性模型被精心调整，以便出于同样的原因大部分时间运行在非饱和、更线性的区域内。这种线性行为表明对线性模型进行廉价的分析扰动也会损害神经网络。

Let ${\bm{\theta}}$ be the parameters of a model, ${\bm{x}}$ the input to the model, $y$ the targets associated with ${\bm{x}}$ (for machine learning tasks that have targets) and $J({\bm{\theta}},{\bm{x}},y)$ be the cost used to train the neural network. We can linearize the cost function around the current value of ${\bm{\theta}}$ , obtaining an optimal max-norm constrained pertubation of
让 ${\bm{\theta}}$ 成为模型的参数， ${\bm{x}}$ 是模型的输入， $y$ 是与 ${\bm{x}}$ 相关联的目标（对于有目标的机器学习任务）， $J({\bm{\theta}},{\bm{x}},y)$ 是用于训练神经网络的成本。我们可以围绕当前值 ${\bm{\theta}}$ 线性化成本函数，得到一个最优的最大范数约束扰动

{\bm{\eta}}={\epsilon}{\text{sign}}\left(\nabla_{\bm{x}}J({\bm{\theta}},{\bm{x}},y)\right).

We refer to this as the “fast gradient sign method” of generating adversarial examples. Note that the required gradient can be computed efficiently using backpropagation.
我们将这称为生成对抗样本的“快速梯度符号方法”。请注意，所需的梯度可通过反向传播有效地计算。

We find that this method reliably causes a wide variety of models to misclassify their input. See Fig. 1 for a demonstration on ImageNet. We find that using ${\epsilon}=.25$ , we cause a shallow softmax classifier to have an error rate of 99.9% with an average confidence of 79.3% on the MNIST (LeCun+98) test set¹¹1This is using MNIST pixel values in the interval [0, 1]. MNIST data does contain values other than 0 or 1, but the images are essentially binary. Each pixel roughly encodes “ink” or “no ink”. This justifies expecting the classifier to be able to handle perturbations within a range of width 0.5, and indeed human observers can read such images without difficulty. . In the same setting, a maxout network misclassifies 89.4% of our adversarial examples with an average confidence of 97.6%. Similarly, using ${\epsilon}=.1$ , we obtain an error rate of 87.15% and an average probability of 96.6% assigned to the incorrect labels when using a convolutional maxout network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 2009) test set²²2 See https://github.com/lisa-lab/pylearn2/tree/master/pylearn2/scripts/papers/maxout. for the preprocessing code, which yields a standard deviation of roughly 0.5. . Other simple methods of generating adversarial examples are possible. For example, we also found that rotating ${\bm{x}}$ by a small angle in the direction of the gradient reliably produces adversarial examples.
我们发现这种方法可靠地导致各种模型对其输入进行错误分类。请参见图 1，演示了在 ImageNet 上的示例。我们发现使用 ${\epsilon}=.25$ ，我们导致一个浅 softmax 分类器在 MNIST（LeCun+98）测试集上的错误率达到 99.9%，平均置信度为 79.3% ¹ 。在相同设置中，Maxout 网络将我们的对抗性示例中的 89.4%错误分类，平均置信度为 97.6%。类似地，使用 ${\epsilon}=.1$ ，在经过预处理的 CIFAR-10（Krizhevsky＆Hinton，2009）测试集上使用卷积 Maxout 网络时，我们获得了 87.15%的错误率和分配给错误标签的平均概率为 96.6% ² 。还有其他简单的生成对抗性示例的方法。例如，我们还发现将 ${\bm{x}}$ 旋转一个小角度朝梯度方向可可靠生成对抗性示例。

The fact that these simple, cheap algorithms are able to generate misclassified examples serves as evidence in favor of our interpretation of adversarial examples as a result of linearity. The algorithms are also useful as a way of speeding up adversarial training or even just analysis of trained networks.
这些简单且廉价的算法能够生成被错误分类的例子，这可以作为支持我们对对抗性样本被解释为由于线性性导致的证据。这些算法也可以作为加快对抗训练甚至仅仅是对已训练网络进行分析的一种方式。

Refer to caption — Figure 1: A demonstration of fast adversarial example generation applied to GoogLeNet (Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to the input, we can change GoogLeNet’s classification of the image. Here our ${\epsilon}$ of .007 corresponds to the magnitude of the smallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real numbers.
图 1：展示应用于 ImageNet 的快速对抗样本生成示例（Szegedy 等人，2014a）到 GoogLeNet 的结果。通过添加一个几乎不可察觉的小矢量，其元素等于代价函数对输入的梯度元素的符号，我们可以改变 GoogLeNet 对图像的分类。在这里，我们的 ${\epsilon}$ 为 0.007，在 GoogLeNet 转换为实数后，对图像编码后的 8 位图像的最小位的大小。

5 Adversarial training of linear models versus weight decay
线性模型的对抗训练 versus 权重衰减

Perhaps the simplest possible model we can consider is logistic regression. In this case, the fast gradient sign method is exact. We can use this case to gain some intuition for how adversarial examples are generated in a simple setting. See Fig. 2 for instructive images.
也许我们可以考虑的最简单的模型是逻辑回归。在这种情况下，快速梯度符号方法是精确的。我们可以利用这种情况来直观了解在简单环境中如何生成对抗性示例。查看图 2 以获取示例图像。

If we train a single model to recognize labels $y\in\{-1,1\}$ with $P(y=1)=\sigma\left({\bm{w}}^{\top}{\bm{x}}+b\right)$ where $\sigma(z)$ is the logistic sigmoid function, then training consists of gradient descent on
如果我们训练单个模型来识别标签 $y\in\{-1,1\}$ ，其中 $P(y=1)=\sigma\left({\bm{w}}^{\top}{\bm{x}}+b\right)$ ， $\sigma(z)$ 是 logistic sigmoid 函数，则训练包括梯度下降。

\mathbb{E}_{{\bm{x}},y\sim p_{\text{data}}}\zeta(-y({\bm{w}}^{\top}{\bm{x}}+b))

where $\zeta(z)=\log\left(1+\exp(z)\right)$ is the softplus function. We can derive a simple analytical form for training on the worst-case adversarial perturbation of ${\bm{x}}$ rather than ${\bm{x}}$ itself, based on gradient sign perturbation. Note that the sign of the gradient is just $-{\text{sign}}({\bm{w}})$ , and that ${\bm{w}}^{\top}{\text{sign}}({\bm{w}})=||{\bm{w}}||_{1}$ . The adversarial version of logistic regression is therefore to minimize
其中 $\zeta(z)=\log\left(1+\exp(z)\right)$ 是 softplus 函数。我们可以导出一个简单的分析形式，以训练对 ${\bm{x}}$ 的最坏情况对抗扰动，而不是 ${\bm{x}}$ 本身，基于梯度符号扰动。请注意，梯度的符号只是 $-{\text{sign}}({\bm{w}})$ ，以及 ${\bm{w}}^{\top}{\text{sign}}({\bm{w}})=||{\bm{w}}||_{1}$ 。逻辑回归的对抗版本因此是最小化。

\mathbb{E}_{{\bm{x}},y\sim p_{\text{data}}}\zeta(y({\epsilon}||{\bm{w}}||_{1}-{\bm{w}}^{\top}{\bm{x}}-b)).

This is somewhat similar to $L^{1}$ regularization. However, there are some important differences. Most significantly, the $L^{1}$ penalty is subtracted off the model’s activation during training, rather than added to the training cost. This means that the penalty can eventually start to disappear if the model learns to make confident enough predictions that $\zeta$ saturates. This is not guaranteed to happen—in the underfitting regime, adversarial training will simply worsen underfitting. We can thus view $L^{1}$ weight decay as being more “worst case” than adversarial training, because it fails to deactivate in the case of good margin.
这在某种程度上类似于 $L^{1}$ 正则化。然而，存在一些重要的区别。最重要的是，在训练过程中， $L^{1}$ 惩罚被从模型的激活中减去，而不是被加到训练成本中。这意味着如果模型学会做出充分自信的预测， $\zeta$ 开始饱和，惩罚最终可能会逐渐消失。这并不是一定会发生的——在欠拟合的情况下，对抗训练将会使欠拟合变得更糟。因此我们可以将 $L^{1}$ 权重衰减视为更加“最坏情况”，因为它在较好裕度的情况下无法停用。

If we move beyond logistic regression to multiclass softmax regression, $L^{1}$ weight decay becomes even more pessimistic, because it treats each of the softmax’s outputs as independently perturbable, when in fact it is usually not possible to find a single ${\bm{\eta}}$ that aligns with all of the class’s weight vectors. Weight decay overestimates the damage achievable with perturbation even more in the case of a deep network with multiple hidden units. Because $L^{1}$ weight decay overestimates the amount of damage an adversary can do, it is necessary to use a smaller $L^{1}$ weight decay coefficient than the ${\epsilon}$ associated with the precision of our features. When training maxout networks on MNIST, we obtained good results using adversarial training with ${\epsilon}=.25$ . When applying $L^{1}$ weight decay to the first layer, we found that even a coefficient of .0025 was too large, and caused the model to get stuck with over 5% error on the training set. Smaller weight decay coefficients permitted succesful training but conferred no regularization benefit.
如果我们超越逻辑回归进入多类 softmax 回归，权重衰减会变得更加悲观，因为它把 softmax 的每一个输出都视为独立可扰动，而实际上通常不可能找到一个与所有类别的权重向量都对齐的单个权重。在深度网络有多个隐藏单元的情况下，权重衰减会更加高估扰动带来的损害。因为权重衰减高估了对手能够造成的伤害量，所以需要使用一个比与我们特征精度相关的权重衰减系数更小的权重衰减系数。在 MNIST 数据集上训练 maxout 网络时，我们使用了对抗训练得到了不错的结果。当将权重衰减应用于第一层时，我们发现即使系数为 0.0025 也太大了，导致模型在训练集上的错误率超过了 5%。较小的权重衰减系数虽然允许成功训练，但并没有带来正则化的好处。


(a)	(b)	(c)	(d)

Figure 2: The fast gradient sign method applied to logistic regression (where it is not an approximation, but truly the most damaging adversarial example in the max norm box). a) The weights of a logistic regression model trained on MNIST. b) The sign of the weights of a logistic regression model trained on MNIST. This is the optimal perturbation. Even though the model has low capacity and is fit well, this perturbation is not readily recognizable to a human observer as having anything to do with the relationship between 3s and 7s. c) MNIST 3s and 7s. The logistic regression model has a 1.6% error rate on the 3 versus 7 discrimination task on these examples. d) Fast gradient sign adversarial examples for the logistic regression model with

{\epsilon}=.25

. The logistic regression model has an error rate of 99% on these examples.
图 2：将快速梯度符号方法应用于逻辑回归（在这里它不是一个近似，而是真正具有破坏性的对抗性示例，位于最大范数框中）。a）在 MNIST 上训练的逻辑回归模型的权重。 b）在 MNIST 上训练的逻辑回归模型的权重符号。这是最佳扰动。尽管该模型容量低且拟合良好，但对于一个人类观察者来说，这种扰动并不容易被识别为与 3 和 7 之间的关系有关。c）MNIST 中的 3 和 7。逻辑回归模型在这些示例中的'3 与 7 辨别'任务上的错误率为 1.6％。 d）逻辑回归模型的快速梯度符号对抗示例

{\epsilon}=.25

。该逻辑回归模型在这些示例上的错误率为 99％。

6 Adversarial training of deep networks
深度网络的对抗训练

The criticism of deep networks as vulnerable to adversarial examples is somewhat misguided, because unlike shallow linear models, deep networks are at least able to represent functions that resist adversarial perturbation. The universal approximator theorem (Hornik et al., 1989) guarantees that a neural network with at least one hidden layer can represent any function to an arbitary degree of accuracy so long as its hidden layer is permitted to have enough units. Shallow linear models are not able to become constant near training points while also assigning different outputs to different training points.
对深度网络的批评易受对抗性示例影响有些误导，因为与浅层线性模型不同，深度网络至少能够表示抵抗对抗性扰动的函数。通用逼近定理（Hornik 等，1989）保证了具有至少一个隐藏层的神经网络能够表示任何函数，只要其隐藏层被允许具有足够多的单元，即可将函数精确地表示到任意程度。浅层线性模型不能在训练点附近变为常量，同时还为不同的训练点分配不同的输出。

Of course, the universal approximator theorem does not say anything about whether a training algorithm will be able to discover a function with all of the desired properties. Obviously, standard supervised training does not specify that the chosen function be resistant to adversarial examples. This must be encoded in the training procedure somehow.
当然，通用逼近定理并未说明训练算法能否发现具有所需属性的函数。显然，标准监督训练并未规定所选择的函数有抗对抗性示例的要求。这必须以某种方式编码在训练过程中。

Szegedy et al. (2014b) showed that by training on a mixture of adversarial and clean examples, a neural network could be regularized somewhat. Training on adversarial examples is somewhat different from other data augmentation schemes; usually, one augments the data with transformations such as translations that are expected to actually occur in the test set. This form of data augmentation instead uses inputs that are unlikely to occur naturally but that expose flaws in the ways that the model conceptualizes its decision function. At the time, this procedure was never demonstrated to improve beyond dropout on a state of the art benchmark. However, this was partially because it was difficult to experiment extensively with expensive adversarial examples based on L-BFGS.
Szegedy 等人（2014 年）表明，通过对混合的对抗性和干净的示例进行训练，神经网络可以在一定程度上得到规范化。使用对抗性示例进行训练与其他数据增强方案有所不同; 通常，人们用像平移这样的变换来增广数据，因为这些变换预计实际发生在测试集中。这种形式的数据增广改为使用输入，这些输入不太可能自然发生，但暴露出模型概念化其决策函数方式上的缺陷。那时，这个程序从未被证明可以在最先进的基准测试中超越丢弃，但这部分是因为很难对基于 L-BFGS 的昂贵对抗性示例进行大量实验。

We found that training with an adversarial objective function based on the fast gradient sign method was an effective regularizer:
我们发现，基于快速梯度符号方法的对抗性目标函数训练是一种有效的规范化方法：

\tilde{J}({\bm{\theta}},{\bm{x}},y)=\alpha J({\bm{\theta}},{\bm{x}},y)+(1-\alpha)J({\bm{\theta}},{\bm{x}}+{\epsilon}{\text{sign}}\left(\nabla_{\bm{x}}J({\bm{\theta}},{\bm{x}},y)\right).

In all of our experiments, we used $\alpha=0.5$ . Other values may work better; our initial guess of this hyperparameter worked well enough that we did not feel the need to explore more. This approach means that we continually update our supply of adversarial examples, to make them resist the current version of the model. Using this approach to train a maxout network that was also regularized with dropout, we were able to reduce the error rate from 0.94% without adversarial training to 0.84% with adversarial training.
在我们所有的实验中，我们使用 $\alpha=0.5$ 。其他数值可能效果更好；但我们的初步猜测对这一超参数的效果已经足够好，让我们无需探索更多。这种方法意味着我们不断更新对抗样本的供应，使它们抵抗当前版本的模型。使用这种方法来训练一个同时受到辍学正则化的最大输出网络，我们成功将错误率从没有经过对抗训练的 0.94%降低到经过对抗训练的 0.84%。

We observed that we were not reaching zero error rate on adversarial examples on the training set. We fixed this problem by making two changes. First, we made the model larger, using 1600 units per layer rather than the 240 used by the original maxout network for this problem. Without adversarial training, this causes the model to overfit slightly, and get an error rate of 1.14% on the test set. With adversarial training, we found that the validation set error leveled off over time, and made very slow progress. The original maxout result uses early stopping, and terminates learning after the validation set error rate has not decreased for 100 epochs. We found that while the validation set error was very flat, the adversarial validation set error was not. We therefore used early stopping on the adversarial validation set error. Using this criterion to choose the number of epochs to train for, we then retrained on all 60,000 examples. Five different training runs using different seeds for the random number generators used to select minibatches of training examples, initialize model weights, and generate dropout masks result in four trials that each had an error rate of 0.77% on the test set and one trial that had an error rate of 0.83%. The average of 0.782% is the best result reported on the permutation invariant version of MNIST, though statistically indistinguishable from the result obtained by fine-tuning DBMs with dropout (Srivastava et al., 2014) at 0.79%.
我们观察到在训练集上对抗样本的错误率没有达到零。我们通过进行两项改变来解决这个问题。首先，我们扩大了模型，每层使用 1600 个单元，而不是原始 maxout 网络使用的 240 个。在没有对抗训练的情况下，这导致模型稍微过拟合，并在测试集上获得了 1.14%的错误率。通过对抗训练，我们发现验证集的错误率随时间趋于平稳，并且进展非常缓慢。原始 maxout 结果使用了提前停止，当验证集错误率连续 100 轮没有下降时，停止学习。我们发现，尽管验证集的错误率非常扁平，对抗验证集的错误率却不然。因此，我们对对抗验证集的错误率使用了提前停止。按照这个标准选择训练的轮次，然后重新对所有 60,000 个样本进行了训练。使用了不同随机数生成器种子进行五次不同的训练运行，用于选择训练样本的小批量、初始化模型权重和生成辍学蒙版的结果为，四次实验中的每一次都达到了 0%的错误率。在测试集上为 77%，一个试验出现错误率为 0.83%。0.782% 的平均值是在 MNIST 的排列不变版本上报道的最佳结果，尽管在统计学上与通过辍学来微调 DBMs 所获得的结果（Srivastava 等，2014）0.79% 没有显著差异。

The model also became somewhat resistant to adversarial examples. Recall that without adversarial training, this same kind of model had an error rate of 89.4% on adversarial examples based on the fast gradient sign method. With adversarial training, the error rate fell to 17.9%. Adversarial examples are transferable between the two models but with the adversarially trained model showing greater robustness. Adversarial examples generated via the original model yield an error rate of 19.6% on the adversarially trained model, while adversarial examples generated via the new model yield an error rate of 40.9% on the original model. When the adversarially trained model does misclassify an adversarial example, its predictions are unfortunately still highly confident. The average confidence on a misclassified example was 81.4%. We also found that the weights of the learned model changed significantly, with the weights of the adversarially trained model being significantly more localized and interpretable (see Fig. 3).
该模型也在某种程度上对抗对抗样本具有抵抗力。回忆一下，没有对抗训练时，这种模型在基于快速梯度符号方法的对抗样本上的错误率为 89.4%。有了对抗训练，错误率降低到了 17.9%。对抗样本在这两个模型之间是可传递的，但经过对抗训练的模型表现出更强的稳健性。通过原始模型生成的对抗样本在对抗训练模型上产生了 19.6% 的错误率，而通过新模型生成的对抗样本在原始模型上产生了 40.9% 的错误率。当对抗训练的模型误分类一个对抗样本时，其预测仍然非常自信。误分类样本的平均置信度为 81.4%。我们还发现，学得模型的权重发生了显著变化，经过对抗训练的模型的权重更加局部化和可解释（见图 3）。

The adversarial training procedure can be seen as minimizing the worst case error when the data is perturbed by an adversary. That can be interpreted as learning to play an adversarial game, or as minimizing an upper bound on the expected cost over noisy samples with noise from $U(-{\epsilon},{\epsilon})$ added to the inputs. Adversarial training can also be seen as a form of active learning, where the model is able to request labels on new points. In this case the human labeler is replaced with a heuristic labeler that copies labels from nearby points.
对抗性训练过程可以被看作是在数据受到对手扰动时最小化最坏情况误差。这可以解释为学习玩对抗游戏，或者是最小化带噪声样本的预期成本上限，其中噪声从 $U(-{\epsilon},{\epsilon})$ 添加到输入中。对抗性训练也可以看作是一种主动学习的形式，在这种情况下，模型能够请求新点上的标签。在这种情况下，人类标注者被一个启发式标注者取代，后者会从附近点复制标签。

We could also regularize the model to be insensitive to changes in its features that are smaller than the ${\epsilon}$ precision simply by training on all points within the ${\epsilon}$ max norm box, or sampling many points within this box. This corresponds to adding noise with max norm ${\epsilon}$ during training. However, noise with zero mean and zero covariance is very inefficient at preventing adversarial examples. The expected dot product between any reference vector and such a noise vector is zero. This means that in many cases the noise will have essentially no effect rather than yielding a more difficult input. In fact, in many cases the noise will actualy result in a lower objective function value. We can think of adversarial training as doing hard example mining among the set of noisy inputs, in order to train more efficiently by considering only those noisy points that strongly resist classification. As control experiments, we trained training a maxout network with noise based on randomly adding $\pm{\epsilon}$ to each pixel, or adding noise in $U(-{\epsilon},{\epsilon})$ to each pixel. These obtained an error rate of 86.2% with confidence 97.3% and an error rate of 90.4% with a confidence of 97.8% respectively on fast gradient sign adversarial examples.
我们还可以对模型进行规范化，使之对小于 ${\epsilon}$ 精度的特征变化不敏感，方法是在 ${\epsilon}$ 最大规范箱内训练所有点，或者从该箱中采样多个点。这相当于在训练过程中添加最大规范为 ${\epsilon}$ 的噪声。然而，均值为零、协方差为零的噪声对防止对抗性示例非常低效。任何参考向量与这样的噪声向量之间的期望点积为零。这意味着在许多情况下，噪声实际上几乎没有影响，而不是生成更复杂的输入。事实上，在许多情况下，噪声实际上会导致目标函数值降低。我们可以认为对抗性训练实际上是在嘈杂输入集中进行难例挖掘，以更高效地训练，只考虑那些对分类强烈抵抗的噪声点。作为对照实验，我们训练了一个基于随机向每个像素添加 $\pm{\epsilon}$ 或者向每个像素添加 $U(-{\epsilon},{\epsilon})$ 的噪声的 maxout 网络。这些方法得到了 86.2%的错误率，置信度为 97.3%，以及 90.4%的错误率，置信度为 97%。在快速梯度标志对抗性例子上分别为 8%。

Because the derivative of the sign function is zero or undefined everywhere, gradient descent on the adversarial objective function based on the fast gradient sign method does not allow the model to anticipate how the adversary will react to changes in the parameters. If we instead adversarial examples based on small rotations or addition of the scaled gradient, then the perturbation process is itself differentiable and the learning can take the reaction of the adversary into account. However, we did not find nearly as powerful of a regularizing result from this process, perhaps because these kinds of adversarial examples are not as difficult to solve.
由于标志函数的导数在任何地方都是零或未定义，基于快速梯度标志法的对抗目标函数上的梯度下降不允许模型预测对手如何对参数变化做出反应。如果我们基于小的旋转或梯度缩放后的添加而生成对抗样本，那么扰动过程本身是可微的，学习可以考虑对手的反应。然而，我们并没有从这一过程中找到类似强大的正则化结果，也许是因为这些类型的对抗样本不那么难解。

One natural question is whether it is better to perturb the input or the hidden layers or both. Here the results are inconsistent. Szegedy et al. (2014b) reported that adversarial perturbations yield the best regularization when applied to the hidden layers. That result was obtained on a sigmoidal network. In our experiments with the fast gradient sign method, we find that networks with hidden units whose activations are unbounded simply respond by making their hidden unit activations very large, so it is usually better to just perturb the original input. On saturating models such as the Rust model we found that perturbation of the input performed comparably to perturbation of the hidden layers. Perturbations based on rotating the hidden layers solve the problem of unbounded activations growing to make additive perturbations smaller by comparison. We were able to succesfully train maxout networks with rotational perturbations of the hidden layers. However, this did not yield nearly as strong of a regularizing effect as additive perturbation of the input layer. Our view of adversarial training is that it is only clearly useful when the model has the capacity to learn to resist adversarial examples. This is only clearly the case when a universal approximator theorem applies. Because the last layer of a neural network, the linear-sigmoid or linear-softmax layer, is not a universal approximator of functions of the final hidden layer, this suggests that one is likely to encounter problems with underfitting when applying adversarial perturbations to the final hidden layer. We indeed found this effect. Our best results with training using perturbations of hidden layers never involved perturbations of the final hidden layer.
一个自然的问题是，是更好地扰动输入层还是隐藏层，还是两者都要。在这里，结果是不一致的。Szegedy 等人（2014b 年）报告称，当应用对抗性扰动到隐藏层时，会得到最佳的正则化效果。这一结果是在一个 S 型网络上获得的。在我们使用快速梯度符号方法进行实验时，我们发现隐藏单元的激活是无界的网络通常会通过使其隐藏单元激活非常大来做出响应，所以通常最好只是扰动原始输入。对于类似 Rust 模型这样饱和模型，我们发现扰动输入与扰动隐藏层表现相当。基于旋转隐藏层的扰动方法通过解决无界激活增长所导致对比中添加扰动更小的问题。我们能够成功地训练出具有旋转隐藏层扰动的 maxout 网络。然而，与输入层的加性扰动相比，这并没有产生近乎强烈的正则化效果。对抗训练的观点是，仅在模型具备学习抵抗对抗样本的能力时才明显有用。当普适逼近定理适用时，这只有明显的情况。由于神经网络的最后一层，即线性-挤压或线性-softmax 层，不是最终隐藏层函数的普适逼近器，这表明当对最终隐藏层应用对抗性扰动时，很可能会遇到欠拟合问题。我们确实发现了这种效果。我们使用隐藏层扰动进行训练的最佳结果从未涉及最终隐藏层的扰动。

7 Different kinds of model capacity
不同种类的模型容量

One reason that the existence of adversarial examples can seem counter-intuitive is that most of us have poor intuitions for high dimensional spaces. We live in three dimensions, so we are not used to small effects in hundreds of dimensions adding up to create a large effect. There is another way that our intuitions serve us poorly. Many people think of models with low capacity as being unable to make many different confident predictions. This is not correct. Some models with low capacity do exhibit this behavior. For example shallow RBF networks with
存在对抗性示例的一个原因看起来反直觉的是，大多数人对高维空间的直觉很差。我们生活在三维空间中，所以不习惯于在数百个维度中产生的微小效应相互叠加，从而产生较大的影响。我们的直觉还有另一种让我们判断失误的方式。很多人认为容量较低的模型无法做出许多不同的自信预测。这是不正确的。一些容量较低的模型确实表现出这种行为。例如，只能自信地预测正类别出现在{{0}}附近的浅层 RBF 网络。

p(y=1\mid{\bm{x}})=\exp\left(({\bm{x}}-\mu)^{\top}{\bm{\beta}}({\bm{x}}-\mu)\right)

are only able to confidently predict that the positive class is present in the vicinity of $\mu$ . Elsewhere, they default to predicting the class is absent, or have low-confidence predictions.
。在其他地方，它们会默认预测该类别不存在，或者做出低置信度的预测。

RBF networks are naturally immune to adversarial examples, in the sense that they have low confidence when they are fooled. A shallow RBF network with no hidden layers gets an error rate of 55.4% on MNIST using adversarial examples generated with the fast gradient sign method and ${\epsilon}=.25$ . However, its confidence on mistaken examples is only $1.2\%$ . Its average confidence on clean test examples is $60.6$ %. We can’t expect a model with such low capacity to get the right answer at all points of space, but it does correctly respond by reducing its confidence considerably on points it does not “understand.”
RBF 网络在某种程度上天然对抗性例子具有免疫性，因为它们在被欺骗时信心低。没有隐藏层的浅层 RBF 网络在 MNIST 上使用快速梯度符号方法生成的对抗性例子的错误率为 55.4％。然而，在错误的例子上，它的信心只有 1％。其对干净测试示例的平均信心为 2％。我们不能期望容量如此之低的模型在空间的所有点上都能给出正确答案，但它确实会通过大幅降低在“理解”不了的点上的信心来正确响应。

RBF units are unfortunately not invariant to any significant transformations so they cannot generalize very well. We can view linear units and RBF units as different points on a precision-recall tradeoff curve. Linear units achieve high recall by responding to every input in a certain direction, but may have low precision due to responding too strongly in unfamiliar situations. RBF units achieve high precision by responding only to a specific point in space, but in doing so sacrifice recall. Motivated by this idea, we decided to explore a variety of models involving quadratic units, including deep RBF networks. We found this to be a difficult task—very model with sufficient quadratic inhibition to resist adversarial perturbation obtained high training set error when trained with SGD.
RBF 单元不幸地对任何重要的转换不具有不变性，因此它们无法很好地泛化。我们可以将线性单元和 RBF 单元视为精度-召回折衷曲线上的不同点。线性单元通过对特定方向的每个输入做出响应来实现高召回率，但由于在陌生情况下过于强烈而可能会导致低精度。RBF 单元通过仅对空间中的特定点做出响应来实现高精度，但在这样做的过程中却牺牲了召回率。受到这个想法的启发，我们决定探索涉及二次单元的各种模型，包括深度 RBF 网络。我们发现这是一个艰巨的任务——每个具有足够二次抑制以抵抗对抗性扰动模型当用 SGD 训练时都会获得很高的训练集误差。

8 Why do adversarial examples generalize?
对抗样本为什么会泛化？

An intriguing aspect of adversarial examples is that an example generated for one model is often misclassified by other models, even when they have different architecures or were trained on disjoint training sets. Moreover, when these different models misclassify an adversarial example, they often agree with each other on its class. Explanations based on extreme non-linearity and overfitting cannot readily account for this behavior—why should multiple extremely non-linear model with excess capacity consistently label out-of-distribution points in the same way? This behavior is especially surprising from the view of the hypothesis that adversarial examples finely tile space like the rational numbers among the reals, because in this view adversarial examples are common but occur only at very precise locations.
对抗样本的一个引人入胜的方面是，为一个模型生成的样本经常被其他模型误分类，即使它们具有不同的架构或是在不同的训练集上训练。此外，当这些不同的模型误分类一个对抗样本时，它们通常在类别上达成一致。基于极端非线性和过拟合的解释不能轻易解释这种行为——为什么多个极端非线性、拥有过剩容量的模型会一致地标记分布之外的点？这种行为尤其令人惊讶，因为对抗样本精细地覆盖空间的假设，正如有理数在实数中的分布一样，根据这种观点，对抗样本是常见的，但只发生在非常精确的位置。

Under the linear view, adversarial examples occur in broad subspaces. The direction ${\bm{\eta}}$ need only have positive dot product with the gradient of the cost function, and ${\epsilon}$ need only be large enough. Fig. 4 demonstrates this phenomenon. By tracing out different values of ${\epsilon}$ we see that adversarial examples occur in contiguous regions of the 1-D subspace defined by the fast gradient sign method, not in fine pockets. This explains why adversarial examples are abundant and why an example misclassified by one classifier has a fairly high prior probability of being misclassified by another classifier.
在线性观点下，对抗性示例会出现在广泛的子空间中。方向 ${\bm{\eta}}$ 只需要与损失函数的梯度有正的点积，而 ${\epsilon}$ 只需要足够大。图 4 展示了这种现象。通过追踪不同 ${\epsilon}$ 的值，我们看到对抗性示例发生在由快速梯度符号方法定义的一维子空间的连续区域，而不是在细小的区域内。这解释了为什么对抗性示例如此丰富，以及为什么一个分类器误分类的示例被另一个分类器误分类的先验概率相当高。

To explain why mutiple classifiers assign the same class to adversarial examples, we hypothesize that neural networks trained with current methodologies all resemble the linear classifier learned on the same training set. This reference classifier is able to learn approximately the same classification weights when trained on different subsets of the training set, simply because machine learning algorithms are able to generalize. The stability of the underlying classification weights in turn results in the stability of adversarial examples.
为了解释为什么多个分类器将对抗性示例分配给相同的类别，我们假设当前方法训练出的神经网络都类似于相同训练集上学习的线性分类器。该参考分类器能够在训练不同子集的训练集时学习到大致相同的分类权重，这仅因为机器学习算法具有泛化能力。底层分类权重的稳定性进而导致对抗性示例的稳定性。

To test this hypothesis, we generated adversarial examples on a deep maxout network and classified these examples using a shallow softmax network and a shallow RBF network. On examples that were misclassified by the maxout network, the RBF network predicted the maxout network’s class assignment only 16.0% of the time, while the softmax classifier predict the maxout network’s class correctly 54.6% of the time. These numbers are largely driven by the differing error rate of the different models though. If we exclude our attention to cases where both models being compared make a mistake, then softmax regression predict’s maxout’s class 84.6% of the time, while the RBF network is able to predict maxout’s class only 54.3% of the time. For comparison, the RBF network can predict softmax regression’s class 53.6% of the time, so it does have a strong linear component to its own behavior. Our hypothesis does not explain all of the maxout network’s mistakes or all of the mistakes that generalize across models, but clearly a significant proportion of them are consistent with linear behavior being a major cause of cross-model generalization.
为了验证这一假设，我们对深度最大化网络生成了对抗样本，并使用浅层 Softmax 网络和浅层 RBF 网络对这些样本进行分类。在最大化网络误分类的样本中，RBF 网络以 16.0%的概率预测最大化网络的类别分配，而 Softmax 分类器以 54.6%的概率正确预测最大化网络的类别。这些数字主要受到不同模型误差率的影响。如果我们排除那些比较模型都犯错的情况，那么 Softmax 回归器在 84.6%的情况下能够预测最大化网络的类别，而 RBF 网络只有 54.3%的准确率。作为对比，RBF 网络能够在 53.6%的情况下预测 Softmax 回归器的类别，这表明它在自身行为中具有较强的线性因素。我们的假设并不能完全解释所有最大化网络的错误或跨模型泛化的所有错误，但显然其中的一部分与线性行为是跨模型泛化中的主要原因是一致的。

9 Alternative hypotheses 9 个替代假设

We now consider and refute some alternative hypotheses for the existence of adversarial examples. First, one hypothesis is that generative training could provide more constraint on the training process, or cause the model to learn what to distinguish “real” from “fake” data and be confident only on “real” data. The MP-DBM (Goodfellow et al., 2013a) provides a good model to test this hypothesis. Its inference procedure gets good classification accuracy (an 0.88% error rate) on MNIST. This inference procedure is differentiable. Other generative models either have non-differentiable inference procedures, making it harder to compute adversarial examples, or require an additional non-generative discriminator model to get good classification accuracy on MNIST. In the case of the MP-DBM, we can be sure that the generative model itself is responding to adversarial examples, rather than the non-generative classifier model on top. We find that the model is vulnerable to adversarial examples. With an $\epsilon$ of 0.25, we find an error rate of 97.5% on adversarial examples generated from the MNIST test set. It remains possible that some other form of generative training could confer resistance, but clearly the mere fact of being generative is not alone sufficient.
现在我们考虑并驳斥一些关于对抗样本存在的替代假说。首先，一个假设是生成式训练可以对训练过程提供更多的限制，或者使模型学会区分“真实”和“伪造”数据并且只对“真实”数据有信心。MP-DBM（Goodfellow 等人，2013a）提供了一个很好的模型来测试这个假说。它的推理过程在 MNIST 上获得了很高的分类准确率（0.88% 的错误率）。这个推理过程可微分。其他生成模型要么有不可微分的推理过程，使得计算对抗样本更加困难，要么需要额外的非生成鉴别器模型才能在 MNIST 上获得很好的分类准确率。对于 MP-DBM 来说，我们可以确定生成模型本身对对抗样本作出响应，而不是在顶部的非生成分类器模型。我们发现该模型对对抗样本是脆弱的。当 $\epsilon$ 为 0.25 时，我们发现从 MNIST 测试集中生成的对抗样本的错误率为 97.5%。仍有可能通过其他形式的生成式训练赋予耐性，但显然仅仅是生成式本身是不够的。

Another hypothesis about why adversarial examples exist is that individual models have strange quirks but averaging over many models can cause adversarial examples to wash out. To test this hypothesis, we trained an ensemble of twelve maxout networks on MNIST. Each network was trained using a different seed for the random number generator used to initialize the weights, generate dropout masks, and select minibatches of data for stochastic gradient descent. The ensemble gets an error rate of 91.1% on adversarial examples designed to perturb the entire ensemble with $\epsilon=.25$ . If we instead use adversarial examples designed to perturb only one member of the ensemble, the error rate falls to 87.9%. Ensembling provides only limited resistance to adversarial perturbation.
关于敌对样本存在的另一个假设是，个别模型存在奇怪的怪癖，但对许多模型进行平均可能会导致敌对样本被清除。为了验证这一假设，我们在 MNIST 上训练了 12 个 maxout 网络的集成模型。每个网络都是使用随机数生成器的不同种子进行训练，该生成器用于初始化权重、生成 dropout 面罩，并选择用于随机梯度下降的小批量数据。集成模型在对整个集成进行扰动的敌对样本上的错误率为 91.1% 。如果我们使用设计用于扰动集成仅一个成员的敌对样本，错误率将下降到 87.9% 。集成只提供有限的抗敌对扰动能力。

10 Summary and discussion 10 概述和讨论

As a summary, this paper has made the following observations:
总的来说，本文得出了以下观察结果：

•

Adversarial examples can be explained as a property of high-dimensional dot products. They are a result of models being too linear, rather than too nonlinear.

• 对抗样本可以解释为高维点积的属性。这是模型过于线性而不是过于非线性的结果。
•

The generalization of adversarial examples across different models can be explained as a result of adversarial perturbations being highly aligned with the weight vectors of a model, and different models learning similar functions when trained to perform the same task.

• 跨不同模型的敌对样本泛化可以解释为敌对扰动与模型的权重向量高度对齐，当训练执行相同任务时，不同模型学习类似的函数。
•

The direction of perturbation, rather than the specific point in space, matters most. Space is not full of pockets of adversarial examples that finely tile the reals like the rational numbers.

• 扰动的方向，而不是空间中的特定点，最为重要。空间中并不充满像有理数那样细致平铺的敌对样本的小区域。
•

Because it is the direction that matters most, adversarial perturbations generalize across different clean examples.

• 由于方向最为重要，对抗性扰动能够泛化到不同的干净样本之上。
•

We have introduced a family of fast methods for generating adversarial examples.

• 我们引入了一系列快速生成对抗性样本的方法。
•

We have demonstrated that adversarial training can result in regularization; even further regularization than dropout.

• 我们证明了对抗训练可以产生正则化效果；甚至比丢弃法还要进一步的正则化。
•

We have run control experiments that failed to reproduce this effect with simpler but less efficient regularizers including $L^{1}$ weight decay and adding noise.

• 我们进行了控制实验，未能用更简单但效率较低的正则化器来重现这种效果，包括 $L^{1}$ 权重衰减和添加噪声。
•

Models that are easy to optimize are easy to perturb.

• 易优化的模型更容易受干扰。
•

Linear models lack the capacity to resist adversarial perturbation; only structures with a hidden layer (where the universal approximator theorem applies) should be trained to resist adversarial perturbation.

• 线性模型缺乏抵抗对抗扰动的能力；只有具有隐藏层的结构（适用于通用逼近定理）应当被训练以抵抗对抗扰动。
•

RBF networks are resistant to adversarial examples.

• 径向基函数网络对对抗样本具有抵抗能力。
•

Models trained to model the input distribution are not resistant to adversarial examples.

• 训练用于建模输入分布的模型不具有对抗样本的抵抗能力。
•

Ensembles are not resistant to adversarial examples.

• 集成模型不具有对抗样本的抵抗能力。

Some further observations concerning rubbish class examples are presented in the appendix:
附录中提出了有关垃圾类示例的一些进一步观察：

•

Rubbish class examples are ubiquitous and easily generated.

• 垃圾类示例无处不在，且易于生成。
•

Shallow linear models are not resistant to rubbish class examples.

• 浅层线性模型对垃圾类示例没有抵抗力。
•

RBF networks are resistant to rubbish class examples.

RBF 网络对垃圾类示例具有抗性。

Gradient-based optimization is the workhorse of modern AI. Using a network that has been designed to be sufficiently linear–whether it is a ReLU or maxout network, an LSTM, or a sigmoid network that has been carefully configured not to saturate too much– we are able to fit most problems we care about, at least on the training set. The existence of adversarial examples suggests that being able to explain the training data or even being able to correctly label the test data does not imply that our models truly understand the tasks we have asked them to perform. Instead, their linear responses are overly confident at points that do not occur in the data distribution, and these confident predictions are often highly incorrect. This work has shown we can partially correct for this problem by explicitly identifying problematic points and correcting the model at each of these points. However, one may also conclude that the model families we use are intrinsically flawed. Ease of optimization has come at the cost of models that are easily misled. This motivates the development of optimization procedures that are able to train models whose behavior is more locally stable.
基于梯度的优化是现代人工智能的主要工具。使用一个经过设计足够线性的网络——无论是 ReLU 还是 maxout 网络、LSTM，或是经过精心配置以避免过度饱和的 sigmoid 网络——我们能够解决大多数我们关心的问题，至少在训练集上是这样。敌对样本的存在表明，能够解释训练数据甚至能够正确标记测试数据并不意味着我们的模型真正理解我们要求它们执行的任务。相反，它们的线性反应在数据分布中并不存在的点上过于自信，并且这些自信的预测通常是高度不正确的。这项工作表明，我们可以通过明确识别问题点并在每个问题点上纠正模型来部分地解决这个问题。然而，也可以得出结论，我们使用的模型族固有地存在缺陷。优化的便利性是以易受误导的模型为代价的。这促使了开发能够训练行为更加局部稳定的模型的优化程序的发展。

Acknowledgments 致谢

We would like to thank Geoffrey Hinton and Ilya Sutskever for helpful discussions. We would also like to thank Jeff Dean, Greg Corrado, and Oriol Vinyals for their feedback on drafts of this article. We would like to thank the developers of Theano(Bergstra et al., 2010; Bastien et al., 2012), Pylearn2(Goodfellow et al., 2013b), and DistBelief (Dean et al., 2012).
我们要感谢 Geoffrey Hinton 和 Ilya Sutskever 的有益讨论。我们还要感谢 Jeff Dean、Greg Corrado 和 Oriol Vinyals 对本文草稿提供的反馈意见。我们要感谢 Theano（Bergstra 等人，2010；Bastien 等人，2012）、Pylearn2（Goodfellow 等人，2013b）和 DistBelief（Dean 等人，2012）的开发人员。

References 参考文献

Bastien et al. (2012) Bastien 等人 (2012) Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian J., Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian J., Bergeron, Arnaud, Bouchard, Nicolas, 和 Bengio, Yoshua. Theano: 新功能和速度改进. 深度学习和无监督特征学习 NIPS 2012 研讨会, 2012.
Bergstra et al. (2010) Bergstra 等人（2010 年） Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.
Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David 和 Bengio, Yoshua。Theano：一种 CPU 和 GPU 数学表达式编译器。在 Python for Scientific Computing Conference (SciPy) 会议论文集中的论文，2010 年 6 月。口头演讲。
Chalupka et al. (2014) Chalupka 等人（2014 年） Chalupka, K., Perona, P., and Eberhardt, F. Visual Causal Feature Learning. ArXiv e-prints, December 2014.
Chalupka，K.，Perona，P.和 Eberhardt，F.视觉因果特征学习。ArXiv 电子打印版，2014 年 12 月。
Dean et al. (2012) Dean 等人（2012 年） Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc’Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.
Dean，Jeffrey，Corrado，Greg S.，Monga，Rajat，Chen，Kai，Devin，Matthieu，Le，Quoc V.，Mao，Mark Z.，Ranzato，Mar' Aurelio，Senior，Andrew，Tucker，Paul，Yang，Ke 和 Ng，Andrew Y.大规模分布式深度网络。在 NIPS，2012 年。
Deng et al. (2009) 邓等人（2009 年） Deng, Jia, Dong, Wei, Socher, Richard, jia Li, Li, Li, Kai, and Fei-fei, Li. Imagenet: A large-scale hierarchical image database. In In CVPR, 2009.
邓，贾，董，卫，索彻尔，理查德，家丽，李，李，凯，和菲菲，丽。Imagenet：一个大规模的分层图像数据库。在 CVPR，2009 年。
Glorot et al. (2011) 格洛罗等人（2011 年） Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Deep sparse rectifier neural networks. In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011), April 2011.
Glorot, Xavier, Bordes, Antoine 和 Bengio, Yoshua. 深度稀疏整流神经网络。在 JMLR W&CP: 2011 年 4 月举行的第十四届人工智能与统计国际会议(AISTATS 2011) 的论文集中发表。
Goodfellow et al. (2013a)
Goodfellow 等人 (2013a) Goodfellow, Ian J., Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. Multi-prediction deep Boltzmann machines. In Neural Information Processing Systems, December 2013a.
Goodfellow, Ian J., Mirza, Mehdi, Courville, Aaron 和 Bengio, Yoshua. 多预测深度 Boltzmann 机。在 2013 年 12 月举行的神经信息处理系统(Neural Information Processing Systems) 中发表。
Goodfellow et al. (2013b)
Goodfellow 等人（2013b） Goodfellow, Ian J., Warde-Farley, David, Lamblin, Pascal, Dumoulin, Vincent, Mirza, Mehdi, Pascanu, Razvan, Bergstra, James, Bastien, Frédéric, and Bengio, Yoshua. Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214, 2013b.
Goodfellow, Ian J., Warde-Farley, David, Lamblin, Pascal, Dumoulin, Vincent, Mirza, Mehdi, Pascanu, Razvan, Bergstra, James, Bastien, Frédéric 和 Bengio, Yoshua. Pylearn2：一个机器学习研究库. arXiv 预印本 arXiv:1308.4214，2013b。
Goodfellow et al. (2013c)
Goodfellow 等人（2013c） Goodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. Maxout networks. In Dasgupta, Sanjoy and McAllester, David (eds.), International Conference on Machine Learning, pp. 1319–1327, 2013c.
Goodfellow, Ian J.，Warde-Farley, David，Mirza, Mehdi，Courville, Aaron 和 Bengio, Yoshua. Maxout 网络. 在 Dasgupta, Sanjoy 和 McAllester, David（编者）主持的《国际机器学习会议》中，2013 年，1319–1327 页。
Gu & Rigazio (2014)
Gu & Rigazio（2014） Gu, Shixiang and Rigazio, Luca. Towards deep neural network architectures robust to adversarial examples. In NIPS Workshop on Deep Learning and Representation Learning, 2014.
谷世祥和 Rigazio，Luca。朝着深度神经网络架构的鲁棒性对抗性示例。在 2014 年 NIPS 深度学习和表示学习研讨会上。
Hochreiter & Schmidhuber (1997)
Hochreiter＆Schmidhuber（1997 年） Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
Hochreiter，S.和 Schmidhuber，J. 长期短期记忆。神经计算，9（8）：1735-1780，1997 年。
Hornik et al. (1989) Hornik 等人（1989 年） Hornik, Kurt, Stinchcombe, Maxwell, and White, Halbert. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366, 1989.
Hornik, Kurt, Stinchcombe, Maxwell 和 White, Halbert。多层前馈网络是通用逼近器。神经网络，2：359–366，1989 年。
Jarrett et al. (2009) Jarrett 等人（2009 年） Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun, Yann. What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pp. 2146–2153. IEEE, 2009.
贾瑞特，凯文，卡武克乔格卢，科雷，兰扎托，马克奥列里奥，勒村. 什么是目标识别的最佳多阶段架构？在计算机视觉国际会议（ICCV’09）论文集中，第 2146-2153 页。IEEE，2009 年。
Krizhevsky & Hinton (2009)
Krizhevsky & Hinton（2009 年） Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Krizhevsky, Alex 和 Hinton, Geoffrey. 从小图像中学习多层特征。多伦多大学技术报告，2009 年。
Nguyen et al. (2014) 2014 年的 Nguyen 等人 Nguyen, A., Yosinski, J., and Clune, J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. ArXiv e-prints, December 2014.
Nguyen, A., Yosinski, J., 和 Clune, J. 深度神经网络很容易被欺骗: 对无法识别的图像的高置信度预测. ArXiv 电子打印版, 2014 年 12 月.
Rust et al. (2005) Rust 等人（2005） Rust, Nicole, Schwartz, Odelia, Movshon, J. Anthony, and Simoncelli, Eero. Spatiotemporal elements of macaque V1 receptive fields. Neuron, 46(6):945–956, 2005.
Rust，Nicole，Schwartz，奥德莉亚，Movshon，J.安东尼和 Simoncelli，Eero。猕猴 V1 感受野的时空元素。神经元，46(6)：945-956，2005 年。
Srivastava et al. (2014)
Srivastava 等人(2014 年) Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
Srivastava，尼蒂什，Hinton，杰弗里，Krizhevsky，Alex，Sutskever，伊利亚和 Salakhutdinov，鲁斯兰。Dropout: 一种防止神经网络过拟合的简单方法。机器学习研究杂志，15(1)：1929-1958，2014 年。
Szegedy et al. (2014a) Szegedy 等人（2014a 年） Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. Technical report, arXiv preprint arXiv:1409.4842, 2014a.
Szegedy, 克里斯蒂安, 刘伟, 贾扬清, 赛尔曼内特, 皮埃尔, 里德, 斯科特, 安格洛夫, 德拉戈米尔, 埃尔汉, 杜米特鲁, 范豪克, 文森特和拉宾诺维奇, 安德鲁. 使用卷积深入探讨. 技术报告, arXiv 预印本 arXiv:1409.4842, 2014a 年.
Szegedy et al. (2014b) Szegedy 等人（2014b 年） Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian J., and Fergus, Rob. Intriguing properties of neural networks. ICLR, abs/1312.6199, 2014b. URL http://arxiv.org/abs/1312.6199.
Szegedy，Christian，Zaremba，Wojciech，Sutskever，Ilya，Bruna，Joan，Erhan，Dumitru，Goodfellow，Ian J.和 Fergus，Rob。神经网络的有趣特性。ICLR，abs/1312.6199，2014 年 b。URL http://arxiv.org/abs/1312.6199。

Appendix A Rubbish class examples
附录 A 垃圾类示例

A concept related to adversarial examples is the concept of examples drawn from a “rubbish class.” These examples are degenerate inputs that a human would classify as not belonging to any of the categories in the training set. If we call these classes in the training set “the positive classes,” then we want to be careful to avoid false positives on rubbish inputs–i.e., we do not want to classify a degenerate input as being something real. In the case of separate binary classifiers for each class, we want all classes output near zero probability of the class being present, and in the case of a multinoulli distribution over only the positive classes, we would prefer that the classifier output a high-entropy (nearly uniform) distribution over the classes. The traditional approach to reducing vulnerability to rubbish inputs is to introduce an extra, constant output to the model representing the rubbish class (LeCun+98). Nguyen et al. (2014) recently re-popularized the concept of the rubbish class in the context of computer vision under the name fooling images. As with adversarial examples, there has been a misconception that rubbish class false positives are hard to find, and that they are primarily a problem faced by deep networks.
与对抗性示例相关的概念是来自“垃圾类”示例的概念。这些示例是退化输入，人类会将其分类为不属于训练集中任何类别。如果我们将训练集中的这些类称为“正类”，那么我们需要小心避免在垃圾输入上出现误报—也就是说，我们不希望将退化输入误判为真实对象。在为每个类别单独设计的二元分类器中，我们希望所有类别的输出概率接近零，而对于仅包含正类别的多项式分布，我们更希望分类器输出高熵(几乎均匀)的类别分布。减少对垃圾输入的脆弱性的传统方法是向模型引入代表垃圾类的额外、固定的输出（LeCun+98）。Nguyen 等人（2014）最近在计算机视觉背景下重新概括了垃圾类的概念，称之为愚弄图像。与对抗性示例一样，有一个错误的看法，即垃圾类错误阳性很难找到，它们主要是深度网络面临的问题。

Our explanation of adversarial examples as the result of linearity and high dimensional spaces also applies to analyzing the behavior of the model on rubbish class examples. Linear models produce more extreme predictions at points that are far from the training data than at points that are near the training data. In order to find high confidence rubbish false positives for such a model, we need only generate a point that is far from the data, with larger norms yielding more confidence. RBF networks, which are not able to confidently predict the presence of any class far from the training data, are not fooled by this phenomenon.
我们将对抗性示例解释为线性和高维空间的结果，也适用于分析模型在垃圾类示例上的行为。线性模型在远离训练数据的点上产生更极端的预测，而在接近训练数据的点上产生的预测则较弱。为了找到这样一个模型的高置信垃圾错误阳性样本，我们只需生成一个远离数据的点，较大的范数会产生更高的置信度。RBF 网络无法对任何远离训练数据的类进行自信预测，因此不会被这种现象所愚弄。

We generated 10,000 samples from $\mathcal{N}(0,{\bm{I}}_{784})$ and fed them into various classifiers on the MNIST dataset. In this context, we consider assigning a probability greater than 0.5 to any class to be an error. A naively trained maxout network with a softmax layer on top had an error rate of 98.35% on Gaussian rubbish examples with an average confidence of 92.8% on mistakes. Changing the top layer to independent sigmoids dropped the error rate to 68% with an average confidence on mistakes of 87.9%. On CIFAR-10, using 1,000 samples from $\mathcal{N}(0,{\bm{I}}_{3072})$ , a convolutional maxout net obtains an error rate of 93.4%, with an average confidence of 84.4%.
我们从 $\mathcal{N}(0,{\bm{I}}_{784})$ 生成了 10,000 个样本，并将它们馈送到 MNIST 数据集上的各种分类器中。在这个背景下，我们认为将概率分配给任何类别大于 0.5 的情况是一个错误。一个朴素训练的具有顶层 softmax 层的 maxout 神经网络，在高斯垃圾样本上的错误率为 98.35%，错误时的平均置信度为 92.8%。将顶层改为独立 sigmoids 后，错误率降至 68%，错误时的平均置信度为 87.9%。在 CIFAR-10 上，使用来自 $\mathcal{N}(0,{\bm{I}}_{3072})$ 的 1,000 个样本，卷积 maxout 网络获得了 93.4% 的错误率，并且错误时的平均置信度为 84.4%。

These experiments suggest that the optimization algorithms employed by Nguyen et al. (2014) are overkill (or perhaps only needed on ImageNet), and that the rich geometric structure in their fooling images are due to the priors encoded in their search procedures, rather than those structures being uniquely able to cause false positives.
这些实验表明，Nguyen 等人（2014）采用的优化算法是过度的（或者可能仅在 ImageNet 上需要），而他们愚弄图像中的丰富几何结构是由其搜索过程中编码的先验条件导致的，而不是这些结构独特地导致误报。

Though Nguyen et al. (2014) focused their attention on deep networks, shallow linear models have the same problem. A softmax regression model has an error rate of 59.8% on the rubbish examples, with an average confidence on mistakes of 70.8%. If we use instead an RBF network, which does not behave like a linear function, we find an error rate of 0%. Note that when the error rate is zero the average confidence on a mistake is undefined.
尽管阮等人（2014 年）关注深度网络，但浅层线性模型存在相同问题。一个 softmax 回归模型在垃圾示例上的错误率为 59.8%，平均错误置信度为 70.8%。如果我们使用一个不像线性函数那样行为的 RBF 网络，我们发现错误率为 0%。请注意，当错误率为零时，错误置信度未定义。

Nguyen et al. (2014) focused on the problem of generating fooling images for a specific class, which is a harder problem than simply finding points that the network confidently classifies as belonging to any one class despite being defective. The above methods on MNIST and CIFAR-10 tend to have a very skewed distribution over classes. On MNIST, 45.3% of a naively trained maxout network’s false positives were classified as 5s, and none were classified as 8s. Likewise, on CIFAR-10, 49.7% of the convolutional network’s false positives were classified as frogs, and none were classified as airplanes, automobiles, horses, ships, or trucks.
Nguyen 等人（2014 年）关注了为特定类别生成愚弄图像的问题，这比简单地找出网络确信地将其分类为任何一个类别的点更加困难。在 MNIST 和 CIFAR-10 上，所述方法往往对类别之间有一个非常偏斜的分布。在 MNIST 上，45.3%的天真训练的 maxout 网络的误判被分类为 5，没有一个被分类为 8。同样地，在 CIFAR-10 上，49.7%的卷积网络的误判被分类为青蛙，没有一个被分类为飞机、汽车、马、船或卡车。

To solve the problem introduced by Nguyen et al. (2014) of generating a fooling image for a particular class, we propose adding ${\epsilon}\nabla_{\bm{x}}p(y=i\mid{\bm{x}})$ to a Gaussian sample ${\bm{x}}$ as a fast method of generating a fooling image classified as class $i$ . If we repeat this sampling process until it succeeds, we a randomized algorithm with variable runtime. On CIFAR-10, we found that one sampling step had a 100% success rate for frogs and trucks, and the hardest class was airplanes, with a success rate of 24.7% per sampling step. Averaged over all ten classes, the method has an average per-step success rate of 75.3%. We can thus generate any desired class with a handful of samples and no special priors, rather than tens of thousands of generations of evolution. To confirm that the resulting examples are indeed fooling images, and not images of real classes rendered by the gradient sign method, see Fig. 5. The success rate of this method in terms of generating members of class $i$ may degrade for datasets with more classes, since the risk of inadvertently increasing the activation of a different class $j$ increases in that case.
为了解决 Nguyen 等人（2014 年）提出的为特定类别生成愚弄图像的问题，我们提出将 ${\epsilon}\nabla_{\bm{x}}p(y=i\mid{\bm{x}})$ 添加到高斯样本 ${\bm{x}}$ 中，作为生成被分类为类 $i$ 的愚弄图像的快速方法。如果我们重复这个采样过程直到成功，我们将得到一个具有可变运行时间的随机算法。在 CIFAR-10 上，我们发现对于青蛙和卡车来说，一次采样步骤的成功率为 100%，而飞机是最难的类别，每个采样步骤的成功率为 24.7%。对于所有十个类别的平均值，该方法在每个步骤上的平均成功率为 75.3%。因此，我们可以用少量样本生成任何所需的类别，而不是进行数万代进化。为了确认所得到的例子确实是愚弄图像，而不是由渐变符号方法呈现的真实类别的图像，请参见图 5。这种方法在生成 $i$ 类成员方面的成功率可能会降低，因为在这种情况下，增加不同类别 $j$ 激活的风险会增加。

We found that we were able to train a maxout network to have a zero percent error rate on Gaussian rubbish examples (it was still vulnerable to rubbish examples generated by applying a fast gradient sign step to a Gaussian sample) with no negative impact on its ability to classify clean examples. Unfortunately, unlike training on adversarial examples, this did not result in any significant reduction of the model’s test set error rate.
我们发现我们能够训练最大输出网络在高斯垃圾示例上达到零误差率（仍然容易受到通过对高斯样本应用快速梯度符号步骤生成的垃圾示例的影响），而不会对其对清晰示例进行分类的能力产生负面影响。不幸的是，与在敌对示例上进行训练不同，这并没有导致模型的测试集错误率显著降低。

In conclusion, it appears that a randomly selected input to deep or shallow models built from linear parts is overwhelmingly likely to be processed incorrectly, and that these models only behave reasonably on a very thin manifold encompassing the training data.
总之，似乎深层或浅层模型从线性部分构建的随机选择的输入极有可能被错误处理，而这些模型只在包含训练数据的非常窄的流形上行为合理。

	$\ +\ .007\ \times$		$=$
$\centering{\bm{x}}\@add@centering$		${\text{sign}}(\nabla_{\bm{x}}J({\bm{\theta}},{\bm{x}},y))$		${\bm{x}}+\epsilon{\text{sign}}(\nabla_{\bm{x}}J({\bm{\theta}},{\bm{x}},y))$
“panda” “熊猫”		“nematode” “线虫”		“gibbon” 长臂猿
57.7% confidence 57.7% 的置信度		8.2% confidence 8.2% 的置信度		99.3 % confidence 99.3％置信度

Explaining and Harnessing Adversarial Examples解释和利用对抗样本

Abstract 摘要

1 Introduction 1 简介

2 Related work 相关工作

3 The linear explanation of adversarial examples对对抗样本的线性解释

4 Linear perturbation of non-linear models非线性模型的线性扰动

5 Adversarial training of linear models versus weight decay线性模型的对抗训练 versus 权重衰减

6 Adversarial training of deep networks深度网络的对抗训练

7 Different kinds of model capacity不同种类的模型容量

8 Why do adversarial examples generalize?对抗样本为什么会泛化？