EXPLAINING AND HARNESSING ADVERSARIAL EXAMPLES
解释和利用对抗性示例

Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy
Ian J. Goodfellow， Jonathon Shlens & Christian SzegedyGoogle Inc., Mountain View, CA
Google Inc.，加利福尼亚州山景城{goodfellow, shlens, szegedy}@google.com
{Goodfellow， Shlens， Szegedy}@google.com

Abstract 抽象

Several machine learning models, including neural networks, consistently misclassify adversarial examples-inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks’ vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.
包括神经网络在内的多个机器学习模型始终对对抗性样本（通过对数据集中的样本应用小但故意最坏情况的扰动而形成）进行错误分类，因此扰动的输入会导致模型以高置信度输出错误答案。解释这种现象的早期尝试集中在非线性和过拟合上。相反，我们认为神经网络容易受到对抗性扰动的主要原因是它们的线性性质。这一解释得到了新的定量结果的支持，同时首次解释了关于它们最有趣的事实：它们在架构和训练集中的泛化。此外，此视图产生了一种简单快速的生成对抗性示例的方法。使用这种方法为对抗性训练提供示例，我们减少了 MNIST 数据集上 maxout 网络的测试集误差。

1 Introduction 1 引言

Szegedy et al. (2014b) made an intriguing discovery: several machine learning models, including state-of-the-art neural networks, are vulnerable to adversarial examples. That is, these machine learning models misclassify examples that are only slightly different from correctly classified examples drawn from the data distribution. In many cases, a wide variety of models with different architectures trained on different subsets of the training data misclassify the same adversarial example. This suggests that adversarial examples expose fundamental blind spots in our training algorithms.
Szegedy 等人（2014b）有一个有趣的发现：几种机器学习模型，包括最先进的神经网络，很容易受到对抗性示例的影响。也就是说，这些机器学习模型对样本进行错误分类，这些样本与从数据分布中提取的正确分类样本仅略有不同。在许多情况下，在训练数据的不同子集上训练的具有不同架构的各种模型会错误地分类相同的对抗性示例。这表明对抗性示例暴露了我们训练算法中的基本盲点。

The cause of these adversarial examples was a mystery, and speculative explanations have suggested it is due to extreme nonlinearity of deep neural networks, perhaps combined with insufficient model averaging and insufficient regularization of the purely supervised learning problem. We show that these speculative hypotheses are unnecessary. Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples. This view enables us to design a fast method of generating adversarial examples that makes adversarial training practical. We show that adversarial training can provide an additional regularization benefit beyond that provided by using dropout (Srivastava et al., 2014) alone. Generic regularization strategies such as dropout, pretraining, and model averaging do not confer a significant reduction in a model’s vulnerability to adversarial examples, but changing to nonlinear model families such as RBF networks can do so.
这些对抗性示例的原因是个谜，推测性的解释表明这是由于深度神经网络的极端非线性，可能是由于模型平均不足和纯监督学习问题的正则化不足。我们表明这些推测性假设是不必要的。高维空间中的线性行为足以引起对抗性示例。此视图使我们能够设计一种快速生成对抗性示例的方法，使对抗性训练变得实用。我们表明，对抗性训练可以提供额外的正则化好处，而不仅仅是单独使用辍学（Srivastava et al.， 2014）。通用的正则化策略（如 dropout、pretraining 和模型平均）不会显著降低模型对对抗性样本的脆弱性，但更改为非线性模型族（如 RBF 网络）可以做到这一点。

Our explanation suggests a fundamental tension between designing models that are easy to train due to their linearity and designing models that use nonlinear effects to resist adversarial perturbation. In the long run, it may be possible to escape this tradeoff by designing more powerful optimization methods that can succesfully train more nonlinear models.
我们的解释表明，设计由于其线性性而易于训练的模型与设计使用非线性效应来抵抗对抗性扰动的模型之间存在根本的紧张关系。从长远来看，可以通过设计更强大的优化方法来避免这种权衡，这些方法可以成功地训练更多的非线性模型。

Szegedy et al. (2014b) demonstrated a variety of intriguing properties of neural networks and related models. Those most relevant to this paper include:
Szegedy 等人（2014b）展示了神经网络和相关模型的各种有趣特性。与本文最相关的包括：

Box-constrained L-BFGS can reliably find adversarial examples.
框约束的 L-BFGS 可以可靠地找到对抗性示例。
On some datasets, such as ImageNet (Deng et al., 2009), the adversarial examples were so close to the original examples that the differences were indistinguishable to the human eye.
在一些数据集上，如ImageNet（邓等人，2009），对抗性样本与原始样本非常接近，以至于人眼无法区分差异。
The same adversarial example is often misclassified by a variety of classifiers with different architectures or trained on different subsets of the training data.
相同的对抗性示例经常被具有不同架构的各种分类器错误分类，或者在训练数据的不同子集上进行训练。
Shallow softmax regression models are also vulnerable to adversarial examples.
浅层 softmax 回归模型也容易受到对抗性示例的影响。
Training on adversarial examples can regularize the model—however, this was not practical at the time due to the need for expensive constrained optimization in the inner loop.
对对抗性示例进行训练可以正则化模型，但是，由于需要在内部循环中进行昂贵的约束优化，这在当时并不实用。

These results suggest that classifiers based on modern machine learning techniques, even those that obtain excellent performance on the test set, are not learning the true underlying concepts that determine the correct output label. Instead, these algorithms have built a Potemkin village that works well on naturally occuring data, but is exposed as a fake when one visits points in space that do not have high probability in the data distribution. This is particularly disappointing because a popular approach in computer vision is to use convolutional network features as a space where Euclidean distance approximates perceptual distance. This resemblance is clearly flawed if images that have an immeasurably small perceptual distance correspond to completely different classes in the network’s representation.
这些结果表明，基于现代机器学习技术的分类器，即使是那些在测试集上获得出色性能的分类器，也没有学习确定正确输出标签的真正基本概念。相反，这些算法构建了一个波将金村，该村对自然发生的数据运行良好，但当人们访问空间中数据分布概率不高的点时，它就会被暴露为假的。这尤其令人失望，因为计算机视觉中的一种流行方法是使用卷积网络特征作为欧几里得距离近似感知距离的空间。如果具有不可测量的小感知距离的图像对应于网络表示中完全不同的类别，那么这种相似性显然是有缺陷的。
These results have often been interpreted as being a flaw in deep networks in particular, even though linear classifiers have the same problem. We regard the knowledge of this flaw as an opportunity to fix it. Indeed, Gu & Rigazio (2014) and Chalupka et al. (2014) have already begun the first steps toward designing models that resist adversarial perturbation, though no model has yet succesfully done so while maintaining state of the art accuracy on clean inputs.
这些结果经常被解释为深度网络中的缺陷，即使线性分类器也有同样的问题。我们将了解此缺陷视为修复该缺陷的机会。事实上，Gu & Rigazio （2014）和 Chalupka et al. . （2014）已经开始了设计抵抗对抗性扰动的模型的第一步，尽管还没有一个模型成功地做到了这一点，同时保持了对干净输入的最高准确性。

3 THE LINEAR EXPLANATION OF ADVERSARIAL EXAMPLES
3 对抗性示例的线性解释

We start with explaining the existence of adversarial examples for linear models.
我们首先解释线性模型是否存在对抗性示例。
In many problems, the precision of an individual input feature is limited. For example, digital images often use only 8 bits per pixel so they discard all information below

1 / 255

of the dynamic range. Because the precision of the features is limited, it is not rational for the classifier to respond differently to an input

x

than to an adversarial input

\tilde{x} = x + η

if every element of the perturbation

η

is smaller than the precision of the features. Formally, for problems with well-separated classes, we expect the classifier to assign the same class to

x

and

\tilde{x}

so long as

‖ η ‖_{\infty} < ϵ

, where

ϵ

is small enough to be discarded by the sensor or data storage apparatus associated with our problem.
在许多问题中，单个输入特征的精度是有限的。例如，数字图像通常每像素仅使用 8 位，因此它们会丢弃低于

1 / 255

动态范围的所有信息。由于特征的精度是有限的，因此如果扰动

η

的每个元素都小于特征的精度，则分类器对输入

x

的响应与对对抗性输入

\tilde{x} = x + η

的响应不同是不合理的。从形式上讲，对于具有良好分离类的问题，我们希望分类器将相同的类分配给

x

，

\tilde{x}

并且

ϵ

只要

‖ η ‖_{\infty} < ϵ

足够小，可以被与我们的问题相关的传感器或数据存储设备丢弃。

Consider the dot product between a weight vector

w

and an adversarial example

\tilde{x}

:
考虑权重向量

w

和对抗示例

\tilde{x}

之间的点积：

w^{⊤} \tilde{x} = w^{⊤} x + w^{⊤} η

The adversarial perturbation causes the activation to grow by

w^{⊤} η

.We can maximize this increase subject to the max norm constraint on

η

by assigning

η = sign (w)

. If

w

has

n

dimensions and the average magnitude of an element of the weight vector is

m

, then the activation will grow by

ϵ m n

. Since

‖ η ‖_{\infty}

does not grow with the dimensionality of the problem but the change in activation caused by perturbation by

η

can grow linearly with

n

, then for high dimensional problems, we can make many infinitesimal changes to the input that add up to one large change to the output. We can think of this as a sort of “accidental steganography,” where a linear model is forced to attend exclusively to the signal that aligns most closely with its weights, even if multiple signals are present and other signals have much greater amplitude.
对抗性扰动导致激活增长

w^{⊤} η

。我们可以通过分配

η = sign (w)

来最大化这个增加，但要遵守最大范数约束。

η

如果

w

有

n

维度，并且权重向量的元素的平均大小为

m

，则激活度将增长

ϵ m n

。由于

‖ η ‖_{\infty}

不随问题的维数增长，但由扰动引起的激活变化

η

可以随

n

线性增长，因此对于高维问题，我们可以对输入进行许多无穷小的更改，这些变化加起来对输出进行一次大的更改。我们可以将其视为一种“意外隐写术”，其中线性模型被迫只关注与其权重最紧密对齐的信号，即使存在多个信号并且其他信号的振幅要大得多。
This explanation shows that a simple linear model can have adversarial examples if its input has sufficient dimensionality. Previous explanations for adversarial examples invoked hypothesized properties of neural networks, such as their supposed highly non-linear nature. Our hypothesis based on linearity is simpler, and can also explain why softmax regression is vulnerable to adversarial examples.
此解释表明，如果简单线性模型的输入具有足够的维度，则它可以具有对抗性示例。前面对对抗性示例的解释援引了神经网络的假设属性，例如它们假定的高度非线性性质。我们基于线性的假设更简单，也可以解释为什么 softmax 回归容易受到对抗性示例的影响。

4 LINEAR PERTURBATION OF NON-LINEAR MODELS
4 非线性模型的线性扰动

The linear view of adversarial examples suggests a fast way of generating them. We hypothesize that neural networks are too linear to resist linear adversarial perturbation. LSTMs (Hochreiter & Schmidhuber 1997), ReLUs (Jarrett et al., 2009, Glorot et al., 2011), and maxout networks (Goodfellow et al., 2013c) are all intentionally designed to behave in very linear ways, so that they are easier to optimize. More nonlinear models such as sigmoid networks are carefully tuned to spend most of their time in the non-saturating, more linear regime for the same reason. This linear behavior suggests that cheap, analytical perturbations of a linear model should also damage neural networks.
对抗性示例的线性视图建议了一种快速生成它们的方法。我们假设神经网络太线性，无法抵抗线性对抗性扰动。LSTMs（Hochreiter & Schmidhuber 1997），ReLUs（Jarrett等人，2009年，Glorot等人，2011年）和maxout网络（Goodfellow等人，2013c）都被有意设计为以非常线性的方式运行，以便更容易优化。出于同样的原因，更多的非线性模型（例如 sigmoid 网络）被仔细调整为将其大部分时间花费在非饱和、更线性的状态下。这种线性行为表明，线性模型的廉价分析扰动也会损害神经网络。

Figure 1: A demonstration of fast adversarial example generation applied to GoogLeNet (Szegedy et al. 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to the input, we can change GoogLeNet’s classification of the image. Here our

ϵ

of .007 corresponds to the magnitude of the smallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real numbers.
图 1：在 ImageNet 上应用于 GoogLeNet 的快速对抗性示例生成的演示（Szegedy 等人，2014a）。通过添加一个难以察觉的小向量，其元素等于成本函数相对于输入的梯度元素的符号，我们可以改变 GoogLeNet 对图像的分类。

ϵ

这里的 .007 对应于 GoogLeNet 转换为实数后 8 位图像编码的最小位的大小。

Let

θ

be the parameters of a model,

x

the input to the model,

y

the targets associated with

x

(for machine learning tasks that have targets) and

J (θ, x, y)

be the cost used to train the neural network. We can linearize the cost function around the current value of

θ

, obtaining an optimal max-norm constrained pertubation of
设

θ

为模型的参数、

x

模型的输入、

y

与之

x

关联的目标（对于具有目标的机器学习任务），以及

J (θ, x, y)

用于训练神经网络的成本。我们可以围绕的当前值线性化成本函数

θ

，得到的最佳最大范数约束插管

η = ϵ sign (\nabla_{x} J (θ, x, y))

We refer to this as the “fast gradient sign method” of generating adversarial examples. Note that the required gradient can be computed efficiently using backpropagation.
我们将其称为生成对抗性示例的 “快速梯度符号方法”。请注意，可以使用反向传播有效地计算所需的梯度。

We find that this method reliably causes a wide variety of models to misclassify their input. See Fig. 1 for a demonstration on ImageNet. We find that using

ϵ = .25

, we cause a shallow softmax classifier to have an error rate of

99.9 %

with an average confidence of

79.3 %

on the MNIST (?) test se

^{1}

. In the same setting, a maxout network misclassifies

89.4 %

of our adversarial examples with an average confidence of

97.6 %

. Similarly, using

ϵ = .1

, we obtain an error rate of

87.15 %

and an average probability of

96.6 %

assigned to the incorrect labels when using a convolutional maxout network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton 2009) test se

^{2}

Other simple methods of generating adversarial examples are possible. For example, we also found that rotating

x

by a small angle in the direction of the gradient reliably produces adversarial examples.
我们发现这种方法可靠地导致各种模型对其输入进行错误分类。参见图 1 中的 ImageNet 演示。我们发现，使用，我们

ϵ = .25

会导致浅层 softmax 分类器在 MNIST （？）测试 se

79.3 %

^{1}

上具有平均置信度的错误率

99.9 %

。在相同的设置中，maxout 网络错误

89.4 %

地分类了平均置信度为

97.6 %

的对抗性示例。同样，使用

ϵ = .1

，当在预处理版本的 CIFAR-10 （Krizhevsky & Hinton 2009）测试中使用卷积最大网络时，我们获得了

96.6 %

错误率

87.15 %

和分配给错误标签的平均概率。

^{2}

其他生成对抗性示例的简单方法是可能的。例如，我们还发现，沿梯度方向旋转

x

一个小角度可以可靠地产生对抗性示例。

The fact that these simple, cheap algorithms are able to generate misclassified examples serves as evidence in favor of our interpretation of adversarial examples as a result of linearity. The algorithms are also useful as a way of speeding up adversarial training or even just analysis of trained networks.
这些简单、廉价的算法能够生成错误分类的示例这一事实证明了我们支持将对抗性示例解释为线性的结果。这些算法还可用于加速对抗性训练，甚至只是分析经过训练的网络。

5 ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEIGHT DECAY
5 线性模型与权重衰减的对抗性训练

Perhaps the simplest possible model we can consider is logistic regression. In this case, the fast gradient sign method is exact. We can use this case to gain some intuition for how adversarial examples are generated in a simple setting. See Fig. 2 for instructive images.
也许我们可以考虑的最简单的模型是 Logistic 回归。在这种情况下，快速梯度符号方法是精确的。我们可以通过这个案例来获得一些关于如何在简单环境中生成对抗性示例的直觉。参见图 2 以获取指导性图像。

If we train a single model to recognize labels

y \in {- 1, 1}

with

P (y = 1) = σ (w^{⊤} x + b)

where

σ (z)

is the logistic sigmoid function, then training consists of gradient descent on
如果我们训练单个模型来识别

y \in {- 1, 1}

标签，其中

P (y = 1) = σ (w^{⊤} x + b)

σ (z)

是逻辑 sigmoid 函数，那么训练包括

E_{x, y \sim p_{data}} ζ (- y (w^{⊤} x + b))

where

ζ (z) = \log (1 + \exp (z))

is the softplus function. We can derive a simple analytical form for training on the worst-case adversarial perturbation of

x

rather than

x

itself, based on gradient sign
其中

ζ (z) = \log (1 + \exp (z))

是 softplus 函数。我们可以推导出一个简单的分析形式，用于根据梯度符号训练最坏情况下的对抗性扰动 of

x

而不是

x

自身

$^{1}$ This is using MNIST pixel values in the interval [0, 1]. MNIST data does contain values other than 0 or 1, but the images are essentially binary. Each pixel roughly encodes “ink” or “no ink”. This justifies expecting the classifier to be able to handle perturbations within a range of width 0.5 , and indeed human observers can read such images without difficulty.
$^{1}$ 这是使用区间 [0， 1] 中的 MNIST 像素值。MNIST 数据确实包含除 0 或 1 以外的值，但图像本质上是二进制的。每个像素大致编码“墨迹”或“无墨迹”。这证明了期望分类器能够处理 width 0.5 范围内的扰动是合理的，并且确实人类观察者可以毫无困难地阅读此类图像。

2 See https://github.com/lisa-lab/pylearn2/tree/master/pylearn2/scripts/ papers/maxout for the preprocessing code, which yields a standard deviation of roughly 0.5 .
2 有关预处理代码，请参见 https://github.com/lisa-lab/pylearn2/tree/master/pylearn2/scripts/ papers/maxout，该代码产生的标准差约为 0.5 。

EXPLAINING AND HARNESSING ADVERSARIAL EXAMPLES 解释和利用对抗性示例

Abstract 抽象

1 Introduction 1 引言

2 RELATED WORK 2 相关工作

3 THE LINEAR EXPLANATION OF ADVERSARIAL EXAMPLES3 对抗性示例的线性解释

4 LINEAR PERTURBATION OF NON-LINEAR MODELS4 非线性模型的线性扰动

5 ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEIGHT DECAY5 线性模型与权重衰减的对抗性训练

EXPLAINING AND HARNESSING ADVERSARIAL EXAMPLES
解释和利用对抗性示例

3 THE LINEAR EXPLANATION OF ADVERSARIAL EXAMPLES
3 对抗性示例的线性解释

4 LINEAR PERTURBATION OF NON-LINEAR MODELS
4 非线性模型的线性扰动

5 ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEIGHT DECAY
5 线性模型与权重衰减的对抗性训练