ImageNet Classification with Deep Convolutional Neural Networks 使用深度卷积神经网络进行 ImageNet 分类
By Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton 作者:Alex Krizhevsky、Ilya Sutskever 和 Geoffrey E. Hinton
Abstract 抽象
We trained a large, deep convolutional neural network to classify the 1.2\mathbf{1 . 2} million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000\mathbf{1 0 0 0} different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%37.5 \% and 17.0%17.0 \%, respectively, which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-poolinglayers, and three fully connected layers with a final 1000-way softmax. To make training faster, we used nonsaturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully connected layers we employed a recently developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%15.3 \%, compared to 26.2%26.2 \% achieved by the sec-ond-best entry. 我们训练了一个大型、深入的卷积神经网络,将 ImageNet LSVRC-2010 竞赛中的 1.2\mathbf{1 . 2} 数百万张高分辨率图像分类为 1000\mathbf{1 0 0 0} 不同的类别。在测试数据上,我们分别实现了 37.5%37.5 \% 和 17.0%17.0 \% 的前 1 和前 5 错误率,这比以前的最先进的技术要好得多。该神经网络具有 6000 万个参数和 650,000 个神经元,由五个卷积层组成,其中一些卷积层后跟 max-poolinglayers,以及三个完全连接层,最终具有 1000 路 softmax。为了更快地进行训练,我们使用了非饱和神经元和非常高效的 GPU 卷积运算实现。为了减少全连接层中的过拟合,我们采用了一种最近开发的称为 “dropout” 的正则化方法,该方法被证明非常有效。我们还在 ILSVRC-2012 竞赛中输入了该模型的变体,并取得了获胜的前 5 名测试错误率 15.3%15.3 \% ,与 26.2%26.2 \% 第二名最佳参赛作品相比。
1. PROLOGUE 1. 序幕
Four years ago, a paper by Yann LeCun and his collaborators was rejected by the leading computer vision conference on the grounds that it used neural networks and therefore provided no insight into how to design a vision system. At the time, most computervision researchers believed that a vision system needed to be carefully hand-designed using a detailed understanding of the nature of the task. They assumed that the task of classifying objects in natural images would never be solved by simply presenting examples of images and the names of the objects they contained to a neural network that acquired all of its knowledge from this training data. 四年前,Yann LeCun 和他的合作者发表了一篇论文,但被领先的计算机视觉会议拒绝,理由是它使用神经网络,因此没有提供有关如何设计视觉系统的见解。当时,大多数计算机视觉研究人员认为,需要根据对任务性质的详细了解,仔细设计视觉系统。他们假设,简单地将图像示例及其包含的对象名称呈现给神经网络,该神经网络从这些训练数据中获取所有知识,永远无法解决对自然图像中对象进行分类的任务。
What many in the vision research community failed to appreciate was that methods that require careful hand-engineering by a programmer who understands the domain do not scale as well as methods that replace the programmer with a powerful general-purpose learning procedure. With enough computation and enough data, learning beats programming for complicated tasks that require the integration of many different, noisy cues. 视觉研究社区中的许多人没有意识到的是,需要由了解该领域的程序员仔细手动设计的方法无法扩展,而用强大的通用学习程序取代程序员的方法则无法扩展。有了足够的计算和足够的数据,学习胜过编程来完成需要集成许多不同的嘈杂线索的复杂任务。
Four years ago, while we were at the University of Toronto, our deep neural network called SuperVision almost halved the error rate for recognizing objects in natural images and triggered an overdue paradigm shift in computer vision. Figure 4 shows some examples of what SuperVision can do. 四年前,当我们在多伦多大学时,我们名为 SuperVision 的深度神经网络几乎将识别自然图像中物体的错误率降低了一半,并引发了计算机视觉姗姗来迟的范式转变。图 4 显示了 SuperVision 可以做什么的一些示例。
SuperVision evolved from the multilayer neural networks SuperVision 从多层神经网络演变而来
that were widely investigated in the 1980s. These networks used multiple layers of feature detectors that were all learned from the training data. Neuroscientists and psychologists had hypothesized that a hierarchy of such feature detectors would provide a robust way to recognize objects but they had no idea how such a hierarchy could be learned. There was great excitement in the 1980s because several different research groups discovered that multiple layers of feature detectors could be trained efficientlyusing a relatively straight-forward algorithm called backpropagation ^(18,22,27,33){ }^{18,22,27,33} to compute, for each image, how the classification performance of the whole network depended on the value of the weight on each connection. 在 1980 年代得到了广泛的调查。这些网络使用多层特征检测器,这些检测器都是从训练数据中学习的。神经科学家和心理学家假设,这种特征检测器的层次结构将提供一种强大的方法来识别对象,但他们不知道如何学习这样的层次结构。在 1980 年代引起了极大的兴奋,因为几个不同的研究小组发现,可以使用一种称为反向传播的相对简单的算法来有效地训练多层特征检测器 ^(18,22,27,33){ }^{18,22,27,33} ,以计算每个图像的整个网络的分类性能如何取决于每个连接的权重值。
Backpropagation worked well for a variety of tasks, but in the 1980s it did not live up to the very high expectations of its advocates. In particular, it proved to be very difficult to learn networks with many layers and these were precisely the networks that should have given the most impressive results. Many researchers concluded, incorrectly, that learning a deep neural network from random initial weights wasjust too difficult. Twenty years later, we know what went wrong: for deep neural networks to shine, they needed far more labeled data and hugely more computation. 反向传播适用于各种任务,但在 1980 年代,它并没有达到其倡导者的极高期望。特别是,事实证明,学习多层网络非常困难,而这些正是应该给出最令人印象深刻的结果的网络。许多研究人员错误地得出结论,即从随机初始权重学习深度神经网络太难了。20 年后,我们知道出了什么问题:要让深度神经网络大放异彩,它们需要更多的标记数据和大量的计算。
2. INTRODUCTION 2. 引言
Current approaches to object recognition make essential use of machine learning methods. To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting. Until recently, datasets of labeled images were relatively small-on the order of tens of thousands of images (e.g., NORB, ^(19){ }^{19} Caltech-101/256, ^(8,10){ }^{8,10} and CIFAR-10/100 ^(14){ }^{14} ). Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with labelpreserving transformations. For example, the current-best error rate on the MNIST digit-recognition task ( < 0.3%<0.3 \% ) approaches human performance. ^(5){ }^{5} But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets. And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Ref. ^(25){ }^{25} ), but it has only recently become possible to collect labeled datasets with millions of 当前的对象识别方法必不可少地使用机器学习方法。为了提高它们的性能,我们可以收集更大的数据集,学习更强大的模型,并使用更好的技术来防止过拟合。直到最近,标记图像的数据集相对较小 - 大约有数万张图像(例如,NORB、 ^(19){ }^{19} Caltech-101/256 ^(8,10){ }^{8,10} 和 CIFAR-10/100 ^(14){ }^{14} )。使用这种规模的数据集可以很好地解决简单的识别任务,特别是如果它们通过标签保留转换进行增强。例如,MNIST 数字识别任务 ( < 0.3%<0.3 \% ) 的当前最佳错误率接近人类表现。 ^(5){ }^{5} 但是在现实环境中,对象表现出相当大的可变性,因此要学会识别它们,有必要使用更大的训练集。事实上,小型图像数据集的缺点已经被广泛认识到(例如,参考文献), ^(25){ }^{25} 但直到最近才有可能收集具有数百万个标记的数据集
The original version of this paper was published in the Proceedings of the 25^("th ")25^{\text {th }} International Conference on Neural Information Processing Systems (Lake Tahoe, NV, Dec. 2012), 1097-1105. 本文的原始版本发表在神经信息处理系统 25^("th ")25^{\text {th }} 国际会议论文集(内华达州太浩湖,2012 年 12 月)上,1097-1105。
images. The new larger datasets include LabelMe, ^(28){ }^{28} which consists of hundreds of thousands of fully segmented images, and ImageNet, ^(7){ }^{7} which consists of over 15 million labeled high-resolution images in over 22,000 categories. 图像。新的大型数据集包括 LabelMe( ^(28){ }^{28} 由数十万张完全分割的图像组成)和 ImageNet( ^(7){ }^{7} 由 22000 多个类别的 1500 多万张标记的高分辨率图像组成)。
To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we do not have. Convolutional neural networks (CNNs) constitute one such class of models. ^(9,15,17,19,21,26,32){ }^{9,15,17,19,21,26,32} Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). Thus, compared to standard feedforward neural networks with similarly sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically best performance is likely to be only slightly worse. 要从数百万张图像中学习数千个对象,我们需要一个具有强大学习能力的模型。然而,对象识别任务的巨大复杂性意味着即使像 ImageNet 这样大的数据集也无法指定这个问题,因此我们的模型也应该有很多先验知识来补偿我们没有的所有数据。卷积神经网络 (CNN) 构成了这样一类模型。 ^(9,15,17,19,21,26,32){ }^{9,15,17,19,21,26,32} 可以通过改变它们的深度和广度来控制它们的容量,并且它们还对图像的性质(即统计数据的平稳性和像素依赖关系的局部性)做出强大且大部分正确的假设。因此,与具有相似大小层的标准前馈神经网络相比,CNN 的连接和参数要少得多,因此更容易训练,而理论上最好的性能可能只会稍微差一点。
Despite the attractive qualities of CNNs, and despite the relative efficiency of their local architecture, they have still been prohibitively expensive to apply in large scale to highresolution images. Luckily, current GPUs, paired with a highly optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestinglylarge CNNs, and recent datasets such as ImageNet contain enough labeled examples to train such models without severe overfitting. 尽管 CNN 具有吸引人的品质,并且尽管其本地架构相对高效,但它们大规模应用于高分辨率图像的成本仍然高得令人望而却步。幸运的是,当前的 GPU 与高度优化的 2D 卷积实现相结合,功能强大,足以促进大型 CNN 的训练,而最近的数据集(如 ImageNet)包含足够的标记示例来训练此类模型,而不会发生严重的过拟合。
The specific contributions of this paper are as follows: we trained one of the largest CNNs to date on the subsets of ImageNet used in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)-2010 and ILSVRC-2012 competitions ^(2){ }^{2} and achieved by far the best results ever reported on these datasets. We wrote a highly optimized GPU implementation of 2D convolution and all the other operations inherent in training CNNs, which we make available publicly. ^(a){ }^{a} Our network contains a number of new and unusual features which improve its performance and reduce its training time, which are detailed in Section 4 . The size of our network made overfitting a significant problem, even with 1.2 million labeled training examples, sowe used several effective techniques for preventing overfitting, which are described in Section 5. Our final network contains five convolutional and three fully connected layers, and this depth seems to be important: we found that removing any convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in inferior performance. 本文的具体贡献如下:我们在 ImageNet 大规模视觉识别挑战赛 (ILSVRC)-2010 和 ILSVRC-2012 比赛中使用的 ImageNet 子集上训练了迄今为止最大的 CNN 之一 ^(2){ }^{2} ,并取得了迄今为止在这些数据集上报告的最佳结果。我们编写了一个高度优化的 GPU 实现,其中包含 2D 卷积和训练 CNN 中固有的所有其他操作,并公开提供。 ^(a){ }^{a} 我们的网络包含许多新的和不寻常的功能,可以提高其性能并减少其训练时间,这在 Section 4 中有详细说明。我们的网络规模使过拟合成为一个重大问题,即使有 120 万个标记的训练样本,sowe 也使用了几种有效的技术来防止过拟合,如第 5 节所述。我们的最终网络包含五个卷积层和三个全连接层,这个深度似乎很重要:我们发现删除任何卷积层(每个卷积层包含不超过 1% 的模型参数)会导致性能不佳。
In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate. Our network takes between 5 and 6 days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available. 最后,网络的大小主要受当前 GPU 上可用内存量和我们愿意容忍的训练时间量的限制。我们的网络需要 5 到 6 天才能在两个 GTX 580 3GB GPU 上进行训练。我们所有的实验都表明,只需等待更快的 GPU 和更大的数据集可用,就可以改善我们的结果。
3. THE DATASET 3. 数据集
ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers usingAmazon’s Mechanical Turk crowd-sourcingtool.Starting in 2010, as part of the PascalVisual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images. ImageNet 是一个数据集,包含超过 1500 万张标记的高分辨率图像,属于大约 22000 个类别。这些图像是从网络上收集的,并由人工贴标机使用亚马逊的 Mechanical Turk 众包工具进行标记。从 2010 年开始,作为 PascalVisual Object Challenge 的一部分,每年举办一次名为 ImageNet 大规模视觉识别挑战赛 (ILSVRC) 的比赛。ILSVRC 使用 ImageNet 的子集,在 1000 个类别中,每个类别中大约有 1000 张图像。总共有大约 120 万张训练图像、50000 张验证图像和 150000 张测试图像。
ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is the version on which we performed most of our experiments. Since we also entered our model in the ILSVRC-2012 competition, in Section 7 we report our results on this version of the dataset as well, for which test set labels are unavailable. On ImageNet, it is customary to report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model. ILSVRC-2010 是唯一一个提供测试集标签的 ILSVRC 版本,因此这是我们执行大多数实验的版本。由于我们还在 ILSVRC-2012 竞赛中输入了我们的模型,因此在第 7 节中,我们也报告了此版本数据集的结果,其中测试集标签不可用。在 ImageNet 上,习惯上报告两个错误率:top-1 和 top-5,其中前 5 错误率是正确标签不在模型认为最可能的五个标签中的测试图像的比例。
ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a fixed resolution of 256 xx256256 \times 256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256 , and then cropped out the central 256 xx256256 \times 256 patch from the resulting image. We did not pre process the images in any other way, except for subtracting the mean activity over the training set from each pixel. So we trained our network on the (centered) raw RGB values of the pixels. ImageNet 由可变分辨率的图像组成,而我们的系统需要恒定的输入维度。因此,我们将图像缩减采样到固定分辨率 256 xx256256 \times 256 。给定一个矩形图像,我们首先重新缩放图像,使较短的边长度为 256 ,然后从生成的图像中裁剪出中央 256 xx256256 \times 256 补丁。我们没有以任何其他方式对图像进行预处理,除了从每个像素中减去训练集的平均活动。因此,我们在像素的(居中)原始 RGB 值上训练我们的网络。
4. THE ARCHITECTURE 4. 架构
The architecture of our network is summarized in Figure 2. It contains eight learned layers-five convolutional and three fully connected. Below, we describe some of the novel or unusual features of our network’s architecture. Sections 4.14.4 are sorted according to our estimation of their importance, with the most important first. 图 2 总结了我们网络的架构。它包含 8 个学习层 - 5 个卷积层和 3 个完全连接层。下面,我们描述了我们网络架构的一些新颖或不寻常的特征。第 4.14.4 节根据我们对它们的重要性的估计进行排序,最重要的排在最前面。
4.1. Rectified Linear Unit nonlinearity 4.1. 修正线性单元非线性
The standard way to model a neuron’s output ff as a function of its input xx is with f(x)=tanh(x)f(x)=\tanh (x) or f(x)=(1+e^(-x))^(-1)f(x)=\left(1+e^{-x}\right)^{-1}. In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f(x)=max(0,x)f(x)=\max (0, x). Following Nair and Hinton, ^(24){ }^{24} we refer to neurons with this non linearity as Rectified Linear Units (ReLUs). Deep CNNs with ReLUs train several times faster than their equivalents with tanh units. This is demonstrated in Figure 1, which shows the number of iterations required to reach 25%25 \% training error on the CIFAR-10 dataset for a particular four-layer convolutional network. This plot shows that we would not have been able to experiment with such large neural networks for this work if we had used traditional saturating neuron models. 将神经元的输出 ff 建模为其输入 xx 的函数的标准方法是使用 f(x)=tanh(x)f(x)=\tanh (x) 或 f(x)=(1+e^(-x))^(-1)f(x)=\left(1+e^{-x}\right)^{-1} 。就梯度下降的训练时间而言,这些饱和非线性比非饱和非线性慢得多 f(x)=max(0,x)f(x)=\max (0, x) 。在 Nair 和 Hinton 之后, ^(24){ }^{24} 我们将具有这种非线性的神经元称为 Rectified Linear Units (ReLUs)。带有 ReLU 的深度 CNN 的训练速度比使用 tanh 单位的同类产品快几倍。图 1 对此进行了演示,该图显示了对于特定的四层卷积网络,在 CIFAR-10 数据集上达到 25%25 \% 训练误差所需的迭代次数。该图表明,如果我们使用传统的饱和神经元模型,我们将无法为这项工作使用如此大型的神经网络进行实验。
We are not the first to consider alternatives to traditional neuron models in CNNs. For example, Jarrett et al. ^(13){ }^{13} claim that the nonlinearity f(x)=|tanh(x)|f(x)=|\tanh (x)| works particularly well with their type of contrast normalization followed by local 我们并不是第一个考虑在 CNN 中替代传统神经元模型的人。例如,Jarrett 等人 ^(13){ }^{13} 声称,非线性特别 f(x)=|tanh(x)|f(x)=|\tanh (x)| 适用于他们的对比度归一化类型,然后是局部
Figure 1. A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line). The learning rates for each network were chosen independently to make training as fast as possible. No regularization of any kind was employed. The magnitude of the effect demonstrated here varies with network architecture, but networks with ReLUs consistently learn several times faster than equivalents with saturating neurons. 图 1.具有 ReLU(实线)的四层卷积神经网络在 CIFAR-10 上达到 25% 的训练错误率,比具有 tanh 神经元(虚线)的等效网络快六倍。每个网络的学习率都是独立选择的,以尽可能快地进行训练。没有采用任何形式的正则化。这里展示的效果大小因网络架构而异,但具有 ReLU 的网络始终比具有饱和神经元的等效网络快几倍。
average pooling on the Caltech-101 dataset. However, on this dataset the primary concern is preventing overfitting, so the effect they are observing is different from the accelerated ability to fit the training set which we report when using ReLUs. Faster learning has a great influence on the performance of large models trained on large datasets. Caltech-101 数据集上的平均池化。然而,在这个数据集上,主要关注的是防止过拟合,因此他们观察到的效果与我们在使用 ReLU 时报告的加速拟合训练集的能力不同。更快的学习速度对在大型数据集上训练的大型模型的性能有很大影响。
4.2. Training on multiple GPUs 4.2. 在多个 GPU 上训练
A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it. It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU. Therefore we spread the net across two GPUs. Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory. The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU. Choosing the pattern of connectivity is a problem for cross-validation, but this allows us to precisely tune the amount of communication until it is an acceptable fraction of the amount of computation. 单个 GTX 580 GPU 只有 3GB 内存,这限制了可在其上训练的网络的最大大小。事实证明,120 万个训练示例足以训练太大而无法在一个 GPU 上容纳的网络。因此,我们将网络分布在两个 GPU 上。当前的 GPU 特别适合跨 GPU 并行化,因为它们能够直接读取和写入彼此的内存,而无需通过主机内存。我们采用的并行化方案基本上将一半的内核(或神经元)放在每个 GPU 上,还有一个额外的技巧:GPU 仅在特定层中通信。这意味着,例如,第 3 层的内核从第 2 层中的所有内核映射获取输入。但是,第 4 层中的内核仅从位于同一 GPU 上的第 3 层中的内核映射获取输入。选择连接模式是交叉验证的一个问题,但这使我们能够精确调整通信量,直到它成为计算量的可接受部分。
The resultant architecture is somewhat similar to that of the “columnar” CNN employed by Cireşan et al., ^(4){ }^{4} except that our columns are not independent (see Figure 2). This scheme reduces our top-1 and top-5 error rates by 1.7%1.7 \% and 1.2%1.2 \%, respectively, as compared with a net with half as many kernels in each convolutional layer trained on one GPU. The two-GPU 由此产生的架构与 Cireşan 等人采用的“柱状”CNN 的结构有些相似, ^(4){ }^{4} 只是我们的柱不是独立的(见图 2)。与在一个 GPU 上训练的每个卷积层中内核数量减半的网络相比,该方案将我们的前 1 名和前 5 名错误率分别降低了 1.7%1.7 \% 和 1.2%1.2 \% 。双 GPU
net takes slightly less time to train than the one-GPU net. ^("b "){ }^{\text {b }} net 的训练时间比单 GPU net 略短。 ^("b "){ }^{\text {b }}
4.3. Local response normalization 4.3. 局部响应归一化
ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, we still find that the following local normalization scheme aids generalization. Denoting by a_(x,y)^(i)a_{x, y}^{i} the activity of a neuron computed by applying kernel ii at position (x,y)(x, y) and then applying the ReLU nonlinearity, the response-normalized activity b_(x,y)^(i)b_{x, y}^{i} is given by the expression ReLU 具有 desirable 属性,即它们不需要 input normalization 来防止它们饱和。如果至少一些训练示例对 ReLU 产生了积极的输入,则学习将在该神经元中进行。然而,我们仍然发现以下局部归一化方案有助于泛化。通过在 a_(x,y)^(i)a_{x, y}^{i} 位置应用内核 ii ,然后应用 ReLU 非线性来计算神经元的活动,响应归一化活动 b_(x,y)^(i)b_{x, y}^{i} 由表达式给出 (x,y)(x, y)
where the sum runs over nn “adjacent” kernel maps at the same spatial position, and NN is the total number of kernels in the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities among neuron outputs computed using different kernels. The constants k,n,alphak, n, \alpha, and beta\beta are hyperparameters whose values are determined using a validation set; we used k=2,n=5,alpha=10^(-4)k=2, n=5, \alpha=10^{-4}, and beta=0.75\beta=0.75. We applied this normalization after applying the ReLU nonlinearity in certain layers (see Section 4.5). 其中,sum 在相同空间位置的 “adcicent” 内核映射上运行 nn , NN 是层中内核的总数。内核映射的顺序当然是任意的,并且在训练开始之前就确定了。这种反应归一化实现了一种形式的横向抑制,其灵感来自真实神经元中的类型,在使用不同内核计算的神经元输出之间产生大型活动的竞争。constants k,n,alphak, n, \alpha 和 beta\beta 是超参数,其值是使用验证集确定的;我们使用了 k=2,n=5,alpha=10^(-4)k=2, n=5, \alpha=10^{-4} 和 beta=0.75\beta=0.75 。在某些层中应用 ReLU 非线性后,我们应用了此归一化(请参阅第 4.5 节)。
This scheme bears some resemblance to the local contrast normalization scheme of Jarrett et al., ^(13){ }^{13} but ours would be more correctly termed “brightness normalization,” since we do not subtract the mean activity. Response normalization reduces our top-1 and top-5 error rates by 1.4%1.4 \% and 1.2%1.2 \%, respectively. We also verified the effectiveness of this scheme on the CIFAR-10 dataset: a four-layer CNN achieved a 13% test error rate without normalization and 11%11 \% with normalization. ^(C){ }^{\mathrm{C}} 该方案与 Jarrett 等人的局部对比度归一化方案有些相似, ^(13){ }^{13} 但我们的方案更准确地称为“亮度归一化”,因为我们没有减去平均活性。响应规范化将我们的前 1 名和前 5 名错误率分别降低了 1.4%1.4 \% 和 1.2%1.2 \% 。我们还在 CIFAR-10 数据集上验证了该方案的有效性:四层 CNN 在没有归一化和 11%11 \% 归一化的情况下实现了 13% 的测试错误率。 ^(C){ }^{\mathrm{C}}
4.4. Overlapping pooling 4.4. 重叠池
Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map. Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap (e.g., Refs. ^(5,13,20){ }^{5,13,20} ). To be more precise, a pooling layer can be thought of as consisting of a grid of pooling units spaced ss pixels apart, each summarizing a neighborhood of size z xx zz \times z centered at the location of the pooling unit. If we set s=zs=z, we obtain traditional local pooling as commonly employed in CNNs. If we set s < zs<z, we obtain overlapping CNN 中的池化层总结了同一内核映射中相邻神经元组的输出。传统上,由相邻池化单元汇总的邻域不会重叠(例如,Refs. ^(5,13,20){ }^{5,13,20} )。更准确地说,可以将池化图层视为由一个池化单元网格组成,这些池化单元间隔像素 ss ,每个池化单元都汇总了一个以池化单元位置为中心的大小 z xx zz \times z 邻域。如果我们设置 s=zs=z ,我们将获得 CNN 中常用的传统本地池化。如果我们设置 s < zs<z ,我们会得到 overlap
^("b "){ }^{\text {b }} The one-GPU net actually has the same number of kernels as the two-GPU net in the final convolutional layer. This is because most of the net’s parameters are in the first fully connected layer, which takes the last convolutional layer as input. So to make the two nets have approximately the same number of parameters, we did not halve the size of the final convolutional layer (nor the fully connected layers which follow). Therefore this comparison is biased in favor of the one-GPU net, since it is bigger than “half the size” of the two-GPU net. ^("b "){ }^{\text {b }} 单 GPU 网络实际上与最终卷积层中的双 GPU 网络具有相同的内核数量。这是因为网络的大部分参数都在第一个全连接层中,该层将最后一个卷积层作为输入。因此,为了使两个网络具有大致相同数量的参数,我们没有将最终卷积层(以及随后的全连接层)的大小减半。因此,这种比较偏向于单 GPU 网络,因为它比双 GPU 网络的“一半大小”大。 ^(c){ }^{c} We cannot describe this network in detail due to space constraints, but it is specified precisely by the code and parameter files provided here: http:// code.google.com/p/cuda-convnet/. ^(c){ }^{c} 由于空间限制,我们无法详细描述这个网络,但它由此处提供的代码和参数文件精确指定:http:// code.google.com/p/cuda-convnet/。