Rethinking the Inception Architecture for Computer Vision
重新思考计算机视觉中的 Inception 架构

Christian Szegedy 克里斯蒂安·塞格迪Google Inc. 谷歌公司szegedy@google.com

Vincent Vanhoucke 文森特·范霍克vanhoucke@google.com

Sergey Ioffe 谢尔盖·伊奥菲sioffe@google.com

Jonathon Shlens 乔纳森·施伦斯shlens@google.com

Zbigniew Wojna 兹比格涅夫·沃伊纳University College London
伦敦大学学院zbigniewwojna@gmail.com

Abstract 摘要

Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: $21.2 %$ top-1 and $5.6 %$ top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report $3.5 %$ top-5 error and $17.3 %$ top-1 error.
卷积网络是大多数最先进计算机视觉解决方案的核心，适用于各种任务。自 2014 年以来，非常深的卷积网络开始成为主流，在各种基准测试中取得了显著提升。尽管模型规模和计算成本的增加往往能直接带来大多数任务的质量提升（只要提供足够的标注数据进行训练），但计算效率和低参数数量仍然是移动视觉和大数据场景等各种用例的推动因素。我们在此探索通过适当分解卷积和积极正则化来尽可能高效地利用额外计算资源来扩展网络的方法。我们在 ILSVRC 2012 分类挑战验证集上对我们的方法进行了基准测试，展示了相对于现有技术的显著提升：使用每次推理计算成本为 50 亿次乘加操作且参数少于 2500 万的网络进行单帧评估时， $21.2 %$ top-1 和 $5.6 %$ top-5 错误率。通过 4 个模型的集成和多裁剪评估，我们报告了 $3.5 %$ top-5 错误率和 $17.3 %$ top-1 错误率。

1. Introduction 1. 引言

Since the 2012 ImageNet competition [16] winning entry by Krizhevsky et al [9], their network “AlexNet” has been successfully applied to a larger variety of computer vision tasks, for example to object-detection [5], segmentation [12], human pose estimation [22], video classification [8], object tracking [23], and superresolution [3].
自 2012 年 Krizhevsky 等人在 ImageNet 竞赛[16]中凭借其网络“AlexNet”[9]获胜以来，该网络已成功应用于更广泛的计算机视觉任务，例如目标检测[5]、分割[12]、人体姿态估计[22]、视频分类[8]、目标跟踪[23]和超分辨率[3]。

These successes spurred a new line of research that focused on finding higher performing convolutional neural networks. Starting in 2014, the quality of network architectures significantly improved by utilizing deeper and wider networks. VGGNet [18] and GoogLeNet [20] yielded simi-
这些成功推动了新一轮研究，专注于寻找性能更高的卷积神经网络。自 2014 年起，通过利用更深更广的网络，网络架构的质量显著提升。VGGNet [18] 和 GoogLeNet [20] 取得了类似的成果。
larly high performance in the 2014 ILSVRC [16] classification challenge. One interesting observation was that gains in the classification performance tend to transfer to significant quality gains in a wide variety of application domains. This means that architectural improvements in deep convolutional architecture can be utilized for improving performance for most other computer vision tasks that are increasingly reliant on high quality, learned visual features. Also, improvements in the network quality resulted in new application domains for convolutional networks in cases where AlexNet features could not compete with hand engineered, crafted solutions, e.g. proposal generation in detection [4].
在 2014 年 ILSVRC[16]分类挑战中表现出色的性能。一个有趣的观察是，分类性能的提升往往能够转化为多种应用领域中显著的质素提升。这意味着深度卷积架构的改进可以被用于提高大多数其他计算机视觉任务的性能，这些任务越来越依赖于高质量的学习视觉特征。此外，网络质量的提升也为卷积网络开辟了新的应用领域，在这些领域中，AlexNet 特征无法与手工设计的解决方案竞争，例如检测中的提案生成[4]。

Although VGGNet [18] has the compelling feature of architectural simplicity, this comes at a high cost: evaluating the network requires a lot of computation. On the other hand, the Inception architecture of GoogLeNet [20] was also designed to perform well even under strict constraints on memory and computational budget. For example, GoogleNet employed only 5 million parameters, which represented a

12 \times

reduction with respect to its predecessor AlexNet, which used 60 million parameters. Furthermore, VGGNet employed about 3x more parameters than AlexNet.
尽管 VGGNet [18] 具有架构简洁的显著特点，但这带来了高昂的计算成本：评估该网络需要大量计算。另一方面，GoogLeNet [20] 的 Inception 架构也被设计为在严格的内存和计算预算限制下仍能表现出色。例如，GoogleNet 仅使用了 500 万个参数，这相对于其前身 AlexNet 使用的 6000 万个参数来说，实现了

12 \times

的减少。此外，VGGNet 使用的参数数量大约是 AlexNet 的 3 倍。

The computational cost of Inception is also much lower than VGGNet or its higher performing successors [6]. This has made it feasible to utilize Inception networks in big-data scenarios [17], [13], where huge amount of data needed to be processed at reasonable cost or scenarios where memory or computational capacity is inherently limited, for example in mobile vision settings. It is certainly possible to mitigate parts of these issues by applying specialized solutions to target memory use [2], [15] or by optimizing the execution of certain operations via computational tricks [10]. However, these methods add extra complexity. Furthermore, these methods could be applied to optimize the Inception architecture as well, widening the efficiency gap again.
Inception 的计算成本也远低于 VGGNet 或其更高性能的继任者[6]。这使得在大数据场景[17]、[13]中利用 Inception 网络成为可能，这些场景需要在合理成本下处理大量数据，或者内存或计算能力本身有限的场景，例如在移动视觉设置中。当然，通过应用专门解决方案来优化内存使用[2]、[15]，或通过计算技巧优化某些操作的执行[10]，可以缓解部分问题。然而，这些方法增加了额外的复杂性。此外，这些方法也可以应用于优化 Inception 架构，从而再次扩大效率差距。

Still, the complexity of the Inception architecture makes
尽管如此，Inception 架构的复杂性使得
it more difficult to make changes to the network. If the architecture is scaled up naively, large parts of the computational gains can be immediately lost. Also, [20] does not provide a clear description about the contributing factors that lead to the various design decisions of the GoogLeNet architecture. This makes it much harder to adapt it to new use-cases while maintaining its efficiency. For example, if it is deemed necessary to increase the capacity of some Inception-style model, the simple transformation of just doubling the number of all filter bank sizes will lead to a 4 x increase in both computational cost and number of parameters. This might prove prohibitive or unreasonable in a lot of practical scenarios, especially if the associated gains are modest. In this paper, we start with describing a few general principles and optimization ideas that that proved to be useful for scaling up convolution networks in efficient ways. Although our principles are not limited to Inceptiontype networks, they are easier to observe in that context as the generic structure of the Inception style building blocks is flexible enough to incorporate those constraints naturally. This is enabled by the generous use of dimensional reduction and parallel structures of the Inception modules which allows for mitigating the impact of structural changes on nearby components. Still, one needs to be cautious about doing so, as some guiding principles should be observed to maintain high quality of the models.
这使得对网络进行更改变得更加困难。如果架构被简单地扩展，大部分计算增益可能会立即丧失。此外，[20]并未提供关于导致 GoogLeNet 架构各种设计决策的贡献因素的清晰描述。这使得在保持其效率的同时，将其适应新的用例变得更加困难。例如，如果认为有必要增加某些 Inception 风格模型的容量，简单地加倍所有滤波器组大小的数量将导致计算成本和参数数量增加 4 倍。在许多实际场景中，这可能会被证明是禁止的或不合理的，特别是如果相关的增益是有限的。在本文中，我们首先描述了一些被证明对以高效方式扩展卷积网络有用的一般原则和优化思路。尽管我们的原则不仅限于 Inception 类型的网络，但在这种背景下更容易观察到，因为 Inception 风格构建块的通用结构足够灵活，可以自然地融入这些约束。这得益于 Inception 模块中大量使用降维和并行结构，从而减轻了结构变化对邻近组件的影响。然而，仍需谨慎行事，因为应遵循一些指导原则以保持模型的高质量。

2. General Design Principles
2. 通用设计原则

Here we will describe a few design principles based on large-scale experimentation with various architectural choices with convolutional networks. At this point, the utility of the principles below are speculative and additional future experimental evidence will be necessary to assess their accuracy and domain of validity. Still, grave deviations from these principles tended to result in deterioration in the quality of the networks and fixing situations where those deviations were detected resulted in improved architectures in general.
在此，我们将基于对卷积网络各种架构选择的大规模实验，阐述一些设计原则。目前，这些原则的实用性尚属推测，未来需要更多的实验证据来评估其准确性和适用范围。然而，严重偏离这些原则往往会导致网络质量下降，而修正这些偏离的情况通常能带来整体架构的改进。

Avoid representational bottlenecks, especially early in the network. Feed-forward networks can be represented by an acyclic graph from the input layer(s) to the classifier or regressor. This defines a clear direction for the information flow. For any cut separating the inputs from the outputs, one can access the amount of information passing though the cut. One should avoid bottlenecks with extreme compression. In general the representation size should gently decrease from the inputs to the outputs before reaching the final representation used for the task at hand. Theoretically, information content can not be assessed merely by the dimensionality of the representation as it discards important factors like correlation structure; the dimensional-
避免表示瓶颈，尤其是在网络的早期阶段。前馈网络可以通过从输入层到分类器或回归器的无环图来表示。这为信息流定义了一个明确的方向。对于任何将输入与输出分离的切割，都可以访问通过切割的信息量。应避免极端压缩造成的瓶颈。通常，表示大小应从输入到输出逐渐减小，然后达到用于当前任务的最终表示。理论上，信息内容不能仅通过表示的维度来评估，因为它忽略了相关结构等重要因素；维度-
ity merely provides a rough estimate of information content.
它仅仅提供了信息内容的一个粗略估计。
Higher dimensional representations are easier to process locally within a network. Increasing the activations per tile in a convolutional network allows for more disentangled features. The resulting networks will train faster.
更高维度的表示在网络中更容易进行局部处理。增加卷积网络中每个图块的激活数量可以实现更解耦的特征。由此产生的网络训练速度会更快。
Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power. For example, before performing a more spread out (e.g. $3 \times 3$ ) convolution, one can reduce the dimension of the input representation before the spatial aggregation without expecting serious adverse effects. We hypothesize that the reason for that is the strong correlation between adjacent unit results in much less loss of information during dimension reduction, if the outputs are used in a spatial aggregation context. Given that these signals should be easily compressible, the dimension reduction even promotes faster learning.
空间聚合可以在较低维度的嵌入上进行，而不会或几乎不会损失表示能力。例如，在执行更分散的（例如 $3 \times 3$ ）卷积之前，可以在空间聚合之前减少输入表示的维度，而不会预期产生严重的负面影响。我们假设其原因是相邻单元之间的强相关性在维度减少过程中导致的信息损失要少得多，如果输出用于空间聚合的上下文中。鉴于这些信号应该易于压缩，维度减少甚至促进了更快的学习。
Balance the width and depth of the network. Optimal performance of the network can be reached by balancing the number of filters per stage and the depth of the network. Increasing both the width and the depth of the network can contribute to higher quality networks. However, the optimal improvement for a constant amount of computation can be reached if both are increased in parallel. The computational budget should therefore be distributed in a balanced way between the depth and width of the network.
平衡网络的宽度和深度。通过平衡每个阶段的滤波器数量和网络的深度，可以达到网络的最佳性能。增加网络的宽度和深度都有助于提高网络的质量。然而，如果在计算量不变的情况下同时增加两者，可以达到最佳的改进效果。因此，计算预算应在网络的深度和宽度之间均衡分配。

Although these principles might make sense, it is not straightforward to use them to improve the quality of networks out of box. The idea is to use them judiciously in ambiguous situations only.
尽管这些原则可能言之有理，但直接利用它们来提升网络质量并非易事。关键在于仅在模棱两可的情况下审慎运用。

3. Factorizing Convolutions with Large Filter Size
3. 大滤波器尺寸的卷积分解

Much of the original gains of the GoogLeNet network [20] arise from a very generous use of dimension reduction. This can be viewed as a special case of factorizing convolutions in a computationally efficient manner. Consider for example the case of a

1 \times 1

convolutional layer followed by a

3 \times 3

convolutional layer. In a vision network, it is expected that the outputs of near-by activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.
GoogLeNet 网络[20]的许多原始增益源于对降维的广泛使用。这可以被视为以计算高效的方式分解卷积的一种特殊情况。例如，考虑一个

1 \times 1

卷积层后接一个

3 \times 3

卷积层的情况。在视觉网络中，预计邻近激活的输出是高度相关的。因此，我们可以预期在聚合之前可以减少它们的激活，并且这应该会产生类似表达力的局部表示。

Here we explore other ways of factorizing convolutions in various settings, especially in order to increase the computational efficiency of the solution. Since Inception networks are fully convolutional, each weight corresponds to
在这里，我们探索了在各种设置下分解卷积的其他方法，特别是为了提高解决方案的计算效率。由于 Inception 网络是完全卷积的，每个权重对应着

Figure 1. Mini-network replacing the

5 \times 5

convolutions.
图 1. 替换

5 \times 5

卷积的小型网络。
one multiplication per activation. Therefore, any reduction in computational cost results in reduced number of parameters. This means that with suitable factorization, we can end up with more disentangled parameters and therefore with faster training. Also, we can use the computational and memory savings to increase the filter-bank sizes of our network while maintaining our ability to train each model replica on a single computer.
每次激活进行一次乘法运算。因此，计算成本的任何降低都会导致参数数量的减少。这意味着通过适当的因式分解，我们可以得到更多解耦的参数，从而加快训练速度。此外，我们可以利用计算和内存的节省来增加网络的滤波器组大小，同时保持我们在单台计算机上训练每个模型副本的能力。

3.1. Factorization into smaller convolutions
3.1. 分解为更小的卷积

Convolutions with larger spatial filters (e.g.

5 \times 5

7 \times 7

) tend to be disproportionally expensive in terms of computation. For example, a

5 \times 5

convolution with

n

filters over a grid with

m

filters is

25 / 9 = 2.78

times more computationally expensive than a

3 \times 3

convolution with the same number of filters. Of course, a

5 \times 5

filter can capture dependencies between signals between activations of units further away in the earlier layers, so a reduction of the geometric size of the filters comes at a large cost of expressiveness. However, we can ask whether a

5 \times 5

convolution could be replaced by a multi-layer network with less parameters with the same input size and output depth. If we zoom into the computation graph of the

5 \times 5

convolution, we see that each output looks like a small fully-connected network sliding over

5 \times 5

tiles over its input (see Figure 1). Since we are constructing a vision network, it seems natural to exploit translation invariance again and replace the fully connected component by a two layer convolutional architecture: the first layer is a

3 \times 3

convolution, the second is a fully connected layer on top of the

3 \times 3

output grid of the first layer (see Figure 1). Sliding this small network over the input activation grid boils down to replacing the

5 \times 5

convolution with two layers of

3 \times 3

convolution (compare Figure 4 with 5 .
使用较大空间滤波器（例如

5 \times 5

或

7 \times 7

）的卷积在计算上往往 disproportionately 昂贵。例如，在具有

m

滤波器的网格上使用

n

滤波器的

5 \times 5

卷积比使用相同数量滤波器的

3 \times 3

卷积在计算上昂贵

25 / 9 = 2.78

倍。当然，

5 \times 5

滤波器可以捕捉到较早层中单元激活之间信号的依赖关系，因此减小滤波器的几何尺寸会以表达能力的巨大损失为代价。然而，我们可以探讨是否可以用一个参数较少的多层网络来替代

5 \times 5

卷积，同时保持相同的输入大小和输出深度。如果我们放大

5 \times 5

卷积的计算图，我们会发现每个输出看起来像一个小型的全连接网络在其输入的

5 \times 5

图块上滑动（见图 1）。由于我们正在构建一个视觉网络，再次利用平移不变性并将全连接组件替换为两层卷积架构似乎很自然：第一层是

3 \times 3

卷积，第二层是在第一层的

3 \times 3

输出网格之上的全连接层（见图 1）。在输入激活网格上滑动这个小网络，相当于用两层

3 \times 3

卷积替换

5 \times 5

卷积（比较图 4 与图 5）。

This setup clearly reduces the parameter count by sharing the weights between adjacent tiles. To analyze the ex-
这种设置通过相邻图块之间共享权重，明显减少了参数数量。为了分析前-

Figure 2. One of several control experiments between two Inception models, one of them uses factorization into linear + ReLU layers, the other uses two ReLU layers. After 3.86 million operations, the former settles at

76.2 %

, while the latter reaches

77.2 %

top-1 Accuracy on the validation set.
图 2. 两个 Inception 模型之间的多项对照实验之一，其中一个采用线性+ReLU 层的分解方式，另一个则使用两个 ReLU 层。经过 386 万次运算后，前者在验证集上的 top-1 准确率稳定在

76.2 %

，而后者达到

77.2 %

。
pected computational cost savings, we will make a few simplifying assumptions that apply for the typical situations: We can assume that

n = α m

, that is that we want to change the number of activations/unit by a constant alpha factor. Since the

5 \times 5

convolution is aggregating,

α

is typically slightly larger than one (around 1.5 in the case of GoogLeNet). Having a two layer replacement for the

5 \times 5

layer, it seems reasonable to reach this expansion in two steps: increasing the number of filters by

\sqrt{α}

in both steps. In order to simplify our estimate by choosing

α = 1

(no expansion), If we would naivly slide a network without reusing the computation between neighboring grid tiles, we would increase the computational cost. sliding this network can be represented by two

3 \times 3

convolutional layers which reuses the activations between adjacent tiles. This way, we end up with a net

\frac{9 + 9}{25} \times

reduction of computation, resulting in a relative gain of

28 %

by this factorization. The exact same saving holds for the parameter count as each parameter is used exactly once in the computation of the activation of each unit. Still, this setup raises two general questions: Does this replacement result in any loss of expressiveness? If our main goal is to factorize the linear part of the computation, would it not suggest to keep linear activations in the first layer? We have ran several control experiments (for example see figure 2) and using linear activation was always inferior to using rectified linear units in all stages of the factorization. We attribute this gain to the enhanced space of variations that the network can learn especially if we batchnormalize [7] the output activations. One can see similar effects when using linear activations for the dimension reduction components.
为了预期的计算成本节省，我们将做出一些适用于典型情况的简化假设：我们可以假设

n = α m

，即我们希望通过一个常数 alpha 因子来改变激活/单元的数量。由于

5 \times 5

卷积是聚合的，

α

通常略大于 1（在 GoogLeNet 的情况下约为 1.5）。对于

5 \times 5

层，有两层替换，似乎可以通过两个步骤实现这种扩展：在两个步骤中分别将滤波器数量增加

\sqrt{α}

。为了通过选择

α = 1

（无扩展）来简化我们的估计，如果我们天真地滑动网络而不在相邻网格块之间重用计算，我们将增加计算成本。滑动这个网络可以由两个

3 \times 3

卷积层表示，它们在相邻块之间重用激活。这样，我们最终实现了计算量的净

\frac{9 + 9}{25} \times

减少，通过这种分解获得了

28 %

的相对增益。对于参数计数，完全相同的节省也适用，因为每个参数在计算每个单元的激活时仅使用一次。尽管如此，这种设置引发了两个普遍问题：这种替换是否会导致表达能力的损失？如果我们的主要目标是分解计算的线性部分，是否不应该建议在第一层保留线性激活？我们进行了几项对照实验（例如参见图 2），在所有分解阶段使用线性激活始终不如使用修正线性单元。我们将这种增益归因于网络可以学习的增强变化空间，特别是如果我们对输出激活进行批归一化[7]。在使用线性激活进行降维组件时，也可以看到类似的效果。

3.2. Spatial Factorization into Asymmetric Convolutions
3.2. 空间分解为不对称卷积

The above results suggest that convolutions with filters larger

3 \times 3

a might not be generally useful as they can always be reduced into a sequence of

3 \times 3

convolutional
上述结果表明，使用大于

3 \times 3

的滤波器进行卷积可能并不普遍有用，因为它们总是可以简化为一系列

3 \times 3

卷积

Rethinking the Inception Architecture for Computer Vision 重新思考计算机视觉中的 Inception 架构

Abstract 摘要

1. Introduction 1. 引言

2. General Design Principles2. 通用设计原则

3. Factorizing Convolutions with Large Filter Size3. 大滤波器尺寸的卷积分解

3.1. Factorization into smaller convolutions3.1. 分解为更小的卷积

3.2. Spatial Factorization into Asymmetric Convolutions3.2. 空间分解为不对称卷积

Rethinking the Inception Architecture for Computer Vision
重新思考计算机视觉中的 Inception 架构

2. General Design Principles
2. 通用设计原则

3. Factorizing Convolutions with Large Filter Size
3. 大滤波器尺寸的卷积分解

3.1. Factorization into smaller convolutions
3.1. 分解为更小的卷积

3.2. Spatial Factorization into Asymmetric Convolutions
3.2. 空间分解为不对称卷积