这是用户在 2025-7-2 24:02 为 https://app.immersivetranslate.com/pdf-pro/b2b26ca5-dc90-4004-92b7-6a5bbe8d44f4/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Schroff 等。 - 2015 - FaceNet 用于人脸识别的统一嵌入

FaceNet:用于人脸识别和聚类的统一嵌入

弗洛里安·施罗夫fschroff@google.com谷歌公司

德米特里·卡列尼琴科dkalenichenko@google.com谷歌公司

詹姆斯·菲尔宾jphilbin@google.com谷歌公司

抽象的

尽管人脸识别领域近年来取得了显著进展 [10, 14, 15, 17],但高效地大规模实现人脸验证和识别仍对现有方法提出了严峻挑战。本文提出了一个名为 FaceNet 的系统,它能够直接学习从人脸图像到紧凑欧氏空间的映射,其中距离直接对应于人脸相似度的度量。一旦生成了这个空间,就可以使用标准技术,以 FaceNet 向量作为特征向量,轻松地实现人脸识别、验证和聚类等任务。

我们的方法使用经过训练的深度卷积网络,直接优化嵌入本身,而不是像之前的深度学习方法那样使用中间瓶颈层。为了进行训练,我们使用一种新颖的在线三元组挖掘方法生成的粗略对齐的匹配/不匹配人脸块三元组。我们方法的优势在于更高的表征效率:我们仅使用每张人脸 128 字节的数据就实现了最先进的人脸识别性能。

在广泛使用的 Labeled Faces in the Wild (LFW) 数据集上,我们的系统达到了新的准确率记录 9 9 . 6 3 % 9 9 . 6 3 % 99.63%\mathbf{9 9 . 6 3 \%}. 在 YouTube Faces DB 上它实现了 9 5 . 1 2 % 9 5 . 1 2 % 95.12%\mathbf{9 5 . 1 2 \%}与已发表的最佳结果 [15] 相比,我们的系统在两个数据集上将错误率降低了 30%。

我们还引入了谐波嵌入和谐波三重态损失的概念,它们描述了不同版本的人脸嵌入(由不同的网络生成),这些版本彼此兼容并且允许彼此直接比较。

1. 简介

在本文中,我们提出了一个统一的人脸验证系统(判断是否是同一个人)、人脸识别系统(判断这个人是谁)和聚类系统(找出这些面孔中的共同点)。我们的方法基于深度卷积网络,为每个图像学习一个欧几里得向量嵌入。该网络经过训练,使得嵌入空间中的 L2 距离平方与人脸相似度直接相关:
图 1. 光照和姿态不变性。姿态和光照一直是人脸识别中长期存在的问题。该图展示了 FaceNet 在不同的姿态和光照组合下,同一个人和不同人脸对之间的输出距离。距离 0.0 表示人脸完全相同,4.0 则对应相反的光谱,即两个不同的身份。您可以看到,阈值 1.1 可以正确分类每一对。
同一个人的脸距离较小,而不同人的脸距离较大。
一旦生成了这种嵌入,上述任务就会变得简单:人脸验证仅涉及对两个嵌入之间的距离进行阈值处理;识别变成了一个 k-NN 分类问题;并且可以使用现成的技术(例如 k-means 或凝聚聚类)来实现聚类。
先前基于深度网络的人脸识别方法使用一个基于已知人脸身份集进行训练的分类层 [15,17],然后采用一个中间的瓶颈 层作为表征,用于在训练集之外推广识别。这种方法的缺点在于其间接性和低效率:人们只能寄希望于瓶颈层能够很好地推广到新的人脸;而且使用瓶颈层,每张人脸的表征规模通常非常大(数千维)。一些近期的研究 [15] 使用主成分分析 (PCA) 降低了维度,但这是一种线性变换,只需在网络的一层中就能轻松学习。
与这些方法不同,FaceNet 使用基于 LMNN [19] 的三元组损失函数,直接将其输出训练为紧凑的 128 维向量。我们的三元组由两个匹配的人脸缩略图和一个不匹配的人脸缩略图组成,损失函数旨在通过一定的距离范围将正例对与负例对区分开来。这些缩略图是人脸区域的紧密裁剪,除了缩放和平移之外,没有进行任何 2D 或 3D 对齐。
事实证明,选择使用哪些三元组对于获得良好性能至关重要。受课程学习 [1] 的启发,我们提出了一种新颖的在线负样本挖掘​​策略,确保在网络训练过程中三元组的难度持续增加。为了提高聚类准确率,我们还探索了硬正样本挖掘技术,鼓励对单个人物的嵌入进行球形聚类。
图 1 说明了我们的方法可以处理的令人难以置信的多样性。图中显示的是来自 PIE [13] 的图像对,之前这些图像对被认为对于人脸验证系统来说非常困难。
本文其余部分的概述如下:第 2 节回顾了该领域的文献;第 3.1 节定义了三元组损失;第 3.2 节描述了我们新颖的三元组选择和训练程序;第 3.3 节描述了所使用的模型架构。最后,在第 4 节和第 5 节中,我们展示了嵌入的一些定量结果,并定性地探讨了一些聚类结果。
与近期其他采用深度网络的研究[15, 17]类似,我们的方法是一种纯数据驱动的方法,直接从人脸像素中学习其表征。我们没有使用人工特征,而是使用大量已标记人脸数据集,以获得针对姿势、光照和其他变分条件的适当不变性。
在本文中,我们探索了两种不同的深度网络架构,这两种架构最近在计算机视觉领域取得了巨大的成功。两者都是深度卷积网络[8, 11]。第一种架构基于 Zeiler&Fergus[22]模型,该模型由卷积、非线性激活、局部响应归一化和最大池化层的多个交错层组成。我们还受到[9]的启发,额外添加了几个 1 × 1 × d 1 × 1 × d 1xx1xx d1 \times 1 \times d 卷积层。
第二种架构基于 Szegedy 等人提出的 Inception 模型,该模型最近被用作 ImageNet 2014 的获胜方法[16]。这些网络使用混合层,这些混合层并行运行几个不同的卷积和池化层,并连接它们的响应。我们发现这些模型可以将参数数量减少多达 20 倍,并且有可能减少获得类似性能所需的 FLOPS 数量。
人脸验证和识别工作的大量文献存在。回顾它超出了本文的范围,因此我们只会简要讨论最近最相关的研究。
[15, 17, 23] 的工作都采用了一个复杂的多阶段系统,该系统将深度卷积网络的输出与用于降维的 PCA 以及用于分类的 SVM 相结合。
Zhenyao 等人 [23] 采用了一个深度网络将人脸“扭曲”成规范的正面视图,然后学习 CNN,将每个人脸分类为属于已知的身份。对于人脸验证,使用网络输出上的 PCA 以及 SVM 集成。
Taigman 等人 [17] 提出了一种多阶段方法,该方法将人脸对齐到通用的 3D 形状模型。训练一个多类网络来执行超过四千个身份的人脸识别任务。作者还尝试了一种所谓的 Siamese 网络,他们直接优化两个人脸特征之间的 L 1 L 1 L_(1)L_{1} -距离。他们在 LFW 上的最佳表现 (97.35%) 来自于使用不同对齐方式和颜色通道的三个网络的集成。使用非线性 SVM 组合这些网络的预测距离(基于 χ 2 χ 2 chi^(2)\chi^{2} 内核的非线性 SVM 预测)。
Sun 等人 [14, 15] 提出了一个紧凑的,因此计算成本相对较低的网络。他们使用了 25 个此类网络的集成,每个网络都在不同的人脸补丁上运行。为了他们在 LFW 上的最终表现(99.47% [15]),作者组合了 50 个响应(常规的和翻转的)。PCA 和 Joint Bayesian 模型 [2] 都被采用,它们有效地对应于嵌入空间中的线性变换。他们的方法不需要显式的 2D/3D 对齐。这些网络通过分类和验证损失的组合进行训练。验证损失类似于我们采用的三元组损失 [12, 19],因为它最小化了相同身份的人脸之间的 L 2 L 2 L_(2)L_{2} -距离,并强制执行不同身份的人脸距离之间的间隔。主要区别在于仅比较图像对,而三元组损失鼓励相对距离约束。
Wang et al. [18] 探索了一种与本文使用的损失函数相似的损失函数,用于按语义和视觉相似度对图像进行排序。
图 2. 模型结构。我们的网络由一个批量输入层和一个深度 CNN 组成,后接 L 2 L 2 L_(2)L_{2} 归一化,从而生成面部嵌入。训练期间会使用三元组损失。
图 3. 三元组损失最小化了锚点和正样本之间的距离,两者具有相同的身份,并最大化了锚点和负样本(具有不同的身份)之间的距离。

  3. 方法

FaceNet 使用了一个深度卷积网络。我们讨论了两种不同的核心架构:Zeiler&Fergus [22] 风格的网络和最近的 Inception [16] 类型网络。这些网络的细节在第 3.3 节中描述。
给定模型细节,并将其视为一个黑盒(参见图 2),我们方法中最重要的部分在于整个系统的端到端学习。为此,我们采用了三元组损失,它直接反映了我们希望在人脸验证、识别和聚类中实现的目标。也就是说,我们力求从图像 x x xx 到特征空间 R d R d R^(d)\mathbb{R}^{d} 的嵌入 f ( x ) f ( x ) f(x)f(x) ,使得同一身份的所有人脸之间的平方距离(独立于成像条件)很小,而来自不同身份的一对人脸图像之间的平方距离很大。
尽管我们没有直接与其他损失进行比较,例如 [14] Eq. (2) 中使用的使用正负对的损失,但我们认为三元组损失更适合人脸验证。其动机是 [14] 中的损失鼓励将一个身份的所有人脸投影到嵌入空间中的单个点上。然而,三元组损失试图在来自一个人的每对人脸与所有其他人脸之间强制执行一个间隔。这允许一个身份的人脸存在于流形上,同时仍然强制执行距离,从而实现与其他身份的可区分性。
以下部分描述了这种三元组损失以及如何在规模上有效地学习它。

  3.1. 三元组损失

嵌入由 f ( x ) R d f ( x ) R d f(x)inR^(d)f(x) \in \mathbb{R}^{d} 表示。它将图像 x x xx 嵌入到 d d dd 维欧几里得空间中。此外,我们约束此嵌入位于 d d dd 维超球面上,即 f ( x ) 2 = 1 f ( x ) 2 = 1 ||f(x)||_(2)=1\|f(x)\|_{2}=1 。这种损失是
在[19]中,在最近邻分类的背景下进行了论证。在这里,我们希望确保特定人物的图像 x i a x i a x_(i)^(a)x_{i}^{a} (锚点)比任何其他人物的图像 x i n x i n x_(i)^(n)x_{i}^{n} (负样本)更接近同一人物的所有其他图像 x i p x i p x_(i)^(p)x_{i}^{p} (正样本)。这在图3中进行了可视化。
  因此,我们希望,
f ( x i a ) f ( x i p ) 2 2 + α < f ( x i a ) f ( x i n ) 2 2 , ( f ( x i a ) , f ( x i p ) , f ( x i n ) ) T . f x i a f x i p 2 2 + α < f x i a f x i n 2 2 , f x i a , f x i p , f x i n T . {:[||f(x_(i)^(a))-f(x_(i)^(p))||_(2)^(2)+alpha < ||f(x_(i)^(a))-f(x_(i)^(n))||_(2)^(2)","],[AA(f(x_(i)^(a)),f(x_(i)^(p)),f(x_(i)^(n)))inT.]:}\begin{aligned} & \left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{p}\right)\right\|_{2}^{2}+\alpha<\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{n}\right)\right\|_{2}^{2}, \\ & \forall\left(f\left(x_{i}^{a}\right), f\left(x_{i}^{p}\right), f\left(x_{i}^{n}\right)\right) \in \mathcal{T} . \end{aligned}
其中, α α alpha\alpha 是在正样本对和负样本对之间强制执行的间隔。 T T T\mathcal{T} 是训练集中所有可能的三元组的集合,其基数为 N N NN
那么,需要最小化的损失是 L = L = L=L=
i N [ f ( x i a ) f ( x i p ) 2 2 f ( x i a ) f ( x i n ) 2 2 + α ] + . i N f x i a f x i p 2 2 f x i a f x i n 2 2 + α + . sum_(i)^(N)[||f(x_(i)^(a))-f(x_(i)^(p))||_(2)^(2)-||f(x_(i)^(a))-f(x_(i)^(n))||_(2)^(2)+alpha]_(+).\sum_{i}^{N}\left[\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{p}\right)\right\|_{2}^{2}-\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{n}\right)\right\|_{2}^{2}+\alpha\right]_{+} .
生成所有可能的三元组会导致许多容易满足的三元组(即满足公式(1)中的约束)。这些三元组不会对训练做出贡献,并且会导致收敛速度变慢,因为它们仍然会被传递到网络中。选择活跃的硬三元组至关重要,因为它们可以为改进模型做出贡献。以下部分介绍了我们用于三元组选择的不同方法。

  3. 2 三元组选择

为了确保快速收敛,选择违反公式(1)中三元组约束的三元组至关重要。这意味着,给定 x i a x i a x_(i)^(a)x_{i}^{a} ,我们希望选择一个 x i p x i p x_(i)^(p)x_{i}^{p} (困难正样本),使得 argmax x i p f ( x i a ) f ( x i p ) 2 2 argmax x i p f x i a f x i p 2 2 argmax_(x_(i)^(p))||f(x_(i)^(a))-f(x_(i)^(p))||_(2)^(2)\operatorname{argmax}_{x_{i}^{p}}\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{p}\right)\right\|_{2}^{2} ,类似地, x i n x i n x_(i)^(n)x_{i}^{n} (困难负样本)使得 argmin x i n f ( x i a ) f ( x i n ) 2 2 argmin x i n f x i a f x i n 2 2 argmin_(x_(i)^(n))||f(x_(i)^(a))-f(x_(i)^(n))||_(2)^(2)\operatorname{argmin}_{x_{i}^{n}}\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{n}\right)\right\|_{2}^{2}
计算整个训练集上的 argmin 和 argmax 是不可行的。此外,由于错误标记和图像质量差的人脸会主导困难正样本和负样本,因此可能会导致较差的训练效果。有两种显而易见的选择可以避免这个问题:
  • 离线生成三元组,每 n 步使用最新的网络检查点,并在数据的子集上计算 argmin 和 argmax。
  • 在线生成三元组。这可以通过从一个 mini-batch 中选择困难正/负样本来实现。
Here, we focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the argmin and argmax within a mini-batch. 
To have a meaningful representation of the anchorpositive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each 
mini-batch. In our experiments we sample the training data such that around 40 faces are selected per identity per minibatch. Additionally, randomly sampled negative faces are added to each mini-batch. 
Instead of picking the hardest positive, we use all anchorpositive pairs in a mini-batch while still selecting the hard negatives. We don’t have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the all anchorpositive method was more stable and converged slightly faster at the beginning of training. 
We also explored the offline generation of triplets in conjunction with the online generation and it may allow the use of smaller batch sizes, but the experiments were inconclusive. 
Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. f ( x ) = 0 f ( x ) = 0 f(x)=0f(x)=0 ). In order to mitigate this, it helps to select x i n x i n x_(i)^(n)x_{i}^{n} such that 
f ( x i a ) f ( x i p ) 2 2 < f ( x i a ) f ( x i n ) 2 2 . f x i a f x i p 2 2 < f x i a f x i n 2 2 . ||f(x_(i)^(a))-f(x_(i)^(p))||_(2)^(2) < ||f(x_(i)^(a))-f(x_(i)^(n))||_(2)^(2).\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{p}\right)\right\|_{2}^{2}<\left\|f\left(x_{i}^{a}\right)-f\left(x_{i}^{n}\right)\right\|_{2}^{2} .
We call these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchorpositive distance. Those negatives lie inside the margin α α alpha\alpha. 
As mentioned before, correct triplet selection is crucial for fast convergence. On the one hand we would like to use small mini-batches as these tend to improve convergence during Stochastic Gradient Descent (SGD) [20]. On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient. The main constraint with regards to the batch size, however, is the way we select hard relevant triplets from within the mini-batches. In most experiments we use a batch size of around 1,800 exemplars. 

3.3. Deep Convolutional Networks 

In all our experiments we train the CNN using Stochastic Gradient Descent (SGD) with standard backprop [8,11] and AdaGrad [5]. In most experiments we start with a learning rate of 0.05 which we lower to finalize the model. The models are initialized from random, similar to [16], and trained on a CPU cluster for 1,000 to 2,000 hours. The decrease in the loss (and increase in accuracy) slows down drastically after 500 h of training, but additional training can still significantly improve performance. The margin α α alpha\alpha is set to 0.2 . 
We used two types of architectures and explore their trade-offs in more detail in the experimental section. Their practical differences lie in the difference of parameters and FLOPS. The best model may be different depending on the application. E.g. a model running in a datacenter can have many parameters and require a large number of FLOPS, whereas a model running on a mobile phone needs to have few parameters, so that it can fit into memory. All our 
layer    输入尺寸   输出尺寸   卷积核   参数 FLPS
conv1  220 × 220 × 3 220 × 220 × 3 220 xx220 xx3220 \times 220 \times 3 110 × 110 × 64 110 × 110 × 64 110 xx110 xx64110 \times 110 \times 64 7 × 7 × 3 , 2 7 × 7 × 3 , 2 7xx7xx3,27 \times 7 \times 3,2 9 K  115 M 
pool1  110 × 110 × 64 110 × 110 × 64 110 xx110 xx64110 \times 110 \times 64 55 × 55 × 64 55 × 55 × 64 55 xx55 xx6455 \times 55 \times 64 3 × 3 × 64 , 2 3 × 3 × 64 , 2 3xx3xx64,23 \times 3 \times 64,2 0
rnorm1  55 × 55 × 64 55 × 55 × 64 55 xx55 xx6455 \times 55 \times 64 55 × 55 × 64 55 × 55 × 64 55 xx55 xx6455 \times 55 \times 64 0
conv2a  55 × 55 × 64 55 × 55 × 64 55 xx55 xx6455 \times 55 \times 64 55 × 55 × 64 55 × 55 × 64 55 xx55 xx6455 \times 55 \times 64 1 × 1 × 64 , 1 1 × 1 × 64 , 1 1xx1xx64,11 \times 1 \times 64,1 4 K  13M
conv2  55 × 55 × 64 55 × 55 × 64 55 xx55 xx6455 \times 55 \times 64 55 × 55 × 192 55 × 55 × 192 55 xx55 xx19255 \times 55 \times 192 3 × 3 × 64 , 1 3 × 3 × 64 , 1 3xx3xx64,13 \times 3 \times 64,1 111 K  335M
rnorm2  55 × 55 × 192 55 × 55 × 192 55 xx55 xx19255 \times 55 \times 192 55 × 55 × 192 55 × 55 × 192 55 xx55 xx19255 \times 55 \times 192 0
pool2  55 × 55 × 192 55 × 55 × 192 55 xx55 xx19255 \times 55 \times 192 28 × 28 × 192 28 × 28 × 192 28 xx28 xx19228 \times 28 \times 192 3 × 3 × 192 , 2 3 × 3 × 192 , 2 3xx3xx192,23 \times 3 \times 192,2 0
conv3a  28 × 28 × 192 28 × 28 × 192 28 xx28 xx19228 \times 28 \times 192 28 × 28 × 192 28 × 28 × 192 28 xx28 xx19228 \times 28 \times 192 1 × 1 × 192 , 1 1 × 1 × 192 , 1 1xx1xx192,11 \times 1 \times 192,1 37 K  29M
conv3  28 × 28 × 192 28 × 28 × 192 28 xx28 xx19228 \times 28 \times 192 28 × 28 × 384 28 × 28 × 384 28 xx28 xx38428 \times 28 \times 384 3 × 3 × 192 , 1 3 × 3 × 192 , 1 3xx3xx192,13 \times 3 \times 192,1 664 K  521 M 
pool3  28 × 28 × 384 28 × 28 × 384 28 xx28 xx38428 \times 28 \times 384 14 × 14 × 384 14 × 14 × 384 14 xx14 xx38414 \times 14 \times 384 3 × 3 × 384 , 2 3 × 3 × 384 , 2 3xx3xx384,23 \times 3 \times 384,2 0
conv4a  14 × 14 × 384 14 × 14 × 384 14 xx14 xx38414 \times 14 \times 384 14 × 14 × 384 14 × 14 × 384 14 xx14 xx38414 \times 14 \times 384 1 × 1 × 384 , 1 1 × 1 × 384 , 1 1xx1xx384,11 \times 1 \times 384,1 148 K  29M
conv4  14 × 14 × 384 14 × 14 × 384 14 xx14 xx38414 \times 14 \times 384 14 × 14 × 256 14 × 14 × 256 14 xx14 xx25614 \times 14 \times 256 3 × 3 × 384 , 1 3 × 3 × 384 , 1 3xx3xx384,13 \times 3 \times 384,1 885 K  173M
conv5a  14 × 14 × 256 14 × 14 × 256 14 xx14 xx25614 \times 14 \times 256 14 × 14 × 256 14 × 14 × 256 14 xx14 xx25614 \times 14 \times 256 1 × 1 × 256 , 1 1 × 1 × 256 , 1 1xx1xx256,11 \times 1 \times 256,1 66 K  13M
conv5  14 × 14 × 256 14 × 14 × 256 14 xx14 xx25614 \times 14 \times 256 14 × 14 × 256 14 × 14 × 256 14 xx14 xx25614 \times 14 \times 256 3 × 3 × 256 , 1 3 × 3 × 256 , 1 3xx3xx256,13 \times 3 \times 256,1 590 K  116M
conv6a  14 × 14 × 256 14 × 14 × 256 14 xx14 xx25614 \times 14 \times 256 14 × 14 × 256 14 × 14 × 256 14 xx14 xx25614 \times 14 \times 256 1 × 1 × 256 , 1 1 × 1 × 256 , 1 1xx1xx256,11 \times 1 \times 256,1 66 K  13M
conv6  14 × 14 × 256 14 × 14 × 256 14 xx14 xx25614 \times 14 \times 256 14 × 14 × 256 14 × 14 × 256 14 xx14 xx25614 \times 14 \times 256 3 × 3 × 256 , 1 3 × 3 × 256 , 1 3xx3xx256,13 \times 3 \times 256,1 590 K  116M
pool4  14 × 14 × 256 14 × 14 × 256 14 xx14 xx25614 \times 14 \times 256 7 × 7 × 256 7 × 7 × 256 7xx7xx2567 \times 7 \times 256 3 × 3 × 256 , 2 3 × 3 × 256 , 2 3xx3xx256,23 \times 3 \times 256,2 0
concat  7 × 7 × 256 7 × 7 × 256 7xx7xx2567 \times 7 \times 256 7 × 7 × 256 7 × 7 × 256 7xx7xx2567 \times 7 \times 256 0
fc1  7 × 7 × 256 7 × 7 × 256 7xx7xx2567 \times 7 \times 256 1 × 32 × 128 1 × 32 × 128 1xx32 xx1281 \times 32 \times 128 maxout p = 2 p = 2 p=2\mathrm{p}=2  103M 103M
fc2  1 × 32 × 128 1 × 32 × 128 1xx32 xx1281 \times 32 \times 128 1 × 32 × 128 1 × 32 × 128 1xx32 xx1281 \times 32 \times 128 maxout p = 2 p = 2 p=2\mathrm{p}=2  34 M  34 M 
fc7128  1 × 32 × 128 1 × 32 × 128 1xx32 xx1281 \times 32 \times 128 1 × 1 × 128 1 × 1 × 128 1xx1xx1281 \times 1 \times 128 524 K  0.5 M 
L2 1 × 1 × 128 1 × 1 × 128 1xx1xx1281 \times 1 \times 128 1 × 1 × 128 1 × 1 × 128 1xx1xx1281 \times 1 \times 128 0
total  140M 1.6B
layer size-in size-out kernel param FLPS conv1 220 xx220 xx3 110 xx110 xx64 7xx7xx3,2 9 K 115 M pool1 110 xx110 xx64 55 xx55 xx64 3xx3xx64,2 0 rnorm1 55 xx55 xx64 55 xx55 xx64 0 conv2a 55 xx55 xx64 55 xx55 xx64 1xx1xx64,1 4 K 13M conv2 55 xx55 xx64 55 xx55 xx192 3xx3xx64,1 111 K 335M rnorm2 55 xx55 xx192 55 xx55 xx192 0 pool2 55 xx55 xx192 28 xx28 xx192 3xx3xx192,2 0 conv3a 28 xx28 xx192 28 xx28 xx192 1xx1xx192,1 37 K 29M conv3 28 xx28 xx192 28 xx28 xx384 3xx3xx192,1 664 K 521 M pool3 28 xx28 xx384 14 xx14 xx384 3xx3xx384,2 0 conv4a 14 xx14 xx384 14 xx14 xx384 1xx1xx384,1 148 K 29M conv4 14 xx14 xx384 14 xx14 xx256 3xx3xx384,1 885 K 173M conv5a 14 xx14 xx256 14 xx14 xx256 1xx1xx256,1 66 K 13M conv5 14 xx14 xx256 14 xx14 xx256 3xx3xx256,1 590 K 116M conv6a 14 xx14 xx256 14 xx14 xx256 1xx1xx256,1 66 K 13M conv6 14 xx14 xx256 14 xx14 xx256 3xx3xx256,1 590 K 116M pool4 14 xx14 xx256 7xx7xx256 3xx3xx256,2 0 concat 7xx7xx256 7xx7xx256 0 fc1 7xx7xx256 1xx32 xx128 maxout p=2 103M 103M fc2 1xx32 xx128 1xx32 xx128 maxout p=2 34 M 34 M fc7128 1xx32 xx128 1xx1xx128 524 K 0.5 M L2 1xx1xx128 1xx1xx128 0 total 140M 1.6B| layer | size-in | size-out | kernel | param | FLPS | | :--- | :--- | :--- | :--- | :--- | :--- | | conv1 | $220 \times 220 \times 3$ | $110 \times 110 \times 64$ | $7 \times 7 \times 3,2$ | 9 K | 115 M | | pool1 | $110 \times 110 \times 64$ | $55 \times 55 \times 64$ | $3 \times 3 \times 64,2$ | 0 | | | rnorm1 | $55 \times 55 \times 64$ | $55 \times 55 \times 64$ | | 0 | | | conv2a | $55 \times 55 \times 64$ | $55 \times 55 \times 64$ | $1 \times 1 \times 64,1$ | 4 K | 13M | | conv2 | $55 \times 55 \times 64$ | $55 \times 55 \times 192$ | $3 \times 3 \times 64,1$ | 111 K | 335M | | rnorm2 | $55 \times 55 \times 192$ | $55 \times 55 \times 192$ | | 0 | | | pool2 | $55 \times 55 \times 192$ | $28 \times 28 \times 192$ | $3 \times 3 \times 192,2$ | 0 | | | conv3a | $28 \times 28 \times 192$ | $28 \times 28 \times 192$ | $1 \times 1 \times 192,1$ | 37 K | 29M | | conv3 | $28 \times 28 \times 192$ | $28 \times 28 \times 384$ | $3 \times 3 \times 192,1$ | 664 K | 521 M | | pool3 | $28 \times 28 \times 384$ | $14 \times 14 \times 384$ | $3 \times 3 \times 384,2$ | 0 | | | conv4a | $14 \times 14 \times 384$ | $14 \times 14 \times 384$ | $1 \times 1 \times 384,1$ | 148 K | 29M | | conv4 | $14 \times 14 \times 384$ | $14 \times 14 \times 256$ | $3 \times 3 \times 384,1$ | 885 K | 173M | | conv5a | $14 \times 14 \times 256$ | $14 \times 14 \times 256$ | $1 \times 1 \times 256,1$ | 66 K | 13M | | conv5 | $14 \times 14 \times 256$ | $14 \times 14 \times 256$ | $3 \times 3 \times 256,1$ | 590 K | 116M | | conv6a | $14 \times 14 \times 256$ | $14 \times 14 \times 256$ | $1 \times 1 \times 256,1$ | 66 K | 13M | | conv6 | $14 \times 14 \times 256$ | $14 \times 14 \times 256$ | $3 \times 3 \times 256,1$ | 590 K | 116M | | pool4 | $14 \times 14 \times 256$ | $7 \times 7 \times 256$ | $3 \times 3 \times 256,2$ | 0 | | | concat | $7 \times 7 \times 256$ | $7 \times 7 \times 256$ | | 0 | | | fc1 | $7 \times 7 \times 256$ | $1 \times 32 \times 128$ | maxout $\mathrm{p}=2$ | 103M | 103M | | fc2 | $1 \times 32 \times 128$ | $1 \times 32 \times 128$ | maxout $\mathrm{p}=2$ | 34 M | 34 M | | fc7128 | $1 \times 32 \times 128$ | $1 \times 1 \times 128$ | | 524 K | 0.5 M | | L2 | $1 \times 1 \times 128$ | $1 \times 1 \times 128$ | | 0 | | | total | | | | 140M | 1.6B |
Table 1. NN1. This table show the structure of our Zeiler&Fergus [22] based model with 1 × 1 1 × 1 1xx11 \times 1 convolutions inspired by [9]. The input and output sizes are described in rows × × xx\times cols × # × # xx#\times \# filters. The kernel is specified as rows × × xx\times cols, stride and the maxout [6] pooling size as p = 2 p = 2 p=2p=2. 
models use rectified linear units as the non-linear activation function. 
The first category, shown in Table 1 , adds 1 × 1 × d 1 × 1 × d 1xx1xx d1 \times 1 \times d convolutional layers, as suggested in [9], between the standard convolutional layers of the Zeiler&Fergus [22] architecture and results in a model 22 layers deep. It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image. 
The second category we use is based on GoogLeNet style Inception models [16]. These models have 20 × 20 × 20 xx20 \times fewer parameters (around 6.6 M 7.5 M 6.6 M 7.5 M 6.6M-7.5M6.6 \mathrm{M}-7.5 \mathrm{M} ) and up to 5 × 5 × 5xx5 \times fewer FLOPS (between 500 M 1.6 B 500 M 1.6 B 500M-1.6B500 \mathrm{M}-1.6 \mathrm{~B} ). Some of these models are dramatically reduced in size (both depth and number of filters), so that they can be run on a mobile phone. One, NNS1, has 26 M parameters and only requires 220 M FLOPS per image. The other, NNS2, has 4.3 M parameters and 20 M FLOPS. Table 2 describes NN2 our largest network in detail. NN3 is identical in architecture but has a reduced input size of 160 × 160 160 × 160 160 xx160160 \times 160. NN4 has an input size of only 96 × 96 96 × 96 96 xx9696 \times 96, thereby drastically reducing the CPU requirements (285M FLOPS vs 1.6 B for NN 2 ). In addition to the reduced input size it does not use 5 × 5 5 × 5 5xx55 \times 5 convolutions in the higher layers as the receptive field is already too small by then. Generally we found that the 5 × 5 5 × 5 5xx55 \times 5 convolutions can be removed throughout 
with only a minor drop in accuracy. Figure 4 compares all our models. 

4. Datasets and Evaluation 

We evaluate our method on four datasets and with the exception of Labelled Faces in the Wild and YouTube Faces we evaluate our method on the face verification task. I.e. given a pair of two face images a squared L 2 L 2 L_(2)L_{2} distance threshold D ( x i , x j ) D x i , x j D(x_(i),x_(j))D\left(x_{i}, x_{j}\right) is used to determine the classification of same and different. All faces pairs ( i , j ) ( i , j ) (i,j)(i, j) of the same identity are denoted with P same P same  P_("same ")\mathcal{P}_{\text {same }}, whereas all pairs of different identities are denoted with P diff P diff  P_("diff ")\mathcal{P}_{\text {diff }}. 
We define the set of all true accepts as 
TA ( d ) = { ( i , j ) P same , with D ( x i , x j ) d } TA ( d ) = ( i , j ) P same  ,  with  D x i , x j d TA(d)={(i,j)inP_("same ")," with "D(x_(i),x_(j)) <= d}\mathrm{TA}(d)=\left\{(i, j) \in \mathcal{P}_{\text {same }}, \text { with } D\left(x_{i}, x_{j}\right) \leq d\right\}
这些是人脸对 ( i , j ) ( i , j ) (i,j)(i, j) ,在阈值 d d dd 下被正确分类为相同。类似地
FA ( d ) = { ( i , j ) P diff , with D ( x i , x j ) d } FA ( d ) = ( i , j ) P diff  ,  with  D x i , x j d FA(d)={(i,j)inP_("diff ")," with "D(x_(i),x_(j)) <= d}\mathrm{FA}(d)=\left\{(i, j) \in \mathcal{P}_{\text {diff }}, \text { with } D\left(x_{i}, x_{j}\right) \leq d\right\}
是所有被错误分类为相同(错误接受)的配对集合。
给定人脸距离 d d dd ,验证率 VAL ( d ) VAL ( d ) VAL(d)\operatorname{VAL}(d) 和错误接受率 FAR ( d ) FAR ( d ) FAR(d)\operatorname{FAR}(d) 定义为
VAL ( d ) = | TA ( d ) | | P same | , FAR ( d ) = | FA ( d ) | | P diff | VAL ( d ) = | TA ( d ) | P same  , FAR ( d ) = | FA ( d ) | P diff  VAL(d)=(|TA(d)|)/(|P_("same ")|),quad FAR(d)=(|FA(d)|)/(|P_("diff ")|)\operatorname{VAL}(d)=\frac{|\operatorname{TA}(d)|}{\left|\mathcal{P}_{\text {same }}\right|}, \quad \operatorname{FAR}(d)=\frac{|\operatorname{FA}(d)|}{\left|\mathcal{P}_{\text {diff }}\right|}

4.1. 留出测试集

我们保留了一个大约一百万张图片的留出集,它与我们的训练集具有相同的分布,但身份不重叠。为了进行评估,我们将其分为五个不相交的集合,每个集合包含 20 万张图片。然后在 100 k × 100 k 100 k × 100 k 100kxx100k100 \mathrm{k} \times 100 \mathrm{k} 张图像对上计算 FAR 和 VAL 率。标准误差是在五个拆分中报告的。

  4.2. 个人照片

这是一个与我们的训练集具有相似分布的测试集,但已经过手动验证,具有非常干净的标签。它由三个个人照片集组成,总共约 1.2 万张图片。我们计算所有 1.2 万张图像的平方对的 FAR 和 VAL 率。

  4.3. 学术数据集

在野外标记人脸 (LFW) 是人脸验证领域事实上的学术测试集 [7]。我们遵循非限制性、标记外部数据的标准协议,并报告平均分类准确率以及平均值的标准误差。
Youtube Faces DB [21] 是一个新的数据集,在人脸识别领域越来越受欢迎 [17,15]。该设置与 LFW 类似,但使用的不是图像对,而是视频对。
图 4. FLOPS 与准确率的权衡。图中展示了各种不同模型尺寸和架构的 FLOPS 与准确率之间的权衡。突出显示的是我们在实验中重点关注的四个模型。

  5. 实验

如无特别说明,我们使用包含约 800 万个不同身份的 100 M 200 M 100 M 200 M 100M-200M100 \mathrm{M}-200 \mathrm{M} 个训练人脸缩略图。在每张图像上运行人脸检测器,并生成每个人脸周围的紧密边界框。这些面部缩略图被调整为相应网络的输入大小。在我们的实验中,输入大小从 96 × 96 96 × 96 96 xx9696 \times 96 像素到 224 × 224 224 × 224 224 xx224224 \times 224 像素不等。

5.1. 计算精度权衡

在深入研究更具体的实验细节之前,我们将讨论特定模型所需的精度与 FLOPS 数量之间的权衡。图 4 显示了 x 轴上的 FLOPS 以及我们在 4.2 节中用户标记的测试数据集上 0.001 误接受率 (FAR) 下的精度。有趣的是,模型所需的计算量与其达到的精度之间存在很强的相关性。该图突出显示了我们将在实验中更详细讨论的五个模型(NN1、NN2、NN3、NNS1、NNS2)。
我们还研究了关于模型参数数量的精度权衡。然而,在这种情况下,情况并不那么清楚。例如,基于 Inception 的模型 NN2 实现了与 NN1 相当的性能,但参数只有 NN1 的 1/20。不过,FLOPS 的数量是相当的。显然,如果参数数量进一步减少,预计性能会在某个时候下降。其他模型架构可能会允许进一步减少参数数量,而不会降低精度,就像 Inception [16] 在这种情况下所做的那样。
  类型   输出大小   深度 # 1 × 1 # 1 × 1 #1xx1\# 1 \times 1    # 3 × 3 # 3 × 3 #3xx3\# 3 \times 3 缩减 # 3 × 3 # 3 × 3 #3xx3\# 3 \times 3    # 5 × 5 # 5 × 5 #5xx5\# 5 \times 5 减少 # 5 × 5 # 5 × 5 #5xx5\# 5 \times 5   池化投影(p)   参数 FLOPS
conv1( 7 × 7 × 3 , 2 7 × 7 × 3 , 2 7xx7xx3,27 \times 7 \times 3,2 112 × 112 × 64 112 × 112 × 64 112 xx112 xx64112 \times 112 \times 64 1 9K 119M
最大池化 + 归一化 56 × 56 × 64 56 × 56 × 64 56 xx56 xx6456 \times 56 \times 64 0 m 3 × 3 , 2 3 × 3 , 2 3xx3,23 \times 3,2
inception (2) 56 × 56 × 192 56 × 56 × 192 56 xx56 xx19256 \times 56 \times 192 2 64 192   11.5万 360M
norm + 最大池化 28 × 28 × 192 28 × 28 × 192 28 xx28 xx19228 \times 28 \times 192 0    3 × 3 , 2 3 × 3 , 2 3xx3,23 \times 3,2
inception (3a) 28 × 28 × 256 28 × 28 × 256 28 xx28 xx25628 \times 28 \times 256 2 64 96 128 16 32   米,32p   164 千   128 百万
  初始模块(3b) 28 × 28 × 320 28 × 28 × 320 28 xx28 xx32028 \times 28 \times 320 2 64 96 128 32 64 L 2 , 64 p L 2 , 64 p L_(2),64pL_{2}, 64 \mathrm{p}   228 千   1.79 亿
  初始模块 (3c) 14 × 14 × 640 14 × 14 × 640 14 xx14 xx64014 \times 14 \times 640 2 0 128 256,2 32 64,2   米 3 × 3 × 3 xx3\times 3 ,2   39.8 万 108M
  初始模块(4a) 14 × 14 × 640 14 × 14 × 640 14 xx14 xx64014 \times 14 \times 640 2 256 96 192 32 64 L 2 , 128 p L 2 , 128 p L_(2),128pL_{2}, 128 \mathrm{p}   54.5万 107M
  初始模块(4b) 14 × 14 × 640 14 × 14 × 640 14 xx14 xx64014 \times 14 \times 640 2 224 112 224 32 64 L 2 , 128 p L 2 , 128 p L_(2),128pL_{2}, 128 \mathrm{p}   59.5万 117M
  初始模块(4c) 14 × 14 × 640 14 × 14 × 640 14 xx14 xx64014 \times 14 \times 640 2 192 128 256 32 64 L 2 , 128 p L 2 , 128 p L_(2),128pL_{2}, 128 \mathrm{p}   65.4万 128M
  初始模块(4d) 14 × 14 × 640 14 × 14 × 640 14 xx14 xx64014 \times 14 \times 640 2 160 144 288 32 64 L 2 , 128 p L 2 , 128 p L_(2),128pL_{2}, 128 \mathrm{p}   72.2万   1. 42 亿
inception (4e) 7 × 7 × 1024 7 × 7 × 1024 7xx7xx10247 \times 7 \times 1024 2 0 160 256,2 64 128,2 m 3 × 3 , 2 3 × 3 , 2 3xx3,23 \times 3,2   71.7 万 56M
  初始模块(5a) 7 × 7 × 1024 7 × 7 × 1024 7xx7xx10247 \times 7 \times 1024 2 384 192 384 48 128 L 2 , 128 p L 2 , 128 p L_(2),128pL_{2}, 128 \mathrm{p}   160万 78M
  初始模块(5b) 7 × 7 × 1024 7 × 7 × 1024 7xx7xx10247 \times 7 \times 1024 2 384 192 384 48 128   米,128p 1.6 M  78M
avg pool  1 × 1 × 1024 1 × 1 × 1024 1xx1xx10241 \times 1 \times 1024 0
fully conn  1 × 1 × 128 1 × 1 × 128 1xx1xx1281 \times 1 \times 128 1 131 K  0.1 M 
L2 normalization  1 × 1 × 128 1 × 1 × 128 1xx1xx1281 \times 1 \times 128 0
total  7.5 M  1.6 B 
type output size depth #1xx1 #3xx3 reduce #3xx3 #5xx5 reduce #5xx5 pool proj (p) params FLOPS conv1 ( 7xx7xx3,2 ) 112 xx112 xx64 1 9K 119M max pool + norm 56 xx56 xx64 0 m 3xx3,2 inception (2) 56 xx56 xx192 2 64 192 115 K 360M norm + max pool 28 xx28 xx192 0 m 3xx3,2 inception (3a) 28 xx28 xx256 2 64 96 128 16 32 m, 32p 164 K 128 M inception (3b) 28 xx28 xx320 2 64 96 128 32 64 L_(2),64p 228 K 179 M inception (3c) 14 xx14 xx640 2 0 128 256,2 32 64,2 m 3 xx3,2 398 K 108M inception (4a) 14 xx14 xx640 2 256 96 192 32 64 L_(2),128p 545 K 107M inception (4b) 14 xx14 xx640 2 224 112 224 32 64 L_(2),128p 595 K 117M inception (4c) 14 xx14 xx640 2 192 128 256 32 64 L_(2),128p 654 K 128M inception (4d) 14 xx14 xx640 2 160 144 288 32 64 L_(2),128p 722 K 142 M inception (4e) 7xx7xx1024 2 0 160 256,2 64 128,2 m 3xx3,2 717 K 56M inception (5a) 7xx7xx1024 2 384 192 384 48 128 L_(2),128p 1.6 M 78M inception (5b) 7xx7xx1024 2 384 192 384 48 128 m, 128p 1.6 M 78M avg pool 1xx1xx1024 0 fully conn 1xx1xx128 1 131 K 0.1 M L2 normalization 1xx1xx128 0 total 7.5 M 1.6 B| type | output size | depth | $\# 1 \times 1$ | $\# 3 \times 3$ reduce | $\# 3 \times 3$ | $\# 5 \times 5$ reduce | $\# 5 \times 5$ | pool proj (p) | params | FLOPS | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | conv1 ( $7 \times 7 \times 3,2$ ) | $112 \times 112 \times 64$ | 1 | | | | | | | 9K | 119M | | max pool + norm | $56 \times 56 \times 64$ | 0 | | | | | | m $3 \times 3,2$ | | | | inception (2) | $56 \times 56 \times 192$ | 2 | | 64 | 192 | | | | 115 K | 360M | | norm + max pool | $28 \times 28 \times 192$ | 0 | | | | | | m $3 \times 3,2$ | | | | inception (3a) | $28 \times 28 \times 256$ | 2 | 64 | 96 | 128 | 16 | 32 | m, 32p | 164 K | 128 M | | inception (3b) | $28 \times 28 \times 320$ | 2 | 64 | 96 | 128 | 32 | 64 | $L_{2}, 64 \mathrm{p}$ | 228 K | 179 M | | inception (3c) | $14 \times 14 \times 640$ | 2 | 0 | 128 | 256,2 | 32 | 64,2 | m 3 $\times 3$,2 | 398 K | 108M | | inception (4a) | $14 \times 14 \times 640$ | 2 | 256 | 96 | 192 | 32 | 64 | $L_{2}, 128 \mathrm{p}$ | 545 K | 107M | | inception (4b) | $14 \times 14 \times 640$ | 2 | 224 | 112 | 224 | 32 | 64 | $L_{2}, 128 \mathrm{p}$ | 595 K | 117M | | inception (4c) | $14 \times 14 \times 640$ | 2 | 192 | 128 | 256 | 32 | 64 | $L_{2}, 128 \mathrm{p}$ | 654 K | 128M | | inception (4d) | $14 \times 14 \times 640$ | 2 | 160 | 144 | 288 | 32 | 64 | $L_{2}, 128 \mathrm{p}$ | 722 K | 142 M | | inception (4e) | $7 \times 7 \times 1024$ | 2 | 0 | 160 | 256,2 | 64 | 128,2 | m $3 \times 3,2$ | 717 K | 56M | | inception (5a) | $7 \times 7 \times 1024$ | 2 | 384 | 192 | 384 | 48 | 128 | $L_{2}, 128 \mathrm{p}$ | 1.6 M | 78M | | inception (5b) | $7 \times 7 \times 1024$ | 2 | 384 | 192 | 384 | 48 | 128 | m, 128p | 1.6 M | 78M | | avg pool | $1 \times 1 \times 1024$ | 0 | | | | | | | | | | fully conn | $1 \times 1 \times 128$ | 1 | | | | | | | 131 K | 0.1 M | | L2 normalization | $1 \times 1 \times 128$ | 0 | | | | | | | | | | total | | | | | | | | | 7.5 M | 1.6 B |
Table 2. NN2. Details of the NN2 Inception incarnation. This model is almost identical to the one described in [16]. The two major differences are the use of L 2 L 2 L_(2)L_{2} pooling instead of max pooling (m), where specified. I.e. instead of taking the spatial max the L 2 L 2 L_(2)L_{2} norm is computed. The pooling is always 3 × 3 3 × 3 3xx33 \times 3 (aside from the final average pooling) and in parallel to the convolutional modules inside each Inception module. If there is a dimensionality reduction after the pooling it is denoted with p. 1 × 1 , 3 × 3 1 × 1 , 3 × 3 1xx1,3xx31 \times 1,3 \times 3, and 5 × 5 5 × 5 5xx55 \times 5 pooling are then concatenated to get the final output. 
图 5. 网络架构。该图显示了第 4.2 节中四个不同模型在我们个人照片测试集上的完整 ROC 曲线。在 10E-4 FAR 处的急剧下降可以用 groundtruth 标签中的噪声来解释。模型性能排序如下:NN2: 224 × 224 224 × 224 224 xx224224 \times 224 输入的基于 Inception 的模型;NN1:基于 Zeiler&Fergus 的网络,带有 1 × 1 1 × 1 1xx11 \times 1 个卷积;NNS1:仅有 2.2 亿次 FLOPS 的小型 Inception 风格模型;NNS2:仅有 2000 万次 FLOPS 的微型 Inception 模型。
  架构 VAL
NN1 (Zeiler&Fergus 220 × 220 220 × 220 220 xx220220 \times 220 ) 87.9 % ± 1.9 87.9 % ± 1.9 87.9%+-1.987.9 \% \pm 1.9
NN2 (Inception 224 × 224 224 × 224 224 xx224224 \times 224 ) 89.4 % ± 1.6 89.4 % ± 1.6 89.4%+-1.689.4 \% \pm 1.6
NN3(Inception 160 × 160 160 × 160 160 xx160160 \times 160 88.3 % ± 1.7 88.3 % ± 1.7 88.3%+-1.788.3 \% \pm 1.7
NN4(Inception 96 × 96 96 × 96 96 xx9696 \times 96 82.0 % ± 2.3 82.0 % ± 2.3 82.0%+-2.382.0 \% \pm 2.3
NNS1(迷你 Inception 165 × 165 165 × 165 165 xx165165 \times 165 82.4 % ± 2.4 82.4 % ± 2.4 82.4%+-2.482.4 \% \pm 2.4
NNS2(微型 Inception 140 × 116 140 × 116 140 xx116140 \times 116 51.9 % ± 2.9 51.9 % ± 2.9 51.9%+-2.951.9 \% \pm 2.9
architecture VAL NN1 (Zeiler&Fergus 220 xx220 ) 87.9%+-1.9 NN2 (Inception 224 xx224 ) 89.4%+-1.6 NN3 (Inception 160 xx160 ) 88.3%+-1.7 NN4 (Inception 96 xx96 ) 82.0%+-2.3 NNS1 (mini Inception 165 xx165 ) 82.4%+-2.4 NNS2 (tiny Inception 140 xx116 ) 51.9%+-2.9| architecture | VAL | | :--- | :---: | | NN1 (Zeiler&Fergus $220 \times 220$ ) | $87.9 \% \pm 1.9$ | | NN2 (Inception $224 \times 224$ ) | $89.4 \% \pm 1.6$ | | NN3 (Inception $160 \times 160$ ) | $88.3 \% \pm 1.7$ | | NN4 (Inception $96 \times 96$ ) | $82.0 \% \pm 2.3$ | | NNS1 (mini Inception $165 \times 165$ ) | $82.4 \% \pm 2.4$ | | NNS2 (tiny Inception $140 \times 116$ ) | $51.9 \% \pm 2.9$ |
表 3. 网络架构。本表比较了我们的模型架构在预留测试集上的性能(见 4.1 节)。报告的是在 10E-3 误接受率下的平均验证率 VAL。同时显示了五个测试分割上的平均值的标准误差。

5. 2 CNN 模型的影响

现在我们将更详细地讨论我们选择的四个模型的性能。一方面,我们有传统的基于 Zeiler&Fergus 的架构,具有 1 × 1 1 × 1 1xx11 \times 1 卷积 [22, 9](见表 1)。另一方面,我们有基于 Inception [16] 的模型,可以显著减小模型大小。总的来说,在最终性能方面,两种架构的顶级模型表现相当。然而,我们的一些基于 Inception 的模型,如 NN3,在显著减少 FLOPS 和模型大小的同时,仍然实现了良好的性能。
我们个人照片测试集的详细评估是
jpeg q   验证率
10 67.3 % 67.3 % 67.3%67.3 \%
20 81.4 % 81.4 % 81.4%81.4 \%
30 83.9 % 83.9 % 83.9%83.9 \%
50 85.5 % 85.5 % 85.5%85.5 \%
70 86.1 % 86.1 % 86.1%86.1 \%
90 86.5 % 86.5 % 86.5%86.5 \%
jpeg q val-rate 10 67.3% 20 81.4% 30 83.9% 50 85.5% 70 86.1% 90 86.5%| jpeg q | val-rate | | :---: | :---: | | 10 | $67.3 \%$ | | 20 | $81.4 \%$ | | 30 | $83.9 \%$ | | 50 | $85.5 \%$ | | 70 | $86.1 \%$ | | 90 | $86.5 \%$ |
#pixels   验证率
1,600 37.8 % 37.8 % 37.8%37.8 \%
6,400 79.5 % 79.5 % 79.5%79.5 \%
14,400 84.5 % 84.5 % 84.5%84.5 \%
25,600 85.7 % 85.7 % 85.7%85.7 \%
65,536 86.4 % 86.4 % 86.4%86.4 \%
#pixels val-rate 1,600 37.8% 6,400 79.5% 14,400 84.5% 25,600 85.7% 65,536 86.4%| #pixels | val-rate | | ---: | :---: | | 1,600 | $37.8 \%$ | | 6,400 | $79.5 \%$ | | 14,400 | $84.5 \%$ | | 25,600 | $85.7 \%$ | | 65,536 | $86.4 \%$ |
表 4. 图像质量。左侧表格显示了在 10 E 3 10 E 3 10E-310 \mathrm{E}-3 精确度下,不同 JPEG 质量对验证率的影响。右侧表格显示了像素图像大小如何影响 10E-3 精确度下的验证率。此实验是在我们的测试留出数据集的第一个分割上使用 NN1 完成的。
#dims VAL
64 86.8 % ± 1.7 86.8 % ± 1.7 86.8%+-1.786.8 \% \pm 1.7
128 87.9 % ± 1.9 87.9 % ± 1.9 87.9%+-1.987.9 \% \pm 1.9
256 87.7 % ± 1.9 87.7 % ± 1.9 87.7%+-1.987.7 \% \pm 1.9
512 85.6 % ± 2.0 85.6 % ± 2.0 85.6%+-2.085.6 \% \pm 2.0
#dims VAL 64 86.8%+-1.7 128 87.9%+-1.9 256 87.7%+-1.9 512 85.6%+-2.0| #dims | VAL | | ---: | :---: | | 64 | $86.8 \% \pm 1.7$ | | 128 | $87.9 \% \pm 1.9$ | | 256 | $87.7 \% \pm 1.9$ | | 512 | $85.6 \% \pm 2.0$ |
表 5. 嵌入维度。本表比较了我们的模型 NN1 的嵌入维度对第 4.1 节中的留出集的影响。除了 10E-3 时的 VAL 外,我们还显示了跨五个分割计算的平均值的标准误差。
如图 5 所示。虽然最大的模型与微小的 NNS2 相比,在精度上有了显著的提高,但后者可以在移动电话上以 30 毫秒/图像的速度运行,并且仍然足够精确,可以用于人脸聚类。对于 FAR < 10 4 < 10 4 < 10^(-4)<10^{-4} ,ROC 的急剧下降表明测试数据 groundtruth 中存在噪声标签。在极低的误接受率下,单个错误标记的图像会对曲线产生重大影响。

5.3. 对图像质量的敏感性

表 4 显示了我们的模型在各种图像大小上的鲁棒性。该网络在 JPEG 压缩方面表现出惊人的鲁棒性,并且在 JPEG 质量降至 20 时表现非常好。对于小至 120 × 120 120 × 120 120 xx120120 \times 120 像素的人脸缩略图,性能下降非常小,即使在 80 × 80 80 × 80 80 xx8080 \times 80 像素时,也表现出可接受的性能。这是值得注意的,因为该网络是在 220 × 220 220 × 220 220 xx220220 \times 220 输入图像上训练的。使用较低分辨率的人脸进行训练可能会进一步改善这一范围。

5.4. 嵌入维度

我们探索了各种嵌入维度,并在除表 5 中报告的比较之外的所有实验中选择了 128。人们可能期望更大的嵌入表现至少与更小的嵌入一样好,但是,它们可能需要更多的训练才能达到相同的准确性。也就是说,性能上的差异
  #训练图像 VAL
2 , 600 , 000 2 , 600 , 000 2,600,0002,600,000 76.3 % 76.3 % 76.3%76.3 \%
26 , 000 , 000 26 , 000 , 000 26,000,00026,000,000 85.1 % 85.1 % 85.1%85.1 \%
52 , 000 , 000 52 , 000 , 000 52,000,00052,000,000 85.1 % 85.1 % 85.1%85.1 \%
260 , 000 , 000 260 , 000 , 000 260,000,000260,000,000 86.2 % 86.2 % 86.2%86.2 \%
#training images VAL 2,600,000 76.3% 26,000,000 85.1% 52,000,000 85.1% 260,000,000 86.2%| #training images | VAL | | :---: | :---: | | $2,600,000$ | $76.3 \%$ | | $26,000,000$ | $85.1 \%$ | | $52,000,000$ | $85.1 \%$ | | $260,000,000$ | $86.2 \%$ |
表 6. 训练数据大小。此表比较了在较小模型经过 700 小时训练后的性能,该模型具有 96 × 96 96 × 96 96 xx9696 \times 96 像素输入。该模型架构与 NN2 相似,但 Inception 模块中没有 5 × 5 5 × 5 5xx55 \times 5 卷积。
ported in Table 5 are statistically insignificant. 
It should be noted, that during training a 128 dimensional float vector is used, but it can be quantized to 128-bytes without loss of accuracy. Thus each face is compactly represented by a 128 dimensional byte vector, which is ideal for large scale clustering and recognition. Smaller embeddings are possible at a minor loss of accuracy and could be employed on mobile devices. 

5.5. Amount of Training Data 

Table 6 shows the impact of large amounts of training data. Due to time constraints this evaluation was run on a smaller model; the effect may be even larger on larger models. It is clear that using tens of millions of exemplars results in a clear boost of accuracy on our personal photo test set from section 4.2. Compared to only millions of images the relative reduction in error is 60 % 60 % 60%60 \%. Using another order of magnitude more images (hundreds of millions) still gives a small boost, but the improvement tapers off. 

5.6. LFW 上的性能

我们使用标准的无限制标记外部数据协议在 LFW 上评估我们的模型。使用九个训练集来选择 L 2 L 2 L_(2)L_{2} -距离阈值。然后在第十个测试集上进行分类(相同或不同)。对于除第八个分割(1.256)之外的所有测试分割,选定的最佳阈值为 1.242。
我们的模型在两种模式下进行评估:
  1. LFW 提供的缩略图的固定中心裁剪。
  2. A proprietary face detector (similar to Picasa [3]) is run on the provided LFW thumbnails. If it fails to align the face (this happens for two images), the LFW alignment is used. 
Figure 6 gives an overview of all failure cases. It shows false accepts on the top as well as false rejects at the bottom. We achieve a classification accuracy of 9 8 . 8 7 % ± 0.15 9 8 . 8 7 % ± 0.15 98.87%+-0.15\mathbf{9 8 . 8 7 \%} \pm 0.15 when using the fixed center crop described in (1) and the record breaking 9 9 . 6 3 % ± 0.09 9 9 . 6 3 % ± 0.09 99.63%+-0.09\mathbf{9 9 . 6 3 \%} \pm 0.09 standard error of the mean when using the extra face alignment (2). This reduces the error reported for DeepFace in [17] by more than a factor 
Figure 6. LFW errors. This shows all pairs of images that were incorrectly classified on LFW. Only eight of the 13 false rejects shown here are actual errors the other five are mislabeled in LFW. 
of 7 and the previous state-of-the-art reported for DeepId2+ in [15] by 30 % 30 % 30%30 \%. This is the performance of model NN1, but even the much smaller NN3 achieves performance that is not statistically significantly different. 

5.7. Performance on Youtube Faces DB 

We use the average similarity of all pairs of the first one hundred frames that our face detector detects in each video. This gives us a classification accuracy of 9 5 . 1 2 % ± 0.39 9 5 . 1 2 % ± 0.39 95.12%+-0.39\mathbf{9 5 . 1 2 \%} \pm 0.39. Using the first one thousand frames results in 95.18 % 95.18 % 95.18%95.18 \%. Compared to [17] 91.4 % 91.4 % 91.4%91.4 \% who also evaluate one hundred frames per video we reduce the error rate by almost half. DeepId2+ [15] achieved 93.2 % 93.2 % 93.2%93.2 \% and our method reduces this error by 30 % 30 % 30%30 \%, comparable to our improvement on LFW. 

5.8. Face Clustering 

Our compact embedding lends itself to be used in order to cluster a users personal photos into groups of people with the same identity. The constraints in assignment imposed by clustering faces, compared to the pure verification task, 
Figure 7. Face Clustering. Shown is an exemplar cluster for one user. All these images in the users personal photo collection were clustered together. 
lead to truly amazing results. Figure 7 shows one cluster in a users personal photo collection, generated using agglomerative clustering. It is a clear showcase of the incredible invariance to occlusion, lighting, pose and even age. 

6. Summary 

We provide a method to directly learn an embedding into an Euclidean space for face verification. This sets it apart from other methods [15, 17] who use the CNN bottleneck layer, or require additional post-processing such as concate- 
Figure 8. Harmonic Embedding Compatibility. These ROCs show the compatibility of the harmonic embeddings of NN 2 to the embeddings of NN1. NN2 is an improved model that performs much better than NN1. When comparing embeddings generated by NN1 to the harmonic ones generated by NN2 we can see the compatibility between the two. In fact, the mixed mode performance is still better than NN1 by itself. 
nation of multiple models and PCA, as well as SVM classification. Our end-to-end training both simplifies the setup and shows that directly optimizing a loss relevant to the task at hand improves performance. 
Another strength of our model is that it only requires minimal alignment (tight crop around the face area). [17], for example, performs a complex 3D alignment. We also experimented with a similarity transform alignment and notice that this can actually improve performance slightly. It is not clear if it is worth the extra complexity. 
Future work will focus on better understanding of the error cases, further improving the model, and also reducing model size and reducing CPU requirements. We will also look into ways of improving the currently extremely long training times, e.g. variations of our curriculum learning with smaller batch sizes and offline as well as online positive and negative mining. 

7. Appendix: Harmonic Embedding 

In this section we introduce the concept of harmonic embeddings. By this we denote a set of embeddings that are generated by different models v1 and v2 but are compatible in the sense that they can be compared to each other. 
This compatibility greatly simplifies upgrade paths. E.g. in an scenario where embedding v1 was computed across a large set of images and a new embedding model v 2 is being rolled out, this compatibility ensures a smooth transition without the need to worry about version incompatibilities. Figure 8 shows results on our 3G dataset. It can be seen that the improved model NN2 significantly outper- 
Figure 9. Learning the Harmonic Embedding. In order to learn a harmonic embedding, we generate triplets that mix the v1 embeddings with the v 2 embeddings that are being trained. The semihard negatives are selected from the whole set of both v1 and v2 embeddings. 
forms NN1, while the comparison of NN2 embeddings to NN1 embeddings performs at an intermediate level. 

7.1. Harmonic Triplet Loss 

In order to learn the harmonic embedding we mix embeddings of v 1 together with the embeddings v 2 , that are being learned. This is done inside the triplet loss and results in additionally generated triplets that encourage the compatibility between the different embedding versions. Figure 9 visualizes the different combinations of triplets that contribute to the triplet loss. 
We initialized the v2 embedding from an independently trained NN2 and retrained the last layer (embedding layer) from random initialization with the compatibility encouraging triplet loss. First only the last layer is retrained, then we continue training the whole v2 network with the harmonic loss. 
Figure 10 shows a possible interpretation of how this compatibility may work in practice. The vast majority of v2 embeddings may be embedded near the corresponding v1 embedding, however, incorrectly placed v1 embeddings can be perturbed slightly such that their new location in embedding space improves verification accuracy. 

7.2. Summary 

These are very interesting findings and it is somewhat surprising that it works so well. Future work can explore how far this idea can be extended. Presumably there is a limit as to how much the v2 embedding can improve over v1, while still being compatible. Additionally it would be interesting to train small networks that can run on a mobile phone and are compatible to a larger server side model. 
Figure 10. Harmonic Embedding Space. This visualisation sketches a possible interpretation of how harmonic embeddings are able to improve verification accuracy while maintaining compatibility to less accurate embeddings. In this scenario there is one misclassified face, whose embedding is perturbed to the “correct” location in v 2 . 

Acknowledgments 

We would like to thank Johannes Steffens for his discussions and great insights on face recognition and Christian Szegedy for providing new network architectures like [16] and discussing network design choices. Also we are indebted to the DistBelief [4] team for their support especially to Rajat Monga for help in setting up efficient training schemes. 
Also our work would not have been possible without the support of Chuck Rosenberg, Hartwig Adam, and Simon Han. 

References 

[1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proc. of ICML, New York, NY, USA, 2009. 2 
[2] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation. In Proc. ECCV, 2012. 2 
[3] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In Proc. ECCV, 2014. 7 
[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 1232-1240. 2012. 10 
[5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J J JJ. Mach. Learn. Res., 12:2121-2159, July 2011. 4 
[6] I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In In ICML, 2013. 4 
[7] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007. 5 
[8] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation 
applied to handwritten zip code recognition. Neural Computation, 1(4):541-551, Dec. 1989. 2, 4 
[9] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. 2, 4, 6 
[10] C. Lu and X. Tang. Surpassing human-level face verification performance on LFW with gaussianface. CoRR, abs/1404.3840, 2014. 1 
[11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 1986. 2, 4 
[12] M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In S. Thrun, L. Saul, and B. Schölkopf, editors, NIPS, pages 41-48. MIT Press, 2004. 2 
[13] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression (PIE) database. In In Proc. FG, 2002. 2 
[14] Y. Sun, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. CoRR, abs/1406.4773, 2014. 1, 2, 3 
[15] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. CoRR, abs/1412.1265, 2014. 1, 2, 5, 8 
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 2, 3, 4, 5, 6, 10 
[17] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In IEEE Conf. on CVPR, 2014. 1, 2, 5, 7, 8, 9 
[18] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. CoRR, abs/1404.4661, 2014. 2 
[19] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS. MIT Press, 2006. 2, 3 
[20] D. R. Wilson and T. R. Martinez. The general inefficiency of batch training for gradient descent learning. Neural Networks, 16(10):1429-1451, 2003. 4 
[21] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In IEEE Conf. on CVPR, 2011. 5 
[22] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. C o R R C o R R CoRRC o R R, abs / 1311.2901 , 2013.2 , 3 / 1311.2901 , 2013.2 , 3 //1311.2901,2013.2,3/ 1311.2901,2013.2,3, 4, 6 
[23] Z. Zhu, P. Luo, X. Wang, and X. Tang. Recover canonicalview faces in the wild with deep neural networks. CoRR, abs/1404.3543, 2014. 2 
原文
请对此翻译评分
您的反馈将用于改进谷歌翻译