离线生成三元组,每 n 步使用最新的网络检查点,并在数据的子集上计算 argmin 和 argmax。
在线生成三元组。这可以通过从一个 mini-batch 中选择困难正/负样本来实现。
Here, we focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the argmin and argmax within a mini-batch.
To have a meaningful representation of the anchorpositive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each
mini-batch. In our experiments we sample the training data such that around 40 faces are selected per identity per minibatch. Additionally, randomly sampled negative faces are added to each mini-batch.
Instead of picking the hardest positive, we use all anchorpositive pairs in a mini-batch while still selecting the hard negatives. We don’t have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the all anchorpositive method was more stable and converged slightly faster at the beginning of training.
We also explored the offline generation of triplets in conjunction with the online generation and it may allow the use of smaller batch sizes, but the experiments were inconclusive.
Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. f(x)=0f(x)=0 ). In order to mitigate this, it helps to select x_(i)^(n)x_{i}^{n} such that
We call these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchorpositive distance. Those negatives lie inside the margin alpha\alpha.
As mentioned before, correct triplet selection is crucial for fast convergence. On the one hand we would like to use small mini-batches as these tend to improve convergence during Stochastic Gradient Descent (SGD) [20]. On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient. The main constraint with regards to the batch size, however, is the way we select hard relevant triplets from within the mini-batches. In most experiments we use a batch size of around 1,800 exemplars.
3.3. Deep Convolutional Networks
In all our experiments we train the CNN using Stochastic Gradient Descent (SGD) with standard backprop [8,11] and AdaGrad [5]. In most experiments we start with a learning rate of 0.05 which we lower to finalize the model. The models are initialized from random, similar to [16], and trained on a CPU cluster for 1,000 to 2,000 hours. The decrease in the loss (and increase in accuracy) slows down drastically after 500 h of training, but additional training can still significantly improve performance. The margin alpha\alpha is set to 0.2 .
We used two types of architectures and explore their trade-offs in more detail in the experimental section. Their practical differences lie in the difference of parameters and FLOPS. The best model may be different depending on the application. E.g. a model running in a datacenter can have many parameters and require a large number of FLOPS, whereas a model running on a mobile phone needs to have few parameters, so that it can fit into memory. All our
Table 1. NN1. This table show the structure of our Zeiler&Fergus [22] based model with 1xx11 \times 1 convolutions inspired by [9]. The input and output sizes are described in rows xx\times cols xx#\times \# filters. The kernel is specified as rows xx\times cols, stride and the maxout [6] pooling size as p=2p=2.
models use rectified linear units as the non-linear activation function.
The first category, shown in Table 1 , adds 1xx1xx d1 \times 1 \times d convolutional layers, as suggested in [9], between the standard convolutional layers of the Zeiler&Fergus [22] architecture and results in a model 22 layers deep. It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image.
The second category we use is based on GoogLeNet style Inception models [16]. These models have 20 xx20 \times fewer parameters (around 6.6M-7.5M6.6 \mathrm{M}-7.5 \mathrm{M} ) and up to 5xx5 \times fewer FLOPS (between 500M-1.6B500 \mathrm{M}-1.6 \mathrm{~B} ). Some of these models are dramatically reduced in size (both depth and number of filters), so that they can be run on a mobile phone. One, NNS1, has 26 M parameters and only requires 220 M FLOPS per image. The other, NNS2, has 4.3 M parameters and 20 M FLOPS. Table 2 describes NN2 our largest network in detail. NN3 is identical in architecture but has a reduced input size of 160 xx160160 \times 160. NN4 has an input size of only 96 xx9696 \times 96, thereby drastically reducing the CPU requirements (285M FLOPS vs 1.6 B for NN 2 ). In addition to the reduced input size it does not use 5xx55 \times 5 convolutions in the higher layers as the receptive field is already too small by then. Generally we found that the 5xx55 \times 5 convolutions can be removed throughout
with only a minor drop in accuracy. Figure 4 compares all our models.
4. Datasets and Evaluation
We evaluate our method on four datasets and with the exception of Labelled Faces in the Wild and YouTube Faces we evaluate our method on the face verification task. I.e. given a pair of two face images a squared L_(2)L_{2} distance threshold D(x_(i),x_(j))D\left(x_{i}, x_{j}\right) is used to determine the classification of same and different. All faces pairs (i,j)(i, j) of the same identity are denoted with P_("same ")\mathcal{P}_{\text {same }}, whereas all pairs of different identities are denoted with P_("diff ")\mathcal{P}_{\text {diff }}.
We define the set of all true accepts as
TA(d)={(i,j)inP_("same ")," with "D(x_(i),x_(j)) <= d}\mathrm{TA}(d)=\left\{(i, j) \in \mathcal{P}_{\text {same }}, \text { with } D\left(x_{i}, x_{j}\right) \leq d\right\}
这些是人脸对 (i,j)(i, j) ,在阈值 dd 下被正确分类为相同。类似地
FA(d)={(i,j)inP_("diff ")," with "D(x_(i),x_(j)) <= d}\mathrm{FA}(d)=\left\{(i, j) \in \mathcal{P}_{\text {diff }}, \text { with } D\left(x_{i}, x_{j}\right) \leq d\right\}
Table 2. NN2. Details of the NN2 Inception incarnation. This model is almost identical to the one described in [16]. The two major differences are the use of L_(2)L_{2} pooling instead of max pooling (m), where specified. I.e. instead of taking the spatial max the L_(2)L_{2} norm is computed. The pooling is always 3xx33 \times 3 (aside from the final average pooling) and in parallel to the convolutional modules inside each Inception module. If there is a dimensionality reduction after the pooling it is denoted with p. 1xx1,3xx31 \times 1,3 \times 3, and 5xx55 \times 5 pooling are then concatenated to get the final output.
表 6. 训练数据大小。此表比较了在较小模型经过 700 小时训练后的性能,该模型具有 96 xx9696 \times 96 像素输入。该模型架构与 NN2 相似,但 Inception 模块中没有 5xx55 \times 5 卷积。
ported in Table 5 are statistically insignificant.
It should be noted, that during training a 128 dimensional float vector is used, but it can be quantized to 128-bytes without loss of accuracy. Thus each face is compactly represented by a 128 dimensional byte vector, which is ideal for large scale clustering and recognition. Smaller embeddings are possible at a minor loss of accuracy and could be employed on mobile devices.
5.5. Amount of Training Data
Table 6 shows the impact of large amounts of training data. Due to time constraints this evaluation was run on a smaller model; the effect may be even larger on larger models. It is clear that using tens of millions of exemplars results in a clear boost of accuracy on our personal photo test set from section 4.2. Compared to only millions of images the relative reduction in error is 60%60 \%. Using another order of magnitude more images (hundreds of millions) still gives a small boost, but the improvement tapers off.
A proprietary face detector (similar to Picasa [3]) is run on the provided LFW thumbnails. If it fails to align the face (this happens for two images), the LFW alignment is used.
Figure 6 gives an overview of all failure cases. It shows false accepts on the top as well as false rejects at the bottom. We achieve a classification accuracy of 98.87%+-0.15\mathbf{9 8 . 8 7 \%} \pm 0.15 when using the fixed center crop described in (1) and the record breaking 99.63%+-0.09\mathbf{9 9 . 6 3 \%} \pm 0.09 standard error of the mean when using the extra face alignment (2). This reduces the error reported for DeepFace in [17] by more than a factor
Figure 6. LFW errors. This shows all pairs of images that were incorrectly classified on LFW. Only eight of the 13 false rejects shown here are actual errors the other five are mislabeled in LFW.
of 7 and the previous state-of-the-art reported for DeepId2+ in [15] by 30%30 \%. This is the performance of model NN1, but even the much smaller NN3 achieves performance that is not statistically significantly different.
5.7. Performance on Youtube Faces DB
We use the average similarity of all pairs of the first one hundred frames that our face detector detects in each video. This gives us a classification accuracy of 95.12%+-0.39\mathbf{9 5 . 1 2 \%} \pm 0.39. Using the first one thousand frames results in 95.18%95.18 \%. Compared to [17] 91.4%91.4 \% who also evaluate one hundred frames per video we reduce the error rate by almost half. DeepId2+ [15] achieved 93.2%93.2 \% and our method reduces this error by 30%30 \%, comparable to our improvement on LFW.
5.8. Face Clustering
Our compact embedding lends itself to be used in order to cluster a users personal photos into groups of people with the same identity. The constraints in assignment imposed by clustering faces, compared to the pure verification task,
Figure 7. Face Clustering. Shown is an exemplar cluster for one user. All these images in the users personal photo collection were clustered together.
lead to truly amazing results. Figure 7 shows one cluster in a users personal photo collection, generated using agglomerative clustering. It is a clear showcase of the incredible invariance to occlusion, lighting, pose and even age.
6. Summary
We provide a method to directly learn an embedding into an Euclidean space for face verification. This sets it apart from other methods [15, 17] who use the CNN bottleneck layer, or require additional post-processing such as concate-
Figure 8. Harmonic Embedding Compatibility. These ROCs show the compatibility of the harmonic embeddings of NN 2 to the embeddings of NN1. NN2 is an improved model that performs much better than NN1. When comparing embeddings generated by NN1 to the harmonic ones generated by NN2 we can see the compatibility between the two. In fact, the mixed mode performance is still better than NN1 by itself.
nation of multiple models and PCA, as well as SVM classification. Our end-to-end training both simplifies the setup and shows that directly optimizing a loss relevant to the task at hand improves performance.
Another strength of our model is that it only requires minimal alignment (tight crop around the face area). [17], for example, performs a complex 3D alignment. We also experimented with a similarity transform alignment and notice that this can actually improve performance slightly. It is not clear if it is worth the extra complexity.
Future work will focus on better understanding of the error cases, further improving the model, and also reducing model size and reducing CPU requirements. We will also look into ways of improving the currently extremely long training times, e.g. variations of our curriculum learning with smaller batch sizes and offline as well as online positive and negative mining.
7. Appendix: Harmonic Embedding
In this section we introduce the concept of harmonic embeddings. By this we denote a set of embeddings that are generated by different models v1 and v2 but are compatible in the sense that they can be compared to each other.
This compatibility greatly simplifies upgrade paths. E.g. in an scenario where embedding v1 was computed across a large set of images and a new embedding model v 2 is being rolled out, this compatibility ensures a smooth transition without the need to worry about version incompatibilities. Figure 8 shows results on our 3G dataset. It can be seen that the improved model NN2 significantly outper-
Figure 9. Learning the Harmonic Embedding. In order to learn a harmonic embedding, we generate triplets that mix the v1 embeddings with the v 2 embeddings that are being trained. The semihard negatives are selected from the whole set of both v1 and v2 embeddings.
forms NN1, while the comparison of NN2 embeddings to NN1 embeddings performs at an intermediate level.
7.1. Harmonic Triplet Loss
In order to learn the harmonic embedding we mix embeddings of v 1 together with the embeddings v 2 , that are being learned. This is done inside the triplet loss and results in additionally generated triplets that encourage the compatibility between the different embedding versions. Figure 9 visualizes the different combinations of triplets that contribute to the triplet loss.
We initialized the v2 embedding from an independently trained NN2 and retrained the last layer (embedding layer) from random initialization with the compatibility encouraging triplet loss. First only the last layer is retrained, then we continue training the whole v2 network with the harmonic loss.
Figure 10 shows a possible interpretation of how this compatibility may work in practice. The vast majority of v2 embeddings may be embedded near the corresponding v1 embedding, however, incorrectly placed v1 embeddings can be perturbed slightly such that their new location in embedding space improves verification accuracy.
7.2. Summary
These are very interesting findings and it is somewhat surprising that it works so well. Future work can explore how far this idea can be extended. Presumably there is a limit as to how much the v2 embedding can improve over v1, while still being compatible. Additionally it would be interesting to train small networks that can run on a mobile phone and are compatible to a larger server side model.
Figure 10. Harmonic Embedding Space. This visualisation sketches a possible interpretation of how harmonic embeddings are able to improve verification accuracy while maintaining compatibility to less accurate embeddings. In this scenario there is one misclassified face, whose embedding is perturbed to the “correct” location in v 2 .
Acknowledgments
We would like to thank Johannes Steffens for his discussions and great insights on face recognition and Christian Szegedy for providing new network architectures like [16] and discussing network design choices. Also we are indebted to the DistBelief [4] team for their support especially to Rajat Monga for help in setting up efficient training schemes.
Also our work would not have been possible without the support of Chuck Rosenberg, Hartwig Adam, and Simon Han.
References
[1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proc. of ICML, New York, NY, USA, 2009. 2
[2] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation. In Proc. ECCV, 2012. 2
[3] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In Proc. ECCV, 2014. 7
[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 1232-1240. 2012. 10
[5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JJ. Mach. Learn. Res., 12:2121-2159, July 2011. 4
[6] I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In In ICML, 2013. 4
[7] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007. 5
[8] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation
applied to handwritten zip code recognition. Neural Computation, 1(4):541-551, Dec. 1989. 2, 4
[9] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. 2, 4, 6
[10] C. Lu and X. Tang. Surpassing human-level face verification performance on LFW with gaussianface. CoRR, abs/1404.3840, 2014. 1
[11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 1986. 2, 4
[12] M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In S. Thrun, L. Saul, and B. Schölkopf, editors, NIPS, pages 41-48. MIT Press, 2004. 2
[13] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression (PIE) database. In In Proc. FG, 2002. 2
[14] Y. Sun, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. CoRR, abs/1406.4773, 2014. 1, 2, 3
[15] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. CoRR, abs/1412.1265, 2014. 1, 2, 5, 8
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 2, 3, 4, 5, 6, 10
[17] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In IEEE Conf. on CVPR, 2014. 1, 2, 5, 7, 8, 9
[18] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. CoRR, abs/1404.4661, 2014. 2
[19] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS. MIT Press, 2006. 2, 3
[20] D. R. Wilson and T. R. Martinez. The general inefficiency of batch training for gradient descent learning. Neural Networks, 16(10):1429-1451, 2003. 4
[21] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In IEEE Conf. on CVPR, 2011. 5
[22] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. CoRRC o R R, abs //1311.2901,2013.2,3/ 1311.2901,2013.2,3, 4, 6
[23] Z. Zhu, P. Luo, X. Wang, and X. Tang. Recover canonicalview faces in the wild with deep neural networks. CoRR, abs/1404.3543, 2014. 2