¹¹institutetext: Systems & Technology Research, Woburn, MA 01801, USA
¹ 机构文本：Systems & Technology Research, Woburn, MA 01801, 美国
%First␣names␣are␣abbreviated␣in␣the␣running␣head.%If␣there␣are␣more␣than␣two␣authors,␣’et␣al.’␣is␣used.https://www.stresearch.com
¹¹email: {jonathan.williford,brandon.may}@stresearch.com
²²institutetext: Visym Labs, Cambridge, MA 02140, USA
Visym 实验室，美国马萨诸塞州剑桥市 02140
²²email: jeff@visym.com

Explainable Face Recognition
可解释的人脸识别

Jonathan R. Williford 11 0000-0002-9178-2647 Brandon B. May 布兰登·B·梅11 0000-0002-9914-2441 Jeffrey Byrne 杰弗里·伯恩1122 0000-0001-8973-0322

Abstract 摘要

Explainable face recognition (XFR) is the problem of explaining the matches returned by a facial matcher, in order to provide insight into why a probe was matched with one identity over another.
可解释的人脸识别（XFR）是解释面部匹配结果的问题，以便深入了解为什么一个探测对象与另一个身份匹配。
In this paper, we provide the first comprehensive benchmark and baseline evaluation for XFR.
在本文中，我们为 XFR 提供了第一个全面的基准和基准评估。
We define a new evaluation protocol called the “inpainting game”, which is a curated set of 3648 triplets (probe, mate, nonmate) of 95 subjects, which differ by synthetically inpainting a chosen facial characteristic like the nose, eyebrows or mouth creating an inpainted nonmate.
我们定义了一种新的评估协议，称为“修补游戏”，这是一组由 95 名受试者的 3648 个三元组（探针，配偶，非配偶）组成的精心策划，这些三元组通过合成修补选择的面部特征（如鼻子，眉毛或嘴巴）进行差异化，从而创建一个修补的非配偶。
An XFR algorithm is tasked with generating a network attention map which best explains which regions in a probe image match with a mated image, and not with an inpainted nonmate for each triplet.
XFR 算法的任务是生成一个网络注意力图，最好地解释探针图像中哪些区域与配对图像匹配，而不是与每个三元组的修补的非配偶匹配。
This provides ground truth for quantifying what image regions contribute to face matching. Finally, we provide a comprehensive benchmark on this dataset comparing five state-of-the-art XFR algorithms on three facial matchers.
这为量化哪些图像区域有助于面部匹配提供了基本事实。最后，我们在这个数据集上提供了一个全面的基准，比较了三个面部匹配器上的五种最先进的 XFR 算法。
This benchmark includes two new algorithms called subtree EBP and Density-based Input Sampling for Explanation (DISE) which outperform the state-of-the-art XFR by a wide margin.
这个基准包括两个名为子树 EBP 和基于密度的输入采样解释（DISE）的新算法，它们在很大程度上优于最先进的 XFR。

1 Introduction 1 介绍

Explainable AI [29] is the problem of interpreting, understanding and visualizing machine learning models.
可解释人工智能[29]是解释、理解和可视化机器学习模型的问题。
Deep convolutional network trained at large scales are traditionally considered blackbox systems, where designers have an understanding of the dataset and loss functions for training, but limited understanding of the learned model.
在大规模训练的深度卷积网络传统上被认为是黑盒系统，设计者了解数据集和训练的损失函数，但对学习的模型了解有限。
Furthermore, predictions generated by the system are often not explainable as to why the system generated this output for that input. An explainable AI system would enable interpretation of what the ML model has learned [26][2], enable transparency to understand and identify biases or failure modes in the system [3][15][13][25] and provide user friendly visualizations to build user trust in critical applications [31][33][42].
此外，系统生成的预测通常无法解释为何系统为该输入生成此输出。可解释的 AI 系统将使人们能够解释 ML 模型学到了什么[26][2]，使透明度以理解和识别系统中的偏见或故障模式[3][15][13][25]，并提供用户友好的可视化来建立用户对关键应用的信任[31][33][42]。

Explainable face recognition (XFR) is the problem of explaining why a face matching system matches faces. Human adjudicators have a long history in explaining face recognition in the field of forensic face matching. Professional facial analysts follow the FISWG standards [11] which leverage comparing facial morphology, measuring facial landmarks and matching scars, marks and blemishes. These features are used to match a controlled
人类裁决者在法庭面部匹配领域解释面部识别具有悠久历史。专业面部分析师遵循 FISWG 标准[11]，这些标准利用比较面部形态、测量面部标志和匹配疤痕、斑点和瑕疵。这些特征用于将被提议候选人的控制性全身照与非控制性探针（如安全摄像头图像）进行匹配。mugshot of a proposed candidate to an uncontrolled probe, such as a security camera image.
人类裁决者在法庭面部匹配领域解释面部识别具有悠久历史。专业面部分析师遵循 FISWG 标准[11]，这些标准利用比较面部形态、测量面部标志和匹配疤痕、斑点和瑕疵。这些特征用于将被提议候选人的控制性全身照与非控制性探针（如安全摄像头图像）进行匹配。
However, these approaches require a candidate list for human adjudication, and a candidate list in a modern workflow is returned from a facial matching system [24]. Why did the face matching system return that candidate list for this probe? What
然而，这些方法需要一个候选列表供人类裁决，而现代工作流程中的候选列表是从一个面部匹配系统返回的[24]。为什么面部匹配系统为这个探针返回了那个候选列表？面部匹配系统使用了哪些面部特征，它们与 FISWG 标准相同吗？面部匹配器是否存在偏见或噪声？facial features did the face matching system use, and are they the same as the FISWG standards? Is the face matcher biased or noisy?
然而，这些方法需要一个候选列表供人类裁决，而现代工作流程中的候选列表是从一个面部匹配系统返回的[24]。为什么面部匹配系统为这个探针返回了那个候选列表？面部匹配系统使用了哪些面部特征，它们与 FISWG 标准相同吗？面部匹配器是否存在偏见或噪声？
The goal of XFR is to explore such questions, and answer why a system matched a pair of faces.
XFR 的目标是探索这些问题，并回答为什么系统匹配了一对面孔。
A successful explainable system would increase confidence in a face matching system for professional examiners, enable intepretation of the internal face representations by
一个成功的可解释系统将增加专业检验员对面部匹配系统的信心，使机器学习研究人员能够解释内部面部表示，并为用户社区生成信任。machine learning researchers and generate trust by the user community.
一个成功的可解释系统将增加专业检验员对面部匹配系统的信心，使机器学习研究人员能够解释内部面部表示，并为用户社区生成信任。

What is an “explanation” in face recognition? Explainable AI has explored various forms of explanation for machine learning systems in the form of: activation maximization [32], synthesizing optimal images [21], network attention [39][23][12], network dissection [2] or synthesizing linguistic explanations [16]. However, a key challenge in explainable AI is the lack of ground truth to compare and quantify explainable results across networks. XFR is especially challenging because the difference between near-mates or doppelgangers is subtle, the explanations are non-obvious, and differences are rarely well localized in a compact facial feature [6].
然而，在可解释的人工智能中一个关键挑战是缺乏地面真相来比较和量化网络间的可解释结果。XFR 尤其具有挑战性，因为近似物或同行者之间的差异微妙，解释并不明显，并且差异很少在紧凑的面部特征中被很好地定位[6]。

Refer to caption — Figure 1: Explainable Face Recognition (XFR). Given a image triplet of (probe, mate 1, nonmate), an explainable
给定一组图像三元组（探针，伙伴 1，非伙伴），一个可解释的人脸识别算法被要求估计哪些像素属于一个区域，该区域对于伙伴是具有区分性的 - 即一个区域更类似于伙伴而不是非伙伴。face recognition algorithm is tasked with estimating which pixels belong in a region that is discriminative for the mate - i.e.
给定一组图像三元组（探针，伙伴 1，非伙伴），一个可解释的人脸识别算法被要求估计哪些像素属于一个区域，该区域对于伙伴是具有区分性的 - 即一个区域更类似于伙伴而不是非伙伴。
a region is more similar to the mate than the non-mate.
一个区域与伙伴相似度高于非伙伴。These estimations are given as a saliency map. The nonmate has been synthesized by
非伴侣是通过修补给定区域合成的（例如。inpainting a given region (e.g.
一个区域与伴侣更相似，而不是非伴侣。非伴侣是通过修补给定区域（例如
eyebrows) that changes the identity according to the given network. This provides
这为使用“修补游戏”协议对 XFR 算法进行定量评估提供了基本事实。ground truth for a quantitative evaluation of XFR algorithms using the “inpainting game” protocol.
这为使用“修补游戏”协议对 XFR 算法进行定量评估提供了基本事实。

In this paper, we provide the first comprehensive benchmark for explainable face recognition (XFR). Fig. 1 shows the structure of this problem. An XFR system is given a triplet of (probe, mate, nonmate) images. The XFR system is tasked with generating a saliency map that best captures the regions of the probe image that increase similarity to the mate and decrease similarity to the nonmate.
在本文中，我们为可解释人脸识别（XFR）提供了第一个全面的基准。图 1 显示了这个问题的结构。XFR 系统被给定一个三元组（探针，同伴，非同伴）图像。XFR 系统的任务是生成一个最能捕捉探针图像区域，增加与同伴相似度并减少与非同伴相似度的显著性地图。
This provides an explanation for why the matcher provides a high verification score for the pair (probe, mate) and a low verification score for (probe, nonmate). This explanation can be quantitatively evaluated by synthesizing nonmates that differ from the mate only in specific regions (e.g.
这解释了为什么匹配器为（探针，伴侣）提供高验证分数，而为（探针，非伴侣）提供低验证分数。这种解释可以通过合成仅在特定区域与伴侣不同的非伴侣来进行定量评估（例如，鼻子、眼睛、嘴巴），因此，如果显著性算法选择这些区域，则在这个度量上表现良好。本文提出了以下贡献：
nose, eyes, mouth), such that if the saliency algorithm selects these regions, then it performs well on this metric. This paper makes the following contributions:
这篇论文提出了以下贡献：

1.

XFR baseline. We provide a baseline for XFR based on five algorithms for network attention evaluated on three publicly available convolutional networks trained for face recognition: LightCNN [35], VGGFace2 [5] and ResNet-101. These baselines include two new algorithms for network attention called subtree EBP (Sec. 3.1) and DISE (Sec. 3.2).

1. XFR 基准线。我们基于五种用于面部识别训练的三个公开卷积网络（LightCNN [35]、VGGFace2 [5]和 ResNet-101）评估的网络注意力算法为 XFR 提供了基准线。这些基准线包括两种名为子树 EBP（第 3.1 节）和 DISE（第 3.2 节）的新网络注意力算法。
2.

Inpainting game protocol and dataset. We provide a standardized evaluation protocol (Sec.
我们提供标准化的评估协议（第。 4.1) and dataset (Sec. 4.2) for fine grained discriminative visualization of faces. This provides a quantitative metric for objectively comparing XFR systems.
修补游戏协议和数据集。4.1）和数据集（第 4.2 节）用于面部的细粒度辨别可视化。这为客观比较 XFR 系统提供了定量指标。
3.

XFR evaluation. We provide the first comprehensive evaluation of XFR using the baseline algorithms on the inpainting game protocol to provide a benchmark for future research (Sec. 5.1). Furthermore, in the supplemental material, we show a qualitative evaluation on novel (non inpainted) images to draw conclusions about the utility of the methods for explanation on real images.

3. XFR 评估。我们首次对 XFR 进行全面评估，使用基准算法在修复游戏协议上提供未来研究的基准（第 5.1 节）。此外，在补充材料中，我们展示了对新颖（非修复）图像的定性评估，以得出关于方法在真实图像上解释效果的结论。

2 Related Work 相关工作

The related work most relevant for our proposed approach to XFR can be broadly categorized into two areas: network attention models for convolutional networks and interpretable face recognition.
与我们提出的 XFR 方法最相关的相关工作可以大致分为两个领域：用于卷积网络的网络注意力模型和可解释的人脸识别。

Network attention is the problem of generating an image based saliency map which visualizes the input regions that best explains a class activation output of a network. Gradient-based methods [31][33][42] attempt to compute the the derivative of the class signal with respect to the input image, while other approaches [4] modify network architectures to capture these signals or localize attribution [17]. Excitation backprop [39], contrastive EBP [39] or truncated contrastive EBP [6] formulate the saliency map as marginal probabilities in a probabilistic absorbing Markov chain. Layerwise relevance propagation [18][28][1] provides network attention through a set of layerwise propagation rules theoretically justified by deep Taylor decomposition. Latent attention networks learn an auxiliary network to map input to attention, rather than exploring the network directly [14]. Inversion methods [19] seek to recover natural images that have the same feature representation as a given image. However, the same insights have not yet been applied to fine grained categorization for face recognition.
网络关注是生成基于显著性地图的问题，该地图可视化最佳解释网络类激活输出的输入区域。基于梯度的方法[31][33][42]尝试计算类信号相对于输入图像的导数，而其他方法[4]修改网络架构以捕获这些信号或定位归因[17]。激励反向传播[39]，对比 EBP[39]或截断对比 EBP[6]将显著性地图构建为概率吸收马尔可夫链中的边际概率。逐层相关传播[18][28][1]通过一组逐层传播规则提供网络关注，这些规则在深度泰勒分解理论上得到了证明。潜在关注网络学习辅助网络，将输入映射到关注，而不是直接探索网络[14]。反演方法[19]旨在恢复具有与给定图像相同特征表示的自然图像。然而，相同的见解尚未应用于面部识别的细粒度分类。
Finally, black box methods have explored network attention for systems that do not have an exposed convolutional network [8][23][12][4]. The approaches to XFR explored in this paper are most closely related to EBP [39], RISE [23], and methods for network attention for pairwise similarity [34].
最后，黑盒方法已经探索了对于没有暴露卷积网络的系统的网络注意力[8][23][12][4]。本文探讨的 XFR 方法与 EBP[39]、RISE[23]以及用于成对相似性的网络注意力方法[34]最为相关。

Recently, there has been emerging research on the interpretation of face recognition systems[37][38][6][36][27][9][41]. Visual psychophysics [27] have provided a set of tools for the controlled manipulation of input stimuli and metrics for the output responses evoked in a face matching system. This approach was inspired by Cambridge Face Memory Test [10], which involves progressively perturbing face images using a chosen transformation function (e.g. adding noise) to investigate controlled degradation of matching performance [27]. This approach enables detailed studies of the failure modes of a face matcher or exploring how facial attributes are expressed in a network [9][41]. In contrast, our approach generates controlled degradations using inpainting, to provide localized ground truth for evaluation of network attention models. In [37], the authors propose a novel loss function to encourage part separability during network dissection of parts in a convolutional network for face matching.
最近，关于面部识别系统解释的研究不断涌现[37][38][6][36][27][9][41]。视觉心理物理学[27]提供了一套工具，用于控制输入刺激的操作和面部匹配系统中引发的输出响应的度量。这种方法受剑桥面部记忆测试[10]的启发，该测试涉及使用选择的转换函数（例如添加噪音）逐渐扰乱面部图像，以调查匹配性能的受控退化[27]。这种方法使得可以详细研究面部匹配器的失效模式或探索面部属性如何在网络中表达[9][41]。相比之下，我们的方法使用修补来生成受控退化，以提供用于评估网络注意力模型的局部真实性。在[37]中，作者提出了一种新颖的损失函数，以鼓励在卷积网络中对面部匹配的部分进行网络解剖时的部分可分离性。
This approach is primarily concerned with training new networks to maximize interpretability, rather than studying existing networks. In [38], the authors study pairwise matching of faces, to visualize features that lead towards classification decisions.
这种方法主要关注训练新网络以最大化可解释性，而不是研究现有网络。在[38]中，作者研究了面部的成对匹配，以可视化导致分类决策的特征。
This is similar in spirit to our proposed approach, however we provide a performance metric for evaluating a saliency approach as well as extending visualizations to mated and nonmated triplets. Finally, in [36], the authors visualize the features of shape and texture that underlie subject identity decisions. This approach uses 3D modeling to generate a controlled dataset, rather than inpainting.
这与我们提出的方法在精神上类似，但我们提供了一个用于评估显著性方法的性能指标，并将可视化扩展到匹配和非匹配的三元组。最后，在[36]中，作者可视化了形状和纹理的特征，这些特征构成了主体身份决策的基础。该方法使用 3D 建模生成受控数据集，而不是修补。
However, given the authors conclusions that texture has a much larger effect of matching than morphology, having a ground truth dataset that includes texture variation would be an appropriate metric for explainable face recognition.
然而，考虑到作者得出的结论，纹理对匹配的影响远大于形态，拥有一个包含纹理变化的地面真实数据集将是解释性人脸识别的一个合适指标。

3 Explainable Face Recognition (XFR)
可解释的人脸识别（XFR）

XFR is the problem of explaining why a face matcher matches faces. Fig. 1 shows the structure of this problem. Given a triplet of (probe, mate, nonmate), the XFR algorithm is tasked with generating a saliency map that explains the regions of the probe image that maximize the similarity to the mate and minimize the similarity to the nonmate.
XFR 是解释为什么人脸匹配器匹配人脸的问题。图 1 显示了这个问题的结构。给定一个三元组（探针，伴侣，非伴侣），XFR 算法的任务是生成一个显著性地图，解释探针图像的区域，使其与伴侣相似度最大化，与非伴侣相似度最小化。
This provides an explanation for why the matcher returns this image for the mated identity.
这为匹配器为匹配的身份返回此图像提供了解释。

Why triplets? Previous work has shown that pairwise similarity between faces is heavily dominated by the periocular region and nose [6], as confirmed by the qualitative visualization study performed in the supplementary material The periocular region and nose is almost always used for facial classification, but this level of XFR is not very helpful in explaining finer levels of discrimination. Our goal is to highlight those regions for a probe that are more similar to a presumptive mate and simultaneously less similar to a nonmate. This triplet of (probe, mate, nonmate) provides a deeper explanation beyond facial class activation maps for the relative importance of facial regions.
为什么是三元组？先前的工作表明，人脸之间的成对相似性在很大程度上由眼周区域和鼻子主导[6]，这一点在补充材料中进行的定性可视化研究中得到了证实。眼周区域和鼻子几乎总是用于人脸分类，但这种 XFR 水平对于解释更精细的区分水平并不是非常有帮助。我们的目标是突出那些对于探针更类似于假定伴侣且同时对于非伴侣更不相似的区域。这个三元组（探针，伴侣，非伴侣）提供了一个更深入的解释，超越了人脸类激活图对于面部区域相对重要性的解释。

In this section, we describe five approaches for network attention in XFR. These approaches are all whitebox methods, which assume access to the underlying convolutional network used for facial matching.
在本节中，我们描述了 XFR 中网络注意力的五种方法。这些方法都是白盒方法，假设可以访问用于面部匹配的基础卷积网络。
The objective of XFR is to generate a non-negative saliency map, that captures the underlying image regions of the probe that are most similar to the mate and least similar to the nonmate.
XFR 的目标是生成一个非负的显著性地图，捕捉探针中与配偶最相似且与非配偶最不相似的基础图像区域。
The XFR algorithm can use any property of the convolutional network to generate this saliency map.
XFR 算法可以利用卷积网络的任何属性来生成这个显著性地图。
For our benchmark evalution, we selected three state-of-the-art approaches for network attention (excitation backprop, contrastive excitation backprop and truncated contrastive excitation backprop) following the survey and evaluation results in [6]. In this section, we introduce two new methods to improve upon these published approaches: subtree EBP (Sec. 3.1) and DISE (Sec. 3.2).
对于我们的基准评估，我们选择了三种最先进的网络注意力方法（激励反向传播、对比激励反向传播和截断对比激励反向传播），根据[6]中的调查和评估结果。在本节中，我们介绍了两种新方法来改进这些已发表的方法：子树 EBP（第 3.1 节）和 DISE（第 3.2 节）。

3.0.1 Excitation Backprop (EBP).
3.0.1 激励反向传播（EBP）。

Excitation backprop (EBP) [39] models network attention as a probabilistic winner-take-all (WTA) process. EBP calculates the probability of traversing to a given node in the convolutional network, with the probabilities being defined by the positive weights and non-negative activations.
激发反向传播（EBP）[39] 将网络注意力建模为概率性的胜者通吃（WTA）过程。 EBP 计算卷积网络中遍历到给定节点的概率，概率由正权重和非负激活定义。
The output of EBP is a saliency map that localizes regions in the image that are excitory for a given class.
EBP 的输出是一个突显地图，用于定位图像中对于给定类别具有激发作用的区域。

In our approach, we replace the cross-entropy loss for EBP with a triplet loss [30]. The original formulation of EBP considers a cross-entropy loss to optimize softmax classification of a set of clases in the training set. In this new formulation, given three embeddings for a mate $(m)$ , nonmate $(n)$ and probe $(p)$ , the triplet loss function is a max-margin hinge loss
在我们的方法中，我们用三元损失[30] 替换 EBP 的交叉熵损失。 EBP 的原始公式考虑交叉熵损失来优化训练集中一组类别的 softmax 分类。在这个新公式中，给定一个配对的三个嵌入，即同类 $(m)$ 、非同类 $(n)$ 和探针 $(p)$ ，三元损失函数是一个最大间隔铰链损失。

L(p,n,m)=\mathtt{max}(0,~{}||p-m||^{2}-||p-n||^{2}+\alpha).

(1)

This uses the squared Euclidean distance between embeddings to capture similarity, such that the loss is minimized when the distance from the probe and mate is small (similarity is high) and the distance form the probe to the nonmate is large (similarity is low), with margin term $\alpha$ . This loss function extends EBP to cases where a new subject is observed at test time that was not present in the training set, as is commonly the case with face matching systems.
这使用嵌入之间的平方欧氏距离来捕捉相似性，使得当探测器和配偶之间的距离很小时（相似性很高）且探测器到非配偶的距离很大时（相似性很低）损失最小，其中边距项 $\alpha$ 。这个损失函数将 EBP 扩展到测试时观察到的新主体的情况，这种情况在人脸匹配系统中很常见，而这些主体在训练集中并不存在。

3.0.2 Contrastive EBP (cEBP).
3.0.2 对比 EBP（cEBP）。

Contrastive EBP was introduced [39] to handle fine-grained network attention for closely related classes. This approach discards activations common to a pair of classes, to provide network attention specific to one class and not another. In our approach, contrastive EBP [39] is combined with a triplet loss (eq 1).
对比 EBP 被引入[39]来处理密切相关类别的细粒度网络注意力。该方法丢弃了一对类别共同激活的部分，以提供特定于一个类别而非另一个类别的网络注意力。在我们的方法中，对比 EBP[39]与三元损失（eq 1）相结合。

3.0.3 Truncated Contrastive EBP (tcEBP).
3.0.3 截断对比 EBP（tcEBP）。

Truncated contrastive EBP was introduced [6] as an extension of cEBP that considers the contrastive EBP attention map only within the kth percentile of the EBP saliency map. This addresses an observed instability of cEBP [6] resulting noisy attention maps for faces.
截断对比 EBP 被引入[6]作为 cEBP 的扩展，仅考虑 EBP 显著性图中位于第 k 个百分位数内的对比 EBP 注意力图。这解决了 cEBP[6]存在的问题，即为人脸生成嘈杂的注意力图。

3.1 Subtree EBP 3.1 子树 EBP

In this section, we introduce Subtree EBP, a novel method for whitebox XFR.
在本节中，我们介绍了 Subtree EBP，一种用于白盒 XFR 的新方法。This approach uses the triplet loss function (eq 1), with the following extension.
这种方法使用三元损失函数（等式 1），并进行以下扩展。 Given a triplet (probe, mate, nonmate) images, we compute the gradient of the triplet loss function ( $\frac{\partial L}{\partial x_{i}}$ ) with respect to every node $x_{i}$ in the network. This approach uses the standard triplet-based learning, where the mate and nonmate embeddings are assumed constant and the gradient is computed relative to the probe image. Next, we sort the gradients at every node $x_{j}$ in decreasing order, and select the top-k nodes with the largest positive gradients. These are the top-k nodes in the network that most affect the triplet loss, to increase the similarity to the mate and simultaneously decrease the similarity to the nonmate. Finally, we construct $k$ EBP saliency maps $S_{i}$ starting from each of the selected interior nodes, then the $S_{i}$ are combined in a weighted convex combination with weights $w_{i}=\frac{\partial L}{\partial x_{i}}$ and
在本节中，我们介绍了子树 EBP，一种用于白盒 XFR 的新方法。给定一个三元组（探测图像，伴侣图像，非伴侣图像），我们计算损失函数对网络中每个节点 $x_{i}$ 的梯度（ $\frac{\partial L}{\partial x_{i}}$ ）。这种方法使用标准的三元组学习，其中假定伴侣和非伴侣嵌入是常数，并且相对于探测图像计算梯度。接下来，我们按降序对每个节点 $x_{j}$ 的梯度进行排序，并选择具有最大正梯度的前 k 个节点。这些是网络中对三元组损失影响最大的前 k 个节点，以增加与伴侣的相似性并同时减少与非伴侣的相似性。最后，我们从每个选定的内部节点开始构建 $k$ EBP 显著性图 $S_{i}$ ，然后将 $S_{i}$ 以权重 $w_{i}=\frac{\partial L}{\partial x_{i}}$ 的加权凸组合方式组合。

S=\frac{1}{\sum_{j}w_{j}}\sum_{i}\frac{\partial L}{\partial x_{i}}S_{i}

(2)

where the weights are given by the loss gradient ( $w_{i}$ ), normalized to sum to one. This forms the final subtree EBP saliency map $S$ .
权重由损失梯度（ $w_{i}$ ）给出，并归一化为总和为一。这形成了最终的子树 EBP 显著性图 $S$ 。

Fig. 2 shows an example of the subtree EBP method. This montage shows the top 27 nodes with the largest triplet loss gradient for the shown triplet. Each node results in a saliency map corresponding to the excitory subtree rooted at this node.
图 2 显示了 EBP 方法子树的示例。此蒙太奇显示了具有最大三元损失梯度的前 27 个节点，每个节点都会产生与该节点为根的兴奋子树对应的显著性地图。
The weight of the saliency map is proportional to the gradient sorted rowwise, so that the nodes in the bottom right affect the loss more than the nodes in the upper left. Each of these saliency maps are combined into a convex combination (eq. 2) forming the final network attention map.
显著性地图的权重与按行排序的梯度成比例，因此右下角的节点比左上角的节点更影响损失。这些显著性地图中的每一个被合并成一个凸组合（方程 2），形成最终的网络注意力地图。
In this example, the nonmate was synthesized to differ with the mate only in the nose region, and our method is able to correctly localize this region. The supplementary material shows a more detailed example of this selection process starting from the largest excitation node at each layer of a ResNet-101 network.
在这个例子中，非配对体被合成为仅在鼻部区域与配对体不同，我们的方法能够正确定位这个区域。补充材料展示了从 ResNet-101 网络每一层的最大激发节点开始的选择过程的更详细示例。
This result shows that nodes selected close to the image will be well localized, nodes in the middle of the network correspond to parts and nodes selected close to the embedding correspond to the whole nose and eyes of the face.
这个结果表明，选择靠近图像的节点将被很好地定位，网络中间的节点对应于部分，选择靠近嵌入的节点对应于脸部的整个鼻子和眼睛。

3.2 Density-based Input Sampling for Explanation (DISE)
基于密度的输入采样解释（DISE）

Density-based Input Sampling for Explanation (DISE) is a second novel approach for whitebox XFR introduced in this paper. DISE is an extension of Randomized Input Sampling for Explanation (RISE) [23] using a prior density to aid in sampling. Previous work [23][12] has constructed a saliency map associated with a particular class by randomly perturbing the input image by masking selected pixels, evaluating it using a blackbox system, and accumulating those perturbations based on how confident the system is that the modified input image corresponds to the target class.
基于密度的输入采样解释（DISE）是本文介绍的白盒 XFR 的第二种新方法。DISE 是对随机输入采样解释（RISE）[23]的扩展，利用先验密度来辅助采样。之前的工作[23][12]通过随机扰动输入图像以遮蔽选定的像素，使用黑盒系统对其进行评估，并根据系统对修改后的输入图像对应目标类的信心程度累积这些扰动来构建与特定类相关联的显著性图。
However, these approaches generate masks to occlude the input image uniformly at random. This sampling process is inefficient, and can be improved by introducing a prior distribution to guide the sampling. In this section, we describe the extension to RISE [23] where the prior density for input sampling is derived from a whitebox EBP with triplet loss.
然而，这些方法生成用于随机均匀遮蔽输入图像的掩模。这种采样过程效率低，可以通过引入先验分布来改进。在本节中，我们描述了对 RISE [23]的扩展，其中用于输入采样的先验密度是从带三元损失的白盒 EBP 中导出的。

Fig. 3 shows an overview of this approach. Our approach extends RISE [23] for XFR as follows:
图 3 显示了这种方法的概述。我们的方法将 RISE [23] 扩展为 XFR，具体如下：

1.

Using a non-uniform prior for generating the random binary masks

1. 使用非均匀先验生成随机二进制掩模
2.

Restricting the masks to use a sparse, fixed number of mask elements

2. 限制掩模使用稀疏的固定数量的掩模元素
3.

Defining a numerical gradient of the triplet loss to weight each mask

定义三元损失的数值梯度，以加权每个掩模。

3.2.1 Non-Uniform Prior. 3.2.1 非均匀先验。

Prior research on discriminative features for facial recognition showed that the most important regions of the face were generally located in and around the eyes and nose (Sec. 3.0.1). Fig. 3 shows an example of this saliency map computed for a probe image of Taylor Swift using the VGG-16 [22] network as the whitebox face matcher. Using this saliency map as our prior probability for generating random masks allows us to sample the space of most salient masks that will affect the loss more efficiently than assuming a uniform probability across the entire image.
有关面部识别辨别特征的先前研究表明，面部最重要的区域通常位于眼睛和鼻子周围（第 3.0.1 节）。图 3 显示了使用 VGG-16 [22]网络作为白盒面部匹配器对泰勒·斯威夫特的探针图像计算的显著性地图的示例。使用此显著性地图作为我们生成随机蒙版的先验概率，使我们能够对影响损失更有效的最显著蒙版空间进行采样，而不是假设整个图像上的均匀概率。
Further limiting this prior to the upper 50th percentile of the mean EBP effectively eliminates the possibility of masking out unimportant background elements.
进一步将此先验限制为平均 EBP 的上 50th 百分位有效地消除了掩盖不重要背景元素的可能性。

3.2.2 Sparse Masks. 3.2.2 稀疏掩码。

Next, we restrict the number of masked elements to be sparse. RISE considered random binary masks covering the entire input image.
接下来，我们将限制稀疏掩模元素的数量。RISE 考虑了覆盖整个输入图像的随机二进制掩模。
In contrast, we use a sparse mask to highlight the affect of a small localized region of the face on the loss. We used two mask elements per mask, upsampled by a factor of 12 (to avoid pixel level adversarial effects).
相比之下，我们使用稀疏掩模来突出面部的一个小局部区域对损失的影响。我们每个掩模使用两个掩模元素，通过 12 倍上采样（以避免像素级对抗效应）。
We found that filling the masks with a blurred version of the image performed quantitatively better on the inpainting game than using grey masks.
我们发现，用图像的模糊版本填充掩模在修复游戏中的表现比使用灰色掩模在定量上更好。

3.2.3 Numerical gradient. 3.2.3 数值梯度。

Finally, given the probe image which has been masked with the sparse mask sampled from the non-uniform prior, we can compute a numerical gradient of the triplet loss. Let $p$ be an embedding of the probe, $m$ the mated image embedding, $n$ the nonmated image embedding, and $\hat{p}$ the masked probe embedding. Then, the numerical gradient of the triplet loss (eq 1) can be approximated as:
最后，给定已用从非均匀先验中采样的稀疏掩码对探针图像进行掩码处理的情况下，我们可以计算三元损失的数值梯度。设 $p$ 为探针的嵌入， $m$ 为配对图像嵌入， $n$ 为非配对图像嵌入， $\hat{p}$ 为掩码探针嵌入。然后，三元损失的数值梯度（eq 1）可以近似为：

\displaystyle\frac{\partial L_{dise}}{\partial p}\approx\mathtt{max}(0,~{}L(p,m,n)-L(\hat{p},m,n))

(3)

The numerical gradient is an approximation to the true loss gradient computed by perturbing the input by occluding the probe with a pixel mask, and computing the corresponding change in the triplet loss.
数值梯度是通过用像素掩码遮挡探针并计算三元损失的相应变化来计算的真实损失梯度的近似。
In other words, when the probe masks out a region that increases for the similarity between the probe and mate and decreases for the probe and nonmate, the numerical gradient should be large. This allows for a loss weighted accumulation of these masks into a saliency map.
换句话说，当探针屏蔽出一个区域，该区域增加了探针和配对体之间的相似性，减少了探针和非配对体之间的相似性时，数值梯度应该很大。这允许将这些掩模加权累积到显著性图中。
The final saliency is accumulated following (eq. 2), where saliency maps $S_{i}$ are the pairwise binary masks, with non-negative gradient weights (eq. 3).
最终的显著性是根据（eq. 2）累积的，其中显著性图 $S_{i}$ 是成对的二进制掩模，具有非负梯度权重（eq. 3）。

4 Experimental Protocol 实验方案

Recent explainable AI research has focused on class activation maps [23][12][31] [33][42][39][4], which visualize salient regions used for classification. For facial recognition, prior work has shown this is almost always the eyes, nose, and upper lip of the face [6]. In facial identification, a probe image is given to a face matching system, which returns the top $K$ identities from a gallery. A natural question is why the matching system picked the top match instead of the second top match (or remaining top K matches). One way to give an answer to this question is to highlight the region(s) that match a given identity more than the second identity or other identities. This saliency map should be larger for the regions that contribute the most to the identity and not others. In this paper, our goal is to highlight the regions that are responsible for matching a given image to one identity versus a similar identity.
最近可解释的 AI 研究集中在类激活图[23][12][31][33][42][39][4]上，这些图可视化用于分类的显著区域。对于面部识别，先前的研究表明几乎总是面部的眼睛、鼻子和上唇[6]。在面部识别中，将探测图像提供给面部匹配系统，该系统从库中返回前 $K$ 个身份。一个自然的问题是为什么匹配系统选择了顶部匹配而不是第二顶部匹配（或其余的前 K 个匹配）。回答这个问题的一种方法是突出显示与给定身份更匹配的区域，而不是第二个身份或其他身份。这个显著性图应该更大地突出显示对身份贡献最大的区域，而不是其他区域。在本文中，我们的目标是突出显示负责将给定图像与一个身份匹配而不是与一个相似身份匹配的区域。

A key challenge for evaluating the performance of an XFR approach is generating ground truth.
评估 XFR 方法性能的一个关键挑战是生成准确的基准数据。
For XFR, ground truth not only depends on the selection of probes, mates, and nonmates, but can also depend on a target network for evaluation. We address this issue by synthesizing inpainted nonmates or doppelgangers, where a select region of the face is changed from the original identity. Only the inpainted region differs between the two images and therefore only the inpainted region can be used to discriminate between them.
对于 XFR，地面真相不仅取决于探针、配对和非配对的选择，还可能取决于用于评估的目标网络。我们通过合成修补的非配对或替身来解决这个问题，其中脸部的选择区域从原始身份更改。只有两个图像之间的修补区域不同，因此只有修补区域可以用于区分它们。
Furthermore, we synthesize doppelgangers based on their ability to reduce the match score for a target network. We call our overall strategy for quantitative evaluation the inpainting game.
此外，我们根据他们降低目标网络匹配分数的能力合成替身。我们称我们的整体定量评估策略为修补游戏。

4.1 The Inpainting Game 4.1 修复游戏

An overview of the inpainting game evaluation is given in Fig. 4. The inpainting game uses four (or more) images for each evaluation: a probe image, mate image(s), an inpainted probe and inpainted nonmate(s). The inpainted probe or probe doppelganger is subtly different from the probe in a fixed region of the face, such as the eyes, nose or mouth. Similarly, the inpainted nonmate or mate doppelganger is subtly different from the mate image, such that the doppelgangers are a different identity. The inpainted probe and inpainted nonmate are constrained to be the same new identity. Sec. 4.2 discusses the construction of this dataset.
给出了修复游戏评估的概述，如图 4 所示。修复游戏每次评估使用四（或更多）幅图像：一个探测图像，配对图像，修复后的探测图像和修复后的非配对图像。修复后的探测图像或探测者的替身在脸部的固定区域（如眼睛、鼻子或嘴巴）上略有不同。同样，修复后的非配对图像或配对者的替身与配对图像略有不同，使得替身具有不同的身份。修复后的探测图像和修复后的非配对图像被限制为相同的新身份。第 4.2 节讨论了该数据集的构建。

The XFR algorithm is given triplets of probes, mates and nonmates, labeled in Fig. 4 as (“mated probe”, “mated references” and “inpainted non-mates”). For each triplet, the XFR algorithm is tasked with estimating the likelihood that each pixel belongs to a region that is discriminative for matching the probe to the mated identity over the nonmated/inpainted identity. These discriminative pixel estimations form a saliency map. Each pixel is classified as being discriminatively salient by applying a threshold, which forms a binary saliency map. For each binary saliency map, pixels in the probe are replaced with the pixels from an inpainted probe forming a blended probe. The inpainted probe is generating by inpainting the same facial region as the inpainted nonmates and is not provided to the XFR algorithm, which is sequestered and used for evaluation only. The saliency map is evaluated by how quickly it can flip the identity of the blended probe from the mate to the non-mate, while maximizing saliency (green) in ground truth (grey) while minimizing false alarms (red).
XFR 算法被赋予了探针、配对和非配对的三元组，如图 4 中标记为（“配对探针”、“配对参考”和“修补非配对”）。对于每个三元组，XFR 算法的任务是估计每个像素属于一个区域的可能性，该区域对于将探针与配对身份匹配而非与非配对/修补身份匹配具有区分性。这些区分性像素估计形成了显著性图。通过应用阈值，将每个像素分类为具有区分性显著性，形成二进制显著性图。对于每个二进制显著性图，将探针中的像素替换为修补探针中的像素，形成混合探针。修补探针是通过修补与修补非配对相同的面部区域生成的，并且不提供给 XFR 算法，该算法被隔离并仅用于评估。显著性图通过评估来判断它能够多快地将混合探针的身份从配对转换为非配对，同时在地面实况中最大化显著性（绿色）并最小化误报（红色）。
See Sec. 4.3 for additional details, including the metrics for the inpainting game.
有关更多详细信息，请参阅第 4.3 节，包括修补游戏的指标。

4.2 Inpainting Dataset for Facial Recognition
4.2 用于人脸识别的修复数据集

The inpainting dataset for face recognition is based on the images from the IJB-C dataset [20]. The inpainting dataset contains 561 images of 95 subjects selected from IJB-C, for an average of 5.9 images per subject. We defined eight facial regions for evaluation: 1) cheeks and jaw, 2) mouth, 3) nose, 4) left eye, 5) right eye, 6) eyebrows, and 7) left face, 8) right face.
用于人脸识别的修复数据集基于 IJB-C 数据集中的图像[20]。修复数据集包含来自 IJB-C 的 95 个主题的 561 张图像，每个主题平均有 5.9 张图像。我们定义了八个用于评估的面部区域：1) 面颊和下巴，2) 嘴巴，3) 鼻子，4) 左眼，5) 右眼，6) 眉毛，以及 7) 左脸，8) 右脸。
Each image is inpainted for each of the eight regions forming a total of 4488 inpainted doppelgangers.
每个图像都被修复，形成了总共 4488 个修复的替身。
From this set, we define 3648 triplets, such that each triplet is a combination of (probe, mate and inpainted doppelganger nonmate). The XFR algorithms should not be evaluated on triplets for networks that cannot distinguish the original and inpainted identities.
从这个集合中，我们定义了 3648 个三元组，使得每个三元组都是（探针、伴侣和修复的替身非伴侣）的组合。XFR 算法不应该在无法区分原始和修复身份的网络上评估三元组。
Hence, the only the triplets that contain discriminable identities are included for the network the algorithm is being evaluated on.
因此，只有包含可区分身份的三元组才会被包括在正在评估算法的网络中。

The inpainted doppelgangers are generated as follows. In order to systematically mask the regions, we use the pix2face algorithm [7] to fit a 3d face mesh onto each facial image. We then projected the facial region masks onto the images. We use pluralistic inpainting [40] to synthesize an image completion in that masked region. Fig. 5 shown examples of these inpainted doppelgangers.
修复的替身是这样生成的。为了系统地掩盖区域，我们使用 pix2face 算法[7]将 3D 面部网格拟合到每个面部图像上。然后将面部区域掩模投影到图像上。我们使用多元修复[40]在该掩模区域合成图像完成。图 5 显示了这些修复的替身的示例。

A key challenge of constructing the inpainting dataset is to enforce that the inpainted nonmate is in fact a different identity. Most of our inpainted images are not sufficiently different in similarity from the original mated identity for a specific network. A given triplet of (probe, mate, and inpainted nonmate) is only included in the dataset if a given target network can distinguish the two identities for the mate/mate doppelganger and the probe/probe doppelganger. They are required to be able to distinguish these identities both using a nearest match protocol and an verification protocol, such that the verification match threshold for a target network is calibrated to a false alarm rate of 1e-4. Specifically, each triplet has to fulfill the following criteria in order to be included in the dataset for a given network:
构建修复数据集的一个关键挑战是确保修复后的非伴侣实际上是一个不同的身份。我们的大多数修复图像与原始伴侣身份在相似性上不够不同，以至于特定网络无法区分两者。只有当给定目标网络能够区分伴侣/伴侣的替身和探针/探针的替身时，才会将给定三元组（探针、伴侣和修复后的非伴侣）包含在数据集中。它们需要能够使用最近匹配协议和验证协议来区分这些身份，使得目标网络的验证匹配阈值校准为 1e-4 的误报率。具体而言，为了将每个三元组包含在给定网络的数据集中，必须满足以下标准：

1.

The original probe must be: (i) more similar to the original/mated identity than the corresponding inpainted/nonmated identity and (ii) correctly verified as the original/mated identity at the calibrated verification threshold.

1. 原始探针必须：（i）与原始/伴侣身份相比更加相似，而不是对应的修复/非伴侣身份；（ii）在校准的验证阈值下正确验证为原始/伴侣身份。
2.

The inpainted probe must be: (i) more similar to the corresponding inpainted / nonmated identity than the original identity and (ii) correctly verified as the same identity as the inpainted/nonmated identity at the calibrated verification threshold.

2. 修复后的探针必须：(i) 与对应的修复/非匹配身份更相似，而不是原始身份；(ii) 在校准的验证阈值下正确验证为与修复/非匹配身份相同的身份。

The inpainting dataset is filtered for each target network according to the above criteria, resulting in a dataset specific to that target network.
根据上述标准，对每个目标网络过滤修复数据集，从而得到一个特定于该目标网络的数据集。
For example, for the ResNet-101 based system, the final filtered dataset includes 84 identities and 543 triplets, filtered down from 95 identities and 3648 triplets. Lower performing networks will generally have fewer triplets satisfying the selection criteria than higher performing networks, because they will not be able to discriminate as many of the subtle changes in the inpainted probes.
例如，对于基于 ResNet-101 的系统，最终过滤后的数据集包括 84 个身份和 543 个三元组，从 95 个身份和 3648 个三元组过滤而来。性能较差的网络通常会比性能较高的网络具有更少的三元组满足选择标准，因为它们无法区分修复探针中的许多细微变化。

4.3 Evaluation Metrics 4.3 评估指标

The XFR algorithms estimate the likelihood that each pixel belongs to a region that is discriminative for matching the probe to the mated identity over the nonmated/inpainted identity. These discriminative pixel estimations form a saliency map, where the brightest pixels are estimated to be most likely to belong to the discriminative region. Fig. 4 shows an example and saliency predictions at two thresholds, where the saliency prediction is shown at different thresholds as a binary mask.
XFR 算法估计每个像素属于区域的可能性，该区域对于将探针与匹配的身份匹配而非未匹配/修补的身份具有区分性。这些区分性像素估计形成一个显著性地图，其中最亮的像素被估计最有可能属于区分性区域。图 4 显示了一个示例和两个阈值下的显著性预测，其中显著性预测以不同阈值显示为二进制掩模。

In order to motivate our proposed metric, let’s first consider using a classic receiver operating characteristic (ROC) curve for evaluation of the inpainting game rather than our proposed metric.
为了激励我们提出的度量标准，让我们首先考虑使用经典的接收者操作特征（ROC）曲线来评估修补游戏，而不是我们提出的度量标准。
A ROC curve can be generated by sweeping a threshold for the pixel saliency estimations, and computing true accept rate and false alarm rate by using the inpainted region as the positive / salient region and the non-inpainted region as the negative / non-salient region (i.e.
通过在像素显著性估计的阈值上进行扫描，使用修补区域作为正/显著区域，非修补区域作为负/非显著区域（即在图 5 中的中间列），可以生成 ROC 曲线。然而，并非修补区域内的所有像素对于识别贡献相同，显著性算法不应因此选择而受到惩罚或赞扬。
middle column in Fig. 5). However, not all pixels within the inpainted region contribute equally to the identity, and the saliency algorithm should not be either penalized or credited with this selection.
但是，并非修补区域内的所有像素对于识别贡献相同，显著性算法不应因此选择而受到惩罚或赞扬。

To address this key challenge, we use mean nonmate classification rate instead of true positive rate for saliency classification.
为了解决这一关键挑战，我们使用均值非匹配分类率而不是真正阳性率进行显著性分类。
We play a game where the pixels classified as being salient by sweeping the saliency threshold are replaced with the pixels from the “inpainted probe”, which is not provided to the saliency algorithm. These “blended probes” can then be classified as original identity or inpainted nonmate identity by the network being tested. High performing XFR algorithms will correctly assign more saliency for the inpainted regions that will change the identity of the blended probes without increasing the false alarm rate of the pixel salience classification.
我们玩一个游戏，在这个游戏中，通过扫描显著性阈值被分类为显著的像素将被替换为“修复探针”中的像素，该像素未提供给显著性算法。这些“混合探针”可以被测试的网络分类为原始身份或修复的非同伴身份。高性能的 XFR 算法将正确地为将改变混合探针身份的修复区域分配更多的显著性，而不会增加像素显著性分类的误报率。
This is the key idea behind our evaluation metric. The false positive rate is calculated from salient pixel classification across all triplets, using the ground truth masks for the blended probe. The mean nonmate classification rate is weighted by the number of triplets within each facial region for a filtered dataset, to avoid bias for subprotocols with more examples. Example of the output curves for this metric is shown in Fig. 6.
这是我们评估指标背后的关键思想。通过使用混合探针的地面真实蒙版计算所有三元组中显著像素分类的误报率。非同伴分类率的平均值根据每个面部区域内三元组的数量进行加权，以避免对具有更多示例的子协议产生偏见。该指标的输出曲线示例如图 6 所示。

[Uncaptioned image] — Table 1: Inpainting game evaluation results. This table summarizes the performance at two operating points of false alarm rate (1E-2, 5E-2) for the performance curves in Fig. 6 (ResNet-50 and LightCNN) and in the supplementary material. Mean nonmate classification rate is the proportion of triplets where the identity of the blended image was correctly “flipped” to the doppelganger.
表 1：修复游戏评估结果。该表总结了在图 6（ResNet-50 和 LightCNN）和补充资料中两个操作点（误报率 1E-2、5E-2）的性能。平均非匹配分类率是三元组中混合图像的身份被正确“翻转”为替身的比例。
Results show that our new methods (DISE, Subtree EBP) outperform the state of the art by a wide margin on three matchers. Detailed subprotocol results and curves are provided in the supplemental material.
结果显示，我们的新方法（DISE，子树 EBP）在三个匹配器上的表现远远优于现有技术。详细的子协议结果和曲线在补充材料中提供。

5 Experimental Results 5 实验结果

5.1 Inpainting Game Quantitative Evaluation
5.1 修补游戏定量评估

We ran the inpainting game evaluation protocol on the inpainting dataset using three target networks: LightCNN [35], VGGFace2 ResNet-50 [5] and a custom trained ResNet-101. We considered the five XFR algorithms described in Sec. 3 forming the benchmark for XFR evaluation.
我们在修复数据集上使用了三个目标网络（LightCNN [ 35]，VGGFace2 ResNet-50 [ 5] 和一个自定义训练的 ResNet-101）运行了修复游戏评估协议。我们考虑了第 3 节中描述的五种 XFR 算法，形成了 XFR 评估的基准。

The evaluation results are summarized in Table 1 and plotted in Fig. 6 and in the supplementary material. The summary table shows for each combination of network and XFR algorithm, at two false alarm rates (1E-2, 5E-2) for the full protocol and three subprotocols: eyes, nose and eyebrows only. Additional results in the supplementary material show the results for the individual facial region subprotocols.
评估结果总结在表 1 中，并在图 6 和补充材料中绘制。总结表显示了每种网络和 XFR 算法组合在两个误报率（1E-2，5E-2）下的完整协议和三个子协议（仅眼睛、鼻子和眉毛）的结果。补充材料中的额外结果显示了各个面部区域子协议的结果。

Overall, results show that for deeper networks (ResNet-101, ResNet-50), the top performing XFR algorithm is DISE. However, for shallower networks (LightCNN) then top performing algorithm is Subtree EBP.
总的来说，结果显示对于更深的网络（ResNet-101，ResNet-50），表现最佳的 XFR 算法是 DISE。然而，对于更浅的网络（LightCNN），表现最佳的算法是 Subtree EBP。
Both of these new approaches outperform the state of the art (EBP, cEBP, tcEBP) by a wide margin. We believe that DISE outperforms Subtree EBP since subtree EBP cannot localize image regions any better than the underlying network represents faces.
这两种新方法在很大程度上优于现有技术（EBP、cEBP、tcEBP）。我们认为 DISE 优于子树 EBP，因为子树 EBP 无法比底层网络更好地定位图像区域以表示面部。
For example, consider the eyebrows subprotocol result in the supplementary material, which shows that subtree EBP cannot represent eyebrows independently from the eyes.
例如，考虑补充材料中的眉毛子协议结果，显示子树 EBP 无法独立于眼睛表示眉毛。
DISE can mask image regions independently from the underlying target network and correctly localize eyebrow effects.
DISE 可以独立于底层目标网络屏蔽图像区域，并正确定位眉毛效果。

6 Conclusions 6 结论

In this paper, we introduced the first comprehensive benchmark for XFR We motivated the need for XFR and describe a new quantitative method for comparing XFR algorithms using the inpainting game.
在这篇论文中，我们介绍了 XFR 的第一个全面基准，我们激发了对 XFR 的需求，并描述了一种使用修补游戏比较 XFR 算法的新的定量方法。
The results show that the DISE and subtree EBP methods provide a significant performance improvement over the state of the art, which provides a new baseline for visualizing discriminative features for face recognition.
结果显示，DISE 和子树 EBP 方法相对于现有技术提供了显著的性能改进，为可视化人脸识别的辨别特征提供了一个新的基准。
This evaluation protocol provides a means to compare different approaches to network saliency, and we believe this form of quantitative evaluation will help encourage research in this emerging area of explainable AI for face recognition.
这种评估协议提供了一种比较网络显著性不同方法的手段，我们相信这种形式的定量评估将有助于鼓励这一新兴领域——可解释人脸识别 AI 的研究。
All software and datasets for reproducible research are available for download at http://stresearch.github.io/xfr.
可在 http://stresearch.github.io/xfr 下载可重现研究的所有软件和数据集。

Acknowledgement. This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) under contract number 2019-19022600003. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, or the U.S. Government.
致谢。本研究基于美国国家情报总监办公室（ODNI）情报先进研究项目局（IARPA）在合同编号 2019-19022600003 下的支持。本文作者的观点和结论不应被解释为必然代表 ODNI、IARPA 或美国政府的官方政策或认可，无论是明示的还是暗示的。
The U.S. Government is authorized to reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon.
尽管有任何版权注释，美国政府有权为政府目的复制和分发再版。

References 参考文献

[1] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Muller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. In PLOS ONE, vol. 10, no. 7,p. e0130140, 2015.
S. Bach，A. Binder，G. Montavon，F. Klauschen，K.-R. Muller 和 W. Samek。关于逐像素解释非线性分类器决策的层级相关传播。在 PLOS ONE，第 10 卷，第 7 号，第 e0130140 页，2015 年。
[2] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep visual representations. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3319–3327, 2017.
D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. 网络解剖：量化深度视觉表示的可解释性。2017 年 IEEE 计算机视觉与模式识别会议（CVPR），页码 3319-3327，2017 年。
[3] J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In S. A. Friedler and C. Wilson, editors, Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 77–91, New York, NY, USA, 23–24 Feb 2018. PMLR.
J. Buolamwini 和 T. Gebru。性别阴影：商业性别分类中的交叉准确性差异。在 S.A. Friedler 和 C. Wilson 编辑，第 1 届公平性、问责性和透明度会议论文集，机器学习研究论文集第 81 卷，页码 77-91，2018 年 2 月 23-24 日，纽约，美国。PMLR。
[4] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, and Others. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2956–2964, 2015.
C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, and Others. 看一看再想一想：用反馈卷积神经网络捕捉自顶向下的视觉注意力。在 2015 年 IEEE 国际计算机视觉会议论文集中，第 2956-2964 页。
[5] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face and Gesture Recognition, 2018.
Q. Cao，L. Shen，W. Xie，O.M. Parkhi 和 A. Zisserman。Vggface2：一个用于识别不同姿势和年龄的人脸的数据集。在 2018 年国际自动人脸和手势识别会议上。
[6] G. Castanon and J. Byrne. Visualizing and quantifying discriminative features for face recognition. In International Conference on Automatic Face and Gesture Recognition, 2018.
G. Castanon 和 J. Byrne。可视化和量化用于人脸识别的判别特征。在 2018 年国际自动人脸和手势识别大会上。
[7] D. Crispell and M. Bazik. Pix2face: Direct 3d face model estimation. In 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), pages 2512–2518, Oct. 2017.
D. Crispell 和 M. Bazik。Pix2face：直接 3D 人脸模型估计。在 2017 年 IEEE 国际计算机视觉大会研讨会(ICCVW)上，页面 2512-2518，2017 年 10 月。
[8] P. Dabkowski and Y. Gal. Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pages 6967–6976, 2017.
P. Dabkowski 和 Y. Gal。黑盒分类器的实时图像显著性。在神经信息处理系统的进展中，页面 6967-6976，2017 年。
[9] P. Dhar, A. Bansal, C. D. Castillo, J. Gleason, P. J. Phillips, and R. Chellappa. How are attributes expressed in face dcnns? ArXiv, abs/1910.05657, 2019.
P. Dhar，A. Bansal，C. D. Castillo，J. Gleason，P. J. Phillips 和 R. Chellappa。面部 DCNNs 中如何表达属性？ArXiv，abs/1910.05657，2019。
[10] B. Duchaine and K. Nakayama. The cambridge face memory test: Results for neurologically intact individuals and an investigation of its validity using inverted face stimuli and prosopagnosic participants. Neuropsychologia, 44:576–585, 2006.
B. Duchaine 和 K. Nakayama。剑桥面孔记忆测试：神经学上完整个体的结果以及使用倒置面孔刺激和面孔失认症患者对其有效性的调查。Neuropsychologia，44：576-585，2006。
[11] Facial Identification Scientific Working Group. FISWG guidelines for facial comparison methods:. In FISWG standards version 1.0 - 2012-02-02, 2012.
面部识别科学工作组。FISWG 面部比对方法指南：FISWG 标准版本 1.0 - 2012-02-02，2012 年。
[12] R. Fong and A. Vedaldi. Interpretable Explanations of Black Boxes by Meaningful Perturbation. arXiv preprint arXiv, 2017.
R. Fong 和 A. Vedaldi。通过有意义的扰动解释黑匣子。arXiv 预印本 arXiv，2017 年。
[13] C. Garvie, A. Bedoya, and J. Frankle. The perpetual line-up: Unregulated police face recognition in america. In Technical report, Georgetown University Law School, 2018.
C. Garvie, A. Bedoya, and J. Frankle. 永久阵容：美国无监管警方人脸识别。在乔治敦大学法学院技术报告中，2018 年。
[14] C. Grimm, D. Arumugam, S. Karamcheti, D. Abel, L. L. Wong, and M. L. Littman. Latent attention networks. In arXiv:1706.00536v1, 2017.
C. Grimm, D. Arumugam, S. Karamcheti, D. Abel, L. L. Wong, and M. L. Littman. 潜在注意力网络。在 arXiv:1706.00536v1，2017 年。
[15] P. Grother, M. Ngan, and K. Hanaoka. Face recognition vendor test (frvt) part 3: Demographic effects. In NISTIR 8280, 2019.
P. Grother，M. Ngan 和 K. Hanaoka。人脸识别供应商测试（FRVT）第 3 部分：人口统计效应。在 NISTIR 8280，2019 年。
[16] R. Hu, J. Andreas, T. Darrell, and K. Saenko. Explainable neural computation via stack neural module networks. In ECCV, 2018.
R. Hu，J. Andreas，T. Darrell 和 K. Saenko。通过堆叠神经模块网络实现可解释的神经计算。在 ECCV，2018 年。
[17] P.-J. Kindermans, K. T. Schütt, M. Alber, K.-R. Müller, D. Erhan, B. Kim, and S. Dähne. Learning how to explain neural networks: Patternnet and patternattribution. arXiv preprint arXiv:1705.05598, 2017.
P.-J. Kindermans，K. T. Schütt，M. Alber，K.-R. Müller，D. Erhan，B. Kim 和 S. Dähne。学习如何解释神经网络：PatternNet 和 PatternAttribution。arXiv 预印本 arXiv:1705.05598，2017 年。
[18] H. Li, K. Mueller, and X. Chen. Beyond saliency: understanding convolutional neural networks from saliency prediction on layer-wise relevance propagation. Image Vision Comput., 83-84:70–86, 2017.
H. Li，K. Mueller 和 X. Chen。超越显著性：从逐层相关传播的显著性预测理解卷积神经网络。图像视觉计算，83-84：70-86，2017。
[19] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015.
[20] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney, et al. Iarpa janus benchmark-c: Face dataset and protocol. In 2018 International Conference on Biometrics (ICB), pages 158–165. IEEE, 2018.
[21] A. M. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In NIPS, 2016.
[22] O. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, 2015.
[23] V. Petsiuk, A. Das, and K. Saenko. Rise: Randomized input sampling for explanation of black-box models. British Machine Vision Conference (BMVC), 2018.
[24] P. J. Phillips, A. N. Yates, Y. Hu, C. A. Hahn, E. Noyes, K. Jackson, J. G. Cavazos, G. Jeckeln, R. Ranjan, S. Sankaranarayanan, J.-C. Chen, C. P. del Castillo, R. Chellappa, D. White, and A. J. O’Toole. Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms. In Proceedings of the National Academy of Sciences of the United States of America, 2018.
[25] I. D. Raji and J. Buolamwini. Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products. In AIES ’19, 2019.
[26] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In KDD ’16, 2016.
[27] B. RichardWebster, S. Y. Kwon, C. Clarizio, S. E. Anthony, and W. J. Scheirer. Visual psychophysics for making face recognition algorithms more explainable. In European Conference on Computer Vision (ECCV), 2018.
[28] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. Müller. Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28:2660–2673, 2015.
[29] W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K. Müller, editors. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer, 2019.
[30] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A unified embedding for face recognition and clustering. In CVPR, 2015.
[31] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. 2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2016.
[32] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.
[33] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. Iclr, pages 1—-, 2014.
[34] A. Stylianou, R. Souvenir, and R. Pless. Visualizing deep similarity networks. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2029–2037. IEEE, 2019.
[35] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018.
[36] T. Xu, J. Zhan, O. G. B. Garrod, P. H. S. Torr, S.-C. Zhu, R. A. A. Ince, and P. G. Schyns. Deeper interpretability of deep networks. ArXiv, abs/1811.07807, 2018.
[37] B. Yin, L. Tran, H. Li, X. Shen, and X. Liu. Towards interpretable face recognition. In In Proceeding of International Conference on Computer Vision, Seoul, South Korea, October 2019.
[38] T. Zee, G. Gali, and I. Nwogu. Enhancing human face recognition with an interpretable neural network. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2019.
[39] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excitation Backprop. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9908 LNCS:543–559, 2016.
[40] C. Zheng, T.-J. Cham, and J. Cai. Pluralistic image completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1438–1447, 2019.
[41] Y. Zhong and W. Deng. Exploring features and attributes in deep face recognition using visualization techniques. In IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 2019.
[42] B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. CVPR, 2016.

Appendix 0.A Supplementary Material

0.A.1 Qualitative Visualization Study

The inpainting game provides a quantitative comparison of XFR algorithms, however it does not provide insight as to how useful these XFR algorithms are on novel face images. In this section, we provide a qualitative study of XFR algorithms visualized on a standard set of triplets. We consider two target networks: ResNet-101 and Light-CNN [35], and provide visualizations for the whitebox methods referenced in the main submission. This analysis includes the following figures showing qualitative visualization results for combinations of (target network, XFR method): (ResNet-101, EBP, Fig. S3), (ResNet-101, cEBP, Fig. S4), (ResNet-101, tcEBP, Fig. S5), (ResNet-101, Subtree, Fig. S6), (Light-CNN, EBP, Fig. S8), (Light-CNN, cEBP, Fig. S9), (Light-CNN, tcEBP, Fig. S10), (Light-CNN, Subtree EBP, Fig. S11). Finally, we show results for the Light-CNN using only single probes (Fig. S12), or repeated probes (Fig. (S13) to highlight the effect of non-mates in the triplet visualization.

From this visualization study, we draw the following conclusions:

1.

Non-localized. Unlike facial examiners which leverage the complete FISWG standards for facial comparison, there is no evidence that modern face matchers leverage localized discriminating features such as scars, marks and blemishes. All visualizations are centered on the facial interior, and almost no activation is on the shape of the head. Also, the systems tend to overgeneralize to represent all faces in a standard manner using the eyes and nose, brow and mouth, ignoring localized features such as moles or facial markings.
2.

Pose variant. The target networks tested are not truly pose invariant. When considering different probes of the same subject, where the probe differs in pose, the whitebox systems can generate different visualizations. This suggests that the underlying network is still pose variant.
3.

Triplet specific. The features that are used for recognition depend on the selection of the triplet, notably the selection of the non-mate for comparison. The visualized features are more consistent when considering a larger non-mate set (Fig. S1).
4.

Network specific. The visualized features are dependent on the selected target network for visualization. A higher performing network (light-CNN) tends to use more facial features of the brow and mouth in addition to the eyes and nose, than a lower performing network (ResNet-50). No networks yet tested use the hair or chin.