Text-Image Person Re-identification aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Besides, due to the substantial gap between modalities, existing methods embed the original modal features into the same latent space for crossmodal alignment. However, feature embedding may lead to intramodal information distortion. Recently, Contrastive LanguageImage Pretraining (CLIP) has attracted extensive attention from researchers due to its powerful semantic concept learning capacity and rich multi-modal knowledge, which can help us solve the above problems. Accordingly, in this paper, we propose a CLIPdriven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we perform finegrained information excavation to mine intra-modal discriminative clues and inter-modal correspondences. Specifically, we first design a multi-grained global feature learning (MGF) module to fully mine the discriminative local information within each modality, which can emphasize identity-related discriminative clues by enhancing the interactions between global image (text) and informative local patches (words). MGF can generate a set of multi-grained global features for later inference. Secondly, crossgrained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules are proposed to establish the crossgrained and fine-grained interactions (image-word, sentence-patch, word-patch) between modalities, which can filter out unimportant and non-modality-shared image patches/words and mine crossmodal correspondences from coarse to fine. CFR and FCD are removed during inference to save computational costs. Note that the above process is performed in the original modality space without further feature embedding. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method on TIReID. 文本-图像人员再识别旨在从候选图像库中检索与给定文本查询相对应的图像。现有方法利用单模态预训练的先验知识来促进学习,但缺乏多模态对应信息。此外,由于模态之间存在巨大差距,现有方法会将原始模态特征嵌入同一潜在空间以进行跨模态对齐。然而,特征嵌入可能会导致模态内信息失真。最近,对比语言图像预训练(CLIP)因其强大的语义概念学习能力和丰富的多模态知识而引起了研究人员的广泛关注,它可以帮助我们解决上述问题。因此,本文提出了一个由 CLIP 驱动的细粒度信息挖掘框架(CFine),以充分利用 CLIP 的强大知识为 TIReID 服务。为了有效传递多模态知识,我们进行了细粒度信息挖掘,以挖掘模态内的判别线索和模态间的对应关系。具体来说,我们首先设计了一个多粒度全局特征学习(MGF)模块,以充分挖掘每种模态内的局部判别信息,通过增强全局图像(文本)与局部信息斑块(单词)之间的交互,强调与身份相关的判别线索。MGF 可以生成一组多粒度全局特征,供以后推理使用。 其次,提出了交叉粒度特征提纯(CFR)和细粒度对应发现(FCD)模块,以建立模态之间的交叉粒度和细粒度交互(图像-单词、句子-片段、单词-片段),从而过滤掉不重要和非模态共享的图像片段/单词,并从粗到细挖掘跨模态对应关系。为了节省计算成本,推理过程中会删除 CFR 和 FCD。请注意,上述过程是在原始模态空间中进行的,无需进一步的特征嵌入。在多个基准上的广泛实验证明了我们的方法在 TIReID 上的卓越性能。
Index Terms-Text-Image Person Re-identification, Multi-modal Correspondence Information, Intra-modal Information Distortion, Fine-grained Information Excavation. 索引词条--文本-图像人员再识别、多模态对应信息、模态内信息失真、细粒度信息挖掘。
I. INTRODUCTION I.引言
ERSON Re-identification (ReID) is a popular and challenging task in computer vision. In the past decade, ReID has made remarkable progress [1]-[4], and has been successfully applied in some practical scenarios. Most existing ReID approaches assume that the pedestrian's images can be captured across disjoint cameras, and tend to ignore the situation that ERSON 重新识别(Re-ID)是计算机视觉领域的一项热门且具有挑战性的任务。在过去十年中,ReID 取得了显著进展 [1]-[4],并已成功应用于一些实际场景。现有的 ReID 方法大多假定行人的图像可以被不同的摄像头捕捉到,而往往忽略了以下情况
S. Yan, N. Dong, and J. Tang are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China (e-mail: shuanglinyan@njust.edu.cn; neng.dong @ njust.edu.cn; jinhuitang@ njust.edu.cn). S.阎晓明、董能、唐俊,南京理工大学计算机科学与工程学院,南京 210094(电子邮箱:shuanglinyan@njust.edu.cn;董能 @ njust.edu.cn;唐俊@ njust.edu.cn)。
L. Zhang is with the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China (e-mail: zhangliyan@nuaa.edu.cn). L.张立,南京航空航天大学计算机科学与技术学院,中国南京 210016(电子邮箱:zhangliyan@nuaa.edu.cn)。
Corresponding author: Liyan Zhang. 通讯作者:张丽艳张丽艳
Fig. 1. The Motivation for our proposed method. (a) CLIP learns visual representation with natural language supervision using web-scale image-text data, and the learned visual representation contains rich semantic information and cross-modal correspondence information. (b) We explore leveraging the powerful knowledge of CLIP for TIReID. CLIP is trained to focus only on instance-level representation, while TIReID requires fine-grained discriminative clues. To take full advantage of rich prior knowledge from CLIP, our CFine proposes three innovative modules (MGF, CFR, and FCD) to mine intra-moda discriminative clues and inter-modal fine-grained correspondences. 图 1.我们提出的方法的动机。(a) CLIP 利用网络规模的图像-文本数据,在自然语言监督下学习视觉表征,学习到的视觉表征包含丰富的语义信息和跨模态对应信息。(b) 我们探索利用 CLIP 的强大知识进行 TIReID。CLIP 的训练只侧重于实例级表示,而 TIReID 则需要细粒度的判别线索。为了充分利用 CLIP 丰富的先验知识,我们的 CFine 提出了三个创新模块(MGF、CFR 和 FCD)来挖掘模内判别线索和模间细粒度对应信息。
pedestrian images cannot be obtained in some complex or special scenes, such as some remote roads without cameras or where pedestrians are completely occluded. Although pedestrian images are not available, we can find some witnesses at the scene and search for the target pedestrian by the witness's language description, that is, text-image person re-identification (TIReID) [5]. Due to its great practical value, TIReID has attracted increasing attention from both academia and industry. 在一些复杂或特殊场景中,如一些没有摄像头的偏僻道路或行人完全被遮挡的地方,无法获取行人图像。虽然没有行人图像,但我们可以在现场找到一些目击者,通过目击者的语言描述来寻找目标行人,即文本-图像人再识别(TIReID)[5]。由于其巨大的实用价值,TIReID 越来越受到学术界和工业界的关注。
As a fine-grained cross-modal retrieval task, the key of TIReID is to mine the fine-grained information of images and texts, and establish their correspondences. In recent years, many effective methods [6]-[11] have been proposed, all of which follow the same structural design of "image/text backbone + feature embedding", where image/text backbone first extract im- 作为一种细粒度的跨模态检索任务,TIReID 的关键在于挖掘图像和文本的细粒度信息,并建立它们之间的对应关系。近年来,人们提出了许多有效的方法[6]-[11],它们都遵循相同的 "图像/文本骨干+特征嵌入 "结构设计,即图像/文本骨干首先提取图像信息,然后将其嵌入到特征嵌入中。
age/text features, and then feature embedding (method-specific) embeds the extracted image and text features into a joint space for cross-modal alignment. Image/text backbone generally utilizes external knowledge to facilitate learning, while mainly initializes backbone by single-modality pre-training (e.g., the pre-trained ResNet [12] and ViT [13] on ImageNet, the pretraining language model BERT [14]), which lacks multi-modal correspondence information. In addition, existing methods generally believe that feature embedding is the core module of the TIReID task, which is crucial for narrowing the semantic gap between modalities. However, several works [15], [16] have shown that projecting different modalities into the joint space may lead to intra-modal information distortion due to the distinct data properties between images and texts. So we need to reconsider whether feature embedding is necessary for the TIReID task. 图像/文本骨干一般利用外部知识来促进学习,而主要是通过单模态预训练(如预训练的 ResNet [12] 和 ViT [13] 等)来初始化骨干。图像/文本骨干一般利用外部知识来促进学习,而主要通过单模态预训练(如 ImageNet 上的预训练 ResNet [12] 和 ViT [13]、预训练语言模型 BERT [14])来初始化骨干,缺乏多模态对应信息。此外,现有方法普遍认为特征嵌入是 TIReID 任务的核心模块,对于缩小模态之间的语义差距至关重要。然而,一些研究[15]、[16]表明,由于图像和文本的数据属性不同,将不同模态投影到联合空间可能会导致模态内信息失真。因此,我们需要重新考虑在 TIReID 任务中是否有必要进行特征嵌入。
Recently, the remarkable success of visual-language pretraining (VLP) has shown its ability to learn semantically rich and high-quality visual concepts with natural language supervision, and the most representative work is Contrastive LanguageImage Pre-training (CLIP) 17]. Compared with single-modal pre-training, CLIP contains abundant multi-modal knowledge. In light of the power of CLIP, some recent works have attempted to exploit its ample knowledge to various tasks and achieved impressive results, such as video-text retrieval [18], referring image segmentation [19], dense prediction [20], video understanding [21]. Besides, the semantic-level visual concept representation capacity of CLIP makes it possible to align image and text in the original modality space. Inspired by this, we explore how to transfer the CLIP model to TIReID in this paper. We found that fine-tuning CLIP directly on TIReID is effective. But CLIP is trained to only pay attention to instance-level representation (image-level, sentence-level), while TIReID requires the model to focus on fine-grained information and inter-modal correspondences to distinguish the sutble differences between pedestrians (as Figure 1). Thus, the direct usage of CLIP can be sub-optimal for TIReID due to the task gap. 最近,视觉语言预训练(VLP)取得了令人瞩目的成功,显示了其在自然语言监督下学习语义丰富的高质量视觉概念的能力,其中最具代表性的工作是对比语言图像预训练(CLIP)17]。与单模态预训练相比,CLIP 包含丰富的多模态知识。鉴于 CLIP 的强大功能,近期的一些工作尝试利用其丰富的知识来完成各种任务,并取得了令人瞩目的成果,如视频文本检索 [18]、参考图像分割 [19]、密集预测 [20]、视频理解 [21]。此外,CLIP 的语义级视觉概念表征能力使图像和文本在原始模态空间中的对齐成为可能。受此启发,我们在本文中探讨了如何将 CLIP 模型移植到 TIReID 中。我们发现,直接在 TIReID 上对 CLIP 进行微调是有效的。但是,CLIP 在训练时只关注实例级表示(图像级、句子级),而 TIReID 则要求模型关注细粒度信息和模态间对应关系,以区分行人之间的明显差异(如图 1 所示)。因此,由于任务差距,直接使用 CLIP 可能不是 TIReID 的最佳选择。
To fully exploit the powerful knowledge of the CLIP model, we propose a novel CLIP-driven Fine-grained information excavation framework, namely CFine, for TIReID. As shown in Figure 1. CFine mainly includes two parts: modality-specific feature extraction and fine-grained information excavation. To be specific, CFine first adopts modality-specific encoders to extract the image and text representations, and then the finegrained information excavation part is exploited to mine intramodal discriminative details and inter-modal fine-grained correspondences for better transferring knowledge of CLIP to TIReID. The fine-grained information excavation includes three main components. First, we propose a multi-grained global feature learning (MGF) module to fully mine the identity-related sutble clues within each modality. In this module, we first design a token selection process to pick out a set of informative tokens (discriminative patches/words) based on the self-attention score between class token and local tokens for each modality. Then, the informative token set is split into multiple subsets and is fed into a global-local decoder (GLD) to generate a set of multi-grained global features by enhancing the interactions between global image (text) and local discriminative patches (words). Second, we design a cross-grained feature refinement (CFR) module to filter out non-modality-shared information in the selected tokens by computing cross-grained similarities (image-word, sentence-patch) between modalities, and establish the rough cross-modal correspondence. Third, to establish intermodal fine-grained correspondence, we propose a fine-grained correspondence discovery (FCD) module to discover the relationship between words and image patches. Note that the entire learning process of CFine is performed in the original feature space without further feature embedding, and optimized in an end-to-end manner. During inference, CFR and FCD are removed, and multi-grained global image and text features generated by MGF are used for cross-modal retrieval. Our main contributions are summarized as follows: 为了充分利用 CLIP 模型的强大知识,我们为 TIReID 提出了一个新颖的 CLIP 驱动的细粒度信息挖掘框架,即 CFine。如图 1 所示。CFine 主要包括两个部分:特定模态特征提取和细粒度信息挖掘。具体来说,CFine 首先采用特定模态编码器提取图像和文本表示,然后利用细粒度信息挖掘部分挖掘模态内的判别细节和模态间的细粒度对应关系,以便更好地将 CLIP 知识转移到 TIReID 中。细粒度信息挖掘包括三个主要部分。首先,我们提出了一个多粒度全局特征学习(MGF)模块,以充分挖掘每种模态中与身份相关的线索。在该模块中,我们首先设计了一个标记选择过程,根据每种模态的类标记和局部标记之间的自关注得分,挑选出一组有信息量的标记(区分性补丁/单词)。然后,将信息标记集分成多个子集,并输入全局-本地解码器(GLD),通过增强全局图像(文本)和本地判别补丁(单词)之间的交互来生成一组多粒度全局特征。其次,我们设计了一个跨粒度特征提纯(CFR)模块,通过计算模态间的跨粒度相似性(图像-词语、句子-补丁),过滤掉所选标记中的非模态共享信息,并建立粗略的跨模态对应关系。第三,为了建立模态间的细粒度对应关系,我们提出了细粒度对应关系发现(FCD)模块,以发现词与图像片段之间的关系。 请注意,CFine 的整个学习过程都是在原始特征空间中进行的,没有进一步的特征嵌入,并以端到端的方式进行了优化。在推理过程中,CFR 和 FCD 被移除,MGF 生成的多粒度全局图像和文本特征被用于跨模态检索。我们的主要贡献概述如下:
We propose a CLIP-driven fine-grain information excavation framework to transfer the knowledge of CLIP to TIReID, achieving fine-grained text-image alignment without further feature embedding. To our best knowledge, we are the first to leverage the ample cross-modal knowledge from VLP to facilitate learning for TIReID. 我们提出了一个由 CLIP 驱动的细粒度信息挖掘框架,将 CLIP 的知识转移到 TIReID 中,无需进一步的特征嵌入即可实现文本与图像的细粒度配准。据我们所知,我们是第一个利用 VLP 中丰富的跨模态知识来促进 TIReID 学习的人。
We take full advantage of rich multi-modal knowledge from CLIP via three innovative modules, i.e., multi-grained global feature learning, cross-grained feature refinement, and fine-grained correspondence discovery. 我们通过三个创新模块,即多粒度全局特征学习、跨粒度特征细化和细粒度对应关系发现,充分利用了 CLIP 中丰富的多模态知识。
We conduct extensive experiments on three benchmarks to validate the effectiveness of our CFine. CFine performs significantly better than previous methods and reaches 69.57 Rank-1, 60.83 Rank-1, 50.55 Rank-1 on the CUHKPEDES, ICFG-PEDES, and RSTPReid, respectively, which outperforms the previous SOTA method by , , and . 我们在三个基准上进行了大量实验,以验证 CFine 的有效性。CFine 的性能明显优于之前的方法,在 CUHKPEDES、ICFG-PEDES 和 RSTPReid 上分别达到了 69.57 Rank-1、60.83 Rank-1 和 50.55 Rank-1,比之前的 SOTA 方法高出 、 和 。
The remainder of the paper is organized as follows. We first review the related works in Section II; Section III describes the proposed CFine in detail; Section IV reports extensive experimental results and analysis; and finally the paper is summarized in Section V. 本文的其余部分安排如下。我们首先在第二节回顾了相关工作;第三节详细介绍了拟议的 CFine;第四节报告了大量实验结果和分析;最后在第五节对本文进行了总结。
II. RELATED WORK II.相关工作
A. Text-Image Person Re-identification A.文本-图像人员再识别
TIReID is a class of multi-modal tasks [22], [23], which was first proposed by [5]. Compare with general cross-modal retrieval tasks, TIReID is more challenging as its fine-grained property. The key to TIReID is cross-modal alignment, existing methods can be broadly classified into two classes according to the alignment strategy: cross-modal interaction-based and cross-modal interaction-free methods. Cross-modal interactionbased methods [5], [8], [9], [24]-[28] focus on mining local correspondences (e.g., patch-word, patch-phrase) between images and texts by using attention mechanism to predict the matching score for image-text pairs. Gao et al. [8] designed a contextual non-local attention, which adaptively aligns image and text features across all scales in a coarse-to-fine way according to their semantics. The cross-modal interaction mechanism is a double-edged sword, and its advantage lies in better aligning image-text pairs and reducing modality gap through sufficient cross-modal interaction, so such methods can achieve superior TIReID是一类多模态任务[22]、[23],由[5]首次提出。与一般的跨模态检索任务相比,TIReID 的细粒度特性更具挑战性。TIReID 的关键在于跨模态配准,根据配准策略的不同,现有的方法大致可分为两类:基于跨模态交互的方法和无跨模态交互的方法。基于跨模态交互的方法[5]、[8]、[9]、[24]-[28]侧重于通过使用注意力机制来预测图像-文本对的匹配得分,从而挖掘图像和文本之间的局部对应关系(如片段-单词、片段-短语)。Gao 等人[8]设计了一种上下文非局部注意力,它能根据图像和文本的语义,以从粗到细的方式自适应地对齐所有尺度上的图像和文本特征。跨模态交互机制是一把双刃剑,它的优势在于通过充分的跨模态交互,更好地对齐图像和文本对,减少模态差距,因此这类方法可以取得卓越的效果。
performance. But its disadvantage is the high computational cost, which greatly reduces the practicability of such methods. 性能。但其缺点是计算成本高,大大降低了此类方法的实用性。
For cross-modal interaction-free methods, some early works [6], [7], [29]-[34] mainly focused on designing network and optimization loss to learn aligned image and text embeddings in a joint latent space. Zhang et al. [30] designed a crossmodal projection matching (CMPM) loss and a cross-modal projection classification (CMPC) loss to learn discriminative image-text embeddings, which is one of the most commonly used cross-modal matching losses for TIReID. These early methods are efficient but their performance is not satisfactory. In recent years, some effective and lightweight models [10], [11], [35] have been proposed, which can achieve better performance than the first type of methods without complex cross-modal interaction. With the success of Transformer in various visual and language tasks, several Transformer-based methods [36], [37] have been proposed recently and achieved state-of-theart performance. All in all, TIReID has achieved remarkable progress over the past few years. 对于跨模态无交互方法,早期的一些研究[6]、[7]、[29]-[34]主要集中在设计网络和优化损失来学习联合潜空间中的对齐图像和文本嵌入。Zhang 等人[30]设计了一种跨模态投影匹配(CMPM)损失和一种跨模态投影分类(CMPC)损失来学习鉴别性图像-文本嵌入,这也是 TIReID 最常用的跨模态匹配损失之一。这些早期方法虽然高效,但性能并不令人满意。近年来,一些有效的轻量级模型[10]、[11]、[35]相继被提出,它们无需复杂的跨模态交互就能获得比第一类方法更好的性能。随着 Transformer 在各种视觉和语言任务中的成功应用,最近又提出了几种基于 Transformer 的方法 [36]、[37],并取得了一流的性能。总之,TIReID 在过去几年中取得了显著进展。
However, existing methods initialize the network through single-modality pre-trained model parameters, ignoring the multi-modal correspondence information. Besides, the image and text embeddings extracted from the above-initialized network over-mine the information within a single modality, which increases the difficulty of cross-modal alignment and network optimization. Recently, visual-language pre-training (VLP) has attracted growing attention, especially CLIP [17], whose own advantages can effectively alleviate the above problems. Thus, we explore leveraging the powerful knowledge of the CLIP model for TIReID in the paper. 然而,现有方法通过单模态预训练模型参数初始化网络,忽略了多模态对应信息。此外,从上述初始化网络中提取的图像和文本嵌入会过度挖掘单一模态内的信息,这增加了跨模态配准和网络优化的难度。最近,视觉语言预训练(VLP)引起了越来越多的关注,尤其是 CLIP [17],其自身的优势可以有效缓解上述问题。因此,我们在本文中探讨了如何利用 CLIP 模型的强大知识来实现 TIReID。
B. Vision-Language Pre-Training B.视觉语言预培训
The paradigm of "pre-training and fine-tuning" is one of the most important paradigms to drive the development of computer vision community. Its basic process is that the model is initialized by pre-trained model parameters on large-scale datasets, and then fine-tuned on various downstream tasks. In this paradigm, the quality of the pre-training model plays a vital role in the optimization difficulty and performance of the model during fine-tuning. In the past decade, the pre-training model in single modal domain [12]-[14] has achieved great success. Recently, many works [17], [38]-[42] have attempted to extend the pre-training model to multi-modal field, that is, visual-language pre-training (VLP), and made remarkable progress. The mainstream VLP models can be divided into two categories according to the pre-training tasks: (1) image-text contrastive learning tasks, which align images and texts into a shared space through cross-modal contrastive learning loss, e.g., CLIP [17], ALIGN [38], FILIP [39]; (2) language modeling based task, some auxiliary tasks (Masked Language/Region Modeling, image captioning, text-grounding image generation) are used to establish the correspondences between images and texts, e.g., VisualBERT [41], UNITER [42]. The VLP models can bring greater performance gains for cross-modal tasks and fine-grained visual tasks, which is confirmed by a large number of follow-up works. We also expect to leverage ample multimodal knowledge of VLP to further advance the TIReID task. 预训练和微调 "范式是推动计算机视觉领域发展的最重要范式之一。它的基本流程是通过在大规模数据集上预训练模型参数来初始化模型,然后在各种下游任务上进行微调。在这种模式下,预训练模型的质量对微调过程中模型的优化难度和性能起着至关重要的作用。在过去十年中,单模态领域的预训练模型[12]-[14]取得了巨大成功。最近,许多研究[17]、[38]-[42] 尝试将预训练模型扩展到多模态领域,即视觉语言预训练(VLP),并取得了显著进展。根据预训练任务的不同,主流的视觉语言预训练模型可分为两类:(1)图像-文本对比学习任务,即通过跨模态对比学习损失(cross-modal contrastive learning loss)将图像和文本对齐到一个共享空间,如CLIP[17]、ALIGN[38]、FILIP[39];(2)基于语言建模的任务,使用一些辅助任务(屏蔽语言/区域建模、图像标题、文本基础图像生成)来建立图像和文本之间的对应关系,如 VisualBERT[41]、UNITER[42]。VLP 模型可为跨模态任务和细粒度视觉任务带来更大的性能提升,这一点已被大量后续研究证实。我们还期望利用 VLP 丰富的多模态知识进一步推进 TIReID 任务。
C. CLIP-Based Fine-tuning C.基于 CLIP 的微调
As one of the most representative VLP models, Contrastive Language-Image Pre-training (CLIP) [17] has attracted much attention. Different from the traditional image-based supervised pre-training model, CLIP employs natural language to supervise the learning of visual features with contrastive learning on webscale image-text data. Benefiting from semantic-level language supervision, the visual network can learn high-quality visual features with rich semantic information, which has an impressive positive impact on cross-modal tasks and fine-grained visual tasks. Recently, a lot of follow-ups [18]-[21], [43]-[48] has been put forward to fine-tune CLIP to various downstream tasks. The most common is to adapt CLIP to some cross-modal tasks, such as video-text retrieval [18], [43]-[45], video caption [46], referring image segmentation [19]. Besides, some efforts have recognized the semantic-level and high-quality visual concept representation capacity of CLIP, and applied it to some finegrained visual tasks, including dense prediction [20], point cloud understanding [48], video recognition [21], and achieved impressive results. As a cross-modal retrieval task as well as a fine-grained recognition task [49], [50], TIReID can also benefit from CLIP. Thus, in this paper, we try to explore an effective framework to fully transfer the CLIP model to the TIReID task. 作为最具代表性的 VLP 模型之一,对比语言-图像预训练(CLIP)[17] 引起了广泛关注。与传统的基于图像的监督预训练模型不同,CLIP 在网络规模的图像-文本数据上采用自然语言监督对比学习的视觉特征学习。得益于语义层面的语言监督,视觉网络可以学习到具有丰富语义信息的高质量视觉特征,这对跨模态任务和细粒度视觉任务产生了令人印象深刻的积极影响。最近,许多后续研究[18]-[21]、[43]-[48]都提出了针对各种下游任务对 CLIP 进行微调的方法。最常见的是让 CLIP 适应一些跨模态任务,如视频文本检索 [18]、[43]-[45]、视频标题 [46]、参考图像分割 [19]。此外,一些研究认识到 CLIP 在语义层面和高质量视觉概念表示方面的能力,并将其应用于一些细粒度视觉任务,包括密集预测[20]、点云理解[48]、视频识别[21]等,并取得了令人瞩目的成果。作为一项跨模态检索任务和细粒度识别任务[49]、[50],TIReID 也能从 CLIP 中受益。因此,在本文中,我们试图探索一种有效的框架,将 CLIP 模型完全移植到 TIReID 任务中。
III. METHODS III.方法
A. Motivation and Overview of CFine A.CFine 的动机和概述
Existing prior methods in TIReID utilize external knowledge from single-modality pre-training to facilitate learning, which is short of multi-modal correspondence information. However, it is unaffordable to directly learn a visual-language pre-training model for TIReID from scratch due to it requires inaccessible large-scale image-text data and expensive training resources. Recently, visual-language pre-training (VLP) has made significant progress and shown its rich cross-modal correspondence information and powerful visual representation capacity. To leverage the powerful capacity of VLP, several efforts have attempted to transfer the prior knowledge of VLP to various downstream tasks, achieving impressive results. Inspired by these works, our method builds upon the currently most popular VLP model, namely CLIP [17], and extends it with fine-grained information excavation to better adapt the TIReID task, instead of learning a new pre-training model from scratch. An overview of the proposed CFine is illustrated in Figure 2. 现有的 TIReID 方法都是利用单模态预训练的外部知识来促进学习,缺乏多模态对应信息。然而,直接从头开始学习 TIReID 的视觉语言预训练模型是难以承受的,因为这需要无法获取的大规模图像-文本数据和昂贵的训练资源。最近,视觉语言预训练(VLP)取得了重大进展,显示出其丰富的跨模态对应信息和强大的视觉表征能力。为了充分利用视觉语言预训练的强大能力,一些人试图将视觉语言预训练的先验知识转移到各种下游任务中,并取得了令人瞩目的成果。受这些工作的启发,我们的方法以目前最流行的 VLP 模型(即 CLIP [17])为基础,并通过细粒度信息挖掘对其进行扩展,以更好地适应 TIReID 任务,而不是从头开始学习一个新的预训练模型。拟议的 CFine 概览如图 2 所示。
Given a set of pedestrian images and text descriptions , we first feed them to the dual encoders to extract the image and text features. Second, to better adapt to TIReID, the finegrained information excavation is performed to mine intramodal fine-grained information and inter-modal fine-grained correspondences. For fine-grained information excavation, three modules are proposed: (1) The multi-grained global feature learning (MGF) mines discriminative local clues according to informative tokens at different levels, and generates a set of multi-grained features; (2) A cross-grained feature refinement (CFR) is proposed to filter out unnecessary information in image/text and ensure the confidence of informative tokens; (3) A fine-grained correspondence discovery (FCD) is proposed to establish local fine-grained correspondences between patches 给定一组行人图像 和文本描述 ,我们首先将其输入双编码器,提取图像和文本特征。其次,为了更好地适应 TIReID,我们进行了细粒度信息挖掘,以挖掘模态内的细粒度信息和模态间的细粒度对应关系。在细粒度信息挖掘方面,提出了三个模块:(1)多粒度全局特征学习(MGF),根据不同层次的信息标记挖掘局部判别线索,生成多粒度特征集;(2)跨粒度特征提纯(CFR),过滤图像/文本中不必要的信息,确保信息标记的可信度;(3)细粒度对应关系发现(FCD),建立补丁之间的局部细粒度对应关系。
Fig. 2. Overview of the proposed CFine. Given image-text pairs, we first extract global image/text features and local patch/word features by image/text encoder. After that, the token selection is performed to select some informative patch/word features from local patch/word features. The above global image/text features, local patch/word features, and these selected informative patch/word features are sent to the multi-grained global feature learning module, which can help reinforce the fine-grained clues and generate a set of multi-grained global image/text features . Cross-grained feature refinement and fine-grained correspondence discovery modules take global image/text features and selected informative patch/word features as input to filter out non-modality-shared patches/words and mine cross-modal correspondences from coarse to fine, which can generate two similarity matrices for a batch of image-text pairs and be removed during inference. Finally, and are supervised by CMPM+CMPC loss and Triplet ranking loss for cross-modal alignment, respectively. 图 2.拟议的 CFine 概述。对于给定的图像/文本对,我们首先通过图像/文本编码器提取全局图像/文本特征和局部补丁/单词特征。然后,进行标记选择,从局部补丁/单词特征中选出一些有信息量的补丁/单词特征。上述全局图像/文本特征、局部补丁/单词特征以及这些被选中的信息补丁/单词特征被发送到多粒度全局特征学习模块,该模块可以帮助强化细粒度线索,并生成一组多粒度全局图像/文本特征 。跨粒度特征细化和细粒度对应关系发现模块将全局图像/文本特征和选定的信息补丁/单词特征作为输入,过滤掉非模态共享的补丁/单词,并从粗到细挖掘跨模态对应关系,从而为一批图像/文本对生成两个相似性矩阵 并在推理过程中去除。最后, 和 分别由 CMPM+CMPC loss 和 Triplet ranking loss 对跨模态对齐进行监督。
and words. Finally, the similarity between the above-learned image and text representations is computed by the cosine similarity function, whose goal is to maximize the similarity if the image and text are matched, and minimize it otherwise. 和文字。最后,通过余弦相似度函数计算上述学习到的图像和文字表示之间的相似度,其目标是在图像和文字匹配的情况下使相似度最大化,反之则使其最小化。
B. Dual Encoders B.双编码器
The structure of CLIP is shown in Figure 1, which includes an image encoder and a text encoder, both of which are composed of a feature extractor and a projector. The image and text feature extractors extract features through a ViT with a width of 768 and a Transformer with a width of 512 respectively, while the projectors map image and text features to a 512-dimensional latent space, in which image and text are aligned through the contrastive objective. The most direct way is to fine-tune CLIP on the TIReID dataset. However, several works [15], [16] have shown that the projectors may lead to intra-modal information distortion. This is unacceptable for TIReID which relies on fine-grained information, especially images. If the projector is removed, the dimensions of the two cannot be unified. Thus, in the paper, we only use the image encoder of CLIP with the projector removed as our image encoder. For text, we use another pre-training language model BERT [14] as text encoder. In addition, in order to make a fair comparison with existing methods, we also use ViT pre-trained on ImageNet [51] as image encoder. CLIP 的结构如图 1 所示,它包括一个图像编码器和一个文本编码器,这两个编码器都由一个特征提取器和一个投影器组成。图像和文本特征提取器分别通过一个宽度为 768 的 ViT 和一个宽度为 512 的 Transformer 提取特征,而投影器则将图像和文本特征映射到一个 512 维的潜在空间,在这个空间中,图像和文本通过对比目标进行对齐。最直接的方法是在 TIReID 数据集上对 CLIP 进行微调。然而,一些研究[15]、[16]表明,投影仪可能会导致模态内信息失真。这对于依赖细粒度信息(尤其是图像)的 TIReID 来说是不可接受的。如果去掉投影器,两者的维度就无法统一。因此,在本文中,我们只使用去除了投影器的 CLIP 图像编码器作为图像编码器。对于文本,我们使用另一个预训练语言模型 BERT [14] 作为文本编码器。此外,为了与现有方法进行公平比较,我们还使用在 ImageNet [51] 上预先训练好的 ViT 作为图像编码器。
Image Representation. Given an image , a visual tokenization process is first performed to convert the image to a discrete token sequence of length . A learnable token is attached to the beginning of the sequence as an image-level representation. Finally, the token sequence of length is fed into the transformer of ViT. The output of image encoder is represented as , where is the image-level global feature, is the patch-level local features. 图像表示。给定图像 后,首先执行可视化标记化流程,将图像转换为长度为 的离散标记序列。一个可学习的 标记被附加到序列的开头,作为图像级表示。最后,长度为 的标记序列被输入 ViT 的变换器。图像编码器的输出表示为 ,其中 是图像级全局特征, 是片段级局部特征。
Text Representation. For a text , we directly use the pre-trained BERT [14] as text encoder to generate text representation. Specifically, the lower-cased byte pair encoding (BPE) with a 30522 vocabulary size is firstly used to tokenize the text . Then, the textual token sequence is padded with token at the beginning. Finally, the token sequence of length is fed into the text encoder to generate the sentence-level global feature and the word-level local 文本表示。对于文本 ,我们直接使用预先训练好的 BERT [14] 作为文本编码器来生成文本表示。具体来说,首先使用词汇量为 30522 的小写字节对编码(BPE)对文本 进行标记化。然后,在文本标记序列的开头填充 标记。最后,将长度为 的标记序列输入文本编码器,生成句子级全局特征 和词级局部特征 。
(a) Token Selection Process (a) 代币遴选程序
(b) Multi-Grained Global Feature Learning (b) 多粒度全局特征学习
Fig. 3. (a) Illustration of token selection process and (b) structure of multi-grained global feature learning (MGF) module, where represents the multi-grained global feature set. 图 3:(a)标记选择过程示意图;(b)多粒度全局特征学习(MGF)模块结构,其中 表示多粒度全局特征集。
features . 功能 。
Due to the substantial gap between the upstream pre-training task and the downstream TIReID task, and only involves instance-level modality information and cross-modal correspondences, which is short of fine-grained information that is critical to TIReID. Accordingly, in the following, we conduct finegrained information excavation (MGF, CFR, and FCD) to fully mine intra-modal discriminative local clues and inter-modal fine-grained correspondences. 由于上游的预训练任务和下游的 TIReID 任务之间存在很大差距, 和 只涉及实例级模态信息和跨模态对应关系,缺少对 TIReID 至关重要的细粒度信息。因此,在下文中,我们将进行细粒度信息挖掘(MGF、CFR 和 FCD),以充分挖掘模态内的判别性局部线索和模态间的细粒度对应关系。
C. Multi-Grained Global Feature Learning C.多粒度全局特征学习
Due to the subtle visual variations among different pedestrians in ReID, it is crucial to fully mine the fine-grained information of images/texts for distinguishing different pedestrians. Most existing TIReID methods mine fine-grained information by learning a set of local features. Unlike them, in this paper, we mine local information at different levels to learn global features at multiple granularities. Benefiting from the global dependency modeling capability of self-attention, Transformer achieves impressive results on various tasks. However, selfattention treats each local token in the same way to calculate the attention weight and then computes a weighted sum of all local tokens to generate a global feature. The global feature is dominated by all local tokens, and this way of considering all local tokens simultaneously reduces the influence of some important local tokens. Especially for fine-grained recognition tasks, the way will bring serious discriminative information loss. To solve this problem, instead of using all tokens to learn global features, we only select informative tokens to form multiple token sequences, and then send them to the global-local decoder to learn a set of multi-grained global features. 由于 ReID 中不同行人之间存在微妙的视觉差异,因此充分挖掘图像/文本的细粒度信息对于区分不同行人至关重要。现有的 TIReID 方法大多通过学习一组局部特征来挖掘细粒度信息。与之不同的是,本文通过挖掘不同层次的局部信息来学习多粒度的全局特征。得益于自我关注的全局依赖建模能力,Transformer 在各种任务中都取得了令人瞩目的成绩。然而,自我注意会以相同的方式处理每个局部标记以计算注意权重,然后计算所有局部标记的加权和以生成全局特征。全局特征受所有局部标记的支配,这种同时考虑所有局部标记的方式会降低一些重要局部标记的影响。特别是对于细粒度识别任务,这种方式会带来严重的识别信息损失。为了解决这个问题,我们不使用所有标记来学习全局特征,而是只选择有信息量的标记来组成多个标记序列,然后将它们发送给全局-局部解码器来学习一组多粒度的全局特征。
Token Selection. Class token as the output of Transformer is used for classification or recognition, which is obtained by weighted aggregation of all local tokens. The weight reflects the correlation between class token and each local token. The larger the weight, the greater the contribution of this local token to the class token, and the more important it is to the task. Therefore, we select informative tokens based on the correlation between local tokens and class token [52]-[54]. Specifically, the selfattention of each Transformer block can generate an attention map of size , which reflects the correlation among the input tokens (the first is the class token). The first row of the attention map represents the dependency between class token and local tokens. In this paper, we take the attention map generated by the self-attention of the last Transformer block, and the correlation score between class token and local tokens is . We select the top tokens from the local tokens output by Transformer that corresponds to the top highest scores in to construct a new discriminative local token sequence. The token selection process is shown in Figure 3 (a). 标记选择。类标记作为变换器的输出用于分类或识别,它是通过对所有本地标记进行加权聚合而得到的。权重反映了类标记与每个局部标记之间的相关性。权重越大,说明该局部标记对类标记的贡献越大,对任务越重要。因此,我们根据本地标记与类标记之间的相关性来选择信息标记[52]-[54]。具体来说,每个转换器块的自我注意可以生成一个大小为 的注意图,它反映了输入 标记(第一个是类标记)之间的相关性。注意力图的第一行表示类标记和本地标记之间的依赖关系。在本文中,我们取最后一个 Transformer 块的自注意力生成的注意力图 ,类标记和本地标记之间的相关性得分为 。我们从 Transformer 输出的 本地标记中选择与 中最高分 相对应的顶部 标记,以构建新的区分性本地标记序列。标记选择过程如图 3 (a) 所示。
We perform the token selection process separately for images and texts. For image tokens are chosen from . For text tokens are chosen from . The selected image and text token sequences are denoted as and , where and denotes the amount of information contained in the token. and , where and represent the selection ratio of image and text tokens, respectively. 我们对图像和文本分别执行标记选择过程。图像 标记从 中选择。文本 标记从 中选出。选定的图像和文本标记序列分别表示为 和 ,其中 和 表示标记中包含的信息量。 和 ,其中 和 分别表示图像和文本标记的选择比例。
Global-Local Decoder. We design a global-local decoder 全球本地解码器。我们设计了一个全局本地解码器
(GLD) to highlight the discriminant local information in images and texts, and improve the discriminability of global features. As shown in Figure 3 (b), the GLD containing blocks generates a set of multi-grained discriminative global features with the above-selected token sequence as the input. The selected token sequence only contains informative tokens, discarding redundant tokens. To fully mine discriminative fine-grained information, we split the selected token sequence into two sub-sequences, which correspond to different discriminant granularities. The former is a high-level discriminant sequence with the top most informative tokens, while the latter is a middle-level discriminant sequence with the remaining informative tokens. Two sub-sequences are fed into GLD to highlight differentgrained local information. Specifically, taking image as an example, is divided into the high-level sequence and middle-level sequence , which are respectively prepended with a token and a token, are then fed into GLD. In one GLD block, is first sent into the multi-head self-attention ( ) layer to propagate the information of these informative tokens into the class token. (GLD) 来突出图像和文本中的局部判别信息,并提高全局特征的判别能力。如图 3 (b)所示,包含 块的 GLD 以上述选定的标记序列为输入,生成一组多级判别全局特征。所选标记序列只包含信息性标记,剔除冗余标记。为了充分挖掘细粒度判别信息,我们将所选标记序列分成两个子序列,分别对应不同的判别粒度。前者是高级判别序列,包含信息量最大的 标记;后者是中级判别序列,包含信息量最小的 标记。两个子序列被输入到 GLD 中,以突出不同粒度的局部信息。具体来说,以图像为例, 被分为高级序列 和中级序列 ,并分别以 标记和 标记作为前缀,然后输入 GLD。在一个 GLD 块中, 首先被送入多头自注意层( ),将这些信息标记的信息传播到类标记中。
where notes Layer Normalization. After that, and are fed into the multi-head cross-attention ( ) layer to compute the cross-attention between and , which can highlight not only the informative tokens themselves but also other associated contextual information. Finally, the output of is further fed into the multi-layer perceptron ( ) layer to generate multiple discriminative global features at different granularities. 其中, 注释层归一化。之后, 和 被送入多头交叉注意( )层,计算 和 之间的交叉注意,这不仅能突出信息词本身,还能突出其他相关的上下文信息。最后, 的输出会进一步输入多层感知器( )层,以生成不同粒度的多种全局判别特征。
where denotes the output of a GLD block for input . The and tokens output by the last block are used as the high-level global image feature and middle-level global image feature respectively. Besides, the image-level feature that treats all local tokens in the same way is regarded as a low-level global feature . The above features constitute a multi-grained global image feature set