2024_07_24_b9c21364aa1d0fac0b63g

CLIP-Driven Fine-grained Text-Image Person Re-identification
CLIP驱动的细粒度文本-图像人员再识别

Shuanglin Yan, Neng Dong, Liyan Zhang, Jinhui Tang, Senior Member, IEEE
严双林、董能、张丽艳、唐金辉，IEEE 高级会员

Abstract 摘要

Text-Image Person Re-identification aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Besides, due to the substantial gap between modalities, existing methods embed the original modal features into the same latent space for crossmodal alignment. However, feature embedding may lead to intramodal information distortion. Recently, Contrastive LanguageImage Pretraining (CLIP) has attracted extensive attention from researchers due to its powerful semantic concept learning capacity and rich multi-modal knowledge, which can help us solve the above problems. Accordingly, in this paper, we propose a CLIPdriven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we perform finegrained information excavation to mine intra-modal discriminative clues and inter-modal correspondences. Specifically, we first design a multi-grained global feature learning (MGF) module to fully mine the discriminative local information within each modality, which can emphasize identity-related discriminative clues by enhancing the interactions between global image (text) and informative local patches (words). MGF can generate a set of multi-grained global features for later inference. Secondly, crossgrained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules are proposed to establish the crossgrained and fine-grained interactions (image-word, sentence-patch, word-patch) between modalities, which can filter out unimportant and non-modality-shared image patches/words and mine crossmodal correspondences from coarse to fine. CFR and FCD are removed during inference to save computational costs. Note that the above process is performed in the original modality space without further feature embedding. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method on TIReID.
文本-图像人员再识别旨在从候选图像库中检索与给定文本查询相对应的图像。现有方法利用单模态预训练的先验知识来促进学习，但缺乏多模态对应信息。此外，由于模态之间存在巨大差距，现有方法会将原始模态特征嵌入同一潜在空间以进行跨模态对齐。然而，特征嵌入可能会导致模态内信息失真。最近，对比语言图像预训练（CLIP）因其强大的语义概念学习能力和丰富的多模态知识而引起了研究人员的广泛关注，它可以帮助我们解决上述问题。因此，本文提出了一个由 CLIP 驱动的细粒度信息挖掘框架（CFine），以充分利用 CLIP 的强大知识为 TIReID 服务。为了有效传递多模态知识，我们进行了细粒度信息挖掘，以挖掘模态内的判别线索和模态间的对应关系。具体来说，我们首先设计了一个多粒度全局特征学习（MGF）模块，以充分挖掘每种模态内的局部判别信息，通过增强全局图像（文本）与局部信息斑块（单词）之间的交互，强调与身份相关的判别线索。MGF 可以生成一组多粒度全局特征，供以后推理使用。其次，提出了交叉粒度特征提纯（CFR）和细粒度对应发现（FCD）模块，以建立模态之间的交叉粒度和细粒度交互（图像-单词、句子-片段、单词-片段），从而过滤掉不重要和非模态共享的图像片段/单词，并从粗到细挖掘跨模态对应关系。为了节省计算成本，推理过程中会删除 CFR 和 FCD。请注意，上述过程是在原始模态空间中进行的，无需进一步的特征嵌入。在多个基准上的广泛实验证明了我们的方法在 TIReID 上的卓越性能。

Index Terms-Text-Image Person Re-identification, Multi-modal Correspondence Information, Intra-modal Information Distortion, Fine-grained Information Excavation.
索引词条--文本-图像人员再识别、多模态对应信息、模态内信息失真、细粒度信息挖掘。

I. INTRODUCTION I.引言

ERSON Re-identification (ReID) is a popular and challenging task in computer vision. In the past decade, ReID has made remarkable progress [1]-[4], and has been successfully applied in some practical scenarios. Most existing ReID approaches assume that the pedestrian's images can be captured across disjoint cameras, and tend to ignore the situation that

ERSON 重新识别（Re-ID）是计算机视觉领域的一项热门且具有挑战性的任务。在过去十年中，ReID 取得了显著进展 [1]-[4]，并已成功应用于一些实际场景。现有的 ReID 方法大多假定行人的图像可以被不同的摄像头捕捉到，而往往忽略了以下情况

S. Yan, N. Dong, and J. Tang are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China (e-mail: shuanglinyan@njust.edu.cn; neng.dong @ njust.edu.cn; jinhuitang@ njust.edu.cn).
S.阎晓明、董能、唐俊，南京理工大学计算机科学与工程学院，南京 210094（电子邮箱：shuanglinyan@njust.edu.cn；董能 @ njust.edu.cn；唐俊@ njust.edu.cn）。

L. Zhang is with the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China (e-mail: zhangliyan@nuaa.edu.cn).
L.张立，南京航空航天大学计算机科学与技术学院，中国南京 210016（电子邮箱：zhangliyan@nuaa.edu.cn）。

Corresponding author: Liyan Zhang.
通讯作者：张丽艳张丽艳

Fig. 1. The Motivation for our proposed method. (a) CLIP learns visual representation with natural language supervision using web-scale image-text data, and the learned visual representation contains rich semantic information and cross-modal correspondence information. (b) We explore leveraging the powerful knowledge of CLIP for TIReID. CLIP is trained to focus only on instance-level representation, while TIReID requires fine-grained discriminative clues. To take full advantage of rich prior knowledge from CLIP, our CFine proposes three innovative modules (MGF, CFR, and FCD) to mine intra-moda discriminative clues and inter-modal fine-grained correspondences.
图 1.我们提出的方法的动机。(a) CLIP 利用网络规模的图像-文本数据，在自然语言监督下学习视觉表征，学习到的视觉表征包含丰富的语义信息和跨模态对应信息。(b) 我们探索利用 CLIP 的强大知识进行 TIReID。CLIP 的训练只侧重于实例级表示，而 TIReID 则需要细粒度的判别线索。为了充分利用 CLIP 丰富的先验知识，我们的 CFine 提出了三个创新模块（MGF、CFR 和 FCD）来挖掘模内判别线索和模间细粒度对应信息。

pedestrian images cannot be obtained in some complex or special scenes, such as some remote roads without cameras or where pedestrians are completely occluded. Although pedestrian images are not available, we can find some witnesses at the scene and search for the target pedestrian by the witness's language description, that is, text-image person re-identification (TIReID) [5]. Due to its great practical value, TIReID has attracted increasing attention from both academia and industry.
在一些复杂或特殊场景中，如一些没有摄像头的偏僻道路或行人完全被遮挡的地方，无法获取行人图像。虽然没有行人图像，但我们可以在现场找到一些目击者，通过目击者的语言描述来寻找目标行人，即文本-图像人再识别（TIReID）[5]。由于其巨大的实用价值，TIReID 越来越受到学术界和工业界的关注。

As a fine-grained cross-modal retrieval task, the key of TIReID is to mine the fine-grained information of images and texts, and establish their correspondences. In recent years, many effective methods [6]-[11] have been proposed, all of which follow the same structural design of "image/text backbone + feature embedding", where image/text backbone first extract im-
作为一种细粒度的跨模态检索任务，TIReID 的关键在于挖掘图像和文本的细粒度信息，并建立它们之间的对应关系。近年来，人们提出了许多有效的方法[6]-[11]，它们都遵循相同的 "图像/文本骨干+特征嵌入 "结构设计，即图像/文本骨干首先提取图像信息，然后将其嵌入到特征嵌入中。
age/text features, and then feature embedding (method-specific) embeds the extracted image and text features into a joint space for cross-modal alignment. Image/text backbone generally utilizes external knowledge to facilitate learning, while mainly initializes backbone by single-modality pre-training (e.g., the pre-trained ResNet [12] and ViT [13] on ImageNet, the pretraining language model BERT [14]), which lacks multi-modal correspondence information. In addition, existing methods generally believe that feature embedding is the core module of the TIReID task, which is crucial for narrowing the semantic gap between modalities. However, several works [15], [16] have shown that projecting different modalities into the joint space may lead to intra-modal information distortion due to the distinct data properties between images and texts. So we need to reconsider whether feature embedding is necessary for the TIReID task.
图像/文本骨干一般利用外部知识来促进学习，而主要是通过单模态预训练（如预训练的 ResNet [12] 和 ViT [13] 等）来初始化骨干。图像/文本骨干一般利用外部知识来促进学习，而主要通过单模态预训练（如 ImageNet 上的预训练 ResNet [12] 和 ViT [13]、预训练语言模型 BERT [14]）来初始化骨干，缺乏多模态对应信息。此外，现有方法普遍认为特征嵌入是 TIReID 任务的核心模块，对于缩小模态之间的语义差距至关重要。然而，一些研究[15]、[16]表明，由于图像和文本的数据属性不同，将不同模态投影到联合空间可能会导致模态内信息失真。因此，我们需要重新考虑在 TIReID 任务中是否有必要进行特征嵌入。

Recently, the remarkable success of visual-language pretraining (VLP) has shown its ability to learn semantically rich and high-quality visual concepts with natural language supervision, and the most representative work is Contrastive LanguageImage Pre-training (CLIP) 17]. Compared with single-modal pre-training, CLIP contains abundant multi-modal knowledge. In light of the power of CLIP, some recent works have attempted to exploit its ample knowledge to various tasks and achieved impressive results, such as video-text retrieval [18], referring image segmentation [19], dense prediction [20], video understanding [21]. Besides, the semantic-level visual concept representation capacity of CLIP makes it possible to align image and text in the original modality space. Inspired by this, we explore how to transfer the CLIP model to TIReID in this paper. We found that fine-tuning CLIP directly on TIReID is effective. But CLIP is trained to only pay attention to instance-level representation (image-level, sentence-level), while TIReID requires the model to focus on fine-grained information and inter-modal correspondences to distinguish the sutble differences between pedestrians (as Figure 1). Thus, the direct usage of CLIP can be sub-optimal for TIReID due to the task gap.
最近，视觉语言预训练（VLP）取得了令人瞩目的成功，显示了其在自然语言监督下学习语义丰富的高质量视觉概念的能力，其中最具代表性的工作是对比语言图像预训练（CLIP）17]。与单模态预训练相比，CLIP 包含丰富的多模态知识。鉴于 CLIP 的强大功能，近期的一些工作尝试利用其丰富的知识来完成各种任务，并取得了令人瞩目的成果，如视频文本检索 [18]、参考图像分割 [19]、密集预测 [20]、视频理解 [21]。此外，CLIP 的语义级视觉概念表征能力使图像和文本在原始模态空间中的对齐成为可能。受此启发，我们在本文中探讨了如何将 CLIP 模型移植到 TIReID 中。我们发现，直接在 TIReID 上对 CLIP 进行微调是有效的。但是，CLIP 在训练时只关注实例级表示（图像级、句子级），而 TIReID 则要求模型关注细粒度信息和模态间对应关系，以区分行人之间的明显差异（如图 1 所示）。因此，由于任务差距，直接使用 CLIP 可能不是 TIReID 的最佳选择。

To fully exploit the powerful knowledge of the CLIP model, we propose a novel CLIP-driven Fine-grained information excavation framework, namely CFine, for TIReID. As shown in Figure 1. CFine mainly includes two parts: modality-specific feature extraction and fine-grained information excavation. To be specific, CFine first adopts modality-specific encoders to extract the image and text representations, and then the finegrained information excavation part is exploited to mine intramodal discriminative details and inter-modal fine-grained correspondences for better transferring knowledge of CLIP to TIReID. The fine-grained information excavation includes three main components. First, we propose a multi-grained global feature learning (MGF) module to fully mine the identity-related sutble clues within each modality. In this module, we first design a token selection process to pick out a set of informative tokens (discriminative patches/words) based on the self-attention score between class token and local tokens for each modality. Then, the informative token set is split into multiple subsets and is fed into a global-local decoder (GLD) to generate a set of multi-grained global features by enhancing the interactions between global image (text) and local discriminative patches (words). Second, we design a cross-grained feature refinement (CFR) module to filter out non-modality-shared information in the selected tokens by computing cross-grained similarities (image-word, sentence-patch) between modalities, and establish the rough cross-modal correspondence. Third, to establish intermodal fine-grained correspondence, we propose a fine-grained correspondence discovery (FCD) module to discover the relationship between words and image patches. Note that the entire learning process of CFine is performed in the original feature space without further feature embedding, and optimized in an end-to-end manner. During inference, CFR and FCD are removed, and multi-grained global image and text features generated by MGF are used for cross-modal retrieval. Our main contributions are summarized as follows:
为了充分利用 CLIP 模型的强大知识，我们为 TIReID 提出了一个新颖的 CLIP 驱动的细粒度信息挖掘框架，即 CFine。如图 1 所示。CFine 主要包括两个部分：特定模态特征提取和细粒度信息挖掘。具体来说，CFine 首先采用特定模态编码器提取图像和文本表示，然后利用细粒度信息挖掘部分挖掘模态内的判别细节和模态间的细粒度对应关系，以便更好地将 CLIP 知识转移到 TIReID 中。细粒度信息挖掘包括三个主要部分。首先，我们提出了一个多粒度全局特征学习（MGF）模块，以充分挖掘每种模态中与身份相关的线索。在该模块中，我们首先设计了一个标记选择过程，根据每种模态的类标记和局部标记之间的自关注得分，挑选出一组有信息量的标记（区分性补丁/单词）。然后，将信息标记集分成多个子集，并输入全局-本地解码器（GLD），通过增强全局图像（文本）和本地判别补丁（单词）之间的交互来生成一组多粒度全局特征。其次，我们设计了一个跨粒度特征提纯（CFR）模块，通过计算模态间的跨粒度相似性（图像-词语、句子-补丁），过滤掉所选标记中的非模态共享信息，并建立粗略的跨模态对应关系。第三，为了建立模态间的细粒度对应关系，我们提出了细粒度对应关系发现（FCD）模块，以发现词与图像片段之间的关系。请注意，CFine 的整个学习过程都是在原始特征空间中进行的，没有进一步的特征嵌入，并以端到端的方式进行了优化。在推理过程中，CFR 和 FCD 被移除，MGF 生成的多粒度全局图像和文本特征被用于跨模态检索。我们的主要贡献概述如下：

We propose a CLIP-driven fine-grain information excavation framework to transfer the knowledge of CLIP to TIReID, achieving fine-grained text-image alignment without further feature embedding. To our best knowledge, we are the first to leverage the ample cross-modal knowledge from VLP to facilitate learning for TIReID.
我们提出了一个由 CLIP 驱动的细粒度信息挖掘框架，将 CLIP 的知识转移到 TIReID 中，无需进一步的特征嵌入即可实现文本与图像的细粒度配准。据我们所知，我们是第一个利用 VLP 中丰富的跨模态知识来促进 TIReID 学习的人。
We take full advantage of rich multi-modal knowledge from CLIP via three innovative modules, i.e., multi-grained global feature learning, cross-grained feature refinement, and fine-grained correspondence discovery.
我们通过三个创新模块，即多粒度全局特征学习、跨粒度特征细化和细粒度对应关系发现，充分利用了 CLIP 中丰富的多模态知识。
We conduct extensive experiments on three benchmarks to validate the effectiveness of our CFine. CFine performs significantly better than previous methods and reaches 69.57 Rank-1, 60.83 Rank-1, 50.55 Rank-1 on the CUHKPEDES, ICFG-PEDES, and RSTPReid, respectively, which outperforms the previous SOTA method by , , and .
我们在三个基准上进行了大量实验，以验证 CFine 的有效性。CFine 的性能明显优于之前的方法，在 CUHKPEDES、ICFG-PEDES 和 RSTPReid 上分别达到了 69.57 Rank-1、60.83 Rank-1 和 50.55 Rank-1，比之前的 SOTA 方法高出、和。

The remainder of the paper is organized as follows. We first review the related works in Section II; Section III describes the proposed CFine in detail; Section IV reports extensive experimental results and analysis; and finally the paper is summarized in Section V.
本文的其余部分安排如下。我们首先在第二节回顾了相关工作；第三节详细介绍了拟议的 CFine；第四节报告了大量实验结果和分析；最后在第五节对本文进行了总结。

A. Text-Image Person Re-identification
A.文本-图像人员再识别

TIReID is a class of multi-modal tasks [22], [23], which was first proposed by [5]. Compare with general cross-modal retrieval tasks, TIReID is more challenging as its fine-grained property. The key to TIReID is cross-modal alignment, existing methods can be broadly classified into two classes according to the alignment strategy: cross-modal interaction-based and cross-modal interaction-free methods. Cross-modal interactionbased methods [5], [8], [9], [24]-[28] focus on mining local correspondences (e.g., patch-word, patch-phrase) between images and texts by using attention mechanism to predict the matching score for image-text pairs. Gao et al. [8] designed a contextual non-local attention, which adaptively aligns image and text features across all scales in a coarse-to-fine way according to their semantics. The cross-modal interaction mechanism is a double-edged sword, and its advantage lies in better aligning image-text pairs and reducing modality gap through sufficient cross-modal interaction, so such methods can achieve superior
TIReID是一类多模态任务[22]、[23]，由[5]首次提出。与一般的跨模态检索任务相比，TIReID 的细粒度特性更具挑战性。TIReID 的关键在于跨模态配准，根据配准策略的不同，现有的方法大致可分为两类：基于跨模态交互的方法和无跨模态交互的方法。基于跨模态交互的方法[5]、[8]、[9]、[24]-[28]侧重于通过使用注意力机制来预测图像-文本对的匹配得分，从而挖掘图像和文本之间的局部对应关系（如片段-单词、片段-短语）。Gao 等人[8]设计了一种上下文非局部注意力，它能根据图像和文本的语义，以从粗到细的方式自适应地对齐所有尺度上的图像和文本特征。跨模态交互机制是一把双刃剑，它的优势在于通过充分的跨模态交互，更好地对齐图像和文本对，减少模态差距，因此这类方法可以取得卓越的效果。
performance. But its disadvantage is the high computational cost, which greatly reduces the practicability of such methods.
性能。但其缺点是计算成本高，大大降低了此类方法的实用性。

For cross-modal interaction-free methods, some early works [6], [7], [29]-[34] mainly focused on designing network and optimization loss to learn aligned image and text embeddings in a joint latent space. Zhang et al. [30] designed a crossmodal projection matching (CMPM) loss and a cross-modal projection classification (CMPC) loss to learn discriminative image-text embeddings, which is one of the most commonly used cross-modal matching losses for TIReID. These early methods are efficient but their performance is not satisfactory. In recent years, some effective and lightweight models [10], [11], [35] have been proposed, which can achieve better performance than the first type of methods without complex cross-modal interaction. With the success of Transformer in various visual and language tasks, several Transformer-based methods [36], [37] have been proposed recently and achieved state-of-theart performance. All in all, TIReID has achieved remarkable progress over the past few years.
对于跨模态无交互方法，早期的一些研究[6]、[7]、[29]-[34]主要集中在设计网络和优化损失来学习联合潜空间中的对齐图像和文本嵌入。Zhang 等人[30]设计了一种跨模态投影匹配（CMPM）损失和一种跨模态投影分类（CMPC）损失来学习鉴别性图像-文本嵌入，这也是 TIReID 最常用的跨模态匹配损失之一。这些早期方法虽然高效，但性能并不令人满意。近年来，一些有效的轻量级模型[10]、[11]、[35]相继被提出，它们无需复杂的跨模态交互就能获得比第一类方法更好的性能。随着 Transformer 在各种视觉和语言任务中的成功应用，最近又提出了几种基于 Transformer 的方法 [36]、[37]，并取得了一流的性能。总之，TIReID 在过去几年中取得了显著进展。

However, existing methods initialize the network through single-modality pre-trained model parameters, ignoring the multi-modal correspondence information. Besides, the image and text embeddings extracted from the above-initialized network over-mine the information within a single modality, which increases the difficulty of cross-modal alignment and network optimization. Recently, visual-language pre-training (VLP) has attracted growing attention, especially CLIP [17], whose own advantages can effectively alleviate the above problems. Thus, we explore leveraging the powerful knowledge of the CLIP model for TIReID in the paper.
然而，现有方法通过单模态预训练模型参数初始化网络，忽略了多模态对应信息。此外，从上述初始化网络中提取的图像和文本嵌入会过度挖掘单一模态内的信息，这增加了跨模态配准和网络优化的难度。最近，视觉语言预训练（VLP）引起了越来越多的关注，尤其是 CLIP [17]，其自身的优势可以有效缓解上述问题。因此，我们在本文中探讨了如何利用 CLIP 模型的强大知识来实现 TIReID。

B. Vision-Language Pre-Training
B.视觉语言预培训

The paradigm of "pre-training and fine-tuning" is one of the most important paradigms to drive the development of computer vision community. Its basic process is that the model is initialized by pre-trained model parameters on large-scale datasets, and then fine-tuned on various downstream tasks. In this paradigm, the quality of the pre-training model plays a vital role in the optimization difficulty and performance of the model during fine-tuning. In the past decade, the pre-training model in single modal domain [12]-[14] has achieved great success. Recently, many works [17], [38]-[42] have attempted to extend the pre-training model to multi-modal field, that is, visual-language pre-training (VLP), and made remarkable progress. The mainstream VLP models can be divided into two categories according to the pre-training tasks: (1) image-text contrastive learning tasks, which align images and texts into a shared space through cross-modal contrastive learning loss, e.g., CLIP [17], ALIGN [38], FILIP [39]; (2) language modeling based task, some auxiliary tasks (Masked Language/Region Modeling, image captioning, text-grounding image generation) are used to establish the correspondences between images and texts, e.g., VisualBERT [41], UNITER [42]. The VLP models can bring greater performance gains for cross-modal tasks and fine-grained visual tasks, which is confirmed by a large number of follow-up works. We also expect to leverage ample multimodal knowledge of VLP to further advance the TIReID task.
预训练和微调 "范式是推动计算机视觉领域发展的最重要范式之一。它的基本流程是通过在大规模数据集上预训练模型参数来初始化模型，然后在各种下游任务上进行微调。在这种模式下，预训练模型的质量对微调过程中模型的优化难度和性能起着至关重要的作用。在过去十年中，单模态领域的预训练模型[12]-[14]取得了巨大成功。最近，许多研究[17]、[38]-[42] 尝试将预训练模型扩展到多模态领域，即视觉语言预训练（VLP），并取得了显著进展。根据预训练任务的不同，主流的视觉语言预训练模型可分为两类：（1）图像-文本对比学习任务，即通过跨模态对比学习损失（cross-modal contrastive learning loss）将图像和文本对齐到一个共享空间，如CLIP[17]、ALIGN[38]、FILIP[39]；（2）基于语言建模的任务，使用一些辅助任务（屏蔽语言/区域建模、图像标题、文本基础图像生成）来建立图像和文本之间的对应关系，如 VisualBERT[41]、UNITER[42]。VLP 模型可为跨模态任务和细粒度视觉任务带来更大的性能提升，这一点已被大量后续研究证实。我们还期望利用 VLP 丰富的多模态知识进一步推进 TIReID 任务。

C. CLIP-Based Fine-tuning
C.基于 CLIP 的微调

As one of the most representative VLP models, Contrastive Language-Image Pre-training (CLIP) [17] has attracted much attention. Different from the traditional image-based supervised pre-training model, CLIP employs natural language to supervise the learning of visual features with contrastive learning on webscale image-text data. Benefiting from semantic-level language supervision, the visual network can learn high-quality visual features with rich semantic information, which has an impressive positive impact on cross-modal tasks and fine-grained visual tasks. Recently, a lot of follow-ups [18]-[21], [43]-[48] has been put forward to fine-tune CLIP to various downstream tasks. The most common is to adapt CLIP to some cross-modal tasks, such as video-text retrieval [18], [43]-[45], video caption [46], referring image segmentation [19]. Besides, some efforts have recognized the semantic-level and high-quality visual concept representation capacity of CLIP, and applied it to some finegrained visual tasks, including dense prediction [20], point cloud understanding [48], video recognition [21], and achieved impressive results. As a cross-modal retrieval task as well as a fine-grained recognition task [49], [50], TIReID can also benefit from CLIP. Thus, in this paper, we try to explore an effective framework to fully transfer the CLIP model to the TIReID task.
作为最具代表性的 VLP 模型之一，对比语言-图像预训练（CLIP）[17] 引起了广泛关注。与传统的基于图像的监督预训练模型不同，CLIP 在网络规模的图像-文本数据上采用自然语言监督对比学习的视觉特征学习。得益于语义层面的语言监督，视觉网络可以学习到具有丰富语义信息的高质量视觉特征，这对跨模态任务和细粒度视觉任务产生了令人印象深刻的积极影响。最近，许多后续研究[18]-[21]、[43]-[48]都提出了针对各种下游任务对 CLIP 进行微调的方法。最常见的是让 CLIP 适应一些跨模态任务，如视频文本检索 [18]、[43]-[45]、视频标题 [46]、参考图像分割 [19]。此外，一些研究认识到 CLIP 在语义层面和高质量视觉概念表示方面的能力，并将其应用于一些细粒度视觉任务，包括密集预测[20]、点云理解[48]、视频识别[21]等，并取得了令人瞩目的成果。作为一项跨模态检索任务和细粒度识别任务[49]、[50]，TIReID 也能从 CLIP 中受益。因此，在本文中，我们试图探索一种有效的框架，将 CLIP 模型完全移植到 TIReID 任务中。

III. METHODS III.方法

A. Motivation and Overview of CFine
A.CFine 的动机和概述

Existing prior methods in TIReID utilize external knowledge from single-modality pre-training to facilitate learning, which is short of multi-modal correspondence information. However, it is unaffordable to directly learn a visual-language pre-training model for TIReID from scratch due to it requires inaccessible large-scale image-text data and expensive training resources. Recently, visual-language pre-training (VLP) has made significant progress and shown its rich cross-modal correspondence information and powerful visual representation capacity. To leverage the powerful capacity of VLP, several efforts have attempted to transfer the prior knowledge of VLP to various downstream tasks, achieving impressive results. Inspired by these works, our method builds upon the currently most popular VLP model, namely CLIP [17], and extends it with fine-grained information excavation to better adapt the TIReID task, instead of learning a new pre-training model from scratch. An overview of the proposed CFine is illustrated in Figure 2.
现有的 TIReID 方法都是利用单模态预训练的外部知识来促进学习，缺乏多模态对应信息。然而，直接从头开始学习 TIReID 的视觉语言预训练模型是难以承受的，因为这需要无法获取的大规模图像-文本数据和昂贵的训练资源。最近，视觉语言预训练（VLP）取得了重大进展，显示出其丰富的跨模态对应信息和强大的视觉表征能力。为了充分利用视觉语言预训练的强大能力，一些人试图将视觉语言预训练的先验知识转移到各种下游任务中，并取得了令人瞩目的成果。受这些工作的启发，我们的方法以目前最流行的 VLP 模型（即 CLIP [17]）为基础，并通过细粒度信息挖掘对其进行扩展，以更好地适应 TIReID 任务，而不是从头开始学习一个新的预训练模型。拟议的 CFine 概览如图 2 所示。

Given a set of pedestrian images

and text descriptions

, we first feed them to the dual encoders to extract the image and text features. Second, to better adapt to TIReID, the finegrained information excavation is performed to mine intramodal fine-grained information and inter-modal fine-grained correspondences. For fine-grained information excavation, three modules are proposed: (1) The multi-grained global feature learning (MGF) mines discriminative local clues according to informative tokens at different levels, and generates a set of multi-grained features; (2) A cross-grained feature refinement (CFR) is proposed to filter out unnecessary information in image/text and ensure the confidence of informative tokens; (3) A fine-grained correspondence discovery (FCD) is proposed to establish local fine-grained correspondences between patches
给定一组行人图像

和文本描述

，我们首先将其输入双编码器，提取图像和文本特征。其次，为了更好地适应 TIReID，我们进行了细粒度信息挖掘，以挖掘模态内的细粒度信息和模态间的细粒度对应关系。在细粒度信息挖掘方面，提出了三个模块：（1）多粒度全局特征学习（MGF），根据不同层次的信息标记挖掘局部判别线索，生成多粒度特征集；（2）跨粒度特征提纯（CFR），过滤图像/文本中不必要的信息，确保信息标记的可信度；（3）细粒度对应关系发现（FCD），建立补丁之间的局部细粒度对应关系。

Fig. 2. Overview of the proposed CFine. Given image-text pairs, we first extract global image/text features and local patch/word features by image/text encoder. After that, the token selection is performed to select some informative patch/word features from local patch/word features. The above global image/text features, local patch/word features, and these selected informative patch/word features are sent to the multi-grained global feature learning module, which can help reinforce the fine-grained clues and generate a set of multi-grained global image/text features

. Cross-grained feature refinement and fine-grained correspondence discovery modules take global image/text features and selected informative patch/word features as input to filter out non-modality-shared patches/words and mine cross-modal correspondences from coarse to fine, which can generate two similarity matrices

for a batch of image-text pairs and be removed during inference. Finally,

and

are supervised by CMPM+CMPC loss and Triplet ranking loss for cross-modal alignment, respectively.
图 2.拟议的 CFine 概述。对于给定的图像/文本对，我们首先通过图像/文本编码器提取全局图像/文本特征和局部补丁/单词特征。然后，进行标记选择，从局部补丁/单词特征中选出一些有信息量的补丁/单词特征。上述全局图像/文本特征、局部补丁/单词特征以及这些被选中的信息补丁/单词特征被发送到多粒度全局特征学习模块，该模块可以帮助强化细粒度线索，并生成一组多粒度全局图像/文本特征

。跨粒度特征细化和细粒度对应关系发现模块将全局图像/文本特征和选定的信息补丁/单词特征作为输入，过滤掉非模态共享的补丁/单词，并从粗到细挖掘跨模态对应关系，从而为一批图像/文本对生成两个相似性矩阵

并在推理过程中去除。最后，

和

分别由 CMPM+CMPC loss 和 Triplet ranking loss 对跨模态对齐进行监督。

and words. Finally, the similarity between the above-learned image and text representations is computed by the cosine similarity function, whose goal is to maximize the similarity if the image and text are matched, and minimize it otherwise.
和文字。最后，通过余弦相似度函数计算上述学习到的图像和文字表示之间的相似度，其目标是在图像和文字匹配的情况下使相似度最大化，反之则使其最小化。

B. Dual Encoders B.双编码器

The structure of CLIP is shown in Figure 1, which includes an image encoder and a text encoder, both of which are composed of a feature extractor and a projector. The image and text feature extractors extract features through a ViT with a width of 768 and a Transformer with a width of 512 respectively, while the projectors map image and text features to a 512-dimensional latent space, in which image and text are aligned through the contrastive objective. The most direct way is to fine-tune CLIP on the TIReID dataset. However, several works [15], [16] have shown that the projectors may lead to intra-modal information distortion. This is unacceptable for TIReID which relies on fine-grained information, especially images. If the projector is removed, the dimensions of the two cannot be unified. Thus, in the paper, we only use the image encoder of CLIP with the projector removed as our image encoder. For text, we use another pre-training language model BERT [14] as text encoder. In addition, in order to make a fair comparison with existing methods, we also use ViT pre-trained on ImageNet [51] as image encoder.
CLIP 的结构如图 1 所示，它包括一个图像编码器和一个文本编码器，这两个编码器都由一个特征提取器和一个投影器组成。图像和文本特征提取器分别通过一个宽度为 768 的 ViT 和一个宽度为 512 的 Transformer 提取特征，而投影器则将图像和文本特征映射到一个 512 维的潜在空间，在这个空间中，图像和文本通过对比目标进行对齐。最直接的方法是在 TIReID 数据集上对 CLIP 进行微调。然而，一些研究[15]、[16]表明，投影仪可能会导致模态内信息失真。这对于依赖细粒度信息（尤其是图像）的 TIReID 来说是不可接受的。如果去掉投影器，两者的维度就无法统一。因此，在本文中，我们只使用去除了投影器的 CLIP 图像编码器作为图像编码器。对于文本，我们使用另一个预训练语言模型 BERT [14] 作为文本编码器。此外，为了与现有方法进行公平比较，我们还使用在 ImageNet [51] 上预先训练好的 ViT 作为图像编码器。

Image Representation. Given an image

, a visual tokenization process is first performed to convert the image to a discrete token sequence of length

. A learnable

token is attached to the beginning of the sequence as an image-level representation. Finally, the token sequence of length

is fed into the transformer of ViT. The output of image encoder is represented as

, where

is the image-level global feature,

is the patch-level local features.
图像表示。给定图像

后，首先执行可视化标记化流程，将图像转换为长度为

的离散标记序列。一个可学习的

标记被附加到序列的开头，作为图像级表示。最后，长度为

的标记序列被输入 ViT 的变换器。图像编码器的输出表示为

，其中

是图像级全局特征，

是片段级局部特征。

Text Representation. For a text

, we directly use the pre-trained BERT [14] as text encoder to generate text representation. Specifically, the lower-cased byte pair encoding (BPE) with a 30522 vocabulary size is firstly used to tokenize the text

. Then, the textual token sequence is padded with

token at the beginning. Finally, the token sequence of length

is fed into the text encoder to generate the sentence-level global feature

and the word-level local
文本表示。对于文本

，我们直接使用预先训练好的 BERT [14] 作为文本编码器来生成文本表示。具体来说，首先使用词汇量为 30522 的小写字节对编码（BPE）对文本

进行标记化。然后，在文本标记序列的开头填充

标记。最后，将长度为

的标记序列输入文本编码器，生成句子级全局特征

和词级局部特征

。

(a) Token Selection Process
(a) 代币遴选程序

(b) Multi-Grained Global Feature Learning
(b) 多粒度全局特征学习

Fig. 3. (a) Illustration of token selection process and (b) structure of multi-grained global feature learning (MGF) module, where

represents the multi-grained global feature set.
图 3：（a）标记选择过程示意图；（b）多粒度全局特征学习（MGF）模块结构，其中

表示多粒度全局特征集。

features

. 功能

。

Due to the substantial gap between the upstream pre-training task and the downstream TIReID task,

and

only involves instance-level modality information and cross-modal correspondences, which is short of fine-grained information that is critical to TIReID. Accordingly, in the following, we conduct finegrained information excavation (MGF, CFR, and FCD) to fully mine intra-modal discriminative local clues and inter-modal fine-grained correspondences.
由于上游的预训练任务和下游的 TIReID 任务之间存在很大差距，

和

只涉及实例级模态信息和跨模态对应关系，缺少对 TIReID 至关重要的细粒度信息。因此，在下文中，我们将进行细粒度信息挖掘（MGF、CFR 和 FCD），以充分挖掘模态内的判别性局部线索和模态间的细粒度对应关系。

C. Multi-Grained Global Feature Learning
C.多粒度全局特征学习

Due to the subtle visual variations among different pedestrians in ReID, it is crucial to fully mine the fine-grained information of images/texts for distinguishing different pedestrians. Most existing TIReID methods mine fine-grained information by learning a set of local features. Unlike them, in this paper, we mine local information at different levels to learn global features at multiple granularities. Benefiting from the global dependency modeling capability of self-attention, Transformer achieves impressive results on various tasks. However, selfattention treats each local token in the same way to calculate the attention weight and then computes a weighted sum of all local tokens to generate a global feature. The global feature is dominated by all local tokens, and this way of considering all local tokens simultaneously reduces the influence of some important local tokens. Especially for fine-grained recognition tasks, the way will bring serious discriminative information loss. To solve this problem, instead of using all tokens to learn global features, we only select informative tokens to form multiple token sequences, and then send them to the global-local decoder to learn a set of multi-grained global features.
由于 ReID 中不同行人之间存在微妙的视觉差异，因此充分挖掘图像/文本的细粒度信息对于区分不同行人至关重要。现有的 TIReID 方法大多通过学习一组局部特征来挖掘细粒度信息。与之不同的是，本文通过挖掘不同层次的局部信息来学习多粒度的全局特征。得益于自我关注的全局依赖建模能力，Transformer 在各种任务中都取得了令人瞩目的成绩。然而，自我注意会以相同的方式处理每个局部标记以计算注意权重，然后计算所有局部标记的加权和以生成全局特征。全局特征受所有局部标记的支配，这种同时考虑所有局部标记的方式会降低一些重要局部标记的影响。特别是对于细粒度识别任务，这种方式会带来严重的识别信息损失。为了解决这个问题，我们不使用所有标记来学习全局特征，而是只选择有信息量的标记来组成多个标记序列，然后将它们发送给全局-局部解码器来学习一组多粒度的全局特征。
Token Selection. Class token as the output of Transformer is used for classification or recognition, which is obtained by weighted aggregation of all local tokens. The weight reflects the correlation between class token and each local token. The larger the weight, the greater the contribution of this local token to the class token, and the more important it is to the task. Therefore, we select informative tokens based on the correlation between local tokens and class token [52]-[54]. Specifically, the selfattention of each Transformer block can generate an attention map of size

, which reflects the correlation among the input

tokens (the first is the class token). The first row of the attention map represents the dependency between class token and local tokens. In this paper, we take the attention map

generated by the self-attention of the last Transformer block, and the correlation score between class token and local tokens is

. We select the top

tokens from the

local tokens output by Transformer that corresponds to the top

highest scores in

to construct a new discriminative local token sequence. The token selection process is shown in Figure 3 (a).
标记选择。类标记作为变换器的输出用于分类或识别，它是通过对所有本地标记进行加权聚合而得到的。权重反映了类标记与每个局部标记之间的相关性。权重越大，说明该局部标记对类标记的贡献越大，对任务越重要。因此，我们根据本地标记与类标记之间的相关性来选择信息标记[52]-[54]。具体来说，每个转换器块的自我注意可以生成一个大小为

的注意图，它反映了输入

标记（第一个是类标记）之间的相关性。注意力图的第一行表示类标记和本地标记之间的依赖关系。在本文中，我们取最后一个 Transformer 块的自注意力生成的注意力图

，类标记和本地标记之间的相关性得分为

。我们从 Transformer 输出的

本地标记中选择与

中最高分

相对应的顶部

标记，以构建新的区分性本地标记序列。标记选择过程如图 3 (a) 所示。

We perform the token selection process separately for images and texts. For image

tokens are chosen from

. For text

tokens are chosen from

. The selected image and text token sequences are denoted as

and

, where

and

denotes the amount of information contained in the token.

and

, where

and

represent the selection ratio of image and text tokens, respectively.
我们对图像和文本分别执行标记选择过程。图像

标记从

中选择。文本

标记从

中选出。选定的图像和文本标记序列分别表示为

和

，其中

和

表示标记中包含的信息量。

和

，其中

和

分别表示图像和文本标记的选择比例。

Global-Local Decoder. We design a global-local decoder
全球本地解码器。我们设计了一个全局本地解码器

(GLD) to highlight the discriminant local information in images and texts, and improve the discriminability of global features. As shown in Figure 3 (b), the GLD containing

blocks generates a set of multi-grained discriminative global features with the above-selected token sequence as the input. The selected token sequence only contains informative tokens, discarding redundant tokens. To fully mine discriminative fine-grained information, we split the selected token sequence into two sub-sequences, which correspond to different discriminant granularities. The former is a high-level discriminant sequence with the top

most informative tokens, while the latter is a middle-level discriminant sequence with the remaining

informative tokens. Two sub-sequences are fed into GLD to highlight differentgrained local information. Specifically, taking image as an example,

is divided into the high-level sequence

and middle-level sequence

, which are respectively prepended with a

token and a

token, are then fed into GLD. In one GLD block,

is first sent into the multi-head self-attention (

) layer to propagate the information of these informative tokens into the class token.
(GLD) 来突出图像和文本中的局部判别信息，并提高全局特征的判别能力。如图 3 (b)所示，包含

块的 GLD 以上述选定的标记序列为输入，生成一组多级判别全局特征。所选标记序列只包含信息性标记，剔除冗余标记。为了充分挖掘细粒度判别信息，我们将所选标记序列分成两个子序列，分别对应不同的判别粒度。前者是高级判别序列，包含信息量最大的

标记；后者是中级判别序列，包含信息量最小的

标记。两个子序列被输入到 GLD 中，以突出不同粒度的局部信息。具体来说，以图像为例，

被分为高级序列

和中级序列

，并分别以

标记和

标记作为前缀，然后输入 GLD。在一个 GLD 块中，

首先被送入多头自注意层（

），将这些信息标记的信息传播到类标记中。

where

notes Layer Normalization. After that,

and

are fed into the multi-head cross-attention (

) layer to compute the cross-attention between

and

, which can highlight not only the informative tokens themselves but also other associated contextual information. Finally, the output of

is further fed into the multi-layer perceptron (

) layer to generate multiple discriminative global features at different granularities.
其中，

注释层归一化。之后，

和

被送入多头交叉注意（

）层，计算

和

之间的交叉注意，这不仅能突出信息词本身，还能突出其他相关的上下文信息。最后，

的输出会进一步输入多层感知器（

）层，以生成不同粒度的多种全局判别特征。

where

denotes the output of a GLD block for input

. The

and

tokens output by the last block are used as the high-level global image feature

and middle-level global image feature

respectively. Besides, the image-level feature

that treats all local tokens in the same way is regarded as a low-level global feature

. The above features constitute a multi-grained global image feature set

. Similarly, a similar process for text

is performed to generate a multi-grained global text feature set

.
其中

表示输入

的 GLD 块输出。最后一个块输出的

和

标记分别用作高级全局图像特征

和中级全局图像特征

。此外，以相同方式处理所有局部标记的图像级特征

被视为低级全局特征

。上述特征构成了多级全局图像特征集

。同样，对文本

执行类似的处理，以生成多粒度全局文本特征集

。

In MGF, we focus on mining intra-modal fine-grained discriminative information. The goal of TIReID is cross-modal alignment. Therefore, we hope that the mined information is not only highly discriminative but also modality-shared. Accordingly, we propose a cross-grained (i.e. image-word and textpatch) feature refinement (CFR) module to filter out unimportant and non-modality-shared information in the selected image and text tokens. Since token selection is based on the correlation between each local token and class token, we filter out unimportant
在 MGF 中，我们的重点是挖掘模态内的细粒度判别信息。而 TIReID 的目标是跨模态对齐。因此，我们希望挖掘出的信息不仅具有高度鉴别性，而且还具有模态共享性。因此，我们提出了一个跨粒度（即图像-词和文本补丁）特征提纯（CFR）模块，以过滤掉所选图像和文本标记中不重要和非模态共享的信息。由于标记选择是基于每个本地标记和类标记之间的相关性，因此我们会过滤掉不重要的

Fig. 4. Illustration of cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules.
图 4.交叉粒度特征提纯（CFR）和细粒度对应关系发现（FCD）模块示意图。

information by the similarity between local and global features across modalities. The CFR is illustrated in Figure 4 (bottom).
通过跨模态的局部特征和全局特征之间的相似性来获取信息。图 4（下图）展示了 CFR。

Given the image-level representation

, the selected patch-level representation

, the sentence-level representation

and the selected word-level representation

. The image-word and sentence-patch similarities are evaluated by inner products between corresponding feature representations, respectively.
给定图像级表示

、选定的片段级表示

、句子级表示

和选定的词级表示

。图像-单词和句子-补丁的相似性分别通过相应特征表示之间的内积来评估。

where

is the similarity between the image and the selected words in the sentence, and

is the similarity between the sentence and the selected patches of an image. Then we fuse the above similarities to get the
其中，

是图像与句子中选定词语之间的相似度，

是句子与图像中选定斑块之间的相似度。然后，我们将上述相似度进行融合，得到
instance-level similarity. To emphasize important information and filter out non-modality-shared information in the selected image and text tokens, we adaptively generate different weights for each score in

and

by Softmax during aggregation, where scores for the image patches (words) related to the sentence (image) will be given high weights. Finally, the instance-level similarity is generated by computing a weighted sum of each score in

and

.
实例级相似性。为了突出重要信息并过滤掉所选图像和文本标记中的非模式共享信息，我们在聚合时通过 Softmax 自适应地为

和

中的每个分数生成不同的权重，其中与句子（图像）相关的图像片段（单词）的分数将被赋予高权重。最后，通过计算

和

中每个分数的加权和，生成实例级相似性。

where

denotes the similarity between image

and text

after cross-grained feature refinement. Since

and

are matched, we expect the greater the similarity the better. Through CFR module, the cross-modal correspondence between images and texts is roughly established.
其中，

表示图像

和文本

经过交叉粒度特征细化后的相似度。由于

和

是匹配的，因此我们认为相似度越高越好。通过 CFR 模块，可以大致确定图像和文本之间的跨模态对应关系。

E. Fine-Grained Correspondence Discovery
E.细粒度对应关系发现

To further mine fine-grained correspondence between images and texts, i.e., patch-word correspondence, we propose a finegrained correspondence discovery (FCD) module, as shown in Figure 4 (top). The simplest way to establish the fine-grained correspondences is to find out the most related positive word (image patch) as the correspondence for each image patch (word) to construct the patch-word (word-patch) pairs, and make them close to each other in the feature space. However, individual word and patch may have vague meanings, and each word (patch) may be related to multiple patches (words), as shown in Figure 1. Accordingly, we pick out the most related

positive words (image patches) for each image patch (word) to construct the patch-word (word-patch) pairs. Moreover, to save computational costs, we only employ the most informative image patches and words for correspondence discovery.
为了进一步挖掘图像和文本之间的细粒度对应关系，即补丁-单词对应关系，我们提出了细粒度对应关系发现（FCD）模块，如图 4（上图）所示。建立细粒度对应关系的最简单方法是找出最相关的正词（图像补丁）作为每个图像补丁（单词）的对应关系，构建补丁-单词（单词-补丁）对，并使它们在特征空间中相互接近。然而，单个词和补丁可能含义模糊，每个词（补丁）可能与多个补丁（词）相关，如图 1 所示。因此，我们为每个图像补丁（词）挑选出关联度最高的

正词（图像补丁），构建补丁-词（词-补丁）对。此外，为了节省计算成本，我们只使用信息量最大的图像补丁和词来发现对应关系。

Formally, for image-text pair

, the most informative patches and words are

and

, respectively. We first compute the cosine similarity between

and

. For example, the first row

of the similarity matrix denotes dependency of the

word to each informative patch. Then, the Topk operation is used to select the most related

patches for the

word

. Finally, these

patches are average pooled to generate a new patch feature

, and

forms a matched word-patch pair.
形式上，对于图像-文本对

，信息量最大的斑块和单词分别是

和

。我们首先计算

和

之间的余弦相似度。例如，相似性矩阵的第一行

表示

词与每个信息补丁的依赖关系。然后，使用 Topk 运算为

词

选择最相关的

补丁。最后，对这些

补丁进行平均汇集，生成新的补丁特征

，

形成匹配的词-补丁对。

And so on, we can generate

matched patch-word pairs

and

matched word-patch pairs

. To make these matched pairs close to each other in the feature space, we get the instance-level similarity by computing the sum of the similarities between all

matched pairs.
以此类推，我们可以生成

匹配的补丁-单词对

和

匹配的单词-补丁对

。为了使这些匹配对在特征空间中相互接近，我们通过计算所有

匹配对之间的相似度之和来获得实例级相似度。

where

denotes the cosine similarity metric, and

denotes the similarity between image

and text

after fine-grained correspondence discovery. Similarly, the greater the similarity, the better.
其中，

表示余弦相似度量，

表示图像

和文本

经过细粒度对应发现后的相似度。同样，相似度越高越好。

F. Training and Inference
F.训练和推理

The commonly used cross-modal projection matching (CMPM) loss and cross-modal projection classification (CMPC) loss proposed by [30] is adopted as our training objective function to learn image-text alignment.
我们采用[30]提出的常用跨模态投影匹配（CMPM）损失和跨模态投影分类（CMPC）损失作为学习图像-文本配准的训练目标函数。

Cross-Modality Alignment. For image

and text

, we generate multi-grained global feature sets

and

by MGF, respectively. To align

and

, we compute CMPM and CMPC losses for each granularity of image and text features to supervise the multi-grained global feature learning.
跨模态对齐。对于图像

和文本

，我们通过 MGF 分别生成多粒度全局特征集

和

。为了对齐

和

，我们对图像和文本特征的每个粒度计算 CMPM 和 CMPC 损失，以监督多粒度全局特征学习。

Besides, since

and

both represent the similarity between

and

, we sum them to represent the final similarity

. For a batch of

image-text pairs. We expect to maximize the similarity if the image and text are matched, and minimize it otherwise. Thus, a cross-modality bi-directional dual-constrained triplet ranking loss is used to optimize it.
此外，由于

和

都表示

和

之间的相似度，我们将它们相加表示最终的相似度

。对于一批

图像-文本对。如果图像和文本匹配，我们希望相似度最大，反之则最小。因此，我们使用了跨模态双向双约束三重排序损失来优化它。

where

denotes the matched image-text pairs, and

denote the mismatched pairs.

and

denote the hard negative samples.

indicates the margin.
其中，

表示匹配的图像-文本对，

表示不匹配的图像-文本对。

和

表示硬阴性样本。

表示边距。

Diversity Regularization. In order to fully mine fine-grained details in images and texts, we hope that different-grained features focus on inconsistent information. To this end, we impose a diversity constraint

on different-grained features to avoid information redundancy, which is represented as follows:
多样性正则化。为了充分挖掘图像和文本中的细粒度细节，我们希望不同粒度的特征能够关注不一致的信息。为此，我们对不同粒度的特征施加了多样性约束

，以避免信息冗余，具体表现如下：

Objective Function. Integrating the above constraints, the finial objection function

is as follows.
目标函数。综合上述约束条件，最终的反对函数

如下。

where

and

balance the focus on different loss terms during training.
其中

和

平衡了训练过程中对不同损失项的关注。

Inference. Note that CFR and FCD modules are only used for training, and will be removed to save computation costs for inference. During inference, for the text query and image candidate, we first generate the multi-grained global text and image features. Then, the similarity between text-image pair is
推理。请注意，CFR 和 FCD 模块仅用于训练，在推理时将被删除以节省计算成本。在推理过程中，对于文本查询和候选图像，我们首先生成多粒度的全局文本和图像特征。然后，文本和图像之间的相似度为
computed as the sum of the cosine distance of different-grained features, i.e.,

.
计算为不同粒度特征的余弦距离之和，即

。

IV. EXPERIMENTS IV.实验

A. Experiment Settings A.实验设置

Datasets and Metrics: CUHK-PEDES [5] is previously the only accessible large-scale benchmark for TIReID. It includes 40,206 images and 80,412 text descriptions of 13,003 persons, each image is manually annotated with 2 descriptions, each of which has an average length of not less than 23 words. Follow [5], 34,054 images and 68,108 descriptions of 11,003 persons, 3,078 images and 6,156 descriptions of 1000 persons, 3,074 images and 6,148 descriptions of 1000 persons are utilized for training, validation, and testing, respectively. Recently, several large-scale datasets [9], [10] have been released, which greatly promoted the development of TIReID. ICFG-PEDES [10] contains 54522 text descriptions for 54,522 images of 4,102 persons collected from the MSMT17 [55] dataset. According to statistics, each description has an average length of 37 words, and the vocabulary contains 5554 unique words. Compare with CUHK-PEDES, text description of ICFGPEDES is more identity-centric and fine-grained. The dataset is split into train, and test with 34674 image-text pairs of 3102 persons, and 19848 image-text pairs of the remaining 1000 persons, respectively. RSTPReid [9] is also constructed based on MSMT17 [55] to handle real scenarios, which includes 41010 textual descriptions and 20505 images of 4101 persons. Specifically, each person contains 5 images caught by 15 cameras, each image corresponds to 2 text descriptions, and the length of each description is no shorter than 23 words. The dataset is split into 3701 train, 200 validation, and 200 test persons. We conduct sufficient experiments on the above three benchmarks to verify the effectiveness of our method. For all experiments, we use recall at Rank K (Rank-K, higher is better) as retrieval metric to evaluate the retrieval performance, where the Rank-1, Rank-5, and Rank-10 accuracy are reported.
数据集和度量标准：CUHK-PEDES [5] 是目前唯一可用于 TIReID 的大规模基准数据。它包括 13 003 人的 40 206 幅图像和 80 412 条文字描述，每幅图像都有 2 条人工标注的描述，每条描述的平均长度不少于 23 个单词。根据[5]，训练、验证和测试分别使用了 11 003 人的 34 054 幅图像和 68 108 条描述、1000 人的 3 078 幅图像和 6 156 条描述、1000 人的 3 074 幅图像和 6 148 条描述。最近，多个大规模数据集[9]、[10]相继发布，极大地推动了 TIReID 的发展。ICFG-PEDES [10] 包含从 MSMT17 [55] 数据集中收集的 54522 个文本描述，涉及 4102 人的 54522 张图像。据统计，每条描述的平均长度为 37 个单词，词汇量包含 5554 个独特的单词。与 CUHK-PEDES 相比，ICFGPEDES 的文字描述更以身份为中心，且粒度更细。该数据集分为训练集和测试集，训练集包含 3102 人的 34674 个图像-文本对，测试集包含其余 1000 人的 19848 个图像-文本对。RSTPReid [9] 也是基于 MSMT17 [55]构建的，用于处理真实场景，其中包括 41010 个文本描述和 4101 个人的 20505 张图像。具体来说，每个人包含 15 个摄像头拍摄到的 5 幅图像，每幅图像对应 2 个文本描述，每个描述的长度不小于 23 个单词。数据集分为 3701 个训练人、200 个验证人和 200 个测试人。我们在上述三个基准上进行了充分的实验，以验证我们方法的有效性。在所有实验中，我们使用等级 K 的召回率（Rank-K，越高越好）作为检索指标来评估检索性能，其中报告了等级-1、等级-5 和等级-10 的准确率。
Implementation Details: We conduct the experiments on a single RTX3090 24GB GPU using the PyTorch library. Input images are resized to and random horizontal flipping is employed for data augmentation. The input sentences are set with a maximum text length of 100 for all datasets. The length of visual and textual token sequences after tokenization is 196 and , and the dimension of image and text embeddings is 768. The multi-grained global feature learning module consists of GLD block, and every block has 12 heads. Besides, informative image tokens and informative text tokens in MGF are chosen to learn multi-grained global discriminative features. In the fine-grained correspondence discovery module, we pick out most related words (patches) for each patch (word) to construct the patch-word (word-patch) pair. The loss balance factors and are set to , respectively. The margin in triplet loss is set to 0.2 . In the training process, we optimize our model with Adam optimizer and adopt a liner warmup strategy. We adopt different learning rates for different modules. To be specific, the initial learning rate for the image and text backbone is set to , and other modules of the network are initialized to .
实施细节：我们使用 PyTorch 库在单个 RTX3090 24GB GPU 上进行实验。输入图像的大小调整为，并采用随机水平翻转进行数据增强。所有数据集的输入句子最大文本长度均为 100。标记化后的视觉和文本标记序列长度分别为 196 和，图像和文本嵌入的维度为 768。多粒度全局特征学习模块由 GLD 块组成，每个块有 12 个头。此外，还选择了 MGF 中的信息图像标记和信息文本标记来学习多粒度全局判别特征。在细粒度对应关系发现模块中，我们为每个补丁（词）挑选出最相关的词（补丁），构建补丁-词（词-补丁）对。损失平衡因子和分别设置为。三连音损失的余量设为 0.2。在训练过程中，我们使用 Adam 优化器优化模型，并采用衬垫热身策略。我们对不同的模块采用不同的学习率。具体来说，图像和文本主干网的初始学习率设置为，网络的其他模块初始化为。
The learning rate is decreased by a factor of 0.1 at the 20th, 25 th, and 35 th epoch, respectively. The network is trained with a batch size of 32 and lasted for 50 epochs.
在第 20 个、第 25 个和第 35 个历元，学习率分别降低了 0.1 倍。网络训练的批次大小为 32，持续 50 个历元。

TABLE I 表 I

PERFORMANCE COMPARISON WITH STATE-OF-THE-ART METHODS ON CUHK-PEDES. RANK-1, RANK-5, AND RANK-10 ARE LISTED. '-' DENOTES THAT NO REPORTED RESULT IS AVAILABLE.
与 Cuhk-pedes 上最先进方法的性能比较。排名-1、-5 和-10 列出。-"表示没有报告结果。


GNA-RNN [5] GNA-RNN [5].	CVPR17	19.05	-	53.64
GLA	ECCV18	43.58	66.93	76.26
Dual Path [29] 双通道 [29］	TOMM20	44.40	66.26	75.07
CMPM/C [30]	ECCV18	49.37	-	79.27
MCCL 31	ICASSP19	50.58	-	79.06
MIA [25] 米亚 [25]	TIP20	53.10	75.00	82.90
A-GANet [32]	MM19	53.14	74.03	81.95
PMA	AAAI20	53.81	73.54	81.23
TIMAM \|33\|	ICCV19	54.51	77.56	84.78
CMKA 34	TIP21	54.69	73.65	81.86
TDE [57]	MM20	55.25	77.46	84.56
SCAN [27] 扫描 [27]	ECCV18	55.86	75.97	83.69
ViTAA [6]	ECCV20	55.97	75.84	83.52
IMG-Net [58]	JEI 20	56.48	76.89	85.01
CMAAM \|7\|	WACV20	56.68	77.18	84.86
HGAN [24]	MM20	59.00	79.49	86.62
SUM \|59\|	KBS22	59.22	80.35	87.60
	arXiv21	59.94	79.86	86.70
DSSL [9]	MM21	59.98	80.41	87.56
MGEL [28]	IJCAI21	60.27	80.01	86.74
SSAN	arXiv21	61.37	80.15	86.73
LapsCore [35] LapsCore [35]。	ICCV21	63.40	-	87.80
IVT 60	ECCVW22	64.00	82.72	88.95
LBUL	MM22	64.04	82.66	87.22
TextReID \|62\|	BMVC 21	64.08	81.73	88.19
SAF	ICASSP22	64.13	82.62	88.40
TIPCB [11]	Neuro22 神经 22	64.26	83.19	89.10
CAIBC	MM22	64.43	82.87	88.37
AXM-Net [64]	AAAI22	64.44	80.52	86.77
Ours-IMG 我们的-IMG	-	65.07	83.01
Ours-CLIP 我们的-CLIP	-	69.57	85.93	91.15

B. Comparisons with State-of-the-art Models
B.与最先进模型的比较

In this section, We evaluate our proposed CFine under two different settings, i.e., ViT pre-trained on ImageNet (Ours-IMG) or ViT from CLIP (Ours-CLIP) as image encoder, on three standard TIReID benchmarks and compare with state-of-theart approaches in Tables I, II and III Our method consistently achieves state-of-the-art results on all three benchmarks with significant improvements.
在本节中，我们将在两种不同设置下，即在 ImageNet（Ours-IMG）上预先训练的 ViT 或作为图像编码器的 CLIP（Ours-CLIP）ViT，在三个标准 TIReID 基准上对我们提出的 CFine 进行评估，并在表一、表二和表三中与最先进的方法进行比较。

CUHK-PEDES: We first evaluate our CFine on the most popular and widely-used benchmark CUHK-PEDES, and the performance comparison is shown in Table It can be observed that CFine consistently achieves state-of-the-art results under two settings. With ViT pre-trained on ImageNet as image encoder, CFine reaches and Rank1, Rank-5, and Rank-10 accuracy on CUHK-PEDES, which
CUHK-PEDES：我们首先在最流行和广泛使用的基准 CUHK-PEDES 上评估了 CFine 的性能，性能比较如表所示。可以看出，在两种设置下，CFine 始终能取得最先进的结果。使用在 ImageNet 上预先训练好的 ViT 作为图像编码器，CFine 在 CUHK-PEDES 上的准确率分别达到了和 Rank1、Rank-5 和 Rank-10，而 CUHK-PEDES
outperforms the same Transformer-based methods, IVT [60] and SAF [36]. In addition, CFine also achieves higher performance than the recent state-of-the-art method, named AXMNet [64]. This can be attributed to that our proposed finegrained information excavation modules (MGF, CFR, and FCD) are critical to reducing the modality gap. Besides, AXM-Net requires cross-modality interaction operations, which are computationally expensive. With ViT from CLIP as image encoder, CFine can obtain significant gains in performance compared with all the methods. Compared with the strongest competitor (AXM-Net [64]), CFine obtains (+5.13%), ( ) and ( ) of Rank-1, Rank-5 and Rank10 accuracy by employing ViT from CLIP as image encoder, which demonstrates that it is beneficial to introduce ample crossmodal correspondence prior.
的性能优于同样基于变压器的方法 IVT [60] 和 SAF [36]。此外，CFine 的性能还高于最近最先进的方法 AXMNet [64]。这是因为我们提出的细粒度信息挖掘模块（MGF、CFR 和 FCD）对于缩小模态差距至关重要。此外，AXM-Net 需要进行跨模态交互操作，计算成本较高。使用 CLIP 的 ViT 作为图像编码器，CFine 的性能与所有方法相比都有显著提高。与最强的竞争对手（AXM-Net [64]）相比，采用 CLIP 的 ViT 作为图像编码器后，CFine 获得了 (+5.13%), ( ) 和 ( ) 的 Rank-1、Rank-5 和 Rank-10 准确度，这表明引入充足的跨模态对应先验是有益的。
Other Benchmarks: To further validate the generalization of our method, we also compare CFine against the previous works on two other benchmarks, ICFG-PEDES and RSTPReid, as shown in Tables II and III. We can observe from the tables that our method achieves very competitive performance on ICFG-PEDES and RSTPReid when using ViT pre-trained on ImageNet as image encoder, outperforming all methods except IVT [60]. While using ViT from CLIP as image encoder, the proposed method outperforms all existing methods by a large margin in all metrics, which reaches Rank-1 accuracy on ICFG-PEDES and RSTPReid benchmarks respectively, surpassing the IVT [60] on Rank-1 by , respectively. Results on these benchmarks demonstrate the generalization and robustness of our proposed CFine.
其他基准：为了进一步验证我们方法的通用性，我们还将 CFine 与之前在 ICFG-PEDES 和 RSTPReid 这两个基准上的工作进行了比较，如表二和表三所示。从表中可以看出，当使用在 ImageNet 上预先训练好的 ViT 作为图像编码器时，我们的方法在 ICFG-PEDES 和 RSTPReid 上取得了极具竞争力的性能，优于除 IVT [60] 以外的所有方法。当使用 CLIP 的 ViT 作为图像编码器时，所提出的方法在所有指标上都远远优于所有现有方法，在 ICFG-PEDES 和 RSTPReid 基准上的准确率分别达到 Rank-1 级别，在 Rank-1 级别上分别以和的优势超过 IVT [60]。在这些基准上的结果证明了我们提出的 CFine 的通用性和鲁棒性。

TABLE II 表 II

PERFORMANCE COMPARISON WITH STATE-OF-THE-ART METHODS ON ICFG-PEDES. RANK-1, RANK-5, AND RANK-10 ARE LISTED.
与最先进方法在 ICCFG-PEDES 上的性能比较。排名-1、排名-5 和排名-10。

Methods 方法	Ref 参考文献	Rank-1 排名-1	Rank-5 排名-5	Rank-10 排名-10
Dual Path \|29\| 双通道 \|29\|	TOMM20	38.99	59.44	68.41
CMPM/C \|30\|	ECCV18	43.51	65.44	74.26
MIA \|25]	TIP20	46.49	67.14	75.18
SCAN \|27\|	ECCV18	50.05	69.65	77.21
ViTAA [6]	ECCV20	50.98	68.79	75.78
SSAN [10]	arXiv21	54.23	72.63	79.53
TIPCB \|11]	Neuro22 神经 22	54.96	74.72	81.89
IVT [60]	ECCVW22	56.04	73.60	80.22
Ours-IMG 我们的-IMG	-
Ours-CLIP 我们的-CLIP	-

The above results and analysis show that our CFine achieves consistent improvements across different benchmarks. This mainly attributes to the following factors. (1) our proposed finegrained information excavation modules can fully mine intramodal fine-grained discriminative details and inter-modal correspondences, effectively narrowing the modality gap and distinguishing different pedestrians. (2) Introducing ample crossmodal correspondences contained in CLIP can bring significant performance gains. (3) Benefiting from fine-grained information excavation, the cross-modal representation capacity of the upstream VLP task is successfully transferred to the TIReID task.
上述结果和分析表明，我们的 CFine 在不同的基准测试中取得了一致的改进。这主要归功于以下因素。(1）我们提出的细粒度信息挖掘模块可以充分挖掘模态内的细粒度判别细节和模态间的对应关系，有效缩小模态差距，区分不同的行人。(2) CLIP 中包含的大量跨模态对应信息可带来显著的性能提升。(3) 受益于细粒度信息的挖掘，上游 VLP 任务的跨模态表征能力成功转移到了 TIReID 任务中。
TABLE III 表 III

PERFORMANCE COMPARISON WITH STATE-OF-THE-ART METHODS ON RSTPREID. RANK-1, RANK-5, AND RANK-10 ARE LISTED.
与最先进方法在 rstpreid 上的性能比较。排名-1、排名-5 和排名-10。

Methods 方法	Ref 参考文献	Rank-1 排名-1	Rank-5 排名-5	Rank-10 排名-10
IMG-Net [58]	JEI20	37.60	61.15	73.55
AMEN \|65] 阿门］	PRCV21	38.45	62.40	73.80
DSSL [9]	MM21	39.05	62.60	73.95
SUM \|59\|	KBS22	41.38	67.48	76.48
SSAN \|10]	arXiv21	43.50	67.80	77.15
LBUL \|61]	MM22	45.55	68.20	77.85
IVT \|60]	ECCVW22	46.70	70.00	78.80
Ours-IMG 我们的-IMG	-
Ours-CLIP 我们的-CLIP	-

C. Ablation Studies C.消融研究

To fully demonstrate the impact of different modules in CFine, we conduct extensive ablation studies to compare different variants of CFine on CUHK-PEDES. Here, we adopt ViT from CLIP as image encoder. To be specific, we first verify the contribution of each component in the model by combining different components. Important parameters and variants in each module are then discussed respectively.
为了充分展示 CFine 不同模块的影响，我们在 CUHK-PEDES 上对 CFine 的不同变体进行了广泛的消融研究比较。在此，我们采用 CLIP 中的 ViT 作为图像编码器。具体来说，我们首先通过组合不同的组件来验证模型中每个组件的贡献。然后分别讨论每个模块的重要参数和变体。

Contributions of Algorithmic Components: We examine the contributions of each module in Table IV No. 0 shows the results of the Baseline. Baseline means only using ViT pretrained on ImageNet and the language pre-trained model BERT as image and text encoders to extract features without adding any modules and further feature embedding. By comparing the results of No. 0 and No. 1 in Table IV, it can be observed that when replacing the image encoder of Baseline with ViT from CLIP, the performance increases by over Baseline on CUHK-PEDES under the Rank-1/5/10 accuracy and surpass all methods in Table I, which shows that introducing cross-modal prior information into the model is able to bring significant performance gain. The results of No. 1 vs No. 2 and No. 1 vs No. 3 reveal the efficacy of MGF. When adding MGF (m) or MGF (h) to No.1, Rank-1 is improved by or , respectively. Besides, as shown in the result of No.6, the combination of MGF (m) and MGF (h) can further improve performance, which can promote the rank-1 accuracy from to . These results in No.2, No.3, and No. 6 justify that MGF can effectively mine the discriminative local clues and learn discriminative global features at multiple granularities.
算法组件的贡献：我们在表 IV 中检查了每个模块的贡献。基线指的是仅使用在 ImageNet 上预训练的 ViT 和语言预训练模型 BERT 作为图像和文本编码器来提取特征，而不添加任何模块和进一步的特征嵌入。通过比较表 IV 中 No.0 和 No.1 的结果，可以发现将 Baseline 的图像编码器替换为 CLIP 的 ViT 后，在 CUHK-PEDES 上的 Rank-1/5/10 准确率下，性能比 Baseline 提高了，超过了表 I 中的所有方法，这说明在模型中引入跨模态先验信息能够带来显著的性能提升。1 与 2 和 1 与 3 的结果显示了 MGF 的功效。在 1 号模型中加入 MGF (m) 或 MGF (h) 后，Rank-1 分别提高了或。此外，如 6 号结果所示，MGF (m) 和 MGF (h) 的组合可以进一步提高性能，使排名-1 的准确率从提高到。第 2、第 3 和第 6 项的结果证明，MGF 可以有效地挖掘局部判别线索，并在多个粒度上学习全局判别特征。

CFR is used to filter redundant information and ensure confidence in informative tokens, while FCD is used to fully discover inter-modal fine-grained correspondences. These experimental results of No. 1 vs No.4, No. 1 vs No.5, No. 6 vs No.7, and No. 6 vs No. 8 demonstrate the efficacy of CFR and FCD. When CFine is equipped with all modules, the best retrieval performance can be achieved. By comparing the results of No. 1 and No.9, it can be proven that the direct usage of CLIP can be sub-optimal for TIReID due to the substantial gap between instance-level pretraining and fine-grained TIReID task. When combined with our proposed fine-grained information excavation modules (MGF, CFR, and FCD), the performance is improved from

to
CFR 用于过滤冗余信息并确保对信息词块的置信度，而 FCD 则用于充分发现模态间的细粒度对应关系。第 1 号与第 4 号、第 1 号与第 5 号、第 6 号与第 7 号、第 6 号与第 8 号的实验结果证明了 CFR 和 FCD 的功效。当 CFine 配备所有模块时，可实现最佳检索性能。通过比较 No.1 和 No.9 的结果，可以证明由于实例级预训练与细粒度 TIReID 任务之间存在巨大差距，直接使用 CLIP 可能不是 TIReID 的最佳选择。当与我们提出的细粒度信息挖掘模块（MGF、CFR 和 FCD）相结合时，性能将从

提高到

TABLE IV 表 IV

ABLATION STUDY ON DIFFERENT COMPONENTS OF OUR PROPOSED CFINE ON CUHK-PEDES.
在 Cuhk-Pedes 上对我们建议的 cfine 的不同组件进行烧蚀研究。

No. 不	Methods 方法	CLIP	MGF			CFR	FCD	Rank1 排名1	Rank5 排名5	Rank10 排名10
No. 不	Methods 方法	CLIP		m	h	CFR	FCD
0	Baseline 基线							57.89	78.46	85.77
1	+CLIP							65.56	83.74	89.85
2	+CLIP+MGF (m)							67.66	85.10	89.80
3	+CLIP+MGF (h)							68.22	85.12	90.24
4	+CLIP+CFR							66.41	84.16	89.91
5	+CLIP+FCD							66.13	84.15	89.78
6	+CLIP+MGF							68.62	85.36	90.84
7	+CLIP+MGF+CFR							69.14	85.40	90.55
8	+CLIP+MGF+FCD							69.33	85.22	90.61
9	CFine							69.57	85.93	91.15

. The result demonstrates that the ample cross-modal knowledge of CLIP is fully exploited and transferred to the TIReID task.

。结果表明，CLIP 丰富的跨模态知识得到了充分利用，并被转移到了 TIReID 任务中。

Ablation of Multi-grained Global Feature Learning: In MGF, the token selection first selects the most informative image and text tokens, and then feeds them into GLD consisting of blocks to learn a set of multi-grained global features. We first analyze the impact of some important factors in MGF, including image token selection ratio , text token selection ratio , and amount of GLD blocks. Figure 5 (a)(b)(c) shows the impact of these parameters on performance. As shown in the figure, for and , if the ratio is too small, it cannot contain comprehensive discriminative local information, resulting in information loss. If the ratio is too large, the noise will be introduced, which will have a negative impact on the model. Therefore, we set and . For M , the results show that performs best.
多粒度全局特征学习的消融：在 MGF 中，标记选择首先选择信息量最大的图像和文本标记，然后将其输入由块组成的 GLD，以学习一组多粒度全局特征。我们首先分析了 MGF 中一些重要因素的影响，包括图像标记选择比例、文本标记选择比例和 GLD 块的数量。图 5 (a)(b)(c) 显示了这些参数对性能的影响。如图所示，对于和，如果比率太小，则无法包含全面的判别局部信息，从而导致信息丢失。如果比率过大，则会引入噪声，对模型产生负面影响。因此，我们设置和。对于 M ，结果显示的性能最好。

TABLE V 表 V

ABLATION STUDY OF MULTI-GRAINED GLOBAL FEATURE LEARNING ON CUHK-PEDES.
在 Cuhk-pedes 上进行多粒度全局特征学习的消融研究。

Method 方法	Rank-1 排名-1	Rank-5 排名-5	Rank-10 排名-10
MGF-2K	67.43	84.80	90.27
Self-attention 自我关注	68.14	84.76	90.09
-mean -平均值	68.00	84.63	89.90
-random 随机	62.67	81.66	87.88
Ours 我们的	69.57	85.93	91.15

We further conduct experiments to compare MGF with other variants, as shown in Table V. In order to verify the effectiveness of the multi-grained learning way, we feed

tokens into GLD at a time (MGF-2K) to learn discriminative global feature. The result shows severe performance degradation (

). We believe that one-time learning way will cause the role of some important informative tokens to be overwritten. Through our multi-grained learning way, the discriminative finegrained clues can be fully mined. In GLD, we conduct the crossattention between selected tokens and all tokens to learn global features. Another possible choice is to compute the self-attention between selected tokens. Our scheme achieves more superior performance, which mainly attributes to our scheme can absorb information not only from informative tokens but also other context beyond them. In addition, before being fed into GLD, the high-level and middle-level token sequences are padded with

and

at the beginning to learn multi-grained global features, respectively. The initialization of these

tokens is critical. We compare three different initializations, including (1) initialization by the mean value of the selected token sequences (

-mean); (2) random initialization (

random); (3) initialization by the global features output by the encoder (Ours). The results show that the third initialization achieves the best performance. We believe that initializing

with the output of encoder can strengthen the connection between GLD and encoder, make them cooperate with each other, and further emphasize the important fine-grained clues based on the output of encoder.
为了验证多粒度学习方式的有效性，我们将

标记一次输入到 GLD 中（MGF-2K），以学习具有区分性的全局特征。结果显示性能严重下降（

）。我们认为，一次性学习方式会导致一些重要信息标记的作用被覆盖。通过我们的多粒度学习方式，可以充分挖掘出细粒度的判别线索。在 GLD 中，我们在选定的标记和所有标记之间进行交叉关注，以学习全局特征。另一种可能的选择是计算选定标记之间的自注意力。我们的方案取得了更优越的性能，这主要归功于我们的方案不仅能从信息标记中吸收信息，还能吸收标记之外的其他上下文信息。此外，在将高层和中层标记序列输入 GLD 之前，会在开始时分别填充

和

以学习多粒度全局特征。这些

标记的初始化至关重要。我们比较了三种不同的初始化方法，包括：(1) 根据所选标记序列的平均值进行初始化（

-mean）；(2) 随机初始化（

random）；(3) 根据编码器输出的全局特征进行初始化（Ours）。结果表明，第三种初始化方法的性能最好。我们认为，使用编码器输出的

进行初始化可以加强 GLD 与编码器之间的联系，使它们相互配合，并在编码器输出的基础上进一步强调重要的细粒度线索。

TABLE VI 表 VI

EFFECTS OF DIFFERENT SIMILARITY AGGREGATION SCHEME IN CROSS-GRAINED FEATURE REFINEMENT MODULE ON CUHK-PEDES.
交叉粒度特征细化模块中不同相似性聚合方案对 Cuhk-Pedes 的影响。

Method 方法	Rank-1 排名-1	Rank-5 排名-5	Rank-10 排名-10
Aggr-Sum 总和	69.07	85.45	90.72
Aggr-Mean 平均值	68.66	85.32	90.46
Aggr-Max	68.44	85.15	90.11
Ours 我们的	69.57	85.93	91.15

Ablation of Cross-grained Feature Refinement: In CFR, each score in image-word and sentence-patch similarities needs to be aggregated to form the instance-level similarity for the next text-image matching, in which score aggregation strategy is crucial. We compared different aggregation strategies, as shown in Table VI. For scores in image-word (sentence-patch) similarity, we get the instancelevel score by the following strategies: (1) Summing all scores (Aggr-Sum); (2) Taking the mean of all scores (Aggr-Mean); (3) Taking the maximum of these scores (Aggr-Max); (4) Computing a weighted sum of all scores, and the weights are generated by Softmax (Ours). The results show that our aggregation strategy achieves the best results, which can be attributed to the way of weighting each score can filter out the
消减交叉粒度特征细化：在 CFR 中，图像-单词和句子-补丁中的每个相似度得分都需要进行聚合，以形成实例级相似度，用于下一次文本-图像匹配，其中得分聚合策略至关重要。我们比较了不同的聚合策略，如表 VI 所示。对于图像-单词（句子-补丁）相似性中的分数，我们通过以下策略得到实例级分数：（1）求所有分数之和（Aggr-Sum）；（2）取所有分数的平均值（Aggr-Mean）；（3）取这些分数的最大值（Aggr-Max）；（4）计算所有分数的加权和，权重由 Softmax（Ours）生成。结果表明，我们的聚合策略取得了最好的结果，这可以归功于对每个分数的加权方式可以过滤掉那些不符合要求的分数。

Fig. 5. Effect of different parameters, including (a) the selected ratio

of image tokens (top left); (b) the selected ratio

of text tokens (top right); (c) the amount

of GLD blocks (bottom left); (d) the number

of selected positive patches (words) for each word (patch) (bottom right).
图 5.不同参数的影响，包括：(a) 图像标记的选定比例

（左上角）；(b) 文本标记的选定比例

（右上角）；(c) GLD 块的数量

（左下角）；(d) 每个单词（补丁）的选定正片（单词）数量

（右下角）。

redundant information and highlight the important information.
冗余信息，突出重要信息。

Ablation of Fine-grained Correspondence Discovery:
细粒度对应发现的消融：

FCD picks out the most related words (patches) for each patch (word) for discovering fine-grained correspondence. We conduct experiments to verify what amount of

is optimal for performance in Figure 5(d). The result shows that If

is too small, the meaning will be ambiguous, and the network will pay too much attention to local correspondence, which will lead to over-fitting. However, if

is too large, some irrelevant information will be introduced, making it impossible to establish an accurate correspondence. When

, our proposed method achieves the best results.
FCD 为每个补丁（单词）挑选出最相关的单词（补丁），以发现细粒度的对应关系。我们通过实验来验证

的大小对图 5(d) 中的性能而言是最佳的。结果表明，如果

太小，意义就会模糊，网络就会过于关注局部对应，从而导致过度拟合。但是，如果

过大，就会引入一些无关信息，从而无法建立准确的对应关系。当

时，我们提出的方法能达到最佳效果。

Qualitative Results: Figure 6 shows the top-10 retrieval results of Baseline+CLIP and CFine for the given text query, the difference between the two is whether fine-grained information excavation is performed. It can be seen from the figure that CFine can still achieve accurate retrieval results in the case that some Baseline+CLIP fail to retrieve. This is mainly due to the fine-grained information excavation modules we designed, namely MGF, CFR, and FCD, which can fully mine the discriminative clues to distinguish different pedestrians. Besides, we also found an interesting phenomenon that by effectively mining local discriminative clues, such as bag, guitar, and shoe, the dependence of the model on color information can be reduced, which alleviates the color over-reliance problem of existing TIReID methods to a certain extent. This provides a new idea for us to further solve the color over-reliance problem in the future.
定性结果：图 6 显示了 Baseline+CLIP 和 CFine 对给定文本查询的前 10 位检索结果，两者的区别在于是否进行了细粒度信息挖掘。从图中可以看出，在一些 Baseline+CLIP 无法检索到的情况下，CFine 仍能获得准确的检索结果。这主要得益于我们设计的细粒度信息挖掘模块，即 MGF、CFR 和 FCD，它们可以充分挖掘出辨别线索，从而区分出不同的行人。此外，我们还发现了一个有趣的现象，即通过有效挖掘包、吉他、鞋子等局部判别线索，可以降低模型对颜色信息的依赖，这在一定程度上缓解了现有 TIReID 方法对颜色过度依赖的问题。这为我们今后进一步解决颜色过度依赖问题提供了新思路。

V. ConCLUSION V.结论

In this work, we propose a CLIP-driven Fine-grained information excavation framework (CFine), a novel transformer architecture with fine-grained information excavation, which
在这项工作中，我们提出了一个由 CLIP 驱动的细粒度信息挖掘框架（CFine），这是一种具有细粒度信息挖掘功能的新型变压器架构，它可以

Fig. 6. Comparison of top-10 retrieval results on CUHK-PEDES between Baseline+CLIP (the first row) and CFine (the second row) for each text query. The matched and mismatched person images are marked with red and blue rectangles, respectively.
图 6.对于每个文本查询，Baseline+CLIP（第一行）和 CFine（第二行）在 CUHK-PEDES 上检索结果前十名的比较。匹配和不匹配的人物图像分别用红色和蓝色矩形标记。

aims to leverage the power of CLIP to achieve cross-modal fine-grained alignment for TIReID. To take full advantage of the rich multi-modal knowledge from CLIP, we performed finegrained information excavation. Specifically, MGF can help reinforce the intra-modal fine-grained discriminative clues by modeling the interactions between the global image (text) and local discriminative patches (words). By modeling the crossgrained interactions between modalities, CFR can filter out nonmodality-shared local information and establish the rough crossmodal correspondence, followed by FCD to establish the finegrained cross-modal correspondence by capturing the relationship between image patches and words. The above modules cooperate with each other to well transfer the knowledge of the CLIP model to TIReID. Additionally, we also show that feature embedding is not necessary for TIReID. Significant performance gains on three popular TIReID benchmarks prove the superiority and effectiveness of the proposed CFine.
旨在利用 CLIP 的强大功能，实现 TIReID 的跨模态精细配准。为了充分利用 CLIP 中丰富的多模态知识，我们进行了细粒度信息挖掘。具体来说，MGF 可以通过对全局图像（文本）和局部判别斑块（单词）之间的交互进行建模，帮助加强模内细粒度判别线索。通过对模态间的跨粒度交互建模，CFR 可以过滤掉非模态共享的局部信息，建立粗略的跨模态对应关系；随后，FCD 通过捕捉图像斑块和词语之间的关系，建立精细的跨模态对应关系。上述模块相互配合，很好地将 CLIP 模型的知识转移到了 TIReID 中。此外，我们还证明 TIReID 不需要特征嵌入。在三个流行的 TIReID 基准测试中取得的显著性能提升证明了所提出的 CFine 的优越性和有效性。

REFERENCES 参考文献

[1] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, "Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)," in European Conference on Computer Vision, ECCV, 2018.
[1] Y. Sun、L. Zheng、Y. Yang、Q. Tian 和 S. Wang，"Beyond part models：使用精细部件池（和强卷积基线）进行人物检索》，欧洲计算机视觉会议，ECCV，2018。

[2] H. Li, S. Yan, Z. Yu, and D. Tao, "Attribute-identity embedding and selfsupervised learning for scalable person re-identification," IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 10, pp

CLIP-Driven Fine-grained Text-Image Person Re-identification CLIP驱动的细粒度文本-图像人员再识别

Abstract 摘要

I. INTRODUCTION I.引言

II. RELATED WORK II.相关工作

A. Text-Image Person Re-identificationA.文本-图像人员再识别

B. Vision-Language Pre-TrainingB.视觉语言预培训

C. CLIP-Based Fine-tuningC.基于 CLIP 的微调

III. METHODS III.方法

A. Motivation and Overview of CFineA.CFine 的动机和概述

B. Dual Encoders B.双编码器

C. Multi-Grained Global Feature LearningC.多粒度全局特征学习

D. Cross-Grained Feature RefinementD.跨粒度特征细化

E. Fine-Grained Correspondence DiscoveryE.细粒度对应关系发现

F. Training and InferenceF.训练和推理

IV. EXPERIMENTS IV.实验

A. Experiment Settings A.实验设置

B. Comparisons with State-of-the-art ModelsB.与最先进模型的比较

C. Ablation Studies C.消融研究

V. ConCLUSION V.结论

REFERENCES 参考文献

CLIP-Driven Fine-grained Text-Image Person Re-identification
CLIP驱动的细粒度文本-图像人员再识别

A. Text-Image Person Re-identification
A.文本-图像人员再识别

B. Vision-Language Pre-Training
B.视觉语言预培训

C. CLIP-Based Fine-tuning
C.基于 CLIP 的微调

A. Motivation and Overview of CFine
A.CFine 的动机和概述

C. Multi-Grained Global Feature Learning
C.多粒度全局特征学习

D. Cross-Grained Feature Refinement
D.跨粒度特征细化

E. Fine-Grained Correspondence Discovery
E.细粒度对应关系发现

F. Training and Inference
F.训练和推理

B. Comparisons with State-of-the-art Models
B.与最先进模型的比较