Multi-view dataset proposed to facilitate view-based person ReID research. 提出的多视图数据集旨在促进基于视图的人物 ReID 研究。
•
Grouping mechanism and cross-modal fusion enable efficient person ReID. 分组机制和跨模态融合使能高效的 person ReID。
•
It achieves competitive performance for occluded and holistic person ReID. 它为闭塞和整体人员 ReID 实现了有竞争力的性能。
Abstract 抽象
Re-identification (ReID) of occluded persons is a challenging task due to the loss of information in scenes with occlusions. Most existing methods for occluded ReID use 2D-based network structures to directly extract representations from 2D RGB (red, green, and blue) images, which can result in reduced performance in occluded scenes. However, since a person is a 3D non-grid object, learning semantic representations in a 2D space can limit the ability to accurately profile an occluded person. Therefore, it is crucial to explore alternative approaches that can effectively handle occlusions and leverage the full 3D nature of a person. To tackle these challenges, in this study, we employ a 3D view-based approach that fully utilizes the geometric information of 3D objects while leveraging advancements in 2D-based networks for feature extraction. Our study is the first to introduce a 3D view-based method in the areas of holistic and occluded ReID. To implement this approach, we propose a random rendering strategy that converts 2D RGB images into 3D multi-view images. We then use a 3D Multi-View Transformation Network for ReID (MV-ReID) to group and aggregate these images into a unified feature space. Compared to 2D RGB images, multi-view images can reconstruct occluded portions of a person in 3D space, enabling a more comprehensive understanding of occluded individuals. The experiments on benchmark datasets demonstrate that the proposed method achieves state-of-the-art results on occluded ReID tasks and exhibits competitive performance on holistic ReID tasks. These results also suggest that our approach has the potential to solve occlusion problems and contribute to the field of ReID. The source code and dataset are available at https://github.com/yuzaiyang123/MV-Reid. 被遮挡人员的重新识别 (ReID) 是一项具有挑战性的任务,因为在有遮挡的场景中会丢失信息。大多数现有的遮挡 ReID 方法都使用基于 2D 的网络结构直接从 2D RGB(红色、绿色和蓝色)图像中提取制图表达,这可能会导致遮挡场景的性能降低。但是,由于人员是 3D 非网格对象,因此在 2D 空间中学习语义表示可能会限制准确剖析被遮挡人员的能力。因此,探索可以有效处理遮挡并利用人体完整 3D 特性的替代方法至关重要。为了应对这些挑战,在本研究中,我们采用了一种基于 3D 视图的方法,该方法充分利用 3D 对象的几何信息,同时利用基于 2D 的网络的进步进行特征提取。我们的研究是第一个在整体和遮挡 ReID 领域引入基于 3D 视图的方法。为了实现这种方法,我们提出了一种随机渲染策略,将 2D RGB 图像转换为 3D 多视图图像。然后,我们使用 3D Multi-View Transformation Network for ReID (MV-ReID) 将这些图像分组和聚合到一个统一的特征空间中。与 2D RGB 图像相比,多视图图像可以在 3D 空间中重建人物的遮挡部分,从而更全面地了解被遮挡的个体。在基准数据集上的实验表明,所提出的方法在遮挡的 ReID 任务上取得了最先进的结果,并在整体 ReID 任务上表现出有竞争力的性能。这些结果还表明,我们的方法有可能解决遮挡问题并为 ReID 领域做出贡献。源代码和数据集可在 https://github 获取。com/yuzaiyang123/MV-里德。
Person re-identification (ReID) is a computer vision technique that involves searching for the same individual from a collection of images captured by different cameras [50]. ReID has been successfully applied to numerous computer vision tasks[48]. However, in real-world scenarios, individuals may be obstructed by various objects, such as cars, trees, and crowds, which makes identifying them a challenging task. When an individual is occluded, the image representation extracted from the occluded image may contain distracting information [24], resulting in a loss of crucial information for ReID [49]. Therefore, occluded ReID poses a significant challenge for computer vision practitioners. 人员重新识别 (ReID) 是一种计算机视觉技术,涉及从不同相机捕获的图像集合中搜索同一个人 [50]。ReID 已成功应用于许多计算机视觉任务[48]。然而,在现实世界中,个人可能会受到各种物体的阻碍,例如汽车、树木和人群,这使得识别他们成为一项具有挑战性的任务。当个体被遮挡时,从被遮挡的图像中提取的图像表示可能包含分散注意力的信息 [24],从而导致 ReID 的关键信息丢失 [49]。因此,遮挡的 ReID 对计算机视觉从业者构成了重大挑战。
In recent years, 2D-based ReID methods have made remarkable progress in holistic (non-occluded) ReID tasks, owing to the development of computing power and the increased scale of ReID datasets [4,16,19,33,38,41,45]. These methods have even surpassed human-level performance [47]. However, their efficacy is limited in cases where individuals are occluded, as evidenced by the performance of state-of-the-art approaches on the Occluded-Duke dataset [24], which is below 63 in terms of mean average precision (mAP) and is deemed unacceptable for deployment. The primary reasons for the poor performance of 2D-based methods in occluded scenarios are two-fold: (1) a person exists in a 3D non-grid space [50], and RGB images lack clear contour boundaries and geometry information for occluded individuals, making it challenging to depict them accurately in 2D space. (2) 2D networks focus solely on pixel variations in RGB images while ignoring crucial 3D information such as depth or relative 3D position [2]. The 3D space provides valuable 3D geometry and structural information about occluded objects [28], thus presenting the 3D-based strategy as a potential solution to address occlusion challenges. 近年来,由于计算能力的发展和 ReID 数据集规模的增加,基于 2D 的 ReID 方法在整体(非遮挡)ReID 任务方面取得了显着进展 [4,16,19,33,38,41,45]。这些方法甚至超过了人类水平的性能 [47]。然而,在个体被遮挡的情况下,它们的有效性是有限的,这在 Occluded-Duke 数据集 [24] 上的最新方法的表现就证明了这一点,该数据集的平均精度 (mAP) 低于 63,被认为无法接受部署。基于 2D 的方法在遮挡场景中性能不佳的主要原因有两个:(1) 一个人存在于 3D 非网格空间中 [50],RGB 图像缺乏清晰的轮廓边界和遮挡个体的几何信息,这使得在 2D 空间中准确描绘他们具有挑战性。(2) 2D 网络只关注 RGB 图像中的像素变化,而忽略了关键的 3D 信息,如深度或相对 3D 位置 [2]。3D 空间提供了有关遮挡物体的有价值的 3D 几何和结构信息 [28],从而将基于 3D 的策略作为解决遮挡挑战的潜在解决方案。
To perform 3D person ReID, person RGB images need to be converted into 3D models using single-view 3D human reconstruction techniques [22,26,39]. These methods use prior knowledge of human geometry structure to reconstruct reasonable geometry and structure results [50]. Zheng et al. [50] exploited complementary information of 3D appearance and structure for person identification, exhibiting good scalability for person ReID tasks. Meanwhile, Chen et al. [2] proposed an end-to-end architecture that integrates person ReID and 3D human reconstruction methods to construct a texture-insensitive 3D person embedding. Previous studies show the efficacy of integrating 3D technology in ReID research, confirming that 3D human models can provide sufficient information to address ReID problems. 要进行3D人体ReID,需要使用单视图3D人体重建技术将人体RGB图像转换为3D模型[22,26,39]。这些方法利用人类几何结构的先验知识来重建合理的几何和结构结果[50]。Zheng等[50]利用3D外观和结构的互补信息进行人员识别,在人员ReID任务中表现出良好的可扩展性。同时,Chen 等人 [2] 提出了一种端到端架构,将 person ReID 和 3D 人体重建方法集成在一起,以构建对纹理不敏感的 3D 人体嵌入。先前的研究表明,将 3D 技术整合到 ReID 研究中的有效性,证实 3D 人体模型可以提供足够的信息来解决 ReID 问题。
However, previous 3D person ReID methods are limited to point-based approaches [37], where networks are directly defined on a point cloud or mesh. These point-based methods have several limitations: (1) They do not include surface information, as point clouds are sets of data points in space [6], leading to the surface structure being ignored. (2) The scale of point cloud datasets is inferior to 2D image datasets, limiting the generalization of pre-trained point-based networks due to a lack of sufficient data points for pre-training. Therefore, while 3D space has the advantage of depicting occluded geometry and structure, the performance of point-based methods is inferior to that of 2D-based methods. 然而,以前的 3D 人物 ReID 方法仅限于基于点的方法 [37],其中网络直接在点云或网格上定义。这些基于点的方法有几个局限性:(1) 它们不包括表面信息,因为点云是空间中的数据点集 [6],导致表面结构被忽略。(2) 点云数据集的规模不如 2D 图像数据集,由于缺乏足够的预训练数据点,限制了预训练基于点的网络的泛化。因此,虽然 3D 空间具有描绘遮挡几何和结构的优势,但基于点的方法的性能不如基于 2D 的方法。
The 3D-based method for semantic representation learning is not limited to 3D point-based learning. Another approach, called 3D view-based learning, can also be used, as shown in a recent study [37]. In this approach, RGB images of a person are converted into 3D human models, and multiple multi-view images are rendered for each model from different viewpoints. This allows the 3D geometry and structure information to be mapped into 2D space, and the occluded portions of a person to be captured from different angles. In contrast to RGB information, multi-view information encompasses details about the pedestrian's body shape and obscured geometric structure, as depicted in Fig. 1. These types of information could potentially enhance the performance in occluded ReID tasks. 基于 3D 的语义表示学习方法不仅限于 3D 基于点的学习。也可以使用另一种称为基于 3D 视图的学习的方法,如最近的一项研究 [37] 所示。在这种方法中,一个人的 RGB 图像被转换为 3D 人体模型,并从不同的视点为每个模型渲染多个多视图图像。这允许将 3D 几何图形和结构信息映射到 2D 空间,并从不同角度捕获人的遮挡部分。与 RGB 信息相比,多视图信息包含有关行人身体形状和模糊几何结构的详细信息,如图 1 所示 。这些类型的信息可能会提高被遮挡的 ReID 任务的性能。
The view-based strategy has several advantages for occluded person ReID. First, it bridges the gap between 2D and 3D learning by using a 2D network structure to solve complex 3D tasks [10]. As a result, the view-based method can leverage pre-training on large-scale 2D datasets such as Imagenet and COCO, and then fine-tune the network for the specific task of occluded person ReID. Second, the occlusion problem in surveillance cameras arises because they can only capture a partial view of a person in RGB images. However, the 3D view-based method can overcome these challenges, as the view-based strategy can complement each other's detailed features from multi-view images by setting different rendering viewpoints and thereby providing a complete interpretation of the occluded object [28]. 基于视图的策略对于被遮挡人员 ReID 具有多个优势。首先,它通过使用 2D 网络结构来解决复杂的 3D 任务 [10],从而弥合了 2D 和 3D 学习之间的差距。因此,基于视图的方法可以利用大规模 2D 数据集(如 Imagenet 和 COCO)的预训练,然后针对被遮挡人员 ReID 的特定任务微调网络。其次,监控摄像头中的遮挡问题之所以出现,是因为它们只能在 RGB 图像中捕捉到一个人的部分视图。然而,基于 3D 视图的方法可以克服这些挑战,因为基于视图的策略可以通过设置不同的渲染视点来补充多视图图像中彼此的详细特征,从而提供对遮挡对象的完整解释 [28]。
The view-based strategy has shown promise in addressing occlusion challenges [28]. This is attributed to the ability of multi-view images to recover occluded portions of objects from 2D RGB images [31]. It can be noted that occluded person ReID falls within the realm of occlusion challenges. Therefore, integrating the view-based strategy into occluded person ReID has the potential to bolster the identification of occluded pedestrians, thereby improving recognition accuracy. However, the application of view-based methods in the occluded person ReID task remains largely unexplored. The 3D Multi-View Transformation Network for ReID (MV-ReID) is the first work to introduce a 3D view-based method in the ReID area, addressing the challenges posed by domain gaps and the aggregation of multi-view features into a unified space. To improve the retrieval accuracy of holistic and occluded ReID, we propose the MV-ReID, as illustrated in Fig. 2, which utilizes a random rendering strategy to convert 2D RGB images to 3D multi-view images. To strengthen the connection of internal features between multi-view images, a grouping mechanism is employed. Finally, an aggregation method is utilized to combine the cross-modal features into a unified space. By leveraging these techniques, MV-ReID aims to enhance the accuracy of occluded and holistic ReID. 基于视图的策略在解决遮挡挑战方面显示出前景 [28]。这归因于多视图图像能够从 2D RGB 图像中恢复对象的遮挡部分 [31]。可以注意到,被遮挡的人 ReID 属于遮挡挑战的范畴。因此,将基于视图的策略集成到被遮挡的行人 ReID 中有可能加强对被遮挡行人的识别,从而提高识别准确性。然而,基于视图的方法在遮挡人员 ReID 任务中的应用在很大程度上仍未得到探索。用于 ReID 的 3D 多视图变换网络 (MV-ReID) 是在 ReID 领域引入基于 3D 视图的方法的第一项工作,解决了域差距和将多视图特征聚合到统一空间中所带来的挑战。为了提高整体和遮挡 ReID 的检索准确性,我们提出了 MV-ReID,如图 2 所示 ,它利用随机渲染策略将 2D RGB 图像转换为 3D 多视图图像。为了加强多视图图像之间内部特征的联系,采用了一种分组机制。最后,利用聚合方法将跨模态特征组合成一个统一的空间。通过利用这些技术,MV-ReID 旨在提高闭塞和整体 ReID 的准确性。
Our main contributions are summarized as follows: 我们的主要贡献总结如下:
1)
We introduce a random rendering strategy to generate large-scale multi-view datasets, including MV-Market-1501, MV-DukeMTMC, MV-Occluded-Duke, and MV-Occluded-ReID (derived from Market-1501, DukeMTMC, Occluded-Duke, and Occluded-ReID). These datasets facilitate view-based research in occluded and holistic ReID. 我们引入了一种随机渲染策略来生成大规模多视图数据集,包括 MV-Market-1501、MV-DukeMTMC、MV-Occluded-Duke 和 MV-Occluded-ReID(源自 Market-1501、DukeMTMC、Occluded-Duke 和 Occluded-ReID)。这些数据集有助于在遮挡和整体 ReID 中进行基于视图的研究。
2)
To enable view-based strategies in occluded and holistic ReID, we propose a grouping mechanism to gather discriminative information from multi-view images. Moreover, we propose an aggregation strategy to bridge the gap between 2D and 3D spaces, enabling MV-ReID to utilize both 2D texture and 3D geometry information about occluded individuals. 为了在遮挡和整体 ReID 中启用基于视图的策略,我们提出了一种分组机制来从多视图图像中收集判别信息。此外,我们提出了一种聚合策略来弥合 2D 和 3D 空间之间的差距,使 MV-ReID 能够利用有关被遮挡个体的 2D 纹理和 3D 几何信息。
3)
We quantitatively evaluate MV-ReID on two fundamental ReID tasks: occluded person ReID and holistic person ReID. The experimental results indicate that MV-ReID exhibits superior performance in comparison to state-of-the-art (SOTA) for occluded person ReID while maintaining competitive performance for holistic person ReID. The experiment results demonstrate the potential of MV-ReID for handling occluded challenges. 我们定量评估了 MV-ReID 的两个基本 ReID 任务:闭塞人员 ReID 和整体人员 ReID。实验结果表明,与最先进的 (SOTA) 相比,MV-ReID 在保持整体人体 ReID 方面的性能优于最先进的 (SOTA),同时保持了整体人员 ReID 的竞争性能。实验结果表明 MV-ReID 在处理闭塞挑战方面的潜力。
The rest of this paper is structured as follows. In Section 2, we provide a review of related work on person ReID and 3D multi-view learning. Section 3 outlines our approach for generating the multi-view dataset. Section 4 introduces the proposed MV-ReID framework. In Section 5, we analyze and discuss the resultsof our experiments. In Section 6, we conclude the paper and suggest avenues for future research. 本文的其余部分结构如下。在第 2 节 中,我们回顾了 person ReID 和 3D 多视图学习的相关工作。第 3 节 概述了我们生成多视图数据集的方法。第 4 节 介绍了拟议的 MV-ReID 框架。在第 5 节 中,我们分析和讨论了我们的实验结果。在第 6 节 中,我们对论文进行了总结,并为未来的研究提出了途径。
2. Related work 2. 相关工作
2.1. Occluded person ReID 2.1. 被遮挡的人物 ReID
Recognizing people in images where they are occluded with undetermined placements, sizes, forms, colors, and even incomplete objects is a difficult task known as occluded person ReID[25]. This task was first introduced by Zhou et al. [49] and has since received further attention from the research community. For example, He et al. [14] proposed posture-guided feature alignment (PGFA), which leverages pose landmarks to extract relevant information from occlusion noise. Li et al. [20] developed a part-aware transformer model to identify learnable part prototypes for occluded person ReID. Jia et al. [17] approached occluded person ReID as a set-matching problem and introduced the matching-on-sets (MoS) technique to assess the similarity of occluded persons without spatial alignment. 在图像中识别被未确定的位置、大小、形状、颜色甚至不完整的对象遮挡的人物是一项艰巨的任务,被称为被遮挡的人物 ReID[25]。这项任务最早由周等[49]提出,此后受到了研究界的进一步关注。例如,He et al. [14] 提出了姿势引导特征对齐 (PGFA),它利用姿态特征点从遮挡噪声中提取相关信息。Li et al. [20] 开发了一种部件感知变压器模型,用于识别被遮挡人员 ReID 的可学习部件原型。Jia 等 人。 [17] 将遮挡人员 ReID 视为集合匹配问题,并引入集合匹配 (MoS) 技术来评估没有空间对齐的遮挡人员的相似性。
Previous research on occluded person ReID has primarily focused on semantic representation learning in the 2D space [50]. However, the limitations of the 2D space in representing a 3D non-grid object like a person can result in inadequate coverage of geometric attributes and surface contours. To address this, the MVReID approach utilizes a view-based strategy to reconstruct the occluded person in 3D space, enabling the capture of essential geometry and structure information. In contrast to previous 2D-based approaches, MV-ReID provides a more comprehensive representation of the occluded person for ReID purposes. 以前对被遮挡的人 ReID 的研究主要集中在 2D 空间中的语义表示学习 [50]。但是,2D 空间在表示 3D 非网格对象(如人)方面的限制可能会导致几何属性和表面轮廓的覆盖范围不足。为了解决这个问题,MVReID 方法利用基于视图的策略在 3D 空间中重建被遮挡的人,从而能够捕获基本的几何和结构信息。与以前基于 2D 的方法相比,MV-ReID 为 ReID 目的提供了更全面的遮挡人员表示。
2.2. 3D point-based person ReID 2.2. 3D 基于积分的人员 ReID
To perform 3D point-based person ReID, the RGB image is first converted into a 3D human model using person reconstruction techniques [1,22,26,39,46]. Next, a point-based network [27,36] is used to extract feature representations from the Skinned Multi-Person Linear (SMPL) model [22]. These reconstruction methods utilize prior knowledge of the human geometry structure and yield reasonable reconstruction results [50]. Some recent studies have explored the use of 3D point-based strategies for occluded and holistic ReID. For example, Cheng et al. [2] proposed an end-to-end architecture that combines person ReID with 3D SMPL human model to calculate a texture-insensitive 3D form embedding. Meanwhile, Zheng et al. [50] developed a parameter-efficient model based on a graph neural network (GNN) which enables to learn the representation from the 3D SMPL human model. Although the SMPL human model [22] does not have the texture of the human body, the previous works demonstrate that the SMPL can provide sufficient information for ReID purposes. Overall, these advancements demonstrate the potential of 3D point-based person ReID as a promising direction for improving person identification in challenging scenarios. 为了进行基于点的3D人体ReID,首先使用人体重建技术将RGB图像转换为3D人体模型[1,22,26,39,46]。接下来,使用基于点的网络 [27\u201236] 从蒙皮多人线性 (SMPL) 模型 [22] 中提取特征表示。这些重建方法利用了人类几何结构的先验知识,并产生了合理的重建结果[50]。最近的一些研究探索了将基于 3D 点的策略用于闭塞和整体 ReID。例如,Cheng 等人 [2] 提出了一种端到端架构,将 person ReID 与 3D SMPL 人体模型相结合,以计算对纹理不敏感的 3D 表单嵌入。同时,Zheng等[50]开发了一种基于图神经网络(GNN)的参数高效模型,该模型能够从3D SMPL人体模型中学习表示。虽然SMPL人体模型[22]不具备人体的纹理,但以前的工作表明,SMPL可以为ReID目的提供足够的信息。总体而言,这些进步证明了 3D 基于点的人物 ReID 作为在具有挑战性的场景中改进人物识别的一个有前途的方向的潜力。
Detecting occluded objects in 2D RGB images can be difficult, whereas in 3D space, occlusion is more easily captured. However, traditional 3D point clouds have limitations such as redundancy in depth information and a lack of surface geometry details, leading to challenges in accurately identifying occluded objects [44]. In addition, point-based 3D methods can be computationally intensive. However, 3D person ReID methods do not necessarily rely solely on point-based learning [28]. A recent study proposes a 3D view-based strategy to address the occlusion problem, which offers a promising approach for improving identification accuracy while reducing computational complexity. 在 2D RGB 图像中检测遮挡对象可能很困难,而在 3D 空间中,遮挡更容易捕获。然而,传统的 3D 点云存在局限性,例如深度信息的冗余和缺乏表面几何细节,导致难以准确识别被遮挡的物体 [44]。此外,基于点的 3D 方法可能是计算密集型的。然而,3D 人物 ReID 方法不一定仅仅依赖于基于点的学习 [28]。最近的一项研究提出了一种基于 3D 视图的策略来解决遮挡问题,这为提高识别精度同时降低计算复杂性提供了一种很有前途的方法。
2.3. 3D multi-view learning 2.3. 3D 多视图学习
View-based methods involve generating multiple views of a given 3D model and using the resulting images to solve downstream tasks [10]. Su et al. [34] suggested that view-based methods align with how the human brain associates 2D appearances with prior knowledge of a 3D shape, which could contribute to their superior performance. Wei et al. [37] claimed that view-based methods are effective at revealing the essential structure and contour changes of 3D objects. Feng et al. [7] incorporated a grouping mechanism into multi-view learning, mimicking the way the human brain learns content relationships and discriminative information from multi-view images. These studies highlight the potential advantages of view-based methods and suggest that they may offer a promising approach to improve the effectiveness of 3D modeling and related tasks. 基于视图的方法涉及生成给定 3D 模型的多个视图,并使用生成的图像来解决下游任务 [10]。Su et al. [34] 认为,基于视图的方法与人脑将 2D 外观与 3D 形状的先验知识联系起来的方式一致,这可能有助于实现其卓越的性能。Wei等[37]声称,基于视图的方法可以有效地揭示3D物体的基本结构和轮廓变化。Feng et al. [7] 将分组机制纳入多视图学习中,模仿人脑从多视图图像中学习内容关系和判别信息的方式。这些研究强调了基于视图的方法的潜在优势,并表明它们可能提供一种有前途的方法来提高 3D 建模和相关任务的有效性。
Recent researches have demonstrated that 3D multi-view information has its own advantages in solving occlusion challenges, as the view-based strategy can complement each other's detailed features from multi-view images by setting different rendering viewpoints and thereby providing a complete interpretation of the occluded object [28]. It can be noted that occluded person ReID falls within the realm of occlusion challenges. Therefore, integrating the multi-view information into occluded person ReID has the potential to provide additional patterns and cues that can be leveraged to improve accuracy. Specifically, multi-view information can capture and utilize several potential patterns of an occluded person: (1) 3D body shapes and poses, which are valuable for distinguishing individuals, as it encodes unique features like body proportions, posture, and the way a person carries themselves; (2) the restored geometric structures which are occluded, as multi-view images can easily capture the occluded portion of a person from various viewpoints; (3) a comprehensive feature representation, multi-view images are rendered from different viewpoints can complement information from each other, and by learning this complementary information, a more complete interpretation of pedestrians can be obtained. To make full use of multi-view information, we present MV-ReID, which introduces an initial attempt to implement the 3D view-based strategy for handling the challenges of occluded ReID. 最近的研究表明,3D 多视图信息在解决遮挡挑战方面有其自身的优势,因为基于视图的策略可以通过设置不同的渲染视点来补充多视图图像中彼此的详细特征,从而提供对遮挡对象的完整解释 [28].可以注意到,被遮挡的人 ReID 属于遮挡挑战的范畴。因此,将多视图信息集成到被遮挡人员 ReID 中有可能提供额外的模式和提示,这些模式和提示可用于提高准确性。具体来说,多视图信息可以捕获和利用被遮挡者的几种潜在模式:(1) 3D 体型和姿势,这对于区分个体很有价值,因为它编码了独特的特征,如身体比例、姿势和人的举止方式;(2) 被遮挡的已修复几何结构,因为多视图图像可以很容易地从不同的视点捕捉到一个人被遮挡的部分;(3) 全面的特征表示,从不同视点渲染的多视角图像可以相互补充信息,通过学习这些互补信息,可以获得对行人的更完整的解读。为了充分利用多视图信息,我们提出了 MV-ReID,它引入了实施基于 3D 视图的策略的初步尝试,以应对 ReID 被遮挡的挑战。
3. The generation of multi-view dataset 3. 多视图数据集的生成
3.1. Data generation 3.1. 数据生成
We propose a random rendering strategy to convert 2D RGB images into 3D multi-view images, as illustrated in Fig. 3. This strategy consists of two phases: (1) reconstructing 2D RGB images into 3D point clouds using single view 3D reconstruction, and (2) generating multi-view images of 3D human models through multi-view rendering. We generate four multi-view datasets, namely MV-Market1501, MV-DukeMTMC, MV-Occluded-Duke, and MV-Occluded-ReID, using this random rendering strategy. The quality of the rendered images plays a crucial role in the retrieval performance of 3D objects in multi-view learning [10]. We experimented with various reconstruction methods and rendering strategies to obtain representative multi-view images. 我们提出了一种随机渲染策略,将 2D RGB 图像转换为 3D 多视图图像,如图 3 所示 。该策略包括两个阶段:(1) 使用单视图 3D 重建将 2D RGB 图像重建为 3D 点云,以及 (2) 通过多视图渲染生成 3D 人体模型的多视图图像。我们使用这种随机渲染策略生成了四个多视图数据集,即 MV-Market1501、MV-DukeMTMC、MV-Occluded-Duke 和 MV-Occluded-ReID。在多视图学习中,渲染图像的质量对 3D 对象的检索性能起着至关重要的作用 [10]。我们尝试了各种重建方法和渲染策略,以获得具有代表性的多视图图像。
3.1.1. 3D human body reconstruction 3.1.1. 3D 人体重建
To obtain informative 3D human models, we experimented with different 3D human reconstruction methods, including the statistical method [22] that only reconstructs the geometry and structure of the person, and current methods [30,39] that also reconstruct skin and clothing textures. However, current methods have limitations when applied to ReID tasks due to two reasons: (1) they require high resolution RGB images as input, while the surveillance cameras used in the ReID scenario are low-resolution cameras; (2) the occluded parts of a person are invisible, while prior knowledge of 3D body shape can reconstruct the geometry and structure of an occluded person [50], it is hard to restore the variation in clothing and skin texture. Therefore, we exploit a statistical-based reconstruction method, Skinned Multi-Person Linear (SMPL) model [22]. The SMPL model is a widely used computer graphics model that can represent human body shapes and poses by capturing the 3D geometry of human bodies [39]. SMPL facilitates the transformation of RGB images into point clouds through a process that includes estimating the human poses, reconstructing the body shapes, and sampling points. Although the SMPL model has limitations in representing texture information for pedestrians, many 3D point-based studies [50,2,11] have shown its effectiveness in providing adequate 3D information to tackle ReID challenges. Therefore, MVReID adopts the SMPL model as the 3D human body reconstruction method to convert RGB images into 3D human models. 为了获得信息丰富的 3D 人体模型,我们试验了不同的 3D 人体重建方法,包括仅重建人体几何形状和结构的统计方法 [22],以及目前也重建皮肤和衣服纹理的方法 [30,39]。然而,由于两个原因,当前方法在应用于 ReID 任务时存在局限性:(1) 它们需要高分辨率的 RGB 图像作为输入,而 ReID 场景中使用的监控摄像机是低分辨率摄像机;(2) 人的遮挡部位是看不见的,虽然 3D 体型的先验知识可以重建被遮挡者的几何形状和结构 [50],但很难恢复服装和皮肤纹理的变化。因此,我们利用了一种基于统计的重建方法,即蒙皮多人线性 (SMPL) 模型 [22]。SMPL 模型是一种广泛使用的计算机图形模型,可以通过捕获人体的 3D 几何形状来表示人体形状和姿势 [39]。SMPL 通过一个过程(包括估计人体姿势、重建身体形状和采样点)促进将 RGB 图像转换为点云。尽管SMPL模型在表示行人的纹理信息方面存在局限性,但许多基于3D点的研究[50,2,11]已经表明,它在提供足够的3D信息来应对ReID挑战方面是有效的。 因此,MVReID 采用 SMPL 模型作为 3D 人体重建方法,将 RGB 图像转换为 3D 人体模型。
3.1.2. Multi-view image generation 3.1.2. 多视图图像生成
To train and evaluate the MV-ReID, the 3D human model needs to be rendered into multi-view images. However, the choice of rendering viewpoints for generating multi-view images has not been extensively explored [10]. To address this issue, we propose a random rendering strategy to select viewpoints. Specifically, we set a dodecahedron as the rendering domain and use its 20 vertices as candidate rendering viewpoints. Then, we randomly select 12 vertices from the domain to generate multi-view images, as shown in Fig. 4. This approach enables us to derive a diverse set of viewpoints for better representation learning in MV-ReID. 为了训练和评估 MV-ReID,需要将 3D 人体模型渲染成多视图图像。然而,用于生成多视图图像的渲染视点的选择尚未得到广泛探索 [10]。为了解决这个问题,我们提出了一种随机渲染策略来选择视点。具体来说,我们将一个十二面体设置为渲染域,并使用它的 20 个顶点作为候选渲染视点。然后,我们从域中随机选择 12 个顶点来生成多视图图像,如图 4 所示 。这种方法使我们能够得出一组不同的观点,以便在 MV-ReID 中更好地学习表示。
3.2. Properties of multi-view dataset 3.2. 多视图数据集的属性
The multi-view dataset has several properties that are worth highlighting. First of all, in order to facilitate the development of more view-based algorithms in the field of person ReID research, we provide four large-scale multi-view datasets, in-cluding MV-Market-1501, MV-DukeMTMC, MV-Occluded-Duke, and MV-OccludedReID, by rendering from Market-1501 [48], DukeMTMC [51], Occluded-Duke [24], and Occluded-ReID [52] respectively. These datasets have been carefully selected to provide a comprehensive range of scenarios and challenges, thereby enabling researchers to test and refine their algorithms under various conditions. These multi-view datasets are designed to promote robustness and versatility in future view-based person ReID methods. Secondly, the occluded parts of a person are often hard to capture in RGB images, while they can be easily observed in 3D multi-view images. This indicates that the multi-view dataset can provide more informative geometry information, particularly in occlusion scenarios. Lastly, we not only provide multi-view formats but also point cloud format, which enables future studies to explore 3D person ReID using both view-based and point-based methodologies. The scale of each multi-view dataset is detailed in Table 1. 多视图数据集具有几个值得强调的属性。首先,为了促进在人体 ReID 研究领域开发更多基于视图的算法,我们提供了四个大规模的多视图数据集,包括 MV-Market-1501、MV-DukeMTMC、MV-Occluded-Duke 和 MV-OccludedReID,分别是从 Market-1501 [48]、DukeMTMC [51]、Occluded-Duke [24] 和 Occluded-ReID [52] 渲染分别。这些数据集经过精心挑选,以提供全面的场景和挑战,从而使研究人员能够在各种条件下测试和改进他们的算法。这些多视图数据集旨在促进未来基于视图的人物 ReID 方法的稳健性和多功能性。其次,人的遮挡部分通常很难在 RGB 图像中捕捉,而在 3D 多视图图像中很容易观察到。这表明多视图数据集可以提供更多信息丰富的几何信息,尤其是在遮挡场景中。最后,我们不仅提供多视图格式,还提供点云格式,这使得未来的研究能够使用基于视图和基于点的方法探索 3D 人物 ReID。表 1 中详细介绍了每个多视图数据集的规模。
Table 1. The scale of Multi-view datasets. 表 1.多视图数据集的比例。
Dataset 数据
Empty Cell
#ID
#3D model #3D 型号
#MV image #MV 图像
MV-Market-1501 MV-市场-1501
Training 训练
751 752 750
12,936 19,732 3368
155,232 236,784 40,416
Gallery 画廊
Query 查询
MV-DukeMTMC MV-杜克MTMC
Training 训练
702 1110 702
16,522 17,661 2228
198,264 211,932 26,736
Gallery 画廊
Query 查询
MV-Occluded-Duke
Training 训练
702 1110 519
15,618 17,661 2210
187,146 211,932 26,520
Gallery 画廊
Query 查询
MV-Occluded-ReID
Training 训练
- 200 200
- 1000 1000
- 12,000 12,000
Gallery 画廊
Query 查询
4. Methods 4. 方法
The MV-ReID algorithm consists of two modules: (1) Grouping Mechanism (Section 4.1), which is responsible for grouping and fusing the multi-view images into a unified space, and (2) Cross-Modal Feature Fusion (Section 4.2), which aggregates the 3D multi-view features and 2D RGB texture features into a global feature representation space. The training process of the MV-ReID algorithm is described in detail in Algorithm 1. MV-ReID 算法由两个模块组成:(1) 分组机制(第 4.1 节 ),负责将多视图图像分组并融合到一个统一的空间中,以及 (2) 跨模态特征融合(第 4.2 节 ),它聚合了 3D 多视图特征和 2D RGB 纹理特征转换为全局特征表示空间。MV-ReID 算法的训练过程在 算法 1 中有详细说明。
Input: Original image and multi-view images , pre-trained multi-view descriptor and pre-trained texture extractor , Classifier layer , training epoch , the label of input person . 输入:原始图像 和多视图图像 ,预训练多视图描述符 和预训练纹理提取器 ,分类器层 ,训练纪元 ,输入人员 的标签。 Output: The id of input person . 输出:输入 person 的 ID 。 1: for in do 1:for in do
2: Extract from by and then calculate the discrimination score of each . Finally, divided the into groups according to the discrimination score, the are remark as as detailed in Eq. (1) and 2; 2:从 by 中提取 然后计算每个 的判别分数 。 最后,根据判别分数将 分为 几组, 这些 表示 如方程 (1) 和 2 中详述;
3: Conduct view-level feature fusion to fuse the features within each group, the extraction of group-level feature as detailed in Eq. (3), and 4; 3:进行视图级特征融合,融合每组内的特征,提取组级特征 ,如方程 (3)和4所示;
4: Conduct attention-based strategy to obtain the multi-view feature by fusing group-level features , as detail in Eq. (5), 6, 7; 4: 通过融合群体级特征 ,进行基于注意力的策略,获得多视图特征 ,如方程 (5)、6、7 所示;
5: Conduct cross-modal feature fusion to aggregate and to obtain global person feature as detail in Eq. (8), 9 and 10; 5: 进行跨模态特征融合,聚合 并 得到方程 (8)、9 和 10 中详细描述的全局人员特征 ;
6: Predict person ID by and then calculate the loss between and to update the weight of , and . 6:预测 人员 ID,然后计算 和 之间的 损失,以更新 和 和 的权重 。
7: end for 7:结束
8: predict person ID by ; 8:通过 ;
4.1. Grouping mechanism 4.1. 分组机制
The grouping mechanism proposed in this study aims to effectively group the multi-view images based on their content and discriminative information, as illustrated in Fig. 5. By doing so, the connections between views within the same group as well as the differences between groups can be utilized to learn the similarity and dissimilarity attributes among multi-view images [21,8]. This is crucial for maximizing the potential of multi-view images in multi-view learning. By doing so, we propose the view-level feature grouping, view-level feature fusion, and group-level feature fusion in Sections 4.1.1, 4.1.2, and 4.1.3, respectively. 本研究提出的分组机制旨在根据多视图图像的内容和判别信息对多视图图像进行有效分组,如图 5 所示 。通过这样做,可以利用同一组内视图之间的联系以及组之间的差异来了解多视图图像之间的相似性和不相似性属性[21,8]。这对于在多视图学习中最大限度地发挥多视图图像的潜力至关重要。通过这样做,我们分别在 4.1.1 、 4.1.2 和 4.1.3 节 中提出了视图级特征分组、视图级特征融合和组级特征融合。
4.1.1. View-level feature grouping 4.1.1. 视图级特征分组
The goal of the view-level feature grouping is to learn group information that can aid in exploring the relationships among views. Firstly, view-level features are extracted from multi-view images using a pre-trained descriptor. Next, the view-level feature grouping divides the view-level features into different groups based on their similarity. To achieve this, discrimination scores are calculated to evaluate the difference and similarity between view-level features using the following equation:(1) 视图级特征分组的目标是了解有助于探索视图之间关系的组信息。首先,使用预训练的描述符从多视图图像 中提取视图级特征 。接下来,视图级特征分组根据视图级特征 的相似性将其划分为不同的组 。为此,使用以下方程计算区分分数以评估视图级特征 之间的差异和相似性: (1)
The represents the discrimination score of . The ReLU represents the activation function. The L2 normalization is defined by:(2) 表示 的判别分数 。ReLU 表示激活函数。L2 标准化定义如下: (2)
When the definition domain of the ReLU function is within the range of 0 to 1, its value domain is also limited to 0 to 1, and the ReLU function demonstrates evident discriminative capability within this interval. This function space with high discriminative capability enables reflecting the differences among each view. As such, this property makes it suitable for grouping view features within this range. For leveraging this characteristic of the ReLU function, we employ the L2 normalization function to ensure that the inputs of the ReLU function fall within the range of 0 to 1. After that, we utilize the ReLU function to obtain a discrimination score for each . Since the discrimination scores fall within the range of 0 to 1, we allocate the to different groups in subsequent steps by setting specific thresholds at the range of 0 to 1. By doing so, The features with similar discrimination scores are then grouped together, while those with different scores are placed in separate groups. Specifically, we partition the view-level features into groups denoted as , where represents the group index and represents the index of within that particular group. 当 ReLU 函数的定义域在 0 到 1 的范围内时,其值域也仅限于 0 到 1,并且 ReLU 函数在此区间内表现出明显的判别能力。这个具有高判别能力的功能空间可以反映每个视图之间的差异。因此,此属性使其适合在此范围内对视图要素进行分组。为了利用 ReLU 函数的这一特性,我们采用 L2 归一化函数来确保 ReLU 函数的输入在 0 到 1 的范围内。之后,我们利用 ReLU 函数来获得每个 的判别分数。由于鉴别分数在 0 到 1 的范围内,因此我们在后续步骤中通过在 0 到 1 的范围内设置特定阈值来将其 分配给不同的组。这样,具有相似鉴别分数的特征将被分组在一起,而具有不同分数的特征则被放置在不同的组中。具体来说,我们将视图级特征 划分为多个 组,表示为 ,其中 表示组索引, 表示该特定组 内的索引。
4.1.2. View-level feature fusion 4.1.2. 视图级特征融合
Once the view-level features are grouped, views within the same group are expected to share similar content and have close discrimination scores. The viewlevel feature fusion stage aims to combine these intra-group features into a unified space. To achieve this, we define an aggregation function as follows:(3)where represents the group-level feature, represents the number of feature in the group . The is defined by:(4) 对视图级特征进行分组后,同一组内的视图应共享相似的内容,并且具有接近的鉴别分数。viewlevel 特征融合阶段旨在将这些组内特征组合成一个统一的空间。为了实现这一点,我们定义了一个聚合函数,如下所示: (3) 其中 表示组级特征, 表示组中 特征的数量 。定义 如下: (4)
The exponential function exhibits an increase as the independent variable grows. This property is utilized to amplify the attention given to the representative characteristics of multi-view features [21]. As a result, the internal features of multi-view images are more closely connected. 指数函数随着自变量的增长而增加。这一特性被用来放大对多视图特征的代表性特征的关注 [21]。因此,多视图图像的内部特征更加紧密地联系在一起。
4.1.3. Group-level feature fusion 4.1.3. Group 级别的特征融合
To aggregate the group-level features into a unified space, we propose an attention-based strategy that can preserve the representative information of multiview images. Each is transformed to by an activation function, as follows:(5)where and are learning weight. After that, the neural attention mechanism is used to calculate the weight of each group-level feature, which is defined as:(6)where is learning weight, and denotes the number of groups, and represents the attention weight of each group-level feature. Finally, the multi-view feature from each group-level feature were aggregated by:(7) 为了将组级特征聚合到一个统一的空间中,我们提出了一种基于注意力的策略,可以保留多视图图像的代表性信息。每个都 由激活函数转换为 ,如下所示: (5) 其中 和 are 学习权重。之后,使用神经注意力机制计算每个组级特征的权重,定义为:(6) 其中 是学习权重,表示 组数, 代表每个组级特征的注意力权重。最后,来自每个组级特征的多视图特征 由以下方式聚合: (7)
The weight of each group-level feature, denoted by , is calculated to determine the attention of the model assigns to the images in the respective groups. By employing the group-level feature fusion, discriminative information present in group-level features is preserved in the feature representation. This facilitates the integration of multi-view features into a unified feature space while maintaining their intercorrelation among each view. 计算每个组级特征的权重(用 表示 ),以确定模型对相应组中图像的关注度。通过采用组级特征融合,组级特征中存在的判别信息保留在特征表示中。这有助于将多视图特征集成到一个统一的特征空间中,同时保持它们在每个视图之间的相互关联。
4.2. Cross-modal feature fusion 4.2. 跨模态特征融合
To fully utilize both the 3D geometry and 2D RGB information, we propose a cross-modal fusion strategy. Since the 3D space can capture the geometry information of occluded objects, while RGB images provide important texture and color information, we use a 2D-based network to extract the texture feature from the original input RGB image. Next, we fuse the texture feature and the multi-view feature into a unified representation space, creating a global feature representation of the person. This process is depicted in Fig. 6. 为了充分利用 3D 几何和 2D RGB 信息,我们提出了一种跨模态融合策略。由于 3D 空间可以捕获被遮挡对象的几何信息,而 RGB 图像提供重要的纹理和颜色信息,因此我们使用基于 2D 的网络从原始输入 RGB 图像中提取纹理特征 。接下来,我们将纹理特征 和多视图特征 融合到一个统一的表示空间中,从而创建人物的全局特征表示。这个过程如图 6 所示 。
The texture features are extracted from the RGB image using the pretrained network . Subsequently, the texture features and the multi-view features are aggregated using the cross-modal feature fusion strategy. To achieve this, we concatenate and as follows:(8)where represents the vertical concatenation. The features and denote the multi-view and RGB modalities, respectively. However, since and are extracted from different modalities, a domain gap exists between them. Therefore, the integration of these cross-modal features into a unified representation space presents a significant challenge. Drawing inspiration from a cross-modality fusion strategy [23], we introduce a residual design into the cross-modal feature fusion module, enhancing the MV-ReID ability to describe inter-dependencies between multi-view and RGB modality. The global person features are defined as:(9) 纹理特征 是使用预训练网络 从 RGB 图像 中提取的。随后,使用跨模态特征融合策略对纹理特征 和多视图特征 进行聚合。为了实现这一点,我们进行 concatenate , 如下所示: (8) 其中 表示垂直连接。功能 和 分别表示多视图和 RGB 模态。然而,由于 和 是从不同的模态中提取的,因此它们之间存在域差距。因此,将这些跨模态特征集成到统一的表示空间中提出了重大挑战。从跨模态融合策略 [23] 中汲取灵感,我们在跨模态特征融合模块中引入了残差设计,增强了 MV-ReID 描述多视图和 RGB 模态之间相互依赖关系的能力。全局人员特征定义为: (9)
The weight , and are utilized to adjust the weights of . The unlearnable hyperparameter performs as the balance factor for the residual connection, similar to ResNetV2 [12]. The learnable weights and of the residual connection are dot multiplied with , utilizing the explanatory power of residual structure to enforce MV-ReID to focus on the information at different levels. The weight and are defined as:(10)(11)where and are learning weights of linear layers, denotes the global average pooling, represent performing normalization of inputs over feature dimension. 权重 , 用于调整 的权重。不可学习的超参数 用作残差连接的平衡因子,类似于 ResNetV2 [12]。可学习的 权重和 残差连接的权重是点相乘的,利用 残差结构的解释能力来强制 MV-ReID 关注不同层次的信息。权重 和 定义为: (10)(11) 其中 和 是线性层的学习权重, 表示全局平均池化, 表示在特征维度上执行输入的归一化。
5. Experiments 5. 实验
5.1. Experimental setup 5.1. 实验设置
Datasets and Evaluation Metrics. We evaluate MV-ReID on four widely used benchmarks for occluded and holistic person ReID: Market1501 [48], DukeMTMC [51], Occluded-Duke [24], and Occluded-ReID [52]. In these benchmarks, our method takes both 2D RGB and 3D multi-view images as input. However, it is noteworthy that the Occluded-ReID dataset does not have a training set. Therefore, we train the MV-ReID network on the Occluded-Duke train set and evaluate its performance on the Occluded-ReID gallery and query set. In accordance with ReID community standards, we evaluate each method using the mean Average Precision (mAP) and Cumulative Matching Characteristic (CMC) curves. 数据集和评估指标。我们根据四个广泛使用的闭塞和整体个体 ReID 基准评估 MV-ReID:Market1501 [48]、DukeMTMC [51]、Occluded-Duke [24] 和 Occluded-ReID [52]。在这些基准测试中,我们的方法将 2D RGB 和 3D 多视图图像作为输入。但是,值得注意的是,Occluded-ReID 数据集没有训练集。因此,我们在 Occluded-Duke 训练集上训练 MV-ReID 网络,并评估其在 Occluded-ReID 库和查询集上的性能。根据 ReID 社区标准,我们使用平均精度均值 (mAP) 和累积匹配特性 (CMC) 曲线评估每种方法。
Implementation details and computation cost. The pre-trained ResNet-50 [13] is chosen to construct the backbone network of MV-ReID. Twelve multiview images are generated from one RGB image, and all multi-view images are resized to 256 × 128. The CrossEntropyLoss is used as the loss function. The network is trained using the SGD optimizer with a momentum of 0.9 and weight decay of 1e-4. MV-ReID is performed on an NVIDIA RTX 5000 using PyTorch. MV-ReID performs 0.0097 GFLOPs of floating point operations per forward pass. Person ID retrieval costs 0.23 ms on an NVIDIA RTX 5000, which is an affordable deployment scenario. 实现细节和计算成本。选择预训练的 ResNet-50 [13] 来构建 MV-ReID 的骨干网络。从一张 RGB 图像生成 12 个多视图图像,所有多视图图像的大小都调整为 256 × 128。CrossEntropyLoss 用作损失函数。该网络使用 SGD 优化器进行训练,动量为 0.9,权重衰减为 1e-4。MV-ReID 使用 PyTorch 在 NVIDIA RTX 5000 上执行。MV-ReID 每次前向传递执行 0.0097 GFLOP 的浮点运算。在 NVIDIA RTX 5000 上,人员 ID 检索的成本为 0.23 毫秒,这是一种经济实惠的部署方案。
5.2. Comparison with state-of-the-arts 5.2. 与最先进的比较
Results on Occluded ReID Datasets. The performance of previous meth‑ ods and our proposed method on two occluded ReID tasks are compared in Table 2. Three methods are compared: 2D-based ReID methods [20,35,40,3,32,42], 3D point-based ReID methods [50,2], and the 3D view-based ReID method (MVReID). OGNet [50] and ASSP [2] (the second group) do not conduct experiments on occluded datasets, and we retrain their networks to obtain their performance. MV-ReID is slightly inferior to PFD [35] in the mAP metric in Occluded-ReID, as there is no uniform training set for the Occluded-ReID task and therefore the evaluation results can vary considerably. The Rank-1/mAP of MV-ReID achieves 76.2%/63.4% and 83.7%/76.4% on Occluded-Duke and Occluded-REID datasets. Except for the mAP metric in Occluded-ReID, MV-ReID sets a new state-of-theart performance. Compared to the 2D-based methods, our model achieves a rank1/mAP improvement of more than 1.1%/0.9% on Occluded-Duke. This is because the 3D space can restore the missing part of the occluded person from 2D RGB images, and therefore the 3D view-based method can make a complete interpretation of the occluded person. Compared to the 3D point-based methods, MV-ReID achieves a large margin rank-1/mAP improvement of more than 33.6%/29.7% on Occluded-Duke, which is owing to the fact that view-based method can employ an effective 2D-based network structure to learn 3D geometry and structure information [28]. The above results demonstrate the potential of our methods for handling occluded challenges. 在被遮挡的 ReID 数据集上的结果。表 2 比较了先前方法和我们提出的方法在两个闭塞 ReID 任务上的性能。比较了三种方法:基于 2D 的 ReID 方法 [20,35,40,3,32,42]、基于 3D 点的 ReID 方法 [50,2] 和基于 3D 视图的 ReID 方法 (MVReID)。OGNet [50] 和 ASSP [2] (第二组)不在被遮挡的数据集上进行实验,我们重新训练它们的网络以获得它们的性能。在 Occluded-ReID 的 mAP 指标中,MV-ReID 略逊于 PFD [35],因为 Occluded-ReID 任务没有统一的训练集,因此评估结果可能会有很大差异。MV-ReID 的 Rank-1/mAP 在 Occluded-Duke 和 Occluded-REID 数据集上达到 76.2%/63.4% 和 83.7%/76.4%。除了 Occluded-ReID 中的 mAP 指标外,MV-ReID 还设置了新的状态性能。与基于 2D 的方法相比,我们的模型在 Occluded-Duke 上实现了超过 1.1%/0.9% 的 rank1/mAP 改进。这是因为 3D 空间可以从 2D RGB 图像中恢复被遮挡人物的缺失部分,因此基于 3D 视图的方法可以对被遮挡的人进行完整的解释。与基于 3D 点的方法相比,MV-ReID 实现了超过 33.6%/29 的 1 级/mAP 利润率大幅提高。Occluded-Duke 为 7%,这是因为基于视图的方法可以采用有效的基于 2D 的网络结构来学习 3D 几何和结构信息 [28]。上述结果证明了我们的方法在处理 occlud challenge 方面的潜力。
Table 2. Comparison with state-of-the-arts in occluded person ReID. The symbols #1,2,3,4 indicate performance rankings, established by comparing the results of using the 2D RGB image versus the 3D multi-view image as the network input type. The symbols ×1,2,3 indicate performance rankings, established by comparing the results of using the 3D point cloud versus the 3D multi-view image as the network input type. 表 2.与闭塞者 ReID 的最新技术进行比较。符号 #1、2、3、4 表示性能排名,通过比较使用 2D RGB 图像和 3D 多视图图像作为网络输入类型的结果来确定。符号 ×1、2、3 表示性能排名,是通过比较使用 3D 点云与 3D 多视图图像作为网络输入类型的结果来确定的。
Results on Holistic ReID Datasets. The recent occluded person ReID meth‑ ods are always unsatisfying on the holistic datasets [40]. In terms of this, we validate the effectiveness of MV-ReID on Market-1501 and DukeMTMC datasets, as shown in Table 3. Four kinds of methods are compared: occluded ReID meth‑ ods [42,20,35,40], 3D point-based ReID methods [50,2], holistic person ReID methods [15,43,29,5,9,18], and the 3D view-based ReID method (MV-ReID). The results indicate that the proposed MV-ReID outperforms the occluded ReID methods and the 3D point-based ReID methods on holistic ReID tasks. This is because the proposed method can fully utilize the geometry information of the 3D objects and adopt advances in the 2D-based network to extract feature representation. Meanwhile, the proposed MV-ReID shows slightly inferior to current holistic person ReID methods. The reason is that the low-resolution 3D human reconstruction methods are immature, and an inaccurate 3D human model will lead to misidentification. Despite these limitations, the above results validate the robustness of MV-ReID on holistic ReID tasks. 整体 ReID 数据集的结果。最近被遮挡的人 ReID 方法在整体数据集上总是不令人满意 [40]。在这方面,我们验证了 MV-ReID 在 Market-1501 和 DukeMTMC 数据集上的有效性,如表 3 所示。比较了四种方法:闭塞 ReID 方法 [42,20,35,40]、3D 基于点的 ReID 方法 [50,2]、整体人 ReID 方法 [15,43,29,5,9,18] 和基于 3D 视图的 ReID 方法 (MV-ReID)。结果表明,所提出的 MV-ReID 在整体 ReID 任务上优于闭塞 ReID 方法和基于 3D 点的 ReID 方法。这是因为所提出的方法可以充分利用三维对象的几何信息,并采用基于二维网络的进步来提取特征表示。同时,所提出的 MV-ReID 显示略逊于当前的整体人体 ReID 方法。原因是低分辨率的 3D 人体重建方法不成熟,不准确的 3D 人体模型会导致误识别。尽管存在这些限制,但上述结果验证了 MV-ReID 在整体 ReID 任务上的稳健性。
Table 3. Performance comparison on holistic ReID datasets. The symbols #1,2,3,4 indicate performance rankings, derived from comparisons between occluded ReID methods and MV-ReID. The symbols ×1,2,3 indicate performance rankings, derived from comparisons between 3D point-based ReID methods and MV-ReID. The symbols $1,2,3,4 indicate performance rankings, derived from comparisons between holistic ReID methods and MV-ReID. 表 3.整体 ReID 数据集的性能比较。符号 #1、2、3、4 表示性能排名,这些排名是通过遮挡 ReID 方法与 MV-ReID 之间的比较得出的。符号 ×1、2、3 表示性能排名,来自基于 3D 点的 ReID 方法与 MV-ReID 之间的比较。符号 $1、2、3、4 表示性能排名,该排名来自整体 ReID 方法和 MV-ReID 之间的比较。
To demonstrate the advantages of MV-ReID in resolving the occlusion problem, each component of MV-ReID is evaluated on the Occluded-Duke ReID task. 为了演示 MV-ReID 在解决遮挡问题方面的优势,在 Occluded-Duke ReID 任务中评估了 MV-ReID 的每个组件。
5.3.1. Analysis the effectiveness of 3D and 2D features 5.3.1. 分析 3D 和 2D 特征的有效性
In this section, the effect of 3D multi-view and 2D RGB features for the occluded ReID task is analyzed, as shown in Table 4. Index-1 represents purely utilizing 3D multi-view features. Index-2 represents purely utilizing 2D RGB features. Index-3 represents utilizing both 3D multi-view and 2D RGB features. 在本节中,分析了 3D 多视图和 2D RGB 特征对遮挡 ReID 任务的影响,如表 4 所示。Index-1 表示纯粹利用 3D 多视图功能。Index-2 表示纯粹使用 2D RGB 功能。Index-3 表示同时使用 3D 多视图和 2D RGB 功能。
Table 4. Analysis of the effect of 3D multi-view and 2D RGB features in MV-ReID performance. 表 4.分析 3D 多视图和 2D RGB 功能对 MV-ReID 性能的影响。
Index 指数
Multi-view 多视图
RGB
Occluded-Duke
Rank-1 第 1 名
Rank-5 第 5 名
mAp 地图
1
✓
×
62.1
69.7
54.4
2
×
✓
60.4
62.9
53.2
3
✓
✓
76.2
81.7
63.4
From Index-1 and Index-2, it can be observed that feeding only 3D or 2D features into the MV-ReID still maintains a moderate level of performance. This is because 3D features provide information about a pedestrian's body shape and occluded geometric structure, while 2D features provide information about the pedestrian's clothing texture and color. Therefore, either 3D features or 2D features can provide effective information, thereby enhancing the final result of the MV-ReID. Comparisons between index-1 and index-2, when multi-view images are utilized, the performance is improved by 1.7 %/6.8 %/1.2 % in Rank-1/Rank5/mAP. This indicates that 3D multi-view images can restore the missing portion of the occluded object. Therefore, the application of 3D multi-view features can effectively improve retrieval accuracy in the occluded person ReID compared to 2D RGB features. From the comparison between index-1 and index-3 or index-2 and index-3, it can be observed that feeding both 3D and 2D features into the MVReID obtains the optimal performance. Specifically, index-3 surpasses index-1 by 14.1%/12.0%/9.0% and exceeds index-2 by 15.8%/18.8%/10.2% in terms of Rank-1/Rank-5/mAP. This is because the 3D and 2D features are complementary; 3D features describe the geometry and structure information, while 2D features capture the variation of color and texture at the pixel level. Fusing these crossmodality features into a unified space can yield a more comprehensive interpretation of a pedestrian, thus enhancing the final result of the MV-ReID. 从 Index-1 和 Index-2 可以观察到,仅将 3D 或 2D 特征馈送到 MV-ReID 中仍然保持中等水平的性能。这是因为 3D 要素提供有关行人身体形状和遮挡几何结构的信息,而 2D 要素提供有关行人服装纹理和颜色的信息。因此,3D 特征或 2D 特征都可以提供有效的信息,从而增强 MV-ReID 的最终结果。索引 1 和索引 2 的对比,当使用多视图图像时,Rank-1/Rank5/mAP 的性能提升了 1.7 %/6.8 %/1.2 %。这表明 3D 多视图图像可以恢复被遮挡对象的缺失部分。因此,与 2D RGB 特征相比,3D 多视图特征的应用可以有效提高被遮挡人员 ReID 中的检索精度。从索引 1 和索引 3 或索引 2 和索引 3 之间的比较中可以观察到,将 3D 和 2D 特征都输入到 MVReID 中可以获得最佳性能。具体而言,index-3 在 Rank-1/Rank-5/mAP 方面比 index-1 高 14.1%/12.0%/9.0%,比 index-2 高 15.8%/18.8%/10.2%。这是因为 3D 和 2D 特征是互补的;3D 特征描述几何和结构信息,而 2D 特征在像素级别捕获颜色和纹理的变化。将这些交叉模式特征融合到一个统一的空间中,可以更全面地解释行人,从而增强 MV-ReID 的最终结果。
5.3.2. Analysis of the number of the groups 5.3.2. 组数分析
In the grouping learning, we calculate the discrimination scores of twelve multi-view features and then allocate these features based on the thresholds which setting at the range of 0 to 1, as detailed in Section4.1.1. We further investigate the number of groups in the grouping learning, as shown in Table 5. From index-1 and index-4, we can observe that the performance suffers from degradation when the number of groups is quite small or large. This is because too few or large numbers of groups will lose much information about the 3D shape [21], therefore leading to performance degradation in the grouping learning. From index-2 and index-3, we can observe that the performance is maintained at a high level. The reason is that grouping learning aims to force the views in the same group to share similar content with closer discriminability[8]. In this regard, a proper number of groups contributes to providing a complete interpretation of the relations between different multi-view images. According to the experimental results, our method adopts the setting of the three groups (index-2) and set the threshold range between [0. 0.3), [0.3. 0.6), [0.6. 1]. 在分组学习中,我们计算 12 个多视图特征的判别分数,然后根据阈值分配这些特征,阈值设置在 0 到 1 的范围内,详见 Section4.1.1。我们进一步调查了分组学习中的组数,如表 5 所示。从索引 1 和索引 4 中,我们可以观察到,当组数非常小或很大时,性能会下降。这是因为太少或太多的组会丢失很多关于 3D 形状的信息 [21],从而导致分组学习的性能下降。从 index-2 和 index-3 中,我们可以观察到性能保持在较高水平。原因是分组学习旨在迫使同一组中的视图共享具有更强可辨性的相似内容 [8]。在这方面,适当数量的组有助于提供对不同多视图图像之间关系的完整解释。根据实验结果,我们的方法采用三组 (index-2) 的设置,并将阈值范围设置在 [0. 0.3)、[0.3. 0.6)、[0.6. 1] 之间。
Table 5. Performance comparison with different group number. 表 5.不同组号的性能比较。
Index 指数
Group number 组号
Occluded-Duke
Rank-1 第 1 名
Rank-5 第 5 名
mAP 地图
1
2
62.3
64.3
54.2
2
3
76.2
81.7
63.4
3
4
73.4
77.1
61.8
4
6
62.2
66.7
55.7
5.3.3. Analysis of each component 5.3.3. 各成分分析
Table 6 illustrates the results of our ablation study on the grouping mechanism (G) and cross-modal feature fusion (C). Without G, we simply use a max-pooling layer [34] to aggregate the twelve multi-view features. Without C, we concatenate and vertically and resize the feature dimension using a linear layer. 表 6 说明了我们对分组机制 (G) 和跨模态特征融合 (C) 的消融研究结果。如果没有 G,我们只需使用最大池化层 [34] 来聚合 12 个多视图特征。如果没有 C 语言,我们将连接 和 垂直连接,并使用线性层调整特征维度的大小。
Table 6. Performance comparison with different component. 表 6.不同组件的性能比较。
Index 指数
G
C
Occluded-Duke
Rank-1 第 1 名
Rank-5 第 5 名
mAp 地图
1
×
×
56.7
63.1
52.8
2
✓
×
6 9.2
75.4
61.6
3
×
✓
68.3
73.7
57.8
4
✓
✓
76.2
81.7
63.4
Analysis of the grouping mechanism: Adopting the grouping mechanisn (G) results in significant improvements in the performance of multi-view person re-identification(MV-ReID) from index-3 and index-4, with inercases of 7.9%/8.0%/5.6% in mAP/Rank-1/Rank-5. This improvement is attributed to the ability of G to utilize content relationships and mine discriminative information from multi-view images [8]. Additionally, at index-1 and index-2, G improves performance by 12.5%/12.3%/8.8% in mAP/Rank-1/Rank-5, as it can learn discriminative multiview visual characteristics [21]. 分组机制分析:采用分组机制 (G) 导致 index-3 和 index-4 的多视图人员重识别 (MV-ReID) 性能得到显著提升,mAP/Rank-1/Rank-5 的案例为 7.9%/8.0%/5.6%。这种改进归因于 G 能够利用内容关系并从多视图图像中挖掘判别性信息 [8]。此外,在索引 1 和索引 2 处,G 在 mAP/Rank-1/Rank-5 中的性能提高了 12.5%/12.3%/8.8%,因为它可以学习判别性多视图视觉特征 [21]。
Analysis of the cross-modal feature fusion: Adopting the residual-based design (C) results in significant improvements in the performance of multi-view person re-identification (MV-ReID) from index-2 and index-4, with increases of 7.0%/6.3%/1.8% in mAP/Rank-1/Rank-5. This improvement is attributed to the ability of C to learn inter-dependencies between 3D geometries and 2D textures, enhancing the model's ability to summarize information at different levels. Additionally, from index-2 and index-4, we observe an 11.6%/10.6%/5.0% improvement in mAP/Rank-1/Rank-5, respectively, when using C. This improvement is due to the ability of C to capture variations in both 2D RGB pixel and 3D space levels. 跨模态特征融合分析:采用基于残差的设计 (C) 导致多视图人员再识别 (MV-ReID) 的性能从索引 2 和索引 4 显着提高,mAP/Rank-1/Rank-5 提高了 7.0%/6.3%/1.8%。这种改进归因于 C 语言能够学习 3D 几何图形和 2D 纹理之间的相互依赖关系,从而增强模型在不同级别汇总信息的能力。此外,从索引 2 和索引 4 中,我们观察到使用 C 时,mAP/Rank-1/Rank-5 分别提高了 11.6%/10.6%/5.0%。这种改进是由于 C 能够捕获 2D RGB 像素和 3D 空间级别的变化。
5.3.4. Parameter sensitivity analysis of cross-modal feature fusion 5.3.4. 跨模态特征融合的参数敏感性分析
In cross-modal feature fusion, the hyperparameter functions as a balancing factor in the residual connection, as detailed in Eq. (9). Specifically, the nontrainable hyperparameter is manually set prior to the initialization of MV-ReID and remains unchanged during the entire training process. Consequently, we conduct experiments with a range of to test the cross-modal feature fusion ability to fuse cross-modal features. As shown in Table 7, we take five different values of in a range of [0.5, 2.5]. From the experimental results, the optimal range of is [1.5, 2.0], and we set 1.5 as the value of the hyperparameter . 在跨模态特征融合中,超参数 在残差连接中起平衡因子的作用,如方程 (9) 中所详述。具体来说,不可训练的超参数 是在 MV-ReID 初始化之前手动设置的,并且在整个训练过程中保持不变。因此,我们进行了一系列实验 ,以测试跨模态特征融合融合跨模态特征的能力。如表 7 所示,我们在 [0.5, 2.5] 的范围内取 的 五个不同值。从实验结果来看,最佳范围 是 [1.5, 2.0],我们将 1.5 设置为超参数 的值。
Table 7. Performance comparison with different values of the hyperparameter. 表 7.hyperparameter 的不同值的性能比较。
β
Occluded-Duke
Rank-1 第 1 名
Rank-5 第 5 名
mAP 地图
0.5
73.9
74.6
60.9
1
73.2
75.3
61.5
1.5
76.2
81.7
63.4
2.0
74.3
77.7
62.1
2.5
63.3
69.6
57.9
5.4. Visualization result 5.4. 可视化结果
5.4.1. Visualization of the grouping mechanism 5.4.1. 分组机制的可视化
The grouping mechanism is crucial to our pipeline, and this section offers a qualitative analysis to understand its role in discerning the connections between internal features of multi-view images. The grouping mechanism's primary aim is to ascertain the discriminative nature of the content concerning the corresponding labels, with the expectation that views within the same group exhibit greater discriminativity and share similar content. Fig. 7 provides several illustrative examples of grouping. 分组机制对我们的管道至关重要,本节提供定性分析,以了解它在辨别多视图图像内部特征之间的联系方面的作用。分组机制的主要目的是确定与相应标签相关的内容的区分性质,并期望同一组内的视图表现出更大的歧视性并共享相似的内容。 图 7 提供了几个分组的说明性示例。
The figure demonstrates that various groups encompass diverse multi-view image types. The grouping primarily consists of three categories: 1) human body parts visible in 2D RGB images and reconstructed in 3D space within multi-view images(red box), 2) human body parts occluded by other objects in 2D RGB images and reconstructed in 3D space within multi-view images(blue box), and 3) human body parts not present in 2D RGB images but reconstructed in 3D space within multi-view images (green box). For instance, in the first example, all 12 views are divided into three groups, each primarily consisting of one of the aforementioned categories. In the third example, where the person is unoccluded, two groups are formed: the first mainly contains human bodies visible in 2D RGB images and reconstructed in 3D space, while the second primarily consists of human bodies not present in 2D RGB images but reconstructed in 3D space. These results indicate that multi-view images with similar content and shape can beeffectively grouped, showcasing the proposed grouping mechanism's efficacy in learning similarities between different views and strengthening internal feature connections among multi-view images. The group weight, calculated using Eq. (6), represents attention allocation among each group. It is observed that when a person is occluded (e.g., the second example), the network assigns comparable weights to both occluded and unoccluded views, facilitating simultaneous learning of occluded and unoccluded representations. When the person is not occluded by obstacles (e.g., the third example), the network focuses more on unoccluded representations (red box), as human bodies not captured in 2D RGB images but reconstructed in 3D space (green box) are estimated based on the parameterized SMPL model and contain more noise interference. In conclusion, the group mechanism enables the network to extract discriminative information from multi-view images and preserve representative information among them. 该图表明,不同的组包含不同的多视图图像类型。该分组主要包括三类:1) 在 2D RGB 图像中可见并在多视图图像(红框)内的 3D 空间中重建的人体部位,2) 在 2D RGB 图像中被其他对象遮挡并在多视图图像中的 3D 空间中重建的人体部位(蓝框),以及 3) 2D RGB 图像中不存在但在多视图图像中的 3D 空间中重建的人体部位(绿框)。例如,在第一个示例中,所有 12 个视图都分为三组,每组主要由上述类别之一组成。在第三个示例中,如果人物未被遮挡,则形成两组:第一组主要包含在 2D RGB 图像中可见并在 3D 空间中重建的人体,而第二组主要由不存在于 2D RGB 图像中但在 3D 空间中重建的人体组成。这些结果表明,具有相似内容和形状的多视图图像可以有效地进行分组,展示了所提出的分组机制在学习不同视图之间的相似性和加强多视图图像之间的内部特征联系方面的有效性。使用方程 (6) 计算的组权重表示每组之间的注意力分配。据观察,当一个人被遮挡时(例如,第二个例子),网络会为被遮挡和未遮挡的视图分配可比的权重,从而有助于同时学习被遮挡和未遮挡的表示。当人员没有被障碍物遮挡时(例如,第三个示例),该网络更侧重于未遮挡的表示(红框),因为人体不是在 2D RGB 图像中捕获但在 3D 空间中重建的(绿框)是基于参数化的 SMPL 模型估计的,并且包含更多的噪声干扰。总之,组机制使网络能够从多视图图像中提取判别信息,并保留其中的代表性信息。
5.4.2. Visualization of the alternative 3D reconstruction methods 5.4.2. 替代 3D 重建方法的可视化
In this section, we justify our decision to employ the SMPL method [22], which concentrates on reconstructing human body geometry and structure, as opposed to recent approaches such as ICON [39] and Pifu [30], which are capable of restoring skin and clothing textures. Fig. 8 displays the reconstruction outcomes of each method. It becomes apparent that, although recent reconstruction methods (ICON [39], Pifu [30]) generate textures for pedestrians, their reconstructed results notably diverge from the precise human body shape. These deviations can be ascribed to the following factors: (1) Recent 3D human reconstruction methods which restore skin and clothing textures necessitate high-resolution images as input, whereas surveillance cameras employed in ReID tasks generally possess lower resolutions. As a result, reconstruction outcomes derived from lowresolution images are inaccurate, as shown in Fig. 8(a). (2) The SMPL model is devised using parametric methods, facilitating the restoration of occluded pedestrian portions through the parameterized model. In contrast, existing textureaware reconstruction techniques predominantly rely on non-parametric models, posing difficulties in reconstructing occluded portions when pedestrians are occluded, as shown in Fig. 8(b–d). In conclusion, despite the SMPL model's inability to supply texture information for pedestrians, a substantial number of research [50,2,11] based on the SMPL model has evidenced its capacity to furnish adequate information for tackling ReID challenges. Consequently, we propose a pipeline that initially converts the RGB image into a 3D SMPL model and subsequently renders it into multi-view images. This pipeline enables the network to harness the geometric and structural data offered by the 3D SMPL human model while capitalizing on advancements in 2D-based networks for feature extraction. 在本节中,我们证明了我们决定采用SMPL方法[22]的决定,该方法专注于重建人体的几何形状和结构,而不是最近的方法,如ICON[39]和Pifu [30],它们能够恢复皮肤和衣服的纹理。 图 8 显示了每种方法的重建结果。很明显,尽管最近的重建方法(ICON [39]、Pifu [30])为行人生成了纹理,但它们的重建结果与精确的人体形状明显不同。这些偏差可归因于以下因素:(1) 最近的 3D 人体重建方法可以恢复皮肤和衣服纹理,需要高分辨率图像作为输入,而 ReID 任务中使用的监控摄像机通常具有较低的分辨率。因此,从低分辨率图像得出的重建结果是不准确的,如图 8(a) 所示 。(2) SMPL 模型是使用参数化方法设计的,有助于通过参数化模型恢复被遮挡的行人部分。相比之下,现有的纹理感知重建技术主要依赖于非参数模型,当行人被遮挡时,重建被遮挡的部分会带来困难,如图 8(b-d) 所示。 总之,尽管SMPL模型无法为行人提供纹理信息,但基于SMPL模型的大量研究[50,2,11]已经证明它能够为应对ReID挑战提供足够的信息。因此,我们提出了一个管道,最初将 RGB 图像转换为 3D SMPL 模型,然后将其渲染为多视图图像。该管道使网络能够利用 3D SMPL 人体模型提供的几何和结构数据,同时利用基于 2D 的网络的进步进行特征提取。
6. Conclusion 6. 总结
This paper introduces MV-ReID, an initial attempt to implement the 3D viewbased strategy for handling the challenges of occluded ReID. Our contribution includes providing four multi-view datasets that enable further research on viewbased occluded and holistic person ReID. To fully utilize the multi-view datasets, we propose a grouping mechanism that extracts the discriminative information between multi-view images. Furthermore, we suggest a cross-modal strategy to learn the inter-dependencies between 3D geometries and 2D textures. We conduct experiments on both holistic and occluded ReID tasks and achieve state-of-theart performance on the latter, which validates the potential of the MV-ReID in handling occluded person ReID. 本文介绍了 MV-ReID,这是实施基于 3D 视图的策略以应对遮挡 ReID 挑战的初步尝试。我们的贡献包括提供四个多视图数据集,以便进一步研究基于视图的遮挡和整体人员 ReID。为了充分利用多视图数据集,我们提出了一种分组机制,用于提取多视图图像之间的判别信息。此外,我们建议采用跨模态策略来学习 3D 几何图形和 2D 纹理之间的相互依赖关系。我们对整体和闭塞 ReID 任务进行了实验,并在后者上实现了最先进的性能,这验证了 MV-ReID 在处理闭塞人员 ReID 方面的潜力。
MV-ReID has promising applications in various occlusion-related tasks. For instance, it can enhance video surveillance systems by improving the tracking and recognition of individuals across multiple cameras, even when they move in and out of view or are partially obscured. MV-ReID can be used in public safety applications, such as identifying suspects in crowded public spaces or tracking individuals in emergency situations, like search and rescue operations. In retail settings, MV-ReID can be employed to track customers as they move through a store, helping businesses analyze customer behavior and shopping patterns. It can also be used to prevent shoplifting by identifying individuals who may be trying to conceal stolen items. Future research can explore alternative rendering configurations. The MV-ReID utilizes predefined rendering configurations to transform 3D human models into multi-view images. Recent view-based research employs differentiable rendering configurations, facilitating the optimization of rendering settings through a data-driven approach, thereby producing more informative multi-view images. Consequently, the integration of differentiable rendering configurations into MV-ReID has the potential to enhance the quality of rendered images, thus augmenting the accuracy of retrieval. MV-ReID 在各种与遮挡相关的任务中具有广阔的应用前景。例如,它可以通过改进跨多个摄像头对个人的跟踪和识别来增强视频监控系统,即使他们进出视图或部分遮挡。MV-ReID 可用于公共安全应用,例如在拥挤的公共场所识别嫌疑人或在紧急情况下跟踪个人,例如搜救行动。在零售环境中,MV-ReID 可用于跟踪客户在商店中的移动,帮助企业分析客户行为和购物模式。它还可以通过识别可能试图隐藏被盗物品的个人来防止入店行窃。未来的研究可以探索替代渲染配置。MV-ReID 利用预定义的渲染配置将 3D 人体模型转换为多视图图像。最近的基于视图的研究采用了可微分渲染配置,通过数据驱动的方法促进了渲染设置的优化,从而生成信息量更大的多视图图像。因此,将可微渲染配置集成到 MV-ReID 中有可能提高渲染图像的质量,从而提高检索的准确性。
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 作者声明,他们没有已知的竞争性经济利益或个人关系,这些利益或个人关系似乎可能会影响本文报告的工作。
Acknowledgement(s) 鸣谢
This work is supported by the National Natural Science Foundation of China No. 62373343, Beijing Natural Science Foundation No. L233036. 这项工作得到了中国国家自然科学基金 No.62373343、北京市自然科学基金L233036。
Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer (2016), pp. 630-645
Matching on sets: conquer occluded person reidentification without alignment
Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, AAAI Press (2021), pp. 1673-1681
Tao Wang, Hong Liu, Pinhao Song, Tianyu Guo, Wei Shi
Pose-guided feature disentangling for occluded person re-identification based on transformer
Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 March 1, 2022, AAAI Press (2022), pp. 2540-2549
Mhr-net: multiplehypothesis reconstruction of non-rigid shapes from 2d views
Shai Avidan, Gabriel J. Brostow, Moustapha Ciss, Giovanni Maria Farinella, Tal Hassner (Eds.), Computer Vision ECCV 2022 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part II, volume 13662 of Lecture Notes in Computer Science, Springer (2022), pp. 1-17
Real-world applications require person re-identification (re-ID) systems to effectively utilize 3D shape information. Such information provides richer spatial, depth, and structural understanding to the model, enhancing its robustness to changes in viewpoint and pose. However, most existing person re-ID methods learn feature representations mainly from 2D images, typically ignoring the actual 3D human structure and spatial relationships. To facilitate better deployment of re-ID systems in real environments, we propose a novel framework called the 3D-Guided Multi-Feature Semantically Enhanced Person Re-ID Network. Firstly, for 3D features, we specially design a hybrid global-local deep spatial feature extractor that enhances the understanding and representation of point cloud details and global information perception through centroid connection and offset attention. Furthermore, to address the semantic mismatch between features of different depths in Convolutional Neural Networks (CNNs) and Transformers, we design a Dual Path Interactive Embedding (DPIE) Module. This module improves the discriminative and correlative properties of features, enhancing the model’s ability to understand and adapt to complex features. Finally, we introduce the novel Symbiotic Attention Module (SAM), which addresses the information asymmetry problem caused by the unidirectional flow of traditional cross-attention. SAM enables bidirectional interaction between features, captures richer complementary information, enhances feature fusion, and establishes a deep connection between global structure and local details. We conducted extensive experiments on several widely used person re-ID datasets, including Market-1501, DukeMTMC-reID, and Occluded-REID, and the results demonstrate that our model outperforms existing methods. Notably, on the Market-1501 dataset, our model achieved 95.9% Rank-1 accuracy and 90.1% mAP.
Enhancing original LiDAR point cloud features with virtual points has gained widespread attention in multimodal information fusion. However, existing methods struggle to leverage image depth information due to the sparse nature of point clouds, hindering proper alignment with camera-derived features. We propose a novel 3D object detection method that refines virtual point clouds using a coarse-to-fine approach, incorporating a dynamic 2D Gaussian distribution for better matching and a dynamic posterior density-aware RoI network for refined feature extraction. Our method achieves an average precision (AP) of 90.02% for moderate car detection on the KITTI validation set, outperforming state-of-the-art methods. Additionally, our approach yields AP scores of 86.58% and 82.16% for moderate and hard car detection categories on the KITTI test set, respectively. These results underscore the effectiveness of our method in addressing point cloud sparsity and enhancing 3D object detection performance. The code is available at https://github.com/ZhongkangZ/LidarIG.
The slight drop observed may be attributed to the fact that images from some specific views may result in low-quality features, thereby impacting anomaly detection performance. As many works [34,37] have demonstrated, adaptive views can better describe the structure of point clouds, while fixed views may lead to subpar performance. Therefore, selecting suitable views for 2D modality feature extraction could be further investigated to improve anomaly detection performance.
Point cloud anomaly detection is steadily emerging as a promising research area. Recognizing the importance of feature descriptiveness in this task, this study introduces the Complementary Pseudo Multimodal Feature (CPMF), which combines local geometrical information extracted by 3D handcrafted descriptors with global semantic information extracted from 2D pre-trained neural networks. Specifically, to leverage 2D pre-trained neural networks for point-wise feature extraction, this study projects original point clouds into multi-view images. These images are then fed into a pre-trained 2D neural network for informative 2D modality feature extraction. Following the 2D–3D correspondence, the multi-view 2D modality features are projected back to 3D space and aggregated to obtain point-wise 2D modality features. Finally, the point-wise 3D and 2D modality features are fused to derive the CPMF for point cloud anomaly detection. Extensive experiments conducted on MVTec 3D and Real3D datasets demonstrate the complementary capacity between 2D and 3D modality features and the effectiveness of CPMF. Notably, CPMF achieves a significantly higher object-level AUROC of 95.15% compared to other methods on the MVTec 3D benchmark. Code is available at https://github.com/caoyunkang/CPMF.
Fueled by substantial advancements in deep learning, the domain of autonomous driving is swiftly advancing towards more robust and effective intelligent systems. One of the critical challenges in this field is achieving accurate 3D object detection, which is often hindered by data sparsity and occlusion. To address these issues, we propose a method centered around a multi-modal fusion strategy that leverages vehicle-road cooperation to enhance perception capabilities. Our approach integrates label information from roadside perception point clouds to harmonize and enrich the representation of image and LiDAR data. This comprehensive integration significantly improves detection accuracy by providing a fuller understanding of the surrounding environment. Rigorous evaluations of our proposed method on two benchmark datasets, KITTI and Waymo Open, demonstrate its superior performance, with our model achieving 87.52% 3D Average Precision (3D AP) and 93.71% Bird’s Eye View Average Precision (BEV AP) on the KITTI set. These results highlight the effectiveness of our method in detecting sparse and distant objects, contributing to the development of safer and more efficient autonomous driving solutions.
Zero-shot stance detection involves predicting stances that have not previously been encountered by adapting models to learn transferable features by aligning the source and destination target spaces. The acquisition of transferable target-invariant features is crucial for zero-shot stance detection. This work proposes a stance detection technique that can effectively adapt to new unseen targets, and the essence lies in acquiring fine-grained and easy-to-migrate target-invariant features from multiple perspectives as transferable knowledge. Specifically, we first perform data augmentation by masking topic keywords to mitigate the target dependency introduced by topic keywords in the text. Then, to account for the diversity and granularity of the sample features, we leverage instance-wise contrastive learning to extract transferable meta-features from multiple perspectives. The meta-features bridge features migration from known targets to unseen targets by incorporating different viewpoints. Finally, we incorporate an attention mechanism to fuse the multi-perspective transferable features for predicting the stance of previously unseen targets. The experimental results demonstrate the superiority of our model over competitive baselines across four benchmark datasets.