Tracking without Label: Unsupervised Multiple Object Tracking via Contrastive Similarity Learning
无标签跟踪:基于对比相似度学习的无监督多目标跟踪

Sha Meng*, Dian Shao*, Jiacheng Guo, Shan Gao $^{†}$
沙蒙*、刁少*、郭家成、高山 $^{†}$ Northwestern Polytechnical University, Xi'an, China
西北工业大学,西安,中国{mengsha,gjc1}@mail.nwpu.edu.cn, {shaodian, gaoshan}@nwpu.edu.cn

Abstract 摘要

Unsupervised learning is a challenging task due to the lack of labels. Multiple Object Tracking (MOT), which inevitably suffers from mutual object interference, occlusion, etc., is even more difficult without label supervision. In this paper, we explore the latent consistency of sample features across video frames and propose an Unsupervised Contrastive Similarity Learning method, named UCSL, including three contrast modules: self-contrast, cross-contrast, and ambiguity contrast. Specifically, i) selfcontrast uses intra-frame direct and inter-frame indirect contrast to obtain discriminative representations by maximizing self-similarity. ii) Cross-contrast aligns cross- and continuous-frame matching results, mitigating the persistent negative effect caused by object occlusion. And iii) ambiguity contrast matches ambiguous objects with each other to further increase the certainty of subsequent object association through an implicit manner. On existing benchmarks, our method outperforms the existing unsupervised methods using only limited help from ReID head, and even provides higher accuracy than lots of fully supervised methods.
无监督学习是一项充满挑战的任务,因为缺乏标签。多目标跟踪(MOT)不可避免地遭受互斥物体干扰、遮挡等问题,在缺乏标签监督的情况下更加困难。本文我们探索了跨视频帧样本特征的潜在一致性,提出了一种无监督对比相似度学习方法 UCSL,包括三个对比模块:自对比、交叉对比和模糊对比。具体来说,i)自对比利用帧内直接对比和帧间间接对比,通过最大化自相似性来获得具有区分性的表示。ii)交叉对比将跨帧和连续帧的匹配结果对齐,从而消除物体遮挡造成的持续负面影响。iii)模糊对比以隐式方式将模糊的物体进行匹配,进一步增加后续物体关联的确定性。在现有基准上,我们的方法在仅使用有限的 ReID 头辅助的情况下,性能优于现有的无监督方法,甚至超过了许多完全监督的方法。

1. Introduction 1. 简介

As a basic task in computer vision, Multiple Object Tracking (MOT) is widely applied in a variety of fields, including robot navigation, intelligent surveillance, and other aspects [36,33]. Currently, one of the most popular tracking paradigms is joint detection and re-identification (ReID) embeddings. In the case of supervision, ReID is regarded as a classification task. To keep track of objects, many works [39, 45] utilize appearance features for object association, where the representation ability of the ReID head will directly affect the accuracy of the object association.
作为计算机视觉的一项基本任务,多目标跟踪(MOT)被广泛应用于机器人导航、智能监控等多个领域[36,33]。目前最流行的跟踪范式之一是联合检测和重识别(ReID)嵌入。在有监督的情况下,ReID 被视为一项分类任务。为了跟踪对象,许多研究[39,45]利用外观特征进行对象关联,ReID 头的表示能力将直接影响对象关联的准确性。

However, due to limitations in various conditions such as labeled datasets, to meet the needs of researchers, there has been a growing requirement to annotate tracking datasets,
然而,由于标注数据集等各种条件的限制,为了满足研究人员的需求,对跟踪数据集进行标注的需求正在不断增加

Figure 1. Supervised and Unsupervised MOT. In the joint detection and ReID embeddings framework, to obtain discriminative embeddings for tracking, the left branch is a usual method of supervised MOT training, i.e., given labels, it is trained as an object classification task. The middle branch is a common method of unsupervised training, i.e., it is processed by clustering, and targets with high similarity are regarded as the same class. The right branch is the proposed method with contrast similarity learning to improve the similarity of the same objects without label information.
图 1. 有监督和无监督的多目标跟踪。在联合检测和 ReID 嵌入框架中,为了获得用于跟踪的有区分性嵌入,左侧分支是通常的有监督多目标跟踪训练方法,即给定标签,它被训练为一个物体分类任务。中间分支是一种常见的无监督训练方法,即通过聚类处理,将高度相似的目标视为同一类。右侧分支是提出的方法,采用对比相似性学习来提高同一物体的相似性,而无需标签信息。
which is costly and time-consuming. Therefore, unsupervised learning of visual representation has attracted great attention in tracking. Some works [35, 30] have demonstrated that training the network in the real direction can also be done even without ground truth. Some works [5, 6] directly use ReID features to cluster objects with high similarity
这既耗时又成本高。因此,无监督的视觉表征学习在跟踪中引起了广泛关注。一些研究[35, 30]已经证明,即使没有真实标签,也可以训练网络在实际方向上工作。一些研究[5, 6]直接使用 ReID 特征对高相似度的目标进行聚类。
into the same class, then generate pseudo-labels to train the network, as shown in the middle branch of Figure 1. But these cluster-based methods are easy to accumulate errors during the training process. In contrast, we consider using an Unsupervised Contrastive Similarity Learning (UCSL) to train the ReID branch without generating pseudo-labels, as shown in the right branch of Figure 1.
放入同一个类别中,然后生成伪标签来训练网络,如图 1 中间分支所示。但是这些基于聚类的方法在训练过程中容易积累错误。相比之下,我们考虑使用无监督对比相似性学习(UCSL)来训练 ReID 分支而不生成伪标签,如图 1 右侧分支所示。

As a video task, the objects in MOT are always changing over time, which leads to inevitable problems of mutual occlusion between objects and objects, and between objects and non-objects, as well as the disappearance of old objects and the appearance of new ones. Occluded objects are not represented consistently from frame to frame due to additional interference features. Lost and emerging objects theoretically cannot be matched with other objects, and they are almost always negative for the current association stage. Thus, it is difficult for the model to determine whether arbitrary two objects are the same or not. In supervised cases, ID labels can be used to make the training more explicitly directed, while in the unsupervised case, the dual limitation of unlabeling and inherent problems makes unsupervised MOT even more challenging. So we manage to find potential connections between objects to determine if the objects are identical.
作为一个视频任务,MOT 中的物体随时间而变化,这导致了物体与物体之间以及物体与非物体之间不可避免的互遮问题,同时也出现了旧物体消失和新物体出现的情况。由于额外的干扰特征,遮挡的物体无法从帧到帧中保持一致的表示。理论上,丢失的物体和新出现的物体无法与其他物体匹配,这对当前的关联阶段几乎都是负面影响。因此,模型很难确定任意两个物体是否相同。在监督的情况下,可以使用 ID 标签来使训练更加明确有针对性,而在无监督的情况下,无标签和固有问题的双重限制使得无监督 MOT 更加具有挑战性。所以我们设法找到物体之间的潜在联系,以确定这些物体是否相同。

In this paper, we propose an Unsupervised Contrastive Similarity Learning (UCSL) method to solve the inherent object association problems of unsupervised MOT. Specifically, UCSL consists of three modules, self-contrast, crosscontrast and ambiguity contrast, designed to address different issues respectively. For the self-contrast, we first match between objects within frames and between objects in adjacent frames. Correspondingly, we get the direct and indirect matching results of the intra-frame objects. Then we maximize the matching probability of self-to-self to maximize the similarity of the same objects. For cross-contrast, considering theoretically the cross-frame matching results should be consistent with the final results of continuous matching, we improve the similarity of the occluded objects by making these two matching results as close as possible. For ambiguity contrast, we match between ambiguous objects, mainly containing occluded, lost, and emerging objects whose final similarity is generally low, to further determine the object identity. Our proposed method is simple but effective, which achieves outstanding performance by utilizing only the ReID embeddings without adding any additional branch such as the occlusion handling or opticalflow based cue to the detection branch.
在本文中，我们提出了一种无监督对比相似度学习(UCSL)方法来解决无监督多目标跟踪(MOT)中固有的对象关联问题。具体而言，UCSL 由三个模块组成,即自对比、交叉对比和歧义对比,分别用于解决不同的问题。对于自对比,我们首先在帧内匹配对象,并在相邻帧之间匹配对象。相应地,我们获得了帧内对象的直接和间接匹配结果。然后我们最大化自对自的匹配概率,从而最大化相同对象的相似度。对于交叉对比,考虑到理论上跨帧匹配结果应该与连续匹配的最终结果一致,我们通过使这两个匹配结果尽可能接近来提高被遮挡对象的相似度。对于歧义对比,我们匹配歧义对象,主要包含通常相似度较低的遮挡、丢失和出现的对象,以进一步确定对象身份。我们提出的方法简单但有效,仅利用 ReID 嵌入即可实现出色的性能,无需为检测分支添加任何额外的分支,如遮挡处理或光流提取。

We implement the method on the basis of FairMOT [45] using the pre-trained model on the COCO dataset [19]. Our experiments on the MOT15 [15], MOT17 [24] and MOT20 [7] datasets are conducted to evaluate the effectiveness of the proposed method. The performance of our unsupervised approach is comparable with, or outperforms, that of some supervised methods using expensive annotations.
我们基于 FairMOT [45] 使用在 COCO 数据集 [19] 上预训练的模型实现了该方法。我们在 MOT15 [15]、MOT17 [24] 和 MOT20 [7] 数据集上进行了实验,以评估所提出方法的有效性。我们无监督方法的性能可与一些利用昂贵注释的监督方法相媲美或优于它们。
Overall, our contributions are summarized as follows:
总的来说,我们的贡献总结如下:

We propose a contrastive similarity learning method for unsupervised MOT task, which pursues latent object consistency based only on the sample features in the ReID module given without the ID information.
我们提出一种用于无监督多目标跟踪任务的对比相似性学习方法,该方法仅基于 ReID 模块中的样本特征(不包含 ID 信息)追求潜在的目标一致性。
We design three useful modules to model associations between objects in different cases. To elaborate, self-contrast module matches intra-frame objects, cross-contrast module associates cross-frame objects, and ambiguity contrast module deals with those hard/corner cases (e.g., occluded objects, lost objects, etc.)
我们设计了三个有用的模块来建模不同情况下对象之间的联系。具体来说,自对比模块匹配帧内对象,跨帧对比模块关联跨帧对象,而歧义对比模块处理那些困难/角落案例(例如,遮挡的对象、丢失的对象等)。
Experiments on MOT15[15], MOT17[24] and MOT20 [7] demonstrate the effectiveness of the proposed UCSL method. As an unsupervised method, UCSL outperforms state-of-the-art unsupervised MOT methods and even achieves similar performance as the fully supervised MOT methods.
在 MOT15[15]、MOT17[24]和 MOT20[7]上的实验证明了所提出的 UCSL 方法的有效性。作为一种无监督方法,UCSL 优于目前最先进的无监督 MOT 方法,甚至可以达到完全监督的 MOT 方法的类似性能。

Multi-Object Tracking. Multi-object tracking is a task that localizes objects from consecutive frames and then associates them according to their identity. Thus, for a long time, the most classic tracking paradigm is tracking-bydetection [27], i.e., firstly, an object detector is used to detect objects from every frame, and secondly, a tracker is used to associate these objects across frames. A large number of works [3, 40, 32, 4] in this paradigm have achieved decent performance, but the paradigm relies too much on the performance of detectors. In the past two years, the joint detection and tracking or embedding paradigm has become stronger. Some transformer-based MOT architectures [31, 23, 41] designed two decoders to perform detection and object propagation respectively. JDE [39] and FairMOT [45] directly incorporated the appearance model into a one-stage detector, and then the model can simultaneously output detection results and the corresponding embeddings. These simple but effective frameworks have been what we are looking for, so we take FairMOT [45] as our baseline.
多目标跟踪。多目标跟踪是一项任务,它首先从连续帧中定位对象,然后根据它们的身份将它们关联起来。因此,很长一段时间以来,最经典的跟踪范式是跟踪-检测 [27],即首先使用物体探测器从每一帧中检测物体,然后使用追踪器将这些物体关联起来。这一范式中有大量研究 [3,40,32,4]取得了不错的性能,但该范式过于依赖探测器的性能。在过去两年中,联合检测和跟踪或嵌入范式变得更加强大。一些基于变换器的多目标跟踪架构 [31,23,41]设计了两个解码器,分别执行检测和对象传播。JDE [39]和 FairMOT [45]直接将外观模型纳入单级检测器,然后该模型可以同时输出检测结果和相应的嵌入。这些简单但有效的框架是我们一直在寻找的,所以我们将 FairMOT [45]作为我们的基线。

Unsupervised Tracking. For some tasks, existing datasets or other resources cannot meet the needs of researchers. In this condition, unsupervised learning has been a popular solution and its efficiency has been demonstrated in related studies [12, 21, 35, 30]. SimpleReID [12] first used unlabeled videos and the corresponding detection sets, and generated tracking results using SORT [3] to simulate the labels, and trained the ReID network to predict the labels of the given images. It is the first demonstration of the effectiveness of the simple unsupervised ReID network
无监督跟踪。对于某些任务,现有数据集或其他资源无法满足研究人员的需求。在这种情况下,无监督学习一直是一种流行的解决方案,其效率已在相关研究中得到证明[12、21、35、30]。SimpleReID[12]首先使用未标记的视频和相应的检测集,并使用 SORT[3]生成跟踪结果来模拟标签,然后训练 ReID 网络来预测给定图像的标签。这是关于简单无监督 ReID 网络有效性的首次示范。

Figure 2. The overall pipeline of our proposed unsupervised contrastive similarity learning model (UCSL), which learns representations with self-contrast, cross-contrast and ambiguity contrast.
图 2. 我们提出的无监督对比相似性学习模型(UCSL)的整体流程,它通过自对比、交叉对比和模糊对比来学习表示。
for MOT. Liu et al. [21] proposed a model, named OUTrack, using an unsupervised ReID learning module and a supervised occlusion estimation module together to improve tracking performance.
刘等人[21]提出了一种名为 OUTrack 的模型,使用无监督的重新识别学习模块和有监督的遮挡估计模块相结合,以提高跟踪性能。

Re-Identification. In the field of re-identification, which is more relevant to MOT, unsupervised learning has been widely used through various means including domain adaption, clustering, etc. Considering the visual similarity and cycle consistency of labels, MMCL [34] predicted pseudo labels and regarded each person as a class, transforming ReID into a multi-classification problem. Some other works [5, 6, 20] also utilized clustering algorithms to generate pseudo labels and take them as ground truth to train the network. However, error accumulation is easy to occur during the iterative process. Recent methods propose selfsupervised learning, Wang et al. [38] proposed CycAs inspired by the data association concept in multi-object tracking. By using the self-supervised signal as a constraint on the data, networks gradually strengthen the feature expression ability during the training process.
重新识别。在重新识别这个领域中,这与多目标跟踪更为相关,无监督学习已被广泛使用,包括通过域自适应、聚类等各种方式。考虑到标签的视觉相似性和循环一致性,MMCL[34]预测了伪标签,将每个人视为一个类别,将 ReID 转化为一个多分类问题。一些其他工作[5,6,20]也利用聚类算法生成伪标签,并将其作为实际标签来训练网络。然而,在迭代过程中很容易出现误差累积。最近的方法提出了自监督学习,Wang 等人[38]提出了 CycAs,受多目标跟踪中数据关联概念的启发。通过将自监督信号用作对数据的约束,网络在训练过程中逐步增强了特征表达能力。

Cycle Consistency. Cycle consistency is originally proposed in Generative Adversarial Network (GAN), and widely used in segmentation, tracking, etc. Jabri et al. [10] constructed a space-time graph from the video, and cast correspondence as prediction of links. By cycle consistency, the single path-level constraint implicitly supervised chains of intermediate comparisons. Wang et al. [37] used cycle consistency in time as the free supervisory signal for learning visual representations from scratch. Then they used the acquired representation to find nearest neighbors across space and time in a range of visual correspondence tasks.
循环一致性。循环一致性最初在生成对抗网络(GAN)中提出,并广泛应用于分割、跟踪等领域。Jabri 等人[10]从视频中构建了时空图,并将对应作为链接预测的问题。通过循环一致性,单一路径级约束隐式监督了中间比较的链条。Wang 等人[37]在时间上使用循环一致性作为从头学习视觉表征的免费监督信号。然后他们使用获得的表征在空间和时间上找到各种视觉对应任务的最近邻。

Contrastive Learning. Contrastive learning has shown great potential in self-supervised learning. Pang et al. [26] proposed QDTrack, which densely sampled hundreds of region proposals on a pair of images for contrastive learning. And they directly combined this with existing detection methods. Yu et al. [42] proposed multi-view trajectory con- trastive learning and designed a trajectory-level contrastive loss to explore the inter-frame information in the whole trajectories. Bastani et al. [1] proposed to construct two different inputs for the same video sequence by hiding different information. Then they computed the trajectory of that sequence by applying the RNN model independently on each input, and trained the model using contrastive learning to produce consistent tracks.
对比学习。对比学习在自监督学习中表现出巨大潜力。Pang 等人[26]提出了 QDTrack,它在一对图像上密集采样了数百个区域建议进行对比学习。他们直接将其与现有的检测方法相结合。Yu 等人[42]提出了多视图轨迹对比学习,设计了轨迹级对比损失来探索整个轨迹中的帧间信息。Bastani 等人[1]提出构建同一视频序列的两个不同输入,通过隐藏不同信息实现。然后他们独立地在每个输入上应用 RNN 模型计算该序列的轨迹,并使用对比学习训练模型以产生一致的跟踪。

3. Method 3.方法

In this section, we first introduce the overall pipeline, as illustrated in Figure 2, and then describe the corresponding specific concepts in detail in the subsequent parts. Finally, we introduce the whole steps of training and inference.
在本节中，我们首先介绍整体流程,如图 2 所示,然后在后续部分详细描述相应的具体概念。最后,我们介绍整个训练和推理的步骤。

3.1. Contrast Similarity Learning
3.1. 对比相似性学习

Given consecutive three images

I_{1}, I_{2}, I_{3} \in R^{H \times W \times 3}

, we first feed them to the backbone, then through detection branches and ReID heads, we could get detection results and ReID feature maps, as shown in Figure 2. Based on the position of the bounding box in the ground truth, the feature embedding corresponding to each object is obtained from the corresponding feature map, which forms embedding matrices

X_{1} = [x_{1}^{0}, x_{1}^{1}, \dots \dots, x_{1}^{N - 1}] \in R^{D \times N}

X_{2} = [x_{2}^{0}, x_{2}^{1}, \dots \dots, x_{2}^{M - 1}] \in R^{D \times M}

and

X_{3} =

[x_{3}^{0}, x_{3}^{1}, \dots \dots, x_{3}^{K - 1}] \in R^{D \times K}

, where

N, M

, and

K

are the object numbers in

I_{1}, I_{2}

and

I_{3}

, respectively, and

D

is the embedding dimension.
给定三个连续的图像

I_{1}, I_{2}, I_{3} \in R^{H \times W \times 3}

，我们首先将它们输入到主干网络中，然后通过检测分支和 ReID 头部，我们可以获得检测结果和 ReID 特征映射,如图 2 所示。基于地面实况中边界框的位置,从相应的特征映射中获取每个目标的特征嵌入,形成嵌入矩阵

X_{1} = [x_{1}^{0}, x_{1}^{1}, \dots \dots, x_{1}^{N - 1}] \in R^{D \times N}

、

X_{2} = [x_{2}^{0}, x_{2}^{1}, \dots \dots, x_{2}^{M - 1}] \in R^{D \times M}

和

X_{3} =

。其中

N, M

和

K

分别是

I_{1}, I_{2}

和

I_{3}

中的目标数量,

D

是嵌入维度。

The ReID branch is connected to three contrast similarity learning branches, in which (1) Self-contrast uses intraframe direct and inter-frame indirect self-matching to obtain discriminative representations and reduce feature interference from other objects by maximizing self-similarity. (2) Cross-contrast uses cross- and continuous-frame matching, and then adjusts similarity between objects to extract more beneficial features for object association. (3) Ambi-
重识别分支连接到三个对比相似度学习分支,其中(1)自对比使用帧内直接和帧间间接自匹配来获得判别式表征,并通过最大化自相似度来减少其他物体的特征干扰。(2)交叉对比使用帧间和连续帧匹配,然后调整物体之间的相似度以提取更有益的物体关联特征。(3)双
guity contrast takes into account occluded, lost, and emerging objects simultaneously, and these ambiguous objects are matched with each other again to further increase the certainty of subsequent object association. We will describe the specific operation in Section 3.1.1, 3.1.2 and 3.1.3, respectively.
基于内疚对比考虑到被遮挡、丢失和新出现的物体,这些模棱两可的物体相互匹配,从而进一步提高后续物体关联的确定性。我们将在第 3.1.1 节、第 3.1.2 节和第 3.1.3 节分别描述具体操作。

3.1.1 Self-Contrast Module
3.1.1 自对比模块

According to the latent knowledge that objects from the same frame must belong to different classes, we can determine that the similarity between self-to-self should be large enough. So the proposed self-contrast finally lands on a self-to-self comparison, which is a strong, deterministic self-supervised restriction. This strong restriction allows us to improve the similarity of the same targets and reduce the interference from other objects by direct and indirect selfcontrast learning, as shown in the first column of Figure 3.
根据同一帧中的物体必须属于不同类别的潜在知识,我们可以确定自身到自身的相似性应该足够大。因此,所提出的自对比最终落在自身到自身的比较上,这是一种强有力的、确定性的自监督约束。这种强大的约束使我们能够通过直接和间接的自对比学习来提高同一目标的相似度,并减少其他物体的干扰,如图 3 第一列所示。

Direct Self-Contrast. We use current feature matrix

X_{1} = [x_{1}^{0}, x_{1}^{1}, \dots \dots, x_{1}^{N - 1}] \in R^{D \times N}

to directly compute the self-similarity matrix

S_{d s} = X_{1}^{T} X_{1} \in R^{N \times N}

, where

T

represents transpose operation. Then we compute the assignment matrix with a softmax operation, as
直接自对比。我们使用当前特征矩阵

X_{1} = [x_{1}^{0}, x_{1}^{1}, \dots \dots, x_{1}^{N - 1}] \in R^{D \times N}

直接计算自相似性矩阵

S_{d s} = X_{1}^{T} X_{1} \in R^{N \times N}

，其中

T

表示转置运算。然后我们用 softmax 操作计算分配矩阵，如

S_{d s c} = ψ_{row} (S_{d s})

where

ψ_{row}

is row-wise softmax operation.

ψ_{row}

Indirect Self-Contrast. MOT itself operates on multiple frames, so we further perform our self-contrast similarity learning by indirect self-to-self matching. To measure similarity between objects, we calculate cosine similarity to get a similarity matrix between objects of different frames

S_{i s} = X_{1}^{T} X_{2} \in R^{N \times M}

. And similar to Eq.1, we calculate the association matrix

S^{1 \to 2} = ψ_{row} (S_{i s})

and

S^{2 \to 1} = ψ_{row} (S_{i s}^{T})

. The corresponding results

S^{1 \to 2}

and

S^{2 \to 1}

are considered to match the targets in

I_{1}

I_{2}

, and the targets in

I_{2}

I_{1}

, respectively. Each element of

S^{1 \to 2}

and

S^{2 \to 1}

in the

i

-th row and

j

-th column are as follows, respectively:
间接自对比。 MOT 本身在多个帧上运行,因此我们进一步通过间接自对自匹配来执行自对比相似性学习。为了衡量对象之间的相似性,我们计算余弦相似性以获得不同帧之间对象的相似性矩阵

S_{i s} = X_{1}^{T} X_{2} \in R^{N \times M}

。类似于公式 1,我们计算关联矩阵

S^{1 \to 2} = ψ_{row} (S_{i s})

和

S^{2 \to 1} = ψ_{row} (S_{i s}^{T})

。相应的结果

S^{1 \to 2}

和

S^{2 \to 1}

被认为与

I_{1}

到

I_{2}

中的目标相匹配,而

I_{2}

到

I_{1}

中的目标相匹配。

i

行

j

列中的

S^{1 \to 2}

和

S^{2 \to 1}

的每个元素如下:

\begin{aligned} s_{i j}^{1 \to 2} & = \frac{\exp ({(x_{1}^{i})}^{T} \cdot x_{2}^{j} / τ)}{\sum_{j = 0}^{M - 1} \exp ({(x_{1}^{i})}^{T} \cdot x_{2}^{j} / τ)} \\ s_{i j}^{2 \to 1} = & \frac{\exp ({(x_{2}^{j})}^{T} \cdot x_{1}^{i} / τ)}{\sum_{i = 0}^{N - 1} \exp ({(x_{2}^{j})}^{T} \cdot x_{1}^{i} / τ)} \end{aligned}

where

τ

is a temperature hyper-parameter [38].
其中

τ

是温度超参数[38]。
According to the cycle association consistency, after forward association

S^{1 \to 2}

and backward association

S^{2 \to 1}

, each object will match itself again ideally,
根据循环联想的一致性,在前向联想

S^{1 \to 2}

和后向联想

S^{2 \to 1}

之后,每个对象理想情况下都会再次与自己匹配

S_{i s c} = S^{1 \to 2} S^{2 \to 1}

Figure 3. Self-Contrast and Cross-Contrast. We use three sets of indirect self-contrast and two sets of cross-contrast methods using different inputs. For the sake of brevity, we only show a set of specific feature calculation in each contrast.
图 3. 自我对比和交叉对比。我们使用三组间接自我对比和两组交叉对比方法,采用不同的输入。为简明起见,我们仅在每个对比中展示一组特定特征计算。

The corresponding self-contrast loss can be formulated as:
相应的自对比损失可以表述为:

L_{sc} = - \frac{1}{N} (\sum \log (diag (S_{d s c})) + \sum \log (diag (S_{i s c})))

,
where

diag ()

is to get a diagonal matrix.
其中

diag ()

用于获取对角矩阵。
Due to the self-contrast, it is obvious that the similarity between the same targets should be the largest, i.e., the diagonal elements of

S_{d s c}

and

S_{i s c}

obtained above are the largest and should be as close to 1 as possible.
由于自对比,很明显,同一目标之间的相似度应该最大,即,上文获得的

S_{d s c}

和

S_{i s c}

的对角线元素是最大的,应该尽可能接近于 1。

3.1.2 Cross-Contrast Module
3.1.2 交叉对比模块

In almost all scenes of MOT, there is more or less object occlusion, and the similarity of these objects is generally low. Since MOT is an operation on multiple consecutive frames, the negative impact of these occluded objects could last for a long time. Considering theoretically the cross-frame matching results should be the same as the final results of continuous matching, we use a weaker unsupervised restriction, i.e., direct (cross-frame) vs. indirect
几乎在所有 MOT 场景中,都存在或多或少的物体遮挡,且这些物体的相似度通常较低。由于 MOT 是在多个连续帧上进行的操作,这些被遮挡的物体的负面影响可能会持续很长时间。从理论上讲,跨帧匹配结果应该与连续匹配的最终结果相同,我们使用了更弱的无监督约束,即直接(跨帧)与间接。
(continuous-frame) association similarity comparison, to alleviate the above issue.
(连续帧)关联相似性比较,以缓解上述问题。

Specifically, we take three frames

I_{1}, I_{2}, I_{3} \in

R^{H \times W \times 3}

as inputs, similar with Section 3.1.1, we calculate the target matching matrices between different frames, i.e.,

S^{1 \to 2}, S^{2 \to 1}, S^{2 \to 3}, S^{3 \to 2}, S^{1 \to 3}, S^{3 \to 1}

. As shown in the second column of Figure 3, we utilize

S^{2 \to 1}

and

S^{3 \to 2}

to compute the association matrix of

3 \to 1

, similarly use

S^{1 \to 2}

and

S^{2 \to 3}

to compute the association matrix of

1 \to 3

, as
具体而言,我们以三个帧

I_{1}, I_{2}, I_{3} \in

R^{H \times W \times 3}

作为输入,类似于第 3.1.1 节,我们计算不同帧之间的目标匹配矩阵,即

S^{1 \to 2}, S^{2 \to 1}, S^{2 \to 3}, S^{3 \to 2}, S^{1 \to 3}, S^{3 \to 1}

。如图 3 的第二列所示,我们利用

S^{2 \to 1}

和

S^{3 \to 2}

计算

3 \to 1

的关联矩阵,同样使用

S^{1 \to 2}

和

S^{2 \to 3}

计算

1 \to 3

的关联矩阵,如

\begin{aligned} S_{*}^{1 \to 3} = ψ_{row} (S^{1 \to 2} S^{2 \to 3}) \\ S_{*}^{3 \to 1} = ψ_{row} (S^{3 \to 2} S^{2 \to 1}) \end{aligned}

These matching matrices, which are generated indirectly through a middle frame, should be the same as directgenerated matching results.
这些通过中间帧间接生成的匹配矩阵应该与直接生成的匹配结果相同。

We use relative entropy to measure the difference between the two matching distributions. KL divergence [14] is often used to compute the difference between two distributions P and Q ,
我们使用相对熵来衡量两个匹配分布之间的差异。KL 散度[14]通常用于计算两个分布 P 和 Q 之间的差异。

K L (P ‖ Q) = \sum p (x) \log \frac{p (x)}{q (x)}

but it is asymmetrical. We further utilize JS divergence [18] with symmetrical properties,
但它是非对称的。我们进一步利用具有对称性质的 JS 散度[18]。

J S D (P ‖ Q) = \frac{1}{2} K L (P ‖ T) + \frac{1}{2} K L (Q ‖ T)

where

T = (P + Q) / 2

. The corresponding cross-contrast loss is as follows,

T = (P + Q) / 2

。相应的交叉对比损失如下，

L_{c c} = \frac{1}{N} J S D (S_{*}^{1 \to 3} ‖ S^{1 \to 3}) + \frac{1}{K} J S D (S_{*}^{3 \to 1} ‖ S^{3 \to 1})

By enabling the continuous and cross-frame matching results to be close together, we use the different association results to mainly mitigate the differences in the same target caused by occlusion.
通过使连续和跨框匹配结果相互接近,我们利用不同的关联结果主要缓解遮挡造成的同一目标的差异。

3.1.3 Ambiguity Contrast 3.1.3 歧义对比

There are occluded, lost, and emerging objects in the MOT, which will interfere with the whole learning process. We explore this problem and propose the ambiguity contrast module.
在 MOT 中存在遮挡、丢失和新兴的物体,这些会干扰整个学习过程。我们探索了这个问题,并提出了歧义对比模块。

Based on the similarity between objects, we assume that objects with similarity greater than

θ

are the same object. The remaining objects with lower similarity are defined as ambiguous objects here. The low similarity mainly due to occlusion or the disappearance and appearance. In the occlusion case, objects of the same ID do exist, but the similarity is decreased due to the absence of original features and involvement of unrelated features. In the latter case, the similarity between the lost object and the newly emerged
根据对象之间的相似性，我们假定相似度大于

θ

的对象是同一个对象。剩余具有较低相似度的对象在此定义为模糊对象。低相似度主要由于遮挡或消失和出现所导致。在遮挡的情况下，同一 ID 的对象确实存在,但由于缺少原有特征和出现不相关特征,从而导致相似度降低。在后一种情况下,失去的对象与新出现的对象之间的相似度

Figure 4. Ambiguity Contrast. For brevity, we only give the maximum similarity for each row, where certain objects have lower similarity to all other targets, i.e., even the maximum similarity is below the threshold value, which is indicated by red circles in the figure. The corresponding feature embeddings are extracted and then matched again.
图 4. 歧义对比。为了简洁起见,我们仅给出每一行的最大相似度,其中某些物体与所有其他目标的相似度都较低,即使最大相似度也低于阈值值,在图中用红色圆圈表示。提取相应的特征嵌入,然后再次进行匹配。
object is lower because there is really no target that can match it.
对象较低,因为确实没有与之匹配的目标。

Our proposed method for ambiguous objects in the unsupervised training process is shown in Figure 4. We find the ambiguous objects in

I_{1}

according to the matching matrix of

S^{1 \to 2}

based on the similarity. Similarly, we get ambiguous objects in

I_{2}

based on the matching result of

S^{2 \to 1}

. Then these objects are again subjected to similarity calculation to get the similarity matrix

S_{r}^{1 \to 2} \in R^{D \times N_{r}}

and

S_{r}^{2 \to 1} \in R^{D \times M_{r}}

, where

N_{r}

and

M_{r}

are the number of ambiguous objects in

I_{1}

and

I_{2}

, respectively. Finally, the loss of the module is calculated by minimum entropy:
我们提出的用于无监督训练过程中模棱两可对象的方法如图 4 所示。我们根据相似度得到的匹配矩阵来找出

I_{1}

中的模棱两可对象。同样地,我们根据

S^{2 \to 1}

的匹配结果得到了

I_{2}

中的模棱两可对象。然后这些对象再次进行相似性计算以得到相似矩阵

S_{r}^{1 \to 2} \in R^{D \times N_{r}}

和

S_{r}^{2 \to 1} \in R^{D \times M_{r}}

,其中

N_{r}

和

M_{r}

分别是

I_{1}

和

I_{2}

中的模棱两可对象的数量。最后,通过最小熵计算该模块的损失。

\begin{matrix} L_{a c} = - \frac{1}{| N_{r} - M_{r} | + 1} (\frac{1}{N_{r}} S_{r}^{1 \to 2} \log (S_{r}^{1 \to 2}) \\ + \frac{1}{M_{r}} S_{r}^{2 \to 1} \log (S_{r}^{2 \to 1})) \end{matrix}

When the number of ambiguous objects in two frames is equal, considering the two frames are very close to each other, we can assume that there are no disappearing or emerging objects, only occlusion exists, and the entropy should be as small as possible at this time. When the number of ambiguous objects in the two frames is not equal, it must contain disappearing or emerging objects. They are less similar to other objects because they cannot be matched, so we dynamically weaken the loss by adaptive coefficients.
当两个画面中含有等量模糊物体时,考虑到两个画面很接近,我们可以假设没有消失或出现的物体,只存在遮挡,此时熵应尽可能小。当两个画面中模糊物体数量不等时,必然包含消失或出现的物体。由于它们无法与其他物体匹配,因此与其他物体的相似性较低,我们通过自适应系数动态减弱损失。

3.2. UCSL for Unsupervised MOT
3.2. 无监督多目标跟踪的 UCSL

We apply UCSL on FairMOT [45], which composes of a backbone network, a detection head, and a re-identification head. For simplicity, the setting of the backbone and detection head follows FairMOT [45]. The overall architecture
我们在 FairMOT [45]上应用 UCSL,FairMOT [45]由主干网络、检测头和重识别头组成。为简单起见,主干网络和检测头的设置遵循 FairMOT [45]。整体架构
of UCSL is illustrated in Figure 2.
UCSL 的结构如图 2 所示。
In the training stage, we follow the three contrast learning sub-modules in Section 3.1, and the complete loss function of ReID can be defined as follows:
在训练阶段,我们遵循第 3.1 节中的三个对比学习子模块,ReID 的完整损失函数可以定义如下:

L (I_{t}, I_{t - 1}, I_{t - 2}) = L_{s c} + L_{c c} + L_{a c}

where three consecutive frames

I_{t}, I_{t - 1}

and

I_{t - 2}

denote inputs.

L_{s c}, L_{c c}

, and

L_{a c}

denote self-contrast, cross-contrast, and ambiguity contrast losses mentioned above, respectively.
其中三个连续的框

I_{t}, I_{t - 1}

和

I_{t - 2}

表示输入。

L_{s c}, L_{c c}

和

L_{a c}

分别表示上述提到的自对比、交叉对比和歧义对比损失。

In the inference stage, video frames are fed into the network one by one. Then we obtain the corresponding detection results and ReID embeddings. We use the detection bounding boxes in the first frame to initialize multiple trajectories, and then use two-stage matching to complete object association. The overall association idea is also similar to FairMOT [45], using Kalman Filter [11] to predict the position of the objects and match bounding boxes with existing trajectories using embedding distance. For the trajectories and detections that are not matched, we match them using iou distance. Finally, the remaining unmatched detections are initialized as new objects, and the unmatched trajectories are saved for 30 frames and matched when they appear again.
在推理阶段，视频帧逐一输入网络。然后我们获得相应的检测结果和 ReID 嵌入。我们使用第一帧中的检测边界框来初始化多个轨迹,然后使用双阶段匹配来完成目标关联。整体关联思路也与 FairMOT[45]类似,使用卡尔曼滤波器[11]预测目标位置,并使用嵌入距离匹配现有轨迹的边界框。对于未匹配的轨迹和检测,我们使用 iou 距离进行匹配。最后,剩余未匹配的检测被初始化为新目标,未匹配的轨迹被保存 30 帧,在再次出现时进行匹配。

4. Experiments 4. 实验

In this section, the proposed UCSL is evaluated on the MOT17 [24], MOT15 [15] and MOT20 [15]. The description of the datasets and the experimental setup is as follows, and next, we compare UCSL with the advanced approaches. Then, we show the evaluation of the effect of our model with ablation experiments.
在本节中，提出的 UCSL 在 MOT17 [24]、MOT15 [15]和 MOT20 [15]上进行了评估。数据集描述和实验设置如下,接下来我们将 UCSL 与先进方法进行比较。然后,我们展示了通过消融实验对我们模型的影响的评估。

4.1. Datasets 4.1. 数据集

The proposed method is evaluated on MOT15, MOT17 and MOT20. MOT15 is the first dataset provided by MOT Challenge. It contains 22 video sequences, 11 of which are used for training and 11 for testing. The MOT15 is derived from older datasets and has different characteristics, such as fixed or moving cameras, different lighting environments, etc. MOT17 consists of 14 video sequences in total, 7 of which are used for training and 7 for testing, which is the most frequently used in MOT by far. MOT20 contains 4 training videos and 4 testing videos with more complex environments and greater crowd density, so MOT20 is more challenging than any previous datasets.
该提出的方法在 MOT15、MOT17 和 MOT20 上进行了评估。MOT15 是 MOT 挑战赛提供的第一个数据集。它包含 22 个视频序列,其中 11 个用于训练,11 个用于测试。MOT15 从较旧的数据集衍生而来,具有不同的特征,如固定或移动摄像头、不同的照明环境等。MOT17 总共包含 14 个视频序列,其中 7 个用于训练,7 个用于测试,这是迄今为止 MOT 中使用最频繁的。MOT20 包含 4 个训练视频和 4 个测试视频,具有更复杂的环境和更大的人群密度,因此相比于之前的数据集,MOT20 更具挑战性。

To evaluate our method, we use the standard MOT challenge metrics [2, 16, 22], mainly including Multi-Object Tracking Accuracy (MOTA), ID F1 Score (IDF1), Higher Order Tracking Accuracy (HOTA), Mostly Tracked objects (MT), Mostly Lost objects (ML), Number of False Positives (FP), Number of False Negatives (FN) and Number of Identity Switches (IDS), where the higher the first four items the better, the lower the last four items the better, and we use "

↑

" and "

↓

" to represent respectively.
为评估我们的方法,我们使用标准的 MOT 挑战指标[2,16,22],主要包括多目标跟踪精度(MOTA)、ID F1 得分(IDF1)、高阶跟踪精度(HOTA)、主要跟踪对象(MT)、主要丢失对象(ML)、假阳性数量(FP)、假阴性数量(FN)和身份切换数量(IDS),其中前四项数值越高越好,后四项数值越低越好,我们分别使用"

↑

"和"

↓

"来表示。

4.2. Implementation Details
4.2. 实现细节

By default, UCSL is implemented based on the basis of the original FairMOT [45]. We take DLA-34 [43] as the backbone of the model and take the detection branch of the COCO dataset [19] pre-trained model to initialize our model parameters. We follow the most hyper-parameters settings of FairMOT [45].
默认情况下,UCSL 是基于原始 FairMOT [45]的基础实现的。我们采用 DLA-34 [43]作为模型的主干网络,并使用 COCO 数据集 [19]预训练模型的检测分支来初始化我们的模型参数。我们遵循 FairMOT [45]的大部分超参数设置。

We use conventional data enhancement approaches such as rotation, random cropping and horizontal flip, scale transformation, color jittering, etc., and resize the input image size to

1088 \times 608

. We use the Adam optimizer [13] with the initial learning rate set to

10^{- 4}

, and the batch size set to 8 . The similarity threshold

θ

in ambiguity contrast is 0.7. The model iterates 60 epochs on the MOT17 training set in the internal ablation experiments. We eventually train the corresponding dataset for 30 epochs on the basis of a pre-trained model of the CrowdHuman [29] dataset. The learning rate decays to

10^{- 5}

at the 20th epoch. Finally, we train our model on 4 RTX2080ti GPUs in about 10 hours.
我们使用常规的数据增强方法,如旋转、随机裁剪和水平翻转、缩放变换、颜色抖动等,并将输入图像大小调整为

1088 \times 608

。我们使用 Adam 优化器[13],初始学习率设置为

10^{- 4}

,批大小设置为 8。模糊对比度中的相似度阈值

θ

为 0.7。在内部消融实验中,模型在 MOT17 训练集上迭代 60 个 epoch。最终,我们在 CrowdHuman[29]数据集预训练模型的基础上训练对应的数据集 30 个 epoch。学习率在第 20 个 epoch 时衰减至

10^{- 5}

。最后,我们在 4 个 RTX2080ti GPU 上训练我们的模型,耗时约 10 小时。

4.3. Performance and Comparison
4.3. 性能和比较

Comparison on MOT17. In this part, we compare our method with some other supervised and unsupervised methods on MOT17. In general, the performance of supervised methods is more advantageous purely in terms of metrics. As an unsupervised approach, we expect it to be as close as possible to state-of-the-art results. As shown in Table 1, we list some popular methods of joint detection and tracking or embeddings, and our method achieves considerable results, especially on IDF1 and HOTA. As the results provided by SimpleReID [12] are based on public detections, for a fairer comparison, we use the detection results of the same detector, i.e., CenterNet [46], to obtain the corresponding private detection-based results of simpleReID [12]. Since UTrack [21] is not tested on MOT17 test set, we replace it with our designed UCSL and conduct experiments under the same hardware conditions on MOT17. The results are shown in the tenth result row of Table 1. Based on the same FairMOT+CycAs model, although UTrack [21] and UCSL are very close on IDF1 and HOTA, our model improves 1.2 in terms of MOTA. Our model outperforms UTrack [21] in terms of ReID feature extraction with the same detection branch. We notice that IDS is not better compared to other methods, which may attribute to that UCSL tracks more trajectories and has a higher recall.
在 MOT17 上的比较。在这一部分中，我们将我们的方法与其他一些监督和无监督方法在 MOT17 上进行了比较。总的来说，监督方法在纯粹的指标上的性能更有优势。作为一种无监督的方法，我们希望它能尽可能接近当前最先进的结果。如表 1 所示，我们列出了一些流行的联合检测和跟踪或嵌入的方法,我们的方法取得了相当不错的结果,特别是在 IDF1 和 HOTA 指标上。由于 SimpleReID[12]提供的结果是基于公开检测的,为了更公平的比较,我们使用了相同检测器(即 CenterNet[46])的检测结果,得到了 simpleReID[12]相应的私有检测结果。由于 UTrack[21]未在 MOT17 测试集上进行测试,我们用我们设计的 UCSL 替换它,并在 MOT17 上进行了相同的硬件条件下的实验。结果如表 1 第十行所示。基于相同的 FairMOT+CycAs 模型,尽管 UTrack[21]和 UCSL 在 IDF1 和 HOTA 上很接近,但我们的模型在 MOTA 指标上提高了 1.2。我们的模型在 ReID 特征提取方面优于 UTrack[21],采用了相同的检测分支。我们注意到,IDS 并不比其他方法更好,这可能归因于 UCSL 跟踪更多的轨迹并有更高的召回率。

Performance on Other Datasets. In addition to MOT17, we also conduct experiments on MOT15 and MOT20, as shown in Table 1. Since FairMOT [45] uses additional MIX datasets for training besides the CrowdHuman
除 MOT17 外,我们还在 MOT15 和 MOT20 数据集上进行了实验,如表 1 所示。由于 FairMOT[45]在训练时使用了除 CrowdHuman 以外的其他 MIX 数据集。

Method 方法	Unsup 无监督	MOTA $↑$ 模拟器 $↑$	IDF1 $↑$	HOTA $↑$
MOT17
TrackFormer [23]	No 没有	74.1	68.0	57.3
TransTrack [31] 传道导接[31]	No 没有	75.2	63.5	54.1
TransCenter [41] 交通中心 [41]	No 没有	73.2	62.2	54.5
QDTrack [26]	No 没有	68.7	66.3	53.9
JDE [39]	No 没有	56.7	55.0	45.1
CSTrack [17]	No 没有	74.9	72.6	59.3
FairMOT [45]	No 没有	73.7	72.3	59.3
	$\overset{―}{Yes}$	$6 \overset{―}{1} . \overset{―}{7}$	$5 \overset{―}{8} . \overset{―}{1}$	$4 \overset{―}{6} . \overset{―}{9}$
SimpleReID [12] 简单人脸识别[12]	Yes 是	69.0	60.7	50.4
UTrack [21]	Yes 是	71.8	70.3	58.4
UCSL (ours) UCSL (我们的)	Yes 是	73.0	70.4	58.4
MOT15
EAMTT [28] 国际特殊教育和矫正治疗期刊 [28]	No 没有	53.0	54.0	42.5
TubeTK [25]	No 没有	58.4	53.1	42.7
RAR15 [8]	No 没有	56.5	61.3	46.0
MTrack [42]	No 没有	58.9	62.1	$\underset{―}{47.9}$
FairMOT [45]	No 没有	55.0	60.2	45.9
$^{\bar{U}} \overset{―}{UC} \bar{S} \bar{L}$ (ours) $^{\bar{U}} \overset{―}{UC} \bar{S} \bar{L}$ (我们的)	Yes 是	$5 \overset{―}{9} . \overset{―}{1}$	$5 \overset{―}{9} . \overset{―}{2}$	$4 \overset{―}{6} . \overset{―}{3}$
MOT20
TransCenter [41] 交通中心 [41]	No 没有	58.5	49.6	54.1
MTrack [42]	No 没有	63.5	$\underset{―}{69.2}$	55.3
FairMOT [45]	No 没有	55.7	64.6	52.5
	Yes 是	$- 5 \overset{―}{3} . \overset{―}{6}$	$5 \overset{―}{0} . \overset{―}{6}$	$4 \overset{―}{1} . \overset{―}{7}$
SimpleReID [12] 简单人脸识别[12]	Yes 是	61.8	54.8	45.5
UCSL (ours) UCSL (我们的)	Yes 是	62.4	63.0	52.3

Table 1. Performance on MOT17, MOT15 and MOT20 test sets. “Unsup” means unsupervised training. “*” denotes using public detections. Bold and underline indicate unsupervised and supervised best metrics, respectively.
表 1. MOT17、MOT15 和 MOT20 测试集的性能。 "Unsup"表示无监督训练。 "*"表示使用公共检测。粗体和下划线分别表示无监督和监督最佳指标。
dataset, we train and test this method under the same conditions for a fair comparison. On MOT15, the performance of our unsupervised UCSL is metrically stronger than the supervised methods on MOTA, and achieves comparable overall performance on other metrics.
数据集,我们在相同的条件下训练和测试这种方法,以进行公平比较。在 MOT15 上,我们的无监督 UCSL 在 MOTA 指标上的性能优于监督方法,并在其他指标上实现了可比的整体表现。

MOT20 is more complex than the scenarios in MOT15 relatively and has a larger amount of data, so the results of MOT20 are improved over those on MOT15. Our model outperforms the unsupervised SimpleReID [12] largely, especially on IDF1 and HOTA. Compared with supervised methods, the results show that our method is already comparable to them.
MOT20 比 MOT15 相对更复杂,并且具有更大的数据量,因此 MOT20 的结果优于 MOT15。我们的模型大幅优于无监督的 SimpleReID [12]，尤其在 IDF1 和 HOTA 方面。与监督方法相比,结果显示我们的方法已经可与之相媲美。

Performance under JDE paradigm. Our method is based on the JDE paradigm, considering FairMOT [45] as the baseline by default. We show the results of classical and our methods under the same paradigm, as shown in Table 2. Since the JDE [39] does not provide results on the MOT17 test set, we retest them under the same conditions. Due to the same paradigm, our approach also can be applied in other JDE-based methods, e.g., JDE [39].
根据 JDE 范式的性能。我们的方法基于 JDE 范式,默认以 FairMOT [45]作为基线。如表 2 所示,我们在同一范式下展示了传统方法和我们的方法的结果。由于 JDE [39]没有在 MOT17 测试集上提供结果,我们在相同的条件下对其进行了重新测试。由于采用了相同的范式,我们的方法也可应用于其他基于 JDE 的方法,例如 JDE [39]。

Method 方法	MOTA $↑$ 模拟器 $↑$	IDF1 $↑$	HOTA $↑$
JDE(yolov5s) [39]	70.2	66.6	54.1
FairMOT [45]	73.7	72.3	59.3
CSTrack [17]	74.9	72.6	59.3
JDE(yolov5s) + Ours 基于 JDE(yolov5s)和我们自己的方法	69.6	68.0	55.7
FairMOT + Ours FairMOT + 我们的	73.0	70.4	58.4

Table 2. Mehtods on MOT17 under the same paradigm, JDE (joint detection and embeddings). “yolov5s” denotes detection branch baseline. The upper and lower parts are supervised and unsupervised methods, respectively.
表 2. 在同一范式(JDE-联合检测和嵌入)下 MOT17 的方法。"yolov5s"表示检测分支基线。上下两部分分别是有监督和无监督的方法。

Method 方法	MOTA $↑$ 模拟器 $↑$	IDF1 $↑$	HOTA $↑$
YOLOX [9] + BYTE [44]	78.8	77.0	62.7
CenterNet [46] + BYTE [44]	73.1	70.0	58.9
CenterNet [46] + UCSL(ours) CenterNet [46] + UCSL(我们的)	73.0	70.4	58.4

Table 3. Comparison with ByteTrack [44] on MOT17. For a more intuitive comparison, we use YOLOX + BYTE to represent ByteTrack directly.
表 3. 与 ByteTrack [44] 在 MOT17 上的比较。为了更直观的比较,我们使用 YOLOX + BYTE 来直接表示 ByteTrack。

Comparison with TBD. Due to the contradiction between detection and ReID, compared with JDE, indeed, TBD (tracking-by-detection) paradigm could achieve a higher performance limit. But joint training methods output detections and embeddings simultaneously, balancing the accuracy and speed. So under JDE paradigm, we focus on exploring the impact of the unsupervised approach on it, rather than aiming for the state-of-the-art performance. To compare our method with TBD, we consider ByteTrack [44] as the representative for advanced TBD methods, First, it should be noticed that ByteTrack [44] uses trajectories interpolation on MOT17 dataset, which turns it into an offline approach. So we test ByteTrack [44] without interpolation on MOT17, as shown in the first result row of Table 3. In our approach, the detection branch uses CenterNet [46] by default, so the comparison between the second and third result lines of Table 3 demonstrates that the performance impact of our unsupervised approach is comparable to that of BYTE [44] with the same detector.
与 TBD 的比较。由于检测与 ReID 之间的矛盾,与 JDE 相比,TBD(检测跟踪)范式确实可以达到更高的性能限制。但联合训练方法同时输出检测结果和嵌入特征,平衡了准确性和速度。因此在 JDE 范式下,我们专注于探讨无监督方法对其的影响,而不是针对最先进的性能。为了将我们的方法与 TBD 进行比较,我们将 ByteTrack [44]作为先进 TBD 方法的代表。首先需要注意的是,ByteTrack [44]在 MOT17 数据集上使用了轨迹插补,这使其变成了一种离线方法。因此,我们在 MOT17 上测试了不使用插补的 ByteTrack [44],如表 3 第一行结果所示。在我们的方法中,检测分支默认使用 CenterNet [46],因此表 3 第二行和第三行结果的比较,表明了我们的无监督方法的性能影响与使用相同检测器的 BYTE [44]相当。

4.4. Ablation Studies 4.4. 消融研究

We conduct ablation experiments on the MOT17 test set, in which we test all contrast losses mentioned above as well as some settings about input frame interval and output ReID dimension.
我们在 MOT17 测试集上进行消融实验,其中我们测试了上述所有对比损失以及一些关于输入帧间隔和输出 ReID 维度的设置。

Baseline. We are inspired by the method CycAs [38]. As shown in the first row in Table 4, only the triple loss of the original CycAs [38] is used for modeling, aiming to make the probability of the object matching back to itself reach a credible level to ensure cycle consistency. In this method, IDF1 and HOTA are 59.1 and 49.6, respectively.
基线。我们受到 CycAs[38]方法的启发。如表 4 第一行所示，仅使用原始 CycAs[38]的三元损失进行建模，旨在使对象匹配回到自身的概率达到可信水平,以确保循环一致性。在这种方法中,IDF1 和 HOTA 分别为 59.1 和 49.6。

Self-Contrast Loss. We use both the direct and indirect
自对比损失。我们同时使用直接和间接的

$L_{s c}$		$L_{c c}$	$L_{a c}$	IDF1 $↑$	HOTA $↑$	MOTA $↑$ 模拟器 $↑$	MT $↑$ 机器翻译 $↑$	ML $↓$ 机器学习 $↓$	FP $↓$ 前端 $↓$	FN $↓$	IDS $↓$ 身份证 $↓$
$L_{d s c}$	$L_{i s c}$	$L_{c c}$	$L_{a c}$
	CycAs [38] 番杞属[38]			59.1	49.6	69.5	$38.7 %$	$19.0 %$	31309	132816	7839
$✓$				61.5	50.8	69.9	$39.3 %$	$19.8 %$	$29475$	132954	7314
	$✓$			66.6	54.9	69.8	$38.6 %$	$18.1 %$	31341	133173	5997
$✓$	$✓$			67.2	55.0	69.4	$38.6 %$	$18.2 %$	32502	134808	5544
$✓$	$✓$	$✓$		$68.4$	$55.6$	69.8	$39.6 %$	$19.2 %$	32619	132228	5595
$✓$	$✓$	$✓$	$✓$	68.2	55.5	$70.5$	$40.8 %$	$16.2 %$	40569	$125004$	$5208$

Table 4. Performance with different losses on MOT17 test set. “CycAs” represents utilizing original loss function in CycAs [38].

L_{s c}

represents self-contrast loss, where

L_{d s c}

and

L_{i s c}

represents direct and indirect self-contrast loss, respectively.

L_{c c}

represents crosscontrast loss,

L_{a c}

represents ambiguity contrast loss.
表 4. 在 MOT17 测试集上使用不同损失函数的性能。"CycAs"表示使用 CycAs[38]中的原始损失函数。

L_{s c}

表示自对比损失,其中

L_{d s c}

和

L_{i s c}

分别表示直接和间接自对比损失。

L_{c c}

表示交叉对比损失,

L_{a c}

表示歧义对比损失。

Interval 区间	MOTA $↑$ 模拟器 $↑$	IDF1 $↑$	MT $↑$ 机器翻译 $↑$	ML $↓$ 机器学习 $↓$
7	68.5	66.5	$38.2 %$	$20.7 %$
3	68.7	67.7	$37.8 %$	$20.6 %$
1	$70.5$	$68.2$	$40.8 %$	$16.2 %$

Table 5. Comparison of different input frame intervals. Based on the current frame, three consecutive frames are taken as input according to the number of frame intervals listed in the table. For the current frame

t

, for example, when the interval is 1 , the inputs are frames

t, t - 1

and

t - 2

.
表 5. 不同输入帧间隔的比较。基于当前帧,根据表中列出的帧间隔数量,将三个连续帧作为输入。例如,对于当前帧

t

,当间隔为 1 时,输入为帧

t, t - 1

和

t - 2

。

ReID Dim 重新识别暗	MOTA $↑$ 模拟器 $↑$	IDF1 $↑$	MT $↑$ 机器翻译 $↑$	ML $↓$ 机器学习 $↓$
64	69.8	$68.9$	$38.2 %$	$20.2 %$
128	70.5	68.2	$40.8 %$	$16.2 %$
256	$70.7$	68.4	$37.7 %$	$20.6 %$

Table 6. Comparison of different output ReID dimensions.
表 6. 不同输出 ReID 维度的比较。
self-contrast losses to construct the model to extract ReID embeddings better. In the direct and indirect self-contrast subparts, we both use the intra-frame cross-entropy loss to construct the loss function, bringing the same objects closer together in the feature space and different targets further apart in the feature space. As seen from the second, third and fourth rows of Table 4, both direct and indirect selfcontrast learning have little effect on MOTA, while have significantly improved the IDF1 and HOTA metrics and reduced IDS, demonstrating that our self-contrast similarity learning extracts more discriminative ReID embeddings.
自对比损失来构建模型,更好地提取人员重识别嵌入。在直接和间接自对比子部分,我们都使用帧内交叉熵损失来构建损失函数,使同一目标在特征空间中更靠近,不同目标在特征空间中更远离。从表 4 的第 2、3、4 行可以看出,直接和间接自对比学习对 MOTA 几乎没有影响,但大幅提高了 IDF1 和 HOTA 指标,并减少了 IDS,证明我们的自对比相似性学习提取了更具辨别力的人员重识别嵌入。

Cross-Contrast Loss. We use cross- and consecutiveframes matching for cross-contrast similarity learning, with the aim of reducing the effect caused by mutual occlusion between objects. As can be seen from the fifth row in Table 4 , on basis of self-contrast similarity learning, the crosscontrast improves IDF1 and HOTA metrics to 68.4 and 55.6 respectively, and there are also different degrees of improvement on other metrics.
跨对比损失。我们使用跨帧和连续帧匹配进行跨对比相似性学习，旨在减少对象之间相互遮挡造成的影响。从表 4 的第五行可以看出，在基于自对比相似性学习的基础上，跨对比使得 IDF1 和 HOTA 指标分别提高到 68.4 和 55.6，其他指标也有不同程度的提升。

Ambiguity Contrast Loss. In order to consider both the occluded, disappearing and emerging objects in MOT, we use ambiguity contrast to match these ambiguous objects again. From the last row of Table 4, one can see that after adding the ambiguity contrast based on the above two losses, the result has a more obvious improvement mainly in MOTA, MT and IDS, indicating that the method does have a positive effect on maintaining the object’s trajectory.
歧义对比损失。为了同时考虑多目标跟踪中被遮挡、消失和出现的物体,我们使用歧义对比来重新匹配这些不确定的物体。从表 4 的最后一行可以看出,在添加了基于上述两个损失的歧义对比后,结果在 MOTA、MT 和 IDS 等指标上有更明显的改善,表明该方法确实对维持物体轨迹有积极作用。

Input Frame Interval. In our model, the default input is three consecutive frames. To show its superiority, we set different input intervals to train and test the corresponding model on MOT17, as shown in Table 5. Generally speaking, occlusion will last for a long time, but we find the larger the frame interval the weaker the performance, which may be surprising but explainable. Large interval is more suitable for supervised settings, where objects between any frames can be well matched with annotated ID labels. However, long intervals may cause drastic object changes without ID labels, making matching hard and errors accumulated. In addition, during training, there is an intersection between each input frame group. So, long-term temporal relation is taken into consideration just in an implicit manner.
输入帧间隔。在我们的模型中,默认输入是三个连续帧。为了证明其优越性,我们在 MOT17 数据集上设置了不同的输入间隔来训练和测试相应的模型,如表 5 所示。一般来说,遮挡会持续很长时间,但我们发现帧间隔越大,性能越弱,这可能是令人惊讶的,但却可以解释。大间隔更适合有监督的设置,其中任何帧之间的物体都可以与标注的 ID 标签很好地匹配。然而,长间隔可能会导致没有 ID 标签的物体发生剧烈变化,从而使匹配很困难并导致错误积累。此外,在训练期间,每个输入帧组之间都有重叠。因此,长期时间关系只是以隐式的方式考虑进去。

Output ReID Dimension. In Table 6, we compare three different ReID embedding dimensions. As we can see, compared to the 64-dimension ReID embeddings, the 128dimension performs better in terms of MOTA and MT metrics. The 256 -dimension features have a similar improvement effect on the MOTA and IDF1 as the 128-dimension but consume more space and slow down the training and inference speed. For all these reasons, we choose the 128dimension as the output dimension of the ReID branch.
输出 ReID 尺寸。在表 6 中,我们比较了三种不同的 ReID 嵌入尺寸。正如我们所看到的,与 64 维 ReID 嵌入相比,128 维的性能在 MOTA 和 MT 指标方面更好。256 维的特征在 MOTA 和 IDF1 方面也有相似的改善效果,但需要更多的空间,并降低训练和推理速度。出于这些原因,我们选择 128 维作为 ReID 分支的输出尺寸。

5. Conclusions 5. 结论

We propose a simple but effective unsupervised method based on Contrastive Similarity Learning (UCSL). Specifically, we construct three learning types: self-contrast, cross-contrast and ambiguity contrast learning. Combining these sub-modules, the network is able to learn discriminative features consistently and reliably, and handle with occluded, lost and emerging objects simultaneously. Our unsupervised method outperforms existing unsupervised methods, and even surpasses some advanced supervised methods.
我们提出了一种基于对比相似性学习(UCSL)的简单但有效的无监督方法。具体而言,我们构建了三种学习类型:自我对比、交叉对比和模糊对比学习。结合这些子模块,网络能够持续可靠地学习区分性特征,并同时处理遮挡、丢失和新出现的对象。我们的无监督方法优于现有的无监督方法,甚至超过了一些先进的监督方法。

References 参考文献

[1] Favyen Bastani, Songtao He, and Samuel Madden. Selfsupervised multi-object tracking with cross-input consistency. Advances in Neural Information Processing Systems, 34:13695-13706, 2021. 3
巴斯塔尼·法维恩、何松滔、马登·塞缪尔. 基于交叉输入一致性的无监督多目标跟踪. 神经信息处理系统进展, 34:13695-13706, 2021.
[2] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:110, 2008. 6
[2] Keni Bernardin 和 Rainer Stiefelhagen. 评估多目标跟踪性能:CLEAR MOT 指标. EURASIP 图像与视频处理杂志, 2008:110, 2008. 6
[3] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In IEEE international conference on image processing, pages 34643468, 2016. 2
[3] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos 和 Ben Upcroft. 简单的在线和实时跟踪. 在 IEEE 国际图像处理会议上, 第 3464-3468 页, 2016 年. 2
[4] Guillem Brasó and Laura Leal-Taixé. Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6247-6257, 2020. 2
[4] 吉尔·布拉索和劳拉·莱尔-塔克谢. 学习用于多目标跟踪的神经求解器. 在 2020 年 IEEE/CVF 计算机视觉和模式识别会议论文集中, 第 6247-6257 页.
[5] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision, pages 132-149, 2018. 1, 3
马蒂尔德·卡隆、皮奥特·博亚诺夫斯基、阿尔芒·朱林和马蒂斯·道兹。用于视觉特征无监督学习的深度聚类。见《2018 年欧洲计算机视觉会议论文集》,第 132-149 页。 1, 3
[6] Zuozhuo Dai, Guangyuan Wang, Weihao Yuan, Xiaoli Liu, Siyu Zhu, and Ping Tan. Cluster contrast for unsupervised person re-identification. arXiv preprint arXiv:2103.11568, 2021. 1, 3
[6] 曹卓, 王广源, 袁威浩, 刘晓丽, 朱思雨, 谭萍。无监督的人物再识别的簇对比。arXiv 预印本 arXiv:2103.11568,2021.1,3
[7] Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003, 2020. 2
[7] Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé。 MOT20:在人群密集场景中对多目标跟踪的基准。 arXiv 预印本 arXiv:2003.09003, 2020。2
[8] Kuan Fang, Yu Xiang, Xiaocheng Li, and Silvio Savarese. Recurrent autoregressive networks for online multi-object tracking. In IEEE Winter Conference on Applications of Computer Vision, pages 466-475, 2018.7
[8] 关芳、余翔、李晓程、Silvio Savarese。用于在线多目标跟踪的递归自回归网络。2018 年 IEEE 冬季计算机视觉应用大会论文集,第 466-475 页。
[9] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.7
[9] 郑戈, 刘松涛, 王锋, 李泽明, 孙建. Yolox: 在 2021 年超越 yolo 系列. arXiv 预印本 arXiv:2107.08430, 2021.7
[10] Allan Jabri, Andrew Owens, and Alexei Efros. Space-time correspondence as a contrastive random walk. Advances in neural information processing systems, 33:19545-19560, 2020. 3
[10] Allan Jabri、Andrew Owens 和 Alexei Efros。空间-时间对应性作为对比性随机游走。神经信息处理系统进展,33:19545-19560,2020。3
[11] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960. 6
[11] 鲁道夫·艾米尔·卡尔曼. 线性滤波和预测问题的新方法. 1960. 6
[12] Shyamgopal Karthik, Ameya Prabhu, and Vineet Gandhi. Simple unsupervised multi-object tracking. arXiv preprint arXiv:2006.02609, 2020. 2, 6, 7
[13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 6
[13] Diederik P Kingma 和 Jimmy Ba. Adam: 一种随机优化方法. arXiv 预印本 arXiv:1412.6980, 2014. 6
[14] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79-86, 1951. 5
[14] 索洛门·库尔巴克和理查德·A·莱布勒. 关于信息和充分性. 数理统计年鉴, 22(1):79-86, 1951. 5
[15] Laura Leal-Taixé, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015. 2, 6
[15] 劳拉·利尔-泰克斯韦, 安东·米兰, 伊恩·里德, 斯蒂芬·罗斯, 和康拉德·申德勒. MotChallenge 2015: 迈向多目标跟踪的基准测试. arXiv 预印本 arXiv:1504.01942, 2015. 2, 6
[16] Yuan Li, Chang Huang, and Ram Nevatia. Learning to associate: Hybridboosted multi-target tracker for crowded scene. In IEEE conference on computer vision and pattern recognition, pages 2953-2960, 2009. 6
[16]袁力、黄昌和拉姆·内瓦蒂亚。学习相联:用混合增强的多目标跟踪器进行拥挤场景跟踪。在 2009 年 IEEE 计算机视觉和模式识别会议上,第 2953-2960 页。6
[17] Chao Liang, Zhipeng Zhang, Xue Zhou, Bing Li, Shuyuan Zhu, and Weiming Hu. Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing, 31:3182-3196, 2022. 7
【17】赵亮、张志鹏、周雪、李冰、朱书远、胡维明。在多目标跟踪中,重新思考检测和 Reid 之间的竞争关系。《IEEE 图像处理论文集》,2022,31:3182-3196。
[18] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145151, 1991. 5
[18] 林建华。基于 shannon 熵的散度度量。IEEE 信息理论汇刊, 37(1):145-151, 1991. 5
[19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740-755, 2014. 2, 6
[19] 林宗一、迈克尔·梅尔、塞尔吉·贝朗热、詹姆斯·黑斯、皮耶特罗·佩罗纳、德瓦·拉曼、彼得·德尔和 C·劳伦斯·齐特尼克。微软 coco:上下文中的普通物体。 2014 年在欧洲计算机视觉会议上,第 740-755 页。 2, 6
[20] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottom-up clustering approach to unsupervised person re-identification. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8738-8745, 2019. 3
林裕天、董萱屹、郑亮、颜艳、杨毅。一种自下而上的聚类方法用于无监督的行人重识别。在人工智能协会年会论文集上,第 33 卷,第 8738-8745 页,2019 年。3
[21] Qiankun Liu, Dongdong Chen, Qi Chu, Lu Yuan, Bin Liu, Lei Zhang, and Nenghai Yu. Online multi-object tracking with unsupervised re-identification learning and occlusion estimation. Neurocomputing, 483:333-347, 2022. 2, 3, 6, 7
[21] 刘乾坤、陈东东、初琪、袁璐、刘彬、张磊和俞能海。基于无监督重识别学习和遮挡估计的在线多目标跟踪。Neurocomputing, 483:333-347, 2022。2, 3, 6, 7
[22] Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taixé, and Bastian Leibe. Hota: A higher order metric for evaluating multiobject tracking. International journal of computer vision, 129(2):548-578, 2021. 6
[22] Jonathon Luiten、Aljosa Osep、Patrick Dendorfer、Philip Torr、Andreas Geiger、Laura Leal-Taixé和 Bastian Leibe。 HOTA:一种用于评估多目标跟踪的更高阶度量。国际计算机视觉学报,129(2):548-578,2021。 6
[23] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8844-8854, 2022. 2, 7
[23] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer.Trackformer: 使用变换器进行多目标跟踪。载于 2022 年计算机视觉和模式识别国际会议论文集,第 8844-8854 页。2, 7
[24] Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016. 2, 6
[24] 安东尼·米兰,劳拉·列阿尔-泰克斯,伊恩·里德,斯特凡·罗斯,康拉德·谢勒。Mot16:一个多目标跟踪基准。ArXiv 预印本 arXiv:1603.00831, 2016.2, 6
[25] Bo Pang, Yizhuo Li, Yifan Zhang, Muchen Li, and Cewu Lu. Tubetk: Adopting tubes to track multi-object in a onestep training model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages

6308 - 6318.7

[25] 庞波, 李依卓, 张易凡, 李沐晨, 陆策武. TubeTK: 在一步训练模型中采用管道跟踪多目标. 在 IEEE/CVF 计算机视觉与模式识别会议论文集中, 第

6308 - 6318.7

页.
[26] Jiangmiao Pang, Linlu Qiu, Xia Li, Haofeng Chen, Qi Li, Trevor Darrell, and Fisher Yu. Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages

164 - 173, 2021.3, 7

[26] 江淼·庞, 林璐·邱, 李夏, 昊丰·陈, 琪·李, Trevor Darrell 和 Fisher Yu. 用于多目标跟踪的准密集相似性学习. 在 IEEE/CVF 计算机视觉和模式识别会议论文集中, 第

164 - 173, 2021.3, 7

页.
[27] Deva Ramanan and David A Forsyth. Finding and tracking people from the bottom up. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages II-II, 2003. 2
[27] 德瓦·拉曼和戴维·A·福赛斯。从底部开始寻找和跟踪人。在 IEEE 计算机学会计算机视觉和模式识别会议上,第二卷,页码 II-II,2003 年。2
[28] Ricardo Sanchez-Matilla, Fabio Poiesi, and Andrea Cavallaro. Online multi-target tracking with strong and weak detections. In European Conference on Computer Vision, pages

84 - 99, 2016.7

[28] 里卡多·桑切斯-马蒂拉、法比奥·波耶西和安德烈亚·卡瓦拉罗。在线多目标跟踪与强检测和弱检测。在欧洲计算机视觉会议上,第

84 - 99, 2016.7

页。
[29] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123, 2018. 6
[30] Qiuhong Shen, Lei Qiao, Jinyang Guo, Peixia Li, Xin Li, Bo Li, Weitao Feng, Weihao Gan, Wei Wu, and Wanli Ouyang. Unsupervised learning of accurate siamese tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8101-8110, 2022. 1, 2
沈秋红、乔磊、郭金阳、李培霞、李鑫、李波、冯伟涛、甘伟豪、吴伟、欧阳万里。无监督学习准确的孪生网络跟踪。在 2022 年计算机视觉和模式识别会议上。
[31] Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020. 2, 7
[31] 孙培泽、曹锦坤、江毅、张如枫、谢恩泽、袁澤寰、王昌虎和罗平。TransTrack:基于 Transformer 的多目标跟踪。预印本 arXiv:2012.15460,2020。2,7
[32] ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal Mian, and Mubarak Shah. Deep affinity network for multiple object tracking. IEEE transactions on pattern analysis and machine intelligence, 43(1):104-119, 2019. 2
[32]孙世杰、纳维德·阿克塔尔、宋焕升、阿杰马尔·米安和穆巴拉克·沙。用于多目标跟踪的深度亲和力网络。IEEE 模式分析与机器智能汇刊, 43(1):104-119, 2019.2
[33] Hideaki Uchiyama and Eric Marchand. Object detection and pose tracking for augmented reality: Recent approaches. In 18th Korea-Japan Joint Workshop on Frontiers of Computer Vision, 2012. 1
内山英明和埃里克·马尔尚。增强现实的目标检测和姿态跟踪:最新方法。在第 18 届韩日计算机视觉前沿联合研讨会上,2012 年。
[34] Dongkai Wang and Shiliang Zhang. Unsupervised person reidentification via multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10981-10990, 2020. 3
[34]王东凯和张世良。无监督的人员重新识别基于多标签分类。在 2020 年 IEEE/CVF 计算机视觉和模式识别会议论文集上,第 10981-10990 页。3
[35] Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, and Houqiang Li. Unsupervised deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1308-1317, 2019. 1, 2
王宁，宋亦冰，马超，周文刚，刘伟，李厚强。无监督深度跟踪。在 2019 年 IEEE/CVF 计算机视觉与模式识别会议论文集中，第 1308-1317 页。1, 2
[36] Xiaogang Wang. Intelligent multi-camera video surveillance: A review. Pattern recognition letters, 34(1):3-19, 2013. 1
王小刚.多智能相机视频监控:综述.模式识别快报,34(1):3-19,2013.
[37] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2566-2576, 2019.3
王小龙、Allan Jabri 和 Alexei A. Efros。从时间的循环一致性中学习对应。在 2019 年 IEEE/CVF 计算机视觉和模式识别会议上,第 2566-2576 页。
[38] Zhongdao Wang, Jingwei Zhang, Liang Zheng, Yixuan Liu, Yifan Sun, Yali Li, and Shengjin Wang. Cycas: Selfsupervised cycle association for learning re-identifiable descriptions. In European Conference on Computer Vision, pages 72-88, 2020. 3, 4, 7, 8
王忠道、张京伟、郑亮、刘毅轩、孙毅帆、李亚立、王圣金。Cycas:用于学习可重新识别描述的自监督循环关联。在 2020 年的欧洲计算机视觉会议上,第 72-88 页。3,4,7,8
[39] Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, and Shengjin Wang. Towards real-time multi-object tracking. In European Conference on Computer Vision, pages 107-122, 2020. 1, 2, 7
王中道、郑亮、刘一轩、李雅丽、王胜金. 面向实时多目标跟踪. 欧洲计算机视觉会议, 第 107-122 页, 2020 年. 1, 2, 7
[40] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In IEEE international conference on image processing, pages 3645-3649, 2017. 2
[40] 尼古拉·沃伊克，亚历克斯·贝利，迪特里希·保罗斯。利用深度关联度量进行简单的在线和实时跟踪。在 2017 年 IEEE 国际图像处理会议上,第 3645-3649 页。
[41] Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, and Xavier Alameda-Pineda. Transcenter: Transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145, 2021. 2, 7
[41] 徐一宏、班宇童、盖亚·德洛姆、甘创、丹妮拉·拉斯和泽维尔·阿拉梅达-皮内达。 TransCenter:具有密集查询的变压器用于多目标跟踪。 arXiv 预印本 arXiv:2103.15145,2021.2,7
[42] En Yu, Zhuoling Li, and Shoudong Han. Towards discriminative representation: Multi-view trajectory contrastive learning for online multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8834-8843, 2022. 3, 7
[42]En Yu、李卓玲和韩寿东。面向判别表征：在线多目标跟踪的多视图轨迹对比学习。2022 年 IEEE/CVF 计算机视觉和模式识别会议论文集,8834-8843 页。3,7
[43] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2403-2412, 2018. 6
于博文、王德全、伊冯·什尔瀚默和特雷弗·达雷尔。深层聚合。在 2018 年 IEEE 计算机视觉和模式识别会议论文集中,第 2403-2412 页。
[44] Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII, pages 1-21, 2022.7
[44] 张义富、孙培泽、姜毅、余东东、翁复成、袁泽桓、罗平、刘文宇、王兴刚。ByteTrack：通过关联每个检测框的多目标跟踪。在 2022 年 10 月 23 日至 27 日于以色列特拉维夫举行的第 17 届欧洲计算机视觉会议(ECCV 2022)第 22 辑论文集中,第 1 至 21 页,2022 年 7 月。
[45] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11):3069-3087, 2021. 1,

2, 5, 6, 7

[45] 张艺甫, 王春瑜, 王星岗, 曾文君, 刘文宇. Fairmot: 在多目标追踪中检测和 re-identification 的公平性. 计算机视觉国际期刊, 129(11):3069-3087, 2021. 1,

2, 5, 6, 7

[46] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019. 6, 7
[46] 周兴怡、王德泉、Philipp Krähenbühl。物体检测为点。arXiv 预印本 arXiv:1904.07850, 2019。6, 7

*Equal Contribution. †Corresponding Author
对等贡献。†通讯作者

Tracking without Label: Unsupervised Multiple Object Tracking via Contrastive Similarity Learning 无标签跟踪:基于对比相似度学习的无监督多目标跟踪

Abstract 摘要

1. Introduction 1. 简介

2. Related Work 2. 相关工作

3. Method 3.方法

3.1. Contrast Similarity Learning3.1. 对比相似性学习

3.1.1 Self-Contrast Module3.1.1 自对比模块

3.1.2 Cross-Contrast Module3.1.2 交叉对比模块

3.1.3 Ambiguity Contrast 3.1.3 歧义对比

3.2. UCSL for Unsupervised MOT3.2. 无监督多目标跟踪的 UCSL

4. Experiments 4. 实验

4.1. Datasets 4.1. 数据集

4.2. Implementation Details4.2. 实现细节

4.3. Performance and Comparison4.3. 性能和比较

4.4. Ablation Studies 4.4. 消融研究

5. Conclusions 5. 结论

References 参考文献

Tracking without Label: Unsupervised Multiple Object Tracking via Contrastive Similarity Learning
无标签跟踪:基于对比相似度学习的无监督多目标跟踪

3.1. Contrast Similarity Learning
3.1. 对比相似性学习

3.1.1 Self-Contrast Module
3.1.1 自对比模块

3.1.2 Cross-Contrast Module
3.1.2 交叉对比模块

3.2. UCSL for Unsupervised MOT
3.2. 无监督多目标跟踪的 UCSL

4.2. Implementation Details
4.2. 实现细节

4.3. Performance and Comparison
4.3. 性能和比较