\title{

Modality Translation for Object Detection Adaptation Without Forgetting Prior Knowledge

}

\author{

Heitor Rapela Medeiros ${ }^{\star}$, Masih Aminbeidokhti, \\ Fidel Guerrero Pena, David Latortue, \\ Eric Granger, and Marco Pedersoli \\ LIVIA, Dept. of Systems Engineering. ETS Montreal, Canada

}

\begin{abstract}

\end{abstract}

\section*{1 Introduction}

Powerful pre-trained models have become essential in the field of computer vision, particularly in object detection (OD) tasks $[29,30]$. These OD models are typically pretrained on extensive natural-image RGB datasets, such as COCO [26]. Moreover, the knowledge encoded by these models can be leveraged for various tasks in a zero-shot way or with additional fine-tuning for downstream tasks [39]. However, adding new modalities to these models, such as infrared (IR), without losing the intrinsic knowledge of the detector remains a challenge [27].

\footnotetext{

* Email: heitor.rapela-medeiros. 1 @ens.etsmtl.ca

}

![](https://cdn.mathpix.com/cropped/2024_07_04_7d22c98a5c842f9680b9g-02.jpg?height=528&width=1244&top_left_y=392&top_left_x=430)

Modality Translation for Object Detection Adaptation Without Forgetting Prior Knowledge
目标检测适应性的模态翻译，无需忘记先前知识

Heitor Rapela Medeiros $^{⋆}$ , Masih Aminbeidokhti,Fidel Guerrero Pena, David Latortue,Eric Granger, and Marco Pedersoli
埃里克·格朗格（Eric Granger）和马可·佩德索利（Marco Pedersoli）LIVIA, Dept. of Systems Engineering. ETS Montreal, Canada
LIVIA，系统工程系。加拿大蒙特利尔 ETS。

Abstract 摘要

A common practice in deep learning consists of training large neural networks on massive datasets to perform accurately for different domains and tasks. While this methodology may work well in numerous application areas, it only applies across modalities due to a larger distribution shift in data captured using different sensors. This paper focuses on the problem of adapting a large object detection model to one or multiple modalities while being efficient. To do so, we propose ModTr as an alternative to the common approach of fine-tuning large models. ModTr consists of adapting the input with a small transformation network trained to minimize the detection loss directly. The original model can therefore work on the translated inputs without any further change or fine-tuning to its parameters. Experimental results on translating from IR to RGB images on two well-known datasets show that this simple ModTr approach provides detectors that can perform comparably or better than the standard fine-tuning without forgetting the original knowledge. This opens the doors to a more flexible and efficient service-based detection pipeline in which, instead of using a different detector for each modality, a unique and unaltered server is constantly running, where multiple modalities with the corresponding translations can query it. Code: https://github.com/heitorrapela/ModTr.
深度学习中的一种常见做法是在大规模数据集上训练大型神经网络，以在不同领域和任务中准确执行。虽然这种方法在许多应用领域可能效果良好，但由于使用不同传感器捕获的数据存在更大的分布偏移，因此仅适用于跨模态。本文关注将大型目标检测模型调整到一个或多个模态的问题，同时保持高效。为此，我们提出了 ModTr 作为对调整大型模型常见方法的替代方案。ModTr 包括使用小型转换网络调整输入，该网络经过训练以直接最小化检测损失。因此，原始模型可以在转换后的输入上工作，而无需进一步更改或微调其参数。在两个知名数据集上从红外到 RGB 图像的转换实验结果表明，这种简单的 ModTr 方法提供的检测器可以表现得与标准微调相当或更好，而不会忘记原始知识。这为更灵活和高效的基于服务的检测流水线打开了大门，在这种流水线中，不再为每种模态使用不同的检测器，而是不断运行一个独特且不变的服务器，多种模态及其对应的翻译可以查询它。代码：https://github.com/heitorrapela/ModTr。

1 Introduction 1 简介

Powerful pre-trained models have become essential in the field of computer vision, particularly in object detection (OD) tasks

[29, 30]

. These OD models are typically pretrained on extensive natural-image RGB datasets, such as COCO [26]. Moreover, the knowledge encoded by these models can be leveraged for various tasks in a zero-shot way or with additional fine-tuning for downstream tasks [39]. However, adding new modalities to these models, such as infrared (IR), without losing the intrinsic knowledge of the detector remains a challenge [27].
强大的预训练模型已经成为计算机视觉领域的必备工具，特别是在目标检测（OD）任务中。这些 OD 模型通常是在广泛的自然图像 RGB 数据集上预训练的，比如 COCO。此外，这些模型所编码的知识可以以零-shot 方式或通过额外的微调用于下游任务。然而，向这些模型添加新的模态，比如红外（IR），而不丢失探测器的内在知识仍然是一个挑战。

These additional modalities, though not as common as RGB images, are still important in various tasks, like surveillance [5], autonomous driving [37], and robotics [35], which strive to achieve robust performance across environmental changes such as different illumination conditions [2]. The dominant way to adapt pre-trained detectors to
这些额外的模态虽然不像 RGB 图像那样常见，但在各种任务中仍然很重要，比如监控[5]、自动驾驶[37]和机器人技术[35]，这些任务旨在实现在不同照明条件[2]等环境变化下的稳健性能。调整预训练检测器的主要方法是

Fig. 1: Bounding box predictions over different OD methods for infrared images on two benchmarks: LLVIP and FLIR. Yellow and red boxes show the ground truth and predicted detections, respectively. FastCUT is an unsupervised image translation approach that takes as input infrared images (IR) and produces pseudo-RGB images. It does not focus on detection and requires both modalities for training. Fine-tuning is the standard approach to adapting the detector to the new modality. It requires only IR data but forgets the original knowledge of the pre-trained detector. Finally, ModTr, our approach focuses the translation on detection, requires only IR data, and does not forget the original knowledge so that it can be reused for other tasks.
图 1：在 LLVIP 和 FLIR 两个基准上，不同目标检测方法对红外图像的边界框预测。黄色和红色框分别显示了地面真实和预测检测结果。FastCUT 是一种无监督图像翻译方法，以红外图像（IR）作为输入并生成伪 RGB 图像。它不专注于检测，需要两种模态进行训练。微调是将检测器适应新模态的标准方法。它只需要 IR 数据，但会忘记预训练检测器的原始知识。最后，我们的方法 ModTr 专注于检测翻译，只需要 IR 数据，并不会忘记原始知识，因此可以用于其他任务。

these novel modalities is by fine-tuning the model. However, fine-tuning often results in catastrophic forgetting and can destroy the intrinsic knowledge of the detector [22]. Ideally, we would like to adapt the detector to new modalities without changing the original model. This is most useful for server-side applications, where a single model runs uninterrupted and different inputs, ideally on different modalities, can query it. The main challenge here is the significant distribution shift between the new modalities, such as the difference between the visual information in RGB images and the thermal data in IR images. This shift can degrade the performance of models when applied directly to new input, as the features learned from one modality may not be relevant or present in another. This can ultimately impact the final performance of the detector [40].
这些新颖的模态是通过微调模型来实现的。然而，微调经常会导致灾难性遗忘，并可能破坏探测器的内在知识[22]。理想情况下，我们希望能够适应新的模态，而不改变原始模型。这对于服务器端应用非常有用，其中单个模型可以连续运行，并且可以查询不同输入，最好是来自不同的模态。这里的主要挑战是新模态之间存在显著的分布偏移，比如 RGB 图像中的视觉信息与 IR 图像中的热数据之间的差异。这种偏移会降低模型在直接应用于新输入时的性能，因为从一个模态学习到的特征可能在另一个模态中不相关或不存在。这最终可能会影响探测器的最终性能[40]。

Image translation methods have emerged as a powerful tool to overcome the downsides of fine-tuning and narrowing the gap between source and target modalities [16]. These methods do not directly work on the weight space of the original detector but rather adapt the input values to reduce the discrepancy between the source and target modalities. However, such methods often require access to source data or some statistics about it during training. Furthermore, their primary focus is on image reconstruction quality rather than the final detection task, which can cause a significant drop in performance. For instance, in Figure 1, some detections (in red) using these generated images, in b) and d), and the fine-tuning in c) are illustrated. Additionally, results for different methods are provided in supplementary material.
图像翻译方法已经成为一种强大的工具，用于克服微调和缩小源模态和目标模态之间差距的缺点。这些方法不直接作用于原始检测器的权重空间，而是调整输入值以减少源模态和目标模态之间的差异。然而，这些方法通常在训练期间需要访问源数据或一些关于源数据的统计信息。此外，它们的主要重点是图像重建质量，而不是最终的检测任务，这可能会导致性能显著下降。例如，在图 1 中，使用这些生成的图像的一些检测结果（红色）在 b)和 d)中，以及 c)中的微调被说明。此外，不同方法的结果在补充材料中提供。

Our work aims to improve the image translation paradigm while addressing its limitations. Our proposed approach, Modality Translation for OD (ModTr), incorporates the detector's knowledge into the translation module by training directly for the final de-
我们的工作旨在改进图像翻译范式，同时解决其局限性。我们提出的方法，OD 的模态翻译（ModTr），通过直接为最终的检测器知识进行训练，将检测器的知识纳入翻译模块中。

Fig. 2: Different approaches to deal with multiple modalities and/or domains. (a) The simplest approach is to use a different detector adapted to each modality. This can lead to a high level of accuracy but requires storing models in memory multiple times. (b) Our proposed solution is based on using a single pre-trained model normally trained on the more abundant data (RGB) and then adapting the input through our ModTr model. (c) A single detector is trained on all modalities jointly. This allows using of a single model but requires access to all modalities jointly, which is often not possible, especially when dealing with large pre-trained models.
图 2：处理多模态和/或领域的不同方法。(a) 最简单的方法是使用适应于每种模态的不同检测器。这可能会导致高水平的准确性，但需要多次在内存中存储模型。(b) 我们提出的解决方案是基于使用单个预训练模型，通常是在更丰富数据（RGB）上训练，然后通过我们的 ModTr 模型调整输入。(c) 一个单一的检测器同时训练所有模态。这允许使用单一模型，但需要同时访问所有模态，这通常是不可能的，特别是在处理大型预训练模型时。

tection task. Unlike traditional image translation methods, ModTr does not require any source data. It is a simple approach that can be easily integrated with any detector, be it a one-stage or two-stage detector. This approach also has applications in many settings. By incorporating new modalities for a pre-trained detector, it can act as a server that receives diverse images translated from each ModTr block trained for each modality. The detector then generates the desired output with comparable performance to fully fine-tuning but requiring much less computational power on the entire system. In Figure 2, we present several options to integrate new modalities into our system. Figure 2a illustrates the

N

-Detectors approach, where each detector is trained on a specific modality. While this method is effective, it requires a significant amount of computational resources. Alternatively, we have the N-ModTr-1-Detector approach in Figure

2 b

, which is less resource-intensive as it involves training a specialized model for each modality. Finally, Figure 2c shows a single detector trained on joint modalities, which is not memory-intensive. However, it may not be as optimal as the other methods due to the challenges in multimodal learning. In this work, we focus on the effectiveness of our approach for IR modality, which is commonly used in surveillance and robotics, and the incremental modality detector server-based application, which are important for many settings that require non-interrupt detection predictions.
ModTr 不需要任何源数据，与传统的图像翻译方法不同。这是一种简单的方法，可以轻松集成到任何检测器中，无论是单阶段还是双阶段检测器。这种方法在许多场景中也有应用。通过为预训练检测器引入新的模态，它可以充当一个服务器，接收从每个为每种模态训练的 ModTr 块翻译的多样化图像。然后，检测器生成所需的输出，性能可与完全微调相媲美，但在整个系统上需要更少的计算资源。在图 2 中，我们提供了几种将新模态集成到我们系统中的选项。图 2a 展示了

N

-检测器方法，其中每个检测器都是针对特定模态进行训练的。虽然这种方法有效，但需要大量的计算资源。另外，我们在图

2 b

中有 N-ModTr-1-检测器方法，这种方法的资源消耗较少，因为它涉及为每种模态训练专门的模型。最后，图 2c 展示了一个训练在联合模态上的单个检测器，这不占用大量内存。然而，由于多模态学习中的挑战，它可能不像其他方法那样优化。在这项工作中，我们专注于我们的 IR 模态方法的有效性，该方法通常用于监控和机器人技术，以及基于服务器的增量模态检测器应用程序，这对许多需要非中断检测预测的设置至关重要。

Our main contributions can be summarized as follows:
我们的主要贡献可以总结如下：

(1) We present ModTr, a method for adapting pre-trained ODs from large RGB datasets to new scarce modalities like IR, without requiring access to any source dataset, by translating the input signal.
我们提出了 ModTr，一种方法，可以将来自大型 RGB 数据集的预训练 ODs 适应新的稀缺模态，如 IR，而无需访问任何源数据集，通过翻译输入信号。

(2) In contrast to standard fine-tuning, our approach does not modify the original detector weights. This allows the detector to retain the knowledge of the source data while adapting to a new modality. As a result, a single model can be used to handle multiple modalities across various translators. For instance, the same model can be used to process RGB during the daytime and IR at nighttime.
(2) 与标准微调相比，我们的方法不会修改原始检测器的权重。这使得检测器能够保留源数据的知识，同时适应新的模态。因此，一个模型可以用来处理各种翻译器之间的多种模态。例如，同一个模型可以用来处理白天的 RGB 和夜晚的 IR。

(3) Our proposed approach, ModTr, is evaluated in several scenarios, showcasing its
(3) 我们提出的方法 ModTr 在几种场景中进行了评估，展示了其
advantages and flexibility. In particular, ModTr achieves competitive OD accuracy compared with image translation methods on two challenging visible/infrared datasets (LLVIP and FLIR).
优势和灵活性。特别是，在两个具有挑战性的可见/红外数据集（LLVIP 和 FLIR）上，ModTr 在 OD 准确性方面与图像翻译方法具有竞争力。

Object detection. OD is a computer vision task that has as its objective to provide labels and localization for the objects in the image [43]. The two categories of OD are twostage and one-stage detectors. Two-stage detectors, exemplified by Faster R-CNN [36], first generate regions of interest and then use a second classifier to confirm object presence within those regions. On the other hand, one-stage detectors streamline the detection process by eliminating the proposal generation stage, aiming for end-to-end training and real-time inference speeds. RetinaNet [25] is a one-stage OD model that utilizes a focal loss function to address class imbalance during training. Also, models like FCOS [38] have emerged in this category, eliminating predefined anchor boxes to potentially enhance inference efficiency. The proposed work investigates these three traditional and powerful detectors: the two-stage Faster R-CNN detector, which is highly used on the benchmarks, RetinaNet, and FCOS. The choice of such detectors was due to the simplicity in implementation and integration among other methods, as well as a different range of pre-trained backbone weights, such as ResNet [12] and MobileNet [14]. Image Translation. Image translation is a pivotal task in computer vision, aiming to map images from a source domain to a target domain while preserving inherent content [32]. The goal is to discover a transformation function such that the distribution of images in the translated domain is aligned with the distribution of images in the target domain. The commonly used approaches for image translation are based on variational autoencoders (VAEs) [21] and generative adversarial network (GANs) [10, 32]. Isola et al. developed the Pix2Pix [19], a method that consists of a generator (based on UNet) and a discriminator (based on GANs architecture) that work together to generate images based on input data and labels. Then, Zhu et al. proposed a method called CycleGAN [46], which is based on GANs, with the objective of unsupervised domain translation. Even though CycleGAN can produce quite visual results, it's hard to optimize due to the adversarial mechanism and memory footprint needed. In contrast, VAEs are easier to train than GANs but require more constraints in the optimization to produce images of good quality than GAN-based approaches. Recent advancements include diffusion models known for their high-quality image generation, although they may not inherently suit domain translation tasks. To enhance models such as CycleGAN, novel methods like Contrastive Unpaired Translation (CUT) [33] and FastCUT [33] have been introduced. CUT, in particular, accelerates the image translation process by maximizing mutual information between image patches, achieving competitive results quickly. In the context of RGB/IR modality, InfraGAN presents an image-level adaptation for RGB to IR conversion, prioritizing image quality [31]. This approach is distinct in its focus on optimizing image quality losses. Moreover, Herrmann et al. have explored OD in RGB/IR modality by adapting IR images to RGB using traditional image preprocessing techniques, allowing the use of RGB object detectors without parameter modification [13]. Recently, there have been many advances in image translation, but
目标检测。OD 是一项计算机视觉任务，其目标是为图像中的对象提供标签和定位[43]。OD 的两个类别是两阶段和一阶段检测器。两阶段检测器，如 Faster R-CNN[36]，首先生成感兴趣区域，然后使用第二个分类器来确认这些区域内是否存在对象。另一方面，一阶段检测器通过消除提议生成阶段来简化检测过程，旨在进行端到端训练和实时推理速度。RetinaNet[25]是一种一阶段 OD 模型，利用焦点损失函数来解决训练过程中的类别不平衡问题。此外，像 FCOS[38]这样的模型已经出现在这个类别中，消除了预定义的锚定框，可能提高推理效率。所提出的工作调查了这三种传统和强大的检测器：两阶段 Faster R-CNN 检测器，在基准测试中被广泛使用的 RetinaNet 和 FCOS。这些探测器的选择是由于在实施和集成方面的简单性，以及其他方法之间的不同预训练骨干权重范围，如 ResNet [12]和 MobileNet [14]。图像翻译。图像翻译是计算机视觉中的一个关键任务，旨在将图像从源域映射到目标域，同时保留固有内容[32]。目标是发现一个转换函数，使得翻译域中的图像分布与目标域中的图像分布对齐。图像翻译的常用方法基于变分自动编码器（VAEs）[21]和生成对抗网络（GANs）[10, 32]。Isola 等人开发了 Pix2Pix [19]，这是一种基于 UNet 的生成器和基于 GANs 架构的鉴别器共同工作以根据输入数据和标签生成图像的方法。然后，Zhu 等人提出了一种名为 CycleGAN [46]的方法，它基于 GANs，旨在进行无监督域翻译。尽管 CycleGAN 能够产生相当视觉效果，但由于需要对抗机制和内存占用量较大，优化起来很困难。相比之下，VAEs 比 GANs 更容易训练，但需要更多的约束条件来优化生成质量较好的图像，而不是基于 GAN 的方法。最近的进展包括以高质量图像生成而闻名的扩散模型，尽管它们可能并不天然适用于领域转换任务。为了增强诸如 CycleGAN 之类的模型，引入了像对比无配对翻译（CUT）[33]和 FastCUT [33]这样的新方法。特别是 CUT 通过最大化图像块之间的互信息来加速图像翻译过程，快速实现竞争性结果。在 RGB/IR 模态的背景下，InfraGAN 提出了一种 RGB 到 IR 转换的图像级适应方法，优先考虑图像质量[31]。这种方法在其专注于优化图像质量损失方面具有独特性。此外，Herrmann 等人通过使用传统图像预处理技术，将 IR 图像调整为 RGB，从而探索了 RGB/IR 模态中的目标检测，使得可以在不修改参数的情况下使用 RGB 目标检测器[13]。最近，图像翻译方面取得了许多进展，但
they do not target OD tasks. Therefore, Medeiros et al. proposed the HalluciDet [27], which uses an image translation mechanism for OD, but in their work, they have as a prior the access to the source data to pre-train the detector on the same RGB domain of the OD modality that the model needs to adapt.
他们不针对 OD 任务。因此，Medeiros 等人提出了 HalluciDet [27]，它使用图像翻译机制进行 OD，但在他们的工作中，他们将源数据作为先验条件，以在与模型需要适应的 OD 模态的相同 RGB 域上对检测器进行预训练。

Adapting without forgetting. Catastrophic forgetting (CF) is the idea that a neural network tends to forget knowledge when sequentially trained on a different task and replaces it with knowledge tailored to the new objective [41]. CF can be harmful or beneficial. Researchers identified harmful learning as situations where retaining the original knowledge while adapting to a different task is necessary. In that case, it is imperative to mitigate the risk of CF. However, some CF can also be beneficial, for instance, to prevent privacy leakage from large pre-trained models, to enhance the generalization, or to remove noisy information from the originally, acquired knowledge that is negatively affecting the new tasks. In our case, knowledge-forgetting is harmful. There are different ways to address this issue including simple techniques like decreasing the learning rate [15], use weight decay [4,45] or mixout regularization [23] during fine-tuning or more complex approaches like Recall and learn [6], Robust Information Fine-tuning [42] or CoSDA [9]. Some adaptation methods use techniques based on replay of the source data or even using the weights of the initial model to keep some prior information [28]. However, these methods may still produce a loss of knowledge since the original parameters are not frozen. Furthermore, in adapting without forgetting, an Adapter, which adopts a frozen pre-trained backbone to generate a representation followed by a different classifier for each downstream task [41], can be seen as a powerful method for keeping knowledge. Even though our ModTr shares some similarities, we work in the input space to adapt to the new modalities, and we address this incremental modality adaptation, optimizing the translation directly for the final OD task.
适应而不忘记。灾难性遗忘（CF）是指神经网络在顺序训练不同任务时往往会忘记知识，并用适合新目标的知识替代[41]。CF 可能有害或有益。研究人员确定有害学习是指在适应不同任务的同时保留原始知识是必要的情况。在这种情况下，必须减轻 CF 的风险。然而，一些 CF 也可能是有益的，例如，防止大型预训练模型泄露隐私，增强泛化能力，或者从最初获得的知识中去除负面影响新任务的嘈杂信息。在我们的情况下，遗忘知识是有害的。有不同的方法来解决这个问题，包括简单的技术，如降低学习率[15]，使用权重衰减[4,45]或在微调期间使用 mixout 正则化[23]，或者更复杂的方法，如 Recall and learn[6]，Robust Information Fine-tuning[42]或 CoSDA[9]。一些适应方法使用基于源数据重播甚至使用初始模型的权重来保留一些先前信息[28]的技术。然而，由于原始参数未被冻结，这些方法仍可能产生知识损失。此外，在适应而不遗忘中，一个适配器采用冻结的预训练骨干生成表示，然后为每个下游任务使用不同的分类器[41]，可以被视为保留知识的强大方法。尽管我们的 ModTr 有一些相似之处，我们在输入空间中工作以适应新的模态，并解决这种增量模态适应，直接为最终 OD 任务优化翻译。

3 Proposed Method 3 提议的方法

Preliminary Definitions. We denote the training set for the OD task as

D = {(x, Y)}

, where

x \in R^{W \times H \times C}

represents an image in the dataset, with dimensions

W \times H

and

C

channels. Subsequently, the object detection model aims to identify

N

regions of interest within these images, denoted as

Y = {(b_{i}, c_{i})}_{i = 1}^{N}

. Each region of interest

b_{i}

is defined by the top-left corner coordinates and the width and height of the object. Additionally, a classification label

c_{i}

is assigned to each detected object, indicating its corresponding class within the dataset. In this study, the number of input channels for the detector is fixed at three, corresponding to RGB-like inputs. In terms of optimization, the primary goal of this task is to maximize detection accuracy, often measured using the average precision (AP) metric across all classes. An OD is formally represented as the mapping

f_{θ} : R^{W \times H \times C} \to Y

, where

θ

denotes the parameter vector. To train such a detector effectively, a differentiable surrogate for the AP metric, referred to as the detection cost function,

C_{det} (θ)

, is employed. The typical structure of such a cost function involves computing the average detection loss over dataset

D

, denoted as

L_{det}

:
初步定义。我们将 OD 任务的训练集表示为

D = {(x, Y)}

，其中

x \in R^{W \times H \times C}

代表数据集中的一幅图像，具有

W \times H

维和

C

通道。随后，目标检测模型旨在识别这些图像中的

N

感兴趣区域，表示为

Y = {(b_{i}, c_{i})}_{i = 1}^{N}

。每个感兴趣区域

b_{i}

由对象的左上角坐标、宽度和高度定义。此外，为每个检测到的对象分配一个分类标签

c_{i}

，表示其在数据集中对应的类别。在本研究中，检测器的输入通道数固定为三，对应于类似 RGB 的输入。在优化方面，该任务的主要目标是最大化检测准确性，通常使用所有类别的平均精度（AP）指标进行衡量。OD 被形式化表示为映射

f_{θ} : R^{W \times H \times C} \to Y

，其中

θ

表示参数向量。为了有效训练这样的检测器，使用了一个可微的替代 AP 指标的检测成本函数

C_{det} (θ)

。这种成本函数的典型结构涉及计算数据集

D

上的平均检测损失，表示为

L_{det}

：

\begin{matrix} (1) & L_{d e t} (θ) = \frac{1}{| D |} \sum_{(x, Y) \in D} L_{d e t} [f_{θ} (x), Y] \end{matrix}

Modality Translation Module (ModTr). Our approach primarily consists of an imageto-image translation network responsible for converting the input modality into an RGB-like space intelligible to the detector. These networks typically adopt an encoderdecoder structure to synthesize and reconstruct knowledge in a pixel-wise manner. While we employ U-Net as the translation network in this work, our framework is general and not limited by the translation architecture. In general terms, this mapping is denoted as

h_{ϑ}^{d} : R^{W \times H \times C} \to R^{W \times H \times 3}

, with a translation network assigned to each available input modality

d

. Unlike the detection network, the number of input channels varies depending on the modality, for instance,

C = 1

for IR and depth images. It's important to note that, being a pixel-level architecture, the output of such a network retains the spatial resolution of the input. However, the number of output channels is consistently fixed at three, corresponding to RGB-like images

(C = 3)

.
模态翻译模块（ModTr）。我们的方法主要包括一个图像到图像翻译网络，负责将输入模态转换为对检测器可理解的类似 RGB 的空间。这些网络通常采用编码器-解码器结构以像素级方式合成和重建知识。虽然我们在这项工作中采用 U-Net 作为翻译网络，但我们的框架是通用的，不受翻译架构的限制。一般来说，这种映射被表示为

h_{ϑ}^{d} : R^{W \times H \times C} \to R^{W \times H \times 3}

，每个可用输入模态分配一个翻译网络

d

。与检测网络不同，输入通道的数量取决于模态，例如，

C = 1

适用于红外和深度图像。值得注意的是，作为像素级架构，这种网络的输出保留了输入的空间分辨率。然而，输出通道的数量始终固定为三，对应于类似 RGB 的图像

(C = 3)

。

Unlike other image-to-image translation approaches, we drive the translation process using the aforementioned detection cost. Thus, the underlying optimization problem is formulated as

ϑ^{*} = \arg min L_{det} (ϑ)

, incorporating the output of the composition

(f_{θ} \circ h_{ϑ}^{d}) (x)

at the loss function level. To streamline the learning process, we utilize a residual learning strategy in which the function

h_{ϑ}^{i}

focuses on capturing the small variations in the input that are necessary to solve the task. This approach is similar to the one employed on diffusion models, which served as inspiration for our work. For the sake of simplicity, we separate the fusion step from the translation mapping in our notation, as various types of fusion are investigated. Consequently, the proposed image-to-image translation loss function is defined as:
与其他图像到图像翻译方法不同，我们使用上述检测成本来驱动翻译过程。因此，基本的优化问题被制定为

ϑ^{*} = \arg min L_{det} (ϑ)

，在损失函数级别结合

(f_{θ} \circ h_{ϑ}^{d}) (x)

的输出。为了简化学习过程，我们利用残差学习策略，其中函数

h_{ϑ}^{i}

专注于捕捉输入中必要的小变化以解决任务。这种方法类似于扩散模型上采用的方法，这为我们的工作提供了灵感。为了简单起见，我们在符号中将融合步骤与翻译映射分开，研究了各种类型的融合。因此，提出的图像到图像翻译损失函数定义如下：

\begin{matrix} (2) & L_{M o d T r} (x, Y; ϑ) = L_{d e t} [f_{θ} (Φ (h_{ϑ}^{d} (x), x)), Y] \end{matrix}

where

Φ (.,

) i s a n o n - p a r a m e t r i c f u s i o n f u n c t i o n . N o t e t h a t t h e o u t p u t o f h_{ϑ}^{d} (x)

is an RGB-like image, whereas

x

may only consist of a single channel, depending on the input modality. We have chosen this definition to simplify the notation, but appropriate reshaping should be performed during implementation to ensure compatibility.
在

Φ (.,

) i s a n o n - p a r a m e t r i c f u s i o n f u n c t i o n . N o t e t h a t t h e o u t p u t o f h_{ϑ}^{d} (x)

是一幅类似 RGB 的图像，而

x

可能只包含一个通道，这取决于输入模态。我们选择了这个定义来简化符号表示，但在实现过程中应进行适当的重塑以确保兼容性。

In addition, note that, while a detection loss is employed to update the translation network, the weight vector

θ

remains constant. This constraint is consistent with the premise of this study, where a pre-trained detector is solely available on the server side and remains unaltered. An overview of the proposed approach can be seen in Fig. 2b.
此外，请注意，虽然使用检测损失来更新翻译网络，但权重向量

θ

保持不变。这个约束与本研究的前提一致，即预训练的检测器仅在服务器端可用且保持不变。所提出方法的概述可在图 2b 中看到。

Fusion strategy. As previously mentioned, we utilize a non-parametric fusion of the intermediate representation

h_{ϑ} (x)

and the original input

x

to simplify the learning process of the translation network. In this context, we investigate three straightforward nonparametric fusion functions inspired by previous works. The first approach involves element-wise summation, commonly used in ResNet and diffusion models. The second approach employs an element-wise product, also known as the Hadamard product, which is particularly interesting for attention mechanisms and has been explored previously for re-calibrating feature maps based on their importance [17]. The final approach
融合策略。如前所述，我们利用中间表示

h_{ϑ} (x)

和原始输入

x

的非参数融合来简化翻译网络的学习过程。在这种情况下，我们研究了受先前作品启发的三种直接的非参数融合函数。第一种方法涉及逐元素求和，常用于 ResNet 和扩散模型。第二种方法采用逐元素乘积，也称为 Hadamard 乘积，特别适用于注意机制，并且先前已经探索过根据其重要性重新校准特征图的功能。最后一种方法
is based on DenseFuse [24]; however, instead of applying it at the feature level, we apply it between the input and output of the translation network.
基于 DenseFuse [24]；然而，我们不是在特征级别应用它，而是在翻译网络的输入和输出之间应用它。
a)

{ModTr}_{+}

: The addition mechanism involves forwarding the input modality and summing it with the output of the translation network. This residual connection of the input serves as regularization for the model, aiding the network in learning the missing information necessary for detector operation. This operator learns the new representation by amplifying pixel values when the weights of the translation representation tend toward 1 , or preserving the original information when they tend toward 0 . Such a range of values is due to our modification on the U-Net to generate, in which we changed the last layer to a sigmoid layer so we can better control the generated image to be closer to a real RGB-like image:
a)

{ModTr}_{+}

：添加机制涉及将输入模态转发并将其与翻译网络的输出相加。输入的这种残差连接作为模型的正则化，帮助网络学习检测器操作所需的缺失信息。当翻译表示的权重趋向于 1 时，该运算符通过放大像素值来学习新的表示，或者当它们趋向于 0 时保留原始信息。这种值范围是由于我们对 U-Net 进行的修改而产生的，我们将最后一层改为 Sigmoid 层，这样我们可以更好地控制生成的图像更接近真实的 RGB 样式图像。

\begin{matrix} (3) & L_{{ModTr}_{+}} (x, Y; ϑ) = L_{det} [f_{θ} (h_{ϑ}^{d} (x) + x), Y] \end{matrix}

{ModTr}_{⊙}

: The Hadamard product-based fusion serves as a gating mechanism to filter or highlight information from the input image. In this approach, the output of the translation network acts as a weight map for the input, and they are fused using pixel-wise multiplication,

⊙

. Consequently, the translation networks tend to highlight information from the input when the pixel value tends toward 1 or discard it when it approaches 0 . Additionally, the output translation modality can be interpreted as an attention map:
b)

{ModTr}_{⊙}

：基于 Hadamard 乘积的融合作为一个门控机制，用于过滤或突出输入图像中的信息。在这种方法中，翻译网络的输出充当输入的权重图，它们使用逐像素乘法进行融合，

⊙

。因此，当像素值趋向于 1 时，翻译网络倾向于突出输入中的信息，或者当像素值接近 0 时丢弃它。此外，输出翻译模态可以被解释为一个注意力图：

\begin{matrix} (4) & L_{{ModTr}_{⊙}} (x, Y; ϑ) = L_{d e t} [f_{θ} (h_{ϑ}^{d} (x) ⊙ x), Y] \end{matrix}

{ModTr}_{\oplus}

: The subsequent fusion mechanism draws inspiration from DenseFuse [24], which employs the relative importance of pixels as an attention mechanism for both the translation network's output and input. This attention mechanism operates by providing a weighted average of the channels. The implementation details for such an operator can be found on [24]. Then, the proposed loss function is given by:
c)

{ModTr}_{\oplus}

：后续的融合机制灵感来自于 DenseFuse [24]，它利用像素的相对重要性作为翻译网络输出和输入的注意力机制。这种注意力机制通过提供通道的加权平均值来运作。有关这种运算符的实现细节可以在[24]中找到。然后，提出的损失函数如下：

\begin{matrix} (5) & L_{{ModTr}_{\oplus}} (x, Y; ϑ) = L_{d e t} [f_{θ} (Φ (h_{ϑ}^{d} (x), x)), Y] \end{matrix}

with 用

Φ (x, \hat{x}) = \frac{x ⊙ e^{x} + \hat{x} ⊙ e^{\hat{x}}}{e^{x} + e^{\hat{x}}}

In our design choices, we opt to utilize these straightforward non-parametric functions to assist in optimization while maintaining low inference costs.
在我们的设计选择中，我们选择利用这些直接的非参数函数来帮助优化，同时保持低推断成本。

4 Results and Discussion
4 结果和讨论

4.1 Experimental Methodology

(a) Datasets: LLVIP: The LLVIP dataset is a surveillance dataset composed of 30,976 images, in which

24, 050 (12, 025

IR and 12, 025 RGB paired images) are used for training and 6,926 for testing (3,463 IR and 3,463 RGB paired images) with only pedestrians annotated. FLIR ALIGNED: For the FLIR, we used the sanitized and aligned
（a）数据集：LLVIP：LLVIP 数据集是一个监控数据集，由 30,976 张图像组成，其中

24, 050 (12, 025

红外和 12,025 RGB 成对图像用于训练，6,926 用于测试（3,463 红外和 3,463 RGB 成对图像），只有行人有注释。FLIR ALIGNED：对于 FLIR，我们使用了经过清理和对齐的数据。
paired sets provided by Zhang et al. [44], which has 10, 284 images, that is 8,258 for training (4,129 IRs and 4, 129 RGBs) and 2,026 (1,013 IRs and 1,013 RGBs) for test. The FLIR images are taken from the perspective of a camera in the front of a car, and the resolution is 640 by 512 . It contains classes of bicycles, dogs, cars, and people. It has been found that for the case of FLIR, the "dog" objects are inadequate for training [3], thus we decided to remove the dogs.
张等人提供的配对数据集[44]，其中包含 10,284 张图像，其中 8,258 张用于训练（4,129 张红外图像和 4,129 张 RGB 图像），2,026 张用于测试（1,013 张红外图像和 1,013 张 RGB 图像）。FLIR 图像是从汽车前方摄像头的视角拍摄的，分辨率为 640x512。其中包含自行车、狗、汽车和人等类别。研究发现，在 FLIR 情况下，“狗”对象不足以进行训练[3]，因此我们决定移除狗。

(b) Implementation details: In our experiments, we used

80 %

of the training set for training and the rest for validation. All results reported are on the test set. For the FLIR dataset, we removed the dogs as there are not many annotations. As starting pre-trained weights for the detectors, we used Torchvision models with COCO [26] weights and for the U-Net translation network, we used PyTorch Segmentation Models [18] and we changed the last layer for 3-channel (RGB-like) with sigmoid, to be closer to an image with values between 0 and 1 , to perform translation instead of traditional segmentation. For the translation network backbones, we explored our default ResNet34, and for subsequent studies on reducing parameters, we dive into ResNet and MobileNet-family. All the code is available on GitHub for reproducibility in the experiments. To ensure fairness, we trained the detectors under the library version and the same experimental design, i.e., data order, augmentations, etc. Furthermore, we trained with PyTorch Lightning [8] training framework, evaluated the APs with TorchMetrics [7], and logged all experiments with WandB [1] logging tool. The different measured APs can be found in the supplementary material as additional metrics provided in this work.
（b）实施细节：在我们的实验中，我们使用

80 %

的训练集进行训练，其余用于验证。所有报告的结果都是在测试集上得出的。对于 FLIR 数据集，我们移除了狗，因为注释不多。作为检测器的初始预训练权重，我们使用了带有 COCO [26]权重的 Torchvision 模型，对于 U-Net 翻译网络，我们使用了 PyTorch 分割模型[18]，并将最后一层更改为 3 通道（类似 RGB）带有 sigmoid 的形式，以更接近值在 0 和 1 之间的图像，以执行翻译而不是传统的分割。对于翻译网络的骨干，我们探索了我们默认的 ResNet34，并在后续的参数减少研究中，我们深入研究了 ResNet 和 MobileNet 系列。所有代码都可以在 GitHub 上找到，以便在实验中进行再现。为了确保公平性，我们在相同的实验设计下训练检测器，即数据顺序，数据增强等。此外，我们使用 PyTorch Lightning [8]训练框架进行训练，使用 TorchMetrics [7]评估 APs，并使用 WandB [1]记录工具记录所有实验。不同的测量 AP 可以在补充材料中找到，作为本文提供的额外指标。

4.2 Comparison with Translation Approaches
4.2 与翻译方法的比较

In this section, we compare the ModTr method with different image-to-image methods employing different learning strategies. These include basic image processing strategies [13], reconstruction strategies such as CycleGAN [47], CUT [34], and FastCUT [34], which employs a contrastive learning approach, as well as HalluciDet [27], which utilizes a detection-based loss. As outlined in Table 1, we evaluated the methods based on their final detection performance across three commonly used detectors: FCOS, RetinaNet, and Faster R-CNN. The reported results are derived from the infrared test set and are averaged over three different seeds, which helps mitigate the impact of randomness across runs and splits of the training and validation datasets.
在本节中，我们将 ModTr 方法与采用不同学习策略的不同图像对图像方法进行比较。这些方法包括基本图像处理策略[13]、重建策略，如 CycleGAN[47]、CUT[34]和 FastCUT[34]，采用对比学习方法，以及 HalluciDet[27]，它利用基于检测的损失。如表 1 所示，我们根据它们在三种常用检测器（FCOS、RetinaNet 和 Faster R-CNN）上的最终检测性能评估这些方法。报告的结果来自红外测试集，并在三个不同种子上进行平均，这有助于减轻训练和验证数据集的运行和拆分中随机性的影响。

For each method, we also consider its dependency on the source data (RGB) and ground truth bounding boxes on the IR images (Box). Methods that rely on reconstruction techniques do not require box annotations on IR images but cannot provide accurate translations for detection purposes. However, HalluciDet and ModTr require box annotations to adjust the input image in a discriminative manner. The main difference between HalluciDet and ModTr is the use of source images. HalluciDet requires RGB images for an initial fine-tuning of the model, while our approach can work without that fine-tuning by reusing the detector's zero-shot knowledge.
对于每种方法，我们还考虑其对源数据（RGB）和红外图像上的地面真实边界框（Box）的依赖性。依赖重建技术的方法不需要在红外图像上进行框注释，但无法为检测目的提供准确的翻译。然而，HalluciDet 和 ModTr 需要框注释以区分性地调整输入图像。HalluciDet 和 ModTr 之间的主要区别在于对源图像的使用。HalluciDet 需要 RGB 图像对模型进行初始微调，而我们的方法可以通过重复使用检测器的零样本知识而无需进行该微调。

The proposed ModTr method demonstrates robustness across the three detectors and consistently exhibits improvement on two different datasets: LLVIP [20] and FLIR
提出的 ModTr 方法展示了对三个检测器的稳健性，并在两个不同数据集 LLVIP [20]和 FLIR 上持续展现改进
aligned [11]. It's worth noting that each algorithm described in Table 1 employs different training supervisions. For instance, CycleGAN employs an adversarial mechanism with both RGB and infrared modalities in an unpaired setting. Similarly, CUT and FastCUT operate with positive and negative patches in an unpaired setting. In contrast, HalluciDet doesn't require the presence of both modalities during training but employs a detection mechanism during training similar to ours. Note that in our approach, we solely require examples from the target modality. In this section, we present the performance of our best approach ModTr

_{⊙}

. For additional results, refer to the supplementary materials.
值得注意的是，表 1 中描述的每种算法都采用不同的训练监督。例如，CycleGAN 在无配对设置中使用了对抗机制，同时使用 RGB 和红外模态。类似地，CUT 和 FastCUT 在无配对设置中使用正面和负面补丁。相比之下，HalluciDet 在训练过程中不需要同时存在两种模态，但在训练过程中采用了类似于我们的检测机制。请注意，在我们的方法中，我们仅需要目标模态的示例。在本节中，我们展示了我们最佳方法 ModTr 的性能。有关更多结果，请参考补充材料。

Table 1: Comparison of the detection performance (AP) of three different detectors (FCOS, RetinaNet, and Faster R-CNN) over different image-to-image methods to translate the infrared to RGB-like images and our proposed method (ModTr). The methods were evaluated on the infrared test set of LLVIP and FLIR datasets. The column RGB tells whether the method needs access to RGB images during training, and Box refers to the use of ground truth bounding boxes during training.
表 1：比较三种不同检测器（FCOS、RetinaNet 和 Faster R-CNN）在不同图像到图像方法上的检测性能（AP），将红外转换为类似 RGB 的图像以及我们提出的方法（ModTr）。这些方法在 LLVIP 和 FLIR 数据集的红外测试集上进行评估。RGB 列表示该方法在训练期间是否需要访问 RGB 图像，Box 表示在训练期间是否使用真实边界框。

Image translation	RGB	Box	Test Set IR (Dataset: LLVIP)
Image translation	RGB	Box	FCOS	RetinaNet	Faster R-CNN
Histogram Equal. [13]			$31.69 \pm 0.00$	$33.16 \pm 0.00$	$38.33 \pm 0.02$
CycleGAN [47]	$✓$		$23.85 \pm 0.76$	$23.34 \pm 0.53$	$26.54 \pm 1.20$
CUT [34]	$✓$		$14.30 \pm 2.25$	$13.12 \pm 2.07$	$14.78 \pm 1.82$
FastCUT [34]	$✓$		$19.39 \pm 1.52$	$18.11 \pm 0.79$	$22.91 \pm 1.68$
HalluciDet [27]	$✓$	$✓$	$28.00 \pm 0.92$	$19.95 \pm 2.01$	$57.78 \pm 0.97$
${ModTr}_{⊙}$ (ours) ${ModTr}_{⊙}$ （我们的）		$✓$	$57.63 \pm 0.66$	$54.83 \pm 0.61$	$57.97 \pm 0.85$
${ModTr}_{⊙}$ (ours) ${ModTr}_{⊙}$ （我们的）			Test Set IR (Dataset: FLIR)
Image translation	RGB	Box	FCOS	RetinaNet	Faster R-CNN
Histogram Equal. [13]			$22.76 \pm 0.00$	$23.06 \pm 0.00$	24.6
CycleGAN [47]	$✓$		$23.92 \pm 0.97$	$23.71 \pm 0.70$	$26.85 \pm 1.23$
CUT [34]	$✓$		$18.16 \pm 0.75$	$17.84 \pm 0.75$	$20.29 \pm 0.48$
FastCUT [34]	$✓$		$24.02 \pm 2.37$	$22.00 \pm 2.73$	$26.68 \pm 2.59$
HalluciDet [27]	$✓$	$✓$	$23.74 \pm 2.09$	$22.29 \pm 0.45$	$29.91 \pm 1.18$
${ModTr}_{⊙}$ (ours) ${ModTr}_{⊙}$ （我们的）		$✓$	$35.49 \pm 0.94$	$34.27 \pm 0.27$	$37.21 \pm 0.46$

As depicted in Table 1, the detection performance of ModTr over the LLVIP dataset exhibited significant improvements. Specifically, it surpassed HalluciDet, the second best, by more than 29.0 AP with both FCOS and RetinaNet architectures, while obtaining comparable results with Faster R-CNN. Such disparity with the previous technique can be attributed to the loss of previous knowledge inherent in HalluciDet, which necessitates a pre-fine-tuning strategy on the source modality. Although the performance of the FLIR dataset also improved, the dataset's inherent challenges, such as changing the background from a moving car setup, make detection more difficult. Nonetheless,
如表 1 所示，ModTr 在 LLVIP 数据集上的检测性能显示出显著的改进。具体而言，它在 FCOS 和 RetinaNet 架构下的性能超过了第二名的 HalluciDet 超过 29.0 AP，同时与 Faster R-CNN 获得了可比较的结果。这种与先前技术的差异可以归因于 HalluciDet 中固有的先前知识的丢失，这需要在源模态上进行预微调策略。尽管 FLIR 数据集的性能也有所提高，但数据集固有的挑战，如将背景从移动车辆设置更改，使检测变得更加困难。然而，
our proposal consistently enhances results, with improvements of more than

11 A P

for FCOS and RetinaNet, and over 7 AP for Faster R-CNN. We also observed improvements on the

{A P}_{50}

and

{A P}_{75}

. Because of the space constraint, we include these in Supplementary Materials. These promising results indicate that our proposal can effectively translate images from the original IR modality to an RGB-like representation, sufficiently close to the source data to be usable by the detector.
我们的提案持续增强结果，FCOS 和 RetinaNet 的改进超过

11 A P

，Faster R-CNN 的 AP 超过 7。我们还观察到

{A P}_{50}

和

{A P}_{75}

的改进。由于空间限制，我们将这些内容包含在补充材料中。这些令人期待的结果表明，我们的提案能够有效地将原始 IR 模态的图像转换为类似 RGB 的表示，与源数据足够接近，可供检测器使用。

4.3 Translator and Fine-tuning of the Detector
4.3 翻译器和探测器的微调

In this section, we further demonstrate that the proposed approach can be trained jointly with both translation and detector, which preserves the detector's knowledge. Here, we compare the performance of the three proposed fusion functions with that of the standard detector's fine-tuning.
在这一部分中，我们进一步证明了所提出的方法可以与翻译和检测器一起进行联合训练，从而保留了检测器的知识。在这里，我们将三种提出的融合函数的性能与标准检测器的微调进行比较。

As depicted in Tab. 2, we include the three fusion variations for ModTr and the finetuning. The table shows AP for the LLVIP and FLIR datasets, with a consistent trend across all detectors (FCOS, RetinaNet, and Faster R-CNN). Furthermore, in the case of the FLIR dataset, we observed enhancements over the standard detector fine-tuning with the various fusion techniques. As demonstrated, our approach surpasses standard fine-tuning while maintaining the detector's performance in the original modality. It is worth noting that our method also improves performance in terms of localization metrics such as

{A P}_{50}

and

{A P}_{75}

compared to fine-tuning alone, and we provide detailed results in the Supplementary Materials.
如表 2 所示，我们包括了 ModTr 的三种融合变体和微调。表格显示了 LLVIP 和 FLIR 数据集的 AP，所有检测器（FCOS、RetinaNet 和 Faster R-CNN）都表现出一致的趋势。此外，在 FLIR 数据集的情况下，我们观察到使用各种融合技术相比标准检测器微调有所增强。正如所示，我们的方法在保持检测器在原始模态下性能的同时超越了标准微调。值得注意的是，我们的方法还在定位指标（如

{A P}_{50}

和

{A P}_{75}

）方面相比单独微调有所提高，并在补充材料中提供了详细结果。

Table 2: Comparison of the fine-tuning of the detector and the different non-parametric fusion strategies for ModTr. The detection performance of three different detectors (FCOS, RetinaNet, and Faster R-CNN) was evaluated on the infrared test set of LLVIP and FLIR datasets.
表 2：检测器微调和 ModTr 的不同非参数融合策略的比较。在 LLVIP 和 FLIR 数据集的红外测试集上评估了三种不同检测器（FCOS、RetinaNet 和 Faster R-CNN）的检测性能。

Method	Test Set IR (Dataset: LLVIP)
Method	FCOS	RetinaNet	Faster R-CNN
Fine-Tuning (FT) 微调（FT）	$57.37 \pm 2.19$	$53.79 \pm 1.79$	$59.62 \pm 1.23$
${M o d T r}_{+}$	$56.44 \pm 0.75$	$53.18 \pm 1.03$	$57.14 \pm 0.50$
${ModTr}_{\oplus}$	$57.01 \pm 0.71$	$54.43 \pm 0.35$	$56.95 \pm 0.37$
${ModTr}_{⊙}$	$57.63 \pm 0.66$	$54.83 \pm 0.61$	$57.97 \pm 0.85$
${ModTr}_{⊙}$	Test Set IR (Dataset: FLIR)
Method	FCOS	RetinaNet	Faster R-CNN
Fine-Tuning (FT) 微调（FT）	$27.97 \pm 0.59$	$28.46 \pm 0.50$	$30.93 \pm 0.46$
${M o d T r}_{+}$	$34.63 \pm 0.24$	$33.70 \pm 0.59$	$37.09 \pm 0.74$
${ModTr}_{\oplus}$	$34.94 \pm 0.52$	$33.72 \pm 0.22$	$37.16 \pm 0.47$
${ModTr}_{⊙}$	$34.60 \pm 0.38$	$33.85 \pm 0.34$	$37.01 \pm 0.15$

4.4 Different Backbones for ModTr

In this context, we focused on evaluating the ModTr architecture and examining the trade-off between performance and parameter cost. It is widely recognized that increasing the number of parameters can enhance performance, but this relationship is not strictly linear. We demonstrated that models with fewer parameters can still achieve good performance; for example, MobileNet

_{v 2}

, with fewer parameters than

{ResNet}_{18}

, sometimes outperformed it. This trade-off highlights the versatility of the model, which can be deployed with MobileNet-based architectures and utilized in low-cost devices. In Table 3, we successfully reduced the default number of parameters from

24.4 M

(

{ResNet}_{34}

) to

6.6 M

using MobileNet

_{v 2}

while maintaining similar performance. For instance, on LLVIP, MobileNet

_{v 2}

achieved a mean AP of 56.15 , comparable to 56.35

{A P}_{50}

from

{R e s N e t}_{34}

(others APs and detectors are reported in the supplementary material).
在这种情况下，我们专注于评估 ModTr 架构并检查性能和参数成本之间的权衡。众所周知，增加参数数量可以提高性能，但这种关系并非严格线性。我们证明了参数较少的模型仍然可以实现良好的性能；例如，MobileNet

_{v 2}

，比

{ResNet}_{18}

参数更少，有时表现更好。这种权衡突显了模型的多功能性，可以与基于 MobileNet 的架构一起部署，并在低成本设备中使用。在表 3 中，我们成功将默认参数数量从

24.4 M

（

{ResNet}_{34}

）减少到

6.6 M

，同时保持类似的性能，使用 MobileNet

_{v 2}

。例如，在 LLVIP 上，MobileNet

_{v 2}

实现了 56.15 的平均 AP，与

{R e s N e t}_{34}

（其他 AP 和检测器在补充材料中报告）的 56.35 相当。

This approach opens up new possibilities, particularly in scenarios where using one translation network and one detector (e.g., one ModTr and one detector for IR/RGB) proves advantageous. This setup requires a total of

44.9 M

parameters, compared to 83.6M parameters, when employing two detectors-one for each modality (for example, for Faster R-CNN). Similar reductions in parameter costs were observed for FCOS (from

66.4 M

36.3 M

) and RetinaNet (from

68 M

37.1 M

) when using one detector for both modalities while preserving the knowledge of the previous modality and incorporating a new one. These numbers are based on MobileNet

_{v 3 s}

, which strikes a balance between performance and the number of parameters, making it suitable for memoryrestricted systems. The complete evaluations for FCOS and RetinaNet are included in the supplementary material.
这种方法打开了新的可能性，特别是在使用一个翻译网络和一个检测器（例如，一个用于 IR/RGB 的 ModTr 和一个检测器）的情况下，这种方法被证明是有利的。与使用两个检测器（例如，对于 Faster R-CNN，每个模态一个）时的 83.6M 参数相比，这种设置需要总共

44.9 M

个参数。当使用一个检测器同时保留先前模态的知识并引入一个新模态时，FCOS（从

66.4 M

到

36.3 M

）和 RetinaNet（从

68 M

到

37.1 M

）的参数成本也有类似的降低。这些数字基于 MobileNet

_{v 3 s}

，它在性能和参数数量之间取得了平衡，使其适用于内存受限系统。FCOS 和 RetinaNet 的完整评估包含在补充材料中。

4.5 Knowledge Preservation through Input Modality Translation
4.5 通过输入模态翻译进行知识保留

In this section, we show how various adaptation paradigms, depicted in Figure 2, are capable of solving the final task while preserving intrinsic knowledge. We compare our proposed method, ModTr, to two other fine-tuning baseline methods. The first baseline method consists of

N

-detectors, each of which fine-tunes the target modality individually. The second baseline method is a single detector trained on the joint modality with balanced sampling.
在本节中，我们展示了如何通过图 2 中所示的各种适应范式来解决最终任务，同时保留内在知识。我们将我们提出的方法 ModTr 与另外两种微调基准方法进行了比较。第一种基准方法包括

N

-检测器，每个检测器都会单独微调目标模态。第二种基准方法是一个在联合模态上进行平衡采样训练的单个检测器。

Table 4 illustrates the final performance. While both

N

-detectors and single detectors exhibit relatively similar final performance, ModTr stands out as the only method capable of preserving knowledge while excelling at the final task. Specifically, both fine-tuning baselines compromised the detector's zero-shot capability. This consistent trend is observable across various detectors and datasets.
表 4 显示了最终性能。虽然

N

-检测器和单个检测器表现出相对相似的最终性能，但 ModTr 凭借其在保留知识的同时在最终任务上表现出色而脱颖而出。具体而言，微调基线都损害了检测器的零样本能力。这种一致的趋势在各种检测器和数据集中都是可观察的。

4.6 Visualization of ModTr Translated Images
4.6 ModTr 翻译图像的可视化

In Figure 3, we present qualitative results for LLVIP and FLIR, alongside comparison with fine-tuning. Panel a) displays the ground-truth RGB, while b) showcases the results of fine-tuning using IR. Subsequently, in c), we present ModTr with additionbased fusion, in d) ModTr with a Hadamard product-based fusion, and in e) ModTr with
在图 3 中，我们展示了 LLVIP 和 FLIR 的定性结果，并与微调进行了比较。面板 a）显示了地面真实的 RGB，而 b）展示了使用 IR 进行微调的结果。随后，在 c）中，我们展示了使用基于加法的融合的 ModTr，在 d）中是使用基于 Hadamard 乘积的融合的 ModTr，而在 e）中是使用 ModTr。

Table 3: Study on the translation backbone describing the tradeoff between the number of parameters for the translation network and the detection performance for each detector. The results comprise the three default detectors used (FCOS, RetinaNet, and Faster R-CNN) in AP under different MobileNet and ResNet backbones for the translation network.
表 3：研究翻译骨干描述翻译网络参数数量与每个检测器的检测性能之间的权衡。结果包括在不同 MobileNet 和 ResNet 骨干下使用的三个默认检测器（FCOS、RetinaNet 和 Faster R-CNN）在翻译网络中的 AP。

	Test Set IR (Dataset: LLVIP)
Method	Params.	AP $↑$
Faster R-CNN	$41.8 M$
MobileNet $_{v 3 s}$	$+ 3.1 M$	$54.51 \pm 0.28$
MobileNet $_{v 2}$	$+ 6.6 M$	$56.15 \pm 0.51$
ResNet $_{18}$	$+ 14.3 M$	$55.53 \pm 1.14$
ResNet $_{34}$	$+ 24.4 M$	$56.35 \pm 0.65$
Test Set IR (Dataset: FLIR)

Faster R-CNN
MobileNet $_{v 3 s}$
MobileNet $_{v 2}$		$+ 31.8 M$
ResNet $_{18}$	$+ 6.6 M$
ResNet $_{34}$	$+ 14.3 M$	$36.77 \pm 0.67$

attention-based fusion for each dataset over the Faster R-CNN detector. Due to space constraints, we provide additional visualizations for the other detectors and competitive methods in the supplementary material. Notably, the IR exhibits some false positives, especially when there is an overlap of the detected objects, which is mitigated by our method. It's worth noting that, as described, the product-based fusion yields a darker image, contrasting with the additive characteristics of other fusion strategies. Further insights into the visualization are also provided in supplementary material, revealing how our method effectively blurs or removes objects that do not have elements of the target classes, thereby facilitating detection for the OD. Although the obtained intermediate representations are not visually pleasant, they prove more efficient for incorporating the knowledge necessary for the OD. Additionally, we conducted experiments with loss function terms aimed at enhancing the visual effects of the image, but they were not conclusive in terms of helping the detection performance.
基于注意力的融合对 Faster R-CNN 检测器上的每个数据集进行了处理。由于空间限制，我们在补充材料中为其他检测器和竞争方法提供了额外的可视化。值得注意的是，IR 在检测到的对象重叠时会出现一些误报，我们的方法可以缓解这种情况。值得注意的是，如描述的那样，基于产品的融合会产生更暗的图像，与其他融合策略的加法特性形成对比。在补充材料中还提供了对可视化的进一步见解，揭示了我们的方法如何有效地模糊或移除没有目标类别元素的对象，从而有助于 OD 的检测。虽然获得的中间表示在视觉上并不令人愉悦，但它们证明更有效地融入了 OD 所需的知识。此外，我们进行了旨在增强图像视觉效果的损失函数项的实验，但在帮助检测性能方面并不具有定论。

4.7 Fine-tuning of ModTr and the Detector
4.7 ModTr 和检测器的微调

The main reason to use ModTr is to avoid fine-tuning the detector for a specific task so that it can preserve its knowledge and be used for multiple modalities. However, in this section, we consider what would happen if we learn jointly ModTr and the detector weights. Results are reported in Figure 4. We see that fine-tuning the detector can further boost performance. Thus, another application of ModTr could be to use it to improve the fine-tuning of a detector with a reduced additional computational cost.
使用 ModTr 的主要原因是避免为特定任务微调检测器，以便它可以保留知识并用于多种模态。然而，在本节中，我们考虑如果同时学习 ModTr 和检测器权重会发生什么。结果报告在图 4 中。我们看到微调检测器可以进一步提升性能。因此，ModTr 的另一个应用可能是使用它来改善检测器的微调，而附加计算成本较低。

Fig. 3: Illustration of a sequence of 8 images of LLVIP and FLIR dataset for Faster R-CNN. For each dataset, the first row is the RGB modality, followed by the IR modality, followed by FastCUT, and different representations created by ModTr and their variations. For visualizations for other detectors, see Suppl. Material.
图 3：LLVIP 和 FLIR 数据集的 Faster R-CNN 的 8 个图像序列示例。对于每个数据集，第一行是 RGB 模态，接着是 IR 模态，然后是 FastCUT，以及由 ModTr 及其变体创建的不同表示。有关其他检测器的可视化，请参阅补充资料。

5 Conclusion

In this work, we present ModTr, a novel method for adapting ODs without changing their parameters. The proposed approach performs well in different settings and results better than powerful image-to-image models and previous competitors. The proposed ModTr method was evaluated for different tasks, such as detection based on image
在这项工作中，我们提出了 ModTr，一种新颖的方法，用于在不改变其参数的情况下适应 ODs。所提出的方法在不同设置中表现良好，并且结果优于强大的图像到图像模型和先前的竞争对手。所提出的 ModTr 方法已针对不同任务进行评估，例如基于图像的检测。

Table 4: Average Precision (AP) comparison of knowledge preserving techniques N-Detectors, 1-Detector, and

N

-ModTr-1-Detector. The results comprise three detectors (FCOS, RetinaNet, and Faster R-CNN) across diverse datasets such as LLVIP, FLIR, and COCO.
表 4：知识保留技术 N-Detectors、1-Detector 和

N

-ModTr-1-Detector 的平均精度（AP）比较。结果涵盖了 FCOS、RetinaNet 和 Faster R-CNN 等三个检测器在 LLVIP、FLIR 和 COCO 等多样化数据集上的表现。

Detector	Dataset	N-Detectors	1-Detector	N-ModTr-1-Det.
FCOS	LLVIP	$57.37 \pm 2.19$	$58.55 \pm 0.89$	$57.63 \pm 0.66$
	FLIR	$27.97 \pm 0.59$	$26.70 \pm 0.48$	$35.49 \pm 0.94$
	COCO	$00.18 \pm 0.01$	$00.33 \pm 0.04$	$38.41 \pm 0.00$
RetinaNet	LLVIP	$53.79 \pm 1.79$	$53.26 \pm 3.02$	$54.83 \pm 0.61$
	FLIR	$28.46 \pm 0.50$	$25.19 \pm 0.72$	$34.27 \pm 0.27$
	COCO	$00.22 \pm 0.02$	$00.29 \pm 0.01$	$35.48 \pm 0.00$
Faster R-CNN	LLVIP	$59.62 \pm 1.23$	$62.50 \pm 1.29$	$57.97 \pm 0.85$
	FLIR	$30.93 \pm 0.46$	$28.90 \pm 0.33$	$37.21 \pm 0.46$
	COCO	$00.31 \pm 0.01$	$00.40 \pm 0.00$	$39.78 \pm 0.00$

Fig. 4: Comparison of the performance of fine-tuning the ModTr and normal fine-tuning on the FLIR dataset for the three different detectors (FCOS, RetinaNet, and Faster R-CNN). In blue, the Fine-tuning; in orange, the

{ModTr}_{⊙}

, and in green,

{ModTr}_{⊙} + F T

.
图 4：对 FCOS、RetinaNet 和 Faster R-CNN 三种不同检测器在 FLIR 数据集上对 ModTr 微调和正常微调性能的比较。蓝色表示微调；橙色表示

{ModTr}_{⊙}

；绿色表示

{ModTr}_{⊙} + F T

。

translation, comparison with traditional fine-tuning, and incremental modality application. Experimental results showed the high level of performance and versatility of our method. Our approach also benefits from preserving the full knowledge of the detector, which opens the possibility of using the translation network as a node to change the modality for an unaltered detector. This is much more convenient in terms of flexibility and computation than having a specialized OD for each modality. Future works will expand the knowledge of the OD by incorporating additional text embedding to help the ModTr perform few-shot and zero-shot learning in the image translation space targeting detection.
翻译、与传统微调比较，以及增量模态应用。实验结果显示了我们方法的高性能和多功能性。我们的方法还受益于保留检测器的全部知识，这打开了将翻译网络用作更改模态以适应未更改检测器的节点的可能性。这在灵活性和计算方面比为每种模态都有专门的 OD 更方便。未来的工作将通过将额外的文本嵌入纳入 OD 的知识来扩展 OD 的知识，以帮助 ModTr 在面向检测的图像翻译空间中执行少样本学习和零样本学习。

Acknowledgments: This work was supported by Distech Controls Inc., the Natural Sciences and Engineering Research Council of Canada, the Digital Research Alliance of Canada, and MITACS.
致谢：本工作得到 Distech Controls Inc.、加拿大自然科学与工程研究理事会、加拿大数字研究联盟和 MITACS 的支持。

Supplementary Material: Modality Translation for Object Detection Adaptation Without Forgetting Prior Knowledge
补充材料：目标检测适应的模态翻译，无需忘记先前的知识

In this supplementary material, we provide additional information to reproduce our work. The source code is provided alongside the supplementary material, and we are going to provide the official repository. This supplementary material is divided into the following sections: Detailed diagrams (Section 1), Quantitative Results (Section 2), in which we provide additional numerical results in terms of APs, and Qualitative Results (Section 3), in which we provide additional visualizations.
在这份补充材料中，我们提供了额外的信息以重现我们的工作。源代码与补充材料一并提供，并且我们将提供官方存储库。这份补充材料分为以下几个部分：详细图表（第 1 部分）、定量结果（第 2 部分），其中我们提供了关于 APs 的额外数值结果，以及定性结果（第 3 部分），其中我们提供了额外的可视化内容。

1 Detailed Diagrams 1 详细图解

In this section, we expand and detail the diagrams provided in the main manuscript. In Figure 1, we describe the traditional approach of employing specialized detectors for individual modalities. For example, we depict an RGB detector (highlighted in purple) and two IR detectors (highlighted in green and yellow), with each trained over a different dataset.
在本节中，我们扩展并详细说明主要手稿中提供的图表。在图 1 中，我们描述了采用专门探测器用于各个模态的传统方法。例如，我们描绘了一个 RGB 探测器（用紫色突出显示）和两个 IR 探测器（用绿色和黄色突出显示），每个探测器都经过不同数据集的训练。

In Figure 2, we illustrate the proposed ModTr. This method involves using a single pre-trained detector model typically trained on the more prevalent data, i.e., RGB, and additional input adaptation network. For clarity, the RGB modality (along with the RGB detector) is depicted in purple, while an adaptation block of ModTr is shown in green for one IR modality and in red for the other IR modality with different distribution.
在图 2 中，我们展示了提出的 ModTr。该方法涉及使用单个预训练的检测器模型，通常在更普遍的数据上进行训练，即 RGB，并附加输入适应网络。为了清晰起见，RGB 模态（以及 RGB 检测器）用紫色表示，而 ModTr 的适应块用绿色表示一个 IR 模态，用红色表示另一个 IR 模态，具有不同的分布。

Lastly, we present a final diagram (Figure 3), depicting a detector trained on the joint distribution of all modalities. The detector, shown in purple, undergoes training with all available modalities. While this approach enables the model to learn shared features, it may not be optimal. Nevertheless, it incurs a lower memory cost compared to employing one detector for each modality.
最后，我们呈现了最终的图表（图 3），展示了一个在所有模态的联合分布上训练的检测器。这个紫色的检测器接受所有可用模态的训练。虽然这种方法使模型能够学习共享特征，但可能并非最佳。尽管如此，与为每种模态使用一个检测器相比，它的内存成本更低。

2 Quantitative Results

In this section, we provide further details for the different AP metrics of the experiments in the main manuscript. In Table 1, we compare the best ModTr with different image translation strategies. Some of the competitors use RGB data for the translation, while others utilize bounding box annotations. Here, we can see that ModTr outperforms the competitors in terms of AP metrics across different detectors, demonstrating
在本节中，我们为主要手稿中的实验的不同 AP 指标提供了更多细节。在表 1 中，我们比较了最佳的 ModTr 与不同的图像翻译策略。一些竞争对手使用 RGB 数据进行翻译，而其他人利用边界框注释。在这里，我们可以看到，ModTr 在不同检测器的 AP 指标方面表现优于竞争对手，证明

Fig. 1: The simplest approach is to use a different detector adapted to each modality. This can lead to a high level of accuracy but requires storing models in memory multiple times. In purple is the RGB detector, in green is one IR detector for one dataset, and in yellow is another detector for another IR dataset.
图 1：最简单的方法是使用适应于每种模态的不同探测器。这可以带来高水平的准确性，但需要多次在内存中存储模型。紫色是 RGB 探测器，绿色是一个红外探测器用于一个数据集，黄色是另一个红外数据集的探测器。

Fig. 2: Our proposed solution is based on using a single pre-trained detector model normally trained on the more abundant data (RGB) and then adapting the input through our ModTr block.
图 2：我们提出的解决方案是基于使用一个单一的预训练检测器模型，通常是在更丰富的数据（RGB）上进行训练，然后通过我们的 ModTr 块调整输入。

Fig. 3: A single detector is trained on all modalities jointly. This allows the use of a single model but requires access to all modalities jointly, which is often not possible, especially when dealing with large pre-trained models.
图 3：一个单独的检测器联合训练所有模态。这允许使用单一模型，但需要同时访问所有模态，这通常是不可能的，特别是在处理大型预训练模型时。

its superiority in terms of classification and localization. For instance, when compared with HalluciDet [27] using FCOS, which was the second best, our method shows an improvement of approximately 28 better in terms of

{A P}_{50}

, with a similar trend observed for RetinaNet on the LLVIP dataset. The gap is narrower with Faster R-CNN, with an improvement of only around

2 {A P}_{50}

. While the gap is smaller for the FLIR dataset, it remains consistent across all APs and detectors; for example, an improvement of 6

{A P}_{50}

for Faster R-CNN, approximately

11 {A P}_{50}

for RetinaNet, and

11 {A P}_{50}

for

F C O S

. In terms of localization for the FLIR dataset, significant improvements are observed, such as around 7 AP for Faster R-CNN, 11 AP for RetinaNet, and 11 AP for FCOS when compared with HalluciDet. For the other methods, which do not rely on bounding boxes, substantial improvements are evident; for instance, our method exhibits an improvement of more than 11 AP over FastCUT [33], with similar trends observed for other competitors.
在分类和定位方面的优势。例如，与使用 FCOS 的 HalluciDet [27]相比，我们的方法在

{A P}_{50}

方面显示出约 28 个更好的改进，这是第二好的，对于 LLVIP 数据集上的 RetinaNet 也观察到类似的趋势。与 Faster R-CNN 相比，差距较小，仅有约

2 {A P}_{50}

的改进。虽然对于 FLIR 数据集，差距较小，但在所有 AP 和检测器上保持一致；例如，与 HalluciDet 相比，Faster R-CNN 的改进约为 6

{A P}_{50}

，RetinaNet 约为

11 {A P}_{50}

，

F C O S

约为

11 {A P}_{50}

。就 FLIR 数据集的定位而言，观察到了显著的改进，例如与 HalluciDet 相比，Faster R-CNN 约为 7 AP，RetinaNet 约为 11 AP，FCOS 约为 11 AP。对于不依赖于边界框的其他方法，明显的改进是显而易见的；例如，我们的方法在 FastCUT [33]上表现出超过 11 AP 的改进，其他竞争对手也观察到了类似的趋势。

In Table 2, we show that compared with fine-tuning (FT), the ModTr is better even without modifying the parameters of the detector. Thus, it preserves the detector's knowledge for further tasks while improving performance. For instance, in terms of localization with

{A P}_{75}

and AP for FCOS, RetinaNet on LLVIP, we reached the performance of the FT, while with

{A P}_{50}

, we were comparable. For the FLIR dataset, we outperform the FT in all detectors for the different APs and also in all different fusion strategies.
在表 2 中，我们展示了与微调（FT）相比，即使不修改检测器的参数，ModTr 也更好。因此，它在提高性能的同时保留了检测器的知识以用于进一步的任务。例如，在 LLVIP 上使用 FCOS、RetinaNet 进行定位和 AP 方面，我们达到了 FT 的性能，而在

{A P}_{50}

方面，我们是可比的。对于 FLIR 数据集，我们在所有不同的 AP 和所有不同的融合策略中都优于 FT。

In Table 3, we explore the potential of achieving comparable performance in detection tasks while significantly reducing the number of parameters by employing a smaller backbone for the translation network. The MobileNet

_{v 2}

with only 6.6 million additional parameters, achieves performance on par with

{ResNet}_{34}

which have 24.4 million parameters. This reduction in parameters is even more pronounced when compared with a new detector on the desired modality. For example, a new FCOS detector would re-
在表 3 中，我们探讨了通过在翻译网络中采用更小的骨干网路，显著减少参数数量的潜力，同时在检测任务中实现可比较的性能。MobileNet

_{v 2}

仅具有 660 万个额外参数，性能与具有 2440 万个参数的

{ResNet}_{34}

相当。与所需模态上的新检测器相比，参数的减少更加显著。例如，新的 FCOS 检测器将重新-

Table 1: Comparison of the detection performance of three different detectors (FCOS, RetinaNet, and Faster R-CNN) over different image-to-image methods to translate the infrared to RGB-like images and our proposed method (ModTr). The methods were evaluated on the infrared test set of LLVIP and FLIR datasets. The column RGB tells whether the method needs access to RGB images during training, and Box refers to the use of ground truth bounding boxes during training.
表 1：比较三种不同检测器（FCOS、RetinaNet 和 Faster R-CNN）在不同图像到图像方法上的检测性能，将红外转换为类似 RGB 图像以及我们提出的方法（ModTr）。这些方法在 LLVIP 和 FLIR 数据集的红外测试集上进行了评估。列 RGB 表示方法在训练期间是否需要访问 RGB 图像，Box 表示在训练期间使用地面实况边界框。

Image translation

RGB

Box

Test Set IR (Dataset: LLVIP)

FCOS

RetinaNet

Faster R-CNN

{A P}_{75} ↑

A P ↑

{A P}_{50} ↑

◻

{P P}_{75} ↑

A P ↑

{A P}_{50} ↑

◻

{A P}_{75} ↑

A P ↑

Histogram Equal. [13]

53.74 \pm 0.00

32.57 \pm 0.00

31.69 \pm 0.00

59.93 \pm 0.00

33.04 \pm 0.00

33.16 \pm 0.00

65.70 \pm 0.04

39.02 \pm 0.11

38.33 \pm 0.02

CycleGAN [47]

✓

41.72 \pm 1.63

23.83 \pm 0.78

23.85 \pm 0.76

43.17 \pm 1.52

22.34 \pm 0.88

23.34 \pm 0.53

45.44 \pm 1.89

26.82 \pm 1.59

26.54 \pm 1.20

CUT [34]

✓

26.48 \pm 2.88

13.68 \pm 2.72

14.30 \pm 2.25

25.64 \pm 3.77

11.74 \pm 2.33

13.12 \pm 2.07

27.96 \pm 1.70

13.59 \pm 2.77

14.78 \pm 1.82

FastCUT [34]

✓

34.92 \pm 3.63

19.07 \pm 1.33

19.39 \pm 1.52

35.73 \pm 2.53

16.36 \pm 0.44

18.11 \pm 0.79

42.09 \pm 3.51

21.44 \pm 1.57

22.91 \pm 1.68

HalluciDet [27]

✓

✓

64.17 \pm 0.61

18.80 \pm 1.45

28.00 \pm 0.92

60.38 \pm 3.59

06.75 \pm 1.38

19.95 \pm 2.01

90.07 \pm 0.72

51.23 \pm 1.81

57.78 \pm 0.97

ModTr (ours) ModTr（我们的）

✓

92.04 \pm 0.47

63.84 \pm 0.93

57.63 \pm 0.66

91.56 \pm 0.64

59.49 \pm 1.11

54.83 \pm 0.61

91.82 \pm 0.49

62.51 \pm 0.87

57.97 \pm 0.85

Test Set IR (Dataset: FLIR)

Image translation

RGB

Box

FCOS

RetinaNet

Faster R-CNN

{A P}_{50} ↑

{A P}_{75} ↑

A P ↑

{A P}_{50} ↑

{A P}_{75} ↑

A P ↑

{A P}_{50} ↑

{A P}_{75} ↑

A P ↑

Histogram Equal. [13]

52.09 \pm 0.00

16.44 \pm 0.00

22.76 \pm 0.00

53.13 \pm 0.00

16.50 \pm 0.00

23.06 \pm 0.00

56.50 \pm 0.10

17.62 \pm 0.04

24.61 \pm 0.01

CycleGAN |

✓

49.01 \pm 1.28

21.16 \pm 0.71

23.92 \pm

49.04 \pm

19.93 \pm 0.54

54.4

23.08 \pm 1.54

26.85 \pm 1.23

CUT [-

✓

38.70 \pm 1.05

14.85 \pm 0.49

18.16 \pm 0.75

39.08 \pm 1.42

13.69 \pm 0.61

17.84 \pm 0.75

43.34 \pm 1.53

16.09 \pm 0.38

20.29 \pm 0.48

FastCUT [34]

✓

45.19 \pm 4.46

22.93 \pm 2.09

24.02 \pm 2.37

43.04 \pm 4.95

19.82 \pm 2.78

22.00 \pm 2.73

49.98 \pm 4.57

25.52 \pm 2.85

26.68 \pm 2.59

HalluciDet [27]

✓

✓

54.20 \pm 2.50

17.36 \pm 2.23

23.74 \pm 2.09

52.06 \pm 1.47

16.21 \pm 0.31

22.29 \pm 0.45

63.11 \pm 1.54

23.91 \pm 1.10

29.91 \pm 1.18

ModTr

_{(}^{(ours)}

✓

65.99 \pm 0.78

33.73 \pm 1.74

35.49 \pm 0.94

66.31 \pm 0.93

31.22 \pm 0.69

34.27 \pm 0.27

69.20 \pm 0.36

34.58 \pm 0.56

37.21 \pm 0.46

Image translation RGB Box Test Set IR (Dataset: LLVIP) FCOS RetinaNet Faster R-CNN https://cdn.mathpix.com/cropped/2024_07_04_7d22c98a5c842f9680b9g-18.jpg?height=23&width=91&top_left_y=857&top_left_x=741 "AP_(75)uarr" "APuarr " "AP_(50)uarr ◻" "PP_(75)uarr" "APuarr" "AP_(50)uarr ◻" "AP_(75)uarr" APuarr Histogram Equal. [13] 53.74+-0.00 32.57+-0.00 31.69+-0.00 59.93+-0.00 33.04+-0.00 33.16+-0.00 65.70+-0.04 39.02+-0.11 38.33+-0.02 CycleGAN [47] ✓ 41.72+-1.63 23.83+-0.78 23.85+-0.76 43.17+-1.52 22.34+-0.88 23.34+-0.53 45.44+-1.89 26.82+-1.59 26.54+-1.20quad CUT [34] ✓ 26.48+-2.88 13.68+-2.72 14.30+-2.25 25.64+-3.77 11.74+-2.33 13.12+-2.07 27.96+-1.70 13.59+-2.77 14.78+-1.82 FastCUT [34] ✓ 34.92+-3.63 19.07+-1.33 19.39+-1.52 35.73+-2.53 16.36+-0.44 18.11+-0.79 42.09+-3.51 21.44+-1.57 22.91+-1.68 HalluciDet [27] ✓ ✓ 64.17+-0.61 18.80+-1.45 28.00+-0.92 60.38+-3.59 06.75+-1.38 19.95+-2.01 90.07+-0.72 51.23+-1.81 57.78+-0.97quad ModTr (ours) ✓ 92.04+-0.47 63.84+-0.93 57.63+-0.66 91.56+-0.64 59.49+-1.11 54.83+-0.61 91.82+-0.49 62.51+-0.87 57.97+-0.85 Test Set IR (Dataset: FLIR) Image translation RGB Box FCOS RetinaNet Faster R-CNN AP_(50)uarr AP_(75)uarr APuarr AP_(50)uarr AP_(75)uarr "APuarr" AP_(50)uarr AP_(75)uarr APuarr Histogram Equal. [13] 52.09+-0.00 16.44+-0.00 22.76+-0.00 53.13+-0.00 16.50+-0.00 23.06+-0.00 56.50+-0.10 17.62+-0.04 24.61+-0.01 CycleGAN | ✓ 49.01+-1.28 21.16+-0.71 23.92+-quad 49.04+-quad 19.93+-0.54 70 "54.4 1" 23.08+-1.54 26.85+-1.23 CUT [- ✓ 38.70+-1.05 14.85+-0.49 18.16+-0.75 39.08+-1.42 13.69+-0.61 17.84+-0.75 43.34+-1.53 16.09+-0.38 20.29+-0.48 FastCUT [34] ✓ 45.19+-4.46 22.93+-2.09 24.02+-2.37 43.04+-4.95 19.82+-2.78 22.00+-2.73 49.98+-4.57 25.52+-2.85 26.68+-2.59 HalluciDet [27] ✓ ✓ 54.20+-2.50 17.36+-2.23 23.74+-2.09 52.06+-1.47 16.21+-0.31 22.29+-0.45 63.11+-1.54 23.91+-1.10 29.91+-1.18 ModTr _(( )^((ours) ) ✓ 65.99+-0.78 33.73+-1.74 35.49+-0.94 66.31+-0.93 31.22+-0.69 34.27+-0.27 69.20+-0.36 34.58+-0.56 37.21+-0.46

| Image translation | RGB | Box | Test Set IR (Dataset: LLVIP) | | | | | | | | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | | FCOS | | | RetinaNet | | | Faster R-CNN | | | | | | | ![](https://cdn.mathpix.com/cropped/2024_07_04_7d22c98a5c842f9680b9g-18.jpg?height=23&width=91&top_left_y=857&top_left_x=741) | $\mathbf{A P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | $\mathbf{A P}_{50} \uparrow$ <br> $\square$ | $\mathbf{P P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | $\mathbf{A P}_{50} \uparrow$ <br> $\square$ | $\mathbf{A P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | | Histogram Equal. [13] | | | $53.74 \pm 0.00$ | $32.57 \pm 0.00$ | $31.69 \pm 0.00$ | $59.93 \pm 0.00$ | $33.04 \pm 0.00$ | $33.16 \pm 0.00$ | $65.70 \pm 0.04$ | $39.02 \pm 0.11$ | $38.33 \pm 0.02$ | | CycleGAN [47] | $\checkmark$ | | $41.72 \pm 1.63$ | $23.83 \pm 0.78$ | $23.85 \pm 0.76$ | $43.17 \pm 1.52$ | $22.34 \pm 0.88$ | $23.34 \pm 0.53$ | $45.44 \pm 1.89$ | $26.82 \pm 1.59$ | $26.54 \pm 1.20 \quad$ | | CUT [34] | $\checkmark$ | | $26.48 \pm 2.88$ | $13.68 \pm 2.72$ | $14.30 \pm 2.25$ | $25.64 \pm 3.77$ | $11.74 \pm 2.33$ | $13.12 \pm 2.07$ | $27.96 \pm 1.70$ | $13.59 \pm 2.77$ | $14.78 \pm 1.82$ | | FastCUT [34] | $\checkmark$ | | $34.92 \pm 3.63$ | $19.07 \pm 1.33$ | $19.39 \pm 1.52$ | $35.73 \pm 2.53$ | $16.36 \pm 0.44$ | $18.11 \pm 0.79$ | $42.09 \pm 3.51$ | $21.44 \pm 1.57$ | $22.91 \pm 1.68$ | | HalluciDet [27] | $\checkmark$ | $\checkmark$ | $64.17 \pm 0.61$ | $18.80 \pm 1.45$ | $28.00 \pm 0.92$ | $60.38 \pm 3.59$ | $06.75 \pm 1.38$ | $19.95 \pm 2.01$ | $90.07 \pm 0.72$ | $51.23 \pm 1.81$ | $57.78 \pm 0.97 \quad$ | | ModTr (ours) | | $\checkmark$ | $92.04 \pm 0.47$ | $63.84 \pm 0.93$ | $57.63 \pm 0.66$ | $91.56 \pm 0.64$ | $59.49 \pm 1.11$ | $54.83 \pm 0.61$ | $91.82 \pm 0.49$ | $62.51 \pm 0.87$ | $57.97 \pm 0.85$ | | | Test Set IR (Dataset: FLIR) | | | | | | | | | | | | Image translation | RGB | Box | FCOS | | | RetinaNet | | | Faster R-CNN | | | | | | | $\mathbf{A P}_{50} \uparrow$ | $\mathbf{A P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | $\mathbf{A P}_{50} \uparrow$ | $\mathbf{A P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | $\mathbf{A P}_{50} \uparrow$ | $\mathbf{A P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | | Histogram Equal. [13] | | | $52.09 \pm 0.00$ | $16.44 \pm 0.00$ | $22.76 \pm 0.00$ | $53.13 \pm 0.00$ | $16.50 \pm 0.00$ | $23.06 \pm 0.00$ | $56.50 \pm 0.10$ | $17.62 \pm 0.04$ | $24.61 \pm 0.01$ | | CycleGAN \| | $\checkmark$ | | $49.01 \pm 1.28$ | $21.16 \pm 0.71$ | $23.92 \pm \quad$ | $49.04 \pm \quad$ | $19.93 \pm 0.54$ | 70 | 54.4 1 | $23.08 \pm 1.54$ | $26.85 \pm 1.23$ | | CUT [- | $\checkmark$ | | $38.70 \pm 1.05$ | $14.85 \pm 0.49$ | $18.16 \pm 0.75$ | $39.08 \pm 1.42$ | $13.69 \pm 0.61$ | $17.84 \pm 0.75$ | $43.34 \pm 1.53$ | $16.09 \pm 0.38$ | $20.29 \pm 0.48$ | | FastCUT [34] | $\checkmark$ | | $45.19 \pm 4.46$ | $22.93 \pm 2.09$ | $24.02 \pm 2.37$ | $43.04 \pm 4.95$ | $19.82 \pm 2.78$ | $22.00 \pm 2.73$ | $49.98 \pm 4.57$ | $25.52 \pm 2.85$ | $26.68 \pm 2.59$ | | HalluciDet [27] | $\checkmark$ | $\checkmark$ | $54.20 \pm 2.50$ | $17.36 \pm 2.23$ | $23.74 \pm 2.09$ | $52.06 \pm 1.47$ | $16.21 \pm 0.31$ | $22.29 \pm 0.45$ | $63.11 \pm 1.54$ | $23.91 \pm 1.10$ | $29.91 \pm 1.18$ | | ModTr $_{\text {( }}^{\text {(ours) }}$ | | $\checkmark$ | $65.99 \pm 0.78$ | $33.73 \pm 1.74$ | $35.49 \pm 0.94$ | $66.31 \pm 0.93$ | $31.22 \pm 0.69$ | $34.27 \pm 0.27$ | $69.20 \pm 0.36$ | $34.58 \pm 0.56$ | $37.21 \pm 0.46$ |

Table 2: Comparison of the fine-tuning of the detector, the different non-parametric fusion strategies. The detection performance of three different detectors (FCOS, RetinaNet, and Faster RCNN) was evaluated on the infrared test set of LLVIP and FLIR datasets.
表 2：检测器微调、不同的非参数融合策略比较。在 LLVIP 和 FLIR 数据集的红外测试集上评估了三种不同检测器（FCOS、RetinaNet 和 Faster RCNN）的检测性能。

Method	Test Set IR (Dataset: LLVIP)
	FCOS			RetinaNet			Faster R-CNN
	${A P}_{50} ↑$	$A P_{75} ↑$	$A P ↑$	${A P}_{50} ↑$	${A P}_{75} ↑$	$A P ↑$	${A P}_{50} ↑$	$A P_{75} ↑$	$A P ↑$
Fine-Tuning	$93.14 \pm 1.06$	$62.08 \pm 3.37$	$57.37 \pm 2.19$	$93.61 \pm 0.59$	$56.20 \pm 3.78$	$53.79 \pm 1.79$	$94.64 \pm 0.84$	$62.50 \pm 0.79$	$59.62 \pm 1.2$
${M o d T r}_{+}$	$90.57 \pm 1.46$	$62.38 \pm 0.31$	$56.44 \pm 0.75$	$91.09 \pm 0.73$	$55.06 \pm 1.81$	$53.18 \pm 1.03$	$91.89 \pm 0.39$	$61.44 \pm 0.72$	$57.14 \pm 0.50$
${ModTr}_{\oplus}$	$91.11 \pm 0.84$	$62.69 \pm 1.53$	$57.01 \pm 0.71$	$90.49 \pm 1.11$	$58.73 \pm 0.55$	$54.43 \pm 0.35$	$91.20 \pm 0.46$	$61.31 \pm 0.73$	$56.95 \pm 0.37$
${M o d T r}_{⊙}$	$92.04 \pm 0.47$	$63.84 \pm 0.93$	$57.63 \pm 0.66$	$91.56 \pm 0.64$	$59.49 \pm 1.11$	$54.83 \pm 0.61$	$91.82 \pm 0.49$	$62.51 \pm 0.87$	$57.97 \pm 0.8$
${M o d T r}_{⊙}$	Test Set IR (Dataset: FLIR)
Method	FCOS			RetinaNet			Faster R-CNN
Method	$A P_{50} ↑$	$A P_{75} ↑$	$A P ↑$	${A P}_{50} ↑$	${A P}_{75} ↑$	$A P ↑$	$A P_{50} ↑$	$A P_{75} ↑$	$A P ↑$
Fine-Tuning	$60.22 \pm 0.97$	$21.94 \pm 0.42$	$27.97 \pm 0.59$	$61.77 \pm 1.02$	$22.37 \pm 0.45$	$28.46 \pm 0.50$	$66.15 \pm 0.94$	$24.48 \pm 0.71$	$30.93 \pm 0.46$
${M o d T r}_{+}$	$64.90 \pm 0.48$	$32.78 \pm 0.27$	$34.63 \pm 0.24$	$65.30 \pm 0.66$	$30.00 \pm 0.81$	$33.70 \pm 0.59$	$68.64 \pm 0.77$	$34.96 \pm 0.90$	$37.09 \pm 0.7$
${ModTr}_{\oplus}$	$65.46 \pm 0.61$	$33.21 \pm 0.55$	$34.94 \pm 0.52$	$63.87 \pm 0.51$	$30.93 \pm 0.38$	$33.72 \pm 0.22$	$68.64 \pm 1.29$	$35.48 \pm 0.33$	$37.16 \pm 0.47$
${ModTr}_{⊙}$	$65.25 \pm 0.33$	$32.24 \pm 0.95$	$34.60 \pm 0.38$	$64.96 \pm 0.68$	$30.93 \pm 0.50$	$33.85 \pm 0.34$	$68.84 \pm 0.40$	$34.77 \pm 0.22$	$37.01 \pm 0.15$

| Method | Test Set IR (Dataset: LLVIP) | | | | | | | | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | FCOS | | | RetinaNet | | | Faster R-CNN | | | | | $\mathbf{A P}_{50} \uparrow$ | $\mathbf{A} \mathbf{P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | $\mathbf{A P}_{50} \uparrow$ | $\mathbf{A P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | $\mathbf{A P}_{50} \uparrow$ | $\mathbf{A} \mathbf{P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | | Fine-Tuning | $93.14 \pm 1.06$ | $62.08 \pm 3.37$ | $57.37 \pm 2.19$ | $93.61 \pm 0.59$ | $56.20 \pm 3.78$ | $53.79 \pm 1.79$ | $94.64 \pm 0.84$ | $62.50 \pm 0.79$ | $59.62 \pm 1.2$ | | $\mathrm{ModTr}_{+}$ | $90.57 \pm 1.46$ | $62.38 \pm 0.31$ | $56.44 \pm 0.75$ | $91.09 \pm 0.73$ | $55.06 \pm 1.81$ | $53.18 \pm 1.03$ | $91.89 \pm 0.39$ | $61.44 \pm 0.72$ | $57.14 \pm 0.50$ | | $\operatorname{ModTr}_{\oplus}$ | $91.11 \pm 0.84$ | $62.69 \pm 1.53$ | $57.01 \pm 0.71$ | $90.49 \pm 1.11$ | $58.73 \pm 0.55$ | $54.43 \pm 0.35$ | $91.20 \pm 0.46$ | $61.31 \pm 0.73$ | $56.95 \pm 0.37$ | | $\mathrm{ModTr}_{\odot}$ | $92.04 \pm 0.47$ | $63.84 \pm 0.93$ | $57.63 \pm 0.66$ | $91.56 \pm 0.64$ | $59.49 \pm 1.11$ | $54.83 \pm 0.61$ | $91.82 \pm 0.49$ | $62.51 \pm 0.87$ | $57.97 \pm 0.8$ | | | Test Set IR (Dataset: FLIR) | | | | | | | | | | Method | FCOS | | | RetinaNet | | | Faster R-CNN | | | | | $\mathbf{A} \mathbf{P}_{50} \uparrow$ | $\mathbf{A} \mathbf{P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | $\mathbf{A P}_{50} \uparrow$ | $\mathbf{A P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | $\mathbf{A} \mathbf{P}_{50} \uparrow$ | $\mathbf{A} \mathbf{P}_{75} \uparrow$ | $\mathbf{A P} \uparrow$ | | Fine-Tuning | $60.22 \pm 0.97$ | $21.94 \pm 0.42$ | $27.97 \pm 0.59$ | $61.77 \pm 1.02$ | $22.37 \pm 0.45$ | $28.46 \pm 0.50$ | $66.15 \pm 0.94$ | $24.48 \pm 0.71$ | $30.93 \pm 0.46$ | | $\mathrm{ModTr}_{+}$ | $64.90 \pm 0.48$ | $32.78 \pm 0.27$ | $34.63 \pm 0.24$ | $65.30 \pm 0.66$ | $30.00 \pm 0.81$ | $33.70 \pm 0.59$ | $68.64 \pm 0.77$ | $34.96 \pm 0.90$ | $37.09 \pm 0.7$ | | $\operatorname{ModTr}_{\oplus}$ | $65.46 \pm 0.61$ | $33.21 \pm 0.55$ | $34.94 \pm 0.52$ | $63.87 \pm 0.51$ | $30.93 \pm 0.38$ | $33.72 \pm 0.22$ | $68.64 \pm 1.29$ | $35.48 \pm 0.33$ | $37.16 \pm 0.47$ | | $\operatorname{ModTr}_{\odot}$ | $65.25 \pm 0.33$ | $32.24 \pm 0.95$ | $34.60 \pm 0.38$ | $64.96 \pm 0.68$ | $30.93 \pm 0.50$ | $33.85 \pm 0.34$ | $68.84 \pm 0.40$ | $34.77 \pm 0.22$ | $37.01 \pm 0.15$ |

quire more 33.2 million parameters. This trend is similar over the various detectors and remains consistent for the FLIR dataset (Table 4).
需要更多的 33.2 百万参数。这种趋势在各种探测器上是相似的，并且对 FLIR 数据集保持一致（表 4）。

Table 3: Study on the translation backbone describing the tradeoff between the number of parameters for the translation network and the detection performance for each detector on LLVIP dataset. The results comprise the three default detectors used (FCOS, RetinaNet, and Faster RCNN) in APs under different MobileNet and ResNet backbones for the translation network.
表 3：研究翻译骨干描述翻译网络参数数量与 LLVIP 数据集上每个检测器的检测性能之间的权衡。结果包括在不同 MobileNet 和 ResNet 骨干下用于翻译网络的三个默认检测器（FCOS、RetinaNet 和 Faster RCNN）的 APs。

Test Set IR (Dataset: LLVIP)
Method	Params.	$A P_{50} ↑$	${A P}_{75} ↑$	$A P ↑$
FCOS	$33.2 M$
MobileNet $_{v 3 s}$	$+ 3.1 M$	$83.34 \pm 1.76$	$53.34 \pm 0.75$	$50.05 \pm 1.01$
MobileNet $_{v 2}$	$+ 6.6 M$	$90.36 \pm 0.54$	$60.17 \pm 0.56$	$55.33 \pm 0.62$
${ResNet}_{18}$	$+ 14.3 M$	$89.53 \pm 1.03$	$58.54 \pm 1.57$	$54.25 \pm 0.90$
ResNet $_{34}$	$+ 24.4 M$	$90.90 \pm 1.29$	$62.96 \pm 2.44$	$56.93 \pm 1.44$
RetinaNet	$34.0 M$
MobileNet $_{v 3 s}$	$+ 3.1 M$	$87.67 \pm 0.18$	$49.99 \pm 0.57$	$49.65 \pm 0.07$
MobileNet $_{v 2}$	$+ 6.6 M$	$90.21 \pm 0.82$	$54.60 \pm 2.62$	$52.42 \pm 1.33$
ResNet $_{18}$	$+ 14.3 M$	$89.53 \pm 1.82$	$52.68 \pm 2.06$	$51.40 \pm 1.40$
${ResNet}_{34}$	$+ 24.4 M$	$90.35 \pm 0.60$	$57.18 \pm 0.29$	$53.60 \pm 0.41$
Faster R-CNN	$41.8 M$
MobileNet $_{v 3 s}$	$+ 3.1 M$	$89.14 \pm 0.63$	$56.85 \pm 0.69$	$54.51 \pm 0.28$
MobileNet $_{v 2}$	$+ 6.6 M$	$91.32 \pm 0.73$	$60.34 \pm 0.66$	$56.15 \pm 0.51$
${ResNet}_{18}$	$+ 14.3 M$	$90.81 \pm 0.46$	$59.39 \pm 2.06$	$55.53 \pm 1.14$
${ResNet}_{34}$	$+ 24.4 M$	$91.04 \pm 0.29$	$60.53 \pm 2.16$	$56.35 \pm 0.65$

3 Qualitative Results

In this section, we provide additional visualizations for all the methods, including images generated by our proposal and its detection results. First, in Figure 4, we present the detections in more detail, highlighting issues of those methods relying solely on translation, such as FastCUT, and some false positives when there is only FT. Subsequently, we provide additional visualizations for a batch of images processed by various detectors, i.e., in Figure 5 for FCOS, in Figure 6 for RetinaNet, and Figure 7 for Faster R-CNN for both datasets.
在本节中，我们为所有方法提供了额外的可视化，包括我们提出的图像及其检测结果。首先，在图 4 中，我们更详细地展示了检测结果，突出了那些仅依赖于翻译的方法（如 FastCUT）以及在只有 FT 时出现的一些误报。随后，我们为一批图像提供了各种检测器处理的额外可视化，即在 FCOS 的图 5 中，在 RetinaNet 的图 6 中，以及在两个数据集中的 Faster R-CNN 的图 7 中。

Table 4: Study on the translation backbone describing the tradeoff between the number of parameters for the translation network and the detection performance for each detector on FLIR dataset. The results comprise the three default detectors used (FCOS, RetinaNet, and Faster R-CNN) in APs under different MobileNet and ResNet backbones for the translation network.
表 4：研究描述翻译网络参数数量与 FLIR 数据集上每个检测器的检测性能之间的权衡。结果包括在不同 MobileNet 和 ResNet 骨干网络下使用的三个默认检测器（FCOS、RetinaNet 和 Faster R-CNN）的 AP。

Test Set IR (Dataset: FLIR)
Method	Params.	$A P_{50} ↑$	$A P_{75} ↑$	$A P ↑$
FCOS	$33.2 M$
MobileNet $_{v 3 s}$	$+ 3.1 M$	$56.73 \pm 0.34$	$27.57 \pm 0.65$	$29.66 \pm 0.14$
MobileNet $_{v 2}$	$+ 6.6 M$	$64.49 \pm 0.99$	$32.17 \pm 0.38$	$32.17 \pm 0.38$
ResNet $_{18}$	$+ 14.3 M$	$64.39 \pm 1.68$	$32.72 \pm 1.50$	$34.44 \pm 1.13$
${ResNet}_{34}$	$+ 24.4 M$	$65.99 \pm 0.78$	$33.73 \pm 1.74$	$35.49 \pm 0.94$
RetinaNet	$34.0 M$
MobileNet $_{v 3 s}$	$+ 3.1 M$	$47.30 \pm 0.54$	$18.63 \pm 0.10$	$22.67 \pm 0.18$
MobileNet $_{v 2}$	$+ 6.6 M$	$64.01 \pm 1.51$	$29.70 \pm 0.62$	$33.12 \pm 0.68$
ResNet $_{18}$	$+ 14.3 M$	$64.20 \pm 0.58$	$30.84 \pm 0.47$	$33.44 \pm 0.47$
${ResNet}_{34}$	$+ 24.4 M$	$66.31 \pm 0.93$	$31.22 \pm 0.69$	$34.27 \pm 0.27$
Faster R-CNN	$41.8 M$
MobileNet $_{v 3 s}$	$+ 3.1 M$	$61.03 \pm 1.26$	$29.87 \pm 0.86$	$32.06 \pm 0.75$
MobileNet $_{v 2}$	$+ 6.6 M$	$68.64 \pm 0.56$	$34.76 \pm 1.27$	$36.77 \pm 0.67$
${ResNet}_{18}$	$+ 14.3 M$	$68.49 \pm 0.53$	$34.52 \pm 0.23$	$36.68 \pm 0.22$
${R e s N e t}_{34}$	$+ 24.4 M$	$69.20 \pm 0.36$	$34.58 \pm 0.56$	$37.21 \pm 0.46$

References 参考资料

Biewald, L.: Experiment tracking with weights and biases (2020), software available from wandb.com 8
Biewald, L.: 使用 Weights and Biases（2020）进行实验跟踪，软件可从 wandb.com 获取。
Bustos, N., Mashhadi, M., Lai-Yuen, S.K., Sarkar, S., Das, T.K.: A systematic literature review on object detection using near infrared and thermal images. Neurocomputing p. 126804 (2023) 1
Bustos, N., Mashhadi, M., Lai-Yuen, S.K., Sarkar, S., Das, T.K.: 一项关于使用近红外和热红外图像进行目标检测的系统文献综述。Neurocomputing p. 126804 (2023) 1
Cao, Y., Bin, J., Hamari, J., Blasch, E., Liu, Z.: Multimodal object detection by channel switching and spatial attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 403-411 (2023) 8
曹，Y.，宾，J.，哈马里，J.，布拉什，E.，刘，Z.：通过通道切换和空间注意力进行多模态物体检测。在：IEEE/CVF 计算机视觉与模式识别会议论文集。页码 403-411（2023）8
Chelba, C., Acero, A.: Adaptation of maximum entropy capitalizer: Little data can help a lot. Computer Speech and Language 20(4), 382-399 (2006). https: / / doi . org/https://doi.org/10.1016/j.csl.2005.05.005, https://www . sciencedirect.com/science/article/pii/S0885230805000276 5
Chelba, C., Acero, A.: 最大熵资本化器的适应性：少量数据可以起到很大作用。《计算机语音与语言》20(4), 382-399 (2006)。https://doi.org/10.1016/j.csl.2005.05.005, https://www.sciencedirect.com/science/article/pii/S08852308050002765
Chen, J., Li, K., Deng, Q., Li, K., Philip, S.Y.: Distributed deep learning model for intelligent video surveillance systems with edge computing. IEEE Transactions on Industrial Informatics (2019) 1
陈，李，邓，李，Philip，S.Y.：具有边缘计算的智能视频监控系统的分布式深度学习模型。IEEE 工业信息学交易（2019）1
Chen, S., Hou, Y., Cui, Y., Che, W., Liu, T., Yu, X.: Recall and learn: Fine-tuning deep pretrained language models with less forgetting. CoRR abs/2004.12651 (2020), https: //arxiv.org/abs/2004.12651 5
陈，S.，侯，Y.，崔，Y.，车，W.，刘，T.，于，X.：回忆和学习：用更少的遗忘微调深度预训练语言模型。CoRR abs/2004.12651（2020），https://arxiv.org/abs/2004.12651 5
Detlefsen, N.S., Borovec, J., Schock, J., Jha, A.H., Koker, T., Di Liello, L., Stancl, D., Quan, C., Grechkin, M., Falcon, W.: Torchmetrics-measuring reproducibility in pytorch. Journal of Open Source Software 7(70), 4101 (2022) 8
Falcon, W., The PyTorch Lightning team: PyTorch Lightning (Mar 2019). https: / / doi . org/10.5281/zenodo. 38289358
Feng, H., Yang, Z., Chen, H., Pang, T., Du, C., Zhu, M., Chen, W., Yan, S.: Cosda: Continual source-free domain adaptation (2023) 5
冯，H.，杨，Z.，陈，H.，庞，T.，杜，C.，朱，M.，陈，W.，严，S.：Cosda：持续无源域自适应（2023）5

Detector: RetinaNet

Detector: Faster R-CNN

Fig. 4: Bounding box predictions over different OD methods for infrared images on two benchmarks: LLVIP and FLIR. Yellow and red boxes show the ground truth and predicted detections, respectively. FastCUT [33] is an unsupervised image translation approach that takes as input infrared images (IR) and produces pseudo-RGB images. It does not focus on detection and requires both modalities for training. Fine-tuning is the standard approach to adapting the detector to the new modality. It requires only IR data but forgets the original knowledge of the pre-trained detector. Finally, ModTr, our approach focuses the translation on detection, requires only IR data, and does not forget the original knowledge so that it can be reused for other tasks.
图 4：在 LLVIP 和 FLIR 两个基准上，不同目标检测方法对红外图像的边界框预测。黄色和红色框分别显示了地面真实和预测检测结果。FastCUT [33]是一种无监督图像翻译方法，以红外图像（IR）作为输入并生成伪 RGB 图像。它不专注于检测，需要两种模态进行训练。微调是将检测器适应新模态的标准方法。它只需要 IR 数据，但会忘记预训练检测器的原始知识。最后，我们的方法 ModTr 专注于检测翻译，只需要 IR 数据，并不会忘记原始知识，因此可以用于其他任务。

Fig. 5: Illustration of a sequence of 8 images of LLVIP and FLIR dataset for FCOS. For each dataset, the first row is the RGB modality, followed by the IR modality, followed by FastCUT [33], and different representations created by ModTr and their variations.
图 5：LLVIP 和 FLIR 数据集的 FCOS 的 8 个图像序列示例。对于每个数据集，第一行是 RGB 模态，接着是 IR 模态，然后是 FastCUT [33]，以及由 ModTr 创建的不同表示及其变体。

Fig. 6: Illustration of a sequence of 8 images of LLVIP and FLIR dataset for RetinaNet. For each dataset, the first row is the RGB modality, followed by the IR modality, followed by FastCUT [33], and different representations created by ModTr and their variations.
图 6：LLVIP 和 FLIR 数据集的 RetinaNet 的 8 个图像序列示例。对于每个数据集，第一行是 RGB 模态，接着是 IR 模态，然后是 FastCUT [33]，以及由 ModTr 及其变体创建的不同表示形式。

Fig. 7: Illustration of a sequence of 8 images of LLVIP and FLIR dataset for Faster R-CNN. For each dataset, the first row is the RGB modality, followed by the IR modality, followed by FastCUT [33], and different representations created by ModTr and their variations.
图 7：LLVIP 和 FLIR 数据集的 8 个图像序列示例，用于 Faster R-CNN。对于每个数据集，第一行是 RGB 模态，接着是 IR 模态，然后是 FastCUT [33]，以及由 ModTr 创建的不同表示及其变体。

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural information processing systems 27 (2014) 4
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: 生成对抗网络。神经信息处理系统的进展 27（2014）4
Group, F., et al.: Flir thermal dataset for algorithm training (2018) 9
组，F.，等人：Flir 热数据集用于算法训练（2018）9
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770-778 (2016) 4
He, K., Zhang, X., Ren, S., Sun, J.: 深度残差学习用于图像识别。在：IEEE 计算机视觉与模式识别会议论文集。第 770-778 页（2016）4
Herrmann, C., Ruf, M., Beyerer, J.: Cnn-based thermal infrared person detection by domain adaptation. In: Autonomous Systems: Sensors, Vehicles, Security, and the Internet of Everything. vol. 10643, p. 1064308. International Society for Optics and Photonics (2018) 4, 8, 9, 18
赫尔曼，C.，鲁夫，M.，拜耶尔，J.：基于 CNN 的热红外人员检测通过领域适应。在：自主系统：传感器，车辆，安全和物联网。卷 10643，页 1064308。国际光学和光子学学会（2018）4，8，9，18
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 4
Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR abs/1801.06146 (2018), http: / /arxiv.org/abs/1801.06146 5
霍华德，J.，鲁德，S.：用于文本分类的微调语言模型。CoRR abs/1801.06146（2018），http://arxiv.org/abs/1801.06146 5
Hsu, H.K., Yao, C.H., Tsai, Y.H., Hung, W.C., Tseng, H.Y., Singh, M., Yang, M.H.: Progressive domain adaptation for object detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 749-757 (2020) 2
徐海康，姚春和，蔡英豪，洪文超，曾宏毅，辛格，杨明华：目标检测的渐进领域适应。在：IEEE/CVF 冬季计算机视觉应用会议论文集。页码 749-757（2020）2
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132-7141 (2018) 6
胡杰，沈磊，孙刚：压缩和激励网络。在：计算机视觉和模式识别 IEEE 会议论文集。页 7132-7141（2018）6
Iakubovskii, P.: Segmentation models pytorch (2019) 8
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on (2017) 4
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: 具有条件对抗网络的图像到图像翻译。在：计算机视觉和模式识别（CVPR），2017 年 IEEE 会议（2017）4
Jia, X., Zhu, C., Li, M., Tang, W., Zhou, W.: Llvip: A visible-infrared paired dataset for low-light vision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3496-3504 (2021) 8
贾，X.，朱，C.，李，M.，唐，W.，周，W.：Llvip：一种用于低光视觉的可见-红外配对数据集。在：IEEE/CVF 国际计算机视觉会议论文集。第 3496-3504 页（2021）8
Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2022) 4
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114(13), 3521-3526 (2017) 2
Lee, C., Cho, K., Kang, W.: Mixout: Effective regularization to finetune large-scale pretrained language models (2020) 5
李，C.，曹，K.，康，W.：Mixout：有效的正则化方法，用于微调大规模预训练语言模型（2020）5
Li, H., Wu, X.J.: Densefuse: A fusion approach to infrared and visible images. IEEE Transactions on Image Processing 28(5), 2614-2623 (2018) 7
Li, H., Wu, X.J.: Densefuse: 一种红外和可见图像融合方法. IEEE 图像处理期刊 28(5), 2614-2623 (2018) 7
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980-2988 (2017) 4
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: 密集目标检测的焦点损失。在：IEEE 国际计算机视觉会议论文集。第 2980-2988 页（2017）4
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740-755. Springer (2014) 1,8
林 T.Y.，Maire，M.，Belongie，S.，Hays，J.，Perona，P.，Ramanan，D.，Dollár，P.，Zitnick，C.L.：Microsoft coco：上下文中的常见对象。在：欧洲计算机视觉会议。页码 740-755。Springer（2014）1,8
Medeiros, H.R., Pena, F.A.G., Aminbeidokhti, M., Dubail, T., Granger, E., Pedersoli, M.: Hallucidet: Hallucinating rgb modality for person detection through privileged information. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1444-1453 (2024) 1, 5, 8, 9, 17, 18
梅代罗斯，H.R.，佩纳，F.A.G.，阿明贝多克蒂，M.，杜拜尔，T.，格兰杰，E.，佩德索利，M.：Hallucidet：通过特权信息幻想 RGB 模态进行人员检测。在：IEEE/CVF 冬季计算机视觉应用会议论文集。页码 1444-1453（2024 年）1, 5, 8, 9, 17, 18
Menezes, A.G., de Moura, G., Alves, C., de Carvalho, A.C.: Continual object detection: A review of definitions, strategies, and challenges. Neural Networks (2023) 5
梅内泽斯，A.G.，德莫拉，G.，阿尔维斯，C.，德卡瓦略，A.C.：持续目标检测：定义，策略和挑战综述。神经网络（2023）5
Minderer, M., Gritsenko, A., Houlsby, N.: Scaling open-vocabulary object detection. Advances in Neural Information Processing Systems 36 (2024) 1
Minderer, M., Gritsenko, A., Houlsby, N.: 扩展开放词汇目标检测的规模。 Advances in Neural Information Processing Systems 36（2024）1
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al.: Simple open-vocabulary object detection. In: European Conference on Computer Vision. pp. 728-755. Springer (2022) 1
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., 等人：简单的开放词汇对象检测。在：欧洲计算机视觉会议。第 728-755 页。Springer（2022）1
Özkanoğlu, M.A., Ozer, S.: Infragan: A gan architecture to transfer visible images to infrared domain. Pattern Recognition Letters 155, 69-76 (2022) 4
Özkanoğlu, M.A., Ozer, S.: Infragan: 一种将可见图像转换为红外领域的 gan 架构。模式识别信件 155，69-76（2022）4
Pang, Y., Lin, J., Qin, T., Chen, Z.: Image-to-image translation: Methods and applications (2021) 4
庞，Y.，林，J.，秦，T.，陈，Z.：图像到图像的翻译：方法和应用（2021）4
Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: Contrastive learning for unpaired image-to-image translation. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IX 16. pp. 319-345. Springer (2020) 2, 4, 17, 21, 22, 23,24
Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: 对比学习用于无配对图像到图像的转换。在：计算机视觉-ECCV 2020：第 16 届欧洲会议，2020 年 8 月 23-28 日，格拉斯哥，英国，论文集，第九部分 16. 页 319-345。Springer（2020）2, 4, 17, 21, 22, 23, 24
Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: Contrastive learning for unpaired image-to-image translation. In: European Conference on Computer Vision (2020) 8, 9, 18
Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: 无配对图像到图像翻译的对比学习。在：欧洲计算机视觉会议（2020）8, 9, 18
Pierson, H.A., Gashler, M.S.: Deep learning in robotics: a review of recent research. Advanced Robotics 31(16), 821-835 (2017) 1
皮尔森，H.A.，加什勒，M.S.：机器人学中的深度学习：最近研究的综述。先进机器人学 31（16），821-835（2017）1
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28, 91-99 (2015) 4
人，S.，何，K.，吉尔希克，R.，孙，J.：更快的 R-CNN：利用区域建议网络实现实时目标检测。神经信息处理系统的进展 28，91-99（2015）4
Stilgoe, J.: Machine learning, social learning and the governance of self-driving cars. Social studies of science 48(1), 25-56 (2018) 1
Stilgoe, J.: 机器学习、社会学习和自动驾驶汽车的治理。科学研究社会学 48(1), 25-56 (2018) 1
Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9627-9636 (2019) 4
田宗瑞，沈超，陈航，何涛：Fcos：全卷积单阶段目标检测。在：IEEE/CVF 国际计算机视觉会议论文集。第 9627-9636 页（2019）4
Vasconcelos, C., Birodkar, V., Dumoulin, V.: Proper reuse of image classification features improves object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13628-13637 (2022) 1
Vasconcelos, C., Birodkar, V., Dumoulin, V.: 图像分类特征的正确重复使用改善目标检测。在：IEEE/CVF 计算机视觉与模式识别会议论文集。第 13628-13637 页（2022）1
Wang, Q., Chi, Y., Shen, T., Song, J., Zhang, Z., Zhu, Y.: Improving rgb-infrared object detection by reducing cross-modality redundancy. Remote Sensing 14(9), 2020 (2022) 2
王琦，迟阳，沈涛，宋嘉，张哲，朱瑜：通过减少跨模态冗余改进 RGB-红外物体检测。遥感 14（9），2020 年（2022 年）2
Wang, Z., Yang, E., Shen, L., Huang, H.: A comprehensive survey of forgetting in deep learning beyond continual learning (2023) 5
王，Z.，杨，E.，沈，L.，黄，H.：深度学习中遗忘的综合调查超越连续学习（2023）5
Wortsman, M., Ilharco, G., Kim, J.W., Li, M., Kornblith, S., Roelofs, R., Lopes, R.G., Hajishirzi, H., Farhadi, A., Namkoong, H., et al.: Robust fine-tuning of zero-shot models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7959-7971 (2022) 5
Wortsman, M., Ilharco, G., Kim, J.W., Li, M., Kornblith, S., Roelofs, R., Lopes, R.G., Hajishirzi, H., Farhadi, A., Namkoong, H., 等人：零样本模型的稳健微调。在：IEEE/CVF 计算机视觉与模式识别会议论文集。页码 7959-7971（2022）5
Zhang, A., Lipton, Z.C., Li, M., Smola, A.J.: Dive into deep learning. arXiv preprint arXiv:2106.11342 (2021) 4
张，A.，利普顿，Z.C.，李，M.，斯莫拉，A.J.：深入深度学习。arXiv 预印本 arXiv:2106.11342（2021）4
Zhang, H., Fromont, E., Lefèvre, S., Avignon, B.: Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 276-280. IEEE (2020) 8
张，H.，Fromont，E.，Lefèvre，S.，Avignon，B.：具有循环融合和细化块的多光谱融合目标检测。在：2020 年 IEEE 国际图像处理会议（ICIP）。第 276-280 页。IEEE（2020）8
Zhang, T., Wu, F., Katiyar, A., Weinberger, K.Q., Artzi, Y.: Revisiting few-sample bert finetuning (2021) 5
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycleconsistent adversarial networks. In: Computer Vision (ICCV), 2017 IEEE International Conference on (2017) 4
朱嘉颖，朴泰，伊索拉，埃弗罗斯：使用循环一致对抗网络进行无配对图像到图像的转换。在：计算机视觉（ICCV），2017 年 IEEE 国际会议（2017）4
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycleconsistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2223-2232 (2017) 8, 9,18
朱嘉宇，朴泰，伊索拉，埃弗罗斯：使用循环一致对抗网络进行无配对图像到图像的转换。在：IEEE 国际计算机视觉会议论文集。第 2223-2232 页（2017）8, 9, 18

- Email: heitor.rapela-medeiros. 1 @ens.etsmtl.ca
- Email: heitor.rapela-medeiros. 1 @ens.etsmtl.ca

Modality Translation for Object Detection Adaptation Without Forgetting Prior Knowledge 目标检测适应性的模态翻译，无需忘记先前知识

Abstract 摘要

1 Introduction 1 简介

Our main contributions can be summarized as follows:我们的主要贡献可以总结如下：

2 Related Work 2 相关工作