2024_04_11_0942d25d596cd93a4577g

A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection

Kaiyu , Member, IEEE, Xiangyong , and Deyu Meng

Abstract

Change detection (CD) is a critical task to observe and analyze dynamic processes of land cover.
变化探测（CD）是观测和分析土地覆被动态过程的一项关键任务。
Although numerous deep-learning (DL)-based CD models have performed excellently, their further performance improvements are constrained by the limited knowledge extracted from the given labeled data.
尽管许多基于深度学习（DL）的 CD 模型表现出色，但由于从给定标注数据中提取的知识有限，其性能的进一步提高受到限制。
On the other hand, the foundation models that emerged recently contain a huge amount of knowledge by scaling up across data modalities and proxy tasks.
另一方面，最近出现的基础模型通过跨数据模式和代理任务的扩展，包含了大量的知识。
In this article, we propose a bi-temporal adapter network (BAN), which is a universal foundation model-based CD adaptation framework aiming to extract the knowledge of foundation models for . The proposed BAN contains three parts, that is, the frozen foundation model (e.g., CLIP), bi-temporal adapter branch (Bi-TAB), and bridging modules between them.
在本文中，我们提出了一个双时态适配器网络（BAN），它是一个基于基础模型的通用 CD 适配框架，旨在提取基础模型的知识用于。拟议的 BAN 包括三个部分，即冻结基础模型（如 CLIP）、双时态适配器分支（Bi-TAB）和它们之间的桥接模块。
Specifically, BAN extracts general features through a frozen foundation model, which are then selected, aligned, and injected into Bi-TAB via the bridging modules.
具体来说，BAN 通过一个冻结的基础模型提取一般特征，然后对这些特征进行选择、对齐，并通过桥接模块注入 Bi-TAB 中。
Bi-TAB is designed as a model-agnostic concept to extract task/domain-specific features, which can be either an existing arbitrary CD model or some hand-crafted stacked blocks.
Bi-TAB 的设计理念与模型无关，可用于提取特定任务/领域的特征，这些特征既可以是现有的任意 CD 模型，也可以是一些手工制作的堆叠块。
Beyond current customized models, BAN is the first extensive attempt to adapt the foundation model to the CD task. Experimental results show the effectiveness of our BAN in improving the performance of existing CD methods (e.g., up to IoU improvement) with only a few additional learnable parameters. More importantly, these successful practices show us the potential of foundation models for remote-sensing (RS) CD. The code is available at https://github.com/likyoo/BAN and will be supported in our Open-CD.
除现有的定制模型外，BAN 是首次针对 CD 任务调整基础模型的广泛尝试。实验结果表明，只需增加几个可学习参数，我们的 BAN 就能有效提高现有 CD 方法的性能（例如，最高可提高 IoU）。更重要的是，这些成功实践向我们展示了遥感（RS）CD 基础模型的潜力。代码可在 https://github.com/likyoo/BAN 上获得，并将在我们的 Open-CD 中得到支持。

Index Terms-Change detection (CD), deep learning (DL), foundation model, remote-sensing (RS) image processing, visual tuning.
索引词条-变化检测（CD）、深度学习（DL）、基础模型、遥感（RS）图像处理、视觉调谐。

I. INTRODUCTION I.引言

HE dynamic properties of the Earth's surface are influenced by a wide range of natural and anthropogenic factors, resulting in constant change.

地球表面的 HE 动态特性受到各种自然和人为因素的影响，从而不断发生变化。
Therefore, analyzing the time-series remote-sensing (RS) data is an important topic in Earth vision, and change detection (CD) techniques have
因此，分析时间序列遥感（RS）数据是地球视觉领域的一个重要课题，而变化检测（CD）技术则是其中之一。

Manuscript received 2 December 2023; revised 11 January 2024; accepted 10 February 2024. Date of publication 16 February 2024; date of current version 23 February 2024. This work was supported in part by the National Key Research and Development Program of China under Grant 2021ZD0112902 and in part by the China NSFC Projects under Contract 62272375 and Contract 12226004. (Corresponding author: Xiangyong Cao.)
手稿于 2023 年 12 月 2 日收到；2024 年 1 月 11 日修订；2024 年 2 月 10 日接受。发表日期：2024 年 2 月 16 日；当前版本日期：2024 年 2 月 23 日。本研究部分得到国家重点研发计划（2021ZD0112902）和国家自然科学基金项目（62272375和12226004）的资助。(通讯作者：曹向勇）。

Kaiyu Li and Xiangyong Cao are with the School of Computer Science and Technology and the Ministry of Education Key Laboratory for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an 710049, China (e-mail: likyoo.ai@gmail.com; caoxiangyong@xjtu.edu.cn).
李开宇和曹向勇现就职于西安交通大学计算机科学与技术学院、智能网络与网络安全教育部重点实验室，邮编：710049（电子邮箱：likyoo.ai@gmail.com; caoxiangyong@xjtu.edu.cn）。

Deyu Meng is with the School of Mathematics and Statistics and the Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China, and also with Pazhou Laboratory (Huangpu), Guangzhou, Guangdong 510555, China (e-mail: dymeng@mail.xjtu.edu.cn).
孟德玉，西安交通大学数学与统计学院、智能网络与网络安全教育部重点实验室，陕西西安 710049；琶洲实验室（黄埔），广东广州 510555（电子邮箱：dymeng@mail.xjtu.edu.cn）。

Digital Object Identifier 10.1109/TGRS.2024.3365825 become an indispensable tool for this task, contributing to the comprehensive interpretation of surface changes. The basic goal of

is to detect targets or pixels with semantic changes or specific changes between bi-temporal RS images of the same region.
数字对象标识符 10.1109/TGRS.2024.3365825 已成为这项任务不可或缺的工具，有助于全面解释地表变化。

的基本目标是检测同一区域双时相 RS 图像之间具有语义变化或特定变化的目标或像素。
Furthermore, the CD can detect, quantify, and analyze changes in land cover and has rich applications in several scenarios, for example, urban expansion [1], [2], natural disaster assessment [3], cropland protection [4], and so on.
此外，CD 还可以检测、量化和分析土地覆被的变化，在多种场景中有着丰富的应用，例如城市扩张[1]、[2]、自然灾害评估[3]、耕地保护[4]等。

Traditional CD methods are based on manual feature extraction, making it difficult to achieve fast and accurate detection in complex scenarios [5]. In the last decade, deep-learning (DL) technologies have greatly contributed to the development of the field of

. For example, a convolutional neural network (CNN) has achieved practical success in CD, and its superior learning ability and automatic feature extraction capability make it powerful to handle complex scenarios [1], [6], [7], [8], [9].
传统的 CD 方法基于人工特征提取，很难在复杂场景下实现快速准确的检测 [5]。近十年来，深度学习（DL）技术极大地促进了

领域的发展。例如，卷积神经网络（CNN）已经在 CD 领域取得了实际成功，其超强的学习能力和自动特征提取能力使其能够有力地处理复杂场景 [1]、[6]、[7]、[8]、[9]。
Recently, the vision transformer (ViT) [10] has further energized the field of CD [11], [12] since it can handle arbitrary unstructured data and capture global dependencies in the whole image, which opens up more possibilities for CD.
最近，视觉转换器（ViT）[10] 进一步活跃了 CD 领域[11]、[12]，因为它可以处理任意非结构化数据，并捕捉整个图像中的全局依赖关系，这为 CD 提供了更多可能性。
Furthermore, some studies have combined CNNs and ViTs in series or parallel form to explore some more powerful CD capabilities [13], [14], [15], [16], [17]. In summary, these models have achieved practical success to some extent.
此外，一些研究还将 CNN 和 ViT 以串联或并联的形式结合起来，探索一些更强大的 CD 功能 [13]、[14]、[15]、[16]、[17]。总之，这些模型在一定程度上取得了实际成功。

However, as the model scale grows, the limited data restricts the performance improvement of the CD models. For instance, the LEVIR-CD dataset [1], a CD benchmark containing 637 pairs of bi-temporal RS images, on which the DL-based methods can achieve

-score by 2020 . However, in the following three years, the improvement is only

, while the model parameter has grown to hundreds of millions. We consider this to be a performance bottleneck under limited data.
然而，随着模型规模的扩大，有限的数据限制了 CD 模型性能的提高。例如，LEVIR-CD 数据集[1]是一个包含 637 对双时态 RS 图像的 CD 基准，在该数据集上，基于 DL 的方法可以在 2020 年达到

-score 。然而，在随后的三年中，改进幅度仅为

，而模型参数已增至数亿。我们认为这是有限数据下的性能瓶颈。
To alleviate this problem, some studies have been tried in several aspects, for example, generating more simulation data using generative models [18], constructing more image pairs through data augmentation strategies and single-temporal images [19], [20], and so on.
为了缓解这一问题，一些研究从多个方面进行了尝试，例如利用生成模型生成更多模拟数据[18]，通过数据增强策略和单时相图像构建更多图像对[19]、[20]，等等。
However, compared to the millions or even billions of annotated data or image-text pairs in natural images, these generated data are still not able to drive the models to achieve the emergence about capabilities and homogeneity [21].
然而，与自然图像中数以百万计甚至数十亿计的注释数据或图像-文本对相比，这些生成的数据仍无法推动模型实现有关能力和同质性的显现[21]。
Recently, the significance of the foundation model has been increasingly recognized in computer vision.
最近，基础模型的重要性在计算机视觉领域日益得到认可。
Compared to small models customized for specific tasks and datasets, the foundation models [22], [23], [24] can accumulate general knowledge through large-scale pretraining, thus reducing the dependence on specific labeled
与针对特定任务和数据集定制的小型模型相比，基础模型[22]、[23]、[24] 可以通过大规模预训练积累一般知识，从而减少对特定标注数据集的依赖。
data, which inspires us to introduce the foundation models into the CD task.
这启发我们在 CD 任务中引入基础模型。

The foundation model refers to models that have been trained on large-scale datasets and are capable of capturing a wide range of knowledge and understanding.
基础模型指的是在大规模数据集上训练过的模型，能够捕捉广泛的知识和理解。
The foundation model started with natural language processing (NLP) tasks, typically exemplified by OpenAI's GPT series [25], and has been widely explored in computer vision and vision-language tasks, for example, CLIP [22], BLIP [26], SAM [27], and so on.
基础模型始于自然语言处理（NLP）任务，典型的例子是 OpenAI 的 GPT 系列 [25]，并在计算机视觉和视觉语言任务中得到了广泛探索，例如 CLIP [22]、BLIP [26]、SAM [27] 等。
These large foundation models allow for cross-domain knowledge transfer and sharing and reduce the demand for task-specific training data [28].
这些大型基础模型可以实现跨领域的知识转移和共享，减少对特定任务训练数据的需求 [28]。
In other words, even though in most tasks (e.g., CD), it is difficult to train a task-specific large model from scratch due to the limited data and computational source, we can resort to the existing foundation models and adapt them to specific tasks.
换句话说，尽管在大多数任务（如 CD）中，由于数据和计算资源有限，很难从头开始训练一个针对特定任务的大型模型，但我们可以利用现有的基础模型，并将其调整到特定任务中。
Some researchers have tried to tune or adapt foundation models to some downstream tasks in natural images and have achieved some success [29], [30].
一些研究人员尝试调整或调整基础模型，使其适应自然图像中的一些下游任务，并取得了一些成功 [29], [30]。
For RS CD, there are two particular issues for its visual tuning: 1) the transfer of the natural image domain to the RS image domain, and 2) the data structure of the input bi-temporal images changes the overall pipeline of foundation models.
对于 RS CD 而言，其视觉调整有两个特殊问题：1）从自然图像域到 RS 图像域的转换；2）输入双时相图像的数据结构会改变基础模型的整体管道。

To alleviate the two issues, in this article, we propose a universal framework, that is, bi-temporal adapter networks (BANs for short), for tuning and adapting foundation models to the

task. By bridging and combining the foundation model with customized models or modules for CD, BAN can dramatically improve the performance of existing

models with only a few additional learnable parameters.
为了缓解这两个问题，我们在本文中提出了一个通用框架，即双时态适配器网络（简称 BAN），用于调整基础模型并使其适应

任务。通过将基础模型与为 CD 定制的模型或模块进行桥接和组合，BAN 只需增加几个可学习的参数，就能显著提高现有

模型的性能。
Specifically, BAN parallel feeds bi-temporal images into a frozen foundation model (i.e., in the Siamese-style fashion), which can be ViTs trained on ImageNet-21k [31], the image encoder of CLIP [22], [23], or RemoteCLIP trained on RS data [24], as well as any other arbitrary pretrained image models.
具体来说，BAN 将双时态图像并行输入一个冻结基础模型（即Siamese 式），该模型可以是在 ImageNet-21k [31] 上训练的 ViTs、CLIP [22], [23] 的图像编码器或在 RS 数据 [24] 上训练的 RemoteCLIP，以及其他任意预训练图像模型。
Simultaneously, the bi-temporal images are fed into a bi-temporal adapter branch (Bi-TAB), which can be either an existing CD model, for example, BiT [13], ChangeFormer [11], and so on, or some hand-crafted stacked blocks.
同时，双时相图像被输入双时相适配器分支（Bi-TAB），该分支可以是现有的 CD 模型，如 BiT [13]、ChangeFormer [11]等，也可以是一些手工制作的堆叠块。
In addition, to alleviate the conflicts in input resolutions [32] of the foundation model and

, the asymmetric resolution input strategy (ARIS) is proposed.
此外，为了缓解基础模型和

在输入分辨率方面的冲突[32]，提出了非对称分辨率输入策略（ARIS）。
Between the foundation model and the Bi-TAB, we design a series of bridging modules that constantly select, align, and inject the general features in the foundation model space into the learnable Bi-TAB at various hierarchies.
在基础模型和 Bi-TAB 之间，我们设计了一系列桥接模块，不断选择、调整基础模型空间中的一般特征，并将其注入不同层次的可学习 Bi-TAB 中。
Through these bridging modules, BAN achieves knowledge transfer from general features to the CD task in the RS image domain. The schematic of BAN is shown in Fig. 1.
通过这些桥接模块，BAN 实现了从一般特征到 RS 图像领域 CD 任务的知识转移。BAN 的示意图如图 1 所示。

In summary, the contributions of this work are threefold.
总之，这项工作有三方面的贡献。

To reduce the dependence on CD-specific data, we propose a universal framework (i.e., BAN) to adapt the foundation model to the CD task, which efficiently reuses the general knowledge of the foundation model and transfers it to the CD task.
为了减少对 CD 特定数据的依赖，我们提出了一个通用框架（即 BAN），以根据 CD 任务调整基础模型，从而有效地重复使用基础模型的一般知识，并将其转移到 CD 任务中。
Additionally, as a side-tuning framework, BAN allows efficient tuning of parameters, memory, and time. To our knowledge, this should be the first universal framework to introduce the foundation models into the CD task.
此外，作为一个侧面调整框架，BAN 允许对参数、内存和时间进行有效调整。据我们所知，这应该是第一个将基础模型引入 CD 任务的通用框架。

Fig. 1. Schematic of BAN.
图 1.BAN 示意图。
The frozen pretrained model can be any foundation model (ImageNet-21k pretrained models [31], CLIP [22], [23], RemoteCLIP [24], etc.), the Bi-TAB can be any CD model (BiT [13], ChangeFormer [11], etc.), and the bridging modules can inject the knowledge extracted from the foundation model into the Bi-TAB.
冻结的预训练模型可以是任何基础模型（ImageNet-21k 预训练模型 [31]、CLIP [22]、[23]、RemoteCLIP [24] 等），Bi-TAB 可以是任何 CD 模型（BiT [13]、ChangeFormer [11] 等），桥接模块可以将从基础模型中提取的知识注入 Bi-TAB。

To better select general features and align them with task/domain-specific features, we propose a bridging module between the foundation model and Bi-TAB.
为了更好地选择一般特征，并将其与特定任务/领域的特征相匹配，我们建议在基础模型和 Bi-TAB 之间建立一个桥接模块。
By cross-domain dot-product attention, the bridging module can resample general domain knowledge and then inject it into features of the RS CD domain.
通过跨域点积关注，桥接模块可以对一般领域知识进行重采样，然后将其注入 RS CD 领域的特征中。
Bi-TAB is proposed as a model-agnostic concept, and this branch can be either an existing CD model or some hand-crafted stacked blocks. Benefiting from its plugand-play property, BAN can be fully equipped by almost any model that exists currently. Experimental results on the benchmark CD datasets demonstrate that BAN can bring about improvement on average for the corresponding customized CD models.
Bi-TAB 是作为一个与模型无关的概念提出的，这个分支可以是现有的 CD 模型，也可以是一些手工制作的叠块。得益于其即插即用的特性，BAN 可以完全装备目前存在的几乎所有模型。在基准 CD 数据集上的实验结果表明，BAN 可以为相应的定制 CD 模型带来的平均改进。

The rest of this article is organized as follows. In Section II, a brief review of related work is presented. Section III describes the specifics of our proposed framework. Section IV provides a series of experimental results and analysis.
本文接下来的内容安排如下。第二节简要回顾了相关工作。第三节介绍了我们提出的框架的具体内容。第四节提供了一系列实验结果和分析。
Finally, in Section V, we give a conclusion of this article.
最后，在第五节中，我们给出了本文的结论。

A. DL-Based A. 基于 DL

An accurate and reliable

is essential in the analysis of RS images. Over the past few years, numerous CD approaches have been proposed. In particular, CNN-based methods have shown remarkable success in this task. Daudt et al.
准确可靠的

对分析 RS 图像至关重要。在过去几年中，人们提出了许多 CD 方法。其中，基于 CNN 的方法在这项任务中取得了显著的成功。Daudt 等人
[6] proposed three U-Net-based fully convolutional Siamese networks: FC-EF, FC-Siam-conc, and FC-Siam-diff, which was one of the earliest studies for DL-based CD. Expanding on the FC-Siam-conc model, Zhang et al.
[6] 提出了三种基于 U-Net 的全卷积Siamese 网络：FC-EF、FC-Siam-conc 和 FC-Siam-diff，这是基于 DL 的 CD 的最早研究之一。在 FC-Siam-conc 模型的基础上，Zhang 等人又提出了基于 U-Net 的全卷积网络：FC-EF 和 FC-Siam-diff。
[8] used the attention mechanism to construct stronger basic blocks and additionally introduced deep supervision for faster convergence and better performance of the model.
[8]利用注意力机制构建了更强的基本模块，并引入了深度监督，从而使模型收敛更快、性能更好。
Different from the former, STANet [1] uses the metric learning-based approach for CD, which projects the bi-temporal images into a high-dimensional space by a neural network, and then uses the L2-distance metric for the bi-temporal high-dimensional mappings to generate the change mask. Fang et al.
与前者不同，STANet[1]采用基于度量学习的方法进行 CD，通过神经网络将双时态图像投影到高维空间，然后利用双时态高维映射的 L2-距离度量生成变化掩码。Fang 等人
[7] considered that continuous downsampling in the backbone of previous models leads to loss of spatial information, resulting in ambiguity in the pixels at the edges of the change targets, and therefore proposed SNUNet, which maintains high-resolution, fine-grained representations through dense skip connections between each node in the encoder and the decoder.
[7]认为，以往模型的主干连续降采样会导致空间信息丢失，造成变化目标边缘像素的模糊性，因此提出了 SNUNet，通过编码器和解码器中每个节点之间的密集跳转连接来保持高分辨率的细粒度表示。

The emergence of ViT provides an alternative path for CD tasks. Specifically, ViT splits an image into a series of patches,
ViT 的出现为 CD 任务提供了另一条途径。具体来说，ViT 将图像分割成一系列补丁、
takes all patches as a sequence, and uses the attention mechanism to capture the global dependencies between patches in the image. Zhang et al.
将所有斑块作为一个序列，并利用注意力机制来捕捉图像中斑块之间的全局依赖关系。Zhang 等人
[12] used swin transformer blocks [33] to construct an encoder and the decoder and designed a pure transformer network with Siamese U-shaped structure.
[12] 使用swin 变压器块[33]构建编码器和解码器，并设计了一个具有Siamese U 型结构的纯变压器网络。
ChangeFormer [11] is a Transformer-based CD model that references the construction of Segformer [34], a model for semantic segmentation.
ChangeFormer [11] 是一种基于变换器的 CD 模型，它参考了语义分割模型 Segformer [34] 的构造。
In the basic module of ChangeFormer, spatial downsampling of the query, key, and value is performed in the self-attention to reduce computation.
在 ChangeFormer 的基本模块中，查询、键和值的空间下采样是在自我关注中进行的，以减少计算量。

Several studies have found that in CD, due to limited data, the pure Transformer model may not reach its full potential without some inductive bias incorporated. Chen et al. [13] believed that high-level change features can be represented by a few semantic tokens and proposed BiT.
一些研究发现，在 CD 中，由于数据有限，如果不加入一些归纳偏差，纯粹的 Transformer 模型可能无法充分发挥其潜力。Chen 等人[13]认为，高层次的变化特征可以用几个语义tokens 来表示，并提出了 BiT。
In BiT, the bi-temporal features obtained through ResNet [35] are represented as several tokens, then the context is modeled in the token-based space-time using the Transformer encoder, and finally, the tokens with the learned context are fed into the pixel space and refine the original features by the Transformer decoder. Wu et al.
在 BiT 中，通过 ResNet [35]获得的双时态特征被表示为多个tokens ，然后使用 Transformer 编码器在基于token 的时空中建立上下文模型，最后将tokens 与学习到的上下文输入像素空间，并通过 Transformer 解码器完善原始特征。Wu 等人
[36] also used ResNet first to extract the bi-temporal features and then transmitted the bi-temporal features to the designed cross swin-transformer for difference feature extraction. On the other hand, some studies use parallel CNN and Transformer structures. Tang et al.
[36]也是先使用 ResNet 提取双时相特征，然后将双时相特征传输到设计的交叉斯温变换器中进行差异特征提取。另一方面，一些研究使用并行 CNN 和变换器结构。Tang 等人
[17] proposed WNet, which uses a Siamese CNN and a Siamese Transformer as encoders and fuses all features in the decoder. Similarly, Feng et al.
[17] 提出的 WNet 使用Siamese CNN 和Siamese Transformer 作为编码器，并在解码器中融合所有特征。同样，Feng 等人也提出了 WNet。
[37] proposed ICIF-Net, which constructs a dual-branch structure of the CNN and Transformer to capture multiscale local and global features, respectively, and uses cross-attention to fuse them.
[37]提出了 ICIF-Net，它构建了一个 CNN 和 Transformer 的双分支结构，分别捕捉多尺度局部和全局特征，并使用交叉注意力对它们进行融合。
Fu et al. Fu 等人
[16] argued that there is no interaction between dual parallel branches in the feature extraction processes of ICIF-Net, so they designed a semantic information aggregation module to dynamically implement the interaction between the CNN and Transformer branches in their proposed SLDDNet.
[16]认为，在 ICIF-Net 的特征提取过程中，双并行分支之间没有交互，因此他们设计了一个语义信息聚合模块，以动态实现他们提出的 SLDDNet 中 CNN 和 Transformer 分支之间的交互。

In this work, beyond the above-customized CD models, we propose a universal CD framework (i.e., BAN) based on the foundation model. In general, all the above-mentioned

models can be used as the Bi-TAB in BAN, and BAN can help improve the performance of these existing CD models.
在这项工作中，除了上述定制的 CD 模型之外，我们还提出了一个基于基础模型的通用 CD 框架（即 BAN）。一般来说，上述所有

模型都可用作 BAN 中的 Bi-TAB，而 BAN 则有助于提高这些现有 CD 模型的性能。

B. Foundation Model and Visual Tuning
B.基础模型和视觉调整

Through various pretraining techniques, the foundation model can be scaled up in data modalities, proxy tasks, and so on, thus accumulating knowledge [38], which has gradually become a consensus among researchers.
通过各种预训练技术，基础模型可以在数据模式、代理任务等方面进行扩展，从而积累知识[38]，这已逐渐成为研究人员的共识。
With the success of ViT, it swept the field of computer vision and became the priority choice for the vision foundation models. Steiner et al. [31] conducted the first systematic study of the interplay between data, augmentation, and regularization when pretraining ViT.
随着 ViT 的成功，它席卷了计算机视觉领域，成为视觉基础模型的优先选择。Steiner 等人[31]首次系统地研究了预训练 ViT 时数据、增强和正则化之间的相互作用。
CLIP is a pretraining method based on contrastive language-image learning, it is trained to maximize the similarity between the vision and language encodings of matching image-text pairs while minimizing it for nonmatching pairs.
CLIP 是一种基于语言-图像对比学习的预训练方法，其训练目的是最大化匹配图像-文本对的视觉和语言编码之间的相似性，同时最小化非匹配图像-文本对的相似性。
With strong generalization ability and extensibility, CLIP can handle multiple types of image and text data without retraining for specific tasks [32]. Inspired by the CLIP, Liu et al. [24] proposed RemoteCLIP, the first visual-language foundation model for RS.
CLIP 具有很强的泛化能力和可扩展性，可以处理多种类型的图像和文本数据，无需针对特定任务进行再训练[32]。受 CLIP 的启发，Liu 等人[24]提出了第一个用于 RS 的视觉语言基础模型 RemoteCLIP。
Although RemoteCLIP has achieved some improvement in some RS tasks, it is still limited by the volume of data, that is, it contains only

image-text pairs. For this, Zhang et al. [39] proposed the RS5M dataset containing 5 million RS images with English descriptions by filtering the image-text pair dataset and generating captions for RS images [40].
虽然 RemoteCLIP 在某些 RS 任务中取得了一定的改进，但仍受到数据量的限制，即只包含

图像-文本对。为此，Zhang 等人[39]提出了 RS5M 数据集，通过过滤图像-文本对数据集并为 RS 图像生成标题[40]，该数据集包含 500 万张带英文描述的 RS 图像。

How to adapt the above-mentioned foundation models to downstream-specific tasks is an important topic currently.
如何使上述基础模型适应下游的具体任务是当前的一个重要课题。
Typical transfer learning methods such as fully fine-tuning the whole model or only fine-tuning the task head, lead to high training costs or insufficient reuse, and one tradeoff is the parameter-efficient transfer learning (PETL) method [38], which selects a subset of pretrained parameters or/and introduces a limited number of trainable parameters into the foundation model while freezing the majority of the original parameters [41].
典型的迁移学习方法，如对整个模型进行完全微调或仅对任务头进行微调，都会导致训练成本过高或重用不足，而参数效率迁移学习（PETL）方法[38]则是一种折衷方法，它选择预训练参数的子集或/和将有限数量的可训练参数引入基础模型，同时冻结大部分原始参数[41]。
Typical PETL methods include prompt tuning, which concatenates some learnable embeddings with inputs, and an adapter, which inserts some learnable components into the pretrained model. Hu et al.
典型的 PETL 方法包括及时调整和适配器，前者将一些可学习的嵌入与输入连接起来，后者则将一些可学习的组件插入预训练模型中。Hu 等人
[42] proposed LoRA, which initializes a low-rank adaptation matrix and inserts it into self-attention in the residual-connected form. Due to the simplicity and effectiveness of LoRA, numerous related studies and practices have emerged [30].
[42]提出了 LoRA，它初始化了一个低秩适应矩阵，并以残差连接的形式将其插入自注意力中。由于 LoRA 的简单性和有效性，出现了许多相关研究和实践 [30]。
However, parameter efficiency does not mean memory efficiency, and the foundation model tuning remains difficult. To address this issue, Sung et al.
然而，参数效率并不意味着内存效率，基础模型的调整仍然很困难。为了解决这个问题，Sung 等人
[41] proposed the ladder side-tuning (LST) method, which freezes the foundation model and builds a side network for training [32], [43], [44]. In this article, our proposed BAN can be seen as the practice and extension of side-tuning for the CD task.
[41]提出了梯形侧调整（LST）方法，即冻结基础模型并建立侧网络进行训练[32], [43], [44]。在本文中，我们提出的 BAN 可以看作是侧调整法在 CD 任务中的实践和扩展。

Most relevant to this article is a concurrent work, SAMCD [45], which uses frozen Fast-SAM [46] as the encoder and fine-tunes the feature pyramid network and prediction head for CD.
与本文最相关的是同时进行的一项工作 SAMCD [45]，它使用冻结的 Fast-SAM [46] 作为编码器，并对 CD 的特征金字塔网络和预测头进行了微调。
However, compared to BAN, SAM-CD is more of a customized Fast-SAM-based model within a traditional transfer learning paradigm. Therefore, we claim that BAN should be the first universal framework to adapt the foundation model to the CD task.
然而，与 BAN 相比，SAM-CD 更像是传统迁移学习范式中基于 Fast-SAM 的定制模型。因此，我们认为，BAN 应该是第一个将基础模型应用于 CD 任务的通用框架。

III. MethoD III.方法

As illustrated in Figs. 1 and 2, BAN can be divided into three parts: frozen foundation model,

, and bridging modules between them. For the foundation model, it is generally ViTs with various scales. Next, we will review the structure of ViT. Then, we will introduce Bi-TAB and explain why Bi-TAB can be an existing arbitrary CD model.
如图 1 和图 2 所示，BAN 可分为三个部分：冻结基础模型、

，以及它们之间的桥接模块。就基础模型而言，一般是各种规模的 ViT。接下来，我们将回顾一下 ViT 的结构。然后，我们将介绍 Bi-TAB，并解释为什么 Bi-TAB 可以成为现有的任意 CD 模型。
Finally, we will present the details of the bridging module.
最后，我们将介绍桥接模块的细节。

A. Review of ViT
A.虚拟技术审查

For an image

, ViT first divides it into a series of nonoverlapping

patches, where

is typically 14, 16, and 32. Each patch is encoded into a

-dimensional vector using a convolutional layer, and the output

is termed patch embedding, where

, and

denotes the constant dimension of ViT. Then, a learnable class token is concatenated into

as the whole image representation. Due to the insensitivity to sequence order, a learnable position encoding is subsequently added to the patch embedding.
对于图像

，ViT 首先将其划分为一系列不重叠的

补丁，其中

通常为 14、16 和 32。使用卷积层将每个补丁编码成

维向量，输出

称为补丁嵌入，其中

，

表示 ViT 的常量维度。然后，将可学习类token 连接到

，作为整个图像的表示。由于对序列顺序不敏感，可学习位置编码随后被添加到补丁嵌入中。

(a)

(b)

The basic block of Bi-TAB

Element-wise addition
Bi-TAB 基本模块

元素加法

The basic block of foundation model

Dot product
基础模型的基本模块

点积

Stop gradient

Resize Bi-linear interpolation

停止梯度调整大小双线性插值

Fig. 2. Detailed illustration of BAN. The foundation model (blue area) accumulates general knowledge through pretraining techniques, and the bridging modules (orange area) select, align, and inject this knowledge into the Bi-TAB.
图 2.BAN 的详细图示。基础模型（蓝色区域）通过预培训技术积累一般知识，桥接模块（橙色区域）对这些知识进行选择、排列并注入 Bi-TAB 中。
The Bi-TAB (green area) is a model-agnostic concept, which can be an arbitrary customized CD model or even some hand-crafted stacked blocks. These three major components are detailed in Sections III-A-III-C, respectively. (a) The detailed structure of BAN.
Bi-TAB（绿色区域）是一个与模型无关的概念，可以是任意定制的 CD 模型，甚至是一些手工制作的堆叠块。这三个主要部分将分别在第 III-A-III-C 节中详细介绍。(a) BAN 的详细结构。
(b) The detailed structure of Bridging Module.
(b) 桥接模块的详细结构。

The basic block of ViT consists of alternating multihead self-attention (MSA) and feed-forward network (FFN), with layer normalization (LN) and residual connection applied in each layer. Formally, a basic block can be expressed as
ViT 的基本模块由交替的多头自注意（MSA）和前馈网络（FFN）组成，每一层都采用了层归一化（LN）和残差连接。形式上，基本模块可表示为

where

denotes the number of basic blocks in ViT,

denotes the output of the

th block, the FFN consists of two linear layers with a GELU nonlinearity, and the MSA can be formulated as
其中，

表示 ViT 中基本区块的数量，

表示第

个区块的输出，FFN 由两个具有 GELU 非线性的线性层组成，MSA 可表述为

where

, and

denote query, key and value, respectively, and

, and

are projection parameters.
其中

, 和

分别表示查询、键和值，

, 和

是投影参数。

In BAN, since two images (bi-temporal images) need to be received as inputs, we extract bi-temporal features simultaneously using a parallel frozen foundation model with shared parameters, that is, a Siamese foundation model. Highresolution images are important for accurate

, however, most foundation models are trained on low-resolution images (e.g.,

). On the other hand, the position embedding of ViT is highly correlated with the image resolution, and if there is an image resolution difference in inference, a new approximate position encoding needs to be recomputed by interpolation.
在 BAN 中，由于需要接收两幅图像（双时态图像）作为输入，我们使用具有共享参数的并行冻结基础模型（即Siamese 基础模型）同时提取双时态特征。高分辨率图像对于准确的

非常重要，但大多数基础模型都是在低分辨率图像（如

或

）上训练的。另一方面，ViT 的位置嵌入与图像分辨率高度相关，如果在推理中出现图像分辨率差异，则需要通过插值重新计算新的近似位置编码。
In addition, self-attention does not have the translation equivariance that exists in convolution and thus is more sensitive to resolution. Therefore, ARIS is proposed in BAN.
此外，自注意没有卷积中存在的平移等差性，因此对分辨率更加敏感。因此，在 BAN 中提出了 ARIS。
Specifically, for the foundation model, the bi-temporal images are resized to match their pretraining resolution, while for

, the resolution is kept as

256 or

according to the dataset.
具体来说，对于基础模型，双时态图像的大小会根据其训练前的分辨率进行调整，而对于

，分辨率则根据数据集保持为

256 或

。

Different from task-specific models that are trained on specific datasets, foundation models are trained in a self-supervised or semi-supervised manner on large-scale data and can adapt their knowledge and capabilities to several downstream tasks.
与在特定数据集上训练的特定任务模型不同，基础模型是在大规模数据上以自我监督或半监督的方式进行训练的，并能使其知识和能力适应多个下游任务。
Under similar ViT architectures, a crucial difference between various foundation models is the training data and training strategy, in this article, we mainly use CLIP's image encoder (ViT-L/14) [22], [23] as the foundation model part in BAN.
在类似的 ViT 架构下，各种基础模型的关键区别在于训练数据和训练策略，本文主要使用 CLIP 的图像编码器（ViT-L/14）[22]、[23] 作为 BAN 的基础模型部分。
In the experimental part, we compare the foundation models under different weights and architectures, and the details can be found in Section IV-E2.
在实验部分，我们比较了不同权重和架构下的基础模型，详情见第 IV-E2 节。

B. B.

Since the foundation model part is frozen, it can only extract general information from the image. However, this information is incomplete and

is thus needed for domain-specific and task-specific feature extraction. Bi-TAB is not designated as an explicit and concrete network, but rather as a model-agnostic concept.
由于基础模型部分是冻结的，因此只能从图像中提取一般信息。然而，这些信息是不完整的，因此需要

来进行特定领域和特定任务的特征提取。Bi-TAB 并不是一个明确而具体的网络，而是一个与模型无关的概念。
Therefore, Bi-TAB can be either a customized CD model, including existing CNN-based and Transformer-based models, or several hand-crafted stacked blocks. In our work, we recommend the former since it allows BAN to better benefit from evolving customized CD models.
因此，Bi-TAB 既可以是一个定制的 CD 模型（包括现有的基于 CNN 和 Transformer 的模型），也可以是几个手工制作的堆叠块。在我们的工作中，我们推荐使用前者，因为它能让 BAN 从不断发展的定制 CD 模型中更好地获益。

A modern customized CD model usually contains a backbone (encoder) and a prediction head (decoder), and its backbone is always a Siamese network. Thus, a customized

model can simply be represented as
现代定制光盘模型通常包含一个主干网（编码器）和一个预测头（解码器），其主干网总是一个Siamese 网络。因此，定制的

模型可以简单地表示为

where

and

denote the bi-temporal images,

denotes the change mask, and the Backbone and the Head consists of a sequence of basic blocks as shown in Fig. 2(a). In the
其中

和

表示双时态图像，

表示变化掩码，骨干和头部由一系列基本块组成，如图 2（a）所示。在

(b)

Fig. 3. Illustration of the Bi-TAB perspective in BAN, with BiT [13] and ChangeFormer [11] as examples. For better presentation, the color renderings follow their original literature [11], [13]. (a) BiT as Bi-TAB. (b) ChangeFormer as Bi-TAB.
图 3.BAN 中的 Bi-TAB 透视图，以 BiT [13] 和 ChangeFormer [11] 为例。为了更好地展示，彩色效果图沿用了其原始文献[11]、[13]。(a) BiT 作为 Bi-TAB。(b) 作为 Bi-TAB 的 ChangeFormer。

perspective of Bi-TAB, as shown in Fig. 3, except for the original bi-temporal images, an additional general feature bank is provided by the foundation model, from which Bi-TAB can arbitrarily take out features.
从 Bi-TAB 的角度来看，如图 3 所示，除了原始的双时相图像外，基础模型还提供了一个额外的通用特征库，Bi-TAB 可以从中任意提取特征。
Since both are Siamese structures, information from the foundation model can be easily injected into the Bi-TAB's backbone of the corresponding temporal phase, which is why almost any CD model can be chosen as the Bi-TAB. With the support of the foundation model,

can easily obtain general information about the image, which is typically obtained from a large amount of feeding data. In this way, the requirement of the model for data is reduced.
由于两者都是Siamese 结构，基础模型的信息可以很容易地注入到 Bi-TAB 的相应时相骨干中，这就是为什么几乎任何 CD 模型都可以被选为 Bi-TAB 的原因。在基础模型的支持下，

可以轻松获取图像的一般信息，而这些信息通常是从大量的馈送数据中获取的。这样，模型对数据的要求就降低了。

As presented in Figs. 2(a) and 3, modern customized CD models can be easily embedded into BAN. Specifically in this article, we mainly use two typical CD customized models as Bi-TAB, CNN-Transformer-based BiT [13], and Transformerbased ChangeFormer [11] (denoted by "CF").
如图 2(a)和图 3 所示，现代定制 CD 模型可以轻松嵌入到 BAN 中。具体来说，本文主要使用两种典型的 CD 定制模型作为 Bi-TAB，即基于 CNN 变换器的 BiT [13] 和基于变换器的 ChangeFormer [11]（用 "CF "表示）。
Among them, in the original

, the backbone network is a customized mix transformer (MiT) [34], which can help disable the model scalability. Therefore, in this article, we use the scalable MiT-b0 to MiT-b5 settings in SegFormer [34].
其中，在最初的

中，骨干网络是一个定制的混合变换器（MiT）[34]，这有助于禁用模型的可扩展性。因此，本文使用 SegFormer [34] 中可扩展的 MiT-b0 至 MiT-b5 设置。
Furthermore, to validate the performance of BAN in the semantic CD (SCD) task, we remodel CF into CF-SCD by adding two additional semantic segmentation prediction heads.
此外，为了验证 BAN 在语义 CD（SCD）任务中的性能，我们将 CF 改造为 CF-SCD，增加了两个额外的语义分割预测头。

C. Bridging Module C.桥接模块

Next, a natural question is how to inject the general features from the foundation model into

, and we found that there are two critical issues.
接下来，一个自然的问题是如何将基础模型的一般特征注入

，我们发现有两个关键问题。

Not all the general knowledge is needed, so we should consider how to filter out the useless information from the general knowledge and select the valuable information.
并不是所有的常识都是我们需要的，所以我们应该考虑如何从常识中筛选出无用的信息，选择有价值的信息。
With ARIS and patch embedding, the features in the foundation model have extremely low resolution (being downsampled by at least ), so it is an important
使用 ARIS 和补丁嵌入技术时，基础模型中的特征分辨率极低（至少降低了），因此重要的是

// obtain change mask by the prediction head
// 通过预测头获得变化掩码

issue to align the features from both sources. Due to the above two issues, we propose the bridging module, which is shown in Fig. 2(b).
对齐两个来源的特征的问题。鉴于上述两个问题，我们提出了桥接模块，如图 2(b) 所示。

There are two types of features in our framework, that is, the general feature

of the foundation model and the task-specific feature

from the Bi-TAB. In the bridging module, to avoid the inconsistency of the distribution between

and

, we first use an LN layer to normalize

and then a linear layer is used to project

我们的框架中有两类特征，即基础模型的一般特征

和来自 Bi-TAB 的特定任务特征

。在桥接模块中，为了避免

和

分布的不一致性，我们首先使用 LN 层对

进行归一化处理，然后使用线性层将

投射到。

To effectively discriminate valuable characteristics in general knowledge, we calculate an affinity matrix

between

and

using the cosine metric and obtain the attention weights

by dividing

and applying a Softmax function. Then, the attention weight

is dot-product with

to obtain the filtered feature

为了有效区分常识中有价值的特征，我们使用余弦度量计算

和

之间的亲和矩阵

，并通过

除以

并应用 Softmax 函数得到注意力权重

。然后，将注意力权重

与

点积，得到过滤后的特征

It is worth noting that the foundation model features

is resampled to

in this process, which also solves the scale misalignment problem. Finally, the residual connections from

and

are adopted to get the ultimate output

of the bridging module, which is also the input of the next basic block of Bi-TAB
值得注意的是，在此过程中，地基模型特征

被重新采样到

，这也解决了比例失调问题。最后，采用

和

的残余连接，得到桥接模块的最终输出

，这也是 Bi-TAB 下一个基本模块的输入

where the Resize() indicates upsampling

using bi-linear interpolation.
其中 Resize() 表示使用双线性插值将

上采样到

。

D. Loss Function D.损失函数

The loss function of BAN can directly follow the plain loss function of Bi-TAB. Specifically, the pixel-wise crossentropy loss for BiT and ChangeFormer is adopted in this article, which is formally defined as
BAN 的损失函数可以直接沿用 Bi-TAB 的普通损失函数。具体来说，本文采用的是 BiT 和 ChangeFormer 的像素点交叉熵损失，其形式定义为

where

and

denote the height and width of the output, respectively, and

and

denote the label and prediction for the pixel at position

, respectively. For SCD with auxiliary tasks,

is also applied to the semantic segmentation subtask.
其中

和

分别表示输出的高度和宽度，

和

分别表示位置

上像素的标签和预测值。对于带有辅助任务的 SCD，

也适用于语义分割子任务。

IV. EXPERIMENTS IV.实验

A. Datasets A.数据集

To sufficiently validate the performance of BAN, we conducted extensive experiments on the following five

datasets consisting of RGB images. It is noted that BAN and all compared methods are trained and evaluated on these datasets without using any extra data.
为了充分验证 BAN 的性能，我们在以下五个由 RGB 图像组成的

数据集上进行了大量实验。值得注意的是，BAN 和所有比较方法都是在这些数据集上训练和评估的，没有使用任何额外数据。

LEVIR-CD dataset [1] comprises a collection of 637 pairs of bi-temporal RS images. These images were obtained from Google Earth and are accompanied by over 31333 annotated instances of changes. Each image in pairs is with a spatial resolution of pixels. To ensure standardized evaluation, the dataset is divided into three subsets: training, validation, and testing, containing 445, 64, and 128 image pairs.
LEVIR-CD 数据集[1] 由 637 对双时态 RS 图像组成。这些图像来自谷歌地球，并附有超过 31333 个注释变化实例。成对图像中的每幅图像都是，空间分辨率为像素。为确保评估的标准化，数据集被分为三个子集：训练集、验证集和测试集，分别包含 445、64 和 128 对图像。
S2Looking [47] is a building CD dataset that contains large-scale side-looking satellite images captured at varying angles from the nadir. It consists of 5000 coregistered bi-temporal image pairs of rural areas around the world, with the size of and spatial resolution of , and in total 65920 annotated change instances. It provides two annotated maps for each pair of images, indicating new and demolished building areas, which we use to generate a change map.
S2Looking [47] 是一个建筑 CD 数据集，包含从天顶以不同角度拍摄的大比例侧视卫星图像。该数据集包含全球农村地区的 5000 对核心注册双时相图像，大小为，空间分辨率为，共有 65920 个注释变化实例。它为每对图像提供了两张注释图，分别标明新建和拆除的建筑区域，我们利用这些注释图生成变化图。
WHU-CD dataset [2] contains a pair of images taken in the same area from 2012 and 2016, each with a size of pixels. The original images are first cropped into several nonoverlapping image patches of size . Following [48], these patches are divided into 5947/743/744 for training, validation, and testing, and part of the annotated training data are selected for the semi-supervised case.
WHU-CD 数据集[2]包含 2012 年和 2016 年在同一地区拍摄的一对图像，每幅图像的大小为像素。原始图像首先被裁剪成几个大小为的非重叠图像片段。按照文献[48]，这些斑块被分为 5947/743/744，用于训练、验证和测试，并选择部分注释训练数据用于半监督情况。
BANDON dataset [49] contains a large number of off-nadir aerial images with a spatial resolution of pixels, collected mainly from Google Earth, Microsoft Virtual Earth, and ArcGIS, and the data are divided into 1689/202/392 for training, validation, and testing, respectively, with a size of . To better evaluate the generalization ability of models, the testing set is divided into an in-domain set and an out-domain set (collected from different cities) containing 207 and 185 image pairs.
BANDON 数据集[49]包含大量空间分辨率为像素的离空航空图像，主要从谷歌地球、微软虚拟地球和 ArcGIS 采集，数据分为 1689/202/392 分别用于训练、验证和测试，大小为。为了更好地评估模型的泛化能力，测试集分为域内集和域外集（从不同城市收集），分别包含 207 和 185 个图像对。
With multiple types of annotations, the BANDON dataset can also be used for the SCD task.
有了多种类型的注释，BANDON 数据集也可用于 SCD 任务。
The Landsat-SCD dataset [50] consists of 8468 image pairs collected between 1990 and 2020, each with a size of and a spatial resolution of . This dataset is annotated with a "no-change" category and ten types of semantic changes, which contain four classes of land cover (farmland, desert, building, and water).
Landsat-SCD 数据集[50] 由 1990 年至 2020 年收集的 8468 对图像组成，每幅图像的大小为，空间分辨率为。该数据集标注了 "无变化 "类别和十种语义变化类型，其中包含四类土地覆盖（农田、沙漠、建筑和水）。
Since there is no official division, we divided the dataset into training, validation, and testing sets in the ratio of .
由于没有正式的划分，我们按照的比例将数据集分为训练集、验证集和测试集。

B. Implementation Details
B.实施细节

We implement our BAN using our Open-CD,

which is a PyTorch-based CD toolkit. During training, we use the cross-entropy loss and the AdamW optimizer. The learning rate is set to 0.0001 , and

in Bi-TAB's prediction head, for all models and datasets. The learning rate is decayed using the poly schedule. Following [51], we use random crop (crop size is set to

for LEVIR-CD, S2Looking, and BANDOM datasets; no extra crop for WHU-CD and Landsat-SCD datasets), flip, and photometric distortion for data augmentation. Following some common settings, we train 40,80 , and

iterations on the LEVIR-CD, S2Looking, and BANDOM datasets, and 100 and 50 epochs on the WHU-CD and Landsat-SCD datasets. All experiments are performed on NVIDIA GeForce RTX 4090 and NVIDIA RTX A6000 GPUs and the batch size is set to 8 . If not specified, CLIP's ViT-L/14 is used as the foundation model part of BAN, and ARIS is enabled by default.
我们使用基于 PyTorch 的开放式 CD 工具包

实现我们的 BAN。在训练过程中，我们使用了交叉熵损失和 AdamW 优化器。对于所有模型和数据集，学习率设置为 0.0001，并在 Bi-TAB 的预测头中设置为

。学习率使用 poly 计划进行衰减。按照文献[51]，我们使用随机裁剪（LEVIR-CD、S2Looking 和 BANDOM 数据集的裁剪大小设置为

；WHU-CD 和 Landsat-SCD 数据集没有额外裁剪）、翻转和光度失真来增强数据。按照一些常用设置，我们对 LEVIR-CD、S2Looking 和 BANDOM 数据集进行了 40、80 和

次迭代训练，对 WHU-CD 和 Landsat-SCD 数据集进行了 100 和 50 次迭代训练。所有实验均在英伟达™（NVIDIA®）GeForce RTX 4090 和英伟达™（NVIDIA®）RTX A6000 GPU 上进行，批次大小设置为 8。如果未指定，则使用 CLIP 的 ViT-L/14 作为 BAN 的基础模型部分，并默认启用 ARIS。

C. Evaluation Metrics C.评估指标

For binary change detection (BCD) tasks, we use the intersection over union

score, and overall accuracy (OA) as evaluation metrics, which are calculated as follows:
对于二进制变化检测（BCD）任务，我们使用交集超过联合

分数和总体准确率（OA）作为评价指标，计算方法如下：

where TP, TN, FP, and FN indicate true positive, true negative, false positive, and false negative, which are calculated on the change category to avoid the class imbalance problem. precision

and recall

are also calculated, indicating the precision and recall on the change category.
其中，TP、TN、FP 和 FN 表示真阳性、真阴性、假阳性和假阴性，它们是根据变化类别计算的，以避免类不平衡问题。此外，还计算了精度

和召回率

，表示变化类别的精度和召回率。

For the SCD task, we use mIoU and adopt the separated coefficient of kappa (Sek) [52] as evaluation metrics, which are calculated as follows:
对于 SCD 任务，我们使用 mIoU 并采用分离的卡帕系数（Sek）[52] 作为评价指标，其计算方法如下：

where

is calculated on the nonchange category, corresponding to

. Kappa is calculated for bi-temporal semantic segmentation,

denotes the value of the number of correctly (a)
其中

计算的是非变化类别，对应于

。Kappa 是为双时态语义分割计算的，

表示正确 (a) 的数量值。

(b)

(c)

(f)

Fig. 4. Visualization results of different methods on the LEVIR-CD testing set. The rendered colors represent true positives (green), false positives (yellow), false negatives (red), and true negatives (black).
图 4.不同方法在 LEVIR-CD 测试集上的可视化结果。呈现的颜色分别代表真阳性（绿色）、假阳性（黄色）、假阴性（红色）和真阴性（黑色）。
(a)-(g) and (a)-(c) are just to separate different samples for easier reference in the text.
(a)-(g)和(a)-(c)只是为了区分不同的样本，便于在文中参考。

predicted samples divided by the total number of samples, and

denotes the sum of the products of the number of samples and the number of predictions for all categories divided by the square of the total number of samples. Then, a weighted Score of Sek and mIoU is calculated with weights of 0.7 and 0.3 .
预测样本数除以样本总数，

表示所有类别的样本数与预测数的乘积之和除以样本总数的平方。然后计算 Sek 和 mIoU 的加权得分，权重分别为 0.7 和 0.3。

D. Comparison and Analysis
D.比较与分析

Binary Change Detection: is the fundamental task of , and we conducted fully supervised BCD experiments on three datasets including LEVIR-CD, S2Looking, and BANDON using only BCD labels ("BANDON-BCD" for short).
二进制变化检测：是的基本任务，我们仅使用 BCD 标签（简称 "BANDON-BCD"）在 LEVIR-CD、S2Looking 和 BANDON 等三个数据集上进行了完全有监督的 BCD 实验。
As listed in Table I, we reimplemented BiT and the results obtained are better than those reported in the original paper, but still not better than the latest six customized models. After using BiT as the Bi-TAB in our BAN
如表 I 所示，我们重新实现了 BiT，所得到的结果比原始论文中报告的结果要好，，但仍然比不上最新的六个定制模型。将 BiT 作为 Bi-TAB 应用于我们的 BAN

Some results in Tables I-III follow [13], [47], and [49]. framework, it achieves up to

improvement in

, with only

learnable parameter cost. Similarly, after injecting general knowledge,

achieves

score and

, achieving the best performance among the compared methods, and our implementation of CF-b0 has far fewer parameters than the original CF.

表一至表三中的一些结果沿用了 [13]、[47] 和 [49]。在 CF-b0框架下，它在上实现了高达的改进，而可学习的参数成本仅为。同样，在注入常识后，在得分中达到了，在得分中达到了，实现了比较方法中的最佳性能，而且我们实现的 CF-b0 的参数远远少于原始 CF。

In addition, we presented a simple lightweight CD model, stacked-blocks, using several hand-crafted stacked blocks under the parameter number of only

, which achieves

improvement under the framework of BAN. This improvement further demonstrates that general knowledge can provide considerable gains under limited conditions.
此外，我们还提出了一个简单的轻量级 CD 模型--堆叠块（stacked-blocks），在参数数仅为

的情况下，使用多个手工制作的堆叠块，在 BAN 框架下实现了

的改进。这一改进进一步表明，在有限的条件下，常识可以带来可观的收益。

For the more challenging S2Looking dataset, BiT and CFb1 also achieve the best composite metrics after equipping to BAN, which brings

and

improvement in

and

, as listed in Table II. Also for off-nadir aerial images, BANDON-BCD focuses on urban scenes rather than rural scenes in S2Looking. On the BANDON-BCD dataset,
对于更具挑战性的 S2Looking 数据集，BiT 和 CFb1 在装备 BAN 后也获得了最佳的综合指标，这使得

和

的

和

得到了改善，如表 II 所示。同样对于离天空较远的航空图像，BANDON-BCD 在 S2Looking 中侧重于城市场景而非乡村场景。在 BANDON-BCD 数据集上、

Fig. 5. Visualization results (semantic segmentation auxiliary task) of different methods on the BANDON dataset. (a)-(g) and (a)-(c) are just to separate different samples for easier reference in the text.
图 5.不同方法在 BANDON 数据集上的可视化结果（语义分割辅助任务）。(a)-(g)和(a)-(c)只是为了区分不同的样本，以便于在文本中参考。

TABLE I 表 I

COMPARISONS OF BAN WITH OTHER BCD METHODS ON LEARNABLE Parameters (Param

, AND

ON THE LEVIR-CD DATASET. THE
在 LEVIR-CD 数据集上，BAN 与其他 BCD 方法在可学习参数（参数

, 和

）上的比较。......）的比较。

SYMBOL “” MEANS OUR REIMPLEMENTED RESULTS 符号 ""表示我们重新实现的结果
Method	Param
FC-EF [6]	1.35	86.91	80.17	83.4	71.53
FC-Siam-Diff [6]	1.35	89.53	83.31	86.31	75.92
FC-Siam-Conc [6]	1.54	91.99	76.77	83.69	71.96
DTCDSCN [9]	41.07	88.53	86.83	87.67	78.05
STANet [1]	16.93	83.81		87.26	77.40
IFNet [8]	50.71	94.02	82.93	88.13	78.77
SNUNet [7]	12.03	89.18	87.17	88.16	78.83
ConvTransNet [13]	7.13	91.57	90.26	90.91	N/A
WNet [17]	43.07	91.16	90.18	90.67	82.93
CSTSUNet [36]	42.17	92.51	89.16	90.81	83.16
BiT* [13]	2.99	91.97	88.62	90.26	82.25
BAN-BiT	3.80		90.89
CF [11]	41.02	92.05	88.80	90.40	82.48
BAN-CF-b0	4.34		90.30
stacked-blocks	1.44	91.72	87.83	89.73	81.38
BAN-stacked-blocks	2.14	92.72	90.64	91.66

as listed in Table III, the performance of the plain BiT is inferior to that of almost all the compared methods. Driven
如表 III 所列，普通 BiT 的性能几乎不如所有比较过的方法。驱动
TABLE II 表 II

COMParisons of BAN WITH OTHER BCD METHODS ON LEARNABLE Parameters (Param

, AND IOU

ON THE S2LOOKING DATASET. THE SYMBOL “*” MEANS OUR REIMPLEMENTED RESUlTS
在 S2LOOKING 数据集上，BAN 与其他 BCD 方法在可学习参数（参数

, 和 IOU

）上的比较。符号 "*"表示我们的重现结果

Method	Param
FC-EF [6]	1.35		8.95	7.65	N/A
FC-Siam-Diff [6]	1.35	68.27	18.52	13.54	N/A
FC-Siam-Conc [6]	1.54		15.76	13.19	N/A
DTCDSCN [9]	41.07	68.58	49.16	57.27	N/A
STANet [1]	16.93	38.75	56.49	57.27	N/A
IFNet [8]	50.71	66.46		64.13	N/A
SNUNet [7]	12.03	71.94	56.34	63.19	46.19
HCGMNet [53]	47.32	70.51	56.87	62.96	45.94
CGNet [54]	33.99	70.18	59.38	64.33	47.41
BiT* [13]	2.99	74.80	55.56	63.76	46.80
BAN-BiT	3.80	75.06	58.00
CF*[11]	41.02	77.68	55.25	64.57	47.68
BAN-CF-b1	15.88	74.63

by BAN, the performance of BiT is dramatically improved, with

spanning

from

. For CF, the

is improved from

, which achieves
通过 BAN，BiT 的性能显著提高，

从

到

，跨越了

。对于 CF，

从

改善到

，实现了

TABLE III 表 III

COMPARISONS OF BAN With OTHER CD METHOdS ON LEARNABLE PARAMETERS (PARAM (M)), FLOPS (G), PRECISION

, AND

ON THE BANDON DATASET. THE FLOPS ARE COMPUTED AT THE RESOLUTION OF

. THE SYMBOL “*” Means OUR Reimplemented Results and the Symbol “†” Means That More Auxiliary Tasks (Labels) Are Used
班顿数据集上班顿与其他 CD 指标在可学习参数（PARAM (M)）、FLOPS (G)、精度

和

上的比较。翻转次数以

的分辨率计算。符号 "*"表示我们重新实现的结果，符号"†"表示更多的辅助任务。表示使用了更多的辅助任务（标签

Method	Backbone	Param	FLOPs	In-domain Test	Out-domain Test
Method	Backbone	Param
FC-EF [6]	ResNet-50	30.75	285.50		34.83	41.65	23.30	40.76	30.76	35.04
FC-Siam-Diff [6]	ResNet-50	30.75	386.49	65.11	67.39	66.23	49.51	60.33	60.06	60.20	43.06
FC-Siam-Conc [6]	ResNet-50	41.90	434.55	65.49	66.17	65.83	49.06	60.81	58.65	59.71	42.56
DTCDSCN [9]	SE-ResNet-34	41.07	60.87	69.23	57.90	63.06	46.05	54.22	56.89	55.53	38.43
STANet [1]	ResNet-50	25.96	525.43	66.24	69.42	67.79	51.28	60.35	61.92	61.12	44.01
Change-Mix [20]	ResNet-50	-	-	69.29	70.80	70.04	53.89	65.60	63.26	64.41	47.50
MTGCD-Net [49] MTGCD-Net [49]	ResNet-50	91.85	503.53	73.92	76.55	75.21	60.27	71.52	67.30	69.35	53.08
BiT* [13]	ResNet-18	2.99	35.00	74.00	52.65	61.52	44.43	68.78	41.95	52.11	35.24
BAN-BiT	ResNet-18	3.80	330.32	76.84	63.19	69.35		85.29	51.38	64.13
CF*[11]	MiT-b0	3.85	11.38	76.73	57.33	65.63	48.84	74.26	46.79	57.41	40.26
BAN-CF	MiT-b0	4.34	300.17	77.32	64.18	70.14		75.65	54.35	63.25
CF-SCD* [34]	MiT-b0	4.24	20.69	77.93	61.35	68.65	52.27	75.77	53.71	62.86	45.84
BAN-CF-SCD	MiT-b0	5.40	316.26	78.19	67.71	72.57		75.49	59.68	66.66
CF-SCD* [34]	MiT-b2	25.51	55.93	79.53	67.14	72.81	57.24	76.58	56.70	65.15	48.32
BAN-CF-SCD	MiT-b2	28.31	354.59	79.66	70.44	74.77		76.08	61.33	67.91

TABLE IV 表 IV

COMPARISONS OF BAN WITH OTHER SCD METHOdS ON LEARNABLE PARAMETERS (PARAM (M)), MIOU (%), SEK (%), AND SCORE (%) ON THE LANDSAT-SCD DATASET. THE SYMBOL “*” MEANS OUR REIMPLEMENTED RESULTS
在 Landsat SCD 数据集上，BAN 与其他 SCD 方法在可学习参数（param (m)）、miou (%)、sek (%) 和 score (%) 方面的比较。符号 "*"表示我们重新实施的结果

Method	Param	mIoU	Sek	Score
HRSCDstr3 [55]	N/A	78.15	31.64	45.60
HRSCDstr4 [55]	62.68	79.15	37.03	49.66
SCDNet [56]	30.01	70.17	39.69	51.83
Bi-SRNet [57] 双SRNet [57]	23.38	82.02	42.60	54.43
SSCDI [57]	23.31	82.10	42.75	54.56
CF-SCD-b0* [34]	4.24	83.99	52.69	62.08
BAN-CF-SCD-b0	5.40	84.76	54.56
CF-SCD-b1* [34]	14.70	85.99	55.69	64.78
BAN-CF-SCD-b1	17.27	86.41	58.56

the best

performance when not using additional auxiliary annotations.
在不使用额外辅助注释的情况下，

性能最佳。

To better analyze where BAN brings improvements, we performed some visualizations on the LEVIR-CD dataset, as shown in Fig. 4, where we use red and yellow to render missed and false detections.
为了更好地分析 BAN 的改进之处，我们对 LEVIR-CD 数据集进行了一些可视化处理，如图 4 所示，我们用红色和黄色来表示漏检和误检。
We found that BAN has better detection performance on buildings with a similar appearance to the background, such as the circled areas in Fig. 4(a)-(e).
我们发现，BAN 对外观与背景相似的建筑物（如图 4(a)-(e) 中的圆圈区域）具有更好的检测性能。
In addition, some changes in buildings with a special appearance can be better detected after applying BAN, as shown by the purple buildings in Fig. 4(f).
此外，如图 4（f）中的紫色建筑所示，应用 BAN 技术后，一些具有特殊外观的建筑物的变化可以被更好地检测到。
A common feature of the above scenes is that they have a relatively small number of samples in the dataset, and thus, to some extent, we can believe that the injection of general knowledge enables the model to achieve a stronger ability to learn from fewer samples.
上述场景的一个共同特点是数据集中的样本数量相对较少，因此，在某种程度上，我们可以认为，常识的注入使得模型能够从更少的样本中获得更强的学习能力。
Another interesting observation is that in Fig. 4(g), models without BAN often misrecognize cargo box changes as building changes due to their similar appearance.
另一个有趣的现象是，在图 4(g)中，没有 BAN 的模型常常会将货箱的变化误认为是建筑物的变化，因为它们的外观相似。
With only limited CD datasets, it is difficult for the model to learn to distinguish between cargo boxes and buildings, but the introduction of general knowledge makes it possible.
由于只有有限的 CD 数据集，模型很难学会区分货箱和建筑物，但常识的引入使其成为可能。

Semantic Change Detection: SCD is a derivative task of BCD that not only detects where the change has occurred, but also predicts from which class to which class.
语义变化检测：SCD 是 BCD 的衍生任务，它不仅能检测变化发生的位置，还能预测从哪个类别到哪个类别的变化。
We conducted SCD experiments on two datasets, including the Landsat dataset and the BANDON dataset using SCD annotation (i.e., additional semantic segmentation labels). As listed in
我们在两个数据集上进行了 SCD 实验，包括使用 SCD 注释（即附加语义分割标签）的 Landsat 数据集和 BANDON 数据集。如
TABLE V 表 V

COMPARISONS OF BAN WITH OTHER SCD METHOdS ON LEARNABLE Parameters (Param (M)), mioU (%), SeK (%), and Score (%) on the BANDON In-Domain Testing Set. The Symbol “*” MEANS OUR REIMPLEMENTED RESULTS
BAN 与其他 SCD 指标在 BANDON 内域测试集上可学习参数（参数 (M)）、mioU (%)、SeK (%) 和分数 (%) 的比较。符号 "*"表示我们重新实施的结果

Method	Param	mIoU	Sek	Score
CF-SCD-b0* [34]	4.24	75.27	27.95	42.15
BAN-CF-SCD-b0	5.40	77.68	31.48
CF-SCD-b2* [34]	25.51	77.84	33.49	46.79
BAN-CF-SCD-b2	28.31	79.11	35.20

Table IV, in the Landsat dataset, CF-SCD-b0 and CF-SCD-b1 are used as Bi-TAB of BAN and achieve a

and

improvement in the weighted Score. It is worth noting that these two models have an improvement of

and

in the Sek metric, indicating that BAN is not only effective in the CD task, but also drives the semantic segmentation subtask to obtain better performance, and the same observation can be made in experiments on the BANDON-SCD dataset in Table V.
表 IV 显示，在 Landsat 数据集中，CF-SCD-b0 和 CF-SCD-b1 作为 BAN 的 Bi-TAB 使用，在加权分数上分别提高了

和

。值得注意的是，这两个模型在 Sek 指标上的改进分别为

和

，这表明 BAN 不仅在 CD 任务中有效，而且还能驱动语义分割子任务获得更好的性能，在表五的 BANDON-SCD 数据集上的实验也可以得出同样的结论。
The results of the auxiliary semantic segmentation task of BANDON-SCD are visualized in Fig. 5, where we found that BAN has a more accurate detection of ambiguous targets, which may be caused by blurring, noise, and entanglement of the image with the background, as shown in Fig.
BANDON-SCD 的辅助语义分割任务结果如图 5 所示，我们发现 BAN 对模糊目标的检测更为准确，如图 5 所示，模糊目标可能是由模糊、噪声以及图像与背景的纠缠造成的。
5(a)-(c).

The metrics in Table III focus only on CD performance, ignoring the performance of the semantic segmentation auxiliary task in Table V. On the BANDON-SCD dataset, when only using our reconstructed CF-SCD-b0, 68.65%

and

are obtained, which is inferior to Change-Mix [20]. After equipping to the BAN, that is, BAN-CF-SCD-b0, it achieves

, with

improvement. Compared to BAN-CF-B0 without SCD annotation, there is a

improvement. With a larger Bi-TAB, that is, MiT-b2, BAN-CF-SCD can achieve

IoU, which is only

inferior to MTGCD-Net [49]. Note that MTGCD-Net uses more auxiliary tasks with annotations, including the building roof-to-footprint offsets and the bi-temporal matching flows between identical roofs, so it is reasonable to achieve the best performance.
表 III 中的指标只关注 CD 性能，忽略了表 V 中语义分割辅助任务的性能。在 BANDON-SCD 数据集上，如果只使用我们重建的 CF-SCD-b0，

和

，结果为 68.65%，不如 Change-Mix [20]。在加入 BAN（即 BAN-CF-SCD-b0）后，

，

。与没有 SCD 注释的 BAN-CF-B0 相比，

。如果使用更大的 Bi-TAB，即 MiT-b2，BAN-CF-SCD 可以实现

IoU，仅比 MTGCD-Net [49] 差

。需要注意的是，MTGCD-Net 使用了更多的带注释的辅助任务，包括建筑物屋顶到脚印的偏移和相同屋顶之间的双时匹配流，因此达到最佳性能是合理的。

Cross-Domain Change Detection: Cross-domain CD is a critical issue in practical applications of CD. In practice,
跨域变化检测：跨域 CD 是 CD 实际应用中的一个关键问题。在实际应用中

TABLE VI 表 VI

COMParisons of BAN With OTHER SEMi-SUPERvised CD METhods on

(%) AND OA (%) ON THE WHU-CD DATASET
在 WHU-CD 数据集的

(%) 和 OA (%) 上，BAN 与其他 SEMi-SUPERvised CD METhods 的比较

Method
Method
Sup. only [48] 仅有[48]	50.00	97.48	55.70	97.53	65.40	98.20	76.10	98.04
AdvNet [58]	55.10	97.90	61.60	98.11	73.80	98.80	76.60	98.94
s4GAN [59]	18.30	96.69	62.60	98.15	70.8	98.60	76.40	98.96
SemiCDNet [60] 半 CDNet [60]	51.70	97.71	62.00	98.16	66.70	98.28	75.90	98.93
SemiCD [48]				98.47	74.80	98.84	77.20	98.96
ours		98.32

it is expensive to annotate specialized data for each scene and train scene-specific models. Therefore, we generally expect the model to have good performance not only on in-domain datasets, but also on out-domain datasets, that is, strong generalizability.
为每个场景标注专门数据并训练特定场景模型的成本很高。因此，我们通常希望模型不仅在域内数据集上有良好的表现，而且在域外数据集上也有良好的表现，即具有很强的泛化能力。
To explore the effectiveness of BAN on outdomain data, we validated it on the out-domain testing set of the BANDON dataset. As listed in Table III, the plain BiT achieves only

on the out-domain data, which is

lower than that on the in-domain data, which indicates that cross-domain inference drastically affects the customized CD model. After equipping to BAN, BAN-BiT achieves

improvement, although the cross-domain damage still exists, it is effectively mitigated. Similarly, BAN-CF achieves a

improvement over

on the out-domain testing set. Compared to the in-domain testing set, the magnitude of the improvement brought by BAN is larger, which suggests that BAN contributes to the generalization ability of the model.
为了探索 BAN 在域外数据上的有效性，我们在 BANDON 数据集的域外测试集上对其进行了验证。如表 III 所示，纯 BiT 在域外数据上仅实现了

，比在域内数据上的

低，这表明跨域推理对定制的 CD 模型有很大影响。装备 BAN 后，BAN-BiT 实现了

的改进，虽然跨域损害仍然存在，但得到了有效缓解。同样，在域外测试集上，BAN-CF 比

进步了

。与域内测试集相比，BAN 带来的改进幅度更大，这表明 BAN 有助于提高模型的泛化能力。
We also conducted cross-domain CD experiments on the BANDON-SCD dataset, and after equipping to BAN, BAN-CF-SCD-b0, and b2 obtained similar improvements as on the in-domain testing set,

and

, respectively.
我们还在 BANDON-SCD 数据集上进行了跨域 CD 实验，在装备 BAN 后，BAN-CF-SCD-b0 和 b2 分别获得了与域内测试集类似的改进，

和

。

Semi-Supervised Change Detection: In Section IV-D1, we found that BAN has a strong ability to learn from a few samples through visualization, and thus we further validated this observation here with some semi-supervised CD settings.
半监督变化检测：在第 IV-D1 节中，我们发现 BAN 通过可视化从少量样本中学习的能力很强，因此我们在这里通过一些半监督 CD 设置进一步验证了这一观察结果。
Semi-supervised CD refers to training the CD model with a small amount of labeled data and a large amount of unlabeled data. Here, we did not train BAN on unlabeled data but just followed some semi-supervised settings, that is, training with partial labels.
半监督 CD 指的是用少量有标签数据和大量无标签数据来训练 CD 模型。在这里，我们没有在无标签数据上训练 BAN，而只是采用了一些半监督设置，即使用部分标签进行训练。
As listed in Table VI, we compared BAN with several semi-supervised learning methods on the WHU-CD dataset, where AdvNet [58] and s4GAN [59] are semi-supervised semantic segmentation methods, SemiCDNet [60] and SemiCD [48] are state-of-the-art semi-supervised CD methods, and "Sup.
如表 VI 所列，我们在 WHU-CD 数据集上比较了 BAN 和几种半监督学习方法，其中 AdvNet [58] 和 s4GAN [59] 是半监督语义分割方法，SemiCDNet [60] 和 SemiCD [48] 是最先进的半监督 CD 方法，而 "Sup...... "是最先进的半监督 CD 方法。
only" is reported in [48], which only performs fully supervised learning on the labeled data. When using only and of the labeled data, BAN is superior to other fully supervised and semi-supervised methods except for SemiCD, which is acceptable because BAN does not use any unlabeled data. When using of the labeled data, BAN achieves , which is superior to SemiCD by . When using of the labeled data, BAN achieves superior to . These results show us the possibility and potential of BAN for future applications in semi-supervised learning frameworks.
[48]报告了 "only "方法，该方法只对标注数据进行完全监督学习。当只使用和的标注数据时，BAN 优于除 SemiCD 之外的其他全监督和半监督方法，这一点是可以接受的，因为 BAN 没有使用任何未标注数据。当使用的标注数据时，BAN 达到了，优于 SemiCD 的。当使用的标注数据时，BAN 实现了，优于。这些结果向我们展示了 BAN 未来在半监督学习框架中应用的可能性和潜力。

E. Ablation Studies E.消融研究

ARIS: To avoid recomputation of the position encoding in the foundation model and the resolution gap during training
ARIS：避免在训练过程中重新计算基础模型中的位置编码和分辨率差距
TABLE VII 表 VII

PERFORMANCE OF BAN AT DIFFERENT INPUT RESOLUTIONS FOR FOUNDATION MODELS (VIT-L/14 OF CLIP AND REMOTECLIP) ON

, AND FPS (IMG/S) ON THE LEVIR-CD DATASET

上不同输入分辨率的基础模型（vit-l/14 of clip 和 remoteclip）的禁用性能，以及 levir-cd 数据集上的帧频（img/s）。

Resolution	ViT-L/14 (CLIP)		ViT-L/14 (RemoteCLIP) ViT-L/14（RemoteCLIP）		FPS
Resolution					FPS
	91.68	84.63	91.92	85.05	2.86
	91.86	84.94	91.80	84.85	1.79
	91.77	84.79	91.84	84.92	0.71

TABLE VIII 表 VIII

PERFORMANCE OF BAN With DIFFERENT FOUNDATION MODELS ON PREcision

, AND

ON THE LEVIR-CD DATASET
在 LEVIR-CD 数据集上，使用不同基础模型的 BAN 在精度

和

上的表现

Foundation Model 基础模型	Pre-train
ViT-B/16	IN-21k, sup [31]	93.59	89.80	91.66	84.60
ViT-L/16	IN-21k, sup [31]	93.27	90.11	91.67	84.61
ViT-B/16	CLIP [23]	93.25	90.21	91.71	84.68
ViT-L/14	CLIP [23]	93.47	90.30	91.86	84.94
ViT-B/32	RemoteCLIP [24]	93.28	90.26	91.75	84.75
ViT-L/14	RemoteCLIP [24]	93.44	90.46	91.92	85.05
ViT-B/32	GeoRSCLIP [39]	93.35	90.24	91.77	84.79
ViT-L/14	GeoRSCLIP [39]	93.50	90.48	91.96	85.13
InternImage-XL [61]	IN-21k, sup	93.53	90.41	91.94	85.09

and inferring, ARIS is adopted. We conducted this ablation on the LEVIR-CD dataset using CLIP's ViT-L/14 (pretrained on

images) and RemoteCLIP's ViT-L/14 (pretrained on

images) as the foundation model. And the input resolution of the Bi-TAB is kept at

. As listed in Table VII, when using the low input resolution, that is,

, the model achieves the best frames per second (FPS), with 2.86 images of

per second inferred using the sliding-window when the batch size is 1 on a single A6000 GPU. When the input resolution matched that of the pretraining, that is,

for RemoteCLIP's ViT-L/14 or

for CLIP's ViT-L/14, is applied, the model achieves the best performance. When keeping the constant input resolution, that is,

, the FPS is only 0.71 and its performance is inferior to that when using ARIS.
和推断，采用了 ARIS。我们使用 CLIP 的 ViT-L/14（在

图像上进行预训练）和 RemoteCLIP 的 ViT-L/14（在

图像上进行预训练）作为基础模型，对 LEVIR-CD 数据集进行了消融处理。Bi-TAB 的输入分辨率保持为

。如表 VII 所示，当使用低输入分辨率（即

）时，模型实现了最好的每秒帧数（FPS），在单个 A6000 GPU 上，当批量大小为 1 时，使用滑动窗口每秒推断出 2.86 幅

图像。当输入分辨率与预训练相匹配时，即

用于 RemoteCLIP 的 ViT-L/14 或

用于 CLIP 的 ViT-L/14 时，模型的性能最佳。当输入分辨率保持不变（即

）时，FPS 仅为 0.71，性能不如使用 ARIS 时。

Different Foundation Models: With differences in training data and training strategies, various foundation models contain different types of general knowledge.
不同的基础模型：由于训练数据和训练策略的不同，各种基础模型包含不同类型的常识。
Therefore, we used different foundation models to explore the impact of different general knowledge sources on the performance of BAN.
因此，我们使用不同的基础模型来探讨不同常识源对 BAN 性能的影响。
Specifically, for the ViT-based foundation models, we used ViTs fully supervised trained on ImageNet-21k, ViTs trained on the LAION-2B dataset using contrastive language-image learning (CLIP), and ViTs trained on RS data (RemoteCLIP, GeoRSCLIP), respectively, as the general knowledge sources in the BAN.
具体来说，对于基于 ViT 的基础模型，我们分别使用了在 ImageNet-21k 上经过完全监督训练的 ViT、使用对比语言图像学习（CLIP）在 LAION-2B 数据集上训练的 ViT 和在 RS 数据（RemoteCLIP、GeoRSCLIP）上训练的 ViT 作为 BAN 中的常识源。
As listed in Table VIII, for the foundation models using the same pretraining, the large-scale models generally perform better than the
如表 VIII 所示，对于使用相同预训练的基础模型，大规模模型的性能普遍优于小规模模型。

TABLE IX 表九

Performance of BAN With Scalable Bi-TABs on Learnable Parameters (Param

, AND

ON THE LEVIR-CD DATASET
在可学习参数（参数

, 和

on the LEVIR-CD DATASET）上使用可扩展 Bi-TAB 的 BAN 性能

Method	Param
CF [11]	41.02	92.05	88.80	90.40	82.48
BAN-CF-b0	4.34	93.47	90.30	91.86
BAN-CF-b1	15.88	93.48	90.76	92.10
BAN-CF-b2	26.92	93.61	91.02	92.30

small-scale models. For the same scale, the model trained using CLIP is superior to the model fully supervised trained on ImageNet-21k.
小规模模型。在相同的规模下，使用 CLIP 训练的模型优于在 ImageNet-21k 上完全监督训练的模型。
In addition, the CLIP model pretrained on RS data is superior to models of the corresponding scale trained on non-RS data, which is in part because general knowledge within the RS domain is easier to transfer and adapt to the CD task of RS images.
此外，在 RS 数据上预先训练的 CLIP 模型优于在非 RS 数据上训练的相应规模的模型，部分原因是 RS 领域的一般知识更容易转移和适应 RS 图像的 CD 任务。
We also attempted to use the deformable convolution-based model (i.e., InternImage), which is trained in a supervised manner on ImageNet-21k, as the foundation model in BAN, and the results demonstrate that the structure of the foundation model in BAN is not limited to ViT and that BAN can benefit from stronger foundation models.
我们还尝试使用在 ImageNet-21k 上以监督方式训练的基于可变形卷积的模型（即 InternImage）作为 BAN 的基础模型，结果表明 BAN 的基础模型结构并不局限于 ViT，BAN 可以从更强大的基础模型中受益。

Scalable Bi-TAB: In Section III-B, it is mentioned that compared to some stacked blocks, we recommend using the existing customized CD models as Bi-TAB, which allows BAN to benefit from these evolving models.
可扩展的 Bi-TAB：在第 III-B 节中提到，与一些堆叠块相比，我们建议使用现有的定制 CD 模型作为 Bi-TAB，这样 BAN 就能从这些不断发展的模型中获益。
In Table IX, we used scalable CFs as Bi-TAB and found that BAN can significantly improve performance after using more complex Bi-TABs. Compared to CF, BAN-CF-b0 achieves an improvement of , BAN-CF-b1 achieves an improvement of , and BAN-CF-b2 achieves an improvement of . This trend further suggests that it is reasonable to consider Bi-TAB as a model-agnostic concept.
在表九中，我们使用了可扩展的 CF 作为 Bi-TAB，发现在使用更复杂的 Bi-TAB 后，BAN 可以显著提高性能。与 CF 相比，BAN-CF-b0 提高了，BAN-CF-b1 提高了，BAN-CF-b2 提高了。这一趋势进一步表明，将 Bi-TAB 视为与模型无关的概念是合理的。

V. CONClusion V.结论

In this article, we propose a universal framework (i.e., BAN) to utilize general knowledge from large foundation models to reduce the dependence of existing CD models on a large amount of labeled data.
在本文中，我们提出了一个通用框架（即 BAN），利用大型基础模型中的一般知识来降低现有 CD 模型对大量标注数据的依赖。
Experimental results demonstrate that with only a few additional learnable parameters, BAN can effectively improve the performance of existing

methods and enable them to learn better from fewer samples. Since both the foundation model and Bi-TAB in BAN are model-agnostic, BAN is highly extensible and can benefit from stronger foundation models, thus helping boost the existing customized CD models.
实验结果表明，只需增加几个可学习的参数，BAN 就能有效提高现有

方法的性能，使它们能从更少的样本中更好地学习。由于 BAN 中的基础模型和 Bi-TAB 都与模型无关，因此 BAN 具有很强的可扩展性，可以从更强大的基础模型中获益，从而帮助提升现有的定制 CD 模型。
More importantly, our BAN confirms the feasibility of foundation model adaptation in CD tasks and provides the research basis for subsequent investigations.
更重要的是，我们的 BAN 证实了在 CD 任务中进行基础模型调整的可行性，并为后续研究提供了研究基础。
In the future, we suggest further exploration of BAN or foundation model-based CD methods on several aspects, including, but not limited to, better PETL methods, more effective bridging of general and task-specific features, and more suitable task-specific modules.
未来，我们建议在多个方面进一步探索基于 BAN 或基础模型的 CD 方法，包括但不限于更好的 PETL 方法、更有效地衔接一般特征和特定任务特征以及更合适的特定任务模块。
In addition, as a universal framework, BAN is not limited to CD on RGB images, and it can be easily extended to spectral data, that is, multispectral and hyperspectral images.
此外，作为一个通用框架，BAN 并不局限于 RGB 图像的 CD，它可以很容易地扩展到光谱数据，即多光谱和高光谱图像。

REFERENCES 参考文献

[1] H. Chen and Z. Shi, "A spatial-temporal attention-based method and a new dataset for remote sensing image change detection," Remote Sens. vol. 12, no. 10, p. 1662, 2020 .
[2] S. Ji, S. Wei, and M. Lu, "Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set," IEEE Trans. Geosci. Remote Sens., vol. 57, no. 1, pp. 574-586, Jan. 2018.
[2] S. Ji, S. Wei, and M. Lu, "Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set," IEEE Trans.Geosci.遥感》，第 57 卷，第 1 期，第 574-586 页，2018 年 1 月。

[3] R. Hänsch et al., "SpaceNet 8-The detection of flooded roads and buildings," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2022, pp. 1471-1479.
[3] R. Hänsch 等人，"SpaceNet 8-The detection of flooded roads and buildings"，Proc. IEEE/CVF Conf.Comput.Vis.Pattern Recognit.Pattern Recognit. Workshops (CVPRW), Jun. 2022, pp.

[4] M. Liu, Z. Chai, H. Deng, and R. Liu, "A CNN-transformer network with multiscale context aggregation for fine-grained cropland change detection," IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 15, pp. 4297-4306, 2022.
[4] M. Liu, Z. Chai, H. Deng, and R. Liu, "A CNN-transformer network with multiscale context aggregation for fine-grained cropland change detection," IEEE J. Sel.Topics Appl.遥感》，第 15 卷，第 4297-4306 页，2022 年。

[5] L. Bruzzone and D. F. Prieto, "Automatic analysis of the difference image for unsupervised change detection," IEEE Trans. Geosci. Remote Sens., vol. 38, no. 3, pp. 1171-1182, May 2000
[5] L. Bruzzone 和 D. F. Prieto，"用于无监督变化检测的差分图像自动分析"，IEEE Trans.Geosci.遥感》，第 38 卷，第 3 期，第 1171-171 页。3, pp.

[6] R. C. Daudt, B. Le Saux, and A. Boulch, "Fully convolutional Siamese networks for change detection," in Proc. 25th IEEE Int. Conf. Image Process. (ICIP), Oct. 2018, pp. 4063-4067.
[6] R. C. Daudt, B. Le Saux, and A. Boulch, "Fully convolutionalSiamese networks for change detection," in Proc. 25th IEEE Int.Conf.Image Process.(ICIP），2018 年 10 月，第 4063-4067 页。

[7] S. Fang, K. Li, J. Shao, and Z. Li, "SNUNet-CD: A densely connected Siamese network for change detection of VHR images," IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1-5, 2021.
[7] S. Fang, K. Li, J. Shao, and Z. Li, "SNUNet-CD: A densely connectedSiamese network for change detection of VHR images," IEEE Geosci.Remote Sens.Lett., vol. 19, pp.

[8] C. Zhang et al., "A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images," ISPRS J. Photogramm. Remote Sens., vol. 166, pp. 183-200, Aug. 2020.
[8] C. Zhang 等人，"A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images，" ISPRS J. Photogramm.遥感》，第 166 卷，第 183-200 页，2020 年 8 月。

[9] Y. Liu, C. Pang, Z. Zhan, X. Zhang, and X. Yang, "Building change detection for remote sensing images using a dual-task constrained deep Siamese convolutional network model," IEEE Geosci. Remote Sens. Lett., vol. 18, no. 5, pp. 811-815, May 2021.
[9] Y. Liu, C. Pang, Z. Zhan, X. Zhang, and X. Yang, "Building change detection for remote sensing images using a dual-task constrained deepSiamese convolutional network model," IEEE Geosci.Remote Sens.Lett.5, pp.

[10] A. Dosovitskiy et al., "An image is worth

words: Transformers for image recognition at scale," 2020, arXiv:2010.11929.
[10] A. Dosovitskiy 等人，"一幅图像胜过

个单词：规模图像识别变换器》，2020 年，arXiv:2010.11929。

[11] W. G. C. Bandara and V. M. Patel, "A transformer-based Siamese network for change detection," in Proc. IEEE Int. Geosci. Remote Sens. Symp., Jul. 2022, pp. 207-210.
[11] W. G. C. Bandara and V. M. Patel, "A transformer-basedSiamese network for change detection," in Proc. IEEE Int.Geosci.遥感研讨会，2022 年 7 月，第 207-210 页。

[12] C. Zhang, L. Wang, S. Cheng, and Y. Li, "SwinSUNet: Pure transformer network for remote sensing image change detection," IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5224713, doi: 10.1109/TGRS.2022.3160007.
[12] C. Zhang、L. Wang、S. Cheng 和 Y. Li，"SwinSUNet：用于遥感图像变化检测的纯变压器网络"，IEEE Trans.Geosci.60, 2022, Art.5224713, doi: 10.1109/TGRS.2022.3160007.

[13] H. Chen, Z. Qi, and Z. Shi, "Remote sensing image change detection with transformers," IEEE Trans. Geosci. Remote Sens., vol. 60, 2021, Art. no. 5607514, doi: 10.1109/TGRS.2021.3095166.
[13] H. Chen, Z. Qi, and Z. Shi, "Remote sensing image change detection with transformers," IEEE Trans.Geosci.遥感》，第 60 卷，2021 年，艺术编号：5607514。5607514, doi: 10.1109/TGRS.2021.3095166.

[14] Q. Li, R. Zhong, X. Du, and Y. Du, "TransUNetCD: A hybrid transformer network for change detection in optical remote-sensing images," IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5622519, doi: 10.1109/TGRS.2022.3169479.
[14] Q. Li, R. Zhong, X. Du, and Y. Du, "TransUNetCD: A hybrid transformer network for change detection in optical remote-sensing images," IEEE Trans.Geosci.60, 2022, Art.5622519, doi: 10.1109/TGRS.2022.3169479.

[15] W. Li, L. Xue, X. Wang, and G. Li, "ConvTransNet: A CNNtransformer network for change detection with multiscale global-local representations," IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no. 5610315, doi: 10.1109/TGRS.2023.3272694.
[15] W. Li、L. Xue、X. Wang 和 G. Li，"ConvTransNet：A CNNtransformer network for change detection with multiscale global-local representation," IEEE Trans.Geosci.Remote Sens., vol. 61, 2023, Art.5610315, doi: 10.1109/TGRS.2023.3272694.

[16] Z. Fu, J. Li, L. Ren, and Z. Chen, "SLDDNet: Stagewise short and long distance dependency network for remote sensing change detection," IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no. 3000319, doi: 10.1109/TGRS.2023.3305554.
[16] Z. Fu、J. Li、L. Ren 和 Z. Chen，"SLDDNet：Stagewise short and long distance dependency network for remote sensing change detection," IEEE Trans.Geosci.61, 2023, Art.3000319, doi: 10.1109/TGRS.2023.3305554.

[17] X. Tang, T. Zhang, J. Ma, X. Zhang, F. Liu, and L. Jiao, "WNet: W-shaped hierarchical network for remote-sensing image change detection," IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no. 5615814, doi: 10.1109/TGRS.2023.3296383.
[17] X. Tang、T. Zhang、J. Ma、X. Zhang、F. Liu 和 L. Jiao，"WNet：W-shaped hierarchical network for remote-sensing image change detection," IEEE Trans.Geosci.61, 2023, Art.5615814, doi: 10.1109/TGRS.2023.3296383.

[18] Z. Zheng, S. Tian, A. Ma, L. Zhang, and Y. Zhong, "Scalable multi-temporal remote sensing change data generation via simulating stochastic change process," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 21818-21827.
[18] Z. Zheng, S. Tian, A. Ma, L. Zhang, and Y. Zhong, "Scalable multi-temporal remote sensing change data generation via simulating stochastic change process," in Proc.Conf.Comput.(ICCV), Oct. 2023, pp.

[19] H. Chen, W. Li, and Z. Shi, "Adversarial instance augmentation for building change detection in remote sensing images," IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5603216, doi: 10.1109/TGRS.2021.3066802.
[19] H. Chen, W. Li, and Z. Shi, "Adversarial instance augmentation for building change detection in remote sensing images," IEEE Trans.Geosci.遥感》，第 60 卷，2022 年，艺术编号：5603216。5603216, doi: 10.1109/TGRS.2021.3066802.

[20] Z. Zheng, A. Ma, L. Zhang, and Y. Zhong, "Change is everywhere: Single-temporal supervised object change detection in remote sensing imagery," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 15193-15202.
[20] Z. Zheng、A. Ma、L. Zhang 和 Y. Zhong，"变化无处不在：IEEE/CVF Int. Conf.Conf.Comput.(ICCV), Oct. 2021, pp.

[21] R. Bommasani et al., "On the opportunities and risks of foundation models," 2021, arXiv:2108.07258.
[21] R. Bommasani 等：《论基础模型的机遇与风险》，2021 年，arXiv:2108.07258。

[22] A. Radford et al., "Learning transferable visual models from natural language supervision," in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748-8763.
[22] A. Radford 等人，"Learning transferable visual models from natural language supervision"，in Proc. Int.Conf.Mach.Learn.，2021 年，第 8748-8763 页。

[23] M. Cherti et al., "Reproducible scaling laws for contrastive languageimage learning," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 2818-2829.
[23] M. Cherti 等人，"Reproducible scaling laws for contrastive languageimage learning"，Proc. IEEE/CVF Conf.Comput.Vis.Pattern Recognit.(CVPR), Jun. 2023, pp.

[24] F. Liu et al., "RemoteCLIP: A vision language foundation model for remote sensing," 2023, arXiv:2306.11029.
[24] F. Liu 等人，《RemoteCLIP：遥感视觉语言基础模型》，2023，arXiv:2306.11029。

[25] OpenAI et al., "GPT-4 technical report," 2023, arXiv:2303.08774.
[25] OpenAI等人，《GPT-4技术报告》，2023，arXiv:2303.08774。

[26] J. Li, D. Li, C. Xiong, and S. Hoi, "BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation," in Proc. Int. Conf. Mach. Learn., 2022, pp. 12888-12900.
[26] J. Li, D. Li, C. Xiong, and S. Hoi, "BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation," in Proc. Int.Conf.Mach.Learn.，2022 年，第 12888-12900 页。

[27] A. Kirillov et al., "Segment anything," 2023, arXiv:2304.02643.
[27] A. Kirillov 等人，"Segment anything"，2023，arXiv:2304.02643。

[28] G. Mai et al., "On the opportunities and challenges of foundation models for geospatial artificial intelligence," 2023, arXiv:2304.06798.
[28] G. Mai 等人，《论地理空间人工智能基础模型的机遇与挑战》，2023，arXiv:2304.06798。

[29] Z. Chen et al., "Vision transformer adapter for dense predictions," 2022, arXiv:2205.08534
[29] Z. Chen 等：《用于密集预测的视觉变换器适配器》，2022 年，arXiv:2205.08534

[30] D. Yin, Y. Yang, Z. Wang, H. Yu, K. Wei, and X. Sun, "

Parameter-efficient low rank adapter for dense predictions," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 20116-20126.
[30] D. Yin, Y. Yang, Z. Wang, H. Yu, K. Wei, and X. Sun, "

Parameter-efficient low rank adapter for dense predictions," in Proc. IEEE/CVF Conf.计算。Vis.Pattern Recognit.(CVPR), Jun. 2023, pp.

[31] A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer, "How to train your ViT? Data, augmentation, and regularization in vision transformers," 2021, arXiv:2106.10270
[31] A. Steiner、A. Kolesnikov、X. Zhai、R. Wightman、J. Uszkoreit 和 L. Beyer，《如何训练你的 ViT？视觉转换器中的数据、增强和正则化》，2021 年，arXiv:2106.10270

[32] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, "Side adapter network for open-vocabulary semantic segmentation," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 2945-2954.
[32] M. Xu、Z. Zhang、F. Wei、H. Hu 和 X. Bai，"用于开放词汇语义分割的侧适配器网络"，Proc. IEEE/CVF Conf.计算。Vis.Pattern Recognit.(CVPR), Jun. 2023, pp.

[33] Z. Liu et al., "Swin transformer: Hierarchical vision transformer using shifted windows," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 10012-10022.
[33] Z. Liu 等人，"Swin transformer：Hierarchical vision transformer using shifted windows"，in Proc. IEEE/CVF Int.Conf.Comput.Vis.（ICCV），2021 年 10 月，第 10012-10022 页。

[34] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, "SegFormer: Simple and efficient design for semantic segmentation with transformers," in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 12077-12090.
[34] E. Xie、W. Wang、Z. Yu、A. Anandkumar、J. M. Alvarez 和 P. Luo，"SegFormer：简单高效的变压器语义分割设计"，Proc.Adv.Process.Syst., vol. 34, 2021, pp.

[35] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770-778.
[35] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proc. IEEE Conf.Comput.Vis.Pattern Recognit.(CVPR), Jun. 2016, pp.

[36] Y. Wu et al., "CSTSUNet: A cross Swin transformer-based Siamese Ushape network for change detection in remote sensing images," IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no. 5623715, doi: 10.1109/TGRS.2023.3326813.
[36] Y. Wu 等人，"CSTSUNet：A cross Swin transformer-basedSiamese Ushape network for change detection in remote sensing images，" IEEE Trans.Geosci.61, 2023, Art.5623715, doi: 10.1109/TGRS.2023.3326813.

[37] Y. Feng, H. Xu, J. Jiang, H. Liu, and J. Zheng, "ICIF-Net: Intrascale cross-interaction and inter-scale feature fusion network for bitemporal remote sensing images change detection," IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no.
[37] Y. Feng、H. Xu、J. Jiang、H. Liu 和 J. Zheng，"ICIF-Net：Intrascale cross-interaction and inter-scale feature fusion network for bitemporal remote sensing images change detection," IEEE Trans.Geosci.60, 2022, Art.
4410213, doi 10.1109/TGRS.2022.3168331

[38] B. X. B. Yu et al., "Visual tuning," 2023, arXiv:2305.06061.
[38] B. X. B. Yu 等人，《视觉调谐》，2023，arXiv:2305.06061。

[39] Z. Zhang, T. Zhao, Y. Guo, and J. Yin, "RS5M and GeoRSCLIP: A large scale vision-language dataset and a large vision-language model for remote sensing," 2023, arXiv:2306.11300.

[40] C. Schuhmann, A. Köpf, R. Vencu, T. Coombes, and R. Beaumont (2022). Laion COCO: 600 m Synthetic Captions From LAION2B-EN. [Online]. Available: https://laion.ai/blog/laion-coco/
[40] C. Schuhmann、A. Köpf、R. Vencu、T. Coombes 和 R. Beaumont (2022)。Laion COCO: 600 m Synthetic Captions From LAION2B-EN.[Online].Available: https://laion.ai/blog/laion-coco/

[41] Y.-L. Sung, J. Cho, and M. Bansal, "LST: Ladder side-tuning for parameter and memory efficient transfer learning," in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 12991-13005.
[41] Y.-L. Sung、J. Cho 和 M. Bansal，"LST：用于参数和内存高效迁移学习的梯形边调谐"，《Proc.Adv.Process.Syst., vol. 35, 2022, pp.

[42] E. J. Hu et al., "LoRA: Low-rank adaptation of large language models," 2021, arXiv:2106.09685.
[42] E. J. Hu 等：《LoRA：大型语言模型的低秩适应》，2021，arXiv:2106.09685.

[43] J. O. Zhang, A. Sax, A. Zamir, L. Guibas, and J. Malik, "Side-tuning A baseline for network adaptation via additive side networks," in Proc. Eur. Conf. Comput. Vis., Glasgow, U.K., 2020, pp. 698-714.
[43] J. O. Zhang、A. Sax、A. Zamir、L. Guibas 和 J. Malik，"通过加边网络进行网络适应的边调谐基线"，《欧洲网络会议》（Proc.Eur.Conf.Comput.Vis., Glasgow, U.K., 2020, pp.

[44] W. Lin et al., "Hierarchical side-tuning for vision transformers," 2023, arXiv:2310.05393
[44] W. Lin 等人，《视觉变换器的分层侧调》，2023，arXiv:2310.05393

[45] L. Ding, K. Zhu, D. Peng, H. Tang, K. Yang, and L. Bruzzone, "Adapting segment anything model for change detection in HR remote sensing images," 2023, arXiv:2309.01429.

[46] X. Zhao et al., "Fast segment anything," 2023, arXiv:2306.12156.
[46] X. Zhao 等人，"Fast segment anything"，2023，arXiv:2306.12156.

[47] L. Shen et al., "S2Looking: A satellite side-looking dataset for building change detection," Remote Sens., vol. 13, no. 24, p. 5094, Dec. 2021.
[47] L. Shen 等人，"S2Looking：用于建筑物变化探测的卫星侧视数据集》，《遥感》，第 13 卷，第 24 期，第 5094 页，2021 年 12 月。

[48] W. Gedara Chaminda Bandara and V. M. Patel, "Revisiting consistency regularization for semi-supervised change detection in remote sensing images," 2022, arXiv:2204.08454.
[48] W. Gedara Chaminda Bandara 和 V. M. Patel，《重新审视遥感图像中半监督变化检测的一致性正则化》，2022，arXiv:2204.08454。

[49] C. Pang, J. Wu, J. Ding, C. Song, and G.-S. Xia, "Detecting building changes with off-nadir aerial images," Sci. China Inf. Sci., vol. 66, no. 4, Apr. 2023, Art. no. 140306
[49] C. Pang、J. Wu、J. Ding、C. Song 和 G.-S.Xia, "Detecting building changes with off-nadir aerial images," Sci.66, no.4, Apr. 2023, Art.

[50] P. Yuan, Q. Zhao, X. Zhao, X. Wang, X. Long, and Y. Zheng, "A transformer-based Siamese network and an open optical dataset for semantic change detection of remote sensing images," Int. J. Digit. Earth, vol. 15, no. 1, pp. 1506-1525, Dec. 2022.
[50] P. Yuan, Q. Zhao, X. Zhao, X. Wang, X. Long, and Y. Zheng, "A transformer-basedSiamese network and an open optical dataset for semantic change detection of remote sensing images," Int. J. Dig.J. Digit.15, no. 1, pp.

[51] S. Fang, K. Li, and Z. Li, "Changer: Feature interaction is what you need for change detection," 2022, arXiv:2209.08290.
[51] S. Fang、K. Li 和 Z. Li，"Changer：变化检测需要特征交互"，2022 年，arXiv:2209.08290。
[52] K. Yang et al., "Asymmetric Siamese networks for semantic change detection in aerial images," IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1-18, 2021
[52] K. Yang 等人，"AsymmetricSiamese networks for semantic change detection in aerial images"，IEEE Trans.Geosci.遥感》，第 60 卷，第 1-18 页，2021 年。

[53] C. Han, C. Wu, and B. Du, "HCGMNET: A hierarchical change guiding map network for change detection," 2023, arXiv:2302.10420.
[53] C. Han、C. Wu 和 B. Du，"HCGMNET：用于变化检测的分层变化引导图网络"，2023，arXiv:2302.10420。

[54] C. Han, C. Wu, H. Guo, M. Hu, J. Li, and H. Chen, "Change guiding network: Incorporating change prior to guide change detection in remote sensing imagery," IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 16, pp. 8395-8407, 2023.
[54] C. Han, C. Wu, H. Guo, M. Hu, J. Li, and H. Chen, "Change guiding network：Incorporating change prior to guide change detection in remote sensing imagery," IEEE J. Sel.Topics Appl.遥感》，第 16 卷，第 8395-8407 页，2023 年。

[55] R. Caye Daudt, B. Le Saux, A. Boulch, and Y. Gousseau, "Multitask learning for large-scale semantic change detection," Comput. Vis. Image Understand., vol. 187, Oct. 2019, Art. no. 102783.
[55] R. Caye Daudt、B. Le Saux、A. Boulch 和 Y. Gousseau，"大规模语义变化检测的多任务学习"，Comput.Vis.图像理解》，第 187 卷，2019 年 10 月，艺术编号 102783。

[56] D. Peng, L. Bruzzone, Y. Zhang, H. Guan, and P. He, "SCDNET: A novel convolutional network for semantic change detection in high resolution optical remote sensing imagery," Int. J. Appl. Earth Observ Geoinformation, vol. 103, Dec. 2021, Art. no. 102465.
[56] D. Peng、L. Bruzzone、Y. Zhang、H. Guan 和 P. He，"SCDNET：A novel convolutional network for semantic change detection in high resolution optical remote sensing imagery," Int. J. Applied Earth Observations Geoinformation, vol. 103.应用地球观测地理信息》，第 103 卷，2021 年 12 月，艺术编号 102465。

[57] L. Ding, H. Guo, S. Liu, L. Mou, J. Zhang, and L. Bruzzone, "Bitemporal semantic reasoning for the semantic change detection in HR remote sensing images," IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5620014, doi: 10.1109/TGRS.2022.3154390.
[57] L. Ding, H. Guo, S. Liu, L. Mou, J. Zhang, and L. Bruzzone, "Bitemporal semantic reasoning for the semantic change detection in HR remote sensing images," IEEE Trans.Geosci.60, 2022, Art.5620014, doi: 10.1109/TGRS.2022.3154390.

[58] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez, "Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 2517-2526.
[58] T.-H.Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez, "Advent：Adversarial entropy minimization for domain adaptation in semantic segmentation," in Proc. IEEE/CVF Conf.Comput.Vis.Pattern Recognit., Jun. 2019, pp.

[59] S. Mittal, M. Tatarchenko, and T. Brox, "Semi-supervised semantic segmentation with high-and low-level consistency," IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 4, pp. 1369-1379, Apr. 2019
[59] S. Mittal、M. Tatarchenko 和 T. Brox，"具有高低级一致性的半监督语义分割"，IEEE Trans.Pattern Anal.机器。Intell.4, pp.

[60] D. Peng, L. Bruzzone, Y. Zhang, H. Guan, H. Ding, and X. Huang, "SemiCDNet: A semisupervised convolutional neural network for change detection in high resolution remote-sensing images," IEEE Trans. Geosci. Remote Sens., vol. 59, no. 7, pp. 5891-5906, Jul. 2021.
[60] D. Peng、L. Bruzzone、Y. Zhang、H. Guan、H. Ding 和 X. Huang，"SemiCDNet：用于高分辨率遥感图像变化检测的半监督卷积神经网络"，IEEE Trans.Geosci.遥感》，第 59 卷，第 7 期，第 5891-5906 页，2021 年 7 月。

[61] W. Wang et al., "InternImage: Exploring large-scale vision foundation models with deformable convolutions," in Proc. IEEE/CVF Conf. Com put. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 14408-14419.
[61] W. Wang 等人，"InternImage：用可变形卷积探索大规模视觉基础模型"，Proc. IEEE/CVF Conf.Com put.Vis.Pattern Recognit.(CVPR), Jun. 2023, pp.

Kaiyu Li (Member, IEEE) received the B.E. and M.E. degrees from Shandong University of Science and Technology, Qingdao, China, in 2020 and 2023, respectively.
李开宇（电气和电子工程师学会会员）分别于 2020 年和 2023 年获得中国青岛山东科技大学工学学士和工学硕士学位。

is currently a Research Assistant with the School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China. His research interests include image processing and remote-sensing visual perception.

现任西安交通大学计算机科学与技术学院助理研究员。他的研究兴趣包括图像处理和遥感视觉感知。

Xiangyong Cao received the B.Sc. and Ph.D. degrees from Xi'an Jiaotong University, Xi'an, China, in 2012 and 2018, respectively. From 2016 to 2017, he was a Visiting Scholar with Columbia University, New York, NY, USA.
曹向勇分别于2012年和2018年获得中国西安交通大学理学学士和博士学位。2016年至2017年，他在美国纽约哥伦比亚大学做访问学者。
He is currently an Associate Professor with the School of Computer Science and Technology, Xi'an Jiaotong University. His research interests include statistical modeling and image processing.
现任西安交通大学计算机科学与技术学院副教授。他的研究兴趣包括统计建模和图像处理。

Deyu Meng received the B.Sc., M.Sc., and Ph.D. degrees from Xi'an Jiaotong University, Xi'an, China, in 2001, 2004, and 2008, respectively.
孟德玉分别于 2001 年、2004 年和 2008 年获得中国西安交通大学学士、硕士和博士学位。

From 2012 to 2014, he took his two-year sabbatical leave with Carnegie Mellon University, Pittsburgh, PA, USA.
2012 年至 2014 年，他在美国宾夕法尼亚州匹兹堡的卡内基梅隆大学休了两年公休假。
He is currently a Professor with the School of Mathematics and Statistics, Xi'an Jiaotong University, and an Adjunct Professor with the Faculty of Information Technology, The Macau University of Science and Technology, Macau, China.
他目前是西安交通大学数学与统计学院教授和中国澳门科技大学信息科技学院兼职教授。
His research interests include model-based deep learning, variational networks, and meta-learning.
他的研究兴趣包括基于模型的深度学习、变异网络和元学习。

An open source CD toolbox: https://github.com/likyoo/open-cd
开放源码光盘工具箱：https://github.com/likyoo/open-cd

A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection

Abstract

I. INTRODUCTION I.引言

II. RELATED WORK II.相关工作

A. DL-Based A. 基于 DL

B. Foundation Model and Visual TuningB.基础模型和视觉调整

III. MethoD III.方法

A. Review of ViTA.虚拟技术审查

B. B.

C. Bridging Module C.桥接模块

D. Loss Function D.损失函数

IV. EXPERIMENTS IV.实验

A. Datasets A.数据集

B. Implementation DetailsB.实施细节

C. Evaluation Metrics C.评估指标

D. Comparison and AnalysisD.比较与分析

E. Ablation Studies E.消融研究

V. CONClusion V.结论

REFERENCES 参考文献

B. Foundation Model and Visual Tuning
B.基础模型和视觉调整

A. Review of ViT
A.虚拟技术审查

B. Implementation Details
B.实施细节

D. Comparison and Analysis
D.比较与分析