这是用户在 2024-4-13 13:41 为 https://app.immersivetranslate.com/pdf-pro/c6ba59ad-1c47-4fb6-98cc-8135b787affe 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

A New Learning Paradigm for Foundation Model-Based Remote-Sensing Change Detection

Kaiyu , Member, IEEE, Xiangyong , and Deyu Meng
电气和电子工程师学会会员 Kaiyu 、Xiangyong 和 Deyu Meng

Abstract 摘要

Change detection (CD) is a critical task to observe and analyze dynamic processes of land cover.

Although numerous deep-learning (DL)-based CD models have performed excellently, their further performance improvements are constrained by the limited knowledge extracted from the given labeled data.
尽管许多基于深度学习(DL)的 CD 模型表现出色,但由于从给定标注数据中提取的知识有限,其性能的进一步提高受到限制。

On the other hand, the foundation models that emerged recently contain a huge amount of knowledge by scaling up across data modalities and proxy tasks.

In this article, we propose a bi-temporal adapter network (BAN), which is a universal foundation model-based CD adaptation framework aiming to extract the knowledge of foundation models for . The proposed BAN contains three parts, that is, the frozen foundation model (e.g., CLIP), bi-temporal adapter branch (Bi-TAB), and bridging modules between them.
在本文中,我们提出了一个双时态适配器网络(BAN),它是一个基于基础模型的通用 CD 适配框架,旨在提取基础模型的知识用于 。拟议的 BAN 包括三个部分,即冻结基础模型(如 CLIP)、双时态适配器分支(Bi-TAB)和它们之间的桥接模块。

Specifically, BAN extracts general features through a frozen foundation model, which are then selected, aligned, and injected into Bi-TAB via the bridging modules.
具体来说,BAN 通过一个冻结的基础模型提取一般特征,然后对这些特征进行选择、对齐,并通过桥接模块注入 Bi-TAB 中。

Bi-TAB is designed as a model-agnostic concept to extract task/domain-specific features, which can be either an existing arbitrary CD model or some hand-crafted stacked blocks.
Bi-TAB 的设计理念与模型无关,可用于提取特定任务/领域的特征,这些特征既可以是现有的任意 CD 模型,也可以是一些手工制作的堆叠块。

Beyond current customized models, BAN is the first extensive attempt to adapt the foundation model to the CD task. Experimental results show the effectiveness of our BAN in improving the performance of existing CD methods (e.g., up to IoU improvement) with only a few additional learnable parameters. More importantly, these successful practices show us the potential of foundation models for remote-sensing (RS) CD. The code is available at https://github.com/likyoo/BAN and will be supported in our Open-CD.
除现有的定制模型外,BAN 是首次针对 CD 任务调整基础模型的广泛尝试。实验结果表明,只需增加几个可学习参数,我们的 BAN 就能有效提高现有 CD 方法的性能(例如,最高可提高 IoU)。更重要的是,这些成功实践向我们展示了遥感(RS)CD 基础模型的潜力。代码可在 https://github.com/likyoo/BAN 上获得,并将在我们的 Open-CD 中得到支持。

Index Terms-Change detection (CD), deep learning (DL), foundation model, remote-sensing (RS) image processing, visual tuning.


HE dynamic properties of the Earth's surface are influenced by a wide range of natural and anthropogenic factors, resulting in constant change.
地球表面的 HE 动态特性受到各种自然和人为因素的影响,从而不断发生变化。

Therefore, analyzing the time-series remote-sensing (RS) data is an important topic in Earth vision, and change detection (CD) techniques have
Manuscript received 2 December 2023; revised 11 January 2024; accepted 10 February 2024. Date of publication 16 February 2024; date of current version 23 February 2024. This work was supported in part by the National Key Research and Development Program of China under Grant 2021ZD0112902 and in part by the China NSFC Projects under Contract 62272375 and Contract 12226004. (Corresponding author: Xiangyong Cao.)
手稿于 2023 年 12 月 2 日收到;2024 年 1 月 11 日修订;2024 年 2 月 10 日接受。发表日期:2024 年 2 月 16 日;当前版本日期:2024 年 2 月 23 日。本研究部分得到国家重点研发计划(2021ZD0112902)和国家自然科学基金项目(62272375和12226004)的资助。(通讯作者:曹向勇)。
Kaiyu Li and Xiangyong Cao are with the School of Computer Science and Technology and the Ministry of Education Key Laboratory for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an 710049, China (e-mail: likyoo.ai@gmail.com; caoxiangyong@xjtu.edu.cn).
李开宇和曹向勇现就职于西安交通大学计算机科学与技术学院、智能网络与网络安全教育部重点实验室,邮编:710049(电子邮箱:likyoo.ai@gmail.com; caoxiangyong@xjtu.edu.cn)。
Deyu Meng is with the School of Mathematics and Statistics and the Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China, and also with Pazhou Laboratory (Huangpu), Guangzhou, Guangdong 510555, China (e-mail: dymeng@mail.xjtu.edu.cn).
孟德玉,西安交通大学数学与统计学院、智能网络与网络安全教育部重点实验室,陕西西安 710049;琶洲实验室(黄埔),广东广州 510555(电子邮箱:dymeng@mail.xjtu.edu.cn)。
Digital Object Identifier 10.1109/TGRS.2024.3365825 become an indispensable tool for this task, contributing to the comprehensive interpretation of surface changes. The basic goal of is to detect targets or pixels with semantic changes or specific changes between bi-temporal RS images of the same region.
数字对象标识符 10.1109/TGRS.2024.3365825 已成为这项任务不可或缺的工具,有助于全面解释地表变化。 的基本目标是检测同一区域双时相 RS 图像之间具有语义变化或特定变化的目标或像素。

Furthermore, the CD can detect, quantify, and analyze changes in land cover and has rich applications in several scenarios, for example, urban expansion [1], [2], natural disaster assessment [3], cropland protection [4], and so on.
此外,CD 还可以检测、量化和分析土地覆被的变化,在多种场景中有着丰富的应用,例如城市扩张[1]、[2]、自然灾害评估[3]、耕地保护[4]等。
Traditional CD methods are based on manual feature extraction, making it difficult to achieve fast and accurate detection in complex scenarios [5]. In the last decade, deep-learning (DL) technologies have greatly contributed to the development of the field of . For example, a convolutional neural network (CNN) has achieved practical success in CD, and its superior learning ability and automatic feature extraction capability make it powerful to handle complex scenarios [1], [6], [7], [8], [9].
传统的 CD 方法基于人工特征提取,很难在复杂场景下实现快速准确的检测 [5]。近十年来,深度学习(DL)技术极大地促进了 领域的发展。例如,卷积神经网络(CNN)已经在 CD 领域取得了实际成功,其超强的学习能力和自动特征提取能力使其能够有力地处理复杂场景 [1]、[6]、[7]、[8]、[9]。

Recently, the vision transformer (ViT) [10] has further energized the field of CD [11], [12] since it can handle arbitrary unstructured data and capture global dependencies in the whole image, which opens up more possibilities for CD.
最近,视觉转换器(ViT)[10] 进一步活跃了 CD 领域[11]、[12],因为它可以处理任意非结构化数据,并捕捉整个图像中的全局依赖关系,这为 CD 提供了更多可能性。

Furthermore, some studies have combined CNNs and ViTs in series or parallel form to explore some more powerful CD capabilities [13], [14], [15], [16], [17]. In summary, these models have achieved practical success to some extent.
此外,一些研究还将 CNN 和 ViT 以串联或并联的形式结合起来,探索一些更强大的 CD 功能 [13]、[14]、[15]、[16]、[17]。总之,这些模型在一定程度上取得了实际成功。
However, as the model scale grows, the limited data restricts the performance improvement of the CD models. For instance, the LEVIR-CD dataset [1], a CD benchmark containing 637 pairs of bi-temporal RS images, on which the DL-based methods can achieve -score by 2020 . However, in the following three years, the improvement is only , while the model parameter has grown to hundreds of millions. We consider this to be a performance bottleneck under limited data.
然而,随着模型规模的扩大,有限的数据限制了 CD 模型性能的提高。例如,LEVIR-CD 数据集[1]是一个包含 637 对双时态 RS 图像的 CD 基准,在该数据集上,基于 DL 的方法可以在 2020 年达到 -score 。然而,在随后的三年中,改进幅度仅为 ,而模型参数已增至数亿。我们认为这是有限数据下的性能瓶颈。

To alleviate this problem, some studies have been tried in several aspects, for example, generating more simulation data using generative models [18], constructing more image pairs through data augmentation strategies and single-temporal images [19], [20], and so on.

However, compared to the millions or even billions of annotated data or image-text pairs in natural images, these generated data are still not able to drive the models to achieve the emergence about capabilities and homogeneity [21].

Recently, the significance of the foundation model has been increasingly recognized in computer vision.

Compared to small models customized for specific tasks and datasets, the foundation models [22], [23], [24] can accumulate general knowledge through large-scale pretraining, thus reducing the dependence on specific labeled
与针对特定任务和数据集定制的小型模型相比,基础模型[22]、[23]、[24] 可以通过大规模预训练积累一般知识,从而减少对特定标注数据集的依赖。

data, which inspires us to introduce the foundation models into the CD task.
这启发我们在 CD 任务中引入基础模型。
The foundation model refers to models that have been trained on large-scale datasets and are capable of capturing a wide range of knowledge and understanding.

The foundation model started with natural language processing (NLP) tasks, typically exemplified by OpenAI's GPT series [25], and has been widely explored in computer vision and vision-language tasks, for example, CLIP [22], BLIP [26], SAM [27], and so on.
基础模型始于自然语言处理(NLP)任务,典型的例子是 OpenAI 的 GPT 系列 [25],并在计算机视觉和视觉语言任务中得到了广泛探索,例如 CLIP [22]、BLIP [26]、SAM [27] 等。

These large foundation models allow for cross-domain knowledge transfer and sharing and reduce the demand for task-specific training data [28].
这些大型基础模型可以实现跨领域的知识转移和共享,减少对特定任务训练数据的需求 [28]。

In other words, even though in most tasks (e.g., CD), it is difficult to train a task-specific large model from scratch due to the limited data and computational source, we can resort to the existing foundation models and adapt them to specific tasks.
换句话说,尽管在大多数任务(如 CD)中,由于数据和计算资源有限,很难从头开始训练一个针对特定任务的大型模型,但我们可以利用现有的基础模型,并将其调整到特定任务中。

Some researchers have tried to tune or adapt foundation models to some downstream tasks in natural images and have achieved some success [29], [30].
一些研究人员尝试调整或调整基础模型,使其适应自然图像中的一些下游任务,并取得了一些成功 [29], [30]。

For RS CD, there are two particular issues for its visual tuning: 1) the transfer of the natural image domain to the RS image domain, and 2) the data structure of the input bi-temporal images changes the overall pipeline of foundation models.
对于 RS CD 而言,其视觉调整有两个特殊问题:1)从自然图像域到 RS 图像域的转换;2)输入双时相图像的数据结构会改变基础模型的整体管道。
To alleviate the two issues, in this article, we propose a universal framework, that is, bi-temporal adapter networks (BANs for short), for tuning and adapting foundation models to the task. By bridging and combining the foundation model with customized models or modules for CD, BAN can dramatically improve the performance of existing models with only a few additional learnable parameters.
为了缓解这两个问题,我们在本文中提出了一个通用框架,即双时态适配器网络(简称 BAN),用于调整基础模型并使其适应 任务。通过将基础模型与为 CD 定制的模型或模块进行桥接和组合,BAN 只需增加几个可学习的参数,就能显著提高现有 模型的性能。

Specifically, BAN parallel feeds bi-temporal images into a frozen foundation model (i.e., in the Siamese-style fashion), which can be ViTs trained on ImageNet-21k [31], the image encoder of CLIP [22], [23], or RemoteCLIP trained on RS data [24], as well as any other arbitrary pretrained image models.
具体来说,BAN 将双时态图像并行输入一个冻结基础模型(即Siamese 式),该模型可以是在 ImageNet-21k [31] 上训练的 ViTs、CLIP [22], [23] 的图像编码器或在 RS 数据 [24] 上训练的 RemoteCLIP,以及其他任意预训练图像模型。

Simultaneously, the bi-temporal images are fed into a bi-temporal adapter branch (Bi-TAB), which can be either an existing CD model, for example, BiT [13], ChangeFormer [11], and so on, or some hand-crafted stacked blocks.
同时,双时相图像被输入双时相适配器分支(Bi-TAB),该分支可以是现有的 CD 模型,如 BiT [13]、ChangeFormer [11]等,也可以是一些手工制作的堆叠块。

In addition, to alleviate the conflicts in input resolutions [32] of the foundation model and , the asymmetric resolution input strategy (ARIS) is proposed.
此外,为了缓解基础模型和 在输入分辨率方面的冲突[32],提出了非对称分辨率输入策略(ARIS)。

Between the foundation model and the Bi-TAB, we design a series of bridging modules that constantly select, align, and inject the general features in the foundation model space into the learnable Bi-TAB at various hierarchies.
在基础模型和 Bi-TAB 之间,我们设计了一系列桥接模块,不断选择、调整基础模型空间中的一般特征,并将其注入不同层次的可学习 Bi-TAB 中。

Through these bridging modules, BAN achieves knowledge transfer from general features to the CD task in the RS image domain. The schematic of BAN is shown in Fig. 1.
通过这些桥接模块,BAN 实现了从一般特征到 RS 图像领域 CD 任务的知识转移。BAN 的示意图如图 1 所示。
In summary, the contributions of this work are threefold.
  1. To reduce the dependence on CD-specific data, we propose a universal framework (i.e., BAN) to adapt the foundation model to the CD task, which efficiently reuses the general knowledge of the foundation model and transfers it to the CD task.
    为了减少对 CD 特定数据的依赖,我们提出了一个通用框架(即 BAN),以根据 CD 任务调整基础模型,从而有效地重复使用基础模型的一般知识,并将其转移到 CD 任务中。

    Additionally, as a side-tuning framework, BAN allows efficient tuning of parameters, memory, and time. To our knowledge, this should be the first universal framework to introduce the foundation models into the CD task.
    此外,作为一个侧面调整框架,BAN 允许对参数、内存和时间进行有效调整。据我们所知,这应该是第一个将基础模型引入 CD 任务的通用框架。
Fig. 1. Schematic of BAN.
图 1.BAN 示意图。

The frozen pretrained model can be any foundation model (ImageNet-21k pretrained models [31], CLIP [22], [23], RemoteCLIP [24], etc.), the Bi-TAB can be any CD model (BiT [13], ChangeFormer [11], etc.), and the bridging modules can inject the knowledge extracted from the foundation model into the Bi-TAB.
冻结的预训练模型可以是任何基础模型(ImageNet-21k 预训练模型 [31]、CLIP [22]、[23]、RemoteCLIP [24] 等),Bi-TAB 可以是任何 CD 模型(BiT [13]、ChangeFormer [11] 等),桥接模块可以将从基础模型中提取的知识注入 Bi-TAB。
  1. To better select general features and align them with task/domain-specific features, we propose a bridging module between the foundation model and Bi-TAB.
    为了更好地选择一般特征,并将其与特定任务/领域的特征相匹配,我们建议在基础模型和 Bi-TAB 之间建立一个桥接模块。

    By cross-domain dot-product attention, the bridging module can resample general domain knowledge and then inject it into features of the RS CD domain.
    通过跨域点积关注,桥接模块可以对一般领域知识进行重采样,然后将其注入 RS CD 领域的特征中。
  2. Bi-TAB is proposed as a model-agnostic concept, and this branch can be either an existing CD model or some hand-crafted stacked blocks. Benefiting from its plugand-play property, BAN can be fully equipped by almost any model that exists currently. Experimental results on the benchmark CD datasets demonstrate that BAN can bring about improvement on average for the corresponding customized CD models.
    Bi-TAB 是作为一个与模型无关的概念提出的,这个分支可以是现有的 CD 模型,也可以是一些手工制作的叠块。得益于其即插即用的特性,BAN 可以完全装备目前存在的几乎所有 模型。在基准 CD 数据集上的实验结果表明,BAN 可以为相应的定制 CD 模型带来 的平均改进。
The rest of this article is organized as follows. In Section II, a brief review of related work is presented. Section III describes the specifics of our proposed framework. Section IV provides a series of experimental results and analysis.

Finally, in Section V, we give a conclusion of this article.

A. DL-Based  A. 基于 DL

An accurate and reliable is essential in the analysis of RS images. Over the past few years, numerous CD approaches have been proposed. In particular, CNN-based methods have shown remarkable success in this task. Daudt et al.
准确可靠的 对分析 RS 图像至关重要。在过去几年中,人们提出了许多 CD 方法。其中,基于 CNN 的方法在这项任务中取得了显著的成功。Daudt 等人

[6] proposed three U-Net-based fully convolutional Siamese networks: FC-EF, FC-Siam-conc, and FC-Siam-diff, which was one of the earliest studies for DL-based CD. Expanding on the FC-Siam-conc model, Zhang et al.
[6] 提出了三种基于 U-Net 的全卷积Siamese 网络:FC-EF、FC-Siam-conc 和 FC-Siam-diff,这是基于 DL 的 CD 的最早研究之一。在 FC-Siam-conc 模型的基础上,Zhang 等人又提出了基于 U-Net 的全卷积网络:FC-EF 和 FC-Siam-diff。

[8] used the attention mechanism to construct stronger basic blocks and additionally introduced deep supervision for faster convergence and better performance of the model.

Different from the former, STANet [1] uses the metric learning-based approach for CD, which projects the bi-temporal images into a high-dimensional space by a neural network, and then uses the L2-distance metric for the bi-temporal high-dimensional mappings to generate the change mask. Fang et al.
与前者不同,STANet[1]采用基于度量学习的方法进行 CD,通过神经网络将双时态图像投影到高维空间,然后利用双时态高维映射的 L2-距离度量生成变化掩码。Fang 等人

[7] considered that continuous downsampling in the backbone of previous models leads to loss of spatial information, resulting in ambiguity in the pixels at the edges of the change targets, and therefore proposed SNUNet, which maintains high-resolution, fine-grained representations through dense skip connections between each node in the encoder and the decoder.
[7]认为,以往模型的主干连续降采样会导致空间信息丢失,造成变化目标边缘像素的模糊性,因此提出了 SNUNet,通过编码器和解码器中每个节点之间的密集跳转连接来保持高分辨率的细粒度表示。
The emergence of ViT provides an alternative path for CD tasks. Specifically, ViT splits an image into a series of patches,
ViT 的出现为 CD 任务提供了另一条途径。具体来说,ViT 将图像分割成一系列补丁、

takes all patches as a sequence, and uses the attention mechanism to capture the global dependencies between patches in the image. Zhang et al.
将所有斑块作为一个序列,并利用注意力机制来捕捉图像中斑块之间的全局依赖关系。Zhang 等人

[12] used swin transformer blocks [33] to construct an encoder and the decoder and designed a pure transformer network with Siamese U-shaped structure.
[12] 使用swin 变压器块[33]构建编码器和解码器,并设计了一个具有Siamese U 型结构的纯变压器网络。

ChangeFormer [11] is a Transformer-based CD model that references the construction of Segformer [34], a model for semantic segmentation.
ChangeFormer [11] 是一种基于变换器的 CD 模型,它参考了语义分割模型 Segformer [34] 的构造。

In the basic module of ChangeFormer, spatial downsampling of the query, key, and value is performed in the self-attention to reduce computation.
在 ChangeFormer 的基本模块中,查询、键和值的空间下采样是在自我关注中进行的,以减少计算量。
Several studies have found that in CD, due to limited data, the pure Transformer model may not reach its full potential without some inductive bias incorporated. Chen et al. [13] believed that high-level change features can be represented by a few semantic tokens and proposed BiT.
一些研究发现,在 CD 中,由于数据有限,如果不加入一些归纳偏差,纯粹的 Transformer 模型可能无法充分发挥其潜力。Chen 等人[13]认为,高层次的变化特征可以用几个语义tokens 来表示,并提出了 BiT。

In BiT, the bi-temporal features obtained through ResNet [35] are represented as several tokens, then the context is modeled in the token-based space-time using the Transformer encoder, and finally, the tokens with the learned context are fed into the pixel space and refine the original features by the Transformer decoder. Wu et al.
在 BiT 中,通过 ResNet [35]获得的双时态特征被表示为多个tokens ,然后使用 Transformer 编码器在基于token 的时空中建立上下文模型,最后将tokens 与学习到的上下文输入像素空间,并通过 Transformer 解码器完善原始特征。Wu 等人

[36] also used ResNet first to extract the bi-temporal features and then transmitted the bi-temporal features to the designed cross swin-transformer for difference feature extraction. On the other hand, some studies use parallel CNN and Transformer structures. Tang et al.
[36] 也是先使用 ResNet 提取双时相特征,然后将双时相特征传输到设计的交叉斯温变换器进行差异特征提取。另一方面,一些研究使用并行 CNN 和变换器结构。Tang 等人

[17] proposed WNet, which uses a Siamese CNN and a Siamese Transformer as encoders and fuses all features in the decoder. Similarly, Feng et al.
[17] 提出的 WNet 使用Siamese CNN 和Siamese Transformer 作为编码器,并在解码器中融合所有特征。同样,Feng 等人也提出了 WNet。

[37] proposed ICIF-Net, which constructs a dual-branch structure of the CNN and Transformer to capture multiscale local and global features, respectively, and uses cross-attention to fuse them.
[37]提出了 ICIF-Net,它构建了 CNN 和 Transformer 的双分支结构,分别捕捉多尺度局部和全局特征,并使用交叉注意力对它们进行融合。

Fu et al.  Fu 等人
[16] argued that there is no interaction between dual parallel branches in the feature extraction processes of ICIF-Net, so they designed a semantic information aggregation module to dynamically implement the interaction between the CNN and Transformer branches in their proposed SLDDNet.
[16]认为,在 ICIF-Net 的特征提取过程中,双并行分支之间没有交互,因此他们设计了一个语义信息聚合模块,以动态实现他们提出的 SLDDNet 中 CNN 和 Transformer 分支之间的交互。
In this work, beyond the above-customized CD models, we propose a universal CD framework (i.e., BAN) based on the foundation model. In general, all the above-mentioned models can be used as the Bi-TAB in BAN, and BAN can help improve the performance of these existing CD models.
在这项工作中,除了上述定制的 CD 模型之外,我们还提出了一个基于基础模型的通用 CD 框架(即 BAN)。一般来说,上述所有 模型都可用作 BAN 中的 Bi-TAB,而 BAN 则有助于提高这些现有 CD 模型的性能。

B. Foundation Model and Visual Tuning

Through various pretraining techniques, the foundation model can be scaled up in data modalities, proxy tasks, and so on, thus accumulating knowledge [38], which has gradually become a consensus among researchers.

With the success of ViT, it swept the field of computer vision and became the priority choice for the vision foundation models. Steiner et al. [31] conducted the first systematic study of the interplay between data, augmentation, and regularization when pretraining ViT.
随着 ViT 的成功,它席卷了计算机视觉领域,成为视觉基础模型的优先选择。Steiner 等人[31]首次系统地研究了预训练 ViT 时数据、增强和正则化之间的相互作用。

CLIP is a pretraining method based on contrastive language-image learning, it is trained to maximize the similarity between the vision and language encodings of matching image-text pairs while minimizing it for nonmatching pairs.
CLIP 是一种基于语言-图像对比学习的预训练方法,其训练目的是最大化匹配图像-文本对的视觉和语言编码之间的相似性,同时最小化非匹配图像-文本对的相似性。

With strong generalization ability and extensibility, CLIP can handle multiple types of image and text data without retraining for specific tasks [32]. Inspired by the CLIP, Liu et al. [24] proposed RemoteCLIP, the first visual-language foundation model for RS.
CLIP 具有很强的泛化能力和可扩展性,可以处理多种类型的图像和文本数据,无需针对特定任务进行再训练[32]。受 CLIP 的启发,Liu 等人[24]提出了第一个用于 RS 的视觉语言基础模型 RemoteCLIP。

Although RemoteCLIP has achieved some improvement in some RS tasks, it is still limited by the volume of data, that is, it contains only image-text pairs. For this, Zhang et al. [39] proposed the RS5M dataset containing 5 million RS images with English descriptions by filtering the image-text pair dataset and generating captions for RS images [40].
虽然 RemoteCLIP 在某些 RS 任务中取得了一定的改进,但仍受到数据量的限制,即只包含 图像-文本对。为此,Zhang 等人[39]提出了 RS5M 数据集,通过过滤图像-文本对数据集并为 RS 图像生成标题[40],该数据集包含 500 万张带英文描述的 RS 图像。
How to adapt the above-mentioned foundation models to downstream-specific tasks is an important topic currently.

Typical transfer learning methods such as fully fine-tuning the whole model or only fine-tuning the task head, lead to high training costs or insufficient reuse, and one tradeoff is the parameter-efficient transfer learning (PETL) method [38], which selects a subset of pretrained parameters or/and introduces a limited number of trainable parameters into the foundation model while freezing the majority of the original parameters [41].

Typical PETL methods include prompt tuning, which concatenates some learnable embeddings with inputs, and an adapter, which inserts some learnable components into the pretrained model. Hu et al.
典型的 PETL 方法包括及时调整和适配器,前者将一些可学习的嵌入与输入连接起来,后者则将一些可学习的组件插入预训练模型中。Hu 等人

[42] proposed LoRA, which initializes a low-rank adaptation matrix and inserts it into self-attention in the residual-connected form. Due to the simplicity and effectiveness of LoRA, numerous related studies and practices have emerged [30].
[42]提出了 LoRA,它初始化了一个低秩适应矩阵,并以残差连接的形式将其插入自注意力中。由于 LoRA 的简单性和有效性,出现了许多相关研究和实践 [30]。

However, parameter efficiency does not mean memory efficiency, and the foundation model tuning remains difficult. To address this issue, Sung et al.
然而,参数效率并不意味着内存效率,基础模型的调整仍然很困难。为了解决这个问题,Sung 等人

[41] proposed the ladder side-tuning (LST) method, which freezes the foundation model and builds a side network for training [32], [43], [44]. In this article, our proposed BAN can be seen as the practice and extension of side-tuning for the CD task.
[41]提出了梯形侧调整(LST)方法,即冻结基础模型并建立侧网络进行训练[32], [43], [44]。在本文中,我们提出的 BAN 可以看作是侧调整法在 CD 任务中的实践和扩展。
Most relevant to this article is a concurrent work, SAMCD [45], which uses frozen Fast-SAM [46] as the encoder and fine-tunes the feature pyramid network and prediction head for CD.
与本文最相关的是同时进行的一项工作 SAMCD [45],它使用冻结的 Fast-SAM [46] 作为编码器,并对 CD 的特征金字塔网络和预测头进行了微调。

However, compared to BAN, SAM-CD is more of a customized Fast-SAM-based model within a traditional transfer learning paradigm. Therefore, we claim that BAN should be the first universal framework to adapt the foundation model to the CD task.
然而,与 BAN 相比,SAM-CD 更像是传统迁移学习范式中基于 Fast-SAM 的定制模型。因此,我们认为,BAN 应该是第一个将基础模型应用于 CD 任务的通用框架。

III. MethoD III.方法

As illustrated in Figs. 1 and 2, BAN can be divided into three parts: frozen foundation model, , and bridging modules between them. For the foundation model, it is generally ViTs with various scales. Next, we will review the structure of ViT. Then, we will introduce Bi-TAB and explain why Bi-TAB can be an existing arbitrary CD model.
如图 1 和图 2 所示,BAN 可分为三个部分:冻结基础模型、 ,以及它们之间的桥接模块。就基础模型而言,一般是各种规模的 ViT。接下来,我们将回顾一下 ViT 的结构。然后,我们将介绍 Bi-TAB,并解释为什么 Bi-TAB 可以成为现有的任意 CD 模型。

Finally, we will present the details of the bridging module.

A. Review of ViT

For an image , ViT first divides it into a series of nonoverlapping patches, where is typically 14, 16, and 32. Each patch is encoded into a -dimensional vector using a convolutional layer, and the output is termed patch embedding, where , and denotes the constant dimension of ViT. Then, a learnable class token is concatenated into as the whole image representation. Due to the insensitivity to sequence order, a learnable position encoding is subsequently added to the patch embedding.
对于图像 ,ViT 首先将其划分为一系列不重叠的 补丁,其中 通常为 14、16 和 32。使用卷积层将每个补丁编码成 维向量,输出 称为补丁嵌入,其中 表示 ViT 的常量维度。然后,将可学习类token 连接到 ,作为整个图像的表示。由于对序列顺序不敏感,可学习位置编码随后被添加到补丁嵌入中。
The basic block of Bi-TAB Element-wise addition
Bi-TAB 基本模块 元素加法
The basic block of foundation model Dot product
基础模型的基本模块 点积
Stop gradient Resize Bi-linear interpolation
停止梯度 调整大小 双线性插值
Fig. 2. Detailed illustration of BAN. The foundation model (blue area) accumulates general knowledge through pretraining techniques, and the bridging modules (orange area) select, align, and inject this knowledge into the Bi-TAB.
图 2.BAN 的详细图示。基础模型(蓝色区域)通过预培训技术积累一般知识,桥接模块(橙色区域)对这些知识进行选择、排列并注入 Bi-TAB 中。

The Bi-TAB (green area) is a model-agnostic concept, which can be an arbitrary customized CD model or even some hand-crafted stacked blocks. These three major components are detailed in Sections III-A-III-C, respectively. (a) The detailed structure of BAN.
Bi-TAB(绿色区域)是一个与模型无关的概念,可以是任意定制的 CD 模型,甚至是一些手工制作的堆叠块。这三个主要部分将分别在第 III-A-III-C 节中详细介绍。(a) BAN 的详细结构。

(b) The detailed structure of Bridging Module.
(b) 桥接模块的详细结构。
The basic block of ViT consists of alternating multihead self-attention (MSA) and feed-forward network (FFN), with layer normalization (LN) and residual connection applied in each layer. Formally, a basic block can be expressed as
ViT 的基本模块由交替的多头自注意(MSA)和前馈网络(FFN)组成,每一层都采用了层归一化(LN)和残差连接。形式上,基本模块可表示为
where denotes the number of basic blocks in ViT, denotes the output of the th block, the FFN consists of two linear layers with a GELU nonlinearity, and the MSA can be formulated as
其中, 表示 ViT 中基本区块的数量, 表示第 个区块的输出,FFN 由两个具有 GELU 非线性的线性层组成,MSA 可表述为
where , and denote query, key and value, respectively, and , and are projection parameters.
其中 , 和 分别表示查询、键和值, , 和 是投影参数。
In BAN, since two images (bi-temporal images) need to be received as inputs, we extract bi-temporal features simultaneously using a parallel frozen foundation model with shared parameters, that is, a Siamese foundation model. Highresolution images are important for accurate , however, most foundation models are trained on low-resolution images (e.g., or ). On the other hand, the position embedding of ViT is highly correlated with the image resolution, and if there is an image resolution difference in inference, a new approximate position encoding needs to be recomputed by interpolation.
在 BAN 中,由于需要接收两幅图像(双时态图像)作为输入,我们使用具有共享参数的并行冻结基础模型(即Siamese 基础模型)同时提取双时态特征。高分辨率图像对于准确的 非常重要,但大多数基础模型都是在低分辨率图像(如 )上训练的。另一方面,ViT 的位置嵌入与图像分辨率高度相关,如果在推理中出现图像分辨率差异,则需要通过插值重新计算新的近似位置编码。

In addition, self-attention does not have the translation equivariance that exists in convolution and thus is more sensitive to resolution. Therefore, ARIS is proposed in BAN.
此外,自注意没有卷积中存在的平移等差性,因此对分辨率更加敏感。因此,在 BAN 中提出了 ARIS。

Specifically, for the foundation model, the bi-temporal images are resized to match their pretraining resolution, while for , the resolution is kept as 256 or according to the dataset.
具体来说,对于基础模型,双时态图像的大小会根据其训练前的分辨率进行调整,而对于 ,分辨率则根据数据集保持为 256 或
Different from task-specific models that are trained on specific datasets, foundation models are trained in a self-supervised or semi-supervised manner on large-scale data and can adapt their knowledge and capabilities to several downstream tasks.

Under similar ViT architectures, a crucial difference between various foundation models is the training data and training strategy, in this article, we mainly use CLIP's image encoder (ViT-L/14) [22], [23] as the foundation model part in BAN.
在类似的 ViT 架构下,各种基础模型的关键区别在于训练数据和训练策略,本文主要使用 CLIP 的图像编码器(ViT-L/14)[22]、[23] 作为 BAN 的基础模型部分。

In the experimental part, we compare the foundation models under different weights and architectures, and the details can be found in Section IV-E2.
在实验部分,我们比较了不同权重和架构下的基础模型,详情见第 IV-E2 节。

B.  B.

Since the foundation model part is frozen, it can only extract general information from the image. However, this information is incomplete and is thus needed for domain-specific and task-specific feature extraction. Bi-TAB is not designated as an explicit and concrete network, but rather as a model-agnostic concept.
由于基础模型部分是冻结的,因此只能从图像中提取一般信息。然而,这些信息是不完整的,因此需要 来进行特定领域和特定任务的特征提取。Bi-TAB 并不是一个明确而具体的网络,而是一个与模型无关的概念。

Therefore, Bi-TAB can be either a customized CD model, including existing CNN-based and Transformer-based models, or several hand-crafted stacked blocks. In our work, we recommend the former since it allows BAN to better benefit from evolving customized CD models.
因此,Bi-TAB 既可以是一个定制的 CD 模型(包括现有的基于 CNN 和 Transformer 的模型),也可以是几个手工制作的堆叠块。在我们的工作中,我们推荐使用前者,因为它能让 BAN 从不断发展的定制 CD 模型中更好地获益。
A modern customized CD model usually contains a backbone (encoder) and a prediction head (decoder), and its backbone is always a Siamese network. Thus, a customized model can simply be represented as
现代定制光盘模型通常包含一个主干网(编码器)和一个预测头(解码器),其主干网总是一个Siamese 网络。因此,定制的 模型可以简单地表示为
where and denote the bi-temporal images, denotes the change mask, and the Backbone and the Head consists of a sequence of basic blocks as shown in Fig. 2(a). In the
其中 表示双时态图像, 表示变化掩码,骨干和头部由一系列基本块组成,如图 2(a)所示。在

Fig. 3. Illustration of the Bi-TAB perspective in BAN, with BiT [13] and ChangeFormer [11] as examples. For better presentation, the color renderings follow their original literature [11], [13]. (a) BiT as Bi-TAB. (b) ChangeFormer as Bi-TAB.
图 3.BAN 中的 Bi-TAB 透视图,以 BiT [13] 和 ChangeFormer [11] 为例。为了更好地展示,彩色效果图沿用了其原始文献[11]、[13]。(a) BiT 作为 Bi-TAB。(b) 作为 Bi-TAB 的 ChangeFormer。
perspective of Bi-TAB, as shown in Fig. 3, except for the original bi-temporal images, an additional general feature bank is provided by the foundation model, from which Bi-TAB can arbitrarily take out features.
从 Bi-TAB 的角度来看,如图 3 所示,除了原始的双时相图像外,基础模型还提供了一个额外的通用特征库,Bi-TAB 可以从中任意提取特征。

Since both are Siamese structures, information from the foundation model can be easily injected into the Bi-TAB's backbone of the corresponding temporal phase, which is why almost any CD model can be chosen as the Bi-TAB. With the support of the foundation model, can easily obtain general information about the image, which is typically obtained from a large amount of feeding data. In this way, the requirement of the model for data is reduced.
由于两者都是Siamese 结构,基础模型的信息可以很容易地注入到 Bi-TAB 的相应时相骨干中,这就是为什么几乎任何 CD 模型都可以被选为 Bi-TAB 的原因。在基础模型的支持下, 可以轻松获取图像的一般信息,而这些信息通常是从大量的馈送数据中获取的。这样,模型对数据的要求就降低了。
As presented in Figs. 2(a) and 3, modern customized CD models can be easily embedded into BAN. Specifically in this article, we mainly use two typical CD customized models as Bi-TAB, CNN-Transformer-based BiT [13], and Transformerbased ChangeFormer [11] (denoted by "CF").
如图 2(a)和图 3 所示,现代定制 CD 模型可以轻松嵌入到 BAN 中。具体来说,本文主要使用两种典型的 CD 定制模型作为 Bi-TAB,即基于 CNN 变换器的 BiT [13] 和基于变换器的 ChangeFormer [11](用 "CF "表示)。

Among them, in the original , the backbone network is a customized mix transformer (MiT) [34], which can help disable the model scalability. Therefore, in this article, we use the scalable MiT-b0 to MiT-b5 settings in SegFormer [34].
其中,在最初的 中,骨干网络是一个定制的混合变换器(MiT)[34],这有助于禁用模型的可扩展性。因此,在本文中,我们使用 SegFormer [34] 中可扩展的 MiT-b0 至 MiT-b5 设置。

Furthermore, to validate the performance of BAN in the semantic CD (SCD) task, we remodel CF into CF-SCD by adding two additional semantic segmentation prediction heads.
此外,为了验证 BAN 在语义 CD(SCD)任务中的性能,我们将 CF 改造为 CF-SCD,增加了两个额外的语义分割预测头。

C. Bridging Module C.桥接模块

Next, a natural question is how to inject the general features from the foundation model into , and we found that there are two critical issues.
接下来,一个自然的问题是如何将基础模型的一般特征注入 ,我们发现有两个关键问题。
  1. Not all the general knowledge is needed, so we should consider how to filter out the useless information from the general knowledge and select the valuable information.
  2. With ARIS and patch embedding, the features in the foundation model have extremely low resolution (being downsampled by at least ), so it is an important
    使用 ARIS 和补丁嵌入技术时,基础模型中的特征分辨率极低(至少降低了 ),因此重要的是
// obtain change mask by the prediction head
// 通过预测头获得变化掩码
issue to align the features from both sources. Due to the above two issues, we propose the bridging module, which is shown in Fig. 2(b).
对齐两个来源的特征的问题。鉴于上述两个问题,我们提出了桥接模块,如图 2(b) 所示。
There are two types of features in our framework, that is, the general feature of the foundation model and the task-specific feature from the Bi-TAB. In the bridging module, to avoid the inconsistency of the distribution between and , we first use an LN layer to normalize and then a linear layer is used to project to
我们的框架中有两类特征,即基础模型的一般特征 和来自 Bi-TAB 的特定任务特征 。在桥接模块中,为了避免 分布的不一致性,我们首先使用 LN 层对 进行归一化处理,然后使用线性层将 投射到 。
To effectively discriminate valuable characteristics in general knowledge, we calculate an affinity matrix between and using the cosine metric and obtain the attention weights by dividing by and applying a Softmax function. Then, the attention weight is dot-product with to obtain the filtered feature
为了有效区分常识中有价值的特征,我们使用余弦度量计算 之间的亲和矩阵 ,并通过 除以 并应用 Softmax 函数得到注意力权重 。然后,将注意力权重 点积,得到过滤后的特征
It is worth noting that the foundation model features is resampled to in this process, which also solves the scale misalignment problem. Finally, the residual connections from and are adopted to get the ultimate output of the bridging module, which is also the input of the next basic block of Bi-TAB
值得注意的是,在此过程中,地基模型特征 被重新采样到 ,这也解决了比例失调问题。最后,采用 的残余连接,得到桥接模块的最终输出 ,这也是 Bi-TAB 下一个基本模块的输入
where the Resize() indicates upsampling to using bi-linear interpolation.
其中 Resize() 表示使用双线性插值将 上采样到

D. Loss Function D.损失函数

The loss function of BAN can directly follow the plain loss function of Bi-TAB. Specifically, the pixel-wise crossentropy loss for BiT and ChangeFormer is adopted in this article, which is formally defined as
BAN 的损失函数可以直接沿用 Bi-TAB 的普通损失函数。具体来说,本文采用的是 BiT 和 ChangeFormer 的像素点交叉熵损失,其形式定义为
where and denote the height and width of the output, respectively, and and denote the label and prediction for the pixel at position , respectively. For SCD with auxiliary tasks, is also applied to the semantic segmentation subtask.
其中 分别表示输出的高度和宽度, 分别表示位置 上像素的标签和预测值。对于带有辅助任务的 SCD, 也适用于语义分割子任务。


A. Datasets A.数据集

To sufficiently validate the performance of BAN, we conducted extensive experiments on the following five datasets consisting of RGB images. It is noted that BAN and all compared methods are trained and evaluated on these datasets without using any extra data.
为了充分验证 BAN 的性能,我们在以下五个由 RGB 图像组成的 数据集上进行了大量实验。值得注意的是,BAN 和所有比较方法都是在这些数据集上训练和评估的,没有使用任何额外数据。
  1. LEVIR-CD dataset [1] comprises a collection of 637 pairs of bi-temporal RS images. These images were obtained from Google Earth and are accompanied by over 31333 annotated instances of changes. Each image in pairs is with a spatial resolution of pixels. To ensure standardized evaluation, the dataset is divided into three subsets: training, validation, and testing, containing 445, 64, and 128 image pairs.
    LEVIR-CD 数据集[1] 由 637 对双时态 RS 图像组成。这些图像来自谷歌地球,并附有超过 31333 个注释变化实例。成对图像中的每幅图像都是 ,空间分辨率为 像素。为确保评估的标准化,数据集被分为三个子集:训练集、验证集和测试集,分别包含 445、64 和 128 对图像。
  2. S2Looking [47] is a building CD dataset that contains large-scale side-looking satellite images captured at varying angles from the nadir. It consists of 5000 coregistered bi-temporal image pairs of rural areas around the world, with the size of and spatial resolution of , and in total 65920 annotated change instances. It provides two annotated maps for each pair of images, indicating new and demolished building areas, which we use to generate a change map.
    S2Looking [47] 是一个建筑 CD 数据集,包含从天顶以不同角度拍摄的大比例侧视卫星图像。该数据集包含全球农村地区的 5000 对核心注册双时相图像,大小为 ,空间分辨率为 ,共有 65920 个注释变化实例。它为每对图像提供了两张注释图,分别标明新建和拆除的建筑区域,我们利用这些注释图生成变化图。
  3. WHU-CD dataset [2] contains a pair of images taken in the same area from 2012 and 2016, each with a size of pixels. The original images are first cropped into several nonoverlapping image patches of size . Following [48], these patches are divided into 5947/743/744 for training, validation, and testing, and part of the annotated training data are selected for the semi-supervised case.
    WHU-CD 数据集[2]包含 2012 年和 2016 年在同一地区拍摄的一对图像,每幅图像的大小为 像素。原始图像首先被裁剪成几个大小为 的非重叠图像片段。按照文献[48],这些斑块被分为 5947/743/744,用于训练、验证和测试,并选择部分注释训练数据用于半监督情况。
  4. BANDON dataset [49] contains a large number of off-nadir aerial images with a spatial resolution of pixels, collected mainly from Google Earth, Microsoft Virtual Earth, and ArcGIS, and the data are divided into 1689/202/392 for training, validation, and testing, respectively, with a size of . To better evaluate the generalization ability of models, the testing set is divided into an in-domain set and an out-domain set (collected from different cities) containing 207 and 185 image pairs.
    BANDON 数据集[49]包含大量空间分辨率为 像素的离空航空图像,主要从谷歌地球、微软虚拟地球和 ArcGIS 采集,数据分为 1689/202/392 分别用于训练、验证和测试,大小为 。为了更好地评估模型的泛化能力,测试集分为域内集和域外集(从不同城市收集),分别包含 207 和 185 个图像对。

    With multiple types of annotations, the BANDON dataset can also be used for the SCD task.
    有了多种类型的注释,BANDON 数据集也可用于 SCD 任务。
  5. The Landsat-SCD dataset [50] consists of 8468 image pairs collected between 1990 and 2020, each with a size of and a spatial resolution of . This dataset is annotated with a "no-change" category and ten types of semantic changes, which contain four classes of land cover (farmland, desert, building, and water).
    Landsat-SCD 数据集[50] 由 1990 年至 2020 年收集的 8468 对图像组成,每幅图像的大小为 ,空间分辨率为 。该数据集标注了 "无变化 "类别和十种语义变化类型,其中包含四类土地覆盖(农田、沙漠、建筑和水)。

    Since there is no official division, we divided the dataset into training, validation, and testing sets in the ratio of .
    由于没有正式的划分,我们按照 的比例将数据集分为训练集、验证集和测试集。

B. Implementation Details

We implement our BAN using our Open-CD, which is a PyTorch-based CD toolkit. During training, we use the cross-entropy loss and the AdamW optimizer. The learning rate is set to 0.0001 , and in Bi-TAB's prediction head, for all models and datasets. The learning rate is decayed using the poly schedule. Following [51], we use random crop (crop size is set to for LEVIR-CD, S2Looking, and BANDOM datasets; no extra crop for WHU-CD and Landsat-SCD datasets), flip, and photometric distortion for data augmentation. Following some common settings, we train 40,80 , and iterations on the LEVIR-CD, S2Looking, and BANDOM datasets, and 100 and 50 epochs on the WHU-CD and Landsat-SCD datasets. All experiments are performed on NVIDIA GeForce RTX 4090 and NVIDIA RTX A6000 GPUs and the batch size is set to 8 . If not specified, CLIP's ViT-L/14 is used as the foundation model part of BAN, and ARIS is enabled by default.
我们使用基于 PyTorch 的开放式 CD 工具包 实现我们的 BAN。在训练过程中,我们使用了交叉熵损失和 AdamW 优化器。对于所有模型和数据集,学习率设置为 0.0001,并在 Bi-TAB 的预测头 。学习率使用 poly 计划进行衰减。按照文献[51],我们使用随机裁剪(LEVIR-CD、S2Looking 和 BANDOM 数据集的裁剪大小设置为 ;WHU-CD 和 Landsat-SCD 数据集没有额外裁剪)、翻转和光度失真来增强数据。按照一些常用设置,我们对 LEVIR-CD、S2Looking 和 BANDOM 数据集进行了 40、80 和 次迭代训练,对 WHU-CD 和 Landsat-SCD 数据集进行了 100 和 50 次迭代训练。所有实验均在英伟达™(NVIDIA®)GeForce RTX 4090 和英伟达™(NVIDIA®)RTX A6000 GPU 上进行,批次大小设置为 8。如果未指定,则使用 CLIP 的 ViT-L/14 作为 BAN 的基础模型部分,并默认启用 ARIS。

C. Evaluation Metrics C.评估指标

For binary change detection (BCD) tasks, we use the intersection over union score, and overall accuracy (OA) as evaluation metrics, which are calculated as follows:
对于二进制变化检测(BCD)任务,我们使用交集大于联合 分数和总体准确率(OA)作为评价指标,计算方法如下:
where TP, TN, FP, and FN indicate true positive, true negative, false positive, and false negative, which are calculated on the change category to avoid the class imbalance problem. precision and recall are also calculated, indicating the precision and recall on the change category.
其中,TP、TN、FP 和 FN 表示真阳性、真阴性、假阳性和假阴性,它们是根据变化类别计算的,以避免类不平衡问题。此外,还计算了精度 和召回率 ,表示变化类别的精度和召回率。
For the SCD task, we use mIoU and adopt the separated coefficient of kappa (Sek) [52] as evaluation metrics, which are calculated as follows:
对于 SCD 任务,我们使用 mIoU 并采用分离的卡帕系数(Sek)[52] 作为评价指标,其计算方法如下:
where is calculated on the nonchange category, corresponding to