这是用户在 2024-4-11 16:29 为 https://app.immersivetranslate.com/pdf-pro/dabcace0-b631-427b-9d1f-c4221fbc46b5 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_04_11_7686742028bfc23ef627g

ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space Model
ChangeMamba:利用时空状态模型进行遥感变化检测

Hongruixuan Chen , Student Member, IEEE, Jian Song , Chengxi Han, Student Member, IEEE,
Hongruixuan Chen , Student Member, IEEE, Jian Song , Chengxi Han, Student Member, IEEE、
Junshi Xia, Senior Member, IEEE, Naoto Yokoya, Member, IEEE
Junshi Xia,IEEE 高级会员;Naoto Yokoya,IEEE 会员

Abstract 摘要

Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have their inherent shortcomings.
卷积神经网络(CNN)和Transformers 在遥感变化检测(CD)领域取得了令人瞩目的进展。然而,这两种架构都有其固有的缺陷。

Recently, the Mamba architecture, based on spatial state models, has shown remarkable performance in a series of natural language processing tasks, which can effectively compensate for the shortcomings of the above two architectures.
最近,基于空间状态模型的Mamba 架构在一系列自然语言处理任务中表现出了不俗的性能,可以有效弥补上述两种架构的不足。

In this paper, we explore for the first time the potential of the Mamba architecture for remote sensing change detection tasks.
在本文中,我们首次探索了Mamba 架构在遥感变化检测任务中的潜力。

We tailor the corresponding frameworks, called MambaBCD, MambaSCD, and MambaBDA, for binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA), respectively.
我们为二进制变化检测(BCD)、语义变化检测(SCD)和建筑物损坏评估(BDA)量身定制了相应的框架,分别称为 MambaBCD、MambaSCD 和 MambaBDA。

All three frameworks adopt the cutting-edge visual Mamba architecture as the encoder, which allows full learning of global spatial contextual information from the input images.
这三个框架都采用了最先进的视觉Mamba 架构作为编码器,可以从输入图像中充分学习全局空间上下文信息。

For the change decoder, which is available in all three architectures, we propose three spatio-temporal relationship modeling mechanisms, which can be naturally combined with the Mamba architecture and fully utilize its attribute to achieve spatio-temporal interaction of multitemporal features and obtain accurate change information.
对于三种架构都有的变化解码器,我们提出了三种时空关系建模机制,可以与Mamba 架构自然结合,充分发挥其属性,实现多时空特征的时空交互,获取准确的变化信息。

On five benchmark datasets, our proposed frameworks outperform current CNN- and Transformer-based approaches without using any complex strategies or tricks, fully demonstrating the potential of the Mamba architecture. Specifically, we obtained , and F1 scores on the three BCD datasets SYSU, LEVIR-CD+, and WHU-CD; on the SCD dataset SECOND, we obtained ; and on the xBD dataset, we obtained overall F1 score. The source code will be available in https://github.com/ChenHongruixuan/MambaCD
在五个基准数据集上,我们提出的框架在不使用任何复杂策略或技巧的情况下超越了当前基于 CNN 和 Transformer 的方法,充分展示了Mamba 架构的潜力。具体来说,在三个 BCD 数据集 SYSU、LEVIR-CD+ 和 WHU-CD 上,我们获得了 的 F1 分数;在 SCD 数据集 SECOND 上,我们获得了 ;在 xBD 数据集上,我们获得了 的总体 F1 分数。源代码见 https://github.com/ChenHongruixuan/MambaCD

Index Terms-State space model, Mamba, Binary change detection, semantic change detection, building damage assessment, spatio-temporal relationship, optical high-resolution images
索引术语--状态空间模型、Mamba 、二值变化检测、语义变化检测、建筑物损坏评估、时空关系、光学高分辨率图像

I. INTRODUCTION I.引言

C HANGE detection (CD) has been a popular field within the remote sensing community since the inception of remote sensing technology. It aims to detect changes in objects
自遥感技术诞生以来,变化探测(CD)一直是遥感界的一个热门领域。其目的是检测物体的变化
Manuscript submitted April xx, 2024. This work was supported in part by the Council for Science, Technology and Innovation (CSTI), the Crossministerial Strategic Innovation Promotion Program (SIP), Development of a Resilient Smart Network System against Natural Disasters (Funding agency: NIED), the JSPS, KAKENHI under Grant Number 22H03609, JST, FOREST under Grant Number JPMJFR206S, Microsoft Research Asia, and Next Generation AI Research Center, The University of Tokyo.
2024 年 4 月 xx 日投稿。本研究部分得到了日本科学技术创新委员会(CSTI)、跨部委战略创新促进计划(SIP)、抗自然灾害智能网络系统的开发(资助机构:NIED)、日本社会科学基金会(JSPS)、KAKENHI(资助编号:22H03609)、日本科学技术振兴机构(JST)、FOREST(资助编号:JPMJFR206S)、微软亚洲研究院(Microsoft Research Asia)和东京大学下一代人工智能研究中心的支持。

( + : equal contribution; Corresponding author: Naoto Yokoya)
( +:等额捐款;通讯作者:Naoto Yokoya)
Hongruixuan Chen, Jian Song, and Naoto Yokoya are with Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan (e-mail: qschrx@gmail.com; song@ms.k.u-tokyo.ac.jp; yokoya@k.u-tokyo.ac.jp).
Hongruixuan Chen、Jian Song 和 Naoto Yokoya 现为日本千叶县东京大学研究生院前沿科学研究科成员(电子邮箱:qschrx@gmail.com; song@ms.k.u-tokyo.ac.jp; yokoya@k.u-tokyo.ac.jp)。
Chengxi Han is with State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China (e-mail: chengxihan@whu.edu.cn).
韩成喜,武汉大学测绘遥感信息工程国家重点实验室(电子邮箱:chengxihan@whu.edu.cn)。
Junshi Xia is with RIKEN Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, 103-0027, Japan (e-mail: junshi.xia@riken.jp). on the Earth's surface from multi-temporal remote sensing images acquired at different times. Depending on the desired result from the change detector, tasks can be categorized into three types, namely binary , semantic (SCD), and building damage assessment (BDA). Now, CD techniques play an important role in many fields, including land cover change analysis, urban sprawl studies, disaster response, geographic information system (GIS) updating and ecological monitoring [1]-[6].
Junshi Xia现就职于日本东京理化学研究所高级智能项目中心(AIP)(电子邮箱:junshi.xia@riken.jp),负责从不同时间获取的多时空遥感图像中检测地球表面的变化。根据变化检测器所需的结果, 任务可分为三类,即二元 、语义 (SCD) 和建筑物损坏评估 (BDA)。目前,CD 技术在许多领域发挥着重要作用,包括土地覆被变化分析、城市扩张研究、灾害响应、地理信息系统(GIS)更新和生态监测[1]-[6]。
Optical high-resolution remote sensing imagery has become one of the most applied and researched data sources in the field of CD, which can provide detailed textural and geometric structural information about surface features, allowing us to achieve more refined changes.
高分辨率光学遥感图像已成为光盘领域应用和研究最多的数据源之一,它可以提供有关地表特征的详细纹理和几何结构信息,使我们能够实现更精细的变化。

Due to the increased heterogeneity within the same land-cover feature brought about by the increase in spatial resolution, the traditional pixel-based CD methods [7]-[9] are difficult to achieve satisfactory detection results.
由于空间分辨率的提高导致同一土地覆被特征内部的异质性增加,传统的基于像素的 CD 方法 [7]-[9] 难以获得令人满意的检测结果。

To cope with this, researchers proposed objectbased CD methods [10], which take the object consisting of homogeneous pixels as the basic unit for CD.
为此,研究人员提出了基于对象的 CD 方法 [10],这种方法将由同质像素组成的对象作为 CD 的基本单位。

However, these methods rely on hand-designed "shallow" features, which are not representative and robust enough to cope with complex ground conditions in multi-temporal high-resolution images [11].
然而,这些方法依赖于手工设计的 "浅层 "特征,其代表性和鲁棒性不足以应对多时高分辨率图像中复杂的地面状况[11]。
The emergence of deep learning has brought new models and paradigms for , greatly improving the efficiency and accuracy of . After the early days of developing image patch-based approaches with tiny deep learning models [12][14], the released of a number of large-scale benchmark datasets [15] allowed us to start designing and introducing even larger and deeper models.
深度学习的出现为 带来了新的模型和范式,大大提高了 的效率和准确性。在早期利用微小的深度学习模型开发基于图像补丁的方法[12][14]之后,一些大规模基准数据集的发布[15]使我们能够开始设计和引入更大、更深的模型。

Convolutional neural network (CNN)-based methods have been dominant within the field of CD after the introduction of fully convolutional networks (FCNs) to the field by Daudt et al. [16].
在 Daudt 等人[16]将全卷积网络(FCN)引入 CD 领域之后,基于卷积神经网络(CNN)的方法一直在 CD 领域占据主导地位。

By employing models within the computer vision field as well as a priori knowledge of the CD task, many representative methods have been proposed [3], [4], [17]-[19].
通过利用计算机视觉领域的模型以及 CD 任务的先验知识,已经提出了许多具有代表性的方法 [3]、[4]、[17]-[19]。

Despite the decent results achieved by these methods, the inherent shortcomings of the CNN model structure, i.e., the inability of the limited receptive field to capture long-range dependencies between pixels, make these methods still fall short when dealing with complex and diverse multi-temporal scenes in images with different spatial-temporal resolutions [20].
尽管这些方法取得了不错的效果,但 CNN 模型结构的固有缺陷,即有限的感受野无法捕捉像素之间的长距离依赖关系,使得这些方法在处理具有不同时空分辨率的图像中复杂多样的多时空场景时仍然存在不足[20]。

The emergence of visual Transformer [21] provides an effective way to solve the above problems. By means of the self-attention operation, the transformer architecture can efficiently model the relationship between all the pixels of the whole image. Currently, more
视觉变换器 [21] 的出现为解决上述问题提供了有效途径。通过自关注操作,变换器架构可以有效地模拟整个图像中所有像素之间的关系。目前,更多的

and more CD architectures adopt Transformer architectures as the encoder to extract representative and robust features [22]-[24] and utilize it in the decoder to capture the spatiotemporal relationship between multi-temporal features [20], [25], exhibiting superior performance.
越来越多的 CD 架构采用变换器架构作为编码器,以提取具有代表性和鲁棒性的特征[22]-[24],并在解码器中利用它来捕捉多时相特征之间的时空关系[20], [25],表现出卓越的性能。
Nonetheless, the self-attention operation requires quadratic complexity in terms of image sizes, resulting in expensive computational overhead, which is detrimental in dense prediction tasks, e.g., land-cover mapping, object detection, and , in large-scale remote sensing datasets. Some available solutions, such as limiting the size of the computational window or step size to improve the efficiency of attention, are at the cost of imposing limits on the size of the reception field [26], [27].
然而,自注意操作需要图像大小的二次方复杂性,导致昂贵的计算开销,不利于大规模遥感数据集的密集预测任务,如土地覆盖物绘图、物体检测和 。现有的一些解决方案,如限制计算窗口的大小或步长以提高注意力的效率,都是以限制接收区域的大小为代价的[26], [27]。

As a viable alternative to the Transformer architecture, state space sequence models (SSMs), especially structured state space sequence models (S4) [28], have shown cuttingedge performance in continuous long-sequence data analysis, with the promising attribute of linearly scaling in sequence length.
状态空间序列模型(SSM),尤其是结构化状态空间序列模型(S4)[28],是变压器架构的一种可行替代方案,在连续长序列数据分析中表现出了最先进的性能,并具有随序列长度线性扩展的优点。

The Mamba architecture [29] further improves on S4 by utilizing a selection mechanism that allows the model to select relevant information in an input-dependent manner.
Mamba 架构[29]在 S4 的基础上做了进一步改进,它采用了一种选择机制,允许模型以依赖输入的方式选择相关信息。

By combining it with hardware-aware implementations, Mamba outperforms Transformers on a number of downward exploration tasks with high efficiency. Very recently, Mamba architecture has been extended to image data and shown promising results on some visual tasks [30], [31].
通过与硬件感知实现相结合,Mamba 在一些向下探索任务上的效率超过了 Transformers。最近,Mamba 架构已扩展到图像数据,并在一些视觉任务上取得了可喜的成果[30], [31]。

However, its potential for high-resolution optical images and different CD subtasks is still under-explored. How to design a suitable architecture based on a priori knowledge of CD tasks is undoubtedly valuable for subsequent research in the community.
然而,它在高分辨率光学图像和不同光盘子任务方面的潜力仍未得到充分发掘。如何根据对光盘任务的先验知识设计合适的架构,无疑对该领域的后续研究具有重要价值。
In this paper, we present the ChangeMamba architecture, an efficient Mamba model for remote sensing CD tasks. ChangeMamba can efficiently model the global spatial context, thereby showing very effective results on the three subtasks of CD, namely BCD, MCD, and BDA.
在本文中,我们介绍了 ChangeMamba 架构,这是一个用于遥感 CD 任务的高效Mamba 模型。ChangeMamba 可以高效地建立全球空间背景模型,从而在 CD 的三个子任务(即 BCD、MCD 和 BDA)上显示出非常有效的结果。

Specifically, ChangeMamba is based on the recently proposed VMamba architecture [30], which adopts a Cross-Scan Module (CSM) to unfold image patches in different spatial directions to achieve effective modeling of global contextual information of images.
具体来说,ChangeMamba 基于最近提出的 VMamba 架构[30],该架构采用交叉扫描模块(CSM)来展开不同空间方向的图像补丁,从而实现对图像全局上下文信息的有效建模。

In addition, since CD tasks require the detector to adequately capture the spatio-temporal relationships between multi-temporal images, we design a cross-spatio-temporal state space module (CST-Mamba), which can adequately model the spatio-temporal relationships between multi-temporal features by combining them in different ways, and thus efficiently detect the different categories of changes, including binary changes, semantic changes, and building damage levels.
此外,由于 CD 任务要求检测器充分捕捉多时空图像之间的时空关系,因此我们设计了跨时空状态空间模块(CST-Mamba ),通过不同方式的组合,充分建模多时空特征之间的时空关系,从而高效检测不同类别的变化,包括二值变化、语义变化和建筑物损坏程度。
In summary, the main contributions of this paper are as follows:
总之,本文的主要贡献如下:
  1. For the first time, we explore the application of the Mamba architecture in the field of remote sensing change detection, thereby achieving high accuracy and efficient .
    我们首次探索了Mamba 架构在遥感变化检测领域的应用,从而实现了高精度和高效率
  2. Based on the Mamba architecture and incorporating the unique characteristics of , we design the corresponding network frameworks for each of the three CD tasks: , and BDA.
    基于Mamba 架构并结合 的独特特性,我们为三个 CD 任务分别设计了相应的网络框架: 和 BDA。
  3. Three spatio-temporal relationship modeling mechanisms are proposed, tailored for CD tasks, with the aid of the Mamba architecture to fully learn spatio-temporal features.
    在Mamba 架构的帮助下,针对 CD 任务提出了三种时空关系建模机制,以充分学习时空特征。
  4. The proposed three frameworks show very competitive and even SOTA performance on BCD, SCD, and BDA tasks on five benchmark datasets, demonstrating their superiority. The source code for this work is publicly available for contributions to subsequent possible research.
    所提出的三个框架在五个基准数据集的 BCD、SCD 和 BDA 任务上表现出了极强的竞争力,甚至达到了 SOTA 性能,证明了它们的优越性。这项工作的源代码是公开的,可供后续研究使用。
The remainder of this paper is organized as follows. Section II reviews related work. Section III elaborates on the three ChangeMamba architectures proposed for BCD, MCD, and BDA, respectively. The experimental results and discussion are provided in Sections IV.
本文的其余部分安排如下。第二节回顾了相关工作。第三节阐述了分别针对 BCD、MCD 和 BDA 提出的三种 ChangeMamba 架构。第四节是实验结果和讨论。

Section V draws the conclusion.
第五部分是结论。
In this section, we focus on reviewing deep learning-based CD methods concentrating on optical high-resolution remote sensing images as data sources.
在本节中,我们将重点评述以光学高分辨率遥感图像为数据源的基于深度学习的 CD 方法。

A. CNN-Based Method A.基于 CNN 的方法

The rapid evolution of and their prowess in extracting local features have led to early applications in CD tasks. In the realm of CNN-based BCD, an initial and simple approach [16] involved combining bi-temporal images into a single input for processing by an FCN.
的飞速发展及其在提取局部特征方面的卓越性能,使其很早就被应用到了生物数据处理任务中。在基于 CNN 的 BCD 领域,最初的一种简单方法 [16] 是将双时相图像合并为单一输入,供 FCN 处理。

In addition, it introduced two variations that employ siamese encoders to handle dual inputs.
此外,它还推出了两款采用siamese 编码器处理双输入的变体。

Given the roots of FCNs in semantic segmentation, [17] sought to tailor these networks more closely to BCD tasks by integrating feature change maps as a deep supervised signal, thus refining the final prediction.
鉴于 FCN 在语义分割方面的根基,[17] 尝试将特征变化图整合为深度监督信号,从而完善最终预测,从而使这些网络更贴近 BCD 任务。

Addressing the limited receptive field of CNNs, [13] expanded this framework by combining long- and short-term memory (LSTM) networks with CNNs. Furthermore, Fang et al.
为了解决 CNN 接受场有限的问题,[13] 通过将长、短时记忆(LSTM)网络与 CNN 结合,扩展了这一框架。此外,Fang 等人

[32] designed a Siamese architecture with denser connections to facilitate more effective shallow information exchange for . To combat the challenge of uneven foreground-background distributions within BCD, Han et al. [33] employed a novel sampling approach and an attention mechanism to integrate global information more effectively.
[32] 设计了一种具有更密集连接的Siamese 架构,以促进更有效的浅层信息交换 。为了应对 BCD 中前景-背景分布不均的挑战,Han 等人[33]采用了一种新颖的采样方法和注意力机制,以更有效地整合全局信息。

After that, they further introduced an innovative self-attention module that not only enhances long-distance dependency, but also specifically targets the refinement of BCD predictions, significantly improving the delineation of change area boundaries and reducing the occurrence of inner holes [34].
之后,他们进一步引入了创新的自我关注模块,不仅增强了长距离依赖性,还专门针对 BCD 预测进行了细化,显著改善了变化区域边界的划分,减少了内洞的出现 [34]。
Compared to BCD, which aims to identify "where" changes have occurred, SCD seeks to determine "what" the changes are. Although this poses a greater challenge, it is more important for downstream applications in remote sensing.
BCD 旨在确定 "哪里 "发生了变化,而 SCD 则旨在确定 "什么 "发生了变化。虽然这带来了更大的挑战,但对于遥感的下游应用却更为重要。

In the field of CNN-based SCD, FCNs remain the core structure.
在基于 CNN 的 SCD 领域,FCN 仍然是核心结构。

Early explorations in SCD include [35], which adopts a multitask learning framework to simultaneously predict a binary change mask and land cover mapping, using the latter to assist in generating the final semantic change prediction.
SCD 的早期探索包括 [35],该研究采用多任务学习框架,同时预测二元变化掩码和土地覆被图,并利用后者协助生成最终的语义变化预测。

The approach in [36] represents an early attempt to combine recurrent neural network (RNN) and CNN with SCD, where
文献 [36] 中的方法是将递归神经网络 (RNN) 和 CNN 与 SCD 相结合的早期尝试,其中
CNN extracts semantic features and RNN models temporal dependencies to classify multiple types of changes.
CNN 提取语义特征,RNN 建立时间依赖关系模型,从而对多种变化类型进行分类。

In [37] and [11], unsupervised SCD methods were developed, using pretrained CNNs or unsupervised learning techniques to extract temporal features, then comparing these features to classify changes. [38] introduced a benchmark dataset for SCD along with a new evaluation metric.
文献[37]和[11]开发了无监督 SCD 方法,使用预训练的 CNN 或无监督学习技术提取时间特征,然后比较这些特征对变化进行分类。[38] 为 SCD 引入了一个基准数据集和一个新的评估指标。

Building on [35], Ding et al. [39] proposed a more innovative architecture to address the insufficient exchange of information between the semantic and change branches, introducing new loss functions and supervision signals to force the network to correlate two branches.
在[35]的基础上,Ding 等人[39]提出了一种更具创新性的架构,以解决语义分支和变化分支之间信息交换不足的问题,他们引入了新的损失函数和监督信号,迫使网络将两个分支关联起来。

Zheng et al. [18] continues to strengthen the connection between the two branches, designing a temporal-symmetric transformer (TST) module to model the causal relationship between semantic and change representations.
Zheng 等人[18]继续加强这两个分支之间的联系,设计了一个时空对称转换器(TST)模块来模拟语义表征和变化表征之间的因果关系。

In [40], channel attention is used to embed change information into temporal features. In [41], temporal salient features are used and fused to extract and improve the representation of changes.
在 [40] 中,信道注意力被用来将变化信息嵌入时间特征中。在 [41] 中,使用并融合了时间显著特征来提取和改进变化的表示。
Compared to the advances in CNN-based methods for BCD and SCD, the progression of BDA methodologies has been relatively gradual. BDA intricately categorizes areas affected by disasters into four distinct classes: "No damage", "Minor damage", "Major damage", and "Destroyed".
与基于 CNN 的 BCD 和 SCD 方法相比,BDA 方法的进展相对缓慢。BDA 将受灾地区复杂地分为四个不同的等级:"无损坏"、"轻微损坏"、"严重损坏 "和 "已摧毁"。

The procedural approach to BDA typically encompasses two primary phases: identifying the location of the building and assessing the level of damage incurred. This bifurcation essentially positions as a nuanced subset of , with a focus on quantifying damage levels. Among established approaches, the xView2 challenge [42] introduced a baseline method that employs a UNet+ResNet cascade for building localization and damage classification, trained with pre and post-disaster imagery.
BDA 的程序方法通常包括两个主要阶段:确定建筑物的位置和评估受损程度。这种分叉基本上将 定位为 的一个细微子集,重点是量化损坏程度。在已有的方法中,xView2 挑战赛 [42] 引入了一种基线方法,该方法采用 UNet+ResNet 级联进行建筑物定位和损坏分类,并使用灾前和灾后图像进行训练。

The Siamese-UNet, the winning approach of the challenge [42], utilizes a UNet framework for both pixel-level building localization and damage level classification.
Siamese-UNet 是 挑战赛的获胜方法[42],它利用 UNet 框架进行像素级建筑物定位和损坏级分类。

It innovatively initializes the damage classification UNet with the parameters from the building localization UNet, enhancing performance through a multi-model ensemble of four UNets.
它创新性地利用建筑物定位 UNet 的参数初始化损坏分类 UNet,通过四个 UNet 的多模型组合提高性能。

Further advancements [43] draw from SCD solutions and the MaskRCNN architecture to address BDA, incorporating Siamese encoders, multitask heads, and instance-level predictions.
进一步的进展[43]借鉴了 SCD 解决方案和 MaskRCNN 架构,以解决 BDA 问题,其中包括Siamese 编码器、多任务头和实例级预测。

An evolved approach [3] introduces a hybrid encoder architecture that merges shared and task-specific weights, alongside an FCN-based multitask decoder. This development enhances task integration with an object-based post-processing strategy, resulting in more refined outcomes.
一种改进的方法[3]引入了一种混合编码器架构,该架构融合了共享权重和特定任务权重,以及基于 FCN 的多任务解码器。这一发展加强了任务与基于对象的后处理策略的整合,从而产生了更精细的结果。
The development of CNN-based CD methods has matured significantly over time. However, despite their excellent ability to extract local features, CNN architectures inherently struggle to capture long-distance dependencies.
随着时间的推移,基于 CNN 的 CD 方法的发展已日趋成熟。然而,尽管 CNN 架构在提取局部特征方面具有出色的能力,但在捕捉长距离依赖关系方面却存在固有的困难。

This limitation becomes particularly critical in CD tasks, where change regions, compared to background areas, are often sparse and scattered. The network's ability to model relationships beyond local areas is essential for accurately identifying these change regions.
在 CD 任务中,这种限制变得尤为关键,因为与背景区域相比,变化区域往往是稀疏和分散的。网络对局部区域以外关系的建模能力对于准确识别这些变化区域至关重要。

In this context, Mamba, evolved from RNN architectures, stands out for its exceptional ability to model non-local relationships, making it ideally suited for CD tasks.
在这种情况下,从 RNN 架构发展而来的Mamba 因其卓越的非局部关系建模能力而脱颖而出,成为 CD 任务的理想选择。

Its architecture leverages the sequential processing strength of RNNs, enabling it to effectively understand the spatial context and dependencies over larger areas, thus overcoming one of the key limitations of traditional CNN-based approaches in CD.
其架构利用了 RNN 的顺序处理优势,使其能够有效地理解更大范围内的空间背景和依赖关系,从而克服了 CD 中基于 CNN 的传统方法的主要局限之一。

B. Transformer-Based Method
B.基于变压器的方法

Recently, with the rise of transformers in computer vision tasks, their superior long-distance modeling capabilities, compared to CNNs, have made them particularly effective in the field of BCD, SCD, and BDA.
最近,随着变压器在计算机视觉任务中的兴起,与 CNN 相比,变压器具有更出色的远距离建模能力,这使其在 BCD、SCD 和 BDA 领域尤为有效。

In the realm of BCD, [20] emerges as a pioneering approach, transforming images into semantic tokens to model spatial-temporal contexts within a token-based framework, enhancing the change detection process. Bandara et al.
在 BCD 领域,[20] 是一种开创性的方法,它将图像转化为语义tokens ,在基于token 的框架内对时空背景进行建模,从而增强了变化检测过程。Bandara等人

[22] employing a pure transformer-based Siamese network that uses a structured transformer encoder coupled with an MLP decoder. This architecture eliminates the need for a CNN-based feature extractor. At the same time, Zhang et al.
[22]采用了基于纯变压器的Siamese 网络,该网络使用结构化变压器编码器和 MLP 解码器。这种架构无需使用基于 CNN 的特征提取器。与此同时,Zhang 等人还开发了一种基于 CNN 的特征提取器。

[24] also proposes a pure transformer-driven model, adopting a Siamese U-shaped structure, processing images in parallel to extract and merge multiscale features.
[24] 还提出了一种纯变压器驱动的模型,采用Siamese U 型结构,并行处理图像以提取和合并多尺度特征。

It leverages Swin transformer blocks across its architecture, further demonstrating the transformer's capability to refine change detection outcomes. Li et al.
它在整个架构中利用了 Swin 变换器模块,进一步展示了变换器完善变化检测结果的能力。李等人

[25] introduces a hybrid model that blends understanding of the global context of transformers with the semantic detail resolution of UNet.
[25] 引入了一种混合模型,将对转换器全局上下文的理解与 UNet 的语义细节解析相结合。

It eliminates UNet's redundant information and enhances feature quality through a difference enhancement module, leading to more precise change area delineation.
它消除了 UNet 的冗余信息,并通过差异增强模块提高了特征质量,从而实现了更精确的变化区域划分。
As for transformer-based SCD, Niu et al. [44] uses a PRTB backbone, combining preactivated residual and transformer blocks, to extract semantic features from image pairs, which is crucial for detecting subtle changes.
至于基于变换器的 SCD,Niu 等人[44]使用 PRTB 骨干,结合预激活的残差块和变换器块,从图像对中提取语义特征,这对于检测微妙的变化至关重要。

Its Multi-Content Fusion Module (MCFM) precisely differentiates between the foreground and background. The model's efficacy is further amplified by multitask prediction branches and a custom loss function, ensuring nuanced semantic change detection with heightened accuracy.
其多内容融合模块 (MCFM) 能精确区分前景和背景。多任务预测分支和自定义损失函数进一步增强了该模型的功效,确保以更高的准确度进行细致入微的语义变化检测。

Alternatively, Ding et al. [45] introduces the Semantic Change Transformer (SCanFormer), a CSWin Transformer adaptation, to explicitly chart semantic transitions.
另外,Ding 等人[45]引入了语义变化转换器(ScanFormer),这是一种 CSWin 转换器改编,用于明确绘制语义转换图。

The model utilizes a triple encoder-decoder architecture to improve spatial features, and it is distinguished by incorporating spatio-temporal dynamics and task-specific expertise, leading to a decrease in learning disparities.
该模型利用三重编码器-解码器架构来改进空间特征,其独特之处在于结合了时空动态和特定任务的专业知识,从而减少了学习差异。

This approach significantly boosts accuracy on benchmarks, adeptly managing scenarios with sparse change samples and semantic labels.
这种方法大大提高了基准测试的准确性,能够很好地管理变化样本和语义标签稀少的场景。
In response to the transformer-based , Chen et al. [23] uses a siamese Mix Transformer (MiT) encoder and lightweight All-MLP decoder efficiently to process bitemporal image pairs.
针对基于变换器的 ,Chen 等人[23]使用siamese 混合变换器(MiT)编码器和轻量级 All-MLP 解码器高效处理位时图像对。

It features deep feature extraction and a multitemporal fusion module for improved information integration, and a dual-task decoder for precise building localization and damage classification.
它具有深度特征提取和多时空融合模块,可提高信息整合能力,还具有双任务解码器,可进行精确的建筑物定位和损坏分类。

By adopting strategies like weight sharing and innovative fusion mechanisms, it efficiently handles high-resolution imagery with lower computational costs, making it effective and streamlined for BDA.
通过采用权重共享和创新的融合机制等策略,它能以较低的计算成本高效处理高分辨率图像,从而使其在 BDA 中发挥高效、精简的作用。
Although transformer-based methods have significantly enhanced CD accuracy through their powerful non-local modeling capabilities, they come with substantial computational
虽然基于变压器的方法通过其强大的非局部建模能力大大提高了 CD 精度,但它们也伴随着大量的计算工作。

resource demands. Specifically, as the number of tokens increases, the consumption of computational resources escalates quadratically, posing challenges for processing large-scale input.
资源需求。具体来说,随着tokens 数量的增加,计算资源的消耗也会呈二次曲线上升,这给处理大规模输入带来了挑战。

However, Mamba, an architecture inherited from RNNs, not only rivals transformers in non-local modeling capabilities but also maintains computational resource consumption at a linear growth rate.
然而,从 RNN 继承而来的架构Mamba 不仅在非本地建模能力上可与transformers 相媲美,而且还能保持计算资源消耗的线性增长率。
Moreover, spatio-temporal fusion remains a fundamental aspect of CD tasks, where seamless and efficient integration of time and space information is essential to achieve accurate detection results.
此外,时空融合仍是 CD 任务的一个基本方面,要想获得准确的检测结果,就必须无缝、高效地整合时间和空间信息。

Traditional methods, whether they employ CNNs or transformers, tend to follow a rigid, modular stacking approach for spatio-temporal fusion.
传统的方法,无论是采用 CNN 还是transformers ,在进行时空融合时往往采用僵化的模块化堆叠方法。

In contrast, the Mamba framework introduces a novel paradigm by treating tokens with inherent "sequence," offering a dynamic range of spatiotemporal fusion possibilities.
相比之下,Mamba 框架引入了一种新的范式,将tokens 与固有的 "序列 "联系起来,提供了一系列动态的时空融合可能性。

By manipulating the order in which tokens are processed, Mamba can generate a spectrum of fusion outcomes, marking a significant departure from conventional methods.
通过调整tokens 的处理顺序,Mamba 可以产生一系列融合结果,这与传统方法大相径庭。

This innovative approach to spatiotemporal fusion provides Mamba with unparalleled flexibility and effectiveness in addressing CD tasks.
这种创新的时空融合方法为Mamba 处理 CD 任务提供了无与伦比的灵活性和有效性。

C. State Space Model
C.状态空间模型

The SSM concept first popped up with the S4 model [28], offering a new way to handle contextual information globally, which caught attention because of its attractive property of scaling linearly in sequence length. Based on S4, Smith et al.
SSM 概念最早出现在 S4 模型中[28],它提供了一种全局处理上下文信息的新方法,因其具有随序列长度线性扩展的诱人特性而备受关注。在 S4 的基础上,Smith 等人提出了 SSM 概念。

[46] proposed a new S5 model by introducing MIMO SSM and efficient parallel scan into the S4 model. After that, the H3 model [47] further advanced these concepts, achieving performance on par with Transformers in language modeling tasks.
[46] 通过在 S4 模型中引入 MIMO SSM 和高效并行扫描,提出了一种新的 S5 模型。之后,H3 模型[47] 进一步推进了这些概念,在语言建模任务中实现了与 Transformers 相当的性能。

Recently, Gu et al [29] proposed a data-dependent SSM layer and built a generic language model backbone called Mamba, which outperforms Transformers at at different scales on large-scale real data and scales linearly in sequence length.
最近,Gu 等人[29]提出了一个与数据相关的 SSM 层,并构建了一个名为Mamba 的通用语言模型骨干,它在大规模真实数据上的不同尺度性能均优于 Transformers,并可按序列长度线性扩展。

Very recently, VMamba [30] and Vision Mamba [31] have extended Mamba architecture to 2D image data, showing superior performance on many computer vision tasks.
最近,VMamba [30] 和 VisionMamba [31] 将Mamba 架构扩展到二维图像数据,在许多计算机视觉任务中显示出卓越的性能。

Inspired by this progress, some pioneering work has introduced the Mamba architecture to the field of remote sensing image processing [48], [49].
受这一进展的启发,一些开创性的工作将Mamba 架构引入了遥感图像处理领域 [48], [49]。

Their studies show that the Mamba architecture can yield comparable performance to the advanced CNN and Transfomer architectures on scene classification [49] and pansharpening [48] tasks.
他们的研究表明,在场景分类 [49] 和平锐化 [48] 任务中,Mamba 架构能产生与高级 CNN 和 Transfomer 架构相当的性能。

However, these works still focus on either low-level tasks or classification tasks, and all of these tasks are single-temporal tasks [50].
然而,这些工作仍然侧重于低级任务或分类任务,而且所有这些任务都是单时任务 [50]。

The potential of the Mamba architecture for multi-temporal remote sensing imagerelated scenarios and intensive prediction tasks remains to be explored.
Mamba 架构在多时空遥感图像相关场景和密集预测任务方面的潜力仍有待探索。

III. Methodology III.方法论

A. Preliminaries A.序言

The SSM-based models and Mamba are inspired by the linear time-invariant systems, which map a 1-D function or sequence to response through a hidden state . These systems are usually formulated as linear ordinary differential equations (ODEs) as
基于 SSM 的模型和Mamba 受到线性时变系统的启发,这些系统通过隐藏状态将一维函数或序列 映射到响应 。这些系统通常被表述为线性常微分方程(ODE),即
where is the evolution parameter and are the projection parameters.
其中 是演化参数, 是投影参数。
The S4 and Mamba models are discrete counterparts of the continuous system as continuous systems face significant challenges when integrated into deep learning algorithms. They contain a time scale parameter which is used to convert the continuous parameters and into their discrete counterparts and . A prevalent way to achieve this transformation is to use a zero-order hold ( approach:
S4 和Mamba 模型是连续系统的离散对应模型,因为连续系统在集成到深度学习算法中时面临重大挑战。它们包含一个时间尺度参数 ,用于将连续参数 转换为离散参数 。实现这种转换的普遍方法是使用零阶保持 ( 方法:
After that, the discretization of Eq. (1) can be formulated as
之后,公式 (1) 的离散化可以表述为
Finally, the output of with sequence length can be calculated directly by the following global convolution operation
最后,序列长度为 的输出可以通过以下全局卷积运算直接计算出来
where is a structured convolutional kernel and denotes a convolutional operation.
其中 是结构化卷积核, 表示卷积运算。

B. Problem Statement B.问题陈述

In this paper, we focus on three sub-tasks within the field. They are binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA). The definitions of the three tasks are as follows.
在本文中,我们重点讨论 领域中的三个子任务。它们分别是二进制变化检测(BCD)、语义变化检测(SCD)和建筑物损坏评估(BDA)。这三个任务的定义如下。
  1. Binary Change Detection: BCD is the basic and the most intensively studied task in the CD field. It focuses on "where" change occurs. According to the class of interest, can also be categorized into class-agnostic CD focusing on general land-cover change information, and singleclass object , e.g., building . Given a training set represented as , where is the -th multi-temporal image pair acquired in and , respectively, and is the corresponding label, the goal of is to train a change detector on that can predict change maps reflecting accurate "change / non-change" binary information on new sets.
    二进制变化检测:BCD 是 CD 领域最基本、研究最深入的任务。它的重点是 "哪里 "发生了变化。根据感兴趣的类别, 还可分为侧重于一般土地覆盖物变化信息的无类别 CD 和单类别对象 ,例如建筑物 。给定一个以 表示的训练集,其中 是分别在 获取的 -th 多时相图像对, 是相应的标签, 的目标是在 上训练一个变化检测器 ,该检测器可预测变化图,准确反映新集上的 "变化/非变化 "二元信息。
  2. Semantic Change Detection: Compared to BCD, SCD focuses not only on "where" the change occurs, but also on the "what" of the change, i.e. "from-to" semantic change information [18], [39], [45]. The training set in SCD task can be represented as
    语义变化检测:与 BCD 相比,SCD 不仅关注变化发生的 "位置",还关注变化的 "内容",即 "从哪里到哪里 "的语义变化信息 [18]、[39]、[45]。SCD 任务的训练集可以表示为
(a) MambaBCD
(b) MambaSCD
(c) MambaBDA
Fig. 1: The network framework of the proposed (a) MambaBCD for binary change detection, (b) MambaSCD for semantic change detection, and (c) MamabaBDA for building damage assessment.
图 1:拟议的 (a) 用于二进制变化检测的 MambaBCD、(b) 用于语义变化检测的 MambaSCD 和 (c) 用于建筑物损坏评估的 MamabaBDA 的网络框架。
. Compared to BCD, corresponding land-cover labels with categories of and are additionally required. The goal of SCD is to train a semantic change detector on that can predict can predict land-cover maps of multi-temporal images and binary change maps between them on new sets as accurately as possible.
.与 BCD 相比,还需要相应的土地覆被标签 与 类别 和 。SCD 的目标是在 上训练一个语义变化检测器 ,该检测器可以尽可能准确地预测多时相图像的土地覆盖图以及新图像集上它们之间的二元变化图。

By combining the information in predicted land-cover maps and binary change maps, the from-to" semantic change information can be derived.
将预测的土地覆盖图和二元变化图中的信息结合起来,就能得出 "从到 "的语义变化信息。
  1. Building Damage Assessment: BDA is a special "one-tomany" SCD task [3]. BDA requires recognizing not only the where of the change (damage) occurs but also post-event states of the object (building) in the condition that the pre-event state of the object is unique.
    建筑物损坏评估:BDA 是一项特殊的 "一对多" SCD 任务 [3]。BDA 不仅要求识别变化(损坏)发生的位置,还要求在对象(建筑物)的事件前状态唯一的条件下识别事件后的状态。

    The training set in BDA task can be denoted as , where is the label of the object of interest (building) at , and is the postevent state (damage level) of the object of interest at .
    BDA 任务中的训练集可表示为 ,其中 中关注对象(建筑物)的标签, 中关注对象的事件后状态(损坏程度)。

C. Network Architecture C.网络架构

For these three subtasks, we design corresponding network frameworks based on the Mamba architecture, called MambaBCD, MambaSCD, and MamabBDA, respectively. Fig. 1 shows the network framework of these three architectures.
针对这三个 子任务,我们基于Mamba 架构设计了相应的网络框架,分别称为 MambaBCD、MambaSCD 和 MambabBDA。图 1 显示了这三种架构的网络框架。

Among them, the encoders of the three networks are weightsharing siamese networks based on the Visual State Space Model (VMamba) architecture [30].
其中,三个网络的编码器都是基于视觉状态空间模型(VMamba)架构的权重共享siamese 网络[30]。

VMamba can adequately extract the robust and representative features of the input images benefiting from the Mamba architecture and an efficient 2D cross-scan mechanism (as shown in Fig. 3). Its specific structure will be introduced in Section III-D.
得益于Mamba 架构和高效的二维交叉扫描机制(如图 3 所示),VMamba 可以充分提取输入图像中稳健且具有代表性的特征。其具体结构将在第 III-D 节中介绍。
  1. MambaBCD: MambaBCD is the architecture designed for the BCD task. First, the siamese encoder network extracts multi-level features from the input image as and . Next, these multi-level features are fed into a tailored change decoder . Based on the Mamba architecture, the change decoder can fully learn the spatio-temporal relationship from multi-level features through three different mechanisms, and gradually obtains an accurate BCD result, formulated as . The binary change map can be obtained as .
    MambaBCD:MambaBCD 是专为 BCD 任务设计的架构。首先,siamese 编码器网络 从输入图像中提取多级特征,如 。然后,这些多级特征被输入到一个量身定制的变化解码器 中。基于Mamba 架构,变化解码器可以通过三种不同的机制从多层次特征中充分学习时空关系,并逐步获得精确的 BCD 结果,表述为 。二进制变化图可以通过 得到。
  2. MambaSCD: MambaSCD is the architecture designed for SCD tasks. As shown in Fig. 1-(b), MambaSCD adds two semantic decoders for land cover mapping tasks based on MambaBCD, denoted as and . In addition to be treated as input as to learn spatio-temporal relationships to predict results, the multi-level features extracted by the encoder are also fed into and to predict the landcover map of the corresponding temporal image, formulated as and . After obtaining the land-cover maps and and the binary change map , the semantic change information of can be obtained by performing the mask operation on and using .
    MambaSCD:MambaSCD 是专为 SCD 任务设计的架构。如图 1-(b) 所示,MambaSCD 在 MambaBCD 的基础上增加了两个用于土地覆被制图任务的语义解码器,分别表示为 。除了作为 的输入来学习时空关系以预测 的结果外,编码器提取的多层次特征还被输入 以预测相应时空图像的土地覆盖图,分别表示为 。在获得土地覆盖图 以及二元变化图 之后,可通过 进行掩码操作,获得 的语义变化信息。
Fig. 2: The specific structure of the encoder network is based on the visual state space model used in our three architectures.
图 2:编码器网络的具体结构基于我们三种架构中使用的视觉状态空间模型。
  1. MambaBDA: MambaBDA is the architecture presented for the BDA task. Unlike the typical SCD task, the BDA task only needs to predict land-cover maps (building location maps) for pre-event images. Therefore, in the MambaBDA architecture shown in Fig.
    MambaBDA:MambaBDA 是针对 BDA 任务提出的架构。与典型的 SCD 任务不同,BDA 任务只需要预测事件前图像的土地覆盖图(建筑物位置图)。因此,在图 1 所示的 MambaBDA 架构中,只需要预测事件前图像的土地覆盖图(建筑物位置图)。

    1-(c), there is only one semantic decoder for predicting building localization map as . Like MambaBCD and MambaSCD, a change decoder denoted as then learns spatio-temporal relationships from multi-temporal features to classify the damage level of buildings as . The obtained building location map can be used for further post-processing on the obtained damage classification map to improve the accuracy, e.g., objectbased post-processing methods [3].
    1-(c) 中,只有一个语义解码器 用于预测建筑物定位图,即 。与 MambaBCD 和 MambaSCD 一样,一个变化解码器表示为 ,然后从多时空特征中学习时空关系,对建筑物的损坏程度进行分类,如 。获得的建筑物位置图 可用于对获得的损毁分类图 进行进一步的后处理,以提高准确性,如基于对象的后处理方法[3]。

D. Encoder Based on Visual State Space Model
D.基于视觉状态空间模型的编码器

S6 model in Mamba causally processes the input data, and thus can only capture information within the scanned part of the data. This attribute naturally aligns with natural language data, which but poses significant challenges when adapting to non-causal data such as image.
Mamba 中的 S6 模型对输入数据进行因果处理,因此只能捕捉数据扫描部分的信息。这一特性自然与自然语言数据相吻合,但在适应图像等非因果数据时却带来了巨大的挑战。

Directly expanding image data along a certain dimension of space and inputting it into the S6 model will result in spatial context information not being adequately modeled. Recently, visual Mamba solved this problem well by proposing a 2D Cross-Scan mechanism. As shown in Fig.
直接将图像数据沿某一空间维度展开并输入 S6 模型,会导致空间上下文信息无法充分建模。最近,视觉Mamba 提出了一种二维交叉扫描机制,很好地解决了这一问题。如图所示

3, before inputting the tokens into the S6 model, the cross-scan mechanism rearranges the tokens in spatial dimensions, i.e., top-left to bottom-right, bottom-right to topleft, top-right to bottom-left, and bottom-left to top-right. Finally, the resulting features are then merged.
3. 在将tokens 输入 S6 模型之前,交叉扫描机制会在空间维度上重新排列tokens ,即从左上到右下、从右下到左上、从右上到左下、从左下到右上。最后,合并得到的特征。

In this way, any pixel can get spatial context information from different directions. Moreover, the computational complexity of the Mamba model under the cross-scan mechanism is still compared to the self-attention technique in Transformer.
这样,任何像素都能从不同方向获得空间上下文信息。此外,与Transformer 中的自我关注技术相比,跨扫描机制下的Mamba 模型的计算复杂度仍为
Based on the Visual Mamba architecture, the specific structure of the encoder network in our proposed three architectures is shown in Fig. 2. There are four stages, each of which
基于 VisualMamba 架构,我们提出的三种架构中编码器网络的具体结构如图 2 所示。共有四个阶段,每个阶段
(a)
(b)
Fig. 3: Illustration of the self-attention mechanism [21], [51] and cross-scan mechanism [30] for capturing global contextual information.
图 3:用于捕捉全局上下文信息的自我关注机制[21]、[51]和交叉扫描机制[30]示意图。
first downsamples the input data, then fully models the spatial contextual information using a number of visual state space (VSS) blocks, and then outputs the features for that stage, i.e., . The structure of VSS block is also depicted in Fig. 2. The input first passes through a linear embedding layer, and the output splits into two information flows. One flow passes through a depth-wise convolution (DWConv) layer, followed by a Silu activation function [52] before entering the core SS2D module (i.e., the integration of S6 with the cross-scan mechanism).
首先对输入数据进行下采样,然后利用一些视觉状态空间(VSS)块对空间上下文信息进行完全建模,最后输出该阶段的特征,即 。VSS 块的结构也如图 2 所示。输入首先经过线性嵌入层,然后输出分成两股信息流。一个信息流通过 深度卷积(DWConv)层,然后通过 Silu 激活函数[52],最后进入核心 SS2D 模块(即 S6 与交叉扫描机制的整合)。

The output of the SS2D module passes through a layer normalization (LN) layer, and is then summed up with the outputs of the other streams that have been activated by Silu. This combination produces the final output of the VSS block. Finally, the features from the four stages are then used in the subsequent decoders responsible for specific tasks.
SS2D 模块的输出经过层归一化(LN)层,然后与西鲁激活的其他数据流的输出相加。这种组合产生 VSS 模块的最终输出。最后,来自 四个阶段的特征被用于后续负责特定任务的解码器。

Fig. 4: Illustration of three mechanisms for learning spatio-temporal relationships based on the state space model proposed in this paper.
图 4:基于本文提出的状态空间模型的三种时空关系学习机制示意图。

E. Task-Specific Decoders
E.特定任务解码器

  1. Spatio-Temporal Relationship Modelling Mechanism: Although the employed encoder can extract robust and representative features, learning spatio-temporal relationships of multi-temporal images is also significant to CD tasks [18], [36].
    时空关系建模机制:虽然采用的编码器可以提取稳健且具有代表性的特征,但学习多时图像的时空关系对 CD 任务也很重要 [18], [36]。

    To this end, we propose three mechanisms for modeling spatio-temporal relationships that can be aligned with the attribute of the S6 model capable of adequately modeling the global contextual information of long sequential data. Fig. 4 illustrates the three modeling mechanisms. They are spatio-temporal sequential modelling, spatio-temporal cross modeling, and spatial-temporal parallel modeling.
    为此,我们提出了三种时空关系建模机制,这些机制可以与 S6 模型的属性相匹配,能够对长序列数据的全局上下文信息进行充分建模。图 4 展示了这三种建模机制。它们分别是时空顺序建模、时空交叉建模和时空并行建模。

    Sequential modelling is unfolding the tokens of two temporal phases of data and then arranging them in temporal order as . Cross modelling mechanism, on the other hand, is modeling by cross-ordering the tokens of the two temporal phases as . Finally, parallel modeling is done by concatenating the tokens of the two temporal phases in the channel dimension and then performing joint modeling, i.e., . Through these three mechanisms and Mamba architecture, the spatio-temporal relationships intrinsic in the multi-temporal features will be fully explored, helping the decoder to obtain accurate change detection results.
    顺序建模是将两个时间阶段的数据tokens ,然后按时间顺序排列,如 。交叉建模机制则是通过交叉排列两个时相的tokens 来建模,如 。最后,并行建模是通过在通道维度上串联两个时相的tokens ,然后进行联合建模,即 。通过这三种机制和Mamba 架构,多时相特征中固有的时空关系将得到充分挖掘,从而帮助解码器获得准确的变化检测结果。
  2. Change Decoder: Based on the proposed three spatiotemporal learning mechanisms, the specific structure of the change decoder is then shown in Fig. 5. It fully learns the spatio-temporal relationships from the extracted multitemporal features and in four stages and obtains accurate binary change maps. At the beginning of each stage, the spatio-temporal relationship of multi-temporal features is first modeled using an STSS block.
    变化解码器:基于所提出的三种时空学习机制,变化解码器的具体结构如图 5 所示。它分四个阶段从提取的多时空特征 中充分学习时空关系,并获得精确的二进制变化图。在每个阶段的开始,首先使用 STSS 块对多时态特征的时空关系进行建模。

    In the STSS block, a spatio-temporal tokens generator module will rearrange the input multi-temporal features, which are then fed into three VSS blocks. Each block is responsible for learning one of the spatio-temporal relationships in Fig. 4. The output of the STSS block of the current stage is then integrated with the information from the feature map of the previous stage through a fusion module.
    在 STSS 模块中,时空tokens 生成器模块将对输入的多时空特征进行重新排列,然后将其输入三个 VSS 模块。每个模块负责学习图 4 中的一种时空关系。然后,通过融合模块将当前阶段 STSS 模块的输出与上一阶段特征图的信息进行整合。

    After passing through an upsampling layer, the feature map is fed to the next stage.
    经过上采样层后,特征图被送入下一阶段。
  3. Semantic Decoder: The specific structure of the semantic decoder is then shown in Fig. 6. It is mainly responsible for gradually recovering the class-agnostic or object-specific landcover maps of the corresponding multi-level features extracted by the encoder.
    语义解码器:语义解码器的具体结构如图 6 所示。它主要负责逐步恢复编码器提取的相应多层次特征的分类或特定对象地貌图。

    It also has four stages. At the beginning of each stage, the global spatial context information of input data is first modeled using a VSS block.
    它也分为四个阶段。在每个阶段的开始,首先使用 VSS 块对输入数据的全局空间背景信息进行建模。

    Then the feature map is upsampled and integrated with information about the lower-level feature map with higher resolution through a fusion module.
    然后,通过融合模块对特征图进行上采样,并与分辨率更高的低层特征图信息进行整合。

    In the fusion module, the number of channels of the low-level feature map is mapped to be consistent with the high-level feature map by a convolutional layer. Then the high-level and low-level feature maps are summed. Finally, the resulting feature map is smoothed by a residual layer.
    在融合模块中,通过 卷积层映射低层特征图的通道数,使其与高层特征图保持一致。然后将高层和低层特征图相加。最后,通过残差层对得到的特征图进行平滑处理。

F. Loss Function F.损失函数

Since this paper focuses on exploring the potential of the Mamba architecture for , we have optimized networks using commonly used loss functions in the field. Some tricks for improving accuracy like focal loss [53], deep supervision [17], multi-scale training/testing [6], etc. are not adopted.
由于本文的重点是探索Mamba 架构在 方面的潜力,我们使用该领域常用的损失函数对网络进行了优化。一些提高准确性的技巧,如焦点损失[53]、深度监督[17]、多尺度训练/测试[6]等,我们都没有采用。

Nevertheless, in the experimental section, we can find that the proposed architecture achieves very competitive and even SOTA performance on all three subtasks.
尽管如此,在实验部分,我们可以发现所提出的架构在所有三个子任务上都取得了非常有竞争力的性能,甚至是 SOTA 性能。
  1. : Since BCD can be regarded as a special semantic segmentation task [50], we directly optimize the network using cross-entropy loss as
    由于 BCD 可被视为一种特殊的语义分割任务 [50],我们直接使用交叉熵损失对网络进行优化,即
where is the one-hot form of is the output of the binary change detector through the softmax activation.
其中 的一热形式,是二进制变化检测器通过softmax 激活后的输出。
In addition, the Lovasz-softmax loss [54] is introduced to alleviate the problem of imbalance in the number of samples between changed and unchanged pixels. The final loss function for the BCD task is formulated as
此外,还引入了 Lovasz-softmax 损失[54],以缓解改变和未改变像素之间样本数量不平衡的问题。BCD 任务的最终损失函数为
  1. : For the SCD task, in addition to optimizing the BCD task, it is also necessary to optimize the land cover mapping task for both pre-event and post-event images, which can be optimized with cross-entropy loss as
    优化:对于 SCD 任务,除了优化 BCD 任务外,还需要优化事件前和事件后图像的土地覆被制图任务。
where and are the probabilistic map of landcover classification for the pre-event image and post-event
其中 分别为事件前图像和事件后图像的土地覆被分类概率图
Fig. 5: The specific structure of the change decoder used in MambaBCD, MambaSCD, and MambaBDA architectures.
图 5:MambaBCD、MambaSCD 和 MambaBDA 架构中使用的变化解码器的具体结构。
Fig. 6: The specific structure of the semantic decoder in MambaSCD and MambaBDA architectures.
图 6:MambaSCD 和 MambaBDA 架构中语义解码器的具体结构。
image output by the semantic decoder of the MambaSCD architecture; and are the corresponding cross-entropy loss.
由 MambaSCD 架构的语义解码器输出的图像; 是相应的交叉熵损失。
Similarly, the Lovasz-softmax loss function is used to solve the sample imbalance problem. Thus, the final loss function for the SCD task takes the form
同样,Lovasz-softmax 损失函数也用于解决样本不平衡问题。因此,SCD 任务的最终损失函数形式为
  1. BDA: BDA contains a building location task and a building damage classification task. The cross-entropy loss for each task can be expressed as follows
    BDA:BDA 包含建筑物定位任务和建筑物损坏分类任务。每个任务的交叉熵损失可表示如下
where and are the output of localization branch and classification branch of the MambaBDA architecture, respectively; and are the corresponding cross-entropy loss of building localization and damage classification tasks, respectively.
其中, 分别为 MambaBDA 架构的定位分支和分类分支的输出; 分别为建筑物定位和损坏分类任务的相应交叉熵损失。
By utilizing the Lovasz-softmax loss function to solve the sample imbalance problem, the final loss function of BDA is formulated as
利用 Lovasz-softmax 损失函数来解决样本不平衡问题,BDA 的最终损失函数表述为

IV. ExPeriments And Discussions
IV.实验和讨论

A. Datasets A.数据集

  1. SYSU-CD [15]: This dataset introduces a comprehensive collection of 20,000 pairs of 0.5 -meter resolution aerial images from Hong Kong, spanning 2007 to 2014, to advance the field of CD.
    SYSU-CD [15]:该数据集全面收集了 2007 年至 2014 年香港的 20,000 对 0.5 米分辨率航空图像,推动了光盘领域的发展。

    This dataset is distinguished by its focus on urban and coastal changes, featuring high-rise buildings and infrastructure developments, areas where poses significant challenges due to shadow and deviation effects. For effective deep learning (DL) application, the dataset is structured into training, validation, and test sets following a 6:2:2 ratio, with 20,000 patches of . Covering diverse change scenarios such as urban construction, suburban expansion,
    该数据集以城市和海岸变化为重点,以高层建筑和基础设施开发为特色,由于阴影和偏差效应, 。为实现有效的深度学习(DL)应用,该数据集按照 6:2:2 的比例分为训练集、验证集和测试集,共有 20,000 个 补丁。数据集涵盖了城市建设、郊区扩张等各种变化场景、

    groundwork, vegetation changes, road expansion, and sea construction, the SYSU-CD dataset is positioned as a pivotal resource for research.
    SYSU-CD 数据集是 研究的关键资源。
  2. WHU-CD [55]: The WHU-CD dataset, a segment of the larger WHU Building dataset, is tailored for the task of building change detection. It comprises two aerial datasets from Christchurch, New Zealand, captured in April 2012 and 2016, with a high image resolution of . This dataset is particularly focused on detecting changes in large and sparse building structures. The 2012 dataset includes aerial imagery covering an area of , featuring 12,796 buildings, while the 2016 dataset shows an increase to 16,077 buildings within the same area, reflecting significant urban development over the four-year period. We cropped these large training and testing images into nonoverlapping patches for our experiments.
    WHU-CD [55]:WHU-CD 数据集是更大的 WHU 建筑数据集的一部分,专为建筑变化检测任务定制。它由新西兰基督城的两个航空数据集组成,分别拍摄于 2012 年 4 月和 2016 年,图像分辨率高达 。该数据集尤其侧重于检测大型稀疏建筑结构的变化。2012 年的数据集包括覆盖面积为 的航空图像,其中有 12,796 座建筑,而 2016 年的数据集显示同一区域内的建筑增加到 16,077 座,反映了四年期间城市的显著发展。我们将这些大型的训练 和测试 图像裁剪成不重叠的 补丁进行实验。
  3. LEVIR-CD+ [56]: This advanced version of the LEVIR dataset introduces a comprehensive resource for research. It comprises 985 pairs of very high-resolution Google Earth images at resolution, each with dimensions of pixels. Spanning a time interval of 5 to 14 years, these bi-temporal images document significant building construction changes.
    levir-cd+ [56]:LEVIR 数据集的高级版本为 研究提供了全面的资源。它包括 985 对 分辨率极高的谷歌地球图像,每张图像的尺寸为 像素。这些双时相图像的时间跨度从 5 年到 14 年不等,记录了建筑施工的重大变化。

    It also encompasses a wide array of building types, including villa residences, tall apartments, small garages, and large warehouses, with a focus on both the emergence of new buildings and the decline of existing structures.
    它还涵盖了各种建筑类型,包括别墅住宅、高层公寓、小型车库和大型仓库,重点关注新建筑的出现和现有建筑的衰落。

    The dataset is meticulously annotated with binary labels by experts in remote sensing image interpretation. Each image pair undergoes a rigorous two-step quality assurance process involving initial annotation and subsequent verification to ensure the accuracy of change instances.
    数据集由遥感图像判读专家精心标注二进制标签。每对图像都经过严格的两步质量保证流程,包括初始注释和后续验证,以确保变化实例的准确性。

    The final dataset features 31,333 instances of building changes, making LEVIR a valuable benchmark for evaluating methodologies.
    最终数据集包含 31,333 个建筑变化实例,使 LEVIR 成为评估 方法的重要基准。
  4. SECOND [38]: The SECOND dataset introduces a collection of 4,662 pairs of aerial images at resolution, annotated for SCD in cities such as Hangzhou, Chengdu, and Shanghai. It focuses on six primary land cover classes: non-vegetated ground surfaces, trees, low vegetation, water, buildings, and playgrounds to capture a wide range of geographical changes.
    SECOND [38]:SECOND 数据集收集了 4,662 对 航空影像,分辨率为 ,并对杭州、成都和上海等城市的 SCD 进行了注释。该数据集侧重于六个主要土地覆被类别:无植被地面、树木、低植被、水域、建筑物和游乐场,以捕捉广泛的地理变化。

    Annotations are done by experts, ensuring high accuracy at the pixel level. The dataset uses land-cover map pairs and non-change masks to accurately represent change, facilitating the differentiation of changes from unchanged pixels within the same class.
    注释由专家完成,确保了像素级别的高准确性。该数据集使用土地覆盖图对和非变化掩码来准确表示变化,便于区分同一类别中的变化像素和不变像素。

    This methodological approach ensures its utility in addressing SCD challenges by providing a diverse and accurately labeled dataset for model training and evaluation.
    这种方法为模型的训练和评估提供了多样化和准确标注的数据集,从而确保了它在应对 SCD 挑战方面的实用性。
  5. [42]: The dataset, pivotal for advancing and BDA in disaster response, integrates pre- and post-disaster satellite imagery with over 850,000 building annotations across . It stands as the largest dataset of its kind, enriched by environmental data and detailed damage levels for a broad spectrum of disaster events. Developed in collaboration with disaster response agencies, is engineered to refine emergency response strategies, including resource allocation and aid distribution. The dataset adheres to stringent criteria, featuring high-resolution imagery (below a GSD), encapsulating multiple levels of damage, and covering a diverse range of disaster types and building structures.
    [42]: 数据集对推进 和 BDA 在应灾中的应用至关重要,它将灾前和灾后卫星图像与整个 中超过 850,000 个建筑物注释整合在一起。该数据集是同类数据集中规模最大的,并通过环境数据和各种灾害事件的详细损坏程度得到了丰富。 是与灾害响应机构合作开发的,旨在完善应急响应策略,包括资源分配和援助分配。该数据集遵循严格的标准,以高分辨率图像(低于 GSD)为特色,囊括多级损害,并涵盖各种灾害类型和建筑结构。

    By including both damaged and undamaged sites, xBD is crafted to foster the creation of adaptive models that effectively navigate the complexities of various disaster scenarios.
    xBD 将受损和未受损的地点都包括在内,旨在促进适应性模型的创建,从而有效驾驭各种复杂的灾难场景。
The information of these five datasets is summarized in Table I.
表 I 汇总了这五个数据集的信息。

B. Experimental Setup B.实验装置

  1. Implementation Details: All three of our proposed architectures are implemented in Pytorch. Depending on the size and depth of the encoder network, our proposed MambaBCD, MambaSCD, and MambaBDA architectures are available in Tiny, Small, and Base versions [30].
    实现细节:我们提出的所有三种架构都是在 Pytorch 中实现的。根据编码器网络的大小和深度,我们提出的 MambaBCD、MambaSCD 和 MambaBDA 架构有 Tiny、Small 和 Base 版本 [30]。

    As shown in Table II, the difference between them is the number of VSS blocks inside each stage and the number of channels in the feature map. For datasets other than the SYSU dataset, the multitemporal image pairs and associated labels are cropped to pixels for input to the network, and then the data with the original size is inferred using the trained networks on the test set. During training, we optimized the network using the AdamW optimizer [57] with a learning rate of and weight decay of , without any learning rate decay strategy. The batch size is set to 16. Except for the SYSU dataset where we set the number of training iterations to 20000 , we set the number of training iterations to 50000 on the remaining four datasets.
    如表 II 所示,它们之间的区别在于每个阶段内部的 VSS 块数和特征图中的通道数。对于 SYSU 数据集以外的其他数据集,多时图像对和相关标签被裁剪为 像素输入网络,然后使用测试集上训练好的网络推断原始尺寸的数据。在训练过程中,我们使用 AdamW 优化器[57]优化网络,学习率为 ,权重衰减为 ,不使用任何学习率衰减策略。批次大小设置为 16。除 SYSU 数据集的训练迭代次数设置为 20000 次外,其余四个数据集的训练迭代次数均设置为 50000 次。

    The random rotation, left-right, and top-bottom flip are used as training data augmentation methods. Our source code will be open-sourced for community reproduction and subsequent research.
    随机旋转、左右翻转和上下翻转被用作训练数据增强方法。我们的源代码将开源 ,供社区复制和后续研究使用。
  2. Assessment Metrics: For evaluating model performance across different subtasks, we employ a set of metrics tailored to the specific requirements of , and .
    评估指标:为了评估不同 子任务的模型性能,我们采用了一套针对 具体要求的指标。
In assessing , we use the following metrics:
在评估 时,我们使用了以下指标:
  • Recall (Rec): The proportion of actual positives that were correctly identified is measured by the recall, defined as
    召回率 (Rec):正确识别的实际阳性比例用召回率来衡量,其定义为
where represents true positives and denotes false negatives.
其中 表示真阳性, 表示假阴性。
  • Precision (Pre): Reflecting the accuracy of positive predictions, precision is defined as
    精确度(Pre):精度反映正面预测的准确性,定义为
with standing for false positives.
代表假阳性。
  • Overall Accuracy (OA): The ratio of correctly predicted observations to the total observations is represented by
    总体精度 (OA):正确预测的观测值与总观测值的比率用以下公式表示
where indicates true negatives.
其中 表示真阴性。
  • F1 Score (F1): Balancing both precision and recall, harmonic mean, given by
    F1 分数(F1):兼顾精确度和召回率,调和平均数,取值为
TABLE I: Information of the five benchmark datasets used for experiments.
表 I:用于实验的五个基准数据集的信息。
Dataset Study site Number of image pairs
图像对数量
Image size Spatial resolution 空间分辨率 Evaluation task 评估任务
SYSU Hong Kong 20,000 BCD (class-agnostic) BCD (不分等级)
LEVIR-CD+ Texas of the US
美国得克萨斯州
985 BCD (single-object) BCD (单目标)
WHU-CD Christchurch, New Zealand
新西兰基督城
2 BCD (single-object) BCD (单目标)
SECOND Hangzhou, Chengdu, and Shanghai
杭州、成都和上海
4,662 SCD
xBD 15 countries 22,068 BDA
TABLE II: Number of layers and feature channels for encoders in ChangeMamba series.
表 II:ChangeMamba 系列编码器的层数和特征通道数。
Stage Tiny Small Base
I
II
III
IV
  • Intersection over Union (IoU): Also known as the Jaccard index, this metric measures the overlap between the predicted and actual positives:
    交集大于联合 (IoU):该指标也称为 Jaccard 指数,用于衡量预测阳性结果与实际阳性结果之间的重叠程度:
  • Cohen's Kappa : Assessing the agreement between two raters beyond chance, Cohen's Kappa is given by
    科恩卡帕(Cohen's Kappa) :科恩卡帕用于评估两个评分者之间的一致性,其值为
where is the observed agreement, and is the expected agreement by chance.
其中 是观察到的一致性, 是偶然情况下的预期一致性。
For SCD, the metrics used include:
对于 SCD,使用的衡量标准包括
  • Overall Accuracy (OA) and F1 Score (F1): As defined for BCD.
    总体准确度 (OA) 和 F1 分数 (F1):与 BCD 相同。
  • Mean Intersection over Union (mIoU): Averaging IoU across classes, mIoU is calculated as
    平均联合交叉点(mIoU):将各班的 IoU 平均,mIoU 的计算公式为
aiming to provide a balanced measure across different classes.
旨在为不同等级提供均衡的衡量标准。
  • Separated Kappa (SeK) coefficient: Specifically proposed in [38] to alleviate the influences of label imbalance in SCD model evaluation, is defined in a cited work.
    分离卡帕(SeK)系数: 是在 [38] 中为减轻 SCD 模型评估中标签不平衡的影响而专门提出的,在一篇被引用的著作中进行了定义。
For BDA, we follow the protocol of View2 Challenge to split BDA into two subtasks, i.e., building localization and damage classification tasks. The following metrics are adopted to measure the performance of each subtask and the overall BDA performance:
对于 BDA,我们遵循 View2 Challenge 的协议,将 BDA 分成两个子任务,即建筑物定位和损坏分类任务。我们采用以下指标来衡量每个子任务的性能和 BDA 的整体性能:
  • F1 Score for Localization and Classification : These metrics assess accuracy in building localization and damage classification, respectively.
    定位 和分类 的 F1 分数:这些指标分别评估建筑物定位和损坏分类的准确性。
  • Overall F1 Score ( ): This reflects the combined performance in both building localization and damage classification tasks.
    总体 F1 分数 ( ):这反映了建筑物定位和损坏分类任务的综合表现。
  • F1 Score for Different Damage Level Classes ): This evaluates the model's performance across various damage extents.
    不同破坏程度等级的 F1 分数 ):评估模型在不同破坏程度下的性能。
Employing this collectively comprehensive set of metrics ensures a balanced and thorough evaluation, offering a detailed assessment of the model's capabilities across a range of CD tasks.
采用这套集体综合指标可确保评估的平衡性和全面性,并对模型在一系列 CD 任务中的能力进行详细评估。

C. Comparison Methods C.比较方法

To rigorously assess the performance of Mamba in CD tasks, we strategically selected a diverse range of methods based on CNN and Transformer architectures for benchmarking across three distinct tasks: BCD, SCD, and BDA.
为了严格评估Mamba 在 CD 任务中的性能,我们战略性地选择了基于 CNN 和 Transformer 架构的多种方法,在三个不同的任务中进行基准测试:BCD、SCD 和 BDA。
For BCD, our comparison includes a suite of CNN-based methodologies: FC-EF [16], FC-Siam-Diff [16], FC-SiamConc [16], and SiamCRNN [13] (utilizing ResNet architectures ranging from ResNet-18 to ResNet-101), alongside SNUNet [32], DSIFN [17], HANet [33], and CGNet [34].
对于 BCD,我们的比较包括一套基于 CNN 的方法:FC-EF[16]、FC-Siam-Diff[16]、FC-SiamConc[16]和 SiamCRNN[13](使用的 ResNet 架构从 ResNet-18 到 ResNet-101),以及 SNUNet[32]、DSIFN[17]、HANet[33]和 CGNet[34]。

On the Transformer front, we compare against ChangeFormer (versions 1 to 4) [22], BIT (with ResNet-18 to ResNet-101 as the backbone for feature extraction) [20], TransUNetCD [25], and SwinSUNet [24].
在 Transformer 方面,我们与 ChangeFormer(版本 1 至 4)[22]、BIT(以 ResNet-18 至 ResNet-101 为特征提取骨干)[20]、TransUNetCD [25] 和 SwinSUNet [24] 进行了比较。
In the SCD arena, CNN-based competitors encompass HRSCD (variants S1 to S4) [35], ChangeMask [18], SSCD-1 [39], Bi-SRNet [39], and TED [45]. For Transformer-based methodologies, we evaluate against SMNet [44] and ScanNet [45].
在 SCD 领域,基于 CNN 的竞争对手包括 HRSCD(变体 S1 至 S4)[35]、ChangeMask[18]、SSCD-1 [39]、Bi-SRNet [39] 和 TED [45]。对于基于变换器的方法,我们以 SMNet [44] 和 ScanNet [45] 进行评估。
Lastly, for the BDA task, we selected the baseline [42], Siamese-UNet (the winning solution of the View 2 challenge) [42], MaskRCNN [43], and ChangeOS [3] (employing ResNet-18 to ResNet-101) as CNN-based benchmarks. DamFormer [23] stands as the solitary Transformer-based comparison method.
最后,对于 BDA 任务,我们选择了 baseline [42]、Siamese-UNet( View 2 challenge 的获胜解决方案)[42]、MaskRCNN [43] 和 ChangeOS [3](采用 ResNet-18 至 ResNet-101)作为基于 CNN 的基准。DamFormer [23] 是唯一基于变换器的比较方法。
For the above methods, if the detection performance was reported in the original paper on the corresponding dataset, we directly adopt the accuracy in the original paper.
对于上述方法,如果原论文中报告了相应数据集的检测性能,我们就直接采用原论文中的准确率。

If not, we train and test the method on the corresponding dataset based on the hyperparameters recommended in the original paper, using the loss function and data augmentation methods consistent with our method.
如果没有,我们就根据原始论文中推荐的超参数,使用与我们的方法一致的损失函数和数据增强方法,在相应的数据集上对该方法进行训练和测试。

D. Benchmark Comparison in Three CD Subtasks
D.三个 CD 子任务的基准比较

  1. BCD: Tables III to list the performance in of the proposed MambaBCD architecture and some comparison
    BCD:表 III 至 列出了拟议的 MambaBCD 架构在 中的性能以及一些比较。
TABLE III: Accuracy assessment for different binary CD models on the SYSU dataset, where indicates CNN-based models, indicates Transformer-based models, and indicates Mamba-based models. The table highlights the highest values in red, and the second-highest results in blue
表三:不同二元 CD 模型在 SYSU 数据集上的精度评估,其中 表示基于 CNN 的模型, 表示基于变换器的模型, 表示基于Mamba 的模型。表中用红色标出了最高值,用蓝色标出了次高结果
Type Method Rec Pre OA F1 IoU KC
FC-EF [16] 75.17 76.47 88.69 75.81 61.04 68.43
FC-Siam-Diff [16] 75.30 76.28 88.65 75.79 61.01 68.38
FC-Siam-Conc [16] 76.75 73.67 88.05 75.18 60.23 67.32
SiamCRNN-18 [13] 76.83 84.80 91.29 80.62 67.54 75.02
SiamCRNN-34 [13] 76.85 85.13 91.37 80.78 67.75 75.23
SiamCRNN-50 [13] 78.40 83.41 91.23 80.83 67.82 75.15
SiamCRNN-101 [13] 80.40 90.77 80.44 67.28 74.40
SNUNet [32] 72.21 74.09 87.49 73.14 57.66 64.99
DSIFN [17] 75.83 89.59 78.80 65.02 71.92
HANet [33] 76.14 78.71 89.52 77.41 63.14 70.59
CGNet [34] 74.37 91.19 79.92 66.55 74.31
ChangeFormerV1 [22] 75.82 79.65 89.73 77.69 63.52 71.02
ChangeFormerV2 [22] 75.62 78.14 89.26 76.86 62.42 69.87
ChangeFormerV3 [22] 75.24 79.46 89.57 77.29 62.99 70.53
ChangeFormerV4 [22] 77.90 79.74 90.12 78.81 65.03 72.37
BIT-18 [20] 76.42 84.85 91.22 80.41 67.24 74.78
BIT-34 [20] 74.63 82.40 90.26 78.32 64.37 72.06
BIT-50 [20] 77.90 81.42 90.60 79.62 66.14 73.51
BIT-101 [20] 75.58 83.64 90.76 79.41 65.84 73.47
TransUNetCD [25] 77.73 82.59 90.88 80.09 66.79 74.18
SwinSUNet [24] 79.75 83.50 91.51 81.58 68.89 76.06
methods on the three benchmark datasets. In these tables, we differentiate between the types of these BCD architectures into CNN-based methods, Tranformer-based methods, and our Mamba-based methods.
方法在三个基准数据集上的结果。在这些表格中,我们将这些 BCD 架构分为基于 CNN 的方法、基于 Tranformer 的方法和基于Mamba 的方法。

It can be seen that our approaches significantly outperform either the CNN-based approaches or the Transformer-based architectures.
可以看出,我们的方法明显优于基于 CNN 的方法或基于Transformer 的架构。

The proposed MambaBCDBase and MambaBCD-Small architectures achieve the highest and second-highest OA, F1, IoU, and KC for both classagnostic CD (SYSU dataset) and single-object CD tasks (LEVIR-CD+ and WHU-CD datasets), fully demonstrating the potential of the Mamba architecture for the BCD task. Fig.
所提出的 MambaBCDBase 和 MambaBCD-Small 架构在分类 CD(SYSU 数据集)和单目标 CD 任务(LEVIR-CD+ 和 WHU-CD 数据集)中都实现了最高和第二高的 OA、F1、IoU 和 KC,充分展示了Mamba 架构在 BCD 任务中的潜力。图

7 to Fig. 9 then show some binary change maps predicted by our three architectures on the test sets of the three datasets.
图 7 至图 9 显示了我们的三种架构在三个数据集的测试集上预测的一些二进制变化图。

It can be seen that the proposed methods accurately detect changes with varying types, scales, and numbers contained in these image pairs, very close to the change reference maps.
可以看出,所提出的方法能准确检测出这些图像对中包含的不同类型、尺度和数量的变化,与变化参考图非常接近。
Table VI lists the network parameters and GFLOPs of these BCD architectures. Benefiting from the S6 model, the MambaBCD architecture can have only linear computational complexity while modeling global context information.
表 VI 列出了这些 BCD 架构的网络参数和 GFLOP。得益于 S6 模型,MambaBCD 架构在对全局上下文信息建模时,计算复杂度仅为线性。

It can be seen that even the largest model MambaBCD-Base has GFLOPs of 179.32, which is lower than many CNN and Transformer architectures, like DSIFN, CFNet, and ChangeFormerV4, comparable to some CNN SOTA methods such as SNUNet and SiamCRNN-50. However, the accuracy advantage is obvious.
可以看出,即使是最大模型 MambaBCD-Base 的 GFLOPs 也只有 179.32,低于许多 CNN 和Transformer 架构,如 DSIFN、CFNet 和 ChangeFormerV4,与 SNUNet 和 SiamCRNN-50 等一些 CNN SOTA 方法相当。不过,其准确性优势是显而易见的。

In particular, the MambaBCD-Tiny architecture has a clear accuracy advantage over other architectures with a similar number of parameters and computational con-
尤其是 MambaBCD-Tiny 架构,与参数数量和计算条件相似的其他架构相比,具有明显的精度优势。
Fig. 7: Some binary change maps obtained by our methods on the test set of the SYSU dataset.
图 7:我们的方法在 SYSU 数据集测试集上获得的一些二进制变化图。
sumption. For example, compared to BIT-18, MambaBCDTiny has lower GFLOPs but and higher F1 scores on the three datasets.
例如,与 BIT-18 相比,MambaBCDTiny 的 GFLOPs例如,与 BIT-18 相比,MambaBCDTiny 的 GFLOPs 低 ,但在三个数据集上的 F1 分数分别高
  1. SCD: Table VII compares the SCD performance of the proposed MamabaSCD architectures and representative SCD approaches.
    SCD:表 VII 比较了拟议的 MamabaSCD 架构和具有代表性的 SCD 方法的 SCD 性能。

    Compared to the BCD task, the SCD task requires the change detector capable of effectively modeling not only the spatio-temporal relationships of multi-temporal images, but also learning representative and discriminative
    与 BCD 任务相比,SCD 任务要求变化检测器不仅能对多时空图像的时空关系进行有效建模,还能学习具有代表性和鉴别性的变化检测器。
TABLE IV: Accuracy assessment for different binary CD models on the LEVIR-CD+ dataset, where indicates CNN-based models, indicates Transformer-based models, and indicates Mamba-based models. The table highlights the highest values in red, and the second-highest results in blue
表四:不同二进制 CD 模型在 LEVIR-CD+ 数据集上的精度评估,其中 表示基于 CNN 的模型, 表示基于变换器的模型, 表示基于Mamba 的模型。表中用红色标出了最高值,用蓝色标出了第二高的结果
Type Method Rec Pre OA F1 IoU KC
FC-EF [16] 71.77 69.12 97.54 70.42 54.34 69.14
FC-Siam-Diff [16] 74.02 81.49 98.26 77.57 63.36 76.67
FC-Siam-Conc [16] 78.49 78.39 98.24 78.44 64.53 77.52
SiamCRNN-18 [13] 84.25 81.22 98.56 82.71 70.52 81.96
SiamCRNN-34 [13] 83.88 82.28 98.61 83.08 71.05 82.35
SiamCRNN-50 [13] 81.61 85.39 98.68 83.46 71.61 82.77
SiamCRNN-101 [13] 80.96 85.56 98.67 83.20 71.23 82.50
DSIFN [17] 84.36 83.78 98.70 84.07 72.52 83.39
SNUNet [32] 78.73 71.07 97.83 74.70 59.62 73.57
HANet [33] 75.53 79.70 98.22 77.56 63.34 76.63
CGNet [34] 86.02 81.46 98.63 83.68 71.94 82.97
ChangeFormerV1 [22] 77.00 82.18 98.38 79.51 65.98 78.66
ChangeFormerV2 [22] 81.32 79.10 98.36 80.20 66.94 79.35
ChangeFormerV3 [22] 79.97 81.34 98.44 80.65 67.58 79.84
ChangeFormerV4 [22] 76.68 75.07 98.01 75.87 61.12 74.83
BIT-18 [20] 80.86 83.76 98.58 82.28 69.90 81.54
BIT-34 [20] 80.96 85.87 98.68 83.34 71.44 82.66
BIT-50 [20] 81.84 85.02 98.67 83.40 71.53 82.71
BIT-101 [20] 81.20 83.91 98.60 82.53 70.26 81.80
TransUNetCD [25] 84.18 83.08 98.66 83.63 71.86 82.93
SwinSUNet [24] 85.85 85.34 98.92 85.60 74.82 84.98
Fig. 8: Some binary change maps obtained by our methods on the test set of the LEVIR-CD+ dataset.
图 8:我们的方法在 LEVIR-CD+ 数据集测试集上获得的一些二进制变化图。
representations for accurate land-cover mapping. From Table VII, we can see that the MambaSCD architecture can satisfy both requirements well. The MambaSCD-Small architecture surpasses the Transformer-based SOTA approach, ScanNet, on all four metrics of the SCD task, with OA of of , mIoU of , and of . Note that our methods have achieved the current accuracy without employing loss functions such as those proposed for the SCD task in ScanNet [45]. Table VIII further lists the number of parameters
表 VII:MambaSCD 架构在精确绘制土地覆盖物地图方面的优势从表七可以看出,MambaSCD 架构可以很好地满足这两项要求。在 SCD 任务的所有四个指标上,MambaSCD-Small 架构都超过了基于变换器的 SOTA 方法 ScanNet,OA 为 of ,mIoU 为 。请注意,我们的方法在没有采用损失函数(如 ScanNet 中针对 SCD 任务提出的损失函数)的情况下就达到了目前的精度[45]。表 VIII 进一步列出了参数数量
Fig. 9: Some binary change maps obtained by our methods on the test set of the WHU-CD dataset.
图 9:我们的方法在 WHU-CD 数据集测试集上获得的一些二进制变化图。
and GFLOPs of these methods. Compared to the Transformerbased ScanNet method, the GFLOPs of our three architectures are lower. Compared to SMNet, another Transformer-based method, the proposed MamabaSCD-Tiny can achieve higher SeK values with lower number of parameters and GFLOPs.
和 GFLOPs。与基于变换器的 ScanNet 方法相比,我们这三种架构的 GFLOPs 较低。与另一种基于变换器的方法 SMNet 相比,所提出的 MamabaSCD-Tiny 能以较少的参数数和 GFLOP 达到较高的 SeK 值。

These results demonstrate the effectiveness of the proposed methods and the potential of the Mamba architecture on the SCD task.
这些结果证明了拟议方法的有效性以及Mamba 架构在 SCD 任务中的潜力。
Fig. 10 shows some semantic change maps on the test
图 10 显示了测试中的一些语义变化图。
TABLE V: Accuracy assessment for different binary CD models on the WHU-CD dataset, where indicates CNN-based models, indicates Transformer-based models, and indicates Mamba-based models. The table highlights the highest values in red, and the second-highest results in blue
表五:不同二元 CD 模型在 WHU-CD 数据集上的精度评估,其中 表示基于 CNN 的模型, 表示基于变换器的模型, 表示基于Mamba 的模型。表中用红色标出了最高值,用蓝色标出了次高值
Type Method Rec Pre OA F1 IoU KC
FC-EF [16] 86.33 83.50 98.87 84.89 73.74 84.30
FC-Siam-Diff [16] 84.69 90.86 99.13 87.67 78.04 87.22
FC-Siam-Conc [16] 87.72 84.02 98.94 85.83 75.18 85.28
SiamCRNN-18 [13] 90.48 91.56 99.34 91.02 83.51 90.68
SiamCRNN-34 [13] 89.10 93.88 99.39 91.42 84.20 91.11
SiamCRNN-50 [13] 91.45 86.70 99.30 90.57 82.76 90.20
SiamCRNN-101 [13] 90.45 87.79 99.19 89.10 80.34 88.68
DSIFN [17] 83.45 99.31 89.91 81.67 89.56
SNUNet [32] 87.36 88.04 99.10 87.70 78.09 87.23
HANet [33] 88.30 88.01 99.16 88.16 78.82 87.72
CGNet [34] 90.79 94.47 99.48 92.59 86.21 92.33
ChangeFormerV1 [22] 84.30 90.80 99.11 87.43 77.67 86.97
ChangeFormerV2 [22] 83.41 88.77 99.00 86.00 75.45 85.49
ChangeFormerV3 [22] 85.55 88.25 99.05 86.88 76.80 86.39
ChangeFormerV4 [22] 84.85 90.09 99.10 87.39 77.61 86.93
BIT-18 [20] 90.36 90.30 99.29 90.33 82.37 89.96
BIT-34 [20] 90.10 89.14 99.23 89.62 81.19 89.22
BIT-50 [20] 90.33 89.70 99.26 90.01 81.84 89.63
BIT-101 [20] 90.24 89.83 99.27 90.04 81.88 89.66
TransUNetCD [25] 90.50 85.48 99.09 87.79 78.44 87.44
SwinSUNet [24] 92.03 94.08 99.50 93.04 87.00 92.78
Fig. 10: Some semantic change maps obtained by the proposed MamabaSCD architectures on the test set of the SECOND dataset. set of the SECOND dataset. Due to land-cover category combinations, SCD often involves a considerable number of categories.
图 10:拟议的 MamabaSCD 架构在 SECOND 数据集测试集上获得的语义变化图。由于土地覆盖类别的组合,SCD 通常涉及相当多的类别。

For ease of visualization, we follow the visualization way in [18], [38], [58], namely only plotting the landcover categories of the changed regions in both pre- and postevent images. In this way, the "from-to" semantic change information can be effectively reflected. From Fig.
为了便于可视化,我们沿用了文献[18]、[38]、[58]中的可视化方法,即只绘制事件前后两幅图像中变化区域的土地覆被类别。这样就能有效反映 "从事件发生到事件结束 "的语义变化信息。从图

10, we can observe that the MambaSCD architecture is not only capable of distinguishing changed features at different scales, but also accurately identifying the specific semantic categories of these changes.
10. 我们可以看到,MambaSCD 架构不仅能够区分不同尺度的变化特征,还能准确识别这些变化的具体语义类别。
  1. BDA: Table IX reports the accuracy of the MambaBDA architecture and some Transformer and -based architectures on the XBD dataset. The performance of MambaBDA architecture significantly outperforms these representative BDA approaches. Compared to the current Transformer-based SOTA method, DamFormer [23], our three MambaBDA architectures have a , and higher , respectively.
    BDA:表 IX 报告了 MambaBDA 架构和一些基于 Transformer 和 的架构在 XBD 数据集上的准确性。MambaBDA 架构的性能明显优于这些具有代表性的 BDA 方法。与目前基于变压器的 SOTA 方法 DamFormer [23]相比,我们的三种 MambaBDA 架构分别高出

    Moreover, it can be noticed that the accuracy improvement compared to DamFormer mainly comes from the more difficult damage classification task rather than the building localization task, which demonstrates that the proposed MambaBDA architecture can adequately learn the spatio-temporal relationships between the multi-temporal images so as to effectively distinguish different building damage levels.
    此外,与 DamFormer 相比,准确率的提高主要来自于难度更大的损坏分类任务,而不是建筑物定位任务,这表明所提出的 MambaBDA 架构能够充分学习多时相图像之间的时空关系,从而有效区分不同的建筑物损坏程度。
Fig. 11 shows the visualization results of building damages caused by different disaster events obtained by MambaBDA on the xBD dataset. MambaBDA can accurately localize and
图 11 显示了 MambaBDA 在 xBD 数据集上获得的不同灾害事件造成的建筑物损坏的可视化结果。MambaBDA 可以精确定位并
TABLE VI: Comparison of different binary CD models in computational cost, where indicates CNN-based models, indicates Transformer-based models, and indicates Mamba-based models
表六:不同二进制 CD 模型在计算成本方面的比较 表示基于 CNN 的模型, 表示基于变换器的模型, 表示基于Mamba 的模型
Type Method Params (M) GFLOPs
FC-EF [16] 1.35 14.13
FC-Siam-Diff [16] 1.35 18.66
FC-Siam-Conc [16] 1.54 21.07
SiamCRNN-18 [13] 18.85 86.64
SiamCRNN-34 [13] 28.96 113.28
SiamCRNN-50 [13] 44.45 185.30
SiamCRNN-101 [13] 63.44 224.30
DSIFN [17] 35.73 329.03
SNUNet [32] 10.21 176.36
HANet [33] 2.61 70.68
CGNet [34] 33.68 329.58
ChangeFormerV1 [22] 29.84 46.62
ChangeFormerV2 [22] 24.30 44.54
ChangeFormerV3 [22] 24.30 33.68
ChangeFormerV4 [22] 33.61 852.53
TransUNetCD [25] 28.37 244.54
SwinSUNet [24] 39.28 43.50
BIT-18 [20] 11.50 106.14
BIT-34 [20] 21.61 190.83
BIT-50 [20] 24.28 224.61
BIT-101 [20] 43.27 380.62
MambaBCD-Tiny 17.13 45.74
MambaBCD-Small 49.94 114.82
MambaBCD-Base 84.70 179.32
TABLE VII: Accuracy assessment for different SCD models on the SECOND dataset, where indicates CNN-based models, indicates Transformer-based models, and indicates Mamba-based models. The table highlights the highest values in red, and the second-highest results in blue
表七:不同 SCD 模型在 SECOND 数据集上的精度评估,其中 表示基于 CNN 的模型, 表示基于变换器的模型, 表示基于Mamba 的模型。表中用红色标出了最高值,用蓝色标出了次高值
Type Method OA F1 mIoU SeK
HRSCD-S1 [35] 45.77 38.44 62.72 5.90
HRSCD-S2 [35] 85.49 49.22 64.43 10.69
HRSCD-S3 [35] 84.62 51.62 66.33 11.97
HRSCD-S4 [35] 86.62 58.21 71.15 18.80
ChangeMask [18] 更改掩码 [18] 86.93 59.74 71.46 19.50
SSCD-1 [39] 87.19 61.22 72.60 21.86
Bi-SRNet [39] 双SRNet [39] 87.84 62.61 73.41 23.22
TED [45] 87.39 60.34 72.79 22.17
SMNet [44] 86.68 60.34 71.95 20.29
ScanNet [45] 87.86 63.66 73.42 23.94
MamabaSCD-Tiny 87.22 60.92 72.18 20.92
MamabaSCD-Small 小号 MamabaSCD
MamabaSCD-Base
then differentiate the damage levels of buildings despite the varying type of building and the disaster event, implying the potential of the MambaBDA architecture for practical disaster response applications.
尽管建筑物的类型和灾害事件各不相同,但 MambaBDA 架构仍能区分建筑物的损坏程度,这意味着 MambaBDA 架构具有实际救灾应用的潜力。

TABLE VIII: Comparison of different SCD models in computational cost, where indicates CNN-based models, indicates Transformer-based models, and indicates Mamba-based models.
表八:不同 SCD 模型计算成本的比较 表示基于 CNN 的模型, 表示基于变换器的模型, 表示基于Mamba 的模型。
Type Method Params GFLOPs
HRSCD-S1 [35] 3.36 8.02
HRSCD-S2 [35] 6.39 14.29
HRSCD-S3 [35] 12.77 42.67
HRSCD-S4 [35] 13.71 43.69
ChangeMask [18] 更改掩码 [18] 2.97 37.16
SSCD-1 [39] 23.31 189.57
Bi-SRNet [39] 双SRNet [39] 23.39 189.91
TED [45] 24.19 204.29
SMNet [44] 42.16 75.79
ScanNet [45] 27.90 264.95
MamabaSCD-Tiny 19.44 63.72
MamabaSCD-Small 小号 MamabaSCD 51.82 137.10
MamabaSCD-Base 87.47 201.85
Fig. 11: Some building damage assessment results obtained by our methods on the test set of the xBD dataset. From top to bottom, the disaster events are typhoons, tsunamis, and wildfires.
图 11:我们的方法在 xBD 数据集测试集上获得的一些建筑物损坏评估结果。从上到下依次为台风、海啸和野火。

E. Different Spatio-Temporal Fusion Methods
E.不同的时空融合方法

Modelling spatio-temporal relationships is significant to tasks [13], [18], [36]. To demonstrate the effectiveness of our proposed CST-Mamba module, we compare it with several commonly used spatial-temporal modeling methods in CD.
时空关系建模对 任务非常重要 [13]、[18]、[36]。为了证明我们提出的 CST-Mamba 模块的有效性,我们将其与 CD 中常用的几种时空建模方法进行了比较。

They are concatenation operations with feature pyramid network (FPN) [18] (baseline), temporal-symmetric transformer based on 3D convolution layers (TST) [18], RNN-based method [13], and Tranformer-based method [20].
它们是使用特征金字塔网络(FPN)[18](基线)、基于三维卷积层的时空对称变换器(TST)[18]、基于 RNN 的方法[13]和基于变换器的方法[20]进行的连接操作。

Table X lists the performance of these modeling methods on the three CD
表 X 列出了这些建模方法在三张光盘上的性能
TABLE IX: Accuracy assessment for different building damage assessment models on the xBD dataset, where indicates CNN-based models, indicates Transformer-based models, and indicates Mamba-based models. The table highlights the highest values in red, and the second-highest results in blue. The suffix PPS means post-processing.
表九:不同建筑物损坏评估模型在 xBD 数据集上的精度评估,其中 表示基于 CNN 的模型, 表示基于变压器的模型, 表示基于Mamba 的模型。表中用红色标出了最高值,用蓝色标出了次高值。后缀 PPS 表示后处理。
Type Method Damage per class
每级伤害
No damage Minor damage Major damage Destroyed
xView2 Baseline xView2 基准线 80.47 3.42 26.54 66.31 14.35 0.94 46.57
Siamese-UNet 85.92 65.58 71.68 86.74 50.02 64.43 71.68
MaskRCNN [43] 83.60 70.02 74.10 90.60 49.30 72.20 83.70
ChangeOS-18 [3] 变更 OS-18 [3] 84.62 69.87 74.30 88.61 52.10 70.36 79.65
ChangeOS-34 [3] 85.16 70.28 74.74 88.63 52.38 71.16 80.08
ChangeOS-50 [3] 85.41 70.88 75.24 88.98 53.33 71.24 80.60
ChangeOS-101 [3] 变更OS-101 [3] 85.69 71.14 75.50 89.11 53.11 72.44 80.79
ChangeOS-18-PPS [3] 变更操作系统-18-PPS [3] 84.62 73.89 77.11 92.38 57.41 72.54 82.62
ChangeOS-34-PPS [3] 85.16 74.25 77.52 92.19 58.07 72.84 82.79
ChangeOS-50-PPS [3] 变更操作系统-50-PPS [3] 85.41 75.64 78.57 92.66 60.14 74.18 83.45
ChangeOS-101-PPS [3] 85.69 75.44 78.52 92.81 59.38 74.65 83.29
DamFormer [23] 86.86 72.81 77.02 89.86 56.78 72.56 80.51
MambaBDA-Tiny 84.76 77.50 79.68 95.33 60.15 75.94 88.27
MambaBDA-Small 86.61 78.80 81.14 95.99 62.82 76.26 88.37
MambaBDA-Base 87.38 78.84 81.41 95.94 62.74 76.46 88.58
TABLE X: Comparison of different approaches for modeling spatio-temporal relationships on three tasks. The table highlights the highest values in red, and the second-highest results in blue.
表 X:三个任务中不同时空关系建模方法的比较。表中红色为最高值,蓝色为第二高值。
Method SYSU SECOND xBD
F1 IoU KC F1 mIoU SeK
Baseline 91.71 81.11 68.22 75.84 87.85 62.71 72.33 22.14 84.85 74.74 77.77
STT [18] 92.09 82.22 69.80 77.16 88.09 63.36 73.01 23.11 86.74 75.92 79.17
RNN [13] 92.97 82.02 69.52 76.94 88.20 63.63 72.86 23.09 85.74 76.48 79.26
Transformer [20] 变压器 [20] 92.14 82.17 69.74 77.17 88.05 64.05 73.02 23.44 86.09 77.45 80.04
Ours 92.35 82.83 70.70 77.94 88.38 64.10 73.61 24.04 86.61 78.80 81.14
subtasks. On the simpler BCD tasks, Transformer does not yield significant advantages in performance compared to RNN. However, on the more difficult SCD and BDA tasks, the advantages of the Transformer architecture are illustrated.
子任务。在较简单的 BCD 任务中,与 RNN 相比,Transformer 的性能优势并不明显。然而,在难度较高的 SCD 和 BDA 任务中,Transformer 架构的优势就体现出来了。

Compared with these current representative spatio-temporal relationship modeling approaches, our proposed cross-temporal modeling approach based on Mamba, which inherits the advantages of RNN and Transformer, shows superior performance on the three CD tasks of BCD, SCD, and BDA.
与目前这些具有代表性的时空关系建模方法相比,我们提出的基于Mamba 的跨时空建模方法继承了 RNN 和Transformer 的优点,在 BCD、SCD 和 BDA 三项 CD 任务中表现出更优越的性能。

Compared to the Transformer-based spatio-temporal modeling approach, our method improves the F1 value by on BCD, SeK by on the SCD task, and by on the BDA task.
与基于变换器的时空建模方法相比,我们的方法在 BCD 任务中将 F1 值提高了 ,在 SCD 任务中将 SeK 提高了 ,在 BDA 任务中将 提高了
In addition, We further visualize the features extracted by the MambaBCD architecture in Fig. 12. In the features extracted by the encoder, the features in the region of high response values are not necessarily regions of interest, i.e., changed regions.
此外,我们在图 12 中进一步直观地展示了 MambaBCD 架构提取的特征。在编码器提取的特征中,高响应值区域的特征不一定是感兴趣的区域,即变化区域。

Whereas, in the decoder, after the spatiotemporal relationships are gradually modeled by the proposed CST-Mamba modules, changed regions gradually exhibit high response values.
而在解码器中,当 CST-Mamba 模块逐步对时空关系进行建模后,变化区域会逐渐显示出高响应值。

F. Comparison to Other Backbone Networks
F.与其他骨干网络的比较

Finally, we simply compare the Mamba with some representative backbone networks in CD tasks on the SYSU
最后,我们将Mamba 与 SYSU CD 任务中一些有代表性的骨干网络进行简单比较
Fig. 12: The visualization of features from different layers of MambaBCD.
图 12:来自 MambaBCD 不同层的可视化特征。
dataset in Table XI. It can be seen that Mamba outperforms the two Transformer backbone networks, Swin-Transformer and MixFormer. This is because, in order to reduce the
数据集,见表 XI。可以看出,Mamba 优于 Swin-Transformer 和 MixFormer 这两个 Transformer 骨干网络。这是因为,为了减少
TABLE XI: Comparison of different commonly used backbones in BCD on the SYSU dataset. The table highlights the highest values in red, and the second-highest results in blue.
表 XI:在 SYSU 数据集上对 BCD 中不同常用骨干网的比较。表中用红色标出了最高值,用蓝色标出了次高值。
Encoder OA F1 IoU KC Params GFLOPs
ResNet-101 91.10 81.10 68.22 75.28 46.90 131.36
EfficientNet-B5 效能网络-B5 91.60 76.84 6.55 53.62
MixFormer-v3 87.70 75.00 60.00 66.87 47.03 82.98
Swin-Small 82.22 69.80 36.22 81.72
Mamba-Small 49.94 114.82
computational overhead of self-attention operation, these two Transformer backbone networks either adopt a local attention mechanism (Swin-Transformer) [26] or a strided convolution layer (MixFormer) [27] to reduce the size of the feature maps, which will negatively affect the learning of global contextual information.
为了减少自注意操作的计算开销,这两种变换器骨干网络要么采用局部注意机制(Swin-Transformer)[26] ,要么采用分层卷积层(MixFormer)[27] 来减小特征图的大小,这将对全局上下文信息的学习产生负面影响。

The Mamba architecture, on the other hand, can reduce its computational consumption to a linear relationship with the number of tokens, thus eliminating the need for these methods, which cause information loss, and allowing the fully learning of global context information.
而Mamba 架构则可以将其计算消耗降低到与tokens 的数量成线性关系,从而无需使用这些会造成信息损失的方法,并可以完全学习全局上下文信息。

As for the remote sensing change detection task, due to the multi-scale phenomenon of land-cover features and the differences in spatial resolution of different sensors, the sizes of changed objects often vary greatly.
至于遥感变化检测任务,由于土地覆盖特征的多尺度现象和不同传感器空间分辨率的差异,变化对象的大小往往相差很大。

Therefore, the features extracted by Mamba through fully learning the global contextual information, are particularly important for the subsequent detection of these changed objects with different sizes.
因此,Mamba 通过充分学习全局上下文信息而提取的特征,对于随后检测这些大小不同的变化物体尤为重要。

V. CONCLUSION V.结论

In this paper, we explore the potential of the emerging Mamba architecture for remote sensing image change detection tasks.
本文探讨了新兴的Mamba 架构在遥感图像变化检测任务中的潜力。

Three architectures, MambaBCD, MambaSCD, and MambaBDA are developed for binary change detection, semantic change detection, and building damage assessment tasks, respectively.
针对二进制变化检测、语义变化检测和建筑物损坏评估任务,分别开发了 MambaBCD、MambaSCD 和 MambaBDA 三种架构。

The encoders of all three architectures are cutting-edge visual Mamba, which can adequately learn the global contextual information of the input image with linear complexity.
这三种架构的编码器都是最先进的视觉Mamba ,能够以线性复杂度充分学习输入图像的全局上下文信息。

Then, for the decoder, in order to adequately learn about spatio-temporal relations, we propose three ways of modeling spatio-temporal relations which can adequately exploiting the attributes and advantages of Mamba architecture.
然后,对于解码器,为了充分学习时空关系,我们提出了三种时空关系建模方法,这些方法可以充分发挥Mamba 架构的属性和优势。

Experiments on five benchmark datasets have fully revealed the potential of Mamba for multi-temporal remote sensing image data processing tasks.
在五个基准数据集上进行的实验充分揭示了Mamba 在多时空遥感图像数据处理任务方面的潜力。

Compared to the CNN-based and Transformer-based approaches, the proposed three architectures can achieve SOTA performance on their respective tasks.
与基于 CNN 和基于Transformer 的方法相比,所提出的三种架构可以在各自的任务中实现 SOTA 性能。

Compared to some commonly used backbone networks, the Mamba architecture can extract features that are more suitable for downstream CD tasks.
与一些常用的骨干网络相比,Mamba 架构可以提取更适合下游 CD 任务的特征。

The combination of our proposed three spatio-temporal relationship modeling mechanisms and the Mamba architecture provides a new perspective on modeling spatio-temporal relationships, showing much better detection results on all three subtasks.
我们提出的三种时空关系建模机制与Mamba 架构相结合,为时空关系建模提供了新的视角,在所有三个子任务中都显示出更好的检测结果。
Our future work includes, but is not limited to, developing Mamba architectures more suited to the characteristics of remote sensing data and exploring the potential of Mamba for multimodal remote sensing tasks.
我们今后的工作包括但不限于开发更适合遥感数据特点的Mamba 架构,以及探索Mamba 在多模态遥感任务中的潜力。

REFERENCES 参考文献

[1] D. Lu, P. Mausel, E. Brondízio, and E. Moran, "Change detection techniques," Int. J. Remote Sens., vol. 25, no. 12, pp. 2365-2407, 2004.
[1] D. Lu, P. Mausel, E. Brondízio, and E. Moran, "Change detection techniques," Int.J. Remote Sens.,第 25 卷,第 12 期,第 2365-2407 页,2004 年。
[2] P. Coppin, I. Jonckheere, K. Nackaerts, B. Muys, and E. Lambin, "Digital change detection methods in ecosystem monitoring: A review," Int. J. Remote Sens., vol. 25, no. 9, pp. 1565-1596, 2004.
[2] P. Coppin、I. Jonckheere、K. Nackaerts、B. Muys 和 E. Lambin,"生态系统监测中的数字变化检测方法:Review," Int.25, no. 9, pp.
[3] Z. Zheng, Y. Zhong, J. Wang, A. Ma, and L. Zhang, "Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to manmade disasters," Remote Sens. Environ., vol. 265, p. 112636, 2021.
[3] Z. Zheng, Y. Zhong, J. Wang, A. Ma, and L. Zhang, "Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework:从自然灾害到人为灾害,"Remote Sens.环境》,第 265 卷,第 112636 页,2021 年。
[4] H. Guo, Q. Shi, A. Marinoni, B. Du, and L. Zhang, "Deep building footprint update network: A semi-supervised method for updating existing building footprint from bi-temporal remote sensing images," Remote Sens. Environ., vol. 264, p. 112589, 2021.
[4] H. Guo、Q. Shi、A. Marinoni、B. Du 和 L. Zhang,"深度建筑足迹更新网络:A semi-supervised method for updating existing building footprint from bi-temporal remote sensing images," Remote Sens.环境》,第 264 卷,第 112589 页,2021 年。
[5] H. Chen, N. Yokoya, C. Wu, and B. Du, "Unsupervised Multimodal Change Detection Based on Structural Relationship Graph Representation Learning," IEEE Trans. Geosci. Remote Sens., pp. 1-18, 2022
[5] H. Chen、N. Yokoya、C. Wu 和 B. Du,《基于结构关系图表示学习的无监督多模态变化检测》,IEEE Trans.Geosci.遥感》,第 1-18 页,2022 年
[6] H. Chen, C. Lan, J. Song, C. Broni-Bediako, J. Xia, and N. Yokoya, "Land-cover change detection using paired openstreetmap data and optical high-resolution imagery via object-guided transformer," arXiv preprint arXiv:2310.02674, 2023.
[7] C. Wu, B. Du, and L. Zhang, "Slow feature analysis for change detection in multispectral imagery," IEEE Trans. Geosci. Remote Sens., vol. 52, no. 5, pp. 2858-2874, 2014.
[7] C. Wu, B. Du, and L. Zhang, "Slow feature analysis for change detection in multispectral imagery," IEEE Trans.Geosci.遥感》,第 52 卷,第 5 期,第 2858-287 页。5, pp.
[8] A. A. Nielsen, "The regularized iteratively reweighted MAD method for change detection in multi- and hyperspectral data," IEEE Trans. Image Process., vol. 16, no. 2, pp. 463-478, 2007.
[8] A. A. Nielsen, "The regularized iteratively reweighted MAD method for change detection in multi- and hyperspectral data," IEEE Trans. Image Process.图像处理》,第 16 卷第 2 期,第 463-478 页,2007 年。
[9] L. Bruzzone and Diego Fernàndez Prieto, "Automatic Analysis of the Difference Image for Unsupervised Change Detection," IEEE Trans. Geosci. Remote Sens., vol. 38, no. 3, pp. 1171-1182, 2000.
[9] L. Bruzzone 和 Diego Fernàndez Prieto,"用于无监督变化检测的差分图像自动分析",IEEE Trans.Geosci.遥感》,第 38 卷,第 3 期,第 1171-171 页。3, pp.
[10] M. Hussain, D. Chen, A. Cheng, H. Wei, and D. Stanley, "Change detection from remotely sensed images: From pixel-based to objectbased approaches," ISPRS J. Photogramm. Remote Sens., vol. 80, pp. 91-106, 2013 .
[10] M. Hussain、D. Chen、A. Cheng、H. Wei 和 D. Stanley,"遥感图像的变化检测:ISPRS J. Photogramm.遥感》,第 80 卷,第 91-106 页,2013 年。
[11] C. Wu, H. Chen, B. Du, and L. Zhang, "Unsupervised change detection in multitemporal vhr images based on deep kernel pca convolutional mapping network," IEEE Trans. Cybern, vol. 52, no. 11, pp. 1208412098,2022 .
[11] C. Wu, H. Chen, B. Du, and L. Zhang, "Unsupervised change detection in multitemporal vhr images based on deep kernel pca convolutional mapping network," IEEE Trans.52卷,第11期,第1208412098页,2022年。
[12] M. Gong, T. Zhan, P. Zhang, and Q. Miao, "Superpixel-based difference representation learning for change detection in multispectral remote sensing images," IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2658-2673, 2017.
[12] M. Gong, T. Zhan, P. Zhang, and Q. Miao, "Superpixel-based difference representation learning for change detection in multispectral remote sensing images," IEEE Trans.Geosci.遥感》,第 55 卷第 5 期,第 2658-26 页。5, pp.
[13] H. Chen, C. Wu, B. Du, L. Zhang, and L. Wang, "Change Detection in Multisource VHR Images via Deep Siamese Convolutional MultipleLayers Recurrent Neural Network," IEEE Trans. Geosci. Remote Sens., vol. 58, no. 4, pp. 2848-2864, 2020.
[13] H. Chen, C. Wu, B. Du, L. Zhang, and L. Wang, "Change Detection in Multisource VHR Images via DeepSiamese Convolutional MultipleLayers Recurrent Neural Network," IEEE Trans.Geosci.Remote Sens., vol. 58, no.4, pp.
[14] H. Chen, C. Wu, B. Du, and L. Zhang, "Deep Siamese Multi-scale Convolutional Network for Change Detection in Multi-Temporal VHR Images," in 2019 10th International Workshop on the Analysis of Multitemporal Remote Sensing Images (MultiTemp), 2019, pp. 1-4.
[14] H. Chen, C. Wu, B. Du, and L. Zhang, "DeepSiamese Multi-scale Convolutional Network for Change Detection in Multi-Temporal VHR Images," in 2019 10th International Workshop on Analysis of Multitemporal Remote Sensing Images (MultiTemp), 2019, pp.
[15] Q. Shi, M. Liu, S. Li, X. Liu, F. Wang, and L. Zhang, "A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection," IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1-16, 2022.
[15] Q. Shi, M. Liu, S. Li, X. Liu, F. Wang, and L. Zhang, "A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection," IEEE Trans.Geosci.遥感》,第 60 卷,第 1-16 页,2022 年。
[16] R. Caye Daudt, B. Le Saux, and A. Boulch, "Fully convolutional siamese networks for change detection," in Proceedings - International Conference on Image Processing, ICIP, 2018, pp. 4063-4067.
[16] R. Caye Daudt、B. Le Saux 和 A. Boulch,"用于变化检测的全卷积siamese 网络",《国际图像处理会议论文集》,ICIP,2018 年,第 4063-4067 页。
[17] C. Zhang, P. Yue, D. Tapete, L. Jiang, B. Shangguan, L. Huang, and G. Liu, "A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images," ISPRS J. Photogramm. Remote Sens., vol. 166, no. June, pp. 183-200, 2020.
[17] C. Zhang, P. Yue, D. Tapete, L. Jiang, B. Shangguan, L. Huang, and G. Liu, "A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images," ISPRS J. Photogramm.Remote Sens., vol. 166, no.June, pp.
[18] Z. Zheng, Y. Zhong, S. Tian, A. Ma, and L. Zhang, "ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection," ISPRS J. Photogramm. Remote Sens., vol. 183, no. March 2021, pp. 228-239, 2022.
[18] Z. Zheng、Y. Zhong、S. Tian、A. Ma 和 L. Zhang,"ChangeMask:Deep multi-task encoder-transformer-decoder architecture for semantic change detection," ISPRS J. Photogramm.遥感》,第 183 卷,No.March 2021, pp.
[19] Y. Cao and X. Huang, "A full-level fused cross-task transfer learning method for building change detection using noise-robust pretrained networks on crowdsourced labels," Remote Sens. Environ., vol. 284, p. 113371, 2023.
[19] Y. Cao and X. Huang, "A full-level fused cross-task transfer learning method for building change detection using noise-robust pretrained networks on crowdsourced labels," Remote Sens. Environment, vol. 284, p. 113371, 2023.环境》,第 284 卷,第 113371 页,2023 年。
[20] H. Chen, Z. Qi, and Z. Shi, "Remote sensing image change detection with transformers," IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1-14, 2022.
[20] H. Chen, Z. Qi, and Z. Shi, "Remote sensing image change detection with transformers," IEEE Trans.Geosci.遥感》,第 60 卷,第 1-14 页,2022 年。
[21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
[21] A. Dosovitskiy、L. Beyer、A. Kolesnikov、D. Weissenborn、X. Zhai、T. Unterthiner、M. Dehghani、M. Minderer、G. Heigold、S. Gelly 等人,"一幅图像胜过 16x16 个单词:规模图像识别变换器》,arXiv 预印本 arXiv:2010.11929, 2020。
[22] W. G. C. Bandara and V. M. Patel, "A transformer-based siamese network for change detection," in IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, 2022, pp. 207-210.
[22] W. G. C. Bandara 和 V. M. Patel,"基于变压器的变化检测siamese 网络",《IGARSS 2022 - 2022 年电气和电子工程师学会国际地球科学与遥感研讨会》,2022 年,第 207-210 页。
[23] H. Chen, E. Nemni, S. Vallecorsa, X. Li, C. Wu, and L. Bromley, "Dualtasks siamese transformer framework for building damage assessment," in International Geoscience and Remote Sensing Symposium (IGARSS), 2022, pp. 1600-1603
[23] H. Chen, E. Nemni, S. Vallecorsa, X. Li, C. Wu, and L. Bromley, "Dualtaskssiamese transformer framework for building damage assessment," in International Geoscience and Remote Sensing Symposium (IGARSS), 2022, pp.
[24] C. Zhang, L. Wang, S. Cheng, and Y. Li, "Swinsunet: Pure transformer network for remote sensing image change detection," IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1-13, 2022.
[24] C. Zhang、L. Wang、S. Cheng 和 Y. Li,"Swinsunet:用于遥感图像变化检测的纯变压器网络",IEEE Trans.Geosci.遥感》,第 60 卷,第 1-13 页,2022 年。
[25] Q. Li, R. Zhong, X. Du, and Y. Du, "Transunetcd: A hybrid transformer network for change detection in optical remote-sensing images," IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1-19, 2022
[25] Q. Li、R. Zhong、X. Du 和 Y. Du,"Transunetcd:A hybrid transformer network for change detection in optical remote-sensing images," IEEE Trans.Geosci.遥感》,第 60 卷,第 1-19 页,2022 年。
[26] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10012-10022.
[26] Z. Liu、Y. Lin、Y. Cao、H. Hu、Y. Wei、Z. Zhang、S. Lin 和 B. Guo,"Swin transformer:Hierarchical vision transformer using shifted windows",《IEEE/CVF 计算机视觉国际会议(ICCV)论文集》,2021 年 10 月,第 10012-10022 页。
[27] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, "Segformer: Simple and efficient design for semantic segmentation with transformers," in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 12077-12090.
[27] E. Xie、W. Wang、Z. Yu、A. Anandkumar、J. M. Alvarez 和 P. Luo,"Segformer:神经信息处理系统进展》,第 34 卷,2021 年,第 12077-12090 页。
[28] A. Gu, K. Goel, and C. Ré, "Efficiently modeling long sequences with structured state spaces," arXiv preprint arXiv:2111.00396, 2021.
[29] A. Gu and T. Dao, "Mamba: Linear-time sequence modeling with selective state spaces," arXiv preprint arXiv:2312.00752, 2023.
[30] Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, and Y. Liu, "Vmamba: Visual state space model," arXiv preprint arXiv:2401.10166, 2024.
[30] Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, and Y. Liu, "Vmamba:Vmamba: Visual state space model," arXiv preprint arXiv:2401.10166, 2024.
[31] L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, "Vision mamba: Efficient visual representation learning with bidirectional state space model," arXiv preprint arXiv:2401.09417, 2024.
[31] L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, "Visionmamba: Efficient visual representation learning with bidirectional state space model," arXiv preprint arXiv:2401.09417, 2024.
[32] S. Fang, K. Li, J. Shao, and Z. Li, "Snunet-cd: A densely connected siamese network for change detection of vhr images," IEEE Geosci. Remote Sens. Lett., vol. 19, pp. 1-5, 2022.
[32] S. Fang、K. Li、J. Shao 和 Z. Li,"Snunet-cd:用于 vhr 图像变化检测的密集连接siamese 网络",IEEE Geosci.Remote Sens.Lett., vol. 19, pp.
[33] C. Han, C. Wu, H. Guo, M. Hu, and H. Chen, "Hanet: A hierarchical attention network for change detection with bitemporal very-highresolution remote sensing images," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 38673878, 2023.
[33] C. Han、C. Wu、H. Guo、M. Hu 和 H. Chen,"Hanet:A hierarchical attention network for change detection with bitemporal very-highresolution remote sensing images," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp.
[34] C. Han, C. Wu, H. Guo, M. Hu, J. Li, and H. Chen, "Change guiding network: Incorporating change prior to guide change detection in remote sensing imagery," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 8395-8407, 2023.
[34] C. Han, C. Wu, H. Guo, M. Hu, J. Li, and H. Chen, "Change guiding network:纳入变化先验以引导遥感图像中的变化检测》,《IEEE 应用地球观测与遥感选题期刊》,第 16 卷,第 8395-8407 页,2023 年。
[35] R. Caye Daudt, B. Le Saux, A. Boulch, and Y. Gousseau, "Multitask learning for large-scale semantic change detection," Computer Vision and Image Understanding, vol. 187, p. 102783, 2019.
[35] R. Caye Daudt、B. Le Saux、A. Boulch 和 Y. Gousseau:《大规模语义变化检测的多任务学习》,《计算机视觉与图像理解》,第 187 卷,第 102783 页,2019 年。
[36] L. Mou, L. Bruzzone, and X. X. Zhu, "Learning spectral-spatialoral features via a recurrent convolutional neural network for change detection in multispectral imagery," IEEE Trans. Geosci. Remote Sens., vol. 57, no. 2, pp. 924-935, 2019.
[36] L. Mou, L. Bruzzone, and X. X. Zhu, "Learning spectral-spatialoral features via a recurrent convolutional neural network for change detection in multispectral imagery," IEEE Trans.Geosci.遥感》,第 57 卷,第 2 期,第 924-935 页,2019 年。
[37] S. Saha, F. Bovolo, and L. Bruzzone, "Unsupervised deep change vector analysis for multiple-change detection in VHR Images," IEEE Trans. Geosci. Remote Sens., vol. 57, no. 6, pp. 3677-3693, 2019.
[37] S. Saha、F. Bovolo 和 L. Bruzzone,"用于 VHR 图像多重变化检测的无监督深度变化向量分析",IEEE Trans.Geosci.遥感》,第 57 卷,第 6 期,第 3677-3693 页,2019 年。
[38] K. Yang, G.-S. Xia, Z. Liu, B. Du, W. Yang, M. Pelillo, and L. Zhang, "Asymmetric siamese networks for semantic change detection in aerial images," IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1-18, 2022.
[38] K. Yang, G.-S.Xia, Z. Liu, B. Du, W. Yang, M. Pelillo, and L. Zhang, "Asymmetricsiamese networks for semantic change detection in aerial images," IEEE Trans.Geosci.遥感》,第 60 卷,第 1-18 页,2022 年。
[39] L. Ding, H. Guo, S. Liu, L. Mou, J. Zhang, and L. Bruzzone, "Bitemporal semantic reasoning for the semantic change detection in remote sensing images," IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-14, 2022.
[39] L. Ding, H. Guo, S. Liu, L. Mou, J. Zhang, and L. Bruzzone, "Bitemporal semantic reasoning for the semantic change detection in remote sensing images," IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp.
[40] D. Peng, L. Bruzzone, Y. Zhang, H. Guan, and P. He, "Scdnet: A novel convolutional network for semantic change detection in high resolution optical remote sensing imagery," International Journal of Applied Earth Observation and Geoinformation, vol. 103, p. 102465, 2021.
[40] D. Peng、L. Bruzzone、Y. Zhang、H. Guan 和 P. He,"Scdnet:A novel convolutional network for semantic change detection in high resolution optical remote sensing imagery," International Journal of Applied Earth Observation and Geoinformation, vol. 103, p. 102465, 2021.
[41] S. Tian, X. Tan, A. Ma, Z. Zheng, L. Zhang, and Y. Zhong, "Temporalagnostic change region proposal for semantic change detection," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 204, pp. 306-320, 2023.
[41] S. Tian、X. Tan、A. Ma、Z. Zheng、L. Zhang 和 Y. Zhong,"用于语义变化检测的时间识别变化区域建议",《摄影测量与遥感学报》(ISPRS Journal of Photogrammetry and Remote Sensing),第 204 卷,第 306-320 页,2023 年。
[42] R. Gupta, B. Goodman, N. Patel, R. Hosfelt, S. Sajeev, E. Heim, J. Doshi, K. Lucas, H. Choset, and M.
[42] R. Gupta、B. Goodman、N. Patel、R. Hosfelt、S. Sajeev、E. Heim、J. Doshi、K. Lucas、H. Choset 以及 M. M. J. Doshi。

Gaston, "Creating xbd: A dataset for assessing building damage from satellite imagery," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
Gaston,"创建 xbd:从卫星图像评估建筑物损坏情况的数据集》,《IEEE/CVF 计算机视觉与模式识别(CVPR)研讨会论文集》,2019 年 6 月。
[43] E. Weber and H. Kané, "Building disaster damage assessment in satellite imagery with multi-temporal fusion," arXiv preprint arXiv:2004.05525, 2020 .
[44] Y. Niu, H. Guo, J. Lu, L. Ding, and D. Yu, "Smnet: Symmetric multitask network for semantic change detection in remote sensing images based on cnn and transformer," Remote Sensing, vol. 15, no. 4, 2023.
[44] Y. Niu、H. Guo、J. Lu、L. Ding 和 D. Yu,"Smnet:基于 cnn 和变换器的用于遥感图像语义变化检测的对称多任务网络",《遥感》,第 15 卷,第 4 期,2023 年。4, 2023.
[45] L. Ding, J. Zhang, H. Guo, K. Zhang, B. Liu, and L. Bruzzone, "Joint spatio-temporal modeling for semantic change detection in remote sensing images," IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-14, 2024.
[45] L. Ding、J. Zhang、H. Guo、K. Zhang、B. Liu 和 L. Bruzzone,"用于遥感图像语义变化检测的时空联合建模",《IEEE 地球科学与遥感论文集》,第 62 卷,第 1-14 页,2024 年。
[46] J. T. Smith, A. Warrington, and S. W. Linderman, "Simplified state space layers for sequence modeling," arXiv preprint arXiv:2208.04933, 2022.
[47] D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Ré, "Hungry hungry hippos: Towards language modeling with state space models," arXiv preprint arXiv:2212.14052, 2022.
[47] D. Y. Fu、T. Dao、K. K. Saab、A. W. Thomas、A. Rudra 和 C. Ré,"Hungry hungry hippos:Towards language modeling with state space models," arXiv preprint arXiv:2212.14052, 2022.
[48] X. He, K. Cao, K. Yan, R. Li, C. Xie, J. Zhang, and M. Zhou, "Pan-mamba: Effective pan-sharpening with state space model," arXiv preprint arXiv:2402.12192, 2024.
[48] X. He, K. Cao, K. Yan, R. Li, C. Xie, J. Zhang, and M. Zhou, "Pan-mamba :Effective pan-sharpening with state space model," arXiv preprint arXiv:2402.12192, 2024.
[49] K. Chen, B. Chen, C. Liu, W. Li, Z. Zou, and Z. Shi, "Rsmamba: Remote sensing image classification with state space model," arXiv preprint arXiv:2403.19654, 2024.
[49] K. Chen、B. Chen、C. Liu、W. Li、Z. Zou 和 Z. Shi,"Rsmamba:Rsmamba: Remote sensing image classification with state space model," arXiv preprint arXiv:2403.19654, 2024.
[50] H. Chen, J. Song, C. Wu, B. Du, and N. Yokoya, "Exchange means change: An unsupervised single-temporal change detection framework based on intra- and inter-image patch exchange," ISPRS J. Photogramm. Remote Sens., vol. 206, pp. 87-105, 2023.
[50] H. Chen, J. Song, C. Wu, B. Du, and N. Yokoya, "Exchange means change:基于图像内和图像间补丁交换的无监督单时变化检测框架," ISPRS J. Photogramm.遥感》,第 206 卷,第 87-105 页,2023 年。
[51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R.
[51] A. Vaswani、N. Shazeer、N. Parmar、J. Uszkoreit、L. Jones、A. N. Gomez、L. u. Kaiser 和 I. Polosukhin,"注意力就是你所需要的一切",《神经信息处理系统进展》,I. Guyon、U. V. Luxburg、S. Bengio、H. Wallach、R. Fergus、S. Vishwanathan 和 R.Guyon、U. V. Luxburg、S. Bengio、H. Wallach、R. Fergus、S. Vishwanathan 和 R.Guyon。

Garnett, Eds., vol. 30, 2017.
Garnett,Eds.,第 30 卷,2017 年。
[52] S. Elfwing, E. Uchibe, and K. Doya, "Sigmoid-weighted linear units for neural network function approximation in reinforcement learning," Neural networks, vol. 107, pp. 3-11, 2018.
[52] S. Elfwing、E. Uchibe 和 K. Doya:《强化学习中用于神经网络函数逼近的西格玛加权线性单元》,《神经网络》,第 107 卷,第 3-11 页,2018 年。
[53] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980-2988.
[53] T.-Y. LinLin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," in Proceedings of the IEEE international conference on computer vision, 2017, pp.
[54] M. Berman, A. R. Triki, and M. B. Blaschko, "The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[54] M. Berman、A. R. Triki 和 M. B. Blaschko,"The lovász-softmax loss:A tractable surrogate for the optimization of the intersection-over-union measure in neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[55] S. Ji, S. Wei, and M. Lu, "Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set," IEEE Transactions on geoscience and remote sensing, vol. 57, no. 1, pp. 574-586, 2018.
[55] S. Ji, S. Wei, and M. Lu, "Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set," IEEE Transactions on geoscience and remote sensing, vol. 57, no. 1, pp.
[56] H. Chen and Z. Shi, "A spatial-temporal attention-based method and a new dataset for remote sensing image change detection," Remote Sens., vol. 12, no. 10, 2020.
[56] H. Chen 和 Z. Shi,《基于时空注意力的遥感图像变化检测方法和新数据集》,《遥感》,第 12 卷第 10 期,2020 年。
[57] I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," arXiv preprint arXiv:1711.05101, 2017.
[57] I. Loshchilov 和 F. Hutter:《解耦权重衰减正则化》,arXiv 预印本 arXiv:1711.05101, 2017。
[58] S. Tian, Y. Zhong, Z. Zheng, A. Ma, X. Tan, and L. Zhang, "Largescale deep learning based binary and semantic change detection in ultra high resolution remote sensing imagery: From benchmark datasets to urban application," ISPRS J. Photogramm. Remote Sens., vol. 193, pp.
[58] S. Tian、Y. Zhong、Z. Zheng、A. Ma、X. Tan 和 L. Zhang,"Largescale deep learning based binary and semantic change detection in ultra high resolution remote sensing imagery:从基准数据集到城市应用," ISPRS J. Photogramm.Remote Sens., vol. 193, pp.

164-186, 2022.