这是用户在 2024-4-11 16:29 为 https://app.immersivetranslate.com/pdf-pro/dabcace0-b631-427b-9d1f-c4221fbc46b5 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space Model

Hongruixuan Chen , Student Member, IEEE, Jian Song , Chengxi Han, Student Member, IEEE,
Hongruixuan Chen , Student Member, IEEE, Jian Song , Chengxi Han, Student Member, IEEE、
Junshi Xia, Senior Member, IEEE, Naoto Yokoya, Member, IEEE
Junshi Xia,IEEE 高级会员;Naoto Yokoya,IEEE 会员

Abstract 摘要

Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have their inherent shortcomings.
卷积神经网络(CNN)和Transformers 在遥感变化检测(CD)领域取得了令人瞩目的进展。然而,这两种架构都有其固有的缺陷。

Recently, the Mamba architecture, based on spatial state models, has shown remarkable performance in a series of natural language processing tasks, which can effectively compensate for the shortcomings of the above two architectures.
最近,基于空间状态模型的Mamba 架构在一系列自然语言处理任务中表现出了不俗的性能,可以有效弥补上述两种架构的不足。

In this paper, we explore for the first time the potential of the Mamba architecture for remote sensing change detection tasks.
在本文中,我们首次探索了Mamba 架构在遥感变化检测任务中的潜力。

We tailor the corresponding frameworks, called MambaBCD, MambaSCD, and MambaBDA, for binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA), respectively.
我们为二进制变化检测(BCD)、语义变化检测(SCD)和建筑物损坏评估(BDA)量身定制了相应的框架,分别称为 MambaBCD、MambaSCD 和 MambaBDA。

All three frameworks adopt the cutting-edge visual Mamba architecture as the encoder, which allows full learning of global spatial contextual information from the input images.
这三个框架都采用了最先进的视觉Mamba 架构作为编码器,可以从输入图像中充分学习全局空间上下文信息。

For the change decoder, which is available in all three architectures, we propose three spatio-temporal relationship modeling mechanisms, which can be naturally combined with the Mamba architecture and fully utilize its attribute to achieve spatio-temporal interaction of multitemporal features and obtain accurate change information.
对于三种架构都有的变化解码器,我们提出了三种时空关系建模机制,可以与Mamba 架构自然结合,充分发挥其属性,实现多时空特征的时空交互,获取准确的变化信息。

On five benchmark datasets, our proposed frameworks outperform current CNN- and Transformer-based approaches without using any complex strategies or tricks, fully demonstrating the potential of the Mamba architecture. Specifically, we obtained , and F1 scores on the three BCD datasets SYSU, LEVIR-CD+, and WHU-CD; on the SCD dataset SECOND, we obtained ; and on the xBD dataset, we obtained overall F1 score. The source code will be available in https://github.com/ChenHongruixuan/MambaCD
在五个基准数据集上,我们提出的框架在不使用任何复杂策略或技巧的情况下超越了当前基于 CNN 和 Transformer 的方法,充分展示了Mamba 架构的潜力。具体来说,在三个 BCD 数据集 SYSU、LEVIR-CD+ 和 WHU-CD 上,我们获得了 的 F1 分数;在 SCD 数据集 SECOND 上,我们获得了 ;在 xBD 数据集上,我们获得了 的总体 F1 分数。源代码见 https://github.com/ChenHongruixuan/MambaCD

Index Terms-State space model, Mamba, Binary change detection, semantic change detection, building damage assessment, spatio-temporal relationship, optical high-resolution images
索引术语--状态空间模型、Mamba 、二值变化检测、语义变化检测、建筑物损坏评估、时空关系、光学高分辨率图像


C HANGE detection (CD) has been a popular field within the remote sensing community since the inception of remote sensing technology. It aims to detect changes in objects
Manuscript submitted April xx, 2024. This work was supported in part by the Council for Science, Technology and Innovation (CSTI), the Crossministerial Strategic Innovation Promotion Program (SIP), Development of a Resilient Smart Network System against Natural Disasters (Funding agency: NIED), the JSPS, KAKENHI under Grant Number 22H03609, JST, FOREST under Grant Number JPMJFR206S, Microsoft Research Asia, and Next Generation AI Research Center, The University of Tokyo.
2024 年 4 月 xx 日投稿。本研究部分得到了日本科学技术创新委员会(CSTI)、跨部委战略创新促进计划(SIP)、抗自然灾害智能网络系统的开发(资助机构:NIED)、日本社会科学基金会(JSPS)、KAKENHI(资助编号:22H03609)、日本科学技术振兴机构(JST)、FOREST(资助编号:JPMJFR206S)、微软亚洲研究院(Microsoft Research Asia)和东京大学下一代人工智能研究中心的支持。

( + : equal contribution; Corresponding author: Naoto Yokoya)
( +:等额捐款;通讯作者:Naoto Yokoya)
Hongruixuan Chen, Jian Song, and Naoto Yokoya are with Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan (e-mail: qschrx@gmail.com; song@ms.k.u-tokyo.ac.jp; yokoya@k.u-tokyo.ac.jp).
Hongruixuan Chen、Jian Song 和 Naoto Yokoya 现为日本千叶县东京大学研究生院前沿科学研究科成员(电子邮箱:qschrx@gmail.com; song@ms.k.u-tokyo.ac.jp; yokoya@k.u-tokyo.ac.jp)。
Chengxi Han is with State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, China (e-mail: chengxihan@whu.edu.cn).
Junshi Xia is with RIKEN Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, 103-0027, Japan (e-mail: junshi.xia@riken.jp). on the Earth's surface from multi-temporal remote sensing images acquired at different times. Depending on the desired result from the change detector, tasks can be categorized into three types, namely binary , semantic (SCD), and building damage assessment (BDA). Now, CD techniques play an important role in many fields, including land cover change analysis, urban sprawl studies, disaster response, geographic information system (GIS) updating and ecological monitoring [1]-[6].
Junshi Xia现就职于日本东京理化学研究所高级智能项目中心(AIP)(电子邮箱:junshi.xia@riken.jp),负责从不同时间获取的多时空遥感图像中检测地球表面的变化。根据变化检测器所需的结果, 任务可分为三类,即二元 、语义 (SCD) 和建筑物损坏评估 (BDA)。目前,CD 技术在许多领域发挥着重要作用,包括土地覆被变化分析、城市扩张研究、灾害响应、地理信息系统(GIS)更新和生态监测[1]-[6]。
Optical high-resolution remote sensing imagery has become one of the most applied and researched data sources in the field of CD, which can provide detailed textural and geometric structural information about surface features, allowing us to achieve more refined changes.

Due to the increased heterogeneity within the same land-cover feature brought about by the increase in spatial resolution, the traditional pixel-based CD methods [7]-[9] are difficult to achieve satisfactory detection results.
由于空间分辨率的提高导致同一土地覆被特征内部的异质性增加,传统的基于像素的 CD 方法 [7]-[9] 难以获得令人满意的检测结果。

To cope with this, researchers proposed objectbased CD methods [10], which take the object consisting of homogeneous pixels as the basic unit for CD.
为此,研究人员提出了基于对象的 CD 方法 [10],这种方法将由同质像素组成的对象作为 CD 的基本单位。

However, these methods rely on hand-designed "shallow" features, which are not representative and robust enough to cope with complex ground conditions in multi-temporal high-resolution images [11].
然而,这些方法依赖于手工设计的 "浅层 "特征,其代表性和鲁棒性不足以应对多时高分辨率图像中复杂的地面状况[11]。
The emergence of deep learning has brought new models and paradigms for , greatly improving the efficiency and accuracy of . After the early days of developing image patch-based approaches with tiny deep learning models [12][14], the released of a number of large-scale benchmark datasets [15] allowed us to start designing and introducing even larger and deeper models.
深度学习的出现为 带来了新的模型和范式,大大提高了 的效率和准确性。在早期利用微小的深度学习模型开发基于图像补丁的方法[12][14]之后,一些大规模基准数据集的发布[15]使我们能够开始设计和引入更大、更深的模型。

Convolutional neural network (CNN)-based methods have been dominant within the field of CD after the introduction of fully convolutional networks (FCNs) to the field by Daudt et al. [16].
在 Daudt 等人[16]将全卷积网络(FCN)引入 CD 领域之后,基于卷积神经网络(CNN)的方法一直在 CD 领域占据主导地位。

By employing models within the computer vision field as well as a priori knowledge of the CD task, many representative methods have been proposed [3], [4], [17]-[19].
通过利用计算机视觉领域的模型以及 CD 任务的先验知识,已经提出了许多具有代表性的方法 [3]、[4]、[17]-[19]。

Despite the decent results achieved by these methods, the inherent shortcomings of the CNN model structure, i.e., the inability of the limited receptive field to capture long-range dependencies between pixels, make these methods still fall short when dealing with complex and diverse multi-temporal scenes in images with different spatial-temporal resolutions [20].
尽管这些方法取得了不错的效果,但 CNN 模型结构的固有缺陷,即有限的感受野无法捕捉像素之间的长距离依赖关系,使得这些方法在处理具有不同时空分辨率的图像中复杂多样的多时空场景时仍然存在不足[20]。

The emergence of visual Transformer [21] provides an effective way to solve the above problems. By means of the self-attention operation, the transformer architecture can efficiently model the relationship between all the pixels of the whole image. Currently, more
视觉变换器 [21] 的出现为解决上述问题提供了有效途径。通过自关注操作,变换器架构可以有效地模拟整个图像中所有像素之间的关系。目前,更多的

and more CD architectures adopt Transformer architectures as the encoder to extract representative and robust features [22]-[24] and utilize it in the decoder to capture the spatiotemporal relationship between multi-temporal features [20], [25], exhibiting superior performance.
越来越多的 CD 架构采用变换器架构作为编码器,以提取具有代表性和鲁棒性的特征[22]-[24],并在解码器中利用它来捕捉多时相特征之间的时空关系[20], [25],表现出卓越的性能。
Nonetheless, the self-attention operation requires quadratic complexity in terms of image sizes, resulting in expensive computational overhead, which is detrimental in dense prediction tasks, e.g., land-cover mapping, object detection, and , in large-scale remote sensing datasets. Some available solutions, such as limiting the size of the computational window or step size to improve the efficiency of attention, are at the cost of imposing limits on the size of the reception field [26], [27].
然而,自注意操作需要图像大小的二次方复杂性,导致昂贵的计算开销,不利于大规模遥感数据集的密集预测任务,如土地覆盖物绘图、物体检测和 。现有的一些解决方案,如限制计算窗口的大小或步长以提高注意力的效率,都是以限制接收区域的大小为代价的[26], [27]。

As a viable alternative to the Transformer architecture, state space sequence models (SSMs), especially structured state space sequence models (S4) [28], have shown cuttingedge performance in continuous long-sequence data analysis, with the promising attribute of linearly scaling in sequence length.

The Mamba architecture [29] further improves on S4 by utilizing a selection mechanism that allows the model to select relevant information in an input-dependent manner.
Mamba 架构[29]在 S4 的基础上做了进一步改进,它采用了一种选择机制,允许模型以依赖输入的方式选择相关信息。

By combining it with hardware-aware implementations, Mamba outperforms Transformers on a number of downward exploration tasks with high efficiency. Very recently, Mamba architecture has been extended to image data and shown promising results on some visual tasks [30], [31].
通过与硬件感知实现相结合,Mamba 在一些向下探索任务上的效率超过了 Transformers。最近,Mamba 架构已扩展到图像数据,并在一些视觉任务上取得了可喜的成果[30], [31]。

However, its potential for high-resolution optical images and different CD subtasks is still under-explored. How to design a suitable architecture based on a priori knowledge of CD tasks is undoubtedly valuable for subsequent research in the community.
In this paper, we present the ChangeMamba architecture, an efficient Mamba model for remote sensing CD tasks. ChangeMamba can efficiently model the global spatial context, thereby showing very effective results on the three subtasks of CD, namely BCD, MCD, and BDA.
在本文中,我们介绍了 ChangeMamba 架构,这是一个用于遥感 CD 任务的高效Mamba 模型。ChangeMamba 可以高效地建立全球空间背景模型,从而在 CD 的三个子任务(即 BCD、MCD 和 BDA)上显示出非常有效的结果。

Specifically, ChangeMamba is based on the recently proposed VMamba architecture [30], which adopts a Cross-Scan Module (CSM) to unfold image patches in different spatial directions to achieve effective modeling of global contextual information of images.
具体来说,ChangeMamba 基于最近提出的 VMamba 架构[30],该架构采用交叉扫描模块(CSM)来展开不同空间方向的图像补丁,从而实现对图像全局上下文信息的有效建模。

In addition, since CD tasks require the detector to adequately capture the spatio-temporal relationships between multi-temporal images, we design a cross-spatio-temporal state space module (CST-Mamba), which can adequately model the spatio-temporal relationships between multi-temporal features by combining them in different ways, and thus efficiently detect the different categories of changes, including binary changes, semantic changes, and building damage levels.
此外,由于 CD 任务要求检测器充分捕捉多时空图像之间的时空关系,因此我们设计了跨时空状态空间模块(CST-Mamba ),通过不同方式的组合,充分建模多时空特征之间的时空关系,从而高效检测不同类别的变化,包括二值变化、语义变化和建筑物损坏程度。
In summary, the main contributions of this paper are as follows:
  1. For the first time, we explore the application of the Mamba architecture in the field of remote sensing change detection, thereby achieving high accuracy and efficient .
    我们首次探索了Mamba 架构在遥感变化检测领域的应用,从而实现了高精度和高效率
  2. Based on the Mamba architecture and incorporating the unique characteristics of , we design the corresponding network frameworks for each of the three CD tasks: , and BDA.
    基于Mamba 架构并结合 的独特特性,我们为三个 CD 任务分别设计了相应的网络框架: 和 BDA。
  3. Three spatio-temporal relationship modeling mechanisms are proposed, tailored for CD tasks, with the aid of the Mamba architecture to fully learn spatio-temporal features.
    在Mamba 架构的帮助下,针对 CD 任务提出了三种时空关系建模机制,以充分学习时空特征。
  4. The proposed three frameworks show very competitive and even SOTA performance on BCD, SCD, and BDA tasks on five benchmark datasets, demonstrating their superiority. The source code for this work is publicly available for contributions to subsequent possible research.
    所提出的三个框架在五个基准数据集的 BCD、SCD 和 BDA 任务上表现出了极强的竞争力,甚至达到了 SOTA 性能,证明了它们的优越性。这项工作的源代码是公开的,可供后续研究使用。
The remainder of this paper is organized as follows. Section II reviews related work. Section III elaborates on the three ChangeMamba architectures proposed for BCD, MCD, and BDA, respectively. The experimental results and discussion are provided in Sections IV.
本文的其余部分安排如下。第二节回顾了相关工作。第三节阐述了分别针对 BCD、MCD 和 BDA 提出的三种 ChangeMamba 架构。第四节是实验结果和讨论。

Section V draws the conclusion.
In this section, we focus on reviewing deep learning-based CD methods concentrating on optical high-resolution remote sensing images as data sources.
在本节中,我们将重点评述以光学高分辨率遥感图像为数据源的基于深度学习的 CD 方法。

A. CNN-Based Method A.基于 CNN 的方法

The rapid evolution of and their prowess in extracting local features have led to early applications in CD tasks. In the realm of CNN-based BCD, an initial and simple approach [16] involved combining bi-temporal images into a single input for processing by an FCN.
的飞速发展及其在提取局部特征方面的卓越性能,使其很早就被应用到了生物数据处理任务中。在基于 CNN 的 BCD 领域,最初的一种简单方法 [16] 是将双时相图像合并为单一输入,供 FCN 处理。

In addition, it introduced two variations that employ siamese encoders to handle dual inputs.
此外,它还推出了两款采用siamese 编码器处理双输入的变体。

Given the roots of FCNs in semantic segmentation, [17] sought to tailor these networks more closely to BCD tasks by integrating feature change maps as a deep supervised signal, thus refining the final prediction.
鉴于 FCN 在语义分割方面的根基,[17] 尝试将特征变化图整合为深度监督信号,从而完善最终预测,从而使这些网络更贴近 BCD 任务。

Addressing the limited receptive field of CNNs, [13] expanded this framework by combining long- and short-term memory (LSTM) networks with CNNs. Furthermore, Fang et al.
为了解决 CNN 接受场有限的问题,[13] 通过将长、短时记忆(LSTM)网络与 CNN 结合,扩展了这一框架。此外,Fang 等人

[32] designed a Siamese architecture with denser connections to facilitate more effective shallow information exchange for . To combat the challenge of uneven foreground-background distributions within BCD, Han et al. [33] employed a novel sampling approach and an attention mechanism to integrate global information more effectively.
[32] 设计了一种具有更密集连接的Siamese 架构,以促进更有效的浅层信息交换 。为了应对 BCD 中前景-背景分布不均的挑战,Han 等人[33]采用了一种新颖的采样方法和注意力机制,以更有效地整合全局信息。

After that, they further introduced an innovative self-attention module that not only enhances long-distance dependency, but also specifically targets the refinement of BCD predictions, significantly improving the delineation of change area boundaries and reducing the occurrence of inner holes [34].
之后,他们进一步引入了创新的自我关注模块,不仅增强了长距离依赖性,还专门针对 BCD 预测进行了细化,显著改善了变化区域边界的划分,减少了内洞的出现 [34]。
Compared to BCD, which aims to identify "where" changes have occurred, SCD seeks to determine "what" the changes are. Although this poses a greater challenge, it is more important for downstream applications in remote sensing.
BCD 旨在确定 "哪里 "发生了变化,而 SCD 则旨在确定 "什么 "发生了变化。虽然这带来了更大的挑战,但对于遥感的下游应用却更为重要。

In the field of CNN-based SCD, FCNs remain the core structure.
在基于 CNN 的 SCD 领域,FCN 仍然是核心结构。

Early explorations in SCD include [35], which adopts a multitask learning framework to simultaneously predict a binary change mask and land cover mapping, using the latter to assist in generating the final semantic change prediction.
SCD 的早期探索包括 [35],该研究采用多任务学习框架,同时预测二元变化掩码和土地覆被图,并利用后者协助生成最终的语义变化预测。

The approach in [36] represents an early attempt to combine recurrent neural network (RNN) and CNN with SCD, where
文献 [36] 中的方法是将递归神经网络 (RNN) 和 CNN 与 SCD 相结合的早期尝试,其中
CNN extracts semantic features and RNN models temporal dependencies to classify multiple types of changes.
CNN 提取语义特征,RNN 建立时间依赖关系模型,从而对多种变化类型进行分类。

In [37] and [11], unsupervised SCD methods were developed, using pretrained CNNs or unsupervised learning techniques to extract temporal features, then comparing these features to classify changes. [38] introduced a benchmark dataset for SCD along with a new evaluation metric.
文献[37]和[11]开发了无监督 SCD 方法,使用预训练的 CNN 或无监督学习技术提取时间特征,然后比较这些特征对变化进行分类。[38] 为 SCD 引入了一个基准数据集和一个新的评估指标。

Building on [35], Ding et al. [39] proposed a more innovative architecture to address the insufficient exchange of information between the semantic and change branches, introducing new loss functions and supervision signals to force the network to correlate two branches.
在[35]的基础上,Ding 等人[39]提出了一种更具创新性的架构,以解决语义分支和变化分支之间信息交换不足的问题,他们引入了新的损失函数和监督信号,迫使网络将两个分支关联起来。

Zheng et al. [18] continues to strengthen the connection between the two branches, designing a temporal-symmetric transformer (TST) module to model the causal relationship between semantic and change representations.
Zheng 等人[18]继续加强这两个分支之间的联系,设计了一个时空对称转换器(TST)模块来模拟语义表征和变化表征之间的因果关系。

In [40], channel attention is used to embed change information into temporal features. In [41], temporal salient features are used and fused to extract and improve the representation of changes.
在 [40] 中,信道注意力被用来将变化信息嵌入时间特征中。在 [41] 中,使用并融合了时间显著特征来提取和改进变化的表示。
Compared to the advances in CNN-based methods for BCD and SCD, the progression of BDA methodologies has been relatively gradual. BDA intricately categorizes areas affected by disasters into four distinct classes: "No damage", "Minor damage", "Major damage", and "Destroyed".
与基于 CNN 的 BCD 和 SCD 方法相比,BDA 方法的进展相对缓慢。BDA 将受灾地区复杂地分为四个不同的等级:"无损坏"、"轻微损坏"、"严重损坏 "和 "已摧毁"。

The procedural approach to BDA typically encompasses two primary phases: identifying the location of the building and assessing the level of damage incurred. This bifurcation essentially positions as a nuanced subset of , with a focus on quantifying damage levels. Among established approaches, the xView2 challenge [42] introduced a baseline method that employs a UNet+ResNet cascade for building localization and damage classification, trained with pre and post-disaster imagery.
BDA 的程序方法通常包括两个主要阶段:确定建筑物的位置和评估受损程度。这种分叉基本上将 定位为 的一个细微子集,重点是量化损坏程度。在已有的方法中,xView2 挑战赛 [42] 引入了一种基线方法,该方法采用 UNet+ResNet 级联进行建筑物定位和损坏分类,并使用灾前和灾后图像进行训练。

The Siamese-UNet, the winning approach of the challenge [42], utilizes a UNet framework for both pixel-level building localization and damage level classification.
Siamese-UNet 是 挑战赛的获胜方法[42],它利用 UNet 框架进行像素级建筑物定位和损坏级分类。

It innovatively initializes the damage classification UNet with the parameters from the building localization UNet, enhancing performance through a multi-model ensemble of four UNets.
它创新性地利用建筑物定位 UNet 的参数初始化损坏分类 UNet,通过四个 UNet 的多模型组合提高性能。

Further advancements [43] draw from SCD solutions and the MaskRCNN architecture to address BDA, incorporating Siamese encoders, multitask heads, and instance-level predictions.
进一步的进展[43]借鉴了 SCD 解决方案和 MaskRCNN 架构,以解决 BDA 问题,其中包括Siamese 编码器、多任务头和实例级预测。

An evolved approach [3] introduces a hybrid encoder architecture that merges shared and task-specific weights, alongside an FCN-based multitask decoder. This development enhances task integration with an object-based post-processing strategy, resulting in more refined outcomes.
一种改进的方法[3]引入了一种混合编码器架构,该架构融合了共享权重和特定任务权重,以及基于 FCN 的多任务解码器。这一发展加强了任务与基于对象的后处理策略的整合,从而产生了更精细的结果。
The development of CNN-based CD methods has matured significantly over time. However, despite their excellent ability to extract local features, CNN architectures inherently struggle to capture long-distance dependencies.
随着时间的推移,基于 CNN 的 CD 方法的发展已日趋成熟。然而,尽管 CNN 架构在提取局部特征方面具有出色的能力,但在捕捉长距离依赖关系方面却存在固有的困难。

This limitation becomes particularly critical in CD tasks, where change regions, compared to background areas, are often sparse and scattered. The network's ability to model relationships beyond local areas is essential for accurately identifying these change regions.
在 CD 任务中,这种限制变得尤为关键,因为与背景区域相比,变化区域往往是稀疏和分散的。网络对局部区域以外关系的建模能力对于准确识别这些变化区域至关重要。

In this context, Mamba, evolved from RNN architectures, stands out for its exceptional ability to model non-local relationships, making it ideally suited for CD tasks.
在这种情况下,从 RNN 架构发展而来的Mamba 因其卓越的非局部关系建模能力而脱颖而出,成为 CD 任务的理想选择。

Its architecture leverages the sequential processing strength of RNNs, enabling it to effectively understand the spatial context and dependencies over larger areas, thus overcoming one of the key limitations of traditional CNN-based approaches in CD.
其架构利用了 RNN 的顺序处理优势,使其能够有效地理解更大范围内的空间背景和依赖关系,从而克服了 CD 中基于 CNN 的传统方法的主要局限之一。

B. Transformer-Based Method

Recently, with the rise of transformers in computer vision tasks, their superior long-distance modeling capabilities, compared to CNNs, have made them particularly effective in the field of BCD, SCD, and BDA.
最近,随着变压器在计算机视觉任务中的兴起,与 CNN 相比,变压器具有更出色的远距离建模能力,这使其在 BCD、SCD 和 BDA 领域尤为有效。

In the realm of BCD, [20] emerges as a pioneering approach, transforming images into semantic tokens to model spatial-temporal contexts within a token-based framework, enhancing the change detection process. Bandara et al.
在 BCD 领域,[20] 是一种开创性的方法,它将图像转化为语义tokens ,在基于token 的框架内对时空背景进行建模,从而增强了变化检测过程。Bandara等人

[22] employing a pure transformer-based Siamese network that uses a structured transformer encoder coupled with an MLP decoder. This architecture eliminates the need for a CNN-based feature extractor. At the same time, Zhang et al.
[22]采用了基于纯变压器的Siamese 网络,该网络使用结构化变压器编码器和 MLP 解码器。这种架构无需使用基于 CNN 的特征提取器。与此同时,Zhang 等人还开发了一种基于 CNN 的特征提取器。

[24] also proposes a pure transformer-driven model, adopting a Siamese U-shaped structure, processing images in parallel to extract and merge multiscale features.
[24] 还提出了一种纯变压器驱动的模型,采用Siamese U 型结构,并行处理图像以提取和合并多尺度特征。

It leverages Swin transformer blocks across its architecture, further demonstrating the transformer's capability to refine change detection outcomes. Li et al.
它在整个架构中利用了 Swin 变换器模块,进一步展示了变换器完善变化检测结果的能力。李等人

[25] introduces a hybrid model that blends understanding of the global context of transformers with the semantic detail resolution of UNet.
[25] 引入了一种混合模型,将对转换器全局上下文的理解与 UNet 的语义细节解析相结合。

It eliminates UNet's redundant information and enhances feature quality through a difference enhancement module, leading to more precise change area delineation.
它消除了 UNet 的冗余信息,并通过差异增强模块提高了特征质量,从而实现了更精确的变化区域划分。
As for transformer-based SCD, Niu et al. [44] uses a PRTB backbone, combining preactivated residual and transformer blocks, to extract semantic features from image pairs, which is crucial for detecting subtle changes.
至于基于变换器的 SCD,Niu 等人[44]使用 PRTB 骨干,结合预激活的残差块和变换器块,从图像对中提取语义特征,这对于检测微妙的变化至关重要。

Its Multi-Content Fusion Module (MCFM) precisely differentiates between the foreground and background. The model's efficacy is further amplified by multitask prediction branches and a custom loss function, ensuring nuanced semantic change detection with heightened accuracy.
其多内容融合模块 (MCFM) 能精确区分前景和背景。多任务预测分支和自定义损失函数进一步增强了该模型的功效,确保以更高的准确度进行细致入微的语义变化检测。

Alternatively, Ding et al. [45] introduces the Semantic Change Transformer (SCanFormer), a CSWin Transformer adaptation, to explicitly chart semantic transitions.
另外,Ding 等人[45]引入了语义变化转换器(ScanFormer),这是一种 CSWin 转换器改编,用于明确绘制语义转换图。

The model utilizes a triple encoder-decoder architecture to improve spatial features, and it is distinguished by incorporating spatio-temporal dynamics and task-specific expertise, leading to a decrease in learning disparities.

This approach significantly boosts accuracy on benchmarks, adeptly managing scenarios with sparse change samples and semantic labels.
In response to the transformer-based , Chen et al. [23] uses a siamese Mix Transformer (MiT) encoder and lightweight All-MLP decoder efficiently to process bitemporal image pairs.
针对基于变换器的 ,Chen 等人[23]使用siamese 混合变换器(MiT)编码器和轻量级 All-MLP 解码器高效处理位时图像对。

It features deep feature extraction and a multitemporal fusion module for improved information integration, and a dual-task decoder for precise building localization and damage classification.

By adopting strategies like weight sharing and innovative fusion mechanisms, it efficiently handles high-resolution imagery with lower computational costs, making it effective and streamlined for BDA.
通过采用权重共享和创新的融合机制等策略,它能以较低的计算成本高效处理高分辨率图像,从而使其在 BDA 中发挥高效、精简的作用。
Although transformer-based methods have significantly enhanced CD accuracy through their powerful non-local modeling capabilities, they come with substantial computational
虽然基于变压器的方法通过其强大的非局部建模能力大大提高了 CD 精度,但它们也伴随着大量的计算工作。

resource demands. Specifically, as the number of tokens increases, the consumption of computational resources escalates quadratically, posing challenges for processing large-scale input.
资源需求。具体来说,随着tokens 数量的增加,计算资源的消耗也会呈二次曲线上升,这给处理大规模输入带来了挑战。

However, Mamba, an architecture inherited from RNNs, not only rivals transformers in non-local modeling capabilities but also maintains computational resource consumption at a linear growth rate.
然而,从 RNN 继承而来的架构Mamba 不仅在非本地建模能力上可与transformers 相媲美,而且还能保持计算资源消耗的线性增长率。
Moreover, spatio-temporal fusion remains a fundamental aspect of CD tasks, where seamless and efficient integration of time and space information is essential to achieve accurate detection results.
此外,时空融合仍是 CD 任务的一个基本方面,要想获得准确的检测结果,就必须无缝、高效地整合时间和空间信息。

Traditional methods, whether they employ CNNs or transformers, tend to follow a rigid, modular stacking approach for spatio-temporal fusion.
传统的方法,无论是采用 CNN 还是transformers ,在进行时空融合时往往采用僵化的模块化堆叠方法。

In contrast, the Mamba framework introduces a novel paradigm by treating tokens with inherent "sequence," offering a dynamic range of spatiotemporal fusion possibilities.
相比之下,Mamba 框架引入了一种新的范式,将tokens 与固有的 "序列 "联系起来,提供了一系列动态的时空融合可能性。

By manipulating the order in which tokens are processed, Mamba can generate a spectrum of fusion outcomes, marking a significant departure from conventional methods.
通过调整tokens 的处理顺序,Mamba 可以产生一系列融合结果,这与传统方法大相径庭。

This innovative approach to spatiotemporal fusion provides Mamba with unparalleled flexibility and effectiveness in addressing CD tasks.
这种创新的时空融合方法为Mamba 处理 CD 任务提供了无与伦比的灵活性和有效性。

C. State Space Model

The SSM concept first popped up with the S4 model [28], offering a new way to handle contextual information globally, which caught attention because of its attractive property of scaling linearly in sequence length. Based on S4, Smith et al.
SSM 概念最早出现在 S4 模型中[28],它提供了一种全局处理上下文信息的新方法,因其具有随序列长度线性扩展的诱人特性而备受关注。在 S4 的基础上,Smith 等人提出了 SSM 概念。

[46] proposed a new S5 model by introducing MIMO SSM and efficient parallel scan into the S4 model. After that, the H3 model [47] further advanced these concepts, achieving performance on par with Transformers in language modeling tasks.
[46] 通过在 S4 模型中引入 MIMO SSM 和高效并行扫描,提出了一种新的 S5 模型。之后,H3 模型[47] 进一步推进了这些概念,在语言建模任务中实现了与 Transformers 相当的性能。

Recently, Gu et al [29] proposed a data-dependent SSM layer and built a generic language model backbone called Mamba, which outperforms Transformers at at different scales on large-scale real data and scales linearly in sequence length.
最近,Gu 等人[29]提出了一个与数据相关的 SSM 层,并构建了一个名为Mamba 的通用语言模型骨干,它在大规模真实数据上的不同尺度性能均优于 Transformers,并可按序列长度线性扩展。

Very recently, VMamba [30] and Vision Mamba [31] have extended Mamba architecture to 2D image data, showing superior performance on many computer vision tasks.
最近,VMamba [30] 和 VisionMamba [31] 将Mamba 架构扩展到二维图像数据,在许多计算机视觉任务中显示出卓越的性能。

Inspired by this progress, some pioneering work has introduced the Mamba architecture to the field of remote sensing image processing [48], [49].
受这一进展的启发,一些开创性的工作将Mamba 架构引入了遥感图像处理领域 [48], [49]。

Their studies show that the Mamba architecture can yield comparable performance to the advanced CNN and Transfomer architectures on scene classification [49] and pansharpening [48] tasks.
他们的研究表明,在场景分类 [49] 和平锐化 [48] 任务中,Mamba 架构能产生与高级 CNN 和 Transfomer 架构相当的性能。

However, these works still focus on either low-level tasks or classification tasks, and all of these tasks are single-temporal tasks [50].
然而,这些工作仍然侧重于低级任务或分类任务,而且所有这些任务都是单时任务 [50]。

The potential of the Mamba architecture for multi-temporal remote sensing imagerelated scenarios and intensive prediction tasks remains to be explored.
Mamba 架构在多时空遥感图像相关场景和密集预测任务方面的潜力仍有待探索。

III. Methodology III.方法论

A. Preliminaries A.序言

The SSM-based models and Mamba are inspired by the linear time-invariant systems, which map a 1-D function or sequence to response through a hidden state . These systems are usually formulated as linear ordinary differential equations (ODEs) as
基于 SSM 的模型和Mamba 受到线性时变系统的启发,这些系统通过隐藏状态将一维函数或序列 映射到响应 。这些系统通常被表述为线性常微分方程(ODE),即
where is the evolution parameter and are the projection parameters.
其中 是演化参数, 是投影参数。
The S4 and Mamba models are discrete counterparts of the continuous system as continuous systems face significant challenges when integrated into deep learning algorithms. They contain a time scale parameter which is used to convert the continuous parameters and into their discrete counterparts and . A prevalent way to achieve this transformation is to use a zero-order hold ( approach:
S4 和Mamba 模型是连续系统的离散对应模型,因为连续系统在集成到深度学习算法中时面临重大挑战。它们包含一个时间尺度参数 ,用于将连续参数 转换为离散参数 。实现这种转换的普遍方法是使用零阶保持 ( 方法:
After that, the discretization of Eq. (1) can be formulated as
之后,公式 (1) 的离散化可以表述为
Finally, the output of with sequence length can be calculated directly by the following global convolution operation
最后,序列长度为 的输出可以通过以下全局卷积运算直接计算出来
where is a structured convolutional kernel and denotes a convolutional operation.
其中 是结构化卷积核, 表示卷积运算。

B. Problem Statement B.问题陈述

In this paper, we focus on three sub-tasks within the field. They are binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA). The definitions of the three tasks are as follows.
在本文中,我们重点讨论 领域中的三个子任务。它们分别是二进制变化检测(BCD)、语义变化检测(SCD)和建筑物损坏评估(BDA)。这三个任务的定义如下。
  1. Binary Change Detection: BCD is the basic and the most intensively studied task in the CD field. It focuses on "where" change occurs. According to the class of interest, can also be categorized into class-agnostic CD focusing on general land-cover change information, and singleclass object , e.g., building . Given a training set represented as , where is the -th multi-temporal image pair acquired in and , respectively, and is the corresponding label, the goal of is to train a change detector on that can predict change maps reflecting accurate "change / non-change" binary information on new sets.
    二进制变化检测:BCD 是 CD 领域最基本、研究最深入的任务。它的重点是 "哪里 "发生了变化。根据感兴趣的类别, 还可分为侧重于一般土地覆盖物变化信息的无类别 CD 和单类别对象 ,例如建筑物 。给定一个以 表示的训练集,其中 是分别在 获取的 -th 多时相图像对, 是相应的标签, 的目标是在 上训练一个变化检测器 ,该检测器可预测变化图,准确反映新集上的 "变化/非变化 "二元信息。
  2. Semantic Change Detection: Compared to BCD, SCD focuses not only on "where" the change occurs, but also on the "what" of the change, i.e. "from-to" semantic change information [18], [39], [45]. The training set in SCD task can be represented as
    语义变化检测:与 BCD 相比,SCD 不仅关注变化发生的 "位置",还关注变化的 "内容",即 "从哪里到哪里 "的语义变化信息 [18]、[39]、[45]。SCD 任务的训练集可以表示为
(a) MambaBCD
(b) MambaSCD
(c) MambaBDA
Fig. 1: The network framework of the proposed (a) MambaBCD for binary change detection, (b) MambaSCD for semantic change detection, and (c) MamabaBDA for building damage assessment.
图 1:拟议的 (a) 用于二进制变化检测的 MambaBCD、(b) 用于语义变化检测的 MambaSCD 和 (c) 用于建筑物损坏评估的 MamabaBDA 的网络框架。
. Compared to BCD, corresponding land-cover labels with categories of and are additionally required. The goal of SCD is to train a semantic change detector on that can predict can predict land-cover maps of multi-temporal images and binary change maps between them on new sets as accurately as possible.
.与 BCD 相比,还需要相应的土地覆被标签 与 类别 和 。SCD 的目标是在 上训练一个语义变化检测器 ,该检测器可以尽可能准确地预测多时相图像的土地覆盖图以及新图像集上它们之间的二元变化图。

By combining the information in predicted land-cover maps and binary change maps, the from-to" semantic change information can be derived.
将预测的土地覆盖图和二元变化图中的信息结合起来,就能得出 "从到 "的语义变化信息。
  1. Building Damage Assessment: BDA is a special "one-tomany" SCD task [3]. BDA requires recognizing not only the where of the change (damage) occurs but also post-event states of the object (building) in the condition that the pre-event state of the object is unique.
    建筑物损坏评估:BDA 是一项特殊的 "一对多" SCD 任务 [3]。BDA 不仅要求识别变化(损坏)发生的位置,还要求在对象(建筑物)的事件前状态唯一的条件下识别事件后的状态。

    The training set in BDA task can be denoted as , where is the label of the object of interest (building) at , and is the postevent state (damage level) of the object of interest at .
    BDA 任务中的训练集可表示为 ,其中 中关注对象(建筑物)的标签, 中关注对象的事件后状态(损坏程度)。

C. Network Architecture C.网络架构

For these three subtasks, we design corresponding network frameworks based on the Mamba architecture, called MambaBCD, MambaSCD, and MamabBDA, respectively. Fig. 1 shows the network framework of these three architectures.
针对这三个 子任务,我们基于Mamba 架构设计了相应的网络框架,分别称为 MambaBCD、MambaSCD 和 MambabBDA。图 1 显示了这三种架构的网络框架。

Among them, the encoders of the three networks are weightsharing siamese networks based on the Visual State Space Model (VMamba) architecture [30].
其中,三个网络的编码器都是基于视觉状态空间模型(VMamba)架构的权重共享siamese 网络[30]。

VMamba can adequately extract the robust and representative features of the input images benefiting from the Mamba architecture and an efficient 2D cross-scan mechanism (as shown in Fig. 3). Its specific structure will be introduced in Section III-D.
得益于Mamba 架构和高效的二维交叉扫描机制(如图 3 所示),VMamba 可以充分提取输入图像中稳健且具有代表性的特征。其具体结构将在第 III-D 节中介绍。
  1. MambaBCD: MambaBCD is the architecture designed for the BCD task. First, the siamese encoder network extracts multi-level features from the input image as and . Next, these multi-level features are fed into a tailored change decoder . Based on the Mamba architecture, the change decoder can fully learn the spatio-temporal relationship from multi-level features through three different mechanisms, and gradually obtains an accurate BCD result, formulated as . The binary change map can be obtained as .
    MambaBCD:MambaBCD 是专为 BCD 任务设计的架构。首先,siamese 编码器网络 从输入图像中提取多级特征,如 。然后,这些多级特征被输入到一个量身定制的变化解码器 中。基于Mamba 架构,变化解码器可以通过三种不同的机制从多层次特征中充分学习时空关系,并逐步获得精确的 BCD 结果,表述为 。二进制变化图可以通过 得到。
  2. MambaSCD: MambaSCD is the architecture designed for SCD tasks. As shown in Fig. 1-(b), MambaSCD adds two semantic decoders for land cover mapping tasks based on MambaBCD, denoted as and . In addition to be treated as input as to learn spatio-temporal relationships to predict results, the multi-level features extracted by the encoder are also fed into and to predict the landcover map of the corresponding temporal image, formulated as and . After obtaining the land-cover maps and and the binary change map , the semantic change information of can be obtained by performing the mask operation on and using .
    MambaSCD:MambaSCD 是专为 SCD 任务设计的架构。如图 1-(b) 所示,MambaSCD 在 MambaBCD 的基础上增加了两个用于土地覆被制图任务的语义解码器,分别表示为 。除了作为 的输入来学习时空关系以预测 的结果外,编码器提取的多层次特征还被输入 以预测相应时空图像的土地覆盖图,分别表示为 。在获得土地覆盖图 以及二元变化图 之后,可通过