Multi-modal learning for geospatial vegetation forecasting
多模态学习用于地理空间植被预测

Vitus Benson^1,2,3,* Claire Robin^1,2 Christian Requena-Mesa^1,2 Lazaro Alonso¹ Nuno Carvalhais^1,2 José Cortés¹ Zhihan Gao⁴ Nora Linscheid¹ Mélanie Weynants¹ Markus Reichstein^1,2
¹ Max-Planck-Institute for Biogeochemistry ² ELLIS Unit Jena ³ ETH Zürich
⁴ Hong Kong University of Science and Technology ^* vbenson@bgc-jena.mpg.de

Abstract 摘要

The innovative application of precise geospatial vegetation forecasting holds immense potential across diverse sectors, including agriculture, forestry, humanitarian aid, and carbon accounting. To leverage the vast availability of satellite imagery for this task, various works have applied deep neural networks for predicting multispectral images in photorealistic quality. However, the important area of vegetation dynamics has not been thoroughly explored. Our study breaks new ground by introducing GreenEarthNet, the first dataset specifically designed for high-resolution vegetation forecasting, and Contextformer, a novel deep learning approach for predicting vegetation greenness from Sentinel 2 satellite images with fine resolution across Europe. Our multi-modal transformer model Contextformer leverages spatial context through a vision backbone and predicts the temporal dynamics on local context patches incorporating meteorological time series in a parameter-efficient manner. The GreenEarthNet dataset features a learned cloud mask and an appropriate evaluation scheme for vegetation modeling. It also maintains compatibility with the existing satellite imagery forecasting dataset EarthNet2021, enabling cross-dataset model comparisons. Our extensive qualitative and quantitative analyses reveal that our methods outperform a broad range of baseline techniques. This includes surpassing previous state-of-the-art models on EarthNet2021, as well as adapted models from time series forecasting and video prediction. To the best of our knowledge, this work presents the first models for continental-scale vegetation modeling at fine resolution able to capture anomalies beyond the seasonal cycle, thereby paving the way for predicting vegetation health and behaviour in response to climate variability and extremes. We provide open source code and pre-trained weights to reproduce our experimental results under https://github.com/vitusbenson/greenearthnet [11].
精确地理空间植被预测的创新应用在农业、林业、人道主义援助和碳核算等多个领域展现出巨大潜力。为了利用卫星图像的广泛可用性来完成此任务，已有多种工作采用深度神经网络来预测多光谱图像，达到逼真质量。然而，植被动态这一重要领域尚未得到充分探索。我们的研究开创了先河，推出了 GreenEarthNet，这是首个专为高分辨率植被预测设计的数据集，以及 Contextformer，一种新颖的深度学习方法，用于从覆盖欧洲的 Sentinel 2 卫星图像中以精细分辨率预测植被绿度。我们的多模态变压器模型 Contextformer 通过视觉骨干网络利用空间上下文，并以参数高效的方式在包含气象时间序列的局部上下文补丁上预测时间动态。GreenEarthNet 数据集具备学习型云掩膜和适用于植被建模的评估方案。它还保持与现有卫星影像预测数据集 EarthNet2021 的兼容性，支持跨数据集模型比较。我们广泛的定性和定量分析表明，我们的方法优于多种基线技术，包括在 EarthNet2021 上超越先前的最先进模型，以及来自时间序列预测和视频预测的适配模型。据我们所知，这项工作首次提出了能够捕捉季节周期之外异常现象的大陆尺度高分辨率植被建模模型，从而为预测植被健康状况及应对气候变化和极端情况的行为铺平了道路。我们提供了开源代码和预训练权重，以在 https://github.com/vitusbenson/greenearthnet [11]下复现我们的实验结果。

Refer to caption — Figure 1: Future vegetation status $\hat{V}$ is predicted with deep learning models $f$ from past satellite imagery $X$ , past and future weather $C$ and elevation $E$ . The dataset GreenEarthNet spans across Europe with minicubes split into train (red markers), temporal OOD test (ood-t, yellow) and spatio-temporal OOD test (ood-st, blue) subsets.
图 1：利用深度学习模型 $f fitalic_f$ ，基于过往卫星图像 $X Xitalic_X$ 、历史及未来气象数据 $C Citalic_C$ 以及海拔信息 $E Eitalic_E$ ，预测未来植被状况 $\hat{V}over^ start_ARG italic_V end_ARG$ 。数据集 GreenEarthNet 覆盖整个欧洲，其迷你立方体被划分为训练集（红色标记）、时间外分布测试集（ood-t，黄色）及时空外分布测试集（ood-st，蓝色）。

1 Introduction 1 引言

Optical satellite images have been proven useful for monitoring vegetation status. This is essential for a variety of applications in agricultural planning, forestry advisory, humanitarian assistance or carbon monitoring. In all these cases, prognostic information is relevant: Farmers want to know how their farmland may react to a given weather scenario [87]. Humanitarian organisations need to understand the localized impact of droughts on pastoral communities for mitigation of famine with anticipatory action [52]. Afforestation efforts need to consider how their forests react to future climate [75]. However, providing such prognostic information through fine resolution vegetation forecasts is challenging as it requires a model that considers ecological memory effects [37], spatial interactions and the influence of weather variations. Deep neural networks have proven successful at modeling relationships in space, time or across modalities. Hence, their application to vegetation forecasting given a sufficiently large dataset seems natural.
光学卫星图像已被证明对监测植被状况非常有用。这对于农业规划、林业咨询、人道主义援助或碳监测等众多应用至关重要。在所有这些情况下，预测信息都具有相关性：农民希望了解他们的农田如何对特定天气情景做出反应[87]。人道主义组织需要了解干旱对牧区社区的局部影响，以便通过预防性行动缓解饥荒[52]。植树造林工作需要考虑其森林如何应对未来气候[75]。然而，通过高分辨率植被预测提供此类预测信息具有挑战性，因为它需要一个考虑生态记忆效应[37]、空间相互作用和天气变化影响的模型。深度神经网络在空间、时间或跨模态建模关系方面已证明成功。因此，在拥有足够大数据集的情况下，将其应用于植被预测似乎是自然而然的选择。

So far, deep learning in the domain of vegetation forecasting can be roughly grouped into two categories: low-resolution global vegetation forecasting and high-resolution local satellite imagery forecasting. The former [32, 37, 9, 40, 97, 69, 49] builds upon long-term measurements of vegetation status from the coarse resolution AVHRR and MODIS satellites. These methods overlook the heterogeneity within each pixel, e.g. a grassland will react very different than a forest, two neighbouring fields can have almost opposite dynamics depending on the type of crops (e.g. winter wheat vs. maize), and vegetation on a north-facing slope close to a river will generally react different to drought stress than on a rocky south-facing slope. The latter [64, 35, 18, 65, 72, 79], in contrast, aims at modeling the field-scale heterogeneity as observed from the Sentinel 2 and Landsat satellites through self-supervised learning. However, these approaches have so far focused on perceptual image quality instead of vegetation dynamics, which renders their suitability for vegetation forecasting uncertain.
到目前为止，植被预测领域的深度学习大致可分为两类：低分辨率全球植被预测和高分辨率局部卫星图像预测。前者[32, 37, 9, 40, 97, 69, 49]基于来自粗分辨率 AVHRR 和 MODIS 卫星的长期植被状态测量数据构建。这些方法忽略了每个像素内的异质性，例如，草地与森林的反应截然不同，相邻的两块田地可能因作物类型（如冬小麦与玉米）而呈现出几乎相反的动态变化，靠近河流的北坡植被通常对干旱压力的反应与岩石南坡上的植被不同。相比之下，后者[64, 35, 18, 65, 72, 79]旨在通过自监督学习对从 Sentinel 2 和 Landsat 卫星观测到的田间尺度异质性进行建模。然而，这些方法迄今为止主要关注感知图像质量而非植被动态，这使得它们在植被预测中的适用性尚不确定。

The largest dataset for high resolution vegetation forecasting [91] is called EarthNet2021 [64]. In theory, a model trained on EarthNet2021 can forecast satellite images of high perceptual quality. It is, however, harder to assess if this skill also translates to predicting derived vegetation dynamics. Here, the suitability of EarthNet2021 is limited by a faulty cloud mask, insufficient baselines and a poorly interpretable evaluation protocol. For instance, natural vegetation follows a strong seasonal cycle, making a vegetation climatology a necessary baseline to compare with, which is lacking on EarthNet2021.
用于高分辨率植被预报的最大数据集[91]称为 EarthNet2021[64]。理论上，在 EarthNet2021 上训练的模型能够预测具有高感知质量的卫星图像。然而，更难评估这种技能是否也能转化为预测衍生植被动态的能力。在此，EarthNet2021 的适用性受到错误云掩膜、不足的基线以及难以解释的评估协议的限制。例如，自然植被遵循强烈的季节周期，使得植被气候学成为必要的比较基线，而 EarthNet2021 中缺乏这一点。

In this paper, we approach continental-scale modeling of vegetation dynamics. To achieve this, we predict remotely sensed vegetation greenness at 20m conditioned on coarse-scale weather. For this, we introduce the GreenEarthNet dataset. It includes Sentinel 2 bands red, green, blue and near-infrared and a high quality deep learning-based cloud mask, which allows to distinguish between anomalies due to data corruption and those due to meteorological and anthropogenic influence. The training locations and spatial and temporal dimensions of GreenEarthNet are kept consistent with the EarthNet2021 dataset, which enables the re-use of the leading models ConvLSTM [18], SGED-ConvLSTM [35] and Earthformer [23] as baselines for vegetation forecasting. In other words, GreenEarthNet is a complete remake of the EarthNet2021 dataset, removing all its weaknesses and enabling self-supervised learning for geospatial vegetation forecasting. To advance state-of-the-art on the new dataset, we present a light-weight transformer model: the Contextformer. It utilizes a Pyramid Vision Transformer [82, 83] as vision backbone and local context patches to make use of spatial interactions. It then models temporal dynamics conditioned on coarse-scale meteorology with a temporal transformer encoder. Finally, it puts a strong prior on persistence through a delta-prediction scheme starting from an initial, gap filled observation. Fig. 1 presents a sketch of our GreenEarthNet approach: Future vegetation state ( $\hat{V}$ ) is predicted from past satellite image spectra ( $X$ ), past and future weather data ( $C$ ), and elevation information ( $E$ ) via a deep neural network: the Contextformer ( $f$ ).
本文探讨了大陆尺度植被动态建模。为此，我们预测以粗尺度天气为条件的 20 米分辨率遥感植被绿度。为此，我们引入了 GreenEarthNet 数据集。该数据集包含 Sentinel 2 的红、绿、蓝和近红外波段，以及基于高质量深度学习的云掩膜，能够区分数据损坏引起的异常与气象和人为影响引起的异常。GreenEarthNet 的训练地点以及空间和时间维度与 EarthNet2021 数据集保持一致，使得领先的模型 ConvLSTM [18]、SGED-ConvLSTM [35]和 Earthformer [23]可以作为植被预测的基线模型。换言之，GreenEarthNet 是 EarthNet2021 数据集的全面重制，消除了其所有弱点，并实现了地理空间植被预测的自监督学习。为了在新数据集上推进最先进的技术，我们提出了一种轻量级 transformer 模型：Contextformer。它利用 Pyramid Vision Transformer [82, 83]作为视觉骨干，并结合局部上下文块以利用空间交互。然后，它通过一个时间变换器编码器对基于粗尺度气象学的时序动态进行建模。最后，通过从初始的、填补空缺的观测开始的增量预测方案，对持续性施加强先验。图 1 展示了我们的 GreenEarthNet 方法的示意图：未来的植被状态（ $\hat{V}over^ start_ARG italic_V end_ARG$ ）是从过去的卫星图像光谱（ $X Xitalic_X$ ）、过去和未来的天气数据（ $C Citalic_C$ ）以及高程信息（ $E Eitalic_E$ ）通过深度神经网络——Contextformer（ $f fitalic_f$ ）——预测得到的。

Our major contributions can be summarized as follows. (1) We present the GreenEarthNet dataset, the first large-scale dataset suitable for prediction of within-year vegetation dynamics, including a learned cloud mask and a new evaluation scheme. (2) We introduce the Contextformer, a novel multi-modal transformer model suitable for vegetation forecasting, leveraging spatial context through its vision backbone, and forecasting the temporal evolution of small context patches with a temporal transformer. (3) We compare the Contextformer against a previously unseen variety of state-of-the-art models from related tasks and find it outperforms all of them both across metrics.
我们的主要贡献可概括如下。(1) 我们提出了 GreenEarthNet 数据集，这是首个适用于年内植被动态预测的大规模数据集，包括一个学习型云掩膜和一种新的评估方案。(2) 我们引入了 Contextformer，这是一种新颖的多模态 transformer 模型，适用于植被预测，通过其视觉骨干网络利用空间上下文，并使用时间 transformer 预测小上下文块的时间演变。(3) 我们将 Contextformer 与一系列先前未见过的相关任务的最先进模型进行了比较，发现它在各项指标上均优于所有这些模型。

2 Related Work 2 相关工作

Vegetation forecasting There is a growing interest in vegetation growth forecasting driven by the democratization of machine learning techniques, the availability of remote sensing data, and the urgency to address climate change [22, 33, 16, 41]. Numerous studies in vegetation modeling use coarse resolution data from satellites like AVHRR or MODIS [32, 37, 40, 97, 49]. Since 2015, Sentinel-2 has provided high-resolution satellite imagery (up to $10$ m), enabling more localized modeling. The introduction of EarthNet2021 [64] marked the first dataset for self-supervised Earth surface forecasting, which contains predicting satellite imagery and derived vegetation state with a focus on perceptual quality. Subsequently, the ConvLSTM model [70] has been widely used for satellite imagery prediction [18, 35, 1, 65, 48], hence we are including it as a baseline.
植被预测随着机器学习技术的普及、遥感数据的可用性增强以及应对气候变化的紧迫性，人们对植被生长预测的兴趣日益增长[22, 33, 16, 41]。众多植被建模研究利用了 AVHRR 或 MODIS 等卫星提供的粗分辨率数据[32, 37, 40, 97, 49]。自 2015 年以来，Sentinel-2 卫星提供了高分辨率卫星影像（最高可达 10 米），使得更精细化的建模成为可能。EarthNet2021[64]的推出标志着首个用于自监督地球表面预测的数据集诞生，该数据集专注于预测卫星影像及其衍生的植被状态，强调感知质量。随后，ConvLSTM 模型[70]被广泛应用于卫星影像预测[18, 35, 1, 65, 48]，因此我们将其作为基线模型纳入研究。

Spatio-temporal learning Learning spatio-temporal dynamics (as in the case of vegetation forecasting) is a challenge across many disciplines. Often, temporal dynamics dominate, so local time series models can be effective. For instance in traffic, weather or electricity forecasting, time series models such as LSTM [29], Prophet [78], Autoformer [88] or NBeats [96] yield useful performance. Still, often spatial interactions are important or at least offer additional predictive capacity. For instance in video prediction, ConvNets [6, 24], ConvLSTM [70], ConvLSTM successors [85, 89], PredRNN [84, 85], SimVP [77] and transformers [25, 53, 23] have been found skillful. Often, the necessity of modeling the spatial component translates to Earth science: spatio-temporal deep learning is being applied for precipitation nowcasting [61, 71], weather forecasting [12, 38, 57], climate projection [55], and wildfire modeling [36]. Hence, when evaluating our Contextformer model, we need to do so against strong baselines from video prediction [70, 23, 85, 77], as a priori one might expect them to outperform also on vegetation forecasting. However, vegetation forecasting does present some unique challenges, it builds upon multi-modal data fusion and requires capturing across-scale relationships (in time and space), which may prove challenging for existing video prediction models and thus interesting to the computer vision community.
时空学习学习时空动态（如植被预测的情况）是跨多个学科的挑战。通常，时间动态占主导地位，因此局部时间序列模型可以有效。例如，在交通、天气或电力预测中，LSTM [29]、Prophet [78]、Autoformer [88] 或 NBeats [96] 等时间序列模型能产生有用的性能。然而，空间交互往往也很重要，至少能提供额外的预测能力。例如，在视频预测中，ConvNets [6, 24]、ConvLSTM [70]、ConvLSTM 后继者 [85, 89]、PredRNN [84, 85]、SimVP [77] 和 transformers [25, 53, 23] 已被证明有效。通常，建模空间成分的必要性转化为地球科学领域：时空深度学习正应用于降水临近预报 [61, 71]、天气预报 [12, 38, 57]、气候预测 [55] 和野火建模 [36]。因此，在评估我们的 Contextformer 模型时，我们需要将其与视频预测中的强基线 [70, 23, 85, 77] 进行比较，因为先验上可能预期它们在植被预测中也能表现出色。然而，植被预测确实带来了一些独特的挑战，它建立在多模态数据融合的基础上，并需要捕捉跨尺度的关系（在时间和空间上），这对现有的视频预测模型可能构成挑战，因此引起了计算机视觉领域的兴趣。

Multi-modal transformers for data fusion Levering remote sensing data often means multi-modal data fusion. Recently, machine learning methods have shown significant advancements in fusing different satellite sensors compared to traditional approaches [73, 86, 3, 17, 42]. This includes recent work on combining Sentinel 2 and SAR data to impute cloudy Sentinel 2 images [81, 51, 93]. Gapfilling vegetation time series could also be done with the models presented in this study, as they leverage meteorology to inform the imputation [74]. However, as gapfilling is just done in retrospective, one should rather resort to complementary satellite data like SAR.
多模态变换器用于数据融合利用遥感数据通常意味着多模态数据融合。最近，与传统方法相比，机器学习方法在融合不同卫星传感器方面取得了显著进展 [73, 86, 3, 17, 42]。这包括最近关于结合哨兵 2 号和 SAR 数据来填补云覆盖的哨兵 2 号图像的工作 [81, 51, 93]。本研究中提出的模型也可以用于填补植被时间序列的空白，因为它们利用气象数据来指导填补 [74]。然而，由于填补工作仅在回顾性中进行，因此更应求助于 SAR 等互补卫星数据。

Transformers [80] offer a compelling approach to handle multi-modal data [31]. Their efficacy in remote sensing has been shown multiple times [47, 23, 13, 79]. In particular, geospatial foundation models [50, 62, 13, 94, 72, 79] make use of this approach, often through masked token modeling [26] with Vision Transformers [20]. Our Contextformer follows this line of research, yet in contrast to geospatial foundation models, it is more targeted for vegetation forecasting and only utilizes a pre-trained vision transformer as a vision backbone.
Transformer [80] 提供了一种引人注目的方法来处理多模态数据 [31]。其在遥感领域的有效性已被多次证明 [47, 23, 13, 79]。特别是，地理空间基础模型 [50, 62, 13, 94, 72, 79] 利用了这种方法，通常通过带有 Vision Transformer [20] 的掩码令牌建模 [26] 来实现。我们的 Contextformer 遵循这一研究路线，但与地理空间基础模型不同，它更专注于植被预测，并且仅使用预训练的视觉 Transformer 作为视觉骨干。

3 Methods 3 方法

3.1 Task 3.1 任务

We predict the future NDVI, a remote sensing proxy of vegetation state ( $V^{t}\in\mathbb{R}^{H\times W},t\in[T+1,T+K]$ ) conditioned on past satellite imagery ( $X^{t}\in\mathbb{R}^{H\times W},t\in[1,T]$ ), past and future weather ( $C^{t}\in\mathbb{R},t\in[1,T+K]$ ) and static elevation maps ( $E\in\mathbb{R}^{H\times W}$ ). Hence, denoting a model $f(.;\theta)$ with parameters $\theta$ , we obtain vegetation forecasts as:
我们预测未来的归一化植被指数（NDVI），这是植被状态的遥感代理（ $V^{t}\in\mathbb{R}^{H\times W},t\in[T+1,T+K]italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT , italic_t ∈ [ italic_T + 1 , italic_T + italic_K ]$ ），基于过去的卫星图像（ $X^{t}\in\mathbb{R}^{H\times W},t\in[1,T]italic_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT , italic_t ∈ [ 1 , italic_T ]$ ）、过去和未来的天气数据（ $C^{t}\in\mathbb{R},t\in[1,T+K]italic_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R , italic_t ∈ [ 1 , italic_T + italic_K ]$ ）以及静态的高程地图（ $E\in\mathbb{R}^{H\times W}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT$ ）。因此，用参数 $\thetaitalic_θ$ 表示模型 $f(.;\theta)italic_f ( . ; italic_θ )$ ，我们得到的植被预测如下：

\displaystyle\hat{V}^{T+1:T+K}=f(X^{1:T},C^{1:T+K},E;\theta)

(1)

In this paper most models are deep neural networks, trained with stochastic gradient descent to maximize a Gaussian Likelihood. More specifically, the optimal parameters $\theta^{*}$ are obtained by minimizing the mean squared error over valid pixels $V_{*}^{t}=V^{t}\odot M_{Q}^{t}\odot M_{L}$ , where $M_{Q}\in\{0,1\}^{H\times W}$ masks pixels that are cloudy, cloud shadow or snow, $M_{L}\in\{0,1\}^{H\times W}$ masks pixels that are not cropland, forest, grassland or shrubland and $\odot$ denotes elementwise multiplication. Hence the training objective (leaving out dimensions for simplicity) is
本文中，大多数模型采用深度神经网络，通过随机梯度下降法训练以最大化高斯似然。更具体地说，最优参数 $\theta^{*}italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT$ 通过最小化有效像素上的均方误差 $V_{*}^{t}=V^{t}\odot M_{Q}^{t}\odot M_{L}italic_V start_POSTSUBSCRIPT * end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊙ italic_M start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊙ italic_M start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT$ 获得，其中 $M_{Q}\in\{0,1\}^{H\times W}italic_M start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT$ 屏蔽了云、云影或雪的像素， $M_{L}\in\{0,1\}^{H\times W}italic_M start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT$ 屏蔽了非农田、森林、草地或灌木丛的像素， $\odot⊙$ 表示逐元素乘法。因此，训练目标（为简化起见省略维度）为

\displaystyle\theta^{*}=\underset{\theta}{argmin}\frac{\sum(V-\hat{V})^{2}% \odot M_{Q}\odot M_{L}}{\sum M_{Q}\odot M_{L}}

(2)

In this work $H=W=128\text{px}$ , $T=10$ and $K=20$ .
在这项工作中 $H=W=128\text{px}italic_H = italic_W = 128 px$ ， $T = 10 T=10italic_T = 10$ 和 $K = 20 K=20italic_K = 20$ 。

3.2 Our proposed Contextformer model
3.2 我们提出的 Contextformer 模型

To tackle the vegetation forecasting task, we develop the Contextformer, a multi-modal transformer model operating on local spatial context patches (hence the name) and trained with self-supervised learning for predicting vegetation state across Europe. Next to historical satellite imagery, it leverages an elevation map and meteorological data.
为应对植被预测任务，我们开发了 Contextformer，这是一种多模态 Transformer 模型，基于局部空间上下文块（因此得名）进行操作，并通过自监督学习训练，以预测欧洲范围内的植被状态。除了历史卫星图像外，它还利用了高程图和气象数据。

Overview Our proposed Contextformer follows an encode-process-decode [10] configuration. Encoders and decoders operate in the spatial domain without temporal fusion, while the processor translates latent features temporally in local context patches. We use two encoders (meteo and vision), a temporal transformer processor and a decoder that predicts the delta from the last cloud free NDVI observation (see Fig. 2).
概述我们提出的 Contextformer 遵循编码-处理-解码[10]配置。编码器和解码器在空间域中运行，不进行时间融合，而处理器则在局部上下文块中对潜在特征进行时间上的转换。我们使用了两个编码器（气象和视觉），一个时间变换处理器，以及一个预测自上次无云 NDVI 观测以来变化的解码器（见图 2）。

Encoders The meteo encoder (for weather) and the delta decoders are parameterized as multi-layer perceptrons (MLPs) (Fig. 2 red boxes). For the vision encoder (Fig. 2 yellow boxes), we follow the MMST-ViT model for crop yield prediction [43] and use a Pyramid Vision Transformer (PVT) v2 B0 [82, 83], which is particularly suitable for dense prediction tasks. It divides the images of each time step into patches of $4\times 4$ px and then creates an embedding for each of the patches with a global receptive field. In other words, information is exchanged spatially at each time step, but not across time steps. We merge multi-scale features from the different stages of an ImageNet pre-trained PVT v2 B0, upscale them to our patches, concatenate and project. The resulting features for each image stack (satellite & elevation) contain multi-scale and spatial context information.
编码器气象编码器（用于天气）和增量解码器被参数化为多层感知器（MLPs）（图 2 红色框）。对于视觉编码器（图 2 黄色框），我们遵循 MMST-ViT 模型进行作物产量预测[43]，并采用金字塔视觉变换器（PVT）v2 B0 [82, 83]，该模型特别适合密集预测任务。它将每个时间步的图像分割成 $4\times 44 × 4$ 像素的块，然后为每个块创建一个具有全局感受野的嵌入。换句话说，信息在每个时间步内进行空间交换，但不在时间步之间交换。我们将来自 ImageNet 预训练的 PVT v2 B0 不同阶段的多种尺度特征合并，将其上采样至我们的块大小，连接并投影。每个图像堆栈（卫星与高程）的最终特征包含多尺度和空间上下文信息。

Masked Token Modeling During training, we drop out patches in a masked token modeling approach [26] and replace them with a learned masking token. We randomly ( $p=0.5$ ) switch between inference mode, where we drop all patches for time steps $10$ to $30$ , and random dropout mode, where we mask $70\%$ of the patches for time steps $3$ to $30$ . At test time, we only use inference mode: the model never sees future satellite imagery. In addition, we use the cloud mask to drop every patch with at least one cloudy pixel. After applying the masking, a sinusoidal temporal positional encoding [79] and the weather embeddings from the meteo encoder are added to each patch.
掩码令牌建模在训练过程中，我们采用掩码令牌建模方法[26]对部分图像块进行丢弃，并用学习到的掩码令牌替换它们。我们随机（ $p = 0.5 p=0.5italic_p = 0.5$ ）在推理模式和随机丢弃模式之间切换：在推理模式下，我们在时间步 $101010$ 到 $303030$ 丢弃所有图像块；在随机丢弃模式下，我们在时间步 $333$ 到 $303030$ 对 $70\%70 %$ 的图像块进行掩码处理。在测试时，我们仅使用推理模式：模型永远不会看到未来的卫星图像。此外，我们利用云掩码丢弃至少包含一个云像素的每个图像块。应用掩码后，正弦时间位置编码[79]和来自气象编码器的天气嵌入被添加到每个图像块中。

Processor The temporal transformer (Fig. 2 green box) processes patches in parallel, i.e. it exchanges information across the $30$ time steps, but spatially only within each $4\times 4$ px patch.The idea here is that for ecosystem processes, spatial context is crucial but does not change dynamically. Therefore, separating spatial and temporal processing enhances efficiency. We maintain a small local context ( $4\times 4$ px) within the temporal encoder due to Sentinel 2’s sub-pixel inaccuracies causing slight pixel shifts over time. This approach significantly reduces the model’s memory cost during training (by $16\times$ ), enabling larger batch sizes. Our temporal transformer is implemented following Presto’s transformer encoder [79], which is based on the standard vision transformer [20].
处理器时间变换器（图 2 绿色框）并行处理补丁，即它在 $303030$ 时间步长之间交换信息，但在每个 $4\times 44 × 4$ 像素补丁内仅在空间上进行。这里的想法是，对于生态系统过程，空间上下文至关重要但不会动态变化。因此，分离空间和时间处理提高了效率。由于哨兵 2 号的亚像素不准确性导致像素随时间轻微偏移，我们在时间编码器中保持了一个小的局部上下文（ $4\times 44 × 4$ 像素）。这种方法显著降低了模型训练期间的内存成本（减少了 $16\times16 ×$ ），从而能够使用更大的批量大小。我们的时间变换器是根据 Presto 的变换器编码器[79]实现的，该编码器基于标准的视觉变换器[20]。

Output Our Contextformer leverages the persistence within the vegetation dynamics by predicting only a deviation from an initial state. More specifically, we compute the last cloud free NDVI observation from the historical period ( $10$ time steps) using the cloud mask and use it as initial prediction $\hat{V}^{0}$ . Then, the delta decoder predicts a deviation $\delta^{i}$ for each of the future period token embeddings (Fig. 2 right side). The final NDVI prediction is computed as $\hat{V}^{i}=\hat{V}^{0}+\delta^{i}$ . A similar delta framework was used in training the SGED-ConvLSTM [35]. However, that model predicts only one step ahead in an iterative fashion, predicting the deviation to the previous time step. In our multi-step prediction setting, this would result in a cumulative sum on the outputs, which is undesirable for training gradients.
输出我们的 Contextformer 通过仅预测偏离初始状态的偏差来利用植被动态中的持久性。更具体地说，我们使用云掩码计算历史时期（ $101010$ 时间步）的最后一个无云 NDVI 观测值，并将其作为初始预测 $\hat{V}^{0}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT$ 。然后，delta 解码器为每个未来时期令牌嵌入预测一个偏差 $\delta^{i}italic_δ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT$ （图 2 右侧）。最终的 NDVI 预测计算为 $\hat{V}^{i}=\hat{V}^{0}+\delta^{i}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + italic_δ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT$ 。在训练 SGED-ConvLSTM [35]时使用了类似的 delta 框架。然而，该模型仅以迭代方式预测下一步，预测相对于前一时间步的偏差。在我们的多步预测设置中，这将导致输出上的累积和，这对于训练梯度是不理想的。

3.3 GreenEarthNet Dataset 3.3GreenEarthNet 数据集

We present GreenEarthNet, a tailored dataset for high-resolution geospatial vegetation forecasting. It contains spatio-temporal minicubes [44], that are a collection of $30$ 5-daily satellite images ( $10$ historical, $20$ future), $150$ daily meteorological observations and an elevation map. Spatial dimensions are $128\times 128$ px ( $2.56\times 2.56$ km). To enable cross-dataset model comparisons, we re-use the training locations and predictor dimensions from the EarthNet2021 [64] dataset for Earth surface forecasting.
我们呈现 GreenEarthNet，这是一个专为高分辨率地理空间植被预测量身定制的数据集。它包含时空微立方体[44]，这些微立方体是 $303030$ 5 天间隔的卫星图像集合（ $101010$ 历史数据， $202020$ 未来数据）， $150150150$ 每日气象观测数据以及一张高程图。空间维度为 $128\times 128128 × 128$ 像素（ $2.56\times 2.562.56 × 2.56$ 公里）。为了实现跨数据集模型比较，我们沿用了 EarthNet2021[64]数据集用于地球表面预测的训练位置和预测维度。

	Works /w 工作/w
Algorithm 算法	GreenEarthNet 绿地球网	Prec 预	Rec 记录	F1
Sen2Cor	Yes 是的	0.83	0.60	0.70
FMask	No 不	0.85	0.85	0.85
KappaMask	No 不	0.74	0.88	0.81
UNet RGBNir	Yes 是的	0.91	0.90	0.90
UNet+Sen2CorSnow UNet+Sen2Cor 雪	Yes 是	0.83	0.93	0.88
UNet 13Bands UNet 13 波段	No 不	0.94	0.92	0.93

Table 1: Precision, recall and F1-score of different Sentinel 2 cloud masking algorithms.
表 1：不同哨兵 2 号云掩膜算法的精度、召回率和 F1 得分。

Satellite and Meteo Layers GreenEarthNet includes Sentinel 2 [46] satellite bands blue, green, red, and near-infrared at $20$ m (consistent with EarthNet2021) and E-OBS [14] interpolated meteorological station data, which represents high quality meteorology over Europe [7]. More specifically, the meteorological drivers wind speed, relative humidity, and shortwave downwelling radiation, alongside the rainfall, sea-level pressure, and temperature (daily mean, min & max) are included. To enable reproducible research and minicube generation anywhere on Earth, we open source a Python package called EarthNet Minicuber¹¹1https://github.com/earthnet2021/earthnet-minicuber, which generates multi-modal minicubes in a cloud native manner: only downloading the data chunks actually needed, instead of a full Sentinel 2 tile.
卫星与气象层 GreenEarthNet 包含 Sentinel 2 [46] 卫星的蓝、绿、红及近红外波段，分辨率为 $202020$ 米（与 EarthNet2021 一致），以及 E-OBS [14] 插值气象站数据，这些数据代表了欧洲地区高质量的气象信息 [7]。更具体地说，气象驱动因素包括风速、相对湿度、短波下行辐射，以及降雨量、海平面气压和温度（日均、最低与最高）。为实现可重复研究并能在全球任何地点生成迷你立方体，我们开源了一个名为 EarthNet Minicuber 的 Python 包，该工具以云原生方式生成多模态迷你立方体：仅下载实际所需的数据块，而非整个 Sentinel 2 瓦片。

Cloud mask Vegetation proxies derived from optical satellite imagery are only meaningful if observations with clouds, shadows and snow are excluded such that anomalies due to clouds can be distinguished from vegetation anomalies. We train a UNet with Mobilenetv2 encoder [68] on the CloudSEN12 dataset [4] to detect clouds and cloud shadows from RGB and Nir bands. Tab. 1 compares precision, recall and F1 scores for detecting faulty pixels. Our approach outperforms Sen2Cor [46] (used in EarthNet2021), FMask [59] and KappaMask [19] baselines by a large margin. If using Sen2Cor in addition, to allow for snow masking, precision drops, but recall increases: i.e. the cloud mask gets more conservative. Using all 13 Sentinel 2 L2A bands is better than just using 4 bands, however such a model is not directly applicable to GreenEarthNet.
从光学卫星图像中提取的云掩膜植被代理只有在排除云、阴影和雪的观测后才有意义，从而可以区分由云引起的异常与植被异常。我们在 CloudSEN12 数据集[4]上使用 Mobilenetv2 编码器[68]训练了一个 UNet，以从 RGB 和 Nir 波段检测云和云阴影。表 1 比较了检测错误像素的精度、召回率和 F1 分数。我们的方法大幅优于 Sen2Cor[46]（用于 EarthNet2021）、FMask[59]和 KappaMask[19]基线。如果额外使用 Sen2Cor 进行雪掩膜，精度会下降，但召回率会增加，即云掩膜变得更加保守。使用所有 13 个 Sentinel 2 L2A 波段比仅使用 4 个波段更好，然而这样的模型并不直接适用于 GreenEarthNet。

Test sets Due to meso-scale circulation patterns, weather has high spatial correlation lengths. For GreenEarthNet, we design test sets ensuring independence not only in the high-resolution satellite data but also in the coarse-scale meteorology between training and test minicubes. More specifically, we introduce the subsets
测试集由于中尺度环流模式，天气具有较高的空间相关长度。对于 GreenEarthNet，我们设计了测试集，确保不仅在训练和测试微立方体之间的高分辨率卫星数据中保持独立性，而且在粗尺度气象数据中也保持独立性。更具体地说，我们引入了子集。

•

Train, 23816 minicubes in years 2017-2020

• 训练，2017-2020 年间共 23816 个微立方体
•

Val 245 minicubes close to training locations, year 2020

• 2020 年，靠近训练地点的 Val 245 迷你立方体
•

OOD-t test same locations as Val, years 2021-2022

• OOD-t 测试与 Val 相同的位置，年份为 2021-2022
•

OOD-s test, 800 minicubes stratified over $1\degree\times 1\degree$ lat-lon grid cells outside training regions, years 2017-2019

• OOD-s 测试，在训练区域外的 $1\degree\times 1\degree1 ° × 1 °$ 经纬度网格单元中分层分布的 800 个微立方体，年份为 2017-2019 年
•

OOD-st test same as OOD-s, but years 2021-2022

• OOD-st 测试与 OOD-s 相同，但年份为 2021-2022 年

OOD-t is the main test set used throughout this study. It tests the models’ ability to extrapolate in time: i.e. we allow it to learn from past information about a location and want to know how it would perform in the future. val follows the same reasoning and hence allows for early stopping of models according to their temporal extrapolation skill. OOD-s and OOD-st test spatial extrapolation, as well as spatio-temporal extrapolation. The test sets are compatible with EarthNet2021: the Val/OOD-t locations are inside the EarthNet2021 IID tests set and the OOD-s/OOD-st locations are far away from EarthNet2021 training data. For all test sets, we create minicubes over four periods during the European growing season [67] each year: Predicting March-May (MAM), May-July (MJJ), July-September (JAS) and September-November (SON).
OOD-t 是本研究中使用的主要测试集。它测试模型在时间上的外推能力：即我们允许其学习某个地点的过去信息，并希望了解它在未来的表现如何。val 遵循相同的推理逻辑，因此可以根据模型的时间外推技能进行早期停止。OOD-s 和 OOD-st 测试空间外推，以及时空外推。测试集与 EarthNet2021 兼容：Val/OOD-t 的位置位于 EarthNet2021 IID 测试集内，而 OOD-s/OOD-st 的位置远离 EarthNet2021 训练数据。对于所有测试集，我们在每年欧洲生长季的四个时期创建了迷你立方体：预测 3 月至 5 月（MAM）、5 月至 7 月（MJJ）、7 月至 9 月（JAS）和 9 月至 11 月（SON）。

Additional Layers We add the ESA Worldcover Landcover map [95] for selecting only vegetated pixels during evaluation, the Geomorpho90m Geomorphons map [2] for further evaluation and the ALOS [76], Copernicus [21] and NASA [15] DEMs, to provide uncertainty in the elevation maps. Finally, we provide georeferencing for each minicube, enabling their extension with further data.
附加层我们添加 ESA Worldcover 土地覆盖图[95]，用于在评估过程中仅选择植被像素，添加 Geomorpho90m 地貌图[2]以进行进一步评估，并添加 ALOS[76]、Copernicus[21]和 NASA[15]的 DEM，以提供高程图的不确定性。最后，我们为每个迷你立方体提供地理配准，使其能够扩展更多数据。

3.4 Evaluation 3.4 评估

We resort to traditional metrics in environmental modeling:
我们采用环境建模中的传统指标：

•

$R^{2}$ , the squared pearson correlation coefficient

• $R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT$ , 平方皮尔逊相关系数
•

RMSE, the root mean squared error

• RMSE，均方根误差
•

$\text{NSE}=1-\dfrac{MSE(V,\hat{V})}{Var[V]}$ , the nash-sutcliffe efficiency [54], a measure of relative variability

• $\text{NSE}=1-\dfrac{MSE(V,\hat{V})}{Var[V]}NSE = 1 - divide start_ARG italic_M italic_S italic_E ( italic_V , over^ start_ARG italic_V end_ARG ) end_ARG start_ARG italic_V italic_a italic_r [ italic_V ] end_ARG$ ，纳什-萨特克利夫效率[54]，一种相对变异性的度量
•

$|\text{bias}|=|\overline{V}-\overline{\hat{V}}|$ , the absolute bias

• $|\text{bias}|=|\overline{V}-\overline{\hat{V}}|| bias | = | over¯ start_ARG italic_V end_ARG - over¯ start_ARG over^ start_ARG italic_V end_ARG end_ARG |$ , 绝对偏差

In addition, we propose to measure if a model is better than the NDVI climatology, by computing the Outperformance score: the percentage of minicubes, for which the model is better in at least 3 out of the 4 metrics. Here, better means their score difference (ordering s.t. higher=better) exceeds $0.01$ for RMSE and $|\text{bias}|$ and $0.05$ for NSE and $R^{2}$ . We also report the RMSE over only the first 25 days (5 time steps) of the target period.
此外，我们建议通过计算超越分数来衡量模型是否优于 NDVI 气候学：即在至少 3 项指标上表现更优的微立方体所占的百分比。此处，“更优”意味着其得分差异（排序使得更高=更优）在 RMSE 上超过 $0.01 0.010.01$ ，在 NSE 和 $R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT$ 上分别超过 $|\text{bias}|| bias |$ 和 $0.05 0.050.05$ 。我们还会报告目标周期前 25 天（5 个时间步）内的 RMSE。

We compute all metrics per pixel over clear-sky timesteps. We then consider only pixels with vegetated landcover (cropland, grassland, forest, shrubland), no seasonal flooding (minimum NDVI $>0$ ), enough observations ( $\geq 10$ during target period, $\geq 3$ during context period) and considerable variation (NDVI std. dev $>0.1$ ). All these pixelwise scores are grouped by minicube and landcover, and then aggregated to account for class imbalance. Finally, the macro-average of the scores per landcover class is computed. In this way, the scores represent a conservative estimate of the expected performance of dynamic vegetation modeling during a new year or at a new location.
我们在晴朗天气的时间步长上计算每个像素的所有指标。然后，仅考虑具有植被覆盖（农田、草地、森林、灌木地）、无季节性洪涝（最小 NDVI $> 0 >0> 0$ ）、足够观测数据（目标期间 $\geq 10≥ 10$ ，上下文期间 $\geq 3≥ 3$ ）以及显著变异（NDVI 标准差 $> 0.1 >0.1> 0.1$ ）的像素。所有这些逐像素得分按微立方体和土地覆盖类型分组，并进行聚合以考虑类别不平衡。最后，计算每种土地覆盖类别的得分宏平均值。通过这种方式，得分反映了在新一年或新地点动态植被建模预期性能的保守估计。

	Model 模型	$R^{2}$ $\uparrow$	RMSE $\downarrow$ 均方根误差 $\downarrow↓$	NSE $\uparrow$	$\|\text{bias}\|$ $\downarrow$	$\underset{\text{Climatology}}{\text{Outperform}}$ $\uparrow$	$\underset{25\text{ days}}{\text{RMSE}}$ $\downarrow$	#Params
non-ML 非机器学习	Persistence 持久性	0.00	0.23	-1.28	0.17	21.8%	0.09	0
	Previous year 上一年	0.56	0.20	-0.40	0.14	19.3%	0.18	0
	Climatology 气候学	0.58	0.18	-0.34	0.13	n.a. 不适用	0.16	0
local TS 本地 TS	Kalman filter 卡尔曼滤波器	0.41	0.19	-0.57	0.13	27.0%	0.16	$\mathcal{O}$ (10)
	LightGBM	0.51	0.17	-0.22	0.12	42.2%	0.11	n.a. 不适用
	Prophet 先知	0.57	0.16	-0.05	0.11	60.6%	0.13	$\mathcal{O}$ (10)
EN21	ConvLSTM [18]	0.51	0.18	-0.37	0.12	43.9%	0.12	0.2M
	SG-ConvLSTM [35]	0.53	0.19	-0.33	0.14	45.8%	0.11	0.7M
	Earthformer [23]	0.49	0.17	-0.27	0.12	47.2%	0.11	60.6M
This Study 本研究	ConvLSTM [65]	0.58 $\pm$ 0.01	0.16 $\pm$ 0.00	-0.13 $\pm$ 0.02	0.11 $\pm$ 0.00	53.1% $\pm$ 1.2%	0.11 $\pm$ 0.00	1.0M
	Earthformer [23]	0.52	0.16	-0.13	0.10	56.5%	0.09	60.6M
	PredRNN [85]	0.62 $\pm$ 0.00	0.15 $\pm$ 0.00	0.03 $\pm$ 0.00	0.10 $\pm$ 0.00	64.7% $\pm$ 1.2%	0.10 $\pm$ 0.00	1.4M
	SimVP [77]	0.60 $\pm$ 0.00	0.15 $\pm$ 0.00	0.03 $\pm$ 0.01	0.09 $\pm$ 0.00	64.1% $\pm$ 1.0%	0.10 $\pm$ 0.00	6.6M
	Contextformer (Ours) 上下文变换器（我们的）	0.62 $\pm$ 0.00	0.14 $\pm$ 0.00	0.09 $\pm$ 0.01	0.09 $\pm$ 0.00	66.8% $\pm$ 0.3%	0.08 $\pm$ 0.00	6.1M

Table 2: Quantitative Results. Mean (

\pm

std. dev.) are computed from three different random seeds.
表 2：定量结果。平均值（

\pm

标准差）由三个不同的随机种子计算得出。

3.5 Baselines 3.5 基线

We evaluate Contextformer against diverse baselines representative of various model classes, including non-ML methods, time series forecasting models, the top 3 performers on the EarthNet2021 benchmark [64], classical, and two state-of-the-art video prediction models. This choice aims to account for uncertainty regarding the optimal models for vegetation forecasting, considering factors such as the relevance of spatial context. While existing spatio-temporal earth surface forecasting models are expected to serve as strong baselines due to task similarity, recent advancements in video prediction, leveraging the perspective of satellite image time series as a video, may also offer a competitive advantage.
我们评估了 Contextformer 与多种基线模型的对比，这些基线代表了不同的模型类别，包括非机器学习方法、时间序列预测模型、EarthNet2021 基准测试中表现前三的模型、经典模型以及两种最先进的视频预测模型。这一选择旨在考虑到影响植被预测最佳模型选择的不确定性因素，如空间上下文的相关性。尽管现有的时空地球表面预测模型因其任务相似性而被视为强有力的基线，但利用卫星图像时间序列作为视频视角的最新视频预测技术也可能提供竞争优势。

Non-ML baselines We evaluate three non-ML baselines related to ecological memory: persistence [64] (last cloud free NDVI pixel), previous year [65] (linearly interpolated) and climatology (mean NDVI seasonal cycle).
非机器学习基线我们评估了三种与生态记忆相关的非机器学习基线：持久性[64]（最后一个无云的 NDVI 像素）、前一年[65]（线性插值）和气候学（平均 NDVI 季节周期）。

Local time series models We compare against three common time series models: Kalman filter, LightGBM [34] and Prophet [78] from the Python library darts [28]. These are trained on timeseries from a single pixel and applied to forecast this pixel, given future weather as covariates. They are expensive to run: a single minicube takes $\sim 3$ h on an 8-CPU machine, $\mathcal{O}(10^{4})$ slower than deep learning. We also evaluate a global timeseries model: the LSTM (implemented as ConvLSTM with 1x1 kernel). The time series models should be strong if spatial interactions are less predictive for vegetation evolution.
本地时间序列模型我们对比了三种常见的时间序列模型：卡尔曼滤波器、LightGBM [34] 和 Prophet [78]，均来自 Python 库 darts [28]。这些模型在单个像素的时间序列上进行训练，并应用于预测该像素，给定未来的天气作为协变量。它们的运行成本较高：在一个 8 核 CPU 机器上，单个 minicube 需要 $\sim 3∼ 3$ 小时，比深度学习慢 $\mathcal{O}(10^{4})caligraphic_O ( 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT )$ 。我们还评估了一个全局时间序列模型：LSTM（实现为 1x1 卷积核的 ConvLSTM）。如果空间交互对植被演变的预测性较低，时间序列模型应表现强劲。

EarthNet2021 models We also evaluate the Top-3 models from the EarthNet2021 challenge leaderboard²²2https://web.archive.org/web/20230228215255/https://www.earthnet.tech/en21/ch-leaderboard/ using their trained weights: a regular ConvLSTM [18], an encode-process-decode ConvLSTM called SGED-ConvLSTM [35] and the Earthformer [23], a transformer model using cuboid-attention.
EarthNet2021 模型我们还评估了 EarthNet2021 挑战排行榜上的前三名模型 ² ，使用其训练权重：一个常规的 ConvLSTM [18]，一个称为 SGED-ConvLSTM [35]的编码-处理-解码 ConvLSTM，以及 Earthformer [23]，一个使用立方体注意力的 Transformer 模型。

Additionally, we train and fine-tune both the ConvLSTM and Earthformer on GreenEarthNet. For the ConvLSTM, we follow the original Shi et al. [70] encoding-forecasting setup, which is different from ConvLSTM flavors studied on EarthNet2021 [18, 35] but has demonstrated improved performance on a similar problem in Africa [65]. We condition the Earthformer [23] through early fusion during historical steps and latent fusion during future steps.
此外，我们在 GreenEarthNet 上训练并微调了 ConvLSTM 和 Earthformer。对于 ConvLSTM，我们遵循 Shi 等人[70]的原始编码-预测设置，这与在 EarthNet2021 上研究的 ConvLSTM 变体[18, 35]不同，但在非洲类似问题上已显示出性能提升[65]。我们通过在历史步骤中进行早期融合，在未来的步骤中进行潜在融合来调整 Earthformer[23]。

Video prediction models We adapt two state-of-the-art video prediction models (PredRNN and SimVP) and two basic UNet-based approaches. The next-frame UNet [60] predicts autoregressively one step ahead. The next-cuboid UNet [64] directly predicts all time steps, taking the historical time steps stacked in the channel dimension. PredRNN [84, 85] is an autoregressive model with improved information flow. We generalize the action-conditioned PredRNN [85] by using feature-wise linear modulation [58] for weather conditioning on the inputs. SimVP [77] performs direct multi-step prediction through an encode-process-decode ConvNet, we adapt it with weather conditioning by feature-wise linear modulation [58] on the latent embeddings at each stage of the processor.
视频预测模型我们采用了两种最先进的视频预测模型（PredRNN 和 SimVP）以及两种基于 UNet 的基础方法。下一帧 UNet [60] 自回归地预测下一步。下一立方体 UNet [64] 直接预测所有时间步，采用在通道维度上堆叠的历史时间步。PredRNN [84, 85] 是一种改进信息流的自回归模型。我们通过在输入上使用特征线性调制 [58] 进行天气条件化，推广了动作条件化的 PredRNN [85]。SimVP [77] 通过编码-处理-解码 ConvNet 进行直接多步预测，我们通过在处理器的每个阶段的潜在嵌入上使用特征线性调制 [58] 进行天气条件化来适应它。

3.6 Implementation details
3.6 实现细节

We build all of our ConvNets with a PatchMerge-style architecture similar to the one in Earthformer [23]. For SimVP and PredRNN, such encoders and decoders are more powerful, but also slightly more parameter-intensive, than the variants used in the original papers. We use GroupNorm [90] and LeakyReLU activation [92] for the ConvNets, and and ConvLSTMs. For the Contextformer, we use LayerNorm [5] and GELU activation [27]. For ConvNets, skip connections preserve high-fidelity content between encoders and decoders. Our framework is implemented in PyTorch, and models are trained on Nvidia A40 and A100 GPUs. We use the AdamW [45] optimizer and tune the learning rate and a few hyperparameters per model. More implementation details can be found in the supplementary materials.
我们构建的所有卷积网络（ConvNets）采用类似于 Earthformer [23]中的 PatchMerge 风格架构。对于 SimVP 和 PredRNN，此类编码器和解码器比原始论文中使用的变体更为强大，但参数略多。我们为 ConvNets 和 ConvLSTMs 采用 GroupNorm [90]和 LeakyReLU 激活函数 [92]。对于 Contextformer，我们使用 LayerNorm [5]和 GELU 激活函数 [27]。在 ConvNets 中，跳跃连接在编码器和解码器之间保留了高保真内容。我们的框架基于 PyTorch 实现，并在 Nvidia A40 和 A100 GPU 上训练模型。我们采用 AdamW [45]优化器，并针对每个模型调整学习率和少量超参数。更多实现细节可在补充材料中找到。

4 Experiments 4 实验

4.1 Baseline comparison 4.1 基线比较

We conduct experiments for predicting vegetation state across Europe in 2021 and 2022 at $20m$ resolution and compare the Contextformer against a wide range of baselines. The quantitative results are shown in table 2. For Contextformer, ConvLSTM, PredRNN and SimVP, we report the mean ( $\pm$ std. dev.) from three different random seeds. Earthformer has an order of magnitude more parameters, making training more expensive, which is why we only report one random seed. We find the Contextformer outperforms (or performs on par with) every baseline on all metrics. It achieves $R^{2}=0.62$ and $0.14$ RMSE on the full 100 days lead time, which is further improved to $0.08$ RMSE during the first 25 days lead time. The closest competitors are PredRNN and SimVP, with PredRNN having on par $R^{2}=0.62$ and SimVP on par $|\text{bias}|=0.09$ .
我们在 $20 m 20m20 italic_m$ 分辨率下对 2021 年和 2022 年欧洲植被状态预测进行实验，并将 Contextformer 与一系列基线方法进行比较。定量结果如表 2 所示。对于 Contextformer、ConvLSTM、PredRNN 和 SimVP，我们报告了来自三个不同随机种子的平均值（ $\pm$ 标准差）。Earthformer 的参数量级更大，使得训练成本更高，因此我们仅报告了一个随机种子的结果。我们发现 Contextformer 在所有指标上均优于（或与）所有基线方法。它在完整的 100 天提前期上实现了 $R^{2}=0.62italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.62$ 和 $0.14 0.140.14$ 的 RMSE，并在前 25 天提前期内进一步改善至 $0.08 0.080.08$ RMSE。最接近的竞争对手是 PredRNN 和 SimVP，其中 PredRNN 在 $R^{2}=0.62italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.62$ 上表现相当，SimVP 在 $|\text{bias}|=0.09| bias | = 0.09$ 上表现相当。

The Contextformer and the other video prediction baselines trained in this study are the first models to outperform the Climatology baseline: the ConvLSTM reaches $53.1\%$ outperformance score, while the Contextformer achieves $66.8\%$ (with consistent ranking across thresholds, see sec. E). For the top-3 models (PredRNN, SimVP and Contextformer) and all metrics, differences to the climatology are highly significant when tested for all pixels (with Wilcoxon signed-rank test, $\alpha=0.001$ ), but also for each land cover or for smaller subsets of $100$ minicubes. ConvLSTM and Earthformer have overall lower skill. They mostly excels at RMSE and $|\text{bias}|$ , where they can perform similar to other methods, yet have way lower performance for NSE and $R^{2}$ .
本研究中训练的 Contextformer 及其他视频预测基线模型首次超越了 Climatology 基线：ConvLSTM 达到 $53.1\%53.1 %$ 的超越分数，而 Contextformer 则实现了 $66.8\%66.8 %$ （在各阈值下排名一致，详见第 E 节）。对于前三名的模型（PredRNN、SimVP 和 Contextformer）以及所有指标，在测试所有像素时（采用 Wilcoxon 符号秩检验， $\alpha=0.001italic_α = 0.001$ ），与气候学的差异极为显著，同样适用于每种土地覆盖类型或更小的 $100100100$ 迷你立方体子集。总体而言，ConvLSTM 和 Earthformer 的技能较低。它们主要在 RMSE 和 $|\text{bias}|| bias |$ 方面表现出色，能与其他方法相媲美，但在 NSE 和 $R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT$ 方面的表现则远逊于其他方法。

The models trained on EarthNet2021 (ConvLSTM [18], SGED-ConvLSTM [35] and Earthformer [23]) perform poorly. None of the approaches consistently beats the Climatology, particularly the $R^{2}$ is much lower (from $0.49$ for Earthformer up to $0.53$ for SGED-ConvLSTM). Likely, this is a result of the focus on perceptual quality that was reflected in the EarthNet2021 metrics, as well as the overall lower data quality due to a faulty cloud mask.
在 EarthNet2021 上训练的模型（ConvLSTM [18]、SGED-ConvLSTM [35] 和 Earthformer [23]）表现不佳。没有任何方法能持续超越气候学基准，尤其是 $R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT$ 显著偏低（Earthformer 为 $0.49 0.490.49$ ，SGED-ConvLSTM 则高达 $0.53 0.530.53$ ）。这可能是因为 EarthNet2021 指标侧重于感知质量，以及由于云掩膜错误导致的整体数据质量较低所致。

Finally, other local time series baselines and non-ML baselines also underperform the Contextformer. The strongest pixelwise model is Prophet [78], with an outperformance score of $60.6\%$ , followed by the climatology. Note, all of these baselines have access to a lot more information than the deep learning-based models (6 years vs. 50 days). Hence, this model comparison gives an indication, that spatial context is useful for vegetation forecasting, but leveraging them is challenging, as temporal dynamics are more dominant. In addition, the local time series models are all very slow, compared to the deep learning solutions presented in this work, which perform predictions within seconds (see sec. F).
最后，其他本地时间序列基线和非机器学习基线也均逊色于 Contextformer。其中，表现最强的逐像素模型是 Prophet [78]，其超越分数为 $60.6\%60.6 %$ ，紧随其后的是气候学模型。值得注意的是，这些基线所掌握的信息量远超基于深度学习的模型（6 年对比 50 天）。因此，此模型对比表明，空间上下文对植被预测具有价值，但利用这些上下文颇具挑战，因为时间动态更为主导。此外，与本研究中提出的深度学习解决方案相比，本地时间序列模型均显得非常缓慢，后者能在数秒内完成预测（详见第 F 节）。

	Original 原文	Shuffled 洗牌
Model 模型	$R^{2}$ $\uparrow$	$R^{2}$ $\uparrow$	Diff $\uparrow$ 差异 $\uparrow↑$
Climatology 气候学	0.58	-	-
1x1 LSTM	0.53	0.53	0.00
Next-frame UNet 下一帧 UNet	0.51	0.48	-0.03
Next-cuboid UNet 下一长方体 UNet	0.56	0.43	-0.13
ConvLSTM	0.58	0.46	-0.12
PredRNN 预测递归神经网络	0.62	0.45	-0.17
SimVP	0.60	0.49	-0.11
Contextformer 上下文变换器	0.62	0.55	-0.07

Table 3: Model skill when spatial interactions are broken through shuffling.
表 3：通过打乱空间交互后的模型技能表现。

Qualitative results of the Contextformer model for one minicube are reported in fig. 3. The model clearly learns the complex dynamics of vegetation, with a strong seasonal evolution of the crop fields. It interpolates faithfully those pixels, which are masked in the target, and contains strong temporal consistency. However, as the lead time increases, predictions become less explicit, with a tendency towards oversmoothing.
Contextformer 模型在一分钟立方体上的定性结果如图 3 所示。该模型清晰地学习了植被的复杂动态，作物田地的季节性演变显著。它能忠实地插值目标中被掩码的像素，并展现出强烈的时间一致性。然而，随着提前期的增加，预测变得不那么明确，有过度平滑的趋势。

4.2 Weather guidance 4.2 天气指导

Our meteo-guided models benefit from the weather conditioning. Fig. 4 compares ConvLSTM, PredRNN, SimVP and Contextformer (blue) against a variant without weather conditioning (orange). For all metrics, using weather outperforms not using it. The ConvLSTM has the largest performance gain due to meteo-guidance, yet it is also the weakest model. This could possibly be due to the ConvLSTMs smaller receptive field and hence lower capacity at leveraging spatial context, which may to some degree compensate predictive capacity from weather.
我们的气象引导模型得益于天气条件的影响。图 4 将 ConvLSTM、PredRNN、SimVP 和 Contextformer（蓝色）与无天气条件的变体（橙色）进行了比较。在所有指标上，使用天气条件的表现均优于不使用。ConvLSTM 由于气象引导获得了最大的性能提升，但它也是最弱的模型。这可能是因为 ConvLSTM 的感受野较小，因此在利用空间上下文方面的能力较低，这在一定程度上可能弥补了来自天气的预测能力。

For PredRNN and SimVP, we conduct an extended ablation study on weather guidance (see supplementary material). Weather conditioning methods (concatenation, FiLM [58], and cross-attention [66]) have a minor impact on performance when applied appropriately: cross-attention is most useful with latent fusion, FiLM outperforms concatenation, and is suitable for early fusion.
对于 PredRNN 和 SimVP，我们在天气引导方面进行了扩展的消融研究（详见补充材料）。天气条件方法（拼接、FiLM [58] 和交叉注意力 [66]）在适当应用时对性能影响较小：交叉注意力在潜在融合中最有用，FiLM 优于拼接，并适用于早期融合。

4.3 The role of spatial interactions
4.3 空间交互的作用

Unlike video prediction, satellite images show minimal spatial movement. Field and forest boundaries remain mostly fixed, with the largest variations occurring within these edges over time. Hence, it is unclear whether spatio-temporal models, accounting for interactions, are suitable for modeling vegetation dynamics. However, at $20m$ resolution, lateral processes may occur, not captured by predictors. For example, grasslands near a river or on a north-facing slope may react differently to meteorological drought. Additionally, weather affects trees at the forest edge differently from those in the center.
与视频预测不同，卫星图像显示的空间运动极少。田野和森林边界大多保持固定，最大的变化发生在这些边缘随时间推移的过程中。因此，尚不清楚是否时空模型（考虑相互作用）适合于建模植被动态。然而，在 $20 m 20m20 italic_m$ 分辨率下，可能发生横向过程，而这些过程未被预测因子捕捉到。例如，靠近河流或位于北坡的草地可能对气象干旱有不同的反应。此外，天气对森林边缘的树木影响与对中心树木的影响不同。

We compare model performance with spatially shuffled input, i.e. explicitly breaking spatial interactions [63]. We shuffle across batch and space, to also destroy image statistics. We evaluate Contextformer, ConvLSTM, PredRNN, and SimVP, skipping Earthformer due to high training cost. In addition we also study three baselines: a pixelwise (1x1) LSTM, the next-frame UNet and the next-cuboid UNet (see sec. 3.5). The pixelwise LSTM is a global timeseries model unable to capture spatial interactions. The next-frame UNet models spatial interactions, but does not consider temporal memory. All other models can leverage spatio-temporal dependencies, though the ConvLSTM only has a small local receptive field ( $\sim 100$ m around each pixel). The results are reported in tab. 3. As can be expected, the pixelwise LSTM can be trained with spatial shuffled pixels without performance loss. All other models, though, exhibit a drop in performance under pixel shuffling. For Contextformer, ConvLSTM, PredRNN and SimVP, $R^{2}$ drops by at least $0.07$ and $RMSE$ increases by at least $0.04$ .
我们将模型性能与空间打乱的输入进行比较，即明确打破空间交互[63]。我们在批次和空间维度上进行打乱，以同时破坏图像统计特性。我们评估了 Contextformer、ConvLSTM、PredRNN 和 SimVP，由于训练成本过高，跳过了 Earthformer。此外，我们还研究了三个基线模型：像素级（1x1）LSTM、下一帧 UNet 和下一立方体 UNet（详见第 3.5 节）。像素级 LSTM 是一个全局时间序列模型，无法捕捉空间交互。下一帧 UNet 模型考虑了空间交互，但不涉及时间记忆。所有其他模型都能利用时空依赖性，尽管 ConvLSTM 的感受野仅限于每个像素周围的小局部区域（约 0#米）。结果如表 3 所示。不出所料，像素级 LSTM 在训练时使用空间打乱的像素不会损失性能。然而，其他所有模型在像素打乱的情况下性能均有所下降。对于 Contextformer、ConvLSTM、PredRNN 和 SimVP， $R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT$ 至少下降了 $0.07 0.070.07$ ，而 $R M S E RMSEitalic_R italic_M italic_S italic_E$ 至少增加了 $0.04 0.040.04$ 。

Ablation 消融	$R^{2}$ $\uparrow$	RMSE $\downarrow$ 均方根误差 $\downarrow↓$	$\underset{\text{Climatology}}{\text{Outperf}}$ $\uparrow$
MLP vision encoder MLP 视觉编码器	0.58	0.15	58.3%
PVT encoder (frozen) PVT 编码器（冻结）	0.57	0.17	46.1%
PVT encoder PVT 编码器	0.62	0.15	62.3%
/w cloud mask token /w 云掩码令牌	0.61	0.16	61.8%
/w learned $\hat{V}^{0}$ /w 已学习 $\hat{V}^{0}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT$	0.62	0.16	60.6%
/w last pixel $\hat{V}^{0}$ /w 最后像素 $\hat{V}^{0}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT$	0.62	0.15	65.1%
Contextformer-6M	0.62	0.14	66.8%
Contextformer-16M	0.61	0.14	67.3%

Table 4: Model ablations. The Contextformer uses a PVT encoder, a cloud mask token and the last cloud free pixel as

\hat{V}_{0}

.
表 4：模型消融。Contextformer 使用 PVT 编码器、云掩码标记以及最后一个无云像素作为

\hat{V}_{0}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

。

4.4 Ablation Study of Contextformer components
4.4 Contextformer 组件的消融研究

We conduct experiments to show how each key component in our Contextformer affects predictive skill. Tab. 4 lists the results of our ablation studies. First, we find that continued training of a pre-trained PVT vision encoder (outperformance score $62.3$ %) outperforms both a MLP vision encoder and a frozen pre-trained PVT. Second, adding the delta-prediction scheme with an initial vegetation state estimate $\hat{V}^{0}$ constructed by the last historical cloud free NDVI pixel further improves the outperformance to $65.1\%$ – the version directly predicting NDVI is PVT encoder. Instead using a learned MLP decoder to estimate $\hat{V}^{0}$ is inferior. Third, using the cloud mask to drop out faulty tokens from the PVT encoder decreases model skill, if used alone, but if used on top of the delta-prediction scheme in the final model Contextformer-6M, it gives another boost to $66.8\%$ outperformance. Finally, scaling the model size of the Contextformer to 16M parameters is not helpful when trained on GreenEarthNet, indicating the need for an even larger dataset for further performance gains.
我们进行实验以展示 Contextformer 中每个关键组件如何影响预测技能。表 4 列出了我们的消融研究结果。首先，我们发现继续训练预训练的 PVT 视觉编码器（超越分数 $62.3 62.362.3$ %）优于 MLP 视觉编码器和冻结的预训练 PVT。其次，添加初始植被状态估计 $\hat{V}^{0}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT$ 的增量预测方案，该估计由最后一个历史无云 NDVI 像素构建，进一步将超越提升至 $65.1\%65.1 %$ ——直接预测 NDVI 的版本为 PVT 编码器。相反，使用学习到的 MLP 解码器来估计 $\hat{V}^{0}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT$ 则表现较差。第三，使用云掩码从 PVT 编码器中剔除故障令牌会降低模型技能，如果单独使用，但如果结合最终模型 Contextformer-6M 中的增量预测方案使用，则会再次提升 $66.8\%66.8 %$ 超越分数。最后，将 Contextformer 的模型规模扩展到 1600 万参数在 GreenEarthNet 上训练时并无帮助，表明需要更大的数据集以进一步提高性能。

	OOD-s		OOD-st
Model 模型	$R^{2}$ $\uparrow$	RMSE $\downarrow$	$R^{2}$ $\uparrow$	RMSE $\downarrow$ 均方根误差 $\downarrow↓$
Climatology 气候学	0.50	0.15	0.56	0.19
Contextformer 上下文变换器	0.54	0.15	0.58	0.14

Table 5: Model skill at spatial (OOD-s) and spatio-temporal (OOD-st) extrapolation.
表 5：模型在空间（OOD-s）和时空（OOD-st）外推中的技能表现。

4.5 Contextformer Strengths and Limitations
4.5 Contextformer 的优势与局限性

The OOD-t test set includes minicubes from four 3-month periods over two years. Fig. 5 shows Contextformer’s model skill. Yearly variations are significant. Growing season prediction was better in 2022 until September, then it switched, and 2021 performed better. First half of the season is usually better predicted than the second half, likely due to anthropogenic influences (harvest, mowing, cutting, and forest fires). These events are challenging to predict from weather covariates and may be interpreted as random noise.
OOD-t 测试集包含来自两年内四个 3 个月期间的微型立方体。图 5 展示了 Contextformer 的模型技能。年度变化显著。2022 年生长季节的预测在 9 月前表现较好，之后情况逆转，2021 年表现更佳。通常，季节前半段的预测优于后半段，这可能归因于人为影响（如收获、割草、砍伐及森林火灾）。这些事件难以从天气协变量中预测，并可能被视为随机噪声。

We assess the performance at spatio-(temporal) extrapolation of the Contextformer on the OOD-s and OOD-st test sets and report in tab. 5. The Contextformer can extrapolate in space and time. However, the margin to the climatology does shrink. Here, more training data might help: spatial extrapolation is theoretically not necessary for modeling vegetation dynamics (only temporal extrapolation is). Practically speaking, however, it does help to increase inference speed and enable potential applicability over large areas.
我们评估了 Contextformer 在 OOD-s 和 OOD-st 测试集上的时空外推性能，并在表 5 中报告了结果。Contextformer 能够在空间和时间上进行外推。然而，与气候学的差距有所缩小。这里，更多的训练数据可能会有所帮助：理论上，建模植被动态并不需要空间外推（仅需时间外推）。但实际上，空间外推有助于提高推理速度，并使模型在大面积范围内具有潜在适用性。

For these practical situations another aspect needs to be studied in future work: at inference time weather comes from uncertain weather forecasts. Here, we first wanted to learn the impact of weather on vegetation and thus took the historical meteo data which has the least error. We expect the weather forecast uncertainty (represented by realizations / scenarios) to mostly propagate, but not present a covariate shift larger than the inter-annual variability, which our models are robust to (OOD-t evaluation).
对于这些实际应用场景，未来的工作中还需要研究另一个方面：在推理阶段，天气数据来源于不确定的天气预报。在此，我们首先希望了解天气对植被的影响，因此采用了误差最小的历史气象数据。我们预期天气预报的不确定性（以实现/情景表示）主要会传播，但不会呈现出比年际变异性更大的协变量偏移，我们的模型对此具有鲁棒性（OOD-t 评估）。

5 Conclusion 5 结论

We proposed Contextformer, a multi-modal transformer model designed for fine-resolution vegetation greenness forecasting. It leverages spatial context through a Pyramid Vision Transformer backbone while maintaining parameter efficiency. The temporal component is a transformer that independently models the dynamics of local context patches over time, incorporating meteorological data. We additionally introduce the novel GreenEarthNet dataset tailored for self-supervised vegetation forecasting and compare Contextformer against an extensive set of baselines.
我们提出了 Contextformer，这是一种专为高分辨率植被绿度预测设计的多模态 Transformer 模型。它通过金字塔视觉 Transformer 骨干网络利用空间上下文，同时保持参数效率。时间组件是一个 Transformer，独立地对局部上下文块随时间的动态进行建模，并结合气象数据。此外，我们还引入了专为自监督植被预测定制的新型 GreenEarthNet 数据集，并将 Contextformer 与一系列广泛的基线模型进行了比较。

Contextformer outperforms the previous state-of-the-art, especially on nash-sutcliffe efficiency and surpasses even strong freshly trained video prediction baselines. To our knowledge, we are the first to consider a climatology baseline and outperforming it with models. Given the pronounced seasonality of vegetation dynamics, this suggests real-world applicability for our models, particularly the Contextformer, in crucial scenarios like humanitarian anticipatory action or carbon monitoring.
Contextformer 在 nash-sutcliffe 效率等指标上显著超越了先前的最先进水平，并优于新近训练的视频预测基线。据我们所知，我们是首个考虑气候学基线并使用模型超越它的研究。鉴于植被动态的季节性显著，这表明我们的模型，尤其是 Contextformer，在人道主义预警行动或碳监测等关键场景中具有实际应用价值。

Code. 代码。

We provide code and pre-trained weights to reproduce our experimental results under https://github.com/vitusbenson/greenearthnet [11].
我们提供了代码和预训练权重，以在 https://github.com/vitusbenson/greenearthnet [11]下重现我们的实验结果。
Author contributions. VB experiments, figures, writing. CR experiments, writing. CRM supervision, figures, writing. ZG experiments. LA figures. NC, JC, NL, MW writing. MR funding, supervision, writing. All authors contributed to discussing the results.
作者贡献。VB 实验、图表、撰写。CR 实验、撰写。CRM 监督、图表、撰写。ZG 实验。LA 图表。NC、JC、NL、MW 撰写。MR 资金、监督、撰写。所有作者均参与了结果讨论。
Acknowledgments. We are thankful for invaluable help, comments and discussions to Reda ElGhawi, Christian Reimers, Annu Panwar and Xingjian Shi. MW thanks the European Space Agency for funding the project DeepExtremes (AI4Science ITT). CRM and LA are thankful to the European Union’s DeepCube Horizon 2020 (research and innovation programme grant agreement No 101004188). NL and JC acknowledge funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101003469.
致谢。我们衷心感谢 Reda ElGhawi、Christian Reimers、Annu Panwar 和 Xingjian Shi 提供的宝贵帮助、意见和讨论。MW 感谢欧洲航天局资助 DeepExtremes 项目（AI4Science ITT）。CRM 和 LA 感谢欧盟的 DeepCube Horizon 2020（研究与创新计划，资助协议编号 101004188）。NL 和 JC 感谢欧盟 Horizon 2020 研究与创新计划（资助协议编号 101003469）的资助。

References 参考文献

Ahmad et al. [2023] 艾哈迈德等人 [2023] Rehaan Ahmad, Brian Yang, Guillermo Ettlin, Andrés Berger, and Pablo Rodríguez-Bocca. A machine-learning based ConvLSTM architecture for NDVI forecasting. International Transactions in Operational Research, 30(4):2025–2048, 2023.
Rehaan Ahmad, Brian Yang, Guillermo Ettlin, Andrés Berger, 和 Pablo Rodríguez-Bocca. 基于机器学习的 ConvLSTM 架构用于 NDVI 预测。国际运筹学交易，30(4):2025–2048, 2023.
Amatulli et al. [2020] 阿马图利等人 [2020] Giuseppe Amatulli, Daniel McInerney, Tushar Sethi, Peter Strobl, and Sami Domisch. Geomorpho90m, empirical evaluation and accuracy assessment of global high-resolution geomorphometric layers. Scientific Data, 7(1):162, 2020.
朱塞佩·阿马图利，丹尼尔·麦金纳尼，图沙尔·塞蒂，彼得·斯特罗布尔，萨米·多米施。地貌 90 米，全球高分辨率地貌层级的实证评估与精度评定。科学数据，7(1):162, 2020 年。
Audebert et al. [2018] 奥德伯特等人 [2018] Nicolas Audebert, Bertrand Le Saux, and Sébastien Lefèvre. Beyond rgb: Very high resolution urban remote sensing with multimodal deep networks. ISPRS journal of photogrammetry and remote sensing, 140:20–32, 2018.
尼古拉斯·奥德伯特，伯特兰·勒·索克斯，塞巴斯蒂安·勒费弗尔。超越 RGB：多模态深度网络在高分辨率城市遥感中的应用。国际摄影测量与遥感杂志，140 卷：20-32 页，2018 年。
Aybar et al. [2022] 阿伊巴尔等人 [2022] Cesar Aybar, Luis Ysuhuaylas, Jhomira Loja, Karen Gonzales, Fernando Herrera, Lesly Bautista, Roy Yali, Angie Flores, Lissette Diaz, Nicole Cuenca, Wendy Espinoza, Fernando Prudencio, Valeria Llactayo, David Montero, Martin Sudmanns, Dirk Tiede, Gonzalo Mateo-García, and Luis Gómez-Chova. CloudSEN12, a global dataset for semantic understanding of cloud and cloud shadow in Sentinel-2. Scientific Data, 9(1):782, 2022.
Cesar Aybar, Luis Ysuhuaylas, Jhomira Loja, Karen Gonzales, Fernando Herrera, Lesly Bautista, Roy Yali, Angie Flores, Lissette Diaz, Nicole Cuenca, Wendy Espinoza, Fernando Prudencio, Valeria Llactayo, David Montero, Martin Sudmanns, Dirk Tiede, Gonzalo Mateo-García, 和 Luis Gómez-Chova. CloudSEN12，一个用于 Sentinel-2 云及云影语义理解的全球数据集。科学数据，9(1):782，2022 年。
Ba et al. [2016] 巴等 [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv, 1607.06450, 2016.
Jimmy Lei Ba, Jamie Ryan Kiros, 和 Geoffrey E. Hinton. 层归一化. arXiv, 1607.06450, 2016.
Babaeizadeh et al. [2021]
巴贝扎德等人 [2021] Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. FitVid: Overfitting in Pixel-Level Video Prediction. arxiv, 2106.13195, 2021.
Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, 和 Dumitru Erhan. FitVid: 像素级视频预测中的过拟合现象. arxiv, 2106.13195, 2021.
Bandhauer et al. [2022] 班德霍尔等 [2022] Moritz Bandhauer, Francesco Isotta, Mónika Lakatos, Cristian Lussana, Line Båserud, Beatrix Izsák, Olivér Szentes, Ole Einar Tveito, and Christoph Frei. Evaluation of daily precipitation analyses in e-obs (v19. 0e) and era5 by comparison to regional high-resolution datasets in european regions. International Journal of Climatology, 42(2):727–747, 2022.
Moritz Bandhauer, Francesco Isotta, Mónika Lakatos, Cristian Lussana, Line Båserud, Beatrix Izsák, Olivér Szentes, Ole Einar Tveito, 和 Christoph Frei. 通过与欧洲区域高分辨率数据集的比较，评估 e-obs（v19.0e）和 era5 中的日降水分析。国际气候学杂志，42(2):727–747，2022 年。
Bao et al. [2022] 包等人 [2022] Shanning Bao, Andreas Ibrom, Georg Wohlfahrt, Sujan Koirala, Mirco Migliavacca, Qian Zhang, and Nuno Carvalhais. Narrow but robust advantages in two-big-leaf light use efficiency models over big-leaf light use efficiency models at ecosystem level. Agricultural and Forest Meteorology, 326:109185, 2022.
Shanning Bao, Andreas Ibrom, Georg Wohlfahrt, Sujan Koirala, Mirco Migliavacca, Qian Zhang, 和 Nuno Carvalhais. 双大叶光能利用效率模型在生态系统层面相对于大叶光能利用效率模型具有狭窄但稳健的优势。农业与森林气象学, 326:109185, 2022.
Barrett et al. [2020] 巴雷特等人 [2020] Adam B. Barrett, Steven Duivenvoorden, Edward E. Salakpi, James M. Muthoka, John Mwangi, Seb Oliver, and Pedram Rowhani. Forecasting vegetation condition for drought early warning systems in pastoral communities in Kenya. Remote Sensing of Environment, 248:111886, 2020.
亚当·B·巴雷特、史蒂文·杜伊文沃登、爱德华·E·萨拉基、詹姆斯·M·穆索卡、约翰·姆旺吉、塞布·奥利弗和佩德拉姆·罗哈尼。肯尼亚牧民社区干旱早期预警系统的植被状况预测。环境遥感，248:111886，2020 年。
Battaglia et al. [2018] 巴塔利亚等人 [2018] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. arxiv, 1806.01261, 2018.
彼得·W·巴塔利亚, 杰西卡·B·哈姆里克, 维克多·巴普斯特, 阿尔瓦罗·桑切斯-冈萨雷斯, 维尼修斯·赞巴尔迪, 马特乌什·马林诺夫斯基, 安德烈亚·塔切蒂, 大卫·拉波索, 亚当·桑托罗, 瑞安·福克纳, 卡格拉尔·古尔切赫雷, 弗朗西斯·宋, 安德鲁·鲍尔, 贾斯汀·吉尔默, 乔治·达尔, 阿什什·瓦斯瓦尼, 凯尔西·艾伦, 查尔斯·纳什, 维多利亚·兰斯顿, 克里斯·戴尔, 尼古拉斯·希斯, 达恩·维尔斯特拉, 普什米特·科利, 马特·博特维尼克, 奥里奥尔·维尼亚尔斯, 于佳, 拉兹万·帕斯卡努. 关系归纳偏置、深度学习与图网络. arxiv, 1806.01261, 2018.
Benson [2024] 本森 [2024] Vitus Benson. Code and pre-trained model weights for benson et. al., CVPR (2024) - multi-modal learning for geospatial vegetation forecasting. Zenodo, 10.5281/zenodo.10793870, 2024.
维图斯·本森。本森等人，CVPR（2024）——用于地理空间植被预测的多模态学习的代码和预训练模型权重。Zenodo，10.5281/zenodo.10793870，2024 年。
Bi et al. [2023] Bi 等人 [2023] Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Accurate medium-range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, 2023.
开封璧, 谢凌曦, 张恒恒, 陈鑫, 顾晓涛, 田奇. 基于三维神经网络的精确中长期全球天气预报. 自然, 619(7970):533–538, 2023.
Cong et al. [2022] 丛等 [2022] Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems, 35:197–211, 2022.
丛业珍, Samar Khanna, 孟晨琳, Patrick Liu, Erik Rozi, 何宇彤, Marshall Burke, David Lobell, 和 Stefano Ermon. Satmae: 时空与多光谱卫星影像的预训练变换器. 神经信息处理系统进展, 35:197–211, 2022.
Cornes et al. [2018] 科恩斯等人 [2018] Richard C. Cornes, Gerard van der Schrier, Else J. M. van den Besselaar, and Philip D. Jones. An Ensemble Version of the E-OBS Temperature and Precipitation Data Sets. Journal of Geophysical Research: Atmospheres, 123(17):9391–9409, 2018.
理查德·C·科恩斯，杰拉德·范德施里尔，埃尔斯·J·M·范登贝塞拉尔，以及菲利普·D·琼斯。E-OBS 温度和降水数据集的集合版本。《地球物理研究杂志：大气》，123(17):9391–9409，2018 年。
Crippen et al. [2016] 克里彭等人 [2016] R. Crippen, S. Buckley, P. Agram, E. Belz, E. Gurrola, S. Hensley, M. Kobrick, M. Lavalle, J. Martin, M. Neumann, Q. Nguyen, P. Rosen, J. Shimada, M. Simard, and W. Tung. NASADEM global elevation model: Methods and progress. In The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, pages 125–128. Copernicus GmbH, 2016.
R. Crippen, S. Buckley, P. Agram, E. Belz, E. Gurrola, S. Hensley, M. Kobrick, M. Lavalle, J. Martin, M. Neumann, Q. Nguyen, P. Rosen, J. Shimada, M. Simard, 和 W. Tung. NASADEM 全球高程模型：方法与进展。发表于《国际摄影测量、遥感与空间信息科学档案》，第 125–128 页。Copernicus GmbH, 2016 年。
Cui et al. [2020] 崔等人 [2020] Changlu Cui, Wen Zhang, ZhiMing Hong, and LingKui Meng. Forecasting ndvi in multiple complex areas using neural network techniques combined feature engineering. International Journal of Digital Earth, 13(12):1733–1749, 2020.
崔昌禄, 张文, 洪志明, 孟令奎. 结合特征工程的神经网络技术在多复杂区域 NDVI 预测中的应用. 国际数字地球学报, 13(12):1733–1749, 2020.
Dalla Mura et al. [2015] 达拉·穆拉等人 [2015] Mauro Dalla Mura, Saurabh Prasad, Fabio Pacifici, Paulo Gamba, Jocelyn Chanussot, and Jón Atli Benediktsson. Challenges and opportunities of multimodality and data fusion in remote sensing. Proceedings of the IEEE, 103(9):1585–1601, 2015.
Mauro Dalla Mura, Saurabh Prasad, Fabio Pacifici, Paulo Gamba, Jocelyn Chanussot, 和 Jón Atli Benediktsson. 遥感中的多模态与数据融合：挑战与机遇. IEEE 会刊, 103(9):1585–1601, 2015.
Diaconu et al. [2022] Diaconu 等人 [2022] Codruț-Andrei Diaconu, Sudipan Saha, Stephan Günnemann, and Xiao Xiang Zhu. Understanding the Role of Weather Data for Earth Surface Forecasting Using a ConvLSTM-Based Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1362–1371, 2022.
Codruț-Andrei Diaconu, Sudipan Saha, Stephan Günnemann, 和 Xiao Xiang Zhu. 利用基于 ConvLSTM 的模型理解天气数据在地球表面预测中的作用。在 IEEE/CVF 计算机视觉与模式识别会议论文集，第 1362-1371 页，2022 年。
Domnich et al. [2021] 多姆尼奇等人 [2021] Marharyta Domnich, Indrek Sünter, Heido Trofimov, Olga Wold, Fariha Harun, Anton Kostiukhin, Mihkel Järveoja, Mihkel Veske, Tanel Tamm, Kaupo Voormansik, Aire Olesk, Valentina Boccia, Nicolas Longepe, and Enrico Giuseppe Cadau. KappaMask: AI-Based Cloudmask Processor for Sentinel-2. Remote Sensing, 13(20):4100, 2021.
Marharyta Domnich, Indrek Sünter, Heido Trofimov, Olga Wold, Fariha Harun, Anton Kostiukhin, Mihkel Järveoja, Mihkel Veske, Tanel Tamm, Kaupo Voormansik, Aire Olesk, Valentina Boccia, Nicolas Longepe, 和 Enrico Giuseppe Cadau。KappaMask：基于 AI 的 Sentinel-2 云掩膜处理器。遥感，13(20):4100，2021 年。
Dosovitskiy et al. [2020]
多索维茨基等人 [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2020.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, 和 Neil Houlsby. 一张图像值 16x16 个词：用于大规模图像识别的 Transformer。在国际学习表征会议上，2020 年。
ESA [2021] ESA. Copernicus DEM - Global and European Digital Elevation Model (COP-DEM). 2021.
欧空局. 哥白尼 DEM - 全球与欧洲数字高程模型（COP-DEM）. 2021 年.
Ferchichi et al. [2022] 费尔奇奇等人 [2022] Aya Ferchichi, Ali Ben Abbes, Vincent Barra, and Imed Riadh Farah. Forecasting vegetation indices from spatio-temporal remotely sensed data using deep learning-based approaches: A systematic literature review. Ecological Informatics, page 101552, 2022.
Aya Ferchichi, Ali Ben Abbes, Vincent Barra, 和 Imed Riadh Farah. 基于深度学习方法的时空遥感数据植被指数预测：系统性文献综述。生态信息学，第 101552 页，2022 年。
Gao et al. [2022a] 高等人 [2022a] Zhihan Gao, Xingjian Shi, Hao Wang, Yi Zhu, Bernie Wang, Mu Li, and Dit-Yan Yeung. Earthformer: Exploring Space-Time Transformers for Earth System Forecasting. In Advances in Neural Information Processing Systems, 2022a.
高志涵, 施兴建, 王浩, 朱毅, 王伯尼, 李沐, 叶迪-严. Earthformer: 探索地球系统预测的空间-时间变换器. 在神经信息处理系统进展中, 2022a.
Gao et al. [2022b] 高等人 [2022b] Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z. Li. SimVP: Simpler Yet Better Video Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3170–3180, 2022b.
张阳高, 谭成, 吴立荣, 李子敬. SimVP: 更简单却更优的视频预测. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 页码 3170–3180, 2022b.
Gupta et al. [2023] 古普塔等人 [2023] Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Li Fei-Fei. Maskvit: Masked visual pre-training for video prediction. In The Eleventh International Conference on Learning Representations, 2023.
Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, 和 Li Fei-Fei. Maskvit: 用于视频预测的掩码视觉预训练。在第十一届国际学习表征会议上，2023 年。
He et al. [2022] He 等人 [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
何恺明, 陈鑫磊, 谢思宁, 李彦豪, Piotr Dollár, 和 Ross Girshick. 掩码自编码器是可扩展的视觉学习器. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 第 16000–16009 页, 2022 年.
Hendrycks and Gimpel [2023]
亨德里克斯和金佩尔 [2023] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv, 1606.08415, 2023.
Dan Hendrycks 和 Kevin Gimpel. 高斯误差线性单元 (GELU). arXiv, 1606.08415, 2023.
Herzen et al. [2022] 赫尔岑等人 [2022] Julien Herzen, Francesco Lässig, Samuele Giuliano Piazzetta, Thomas Neuer, Léo Tafti, Guillaume Raille, Tomas Van Pottelbergh, Marek Pasieka, Andrzej Skrodzki, Nicolas Huguenin, Maxime Dumonal, Jan Kościsz, Dennis Bader, Frédérick Gusset, Mounir Benheddi, Camila Williamson, Michal Kosinski, Matej Petrik, and Gaël Grosch. Darts: User-Friendly Modern Machine Learning for Time Series. Journal of Machine Learning Research, 23(124):1–6, 2022.
Julien Herzen, Francesco Lässig, Samuele Giuliano Piazzetta, Thomas Neuer, Léo Tafti, Guillaume Raille, Tomas Van Pottelbergh, Marek Pasieka, Andrzej Skrodzki, Nicolas Huguenin, Maxime Dumonal, Jan Kościsz, Dennis Bader, Frédérick Gusset, Mounir Benheddi, Camila Williamson, Michal Kosinski, Matej Petrik, 和 Gaël Grosch. Darts：面向用户的现代机器学习时间序列工具。机器学习研究杂志，23(124):1-6, 2022.
Hochreiter and Schmidhuber [1997]
Hochreiter 和 Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
Sepp Hochreiter 和 Jürgen Schmidhuber. 长短期记忆. 神经计算, 9(8):1735–1780, 1997.
ICOS RI [2022] ICOS RI. Ecosystem final quality (l2) product in etc-archive format - release 2022-1. station de-gri. 2022.
ICOS RI. 生态系统最终质量（l2）产品，采用 etc-archive 格式 - 2022-1 版本。站点去极化。2022 年。
Jaegle et al. [2021] Jaegle 等人[2021] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
安德鲁·杰格勒, 费利克斯·吉梅诺, 安迪·布洛克, 奥里奥尔·维尼亚尔斯, 安德鲁·齐瑟曼, 和若昂·卡雷拉. 感知器: 通过迭代注意力实现通用感知. 在机器学习国际会议中, 第 4651–4664 页. PMLR, 2021 年.
Ji and Peters [2004] 季和彼得斯 [2004] Lei Ji and A.J. Peters. Forecasting vegetation greenness with satellite and climate data. IEEE Geoscience and Remote Sensing Letters, 1(1):3–6, 2004.
雷吉与 A.J. 彼得斯。利用卫星与气候数据预测植被绿度。IEEE 地球科学与遥感快报，1(1):3–6, 2004 年。
Kang et al. [2016] 康等 [2016] Lingjun Kang, Liping Di, Meixia Deng, Eugene Yu, and Yang Xu. Forecasting vegetation index based on vegetation-meteorological factor interactions with artificial neural network. In 2016 Fifth International Conference on Agro-Geoinformatics (Agro-Geoinformatics), pages 1–6. IEEE, 2016.
康凌俊, 邸立平, 邓美霞, Eugene Yu, 徐洋. 基于植被-气象因子交互作用的人工神经网络植被指数预测. 在 2016 年第五届国际农业地理信息学会议(Agro-Geoinformatics)上, 第 1-6 页. IEEE, 2016.
Ke et al. [2017] 柯等人 [2017] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
柯国霖, 孟奇, Thomas Finley, 王泰峰, 陈伟, 马维东, 叶启威, 刘铁岩. LightGBM: 一种高效的梯度提升决策树. 在神经信息处理系统进展中. Curran Associates, Inc., 2017.
Kladny et al. [2024] 克拉德尼等人 [2024] Klaus-Rudolf Kladny, Marco Milanta, Oto Mraz, Koen Hufkens, and Benjamin D Stocker. Enhanced prediction of vegetation responses to extreme drought using deep learning and earth observation data. Ecological Informatics, 80:102474, 2024.
克劳斯-鲁道夫·克拉德尼，马可·米兰特，奥托·姆拉兹，科恩·赫夫肯斯，本杰明·D·斯托克。利用深度学习和地球观测数据增强对极端干旱下植被响应的预测。生态信息学，80:102474，2024 年。
Kondylatos et al. [2022] Kondylatos 等人 [2022] Spyros Kondylatos, Ioannis Prapas, Michele Ronco, Ioannis Papoutsis, Gustau Camps-Valls, María Piles, Miguel-Ángel Fernández-Torres, and Nuno Carvalhais. Wildfire Danger Prediction and Understanding With Deep Learning. Geophysical Research Letters, 49(17):e2022GL099368, 2022.
Spyros Kondylatos, Ioannis Prapas, Michele Ronco, Ioannis Papoutsis, Gustau Camps-Valls, María Piles, Miguel-Ángel Fernández-Torres, 和 Nuno Carvalhais. 利用深度学习进行野火危险预测与理解. 地球物理研究通讯, 49(17):e2022GL099368, 2022.
Kraft et al. [2019] 克拉夫特等人 [2019] Basil Kraft, Martin Jung, Marco Körner, Christian Requena Mesa, José Cortés, and Markus Reichstein. Identifying Dynamic Memory Effects on Vegetation State Using Recurrent Neural Networks. Frontiers in Big Data, 2, 2019.
巴兹尔·克拉夫特, 马丁·荣格, 马可·科尔纳, 克里斯蒂安·雷奎纳·梅萨, 何塞·科尔特斯, 以及马库斯·赖希施泰因. 利用循环神经网络识别植被状态的动态记忆效应. 大数据前沿, 2, 2019.
Lam et al. [2023] Lam 等人 [2023] Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting. Science, page eadi2336, 2023.
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, 等. 学习熟练的中期全球天气预报. 科学, 页码 eadi2336, 2023.
Lee et al. [2018] 李等人 [2018] Alex X. Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic Adversarial Video Prediction. arxiv, 1804.01523, 2018.
李亚历克斯·X., 张理查德, 弗雷德里克·埃伯特, 皮耶特·阿贝勒, 切尔西·芬恩, 谢尔盖·莱文. 随机对抗视频预测. arxiv, 1804.01523, 2018.
Lees et al. [2022a] 李等人 [2022a] Thomas Lees, Gabriel Tseng, Clement Atzberger, Steven Reece, and Simon Dadson. Deep Learning for Vegetation Health Forecasting: A Case Study in Kenya. Remote Sensing, 14(3):698, 2022a.
托马斯·利斯、加布里埃尔·曾、克莱门特·阿茨贝格、史蒂文·里斯和西蒙·达德森。深度学习在植被健康预测中的应用：肯尼亚案例研究。遥感，14(3):698，2022a。
Lees et al. [2022b] 李斯等人 [2022b] Thomas Lees, Gabriel Tseng, Clement Atzberger, Steven Reece, and Simon Dadson. Deep learning for vegetation health forecasting: a case study in kenya. Remote Sensing, 14(3):698, 2022b.
托马斯·利斯、加布里埃尔·曾、克莱门特·阿茨贝格、史蒂文·里斯和西蒙·达德森。深度学习在植被健康预测中的应用：肯尼亚案例研究。遥感，14(3):698，2022b。
Li et al. [2022] 李等人 [2022] Jiaxin Li, Danfeng Hong, Lianru Gao, Jing Yao, Ke Zheng, Bing Zhang, and Jocelyn Chanussot. Deep learning in multimodal remote sensing data fusion: A comprehensive review. International Journal of Applied Earth Observation and Geoinformation, 112:102926, 2022.
李家欣, 洪丹枫, 高连如, 姚静, 郑珂, 张冰, 和 Jocelyn Chanussot. 多模态遥感数据融合中的深度学习：全面综述. 国际应用地球观测与地理信息学杂志, 112:102926, 2022.
Lin et al. [2023] 林等人 [2023] Fudong Lin, Summer Crawford, Kaleb Guillot, Yihe Zhang, Yan Chen, Xu Yuan, Li Chen, Shelby Williams, Robert Minvielle, Xiangming Xiao, Drew Gholson, Nicolas Ashwell, Tri Setiyono, Brenda Tubana, Lu Peng, Magdy Bayoumi, and Nian-Feng Tzeng. MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5774–5784, 2023.
林福东, Summer Crawford, Kaleb Guillot, 张一鹤, 陈燕, 袁旭, 陈力, Shelby Williams, Robert Minvielle, 肖向明, Drew Gholson, Nicolas Ashwell, Tri Setiyono, Brenda Tubana, 彭璐, Magdy Bayoumi, 曾念峰. MMST-ViT: 基于多模态时空视觉 Transformer 的气候变化感知作物产量预测. 在 IEEE/CVF 国际计算机视觉会议论文集, 页码 5774–5784, 2023 年.
Loaiza et al. [2023] 洛伊萨等人 [2023] David Montero Loaiza, Guido Kraemer, Anca Anghelea, Cesar Luis Aybar Camacho, Gunnar Brandt, Gustau Camps-Valls, Felix Cremer, Ida Flik, Fabian Gans, Sarah Habershon, Chaonan Ji, Teja Kattenborn, Laura Martínez-Ferrer, Francesco Martinuzzi, Martin Reinhardt, Maximilian Söchting, Khalil Teber, and Miguel Mahecha. Data Cubes for Earth System Research: Challenges Ahead. EarthArXiv, 5649, 2023.
大卫·蒙特罗·洛艾萨, 圭多·克雷默, 安卡·安格莱亚, 塞萨尔·路易斯·阿亚巴·卡马乔, 冈纳·布兰特, 古斯塔乌·坎普斯-瓦尔斯, 费利克斯·克雷默, 伊达·弗利克, 法比安·甘斯, 莎拉·哈伯肖恩, 季超南, 泰贾·卡滕伯恩, 劳拉·马丁内斯-费雷尔, 弗朗切斯科·马蒂努齐, 马丁·莱茵哈特, 马克西米利安·索赫廷, 哈利勒·特贝, 米格尔·马赫查. 地球系统研究的数据立方体：前方的挑战. EarthArXiv, 5649, 2023.
Loshchilov and Hutter [2022]
洛什奇洛夫和胡特尔 [2022] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2022.
伊利亚·洛希洛夫和弗兰克·赫特。解耦权重衰减正则化。在国际学习表征会议上，2022 年。
Louis et al. [2016] 路易斯等人 [2016] Jérôme Louis, Vincent Debaecker, Bringfried Pflug, Magdalena Main-Knorn, Jakub Bieniarz, Uwe Mueller-Wilm, Enrico Cadau, and Ferran Gascon. SENTINEL-2 SEN2COR: L2A Processor for Users. In Proceedings Living Planet Symposium 2016, pages 1–8, Prague, Czech Republic, 2016. Spacebooks Online.
Jérôme Louis, Vincent Debaecker, Bringfried Pflug, Magdalena Main-Knorn, Jakub Bieniarz, Uwe Mueller-Wilm, Enrico Cadau, 和 Ferran Gascon. SENTINEL-2 SEN2COR：面向用户的 L2A 处理器。在 2016 年 Living Planet Symposium 会议论文集，第 1-8 页，捷克共和国布拉格，2016 年。Spacebooks Online。
Ma et al. [2022a] 马等人 [2022a] Xianping Ma, Xiaokang Zhang, and Man-On Pun. A crossmodal multiscale fusion network for semantic segmentation of remote sensing data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:3463–3474, 2022a.
马先平，张小康，潘文安。用于遥感数据语义分割的跨模态多尺度融合网络。IEEE 应用地球观测与遥感选题期刊，15 卷：3463-3474 页，2022a。
Ma et al. [2022b] 马等人 [2022b] Yue Ma, Yingjie Hu, Glenn R. Moncrieff, Jasper A. Slingsby, Adam M. Wilson, Brian Maitner, and Ryan Zhenqi Zhou. Forecasting vegetation dynamics in an open ecosystem by integrating deep learning and environmental variables. International Journal of Applied Earth Observation and Geoinformation, 114:103060, 2022b.
岳马, 胡英杰, Glenn R. Moncrieff, Jasper A. Slingsby, Adam M. Wilson, Brian Maitner, 周振奇. 通过整合深度学习与环境变量预测开放生态系统中的植被动态. 国际应用地球观测与地理信息学杂志, 114:103060, 2022b.
Martinuzzi et al. [2023] 马蒂努齐等人 [2023] Francesco Martinuzzi, Miguel D Mahecha, Gustau Camps-Valls, David Montero, Tristan Williams, and Karin Mora. Learning extreme vegetation response to climate forcing: A comparison of recurrent neural network architectures. EGUsphere, 2023:1–32, 2023.
弗朗切斯科·马蒂努齐，米格尔·D·马赫查，古斯塔夫·坎普斯-瓦尔斯，大卫·蒙特罗，特里斯坦·威廉姆斯，以及卡琳·莫拉。学习极端植被对气候强迫的响应：循环神经网络架构的比较。EGUsphere, 2023:1–32, 2023.
Mendieta et al. [2023] 门迪塔等人 [2023] Matías Mendieta, Boran Han, Xingjian Shi, Yi Zhu, and Chen Chen. Towards Geospatial Foundation Models via Continual Pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16806–16816, 2023.
马蒂亚斯·门迪塔，博兰·韩，石兴建，朱毅，陈晨。通过持续预训练迈向地理空间基础模型。在 IEEE/CVF 国际计算机视觉会议论文集，第 16806-16816 页，2023 年。
Meraner et al. [2020] 梅拉纳等人 [2020] Andrea Meraner, Patrick Ebel, Xiao Xiang Zhu, and Michael Schmitt. Cloud removal in Sentinel-2 imagery using a deep residual neural network and SAR-optical data fusion. ISPRS Journal of Photogrammetry and Remote Sensing, 166:333–346, 2020.
Andrea Meraner, Patrick Ebel, Xiao Xiang Zhu, 和 Michael Schmitt. 利用深度残差神经网络与 SAR-光学数据融合去除 Sentinel-2 影像中的云层。ISPRS 摄影测量与遥感杂志, 166:333–346, 2020.
Meshesha et al. [2020] Meshesha 等人[2020] Derege Tsegaye Meshesha, Muhyadin Mohammed Ahmed, Dahir Yosuf Abdi, and Nigussie Haregeweyn. Prediction of grass biomass from satellite imagery in Somali regional state, eastern Ethiopia. Heliyon, 6(10), 2020.
Derege Tsegaye Meshesha, Muhyadin Mohammed Ahmed, Dahir Yosuf Abdi, 和 Nigussie Haregeweyn. 利用卫星影像预测埃塞俄比亚东部索马里州草本生物量。Heliyon, 6(10), 2020.
Nash et al. [2022] 纳什等人 [2022] Charlie Nash, João Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, and Peter W. Battaglia. Transframer: Arbitrary frame prediction with generative models. Trans. Mach. Learn. Res., 2023, 2022.
Charlie Nash, João Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, 和 Peter W. Battaglia. Transframer: 使用生成模型进行任意帧预测。机器学习研究，2023，2022。
Nash and Sutcliffe [1970]
纳什与萨克利夫 [1970] J. E. Nash and J. V. Sutcliffe. River flow forecasting through conceptual models part I — A discussion of principles. Journal of Hydrology, 10(3):282–290, 1970.
J. E. 纳什和 J. V. 萨奇克利夫。通过概念模型进行河流流量预报第一部分——原理探讨。《水文学杂志》，10(3):282–290，1970 年。
Nguyen et al. [2023] 阮氏等人 [2023] Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, and Aditya Grover. Climax: A foundation model for weather and climate. In 1st Workshop on the Synergy of Scientific and Machine Learning Modeling @ ICML2023, 2023.
Tung Nguyen, Johannes Brandstetter, Ashish Kapoor, Jayesh K Gupta, 和 Aditya Grover. Climax: 天气与气候的基础模型。在 2023 年 ICML2023 上首届科学计算与机器学习建模协同研讨会，2023 年。
Pabon-Moreno et al. [2022]
帕邦-莫雷诺等人 [2022] Daniel E. Pabon-Moreno, Mirco Migliavacca, Markus Reichstein, and Miguel D. Mahecha. On the potential of Sentinel-2 for estimating Gross Primary Production. IEEE Transactions on Geoscience and Remote Sensing, pages 1–1, 2022.
丹尼尔·E·帕邦-莫雷诺, 米尔科·米利亚瓦卡, 马库斯·赖希斯坦, 米格尔·D·马埃查. 关于哨兵-2 在估算总初级生产力方面的潜力. IEEE 地球科学与遥感汇刊, 页码 1–1, 2022 年.
Pathak et al. [2022] 帕塔克等人 [2022] Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, and Animashree Anandkumar. FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators. AI for Earth and Space Science, workshop at ICLR 2022, 2202.11214, 2022.
Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, 和 Animashree Anandkumar. FourCastNet: 基于全球数据驱动的高分辨率天气模型，采用自适应傅里叶神经算子。AI for Earth and Space Science, ICLR 2022 研讨会, 2202.11214, 2022.
Perez et al. [2018] 佩雷斯等人 [2018] Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018.
伊桑·佩雷斯, 弗洛里安·斯特鲁布, 哈姆·德弗里斯, 文森特·杜穆兰, 和阿伦·库维尔. FiLM: 使用通用条件层进行视觉推理. 人工智能 AAAI 会议论文集, 32(1), 2018.
Qiu et al. [2019] 邱等人 [2019] Shi Qiu, Zhe Zhu, and Binbin He. Fmask 4.0: Improved cloud and cloud shadow detection in Landsats 4–8 and Sentinel-2 imagery. Remote Sensing of Environment, 231:111205, 2019.
施秋, 朱哲, 何彬彬. Fmask 4.0: 改进的 Landsat 4-8 和 Sentinel-2 影像云及云影检测. 环境遥感, 231:111205, 2019.
Rasp et al. [2020] 拉斯普等人 [2020] Stephan Rasp, Peter D. Dueben, Sebastian Scher, Jonathan A. Weyn, Soukayna Mouatadid, and Nils Thuerey. WeatherBench: A Benchmark Data Set for Data-Driven Weather Forecasting. Journal of Advances in Modeling Earth Systems, 12(11):e2020MS002203, 2020.
Stephan Rasp, Peter D. Dueben, Sebastian Scher, Jonathan A. Weyn, Soukayna Mouatadid, 和 Nils Thuerey. WeatherBench: 用于数据驱动天气预报的基准数据集. 地球系统建模进展杂志, 12(11):e2020MS002203, 2020.
Ravuri et al. [2021] 拉武里等人 [2021] Suman Ravuri, Karel Lenc, Matthew Willson, Dmitry Kangin, Remi Lam, Piotr Mirowski, Megan Fitzsimons, Maria Athanassiadou, Sheleem Kashem, Sam Madge, Rachel Prudden, Amol Mandhane, Aidan Clark, Andrew Brock, Karen Simonyan, Raia Hadsell, Niall Robinson, Ellen Clancy, Alberto Arribas, and Shakir Mohamed. Skilful precipitation nowcasting using deep generative models of radar. Nature, 597(7878):672–677, 2021.
苏曼·拉武里, 卡雷尔·伦茨, 马修·威尔逊, 德米特里·坎金, 雷米·拉姆, 皮奥特·米罗夫斯基, 梅根·菲茨西蒙斯, 玛丽亚·阿萨纳西阿杜, 谢利姆·卡希姆, 萨姆·马杰, 雷切尔·普鲁登, 阿莫尔·曼达内, 艾丹·克拉克, 安德鲁·布鲁克, 卡伦·西蒙扬, 拉亚·哈德塞尔, 尼尔·罗宾逊, 埃伦·克兰西, 阿尔贝托·阿里巴斯, 沙基尔·穆罕默德. 利用深度生成模型进行雷达降水临近预报. 自然, 597(7878):672–677, 2021.
Reed et al. [2023] 里德等人 [2023] Colorado J. Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088–4099, 2023.
科罗拉多·J·里德, Ritwik 古普塔, 李舒凡, 莎拉·布罗克曼, 克里斯托弗·芬克, 布莱恩·克利普, 库尔特·克伊特泽, 萨尔瓦托雷·坎迪多, 马特·乌伊滕达尔, 特雷弗·达雷尔. 尺度-MAE: 一种用于多尺度地理空间表示学习的尺度感知掩码自编码器. 在 IEEE/CVF 国际计算机视觉会议论文集, 第 4088-4099 页, 2023 年.
Requena-Mesa et al. [2019]
雷克纳-梅萨等人 [2019] Christian Requena-Mesa, Markus Reichstein, Miguel Mahecha, Basil Kraft, and Joachim Denzler. Predicting Landscapes from Environmental Conditions Using Generative Networks. In Pattern Recognition, pages 203–217, Cham, 2019. Springer International Publishing.
克里斯蒂安·雷奎纳-梅萨, 马库斯·赖希施泰因, 米格尔·马赫查, 巴兹尔·克拉夫特, 约阿希姆·登茨勒. 利用生成网络从环境条件预测景观. 模式识别, 第 203-217 页, 查姆, 2019 年. 斯普林格国际出版.
Requena-Mesa et al. [2021]
雷克纳-梅萨等人 [2021] Christian Requena-Mesa, Vitus Benson, Markus Reichstein, Jakob Runge, and Joachim Denzler. EarthNet2021: A large-scale dataset and challenge for Earth surface forecasting as a guided video prediction task. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1132–1142, 2021.
克里斯蒂安·雷奎纳-梅萨, 维图斯·本森, 马库斯·赖希施泰因, 雅各布·伦格, 约阿希姆·登茨勒. EarthNet2021: 一个大规模数据集及挑战, 面向地球表面预测的引导视频预测任务. 在 2021 年 IEEE/CVF 计算机视觉与模式识别会议研讨会(CVPRW)上, 第 1132-1142 页, 2021 年.
Robin et al. [2022] 罗宾等人 [2022] Claire Robin, Christian Requena-Mesa, Vitus Benson, Lazaro Alonso, Jeran Poehls, Nuno Carvalhais, and Markus Reichstein. Learning to forecast vegetation greenness at fine resolution over Africa with ConvLSTMs. Climate Change AI workshop at NeurIPS 2022, 2022.
克莱尔·罗宾，克里斯蒂安·雷奎纳-梅萨，维图斯·本森，拉萨罗·阿尔瓦雷斯，杰兰·波尔斯，努诺·卡瓦利亚斯，以及马库斯·赖希施泰因。利用 ConvLSTMs 在非洲精细尺度上学习预测植被绿度。2022 年 NeurIPS 气候变化 AI 研讨会，2022 年。
Rombach et al. [2022] 隆巴赫等人 [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, 和 Björn Ommer. 基于潜在扩散模型的高分辨率图像合成. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 页码 10684–10695, 2022 年.
Rötzer and Chmielewski [2001]
罗策尔与希姆莱夫斯基 [2001] Thomas Rötzer and Frank-M. Chmielewski. Phenological maps of Europe. Climate Research, 18(3):249–257, 2001.
托马斯·罗策尔和弗兰克-M·施米莱夫斯基。欧洲物候图。气候研究，18(3):249–257，2001 年。
Sandler et al. [2018] 桑德勒等人 [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
马克·桑德勒, 安德鲁·霍华德, 孟龙·朱, 安德烈·日莫金诺夫, 梁杰·陈. MobileNetV2: 倒残差与线性瓶颈. 在 IEEE 计算机视觉与模式识别会议论文集, 第 4510–4520 页, 2018 年.
Shams Eddin and Gall [2023]
沙姆斯·丁和加尔 [2023] Mohamad Hakam Shams Eddin and Juergen Gall. Focal-TSMP: Deep learning for vegetation health prediction and agricultural drought assessment from a regional climate simulation. EGUsphere, pages 1–50, 2023.
Mohamad Hakam Shams Eddin 和 Juergen Gall。Focal-TSMP：基于深度学习的植被健康预测与农业干旱区域气候模拟评估。EGUsphere，第 1-50 页，2023 年。
Shi et al. [2015] 石等人 [2015] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun WOO. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2015.
施星建，陈周荣，王浩，Yeung Dit-Yan，黄伟健，WOO 王春. 卷积 LSTM 网络：一种用于降水临近预报的机器学习方法. 在神经信息处理系统进展中. Curran Associates, Inc., 2015.
Shi et al. [2017] 石等人 [2017] Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun WOO. Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
施兴建，高志涵，Leonard Lausen，王浩，Yeung Dit-Yan，黄伟健，WOO 王春. 深度学习用于降水临近预报：基准与新模型。发表于《神经信息处理系统进展》。Curran Associates, Inc., 2017 年。
Smith et al. [2023] 史密斯等人 [2023] Michael J. Smith, Luke Fleming, and James E. Geach. EarthPT: A foundation model for Earth Observation. arxiv, 2309.07207, 2023.
迈克尔·J·史密斯，卢克·弗莱明，詹姆斯·E·吉奇。EarthPT：地球观测的基础模型。arxiv，2309.07207，2023 年。
Steinhausen et al. [2018]
斯坦豪森等人 [2018] Max J Steinhausen, Paul D Wagner, Balaji Narasimhan, and Björn Waske. Combining sentinel-1 and sentinel-2 data for improved land use and land cover mapping of monsoon regions. International journal of applied earth observation and geoinformation, 73:595–604, 2018.
马克斯·J·斯坦豪森、保罗·D·瓦格纳、巴拉吉·纳拉西曼和比约恩·瓦斯克。结合哨兵 1 号和哨兵 2 号数据，提升季风区土地利用与土地覆盖制图精度。国际应用地球观测与地理信息学杂志，73 期：595–604 页，2018 年。
[74] Corinne Stucker, Vivien Sainte Fare Garnot, and Konrad Schindler. U-TILISE: A Sequence-to-Sequence Model for Cloud Removal in Optical Satellite Time Series. 61:1–16.
Corinne Stucker, Vivien Sainte Fare Garnot, 和 Konrad Schindler. U-TILISE: 一种用于光学卫星时间序列中云去除的序列到序列模型。61:1–16.
Sturm et al. [2022] 施图姆等人 [2022] Joan Sturm, Maria J. Santos, Bernhard Schmid, and Alexander Damm. Satellite data reveal differential responses of Swiss forests to unprecedented 2018 drought. Global Change Biology, 28(9):2956–2978, 2022.
琼·斯特姆, 玛丽亚·J·桑托斯, 伯恩哈德·施密德, 亚历山大·达姆. 卫星数据揭示瑞士森林对 2018 年空前干旱的差异性响应. 全球变化生物学, 28(9):2956–2978, 2022.
Tadono et al. [2016] 田野等人 [2016] T. Tadono, H. Nagai, H. Ishida, F. Oda, S. Naito, K. Minakawa, and H. Iwamoto. Generation of the 30 m mesh global digital surface model by ALOS Prism. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLI-B4:157–162, 2016.
T. Tadono, H. Nagai, H. Ishida, F. Oda, S. Naito, K. Minakawa, 和 H. Iwamoto. 利用 ALOS Prism 生成 30 米网格全球数字地表模型。ISPRS - 国际摄影测量、遥感与空间信息科学档案, XLI-B4:157–162, 2016.
Tan et al. [2023] 谭等人 [2023] Cheng Tan, Zhangyang Gao, and Stan Z. Li. SimVP: Towards Simple yet Powerful Spatiotemporal Predictive Learning. arxiv, 2211.12509, 2023.
程坦, 高张阳, 李子敬. SimVP: 迈向简单而强大的时空预测学习. arxiv, 2211.12509, 2023.
Taylor and Letham [2018] 泰勒和莱瑟姆 [2018] Sean J. Taylor and Benjamin Letham. Forecasting at Scale. The American Statistician, 72(1):37–45, 2018.
肖恩·J·泰勒和本杰明·莱瑟姆。规模化预测。《美国统计学家》，72(1):37–45, 2018 年。
Tseng et al. [2023] 曾等人 [2023] Gabriel Tseng, Ruben Cartuyvels, Ivan Zvonkov, Mirali Purohit, David Rolnick, and Hannah Kerner. Lightweight, Pre-trained Transformers for Remote Sensing Timeseries. arxiv, 2304.14065, 2023.
Gabriel Tseng, Ruben Cartuyvels, Ivan Zvonkov, Mirali Purohit, David Rolnick, 和 Hannah Kerner. 轻量级、预训练的遥感时间序列 Transformer 模型。arxiv, 2304.14065, 2023.
Vaswani et al. [2017] Vaswani 等人[2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, 和 Illia Polosukhin. 注意力就是你所需要的。在神经信息处理系统进展中。Curran Associates, Inc., 2017.
Wang et al. [2019] 王等人 [2019] Lei Wang, Xin Xu, Yue Yu, Rui Yang, Rong Gui, Zhaozhuo Xu, and Fangling Pu. SAR-to-Optical Image Translation Using Supervised Cycle-Consistent Adversarial Networks. IEEE Access, 7:129136–129149, 2019.
王磊, 徐鑫, 余越, 杨锐, 桂荣, 徐兆卓, 蒲方玲. 基于监督循环一致对抗网络的 SAR 到光学图像转换. IEEE Access, 7:129136–129149, 2019.
Wang et al. [2021] 王等人 [2021] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021.
王文海、谢恩泽、李翔、范登平、宋凯涛、梁丁、卢彤、罗平、邵凌。金字塔视觉 Transformer：无卷积密集预测的多功能骨干网络。在 IEEE/CVF 国际计算机视觉会议论文集，第 568-578 页，2021 年。
Wang et al. [2022] 王等人 [2022] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. PVT v2: Improved baselines with Pyramid Vision Transformer. Computational Visual Media, 8(3):415–424, 2022.
王文海、谢恩泽、李翔、范登平、宋凯涛、梁丁、卢彤、罗平、邵凌。PVT v2：基于金字塔视觉变换器的改进基线。计算视觉媒体，8(3):415–424，2022 年。
Wang et al. [2017] 王等人 [2017] Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
王运波, 龙明盛, 王建民, 高志峰, 以及 Philip S. Yu. PredRNN: 基于时空 LSTM 的预测学习递归神经网络. 发表于神经信息处理系统进展. Curran Associates, Inc., 2017.
Wang et al. [2023] 王等人 [2023] Yunbo Wang, Haixu Wu, Jianjin Zhang, Zhifeng Gao, Jianmin Wang, Philip S. Yu, and Mingsheng Long. PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2208–2225, 2023.
王云波, 吴海旭, 张建金, 高志峰, 王建民, Philip S. 余, 龙明盛. PredRNN: 一种用于时空预测学习的循环神经网络. IEEE 模式分析与机器智能汇刊, 45(2):2208–2225, 2023.
Whyte et al. [2018] 怀特等人 [2018] Andrew Whyte, Konstantinos P Ferentinos, and George P Petropoulos. A new synergistic approach for monitoring wetlands using sentinels-1 and 2 data with object-based machine learning algorithms. Environmental Modelling & Software, 104:40–54, 2018.
安德鲁·怀特，康斯坦丁诺斯·P·费伦蒂诺斯，乔治·P·佩特罗普洛斯。一种利用哨兵-1 和 2 数据结合基于对象的机器学习算法监测湿地的新协同方法。环境建模与软件，104 卷：40-54 页，2018 年。
Wolanin et al. [2019] 沃兰宁等人 [2019] Aleksandra Wolanin, Gustau Camps-Valls, Luis Gómez-Chova, Gonzalo Mateo-García, Christiaan van der Tol, Yongguang Zhang, and Luis Guanter. Estimating crop primary productivity with Sentinel-2 and Landsat 8 using machine learning methods trained with radiative transfer simulations. Remote Sensing of Environment, 225:441–457, 2019.
亚历山德拉·沃兰宁，古斯塔乌·坎普斯-瓦尔斯，路易斯·戈麦斯-乔瓦，贡萨洛·马特奥-加西亚，克里斯蒂安·范德托尔，张永光，路易斯·瓜南特。利用辐射传输模拟训练的机器学习方法，通过哨兵 2 号和 Landsat 8 估算作物初级生产力。环境遥感，225:441–457，2019 年。
Wu et al. [2021a] 吴等人 [2021a] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Advances in Neural Information Processing Systems, pages 22419–22430. Curran Associates, Inc., 2021a.
吴海旭, 徐杰辉, 王建民, 龙明盛. Autoformer: 基于自相关分解的长序列预测 Transformer. 在神经信息处理系统进展中, 第 22419–22430 页. Curran Associates, Inc., 2021a.
Wu et al. [2021b] 吴等人 [2021b] Haixu Wu, Zhiyu Yao, Jianmin Wang, and Mingsheng Long. MotionRNN: A Flexible Model for Video Prediction With Spacetime-Varying Motions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15435–15444, 2021b.
吴海旭, 姚志宇, 王建民, 龙明盛. MotionRNN: 时空变动的视频预测灵活模型. 在 IEEE/CVF 计算机视觉与模式识别会议论文集, 页码 15435–15444, 2021b.
Wu and He [2018] 吴和何 [2018] Yuxin Wu and Kaiming He. Group Normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
吴育昕和何恺明. 组归一化. 在欧洲计算机视觉会议(ECCV)论文集, 第 3-19 页, 2018 年.
Xiong et al. [2022] 熊等人 [2022] Zhitong Xiong, Fahong Zhang, Yi Wang, Yilei Shi, and Xiao Xiang Zhu. EarthNets: Empowering AI in Earth Observation. arxiv, 2210.04936, 2022.
智通熊, 张发红, 王毅, 石一磊, 朱晓祥. EarthNets: 赋能地球观测中的 AI. arxiv, 2210.04936, 2022.
Xu et al. [2015] 徐等人 [2015] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical Evaluation of Rectified Activations in Convolutional Network. arxiv, 1505.00853, 2015.
徐冰, 王乃岩, 陈天奇, 李沐. 卷积网络中整流激活函数的实证评估. arxiv, 1505.00853, 2015.
Yang et al. [2022] 杨等人 [2022] Xian Yang, Yifan Zhao, and Ranga Raju Vatsavai. Deep Residual Network with Multi-Image Attention for Imputing Under Clouds in Satellite Imagery. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 643–649, 2022.
咸阳、赵一凡和 Ranga Raju Vatsavai。基于多图像注意力的深度残差网络用于填补卫星图像中的云层空缺。在 2022 年第 26 届国际模式识别会议（ICPR）上，第 643-649 页，2022 年。
Yao et al. [2023] 姚等人 [2023] Fanglong Yao, Wanxuan Lu, Heming Yang, Liangyu Xu, Chenglong Liu, Leiyi Hu, Hongfeng Yu, Nayu Liu, Chubo Deng, Deke Tang, Changshuo Chen, Jiaqi Yu, Xian Sun, and Kun Fu. RingMo-Sense: Remote Sensing Foundation Model for Spatiotemporal Prediction via Spatiotemporal Evolution Disentangling. IEEE Transactions on Geoscience and Remote Sensing, pages 1–1, 2023.
姚方龙, 陆万轩, 杨鹤鸣, 徐良宇, 刘成龙, 胡磊毅, 俞洪峰, 刘纳宇, 邓楚波, 唐德科, 陈昌硕, 俞嘉琪, 孙贤, 傅昆. RingMo-Sense: 基于时空演化解耦的时空预测遥感基础模型. IEEE 地球科学与遥感汇刊, 页码 1–1, 2023.
Zanaga et al. [2021] 扎纳加等人 [2021] Daniele Zanaga, Ruben Van De Kerchove, Wanda De Keersmaecker, Niels Souverijns, Carsten Brockmann, Ralf Quast, Jan Wevers, Alex Grosu, Audrey Paccini, Sylvain Vergnaud, Oliver Cartus, Maurizio Santoro, Steffen Fritz, Ivelina Georgieva, Myroslava Lesiv, Sarah Carter, Martin Herold, Linlin Li, Tsendbazar, Nandin-Erdene, Fabrizio Ramoino, and Olivier Arino. ESA WorldCover 10 m 2020 v100. 2021.
丹尼尔·扎纳加, 鲁本·范德克霍夫, 万达·德·凯尔施马克, 尼尔斯·苏韦尔扬斯, 卡斯滕·布罗克曼, 拉尔夫·夸斯特, 扬·韦弗斯, 亚历克斯·格罗苏, 奥黛丽·帕西尼, 西尔万·韦尔尼亚杜, 奥利弗·卡图斯, 毛里齐奥·桑托罗, 斯特凡·弗里茨, 伊韦利娜·格奥尔基耶娃, 米罗斯拉瓦·莱西夫, 莎拉·卡特, 马丁·赫罗尔德, 李琳琳, 策登巴尔, 南迪恩-额尔德尼, 法布里齐奥·拉莫伊诺, 奥利维尔·阿里诺. ESA 世界覆盖 10 米 2020 版 v100. 2021 年.
Zeng et al. [2023] 曾等人 [2023] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are Transformers Effective for Time Series Forecasting? Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):11121–11128, 2023.
曾爱玲, 陈慕曦, 张磊, 徐强. 变压器对时间序列预测有效吗? 人工智能 AAAI 会议论文集, 37(9):11121–11128, 2023.
Zeng et al. [2022] 曾等人 [2022] Yelu Zeng, Dalei Hao, Alfredo Huete, Benjamin Dechant, Joe Berry, Jing M. Chen, Joanna Joiner, Christian Frankenberg, Ben Bond-Lamberty, Youngryel Ryu, Jingfeng Xiao, Ghassem R. Asrar, and Min Chen. Optical vegetation indices for monitoring terrestrial ecosystems globally. Nature Reviews Earth & Environment, pages 1–17, 2022.
耶鲁·曾，郝大力，阿尔弗雷多·胡埃特，本杰明·迪安特，乔·贝瑞，陈静，乔安娜·乔伊纳，克里斯蒂安·弗兰肯伯格，本·邦德-兰伯蒂，柳永烈，肖金峰，加塞姆·R·阿斯拉尔，陈敏。光学植被指数在全球范围内监测陆地生态系统。《自然评论：地球与环境》，第 1-17 页，2022 年。

Appendix A Model details 附录 A 模型详情

A.1 Cloud masking A.1 云掩膜

Baselines (Table 1) 基线（表 1）

The baselines reported in table 1 are taken from CloudSEN12 [4]. Sen2Cor [46] is the processing software from ESA used to produce the Scene Classification Layer (SCL) mask, which was also introduced in EarthNet2021 [64]. FMask [59] is a processing software originally designed for NASA Landsat imagery, but now repurposed to also work with Sentinel 2 imagery. It requires L1C top-of-atmosphere reflectance from all bands to be produced (EarthNet2021 only containes L2A bottom-of-atmosphere reflectance from four bands). KappaMask [19] is a cloud mask based on deep learning, in table 1 we reported scores from the L2A version, which uses all 13 L2A bands as input.
表 1 中报告的基线数据来自 CloudSEN12 [4]。Sen2Cor [46] 是 ESA 用于生成场景分类层（SCL）掩码的处理软件，该掩码也在 EarthNet2021 [64]中引入。FMask [59] 最初是为 NASA Landsat 影像设计的处理软件，现已被重新用于处理 Sentinel 2 影像。它需要所有波段的 L1C 大气层顶反射率数据（EarthNet2021 仅包含四个波段的 L2A 大气层底反射率数据）。KappaMask [19] 是一种基于深度学习的云掩码，在表 1 中我们报告了使用所有 13 个 L2A 波段作为输入的 L2A 版本的评分。

UNet Mobilenetv2 (Table 1)
UNet Mobilenetv2（表 1）

Our UNet with Mobilenetv2 encoder [68] was trained in two variants, one with RGB and near-infrared bands of L2A imagery (i.e. works with EarthNet2021) and one with all 13 bands of L2A imagery. We adopted the exact same implementation that was benchmarked in the CloudSEN12 paper [4], with the only difference being that in the paper, L1C imagery was used (which is often not useful in practical use-cases). In detail, this means we trained the UNet with Mobilenetv2 encoder using the Segmentation Models PyTorch Python library³³3https://segmentation-models-pytorch.readthedocs.io/en/latest/. We used a batch size of 32, random horizontal and vertical flipping, random 90 degree rotations, random mirroring, unweighted cross entropy loss, early stopping with a patience of 10 epochs, AdamW optimizer, learning rate of $1e^{-3}$ , and a learning rate schedule reducing the learning rate by a factor of 10 if validation loss did not decrease for 4 epochs.
我们采用 Mobilenetv2 编码器的 UNet 进行了两种变体的训练，一种使用 L2A 影像的 RGB 和近红外波段（即适用于 EarthNet2021），另一种则使用 L2A 影像的所有 13 个波段。我们采用了与 CloudSEN12 论文[4]中基准测试完全相同的实现，唯一的区别在于论文中使用了 L1C 影像（这在实际应用场景中往往不实用）。具体而言，这意味着我们使用 Segmentation Models PyTorch Python 库 ³ 训练了带有 Mobilenetv2 编码器的 UNet。我们采用了批量大小为 32，随机水平和垂直翻转，随机 90 度旋转，随机镜像，未加权的交叉熵损失，早停机制（耐心期为 10 个 epoch），AdamW 优化器，学习率为 $1e^{-3}1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT$ ，以及一个学习率调度策略，若验证损失连续 4 个 epoch 未下降，则将学习率降低 10 倍。

A.2 Vegetation modeling A.2 植被建模

Contextformer (Table 2,3,4,5 Figure 3,4,5)
Contextformer（表 2、3、4、5，图 3、4、5）

Our Contextformer is a combination of a spatial vision encoder: Pyramid Vision Transformer (PVT) v2 B0 [82, 83] with pre-trained ImageNet1k weights from the PyTorch Image Models library⁴⁴4https://github.com/rwightman/pytorch-image-models and a temporal transformer encoder. We use a patch size of $4\times 4$ px. We use an embedding size of $256$ and the temporal transformer has three self-attention layers with $8$ heads, each followed by an MLP with $1024$ hidden channels. We use LayerNorm [5] for normalization and GELU [27] as non-linear activation function. The model is trained with masked token modeling, randomly ( $p=0.5$ ) flipping between inference mask (future token masked) and random dropout mask ( $70\%$ of image patches masked, except for the first 3 time steps). We train for 100 epochs with a batch size of 32, a learning rate of $4e^{-5}$ and with AdamW optimizer on $2$ NVIDIA A100 GPUs. We train three models from the random seeds 42, 97 and 27.
我们的 Contextformer 结合了空间视觉编码器：金字塔视觉 Transformer（PVT）v2 B0 [82, 83]，其预训练的 ImageNet1k 权重来自 PyTorch 图像模型库 ⁴ ，以及一个时间 Transformer 编码器。我们采用 $4\times 44 × 4$ 像素的补丁大小。使用 $256256256$ 的嵌入尺寸，时间 Transformer 包含三个自注意力层，每层有 $888$ 个头，其后跟随一个具有 $102410241024$ 隐藏通道的多层感知机（MLP）。我们采用 LayerNorm [5]进行归一化，并以 GELU [27]作为非线性激活函数。模型通过掩码令牌建模进行训练，随机（ $p = 0.5 p=0.5italic_p = 0.5$ ）在推理掩码（未来令牌被掩码）和随机丢弃掩码（ $70\%70 %$ 的图像补丁被掩码，但前 3 个时间步除外）之间切换。我们以 32 的批量大小、 $4e^{-5}4 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT$ 的学习率，使用 AdamW 优化器在 $222$ 个 NVIDIA A100 GPU 上训练 100 个周期。我们从随机种子 42、97 和 27 开始训练三个模型。

Local timeseries models (Table 2)
本地时间序列模型（表 2）

We train the local timeseries models (table 2) at each pixel. For a given pixel we extract the full timeseries of NDVI and weather variables at 5-daily resolution. All variables are linearly gapfilled and weather is aggregated with min, mean, max, and std to 5-daily. The whole timeseries before each target period is used to train a timeseries model, for the target period the model only receives weather. The Kalman Filter runs with default parameters from darts [28]. The LightGBM model gets lagged variables from the last 10 time steps and predicts a full 20 time step chunk at once. For Prophet we again use default parameters.
我们在每个像素点上训练本地的时间序列模型（表 2）。对于给定的像素，我们提取 NDVI 和天气变量的完整时间序列，分辨率为 5 天。所有变量均进行线性插值填补缺失值，天气数据则通过最小值、均值、最大值和标准差聚合到 5 天分辨率。每个目标期之前的整个时间序列用于训练时间序列模型，而在目标期内，模型仅接收天气数据。卡尔曼滤波器使用 darts 库[28]的默认参数运行。LightGBM 模型获取过去 10 个时间步长的滞后变量，并一次性预测未来 20 个时间步长的完整块。对于 Prophet 模型，我们同样使用默认参数。

EarthNet models (Table 2)
EarthNet 模型（表 2）

For running the leading models from EarthNet2021 we utilize the code from the respective github repositories: ConvLSTM [18]⁵⁵5https://github.com/dcodrut/weather2land, SGED-ConvLSTM [35]⁶⁶6https://github.com/rudolfwilliam/satellite_image_forecasting and Earthformer [23] ⁷⁷7https://github.com/amazon-science/earth-forecasting-transformer/tree/main/scripts/cuboid_transformer/earthnet_w_meso. We derive the NDVI from the predicted satellite bands red and near-infrared:
为了运行 EarthNet2021 中的领先模型，我们利用了各自 GitHub 仓库中的代码：ConvLSTM [18] ⁵ ，SGED-ConvLSTM [35] ⁶ 和 Earthformer [23] ⁷ 。我们从预测的卫星波段红色和近红外中推导出 NDVI：

\displaystyle NDVI=\frac{NIR-Red}{NIR+Red+1e^{-8}}

(3)

ConvLSTM (Table 2,3, Figure 4,5)
ConvLSTM（表 2,3，图 4,5）

Our ConvLSTM contains four ConvLSTM-cells [70] in total, two for processing context frames and two for processing target frames. Each has convolution kernels with bias, hidden dimension of 64 and kernel size of 3. We train for 100 epochs with a batch size of 32, a learning rate of $4e^{-5}$ and with AdamW optimizer. We train three models from the random seeds 42, 97 and 27.
我们的 ConvLSTM 总共包含四个 ConvLSTM 单元[70]，其中两个用于处理上下文帧，另外两个用于处理目标帧。每个单元均配备有偏置的卷积核，隐藏维度为 64，卷积核大小为 3。我们以批量大小 32、学习率 $4e^{-5}4 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT$ 以及 AdamW 优化器进行 100 轮训练。我们基于随机种子 42、97 和 27 训练了三个模型。

PredRNN (Table 2,3, Figure 4)
PredRNN（表 2,3，图 4）

Our PredRNN contains two ST-ConvLSTM-cells [84] Each has convolution kernels with bias, hidden dimension of 64 and kernel size of 3 and residual connections. We use a PatchMerge encoder decoder with GroupNorm (16 groups), convolutions with kernel size of 3 and hidden dimension of 64, LeakyReLU activation and downsampling rate of 4x. We train for 100 epochs with a batch size of 32, a learning rate of $3e^{-4}$ and with AdamW optimizer. We use a spatio-temporal memory decoupling loss term with weight 0.1 and reverse exponential scheduling of true vs. predicted images (as in the PredRNN journal version [85]). We train three models from the random seeds 42, 97 and 27.
我们的 PredRNN 包含两个 ST-ConvLSTM 单元[84]，每个单元均配备有偏置的卷积核，隐藏维度为 64，核大小为 3，并带有残差连接。我们采用 PatchMerge 编码器解码器，结合 GroupNorm（16 组），卷积核大小为 3，隐藏维度为 64，LeakyReLU 激活函数，以及 4 倍的降采样率。训练过程持续 100 个周期，批量大小为 32，学习率为 $3e^{-4}3 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT$ ，并使用 AdamW 优化器。我们引入了一个权重为 0.1 的时空记忆解耦损失项，并采用真实图像与预测图像的反向指数调度策略（如 PredRNN 期刊版本[85]中所述）。我们基于随机种子 42、97 和 27 训练了三个模型。

SimVP (Table 2,3, Figure 4)
SimVP（表 2、3，图 4）

Our SimVP has a PatchMerge encoder decoder with GroupNorm (16 groups), convolutions with kernel size of 3 and hidden dimension of 64, LeakyReLU activation and downsampling rate of 4x. The encoder processes all 10 context time steps at once (stacked along the channel dimension). The decoder processes 1 target time step at a time. The gated spatio-temporal attention processor [77] translates between both in the latent space, we use two layers and 64 hidden channels. We train for 100 epochs with a batch size of 64, a learning rate of $6e^{-4}$ and with AdamW optimizer. We train three models from the random seeds 42, 97 and 27.
我们的 SimVP 采用带有 GroupNorm（16 组）的 PatchMerge 编码器解码器，卷积核大小为 3，隐藏维度为 64，LeakyReLU 激活函数，以及 4 倍的降采样率。编码器一次性处理所有 10 个上下文时间步（沿通道维度堆叠）。解码器则逐个处理 1 个目标时间步。门控时空注意力处理器[77]在潜在空间中进行两者间的转换，我们使用两层和 64 个隐藏通道。我们以 64 的批量大小、 $6e^{-4}6 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT$ 的学习率，使用 AdamW 优化器训练 100 个周期。我们从随机种子 42、97 和 27 中训练了三个模型。

Earthformer (Table 2) Earthformer（表 2）

Our Earthformer is a transformer combined with an initial PatchMerge encoder (and a final decoder) to reduce the dimensionality. The encoder and decoder use LeakyReLU activation, hidden size of 64 and 256 and downsample 2x. In between, the transformer processor has a UNet-type architecture, with cross-attention to merge context frame information with target frame embeddings. GeLU activation and LayerNorm, axial self-attention, 0.1 dropout and 4 attention heads are used. Weather information is regridded to match the spatial resolution of satellite imagery and used as input during context and target period. We train for 100 epochs with a batch size of 32, a maximum learning rate of $1e^{-4}$ , linear learning rate warm up, cosine learning rate shedule and with AdamW optimizer.
我们的 Earthformer 结合了初始 PatchMerge 编码器（及最终解码器）以降低维度。编码器与解码器采用 LeakyReLU 激活函数，隐藏层大小分别为 64 和 256，并进行 2 倍下采样。在两者之间，Transformer 处理器采用 UNet 型架构，通过交叉注意力机制融合上下文帧信息与目标帧嵌入。使用 GeLU 激活、LayerNorm、轴向自注意力、0.1 的 dropout 以及 4 个注意力头。天气信息被重新网格化以匹配卫星图像的空间分辨率，并在上下文和目标时段作为输入。我们训练 100 个 epoch，批量大小为 32，最大学习率为 $1e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT$ ，线性学习率预热，余弦学习率调度，并采用 AdamW 优化器。

1x1 LSTM (Table 4) 1x1 LSTM（表 4）

Our 1x1 LSTM is implemented as a ConvLSTM with kernel size of 1. We train for 100 epochs with a batch size of 32, a learning rate of $4e^{-5}$ and with AdamW optimizer.
我们的 1x1 LSTM 实现为核大小为 1 的 ConvLSTM。我们训练 100 个周期，批量大小为 32，学习率为 $4e^{-5}4 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT$ ，并使用 AdamW 优化器。

Next-frame UNet (Table 4)
下一帧 UNet（表 4）

Our next-frame UNet has a depth of 5, latent weather conditioning with FiLM, a hidden size 128, kernel size 3, LeakyReLU activation, GroupNorm (16 groups), PatchMerge downsampling and nearest upsampling. We train for 100 epochs with a batch size of 64, a learning rate of $6e^{-4}$ and with AdamW optimizer.
我们的下一帧 UNet 深度为 5，采用 FiLM 进行潜在天气条件控制，隐藏层大小为 128，卷积核大小为 3，激活函数为 LeakyReLU，归一化方式为 GroupNorm（16 组），下采样使用 PatchMerge，上采样采用最近邻插值。我们训练 100 个 epoch，批量大小为 64，学习率为 $6e^{-4}6 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT$ ，优化器选用 AdamW。

Next-cuboid UNet (Table 4)
下一长方体 UNet（表 4）

Our next-cuboid UNet has a depth of 5, latent weather conditioning with FiLM, a hidden size 256, kernel size 3, LeakyReLU activation, GroupNorm (16 groups), PatchMerge downsampling and nearest upsampling. We train for 100 epochs with a batch size of 64, a learning rate of $6e^{-4}$ and with AdamW optimizer.
我们的下一个长方体 UNet 深度为 5，采用 FiLM 进行潜在天气调节，隐藏层大小为 256，核大小为 3，LeakyReLU 激活函数，GroupNorm（16 组），PatchMerge 下采样和最近邻上采样。我们以 64 的批量大小、 $6e^{-4}6 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT$ 的学习率和 AdamW 优化器训练 100 个周期。

Appendix B Weather ablations
附录 B 天气消融

B.1 Methods B.1 方法

Most of our baseline approaches have been originally proposed to handle only past covariates. Here, we condition forecasts on future weather. A-priori it is not known how to best achieve this weather conditioning. For PredRNN and SimVP, we compare three approaches, each fused at three different locations. The approaches operate pixelwise, taking features $x_{in}\in\mathbb{R}^{d}$ and conditioning input $c_{i}\in\mathbb{R}^{n}$ for weather variable $i$ . The conditioning layers $g(\cdot,\cdot;\phi)$ with parameters $\phi$ then operate as
我们的大多数基线方法最初仅设计用于处理历史协变量。在此，我们基于未来天气条件进行预测。事先并不清楚如何最佳实现这种天气条件化。对于 PredRNN 和 SimVP，我们比较了三种方法，每种方法在三个不同位置融合。这些方法逐像素操作，提取特征 $x_{in}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT$ 并根据天气变量 $i iitalic_i$ 对输入 $c_{i}\in\mathbb{R}^{n}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT$ 进行条件化。随后，带有参数 $\phiitalic_ϕ$ 的条件化层 $g(\cdot,\cdot;\phi)italic_g ( ⋅ , ⋅ ; italic_ϕ )$ 开始运作。

\displaystyle x_{out}=g(x_{in},c;\phi)\in\mathbb{R}^{d}

(4)

We parameterize $g$ with neural networks.
我们用神经网络参数化 $g gitalic_g$ 。

CAT

First concatenates $x_{in}$ and a flattened $c$ along the channel dimension, and then performs a linear projection to obtain $x_{out}$ of same dimensionality as $x_{in}$ . In practice we implement this with a 1x1 Conv layer.
首先将 $x_{in}italic_x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT$ 与展平后的 $c citalic_c$ 沿通道维度拼接，然后进行线性投影，以获得与 $x_{in}italic_x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT$ 相同维度的 $x_{out}italic_x start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT$ 。在实际操作中，我们通过 1x1 卷积层来实现这一点。

FiLM

Feature-wise linear modulation [58] generalizes the concatenation layer before. It produces $x_{out}$ with linear modulation:
特征线性调制[58]推广了之前的连接层。它通过线性调制生成 $x_{out}italic_x start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT$ ：

\displaystyle x_{out}=x_{in}+\sigma(\gamma(c;\phi_{\gamma})\odot N(f(x_{in};% \phi_{f}))+\beta(c;\phi_{\beta}))

(5)

Here, $f$ is a linear layer, $\gamma$ and $\beta$ are MLPs, $N$ is a normalization layer and $\sigma$ is a pointwise non-linear activation function.
这里， $f fitalic_f$ 是一个线性层， $\gammaitalic_γ$ 和 $\betaitalic_β$ 是多层感知机（MLPs）， $N Nitalic_N$ 是一个归一化层，而 $\sigmaitalic_σ$ 是一个逐点非线性激活函数。

xAttn x 注意

Cross-attention is an operation commonly found in the Transformers architecture. In recent works on image generation with diffusion models it is used to condition the generative process on a text embedding [66]. Inspired from this, we propose a pixelwise conditioning layer based on multi-head cross-attention. The input $x_{in}$ is treated as a single token query $Q$ . Each weather variable $c_{i}$ is treated as individual tokens, from which we derive keys $K$ and values $V$ . The result is then just regular multi-head attention $MHA$ in a residual block:
交叉注意力是 Transformer 架构中常见的操作。在最近关于使用扩散模型进行图像生成的工作中，它被用于根据文本嵌入来调节生成过程[66]。受此启发，我们提出了一种基于多头交叉注意力的逐像素条件层。输入 $x_{in}italic_x start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT$ 被视为单个 token 查询 $Q Qitalic_Q$ 。每个天气变量 $c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT$ 被视为独立 token，从中我们推导出键 $K Kitalic_K$ 和值 $V Vitalic_V$ 。结果就是残差块中的常规多头注意力 $M H A MHAitalic_M italic_H italic_A$ ：

	$\displaystyle x_{out}$	$\displaystyle=x_{in}$		(6)
		$\displaystyle+f(N(MHA(Q(x_{in};\phi_{Q}),K(c;\phi_{K}),V(c;\phi_{V})));\phi_{f})$		(7)

Here, $f$ is either a linear projection or a MLP and $N$ is a normalization layer.
在这里， $f fitalic_f$ 可以是线性投影或 MLP，而 $N Nitalic_N$ 是一个归一化层。

Each of the three approaches we apply at three locations throughout the network:
我们在网络中的三个位置应用了三种方法：

Early fusion 早期融合

Just fusing all data modalities before passing it to a model. This Early CAT has been previously used for weather conditioning in satellite imagery forecasting
在将所有数据模态融合后直接传递给模型。这种早期 CAT 技术先前已用于卫星图像预测中的天气条件处理。

Latent fusion 潜在融合

In the encode-process-decode framework, encoders are meant to capture spatial, and not temporal, relationships. Hence, latent fusion conditions the encoded spatial inputs twice: right after leaving the encoder and before entering the decoder.
在编码-处理-解码框架中，编码器旨在捕捉空间而非时间关系。因此，潜在融合对编码后的空间输入进行了两次条件处理：一次是在离开编码器之后，另一次是在进入解码器之前。

All (fusion everywhere) 所有（无处不在的融合）

In addition, we compare against conditioning at every stage of the encoders, processors and decoders. All CAT has been applied to condition stochastic video predictions on random latent codes [39].
此外，我们在编码器、处理器和解码器的每个阶段都进行了条件对比。所有 CAT 均已应用于基于随机潜在代码对随机视频预测进行条件化处理[39]。

B.2 Results B.2 结果

Fig. 6 summarizes the findings by looking at the RMSE over the prediction horizon. For the first 50 days, most models are better than the climatology, afterwards, most are worse. If using early fusion, FiLM is the best conditioning method. For latent fusion and fusion everywhere (all), xAttn is a consistent choice, but FiLM may sometimes be better (and sometimes a lot worse). CAT in general should be avoided, which is consistent with the theoretical observation, that CAT is a special case of FiLM.
图 6 通过观察预测范围内的均方根误差（RMSE）总结了研究结果。在前 50 天内，大多数模型优于气候学基准，之后则大多表现较差。若采用早期融合，FiLM 是最佳的调节方法。对于潜在融合和全局融合（all），xAttn 是一个一致的选择，但 FiLM 有时可能更好（有时则差很多）。总体而言，应避免使用 CAT，这与理论观察一致，即 CAT 是 FiLM 的一个特例。

For SimVP, the best weather guiding method is latent fusion with FiLM. For PredRNN, the best method is early fusion with FiLM. This is likely due to the difference in treatment of the temporal axis. For SimVP, early fusion would merge all time steps, hence, latent fusion is a better choice. For PredRNN on the other hand, early fusion handles only a single timestep.
对于 SimVP，最佳的天气引导方法是与 FiLM 进行潜在融合。而对于 PredRNN，最佳方法则是与 FiLM 进行早期融合。这很可能是由于对时间轴处理方式的差异所致。对于 SimVP，早期融合会将所有时间步合并，因此，潜在融合是更好的选择。另一方面，对于 PredRNN，早期融合仅处理单个时间步。

Appendix C Contextformer Strengths and Limitations continued
附录 C Contextformer 优势与局限性（续）

We show spatial extrapolation skills for more models in table 6.
我们在表 6 中展示了更多模型的空间外推能力。

	OOD-s		OOD-st
Model 模型	$R^{2}$ $\uparrow$	RMSE $\downarrow$ 均方根误差 $\downarrow↓$	$R^{2}$ $\uparrow$	RMSE $\downarrow$ 均方根误差 $\downarrow↓$
Climatology 气候学	0.50	0.15	0.56	0.19
ConvLSTM	0.47	0.17	0.52	0.16
Earthformer 地球构造者	0.47	0.15	0.47	0.16
PredRNN 预测递归神经网络	0.54	0.15	0.58	0.15
SimVP	0.50	0.15	0.54	0.15
Contextformer 上下文形式化器	0.54	0.15	0.58	0.14

Table 6: Same as table 5, but extended. Model skill at spatial (OOD-s) and spatio-temporal (OOD-st) extrapolation.
表 6：与表 5 相同，但有所扩展。模型在空间（OOD-s）和时空（OOD-st）外推方面的技能。

Reassured by spatial extrapolation capabilities, we present a map of $R^{2}$ for the Contextformer in fig. 7a. Cropland regions on the Iberian peninsula and in northern France, as well as forests in the Balkans are regions with great applicability of the model. For the former two, this may be explained by many training samples in those regions, for the last, it cannot. Grasslands and forests in Poland and highly heterogenous regions (mountains, near cities, near coasts) are more challenging for the model.
借助空间外推能力，我们在图 7a 中为 Contextformer 呈现了 $R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT$ 的地图。伊比利亚半岛和法国北部的农田区域，以及巴尔干半岛的森林，都是该模型适用性极强的区域。对于前两者，这可能归因于这些区域丰富的训练样本；而对于后者，则无法如此解释。波兰的草原和森林，以及高度异质区域（如山区、城市附近、海岸附近），对模型来说更具挑战性。

Geomorphons capture local terrain features, derived from first and second spatial derivatives of elevation. Fig. 7b shows densities of RMSE of the ConvLSTM for different geomorphons from the Geomorpho90m map [2]. Generally, the model performs well across all classes. Summits and Depressions, two rather extreme types, seem to be slightly easier to predict. Homogeneous terrain (red: flat, shoulder, footslope) has a larger tail towards high error. This may be as those regions are typically where there is a lot of anthropogenic activity, possibly leading to dynamics less covered by the predictors (harvest, clear-cut, etc.).
地貌形态捕捉局部地形特征，源自高程的一阶和二阶空间导数。图 7b 展示了 ConvLSTM 在 Geomorpho90m 地图[2]中不同地貌形态的 RMSE 密度。总体而言，该模型在所有类别中表现良好。山顶和洼地这两种较为极端的类型似乎更容易预测。均质地形（红色：平坦、肩部、山麓）在高误差方向上具有较大的尾部。这可能是因为这些区域通常是人类活动密集的地方，可能导致预测因子（如收割、皆伐等）较少涵盖的动态变化。

Appendix D Performance per landcover type
附录 D 按土地覆盖类型划分的性能

Fig. 8 shows the model performance per landcover type.
图 8 展示了每种土地覆盖类型的模型性能。

Appendix E Robustness of Outperformance Score
附录 E 超额表现评分的稳健性

The choice of thresholds in the outperformance score (the percentage of samples where a model outperforms the climatology baseline by at least the threshold on at least 3 out of 4 metrics) is a heuristic. To assess its robustness, we re-evaluated five of our models over a wide range of possible threshold values. Fig. 9 shows a consistency of the ranking, in particular our Contextformer outperforms all other models in all settings.
超额表现分数中阈值的选择（即模型在至少四项指标中的三项上，至少以该阈值超越气候学基线样本的百分比）是一种启发式方法。为评估其稳健性，我们对五种模型在广泛的阈值范围内进行了重新评估。图 9 显示了排名的稳定性，特别是我们的 Contextformer 在所有设置下均优于所有其他模型。

Appendix F Inference speed
附录 F 推理速度

Computing inference speed is highly platform and batch size dependent. To make it somewhat fair, we compare models by running 1024 samples on an A40 GPU (48GB), with the largest batch size (bs) fitting in memory, we perform 10 repetitions and report the mean and std. dev.: Contextformer $29.3$ s $\pm 0.4$ (bs 72), SimVP $6.7$ s $\pm 0.8$ (bs 96), PredRNN $16.2$ s $\pm 0.2$ (bs 512), ConvLSTM $37.1$ s $\pm 1.8$ (bs 256). For comparison predicting a single sample with one of the local time series models takes $>$ 1h on a single CPU.
计算推理速度高度依赖于平台和批量大小。为了使其相对公平，我们通过在 A40 GPU（48GB）上运行 1024 个样本，使用内存中能容纳的最大批量大小（bs），进行 10 次重复并报告平均值和标准差来比较模型：Contextformer $29.3 29.329.3$ s $\pm 0.4± 0.4$ （bs 72），SimVP $6.7 6.76.7$ s $\pm 0.8± 0.8$ （bs 96），PredRNN $16.2 16.216.2$ s $\pm 0.2± 0.2$ （bs 512），ConvLSTM $37.1 37.137.1$ s $\pm 1.8± 1.8$ （bs 256）。作为对比，使用本地时间序列模型预测单个样本在单个 CPU 上需要 $>$ 1 小时。

Appendix G Downstream task: carbon monitoring
附录 G 下游任务：碳监测

Carbon monitoring is of great importance for climate change mitigation, especially in relation to nature-based solutions. The gross primary productivity (GPP) represents the amount of carbon that is taken up by plants through photosynthesis and subsequently stored. It is not directly observable. At a few hundred research stations around the world with eddy covariance measurement technology, it can be indirectly measured. For carbon monitoring, it would be beneficial to measure this quantity everywhere on the globe. It has been shown [56] that Sentinel 2 NDVI is correlated to GPP measured with eddy covariance. We build on this correlation to show how our models could potentially be leveraged to give near real-time estimates of GPP and to study weather scenarios.
碳监测对于气候变化缓解至关重要，尤其是在基于自然的解决方案方面。总初级生产力（GPP）代表了植物通过光合作用吸收并储存的碳量，这一指标无法直接观测。在全球数百个配备涡度协方差测量技术的研究站，GPP 可以通过间接方式测得。为了进行碳监测，在全球范围内测量这一数值将大有裨益。已有研究表明[56]，哨兵 2 号 NDVI 与涡度协方差测得的 GPP 存在相关性。我们基于这一相关性，展示了如何利用我们的模型来实现 GPP 的近实时估算，并研究不同天气情景下的表现。

Fig. 10 compares modeled with observed GPP at the Fluxnet site Grillenburg (identifier DE-Gri) in eastern Germany distributed by ICOS [30]. First, we fit a linear model between observed NDVI and GPP for the years 2017-2019. Here, interpolated grassland NDVI pixels (fig. 10b, inside red boundaries) are used. Next, we perform an out-of-sample analysis and find an $R^{2}=0.53$ for 2020-01 to 2021-04 (fig. 10a, blue line). Finally, we forecast GPP with our PredRNN model from May to July 2021(fig. 10a, orange line). The resulting forecast has decent quality at short prediction horizons, but low skill after 75 days (fig. 10c). These results show a way to leverage models from this paper for near real-time carbon monitoring. However, for application at scale, it is likely beneficial to use a more powerful GPP model (e.g. random forest [56] or light-use efficiency [8]), fitted across many Fluxnet sites.
图 10 比较了在德国东部由 ICOS[30]分发的 Fluxnet 站点 Grillenburg（标识符 DE-Gri）处模拟的与观测到的 GPP。首先，我们针对 2017-2019 年拟合了一个线性模型，该模型基于观测到的 NDVI 与 GPP 之间的关系。此处使用了插值后的草地 NDVI 像素（图 10b，红色边界内）。接着，我们进行了样本外分析，并发现 2020-01 至 2021-04 期间的 $R^{2}=0.53italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.53$ （图 10a，蓝线）。最后，我们使用 PredRNN 模型预测了 2021 年 5 月至 7 月的 GPP（图 10a，橙线）。所得预测在短期预测范围内质量尚可，但在 75 天后技能较低（图 10c）。这些结果展示了一种利用本文模型进行近实时碳监测的方法。然而，对于大规模应用，使用更强大的 GPP 模型（例如随机森林[56]或光能利用效率[8]），并在多个 Fluxnet 站点上进行拟合，可能会更有益。

Multi-modal learning for geospatial vegetation forecasting多模态学习用于地理空间植被预测