Elsevier

Information Fusion

Volume 97, September 2023, 101819
Information Fusion

Long sequence time-series forecasting with deep learning: A survey

计算机科学TOPEI检索SCI升级版 计算机科学1区SCI基础版 工程技术1区IF 14.7SWJTU A++SWUFE A
https://doi.org/10.1016/j.inffus.2023.101819 Get rights and content  获取权利和内容

Highlights  高光

  • Long sequence time-series forecasting (LSTF) is defined from two perspectives.
    长序列时间序列预测(LSTF)从两个角度进行定义。
  • We propose a new taxonomy and give a comprehensive review of LSTF.
    我们提出了一种新的分类法,并对 LSTF 进行了全面综述。
  • A Kruskal–Wallis test based LSTF performance evaluation method is proposed.
    一种基于 Kruskal-Wallis 检验的 LSTF 性能评估方法被提出。
  • Abundant resources of TSF and LSTF are collected including an open-source library.
    丰富的 TSF 和 LSTF 资源被收集,包括一个开源库。
  • We summarize four possible future research directions.
    我们总结了四个可能的研究方向。

Abstract  摘要

The development of deep learning technology has brought great improvements to the field of time series forecasting. Short sequence time-series forecasting no longer satisfies the current research community, and long-term future prediction is becoming the hotspot, which is noted as long sequence time-series forecasting (LSTF). The LSTF has been widely studied in the extant literature, but few reviews of its research development are reported. In this article, we provide a comprehensive survey of LSTF studies with deep learning technology. We propose rigorous definitions of LSTF and summarize the evolution in terms of a proposed taxonomy based on network structure. Next, we discuss three key problems and corresponding solutions from long dependency modeling, computation cost, and evaluation metrics. In particular, we propose a Kruskal–Wallis test based evaluation method for evaluation metrics problems. We further synthesize the applications, datasets, and open-source codes of LSTF. Moreover, we conduct extensive case studies comparing the proposed Kruskal–Wallis test based evaluation method with existing metrics and the results demonstrate the effectiveness. Finally, we propose potential research directions in this rapidly growing field. All resources and codes are assembled and organized under a unified framework that is available online at https://github.com/Masterleia/TSF_LSTF_Compare.
深度学习技术的发展极大地推动了时间序列预测领域。短序列时间序列预测已无法满足当前研究社区的需求,长期未来预测成为热点,被称为长序列时间序列预测(LSTF)。LSTF 在现有文献中已被广泛研究,但其研究发展的综述却很少。在本文中,我们提供了基于深度学习技术的 LSTF 研究的全面综述。我们提出了 LSTF 的严格定义,并基于网络结构提出了一个分类法来总结其演变。接下来,我们讨论了从长期依赖建模、计算成本和评估指标三个关键问题及其相应解决方案。特别是,我们提出了一种基于 Kruskal-Wallis 测试的评估方法来解决评估指标问题。我们进一步综合了 LSTF 的应用、数据集和开源代码。此外,我们进行了广泛的案例研究,比较了所提出的基于 Kruskal-Wallis 测试的评估方法与现有指标,结果表明其有效性。 最后,我们提出了这个快速发展的领域中的潜在研究方向。所有资源和代码都按照统一框架汇编和组织,可在 https://github.com/Masterleia/TSF_LSTF_Compare 在线获取。

Keywords  关键词

Time series forecasting
Long time series forecasting
Transformer
Data mining
Deep learning

时间序列预测 长时间序列预测 变压器 数据挖掘 深度学习

1. Introduction  1. 引言

Time series forecasting (TSF) is a classical forecasting task that predicts the future trend changes of time series, and has been widely used in real-world applications such as energy [1], transportation [2], and meteorology [3]. Formerly, traditional statistical methods were effective ways to forecast time series. As early as 1927, Yule proposed the autoregressive [4] (AR) model for predicting univariate smooth time series. Since then, the moving average [5] (MA) model proposed by Walker and the autoregressive moving average (ARMA) [6] model, a combination of both, have been used to solve univariate smooth TSF problems. However, these traditional statistical methods often have many assumptions about time series data and are not effective in making valid predictions about reality, based on many prior assumptions such as stability, linear correlation, normal distribution, independence and other properties that limit the effectiveness in real-world applications. It is difficult to effectively capture the nonlinear relationships between time series. Traditional machine learning (ML) algorithms are effective solutions. Two types of classic algorithms, support vector machines (SVMs) [7] and adaptive boosting (AdaBoost) [8], were developed in 1995 and have achieved great success in the field of TSF. They utilize feature engineering, which changes the organization of data with a sliding time window; i.e., data metrics, such as the minimum, maximum, mean, variance, etc., are calculated within the sliding window as new features for prediction. These algorithmic models solve the problem of predicting multivariate, heteroskedastic time series with nonlinear relationships to some extent. However, these models suffer from poor generalization, exhibiting limited predictive accuracy that can often only be improved with complex artificial feature engineering. The rapid development of deep learning (DL) in recent years as shown in Fig. 1, it has greatly improved the nonlinear modeling capabilities of TSF method. DL is called an effective solution for TSF, and it is widely used in finance, energy, meteorology, transportation and medical fields. It has also extended many other problems related to TSF, such as hierarchical time series forecasting [9], intermittent time series forecasting [10], sparse multivariate time series forecasting [11] and asynchronous time series forecasting [12], [13], and has even extended some multiobjective and multigranular forecasting scenarios [14] and multimodal time series forecasting scenarios [15], [16]. With the development of technology, a task gradually opens to using more historical data to predict the longer-term future, that is long sequence time-series forecasting (LSTF) [17], [18], [19], and researchers have also explored commonly feasible solutions to the LSTF.
时间序列预测(TSF)是一种经典的预测任务,它预测时间序列的未来趋势变化,并在能源[1]、交通[2]和气象[3]等现实应用中得到广泛应用。以前,传统的统计方法是预测时间序列的有效方法。早在 1927 年,尤尔就提出了自回归[4](AR)模型来预测单变量平滑时间序列。从那时起,沃克提出的移动平均[5](MA)模型和结合两者的自回归移动平均(ARMA)[6]模型被用来解决单变量平滑时间序列预测问题。然而,这些传统的统计方法通常对时间序列数据有许多假设,并且基于许多先前的假设(如稳定性、线性相关性、正态分布、独立性等)在现实世界中做出有效预测并不有效,这些假设限制了其在现实应用中的有效性。难以有效地捕捉时间序列之间的非线性关系。传统的机器学习(ML)算法是有效的解决方案。 两种经典算法,支持向量机(SVMs)[7]和自适应提升(AdaBoost)[8],于 1995 年开发,在时间序列预测(TSF)领域取得了巨大成功。它们利用特征工程,通过滑动时间窗口改变数据组织;即,在滑动窗口内计算数据度量,如最小值、最大值、平均值、方差等,作为预测的新特征。这些算法模型在一定程度上解决了预测具有非线性关系的多元异方差时间序列的问题。然而,这些模型存在泛化能力差的问题,预测精度有限,通常只能通过复杂的人工特征工程来提高。如图 1 所示,近年来深度学习(DL)的快速发展,极大地提高了 TSF 方法的非线性建模能力。深度学习被称为 TSF 的有效解决方案,并在金融、能源、气象、交通和医疗等领域得到广泛应用。 它还扩展了许多与 TSF 相关的问题,例如层次时间序列预测[9]、间歇时间序列预测[10]、稀疏多元时间序列预测[11]和异步时间序列预测[12]、[13],甚至扩展了一些多目标和多粒度预测场景[14]和多模态时间序列预测场景[15]、[16]。随着技术的发展,一个任务逐渐开放为使用更多历史数据来预测更长期的未来,即长序列时间序列预测(LSTF)[17]、[18]、[19],研究人员也探索了 LSTF 的常见可行解决方案。
LSTF has been in greater demand for applications in financial, energy, meteorological, transportation, and medical scenarios, such as electricity usage planning [25] and financial long-term strategic guidance [26]. The LSTF is a critical component in assisting individuals to better plan for the future by forecasting outcomes further in advance. Nevertheless, prior algorithms have demonstrated unsatisfactory performance when attempting to forecast longer-term predictions with an increased number of input sequences. Specifically, such algorithms are prone to a significant decline in inference performance and predictive accuracy due to the growing sequence length. Furthermore, they struggle to efficiently extract more effective dependencies from the data, posing an urgent problem in time series forecasting. However, with the remarkable advancement in computational resources and deep learning techniques, this challenge has been gradually and effectively addressed.
LSTF 在金融、能源、气象、交通和医疗场景中的应用需求日益增加,例如电力使用规划[25]和金融长期战略指导[26]。LSTF 是帮助个人更好地规划未来的关键组成部分,通过提前更长时间预测结果。然而,先前算法在尝试使用增加的输入序列进行长期预测时表现不佳。具体来说,这些算法由于序列长度的增加,容易导致推理性能和预测准确性的显著下降。此外,它们难以有效地从数据中提取更有效的依赖关系,这在时间序列预测中构成了一个紧迫的问题。然而,随着计算资源和深度学习技术的显著进步,这一挑战已逐步得到有效解决。

Table 1. Recent survey comparison.
表 1. 近期调查比较。

Survey  调查Theme  主题Contribution  贡献
Bianchi F M [20]  比安奇 FM [20]Focus on RNN in short-term load forecasting
关注短期负荷预测中的循环神经网络
A survey on short-term load forecasting.
短期负荷预测调查。

The case study about reviewed models.
关于已审查模型的案例研究。
Manibardo E L [21]  曼伊巴尔多 E L [21]Focus on DL in road traffic forecasting
关注道路交通预测中的深度学习
Critically analyzing the state of the art in what refers to the use of DL for Intelligent Transportation Systems research.
对使用深度学习(DL)进行智能交通系统(ITS)研究现状进行批判性分析。

An extensive experimentation comprising.
包含广泛的实验。
Lara-Benítez P [22]  拉拉-贝尼特斯 P [22]Focus on DL in TSF
关注 TSF 中的深度学习
A exhaustive review about DL in TSF.
关于 TSF 中深度学习的全面综述

An open-source deep learning framework for TSF and a comparative analysis.
开源的 TSF 深度学习框架及其比较分析
Deb C [23]Focus on energy consumption forecasting based on ML
关注基于机器学习的能源消耗预测
A comprehensive review and comparison of ML in energy consumption forecasting.
对能源消耗预测中机器学习的全面综述与比较。

Provides constructive future direction.
提供有建设性的未来方向。
K Benidis [24]  Text: K Benidis [24]The main focus is to educate, review and popularize the latest developments in NN-driven forecasting
主要关注点是教育、审查和普及 NN 驱动预测的最新进展
The breadth and depth survey on DL for TSF.
对 DL 在 TSF 领域的广度和深度调查。

Digging deeper and find out future directions.
深入挖掘,探寻未来方向。
Our  我们的Focus on DL in long sequence time-series forecasting
关注长序列时间序列预测中的深度学习
Deep explanations from multiple perspectives about LSTF definition.
关于 LSTF 定义的多角度深入解释。

New taxonomy and comprehensive review for LSTF.
新分类法和 LSTF 的全面综述

New performance evaluation for LSTF.
LSTF 的新性能评估。

Abundant resources of TSF and particularly in LSTF, such as datasets, metrics, open source library, eg.
TSF 资源丰富,尤其是在 LSTF 方面,例如数据集、指标、开源库等。

Future directions for LSTF.
未来 LSTF 的发展方向
  1. Download: Download high-res image (180KB)
    下载:下载高分辨率图片(180KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 1. Technological Development of DL in TSF.
图 1. TSF 中深度学习的技术发展。

Many researchers have studied LSTF from the perspectives of traditional statistics and machine learning [27], [28], [29], [30], [31]. With the development of DL, especially the emergence of Transformers [32], the ability to deal with long-sequence problems has attracted wide attention in the industry. Researchers started to study the applications of Transformers in the field of TSF, and works such as [17], [33] demonstrated the effectiveness of Transformer-based models for LSTF. The recent work of [34] summarized the applications of Transformer in TSF. The parallel computing approach in the Transformer-based architecture speeds up the inference in LSTF and the capacity of long series data that the model can handle, while the number of operations required for self-attention to compute the association between two time points does not grow with the distance between the two time points making it easier to capture long-term dependencies between the data. However, the expensive training and deployment costs of these models make them unaffordable when applied to real LSTF problems. Transformer-based models with large parameters are also prone to overfitting in some data distributions. Thus, there are also many other nontransformer-type works [29], [35], [36], [37], [38], [39], [40], [41] that are studying TSF in general and LSTF problems in particular. As the LSTF is receiving more and more attention, its related research is increasing. In particular, in view of the current status of the existing work, we found the following three problems in the field of LSTF, which have not been standardized and sorted out in other works
许多研究人员从传统统计学和机器学习的角度研究了 LSTF[27]、[28]、[29]、[30]、[31]。随着深度学习(DL)的发展,尤其是 Transformers[32]的出现,处理长序列问题的能力在业界引起了广泛关注。研究人员开始研究 Transformers 在时间序列预测(TSF)领域的应用,如[17]、[33]等作品展示了基于 Transformer 的模型在 LSTF 中的有效性。[34]的最新工作总结了 Transformer 在 TSF 中的应用。基于 Transformer 的架构中的并行计算方法加快了 LSTF 中的推理速度,并提高了模型可以处理的长序列数据容量,同时,由于计算两个时间点之间关联所需的操作数量不会随着两个时间点之间的距离增长而增加,这使得捕捉数据之间的长期依赖关系变得更加容易。然而,这些模型的昂贵训练和部署成本使得它们在应用于实际的 LSTF 问题时变得难以承受。具有大量参数的基于 Transformer 的模型也容易在一些数据分布中发生过拟合。 因此,还有许多其他非变换器类型的工作[29],[35],[36],[37],[38],[39],[40],[41],它们在研究 TSF 的一般问题和 LSTF 的特定问题。随着 LSTF 越来越受到关注,其相关研究也在不断增加。特别是,鉴于现有工作的现状,我们在 LSTF 领域发现了以下三个问题,这些问题在其他作品中尚未得到标准化和整理
  • Problem 1: To our knowledge, long sequence time-series forecasting still lacks a comprehensive overview in data mining and deep learning fields. In particular, the rapid development of LSTF based on deep learning technology has created a pressing need for an overview of methods, data, and metrics to inspire researchers who are interested in entering this rapidly developing field, as well as experts who wish to compare LSTF methods. Many existing studies survey time series forecasting [20], [21], [22], [23], [24], these reviews tend to categorize and summarize time series forecasting research under one overarching theme, as shown in Table 1, without distinguishing between LSTF and other forecasting approaches.
    问题 1:据我们所知,长序列时间序列预测在数据挖掘和深度学习领域仍缺乏全面的概述。特别是,基于深度学习技术的 LSTF(长短期记忆网络)的快速发展,迫切需要概述方法、数据和指标,以激发对进入这一快速发展领域感兴趣的研究人员,以及希望比较 LSTF 方法的专家。许多现有研究综述了时间序列预测[20],[21],[22],[23],[24],这些综述倾向于将时间序列预测研究归类和总结为一个总主题,如表 1 所示,没有区分 LSTF 和其他预测方法。
  • Problem 2: Long sequence time-series forecasting lacks deep explanations from multiple perspectives. Generally, LSTF is defined as forecasting the distant future [17]. However, such a definition is macroscopic, and lacks a comprehensive and nuanced explanation of LSTF in the existing literature. We are unsure about the specific forecasting task that qualifies as LSTF.
    问题 2:长序列时间序列预测缺乏多角度的深入解释。通常,LSTF 被定义为预测遥远的未来[17]。然而,这样的定义是宏观的,且在现有文献中缺乏对 LSTF 的全面和细致的解释。我们不确定哪些具体的预测任务可以被视为 LSTF。
  • Problem 3: Long sequence time-series forecasting still lacks technical analysis and performance comparison of different neural networks under a unified framework. The traditional MSE and MAE metrics only reflect the error between the true and predicted values. However, in LSTF, with the increase of predicted horizon, it is not reliable to compare the accuracy of models purely in terms of the numerical magnitude of the evaluated metrics. For example, we cannot compare the superior results [42] of PatchTST/42 and DLinear for MSE=0.202 and MSE=0.203. We need to prove the superiority of one model over the other in a statistical sense. However, there is no relevant performance evaluation.
    问题 3:长序列时间序列预测在统一框架下仍缺乏不同神经网络的技术分析和性能比较。传统的均方误差(MSE)和平均绝对误差(MAE)指标仅反映了真实值与预测值之间的误差。然而,在 LSTF 中,随着预测范围的增加,仅从评估指标数值大小的角度比较模型的准确性是不可靠的。例如,我们无法比较 PatchTST/42 和 DLinear 在 MSE=0.202 和 MSE=0.203 时的优越结果。我们需要在统计意义上证明一个模型相对于另一个模型的优越性。然而,目前尚无相关的性能评估。
In order to address the above challenges, we have provided a comprehensive overview of LSTF, and this article is applicable both to interested researchers who want to enter this fast growing field and to experts who want to compare LSTF models. To cover a broader range of approaches, we considers LSTF as a task for forecasting more distant futures in TSF. In other words, we consider as the LSTF literature in which the contribution, motivation, and the problem addressed is to predict a more distant future in TSF. Thus, LSTF is part of TSF, and the difference between spatio-temporal forecasting and LSTF is that they view the problem from different perspectives and they have an intersection part. We retrieved data in a predefined manner from the Web of Science (WoS) Core Collection database and DataBase systems and Logic Programming (DBLP). The document types were confined to article or review. After filtering the articles by reading their abstracts, we also selected closely related references to complement them.
为了应对上述挑战,我们提供了对 LSTF 的全面概述,本文适用于希望进入这一快速发展的领域的感兴趣的研究人员以及希望比较 LSTF 模型的专家。为了涵盖更广泛的方法,我们将 LSTF 视为 TSF 中预测更遥远未来的任务。换句话说,我们将预测 TSF 中更遥远未来的贡献、动机和所解决的问题视为 LSTF 文献。因此,LSTF 是 TSF 的一部分,时空预测与 LSTF 之间的区别在于它们从不同的角度看待问题,并且它们有一个交集部分。我们从 Web of Science(WoS)核心集合数据库和数据库系统与逻辑编程(DBLP)中按预定方式检索数据。文档类型限于文章或评论。在阅读摘要后过滤文章后,我们还选择了与之密切相关的研究文献以补充它们。
The contributions of this paper are summarized as follows:
本文的贡献总结如下:
  • (1)
    Deeply multi-Perspective LSTF definition explanations: The existing definition of LSTF is quite macroscopic and ambiguous. Clearly defining the LSTF is challenging. To address it, we provide deep explanations from multiple perspectives: relative and absolute concepts.
    深度多视角 LSTF 定义解释:现有 LSTF 的定义相当宏观且模糊。明确界定 LSTF 具有挑战性。为解决这一问题,我们从多个视角提供深入的解释:相对和绝对概念。
  • (2)
    New Taxonomy and Comprehensive Review: We propose a new taxonomy of LSTF research. We classify them into RNN-based model, CNN-based model, GNN-based model, Transformer-based model, Compound model and Miscellaneous methods according to their main predictive network structure characteristics. In addition, we summarize several key problems of LSTF and investigate the related work and effective solutions of each key problem.
    新分类法和全面综述:我们提出了一种新的 LSTF 研究分类法。根据其主要预测网络结构特征,我们将它们分为基于 RNN 的模型、基于 CNN 的模型、基于 GNN 的模型、基于 Transformer 的模型、复合模型和杂项方法。此外,我们总结了 LSTF 的几个关键问题,并研究了每个关键问题的相关工作和有效解决方案。
  • (3)
    New Performance Evaluation: To enhance fair and effective performance evaluation of LSTF, we apply Kruskal–Wallis test and propose a new LSTF prediction performance evaluation method from two perspectives.
    新性能评估:为了提高 LSTF 的公平和有效性能评估,我们应用 Kruskal-Wallis 检验,并从两个角度提出一种新的 LSTF 预测性能评估方法。
  • (4)
    Abundant Resources: We have collected abundant resources of TSF and LSTF in particular. We have classified five application domains and collected relevant datasets for each domain, giving the relevant open source links for each dataset. We also collected relevant evaluation metrics, classified them into scale-dependent, scale-independent, and scaled errors, and discussed the properties of each metric. Finally, we selected SOTA (Pyraformer, FEDformer, Autoformer, Informer, Reformer, Transformer, MTGNN, Graph WaveNet, LSTNet) models in recent years and based on our research. Based on the datasets and evaluation metrics of our research, we give a technical analysis and performance comparison of different neural networks under a unified framework in LSTF, and an open source library.
    丰富的资源:我们特别收集了 TSF 和 LSTF 的丰富资源。我们已将应用领域分为五个类别,并为每个领域收集了相关数据集,为每个数据集提供了相关开源链接。我们还收集了相关评估指标,将它们分为规模依赖性、规模独立性和缩放误差,并讨论了每个指标的性质。最后,我们根据近年来的研究和我们的研究,选择了 SOTA(Pyraformer、FEDformer、Autoformer、Informer、Reformer、Transformer、MTGNN、Graph WaveNet、LSTNet)模型。基于我们的研究数据集和评估指标,我们在统一框架下对 LSTF 中的不同神经网络进行了技术分析和性能比较,并开源了一个库。
  • (5)
    Future Directions: Based on the results of our research and experiments, we have summarized four possible future research directions.
    未来方向:基于我们的研究和实验结果,我们总结了四个可能的研究方向。
The motivation and line logic of this paper are shown in Fig. 2. In this paper, definitions of time series data, TSF tasks, and deeply multi-angle definition explanations for LSTF are given in Section 2. In Section 3, we review the relevant research and the evolution of DL networks in the LSTF field in recent years and classify the related studies according to the technical contributions of each work. Section 4 summarizes the key problems encountered in LSTF, sorts out the research directions and effective solutions for each key problem in the current works, and proposes our new LSTF evaluation approach for performance evaluation problems. Section 5 summarizes the relevant open source datasets in various domains of the five application areas. Section 6 summarizes and discusses the predictive performance evaluation metrics and classifies them. In Section 7, we conduct a case study to compare the SOTA models that have been developed in recent years to evaluate and compare their LSTF performance. In Section 8, we discuss future research directions.
本文的动机和线路逻辑如图 2 所示。在本文第 2 节中,给出了时间序列数据、TSF 任务以及 LSTF 的深度多角度定义解释的定义。在第 3 节中,我们回顾了近年来 LSTF 领域相关研究和深度学习网络的发展,并根据每项工作的技术贡献对相关研究进行分类。第 4 节总结了在 LSTF 中遇到的关键问题,梳理了当前工作中每个关键问题的研究方向和有效解决方案,并提出了我们针对性能评估问题的新的 LSTF 评估方法。第 5 节总结了五个应用领域中各个领域的相关开源数据集。第 6 节总结了预测性能评估指标并对其进行分类。在第 7 节中,我们进行案例研究,比较近年来开发的 SOTA 模型,以评估和比较它们的 LSTF 性能。在第 8 节中,我们讨论未来的研究方向。
  1. Download: Download high-res image (513KB)
    下载:下载高分辨率图片(513KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 2. Article Structure.
图 2. 文章结构。

  1. Download: Download high-res image (101KB)
    下载:下载高分辨率图片(101KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 3. Univariate Forecasting.
图 3. 单变量预测。

  1. Download: Download high-res image (105KB)
    下载:下载高分辨率图片(105KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 4. Covariate Forecasting.
图 4. 协变量预测。

2. Problem definition  2. 问题定义

In this section, we present the definitions of time series data and their characteristics. TSF tasks and their classification methods are given based on their properties. We also note that the terms “variable” and “feature” are used interchangeably in this paper. Unless otherwise noted, the notations used in this study are shown in Table 2.
在这一节中,我们介绍了时间序列数据的定义及其特征。根据其属性,给出了 TSF 任务及其分类方法。我们还指出,本文中“变量”和“特征”这两个术语是可互换使用的。除非另有说明,本研究所使用的符号在表 2 中展示。

2.1. Time series data  2.1. 时间序列数据

Time series data exist in various fields in real life, and it is particularly important to master the distribution characteristics of time series data and predict their future laws. By predicting future gold price fluctuations or future stock market price fluctuations to obtain large benefits and by predicting future weather or predicting future traffic flows to benefit people’s lives, it can be seen that grasping the future development law of time series data is very meaningful. So what are time series data?
时间序列数据存在于现实生活的各个领域,掌握时间序列数据的分布特征和预测其未来规律尤为重要。通过预测未来金价波动或未来股市价格波动以获得巨大利益,以及通过预测未来天气或预测未来交通流量以造福人民生活,可以看出,掌握时间序列数据的未来发展趋势非常有意义。那么,什么是时间序列数据呢?
Time series data are series of finite or infinite random variables (sequence data) arranged on the basis of time. This type of data reflects the changing state of a certain thing or phenomenon over time. In addition to digital data in the traditional sense, if sequence data such as video data, audio data, and voice data are arranged according to time, they can be regarded as time series data. That is, any data observed and recorded in a time sequence can be regarded as time series data. An N-dimensional time series matrix Xˆ1:N,tn:t with n samples can be expressed formally as follows: (1)Xˆ1:N,tn:t={x1,tn:t,x2,tn:t,,xN,tn:t} Among them, x1,tn:t is represented as a vector, x1,tn:t={x1,tn,x1,tn+1,,x1,t}, and x1,tn is denoted as a time series point at time tn of a time series. Thus, Xˆ1:N,tn:t denotes time series data consisting of multiple time series.
时间序列数据是根据时间排列的有限或无限随机变量(序列数据)的序列。这类数据反映了某个事物或现象随时间的变化状态。除了传统意义上的数字数据外,如果视频数据、音频数据和语音数据等序列数据按照时间排列,它们也可以被视为时间序列数据。也就是说,任何按时间顺序观察和记录的数据都可以被视为时间序列数据。一个具有 N 维、 Xˆ1:N,tn:t 个样本的时间序列矩阵 n 可以形式上表示如下: (1)Xˆ1:N,tn:t={x1,tn:t,x2,tn:t,,xN,tn:t} 其中, x1,tn:t 表示为向量 x1,tn:t={x1,tn,x1,tn+1,,x1,t}x1,tn 表示为时间序列在时间 tn 的点。因此, Xˆ1:N,tn:t 表示由多个时间序列组成的时间序列数据。

Table 2. A summary of notations.
表 2. 符号总结

Notations  符号Descriptions  描述
XDenotes a multidimensional time series matrix
表示一个多维时间序列矩阵
xDenotes a time series vector
表示时间序列向量
f()Denotes a time series forecasting model
表示时间序列预测模型
Xˆi,t+1Denotes the predicted value of the ith feature at t+1 time step
表示第①#时间步长第①#个特征的预测值
Xi,tn:tDenotes the historical value of the ith feature from time step tn to t
表示从时间步 tnt 的第 i 个特征的历史值
tThe current time step  当前时间步
lIndicates the interval between time series data
指示时间序列数据之间的间隔
mIndicates sample size  指示样本量
yˆIndicates the predicted value
指示预测值
yIndicates true value  指示真实值
yīDenotes the mean of the real sample data
表示真实样本数据的平均值
ymaxDenotes the maximum value of the real sample data
表示真实样本数据的最大值
yminDenotes the minimum value of the real sample data
表示真实样本数据的极小值
mean()Denotes the mean value  表示平均值
HDenotes the statistic in the K-W test
表示 K-W 检验中的统计量
Ridenotes the sum of the rank of ni observations {X1,,Xni} of the ith sample in this arrangement
表示本排列中第②个样本的第①个观测值的秩之和
HDenotes the corrected K-W test statistic
表示修正后的 K-W 检验统计量
αIndicates the significance level
指示显著性水平
VDegree of freedom  自由度
χ2Denotes the cardinal distribution
表示基数分布
χα,v2Denotes the upper lateral quantile of the chi-square distribution
表示卡方分布的上侧分位数
Time series data exist in major fields such as finance, energy, meteorology, transportation, and medical care. They all have the following common characteristics.
时间序列数据存在于金融、能源、气象、交通和医疗保健等主要领域。它们都具有以下共同特征。
  • Trend (T): A trend refers to the change in a time series with the progress of time or the change in an independent variable, and data may present a relatively slow long-term change trend of continuous rising or falling.
    趋势(T):趋势是指随着时间推移或自变量变化,时间序列中的变化,数据可能呈现相对缓慢的长期持续上升或下降变化趋势。
  • Periodicity (C): Periodicity means that the given time series and its independent variables changes with time, and its data show a regular rise or fall.
    周期性(C):周期性意味着给定的时间序列及其自变量随时间变化,其数据表现出规律性的上升或下降。
  • Seasonality (S): Seasonality means that the time series changes with natural seasons and time periods, and its data exhibit regular rises or falls with season changes. The definition of seasonality is very similar to that of periodicity, and many people confuse the two. However, in fact, there is a significant difference between them. Seasonality represents that the time series changes at a known and fixed frequency with the influence of seasons, while periodicity means that the time series rises or falls with an unfixed frequency.
    季节性(S):季节性意味着时间序列随自然季节和时间周期变化,其数据随季节变化呈现规律性的上升或下降。季节性的定义与周期性非常相似,许多人容易混淆这两个概念。然而,实际上,它们之间存在显著差异。季节性表示时间序列在季节影响下以已知和固定的频率变化,而周期性则意味着时间序列以不固定的频率上升或下降。
  • Randomness (I): Randomness is manifested in the fact that time series data points show irregular changes, but the whole series satisfies statistical laws; these changes are sometimes called residuals.
    随机性(一):随机性体现在时间序列数据点表现出不规则变化,但整个序列满足统计规律;这些变化有时被称为残差。
In summary, time series have four characteristics: trend, periodicity, seasonality and randomness. In addition, time series can also be classified according to other characteristics of TSF tasks, such as data distribution characteristics. There are also stationary time series and nonstationary time series in time series data. Stationarity means that the mean and variance of the time series are consistent at different times. In many cases, methods such as time series decomposition are used before forecasting to convert nonstationary time series into stationary time series, which can improve the accuracy of forecasting. Notably, according to the nature of time series, they can also be divided into hierarchical time series [9], fuzzy time series [43] and so on. However, this paper only discusses LSTF under precise time series. In the next section, we introduce the definition of TSF tasks.
总结来说,时间序列具有四个特征:趋势、周期性、季节性和随机性。此外,时间序列还可以根据 TSF 任务的其他特征进行分类,例如数据分布特征。时间序列数据中还存在平稳时间序列和非平稳时间序列。平稳性意味着时间序列的均值和方差在不同时间是一致的。在许多情况下,在预测之前使用时间序列分解等方法将非平稳时间序列转换为平稳时间序列,以提高预测的准确性。值得注意的是,根据时间序列的性质,它们还可以分为层次时间序列[9]、模糊时间序列[43]等。然而,本文仅讨论在精确时间序列下的 LSTF。在下一节中,我们将介绍 TSF 任务的定义。

2.2. Time series forecasting
2.2. 时间序列预测

Many analysis and processing tasks are available for time series data, such as time series classification and TSF. This paper focuses on the TSF task. The TSF task is a classic problem that involves using one or more time-ordered features of something in a prior period of time to predict the features of that thing in a future period of time. This task is different from the general regression analysis prediction model. A TSF model needs to capture the sequence of the series data to grasp the future trend of the sequence data. The same sequence data are changed in order, and the result generated by inputting the time series data into the time series model is completely different. If there is an ith group time series with n sample data Xi,tn:t={xi,tn,xi,tn+1,,xi,t}, We take the historical data of this time series to predict its future value at t+1 time step, then the result Xˆi,t+1 is: (2)Xˆi,t+1=f(Xi,tn:t).where f(x) represents the forecasting model, xi,tR.
许多分析处理任务适用于时间序列数据,例如时间序列分类和 TSF。本文聚焦于 TSF 任务。TSF 任务是一个经典问题,涉及使用一个或多个先前时间段内某物的时序特征来预测该物在未来时间段内的特征。此任务与一般回归分析预测模型不同。TSF 模型需要捕捉序列数据的顺序以掌握序列数据的未来趋势。相同的序列数据按顺序改变,将时间序列数据输入时间序列模型生成的结果完全不同。如果存在第 i 组时间序列,包含 n 个样本数据 Xi,tn:t={xi,tn,xi,tn+1,,xi,t} ,我们取该时间序列的历史数据来预测其在 t+1 时间步的未来值,那么结果 Xˆi,t+1 是: (2)Xˆi,t+1=f(Xi,tn:t). ,其中 f(x) 代表预测模型, xi,tR
However, in different TSF tasks, due to the different numbers of required input features, the different numbers of features predicted by outputs, the different lengths of output time series, and the different distributions and data characteristics of time series data for each feature [44], [45], there are a wide variety of TSF tasks. The classification methods employed for time series tasks are given below according to the characteristics of different tasks.
然而,在不同的 TSF 任务中,由于所需输入特征的数目不同,输出预测的特征数目不同,输出时间序列的长度不同,以及每个特征的时间序列数据的分布和数据特征不同[44],[45],存在各种各样的 TSF 任务。以下是根据不同任务的特点给出的用于时间序列任务的分类方法。
  • (1)
    Univariate/Covariate/Multivariate Forecasting According to the different number of features required for the time series forecasting task and the different number of features of the forecasting target, we can classify the forecasting tasks into univariate forecasting tasks, covariate forecasting tasks and multivariate forecasting tasks.
    单变量/协变量/多元预测 根据时间序列预测任务所需特征的数量以及预测目标的特征数量不同,我们可以将预测任务分为单变量预测任务、协变量预测任务和多元预测任务。
    • Univariate: We refer to the univariate forecasting task in which both the number of input features and the number of predicted features required are univariate, as shown in Fig. 3, in which the future value of a single variable is predicted by inputting historical data for that variable, and the size of the predicted future window is referred to as the predict horizon in this paper. The mathematical expression of the univariate forecasting task is as follows. (3)Xˆt+1=f(Xtn:t) Where t is the current time step, n denotes the input time steps size, the task is expressed as using the historical data of the previous n time steps at moment t to predict the feature values at t+1 time step in the future.
      单变量:我们称输入特征数和所需预测特征数均为单变量的预测任务为单变量预测任务,如图 3 所示。在该任务中,通过输入该变量的历史数据来预测单个变量的未来值,而预测未来窗口的大小在本文中被称为预测范围。单变量预测任务的数学表达式如下。 (3)Xˆt+1=f(Xtn:t) 其中 t 是当前时间步, n 表示输入时间步的大小,任务表述为使用在时刻 t 之前 n 个时间步的历史数据来预测未来 t+1 时间步的特征值。
    • Covariate: Similarly, when predicting the feature of a single variable, it is sometimes necessary to model the association of features among multiple variables to predict the future of a single variable, and we refer to these variables as covariates, and such a forecasting task is called a covariate forecasting task. Note that the number of features predicted in the covariate forecasting task is still single as in univariate prediction, but it has multiple input features, as shown in Fig. 4. In the case of single-step forecasting, the mathematical expression of the covariate forecasting task is shown below. (4)Xˆ1,t+1=f(Xi,tn:t) where i is the number of input features, n denotes the input time steps size, t denotes the current time step, i>1.
      协变量:同样,在预测单个变量的特征时,有时需要建模多个变量之间的特征关联来预测单个变量的未来,我们将这些变量称为协变量,这种预测任务称为协变量预测任务。请注意,在协变量预测任务中预测的特征数量仍然与单变量预测一样是单个,但它具有多个输入特征,如图 4 所示。在单步预测的情况下,协变量预测任务的数学表达式如下。 (4)Xˆ1,t+1=f(Xi,tn:t) 其中 i 是输入特征的数量, n 表示输入时间步长的大小, t 表示当前时间步长, i>1
    • Multivariate: In addition to the above two cases, there exist forecasting tasks that model the feature associations among multiple variables in order to predict the feature values of multiple variables, which we refer to as multivariate forecasting tasks. Unlike the above two types of tasks, the inputs and outputs of this task are multivariate, as shown in Fig. 5. The mathematical expression of the multivariate forecasting task within single-step forecasting is shown below. (5)Xˆj,t+1=f(Xi,tn:t) Where i is the number of input features, n denotes the input time steps size, j is the number of output features, ji, and t is the current time step.
      多元变量:除了上述两种情况外,还存在一些预测任务,这些任务通过建模多个变量之间的特征关联来预测多个变量的特征值,我们称之为多元变量预测任务。与上述两种任务不同,此任务的输入和输出都是多元变量,如图 5 所示。单步预测中多元变量预测任务的数学表达式如下。 (5)Xˆj,t+1=f(Xi,tn:t) 其中 i 是输入特征的数量, n 表示输入时间步长的大小, j 是输出特征的数量, jit 是当前时间步。
  • (2)
    Single-Step Forecasting/Multi-Step Forecasting We refer to the task that predicts a single time step in the future as a single-step forecasting task, and the task that predicts multiple time steps in the future as a multi-step forecasting task, which is shown in Fig. 6. The mathematical expressions of the single-step forecasting task and the multi-step forecasting task in the context of multivariate forecasting are as follows.
    单步预测/多步预测我们将预测未来一个时间步的任务称为单步预测任务,预测未来多个时间步的任务称为多步预测任务,如图 6 所示。在多元预测的背景下,单步预测任务和多步预测任务的数学表达式如下。
    • Single-Step Forecasting: (6)Xˆj,t+1=f(Xi,tn:t)
      单步预测: (6)Xˆj,t+1=f(Xi,tn:t)
    • Multi-Step Forecasting: (7)Xˆj,t+1:t+h+1=f(Xi,tn:t)
      多步预测: (7)Xˆj,t+1:t+h+1=f(Xi,tn:t)
      Where i is the number of input features, j is the number of output features, ji, n denotes the size of input time steps, t is the current time step, and h denotes the predicted horizons.
      i 表示输入特征的数量, j 表示输出特征的数量, jin 表示输入时间步的尺寸, t 表示当前时间步, h 表示预测的时域。
      Recursive forecasting is a way to achieve multi-step forecasting, which is mainly through each single-step forecasting to get the prediction value at time t+1, and then input the prediction value at time t+1 and the previous historical sequence into the prediction model to get the prediction value at time t+2, and through this iteration to get the prediction value at time t+n to achieve the effect of multi-step forecasting, as shown in Fig. 7. The mathematical definition of recursive forecasting is as follows. Xˆi,t+1=f(Xi,tn:t)(8)Xˆi,t+2=f(Xi,tn+1:t+1)Xˆi,t+h+1=f(Xi,t+hn:t+h) Where i is the number of input features, n denotes the size of input time steps, t is the current time step, and h denotes the predicted horizons. However, this method often leads to the accumulation of prediction errors, resulting in low prediction accuracy for the final model.
      递归预测是一种实现多步预测的方法,主要通过每一步预测得到时间 t+1 的预测值,然后将时间 t+1 的预测值和之前的历史序列输入预测模型,得到时间 t+2 的预测值,通过这种迭代得到时间 t+n 的预测值,以实现多步预测的效果,如图 7 所示。递归预测的数学定义如下。 Xˆi,t+1=f(Xi,tn:t)(8)Xˆi,t+2=f(Xi,tn+1:t+1)Xˆi,t+h+1=f(Xi,t+hn:t+h) 其中 i 是输入特征的数量, n 表示输入时间步的大小, t 是当前时间步, h 表示预测范围。然而,这种方法往往会导致预测误差的累积,导致最终模型的预测精度较低。
  • (3)
    Isometric/Nonisometric Interval Forecasting
    等距/非等距区间预测
    In addition to the above classification methods, the developed approaches can also be divided according to whether the prediction step size of the forecasting task is equidistant. We refer to tasks with equal prediction steps as isometric interval forecasting tasks, and tasks with unequal prediction steps as nonisometric interval forecasting tasks. In a case with a univariate single-step forecasting, the mathematical expressions of isometric interval forecasting tasks and nonisometric interval forecasting tasks are as follows:
    除了上述分类方法外,所开发的方法还可以根据预测任务的预测步长是否等距进行划分。我们将具有相等预测步长的任务称为等距间隔预测任务,将具有不等预测步长的任务称为非等距间隔预测任务。在单变量单步预测的情况下,等距间隔预测任务和非等距间隔预测任务的数学表达式如下:
    • Isometric Forecasting: (9)Xˆt+1=f(xi,tn,xi,tn+1,xi,tn+2,,xi,t)=f(Xtn:t)
      等距预测: (9)Xˆt+1=f(xi,tn,xi,tn+1,xi,tn+2,,xi,t)=f(Xtn:t)
    • Nonisometric Forecasting: (10)Xˆt+1=f(xi,tn,xi,tn+l1,xi,tn+l2,,xi,t)=f(Xtn:t)Where t is the current time step, and {ln|l1,l2,l3,,lnN+}. There are few nonisometric forecasting cases, so we only discuss isometric forecasting.
      非等距预测: (10)Xˆt+1=f(xi,tn,xi,tn+l1,xi,tn+l2,,xi,t)=f(Xtn:t) 表示当前时间步, {ln|l1,l2,l3,,lnN+} 。非等距预测案例较少,因此我们只讨论等距预测。
  • (4)
    Homogeneous/Heterogeneous Forecasting
    同质/异质预测
    All the above classification methods are based on homogeneous time series, while a large number of heterogeneous time series exist in the fields of unmanned, smart home, robotic servants and aviation. For example, Li et al. [45] combined synchronous time series and asynchronous event sequences for prediction. Thus, the forecasting task can also be divided into homogeneous time series forecasting tasks and heterogeneous time series forecasting tasks. Heterogeneous time series contain different types of embedded, statistical patterns [44], such as the time series consisting of continuous and categorical variables, we refer to the task of combining different heterogeneous data for forecasting as heterogeneous time series forecasting. The mathematical expression of heterogeneous time series forecasting in the context of single-step forecasting is as follows. (11)Xˆt+1=f(Xtn:t,Si,τ) Where Si,τ is the ith heterogeneous time series of the input, Si,τ={si,τk|τk<t}, and t is the current time step. In this paper, only the case of homogeneous time series prediction is discussed
    所有上述分类方法均基于同质时间序列,而在无人驾驶、智能家居、机器人服务和航空等领域存在大量异质时间序列。例如,Li 等人[45]将同步时间序列和非同步事件序列结合进行预测。因此,预测任务也可以分为同质时间序列预测任务和异质时间序列预测任务。异质时间序列包含不同类型的嵌入、统计模式[44],例如由连续变量和分类变量组成的时间序列,我们将结合不同异质数据进行预测的任务称为异质时间序列预测。在单步预测的背景下,异质时间序列预测的数学表达式如下。 (11)Xˆt+1=f(Xtn:t,Si,τ) 其中 Si,τ 是输入的第 i 个异质时间序列 Si,τ={si,τk|τk<t}t 是当前时间步。在本文中,仅讨论同质时间序列预测的情况
With reference to the above classification, a time series forecasting task can be defined as univariate single-step forecasting/multi-step forecasting, covariate single-step forecasting/multi-step forecasting and multivariate single-step forecasting/multi-step forecasting according to its characteristics. In this paper, the default is all the cases of isometric homogeneous time forecasting.
根据上述分类,可以将时间序列预测任务定义为根据其特征的单变量单步预测/多步预测、协变量单步预测/多步预测以及多变量单步预测/多步预测。在本文中,默认情况为等距同质时间预测的所有情况。

2.3. Long sequence time-series forecasting
2.3. 长序列时间序列预测

The definitions of time series and TSF tasks are introduced above, and this section discusses the definition of LSTF based on investigating the literature and industry standards in various fields.
时间序列和 TSF 任务的定义已在上方介绍,本节基于调查各领域的文献和行业标准,讨论 LSTF 的定义。
For the definition of the LSTF, we have reviewed studies of LSTF in recent years and its applications in various areas; and it was usually defined as forecasting a more distant future [17], [46]. However, this definition is quite macroscopic and ambiguous. Clearly defining the LSTF is challenging. To address it, we will deeply discuss it from both relative and absolute perspectives.
对于 LSTF 的定义,我们回顾了近年来关于 LSTF 的研究及其在各领域的应用;它通常被定义为预测更遥远的未来[17],[46]。然而,这种定义相当宏观且模糊。明确界定 LSTF 具有挑战性。为了解决这个问题,我们将从相对和绝对两个角度深入探讨。
From the relative perspective, the long prediction horizon task is usually defined as the LSTF. The long or short is a relative concept which is usually based on a comparison result. Currently, many LSTF definitions are based on the relative concept with comparison results [18], [47] and no clear definitions are provided [33], [47], [48]. They elucidate the LSTF problem by comparing the prediction accuracy and inference performance of the model under different prediction horizons [17], [49]. In longer prediction horizon tasks, the prediction accuracy and inference performance tend to decrease compared to shorter prediction horizon tasks. Based on it, the prediction task with longer prediction horizon is usually defined as LSTF.
从相对视角来看,长预测时间跨度任务通常被定义为 LSTF。长或短是一个相对概念,通常基于比较结果。目前,许多 LSTF 定义基于相对概念和比较结果[18],[47],而未提供明确的定义[33],[47],[48]。他们通过比较模型在不同预测时间跨度下的预测准确性和推理性能来阐明 LSTF 问题[17],[49]。在较长的预测时间跨度任务中,与较短的预测时间跨度任务相比,预测准确性和推理性能往往下降。基于此,通常将具有较长预测时间跨度的预测任务定义为 LSTF。
From the absolute perspective, the LSTF is usually derived from the application area [21], which varies with industry sectors. In most cases, the LSTF’s definition is highly related to the cycle, which is a critical predictor for forecasting tasks [50]. The LSTF can be considered when the prediction range is close to or exceeds the maximum cycle of the data in the forecasting task. For instance, in finance, Queesland1 and St.helena2 has defined the forecasting tasks for over 10 years as the LSTF in the local Fiscal reports, while the cycle is usually between 2 to 10 years in the economic data [51]; in meteorology, in weather forecasting, the ”Dictionary of Atmospheric Sciences” [52] defined the weather forecasting tasks with predict horizon over 1 month as the LSTF; the World Meteorological Organization (WMO)3 stipulated that the predict horizon of LSTF must be over 30 days and less than 2 years [53]. Nevertheless, the maximum cycle of the meteorological data in these forecasting tasks is usually a month or a quarter; in electrical load forecasting, Khuntia S. R. [54] and many real-world practices [55], [56], [57], [58] termed the predict horizon of a few years (> 1 year) up to 10–20 years as the LSTF problem, where the maximum cycle of data is only up to 1 year; in traffic prediction, James J Q [59] and Peng H [60] categorized the tasks with predict horizon of several days as LSTF problems, while the maximum cycle as upper bounded by a few hours or 1 day; in blood glucose prediction, the LSTF was implemented as foretelling the at least 60 minutes, however, the maximum cycle is usually less than one hour.
从绝对角度来看,LSTF 通常来源于应用领域[21],这因行业部门而异。在大多数情况下,LSTF 的定义与周期高度相关,周期是预测任务的关键预测因子[50]。当预测范围接近或超过预测任务中数据的最大周期时,可以考虑 LSTF。例如,在金融领域,Queesland 1 和 St.helena 2 将超过 10 年的预测任务定义为地方财政报告中的 LSTF,而经济数据中的周期通常在 2 到 10 年之间[51];在气象学中,在天气预报中,“大气科学词典”[52]将预测范围超过 1 个月的天气预报任务定义为 LSTF;世界气象组织(WMO) 3 规定,LSTF 的预测范围必须超过 30 天且少于 2 年[53]。然而,在这些预测任务中,气象数据的最大周期通常是一个月或一个季度;在电力负荷预测中,Khuntia S. R. [54] 以及许多实际应用[55]、[56]、[57]、[58],将预测时间范围几周到 10-20 年的问题称为 LSTF 问题,其中数据最大周期仅为 1 年;在交通预测中,James J Q [59] 和 Peng H [60] 将预测时间范围为几天的问题归类为 LSTF 问题,而最大周期则被限制在几小时或 1 天以内;在血糖预测中,LSTF 被实施为预测至少 60 分钟,然而,最大周期通常小于 1 小时。
Therefore, based on the investigation of real-world practices and research works in various industry sectors, it is naturally concluded that the LSTF is defined on the comparisons between predict horizons and cycles; In most of the tasks, if predict horizon is approaching or exceeding the maximal cycle, the forecasting task is viewed as a LSTF problem.
因此,基于对各个行业领域现实实践和研究工作的调查,自然得出结论:LSTF 是在预测范围与周期之间的比较中定义的;在大多数任务中,如果预测范围接近或超过最大周期,则预测任务被视为 LSTF 问题。

3. Review and classification of DL techniques for LSTF
3. 长时记忆网络(LSTF)的深度学习技术综述与分类

In this section, we review the latest results in the field of DL for LSTF and classify the developed models based on the different technical routes of their effective architectures. Model feature extraction capability and architecture are crucial in LSTF. Therefore, based on the dominant feature extraction architectures used in the models, we can classify the models into RNN-based model, CNN-based model, GNN-based model, Transformer-based model, Compound model. For some special and novel methods that appear in the field and do not belong to the above categories, we classify them as miscellaneous methods. Finally, we review the evolution of the networks in the LSTF field in order to acquire the popular feature extraction architecture in LSTF. Note that the criterion for our collection of literature is whether the work enhances the predictive performance of the LSTF. For example whether it enhances the capture of long-term dependencies, or whether the specific task context of the study is long-term prediction.
在这一节中,我们回顾了 LSTF 领域深度学习(DL)的最新成果,并根据它们有效架构的不同技术路线对开发出的模型进行分类。在 LSTF 中,模型特征提取能力和架构至关重要。因此,基于模型中使用的占主导地位的特征提取架构,我们可以将这些模型分为基于 RNN 的模型、基于 CNN 的模型、基于 GNN 的模型、基于 Transformer 的模型和复合模型。对于该领域出现的一些特殊且新颖的方法,它们不属于上述类别,我们将它们归类为杂项方法。最后,我们回顾了 LSTF 领域中网络的演变,以获取 LSTF 中流行的特征提取架构。请注意,我们收集文献的标准是工作是否增强了 LSTF 的预测性能。例如,是否增强了长期依赖关系的捕捉,或者研究的具体任务背景是否是长期预测。

3.1. RNN-based model  3.1. 基于 RNN 的模型

RNNs were first proposed by Elman et al. [61]. RNNs store past information by introducing state variables and use them together with the current input to determine the current output, and their innate structural designs are well suited for processing temporal information, the structure is shown in Fig. 8. Therefore, RNNs and their associated family of models have been widely utilized for processing sequential data. However, because vanilla RNNs suffer from gradient disappearance and gradient explosion in long sequences, Graves et al. [62] designed long short-term memory (LSTM), which greatly alleviated the problem of early RNN training by using gating units and memory mechanisms. Later, due to the computational complexity of LSTM, the large number of required computational resources and the slow training process experienced with a large amount of data, researchers [63] proposed an improved LSTM variant called a gated recurrent unit (GRU) model, which combines the forgetting gate and the input gate of LSTM into a single “update gate”, thus simplifying and reducing the model. This simplifies the model and reduces the number of model parameters, allowing it to converge faster during training. Table 3 summarizes and presents the recent works on RNN-based models for LSTF.
循环神经网络(RNNs)最初由 Elman 等人[61]提出。RNNs 通过引入状态变量来存储过去信息,并将这些变量与当前输入一起用于确定当前输出,它们的内在结构设计非常适合处理时间信息,结构如图 8 所示。因此,RNN 及其相关模型族已被广泛用于处理序列数据。然而,由于 vanilla RNN 在长序列中存在梯度消失和梯度爆炸的问题,Graves 等人[62]设计了长短期记忆(LSTM),通过使用门控单元和记忆机制,极大地缓解了早期 RNN 训练的问题。后来,由于 LSTM 的计算复杂性、所需的大量计算资源和大量数据训练过程中的缓慢过程,研究人员[63]提出了一种改进的 LSTM 变体,称为门控循环单元(GRU)模型,将 LSTM 的遗忘门和输入门合并为一个单一的“更新门”,从而简化并减少了模型。 这简化了模型并减少了模型参数的数量,使其在训练过程中更快收敛。表 3 总结了基于 RNN 的 LSTF 模型的相关近期工作。
RNN variant. Vanilla LSTM has a strong ability to capture long-term dependence. Jung et al. [35] proposed an RNN model with LSTM (a monthly PV power generation prediction model) for predicting the amount of PV solar power that could be generated at a new location. The model uses an RNN with an LSTM layer for monthly data series (monthly solar radiation) to learn temporal and topographic variations in solar radiation and weather conditions to obtain temporal patterns observed at multiple sites.A multivariate time series forecasting method was developed based on bidirectional LSTM (BiLSTM) [73], and the model was trained using several features, such as the number of coronavirus detections, wind speed, ambient temperature, biomass, solar and wind energy supply, and historical electricity demand, to achieve satisfactory results. To improve the capture of long-term future dependencies, the winning entry of the M4 prediction competition, ES-LSTM [68], combines exponential smoothing (ES) with LSTM, where ES allows the method to efficiently capture the main components of individual sequences (for deseasonalization and normalization), while the LSTM network allows for nonlinear trends and cross-learning. This network is essentially hierarchical and utilizes both global (applied to a large subset of all sequences) and local (applied to each sequence individually) neural network parameters to enable cross-learning. Further, to addresses the problems of training long sequences in RNNs, such as capturing complex dependencies, vanishing and exploding gradients, and effective parallelization. The dilated RNN [64] constructs a model for learning temporal dependencies at different scales and different levels. This is done combining multiple recursive expansion layers with hierarchical expansion, which is superimposed by multiresolution recursive expansion jump connections, for flexible combination with different RNN units. Effectively alleviates the three problems of RNNs network structure in LSTF.
RNN 变体。Vanilla LSTM 具有强大的捕捉长期依赖的能力。Jung 等人[35]提出了一种带有 LSTM 的 RNN 模型(一个月度 PV 电力生成预测模型)来预测新地点可能产生的 PV 太阳能发电量。该模型使用带有 LSTM 层的 RNN 对月度数据序列(月度太阳辐射)进行学习,以获取多个地点观察到的太阳辐射和天气条件的时间变化和地形变化。基于双向 LSTM(BiLSTM)[73]开发了一种多元时间序列预测方法,并使用多个特征(如冠状病毒检测数量、风速、环境温度、生物质、太阳能和风能供应以及历史电力需求)进行模型训练,以实现令人满意的结果。 为了提高对长期未来依赖的捕捉,M4 预测竞赛的获胜作品 ES-LSTM[68]将指数平滑(ES)与 LSTM 结合,其中 ES 允许该方法高效地捕捉单个序列的主要成分(用于去季节化和归一化),而 LSTM 网络则允许非线性趋势和跨学习。这个网络本质上具有层次性,并利用全局(应用于所有序列的大子集)和局部(应用于每个序列单独)神经网络参数以实现跨学习。此外,为了解决在 RNNs 中训练长序列的问题,如捕捉复杂依赖、梯度消失和爆炸以及有效并行化。扩张 RNN[64]构建了一个在不同尺度和不同层次学习时间依赖性的模型。这是通过结合多个递归扩展层与层次扩展,由多分辨率递归扩展跳跃连接叠加,以灵活地与不同的 RNN 单元组合来实现的。有效地缓解了 LSTF 中 RNN 网络结构的三个问题。
  1. Download: Download high-res image (145KB)
    下载:下载高分辨率图片(145KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 8. Unrolled RNN Structure. At each time point t, the network is subjected to external input xt and the hidden state ht1 at the previous time point, while the hidden state of the network at t time point is updated to ht, and yt is the network output.
图 8. 展开式 RNN 结构。在每个时间点 ,网络受到外部输入 和前一时间点的隐藏状态 的影响,同时网络在 时间点的隐藏状态更新为 是网络输出。

Table 3. RNN-based Model. Remark denotes the research directions of each work in enhancing LSTF. : Explicitly revealing the correlations between variables; : Implicitly increasing the intersequence dependency learning abilities of models; §: Few parameters and low computational complexity.
表 3. 基于 RNN 的模型。备注表示每项工作在增强 LSTF 中的研究方向。 :明确揭示变量之间的相关性; :隐式提高模型之间的序列依赖性学习能力;§:参数少,计算复杂度低。

Ref  Ref 参考文献Year  Technique  技术Baseline  基线Comments  注释Remark  备注
Dilated RNN [64]  扩张型循环神经网络[64]2017Dilated Connection+RNN  扩张连接+循环神经网络Vanilla RNN, Vanilla LSTM, VanillaGRU, Stacke RNN, Stack LSTM Stack GRU, Skip RNN
香草 RNN,香草 LSTM,香草 GRU,堆叠 RNN,堆叠 LSTM,堆叠 GRU,跳过 RNN
Dilated skip connection to addresses the problems of training long sequences in RNNs, 1. Capturing long-term and complex dependencies; 2. Vanishing and exploding gradients; 3. effective parallelization.
扩张跳跃连接解决 RNNs 训练长序列的问题,1. 捕获长期和复杂依赖;2. 梯度消失和爆炸;3. 有效并行化。
§
DA-RNN [65]2017Attention+LSTM  注意力+LSTMARIMA, NARX RNN, Encoder–Decoder, Attention RNN, Input-Attn-RNN
ARIMA, NARX RNN, 编码器-解码器,注意力 RNN,输入-注意力-RNN
The two stages of Attention select the most relevant feature variables and hidden states in LSTM.
两个阶段的注意力机制选择 LSTM 中最相关的特征变量和隐藏状态。
MQ-RNN [66]2017LSTM+ Encoder–Decoder+MLP
LSTM+编码器-解码器+MLP
Multi horizon LSTM with encoder–decoder can cold-starts in real life LSTF.
多时域 LSTM 与编码器-解码器可以在现实生活中的 LSTF 中实现冷启动。
Fan et al. [67]  范等[67]2019LSTM+Encoder–Decoder+Atten
LSTM+编码器-解码器+注意力机制
Gradient-Boosting, POS-RNN, MQ-RNN, TRMF
梯度提升,POS-RNN,MQ-RNN,TRMF
Attention to fuse multimodal features of LSTM.
关注融合 LSTM 的多模态特征。
ES-LSTM [68]2020ES+LSTM  ES+LSTM 翻译文本:ES+LSTMES and LSTM capture linear and nonlinear relationships.
ES 和 LSTM 捕捉线性与非线性的关系。
Jung el al. [35]  Jung 等人 [35]2020LSTMAchieved long-term power generation forecast for PV plants by LSTM.
通过 LSTM 实现了光伏电站的长期发电预测。
MTSMFF [69]2020LSTM + Encoder–Decoder + Atten
LSTM + 编码器-解码器 + 注意力机制
SVR, RNN, CNN, LSTM, GRU, seq2seq, seq2seq-BI, seq2seq-ATT
SVR、RNN、CNN、LSTM、GRU、seq2seq、seq2seq-BI、seq2seq-ATT
Attention to select the hidden state of BiLSTM.
关注选择 BiLSTM 的隐藏状态。
LGnet [70]  LG 网[70]2020GAN+LSTMLR, XGBoost, MICE, GRUI, GRU-D, BRITS
LR,XGBoost,MICE,GRUI,GRU-D,BRITS
GAN to counter training and compensate for missing data in LSTM.
生成对抗网络(GAN)用于对抗训练和补偿 LSTM 中缺失的数据。
ATFN [71]2020NN + S2S + GRU + DFTS2S, S2SAttn, SFM, mLSTM, NDDFT for periodicity, GRU for trending, NN to fuse multiple features.
DFT 用于周期性,GRU 用于趋势,NN 用于融合多个特征。
Yoshimi et al. [72]  吉田等[72]2020LSTM+Atten  LSTM+注意力机制LSTM, DSTP  LSTM,DSTPDFT for periodicity, GRU for trending, NN to fuse multiple features.
DFT 用于周期性,GRU 用于趋势,NN 用于融合多个特征。
Liu et al. [73]  刘等[73]2021BiLSTM  双向长短期记忆网络Long electricity demand forecast with the COVID19 impact.
长期电力需求预测及 COVID-19 影响。
Gangopadhyay et al. [74]  冈哥帕达亚等。[74]2021LSTM+Atten  LSTM+注意力机制SVR-RBF, Enc-Dec, LSTM-Att, DA-RNN
SVR-RBF,编码-解码,LSTM-Att,DA-RNN
STAM to capture the most relevant variables at each time step.
STAM 在每一步捕捉最相关的变量。
Early work was mainly conducted by optimizing RNN networks themselves and combining them with traditional statistical methods. Since 2014, the emergence of structures such as attention mechanisms, sequence-to-sequence (Seq2Seq) structures and generative adversarial networks (GANs) has brought new directions regarding the optimization of RNN models in LSTF.
早期工作主要通过对 RNN 网络本身进行优化以及结合传统统计方法来进行的。自 2014 年以来,注意力机制、序列到序列(Seq2Seq)结构和生成对抗网络(GANs)等结构的出现,为 LSTF 中 RNN 模型的优化带来了新的方向。
RNNs with Attention. In 2014, the attention mechanism was proposed and has since been widely used in the CV field, but in recent years, attention mechanisms has started to be combined with RNNs to capture more relevance from the data, which making it possible to predict the more distant future. An RNN model based on an attention mechanism and LSTM was designed by [72]. The model first uses a spatial-attention LSTM to extract the spatial correlations between multiple time series data, and after applying the attention mechanism to the whole time series, the spatial-attention LSTM is then used to extract the spatial relationships between the dependent variable time series and the independent variable time series to extract the event correlations. Achieving accurate LSTF for multivariate time series may not be easy because a portion of multiple independent variable time series data may not be helpful for prediction and may even impair the prediction accuracy, but this approach improves the resulting data quality by filtering the segmented multiple independent variable time series data through an attention mechanism to improve the prediction accuracy. Similarly to improve the ability to capture long-term correlations in the data. Y Xiang et al. [65] presented DA-RNN which is a two-stage RNN based on an attention mechanism. The model adaptively extracts correlations and hidden features among the input data at each time step through an attention mechanism in the first stage. In the second stage, the hidden states of the encoder are selected at all time steps by the temporal attention mechanism. Thus, the model adaptively extracts the most relevant features among the input data and captures the long-term temporal correlations seen in the input sequence. On the other hand, Sarkar et al. [74] similarly combined LSTM and attention mechanisms, thus proposing a spatio-temporal attention mechanism called STAM. The model uses an LSTM layer in the encoder to learn temporal correlations in the input data. At each output time step, the model first searches for the most relevant time step and the most important variable. The model then makes predictions based on the computed spatial and temporal context vectors.
RNNs with Attention。2014 年,注意力机制被提出,并自那时起在计算机视觉领域得到广泛应用,但近年来,注意力机制开始与 RNN 结合,以从数据中捕获更多相关性,这使得预测更远的未来成为可能。[72]提出了一种基于注意力机制和 LSTM 的 RNN 模型。该模型首先使用空间注意力 LSTM 提取多个时间序列数据之间的空间相关性,然后在整个时间序列上应用注意力机制,接着使用空间注意力 LSTM 提取因变量时间序列和自变量时间序列之间的空间关系以提取事件相关性。实现多元时间序列的准确 LSTF 可能并不容易,因为多个独立变量时间序列数据的一部分可能对预测没有帮助,甚至可能降低预测精度,但这种方法通过通过注意力机制过滤分割的多个独立变量时间序列数据来提高数据质量,从而提高了预测精度。 与提高捕捉数据中长期相关性的能力相似,Y Xiang 等人[65]提出了基于注意力机制的 DA-RNN,这是一种两阶段的 RNN。在第一阶段,模型通过注意力机制自适应地从每个时间步长的输入数据中提取相关性及隐藏特征。在第二阶段,通过时间注意力机制选择所有时间步长的编码器隐藏状态。因此,模型自适应地从输入数据中提取最相关的特征,并捕捉到输入序列中观察到的长期时间相关性。另一方面,Sarkar 等人[74]同样结合了 LSTM 和注意力机制,从而提出了称为 STAM 的空间时间注意力机制。模型在编码器中使用 LSTM 层来学习输入数据中的时间相关性。在每个输出时间步,模型首先搜索最相关的时间步和最重要的变量。然后,模型基于计算出的空间和时间上下文向量进行预测。
RNNs with Seq2Seq. In addition to attention mechanisms, Seq2Seq has made a large splash in natural language processing (NLP) since 2014. Compared to the vanilla RNN, it effectively improves the ability to handle long sequences in multi-step forecasting. Fan et al. [67] borrowed the encoder–decoder structure to combine it with LSTM for LSTF. Their model uses a bidirectional LSTM decoder to first propagate future information in both the forward and reverse directions while taking the dynamic future information, such as promotions and calendar events, into account. Then, at each time step in the future, several different periods of history are focused on, and attention vectors are separately generated. The work treats different historical periods as different modalities and combines them by learning the relative importance of each modality for predicting the current time step. The study demonstrated that jointly learning temporal attention values for multiple historical periods and fusing them with multimodal attention weights can improve the prediction accuracy of the results to predict further into the future. The MQRNN [66] is a multihorizon prediction model that simultaneously predicts the values of multiple future time steps. MQRNN uses an encoder–decoder structure and changes the decoder part from an RNN to multiple full joins. The model directly stitches together the context of the last moment of the encoder and all future features through a full connection to generate the context of each moment and a global overall context, interweaving coarse and fine-grained features to improve the accuracy of predicting distant futures.
循环神经网络(RNN)与 Seq2Seq。除了注意力机制外,自 2014 年以来,Seq2Seq 在自然语言处理(NLP)领域产生了巨大影响。与传统的 RNN 相比,它有效地提高了处理多步预测中长序列的能力。Fan 等人[67]借鉴了编码器-解码器结构,将其与 LSTM 结合用于 LSTF。他们的模型使用双向 LSTM 解码器,首先在正向和反向方向上传播未来信息,同时考虑动态的未来信息,如促销和日历事件。然后,在未来的每个时间步,关注几个不同的历史时期,并分别生成注意力向量。该工作将不同的历史时期视为不同的模态,并通过学习每个模态对预测当前时间步的相对重要性来结合它们。研究表明,联合学习多个历史时期的时序注意力值并将它们与多模态注意力权重融合,可以提高预测结果预测未来结果的准确性。 MQRNN[66]是一种多时间步预测模型,可以同时预测多个未来时间步的值。MQRNN 采用编码器-解码器结构,并将解码器部分从 RNN 改为多个全连接。该模型通过全连接直接将编码器最后时刻的上下文和所有未来特征连接起来,生成每个时刻的上下文和全局整体上下文,交织粗粒度和细粒度特征,以提高预测遥远未来的准确性。
RNNs with Seq2Seq and Attention. The MTSMFF model [69] similarly combines LSTMs, encoder–decoder structures and attention mechanisms. It uses BiLSTM with temporal attention as an encoder network, adaptively learns long-term correlations and the implied correlation features of multivariate temporal data, and then decodes them with a stacked vanilla LSTM. Improves both the ability to handle long sequences and the ability to capture long-term correlations.
RNNs with Seq2Seq and Attention. MTSMFF 模型[69]同样结合了 LSTM、编码器-解码器结构和注意力机制。它使用双向 LSTM 和时间注意力作为编码网络,自适应地学习长期相关性以及多变量时间数据的隐含相关性特征,然后使用堆叠的普通 LSTM 进行解码。提高了处理长序列的能力和捕捉长期相关性的能力。
RNNs with GAN. Additionally, in 2014, the GAN was created as a revolution in the field of DL. The GAN was then introduced by Trirat et al. [70], who proposed a new framework called LGnet for solving the MTS prediction problem in the presence of missing values in multivariate temporal data (MTS). The framework constructs representative local features for each variable of the MTS problem and then uses an LGnet storage module to explicitly utilize the knowledge derived from external sequences to generate global estimates for the missing values. In addition, the modeling process of the global temporal distribution is enhanced by adversarial training. This work overcomes the case of missing values to ensure the validity of the prediction results, which is often affected by the growth of the prediction horizon in LSTF, and this research overcomes the case of data with missing values to reduce the data integrity requirement in LSTF.
RNNs 与 GAN。此外,2014 年,GAN 作为深度学习领域的革命性成果被创造出来。GAN 随后由 Trirat 等人[70]引入,他们提出了一种名为 LGnet 的新框架,用于解决存在缺失值的多元时间序列数据(MTS)中的 MTS 预测问题。该框架为 MTS 问题的每个变量构建代表性局部特征,然后使用 LGnet 存储模块显式利用从外部序列中推导出的知识来生成缺失值的全局估计。此外,通过对抗性训练增强了全局时间分布的建模过程。这项工作克服了缺失值的情况,确保预测结果的准确性,这在 LSTF 中通常受到预测范围增长的影响,而这项研究克服了具有缺失值的数据的情况,以降低 LSTF 中的数据完整性要求。
RNNs with Time–frequency Analysis Method. Time–frequency analysis can effectively indicate the relationship between revealed features rather than relying on model capture, and indicating the revealed feature relationship can effectively improve the accuracy to forecast distant future. Yang et al. [71] proposed an adaptive time–frequency network (ATFN) to solve the LSTF problem. The model first uses AugS2S2S, an augmented Seq2Seq model based on a GRU, to learn the trend features of complex nonstationary time series, then captures the complex periodic patterns of the time series using frequency domain blocks, and finally combines the trend and periodic features to produce the final forecast using a fully connected neural network. This study demonstrated the advantage of utilizing pure time domain models for LSTF problems involving weakly periodic time series, while frequency domain methods have significant advantages for LSTF problems with strongly periodic time series. Therefore, hybrid time domain-frequency domain models form a future direction for LSTF research.
循环神经网络与时间-频率分析方法。时间-频率分析能够有效指示揭示的特征之间的关系,而不是依赖于模型捕捉,指示揭示的特征关系可以有效提高预测遥远未来的准确性。Yang 等人[71]提出了一种自适应时间-频率网络(ATFN)来解决 LSTF 问题。该模型首先使用 AugS2S2S,一个基于 GRU 的增强 Seq2Seq 模型,来学习复杂非平稳时间序列的趋势特征,然后使用频域块捕捉时间序列的复杂周期模式,最后使用全连接神经网络结合趋势和周期特征来生成最终的预测。这项研究证明了在涉及弱周期时间序列的 LSTF 问题中利用纯时间域模型的优势,而频域方法对于具有强周期时间序列的 LSTF 问题具有显著优势。因此,混合时间域-频域模型是 LSTF 研究的未来方向。

3.2. CNN-based model  3.2. 基于卷积神经网络(CNN)的模型

CNNs are very widely used structures in the field of computer vision. Its feature reduction and parallel processing capability in processing long sequences has led to attempts to apply it to LSTF. However, capturing short-term local dependence is a property of CNNs, and how to effectively capture long-term global dependencies in LSTF is of concern. It was not until causal convolution [75] and dilated causal convolution [76] were proposed that this problem was effectively alleviated, as shown in Fig. 9. In this section, we summarize and present the recent research works on CNN-based models in Table 4 and below.
卷积神经网络(CNN)在计算机视觉领域被广泛使用。其在处理长序列时的特征降维和并行处理能力导致了将其应用于长短期记忆网络(LSTF)的尝试。然而,捕捉短期局部依赖性是 CNN 的特性,如何在 LSTF 中有效捕捉长期全局依赖性是一个关注点。直到因果卷积[75]和扩张因果卷积[76]被提出,这个问题才得到了有效缓解,如图 9 所示。在本节中,我们总结了基于 CNN 的模型在表 4 及以下的研究成果。
CNN variant. TCN [77] is a classic model used for modeling and analyzing sequential data. It uses causal convolution to avoid information leakage and dilated causal convolution to handle cases with long sequential inputs. In addition, the above work introduced a residual joining approach to deepen the network. This demonstrates that a simple convolutional network structure can also perform as effectively as an RNN model such as LSTM when processing sequential data. And its parallelism and feature compression properties make it more suitable for LSTF than vanilla RNN. HyDCNN [80] also solved problems such as the inability of ordinary CNNs to capture long-term correlations. It is a hybrid framework based on dilated causal convolution, captures the nonlinear dynamics between sequences with dilated causal convolution and captures the sequence linear dependencies using autoregressive modules. In addition, HyDCNN consists of several hybrid modules, each of which targets a trend pattern or a kind of periodic pattern. To further capture periodic temporal patterns, a new jumping scheme was introduced in the hybrid modules. It captured multi-granularity cyclicality and trending to achieve improved forecast accuracy to predict further. Time series data have two characteristics, long-term trends and short-term fluctuations, which together constitute the complex structures of time series data. PFNet [79] was designed as a framework that can capture the long-term trends and short-term fluctuations of time series in parallel. The framework consists of three submodules: a long-term trend forecasting module (LTPM), a short-term fluctuation forecasting module (SFPM) and an information fusion module (IFM). The LTPM uses Highway-CNN to capture long-term trend features, and the SFPM uses the differences between time series data as inputs. The Highway-CNN structure is also used for feature extraction, but this part extracts the short-term fluctuation features in the data and uses an MLP to do it. The IFM combines the features extracted by the two other modules to obtain the final prediction result. Compared with other baselines, this framework can more accurately capture the long change trends of time series to obtain higher prediction accuracy.
CNN 变体。TCN[77]是一种经典的用于建模和分析序列数据的模型。它使用因果卷积来避免信息泄露,并使用扩张因果卷积来处理长序列输入的情况。此外,上述工作引入了一种残差连接方法来加深网络。这表明简单的卷积网络结构在处理序列数据时也可以像 LSTM 这样的 RNN 模型一样有效。并且其并行性和特征压缩特性使其比 vanilla RNN 更适合 LSTF。HyDCNN[80]也解决了普通 CNN 无法捕捉长期相关性的问题。它是一个基于扩张因果卷积的混合框架,使用扩张因果卷积捕捉序列之间的非线性动力学,并使用自回归模块捕捉序列的线性依赖关系。此外,HyDCNN 由几个混合模块组成,每个模块针对一种趋势模式或一种周期性模式。为了进一步捕捉周期性时间模式,混合模块中引入了一种新的跳跃方案。 它捕捉了多粒度循环性和趋势,以实现更准确的预测来预测未来。时间序列数据有两个特征,长期趋势和短期波动,它们共同构成了时间序列数据的复杂结构。PFNet[79]被设计为一个框架,可以并行捕捉时间序列的长期趋势和短期波动。该框架由三个子模块组成:长期趋势预测模块(LTPM)、短期波动预测模块(SFPM)和信息融合模块(IFM)。LTPM 使用 Highway-CNN 来捕捉长期趋势特征,而 SFPM 使用时间序列数据之间的差异作为输入。Highway-CNN 结构也用于特征提取,但这一部分提取数据中的短期波动特征,并使用 MLP 来完成。IFM 将两个其他模块提取的特征结合起来,以获得最终的预测结果。与基线相比,该框架可以更准确地捕捉时间序列的长期变化趋势,从而获得更高的预测精度。
  1. Download: Download high-res image (172KB)
    下载:下载高分辨率图片(172KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 9. The structure of CNN with two layers of causal convolutional layers. The input layer Dilated=1, the hidden layer Dilated=2.
图 9. 具有两层因果卷积层的 CNN 结构。输入层 Dilated=1 ,隐藏层 Dilated=2

Table 4. CNN-based Model. Remark denotes the research directions of each work in enhancing LSTF. : Explicitly revealing the correlations between variables; : Implicitly increasing the intersequence dependency learning abilities of models; §: Few parameters and low computational complexity.
表 4. 基于 CNN 的模型。备注表示每项工作在增强 LSTF 方面的研究方向。 :明确揭示变量之间的相关性; :隐式提高模型之间的序列依赖性学习能力;§:参数少,计算复杂度低。

Ref  Ref 参考文献Year  Technique  技术Baseline  基线Comments  注释Remark  备注
TCN [77]2018Causal CNN + Dilated CNN
因果卷积神经网络 + 扩展卷积神经网络
LSTM, GRU, RNN  LSTM,GRU,RNNDilated causal convolution can handle the long sequence inputs, residual connection to deepen the network.
扩张因果卷积可以处理长序列输入,残差连接以加深网络。
DSANet [78]2019Self-Atten+CNN+AR  自注意力+CNN+ARVAR, LRidge, LSVR, GP, GRU, LSTNet-S, LSTNet-A, TPAParallel convolution for global and local time correlation, AR for robustness.
并行卷积实现全局和局部时间相关性,AR 用于鲁棒性。
PFNet [79]2020Highway-CNN+MLP  高速公路-CNN+MLPVAR, RNN (AR+GRU), MHA, LSTNet, MLCNN, MTGNN
VAR,RNN(AR+GRU),MHA,LSTNet,MLCNN,MTGNN
LTPM for long-term trend, SFPM for short-term volatility.
LTPM 表示长期趋势,SFPM 表示短期波动。
HyDCNN [80]2021Dilated CNN +AR  扩张型 CNN+ARAR, VAR-MLP, GRU, TCN, LSTNet, TPA-LSTM, MTGNNAutoregressive and dilated causal convolution capture cyclicality and trend, respectively.
自回归和扩张因果卷积分别捕捉循环性和趋势。
SCINet [81]2022CNN + Encoder–Decoder  CNN + 编码器-解码器Autoformer, Informer, Transformer, TCN, LSTNet, TPA-LSTM
自 former,Informer,Transformer,TCN,LSTNet,TPA-LSTM
A hierarchical downsample-convolve-interact framework to effectively models time series with complex temporal dynamics.
一个用于有效建模具有复杂时间动态的时间序列的分层下采样-卷积-交互框架。
§   § 翻译文本:

Table 5. GNN-based Model. Remark denotes the research directions of each work in enhancing LSTF. : Explicitly revealing the correlations between variables; : Implicitly increasing the intersequence dependency learning abilities of models; §: Few parameters and low computational complexity.
表 5. 基于 GNN 的模型。备注表示每项工作在增强 LSTF 中的研究方向。 :明确揭示变量之间的相关性; :隐式提高模型之间的序列依赖性学习能力;§:参数少,计算复杂度低。

Ref  Ref 参考文献Year  Technique  技术Baseline  基线Comments  注释Remark  备注
MTGNN [82]2020GCN+TCN+ResNet  GCN+TCN+ResNet 翻译文本:GCN+TCN+ResNetAR, VAR-MLP, GP, RNN-GRU, LSTNet, TPA-LSTM, DCRNN, STGCN, Graph WaveNet, ST-MetaNet, GMAN, MRA-BGCNThe graph learning module for adaptive extraction of sparse graph adjacency matrix with TC and GC module.
图学习模块,用于自适应提取稀疏图邻接矩阵的 TC 和 GC 模块。
AutoSTG [83]  自 STG [83]2021GCN+Meta Learning  GCN+元学习HA, GBRT, GAT-Seq2Seq, DCRNN, Graph WaveNet, ST-MetaNet+, RANDOM, DARTS
HA, GBRT, GAT-Seq2Seq, DCRNN, 图波网,ST-MetaNet+,随机,DARTS
Meta learning to learn adjacency matrix and convolution kernel size for dynamic graph learning.
元学习以学习动态图学习中的邻接矩阵和卷积核大小。
DMST-GCN [84]2021Dynamic Graph Constructor+DGCN+TCN
动态图构建器+DGCN+TCN
HA, VAR, LR, XGBoost, DCRNN, ASTGCN, GMAN, Graph Wavenet, MTGNN
HA, VAR, LR, XGBoost, DCRNN, ASTGCN, GMAN, 图波网, MTGNN
Dynamic graph constructor based on tensor decomposition.
基于张量分解的动态图构造器。
REST [85]2021DCRNN+GCNARIMA, VAR, SVR, FC-LSTM, WaveNet, DCRNN, Graph WaveNetEINS to infer multimodal directed weighted graphs.
EINS 用于推断多模态有向加权图。
TPGNN [86]2022GNN + Encoder–Decoder + Attention
GNN + 编码器-解码器 + 注意力
VARMLP, GP, RNN-GRU, LSTNet, TPA-LSTM, MTGNN, TPGNN, Informer, Graph WaveNet, DCRNN, STGCN
VARMLP,GP,RNN-GRU,LSTNet,TPA-LSTM,MTGNN,TPGNN,Informer,Graph WaveNet,DCRNN,STGCN
A TPG module to represent the correlation as a time-varying matrix polynomial.
一个表示相关性的时变矩阵多项式 TPG 模块。
CNN with Attention. Although LSTMGRU and other combinations with attention show better performance in capturing long-term predictions, they do not perform well on multivariate long-term forecasting. To optimize the long-term correlation capture problem for multivariate long-term prediction, Huang et al. [78] proposed a dual self-attentive network (DSANet) for involving dynamic periodic or aperiodic sequences. The model starts from both global and local temporal patterns, omits the recursive pattern of traditional RNNs, and employs two parallel convolutions (global temporal convolution and local temporal convolution) to capture a complex mixture of global and local temporal patterns. In addition, the model still uses a self-attentive mechanism to learn the connections between multiple time series. To further improve its robustness, the model also integrates the traditional AR linear model in parallel with the aforementioned nonlinear neural network.
CNN 与注意力机制。尽管 LSTM GRU 以及其他结合注意力机制的组合在捕捉长期预测方面表现出更好的性能,但在多元长期预测方面表现不佳。为了优化多元长期预测中的长期相关性捕获问题,黄等人[78]提出了一种双重自注意力网络(DSANet),用于涉及动态周期性或非周期性序列。该模型从全局和局部时间模式出发,省略了传统 RNN 的递归模式,并采用两个并行卷积(全局时间卷积和局部时间卷积)来捕捉全局和局部时间模式的复杂混合。此外,该模型仍然使用自注意力机制来学习多个时间序列之间的联系。为了进一步提高其鲁棒性,该模型还并行集成了传统的 AR 线性模型与上述非线性神经网络。
CNNs with Seq2Seq.Similarly, there have been attempts to combine CNNs and Seq2Seq to optimize long time series prediction problems. Liu et al. [81] out propose a SCINet with multiple convolutional filters to extract unique and valuable temporal features from downsampled subseries or features. By combining these rich features aggregated from multiple resolutions, SCINet effectively models temporal dynamics with complex multi-resolution and can better capture both short-term (local temporal dynamics) and long-term (trend, seasonality) temporal dependencies to make accurate forecasts in LSTF.
卷积神经网络与 Seq2Seq。同样,有人尝试将 CNNs 和 Seq2Seq 结合以优化长期时间序列预测问题。刘等人[81]提出了一个具有多个卷积滤波器的 SCINet,用于从下采样子序列或特征中提取独特且有价值的时间特征。通过结合从多个分辨率聚合的丰富特征,SCINet 有效地模拟了复杂的多分辨率时间动态,并能更好地捕捉短期(局部时间动态)和长期(趋势、季节性)时间依赖性,从而在 LSTF 中做出准确的预测。

3.3. GNN-based model  3.3. 基于图神经网络(GNN)的模型

The GNN is a network structure that is commonly used for spatio-temporal multivariate time series forecasting. Spatio-temporal data is a special kind of time series data. Unlike general time series data, spatial-time-series data exhibits temporality and spatiality, and there are dependencies between its features. Therefore, predicting such data requires models that can capture not only the temporal dynamic features of the data but also the spatial static features among the data. GNNs have developed rapidly in these two years by combining the ideas of graph theory with previous DL ideas, producing networks such as GCN, GAT, GAEs, GGNs, and GSTMs. Many existing works have also reviewed the current GNNs in terms of several aspects, such as their model structures and application areas [21], [87], [88], [89], [90]. Therefore, in this paper, we do not rediscuss them but only provide some complementary introduction to some challenges and work of GNNs on spatio-temporal long-term forecasting based on the previous work. Table 5 provides a summary of the discussed complementary GNN-based modules.
GNN 是一种网络结构,常用于时空多元时间序列预测。时空数据是一种特殊的时间序列数据。与一般的时间序列数据不同,时空序列数据具有时间性和空间性,其特征之间存在依赖关系。因此,预测此类数据需要的模型不仅能够捕获数据的时间动态特征,还能够捕获数据之间的空间静态特征。GNN 通过将图论的思想与以前的 DL 思想相结合,在这两年中迅速发展,产生了 GCN、GAT、GAE、GGN 和 GSTM 等网络。许多现有工作还从模型结构和应用领域等几个方面对当前的 GNN 进行了审查 [21]、[87]、[88]、[89]、[90]。因此,在本文中,我们不再重新讨论它们,而只是在前人工作的基础上,对 GNN 在时空长期预测方面的一些挑战和工作提供一些补充性的介绍。表 5 总结了所讨论的基于 GNN 的互补模块。
In LSTF, the predefined static graph structure is often not very effective at spatio-temporal long-term forecasting. As the prediction horizon increases, the hidden topological graph structure usually changes. And people started to research adaptive and dynamic graph structures to solve this problem of long-term spatio-temporal forecasting.
在 LSTF 中,预定义的静态图结构通常在时空长期预测方面效果不佳。随着预测范围的扩大,隐藏的拓扑图结构通常会发生变化。因此,人们开始研究自适应和动态图结构以解决长期时空预测的问题。
GNN with adaptive dynamic graph structure.As the predict horizon increases, the impact between nodes usually changes with traffic flow. This change in graph structure is especially significant in long-term traffic forecasts. So Han et al. [84] designed DMSTGCN based on DGNNs for this problem by using a dynamic graph convolution module to predict traffic speed. The model designs a dynamic graph constructor based on tensor decomposition for the dynamic spatial relationships between road segments while considering the periodic and dynamic characteristics of traffic; in addition, dynamic graph convolution is utilized to capture the spatial correlations among different road segments. This approach alleviates the problem of changing graph structure in long-term traffic forecasting through interpretable dynamic spatial relationships of road segments. In addition, other researchers have explored adaptive graph structure relationships to alleviate this problem. AutoSTG [83] was proposed as a new automatic prediction framework for spatio-temporal graphs, in which spatial graph convolution and temporal convolution operations are employed to capture complex spatio-temporal correlations. In particular, the framework uses meta-learning techniques to learn the adjacency matrix of the spatial graph convolution layer and the kernel of the temporal graph convolution layer from the meta-knowledge of the attribute graph. Similarly, MTGNN [82] was established as a general GNN structure. Instead of requiring a well-defined predetermined graph structure, this model uses a graph learning module to adaptively extract a sparse graph adjacency matrix based on the input data and then uses the extracted graph adjacency matrix to capture the spatial and temporal correlations of the data by interleaving a graph convolution module with a temporal convolution module. Furthermore, to avoid the vanishing gradient problem, residual and skip connections are added between the temporal convolution module and the graph convolution module. High-quality graphs can improve the accuracy in spatio-temporal long-term prediction, and the authors who proposed REST similarly concluded that the effectiveness of GCN in spatio-temporal prediction is closely related to the quality of the graph structure. Moreover, REST [85] can automatically mine inferred multimodal directed weighted graphs. The model combines an inference edge network (EINS) and a GCNS. The EINS projects spatial dependencies between time series nodes from the temporal side to the spatial side and generates multimodal directed weighted graphs for the GCNS. From the temporal side to the spatial side, the GCNS uses these spatial correlations to make predictions and then introduces feedback to optimize the EINS. The spatial dependencies inferred by the EINS motivate the GCNS to make more accurate predictions, and the supervised training process of the GCNS helps the EINS learn better distance measurements. These two components collaborate with each other to build high-quality graphs, improving accuracy when making long-term spatio-temporal forecasting. Also for the dynamic graph problem in LSTF, Liu et al. [86] proposed TPGNN, which represents the correlations of dynamic variables as temporal matrix polynomials in two steps. First, we use a static matrix basis to capture the overall correlations. Then, we use a set of time-varying coefficients and matrix basis to construct matrix polynomials for each time step. All its long-term prediction aspects reach the state-of-the-art.
GNN 具有自适应动态图结构。随着预测时间跨度的增加,节点间的影响通常随着交通流量而变化。这种图结构的变化在长期交通预测中尤为重要。因此,Han 等人[84]针对此问题设计了基于 DGNNs 的 DMSTGCN,通过使用动态图卷积模块来预测交通速度。该模型设计了一个基于张量分解的动态图构造器,用于处理路段之间的动态空间关系,同时考虑交通的周期性和动态特性;此外,还利用动态图卷积来捕捉不同路段之间的空间相关性。这种方法通过路段的动态空间关系缓解了长期交通预测中图结构变化的问题。此外,其他研究人员还探索了自适应图结构关系以缓解此问题。AutoSTG[83]被提出作为一种新的时空图自动预测框架,其中采用了空间图卷积和时间卷积操作来捕捉复杂的时空相关性。 特别地,该框架使用元学习技术从属性图的元知识中学习空间图卷积层的邻接矩阵和时间图卷积层的核。同样,MTGNN[82]被建立为一个通用的 GNN 结构。该模型不需要一个预先定义好的图结构,而是使用一个图学习模块根据输入数据自适应地提取稀疏图邻接矩阵,然后通过交织图卷积模块和时间卷积模块来捕获数据的时空相关性。此外,为了避免梯度消失问题,在时间卷积模块和图卷积模块之间添加了残差和跳跃连接。高质量的图可以提高时空长期预测的准确性,提出 REST 的作者也得出结论,GCN 在时空预测中的有效性与图结构的质量密切相关。此外,REST[85]可以自动挖掘推断出的多模态有向加权图。 模型结合了推理边缘网络(EINS)和 GCNS。EINS 将时间序列节点之间的空间依赖性从时间侧投影到空间侧,并为 GCNS 生成多模态有向加权图。从时间侧到空间侧,GCNS 利用这些空间相关性进行预测,然后引入反馈以优化 EINS。EINS 推断出的空间依赖性激励 GCNS 做出更准确的预测,GCNS 的监督训练过程有助于 EINS 学习更好的距离度量。这两个组件相互协作构建高质量图,在做出长期时空预测时提高准确性。对于 LSTF 中的动态图问题,刘等人[86]提出了 TPGNN,它将动态变量的相关性表示为两步的时间矩阵多项式。首先,我们使用静态矩阵基来捕捉整体相关性。然后,我们使用一组时变系数和矩阵基来构建每个时间步的矩阵多项式。其所有长期预测方面均达到最先进水平。

3.4. Transformer-based model
3.4. 基于 Transformer 的模型

Google proposed the original Transformer [32], a model that completely discards traditional CNNs and RNNs, to handle sequence tasks in 2017. A Transformer captures correlations among sequence data mainly through a self-attention mechanism, and it has better parallelism than RNN structures for interacting with global information, as in CNNs. Compared with the traditional DL architecture, the self-attention mechanism in a Transformer is more interpretable and has epochal significance. However, Transformers also have some drawbacks; for example, their computational complexity is too large, and the time complexity of the self-attention computation is O(n2), leading to high computational costs. The utilization of location information is not obvious, the position embedding in the model embedding process is not effective, and long-distance information cannot be captured. In this work, the developed Transformer was mainly used for tasks in the NLP domain, such as machine translation. Due to the similarity of sequence prediction tasks, and its large model capacity and easy to model long time dependence, soon it were implemented into LSTF to the field of time series prediction, and its shortcomings were gradually improved. as shown in Table 6.
谷歌在 2017 年提出了原始的 Transformer [32],这是一个完全摒弃传统 CNN 和 RNN 的模型,用于处理序列任务。Transformer 主要通过自注意力机制捕捉序列数据之间的相关性,与 CNN 中的 RNN 结构相比,它在与全局信息交互方面具有更好的并行性。与传统深度学习架构相比,Transformer 中的自注意力机制更具可解释性,并具有划时代的意义。然而,Transformer 也存在一些缺点;例如,它们的计算复杂度太大,自注意力计算的时间复杂度为 O(n2) ,导致计算成本高。位置信息的利用不明显,模型嵌入过程中的位置嵌入无效,且无法捕捉长距离信息。在这项工作中,开发的 Transformer 主要用于 NLP 领域的任务,如机器翻译。 由于序列预测任务的相似性,以及其大模型容量和易于建模长期依赖性,它很快就被应用于时间序列预测领域的 LSTF 中,并且其缺点逐渐得到改进,如表 6 所示。
Transformer variant.Wu et al. [92] introduced the vanilla Transformer to the field of temporal prediction for influenza disease prediction. In temporal prediction, instead of processing data in an ordered sequential manner as done by an RNN, the Transformer processes the entire given data sequence and uses a self-attention mechanism to learn the dependencies in the temporal data. Therefore, the model relies heavily on embedding to obtain the positional relationships between sequences. However, the model often fails to adequately encode long sequences when processing long-sequence inputs and tends to lose long-term dependency items. To address the long-term dependency items of repeatable fluctuation patterns, Lin et al. [94] established a Transformer-based SpringNet for solar prediction. They proposed a DTW attention layer to capture the local correlations of time series data, and this attention mechanism can help capture repeatable fluctuation patterns and provide accurate predictions, especially in the presence of fluctuations. With the same purpose, Chu et al. [99] combines Autoformer, Informer and Reformer, proposed a forecasting model based on Stacking ensemble learning, where the results of the base learner are applied to train a meta-learner (MLP), which generates future forecasting structures from meta-learning. In contrast, TCCT [98] designs a CSPAttention module that merges CSPNet with a self-attentive mechanism and replaces the typical convolutional layer with an expanded causal convolutional layer to modify the distillation operation employed by Informer to attain exponential receptive field growth. In addition, the model develops a penetration mechanism for stacking self-attentive blocks, which helps Transformer-like models obtain finer information at negligible additional computational costs. To mitigate the problem that the high time complexity of the self-attention computation in LSTF, leading to high computational costs, the LogSparse Transformer model [91] was designed to reduce the time complexity to O(L(logL)2) and break the memory bottleneck. The model employs convolutional self-attention, which uses causal convolution to generate Q and K to enable better integration of local contexts into the attention mechanism. The LogSparse Transformer with a time complexity of O(L(logL)2) was then proposed; this method only needs to compute the dot product of each cell in each layer with a complexity of O(LogL). Thus, only O(LogL) layers must be stacked for the model to be able to access all the information. The prediction accuracy achieved for fine-grained, long-term dependent time series can be improved in cases with limited memory budgets. The Informer [17] further reduces the computational complexity and memory usage of the attention mechanism with ProbSparse self-attention to efficiently replace regular self-attention and incurs a time complexity of O(LlogL) and a memory usage of O(LlogL). At the same time, the proposed self-attention distillation operation significantly reduces the spatial complexity of the model computation process. This gives the model the ability to handle longer sequence inputs. In addition, the decoder part uses one-step output generation to prevent the cumulative propagation of errors. Lee et al. [95] established a partial correlation-based attention mechanism built on a Transformer for conducting partial correlation analyses of attention scores and applied it to multivariate time series forecasting. In addition, a data-driven serieswise multiresolution convolutional layer was designed to represent the input time series data for domain-agnostic learning. However, their paper did not provide an experimental validation of the proposed model and instead discussed the theory and innovations part of the model in greater detail. The effectiveness of this model in practical prediction applications has yet to be tested. Pyraformer [96], one of the latest works in the field developed in 2022, further breaks the limits of Transformers in terms of computational time complexity. It utilizes a pyramidal attention mechanism to form multiresolution representations of time series, the maximum length of the signal traversal path in Pyraformer is a constant with respect to the sequence length L, and the temporal and spatial complexity levels reach O(L), thus enabling the effective capture of interdependencies in LSTF. Similar to Praformer, Cirstea et al. [19] also considered from that perspective and proposed a triangular, variable-specific attention architecture Triformer, which achieves linear complexity through Patch Attention while proposing light-weight approach to enable variable-specific model parameters. Moreover, Li et al. [100] proposed a generalizable memory-driven Transformer, which integrates multiple time series features to drive the prediction procedure through a global-level memory component. In addition, an incremental approach is used to train our model by gradually introducing Bernoulli noise to improve its generalizability. This work proposes a plug-and-play plug-in that can improve the accuracy of LSTF, alleviating the problem of model prone to overfitting in LSTF.
变换器变体。Wu 等人[92]将 vanilla Transformer 引入到流感疾病预测的时间预测领域。在时间预测中,与 RNN 按顺序处理数据的方式不同,Transformer 处理整个给定数据序列,并使用自注意力机制来学习时间数据中的依赖关系。因此,该模型严重依赖嵌入来获取序列之间的位置关系。然而,当处理长序列输入时,该模型往往无法充分编码长序列,并且倾向于丢失长期依赖项。为了解决可重复波动模式的长期依赖项,Lin 等人[94]建立了一个基于 Transformer 的 SpringNet 进行太阳能预测。他们提出了一种 DTW 注意力层来捕获时间序列数据的局部相关性,这种注意力机制有助于捕获可重复波动模式并提供准确的预测,尤其是在存在波动的情况下。出于同样的目的,Chu 等人。 [99] 结合 Autoformer、Informer 和 Reformer,提出了一种基于 Stacking 集成学习的预测模型,其中将基学习器的结果应用于训练一个元学习器(MLP),通过元学习生成未来的预测结构。相比之下,TCCT [98] 设计了一个 CSPAttention 模块,该模块将 CSPNet 与自注意力机制相结合,并用扩展的因果卷积层替换了典型的卷积层,以修改 Informer 使用的蒸馏操作,以实现指数级感受野增长。此外,该模型开发了一种穿透机制,用于堆叠自注意力块,这有助于 Transformer-like 模型在可忽略的计算成本下获得更精细的信息。为了减轻 LSTF 中自注意力计算的高时间复杂度导致的高计算成本问题,设计了 LogSparse Transformer 模型 [91],将时间复杂度降低到 O(L(logL)2) 并打破内存瓶颈。该模型采用卷积自注意力,使用因果卷积生成 QK ,以更好地将局部上下文整合到注意力机制中。 The LogSparse Transformer with a time complexity of O(L(logL)2) was then proposed; this method only needs to compute the dot product of each cell in each layer with a complexity of O(LogL) . Thus, only O(LogL) layers must be stacked for the model to be able to access all the information. The prediction accuracy achieved for fine-grained, long-term dependent time series can be improved in cases with limited memory budgets. The Informer [17] further reduces the computational complexity and memory usage of the attention mechanism with ProbSparse self-attention to efficiently replace regular self-attention and incurs a time complexity of O(LlogL) and a memory usage of O(LlogL) . At the same time, the proposed self-attention distillation operation significantly reduces the spatial complexity of the model computation process. This gives the model the ability to handle longer sequence inputs. In addition, the decoder part uses one-step output generation to prevent the cumulative propagation of errors. Lee et al. [95]建立了一个基于 Transformer 的基于部分相关性的注意力机制,用于进行注意力分数的部分相关性分析,并将其应用于多元时间序列预测。此外,设计了一个数据驱动的序列间多分辨率卷积层来表示输入时间序列数据以实现领域无关的学习。然而,他们的论文没有提供所提模型的实验验证,而是更详细地讨论了模型的理论和创新部分。该模型在实际预测应用中的有效性尚未得到测试。Pyraformer [96]是 2022 年该领域最新研究成果之一,进一步突破了 Transformer 在计算时间复杂度方面的限制。它利用金字塔注意力机制形成时间序列的多分辨率表示,Pyraformer 中信号遍历路径的最大长度与序列长度无关,为 L ,时空复杂度级别达到 O(L) ,从而有效地捕捉 LSTF 中的相互依赖性。与 Praformer 类似,Cirstea 等人。 [19] 也从该角度进行了考虑,并提出了一个三角形的、变量特定的注意力架构 Triformer,通过 Patch Attention 实现线性复杂度,同时提出轻量级方法以实现变量特定的模型参数。此外,Li 等人 [100] 提出了一种可推广的记忆驱动 Transformer,该 Transformer 通过全局级记忆组件整合多个时间序列特征来驱动预测过程。此外,通过逐步引入伯努利噪声来提高其泛化能力,我们采用了一种增量方法来训练我们的模型。这项工作提出了一种即插即用的插件,可以提高 LSTF 的准确性,缓解 LSTF 中模型容易过拟合的问题。

Table 6. Transformer-based Model. Remark denotes the research directions of each work in enhancing LSTF. : Explicitly revealing the correlations between variables; : Implicitly increasing the intersequence dependency learning abilities of models; §: Few parameters and low computational complexity.
表 6. 基于 Transformer 的模型。备注表示每项工作在增强 LSTF 方面的研究方向。 :明确揭示变量之间的相关性; :隐式提高模型之间的序列依赖性学习能力;§:参数少,计算复杂度低。

Ref  Ref 参考文献Year  Technique  技术Baseline  基线Comments  注释Remark  备注
LogTrans [91]2019LogSparse self-atten + Transformer
LogSparse 自注意力 + Transformer
ARIMA, ETS, TRMF, DeepAR, DeepStateConvolutional Self-Attention to reduced time complexity, demonstrated Transformer’s ability to handle long-term dependencies.
卷积自注意力降低时间复杂度,展示了 Transformer 处理长期依赖的能力。
§   § 翻译文本:
Wu et al. [92]  吴等人[92]2020Transformer  变换器ARIMA, LSTM, Seq2Seq(GRU)+Attn
ARIMA、LSTM、Seq2Seq(GRU)+Attn
Transformer for Influenza LSTF.
流感 LSTF 变换器
AST [93]2020GAN + Transformer  生成对抗网络 + 变换器ARIMA, ETS, TRMF, DeepAR, DSSM, ConvTransSparse attention with GAN to reduce error accumulation.
稀疏注意力与 GAN 减少误差累积。
§   § 翻译文本:
SpringNet [94]  春季网络[94]2020Spring Attention + Transformer
春季注意力 + Transformer
DeepAR, LogSparse Transformer
DeepAR,LogSparse Transformer
Spring Attention for repeatable long-term dependency fluctuating patterns.
春季重复长期依赖波动模式的关注。
Lee et al. [95]  李等[95]2020Partial Correlation-based Attention + Series-wise Multi-resolution Convolution + Transformer
部分相关注意力 + 系列多分辨率卷积 + Transformer
Partial Correlation-based Attention for improving pair-wise comparisons-based attention disadvantages.
基于部分相关性的注意力机制,用于改善基于成对比较的注意力机制的缺点。
Informer [17]  信源[17]2021ProbSparse Self-attention+ Self-attention Distilling + Generative Style Decoder
ProbSparse 自注意力 + 自注意力蒸馏 + 生成风格解码器
LogTrnas, Reformer, LSTMa, DeepAR, ARIMA, Prophet, LSTnetSparse and computationally effective transformer architecture.
稀疏且计算高效的 Transformer 架构。
§   § 翻译文本:
Autoformer [33]  自变形器 [33]2021Sequence decomposition + Auto-Correlation + Transformer
序列分解 + 自相关 + Transformer
Informer, LogTrans, Reformer, LSTNet, LSTM, TCN
信息者,日志转换器,改革者,LSTNet,LSTM,TCN
Auto-Correlation and decomposition architecture.
自相关与分解架构。
§
Pyraformer [96]2022Pyramidal Attention Module + Transformer
金字塔注意力模块 + Transformer
Informer, LogTrans, Longformer, Reformer, ETC
信息者,LogTrans,Longformer,Reformer,等等
Pyramidal Attention Module for multi-resolution representation.
金字塔注意力模块用于多分辨率表示。
§
FEDformer [97]2022Fourier Enhanced + Wavelet Enhanced + Transformer
傅里叶增强 + 小波增强 + 变换
Autoformer, Informer, LogTrans, Reformer
自 former,Informer,LogTrans,Reformer
Time complexity reduction with frequency domain decomposition based on Autoformer architecture.
基于 Autoformer 架构的频域分解时间复杂度降低。
§
TCCT [98]2022CNN + CSPNet Attention + Transformer
CNN + CSPNet 注意力 + Transformer
Informer, ARIMA, Prophet, LSTMa
信息提供者,ARIMA,Prophet,LSTMa
CSPAttention reduces computational costs.
CSPAttention 降低计算成本。
§   § 翻译文本:
Chu et al. [99]  朱等[99]2022Autoformer + Informer + Reformer + MLP
自动 former + 信息 former + 改形 former + 多层感知器
Autoformer, Informer, Reformer, Transformer
自 former,信 former,改 former,转 former
Stacking ensemble learning based forecasting model incorporating multiple Transformer variants and meta-learner
基于堆叠集成学习的融合多种 Transformer 变体和元学习者的预测模型
Li et al. [100]  李等[100]2022Transformer + Gate Mechanism
变换器 + 门机制
Informer, Transformer, ProbTrans
信息者,变换器,ProbTrans
A generalizable memory-driven deformer plugin with progressive training.
一个可推广的记忆驱动变形插件,具有渐进式训练。
Quatformer [101]2022Transformer + learning-to-rotate attention + trend normalization
Transformer + 旋转注意力学习 + 趋势归一化
Autoformer, Informer, LogTrans, Reformer, LSTM, TCNA Quaternion architecture with learning-to-rotate attention.
四元数架构与学习旋转注意力
§
Muformer [18]  穆变形器[18]2022Multi-granularity attention + Transformer + Kullback–LeiblerInformer, LogTrans, Reformer, LSTMa, DeepAR, ARIMA, ProphetMuformer architecture for multi-sensory domain feature enhancement and multi-headed attentional expression enhancement.
Muformer 架构用于多感官域特征增强和多头注意力表达增强。
Sepformer [102]2022Transformer + Discrete wavelet transform
变换器 + 离散小波变换
Informer, LogTrans, Reformer, LSTMa, LSTnet
信息者,日志转换器,改革者,LSTMa,LSTnet
Sepformer architecture that combines discrete wavelet transforms to enhance feature extraction and reduce the computational complexity.
Sepformer 架构,结合离散小波变换以增强特征提取并降低计算复杂度。
§
Triformer [19]  三形器[19]2022Transformer + Patch Attention
Transformer + 补丁注意力
Reformer, LogTrans, StemGNN, AGCRN, Informer, AutoformerA triangular, variable-specific attention architecture implements the capture of linear complexity and different temporal dynamic patterns of sequences.
一个基于三角形、变量特定注意力的架构实现了对序列线性复杂度和不同时间动态模式的捕捉。
§   § 翻译文本:
Transformer with Time–frequency Analysis Method.Usually, serial decomposition methods such as STL are used as preprocessing steps for TSF. In contrast, the Autoformer [33] framework, which breaks away from the traditional idea of using sequence decomposition for preprocessing, designs a deep decomposition architecture that progressively decomposes time series into two aspects: periodicity and trends. This deep decomposition architecture embeds the sequence decomposition strategy into an encoder–decoder structure as an internal unit of Autoformer. During the prediction process, the model alternates between prediction result optimization and sequence decomposition; i.e., progressive decomposition is achieved by gradually separating the trend and period terms from the hidden variables. In addition, the model also establishes an auto-correlation mechanism based on stochastic process theory to replace the self-attention mechanism in the vanilla Transformer, realizing serieswise connections and an O(LlogL) computational time complexity. FEDformer [97], also one of the latest works in this field in 2022, takes an alternative approach to optimize the vanilla Transformer from the frequency domain perspective. By adding a Fourier enhancement module and a wavelet enhancement module to the Transformer, FEDformer captures the significant structure in a time series through frequency domain mapping, which achieves O(L) linear computational complexity and storage overhead by randomly selecting a fixed number of Fourier components and achieves substantially improved prediction accuracy while effectively capturing long-term dependencies. Also for the LSTF time complexity optimization problem, Chen et al. [101] proposed a Quatformer framework in which learning-to-rotate attention introduces learnable period and phase information to describe complex periodic patterns, trend normalization to model normalization of the sequence representation in the hidden layer, and decoupling of the LRA using global memory to achieve linear complexity without loss of prediction accuracy while effectively fitting the multi-periodic complex patterns in the LSTF. Also to address the higher computational complexity of Transformer in LSTF, limited capture of fine-grained features of data, and poor interpretability, Fan et al. [102] proposed Sepformer architecture to enhance feature extraction and reduce the computational complexity of the model to O(CN). Among them, Sepformer uses independent networks to extract data stream features in parallel. SWformer and Mini-SWformer, by using discrete wavelet transform and separating high and low frequencies. The time complexity optimization in the above work is for the Transformer redundant information problem, Zeng et al. [18] proposed Muformer with the role of feature enhancement through the input multiple perceptual domains processing mechanism, while the multi-cornered attention head mechanism as well as the attention head pruning mechanism enhances the expression of multi-headed attention. It alleviates the problem of redundant information input in LSTF.
Transformer 与时间-频率分析方法。通常,STL 等序列分解方法被用作 TSF 的预处理步骤。相比之下,Autoformer [33] 框架打破了传统使用序列分解进行预处理的观念,设计了一种深度分解架构,逐步将时间序列分解为两个方面:周期性和趋势。这种深度分解架构将序列分解策略嵌入到编码器-解码器结构中,作为 Autoformer 的内部单元。在预测过程中,模型在预测结果优化和序列分解之间交替;即通过逐渐从隐藏变量中分离趋势和周期项来实现渐进式分解。此外,该模型还基于随机过程理论建立了一个自相关机制,以取代 vanilla Transformer 中的自注意力机制,实现逐序列连接和 O(LlogL) 计算时间复杂度。 FEDformer [97] 是 2022 年该领域最新研究成果之一,它从频域角度对原始 Transformer 进行了一种替代优化。通过向 Transformer 添加傅里叶增强模块和小波增强模块,FEDformer 通过频域映射捕捉时间序列中的显著结构,通过随机选择固定数量的傅里叶分量,实现了 O(L) 线性计算复杂度和存储开销,同时显著提高了预测精度并有效地捕捉长期依赖关系。对于 LSTF 时间复杂度优化问题,Chen 等人[101]提出了 Quatformer 框架,其中学习旋转注意力引入了可学习的周期和相位信息来描述复杂的周期性模式,趋势归一化用于对隐藏层中序列表示的归一化建模,以及使用全局记忆解耦 LRA,在不损失预测精度的同时实现了线性复杂度,有效地拟合了 LSTF 中的多周期复杂模式。 同样,为了解决 Transformer 在 LSTF 中的更高计算复杂度、数据细粒度特征的有限捕捉和较差的可解释性,Fan 等人[102]提出了 Sepformer 架构以增强特征提取并降低模型的计算复杂度至 O(CN) 。其中,Sepformer 使用独立网络并行提取数据流特征。SWformer 和 Mini-SWformer 通过使用离散小波变换和分离高低频来实现。上述工作中对时间复杂度的优化是为了解决 Transformer 的冗余信息问题,Zeng 等人[18]提出了 Muformer,通过输入多个感知域处理机制实现特征增强,同时多角注意力头机制以及注意力头剪枝机制增强了多头注意力的表达。它缓解了 LSTF 中冗余信息输入的问题。
Transformer with GAN.In recent years, the idea of GANs has been widely studied in the field of imaging. Wu et al. [93] introduced GANs to the field of LSTF and designed a GAN-based adversarial sparse Transformer (AST) to solve the problem of not being able to predict long series due to error accumulation. The model achieves improved robustness and generalization by using a Transformer with sparse attention as the generator and then utilizing an adversary to alleviate the error accumulation when predicting multiple steps. The adversarial training process improves the robustness and generalization of the model. In addition, it regularizes the prediction model to improve its sequence-level capability.
Transformer 与 GAN。近年来,GANs 的概念在成像领域得到了广泛研究。Wu 等人[93]将 GANs 引入 LSTF 领域,并设计了一种基于 GAN 的对抗稀疏 Transformer(AST),以解决由于误差累积而无法预测长序列的问题。该模型通过使用具有稀疏注意力的 Transformer 作为生成器,并利用对抗性来缓解预测多个步骤时的误差累积,从而提高了鲁棒性和泛化能力。对抗性训练过程提高了模型的鲁棒性和泛化能力。此外,它还规范了预测模型,以提高其序列级能力。

3.5. Compound model  3.5. 复合模型

As can be seen above, different backbone of base models have their own advantages and disadvantages in the face of LSTF. For example, RNNs are good at capturing long-term macro dependencies of sequence data, but suffer from problems such as gradient explosion and slow inference speed. CNNs, on the other hand, are fast in parallel, but are better at capturing short-term local dependencies among sequence data, etc. By combining the strengths and weaknesses of each type of model, researchers have targeted specific forecasting problems and difficulties. The compound models categorized in this paper refer to models that combine the techniques of the previous classes of models described above in model backbone, as shown in Table 7.
如上所示,不同基座的骨干模型在面对 LSTF 时各有优缺点。例如,RNN 擅长捕捉序列数据的长期宏观依赖关系,但存在梯度爆炸和推理速度慢等问题。另一方面,CNN 在并行处理方面速度快,但更擅长捕捉序列数据中的短期局部依赖关系等。通过结合每种类型模型的优缺点,研究人员针对特定的预测问题和难点进行了研究。本文中分类的复合模型指的是在模型骨干中结合上述描述的前几类模型技术的模型,如表 7 所示。
RNNs+CNNs.RNN is better at capturing long-term macroscopic dependencies between sequence data but with slow inference speed, while CNN parallel fast is better at capturing short-term local dependencies between sequence data. People start to combine the advantages of both models to better solve the problem of long-time dependencies in LSTF. A classic concise and effective model among the developed hybrid models is LSTNet [105]. This model integrates CNN, RNN and AR techniques by using CNNs and RNNs to extract short-term local dependency patterns between variables and to discover long-term patterns for determining time series trends. In contrast, the traditional AR linear model is used to join the nonlinear neural network parts in parallel, making the nonlinear DL model more robust to time series with large data scale changes. In addition, LSTNet was designed with a new recursive structure, the recursive jump structure, which exploits the periodicity of the input time series signals and is able to capture very long time-dependent patterns. DAQFF [108] was proposed as a spatio-temporal prediction model combining the advantages of RNNs and CNNs, to solve the long-term spatio-temporal correlation capture problem in multivariate forecasting. The model addresses the dynamic, spatial–temporal and nonlinear characteristics of multivariate air quality time series by first extracting local trend features and spatial features through a one-dimensional convolutional CNN and learning spatio-temporal dependencies through BiLSTM. The spatio-temporal correlations in the given data are effectively captured by a simple model structure. The MLCNN [110] also integrates the CNN, RNN and AR techniques and was inspired by psychological construct-level theory to achieve capture long-term dependence by fusing prediction information from different future visions. The model first extracts the multilevel abstract representation among data with a CNN, then inputs the extracted features into shared LSTM and performs long-range and short-term future prediction. At the same time, the main LSTM is also set to perform feature extraction for the main task to make more accurate predictions. In this case, the shared LSTM works as the encoder of the main task features, while the main LSTM works as the decoder. In addition, the traditional AR linear model is combined with the nonlinear part of the neural network. Finally, the future vision aspects of the codecs are fused by such an improved encoder–decoder architecture.
RNNs+CNNs.RNN 在捕捉序列数据之间的长期宏观依赖关系方面更出色,但推理速度较慢,而 CNN 并行快速则在捕捉序列数据之间的短期局部依赖关系方面更出色。人们开始结合两种模型的优点,以更好地解决 LSTF 中的长期依赖问题。在开发的混合模型中,LSTNet [105] 是一个经典、简洁且有效的模型。该模型通过使用 CNN 和 RNN 提取变量之间的短期局部依赖模式,并发现长期模式以确定时间序列趋势,从而整合了 CNN、RNN 和 AR 技术。相比之下,传统的 AR 线性模型用于并行连接非线性神经网络部分,使非线性深度学习模型对大数据规模变化的时间序列更加鲁棒。此外,LSTNet 还设计了一种新的递归结构,即递归跳跃结构,该结构利用输入时间序列信号的周期性,能够捕捉非常长的时序依赖模式。 DAQFF[108]被提出作为一种时空预测模型,结合了 RNN 和 CNN 的优点,以解决多变量预测中长期时空相关性的捕捉问题。该模型通过首先使用一维卷积 CNN 提取局部趋势特征和空间特征,并通过 BiLSTM 学习时空依赖关系,解决了多变量空气质量时间序列的动态、时空和非线性特征。通过简单的模型结构,有效地捕捉了给定数据中的时空相关性。MLCNN[110]也整合了 CNN、RNN 和 AR 技术,并受到心理建构层次理论的启发,通过融合不同未来视角的预测信息来实现捕捉长期依赖。该模型首先使用 CNN 从数据中提取多级抽象表示,然后将提取的特征输入到共享 LSTM 中,进行长期和短期未来预测。同时,主 LSTM 也被设置为执行主要任务的特征提取,以实现更准确的预测。 在这种情况下,共享的 LSTM 充当主任务特征的编码器,而主 LSTM 充当解码器。此外,传统的 AR 线性模型与非线性神经网络部分相结合。最后,通过这种改进的编码器-解码器架构融合了编解码器的未来愿景方面。

Table 7. Compound Model. Remark denotes the research directions of each work in enhancing LSTF. : Explicitly revealing the correlations between variables; : Implicitly increasing the intersequence dependency learning abilities of models; §: Few parameters and low computational complexity.
表 7. 化合物模型。备注表示每项工作在增强 LSTF 方面的研究方向。 :明确揭示变量之间的相关性; :隐式提高模型之间的序列依赖性学习能力;§:参数少,计算复杂度低。

Ref  Ref 参考文献Year  Technique  技术Baseline  基线Comments  注释Remark  备注
DCRNN [103]2017Diffusion Conv + RNN + Encoder–Decoder + Graph
扩散卷积 + RNN + 编码器-解码器 + 图
HA, ARIMA, VAR, SVR, FNN, FC-LSTM
HA,ARIMA,VAR,SVR,FNN,FC-LSTM
New form of convolution to optimize RNNs to capture spatial correlation using bi-directional random wandering on the graph.
新型卷积形式,通过图上的双向随机游走优化 RNN 以捕捉空间相关性。
MTNet [104]2018CNN + Atten + RNN + AR
CNN + 注意力机制 + RNN + 自回归模型
AR, LRidge, LSVR, GP, VAR-MLP, RNN-GRU, DA-RNN, LSTNetLarge memory component with CNN, RNN, Attention to storage of long-term historical data.
大型内存组件,包含 CNN、RNN、注意力机制,用于存储长期历史数据。
LSTNet [105]2018GRU + CNN + ARAR, LRidge, LSVAR, TRMF, GP, VAR-MLP, RNN-GRU
AR,LRidge,LSVAR,TRMF,GP,VAR-MLP,RNN-GRU
CNN for short-term local dependencies, RNN for long-term trend dependencies, AR for robustness improvement.
CNN 用于短期局部依赖,RNN 用于长期趋势依赖,AR 用于鲁棒性提升。
Zhao et al. [106]  赵等[106]2018Wavelet Transform + CNN + LSTM + Atten
小波变换 + 卷积神经网络 + 长短期记忆网络 + 注意力机制
ARIMA, SVR, ANN, LSTM, CNN, Ensemble of LSTM & CNN
ARIMA, SVR, ANN, LSTM, CNN, LSTM 与 CNN 的集成
CNN and LSTM for long and short term dependence capture after frequency domain change.
卷积神经网络和长短期记忆网络在频率域变化后的长短期依赖捕捉。
RESTFul [107]2018GNN + CNN + GRU + MLPSVR, ARIMA, MLP, GRU, Deep-GRU, Dipole, DA-RNN
SVR, ARIMA, MLP, GRU, 深度 GRU, 双极子,DA-RNN
Multi-resolution forecasting framework with CNN and GRU alternation.
多分辨率预测框架,采用 CNN 和 GRU 交替。
DAQFF [108]2019CNN+LSTM  CNN+LSTM 翻译文本:CNN+长短时记忆网络ARIMA, SVR-POLY, SVR-RBF, SVR-LINEAR, LSTM, GRU, CNN, RNNCNN for local trend features, LSTM for long-term spatial–temporal dependence.
CNN 用于局部趋势特征,LSTM 用于长期时空依赖。
GC-LSTM [109]2019GCN+LSTMMLR, FNN, LSTM  MLR, FNN, LSTM 机器学习回归,前馈神经网络,长短期记忆网络GCN for spatial correlation to assist LSTM long-term correlation capture.
GCN 用于空间相关性以辅助 LSTM 长期相关性捕获。
MLCNN [110]2020CNN + RNN + ARVAR, RNN-LSTM, MTCNN, AECRNN, LSTNet
VAR,RNN-LSTM,MTCNN,AECRNN,LSTNet
CNN combined with LSTM for multi-resolution feature extraction, AR for improving robustness.
CNN 结合 LSTM 进行多分辨率特征提取,AR 用于提高鲁棒性。
Forecaster [111]  预测者[111]2020Markov + graph + Transformer
马尔可夫 + 图 + 生成器
VAR, DCRNN, Transformer(same width/best width)
VAR,DCRNN,Transformer(相同宽度/最佳宽度)
Gauss Markov random field theory to simplify transformer.
高斯马尔可夫随机场理论简化 Transformer。
§
STNN [112]2020Transformer + GCNHA, ARIMA, LSVR, FNN, FC-LSTM, STGCN, DRCNN, Graph WaveNetSpatial-transformer and temporal-transformer to capture directed spatial dependence and long-term temporal dependence.
空间变换器和时间变换器用于捕捉有向空间依赖和长期时间依赖。
§
DeepFEC [36]  深度 FEC [36]2021ResNet + BiLSTM  ResNet + 双向长短时记忆网络MR, SVR, ANNs, RNN, GRU, LSTM, Bi-LSTM, DCNN, ST-ResNet, LC-RNNDynamic aggregation of spatial and temporal correlations.
动态的时空相关性聚合。
NAST [113]2021Spatial–temporal Attention+ Transformer
空间-时间注意力+Transformer
RNN-ED, TRANSFORMER, TST-NOLOGSPARSE
RNN-ED,转换器,TST-NOLOGSPARSE
New spatial–temporal attention and non-autoregressive architectures.
新的时空注意力机制和非自回归架构。
§   § 翻译文本:
NET3 [114]2021TGCN+TRNNDynaMMo, MLDS, DCRNN, STGCNTGCN and TRNN collaboration improves long-term dependent capture and reduces the number of model parameters.
TGCN 和 TRNN 协作提升长期依赖捕捉并减少模型参数数量。
§
GCLSTM + GCTrafo [115]2021GCN + LSTM + TransformerSTAR, STCNN, EDLSTM, SVRGCLSTM and GCTrafo for long-term dependent capture and reduction of the number of participants.
GCLSTM 和 GCTrafo 用于长期依赖捕获和减少参与者数量。
§
MS-LSTM [116]2022LSTM + Attention + Seq2Seq + CNN
LSTM + 注意力 + Seq2Seq + CNN
AR, DeepAR, Transformer, LSTM, LST-Attn
AR,DeepAR,Transformer,LSTM,LST-Attn
MS-LSTM structure for LSTM with multi-scale feature extraction.
MS-LSTM 结构,用于具有多尺度特征提取的 LSTM。
ST-KMRN [117]2022GNN + CNN + Koopman + Encoder–Decoder + Self-attentionHA, Static, GRU, Informer, Graph WaveNet, MTGNN, KoopmanAE
HA, 静态,GRU,Informer,图波网,MTGNN,KoopmanAE
A multi-resolution long-time feature modeling framework ST-KMRN incorporating self-attention and Koopman theory for physical information modules.
多分辨率长时间特征建模框架 ST-KMRN,融合自注意力和 Koopman 理论用于物理信息模块。
To solve the problem of capturing extremely long term patterns while effectively adding other variable information in multivariate time series forecasting, MTNet [104] integrates CNN, RNN, and attention and mainly consists of a large memory component, three independent encoders, and a jointly trained AR component. The memory component is used to store long-term historical data, while the encoders encode the input and memory data into feature values. After calculating the similarity between the input data and the data in the memory component, an attention mechanism is also used to calculate the attention weights of all memory blocks. In addition, the model borrows from the LSTNet structure by combining a traditional AR linear model with a nonlinear neural network to improve the sensitivity of the model to data scale changes. Zhao et al. [106] designed a new model to capture long-term dependence from a frequency domain perspective. Their work considered the fact that real-life time series are driven by multiple potential components that occur at different frequencies. Therefore, the researchers drew on signal processing to explicitly reveal the frequency domain information in the input data using a wavelet transform and then used a CNN and LSTM based on the transformed data to capture local time–frequency features and global long-term trends, respectively. In addition, attention mechanisms were further used to distinguish the importance levels of different local features and to fuse the local features with global features. DeepFEC [36] was established as a new deep spatio-temporal contextual association network for predicting the energy consumption levels of roads. This approach uses ResNet to capture the spatial correlations between different road segments, while BiLSTM learns the temporal correlations between sequential data. The method achieves a decrease in the number of parameters and an increase in accuracy, which improves the processing of long series data and the ability to capture correlations between data. While Wang et al. [116] proposed MS-LSTM, a long time sequence prediction method using multi-scale feature extraction and sequence-sequence (seq2seq) attention mechanism under long short-term memory (LSTM) hidden state, using LSTM and CNN to extract long-term and short-term features at different scales, respectively, and using multi-scale information to assist long-term forecasting.
为了在多元时间序列预测中同时捕捉极长期模式并有效添加其他变量信息,MTNet [104] 集成了卷积神经网络(CNN)、循环神经网络(RNN)和注意力机制,主要由一个大型记忆组件、三个独立的编码器和一个联合训练的 AR 组件组成。记忆组件用于存储长期历史数据,而编码器将输入和记忆数据编码为特征值。在计算输入数据与记忆组件中数据之间的相似性之后,还使用注意力机制计算所有记忆块的注意力权重。此外,该模型借鉴了 LSTNet 结构,通过将传统的 AR 线性模型与非线性神经网络相结合,以提高模型对数据尺度变化的敏感性。Zhao 等人[106]设计了一种新的模型,从频域角度捕捉长期依赖性。他们的工作考虑了这样一个事实,即现实生活中的时间序列是由多个不同频率发生的潜在成分驱动的。 因此,研究人员利用信号处理技术,通过小波变换显式地揭示输入数据的频域信息,然后基于转换后的数据使用 CNN 和 LSTM 分别捕捉局部时间-频率特征和全局长期趋势。此外,还进一步使用了注意力机制来区分不同局部特征的重要性级别,并将局部特征与全局特征融合。建立了 DeepFEC[36]作为一个新的深度时空上下文关联网络,用于预测道路的能量消耗水平。这种方法使用 ResNet 捕捉不同路段之间的空间相关性,而 BiLSTM 学习序列数据之间的时序相关性。该方法实现了参数数量的减少和准确率的提高,从而改善了长序列数据的处理能力和捕捉数据之间相关性的能力。虽然王等人。 [116] 提出了 MS-LSTM,一种利用多尺度特征提取和序列-序列(seq2seq)注意力机制在长短期记忆(LSTM)隐藏状态下的长时间序列预测方法,分别使用 LSTM 和 CNN 在不同尺度上提取长期和短期特征,并利用多尺度信息辅助长期预测。
RNNs+CNNs+GNNs.Graph neural networks are also often fused with models such as CNN and RNN to improve the accuracy of long-term forecasting of the future by being used to capture spatial correlations between serial data. Qi et al. [109] combined the advantages of a GCN and an LSTM network to design a GCLSTM model and used it for meteorological data prediction. The model first introduces a GCN to capture the spatial correlations between different meteorological stations and then captures the temporal correlations among the observations at different moments via LSTM to finally achieve effective LSTF for meteorological data. Similarly, many researchers have started to introduce variants of GCN, RNN and CNN models for hybrid model experiments. The DCRNN [103] was designed as another new convolutional network form for optimizing RNNs, alleviating the LSTF problem in traffic flow forecasting. It uses bidirectional random wandering on the input graph to capture spatial correlations and uses a codec structure and scheduling sampling to capture temporal correlations. The relationships between traffic sensors are modeled by directed graphs, the traffic flow dynamics are modeled as a diffusion process, and a diffusion convolution operation is utilized to capture spatial correlations. Jing et al. [114] proposed a new tensor time series network model called NET3. The model consists of two modules: a tensor GCN (TGCN) and a tensor RNN (TRNN), where the TGCN captures the relationships between multiple graphs by generalizing the GCN from planar graphs to tensor graphs, and the TRNN models the implicit dependencies between time series using tensor decomposition. In this way, the noise in the data and the number of parameters in the model can be effectively reduced. Improved the ability to handle long sequence data. RESTFul [107] was then established as a multi-resolution framework that uses time series data at different granularities. The framework starts with a recursive framework (a GRU) to encode the temporal patterns at each resolution. A convolutional framework (a CNN) is used to fuse the temporal patterns of the data with different resolutions captured by the GRU. Finally, the learned feature vectors are fed into an MLP to obtain the prediction output. It provides more feature information for long-term prediction through multi-resolution feature representation.
RNNs+CNNs+GNNs。图神经网络也常与 CNN 和 RNN 等模型融合,通过捕捉序列数据之间的空间相关性来提高对未来长期预测的准确性。Qi 等人[109]结合了 GCN 和 LSTM 网络的优点,设计了 GCLSTM 模型,并用于气象数据预测。该模型首先引入 GCN 来捕捉不同气象站之间的空间相关性,然后通过 LSTM 捕捉不同时刻观测之间的时序相关性,最终实现有效的 LSTF 气象数据预测。同样,许多研究人员开始引入 GCN、RNN 和 CNN 模型的变体进行混合模型实验。DCRNN[103]被设计为另一种新的卷积网络形式,用于优化 RNN,缓解交通流量预测中的 LSTF 问题。它使用输入图上的双向随机游走来捕捉空间相关性,并使用编解码结构和调度采样来捕捉时序相关性。 交通传感器的相互关系通过有向图进行建模,交通流动力学被建模为扩散过程,并利用扩散卷积操作来捕捉空间相关性。Jing 等人[114]提出了一种新的张量时间序列网络模型,称为 NET3。该模型由两个模块组成:一个张量 GCN(TGCN)和一个张量 RNN(TRNN),其中 TGCN 通过将 GCN 从平面图推广到张量图来捕捉多个图之间的关系,而 TRNN 使用张量分解来模拟时间序列之间的隐含依赖。通过这种方式,可以有效地减少数据中的噪声和模型中的参数数量。改进了处理长序列数据的能力。随后,RESTFul[107]被建立为一个多分辨率框架,该框架使用不同粒度的时间序列数据。该框架从递归框架(一个 GRU)开始,以编码每个分辨率的时序模式。使用卷积框架(一个 CNN)来融合由 GRU 捕获的不同分辨率的数据的时序模式。 最后,学习到的特征向量被输入到 MLP 中,以获得预测输出。它通过多分辨率特征表示提供了更多特征信息,以支持长期预测。
Transformers+GNNs.Transformer shows better performance in LSTF, and people explore combining Transformer and GNN to solve the long-time prediction problem in spatio-temporal forecasting. The STNN [112] is a new spatio-temporal Transformer network that improves the accuracy of long-term traffic flow prediction by combining a Transformer and a GCN to dynamically model directional spatial dependencies and long-term temporal dependencies; the network is mainly divided into a spatial Transformer and a temporal Transformer. The spatial Transformer can be used to dynamically model the directional spatial dependencies between data by using a self-attentive mechanism to obtain the real-time status and direction of a traffic flow and to capture different spatial dependencies by using a multi-headed attention mechanism. The temporal Transformer, on the other hand, captures remote bidirectional temporal dependencies across multiple time steps. This reduces the accumulation of prediction errors and enables parallel training and prediction. Chen et al. [113] established a Transformer-based NAST model for spatio-temporal forecasting problems. This model was the first non-AR Transformer architecture and is used to overcome the output delay and cumulative error of the decoder in the traditional Transformer. The model has QGB as the core module; QGB generates queries in one step, and these queries have the same length as the target sequence. Then, by using the generated query Q, the time-transformed decoder is able to execute the prediction process in parallel, thus preventing error accumulation. In addition, the model utilizes a new spatio-temporal attention mechanism that first embeds the input features into a high-dimensional tensor and predicts attention maps in both the spatial and temporal directions. Then, the temporal attention map is permuted to obtain a temporal influence map, and the temporal influence map is used to perform the product operation with the spatial attention map to obtain the overall spatio-temporal attention map for joint learning so that the spatial and temporal dependencies can be processed completely. From another perspective, Forecaster [111], a graph-based Transformer framework for spatio-temporal data prediction, has also been proposed. The framework first learns the topology of the given graph through Markov random field theory and then replaces the linear layer of the Transformer with a sparse linear layer through this dependency graph structure, allowing the sparse Transformer to simultaneously capture the potential spatio-temporal dependencies, nonsmoothness and heterogeneity in the data and reducing the computational complexity of the Transformer to some extent. It effectively mitigates the impact of non-stationarity problems in LSTF.
Transformer+GNNs. Transformer 在 LSTF 中表现出更好的性能,人们探索将 Transformer 和 GNN 结合来解决时空预测中的长期预测问题。STNN[112]是一种新的时空 Transformer 网络,通过结合 Transformer 和 GCN 动态建模方向性空间依赖和长期时间依赖,提高了长期交通流量预测的准确性;网络主要分为空间 Transformer 和时间 Transformer。空间 Transformer 可以通过使用自注意力机制来动态建模数据之间的方向性空间依赖,获取交通流的实时状态和方向,并通过多头注意力机制捕获不同的空间依赖。另一方面,时间 Transformer 捕捉多个时间步之间的远程双向时间依赖。这减少了预测误差的累积,并实现了并行训练和预测。Chen 等人[113]为时空预测问题建立了一个基于 Transformer 的 NAST 模型。 此模型是第一个非 AR Transformer 架构,用于克服传统 Transformer 中解码器的输出延迟和累积误差。该模型以 QGB 为核心模块;QGB 一步生成查询,这些查询与目标序列长度相同。然后,通过使用生成的查询 Q,时间变换解码器能够并行执行预测过程,从而防止错误累积。此外,该模型利用了一种新的时空注意力机制,首先将输入特征嵌入到高维张量中,并在空间和时间方向上预测注意力图。然后,对时间注意力图进行置换以获得时间影响图,时间影响图用于与空间注意力图进行乘法运算,以获得整体时空注意力图,以便进行联合学习,从而完全处理空间和时间依赖关系。从另一个角度来看,还提出了基于图的时空数据预测 Transformer 框架 Forecaster [111]。 该框架首先通过马尔可夫随机场理论学习给定图的拓扑结构,然后通过依赖图结构将 Transformer 的线性层替换为稀疏线性层,从而使稀疏 Transformer 能够同时捕捉数据中的潜在时空依赖性、非光滑性和异质性,并在一定程度上降低 Transformer 的计算复杂度。它有效地缓解了 LSTF 中非平稳性问题的影响。
Transformers+RNNs+GNNs.Moreover, Carrillo et al. [115] established a graph convolutional LSTM (GCLSTM) model and a graph convolutional Transformer (GCTrafo) model by fusing the ideas of LSTM, GCNs, and Transformers. Among them, the GCLSTM uses a recursive structure to dynamically combine the GCN and LSTM. GCTrafo, on the other hand, embeds a GCN into the Transformer framework for spatio-temporal dependency capturing. It effectively achieves spatio-temporal long-term prediction.
Transformer+RNNs+GNNs。此外,Carrillo 等人[115]通过融合 LSTM、GCN 和 Transformer 的思想,建立了图卷积 LSTM(GCLSTM)模型和图卷积 Transformer(GCTrafo)模型。其中,GCLSTM 使用递归结构动态结合 GCN 和 LSTM。另一方面,GCTrafo 将 GCN 嵌入到 Transformer 框架中,用于时空依赖关系捕捉。它有效地实现了时空长期预测。

3.6. Miscellaneous  3.6. 其他

In addition to the above mentioned baseline models and the associated compound methods, researchers have explored the possibility of using other techniques in a nontraditional sense for LSTF, as shown in Table 8. In this paper, only some interesting techniques are characterized and explored in LSTF. For example, Cui et al. [46] proposed the idea that it is wrong to ignore or underestimate one of the most natural and fundamental temporal properties of time series, historical inertia (HI), in the LSTF problem. Therefore, they explored the development of a new baseline LSTF model called the HI model, which uses the most recent historical data point in the input time series as the prediction result. The final results showed that their method has an extremely strong ability to be compared with the SOTA models, proving its validity. Similarly, Li et al. [119] proposed the use of a simpler statistical method as a benchmark, proposing SNaive to serve as an effective nonparametric method to alleviate the difficulty of consistently low prediction errors in LSTF. And to address the problem, Zhou et al. [120] argued that this LSTF problem is caused by overfitting the noise presented in the history, so a frequency-improved Legendre Memory Model (FiLM) architecture was designed, which applies Legendre polynomial projections to approximate the historical information, uses Fourier projections to remove the noise, and adds a low-rank approximation to speed up the computation. It better alleviates the over-fitting noise in LSTF while speeding up the inference.
除了上述基线模型和相关复合方法之外,研究人员还探索了在非传统意义上使用其他技术来解决 LSTF 问题的可能性,如表 8 所示。在本文中,仅对一些有趣的 LSTF 技术进行了特征描述和探索。例如,Cui 等人[46]提出了一个观点,即在 LSTF 问题中忽视或低估时间序列最自然和基本的时间属性之一——历史惯性(HI)是错误的。因此,他们探索了开发一个新的基线 LSTF 模型,称为 HI 模型,该模型使用输入时间序列中最新的历史数据点作为预测结果。最终结果表明,他们的方法与 SOTA 模型相比具有极强的能力,证明了其有效性。同样,Li 等人[119]提出使用一种更简单的统计方法作为基准,提出 SNaive 作为一种有效的非参数方法来减轻 LSTF 中持续低预测误差的难度。为了解决这个问题,Zhou 等人。 [120] 认为这个 LSTF 问题是由对历史中呈现的噪声过度拟合所引起的,因此设计了一种频率改进的勒让德记忆模型(FiLM)架构,该架构将勒让德多项式投影应用于近似历史信息,使用傅里叶投影去除噪声,并添加低秩近似以加速计算。它更好地缓解了 LSTF 中的过拟合噪声,同时加快了推理速度。

Table 8. Miscellaneous Methods. Remark denotes the research directions of each work in enhancing LSTF. : Explicitly revealing the correlations between variables; : Implicitly increasing the intersequence dependency learning abilities of models; §: Few parameters and low computational complexity.
表 8. 其他方法。备注表示每项工作在增强 LSTF 方面的研究方向。 :明确揭示变量之间的相关性; :隐式提高模型之间的序列依赖性学习能力;§:参数少,计算复杂度低。

Ref  Ref 参考文献Year  Technique  技术Baseline  基线Comments  注释Remark  备注
HI [46]2021HIProphet, ARIMA, DeepAR, LSTMa, Reformer, LogTrans, Informer, Informer-
先知,ARIMA,DeepAR,LSTMa,Reformer,LogTrans,Informer,Informer-
Simple historical inertia HI as a baseline model for LSTF.
简单历史惯性 HI 作为 LSTF 的基线模型。
§   § 翻译文本:
Oreshkin et al. [118]  Oreshkin 等人[118]2021Meta learning  元学习Only meta-learning enables prediction, solving available historical samples too few in LSTF.
仅元学习能够进行预测,解决 LSTF 中可用历史样本过少的问题。
N-HiTs [49]2022MLP + Residual Connection
MLP + 残差连接
N-BEATS, FEDformer, Autoformer, Informer, LogTrans, Reformer, DilRNN, ARIMAA novel way of hierarchically synchronizing the rate of input sampling with the scale of output interpolation across blocks.
一种新颖的分层同步输入采样速率与块间输出插值规模的方法。
§   § 翻译文本:
SNaive [119]2022Linear Regression  线性回归LSTNet, LSTMa, Reformer, LogTrans, Informer, TCN, SCINetThe SNaive is an effective non-parametric method to achieve a lower error limit than deep learning methods.
SNaive 是一种有效的非参数方法,其误差限制低于深度学习方法。
§   § 翻译文本:
FiLM [120]2022Legendre projection + Fourier analysis
勒让德投影 + 傅里叶分析
FEDformer, Autoformer, S4, Informer, LogTrans, ReformerA Frequency-Improved Legendre Memory Model (FiLM) Architecture
频率改进的勒让德记忆模型(FiLM)架构
§   § 翻译文本:
Moreover, meta-learning is widely used in major fields as a popular technique, and some of the works presented above also introduced meta-learning to assist with the model prediction process. [118] proposed a generic meta-learning framework; this approach can source time series datasets on which neural networks can be trained, and these networks can then be deployed on different target time series datasets without retraining. This solves the problem that the number of available historical samples for some target time series is too small, and therefore, it alleviated the problem of insufficient amount of historical data in LSTF.
此外,元学习作为一项流行技术,在众多领域得到广泛应用,上述一些工作也引入了元学习以辅助模型预测过程。[118] 提出了一种通用的元学习框架;这种方法可以获取时间序列数据集,在这些数据集上可以训练神经网络,然后这些网络可以部署在不同的目标时间序列数据集上而无需重新训练。这解决了某些目标时间序列可用历史样本数量过少的问题,从而缓解了 LSTF 中历史数据不足的问题。
Oreshkin et al. [121] investigated how to perform complex timing prediction tasks by means of simple interpretable strong DL models, proposing an N-Beats model based on MLP stacking and residual linking. Their subsequent work N-HiTs [49], improves N-Beats on the LSTF task by incorporating novel hierarchical interpolation and multi-rate data sampling techniques to emphasize components with different frequencies and scales while decomposing the input signal and synthesizing predictions.
Oreshkin 等人[121]研究了如何通过简单的可解释的强深度学习模型来执行复杂的定时预测任务,提出了一种基于 MLP 堆叠和残差连接的 N-Beats 模型。他们的后续工作 N-HiTs[49],通过引入新颖的分层插值和多速率数据采样技术,在 LSTF 任务上改进了 N-Beats,以强调不同频率和尺度的成分,同时分解输入信号和合成预测。

3.7. Evolution of networks in LSTF
3.7. LSTF 中网络的演变

In this subsection, we summarize the number of the popularly feature extraction network structures used in LSTF by timeline since 2017, and try to summarize the trends and problems of popular feature extraction structures in recent years from them, which are shown in Fig. 10. A more fine-grained review graph is shown in Appendix B.
在此小节中,我们按时间线总结了自 2017 年以来在 LSTF 中使用的流行特征提取网络结构的数量,并试图从其中总结出近年来流行特征提取结构的发展趋势和问题,如图 10 所示。更细致的综述图见附录 B。
We can see that in 2017, people paid more attention to research involving RNN-based models; at this time, the time series prediction task was still inseparable from RNNs, and researchers were more inclined to encoder–decoder architectures. However, due to the natural cyclic AR structures of RNNs, they have problems such as the ease of gradient disappearance and gradient explosion as the network deepens, and a shallow network often does not sufficiently exploit data dependency when dealing with long sequences of data, so people have also started to explore other mechanisms such as convolution for time series applications.
我们可以看到,在 2017 年,人们更加关注涉及基于 RNN 模型的科研;当时,时间序列预测任务仍与 RNN 密不可分,研究人员更倾向于编码器-解码器架构。然而,由于 RNN 的自然循环 AR 结构,随着网络加深,它们存在梯度消失和梯度爆炸等问题,浅层网络在处理长序列数据时往往未能充分利用数据依赖性,因此人们也开始探索其他机制,如卷积,用于时间序列应用。
The TCN relying only on dilated CNNs emerged in 2018, and traditional CNN are generally considered less suitable for modeling time-series problems. This mainly due to their convolutional kernel size limitations, which cannot effectively capture long-term dependence information. However, the unique convolutional structures in TCNs avoid the problems of easy gradient disappearance and gradient explosion experienced by RNNs and achieve excellent results. Compared with those of RNNs. The parallelized processing, flexible fields and lower memory levels of TCNs have achieved good results, and researchers have started to study the use of CNNs combined with RNNs to build compound models. However, when processing long-sequence data, TCNs are still limited by the use of extended convolution to expand their perceptual fields; thus, they still suffer from the problem that their perceptual fields are not large enough, leading to poor model results.
TCN 仅依赖扩张 CNNs 在 2018 年出现,而传统的 CNN 通常被认为不适用于建模时间序列问题。这主要是因为它们的卷积核尺寸限制,无法有效捕捉长期依赖信息。然而,TCN 中独特的卷积结构避免了 RNNs 所经历的梯度消失和梯度爆炸问题,并取得了优异的结果。与 RNNs 相比,TCNs 的并行处理、灵活的域和较低的内存级别已经取得了良好的效果,研究人员已经开始研究将 CNN 与 RNN 结合构建复合模型的使用。然而,在处理长序列数据时,TCNs 仍然受到使用扩展卷积来扩展感知域的限制;因此,它们仍然存在感知域不够大的问题,导致模型结果较差。
The inherent sequential properties of RNNs hinder the parallelization of training samples, the perceptual fields of CNNs limits their ability to capture global dependencies, and an attention mechanism enables the dependencies of input–output sequences to be modeled without considering their distances in the input sequence. Therefore, in 2019, researchers started to explore the applications of Transformers in time series prediction. One such proposed method was LogTrans, and its excellent global dependency capture ability achieved good prediction results, verifying the effectiveness of Transformers model in LSTF. Furthermore, GNNs have also started to demonstrate their effectiveness in multivariate time series forecasting.
RNNs 的内禀序列属性阻碍了训练样本的并行化,CNNs 的感知域限制了它们捕捉全局依赖的能力,而注意力机制使得可以在不考虑输入序列距离的情况下对输入-输出序列的依赖进行建模。因此,在 2019 年,研究人员开始探索 Transformers 在时间序列预测中的应用。其中一种提出的方法是 LogTrans,其卓越的全局依赖捕捉能力实现了良好的预测结果,验证了 Transformers 模型在 LSTF 中的有效性。此外,GNNs 也开始在多元时间序列预测中展现出其有效性。
From 2020–2021, much research on Transformer-based models, GNN-based models, and compound models explored popularly. Transformers have the following drawbacks with respect to time series prediction: (1) O(L2) time complexity; (2) inadequate location information encoding and representation; and (3) inferior information acquisition to RNNs and CNNs. Researchers of GNN-based models prefer to study how to construct a suitable graph adjacency matrix for enhancing multivariate prediction. Until 2022, LSTF studies preferred Transformer-based models, whose excellent long-term dependency capture capabilities have yielded good results in LSTF. In addition, Transformer-based models also have some inherent drawbacks, and using compound models to complement each other is also a good solution.
从 2020-2021 年,许多关于基于 Transformer 模型、基于 GNN 模型和复合模型的研究受到广泛关注。Transformer 在时间序列预测方面存在以下不足:(1) 时间复杂度为 O(n^2);(2) 位置信息编码和表示不足;(3) 信息获取能力不如 RNN 和 CNN。基于 GNN 模型的学者更倾向于研究如何构建合适的图邻接矩阵以增强多元预测。截至 2022 年,LSTF 研究更倾向于基于 Transformer 模型,其出色的长期依赖捕捉能力在 LSTF 中取得了良好的效果。此外,基于 Transformer 模型也存在一些固有的不足,使用复合模型相互补充也是一种好的解决方案。
  1. Download: Download high-res image (502KB)
    下载:下载高分辨率图片(502KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 10. Literature Statistics. Statistics on the number of research works on mainstream feature extraction network structures based on timeline.
图 10. 文献统计。基于时间线的主流特征提取网络结构研究作品的统计。

4. Key problems and effective solutions of DL in LSTF
4. 长短期记忆网络(LSTF)中深度学习的关键问题及有效解决方案

In Table 3, Table 4, Table 5, Table 6, Table 7, we summarize the research directions of each work in enhancing LSTF. Based on above review, in this section we summarize the key problems encountered by the above work in LSTF. Note that these problems may not be unique to LSTF, but LSTF will exacerbate them. We categorized them into three points. In Section 4.1, we present the key problems encountered by LSTF in three directions. In Sections 4.2 More and longer term dependency, 4.3 Computational complexity and memory requirements, we summarize the solutions and effective components provided in the current works for the first two key problems, respectively. In Section 4.4, we present our LSTF performance evaluation approach for the last key problem. The key LSTF problems and the effective solutions presented in this part are shown in Fig. 11.
在表 3、表 4、表 5、表 6、表 7 中,我们总结了每项工作在增强 LSTF 方面的研究方向。基于上述回顾,在本节中,我们总结了上述工作在 LSTF 中遇到的关键问题。请注意,这些问题可能并非仅限于 LSTF,但 LSTF 会加剧这些问题。我们将它们分为三点。在第 4.1 节中,我们提出了 LSTF 在三个方向上遇到的关键问题。在第 4.2 节“更多和更长期的依赖”和第 4.3 节“计算复杂度和内存需求”中,我们分别总结了针对前两个关键问题的当前工作中提供的解决方案和有效组件。在第 4.4 节中,我们提出了我们对最后一个关键问题的 LSTF 性能评估方法。本部分展示的关键 LSTF 问题和有效解决方案如图 11 所示。
  1. Download: Download high-res image (355KB)
    下载:下载高分辨率图片(355KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 11. A Summary of the Key Problems and Solutions in LSTF.
图 11. LSTF 中关键问题和解决方案的总结。

4.1. The key problems of LSTF
4.1. LSTF 的关键问题

LSTF is a challenging problem in both theory and practice. Below, we summarize the key problems encountered in LSTF in terms of three aspects.
LSTF 是理论和实践中都极具挑战性的问题。以下,我们从三个方面总结在 LSTF 中遇到的关键问题。
  • (1)
    More and longer-term dependency. “Capture dependency” refers to the ability of a forecasting model to capture the dependencies and patterns that exist within the time series data [122], [123]. In other words, it refers to the model’s ability to accurately capture the temporal relationships between past and future observations in the time series [78]. Long sequence time series exhibit long-term dependencies, where the future values depend on the past values that are far away in time [100], [116], [119]. It poses the challenge to the model’s ability to capture dependencies [17]. Moreover, LSTF is characterized by inherent cyclicality and non-stationarity [124]. Thus, LSTF models need to learn a mixture of the short-term and long-term repeated patterns in the given time series [33]. An example is the hourly occupancy rate of a highway. Clearly, there are two repeating patterns: daily and weekly patterns. The former depicts the morning peak versus the evening peak, and the latter reflects the weekday and weekend patterns. An effective model should capture these two repeating patterns to make accurate LSTF. Therefore, this imposes more stringent requirements on the forecasting model in terms of learning of dependency.
    更多和长期依赖。“捕获依赖”指的是预测模型捕捉时间序列数据中存在的依赖关系和模式的能力[122],[123]。换句话说,它指的是模型准确捕捉时间序列中过去和未来观测值之间时间关系的能力[78]。长序列时间序列表现出长期依赖性,其中未来的值依赖于时间上远离的过去值[100],[116],[119]。这对模型捕捉依赖关系的能力提出了挑战[17]。此外,LSTF 具有固有的周期性和非平稳性[124]。因此,LSTF 模型需要学习给定时间序列中的短期和长期重复模式[33]。例如,高速公路的小时占用率。显然,存在两种重复模式:日模式和周模式。前者描述了早晨高峰与晚上高峰,后者反映了工作日和周末的模式。一个有效的模型应该捕捉这两种重复模式以实现准确的 LSTF。 因此,这在对依赖关系学习方面对预测模型提出了更严格的要求。
  • (2)
    Computational complexity and Memory requirements. Long sequence time series forecasting requires processing a large amount of data [101], which can be computationally intensive [33]. This can result in longer training times and require more computing resources [18]. Therefore, low computational complexity is very critical for LSTF. In addition, storing the entire sequence in memory can be challenging due to the limited memory available in computers [96]. Some models, such as recurrent neural networks, require a large amount of memory to process long sequences [64]. This can limit the length of the time series that can be used for forecasting.
    计算复杂度和内存需求。长序列时间序列预测需要处理大量数据[101],这可能计算密集[33]。这可能导致更长的训练时间和需要更多的计算资源[18]。因此,对于 LSTF 来说,低计算复杂度非常关键。此外,由于计算机中可用的内存有限[96],将整个序列存储在内存中可能具有挑战性。一些模型,如循环神经网络,需要大量内存来处理长序列[64]。这可能会限制可用于预测的时间序列长度。
  • (3)
    Performance evaluation method for LSTF. When the prediction horizon grows, the length of the model prediction increases, requiring the model’s prediction results not only to be close to the true values but also to match the distribution of the true data. In addition, due to the growth of the prediction horizon, it is difficult to judge which model has better LTSF performance when the difference between the predictions of two models is very small, e.g., when the mean squared errors (MSE) of the two predictions differ by 0.001. For example, we cannot compare the superior results [42] of PatchTST/42 and DLinear for MSE=0.202 and MSE=0.203. We need to prove the superiority of one model over the other in a statistical sense. However, there is no relevant performance evaluation. Therefore, a key problem involves systematically evaluating the LSTF performance of a model. In this paper, we apply the Kruskal–Wallis test to evaluate the LSTF performance of a model from two perspectives: model-model and model-self.
    LSTF 性能评估方法。当预测范围增长时,模型预测的长度增加,要求模型的预测结果不仅接近真实值,还要匹配真实数据的分布。此外,由于预测范围的增加,当两个模型的预测差异非常小,例如,当两个预测的平均平方误差(MSE)相差 0.001 时,很难判断哪个模型具有更好的 LTSF 性能。例如,我们无法比较 PatchTST/42 和 DLinear 在 MSE=0.202 和 MSE=0.203 时的优越结果。我们需要在统计意义上证明一个模型相对于另一个模型的优越性。然而,目前尚无相关的性能评估。因此,一个关键问题涉及系统地评估模型的 LSTF 性能。在本文中,我们从模型-模型和模型-自我两个角度应用 Kruskal-Wallis 检验来评估模型的 LSTF 性能。
Now, we proceed to the possible solutions of LSTF based on the key problems listed above, and the details are as follows.
现在,我们根据上述列出的关键问题,继续探讨 LSTF 的可能的解决方案,具体细节如下。
  1. Download: Download high-res image (189KB)
    下载:下载高分辨率图片(189KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 12. Model-Self.  图 12. 模型-自我。

  1. Download: Download high-res image (190KB)
    下载:下载高分辨率图片(190KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 13. Model-Model.  图 13. 模型-模型。

Table 9. Evaluation measures.
表 9. 评估指标。

Category  类别Performance measure  性能指标Remark  备注Ref  Ref 参考文献
Scale-dependent  规模依赖性Mean Absolute Error (MAE)
平均绝对误差(MAE)
Highlight the outliers even more
突出异常值
[11], [33], [36], [46], [71], [78], [82], [83], [84], [85], [104], [107], [109], [125], [126], [127], [128], [129], [130], [131], [132]
Mean Square Error (MSE)  SE(均方误差)Expected value reflecting the square between predicted and truth values
预测值与真实值之间平方的期望值
[15], [33], [35], [46], [72], [106], [133]
Largest Absolute Error (LAE)
最大绝对误差(LAE)
Reflects only the largest error in the results and is susceptible to outliers and does not reflect the overall fit of the results
仅反映结果中的最大误差,易受异常值影响,并不能反映结果的总体拟合度
[134], [135], [136], [137]
[134]、[135]、[136]、[137]
Root Mean Square Error (RMSE)
均方根误差(RMSE)
For cases where the error is not very obvious
对于错误不是很明显的案例
[11], [27], [35], [36], [70], [71], [73], [82], [83], [84], [85], [92], [104], [107], [109], [111], [125], [126], [127], [128], [129], [130], [131], [132], [138], [139], [140], [141]
Mean Squared Logarithmic Error (MSLE)
均方对数误差(MSLE)
Only care about the relative between the true and predicted values, reflecting the percentage difference between them
仅关注真实值与预测值之间的相对关系,反映它们之间的百分比差异
[73]  [73] 译文:
Mean Absolute Log-scaled Error (MALE)
平均绝对对数尺度误差(MALE)
[142]  [142] 翻译文本:
Root Mean Squared Log Error (RMSLE)
均方根对数误差(RMSLE)
Significantly more penalties are imposed for under-predicted than over-predicted
显著对低估的处罚比高估的处罚更多
[85], [128], [142]  [85]、[128]、[142]
Scale-independent  无尺度依赖性Mean Absolute Percentage Error (MAPE)
平均绝对百分比误差(MAPE)
Using absolute values to avoid positive and negative errors canceling each other out
使用绝对值以避免正负误差相互抵消
[27], [35], [36], [70], [71], [82], [84], [85], [111], [127], [128], [131], [133], [140], [141]
[27]、[35]、[36]、[70]、[71]、[82]、[84]、[85]、[111]、[127]、[128]、[131]、[133]、[140]、[141]
IAA non-dimensional, bounded measure of the degree of model prediction error
无量纲、有界模型预测误差程度的度量
[109]  [109] [109]
Root Relative Squared Error (RRSE)
SE 值
Reflects the mean offset between the target and the true regression line and is used to estimate the standard deviation of the residuals
反映目标回归线与真实回归线之间的平均偏移量,用于估计残差的方差
[78], [79], [80], [82], [104], [105], [143], [144]
[78],[79],[80],[82],[104],[105],[143],[144]
R-Squared (R2)  R²( R2Intuitively reflect the fit of the true and predicted values as a percentage
直观反映真实值和预测值匹配的百分比
[35], [126]
Symmetric Mean Absolute Percentage Error (sMAPE)
对称平均绝对百分比误差(sMAPE)
Corrects for the asymmetric distribution characteristic of MAPE, but is unstable when both the true and predicted values are very close to zero
纠正 MAPE 的非对称分布特性,但当真实值和预测值都接近零时,稳定性较差
[85], [121], [138], [145], [146]
[85]、[121]、[138]、[145]、[146]
Normalized Root Mean-square Error (NRMSE)
标准化均方根误差(NRMSE)
Normalize the value of RMSE to the (0,1) interval
将 RMSE 值标准化到(0,1)区间
[28], [35], [115]  [28],[35],[115]
Root Mean Square Percentage Error (RMSPE)
均方根百分比误差(RMSPE)
Reflects the error rate and avoids the effect of the difference in magnitude between the true values on the error, when the predicted value is more than twice the true value, the RMSPE value may be so large that the evaluation is lost
反映误差率并避免真实值之间差异幅度对误差的影响,当预测值超过真实值的两倍时,均方根误差(RMSPE)值可能非常大,以至于评估失去意义
[107], [147]
Relative Absolute Error (RAE)
相对绝对误差(RAE)
expressed as a ratio, comparing a mean error (residual) to errors produced by a trivial or naive model
表示为比率,比较平均误差(残差)与一个简单或天真模型产生的误差
[79], [80]  [79],[80]
Normalized Mean Absolute Error (NMAE)
标准化平均绝对误差(NMAE)
Normalize the MAE  标准化 MAE[115]  [115] [115]
Scaled-error  缩放误差Bean Absolute Scaled Error (MASE)
豆绝对缩放误差(MASE)
By comparing the prediction results with the output of a naive prediction method, it is susceptible to outliers and does not visualize the error in the prediction results
通过将预测结果与朴素预测方法的输出进行比较,它容易受到异常值的影响,并且无法可视化预测结果中的误差
[145], [146]  [145],[146]
Median Absolute Scaled Error (MdASE)
中位数绝对缩放误差(MdASE)
More robust compared to MASE, but cannot visualize the error of prediction results
与 MASE 相比更稳健,但不能可视化预测结果的误差
[148]  [148] 翻译文本:

4.2. More and longer term dependency
4.2. 更长远的依赖

Most of the existing work on LSTF has concentrated on how to achieve improved prediction accuracy by capturing more and longer-term dependencies between sequences. The majority of these studies commence by explicitly exposing the relationships between variables and implicitly increasing the ability of the model to capture serial correlations.
大多数关于 LSTF 的研究都集中在如何通过捕捉序列之间更多和更长期的依赖关系来提高预测精度。这些研究中的大多数都是从明确揭示变量之间的关系和隐式提高模型捕捉序列相关性的能力开始的。

4.2.1. Methods for explicitly revealing the relationships between variables
4.2.1. 显式揭示变量之间关系的方法

Previously, feature engineering was used to explicitly expose the relationships between variables by manually constructing complex features to reveal the correlations between serial data. In recent years, many novel techniques have also emerged in LSTF to explain the correlations between variables from other perspectives. GNNs have made a splash in the field of TSF in the last two years by modeling spatial correlations among data through graph structures as a way to capture more and longer-term dependencies. AutoSTG [83] automatically searches for suitable graph structures through a kind of meta-learning strategy, MTGNN [82] adaptively extracts sparse graph adjacency matrices based on data through a graph learning module, DGNNs [84] use dynamic graph structures to model spatial correlations, and CGM [149] even introduces multi-view information to model the relationships among variables more significantly by constructing a multi-view information-based heterogeneous graph. REST [85], on the other hand, exposes the linkages between variables by means of a multimodal directed weighted graph.
之前,特征工程被用来通过手动构建复杂特征来显式地揭示变量之间的关系,以揭示序列数据之间的相关性。近年来,在 LSTF 中也出现了许多新颖的技术,从其他角度解释变量之间的相关性。在过去的两年里,图神经网络(GNNs)通过图结构建模数据之间的空间相关性,作为一种捕捉更多和长期依赖的方法,在时间序列预测(TSF)领域引起了轰动。AutoSTG[83]通过一种元学习策略自动搜索合适的图结构,MTGNN[82]通过图学习模块根据数据自适应地提取稀疏图邻接矩阵,DGNNs[84]使用动态图结构来建模空间相关性,而 CGM[149]甚至通过构建基于多视图信息异构图来引入多视图信息,以更显著地建模变量之间的关系。另一方面,REST[85]通过一种多模态有向加权图来揭示变量之间的联系。
Time–frequency analysis methods are also effective techniques, and the ATFN has [71] demonstrated the significant advantage of frequency domain methods for the LSTF of strongly periodic time series through a time–frequency adaptive network. FEDformer [97] similarly starts from the frequency domain and uses Fourier variation and wavelet transforms to expose potential periodic information in the input data. Zhao et al. [106] also explicitly revealed the frequency domain information in the given data via the wavelet transform before further learning the long-term correlations between the series with their model.
时间-频率分析方法也是有效技术,ATFN[71]通过时间-频率自适应网络证明了频率域方法在强周期性时间序列长时频率分析中的显著优势。FEDformer[97]同样从频率域开始,利用傅里叶变换和小波变换来揭示输入数据中的潜在周期信息。赵等[106]也通过小波变换在进一步学习序列之间的长期相关性之前,明确揭示了给定数据中的频率域信息。
The multi-granular data input approach reveals the relationships between variables in terms of amounts of information, and different granularities reflect the relevance levels of the different cycles of a feature. Lee et al. [95] used multi-resolution convolution to extract feature information at different granularities, Pyraformer [96] uses a pyramidal attention module for multi-resolution time series representation, and the MLCNN [110] similarly extracts multilevel abstraction representations from data to aid the prediction process. HTML [15], on the other hand, uses multiple views of a multi-source data input to reveal potentially complex long-term correlations among the same features from different dimensions.
多粒度数据输入方法揭示了变量之间在信息量方面的关系,不同的粒度反映了特征不同周期的相关性水平。Lee 等人[95]使用多分辨率卷积在不同粒度上提取特征信息,Pyraformer[96]使用金字塔注意力模块进行多分辨率时间序列表示,而 MLCNN[110]类似地从数据中提取多级抽象表示以辅助预测过程。另一方面,HTML[15]使用多源数据输入的多个视图来揭示来自不同维度的相同特征之间可能存在的复杂长期相关性。

4.2.2. Methods for implicitly increasing the intersequence dependency learning abilities of models
4.2.2. 模型隐式提高序列依赖学习能力的策略

How to effectively increase the abilities of models to learn long-term correlations between sequences has been a hot research topic, and DL studies in this area can be traced back to LSTM, which was proposed to solve the memory problem of RNNs when dealing with long sequences. DL has evolved since then, and many efficient and effective components that can enhance the learning abilities of models have emerged. Jung et al. [35] performed LSTF for photovoltaic power generation cases by using vanilla LSTM. The dilated RNN [110] further mitigates the difficulty faced by RNNs when capturing complex correlations in long sequences by jump linking.
如何有效提高模型学习序列之间长期相关性的能力一直是热门的研究课题,深度学习(DL)在这一领域的研究可以追溯到 LSTM,它被提出用来解决 RNN 在处理长序列时的记忆问题。自那时起,深度学习得到了发展,许多能够增强模型学习能力的有效和高效组件相继出现。Jung 等人[35]通过使用 vanilla LSTM 对光伏发电案例进行了 LSTF。扩张 RNN[110]通过跳跃链接进一步缓解了 RNN 在捕捉长序列中复杂相关性时遇到的困难。
Attention mechanisms solve the information loss problem caused by excessive information length via screening, which can effectively improve the learning ability of the utilized model for long series data. To solve the problems of data redundancy and noise in multivariate time series LSTF, Yoshimi and Eguchi [72] combined an attention mechanism and LSTM to improve the quality of the input data by screening and segmenting multiple independent variable time series data through the attention mechanism. The DA-RNN [65] more effectively captures long-term event correlations by selecting hidden states at all time steps through a two-stage RNN based on an attention mechanism. Gangopadhyay et al. [74] also combined LSTM and an attention mechanism and proposed a spatio-temporal attention mechanism (STAM) to improve the LSTF capability of their model based on the computed spatial and temporal context vectors.
注意力机制通过筛选解决由信息长度过长引起的信息丢失问题,从而有效提高模型对长序列数据的学习能力。为了解决多元时间序列 LSTF 中的数据冗余和噪声问题,Yoshimi 和 Eguchi[72]结合了注意力机制和 LSTM,通过注意力机制筛选和分割多个独立变量时间序列数据,从而提高输入数据的质量。DA-RNN[65]通过基于注意力机制的两阶段 RNN 选择所有时间步的隐藏状态,更有效地捕捉长期事件相关性。Gangopadhyay 等人[74]也结合了 LSTM 和注意力机制,并提出了一种时空注意力机制(STAM),以改进其模型基于计算的空间和时间上下文向量的 LSTF 能力。
Encoder–decoder structures are also effective for enhancing the LSTF capability of a model, as they can efficiently handle long sequences while achieving sequence dimensionality reduction. Fan et al. [67] combined an encoder–decoder structure with LSTM to jointly learn temporal attention values from multiple historical periods and fuse them with multimodal attention weights to improve the prediction accuracy of LSTF. The MQRNN [66] is a multihorizon prediction model with an encoder–decoder structure that can cold-start in real-life long-sequence forecasting problems. Notably, there are researchers who have combine attention mechanisms and encoder–decoder structures. MTSMFF [69] adaptively learns the long-term correlations and implied correlation features of multivariate temporal data through of temporal attention in BiLSTM as an encoder network.
编码器-解码器结构也能有效提升模型的 LSTF 能力,因为它们可以高效处理长序列同时实现序列维度的降低。Fan 等人[67]将编码器-解码器结构与 LSTM 结合,共同学习多个历史时期的时序注意力值,并将它们与多模态注意力权重融合以提高 LSTF 的预测精度。MQRNN[66]是一个具有编码器-解码器结构的多时间尺度预测模型,能够在现实生活中的长序列预测问题中进行冷启动。值得注意的是,有研究者将注意力机制与编码器-解码器结构相结合。MTSMFF[69]通过 BiLSTM 中的时序注意力作为编码网络,自适应地学习多元时间数据的长期相关性及其隐含的相关特征。
GANs can solve some of the missing data problems encountered in LSTF. The authors of LGnet [70] introduced a GAN to TSF to overcome missing value cases to ensure the validity of LSTF, which often results in the growth of the input size due to the increased prediction horizon. This work reduced the requirement of data integrity in LSTF.
GANs 可以解决 LSTF 中遇到的某些缺失数据问题。LGnet [70]的作者将 GAN 引入 TSF,以克服缺失值情况,确保 LSTF 的有效性,这通常会导致输入大小的增加,因为预测范围扩大。这项工作降低了 LSTF 中对数据完整性的要求。
Researchers have also started to explore the possibilities of CNNs in LSTF. The TCN [77] was proposed with a causal convolution to prevent information leakage, and the network uses dilated causal convolution to handle long-sequence inputs. HyDCNN [80] utilizes a hybrid framework based on dilated causal convolution to solve the problem of long-term dependence between ordinary CNNs that cannot fire sequences. PFNet [79], on the other hand, improves the LSTF capability of the associated model by capturing long-term trend features and short-term fluctuation features through its Highway-CNN structure. Recently, although LSTM, GRUs and other models combined with attention have exhibited better performance in terms of capturing long-term dependencies, they perform poorly on multivariate LSTF problems. DSANet [78] captures complex mixed global and local temporal patterns via two parallel convolutions and incorporates an attention mechanism to optimize the long-term correlation capture problem for multivariate LSTF.
研究人员也开始探索 CNN 在 LSTF 中的可能性。TCN[77]提出了因果卷积以防止信息泄露,该网络使用扩张因果卷积来处理长序列输入。HyDCNN[80]利用基于扩张因果卷积的混合框架来解决普通 CNN 无法触发序列的长时依赖问题。另一方面,PFNet[79]通过其 Highway-CNN 结构捕捉长期趋势特征和短期波动特征,从而提高了相关模型的 LSTF 能力。最近,尽管 LSTM、GRU 和其他模型结合注意力机制在捕捉长期依赖方面表现出更好的性能,但在多元 LSTF 问题上表现不佳。DSANet[78]通过两个并行卷积捕捉复杂的混合全局和局部时间模式,并引入注意力机制以优化多元 LSTF 的长期相关性捕获问题。
The Transformer [32], on the other hand, is a model that combines various mechanisms, such as attention, embedding and encoder–decoder structures, for the field of NLP. Later, studies improved the Transformer and gradually applied it to TSF, imaging and other fields, gradually making Transformers their own genre. Therefore, this paper summarizes this class of models as the Transformer class. The performance of Transformers on LSTF is amazing, and they have gradually become the current mainstream approaches. The LogSparse Transformer [91] model first introduced a Transformer to the field of TSF and improved the prediction accuracy achieved for fine-grained, long-term dependent time series in a case with a limited memory budget. This study verified the effectiveness of Transformers in LSTF. Subsequently, much work regarding LSTF has revolved around the improvement exhibited by Transformers. When dealing with time-series data [92], the embedding mechanism of the vanilla Transformer often fails to fully encode long sequences and is prone to losing long-term dependencies. To address this problem, SpringNet [94] improves the model’s ability to capture long-term repeatable fluctuating sequence correlations in the embedding postinformation by means of a DWT attention layer. TCCT [98] integrates CSPNet with a self-concentration mechanism and distills it with an expansive causal convolution layer to obtain more granular information with a negligible additional computational cost; it also improves the obtained information to enhance the LSTF capability of the model without increasing its computational cost. In contrast, Autoformer [33] uses an Auto-Correlation mechanism to break through the information utilization bottleneck of Transformers and achieve an LSTF accuracy improvement. AST [93] introduces the combination of a GAN and a Transformer to optimize the Transformer model in LSTF to address the problem of cumulative error generation.
The Transformer [32]另一方面,是一种结合了各种机制(如注意力、嵌入和编码器-解码器结构)的 NLP 领域模型。后来,研究改进了 Transformer,并逐渐将其应用于 TSF、成像和其他领域,逐渐使其成为自己的流派。因此,本文将这类模型总结为 Transformer 类。Transformer 在 LSTF 上的性能惊人,它们逐渐成为当前的主流方法。LogSparse Transformer [91]模型首次将 Transformer 引入 TSF 领域,并在一个内存预算有限的案例中提高了对细粒度、长期依赖时间序列的预测精度。这项研究验证了 Transformer 在 LSTF 中的有效性。随后,许多关于 LSTF 的研究都围绕着 Transformer 所展现的改进展开。在处理时间序列数据[92]时,vanilla Transformer 的嵌入机制往往无法完全编码长序列,且容易丢失长期依赖。 为了解决这个问题,SpringNet [94] 通过 DWT 注意力层提高了模型在嵌入后信息中捕捉长期可重复波动序列相关性的能力。TCCT [98] 将 CSPNet 与自集中机制相结合,并通过扩展因果卷积层进行提炼,以获得更细粒度的信息,同时额外计算成本可忽略不计;它还提高了获得的信息,以增强模型的 LSTF 能力,而不会增加其计算成本。相比之下,Autoformer [33] 使用自相关机制来突破 Transformers 的信息利用瓶颈,并实现 LSTF 精度的提升。AST [93] 引入了 GAN 与 Transformer 的结合,以优化 LSTF 中的 Transformer 模型,解决累积误差生成的问题。

Table 10. Dataset details.
表 10. 数据集详细信息。

Dataset  数据集Time Horizon and Data Granularity
时间范围和数据粒度
Features  特性
ETTh12016.07–2018.07 1 h  2016.07–2018.07 1 小时7 features  7 特征
Stock  股票2000.01.01–2022.02.28 1 d
2000.01.01–2022.02.28 1 天
12 features  12 个特征
PEMS032018.01.09–2018.11.30 5 min
2018.01.09–2018.11.30 5 分钟
358 features  358 个特征
WTH  WTH 无法翻译,因为"WTH"是一个网络用语,通常表示"What the hell"或"What the hell is that?",在学术文本中并不常见,也没有一个固定的学术翻译。在学术翻译中,这样的缩写通常保留原文2010.01.01–2013.12.31 1 h
2010.01.01–2013.12.31 1 小时
12 features  12 个特征
COVID19  COVID-192020.1.22-now 1d  2020.1.22-至今 1 天4 features  4 特征

Table 11. Case model.  表 11. 案例模型。

Model  模型Year  Published Journals/Conferences
已发表期刊/会议
Pyraformer  皮拉福梅尔2022ICLR
FEDformer2022ICML  国际机器学习会议
Autoformer  自 former2021NeurIPS  神经信息处理系统大会
Informer  情报员2021AAAI
Reformer  重构器2020ICLR
Transformer  变换器2017NeurIPS  神经信息处理系统大会
MTGNN  MTGNN 翻译文本:MTGNN2020SIGKDD
Graph WaveNet  图波网2019IJCAI  国际人工智能联合会议
LSTNet2018SIGIR

Table 12. Multivariate experimental forecasting results obtained on ETTh1. The MAE and RMSE mark the top two models with the best prediction results under each horizon, where a lower RMSE or MAE indicates a better prediction effect.
表 12. 在 ETTh1 上获得的多元实验预测结果。MAE 和 RMSE 分别标记了每个预测期下预测结果最好的前两个模型,其中 RMSE 或 MAE 值越低,预测效果越好。

Model  模型Metric  度量衡ETTh1
Empty CellEmpty CellHorizons  视野
Empty CellEmpty Cell1248192384768
Informer  情报员MAE  MAE 翻译文本:MAE0.4740.6310.8090.8840.878
RMSE  均方根误差0.6760.8291.0101.0901.084
MAPE  MAPE 翻译文本:MAPE10.50113.58312.13513.22114.752
Autoformer  自 formerMAE  MAE 翻译文本:MAE0.4220.4440.4790.4890.519
RMSE  均方根误差0.6160.6540.7060.7110.726
MAPE  MAPE 翻译文本:MAPE11.15811.27211.99812.42514.027
Reformer  重构器MAE  MAE 翻译文本:MAE0.5320.6130.7160.7420.825
RMSE  均方根误差0.7330.8390.9560.9831.084
MAPE  MAPE 翻译文本:MAPE12.30012.0818.60113.07723.151
Pyraformer  皮拉福梅尔MAE  MAE 翻译文本:MAE0.4440.5470.6760.7470.780
RMSE  均方根误差0.6300.7430.8810.9500.978
MAPE  MAPE 翻译文本:MAPE10.1889.16610.97213.80123.171
FEDformerMAE  MAE 翻译文本:MAE0.3680.3900.4440.4660.506
RMSE  均方根误差0.5410.5810.6490.6750.708
MAPE  MAPE 翻译文本:MAPE9.7099.69411.73812.51213.878
Transformer  变换器MAE  MAE 翻译文本:MAE0.4850.6680.7790.8450.817
RMSE  均方根误差0.680.8630.9791.0511.019
MAPE  MAPE 翻译文本:MAPE10.90810.6299.76713.9620.209
MTGNN  MTGNN 翻译文本:MTGNNMAE  MAE 翻译文本:MAE0.350.4130.5150.6610.694
RMSE  均方根误差0.5310.6090.7220.8630.895
MAPE  MAPE 翻译文本:MAPE8.188.5088.77611.82514.032
LSTNetMAE  MAE 翻译文本:MAE0.6090.6310.710.7810.829
RMSE  均方根误差0.8270.8520.9310.9981.031
MAPE  MAPE 翻译文本:MAPE10.23710.2311.00614.64620.746
Graph WaveNet  图波网MAE  MAE 翻译文本:MAE0.4010.4320.4990.5330.581
RMSE  均方根误差0.610.6530.7280.7590.793
MAPE  MAPE 翻译文本:MAPE8.7758.298.55611.14614.119

Table 13. Multivariate experimental forecasting results obtained on Stock and COVID19. The MAE, RMSE, and MAPE mark the top two models with the best prediction results under each horizon, and a lower RMSE or MAE indicates a better prediction effect.
表 13. 在股票和 COVID19 上获得的多元实验预测结果。MAE、RMSE 和 MAPE 分别标记了每个预测期下预测结果最好的前两个模型,RMSE 或 MAE 越低,预测效果越好。

Model  模型Metric  度量衡Stock  股票COVID19  COVID-19
Empty CellEmpty CellHorizons  视野Horizons  视野
Empty CellEmpty Cell1248962884801224486072
Informer  情报员MAE  MAE 翻译文本:MAE9.69510.19310.2689.8399.8971.5561.5911.5481.5341.573
RMSE  均方根误差12.65313.17113.26012.73812.5821.8051.8911.7891.7461.803
MAPE  MAPE 翻译文本:MAPE0.8240.8230.8590.8560.9280.9570.9741.0241.0471.046
Autoformer  自 formerMAE  MAE 翻译文本:MAE1.0411.3351.6362.8933.6640.1760.2120.2690.1700.205
RMSE  均方根误差1.5672.0062.5104.3755.5170.2740.3550.4180.2410.316
MAPE  MAPE 翻译文本:MAPE0.2490.2850.3150.4190.4500.1610.1850.2210.1610.198
Reformer  重构器MAE  MAE 翻译文本:MAE9.3199.7049.7449.9179.7451.5791.6931.6081.5331.574
RMSE  均方根误差12.20912.60612.65412.72112.4371.8552.0011.8781.7691.824
MAPE  MAPE 翻译文本:MAPE0.7530.8530.9240.9290.9580.9390.9720.9330.9480.942
Pyraformer  皮拉福梅尔MAE  MAE 翻译文本:MAE9.66210.01510.03310.18310.1941.7471.7811.7561.7021.713
RMSE  均方根误差12.58113.00413.07413.07312.9382.0542.0802.0321.9811.984
MAPE  MAPE 翻译文本:MAPE0.8660.9220.9290.9440.9801.1551.2141.2011.1331.181
FEDformerMAE  MAE 翻译文本:MAE1.0821.3501.6492.8993.6830.1610.1950.2560.1740.218
RMSE  均方根误差1.5932.0192.5234.3815.5310.2690.3390.4060.2430.325
MAPE  MAPE 翻译文本:MAPE0.2910.3030.3270.4240.4610.1290.1600.2060.1640.197
Transformer  变换器MAE  MAE 翻译文本:MAE9.3949.6119.5649.3509.5631.5861.6891.6341.6091.657
RMSE  均方根误差12.2812.50112.50112.29212.131.8061.9441.9091.8451.909
MAPE  MAPE 翻译文本:MAPE0.7650.7660.7800.7880.8661.0851.1430.9771.2451.208
MTGNN  MTGNN 翻译文本:MTGNNMAE  MAE 翻译文本:MAE8.6648.9539.3599.4839.7040.9941.1671.4881.5011.586
RMSE  均方根误差11.39111.75812.17512.27212.3981.3401.5591.9241.8861.989
MAPE  MAPE 翻译文本:MAPE0.7130.7510.8070.8220.8890.4310.5480.7130.7590.802
LSTNetMAE  MAE 翻译文本:MAE7.7347.9178.0018.4048.3932.0982.1221.9711.9392.081
RMSE  均方根误差10.55210.72810.83711.16211.1932.6582.6362.3972.3892.551
MAPE  MAPE 翻译文本:MAPE1.0341.111.2111.1671.1171.3221.5281.3341.2681.399
Graph WaveNet  图波网MAE  MAE 翻译文本:MAE10.01110.37710.46410.60510.6460.6410.7780.8730.8791.002
RMSE  均方根误差13.42413.65013.60213.54713.3851.0081.1951.2551.2031.365
MAPE  MAPE 翻译文本:MAPE0.7390.7950.8490.9040.9390.2160.2670.3210.3510.406

Table 14. Multivariate experimental forecasting results obtained on PEMS03 and WTH. The MAE, RMSE, and MAPE mark the top two models with the best prediction results under each horizon, and a lower RMSE or MAE indicates a better prediction effect.
表 14. 在 PEMS03 和 WTH 上获得的多变量实验预测结果。MAE、RMSE 和 MAPE 分别表示在每个预测期下预测结果最好的前两个模型,RMSE 或 MAE 越低,预测效果越好。

Model  模型Metric  度量衡PEMS03WTH  WTH 无法翻译,因为"WTH"是一个网络用语,通常表示"What the hell"或"What the hell is that?",在学术文本中并不常见,也没有一个固定的学术翻译。在学术翻译中,这样的缩写通常保留原文
Empty CellEmpty CellHorizons  视野Horizons  视野
Empty CellEmpty Cell12481923847681248192384768
Informer  情报员MAE  MAE 翻译文本:MAE0.2110.2460.2690.2610.2830.3170.4270.5510.5920.584
RMSE  均方根误差0.3270.3730.4050.4000.4490.5170.6220.7630.7990.787
MAPE  MAPE 翻译文本:MAPE2.9033.4524.0483.9144.3551.0861.4911.9142.1082.002
Autoformer  自 formerMAE  MAE 翻译文本:MAE0.2820.4090.6150.6760.6730.3920.4940.5660.5810.582
RMSE  均方根误差0.3890.5490.7960.8850.8690.5830.6990.7740.7910.797
MAPE  MAPE 翻译文本:MAPE3.1193.9065.4886.8676.3091.3681.7151.9682.0852.122
Reformer  重构器MAE  MAE 翻译文本:MAE0.2080.2610.2690.2710.2750.3030.4110.4930.5070.505
RMSE  均方根误差0.3240.4050.4310.4370.4470.5030.6110.6950.7090.705
MAPE  MAPE 翻译文本:MAPE2.7544.0474.1884.2124.3331.031.3821.6441.7171.728
Pyraformer  皮拉福梅尔MAE  MAE 翻译文本:MAE0.2120.2430.2730.2740.2910.3040.4060.5110.5290.530
RMSE  均方根误差0.3290.3710.4080.4120.4430.5010.6020.7120.7310.727
MAPE  MAPE 翻译文本:MAPE2.8723.4954.0814.0964.6281.0231.3891.7761.9461.931
FEDformerMAE  MAE 翻译文本:MAE0.2290.2690.3320.350.3640.3450.4340.5390.5630.572
RMSE  均方根误差0.3280.3840.4620.4960.5170.5330.6320.7520.7790.786
MAPE  MAPE 翻译文本:MAPE0.8933.3764.4014.4874.8091.1841.5191.8681.9792.019
Transformer  变换器MAE  MAE 翻译文本:MAE0.1990.2410.2580.2530.2650.3210.4350.5670.5530.547
RMSE  均方根误差0.3130.3690.3910.3870.4210.5140.6370.7410.7620.749
MAPE  MAPE 翻译文本:MAPE2.6843.4033.9493.8043.9361.0871.5571.8832.0971.991
MTGNN  MTGNN 翻译文本:MTGNNMAE  MAE 翻译文本:MAE0.1630.2180.2850.3160.3560.3040.4040.5050.5260.562
RMSE  均方根误差0.2490.3330.4180.4590.5130.5080.6060.7030.7230.756
MAPE  MAPE 翻译文本:MAPE2.3093.0264.5364.6815.2601.0011.3841.7761.9322.121
LSTNetMAE  MAE 翻译文本:MAE0.2590.3020.3600.3620.3800.3610.4390.5220.5440.569
RMSE  均方根误差0.3830.4310.4990.5070.5330.5470.6300.7170.7380.763
MAPE  MAPE 翻译文本:MAPE3.7874.4795.2585.1895.4791.2111.4891.7861.9371.997
Graph WaveNet  图波网MAE  MAE 翻译文本:MAE0.1980.3340.4490.4260.4480.3360.4560.5460.5710.608
RMSE  均方根误差0.2950.4860.6190.5970.620.5450.6610.7490.7690.804
MAPE  MAPE 翻译文本:MAPE2.6794.2565.8625.4725.8161.0741.4831.7821.9761.966

Table 15. Kruskal–Wallis test results. P ¡ 0.05 denotes significant difference; Statistic denotes measure the degree of variation between samples, larger values indicate greater variation between samples.
表 15. Kruskal-Wallis 检验结果。P≤0.05 表示有显著差异;统计量表示测量样本之间变异程度的指标,数值越大表示样本之间的变异程度越大。

Model comparison  模型比较ETTh1Stock  股票
Empty CellStatistic  统计P-value  P 值Statistic  统计P-value  P 值
Autoformer-Informer  自转型-信息者445.970.013493.540.002
Autoformer-FEDformer  自转型-FED 转型器99.850.0835.240.611
Autoformer-Pyraformer  自转型-Pyraformer446.480.003295.780.023
Autoformer-Transformer  自转型-Transformer341.680.023516.950.001
Autoformer-Reformer  自转型-改形器361.180.044494.340.009
Autoformer-LSTnet  自转型-LSTnet402.810.034311.620.012

Table A.16. Finance application and dataset.
表 A.16. 财务应用和数据集。

Dataset  数据集Ref  Ref 参考文献Time range  时间范围Min-Granularity  最小粒度Information  信息
Gold prices  金价[125]  [125] [125]2014.1–2018-4  2014.1-2018.4Day  The dataset contains daily gold prices (U.S. dollars) from January 2014 to April 2018 from http://finance.yahoo.com and includes minimum, mean, maximum, median, standard deviation (SD), skewness and kurtosis used to describe the nature of the distribution.
数据集包含从 2014 年 1 月到 2018 年 4 月的每日金价(美元),来源于 http://finance.yahoo.com,并包括用于描述分布特性的最小值、平均值、最大值、中位数、标准差(SD)、偏度和峰度。
GEFCom2014 Electricity Price
GEFCom2014 电价
[150]  [150] 翻译文本:Hour  小时The dataset was published in the GEFCom 2014 Forecasting Competition, which consists of four questions: electricity load forecasting, electricity price forecasting, and two additional questions related to wind and solar power generation. In the electricity price forecasting, the dataset has 21552 time points with hourly data granularity. Data source: https://www.dropbox.com/s/pqenrr2mcvl0hk9/GEFCom2014.zip?dl=0.
数据集发布于 GEFCom 2014 预测竞赛,该竞赛包含四个问题:电力负荷预测、电力价格预测以及与风能和太阳能发电相关的两个附加问题。在电力价格预测中,数据集包含 21552 个时间点,数据粒度为每小时。数据来源:https://www.dropbox.com/s/pqenrr2mcvl0hk9/GEFCom2014.zip?dl=0。
Exchange-Rate  汇率[33], [82], [104], [143]  [33]、[82]、[104]、[143]1990–2016  1990-2016Day  The dataset collects daily exchange rates from 1990 to 2016 for eight countries, including Australia, the United Kingdom, Canada, Switzerland, China, Japan, New Zealand and Singapore. Data source: https://github.com/laiguokun/multivariate-time-series-data.
数据集收集了 1990 年至 2016 年八个国家的每日汇率,包括澳大利亚、英国、加拿大、瑞士、中国、日本、新西兰和新加坡。数据来源:https://github.com/laiguokun/multivariate-time-series-data。
S&P 500  标普 500[126]  [126] [126]2001.1–2017.5  2001.1–2017.5 译文:2001.1–2017.5Day  This dataset records the daily S&P 500 index from 2001.01–2017.05, with a total of 4506 data. Data source: http://finance.yahoo.com.
此数据集记录了 2001.01-2017.05 的每日 S&P 500 指数,共计 4506 个数据。数据来源:http://finance.yahoo.com。
Shanghai Composite  上证综指[126]  [126] [126]2005.1–2017.6  2005.1–2017.6 译文:2005.1-2017.6Day  This dataset records the daily SSE indices from 2005.01-2017.06, with 2550 data. Data source:.
此数据集记录了 2005.01-2017.06 的每日上证指数,共 2550 个数据。数据来源:。
S&P500 Stocks  S&P500 股票[71]  [71] [71]2013.2.8–2018.2.7Day  The dataset consists of 505 common stocks issued by 500 large-cap companies and traded on the American Stock Exchange, recording historical daily stock prices from 2013.2.8–2018.2.7 for all companies currently included in the S&P 500 index. Data source: https://www.kaggle.com/camnugent/sandp500.
数据集包括由 500 家大型公司发行的 505 只普通股票,这些股票在美国证券交易所交易,记录了自 2013 年 2 月 8 日至 2018 年 2 月 7 日所有目前包含在标准普尔 500 指数中的公司的历史每日股票价格。数据来源:https://www.kaggle.com/camnugent/sandp500。
CRSP’s Stocks  CRSP 的股票[151]  [151] [151]Day  The dataset is from CRSP and includes data on individual stock returns and prices, S& P 500 index returns, industry categories, number of shares outstanding, ticker symbols, exchange codes and trading volume. Data source: http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html.
数据集来自 CRSP,包括个股回报和价格、S&P 500 指数回报、行业分类、流通股数、股票代码、交易所代码和交易量。数据来源:http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html。
Gas Station Revenue  加油站收入[78]2015.12.1–2018.12.1  2015.12.1-2018.12.1Day  This dataset records the daily revenue of five gas stations, which are geographically close to each other and have interrelated effects, for the period from December 1, 2015 to December 1, 2018. Data source: https://github.com/bighuang624/DSANet/tree/master/data.
此数据集记录了五个地理位置相近、相互影响的加油站从 2015 年 12 月 1 日至 2018 年 12 月 1 日的每日收入。数据来源:https://github.com/bighuang624/DSANet/tree/master/data。
Finance Japan  金融日本[72]  [72] [72]2003.1–2016.12  2003.1–2016.12 译文:2003.1-2016.124 months  4 个月The dataset was collected by the Ministry of Finance of Japan and records general partnerships, limited partnerships, limited liability companies and joint stock companies from the first quarter of 2003 to the fourth quarter of 2016. The time granularity is available on an annual and quarterly basis. The total number of companies surveyed for the quarter is 57,775 and the total number of companies surveyed for the year is 60,516. Data source: https://www.mof.go.jp/english/pri/reference/ssc/outline.htm.
数据集由日本财务省收集,记录了从 2003 年第一季度至 2016 年第四季度的一般合伙企业、有限合伙企业、有限责任公司和股份有限公司。时间粒度按年度和季度提供。每季度调查的公司总数为 57,775 家,每年调查的公司总数为 60,516 家。数据来源:https://www.mof.go.jp/english/pri/reference/ssc/outline.htm。
Stock Opening Prices  股票开盘价[106]  [106] 翻译文本:2007–2016  2007-2016Day  The dataset collects daily opening prices for 50 stocks in 10 sectors in Financial Yahoo from 2007–2016. Each sector involves 5 top companies and each stock contains 2518 data. Data source: https://github.com/z331565360/State-Frequency-Memory-stock-prediction.
数据集收集了 2007-2016 年金融雅虎中 10 个行业 50 只股票的每日开盘价。每个行业涉及 5 家顶级公司,每只股票包含 2518 个数据。数据来源:https://github.com/z331565360/State-Frequency-Memory-stock-prediction。

Table A.17. Industrial energy application and dataset.
表 A.17. 工业能源应用与数据集。

Dataset  数据集Ref  Ref 参考文献Time range  时间范围Min-Granularity  最小粒度Information  信息
Power Consumption  功耗[106]  [106] 翻译文本:2006.12–2010.11  2006.12-2010.11Minute  分钟This dataset records the electricity consumption of a household over a period of nearly 4 years, including voltage, electricity consumption and other characteristics, at a data granularity of min. Data source: https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption.
此数据集记录了一个家庭在近 4 年内的电力消耗情况,包括电压、电力消耗和其他特征,数据粒度为分钟。数据来源:https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption。
Solar-Energy  太阳能[82], [91], [93], [104], [143]
[82]、[91]、[93]、[104]、[143]
20065 min  5 分钟This dataset records the highest solar power production from 137 photovoltaic plants in Alabama in 2006, with a data granularity of 5-minute sampling. Data source: https://www.nrel.gov/grid/solar-power-data.html.
此数据集记录了 2006 年阿拉巴马州 137 个光伏电站的最高太阳能发电量,数据粒度为 5 分钟采样。数据来源:https://www.nrel.gov/grid/solar-power-data.html。
Electricity  电力[33], [82], [93], [104], [143]2011–2014  2011-201415 min  15 分钟This dataset records the electricity consumption of 321 customers from 2011 to 2014, with electricity consumption measured in kWh and data granularity of 15 min. Data source: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014.
此数据集记录了 2011 年至 2014 年间 321 名客户的电力消耗情况,电力消耗以千瓦时(kWh)为单位,数据粒度为 15 分钟。数据来源:https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014。
Wind  [91], [93]  [91],[93]1986–2015  1986-2015Hour  小时This dataset records hourly estimates of energy potential as a percentage of the maximum output of power plants for a European region for the period 1986–2015. Data source: https://www.kaggle.com/sohier/30-years-of-european-wind-generation.
此数据集记录了 1986-2015 年期间欧洲地区每小时能源潜力的估算值,占发电厂最大输出功率的百分比。数据来源:https://www.kaggle.com/sohier/30-years-of-european-wind-generation。
ETT  ETT 翻译文本:ETT[17], [33]2016.7–2018.7  2016.7-2018.715 min  15 分钟This dataset records the load and oil temperature of power transformers recorded every 15 min from July 2016 to July 2018. Has ETTh1, ETTh2 at hourly granularity and ETTm1 at 15 min granularity. Data source: https://github.com/zhouhaoyi/ETDataset.
此数据集记录了从 2016 年 7 月到 2018 年 7 月每 15 分钟记录的电力变压器的负载和油温。具有 ETTh1、ETTh2 的时分辨率和 ETTm1 的 15 分钟分辨率。数据来源:https://github.com/zhouhaoyi/ETDataset。
Sanyo  三洋[94]  [94] 译文:2011.1.1–2017.12.31Day  This dataset records solar power generation data for two photovoltaic power plants in Alice Springs, Northern Territory, Australia, including solar data from January 1, 2011 to December 31, 2017. Data source: http://dkasolarcentre.com.au/source/alice-springs/dka-m4-b-phase.
此数据集记录了澳大利亚北部领地爱丽丝泉两个光伏发电站从 2011 年 1 月 1 日至 2017 年 12 月 31 日的太阳能发电数据。数据来源:http://dkasolarcentre.com.au/source/alice-springs/dka-m4-b-phase。
Hanergy  汉能[94]  [94] 译文:2011.1.1–2016.12.31Day  This dataset records solar power generation data from two photovoltaic plants in Alice Springs, Northern Territory, Australia, and contains data from January 1, 2011 to December 31, 2016. Data source: https://dkasolarcentre.com.au/source/alice-springs/dka-m16-b-phase.
此数据集记录了澳大利亚北部领地爱丽丝泉两个光伏电站的太阳能发电数据,包含 2011 年 1 月 1 日至 2016 年 12 月 31 日的数据。数据来源:https://dkasolarcentre.com.au/source/alice-springs/dka-m16-b-phase。
SGSMEPC[141]  [141] [141]2014.1.1–2015.2.19Day  The dataset records grid data from 2014.01.01–2015.02.19 of State Grid Shanghai Electric Power Company with a data granularity of 1 day.
数据集记录了国家电网上海电力公司从 2014.01.01 至 2015.02.19 的电网数据,数据粒度为 1 天。

4.3. Computational complexity and memory requirements
4.3. 计算复杂度和内存需求

The length of a sequence can impede efficient data processing [33] in LSTF, as the computational complexity grows exponentially with the length of the sequence. Additionally, storing the entire sequence in memory can be challenging due to limited computer memory availability [96]. Numerous studies have been conducted to enhance the efficiency of LSTF. The time complexity of self-attention computation in a Transformer is initially established at O(n2), which inevitably leads to elevated computational expenses; subsequent works on LTSF involving Transformers have optimized this time complexity. Log Trans [91] reduces the time complexity to O(L(logL)2) via convolutional self-attention, while Informer [17] obtains a time complexity of O(LlogL) through ProbSparse self-attention achieves O(LlogL) memory usage, and drastically reduces the spatial complexity of the model computations with the self-attention distillation operation. In contrast, Autoformer [33] achieves O(LlogL) computational time complexity through sequence decomposition and the Auto-Correlation mechanism, which breaks the information utilization bottleneck of the vanilla Transformer. Pyraformer [96] achieves O(LlogL) computational time complexity through pyramidal attention by letting the maximum length of the signal traversal path be a constant with respect to the sequence length L, achieving time and space complexity levels on the order of O(L). FEDformer [97] achieves O(L) linear computational complexity and storage overhead through frequency domain mapping. In other works, NET3 [114] effectively reduces the number of parameters in the model using tensor decomposition, and the STRN [132] significantly reduces the number of parameters required for the model through meta-learning and a unique framework design for CNNs and GCNs. Each work has its own unique perspective for optimizing the parametric part of the model, and a general architecture and component that can reduce the number of required model parameters has not yet emerged.
序列长度可能会阻碍 LSTF 中的高效数据处理[33],因为计算复杂度随着序列长度的增加呈指数增长。此外,由于计算机内存有限[96],将整个序列存储在内存中可能具有挑战性。已经进行了许多研究以提高 LSTF 的效率。在 Transformer 中,自注意力计算的复杂度最初设定为 O(n2) ,这不可避免地导致计算成本增加;随后关于涉及 Transformer 的 LTSF 的研究已经优化了这一时间复杂度。Log Trans[91]通过卷积自注意力将时间复杂度降低到 O(L(logL)2) ,而 Informer[17]通过 ProbSparse 自注意力获得 O(LlogL) 的时间复杂度,实现了 O(LlogL) 的内存使用,并通过自注意力蒸馏操作极大地降低了模型计算的时空复杂度。相比之下,Autoformer[33]通过序列分解和自相关机制实现了 O(LlogL) 的计算时间复杂度,打破了传统 Transformer 的信息利用瓶颈。 Pyraformer [96] 通过金字塔注意力,使得信号遍历路径的最大长度与序列长度 L 无关,实现了 O(LlogL) 的计算时间复杂度,达到 O(L) 的时间复杂度和空间复杂度水平。FEDformer [97] 通过频域映射实现了 O(L) 的线性计算复杂度和存储开销。在其他研究中,NET3 [114] 通过张量分解有效地减少了模型中的参数数量,而 STRN [132] 通过元学习和 CNNs 和 GCNs 的独特框架设计,显著减少了模型所需的参数数量。每一项工作都有其独特的视角来优化模型的参数部分,但尚未出现一种通用的架构和组件,可以减少所需的模型参数数量。

4.4. Performance evaluation method for LSTF
4.4. LSTF 的性能评估方法

To address the third key problem regarding the LSTF evaluation methodology, we are not currently aware of any work in this area. Therefore, we apply the Kruskal–Wallis test to evaluate the performance of an LSTF model from two perspectives: model-model and model-self.
针对 LSTF 评估方法的第三个关键问题,我们目前尚未了解该领域有任何研究。因此,我们采用 Kruskal-Wallis 检验从两个角度评估 LSTF 模型的表现:模型-模型和模型-自我。
From the model-self perspective, in the LSTF task, we believe that when the prediction horizon of a model exceeds a certain range and the prediction accuracy decreases significantly, the larger the range of the prediction horizon is, the better the LSTF performance of the model. Therefore, how to define this range becomes an important problem. For different models, this value is different. Let us use the example in Fig. 12 to illustrate that for model 1, with an input time step size of 48 and a prediction horizon of 24, the prediction accuracy of this model starts to decrease significantly; then, for model 1, prediction horizons > 24 can be considered LSTF. In contrast, for model 2, the prediction accuracy only decreases significantly when the prediction horizon > 120; then, for model 2, the prediction horizon of 24 is obviously not its defined value for LSTF. Therefore, the LSTF performance of model 2 is better.
从模型自我视角来看,在 LSTF 任务中,我们认为当模型的预测范围超过一定范围且预测精度显著下降时,预测范围越大,模型的 LSTF 性能越好。因此,如何定义这个范围成为一个重要问题。对于不同的模型,这个值是不同的。让我们以图 12 中的例子来说明,对于模型 1,输入时间步长为 48,预测范围为 24 时,该模型的预测精度开始显著下降;然后,对于模型 1,预测范围 > 24 可以被认为是 LSTF。相比之下,对于模型 2,只有在预测范围 > 120 时预测精度才会显著下降;因此,对于模型 2,24 的预测范围显然不是其定义的 LSTF 值。因此,模型 2 的 LSTF 性能更好。
From the model-model perspective, we consider that in the same LSTF task, the model with superior prediction accuracy between different models and statistically significantly better than another with the same predict horizon can be considered as more LSTF performance. For instance, in Fig. 13, when model 1 and model 2 are evaluated on the same LSTF task with identical input time steps and prediction horizons, model 2 exhibits significantly higher prediction accuracy than model 1. Hence, we consider that model 2 outperforms model 1 in terms of LSTF performance under these conditions.
从模型-模型的角度来看,我们认为在同一 LSTF 任务中,预测精度优于其他模型且在相同的预测时间范围内统计上显著优于另一个模型的模型,可以被视为具有更高的 LSTF 性能。例如,在图 13 中,当模型 1 和模型 2 在相同的 LSTF 任务上以相同的输入时间步和预测时间范围进行评估时,模型 2 的预测精度显著高于模型 1。因此,我们认为在这些条件下,模型 2 在 LSTF 性能方面优于模型 1。
Furthermore, we believe that comparing the accuracies of models purely based on the numerical magnitudes of their evaluation metrics is not reliable, and we need to demonstrate that the prediction accuracy of one model is superior to that of another model in a statistical sense. Therefore, we propose to compare whether there is a significant difference between the data distributions of the differences between the true predicted values of different models using the Kruskal–Wallis test [152]. Compared with other common hypothesis testing methods such as standard t-tests and ANOVA, the Kruskal–Wallis test has the advantage of not assuming that the data follows a normal distribution [153], making it suitable for comparative analysis of non-normally distributed data. At the same time, it can detect differences of any form between different groups, without assuming that they have the same variance or the same distribution [154], making it widely applicable. However, the Kruskal–Wallis test also has its limitations: it is a non-parametric testing method, and therefore requires a larger sample size to achieve higher statistical power [155], which can be satisfied by the long sequence data in LSTF.
此外,我们认为仅根据模型评估指标的数值大小来比较模型的准确性是不可靠的,我们需要在统计意义上证明一个模型的预测准确性优于另一个模型。因此,我们提出使用 Kruskal-Wallis 检验[152]来比较不同模型真实预测值差异的数据分布是否存在显著差异。与标准 t 检验和方差分析等其他常见假设检验方法相比,Kruskal-Wallis 检验的优点是不假设数据服从正态分布[153],使其适用于非正态分布数据的比较分析。同时,它能够检测不同组之间任何形式的不同,而不假设它们具有相同的方差或分布[154],使其应用广泛。 然而,Kruskal-Wallis 检验也存在其局限性:它是一种非参数检验方法,因此需要更大的样本量才能达到更高的统计功效[155],这在 LSTF 的长序列数据中可以得到满足。
The Kruskal–Wallis test is applied to compare the variances of multiple continuous datasets when the conditions of the normal distribution are not satisfied. The idea of the Kruskal–Wallis test is to mix n groups of samples into one dataset (assuming that they are from the same sample) and then encode the data from small to large. Each data point has its own rank in the mixed dataset; if two orders are the same, the average value is taken as the rank. If the n groups are acquired from the same sample, then the rank of each group should not differ much from the total average rank of the mixed data. If the difference is large, then the groups are not from the same overall distribution; i.e., there is a significant difference between the data.
克鲁斯卡尔-沃利斯检验用于比较多个连续数据集的方差,当不满足正态分布条件时。克鲁斯卡尔-沃利斯检验的思路是将 n 个样本组混合成一个数据集(假设它们来自同一样本)并按从小到大的顺序编码数据。每个数据点在混合数据集中都有其自己的排名;如果两个顺序相同,则取平均值作为排名。如果 n 个组来自同一样本,那么每个组的排名应该与混合数据的总平均排名相差不大。如果差异很大,则说明这些组不是来自同一总体分布;即数据之间存在显著差异。

Table A.18. Urban transportation application and dataset.
表 A.18. 城市交通应用与数据集。

Dataset  数据集Ref  Ref 参考文献Time range  时间范围Min-Granularity  最小粒度Information  信息
Paris metro line  巴黎地铁线路[27]2009–2010  2009-2010Hour  小时This dataset records the passenger flow on Paris metro lines 3 and 13 and consists of 1742 observations with a data granularity of 1 hour. The dataset is derived from the Artificial neural network and computational intelligence forecasting competition http://www.neural-forecasting-competition.com/.
此数据集记录了巴黎地铁 3 号线和 13 号线的客流量,包含 1742 个观测值,数据粒度为每小时。该数据集来源于人工神经网络和计算智能预测竞赛 http://www.neural-forecasting-competition.com/。
PeMS03, PeMS04, PeMS07, PeMS08[33], [82], [83], [84], [93], [104], [131], [143]
[33]、[82]、[83]、[84]、[93]、[104]、[131]、[143]
30 s  30 秒The data comes from the PeMS system. The system records traffic flow data for areas such as the Bay Area in California, where traffic data is collected every 30 s and aggregated over a five-minute period. Each sensor on the freeway has traffic flow, occupancy and speed for each time stamp. The sensors are inductive loop traffic detector devices on the mainline, exit ramp or entrance ramp locations. Users can download the data for the desired time period on demand. This data was obtained from http://pems.dot.ca.gov/.
数据来自 PeMS 系统。该系统记录加利福尼亚州湾区等地区的交通流量数据,交通数据每 30 秒收集一次,并在五分钟内汇总。高速公路上的每个传感器在每个时间戳都有交通流量、占用率和速度。传感器是主线、出口匝道或入口匝道位置的感应线圈交通检测器。用户可以按需下载所需时间段的 数据。这些数据来自 http://pems.dot.ca.gov/。
Birmingham Parking  伯明翰停车[71]  [71] [71]2016.10.4–2016.12.19  2016.10.4-2016.12.1930 min  30 分钟This dataset records the parking lot ID, parking lot capacity, parking lot occupancy and update time for 30 parking lots operated by Birmingham National Car Park. The data period is 8:00 16:30 per day from 2016.10.4–2016.12.19, and the data granularity is 30 min. Data source: http://archive.ics.uci.edu/ml/datasets/parking+birmingham.
此数据集记录了伯明翰国家停车场运营的 30 个停车场的停车场 ID、停车场容量、停车场占用率和更新时间。数据时间段为每天 8:00 至 16:30,从 2016 年 10 月 4 日至 2016 年 12 月 19 日,数据粒度为 30 分钟。数据来源:http://archive.ics.uci.edu/ml/datasets/parking+birmingham。
METR-LA[82], [83]  [82],[83]2012.3.1–2012.6.30  2012 年 3 月 1 日-2012 年 6 月 30 日5 min  5 分钟This dataset records traffic information collected by loop detectors on Los Angeles County freeways. Data source: https://drive.google.com/drive/folders/10FOTa6HXPqX8Pf5WRoRwcFnW9BrNZEIX.
此数据集记录了洛杉矶县高速公路上由环状检测器收集的交通信息。数据来源:https://drive.google.com/drive/folders/10FOTa6HXPqX8Pf5WRoRwcFnW9BrNZEIX。
PEMS-BAY[82], [83]  [82],[83]2017.1.1–2017.5.1330 s  30 秒This dataset records traffic speed readings from 325 sensors collected by PeMS, the California Transit Agency Performance Measurement System. Data source: https://drive.google.com/drive/folders/10FOTa6HXPqX8Pf5WRoRwcFnW9BrNZEIX.
此数据集记录了由加州公共交通机构绩效测量系统 PeMS 收集的 325 个传感器的交通速度读数。数据来源:https://drive.google.com/drive/folders/10FOTa6HXPqX8Pf5WRoRwcFnW9BrNZEIX。
SPMD[36]2015.5.10–2015.10.18  2015.5.10-2015.10.18Hour  小时The dataset was extracted by the University of Michigan Transportation Research Institute and recorded the driving records of approximately 3,000 drivers in Ann Arbor, Michigan between May 10, 2015 and October 18, 2015. Data source: https://github.com/ElmiSay/DeepFEC.
数据集由密歇根大学运输研究所提取,记录了 2015 年 5 月 10 日至 2015 年 10 月 18 日期间密歇根州安阿伯市约 3,000 名驾驶员的驾驶记录。数据来源:https://github.com/ElmiSay/DeepFEC。
VED  VED 翻译文本:VED[36]2017.11–2018.11  2017.11-2018.11Hour  小时This dataset contains the fuel and energy consumption of various personal vehicles (cars, convertibles, pickup trucks, SUVs, etc.) operating under different realistic driving conditions in Michigan, USA from November 2017 to November 2018. Data source: https://github.com/ElmiSay/DeepFEC.
此数据集包含了 2017 年 11 月至 2018 年 11 月期间在美国密歇根州不同实际驾驶条件下运行的各类个人车辆(轿车、敞篷车、皮卡、SUV 等)的燃料和能源消耗。数据来源:https://github.com/ElmiSay/DeepFEC。
England  英国[84]  [84] 翻译文本:2014.1–2014.6  2014.1–2014.6 译文:2014.1-2014.615 min  15 分钟The dataset is derived from UK freeway traffic data made public by the UK government, which was selected for the period January 2014 to June 2014, and the data contains national average speeds and traffic volumes. Data source: http://tris.highwaysengland.co.uk/detail/trafficflowdata.
数据集来源于英国政府公开的英国高速公路交通数据,选取了 2014 年 1 月至 2014 年 6 月的数据,数据包含全国平均速度和交通量。数据来源:http://tris.highwaysengland.co.uk/detail/trafficflowdata。
TaxiBJ+  出租车 BJ+[132]  [132] [132]30 min  30 分钟This dataset records the distribution and trajectory of more than 3000 cabs in Beijing. The data granularity is 30 min. time span is, respectively, P1: 2013.07.01–2013.10.31; P2: 2014.02.01–2014.06.30; P3: 2015.03.01–2015.06.30; P4: 2015.11.01–2016.03.31.
此数据集记录了北京 3000 多辆出租车的分布和轨迹。数据粒度为 30 分钟,时间跨度分别为:P1:2013.07.01-2013.10.31;P2:2014.02.01-2014.06.30;P3:2015.03.01-2015.06.30;P4:2015.11.01-2016.03.31。
HappyValley  快乐谷[132]  [132] [132]2018.1.1–2018.10.31  2018 年 1 月 1 日-2018 年 10 月 31 日Hour  小时This dataset records the hourly population density of popular theme parks in Beijing from 01.01.2018 to 31.10.2018.
此数据集记录了 2018 年 1 月 1 日至 2018 年 10 月 31 日北京热门主题公园的每小时人口密度。
NYC Taxi  纽约出租车[111]2009.1.1–2016.6.30Hour  小时This dataset records details of every cab trip in New York City from 2009.01.01 to 2016.06.30.
此数据集记录了从 2009 年 01 月 01 日至 2016 年 06 月 30 日纽约市每辆出租车行程的详细信息。

Table A.19. Meteorological application and dataset.
表 A.19. 气象应用与数据集。

Dataset  数据集Ref  Ref 参考文献Time range  时间范围Min-Granularity  最小粒度Information  信息
Weather pollutants  气象污染物[109]  [109] [109]2015.1.1–2016.4.1Hour  小时The dataset contains data of pollutants (PM2.5, PM10, NO2, SO2, O3, and CO) sampled hourly for 76 stations in the Beijing–Tianjin–Hebei region from January 1, 2015 to April 1, 2016 (10944 h in total). Hour-by-hour meteorological observations, including wind speed, wind direction, temperature, barometric pressure, and relative humidity, for the same period were also downloaded from the China Weather website platform, which is maturely maintained by the China Meteorological Administration (CMA).
数据集包含 2015 年 1 月 1 日至 2016 年 4 月 1 日(总计 10944 小时)北京-天津-河北地区 76 个站点每小时采集的污染物(PM2.5、PM10、NO2、SO2、O3 和 CO)数据。同一时期的风速、风向、温度、气压和相对湿度等逐小时气象观测数据也从中国气象局(CMA)成熟维护的中国天气网站平台下载。
Beijing PM2.5  北京 PM2.5[104]  [104] [104]2010.1.1–2014.12.31Hour  小时The dataset contains hourly PM2.5 data and associated meteorological data for Beijing, China. PM2.5 measurements are the target series. The data include dew point, temperature, barometric pressure, combined wind direction, cumulative wind speed, hours of snowfall, and hours of rainfall for a total of 43,824 multivariate series. The dataset was obtained from https://archive.ics.uci.edu/ml/datasets.html.
数据集包含北京,中国的小时 PM2.5 数据和相关的气象数据。PM2.5 测量值是目标序列。数据包括露点、温度、气压、综合风向、累积风速、降雪小时数和降雨小时数,共计 43,824 个多元序列。数据集来源于 https://archive.ics.uci.edu/ml/datasets.html。
Hangzhou Temperature  杭州温度[126]  [126] [126]2011.1–2017.1  2011.1–2017.1 译文:2011.1-2017.1Day  This dataset records the daily average temperature of Hangzhou from 2011.01 to 2017.01, with a total of 2820 pieces of data. Data source: http://data.cma.cn/data/.
这个数据集记录了 2011.01 至 2017.01 期间杭州的每日平均气温,共计 2820 条数据。数据来源:http://data.cma.cn/data/。
WTH  WTH 无法翻译,因为"WTH"是一个网络用语,通常表示"What the hell"或"What the hell is that?",在学术文本中并不常见,也没有一个固定的学术翻译。在学术翻译中,这样的缩写通常保留原文[33]  [33] [33]202010 min  10 分钟The dataset records weather conditions throughout 2020, with data recorded every 10 min, and contains 21 meteorological indicators such as temperature and humidity. Data source: https://www.bgc-jena.mpg.de/wetter/.
数据集记录了 2020 年全年的天气状况,每 10 分钟记录一次数据,包含温度和湿度等 21 个气象指标。数据来源:https://www.bgc-jena.mpg.de/wetter/。
USHCN  USHCN 美国历史气候网络[11]1887–2019  1887-2019Day  The dataset records continuous daily meteorological records from 1887 to 2009 for 1219 stations collected from each state. The data include features such as average temperature, total daily precipitation, and other characteristics at a data granularity of days. Data source: https://www.ncdc.noaa.gov/ushcn/introduction.
数据集记录了从 1887 年到 2009 年 1219 个站点的连续每日气象记录。数据包括平均温度、每日总降水量和其他日数据粒度的特征。数据来源:https://www.ncdc.noaa.gov/ushcn/introduction。
KDD-CUP[11]2017.1–2017.12  2017.1–2017.12 译文:2017.1-2017.12Hour  小时This dataset is derived from the 2018 KDD CUP Challenge air quality dataset, which recorded PM2.5 measurements from 35 monitoring stations in Beijing from January 2017 to December 2017. Data source: https://www.kdd.org/kdd2018/kdd-cup.
此数据集来源于 2018 年 KDD CUP 挑战赛空气质量数据集,记录了 2017 年 1 月至 2017 年 12 月北京 35 个监测站的 PM2.5 测量值。数据来源:https://www.kdd.org/kdd2018/kdd-cup。
US[131]  [131] [131]2012–2017  2012-2017Hour  小时This dataset records weather datasets from 2012 to 2017 from 36 weather stations in the U.S. The data granularity is hourly and the data contains features such as temperature, humidity, pressure, wind direction, wind speed, and weather description. Data source: https://www.kaggle.com/selfishgene/historical-hourly-weather-data.
此数据集记录了 2012 年至 2017 年美国 36 个气象站的气象数据集。数据粒度为每小时,包含温度、湿度、气压、风向、风速和天气描述等特征。数据来源:https://www.kaggle.com/selfishgene/historical-hourly-weather-data。

Table A.20. Medicine application and dataset.
表 A.20. 药物应用与数据集。

Dataset  数据集Ref  Ref 参考文献Time range  时间范围Min-Granularity  最小粒度Information  信息
ILI  ILI 翻译文本:ILI[33], [92]2002–2021  2002-2021Week  This dataset records data on patients with influenza-like illness (ILI) recorded weekly by the Centers for Disease Control and Prevention from 2002 to 2021, which describes the ratio of ILI patients to the total number of patients. Data source: https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html.
此数据集记录了 2002 年至 2021 年间由疾病预防控制中心每周记录的流感样病例(ILI)数据,描述了 ILI 患者与总患者数的比例。数据来源:https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html。
COVID-19[128]  [128] [128]2020.1.22–2020.6.17Day  This dataset records daily data on confirmed and recovered cases collected from January 22, 2020–June 17, 2020 in six countries: Italy, Spain, Italy, China, the United States, and Australia. Data source: https://github.com/CSSEGISandData/COVID-19 accessed on17/06/2020.
此数据集记录了来自六个国家(意大利、西班牙、意大利、中国、美国和澳大利亚)2020 年 1 月 22 日至 2020 年 6 月 17 日的每日确诊病例和康复病例数据。数据来源:https://github.com/CSSEGISandData/COVID-19,访问日期:2020 年 6 月 17 日。
2020 OhioT1DM  2020 年俄亥俄州 1 型糖尿病研究[156]  [156] [156]20205 min  5 分钟This dataset is the glycemic prediction dataset for the 2nd BGLP Challenge 2020. The dataset records 8 weeks of continuous glucose monitoring, insulin, physiological sensor, and self-reported life event data for each of 12 patients with type 1 diabetes. The specific dataset contains CGM glucose levels every 5 min; glucose levels from regular self-monitoring of blood glucose (finger sticks); push and basal insulin doses; self-reported meal times and carbohydrate estimates; self-reported exercise, sleep, work, stress, and illness times; and data from Basis Peak or Empatica Embrace bands. Data source: http://smarthealth.cs.ohio.edu/OhioT1DM-dataset.html.
这个数据集是 2020 年第二次 BGLP 挑战赛的血糖预测数据集。该数据集记录了 12 名 1 型糖尿病患者连续 8 周的血糖监测、胰岛素、生理传感器和自我报告的生活事件数据。具体数据集包含每 5 分钟一次的连续血糖监测(CGM)血糖水平;常规自我监测血糖(指尖血)的血糖水平;推注和基础胰岛素剂量;自我报告的用餐时间和碳水化合物估算;自我报告的运动、睡眠、工作和压力以及疾病时间;以及来自 Basis Peak 或 Empatica Embrace 手环的数据。数据来源:http://smarthealth.cs.ohio.edu/OhioT1DM-dataset.html。
MIMIC-III[11]  [11] [11]2001–2012  2001-2012Hour  小时The dataset is a public clinical dataset with over 58,000 admission records. It includes several clinical characteristics such as ICU stay data, glucose and heart rate.
数据集是一个包含超过 58,000 条入院记录的公开临床数据集。它包括多个临床特征,如 ICU 住院数据、血糖和心率。
The Kruskal–Wallis test procedure is as follows.
克鲁斯卡尔-沃利斯检验步骤如下。
  • (1)
    Suppose that there are m mutually independent simple random samples {X1,,Xni}(i=1,,m) and that all N=i=1mni observations of each sample are arranged in increasing order.
    假设存在 0#个相互独立的简单随机样本 1#,并且每个样本的所有 2#个观测值都按升序排列。
  • (2)
    Denote by Ri(i=1,,m) the sum of the ranks of ni observations {X1,,Xni} for the ith sample in this arrangement.
    Ri(i=1,,m) 表示本排列中 ni 个观察值 {X1,,Xni} 的秩之和。
  • (3)
    Calculate the statistic H: (12)H=12N(N+1)i=1mRi2ni3(N+1) If each sample has r identical data and ti(i=1,,r) is the number of times that the ith common observation of each sample occurs in all N observations, then the following modified statistic H is calculated (H and H approximately follow a chi-square distribution when N is sufficiently large with degrees of freedom v=m1): (13)H=NN21i=1rti3tiH
    计算统计量 H(12)H=12N(N+1)i=1mRi2ni3(N+1) 如果每个样本有 r 个相同的数据,并且 ti(i=1,,r) 是每个样本中 ith 共同观察结果在所有 N 观察中出现的次数,那么计算以下修改后的统计量 H (当 N 足够大时, HH 近似服从自由度为 v=m1 的卡方分布): (13)H=NN21i=1rti3tiH
  • (4)
    For a given significance level α and v=m1 degrees of freedom, χ2 is obtained as the upper lateral quantile of the distribution χα,v2, and the hypothesis of no significant difference between the samples is rejected when Hχα,v2 (or Hχα,v2), which means that the m samples are considered to be statistically significantly different from each other.
    对于给定的显著性水平 αv=m1 自由度, χ2 是分布 χα,v2 的上侧分位数,当 Hχα,v2 (或 Hχα,v2 )时,拒绝样本间无显著差异的假设,这意味着 m 样本被认为在统计学上彼此显著不同。

5. Applications and datasets
5. 应用和数据集

In this section, we summarized the applications of LSTF in the financial, industrial energy, urban transportation, medicine, and meteorological fields, as shown in Fig. 14, as well as the publicly available datasets in each application field. We collected commonly used datasets from various literature and presented the basic information of each dataset for reference. The survey results are presented in Appendix A.
在这一节中,我们总结了 LSTF 在金融、工业能源、城市交通、医学和气象领域的应用,如图 14 所示,以及每个应用领域的公开可用数据集。我们从各种文献中收集了常用数据集,并展示了每个数据集的基本信息以供参考。调查结果见附录 A。

5.1. Finance  5.1. 财务

In the finance, LSTF is often used for predicting economic cycles [51], fiscal cycles, and long-term trends in stocks [157], and can assist financial traders in making more favorable decisions or help companies make better financial decisions based on predicted revenue and expenditures. For example, in the stock market, LSTF can assist investors in developing more accurate investment strategies, predicting trends and fluctuations in future stock prices [1], to make better investment decisions. In financial planning, LSTF can assist individuals and businesses in financial planning [26], predicting future financial conditions such as income, expenses, and profitability, in order to better plan financial goals and fund operations. In addition, LSTF can also assist financial institutions in loan risk assessment, predicting future borrower repayment ability and credit risk [158], or predicting future interest rate trends to better formulate monetary and interest rate policies. We summarized the prediction tasks and open-source datasets in the finance field in recent years in Appendix A.1.
在金融领域,LSTF 常用于预测经济周期[51]、财政周期以及股票的长期趋势[157],可以帮助金融交易者做出更有利的决策,或帮助公司根据预测的收入和支出做出更好的财务决策。例如,在股市中,LSTF 可以帮助投资者制定更准确的投资策略,预测未来股票价格的走势和波动[1],以做出更好的投资决策。在财务规划中,LSTF 可以帮助个人和企业进行财务规划[26],预测未来的财务状况,如收入、支出和盈利能力,以便更好地规划财务目标和资金运营。此外,LSTF 还可以帮助金融机构进行贷款风险评估,预测未来借款人的还款能力和信用风险[158],或预测未来利率趋势,以更好地制定货币和利率政策。我们在附录 A.1 中总结了近年来金融领域的预测任务和开源数据集。

5.2. Industrial energy  5.2. 工业能源

In the industrial energy field, LSTF is often used to assist in formulating long-term strategic resource planning strategies [25]. As shown in Appendix A.2, there are many works and open-source datasets related to industrial energy prediction. LSTF can help companies and governments predict future energy demands, such as electricity, oil, natural gas, etc., in order to better plan energy production and supply. It can also help power companies predict future power generation [54] to ensure adequate and stable supply. In addition, LSTF can also help governments and companies develop energy policy planning or manage energy supply chains [159]. Long-term forecasting in the industrial energy field has applications in many aspects, which can help companies and governments better plan and manage, reduce risks, improve efficiency, and achieve sustainable development.
在工业能源领域,LSTF 常被用于协助制定长期战略资源规划策略[25]。如附录 A.2 所示,有许多关于工业能源预测的研究工作和开源数据集。LSTF 可以帮助公司和政府预测未来的能源需求,如电力、石油、天然气等,以便更好地规划能源生产和供应。它还可以帮助电力公司预测未来的电力生成[54],以确保充足的稳定供应。此外,LSTF 还可以帮助政府和公司制定能源政策规划或管理能源供应链[159]。在工业能源领域的长期预测在许多方面都有应用,可以帮助公司和政府更好地规划和管理工作,降低风险,提高效率,并实现可持续发展。

5.3. Urban transportation
5.3. 城市交通

In the field of urban transportation, LSTF has been applied in various aspects. It can assist urban transportation management departments in predicting future traffic flow [160] for better traffic planning and management. LSTF can also be used to forecast future traffic congestion, which can help in formulating strategies and measures to alleviate congestion [2]. In addition, long-term forecasting can assist urban transportation management departments in predicting future traffic accident risks and traffic safety issues [161], enabling better traffic safety management and accident prevention. The datasets are shown in Appendix A.3.
在城市交通领域,LSTF 已在多个方面得到应用。它可以协助城市交通管理部门预测未来交通流量[160],以优化交通规划和管理工作。LSTF 还可以用于预测未来交通拥堵,有助于制定缓解拥堵的策略和措施[2]。此外,长期预测有助于城市交通管理部门预测未来交通事故风险和交通安全问题[161],从而实现更好的交通安全管理和事故预防。数据集见附录 A.3。

5.4. Meteorology  5.4. 气象学

The application of LSTF in meteorology is generally focused on long-term trend forecasting of climate, providing decision support for industries such as agriculture, marine transportation, and natural weather disaster warning. For example, LSTF can be used to forecast long-term climate change [162], including global climate change trends, temperature, precipitation, and other indicators, providing scientific basis for national policy-making in response to climate change. LSTF can also be used to issue early warnings for natural weather disasters such as heavy rain, floods, and typhoons [3], allowing for adequate preparation and mitigation of potential harm to human life and property. Additionally, LSTF can predict information such as sea surface temperature and marine meteorology for the next few months or years [163], providing decision support for industries such as fishing and marine transportation. The datasets are shown in Appendix A.4.
LSTF 在气象学中的应用通常集中在气候的长期趋势预测上,为农业、海洋运输和自然灾害预警等行业提供决策支持。例如,LSTF 可用于预测长期气候变化[162],包括全球气候变化趋势、温度、降水等指标,为应对气候变化的国家级政策制定提供科学依据。LSTF 还可用于发布针对自然灾害如暴雨、洪水和台风的早期预警[3],以便进行充分的准备和减轻对人类生命和财产可能造成的危害。此外,LSTF 还可预测未来几个月或几年的海面温度和海洋气象信息[163],为渔业和海洋运输等行业提供决策支持。数据集见附录 A.4。

5.5. Medicine  5.5. 药学

The development of the medical field bears the responsibility of safeguarding and advancing human society. LSTF can be applied to various stages of drug development. For instance, it can predict the toxicity, pharmacokinetics, pharmacodynamics, and other parameters of a drug, helping researchers optimize its design and screening process [164]. Additionally, long-term forecasting can be used to predict the market prospects and sales of a particular drug, aiding in the development of marketing strategies. LSTF can also predict future medical needs over a certain period of time, such as the population structure and incidence of chronic diseases in a particular region [165]. These forecasting results can be utilized for the rational allocation and planning of medical resources. The datasets are shown in Appendix A.5.
医学领域的发展肩负着保障和推进人类社会的责任。长短期记忆网络(LSTF)可以应用于药物开发的各个阶段。例如,它可以预测药物的毒性、药代动力学、药效学等参数,帮助研究人员优化其设计和筛选过程[164]。此外,长期预测可以用来预测特定药物的市前景和销售情况,有助于制定营销策略。LSTF 还可以预测一定时期内的未来医疗需求,例如特定地区的人口结构和慢性病发病率[165]。这些预测结果可用于医疗资源的合理分配和规划。数据集见附录 A.5。

6. Performance evaluation
6. 性能评估

In this section, we discuss prediction performance evaluation metrics in the field of TSF based on nearly 80 pieces of literature. Here, we employ the classification method introduced by [166] to divide the prediction accuracy metrics into three groups: scale-dependent, scale-independent, and scaled error metrics, which are based on whether the evaluation metrics are affected by the data scale and the way in which the data scale effects are eliminated. The evaluation methods used in these papers are summarized and described in detail in Table 9. Moreover, we analyze the characteristics of each metric appearing in the papers in terms of the above classification metrics.
在这一节中,我们讨论了基于近 80 篇文献的 TSF 领域预测性能评估指标。在这里,我们采用[166]提出的分类方法,将预测准确度指标分为三类:尺度相关、尺度无关和缩放误差指标,这些指标基于评估指标是否受数据规模的影响以及如何消除数据规模的影响。这些论文中使用的评估方法在表 9 中进行了总结和详细描述。此外,我们根据上述分类指标分析了论文中出现的每个指标的特征。

6.1. Scale-dependent measures
6.1. 规模依赖性度量

Scale-dependent measures are evaluation metrics whose data scales depend on the data size of the original data, and they are the most widely used evaluation metrics in forecasting. The advantage of this type of metric is that it is computationally simple and can be used to measure different methods applied to the same dataset for comparison purposes. Notably, the limitation of this type of metric in TSF is that it is not possible to compare algorithm models across datasets with different scales, and such an evaluation metric is affected by datasets with particularly large data scales, which is not conducive to subsequent comparative studies. We summarize the current popular scale-dependent measures as follows.
规模依赖性度量是数据规模依赖于原始数据大小的评估指标,它们是预测中最广泛使用的评估指标。这类指标的优势在于计算简单,可用于测量应用于同一数据集的不同方法,以便进行比较。值得注意的是,在时间序列预测(TSF)中,这类指标的局限性在于无法比较不同规模数据集上的算法模型,并且这种评估指标会受到特别大数据规模数据集的影响,这不利于后续的比较研究。我们总结当前流行的规模依赖性度量如下。
  • MAE=1mi=1m|yiyˆi|
  • MSE=1mi=1myiyˆi2
  • LAE=max{yiyˆi,i=1,,m}
  • RMSE=1mi=1myiyˆi2
  • MSLE=1mi=0mlogyi+1logyˆi+12
  • MALE=1mi=1m|logyilogyˆi|
  • RMSLE=1mi=1mlogyilogyˆi2
Here yˆi denotes the predicted value, yi denotes the true value, m denotes the sample size, and ȳi denotes the mean of the true sample data.
这里 yˆi 表示预测值, yi 表示真实值, m 表示样本大小, ȳi 表示真实样本数据的均值。
Among all scale-dependent measures, the root mean square error (RMSE) and mean absolute error (MAE) are almost the most widely used metrics.The LAE metric can only reflect the maximum error situation in the prediction results, and is susceptible to outliers and cannot reflect the true fit of the results, so it is less used in DL in recent years. The RMSE is effective due to its simplicity and close relationship with statistical modeling. For unbiased forecasting methods, it gives exactly the same value as the variance of the forecast error. The main difference between the RMSE and MSLE is their sensitivity to outliers in the given dataset. For MSLE, the effect of outliers can be reduced, while the RMSE is relatively more sensitive to outlier data points. The MSLE is more robust when evaluating datasets containing outliers. In contrast, the MAE better reflects the true error situation than the RMSE. The RMSLE is useful in situations where there are significantly more penalties for underpredicted valuations than for overpredicted valuations.
在所有规模依赖性度量中,均方根误差(RMSE)和平均绝对误差(MAE)几乎是应用最广泛的指标。LAE 指标只能反映预测结果中的最大误差情况,容易受到异常值的影响,不能反映结果的真正拟合,因此在近年来深度学习中使用较少。RMSE 由于简单且与统计模型关系密切而有效。对于无偏预测方法,它给出与预测误差方差完全相同的值。RMSE 与 MSLE 的主要区别在于它们对给定数据集中异常值的敏感性。对于 MSLE,可以减少异常值的影响,而 RMSE 对异常数据点相对更敏感。MSLE 在评估包含异常值的数据集时更稳健。相比之下,MAE 比 RMSE 更好地反映了真实的误差情况。在预测值低于实际值比高于实际值受到更多惩罚的情况下,RMSLE 是有用的。

6.2. Scale-independent measures
6.2. 尺度无关度量

Scale-independent measures, in contrast to scale-dependent measures, are evaluation metrics that are not affected by the size of the original data. These evaluation metrics also have some drawbacks: some evaluation metrics do not consider the case when the actual value is zero (MAPE), and some evaluation metrics have an incorrectly skewed distribution when the actual value is close to zero. We collect and list the most popular scale-independent measures:
尺度无关度量,与尺度相关度量相反,是不受原始数据大小影响的评估指标。这些评估指标也存在一些缺点:一些评估指标没有考虑实际值为零的情况(MAPE),而一些评估指标在实际值接近零时分布存在偏差。我们收集并列出最流行的尺度无关度量:
  • MAPE=100mi=1m|yiyˆiyi|
  • IA=i=1myiyˆi2i=1m|yˆiȳi|+|yiȳi|2
  • RRSE=i=1myiyˆi2i=1myiȳi2
  • R2=1i=1myˆiyi2i=1myiȳi2
  • sMAPE=1mi=1m|yˆiyi||yˆi|+|yi|/2
  • NRMSE=1mi=1myiyˆi2ymaxymin,where ymax is the maximum value of the real sample data and ymin is the minimum value of the real sample data;
    NRMSE=1mi=1myiyˆi2ymaxymin ,其中 ymax 是真实样本数据的最大值, ymin 是真实样本数据的最小值;
  • RAE=i=1m|yiyˆi|i=1m|yiȳi|
  • RMSPE=1mi=1myiyˆiyi2
  • NMAE=i=1m|yiyˆi|i=1myi
By scaling the error according to the data scale of the real data, the value of a scale-independent measure may be limited to an interval range, such as [0,1]. Therefore, the use of scale-independent measures can largely avoid the influence of the scale of the real data, which has an impact on the value of the evaluation metric. In addition, we note that such an evaluation metric has some defects due to the presence of scale factors: (1) first, for the case where the true value is 0, some evaluation metrics have undefined situations involving a denominator of 0, such as the MAPE; (2) when the true value is close to 0, some evaluation metrics produce skewed distributions, such as the MAPE, and Makridakis [167] argued that the skewed distribution of the MAPE is due to the penalty inequality between positive and negative prediction errors. Therefore, they proposed the sMAPE to solve this problem, while Goodwin and Lawton [168] pointed out that the sMAPE is still asymmetric.
根据真实数据的规模调整误差,一个尺度无关的度量值可能被限制在一个区间范围内,例如 [0,1] 。因此,使用尺度无关的度量可以很大程度上避免真实数据规模对评估指标值的影响。此外,我们注意到,由于存在尺度因子,这样的评估指标存在一些缺陷:(1)首先,对于真实值为 0 的情况,一些评估指标存在涉及 0 的除数的不确定情况,例如 MAPE;(2)当真实值接近 0 时,一些评估指标会产生偏斜分布,例如 MAPE,Makridakis [167] 认为 MAPE 的偏斜分布是由于正负预测误差之间的惩罚不平等。因此,他们提出了 sMAPE 来解决这个问题,而 Goodwin 和 Lawton [168] 指出 sMAPE 仍然是不对称的。

6.3. Scaled errors  6.3. 比例误差

Scaled errors were first proposed by Hyndman and Koehler [166] and can be used to eliminate the effect of data size by comparing the prediction results obtained with the underlying method (usually the native method). The following scaled error is commonly used:
缩放误差首先由 Hyndman 和 Koehler[166]提出,可以通过比较使用基础方法(通常是原生方法)获得的预测结果来消除数据规模的影响。以下缩放误差通常被使用:
  • MASE=mean|yiyˆi|1mi=2m|yiyi1|
  • MdASE=median|yiyˆi|1mi=2m|yiyi1|
There is a simple way to understand this evaluation metric; the denominator of this evaluation metric can be understood as the average error of the native predictions made one step ahead for the future. If the MASE > 1, then the experimental method under evaluation is worse than the native prediction, and vice versa. While MdASE is similar to MASE, the way MASE is calculated using the mean makes it more susceptible to outliers, while the way MdASE is calculated using the median has stronger robustness and validity. However, such metrics can only reflect the results of comparison with the basic method and cannot visualize the error of the prediction results.
有简单的方法理解这个评估指标;这个评估指标的分母可以理解为对未来一步预测的原始预测的平均误差。如果 MASE > 1,那么正在评估的实验方法比原始预测更差,反之亦然。虽然 MdASE 与 MASE 相似,但 MASE 使用平均值计算使其更容易受到异常值的影响,而 MdASE 使用中位数计算则具有更强的鲁棒性和有效性。然而,这样的指标只能反映与基本方法的比较结果,不能可视化预测结果的误差。
As seen from Table 9, the MAE, RMSE and MAPE are the most commonly used evaluation metrics. Among them, the MAE reflects the true error situation, while the RMSE magnifies the gaps between larger errors. The MAPE can be used to compare forecasts at different data scales and is easy to calculate. Therefore, the combination of the MAE, RMSE and MAPE can reflect the overall fitting effect, such as the real error situation and the outlier situation. Therefore, the MAE, RMSE and MAPE are used as evaluation indicators in the subsequent section.
从表 9 中可以看出,MAE、RMSE 和 MAPE 是最常用的评估指标。其中,MAE 反映了真实误差情况,而 RMSE 放大了较大误差之间的差距。MAPE 可用于比较不同数据尺度的预测,且易于计算。因此,MAE、RMSE 和 MAPE 的组合可以反映整体拟合效果,如真实误差情况和异常值情况。因此,MAE、RMSE 和 MAPE 在后续章节中被用作评估指标。

7. Case study  7. 案例研究

In this section, we select the algorithm models of 10 excellent works based on the abovementioned works, and in Section 5 of this paper, we select five well-known benchmark datasets in the fields of finance, energy, meteorology, transportation and medicine for experiments and use the MAE, RMSE and MAPE as evaluation metrics based on the evaluation metric results presented in Section 6. Finally, we give the performance comparison for LSTF.
在这一节中,我们根据上述作品选择了 10 篇优秀作品的算法模型,并在本文的第 5 节中,我们选择了金融、能源、气象、交通和医学领域的五个知名基准数据集进行实验,并基于第 6 节中提出的评估指标,使用 MAE、RMSE 和 MAPE 作为评估指标。最后,我们给出了 LSTF 的性能比较。
The selected datasets are as follows.
所选数据集如下。
  • ETT4: The ETT dataset is a long-term deployment dataset for electricity proposed by [82], which has several datasets with different temporal granularities, i.e., ETTh1, ETTh2 and ETTm1. The ETTh1 dataset is used in this paper, and it includes load and oil temperature values recorded every 1 hour during the period from July 2016 to July 2018.
    ETT 4 :ETT 数据集是由[82]提出的长期部署电力数据集,包含具有不同时间粒度的多个数据集,即 ETTh1、ETTh2 和 ETTm1。本文使用 ETTh1 数据集,它包括从 2016 年 7 月到 2018 年 7 月期间每 1 小时记录的负荷和油温值。
  • Stock5: The Stock dataset from http://finance.yahoo.com contains multiple features, such as the opening price, high price, low price, closing price, and trading volume, for each day from January 1, 2000, to February 28, 2022.
    股票数据集 5 :来自 http://finance.yahoo.com 的股票数据集包含多个特征,如开盘价、最高价、最低价、收盘价和交易量,涵盖了从 2000 年 1 月 1 日到 2022 年 2 月 28 日的每一天。
  • PEMS036: This dataset includes traffic data from the California Highway Network PEMS, a database collected by Caltrans on California freeways, as well as datasets from other Caltrans and partner agencies. We select 358 nodes in the San Francisco Bay Area freeways with sensor-measured road occupancy values collected every 5 minutes from January 9, 2018, to November 30, 2018.
    PEMS03 6 :该数据集包括来自加利福尼亚州高速公路网络 PEMS 的交通数据,这是一个由 Caltrans 在加利福尼亚州高速公路上收集的数据库,以及来自其他 Caltrans 和合作伙伴机构的其他数据集。我们选择了旧金山湾区高速公路上的 358 个节点,从 2018 年 1 月 9 日至 2018 年 11 月 30 日,每 5 分钟收集一次传感器测量的道路占用值。
  • WTH7: This dataset is derived from weather data collected from more than 1,600 areas in the United States from 2010 to 2013. The data were collected every hour and contain 12 weather characteristics, including local humidity, for each area.
    WTH 7 :本数据集来源于 2010 年至 2013 年间从美国 1600 多个地区收集的气象数据。数据每小时收集一次,包含 12 个气象特征,包括每个地区的局部湿度。
  • COVID198: This dataset contains confirmed coronavirus disease (COVID-19) cases, reported deaths, and reported recoveries for each day from 2020 to the present. The dataset includes time series data tracking the number of people affected by COVID-19 globally, documenting the numbers of confirmed cases of coronavirus infection that have been reported, deaths while infected with coronavirus, and recoveries reported from infections. The dataset is maintained by a team at the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University who collate data from around the world and update it daily.
    COVID19 8 :本数据集包含从 2020 年至今每日确诊的冠状病毒疾病(COVID-19)病例、报告死亡人数和报告康复人数。数据集包括追踪全球受 COVID-19 影响人数的时间序列数据,记录报告的冠状病毒感染确诊病例数、感染冠状病毒期间的死亡人数和感染后的康复报告。该数据集由约翰霍普金斯大学系统工程中心(CSSE)的一个团队维护,该团队收集全球数据并每日更新。
The data granularities and feature information of these five datasets are shown in Table 10. Our implementation details are as follows.
数据粒度和五个数据集的特征信息如表 10 所示。我们的实现细节如下。
We select six models based on the excellent work of the latest top-level conference on Transformers based on the work done in Section 3 of this paper: Pyformer [96], FEDformer [97], Informer [17], Autoformer [33], Reformer [172], a Transformer [32] and three other excellent models, the MTGNN [82], Graph WaveNet [173], and LSTNet, [105], as shown in Table 11.
我们根据本论文第 3 节所做的工作,从最新顶级会议关于 Transformer 的优秀工作中选择了六个模型:Pyformer [96]、FEDformer [97]、Informer [17]、Autoformer [33]、Reformer [172]、一个 Transformer [32]以及另外三个优秀的模型,MTGNN [82]、Graph WaveNet [173]和 LSTNet [105],如表 11 所示。

Table A.21. Competition application and dataset.
表 A.21. 竞赛申请和数据集。

Dataset  数据集Ref  Ref 参考文献Time range  时间范围Min-Granularity  最小粒度Information  信息
M5[169]  [169] [169]Hour  小时The M5 contest is the latest in the M contest, which runs from March 2 to June 30, 2020. The contest dataset uses stratified sales data generously provided by Walmart, starting at the item level and then aggregated to departments, product categories and stores in three geographic regions of the U.S. (California, Texas and Wisconsin). In addition to time-series data, it includes explanatory variables that affect prices, such as price, promotion, day of the week and special events, which are used to improve forecast accuracy. Data source: https://github.com/Mcompetitions/M5-methods.
M5 竞赛是 M 竞赛的最新一届,比赛时间从 2020 年 3 月 2 日至 6 月 30 日。竞赛数据集使用了沃尔玛慷慨提供的分层销售数据,从商品级别开始,然后汇总到美国三个地理区域的部门、产品类别和店铺(加利福尼亚、德克萨斯和威斯康星)。除了时间序列数据外,还包括影响价格的解释变量,如价格、促销、星期几和特别事件,这些变量用于提高预测准确性。数据来源:https://github.com/Mcompetitions/M5-methods。
M4[91], [170]  [91],[170]Hour  小时The M4 dataset was created by randomly selecting 100,000 time series from the ForeDeCk database. the M4 dataset consists of time series of annual, quarterly, monthly and other (weekly, daily and hourly) data, which are divided into a training set and a test set. Data source: https://github.com/Mcompetitions/M4-methods/tree/master/Dataset.
M4 数据集由从 ForeDeCk 数据库中随机选取的 10 万个时间序列创建而成。M4 数据集包括年度、季度、月度以及其他(周、日和小时)数据的时间序列,这些数据被分为训练集和测试集。数据来源:https://github.com/Mcompetitions/M4-methods/tree/master/Dataset。
M3[171]  [171] 翻译文本:Hour  小时The M3 competition dataset contains 3003 time series data containing multiple series of Micro, Industry, Macro, Finance, Demog, etc. The data granularity consists of time series of annual, quarterly, monthly and other (weekly, daily and hourly) data. Data source: https://forecastingdata.org/.
M3 竞赛数据集包含 3003 个时间序列数据,包括多个微观数据、行业数据、宏观经济数据、金融数据、人口统计数据等。数据粒度包括年度、季度、月度以及其他(周度、日度和小时度)时间序列数据。数据来源:https://forecastingdata.org/。
We adopt the MSE as the loss function and use the adaptive moment estimation (ADAM) optimizer, where the initial learning rate is 10-4, the batch size is set to 32, the early stopping mechanism is used as the termination condition of the model training process, and the patience is set to 3. The Z-score normalization method is used for all datasets. The input_len is set to 96 for the ETTh1, PEMS03, WTH, and Stock datasets and 36 for the COVID19 dataset. The settings of predict_horizon are {12, 48, 192, 384, 768} for the ETTh1, PEMS03, and WTH datasets, {12, 48, 96, 288, 480} for the Stock dataset, and {12, 24, 48, 60, 72} for the COVID19 dataset. The training set, validation set, and test set of each datasets are divided according to a ratio of 7:1:2. The hyperparameter settings of each model are determined based on the settings of their source works. All our models are implemented based on PyTorch, the open source code of each original paper is collected and experimented with, and all experiments are repeated 5 times and trained on dual NVIDIA TITAN X pascal 12-GB-GPU workstations. After referring to the research results in Section 5, MAE, RMSE, and MAPE were used as evaluation metrics.
我们采用均方误差(MSE)作为损失函数,并使用自适应矩估计(ADAM)优化器,其中初始学习率为 10^-4,批量大小设置为 32,使用早停机制作为模型训练过程的终止条件,耐心设置为 3。对所有数据集使用 Z 分数标准化方法。对于 ETTh1、PEMS03、WTH 和 Stock 数据集,input_len 设置为 96,对于 COVID19 数据集设置为 36。预测时间范围(predict_horizon)的设置分别为 ETTh1、PEMS03 和 WTH 数据集为{12, 48, 192, 384, 768},Stock 数据集为{12, 48, 96, 288, 480},COVID19 数据集为{12, 24, 48, 60, 72}。每个数据集的训练集、验证集和测试集按照 7:1:2 的比例划分。每个模型的超参数设置基于其原始工作的设置。所有我们的模型都是基于 PyTorch 实现的,收集了每篇原始论文的开源代码并进行实验,所有实验重复 5 次,并在双 NVIDIA TITAN X pascal 12-GB-GPU 工作站上训练。 在参考第 5 节的研究结果后,MAE、RMSE 和 MAPE 被用作评价指标。

7.1. Main results and analysis
7.1 主要结果与分析

After conducting experiments on the ETTh1 dataset, the obtained results are shown in Table 12, where the two best experimental results produced under each horizon of the dataset are bolded. On the ETTh1 dataset, the MTGNN and FDEformer method perform better when the prediction horizon is small, but as the prediction horizon grows, Autoformer gradually outperforms the MTGNN, which shows that on ETTh1, the FEDformer and Autoformer models are more suitable for LSTF.
在 ETTh1 数据集上进行实验后,获得的结果显示在表 12 中,其中每个数据集时间跨度的两个最佳实验结果被加粗。在 ETTh1 数据集上,当预测时间跨度较小时,MTGNN 和 FDEformer 方法表现更好,但随着预测时间跨度的增加,Autoformer 逐渐超越 MTGNN,这表明在 ETTh1 上,FEDformer 和 Autoformer 模型更适合 LSTF。
After conducting experiments on the Stock dataset and COVID19 dataset, as shown in Table 13, we find that the mechanisms of progressive trending and seasonal decomposition yield good results in the face of multiple periodicities and trends when the dataset has few relevant features or repetitions. Autoformer and FEDformer achieve far better performance than the other models on the Stock dataset and the COVID19 dataset. Both datasets contain multiple repeated cycles and trends that are harder to capture. The Stock dataset contains cyclical changes in stock days, weeks, months, quarters, years, etc., and is also affected by unusual events such as the 2008 financial crisis. COVID19, on the other hand, has a low data volume, but it possesses the characteristic of an extremely strong exponential growth trend at the beginning that then exhibits volatility and a decreasing trend with seasonal and yearly changes. Both the Autoformer and FEDformer models apply the characteristics of asymptotic trending and seasonal decomposition, and they both exploit the characteristics of time series in the frequency domain, and we believe that this mechanism is the key to their far superior performance on the Stock and COVID19 datasets.
在 Stock 数据集和 COVID19 数据集上进行实验后,如表 13 所示,我们发现当数据集具有少量相关特征或重复时,渐进趋势和季节分解的机制在面对多个周期性和趋势时能取得良好的结果。Autoformer 和 FEDformer 在 Stock 数据集和 COVID19 数据集上的表现远优于其他模型。这两个数据集都包含多个重复的周期和趋势,这些趋势难以捕捉。Stock 数据集包含股票日、周、月、季度、年等周期性变化,并受到 2008 年金融危机等异常事件的影响。另一方面,COVID19 数据量较低,但它在开始时具有极强的指数增长趋势,随后表现出波动性和随季节和年度变化的下降趋势。 两者 Autoformer 和 FEDformer 模型都应用了渐进趋势和季节分解的特征,并且它们都利用了时间序列在频域中的特性,我们相信这一机制是它们在股票和 COVID19 数据集上表现出色的重要因素。
Experimenting on the PEMS03 dataset, as shown in Table 14, we find that the Transformer-type model is more suitable for LSTF than the MTGNN-type model when t many data feature variables and complex correlations are present. In the PEMS03 dataset, the spatial topology information is obvious, and we take the data of each node as a feature, so the dataset has 358 new features; thus, it has more features than the other datasets (ETTh1 has 7 feature variables, and Stock has 12 feature variables). The MTGNN spatio-temporal prediction model and the Transformer source model have better performance on this dataset, and the correlations between the data feature variables are complicated due to the presence of many data feature variables. While the MTGNN achieves better prediction results by adaptively learning the topological graph structure of the data, the Transformer yields better prediction results via fully self-attentive dot product operations. The remaining Transformer-class models inevitably lose some of the dependencies among the data in cases with complex correlations among such data feature variables due to the reduction in the self-attentive computational complexity. This aspect is not obvious in other datasets with few feature variables but is gradually revealed in data with many feature variables, such as PEMS03. As the prediction horizon grows, Reformer gradually outperforms the MTGNN, and it can be seen that the models with better LSTF performance on this dataset are Reformer and the Transformer, which are the two models with the highest computational time complexity levels among the selected Transformer-class models.
在 PEMS03 数据集上进行实验,如表 14 所示,我们发现当存在许多数据特征变量和复杂相关性时,Transformer 型模型比 MTGNN 型模型更适合 LSTF。在 PEMS03 数据集中,空间拓扑信息明显,我们将每个节点的数据作为特征,因此该数据集有 358 个新特征;因此,它比其他数据集(ETTh1 有 7 个特征变量,Stock 有 12 个特征变量)具有更多特征。MTGNN 时空预测模型和 Transformer 源模型在此数据集上表现更好,由于存在许多数据特征变量,数据特征变量之间的相关性复杂。虽然 MTGNN 通过自适应学习数据的拓扑图结构实现了更好的预测结果,但 Transformer 通过完全自注意力点积运算实现了更好的预测结果。由于自注意力计算复杂性的降低,具有此类数据特征变量之间复杂相关性的情况下,剩余的 Transformer 类模型不可避免地会丢失一些数据之间的依赖关系。 此方面在其他特征变量较少的数据集中并不明显,但在具有许多特征变量的数据中,如 PEMS03,逐渐显现出来。随着预测范围的扩大,Reformer 逐渐优于 MTGNN,可以看出在此数据集中,具有更好 LSTF 性能的模型是 Reformer 和 Transformer,它们是所选 Transformer 类模型中计算时间复杂度最高的两种模型。
  1. Download: Download high-res image (1MB)
    下载:下载高分辨率图片(1MB)
  2. Download: Download full-size image
    下载:下载全尺寸图片
  1. Download: Download high-res image (1MB)
    下载:下载高分辨率图片(1MB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. B.15. The Evolution of LSTF Networks.
图 B.15. LSTF 网络的演变。

On the WTH dataset, we find that Reformer and Pyraformer have good LSTF performance. However, since there is also more obvious spatial topological information and spatial correlations in the WTH weather data, the MTGNN has good performance when the prediction horizon is short, but as on the other datasets, the characteristics of the MTGNN gradually lag behind those of the Transformer-type models as the prediction horizon increases.
在 WTH 数据集上,我们发现 Reformer 和 Pyraformer 具有较好的 LSTF 性能。然而,由于 WTH 气象数据中还存在更明显的空间拓扑信息和空间相关性,当预测范围较短时,MTGNN 表现良好,但与其他数据集一样,随着预测范围的增加,MTGNN 的特点逐渐落后于 Transformer 型模型。
In summary, we obtain the following conclusions from our experiments.
总结来说,我们从实验中得出以下结论。
  • (1)
    Spatio-temporal prediction models such as the MTGNN achieve great results on spatially correlated datasets such as traffic datasets and weather datasets, but their LSTF results on these datasets are not as good as those of Transformer-type models.
    时空预测模型如 MTGNN 在空间相关数据集(如交通数据集和天气数据集)上取得了很好的效果,但它们在这些数据集上的 LSTF 结果不如 Transformer 型模型。
  • (2)
    The progressive trend and seasonal decomposition mechanism is more suitable for cases where few relevant features are contained in the dataset and there are multiple repeated cycles and trends among the data. The characteristics of time series in the frequency domain are worth studying.
    渐进趋势和季节分解机制更适合于数据集中包含的相关特征较少,且数据中存在多个重复周期和趋势的情况。时序数据在频域中的特征值得研究。
  • (3)
    Transformer-based models with many data feature variables and complex correlations inevitably lose some of the dependencies between data while reducing the computational complexity of self-attentiveness, thus leading to a decrease in prediction accuracy.
    基于 Transformer 的模型在拥有众多数据特征变量和复杂相关性时,不可避免地会在降低自注意力计算复杂度的同时,丢失部分数据之间的依赖关系,从而导致预测精度下降。
  • (4)
    Transformer-based models are more practical for LSTF. According to the above experimental results, it can be seen that the Transformers model achieves better results on LSTF for most of the datasets.
    基于 Transformer 的模型对 LSTF 更实用。根据上述实验结果,可以看出 Transformer 模型在大多数数据集上对 LSTF 取得了更好的效果。

7.2. Performance comparisons among long sequence time-series forecasting models
7.2. 长序列时间序列预测模型性能比较

After the above experiments conducted with nine models on each of the five described datasets, how can we compare their LSTF performance? We use the theory proposed in Section 4.4 to illustrate the performance from two perspectives: model-model and model-self. Furthermore, we believe that a single numerical magnitude comparison between models in terms of evaluation metrics is not correct, and we need to demonstrate statistically significant differences between the predictions of model 1 and model 2 for a given prediction scenario. Therefore, we use the Kruskal–Wallis test to compare whether the data distributions of the differences between the true and predicted values of the different models are significantly different.
are translated into Simplified Chinese as follows: 在上述九个模型在五个描述的数据集上进行的实验之后,我们如何比较它们的 LSTF 性能?我们使用第 4.4 节中提出的理论从两个角度来阐述性能:模型-模型和模型-自我。此外,我们认为仅就评估指标而言,对模型之间的单一数值大小比较是不正确的,我们需要证明在给定的预测场景中,模型 1 和模型 2 的预测之间存在统计学上显著差异。因此,我们使用 Kruskal-Wallis 检验来比较不同模型的真实值和预测值之间的差异数据分布是否存在显著差异。
We illustrate the results of Informer, Autoformer, Fedformer, Pyraformer, Reformer, Transformer, and LSTNet on the ETTh1 and Stock datasets as examples. When the models make predictions with the predict horizon = 768 on the ETTh1 dataset and the predict horizon = 480 on the Stock dataset, we perform a Kruskal–Wallis test by comparing the difference between the model’s predicted values and the true values. We use a two-sided test with a significance level of 0.05 to test whether there is a significant difference in the data distribution of the predicted results between the two models. The results are shown in Table 15.
我们以 ETTh1 和 Stock 数据集为例,展示了 Informer、Autoformer、Fedformer、Pyraformer、Reformer、Transformer 和 LSTNet 的结果。当模型在 ETTh1 数据集上预测预测范围=768,在 Stock 数据集上预测范围=480 时,我们通过比较模型预测值与真实值之间的差异,进行 Kruskal-Wallis 检验。我们使用双尾检验,显著性水平为 0.05,以检验两个模型的预测结果数据分布是否存在显著差异。结果如表 15 所示。
From the model-model perspective, in Table 15, it can be observed that, in the case of a prediction horizon = 768 in ETTh1, Autoformer exhibits significant statistical differences in its predictive performance compared to other models in this prediction task. This suggests that Autoformer outperforms other models in terms of long-term prediction performance in this task, while the significance level of the difference between the predictive results of Autoformer and FEDformer is greater than 0.05, indicating no significant difference between the two. Therefore, we conclude that the long-term predictive performance of Autoformer and FEDformer is similar in this prediction task. Additionally, similar to the above, Autoformer significantly outperforms other models in the Stock dataset in this prediction task, with statistical differences, and its LSTF predictive performance is also similar to that of FEDformer, with no significant differences observed in this prediction task.
从模型-模型的角度来看,在表 15 中,可以观察到,在 ETTh1 中预测范围=768 的情况下,Autoformer 与其他模型相比,在预测性能上存在显著统计学差异。这表明,在长期预测性能方面,Autoformer 优于其他模型。同时,Autoformer 与 FEDformer 预测结果之间的差异显著性水平大于 0.05,表明两者之间没有显著差异。因此,我们得出结论,在此次预测任务中,Autoformer 和 FEDformer 的长期预测性能相似。此外,与上述情况类似,Autoformer 在股票数据集的预测任务中也显著优于其他模型,存在统计学差异,其 LSTF 预测性能也与 FEDformer 相似,在此次预测任务中没有观察到显著差异。

8. Conclusion and future outlook
8. 结论与未来展望

Based on our findings, we propose tentative future research directions for the field of LSTF as follows.
基于我们的发现,我们提出以下关于 LSTF 领域的初步未来研究方向。
  • (1)
    At present, the compound model and Transformer-based model represent the main research directions in LSTF. As evidenced by the findings of our research presented in Section 3, the majority of LSTF-related works in 2022 and 2021 have been built on the Transformer-based model and the Compound model, which are designed to capture the long-term dependencies present [18]. Currently, the Transformer-based model appears to be one of the most effective solutions in LSTF, as indicated by previous studies [17], [33]. Additionally, the Compound model, which combines the strengths of various techniques to address the limitations of each, has also shown to be an effective approach in LSTF.
    目前,复合模型和基于 Transformer 的模型代表了 LSTF 的主要研究方向。正如第 3 节中我们研究的结果所示,2022 年和 2021 年大多数与 LSTF 相关的研究都是基于基于 Transformer 的模型和复合模型构建的,这些模型旨在捕捉长期依赖关系[18]。目前,基于 Transformer 的模型似乎是在 LSTF 中最为有效的解决方案之一,如先前的研究[17],[33]所示。此外,复合模型,它结合了各种技术的优势以解决各自的局限性,也已被证明是在 LSTF 中的一种有效方法。
  • (2)
    Progressive trending and seasonal decomposition mechanisms should be introduced. In real scenarios, cases often possess few relevant features in their datasets [174], and multiple cycles and trends are hidden and repeated among the data. We experimentally find that the progressive trend and seasonal decomposition mechanism [47] may be helpful in this case, which can be investigated in depth in the future.
    渐进趋势和季节分解机制应被引入。在实际场景中,案例通常在其数据集中具有少量相关特征[174],多个周期和趋势在数据中隐藏并重复。我们实验发现,渐进趋势和季节分解机制[47]可能在此情况下有所帮助,这在未来可以进行深入研究。
  • (3)
    Currently, the research on Transformer-based models has some inherent limitations in LSTF. The Transformer architecture is known for its high model capacity and ability to model long-term dependencies [111], making it a popular choice for LSTF. However, the Transformer-based models have some inherent limitations in LSTF, such as time complexity [91]. Most of the current optimization efforts for Transformers are also focused on addressing this issue [96], [101]. However, our experiments have shown that Transformer-based models may unavoidably lose some of the dependencies between data points when reducing the complexity of self-attention computations in situations where there are many data feature variables with complex correlations, resulting in reduced prediction accuracy. Therefore, Transformer-based models are not suitable for all LSTF tasks, and other feature extraction frameworks can achieve performance that is comparable to Transformers [81]. Therefore, the choice of an appropriate feature extraction model should be based on the specific requirements of the task.
    目前,基于 Transformer 的模型在 LSTF 研究中存在一些固有的局限性。Transformer 架构以其高模型容量和建模长期依赖关系的能力而闻名[111],使其成为 LSTF 的流行选择。然而,基于 Transformer 的模型在 LSTF 中存在一些固有的局限性,例如时间复杂度[91]。目前大多数针对 Transformer 的优化工作也集中在解决这一问题[96],[101]。然而,我们的实验表明,在存在许多具有复杂相关性的数据特征变量时,通过降低自注意力计算复杂度,基于 Transformer 的模型不可避免地会丢失数据点之间的一些依赖关系,从而导致预测精度降低。因此,基于 Transformer 的模型并不适用于所有 LSTF 任务,其他特征提取框架可以达到与 Transformer 相当的性能[81]。因此,选择合适的特征提取模型应根据任务的特定要求。
  • (4)
    Capturing dynamic graphs and long-term dependencies is critical for spatio-temporal long-term prediction. In spatio-temporal long-term prediction, the hidden topological structure in the data typically undergoes significant changes as the prediction horizon increases [175]. Using static graphs for prediction can lead to significant biases [82], making the modeling of dynamic graphs essential for spatio-temporal long-term prediction. In addition, we have found that spatio-temporal prediction models that combine dynamic graphs, such as MTGNN, perform better on datasets with significant spatial correlations, such as transportation and weather, but their performance in long-term prediction is not as good as Transformer-based modules. This suggests that their model architecture is not optimal for feature extraction in long-term dependencies. Therefore, one of the future research directions for improving spatio-temporal long-term prediction is how to combine the characteristics of these two types of models.
    捕获动态图和长期依赖关系对于时空长期预测至关重要。在时空长期预测中,随着预测范围的扩大,数据中的隐藏拓扑结构通常会发生显著变化[175]。使用静态图进行预测可能导致重大偏差[82],因此,对于时空长期预测,动态图的建模是必不可少的。此外,我们发现,结合动态图(如 MTGNN)的时空预测模型在具有显著空间相关性的数据集(如交通和天气)上表现更好,但它们在长期预测方面的性能不如基于 Transformer 的模块。这表明,它们的模型架构在长期依赖特征提取方面并不最优。因此,提高时空长期预测的未来研究方向之一是如何结合这两种模型的特点。
In this article, we present a comprehensive overview of DL in LSTF. We provide deeply multi-angle definition explanations of LSTF and propose a new taxonomy of LSTF research based on the current mainstream feature extraction structures of models. We classify them into RNN-based model, CNN-based model, GNN-based model, Transformer-based model, Compound model, and miscellaneous methods according to their main predictive network structure characteristics. To achieve a fair evaluation of LSTF performance, we apply the Kruskal–Wallis test and propose a new LSTF performance evaluation method from two perspectives. Additionally, we have collected abundant resources on TSF and LSTF, including datasets, application fields, evaluation metrics, and model comparison code libraries, which are provided in the open-source link. Finally, based on the results of our research and experiments, we summarize four possible future research directions.
在这篇文章中,我们全面概述了 LSTF 中的深度学习(DL)。我们提供了对 LSTF 的深度多角度定义解释,并基于当前主流模型特征提取结构,提出了 LSTF 研究的新分类法。我们根据它们的主要预测网络结构特征,将它们分为基于 RNN 的模型、基于 CNN 的模型、基于 GNN 的模型、基于 Transformer 的模型、复合模型和杂项方法。为了公平评估 LSTF 的性能,我们应用了 Kruskal-Wallis 检验,并从两个角度提出了新的 LSTF 性能评估方法。此外,我们收集了大量关于时间序列预测(TSF)和 LSTF 的资源,包括数据集、应用领域、评估指标和模型比较代码库,这些资源都提供在开源链接中。最后,基于我们的研究和实验结果,我们总结了四个可能的研究方向。
Our review provides a comprehensive overview of the recent developments in LSTF and offers researchers valuable insights into improving their models with appropriate feature extraction methods. Our ’Abundant Resources’ provide researchers with easy access to critical information on application domains, datasets, evaluation metrics, and LSTF method codes. Additionally, our newly proposed metrics can facilitate researchers in further analyzing and comparing the performance of LSTF models.
我们的综述全面概述了最近在 LSTF 领域的发展,并为研究人员提供了通过适当的特征提取方法改进其模型的宝贵见解。我们的“丰富资源”为研究人员提供了轻松获取关于应用领域、数据集、评估指标和 LSTF 方法代码的关键信息的途径。此外,我们新提出的指标可以促进研究人员进一步分析和比较 LSTF 模型的表现。

CRediT authorship contribution statement
CRediT 作者贡献声明

Zonglei Chen: Conceptualization, Methodology, Investigation, Formal analysis, Writing – original draft, Writing – review & editing, Visualization. Minbo Ma: Conceptualization, Methodology, Investigation, Formal analysis, Writing – original draft, Writing – review & editing. Tianrui Li: Conceptualization, Methodology, Investigation, Formal analysis, Writing – original draft, Writing – review & editing, Funding acquisition. Hongjun Wang: Conceptualization, Methodology, Investigation, Formal analysis, Writing – original draft, Writing – review & editing, Funding acquisition. Chongshou Li: Conceptualization, Methodology, Investigation, Formal analysis, Writing – original draft, Writing – review & editing, Funding acquisition, Project administration.
宗磊 陈:概念化、方法论、调查、形式分析、写作 - 原始草稿、写作 - 审稿与编辑、可视化。马敏博:概念化、方法论、调查、形式分析、写作 - 原始草稿、写作 - 审稿与编辑。李天瑞:概念化、方法论、调查、形式分析、写作 - 原始草稿、写作 - 审稿与编辑、资金获取。王红军:概念化、方法论、调查、形式分析、写作 - 原始草稿、写作 - 审稿与编辑、资金获取。李崇寿:概念化、方法论、调查、形式分析、写作 - 原始草稿、写作 - 审稿与编辑、资金获取、项目管理。

Declaration of Competing Interest
利益冲突声明

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
作者声明,他们没有已知可能影响本文报道工作的财务利益或个人关系。

Acknowledgments  致谢

This research was supported by the National Natural Science Foundation of China (Grant Nos. 62202395, 62176221, and 62276216), Natural Science Foundation of Sichuan Province (Grant No. 2022NSFSC0930), and Fundamental Research Funds for the Central Universities, China (Grant No. 2682022CX067).
这项研究得到了国家自然科学基金委员会(项目编号:62202395、62176221 和 62276216)、四川省自然科学基金(项目编号:2022NSFSC0930)和中国中央大学基础研究基金(项目编号:2682022CX067)的支持。

Appendix A. Datasets  附录 A. 数据集

The ‘-’ of time range means that the data has multiple different time ranges or that the time range of the dataset is maintained and dynamically updated.
时间范围的“-”表示数据具有多个不同的时间范围,或者数据集的时间范围被维护并动态更新。

A.1. Finance  A.1. 财务

See Table A.16.  参见表 A.16。

A.2. Industrial energy  A.2. 工业能源

See Table A.17.  参见表 A.17。

A.3. Urban transportation
A.3. 城市交通

See Table A.18.  参见表 A.18。

A.4. Meteorological  A.4. 气象

See Table A.19.  参见表 A.19。

A.5. Medicine  A.5. 药学

See Table A.20.  参见表 A.20。

A.6. Competition  A.6. 竞争

See Table A.21.  参见表 A.21。

Appendix B. LSTF networks evolution
附录 B. LSTF 网络演变

A finer-grained LSTF Networks Evolution is shown in Fig. B.15.
图 B.15 显示了更细粒度的 LSTF 网络演化。

Data availability  数据可用性

We give open source links to the data used in the article
我们提供文章中使用的数据的开源链接

References  参考文献

Cited by (81)  被引用(81 次)

  • STGAFormer: Spatial–temporal Gated Attention Transformer based Graph Neural Network for traffic flow forecasting
    STGAFormer:基于时空门控注意力变换器的交通流量预测图神经网络

    2024, Information Fusion  2024,信息融合
    Citation Excerpt :  引用摘录:

    For instance, Wu et al. [11] propose Graph WaveNet (GWNET), a model that integrates a Graph Convolution Network (GCN) with a Gated Temporal Convolution Network (Gated TCN), yielding reliable outcomes. Moreover, transformers have shown remarkable performance in diverse research domains [12], particularly in extracting multi-modal features [13]. Researchers are now exploring their applicability in traffic flow prediction [14–16].
    例如,Wu 等人[11]提出了图波网(GWNET),这是一种将图卷积网络(GCN)与门控时序卷积网络(Gated TCN)集成的模型,能够产生可靠的预测结果。此外,在多个研究领域[12]中,尤其是提取多模态特征[13]方面,Transformer 表现出了卓越的性能。研究人员现在正在探索其在交通流量预测[14-16]中的应用。

View all citing articles on Scopus
查看 Scopus 上所有引用文章
View Abstract  查看摘要