1 Introduction
1 介绍

Big Data and the new tools and techniques it has enabled, such as Big Data Analytics (BDA), have completely changed how organisations and businesses work, opening up enormous new opportunities for corporations, professionals, and academics (Sun et al. 2018). Governmental and non-governmental organisations regularly generate vast volumes of data with a unique breadth and complexity nowadays, in addition to businesses and academic institutions (Debortoli et al. 2014). As a result, it is now vital for organisations all over the world to get relevant data and valuable advantages from these vast data resources (Sarkar 2017). However, as the research reveals, it can be challenging to quickly and expertly extract valuable insights from massive data (Zakir et al. 2015). In order to improve business performance and increase market share for the majority of organisations, BDA is now undeniably required to realise the full value of big data. Recent studies concentrate on deep learning methods to get reliable conclusions from huge volume of data.
大数据和它所带来的新工具和技术,比如大数据分析(BDA),已经彻底改变了组织和企业的工作方式,为企业、专业人士和学者们开辟了巨大的新机遇(Sun 等,2018 年)。政府和非政府组织现今定期产生着大量数据,具有独特的广度和复杂性,这除了企业和学术机构外(Debortoli 等,2014 年)。因此,现在对于世界各地的组织而言,从这些海量数据资源中获取相关数据和有价值的优势是至关重要的(Sarkar,2017 年)。然而,研究显示,迅速和专业地从海量数据中提取有价值的见解可能是具有挑战性的(Zakir 等,2015 年)。为了提升大多数组织的业绩并增加市场份额,现在毫无疑问地需要使用 BDA 来实现大数据的全部价值。最近的研究聚焦于深度学习方法,以从大量数据中获得可靠的结论。

A division of machine learning (ML) called deep learning (DL) is developed. Big data, which is given a lot of attention while creating these decision-making systems, may be utilised in its entirety using DL algorithms. The concept is to automatically extract characteristics from massive artificial neural networks and then use these features to inform choices (Nguyen et al. 2019). The number of neural network’s hidden layers is indicated by the term “deep” in this context. As the network is drilled deeper, the model’s performance increases. Deep learning algorithms’ ability to train on unlabelled data is one of its distinguishing characteristics (Gheisari and ZakirulAlam 2017). Both organised and unstructured forms of data may be processed using the deep learning algorithm, which can learn without human supervision. Many different businesses, including healthcare, finance, banking, and e-commerce, can benefit from deep learning. Any data that has pieces that are arranged in sequences is referred to as sequential data. In order to simulate problems involving time-dependent and sequential data, such as text generation, artificial intelligence, and stock market prediction, recurrent neural networks (RNN) are employed to analyse sequential data (Sarker 2021). For modelling sequential data, deep learning techniques namely recurrent neural networks (RNN) are used (Akbal and Ünlü 2022b). RNN and its variants, like gated recurrent units (GRU) and long short-term memories (LSTM), have become the key building elements for understanding sequential internet data in a number of academic fields, such as natural language processing and voice data analysis (Diao et al. 2019).
深度学习是机器学习(ML)的一个分支,被开发出来。大数据在创建这些决策系统时受到了很多关注,可以利用 DL 算法充分利用。该概念是从庞大的人工神经网络中自动提取特征,然后利用这些特征来指导选择(Nguyen 等人,2019 年)。在这种情况下,“深度”一词表示神经网络的隐藏层数量。随着网络深入挖掘,模型的性能会提高。无标记数据上训练的能力是深度学习算法的一个显著特征(Gheisari 和 ZakirulAlam,2017 年)。深度学习算法可以处理组织良好和无结构形式的数据,而无需人类监督。许多不同的行业,包括保健、金融、银行和电子商务,都可以从深度学习中受益。任何具有按序排列的片段的数据被称为序列数据。 为了模拟涉及依赖时间和顺序数据的问题,如文本生成、人工智能和股市预测,循环神经网络(RNN)被用来分析序列数据(Sarker 2021)。用于建模序列数据的深度学习技术,即循环神经网络(RNN)被使用(Akbal 和 Ünlü 2022b)。RNN 及其变体,如门控循环单元(GRU)和长短期记忆单元(LSTM),已成为理解多个学术领域中的顺序互联网数据的关键构建元素,例如自然语言处理和语音数据分析(Diao 等,2019)。

Hochreiter et al. originally suggested the use of LSTM, an implementation of the recurrent neural network, in 1997 (Hochreiter and Schmidhuber 1997). LSTM may learn to do tasks that require memory or state awareness, unlike the feed forward network designs that were previously defined. By allowing gradients to pass unmodified, LSTM partially overcomes the issue of disappearing gradients, which is a significant RNN restriction. Gated recurrent units (GRU) are a more compact variant of the LSTM (Körner and Marc 2021). Due to the absence of the output gate, GRUs is smaller than LSTM and can only outperform LSTM on particular simpler datasets (Cho et al. 2014b). Recurrent neural networks using LSTMs are capable of tracking long-term dependencies. Naul et al. developed GRU-based auto encoders and LSTM for automatically extracting features (Naul et al. 2018). In order to decrease dimensionality and recreate the original data set using an unsupervised method, an auto encoder that picks up the representation from the input data set is utilized. Back propagation is used as the foundation for the learning algorithm (Ng and Auto encoders 2018).
Hochreiter 等最初建议在 1997 年使用 LSTM,这是递归神经网络的一种实现(Hochreiter 和 Schmidhuber 1997)。LSTM 可以学习执行需要记忆或状态意识的任务,与先前定义的前馈网络设计不同。通过允许梯度无修改地传递,LSTM 部分克服了消失梯度的问题,这是递归神经网络的一个重要限制。门控递归单元(GRU)是 LSTM 的一种更紧凑的变体(Körner 和 Marc 2021)。由于没有输出门,GRU 比 LSTM 更小,只能在特定更简单的数据集上胜过 LSTM(Cho 等 2014b)。使用 LSTM 的递归神经网络能够跟踪长期依赖关系。Naul 等开发了基于 GRU 的自动编码器和 LSTM 以自动提取特征(Naul 等 2018)。为了使用无监督方法减少维度并重新创建原始数据集,利用从输入数据集中提取表示的自动编码器。 反向传播被用作学习算法的基础(Ng 和 Auto encoders 2018)。

This article is structured in the following manner. Section 2 briefly explains the related work based on sequence data, RNN, LSTM, GRU attention modules and also ensemble classifiers. Section 3 discusses the above techniques in detail. The manuscript is concluded in Sect. 4.
本文章的结构如下所示。第2部分简要说明基于序列数据、RNN、LSTM、GRU 注意力模块以及集成分类器的相关工作。第3部分详细讨论上述技术。手稿总结于第4节。

2 Related works
2 相关作品

2.1 Sequence data
2.1 序列数据

Modern industrial applications place a high value on effective anomaly identification and diagnosis in multivariate timeseries data. It is difficult to create a system that can rapidly and precisely identify unusual observations, nevertheless. Due to the dearth of anomaly labels, significant data volatility, and need for extremely quick inference times in contemporary applications, this is the case. Despite recent advances in deep learning algorithms for anomaly identification, only a select fraction of them can fully solve these issues. With the aid of attention-based sequence encoders and the understanding of broader temporal trends in the data, Tuli et al. (2022) proposal for TranAD, a deep transformer network-based anomaly detection and diagnosis model, uses inference quickly. TranAD employs adversarial training to achieve stability and focus score-based self-conditioning to enable robust multi-modal feature extraction. Moreover, we can train the model with little data thanks to model-agnostic meta learning (MAML). Comprehensive empirical experiments on six publicly accessible datasets show that TranAD can perform better in detection and diagnosis than state-of-the-art baseline approaches with data- and time-efficient training. Both the training times and the scores for anomaly detection are impacted by the window size. As smaller inputs need less inference time, TranAD can discover abnormalities more quickly when the window size is smaller. The local contextual information is not accurately shown if the window is too tiny. The area under the receiver operating characteristic curve (ROC/AUC) and the F1 score are both impacted by a window that is too big since brief abnormalities may be concealed among a lot of data sets. On the other hand, the predicting accuracy may decrease as a result of the conflict, duplication, and inconsistency of massive time-series data. In order to increase performance, (Jin et al. 2022) suggest a deep network by carefully choosing and comprehending the data. A variational Bayesian gated recurrent unit (VBGRU) is utilised to increase the anti-noise capacity and resilience of the model after first designing a data self-screening layer (DSSL) with a maximum information distance coefficient (MIDC) to filter input data with high correlation and low redundancy. Also, it effectively raises forecast resilience and accuracy. Compared to GRU and LSTM, this model needs greater training time.
现代工业应用非常重视在多变量时间序列数据中有效识别和诊断异常。然而,要创建一个能够迅速精确识别异常观测的系统是困难的。由于异常标签稀缺、数据波动显著以及当代应用程序中对极快推理时间的需求,这是一个事实。尽管在深度学习算法方面取得了一些进展用于异常识别,但其中只有很少一部分可以充分解决这些问题。在理解数据中更广泛的时间趋势以及基于注意力的序列编码器的帮助下,Tuli 等人(2022 年)提出了 TranAD,一种基于深度变压器网络的异常检测和诊断模型,能够快速进行推理。TranAD 采用对抗训练实现稳定性,并通过基于焦点得分的自身条件实现强大的多模特征提取。此外,我们可以通过模型无关的元学习(MAML)用少量数据来训练模型。 广泛的实证实验显示,TranAD 在检测和诊断方面比现有的基准方法表现更好,具有数据和时间高效的训练。训练时间和异常检测得分都受窗口大小的影响。由于较小的输入需要更少的推理时间,当窗口大小较小时,TranAD 可以更快地发现异常。如果窗口太小,本地上下文信息将无法准确显示。由于简短的异常可能被隐藏在大量数据集中,因此接收者操作特性曲线(ROC/AUC)下的面积和 F1 得分都会受到窗口过大的影响。另一方面,由于庞大时间序列数据的冲突、重复和不一致,预测准确性可能会下降。为了提高性能,(Jin 等人,2022 年)建议通过精心选择和理解数据来设计深度网络。 一种变分贝叶斯门控循环单元(VBGRU)被用来提高模型的抗噪能力和韧性,首先设计了一个具有最大信息距离系数(MIDC)的数据自筛选层(DSSL)来过滤具有高相关性和低冗余的输入数据。此外,它还有效提高了预测的韧性和准确性。与 GRU 和 LSTM 相比,这种模型需要更长的训练时间。

A deep learning algorithm is provided by Ünlü (2022) to model and estimate multistep daily Turkish power demands using data from 5 January 2015 to 26 December 2021. The development of novel and inventive deep neural network topologies as well as considerable processing breakthroughs are two key factors in deep learning’s rising appeal. Convolutional neural networks, Gated recurrent networks, and Long Short-Term Memory (LSTM) are trained and compared to anticipate daily power usage one to seven days in advance. The effectiveness of the suggested methods was assessed using the coefficient of determination (R2), root mean squared error, and mean absolute error, three separate performance criteria. According to the test set’s predicting findings, LSTM exhibits the best performance. The researched model’s stationarity or the model’s residual terms’ expectation of normal distribution and unidirectional correlation are not made in this model. The strength of the suggested model may be demonstrated by this characteristic. The researched model’s stationarity or the model’s residual terms’ expectation of normal distribution and unidirectional correlation are not made in this model. The strength of the suggested model may be demonstrated by this characteristic. By using univariate time series approaches, modelling long-term electrical loads may be more difficult. In this task, it’s important to take into account the other affecting factors.
一个深度学习算法由Ünlü提供(2022),利用从 2015 年 1 月 5 日到 2021 年 12 月 26 日期间的数据,对多步骤的土耳其每日用电需求进行建模和估算。深度学习不断增长的吸引力的两个关键因素是开发新颖和创新的深度神经网络拓扑以及相当大的处理突破。卷积神经网络、门控循环网络和长短时记忆(LSTM)被训练和比较,以预测提前一至七天的每日用电量。建议方法的有效性是通过确定系数(R2)、均方根误差和平均绝对误差三个单独的性能标准来评估的。根据测试集的预测结果,LSTM 表现出最佳性能。在这个模型中,研究的模型的稳定性或模型的残差项的正态分布期望和单向相关性并未考虑。建议模型的优势可能通过这一特征得到展示。 研究模型的稳定性或模型残差项的正态分布期望和单向相关性不在这个模型中进行。建议模型的优势可能通过这一特征得以展示。通过使用单变量时间序列方法,建模长期电力负载可能更加困难。在这项任务中,重要的是考虑其他影响因素。

In order to estimate and predict the power output of the Manisa wind farms, Akbal and Ünlü (2022a) employed a univariate model based on sequence-to-sequence learning. The short- to medium-term forecasting window covered by the study. The advantage of the proposed model is that predictions may be obtained with just the model’s own lagged value. According to empirical results, the model has a high coefficient of variation (R2) for the short-term prediction and a moderate R2 for the mid-term forecast. The model is still considered to be generally trustworthy even if mean squared error and mean absolute error in midterm estimates show a little reduction in R2. With a small adjustment, the recommended model may also be used to predict the lowest, highest, and average electricity production over a certain period in addition to hourly power output. This study concludes with two unique and intriguing suggestions for more investigation.
为了估算和预测马尼萨风电场的功率输出,Akbal 和Ünlü(2022a)采用了基于序列到序列学习的单变量模型。研究覆盖了短期到中期的预测窗口。所提出模型的优点是可以仅仅通过模型自身的滞后值来获得预测。根据实证结果,该模型在短期预测中具有较高的变差系数(R²),在中期预测中具有中等的 R²。即使中期估计中的均方误差和平均绝对误差表现出 R²略微降低,该模型仍然被认为是一般可信的。通过小幅调整,建议的模型也可以用来预测在一定时期内的最低、最高和平均电力生产,以及每小时的功率输出。这项研究得出了两个独特且有趣的建议,以供进一步调查。

Nowadays, there is a significant increase in the volume of AIR traffic, which has resulted in rising need for an air traffic monitoring system. Future air traffic density demands cannot be met by secondary surveillance radar and primary surveillance radar, two traditional surveillance technologies. In contrast to wireless communications situations, the air traffic likewise operates in a dynamic environment and is subject to a variety of dynamic influences. As a result, it is worthwhile to deploy machine learning models to anticipate flight delays by fully using the aviation data lake. In order to integrate the benefits of all the available data types, Gui et al. (2020), Wang et al. (2020) fed the complete dataset into a particular DL, allowing for discovering the best outcome in the broader and more precise solution space thus increasing forecast performance in delay in fight. RNNs have been generally utilized in the field of natural language processing (NPL) because they are suitable for sequential input. The LSTM network, one of the most powerful RNNs with a more complex cell organization, particularly resolves the gradient vanishing issue in RNNs. Considering the input vector x = [× 1, × 2,…, xT], and the state vector h = [h1, h2,…, hT] in which LSTM approach computes at every time to produce the output sequence y = [y1, y2,…, yT].
现在,航空交通量显著增加,这导致对空中交通监控系统需求增加。未来的空中交通密度需求无法通过传统的二次监视雷达和一次监视雷达来满足。与无线通信情况相比,空中交通也在动态环境中运作,并受各种动态影响。因此,值得部署机器学习模型,通过充分利用航空数据湖来预测航班延误。为了整合所有可用数据类型的好处,贵等人(2020 年)、王等人(2020 年)将完整数据集输入特定的深度学习模型中,从而在更广泛和更精确的解决方案空间中发现最佳结果,从而提高延误预测性能。RNNs 通常在自然语言处理(NPL)领域中被广泛使用,因为它们适合顺序输入。LSTM 网络是最强大的 RNN 之一,具有更复杂的细胞组织结构,特别解决了 RNN 中的梯度消失问题。 考虑输入向量 x = [× 1, × 2,…, xT],以及状态向量 h = [h1, h2,…, hT],其中 LSTM 方法在每个时间步计算以产生输出序列 y = [y1, y2,…, yT]。

The generic convolutional and recurrent architectures for temporal convolutional network (TCN) sequence modelling are systematically evaluated by Bai et al. (2018). The models are assessed using a wide variety of conventional benchmarking tasks for recurrent networks. The results show that a simple convolutional structural design achieves superior than traditional RNN like LSTMs while displaying a longer effective memory over a wide range of tasks and datasets. Any operation f: X T +1 → YT +1 which results in mapping is a sequence modelling network.
通用的卷积和循环体系结构对时间卷积网络(TCN)序列建模进行了系统评估,评估者是白 et al.(2018 年)。这些模型使用各种常规基准任务对循环网络进行评估。结果显示,简单的卷积结构设计在一系列任务和数据集上显示出比传统的 RNN(如 LSTMs)更优越的性能,并展现出更长的有效存储记忆。任何将 X T +1 → YT +1 进行映射的操作 f 都是一个序列建模网络。

$$\hat{y}_{0} , \ldots ,\hat{y}_{T} = f\left( {x_{0} ,x_{T} } \right)$$ 
(1)

The aim of sequence model discover a network f that diminishes the anticipated loss among actual outcome and predictions, L(y0,…, yT, f(× 0,…, xT), where the sequences and outputs are chosen from specific distribution. The elements of the sequence’s dilated convolution process F are computed as, 

$$F\left( s \right) = \left( {X*_{d} f} \right)\left( s \right) = \mathop \sum \limits_{i = 0}^{k - 1} f\left( i \right).X_{s} - d.i$$ 
(2)

A residual block has a branch that leads to a set of transformations F (He et al. 2016), whose results are combined with the block’s input x: 

$$o = Activation\left( {X + F\left( x \right)} \right)$$ 
(3)

According to the experimental findings, TCN models perform significantly better than general recurrent architectures like LSTMs and GRUs. The drawback was that TCNs had longer memory than recurrent systems do for the same amount of processing power. 

2.2 Vanilla recurrent neural networks
2.2 香草循环神经网络

Another well-known neural network is the recurrent neural network (RNN), which uses time-series or sequential data and feeds the results of the previous stage as input to the current stage (Dupond 2019; Mandic and Chambers 2001). Recurrent networks, like feed-forward and CNN, also learn from training input, but they set themselves apart by having a “memory” that lets them use data from earlier inputs to influence current input and output. The output of an RNN is dependent on earlier parts in the sequence, unlike a normal DNN, which presumes that inputs and outputs are independent of one another. Learning lengthy data sequences can be difficult with ordinary recurrent networks due to the problem of vanishing gradients. Kim et al. (2016) presented a 2-stage approach. Initially, a deep RNN modelling is used to forecast the daily delay status of a specific airport, here the status were explained as mean delay of the entire arriving aircraft. Next, it uses a layered neuron network methodology for anticipating the delay for every specific aircraft by initial stage daily delay status and other data. The model’s two stages had accuracy of 85% and 87.42%, respectively. The DL approach needs a lot of information, according to this study. If not, the model will either perform badly or become over fit.
另一个众所周知的神经网络是递归神经网络(RNN),它使用时间序列或顺序数据,并将前一阶段的结果作为当前阶段的输入(Dupond 2019; Mandic 和 Chambers 2001)。 递归网络,像前馈和 CNN 一样,也是通过训练输入进行学习,但它们通过具有“记忆”而让自己脱颖而出,这使它们能够使用来自早期输入的数据来影响当前输入和输出。 RNN 的输出取决于序列中的更早部分,而不像正常的 DNN 那样假设输入和输出是彼此独立的。 由于消失梯度的问题,利用普通递归网络学习长数据序列可能会很困难。 Kim 等人(2016)提出了 2 阶段方法。 首先,使用深度 RNN 建模来预测特定机场的每日延误状态,这里的状态被解释为整个到达飞机的平均延误。 接下来,利用分层神经网络方法来预测每架特定飞机的延误,通过初始阶段每日延误状态和其他数据。 该模型的两个阶段的准确度为 85%和 87%。42%,分别。根据这项研究,DL 方法需要大量信息。否则,模型要么表现不佳,要么过拟合。

A novel approach is described by Diao et al. (2019) to considerably reduce the number of parameters in RNNs while retaining performance that is equivalent to or better than that of standard RNNs. The suggested idea restricts the parameter sharing between the weight matrices at each time step that correspond to the input data and hidden states. The unique design may be viewed as a compression of its conventional version, however in contrast to the majority of previous compression techniques, pre-training or complex parameter tweaking is not necessary. Figure 1 depicts the structure for imposing parameter restrictions in RNN.
作者 Diao 等人描述了一种新颖的方法,可以显著减少 RNN 中的参数数量,同时保持性能等同或优于标准 RNN。建议的想法是限制每个时间步对应于输入数据和隐藏状态的权重矩阵之间的参数共享。这种独特设计可以被视为其传统版本的压缩,然而与大多数先前的压缩技术相反,不需要预训练或复杂的参数调整。图 1 显示了在 RNN 中施加参数限制的结构。

Fig. 1
figure 1

Framework for setting parameter restriction in RNN 

The findings demonstrate that none of the extreme scenarios of sharing all or none of the parameters is the best way to describe various dependent input data. Utilizing shared partial parameters can drastically minimise the amount of RNN parameters by utilising the relationships between inputs. 

Academics have historically used statistical techniques, such as the Auto Regressive Integrated Moving Average (ARIMA), to forecast traffic. However, ARIMA-based models are unable to make accurate predictions in a cellular environment that is very dynamic. Researchers are looking at LSTM and RNN-based deep learning techniques to build autonomous cellular traffic forecast models. Jaffry and Hasan (2020) offer a real-world call data record-based LSTM-based cellular traffic forecast model. The ARIMA model and a straightforward feed-forward neural network (FFNN) were contrasted with the LSTM-based prediction.
学术界历来使用统计技术,如自回归综合移动平均(ARIMA),来预测交通状况。然而,在一个非常动态的移动环境中,基于 ARIMA 的模型无法进行准确预测。研究人员正在研究基于 LSTM 和 RNN 的深度学习技术,以构建自主的移动交通预测模型。Jaffry 和 Hasan(2020)提供了一个基于真实通话数据记录的 LSTM 型移动交通预测模型。ARIMA 模型和简单的前馈神经网络(FFNN)与基于 LSTM 的预测进行了对比。

In a standard FFNN, information can only go in one direction: forward. Data for calculations and operations in the concealed, forward-looking layer is provided by the input layer. The output layer gets the information from the hidden layers and performs regression or classification predictions. In contrast, RNN use outcome of one layer as input for following layers. As a result, input from the very first layer is present in each layer. RNN takes both the input from the most recent time step and the input from earlier time steps into account. RNN may learn from all past time steps as a result, which enhances the accuracy of sequence predictions. Second, the learning process is halted when the gradient either totally vanishes or explodes to a very high value. This is an issue with pure RNNs. To resolve this problem, LSTM, an RNN version, was suggested in Hochreiter and Schmidhuber (1997). LSTMs were designed to get over the long-term reliance issue, which is the cause of the vanishing-gradient issue with Vanilla RNNs (Goodfellow et al. 2016). All gates and cell states that aren’t a part of the hidden layer’s final prediction and expression are:
在标准的前馈神经网络(FFNN)中,信息只能沿着一个方向传播:向前。计算和操作隐层的数据是由输入层提供的。输出层获取来自隐藏层的信息,并执行回归或分类预测。相比之下,循环神经网络(RNN)使用前一层的结果作为下一层的输入。因此,第一层的输入存在于每一层中。RNN 考虑了最近时间步的输入以及之前时间步的输入。结果,RNN 可以从所有过去的时间步学习,从而提高序列预测的准确性。其次,当梯度要么完全消失要么爆发到非常高的值时,学习过程会停止。这是纯 RNN 的一个问题。为解决这个问题,长短时记忆网络(LSTM)这种 RNN 版本在 Hochreiter 和 Schmidhuber 提出(1997 年)。LSTM 被设计用来解决长期依赖问题,这也是导致普通 RNN 出现梯度消失问题的原因(Goodfellow 等人,2016 年)。所有门和细胞状态,不属于隐藏层最终预测和表达的一部分:

$$\varphi_{f}^{t} = \sigma \left( {W_{f} \left[ {a^{t - 1} ,X^{t} } \right] + b_{f} } \right)$$
(4)
$$\varphi_{i}^{t} = \sigma \left( {W_{i} \left[ {a^{t - 1} ,X^{t} } \right] + b_{i} } \right)$$
(5)
$$CC_{u}^{t} = \varphi \left( {W_{c} \left[ {a^{t - 1} ,X^{t} } \right] + b_{c} } \right)$$
(6)
$$\varphi_{o}^{t} = \sigma \left( {W_{o} \left[ {a^{t - 1} ,X^{t} } \right] + b_{o} } \right)$$
(7)
$$CC^{t} = \varphi_{f}^{t} *CC^{t - 1} + \varphi *CC_{u}^{t}$$ 
(8)
$$a^{t} = \varphi_{o}^{t} *\left( {CC^{t} } \right)$$ 
(9)

Finally the outcome of \(y^{t}\). is computed as: 

$$y^{t} = \sigma \left( {W_{y} a^{t} + b_{y} } \right)$$ 
(10)

The symbol in these equations stands for the sigmoid operation, frequently called as the squashing operation since it limits the output to values between 0 and 1.The sigmoid operation is officially defined as \(\frac{1}{{1 + e^{ - x} }}\). Another squashing function is \(\varphi\), which is frequently squashed using tanh or rectified linear unit (relu) functions. The findings demonstrate the accuracy of LSTM and FFNN in predicting cellular traffic. When training a model for prediction, it was discovered that LSTM models converged more quickly.
这些方程中的符号代表了 Sigmoid 运算,常常被称为压缩操作,因为它将输出限制在 0 到 1 之间。Sigmoid 运算被正式定义为\(\frac{1}{{1 + e^{ - x} }}\)。另一个压缩函数是\(\varphi\),经常使用 tanh 或修正线性单元(relu)函数进行压缩。研究结果显示 LSTM 和 FFNN 在预测细胞流量方面的准确性。在训练预测模型时,发现 LSTM 模型收敛速度更快。

In Long et al. (2019) formalised the concept of “memory length” for recurrent networks and identified a general family of recurrent networks with maximal memory lengths. It is shown that these networks may be stacked into several layers to create efficient models, including gated convolutional networks. Due to the architecture of such networks, this model shows that there is no gradient vanishing or explosion during back-propagation, which could allow for a more principled design approach in practise. Additionally, it provides a novel example of this family, the attentive activation recurrent unit (AARU). The framework of the AARU network is depicted in Fig. 2.
2019 年,《龙等人》形式化了循环网络中的“记忆长度”概念,并确定了具有最大记忆长度的一类通用循环网络。表明这些网络可以堆叠成多层以创建高效模型,包括门控卷积网络。由于这些网络的架构,该模型表明在反向传播过程中不存在梯度消失或爆炸,这可能会使设计方法更为原则。此外,它提供了该类别的一个新例子,即注意力激活循环单元(AARU)。AARU 网络框架如图 2 所示。

Fig. 2
figure 2

Framework of AARU network
AARU 网络框架

Because they include internal memory, RNNs are among the most effective and reliable ANN types currently being used (Park et al. 2022). RNN can identify solutions for a wide range of issues thanks to its internal memory, which can recall its inputs (Ma and Principe 2018). By using a method called back propagation, which depends on the RNN’s gradient, RNN is optimised in terms of its weights to fit the training data. The gradient of the RNN, however, may disappear and explode during the optimization routing, which impairs the RNN’s capacity to learn from lengthy data sequences (Allen-Zhu et al. 2019). The LSTM architecture (Hochreiter and Schmidhuber 1997), a particular kind of RNN, is frequently utilised as a solution to these two issues (Le and Zuidema 2016). By retaining information for extended periods of time, LSTMs are specifically created to learn long-term dependencies of time-dependent data. In applications like speech recognition (Tian et al. 2017; Kim et al. 2017) and text processing (Shih et al. 2017; Simistira et al. 2015), LSTM carries out faithful learning. Moreover, because it has internal memory, the flexibility to be customised, and is unaffected by gradient-related problems, LSTM is especially appropriate for complicated data sequences like stock time series retrieved from financial markets.
因为它们包括内部记忆,RNNs 是目前使用中最有效和可靠的 ANN 类型之一 (Park 等人.2022)。RNN 能够通过其内部记忆识别各种问题的解决方案,可以召回其输入(马和普林西普等人 2018)。通过使用一种称为反向传播的方法,该方法依赖于 RNN 的梯度,RNN 在权重方面进行了优化,以适应训练数据。然而,RNN 的梯度在优化过程中可能会消失和爆炸,这会影响 RNN 从长序列数据中学习的能力(Allen-Zhu 等人 2019)。LSTM 架构(Hochreiter 和 Schmidhuber1997),一种特定类型的 RNN,经常被用作这两个问题的解决方案(Le 和 Zuidema2016)。通过长时间保留信息,LSTMs 专门用于学习依赖于时间的数据的长期依赖关系。在诸如语音识别(Tian 等人 2017;Kim 等人 2017)和文本处理(Shih 等人 2017;Simistira 等人 2015)等应用中,LSTM 进行了忠实的学习。 此外,由于它具有内部存储器、可以定制的灵活性,并且不受梯度相关问题的影响,LSTM 特别适用于从金融市场检索出的股票时间序列等复杂数据序列。

2.3 Long short-term memory  重试    错误原因

The vanishing gradient problem was first presented by Hochreiter and Schmidhuber (1997), and is now addressed by the widely used LSTM type of RNN design. An LSTM unit’s memory cell has the capacity to retain data for extended periods of time, and three gates control the movement of data into and out of the cell. For example, the “Forget Gate” decides what information from the previous state cell will be memorised and what information will be erased that is no longer helpful, while the “Input Gate” decides which information should enter the cell state and the “Output Gate” decides and controls the outputs. The LSTM network is one of the most effective RNNs because it addresses the problems associated with recurrent network training. LSTM utilised with Preeti et al. to forecast non-stationary financial finance (Bala and Preeti 2019). The study of the aforementioned academics demonstrates that the time-series prediction model based on LSTM is more accurate than the conventional regression model when used to the associated feature prediction of time-series. The model training time is lengthy, though, since there are too many LSTM parameters. In order to maximise the sensitivity of model learning and boost the precision of prediction, Hu and Zheng (2020) suggested a multi-stage attention network for a multivariate time-series prediction model. However, the issue of this model’s prolonged training period was not fully resolved.  重试    错误原因

Mujeeb et al. (2019) proposed DL methodology to estimate price and demand for huge data using Deep LSTM (DLSTM). Processing vast volumes of data by LSTM is simpler than with solely data-driven approaches because of the adaptive and automatic feature learning mechanism of DNN. Information from well-known real power markets was used to assess the suggested model. All months were used for the day and week ahead forecasting tests. Mean Absolute Error (MAE) and Normalized Root Mean Square Error(NRMSE) were used to evaluate forecast performance. The suggested DLSTM approach was compared to two traditional ANN time series forecasting techniques, Nonlinear Autoregressive Network with Exogenous Variables and Extreme Learning Machine (ELM).  重试    错误原因

The network over fits on the training set as a result of the gradient disappearing. Memorizing inputs rather than learning is over fitting. A model that has been over fitted cannot be relied upon to perform well on test or new data. The hidden layer of the LSTM has one neuron per cell, each of which represents cell in memory with a self-connected recurrent edge. This edge allows the gradient to continue without bursting or fading over a period of steps when the weight is set to 1. Figure 3 depicts the construction of one LSTM unit.  重试    错误原因

Fig. 3  重试    错误原因
figure 3

One layer LSTM (Mujeeb et al. 2019) 重试    错误原因

In terms of accuracy, DLSTM fared better than the benchmark forecasting techniques. The effectiveness of the suggested approach for predicting power prices and load is demonstrated by experimental findings. 重试    错误原因

Yang et al. (2019a) tested 5 word embeddings and 2 custom methods while de-identifying 10 PHI categories using a Bi-LSTM-CRF approach. Word embedding can discover properties in a low-dimensional matrix by converting words into vectors of real values, removing or minimising the need for feature engineering. From Google News, Common Craw, MIMIC-word2vec, MIMIC-fast Text, and MADE, word embeddings were retrieved. The highest recall and accuracy scores for Common Crawl (94.98 and 97.97, respectively) were linked with the highest F-measure score (96.46). There is no denying that this is an exciting piece of research, however the strategy used to adjust training techniques may be constrained by data from the test source.  重试    错误原因

The smart workshop makes extensive use of intelligent sensors and the IoT, which results in the collection of a sizable quantity of real-time production data. The data gathered may be thoroughly assessed in order to aid producers in making informed decisions. As contrast to traditional data processing methods, artificial intelligence (AI), the main big data analysis strategy, is being used in the industrial sector more and more. But there are differences in how well different AI models can decipher real-time data from smart job shop operations. Traditional ML and DL techniques cannot effectively take advantage of the temporal correlation of data, whereas LSTM is widely used in machine translation, dialogue generation, coding, and decoding technology precisely because it is particularly suited for dealing with the time series strongly connected issues. The predicted production plan will be used as an example by Wang et al. (2020), who will also create the LSTM and GRU model to handle the production process data. This will assist the production manager in meeting the deadline for the initial production plan job and in determining whether to postpone the production plan. It uses the two-layer LSTM system model from Fig. 4. The output vector of one LSTM hidden layer becomes the input vector for the following layer when using DLSTM, giving the entire model extra processing capacity.  重试    错误原因

Fig. 4 
figure 4

Two-layer LSTM (Wang et al. 2020) 重试    错误原因

Another name for LSTM is GRU. The impact is unaffected despite the GRU’s structure being less complex than the LSTM’s. Update door and Reset door are the only two door operations available in the GRU model. To regulate how much of the prior time’s state information is incorporated into the present state, an update gate is utilised. As a result, the update gate’s value increases. The amount of information from the prior state written depends on how big the reset gate is. GRU computation is represented in the following Equations, 重试    错误原因

$$a_{t} = \sigma \left( {Wi_{a} S_{t} + u_{a} hl_{t - 1} } \right)$$ 重试    错误原因
(11)
$$k_{t} = \sigma \left( {Wi_{r} S_{t} + u_{r} hl_{t - 1} } \right)$$ 重试    错误原因
(12)
$$\widehat{{hl_{t} }} = tanh\left( {k_{t} uhl_{t - 1} + WiS_{t} } \right)$$ 重试    错误原因
(13)
$$hl_{t} = \left( {1 - a_{t} } \right)\widehat{{hl_{t} }} + a_{t} hl_{t - 1}$$ 重试    错误原因
(14)

here, \(\sigma\) represent the sigmoid operation, \(a_{t}\) represents activation sequences in the input gate, the hidden layers were represented as \(hl\) and the weight matrix denoted as \(Wi\).While LSTM and GRU perform similarly on various data sets, GRU has a simpler internal storage unit structure than LSTM. Therefore, while training the model, GRU runs significantly more quickly than LSTM. When the accuracy of the predictions is almost equal, it can be concluded from the experimental comparison that the GRU model takes a great deal less time than the LSTM model. As a result, the GRU model outperforms the LSTM model while analysing industrial data. Due to local optimum, both GRU and LSTM experience loss when training the dataset; this problem may be resolved by making the models better. 重试    错误原因

Worldwide, there are more and more different types of financial fraud, which causes enormous financial losses and is a significant issue. This issue has been thoroughly addressed over the past few years utilising methods from machine learning and data mining. These techniques still need to be developed in order to cope with massive data, compute quickly, and spot new attack patterns. Yara et al. (2020) suggested a DL based solution for identifying financial fraud based on the LSTM methodology. With regard to massive data, this approach seeks to increase the effectiveness and precision of existing detection techniques. The suggested approach is assessed on real dataset of credit card thefts, and the results are compared with a deep learning model currently available called the Auto-encoder model and many other ML strategies. The test outcomes showed that the LSTM performed flawlessly, achieving 99.95% accuracy in less than a minute.

Making decisions based on many criteria is one of the most important problems to address when dealing with concerns relating to alternative impacts in big data research. In order to deal with this problem for uncovering different qualities of which increased goods challenging establishing exact characteristics for the remoteness determining similarity across tests, uproarious for incorporating needless data at now linked group staggered progressive based on decision tree called structure A deep learning-based hierarchical clustering approach for huge sequence data is proposed by Papineni et al. (2021) with the fusion of several criteria-based decision making. However, unnecessary data severely weaken the viability interruption finding methodology. The following steps illustrates how this approach works, 

Pseudocode steps to design MC-DL for BD: 

  1. 1.

    The model took into account the following inputs: a, b, c, and d model parameters established using datasets gathered from running records. 

  2. 2.

    Output: According to the instructions list for the issue that was resolved, be observed.  重试    错误原因

  3. 3.

    Set the data stored in the batch  重试    错误原因

  4. 4.

    For every dataset  重试    错误原因

    \(B\left( {a_{i} k} \right) = b\left( {a_{i} } \right) + b\left( {N_{i} } \right)\) repeat for the whole data  重试    错误原因

    If there are any influencing relationships between the agents of one initial value and those of other initial values at the global set of issues, they are used to compute the overall efficacy of ai coefficient.  重试    错误原因

    $$B\left( {N_{i} } \right) = Neighboring set of values$$  重试    错误原因
  5. 5.

    The system’s starting value is stated as the following,  重试    错误原因

    $$\mu_{ij} = b\left( {a_{i} } \right) + b\left( {a_{j} } \right){ / }K$$  重试    错误原因

    The positive agent factors are the constant normalisation parameters, \(b\left( N \right) = \mathop \sum \nolimits_{j \in N} \mu_{ij} b\left( {a_{j} } \right)\). 重试    错误原因

  6. 6.

    The mapping procedure established for all of the DL’s hidden layers is 重试    错误原因

    \(a_{i + 1} = f_{k} (W_{i} *a_{i} + b_{i}\)) with its activation, i.e., the weights and hidden layers linking the DL networking may be cumulatively added with the sigmoid function. 重试    错误原因

    The decision outcome for the x-variable that has been normalised is \(Decisionvalue\left( x \right) = \frac{{f_{k} x - mean value}}{{\delta_{k} }}\). 重试    错误原因

  7. 7.

    End for 重试    错误原因

  8. 8.

    Return 重试    错误原因

Mutual coefficients under the common agents can be derived based on the weighting index when computing the benefit factors of the nearby agents.

$$\mu_{ij} = \frac{{b\left( {a_{i} } \right) + b\left( {a_{i} } \right)}}{k}*\left( {C_{aj} + 1} \right)$$ 
(15)

As a result, the benefit plan’s entire evaluation will be looked into. The suggested study improves our understanding of this area of stream research. It enables getting more effective, robust, and precise forecasts of the values for various criteria-based assisted decision modelling systems and their framework. 

Bidirectional recurrent neural networks are used in Chadha et al. (2020) innovative method for time series-based condition monitoring and defect diagnostics. The use of bidirectional recurrent neural networks basically changes how fault detection is approached and enables handling fault relationships across longer time horizons, preventing crucial process failures and boosting system productivity overall. The capacity is further improved by an unique data preprocessing and restructure method that enforces generalisation, maximises data use, and produces more effective network training, particularly for sequential fault classification tasks. Standard recurrent designs like LSTM, GRU, and vanilla recurrent neural networks are outperformed by the proposed Bi-LSTM. 

$$\mathop {\min }\limits_{\theta } L_{\theta } \left( {{\mathbf{X}},{\mathbf{f}}} \right) = - \mathop \sum \limits_{t = 0}^{T} f\left( t \right)\log (g\theta \left( {{\mathbf{X}}\left( {\varvec{t}} \right)} \right)$$ 
(16)

The testing findings for both binary and multi-class classification demonstrate the bidirectional LSTM Networks’ higher average fault discovering capabilities when compared to alternative designs. 

Bian et al. (2021), (By setting up comparative experiments, comparing with LSTM, GRU, SVR, RF, LR, CNN-LSTM and Attention-LSTM, it is verified that the PSO-Attention-LSTM model has advantages in positive rate and false positive rate, and has stronger anomaly detection ability) propose an approach based on particle swarm optimization and LSTM by attention mechanism in response to consumer’s aberrant power consumption behaviour (PSO-Attention-LSTM). In order to fully analyse the detection efficacy of the model for diverse electrical thief behaviours, four composite modes are produced by merging the six frequent electricity theft modes, which are first specified in accordance with the real electrical steal behaviour. Second, a detection model based on PSO-Attention-LSTM is built using the Tensor Flow framework. The model may minimise the loss of past data while boosting critical information and suppressing irrelevant information by employing the attention mechanism to apply changing weights to the hidden state of the LSTM. Utilize PSO to choose the ideal model parameters, and then optimise the hyperparameters to increase the output of the model. The LSTM model has high time series data fitting regression skills and can take into account the time series features of user power usage. To improve the model’s detection and prediction accuracy, the Attention mechanism is also included. It gives LSTM hidden states different probability weights and amplifies the impact of important information. To find the optimal parameters and raise the model’s detection performance, use PSO-optimized model hyperparameters. 

In this study, window sliding processing is done using one-day data, which implies that the data from the previous 24 time steps are used to forecast the data from the subsequent time step. The timing and pattern of consumers’ power consumption behaviour are taken into account in this process (Wang et al. 2018a; Deng et al. 2020). Figure 5 displays the unique prediction principle. 

Fig. 5 
figure 5

LSTM sliding window forecast principle 

The LSTM layer’s prediction outcomes are successfully highlighted using the attention mechanism, which also enhances the model’s effectiveness as a predictor and its ability to identify changes in user power use. The LSTM layer, input layer, attention layer, and output layer make up majority of Attention-LSTM model. The LSTM feature vector acts as the input layer for the Attention layer, which is positioned underneath it. According to the weight distribution concept, improved weight parameters may be created by continuing to update. The final estimation of user power consumption is output by the fully linked layer. Figure 6 designs the Attention-LSTM approach’s frameworks. 

Fig. 6
figure 6

Attention-LSTM model construction 

This study’s detection model has poor detection performance on “burr” spots brought on by high levels of noise and unpredictable behaviour because it fails to take into account the relevant practical application considerations. 

While employing an LSTM with high volatility and uncertainty, Kong et al. (2019) developed the short-term residential load predicting. In the interim, the method of mixing different projections was applied. The proposed LSTM framework generated the data set’s highest prediction performance. Jiao et al. (2018) proposed an LSTM-based method to forecast the load of non-residential users using a variety of relevant sequence information. K-means was utilised to analyse the daily load curve of non-residential customers, and testing results revealed that it was also more accurate than earlier load forecasting methods. But given that load forecasting is impossible without taking into consideration a sizable number of relevant factors and historical data. Single RNN algorithms like LSTM or GRU (Chung et al. 2014) might take into account the history data of temporal data, but they need to manually create the feature connection (Sak et al. 2014). 

2.4 Gated recurrent unit 

A Gated Recurrent Unit (GRU), developed by Cho et al. (2014a), is a well-liked variation of the recurrent network that use gating techniques to regulate and manage information flow between neural network cells. As seen in Fig. 7, the GRU is similar to an LSTM but has fewer parameters because it just contains a reset gate and an update gate. The main distinction between a GRU and an LSTM is therefore that a GRU has two gates (reset and update gates), whereas an LSTM has three gates (namely input, output and forget gates). The structure of the GRU makes it possible to capture dependencies from lengthy data sequences in an adaptive fashion without losing data from previous portions of the sequence. For of this, GRU is a little more simplified variation that frequently provides equivalent performance and is significantly faster to calculate (Chung et al. 2014). A technique for forecasting short-term wind power based on wavelet packet decomposition and enhanced GRU was proposed by Zu and Song (2018). (WPD-GRU-SELU). The wind power time series is divided into many sub-sequences with different frequencies using WPD as the initial stage in this procedure. Then, a modified GRU neural network is used to forecast the sequences of various frequency components. This network employs scaled exponential linear units (SELU) as the activation function to compress the hidden states and determine the output. Reconstructing the GRU neural network output data yields the thorough findings for wind power prediction. By including SELU as the output activation function, this approach revised the activation function of GRU. The activation functions are illustrated in the following equations,
门控循环单元(GRU),由 Cho 等人开发,是循环网络的一种备受喜爱的变体,使用门控技术在神经网络单元之间调控和管理信息流。如图`7`所示,GRU 类似于 LSTM,但参数更少,因为它只包含一个重置门和一个更新门。因此,GRU 与 LSTM 的主要区别在于,GRU 有两个门(重置和更新门),而 LSTM 有三个门(即输入、输出和遗忘门)。GRU 的结构使其能够以适应性方式捕获长数据序列的依赖关系,而不会丢失先前部分序列的数据。因此,GRU 是一个稍微简化的变体,通常提供相同的性能,并且计算速度明显更快(Chung 等人`2014`)。基于小波包分解和增强型 GRU 的短期风力预测技术由 Zu 和 Song 提出(`2018`)。(WPD-GRU-SELU)。 风电时间序列在此过程中被分成许多具有不同频率的子序列,使用 WPD 作为初始阶段。然后,使用改进的 GRU 神经网络来预测不同频率成分的序列。该网络采用了缩放指数线性单元(SELU)作为激活函数,用于压缩隐藏状态并确定输出。重建 GRU 神经网络输出数据可以为风电预测提供透彻的发现。通过将 SELU 作为输出激活函数,这种方法修改了 GRU 的激活函数。激活函数如下方程所示:

$$r_{1} = \sigma_{sig} \left( {W_{r} x_{t} + U_{r} h_{t - 1} } \right)$$
(17)
$$z_{t} = \sigma_{sig} \left( {W_{z} x_{t} + U_{z} h_{t - 1} } \right)$$
(18)
$$\tilde{h} = \phi_{tanh} \left( {W_{h} x_{t} + r_{t} \circ U_{h} h_{t - 1} } \right)$$
(19)
$$h_{t} = \left( {1 - z_{t} } \right) \circ h_{t - 1} + z_{t} \circ \tilde{h}_{t}$$ 重试    错误原因
(20)
Fig. 7 重试    错误原因
figure 7

Framework of GRU 重试    错误原因

The potential value of the active hidden node is represented by \(\tilde{h}_{t}\) in the equation above, and the activation value of the current hidden node output is represented by \(h_{t}\). Reset and update gates are denoted by \(r_{t}\) and \(z_{t}\). The element-wise multiplier is denoted by \(\circ\). Control gates and candidate states, \(\sigma_{sig} \varphi_{tanh}\) are activated by the activation functions. The sigmoid and tanh expressions are,

$$sig\left( x \right) = \left( {1 + e^{x} } \right)^{ - 1}$$
(21)
$$\tanh \left( x \right) = 2*sig\left( x \right) - 1$$
(22)

Thus the researchers apply an updated GRU model to estimate wind power to increase prediction correctness. It was first introduced to use SELU as the output h t’s activation function. The SELU equation is:

$$selu\left( x \right) = \lambda \left\{ {\begin{array}{ll} {x,x > 0} \\ {\alpha e^{x} - \alpha , \le 0} \\ \end{array} } \right.$$
(23)

here,\(\lambda = 1.0507009873554804934193349852946,\alpha = 1.6732632423543772848170429916717.\) The equation for GRU after integrating SELU are depicted below:

$$r_{1} = \sigma_{sig} \left( { W_{t} x_{t} + U_{r} \phi_{SELU} \left( {h_{t - 1} } \right)} \right)$$
(24)
$$z_{t} = \sigma_{sig} \left( {W_{z} x_{t} + U_{z} \phi_{SELU} \left( {h_{t - 1} } \right)} \right)$$
(25)
$$\tilde{h} = \phi_{tanh} \left( {W_{h} x_{t} + r_{t}^\circ U_{h} \phi_{SELU} \left( {h_{t - 1} } \right)} \right)$$
(26)
$$h_{t} = \left( {1 - z_{t} } \right)^\circ h_{t - 1} + z_{t}^\circ \tilde{h}_{t}$$
(27)
$$Output = \phi_{SELU} \left( {h_{t} } \right)$$
(28)

Jin et al. (2020) hybrid DL predictor separates the climatic information to fixed component groups with unique frequency characteristics using the empirical mode decomposition (EMD) method, trains a GRU as a sub-predictor for each group, and then adds the GRU results to the prediction result. The GRU in this model is trained using the stochastic gradient descent method utilising the data recognized to be input and output, enabling the determination of the ideal weight. Each group’s combined IMF sequences were utilised to train the GRU network. The GRU is made up of several GRU cells, and in this instance there are two hidden layers. On analysing Fig. 7, IPt, t = 1,2,...,n input to GRU model and OPt, t = 1, 2,..., n output.

The GRU with forward propagation are computed as (Yang et al. 2019b): 重试    错误原因

$$u_{t = } \sigma \left( {a_{t} U^{z} + ch_{t - 1} W^{z} + b^{z} } \right)$$
(29)
$$rs_{t = } \sigma \left( {a_{t} U^{r} + ch_{t - 1} W^{r} + b^{r} } \right)$$
(30)
$$\widetilde{ch}_{t = } tanh\left( {a_{t} U^{h} + ch_{t - 1} o rs_{t} } \right)W^{h} + b^{h} )$$ 重试    错误原因
(31)
$$ch_{t = } \left( {1 - u_{t} } \right)o \widetilde{ch}_{t} + u_{t} o ch_{t - 1} )$$ 重试    错误原因
(32)

here, it ∈ Rd is vector input for every GRU cell; ut, cst,rst, and cht reflect the update gate, active state, reset gate, and candidate state of the present hidden node t time. Bias vectors are symbolized by b, whereas weight matrices U and W are determined during model training. The gradient descent approach is used to train the GRU, and the parameters are adjusted until convergence. Experiments employing meteorological data from an agricultural IoT system confirm the proposed model. The results of predictions show that the proposed predictor might offer more accurate forecasts of temperature, wind speed, and humidity data to meet the needs of precision agriculture output. 重试    错误原因

As a technique for predicting time series, recurrent neural networks were proposed by a number of researchers (Ahmad et al. 2017). The RNN’s shortcoming is that it can only remember recent correlations and dependencies because of the declining gradient (Song et al. 2019). The output gate, input gate, and forget gate of LSTM approach, an enhanced RNN, may be used to regulate how much information is retained or discarded by a sequence. Additionally, it prevents the RNN model from overlooking long-term states (Lin et al. 2020). In the GRU model, a modified version of the LSTM model, the forget gate and input gate are integrated into one update gate, but the cell state is lost (Chen and Chou 2022).Therefore, there is less parameter. The prediction performance of the GRU model greatly surpassed that of the other two dynamic models, RNN and LSTM. 重试    错误原因

For effectively using the link between temporal components in load data, increase the precision and effectiveness of short-term load forecasting, and get over the challenges brought on by load volatility and nonlinearity in accurate load forecasting. In 2021 (Shi et al. 2021), Shi et al. suggested a hybrid neural network model for estimating short-term demand based on the temporal convolutional network (TCN) and GRU. The distance correlation coefficient is employed to determine the link between the load and the climatic variables after reconstructing the characteristics using the fixed-length sliding time window approach. The last phase is to carry out perdition by using a temporal convolutional network to unearth hidden time correlations and historical information, such as weather data and electricity pricing. To improve prediction efficiency and accuracy, the cutting-edge AdaBelief optimizer and Attention mechanism are in fact used. Data on Spanish load, weather, and the PJM power system are used to demonstrate the usefulness and superiority of the suggested model. The recommended model may produce accurate load forecasting results extremely quickly, according on data from several short-term load forecasting periods and detailed analyses of the performance of various models. 重试    错误原因

2.5 Auto encoder and decoder 重试    错误原因

To overcome the aforementioned issues and improve the accuracy of predictions for specific times, Hou et al. (2021), Chenyu et al. (2021) advise utilising a Self-attention based Time-Varying (STV) prediction model. Using an encoder-decoder module with a multi-head self attention mechanism, researchers first examine how succeeding series are interconnected and hunt for common patterns. With the help of the multi-head self-attention mechanism, the model is able to project the historical series into several subspaces and extract detailed, high-level properties for prediction. Then, using the primary outcome of the encoding vector as our input, we use a decoder to forecast upcoming multi-step outcomes. Using this encoder-decoder module, the model may link input and prediction in order to seek for patterns that are consistent throughout a range of time periods in a series. However, the decoder’s outputs cannot be utilised as the final conclusions for prediction since they are slanted toward the distribution of normal periods. 重试    错误原因

Let’s say that \(Z \in {\varvec{R}}\) is the sub-input. encoder’s, then it is computed as: 重试    错误原因

$$Z^{\left( 1 \right)} = LayerNorm\left( {Z + Multi - head Self - Attn\left( Z \right)} \right)$$ 重试    错误原因
(33)
$$Z^{out} = LayerNorm\left( {Z^{\left( 1 \right)} + FeedForward\left( {Z^{\left( 1 \right)} } \right)} \right)$$ 重试    错误原因
(34)

\(c \in R^{{l \times d_{e} }}\) Formulated from encoder, decoder is computed as: 重试    错误原因

$$z^{\left( 1 \right)} = ReLu\left( {W_{1}^{T} c + b_{1} } \right)$$
(35)
$$z^{\left( 2 \right)} = ReLu(W_{2}^{T} z^{\left( 1 \right)} + b_{2} )$$ 重试    错误原因
(36)
$$d = W_{3}^{T} z^{\left( 2 \right)} + b_{3}$$ 重试    错误原因
(37)

here, matrix for weight \(W_{1} \in {\varvec{R}}^{{\left( {l \times d_{e} } \right) \times \left( {2l \times d_{e} } \right)}} ,W_{2} \in {\varvec{R}}^{{\left( {2l \times d_{e} } \right) \times \left( {l \times d_{e} } \right)}} ,\) and \(W_{3} \in {\varvec{R}}^{{\left( {l \times d_{e} } \right) \times H}}\), and bias vectors \(b_{1 } \in {\varvec{R}}^{{2l \times d_{e} }} , b_{2 } \in {\varvec{R}}^{{l \times d_{e} }}\), and \(b_{3} \in {\varvec{R}}^{H}\) are fully-connected layers. 重试    错误原因

$$MultiHead\left( {Q,K,V} \right) = Concat\left( {head_{1} , \ldots ,head_{n} } \right)W^{L} ,\;{\text{s}}{\text{.t}}.\;head_{i} = {\text{Attention }}\left( {{\text{Q}} \times {\text{W}}_{{\text{i}}}^{{\text{Q}}} ,{\text{K}} \times W_{i}^{K} , V \times W_{i}^{V} } \right)$$
(38)

where \(W_{i}^{Q} \in {\varvec{R}}^{{d_{e} \times n}} ,W_{i}^{K} \in {\varvec{R}}^{{d_{e} \times n}} , W_{i}^{V} \in {\varvec{R}}^{{d_{e} \times n}}\), and \(W^{L} \in {\varvec{R}}^{{n \times d_{e} }}\) are learnable parameters.

If Weight = 1, conduct softmax operation for final attention marks,

$$a_{j} = \frac{{{\text{exp}}\left( {e_{ij} } \right)}}{{\mathop \sum \nolimits_{{{\text{k}} = 1}}^{1} {\text{exp}}\left( {e_{ik} } \right)}}$$
(39)

here, length = 1 of time series, and 重试    错误原因

$$e_{ij} = 1 - \left| {q_{i - } k_{j} } \right|$$ 重试    错误原因
(40)

The significance of each past value is symbolized by this score in a semantic way. 重试    错误原因

Final outcome is computed as, 重试    错误原因

$$o_{i} = \mathop \sum \limits_{k = 1}^{m} a_{ik} \times v_{k, } i = 1,2, \ldots ,l$$ 重试    错误原因
(41)

Forecast outcome \(\hat{y}\), forecasting outcome \(\hat{y}_{t}\) at specific time \(t\) computed as, 重试    错误原因

$$\hat{y}_{t} = (1 + s_{t} ) \times d_{t}$$ 重试    错误原因
(42)

With this method, the regression problem’s unbalanced data are resolved. IOn the Call Traffic dataset, it improves baselines with median of 18.64% and 20.87%, while Elec CONS dataset, it improves baselines by 5.74% and 20%. 重试    错误原因

Traditional health advice is risky because it lacks security safeguards against intrusions and is unsuitable for offering useful advice. As a result, people are reluctant to provide critical medical information. It is essential to create a privacy-preserving health recommendation system that offers the user the top-N options based on their preferences and previous input while also protecting their privacy. These issues are addressed by Selvi and Kavitha (