Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning
时空注意力单元：实现高效时空预测学习

Cheng Tan^∗, Zhangyang Gao¹¹footnotemark: 1, Lirong Wu, Yongjie Xu, Jun Xia, Siyuan Li, Stan Z. Li
AI Lab, Research Center for Industries of the Future, Westlake University
{tancheng,gaozhangyang,wulirong,xuyongjie,xiajun,lisiyuan,stan.zq.li}@westlake.edu.cn
西湖大学未来产业研究中心人工智能实验室 {tancheng,gaozhangyang,wulirong,xuyongjie,xiajun,lisiyuan,stan.zq.li}@westlake.edu.cn Equal contribution.Corresponding author.
通讯作者

Abstract 摘要

Spatiotemporal predictive learning aims to generate future frames by learning from historical frames. In this paper, we investigate existing methods and present a general framework of spatiotemporal predictive learning, in which the spatial encoder and decoder capture intra-frame features and the middle temporal module catches inter-frame correlations. While the mainstream methods employ recurrent units to capture long-term temporal dependencies, they suffer from low computational efficiency due to their unparallelizable architectures. To parallelize the temporal module, we propose the Temporal Attention Unit (TAU), which decomposes temporal attention into intra-frame statical attention and inter-frame dynamical attention. Moreover, while the mean squared error loss focuses on intra-frame errors, we introduce a novel differential divergence regularization to take inter-frame variations into account. Extensive experiments demonstrate that the proposed method enables the derived model to achieve competitive performance on various spatiotemporal prediction benchmarks.
时空预测学习旨在通过学习历史帧来生成未来帧。在本文中，我们对现有方法进行了研究，并提出了时空预测学习的一般框架，其中空间编码器和解码器捕捉帧内特征，中间的时间模块捕捉帧间相关性。虽然主流方法采用递归单元来捕捉长期的时间依赖性，但由于其架构无法并行，因此计算效率较低。为了并行化时间模块，我们提出了时间注意力单元（TAU），它将时间注意力分解为帧内静态注意力和帧间动态注意力。此外，虽然均方误差损耗侧重于帧内误差，但我们引入了一种新的差分发散正则化，将帧间变化考虑在内。广泛的实验证明，所提出的方法能使衍生模型在各种时空预测基准测试中获得具有竞争力的性能。

1 Introduction 1 引言

The last decade has witnessed revolutionary advances in deep learning across various supervised learning tasks such as image classification [34, 87, 52], object detection [75, 73], computational biology [43, 84, 83, 21, 22], and etc. Despite significant breakthroughs in supervised learning, which relies on large-scale labeled datasets, the potential of unsupervised learning remains largely untapped. Self-supervised learning that designs pretext tasks to produce labels derived from the data itself is recognized as a subset of unsupervised learning. In the context of self-supervised learning, contrastive self-supervised learning [33, 7, 28, 8, 92, 104, 86, 85] predicts the noise contrastive estimation from predefined positive or negative pairs, and masked self-supervised learning [45, 102, 51, 17, 32, 57] predicts the masked patches from the visible patches. Unlike these image-level self-supervised learning, spatiotemporal predictive learning that predicts future frames from past frames at the video-level [19, 60, 27, 63, 46, 6].
过去十年，深度学习在图像分类[ 34, 87, 52]、物体检测[ 75, 73]、计算生物学[ 43, 84, 83, 21, 22]等各种有监督学习任务中取得了革命性进展。尽管依赖大规模标记数据集的监督学习取得了重大突破，但无监督学习的潜力在很大程度上仍未得到开发。自监督学习设计了借口任务，以产生源自数据本身的标签，被认为是无监督学习的一个子集。在自监督学习中，对比自监督学习[33, 7, 28, 8, 92, 104, 86, 85]从预定义的正负对中预测噪声对比估计，而遮蔽自监督学习[45, 102, 51, 17, 32, 57]则从可见斑块中预测遮蔽斑块。与这些图像级自监督学习不同的是，时空预测学习可从视频级的过去帧中预测未来帧[19, 60, 27, 63, 46, 6]。

Refer to caption — Figure 1: Comparison of model architectures among common spatiotemporal predictive learning methods. Note that we denote 2D convolutional neural networks as Conv while 3D Conv means 3D convolutional neural networks.
图 1：常见时空预测学习方法的模型架构比较。请注意，我们用 Conv 表示二维卷积神经网络，而 3D Conv 表示三维卷积神经网络。

Accurate spatiotemporal predictive learning can benefit broad practical applications in climate change [77, 74], human motion forecasting [107, 91], traffic flow prediction [18, 97], and representation learning [71, 39]. The significance of spatiotemporal predictive learning primarily lies in its potential of exploring both spatial correlation and temporal evolution in the physical world. Moreover, the self-supervised nature of spatiotemporal predictive learning aligns well with human learning styles without a large amount of labeled data. Massive videos can provide a rich source of visual information, enabling spatiotemporal predictive learning to serve as a generative pre-training strategy [60, 35] for feature representation learning towards diverse downstream visual supervised tasks.
准确的时空预测学习可以使气候变化[77, 74]、人类运动预测[107, 91]、交通流量预测[18, 97]和表征学习[71, 39]等领域的广泛实际应用受益。时空预测学习的意义主要在于其探索物理世界中空间相关性和时间演化的潜力。此外，时空预测学习的自监督性质非常符合人类的学习方式，无需大量标记数据。海量视频可以提供丰富的视觉信息，使时空预测学习成为一种生成性预训练策略[60, 35]，用于特征表征学习，以完成各种下游视觉监督任务。

Most of the existing methods [78, 95, 93, 94, 98, 29, 96, 103, 19, 79, 82, 44, 3, 100, 24, 88, 89, 65, 11, 12, 9, 2, 41] in spatiotemporal predictive learning employ hybrid architectures of convolutional neural networks and recurrent units in which spatial correlation and time evolution can be learned, respectively. Inspired by the success of long short-term memory (LSTM) [36] in sequential modeling, ConvLSTM [78] is a seminal work on the topic of spatiotemporal predictive learning that extends fully connected LSTM to convolutional LSTM towards accurate precipitation nowcasting. PredRNN [95] is an admirable work that proposes Spatiotemporal LSTM (ST-LSTM) units to model spatial appearances and temporal variations in a unified memory pool. This work provides insights on designing typical recurrent units for spatiotemporal predictive learning and inspires a series of subsequent works [93, 4, 98, 96]. E3D-LSTM [94] integrates 3D convolutional neural networks into recurrent units towards good representations with both short-term frame dependencies and long-term high-level relations. PhyDNet [29] introduces a two-branch architecture that involves physical-based PhyCells and ConvLSTMs for performing partial differential equation constraints in latent space. CrevNet [103] proposes an invertible two-way autoencoder based on flow [14, 15] and a conditionally reversible architecture for spatiotemporal predictive learning. As shown in Figure 1 (a), we present a general framework consisting of the spatial encoder/decoder and the middle temporal module, abstracted from these methods. Though these spatiotemporal predictive learning methods have different temporal modules and spatial encoders/decoders, they basically share a similar framework.
在时空预测学习方面，现有的大多数方法[78、95、93、94、98、29、96、103、19、79、82、44、3、100、24、88、89、65、11、12、9、2、41]都采用了卷积神经网络和递归单元的混合架构，可以分别学习空间相关性和时间演化。受长短时记忆（LSTM）[ 36] 在序列建模中的成功启发，ConvLSTM [ 78] 是时空预测学习领域的一项开创性工作，它将全连接 LSTM 扩展为卷积 LSTM，以实现准确的降水预报。PredRNN [ 95] 是一项令人钦佩的工作，它提出了时空 LSTM（ST-LSTM）单元，在统一的内存池中对空间外观和时间变化进行建模。这项工作为设计时空预测学习的典型循环单元提供了启示，并激发了一系列后续工作[93, 4, 98, 96]。E3D-LSTM [ 94] 将三维卷积神经网络集成到递归单元中，以实现具有短期帧依赖性和长期高层关系的良好表征。PhyDNet [ 29] 引入了一种双分支架构，其中包括基于物理的 PhyCell 和 ConvLSTM，用于在潜空间执行偏微分方程约束。CrevNet [ 103] 提出了一种基于 flow [ 14, 15] 的可逆双向自动编码器和一种用于时空预测学习的条件可逆架构。如图 1 (a) 所示，我们提出了一个由空间编码器/解码器和中间时间模块组成的总体框架，是从这些方法中抽象出来的。虽然这些时空预测学习方法的时空模块和空间编码器/解码器各不相同，但它们基本上共享一个相似的框架。

Based on the general framework, we argue that the temporal module plays an essential role in spatiotemporal predictive learning. While the common choice of the temporal module is the recurrent-based units, we explore a novel parallelizable attention module named Temporal Attention Unit (TAU) to capture time evolution. The proposed temporal attention is decomposed into intra-frame statical attention and inter-frame dynamical attention. Furthermore, we argue that the mean square error loss only focuses on intra-frame differences, and we propose a differential divergence regularization that also cares about the inter-frame variations. Keeping the spatial encoder and decoder as simple as 2D convolutional neural networks, we deliberately implement our proposed TAU modules and surprisingly find the derived model achieves competitive performance as those recurrent-based models. This observation provides a new perspective to improve spatiotemporal predictive learning by parallelizable attention networks instead of common-used recurrent units.
基于这一总体框架，我们认为时间模块在时空预测学习中发挥着至关重要的作用。虽然时间模块的常见选择是基于递归的单元，但我们探索了一种名为 "时间注意单元（TAU）"的新型可并行注意模块，以捕捉时间演化。所提出的时间注意力被分解为帧内静态注意力和帧间动态注意力。此外，我们认为均方误差损失只关注帧内差异，因此我们提出了一种差分发散正则化，它也关注帧间变化。在保持空间编码器和解码器与二维卷积神经网络一样简单的情况下，我们特意实现了我们提出的 TAU 模块，并意外地发现衍生模型的性能与那些基于递归的模型不相上下。这一发现为通过可并行化注意力网络而不是常用的递归单元来改进时空预测学习提供了新的视角。

We conduct experiments on various datasets with different experimental settings: (1) Standard spatiotemporal predictive learning. Quantitative results on various datasets demonstrate our proposed method can achieve competitive performance on standard spatiotemporal predictive learning. (2) Generalization ability. To verify the generalization ability, we train our model on KITTI and test it on the Caltech Pedestrian dataset with different domains. (3) Predict future frames with flexible lengths. We tackle the long-length frames by feeding the predicted frames as the input and find the performance is consistently well. Through the superior performance in the above three experimental settings, we demonstrate that our proposed model can provide a novel manner that learns temporal dependencies without recurrent units.
我们在各种数据集上以不同的实验设置进行了实验：（1）标准时空预测学习。在各种数据集上的定量结果表明，我们提出的方法在标准时空预测学习方面能取得具有竞争力的性能。(2) 泛化能力。为了验证泛化能力，我们在 KITTI 上训练了我们的模型，并在不同领域的加州理工学院行人数据集上进行了测试。(3) 预测长度灵活的未来帧。我们将预测到的帧作为输入来处理长帧，结果发现性能一直很好。通过以上三个实验环境中的优异表现，我们证明了我们提出的模型可以提供一种新颖的方式，在没有递归单元的情况下学习时间依赖性。

2 Related works 2 相关作品

2.1 Self-supervised learning
2.1 自我监督学习

Deep learning has been well developed and applied in various fields [55, 5, 56, 110, 109]. Learning from massive data enables tremendous progress in supervised learning. By designing pretext tasks and generating labels from the data itself, self-supervised learning obtains supervisory signals. The model learns valuable representations by solving pretext tasks that leverage the underlying structure of the data. Early works on visual self-supervised learning design pretext tasks such as colorization [108], inpainting [69], rotation [26], jigsaw [66]. Contrastive self-supervised learning [33, 7, 28, 8, 92, 104] is a dominant manner in visual self-supervised learning that aims at a pretext task of grouping similar samples closer and diverse samples away from each other. However, contrastive self-supervised learning is limited by making pairs by multiple images, which affects its ability on small-scale datasets. Masked self-supervised learning [45, 102, 51, 17, 32, 57], which predicts the masked patches from the visible ones, is another research direction. Although masked pretraining has great success in natural language processing, its applications in visual tasks are challenging. Spatiotemporal predictive learning is another promising branch of self-supervised learning that focus on video-level information and predicts future frames conditioned on past frames.
深度学习在各个领域都得到了很好的发展和应用 [ 55, 5, 56, 110, 109]。从海量数据中学习能让监督学习取得巨大进步。通过设计前置任务并从数据本身生成标签，自监督学习可以获得监督信号。模型通过解决利用数据底层结构的前置任务来学习有价值的表征。早期的视觉自监督学习作品设计了一些前置任务，如着色[108]、内画[69]、旋转[26]、拼图[66]。对比式自我监督学习[33, 7, 28, 8, 92, 104]是视觉自我监督学习中的一种主流方式，其目的是将相似的样本分组得更近，而将不同的样本分组得更远。然而，对比式自监督学习受限于通过多张图像进行配对，这影响了它在小规模数据集上的能力。遮蔽式自监督学习[45, 102, 51, 17, 32, 57]是另一个研究方向，它能从可见图像中预测出被遮蔽的斑块。虽然遮蔽预训练在自然语言处理中取得了巨大成功，但在视觉任务中的应用却充满挑战。时空预测学习是自监督学习的另一个有前途的分支，它侧重于视频级信息，并以过去的帧为条件预测未来的帧。

In contrast to the above image-level methods, spatiotemporal predictive learning focus on video-level information and predicts future frames conditioned on past frames. By learning the intrinsic motion dynamics, the model is enabled to easily decouple the foreground and background.
与上述图像级方法不同，时空预测学习侧重于视频级信息，并以过去的帧为条件预测未来的帧。通过学习内在的运动动态，模型可以轻松地将前景和背景分离开来。

2.2 Spatiotemporal predictive learning
2.2 时空预测学习

Recurrent models have achieved remarkable advances in spatiotemporal predictive learning. Inspired by recurrent neural networks, VideoModeling [62] adopts language modeling and quantizes the image patches into a large dictionary for recurrent units. ConvLSTM [78] leverages convolutional neural networks to model the LSTM architecture. PredNet [61] persistently predicts future video frames using deep recurrent convolutional neural networks with bottom-up and top-down connections. PredRNN [95] proposes a Spatiotemporal LSTM unit that extracts and memorizes spatial and temporal representations simultaneously, and its following work PredRNN++ [93] further introduces gradient highway unit and Casual LSTM to adaptively capture temporal dependencies. E3D-LSTM [94] designs eidetic memory transition in recurrent convolutional units. Conv-TT-LSTM [82] employs a higher-order ConvLSTM to predict by combining convolutional features across time. MotionRNN [100] focuses on motion trends and transient variations. LMC-Memory [50] introduces a long-term motion context memory using memory alignment learning. PredRNN-v2 [96] extends PredRNN by leveraging a memory decoupling loss and curriculum learning strategy.
递归模型在时空预测学习方面取得了显著进展。受递归神经网络的启发，VideoModeling [ 62] 采用语言建模，将图像片段量化为递归单元的大型字典。ConvLSTM [ 78] 利用卷积神经网络来模拟 LSTM 架构。PredNet [ 61] 使用具有自下而上和自上而下连接的深度递归卷积神经网络持续预测未来的视频帧。PredRNN [ 95] 提出了一种同时提取和记忆空间和时间表征的时空 LSTM 单元，其后续成果 PredRNN++ [ 93] 进一步引入了梯度高速公路单元和 Casual LSTM，以自适应地捕捉时间依赖性。E3D-LSTM [ 94] 在递归卷积单元中设计了天赋记忆转换。Conv-TT-LSTM [ 82] 采用高阶 ConvLSTM，通过结合跨时间的卷积特征进行预测。MotionRNN [ 100] 专注于运动趋势和瞬时变化。LMC-Memory [ 50] 通过记忆排列学习引入了长期运动上下文记忆。PredRNN-v2 [ 96] 利用记忆解耦损失和课程学习策略扩展了 PredRNN。

Instead of using recurrent-based methods that are computationally expensive for spatiotemporal predictive learning, we introduce TAU, a model that uses visual attention mechanism to parallelize the temporal evolution without the recurrent structure. There are prior arts that have some similarities with our proposed model. PredCNN [101] and TrajectoryCNN [54] implement pure convolutional neural networks as the temporal module. SimVP [23] is a seminal work that applies blocks of Inception modules with a UNet architecture to learn the temporal evolution. Though their temporal modules are parallelizable, we argue that convolutions alone cannot capture long-term dependencies. Moreover, SimVP provides a simple baseline with minor complex attachment but a large space for further improvements. In general, SimVP first downsamples video sequences to reduce the computation, then uses Inception-UNet to learn essential spatiotemporal relationships, and upsamples the representations to predict future frames. Our work aims to replace the pivotal Inception-UNet with efficient attention modules that promote prediction performance. In this paper, we employ a simple yet effective attention mechanism to enable the temporal module not only to be parallelizable but also to capture long-term time evolution.
在时空预测学习中，基于递归的方法计算成本高昂，而我们引入了 TAU 模型，该模型利用视觉注意力机制，在没有递归结构的情况下并行处理时空演化。之前有一些技术与我们提出的模型有一些相似之处。PredCNN [ 101] 和 TrajectoryCNN [ 54] 采用纯卷积神经网络作为时间模块。SimVP [ 23] 是一项开创性工作，它采用 UNet 架构的 Inception 模块块来学习时间演化。虽然他们的时间模块是可并行的，但我们认为仅靠卷积无法捕捉长期依赖关系。此外，SimVP 提供了一个简单的基线，其复杂附件较少，但有很大的进一步改进空间。一般来说，SimVP 首先对视频序列进行低采样以减少计算量，然后使用 Inception-UNet 学习基本的时空关系，并对表征进行高采样以预测未来帧。我们的工作旨在用高效的注意力模块取代关键的 Inception-UNet 模块，从而提高预测性能。在本文中，我们采用了一种简单而有效的注意力机制，使时空模块不仅可以并行化，还能捕捉长期的时间演变。

3 Methods 3 方法

3.1 Preliminaries 3.1 前言

We formally define the spatiotemporal predictive learning problem as follows. Given a video sequence $\mathcal{X}^{t,T}=\{\mathbf{x}^{i}\}_{t-T+1}^{t}$ at time $t$ with the past $T$ frames, we aim to predict the subsequent $T^{\prime}$ frames $\mathcal{Y}^{t+1,T^{\prime}}=\{\mathbf{x}^{i}\}_{t+1}^{t+1+T^{\prime}}$ from time $t+1$ , where $\mathbf{x}_{i}\in\mathbb{R}^{C\times H\times W}$ is usually an image with channels $C$ , height $H$ , and width $W$ . In practice, we represent the video sequences as tensors, i.e., $\mathcal{X}^{t,T}\in\mathbb{R}^{T\times C\times H\times W}$ and $\mathcal{Y}^{t+1,T^{\prime}}\in\mathbb{R}^{T^{\prime}\times C\times H\times W}$ .
我们将时空预测学习问题正式定义如下。给定一个时间为 $t$ 的视频序列 $\mathcal{X}^{t,T}=\{\mathbf{x}^{i}\}_{t-T+1}^{t}$ 和过去的 $T$ 帧，我们的目标是预测从时间 $t+1$ 开始的后续 $T^{\prime}$ 帧 $\mathcal{Y}^{t+1,T^{\prime}}=\{\mathbf{x}^{i}\}_{t+1}^{t+1+T^{\prime}}$ ，其中 $\mathbf{x}_{i}\in\mathbb{R}^{C\times H\times W}$ 通常是一个具有通道 $C$ 、高度 $H$ 和宽度 $W$ 的图像。实际上，我们将视频序列表示为张量，即 $\mathcal{X}^{t,T}\in\mathbb{R}^{T\times C\times H\times W}$ 和 $\mathcal{Y}^{t+1,T^{\prime}}\in\mathbb{R}^{T^{\prime}\times C\times H\times W}$ 。

The model with learnable parameters $\Theta$ learns a mapping $\mathcal{F}_{\Theta}:\mathcal{X}^{t,T}\mapsto\mathcal{Y}^{t+1,T^{\prime}}$ by exploring both spatial and temporal dependencies. In our case, the mapping $\mathcal{F}_{\Theta}$ is a neural network model trained to minimize the difference between the predicted future frames and the ground-truth future frames. The optimal parameters $\Theta^{*}$ are:
具有可学习参数 $\Theta$ 的模型通过探索空间和时间相关性来学习映射 $\mathcal{F}_{\Theta}:\mathcal{X}^{t,T}\mapsto\mathcal{Y}^{t+1,T^{\prime}}$ 。在我们的案例中，映射 $\mathcal{F}_{\Theta}$ 是一个神经网络模型，经过训练后可使预测的未来帧与地面实况未来帧之间的差值最小。最佳参数 $\Theta^{*}$ 为：

\Theta^{*}=\arg\min_{\Theta}\mathcal{L}(\mathcal{F}_{\Theta}(\mathcal{X}^{t,T}),\mathcal{Y}^{t+1,T^{\prime}}),

(1)

where $\mathcal{L}$ is a loss function that evaluates such differences.
其中， $\mathcal{L}$ 是一个损失函数，用于评估这种差异。

3.2 Overview 3.2 概述

We illustrate the overview model in Figure 2 using the input Moving MNIST [81] data as an example.
我们以输入的移动 MNIST [ 81] 数据为例，在图 2 中对概述模型进行说明。

Striving for simplicity, the model follows the general framework in Figure 1, while the spatial encoder consists of four vanilla 2D convolutional layers, and the spatial decoder consists of four 2D transposed convolutional layers (’Conv2d’ and ’ConvTranspose2d’ in PyTorch respectively). We add a residual connection from the first convolutional layer to the last transposed convolutional layer for preserving the spatial-dependent features. Stacks of TAU modules are in the middle of the spatial encoder and decoder to extract temporal-dependent features. Though our model is simple, it can efficiently learn both spatial-dependent and temporal-dependent features without recurrent architectures.
空间编码器由四个虚构的二维卷积层组成，空间解码器由四个二维转置卷积层组成（分别为 PyTorch 中的 "Conv2d "和 "ConvTranspose2d"）。我们在第一个卷积层和最后一个转置卷积层之间添加了一个残差连接，以保留空间相关特征。空间编码器和解码器中间是 TAU 模块堆栈，用于提取与时间相关的特征。虽然我们的模型很简单，但它可以在没有递归架构的情况下高效地学习空间依赖特征和时间依赖特征。

3.3 Temporal Attention Unit
3.3 时空注意力单元

Suppose a batch of input video tensors $\mathcal{B}\in\mathbb{R}^{B\times T\times C\times H\times W}$ with the number of videos $B=|\mathcal{B}|$ is given. In the spatial encoder and decoder, we reshape the sequential input data $B\times T\times C\times H\times W$ as $(B\times T)\times C\times H\times W$ so that only spatial correlations are taken into account. In the temporal module, we reshape the feature $B\times T\times C\times H\times W$ as $B\times(T\times C)\times H\times W$ so that frames are arranged in order on the channel dimension.
假设给定了一批输入视频张量 $\mathcal{B}\in\mathbb{R}^{B\times T\times C\times H\times W}$ ，视频数量为 $B=|\mathcal{B}|$ 。在空间编码器和解码器中，我们将顺序输入数据 $B\times T\times C\times H\times W$ 重塑为 $(B\times T)\times C\times H\times W$ ，以便只考虑空间相关性。在时间模块中，我们将特征 $B\times T\times C\times H\times W$ 重塑为 $B\times(T\times C)\times H\times W$ ，以便帧在信道维度上按顺序排列。

We decompose the temporal attention into the intra-frame statical attention and the inter-frame dynamical attention, as shown in Figure 3. Inspired by the recent progress of vision Transformers (ViTs) [17, 57] and large kernel convolutions [58, 13, 30], we propose to employ small kernel depth-wise convolutions (DW Conv), depth-wise convolutions with dilations (DW-D Conv), and 1 $\times$ 1 convolutions to model the large kernel convolutions. Through the obtained large receptive field on intra-frames, the statical attention is able to capture long-range dependencies. However, the statical attention alone is not enough for learning temporal evolutions along the timeline. Thus, we employ the dynamical attention that learns the attention weights of channels in a squeeze-and-excitation manner [38]. The final attention is the product of dynamical attention and statical attention.
如图 3 所示，我们将时间注意力分解为帧内静态注意力和帧间动态注意力。受视觉变换器（ViTs）[17, 57] 和大内核卷积[58, 13, 30]最新研究进展的启发，我们建议采用小内核深度卷积（DW Conv）、带扩张的深度卷积（DW-D Conv）和 1 $\times$ 1 卷积来模拟大内核卷积。通过获得帧内的大感受野，静态注意力能够捕捉到长程依赖关系。然而，仅靠静态注意力还不足以学习时间轴上的时间演变。因此，我们采用了动态注意力，以挤压-激发的方式学习通道的注意力权重[38]。最终的注意力是动态注意力和静态注意力的产物。

We show the detailed scheme of our model in Figure 4. The proposed TAU module can be formally expressed as:
图 4 展示了我们模型的详细方案。拟议的 TAU 模块可表述为

SA	$\displaystyle=\text{Conv}_{1\times 1}(\text{DW-D Conv}(\text{DW Conv}(H))),$	(2)
DA	$\displaystyle=\text{FC}(\text{AvgPool}(H)),$
$\displaystyle H^{\prime}$	$\displaystyle=(\text{SA}\;\otimes\;\text{DA})\;\odot\;H,$

where $H\in\mathbb{R}^{B\times(T\times C^{\prime})\times H\times W}$ is the hidden feature that will be fed into the TAU module, $\text{SA}\in\mathbb{R}^{B\times(T\times C^{\prime})\times H\times W},\text{DA}\in\mathbb{R}^{B\times(T\times C^{\prime})\times 1\times 1}$ denote the statical and dynamical attention, FC and AvgPool are fully connected layers and the average pooling. We represent the Kronecker product by $\otimes$ and the Hadamard product by $\odot$ .
其中， $H\in\mathbb{R}^{B\times(T\times C^{\prime})\times H\times W}$ 是将输入 TAU 模块的隐藏特征， $\text{SA}\in\mathbb{R}^{B\times(T\times C^{\prime})\times H\times W},\text{DA}\in\mathbb{R}^{B\times(T\times C^{\prime})\times 1\times 1}$ 表示静态和动态注意力，FC 和 AvgPool 分别表示全连接层和平均池。我们用 $\otimes$ 表示克朗克乘积，用 $\odot$ 表示哈达玛乘积。

3.4 Differential Divergence Regularization
3.4 微分正则化

To improve the prediction of our model, we further propose a differential divergence regularization that forces the model to learn the differences between consecutive frames and be aware of the inherent variation.
为了提高模型的预测能力，我们进一步提出了一种差分正则化方法，迫使模型学习连续帧之间的差异，并意识到内在的变化。

Given the predicted frames $\widehat{\mathcal{Y}}=\mathcal{F}_{\Theta}(\mathcal{X})\in\mathbb{R}^{T^{\prime}\times C\times H\times W}$ and its corresponding ground-truth frames $\mathcal{Y}$ , we first calculate their forward difference $\Delta\widehat{\mathcal{Y}},\Delta\mathcal{Y}\in\mathbb{R}^{(T^{\prime}-1)\times C\times H\times W}$ , where:
给定预测帧 $\widehat{\mathcal{Y}}=\mathcal{F}_{\Theta}(\mathcal{X})\in\mathbb{R}^{T^{\prime}\times C\times H\times W}$ 及其对应的地面实况帧 $\mathcal{Y}$ ，我们首先计算它们的前向差值 $\Delta\widehat{\mathcal{Y}},\Delta\mathcal{Y}\in\mathbb{R}^{(T^{\prime}-1)\times C\times H\times W}$ ，其中

	$\displaystyle\Delta\widehat{\mathcal{Y}}_{i}=\widehat{\mathcal{Y}}_{i+1}-\widehat{\mathcal{Y}}_{i},$		(3)
	$\displaystyle\Delta\mathcal{Y}_{i}=\mathcal{Y}_{i+1}-\mathcal{Y}_{i}.$		(3)

Then, we transform the differences into probabilities by the softmax function on the channel, height, and width dimension and obtain $\sigma(\Delta\widehat{\mathcal{Y}}),\sigma(\Delta\mathcal{Y})$ , where:
然后，我们通过通道、高度和宽度维度上的 softmax 函数将差值转化为概率，得到 $\sigma(\Delta\widehat{\mathcal{Y}}),\sigma(\Delta\mathcal{Y})$ ，其中

	$\displaystyle\sigma(\Delta\widehat{\mathcal{Y}})_{i,j,k,l}=\frac{\exp{(\Delta\widehat{\mathcal{Y}}_{i,j,k,l}}/\tau)}{\sum_{j^{\prime}=1}^{C}\sum_{k^{\prime}=1}^{H}\sum_{l^{\prime}=1}^{W}\exp(\Delta\widehat{\mathcal{Y}}_{i,j^{\prime},k^{\prime},l^{\prime}}/\tau)},$		(4)
	$\displaystyle\sigma(\Delta\mathcal{Y})_{i,j,k,l}=\frac{\exp{(\Delta\mathcal{Y}_{i,j,k,l}})/\tau}{\sum_{j^{\prime}=1}^{C}\sum_{k^{\prime}=1}^{H}\sum_{l^{\prime}=1}^{W}\exp(\Delta\mathcal{Y}_{i,j^{\prime},k^{\prime},l^{\prime}}/\tau)},$		(4)

and $\tau$ represents the temperature parameter which we empirically set as 0.1 to sharpen the probability distribution. Through the competition mechanism of the softmax function [10, 70, 99], the high difference frames are penalized.
和 $\tau$ 代表温度参数，我们根据经验将其设置为 0.1，以锐化概率分布。通过 softmax 函数的竞争机制[10, 70, 99]，高差异帧会受到惩罚。

Thus, the differential divergence regularization $\mathcal{L}_{reg}$ is defined as the Kullback-Leibler divergence between the probability distributions $\sigma(\Delta\widehat{\mathcal{Y}})$ and $\sigma(\Delta\mathcal{Y})$ :
因此，差分发散正则化 $\mathcal{L}_{reg}$ 被定义为概率分布 $\sigma(\Delta\widehat{\mathcal{Y}})$ 和 $\sigma(\Delta\mathcal{Y})$ 之间的库尔贝-莱布勒发散：

	$\displaystyle\mathcal{L}_{reg}(\widehat{\mathcal{Y}},\mathcal{Y})$	$\displaystyle=D_{KL}(\sigma(\Delta\widehat{\mathcal{Y}})\;\|\|\;\sigma(\Delta\mathcal{Y}))$		(5)
		$\displaystyle=\sum_{i=1}^{T^{\prime}-1}\sigma(\Delta\widehat{\mathcal{Y}}_{i})\log\frac{\sigma(\Delta\widehat{\mathcal{Y}}_{i})}{\sigma(\Delta\mathcal{Y}_{i})}.$		(5)

Our model is trained end-to-end in a fully unsupervised manner, and the overall objective function consists of the mean square error loss and the differential divergence regularization weighted by a constant $\alpha$ :
我们的模型以完全无监督的方式进行端到端训练，总体目标函数包括均方误差损失和由常数 $\alpha$ 加权的微分发散正则化：

\mathcal{L}=\sum_{i=1}^{T^{\prime}}\|\widehat{\mathcal{Y}}-\mathcal{Y}\|^{2}+\alpha\mathcal{L}_{reg}(\widehat{\mathcal{Y}},\mathcal{Y}),

(6)

where the first term focuses on intra-frame-level differences, and the second regularization term focuses on inter-frame-level variations.
其中，第一个项侧重于帧内水平差异，第二个正则化项侧重于帧间水平变化。

4 Experiments 4实验

In this section, we present experiments that demonstrate the effectiveness of our proposed method. The experiments are conducted on various datasets with different settings to validate our proposed model from three aspects:
在本节中，我们将通过实验来证明我们提出的方法的有效性。实验在不同设置的数据集上进行，从三个方面验证了我们提出的模型：

•

Standard spatiotemporal predictive learning (Section 4.2). We recognize the prediction problem of the same number of input and output frames as the standard spatiotemporal predictive learning. We evaluate the performance on standard spatiotemporal predictive learning and compare our model with state-of-the-art methods with Moving MNIST [81]and TaxiBJ [106] datasets.

- 标准时空预测学习（第 4.2 节）。我们认识到与标准时空预测学习相同数量的输入和输出帧的预测问题。我们评估了标准时空预测学习的性能，并用移动 MNIST [ 81] 和 TaxiBJ [ 106] 数据集将我们的模型与最先进的方法进行了比较。
•

Generalization ability across different datasets (Section 4.3). Generalizing the learned knowledge to other domains is a challenge in unsupervised learning. We investigate the such ability of our method by training the model on the KITTI [25] dataset and evaluating it on the Caltech Pedestrian [16] dataset.

- 不同数据集之间的泛化能力（第 4.3 节）。将所学知识推广到其他领域是无监督学习中的一项挑战。我们在 KITTI [ 25] 数据集上训练了模型，并在 Caltech Pedestrian [ 16] 数据集上对其进行了评估，从而研究了我们方法的这种能力。
•

Predicting frames with flexible lengths (Section 4.4). One of the advantages of recurrent units is that they can easily handle flexible-length frames like the KTH dataset [76]. Our work tackles the long-length frame prediction by imitating recurrent units that feed predicted frames as the input and recursively produce long-term predictions.

- 预测具有灵活长度的帧（第 4.4 节）。递归单元的优势之一是可以轻松处理 KTH 数据集等长度灵活的帧[76]。我们的研究通过模仿递归单元，将预测帧作为输入并递归生成长期预测，从而解决了长帧预测问题。

4.1 Experimental Setups 4.1 实验设置

Datasets 数据集

We quantitatively evaluate our model on the following datasets for both synthetic and real-world scenarios:
我们在以下数据集上对模型进行了定量评估，包括合成场景和真实场景：

•

Moving MNIST [81] is a synthetic dataset consisting of two digits independently moving within the 64 $\times$ 64 grid and bouncing off the boundary. It is a standard benchmark in spatiotemporal predictive learning.

- 移动 MNIST [ 81] 是一个合成数据集，由在 64 $\times$ 64 网格内独立移动并从边界弹出的两个数字组成。它是时空预测学习的标准基准。
•

TaxiBJ contains the trajectory data in Beijing collected from taxicab GPS with two channels, i.e., inflow or outflow defined in [106]. Following the previous works [98, 29], we normalize the data into $[0,1]$ .

- TaxiBJ 包含从出租车 GPS 收集到的北京轨迹数据，有两个通道，即 [ 106] 中定义的流入或流出通道。根据之前的研究[98, 29]，我们将数据归一化为 $[0,1]$ 。
•

KTH [76] contains 25 individuals performing six types of actions. Following [90, 94], we use person 1-16 for training and 17-25 for testing. Models are trained to predict the next 20 or 40 frames from the previous 10 observations.

- KTH[76]包含 25 个执行六种动作的人。按照[90, 94]的方法，我们使用人 1-16 进行训练，使用人 17-25 进行测试。训练模型的目的是根据前 10 次观测结果预测接下来的 20 或 40 帧。
•

Caltech Pedestrian is a driving dataset focusing on detecting pedestrians. It consists of approximately 10 hours of $640\times 480$ videos taken from vehicles driving through regular traffic in an urban environment. We follow the same protocol of PredNet [61] and CrevNet [103] for pre-processing, training, and evaluation.

- Caltech Pedestrian 是一个侧重于检测行人的驾驶数据集。该数据集由大约 10 个小时的 $640\times 480$ 视频组成，这些视频来自城市环境中正常行驶的车辆。我们采用与 PredNet [ 61] 和 CrevNet [ 103] 相同的协议进行预处理、训练和评估。

We summarize the statistics of the above datasets in Table 1, including the number of training samples $N_{train}$ and the number of testing samples $N_{test}$ .
表 1 总结了上述数据集的统计数据，包括训练样本数 $N_{train}$ 和测试样本数 $N_{test}$ 。

Table 1: The statistics of datasets. The training or testing set has

N_{train}

N_{test}

samples, composed by

T

T^{\prime}

images with the shape

(C,H,W)

.
表 1：数据集的统计数据。训练集或测试集有

N_{train}

或

N_{test}

样本，由形状为

(C,H,W)

的

T

或

T^{\prime}

图像组成。

	$N_{train}$	$N_{test}$	$(C,H,W)$	$T$	$T^{\prime}$
MMNIST	10000	10000	(1, 64, 64)	10	10
TaxiBJ	19627	1334	(2, 32, 32)	4	4
KTH	5200	3167	(1, 128, 128)	10	20 or 40 20 或 40
Caltech	2042	1983	(3, 128, 160)	10	1

Measurement 测量

Following [29, 103], we employ Mean Squared Error (MSE), Mean Absolute Error (MAE), Structure Similarity Index Measure (SSIM), and Peak Signal to Noise Ratio (PSNR) to evaluate the quality of predictions. MSE and MAE estimate the absolute pixel-wise errors, SSIM measures the similarity of structural information within the spatial neighborhoods, and PSNR is an expression for the ratio between the maximum possible power of a signal and the power of distorted noise.
按照 [ 29, 103] 的方法，我们采用平均平方误差 (MSE)、平均绝对误差 (MAE)、结构相似性指数 (SSIM) 和峰值信噪比 (PSNR) 来评估预测质量。MSE 和 MAE 估计的是像素绝对误差，SSIM 衡量的是空间邻域内结构信息的相似性，而 PSNR 则表示信号的最大可能功率与失真噪声功率之间的比值。

Implementation details 实施细节

We implement the proposed method with the Pytorch framework and conduct experiments on a single NVIDIA-V100 GPU. The model is trained with a mini-batch of 16 video sequences while the AdamW optimizer is utilized with a learning rate of 0.01 and a weight decay of 0.05.
我们利用 Pytorch 框架实现了所提出的方法，并在单个 NVIDIA-V100 GPU 上进行了实验。模型使用 16 个视频序列的迷你批次进行训练，同时使用学习率为 0.01、权重衰减为 0.05 的 AdamW 优化器。

4.2 Standard spatiotemporal predictive learning
4.2 标准时空预测学习

4.2.1 Moving MNIST 4.2.1 移动 MNIST

This dataset is a standard benchmark in spatiotemporal predictive learning. We evaluate our proposed method against strong recent baselines, including competitive recurrent architectures: ConvLSTM [78], PredRNN [95], PredRNN++ [93], MIM [98], LMC [50], E3D-LSTM [94], Conv-TT-LSTM [82], and CrevNet [103]. We also compare the method DDPAE [37] that is specifically designed for this dataset. The quantitative results are reported in Table 2, and qualitative visualizations of the predicted results are shown in Figure 5.
该数据集是时空预测学习的标准基准。我们将我们提出的方法与近期的强大基准进行了评估，包括具有竞争力的递归架构：ConvLSTM [ 78]、PredRNN [ 95]、PredRNN++ [ 93]、MIM [ 98]、LMC [ 50]、E3D-LSTM [ 94]、Conv-TT-LSTM [ 82] 和 CrevNet [ 103]。我们还比较了专门为该数据集设计的方法 DDPAE [ 37]。定量结果见表 2，预测结果的定性可视化见图 5。

Our proposed method significantly outperforms all the baselines above under three different metrics. The performance gain is large with respect to state-of-the-art recurrent methods.
在三个不同的指标下，我们提出的方法明显优于上述所有基线方法。与最先进的递归方法相比，性能提升幅度很大。

Table 2: Quantitative results of different methods on the Moving MNIST dataset (

10\rightarrow 10

frames).
表 2：不同方法在移动 MNIST 数据集（

10\rightarrow 10

帧）上的量化结果。

Method	MSE $\downarrow$	MAE $\downarrow$	SSIM $\uparrow$
	Moving MNIST 移动 MNIST
ConvLSTM [78] ConvLSTM [ 78］	103.3	182.9	0.707
VPN [44] 虚拟专用网 [ 44］	64.1	-	0.870
PredRNN [95] PredRNN [ 95］	56.8	126.1	0.867
PredRNN++ [93] PredRNN++ [ 93］	46.5	106.8	0.898
MIM [98] MIM [ 98］	44.2	101.1	0.910
LMC [50] LMC [ 50］	41.5	-	0.924
E3D-LSTM [94] E3D-LSTM [ 94］	41.3	87.2	0.910
Conv-TT-LSTM [82]	53.0	-	0.915
DDPAE [37] DDPAE [ 37］	38.9	90.7	0.922
PhyDNet [29] PhyDNet [ 29］	24.4	70.3	0.947
SimVP [23] SimVP [ 23］	23.8	68.9	0.948
Crevnet [103] Crevnet [ 103］	22.3	-	0.949
Ours	19.8	60.3	0.957

4.2.2 TaxiBJ

We evaluate our proposed model on a complicated real-world dataset, TaxiBJ [106]. Driven by human consciousness, the complex real-world traffic flows requires modeling transport phenomena and traffic diffusion for prediction. Due to the spatiotemporal nature of the traffic forecasting task, we straightforwardly implement our model for it.
我们在复杂的真实世界数据集 TaxiBJ [ 106] 上评估了我们提出的模型。在人类意识的驱动下，复杂的现实世界交通流需要对交通现象和交通扩散进行建模预测。由于交通预测任务的时空性质，我们直接为其实现了我们的模型。

The qualitative visualizations of the predicted results are shown in Figure 6, and the quantitative results are reported in Table 3. Though the given frames are quite different from the future frames, our model can still accurately produce reliable frames. The difference between target frames and predicted frames is mainly located in central spots, but the overall trend is approximating the ground truth. It can also be observed that the quantitative results are consistently well, which demonstrates the practical application value in real-world scenarios.
预测结果的定性可视化结果如图 6 所示，定量结果如表 3 所示。虽然给定帧与未来帧有很大差异，但我们的模型仍能准确生成可靠的帧。目标帧与预测帧之间的差异主要集中在中心点，但总体趋势是接近地面实况。此外，定量结果的一致性也很好，这也证明了模型在实际场景中的应用价值。

Table 3: Quantitative results of different methods on the TaxiBJ dataset (

4\rightarrow 4

frames).
表 3：不同方法在 TaxiBJ 数据集（

4\rightarrow 4

帧）上的量化结果。

Method	MSE $\times$ 100 $\downarrow$	MAE $\downarrow$	SSIM $\uparrow$
	TaxiBJ
ConvLSTM [78] ConvLSTM [ 78］	48.5	17.7	0.978
PredRNN [95] PredRNN [ 95］	46.4	17.1	0.971
PredRNN++ [93] PredRNN++ [ 93］	44.8	16.9	0.977
MIM [98] MIM [ 98］	42.9	16.6	0.971
E3D-LSTM [94] E3D-LSTM [ 94］	43.2	16.9	0.979
PhyDNet [29] PhyDNet [ 29］	41.9	16.2	0.982
SimVP [23] SimVP [ 23］	41.4	16.2	0.982
Ours	34.4	15.6	0.983

4.3 Generalization across different datasets
4.3 不同数据集之间的通用性

The generalization ability is one of the fundamental problems in artificial intelligence technology. Traditional supervised learning suffers from its poor generalization of labeled datasets with different domains. Self-supervised learning aims to learn robust representations based on unlabeled data and evaluates the generalization ability of the learned model. While contrastive self-supervised learning and masked self-supervised learning in visual tasks usually evaluate such generalization ability by downstream tasks, we evaluate it by the prediction results across different datasets in spatiotemporal predictive learning.
泛化能力是人工智能技术的基本问题之一。传统的监督学习对不同领域的标记数据集的泛化能力较差。自监督学习旨在基于无标签数据学习稳健表征，并评估学习模型的泛化能力。视觉任务中的对比式自监督学习和遮蔽式自监督学习通常通过下游任务来评估这种泛化能力，而我们则通过时空预测学习中不同数据集的预测结果来评估这种泛化能力。

Following the previous works [61, 103, 53], we train the model using the raw video sequences from the KITTI dataset and evaluate the model by Caltech Pedestrian dataset that is made to match the frame rate and image size ( $128\times 160$ ) of the KTTI dataset. The final prediction is made for the next frame after a 10-frame warm-up.
按照之前的研究[61, 103, 53]，我们使用 KITTI 数据集的原始视频序列训练模型，并通过加州理工学院行人数据集评估模型，该数据集的帧速率和图像大小（ $128\times 160$ ）与 KTTI 数据集一致。经过 10 帧预热后，对下一帧进行最终预测。

We show the qualitative visualization in Figure 7 and report the quantitative results in Table 4. Note that some of the baseline results are copied from [68]. It can be seen that our proposed method achieves state-of-the-art performance under both SSIM and PSNR metrics in this generalization evaluation task. Moreover, our model shows significantly robust predictions in terms of the variation of illumination and lane line, suggesting its practical applications in autonomous vehicles.
我们在图 7 中展示了定性可视化结果，并在表 4 中报告了定量结果。需要注意的是，部分基线结果抄自 [ 68]。可以看出，在这项泛化评估任务中，我们提出的方法在 SSIM 和 PSNR 指标下都达到了最先进的性能。此外，我们的模型在光照和车道线的变化方面显示出明显的鲁棒性预测，这表明它在自动驾驶汽车中的实际应用是可行的。

Table 4: Quantitative results of different methods on the Caltech Pedestrian dataset (

10\rightarrow 1

frame).
表 4：不同方法在加州理工学院行人数据集（

10\rightarrow 1

帧）上的量化结果。

Method	SSIM $\uparrow$	PSNR $\uparrow$
	Caltech Pedestrian 加州理工学院行人
BeyondMSE [64] BeyondMSE [ 64］	0.847	-
MCnet [90] MCnet [ 90］	0.879	-
DVF [59] DVF [ 59］	0.897	26.2
Dual-GAN [53] 双广域网 [ 53］	0.899	-
CtrlGen [31] CtrlGen [ 31］	0.900	26.5
PredNet [61] PredNet [ 61］	0.905	27.6
ContextVP [4] 上下文 VP [ 4］	0.921	28.7
GAN-VGG [80] GAN-VGG [ 80］	0.916	-
G-VGG [80] G-VGG [ 80］	0.917	-
SDC-Net [72] SDC-Net [ 72］	0.918	-
rCycleGan [47]	0.919	29.2
DPG [20] DPG [ 20］	0.923	28.2
G-MAE [80] G-MAE [ 80］	0.923	-
GAN-MAE [80] 加纳-马埃[ 80］	0.923	-
CrevNet [103] CrevNet [ 103］	0.925	29.3
STMFANet [41]	0.927	29.1
SimVP [23] SimVP [ 23］	0.940	33.1
Ours	0.946	33.7

4.4 Predicting frames with flexible lengths
4.4 预测具有灵活长度的框架

Though recurrent units are adept at handling flexible-length frames, our model can also easily tackle such problems by imitating recurrent units that feed predicted frames as the input and recursively produce predictions. For this KTH dataset, our model is trained to predict the next 20 or 40 frames from the given 10 observations. Moreover, this dataset contains six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors, outdoors with scale variations, outdoors with different clothes and indoors. The difficulty of this human motion prediction task not only lies in its flexible lengths of predicted frames but also in its complex dynamics involving the randomness of human consciousness. Following [95], we use the Peak Signal Noise Ratio (PSNR) and the Structural Similarity Index Measure (SSIM) as evaluation metrics to measure the framewise prediction quality from the perceptive view. The detailed quantitative results are shown in Table 5.
虽然递归单元擅长处理灵活长度的帧，但我们的模型也可以通过模仿递归单元，将预测帧作为输入并递归生成预测，从而轻松解决此类问题。对于这个 KTH 数据集，我们的模型经过训练后可以从给定的 10 个观察结果中预测下一个 20 或 40 个帧。此外，该数据集包含六种类型的人体动作（行走、慢跑、跑步、拳击、挥手和拍手），由 25 名受试者在四种不同场景下多次完成：室外、有尺度变化的室外、穿不同衣服的室外和室内。这项人体运动预测任务的难度不仅在于其预测帧的灵活长度，还在于其复杂的动态性，其中涉及人类意识的随机性。按照[95]的方法，我们使用峰值信号噪声比（PSNR）和结构相似性指数（SSIM）作为评价指标，从感知的角度来衡量帧预测质量。具体量化结果如表 5 所示。

Table 5: Quantitative results of different methods on the KTH dataset (

10\rightarrow 20

and

10\rightarrow 40

frames).
表 5：不同方法在 KTH 数据集（

10\rightarrow 20

和

10\rightarrow 40

帧）上的量化结果。

Method	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$
	KTH ( $10\rightarrow 20$ )		KTH ( $10\rightarrow 40$ )
MCnet [90] MCnet [ 90］	0.804	25.95	0.73	23.89
ConvLSTM [78] ConvLSTM [ 78］	0.712	23.58	0.639	22.85
SAVP [48] SAVP [ 48］	0.746	25.38	0.701	23.97
VPN [44] 虚拟专用网 [ 44］	0.746	23.76	–	–
DFN [40] DFN [ 40］	0.794	27.26	0.652	23.01
fRNN [67] fRNN [ 67］	0.771	26.12	0.678	23.77
Znet [105] Znet [ 105］	0.817	27.58	–	–
SV2Pv [1] SV2Pv [ 1］	0.838	27.79	0.789	26.12
PredRNN [95] PredRNN [ 95］	0.839	27.55	0.703	24.16
VarNet [42] VarNet [ 42］	0.843	28.48	0.739	25.37
SAVP-VAE [48] SAVP-VAE [ 48］	0.852	27.77	0.811	26.18
PredRNN++ [93] PredRNN++ [ 93］	0.865	28.47	0.741	25.21
MSNET [49] MSNET [ 49］	0.876	27.08	–	–
E3d-LSTM [94] E3d-LSTM [ 94］	0.879	29.31	0.810	27.24
STMFANet [41]	0.893	29.85	0.851	27.56
SimVP [23] SimVP [ 23］	0.905	33.72	0.886	32.93
Ours	0.911	34.13	0.897	33.01

4.5 Empirical Running Time
4.5 经验运行时间

TAU benefits from the parallelizable computation architecture, which leads to fast convergence and high training speed. We evaluate our efficiency by measuring the running time against state-of-the-art spatiotemporal predictive learning methods.
TAU 采用可并行计算架构，收敛速度快，训练速度高。我们通过测量与最先进的时空预测学习方法相比的运行时间来评估我们的效率。

The experiments are conducted on a single Tesla V100 GPU, and the batch size is set as 16. For the Moving MNIST dataset, CrevNet [103] needs about 30 minutes per epoch, and PhyDNet [29] needs about 7 minutes per epoch. Our model only requires 2.5 minutes per epoch. Furthermore, our method is able to converge at a rapid rate. As shown in Figure 8, on the Moving MNIST dataset, our model can achieve MSE 35.0 with only 50 epochs, while CrevNet and PhyDNet are far from this performance.
实验在单个 Tesla V100 GPU 上进行，批量大小设置为 16。对于移动的 MNIST 数据集，CrevNet [ 103] 每个epoch大约需要 30 分钟，PhyDNet [ 29] 每个epoch大约需要 7 分钟。而我们的模型每个epoch只需要 2.5 分钟。此外，我们的方法能够快速收敛。如图 8 所示，在移动的 MNIST 数据集上，我们的模型只需 50 个历时就能达到 MSE 35.0，而 CrevNet 和 PhyDNet 则远远达不到这个性能。

4.6 Computational Cost and Ablation Study
4.6 计算成本和消融研究

We compare the performance and computational cost with state-of-the-art methods in the first several rows in Table 6. Our model achieves superior results with better performance and much lower Flops. We also conduct ablation studies and summarize the results in Table 6. It can be seen that replacing TAU with the same number of convolutional blocks with vanilla $3\times 3$ convolutions (Conv Baseline) significantly degrades the performance. Training our model without differential divergence regularization (w/o DDR) will also weaken the prediction results. Both the TAU module and differential divergence regularization are useful. We also find that SA and DA of the TAU module play important roles in better performance.
在表 6 的前几行中，我们将性能和计算成本与最先进的方法进行了比较。我们的模型取得了更好的性能和更低的 Flops。我们还进行了消融研究，并在表 6 中总结了结果。从表中可以看出，用相同数量的卷积块（vanilla $3\times 3$ convolutions）（Conv Baseline）替换 TAU 会显著降低性能。在不使用差分发散正则化（w/o DDR）的情况下训练我们的模型，也会削弱预测结果。TAU 模块和微分发散正则化都很有用。我们还发现，TAU 模块的 SA 和 DA 在提高性能方面发挥了重要作用。

Table 6: Ablation study of our proposed method.
表 6：拟议方法的消融研究。

Method	M-MNIST	TaxiBJ ( $\times 100$ )	Flops(G) 翻牌(G)
PredRNN	56.8	46.4	115.94
PredRNN++	46.4	44.8	171.73
MIM	44.2	42.9	179.17
E3DLSTM	41.3	43.2	298.85
SimVP	23.8	41.4	19.43
Conv Baseline Conv 基线	58.9	43.5	6.11
Ours w/o SA 我们没有 SA	23.2	40.8	15.30
Ours w/o DA 我们的不含 DA	22.4	38.4	15.96
Ours w/o DDR 我们的不含 DDR	21.1	37.7	15.96
Ours	19.8	34.4	15.96

5 Conclusion 5 结束语

In this paper, we present a general framework of spatiotemporal predictive learning and propose an attention-based temporal module to replace the common-used recurrent units. By decomposing the temporal module into the intra-frame statical attention and the inter-frame dynamical attention, our proposed TAU module can achieve competitive performance across various experimental settings and datasets. Moreover, a novel differential divergence regularization is proposed to overcome the drawback of MSE loss that only considers the intra-frame error. In summary, our work highlights the importance of both intra-frame and inter-frame variations that enable the model to capture long-term relations and provide a new paradigm of efficient spatiotemporal predictive learning.
本文介绍了时空预测学习的一般框架，并提出了一种基于注意力的时空模块来取代常用的递归单元。通过将时态模块分解为帧内静态注意力和帧间动态注意力，我们提出的 TAU 模块可以在各种实验环境和数据集中实现具有竞争力的性能。此外，我们还提出了一种新的微分发散正则化方法，以克服只考虑帧内误差的 MSE 损失的缺点。总之，我们的工作强调了帧内和帧间变化的重要性，这使得模型能够捕捉长期关系，并提供了一种高效时空预测学习的新范例。

Acknowledgement. 致谢。

We thank the anonymous reviewers for their constructive and helpful reviews. This work was supported by the National Key R&D Program of China (2022ZD0115100), the National Natural Science Foundation of China (U21A20427), the Competitive Research Fund (WU2022A009) from the Westlake Center for Synthetic Biology and Integrated Bioengineering
感谢匿名审稿人的建设性和有益的评论。这项工作得到了国家重点研发计划（2022ZD0115100）、国家自然科学基金（U21A20427）、西湖合成生物学与集成生物工程中心竞争性研究基金（WU2022A009）的支持。

References 参考资料

[1] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. In International Conference on Learning Representations, 2018.
Mohammad Babaeizadeh、Chelsea Finn、Dumitru Erhan、Roy H Campbell 和 Sergey Levine。随机变异视频预测。2018年国际学习表征大会。
[2] Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. Fitvid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2021.
Mohammad Babaeizadeh、Mohammad Taghi Saffar、Suraj Nair、Sergey Levine、Chelsea Finn 和 Dumitru Erhan。Fitvid：像素级视频预测中的过度拟合。arXiv 预印本 arXiv:2106.13195, 2021.
[3] Sarthak Bhagat, Shagun Uppal, Zhuyun Yin, and Nengli Lim. Disentangling multiple features in video sequences using gaussian processes in variational autoencoders. In European Conference on Computer Vision, pages 102–117. Springer, 2020.
Sarthak Bhagat、Shagun Uppal、Zhuyun Yin 和 Nengli Lim。使用变异自动编码器中的高斯过程分解视频序列中的多个特征。欧洲计算机视觉会议，第 102-117 页。Springer, 2020.
[4] Wonmin Byeon, Qin Wang, Rupesh Kumar Srivastava, and Petros Koumoutsakos. Contextvp: Fully context-aware video prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 753–769, 2018.
Wonmin Byeon、Qin Wang、Rupesh Kumar Srivastava 和 Petros Koumoutsakos。Contextvp：完全感知上下文的视频预测。欧洲计算机视觉会议（ECCV）论文集》，第 753-769 页，2018 年。
[5] Hanqun Cao, Cheng Tan, Zhangyang Gao, Guangyong Chen, Pheng-Ann Heng, and Stan Z Li. A survey on generative diffusion model. arXiv preprint arXiv:2209.02646, 2022.
曹汉群、谭诚、高占阳、陈光勇、衡萍安、李斯坦。生成式扩散模型概览》，arXiv preprint arXiv:2209.02646, 2022.
[6] Lluis Castrejon, Nicolas Ballas, and Aaron Courville. Improved conditional vrnns for video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7608–7617, 2019.
Lluis Castrejon、Nicolas Ballas 和 Aaron Courville.用于视频预测的改进条件 vrnns。In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7608-7617, 2019.
[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pages 1597–1607. PMLR, 2020.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton.视觉表征对比学习的简单框架。国际机器学习大会，第 1597-1607 页。PMLR, 2020.
[8] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems, 33:22243–22255, 2020.
Ting Chen、Simon Kornblith、Kevin Swersky、Mohammad Norouzi 和 Geoffrey E Hinton。大自监督模型是强半监督学习者。神经信息处理系统进展》，33:22243-22255，2020 年。
[9] Yu Cheng, Bo Wang, Bo Yang, and Robby T Tan. Graph and temporal convolutional networks for 3d multi-person pose estimation in monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 4, page 12, 2021.
Yu Cheng, Bo Wang, Bo Yang, and Robby T Tan.用于单目视频中三维多人姿态估计的图和时间卷积网络。AAAI人工智能大会论文集》，第4卷，第12页，2021年。
[10] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In International Conference on Learning Representations, 2020.
Krzysztof Marcin Choromanski、Valerii Likhosherstov、David Dohan、Xingyou Song、Andreea Gane、Tamas Sarlos、Peter Hawkins、Jared Quincy Davis、Afroz Mohiuddin、Lukasz Kaiser 等：与表演者一起反思注意力。2020年学习表征国际会议。
[11] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In International Conference on Machine Learning, pages 1174–1183. PMLR, 2018.
Emily Denton 和 Rob Fergus.利用学习先验随机生成视频。国际机器学习大会》，第 1174-1183 页。PMLR，2018。
[12] Emily L Denton et al. Unsupervised learning of disentangled representations from video. Advances in Neural Information Processing Systems, 30, 2017.
Emily L Denton 等人. 从视频中无监督学习分离表征。神经信息处理系统进展》，2017年30期。
[13] Xiaohan Ding, Xiangyu Zhang, Yizhuang Zhou, Jungong Han, Guiguang Ding, and Jian Sun. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. arXiv preprint arXiv:2203.06717, 2022.
Xiaohan Ding, Xiangyu Zhang, Yizhuang Zhou, Jungong Han, Guiguang Ding, and Jian Sun.将内核扩大到 31x31：重新审视 cnns 中的大内核设计。
[14] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. In International Conference on Learning Representations Workshop, 2015.
Laurent Dinh、David Krueger 和 Yoshua Bengio。尼斯：非线性独立成分估计。国际学习表征研讨会，2015 年。
[15] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. In International Conference on Learning Representations, 2017.
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio.使用真实 nvp 的密度估计。国际学习表征会议，2017。
[16] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: A benchmark. In 2009 IEEE conference on computer vision and pattern recognition, pages 304–311. IEEE, 2009.
Piotr Dollár、Christian Wojek、Bernt Schiele 和 Pietro Perona。行人检测：一个基准。2009 IEEE 计算机视觉与模式识别会议，第 304-311 页。IEEE, 2009.
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
Alexey Dosovitskiy、Lucas Beyer、Alexander Kolesnikov、Dirk Weissenborn、Xiaohua Zhai、Thomas Unterthiner、Mostafa Dehghani、Matthias Minderer、Georg Heigold、Sylvain Gelly 等。图像胜过 16x16 个单词：规模图像识别变换器。国际学习表征会议，2020 年。
[18] Shen Fang, Qi Zhang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Gstnet: Global spatial-temporal network for traffic flow prediction. In IJCAI, pages 2286–2293, 2019.
Shen Fang、Qi Zhang、Gaofeng Meng、Shiming Xiang 和 Chunhong Pan.Gstnet：用于交通流预测的全球时空网络。In IJCAI, pages 2286-2293, 2019.
[19] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. Advances in Neural Information Processing Systems, 29, 2016.
Chelsea Finn、Ian Goodfellow 和 Sergey Levine。通过视频预测实现物理交互的无监督学习。神经信息处理系统进展》，2016 年第 29 期。
[20] Hang Gao, Huazhe Xu, Qi-Zhi Cai, Ruth Wang, Fisher Yu, and Trevor Darrell. Disentangling propagation and generation for video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9006–9015, 2019.
Hang Gao, Huazhe Xu, Qi-Zhi Cai, Ruth Wang, Fisher Yu, and Trevor Darrell.视频预测中传播与生成的分离。IEEE/CVF 计算机视觉国际会议论文集》，第 9006-9015 页，2019 年。
[21] Zhangyang Gao, Cheng Tan, Stan Li, et al. Alphadesign: A graph protein design method and benchmark on alphafolddb. arXiv preprint arXiv:2202.01079, 2022.
Alphadesign：一种图蛋白质设计方法和基于alphafolddb的基准。
[22] Zhangyang Gao, Cheng Tan, and Stan Z Li. Pifold: Toward effective and efficient protein inverse folding. arXiv preprint arXiv:2209.12643, 2022.
Zhangyang Gao, Cheng Tan, and Stan Z Li.Pifold：ArXiv preprint arXiv:2209.12643, 2022.
[23] Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z. Li. Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3170–3180, June 2022.
Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z. Li.Simvp：更简单、更好的视频预测。IEEE/CVF 计算机视觉与模式识别会议（CVPR）论文集》，第 3170-3180 页，2022 年 6 月。
[24] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In International Conference on Machine Learning, pages 1243–1252. PMLR, 2017.
乔纳斯-盖林、迈克尔-奥利、戴维-格兰杰、丹尼斯-亚拉茨和扬-恩-多芬。卷积序列到序列学习。国际机器学习大会》，第 1243-1252 页。PMLR，2017。
[25] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
Andreas Geiger、Philip Lenz、Christoph Stiller 和 Raquel Urtasun。视觉与机器人：kitti 数据集。国际机器人研究期刊》，32（11）：1231-1237，2013 年。
[26] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, 2018.
Spyros Gidaris、Praveer Singh 和 Nikos Komodakis.通过预测图像旋转进行无监督表示学习2018年国际表征学习大会。
[27] Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning, pages 2424–2433. PMLR, 2019.
克劳斯-格雷夫、拉斐尔-洛佩兹-考夫曼、里沙布-卡布拉、尼克-沃特斯、克里斯托弗-伯吉斯、丹尼尔-佐兰、洛伊克-马特、马修-博特维尼克和亚历山大-勒奇纳。迭代变异推理的多对象表示学习。国际机器学习大会》，第 2424-2433 页。PMLR，2019。
[28] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 21271–21284. Curran Associates, Inc., 2020.
Jean-Bastien Grill、Florian Strub、Florent Altché、Corentin Tallec、Pierre Richemond、Elena Buchatskaya、Carl Doersch、Bernardo Avila Pires、Zhaohan Guo、Mohammad Gheshlaghi Azar、Bilal Piot、Koray Kavukcuoglu、Remi Munos 和 Michal Valko。引导你自己的潜变量--自我监督学习的新方法。见 H. Larochelle、M. Ranzato、R. Hadsell、M. F. Balcan 和 H. Lin 编辑，《神经信息处理系统进展》，第 33 卷，第 21271-21284 页。Curran Associates, Inc.，2020 年。
[29] Vincent Le Guen and Nicolas Thome. Disentangling physical dynamics from unknown factors for unsupervised video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11474–11484, 2020.
Vincent Le Guen 和 Nicolas Thome.从未知因素中分离物理动态，实现无监督视频预测。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 11474-11484 页，2020 年。
[30] Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, and Shi-Min Hu. Visual attention network. arXiv preprint arXiv:2202.09741, 2022.
郭孟浩、吕承泽、刘正宁、程明明、胡世民。视觉注意力网络。ArXiv preprint arXiv:2202.09741, 2022.
[31] Zekun Hao, Xun Huang, and Serge Belongie. Controllable video generation with sparse trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7854–7863, 2018.
郝泽坤、黄迅、Serge Belongie.用稀疏轨迹生成可控视频。电气和电子工程师学会计算机视觉与模式识别会议论文集》，第 7854-7863 页，2018 年。
[32] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
何开明、陈新磊、谢赛宁、李阳浩、Piotr Dollár 和 Ross Girshick。遮蔽式自动编码器是可扩展的视觉学习器。2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集》。
[33] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
何开明、范昊琦、吴雨欣、谢赛宁和罗斯-吉尔希克。无监督视觉表征学习的动量对比。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 9729-9738 页，2020 年。
[34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
何开明、张翔宇、任少清和孙健。图像识别的深度残差学习。2016年IEEE/CVF计算机视觉与模式识别大会论文集，第770-778页。
[35] Geoffrey Hinton. How to represent part-whole hierarchies in a neural network. arXiv preprint arXiv:2102.12627, 2021.
Geoffrey Hinton.如何在神经网络中表示部分-整体层次结构》，arXiv preprint arXiv:2102.12627, 2021.
[36] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Sepp Hochreiter 和 Jürgen Schmidhuber.长短期记忆。神经计算》，9（8）：1735-1780，1997 年。
[37] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. Advances in Neural Information Processing Systems, 31, 2018.
Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles.学习分解和拆分视频预测的表征。神经信息处理系统进展》，2018 年第 31 期。
[38] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
胡杰、沈莉、孙刚。挤压-激发网络。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132-7141, 2018.
[39] Simon Jenni, Givi Meishvili, and Paolo Favaro. Video representation learning by recognizing temporal transformations. In European Conference on Computer Vision, pages 425–442. Springer, 2020.
Simon Jenni、Givi Meishvili 和 Paolo Favaro。通过识别时间变换学习视频表示。欧洲计算机视觉会议，第 425-442 页。Springer, 2020.
[40] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks. Advances in neural information processing systems, 29, 2016.
Xu Jia、Bert De Brabandere、Tinne Tuytelaars 和 Luc V Gool。动态过滤网络。神经信息处理系统进展》，2016年第29期。
[41] Beibei Jin, Yu Hu, Qiankun Tang, Jingyu Niu, Zhiping Shi, Yinhe Han, and Xiaowei Li. Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4554–4563, 2020.
金蓓蓓、胡宇、唐乾坤、牛靖宇、史志平、韩银河、李晓伟。探索用于高保真和时间一致性视频预测的时空多频分析。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 4554-4563 页，2020 年。
[42] Beibei Jin, Yu Hu, Yiming Zeng, Qiankun Tang, Shice Liu, and Jing Ye. Varnet: Exploring variations for unsupervised video prediction. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5801–5806. IEEE, 2018.
Beibei Jin、Yu Hu、Yiming Zeng、Qiankun Tang、Shice Liu 和 Jing Ye.Varnet：探索无监督视频预测的变化。In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5801-5806.IEEE, 2018.
[43] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
John Jumper、Richard Evans、Alexander Pritzel、Tim Green、Michael Figurnov、Olaf Ronneberger、Kathryn Tunyasuvunakool、Russ Bates、Augustin Žídek、Anna Potapenko 等用 alphafold 进行高精度蛋白质结构预测。自然》，596（7873）：583-589，2021 年。
[44] Nal Kalchbrenner, Aäron Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In International Conference on Machine Learning, pages 1771–1779. PMLR, 2017.
纳尔-卡尔奇布伦纳、亚伦-奥德、卡伦-西蒙尼扬、伊沃-达尼赫尔卡、奥里奥尔-维尼亚尔斯、亚历克斯-格雷夫斯和科雷-卡武库奥卢。视频像素网络。国际机器学习大会》，第 1771-1779 页。PMLR，2017 年。
[45] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova.Bert：用于语言理解的深度双向变换器预训练。In Proceedings of NAACL-HLT, pages 4171-4186, 2019.
[46] Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207–2217. PMLR, 2020.
Ilyes Khemakhem、Diederik Kingma、Ricardo Monti 和 Aapo Hyvarinen。变异自动编码器和非线性 ica：统一框架。人工智能与统计国际会议，第 2207-2217 页。PMLR, 2020.
[47] Yong-Hoon Kwon and Min-Gyu Park. Predicting future frames using retrospective cycle gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1811–1820, 2019.
Yong-Hoon Kwon 和 Min-Gyu Park.利用回溯周期甘预测未来帧。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1811-1820, 2019.
[48] Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine.随机对抗视频预测》，arXiv preprint arXiv:1804.01523, 2018.
[49] Jungbeom Lee, Jangho Lee, Sungmin Lee, and Sungroh Yoon. Mutual suppression network for video prediction using disentangled features. arXiv preprint arXiv:1804.04810, 2018.
Jungbeom Lee, Jangho Lee, Sungmin Lee, and Sungroh Yoon.使用分离特征的视频预测互抑网络》，arXiv preprint arXiv:1804.04810, 2018.
[50] Sangmin Lee, Hak Gu Kim, Dae Hwi Choi, Hyung-Il Kim, and Yong Man Ro. Video prediction recalling long-term motion context via memory alignment learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3054–3063, 2021.
Sangmin Lee、Hak Gu Kim、Dae Hwi Choi、Hyung-Il Kim 和 Yong Man Ro。通过记忆对齐学习回忆长期运动上下文的视频预测。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 3054-3063 页，2021 年。
[51] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
Mike Lewis、Yinhan Liu、Naman Goyal、Marjan Ghazvininejad、Abdelrahman Mohamed、Omer Levy、Veselin Stoyanov 和 Luke Zettlemoyer。巴特：用于自然语言生成、翻译和理解的去噪序列到序列预训练。计算语言学协会第 58 届年会论文集》，第 7871-7880 页，2020 年。
[52] Siyuan Li, Zedong Wang, Zicheng Liu, Cheng Tan, Haitao Lin, Di Wu, Zhiyuan Chen, Jiangbin Zheng, and Stan Z Li. Efficient multi-order gated aggregation network. arXiv preprint arXiv:2211.03295, 2022.
李思源、王泽东、刘自成、谭诚、林海涛、吴迪、陈志远、郑江滨、李斯坦。高效多阶门控聚合网络。arXiv preprint arXiv:2211.03295, 2022.
[53] Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing. Dual motion gan for future-flow embedded video prediction. In proceedings of the IEEE international conference on computer vision, pages 1744–1752, 2017.
Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing.未来流嵌入式视频预测的双运动矢量。IEEE 计算机视觉国际会议论文集》，第 1744-1752 页，2017 年。
[54] Xiaoli Liu, Jianqin Yin, Jin Liu, Pengxiang Ding, Jun Liu, and Huaping Liu. Trajectorycnn: a new spatio-temporal feature learning network for human motion prediction. IEEE Transactions on Circuits and Systems for Video Technology, 31(6):2133–2146, 2020.
Xiaoli Liu, Jianqin Yin, Jin Liu, Pengxiang Ding, Jun Liu, and Huaping Liu.Trajectorycnn：用于人体运动预测的新型时空特征学习网络。IEEE 视频技术电路与系统论文集》，31（6）：2133-2146，2020 年。
[55] Zicheng Liu, Siyuan Li, Ge Wang, Cheng Tan, Lirong Wu, and Stan Z Li. Decoupled mixup for data-efficient learning. arXiv preprint arXiv:2203.10761, 2022.
刘自成、李思源、王革、谭成、吴立荣、李铮。解耦混合数据高效学习》，arXiv preprint arXiv:2203.10761, 2022.
[56] Zicheng Liu, Siyuan Li, Di Wu, Zihan Liu, Zhiyuan Chen, Lirong Wu, and Stan Z Li. Automix: Unveiling the power of mixup for stronger classifiers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 441–458. Springer, 2022.
Zicheng Liu, Siyuan Li, Di Wu, Zihan Liu, Zhiyuan Chen, Lirong Wu, and Stan Z Li.Automix：揭示混合的力量，实现更强大的分类器。In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXIV, pages 441-458.Springer, 2022.
[57] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
Ze Liu、Yutong Lin、Yue Cao、Han Hu、Yixuan Wei、Zheng Zhang、Stephen Lin 和 Baining Guo。Swin 变换器：使用移位窗口的分层视觉变换器。IEEE/CVF 计算机视觉国际会议论文集》，第 10012-10022 页，2021 年。
[58] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. arXiv preprint arXiv:2201.03545, 2022.
刘壮、毛汉子、吴超远、克里斯托夫-费希滕霍夫、特雷弗-达雷尔、谢赛宁。2020 年代的信念网络。arXiv 预印本 arXiv:2201.03545, 2022.
[59] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 4463–4471, 2017.
Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala.使用深度体素流合成视频帧。电气和电子工程师学会计算机视觉国际会议论文集》，第 4463-4471 页，2017 年。
[60] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, pages 4114–4124. PMLR, 2019.
Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem.挑战无监督学习分离表征中的常见假设。国际机器学习大会》，第 4114-4124 页。PMLR，2019。
[61] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. In International Conference on Learning Representations, 2017.
William Lotter、Gabriel Kreiman 和 David Cox.用于视频预测和无监督学习的深度预测编码网络。2017年学习表征国际会议。
[62] Arthur Szlam Marc’Aurelio Ranzato, Joan Bruna, Michaël Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. CoRR, abs/1412.6604, 2, 2014.
Arthur Szlam Marc'Aurelio Ranzato、Joan Bruna、Michaël Mathieu、Ronan Collobert 和 Sumit Chopra。视频（语言）建模：自然视频生成模型的基线。CoRR，abs/1412.6604，2，2014。
[63] Emile Mathieu, Tom Rainforth, Nana Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, pages 4402–4412. PMLR, 2019.
Emile Mathieu、Tom Rainforth、Nana Siddharth 和 Yee Whye Teh。变异自动编码器中的解缠。国际机器学习大会》，第 4402-4412 页。PMLR，2019。
[64] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
Michael Mathieu、Camille Couprie 和 Yann LeCun.超越均方误差的深度多尺度视频预测》，arXiv preprint arXiv:1511.05440, 2015.
[65] Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin P Murphy, and Honglak Lee. Unsupervised learning of object structure and dynamics from videos. Advances in Neural Information Processing Systems, 32, 2019.
Matthias Minderer、Chen Sun、Ruben Villegas、Forrester Cole、Kevin P Murphy 和 Honglak Lee。从视频中无监督学习物体结构和动态。神经信息处理系统进展》，32，2019.
[66] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer, 2016.
Mehdi Noroozi 和 Paolo Favaro.通过解拼图无监督学习视觉表征。欧洲计算机视觉会议，第 69-84 页。Springer, 2016.
[67] Marc Oliu, Javier Selva, and Sergio Escalera. Folded recurrent neural networks for future video prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 716–731, 2018.
Marc Oliu、Javier Selva 和 Sergio Escalera.用于未来视频预测的折叠递归神经网络。欧洲计算机视觉会议（ECCV）论文集》，第 716-731 页，2018 年。
[68] Sergiu Oprea, Pablo Martinez-Gonzalez, Alberto Garcia-Garcia, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia-Rodriguez, and Antonis Argyros. A review on deep learning techniques for video prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
Sergiu Oprea、Pablo Martinez-Gonzalez、Alberto Garcia-Garcia、John Alejandro Castro-Vargas、Sergio Orts-Escolano、Jose Garcia-Rodriguez 和 Antonis Argyros。视频预测深度学习技术综述。IEEE 《模式分析与机器智能》期刊，2020 年。
[69] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
Deepak Pathak、Philipp Krahenbuhl、Jeff Donahue、Trevor Darrell 和 Alexei A Efros。上下文编码器：通过内绘制进行特征学习。电气和电子工程师学会计算机视觉与模式识别会议论文集》，第 2536-2544 页，2016 年。
[70] Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. In International Conference on Learning Representations, 2020.
彭浩、尼古拉斯-帕帕斯、达尼-约加塔玛、罗伊-施瓦茨、诺亚-史密斯和孔令鹏。随机特征关注。2020年学习表征国际会议。
[71] Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6964–6974, 2021.
钱蕊、孟天健、龚柏青、杨明轩、王慧生、Serge Belongie 和崔寅。时空对比视频表示学习。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 6964-6974 页，2021 年。
[72] Fitsum A Reda, Guilin Liu, Kevin J Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, and Bryan Catanzaro. Sdc-net: Video prediction using spatially-displaced convolution. In Proceedings of the European Conference on Computer Vision (ECCV), pages 718–733, 2018.
Fitsum A Reda、Guilin Liu、Kevin J Shih、Robert Kirby、Jon Barker、David Tarjan、Andrew Tao 和 Bryan Catanzaro。Sdc-net：使用空间错位卷积进行视频预测。欧洲计算机视觉会议（ECCV）论文集》，第 718-733 页，2018 年。
[73] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
Joseph Redmon、Santosh Divvala、Ross Girshick 和 Ali Farhadi。只看一次统一的实时物体检测。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 779-788 页，2016 年。
[74] Markus Reichstein, Gustau Camps-Valls, Bjorn Stevens, Martin Jung, Joachim Denzler, Nuno Carvalhais, et al. Deep learning and process understanding for data-driven earth system science. Nature, 566(7743):195–204, 2019.
Markus Reichstein, Gustau Camps-Valls, Bjorn Stevens, Martin Jung, Joachim Denzler, Nuno Carvalhais, et al. Deep learning and process understanding for data-driven earth system science.自然》，566（7743）：195-204，2019。
[75] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 28:91–99, 2015.
Shaoqing Ren、Kaiming He、Ross Girshick 和 Jian Sun.更快的r-cnn：利用区域建议网络实现实时物体检测。神经信息处理系统进展》，28:91-99，2015.
[76] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pages 32–36. IEEE, 2004.
Christian Schuldt、Ivan Laptev 和 Barbara Caputo.识别人类行动：一种局部 SVM 方法。第 17 届模式识别国际会议论文集，2004 年。第 3 卷，第 32-36 页。IEEE, 2004.
[77] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 28, 2015.
Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo.卷积 lstm 网络：降水预报的机器学习方法。神经信息处理系统进展》，2015年第28期。
[78] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 28, 2015.
Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo.卷积 lstm 网络：降水预报的机器学习方法。神经信息处理系统进展》，2015年第28期。
[79] Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Deep learning for precipitation nowcasting: A benchmark and a new model. Advances in Neural Information Processing Systems, 30, 2017.
石行健、高志涵、Leonard Lausen、王浩、杨迪彦、黄伟健、吴望春。降水预报的深度学习：一个基准和一个新模型。神经信息处理系统进展》，2017年第30期。
[80] Osamu Shouno. Photo-realistic video prediction on natural videos of largely changing frames. arXiv preprint arXiv:2003.08635, 2020.
Osamu Shouno.自然视频中帧数大幅变化的照相逼真视频预测. arXiv preprint arXiv:2003.08635, 2020.
[81] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852. PMLR, 2015.
Nitish Srivastava、Elman Mansimov 和 Ruslan Salakhudinov.使用 lstms 进行视频表征的无监督学习。国际机器学习会议，第 843-852 页。PMLR，2015。
[82] Jiahao Su, Wonmin Byeon, Jean Kossaifi, Furong Huang, Jan Kautz, and Anima Anandkumar. Convolutional tensor-train lstm for spatio-temporal learning. Advances in Neural Information Processing Systems, 33:13714–13726, 2020.
Jiahao Su, Wonmin Byeon, Jean Kossaifi, Furong Huang, Jan Kautz, and Anima Anandkumar.用于时空学习的卷积张量-训练 lstm。神经信息处理系统进展》，33:13714-13726，2020 年。
[83] Cheng Tan, Zhangyang Gao, and Stan Z Li. Rfold: Towards simple yet effective rna secondary structure prediction. arXiv preprint arXiv:2212.14041, 2022.
Cheng Tan, Zhangyang Gao, and Stan Z Li.Rfold：ArXiv preprint arXiv:2212.14041, 2022.
[84] Cheng Tan, Zhangyang Gao, and Stan Z Li. Target-aware molecular graph generation. arXiv preprint arXiv:2202.04829, 2022.
Cheng Tan, Zhangyang Gao, and Stan Z Li.目标感知分子图生成。arXiv preprint arXiv:2202.04829, 2022.
[85] Cheng Tan, Zhangyang Gao, Lirong Wu, Siyuan Li, and Stan Z Li. Hyperspherical consistency regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7244–7255, 2022.
谭成、高占阳、吴立荣、李思远、李铮。超球面一致性正则化IEEE/CVF 计算机视觉与模式识别会议论文集》，第 7244-7255 页，2022 年。
[86] Cheng Tan, Jun Xia, Lirong Wu, and Stan Z Li. Co-learning: Learning from noisy labels with self-supervision. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1405–1413, 2021.
Cheng Tan, Jun Xia, Lirong Wu, and Stan Z Li.协同学习：利用自我监督从噪声标签中学习。第 29 届 ACM 国际多媒体会议论文集》，第 1405-1413 页，2021 年。
[87] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
Mingxing Tan 和 Quoc Le.Efficientnet：反思卷积神经网络的模型缩放。国际机器学习大会，第 6105-6114 页。PMLR，2019。
[88] Ruben Villegas, Dumitru Erhan, Honglak Lee, et al. Hierarchical long-term video prediction without supervision. In International Conference on Machine Learning, pages 6038–6046. PMLR, 2018.
Ruben Villegas、Dumitru Erhan、Honglak Lee 等：无监督分层长期视频预测。国际机器学习大会，第 6038-6046 页。PMLR，2018。
[89] Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V Le, and Honglak Lee. High fidelity video prediction with large stochastic recurrent neural networks. Advances in Neural Information Processing Systems, 32, 2019.
Ruben Villegas、Arkanath Pathak、Harini Kannan、Dumitru Erhan、Quoc V Le 和 Honglak Lee。大型随机递归神经网络的高保真视频预测。神经信息处理系统进展》，2019 年第 32 期。
[90] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. In International Conference on Learning Representations, 2017.
Ruben Villegas、Jimei Yang、Seungghoon Hong、Xunyu Lin 和 Honglak Lee。为自然视频序列预测分解运动和内容。国际学习表征大会，2017 年。
[91] Pichao Wang, Wanqing Li, Philip Ogunbona, Jun Wan, and Sergio Escalera. Rgb-d-based human motion recognition with deep learning: A survey. Computer Vision and Image Understanding, 171:118–139, 2018.
Pichao Wang、Wanqing Li、Philip Ogunbona、Jun Wan 和 Sergio Escalera.基于RGB-D的深度学习人体运动识别：A survey.计算机视觉与图像理解》，171:118-139，2018.
[92] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
王彤宙和菲利普-伊索拉。通过超球上的对齐和均匀性理解对比性表征学习。国际机器学习大会，第9929-9939页。PMLR，2020。
[93] Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and S Yu Philip. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In International Conference on Machine Learning, pages 5123–5132. PMLR, 2018.
王云波、高志峰、龙明生、王建民和 S Yu Philip。Predrnn++：解决时空预测学习中的 "深度-时间 "困境。国际机器学习大会，第 5123-5132 页。PMLR，2018。
[94] Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, and Li Fei-Fei. Eidetic 3d lstm: A model for video prediction and beyond. In International conference on learning representations, 2018.
王云波、蒋璐、杨明轩、李佳、龙明生、李菲菲。Eidetic 3d lstm：视频预测及其他模型。国际学习表征会议，2018。
[95] Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in Neural Information Processing Systems, 30, 2017.
王云波、龙明生、王建民、高志峰和 Philip S Yu。Predrnn：利用时空 lstms 进行预测学习的递归神经网络。神经信息处理系统进展》，2017年第30期。
[96] Yunbo Wang, Haixu Wu, Jianjin Zhang, Zhifeng Gao, Jianmin Wang, Philip S Yu, and Mingsheng Long. Predrnn: A recurrent neural network for spatiotemporal predictive learning. arXiv preprint arXiv:2103.09504, 2021.
王云波、吴海旭、张建进、高志峰、王建民、俞皓思、龙明生。Predrnn：ArXiv preprint arXiv:2103.09504, 2021.
[97] Yunbo Wang, Jianjin Zhang, Hongyu Zhu, Mingsheng Long, Jianmin Wang, and Philip S Yu. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9154–9162, 2019.
王云波、张建金、朱宏宇、龙明生、王建民和 Philip S Yu。记忆中的记忆：从时空动态中学习高阶非平稳性的预测神经网络。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9154-9162, 2019.
[98] Yunbo Wang, Jianjin Zhang, Hongyu Zhu, Mingsheng Long, Jianmin Wang, and Philip S Yu. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9154–9162, 2019.
王云波、张建金、朱宏宇、龙明生、王建民和 Philip S Yu。记忆中的记忆：从时空动态中学习高阶非平稳性的预测神经网络。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9154-9162, 2019.
[99] Haixu Wu, Jialong Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Flowformer: Linearizing transformers with conservation flows. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 24226–24242. PMLR, 17–23 Jul 2022.
Haixu Wu, Jialong Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long.流动变换器用守恒流线性化变压器。第 39 届国际机器学习大会论文集》，《机器学习研究论文集》第 162 卷，第 24226-24242 页。PMLR，2022 年 7 月 17-23 日。
[100] Haixu Wu, Zhiyu Yao, Jianmin Wang, and Mingsheng Long. Motionrnn: A flexible model for video prediction with spacetime-varying motions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15435–15444, 2021.
Haixu Wu, Zhiyu Yao, Jianmin Wang, and Mingsheng Long.Motionrnn：用于时空变化运动视频预测的灵活模型。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 15435-15444 页，2021 年。
[101] Ziru Xu, Yunbo Wang, Mingsheng Long, Jianmin Wang, and M KLiss. Predcnn: Predictive learning with cascade convolutions. In IJCAI, pages 2940–2947, 2018.
Ziru Xu, Yunbo Wang, Mingsheng Long, Jianmin Wang, and M KLiss.Predcnn：级联卷积预测学习。In IJCAI, pages 2940-2947, 2018.
[102] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, 32, 2019.
杨志林、戴子航、杨一鸣、Jaime Carbonell、Russ R Salakhutdinov 和 Quoc V Le。Xlnet：语言理解的广义自回归预训练。神经信息处理系统进展》，32，2019.
[103] Wei Yu, Yichao Lu, Steve Easterbrook, and Sanja Fidler. Efficient and information-preserving future frame prediction and beyond. In International Conference on Learning Representations, 2019.
Wei Yu、Yichao Lu、Steve Easterbrook 和 Sanja Fidler.高效且信息保护的未来帧预测及其他。2019年学习表征国际会议。
[104] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230, 2021.
Jure Zbontar、李静、Ishan Misra、Yann LeCun 和 Stéphane Deny。巴洛双胞胎：ArXiv preprint arXiv:2103.03230, 2021.
[105] Jianjin Zhang, Yunbo Wang, Mingsheng Long, Wang Jianmin, and S Yu Philip. Z-order recurrent neural networks for video prediction. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 230–235. IEEE, 2019.
Jianjin Zhang, Yunbo Wang, Mingsheng Long, Wang Jianmin, and S Yu Philip.用于视频预测的 Z 阶递归神经网络。In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 230-235.IEEE, 2019.
[106] Junbo Zhang, Yu Zheng, and Dekang Qi. Deep spatio-temporal residual networks for citywide crowd flows prediction. In Thirty-first AAAI conference on artificial intelligence, 2017.
Junbo Zhang, Yu Zheng, and Dekang Qi.用于全城人流预测的深度时空残差网络。第三十一届 AAAI 人工智能大会，2017 年。
[107] Liang Zhang, Guangming Zhu, Peiyi Shen, Juan Song, Syed Afaq Shah, and Mohammed Bennamoun. Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 3120–3128, 2017.
Liang Zhang, Guangming Zhu, Peiyi Shen, Juan Song, Syed Afaq Shah, and Mohammed Bennamoun.使用 3dcnn 和卷积 Lstm 学习时空特征进行手势识别。In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 3120-3128, 2017.
[108] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pages 649–666. Springer, 2016.
Richard Zhang、Phillip Isola 和 Alexei A Efros。多彩图像着色。欧洲计算机视觉会议，第 649-666 页。Springer, 2016.
[109] Jiangbin Zheng, Yidong Chen, Chong Wu, Xiaodong Shi, and Suhail Muhammad Kamal. Enhancing neural sign language translation by highlighting the facial expression information. Neurocomputing, 464:462–472, 2021.
郑江彬、陈一东、吴冲、史晓东和苏海尔-穆罕默德-卡迈勒。通过突出面部表情信息增强神经手语翻译。神经计算》，464:462-472，2021 年。
[110] Jiangbin Zheng, Yile Wang, Ge Wang, Jun Xia, Yufei Huang, Guojiang Zhao, Yue Zhang, and Stan Li. Using context-to-vector with graph retrofitting to improve word embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8154–8163, 2022.
Jiangbin Zheng, Yile Wang, Ge Wang, Jun Xia, Yufei Huang, Guojiang Zhao, Yue Zhang, and Stan Li.利用上下文到向量与图重构改进词嵌入。计算语言学协会第 60 届年会（第 1 卷：长篇论文）论文集》，第 8154-8163 页，2022 年。

Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning时空注意力单元：实现高效时空预测学习

Abstract 摘要

1 Introduction 1 引言

2 Related works 2 相关作品

2.1 Self-supervised learning2.1 自我监督学习

2.2 Spatiotemporal predictive learning2.2 时空预测学习

3 Methods 3 方法

3.1 Preliminaries 3.1 前言

3.2 Overview 3.2 概述

3.3 Temporal Attention Unit3.3 时空注意力单元

3.4 Differential Divergence Regularization3.4 微分正则化

4 Experiments 4实验

4.1 Experimental Setups 4.1 实验设置

Datasets 数据集

Measurement 测量

Implementation details 实施细节

4.2 Standard spatiotemporal predictive learning4.2 标准时空预测学习

4.2.1 Moving MNIST 4.2.1 移动 MNIST

4.2.2 TaxiBJ

4.3 Generalization across different datasets4.3 不同数据集之间的通用性

4.4 Predicting frames with flexible lengths4.4 预测具有灵活长度的框架

4.5 Empirical Running Time4.5 经验运行时间

4.6 Computational Cost and Ablation Study4.6 计算成本和消融研究

5 Conclusion 5 结束语

Acknowledgement. 致谢。

References 参考资料

Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning
时空注意力单元：实现高效时空预测学习

2.1 Self-supervised learning
2.1 自我监督学习

2.2 Spatiotemporal predictive learning
2.2 时空预测学习

3.3 Temporal Attention Unit
3.3 时空注意力单元

3.4 Differential Divergence Regularization
3.4 微分正则化

4.2 Standard spatiotemporal predictive learning
4.2 标准时空预测学习

4.3 Generalization across different datasets
4.3 不同数据集之间的通用性

4.4 Predicting frames with flexible lengths
4.4 预测具有灵活长度的框架

4.5 Empirical Running Time
4.5 经验运行时间

4.6 Computational Cost and Ablation Study
4.6 计算成本和消融研究