这是用户在 2025-1-13 15:05 为 https://ar5iv.labs.arxiv.org/html/2107.07089?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

STAR: Sparse Transformer-based Action Recognition
STAR:基于稀疏变换器的动作识别

Feng Shi1, Chonghan Lee2, Liang Qiu1, Yizhou Zhao1, Tianyi Shen2, Shivran Muralidhar2, Tian Han3, Song-Chun Zhu1, Vijaykrishnan Narayanan2
风石 1 ,钟汉李 2 ,秋梁 1 ,赵一舟 1 ,沈天意 2 ,Shivran Muralidhar 2 ,韩天 3 ,朱松春 1 ,Vijaykrishnan Narayanan 2

Abstract  摘要

The cognitive system for human action and behavior has evolved into a deep learning regime, and especially the advent of Graph Convolution Networks has transformed the field in recent years. However, previous works have mainly focused on over-parameterized and complex models based on dense graph convolution networks, resulting in low efficiency in training and inference. Meanwhile, the Transformer architecture-based model has not yet been well explored for cognitive application in human action and behavior estimation. This work proposes a novel skeleton-based human action recognition model with sparse-attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Our model can also process variable length of video clips grouped as a single batch. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference. Experiments show that our model achieves 4\sim18×\times speedup and 17\frac{1}{7}\sim115\frac{1}{15} model size compared with the baseline models at competitive accuracy.
人类动作和行为的认知系统已演变为深度学习模式,特别是图卷积网络的出现,近年来彻底改变了该领域。然而,先前的工作主要集中在基于密集图卷积网络的过参数化和复杂模型上,导致训练和推理效率低下。同时,基于 Transformer 架构的模型在人类动作和行为估计的认知应用方面尚未得到充分探索。本研究提出了一种基于骨架的人体动作识别模型,该模型在数据的时空维度上具有稀疏注意力机制和分段线性注意力机制。我们的模型还可以处理作为单个批次分组的不同长度的视频片段。实验表明,我们的模型在利用远少于可训练参数的同时,可以实现与基线模型相当的性能,并在训练和推理中达到高速。实验表明,与基线模型相比,我们的模型在具有竞争力的准确度下实现了 4 similar-to\sim 18 ×\times 倍的速度提升和 1717\frac{1}{7} similar-to\sim 115115\frac{1}{15} 倍的模型尺寸。

Introduction  简介

Human action recognition plays a crucial role in many real-world applications, such as holistic scene understanding, video surveillance, and human-computer interaction (Htike et al. 2014; Choi et al. 2008). In particular, skeleton-based human action recognition has attracted much attention in recent years and has shown its effectiveness. The skeleton representation contains a time series of 2D or 3D coordinates of human key-joints, providing dynamic body movement information that is robust to variations of light conditions and background noises in contrast to raw RGB representation.
人类动作识别在许多实际应用中扮演着关键角色,例如整体场景理解、视频监控和人与计算机交互(Htike 等,2014;Choi 等,2008)。特别是,基于骨骼的人类动作识别近年来引起了广泛关注,并显示出其有效性。骨骼表示包含人类关键关节的 2D 或 3D 坐标的时间序列,提供对光照条件和背景噪声变化的鲁棒性动态身体运动信息,与原始 RGB 表示相比。

Earlier skeleton-based human action recognition methods focus on designing hand-crafted features extracted from the joint coordinates (Vemulapalli, Arrate, and Chellappa 2014; Wang et al. 2012) and aggregating learned features using RNNs and CNNs (Yong Du, Wang, and Wang 2015; Xie et al. 2018; Zhang et al. 2017; Liu, Tu, and Liu 2017; Ke et al. 2017). However, these methods rarely explore the relations between body joints and result in unsatisfactory performance. Recent methods focus on exploring the natural connection of human body joints and successfully adopted the Graph Convolutional Networks (GCNs), especially for non-Euclidean domain data, similar to Convolutional Neural Networks (CNNs) but executing convolutional operations to aggregate the connected and related joints’ features. Yan et al. (Yan, Xiong, and Lin 2018) proposed a ST-GCN model to extract discriminative features from spatial and temporal graphs of body joints. Following the success of ST-GCN, many works proposed optimizations to ST-GCN to improve the performance and network capacity (Shi et al. 2019; Li et al. 2019; Liu et al. 2020).
早期基于骨骼的人类动作识别方法主要关注从关节坐标中设计手工特征(Vemulapalli, Arrate, 和 Chellappa 2014;Wang 等人 2012)以及使用 RNNs 和 CNNs 聚合学习到的特征(Yong Du, Wang, 和 Wang 2015;Xie 等人 2018;Zhang 等人 2017;Liu, Tu, 和 Liu 2017;Ke 等人 2017)。然而,这些方法很少探索身体关节之间的关系,导致性能不令人满意。最近的方法侧重于探索人体关节的自然联系,并成功采用了图卷积网络(GCNs),特别是对于非欧几里得域数据,类似于卷积神经网络(CNNs),但执行卷积操作以聚合连接和相关关节的特征。Yan 等人(Yan, Xiong, 和 Lin 2018)提出了一种 ST-GCN 模型,从身体关节的空间和时间图中提取判别特征。在 ST-GCN 的成功之后,许多工作提出了对 ST-GCN 的优化,以提高性能和网络容量(Shi 等人 2019;Li 等人 2019;Liu 等人 2020)。

However, the existing GCN-based models are often impractical in real-time applications due to their vast computational complexity and memory usage. The baseline GCN model, e.g., ST-GCN, consists of more than 3.09 million parameters and costs at least 16.2 GFLOPs to run inference on a single action video sample (Yan, Xiong, and Lin 2018). DGNN, which is composed of incremental GCN modules, even contains 26 million model parameters. (Shi et al. 2019) Such high model complexity leads to difficulty in model training and inference, makes the model not suitable for deployment on edge devices. Furthermore, these GCN-based models process fixed-size action videos by padding repetitive frames and zeros to match the maximum number of frames and persons depicted in the videos. These additional paddings increase the latency and memory required hindering their adoption in real-time and embedded applications.
然而,现有的基于 GCN 的模型由于计算复杂性和内存使用量大,在实时应用中往往不切实际。基线 GCN 模型,例如 ST-GCN,包含超过 309 万个参数,在单个动作视频样本上进行推理至少需要 16.2 GFLOPs(Yan, Xiong, and Lin 2018)。由增量 GCN 模块组成的 DGNN 甚至包含 2600 万个模型参数。(Shi et al. 2019)这种高模型复杂度导致模型训练和推理困难,使得模型不适合部署在边缘设备上。此外,这些基于 GCN 的模型通过填充重复帧和零来处理固定大小的动作视频,以匹配视频中描述的最大帧数和人物数。这些额外的填充增加了延迟和内存需求,阻碍了它们在实时和嵌入式应用中的采用。

This paper proposes a sparse transformer-based action recognition (STAR) model as a novel baseline for skeleton action modeling to address the above shortcomings. Transformers have been a popular choice in natural language processing. Recently, they have been employed in computer vision to attain competitive results compared to convolutional networks, while requiring fewer computational resources to train (Dosovitskiy et al. 2020; Huang et al. 2019). Inspired by these Transformer architectures, our model consists of spatial and temporal encoders, which apply sparse attention and segmented linear attention on skeleton sequences along the spatial and temporal dimension, respectively. Our sparse attention module along the spatial dimension performs sparse matrix multiplications to extract correlations of connected joints, whereas previous approaches utilize dense matrix multiplications where most of the entries are zeros, causing extra computation. The segmented linear attention mechanism along temporal dimension further reduces the computation and memory usage by processing variable length of sequences. We also apply segmented positional encoding to the data embedding to provide the concept of time-series ordering along the temporal dimension of variable-length skeleton data. Additionally, segmented context attention performs weighted summarization across the entire video frames, making our model robust compared to GCN-based models with their fixed-length receptive field on the temporal dimension.
本文提出了一种基于稀疏变换器的动作识别(STAR)模型,作为骨骼动作建模的新基准,以解决上述不足。变换器在自然语言处理中已成为一种流行的选择。最近,它们被用于计算机视觉,与卷积网络相比,在获得具有竞争力的结果的同时,需要更少的计算资源进行训练(Dosovitskiy 等人,2020 年;Huang 等人,2019 年)。受这些变换器架构的启发,我们的模型由空间和时间编码器组成,分别在空间和时间维度上对骨骼序列应用稀疏注意力和分段线性注意力。我们沿空间维度的稀疏注意力模块通过稀疏矩阵乘法提取连接关节的相关性,而先前的方法使用密集矩阵乘法,其中大部分条目为零,导致额外的计算。沿时间维度的分段线性注意力机制进一步通过处理可变长度的序列来减少计算和内存使用。 我们还将分段位置编码应用于数据嵌入,以在变量长度骨骼数据的时序维度上提供时间序列排序的概念。此外,分段上下文注意力在整个视频帧上执行加权总结,使我们的模型与基于 GCN 的模型相比,在时间维度上的固定长度感受野上具有鲁棒性。

Compared to the baseline GCN model (ST-GCN), our model (STAR) achieves higher performance with much smaller model size on the two datasets, NTU RGB+D 60 and 120. The major contributions of this work are listed below:
与基线 GCN 模型(ST-GCN)相比,我们的模型(STAR)在两个数据集 NTU RGB+D 60 和 120 上实现了更高的性能,同时模型尺寸更小。本工作的主要贡献如下:

  • We focus on designing an efficient model purely based on self-attention mechanism. We propose sparse transformer-based action recognition (STAR) model that process variable length of skeleton action sequence without additional preprocessing and zero paddings. The flexibility of our model is beneficial for real-time applications or edge platforms with limited computational resources.
    我们专注于设计一个仅基于自注意力机制的效率模型。我们提出了一种基于稀疏变换器的动作识别(STAR)模型,该模型可以处理无需额外预处理和零填充的变长骨骼动作序列。我们模型的灵活性有利于实时应用或计算资源有限的边缘平台。

  • We propose a sparse self-attention module that efficiently performs sparse matrix multiplications to capture spatial correlations between human skeleton joints.
    我们提出了一种稀疏自注意力模块,该模块高效地执行稀疏矩阵乘法,以捕捉人体骨骼关节之间的空间相关性。

  • We propose a segmented linear self-attention module that effectively captures temporal correlations of dynamic joint movements across time dimension.
    我们提出了一种分段线性自注意力模块,该模块能有效捕捉动态联合运动在时间维度上的时序相关性。

  • Experiments show that our model is 5\sim7×\times smaller than the baseline models while providing 4\sim18×\times execution speedup.
    实验表明,我们的模型在提供 4 similar-to\sim 18 ×\times 执行速度提升的同时,比基线模型小 5 similar-to\sim 7 ×\times

Related works  相关作品

Skeleton-Based Action Recognition
基于骨架的动作识别

Recently, skeleton-based action recognition has attracted much attention since its compact skeleton data representation makes the models more efficient and free from the variations in lighting conditions and other environmental noises. Earlier methods to skeleton-based action modeling have mainly worked on designing hand-crafted features and relations between joints . Recently, by looking into the inherent connectivity of the human body, Graph Convolutional Networks (GCNs), especially, ST-GCNs have gained massive success in getting satisfactory results in this task. The model consists of spatial and temporal convolution modules similar to conventional convolutional filters used for images (Yan, Xiong, and Lin 2018). The graph adjacency matrix encodes the skeleton joints’ connections and extracts high-level spatial representations from the skeleton action sequence. On the temporal dimension, 1D convolutional filters facilitate extracting dynamic information.
最近,基于骨架的动作识别因其紧凑的骨架数据表示而受到广泛关注,这使得模型更高效,且不受光照条件和其他环境噪声的影响。早期基于骨架的动作建模方法主要致力于设计手工特征和关节之间的关系。最近,通过研究人体固有的连接性,图卷积网络(GCNs),特别是 ST-GCNs,在获得令人满意的结果方面取得了巨大成功。该模型由空间和时序卷积模块组成,类似于用于图像的传统卷积滤波器(Yan,Xiong 和 Lin 2018)。图邻接矩阵编码骨架关节的连接,并从骨架动作序列中提取高级空间表示。在时间维度上,一维卷积滤波器有助于提取动态信息。

Many following works have proposed improvements to ST-GCN to improve the performance. Li et al. (Li et al. 2019) proposed AS-GCN, which leveraged the potential of adjacency matrices to scale the human skeleton’s connectivity. Furthermore, they generated semantic links to capture better structural and action semantics with additional information aggregation. Lei et al. (Shi et al. 2019a) proposed Directed Graph Neural Networks (DGNNs), which incorporate joint and bone information to represent the skeleton data as a directed acyclic graph. Liu et al. (Liu et al. 2020) proposed a unified spatial-temporal graph convolution module (G3D) to aggregate information across space and time for effective feature learning.
许多后续工作提出了对 ST-GCN 的改进,以提高其性能。Li 等人(Li 等人,2019)提出了 AS-GCN,它利用邻接矩阵的潜力来扩展人体骨骼的连通性。此外,他们通过附加信息聚合生成了语义链接,以更好地捕捉结构和动作语义。Lei 等人(Shi 等人,2019a)提出了有向图神经网络(DGNNs),它结合了关节和骨骼信息,将骨骼数据表示为有向无环图。Liu 等人(Liu 等人,2020)提出了一种统一的空间时间图卷积模块(G3D),以在空间和时间上聚合信息,以进行有效的特征学习。

Some studies have been focusing on the computational complexity of GCN-based methods. Cheng et al. (Cheng et al. 2020) proposed Shift-GCN, which leverages shift graph operations and point-wise convolutions to reduce the computational complexity. Song et al. (Song et al. 2020) proposed multi-branch ResGCN that fuses different spatio-temporal features from multiple branches and used residual bottleneck modules to obtain competitive performance with less number of parameters. Compared to these methods, our spatial and temporal self-attention modules have several essential distinctions: our model can process variable length of skeleton sequence without preprocessing with zero-paddings. Our model can retrieve global context on the temporal dimension by applying self-attention to the input sequence’s entire frames.
一些研究集中于基于 GCN 方法的计算复杂性。Cheng 等人(Cheng 等人 2020)提出了 Shift-GCN,它利用位移图操作和点卷积来降低计算复杂性。Song 等人(Song 等人 2020)提出了多分支 ResGCN,它融合了来自多个分支的不同时空特征,并使用残差瓶颈模块以较少的参数数量获得有竞争力的性能。与这些方法相比,我们的时空自注意力模块有几个基本区别:我们的模型可以处理无需预处理零填充的变长骨骼序列。我们的模型可以通过对输入序列的整个帧应用自注意力来在时间维度上检索全局上下文。

Transformers and Self-Attention Mechanism
变压器和自注意力机制

Vaswani et al. (Vaswani et al. 2017) first introduced Transformers for machine translation and have been the state-of-the-art method in various NLP tasks. For example, GPT and BERT (Radford et al. 2018; Devlin et al. 2019) are currently the Transformer-based language models that have achieved the best performance. The core component of Transformer architectures is a self-attention mechanism that learns the relationships between each element of a sequence. In contrast to recurrent networks that process sequence in a recursive fashion and are limited to attention on short-term context, transformer architectures enable modeling long dependencies in sequence. Furthermore, the multi-head self-attention operations can be easily parallelized. Recently, Transformer-based models have attracted much attention in the computer vision community. Convolution operation has been the core of the conventional deep learning models for computer vision tasks. However, there are downfalls to the operation. The convolution operates on a fixed-sized window, which only captures short-range dependencies. The same applies to GCNs where the Graph Convolution operation is incapable of capturing long-range relations between joints in both spatial and temporal dimensions.
Vaswani 等人(Vaswani 等人 2017)首次引入了 Transformer 用于机器翻译,并已成为各种 NLP 任务中的最先进方法。例如,GPT 和 BERT(Radford 等人 2018;Devlin 等人 2019)是目前基于 Transformer 的语言模型,实现了最佳性能。Transformer 架构的核心组件是一种自注意力机制,它学习序列中每个元素之间的关系。与以递归方式处理序列并限于关注短期上下文的循环网络相比,Transformer 架构能够对序列中的长依赖进行建模。此外,多头自注意力操作可以轻松并行化。最近,基于 Transformer 的模型在计算机视觉社区中引起了广泛关注。卷积操作一直是传统深度学习模型在计算机视觉任务中的核心。然而,这种操作也存在缺点。卷积在固定大小的窗口上操作,只能捕获短程依赖。 同样适用于 GCNs,其中图卷积操作无法捕捉到在空间和时间维度上关节之间的长距离关系。

Vision Transformer (ViT) (Dosovitskiy et al. 2020) is the first work to completely replace standard convolutions in deep neural networks on large-scale image recognition tasks. Huang et al. (Huang et al. 2019) explored the sparse attention to study the trade-off between computational efficiency and performance of a Transformer model on the image classification task. A recent study (Plizzari, Cannici, and Matteucci 2020) proposed a hybrid model consists of the Transformer encoder and GCN modules on the skeleton-based human action recognition task. Nevertheless, no prior study has completely replaced GCNs with the Transformer architecture to the best of our knowledge.
视觉 Transformer(ViT)(Dosovitskiy 等人,2020 年)是第一个在大型图像识别任务中完全取代标准卷积的深度神经网络的工作。黄等人(黄等人,2019 年)探讨了稀疏注意力,以研究 Transformer 模型在图像分类任务中的计算效率和性能之间的权衡。最近的一项研究(Plizzari,Cannici 和 Matteucci,2020 年)提出了一种混合模型,该模型由 Transformer 编码器和基于骨骼的人类动作识别任务中的 GCN 模块组成。然而,据我们所知,没有先前的研究完全用 Transformer 架构取代 GCN。

Methodology  方法

In this section, we present the algorithms used in our model and the relevant architecture of our model.
在这一节中,我们介绍了我们模型中使用的算法以及我们模型的相关架构。

Section Spatial domain: Sparse MHSA depicts the sparse multi-head self-attention (MHSA) mechanism used in spatial Transformer encoder module; Section Temporal domain: Segmented Linear MHSA introduces the novel data format and the relevant linear multi-head self-attention (MHSA) mechanism for temporal Transformer encoder; Section STAR Framework shows the overall framework of our model and related auxiliary modules.
; 空间域部分:稀疏多头自注意力(MHSA)描述了在空间变换编码器模块中使用的稀疏多头自注意力(MHSA)机制;时间域部分:分段线性 MHSA 介绍了用于时间变换编码器的新型数据格式和相关线性多头自注意力(MHSA)机制;STAR 框架展示了我们模型的总体框架和相关辅助模块。

Refer to caption
Figure 1: Illustration of our Sparse attention module: Given the queries QQ and the keys KK of the skeleton embedding, feature vectors of joint ii and jj are correlated with attention weight αi,j\alpha_{i,j} The solid black line on the skeleton represents the physical connection of human skeleton. The dashed line connecting two joints represents the artificial attention of joints.
图 1:稀疏注意力模块的示意图:给定骨骼嵌入的查询 QQ 和键 KK ,关节 iijj 的特征向量与注意力权重 αi,jsubscript\alpha_{i,j} 相关联。骨骼上的实线表示人体骨骼的物理连接。连接两个关节的虚线表示关节的人工注意力。

Spatial domain: Sparse MHSA
空间域:稀疏 MHSA

The crucial component of our spatial transformer encoder is the sparse multi-head self-attention module. GCN-based models and previous Transformer models, such as ST-GCN and ST-TR, utilize dense skeleton representation to aggregate the features of neighboring nodes. This dense adjacency matrix representation contains 625 entries for the NTU dataset, while the actual number of joint connections representing the skeletons is only 24. It means that 96% of the matrix multiplications are unnecessary calculations for zero entries. So we propose a sparse attention mechanism, which only performs matrix multiplications on the sparse node connections. This allows each joint to only aggregate the information from its neighboring joints based on the attention coefficients, which are dynamically assigned to the corresponding connections.
我们空间变换编码器的关键组件是稀疏多头自注意力模块。基于 GCN 的模型以及之前的 Transformer 模型,如 ST-GCN 和 ST-TR,使用密集骨骼表示来聚合相邻节点的特征。这种密集邻接矩阵表示对于 NTU 数据集包含 625 个条目,而表示骨骼的实际关节连接数量仅为 24。这意味着矩阵乘法的 96%都是针对零条目的不必要的计算。因此,我们提出了一种稀疏注意力机制,它只对稀疏节点连接执行矩阵乘法。这允许每个关节仅根据动态分配给相应连接的注意力系数,聚合其相邻关节的信息。

The joint connections are based on the topology of skeleton, which is a tree structure. The attentions inherited from this topology are seen as physical attention (or real attention), as illustrated in Figure 1. To augment the attending field, we also artificially add more links between joints according to the logical relations of body parts, and we call these artificially created attentions as artificial attention, as the dashed yellow arrows shown in Figure 1. For simplicity, suppose that the skeleton adjacency matrix is AA, then the artificial links for additional spatial attention are obtained through A2A^{2} and A3A^{3}. Hence, in our model, the spatial attention maps are evaluated based on the topology representation of A+A2+A3A+A^{2}+A^{3}.
联合连接基于骨骼拓扑,这是一种树状结构。从该拓扑继承的注意力被视为物理注意力(或真实注意力),如图 1 所示。为了增强注意力场,我们还根据身体部分之间的逻辑关系人工添加了更多的关节链接,并将这些人工创建的注意力称为人工注意力,如图 1 中所示的长虚线黄色箭头。为了简单起见,假设骨骼邻接矩阵为 AA ,那么通过 A2superscript2A^{2}A3superscript3A^{3} 获得额外的空间注意力的人工链接。因此,在我们的模型中,空间注意力图是基于 A+A2+A3superscript2superscript3A+A^{2}+A^{3} 的拓扑表示进行评估的。

The sparse attention is calculated according to the connectivity between joints. As described in below equations: after the embedding in Equation 1, the joint-to-joint attention between a pair of connected joints is computed first by an exponential score of the dot product of the feature vectors of these two joints (Equation 2), then the score is normalized by the sum of exponential scores of all neighboring joints as described in Equation 3.
稀疏注意力是根据关节之间的连接性计算的。如下述方程所述:在方程 1 中的嵌入之后,首先通过计算两个连接关节的特征向量的点积的指数得分(方程 2),然后根据方程 3 所述,将得分通过所有相邻关节的指数得分的总和进行归一化。

Q\displaystyle Q =XWq,K=XWk,V=XWv\displaystyle=XW_{q},K=XW_{k},V=XW_{v} (1)
αi,j\displaystyle\alpha_{i,j} =qi,kjnN(i)qi,kn\displaystyle=\frac{\left<q_{i},k_{j}\right>}{\sum_{n\in N(i)}\left<q_{i},k_{n}\right>} (2)
vi\displaystyle v_{i}^{\prime} =jN(i)αi,jvj, or V=𝒜V\displaystyle=\sum_{j\in N(i)}\alpha_{i,j}v_{j},\quad\text{ or }V^{\prime}=\mathcal{A}V (3)

where QQ, KK, and VV are queries, keys, and values in Transformer’s terminology, respectively; and qi=Q(i)q_{i}=Q(i), kj=K(j)k_{j}=K(j), vj=V(j)v_{j}=V(j), and q,k=exp(qTkd)\left<q,k\right>=exp\left(\frac{q^{T}k}{\sqrt{d}}\right). Finally, we obtain attention maps 𝒜\mathcal{A} as multi-dimension (multi-head) sparse matrices sharing the identical topology described by a single adjacency matrix (including links for the artificial attention), where attention coefficients are 𝒜(i,j)=αi,j\mathcal{A}(i,j)=\alpha_{i,j}. The sparse operation can be fulfilled with tensor gathering and scattering operations for parallelism.
; 其中 QQKKVV 分别代表 Transformer 术语中的查询、键和值;以及 qi=Q(i)subscriptq_{i}=Q(i)kj=K(j)subscriptk_{j}=K(j)vj=V(j)subscriptv_{j}=V(j)q,k=exp(qTkd)superscript\left<q,k\right>=exp\left(\frac{q^{T}k}{\sqrt{d}}\right) 。最后,我们获得注意力图 𝒜\mathcal{A} ,作为共享相同拓扑结构(包括人工注意力链接)的多维(多头)稀疏矩阵,其中注意力系数为 𝒜(i,j)=αi,jsubscript\mathcal{A}(i,j)=\alpha_{i,j} 。稀疏操作可以通过张量收集和散射操作来实现并行化。

Temporal domain: Segmented Linear MHSA
时域:分段线性 MHSA

Refer to caption
Figure 2: Illustration of our data format used in our framework: Previous works used the upper data format, which has fixed-sized time and person dimensions. Our work adopts new data format on the bottom, which has combined batch, person, and time dimensions into a single variable length sequence.
图 2:我们框架中使用的数据格式示意图:先前的工作使用了上面的数据格式,其中时间和人员维度具有固定大小。我们的工作采用底部的新的数据格式,将批量、人员和时间维度合并成一个单变量长度序列。

The most apparent drawbacks in the previous approaches (Yan, Xiong, and Lin 2018; Shi et al. 2019b) are utilizing (1) the fixed number of frames for each video clip and (2) zero-filling for the non-existing second person. The first drawback constrains their scalability to process video clips longer than the predefined length and their flexibility on a shorter video clip. The second drawback due to the zero’s participation in computation causes latency degradation. Moreover, a significant amount of memory space is allocated to those zero-valued data during the computation. So we propose a compact data format to bypass these drawbacks. Also, we propose Segmented Linear MHSA to process our compact data format.
最明显的缺点在于先前的方法(Yan, Xiong, 和 Lin 2018;Shi 等人 2019b)使用(1)每个视频片段固定的帧数和(2)对不存在第二人的零填充。第一个缺点限制了它们处理超过预定义长度的视频片段的可扩展性和对较短视频片段的灵活性。第二个缺点由于零值参与计算导致延迟降低。此外,在计算过程中为这些零值数据分配了大量的内存空间。因此,我们提出了一种紧凑的数据格式来规避这些缺点。同时,我们提出了分段线性 MHSA 来处理我们的紧凑数据格式。

Refer to caption
Figure 3: Illustration of Different Attention Operations: (a) Standard attention is obtained from Softmax(QKT)VSoftmax(QK^{T})V with a complexity of 𝒪(n2)\mathcal{O}(n^{2}). (b) Linearized attention ϕ(Q)(ϕ(KT)V)\phi(Q)(\phi(K^{T})V) with kernel function ϕ()\phi(\cdot) reduces the complexity to 𝒪(n)\mathcal{O}(n), (c) we extend the linearized attention (b) to process segments of sequences.
图 3:不同注意力操作的说明:(a) 标准注意力从 Softmax(QKT)VsuperscriptSoftmax(QK^{T})V 获得,复杂度为 𝒪(n2)superscript2\mathcal{O}(n^{2}) 。(b) 线性化注意力 ϕ(Q)(ϕ(KT)V)superscript\phi(Q)(\phi(K^{T})V) ,使用核函数 ϕ()\phi(\cdot) ,将复杂度降低到 𝒪(n)\mathcal{O}(n) ,(c) 我们将线性化注意力(b)扩展到处理序列段。

Variable Frame Length Data Format
可变帧长数据格式

The Figure 2 shows the comparison between our data format and the format used by previous works. In the data format adopted by previous works, longer videos are cut off to the predefined length and shorter videos are padded with repeated frames. Furthermore, the frames with a single person are all zero-padded to match the fixed number of persons. The upper data format from Figure2 illustrates the NTU RGB+D data format used by previous works. In each fixed-length video V(i)V^{(i)}, P1(i)P_{1}^{(i)} and P2(i)P_{2}^{(i)} represent two persons. In NTU RGB+D 120 dataset, only 26 out of 120 actions are mutual actions, which means that the second person’s skeleton is just zeros (P2(i)=0P_{2}^{(i)}=0 in Figure 2) in most data samples. In contrast to the previous data format, the proposed format maintains the original length of each video clip. Additionally, when a video clip contains two persons, we concatenate them along the frame dimension. Instead of keeping an individual dimension for a batch of video clips, we further concatenate the video clips in a batch along the frame dimension, and the auxiliary vector stores the batch indices to indicate to which video clip a frame belongs, as shown in the bottom data format of Figure 2. Moreover, given the new dimensions (NN, VV, CC) as shown in Figure 2, where NN is the total number of frames after concatenating the video clips along the temporal dimension and VV is the number of skeleton’s joints, we regard dimension NN as the logical batch size for spatial attention and dimension VV as the logical batch size for temporal attention.
图 2 显示了我们的数据格式与先前工作中使用的格式的比较。在先前工作中采用的数据格式中,较长的视频被截断到预定义的长度,较短的视频则通过重复帧进行填充。此外,单人帧全部用零填充以匹配固定的人数数量。图 2 上方的数据格式展示了先前工作中使用的 NTU RGB+D 数据格式。在每个固定长度的视频 V(i)superscriptV^{(i)}P1(i)superscriptsubscript1P_{1}^{(i)}P2(i)superscriptsubscript2P_{2}^{(i)} 代表两个人。在 NTU RGB+D 120 数据集中,只有 120 个动作中的 26 个是相互动作,这意味着在大多数数据样本中,第二个人的骨骼只是零(图 2 中的 P2(i)=0superscriptsubscript20P_{2}^{(i)}=0 )。与先前数据格式相比,所提出的格式保持了每个视频片段的原始长度。此外,当视频片段包含两个人时,我们在帧维度上拼接它们。我们不是为一批视频片段保留一个单独的维度,而是进一步在帧维度上拼接一批视频片段,辅助向量存储批索引以指示帧属于哪个视频片段,如图 2 底部的数据格式所示。 此外,根据图 2 所示的新维度( NNVVCC ),其中 NN 是沿时间维度拼接视频片段后的总帧数, VV 是骨骼关节的数量,我们将维度 NN 视为空间注意力的逻辑批量大小,维度 VV 视为时间注意力的逻辑批量大小。

Refer to caption
Figure 4: Segmented Attention: directly applying the linearized attention to our new data format will calculate unexpected attention between the two irrelevant video clips, which is error-prone. Therefore, we use segmented attention corresponding to each video sequence.
图 4:分段注意力:直接将线性化注意力应用于我们新的数据格式,将在两个无关视频片段之间计算意外的注意力,这容易出错。因此,我们使用与每个视频序列相对应的分段注意力。

Segmented Linear Attention
分段线性注意力

With the new data format introduced in the previous section, we propose a novel linear multi-head attention tailored for this data format. We call it a Segmented Linear Attention. As stated in the previous sections, Transformers are originally designed for sequential data. In the human skeleton sequence, each joint’s dynamic movement across the frames can be regarded as a time series. Therefore, the 3D coordinates, i.e., (x,y,z)(x,y,z), of every joint can be processed individually through the trajectory along the time dimension, and the application of attention extracts the interaction among time steps represented by frames.
随着上一节引入的新数据格式,我们提出了一种针对此数据格式的创新线性多头注意力机制。我们称之为分段线性注意力。如前几节所述,Transformer 最初是为序列数据设计的。在人体骨骼序列中,每个关节在帧之间的动态运动可以被视为一个时间序列。因此,每个关节的 3D 坐标,即 (x,y,z)(x,y,z) ,可以通过沿着时间维度的轨迹单独处理,应用注意力机制提取由帧表示的时间步之间的交互。

Linear Attention. Standard dot product attention mechanism (Vaswani et al. 2017) (Equation 4) with the global receptive field of NN inputs are prohibitively slow due to the quadratic time and memory complexity 𝒪(N2)\mathcal{O}(N^{2}). The quadratic complexity also makes Transformers hard to train and limits the context. Recent research toward the linearized attention mechanism derives the approximation of the Softmax-based attention. The most appealing ones are linear Transformers (Katharopoulos et al. 2020; Choromanski et al. 2021; Shen et al. 2021) based on kernel functions approximating the Softmax. The linearized Transformers can improve inference speeds up to three orders of magnitude without much loss in predictive performance (Tay et al. 2020). Given the projected embeddings QQ, KK, and VV for input tensors of queries, keys, and values, respectively, according to the observation from the accumulated value Vi𝐑dV_{i}^{\prime}\in\mathbf{R}^{d} for the query Qi𝐑dQ_{i}\in\mathbf{R}^{d} in position ii, dd is the channel dimension, the linearized attention can be transformed from Equation 4 to Equation 5, the computational complexity is reduced to 𝒪(Nd)\mathcal{O}(Nd), when dd is much smaller than NN, the computational complexity is approaching linear 𝒪(N)\mathcal{O}(N):
线性注意力。标准点积注意力机制(Vaswani 等人,2017)(公式 4)由于输入的全局感受野的二次时间和内存复杂度,导致速度过慢。二次复杂度也使得 Transformer 难以训练并限制了上下文。针对线性化注意力机制的最新研究推导出基于 Softmax 的注意力近似。其中最吸引人的是线性 Transformer(Katharopoulos 等人,2020;Choromanski 等人,2021;Shen 等人,2021),它们基于近似 Softmax 的核函数。线性化 Transformer 可以提高推理速度高达三个数量级,同时预测性能损失不大(Tay 等人,2020)。考虑到查询、键和值输入张量的投影嵌入 QQKKVV ,分别根据查询 Qi𝐑dsubscriptsuperscriptQ_{i}\in\mathbf{R}^{d} 在位置 ii 的累积值 Vi𝐑dsuperscriptsubscriptsuperscriptV_{i}^{\prime}\in\mathbf{R}^{d} 的观察, dd 是通道维度,线性化注意力可以从方程 4 转换为方程 5,当 dd 远小于 NN 时,计算复杂度降低到 𝒪(Nd)\mathcal{O}(Nd) ,接近线性 𝒪(N)\mathcal{O}(N)

Vi=j=1NQi,KjVjj=1NQi,KiV_{i}^{\prime}=\frac{\sum_{j=1}^{N}\left<Q_{i},K_{j}\right>V_{j}}{\sum_{j=1}^{N}\left<Q_{i},K_{i}\right>} (4)
Vi=ϕ(Qi)Tj=1Nϕ(Kj)VjTϕ(Qi)Tj=1Nϕ(Kj)=ϕ(Qi)TUϕ(Qi)TZU=j=1Nϕ(Kj)VjT,Z=j=1Nϕ(Kj)\begin{split}V_{i}^{\prime}&=\frac{\phi(Q_{i})^{T}\sum_{j=1}^{N}\phi(K_{j})V_{j}^{T}}{\phi(Q_{i})^{T}\sum_{j=1}^{N}\phi(K_{j})}=\frac{\phi(Q_{i})^{T}U}{\phi(Q_{i})^{T}Z}\\ U&=\sum_{j=1}^{N}\phi(K_{j})V_{j}^{T},\quad Z=\sum_{j=1}^{N}\phi(K_{j})\end{split} (5)

where ϕ()\phi(\cdot) is the kernel function. In work of (Katharopoulos et al. 2020), kernel function is simply simulated with ELU, ϕ(x)=elu(x)+1\phi(x)=elu(x)+1; while (Choromanski et al. 2021) introduces the Fast Attention via Orthogonal Random Feature (FAVOR) maps as the kernel function, ϕ(x)=cMf(Wx+b)T\phi(x)=\frac{c}{\sqrt{M}}f(Wx+b)^{T}, where c>0c>0 is a constant, and W𝐑M×dW\in\mathbf{R}^{M\times d} is a Gaussian random feature matrix, and MM is the dimensionality of this matrix that controls the number of random features.
; ϕ()\phi(\cdot) 是核函数。在 (Katharopoulos 等人 2020 年) 的研究中,核函数简单地通过 ELU 进行模拟, ϕ(x)=elu(x)+11\phi(x)=elu(x)+1 ;而 (Choromanski 等人 2021 年) 引入了通过正交随机特征 (FAVOR) 映射的快速注意力作为核函数, ϕ(x)=cMf(Wx+b)Tsuperscript\phi(x)=\frac{c}{\sqrt{M}}f(Wx+b)^{T} ,其中 c>00c>0 是一个常数, W𝐑M×dsuperscriptW\in\mathbf{R}^{M\times d} 是高斯随机特征矩阵, MM 是该矩阵的维度,它控制随机特征的数量。

Refer to caption
Figure 5: Illustration of the overall pipeline of our approach (STAR)
图 5:我们方法(STAR)的整体流程示意图

Segmented Linear Attention. Since we concatenate the various length of video clips within a single batch along the time dimension, directly applying linear attention will cause the cross clip attention, leading to irrelevant information taken into account from one video clip to another, as shown in Figure 4. Therefore, we consider the frames of a video clip arranged as a segment, and then we design the segmented linear attention by reformulating Equation 5 with segment index. Therefore, for each ViV_{i} in segment 𝒮m\mathcal{S}_{m}, we summarize
分段线性注意力。由于我们在时间维度上将不同长度的视频片段拼接在一个批次中,直接应用线性注意力将导致跨片段注意力,导致从一个视频片段到另一个视频片段考虑了不相关信息,如图 4 所示。因此,我们将视频片段的帧排列为一个段,然后通过使用段索引重新表述方程 5 来设计分段线性注意力。因此,对于每个分段 𝒮msubscript\mathcal{S}_{m} 中的 VisubscriptV_{i} ,我们总结

Vi𝒮m=ϕ(Qi𝒮m)Tj𝒮mϕ(Kj)VjTϕ(Qi𝒮m)Tj𝒮mϕ(Kj)=ϕ(Qi𝒮m)TUSmϕ(Qi𝒮m)TZSmUSm=j𝒮mϕ(Kj)VjT,ZSm=j𝒮mϕ(Kj)\begin{split}V_{i\in\mathcal{S}_{m}}^{\prime}&=\frac{\phi(Q_{i\in\mathcal{S}_{m}})^{T}\sum_{j\in\mathcal{S}_{m}}\phi(K_{j})V_{j}^{T}}{\phi(Q_{i\in\mathcal{S}_{m}})^{T}\sum_{j\in\mathcal{S}_{m}}\phi(K_{j})}\\ &=\frac{\phi(Q_{i\in\mathcal{S}_{m}})^{T}U_{S_{m}}}{\phi(Q_{i\in\mathcal{S}_{m}})^{T}Z_{S_{m}}}\\ U_{S_{m}}&=\sum_{j\in\mathcal{S}_{m}}\phi(K_{j})V_{j}^{T},\quad Z_{S_{m}}=\sum_{j\in\mathcal{S}_{m}}\phi(K_{j})\end{split} (6)

where 𝒮m\mathcal{S}_{m} is the mm-th segment, and the reduction operation j𝒮mf(x)\sum_{j\in\mathcal{S}_{m}}f(x) can be easily implemented through the indexation to segments; and with help of the gathering and scattering operations (Fey 2021a), the segmented linear attention maintains the highly-paralleled computation. Figure 3 illustrates the comparison of different attention operations.
; 𝒮msubscript\mathcal{S}_{m} 是第 mm 个段,缩减操作 j𝒮mf(x)subscriptsubscript\sum_{j\in\mathcal{S}_{m}}f(x) 可以通过段索引轻松实现;借助聚集和散射操作(Fey 2021a),分段线性注意力保持高度并行计算。图 3 说明了不同注意力操作的比较。

STAR Framework  STAR 框架

In this work, we propose the Sparse-Transformer Action Recognition (STAR) framework. Figure 5 (c) shows the overview of our STAR framework. The STAR framework is built upon several Spatial-Temporal Transformer blocks (ST-block) followed by context-aware attention and MLP head for classification. Each ST-block comprises two pipelines: the spatial Transformer encoder and the temporal Transformer encoder. Each Transformer encoder consists of several key components, including the multi-head self-attention (MHSA), skip connection (AND & Norm part in Figure 5 (c)), and feed-forward network. The spatial Transformer encoder utilizes sparse attention to capture the topological correlation of connected joints for each frame. The temporal Transformer encoder utilizes the segmented linear attention to capture the correlation of joints along the time dimension. The output sum from the two encoder layers is fed to the context-aware attention module to perform weighted summarization on the sequence of frames. Positional encoding is also utilized before ST-block to provide the context of ordering on the input sequence. Below is a brief introduction to each of them.
在此工作中,我们提出了稀疏 Transformer 动作识别(STAR)框架。图 5(c)展示了我们 STAR 框架的概览。STAR 框架基于几个时空 Transformer 块(ST-block),随后是上下文感知注意力和 MLP 头部进行分类。每个 ST-block 包含两个管道:空间 Transformer 编码器和时间 Transformer 编码器。每个 Transformer 编码器由多个关键组件组成,包括多头自注意力(MHSA)、跳过连接(图 5(c)中的 AND & Norm 部分)和前馈网络。空间 Transformer 编码器利用稀疏注意力来捕捉每个帧连接关节的拓扑相关性。时间 Transformer 编码器利用分段线性注意力来捕捉关节沿时间维度的相关性。两个编码层的输出总和被输入到上下文感知注意模块,以对帧序列进行加权总结。在 ST-block 之前也使用了位置编码,以提供输入序列的顺序上下文。以下是每个组件的简要介绍。

Context-aware attention  上下文感知注意力

Refer to caption
Figure 6: The context-aware attention is utilized to summarize each video clip.
图 6:利用上下文感知注意力来总结每个视频片段。

In previous works (Yan, Xiong, and Lin 2018; Shi et al. 2019b), before connecting to the final fully-connected layer for classification, summarizing the video clip embedding along the temporal dimension is implemented by global average pooling. Alternatively, we utilize a probabilistic approach through context-aware attention, which is extended from the work of (Bai et al. 2019), to enhance this step’s robustness, as demonstrated in Figure 6. Denote an input tensor embedding of video clip SmS_{m} as VRF×N×DV\in R^{F\times N\times D}, for FF is the number of frames in video clip SmS_{m}, NN is the number of joints of skeleton, and each joint possessing DD features, where viRN×Dv_{i}\in R^{N\times D} is the embedding of frame ii of VV. First, a global context cRN×Dc\in R^{N\times D} is computed, which is a simple average of embedding of frames followed by a nonlinear transformation: c=tanh(1FWiSmFvi)c=tanh\left(\frac{1}{F}W\sum^{F}_{i\in S_{m}}v_{i}\right), where WRD×DW\in R^{D\times D} is a learnable weight matrix. The context cc provides the global structural and feature information of the video clip that is adaptive to the similarity between frames in video clip SmS_{m}, via learning the weight matrix. Based on cc, we can compute one attention weight for each frame. For frame ii, to make its attention an aware of the global context, we take the inner product between cc and its embedding. The intuition is that, frames similar to the global context should receive higher attention weights. A sigmoid function σ(x)=11+exp(x)\sigma(x)=\frac{1}{1+exp(-x)} is applied to the result to ensure the attention weights is in the range (0,1)(0,1). Finally, the video clip embedding vRN×Dv^{\prime}\in R^{N\times D} is the weighted sum of video clip embeddings, v=iSmFaiviv^{\prime}=\sum^{F}_{i\in S_{m}}a_{i}v_{i}. The following equations summarize the proposed context-aware attentive mechanism:
在先前的工作(Yan, Xiong, 和 Lin 2018;Shi 等人 2019b)中,在连接到最终的完全连接层进行分类之前,通过全局平均池化实现了沿时间维度的视频片段嵌入的总结。或者,我们利用一种基于上下文感知注意力的概率方法,该方法扩展自(Bai 等人 2019)的工作,以提高这一步骤的鲁棒性,如图 6 所示。将视频片段 SmsubscriptS_{m} 的输入张量嵌入表示为 VRF×N×DsuperscriptV\in R^{F\times N\times D} ,对于 FF 是视频片段 SmsubscriptS_{m} 中的帧数, NN 是骨骼关节数,每个关节具有 DD 个特征,其中 viRN×Dsubscriptsuperscriptv_{i}\in R^{N\times D}VV 中帧 ii 的嵌入。首先,计算一个全局上下文 cRN×Dsuperscriptc\in R^{N\times D} ,这是一个简单平均的帧嵌入,随后进行非线性变换: c=tanh(1FWiSmFvi)1subscriptsuperscriptsubscriptsubscriptc=tanh\left(\frac{1}{F}W\sum^{F}_{i\in S_{m}}v_{i}\right) ,其中 WRD×DsuperscriptW\in R^{D\times D} 是一个可学习的权重矩阵。上下文 cc 通过学习权重矩阵,提供了视频片段的全局结构和特征信息,该信息适应于视频片段 SmsubscriptS_{m} 中帧之间的相似性。基于 cc ,我们可以为每个帧计算一个注意力权重。 对于帧 ii ,为了使其关注具有全局上下文意识,我们计算 cc 与其嵌入的内积。直觉是,与全局上下文相似的帧应获得更高的关注权重。对结果应用 sigmoid 函数 σ(x)=11+exp(x)11\sigma(x)=\frac{1}{1+exp(-x)} ,以确保关注权重在 (0,1)01(0,1) 范围内。最后,视频片段嵌入 vRN×Dsuperscriptsuperscriptv^{\prime}\in R^{N\times D} 是视频片段嵌入 v=iSmFaivisuperscriptsubscriptsuperscriptsubscriptsubscriptsubscriptv^{\prime}=\sum^{F}_{i\in S_{m}}a_{i}v_{i} 的加权求和。以下方程式总结了所提出的上下文感知注意力机制:

c=tanh(1FWjSmFvj)=tanh(1F(VT𝟏)W)v=iSmFσ(viT[tanh(1FWjSmFvj)])vi=iSmFσ(viTc)vi=[σ(Vc)]TV\begin{split}c&=tanh\left(\frac{1}{F}W\sum^{F}_{j\in S_{m}}v_{j}\right)=tanh\left(\frac{1}{F}\left(V^{T}\cdot\mathbf{1}\right)W\right)\\ v^{\prime}&=\sum^{F}_{i\in S_{m}}\sigma\left(v_{i}^{T}\left[tanh\left(\frac{1}{F}W\sum^{F}_{j\in S_{m}}v_{j}\right)\right]\right)v_{i}\\ &=\sum^{F}_{i\in S_{m}}\sigma\left(v_{i}^{T}c\right)v_{i}=\left[\sigma(Vc)\right]^{T}V\end{split} (7)

Positional Encoding  位置编码

As the attention mechanism is order-agnostic to the permutation in the input sequence (Vaswani et al. 2017; Tsai et al. 2019) and treats the input as an unordered bag of element. Therefore, an extra positional embedding is necessary to maintain the data order, i.e., time-series data are in the inherently sequential ordering. Then these positional embedding are participating the evaluation of the attention weight and value between token ii and jj in the input sequence.
由于注意力机制对输入序列的排列(Vaswani 等人,2017;Tsai 等人,2019)是无序的,并将输入视为无序的元素集合。因此,需要额外的位置嵌入来维持数据顺序,即时间序列数据具有固有的顺序性。然后这些位置嵌入参与评估输入序列中标记 iijj 之间的注意力权重和值。

Segmented Sequential Positional Encoding However, as we arrange the variable-length video clips into a batch along the temporal dimension, it is not feasible to directly apply positional encoding to the whole batch. Therefore, we introduce the segmented positional encoding where each video clip gets its positional encoding according to batch indices. An example of such encoding is shown in Figure 7.
分段顺序位置编码然而,当我们沿着时间维度将可变长度的视频剪辑批量排列时,无法直接对整个批量应用位置编码。因此,我们引入了分段位置编码,其中每个视频剪辑根据批量索引获得其位置编码。这种编码的示例如图 7 所示。

Structural Positional Encoding. we also attempt to apply the structural positional encoding, e.g., tree-based positional encoding (Shiv and Quirk 2019; Omote, Tamura, and Ninomiya 2019), to the spatial dimension, i.e., the tree topology of skeleton. Experiments show that the current approach which we used cannot improve our model’s performance significantly. Hence, to reduce our model’s complexity, we decide not to apply the structural positional encoding for this work and leave it for future research.
结构位置编码。我们还尝试将结构位置编码应用于空间维度,即骨骼的树拓扑结构,例如基于树的编码(Shiv 和 Quirk 2019;Omote,Tamura 和 Ninomiya 2019)。实验表明,我们目前使用的方法并不能显著提高我们模型的表现。因此,为了降低我们模型的复杂性,我们决定不将结构位置编码应用于这项工作,并将其留待未来的研究。

Refer to caption
Figure 7: Illustration of Segmented Positional Encoding for a batch of 4 video clips. x-axis represents the number of frames and y-axis represents the feature dimension.
图 7:4 个视频片段分段位置编码的示意图。x 轴表示帧数,y 轴表示特征维度。
NTU-60 NTU-120
Method  方法 X-subject  X-主题 X-view X-subject  X-主题 X-setup  X-设置
ST-GCN 81.5 88.3 72.4 71.3
ST-TR 88.7 95.6 81.9. 84.1
STAR-64 (ours)  STAR-64( ours) 81.9 88.9 75.4 78.1
STAR-128 (ours)  STAR-128( ours) 83.4 89.0 78.3 80.2
Table 1: Comparison of models’ accuracy on NTU RGB+D 60 and 120 datasets
表 1:在 NTU RGB+D 60 和 120 数据集上比较模型准确度
Model CUDA time (ms)  CUDA 时间(毫秒) num. of parameters  参数数量 GMACs
ST-GCN 333.89 3.1M 261.49
ST-TR 1593.05 6.73M 197.55
STAR-64 (ours)  STAR-64( ours) 86.54 0.42M 15.58
STAR-128 (ours)  STAR-128( ours) 191.23 1.26M 73.33
Table 2: Comparison of models’ efficiency
表 2:模型效率比较

Experiments  实验

In this section, we conduct experiments and ablation studies to verify the effectiveness and efficiency of our proposed sparse spatial and segmented linear temporal self-attention operations. The comparison has been made with ST-GCN, the baseline GCN model, and ST-TR, one of the state-of-the-art hybrid model, which have utilized full attention operation coupled with graph convolutions. The corresponding analysis demonstrates the potential of our model and the possible room for improvements.
在这一节中,我们进行了实验和消融研究,以验证我们提出的稀疏空间和分段线性时间自注意力操作的有效性和效率。该比较与 ST-GCN、基线 GCN 模型以及 ST-TR(一种最先进的混合模型之一),它们都使用了与图卷积相结合的全注意力操作。相应的分析展示了我们模型的可能性以及可能的改进空间。

Datasets  数据集

In the experiments, we evaluate our model on two largest scale 3D skeleton-based action recognition datasets, NTU-RGB+D 60 and 120.
在实验中,我们评估了我们的模型在两个最大规模的基于 3D 骨骼的动作识别数据集 NTU-RGB+D 60 和 120 上。

NTU RGB+D 60

This dataset contains 56,880 video clips involving 60 human action classes. The samples are performed by 40 volunteers and captured by three Microsoft Kinect v2 cameras (Shahroudy et al. 2016). It contains four modalities, including RGB videos, depth sequences, infrared frames, and 3D skeleton data. Our experiments are only conducted with the 3D skeleton data. The length of the action samples vary from 32 frames to 300 frames. In each frame, there are at most 2 subjects and each subject contains 25 3D joint coordinates. The dataset follows two evaluation criteria, which are Cross-Subject and Cross-View. In the Cross-View evaluation (X-View), there are 37,920 training samples captured from camera 2 and 3 and 18,960 test samples captured from camera 1. In the Cross-Subject evaluation (X-Sub), there are 40,320 training samples from 20 subjects and 26,560 test samples from the rest. We follow the original two benchmarks and report the Top-1 accuracy as well as the profiling metrics.
此数据集包含 56,880 个视频片段,涉及 60 个人动作类别。样本由 40 名志愿者执行并由三台 Microsoft Kinect v2 相机(Shahroudy 等人,2016 年)捕捉。它包含四种模态,包括 RGB 视频、深度序列、红外帧和 3D 骨骼数据。我们的实验仅使用 3D 骨骼数据。动作样本的长度从 32 帧到 300 帧不等。在每一帧中,最多有 2 个主体,每个主体包含 25 个 3D 关节坐标。该数据集遵循两个评估标准,即跨主体和跨视图。在跨视图评估(X-View)中,有 37,920 个训练样本来自相机 2 和 3,以及 18,960 个测试样本来自相机 1。在跨主体评估(X-Sub)中,有 40,320 个训练样本来自 20 个主体,以及 26,560 个测试样本来自其他主体。我们遵循原始的两个基准,并报告 Top-1 准确率以及配置文件指标。

NTU RGB+D 120

The dataset (Liu et al. 2019) extends from NTU RGB+D 60 and is currently the largest dataset with 3D joint annotations. It contains 57,600 new skeleton sequences representing 60 new actions, a total of 114,480 videos involving 120 classes of 106 subjects captured from 32 different camera setups. The dataset follows two criteria, which are Cross-Subject and Cross-Setup. In the Cross-Subject evaluation, similar to the previous dataset, splits subjects in half to training and testing dataset. In the Cross-Setup evaluation, the samples are divided by the 32 camera setup IDs, where the even setup IDs are for training and the odd setup IDs for testing. Similar to the previous dataset, there is no preprocessing to set the uniform video length for all the samples. We follow the two criteria and report the Top-1 accuracy and the profiling metrics.
数据集(刘等人,2019 年)从 NTU RGB+D 60 扩展而来,是目前拥有 3D 联合标注的最大数据集。它包含 57,600 个新的骨架序列,代表 60 个新的动作,共计 114,480 个视频,涉及 106 个受试者的 120 个类别,由 32 个不同的摄像机设置捕获。该数据集遵循两个标准,即跨受试者和跨设置。在跨受试者评估中,与之前的数据集类似,将受试者分为训练集和测试集。在跨设置评估中,样本根据 32 个摄像机设置 ID 进行划分,偶数设置 ID 用于训练,奇数设置 ID 用于测试。与之前的数据集类似,没有预处理来设置所有样本的统一视频长度。我们遵循这两个标准,并报告 Top-1 准确率和性能指标。

Unlike GCN-based models, where the length of all the samples and the number of subjects need to be fixed (e.g. 300 frames and 2 subjects), our model can process varying length of input samples and of the number of subjects. So no further preprocessing with padding is done on the samples.
与基于 GCN 的模型不同,其中所有样本的长度和主题数量需要固定(例如,300 帧和 2 个主题),我们的模型可以处理不同长度的输入样本和主题数量。因此,对样本不再进行填充等进一步预处理。

Configuration of experiments
实验配置

Implementation details. As the original Transformer framework (Vaswani et al. 2017) employs the unified model size dd for every layer, we follow the same notion and keep the hidden channel size uniform across the attention heads and the feedforward networks. We run the experiments with two different hidden channel sizes, 64 and 128 for our Transformer encoders (STAR-64 and STAR-128), respectively. The hidden channel size of the MLP head is also proportional to that of the attention heads. Our model consists of 5 layers, each layer comprises one spatial encoder and one temporal encoder in parallel and the output sum from the two encoders is fed to the next layer. Drop rates are set to 0.5 for every module. We also replace the ReLU non-linear activation funciton with SiLU (or Swish) (Elfwing, Uchibe, and Doya 2018; Ramachandran, Zoph, and Le 2017) to increase the stability of gradients in back-propagation phase (GELU or SELU also bring similar effect). Our model is implemented with the deep learning framework PyTorch (Paszke et al. 2019) and its extension PyTorch Geometric (Fey and Lenssen 2019). The scattering/gathering operations and sparse matrix multiplications are based on PyTorch Scatter (Fey 2021a) and PyTorch Sparse (Fey 2021b), respectively.
实现细节。由于原始的 Transformer 框架(Vaswani 等人,2017 年)为每一层使用统一的模型大小 dd ,我们遵循同样的概念,并保持隐藏通道大小在注意力头和前馈网络中均匀。我们分别以 64 和 128 的隐藏通道大小运行实验,用于我们的 Transformer 编码器(STAR-64 和 STAR-128)。MLP 头的隐藏通道大小也与注意力头成比例。我们的模型由 5 层组成,每层包含一个空间编码器和一个时间编码器并行,两个编码器的输出总和输入到下一层。每个模块的 Drop rates 设置为 0.5。我们还用 SiLU(或 Swish)(Elfwing,Uchibe 和 Doya 2018;Ramachandran,Zoph 和 Le 2017)替换 ReLU 非线性激活函数,以增加反向传播阶段的梯度稳定性(GELU 或 SELU 也有类似的效果)。我们的模型使用深度学习框架 PyTorch(Paszke 等人,2019 年)及其扩展 PyTorch Geometric(Fey 和 Lenssen 2019 年)实现。 散列/聚集操作和稀疏矩阵乘法分别基于 PyTorch Scatter(Fey 2021a)和 PyTorch Sparse(Fey 2021b)。

MACs Parameters  参数 Latency  延迟
ST-GCN
Conv2d: 260.4 GMACs
BatchNorm2d: 737.3 MMACs
ReLU: 184.3 MMACs
Conv2d: 3.06M
BatchNorm2d: 6.4K
Linear: 15.4K  线性:15.4K
Conv2d: 149.92ms
BatchNorm2d: 19.92ms
ReLU: 4.49ms
ST-TR
Conv2d: 810.57 GMACs
MatMul: 161.1 GMACs
BatchNorm2d: 138.4 MMACs
Conv2d: 2.7M
BatchNorm2d: 10.5K
Linear: 30.8K  线性:30.8K
Conv2d: 692.39ms
MatMul: 161.38ms
BatchNorm2d: 38.97ms
STAR-64
MatMul(attention): 24.4 GMACs
MatMul(注意力): 24.4 GMACs
Mul(sparse): 12.3 GMACs
Linear: 6.2 GMACs  线性:6.2 GMACs
Linear:83.2K  线性:83.2K
LayerNorm: 1.3K  层归一化:1.3K
MatMul: 25.27ms
Mul: 12.81ms
Linear:6.53ms  线性:6.53ms
Table 3: The breakdown analysis and top-3 components in each metrics
表 3:各指标的分解分析和前 3 大组成部分

Training setting. The maximum number of training epochs is set to 100. We used the Adam optimizer (Kingma and Ba 2014) with β1=0.9\beta_{1}=0.9, β2=0.98\beta_{2}=0.98 and ϵ=109\epsilon=10^{-9}. Following the setting of the original Transformer paper (Vaswani et al. 2017), the learning rate is adjusted throughout the training:
训练设置。最大训练轮数设置为 100。我们使用了 Adam 优化器(Kingma 和 Ba 2014)以及 β1=0.9subscript10.9\beta_{1}=0.9β2=0.98subscript20.98\beta_{2}=0.98ϵ=109superscript109\epsilon=10^{-9} 。遵循原始 Transformer 论文(Vaswani 等人 2017)的设置,学习率在整个训练过程中进行调整:

lrate=d0.5min(t0.5,tw1.5)lrate=d^{-0.5}\cdot min(t^{-0.5},t\cdot w^{-1.5}) (8)

where dd is the model dimension, ss is step number, and ww is the warmup steps. According to the equation 8, the learning rate linearly increases for the first ww training steps and then decreases proportionally to the inverse square root of the step number. We keep the original settings for the baseline models in their papers (Yan, Xiong, and Lin 2018; Plizzari, Cannici, and Matteucci 2020), and use their codes provided online. All our training experiments are performed on a system with two GTX TITAN X GPUs and a system with one TITAN RTX GPU, while the inferences are executed on a single GPU.
; dd 表示模型维度, ss 表示步数, ww 表示预热步数。根据公式 8,学习率在前 ww 个训练步长内线性增加,然后按步数的倒数平方根成比例减少。我们保留了他们在论文(Yan, Xiong, and Lin 2018;Plizzari, Cannici, and Matteucci 2020)中提出的基线模型的原始设置,并使用了他们在线提供的代码。所有训练实验都是在配备两个 GTX TITAN X GPU 和一个 TITAN RTX GPU 的系统上进行的,而推理是在单个 GPU 上执行的。

Results and Analysis  结果与分析

We evaluate the accuracy and the efficiency of the baseline GCN model (ST-GCN), our model (STAR) and the hybrid model (ST-TR), which utilize both transformer and GCN frameworks.
我们评估了基线 GCN 模型(ST-GCN)、我们的模型(STAR)和混合模型(ST-TR)的准确性和效率,这些模型利用了 transformer 和 GCN 框架。

Accuracy  准确性

We first evaluate the effectiveness of our Transformer encoder based model compared to ST-TR and ST-GCN models. Each model’s accuracy is evaluated with the NTU RGB+D 60 and 120 testing datasets. As shown in the Table 1, our model outperforms ST-GCN in both cross-view (cross-set) and cross-subject benchmarks of the two dataset. Our model achieves 3.6 \sim 7.7 percent lower accuracy compared to ST-TR, which heavily relies on convolution-based key components inherited from ST-GCN and utilizes them in both spatial and temporal pipelines. Our model yields modest performance compared to the state-of-the-art models in NTU RGB+D 60 and 120 when trained from scratch. The Figure 11 shows that there exists a performance gap between the training and testing. Transformer architectures’ lack of inductive biases, especially translation equivariance and locality that are essential to convolution networks, could result in weak generalization. In NLP, Transformers are usually pretrained on a large corpus of text and fine-tuned on a smaller task-specific dataset to boost the performance. We would like to conduct extensive experiments on pre-training and fine-tuning our model on a larger dataset in the future to improve the accuracy comparable to those of the state-of-the-art models. For our future study, we want to address effective generalization methods for Transformer models, which resolves overfitting issues and improve the overall performance.
首先评估我们的 Transformer 编码器模型相对于 ST-TR 和 ST-GCN 模型的有效性。每个模型的准确性使用 NTU RGB+D 60 和 120 测试数据集进行评估。如表 1 所示,我们的模型在两个数据集的交叉视图(交叉集)和交叉主题基准测试中均优于 ST-GCN。与严重依赖从 ST-GCN 继承的基于卷积的关键组件并在空间和时间管道中利用它们的 ST-TR 相比,我们的模型实现了 3.6 similar-to\sim 7.7%的更低准确性。当从头开始训练时,我们的模型在 NTU RGB+D 60 和 120 的现有模型中表现出适度的性能。图 11 显示,训练和测试之间存在性能差距。Transformer 架构缺乏归纳偏差,特别是对卷积网络至关重要的平移等变性和局部性,可能导致泛化能力较弱。在 NLP 中,Transformers 通常在大型文本语料库上预训练,并在较小的特定任务数据集上进行微调以提升性能。 我们希望在未来的某个时间点,在更大的数据集上对我们模型的预训练和微调进行广泛的实验,以提高其准确性,使其与最先进的模型相当。对于我们的未来研究,我们希望解决 Transformer 模型的有效泛化方法,这可以解决过拟合问题并提高整体性能。

Efficiency  效率

In this section, we evaluate the efficiency of the different models. As shown in Table 2, our model (STAR) is significantly efficient in terms of model size, the number of multiply-accumulate operations (GMACs) and the latency. Each metric for the different models is evaluated by running inference with sample dataset. Our model is fed with the original skeleton sequence of varying length. The other two models are fed with fix-sized skeleton sequence padded to 300 frames and 2 persons. We use the official profiler of PyTorch (v1.8.1) (Paszke et al. 2019), and Flops-Profiler of DeepSpeed (Rasley et al. 2020) to measure the benchmarks. The results are summarized with the following metrics:
本节中,我们评估了不同模型的效率。如表 2 所示,我们的模型(STAR)在模型大小、乘累加操作数(GMACs)和延迟方面效率显著。通过使用样本数据集进行推理,评估了不同模型的每个指标。我们的模型使用的是不同长度的原始骨架序列。其他两个模型使用的是填充到 300 帧和 2 个人的固定大小骨架序列。我们使用 PyTorch(v1.8.1)的官方分析器(Paszke 等人,2019 年)和 DeepSpeed 的 Flops-Profiler(Rasley 等人,2020 年)来测量基准。结果以下列指标总结:

Refer to caption
Figure 8: The breakdown of MACs for top-3 modules
图 8:前 3 个模块的 MAC 分解
Refer to caption
Figure 9: The breakdown of # parameters for top-3 modules
图 9:前 3 个模块的#参数分解
Refer to caption
Figure 10: The breakdown of latency for top-3 modules
图 10:前 3 个模块的延迟分解

MACs. The number of Multiply-Accumulate (MAC) operations is used to determine the efficiency of deep learning models. Each MAC operation is counted as two floating point operations. With the same hardware configuration, more efficient models require fewer MACs than other models to fulfill the same task. As shown in Table 2, both of our model with different channel sizes execute only 13117\frac{1}{3}\sim\frac{1}{17} amount of GMACs (i.e., Giga MACs) compared to ST-GCN and ST-TR models, respectively.
MACs. 乘累加(MAC)操作的次数用于确定深度学习模型的效率。每个 MAC 操作计为两个浮点操作。在相同的硬件配置下,更高效的模型比其他模型完成相同任务所需的 MACs 更少。如表 2 所示,我们具有不同通道大小的模型与 ST-GCN 和 ST-TR 模型相比,分别仅执行 13117similar-to13117\frac{1}{3}\sim\frac{1}{17} 数量的 GMACs(即十亿 MACs)。

Model size. Model size is another metric to measure the efficiency of a machine learning model. Given the same task, smaller model delivering the same or very close performance is preferable. Smaller model is not only beneficial for the higher speedup and less memory accesses but also gives better energy consumption, especially for embedded systems and edge devices with scarce computational resources and small storage volume. The column of the number of parameters in Table 2 depicts the size of the models, these parameters are trainable weights in the model. Among all the model, STAR possesses the smallest model size, 0.42M and 1.26M for STAT-64 and STAR-128, respectively.
模型大小。模型大小是衡量机器学习模型效率的另一个指标。在相同任务下,较小的模型能够提供相同或非常接近的性能,这是更可取的。较小的模型不仅有利于更高的加速和更少的内存访问,还能提供更好的能耗,尤其是在计算资源和存储空间有限的嵌入式系统和边缘设备中。表 2 中参数数量的列描述了模型的大小,这些参数是模型中的可训练权重。在所有模型中,STAR 具有最小的模型大小,分别为 STAT-64 的 0.42M 和 STAR-128 的 1.26M。

Refer to caption
Figure 11: The training and testing curve for STAR-64(left) and STAR-128(right) on NTU-RGB+D 60 X-subject benchmark.
图 11:STAR-64(左)和 STAR-128(右)在 NTU-RGB+D 60 X 基准测试上的训练和测试曲线。

Breakdown analysis. The breakdown analysis is used to identify potential bottlenecks within different models (STAR-64, ST-TR, and ST-GCN). Table 3 provides the detailed profiling results for the top-3 computation modules that are dominant in each models. According to Figure 8, 9 and 10, the convolution operations cost significant number of MAC operations and lead to computation bound. ST-GCN and ST-TR mainly consist of the convolution operations followed by batch normalization, which requires relatively large computational resources. Our Transformer model is based on sparse and linear attention mechanisms. It only produces relatively small attention weights from sparse attention; and performs low-rank matrix multiplication for linear attention (𝒪(n)\mathcal{O}(n)). This replaces huge dynamic weights of attention coefficients from the standard attention mechanism, which has a quadratic time and space complexity (𝒪(n2)\mathcal{O}(n^{2})).
分解分析。分解分析用于识别不同模型(STAR-64、ST-TR 和 ST-GCN)中的潜在瓶颈。表 3 提供了每个模型中占主导地位的 Top-3 计算模块的详细分析结果。根据图 8、9 和 10,卷积操作消耗了大量的 MAC 操作,导致计算受限。ST-GCN 和 ST-TR 主要由卷积操作组成,随后是批量归一化,这需要相对较大的计算资源。我们的 Transformer 模型基于稀疏和线性注意力机制。它仅从稀疏注意力中产生相对较小的注意力权重;并对线性注意力执行低秩矩阵乘法( 𝒪(n)\mathcal{O}(n) )。这取代了标准注意力机制中巨大的动态权重,该机制具有二次时间和空间复杂度( 𝒪(n2)superscript2\mathcal{O}(n^{2}) )。

Ablation Study  消融研究

. In this section, we evaluate the effectiveness and efficiency of our sparse self-attention operation in spatial encoder compared to the standard transformer encoder with full-attention operation. Table 4 and Table 5 show that our model with sparse self-attention operation achieves higher accuracy on both X-subject and X-view benchmarks and use significantly less number of GMACs and runtime. This shows that additional correlations of distant joints calculated by full attention do not improve the performance but rather contribute noise to the prediction. To handle such issue, learnable masks, consistent with adjacency matrix of skeleton, can be integrated to the full attention calculation to avoid accuracy degradation. But it requires extra computation involving learnable masks.
; 在本节中,我们评估了与标准全注意力操作的 transformer 编码器相比,我们在空间编码器中使用的稀疏自注意力操作的有效性和效率。表 4 和表 5 显示,我们的模型在 X 主题和 X 视图基准测试中实现了更高的准确率,并且显著减少了 GMAC 数量和运行时间。这表明,由全注意力计算出的远程关节的额外相关性并不会提高性能,反而会增加预测的噪声。为了处理此类问题,可以集成与骨骼邻接矩阵一致的可学习掩码到全注意力计算中,以避免精度下降。但这需要涉及可学习掩码的额外计算。

NTU-60 NTU-120
Method  方法 X-subject  X-主题 X-view X-subject  X-主题 X-setup  X-设置
STAR (sparse)  STAR(稀疏) 83.4 84.2 78.3 78.5
STAR (full)  STAR(完整) 80.7 81.9 77.4 77.7
Table 4: Classification accuracy comparison between Sparse attention and Full attention on the NTU RGB+D 60 Skeleton dataset.
表 4:在 NTU RGB+D 60 骨骼数据集上,稀疏注意力与全注意力分类精度的比较。
Model CUDA time (ms)  CUDA 时间(毫秒) GMACs
STAR-sparse  STAR-稀疏 105.7 15.58
STAR-full 254.7 73.33
Table 5: Efficiency comparison between Sparse attention and Full attention on the NTU RGB+D 60 Skeleton dataset.
表 5:在 NTU RGB+D 60 骨骼数据集上稀疏注意力与全注意力效率比较。

Conclusion  结论

In this paper, we propose an efficient Transformer-based model with sparse attention and segmented linear attention mechanisms applied on spatial and temporal dimensions of action skeleton sequence. We demonstrate that our model can replace graph convolution operations with the self-attention operations and yield the modest performance, while requiring significantly less computational and memory resources. We also designed compact data representation which is much smaller than fixed-size and zero padded data representation utilized by previous models. This work was supported in part by Semiconductor Research Corporation (SRC).
在本文中,我们提出了一种基于 Transformer 的高效模型,该模型在动作骨架序列的空间和时间维度上应用了稀疏注意力和分段线性注意力机制。我们证明,我们的模型可以用自注意力操作替换图卷积操作,并实现适度的性能,同时显著减少计算和内存资源。我们还设计了紧凑的数据表示,它比先前模型使用的固定大小和零填充数据表示小得多。这项工作部分得到了半导体研究公司(SRC)的支持。

References

  • Bai et al. (2019) Bai, Y.; Ding, H.; Bian, S.; Chen, T.; Sun, Y.; and Wang, W. 2019. SimGNN: A Neural Network Approach to Fast Graph Similarity Computation. WSDM ’19, 384–392. New York, NY, USA: Association for Computing Machinery. ISBN 9781450359405. doi:10.1145/3289600.3290967. URL https://doi.org/10.1145/3289600.3290967.
  • Cheng et al. (2020) Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; and Lu, H. 2020. Skeleton-Based Action Recognition with Shift Graph Convolutional Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Choi et al. (2008) Choi, J.; Cho, Y.-i.; Han, T.; and Yang, H. S. 2008. A View-Based Real-Time Human Action Recognition System as an Interface for Human Computer Interaction. In Wyeld, T. G.; Kenderdine, S.; and Docherty, M., eds., Virtual Systems and Multimedia, 112–120. Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-540-78566-8.
  • Choromanski et al. (2021) Choromanski, K. M.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J. Q.; Mohiuddin, A.; Kaiser, L.; Belanger, D. B.; Colwell, L. J.; and Weller, A. 2021. Rethinking Attention with Performers. In International Conference on Learning Representations. URL https://openreview.net/forum?id=Ua6zuk0WRH.
  • Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics. doi:10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
  • Dosovitskiy et al. (2020) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 .
  • Elfwing, Uchibe, and Doya (2018) Elfwing, S.; Uchibe, E.; and Doya, K. 2018. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks 107: 3–11.
  • Fey (2021a) Fey, M. 2021a. PyTorch Scatter. URL https://github.com/rusty1s/pytorch˙scatter.
  • Fey (2021b) Fey, M. 2021b. PyTorch Sparse. URL https://github.com/rusty1s/pytorch˙sparse.
  • Fey and Lenssen (2019) Fey, M.; and Lenssen, J. E. 2019. Fast Graph Representation Learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.
  • Htike et al. (2014) Htike, K. K.; Khalifa, O. O.; Mohd Ramli, H. A.; and Abushariah, M. A. M. 2014. Human activity recognition for video surveillance using sequences of postures. In The Third International Conference on e-Technologies and Networks for Development (ICeND2014), 79–82. doi:10.1109/ICeND.2014.6991357.
  • Huang et al. (2019) Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; and Liu, W. 2019. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 603–612.
  • Katharopoulos et al. (2020) Katharopoulos, A.; Vyas, A.; Pappas, N.; and Fleuret, F. 2020. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In III, H. D.; and Singh, A., eds., Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 5156–5165. PMLR. URL http://proceedings.mlr.press/v119/katharopoulos20a.html.
  • Ke et al. (2017) Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; and Boussaid, F. 2017. A New Representation of Skeleton Sequences for 3D Action Recognition 4570–4579. doi:10.1109/CVPR.2017.486.
  • Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  • Li et al. (2019) Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; and Tian, Q. 2019. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Liu, Tu, and Liu (2017) Liu, H.; Tu, J.; and Liu, M. 2017. Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition .
  • Liu et al. (2019) Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.-Y.; and Kot, A. C. 2019. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence 42(10): 2684–2701.
  • Liu et al. (2020) Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; and Ouyang, W. 2020. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 143–152.
  • Omote, Tamura, and Ninomiya (2019) Omote, Y.; Tamura, A.; and Ninomiya, T. 2019. Dependency-Based Relative Positional Encoding for Transformer NMT. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 854–861. Varna, Bulgaria: INCOMA Ltd. doi:10.26615/978-954-452-056-4˙099. URL https://www.aclweb.org/anthology/R19-1099.
  • Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32, 8024–8035. Curran Associates, Inc. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  • Plizzari, Cannici, and Matteucci (2020) Plizzari, C.; Cannici, M.; and Matteucci, M. 2020. Spatial temporal transformer network for skeleton-based action recognition. arXiv preprint arXiv:2008.07404 .
  • Radford et al. (2018) Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training .
  • Ramachandran, Zoph, and Le (2017) Ramachandran, P.; Zoph, B.; and Le, Q. V. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941 .
  • Rasley et al. (2020) Rasley, J.; Rajbhandari, S.; Ruwase, O.; and He, Y. 2020. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, 3505–3506. New York, NY, USA: Association for Computing Machinery. ISBN 9781450379984. doi:10.1145/3394486.3406703. URL https://doi.org/10.1145/3394486.3406703.
  • Shahroudy et al. (2016) Shahroudy, A.; Liu, J.; Ng, T.-T.; and Wang, G. 2016. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1010–1019.
  • Shen et al. (2021) Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; and Li, H. 2021. Efficient Attention: Attention with Linear Complexities. In WACV. IEEE.
  • Shi et al. (2019) Shi, L.; Zhang, Y.; Cheng, J.; and Lu, H. 2019. Skeleton-Based Action Recognition With Directed Graph Neural Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7904–7913. doi:10.1109/CVPR.2019.00810.
  • Shi et al. (2019a) Shi, L.; Zhang, Y.; Cheng, J.; and Lu, H. 2019a. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7912–7921.
  • Shi et al. (2019b) Shi, L.; Zhang, Y.; Cheng, J.; and Lu, H. 2019b. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In CVPR.
  • Shiv and Quirk (2019) Shiv, V.; and Quirk, C. 2019. Novel positional encodings to enable tree-based transformers. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc. URL https://proceedings.neurips.cc/paper/2019/file/6e0917469214d8fbd8c517dcdc6b8dcf-Paper.pdf.
  • Song et al. (2020) Song, Y.-F.; Zhang, Z.; Shan, C.; and Wang, L. 2020. Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-Based Action Recognition. In Proceedings of the 28th ACM International Conference on Multimedia (ACMMM), 1625–1633. New York, NY, USA: Association for Computing Machinery. ISBN 9781450379885. doi:10.1145/3394171.3413802. URL https://doi.org/10.1145/3394171.3413802.
  • Tay et al. (2020) Tay, Y.; Dehghani, M.; Bahri, D.; and Metzler, D. 2020. Efficient Transformers: A Survey.
  • Tsai et al. (2019) Tsai, Y.-H. H.; Bai, S.; Yamada, M.; Morency, L.-P.; and Salakhutdinov, R. 2019. Transformer Dissection: An Unified Understanding for Transformer’s Attention via the Lens of Kernel. In EMNLP. ACL.
  • Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, u.; and Polosukhin, I. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000–6010. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781510860964.
  • Vemulapalli, Arrate, and Chellappa (2014) Vemulapalli, R.; Arrate, F.; and Chellappa, R. 2014. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, 588–595. doi:10.1109/CVPR.2014.82.
  • Wang et al. (2012) Wang, J.; Liu, Z.; Wu, Y.; and Yuan, J. 2012. Mining actionlet ensemble for action recognition with depth cameras. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 1290–1297. doi:10.1109/CVPR.2012.6247813.
  • Xie et al. (2018) Xie, C.; Li, C.; Zhang, B.; Chen, C.; Han, J.; and Liu, J. 2018. Memory Attention Networks for Skeleton-based Action Recognition. 1639–1645. doi:10.24963/ijcai.2018/227.
  • Yan, Xiong, and Lin (2018) Yan, S.; Xiong, Y.; and Lin, D. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  • Yong Du, Wang, and Wang (2015) Yong Du; Wang, W.; and Wang, L. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1110–1118. doi:10.1109/CVPR.2015.7298714.
  • Zhang et al. (2017) Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; and Zheng, N. 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, 2117–2126.